The TDP MMU has a performance regression compared to the legacy
MMU when CR0 changes often. This was reported for the grsecurity
kernel, which uses CR0.WP to implement kernel W^X. In that case,
each change to CR0.WP unloads the MMU and causes a lot of unnecessary
work. When running nested, this can even cause the L1 to hardly
make progress, as the L0 hypervisor it is overwhelmed by the amount
of MMU work that is needed.
The root cause of the issue is that the "MMU role" in KVM is a mess
that mixes the CPU setup (CR0/CR4/EFER, SMM, guest mode, etc.)
and the shadow page table format. Whenever something is different
between the MMU and the CPU, it is stored as an extra field in struct
kvm_mmu---and for extra bonus complication, sometimes the same thing
is stored in both the role and an extra field.
So, this is the "no functional change intended" part of the changes
required to fix the performance regression. It separates neatly
the shadow page table format ("MMU role") from the guest page table
format ("CPU role"), and removes the duplicate fields. The next
step then is to avoid unloading the MMU as long as the MMU role
stays the same.
Please review!
Paolo
Paolo Bonzini (23):
KVM: MMU: pass uses_nx directly to reset_shadow_zero_bits_mask
KVM: MMU: nested EPT cannot be used in SMM
KVM: MMU: remove valid from extended role
KVM: MMU: constify uses of struct kvm_mmu_role_regs
KVM: MMU: pull computation of kvm_mmu_role_regs to kvm_init_mmu
KVM: MMU: load new PGD once nested two-dimensional paging is
initialized
KVM: MMU: remove kvm_mmu_calc_root_page_role
KVM: MMU: rephrase unclear comment
KVM: MMU: remove "bool base_only" arguments
KVM: MMU: split cpu_role from mmu_role
KVM: MMU: do not recompute root level from kvm_mmu_role_regs
KVM: MMU: remove ept_ad field
KVM: MMU: remove kvm_calc_shadow_root_page_role_common
KVM: MMU: cleanup computation of MMU roles for two-dimensional paging
KVM: MMU: cleanup computation of MMU roles for shadow paging
KVM: MMU: remove extended bits from mmu_role
KVM: MMU: remove redundant bits from extended role
KVM: MMU: fetch shadow EFER.NX from MMU role
KVM: MMU: simplify and/or inline computation of shadow MMU roles
KVM: MMU: pull CPU role computation to kvm_init_mmu
KVM: MMU: store shadow_root_level into mmu_role
KVM: MMU: use cpu_role for root_level
KVM: MMU: replace direct_map with mmu_role.direct
arch/x86/include/asm/kvm_host.h | 13 +-
arch/x86/kvm/mmu.h | 2 +-
arch/x86/kvm/mmu/mmu.c | 408 ++++++++++++--------------------
arch/x86/kvm/mmu/mmu_audit.c | 6 +-
arch/x86/kvm/mmu/paging_tmpl.h | 12 +-
arch/86/kvm/mmu/tdp_mmu.c | 4 +-
arch/x86/kvm/svm/svm.c | 2 +-
arch/x86/kvm/vmx/vmx.c | 2 +-
arch/x86/kvm/x86.c | 12 +-
10 files changed, 178 insertions(+), 284 deletions(-)
--
2.31.1
reset_shadow_zero_bits_mask has a very unintuitive way of deciding
whether the shadow pages will use the NX bit. The function is used in
two cases, shadow paging and shadow NPT; shadow paging has a use for
EFER.NX and needs to force it enabled, while shadow NPT only needs it
depending on L1's setting.
The actual root problem here is that is_efer_nx, despite being part
of the "base" role, only matches the format of the shadow pages in the
NPT case. For now, just remove the ugly variable initialization and move
the call to reset_shadow_zero_bits_mask out of shadow_mmu_init_context.
The parameter can then be removed after the root problem in the role
is fixed.
Signed-off-by: Paolo Bonzini <[email protected]>
---
arch/x86/kvm/mmu/mmu.c | 26 +++++++++++++-------------
1 file changed, 13 insertions(+), 13 deletions(-)
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 296f8723f9ae..9424ae90f1ef 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -4410,18 +4410,9 @@ static inline u64 reserved_hpa_bits(void)
* follow the features in guest.
*/
static void reset_shadow_zero_bits_mask(struct kvm_vcpu *vcpu,
- struct kvm_mmu *context)
+ struct kvm_mmu *context,
+ bool uses_nx)
{
- /*
- * KVM uses NX when TDP is disabled to handle a variety of scenarios,
- * notably for huge SPTEs if iTLB multi-hit mitigation is enabled and
- * to generate correct permissions for CR0.WP=0/CR4.SMEP=1/EFER.NX=0.
- * The iTLB multi-hit workaround can be toggled at any time, so assume
- * NX can be used by any non-nested shadow MMU to avoid having to reset
- * MMU contexts. Note, KVM forces EFER.NX=1 when TDP is disabled.
- */
- bool uses_nx = is_efer_nx(context) || !tdp_enabled;
-
/* @amd adds a check on bit of SPTEs, which KVM shouldn't use anyways. */
bool is_amd = true;
/* KVM doesn't use 2-level page tables for the shadow MMU. */
@@ -4829,8 +4820,6 @@ static void shadow_mmu_init_context(struct kvm_vcpu *vcpu, struct kvm_mmu *conte
reset_guest_paging_metadata(vcpu, context);
context->shadow_root_level = new_role.base.level;
-
- reset_shadow_zero_bits_mask(vcpu, context);
}
static void kvm_init_shadow_mmu(struct kvm_vcpu *vcpu,
@@ -4841,6 +4830,16 @@ static void kvm_init_shadow_mmu(struct kvm_vcpu *vcpu,
kvm_calc_shadow_mmu_root_page_role(vcpu, regs, false);
shadow_mmu_init_context(vcpu, context, regs, new_role);
+
+ /*
+ * KVM uses NX when TDP is disabled to handle a variety of scenarios,
+ * notably for huge SPTEs if iTLB multi-hit mitigation is enabled and
+ * to generate correct permissions for CR0.WP=0/CR4.SMEP=1/EFER.NX=0.
+ * The iTLB multi-hit workaround can be toggled at any time, so assume
+ * NX can be used by any non-nested shadow MMU to avoid having to reset
+ * MMU contexts. Note, KVM forces EFER.NX=1 when TDP is disabled.
+ */
+ reset_shadow_zero_bits_mask(vcpu, context, true);
}
static union kvm_mmu_role
@@ -4872,6 +4871,7 @@ void kvm_init_shadow_npt_mmu(struct kvm_vcpu *vcpu, unsigned long cr0,
__kvm_mmu_new_pgd(vcpu, nested_cr3, new_role.base);
shadow_mmu_init_context(vcpu, context, ®s, new_role);
+ reset_shadow_zero_bits_mask(vcpu, context, is_efer_nx(context));
}
EXPORT_SYMBOL_GPL(kvm_init_shadow_npt_mmu);
--
2.31.1
The level field of the MMU role can act as a marker for validity
instead: it is guaranteed to be nonzero so a zero value means the role
is invalid and the MMU properties will be computed again.
Signed-off-by: Paolo Bonzini <[email protected]>
---
arch/x86/include/asm/kvm_host.h | 4 +---
arch/x86/kvm/mmu/mmu.c | 9 +++------
2 files changed, 4 insertions(+), 9 deletions(-)
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index e7e5bd9a984d..4ec7d1e3aa36 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -342,8 +342,7 @@ union kvm_mmu_page_role {
* kvm_mmu_extended_role complements kvm_mmu_page_role, tracking properties
* relevant to the current MMU configuration. When loading CR0, CR4, or EFER,
* including on nested transitions, if nothing in the full role changes then
- * MMU re-configuration can be skipped. @valid bit is set on first usage so we
- * don't treat all-zero structure as valid data.
+ * MMU re-configuration can be skipped.
*
* The properties that are tracked in the extended role but not the page role
* are for things that either (a) do not affect the validity of the shadow page
@@ -360,7 +359,6 @@ union kvm_mmu_page_role {
union kvm_mmu_extended_role {
u32 word;
struct {
- unsigned int valid:1;
unsigned int execonly:1;
unsigned int cr0_pg:1;
unsigned int cr4_pae:1;
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index b0065ae3cea8..0039b2f21286 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -4683,8 +4683,6 @@ static union kvm_mmu_extended_role kvm_calc_mmu_role_ext(struct kvm_vcpu *vcpu,
ext.efer_lma = ____is_efer_lma(regs);
}
- ext.valid = 1;
-
return ext;
}
@@ -4891,7 +4889,6 @@ kvm_calc_shadow_ept_root_page_role(struct kvm_vcpu *vcpu, bool accessed_dirty,
/* EPT, and thus nested EPT, does not consume CR0, CR4, nor EFER. */
role.ext.word = 0;
role.ext.execonly = execonly;
- role.ext.valid = 1;
return role;
}
@@ -5039,9 +5036,9 @@ void kvm_mmu_after_set_cpuid(struct kvm_vcpu *vcpu)
* problem is swept under the rug; KVM's CPUID API is horrific and
* it's all but impossible to solve it without introducing a new API.
*/
- vcpu->arch.root_mmu.mmu_role.ext.valid = 0;
- vcpu->arch.guest_mmu.mmu_role.ext.valid = 0;
- vcpu->arch.nested_mmu.mmu_role.ext.valid = 0;
+ vcpu->arch.root_mmu.mmu_role.base.level = 0;
+ vcpu->arch.guest_mmu.mmu_role.base.level = 0;
+ vcpu->arch.nested_mmu.mmu_role.base.level = 0;
kvm_mmu_reset_context(vcpu);
/*
--
2.31.1
direct_map is always equal to the role's direct field:
- for shadow paging, direct_map is true if CR0.PG=0 and mmu_role.direct is
copied from cpu_role.base.direct
- for TDP, it is always true and mmu_role.direct is also always true
- for shadow EPT, it is always false and mmu_role.direct is also always
false
Signed-off-by: Paolo Bonzini <[email protected]>
---
arch/x86/include/asm/kvm_host.h | 1 -
arch/x86/kvm/mmu/mmu.c | 30 ++++++++++++++----------------
arch/x86/kvm/x86.c | 12 ++++++------
4 files changed, 21 insertions(+), 23 deletions(-)
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index c86a2beee92a..647b3f6d02d0 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -432,7 +432,6 @@ struct kvm_mmu {
gpa_t root_pgd;
union kvm_mmu_role cpu_role;
union kvm_mmu_page_role mmu_role;
- bool direct_map;
struct kvm_mmu_root_info prev_roots[KVM_MMU_NUM_PREV_ROOTS];
/*
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 5a6541d6a424..ce55fad99671 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -2045,7 +2045,7 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
int direct,
unsigned int access)
{
- bool direct_mmu = vcpu->arch.mmu->direct_map;
+ bool direct_mmu = vcpu->arch.mmu->mmu_role.direct;
union kvm_mmu_page_role role;
struct hlist_head *sp_list;
unsigned quadrant;
@@ -2147,7 +2147,7 @@ static void shadow_walk_init_using_root(struct kvm_shadow_walk_iterator *iterato
if (iterator->level >= PT64_ROOT_4LEVEL &&
vcpu->arch.mmu->cpu_role.base.level < PT64_ROOT_4LEVEL &&
- !vcpu->arch.mmu->direct_map)
+ !vcpu->arch.mmu->mmu_role.direct)
iterator->level = PT32E_ROOT_LEVEL;
if (iterator->level == PT32E_ROOT_LEVEL) {
@@ -2523,7 +2523,7 @@ static int kvm_mmu_unprotect_page_virt(struct kvm_vcpu *vcpu, gva_t gva)
gpa_t gpa;
int r;
- if (vcpu->arch.mmu->direct_map)
+ if (vcpu->arch.mmu->mmu_role.direct)
return 0;
gpa = kvm_mmu_gva_to_gpa_read(vcpu, gva, NULL);
@@ -3255,7 +3255,8 @@ void kvm_mmu_free_roots(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
if (free_active_root) {
if (mmu->mmu_role.level >= PT64_ROOT_4LEVEL &&
- (mmu->cpu_role.base.level >= PT64_ROOT_4LEVEL || mmu->direct_map)) {
+ (mmu->cpu_role.base.level >= PT64_ROOT_4LEVEL ||
+ mmu->mmu_role.direct)) {
mmu_free_root_page(kvm, &mmu->root_hpa, &invalid_list);
} else if (mmu->pae_root) {
for (i = 0; i < 4; ++i) {
@@ -3558,7 +3559,8 @@ static int mmu_alloc_special_roots(struct kvm_vcpu *vcpu)
* equivalent level in the guest's NPT to shadow. Allocate the tables
* on demand, as running a 32-bit L1 VMM on 64-bit KVM is very rare.
*/
- if (mmu->direct_map || mmu->cpu_role.base.level >= PT64_ROOT_4LEVEL ||
+ if (mmu->mmu_role.direct ||
+ mmu->cpu_role.base.level >= PT64_ROOT_4LEVEL ||
mmu->mmu_role.level < PT64_ROOT_4LEVEL)
return 0;
@@ -3647,7 +3649,7 @@ void kvm_mmu_sync_roots(struct kvm_vcpu *vcpu)
int i;
struct kvm_mmu_page *sp;
- if (vcpu->arch.mmu->direct_map)
+ if (vcpu->arch.mmu->mmu_role.direct)
return;
if (!VALID_PAGE(vcpu->arch.mmu->root_hpa))
@@ -3872,7 +3874,7 @@ static bool kvm_arch_setup_async_pf(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
arch.token = (vcpu->arch.apf.id++ << 12) | vcpu->vcpu_id;
arch.gfn = gfn;
- arch.direct_map = vcpu->arch.mmu->direct_map;
+ arch.direct_map = vcpu->arch.mmu->mmu_role.direct;
arch.cr3 = vcpu->arch.mmu->get_guest_pgd(vcpu);
return kvm_setup_async_pf(vcpu, cr2_or_gpa,
@@ -4090,7 +4092,6 @@ static void nonpaging_init_context(struct kvm_mmu *context)
context->gva_to_gpa = nonpaging_gva_to_gpa;
context->sync_page = nonpaging_sync_page;
context->invlpg = NULL;
- context->direct_map = true;
}
static inline bool is_root_usable(struct kvm_mmu_root_info *root, gpa_t pgd,
@@ -4641,7 +4642,6 @@ static void paging64_init_context(struct kvm_mmu *context)
context->gva_to_gpa = paging64_gva_to_gpa;
context->sync_page = paging64_sync_page;
context->invlpg = paging64_invlpg;
- context->direct_map = false;
}
static void paging32_init_context(struct kvm_mmu *context)
@@ -4650,7 +4650,6 @@ static void paging32_init_context(struct kvm_mmu *context)
context->gva_to_gpa = paging32_gva_to_gpa;
context->sync_page = paging32_sync_page;
context->invlpg = paging32_invlpg;
- context->direct_map = false;
}
static union kvm_mmu_role
@@ -4735,7 +4734,6 @@ static void init_kvm_tdp_mmu(struct kvm_vcpu *vcpu, union kvm_mmu_role cpu_role)
context->page_fault = kvm_tdp_page_fault;
context->sync_page = nonpaging_sync_page;
context->invlpg = NULL;
- context->direct_map = true;
context->get_guest_pgd = get_cr3;
context->get_pdptr = kvm_pdptr_read;
context->inject_page_fault = kvm_inject_page_fault;
@@ -4852,7 +4850,7 @@ void kvm_init_shadow_ept_mmu(struct kvm_vcpu *vcpu, bool execonly,
context->gva_to_gpa = ept_gva_to_gpa;
context->sync_page = ept_sync_page;
context->invlpg = ept_invlpg;
- context->direct_map = false;
+
update_permission_bitmask(context, true);
context->pkru_mask = 0;
reset_rsvds_bits_mask_ept(vcpu, context, execonly, huge_page_level);
@@ -4967,13 +4965,13 @@ int kvm_mmu_load(struct kvm_vcpu *vcpu)
{
int r;
- r = mmu_topup_memory_caches(vcpu, !vcpu->arch.mmu->direct_map);
+ r = mmu_topup_memory_caches(vcpu, !vcpu->arch.mmu->mmu_role.direct);
if (r)
goto out;
r = mmu_alloc_special_roots(vcpu);
if (r)
goto out;
- if (vcpu->arch.mmu->direct_map)
+ if (vcpu->arch.mmu->mmu_role.direct)
r = mmu_alloc_direct_roots(vcpu);
else
r = mmu_alloc_shadow_roots(vcpu);
@@ -5176,7 +5174,7 @@ int kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, u64 error_code,
void *insn, int insn_len)
{
int r, emulation_type = EMULTYPE_PF;
- bool direct = vcpu->arch.mmu->direct_map;
+ bool direct = vcpu->arch.mmu->mmu_role.direct;
if (WARN_ON(!VALID_PAGE(vcpu->arch.mmu->root_hpa)))
return RET_PF_RETRY;
@@ -5207,7 +5205,7 @@ int kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, u64 error_code,
* paging in both guests. If true, we simply unprotect the page
* and resume the guest.
*/
- if (vcpu->arch.mmu->direct_map &&
+ if (vcpu->arch.mmu->mmu_role.direct &&
(error_code & PFERR_NESTED_GUEST_PAGE) == PFERR_NESTED_GUEST_PAGE) {
kvm_mmu_unprotect_page(vcpu->kvm, gpa_to_gfn(cr2_or_gpa));
return 1;
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 657aa646871e..b910fa34e57e 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -7978,7 +7978,7 @@ static bool reexecute_instruction(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
WARN_ON_ONCE(!(emulation_type & EMULTYPE_PF)))
return false;
- if (!vcpu->arch.mmu->direct_map) {
+ if (!vcpu->arch.mmu->mmu_role.direct) {
/*
* Write permission should be allowed since only
* write access need to be emulated.
@@ -8011,7 +8011,7 @@ static bool reexecute_instruction(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
kvm_release_pfn_clean(pfn);
/* The instructions are well-emulated on direct mmu. */
- if (vcpu->arch.mmu->direct_map) {
+ if (vcpu->arch.mmu->mmu_role.direct) {
unsigned int indirect_shadow_pages;
write_lock(&vcpu->kvm->mmu_lock);
@@ -8079,7 +8079,7 @@ static bool retry_instruction(struct x86_emulate_ctxt *ctxt,
vcpu->arch.last_retry_eip = ctxt->eip;
vcpu->arch.last_retry_addr = cr2_or_gpa;
- if (!vcpu->arch.mmu->direct_map)
+ if (!vcpu->arch.mmu->mmu_role.direct)
gpa = kvm_mmu_gva_to_gpa_write(vcpu, cr2_or_gpa, NULL);
kvm_mmu_unprotect_page(vcpu->kvm, gpa_to_gfn(gpa));
@@ -8359,7 +8359,7 @@ int x86_emulate_instruction(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
ctxt->exception.address = cr2_or_gpa;
/* With shadow page tables, cr2 contains a GVA or nGPA. */
- if (vcpu->arch.mmu->direct_map) {
+ if (vcpu->arch.mmu->mmu_role.direct) {
ctxt->gpa_available = true;
ctxt->gpa_val = cr2_or_gpa;
}
@@ -12196,7 +12196,7 @@ void kvm_arch_async_page_ready(struct kvm_vcpu *vcpu, struct kvm_async_pf *work)
{
int r;
- if ((vcpu->arch.mmu->direct_map != work->arch.direct_map) ||
+ if ((vcpu->arch.mmu->mmu_role.direct != work->arch.direct_map) ||
work->wakeup_all)
return;
@@ -12204,7 +12204,7 @@ void kvm_arch_async_page_ready(struct kvm_vcpu *vcpu, struct kvm_async_pf *work)
if (unlikely(r))
return;
- if (!vcpu->arch.mmu->direct_map &&
+ if (!vcpu->arch.mmu->mmu_role.direct &&
work->arch.cr3 != vcpu->arch.mmu->get_guest_pgd(vcpu))
return;
--
2.31.1
The init_kvm_*mmu functions, with the exception of shadow NPT,
do not need to know the full values of CR0/CR4/EFER; they only
need to know the bits that make up the "role". This cleanup
however will take quite a few incremental steps. As a start,
pull the common computation of the struct kvm_mmu_role_regs
into their caller: all of them extract the struct from the vcpu
as the very first step.
Signed-off-by: Paolo Bonzini <[email protected]>
---
arch/x86/kvm/mmu/mmu.c | 33 +++++++++++++++++----------------
1 file changed, 17 insertions(+), 16 deletions(-)
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 3add9d8b0630..577e70509510 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -4736,12 +4736,12 @@ kvm_calc_tdp_mmu_root_page_role(struct kvm_vcpu *vcpu,
return role;
}
-static void init_kvm_tdp_mmu(struct kvm_vcpu *vcpu)
+static void init_kvm_tdp_mmu(struct kvm_vcpu *vcpu,
+ const struct kvm_mmu_role_regs *regs)
{
struct kvm_mmu *context = &vcpu->arch.root_mmu;
- struct kvm_mmu_role_regs regs = vcpu_to_role_regs(vcpu);
union kvm_mmu_role new_role =
- kvm_calc_tdp_mmu_root_page_role(vcpu, ®s, false);
+ kvm_calc_tdp_mmu_root_page_role(vcpu, regs, false);
if (new_role.as_u64 == context->mmu_role.as_u64)
return;
@@ -4755,7 +4755,7 @@ static void init_kvm_tdp_mmu(struct kvm_vcpu *vcpu)
context->get_guest_pgd = get_cr3;
context->get_pdptr = kvm_pdptr_read;
context->inject_page_fault = kvm_inject_page_fault;
- context->root_level = role_regs_to_root_level(®s);
+ context->root_level = role_regs_to_root_level(regs);
if (!is_cr0_pg(context))
context->gva_to_gpa = nonpaging_gva_to_gpa;
@@ -4803,7 +4803,7 @@ kvm_calc_shadow_mmu_root_page_role(struct kvm_vcpu *vcpu,
}
static void shadow_mmu_init_context(struct kvm_vcpu *vcpu, struct kvm_mmu *context,
- struct kvm_mmu_role_regs *regs,
+ const struct kvm_mmu_role_regs *regs,
union kvm_mmu_role new_role)
{
if (new_role.as_u64 == context->mmu_role.as_u64)
@@ -4824,7 +4824,7 @@ static void shadow_mmu_init_context(struct kvm_vcpu *vcpu, struct kvm_mmu *conte
}
static void kvm_init_shadow_mmu(struct kvm_vcpu *vcpu,
- struct kvm_mmu_role_regs *regs)
+ const struct kvm_mmu_role_regs *regs)
{
struct kvm_mmu *context = &vcpu->arch.root_mmu;
union kvm_mmu_role new_role =
@@ -4845,7 +4845,7 @@ static void kvm_init_shadow_mmu(struct kvm_vcpu *vcpu,
static union kvm_mmu_role
kvm_calc_shadow_npt_root_page_role(struct kvm_vcpu *vcpu,
- struct kvm_mmu_role_regs *regs)
+ const struct kvm_mmu_role_regs *regs)
{
union kvm_mmu_role role =
kvm_calc_shadow_root_page_role_common(vcpu, regs, false);
@@ -4930,12 +4930,12 @@ void kvm_init_shadow_ept_mmu(struct kvm_vcpu *vcpu, bool execonly,
}
EXPORT_SYMBOL_GPL(kvm_init_shadow_ept_mmu);
-static void init_kvm_softmmu(struct kvm_vcpu *vcpu)
+static void init_kvm_softmmu(struct kvm_vcpu *vcpu,
+ const struct kvm_mmu_role_regs *regs)
{
struct kvm_mmu *context = &vcpu->arch.root_mmu;
- struct kvm_mmu_role_regs regs = vcpu_to_role_regs(vcpu);
- kvm_init_shadow_mmu(vcpu, ®s);
+ kvm_init_shadow_mmu(vcpu, regs);
context->get_guest_pgd = get_cr3;
context->get_pdptr = kvm_pdptr_read;
@@ -4959,10 +4959,9 @@ kvm_calc_nested_mmu_role(struct kvm_vcpu *vcpu, const struct kvm_mmu_role_regs *
return role;
}
-static void init_kvm_nested_mmu(struct kvm_vcpu *vcpu)
+static void init_kvm_nested_mmu(struct kvm_vcpu *vcpu, const struct kvm_mmu_role_regs *regs)
{
- struct kvm_mmu_role_regs regs = vcpu_to_role_regs(vcpu);
- union kvm_mmu_role new_role = kvm_calc_nested_mmu_role(vcpu, ®s);
+ union kvm_mmu_role new_role = kvm_calc_nested_mmu_role(vcpu, regs);
struct kvm_mmu *g_context = &vcpu->arch.nested_mmu;
if (new_role.as_u64 == g_context->mmu_role.as_u64)
@@ -5002,12 +5001,14 @@ static void init_kvm_nested_mmu(struct kvm_vcpu *vcpu)
void kvm_init_mmu(struct kvm_vcpu *vcpu)
{
+ struct kvm_mmu_role_regs regs = vcpu_to_role_regs(vcpu);
+
if (mmu_is_nested(vcpu))
- init_kvm_nested_mmu(vcpu);
+ init_kvm_nested_mmu(vcpu, ®s);
else if (tdp_enabled)
- init_kvm_tdp_mmu(vcpu);
+ init_kvm_tdp_mmu(vcpu, ®s);
else
- init_kvm_softmmu(vcpu);
+ init_kvm_softmmu(vcpu, ®s);
}
EXPORT_SYMBOL_GPL(kvm_init_mmu);
--
2.31.1
Now that the MMU role is separate from the CPU role, it contains a
truthful description of the format of the shadow pages. This includes
whether the shadow pages use the NX bit, so use the MMU role instead of
hardcoding it in the callers of reset_shadow_zero_bits_mask.
Signed-off-by: Paolo Bonzini <[email protected]>
---
arch/x86/kvm/mmu/mmu.c | 8 ++++----
1 file changed, 4 insertions(+), 4 deletions(-)
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index b3856551607d..bba712d1a6d7 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -4398,13 +4398,13 @@ static inline u64 reserved_hpa_bits(void)
* follow the features in guest.
*/
static void reset_shadow_zero_bits_mask(struct kvm_vcpu *vcpu,
- struct kvm_mmu *context,
- bool uses_nx)
+ struct kvm_mmu *context)
{
/* @amd adds a check on bit of SPTEs, which KVM shouldn't use anyways. */
bool is_amd = true;
/* KVM doesn't use 2-level page tables for the shadow MMU. */
bool is_pse = false;
+ bool uses_nx = context->mmu_role.efer_nx;
struct rsvd_bits_validate *shadow_zero_check;
int i;
@@ -4810,7 +4810,7 @@ static void kvm_init_shadow_mmu(struct kvm_vcpu *vcpu,
* NX can be used by any non-nested shadow MMU to avoid having to reset
* MMU contexts. Note, KVM forces EFER.NX=1 when TDP is disabled.
*/
- reset_shadow_zero_bits_mask(vcpu, context, true);
+ reset_shadow_zero_bits_mask(vcpu, context);
}
static union kvm_mmu_page_role
@@ -4834,7 +4834,7 @@ void kvm_init_shadow_npt_mmu(struct kvm_vcpu *vcpu, unsigned long cr0,
union kvm_mmu_page_role mmu_role = kvm_calc_shadow_npt_root_page_role(vcpu, cpu_role);
shadow_mmu_init_context(vcpu, context, cpu_role, mmu_role);
- reset_shadow_zero_bits_mask(vcpu, context, is_efer_nx(context));
+ reset_shadow_zero_bits_mask(vcpu, context);
kvm_mmu_new_pgd(vcpu, nested_cr3);
}
EXPORT_SYMBOL_GPL(kvm_init_shadow_npt_mmu);
--
2.31.1
The root_level can be found in the cpu_role (in fact the field
is superfluous and could be removed, but one thing at a time).
Since there is only one usage left of role_regs_to_root_level,
inline it into kvm_calc_cpu_role.
Signed-off-by: Paolo Bonzini <[email protected]>
---
arch/x86/kvm/mmu/mmu.c | 23 ++++++++---------------
1 file changed, 8 insertions(+), 15 deletions(-)
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index f98444e1d834..74789295f922 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -253,19 +253,6 @@ static struct kvm_mmu_role_regs vcpu_to_role_regs(struct kvm_vcpu *vcpu)
return regs;
}
-static int role_regs_to_root_level(const struct kvm_mmu_role_regs *regs)
-{
- if (!____is_cr0_pg(regs))
- return 0;
- else if (____is_efer_lma(regs))
- return ____is_cr4_la57(regs) ? PT64_ROOT_5LEVEL :
- PT64_ROOT_4LEVEL;
- else if (____is_cr4_pae(regs))
- return PT32E_ROOT_LEVEL;
- else
- return PT32_ROOT_LEVEL;
-}
-
static inline bool kvm_available_flush_tlb_with_range(void)
{
return kvm_x86_ops.tlb_remote_flush_with_range;
@@ -4673,7 +4660,13 @@ kvm_calc_cpu_role(struct kvm_vcpu *vcpu, const struct kvm_mmu_role_regs *regs)
role.base.smep_andnot_wp = ____is_cr4_smep(regs) && !____is_cr0_wp(regs);
role.base.smap_andnot_wp = ____is_cr4_smap(regs) && !____is_cr0_wp(regs);
role.base.has_4_byte_gpte = !____is_cr4_pae(regs);
- role.base.level = role_regs_to_root_level(regs);
+
+ if (____is_efer_lma(regs))
+ role.base.level = ____is_cr4_la57(regs) ? PT64_ROOT_5LEVEL : PT64_ROOT_4LEVEL;
+ else if (____is_cr4_pae(regs))
+ role.base.level = PT32E_ROOT_LEVEL;
+ else
+ role.base.level = PT32_ROOT_LEVEL;
role.ext.cr0_pg = 1;
role.ext.cr4_pae = ____is_cr4_pae(regs);
@@ -4766,7 +4759,7 @@ static void init_kvm_tdp_mmu(struct kvm_vcpu *vcpu,
context->get_guest_pgd = get_cr3;
context->get_pdptr = kvm_pdptr_read;
context->inject_page_fault = kvm_inject_page_fault;
- context->root_level = role_regs_to_root_level(regs);
+ context->root_level = cpu_role.base.level;
if (!is_cr0_pg(context))
context->gva_to_gpa = nonpaging_gva_to_gpa;
--
2.31.1
On 2/4/22 19:32, David Matlack wrote:
>> - vcpu->arch.root_mmu.mmu_role.ext.valid = 0;
>> - vcpu->arch.guest_mmu.mmu_role.ext.valid = 0;
>> - vcpu->arch.nested_mmu.mmu_role.ext.valid = 0;
>> + vcpu->arch.root_mmu.mmu_role.base.level = 0;
>> + vcpu->arch.guest_mmu.mmu_role.base.level = 0;
>> + vcpu->arch.nested_mmu.mmu_role.base.level = 0;
> I agree this will work but I think it makes the code more difficult to
> follow (and I start worrying that some code that relies on level being
> accurate will creep in in the future). At minimum we should extend the
> comment here to describe why level is being changed.
>
> I did a half-assed attempt to pass something like "bool force_role_reset"
> down to the MMU initialization functions as an alternative but it very
> quickly got out of hand.
>
> What about just changing `valid` to `cpuid_stale` and flip the meaning?
> kvm_mmu_after_set_cpuid() would set the cpuid_stale bit and then reset
> the MMUs.
>
For now I'll swap this patch with one that clears the whole word, but
keep the ext bit as described in my other reply.
Paolo
On 2/4/22 18:59, David Matlack wrote:
>> + reset_shadow_zero_bits_mask(vcpu, context, is_efer_nx(context));
>
> Out of curiousity, how does KVM mitigate iTLB multi-hit when shadowing
> NPT and the guest has not enabled EFER.NX?
You got me worried for a second but iTLB multihit is Intel-only, isn't it?
Paolo
Remove another duplicate field of struct kvm_mmu. This time it's
the root level for page table walking; we were already initializing
it mostly as cpu_role.base.level, but the field still existed;
remove it.
Signed-off-by: Paolo Bonzini <[email protected]>
---
arch/x86/include/asm/kvm_host.h | 1 -
arch/x86/kvm/mmu/mmu.c | 22 +++++++++-------------
arch/x86/kvm/mmu/mmu_audit.c | 6 +++---
arch/x86/kvm/mmu/paging_tmpl.h | 4 ++--
4 files changed, 14 insertions(+), 19 deletions(-)
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 867fc82f1de5..c86a2beee92a 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -432,7 +432,6 @@ struct kvm_mmu {
gpa_t root_pgd;
union kvm_mmu_role cpu_role;
union kvm_mmu_page_role mmu_role;
- u8 root_level;
bool direct_map;
struct kvm_mmu_root_info prev_roots[KVM_MMU_NUM_PREV_ROOTS];
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 4d1fa87718f8..5a6541d6a424 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -2146,7 +2146,7 @@ static void shadow_walk_init_using_root(struct kvm_shadow_walk_iterator *iterato
iterator->level = vcpu->arch.mmu->mmu_role.level;
if (iterator->level >= PT64_ROOT_4LEVEL &&
- vcpu->arch.mmu->root_level < PT64_ROOT_4LEVEL &&
+ vcpu->arch.mmu->cpu_role.base.level < PT64_ROOT_4LEVEL &&
!vcpu->arch.mmu->direct_map)
iterator->level = PT32E_ROOT_LEVEL;
@@ -3255,7 +3255,7 @@ void kvm_mmu_free_roots(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
if (free_active_root) {
if (mmu->mmu_role.level >= PT64_ROOT_4LEVEL &&
- (mmu->root_level >= PT64_ROOT_4LEVEL || mmu->direct_map)) {
+ (mmu->cpu_role.base.level >= PT64_ROOT_4LEVEL || mmu->direct_map)) {
mmu_free_root_page(kvm, &mmu->root_hpa, &invalid_list);
} else if (mmu->pae_root) {
for (i = 0; i < 4; ++i) {
@@ -3453,7 +3453,7 @@ static int mmu_alloc_shadow_roots(struct kvm_vcpu *vcpu)
* On SVM, reading PDPTRs might access guest memory, which might fault
* and thus might sleep. Grab the PDPTRs before acquiring mmu_lock.
*/
- if (mmu->root_level == PT32E_ROOT_LEVEL) {
+ if (mmu->cpu_role.base.level == PT32E_ROOT_LEVEL) {
for (i = 0; i < 4; ++i) {
pdptrs[i] = mmu->get_pdptr(vcpu, i);
if (!(pdptrs[i] & PT_PRESENT_MASK))
@@ -3477,7 +3477,7 @@ static int mmu_alloc_shadow_roots(struct kvm_vcpu *vcpu)
* Do we shadow a long mode page table? If so we need to
* write-protect the guests page table root.
*/
- if (mmu->root_level >= PT64_ROOT_4LEVEL) {
+ if (mmu->cpu_role.base.level >= PT64_ROOT_4LEVEL) {
root = mmu_alloc_root(vcpu, root_gfn, 0,
mmu->mmu_role.level, false);
mmu->root_hpa = root;
@@ -3516,7 +3516,7 @@ static int mmu_alloc_shadow_roots(struct kvm_vcpu *vcpu)
for (i = 0; i < 4; ++i) {
WARN_ON_ONCE(IS_VALID_PAE_ROOT(mmu->pae_root[i]));
- if (mmu->root_level == PT32E_ROOT_LEVEL) {
+ if (mmu->cpu_role.base.level == PT32E_ROOT_LEVEL) {
if (!(pdptrs[i] & PT_PRESENT_MASK)) {
mmu->pae_root[i] = INVALID_PAE_ROOT;
continue;
@@ -3558,7 +3558,7 @@ static int mmu_alloc_special_roots(struct kvm_vcpu *vcpu)
* equivalent level in the guest's NPT to shadow. Allocate the tables
* on demand, as running a 32-bit L1 VMM on 64-bit KVM is very rare.
*/
- if (mmu->direct_map || mmu->root_level >= PT64_ROOT_4LEVEL ||
+ if (mmu->direct_map || mmu->cpu_role.base.level >= PT64_ROOT_4LEVEL ||
mmu->mmu_role.level < PT64_ROOT_4LEVEL)
return 0;
@@ -3655,7 +3655,7 @@ void kvm_mmu_sync_roots(struct kvm_vcpu *vcpu)
vcpu_clear_mmio_info(vcpu, MMIO_GVA_ANY);
- if (vcpu->arch.mmu->root_level >= PT64_ROOT_4LEVEL) {
+ if (vcpu->arch.mmu->cpu_role.base.level >= PT64_ROOT_4LEVEL) {
hpa_t root = vcpu->arch.mmu->root_hpa;
sp = to_shadow_page(root);
@@ -4146,7 +4146,7 @@ static bool fast_pgd_switch(struct kvm_vcpu *vcpu, gpa_t new_pgd,
* later if necessary.
*/
if (mmu->mmu_role.level >= PT64_ROOT_4LEVEL &&
- mmu->root_level >= PT64_ROOT_4LEVEL)
+ mmu->cpu_role.base.level >= PT64_ROOT_4LEVEL)
return cached_root_available(vcpu, new_pgd, new_role);
return false;
@@ -4335,7 +4335,7 @@ static void reset_rsvds_bits_mask(struct kvm_vcpu *vcpu,
{
__reset_rsvds_bits_mask(&context->guest_rsvd_check,
vcpu->arch.reserved_gpa_bits,
- context->root_level, is_efer_nx(context),
+ context->cpu_role.base.level, is_efer_nx(context),
guest_can_use_gbpages(vcpu),
is_cr4_pse(context),
guest_cpuid_is_amd_or_hygon(vcpu));
@@ -4739,7 +4739,6 @@ static void init_kvm_tdp_mmu(struct kvm_vcpu *vcpu, union kvm_mmu_role cpu_role)
context->get_guest_pgd = get_cr3;
context->get_pdptr = kvm_pdptr_read;
context->inject_page_fault = kvm_inject_page_fault;
- context->root_level = cpu_role.base.level;
if (!is_cr0_pg(context))
context->gva_to_gpa = nonpaging_gva_to_gpa;
@@ -4769,7 +4768,6 @@ static void shadow_mmu_init_context(struct kvm_vcpu *vcpu, struct kvm_mmu *conte
paging64_init_context(context);
else
paging32_init_context(context);
- context->root_level = cpu_role.base.level;
reset_guest_paging_metadata(vcpu, context);
}
@@ -4854,7 +4852,6 @@ void kvm_init_shadow_ept_mmu(struct kvm_vcpu *vcpu, bool execonly,
context->gva_to_gpa = ept_gva_to_gpa;
context->sync_page = ept_sync_page;
context->invlpg = ept_invlpg;
- context->root_level = level;
context->direct_map = false;
update_permission_bitmask(context, true);
context->pkru_mask = 0;
@@ -4889,7 +4886,6 @@ static void init_kvm_nested_mmu(struct kvm_vcpu *vcpu, union kvm_mmu_role new_ro
g_context->get_guest_pgd = get_cr3;
g_context->get_pdptr = kvm_pdptr_read;
g_context->inject_page_fault = kvm_inject_page_fault;
- g_context->root_level = new_role.base.level;
/*
* L2 page tables are never shadowed, so there is no need to sync
diff --git a/arch/x86/kvm/mmu/mmu_audit.c b/arch/x86/kvm/mmu/mmu_audit.c
index f31fdb874f1f..eb9c59fcb957 100644
--- a/arch/x86/kvm/mmu/mmu_audit.c
+++ b/arch/x86/kvm/mmu/mmu_audit.c
@@ -59,11 +59,11 @@ static void mmu_spte_walk(struct kvm_vcpu *vcpu, inspect_spte_fn fn)
if (!VALID_PAGE(vcpu->arch.mmu->root_hpa))
return;
- if (vcpu->arch.mmu->root_level >= PT64_ROOT_4LEVEL) {
+ if (vcpu->arch.mmu->cpu_role.base.level >= PT64_ROOT_4LEVEL) {
hpa_t root = vcpu->arch.mmu->root_hpa;
sp = to_shadow_page(root);
- __mmu_spte_walk(vcpu, sp, fn, vcpu->arch.mmu->root_level);
+ __mmu_spte_walk(vcpu, sp, fn, vcpu->arch.mmu->cpu_role.base.level);
return;
}
@@ -119,7 +119,7 @@ static void audit_mappings(struct kvm_vcpu *vcpu, u64 *sptep, int level)
hpa = pfn << PAGE_SHIFT;
if ((*sptep & PT64_BASE_ADDR_MASK) != hpa)
audit_printk(vcpu->kvm, "levels %d pfn %llx hpa %llx "
- "ent %llxn", vcpu->arch.mmu->root_level, pfn,
+ "ent %llxn", vcpu->arch.mmu->cpu_role.base.level, pfn,
hpa, *sptep);
}
diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h
index 847c4339e4d9..dd0b6f83171f 100644
--- a/arch/x86/kvm/mmu/paging_tmpl.h
+++ b/arch/x86/kvm/mmu/paging_tmpl.h
@@ -361,7 +361,7 @@ static int FNAME(walk_addr_generic)(struct guest_walker *walker,
trace_kvm_mmu_pagetable_walk(addr, access);
retry_walk:
- walker->level = mmu->root_level;
+ walker->level = mmu->cpu_role.base.level;
pte = mmu->get_guest_pgd(vcpu);
have_ad = PT_HAVE_ACCESSED_DIRTY(mmu);
@@ -656,7 +656,7 @@ static int FNAME(fetch)(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault,
WARN_ON_ONCE(gw->gfn != base_gfn);
direct_access = gw->pte_access;
- top_level = vcpu->arch.mmu->root_level;
+ top_level = vcpu->arch.mmu->cpu_role.base.level;
if (top_level == PT32E_ROOT_LEVEL)
top_level = PT32_ROOT_LEVEL;
/*
--
2.31.1
On Fri, Feb 04, 2022 at 06:56:58AM -0500, Paolo Bonzini wrote:
> The level field of the MMU role can act as a marker for validity
> instead: it is guaranteed to be nonzero so a zero value means the role
> is invalid and the MMU properties will be computed again.
>
> Signed-off-by: Paolo Bonzini <[email protected]>
> ---
> arch/x86/include/asm/kvm_host.h | 4 +---
> arch/x86/kvm/mmu/mmu.c | 9 +++------
> 2 files changed, 4 insertions(+), 9 deletions(-)
>
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index e7e5bd9a984d..4ec7d1e3aa36 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -342,8 +342,7 @@ union kvm_mmu_page_role {
> * kvm_mmu_extended_role complements kvm_mmu_page_role, tracking properties
> * relevant to the current MMU configuration. When loading CR0, CR4, or EFER,
> * including on nested transitions, if nothing in the full role changes then
> - * MMU re-configuration can be skipped. @valid bit is set on first usage so we
> - * don't treat all-zero structure as valid data.
> + * MMU re-configuration can be skipped.
> *
> * The properties that are tracked in the extended role but not the page role
> * are for things that either (a) do not affect the validity of the shadow page
> @@ -360,7 +359,6 @@ union kvm_mmu_page_role {
> union kvm_mmu_extended_role {
> u32 word;
> struct {
> - unsigned int valid:1;
> unsigned int execonly:1;
> unsigned int cr0_pg:1;
> unsigned int cr4_pae:1;
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index b0065ae3cea8..0039b2f21286 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -4683,8 +4683,6 @@ static union kvm_mmu_extended_role kvm_calc_mmu_role_ext(struct kvm_vcpu *vcpu,
> ext.efer_lma = ____is_efer_lma(regs);
> }
>
> - ext.valid = 1;
> -
> return ext;
> }
>
> @@ -4891,7 +4889,6 @@ kvm_calc_shadow_ept_root_page_role(struct kvm_vcpu *vcpu, bool accessed_dirty,
> /* EPT, and thus nested EPT, does not consume CR0, CR4, nor EFER. */
> role.ext.word = 0;
> role.ext.execonly = execonly;
> - role.ext.valid = 1;
>
> return role;
> }
> @@ -5039,9 +5036,9 @@ void kvm_mmu_after_set_cpuid(struct kvm_vcpu *vcpu)
> * problem is swept under the rug; KVM's CPUID API is horrific and
> * it's all but impossible to solve it without introducing a new API.
> */
> - vcpu->arch.root_mmu.mmu_role.ext.valid = 0;
> - vcpu->arch.guest_mmu.mmu_role.ext.valid = 0;
> - vcpu->arch.nested_mmu.mmu_role.ext.valid = 0;
> + vcpu->arch.root_mmu.mmu_role.base.level = 0;
> + vcpu->arch.guest_mmu.mmu_role.base.level = 0;
> + vcpu->arch.nested_mmu.mmu_role.base.level = 0;
I agree this will work but I think it makes the code more difficult to
follow (and I start worrying that some code that relies on level being
accurate will creep in in the future). At minimum we should extend the
comment here to describe why level is being changed.
I did a half-assed attempt to pass something like "bool force_role_reset"
down to the MMU initialization functions as an alternative but it very
quickly got out of hand.
What about just changing `valid` to `cpuid_stale` and flip the meaning?
kvm_mmu_after_set_cpuid() would set the cpuid_stale bit and then reset
the MMUs.
> kvm_mmu_reset_context(vcpu);
>
> /*
> --
> 2.31.1
>
>
On Fri, Feb 04, 2022 at 06:56:56AM -0500, Paolo Bonzini wrote:
> reset_shadow_zero_bits_mask has a very unintuitive way of deciding
> whether the shadow pages will use the NX bit. The function is used in
> two cases, shadow paging and shadow NPT; shadow paging has a use for
> EFER.NX and needs to force it enabled, while shadow NPT only needs it
> depending on L1's setting.
>
> The actual root problem here is that is_efer_nx, despite being part
> of the "base" role, only matches the format of the shadow pages in the
> NPT case. For now, just remove the ugly variable initialization and move
> the call to reset_shadow_zero_bits_mask out of shadow_mmu_init_context.
> The parameter can then be removed after the root problem in the role
> is fixed.
>
> Signed-off-by: Paolo Bonzini <[email protected]>
Reviewed-by: David Matlack <[email protected]>
(I agree this commit makes no functional change.)
> ---
> arch/x86/kvm/mmu/mmu.c | 26 +++++++++++++-------------
> 1 file changed, 13 insertions(+), 13 deletions(-)
>
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 296f8723f9ae..9424ae90f1ef 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -4410,18 +4410,9 @@ static inline u64 reserved_hpa_bits(void)
> * follow the features in guest.
> */
> static void reset_shadow_zero_bits_mask(struct kvm_vcpu *vcpu,
> - struct kvm_mmu *context)
> + struct kvm_mmu *context,
> + bool uses_nx)
> {
> - /*
> - * KVM uses NX when TDP is disabled to handle a variety of scenarios,
> - * notably for huge SPTEs if iTLB multi-hit mitigation is enabled and
> - * to generate correct permissions for CR0.WP=0/CR4.SMEP=1/EFER.NX=0.
> - * The iTLB multi-hit workaround can be toggled at any time, so assume
> - * NX can be used by any non-nested shadow MMU to avoid having to reset
> - * MMU contexts. Note, KVM forces EFER.NX=1 when TDP is disabled.
> - */
> - bool uses_nx = is_efer_nx(context) || !tdp_enabled;
> -
> /* @amd adds a check on bit of SPTEs, which KVM shouldn't use anyways. */
> bool is_amd = true;
> /* KVM doesn't use 2-level page tables for the shadow MMU. */
> @@ -4829,8 +4820,6 @@ static void shadow_mmu_init_context(struct kvm_vcpu *vcpu, struct kvm_mmu *conte
>
> reset_guest_paging_metadata(vcpu, context);
> context->shadow_root_level = new_role.base.level;
> -
> - reset_shadow_zero_bits_mask(vcpu, context);
> }
>
> static void kvm_init_shadow_mmu(struct kvm_vcpu *vcpu,
> @@ -4841,6 +4830,16 @@ static void kvm_init_shadow_mmu(struct kvm_vcpu *vcpu,
> kvm_calc_shadow_mmu_root_page_role(vcpu, regs, false);
>
> shadow_mmu_init_context(vcpu, context, regs, new_role);
> +
> + /*
> + * KVM uses NX when TDP is disabled to handle a variety of scenarios,
> + * notably for huge SPTEs if iTLB multi-hit mitigation is enabled and
> + * to generate correct permissions for CR0.WP=0/CR4.SMEP=1/EFER.NX=0.
> + * The iTLB multi-hit workaround can be toggled at any time, so assume
> + * NX can be used by any non-nested shadow MMU to avoid having to reset
> + * MMU contexts. Note, KVM forces EFER.NX=1 when TDP is disabled.
> + */
> + reset_shadow_zero_bits_mask(vcpu, context, true);
> }
>
> static union kvm_mmu_role
> @@ -4872,6 +4871,7 @@ void kvm_init_shadow_npt_mmu(struct kvm_vcpu *vcpu, unsigned long cr0,
> __kvm_mmu_new_pgd(vcpu, nested_cr3, new_role.base);
>
> shadow_mmu_init_context(vcpu, context, ®s, new_role);
> + reset_shadow_zero_bits_mask(vcpu, context, is_efer_nx(context));
Out of curiousity, how does KVM mitigate iTLB multi-hit when shadowing
NPT and the guest has not enabled EFER.NX?
> }
> EXPORT_SYMBOL_GPL(kvm_init_shadow_npt_mmu);
>
> --
> 2.31.1
>
>
On Fri, Feb 04, 2022 at 06:57:17AM -0500, Paolo Bonzini wrote:
> Remove another duplicate field of struct kvm_mmu. This time it's
> the root level for page table walking; we were already initializing
> it mostly as cpu_role.base.level, but the field still existed;
> remove it.
>
> Signed-off-by: Paolo Bonzini <[email protected]>
nit: How about the following for the shortlog?
KVM: MMU: Replace root_level with cpu_role.base.level
Otherwise,
Reviewed-by: David Matlack <[email protected]>
> ---
> arch/x86/include/asm/kvm_host.h | 1 -
> arch/x86/kvm/mmu/mmu.c | 22 +++++++++-------------
> arch/x86/kvm/mmu/mmu_audit.c | 6 +++---
> arch/x86/kvm/mmu/paging_tmpl.h | 4 ++--
> 4 files changed, 14 insertions(+), 19 deletions(-)
>
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 867fc82f1de5..c86a2beee92a 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -432,7 +432,6 @@ struct kvm_mmu {
> gpa_t root_pgd;
> union kvm_mmu_role cpu_role;
> union kvm_mmu_page_role mmu_role;
> - u8 root_level;
> bool direct_map;
> struct kvm_mmu_root_info prev_roots[KVM_MMU_NUM_PREV_ROOTS];
>
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 4d1fa87718f8..5a6541d6a424 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -2146,7 +2146,7 @@ static void shadow_walk_init_using_root(struct kvm_shadow_walk_iterator *iterato
> iterator->level = vcpu->arch.mmu->mmu_role.level;
>
> if (iterator->level >= PT64_ROOT_4LEVEL &&
> - vcpu->arch.mmu->root_level < PT64_ROOT_4LEVEL &&
> + vcpu->arch.mmu->cpu_role.base.level < PT64_ROOT_4LEVEL &&
> !vcpu->arch.mmu->direct_map)
> iterator->level = PT32E_ROOT_LEVEL;
>
> @@ -3255,7 +3255,7 @@ void kvm_mmu_free_roots(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
>
> if (free_active_root) {
> if (mmu->mmu_role.level >= PT64_ROOT_4LEVEL &&
> - (mmu->root_level >= PT64_ROOT_4LEVEL || mmu->direct_map)) {
> + (mmu->cpu_role.base.level >= PT64_ROOT_4LEVEL || mmu->direct_map)) {
> mmu_free_root_page(kvm, &mmu->root_hpa, &invalid_list);
> } else if (mmu->pae_root) {
> for (i = 0; i < 4; ++i) {
> @@ -3453,7 +3453,7 @@ static int mmu_alloc_shadow_roots(struct kvm_vcpu *vcpu)
> * On SVM, reading PDPTRs might access guest memory, which might fault
> * and thus might sleep. Grab the PDPTRs before acquiring mmu_lock.
> */
> - if (mmu->root_level == PT32E_ROOT_LEVEL) {
> + if (mmu->cpu_role.base.level == PT32E_ROOT_LEVEL) {
> for (i = 0; i < 4; ++i) {
> pdptrs[i] = mmu->get_pdptr(vcpu, i);
> if (!(pdptrs[i] & PT_PRESENT_MASK))
> @@ -3477,7 +3477,7 @@ static int mmu_alloc_shadow_roots(struct kvm_vcpu *vcpu)
> * Do we shadow a long mode page table? If so we need to
> * write-protect the guests page table root.
> */
> - if (mmu->root_level >= PT64_ROOT_4LEVEL) {
> + if (mmu->cpu_role.base.level >= PT64_ROOT_4LEVEL) {
> root = mmu_alloc_root(vcpu, root_gfn, 0,
> mmu->mmu_role.level, false);
> mmu->root_hpa = root;
> @@ -3516,7 +3516,7 @@ static int mmu_alloc_shadow_roots(struct kvm_vcpu *vcpu)
> for (i = 0; i < 4; ++i) {
> WARN_ON_ONCE(IS_VALID_PAE_ROOT(mmu->pae_root[i]));
>
> - if (mmu->root_level == PT32E_ROOT_LEVEL) {
> + if (mmu->cpu_role.base.level == PT32E_ROOT_LEVEL) {
> if (!(pdptrs[i] & PT_PRESENT_MASK)) {
> mmu->pae_root[i] = INVALID_PAE_ROOT;
> continue;
> @@ -3558,7 +3558,7 @@ static int mmu_alloc_special_roots(struct kvm_vcpu *vcpu)
> * equivalent level in the guest's NPT to shadow. Allocate the tables
> * on demand, as running a 32-bit L1 VMM on 64-bit KVM is very rare.
> */
> - if (mmu->direct_map || mmu->root_level >= PT64_ROOT_4LEVEL ||
> + if (mmu->direct_map || mmu->cpu_role.base.level >= PT64_ROOT_4LEVEL ||
> mmu->mmu_role.level < PT64_ROOT_4LEVEL)
> return 0;
>
> @@ -3655,7 +3655,7 @@ void kvm_mmu_sync_roots(struct kvm_vcpu *vcpu)
>
> vcpu_clear_mmio_info(vcpu, MMIO_GVA_ANY);
>
> - if (vcpu->arch.mmu->root_level >= PT64_ROOT_4LEVEL) {
> + if (vcpu->arch.mmu->cpu_role.base.level >= PT64_ROOT_4LEVEL) {
> hpa_t root = vcpu->arch.mmu->root_hpa;
> sp = to_shadow_page(root);
>
> @@ -4146,7 +4146,7 @@ static bool fast_pgd_switch(struct kvm_vcpu *vcpu, gpa_t new_pgd,
> * later if necessary.
> */
> if (mmu->mmu_role.level >= PT64_ROOT_4LEVEL &&
> - mmu->root_level >= PT64_ROOT_4LEVEL)
> + mmu->cpu_role.base.level >= PT64_ROOT_4LEVEL)
> return cached_root_available(vcpu, new_pgd, new_role);
>
> return false;
> @@ -4335,7 +4335,7 @@ static void reset_rsvds_bits_mask(struct kvm_vcpu *vcpu,
> {
> __reset_rsvds_bits_mask(&context->guest_rsvd_check,
> vcpu->arch.reserved_gpa_bits,
> - context->root_level, is_efer_nx(context),
> + context->cpu_role.base.level, is_efer_nx(context),
> guest_can_use_gbpages(vcpu),
> is_cr4_pse(context),
> guest_cpuid_is_amd_or_hygon(vcpu));
> @@ -4739,7 +4739,6 @@ static void init_kvm_tdp_mmu(struct kvm_vcpu *vcpu, union kvm_mmu_role cpu_role)
> context->get_guest_pgd = get_cr3;
> context->get_pdptr = kvm_pdptr_read;
> context->inject_page_fault = kvm_inject_page_fault;
> - context->root_level = cpu_role.base.level;
>
> if (!is_cr0_pg(context))
> context->gva_to_gpa = nonpaging_gva_to_gpa;
> @@ -4769,7 +4768,6 @@ static void shadow_mmu_init_context(struct kvm_vcpu *vcpu, struct kvm_mmu *conte
> paging64_init_context(context);
> else
> paging32_init_context(context);
> - context->root_level = cpu_role.base.level;
>
> reset_guest_paging_metadata(vcpu, context);
> }
> @@ -4854,7 +4852,6 @@ void kvm_init_shadow_ept_mmu(struct kvm_vcpu *vcpu, bool execonly,
> context->gva_to_gpa = ept_gva_to_gpa;
> context->sync_page = ept_sync_page;
> context->invlpg = ept_invlpg;
> - context->root_level = level;
> context->direct_map = false;
> update_permission_bitmask(context, true);
> context->pkru_mask = 0;
> @@ -4889,7 +4886,6 @@ static void init_kvm_nested_mmu(struct kvm_vcpu *vcpu, union kvm_mmu_role new_ro
> g_context->get_guest_pgd = get_cr3;
> g_context->get_pdptr = kvm_pdptr_read;
> g_context->inject_page_fault = kvm_inject_page_fault;
> - g_context->root_level = new_role.base.level;
>
> /*
> * L2 page tables are never shadowed, so there is no need to sync
> diff --git a/arch/x86/kvm/mmu/mmu_audit.c b/arch/x86/kvm/mmu/mmu_audit.c
> index f31fdb874f1f..eb9c59fcb957 100644
> --- a/arch/x86/kvm/mmu/mmu_audit.c
> +++ b/arch/x86/kvm/mmu/mmu_audit.c
> @@ -59,11 +59,11 @@ static void mmu_spte_walk(struct kvm_vcpu *vcpu, inspect_spte_fn fn)
> if (!VALID_PAGE(vcpu->arch.mmu->root_hpa))
> return;
>
> - if (vcpu->arch.mmu->root_level >= PT64_ROOT_4LEVEL) {
> + if (vcpu->arch.mmu->cpu_role.base.level >= PT64_ROOT_4LEVEL) {
> hpa_t root = vcpu->arch.mmu->root_hpa;
>
> sp = to_shadow_page(root);
> - __mmu_spte_walk(vcpu, sp, fn, vcpu->arch.mmu->root_level);
> + __mmu_spte_walk(vcpu, sp, fn, vcpu->arch.mmu->cpu_role.base.level);
> return;
> }
>
> @@ -119,7 +119,7 @@ static void audit_mappings(struct kvm_vcpu *vcpu, u64 *sptep, int level)
> hpa = pfn << PAGE_SHIFT;
> if ((*sptep & PT64_BASE_ADDR_MASK) != hpa)
> audit_printk(vcpu->kvm, "levels %d pfn %llx hpa %llx "
> - "ent %llxn", vcpu->arch.mmu->root_level, pfn,
> + "ent %llxn", vcpu->arch.mmu->cpu_role.base.level, pfn,
> hpa, *sptep);
> }
>
> diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h
> index 847c4339e4d9..dd0b6f83171f 100644
> --- a/arch/x86/kvm/mmu/paging_tmpl.h
> +++ b/arch/x86/kvm/mmu/paging_tmpl.h
> @@ -361,7 +361,7 @@ static int FNAME(walk_addr_generic)(struct guest_walker *walker,
>
> trace_kvm_mmu_pagetable_walk(addr, access);
> retry_walk:
> - walker->level = mmu->root_level;
> + walker->level = mmu->cpu_role.base.level;
> pte = mmu->get_guest_pgd(vcpu);
> have_ad = PT_HAVE_ACCESSED_DIRTY(mmu);
>
> @@ -656,7 +656,7 @@ static int FNAME(fetch)(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault,
> WARN_ON_ONCE(gw->gfn != base_gfn);
> direct_access = gw->pte_access;
>
> - top_level = vcpu->arch.mmu->root_level;
> + top_level = vcpu->arch.mmu->cpu_role.base.level;
> if (top_level == PT32E_ROOT_LEVEL)
> top_level = PT32_ROOT_LEVEL;
> /*
> --
> 2.31.1
>
>
On Mon, Feb 7, 2022 at 3:27 PM Sean Christopherson <[email protected]> wrote:
>
> On Mon, Feb 07, 2022, David Matlack wrote:
> > On Fri, Feb 04, 2022 at 06:56:55AM -0500, Paolo Bonzini wrote:
> > > The TDP MMU has a performance regression compared to the legacy
> > > MMU when CR0 changes often. This was reported for the grsecurity
> > > kernel, which uses CR0.WP to implement kernel W^X. In that case,
> > > each change to CR0.WP unloads the MMU and causes a lot of unnecessary
> > > work. When running nested, this can even cause the L1 to hardly
> > > make progress, as the L0 hypervisor it is overwhelmed by the amount
> > > of MMU work that is needed.
> > >
> > > The root cause of the issue is that the "MMU role" in KVM is a mess
> > > that mixes the CPU setup (CR0/CR4/EFER, SMM, guest mode, etc.)
> > > and the shadow page table format. Whenever something is different
> > > between the MMU and the CPU, it is stored as an extra field in struct
> > > kvm_mmu---and for extra bonus complication, sometimes the same thing
> > > is stored in both the role and an extra field.
> > >
> > > So, this is the "no functional change intended" part of the changes
> > > required to fix the performance regression. It separates neatly
> > > the shadow page table format ("MMU role") from the guest page table
> > > format ("CPU role"), and removes the duplicate fields.
> >
> > What do you think about calling this the guest_role instead of cpu_role?
> > There is a bit of a precedent for using "guest" instead of "cpu" already
> > for this type of concept (e.g. guest_walker), and I find it more
> > intuitive.
>
> Haven't looked at the series yet, but I'd prefer not to use guest_role, it's
> too similar to is_guest_mode() and kvm_mmu_role.guest_mode. E.g. we'd end up with
>
> static union kvm_mmu_role kvm_calc_guest_role(struct kvm_vcpu *vcpu,
> const struct kvm_mmu_role_regs *regs)
> {
> union kvm_mmu_role role = {0};
>
> role.base.access = ACC_ALL;
> role.base.smm = is_smm(vcpu);
> role.base.guest_mode = is_guest_mode(vcpu);
> role.base.direct = !____is_cr0_pg(regs);
>
> ...
> }
>
> and possibly
>
> if (guest_role.guest_mode)
> ...
>
> which would be quite messy. Maybe vcpu_role if cpu_role isn't intuitive?
I agree it's a little odd. But actually it's somewhat intuitive (the
guest is in guest-mode, i.e. we're running a nested guest).
Ok I'm stretching a little bit :). But if the trade-off is just
"guest_role.guest_mode" requires a clarifying comment, but the rest of
the code gets more readable (cpu_role is used a lot more than
role.guest_mode), it still might be worth it.
Do not lead init_kvm_*mmu into the temptation of poking
into struct kvm_mmu_role_regs, by passing to it directly
the CPU role.
Signed-off-by: Paolo Bonzini <[email protected]>
---
arch/x86/kvm/mmu/mmu.c | 21 +++++++++------------
1 file changed, 9 insertions(+), 12 deletions(-)
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 01027da82e23..6f9d876ce429 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -4721,11 +4721,9 @@ kvm_calc_tdp_mmu_root_page_role(struct kvm_vcpu *vcpu,
return role;
}
-static void init_kvm_tdp_mmu(struct kvm_vcpu *vcpu,
- const struct kvm_mmu_role_regs *regs)
+static void init_kvm_tdp_mmu(struct kvm_vcpu *vcpu, union kvm_mmu_role cpu_role)
{
struct kvm_mmu *context = &vcpu->arch.root_mmu;
- union kvm_mmu_role cpu_role = kvm_calc_cpu_role(vcpu, regs);
union kvm_mmu_page_role mmu_role = kvm_calc_tdp_mmu_root_page_role(vcpu, cpu_role);
if (cpu_role.as_u64 == context->cpu_role.as_u64 &&
@@ -4779,10 +4777,9 @@ static void shadow_mmu_init_context(struct kvm_vcpu *vcpu, struct kvm_mmu *conte
}
static void kvm_init_shadow_mmu(struct kvm_vcpu *vcpu,
- const struct kvm_mmu_role_regs *regs)
+ union kvm_mmu_role cpu_role)
{
struct kvm_mmu *context = &vcpu->arch.root_mmu;
- union kvm_mmu_role cpu_role = kvm_calc_cpu_role(vcpu, regs);
union kvm_mmu_page_role mmu_role;
mmu_role = cpu_role.base;
@@ -4874,20 +4871,19 @@ void kvm_init_shadow_ept_mmu(struct kvm_vcpu *vcpu, bool execonly,
EXPORT_SYMBOL_GPL(kvm_init_shadow_ept_mmu);
static void init_kvm_softmmu(struct kvm_vcpu *vcpu,
- const struct kvm_mmu_role_regs *regs)
+ union kvm_mmu_role cpu_role)
{
struct kvm_mmu *context = &vcpu->arch.root_mmu;
- kvm_init_shadow_mmu(vcpu, regs);
+ kvm_init_shadow_mmu(vcpu, cpu_role);
context->get_guest_pgd = get_cr3;
context->get_pdptr = kvm_pdptr_read;
context->inject_page_fault = kvm_inject_page_fault;
}
-static void init_kvm_nested_mmu(struct kvm_vcpu *vcpu, const struct kvm_mmu_role_regs *regs)
+static void init_kvm_nested_mmu(struct kvm_vcpu *vcpu, union kvm_mmu_role new_role)
{
- union kvm_mmu_role new_role = kvm_calc_cpu_role(vcpu, regs);
struct kvm_mmu *g_context = &vcpu->arch.nested_mmu;
if (new_role.as_u64 == g_context->cpu_role.as_u64)
@@ -4928,13 +4924,14 @@ static void init_kvm_nested_mmu(struct kvm_vcpu *vcpu, const struct kvm_mmu_role
void kvm_init_mmu(struct kvm_vcpu *vcpu)
{
struct kvm_mmu_role_regs regs = vcpu_to_role_regs(vcpu);
+ union kvm_mmu_role cpu_role = kvm_calc_cpu_role(vcpu, ®s);
if (mmu_is_nested(vcpu))
- init_kvm_nested_mmu(vcpu, ®s);
+ init_kvm_nested_mmu(vcpu, cpu_role);
else if (tdp_enabled)
- init_kvm_tdp_mmu(vcpu, ®s);
+ init_kvm_tdp_mmu(vcpu, cpu_role);
else
- init_kvm_softmmu(vcpu, ®s);
+ init_kvm_softmmu(vcpu, cpu_role);
}
EXPORT_SYMBOL_GPL(kvm_init_mmu);
--
2.31.1
On Mon, Feb 07, 2022 at 10:10:39PM +0000, David Matlack wrote:
> On Fri, Feb 04, 2022 at 06:57:06AM -0500, Paolo Bonzini wrote:
> > The root_level can be found in the cpu_role (in fact the field
> > is superfluous and could be removed, but one thing at a time).
> > Since there is only one usage left of role_regs_to_root_level,
> > inline it into kvm_calc_cpu_role.
> >
> > Signed-off-by: Paolo Bonzini <[email protected]>
> > ---
> > arch/x86/kvm/mmu/mmu.c | 23 ++++++++---------------
> > 1 file changed, 8 insertions(+), 15 deletions(-)
> >
> > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > index f98444e1d834..74789295f922 100644
> > --- a/arch/x86/kvm/mmu/mmu.c
> > +++ b/arch/x86/kvm/mmu/mmu.c
> > @@ -253,19 +253,6 @@ static struct kvm_mmu_role_regs vcpu_to_role_regs(struct kvm_vcpu *vcpu)
> > return regs;
> > }
> >
> > -static int role_regs_to_root_level(const struct kvm_mmu_role_regs *regs)
> > -{
> > - if (!____is_cr0_pg(regs))
> > - return 0;
> > - else if (____is_efer_lma(regs))
> > - return ____is_cr4_la57(regs) ? PT64_ROOT_5LEVEL :
> > - PT64_ROOT_4LEVEL;
> > - else if (____is_cr4_pae(regs))
> > - return PT32E_ROOT_LEVEL;
> > - else
> > - return PT32_ROOT_LEVEL;
> > -}
> > -
> > static inline bool kvm_available_flush_tlb_with_range(void)
> > {
> > return kvm_x86_ops.tlb_remote_flush_with_range;
> > @@ -4673,7 +4660,13 @@ kvm_calc_cpu_role(struct kvm_vcpu *vcpu, const struct kvm_mmu_role_regs *regs)
> > role.base.smep_andnot_wp = ____is_cr4_smep(regs) && !____is_cr0_wp(regs);
> > role.base.smap_andnot_wp = ____is_cr4_smap(regs) && !____is_cr0_wp(regs);
> > role.base.has_4_byte_gpte = !____is_cr4_pae(regs);
> > - role.base.level = role_regs_to_root_level(regs);
> > +
> > + if (____is_efer_lma(regs))
> > + role.base.level = ____is_cr4_la57(regs) ? PT64_ROOT_5LEVEL : PT64_ROOT_4LEVEL;
> > + else if (____is_cr4_pae(regs))
> > + role.base.level = PT32E_ROOT_LEVEL;
> > + else
> > + role.base.level = PT32_ROOT_LEVEL;
>
> Did you mean to drop the !CR0.PG case?
Oh never mind I see the !CR0.PG case is handled above.
Reviewed-by: David Matlack <[email protected]>
>
> >
> > role.ext.cr0_pg = 1;
> > role.ext.cr4_pae = ____is_cr4_pae(regs);
> > @@ -4766,7 +4759,7 @@ static void init_kvm_tdp_mmu(struct kvm_vcpu *vcpu,
> > context->get_guest_pgd = get_cr3;
> > context->get_pdptr = kvm_pdptr_read;
> > context->inject_page_fault = kvm_inject_page_fault;
> > - context->root_level = role_regs_to_root_level(regs);
> > + context->root_level = cpu_role.base.level;
> >
> > if (!is_cr0_pg(context))
> > context->gva_to_gpa = nonpaging_gva_to_gpa;
> > --
> > 2.31.1
> >
> >
On Fri, Feb 04, 2022 at 06:57:15AM -0500, Paolo Bonzini wrote:
> Do not lead init_kvm_*mmu into the temptation of poking
> into struct kvm_mmu_role_regs, by passing to it directly
> the CPU role.
>
> Signed-off-by: Paolo Bonzini <[email protected]>
> ---
> arch/x86/kvm/mmu/mmu.c | 21 +++++++++------------
> 1 file changed, 9 insertions(+), 12 deletions(-)
>
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 01027da82e23..6f9d876ce429 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -4721,11 +4721,9 @@ kvm_calc_tdp_mmu_root_page_role(struct kvm_vcpu *vcpu,
> return role;
> }
>
> -static void init_kvm_tdp_mmu(struct kvm_vcpu *vcpu,
> - const struct kvm_mmu_role_regs *regs)
> +static void init_kvm_tdp_mmu(struct kvm_vcpu *vcpu, union kvm_mmu_role cpu_role)
> {
> struct kvm_mmu *context = &vcpu->arch.root_mmu;
> - union kvm_mmu_role cpu_role = kvm_calc_cpu_role(vcpu, regs);
> union kvm_mmu_page_role mmu_role = kvm_calc_tdp_mmu_root_page_role(vcpu, cpu_role);
>
> if (cpu_role.as_u64 == context->cpu_role.as_u64 &&
> @@ -4779,10 +4777,9 @@ static void shadow_mmu_init_context(struct kvm_vcpu *vcpu, struct kvm_mmu *conte
> }
>
> static void kvm_init_shadow_mmu(struct kvm_vcpu *vcpu,
> - const struct kvm_mmu_role_regs *regs)
> + union kvm_mmu_role cpu_role)
> {
> struct kvm_mmu *context = &vcpu->arch.root_mmu;
> - union kvm_mmu_role cpu_role = kvm_calc_cpu_role(vcpu, regs);
> union kvm_mmu_page_role mmu_role;
>
> mmu_role = cpu_role.base;
> @@ -4874,20 +4871,19 @@ void kvm_init_shadow_ept_mmu(struct kvm_vcpu *vcpu, bool execonly,
> EXPORT_SYMBOL_GPL(kvm_init_shadow_ept_mmu);
>
> static void init_kvm_softmmu(struct kvm_vcpu *vcpu,
> - const struct kvm_mmu_role_regs *regs)
> + union kvm_mmu_role cpu_role)
> {
> struct kvm_mmu *context = &vcpu->arch.root_mmu;
>
> - kvm_init_shadow_mmu(vcpu, regs);
> + kvm_init_shadow_mmu(vcpu, cpu_role);
>
> context->get_guest_pgd = get_cr3;
> context->get_pdptr = kvm_pdptr_read;
> context->inject_page_fault = kvm_inject_page_fault;
> }
>
> -static void init_kvm_nested_mmu(struct kvm_vcpu *vcpu, const struct kvm_mmu_role_regs *regs)
> +static void init_kvm_nested_mmu(struct kvm_vcpu *vcpu, union kvm_mmu_role new_role)
> {
> - union kvm_mmu_role new_role = kvm_calc_cpu_role(vcpu, regs);
> struct kvm_mmu *g_context = &vcpu->arch.nested_mmu;
>
> if (new_role.as_u64 == g_context->cpu_role.as_u64)
> @@ -4928,13 +4924,14 @@ static void init_kvm_nested_mmu(struct kvm_vcpu *vcpu, const struct kvm_mmu_role
> void kvm_init_mmu(struct kvm_vcpu *vcpu)
> {
> struct kvm_mmu_role_regs regs = vcpu_to_role_regs(vcpu);
> + union kvm_mmu_role cpu_role = kvm_calc_cpu_role(vcpu, ®s);
WDYT about also inlining vcpu_to_role_regs() in kvm_calc_cpu_role()?
>
> if (mmu_is_nested(vcpu))
> - init_kvm_nested_mmu(vcpu, ®s);
> + init_kvm_nested_mmu(vcpu, cpu_role);
> else if (tdp_enabled)
> - init_kvm_tdp_mmu(vcpu, ®s);
> + init_kvm_tdp_mmu(vcpu, cpu_role);
> else
> - init_kvm_softmmu(vcpu, ®s);
> + init_kvm_softmmu(vcpu, cpu_role);
> }
> EXPORT_SYMBOL_GPL(kvm_init_mmu);
>
> --
> 2.31.1
>
>
On Sat, Feb 05, 2022, Paolo Bonzini wrote:
> On 2/4/22 18:59, David Matlack wrote:
> > > + reset_shadow_zero_bits_mask(vcpu, context, is_efer_nx(context));
> >
> > Out of curiousity, how does KVM mitigate iTLB multi-hit when shadowing
> > NPT and the guest has not enabled EFER.NX?
>
> You got me worried for a second but iTLB multihit is Intel-only, isn't it?
AFAIK, yes, big Core only. arch/x86/kernel/cpu/common.c sets NO_ITLB_MULTIHIT
for all AMD, Hygon, and Atom CPUs.
On Fri, Feb 04, 2022 at 06:57:06AM -0500, Paolo Bonzini wrote:
> The root_level can be found in the cpu_role (in fact the field
> is superfluous and could be removed, but one thing at a time).
> Since there is only one usage left of role_regs_to_root_level,
> inline it into kvm_calc_cpu_role.
>
> Signed-off-by: Paolo Bonzini <[email protected]>
> ---
> arch/x86/kvm/mmu/mmu.c | 23 ++++++++---------------
> 1 file changed, 8 insertions(+), 15 deletions(-)
>
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index f98444e1d834..74789295f922 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -253,19 +253,6 @@ static struct kvm_mmu_role_regs vcpu_to_role_regs(struct kvm_vcpu *vcpu)
> return regs;
> }
>
> -static int role_regs_to_root_level(const struct kvm_mmu_role_regs *regs)
> -{
> - if (!____is_cr0_pg(regs))
> - return 0;
> - else if (____is_efer_lma(regs))
> - return ____is_cr4_la57(regs) ? PT64_ROOT_5LEVEL :
> - PT64_ROOT_4LEVEL;
> - else if (____is_cr4_pae(regs))
> - return PT32E_ROOT_LEVEL;
> - else
> - return PT32_ROOT_LEVEL;
> -}
> -
> static inline bool kvm_available_flush_tlb_with_range(void)
> {
> return kvm_x86_ops.tlb_remote_flush_with_range;
> @@ -4673,7 +4660,13 @@ kvm_calc_cpu_role(struct kvm_vcpu *vcpu, const struct kvm_mmu_role_regs *regs)
> role.base.smep_andnot_wp = ____is_cr4_smep(regs) && !____is_cr0_wp(regs);
> role.base.smap_andnot_wp = ____is_cr4_smap(regs) && !____is_cr0_wp(regs);
> role.base.has_4_byte_gpte = !____is_cr4_pae(regs);
> - role.base.level = role_regs_to_root_level(regs);
> +
> + if (____is_efer_lma(regs))
> + role.base.level = ____is_cr4_la57(regs) ? PT64_ROOT_5LEVEL : PT64_ROOT_4LEVEL;
> + else if (____is_cr4_pae(regs))
> + role.base.level = PT32E_ROOT_LEVEL;
> + else
> + role.base.level = PT32_ROOT_LEVEL;
Did you mean to drop the !CR0.PG case?
>
> role.ext.cr0_pg = 1;
> role.ext.cr4_pae = ____is_cr4_pae(regs);
> @@ -4766,7 +4759,7 @@ static void init_kvm_tdp_mmu(struct kvm_vcpu *vcpu,
> context->get_guest_pgd = get_cr3;
> context->get_pdptr = kvm_pdptr_read;
> context->inject_page_fault = kvm_inject_page_fault;
> - context->root_level = role_regs_to_root_level(regs);
> + context->root_level = cpu_role.base.level;
>
> if (!is_cr0_pg(context))
> context->gva_to_gpa = nonpaging_gva_to_gpa;
> --
> 2.31.1
>
>
On Mon, Feb 07, 2022, David Matlack wrote:
> On Fri, Feb 04, 2022 at 06:56:55AM -0500, Paolo Bonzini wrote:
> > The TDP MMU has a performance regression compared to the legacy
> > MMU when CR0 changes often. This was reported for the grsecurity
> > kernel, which uses CR0.WP to implement kernel W^X. In that case,
> > each change to CR0.WP unloads the MMU and causes a lot of unnecessary
> > work. When running nested, this can even cause the L1 to hardly
> > make progress, as the L0 hypervisor it is overwhelmed by the amount
> > of MMU work that is needed.
> >
> > The root cause of the issue is that the "MMU role" in KVM is a mess
> > that mixes the CPU setup (CR0/CR4/EFER, SMM, guest mode, etc.)
> > and the shadow page table format. Whenever something is different
> > between the MMU and the CPU, it is stored as an extra field in struct
> > kvm_mmu---and for extra bonus complication, sometimes the same thing
> > is stored in both the role and an extra field.
> >
> > So, this is the "no functional change intended" part of the changes
> > required to fix the performance regression. It separates neatly
> > the shadow page table format ("MMU role") from the guest page table
> > format ("CPU role"), and removes the duplicate fields.
>
> What do you think about calling this the guest_role instead of cpu_role?
> There is a bit of a precedent for using "guest" instead of "cpu" already
> for this type of concept (e.g. guest_walker), and I find it more
> intuitive.
Haven't looked at the series yet, but I'd prefer not to use guest_role, it's
too similar to is_guest_mode() and kvm_mmu_role.guest_mode. E.g. we'd end up with
static union kvm_mmu_role kvm_calc_guest_role(struct kvm_vcpu *vcpu,
const struct kvm_mmu_role_regs *regs)
{
union kvm_mmu_role role = {0};
role.base.access = ACC_ALL;
role.base.smm = is_smm(vcpu);
role.base.guest_mode = is_guest_mode(vcpu);
role.base.direct = !____is_cr0_pg(regs);
...
}
and possibly
if (guest_role.guest_mode)
...
which would be quite messy. Maybe vcpu_role if cpu_role isn't intuitive?
Inline kvm_calc_mmu_role_common into its sole caller, and simplify it
by removing the computation of unnecessary bits.
Extended bits are unnecessary because page walking uses the CPU role,
and EFER.NX/CR0.WP can be set to one unconditionally---matching the
format of shadow pages rather than the format of guest pages.
The MMU role for two dimensional paging does still depend on the CPU role,
even if only barely so, due to SMM and guest mode; for consistency,
pass it down to kvm_calc_tdp_mmu_root_page_role instead of querying
the vcpu with is_smm or is_guest_mode.
Signed-off-by: Paolo Bonzini <[email protected]>
---
arch/x86/kvm/mmu/mmu.c | 39 ++++++++-------------------------------
1 file changed, 8 insertions(+), 31 deletions(-)
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 19abf1e4cee9..1650fc291284 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -4683,33 +4683,6 @@ kvm_calc_cpu_role(struct kvm_vcpu *vcpu, const struct kvm_mmu_role_regs *regs)
return role;
}
-static union kvm_mmu_role kvm_calc_mmu_role_common(struct kvm_vcpu *vcpu,
- const struct kvm_mmu_role_regs *regs)
-{
- union kvm_mmu_role role = {0};
-
- role.base.access = ACC_ALL;
- if (____is_cr0_pg(regs)) {
- role.ext.cr0_pg = 1;
- role.base.efer_nx = ____is_efer_nx(regs);
- role.base.cr0_wp = ____is_cr0_wp(regs);
-
- role.ext.cr4_pae = ____is_cr4_pae(regs);
- role.ext.cr4_smep = ____is_cr4_smep(regs);
- role.ext.cr4_smap = ____is_cr4_smap(regs);
- role.ext.cr4_pse = ____is_cr4_pse(regs);
-
- /* PKEY and LA57 are active iff long mode is active. */
- role.ext.cr4_pke = ____is_efer_lma(regs) && ____is_cr4_pke(regs);
- role.ext.cr4_la57 = ____is_efer_lma(regs) && ____is_cr4_la57(regs);
- role.ext.efer_lma = ____is_efer_lma(regs);
- }
- role.base.smm = is_smm(vcpu);
- role.base.guest_mode = is_guest_mode(vcpu);
-
- return role;
-}
-
static inline int kvm_mmu_get_tdp_level(struct kvm_vcpu *vcpu)
{
/* tdp_root_level is architecture forced level, use it if nonzero */
@@ -4725,10 +4698,15 @@ static inline int kvm_mmu_get_tdp_level(struct kvm_vcpu *vcpu)
static union kvm_mmu_role
kvm_calc_tdp_mmu_root_page_role(struct kvm_vcpu *vcpu,
- const struct kvm_mmu_role_regs *regs)
+ union kvm_mmu_role cpu_role)
{
- union kvm_mmu_role role = kvm_calc_mmu_role_common(vcpu, regs);
+ union kvm_mmu_role role = {0};
+ role.base.access = ACC_ALL;
+ role.base.cr0_wp = true;
+ role.base.efer_nx = true;
+ role.base.smm = cpu_role.base.smm;
+ role.base.guest_mode = cpu_role.base.guest_mode;
role.base.ad_disabled = (shadow_accessed_mask == 0);
role.base.level = kvm_mmu_get_tdp_level(vcpu);
role.base.direct = true;
@@ -4742,8 +4720,7 @@ static void init_kvm_tdp_mmu(struct kvm_vcpu *vcpu,
{
struct kvm_mmu *context = &vcpu->arch.root_mmu;
union kvm_mmu_role cpu_role = kvm_calc_cpu_role(vcpu, regs);
- union kvm_mmu_role mmu_role =
- kvm_calc_tdp_mmu_root_page_role(vcpu, regs);
+ union kvm_mmu_role mmu_role = kvm_calc_tdp_mmu_root_page_role(vcpu, cpu_role);
if (cpu_role.as_u64 == context->cpu_role.as_u64 &&
mmu_role.as_u64 == context->mmu_role.as_u64)
--
2.31.1
mmu_role represents the role of the root of the page tables.
It does not need any extended bits, as those govern only KVM's
page table walking; the is_* functions used for page table
walking always use the CPU role.
Signed-off-by: Paolo Bonzini <[email protected]>
---
arch/x86/include/asm/kvm_host.h | 2 +-
arch/x86/kvm/mmu/mmu.c | 63 ++++++++++++++++-----------------
arch/x86/kvm/mmu/paging_tmpl.h | 2 +-
arch/x86/kvm/mmu/tdp_mmu.c | 2 +-
4 files changed, 34 insertions(+), 35 deletions(-)
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 795b345361c8..121eefdb9991 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -433,7 +433,7 @@ struct kvm_mmu {
hpa_t root_hpa;
gpa_t root_pgd;
union kvm_mmu_role cpu_role;
- union kvm_mmu_role mmu_role;
+ union kvm_mmu_page_role mmu_role;
u8 root_level;
u8 shadow_root_level;
bool direct_map;
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 817e6cc916fc..0cb46a74e561 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -2045,7 +2045,7 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
int collisions = 0;
LIST_HEAD(invalid_list);
- role = vcpu->arch.mmu->mmu_role.base;
+ role = vcpu->arch.mmu->mmu_role;
role.level = level;
role.direct = direct;
role.access = access;
@@ -3278,7 +3278,7 @@ void kvm_mmu_free_guest_mode_roots(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu)
* This should not be called while L2 is active, L2 can't invalidate
* _only_ its own roots, e.g. INVVPID unconditionally exits.
*/
- WARN_ON_ONCE(mmu->mmu_role.base.guest_mode);
+ WARN_ON_ONCE(mmu->mmu_role.guest_mode);
for (i = 0; i < KVM_MMU_NUM_PREV_ROOTS; i++) {
root_hpa = mmu->prev_roots[i].hpa;
@@ -4146,7 +4146,7 @@ static bool fast_pgd_switch(struct kvm_vcpu *vcpu, gpa_t new_pgd,
void kvm_mmu_new_pgd(struct kvm_vcpu *vcpu, gpa_t new_pgd)
{
- union kvm_mmu_page_role new_role = vcpu->arch.mmu->mmu_role.base;
+ union kvm_mmu_page_role new_role = vcpu->arch.mmu->mmu_role;
if (!fast_pgd_switch(vcpu, new_pgd, new_role)) {
kvm_mmu_free_roots(vcpu, vcpu->arch.mmu, KVM_MMU_ROOT_CURRENT);
return;
@@ -4696,21 +4696,21 @@ static inline int kvm_mmu_get_tdp_level(struct kvm_vcpu *vcpu)
return max_tdp_level;
}
-static union kvm_mmu_role
+static union kvm_mmu_page_role
kvm_calc_tdp_mmu_root_page_role(struct kvm_vcpu *vcpu,
union kvm_mmu_role cpu_role)
{
- union kvm_mmu_role role = {0};
+ union kvm_mmu_page_role role = {0};
- role.base.access = ACC_ALL;
- role.base.cr0_wp = true;
- role.base.efer_nx = true;
- role.base.smm = cpu_role.base.smm;
- role.base.guest_mode = cpu_role.base.guest_mode;
- role.base.ad_disabled = (shadow_accessed_mask == 0);
- role.base.level = kvm_mmu_get_tdp_level(vcpu);
- role.base.direct = true;
- role.base.has_4_byte_gpte = false;
+ role.access = ACC_ALL;
+ role.cr0_wp = true;
+ role.efer_nx = true;
+ role.smm = cpu_role.base.smm;
+ role.guest_mode = cpu_role.base.guest_mode;
+ role.ad_disabled = (shadow_accessed_mask == 0);
+ role.level = kvm_mmu_get_tdp_level(vcpu);
+ role.direct = true;
+ role.has_4_byte_gpte = false;
return role;
}
@@ -4720,14 +4720,14 @@ static void init_kvm_tdp_mmu(struct kvm_vcpu *vcpu,
{
struct kvm_mmu *context = &vcpu->arch.root_mmu;
union kvm_mmu_role cpu_role = kvm_calc_cpu_role(vcpu, regs);
- union kvm_mmu_role mmu_role = kvm_calc_tdp_mmu_root_page_role(vcpu, cpu_role);
+ union kvm_mmu_page_role mmu_role = kvm_calc_tdp_mmu_root_page_role(vcpu, cpu_role);
if (cpu_role.as_u64 == context->cpu_role.as_u64 &&
- mmu_role.as_u64 == context->mmu_role.as_u64)
+ mmu_role.word == context->mmu_role.word)
return;
context->cpu_role.as_u64 = cpu_role.as_u64;
- context->mmu_role.as_u64 = mmu_role.as_u64;
+ context->mmu_role.word = mmu_role.word;
context->page_fault = kvm_tdp_page_fault;
context->sync_page = nonpaging_sync_page;
context->invlpg = NULL;
@@ -4749,7 +4749,7 @@ static void init_kvm_tdp_mmu(struct kvm_vcpu *vcpu,
reset_tdp_shadow_zero_bits_mask(context);
}
-static union kvm_mmu_role
+static union kvm_mmu_page_role
kvm_calc_shadow_mmu_root_page_role(struct kvm_vcpu *vcpu,
union kvm_mmu_role role)
{
@@ -4760,19 +4760,19 @@ kvm_calc_shadow_mmu_root_page_role(struct kvm_vcpu *vcpu,
else
role.base.level = PT64_ROOT_4LEVEL;
- return role;
+ return role.base;
}
static void shadow_mmu_init_context(struct kvm_vcpu *vcpu, struct kvm_mmu *context,
union kvm_mmu_role cpu_role,
- union kvm_mmu_role mmu_role)
+ union kvm_mmu_page_role mmu_role)
{
if (cpu_role.as_u64 == context->cpu_role.as_u64 &&
- mmu_role.as_u64 == context->mmu_role.as_u64)
+ mmu_role.word == context->mmu_role.word)
return;
context->cpu_role.as_u64 = cpu_role.as_u64;
- context->mmu_role.as_u64 = mmu_role.as_u64;
+ context->mmu_role.word = mmu_role.word;
if (!is_cr0_pg(context))
nonpaging_init_context(context);
@@ -4783,7 +4783,7 @@ static void shadow_mmu_init_context(struct kvm_vcpu *vcpu, struct kvm_mmu *conte
context->root_level = cpu_role.base.level;
reset_guest_paging_metadata(vcpu, context);
- context->shadow_root_level = mmu_role.base.level;
+ context->shadow_root_level = mmu_role.level;
}
static void kvm_init_shadow_mmu(struct kvm_vcpu *vcpu,
@@ -4791,7 +4791,7 @@ static void kvm_init_shadow_mmu(struct kvm_vcpu *vcpu,
{
struct kvm_mmu *context = &vcpu->arch.root_mmu;
union kvm_mmu_role cpu_role = kvm_calc_cpu_role(vcpu, regs);
- union kvm_mmu_role mmu_role =
+ union kvm_mmu_page_role mmu_role =
kvm_calc_shadow_mmu_root_page_role(vcpu, cpu_role);
shadow_mmu_init_context(vcpu, context, cpu_role, mmu_role);
@@ -4807,13 +4807,12 @@ static void kvm_init_shadow_mmu(struct kvm_vcpu *vcpu,
reset_shadow_zero_bits_mask(vcpu, context, true);
}
-static union kvm_mmu_role
+static union kvm_mmu_page_role
kvm_calc_shadow_npt_root_page_role(struct kvm_vcpu *vcpu,
union kvm_mmu_role role)
{
role.base.level = kvm_mmu_get_tdp_level(vcpu);
-
- return role;
+ return role.base;
}
void kvm_init_shadow_npt_mmu(struct kvm_vcpu *vcpu, unsigned long cr0,
@@ -4826,7 +4825,7 @@ void kvm_init_shadow_npt_mmu(struct kvm_vcpu *vcpu, unsigned long cr0,
.efer = efer,
};
union kvm_mmu_role cpu_role = kvm_calc_cpu_role(vcpu, ®s);
- union kvm_mmu_role mmu_role = kvm_calc_shadow_npt_root_page_role(vcpu, cpu_role);
+ union kvm_mmu_page_role mmu_role = kvm_calc_shadow_npt_root_page_role(vcpu, cpu_role);
shadow_mmu_init_context(vcpu, context, cpu_role, mmu_role);
reset_shadow_zero_bits_mask(vcpu, context, is_efer_nx(context));
@@ -4866,7 +4865,7 @@ void kvm_init_shadow_ept_mmu(struct kvm_vcpu *vcpu, bool execonly,
if (new_role.as_u64 != context->cpu_role.as_u64) {
/* EPT, and thus nested EPT, does not consume CR0, CR4, nor EFER. */
context->cpu_role.as_u64 = new_role.as_u64;
- context->mmu_role.as_u64 = new_role.as_u64;
+ context->mmu_role.word = new_role.base.word;
context->shadow_root_level = level;
@@ -4968,9 +4967,9 @@ void kvm_mmu_after_set_cpuid(struct kvm_vcpu *vcpu)
vcpu->arch.root_mmu.cpu_role.base.level = 0;
vcpu->arch.guest_mmu.cpu_role.base.level = 0;
vcpu->arch.nested_mmu.cpu_role.base.level = 0;
- vcpu->arch.root_mmu.mmu_role.base.level = 0;
- vcpu->arch.guest_mmu.mmu_role.base.level = 0;
- vcpu->arch.nested_mmu.mmu_role.base.level = 0;
+ vcpu->arch.root_mmu.mmu_role.level = 0;
+ vcpu->arch.guest_mmu.mmu_role.level = 0;
+ vcpu->arch.nested_mmu.mmu_role.level = 0;
kvm_mmu_reset_context(vcpu);
/*
diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h
index 1b5c7d03f94b..847c4339e4d9 100644
--- a/arch/x86/kvm/mmu/paging_tmpl.h
+++ b/arch/x86/kvm/mmu/paging_tmpl.h
@@ -1025,7 +1025,7 @@ static gpa_t FNAME(gva_to_gpa)(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
*/
static int FNAME(sync_page)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp)
{
- union kvm_mmu_page_role mmu_role = vcpu->arch.mmu->mmu_role.base;
+ union kvm_mmu_page_role mmu_role = vcpu->arch.mmu->mmu_role;
int i;
bool host_writable;
gpa_t first_pte_gpa;
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 8def8f810cb0..dd4c78833016 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -209,7 +209,7 @@ static void tdp_mmu_init_child_sp(struct kvm_mmu_page *child_sp,
hpa_t kvm_tdp_mmu_get_vcpu_root_hpa(struct kvm_vcpu *vcpu)
{
- union kvm_mmu_page_role role = vcpu->arch.mmu->mmu_role.base;
+ union kvm_mmu_page_role role = vcpu->arch.mmu->mmu_role;
struct kvm *kvm = vcpu->kvm;
struct kvm_mmu_page *root;
--
2.31.1
On Fri, Feb 04, 2022 at 06:57:18AM -0500, Paolo Bonzini wrote:
> direct_map is always equal to the role's direct field:
>
> - for shadow paging, direct_map is true if CR0.PG=0 and mmu_role.direct is
> copied from cpu_role.base.direct
>
> - for TDP, it is always true and mmu_role.direct is also always true
>
> - for shadow EPT, it is always false and mmu_role.direct is also always
> false
>
> Signed-off-by: Paolo Bonzini <[email protected]>
Reviewed-by: David Matlack <[email protected]>
> ---
> arch/x86/include/asm/kvm_host.h | 1 -
> arch/x86/kvm/mmu/mmu.c | 30 ++++++++++++++----------------
> arch/x86/kvm/x86.c | 12 ++++++------
> 4 files changed, 21 insertions(+), 23 deletions(-)
>
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index c86a2beee92a..647b3f6d02d0 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -432,7 +432,6 @@ struct kvm_mmu {
> gpa_t root_pgd;
> union kvm_mmu_role cpu_role;
> union kvm_mmu_page_role mmu_role;
> - bool direct_map;
> struct kvm_mmu_root_info prev_roots[KVM_MMU_NUM_PREV_ROOTS];
>
> /*
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 5a6541d6a424..ce55fad99671 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -2045,7 +2045,7 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
> int direct,
> unsigned int access)
> {
> - bool direct_mmu = vcpu->arch.mmu->direct_map;
> + bool direct_mmu = vcpu->arch.mmu->mmu_role.direct;
> union kvm_mmu_page_role role;
> struct hlist_head *sp_list;
> unsigned quadrant;
> @@ -2147,7 +2147,7 @@ static void shadow_walk_init_using_root(struct kvm_shadow_walk_iterator *iterato
>
> if (iterator->level >= PT64_ROOT_4LEVEL &&
> vcpu->arch.mmu->cpu_role.base.level < PT64_ROOT_4LEVEL &&
> - !vcpu->arch.mmu->direct_map)
> + !vcpu->arch.mmu->mmu_role.direct)
> iterator->level = PT32E_ROOT_LEVEL;
>
> if (iterator->level == PT32E_ROOT_LEVEL) {
> @@ -2523,7 +2523,7 @@ static int kvm_mmu_unprotect_page_virt(struct kvm_vcpu *vcpu, gva_t gva)
> gpa_t gpa;
> int r;
>
> - if (vcpu->arch.mmu->direct_map)
> + if (vcpu->arch.mmu->mmu_role.direct)
> return 0;
>
> gpa = kvm_mmu_gva_to_gpa_read(vcpu, gva, NULL);
> @@ -3255,7 +3255,8 @@ void kvm_mmu_free_roots(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
>
> if (free_active_root) {
> if (mmu->mmu_role.level >= PT64_ROOT_4LEVEL &&
> - (mmu->cpu_role.base.level >= PT64_ROOT_4LEVEL || mmu->direct_map)) {
> + (mmu->cpu_role.base.level >= PT64_ROOT_4LEVEL ||
> + mmu->mmu_role.direct)) {
> mmu_free_root_page(kvm, &mmu->root_hpa, &invalid_list);
> } else if (mmu->pae_root) {
> for (i = 0; i < 4; ++i) {
> @@ -3558,7 +3559,8 @@ static int mmu_alloc_special_roots(struct kvm_vcpu *vcpu)
> * equivalent level in the guest's NPT to shadow. Allocate the tables
> * on demand, as running a 32-bit L1 VMM on 64-bit KVM is very rare.
> */
> - if (mmu->direct_map || mmu->cpu_role.base.level >= PT64_ROOT_4LEVEL ||
> + if (mmu->mmu_role.direct ||
> + mmu->cpu_role.base.level >= PT64_ROOT_4LEVEL ||
> mmu->mmu_role.level < PT64_ROOT_4LEVEL)
> return 0;
>
> @@ -3647,7 +3649,7 @@ void kvm_mmu_sync_roots(struct kvm_vcpu *vcpu)
> int i;
> struct kvm_mmu_page *sp;
>
> - if (vcpu->arch.mmu->direct_map)
> + if (vcpu->arch.mmu->mmu_role.direct)
> return;
>
> if (!VALID_PAGE(vcpu->arch.mmu->root_hpa))
> @@ -3872,7 +3874,7 @@ static bool kvm_arch_setup_async_pf(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
>
> arch.token = (vcpu->arch.apf.id++ << 12) | vcpu->vcpu_id;
> arch.gfn = gfn;
> - arch.direct_map = vcpu->arch.mmu->direct_map;
> + arch.direct_map = vcpu->arch.mmu->mmu_role.direct;
> arch.cr3 = vcpu->arch.mmu->get_guest_pgd(vcpu);
>
> return kvm_setup_async_pf(vcpu, cr2_or_gpa,
> @@ -4090,7 +4092,6 @@ static void nonpaging_init_context(struct kvm_mmu *context)
> context->gva_to_gpa = nonpaging_gva_to_gpa;
> context->sync_page = nonpaging_sync_page;
> context->invlpg = NULL;
> - context->direct_map = true;
> }
>
> static inline bool is_root_usable(struct kvm_mmu_root_info *root, gpa_t pgd,
> @@ -4641,7 +4642,6 @@ static void paging64_init_context(struct kvm_mmu *context)
> context->gva_to_gpa = paging64_gva_to_gpa;
> context->sync_page = paging64_sync_page;
> context->invlpg = paging64_invlpg;
> - context->direct_map = false;
> }
>
> static void paging32_init_context(struct kvm_mmu *context)
> @@ -4650,7 +4650,6 @@ static void paging32_init_context(struct kvm_mmu *context)
> context->gva_to_gpa = paging32_gva_to_gpa;
> context->sync_page = paging32_sync_page;
> context->invlpg = paging32_invlpg;
> - context->direct_map = false;
> }
>
> static union kvm_mmu_role
> @@ -4735,7 +4734,6 @@ static void init_kvm_tdp_mmu(struct kvm_vcpu *vcpu, union kvm_mmu_role cpu_role)
> context->page_fault = kvm_tdp_page_fault;
> context->sync_page = nonpaging_sync_page;
> context->invlpg = NULL;
> - context->direct_map = true;
> context->get_guest_pgd = get_cr3;
> context->get_pdptr = kvm_pdptr_read;
> context->inject_page_fault = kvm_inject_page_fault;
> @@ -4852,7 +4850,7 @@ void kvm_init_shadow_ept_mmu(struct kvm_vcpu *vcpu, bool execonly,
> context->gva_to_gpa = ept_gva_to_gpa;
> context->sync_page = ept_sync_page;
> context->invlpg = ept_invlpg;
> - context->direct_map = false;
> +
> update_permission_bitmask(context, true);
> context->pkru_mask = 0;
> reset_rsvds_bits_mask_ept(vcpu, context, execonly, huge_page_level);
> @@ -4967,13 +4965,13 @@ int kvm_mmu_load(struct kvm_vcpu *vcpu)
> {
> int r;
>
> - r = mmu_topup_memory_caches(vcpu, !vcpu->arch.mmu->direct_map);
> + r = mmu_topup_memory_caches(vcpu, !vcpu->arch.mmu->mmu_role.direct);
> if (r)
> goto out;
> r = mmu_alloc_special_roots(vcpu);
> if (r)
> goto out;
> - if (vcpu->arch.mmu->direct_map)
> + if (vcpu->arch.mmu->mmu_role.direct)
> r = mmu_alloc_direct_roots(vcpu);
> else
> r = mmu_alloc_shadow_roots(vcpu);
> @@ -5176,7 +5174,7 @@ int kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, u64 error_code,
> void *insn, int insn_len)
> {
> int r, emulation_type = EMULTYPE_PF;
> - bool direct = vcpu->arch.mmu->direct_map;
> + bool direct = vcpu->arch.mmu->mmu_role.direct;
>
> if (WARN_ON(!VALID_PAGE(vcpu->arch.mmu->root_hpa)))
> return RET_PF_RETRY;
> @@ -5207,7 +5205,7 @@ int kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, u64 error_code,
> * paging in both guests. If true, we simply unprotect the page
> * and resume the guest.
> */
> - if (vcpu->arch.mmu->direct_map &&
> + if (vcpu->arch.mmu->mmu_role.direct &&
> (error_code & PFERR_NESTED_GUEST_PAGE) == PFERR_NESTED_GUEST_PAGE) {
> kvm_mmu_unprotect_page(vcpu->kvm, gpa_to_gfn(cr2_or_gpa));
> return 1;
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 657aa646871e..b910fa34e57e 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -7978,7 +7978,7 @@ static bool reexecute_instruction(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
> WARN_ON_ONCE(!(emulation_type & EMULTYPE_PF)))
> return false;
>
> - if (!vcpu->arch.mmu->direct_map) {
> + if (!vcpu->arch.mmu->mmu_role.direct) {
> /*
> * Write permission should be allowed since only
> * write access need to be emulated.
> @@ -8011,7 +8011,7 @@ static bool reexecute_instruction(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
> kvm_release_pfn_clean(pfn);
>
> /* The instructions are well-emulated on direct mmu. */
> - if (vcpu->arch.mmu->direct_map) {
> + if (vcpu->arch.mmu->mmu_role.direct) {
> unsigned int indirect_shadow_pages;
>
> write_lock(&vcpu->kvm->mmu_lock);
> @@ -8079,7 +8079,7 @@ static bool retry_instruction(struct x86_emulate_ctxt *ctxt,
> vcpu->arch.last_retry_eip = ctxt->eip;
> vcpu->arch.last_retry_addr = cr2_or_gpa;
>
> - if (!vcpu->arch.mmu->direct_map)
> + if (!vcpu->arch.mmu->mmu_role.direct)
> gpa = kvm_mmu_gva_to_gpa_write(vcpu, cr2_or_gpa, NULL);
>
> kvm_mmu_unprotect_page(vcpu->kvm, gpa_to_gfn(gpa));
> @@ -8359,7 +8359,7 @@ int x86_emulate_instruction(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
> ctxt->exception.address = cr2_or_gpa;
>
> /* With shadow page tables, cr2 contains a GVA or nGPA. */
> - if (vcpu->arch.mmu->direct_map) {
> + if (vcpu->arch.mmu->mmu_role.direct) {
> ctxt->gpa_available = true;
> ctxt->gpa_val = cr2_or_gpa;
> }
> @@ -12196,7 +12196,7 @@ void kvm_arch_async_page_ready(struct kvm_vcpu *vcpu, struct kvm_async_pf *work)
> {
> int r;
>
> - if ((vcpu->arch.mmu->direct_map != work->arch.direct_map) ||
> + if ((vcpu->arch.mmu->mmu_role.direct != work->arch.direct_map) ||
> work->wakeup_all)
> return;
>
> @@ -12204,7 +12204,7 @@ void kvm_arch_async_page_ready(struct kvm_vcpu *vcpu, struct kvm_async_pf *work)
> if (unlikely(r))
> return;
>
> - if (!vcpu->arch.mmu->direct_map &&
> + if (!vcpu->arch.mmu->mmu_role.direct &&
> work->arch.cr3 != vcpu->arch.mmu->get_guest_pgd(vcpu))
> return;
>
> --
> 2.31.1
>
On Fri, Feb 04, 2022 at 06:56:55AM -0500, Paolo Bonzini wrote:
> The TDP MMU has a performance regression compared to the legacy
> MMU when CR0 changes often. This was reported for the grsecurity
> kernel, which uses CR0.WP to implement kernel W^X. In that case,
> each change to CR0.WP unloads the MMU and causes a lot of unnecessary
> work. When running nested, this can even cause the L1 to hardly
> make progress, as the L0 hypervisor it is overwhelmed by the amount
> of MMU work that is needed.
>
> The root cause of the issue is that the "MMU role" in KVM is a mess
> that mixes the CPU setup (CR0/CR4/EFER, SMM, guest mode, etc.)
> and the shadow page table format. Whenever something is different
> between the MMU and the CPU, it is stored as an extra field in struct
> kvm_mmu---and for extra bonus complication, sometimes the same thing
> is stored in both the role and an extra field.
>
> So, this is the "no functional change intended" part of the changes
> required to fix the performance regression. It separates neatly
> the shadow page table format ("MMU role") from the guest page table
> format ("CPU role"), and removes the duplicate fields.
What do you think about calling this the guest_role instead of cpu_role?
There is a bit of a precedent for using "guest" instead of "cpu" already
for this type of concept (e.g. guest_walker), and I find it more
intuitive.
> The next
> step then is to avoid unloading the MMU as long as the MMU role
> stays the same.
>
> Please review!
>
> Paolo
>
> Paolo Bonzini (23):
> KVM: MMU: pass uses_nx directly to reset_shadow_zero_bits_mask
> KVM: MMU: nested EPT cannot be used in SMM
> KVM: MMU: remove valid from extended role
> KVM: MMU: constify uses of struct kvm_mmu_role_regs
> KVM: MMU: pull computation of kvm_mmu_role_regs to kvm_init_mmu
> KVM: MMU: load new PGD once nested two-dimensional paging is
> initialized
> KVM: MMU: remove kvm_mmu_calc_root_page_role
> KVM: MMU: rephrase unclear comment
> KVM: MMU: remove "bool base_only" arguments
> KVM: MMU: split cpu_role from mmu_role
> KVM: MMU: do not recompute root level from kvm_mmu_role_regs
> KVM: MMU: remove ept_ad field
> KVM: MMU: remove kvm_calc_shadow_root_page_role_common
> KVM: MMU: cleanup computation of MMU roles for two-dimensional paging
> KVM: MMU: cleanup computation of MMU roles for shadow paging
> KVM: MMU: remove extended bits from mmu_role
> KVM: MMU: remove redundant bits from extended role
> KVM: MMU: fetch shadow EFER.NX from MMU role
> KVM: MMU: simplify and/or inline computation of shadow MMU roles
> KVM: MMU: pull CPU role computation to kvm_init_mmu
> KVM: MMU: store shadow_root_level into mmu_role
> KVM: MMU: use cpu_role for root_level
> KVM: MMU: replace direct_map with mmu_role.direct
>
> arch/x86/include/asm/kvm_host.h | 13 +-
> arch/x86/kvm/mmu.h | 2 +-
> arch/x86/kvm/mmu/mmu.c | 408 ++++++++++++--------------------
> arch/x86/kvm/mmu/mmu_audit.c | 6 +-
> arch/x86/kvm/mmu/paging_tmpl.h | 12 +-
> arch/86/kvm/mmu/tdp_mmu.c | 4 +-
> arch/x86/kvm/svm/svm.c | 2 +-
> arch/x86/kvm/vmx/vmx.c | 2 +-
> arch/x86/kvm/x86.c | 12 +-
> 10 files changed, 178 insertions(+), 284 deletions(-)
>
> --
> 2.31.1
>
mmu_role.level is always the same value as shadow_level:
- kvm_mmu_get_tdp_level(vcpu) when going through init_kvm_tdp_mmu
- the level argument when going through kvm_init_shadow_ept_mmu
- it's assigned directly from new_role.base.level when going
through shadow_mmu_init_context
Remove the duplication and get the level directly from the role.
Signed-off-by: Paolo Bonzini <[email protected]>
---
arch/x86/include/asm/kvm_host.h | 1 -
arch/x86/kvm/mmu.h | 2 +-
arch/x86/kvm/mmu/mmu.c | 36 +++++++++++++++------------------
arch/x86/kvm/mmu/tdp_mmu.c | 2 +-
arch/x86/kvm/svm/svm.c | 2 +-
arch/x86/kvm/vmx/vmx.c | 2 +-
6 files changed, 20 insertions(+), 25 deletions(-)
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index b0085c54786c..867fc82f1de5 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -433,7 +433,6 @@ struct kvm_mmu {
union kvm_mmu_role cpu_role;
union kvm_mmu_page_role mmu_role;
u8 root_level;
- u8 shadow_root_level;
bool direct_map;
struct kvm_mmu_root_info prev_roots[KVM_MMU_NUM_PREV_ROOTS];
diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
index 51faa2c76ca5..43b99308cb0e 100644
--- a/arch/x86/kvm/mmu.h
+++ b/arch/x86/kvm/mmu.h
@@ -112,7 +112,7 @@ static inline void kvm_mmu_load_pgd(struct kvm_vcpu *vcpu)
return;
static_call(kvm_x86_load_mmu_pgd)(vcpu, root_hpa,
- vcpu->arch.mmu->shadow_root_level);
+ vcpu->arch.mmu->mmu_role.level);
}
struct kvm_page_fault {
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 6f9d876ce429..4d1fa87718f8 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -2143,7 +2143,7 @@ static void shadow_walk_init_using_root(struct kvm_shadow_walk_iterator *iterato
{
iterator->addr = addr;
iterator->shadow_addr = root;
- iterator->level = vcpu->arch.mmu->shadow_root_level;
+ iterator->level = vcpu->arch.mmu->mmu_role.level;
if (iterator->level >= PT64_ROOT_4LEVEL &&
vcpu->arch.mmu->root_level < PT64_ROOT_4LEVEL &&
@@ -3254,7 +3254,7 @@ void kvm_mmu_free_roots(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
&invalid_list);
if (free_active_root) {
- if (mmu->shadow_root_level >= PT64_ROOT_4LEVEL &&
+ if (mmu->mmu_role.level >= PT64_ROOT_4LEVEL &&
(mmu->root_level >= PT64_ROOT_4LEVEL || mmu->direct_map)) {
mmu_free_root_page(kvm, &mmu->root_hpa, &invalid_list);
} else if (mmu->pae_root) {
@@ -3329,7 +3329,7 @@ static hpa_t mmu_alloc_root(struct kvm_vcpu *vcpu, gfn_t gfn, gva_t gva,
static int mmu_alloc_direct_roots(struct kvm_vcpu *vcpu)
{
struct kvm_mmu *mmu = vcpu->arch.mmu;
- u8 shadow_root_level = mmu->shadow_root_level;
+ u8 shadow_root_level = mmu->mmu_role.level;
hpa_t root;
unsigned i;
int r;
@@ -3479,7 +3479,7 @@ static int mmu_alloc_shadow_roots(struct kvm_vcpu *vcpu)
*/
if (mmu->root_level >= PT64_ROOT_4LEVEL) {
root = mmu_alloc_root(vcpu, root_gfn, 0,
- mmu->shadow_root_level, false);
+ mmu->mmu_role.level, false);
mmu->root_hpa = root;
goto set_root_pgd;
}
@@ -3495,7 +3495,7 @@ static int mmu_alloc_shadow_roots(struct kvm_vcpu *vcpu)
* the shadow page table may be a PAE or a long mode page table.
*/
pm_mask = PT_PRESENT_MASK | shadow_me_mask;
- if (mmu->shadow_root_level >= PT64_ROOT_4LEVEL) {
+ if (mmu->mmu_role.level >= PT64_ROOT_4LEVEL) {
pm_mask |= PT_ACCESSED_MASK | PT_WRITABLE_MASK | PT_USER_MASK;
if (WARN_ON_ONCE(!mmu->pml4_root)) {
@@ -3504,7 +3504,7 @@ static int mmu_alloc_shadow_roots(struct kvm_vcpu *vcpu)
}
mmu->pml4_root[0] = __pa(mmu->pae_root) | pm_mask;
- if (mmu->shadow_root_level == PT64_ROOT_5LEVEL) {
+ if (mmu->mmu_role.level == PT64_ROOT_5LEVEL) {
if (WARN_ON_ONCE(!mmu->pml5_root)) {
r = -EIO;
goto out_unlock;
@@ -3529,9 +3529,9 @@ static int mmu_alloc_shadow_roots(struct kvm_vcpu *vcpu)
mmu->pae_root[i] = root | pm_mask;
}
- if (mmu->shadow_root_level == PT64_ROOT_5LEVEL)
+ if (mmu->mmu_role.level == PT64_ROOT_5LEVEL)
mmu->root_hpa = __pa(mmu->pml5_root);
- else if (mmu->shadow_root_level == PT64_ROOT_4LEVEL)
+ else if (mmu->mmu_role.level == PT64_ROOT_4LEVEL)
mmu->root_hpa = __pa(mmu->pml4_root);
else
mmu->root_hpa = __pa(mmu->pae_root);
@@ -3547,7 +3547,7 @@ static int mmu_alloc_shadow_roots(struct kvm_vcpu *vcpu)
static int mmu_alloc_special_roots(struct kvm_vcpu *vcpu)
{
struct kvm_mmu *mmu = vcpu->arch.mmu;
- bool need_pml5 = mmu->shadow_root_level > PT64_ROOT_4LEVEL;
+ bool need_pml5 = mmu->mmu_role.level > PT64_ROOT_4LEVEL;
u64 *pml5_root = NULL;
u64 *pml4_root = NULL;
u64 *pae_root;
@@ -3559,7 +3559,7 @@ static int mmu_alloc_special_roots(struct kvm_vcpu *vcpu)
* on demand, as running a 32-bit L1 VMM on 64-bit KVM is very rare.
*/
if (mmu->direct_map || mmu->root_level >= PT64_ROOT_4LEVEL ||
- mmu->shadow_root_level < PT64_ROOT_4LEVEL)
+ mmu->mmu_role.level < PT64_ROOT_4LEVEL)
return 0;
/*
@@ -4145,7 +4145,7 @@ static bool fast_pgd_switch(struct kvm_vcpu *vcpu, gpa_t new_pgd,
* having to deal with PDPTEs. We may add support for 32-bit hosts/VMs
* later if necessary.
*/
- if (mmu->shadow_root_level >= PT64_ROOT_4LEVEL &&
+ if (mmu->mmu_role.level >= PT64_ROOT_4LEVEL &&
mmu->root_level >= PT64_ROOT_4LEVEL)
return cached_root_available(vcpu, new_pgd, new_role);
@@ -4408,17 +4408,17 @@ static void reset_shadow_zero_bits_mask(struct kvm_vcpu *vcpu,
struct rsvd_bits_validate *shadow_zero_check;
int i;
- WARN_ON_ONCE(context->shadow_root_level < PT32E_ROOT_LEVEL);
+ WARN_ON_ONCE(context->mmu_role.level < PT32E_ROOT_LEVEL);
shadow_zero_check = &context->shadow_zero_check;
__reset_rsvds_bits_mask(shadow_zero_check, reserved_hpa_bits(),
- context->shadow_root_level, uses_nx,
+ context->mmu_role.level, uses_nx,
guest_can_use_gbpages(vcpu), is_pse, is_amd);
if (!shadow_me_mask)
return;
- for (i = context->shadow_root_level; --i >= 0;) {
+ for (i = context->mmu_role.level; --i >= 0;) {
shadow_zero_check->rsvd_bits_mask[0][i] &= ~shadow_me_mask;
shadow_zero_check->rsvd_bits_mask[1][i] &= ~shadow_me_mask;
}
@@ -4445,7 +4445,7 @@ reset_tdp_shadow_zero_bits_mask(struct kvm_mmu *context)
if (boot_cpu_is_amd())
__reset_rsvds_bits_mask(shadow_zero_check, reserved_hpa_bits(),
- context->shadow_root_level, false,
+ context->mmu_role.level, false,
boot_cpu_has(X86_FEATURE_GBPAGES),
false, true);
else
@@ -4456,7 +4456,7 @@ reset_tdp_shadow_zero_bits_mask(struct kvm_mmu *context)
if (!shadow_me_mask)
return;
- for (i = context->shadow_root_level; --i >= 0;) {
+ for (i = context->mmu_role.level; --i >= 0;) {
shadow_zero_check->rsvd_bits_mask[0][i] &= ~shadow_me_mask;
shadow_zero_check->rsvd_bits_mask[1][i] &= ~shadow_me_mask;
}
@@ -4735,7 +4735,6 @@ static void init_kvm_tdp_mmu(struct kvm_vcpu *vcpu, union kvm_mmu_role cpu_role)
context->page_fault = kvm_tdp_page_fault;
context->sync_page = nonpaging_sync_page;
context->invlpg = NULL;
- context->shadow_root_level = kvm_mmu_get_tdp_level(vcpu);
context->direct_map = true;
context->get_guest_pgd = get_cr3;
context->get_pdptr = kvm_pdptr_read;
@@ -4773,7 +4772,6 @@ static void shadow_mmu_init_context(struct kvm_vcpu *vcpu, struct kvm_mmu *conte
context->root_level = cpu_role.base.level;
reset_guest_paging_metadata(vcpu, context);
- context->shadow_root_level = mmu_role.level;
}
static void kvm_init_shadow_mmu(struct kvm_vcpu *vcpu,
@@ -4852,8 +4850,6 @@ void kvm_init_shadow_ept_mmu(struct kvm_vcpu *vcpu, bool execonly,
context->cpu_role.as_u64 = new_role.as_u64;
context->mmu_role.word = new_role.base.word;
- context->shadow_root_level = level;
-
context->page_fault = ept_page_fault;
context->gva_to_gpa = ept_gva_to_gpa;
context->sync_page = ept_sync_page;
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index dd4c78833016..9fb6d983bae9 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -1697,7 +1697,7 @@ int kvm_tdp_mmu_get_walk(struct kvm_vcpu *vcpu, u64 addr, u64 *sptes,
gfn_t gfn = addr >> PAGE_SHIFT;
int leaf = -1;
- *root_level = vcpu->arch.mmu->shadow_root_level;
+ *root_level = vcpu->arch.mmu->mmu_role.level;
tdp_mmu_for_each_pte(iter, mmu, gfn, gfn + 1) {
leaf = iter.level;
diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index 7b5345a66117..5a1d552b535b 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -3815,7 +3815,7 @@ static void svm_load_mmu_pgd(struct kvm_vcpu *vcpu, hpa_t root_hpa,
hv_track_root_tdp(vcpu, root_hpa);
cr3 = vcpu->arch.cr3;
- } else if (vcpu->arch.mmu->shadow_root_level >= PT64_ROOT_4LEVEL) {
+ } else if (vcpu->arch.mmu->mmu_role.level >= PT64_ROOT_4LEVEL) {
cr3 = __sme_set(root_hpa) | kvm_get_active_pcid(vcpu);
} else {
/* PCID in the guest should be impossible with a 32-bit MMU. */
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 8ac5a6fa7720..5e2c865a04ff 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -2965,7 +2965,7 @@ static void vmx_flush_tlb_current(struct kvm_vcpu *vcpu)
if (enable_ept)
ept_sync_context(construct_eptp(vcpu, root_hpa,
- mmu->shadow_root_level));
+ mmu->mmu_role.level));
else
vpid_sync_context(vmx_get_current_vpid(vcpu));
}
--
2.31.1
On Fri, Feb 04, 2022 at 06:57:16AM -0500, Paolo Bonzini wrote:
> mmu_role.level is always the same value as shadow_level:
>
> - kvm_mmu_get_tdp_level(vcpu) when going through init_kvm_tdp_mmu
>
> - the level argument when going through kvm_init_shadow_ept_mmu
>
> - it's assigned directly from new_role.base.level when going
> through shadow_mmu_init_context
>
> Remove the duplication and get the level directly from the role.
>
> Signed-off-by: Paolo Bonzini <[email protected]>
nit: How about the following for the shortlog?
KVM: MMU: Replace shadow_root_level with mmu_role.level
Otherwise,
Reviewed-by: David Matlack <[email protected]>
> ---
> arch/x86/include/asm/kvm_host.h | 1 -
> arch/x86/kvm/mmu.h | 2 +-
> arch/x86/kvm/mmu/mmu.c | 36 +++++++++++++++------------------
> arch/x86/kvm/mmu/tdp_mmu.c | 2 +-
> arch/x86/kvm/svm/svm.c | 2 +-
> arch/x86/kvm/vmx/vmx.c | 2 +-
> 6 files changed, 20 insertions(+), 25 deletions(-)
>
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index b0085c54786c..867fc82f1de5 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -433,7 +433,6 @@ struct kvm_mmu {
> union kvm_mmu_role cpu_role;
> union kvm_mmu_page_role mmu_role;
> u8 root_level;
> - u8 shadow_root_level;
> bool direct_map;
> struct kvm_mmu_root_info prev_roots[KVM_MMU_NUM_PREV_ROOTS];
>
> diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
> index 51faa2c76ca5..43b99308cb0e 100644
> --- a/arch/x86/kvm/mmu.h
> +++ b/arch/x86/kvm/mmu.h
> @@ -112,7 +112,7 @@ static inline void kvm_mmu_load_pgd(struct kvm_vcpu *vcpu)
> return;
>
> static_call(kvm_x86_load_mmu_pgd)(vcpu, root_hpa,
> - vcpu->arch.mmu->shadow_root_level);
> + vcpu->arch.mmu->mmu_role.level);
> }
>
> struct kvm_page_fault {
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 6f9d876ce429..4d1fa87718f8 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -2143,7 +2143,7 @@ static void shadow_walk_init_using_root(struct kvm_shadow_walk_iterator *iterato
> {
> iterator->addr = addr;
> iterator->shadow_addr = root;
> - iterator->level = vcpu->arch.mmu->shadow_root_level;
> + iterator->level = vcpu->arch.mmu->mmu_role.level;
>
> if (iterator->level >= PT64_ROOT_4LEVEL &&
> vcpu->arch.mmu->root_level < PT64_ROOT_4LEVEL &&
> @@ -3254,7 +3254,7 @@ void kvm_mmu_free_roots(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
> &invalid_list);
>
> if (free_active_root) {
> - if (mmu->shadow_root_level >= PT64_ROOT_4LEVEL &&
> + if (mmu->mmu_role.level >= PT64_ROOT_4LEVEL &&
> (mmu->root_level >= PT64_ROOT_4LEVEL || mmu->direct_map)) {
> mmu_free_root_page(kvm, &mmu->root_hpa, &invalid_list);
> } else if (mmu->pae_root) {
> @@ -3329,7 +3329,7 @@ static hpa_t mmu_alloc_root(struct kvm_vcpu *vcpu, gfn_t gfn, gva_t gva,
> static int mmu_alloc_direct_roots(struct kvm_vcpu *vcpu)
> {
> struct kvm_mmu *mmu = vcpu->arch.mmu;
> - u8 shadow_root_level = mmu->shadow_root_level;
> + u8 shadow_root_level = mmu->mmu_role.level;
> hpa_t root;
> unsigned i;
> int r;
> @@ -3479,7 +3479,7 @@ static int mmu_alloc_shadow_roots(struct kvm_vcpu *vcpu)
> */
> if (mmu->root_level >= PT64_ROOT_4LEVEL) {
> root = mmu_alloc_root(vcpu, root_gfn, 0,
> - mmu->shadow_root_level, false);
> + mmu->mmu_role.level, false);
> mmu->root_hpa = root;
> goto set_root_pgd;
> }
> @@ -3495,7 +3495,7 @@ static int mmu_alloc_shadow_roots(struct kvm_vcpu *vcpu)
> * the shadow page table may be a PAE or a long mode page table.
> */
> pm_mask = PT_PRESENT_MASK | shadow_me_mask;
> - if (mmu->shadow_root_level >= PT64_ROOT_4LEVEL) {
> + if (mmu->mmu_role.level >= PT64_ROOT_4LEVEL) {
> pm_mask |= PT_ACCESSED_MASK | PT_WRITABLE_MASK | PT_USER_MASK;
>
> if (WARN_ON_ONCE(!mmu->pml4_root)) {
> @@ -3504,7 +3504,7 @@ static int mmu_alloc_shadow_roots(struct kvm_vcpu *vcpu)
> }
> mmu->pml4_root[0] = __pa(mmu->pae_root) | pm_mask;
>
> - if (mmu->shadow_root_level == PT64_ROOT_5LEVEL) {
> + if (mmu->mmu_role.level == PT64_ROOT_5LEVEL) {
> if (WARN_ON_ONCE(!mmu->pml5_root)) {
> r = -EIO;
> goto out_unlock;
> @@ -3529,9 +3529,9 @@ static int mmu_alloc_shadow_roots(struct kvm_vcpu *vcpu)
> mmu->pae_root[i] = root | pm_mask;
> }
>
> - if (mmu->shadow_root_level == PT64_ROOT_5LEVEL)
> + if (mmu->mmu_role.level == PT64_ROOT_5LEVEL)
> mmu->root_hpa = __pa(mmu->pml5_root);
> - else if (mmu->shadow_root_level == PT64_ROOT_4LEVEL)
> + else if (mmu->mmu_role.level == PT64_ROOT_4LEVEL)
> mmu->root_hpa = __pa(mmu->pml4_root);
> else
> mmu->root_hpa = __pa(mmu->pae_root);
> @@ -3547,7 +3547,7 @@ static int mmu_alloc_shadow_roots(struct kvm_vcpu *vcpu)
> static int mmu_alloc_special_roots(struct kvm_vcpu *vcpu)
> {
> struct kvm_mmu *mmu = vcpu->arch.mmu;
> - bool need_pml5 = mmu->shadow_root_level > PT64_ROOT_4LEVEL;
> + bool need_pml5 = mmu->mmu_role.level > PT64_ROOT_4LEVEL;
> u64 *pml5_root = NULL;
> u64 *pml4_root = NULL;
> u64 *pae_root;
> @@ -3559,7 +3559,7 @@ static int mmu_alloc_special_roots(struct kvm_vcpu *vcpu)
> * on demand, as running a 32-bit L1 VMM on 64-bit KVM is very rare.
> */
> if (mmu->direct_map || mmu->root_level >= PT64_ROOT_4LEVEL ||
> - mmu->shadow_root_level < PT64_ROOT_4LEVEL)
> + mmu->mmu_role.level < PT64_ROOT_4LEVEL)
> return 0;
>
> /*
> @@ -4145,7 +4145,7 @@ static bool fast_pgd_switch(struct kvm_vcpu *vcpu, gpa_t new_pgd,
> * having to deal with PDPTEs. We may add support for 32-bit hosts/VMs
> * later if necessary.
> */
> - if (mmu->shadow_root_level >= PT64_ROOT_4LEVEL &&
> + if (mmu->mmu_role.level >= PT64_ROOT_4LEVEL &&
> mmu->root_level >= PT64_ROOT_4LEVEL)
> return cached_root_available(vcpu, new_pgd, new_role);
>
> @@ -4408,17 +4408,17 @@ static void reset_shadow_zero_bits_mask(struct kvm_vcpu *vcpu,
> struct rsvd_bits_validate *shadow_zero_check;
> int i;
>
> - WARN_ON_ONCE(context->shadow_root_level < PT32E_ROOT_LEVEL);
> + WARN_ON_ONCE(context->mmu_role.level < PT32E_ROOT_LEVEL);
>
> shadow_zero_check = &context->shadow_zero_check;
> __reset_rsvds_bits_mask(shadow_zero_check, reserved_hpa_bits(),
> - context->shadow_root_level, uses_nx,
> + context->mmu_role.level, uses_nx,
> guest_can_use_gbpages(vcpu), is_pse, is_amd);
>
> if (!shadow_me_mask)
> return;
>
> - for (i = context->shadow_root_level; --i >= 0;) {
> + for (i = context->mmu_role.level; --i >= 0;) {
> shadow_zero_check->rsvd_bits_mask[0][i] &= ~shadow_me_mask;
> shadow_zero_check->rsvd_bits_mask[1][i] &= ~shadow_me_mask;
> }
> @@ -4445,7 +4445,7 @@ reset_tdp_shadow_zero_bits_mask(struct kvm_mmu *context)
>
> if (boot_cpu_is_amd())
> __reset_rsvds_bits_mask(shadow_zero_check, reserved_hpa_bits(),
> - context->shadow_root_level, false,
> + context->mmu_role.level, false,
> boot_cpu_has(X86_FEATURE_GBPAGES),
> false, true);
> else
> @@ -4456,7 +4456,7 @@ reset_tdp_shadow_zero_bits_mask(struct kvm_mmu *context)
> if (!shadow_me_mask)
> return;
>
> - for (i = context->shadow_root_level; --i >= 0;) {
> + for (i = context->mmu_role.level; --i >= 0;) {
> shadow_zero_check->rsvd_bits_mask[0][i] &= ~shadow_me_mask;
> shadow_zero_check->rsvd_bits_mask[1][i] &= ~shadow_me_mask;
> }
> @@ -4735,7 +4735,6 @@ static void init_kvm_tdp_mmu(struct kvm_vcpu *vcpu, union kvm_mmu_role cpu_role)
> context->page_fault = kvm_tdp_page_fault;
> context->sync_page = nonpaging_sync_page;
> context->invlpg = NULL;
> - context->shadow_root_level = kvm_mmu_get_tdp_level(vcpu);
> context->direct_map = true;
> context->get_guest_pgd = get_cr3;
> context->get_pdptr = kvm_pdptr_read;
> @@ -4773,7 +4772,6 @@ static void shadow_mmu_init_context(struct kvm_vcpu *vcpu, struct kvm_mmu *conte
> context->root_level = cpu_role.base.level;
>
> reset_guest_paging_metadata(vcpu, context);
> - context->shadow_root_level = mmu_role.level;
> }
>
> static void kvm_init_shadow_mmu(struct kvm_vcpu *vcpu,
> @@ -4852,8 +4850,6 @@ void kvm_init_shadow_ept_mmu(struct kvm_vcpu *vcpu, bool execonly,
> context->cpu_role.as_u64 = new_role.as_u64;
> context->mmu_role.word = new_role.base.word;
>
> - context->shadow_root_level = level;
> -
> context->page_fault = ept_page_fault;
> context->gva_to_gpa = ept_gva_to_gpa;
> context->sync_page = ept_sync_page;
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index dd4c78833016..9fb6d983bae9 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -1697,7 +1697,7 @@ int kvm_tdp_mmu_get_walk(struct kvm_vcpu *vcpu, u64 addr, u64 *sptes,
> gfn_t gfn = addr >> PAGE_SHIFT;
> int leaf = -1;
>
> - *root_level = vcpu->arch.mmu->shadow_root_level;
> + *root_level = vcpu->arch.mmu->mmu_role.level;
>
> tdp_mmu_for_each_pte(iter, mmu, gfn, gfn + 1) {
> leaf = iter.level;
> diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
> index 7b5345a66117..5a1d552b535b 100644
> --- a/arch/x86/kvm/svm/svm.c
> +++ b/arch/x86/kvm/svm/svm.c
> @@ -3815,7 +3815,7 @@ static void svm_load_mmu_pgd(struct kvm_vcpu *vcpu, hpa_t root_hpa,
> hv_track_root_tdp(vcpu, root_hpa);
>
> cr3 = vcpu->arch.cr3;
> - } else if (vcpu->arch.mmu->shadow_root_level >= PT64_ROOT_4LEVEL) {
> + } else if (vcpu->arch.mmu->mmu_role.level >= PT64_ROOT_4LEVEL) {
> cr3 = __sme_set(root_hpa) | kvm_get_active_pcid(vcpu);
> } else {
> /* PCID in the guest should be impossible with a 32-bit MMU. */
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index 8ac5a6fa7720..5e2c865a04ff 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -2965,7 +2965,7 @@ static void vmx_flush_tlb_current(struct kvm_vcpu *vcpu)
>
> if (enable_ept)
> ept_sync_context(construct_eptp(vcpu, root_hpa,
> - mmu->shadow_root_level));
> + mmu->mmu_role.level));
> else
> vpid_sync_context(vmx_get_current_vpid(vcpu));
> }
> --
> 2.31.1
>
>
On Mon, Feb 7, 2022 at 8:10 AM Sean Christopherson <[email protected]> wrote:
>
> On Sat, Feb 05, 2022, Paolo Bonzini wrote:
> > On 2/4/22 18:59, David Matlack wrote:
> > > > + reset_shadow_zero_bits_mask(vcpu, context, is_efer_nx(context));
> > >
> > > Out of curiousity, how does KVM mitigate iTLB multi-hit when shadowing
> > > NPT and the guest has not enabled EFER.NX?
> >
> > You got me worried for a second but iTLB multihit is Intel-only, isn't it?
>
> AFAIK, yes, big Core only. arch/x86/kernel/cpu/common.c sets NO_ITLB_MULTIHIT
> for all AMD, Hygon, and Atom CPUs.
Ah that's right, it's Intel-only. Thanks!
On Fri, Feb 04, 2022, Paolo Bonzini wrote:
> Paolo Bonzini (23):
> KVM: MMU: pass uses_nx directly to reset_shadow_zero_bits_mask
> KVM: MMU: nested EPT cannot be used in SMM
> KVM: MMU: remove valid from extended role
> KVM: MMU: constify uses of struct kvm_mmu_role_regs
> KVM: MMU: pull computation of kvm_mmu_role_regs to kvm_init_mmu
> KVM: MMU: load new PGD once nested two-dimensional paging is
> initialized
> KVM: MMU: remove kvm_mmu_calc_root_page_role
> KVM: MMU: rephrase unclear comment
> KVM: MMU: remove "bool base_only" arguments
> KVM: MMU: split cpu_role from mmu_role
> KVM: MMU: do not recompute root level from kvm_mmu_role_regs
> KVM: MMU: remove ept_ad field
> KVM: MMU: remove kvm_calc_shadow_root_page_role_common
> KVM: MMU: cleanup computation of MMU roles for two-dimensional paging
> KVM: MMU: cleanup computation of MMU roles for shadow paging
> KVM: MMU: remove extended bits from mmu_role
> KVM: MMU: remove redundant bits from extended role
> KVM: MMU: fetch shadow EFER.NX from MMU role
> KVM: MMU: simplify and/or inline computation of shadow MMU roles
> KVM: MMU: pull CPU role computation to kvm_init_mmu
> KVM: MMU: store shadow_root_level into mmu_role
> KVM: MMU: use cpu_role for root_level
> KVM: MMU: replace direct_map with mmu_role.direct
Heresy! Everyone knows the one true way is "KVM: x86/mmu:"
$ glo | grep "KVM: MMU:" | wc -l
740
$ glo | grep "KVM: x86/mmu:" | wc -l
403
Dammit, I'm the heathen...
I do think we should use x86/mmu though. VMX and SVM (and nVMX and nSVM) are ok
because they're unlikely to collide with other architectures, but every arch has
an MMU...
On Fri, Feb 04, 2022, Paolo Bonzini wrote:
> The level field of the MMU role can act as a marker for validity
> instead: it is guaranteed to be nonzero so a zero value means the role
> is invalid and the MMU properties will be computed again.
Nope, it's not guaranteed to be non-zero:
static int role_regs_to_root_level(struct kvm_mmu_role_regs *regs)
{
if (!____is_cr0_pg(regs))
return 0; <=============================================
else if (____is_efer_lma(regs))
return ____is_cr4_la57(regs) ? PT64_ROOT_5LEVEL :
PT64_ROOT_4LEVEL;
else if (____is_cr4_pae(regs))
return PT32E_ROOT_LEVEL;
else
return PT32_ROOT_LEVEL;
}
static union kvm_mmu_role
kvm_calc_nested_mmu_role(struct kvm_vcpu *vcpu, struct kvm_mmu_role_regs *regs)
{
union kvm_mmu_role role;
role = kvm_calc_shadow_root_page_role_common(vcpu, regs, false);
/*
* Nested MMUs are used only for walking L2's gva->gpa, they never have
* shadow pages of their own and so "direct" has no meaning. Set it
* to "true" to try to detect bogus usage of the nested MMU.
*/
role.base.direct = true;
role.base.level = role_regs_to_root_level(regs);
return role;
}
static void init_kvm_nested_mmu(struct kvm_vcpu *vcpu)
{
struct kvm_mmu_role_regs regs = vcpu_to_role_regs(vcpu);
union kvm_mmu_role new_role = kvm_calc_nested_mmu_role(vcpu, ®s);
struct kvm_mmu *g_context = &vcpu->arch.nested_mmu;
if (new_role.as_u64 == g_context->mmu_role.as_u64) <== theoretically can get a false positive
return;
On Fri, Feb 04, 2022, David Matlack wrote:
> On Fri, Feb 04, 2022 at 06:56:56AM -0500, Paolo Bonzini wrote:
> > reset_shadow_zero_bits_mask has a very unintuitive way of deciding
> > whether the shadow pages will use the NX bit. The function is used in
> > two cases, shadow paging and shadow NPT; shadow paging has a use for
> > EFER.NX and needs to force it enabled, while shadow NPT only needs it
> > depending on L1's setting.
> >
> > The actual root problem here is that is_efer_nx, despite being part
> > of the "base" role, only matches the format of the shadow pages in the
> > NPT case. For now, just remove the ugly variable initialization and move
> > the call to reset_shadow_zero_bits_mask out of shadow_mmu_init_context.
> > The parameter can then be removed after the root problem in the role
> > is fixed.
> >
> > Signed-off-by: Paolo Bonzini <[email protected]>
>
> Reviewed-by: David Matlack <[email protected]>
>
> (I agree this commit makes no functional change.)
There may not be a functional change, but it drops an optimization and contributes
to making future code/patches more fragile due to making it harder to understand
the relationship between shadow_mmu_init_context() and __kvm_mmu_new_pgd().
> > @@ -4829,8 +4820,6 @@ static void shadow_mmu_init_context(struct kvm_vcpu *vcpu, struct kvm_mmu *conte
> >
> > reset_guest_paging_metadata(vcpu, context);
> > context->shadow_root_level = new_role.base.level;
> > -
> > - reset_shadow_zero_bits_mask(vcpu, context);
This is guarded by:
if (new_role.as_u64 == context->mmu_role.as_u64)
return;
> > }
> >
> > static void kvm_init_shadow_mmu(struct kvm_vcpu *vcpu,
> > @@ -4841,6 +4830,16 @@ static void kvm_init_shadow_mmu(struct kvm_vcpu *vcpu,
> > kvm_calc_shadow_mmu_root_page_role(vcpu, regs, false);
> >
> > shadow_mmu_init_context(vcpu, context, regs, new_role);
> > +
> > + /*
> > + * KVM uses NX when TDP is disabled to handle a variety of scenarios,
> > + * notably for huge SPTEs if iTLB multi-hit mitigation is enabled and
> > + * to generate correct permissions for CR0.WP=0/CR4.SMEP=1/EFER.NX=0.
> > + * The iTLB multi-hit workaround can be toggled at any time, so assume
> > + * NX can be used by any non-nested shadow MMU to avoid having to reset
> > + * MMU contexts. Note, KVM forces EFER.NX=1 when TDP is disabled.
> > + */
> > + reset_shadow_zero_bits_mask(vcpu, context, true);
Whereas this will compute the mask even if the role doesn't change. I say that
matters later on because then this sequence:
shadow_mmu_init_context(vcpu, context, ®s, new_role);
reset_shadow_zero_bits_mask(vcpu, context, is_efer_nx(context));
__kvm_mmu_new_pgd(vcpu, nested_cr3, new_role.base);
becomes even more difficult to untangle.
And looking at where this series ends up, I don't understand the purpose of this
change. Patch 18 essentially reverts this patch, and I see nothing in between
that will break without the temporary change. That patch becomes:
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 02e6d256805d..f9c96de1189d 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -4408,7 +4408,7 @@ static void reset_shadow_zero_bits_mask(struct kvm_vcpu *vcpu,
* NX can be used by any non-nested shadow MMU to avoid having to reset
* MMU contexts. Note, KVM forces EFER.NX=1 when TDP is disabled.
*/
- bool uses_nx = is_efer_nx(context) || !tdp_enabled;
+ bool uses_nx = context->mmu_role.efer_nx;
/* @amd adds a check on bit of SPTEs, which KVM shouldn't use anyways. */
bool is_amd = true;
though it needs to update the comment as well.
> > }
> >
> > static union kvm_mmu_role
> > @@ -4872,6 +4871,7 @@ void kvm_init_shadow_npt_mmu(struct kvm_vcpu *vcpu, unsigned long cr0,
> > __kvm_mmu_new_pgd(vcpu, nested_cr3, new_role.base);
> >
> > shadow_mmu_init_context(vcpu, context, ®s, new_role);
> > + reset_shadow_zero_bits_mask(vcpu, context, is_efer_nx(context));
>
> Out of curiousity, how does KVM mitigate iTLB multi-hit when shadowing
> NPT and the guest has not enabled EFER.NX?
>
> > }
> > EXPORT_SYMBOL_GPL(kvm_init_shadow_npt_mmu);
> >
> > --
> > 2.31.1
> >
> >
On Mon, Feb 07, 2022, David Matlack wrote:
> On Mon, Feb 7, 2022 at 3:27 PM Sean Christopherson <[email protected]> wrote:
> > > What do you think about calling this the guest_role instead of cpu_role?
> > > There is a bit of a precedent for using "guest" instead of "cpu" already
> > > for this type of concept (e.g. guest_walker), and I find it more
> > > intuitive.
> >
> > Haven't looked at the series yet, but I'd prefer not to use guest_role, it's
> > too similar to is_guest_mode() and kvm_mmu_role.guest_mode. E.g. we'd end up with
> >
> > static union kvm_mmu_role kvm_calc_guest_role(struct kvm_vcpu *vcpu,
> > const struct kvm_mmu_role_regs *regs)
> > {
> > union kvm_mmu_role role = {0};
> >
> > role.base.access = ACC_ALL;
> > role.base.smm = is_smm(vcpu);
> > role.base.guest_mode = is_guest_mode(vcpu);
> > role.base.direct = !____is_cr0_pg(regs);
> >
> > ...
> > }
> >
> > and possibly
> >
> > if (guest_role.guest_mode)
> > ...
> >
> > which would be quite messy. Maybe vcpu_role if cpu_role isn't intuitive?
>
> I agree it's a little odd. But actually it's somewhat intuitive (the
> guest is in guest-mode, i.e. we're running a nested guest).
>
> Ok I'm stretching a little bit :). But if the trade-off is just
> "guest_role.guest_mode" requires a clarifying comment, but the rest of
> the code gets more readable (cpu_role is used a lot more than
> role.guest_mode), it still might be worth it.
It's not just guest_mode, we also have guest_mmu, e.g. we'd end up with
vcpu->arch.root_mmu.guest_role.base.level
vcpu->arch.guest_mmu.guest_role.base.level
vcpu->arch.nested_mmu.guest_role.base.level
In a vacuum, I 100% agree that guest_role is better than cpu_role or vcpu_role,
but the term "guest" has already been claimed for "L2" in far too many places.
While we're behind the bikeshed... the resulting:
union kvm_mmu_role cpu_role;
union kvm_mmu_page_role mmu_role;
is a mess. Again, I really like "mmu_role" in a vacuum, but juxtaposed with
union kvm_mmu_role cpu_role;
it's super confusing, e.g. I expected
union kvm_mmu_role mmu_role;
Nested EPT is a good example of complete confusion, because we compute kvm_mmu_role,
compare it to cpu_role, then shove it into both cpu_role and mmu_ole. It makes
sense once you reason about what it's doing, but on the surface it's confusing.
struct kvm_mmu *context = &vcpu->arch.guest_mmu;
u8 level = vmx_eptp_page_walk_level(new_eptp);
union kvm_mmu_role new_role =
kvm_calc_shadow_ept_root_page_role(vcpu, accessed_dirty,
execonly, level);
if (new_role.as_u64 != context->cpu_role.as_u64) {
/* EPT, and thus nested EPT, does not consume CR0, CR4, nor EFER. */
context->cpu_role.as_u64 = new_role.as_u64;
context->mmu_role.word = new_role.base.word;
Mabye this?
union kvm_mmu_vcpu_role vcpu_role;
union kvm_mmu_page_role mmu_role;
and some sample usage?
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index d25f8cb2e62b..9f9b97c88738 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -4836,13 +4836,16 @@ void kvm_init_shadow_ept_mmu(struct kvm_vcpu *vcpu, bool execonly,
{
struct kvm_mmu *context = &vcpu->arch.guest_mmu;
u8 level = vmx_eptp_page_walk_level(new_eptp);
- union kvm_mmu_role new_role =
+ union kvm_mmu_vcpu_role new_role =
kvm_calc_shadow_ept_root_page_role(vcpu, accessed_dirty,
execonly, level);
- if (new_role.as_u64 != context->cpu_role.as_u64) {
- /* EPT, and thus nested EPT, does not consume CR0, CR4, nor EFER. */
- context->cpu_role.as_u64 = new_role.as_u64;
+ if (new_role.as_u64 != context->vcpu_role.as_u64) {
+ /*
+ * EPT, and thus nested EPT, does not consume CR0, CR4, nor
+ * EFER, so the mmu_role is a strict subset of the vcpu_role.
+ */
+ context->vcpu_role.as_u64 = new_role.as_u64;
context->mmu_role.word = new_role.base.word;
context->page_fault = ept_page_fault;
And while I'm on a soapbox.... am I the only one that absolutely detests the use
of "context" and "g_context"? I'd be all in favor of renaming those to "mmu"
throughout the code as a prep to this series.
I also think we should move the initializing of guest_mmu => mmu into the MMU
helpers. Pulling the mmu from guest_mmu but then relying on the caller to wire
up guest_mmu => mmu so that e.g. kvm_mmu_new_pgd() works is gross and confused
the heck out of me. E.g.
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index d25f8cb2e62b..4e7fe9758ce8 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -4794,7 +4794,7 @@ static void kvm_init_shadow_mmu(struct kvm_vcpu *vcpu,
void kvm_init_shadow_npt_mmu(struct kvm_vcpu *vcpu, unsigned long cr0,
unsigned long cr4, u64 efer, gpa_t nested_cr3)
{
- struct kvm_mmu *context = &vcpu->arch.guest_mmu;
+ struct kvm_mmu *mmu = &vcpu->arch.guest_mmu;
struct kvm_mmu_role_regs regs = {
.cr0 = cr0,
.cr4 = cr4 & ~X86_CR4_PKE,
@@ -4806,6 +4806,8 @@ void kvm_init_shadow_npt_mmu(struct kvm_vcpu *vcpu, unsigned long cr0,
mmu_role = cpu_role.base;
mmu_role.level = kvm_mmu_get_tdp_level(vcpu);
+ vcpu->arch.mmu = &vcpu->arch.guest_mmu;
+
shadow_mmu_init_context(vcpu, context, cpu_role, mmu_role);
kvm_mmu_new_pgd(vcpu, nested_cr3);
}
@@ -4834,12 +4836,14 @@ void kvm_init_shadow_ept_mmu(struct kvm_vcpu *vcpu, bool execonly,
int huge_page_level, bool accessed_dirty,
gpa_t new_eptp)
{
- struct kvm_mmu *context = &vcpu->arch.guest_mmu;
+ struct kvm_mmu *mmu = &vcpu->arch.guest_mmu;
u8 level = vmx_eptp_page_walk_level(new_eptp);
union kvm_mmu_role new_role =
kvm_calc_shadow_ept_root_page_role(vcpu, accessed_dirty,
execonly, level);
+ vcpu->arch.mmu = mmu;
+
if (new_role.as_u64 != context->cpu_role.as_u64) {
/* EPT, and thus nested EPT, does not consume CR0, CR4, nor EFER. */
context->cpu_role.as_u64 = new_role.as_u64;
diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c
index 1218b5a342fc..d0f8eddb32be 100644
--- a/arch/x86/kvm/svm/nested.c
+++ b/arch/x86/kvm/svm/nested.c
@@ -98,8 +98,6 @@ static void nested_svm_init_mmu_context(struct kvm_vcpu *vcpu)
WARN_ON(mmu_is_nested(vcpu));
- vcpu->arch.mmu = &vcpu->arch.guest_mmu;
-
/*
* The NPT format depends on L1's CR4 and EFER, which is in vmcb01. Note,
* when called via KVM_SET_NESTED_STATE, that state may _not_ match current
On 2/9/22 23:54, Sean Christopherson wrote:
> Nope, it's not guaranteed to be non-zero:
>
> static int role_regs_to_root_level(struct kvm_mmu_role_regs *regs)
> {
> if (!____is_cr0_pg(regs))
> return 0; <=============================================
> else if (____is_efer_lma(regs))
> return ____is_cr4_la57(regs) ? PT64_ROOT_5LEVEL :
> PT64_ROOT_4LEVEL;
> else if (____is_cr4_pae(regs))
> return PT32E_ROOT_LEVEL;
> else
> return PT32_ROOT_LEVEL;
> }
>
Yes, see my reply to David. At the end of the series the assumption is
correct:
- level is always nonzero in mmu_role
- one of level or direct (which is !CR0.PG) is always nonzero in cpu_role
So the patch can be kept but it has to be moved much later.
Paolo
On 2/10/22 02:11, Sean Christopherson wrote:
> In a vacuum, I 100% agree that guest_role is better than cpu_role or vcpu_role,
> but the term "guest" has already been claimed for "L2" in far too many places.
>
> While we're behind the bikeshed... the resulting:
>
> union kvm_mmu_role cpu_role;
> union kvm_mmu_page_role mmu_role;
>
> is a mess. Again, I really like "mmu_role" in a vacuum, but juxtaposed with
>
> union kvm_mmu_role cpu_role;
>
> it's super confusing, e.g. I expected
>
> union kvm_mmu_role mmu_role;
What about
union kvm_mmu_page_role root_role;
union kvm_mmu_paging_mode cpu_mode;
? I already have to remove ".base" from all accesses to mmu_role, so
it's not much extra churn.
Paolo
On 2/9/22 23:31, Sean Christopherson wrote:
> Heresy! Everyone knows the one true way is "KVM: x86/mmu:"
>
> $ glo | grep "KVM: MMU:" | wc -l
> 740
> $ glo | grep "KVM: x86/mmu:" | wc -l
> 403
>
> Dammit, I'm the heathen...
>
> I do think we should use x86/mmu though. VMX and SVM (and nVMX and nSVM) are ok
> because they're unlikely to collide with other architectures, but every arch has
> an MMU...
>
Sure, I can adjust my habits.
Paolo
On 2/10/22 17:55, Sean Christopherson wrote:
>> union kvm_mmu_page_role root_role;
>> union kvm_mmu_paging_mode cpu_mode;
>
> I'd prefer to not use "paging mode", the SDM uses that terminology to refer to
> the four paging modes. My expectation given the name is that the union would
> track only CR0.PG, EFER.LME, CR4.PAE, and CR4.PSE[*].
Yeah, I had started with kvm_mmu_paging_flags, but cpu_flags was an even
worse method than kvm_mmu_paging_mode.
Anyway, now that I have done _some_ replacement, it's a matter of sed -i
on the patch files once you or someone else come up with a good moniker.
I take it that "root_role" passed your filter successfully.
Paolo
> I'm out of ideas at the moment, I'll keep chewing on this while reviewing...
>
> [*] Someone at Intel rewrote the SDM and eliminated Mode B, a.k.a. PSE 36-bit
> physical paging, it's now just part of "32-bit paging". But 5-level paging is
> considered it's own paging mode?!?! Lame. I guess they really want to have
> exactly four paging modes...
On Thu, Feb 10, 2022, Paolo Bonzini wrote:
> On 2/10/22 02:11, Sean Christopherson wrote:
> > In a vacuum, I 100% agree that guest_role is better than cpu_role or vcpu_role,
> > but the term "guest" has already been claimed for "L2" in far too many places.
> >
> > While we're behind the bikeshed... the resulting:
> >
> > union kvm_mmu_role cpu_role;
> > union kvm_mmu_page_role mmu_role;
> >
> > is a mess. Again, I really like "mmu_role" in a vacuum, but juxtaposed with
> >
> > union kvm_mmu_role cpu_role;
> >
> > it's super confusing, e.g. I expected
> >
> > union kvm_mmu_role mmu_role;
>
> What about
>
> union kvm_mmu_page_role root_role;
> union kvm_mmu_paging_mode cpu_mode;
>
> ? I already have to remove ".base" from all accesses to mmu_role, so it's
> not much extra churn.
I'd prefer to not use "paging mode", the SDM uses that terminology to refer to
the four paging modes. My expectation given the name is that the union would
track only CR0.PG, EFER.LME, CR4.PAE, and CR4.PSE[*].
I'm out of ideas at the moment, I'll keep chewing on this while reviewing...
[*] Someone at Intel rewrote the SDM and eliminated Mode B, a.k.a. PSE 36-bit
physical paging, it's now just part of "32-bit paging". But 5-level paging is
considered it's own paging mode?!?! Lame. I guess they really want to have
exactly four paging modes...
On 2/10/22 01:30, Sean Christopherson wrote:
> There may not be a functional change, but it drops an optimization and contributes
> to making future code/patches more fragile due to making it harder to understand
> the relationship between shadow_mmu_init_context() and __kvm_mmu_new_pgd().
Fair enough, I'll drop this patch.
Regarding the relationship between shadow_mmu_init_context() and
kvm_mmu_new_pgd(), it's a lot more subtle than it looks. Please take a
look first at the optimization series, since it removes that subtlety.
Paolo
On Thu, Feb 10, 2022, Paolo Bonzini wrote:
> On 2/10/22 17:55, Sean Christopherson wrote:
> > > union kvm_mmu_page_role root_role;
> > > union kvm_mmu_paging_mode cpu_mode;
> >
> > I'd prefer to not use "paging mode", the SDM uses that terminology to refer to
> > the four paging modes. My expectation given the name is that the union would
> > track only CR0.PG, EFER.LME, CR4.PAE, and CR4.PSE[*].
>
> Yeah, I had started with kvm_mmu_paging_flags, but cpu_flags was an even
> worse method than kvm_mmu_paging_mode.
We could always do s/is_guest_mode/is_nested_mode or something to that effect.
It would take some retraining, but I feel like we've been fighting the whole
"guest mode" thing over and over.
> Anyway, now that I have done _some_ replacement, it's a matter of sed -i on
> the patch files once you or someone else come up with a good moniker.
>
> I take it that "root_role" passed your filter successfully.
Yep, works for me. I almost suggested it, too, but decided I liked mmu_role
marginally better. I like root_role because it ties in with root_hpa and root_pga.
On Wed, Feb 9, 2022 at 2:31 PM Sean Christopherson <[email protected]> wrote:
> On Fri, Feb 04, 2022, Paolo Bonzini wrote:
> > KVM: MMU: replace direct_map with mmu_role.direct
>
> Heresy! Everyone knows the one true way is "KVM: x86/mmu:"
>
> $ glo | grep "KVM: MMU:" | wc -l
> 740
> $ glo | grep "KVM: x86/mmu:" | wc -l
> 403
>
> Dammit, I'm the heathen...
>
> I do think we should use x86/mmu though. VMX and SVM (and nVMX and nSVM) are ok
> because they're unlikely to collide with other architectures, but every arch has
> an MMU...
Can you document these rules/preferences somewhere? Even better if we
can enforce them with checkpatch :)