__kvm_mmu_new_pgd looks at the MMU's root_level and shadow_root_level
via fast_pgd_switch. It makes no sense to call it before updating
these fields, even though it was done like that ever since nested
VMX grew the ability to use fast CR3 switch (commit 50c28f21d045,
"kvm: x86: Use fast CR3 switch for nested VMX").
Pull it to the end of the initialization of the shadow nested MMUs.
Signed-off-by: Paolo Bonzini <[email protected]>
---
arch/x86/kvm/mmu/mmu.c | 41 +++++++++++++++++++----------------------
1 file changed, 19 insertions(+), 22 deletions(-)
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 577e70509510..b8ab16323629 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -4869,10 +4869,9 @@ void kvm_init_shadow_npt_mmu(struct kvm_vcpu *vcpu, unsigned long cr0,
new_role = kvm_calc_shadow_npt_root_page_role(vcpu, ®s);
- __kvm_mmu_new_pgd(vcpu, nested_cr3, new_role.base);
-
shadow_mmu_init_context(vcpu, context, ®s, new_role);
reset_shadow_zero_bits_mask(vcpu, context, is_efer_nx(context));
+ __kvm_mmu_new_pgd(vcpu, nested_cr3, new_role.base);
}
EXPORT_SYMBOL_GPL(kvm_init_shadow_npt_mmu);
@@ -4906,27 +4905,25 @@ void kvm_init_shadow_ept_mmu(struct kvm_vcpu *vcpu, bool execonly,
kvm_calc_shadow_ept_root_page_role(vcpu, accessed_dirty,
execonly, level);
- __kvm_mmu_new_pgd(vcpu, new_eptp, new_role.base);
-
- if (new_role.as_u64 == context->mmu_role.as_u64)
- return;
-
- context->mmu_role.as_u64 = new_role.as_u64;
-
- context->shadow_root_level = level;
-
- context->ept_ad = accessed_dirty;
- context->page_fault = ept_page_fault;
- context->gva_to_gpa = ept_gva_to_gpa;
- context->sync_page = ept_sync_page;
- context->invlpg = ept_invlpg;
- context->root_level = level;
- context->direct_map = false;
+ if (new_role.as_u64 != context->mmu_role.as_u64) {
+ context->mmu_role.as_u64 = new_role.as_u64;
+
+ context->shadow_root_level = level;
+
+ context->ept_ad = accessed_dirty;
+ context->page_fault = ept_page_fault;
+ context->gva_to_gpa = ept_gva_to_gpa;
+ context->sync_page = ept_sync_page;
+ context->invlpg = ept_invlpg;
+ context->root_level = level;
+ context->direct_map = false;
+ update_permission_bitmask(context, true);
+ context->pkru_mask = 0;
+ reset_rsvds_bits_mask_ept(vcpu, context, execonly, huge_page_level);
+ reset_ept_shadow_zero_bits_mask(context, execonly);
+ }
- update_permission_bitmask(context, true);
- context->pkru_mask = 0;
- reset_rsvds_bits_mask_ept(vcpu, context, execonly, huge_page_level);
- reset_ept_shadow_zero_bits_mask(context, execonly);
+ __kvm_mmu_new_pgd(vcpu, new_eptp, new_role.base);
}
EXPORT_SYMBOL_GPL(kvm_init_shadow_ept_mmu);
--
2.31.1
On Fri, Feb 04, 2022 at 06:57:01AM -0500, Paolo Bonzini wrote:
> __kvm_mmu_new_pgd looks at the MMU's root_level and shadow_root_level
> via fast_pgd_switch.
Those checks are just for performance correct (to skip iterating through
the list of roots)?
Either way, it's probably worth including a Fixes tag below.
> It makes no sense to call it before updating
> these fields, even though it was done like that ever since nested
> VMX grew the ability to use fast CR3 switch (commit 50c28f21d045,
> "kvm: x86: Use fast CR3 switch for nested VMX").
>
> Pull it to the end of the initialization of the shadow nested MMUs.
>
> Signed-off-by: Paolo Bonzini <[email protected]>
Reviewed-by: David Matlack <[email protected]>
> ---
> arch/x86/kvm/mmu/mmu.c | 41 +++++++++++++++++++----------------------
> 1 file changed, 19 insertions(+), 22 deletions(-)
>
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 577e70509510..b8ab16323629 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -4869,10 +4869,9 @@ void kvm_init_shadow_npt_mmu(struct kvm_vcpu *vcpu, unsigned long cr0,
>
> new_role = kvm_calc_shadow_npt_root_page_role(vcpu, ®s);
>
> - __kvm_mmu_new_pgd(vcpu, nested_cr3, new_role.base);
> -
> shadow_mmu_init_context(vcpu, context, ®s, new_role);
> reset_shadow_zero_bits_mask(vcpu, context, is_efer_nx(context));
> + __kvm_mmu_new_pgd(vcpu, nested_cr3, new_role.base);
> }
> EXPORT_SYMBOL_GPL(kvm_init_shadow_npt_mmu);
>
> @@ -4906,27 +4905,25 @@ void kvm_init_shadow_ept_mmu(struct kvm_vcpu *vcpu, bool execonly,
> kvm_calc_shadow_ept_root_page_role(vcpu, accessed_dirty,
> execonly, level);
>
> - __kvm_mmu_new_pgd(vcpu, new_eptp, new_role.base);
> -
> - if (new_role.as_u64 == context->mmu_role.as_u64)
> - return;
> -
> - context->mmu_role.as_u64 = new_role.as_u64;
> -
> - context->shadow_root_level = level;
> -
> - context->ept_ad = accessed_dirty;
> - context->page_fault = ept_page_fault;
> - context->gva_to_gpa = ept_gva_to_gpa;
> - context->sync_page = ept_sync_page;
> - context->invlpg = ept_invlpg;
> - context->root_level = level;
> - context->direct_map = false;
> + if (new_role.as_u64 != context->mmu_role.as_u64) {
> + context->mmu_role.as_u64 = new_role.as_u64;
> +
> + context->shadow_root_level = level;
> +
> + context->ept_ad = accessed_dirty;
> + context->page_fault = ept_page_fault;
> + context->gva_to_gpa = ept_gva_to_gpa;
> + context->sync_page = ept_sync_page;
> + context->invlpg = ept_invlpg;
> + context->root_level = level;
> + context->direct_map = false;
> + update_permission_bitmask(context, true);
> + context->pkru_mask = 0;
> + reset_rsvds_bits_mask_ept(vcpu, context, execonly, huge_page_level);
> + reset_ept_shadow_zero_bits_mask(context, execonly);
> + }
>
> - update_permission_bitmask(context, true);
> - context->pkru_mask = 0;
> - reset_rsvds_bits_mask_ept(vcpu, context, execonly, huge_page_level);
> - reset_ept_shadow_zero_bits_mask(context, execonly);
> + __kvm_mmu_new_pgd(vcpu, new_eptp, new_role.base);
> }
> EXPORT_SYMBOL_GPL(kvm_init_shadow_ept_mmu);
>
> --
> 2.31.1
>
>
On 2/4/22 20:18, David Matlack wrote:
> On Fri, Feb 04, 2022 at 06:57:01AM -0500, Paolo Bonzini wrote:
>> __kvm_mmu_new_pgd looks at the MMU's root_level and shadow_root_level
>> via fast_pgd_switch.
> Those checks are just for performance correct (to skip iterating through
> the list of roots)?
>
> Either way, it's probably worth including a Fixes tag below.
>
There's no bug because __kvm_mmu_new_pgd is passed a correct new_role.
But it's unnecessarily complex as shown by patches 7 and 9.
Paolo
On 2/7/22 14:50, Paolo Bonzini wrote:
>>>
>> Those checks are just for performance correct (to skip iterating through
>> the list of roots)?
>>
>> Either way, it's probably worth including a Fixes tag below.
>
> There's no bug because __kvm_mmu_new_pgd is passed a correct new_role.
> But it's unnecessarily complex as shown by patches 7 and 9.
root_level is not part of the role, but we're still safe because
changing it requires to flip CR0.PG to 0 and back to 1. This calls
kvm_mmu_reset_context and unloads all roots, so it's not possible to
have a cached PGD with the wrong root level. On the first call the
root_level might be incorrect, but then again there's no cached PGD.
(Also, shadow_root_level is either always or never >= PT64_ROOT_4LEVEL,
so it's okay unless the guest_mmu was never set up and therefore no
cached PGD exists).
Paolo
On 2/4/22 20:18, David Matlack wrote:
> On Fri, Feb 04, 2022 at 06:57:01AM -0500, Paolo Bonzini wrote:
>> __kvm_mmu_new_pgd looks at the MMU's root_level and shadow_root_level
>> via fast_pgd_switch.
>
> Those checks are just for performance correct (to skip iterating through
> the list of roots)?
>
> Either way, it's probably worth including a Fixes tag below.
It's actually a much bigger mess---though it's working as intended,
except that in some cases the caching is suboptimal.
Basically, you have to call __kvm_mmu_new_pgd *before* kvm_init_mmu
because of how fast_pgd_switch takes care of stashing the current root
in the cache. A PAE root is never placed in the cache exactly because
mmu->root_level and mmu->shadow_root_level hold the value for the
_current_ root. In that case, fast_pgd_switch does not look for a
cached PGD (even if according to the role it could be there).
__kvm_mmu_new_pgd then frees the current root using kvm_mmu_free_roots.
kvm_mmu_free_roots *also* uses mmu->root_level and
mmu->shadow_root_level to distinguish whether the page table uses a
single root or 4 PAE roots. Because kvm_init_mmu can overwrite
muu->root_level, kvm_mmu_free_roots must also be called before kvm_init_mmu.
I do wonder if the use of mmu->shadow_root_level was intentional (it
certainly escaped me when first reviewing fast PGD switching), but
fortunately none of this is necessary. PAE roots can be identified from
!to_shadow_page(root_hpa), so the better fix is to do that:
- in fast_pgd_switch/cached_root_available, you need to account for two
possible transformations of the cache: either the old entry becomes the
MRU of the prev_roots as in the current code, or the old entry cannot be
cached. This is 20-30 more lines of code.
- in kvm_mmu_free_roots, just use to_shadow_page instead of
mmu->root_level and mmu->shadow_root_level.
Once this is in place, the original bug is fixed simply by calling
kvm_mmu_new_pgd after kvm_init_mmu. kvm_mmu_reset_context is not doing
now to avoid having to figure out the new role, but it can do that
easily after the above change.
I'll keep this cpu_role refactoring around, but strictly speaking it
becomes a separate change than the optimization.
Paolo