Reviewing the eager page splitting code made me realize that burning 14
rmap entries for nested TDP MMUs is extremely wasteful due to the per-vCPU
caches allocating 40 entries by default. For nested TDP, aliasing L2 gfns
to L1 gfns is quite rare and is not performance critical (it's exclusively
pre-boot behavior for sane setups).
Patch 1 fixes a bug where pte_list_desc is not correctly aligned nor sized
on 32-bit kernels. The primary motivation for the fix is to be able to add
a compile-time assertion on the size being a multiple of the cache line
size, I doubt anyone cares about the performance/memory impact.
Patch 2 tweaks MMU setup to support a dynamic pte_list_desc size.
Patch 3 reduces the number of sptes per pte_list_desc to 2 for nested TDP
MMUs, i.e. allocates the bare minimum to prioritize the memory footprint
over performance for sane setups.
Patch 4 fills the pte_list_desc cache if and only if rmaps are in use,
i.e. doesn't allocate pte_list_desc when using the TDP MMU until nested
TDP is used.
Sean Christopherson (4):
KVM: x86/mmu: Track the number entries in a pte_list_desc with a ulong
KVM: x86/mmu: Defer "full" MMU setup until after vendor
hardware_setup()
KVM: x86/mmu: Shrink pte_list_desc size when KVM is using TDP
KVM: x86/mmu: Topup pte_list_desc cache iff VM is using rmaps
arch/x86/include/asm/kvm_host.h | 5 ++-
arch/x86/kvm/mmu/mmu.c | 78 +++++++++++++++++++++++----------
arch/x86/kvm/x86.c | 17 ++++---
3 files changed, 70 insertions(+), 30 deletions(-)
base-commit: 4b88b1a518b337de1252b8180519ca4c00015c9e
--
2.37.0.rc0.161.g10f37bed90-goog
Use an "unsigned long" instead of a "u64" to track the number of entries
in a pte_list_desc's sptes array. Both sizes are overkill as the number
of entries would easily fit into a u8, the goal is purely to get sptes[]
aligned and to size the struct as a whole to be a multiple of a cache
line (64 bytes).
Using a u64 on 32-bit kernels fails on both accounts as "more" is only
4 bytes. Dropping "spte_count" to 4 bytes on 32-bit kernels fixes the
alignment issue and the overall size.
Add a compile-time assert to ensure the size of pte_list_desc stays a
multiple of the cache line size on modern CPUs (hardcoded because
L1_CACHE_BYTES is configurable via CONFIG_X86_L1_CACHE_SHIFT).
Fixes: 13236e25ebab ("KVM: X86: Optimize pte_list_desc with per-array counter")
Cc: Peter Xu <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
---
arch/x86/kvm/mmu/mmu.c | 13 +++++++++----
1 file changed, 9 insertions(+), 4 deletions(-)
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index bd74a287b54a..17ac30b9e22c 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -117,15 +117,17 @@ module_param(dbg, bool, 0644);
/*
* Slight optimization of cacheline layout, by putting `more' and `spte_count'
* at the start; then accessing it will only use one single cacheline for
- * either full (entries==PTE_LIST_EXT) case or entries<=6.
+ * either full (entries==PTE_LIST_EXT) case or entries<=6. On 32-bit kernels,
+ * the entire struct fits in a single cacheline.
*/
struct pte_list_desc {
struct pte_list_desc *more;
/*
- * Stores number of entries stored in the pte_list_desc. No need to be
- * u64 but just for easier alignment. When PTE_LIST_EXT, means full.
+ * The number of valid entries in sptes[]. Use an unsigned long to
+ * naturally align sptes[] (a u8 for the count would suffice). When
+ * equal to PTE_LIST_EXT, this particular list is full.
*/
- u64 spte_count;
+ unsigned long spte_count;
u64 *sptes[PTE_LIST_EXT];
};
@@ -5640,6 +5642,9 @@ void kvm_configure_mmu(bool enable_tdp, int tdp_forced_root_level,
tdp_root_level = tdp_forced_root_level;
max_tdp_level = tdp_max_root_level;
+ BUILD_BUG_ON_MSG((sizeof(struct pte_list_desc) % 64),
+ "pte_list_desc is not a multiple of cache line size (on modern CPUs)");
+
/*
* max_huge_page_level reflects KVM's MMU capabilities irrespective
* of kernel support, e.g. KVM may be capable of using 1GB pages when
--
2.37.0.rc0.161.g10f37bed90-goog
Topup the per-vCPU pte_list_desc caches if and only if the VM is using
rmaps, i.e. KVM is not using the TDP MMU or KVM is shadowing a nested TDP
MMU. This avoids wasting 1280 bytes per vCPU when KVM is using the TDP
MMU and L1 is not utilizing nested TDP.
Signed-off-by: Sean Christopherson <[email protected]>
---
arch/x86/kvm/mmu/mmu.c | 12 +++++++-----
1 file changed, 7 insertions(+), 5 deletions(-)
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 2db328d28b7b..fcbdd780075f 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -646,11 +646,13 @@ static int mmu_topup_memory_caches(struct kvm_vcpu *vcpu, bool maybe_indirect)
{
int r;
- /* 1 rmap, 1 parent PTE per level, and the prefetched rmaps. */
- r = kvm_mmu_topup_memory_cache(&vcpu->arch.mmu_pte_list_desc_cache,
- 1 + PT64_ROOT_MAX_LEVEL + PTE_PREFETCH_NUM);
- if (r)
- return r;
+ if (kvm_memslots_have_rmaps(vcpu->kvm)) {
+ /* 1 rmap, 1 parent PTE per level, and the prefetched rmaps. */
+ r = kvm_mmu_topup_memory_cache(&vcpu->arch.mmu_pte_list_desc_cache,
+ 1 + PT64_ROOT_MAX_LEVEL + PTE_PREFETCH_NUM);
+ if (r)
+ return r;
+ }
r = kvm_mmu_topup_memory_cache(&vcpu->arch.mmu_shadow_page_cache,
PT64_ROOT_MAX_LEVEL);
if (r)
--
2.37.0.rc0.161.g10f37bed90-goog
Defer MMU setup, and in particular allocation of pte_list_desc_cache,
until after the vendor's hardware_setup() has run, i.e. until after the
MMU has been configured by vendor code. This will allow a future commit
to dynamically size pte_list_desc's array of sptes based on whether or
not KVM is using TDP.
Alternatively, the setup could be done in kvm_configure_mmu(), but that
would require vendor code to call e.g. kvm_unconfigure_mmu() in teardown
and error paths, i.e. doesn't actually save code and is arguably uglier.
Note, keep the reset of PTE masks where it is to ensure that the masks
are reset before the vendor's hardware_setup() runs, i.e. before the
vendor code has a chance to manipulate the masks, e.g. VMX modifies masks
even before calling kvm_configure_mmu().
Signed-off-by: Sean Christopherson <[email protected]>
---
arch/x86/include/asm/kvm_host.h | 5 +++--
arch/x86/kvm/mmu/mmu.c | 12 ++++++++----
arch/x86/kvm/x86.c | 17 +++++++++++------
3 files changed, 22 insertions(+), 12 deletions(-)
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 88a3026ee163..c670a9656257 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1711,8 +1711,9 @@ static inline int kvm_arch_flush_remote_tlb(struct kvm *kvm)
((vcpu) && (vcpu)->arch.handling_intr_from_guest)
void kvm_mmu_x86_module_init(void);
-int kvm_mmu_vendor_module_init(void);
-void kvm_mmu_vendor_module_exit(void);
+void kvm_mmu_vendor_module_init(void);
+int kvm_mmu_hardware_setup(void);
+void kvm_mmu_hardware_unsetup(void);
void kvm_mmu_destroy(struct kvm_vcpu *vcpu);
int kvm_mmu_create(struct kvm_vcpu *vcpu);
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 17ac30b9e22c..ceb81e04aea3 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -6673,10 +6673,8 @@ void kvm_mmu_x86_module_init(void)
* loaded as many of the masks/values may be modified by VMX or SVM, i.e. need
* to be reset when a potentially different vendor module is loaded.
*/
-int kvm_mmu_vendor_module_init(void)
+void kvm_mmu_vendor_module_init(void)
{
- int ret = -ENOMEM;
-
/*
* MMU roles use union aliasing which is, generally speaking, an
* undefined behavior. However, we supposedly know how compilers behave
@@ -6687,7 +6685,13 @@ int kvm_mmu_vendor_module_init(void)
BUILD_BUG_ON(sizeof(union kvm_mmu_extended_role) != sizeof(u32));
BUILD_BUG_ON(sizeof(union kvm_cpu_role) != sizeof(u64));
+ /* Reset the PTE masks before the vendor module's hardware setup. */
kvm_mmu_reset_all_pte_masks();
+}
+
+int kvm_mmu_hardware_setup(void)
+{
+ int ret = -ENOMEM;
pte_list_desc_cache = kmem_cache_create("pte_list_desc",
sizeof(struct pte_list_desc),
@@ -6723,7 +6727,7 @@ void kvm_mmu_destroy(struct kvm_vcpu *vcpu)
mmu_free_memory_caches(vcpu);
}
-void kvm_mmu_vendor_module_exit(void)
+void kvm_mmu_hardware_unsetup(void)
{
mmu_destroy_caches();
percpu_counter_destroy(&kvm_total_used_mmu_pages);
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 031678eff28e..735543df829a 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -9204,9 +9204,7 @@ int kvm_arch_init(void *opaque)
}
kvm_nr_uret_msrs = 0;
- r = kvm_mmu_vendor_module_init();
- if (r)
- goto out_free_percpu;
+ kvm_mmu_vendor_module_init();
kvm_timer_init();
@@ -9226,8 +9224,6 @@ int kvm_arch_init(void *opaque)
return 0;
-out_free_percpu:
- free_percpu(user_return_msrs);
out_free_x86_emulator_cache:
kmem_cache_destroy(x86_emulator_cache);
out:
@@ -9252,7 +9248,6 @@ void kvm_arch_exit(void)
cancel_work_sync(&pvclock_gtod_work);
#endif
kvm_x86_ops.hardware_enable = NULL;
- kvm_mmu_vendor_module_exit();
free_percpu(user_return_msrs);
kmem_cache_destroy(x86_emulator_cache);
#ifdef CONFIG_KVM_XEN
@@ -11937,6 +11932,10 @@ int kvm_arch_hardware_setup(void *opaque)
kvm_ops_update(ops);
+ r = kvm_mmu_hardware_setup();
+ if (r)
+ goto out_unsetup;
+
kvm_register_perf_callbacks(ops->handle_intel_pt_intr);
if (!kvm_cpu_cap_has(X86_FEATURE_XSAVES))
@@ -11960,12 +11959,18 @@ int kvm_arch_hardware_setup(void *opaque)
kvm_caps.default_tsc_scaling_ratio = 1ULL << kvm_caps.tsc_scaling_ratio_frac_bits;
kvm_init_msr_list();
return 0;
+
+out_unsetup:
+ static_call(kvm_x86_hardware_unsetup)();
+ return r;
}
void kvm_arch_hardware_unsetup(void)
{
kvm_unregister_perf_callbacks();
+ kvm_mmu_hardware_unsetup();
+
static_call(kvm_x86_hardware_unsetup)();
}
--
2.37.0.rc0.161.g10f37bed90-goog
On Fri, Jun 24, 2022 at 11:27:33PM +0000, Sean Christopherson wrote:
> Defer MMU setup, and in particular allocation of pte_list_desc_cache,
> until after the vendor's hardware_setup() has run, i.e. until after the
> MMU has been configured by vendor code. This will allow a future commit
> to dynamically size pte_list_desc's array of sptes based on whether or
> not KVM is using TDP.
>
> Alternatively, the setup could be done in kvm_configure_mmu(), but that
> would require vendor code to call e.g. kvm_unconfigure_mmu() in teardown
> and error paths, i.e. doesn't actually save code and is arguably uglier.
>
> Note, keep the reset of PTE masks where it is to ensure that the masks
> are reset before the vendor's hardware_setup() runs, i.e. before the
> vendor code has a chance to manipulate the masks, e.g. VMX modifies masks
> even before calling kvm_configure_mmu().
>
> Signed-off-by: Sean Christopherson <[email protected]>
[...]
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 17ac30b9e22c..ceb81e04aea3 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -6673,10 +6673,8 @@ void kvm_mmu_x86_module_init(void)
> * loaded as many of the masks/values may be modified by VMX or SVM, i.e. need
> * to be reset when a potentially different vendor module is loaded.
> */
> -int kvm_mmu_vendor_module_init(void)
> +void kvm_mmu_vendor_module_init(void)
> {
> - int ret = -ENOMEM;
> -
> /*
> * MMU roles use union aliasing which is, generally speaking, an
> * undefined behavior. However, we supposedly know how compilers behave
> @@ -6687,7 +6685,13 @@ int kvm_mmu_vendor_module_init(void)
> BUILD_BUG_ON(sizeof(union kvm_mmu_extended_role) != sizeof(u32));
> BUILD_BUG_ON(sizeof(union kvm_cpu_role) != sizeof(u64));
>
> + /* Reset the PTE masks before the vendor module's hardware setup. */
> kvm_mmu_reset_all_pte_masks();
> +}
> +
> +int kvm_mmu_hardware_setup(void)
> +{
Instead of putting this code in a new function and calling it after
hardware_setup(), we could put it in kvm_configure_mmu().
This will result in a larger patch diff, but has it eliminates a subtle
and non-trivial-to-verify dependency ordering between
kvm_configure_mmu() and kvm_mmu_hardware_setup() and it will co-locate
the initialization of nr_sptes_per_pte_list and the code that uses it to
create pte_list_desc_cache in a single function.
On Sat, Jun 25, 2022, David Matlack wrote:
> On Fri, Jun 24, 2022 at 11:27:33PM +0000, Sean Christopherson wrote:
> > Alternatively, the setup could be done in kvm_configure_mmu(), but that
> > would require vendor code to call e.g. kvm_unconfigure_mmu() in teardown
> > and error paths, i.e. doesn't actually save code and is arguably uglier.
> [...]
> > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > index 17ac30b9e22c..ceb81e04aea3 100644
> > --- a/arch/x86/kvm/mmu/mmu.c
> > +++ b/arch/x86/kvm/mmu/mmu.c
> > @@ -6673,10 +6673,8 @@ void kvm_mmu_x86_module_init(void)
> > * loaded as many of the masks/values may be modified by VMX or SVM, i.e. need
> > * to be reset when a potentially different vendor module is loaded.
> > */
> > -int kvm_mmu_vendor_module_init(void)
> > +void kvm_mmu_vendor_module_init(void)
> > {
> > - int ret = -ENOMEM;
> > -
> > /*
> > * MMU roles use union aliasing which is, generally speaking, an
> > * undefined behavior. However, we supposedly know how compilers behave
> > @@ -6687,7 +6685,13 @@ int kvm_mmu_vendor_module_init(void)
> > BUILD_BUG_ON(sizeof(union kvm_mmu_extended_role) != sizeof(u32));
> > BUILD_BUG_ON(sizeof(union kvm_cpu_role) != sizeof(u64));
> >
> > + /* Reset the PTE masks before the vendor module's hardware setup. */
> > kvm_mmu_reset_all_pte_masks();
> > +}
> > +
> > +int kvm_mmu_hardware_setup(void)
> > +{
>
> Instead of putting this code in a new function and calling it after
> hardware_setup(), we could put it in kvm_configure_mmu().a
Ya, I noted that as an alternative in the changelog but obviously opted to not
do the allocation in kvm_configure_mmu(). I view kvm_configure_mmu() as a necessary
evil. Ideally vendor code wouldn't call into the MMU during initialization, and
common x86 would fully dictate the order of calls so that MMU setup. We could force
that, but it'd require something gross like filling a struct passed into
ops->hardware_setup(), and probably would be less robust (more likely to omit a
"required" field).
In other words, I like the explicit kvm_mmu_hardware_setup() call from common x86,
e.g. to show that vendor code needs to do setup before the MMU, and so that MMU
setup isn't buried in a somewhat arbitrary location in vendor hardware setup.
I'm not dead set against handling this in kvm_configure_mmu() (though I'd probably
vote to rename it to kvm_mmu_hardware_setup()) if anyone has a super strong opinion.
> This will result in a larger patch diff, but has it eliminates a subtle
> and non-trivial-to-verify dependency ordering between
Verification is "trivial" in that this WARN will fire if the order is swapped:
if (WARN_ON_ONCE(!nr_sptes_per_pte_list))
return -EIO;
> kvm_configure_mmu() and kvm_mmu_hardware_setup() and it will co-locate
> the initialization of nr_sptes_per_pte_list and the code that uses it to
> create pte_list_desc_cache in a single function.
On Mon, Jun 27, 2022 at 03:40:49PM +0000, Sean Christopherson wrote:
> On Sat, Jun 25, 2022, David Matlack wrote:
> > On Fri, Jun 24, 2022 at 11:27:33PM +0000, Sean Christopherson wrote:
> > > Alternatively, the setup could be done in kvm_configure_mmu(), but that
> > > would require vendor code to call e.g. kvm_unconfigure_mmu() in teardown
> > > and error paths, i.e. doesn't actually save code and is arguably uglier.
> > [...]
> > > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > > index 17ac30b9e22c..ceb81e04aea3 100644
> > > --- a/arch/x86/kvm/mmu/mmu.c
> > > +++ b/arch/x86/kvm/mmu/mmu.c
> > > @@ -6673,10 +6673,8 @@ void kvm_mmu_x86_module_init(void)
> > > * loaded as many of the masks/values may be modified by VMX or SVM, i.e. need
> > > * to be reset when a potentially different vendor module is loaded.
> > > */
> > > -int kvm_mmu_vendor_module_init(void)
> > > +void kvm_mmu_vendor_module_init(void)
> > > {
> > > - int ret = -ENOMEM;
> > > -
> > > /*
> > > * MMU roles use union aliasing which is, generally speaking, an
> > > * undefined behavior. However, we supposedly know how compilers behave
> > > @@ -6687,7 +6685,13 @@ int kvm_mmu_vendor_module_init(void)
> > > BUILD_BUG_ON(sizeof(union kvm_mmu_extended_role) != sizeof(u32));
> > > BUILD_BUG_ON(sizeof(union kvm_cpu_role) != sizeof(u64));
> > >
> > > + /* Reset the PTE masks before the vendor module's hardware setup. */
> > > kvm_mmu_reset_all_pte_masks();
> > > +}
> > > +
> > > +int kvm_mmu_hardware_setup(void)
> > > +{
> >
> > Instead of putting this code in a new function and calling it after
> > hardware_setup(), we could put it in kvm_configure_mmu().a
>
> Ya, I noted that as an alternative in the changelog but obviously opted to not
> do the allocation in kvm_configure_mmu().
Doh! My mistake. The idea to use kvm_configure_mmu() came to me while
reviewing patch 3 and I totally forgot about that blurb in the commit
message when I came back here to leave the suggestion.
> I view kvm_configure_mmu() as a necessary
> evil. Ideally vendor code wouldn't call into the MMU during initialization, and
> common x86 would fully dictate the order of calls so that MMU setup. We could force
> that, but it'd require something gross like filling a struct passed into
> ops->hardware_setup(), and probably would be less robust (more likely to omit a
> "required" field).
>
> In other words, I like the explicit kvm_mmu_hardware_setup() call from common x86,
> e.g. to show that vendor code needs to do setup before the MMU, and so that MMU
> setup isn't buried in a somewhat arbitrary location in vendor hardware setup.
Agreed, but if we're not going to get rid of kvm_configure_mmu(), we're
stuck with vendor-specific code calling into the MMU code during
hardware setup either way.
>
> I'm not dead set against handling this in kvm_configure_mmu() (though I'd probably
> vote to rename it to kvm_mmu_hardware_setup()) if anyone has a super strong opinion.
Your call. I'll put in a vote for using kvm_configure_mmu() and renaming
to kvm_mmu_hardware_setup().
>
> > This will result in a larger patch diff, but has it eliminates a subtle
> > and non-trivial-to-verify dependency ordering between
>
> Verification is "trivial" in that this WARN will fire if the order is swapped:
>
> if (WARN_ON_ONCE(!nr_sptes_per_pte_list))
> return -EIO;
Ah I missed that, that's good. Although I was thinking more from a code
readability standpoint.
>
> > kvm_configure_mmu() and kvm_mmu_hardware_setup() and it will co-locate
> > the initialization of nr_sptes_per_pte_list and the code that uses it to
> > create pte_list_desc_cache in a single function.
On Fri, Jun 24, 2022 at 11:27:33PM +0000, Sean Christopherson wrote:
> @@ -11937,6 +11932,10 @@ int kvm_arch_hardware_setup(void *opaque)
>
> kvm_ops_update(ops);
>
> + r = kvm_mmu_hardware_setup();
> + if (r)
> + goto out_unsetup;
> +
> kvm_register_perf_callbacks(ops->handle_intel_pt_intr);
>
> if (!kvm_cpu_cap_has(X86_FEATURE_XSAVES))
> @@ -11960,12 +11959,18 @@ int kvm_arch_hardware_setup(void *opaque)
> kvm_caps.default_tsc_scaling_ratio = 1ULL << kvm_caps.tsc_scaling_ratio_frac_bits;
> kvm_init_msr_list();
> return 0;
> +
> +out_unsetup:
> + static_call(kvm_x86_hardware_unsetup)();
Should this be kvm_mmu_hardware_unsetup()? Or did I miss something?..
--
Peter Xu
On Tue, Jul 12, 2022, Peter Xu wrote:
> On Fri, Jun 24, 2022 at 11:27:33PM +0000, Sean Christopherson wrote:
> > @@ -11937,6 +11932,10 @@ int kvm_arch_hardware_setup(void *opaque)
> >
> > kvm_ops_update(ops);
> >
> > + r = kvm_mmu_hardware_setup();
> > + if (r)
> > + goto out_unsetup;
> > +
> > kvm_register_perf_callbacks(ops->handle_intel_pt_intr);
> >
> > if (!kvm_cpu_cap_has(X86_FEATURE_XSAVES))
> > @@ -11960,12 +11959,18 @@ int kvm_arch_hardware_setup(void *opaque)
> > kvm_caps.default_tsc_scaling_ratio = 1ULL << kvm_caps.tsc_scaling_ratio_frac_bits;
> > kvm_init_msr_list();
> > return 0;
> > +
> > +out_unsetup:
> > + static_call(kvm_x86_hardware_unsetup)();
>
> Should this be kvm_mmu_hardware_unsetup()? Or did I miss something?..
There is no kvm_mmu_hardware_unsetup(). This path is called if kvm_mmu_hardware_setup()
fails, i.e. the common code doesn't need to unwind anything.
The vendor call is not shown in the patch diff, but it's before this as:
r = ops->hardware_setup();
if (r != 0)
return r
there's no existing error paths after that runs, which is why the vendor unsetup
call is new.