2023-01-18 15:08:15

by Mathias Krause

[permalink] [raw]
Subject: [PATCH v2 0/3] KVM: MMU: performance tweaks for heavy CR0.WP users

v1: https://lore.kernel.org/kvm/[email protected]/

This series is a resurrection of the missing pieces of Paolo's previous
attempt[1] to avoid needless MMU roots unloading. The performance gap
between TDP and legacy MMU is still existent, especially noticeable under
grsecurity which implements kernel W^X by toggling CR0.WP, which happens
very frequently.

Patches 1-13 and 17 of the old series had been merged, but, unfortunately,
the remaining parts never saw a v3. I therefore took care of these, took
Sean's feedback into account[2] and simplified the whole approach to just
handle the case we care most about explicitly.

Patch 1 is a v3 of [3], addressing Sean's feedback.

Patch 2 is specifically useful for grsecurity, as handle_cr() is by far
*the* top vmexit reason.

Patch 3 is the most important one, as it skips unloading the MMU roots for
CR0.WP toggling. It's the only one that changed in v2 to handle the
shadow MMU case as well, as Sean kindly pointed out.

While patches 1 and 2 bring small performance improvements already, the big
gains comes from patch 3.

This series builds on top of kvm.git/queue, namely commit de60733246ff
("Merge branch 'kvm-hw-enable-refactor' into HEAD").

Thanks,
Mathias

[1] https://lore.kernel.org/kvm/[email protected]/
[2] https://lore.kernel.org/kvm/YhATewkkO%[email protected]/
[3] https://lore.kernel.org/kvm/YhAB1d1%[email protected]/

Mathias Krause (2):
KVM: VMX: avoid retpoline call for control register caused exits
KVM: x86: do not unload MMU roots when only toggling CR0.WP

Paolo Bonzini (1):
KVM: x86/mmu: avoid indirect call for get_cr3

arch/x86/kvm/mmu/mmu.c | 31 ++++++++++++++++++++-----------
arch/x86/kvm/mmu/paging_tmpl.h | 2 +-
arch/x86/kvm/vmx/vmx.c | 2 ++
arch/x86/kvm/x86.c | 9 +++++++++
4 files changed, 32 insertions(+), 12 deletions(-)

--
2.39.0


2023-01-18 15:08:52

by Mathias Krause

[permalink] [raw]
Subject: [PATCH v2 1/3] KVM: x86/mmu: avoid indirect call for get_cr3

From: Paolo Bonzini <[email protected]>

Most of the time, calls to get_guest_pgd result in calling
kvm_read_cr3 (the exception is only nested TDP). Hardcode
the default instead of using the get_cr3 function, avoiding
a retpoline if they are enabled.

Signed-off-by: Paolo Bonzini <[email protected]>
Signed-off-by: Mathias Krause <[email protected]>
---
arch/x86/kvm/mmu/mmu.c | 31 ++++++++++++++++++++-----------
arch/x86/kvm/mmu/paging_tmpl.h | 2 +-
2 files changed, 21 insertions(+), 12 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index aeb240b339f5..505768631614 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -241,6 +241,20 @@ static struct kvm_mmu_role_regs vcpu_to_role_regs(struct kvm_vcpu *vcpu)
return regs;
}

+static unsigned long get_guest_cr3(struct kvm_vcpu *vcpu)
+{
+ return kvm_read_cr3(vcpu);
+}
+
+static inline unsigned long kvm_mmu_get_guest_pgd(struct kvm_vcpu *vcpu,
+ struct kvm_mmu *mmu)
+{
+ if (IS_ENABLED(CONFIG_RETPOLINE) && mmu->get_guest_pgd == get_guest_cr3)
+ return kvm_read_cr3(vcpu);
+
+ return mmu->get_guest_pgd(vcpu);
+}
+
static inline bool kvm_available_flush_tlb_with_range(void)
{
return kvm_x86_ops.tlb_remote_flush_with_range;
@@ -3722,7 +3736,7 @@ static int mmu_alloc_shadow_roots(struct kvm_vcpu *vcpu)
int quadrant, i, r;
hpa_t root;

- root_pgd = mmu->get_guest_pgd(vcpu);
+ root_pgd = kvm_mmu_get_guest_pgd(vcpu, mmu);
root_gfn = root_pgd >> PAGE_SHIFT;

if (mmu_check_root(vcpu, root_gfn))
@@ -4172,7 +4186,7 @@ static bool kvm_arch_setup_async_pf(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
arch.token = alloc_apf_token(vcpu);
arch.gfn = gfn;
arch.direct_map = vcpu->arch.mmu->root_role.direct;
- arch.cr3 = vcpu->arch.mmu->get_guest_pgd(vcpu);
+ arch.cr3 = kvm_mmu_get_guest_pgd(vcpu, vcpu->arch.mmu);

return kvm_setup_async_pf(vcpu, cr2_or_gpa,
kvm_vcpu_gfn_to_hva(vcpu, gfn), &arch);
@@ -4191,7 +4205,7 @@ void kvm_arch_async_page_ready(struct kvm_vcpu *vcpu, struct kvm_async_pf *work)
return;

if (!vcpu->arch.mmu->root_role.direct &&
- work->arch.cr3 != vcpu->arch.mmu->get_guest_pgd(vcpu))
+ work->arch.cr3 != kvm_mmu_get_guest_pgd(vcpu, vcpu->arch.mmu))
return;

kvm_mmu_do_page_fault(vcpu, work->cr2_or_gpa, 0, true);
@@ -4592,11 +4606,6 @@ void kvm_mmu_new_pgd(struct kvm_vcpu *vcpu, gpa_t new_pgd)
}
EXPORT_SYMBOL_GPL(kvm_mmu_new_pgd);

-static unsigned long get_cr3(struct kvm_vcpu *vcpu)
-{
- return kvm_read_cr3(vcpu);
-}
-
static bool sync_mmio_spte(struct kvm_vcpu *vcpu, u64 *sptep, gfn_t gfn,
unsigned int access)
{
@@ -5147,7 +5156,7 @@ static void init_kvm_tdp_mmu(struct kvm_vcpu *vcpu,
context->page_fault = kvm_tdp_page_fault;
context->sync_page = nonpaging_sync_page;
context->invlpg = NULL;
- context->get_guest_pgd = get_cr3;
+ context->get_guest_pgd = get_guest_cr3;
context->get_pdptr = kvm_pdptr_read;
context->inject_page_fault = kvm_inject_page_fault;

@@ -5297,7 +5306,7 @@ static void init_kvm_softmmu(struct kvm_vcpu *vcpu,

kvm_init_shadow_mmu(vcpu, cpu_role);

- context->get_guest_pgd = get_cr3;
+ context->get_guest_pgd = get_guest_cr3;
context->get_pdptr = kvm_pdptr_read;
context->inject_page_fault = kvm_inject_page_fault;
}
@@ -5311,7 +5320,7 @@ static void init_kvm_nested_mmu(struct kvm_vcpu *vcpu,
return;

g_context->cpu_role.as_u64 = new_mode.as_u64;
- g_context->get_guest_pgd = get_cr3;
+ g_context->get_guest_pgd = get_guest_cr3;
g_context->get_pdptr = kvm_pdptr_read;
g_context->inject_page_fault = kvm_inject_page_fault;

diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h
index e5662dbd519c..78448fb84bd6 100644
--- a/arch/x86/kvm/mmu/paging_tmpl.h
+++ b/arch/x86/kvm/mmu/paging_tmpl.h
@@ -324,7 +324,7 @@ static int FNAME(walk_addr_generic)(struct guest_walker *walker,
trace_kvm_mmu_pagetable_walk(addr, access);
retry_walk:
walker->level = mmu->cpu_role.base.level;
- pte = mmu->get_guest_pgd(vcpu);
+ pte = kvm_mmu_get_guest_pgd(vcpu, mmu);
have_ad = PT_HAVE_ACCESSED_DIRTY(mmu);

#if PTTYPE == 64
--
2.39.0

2023-01-18 15:52:19

by Mathias Krause

[permalink] [raw]
Subject: [PATCH v2 2/3] KVM: VMX: avoid retpoline call for control register caused exits

Complement commit 4289d2728664 ("KVM: retpolines: x86: eliminate
retpoline from vmx.c exit handlers") and avoid a retpoline call for
control register accesses as well.

This speeds up guests that make heavy use of it, like grsecurity
kernels toggling CR0.WP to implement kernel W^X.

Signed-off-by: Mathias Krause <[email protected]>
---
SVM may gain from a similar change as well, however, I've no AMD box to
test this on.

arch/x86/kvm/vmx/vmx.c | 2 ++
1 file changed, 2 insertions(+)

diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index c788aa382611..c8198c8a9b55 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -6538,6 +6538,8 @@ static int __vmx_handle_exit(struct kvm_vcpu *vcpu, fastpath_t exit_fastpath)
return handle_external_interrupt(vcpu);
else if (exit_reason.basic == EXIT_REASON_HLT)
return kvm_emulate_halt(vcpu);
+ else if (exit_reason.basic == EXIT_REASON_CR_ACCESS)
+ return handle_cr(vcpu);
else if (exit_reason.basic == EXIT_REASON_EPT_MISCONFIG)
return handle_ept_misconfig(vcpu);
#endif
--
2.39.0

2023-01-18 15:54:13

by Mathias Krause

[permalink] [raw]
Subject: [PATCH v2 3/3] KVM: x86: do not unload MMU roots when only toggling CR0.WP

There is no need to unload the MMU roots for a direct MMU role when only
CR0.WP has changed -- the paging structures are still valid, only the
permission bitmap needs to be updated.

One heavy user of toggling CR0.WP is grsecurity's KERNEXEC feature to
implement kernel W^X.

The optimization brings a huge performance gain for this case as the
following micro-benchmark running 'ssdd 10 50000' from rt-tests[1] on a
grsecurity L1 VM shows (runtime in seconds, lower is better):

legacy TDP shadow
kvm.git/queue 11.55s 13.91s 75.2s
kvm.git/queue+patch 7.32s 7.31s 74.6s

For legacy MMU this is ~36% faster, for TTP MMU even ~47% faster. Also
TDP and legacy MMU now both have around the same runtime which vanishes
the need to disable TDP MMU for grsecurity.

Shadow MMU sees no measurable difference and is still slow, as expected.

[1] https://git.kernel.org/pub/scm/utils/rt-tests/rt-tests.git

Co-developed-by: Sean Christopherson <[email protected]>
Signed-off-by: Mathias Krause <[email protected]>
---
v2: handle the CR0.WP case directly in kvm_post_set_cr0() and only for
the direct MMU role -- Sean

I re-ran the benchmark and it's even faster than with my patch, as the
critical path is now the first one handled and is now inline. Thanks a
lot for the suggestion, Sean!

arch/x86/kvm/x86.c | 9 +++++++++
1 file changed, 9 insertions(+)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 508074e47bc0..f09bfc0a3cc1 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -902,6 +902,15 @@ EXPORT_SYMBOL_GPL(load_pdptrs);

void kvm_post_set_cr0(struct kvm_vcpu *vcpu, unsigned long old_cr0, unsigned long cr0)
{
+ /*
+ * Toggling just CR0.WP doesn't invalidate page tables per se, only the
+ * permission bits.
+ */
+ if (vcpu->arch.mmu->root_role.direct && (cr0 ^ old_cr0) == X86_CR0_WP) {
+ kvm_init_mmu(vcpu);
+ return;
+ }
+
if ((cr0 ^ old_cr0) & X86_CR0_PG) {
kvm_clear_async_pf_completion_queue(vcpu);
kvm_async_pf_hash_reset(vcpu);
--
2.39.0

2023-02-01 19:47:48

by Mathias Krause

[permalink] [raw]
Subject: Re: [PATCH v2 0/3] KVM: MMU: performance tweaks for heavy CR0.WP users

I just send a v3 which should appear any time soon here:

https://lore.kernel.org/kvm/[email protected]/

Thanks,
Mathias