v2: https://lore.kernel.org/kvm/[email protected]/
This series is a resurrection of the missing pieces of Paolo's previous
attempt[1] to avoid needless MMU roots unloading. The performance gap
between TDP and legacy MMU is still existent, especially noticeable under
grsecurity which implements kernel W^X by toggling CR0.WP, which happens
very frequently.
Patches 1-13 and 17 of the old series had been merged, but, unfortunately,
the remaining parts never saw a v3. I therefore took care of these, took
Sean's feedback into account[2] and simplified the whole approach to just
handle the case we care most about explicitly.
Patch 1 is a v3 of [3], addressing Sean's feedback.
Patch 2 is specifically useful for grsecurity, as handle_cr() is by far
*the* top vmexit reason.
Patch 3 is the most important one, as it skips unloading the MMU roots for
CR0.WP toggling.
Sean was suggesting another change on top of v2 of this series, to skip
intercepting CR0.WP writes completely for VMX[4]. That turned out to be
yet another performance boost and is implemenmted in patch 6.
While patches 1 and 2 bring small performance improvements already, the
big gains come from patches 3 and 6.
I used 'ssdd 10 50000' from rt-tests[5] as a micro-benchmark, running on a
grsecurity L1 VM. Below table shows the results (runtime in seconds, lower
is better):
legacy TDP shadow
kvm.git/queue 11.55s 13.91s 75.2s
+ patches 1-3 7.32s 7.31s 74.6s
+ patches 4-6 4.89s 4.89s 73.4s
This series builds on top of kvm.git/queue, namely commit de60733246ff
("Merge branch 'kvm-hw-enable-refactor' into HEAD").
Patches 1-3 didn't change from v2, beside minor changlog mangling.
Patches 4-6 are new to v3.
Thanks,
Mathias
[1] https://lore.kernel.org/kvm/[email protected]/
[2] https://lore.kernel.org/kvm/YhATewkkO%[email protected]/
[3] https://lore.kernel.org/kvm/YhAB1d1%[email protected]/
[4] https://lore.kernel.org/kvm/[email protected]/
[5] https://git.kernel.org/pub/scm/utils/rt-tests/rt-tests.git
Mathias Krause (5):
KVM: VMX: Avoid retpoline call for control register caused exits
KVM: x86: Do not unload MMU roots when only toggling CR0.WP
KVM: x86: Make use of kvm_read_cr*_bits() when testing bits
KVM: x86/mmu: Fix comment typo
KVM: VMX: Make CR0.WP a guest owned bit
Paolo Bonzini (1):
KVM: x86/mmu: Avoid indirect call for get_cr3
arch/x86/kvm/kvm_cache_regs.h | 3 ++-
arch/x86/kvm/mmu/mmu.c | 31 ++++++++++++++++++++-----------
arch/x86/kvm/mmu/paging_tmpl.h | 2 +-
arch/x86/kvm/mmu/spte.c | 2 +-
arch/x86/kvm/pmu.c | 4 ++--
arch/x86/kvm/vmx/capabilities.h | 1 +
arch/x86/kvm/vmx/nested.c | 4 ++--
arch/x86/kvm/vmx/vmx.c | 15 ++++++++++++---
arch/x86/kvm/vmx/vmx.h | 8 ++++++++
arch/x86/kvm/x86.c | 9 +++++++++
10 files changed, 58 insertions(+), 21 deletions(-)
--
2.39.1
From: Paolo Bonzini <[email protected]>
Most of the time, calls to get_guest_pgd result in calling
kvm_read_cr3 (the exception is only nested TDP). Hardcode
the default instead of using the get_cr3 function, avoiding
a retpoline if they are enabled.
Signed-off-by: Paolo Bonzini <[email protected]>
Signed-off-by: Mathias Krause <[email protected]>
---
arch/x86/kvm/mmu/mmu.c | 31 ++++++++++++++++++++-----------
arch/x86/kvm/mmu/paging_tmpl.h | 2 +-
2 files changed, 21 insertions(+), 12 deletions(-)
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index aeb240b339f5..505768631614 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -241,6 +241,20 @@ static struct kvm_mmu_role_regs vcpu_to_role_regs(struct kvm_vcpu *vcpu)
return regs;
}
+static unsigned long get_guest_cr3(struct kvm_vcpu *vcpu)
+{
+ return kvm_read_cr3(vcpu);
+}
+
+static inline unsigned long kvm_mmu_get_guest_pgd(struct kvm_vcpu *vcpu,
+ struct kvm_mmu *mmu)
+{
+ if (IS_ENABLED(CONFIG_RETPOLINE) && mmu->get_guest_pgd == get_guest_cr3)
+ return kvm_read_cr3(vcpu);
+
+ return mmu->get_guest_pgd(vcpu);
+}
+
static inline bool kvm_available_flush_tlb_with_range(void)
{
return kvm_x86_ops.tlb_remote_flush_with_range;
@@ -3722,7 +3736,7 @@ static int mmu_alloc_shadow_roots(struct kvm_vcpu *vcpu)
int quadrant, i, r;
hpa_t root;
- root_pgd = mmu->get_guest_pgd(vcpu);
+ root_pgd = kvm_mmu_get_guest_pgd(vcpu, mmu);
root_gfn = root_pgd >> PAGE_SHIFT;
if (mmu_check_root(vcpu, root_gfn))
@@ -4172,7 +4186,7 @@ static bool kvm_arch_setup_async_pf(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
arch.token = alloc_apf_token(vcpu);
arch.gfn = gfn;
arch.direct_map = vcpu->arch.mmu->root_role.direct;
- arch.cr3 = vcpu->arch.mmu->get_guest_pgd(vcpu);
+ arch.cr3 = kvm_mmu_get_guest_pgd(vcpu, vcpu->arch.mmu);
return kvm_setup_async_pf(vcpu, cr2_or_gpa,
kvm_vcpu_gfn_to_hva(vcpu, gfn), &arch);
@@ -4191,7 +4205,7 @@ void kvm_arch_async_page_ready(struct kvm_vcpu *vcpu, struct kvm_async_pf *work)
return;
if (!vcpu->arch.mmu->root_role.direct &&
- work->arch.cr3 != vcpu->arch.mmu->get_guest_pgd(vcpu))
+ work->arch.cr3 != kvm_mmu_get_guest_pgd(vcpu, vcpu->arch.mmu))
return;
kvm_mmu_do_page_fault(vcpu, work->cr2_or_gpa, 0, true);
@@ -4592,11 +4606,6 @@ void kvm_mmu_new_pgd(struct kvm_vcpu *vcpu, gpa_t new_pgd)
}
EXPORT_SYMBOL_GPL(kvm_mmu_new_pgd);
-static unsigned long get_cr3(struct kvm_vcpu *vcpu)
-{
- return kvm_read_cr3(vcpu);
-}
-
static bool sync_mmio_spte(struct kvm_vcpu *vcpu, u64 *sptep, gfn_t gfn,
unsigned int access)
{
@@ -5147,7 +5156,7 @@ static void init_kvm_tdp_mmu(struct kvm_vcpu *vcpu,
context->page_fault = kvm_tdp_page_fault;
context->sync_page = nonpaging_sync_page;
context->invlpg = NULL;
- context->get_guest_pgd = get_cr3;
+ context->get_guest_pgd = get_guest_cr3;
context->get_pdptr = kvm_pdptr_read;
context->inject_page_fault = kvm_inject_page_fault;
@@ -5297,7 +5306,7 @@ static void init_kvm_softmmu(struct kvm_vcpu *vcpu,
kvm_init_shadow_mmu(vcpu, cpu_role);
- context->get_guest_pgd = get_cr3;
+ context->get_guest_pgd = get_guest_cr3;
context->get_pdptr = kvm_pdptr_read;
context->inject_page_fault = kvm_inject_page_fault;
}
@@ -5311,7 +5320,7 @@ static void init_kvm_nested_mmu(struct kvm_vcpu *vcpu,
return;
g_context->cpu_role.as_u64 = new_mode.as_u64;
- g_context->get_guest_pgd = get_cr3;
+ g_context->get_guest_pgd = get_guest_cr3;
g_context->get_pdptr = kvm_pdptr_read;
g_context->inject_page_fault = kvm_inject_page_fault;
diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h
index e5662dbd519c..78448fb84bd6 100644
--- a/arch/x86/kvm/mmu/paging_tmpl.h
+++ b/arch/x86/kvm/mmu/paging_tmpl.h
@@ -324,7 +324,7 @@ static int FNAME(walk_addr_generic)(struct guest_walker *walker,
trace_kvm_mmu_pagetable_walk(addr, access);
retry_walk:
walker->level = mmu->cpu_role.base.level;
- pte = mmu->get_guest_pgd(vcpu);
+ pte = kvm_mmu_get_guest_pgd(vcpu, mmu);
have_ad = PT_HAVE_ACCESSED_DIRTY(mmu);
#if PTTYPE == 64
--
2.39.1
Complement commit 4289d2728664 ("KVM: retpolines: x86: eliminate
retpoline from vmx.c exit handlers") and avoid a retpoline call for
control register accesses as well.
This speeds up guests that make heavy use of it, like grsecurity
kernels toggling CR0.WP to implement kernel W^X.
Signed-off-by: Mathias Krause <[email protected]>
---
Meanwhile I got my hands on a AMD system and while doing a similar change
for SVM gives a small measurable win (1.1% faster for grsecurity guests),
it would provide nothing for other guests, as the change I was testing was
specifically targeting CR0 caused exits.
A more general approach would instead cover CR3 and, maybe, CR4 as well.
However, that would require a lot more exit code compares, likely
vanishing the gains in the general case. So this tweak is VMX only.
arch/x86/kvm/vmx/vmx.c | 2 ++
1 file changed, 2 insertions(+)
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index c788aa382611..c8198c8a9b55 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -6538,6 +6538,8 @@ static int __vmx_handle_exit(struct kvm_vcpu *vcpu, fastpath_t exit_fastpath)
return handle_external_interrupt(vcpu);
else if (exit_reason.basic == EXIT_REASON_HLT)
return kvm_emulate_halt(vcpu);
+ else if (exit_reason.basic == EXIT_REASON_CR_ACCESS)
+ return handle_cr(vcpu);
else if (exit_reason.basic == EXIT_REASON_EPT_MISCONFIG)
return handle_ept_misconfig(vcpu);
#endif
--
2.39.1
Make use of the kvm_read_cr{0,4}_bits() helper functions when we only
want to know the state of certain bits instead of the whole register.
This not only makes the intend cleaner, it also avoids a VMREAD in case
the tested bits aren't guest owned.
Signed-off-by: Mathias Krause <[email protected]>
---
arch/x86/kvm/pmu.c | 4 ++--
arch/x86/kvm/vmx/vmx.c | 4 ++--
2 files changed, 4 insertions(+), 4 deletions(-)
diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c
index d939d3b84e6f..d9922277df67 100644
--- a/arch/x86/kvm/pmu.c
+++ b/arch/x86/kvm/pmu.c
@@ -439,9 +439,9 @@ int kvm_pmu_rdpmc(struct kvm_vcpu *vcpu, unsigned idx, u64 *data)
if (!pmc)
return 1;
- if (!(kvm_read_cr4(vcpu) & X86_CR4_PCE) &&
+ if (!(kvm_read_cr4_bits(vcpu, X86_CR4_PCE)) &&
(static_call(kvm_x86_get_cpl)(vcpu) != 0) &&
- (kvm_read_cr0(vcpu) & X86_CR0_PE))
+ (kvm_read_cr0_bits(vcpu, X86_CR0_PE)))
return 1;
*data = pmc_read_counter(pmc) & mask;
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index c8198c8a9b55..d3b49e0b6c32 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -5487,7 +5487,7 @@ static int handle_cr(struct kvm_vcpu *vcpu)
break;
case 3: /* lmsw */
val = (exit_qualification >> LMSW_SOURCE_DATA_SHIFT) & 0x0f;
- trace_kvm_cr_write(0, (kvm_read_cr0(vcpu) & ~0xful) | val);
+ trace_kvm_cr_write(0, (kvm_read_cr0_bits(vcpu, ~0xful) | val));
kvm_lmsw(vcpu, val);
return kvm_skip_emulated_instruction(vcpu);
@@ -7547,7 +7547,7 @@ static u8 vmx_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio)
if (!kvm_arch_has_noncoherent_dma(vcpu->kvm))
return (MTRR_TYPE_WRBACK << VMX_EPT_MT_EPTE_SHIFT) | VMX_EPT_IPAT_BIT;
- if (kvm_read_cr0(vcpu) & X86_CR0_CD) {
+ if (kvm_read_cr0_bits(vcpu, X86_CR0_CD)) {
if (kvm_check_has_quirk(vcpu->kvm, KVM_X86_QUIRK_CD_NW_CLEARED))
cache = MTRR_TYPE_WRBACK;
else
--
2.39.1
There is no need to unload the MMU roots for a direct MMU role when only
CR0.WP has changed -- the paging structures are still valid, only the
permission bitmap needs to be updated.
One heavy user of toggling CR0.WP is grsecurity's KERNEXEC feature to
implement kernel W^X.
The optimization brings a huge performance gain for this case as the
following micro-benchmark running 'ssdd 10 50000' from rt-tests[1] on a
grsecurity L1 VM shows (runtime in seconds, lower is better):
legacy TDP shadow
kvm.git/queue 11.55s 13.91s 75.2s
kvm.git/queue+patch 7.32s 7.31s 74.6s
For legacy MMU this is ~36% faster, for TTP MMU even ~47% faster. Also
TDP and legacy MMU now both have around the same runtime which vanishes
the need to disable TDP MMU for grsecurity.
Shadow MMU sees no measurable difference and is still slow, as expected.
[1] https://git.kernel.org/pub/scm/utils/rt-tests/rt-tests.git
Co-developed-by: Sean Christopherson <[email protected]>
Signed-off-by: Mathias Krause <[email protected]>
---
v2: handle the CR0.WP case directly in kvm_post_set_cr0() and only for
the direct MMU role -- Sean
I re-ran the benchmark and it's even faster than with my patch, as the
critical path is now the first one handled and is now inline. Thanks a
lot for the suggestion, Sean!
arch/x86/kvm/x86.c | 9 +++++++++
1 file changed, 9 insertions(+)
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 508074e47bc0..f09bfc0a3cc1 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -902,6 +902,15 @@ EXPORT_SYMBOL_GPL(load_pdptrs);
void kvm_post_set_cr0(struct kvm_vcpu *vcpu, unsigned long old_cr0, unsigned long cr0)
{
+ /*
+ * Toggling just CR0.WP doesn't invalidate page tables per se, only the
+ * permission bits.
+ */
+ if (vcpu->arch.mmu->root_role.direct && (cr0 ^ old_cr0) == X86_CR0_WP) {
+ kvm_init_mmu(vcpu);
+ return;
+ }
+
if ((cr0 ^ old_cr0) & X86_CR0_PG) {
kvm_clear_async_pf_completion_queue(vcpu);
kvm_async_pf_hash_reset(vcpu);
--
2.39.1
Fix a small comment typo in make_spte().
Signed-off-by: Mathias Krause <[email protected]>
---
arch/x86/kvm/mmu/spte.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/arch/x86/kvm/mmu/spte.c b/arch/x86/kvm/mmu/spte.c
index fce6f047399f..95441ffdccc8 100644
--- a/arch/x86/kvm/mmu/spte.c
+++ b/arch/x86/kvm/mmu/spte.c
@@ -164,7 +164,7 @@ bool make_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
/*
* For simplicity, enforce the NX huge page mitigation even if not
* strictly necessary. KVM could ignore the mitigation if paging is
- * disabled in the guest, as the guest doesn't have an page tables to
+ * disabled in the guest, as the guest doesn't have any page tables to
* abuse. But to safely ignore the mitigation, KVM would have to
* ensure a new MMU is loaded (or all shadow pages zapped) when CR0.PG
* is toggled on, and that's a net negative for performance when TDP is
--
2.39.1
Guests like grsecurity that make heavy use of CR0.WP to implement kernel
level W^X will suffer from the implied VMEXITs.
For a direct MMU role there is no need to intercept a guest change of
CR0.WP, so simply make it a guest owned bit if we can do so.
This implies that a read of a guest's CR0.WP bit might need a VMREAD.
However, the only potentially affected user seems to be kvm_init_mmu()
which is a heavy operation to begin with. But also most callers already
cache the full value of CR0 anyway, so no additional VMREAD is needed.
The only exception is nested_vmx_load_cr3().
Add a new module parameter 'lazycr0' to allow users to revert back to
the old behaviour by loading kvm-intel.ko with 'lazycr0=0'.
This change is VMX-specific, as SVM has no such fine grained control
register intercept control.
Suggested-by: Sean Christopherson <[email protected]>
Signed-off-by: Mathias Krause <[email protected]>
---
Initially I wanted to implement the scheme Sean sketched[1]: having a
threshold where we would switch from eager to lazy CR0.WP tracking after
toggling the bit often enough, make the bit guest owned afterwards and
VMREAD CR0 when needed. However, when starting to look for users that
would be affected, I only found kvm_init_mmu() (via kvm_init_mmu() ->
vcpu_to_role_regs() -> kvm_read_cr0_bits(KVM_MMU_CR0_ROLE_BITS)). It has
only these three interesting callers:
1/ kvm_mmu_reset_context(), which isn't all that interesting, as it's a
heavy weight operation anyway and many of the control flows leading
to it already cache the value of CR0, so no additional VMREAD is
needed,
2/ nested_vmx_load_cr3() and
3/ kvm_post_set_cr0(), only when CR0.WP was toggled and the MMU is in
direct mode (optimization introduced by patch 3).
The last case's most interesting caller is likely kvm_set_cr0(), which
already caches the written CR0 value, thereby vanishes the need for
another VMREAD in vcpu_to_role_regs().
That's why I went with the much simpler approach and always allow CR0.WP
to be guest owned if EPT is enabled as well.
There's nothing we can do for SVM, though :/
[1] https://lore.kernel.org/kvm/[email protected]/
---
arch/x86/kvm/kvm_cache_regs.h | 3 ++-
arch/x86/kvm/vmx/capabilities.h | 1 +
arch/x86/kvm/vmx/nested.c | 4 ++--
arch/x86/kvm/vmx/vmx.c | 9 ++++++++-
arch/x86/kvm/vmx/vmx.h | 8 ++++++++
5 files changed, 21 insertions(+), 4 deletions(-)
diff --git a/arch/x86/kvm/kvm_cache_regs.h b/arch/x86/kvm/kvm_cache_regs.h
index c09174f73a34..495ae0204933 100644
--- a/arch/x86/kvm/kvm_cache_regs.h
+++ b/arch/x86/kvm/kvm_cache_regs.h
@@ -4,7 +4,8 @@
#include <linux/kvm_host.h>
-#define KVM_POSSIBLE_CR0_GUEST_BITS X86_CR0_TS
+#define KVM_LAZY_CR0_GUEST_BITS X86_CR0_WP
+#define KVM_POSSIBLE_CR0_GUEST_BITS (X86_CR0_TS | KVM_LAZY_CR0_GUEST_BITS)
#define KVM_POSSIBLE_CR4_GUEST_BITS \
(X86_CR4_PVI | X86_CR4_DE | X86_CR4_PCE | X86_CR4_OSFXSR \
| X86_CR4_OSXMMEXCPT | X86_CR4_PGE | X86_CR4_TSD | X86_CR4_FSGSBASE)
diff --git a/arch/x86/kvm/vmx/capabilities.h b/arch/x86/kvm/vmx/capabilities.h
index 45162c1bcd8f..41d48a3a651e 100644
--- a/arch/x86/kvm/vmx/capabilities.h
+++ b/arch/x86/kvm/vmx/capabilities.h
@@ -12,6 +12,7 @@
extern bool __read_mostly enable_vpid;
extern bool __read_mostly flexpriority_enabled;
extern bool __read_mostly enable_ept;
+extern bool __read_mostly enable_lazy_cr0;
extern bool __read_mostly enable_unrestricted_guest;
extern bool __read_mostly enable_ept_ad_bits;
extern bool __read_mostly enable_pml;
diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
index 557b9c468734..2a0010ca7277 100644
--- a/arch/x86/kvm/vmx/nested.c
+++ b/arch/x86/kvm/vmx/nested.c
@@ -4478,7 +4478,7 @@ static void load_vmcs12_host_state(struct kvm_vcpu *vcpu,
* CR0_GUEST_HOST_MASK is already set in the original vmcs01
* (KVM doesn't change it);
*/
- vcpu->arch.cr0_guest_owned_bits = KVM_POSSIBLE_CR0_GUEST_BITS;
+ vcpu->arch.cr0_guest_owned_bits = vmx_guest_owned_cr0_bits();
vmx_set_cr0(vcpu, vmcs12->host_cr0);
/* Same as above - no reason to call set_cr4_guest_host_mask(). */
@@ -4629,7 +4629,7 @@ static void nested_vmx_restore_host_state(struct kvm_vcpu *vcpu)
*/
vmx_set_efer(vcpu, nested_vmx_get_vmcs01_guest_efer(vmx));
- vcpu->arch.cr0_guest_owned_bits = KVM_POSSIBLE_CR0_GUEST_BITS;
+ vcpu->arch.cr0_guest_owned_bits = vmx_guest_owned_cr0_bits();
vmx_set_cr0(vcpu, vmcs_readl(CR0_READ_SHADOW));
vcpu->arch.cr4_guest_owned_bits = ~vmcs_readl(CR4_GUEST_HOST_MASK);
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index d3b49e0b6c32..1969360d2744 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -91,6 +91,9 @@ module_param_named(flexpriority, flexpriority_enabled, bool, S_IRUGO);
bool __read_mostly enable_ept = 1;
module_param_named(ept, enable_ept, bool, S_IRUGO);
+bool __read_mostly enable_lazy_cr0 = 1;
+module_param_named(lazycr0, enable_lazy_cr0, bool, S_IRUGO);
+
bool __read_mostly enable_unrestricted_guest = 1;
module_param_named(unrestricted_guest,
enable_unrestricted_guest, bool, S_IRUGO);
@@ -4765,7 +4768,7 @@ static void init_vmcs(struct vcpu_vmx *vmx)
/* 22.2.1, 20.8.1 */
vm_entry_controls_set(vmx, vmx_vmentry_ctrl());
- vmx->vcpu.arch.cr0_guest_owned_bits = KVM_POSSIBLE_CR0_GUEST_BITS;
+ vmx->vcpu.arch.cr0_guest_owned_bits = vmx_guest_owned_cr0_bits();
vmcs_writel(CR0_GUEST_HOST_MASK, ~vmx->vcpu.arch.cr0_guest_owned_bits);
set_cr4_guest_host_mask(vmx);
@@ -8370,6 +8373,10 @@ static __init int hardware_setup(void)
return -EOPNOTSUPP;
}
+ /* Need EPT for lazy CR0.WP synchronization. */
+ if (!enable_ept)
+ enable_lazy_cr0 = 0;
+
if (!cpu_has_vmx_ept_ad_bits() || !enable_ept)
enable_ept_ad_bits = 0;
diff --git a/arch/x86/kvm/vmx/vmx.h b/arch/x86/kvm/vmx/vmx.h
index a3da84f4ea45..e899c2291a3f 100644
--- a/arch/x86/kvm/vmx/vmx.h
+++ b/arch/x86/kvm/vmx/vmx.h
@@ -640,6 +640,14 @@ BUILD_CONTROLS_SHADOW(tertiary_exec, TERTIARY_VM_EXEC_CONTROL, 64)
(1 << VCPU_EXREG_EXIT_INFO_1) | \
(1 << VCPU_EXREG_EXIT_INFO_2))
+static inline unsigned long vmx_guest_owned_cr0_bits(void)
+{
+ unsigned long bits = KVM_POSSIBLE_CR0_GUEST_BITS;
+ if (!enable_lazy_cr0)
+ bits &= ~KVM_LAZY_CR0_GUEST_BITS;
+ return bits;
+}
+
static inline struct kvm_vmx *to_kvm_vmx(struct kvm *kvm)
{
return container_of(kvm, struct kvm_vmx, kvm);
--
2.39.1
On Wed, 1 Feb 2023 20:46:02 +0100
Mathias Krause <[email protected]> wrote:
> Make use of the kvm_read_cr{0,4}_bits() helper functions when we only
> want to know the state of certain bits instead of the whole register.
>
> This not only makes the intend cleaner, it also avoids a VMREAD in case
> the tested bits aren't guest owned.
^
The patch comment is a little confusing. Not sure if I misunderstood here:
Check the code of kvm_read_cr0_bits
static inline ulong kvm_read_cr0_bits(struct kvm_vcpu *vcpu, ulong mask)
{
ulong tmask = mask & KVM_POSSIBLE_CR0_GUEST_BITS;
if ((tmask & vcpu->arch.cr0_guest_owned_bits) &&
!kvm_register_is_available(vcpu, VCPU_EXREG_CR0))
static_call(kvm_x86_cache_reg)(vcpu, VCPU_EXREG_CR0);
return vcpu->arch.cr0 & mask;
}
I suppose the conditions that can avoids a VMREAD is to avoid the vmread in
static_call(kvm_x86_cache_reg):
Conditions are not triggering vmread:
1) The test bits are guest_owned_bits and cache register is available.
2) The test bits are *not* guest_owned bits.
I agree that this makes the intend cleaner, but not sure the later statement
is true in the patch comment. If the test bits are not guest owned, it will
not reach static_call(kvm_x86_cache_reg).
>
> Signed-off-by: Mathias Krause <[email protected]>
> ---
> arch/x86/kvm/pmu.c | 4 ++--
> arch/x86/kvm/vmx/vmx.c | 4 ++--
> 2 files changed, 4 insertions(+), 4 deletions(-)
>
> diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c
> index d939d3b84e6f..d9922277df67 100644
> --- a/arch/x86/kvm/pmu.c
> +++ b/arch/x86/kvm/pmu.c
> @@ -439,9 +439,9 @@ int kvm_pmu_rdpmc(struct kvm_vcpu *vcpu, unsigned idx, u64 *data)
> if (!pmc)
> return 1;
>
> - if (!(kvm_read_cr4(vcpu) & X86_CR4_PCE) &&
> + if (!(kvm_read_cr4_bits(vcpu, X86_CR4_PCE)) &&
> (static_call(kvm_x86_get_cpl)(vcpu) != 0) &&
> - (kvm_read_cr0(vcpu) & X86_CR0_PE))
> + (kvm_read_cr0_bits(vcpu, X86_CR0_PE)))
> return 1;
>
> *data = pmc_read_counter(pmc) & mask;
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index c8198c8a9b55..d3b49e0b6c32 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -5487,7 +5487,7 @@ static int handle_cr(struct kvm_vcpu *vcpu)
> break;
> case 3: /* lmsw */
> val = (exit_qualification >> LMSW_SOURCE_DATA_SHIFT) & 0x0f;
> - trace_kvm_cr_write(0, (kvm_read_cr0(vcpu) & ~0xful) | val);
> + trace_kvm_cr_write(0, (kvm_read_cr0_bits(vcpu, ~0xful) | val));
> kvm_lmsw(vcpu, val);
>
> return kvm_skip_emulated_instruction(vcpu);
> @@ -7547,7 +7547,7 @@ static u8 vmx_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio)
> if (!kvm_arch_has_noncoherent_dma(vcpu->kvm))
> return (MTRR_TYPE_WRBACK << VMX_EPT_MT_EPTE_SHIFT) | VMX_EPT_IPAT_BIT;
>
> - if (kvm_read_cr0(vcpu) & X86_CR0_CD) {
> + if (kvm_read_cr0_bits(vcpu, X86_CR0_CD)) {
> if (kvm_check_has_quirk(vcpu->kvm, KVM_X86_QUIRK_CD_NW_CLEARED))
> cache = MTRR_TYPE_WRBACK;
> else
On Wed, 1 Feb 2023 20:46:01 +0100
Mathias Krause <[email protected]> wrote:
> There is no need to unload the MMU roots for a direct MMU role when only
> CR0.WP has changed -- the paging structures are still valid, only the
> permission bitmap needs to be updated.
>
> One heavy user of toggling CR0.WP is grsecurity's KERNEXEC feature to
> implement kernel W^X.
>
Wouldn't it be better to factor out update_permission_bitmask and
update_pkru_bitmask in a common function and call it from here? So that
we can also skip: bunches of if..else..., recalculation of the rsvd mask
and shadow_zero_bit masks.
I suppose this is a critical path according to the patch comments and
kvm_init_mmu() is a non-critical path. Is it better to seperate
them now for saving the maintanence efforts in future? E.g. something heavier
might be introduced into the kvm_init_mmu() path and slows down this path.
> The optimization brings a huge performance gain for this case as the
> following micro-benchmark running 'ssdd 10 50000' from rt-tests[1] on a
> grsecurity L1 VM shows (runtime in seconds, lower is better):
>
> legacy TDP shadow
> kvm.git/queue 11.55s 13.91s 75.2s
> kvm.git/queue+patch 7.32s 7.31s 74.6s
>
> For legacy MMU this is ~36% faster, for TTP MMU even ~47% faster. Also
> TDP and legacy MMU now both have around the same runtime which vanishes
> the need to disable TDP MMU for grsecurity.
>
> Shadow MMU sees no measurable difference and is still slow, as expected.
>
> [1] https://git.kernel.org/pub/scm/utils/rt-tests/rt-tests.git
>
> Co-developed-by: Sean Christopherson <[email protected]>
> Signed-off-by: Mathias Krause <[email protected]>
> ---
> v2: handle the CR0.WP case directly in kvm_post_set_cr0() and only for
> the direct MMU role -- Sean
>
> I re-ran the benchmark and it's even faster than with my patch, as the
> critical path is now the first one handled and is now inline. Thanks a
> lot for the suggestion, Sean!
>
> arch/x86/kvm/x86.c | 9 +++++++++
> 1 file changed, 9 insertions(+)
>
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 508074e47bc0..f09bfc0a3cc1 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -902,6 +902,15 @@ EXPORT_SYMBOL_GPL(load_pdptrs);
>
> void kvm_post_set_cr0(struct kvm_vcpu *vcpu, unsigned long old_cr0, unsigned long cr0)
> {
> + /*
> + * Toggling just CR0.WP doesn't invalidate page tables per se, only the
> + * permission bits.
> + */
> + if (vcpu->arch.mmu->root_role.direct && (cr0 ^ old_cr0) == X86_CR0_WP) {
> + kvm_init_mmu(vcpu);
> + return;
> + }
> +
> if ((cr0 ^ old_cr0) & X86_CR0_PG) {
> kvm_clear_async_pf_completion_queue(vcpu);
> kvm_async_pf_hash_reset(vcpu);
On 07.02.23 14:05, Zhi Wang wrote:
> On Wed, 1 Feb 2023 20:46:02 +0100
> Mathias Krause <[email protected]> wrote:
>
>> Make use of the kvm_read_cr{0,4}_bits() helper functions when we only
>> want to know the state of certain bits instead of the whole register.
>>
>> This not only makes the intend cleaner, it also avoids a VMREAD in case
~~~~~~
Oh, this should have been "intent". Will fix in v4, if there's a need for.
>> the tested bits aren't guest owned.
> ^
> The patch comment is a little confusing. Not sure if I misunderstood here:
Sorry, lets try to clarify.
> Check the code of kvm_read_cr0_bits
>
> static inline ulong kvm_read_cr0_bits(struct kvm_vcpu *vcpu, ulong mask)
> {
> ulong tmask = mask & KVM_POSSIBLE_CR0_GUEST_BITS;
> if ((tmask & vcpu->arch.cr0_guest_owned_bits) &&
> !kvm_register_is_available(vcpu, VCPU_EXREG_CR0))
> static_call(kvm_x86_cache_reg)(vcpu, VCPU_EXREG_CR0);
> return vcpu->arch.cr0 & mask;
> }
>
> I suppose the conditions that can avoids a VMREAD is to avoid the vmread in
> static_call(kvm_x86_cache_reg):
Correct, that's what this patch is trying to do: It tries to avoid the
static_call(kvm_x86_cache_reg)(...) by making the compiler aware of the
actually used bits in 'mask'. If those don't intersect with the guest
owned bits, the first part of the condition wont be true and we simply
can make use of 'vcpu->arch.cr0'.
Maybe it gets clearer when looking at kvm_read_cr0() too which is just this:
static inline ulong kvm_read_cr0(struct kvm_vcpu *vcpu)
{
return kvm_read_cr0_bits(vcpu, ~0UL);
}
So the 'mask' passed to kvm_read_cr0_bits() will always include all
(possible) guest owned bits (KVM_POSSIBLE_CR0_GUEST_BITS & ~0UL ==
KVM_POSSIBLE_CR0_GUEST_BITS) and the compiler cannot do the optimization
mentioned above.
If we, however, use kvm_read_cr0_bits(..., MASK) directly instead of
using kvm_read_cr0() & MASK, it can, like for all bits not in
KVM_POSSIBLE_CR0_GUEST_BITS & vcpu->arch.cr0_guest_owned_bits.
> Conditions are not triggering vmread:
>
> 1) The test bits are guest_owned_bits and cache register is available.
> 2) The test bits are *not* guest_owned bits.
For case 1 the patch would make only a minor difference, by concluding
earlier that it can simply make use of vcpu->arch.cr0. But it's case 2
I'm after.
If you look up KVM_POSSIBLE_CR0_GUEST_BITS, which is the upper bound for
guest owned CR0 bits, you'll find before patch 6:
#define KVM_POSSIBLE_CR0_GUEST_BITS X86_CR0_TS
and after patch 6:
#define KVM_LAZY_CR0_GUEST_BITS X86_CR0_WP
#define KVM_POSSIBLE_CR0_GUEST_BITS (X86_CR0_TS|KVM_LAZY_CR0_GUEST_BITS)
So the upper bound would be 'X86_CR0_TS|X86_CR0_WP'. Every bit outside
that set can directly be read from the 'vcpu' cached register value and
that's (mostly) the case for the users this patch is changing, see below.
> I agree that this makes the intend cleaner, but not sure the later statement
> is true in the patch comment. If the test bits are not guest owned, it will
> not reach static_call(kvm_x86_cache_reg).
Correct, but that's no different from what I'm saying. My description
just set 'static_call(kvm_x86_cache_reg)' mentally equivalent to VMREAD,
which abstracts the static_call quite well, IMHO. But maybe I should
clarify that 'tested bits' means the bits used by the changed call side?
Though, I think that's rather obvious from the change itself. I can
factor in the caching aspect, though.
Maybe something like this?:
This not only makes the intent cleaner, it also avoids a potential
VMREAD in case the tested bits aren't guest owned.
I've added "potential" but left the remainder as is.
>>
>> Signed-off-by: Mathias Krause <[email protected]>
>> ---
>> arch/x86/kvm/pmu.c | 4 ++--
>> arch/x86/kvm/vmx/vmx.c | 4 ++--
>> 2 files changed, 4 insertions(+), 4 deletions(-)
>>
>> diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c
>> index d939d3b84e6f..d9922277df67 100644
>> --- a/arch/x86/kvm/pmu.c
>> +++ b/arch/x86/kvm/pmu.c
>> @@ -439,9 +439,9 @@ int kvm_pmu_rdpmc(struct kvm_vcpu *vcpu, unsigned idx, u64 *data)
>> if (!pmc)
>> return 1;
>>
>> - if (!(kvm_read_cr4(vcpu) & X86_CR4_PCE) &&
>> + if (!(kvm_read_cr4_bits(vcpu, X86_CR4_PCE)) &&
X86_CR4_PCE & KVM_POSSIBLE_CR4_GUEST_BITS == X86_CR4_PCE, therefore can
only be optimized if X86_CR4_PCE would be dropped from
'vcpu->arch.cr4_guest_owned_bits' as well. But AFAICS we don't do that.
So here you're right that this only clears up the intent, not the actual
behavior at runtime.
>> (static_call(kvm_x86_get_cpl)(vcpu) != 0) &&
>> - (kvm_read_cr0(vcpu) & X86_CR0_PE))
>> + (kvm_read_cr0_bits(vcpu, X86_CR0_PE)))
X86_CR0_PE & KVM_POSSIBLE_CR0_GUEST_BITS == 0, therefore this can be
optimized.
>> return 1;
>>
>> *data = pmc_read_counter(pmc) & mask;
>> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
>> index c8198c8a9b55..d3b49e0b6c32 100644
>> --- a/arch/x86/kvm/vmx/vmx.c
>> +++ b/arch/x86/kvm/vmx/vmx.c
>> @@ -5487,7 +5487,7 @@ static int handle_cr(struct kvm_vcpu *vcpu)
>> break;
>> case 3: /* lmsw */
>> val = (exit_qualification >> LMSW_SOURCE_DATA_SHIFT) & 0x0f;
>> - trace_kvm_cr_write(0, (kvm_read_cr0(vcpu) & ~0xful) | val);
>> + trace_kvm_cr_write(0, (kvm_read_cr0_bits(vcpu, ~0xful) | val));
~0xful & KVM_POSSIBLE_CR0_GUEST_BITS is 0 prior to patch 6 and
X86_CR0_WP afterwards, therefore this might be optimized, depending on
the runtime setting of 'enable_lazy_cr0', possibly capping the guest
owned CR0 bits to exclude X86_CR0_WP again.
>> kvm_lmsw(vcpu, val);
>>
>> return kvm_skip_emulated_instruction(vcpu);
>> @@ -7547,7 +7547,7 @@ static u8 vmx_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio)
>> if (!kvm_arch_has_noncoherent_dma(vcpu->kvm))
>> return (MTRR_TYPE_WRBACK << VMX_EPT_MT_EPTE_SHIFT) | VMX_EPT_IPAT_BIT;
>>
>> - if (kvm_read_cr0(vcpu) & X86_CR0_CD) {
>> + if (kvm_read_cr0_bits(vcpu, X86_CR0_CD)) {
X86_CR0_CD & KVM_POSSIBLE_CR0_GUEST_BITS == 0, therefore this can be
optimized as well.
>> if (kvm_check_has_quirk(vcpu->kvm, KVM_X86_QUIRK_CD_NW_CLEARED))
>> cache = MTRR_TYPE_WRBACK;
>> else
>
Thanks,
Mathias
On 07.02.23 14:36, Zhi Wang wrote:
> On Wed, 1 Feb 2023 20:46:01 +0100
> Mathias Krause <[email protected]> wrote:
>
>> There is no need to unload the MMU roots for a direct MMU role when only
>> CR0.WP has changed -- the paging structures are still valid, only the
>> permission bitmap needs to be updated.
>>
>> One heavy user of toggling CR0.WP is grsecurity's KERNEXEC feature to
>> implement kernel W^X.
>>
>
> Wouldn't it be better to factor out update_permission_bitmask and
> update_pkru_bitmask in a common function and call it from here? So that
> we can also skip: bunches of if..else..., recalculation of the rsvd mask
> and shadow_zero_bit masks.
Probably, yes. But I dislike the fact that this would imply that we know
about the details how kvm_init_mmu() and, moreover, init_kvm_tdp_mmu()
are implemented and I'd rather like to avoid that to not introduce bugs
or regressions via future code changes in either one of them. By calling
out to kvm_init_mmu() we avoid that implicitly as a future change, for
sure, must check all callers and would find this location. If we instead
simply extract the (as of now) required bits, that might go unnoticed.
That said, I agree that there's still room for improvements.
> I suppose this is a critical path according to the patch comments and
> kvm_init_mmu() is a non-critical path. Is it better to seperate
> them now for saving the maintanence efforts in future? E.g. something heavier
> might be introduced into the kvm_init_mmu() path and slows down this path.
I'll look into what can be done about it. But this change is a first
step that can be further optimized via follow up changes.
As you can see from the numbers below, it's already way faster that what
we have right now, so I'd rather land this (imperfect) change sooner
than later and gradually improve on it. This will, however, likely only
bring minor speedups compared to this change, so they're less important,
IMHO.
The question is really what's better from a maintenance point of view:
Keeping the call to the commonly used kvm_init_mmu() function or special
case even further? I fear the latter might regress easier, but YMMV, of
course.
>
>> The optimization brings a huge performance gain for this case as the
>> following micro-benchmark running 'ssdd 10 50000' from rt-tests[1] on a
>> grsecurity L1 VM shows (runtime in seconds, lower is better):
>>
>> legacy TDP shadow
>> kvm.git/queue 11.55s 13.91s 75.2s
>> kvm.git/queue+patch 7.32s 7.31s 74.6s
>>
>> For legacy MMU this is ~36% faster, for TTP MMU even ~47% faster. Also
>> TDP and legacy MMU now both have around the same runtime which vanishes
>> the need to disable TDP MMU for grsecurity.
>>
>> Shadow MMU sees no measurable difference and is still slow, as expected.
>>
>> [1] https://git.kernel.org/pub/scm/utils/rt-tests/rt-tests.git
>>
>> Co-developed-by: Sean Christopherson <[email protected]>
>> Signed-off-by: Mathias Krause <[email protected]>
>> ---
>> v2: handle the CR0.WP case directly in kvm_post_set_cr0() and only for
>> the direct MMU role -- Sean
>>
>> I re-ran the benchmark and it's even faster than with my patch, as the
>> critical path is now the first one handled and is now inline. Thanks a
>> lot for the suggestion, Sean!
>>
>> arch/x86/kvm/x86.c | 9 +++++++++
>> 1 file changed, 9 insertions(+)
>>
>> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
>> index 508074e47bc0..f09bfc0a3cc1 100644
>> --- a/arch/x86/kvm/x86.c
>> +++ b/arch/x86/kvm/x86.c
>> @@ -902,6 +902,15 @@ EXPORT_SYMBOL_GPL(load_pdptrs);
>>
>> void kvm_post_set_cr0(struct kvm_vcpu *vcpu, unsigned long old_cr0, unsigned long cr0)
>> {
>> + /*
>> + * Toggling just CR0.WP doesn't invalidate page tables per se, only the
>> + * permission bits.
>> + */
>> + if (vcpu->arch.mmu->root_role.direct && (cr0 ^ old_cr0) == X86_CR0_WP) {
>> + kvm_init_mmu(vcpu);
>> + return;
>> + }
>> +
>> if ((cr0 ^ old_cr0) & X86_CR0_PG) {
>> kvm_clear_async_pf_completion_queue(vcpu);
>> kvm_async_pf_hash_reset(vcpu);
>
On Wed, 8 Feb 2023 10:11:30 +0100
Mathias Krause <[email protected]> wrote:
> On 07.02.23 14:05, Zhi Wang wrote:
> > On Wed, 1 Feb 2023 20:46:02 +0100
> > Mathias Krause <[email protected]> wrote:
> >
> >> Make use of the kvm_read_cr{0,4}_bits() helper functions when we only
> >> want to know the state of certain bits instead of the whole register.
> >>
> >> This not only makes the intend cleaner, it also avoids a VMREAD in case
> ~~~~~~
> Oh, this should have been "intent". Will fix in v4, if there's a need for.
>
> >> the tested bits aren't guest owned.
> > ^
> > The patch comment is a little confusing. Not sure if I misunderstood here:
>
> Sorry, lets try to clarify.
>
> > Check the code of kvm_read_cr0_bits
> >
> > static inline ulong kvm_read_cr0_bits(struct kvm_vcpu *vcpu, ulong mask)
> > {
> > ulong tmask = mask & KVM_POSSIBLE_CR0_GUEST_BITS;
> > if ((tmask & vcpu->arch.cr0_guest_owned_bits) &&
> > !kvm_register_is_available(vcpu, VCPU_EXREG_CR0))
> > static_call(kvm_x86_cache_reg)(vcpu, VCPU_EXREG_CR0);
> > return vcpu->arch.cr0 & mask;
> > }
> >
> > I suppose the conditions that can avoids a VMREAD is to avoid the vmread in
> > static_call(kvm_x86_cache_reg):
>
> Correct, that's what this patch is trying to do: It tries to avoid the
> static_call(kvm_x86_cache_reg)(...) by making the compiler aware of the
> actually used bits in 'mask'. If those don't intersect with the guest
> owned bits, the first part of the condition wont be true and we simply
> can make use of 'vcpu->arch.cr0'.
>
> Maybe it gets clearer when looking at kvm_read_cr0() too which is just this:
>
> static inline ulong kvm_read_cr0(struct kvm_vcpu *vcpu)
> {
> return kvm_read_cr0_bits(vcpu, ~0UL);
> }
>
> So the 'mask' passed to kvm_read_cr0_bits() will always include all
> (possible) guest owned bits (KVM_POSSIBLE_CR0_GUEST_BITS & ~0UL ==
> KVM_POSSIBLE_CR0_GUEST_BITS) and the compiler cannot do the optimization
> mentioned above.
>
> If we, however, use kvm_read_cr0_bits(..., MASK) directly instead of
> using kvm_read_cr0() & MASK, it can, like for all bits not in
> KVM_POSSIBLE_CR0_GUEST_BITS & vcpu->arch.cr0_guest_owned_bits.
>
> > Conditions are not triggering vmread:
> >
> > 1) The test bits are guest_owned_bits and cache register is available.
> > 2) The test bits are *not* guest_owned bits.
>
> For case 1 the patch would make only a minor difference, by concluding
> earlier that it can simply make use of vcpu->arch.cr0. But it's case 2
> I'm after.
>
Thanks for the explanation. Now I got it.
> If you look up KVM_POSSIBLE_CR0_GUEST_BITS, which is the upper bound for
> guest owned CR0 bits, you'll find before patch 6:
>
> #define KVM_POSSIBLE_CR0_GUEST_BITS X86_CR0_TS
>
> and after patch 6:
>
> #define KVM_LAZY_CR0_GUEST_BITS X86_CR0_WP
> #define KVM_POSSIBLE_CR0_GUEST_BITS (X86_CR0_TS|KVM_LAZY_CR0_GUEST_BITS)
>
> So the upper bound would be 'X86_CR0_TS|X86_CR0_WP'. Every bit outside
> that set can directly be read from the 'vcpu' cached register value and
> that's (mostly) the case for the users this patch is changing, see below.
>
> > I agree that this makes the intend cleaner, but not sure the later statement
> > is true in the patch comment. If the test bits are not guest owned, it will
> > not reach static_call(kvm_x86_cache_reg).
>
> Correct, but that's no different from what I'm saying. My description
> just set 'static_call(kvm_x86_cache_reg)' mentally equivalent to VMREAD,
> which abstracts the static_call quite well, IMHO. But maybe I should
> clarify that 'tested bits' means the bits used by the changed call side?
> Though, I think that's rather obvious from the change itself. I can
> factor in the caching aspect, though.
>
> Maybe something like this?:
>
> This not only makes the intent cleaner, it also avoids a potential
> VMREAD in case the tested bits aren't guest owned.
>
> I've added "potential" but left the remainder as is.
>
> >>
> >> Signed-off-by: Mathias Krause <[email protected]>
> >> ---
> >> arch/x86/kvm/pmu.c | 4 ++--
> >> arch/x86/kvm/vmx/vmx.c | 4 ++--
> >> 2 files changed, 4 insertions(+), 4 deletions(-)
> >>
> >> diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c
> >> index d939d3b84e6f..d9922277df67 100644
> >> --- a/arch/x86/kvm/pmu.c
> >> +++ b/arch/x86/kvm/pmu.c
> >> @@ -439,9 +439,9 @@ int kvm_pmu_rdpmc(struct kvm_vcpu *vcpu, unsigned idx, u64 *data)
> >> if (!pmc)
> >> return 1;
> >>
> >> - if (!(kvm_read_cr4(vcpu) & X86_CR4_PCE) &&
> >> + if (!(kvm_read_cr4_bits(vcpu, X86_CR4_PCE)) &&
>
> X86_CR4_PCE & KVM_POSSIBLE_CR4_GUEST_BITS == X86_CR4_PCE, therefore can
> only be optimized if X86_CR4_PCE would be dropped from
> 'vcpu->arch.cr4_guest_owned_bits' as well. But AFAICS we don't do that.
> So here you're right that this only clears up the intent, not the actual
> behavior at runtime.
>
> >> (static_call(kvm_x86_get_cpl)(vcpu) != 0) &&
> >> - (kvm_read_cr0(vcpu) & X86_CR0_PE))
> >> + (kvm_read_cr0_bits(vcpu, X86_CR0_PE)))
>
> X86_CR0_PE & KVM_POSSIBLE_CR0_GUEST_BITS == 0, therefore this can be
> optimized.
>
> >> return 1;
> >>
> >> *data = pmc_read_counter(pmc) & mask;
> >> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> >> index c8198c8a9b55..d3b49e0b6c32 100644
> >> --- a/arch/x86/kvm/vmx/vmx.c
> >> +++ b/arch/x86/kvm/vmx/vmx.c
> >> @@ -5487,7 +5487,7 @@ static int handle_cr(struct kvm_vcpu *vcpu)
> >> break;
> >> case 3: /* lmsw */
> >> val = (exit_qualification >> LMSW_SOURCE_DATA_SHIFT) & 0x0f;
> >> - trace_kvm_cr_write(0, (kvm_read_cr0(vcpu) & ~0xful) | val);
> >> + trace_kvm_cr_write(0, (kvm_read_cr0_bits(vcpu, ~0xful) | val));
>
> ~0xful & KVM_POSSIBLE_CR0_GUEST_BITS is 0 prior to patch 6 and
> X86_CR0_WP afterwards, therefore this might be optimized, depending on
> the runtime setting of 'enable_lazy_cr0', possibly capping the guest
> owned CR0 bits to exclude X86_CR0_WP again.
>
> >> kvm_lmsw(vcpu, val);
> >>
> >> return kvm_skip_emulated_instruction(vcpu);
> >> @@ -7547,7 +7547,7 @@ static u8 vmx_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio)
> >> if (!kvm_arch_has_noncoherent_dma(vcpu->kvm))
> >> return (MTRR_TYPE_WRBACK << VMX_EPT_MT_EPTE_SHIFT) | VMX_EPT_IPAT_BIT;
> >>
> >> - if (kvm_read_cr0(vcpu) & X86_CR0_CD) {
> >> + if (kvm_read_cr0_bits(vcpu, X86_CR0_CD)) {
>
> X86_CR0_CD & KVM_POSSIBLE_CR0_GUEST_BITS == 0, therefore this can be
> optimized as well.
>
> >> if (kvm_check_has_quirk(vcpu->kvm, KVM_X86_QUIRK_CD_NW_CLEARED))
> >> cache = MTRR_TYPE_WRBACK;
> >> else
> >
>
> Thanks,
> Mathias
On 01.02.23 20:45, Mathias Krause wrote:
> v2: https://lore.kernel.org/kvm/[email protected]/
>
> This series is a resurrection of the missing pieces of Paolo's previous
> attempt[1] to avoid needless MMU roots unloading. The performance gap
> between TDP and legacy MMU is still existent, especially noticeable under
> grsecurity which implements kernel W^X by toggling CR0.WP, which happens
> very frequently.
>
> Patches 1-13 and 17 of the old series had been merged, but, unfortunately,
> the remaining parts never saw a v3. I therefore took care of these, took
> Sean's feedback into account[2] and simplified the whole approach to just
> handle the case we care most about explicitly.
>
> Patch 1 is a v3 of [3], addressing Sean's feedback.
>
> Patch 2 is specifically useful for grsecurity, as handle_cr() is by far
> *the* top vmexit reason.
>
> Patch 3 is the most important one, as it skips unloading the MMU roots for
> CR0.WP toggling.
>
> Sean was suggesting another change on top of v2 of this series, to skip
> intercepting CR0.WP writes completely for VMX[4]. That turned out to be
> yet another performance boost and is implemenmted in patch 6.
>
> While patches 1 and 2 bring small performance improvements already, the
> big gains come from patches 3 and 6.
>
> I used 'ssdd 10 50000' from rt-tests[5] as a micro-benchmark, running on a
> grsecurity L1 VM. Below table shows the results (runtime in seconds, lower
> is better):
>
> legacy TDP shadow
> kvm.git/queue 11.55s 13.91s 75.2s
> + patches 1-3 7.32s 7.31s 74.6s
> + patches 4-6 4.89s 4.89s 73.4s
>
> This series builds on top of kvm.git/queue, namely commit de60733246ff
> ("Merge branch 'kvm-hw-enable-refactor' into HEAD").
>
> Patches 1-3 didn't change from v2, beside minor changlog mangling.
>
> Patches 4-6 are new to v3.
>
> Thanks,
> Mathias
>
> [1] https://lore.kernel.org/kvm/[email protected]/
> [2] https://lore.kernel.org/kvm/YhATewkkO%[email protected]/
> [3] https://lore.kernel.org/kvm/YhAB1d1%[email protected]/
> [4] https://lore.kernel.org/kvm/[email protected]/
> [5] https://git.kernel.org/pub/scm/utils/rt-tests/rt-tests.git
>
> Mathias Krause (5):
> KVM: VMX: Avoid retpoline call for control register caused exits
> KVM: x86: Do not unload MMU roots when only toggling CR0.WP
> KVM: x86: Make use of kvm_read_cr*_bits() when testing bits
> KVM: x86/mmu: Fix comment typo
> KVM: VMX: Make CR0.WP a guest owned bit
>
> Paolo Bonzini (1):
> KVM: x86/mmu: Avoid indirect call for get_cr3
>
> arch/x86/kvm/kvm_cache_regs.h | 3 ++-
> arch/x86/kvm/mmu/mmu.c | 31 ++++++++++++++++++++-----------
> arch/x86/kvm/mmu/paging_tmpl.h | 2 +-
> arch/x86/kvm/mmu/spte.c | 2 +-
> arch/x86/kvm/pmu.c | 4 ++--
> arch/x86/kvm/vmx/capabilities.h | 1 +
> arch/x86/kvm/vmx/nested.c | 4 ++--
> arch/x86/kvm/vmx/vmx.c | 15 ++++++++++++---
> arch/x86/kvm/vmx/vmx.h | 8 ++++++++
> arch/x86/kvm/x86.c | 9 +++++++++
> 10 files changed, 58 insertions(+), 21 deletions(-)
Ping!
Anything I can do to help getting this series reviewed and hopefully merged?
Thanks,
Mathias
On Mon, Mar 06, 2023, Mathias Krause wrote:
> On 01.02.23 20:45, Mathias Krause wrote:
> > Mathias Krause (5):
> > KVM: VMX: Avoid retpoline call for control register caused exits
> > KVM: x86: Do not unload MMU roots when only toggling CR0.WP
> > KVM: x86: Make use of kvm_read_cr*_bits() when testing bits
> > KVM: x86/mmu: Fix comment typo
> > KVM: VMX: Make CR0.WP a guest owned bit
> >
> > Paolo Bonzini (1):
> > KVM: x86/mmu: Avoid indirect call for get_cr3
> >
> > arch/x86/kvm/kvm_cache_regs.h | 3 ++-
> > arch/x86/kvm/mmu/mmu.c | 31 ++++++++++++++++++++-----------
> > arch/x86/kvm/mmu/paging_tmpl.h | 2 +-
> > arch/x86/kvm/mmu/spte.c | 2 +-
> > arch/x86/kvm/pmu.c | 4 ++--
> > arch/x86/kvm/vmx/capabilities.h | 1 +
> > arch/x86/kvm/vmx/nested.c | 4 ++--
> > arch/x86/kvm/vmx/vmx.c | 15 ++++++++++++---
> > arch/x86/kvm/vmx/vmx.h | 8 ++++++++
> > arch/x86/kvm/x86.c | 9 +++++++++
> > 10 files changed, 58 insertions(+), 21 deletions(-)
>
> Ping!
>
> Anything I can do to help getting this series reviewed and hopefully merged?
I'm slowly getting there...
https://lore.kernel.org/kvm/Y%[email protected]
On Wed, Feb 08, 2023, Mathias Krause wrote:
> On 07.02.23 14:36, Zhi Wang wrote:
> > On Wed, 1 Feb 2023 20:46:01 +0100
> > Mathias Krause <[email protected]> wrote:
> > I suppose this is a critical path according to the patch comments and
> > kvm_init_mmu() is a non-critical path. Is it better to seperate
> > them now for saving the maintanence efforts in future? E.g. something heavier
> > might be introduced into the kvm_init_mmu() path and slows down this path.
>
> I'll look into what can be done about it. But this change is a first
> step that can be further optimized via follow up changes.
>
> As you can see from the numbers below, it's already way faster that what
> we have right now, so I'd rather land this (imperfect) change sooner
> than later and gradually improve on it. This will, however, likely only
> bring minor speedups compared to this change, so they're less important,
> IMHO.
>
> The question is really what's better from a maintenance point of view:
> Keeping the call to the commonly used kvm_init_mmu() function or special
> case even further? I fear the latter might regress easier, but YMMV, of
> course.
Agreed. Unless the performance benefits of getting super precise are significant,
I would much rather keep things simpler and reduce the risk of introducing bugs.
Bugs in this area in particular have a nasty habit of being really good at hiding.
On Wed, Feb 01, 2023, Mathias Krause wrote:
> Complement commit 4289d2728664 ("KVM: retpolines: x86: eliminate
> retpoline from vmx.c exit handlers") and avoid a retpoline call for
> control register accesses as well.
>
> This speeds up guests that make heavy use of it, like grsecurity
> kernels toggling CR0.WP to implement kernel W^X.
I would rather drop this patch for VMX and instead unconditionally make CR0.WP
guest owned when TDP (EPT) is enabled, i.e. drop the module param from patch 6.
> Signed-off-by: Mathias Krause <[email protected]>
> ---
>
> Meanwhile I got my hands on a AMD system and while doing a similar change
> for SVM gives a small measurable win (1.1% faster for grsecurity guests),
Mostly out of curiosity...
Is the 1.1% roughly aligned with the gains for VMX? If VMX sees a significantly
larger improvement, any idea why SVM doesn't benefit as much? E.g. did you double
check that the kernel was actually using RETPOLINE?
> it would provide nothing for other guests, as the change I was testing was
> specifically targeting CR0 caused exits.
>
> A more general approach would instead cover CR3 and, maybe, CR4 as well.
> However, that would require a lot more exit code compares, likely
> vanishing the gains in the general case. So this tweak is VMX only.
I don't think targeting on CR0 exits is a reason to not do this for SVM. With
NPT enabled, CR3 isn't intercepted, and CR4 exits should be very rare. If the
performance benefits are marginal (I don't have a good frame of reference for the
1.1%), then _that's_ a good reason to leave SVM alone. But not giving CR3 and CR4
priority is a non-issue.
> arch/x86/kvm/vmx/vmx.c | 2 ++
> 1 file changed, 2 insertions(+)
>
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index c788aa382611..c8198c8a9b55 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -6538,6 +6538,8 @@ static int __vmx_handle_exit(struct kvm_vcpu *vcpu, fastpath_t exit_fastpath)
> return handle_external_interrupt(vcpu);
> else if (exit_reason.basic == EXIT_REASON_HLT)
> return kvm_emulate_halt(vcpu);
> + else if (exit_reason.basic == EXIT_REASON_CR_ACCESS)
> + return handle_cr(vcpu);
> else if (exit_reason.basic == EXIT_REASON_EPT_MISCONFIG)
> return handle_ept_misconfig(vcpu);
> #endif
> --
> 2.39.1
>
Can you tweak the shortlog to something like this? I want to make it very clear
that this applies only to the TDP case (more below). I did a double take when I
first read the subject :-)
KVM: x86: Do not unload MMU roots when only toggling CR0.WP with TDP enabled
On Wed, Feb 01, 2023, Mathias Krause wrote:
> There is no need to unload the MMU roots for a direct MMU role when only
> CR0.WP has changed -- the paging structures are still valid, only the
> permission bitmap needs to be updated.
>
> One heavy user of toggling CR0.WP is grsecurity's KERNEXEC feature to
> implement kernel W^X.
>
> The optimization brings a huge performance gain for this case as the
> following micro-benchmark running 'ssdd 10 50000' from rt-tests[1] on a
> grsecurity L1 VM shows (runtime in seconds, lower is better):
>
> legacy TDP shadow
> kvm.git/queue 11.55s 13.91s 75.2s
> kvm.git/queue+patch 7.32s 7.31s 74.6s
>
> For legacy MMU this is ~36% faster, for TTP MMU even ~47% faster. Also
> TDP and legacy MMU now both have around the same runtime which vanishes
> the need to disable TDP MMU for grsecurity.
>
> Shadow MMU sees no measurable difference and is still slow, as expected.
>
> [1] https://git.kernel.org/pub/scm/utils/rt-tests/rt-tests.git
>
> Co-developed-by: Sean Christopherson <[email protected]>
No need for this, I just threw a snippet at you as part of code review. And IMO,
if someone goes through the pain of running benchmarks, they get full credit no
matter what ;-)
> Signed-off-by: Mathias Krause <[email protected]>
> ---
> v2: handle the CR0.WP case directly in kvm_post_set_cr0() and only for
> the direct MMU role -- Sean
>
> I re-ran the benchmark and it's even faster than with my patch, as the
> critical path is now the first one handled and is now inline. Thanks a
> lot for the suggestion, Sean!
>
> arch/x86/kvm/x86.c | 9 +++++++++
> 1 file changed, 9 insertions(+)
>
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 508074e47bc0..f09bfc0a3cc1 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -902,6 +902,15 @@ EXPORT_SYMBOL_GPL(load_pdptrs);
>
> void kvm_post_set_cr0(struct kvm_vcpu *vcpu, unsigned long old_cr0, unsigned long cr0)
> {
> + /*
> + * Toggling just CR0.WP doesn't invalidate page tables per se, only the
> + * permission bits.
> + */
> + if (vcpu->arch.mmu->root_role.direct && (cr0 ^ old_cr0) == X86_CR0_WP) {
Past me was wrong, which is a very good thing in this case. Per the APM,
The host hCR0.WP bit is ignored under nested paging.
which means that CR0.WP doesn't need to be incorporated into the role. Ha! And
really-past me even wrote a very nice comment to call that out in commit 31e96bc63655
("KVM: nSVM: Add a comment to document why nNPT uses vmcb01, not vCPU state").
Double ha! That's all moot, because if this code is reached for a nested MMU,
it means L2 is active and the CR0 being changed is gCR0, not L1's hCR0.
So more simply, this can be
if (tdp_enabled && (cr0 ^ old_cr0) == X86_CR0_WP)
or if we want to exempt non-paging mode for the shadow MMU as well...
if ((cr0 ^ old_cr0) == X86_CR0_WP && (tdp_enabled || !(cr0 & X86_CR0_PG)))
Actually, if we bother to check CR0.PG, then we might as well get greedy and
skip _all_ updates when paging is disabled. E.g. end up with this over two
patches? First one exempts the tdp_enabled case, second one completely exempts
paging disabled.
/*
* CR0.WP is incorporated into the MMU role, but only for non-nested,
* indirect shadow MMUs. If paging is disabled, no updates are needed
* as there are no permission bits to emulate. If TDP is enabled, the
* MMU's metadata needs to be updated, e.g. so that emulating guest
* translations does the right thing, but there's no need to unload the
* root as CR0.WP doesn't affect SPTEs when TDP is enabled.
*/
if ((cr0 ^ old_cr0) == X86_CR0_WP) {
if (!(cr0 & X86_CR0_PG))
return;
if (tdp_enabled) {
kvm_init_mmu(vcpu);
return;
}
}
On Wed, Feb 01, 2023, Mathias Krause wrote:
> Make use of the kvm_read_cr{0,4}_bits() helper functions when we only
> want to know the state of certain bits instead of the whole register.
>
> This not only makes the intend cleaner, it also avoids a VMREAD in case
> the tested bits aren't guest owned.
>
> Signed-off-by: Mathias Krause <[email protected]>
> ---
> arch/x86/kvm/pmu.c | 4 ++--
> arch/x86/kvm/vmx/vmx.c | 4 ++--
> 2 files changed, 4 insertions(+), 4 deletions(-)
>
> diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c
> index d939d3b84e6f..d9922277df67 100644
> --- a/arch/x86/kvm/pmu.c
> +++ b/arch/x86/kvm/pmu.c
> @@ -439,9 +439,9 @@ int kvm_pmu_rdpmc(struct kvm_vcpu *vcpu, unsigned idx, u64 *data)
> if (!pmc)
> return 1;
>
> - if (!(kvm_read_cr4(vcpu) & X86_CR4_PCE) &&
> + if (!(kvm_read_cr4_bits(vcpu, X86_CR4_PCE)) &&
Purely as an FYI, I proposed adding helpers to query single CR0/CR4 bits in a
separate thread[*]. No need to do anything on your end, I'll plan on applying
this patch first and will handle whatever conflicts arise.
[*] https://lore.kernel.org/all/[email protected]
On Wed, Feb 01, 2023, Mathias Krause wrote:
> Guests like grsecurity that make heavy use of CR0.WP to implement kernel
> level W^X will suffer from the implied VMEXITs.
>
> For a direct MMU role there is no need to intercept a guest change of
> CR0.WP, so simply make it a guest owned bit if we can do so.
>
> This implies that a read of a guest's CR0.WP bit might need a VMREAD.
> However, the only potentially affected user seems to be kvm_init_mmu()
> which is a heavy operation to begin with. But also most callers already
> cache the full value of CR0 anyway, so no additional VMREAD is needed.
> The only exception is nested_vmx_load_cr3().
>
> Add a new module parameter 'lazycr0' to allow users to revert back to
> the old behaviour by loading kvm-intel.ko with 'lazycr0=0'.
>
> This change is VMX-specific, as SVM has no such fine grained control
> register intercept control.
>
> Suggested-by: Sean Christopherson <[email protected]>
> Signed-off-by: Mathias Krause <[email protected]>
> ---
>
> Initially I wanted to implement the scheme Sean sketched[1]: having a
> threshold where we would switch from eager to lazy CR0.WP tracking after
> toggling the bit often enough, make the bit guest owned afterwards and
> VMREAD CR0 when needed. However, when starting to look for users that
> would be affected, I only found kvm_init_mmu() (via kvm_init_mmu() ->
> vcpu_to_role_regs() -> kvm_read_cr0_bits(KVM_MMU_CR0_ROLE_BITS)). It has
> only these three interesting callers:
> 1/ kvm_mmu_reset_context(), which isn't all that interesting, as it's a
> heavy weight operation anyway and many of the control flows leading
> to it already cache the value of CR0, so no additional VMREAD is
> needed,
> 2/ nested_vmx_load_cr3() and
> 3/ kvm_post_set_cr0(), only when CR0.WP was toggled and the MMU is in
> direct mode (optimization introduced by patch 3).
>
> The last case's most interesting caller is likely kvm_set_cr0(), which
> already caches the written CR0 value, thereby vanishes the need for
> another VMREAD in vcpu_to_role_regs().
>
> That's why I went with the much simpler approach and always allow CR0.WP
> to be guest owned if EPT is enabled as well.
Nice!
> There's nothing we can do for SVM, though :/
:/ indeed
> [1] https://lore.kernel.org/kvm/[email protected]/
> ---
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index d3b49e0b6c32..1969360d2744 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -91,6 +91,9 @@ module_param_named(flexpriority, flexpriority_enabled, bool, S_IRUGO);
> bool __read_mostly enable_ept = 1;
> module_param_named(ept, enable_ept, bool, S_IRUGO);
>
> +bool __read_mostly enable_lazy_cr0 = 1;
> +module_param_named(lazycr0, enable_lazy_cr0, bool, S_IRUGO);
Unless someone crawls out of the woodworks to object, let's omit the module param
and make this unconditional. We typically add module params for behavior where
there are legitimate downsides even if KVM is bug free, or for features that are
dependent on hardware. E.g. testing shadow paging without a knob to disable EPT
would require acces to really ancient CPUs.
The one exception that comes to mind is force_flush_and_sync_on_reuse, but TLB
bugs tend to be subtle and hard to hit, whereas if we break something with CR0.WP
emulation, the breakage should be immediate and obvious.
> bool __read_mostly enable_unrestricted_guest = 1;
> module_param_named(unrestricted_guest,
> enable_unrestricted_guest, bool, S_IRUGO);
> @@ -4765,7 +4768,7 @@ static void init_vmcs(struct vcpu_vmx *vmx)
> /* 22.2.1, 20.8.1 */
> vm_entry_controls_set(vmx, vmx_vmentry_ctrl());
>
> - vmx->vcpu.arch.cr0_guest_owned_bits = KVM_POSSIBLE_CR0_GUEST_BITS;
> + vmx->vcpu.arch.cr0_guest_owned_bits = vmx_guest_owned_cr0_bits();
> vmcs_writel(CR0_GUEST_HOST_MASK, ~vmx->vcpu.arch.cr0_guest_owned_bits);
>
> set_cr4_guest_host_mask(vmx);
> @@ -8370,6 +8373,10 @@ static __init int hardware_setup(void)
> return -EOPNOTSUPP;
> }
>
> + /* Need EPT for lazy CR0.WP synchronization. */
> + if (!enable_ept)
> + enable_lazy_cr0 = 0;
Heh, just realized that this code would be broken if nested TDP wasn't exempt
from including CR0.WP in the MMU role. Better to be lucky than good :-)
And similar to similar to kvm_post_set_cr0(), the CR0.PG=0 case _could_ let
CR0.WP be guest-owned, but I don't think that's worth doing as it introduces a
subtle dependency on CR0 being up-to-date (or passed in).
So this?
---
arch/x86/kvm/kvm_cache_regs.h | 2 +-
arch/x86/kvm/vmx/nested.c | 4 ++--
arch/x86/kvm/vmx/vmx.c | 2 +-
arch/x86/kvm/vmx/vmx.h | 18 ++++++++++++++++++
4 files changed, 22 insertions(+), 4 deletions(-)
diff --git a/arch/x86/kvm/kvm_cache_regs.h b/arch/x86/kvm/kvm_cache_regs.h
index 4c91f626c058..e50d353b5c1c 100644
--- a/arch/x86/kvm/kvm_cache_regs.h
+++ b/arch/x86/kvm/kvm_cache_regs.h
@@ -4,7 +4,7 @@
#include <linux/kvm_host.h>
-#define KVM_POSSIBLE_CR0_GUEST_BITS X86_CR0_TS
+#define KVM_POSSIBLE_CR0_GUEST_BITS (X86_CR0_TS | X86_CR0_WP)
#define KVM_POSSIBLE_CR4_GUEST_BITS \
(X86_CR4_PVI | X86_CR4_DE | X86_CR4_PCE | X86_CR4_OSFXSR \
| X86_CR4_OSXMMEXCPT | X86_CR4_PGE | X86_CR4_TSD | X86_CR4_FSGSBASE)
diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
index 7c4f5ca405c7..a0c92a2b3f65 100644
--- a/arch/x86/kvm/vmx/nested.c
+++ b/arch/x86/kvm/vmx/nested.c
@@ -4478,7 +4478,7 @@ static void load_vmcs12_host_state(struct kvm_vcpu *vcpu,
* CR0_GUEST_HOST_MASK is already set in the original vmcs01
* (KVM doesn't change it);
*/
- vcpu->arch.cr0_guest_owned_bits = KVM_POSSIBLE_CR0_GUEST_BITS;
+ vcpu->arch.cr0_guest_owned_bits = vmx_l1_guest_owned_cr0_bits();
vmx_set_cr0(vcpu, vmcs12->host_cr0);
/* Same as above - no reason to call set_cr4_guest_host_mask(). */
@@ -4629,7 +4629,7 @@ static void nested_vmx_restore_host_state(struct kvm_vcpu *vcpu)
*/
vmx_set_efer(vcpu, nested_vmx_get_vmcs01_guest_efer(vmx));
- vcpu->arch.cr0_guest_owned_bits = KVM_POSSIBLE_CR0_GUEST_BITS;
+ vcpu->arch.cr0_guest_owned_bits = vmx_l1_guest_owned_cr0_bits();
vmx_set_cr0(vcpu, vmcs_readl(CR0_READ_SHADOW));
vcpu->arch.cr4_guest_owned_bits = ~vmcs_readl(CR4_GUEST_HOST_MASK);
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index da65d90984ae..136adccffc4b 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -4773,7 +4773,7 @@ static void init_vmcs(struct vcpu_vmx *vmx)
/* 22.2.1, 20.8.1 */
vm_entry_controls_set(vmx, vmx_vmentry_ctrl());
- vmx->vcpu.arch.cr0_guest_owned_bits = KVM_POSSIBLE_CR0_GUEST_BITS;
+ vmx->vcpu.arch.cr0_guest_owned_bits = vmx_l1_guest_owned_cr0_bits();
vmcs_writel(CR0_GUEST_HOST_MASK, ~vmx->vcpu.arch.cr0_guest_owned_bits);
set_cr4_guest_host_mask(vmx);
diff --git a/arch/x86/kvm/vmx/vmx.h b/arch/x86/kvm/vmx/vmx.h
index 2acdc54bc34b..423e9d3c9c40 100644
--- a/arch/x86/kvm/vmx/vmx.h
+++ b/arch/x86/kvm/vmx/vmx.h
@@ -640,6 +640,24 @@ BUILD_CONTROLS_SHADOW(tertiary_exec, TERTIARY_VM_EXEC_CONTROL, 64)
(1 << VCPU_EXREG_EXIT_INFO_1) | \
(1 << VCPU_EXREG_EXIT_INFO_2))
+static inline unsigned long vmx_l1_guest_owned_cr0_bits(void)
+{
+ unsigned long bits = KVM_POSSIBLE_CR0_GUEST_BITS;
+
+ /*
+ * CR0.WP needs to be intercepted when KVM is shadowing legacy paging
+ * in order to construct shadow PTEs with the correct protections.
+ * Note! CR0.WP technically can be passed through to the guest if
+ * paging is disabled, but checking CR0.PG would generate a cyclical
+ * dependency of sorts due to forcing the caller to ensure CR0 holds
+ * the correct value prior to determining which CR0 bits can be owned
+ * by L1. Keep it simple and limit the optimization to EPT.
+ */
+ if (!enable_ept)
+ bits &= ~X86_CR0_WP;
+ return bits;
+}
+
static __always_inline struct kvm_vmx *to_kvm_vmx(struct kvm *kvm)
{
return container_of(kvm, struct kvm_vmx, kvm);
base-commit: 0b39948a802b5e76d65989b47ae36fe0dfbc10ad
--
On Wed, Mar 15, 2023 at 02:38:33PM -0700, Sean Christopherson wrote:
> On Wed, Feb 01, 2023, Mathias Krause wrote:
> > Complement commit 4289d2728664 ("KVM: retpolines: x86: eliminate
> > retpoline from vmx.c exit handlers") and avoid a retpoline call for
> > control register accesses as well.
> >
> > This speeds up guests that make heavy use of it, like grsecurity
> > kernels toggling CR0.WP to implement kernel W^X.
>
> I would rather drop this patch for VMX and instead unconditionally make CR0.WP
> guest owned when TDP (EPT) is enabled, i.e. drop the module param from patch 6.
That's fine with me. As EPT usually implies TDP (if neither of both was
explicitly disabled) that should be no limitation and as the non-EPT
case only saw a very small gain from this change anyways (less than 1%)
we can drop it.
>
> > Signed-off-by: Mathias Krause <[email protected]>
> > ---
> >
> > Meanwhile I got my hands on a AMD system and while doing a similar change
> > for SVM gives a small measurable win (1.1% faster for grsecurity guests),
>
> Mostly out of curiosity...
>
> Is the 1.1% roughly aligned with the gains for VMX? If VMX sees a significantly
> larger improvement, any idea why SVM doesn't benefit as much? E.g. did you double
> check that the kernel was actually using RETPOLINE?
I measured the runtime of the ssdd test I used before and got 3.98s for
a kernel with the whole series applied and 3.94s with the below change
on top:
diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index d13cf53e7390..2a471eae11c6 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -3369,6 +3369,10 @@ int svm_invoke_exit_handler(struct kvm_vcpu *vcpu, u64 exit_code)
return intr_interception(vcpu);
else if (exit_code == SVM_EXIT_HLT)
return kvm_emulate_halt(vcpu);
+ else if (exit_code == SVM_EXIT_READ_CR0 ||
+ exit_code == SVM_EXIT_WRITE_CR0 ||
+ exit_code == SVM_EXIT_CR0_SEL_WRITE)
+ return cr_interception(vcpu);
else if (exit_code == SVM_EXIT_NPF)
return npf_interception(vcpu);
#endif
Inspecting svm_invoke_exit_handler() on the host with perf confirmed it
could use the direct call of cr_interception() most of the time, thereby
could avoid the retpoline for it:
(My version of perf is, apparently, unable to detect tail calls properly
and therefore lacks symbol information for the jump targets in the below
assembly dump. I therefore added these manually.)
Percent│
│
│ ffffffffc194c410 <load0>: # svm_invoke_exit_handler
5.00 │ nop
7.44 │ push %rbp
10.43 │ cmp $0x403,%rsi
5.86 │ mov %rdi,%rbp
1.23 │ push %rbx
2.11 │ mov %rsi,%rbx
4.60 │ jbe 7a
│ 16: [svm_handle_invalid_exit() path removed]
4.59 │ 7a: mov -0x3e6a5b00(,%rsi,8),%rax
4.52 │ test %rax,%rax
│ je 16
6.25 │ cmp $0x7c,%rsi
│ je dd
4.18 │ cmp $0x64,%rsi
│ je f2
3.26 │ cmp $0x60,%rsi
│ je ca
4.57 │ cmp $0x78,%rsi
│ je f9
1.27 │ test $0xffffffffffffffef,%rsi
│ je c3
1.67 │ cmp $0x65,%rsi
│ je c3
│ cmp $0x400,%rsi
│ je 13d
│ pop %rbx
│ pop %rbp
│ jmp 0xffffffffa0487d80 # __x86_indirect_thunk_rax
│ int3
11.68 │ c3: pop %rbx
10.01 │ pop %rbp
10.47 │ jmp 0xffffffffc19482a0 # cr_interception
│ ca: incq 0x1940(%rdi)
│ mov $0x1,%eax
│ pop %rbx
0.42 │ pop %rbp
│ ret
│ int3
│ int3
│ int3
│ int3
│ dd: mov 0x1a20(%rdi),%rax
│ cmpq $0x0,0x78(%rax)
│ je 100
│ pop %rbx
│ pop %rbp
│ jmp 0xffffffffc185af20 # kvm_emulate_wrmsr
│ f2: pop %rbx
│ pop %rbp
0.42 │ jmp 0xffffffffc19472b0 # interrupt_window_interception
│ f9: pop %rbx
│ pop %rbp
│ jmp 0xffffffffc185a6a0 # kvm_emulate_halt
│100: pop %rbx
│ pop %rbp
│ jmp 0xffffffffc18602a0 # kvm_emulate_rdmsr
│107: mov %rbp,%rdi
│ mov $0x10,%esi
│ call kvm_register_read_raw
│ mov 0x24(%rbp),%edx
│ mov %rax,%rcx
│ mov %rbx,%r8
│ mov %gs:0x2ac00,%rax
│ mov 0x95c(%rax),%esi
│ mov $0xffffffffc195dc28,%rdi
│ call _printk
│ jmp 31
│13d: pop %rbx
│ pop %rbp
│ jmp 0xffffffffc1946b90 # npf_interception
What's clear from above (or so I hope!), cr_interception() is *the* reason to
cause a VM exit for my test run and by taking the shortcut via a direct call,
it doesn't have to do the retpoline dance which might be the explanation for
the ~1.1% performance gain (even in the face of three additional compare
instructions). However! As I realized that these three more instructions
probably "hurt" all other workloads (that don't toggle CR0.WP as often as a
grsecurity kernel would do), I didn't include the above change as a patch of
the series. If you think it's worth it nonetheless, as VM exits shouldn't
happen often anyways, I can do a proper patch.
>
> > it would provide nothing for other guests, as the change I was testing was
> > specifically targeting CR0 caused exits.
> >
> > A more general approach would instead cover CR3 and, maybe, CR4 as well.
> > However, that would require a lot more exit code compares, likely
> > vanishing the gains in the general case. So this tweak is VMX only.
>
> I don't think targeting on CR0 exits is a reason to not do this for SVM. With
> NPT enabled, CR3 isn't intercepted, and CR4 exits should be very rare. If the
> performance benefits are marginal (I don't have a good frame of reference for the
> 1.1%), then _that's_ a good reason to leave SVM alone. But not giving CR3 and CR4
> priority is a non-issue.
Ok. But yeah, the win isn't all the big either, less so in real
workloads that won't exercise this code path so often.
>
> > arch/x86/kvm/vmx/vmx.c | 2 ++
> > 1 file changed, 2 insertions(+)
> >
> > diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> > index c788aa382611..c8198c8a9b55 100644
> > --- a/arch/x86/kvm/vmx/vmx.c
> > +++ b/arch/x86/kvm/vmx/vmx.c
> > @@ -6538,6 +6538,8 @@ static int __vmx_handle_exit(struct kvm_vcpu *vcpu, fastpath_t exit_fastpath)
> > return handle_external_interrupt(vcpu);
> > else if (exit_reason.basic == EXIT_REASON_HLT)
> > return kvm_emulate_halt(vcpu);
> > + else if (exit_reason.basic == EXIT_REASON_CR_ACCESS)
> > + return handle_cr(vcpu);
> > else if (exit_reason.basic == EXIT_REASON_EPT_MISCONFIG)
> > return handle_ept_misconfig(vcpu);
> > #endif
> > --
> > 2.39.1
> >
On 15.03.23 23:11, Sean Christopherson wrote:
> Can you tweak the shortlog to something like this? I want to make it very clear
> that this applies only to the TDP case (more below). I did a double take when I
> first read the subject :-)
>
> KVM: x86: Do not unload MMU roots when only toggling CR0.WP with TDP enabled
Sure, will do!
>
> On Wed, Feb 01, 2023, Mathias Krause wrote:
>> There is no need to unload the MMU roots for a direct MMU role when only
>> CR0.WP has changed -- the paging structures are still valid, only the
>> permission bitmap needs to be updated.
>>
>> One heavy user of toggling CR0.WP is grsecurity's KERNEXEC feature to
>> implement kernel W^X.
>>
>> The optimization brings a huge performance gain for this case as the
>> following micro-benchmark running 'ssdd 10 50000' from rt-tests[1] on a
>> grsecurity L1 VM shows (runtime in seconds, lower is better):
>>
>> legacy TDP shadow
>> kvm.git/queue 11.55s 13.91s 75.2s
>> kvm.git/queue+patch 7.32s 7.31s 74.6s
>>
>> For legacy MMU this is ~36% faster, for TTP MMU even ~47% faster. Also
>> TDP and legacy MMU now both have around the same runtime which vanishes
>> the need to disable TDP MMU for grsecurity.
>>
>> Shadow MMU sees no measurable difference and is still slow, as expected.
>>
>> [1] https://git.kernel.org/pub/scm/utils/rt-tests/rt-tests.git
>>
>> Co-developed-by: Sean Christopherson <[email protected]>
>
> No need for this, I just threw a snippet at you as part of code review. And IMO,
> if someone goes through the pain of running benchmarks, they get full credit no
> matter what ;-)
Reviewers (and in your case maintainers) get far too little credit, so
I'd rather keep that tag, if you don't mind that hard ;)
>
>> Signed-off-by: Mathias Krause <[email protected]>
>> ---
>> v2: handle the CR0.WP case directly in kvm_post_set_cr0() and only for
>> the direct MMU role -- Sean
>>
>> I re-ran the benchmark and it's even faster than with my patch, as the
>> critical path is now the first one handled and is now inline. Thanks a
>> lot for the suggestion, Sean!
>>
>> arch/x86/kvm/x86.c | 9 +++++++++
>> 1 file changed, 9 insertions(+)
>>
>> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
>> index 508074e47bc0..f09bfc0a3cc1 100644
>> --- a/arch/x86/kvm/x86.c
>> +++ b/arch/x86/kvm/x86.c
>> @@ -902,6 +902,15 @@ EXPORT_SYMBOL_GPL(load_pdptrs);
>>
>> void kvm_post_set_cr0(struct kvm_vcpu *vcpu, unsigned long old_cr0, unsigned long cr0)
>> {
>> + /*
>> + * Toggling just CR0.WP doesn't invalidate page tables per se, only the
>> + * permission bits.
>> + */
>> + if (vcpu->arch.mmu->root_role.direct && (cr0 ^ old_cr0) == X86_CR0_WP) {
>
> Past me was wrong, which is a very good thing in this case. Per the APM,
>
> The host hCR0.WP bit is ignored under nested paging.
>
See what you did? You even went that far and re-read the manual.
Definitely deserves credit!
> which means that CR0.WP doesn't need to be incorporated into the role. Ha! And
> really-past me even wrote a very nice comment to call that out in commit 31e96bc63655
> ("KVM: nSVM: Add a comment to document why nNPT uses vmcb01, not vCPU state").
>
> Double ha! That's all moot, because if this code is reached for a nested MMU,
> it means L2 is active and the CR0 being changed is gCR0, not L1's hCR0.
>
> So more simply, this can be
>
> if (tdp_enabled && (cr0 ^ old_cr0) == X86_CR0_WP)
Looks much simpler, indeed. But might deserve a little comment itself
why it's fine test 'tdp_enabled' only...
>
> or if we want to exempt non-paging mode for the shadow MMU as well...
>
> if ((cr0 ^ old_cr0) == X86_CR0_WP && (tdp_enabled || !(cr0 & X86_CR0_PG)))
>
> Actually, if we bother to check CR0.PG, then we might as well get greedy and
> skip _all_ updates when paging is disabled. E.g. end up with this over two
> patches? First one exempts the tdp_enabled case, second one completely exempts
> paging disabled.
>
...and there it is already! :D
> /*
> * CR0.WP is incorporated into the MMU role, but only for non-nested,
> * indirect shadow MMUs. If paging is disabled, no updates are needed
> * as there are no permission bits to emulate. If TDP is enabled, the
> * MMU's metadata needs to be updated, e.g. so that emulating guest
> * translations does the right thing, but there's no need to unload the
> * root as CR0.WP doesn't affect SPTEs when TDP is enabled.
> */
> if ((cr0 ^ old_cr0) == X86_CR0_WP) {
> if (!(cr0 & X86_CR0_PG))
> return;
>
> if (tdp_enabled) {
> kvm_init_mmu(vcpu);
> return;
> }
> }
Thanks, Sean! Sounds all very good to me. I'll cook up these commits, do
some more tests and send a v4 tomorrow, if time allows.
On 15.03.23 23:30, Sean Christopherson wrote:
> On Wed, Feb 01, 2023, Mathias Krause wrote:
>> Guests like grsecurity that make heavy use of CR0.WP to implement kernel
>> level W^X will suffer from the implied VMEXITs.
>>
>> For a direct MMU role there is no need to intercept a guest change of
>> CR0.WP, so simply make it a guest owned bit if we can do so.
>>
>> This implies that a read of a guest's CR0.WP bit might need a VMREAD.
>> However, the only potentially affected user seems to be kvm_init_mmu()
>> which is a heavy operation to begin with. But also most callers already
>> cache the full value of CR0 anyway, so no additional VMREAD is needed.
>> The only exception is nested_vmx_load_cr3().
>>
>> Add a new module parameter 'lazycr0' to allow users to revert back to
>> the old behaviour by loading kvm-intel.ko with 'lazycr0=0'.
>>
>> This change is VMX-specific, as SVM has no such fine grained control
>> register intercept control.
>>
>> Suggested-by: Sean Christopherson <[email protected]>
>> Signed-off-by: Mathias Krause <[email protected]>
>> ---
>>
>> Initially I wanted to implement the scheme Sean sketched[1]: having a
>> threshold where we would switch from eager to lazy CR0.WP tracking after
>> toggling the bit often enough, make the bit guest owned afterwards and
>> VMREAD CR0 when needed. However, when starting to look for users that
>> would be affected, I only found kvm_init_mmu() (via kvm_init_mmu() ->
>> vcpu_to_role_regs() -> kvm_read_cr0_bits(KVM_MMU_CR0_ROLE_BITS)). It has
>> only these three interesting callers:
>> 1/ kvm_mmu_reset_context(), which isn't all that interesting, as it's a
>> heavy weight operation anyway and many of the control flows leading
>> to it already cache the value of CR0, so no additional VMREAD is
>> needed,
>> 2/ nested_vmx_load_cr3() and
>> 3/ kvm_post_set_cr0(), only when CR0.WP was toggled and the MMU is in
>> direct mode (optimization introduced by patch 3).
>>
>> The last case's most interesting caller is likely kvm_set_cr0(), which
>> already caches the written CR0 value, thereby vanishes the need for
>> another VMREAD in vcpu_to_role_regs().
>>
>> That's why I went with the much simpler approach and always allow CR0.WP
>> to be guest owned if EPT is enabled as well.
>
> Nice!
>
>> There's nothing we can do for SVM, though :/
>
> :/ indeed
>
>> [1] https://lore.kernel.org/kvm/[email protected]/
>> ---
>> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
>> index d3b49e0b6c32..1969360d2744 100644
>> --- a/arch/x86/kvm/vmx/vmx.c
>> +++ b/arch/x86/kvm/vmx/vmx.c
>> @@ -91,6 +91,9 @@ module_param_named(flexpriority, flexpriority_enabled, bool, S_IRUGO);
>> bool __read_mostly enable_ept = 1;
>> module_param_named(ept, enable_ept, bool, S_IRUGO);
>>
>> +bool __read_mostly enable_lazy_cr0 = 1;
>> +module_param_named(lazycr0, enable_lazy_cr0, bool, S_IRUGO);
>
> Unless someone crawls out of the woodworks to object, let's omit the module param
> and make this unconditional. We typically add module params for behavior where
> there are legitimate downsides even if KVM is bug free, or for features that are
> dependent on hardware. E.g. testing shadow paging without a knob to disable EPT
> would require acces to really ancient CPUs.
>
> The one exception that comes to mind is force_flush_and_sync_on_reuse, but TLB
> bugs tend to be subtle and hard to hit, whereas if we break something with CR0.WP
> emulation, the breakage should be immediate and obvious.
>
>> bool __read_mostly enable_unrestricted_guest = 1;
>> module_param_named(unrestricted_guest,
>> enable_unrestricted_guest, bool, S_IRUGO);
>> @@ -4765,7 +4768,7 @@ static void init_vmcs(struct vcpu_vmx *vmx)
>> /* 22.2.1, 20.8.1 */
>> vm_entry_controls_set(vmx, vmx_vmentry_ctrl());
>>
>> - vmx->vcpu.arch.cr0_guest_owned_bits = KVM_POSSIBLE_CR0_GUEST_BITS;
>> + vmx->vcpu.arch.cr0_guest_owned_bits = vmx_guest_owned_cr0_bits();
>> vmcs_writel(CR0_GUEST_HOST_MASK, ~vmx->vcpu.arch.cr0_guest_owned_bits);
>>
>> set_cr4_guest_host_mask(vmx);
>> @@ -8370,6 +8373,10 @@ static __init int hardware_setup(void)
>> return -EOPNOTSUPP;
>> }
>>
>> + /* Need EPT for lazy CR0.WP synchronization. */
>> + if (!enable_ept)
>> + enable_lazy_cr0 = 0;
>
> Heh, just realized that this code would be broken if nested TDP wasn't exempt
> from including CR0.WP in the MMU role. Better to be lucky than good :-)
=:)
>
> And similar to similar to kvm_post_set_cr0(), the CR0.PG=0 case _could_ let
> CR0.WP be guest-owned, but I don't think that's worth doing as it introduces a
> subtle dependency on CR0 being up-to-date (or passed in).
And it has no real use case, IMHO. Aside from the academic exercise, why
would any sane operating system^W^W"system software" that runs non-paged
touch CR0.WP at all?
>
> So this?
>
> ---
> arch/x86/kvm/kvm_cache_regs.h | 2 +-
> arch/x86/kvm/vmx/nested.c | 4 ++--
> arch/x86/kvm/vmx/vmx.c | 2 +-
> arch/x86/kvm/vmx/vmx.h | 18 ++++++++++++++++++
> 4 files changed, 22 insertions(+), 4 deletions(-)
>
> diff --git a/arch/x86/kvm/kvm_cache_regs.h b/arch/x86/kvm/kvm_cache_regs.h
> index 4c91f626c058..e50d353b5c1c 100644
> --- a/arch/x86/kvm/kvm_cache_regs.h
> +++ b/arch/x86/kvm/kvm_cache_regs.h
> @@ -4,7 +4,7 @@
>
> #include <linux/kvm_host.h>
>
> -#define KVM_POSSIBLE_CR0_GUEST_BITS X86_CR0_TS
> +#define KVM_POSSIBLE_CR0_GUEST_BITS (X86_CR0_TS | X86_CR0_WP)
> #define KVM_POSSIBLE_CR4_GUEST_BITS \
> (X86_CR4_PVI | X86_CR4_DE | X86_CR4_PCE | X86_CR4_OSFXSR \
> | X86_CR4_OSXMMEXCPT | X86_CR4_PGE | X86_CR4_TSD | X86_CR4_FSGSBASE)
> diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
> index 7c4f5ca405c7..a0c92a2b3f65 100644
> --- a/arch/x86/kvm/vmx/nested.c
> +++ b/arch/x86/kvm/vmx/nested.c
> @@ -4478,7 +4478,7 @@ static void load_vmcs12_host_state(struct kvm_vcpu *vcpu,
> * CR0_GUEST_HOST_MASK is already set in the original vmcs01
> * (KVM doesn't change it);
> */
> - vcpu->arch.cr0_guest_owned_bits = KVM_POSSIBLE_CR0_GUEST_BITS;
> + vcpu->arch.cr0_guest_owned_bits = vmx_l1_guest_owned_cr0_bits();
> vmx_set_cr0(vcpu, vmcs12->host_cr0);
>
> /* Same as above - no reason to call set_cr4_guest_host_mask(). */
> @@ -4629,7 +4629,7 @@ static void nested_vmx_restore_host_state(struct kvm_vcpu *vcpu)
> */
> vmx_set_efer(vcpu, nested_vmx_get_vmcs01_guest_efer(vmx));
>
> - vcpu->arch.cr0_guest_owned_bits = KVM_POSSIBLE_CR0_GUEST_BITS;
> + vcpu->arch.cr0_guest_owned_bits = vmx_l1_guest_owned_cr0_bits();
> vmx_set_cr0(vcpu, vmcs_readl(CR0_READ_SHADOW));
>
> vcpu->arch.cr4_guest_owned_bits = ~vmcs_readl(CR4_GUEST_HOST_MASK);
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index da65d90984ae..136adccffc4b 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -4773,7 +4773,7 @@ static void init_vmcs(struct vcpu_vmx *vmx)
> /* 22.2.1, 20.8.1 */
> vm_entry_controls_set(vmx, vmx_vmentry_ctrl());
>
> - vmx->vcpu.arch.cr0_guest_owned_bits = KVM_POSSIBLE_CR0_GUEST_BITS;
> + vmx->vcpu.arch.cr0_guest_owned_bits = vmx_l1_guest_owned_cr0_bits();
> vmcs_writel(CR0_GUEST_HOST_MASK, ~vmx->vcpu.arch.cr0_guest_owned_bits);
>
> set_cr4_guest_host_mask(vmx);
> diff --git a/arch/x86/kvm/vmx/vmx.h b/arch/x86/kvm/vmx/vmx.h
> index 2acdc54bc34b..423e9d3c9c40 100644
> --- a/arch/x86/kvm/vmx/vmx.h
> +++ b/arch/x86/kvm/vmx/vmx.h
> @@ -640,6 +640,24 @@ BUILD_CONTROLS_SHADOW(tertiary_exec, TERTIARY_VM_EXEC_CONTROL, 64)
> (1 << VCPU_EXREG_EXIT_INFO_1) | \
> (1 << VCPU_EXREG_EXIT_INFO_2))
>
> +static inline unsigned long vmx_l1_guest_owned_cr0_bits(void)
> +{
> + unsigned long bits = KVM_POSSIBLE_CR0_GUEST_BITS;
> +
> + /*
> + * CR0.WP needs to be intercepted when KVM is shadowing legacy paging
> + * in order to construct shadow PTEs with the correct protections.
> + * Note! CR0.WP technically can be passed through to the guest if
> + * paging is disabled, but checking CR0.PG would generate a cyclical
> + * dependency of sorts due to forcing the caller to ensure CR0 holds
> + * the correct value prior to determining which CR0 bits can be owned
> + * by L1. Keep it simple and limit the optimization to EPT.
> + */
> + if (!enable_ept)
> + bits &= ~X86_CR0_WP;
> + return bits;
> +}
> +
> static __always_inline struct kvm_vmx *to_kvm_vmx(struct kvm *kvm)
> {
> return container_of(kvm, struct kvm_vmx, kvm);
>
> base-commit: 0b39948a802b5e76d65989b47ae36fe0dfbc10ad
LGTM!
On 15.03.23 23:18, Sean Christopherson wrote:
> On Wed, Feb 01, 2023, Mathias Krause wrote:
>> Make use of the kvm_read_cr{0,4}_bits() helper functions when we only
>> want to know the state of certain bits instead of the whole register.
>>
>> This not only makes the intend cleaner, it also avoids a VMREAD in case
>> the tested bits aren't guest owned.
>>
>> Signed-off-by: Mathias Krause <[email protected]>
>> ---
>> arch/x86/kvm/pmu.c | 4 ++--
>> arch/x86/kvm/vmx/vmx.c | 4 ++--
>> 2 files changed, 4 insertions(+), 4 deletions(-)
>>
>> diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c
>> index d939d3b84e6f..d9922277df67 100644
>> --- a/arch/x86/kvm/pmu.c
>> +++ b/arch/x86/kvm/pmu.c
>> @@ -439,9 +439,9 @@ int kvm_pmu_rdpmc(struct kvm_vcpu *vcpu, unsigned idx, u64 *data)
>> if (!pmc)
>> return 1;
>>
>> - if (!(kvm_read_cr4(vcpu) & X86_CR4_PCE) &&
>> + if (!(kvm_read_cr4_bits(vcpu, X86_CR4_PCE)) &&
>
> Purely as an FYI, I proposed adding helpers to query single CR0/CR4 bits in a
> separate thread[*]. No need to do anything on your end, I'll plan on applying
> this patch first and will handle whatever conflicts arise.
>
> [*] https://lore.kernel.org/all/[email protected]
Unfortunately, not all users of kvm_read_cr*_bits() only want to read a
single bit. There are a very few that read bit masks -- but you're
probably fully aware of this.
Thanks,
Mathias
On Mon, Mar 20, 2023, Mathias Krause wrote:
> On 15.03.23 23:18, Sean Christopherson wrote:
> > On Wed, Feb 01, 2023, Mathias Krause wrote:
> >> Make use of the kvm_read_cr{0,4}_bits() helper functions when we only
> >> want to know the state of certain bits instead of the whole register.
> >>
> >> This not only makes the intend cleaner, it also avoids a VMREAD in case
> >> the tested bits aren't guest owned.
> >>
> >> Signed-off-by: Mathias Krause <[email protected]>
> >> ---
> >> arch/x86/kvm/pmu.c | 4 ++--
> >> arch/x86/kvm/vmx/vmx.c | 4 ++--
> >> 2 files changed, 4 insertions(+), 4 deletions(-)
> >>
> >> diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c
> >> index d939d3b84e6f..d9922277df67 100644
> >> --- a/arch/x86/kvm/pmu.c
> >> +++ b/arch/x86/kvm/pmu.c
> >> @@ -439,9 +439,9 @@ int kvm_pmu_rdpmc(struct kvm_vcpu *vcpu, unsigned idx, u64 *data)
> >> if (!pmc)
> >> return 1;
> >>
> >> - if (!(kvm_read_cr4(vcpu) & X86_CR4_PCE) &&
> >> + if (!(kvm_read_cr4_bits(vcpu, X86_CR4_PCE)) &&
> >
> > Purely as an FYI, I proposed adding helpers to query single CR0/CR4 bits in a
> > separate thread[*]. No need to do anything on your end, I'll plan on applying
> > this patch first and will handle whatever conflicts arise.
> >
> > [*] https://lore.kernel.org/all/[email protected]
>
> Unfortunately, not all users of kvm_read_cr*_bits() only want to read a
> single bit. There are a very few that read bit masks -- but you're
> probably fully aware of this.
Yeah, we won't be able to get rid of kvm_read_cr*_bits() entirely, the goal is
purely to make code that does query a single bit slightly more readable/obvious.