2020-03-20 21:33:10

by Sean Christopherson

[permalink] [raw]
Subject: [PATCH v3 00/37] KVM: x86: TLB flushing fixes and enhancements

VMX TLB flushing cleanup series to fix a variety of bugs, and to avoid
unnecessary TLB flushes on nested VMX transitions.

1) Nested VMX doesn't properly flush all ASIDs/contexts on system events,
e.g. on mmu_notifier invalidate all contexts for L1 *and* L2 need to
be invalidated, but KVM generally only flushes L1 or L2 (or just L1).

2) #1 is largely benign because nested VMX always flushes the new
context on nested VM-Entry/VM-Exit.

High level overview:

a) Fix the main TLB flushing bug with a big hammer.

b) Fix a few other flushing related bugs.

c) Clean up vmx_tlb_flush(), i.e. what was v1 of this series.

d) Reintroduce current-ASID/context flushing to regain some of the
precision that got blasted away by the big hammer in #1.

e) Fix random code paths that unnecessarily trigger TLB flushes on
nested VMX transitions.

f) Stop flushing on every nested VMX transition.

g) Extra cleanup.


v3:
- Fix freeing of roots during INVVPID, I botched things when tweaking
Junaid's original patch.
- Move "last vpid02" logic into nested VMX flushing helper. [Paolo]
- Split "skip tlb flush" and "skip mmu sync" logic during fast roots
switch. [Paolo]
- Unconditionally skip tlb flush during fast roots switch on nested VMX
transitions, i.e. let nested_vmx_transition_tlb_flush() do the work.
This avoids flushing when EPT=0 and VPID=1. [Paolo]
- Do more cr3->pgd conversions in related code that drove me bonkers
when trying to figure out what was broken with VPID. I think this
knocks off the last code that uses "cr3" for variables/functions that
work with cr3 or eptp.

v2:
- Basically a new series.

v1:
- https://patchwork.kernel.org/cover/11394987/

Junaid Shahid (2):
KVM: nVMX: Invalidate all roots when emulating INVVPID without EPT
KVM: x86: Sync SPTEs when injecting page/EPT fault into L1

Sean Christopherson (35):
KVM: VMX: Flush all EPTP/VPID contexts on remote TLB flush
KVM: nVMX: Validate the EPTP when emulating INVEPT(EXTENT_CONTEXT)
KVM: nVMX: Invalidate all EPTP contexts when emulating INVEPT for L1
KVM: x86: Export kvm_propagate_fault() (as
kvm_inject_emulated_page_fault)
KVM: x86: Consolidate logic for injecting page faults to L1
KVM: VMX: Skip global INVVPID fallback if vpid==0 in
vpid_sync_context()
KVM: VMX: Use vpid_sync_context() directly when possible
KVM: VMX: Move vpid_sync_vcpu_addr() down a few lines
KVM: VMX: Handle INVVPID fallback logic in vpid_sync_vcpu_addr()
KVM: VMX: Drop redundant capability checks in low level INVVPID
helpers
KVM: nVMX: Use vpid_sync_vcpu_addr() to emulate INVVPID with address
KVM: x86: Move "flush guest's TLB" logic to separate kvm_x86_ops hook
KVM: VMX: Clean up vmx_flush_tlb_gva()
KVM: x86: Drop @invalidate_gpa param from kvm_x86_ops' tlb_flush()
KVM: SVM: Wire up ->tlb_flush_guest() directly to svm_flush_tlb()
KVM: VMX: Move vmx_flush_tlb() to vmx.c
KVM: nVMX: Move nested_get_vpid02() to vmx/nested.h
KVM: VMX: Introduce vmx_flush_tlb_current()
KVM: SVM: Document the ASID logic in svm_flush_tlb()
KVM: x86: Rename ->tlb_flush() to ->tlb_flush_all()
KVM: nVMX: Add helper to handle TLB flushes on nested VM-Enter/VM-Exit
KVM: x86: Introduce KVM_REQ_TLB_FLUSH_CURRENT to flush current ASID
KVM: x86/mmu: Use KVM_REQ_TLB_FLUSH_CURRENT for MMU specific flushes
KVM: nVMX: Selectively use TLB_FLUSH_CURRENT for nested
VM-Enter/VM-Exit
KVM: nVMX: Reload APIC access page on nested VM-Exit only if necessary
KVM: VMX: Retrieve APIC access page HPA only when necessary
KVM: VMX: Don't reload APIC access page if its control is disabled
KVM: x86/mmu: Move fast_cr3_switch() side effects to
__kvm_mmu_new_cr3()
KVM: x86/mmu: Add separate override for MMU sync during fast CR3
switch
KVM: x86/mmu: Add module param to force TLB flush on root reuse
KVM: nVMX: Skip MMU sync on nested VMX transition when possible
KVM: nVMX: Don't flush TLB on nested VMX transition
KVM: nVMX: Free only the affected contexts when emulating INVEPT
KVM: x86: Replace "cr3" with "pgd" in "new cr3/pgd" related code
KVM: VMX: Clean cr3/pgd handling in vmx_load_mmu_pgd()

arch/x86/include/asm/kvm_host.h | 25 +++-
arch/x86/kvm/mmu/mmu.c | 145 +++++++++----------
arch/x86/kvm/mmu/paging_tmpl.h | 2 +-
arch/x86/kvm/svm.c | 19 ++-
arch/x86/kvm/vmx/nested.c | 249 ++++++++++++++++++++++----------
arch/x86/kvm/vmx/nested.h | 7 +
arch/x86/kvm/vmx/ops.h | 32 ++--
arch/x86/kvm/vmx/vmx.c | 119 ++++++++++++---
arch/x86/kvm/vmx/vmx.h | 19 +--
arch/x86/kvm/x86.c | 67 ++++++---
arch/x86/kvm/x86.h | 6 +
11 files changed, 448 insertions(+), 242 deletions(-)

--
2.24.1


2020-03-20 21:33:13

by Sean Christopherson

[permalink] [raw]
Subject: [PATCH v3 16/37] KVM: x86: Drop @invalidate_gpa param from kvm_x86_ops' tlb_flush()

Drop @invalidate_gpa from ->tlb_flush() and kvm_vcpu_flush_tlb() now
that all callers pass %true for said param, or ignore the param (SVM has
an internal call to svm_flush_tlb() in svm_flush_tlb_guest that somewhat
arbitrarily passes %false).

Remove __vmx_flush_tlb() as it is no longer used.

No functional change intended.

Signed-off-by: Sean Christopherson <[email protected]>
---
arch/x86/include/asm/kvm_host.h | 2 +-
arch/x86/kvm/mmu/mmu.c | 2 +-
arch/x86/kvm/svm.c | 10 ++++----
arch/x86/kvm/vmx/vmx.c | 4 ++--
arch/x86/kvm/vmx/vmx.h | 42 ++++++++++-----------------------
arch/x86/kvm/x86.c | 6 ++---
6 files changed, 24 insertions(+), 42 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index c08f4c0bf4d1..a5dfab4642d6 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1105,7 +1105,7 @@ struct kvm_x86_ops {
unsigned long (*get_rflags)(struct kvm_vcpu *vcpu);
void (*set_rflags)(struct kvm_vcpu *vcpu, unsigned long rflags);

- void (*tlb_flush)(struct kvm_vcpu *vcpu, bool invalidate_gpa);
+ void (*tlb_flush)(struct kvm_vcpu *vcpu);
int (*tlb_remote_flush)(struct kvm *kvm);
int (*tlb_remote_flush_with_range)(struct kvm *kvm,
struct kvm_tlb_range *range);
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 5ae620881bbc..a87b8f9f3b1f 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -5177,7 +5177,7 @@ int kvm_mmu_load(struct kvm_vcpu *vcpu)
if (r)
goto out;
kvm_mmu_load_pgd(vcpu);
- kvm_x86_ops->tlb_flush(vcpu, true);
+ kvm_x86_ops->tlb_flush(vcpu);
out:
return r;
}
diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
index 396f42753489..62fa45dcb6a4 100644
--- a/arch/x86/kvm/svm.c
+++ b/arch/x86/kvm/svm.c
@@ -385,7 +385,7 @@ module_param(dump_invalid_vmcb, bool, 0644);
static u8 rsm_ins_bytes[] = "\x0f\xaa";

static void svm_set_cr0(struct kvm_vcpu *vcpu, unsigned long cr0);
-static void svm_flush_tlb(struct kvm_vcpu *vcpu, bool invalidate_gpa);
+static void svm_flush_tlb(struct kvm_vcpu *vcpu);
static void svm_complete_interrupts(struct vcpu_svm *svm);
static void svm_toggle_avic_for_irq_window(struct kvm_vcpu *vcpu, bool activate);
static inline void avic_post_state_restore(struct kvm_vcpu *vcpu);
@@ -2692,7 +2692,7 @@ static int svm_set_cr4(struct kvm_vcpu *vcpu, unsigned long cr4)
return 1;

if (npt_enabled && ((old_cr4 ^ cr4) & X86_CR4_PGE))
- svm_flush_tlb(vcpu, true);
+ svm_flush_tlb(vcpu);

vcpu->arch.cr4 = cr4;
if (!npt_enabled)
@@ -3630,7 +3630,7 @@ static void enter_svm_guest_mode(struct vcpu_svm *svm, u64 vmcb_gpa,
svm->nested.intercept_exceptions = nested_vmcb->control.intercept_exceptions;
svm->nested.intercept = nested_vmcb->control.intercept;

- svm_flush_tlb(&svm->vcpu, true);
+ svm_flush_tlb(&svm->vcpu);
svm->vmcb->control.int_ctl = nested_vmcb->control.int_ctl | V_INTR_MASKING_MASK;
if (nested_vmcb->control.int_ctl & V_INTR_MASKING_MASK)
svm->vcpu.arch.hflags |= HF_VINTR_MASK;
@@ -5626,7 +5626,7 @@ static int svm_set_identity_map_addr(struct kvm *kvm, u64 ident_addr)
return 0;
}

-static void svm_flush_tlb(struct kvm_vcpu *vcpu, bool invalidate_gpa)
+static void svm_flush_tlb(struct kvm_vcpu *vcpu)
{
struct vcpu_svm *svm = to_svm(vcpu);

@@ -5645,7 +5645,7 @@ static void svm_flush_tlb_gva(struct kvm_vcpu *vcpu, gva_t gva)

static void svm_flush_tlb_guest(struct kvm_vcpu *vcpu)
{
- svm_flush_tlb(vcpu, false);
+ svm_flush_tlb(vcpu);
}

static void svm_prepare_guest_switch(struct kvm_vcpu *vcpu)
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 43c0d4706f9a..477bdbc52ed0 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -6079,7 +6079,7 @@ void vmx_set_virtual_apic_mode(struct kvm_vcpu *vcpu)
if (flexpriority_enabled) {
sec_exec_control |=
SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES;
- vmx_flush_tlb(vcpu, true);
+ vmx_flush_tlb(vcpu);
}
break;
case LAPIC_MODE_X2APIC:
@@ -6097,7 +6097,7 @@ static void vmx_set_apic_access_page_addr(struct kvm_vcpu *vcpu, hpa_t hpa)
{
if (!is_guest_mode(vcpu)) {
vmcs_write64(APIC_ACCESS_ADDR, hpa);
- vmx_flush_tlb(vcpu, true);
+ vmx_flush_tlb(vcpu);
}
}

diff --git a/arch/x86/kvm/vmx/vmx.h b/arch/x86/kvm/vmx/vmx.h
index 3770ae111e6a..bab5d62ad964 100644
--- a/arch/x86/kvm/vmx/vmx.h
+++ b/arch/x86/kvm/vmx/vmx.h
@@ -503,46 +503,28 @@ static inline struct vmcs *alloc_vmcs(bool shadow)

u64 construct_eptp(struct kvm_vcpu *vcpu, unsigned long root_hpa);

-static inline void __vmx_flush_tlb(struct kvm_vcpu *vcpu, int vpid,
- bool invalidate_gpa)
-{
- if (enable_ept && (invalidate_gpa || !enable_vpid)) {
- if (!VALID_PAGE(vcpu->arch.mmu->root_hpa))
- return;
- ept_sync_context(construct_eptp(vcpu,
- vcpu->arch.mmu->root_hpa));
- } else {
- vpid_sync_context(vpid);
- }
-}
-
-static inline void vmx_flush_tlb(struct kvm_vcpu *vcpu, bool invalidate_gpa)
+static inline void vmx_flush_tlb(struct kvm_vcpu *vcpu)
{
struct vcpu_vmx *vmx = to_vmx(vcpu);

/*
- * Flush all EPTP/VPID contexts if the TLB flush _may_ have been
- * invoked via kvm_flush_remote_tlbs(), which always passes %true for
- * @invalidate_gpa. Flushing remote TLBs requires all contexts to be
- * flushed, not just the active context.
+ * Flush all EPTP/VPID contexts, as the TLB flush _may_ have been
+ * invoked via kvm_flush_remote_tlbs(). Flushing remote TLBs requires
+ * all contexts to be flushed, not just the active context.
*
* Note, this also ensures a deferred TLB flush with VPID enabled and
* EPT disabled invalidates the "correct" VPID, by nuking both L1 and
* L2's VPIDs.
*/
- if (invalidate_gpa) {
- if (enable_ept) {
- ept_sync_global();
- } else if (enable_vpid) {
- if (cpu_has_vmx_invvpid_global()) {
- vpid_sync_vcpu_global();
- } else {
- vpid_sync_vcpu_single(vmx->vpid);
- vpid_sync_vcpu_single(vmx->nested.vpid02);
- }
+ if (enable_ept) {
+ ept_sync_global();
+ } else if (enable_vpid) {
+ if (cpu_has_vmx_invvpid_global()) {
+ vpid_sync_vcpu_global();
+ } else {
+ vpid_sync_vcpu_single(vmx->vpid);
+ vpid_sync_vcpu_single(vmx->nested.vpid02);
}
- } else {
- __vmx_flush_tlb(vcpu, vmx->vpid, false);
}
}

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 0b90ec2c93cf..84cbd7ca1e18 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -2696,10 +2696,10 @@ static void kvmclock_reset(struct kvm_vcpu *vcpu)
vcpu->arch.time = 0;
}

-static void kvm_vcpu_flush_tlb(struct kvm_vcpu *vcpu, bool invalidate_gpa)
+static void kvm_vcpu_flush_tlb(struct kvm_vcpu *vcpu)
{
++vcpu->stat.tlb_flush;
- kvm_x86_ops->tlb_flush(vcpu, invalidate_gpa);
+ kvm_x86_ops->tlb_flush(vcpu);
}

static void record_steal_time(struct kvm_vcpu *vcpu)
@@ -8223,7 +8223,7 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
if (kvm_check_request(KVM_REQ_LOAD_MMU_PGD, vcpu))
kvm_mmu_load_pgd(vcpu);
if (kvm_check_request(KVM_REQ_TLB_FLUSH, vcpu))
- kvm_vcpu_flush_tlb(vcpu, true);
+ kvm_vcpu_flush_tlb(vcpu);
if (kvm_check_request(KVM_REQ_REPORT_TPR_ACCESS, vcpu)) {
vcpu->run->exit_reason = KVM_EXIT_TPR_ACCESS;
r = 0;
--
2.24.1

2020-03-20 21:33:15

by Sean Christopherson

[permalink] [raw]
Subject: [PATCH v3 04/37] KVM: nVMX: Invalidate all roots when emulating INVVPID without EPT

From: Junaid Shahid <[email protected]>

Free all roots when emulating INVVPID for L1 and EPT is disabled, as
outstanding changes to the page tables managed by L1 need to be
recognized. Because L1 and L2 share an MMU when EPT is disabled, and
because VPID is not tracked by the MMU role, all roots in the current
MMU (root_mmu) need to be freed, otherwise a future nested VM-Enter or
VM-Exit could do a fast CR3 switch (without a flush/sync) and consume
stale SPTEs.

Fixes: 5c614b3583e7b ("KVM: nVMX: nested VPID emulation")
Signed-off-by: Junaid Shahid <[email protected]>
[sean: ported to upstream KVM, reworded the comment and changelog]
Signed-off-by: Sean Christopherson <[email protected]>
---
arch/x86/kvm/vmx/nested.c | 14 ++++++++++++++
1 file changed, 14 insertions(+)

diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
index 9624cea4ed9f..bc74fbbf33c6 100644
--- a/arch/x86/kvm/vmx/nested.c
+++ b/arch/x86/kvm/vmx/nested.c
@@ -5250,6 +5250,20 @@ static int handle_invvpid(struct kvm_vcpu *vcpu)
return kvm_skip_emulated_instruction(vcpu);
}

+ /*
+ * Sync the shadow page tables if EPT is disabled, L1 is invalidating
+ * linear mappings for L2 (tagged with L2's VPID). Free all roots as
+ * VPIDs are not tracked in the MMU role.
+ *
+ * Note, this operates on root_mmu, not guest_mmu, as L1 and L2 share
+ * an MMU when EPT is disabled.
+ *
+ * TODO: sync only the affected SPTEs for INVDIVIDUAL_ADDR.
+ */
+ if (!enable_ept)
+ kvm_mmu_free_roots(vcpu, &vcpu->arch.root_mmu,
+ KVM_MMU_ROOTS_ALL);
+
return nested_vmx_succeed(vcpu);
}

--
2.24.1

2020-03-20 21:33:15

by Sean Christopherson

[permalink] [raw]
Subject: [PATCH v3 01/37] KVM: VMX: Flush all EPTP/VPID contexts on remote TLB flush

Flush all EPTP/VPID contexts if a TLB flush _may_ have been triggered by
a remote or deferred TLB flush, i.e. by KVM_REQ_TLB_FLUSH. Remote TLB
flushes require all contexts to be invalidated, not just the active
contexts, e.g. all mappings in all contexts for a given HVA need to be
invalidated on a mmu_notifier invalidation. Similarly, the instigator
of the deferred TLB flush may be expecting all contexts to be flushed,
e.g. vmx_vcpu_load_vmcs().

Without nested VMX, flushing only the current EPTP/VPID context isn't
problematic because KVM uses a constant VPID for each vCPU, and
mmu_alloc_direct_roots() all but guarantees KVM will use a single EPTP
for L1. In the rare case where a different EPTP is created or reused,
KVM (currently) unconditionally flushes the new EPTP context prior to
entering the guest.

With nested VMX, KVM conditionally uses a different VPID for L2, and
unconditionally uses a different EPTP for L2. Because KVM doesn't
_intentionally_ guarantee L2's EPTP/VPID context is flushed on nested
VM-Enter, it'd be possible for a malicious L1 to attack the host and/or
different VMs by exploiting the lack of flushing for L2.

1) Launch nested guest from malicious L1.

2) Nested VM-Enter to L2.

3) Access target GPA 'g'. CPU inserts TLB entry tagged with L2's ASID
mapping 'g' to host PFN 'x'.

2) Nested VM-Exit to L1.

3) L1 triggers kernel same-page merging (ksm) by duplicating/zeroing
the page for PFN 'x'.

4) Host kernel merges PFN 'x' with PFN 'y', i.e. unmaps PFN 'x' and
remaps the page to PFN 'y'. mmu_notifier sends invalidate command,
KVM flushes TLB only for L1's ASID.

4) Host kernel reallocates PFN 'x' to some other task/guest.

5) Nested VM-Enter to L2. KVM does not invalidate L2's EPTP or VPID.

6) L2 accesses GPA 'g' and gains read/write access to PFN 'x' via its
stale TLB entry.

However, current KVM unconditionally flushes L1's EPTP/VPID context on
nested VM-Exit. But, that behavior is mostly unintentional, KVM doesn't
go out of its way to flush EPTP/VPID on nested VM-Enter/VM-Exit, rather
a TLB flush is guaranteed to occur prior to re-entering L1 due to
__kvm_mmu_new_cr3() always being called with skip_tlb_flush=false. On
nested VM-Enter, this happens via kvm_init_shadow_ept_mmu() (nested EPT
enabled) or in nested_vmx_load_cr3() (nested EPT disabled). On nested
VM-Exit it occurs via nested_vmx_load_cr3().

This also fixes a bug where a deferred TLB flush in the context of L2,
with EPT disabled, would flush L1's VPID instead of L2's VPID, as
vmx_flush_tlb() flushes L1's VPID regardless of is_guest_mode().

Cc: Vitaly Kuznetsov <[email protected]>
Cc: Ben Gardon <[email protected]>
Cc: Jim Mattson <[email protected]>
Cc: Junaid Shahid <[email protected]>
Cc: Liran Alon <[email protected]>
Cc: Boris Ostrovsky <[email protected]>
Cc: John Haxby <[email protected]>
Reviewed-by: Liran Alon <[email protected]>
Fixes: efebf0aaec3d ("KVM: nVMX: Do not flush TLB on L1<->L2 transitions if L1 uses VPID and EPT")
Signed-off-by: Sean Christopherson <[email protected]>
---
arch/x86/kvm/vmx/vmx.h | 28 +++++++++++++++++++++++++++-
1 file changed, 27 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/vmx/vmx.h b/arch/x86/kvm/vmx/vmx.h
index be93d597306c..d6d67b816ebe 100644
--- a/arch/x86/kvm/vmx/vmx.h
+++ b/arch/x86/kvm/vmx/vmx.h
@@ -518,7 +518,33 @@ static inline void __vmx_flush_tlb(struct kvm_vcpu *vcpu, int vpid,

static inline void vmx_flush_tlb(struct kvm_vcpu *vcpu, bool invalidate_gpa)
{
- __vmx_flush_tlb(vcpu, to_vmx(vcpu)->vpid, invalidate_gpa);
+ struct vcpu_vmx *vmx = to_vmx(vcpu);
+
+ /*
+ * Flush all EPTP/VPID contexts if the TLB flush _may_ have been
+ * invoked via kvm_flush_remote_tlbs(), which always passes %true for
+ * @invalidate_gpa. Flushing remote TLBs requires all contexts to be
+ * flushed, not just the active context.
+ *
+ * Note, this also ensures a deferred TLB flush with VPID enabled and
+ * EPT disabled invalidates the "correct" VPID, by nuking both L1 and
+ * L2's VPIDs.
+ */
+ if (invalidate_gpa) {
+ if (enable_ept) {
+ ept_sync_global();
+ } else if (enable_vpid) {
+ if (cpu_has_vmx_invvpid_global()) {
+ vpid_sync_vcpu_global();
+ } else {
+ WARN_ON_ONCE(!cpu_has_vmx_invvpid_single());
+ vpid_sync_vcpu_single(vmx->vpid);
+ vpid_sync_vcpu_single(vmx->nested.vpid02);
+ }
+ }
+ } else {
+ __vmx_flush_tlb(vcpu, vmx->vpid, false);
+ }
}

static inline void decache_tsc_multiplier(struct vcpu_vmx *vmx)
--
2.24.1

2020-03-20 21:33:15

by Sean Christopherson

[permalink] [raw]
Subject: [PATCH v3 11/37] KVM: VMX: Handle INVVPID fallback logic in vpid_sync_vcpu_addr()

Directly invoke vpid_sync_context() to do a global INVVPID when the
individual address variant is not supported instead of deferring such
behavior to the caller. This allows for additional consolidation of
code as the logic is basically identical to the emulation of the
individual address variant in handle_invvpid().

No functional change intended.

Reviewed-by: Miaohe Lin <[email protected]>
Reviewed-by: Vitaly Kuznetsov <[email protected]>
Reviewed-by: Paolo Bonzini <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
---
arch/x86/kvm/vmx/ops.h | 12 +++++-------
arch/x86/kvm/vmx/vmx.c | 3 +--
2 files changed, 6 insertions(+), 9 deletions(-)

diff --git a/arch/x86/kvm/vmx/ops.h b/arch/x86/kvm/vmx/ops.h
index dd7ab61bfcc1..39122699cfeb 100644
--- a/arch/x86/kvm/vmx/ops.h
+++ b/arch/x86/kvm/vmx/ops.h
@@ -276,17 +276,15 @@ static inline void vpid_sync_context(int vpid)
vpid_sync_vcpu_global();
}

-static inline bool vpid_sync_vcpu_addr(int vpid, gva_t addr)
+static inline void vpid_sync_vcpu_addr(int vpid, gva_t addr)
{
if (vpid == 0)
- return true;
+ return;

- if (cpu_has_vmx_invvpid_individual_addr()) {
+ if (cpu_has_vmx_invvpid_individual_addr())
__invvpid(VMX_VPID_EXTENT_INDIVIDUAL_ADDR, vpid, addr);
- return true;
- }
-
- return false;
+ else
+ vpid_sync_context(vpid);
}

static inline void ept_sync_global(void)
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index ba49323a89d8..ba24bbda2c12 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -2853,8 +2853,7 @@ static void vmx_flush_tlb_gva(struct kvm_vcpu *vcpu, gva_t addr)
{
int vpid = to_vmx(vcpu)->vpid;

- if (!vpid_sync_vcpu_addr(vpid, addr))
- vpid_sync_context(vpid);
+ vpid_sync_vcpu_addr(vpid, addr);

/*
* If VPIDs are not supported or enabled, then the above is a no-op.
--
2.24.1

2020-03-20 21:33:26

by Sean Christopherson

[permalink] [raw]
Subject: [PATCH v3 02/37] KVM: nVMX: Validate the EPTP when emulating INVEPT(EXTENT_CONTEXT)

Signal VM-Fail for the single-context variant of INVEPT if the specified
EPTP is invalid. Per the INEVPT pseudocode in Intel's SDM, it's subject
to the standard EPT checks:

If VM entry with the "enable EPT" VM execution control set to 1 would
fail due to the EPTP value then VMfail(Invalid operand to INVEPT/INVVPID);

Fixes: bfd0a56b90005 ("nEPT: Nested INVEPT")
Signed-off-by: Sean Christopherson <[email protected]>
---
arch/x86/kvm/vmx/nested.c | 6 +++++-
1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
index 8578513907d7..f3774cef4fd4 100644
--- a/arch/x86/kvm/vmx/nested.c
+++ b/arch/x86/kvm/vmx/nested.c
@@ -5156,8 +5156,12 @@ static int handle_invept(struct kvm_vcpu *vcpu)
}

switch (type) {
- case VMX_EPT_EXTENT_GLOBAL:
case VMX_EPT_EXTENT_CONTEXT:
+ if (!nested_vmx_check_eptp(vcpu, operand.eptp))
+ return nested_vmx_failValid(vcpu,
+ VMXERR_INVALID_OPERAND_TO_INVEPT_INVVPID);
+ fallthrough;
+ case VMX_EPT_EXTENT_GLOBAL:
/*
* TODO: Sync the necessary shadow EPT roots here, rather than
* at the next emulated VM-entry.
--
2.24.1

2020-03-20 21:33:36

by Sean Christopherson

[permalink] [raw]
Subject: [PATCH v3 05/37] KVM: x86: Export kvm_propagate_fault() (as kvm_inject_emulated_page_fault)

Export the page fault propagation helper so that VMX can use it to
correctly emulate TLB invalidation on page faults in an upcoming patch.

In the (hopefully) not-too-distant future, SGX virtualization will also
want access to the helper for injecting page faults to the correct level
(L1 vs. L2) when emulating ENCLS instructions.

Rename the function to kvm_inject_emulated_page_fault() to clarify that
it is (a) injecting a fault and (b) only for page faults. WARN if it's
invoked with an exception other than PF_VECTOR.

Signed-off-by: Sean Christopherson <[email protected]>
---
arch/x86/include/asm/kvm_host.h | 2 ++
arch/x86/kvm/x86.c | 8 ++++++--
2 files changed, 8 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 9a183e9d4cb1..328b1765ff76 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1447,6 +1447,8 @@ void kvm_queue_exception_e(struct kvm_vcpu *vcpu, unsigned nr, u32 error_code);
void kvm_requeue_exception(struct kvm_vcpu *vcpu, unsigned nr);
void kvm_requeue_exception_e(struct kvm_vcpu *vcpu, unsigned nr, u32 error_code);
void kvm_inject_page_fault(struct kvm_vcpu *vcpu, struct x86_exception *fault);
+bool kvm_inject_emulated_page_fault(struct kvm_vcpu *vcpu,
+ struct x86_exception *fault);
int kvm_read_guest_page_mmu(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
gfn_t gfn, void *data, int offset, int len,
u32 access);
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index e54c6ad628a8..64ed6e6e2b56 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -611,8 +611,11 @@ void kvm_inject_page_fault(struct kvm_vcpu *vcpu, struct x86_exception *fault)
}
EXPORT_SYMBOL_GPL(kvm_inject_page_fault);

-static bool kvm_propagate_fault(struct kvm_vcpu *vcpu, struct x86_exception *fault)
+bool kvm_inject_emulated_page_fault(struct kvm_vcpu *vcpu,
+ struct x86_exception *fault)
{
+ WARN_ON_ONCE(fault->vector != PF_VECTOR);
+
if (mmu_is_nested(vcpu) && !fault->nested_page_fault)
vcpu->arch.nested_mmu.inject_page_fault(vcpu, fault);
else
@@ -620,6 +623,7 @@ static bool kvm_propagate_fault(struct kvm_vcpu *vcpu, struct x86_exception *fau

return fault->nested_page_fault;
}
+EXPORT_SYMBOL_GPL(kvm_inject_emulated_page_fault);

void kvm_inject_nmi(struct kvm_vcpu *vcpu)
{
@@ -6373,7 +6377,7 @@ static bool inject_emulated_exception(struct kvm_vcpu *vcpu)
{
struct x86_emulate_ctxt *ctxt = vcpu->arch.emulate_ctxt;
if (ctxt->exception.vector == PF_VECTOR)
- return kvm_propagate_fault(vcpu, &ctxt->exception);
+ return kvm_inject_emulated_page_fault(vcpu, &ctxt->exception);

if (ctxt->exception.error_code_valid)
kvm_queue_exception_e(vcpu, ctxt->exception.vector,
--
2.24.1

2020-03-20 21:33:47

by Sean Christopherson

[permalink] [raw]
Subject: [PATCH v3 03/37] KVM: nVMX: Invalidate all EPTP contexts when emulating INVEPT for L1

Free all L2 (guest_mmu) roots when emulating INVEPT for L1. Outstanding
changes to the EPT tables managed by L1 need to be recognized, and
relying on KVM to always flush L2's EPTP context on nested VM-Enter is
dangerous.

Similar to handle_invpcid(), rely on kvm_mmu_free_roots() to do a remote
TLB flush if necessary, e.g. if L1 has never entered L2 then there is
nothing to be done.

Nuking all L2 roots is overkill for the single-context variant, but it's
the safe and easy bet. A more precise zap mechanism will be added in
the future. Add a TODO to call out that KVM only needs to invalidate
affected contexts.

Fixes: b119019847fbc ("kvm: nVMX: Remove unnecessary sync_roots from handle_invept")
Reported-by: Jim Mattson <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
---
arch/x86/kvm/vmx/nested.c | 8 ++++----
1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
index f3774cef4fd4..9624cea4ed9f 100644
--- a/arch/x86/kvm/vmx/nested.c
+++ b/arch/x86/kvm/vmx/nested.c
@@ -5160,12 +5160,12 @@ static int handle_invept(struct kvm_vcpu *vcpu)
if (!nested_vmx_check_eptp(vcpu, operand.eptp))
return nested_vmx_failValid(vcpu,
VMXERR_INVALID_OPERAND_TO_INVEPT_INVVPID);
+
+ /* TODO: sync only the target EPTP context. */
fallthrough;
case VMX_EPT_EXTENT_GLOBAL:
- /*
- * TODO: Sync the necessary shadow EPT roots here, rather than
- * at the next emulated VM-entry.
- */
+ kvm_mmu_free_roots(vcpu, &vcpu->arch.guest_mmu,
+ KVM_MMU_ROOTS_ALL);
break;
default:
BUG_ON(1);
--
2.24.1

2020-03-23 14:53:43

by Vitaly Kuznetsov

[permalink] [raw]
Subject: Re: [PATCH v3 02/37] KVM: nVMX: Validate the EPTP when emulating INVEPT(EXTENT_CONTEXT)

Sean Christopherson <[email protected]> writes:

> Signal VM-Fail for the single-context variant of INVEPT if the specified
> EPTP is invalid. Per the INEVPT pseudocode in Intel's SDM, it's subject
> to the standard EPT checks:
>
> If VM entry with the "enable EPT" VM execution control set to 1 would
> fail due to the EPTP value then VMfail(Invalid operand to INVEPT/INVVPID);
>
> Fixes: bfd0a56b90005 ("nEPT: Nested INVEPT")
> Signed-off-by: Sean Christopherson <[email protected]>
> ---
> arch/x86/kvm/vmx/nested.c | 6 +++++-
> 1 file changed, 5 insertions(+), 1 deletion(-)
>
> diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
> index 8578513907d7..f3774cef4fd4 100644
> --- a/arch/x86/kvm/vmx/nested.c
> +++ b/arch/x86/kvm/vmx/nested.c
> @@ -5156,8 +5156,12 @@ static int handle_invept(struct kvm_vcpu *vcpu)
> }
>
> switch (type) {
> - case VMX_EPT_EXTENT_GLOBAL:
> case VMX_EPT_EXTENT_CONTEXT:
> + if (!nested_vmx_check_eptp(vcpu, operand.eptp))
> + return nested_vmx_failValid(vcpu,
> + VMXERR_INVALID_OPERAND_TO_INVEPT_INVVPID);

I was going to ask "and we don't seem to check that current nested VMPTR
is valid, how can we know that nested_vmx_failValid() is the right
VMfail() to use" but then I checked our nested_vmx_failValid() and there
is a fallback there:

if (vmx->nested.current_vmptr == -1ull && !vmx->nested.hv_evmcs)
return nested_vmx_failInvalid(vcpu);

so this is a non-issue. My question, however, transforms into "would it
make sense to introduce nested_vmx_fail() implementing the logic from
SDM:

VMfail(ErrorNumber):
IF VMCS pointer is valid
THEN VMfailValid(ErrorNumber);
ELSE VMfailInvalid;
FI;

to assist an innocent reader of the code?"

> + fallthrough;
> + case VMX_EPT_EXTENT_GLOBAL:
> /*
> * TODO: Sync the necessary shadow EPT roots here, rather than
> * at the next emulated VM-entry.

Reviewed-by: Vitaly Kuznetsov <[email protected]>

--
Vitaly

2020-03-23 15:25:32

by Vitaly Kuznetsov

[permalink] [raw]
Subject: Re: [PATCH v3 03/37] KVM: nVMX: Invalidate all EPTP contexts when emulating INVEPT for L1

Sean Christopherson <[email protected]> writes:

> Free all L2 (guest_mmu) roots when emulating INVEPT for L1. Outstanding
> changes to the EPT tables managed by L1 need to be recognized, and
> relying on KVM to always flush L2's EPTP context on nested VM-Enter is
> dangerous.
>
> Similar to handle_invpcid(), rely on kvm_mmu_free_roots() to do a remote
> TLB flush if necessary, e.g. if L1 has never entered L2 then there is
> nothing to be done.
>
> Nuking all L2 roots is overkill for the single-context variant, but it's
> the safe and easy bet. A more precise zap mechanism will be added in
> the future. Add a TODO to call out that KVM only needs to invalidate
> affected contexts.
>
> Fixes: b119019847fbc ("kvm: nVMX: Remove unnecessary sync_roots from handle_invept")
> Reported-by: Jim Mattson <[email protected]>
> Signed-off-by: Sean Christopherson <[email protected]>
> ---
> arch/x86/kvm/vmx/nested.c | 8 ++++----
> 1 file changed, 4 insertions(+), 4 deletions(-)
>
> diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
> index f3774cef4fd4..9624cea4ed9f 100644
> --- a/arch/x86/kvm/vmx/nested.c
> +++ b/arch/x86/kvm/vmx/nested.c
> @@ -5160,12 +5160,12 @@ static int handle_invept(struct kvm_vcpu *vcpu)
> if (!nested_vmx_check_eptp(vcpu, operand.eptp))
> return nested_vmx_failValid(vcpu,
> VMXERR_INVALID_OPERAND_TO_INVEPT_INVVPID);
> +
> + /* TODO: sync only the target EPTP context. */
> fallthrough;
> case VMX_EPT_EXTENT_GLOBAL:
> - /*
> - * TODO: Sync the necessary shadow EPT roots here, rather than
> - * at the next emulated VM-entry.
> - */
> + kvm_mmu_free_roots(vcpu, &vcpu->arch.guest_mmu,
> + KVM_MMU_ROOTS_ALL);
> break;

An ignorant reader may wonder "and how do we know that L1 actaully uses
EPT" as he may find out that guest_mmu is not being used otherwise. The
answer to the question will likely be "if L1 doesn't use EPT for some of
its guests than there's nothing we should do here as we will be
resetting root_mmu when switching to/from them". Hope the ignorant
reviewer typing this is not very wrong :-)

> default:
> BUG_ON(1);

Reviewed-by: Vitaly Kuznetsov <[email protected]>

--
Vitaly

2020-03-23 15:36:03

by Vitaly Kuznetsov

[permalink] [raw]
Subject: Re: [PATCH v3 04/37] KVM: nVMX: Invalidate all roots when emulating INVVPID without EPT

Sean Christopherson <[email protected]> writes:

> From: Junaid Shahid <[email protected]>
>
> Free all roots when emulating INVVPID for L1 and EPT is disabled, as
> outstanding changes to the page tables managed by L1 need to be
> recognized. Because L1 and L2 share an MMU when EPT is disabled, and
> because VPID is not tracked by the MMU role, all roots in the current
> MMU (root_mmu) need to be freed, otherwise a future nested VM-Enter or
> VM-Exit could do a fast CR3 switch (without a flush/sync) and consume
> stale SPTEs.
>
> Fixes: 5c614b3583e7b ("KVM: nVMX: nested VPID emulation")
> Signed-off-by: Junaid Shahid <[email protected]>
> [sean: ported to upstream KVM, reworded the comment and changelog]
> Signed-off-by: Sean Christopherson <[email protected]>
> ---
> arch/x86/kvm/vmx/nested.c | 14 ++++++++++++++
> 1 file changed, 14 insertions(+)
>
> diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
> index 9624cea4ed9f..bc74fbbf33c6 100644
> --- a/arch/x86/kvm/vmx/nested.c
> +++ b/arch/x86/kvm/vmx/nested.c
> @@ -5250,6 +5250,20 @@ static int handle_invvpid(struct kvm_vcpu *vcpu)
> return kvm_skip_emulated_instruction(vcpu);
> }
>
> + /*
> + * Sync the shadow page tables if EPT is disabled, L1 is invalidating
> + * linear mappings for L2 (tagged with L2's VPID). Free all roots as
> + * VPIDs are not tracked in the MMU role.
> + *
> + * Note, this operates on root_mmu, not guest_mmu, as L1 and L2 share
> + * an MMU when EPT is disabled.
> + *
> + * TODO: sync only the affected SPTEs for INVDIVIDUAL_ADDR.
> + */
> + if (!enable_ept)
> + kvm_mmu_free_roots(vcpu, &vcpu->arch.root_mmu,
> + KVM_MMU_ROOTS_ALL);
> +

This is related to my remark on the previous patch; the comment above
makes me think I'm missing something obvious, enlighten me please)

My understanding is that L1 and L2 will share arch.root_mmu not only
when EPT is globally disabled, we seem to switch between
root_mmu/guest_mmu only when nested_cpu_has_ept(vmcs12) but different L2
guests may be different on this. Do we need to handle this somehow?

> return nested_vmx_succeed(vcpu);
> }

--
Vitaly

2020-03-23 15:46:36

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH v3 02/37] KVM: nVMX: Validate the EPTP when emulating INVEPT(EXTENT_CONTEXT)

On Mon, Mar 23, 2020 at 03:51:17PM +0100, Vitaly Kuznetsov wrote:
> Sean Christopherson <[email protected]> writes:
>
> > Signal VM-Fail for the single-context variant of INVEPT if the specified
> > EPTP is invalid. Per the INEVPT pseudocode in Intel's SDM, it's subject
> > to the standard EPT checks:
> >
> > If VM entry with the "enable EPT" VM execution control set to 1 would
> > fail due to the EPTP value then VMfail(Invalid operand to INVEPT/INVVPID);
> >
> > Fixes: bfd0a56b90005 ("nEPT: Nested INVEPT")
> > Signed-off-by: Sean Christopherson <[email protected]>
> > ---
> > arch/x86/kvm/vmx/nested.c | 6 +++++-
> > 1 file changed, 5 insertions(+), 1 deletion(-)
> >
> > diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
> > index 8578513907d7..f3774cef4fd4 100644
> > --- a/arch/x86/kvm/vmx/nested.c
> > +++ b/arch/x86/kvm/vmx/nested.c
> > @@ -5156,8 +5156,12 @@ static int handle_invept(struct kvm_vcpu *vcpu)
> > }
> >
> > switch (type) {
> > - case VMX_EPT_EXTENT_GLOBAL:
> > case VMX_EPT_EXTENT_CONTEXT:
> > + if (!nested_vmx_check_eptp(vcpu, operand.eptp))
> > + return nested_vmx_failValid(vcpu,
> > + VMXERR_INVALID_OPERAND_TO_INVEPT_INVVPID);
>
> I was going to ask "and we don't seem to check that current nested VMPTR
> is valid, how can we know that nested_vmx_failValid() is the right
> VMfail() to use" but then I checked our nested_vmx_failValid() and there
> is a fallback there:
>
> if (vmx->nested.current_vmptr == -1ull && !vmx->nested.hv_evmcs)
> return nested_vmx_failInvalid(vcpu);
>
> so this is a non-issue. My question, however, transforms into "would it
> make sense to introduce nested_vmx_fail() implementing the logic from
> SDM:
>
> VMfail(ErrorNumber):
> IF VMCS pointer is valid
> THEN VMfailValid(ErrorNumber);
> ELSE VMfailInvalid;
> FI;
>

Hmm, I wouldn't be opposed to such a wrapper. It would pair with
nested_vmx_succeed().

>
> > + fallthrough;
> > + case VMX_EPT_EXTENT_GLOBAL:
> > /*
> > * TODO: Sync the necessary shadow EPT roots here, rather than
> > * at the next emulated VM-entry.
>
> Reviewed-by: Vitaly Kuznetsov <[email protected]>
>
> --
> Vitaly
>

2020-03-23 15:50:00

by Vitaly Kuznetsov

[permalink] [raw]
Subject: Re: [PATCH v3 05/37] KVM: x86: Export kvm_propagate_fault() (as kvm_inject_emulated_page_fault)

Sean Christopherson <[email protected]> writes:

> Export the page fault propagation helper so that VMX can use it to
> correctly emulate TLB invalidation on page faults in an upcoming patch.
>
> In the (hopefully) not-too-distant future, SGX virtualization will also
> want access to the helper for injecting page faults to the correct level
> (L1 vs. L2) when emulating ENCLS instructions.
>
> Rename the function to kvm_inject_emulated_page_fault() to clarify that
> it is (a) injecting a fault and (b) only for page faults. WARN if it's
> invoked with an exception other than PF_VECTOR.
>
> Signed-off-by: Sean Christopherson <[email protected]>
> ---
> arch/x86/include/asm/kvm_host.h | 2 ++
> arch/x86/kvm/x86.c | 8 ++++++--
> 2 files changed, 8 insertions(+), 2 deletions(-)
>
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 9a183e9d4cb1..328b1765ff76 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -1447,6 +1447,8 @@ void kvm_queue_exception_e(struct kvm_vcpu *vcpu, unsigned nr, u32 error_code);
> void kvm_requeue_exception(struct kvm_vcpu *vcpu, unsigned nr);
> void kvm_requeue_exception_e(struct kvm_vcpu *vcpu, unsigned nr, u32 error_code);
> void kvm_inject_page_fault(struct kvm_vcpu *vcpu, struct x86_exception *fault);
> +bool kvm_inject_emulated_page_fault(struct kvm_vcpu *vcpu,
> + struct x86_exception *fault);
> int kvm_read_guest_page_mmu(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
> gfn_t gfn, void *data, int offset, int len,
> u32 access);
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index e54c6ad628a8..64ed6e6e2b56 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -611,8 +611,11 @@ void kvm_inject_page_fault(struct kvm_vcpu *vcpu, struct x86_exception *fault)
> }
> EXPORT_SYMBOL_GPL(kvm_inject_page_fault);
>
> -static bool kvm_propagate_fault(struct kvm_vcpu *vcpu, struct x86_exception *fault)
> +bool kvm_inject_emulated_page_fault(struct kvm_vcpu *vcpu,
> + struct x86_exception *fault)
> {
> + WARN_ON_ONCE(fault->vector != PF_VECTOR);
> +
> if (mmu_is_nested(vcpu) && !fault->nested_page_fault)
> vcpu->arch.nested_mmu.inject_page_fault(vcpu, fault);
> else
> @@ -620,6 +623,7 @@ static bool kvm_propagate_fault(struct kvm_vcpu *vcpu, struct x86_exception *fau
>
> return fault->nested_page_fault;
> }
> +EXPORT_SYMBOL_GPL(kvm_inject_emulated_page_fault);

We don't seem to use the return value a lot, actually,
inject_emulated_exception() seems to be the only one, the rest just call
it without checking the return value. Judging by the new name, I'd guess
that the function returns whether it was able to inject the exception or
not but this doesn't seem to be the case. My suggestion would then be to
make it return 'void' and return 'fault->nested_page_fault' separately
in inject_emulated_exception().

>
> void kvm_inject_nmi(struct kvm_vcpu *vcpu)
> {
> @@ -6373,7 +6377,7 @@ static bool inject_emulated_exception(struct kvm_vcpu *vcpu)
> {
> struct x86_emulate_ctxt *ctxt = vcpu->arch.emulate_ctxt;
> if (ctxt->exception.vector == PF_VECTOR)
> - return kvm_propagate_fault(vcpu, &ctxt->exception);
> + return kvm_inject_emulated_page_fault(vcpu, &ctxt->exception);
>
> if (ctxt->exception.error_code_valid)
> kvm_queue_exception_e(vcpu, ctxt->exception.vector,

With or without the change suggested above,

Reviewed-by: Vitaly Kuznetsov <[email protected]>

--
Vitaly

2020-03-23 15:54:14

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH v3 03/37] KVM: nVMX: Invalidate all EPTP contexts when emulating INVEPT for L1

On Mon, Mar 23, 2020 at 04:24:05PM +0100, Vitaly Kuznetsov wrote:
> Sean Christopherson <[email protected]> writes:
>
> > Free all L2 (guest_mmu) roots when emulating INVEPT for L1. Outstanding
> > changes to the EPT tables managed by L1 need to be recognized, and
> > relying on KVM to always flush L2's EPTP context on nested VM-Enter is
> > dangerous.
> >
> > Similar to handle_invpcid(), rely on kvm_mmu_free_roots() to do a remote
> > TLB flush if necessary, e.g. if L1 has never entered L2 then there is
> > nothing to be done.
> >
> > Nuking all L2 roots is overkill for the single-context variant, but it's
> > the safe and easy bet. A more precise zap mechanism will be added in
> > the future. Add a TODO to call out that KVM only needs to invalidate
> > affected contexts.
> >
> > Fixes: b119019847fbc ("kvm: nVMX: Remove unnecessary sync_roots from handle_invept")
> > Reported-by: Jim Mattson <[email protected]>
> > Signed-off-by: Sean Christopherson <[email protected]>
> > ---
> > arch/x86/kvm/vmx/nested.c | 8 ++++----
> > 1 file changed, 4 insertions(+), 4 deletions(-)
> >
> > diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
> > index f3774cef4fd4..9624cea4ed9f 100644
> > --- a/arch/x86/kvm/vmx/nested.c
> > +++ b/arch/x86/kvm/vmx/nested.c
> > @@ -5160,12 +5160,12 @@ static int handle_invept(struct kvm_vcpu *vcpu)
> > if (!nested_vmx_check_eptp(vcpu, operand.eptp))
> > return nested_vmx_failValid(vcpu,
> > VMXERR_INVALID_OPERAND_TO_INVEPT_INVVPID);
> > +
> > + /* TODO: sync only the target EPTP context. */
> > fallthrough;
> > case VMX_EPT_EXTENT_GLOBAL:
> > - /*
> > - * TODO: Sync the necessary shadow EPT roots here, rather than
> > - * at the next emulated VM-entry.
> > - */
> > + kvm_mmu_free_roots(vcpu, &vcpu->arch.guest_mmu,
> > + KVM_MMU_ROOTS_ALL);
> > break;
>
> An ignorant reader may wonder "and how do we know that L1 actaully uses
> EPT" as he may find out that guest_mmu is not being used otherwise. The
> answer to the question will likely be "if L1 doesn't use EPT for some of
> its guests than there's nothing we should do here as we will be
> resetting root_mmu when switching to/from them". Hope the ignorant
> reviewer typing this is not very wrong :-)

A different way to put it would be:

KVM never uses root_mmu to hold nested EPT roots.

Invalidating too much is functionally ok, though sub-optimal for performance.
Invalidating too little is what we really care about.

FWIW, VMX currently uses guest_mmu iff nested EPT is enabled. In theory,
KVM could be enhanced to also used guest_mmu when nested-TDP is disabled,
e.g. to enable VMX to preserve L1's root_mmu when emulating INVVPID. That
would likely be a decent performance boost for nested VMX+VPID without
nested EPT, but I'm guessing that the cross-section of users that care
about nested performance and don't use nested EPT is quite small.

> > default:
> > BUG_ON(1);
>
> Reviewed-by: Vitaly Kuznetsov <[email protected]>
>
> --
> Vitaly
>

2020-03-23 16:06:00

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH v3 04/37] KVM: nVMX: Invalidate all roots when emulating INVVPID without EPT

On Mon, Mar 23, 2020 at 04:34:17PM +0100, Vitaly Kuznetsov wrote:
> Sean Christopherson <[email protected]> writes:
>
> > From: Junaid Shahid <[email protected]>
> >
> > Free all roots when emulating INVVPID for L1 and EPT is disabled, as
> > outstanding changes to the page tables managed by L1 need to be
> > recognized. Because L1 and L2 share an MMU when EPT is disabled, and
> > because VPID is not tracked by the MMU role, all roots in the current
> > MMU (root_mmu) need to be freed, otherwise a future nested VM-Enter or
> > VM-Exit could do a fast CR3 switch (without a flush/sync) and consume
> > stale SPTEs.
> >
> > Fixes: 5c614b3583e7b ("KVM: nVMX: nested VPID emulation")
> > Signed-off-by: Junaid Shahid <[email protected]>
> > [sean: ported to upstream KVM, reworded the comment and changelog]
> > Signed-off-by: Sean Christopherson <[email protected]>
> > ---
> > arch/x86/kvm/vmx/nested.c | 14 ++++++++++++++
> > 1 file changed, 14 insertions(+)
> >
> > diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
> > index 9624cea4ed9f..bc74fbbf33c6 100644
> > --- a/arch/x86/kvm/vmx/nested.c
> > +++ b/arch/x86/kvm/vmx/nested.c
> > @@ -5250,6 +5250,20 @@ static int handle_invvpid(struct kvm_vcpu *vcpu)
> > return kvm_skip_emulated_instruction(vcpu);
> > }
> >
> > + /*
> > + * Sync the shadow page tables if EPT is disabled, L1 is invalidating
> > + * linear mappings for L2 (tagged with L2's VPID). Free all roots as
> > + * VPIDs are not tracked in the MMU role.
> > + *
> > + * Note, this operates on root_mmu, not guest_mmu, as L1 and L2 share
> > + * an MMU when EPT is disabled.
> > + *
> > + * TODO: sync only the affected SPTEs for INVDIVIDUAL_ADDR.
> > + */
> > + if (!enable_ept)
> > + kvm_mmu_free_roots(vcpu, &vcpu->arch.root_mmu,
> > + KVM_MMU_ROOTS_ALL);
> > +
>
> This is related to my remark on the previous patch; the comment above
> makes me think I'm missing something obvious, enlighten me please)
>
> My understanding is that L1 and L2 will share arch.root_mmu not only
> when EPT is globally disabled, we seem to switch between
> root_mmu/guest_mmu only when nested_cpu_has_ept(vmcs12) but different L2
> guests may be different on this. Do we need to handle this somehow?

guest_mmu is used iff nested EPT is enabled, which requires enable_ept=1.
enable_ept is global and cannot be changed without reloading kvm_intel.

This most definitely over-invalidates, e.g. it blasts away L1's page
tables. But, fixing that requires tracking VPID in mmu_role and/or adding
support for using guest_mmu when L1 isn't using TDP, i.e. nested EPT is
disabled. Assuming the vast majority of nested deployments enable EPT in
L0, the cost of both options likely outweighs the benefits.

> > return nested_vmx_succeed(vcpu);
> > }
>
> --
> Vitaly
>

2020-03-23 16:26:09

by Jim Mattson

[permalink] [raw]
Subject: Re: [PATCH v3 03/37] KVM: nVMX: Invalidate all EPTP contexts when emulating INVEPT for L1

On Fri, Mar 20, 2020 at 2:29 PM Sean Christopherson
<[email protected]> wrote:
>
> Free all L2 (guest_mmu) roots when emulating INVEPT for L1. Outstanding
> changes to the EPT tables managed by L1 need to be recognized, and
> relying on KVM to always flush L2's EPTP context on nested VM-Enter is
> dangerous.
>
> Similar to handle_invpcid(), rely on kvm_mmu_free_roots() to do a remote
> TLB flush if necessary, e.g. if L1 has never entered L2 then there is
> nothing to be done.
>
> Nuking all L2 roots is overkill for the single-context variant, but it's
> the safe and easy bet. A more precise zap mechanism will be added in
> the future. Add a TODO to call out that KVM only needs to invalidate
> affected contexts.
>
> Fixes: b119019847fbc ("kvm: nVMX: Remove unnecessary sync_roots from handle_invept")

The bug existed well before the commit indicated in the "Fixes" line.

2020-03-23 16:26:32

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH v3 05/37] KVM: x86: Export kvm_propagate_fault() (as kvm_inject_emulated_page_fault)

On Mon, Mar 23, 2020 at 04:47:49PM +0100, Vitaly Kuznetsov wrote:
> Sean Christopherson <[email protected]> writes:
>
> > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > index e54c6ad628a8..64ed6e6e2b56 100644
> > --- a/arch/x86/kvm/x86.c
> > +++ b/arch/x86/kvm/x86.c
> > @@ -611,8 +611,11 @@ void kvm_inject_page_fault(struct kvm_vcpu *vcpu, struct x86_exception *fault)
> > }
> > EXPORT_SYMBOL_GPL(kvm_inject_page_fault);
> >
> > -static bool kvm_propagate_fault(struct kvm_vcpu *vcpu, struct x86_exception *fault)
> > +bool kvm_inject_emulated_page_fault(struct kvm_vcpu *vcpu,
> > + struct x86_exception *fault)
> > {
> > + WARN_ON_ONCE(fault->vector != PF_VECTOR);
> > +
> > if (mmu_is_nested(vcpu) && !fault->nested_page_fault)
> > vcpu->arch.nested_mmu.inject_page_fault(vcpu, fault);
> > else
> > @@ -620,6 +623,7 @@ static bool kvm_propagate_fault(struct kvm_vcpu *vcpu, struct x86_exception *fau
> >
> > return fault->nested_page_fault;
> > }
> > +EXPORT_SYMBOL_GPL(kvm_inject_emulated_page_fault);
>
> We don't seem to use the return value a lot, actually,
> inject_emulated_exception() seems to be the only one, the rest just call
> it without checking the return value. Judging by the new name, I'd guess
> that the function returns whether it was able to inject the exception or
> not but this doesn't seem to be the case. My suggestion would then be to
> make it return 'void' and return 'fault->nested_page_fault' separately
> in inject_emulated_exception().

Oooh, I like that idea. The return from the common helper also confuses me
every time I look at it.

> > void kvm_inject_nmi(struct kvm_vcpu *vcpu)
> > {
> > @@ -6373,7 +6377,7 @@ static bool inject_emulated_exception(struct kvm_vcpu *vcpu)
> > {
> > struct x86_emulate_ctxt *ctxt = vcpu->arch.emulate_ctxt;
> > if (ctxt->exception.vector == PF_VECTOR)
> > - return kvm_propagate_fault(vcpu, &ctxt->exception);
> > + return kvm_inject_emulated_page_fault(vcpu, &ctxt->exception);
> >
> > if (ctxt->exception.error_code_valid)
> > kvm_queue_exception_e(vcpu, ctxt->exception.vector,
>
> With or without the change suggested above,
>
> Reviewed-by: Vitaly Kuznetsov <[email protected]>
>
> --
> Vitaly
>

2020-03-23 16:29:59

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH v3 03/37] KVM: nVMX: Invalidate all EPTP contexts when emulating INVEPT for L1

On Mon, Mar 23, 2020 at 09:24:25AM -0700, Jim Mattson wrote:
> On Fri, Mar 20, 2020 at 2:29 PM Sean Christopherson
> <[email protected]> wrote:
> >
> > Free all L2 (guest_mmu) roots when emulating INVEPT for L1. Outstanding
> > changes to the EPT tables managed by L1 need to be recognized, and
> > relying on KVM to always flush L2's EPTP context on nested VM-Enter is
> > dangerous.
> >
> > Similar to handle_invpcid(), rely on kvm_mmu_free_roots() to do a remote
> > TLB flush if necessary, e.g. if L1 has never entered L2 then there is
> > nothing to be done.
> >
> > Nuking all L2 roots is overkill for the single-context variant, but it's
> > the safe and easy bet. A more precise zap mechanism will be added in
> > the future. Add a TODO to call out that KVM only needs to invalidate
> > affected contexts.
> >
> > Fixes: b119019847fbc ("kvm: nVMX: Remove unnecessary sync_roots from handle_invept")
>
> The bug existed well before the commit indicated in the "Fixes" line.

Ah, my bad. A cursory glance at commit b119019847fbc makes that quite
obvious. This should be

Fixes: bfd0a56b9000 ("nEPT: Nested INVEPT")

2020-03-23 16:35:23

by Vitaly Kuznetsov

[permalink] [raw]
Subject: Re: [PATCH v3 04/37] KVM: nVMX: Invalidate all roots when emulating INVVPID without EPT

Sean Christopherson <[email protected]> writes:

> On Mon, Mar 23, 2020 at 04:34:17PM +0100, Vitaly Kuznetsov wrote:
>> Sean Christopherson <[email protected]> writes:
>>
>> > From: Junaid Shahid <[email protected]>
>> >
>> > Free all roots when emulating INVVPID for L1 and EPT is disabled, as
>> > outstanding changes to the page tables managed by L1 need to be
>> > recognized. Because L1 and L2 share an MMU when EPT is disabled, and
>> > because VPID is not tracked by the MMU role, all roots in the current
>> > MMU (root_mmu) need to be freed, otherwise a future nested VM-Enter or
>> > VM-Exit could do a fast CR3 switch (without a flush/sync) and consume
>> > stale SPTEs.
>> >
>> > Fixes: 5c614b3583e7b ("KVM: nVMX: nested VPID emulation")
>> > Signed-off-by: Junaid Shahid <[email protected]>
>> > [sean: ported to upstream KVM, reworded the comment and changelog]
>> > Signed-off-by: Sean Christopherson <[email protected]>
>> > ---
>> > arch/x86/kvm/vmx/nested.c | 14 ++++++++++++++
>> > 1 file changed, 14 insertions(+)
>> >
>> > diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
>> > index 9624cea4ed9f..bc74fbbf33c6 100644
>> > --- a/arch/x86/kvm/vmx/nested.c
>> > +++ b/arch/x86/kvm/vmx/nested.c
>> > @@ -5250,6 +5250,20 @@ static int handle_invvpid(struct kvm_vcpu *vcpu)
>> > return kvm_skip_emulated_instruction(vcpu);
>> > }
>> >
>> > + /*
>> > + * Sync the shadow page tables if EPT is disabled, L1 is invalidating
>> > + * linear mappings for L2 (tagged with L2's VPID). Free all roots as
>> > + * VPIDs are not tracked in the MMU role.
>> > + *
>> > + * Note, this operates on root_mmu, not guest_mmu, as L1 and L2 share
>> > + * an MMU when EPT is disabled.
>> > + *
>> > + * TODO: sync only the affected SPTEs for INVDIVIDUAL_ADDR.
>> > + */
>> > + if (!enable_ept)
>> > + kvm_mmu_free_roots(vcpu, &vcpu->arch.root_mmu,
>> > + KVM_MMU_ROOTS_ALL);
>> > +
>>
>> This is related to my remark on the previous patch; the comment above
>> makes me think I'm missing something obvious, enlighten me please)
>>
>> My understanding is that L1 and L2 will share arch.root_mmu not only
>> when EPT is globally disabled, we seem to switch between
>> root_mmu/guest_mmu only when nested_cpu_has_ept(vmcs12) but different L2
>> guests may be different on this. Do we need to handle this somehow?
>
> guest_mmu is used iff nested EPT is enabled, which requires enable_ept=1.
> enable_ept is global and cannot be changed without reloading kvm_intel.
>
> This most definitely over-invalidates, e.g. it blasts away L1's page
> tables. But, fixing that requires tracking VPID in mmu_role and/or adding
> support for using guest_mmu when L1 isn't using TDP, i.e. nested EPT is
> disabled. Assuming the vast majority of nested deployments enable EPT in
> L0, the cost of both options likely outweighs the benefits.
>

Yes but my question rather was: what if global 'enable_ept' is true but
nested EPT is not being used by L1, don't we still need to do
kvm_mmu_free_roots(&vcpu->arch.root_mmu) here?

--
Vitaly

2020-03-23 16:38:41

by Jim Mattson

[permalink] [raw]
Subject: Re: [PATCH v3 03/37] KVM: nVMX: Invalidate all EPTP contexts when emulating INVEPT for L1

On Mon, Mar 23, 2020 at 9:28 AM Sean Christopherson
<[email protected]> wrote:
>
> On Mon, Mar 23, 2020 at 09:24:25AM -0700, Jim Mattson wrote:
> > On Fri, Mar 20, 2020 at 2:29 PM Sean Christopherson
> > <[email protected]> wrote:
> > >
> > > Free all L2 (guest_mmu) roots when emulating INVEPT for L1. Outstanding
> > > changes to the EPT tables managed by L1 need to be recognized, and
> > > relying on KVM to always flush L2's EPTP context on nested VM-Enter is
> > > dangerous.
> > >
> > > Similar to handle_invpcid(), rely on kvm_mmu_free_roots() to do a remote
> > > TLB flush if necessary, e.g. if L1 has never entered L2 then there is
> > > nothing to be done.
> > >
> > > Nuking all L2 roots is overkill for the single-context variant, but it's
> > > the safe and easy bet. A more precise zap mechanism will be added in
> > > the future. Add a TODO to call out that KVM only needs to invalidate
> > > affected contexts.
> > >
> > > Fixes: b119019847fbc ("kvm: nVMX: Remove unnecessary sync_roots from handle_invept")
> >
> > The bug existed well before the commit indicated in the "Fixes" line.
>
> Ah, my bad. A cursory glance at commit b119019847fbc makes that quite
> obvious. This should be
>
> Fixes: bfd0a56b9000 ("nEPT: Nested INVEPT")

Actually, I think that things were fine back then (though we
gratuitously flushed L1's TLB as a result of an emulated INVEPT). The
problem started when we stopped flushing the TLB on every emulated
VM-entry (i.e. L1 -> L2 transitions). I'm not sure what that commit
was, but I think you referenced it in an earlier email.

2020-03-23 16:45:43

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH v3 03/37] KVM: nVMX: Invalidate all EPTP contexts when emulating INVEPT for L1

On Mon, Mar 23, 2020 at 09:36:22AM -0700, Jim Mattson wrote:
> On Mon, Mar 23, 2020 at 9:28 AM Sean Christopherson
> <[email protected]> wrote:
> >
> > On Mon, Mar 23, 2020 at 09:24:25AM -0700, Jim Mattson wrote:
> > > On Fri, Mar 20, 2020 at 2:29 PM Sean Christopherson
> > > <[email protected]> wrote:
> > > >
> > > > Free all L2 (guest_mmu) roots when emulating INVEPT for L1. Outstanding
> > > > changes to the EPT tables managed by L1 need to be recognized, and
> > > > relying on KVM to always flush L2's EPTP context on nested VM-Enter is
> > > > dangerous.
> > > >
> > > > Similar to handle_invpcid(), rely on kvm_mmu_free_roots() to do a remote
> > > > TLB flush if necessary, e.g. if L1 has never entered L2 then there is
> > > > nothing to be done.
> > > >
> > > > Nuking all L2 roots is overkill for the single-context variant, but it's
> > > > the safe and easy bet. A more precise zap mechanism will be added in
> > > > the future. Add a TODO to call out that KVM only needs to invalidate
> > > > affected contexts.
> > > >
> > > > Fixes: b119019847fbc ("kvm: nVMX: Remove unnecessary sync_roots from handle_invept")
> > >
> > > The bug existed well before the commit indicated in the "Fixes" line.
> >
> > Ah, my bad. A cursory glance at commit b119019847fbc makes that quite
> > obvious. This should be
> >
> > Fixes: bfd0a56b9000 ("nEPT: Nested INVEPT")
>
> Actually, I think that things were fine back then (though we
> gratuitously flushed L1's TLB as a result of an emulated INVEPT). The
> problem started when we stopped flushing the TLB on every emulated
> VM-entry (i.e. L1 -> L2 transitions). I'm not sure what that commit
> was, but I think you referenced it in an earlier email.

Hmm, true. I was thinking it was the original commit because it didn't
operate on guest_mmu, but guest_mmu didn't exist back then. So I think

Fixes: 14c07ad89f4d ("x86/kvm/mmu: introduce guest_mmu")

would be appropriate?

2020-03-23 16:50:44

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH v3 04/37] KVM: nVMX: Invalidate all roots when emulating INVVPID without EPT

On Mon, Mar 23, 2020 at 05:33:08PM +0100, Vitaly Kuznetsov wrote:
> Sean Christopherson <[email protected]> writes:
>
> > On Mon, Mar 23, 2020 at 04:34:17PM +0100, Vitaly Kuznetsov wrote:
> >> Sean Christopherson <[email protected]> writes:
> >>
> >> > From: Junaid Shahid <[email protected]>
> >> >
> >> > Free all roots when emulating INVVPID for L1 and EPT is disabled, as
> >> > outstanding changes to the page tables managed by L1 need to be
> >> > recognized. Because L1 and L2 share an MMU when EPT is disabled, and
> >> > because VPID is not tracked by the MMU role, all roots in the current
> >> > MMU (root_mmu) need to be freed, otherwise a future nested VM-Enter or
> >> > VM-Exit could do a fast CR3 switch (without a flush/sync) and consume
> >> > stale SPTEs.
> >> >
> >> > Fixes: 5c614b3583e7b ("KVM: nVMX: nested VPID emulation")
> >> > Signed-off-by: Junaid Shahid <[email protected]>
> >> > [sean: ported to upstream KVM, reworded the comment and changelog]
> >> > Signed-off-by: Sean Christopherson <[email protected]>
> >> > ---
> >> > arch/x86/kvm/vmx/nested.c | 14 ++++++++++++++
> >> > 1 file changed, 14 insertions(+)
> >> >
> >> > diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
> >> > index 9624cea4ed9f..bc74fbbf33c6 100644
> >> > --- a/arch/x86/kvm/vmx/nested.c
> >> > +++ b/arch/x86/kvm/vmx/nested.c
> >> > @@ -5250,6 +5250,20 @@ static int handle_invvpid(struct kvm_vcpu *vcpu)
> >> > return kvm_skip_emulated_instruction(vcpu);
> >> > }
> >> >
> >> > + /*
> >> > + * Sync the shadow page tables if EPT is disabled, L1 is invalidating
> >> > + * linear mappings for L2 (tagged with L2's VPID). Free all roots as
> >> > + * VPIDs are not tracked in the MMU role.
> >> > + *
> >> > + * Note, this operates on root_mmu, not guest_mmu, as L1 and L2 share
> >> > + * an MMU when EPT is disabled.
> >> > + *
> >> > + * TODO: sync only the affected SPTEs for INVDIVIDUAL_ADDR.
> >> > + */
> >> > + if (!enable_ept)
> >> > + kvm_mmu_free_roots(vcpu, &vcpu->arch.root_mmu,
> >> > + KVM_MMU_ROOTS_ALL);
> >> > +
> >>
> >> This is related to my remark on the previous patch; the comment above
> >> makes me think I'm missing something obvious, enlighten me please)
> >>
> >> My understanding is that L1 and L2 will share arch.root_mmu not only
> >> when EPT is globally disabled, we seem to switch between
> >> root_mmu/guest_mmu only when nested_cpu_has_ept(vmcs12) but different L2
> >> guests may be different on this. Do we need to handle this somehow?
> >
> > guest_mmu is used iff nested EPT is enabled, which requires enable_ept=1.
> > enable_ept is global and cannot be changed without reloading kvm_intel.
> >
> > This most definitely over-invalidates, e.g. it blasts away L1's page
> > tables. But, fixing that requires tracking VPID in mmu_role and/or adding
> > support for using guest_mmu when L1 isn't using TDP, i.e. nested EPT is
> > disabled. Assuming the vast majority of nested deployments enable EPT in
> > L0, the cost of both options likely outweighs the benefits.
> >
>
> Yes but my question rather was: what if global 'enable_ept' is true but
> nested EPT is not being used by L1, don't we still need to do
> kvm_mmu_free_roots(&vcpu->arch.root_mmu) here?

No, because L0 isn't shadowing the L1->L2 page tables, i.e. there can't be
unsync'd SPTEs for L2. The vpid_sync_*() above flushes the TLB for L2's
effective VPID, which is all that's required.

2020-03-23 16:58:23

by Vitaly Kuznetsov

[permalink] [raw]
Subject: Re: [PATCH v3 04/37] KVM: nVMX: Invalidate all roots when emulating INVVPID without EPT

Sean Christopherson <[email protected]> writes:

> On Mon, Mar 23, 2020 at 05:33:08PM +0100, Vitaly Kuznetsov wrote:
>> Sean Christopherson <[email protected]> writes:
>>
>> > On Mon, Mar 23, 2020 at 04:34:17PM +0100, Vitaly Kuznetsov wrote:
>> >> Sean Christopherson <[email protected]> writes:
>> >>
>> >> > From: Junaid Shahid <[email protected]>
>> >> >
>> >> > Free all roots when emulating INVVPID for L1 and EPT is disabled, as
>> >> > outstanding changes to the page tables managed by L1 need to be
>> >> > recognized. Because L1 and L2 share an MMU when EPT is disabled, and
>> >> > because VPID is not tracked by the MMU role, all roots in the current
>> >> > MMU (root_mmu) need to be freed, otherwise a future nested VM-Enter or
>> >> > VM-Exit could do a fast CR3 switch (without a flush/sync) and consume
>> >> > stale SPTEs.
>> >> >
>> >> > Fixes: 5c614b3583e7b ("KVM: nVMX: nested VPID emulation")
>> >> > Signed-off-by: Junaid Shahid <[email protected]>
>> >> > [sean: ported to upstream KVM, reworded the comment and changelog]
>> >> > Signed-off-by: Sean Christopherson <[email protected]>
>> >> > ---
>> >> > arch/x86/kvm/vmx/nested.c | 14 ++++++++++++++
>> >> > 1 file changed, 14 insertions(+)
>> >> >
>> >> > diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
>> >> > index 9624cea4ed9f..bc74fbbf33c6 100644
>> >> > --- a/arch/x86/kvm/vmx/nested.c
>> >> > +++ b/arch/x86/kvm/vmx/nested.c
>> >> > @@ -5250,6 +5250,20 @@ static int handle_invvpid(struct kvm_vcpu *vcpu)
>> >> > return kvm_skip_emulated_instruction(vcpu);
>> >> > }
>> >> >
>> >> > + /*
>> >> > + * Sync the shadow page tables if EPT is disabled, L1 is invalidating
>> >> > + * linear mappings for L2 (tagged with L2's VPID). Free all roots as
>> >> > + * VPIDs are not tracked in the MMU role.
>> >> > + *
>> >> > + * Note, this operates on root_mmu, not guest_mmu, as L1 and L2 share
>> >> > + * an MMU when EPT is disabled.
>> >> > + *
>> >> > + * TODO: sync only the affected SPTEs for INVDIVIDUAL_ADDR.
>> >> > + */
>> >> > + if (!enable_ept)
>> >> > + kvm_mmu_free_roots(vcpu, &vcpu->arch.root_mmu,
>> >> > + KVM_MMU_ROOTS_ALL);
>> >> > +
>> >>
>> >> This is related to my remark on the previous patch; the comment above
>> >> makes me think I'm missing something obvious, enlighten me please)
>> >>
>> >> My understanding is that L1 and L2 will share arch.root_mmu not only
>> >> when EPT is globally disabled, we seem to switch between
>> >> root_mmu/guest_mmu only when nested_cpu_has_ept(vmcs12) but different L2
>> >> guests may be different on this. Do we need to handle this somehow?
>> >
>> > guest_mmu is used iff nested EPT is enabled, which requires enable_ept=1.
>> > enable_ept is global and cannot be changed without reloading kvm_intel.
>> >
>> > This most definitely over-invalidates, e.g. it blasts away L1's page
>> > tables. But, fixing that requires tracking VPID in mmu_role and/or adding
>> > support for using guest_mmu when L1 isn't using TDP, i.e. nested EPT is
>> > disabled. Assuming the vast majority of nested deployments enable EPT in
>> > L0, the cost of both options likely outweighs the benefits.
>> >
>>
>> Yes but my question rather was: what if global 'enable_ept' is true but
>> nested EPT is not being used by L1, don't we still need to do
>> kvm_mmu_free_roots(&vcpu->arch.root_mmu) here?
>
> No, because L0 isn't shadowing the L1->L2 page tables, i.e. there can't be
> unsync'd SPTEs for L2. The vpid_sync_*() above flushes the TLB for L2's
> effective VPID, which is all that's required.

Ah, stupid me, it's actually EPT and not nested EPT which we care about
here. Thank you for the clarification!

Reviewed-by: Vitaly Kuznetsov <[email protected]>

--
Vitaly

2020-03-23 23:48:29

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [PATCH v3 02/37] KVM: nVMX: Validate the EPTP when emulating INVEPT(EXTENT_CONTEXT)

On 23/03/20 16:45, Sean Christopherson wrote:
>> My question, however, transforms into "would it
>> make sense to introduce nested_vmx_fail() implementing the logic from
>> SDM:
>>
>> VMfail(ErrorNumber):
>> IF VMCS pointer is valid
>> THEN VMfailValid(ErrorNumber);
>> ELSE VMfailInvalid;
>> FI;
>>
> Hmm, I wouldn't be opposed to such a wrapper. It would pair with
> nested_vmx_succeed().
>

Neither would I.

Paolo

2020-03-23 23:51:32

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [PATCH v3 03/37] KVM: nVMX: Invalidate all EPTP contexts when emulating INVEPT for L1

On 23/03/20 17:44, Sean Christopherson wrote:
> So I think
>
> Fixes: 14c07ad89f4d ("x86/kvm/mmu: introduce guest_mmu")
>
> would be appropriate?
>

Yes. I changed it and also added the comment

+ /*
+ * Nested EPT roots are always held through guest_mmu,
+ * not root_mmu.
+ */

which isn't unlike what you suggested elsewhere in the thread.

Paolo

2020-03-23 23:58:15

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [PATCH v3 05/37] KVM: x86: Export kvm_propagate_fault() (as kvm_inject_emulated_page_fault)

On 23/03/20 17:24, Sean Christopherson wrote:
>> We don't seem to use the return value a lot, actually,
>> inject_emulated_exception() seems to be the only one, the rest just call
>> it without checking the return value. Judging by the new name, I'd guess
>> that the function returns whether it was able to inject the exception or
>> not but this doesn't seem to be the case. My suggestion would then be to
>> make it return 'void' and return 'fault->nested_page_fault' separately
>> in inject_emulated_exception().
> Oooh, I like that idea. The return from the common helper also confuses me
> every time I look at it.
>

Separate patch, please. I'm not sure it makes a great difference though.

Paolo

2020-03-24 00:12:50

by Jim Mattson

[permalink] [raw]
Subject: Re: [PATCH v3 03/37] KVM: nVMX: Invalidate all EPTP contexts when emulating INVEPT for L1

On Mon, Mar 23, 2020 at 4:51 PM Paolo Bonzini <[email protected]> wrote:
>
> On 23/03/20 17:44, Sean Christopherson wrote:
> > So I think
> >
> > Fixes: 14c07ad89f4d ("x86/kvm/mmu: introduce guest_mmu")
> >
> > would be appropriate?
> >
>
> Yes.

I think it was actually commit efebf0aaec3d ("KVM: nVMX: Do not flush
TLB on L1<->L2 transitions if L1 uses VPID and EPT").

2020-03-25 11:24:26

by Vitaly Kuznetsov

[permalink] [raw]
Subject: Re: [PATCH v3 16/37] KVM: x86: Drop @invalidate_gpa param from kvm_x86_ops' tlb_flush()

Sean Christopherson <[email protected]> writes:

> Drop @invalidate_gpa from ->tlb_flush() and kvm_vcpu_flush_tlb() now
> that all callers pass %true for said param, or ignore the param (SVM has
> an internal call to svm_flush_tlb() in svm_flush_tlb_guest that somewhat
> arbitrarily passes %false).
>
> Remove __vmx_flush_tlb() as it is no longer used.
>
> No functional change intended.
>
> Signed-off-by: Sean Christopherson <[email protected]>
> ---
> arch/x86/include/asm/kvm_host.h | 2 +-
> arch/x86/kvm/mmu/mmu.c | 2 +-
> arch/x86/kvm/svm.c | 10 ++++----
> arch/x86/kvm/vmx/vmx.c | 4 ++--
> arch/x86/kvm/vmx/vmx.h | 42 ++++++++++-----------------------
> arch/x86/kvm/x86.c | 6 ++---
> 6 files changed, 24 insertions(+), 42 deletions(-)
>
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index c08f4c0bf4d1..a5dfab4642d6 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -1105,7 +1105,7 @@ struct kvm_x86_ops {
> unsigned long (*get_rflags)(struct kvm_vcpu *vcpu);
> void (*set_rflags)(struct kvm_vcpu *vcpu, unsigned long rflags);
>
> - void (*tlb_flush)(struct kvm_vcpu *vcpu, bool invalidate_gpa);
> + void (*tlb_flush)(struct kvm_vcpu *vcpu);
> int (*tlb_remote_flush)(struct kvm *kvm);
> int (*tlb_remote_flush_with_range)(struct kvm *kvm,
> struct kvm_tlb_range *range);
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 5ae620881bbc..a87b8f9f3b1f 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -5177,7 +5177,7 @@ int kvm_mmu_load(struct kvm_vcpu *vcpu)
> if (r)
> goto out;
> kvm_mmu_load_pgd(vcpu);
> - kvm_x86_ops->tlb_flush(vcpu, true);
> + kvm_x86_ops->tlb_flush(vcpu);
> out:
> return r;
> }
> diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
> index 396f42753489..62fa45dcb6a4 100644
> --- a/arch/x86/kvm/svm.c
> +++ b/arch/x86/kvm/svm.c
> @@ -385,7 +385,7 @@ module_param(dump_invalid_vmcb, bool, 0644);
> static u8 rsm_ins_bytes[] = "\x0f\xaa";
>
> static void svm_set_cr0(struct kvm_vcpu *vcpu, unsigned long cr0);
> -static void svm_flush_tlb(struct kvm_vcpu *vcpu, bool invalidate_gpa);
> +static void svm_flush_tlb(struct kvm_vcpu *vcpu);
> static void svm_complete_interrupts(struct vcpu_svm *svm);
> static void svm_toggle_avic_for_irq_window(struct kvm_vcpu *vcpu, bool activate);
> static inline void avic_post_state_restore(struct kvm_vcpu *vcpu);
> @@ -2692,7 +2692,7 @@ static int svm_set_cr4(struct kvm_vcpu *vcpu, unsigned long cr4)
> return 1;
>
> if (npt_enabled && ((old_cr4 ^ cr4) & X86_CR4_PGE))
> - svm_flush_tlb(vcpu, true);
> + svm_flush_tlb(vcpu);
>
> vcpu->arch.cr4 = cr4;
> if (!npt_enabled)
> @@ -3630,7 +3630,7 @@ static void enter_svm_guest_mode(struct vcpu_svm *svm, u64 vmcb_gpa,
> svm->nested.intercept_exceptions = nested_vmcb->control.intercept_exceptions;
> svm->nested.intercept = nested_vmcb->control.intercept;
>
> - svm_flush_tlb(&svm->vcpu, true);
> + svm_flush_tlb(&svm->vcpu);
> svm->vmcb->control.int_ctl = nested_vmcb->control.int_ctl | V_INTR_MASKING_MASK;
> if (nested_vmcb->control.int_ctl & V_INTR_MASKING_MASK)
> svm->vcpu.arch.hflags |= HF_VINTR_MASK;
> @@ -5626,7 +5626,7 @@ static int svm_set_identity_map_addr(struct kvm *kvm, u64 ident_addr)
> return 0;
> }
>
> -static void svm_flush_tlb(struct kvm_vcpu *vcpu, bool invalidate_gpa)
> +static void svm_flush_tlb(struct kvm_vcpu *vcpu)
> {
> struct vcpu_svm *svm = to_svm(vcpu);
>
> @@ -5645,7 +5645,7 @@ static void svm_flush_tlb_gva(struct kvm_vcpu *vcpu, gva_t gva)
>
> static void svm_flush_tlb_guest(struct kvm_vcpu *vcpu)
> {
> - svm_flush_tlb(vcpu, false);
> + svm_flush_tlb(vcpu);
> }
>
> static void svm_prepare_guest_switch(struct kvm_vcpu *vcpu)
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index 43c0d4706f9a..477bdbc52ed0 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -6079,7 +6079,7 @@ void vmx_set_virtual_apic_mode(struct kvm_vcpu *vcpu)
> if (flexpriority_enabled) {
> sec_exec_control |=
> SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES;
> - vmx_flush_tlb(vcpu, true);
> + vmx_flush_tlb(vcpu);
> }
> break;
> case LAPIC_MODE_X2APIC:
> @@ -6097,7 +6097,7 @@ static void vmx_set_apic_access_page_addr(struct kvm_vcpu *vcpu, hpa_t hpa)
> {
> if (!is_guest_mode(vcpu)) {
> vmcs_write64(APIC_ACCESS_ADDR, hpa);
> - vmx_flush_tlb(vcpu, true);
> + vmx_flush_tlb(vcpu);
> }
> }
>
> diff --git a/arch/x86/kvm/vmx/vmx.h b/arch/x86/kvm/vmx/vmx.h
> index 3770ae111e6a..bab5d62ad964 100644
> --- a/arch/x86/kvm/vmx/vmx.h
> +++ b/arch/x86/kvm/vmx/vmx.h
> @@ -503,46 +503,28 @@ static inline struct vmcs *alloc_vmcs(bool shadow)
>
> u64 construct_eptp(struct kvm_vcpu *vcpu, unsigned long root_hpa);
>
> -static inline void __vmx_flush_tlb(struct kvm_vcpu *vcpu, int vpid,
> - bool invalidate_gpa)
> -{
> - if (enable_ept && (invalidate_gpa || !enable_vpid)) {
> - if (!VALID_PAGE(vcpu->arch.mmu->root_hpa))
> - return;
> - ept_sync_context(construct_eptp(vcpu,
> - vcpu->arch.mmu->root_hpa));
> - } else {
> - vpid_sync_context(vpid);
> - }
> -}
> -
> -static inline void vmx_flush_tlb(struct kvm_vcpu *vcpu, bool invalidate_gpa)
> +static inline void vmx_flush_tlb(struct kvm_vcpu *vcpu)
> {
> struct vcpu_vmx *vmx = to_vmx(vcpu);
>
> /*
> - * Flush all EPTP/VPID contexts if the TLB flush _may_ have been
> - * invoked via kvm_flush_remote_tlbs(), which always passes %true for
> - * @invalidate_gpa. Flushing remote TLBs requires all contexts to be
> - * flushed, not just the active context.
> + * Flush all EPTP/VPID contexts, as the TLB flush _may_ have been
> + * invoked via kvm_flush_remote_tlbs(). Flushing remote TLBs requires
> + * all contexts to be flushed, not just the active context.
> *
> * Note, this also ensures a deferred TLB flush with VPID enabled and
> * EPT disabled invalidates the "correct" VPID, by nuking both L1 and
> * L2's VPIDs.
> */
> - if (invalidate_gpa) {
> - if (enable_ept) {
> - ept_sync_global();
> - } else if (enable_vpid) {
> - if (cpu_has_vmx_invvpid_global()) {
> - vpid_sync_vcpu_global();
> - } else {
> - vpid_sync_vcpu_single(vmx->vpid);
> - vpid_sync_vcpu_single(vmx->nested.vpid02);
> - }
> + if (enable_ept) {
> + ept_sync_global();
> + } else if (enable_vpid) {
> + if (cpu_has_vmx_invvpid_global()) {
> + vpid_sync_vcpu_global();
> + } else {
> + vpid_sync_vcpu_single(vmx->vpid);
> + vpid_sync_vcpu_single(vmx->nested.vpid02);
> }
> - } else {
> - __vmx_flush_tlb(vcpu, vmx->vpid, false);
> }
> }
>
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 0b90ec2c93cf..84cbd7ca1e18 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -2696,10 +2696,10 @@ static void kvmclock_reset(struct kvm_vcpu *vcpu)
> vcpu->arch.time = 0;
> }
>
> -static void kvm_vcpu_flush_tlb(struct kvm_vcpu *vcpu, bool invalidate_gpa)
> +static void kvm_vcpu_flush_tlb(struct kvm_vcpu *vcpu)
> {
> ++vcpu->stat.tlb_flush;
> - kvm_x86_ops->tlb_flush(vcpu, invalidate_gpa);
> + kvm_x86_ops->tlb_flush(vcpu);
> }
>
> static void record_steal_time(struct kvm_vcpu *vcpu)
> @@ -8223,7 +8223,7 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
> if (kvm_check_request(KVM_REQ_LOAD_MMU_PGD, vcpu))
> kvm_mmu_load_pgd(vcpu);
> if (kvm_check_request(KVM_REQ_TLB_FLUSH, vcpu))
> - kvm_vcpu_flush_tlb(vcpu, true);
> + kvm_vcpu_flush_tlb(vcpu);
> if (kvm_check_request(KVM_REQ_REPORT_TPR_ACCESS, vcpu)) {
> vcpu->run->exit_reason = KVM_EXIT_TPR_ACCESS;
> r = 0;

Reviewed-by: Vitaly Kuznetsov <[email protected]>

--
Vitaly

2020-03-30 18:39:06

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH v3 03/37] KVM: nVMX: Invalidate all EPTP contexts when emulating INVEPT for L1

On Mon, Mar 23, 2020 at 05:12:04PM -0700, Jim Mattson wrote:
> On Mon, Mar 23, 2020 at 4:51 PM Paolo Bonzini <[email protected]> wrote:
> >
> > On 23/03/20 17:44, Sean Christopherson wrote:
> > > So I think
> > >
> > > Fixes: 14c07ad89f4d ("x86/kvm/mmu: introduce guest_mmu")
> > >
> > > would be appropriate?
> > >
> >
> > Yes.
>
> I think it was actually commit efebf0aaec3d ("KVM: nVMX: Do not flush
> TLB on L1<->L2 transitions if L1 uses VPID and EPT").

Hmm, commit efebf0aaec3d it only changed flushing behavior, it didn't
affect KVM's behavior with respect to refreshing unsync'd SPTE, i.e.
reloading guest_mmu.

It's somewhat of a moot point, because _technically_ there is no bug since,
at the time of this fix, KVM always flushes and reloads on nested VM-Enter.

2021-08-03 01:46:48

by Lai Jiangshan

[permalink] [raw]
Subject: Re: [PATCH v3 01/37] KVM: VMX: Flush all EPTP/VPID contexts on remote TLB flush

(I'm replying to a very old email, so many CCs are dropped.)

On Sat, Mar 21, 2020 at 5:33 AM Sean Christopherson
<[email protected]> wrote:
>
> Flush all EPTP/VPID contexts if a TLB flush _may_ have been triggered by
> a remote or deferred TLB flush, i.e. by KVM_REQ_TLB_FLUSH. Remote TLB
> flushes require all contexts to be invalidated, not just the active
> contexts, e.g. all mappings in all contexts for a given HVA need to be
> invalidated on a mmu_notifier invalidation. Similarly, the instigator
> of the deferred TLB flush may be expecting all contexts to be flushed,
> e.g. vmx_vcpu_load_vmcs().
>
> Without nested VMX, flushing only the current EPTP/VPID context isn't
> problematic because KVM uses a constant VPID for each vCPU, and

Hello, Sean

Is the patch optimized for cases where nested VMX is active?
I think the non-nested cases are normal cases.

Although the related code has been changed, the logic of the patch
is still working now, would it be better if we restore the optimization
for the normal cases (non-nested)?

Thanks
Lai

> mmu_alloc_direct_roots() all but guarantees KVM will use a single EPTP
> for L1. In the rare case where a different EPTP is created or reused,
> KVM (currently) unconditionally flushes the new EPTP context prior to
> entering the guest.
>
> With nested VMX, KVM conditionally uses a different VPID for L2, and
> unconditionally uses a different EPTP for L2. Because KVM doesn't
> _intentionally_ guarantee L2's EPTP/VPID context is flushed on nested
> VM-Enter, it'd be possible for a malicious L1 to attack the host and/or
> different VMs by exploiting the lack of flushing for L2.
>
> 1) Launch nested guest from malicious L1.
>
> 2) Nested VM-Enter to L2.
>
> 3) Access target GPA 'g'. CPU inserts TLB entry tagged with L2's ASID
> mapping 'g' to host PFN 'x'.
>
> 2) Nested VM-Exit to L1.
>
> 3) L1 triggers kernel same-page merging (ksm) by duplicating/zeroing
> the page for PFN 'x'.
>
> 4) Host kernel merges PFN 'x' with PFN 'y', i.e. unmaps PFN 'x' and
> remaps the page to PFN 'y'. mmu_notifier sends invalidate command,
> KVM flushes TLB only for L1's ASID.
>
> 4) Host kernel reallocates PFN 'x' to some other task/guest.
>
> 5) Nested VM-Enter to L2. KVM does not invalidate L2's EPTP or VPID.
>
> 6) L2 accesses GPA 'g' and gains read/write access to PFN 'x' via its
> stale TLB entry.
>
> However, current KVM unconditionally flushes L1's EPTP/VPID context on
> nested VM-Exit. But, that behavior is mostly unintentional, KVM doesn't
> go out of its way to flush EPTP/VPID on nested VM-Enter/VM-Exit, rather
> a TLB flush is guaranteed to occur prior to re-entering L1 due to
> __kvm_mmu_new_cr3() always being called with skip_tlb_flush=false. On
> nested VM-Enter, this happens via kvm_init_shadow_ept_mmu() (nested EPT
> enabled) or in nested_vmx_load_cr3() (nested EPT disabled). On nested
> VM-Exit it occurs via nested_vmx_load_cr3().
>
> This also fixes a bug where a deferred TLB flush in the context of L2,
> with EPT disabled, would flush L1's VPID instead of L2's VPID, as
> vmx_flush_tlb() flushes L1's VPID regardless of is_guest_mode().
>
> Cc: Vitaly Kuznetsov <[email protected]>
> Cc: Ben Gardon <[email protected]>
> Cc: Jim Mattson <[email protected]>
> Cc: Junaid Shahid <[email protected]>
> Cc: Liran Alon <[email protected]>
> Cc: Boris Ostrovsky <[email protected]>
> Cc: John Haxby <[email protected]>
> Reviewed-by: Liran Alon <[email protected]>
> Fixes: efebf0aaec3d ("KVM: nVMX: Do not flush TLB on L1<->L2 transitions if L1 uses VPID and EPT")
> Signed-off-by: Sean Christopherson <[email protected]>
> ---
> arch/x86/kvm/vmx/vmx.h | 28 +++++++++++++++++++++++++++-
> 1 file changed, 27 insertions(+), 1 deletion(-)
>
> diff --git a/arch/x86/kvm/vmx/vmx.h b/arch/x86/kvm/vmx/vmx.h
> index be93d597306c..d6d67b816ebe 100644
> --- a/arch/x86/kvm/vmx/vmx.h
> +++ b/arch/x86/kvm/vmx/vmx.h
> @@ -518,7 +518,33 @@ static inline void __vmx_flush_tlb(struct kvm_vcpu *vcpu, int vpid,
>
> static inline void vmx_flush_tlb(struct kvm_vcpu *vcpu, bool invalidate_gpa)
> {
> - __vmx_flush_tlb(vcpu, to_vmx(vcpu)->vpid, invalidate_gpa);
> + struct vcpu_vmx *vmx = to_vmx(vcpu);
> +
> + /*
> + * Flush all EPTP/VPID contexts if the TLB flush _may_ have been
> + * invoked via kvm_flush_remote_tlbs(), which always passes %true for
> + * @invalidate_gpa. Flushing remote TLBs requires all contexts to be
> + * flushed, not just the active context.
> + *
> + * Note, this also ensures a deferred TLB flush with VPID enabled and
> + * EPT disabled invalidates the "correct" VPID, by nuking both L1 and
> + * L2's VPIDs.
> + */
> + if (invalidate_gpa) {
> + if (enable_ept) {
> + ept_sync_global();
> + } else if (enable_vpid) {
> + if (cpu_has_vmx_invvpid_global()) {
> + vpid_sync_vcpu_global();
> + } else {
> + WARN_ON_ONCE(!cpu_has_vmx_invvpid_single());
> + vpid_sync_vcpu_single(vmx->vpid);
> + vpid_sync_vcpu_single(vmx->nested.vpid02);
> + }
> + }
> + } else {
> + __vmx_flush_tlb(vcpu, vmx->vpid, false);
> + }
> }
>
> static inline void decache_tsc_multiplier(struct vcpu_vmx *vmx)
> --
> 2.24.1
>

2021-08-03 15:41:20

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH v3 01/37] KVM: VMX: Flush all EPTP/VPID contexts on remote TLB flush

On Tue, Aug 03, 2021, Lai Jiangshan wrote:
> (I'm replying to a very old email, so many CCs are dropped.)
>
> On Sat, Mar 21, 2020 at 5:33 AM Sean Christopherson
> <[email protected]> wrote:
> >
> > Flush all EPTP/VPID contexts if a TLB flush _may_ have been triggered by
> > a remote or deferred TLB flush, i.e. by KVM_REQ_TLB_FLUSH. Remote TLB
> > flushes require all contexts to be invalidated, not just the active
> > contexts, e.g. all mappings in all contexts for a given HVA need to be
> > invalidated on a mmu_notifier invalidation. Similarly, the instigator
> > of the deferred TLB flush may be expecting all contexts to be flushed,
> > e.g. vmx_vcpu_load_vmcs().
> >
> > Without nested VMX, flushing only the current EPTP/VPID context isn't
> > problematic because KVM uses a constant VPID for each vCPU, and
>
> Hello, Sean
>
> Is the patch optimized for cases where nested VMX is active?

Well, this patch isn't, but KVM has since been optimized to do full EPT/VPID
flushes only when "necessary". Necessary in quotes because the two uses can
technically be further optimized, but doing so would incur significant complexity.

Use #1 is remote flushes from the MMU, which don't strictly require a global flush,
but KVM would need to propagate more information (mmu_role?) in order for responding
vCPUs to determine what contexts needs to be flushed. And practically speaking,
for MMU flushes there's no meaningful difference when using TDP without nested
guests as the common case will be that each vCPU has a single active EPTP and
that EPTP will be affected by the MMU changes, i.e. needs to be flushed.

Use #2 is in VMX's pCPU migration path. Again, not strictly necessary as KVM could
theoretically track which pCPUs have run a particular vCPU and when that pCPU last
flushed EPT contexts, but fully solving the problem would be quite complex. Since
pCPU migration is always going to be a slow path, the extra complexity would be
very difficult to justify.

> I think the non-nested cases are normal cases.
>
> Although the related code has been changed, the logic of the patch
> is still working now, would it be better if we restore the optimization
> for the normal cases (non-nested)?

As above, vmx_flush_tlb_all() hasn't changed, but the callers have.

2021-08-04 03:56:49

by Lai Jiangshan

[permalink] [raw]
Subject: Re: [PATCH v3 01/37] KVM: VMX: Flush all EPTP/VPID contexts on remote TLB flush

On Tue, Aug 3, 2021 at 11:39 PM Sean Christopherson <[email protected]> wrote:
>
> On Tue, Aug 03, 2021, Lai Jiangshan wrote:
> > (I'm replying to a very old email, so many CCs are dropped.)
> >
> > On Sat, Mar 21, 2020 at 5:33 AM Sean Christopherson
> > <[email protected]> wrote:
> > >
> > > Flush all EPTP/VPID contexts if a TLB flush _may_ have been triggered by
> > > a remote or deferred TLB flush, i.e. by KVM_REQ_TLB_FLUSH. Remote TLB
> > > flushes require all contexts to be invalidated, not just the active
> > > contexts, e.g. all mappings in all contexts for a given HVA need to be
> > > invalidated on a mmu_notifier invalidation. Similarly, the instigator
> > > of the deferred TLB flush may be expecting all contexts to be flushed,
> > > e.g. vmx_vcpu_load_vmcs().
> > >
> > > Without nested VMX, flushing only the current EPTP/VPID context isn't
> > > problematic because KVM uses a constant VPID for each vCPU, and
> >
> > Hello, Sean
> >
> > Is the patch optimized for cases where nested VMX is active?
>
> Well, this patch isn't, but KVM has since been optimized to do full EPT/VPID
> flushes only when "necessary". Necessary in quotes because the two uses can
> technically be further optimized, but doing so would incur significant complexity.

Hello, thanks for your reply.

I know there might be a lot of possible optimizations to be considered, many of
which are too complicated to be implemented.

The optimization I considered yesterday is "ept_sync_global() V.S.
ept_sync_context(this_vcpu's)" in the case: when the VM is using EPT and
doesn't allow nested VMs. (And I failed to express it yesterday)

In this case, the vCPU uses only one single root_hpa, and I think ept sync
for single context is enough for both cases you listed below.

When the context is flushed, the TLB for the vCPU is clean to run.

If kvm changes the mmu->root_hpa, it is kvm's responsibility to request
another flush which is implemented.

In other words, KVM_REQ_TLB_FLUSH == KVM_REQ_TLB_FLUSH_CURRENT in this case.
And before this patch, kvm flush only the single context rather than global.

>
> Use #1 is remote flushes from the MMU, which don't strictly require a global flush,
> but KVM would need to propagate more information (mmu_role?) in order for responding
> vCPUs to determine what contexts needs to be flushed. And practically speaking,
> for MMU flushes there's no meaningful difference when using TDP without nested
> guests as the common case will be that each vCPU has a single active EPTP and
> that EPTP will be affected by the MMU changes, i.e. needs to be flushed.

I don't see when we need "to determine what contexts" since the vcpu is
using only one context in this case which is the assumption in my mind,
could you please correct me if I'm wrong.

Thanks,
Lai.

>
> Use #2 is in VMX's pCPU migration path. Again, not strictly necessary as KVM could
> theoretically track which pCPUs have run a particular vCPU and when that pCPU last
> flushed EPT contexts, but fully solving the problem would be quite complex. Since
> pCPU migration is always going to be a slow path, the extra complexity would be
> very difficult to justify.
>
> > I think the non-nested cases are normal cases.
> >
> > Although the related code has been changed, the logic of the patch
> > is still working now, would it be better if we restore the optimization
> > for the normal cases (non-nested)?
>
> As above, vmx_flush_tlb_all() hasn't changed, but the callers have.

2021-08-04 15:37:28

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH v3 01/37] KVM: VMX: Flush all EPTP/VPID contexts on remote TLB flush

On Wed, Aug 04, 2021, Lai Jiangshan wrote:
> The optimization I considered yesterday is "ept_sync_global() V.S.
> ept_sync_context(this_vcpu's)" in the case: when the VM is using EPT and
> doesn't allow nested VMs. (And I failed to express it yesterday)
>
> In this case, the vCPU uses only one single root_hpa,

This is not strictly guaranteed. kvm_mmu_page_role tracks efer.NX, cr0.wp, and
cr4.SMEP/SMAP (if cr0.wp=0), which means that KVM will create a a different root
if the guest toggles any of those bits. I'm pretty sure that can be changed and
will look into doing so in the near future[*], but even that wouldn't guarantee
a single root.

SMM is also incorporated in the page role and will result in a different roots
for SMM vs. non-SMM. This is mandatory because SMM has its own memslot view.

A CPUID.MAXPHYADDR change can also change the role, but in this case zapping all
roots will always be the correct/desired behavior.

[*] https://lkml.kernel.org/r/[email protected]

> and I think ept sync for single context is enough for both cases you listed below.
>
> When the context is flushed, the TLB for the vCPU is clean to run.
>
> If kvm changes the mmu->root_hpa, it is kvm's responsibility to request
> another flush which is implemented.

KVM needs to flush when it allocates a new root, largely because it has no way
of knowing if some other entity previously created a CR3/EPTP at that HPA, but
KVM isn't strictly required to flush when switching to a previous/cached root.

Currently this is a moot point because kvm_post_set_cr0(), kvm_post_set_cr4(),
set_efer(), and kvm_smm_changed() all do kvm_mmu_reset_context() instead of
attempting a fast PGD switch, but I am hoping to change this as well, at least
for the non-SMM cases.

> In other words, KVM_REQ_TLB_FLUSH == KVM_REQ_TLB_FLUSH_CURRENT in this case.
> And before this patch, kvm flush only the single context rather than global.
>
> >
> > Use #1 is remote flushes from the MMU, which don't strictly require a global flush,
> > but KVM would need to propagate more information (mmu_role?) in order for responding
> > vCPUs to determine what contexts needs to be flushed. And practically speaking,
> > for MMU flushes there's no meaningful difference when using TDP without nested
> > guests as the common case will be that each vCPU has a single active EPTP and
> > that EPTP will be affected by the MMU changes, i.e. needs to be flushed.
>
> I don't see when we need "to determine what contexts" since the vcpu is
> using only one context in this case which is the assumption in my mind,
> could you please correct me if I'm wrong.

As it exists today, I believe you're correct that KVM will only ever have a
single reachable TDP root, but only because of overzealous kvm_mmu_reset_context()
usage. The SMM case in particular could be optimized to not zap all roots (whether
or not it's worth optimizing is another question).

All that said, the easiest way to query the number of reachable roots would be to
check the previous/cached root.

But, even if we can guarantee there's exactly one reachable root, I would be
surprised if doing INVEPT.context instead of INVEPT.global actually provided any
meaningful performance benefit. Using INVEPT.context is safe if and only if there
are no other TLB entries for this vCPU, and KVM must invalidate on pCPU migration,
so there can't be collateral damage in that sense.

That leaves the latency of INVEPT as the only possible performance delta, and that
will be uarch specific. It's entirely possible INVEPT.global is slower, but again
I would be surprised if it is so much slower than INVEPT.context that it actually
impacts guest performance given that its use is limited to slow paths.