LinuxLists.cc - [PATCH MANUALSEL 5.15 1/4] KVM: x86: do not report a vCPU as preempted outside instruction boundaries

2022-06-14 03:06:40

Subject: [PATCH MANUALSEL 5.15 1/4] KVM: x86: do not report a vCPU as preempted outside instruction boundaries

From: Paolo Bonzini <[email protected]>

[ Upstream commit 6cd88243c7e03845a450795e134b488fc2afb736 ]

If a vCPU is outside guest mode and is scheduled out, it might be in the
process of making a memory access. A problem occurs if another vCPU uses
the PV TLB flush feature during the period when the vCPU is scheduled
out, and a virtual address has already been translated but has not yet
been accessed, because this is equivalent to using a stale TLB entry.

To avoid this, only report a vCPU as preempted if sure that the guest
is at an instruction boundary. A rescheduling request will be delivered
to the host physical CPU as an external interrupt, so for simplicity
consider any vmexit *not* instruction boundary except for external
interrupts.

It would in principle be okay to report the vCPU as preempted also
if it is sleeping in kvm_vcpu_block(): a TLB flush IPI will incur the
vmentry/vmexit overhead unnecessarily, and optimistic spinning is
also unlikely to succeed. However, leave it for later because right
now kvm_vcpu_check_block() is doing memory accesses. Even
though the TLB flush issue only applies to virtual memory address,
it's very much preferrable to be conservative.

Reported-by: Jann Horn <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
Signed-off-by: Sasha Levin <[email protected]>
---
arch/x86/include/asm/kvm_host.h | 3 +++
arch/x86/kvm/svm/svm.c | 2 ++
arch/x86/kvm/vmx/vmx.c | 1 +
arch/x86/kvm/x86.c | 22 ++++++++++++++++++++++
4 files changed, 28 insertions(+)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 49d814b2a341..a35f5e23fc2a 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -642,6 +642,7 @@ struct kvm_vcpu_arch {
u64 ia32_misc_enable_msr;
u64 smbase;
u64 smi_count;
+ bool at_instruction_boundary;
bool tpr_access_reporting;
bool xsaves_enabled;
u64 ia32_xss;
@@ -1271,6 +1272,8 @@ struct kvm_vcpu_stat {
u64 nested_run;
u64 directed_yield_attempted;
u64 directed_yield_successful;
+ u64 preemption_reported;
+ u64 preemption_other;
u64 guest_mode;
};

diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index 26f2da1590ed..5b51156712f7 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -4263,6 +4263,8 @@ static int svm_check_intercept(struct kvm_vcpu *vcpu,

static void svm_handle_exit_irqoff(struct kvm_vcpu *vcpu)
{
+ if (to_svm(vcpu)->vmcb->control.exit_code == SVM_EXIT_INTR)
+ vcpu->arch.at_instruction_boundary = true;
}

static void svm_sched_in(struct kvm_vcpu *vcpu, int cpu)
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 16a660a0ed5f..08f9f05a6cb2 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -6390,6 +6390,7 @@ static void handle_external_interrupt_irqoff(struct kvm_vcpu *vcpu)
return;

handle_interrupt_nmi_irqoff(vcpu, gate_offset(desc));
+ vcpu->arch.at_instruction_boundary = true;
}

static void vmx_handle_exit_irqoff(struct kvm_vcpu *vcpu)
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 23905ba3058a..fbf10fa99507 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -277,6 +277,8 @@ const struct _kvm_stats_desc kvm_vcpu_stats_desc[] = {
STATS_DESC_COUNTER(VCPU, nested_run),
STATS_DESC_COUNTER(VCPU, directed_yield_attempted),
STATS_DESC_COUNTER(VCPU, directed_yield_successful),
+ STATS_DESC_COUNTER(VCPU, preemption_reported),
+ STATS_DESC_COUNTER(VCPU, preemption_other),
STATS_DESC_ICOUNTER(VCPU, guest_mode)
};

@@ -4368,6 +4370,19 @@ static void kvm_steal_time_set_preempted(struct kvm_vcpu *vcpu)
struct kvm_memslots *slots;
static const u8 preempted = KVM_VCPU_PREEMPTED;

+ /*
+ * The vCPU can be marked preempted if and only if the VM-Exit was on
+ * an instruction boundary and will not trigger guest emulation of any
+ * kind (see vcpu_run). Vendor specific code controls (conservatively)
+ * when this is true, for example allowing the vCPU to be marked
+ * preempted if and only if the VM-Exit was due to a host interrupt.
+ */
+ if (!vcpu->arch.at_instruction_boundary) {
+ vcpu->stat.preemption_other++;
+ return;
+ }
+
+ vcpu->stat.preemption_reported++;
if (!(vcpu->arch.st.msr_val & KVM_MSR_ENABLED))
return;

@@ -9936,6 +9951,13 @@ static int vcpu_run(struct kvm_vcpu *vcpu)
vcpu->arch.l1tf_flush_l1d = true;

for (;;) {
+ /*
+ * If another guest vCPU requests a PV TLB flush in the middle
+ * of instruction emulation, the rest of the emulation could
+ * use a stale page translation. Assume that any code after
+ * this point can start executing an instruction.
+ */
+ vcpu->arch.at_instruction_boundary = false;
if (kvm_vcpu_running(vcpu)) {
r = vcpu_enter_guest(vcpu);
} else {
--
2.35.1

2022-06-14 03:35:02

by Sasha Levin

[permalink] [raw]

Subject: [PATCH MANUALSEL 5.15 2/4] KVM: x86: do not set st->preempted when going back to user space

From: Paolo Bonzini <[email protected]>

[ Upstream commit 54aa83c90198e68eee8b0850c749bc70efb548da ]

Similar to the Xen path, only change the vCPU's reported state if the vCPU
was actually preempted. The reason for KVM's behavior is that for example
optimistic spinning might not be a good idea if the guest is doing repeated
exits to userspace; however, it is confusing and unlikely to make a difference,
because well-tuned guests will hardly ever exit KVM_RUN in the first place.

Suggested-by: Sean Christopherson <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
Signed-off-by: Sasha Levin <[email protected]>
---
arch/x86/kvm/x86.c | 26 ++++++++++++++------------
arch/x86/kvm/xen.h | 6 ++++--
2 files changed, 18 insertions(+), 14 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index fbf10fa99507..16c1c5403b52 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -4412,19 +4412,21 @@ void kvm_arch_vcpu_put(struct kvm_vcpu *vcpu)
{
int idx;

- if (vcpu->preempted && !vcpu->arch.guest_state_protected)
- vcpu->arch.preempted_in_kernel = !static_call(kvm_x86_get_cpl)(vcpu);
+ if (vcpu->preempted) {
+ if (!vcpu->arch.guest_state_protected)
+ vcpu->arch.preempted_in_kernel = !static_call(kvm_x86_get_cpl)(vcpu);

- /*
- * Take the srcu lock as memslots will be accessed to check the gfn
- * cache generation against the memslots generation.
- */
- idx = srcu_read_lock(&vcpu->kvm->srcu);
- if (kvm_xen_msr_enabled(vcpu->kvm))
- kvm_xen_runstate_set_preempted(vcpu);
- else
- kvm_steal_time_set_preempted(vcpu);
- srcu_read_unlock(&vcpu->kvm->srcu, idx);
+ /*
+ * Take the srcu lock as memslots will be accessed to check the gfn
+ * cache generation against the memslots generation.
+ */
+ idx = srcu_read_lock(&vcpu->kvm->srcu);
+ if (kvm_xen_msr_enabled(vcpu->kvm))
+ kvm_xen_runstate_set_preempted(vcpu);
+ else
+ kvm_steal_time_set_preempted(vcpu);
+ srcu_read_unlock(&vcpu->kvm->srcu, idx);
+ }

static_call(kvm_x86_vcpu_put)(vcpu);
vcpu->arch.last_host_tsc = rdtsc();
diff --git a/arch/x86/kvm/xen.h b/arch/x86/kvm/xen.h
index cc0cf5f37450..a7693a286e40 100644
--- a/arch/x86/kvm/xen.h
+++ b/arch/x86/kvm/xen.h
@@ -97,8 +97,10 @@ static inline void kvm_xen_runstate_set_preempted(struct kvm_vcpu *vcpu)
* behalf of the vCPU. Only if the VMM does actually block
* does it need to enter RUNSTATE_blocked.
*/
- if (vcpu->preempted)
- kvm_xen_update_runstate_guest(vcpu, RUNSTATE_runnable);
+ if (WARN_ON_ONCE(!vcpu->preempted))
+ return;
+
+ kvm_xen_update_runstate_guest(vcpu, RUNSTATE_runnable);
}

/* 32-bit compatibility definitions, also used natively in 32-bit build */
--
2.35.1

2022-06-15 15:07:46

by Jann Horn

[permalink] [raw]

Subject: Re: [PATCH MANUALSEL 5.15 1/4] KVM: x86: do not report a vCPU as preempted outside instruction boundaries

On Tue, Jun 14, 2022 at 4:11 AM Sasha Levin <[email protected]> wrote:
>
> From: Paolo Bonzini <[email protected]>
>
> [ Upstream commit 6cd88243c7e03845a450795e134b488fc2afb736 ]
>
> If a vCPU is outside guest mode and is scheduled out, it might be in the
> process of making a memory access. A problem occurs if another vCPU uses
> the PV TLB flush feature during the period when the vCPU is scheduled
> out, and a virtual address has already been translated but has not yet
> been accessed, because this is equivalent to using a stale TLB entry.
>
> To avoid this, only report a vCPU as preempted if sure that the guest
> is at an instruction boundary. A rescheduling request will be delivered
> to the host physical CPU as an external interrupt, so for simplicity
> consider any vmexit *not* instruction boundary except for external
> interrupts.
>
> It would in principle be okay to report the vCPU as preempted also
> if it is sleeping in kvm_vcpu_block(): a TLB flush IPI will incur the
> vmentry/vmexit overhead unnecessarily, and optimistic spinning is
> also unlikely to succeed. However, leave it for later because right
> now kvm_vcpu_check_block() is doing memory accesses. Even
> though the TLB flush issue only applies to virtual memory address,
> it's very much preferrable to be conservative.
>
> Reported-by: Jann Horn <[email protected]>
> Signed-off-by: Paolo Bonzini <[email protected]>
> Signed-off-by: Sasha Levin <[email protected]>

This feature was introduced in commit f38a7b75267f1f (first in 4.16).
I think the fix has to be applied all the way back to there (so
additionally to what you already did, it'd have to be added to 4.19,
5.4 and 5.10)?

But it doesn't seem to apply cleanly to those older branches. Paolo,
are you going to send stable backports of this?

2022-06-15 15:27:59

by Jann Horn

[permalink] [raw]

Subject: Re: [PATCH MANUALSEL 5.15 1/4] KVM: x86: do not report a vCPU as preempted outside instruction boundaries

On Wed, Jun 15, 2022 at 4:53 PM Jann Horn <[email protected]> wrote:
>
> On Tue, Jun 14, 2022 at 4:11 AM Sasha Levin <[email protected]> wrote:
> >
> > From: Paolo Bonzini <[email protected]>
> >
> > [ Upstream commit 6cd88243c7e03845a450795e134b488fc2afb736 ]
> >
> > If a vCPU is outside guest mode and is scheduled out, it might be in the
> > process of making a memory access. A problem occurs if another vCPU uses
> > the PV TLB flush feature during the period when the vCPU is scheduled
> > out, and a virtual address has already been translated but has not yet
> > been accessed, because this is equivalent to using a stale TLB entry.
> >
> > To avoid this, only report a vCPU as preempted if sure that the guest
> > is at an instruction boundary. A rescheduling request will be delivered
> > to the host physical CPU as an external interrupt, so for simplicity
> > consider any vmexit *not* instruction boundary except for external
> > interrupts.
> >
> > It would in principle be okay to report the vCPU as preempted also
> > if it is sleeping in kvm_vcpu_block(): a TLB flush IPI will incur the
> > vmentry/vmexit overhead unnecessarily, and optimistic spinning is
> > also unlikely to succeed. However, leave it for later because right
> > now kvm_vcpu_check_block() is doing memory accesses. Even
> > though the TLB flush issue only applies to virtual memory address,
> > it's very much preferrable to be conservative.
> >
> > Reported-by: Jann Horn <[email protected]>
> > Signed-off-by: Paolo Bonzini <[email protected]>
> > Signed-off-by: Sasha Levin <[email protected]>
>
> This feature was introduced in commit f38a7b75267f1f (first in 4.16).
> I think the fix has to be applied all the way back to there (so
> additionally to what you already did, it'd have to be added to 4.19,
> 5.4 and 5.10)?
>
> But it doesn't seem to apply cleanly to those older branches. Paolo,
> are you going to send stable backports of this?

Also, I think the same thing applies for "KVM: x86: do not set
st->preempted when going back to user space"?