LinuxLists.cc - [PATCH v4 1/6] KVM: Fix leak vCPU's VMCS value into other pCPU

2019-08-05 02:04:25

Subject: [PATCH v4 1/6] KVM: Fix leak vCPU's VMCS value into other pCPU

From: Wanpeng Li <[email protected]>

After commit d73eb57b80b (KVM: Boost vCPUs that are delivering interrupts), a
five years old bug is exposed. Running ebizzy benchmark in three 80 vCPUs VMs
on one 80 pCPUs Skylake server, a lot of rcu_sched stall warning splatting
in the VMs after stress testing:

INFO: rcu_sched detected stalls on CPUs/tasks: { 4 41 57 62 77} (detected by 15, t=60004 jiffies, g=899, c=898, q=15073)
Call Trace:
flush_tlb_mm_range+0x68/0x140
tlb_flush_mmu.part.75+0x37/0xe0
tlb_finish_mmu+0x55/0x60
zap_page_range+0x142/0x190
SyS_madvise+0x3cd/0x9c0
system_call_fastpath+0x1c/0x21

swait_active() sustains to be true before finish_swait() is called in
kvm_vcpu_block(), voluntarily preempted vCPUs are taken into account
by kvm_vcpu_on_spin() loop greatly increases the probability condition
kvm_arch_vcpu_runnable(vcpu) is checked and can be true, when APICv
is enabled the yield-candidate vCPU's VMCS RVI field leaks(by
vmx_sync_pir_to_irr()) into spinning-on-a-taken-lock vCPU's current
VMCS.

This patch fixes it by checking conservatively a subset of events.

Cc: Paolo Bonzini <[email protected]>
Cc: Radim Krčmář <[email protected]>
Cc: Christian Borntraeger <[email protected]>
Cc: Marc Zyngier <[email protected]>
Cc: [email protected]
Fixes: 98f4a1467 (KVM: add kvm_arch_vcpu_runnable() test to kvm_vcpu_on_spin() loop)
Signed-off-by: Wanpeng Li <[email protected]>
---
v3 -> v4:
* just test KVM_REQ_*
* rename the hook to apicv_has_pending_interrupt
* wrap with #ifdef CONFIG_KVM_ASYNC_PF
v2 -> v3:
* check conservatively a subset of events
v1 -> v2:
* checking swait_active(&vcpu->wq) for involuntary preemption

arch/mips/kvm/mips.c | 5 +++++
arch/powerpc/kvm/powerpc.c | 5 +++++
arch/s390/kvm/kvm-s390.c | 5 +++++
arch/x86/include/asm/kvm_host.h | 1 +
arch/x86/kvm/svm.c | 6 ++++++
arch/x86/kvm/vmx/vmx.c | 6 ++++++
arch/x86/kvm/x86.c | 16 ++++++++++++++++
include/linux/kvm_host.h | 1 +
virt/kvm/arm/arm.c | 5 +++++
virt/kvm/kvm_main.c | 16 +++++++++++++++-
10 files changed, 65 insertions(+), 1 deletion(-)

diff --git a/arch/mips/kvm/mips.c b/arch/mips/kvm/mips.c
index 2cfe839..95a4642 100644
--- a/arch/mips/kvm/mips.c
+++ b/arch/mips/kvm/mips.c
@@ -98,6 +98,11 @@ int kvm_arch_vcpu_runnable(struct kvm_vcpu *vcpu)
return !!(vcpu->arch.pending_exceptions);
}

+bool kvm_arch_dy_runnable(struct kvm_vcpu *vcpu)
+{
+ return kvm_arch_vcpu_runnable(vcpu);
+}
+
bool kvm_arch_vcpu_in_kernel(struct kvm_vcpu *vcpu)
{
return false;
diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
index 0dba7eb..3e34d5f 100644
--- a/arch/powerpc/kvm/powerpc.c
+++ b/arch/powerpc/kvm/powerpc.c
@@ -50,6 +50,11 @@ int kvm_arch_vcpu_runnable(struct kvm_vcpu *v)
return !!(v->arch.pending_exceptions) || kvm_request_pending(v);
}

+bool kvm_arch_dy_runnable(struct kvm_vcpu *vcpu)
+{
+ return kvm_arch_vcpu_runnable(vcpu);
+}
+
bool kvm_arch_vcpu_in_kernel(struct kvm_vcpu *vcpu)
{
return false;
diff --git a/arch/s390/kvm/kvm-s390.c b/arch/s390/kvm/kvm-s390.c
index 3f520cd8..5623b23 100644
--- a/arch/s390/kvm/kvm-s390.c
+++ b/arch/s390/kvm/kvm-s390.c
@@ -3102,6 +3102,11 @@ int kvm_arch_vcpu_runnable(struct kvm_vcpu *vcpu)
return kvm_s390_vcpu_has_irq(vcpu, 0);
}

+bool kvm_arch_dy_runnable(struct kvm_vcpu *vcpu)
+{
+ return kvm_arch_vcpu_runnable(vcpu);
+}
+
bool kvm_arch_vcpu_in_kernel(struct kvm_vcpu *vcpu)
{
return !(vcpu->arch.sie_block->gpsw.mask & PSW_MASK_PSTATE);
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 7b0a4ee..25ffa7c 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1175,6 +1175,7 @@ struct kvm_x86_ops {
int (*update_pi_irte)(struct kvm *kvm, unsigned int host_irq,
uint32_t guest_irq, bool set);
void (*apicv_post_state_restore)(struct kvm_vcpu *vcpu);
+ bool (*apicv_has_pending_interrupt)(struct kvm_vcpu *vcpu);

int (*set_hv_timer)(struct kvm_vcpu *vcpu, u64 guest_deadline_tsc,
bool *expired);
diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
index 7eafc69..1b4384f 100644
--- a/arch/x86/kvm/svm.c
+++ b/arch/x86/kvm/svm.c
@@ -5190,6 +5190,11 @@ static void svm_deliver_avic_intr(struct kvm_vcpu *vcpu, int vec)
kvm_vcpu_wake_up(vcpu);
}

+static bool svm_apicv_has_pending_interrupt(struct kvm_vcpu *vcpu)
+{
+ return false;
+}
+
static void svm_ir_list_del(struct vcpu_svm *svm, struct amd_iommu_pi_data *pi)
{
unsigned long flags;
@@ -7314,6 +7319,7 @@ static struct kvm_x86_ops svm_x86_ops __ro_after_init = {

.pmu_ops = &amd_pmu_ops,
.deliver_posted_interrupt = svm_deliver_avic_intr,
+ .apicv_has_pending_interrupt = svm_apicv_has_pending_interrupt,
.update_pi_irte = svm_update_pi_irte,
.setup_mce = svm_setup_mce,

diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 074385c..59871b6 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -6117,6 +6117,11 @@ static int vmx_sync_pir_to_irr(struct kvm_vcpu *vcpu)
return max_irr;
}

+static bool vmx_apicv_has_pending_interrupt(struct kvm_vcpu *vcpu)
+{
+ return pi_test_on(vcpu_to_pi_desc(vcpu));
+}
+
static void vmx_load_eoi_exitmap(struct kvm_vcpu *vcpu, u64 *eoi_exit_bitmap)
{
if (!kvm_vcpu_apicv_active(vcpu))
@@ -7726,6 +7731,7 @@ static struct kvm_x86_ops vmx_x86_ops __ro_after_init = {
.guest_apic_has_interrupt = vmx_guest_apic_has_interrupt,
.sync_pir_to_irr = vmx_sync_pir_to_irr,
.deliver_posted_interrupt = vmx_deliver_posted_interrupt,
+ .apicv_has_pending_interrupt = vmx_apicv_has_pending_interrupt,

.set_tss_addr = vmx_set_tss_addr,
.set_identity_map_addr = vmx_set_identity_map_addr,
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index c6d951c..f715efb 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -9698,6 +9698,22 @@ int kvm_arch_vcpu_runnable(struct kvm_vcpu *vcpu)
return kvm_vcpu_running(vcpu) || kvm_vcpu_has_events(vcpu);
}

+bool kvm_arch_dy_runnable(struct kvm_vcpu *vcpu)
+{
+ if (READ_ONCE(vcpu->arch.pv.pv_unhalted))
+ return true;
+
+ if (kvm_test_request(KVM_REQ_NMI, vcpu) ||
+ kvm_test_request(KVM_REQ_SMI, vcpu) ||
+ kvm_test_request(KVM_REQ_EVENT, vcpu))
+ return true;
+
+ if (vcpu->arch.apicv_active && kvm_x86_ops->apicv_has_pending_interrupt(vcpu))
+ return true;
+
+ return false;
+}
+
bool kvm_arch_vcpu_in_kernel(struct kvm_vcpu *vcpu)
{
return vcpu->arch.preempted_in_kernel;
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 5c5b586..9e4c2bb 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -872,6 +872,7 @@ int kvm_arch_check_processor_compat(void);
int kvm_arch_vcpu_runnable(struct kvm_vcpu *vcpu);
bool kvm_arch_vcpu_in_kernel(struct kvm_vcpu *vcpu);
int kvm_arch_vcpu_should_kick(struct kvm_vcpu *vcpu);
+bool kvm_arch_dy_runnable(struct kvm_vcpu *vcpu);

#ifndef __KVM_HAVE_ARCH_VM_ALLOC
/*
diff --git a/virt/kvm/arm/arm.c b/virt/kvm/arm/arm.c
index acc4324..2927895 100644
--- a/virt/kvm/arm/arm.c
+++ b/virt/kvm/arm/arm.c
@@ -444,6 +444,11 @@ int kvm_arch_vcpu_runnable(struct kvm_vcpu *v)
&& !v->arch.power_off && !v->arch.pause);
}

+bool kvm_arch_dy_runnable(struct kvm_vcpu *vcpu)
+{
+ return kvm_arch_vcpu_runnable(vcpu);
+}
+
bool kvm_arch_vcpu_in_kernel(struct kvm_vcpu *vcpu)
{
return vcpu_mode_priv(vcpu);
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 887f3b0..e121423 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -2477,6 +2477,20 @@ static bool kvm_vcpu_eligible_for_directed_yield(struct kvm_vcpu *vcpu)
#endif
}

+static bool vcpu_runnable(struct kvm_vcpu *vcpu)
+{
+ /* It is called outside vcpu_load/vcpu_put */
+ if (kvm_arch_dy_runnable(vcpu))
+ return true;
+
+#ifdef CONFIG_KVM_ASYNC_PF
+ if (!list_empty_careful(&vcpu->async_pf.done))
+ return true;
+#endif
+
+ return false;
+}
+
void kvm_vcpu_on_spin(struct kvm_vcpu *me, bool yield_to_kernel_mode)
{
struct kvm *kvm = me->kvm;
@@ -2506,7 +2520,7 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me, bool yield_to_kernel_mode)
continue;
if (vcpu == me)
continue;
- if (swait_active(&vcpu->wq) && !kvm_arch_vcpu_runnable(vcpu))
+ if (swait_active(&vcpu->wq) && !vcpu_runnable(vcpu))
continue;
if (yield_to_kernel_mode && !kvm_arch_vcpu_in_kernel(vcpu))
continue;
--
2.7.4

2019-08-05 02:04:27

by Wanpeng Li

[permalink] [raw]

Subject: [PATCH v4 2/6] KVM: LAPIC: Don't need to wakeup vCPU twice afer timer fire

From: Wanpeng Li <[email protected]>

kvm_set_pending_timer() will take care to wake up the sleeping vCPU which
has pending timer, don't need to check this in apic_timer_expired() again.

Cc: Paolo Bonzini <[email protected]>
Cc: Radim Krčmář <[email protected]>
Signed-off-by: Wanpeng Li <[email protected]>
---
arch/x86/kvm/lapic.c | 8 --------
1 file changed, 8 deletions(-)

diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index 0aa1586..685d17c 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -1548,7 +1548,6 @@ static void kvm_apic_inject_pending_timer_irqs(struct kvm_lapic *apic)
static void apic_timer_expired(struct kvm_lapic *apic)
{
struct kvm_vcpu *vcpu = apic->vcpu;
- struct swait_queue_head *q = &vcpu->wq;
struct kvm_timer *ktimer = &apic->lapic_timer;

if (atomic_read(&apic->lapic_timer.pending))
@@ -1566,13 +1565,6 @@ static void apic_timer_expired(struct kvm_lapic *apic)

atomic_inc(&apic->lapic_timer.pending);
kvm_set_pending_timer(vcpu);
-
- /*
- * For x86, the atomic_inc() is serialized, thus
- * using swait_active() is safe.
- */
- if (swait_active(q))
- swake_up_one(q);
}

static void start_sw_tscdeadline(struct kvm_lapic *apic)
--
2.7.4

2019-08-05 02:04:37

by Wanpeng Li

[permalink] [raw]

Subject: [PATCH v4 4/6] KVM: X86: Use IPI shorthands in kvm guest when support

From: Wanpeng Li <[email protected]>

IPI shorthand is supported now by linux apic/x2apic driver, switch to
IPI shorthand for all excluding self and all including self destination
shorthand in kvm guest, to avoid splitting the target mask into several
PV IPI hypercalls. This patch removes the kvm_send_ipi_all() and
kvm_send_ipi_allbutself() since the callers in APIC codes have already
taken care of apic_use_ipi_shorthand and fallback to ->send_IPI_mask
and ->send_IPI_mask_allbutself if it is false.

Cc: Thomas Gleixner <[email protected]>
Cc: Paolo Bonzini <[email protected]>
Cc: Radim Krčmář <[email protected]>
Cc: Sean Christopherson <[email protected]>
Cc: Nadav Amit <[email protected]>
Signed-off-by: Wanpeng Li <[email protected]>
---
Note: rebase against tip tree's x86/apic branch, but just modify kvm codes
v1 -> v2:
* remove kvm_send_ipi_all() and kvm_send_ipi_allbutself()

arch/x86/kernel/kvm.c | 12 ------------
1 file changed, 12 deletions(-)

diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index b7f34fe..96626d8 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -505,16 +505,6 @@ static void kvm_send_ipi_mask_allbutself(const struct cpumask *mask, int vector)
__send_ipi_mask(local_mask, vector);
}

-static void kvm_send_ipi_allbutself(int vector)
-{
- kvm_send_ipi_mask_allbutself(cpu_online_mask, vector);
-}
-
-static void kvm_send_ipi_all(int vector)
-{
- __send_ipi_mask(cpu_online_mask, vector);
-}
-
/*
* Set the IPI entry points
*/
@@ -522,8 +512,6 @@ static void kvm_setup_pv_ipi(void)
{
apic->send_IPI_mask = kvm_send_ipi_mask;
apic->send_IPI_mask_allbutself = kvm_send_ipi_mask_allbutself;
- apic->send_IPI_allbutself = kvm_send_ipi_allbutself;
- apic->send_IPI_all = kvm_send_ipi_all;
pr_info("KVM setup pv IPIs\n");
}

--
2.7.4

2019-08-05 02:04:38

by Wanpeng Li

[permalink] [raw]

Subject: [PATCH v4 6/6] KVM: X86: Add pv tlb shootdown tracepoint

From: Wanpeng Li <[email protected]>

Add pv tlb shootdown tracepoint.

Cc: Paolo Bonzini <[email protected]>
Cc: Radim Krčmář <[email protected]>
Signed-off-by: Wanpeng Li <[email protected]>
---
arch/x86/kvm/trace.h | 19 +++++++++++++++++++
arch/x86/kvm/x86.c | 2 ++
2 files changed, 21 insertions(+)

diff --git a/arch/x86/kvm/trace.h b/arch/x86/kvm/trace.h
index ce6ee34..84f32d3 100644
--- a/arch/x86/kvm/trace.h
+++ b/arch/x86/kvm/trace.h
@@ -1487,6 +1487,25 @@ TRACE_EVENT(kvm_pv_send_ipi,
TP_printk("vector %d min 0x%x ipi_bitmap_low 0x%lx ipi_bitmap_high 0x%lx",
__entry->vector, __entry->min, __entry->ipi_bitmap_low, __entry->ipi_bitmap_high)
);
+
+TRACE_EVENT(kvm_pv_tlb_flush,
+ TP_PROTO(unsigned int vcpu_id, bool need_flush_tlb),
+ TP_ARGS(vcpu_id, need_flush_tlb),
+
+ TP_STRUCT__entry(
+ __field( unsigned int, vcpu_id )
+ __field( bool, need_flush_tlb )
+ ),
+
+ TP_fast_assign(
+ __entry->vcpu_id = vcpu_id;
+ __entry->need_flush_tlb = need_flush_tlb;
+ ),
+
+ TP_printk("vcpu %u need_flush_tlb %s", __entry->vcpu_id,
+ __entry->need_flush_tlb ? "true" : "false")
+);
+
#endif /* _TRACE_KVM_H */

#undef TRACE_INCLUDE_PATH
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index f715efb..7a302cd 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -2452,6 +2452,8 @@ static void record_steal_time(struct kvm_vcpu *vcpu)
* Doing a TLB flush here, on the guest's behalf, can avoid
* expensive IPIs.
*/
+ trace_kvm_pv_tlb_flush(vcpu->vcpu_id,
+ vcpu->arch.st.steal.preempted & KVM_VCPU_FLUSH_TLB);
if (xchg(&vcpu->arch.st.steal.preempted, 0) & KVM_VCPU_FLUSH_TLB)
kvm_vcpu_flush_tlb(vcpu, false);

--
2.7.4

2019-08-05 02:04:58

by Wanpeng Li

[permalink] [raw]

Subject: [PATCH v4 5/6] KVM: LAPIC: Add pv ipi tracepoint

From: Wanpeng Li <[email protected]>

Add pv ipi tracepoint.

Cc: Paolo Bonzini <[email protected]>
Cc: Radim Krčmář <[email protected]>
Signed-off-by: Wanpeng Li <[email protected]>
---
arch/x86/kvm/lapic.c | 2 ++
arch/x86/kvm/trace.h | 25 +++++++++++++++++++++++++
2 files changed, 27 insertions(+)

diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index 685d17c..df5cd07 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -570,6 +570,8 @@ int kvm_pv_send_ipi(struct kvm *kvm, unsigned long ipi_bitmap_low,
irq.level = (icr & APIC_INT_ASSERT) != 0;
irq.trig_mode = icr & APIC_INT_LEVELTRIG;

+ trace_kvm_pv_send_ipi(irq.vector, min, ipi_bitmap_low, ipi_bitmap_high);
+
if (icr & APIC_DEST_MASK)
return -KVM_EINVAL;
if (icr & APIC_SHORT_MASK)
diff --git a/arch/x86/kvm/trace.h b/arch/x86/kvm/trace.h
index b5c831e..ce6ee34 100644
--- a/arch/x86/kvm/trace.h
+++ b/arch/x86/kvm/trace.h
@@ -1462,6 +1462,31 @@ TRACE_EVENT(kvm_hv_send_ipi_ex,
__entry->vector, __entry->format,
__entry->valid_bank_mask)
);
+
+/*
+ * Tracepoints for kvm_pv_send_ipi.
+ */
+TRACE_EVENT(kvm_pv_send_ipi,
+ TP_PROTO(u32 vector, u32 min, unsigned long ipi_bitmap_low, unsigned long ipi_bitmap_high),
+ TP_ARGS(vector, min, ipi_bitmap_low, ipi_bitmap_high),
+
+ TP_STRUCT__entry(
+ __field(u32, vector)
+ __field(u32, min)
+ __field(unsigned long, ipi_bitmap_low)
+ __field(unsigned long, ipi_bitmap_high)
+ ),
+
+ TP_fast_assign(
+ __entry->vector = vector;
+ __entry->min = min;
+ __entry->ipi_bitmap_low = ipi_bitmap_low;
+ __entry->ipi_bitmap_high = ipi_bitmap_high;
+ ),
+
+ TP_printk("vector %d min 0x%x ipi_bitmap_low 0x%lx ipi_bitmap_high 0x%lx",
+ __entry->vector, __entry->min, __entry->ipi_bitmap_low, __entry->ipi_bitmap_high)
+);
#endif /* _TRACE_KVM_H */

#undef TRACE_INCLUDE_PATH
--
2.7.4

2019-08-05 02:05:20

by Wanpeng Li

[permalink] [raw]

Subject: [PATCH v4 3/6] KVM: Check preempted_in_kernel for involuntary preemption

From: Wanpeng Li <[email protected]>

preempted_in_kernel is updated in preempt_notifier when involuntary preemption
ocurrs, it can be stale when the voluntarily preempted vCPUs are taken into
account by kvm_vcpu_on_spin() loop. This patch lets it just check preempted_in_kernel
for involuntary preemption.

Cc: Paolo Bonzini <[email protected]>
Cc: Radim Krčmář <[email protected]>
Signed-off-by: Wanpeng Li <[email protected]>
---
virt/kvm/kvm_main.c | 7 ++++---
1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index e121423..2ae9e84 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -2522,7 +2522,8 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me, bool yield_to_kernel_mode)
continue;
if (swait_active(&vcpu->wq) && !vcpu_runnable(vcpu))
continue;
- if (yield_to_kernel_mode && !kvm_arch_vcpu_in_kernel(vcpu))
+ if (READ_ONCE(vcpu->preempted) && yield_to_kernel_mode &&
+ !kvm_arch_vcpu_in_kernel(vcpu))
continue;
if (!kvm_vcpu_eligible_for_directed_yield(vcpu))
continue;
@@ -4219,7 +4220,7 @@ static void kvm_sched_in(struct preempt_notifier *pn, int cpu)
{
struct kvm_vcpu *vcpu = preempt_notifier_to_vcpu(pn);

- vcpu->preempted = false;
+ WRITE_ONCE(vcpu->preempted, false);
WRITE_ONCE(vcpu->ready, false);

kvm_arch_sched_in(vcpu, cpu);
@@ -4233,7 +4234,7 @@ static void kvm_sched_out(struct preempt_notifier *pn,
struct kvm_vcpu *vcpu = preempt_notifier_to_vcpu(pn);

if (current->state == TASK_RUNNING) {
- vcpu->preempted = true;
+ WRITE_ONCE(vcpu->preempted, true);
WRITE_ONCE(vcpu->ready, true);
}
kvm_arch_vcpu_put(vcpu);
--
2.7.4

2019-08-05 23:18:28

by Paolo Bonzini

[permalink] [raw]

Subject: Re: [PATCH v4 1/6] KVM: Fix leak vCPU's VMCS value into other pCPU

On 05/08/19 04:03, Wanpeng Li wrote:
> From: Wanpeng Li <[email protected]>
>
> After commit d73eb57b80b (KVM: Boost vCPUs that are delivering interrupts), a
> five years old bug is exposed. Running ebizzy benchmark in three 80 vCPUs VMs
> on one 80 pCPUs Skylake server, a lot of rcu_sched stall warning splatting
> in the VMs after stress testing:
>
> INFO: rcu_sched detected stalls on CPUs/tasks: { 4 41 57 62 77} (detected by 15, t=60004 jiffies, g=899, c=898, q=15073)
> Call Trace:
> flush_tlb_mm_range+0x68/0x140
> tlb_flush_mmu.part.75+0x37/0xe0
> tlb_finish_mmu+0x55/0x60
> zap_page_range+0x142/0x190
> SyS_madvise+0x3cd/0x9c0
> system_call_fastpath+0x1c/0x21
>
> swait_active() sustains to be true before finish_swait() is called in
> kvm_vcpu_block(), voluntarily preempted vCPUs are taken into account
> by kvm_vcpu_on_spin() loop greatly increases the probability condition
> kvm_arch_vcpu_runnable(vcpu) is checked and can be true, when APICv
> is enabled the yield-candidate vCPU's VMCS RVI field leaks(by
> vmx_sync_pir_to_irr()) into spinning-on-a-taken-lock vCPU's current
> VMCS.
>
> This patch fixes it by checking conservatively a subset of events.
>
> Cc: Paolo Bonzini <[email protected]>
> Cc: Radim Krčmář <[email protected]>
> Cc: Christian Borntraeger <[email protected]>
> Cc: Marc Zyngier <[email protected]>
> Cc: [email protected]
> Fixes: 98f4a1467 (KVM: add kvm_arch_vcpu_runnable() test to kvm_vcpu_on_spin() loop)
> Signed-off-by: Wanpeng Li <[email protected]>
> ---
> v3 -> v4:
> * just test KVM_REQ_*
> * rename the hook to apicv_has_pending_interrupt
> * wrap with #ifdef CONFIG_KVM_ASYNC_PF
> v2 -> v3:
> * check conservatively a subset of events
> v1 -> v2:
> * checking swait_active(&vcpu->wq) for involuntary preemption
>
> arch/mips/kvm/mips.c | 5 +++++
> arch/powerpc/kvm/powerpc.c | 5 +++++
> arch/s390/kvm/kvm-s390.c | 5 +++++
> arch/x86/include/asm/kvm_host.h | 1 +
> arch/x86/kvm/svm.c | 6 ++++++
> arch/x86/kvm/vmx/vmx.c | 6 ++++++
> arch/x86/kvm/x86.c | 16 ++++++++++++++++
> include/linux/kvm_host.h | 1 +
> virt/kvm/arm/arm.c | 5 +++++
> virt/kvm/kvm_main.c | 16 +++++++++++++++-
> 10 files changed, 65 insertions(+), 1 deletion(-)
>
> diff --git a/arch/mips/kvm/mips.c b/arch/mips/kvm/mips.c
> index 2cfe839..95a4642 100644
> --- a/arch/mips/kvm/mips.c
> +++ b/arch/mips/kvm/mips.c
> @@ -98,6 +98,11 @@ int kvm_arch_vcpu_runnable(struct kvm_vcpu *vcpu)
> return !!(vcpu->arch.pending_exceptions);
> }
>
> +bool kvm_arch_dy_runnable(struct kvm_vcpu *vcpu)

Using a __weak definition for the default implementation is a bit more
concise. Queued with that change.

Paolo

2019-08-06 00:38:46

by Wanpeng Li

[permalink] [raw]

Subject: Re: [PATCH v4 1/6] KVM: Fix leak vCPU's VMCS value into other pCPU

On Tue, 6 Aug 2019 at 07:17, Paolo Bonzini <[email protected]> wrote:
>
> On 05/08/19 04:03, Wanpeng Li wrote:
> > From: Wanpeng Li <[email protected]>
> >
> > After commit d73eb57b80b (KVM: Boost vCPUs that are delivering interrupts), a
> > five years old bug is exposed. Running ebizzy benchmark in three 80 vCPUs VMs
> > on one 80 pCPUs Skylake server, a lot of rcu_sched stall warning splatting
> > in the VMs after stress testing:
> >
> > INFO: rcu_sched detected stalls on CPUs/tasks: { 4 41 57 62 77} (detected by 15, t=60004 jiffies, g=899, c=898, q=15073)
> > Call Trace:
> > flush_tlb_mm_range+0x68/0x140
> > tlb_flush_mmu.part.75+0x37/0xe0
> > tlb_finish_mmu+0x55/0x60
> > zap_page_range+0x142/0x190
> > SyS_madvise+0x3cd/0x9c0
> > system_call_fastpath+0x1c/0x21
> >
> > swait_active() sustains to be true before finish_swait() is called in
> > kvm_vcpu_block(), voluntarily preempted vCPUs are taken into account
> > by kvm_vcpu_on_spin() loop greatly increases the probability condition
> > kvm_arch_vcpu_runnable(vcpu) is checked and can be true, when APICv
> > is enabled the yield-candidate vCPU's VMCS RVI field leaks(by
> > vmx_sync_pir_to_irr()) into spinning-on-a-taken-lock vCPU's current
> > VMCS.
> >
> > This patch fixes it by checking conservatively a subset of events.
> >
> > Cc: Paolo Bonzini <[email protected]>
> > Cc: Radim Krčmář <[email protected]>
> > Cc: Christian Borntraeger <[email protected]>
> > Cc: Marc Zyngier <[email protected]>
> > Cc: [email protected]
> > Fixes: 98f4a1467 (KVM: add kvm_arch_vcpu_runnable() test to kvm_vcpu_on_spin() loop)
> > Signed-off-by: Wanpeng Li <[email protected]>
> > ---
> > v3 -> v4:
> > * just test KVM_REQ_*
> > * rename the hook to apicv_has_pending_interrupt
> > * wrap with #ifdef CONFIG_KVM_ASYNC_PF
> > v2 -> v3:
> > * check conservatively a subset of events
> > v1 -> v2:
> > * checking swait_active(&vcpu->wq) for involuntary preemption
> >
> > arch/mips/kvm/mips.c | 5 +++++
> > arch/powerpc/kvm/powerpc.c | 5 +++++
> > arch/s390/kvm/kvm-s390.c | 5 +++++
> > arch/x86/include/asm/kvm_host.h | 1 +
> > arch/x86/kvm/svm.c | 6 ++++++
> > arch/x86/kvm/vmx/vmx.c | 6 ++++++
> > arch/x86/kvm/x86.c | 16 ++++++++++++++++
> > include/linux/kvm_host.h | 1 +
> > virt/kvm/arm/arm.c | 5 +++++
> > virt/kvm/kvm_main.c | 16 +++++++++++++++-
> > 10 files changed, 65 insertions(+), 1 deletion(-)
> >
> > diff --git a/arch/mips/kvm/mips.c b/arch/mips/kvm/mips.c
> > index 2cfe839..95a4642 100644
> > --- a/arch/mips/kvm/mips.c
> > +++ b/arch/mips/kvm/mips.c
> > @@ -98,6 +98,11 @@ int kvm_arch_vcpu_runnable(struct kvm_vcpu *vcpu)
> > return !!(vcpu->arch.pending_exceptions);
> > }
> >
> > +bool kvm_arch_dy_runnable(struct kvm_vcpu *vcpu)
>
> Using a __weak definition for the default implementation is a bit more
> concise. Queued with that change.

Thank you, Paolo! Btw, how about other 5 patches?

Regards,
Wanpeng Li

2019-08-06 06:21:50

by Paolo Bonzini

[permalink] [raw]

Subject: Re: [PATCH v4 1/6] KVM: Fix leak vCPU's VMCS value into other pCPU

On 06/08/19 02:35, Wanpeng Li wrote:
> Thank you, Paolo! Btw, how about other 5 patches?

Queued everything else too.

Paolo

> Regards,
> Wanpeng Li

2019-08-22 02:41:34

by Wanpeng Li

[permalink] [raw]

Subject: Re: [PATCH v4 1/6] KVM: Fix leak vCPU's VMCS value into other pCPU

On Tue, 6 Aug 2019 at 14:20, Paolo Bonzini <[email protected]> wrote:
>
> On 06/08/19 02:35, Wanpeng Li wrote:
> > Thank you, Paolo! Btw, how about other 5 patches?
>
> Queued everything else too.

How about patch 4/6~5/6, they are not in kvm/queue. :)

Regards,
Wanpeng Li

2019-08-22 09:52:05

by Wanpeng Li

[permalink] [raw]

Subject: Re: [PATCH v4 1/6] KVM: Fix leak vCPU's VMCS value into other pCPU

On Thu, 22 Aug 2019 at 16:35, Paolo Bonzini <[email protected]> wrote:
>
> On 22/08/19 02:46, Wanpeng Li wrote:
> > On Tue, 6 Aug 2019 at 14:20, Paolo Bonzini <[email protected]> wrote:
> >>
> >> On 06/08/19 02:35, Wanpeng Li wrote:
> >>> Thank you, Paolo! Btw, how about other 5 patches?
> >>
> >> Queued everything else too.
> >
> > How about patch 4/6~5/6, they are not in kvm/queue. :)
>
> I queued 4.
>
> For patch 5, I don't really see the benefit since the hypercall
> arguments are already traced.

Agreed, thank you.

Regards,
Wanpeng Li

2019-08-22 10:45:51

by Paolo Bonzini

[permalink] [raw]

Subject: Re: [PATCH v4 1/6] KVM: Fix leak vCPU's VMCS value into other pCPU

On 22/08/19 02:46, Wanpeng Li wrote:
> On Tue, 6 Aug 2019 at 14:20, Paolo Bonzini <[email protected]> wrote:
>>
>> On 06/08/19 02:35, Wanpeng Li wrote:
>>> Thank you, Paolo! Btw, how about other 5 patches?
>>
>> Queued everything else too.
>
> How about patch 4/6~5/6, they are not in kvm/queue. :)

I queued 4.

For patch 5, I don't really see the benefit since the hypercall
arguments are already traced.

Paolo