2016-10-14 18:21:38

by Paolo Bonzini

[permalink] [raw]
Subject: [PATCH 0/5] KVM: x86: cleanup and minimal speedup for APICv

On some benchmarks (e.g. netperf with ioeventfd disabled), APICv
posted interrupts turn out to be slower than interrupt injection via
KVM_REQ_EVENT.

These patches save 40 cycles on kvm-unit-tests "inl" tests (1-2%), but I
suspect the effect is bigger on workloads that inject interrupts heavily.

Patches 1-2 are micro-optimization of the APICv vmentry code.

The interesting one is patch 3. The aim is for APICv to not use
KVM_REQ_EVENT at all for interrupts, therefore turning APICv's weakness
(having to look at PIR on every vmentry) into a strength (because
checking PIR.ON is cheaper than processing KVM_REQ_EVENT). I find that
the new code overall makes more sense than the old one, and I like that
more of the logic is contained within vmx_deliver_posted_interrupt and
vmx_sync_pir_to_irr. KVM_REQ_EVENT did protect from mistakes, so the new
code is more delicate, but I still prefer that to duct tape. :)

Patches 4 and 5 are other cleanups.

Paolo

Paolo Bonzini (5):
KVM: x86: avoid atomic operations on APICv vmentry
KVM: x86: do not scan IRR twice on APICv vmentry
KVM: x86: do not use KVM_REQ_EVENT for APICv interrupt injection
KVM: x86: remove unnecessary sync_pir_to_irr
KVM: vmx: clear pending interrupts on KVM_SET_LAPIC

arch/x86/kvm/lapic.c | 45 +++++++++++++++++++++++++++++----------------
arch/x86/kvm/lapic.h | 4 ++--
arch/x86/kvm/vmx.c | 44 ++++++++++++++++++++++++++++++++------------
arch/x86/kvm/x86.c | 45 +++++++++++++++++++++++++--------------------
4 files changed, 88 insertions(+), 50 deletions(-)

--
1.8.3.1


2016-10-14 18:22:47

by Paolo Bonzini

[permalink] [raw]
Subject: [PATCH 3/5] KVM: x86: do not use KVM_REQ_EVENT for APICv interrupt injection

Since bf9f6ac8d749 ("KVM: Update Posted-Interrupts Descriptor when vCPU
is blocked", 2015-09-18) the posted interrupt descriptor is checked
unconditionally for PIR.ON. Therefore we don't need KVM_REQ_EVENT to
trigger the scan and, if NMIs or SMIs are not involved, we can avoid
the complicated event injection path.

Calling kvm_vcpu_kick if PIR.ON=1 is also useless, though it has been
there since APICv was introduced.

However, without the KVM_REQ_EVENT safety net KVM needs to be much
more careful about races between vmx_deliver_posted_interrupt and
vcpu_enter_guest. First, the IPI for posted interrupts may be issued
between setting vcpu->mode = IN_GUEST_MODE and disabling interrupts.
If that happens, kvm_trigger_posted_interrupt returns true, but
smp_kvm_posted_intr_ipi doesn't do anything about it. The guest is
entered with PIR.ON, but the posted interrupt IPI has not been sent
and the interrupt is only delivered to the guest on the next vmentry
(if any). To fix this, disable interrupts before setting vcpu->mode.
This ensures that the IPI is delayed until the guest enters non-root mode;
it is then trapped by the processor causing the interrupt to be injected.

Second, the IPI may be issued between kvm_x86_ops->sync_pir_to_irr(vcpu)
and vcpu->mode = IN_GUEST_MODE. In this case, kvm_vcpu_kick is called
but it (correctly) doesn't do anything because it sees vcpu->mode ==
OUTSIDE_GUEST_MODE. Again, the guest is entered with PIR.ON but no
posted interrupt IPI is pending. This time, the fix for this is to move
the PIR->IRR synchronization and RVI update after setting IN_GUEST_MODE.

Both issues were previously masked by the liberal usage of KVM_REQ_EVENT.
In both race scenarios KVM_REQ_EVENT would cancel guest entry, resulting
in another vmentry which would inject the interrupt.

Signed-off-by: Paolo Bonzini <[email protected]>
---
arch/x86/kvm/lapic.c | 9 +++------
arch/x86/kvm/vmx.c | 8 +++++---
arch/x86/kvm/x86.c | 44 +++++++++++++++++++++++++-------------------
3 files changed, 33 insertions(+), 28 deletions(-)

diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index a24492a88643..c0928173b2e3 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -361,12 +361,8 @@ EXPORT_SYMBOL_GPL(__kvm_apic_update_irr);
int kvm_apic_update_irr(struct kvm_vcpu *vcpu, u32 *pir)
{
struct kvm_lapic *apic = vcpu->arch.apic;
- int max_irr;

- max_irr = __kvm_apic_update_irr(pir, apic->regs);
-
- kvm_make_request(KVM_REQ_EVENT, vcpu);
- return max_irr;
+ return __kvm_apic_update_irr(pir, apic->regs);
}
EXPORT_SYMBOL_GPL(kvm_apic_update_irr);

@@ -403,7 +399,8 @@ static inline void apic_clear_irr(int vec, struct kvm_lapic *apic)
if (unlikely(vcpu->arch.apicv_active)) {
/* try to update RVI */
apic_clear_vector(vec, apic->regs + APIC_IRR);
- kvm_make_request(KVM_REQ_EVENT, vcpu);
+ kvm_x86_ops->hwapic_irr_update(vcpu,
+ apic_find_highest_irr(apic));
} else {
apic->irr_pending = false;
apic_clear_vector(vec, apic->regs + APIC_IRR);
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index c985ec8736d3..1e4ff8c2cdbe 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -4850,9 +4850,11 @@ static void vmx_deliver_posted_interrupt(struct kvm_vcpu *vcpu, int vector)
if (pi_test_and_set_pir(vector, &vmx->pi_desc))
return;

- r = pi_test_and_set_on(&vmx->pi_desc);
- kvm_make_request(KVM_REQ_EVENT, vcpu);
- if (r || !kvm_vcpu_trigger_posted_interrupt(vcpu))
+ /* If a previous notification has sent the IPI, nothing to do. */
+ if (pi_test_and_set_on(&vmx->pi_desc))
+ return;
+
+ if (!kvm_vcpu_trigger_posted_interrupt(vcpu))
kvm_vcpu_kick(vcpu);
}

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 3d29be9d9db4..7a67de964baa 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -6601,19 +6601,6 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
kvm_hv_process_stimers(vcpu);
}

- /*
- * KVM_REQ_EVENT is not set when posted interrupts are set by
- * VT-d hardware, so we have to update RVI unconditionally.
- */
- if (kvm_lapic_enabled(vcpu)) {
- /*
- * Update architecture specific hints for APIC
- * virtual interrupt delivery.
- */
- if (vcpu->arch.apicv_active)
- kvm_x86_ops->sync_pir_to_irr(vcpu);
- }
-
if (kvm_check_request(KVM_REQ_EVENT, vcpu) || req_int_win) {
kvm_apic_accept_events(vcpu);
if (vcpu->arch.mp_state == KVM_MP_STATE_INIT_RECEIVED) {
@@ -6657,20 +6644,39 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
kvm_x86_ops->prepare_guest_switch(vcpu);
if (vcpu->fpu_active)
kvm_load_guest_fpu(vcpu);
+
+ /*
+ * By disabling IRQs before setting IN_GUEST_MODE, we delay any posted
+ * interrupt IPI to after guest entry. This ensures that the IPI
+ * results in virtual interrupt delivery.
+ */
+ local_irq_disable();
vcpu->mode = IN_GUEST_MODE;

srcu_read_unlock(&vcpu->kvm->srcu, vcpu->srcu_idx);

/*
- * We should set ->mode before check ->requests,
- * Please see the comment in kvm_make_all_cpus_request.
- * This also orders the write to mode from any reads
- * to the page tables done while the VCPU is running.
- * Please see the comment in kvm_flush_remote_tlbs.
+ * 1) We should set ->mode before checking ->requests. Please see
+ * the comment in kvm_make_all_cpus_request.
+ *
+ * 2) For APICv, we should set ->mode before checking PIR.ON. This
+ * pairs with the memory barrier implicit in pi_test_and_set_on
+ * (see vmx_deliver_posted_interrupt).
+ *
+ * 3) This also orders the write to mode from any reads to the page
+ * tables done while the VCPU is running. Please see the comment
+ * in kvm_flush_remote_tlbs.
*/
smp_mb__after_srcu_read_unlock();

- local_irq_disable();
+ if (kvm_lapic_enabled(vcpu)) {
+ /*
+ * This handles the case where a posted interrupt was
+ * notified with kvm_vcpu_kick.
+ */
+ if (vcpu->arch.apicv_active)
+ kvm_x86_ops->sync_pir_to_irr(vcpu);
+ }

if (vcpu->mode == EXITING_GUEST_MODE || vcpu->requests
|| need_resched() || signal_pending(current)) {
--
1.8.3.1


2016-10-14 18:22:45

by Paolo Bonzini

[permalink] [raw]
Subject: [PATCH 4/5] KVM: x86: remove unnecessary sync_pir_to_irr

Synchronizing PIR to IRR is not needed in most callers of
apic_find_highest_irr. Move it to the one place that matters,
interrupt acknowledgement.

Signed-off-by: Paolo Bonzini <[email protected]>
---
arch/x86/kvm/lapic.c | 10 ++++++++--
1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index c0928173b2e3..255b5e1658f6 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -382,8 +382,6 @@ static inline int apic_find_highest_irr(struct kvm_lapic *apic)
if (!apic->irr_pending)
return -1;

- if (apic->vcpu->arch.apicv_active)
- kvm_x86_ops->sync_pir_to_irr(apic->vcpu);
result = apic_search_irr(apic);
ASSERT(result == -1 || result >= 16);

@@ -1973,6 +1971,9 @@ int kvm_apic_has_interrupt(struct kvm_vcpu *vcpu)
return -1;

apic_update_ppr(apic);
+
+ if (apic->vcpu->arch.apicv_active)
+ kvm_x86_ops->sync_pir_to_irr(apic->vcpu);
highest_irr = apic_find_highest_irr(apic);
if ((highest_irr == -1) ||
((highest_irr & 0xF0) <= kvm_lapic_get_reg(apic, APIC_PROCPRI)))
@@ -2206,6 +2207,11 @@ void kvm_lapic_sync_to_vapic(struct kvm_vcpu *vcpu)
if (!test_bit(KVM_APIC_CHECK_VAPIC, &vcpu->arch.apic_attention))
return;

+ /* We don't get here when APICv is active, since that blocks
+ * TPR access vmexits.
+ */
+ WARN_ON_ONCE(vcpu->arch.apicv_active);
+
tpr = kvm_lapic_get_reg(apic, APIC_TASKPRI) & 0xff;
max_irr = apic_find_highest_irr(apic);
if (max_irr < 0)
--
1.8.3.1


2016-10-14 18:22:43

by Paolo Bonzini

[permalink] [raw]
Subject: [PATCH 5/5] KVM: vmx: clear pending interrupts on KVM_SET_LAPIC

Pending interrupts might be in the PI descriptor when the
LAPIC is restored from an external state; we do not want
them to be injected.

Signed-off-by: Paolo Bonzini <[email protected]>
---
arch/x86/kvm/lapic.c | 3 +--
arch/x86/kvm/vmx.c | 9 +++++++++
2 files changed, 10 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index 255b5e1658f6..271cb9c22777 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -2088,8 +2088,7 @@ int kvm_apic_set_state(struct kvm_vcpu *vcpu, struct kvm_lapic_state *s)
1 : count_vectors(apic->regs + APIC_ISR);
apic->highest_isr_cache = -1;
if (vcpu->arch.apicv_active) {
- if (kvm_x86_ops->apicv_post_state_restore)
- kvm_x86_ops->apicv_post_state_restore(vcpu);
+ kvm_x86_ops->apicv_post_state_restore(vcpu);
kvm_x86_ops->hwapic_irr_update(vcpu,
apic_find_highest_irr(apic));
kvm_x86_ops->hwapic_isr_update(vcpu,
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 1e4ff8c2cdbe..290d80f9fcbd 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -8617,6 +8617,14 @@ static void vmx_sync_pir_to_irr(struct kvm_vcpu *vcpu)
vmx_hwapic_irr_update(vcpu, max_irr);
}

+static void vmx_apicv_post_state_restore(struct kvm_vcpu *vcpu)
+{
+ struct vcpu_vmx *vmx = to_vmx(vcpu);
+
+ pi_clear_on(&vmx->pi_desc);
+ memset(vmx->pi_desc.pir, 0, sizeof(vmx->pi_desc.pir));
+}
+
static void vmx_load_eoi_exitmap(struct kvm_vcpu *vcpu, u64 *eoi_exit_bitmap)
{
if (!kvm_vcpu_apicv_active(vcpu))
@@ -11327,6 +11335,7 @@ static struct kvm_x86_ops vmx_x86_ops = {
.get_enable_apicv = vmx_get_enable_apicv,
.refresh_apicv_exec_ctrl = vmx_refresh_apicv_exec_ctrl,
.load_eoi_exitmap = vmx_load_eoi_exitmap,
+ .apicv_post_state_restore = vmx_apicv_post_state_restore,
.hwapic_irr_update = vmx_hwapic_irr_update,
.hwapic_isr_update = vmx_hwapic_isr_update,
.sync_pir_to_irr = vmx_sync_pir_to_irr,
--
1.8.3.1

2016-10-14 18:25:30

by Paolo Bonzini

[permalink] [raw]
Subject: [PATCH 2/5] KVM: x86: do not scan IRR twice on APICv vmentry

Calling apic_find_highest_irr results in IRR being scanned twice,
once in vmx_sync_pir_from_irr and once in apic_search_irr. Change
sync_pir_from_irr to do the RVI write and kvm_apic_update_irr to
compute the new RVI on the fly.

Signed-off-by: Paolo Bonzini <[email protected]>
---
arch/x86/kvm/lapic.c | 25 +++++++++++++++++--------
arch/x86/kvm/lapic.h | 4 ++--
arch/x86/kvm/vmx.c | 24 +++++++++++++-----------
arch/x86/kvm/x86.c | 3 +--
4 files changed, 33 insertions(+), 23 deletions(-)

diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index 63a442aefc12..a24492a88643 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -317,7 +317,7 @@ static int find_highest_vector(void *bitmap)
vec >= 0; vec -= APIC_VECTORS_PER_REG) {
reg = bitmap + REG_POS(vec);
if (*reg)
- return fls(*reg) - 1 + vec;
+ return __fls(*reg) + vec;
}

return -1;
@@ -337,27 +337,36 @@ static u8 count_vectors(void *bitmap)
return count;
}

-void __kvm_apic_update_irr(u32 *pir, void *regs)
+int __kvm_apic_update_irr(u32 *pir, void *regs)
{
- u32 i, pir_val;
+ u32 i, vec;
+ u32 pir_val, irr_val;
+ int max_irr = -1;

- for (i = 0; i <= 7; i++) {
+ for (i = vec = 0; i <= 7; i++, vec += 32) {
pir_val = READ_ONCE(pir[i]);
+ irr_val = *((u32 *)(regs + APIC_IRR + i * 0x10));
if (pir_val) {
- pir_val = xchg(&pir[i], 0);
- *((u32 *)(regs + APIC_IRR + i * 0x10)) |= pir_val;
+ irr_val |= xchg(&pir[i], 0);
+ *((u32 *)(regs + APIC_IRR + i * 0x10)) = irr_val;
}
+ if (irr_val)
+ max_irr = __fls(irr_val) + vec;
}
+
+ return max_irr;
}
EXPORT_SYMBOL_GPL(__kvm_apic_update_irr);

-void kvm_apic_update_irr(struct kvm_vcpu *vcpu, u32 *pir)
+int kvm_apic_update_irr(struct kvm_vcpu *vcpu, u32 *pir)
{
struct kvm_lapic *apic = vcpu->arch.apic;
+ int max_irr;

- __kvm_apic_update_irr(pir, apic->regs);
+ max_irr = __kvm_apic_update_irr(pir, apic->regs);

kvm_make_request(KVM_REQ_EVENT, vcpu);
+ return max_irr;
}
EXPORT_SYMBOL_GPL(kvm_apic_update_irr);

diff --git a/arch/x86/kvm/lapic.h b/arch/x86/kvm/lapic.h
index f60d01c29d51..fc72828c3782 100644
--- a/arch/x86/kvm/lapic.h
+++ b/arch/x86/kvm/lapic.h
@@ -70,8 +70,8 @@ int kvm_lapic_reg_read(struct kvm_lapic *apic, u32 offset, int len,
bool kvm_apic_match_dest(struct kvm_vcpu *vcpu, struct kvm_lapic *source,
int short_hand, unsigned int dest, int dest_mode);

-void __kvm_apic_update_irr(u32 *pir, void *regs);
-void kvm_apic_update_irr(struct kvm_vcpu *vcpu, u32 *pir);
+int __kvm_apic_update_irr(u32 *pir, void *regs);
+int kvm_apic_update_irr(struct kvm_vcpu *vcpu, u32 *pir);
int kvm_apic_set_irq(struct kvm_vcpu *vcpu, struct kvm_lapic_irq *irq,
struct dest_map *dest_map);
int kvm_apic_local_deliver(struct kvm_lapic *apic, int lvt_type);
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 7c79d6c6b6ed..c985ec8736d3 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -4856,17 +4856,6 @@ static void vmx_deliver_posted_interrupt(struct kvm_vcpu *vcpu, int vector)
kvm_vcpu_kick(vcpu);
}

-static void vmx_sync_pir_to_irr(struct kvm_vcpu *vcpu)
-{
- struct vcpu_vmx *vmx = to_vmx(vcpu);
-
- if (!pi_test_on(&vmx->pi_desc))
- return;
-
- pi_clear_on(&vmx->pi_desc);
- kvm_apic_update_irr(vcpu, vmx->pi_desc.pir);
-}
-
/*
* Set up the vmcs's constant host-state fields, i.e., host-state fields that
* will not change in the lifetime of the guest.
@@ -8613,6 +8602,19 @@ static void vmx_hwapic_irr_update(struct kvm_vcpu *vcpu, int max_irr)
}
}

+static void vmx_sync_pir_to_irr(struct kvm_vcpu *vcpu)
+{
+ struct vcpu_vmx *vmx = to_vmx(vcpu);
+ int max_irr;
+
+ if (!pi_test_on(&vmx->pi_desc))
+ return;
+
+ pi_clear_on(&vmx->pi_desc);
+ max_irr = kvm_apic_update_irr(vcpu, vmx->pi_desc.pir);
+ vmx_hwapic_irr_update(vcpu, max_irr);
+}
+
static void vmx_load_eoi_exitmap(struct kvm_vcpu *vcpu, u64 *eoi_exit_bitmap)
{
if (!kvm_vcpu_apicv_active(vcpu))
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 3ee8a91a78c3..3d29be9d9db4 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -6611,8 +6611,7 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
* virtual interrupt delivery.
*/
if (vcpu->arch.apicv_active)
- kvm_x86_ops->hwapic_irr_update(vcpu,
- kvm_lapic_find_highest_irr(vcpu));
+ kvm_x86_ops->sync_pir_to_irr(vcpu);
}

if (kvm_check_request(KVM_REQ_EVENT, vcpu) || req_int_win) {
--
1.8.3.1


2016-10-14 18:25:55

by Paolo Bonzini

[permalink] [raw]
Subject: [PATCH 1/5] KVM: x86: avoid atomic operations on APICv vmentry

On some benchmarks (e.g. netperf with ioeventfd disabled), APICv
posted interrupts turn out to be slower than interrupt injection via
KVM_REQ_EVENT.

This patch optimizes a bit the IRR update, avoiding expensive atomic
operations in the common case where PI.ON=0 at vmentry or the PIR vector
is mostly zero. This saves at least 20 cycles (1%) per vmexit, as
measured by kvm-unit-tests' inl_from_qemu test (20 runs):

| enable_apicv=1 | enable_apicv=0
| mean stdev | mean stdev
----------|-----------------|------------------
before | 5826 32.65 | 5765 47.09
after | 5809 43.42 | 5777 77.02

Of course, any change in the right column is just placebo effect. :)
The savings are bigger if interrupts are frequent.

Signed-off-by: Paolo Bonzini <[email protected]>
---
arch/x86/kvm/lapic.c | 6 ++++--
arch/x86/kvm/vmx.c | 9 ++++++++-
2 files changed, 12 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index 23b99f305382..63a442aefc12 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -342,9 +342,11 @@ void __kvm_apic_update_irr(u32 *pir, void *regs)
u32 i, pir_val;

for (i = 0; i <= 7; i++) {
- pir_val = xchg(&pir[i], 0);
- if (pir_val)
+ pir_val = READ_ONCE(pir[i]);
+ if (pir_val) {
+ pir_val = xchg(&pir[i], 0);
*((u32 *)(regs + APIC_IRR + i * 0x10)) |= pir_val;
+ }
}
}
EXPORT_SYMBOL_GPL(__kvm_apic_update_irr);
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 2577183b40d9..7c79d6c6b6ed 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -521,6 +521,12 @@ static inline void pi_set_sn(struct pi_desc *pi_desc)
(unsigned long *)&pi_desc->control);
}

+static inline void pi_clear_on(struct pi_desc *pi_desc)
+{
+ clear_bit(POSTED_INTR_ON,
+ (unsigned long *)&pi_desc->control);
+}
+
static inline int pi_test_on(struct pi_desc *pi_desc)
{
return test_bit(POSTED_INTR_ON,
@@ -4854,9 +4860,10 @@ static void vmx_sync_pir_to_irr(struct kvm_vcpu *vcpu)
{
struct vcpu_vmx *vmx = to_vmx(vcpu);

- if (!pi_test_and_clear_on(&vmx->pi_desc))
+ if (!pi_test_on(&vmx->pi_desc))
return;

+ pi_clear_on(&vmx->pi_desc);
kvm_apic_update_irr(vcpu, vmx->pi_desc.pir);
}

--
1.8.3.1


2016-10-14 18:51:10

by Nadav Amit

[permalink] [raw]
Subject: Re: [PATCH 1/5] KVM: x86: avoid atomic operations on APICv vmentry

>
> diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
> index 23b99f305382..63a442aefc12 100644
> --- a/arch/x86/kvm/lapic.c
> +++ b/arch/x86/kvm/lapic.c
> @@ -342,9 +342,11 @@ void __kvm_apic_update_irr(u32 *pir, void *regs)
> u32 i, pir_val;
>
> for (i = 0; i <= 7; i++) {
> - pir_val = xchg(&pir[i], 0);
> - if (pir_val)
> + pir_val = READ_ONCE(pir[i]);

Out of curiosity, do you really need this READ_ONCE?

2016-10-14 18:56:49

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [PATCH 1/5] KVM: x86: avoid atomic operations on APICv vmentry

> > diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
> > index 23b99f305382..63a442aefc12 100644
> > --- a/arch/x86/kvm/lapic.c
> > +++ b/arch/x86/kvm/lapic.c
> > @@ -342,9 +342,11 @@ void __kvm_apic_update_irr(u32 *pir, void *regs)
> > u32 i, pir_val;
> >
> > for (i = 0; i <= 7; i++) {
> > - pir_val = xchg(&pir[i], 0);
> > - if (pir_val)
> > + pir_val = READ_ONCE(pir[i]);
>
> Out of curiosity, do you really need this READ_ONCE?

The answer can only be "depends on the compiler's whims". :)
If you think of READ_ONCE as a C11 relaxed atomic load, then yes.

Paolo

2016-10-14 19:45:02

by Nadav Amit

[permalink] [raw]
Subject: Re: [PATCH 1/5] KVM: x86: avoid atomic operations on APICv vmentry


> On Oct 14, 2016, at 11:56 AM, Paolo Bonzini <[email protected]> wrote:
>>>
>>> for (i = 0; i <= 7; i++) {
>>> - pir_val = xchg(&pir[i], 0);
>>> - if (pir_val)
>>> + pir_val = READ_ONCE(pir[i]);
>>
>> Out of curiosity, do you really need this READ_ONCE?
>
> The answer can only be "depends on the compiler's whims". :)
> If you think of READ_ONCE as a C11 relaxed atomic load, then yes.

Hm.. So the idea is to make the code "race-free” in the sense
that every concurrent memory access is done using READ_ONCE/WRITE_ONCE?

If that is the case, I think there are many other cases that need to be
changed, for example apic->irr_pending and vcpu->arch.pv.pv_unhalted.



2016-10-15 07:47:50

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [PATCH 1/5] KVM: x86: avoid atomic operations on APICv vmentry


> > On Oct 14, 2016, at 11:56 AM, Paolo Bonzini <[email protected]> wrote:
> >>>
> >>> for (i = 0; i <= 7; i++) {
> >>> - pir_val = xchg(&pir[i], 0);
> >>> - if (pir_val)
> >>> + pir_val = READ_ONCE(pir[i]);
> >>
> >> Out of curiosity, do you really need this READ_ONCE?
> >
> > The answer can only be "depends on the compiler's whims". :)
> > If you think of READ_ONCE as a C11 relaxed atomic load, then yes.
>
> Hm.. So the idea is to make the code "race-free” in the sense
> that every concurrent memory access is done using READ_ONCE/WRITE_ONCE?
>
> If that is the case, I think there are many other cases that need to be
> changed, for example apic->irr_pending and vcpu->arch.pv.pv_unhalted.

There is no documentation for this in the kernel tree unfortunately.
But yes, I think we should do that. Using READ_ONCE/WRITE_ONCE around
memory barriers is a start.

Paolo

2016-10-16 02:31:17

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: [PATCH 1/5] KVM: x86: avoid atomic operations on APICv vmentry

On Sat, Oct 15, 2016 at 03:47:45AM -0400, Paolo Bonzini wrote:
>
> > > On Oct 14, 2016, at 11:56 AM, Paolo Bonzini <[email protected]> wrote:
> > >>>
> > >>> for (i = 0; i <= 7; i++) {
> > >>> - pir_val = xchg(&pir[i], 0);
> > >>> - if (pir_val)
> > >>> + pir_val = READ_ONCE(pir[i]);
> > >>
> > >> Out of curiosity, do you really need this READ_ONCE?
> > >
> > > The answer can only be "depends on the compiler's whims". :)
> > > If you think of READ_ONCE as a C11 relaxed atomic load, then yes.
> >
> > Hm.. So the idea is to make the code "race-free” in the sense
> > that every concurrent memory access is done using READ_ONCE/WRITE_ONCE?
> >
> > If that is the case, I think there are many other cases that need to be
> > changed, for example apic->irr_pending and vcpu->arch.pv.pv_unhalted.
>
> There is no documentation for this in the kernel tree unfortunately.
> But yes, I think we should do that. Using READ_ONCE/WRITE_ONCE around
> memory barriers is a start.
>
> Paolo

I'm beginning to think that if a value is always (maybe except for init
where we don't much care about the code size anyway) accessed through
*_ONCE macros, we should just mark it volatile and be done with it. The
code will look cleaner, and there will be less space for errors
like forgetting *_ONCE macros.

Would such code (where all accesses are done through
READ_ONCE/WRITE_ONCE otherwise) be an exception to
volatile-considered-harmful.txt rules?

Cc Paul and Jonathan (for volatile-considered-harmful.txt).


--
MST

2016-10-16 03:21:47

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: [PATCH 1/5] KVM: x86: avoid atomic operations on APICv vmentry

On Fri, Oct 14, 2016 at 08:21:27PM +0200, Paolo Bonzini wrote:
> On some benchmarks (e.g. netperf with ioeventfd disabled), APICv
> posted interrupts turn out to be slower than interrupt injection via
> KVM_REQ_EVENT.
>
> This patch optimizes a bit the IRR update, avoiding expensive atomic
> operations in the common case where PI.ON=0 at vmentry or the PIR vector
> is mostly zero. This saves at least 20 cycles (1%) per vmexit, as
> measured by kvm-unit-tests' inl_from_qemu test (20 runs):
>
> | enable_apicv=1 | enable_apicv=0
> | mean stdev | mean stdev
> ----------|-----------------|------------------
> before | 5826 32.65 | 5765 47.09
> after | 5809 43.42 | 5777 77.02
>
> Of course, any change in the right column is just placebo effect. :)
> The savings are bigger if interrupts are frequent.
>
> Signed-off-by: Paolo Bonzini <[email protected]>
> ---
> arch/x86/kvm/lapic.c | 6 ++++--
> arch/x86/kvm/vmx.c | 9 ++++++++-
> 2 files changed, 12 insertions(+), 3 deletions(-)
>
> diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
> index 23b99f305382..63a442aefc12 100644
> --- a/arch/x86/kvm/lapic.c
> +++ b/arch/x86/kvm/lapic.c
> @@ -342,9 +342,11 @@ void __kvm_apic_update_irr(u32 *pir, void *regs)
> u32 i, pir_val;
>
> for (i = 0; i <= 7; i++) {
> - pir_val = xchg(&pir[i], 0);
> - if (pir_val)
> + pir_val = READ_ONCE(pir[i]);
> + if (pir_val) {
> + pir_val = xchg(&pir[i], 0);
> *((u32 *)(regs + APIC_IRR + i * 0x10)) |= pir_val;
> + }
> }
> }
> EXPORT_SYMBOL_GPL(__kvm_apic_update_irr);

gcc doesn't seem to unroll this loop and it's
probably worth unrolling it

The following seems to do the trick for me on upstream - I didn't
benchmark it though. Is there a kvm unit test for interrupts?

--->

kvm: unroll the loop in __kvm_apic_update_irr.

This is hot data path in interrupt-rich workloads, worth unrolling.

Signed-off-by: Michael S. Tsirkin <[email protected]>


diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index b62c852..0c3462c 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -337,7 +337,8 @@ static u8 count_vectors(void *bitmap)
return count;
}

-void __kvm_apic_update_irr(u32 *pir, void *regs)
+void __attribute__((optimize("unroll-loops")))
+__kvm_apic_update_irr(u32 *pir, void *regs)
{
u32 i, pir_val;


2016-10-17 11:07:54

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [PATCH 1/5] KVM: x86: avoid atomic operations on APICv vmentry



On 16/10/2016 05:21, Michael S. Tsirkin wrote:
> On Fri, Oct 14, 2016 at 08:21:27PM +0200, Paolo Bonzini wrote:
>> On some benchmarks (e.g. netperf with ioeventfd disabled), APICv
>> posted interrupts turn out to be slower than interrupt injection via
>> KVM_REQ_EVENT.
>>
>> This patch optimizes a bit the IRR update, avoiding expensive atomic
>> operations in the common case where PI.ON=0 at vmentry or the PIR vector
>> is mostly zero. This saves at least 20 cycles (1%) per vmexit, as
>> measured by kvm-unit-tests' inl_from_qemu test (20 runs):
>>
>> | enable_apicv=1 | enable_apicv=0
>> | mean stdev | mean stdev
>> ----------|-----------------|------------------
>> before | 5826 32.65 | 5765 47.09
>> after | 5809 43.42 | 5777 77.02
>>
>> Of course, any change in the right column is just placebo effect. :)
>> The savings are bigger if interrupts are frequent.
>>
>> Signed-off-by: Paolo Bonzini <[email protected]>
>> ---
>> arch/x86/kvm/lapic.c | 6 ++++--
>> arch/x86/kvm/vmx.c | 9 ++++++++-
>> 2 files changed, 12 insertions(+), 3 deletions(-)
>>
>> diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
>> index 23b99f305382..63a442aefc12 100644
>> --- a/arch/x86/kvm/lapic.c
>> +++ b/arch/x86/kvm/lapic.c
>> @@ -342,9 +342,11 @@ void __kvm_apic_update_irr(u32 *pir, void *regs)
>> u32 i, pir_val;
>>
>> for (i = 0; i <= 7; i++) {
>> - pir_val = xchg(&pir[i], 0);
>> - if (pir_val)
>> + pir_val = READ_ONCE(pir[i]);
>> + if (pir_val) {
>> + pir_val = xchg(&pir[i], 0);
>> *((u32 *)(regs + APIC_IRR + i * 0x10)) |= pir_val;
>> + }
>> }
>> }
>> EXPORT_SYMBOL_GPL(__kvm_apic_update_irr);
>
> gcc doesn't seem to unroll this loop and it's
> probably worth unrolling it
>
> The following seems to do the trick for me on upstream - I didn't
> benchmark it though. Is there a kvm unit test for interrupts?

No, not yet. Purely from a branch-prediction point of view I wouldn't
be so sure that it's worth unrolling it. Because of the xchg you cannot
make the code branchless, and having a branch per iteration puts some
pressure on the BTB.

This is only a hot path in workloads that have a lot of interrupts and a
lot of vmexits. If it's always the same interrupt (actually the same
group of 32 interrupts) that triggers, then the branch predictor will
predict the history very easily (e.g. 00001000 00001000 00001000...).

Paolo

> --->
>
> kvm: unroll the loop in __kvm_apic_update_irr.
>
> This is hot data path in interrupt-rich workloads, worth unrolling.
>
> Signed-off-by: Michael S. Tsirkin <[email protected]>
>
>
> diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
> index b62c852..0c3462c 100644
> --- a/arch/x86/kvm/lapic.c
> +++ b/arch/x86/kvm/lapic.c
> @@ -337,7 +337,8 @@ static u8 count_vectors(void *bitmap)
> return count;
> }
>
> -void __kvm_apic_update_irr(u32 *pir, void *regs)
> +void __attribute__((optimize("unroll-loops")))
> +__kvm_apic_update_irr(u32 *pir, void *regs)
> {
> u32 i, pir_val;
>
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>

2016-10-18 06:04:36

by Wanpeng Li

[permalink] [raw]
Subject: Re: [PATCH 2/5] KVM: x86: do not scan IRR twice on APICv vmentry

2016-10-15 2:21 GMT+08:00 Paolo Bonzini <[email protected]>:
> Calling apic_find_highest_irr results in IRR being scanned twice,
> once in vmx_sync_pir_from_irr and once in apic_search_irr. Change

s/from/to

> sync_pir_from_irr to do the RVI write and kvm_apic_update_irr to
> compute the new RVI on the fly.
>
> Signed-off-by: Paolo Bonzini <[email protected]>
> ---
> arch/x86/kvm/lapic.c | 25 +++++++++++++++++--------
> arch/x86/kvm/lapic.h | 4 ++--
> arch/x86/kvm/vmx.c | 24 +++++++++++++-----------
> arch/x86/kvm/x86.c | 3 +--
> 4 files changed, 33 insertions(+), 23 deletions(-)
>
> diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
> index 63a442aefc12..a24492a88643 100644
> --- a/arch/x86/kvm/lapic.c
> +++ b/arch/x86/kvm/lapic.c
> @@ -317,7 +317,7 @@ static int find_highest_vector(void *bitmap)
> vec >= 0; vec -= APIC_VECTORS_PER_REG) {
> reg = bitmap + REG_POS(vec);
> if (*reg)
> - return fls(*reg) - 1 + vec;
> + return __fls(*reg) + vec;
> }
>
> return -1;
> @@ -337,27 +337,36 @@ static u8 count_vectors(void *bitmap)
> return count;
> }
>
> -void __kvm_apic_update_irr(u32 *pir, void *regs)
> +int __kvm_apic_update_irr(u32 *pir, void *regs)
> {
> - u32 i, pir_val;
> + u32 i, vec;
> + u32 pir_val, irr_val;
> + int max_irr = -1;
>
> - for (i = 0; i <= 7; i++) {
> + for (i = vec = 0; i <= 7; i++, vec += 32) {
> pir_val = READ_ONCE(pir[i]);
> + irr_val = *((u32 *)(regs + APIC_IRR + i * 0x10));
> if (pir_val) {
> - pir_val = xchg(&pir[i], 0);
> - *((u32 *)(regs + APIC_IRR + i * 0x10)) |= pir_val;
> + irr_val |= xchg(&pir[i], 0);
> + *((u32 *)(regs + APIC_IRR + i * 0x10)) = irr_val;
> }
> + if (irr_val)
> + max_irr = __fls(irr_val) + vec;
> }
> +
> + return max_irr;
> }
> EXPORT_SYMBOL_GPL(__kvm_apic_update_irr);
>
> -void kvm_apic_update_irr(struct kvm_vcpu *vcpu, u32 *pir)
> +int kvm_apic_update_irr(struct kvm_vcpu *vcpu, u32 *pir)
> {
> struct kvm_lapic *apic = vcpu->arch.apic;
> + int max_irr;
>
> - __kvm_apic_update_irr(pir, apic->regs);
> + max_irr = __kvm_apic_update_irr(pir, apic->regs);
>
> kvm_make_request(KVM_REQ_EVENT, vcpu);
> + return max_irr;
> }
> EXPORT_SYMBOL_GPL(kvm_apic_update_irr);
>
> diff --git a/arch/x86/kvm/lapic.h b/arch/x86/kvm/lapic.h
> index f60d01c29d51..fc72828c3782 100644
> --- a/arch/x86/kvm/lapic.h
> +++ b/arch/x86/kvm/lapic.h
> @@ -70,8 +70,8 @@ int kvm_lapic_reg_read(struct kvm_lapic *apic, u32 offset, int len,
> bool kvm_apic_match_dest(struct kvm_vcpu *vcpu, struct kvm_lapic *source,
> int short_hand, unsigned int dest, int dest_mode);
>
> -void __kvm_apic_update_irr(u32 *pir, void *regs);
> -void kvm_apic_update_irr(struct kvm_vcpu *vcpu, u32 *pir);
> +int __kvm_apic_update_irr(u32 *pir, void *regs);
> +int kvm_apic_update_irr(struct kvm_vcpu *vcpu, u32 *pir);
> int kvm_apic_set_irq(struct kvm_vcpu *vcpu, struct kvm_lapic_irq *irq,
> struct dest_map *dest_map);
> int kvm_apic_local_deliver(struct kvm_lapic *apic, int lvt_type);
> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> index 7c79d6c6b6ed..c985ec8736d3 100644
> --- a/arch/x86/kvm/vmx.c
> +++ b/arch/x86/kvm/vmx.c
> @@ -4856,17 +4856,6 @@ static void vmx_deliver_posted_interrupt(struct kvm_vcpu *vcpu, int vector)
> kvm_vcpu_kick(vcpu);
> }
>
> -static void vmx_sync_pir_to_irr(struct kvm_vcpu *vcpu)
> -{
> - struct vcpu_vmx *vmx = to_vmx(vcpu);
> -
> - if (!pi_test_on(&vmx->pi_desc))
> - return;
> -
> - pi_clear_on(&vmx->pi_desc);
> - kvm_apic_update_irr(vcpu, vmx->pi_desc.pir);
> -}
> -
> /*
> * Set up the vmcs's constant host-state fields, i.e., host-state fields that
> * will not change in the lifetime of the guest.
> @@ -8613,6 +8602,19 @@ static void vmx_hwapic_irr_update(struct kvm_vcpu *vcpu, int max_irr)
> }
> }
>
> +static void vmx_sync_pir_to_irr(struct kvm_vcpu *vcpu)
> +{
> + struct vcpu_vmx *vmx = to_vmx(vcpu);
> + int max_irr;
> +
> + if (!pi_test_on(&vmx->pi_desc))
> + return;
> +
> + pi_clear_on(&vmx->pi_desc);
> + max_irr = kvm_apic_update_irr(vcpu, vmx->pi_desc.pir);
> + vmx_hwapic_irr_update(vcpu, max_irr);
> +}
> +
> static void vmx_load_eoi_exitmap(struct kvm_vcpu *vcpu, u64 *eoi_exit_bitmap)
> {
> if (!kvm_vcpu_apicv_active(vcpu))
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 3ee8a91a78c3..3d29be9d9db4 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -6611,8 +6611,7 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
> * virtual interrupt delivery.
> */
> if (vcpu->arch.apicv_active)
> - kvm_x86_ops->hwapic_irr_update(vcpu,
> - kvm_lapic_find_highest_irr(vcpu));
> + kvm_x86_ops->sync_pir_to_irr(vcpu);
> }
>
> if (kvm_check_request(KVM_REQ_EVENT, vcpu) || req_int_win) {
> --
> 1.8.3.1
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

2016-10-19 17:57:18

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH 1/5] KVM: x86: avoid atomic operations on APICv vmentry

On Sun, Oct 16, 2016 at 05:29:24AM +0300, Michael S. Tsirkin wrote:
> On Sat, Oct 15, 2016 at 03:47:45AM -0400, Paolo Bonzini wrote:
> >
> > > > On Oct 14, 2016, at 11:56 AM, Paolo Bonzini <[email protected]> wrote:
> > > >>>
> > > >>> for (i = 0; i <= 7; i++) {
> > > >>> - pir_val = xchg(&pir[i], 0);
> > > >>> - if (pir_val)
> > > >>> + pir_val = READ_ONCE(pir[i]);
> > > >>
> > > >> Out of curiosity, do you really need this READ_ONCE?
> > > >
> > > > The answer can only be "depends on the compiler's whims". :)
> > > > If you think of READ_ONCE as a C11 relaxed atomic load, then yes.
> > >
> > > Hm.. So the idea is to make the code "race-free” in the sense
> > > that every concurrent memory access is done using READ_ONCE/WRITE_ONCE?
> > >
> > > If that is the case, I think there are many other cases that need to be
> > > changed, for example apic->irr_pending and vcpu->arch.pv.pv_unhalted.
> >
> > There is no documentation for this in the kernel tree unfortunately.
> > But yes, I think we should do that. Using READ_ONCE/WRITE_ONCE around
> > memory barriers is a start.
> >
> > Paolo
>
> I'm beginning to think that if a value is always (maybe except for init
> where we don't much care about the code size anyway) accessed through
> *_ONCE macros, we should just mark it volatile and be done with it. The
> code will look cleaner, and there will be less space for errors
> like forgetting *_ONCE macros.
>
> Would such code (where all accesses are done through
> READ_ONCE/WRITE_ONCE otherwise) be an exception to
> volatile-considered-harmful.txt rules?
>
> Cc Paul and Jonathan (for volatile-considered-harmful.txt).

One concern would be the guy reading the code, to whom it might not
be obvious that the underlying access was volatile, especially if
the reference was a few levels down.

Thanx, Paul

2016-10-26 19:53:51

by Radim Krčmář

[permalink] [raw]
Subject: Re: [PATCH 1/5] KVM: x86: avoid atomic operations on APICv vmentry

2016-10-14 20:21+0200, Paolo Bonzini:
> On some benchmarks (e.g. netperf with ioeventfd disabled), APICv
> posted interrupts turn out to be slower than interrupt injection via
> KVM_REQ_EVENT.
>
> This patch optimizes a bit the IRR update, avoiding expensive atomic
> operations in the common case where PI.ON=0 at vmentry or the PIR vector
> is mostly zero. This saves at least 20 cycles (1%) per vmexit, as
> measured by kvm-unit-tests' inl_from_qemu test (20 runs):
>
> | enable_apicv=1 | enable_apicv=0
> | mean stdev | mean stdev
> ----------|-----------------|------------------
> before | 5826 32.65 | 5765 47.09
> after | 5809 43.42 | 5777 77.02
>
> Of course, any change in the right column is just placebo effect. :)
> The savings are bigger if interrupts are frequent.
>
> Signed-off-by: Paolo Bonzini <[email protected]>
> ---
> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> @@ -521,6 +521,12 @@ static inline void pi_set_sn(struct pi_desc *pi_desc)
> (unsigned long *)&pi_desc->control);
> }
>
> +static inline void pi_clear_on(struct pi_desc *pi_desc)
> +{
> + clear_bit(POSTED_INTR_ON,
> + (unsigned long *)&pi_desc->control);
^^
bad whitespace.

> +}

We should add an explicit smp_mb__after_atomic() for extra correctness,
because clear_bit() does not guarantee a memory barrier and we must make
sure that pir reads can't be reordered before it.
x86 clear_bit() currently uses locked instruction, though.

> +
> static inline int pi_test_on(struct pi_desc *pi_desc)
> {
> return test_bit(POSTED_INTR_ON,

Other than that,

Reviewed-by: Radim Krčmář <[email protected]>

2016-10-26 19:59:06

by Radim Krčmář

[permalink] [raw]
Subject: Re: [PATCH 2/5] KVM: x86: do not scan IRR twice on APICv vmentry

2016-10-14 20:21+0200, Paolo Bonzini:
> Calling apic_find_highest_irr results in IRR being scanned twice,
> once in vmx_sync_pir_from_irr and once in apic_search_irr. Change
> sync_pir_from_irr to do the RVI write and kvm_apic_update_irr to
> compute the new RVI on the fly.
>
> Signed-off-by: Paolo Bonzini <[email protected]>
> ---

Reviewed-by: Radim Krčmář <[email protected]>

2016-10-26 20:06:18

by Radim Krčmář

[permalink] [raw]
Subject: Re: [PATCH 3/5] KVM: x86: do not use KVM_REQ_EVENT for APICv interrupt injection

2016-10-14 20:21+0200, Paolo Bonzini:
> Since bf9f6ac8d749 ("KVM: Update Posted-Interrupts Descriptor when vCPU
> is blocked", 2015-09-18) the posted interrupt descriptor is checked
> unconditionally for PIR.ON. Therefore we don't need KVM_REQ_EVENT to
> trigger the scan and, if NMIs or SMIs are not involved, we can avoid
> the complicated event injection path.
>
> Calling kvm_vcpu_kick if PIR.ON=1 is also useless, though it has been
> there since APICv was introduced.
>
> However, without the KVM_REQ_EVENT safety net KVM needs to be much
> more careful about races between vmx_deliver_posted_interrupt and
> vcpu_enter_guest. First, the IPI for posted interrupts may be issued
> between setting vcpu->mode = IN_GUEST_MODE and disabling interrupts.
> If that happens, kvm_trigger_posted_interrupt returns true, but
> smp_kvm_posted_intr_ipi doesn't do anything about it. The guest is
> entered with PIR.ON, but the posted interrupt IPI has not been sent
> and the interrupt is only delivered to the guest on the next vmentry
> (if any). To fix this, disable interrupts before setting vcpu->mode.
> This ensures that the IPI is delayed until the guest enters non-root mode;
> it is then trapped by the processor causing the interrupt to be injected.
>
> Second, the IPI may be issued between kvm_x86_ops->sync_pir_to_irr(vcpu)
> and vcpu->mode = IN_GUEST_MODE. In this case, kvm_vcpu_kick is called
> but it (correctly) doesn't do anything because it sees vcpu->mode ==
> OUTSIDE_GUEST_MODE. Again, the guest is entered with PIR.ON but no
> posted interrupt IPI is pending. This time, the fix for this is to move
> the PIR->IRR synchronization and RVI update after setting IN_GUEST_MODE.
>
> Both issues were previously masked by the liberal usage of KVM_REQ_EVENT.
> In both race scenarios KVM_REQ_EVENT would cancel guest entry, resulting
> in another vmentry which would inject the interrupt.
>
> Signed-off-by: Paolo Bonzini <[email protected]>
> ---

Reviewed-by: Radim Krčmář <[email protected]>

2016-10-26 20:08:17

by Radim Krčmář

[permalink] [raw]
Subject: Re: [PATCH 5/5] KVM: vmx: clear pending interrupts on KVM_SET_LAPIC

2016-10-14 20:21+0200, Paolo Bonzini:
> Pending interrupts might be in the PI descriptor when the
> LAPIC is restored from an external state; we do not want
> them to be injected.
>
> Signed-off-by: Paolo Bonzini <[email protected]>
> ---

Reviewed-by: Radim Krčmář <[email protected]>

2016-10-26 20:28:44

by Radim Krčmář

[permalink] [raw]
Subject: Re: [PATCH 4/5] KVM: x86: remove unnecessary sync_pir_to_irr

2016-10-14 20:21+0200, Paolo Bonzini:
> Synchronizing PIR to IRR is not needed in most callers of
> apic_find_highest_irr. Move it to the one place that matters,
> interrupt acknowledgement.
>
> Signed-off-by: Paolo Bonzini <[email protected]>
> ---

Reviewed-by: Radim Krčmář <[email protected]>

> diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
> @@ -1973,6 +1971,9 @@ int kvm_apic_has_interrupt(struct kvm_vcpu *vcpu)
> return -1;
>
> apic_update_ppr(apic);
> +
> + if (apic->vcpu->arch.apicv_active)
> + kvm_x86_ops->sync_pir_to_irr(apic->vcpu);

Btw. do all callers of kvm_apic_has_interrupt() need it?

Thanks.

> highest_irr = apic_find_highest_irr(apic);
> if ((highest_irr == -1) ||
> ((highest_irr & 0xF0) <= kvm_lapic_get_reg(apic, APIC_PROCPRI)))

2016-10-26 21:42:56

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: [PATCH 1/5] KVM: x86: avoid atomic operations on APICv vmentry

On Wed, Oct 26, 2016 at 09:53:45PM +0200, Radim Krčmář wrote:
> 2016-10-14 20:21+0200, Paolo Bonzini:
> > On some benchmarks (e.g. netperf with ioeventfd disabled), APICv
> > posted interrupts turn out to be slower than interrupt injection via
> > KVM_REQ_EVENT.
> >
> > This patch optimizes a bit the IRR update, avoiding expensive atomic
> > operations in the common case where PI.ON=0 at vmentry or the PIR vector
> > is mostly zero. This saves at least 20 cycles (1%) per vmexit, as
> > measured by kvm-unit-tests' inl_from_qemu test (20 runs):
> >
> > | enable_apicv=1 | enable_apicv=0
> > | mean stdev | mean stdev
> > ----------|-----------------|------------------
> > before | 5826 32.65 | 5765 47.09
> > after | 5809 43.42 | 5777 77.02
> >
> > Of course, any change in the right column is just placebo effect. :)
> > The savings are bigger if interrupts are frequent.
> >
> > Signed-off-by: Paolo Bonzini <[email protected]>
> > ---
> > diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> > @@ -521,6 +521,12 @@ static inline void pi_set_sn(struct pi_desc *pi_desc)
> > (unsigned long *)&pi_desc->control);
> > }
> >
> > +static inline void pi_clear_on(struct pi_desc *pi_desc)
> > +{
> > + clear_bit(POSTED_INTR_ON,
> > + (unsigned long *)&pi_desc->control);
> ^^
> bad whitespace.
>
> > +}
>
> We should add an explicit smp_mb__after_atomic() for extra correctness,
> because clear_bit() does not guarantee a memory barrier and we must make
> sure that pir reads can't be reordered before it.
> x86 clear_bit() currently uses locked instruction, though.

smp_mb__after_atomic is empty on x86 so it's
a documentation thing, not a correctness thing anyway.

> > +
> > static inline int pi_test_on(struct pi_desc *pi_desc)
> > {
> > return test_bit(POSTED_INTR_ON,
>
> Other than that,
>
> Reviewed-by: Radim Krčmář <[email protected]>

2016-10-26 21:50:53

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: [PATCH 1/5] KVM: x86: avoid atomic operations on APICv vmentry

On Wed, Oct 19, 2016 at 04:45:48AM -0700, Paul E. McKenney wrote:
> On Sun, Oct 16, 2016 at 05:29:24AM +0300, Michael S. Tsirkin wrote:
> > On Sat, Oct 15, 2016 at 03:47:45AM -0400, Paolo Bonzini wrote:
> > >
> > > > > On Oct 14, 2016, at 11:56 AM, Paolo Bonzini <[email protected]> wrote:
> > > > >>>
> > > > >>> for (i = 0; i <= 7; i++) {
> > > > >>> - pir_val = xchg(&pir[i], 0);
> > > > >>> - if (pir_val)
> > > > >>> + pir_val = READ_ONCE(pir[i]);
> > > > >>
> > > > >> Out of curiosity, do you really need this READ_ONCE?
> > > > >
> > > > > The answer can only be "depends on the compiler's whims". :)
> > > > > If you think of READ_ONCE as a C11 relaxed atomic load, then yes.
> > > >
> > > > Hm.. So the idea is to make the code "race-free” in the sense
> > > > that every concurrent memory access is done using READ_ONCE/WRITE_ONCE?
> > > >
> > > > If that is the case, I think there are many other cases that need to be
> > > > changed, for example apic->irr_pending and vcpu->arch.pv.pv_unhalted.
> > >
> > > There is no documentation for this in the kernel tree unfortunately.
> > > But yes, I think we should do that. Using READ_ONCE/WRITE_ONCE around
> > > memory barriers is a start.
> > >
> > > Paolo
> >
> > I'm beginning to think that if a value is always (maybe except for init
> > where we don't much care about the code size anyway) accessed through
> > *_ONCE macros, we should just mark it volatile and be done with it. The
> > code will look cleaner, and there will be less space for errors
> > like forgetting *_ONCE macros.
> >
> > Would such code (where all accesses are done through
> > READ_ONCE/WRITE_ONCE otherwise) be an exception to
> > volatile-considered-harmful.txt rules?
> >
> > Cc Paul and Jonathan (for volatile-considered-harmful.txt).
>
> One concern would be the guy reading the code, to whom it might not
> be obvious that the underlying access was volatile, especially if
> the reference was a few levels down.
>
> Thanx, Paul

So the guy reading the code will think access is unsafe, check it up and
realise it's fine. Is this a big deal? Isn't it better than the
reverse where one forgets to use READ_ONCE and gets a runtime bug?

My point is that this text:

The key point to understand with regard to volatile is that its purpose is
to suppress optimization, which is almost never what one really wants to do.

doesn't seem to apply anymore since we have hundreds of users of
READ_ONCE the point of which is exactly to suppress optimization.

I'm guessing this was written when we only had barrier() which is quite
unlike READ_ONCE in that it's not tied to a specific memory reference.

--
MST

2016-10-26 21:52:58

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: [PATCH 0/5] KVM: x86: cleanup and minimal speedup for APICv

On Fri, Oct 14, 2016 at 08:21:26PM +0200, Paolo Bonzini wrote:
> On some benchmarks (e.g. netperf with ioeventfd disabled), APICv
> posted interrupts turn out to be slower than interrupt injection via
> KVM_REQ_EVENT.
>
> These patches save 40 cycles on kvm-unit-tests "inl" tests (1-2%), but I
> suspect the effect is bigger on workloads that inject interrupts heavily.
>
> Patches 1-2 are micro-optimization of the APICv vmentry code.
>
> The interesting one is patch 3. The aim is for APICv to not use
> KVM_REQ_EVENT at all for interrupts, therefore turning APICv's weakness
> (having to look at PIR on every vmentry) into a strength (because
> checking PIR.ON is cheaper than processing KVM_REQ_EVENT). I find that
> the new code overall makes more sense than the old one, and I like that
> more of the logic is contained within vmx_deliver_posted_interrupt and
> vmx_sync_pir_to_irr. KVM_REQ_EVENT did protect from mistakes, so the new
> code is more delicate, but I still prefer that to duct tape. :)
>
> Patches 4 and 5 are other cleanups.

FWIW, series

Acked-by: Michael S. Tsirkin <[email protected]>

I still think we should make pir volatile but we
can argue about that later.

> Paolo
>
> Paolo Bonzini (5):
> KVM: x86: avoid atomic operations on APICv vmentry
> KVM: x86: do not scan IRR twice on APICv vmentry
> KVM: x86: do not use KVM_REQ_EVENT for APICv interrupt injection
> KVM: x86: remove unnecessary sync_pir_to_irr
> KVM: vmx: clear pending interrupts on KVM_SET_LAPIC
>
> arch/x86/kvm/lapic.c | 45 +++++++++++++++++++++++++++++----------------
> arch/x86/kvm/lapic.h | 4 ++--
> arch/x86/kvm/vmx.c | 44 ++++++++++++++++++++++++++++++++------------
> arch/x86/kvm/x86.c | 45 +++++++++++++++++++++++++--------------------
> 4 files changed, 88 insertions(+), 50 deletions(-)
>
> --
> 1.8.3.1

2016-10-27 16:44:06

by Radim Krčmář

[permalink] [raw]
Subject: Re: [PATCH 1/5] KVM: x86: avoid atomic operations on APICv vmentry

2016-10-27 00:42+0300, Michael S. Tsirkin:
> On Wed, Oct 26, 2016 at 09:53:45PM +0200, Radim Krčmář wrote:
>> 2016-10-14 20:21+0200, Paolo Bonzini:
>> > On some benchmarks (e.g. netperf with ioeventfd disabled), APICv
>> > posted interrupts turn out to be slower than interrupt injection via
>> > KVM_REQ_EVENT.
>> >
>> > This patch optimizes a bit the IRR update, avoiding expensive atomic
>> > operations in the common case where PI.ON=0 at vmentry or the PIR vector
>> > is mostly zero. This saves at least 20 cycles (1%) per vmexit, as
>> > measured by kvm-unit-tests' inl_from_qemu test (20 runs):
>> >
>> > | enable_apicv=1 | enable_apicv=0
>> > | mean stdev | mean stdev
>> > ----------|-----------------|------------------
>> > before | 5826 32.65 | 5765 47.09
>> > after | 5809 43.42 | 5777 77.02
>> >
>> > Of course, any change in the right column is just placebo effect. :)
>> > The savings are bigger if interrupts are frequent.
>> >
>> > Signed-off-by: Paolo Bonzini <[email protected]>
>> > ---
>> > diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
>> > @@ -521,6 +521,12 @@ static inline void pi_set_sn(struct pi_desc *pi_desc)
>> > (unsigned long *)&pi_desc->control);
>> > }
>> >
>> > +static inline void pi_clear_on(struct pi_desc *pi_desc)
>> > +{
>> > + clear_bit(POSTED_INTR_ON,
>> > + (unsigned long *)&pi_desc->control);
>> > +}
>>
>> We should add an explicit smp_mb__after_atomic() for extra correctness,
>> because clear_bit() does not guarantee a memory barrier and we must make
>> sure that pir reads can't be reordered before it.
>> x86 clear_bit() currently uses locked instruction, though.
>
> smp_mb__after_atomic is empty on x86 so it's
> a documentation thing, not a correctness thing anyway.

All atomics currently contain a barrier, but the code is also
future-proofing, not just documentation: implementation of clear_bit()
could drop the barrier and smp_mb__after_atomic() would then become a
real barrier.

Adding dma_mb__after_atomic() would be even better as this bug could
happen even on a uniprocessor with an assigned device, but people who
buy a SMP chip to run a UP kernel deserve it.

2016-10-27 16:51:38

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: [PATCH 1/5] KVM: x86: avoid atomic operations on APICv vmentry

On Thu, Oct 27, 2016 at 06:44:00PM +0200, Radim Krčmář wrote:
> 2016-10-27 00:42+0300, Michael S. Tsirkin:
> > On Wed, Oct 26, 2016 at 09:53:45PM +0200, Radim Krčmář wrote:
> >> 2016-10-14 20:21+0200, Paolo Bonzini:
> >> > On some benchmarks (e.g. netperf with ioeventfd disabled), APICv
> >> > posted interrupts turn out to be slower than interrupt injection via
> >> > KVM_REQ_EVENT.
> >> >
> >> > This patch optimizes a bit the IRR update, avoiding expensive atomic
> >> > operations in the common case where PI.ON=0 at vmentry or the PIR vector
> >> > is mostly zero. This saves at least 20 cycles (1%) per vmexit, as
> >> > measured by kvm-unit-tests' inl_from_qemu test (20 runs):
> >> >
> >> > | enable_apicv=1 | enable_apicv=0
> >> > | mean stdev | mean stdev
> >> > ----------|-----------------|------------------
> >> > before | 5826 32.65 | 5765 47.09
> >> > after | 5809 43.42 | 5777 77.02
> >> >
> >> > Of course, any change in the right column is just placebo effect. :)
> >> > The savings are bigger if interrupts are frequent.
> >> >
> >> > Signed-off-by: Paolo Bonzini <[email protected]>
> >> > ---
> >> > diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> >> > @@ -521,6 +521,12 @@ static inline void pi_set_sn(struct pi_desc *pi_desc)
> >> > (unsigned long *)&pi_desc->control);
> >> > }
> >> >
> >> > +static inline void pi_clear_on(struct pi_desc *pi_desc)
> >> > +{
> >> > + clear_bit(POSTED_INTR_ON,
> >> > + (unsigned long *)&pi_desc->control);
> >> > +}
> >>
> >> We should add an explicit smp_mb__after_atomic() for extra correctness,
> >> because clear_bit() does not guarantee a memory barrier and we must make
> >> sure that pir reads can't be reordered before it.
> >> x86 clear_bit() currently uses locked instruction, though.
> >
> > smp_mb__after_atomic is empty on x86 so it's
> > a documentation thing, not a correctness thing anyway.
>
> All atomics currently contain a barrier, but the code is also
> future-proofing, not just documentation: implementation of clear_bit()
> could drop the barrier and smp_mb__after_atomic() would then become a
> real barrier.
>
> Adding dma_mb__after_atomic() would be even better as this bug could
> happen even on a uniprocessor with an assigned device, but people who
> buy a SMP chip to run a UP kernel deserve it.

Not doing dma so does not seem to make sense ...
Why do you need a barrier on a UP kernel?

--
MST

2016-10-27 17:06:17

by Radim Krčmář

[permalink] [raw]
Subject: Re: [PATCH 1/5] KVM: x86: avoid atomic operations on APICv vmentry

2016-10-27 19:51+0300, Michael S. Tsirkin:
> On Thu, Oct 27, 2016 at 06:44:00PM +0200, Radim Krčmář wrote:
>> 2016-10-27 00:42+0300, Michael S. Tsirkin:
>> > On Wed, Oct 26, 2016 at 09:53:45PM +0200, Radim Krčmář wrote:
>> >> 2016-10-14 20:21+0200, Paolo Bonzini:
>> >> > On some benchmarks (e.g. netperf with ioeventfd disabled), APICv
>> >> > posted interrupts turn out to be slower than interrupt injection via
>> >> > KVM_REQ_EVENT.
>> >> >
>> >> > This patch optimizes a bit the IRR update, avoiding expensive atomic
>> >> > operations in the common case where PI.ON=0 at vmentry or the PIR vector
>> >> > is mostly zero. This saves at least 20 cycles (1%) per vmexit, as
>> >> > measured by kvm-unit-tests' inl_from_qemu test (20 runs):
>> >> >
>> >> > | enable_apicv=1 | enable_apicv=0
>> >> > | mean stdev | mean stdev
>> >> > ----------|-----------------|------------------
>> >> > before | 5826 32.65 | 5765 47.09
>> >> > after | 5809 43.42 | 5777 77.02
>> >> >
>> >> > Of course, any change in the right column is just placebo effect. :)
>> >> > The savings are bigger if interrupts are frequent.
>> >> >
>> >> > Signed-off-by: Paolo Bonzini <[email protected]>
>> >> > ---
>> >> > diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
>> >> > @@ -521,6 +521,12 @@ static inline void pi_set_sn(struct pi_desc *pi_desc)
>> >> > (unsigned long *)&pi_desc->control);
>> >> > }
>> >> >
>> >> > +static inline void pi_clear_on(struct pi_desc *pi_desc)
>> >> > +{
>> >> > + clear_bit(POSTED_INTR_ON,
>> >> > + (unsigned long *)&pi_desc->control);
>> >> > +}
>> >>
>> >> We should add an explicit smp_mb__after_atomic() for extra correctness,
>> >> because clear_bit() does not guarantee a memory barrier and we must make
>> >> sure that pir reads can't be reordered before it.
>> >> x86 clear_bit() currently uses locked instruction, though.
>> >
>> > smp_mb__after_atomic is empty on x86 so it's
>> > a documentation thing, not a correctness thing anyway.
>>
>> All atomics currently contain a barrier, but the code is also
>> future-proofing, not just documentation: implementation of clear_bit()
>> could drop the barrier and smp_mb__after_atomic() would then become a
>> real barrier.
>>
>> Adding dma_mb__after_atomic() would be even better as this bug could
>> happen even on a uniprocessor with an assigned device, but people who
>> buy a SMP chip to run a UP kernel deserve it.
>
> Not doing dma so does not seem to make sense ...

IOMMU does -- it writes to the PIR and sets ON asynchronously.

> Why do you need a barrier on a UP kernel?

If pi_clear_on() doesn't contain a memory barrier (possible future),
then we have the following race: (pir[0] begins as 0.)

KVM | IOMMU
-------------------------------+-------------
pir_val = ACCESS_ONCE(pir[0]) |
| pir[0] = 123
| pi_set_on()
pi_clear_on() |
if (pir_val) |

ACCESS_ONCE() does not prevent the CPU to prefetch pir[0] (ACCESS_ONCE
does nothing in this patch), so if there was 0 in pir[0] before IOMMU
wrote to it, then our optimization to avoid the xchg would yield a false
negative and the interrupt would be lost.

2016-10-28 09:39:53

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [PATCH 1/5] KVM: x86: avoid atomic operations on APICv vmentry



On 27/10/2016 19:06, Radim Krčmář wrote:
> 2016-10-27 19:51+0300, Michael S. Tsirkin:
>> On Thu, Oct 27, 2016 at 06:44:00PM +0200, Radim Krčmář wrote:
>>> 2016-10-27 00:42+0300, Michael S. Tsirkin:
>>>> On Wed, Oct 26, 2016 at 09:53:45PM +0200, Radim Krčmář wrote:
>>>>> 2016-10-14 20:21+0200, Paolo Bonzini:
>>>>>> On some benchmarks (e.g. netperf with ioeventfd disabled), APICv
>>>>>> posted interrupts turn out to be slower than interrupt injection via
>>>>>> KVM_REQ_EVENT.
>>>>>>
>>>>>> This patch optimizes a bit the IRR update, avoiding expensive atomic
>>>>>> operations in the common case where PI.ON=0 at vmentry or the PIR vector
>>>>>> is mostly zero. This saves at least 20 cycles (1%) per vmexit, as
>>>>>> measured by kvm-unit-tests' inl_from_qemu test (20 runs):
>>>>>>
>>>>>> | enable_apicv=1 | enable_apicv=0
>>>>>> | mean stdev | mean stdev
>>>>>> ----------|-----------------|------------------
>>>>>> before | 5826 32.65 | 5765 47.09
>>>>>> after | 5809 43.42 | 5777 77.02
>>>>>>
>>>>>> Of course, any change in the right column is just placebo effect. :)
>>>>>> The savings are bigger if interrupts are frequent.
>>>>>>
>>>>>> Signed-off-by: Paolo Bonzini <[email protected]>
>>>>>> ---
>>>>>> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
>>>>>> @@ -521,6 +521,12 @@ static inline void pi_set_sn(struct pi_desc *pi_desc)
>>>>>> (unsigned long *)&pi_desc->control);
>>>>>> }
>>>>>>
>>>>>> +static inline void pi_clear_on(struct pi_desc *pi_desc)
>>>>>> +{
>>>>>> + clear_bit(POSTED_INTR_ON,
>>>>>> + (unsigned long *)&pi_desc->control);
>>>>>> +}
>>>>>
>>>>> We should add an explicit smp_mb__after_atomic() for extra correctness,
>>>>> because clear_bit() does not guarantee a memory barrier and we must make
>>>>> sure that pir reads can't be reordered before it.
>>>>> x86 clear_bit() currently uses locked instruction, though.
>>>>
>>>> smp_mb__after_atomic is empty on x86 so it's
>>>> a documentation thing, not a correctness thing anyway.
>>>
>>> All atomics currently contain a barrier, but the code is also
>>> future-proofing, not just documentation: implementation of clear_bit()
>>> could drop the barrier and smp_mb__after_atomic() would then become a
>>> real barrier.
>>>
>>> Adding dma_mb__after_atomic() would be even better as this bug could
>>> happen even on a uniprocessor with an assigned device, but people who
>>> buy a SMP chip to run a UP kernel deserve it.
>>
>> Not doing dma so does not seem to make sense ...
>
> IOMMU does -- it writes to the PIR and sets ON asynchronously.

I can use either __smp_mb__after_atomic or virt_mb__after_atomic. The
difference is documentation only, since all of them are
compiler-barriers only on x86.

Preferences?

Thanks,

Paolo

>> Why do you need a barrier on a UP kernel?
>
> If pi_clear_on() doesn't contain a memory barrier (possible future),
> then we have the following race: (pir[0] begins as 0.)
>
> KVM | IOMMU
> -------------------------------+-------------
> pir_val = ACCESS_ONCE(pir[0]) |
> | pir[0] = 123
> | pi_set_on()
> pi_clear_on() |
> if (pir_val) |
>
> ACCESS_ONCE() does not prevent the CPU to prefetch pir[0] (ACCESS_ONCE
> does nothing in this patch), so if there was 0 in pir[0] before IOMMU
> wrote to it, then our optimization to avoid the xchg would yield a false
> negative and the interrupt would be lost.
>

2016-10-28 22:04:13

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: [PATCH 1/5] KVM: x86: avoid atomic operations on APICv vmentry

On Fri, Oct 28, 2016 at 11:39:44AM +0200, Paolo Bonzini wrote:
>
>
> On 27/10/2016 19:06, Radim Krčmář wrote:
> > 2016-10-27 19:51+0300, Michael S. Tsirkin:
> >> On Thu, Oct 27, 2016 at 06:44:00PM +0200, Radim Krčmář wrote:
> >>> 2016-10-27 00:42+0300, Michael S. Tsirkin:
> >>>> On Wed, Oct 26, 2016 at 09:53:45PM +0200, Radim Krčmář wrote:
> >>>>> 2016-10-14 20:21+0200, Paolo Bonzini:
> >>>>>> On some benchmarks (e.g. netperf with ioeventfd disabled), APICv
> >>>>>> posted interrupts turn out to be slower than interrupt injection via
> >>>>>> KVM_REQ_EVENT.
> >>>>>>
> >>>>>> This patch optimizes a bit the IRR update, avoiding expensive atomic
> >>>>>> operations in the common case where PI.ON=0 at vmentry or the PIR vector
> >>>>>> is mostly zero. This saves at least 20 cycles (1%) per vmexit, as
> >>>>>> measured by kvm-unit-tests' inl_from_qemu test (20 runs):
> >>>>>>
> >>>>>> | enable_apicv=1 | enable_apicv=0
> >>>>>> | mean stdev | mean stdev
> >>>>>> ----------|-----------------|------------------
> >>>>>> before | 5826 32.65 | 5765 47.09
> >>>>>> after | 5809 43.42 | 5777 77.02
> >>>>>>
> >>>>>> Of course, any change in the right column is just placebo effect. :)
> >>>>>> The savings are bigger if interrupts are frequent.
> >>>>>>
> >>>>>> Signed-off-by: Paolo Bonzini <[email protected]>
> >>>>>> ---
> >>>>>> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> >>>>>> @@ -521,6 +521,12 @@ static inline void pi_set_sn(struct pi_desc *pi_desc)
> >>>>>> (unsigned long *)&pi_desc->control);
> >>>>>> }
> >>>>>>
> >>>>>> +static inline void pi_clear_on(struct pi_desc *pi_desc)
> >>>>>> +{
> >>>>>> + clear_bit(POSTED_INTR_ON,
> >>>>>> + (unsigned long *)&pi_desc->control);
> >>>>>> +}
> >>>>>
> >>>>> We should add an explicit smp_mb__after_atomic() for extra correctness,
> >>>>> because clear_bit() does not guarantee a memory barrier and we must make
> >>>>> sure that pir reads can't be reordered before it.
> >>>>> x86 clear_bit() currently uses locked instruction, though.
> >>>>
> >>>> smp_mb__after_atomic is empty on x86 so it's
> >>>> a documentation thing, not a correctness thing anyway.
> >>>
> >>> All atomics currently contain a barrier, but the code is also
> >>> future-proofing, not just documentation: implementation of clear_bit()
> >>> could drop the barrier and smp_mb__after_atomic() would then become a
> >>> real barrier.
> >>>
> >>> Adding dma_mb__after_atomic() would be even better as this bug could
> >>> happen even on a uniprocessor with an assigned device, but people who
> >>> buy a SMP chip to run a UP kernel deserve it.
> >>
> >> Not doing dma so does not seem to make sense ...
> >
> > IOMMU does -- it writes to the PIR and sets ON asynchronously.
>
> I can use either __smp_mb__after_atomic or virt_mb__after_atomic. The
> difference is documentation only, since all of them are
> compiler-barriers only on x86.

A comment is also an option.

> Preferences?
>
> Thanks,
>
> Paolo

virt_ is for a VM guest. Pls don't use for host side code.
I thought it's clear enough but maybe I should add
more documentation.

> >> Why do you need a barrier on a UP kernel?
> >
> > If pi_clear_on() doesn't contain a memory barrier (possible future),
> > then we have the following race: (pir[0] begins as 0.)
> >
> > KVM | IOMMU
> > -------------------------------+-------------
> > pir_val = ACCESS_ONCE(pir[0]) |
> > | pir[0] = 123
> > | pi_set_on()
> > pi_clear_on() |
> > if (pir_val) |
> >
> > ACCESS_ONCE() does not prevent the CPU to prefetch pir[0] (ACCESS_ONCE
> > does nothing in this patch), so if there was 0 in pir[0] before IOMMU
> > wrote to it, then our optimization to avoid the xchg would yield a false
> > negative and the interrupt would be lost.
> >


2016-11-03 13:30:42

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [PATCH 2/5] KVM: x86: do not scan IRR twice on APICv vmentry



On 26/10/2016 21:59, Radim Krčmář wrote:
> 2016-10-14 20:21+0200, Paolo Bonzini:
>> Calling apic_find_highest_irr results in IRR being scanned twice,
>> once in vmx_sync_pir_from_irr and once in apic_search_irr. Change
>> sync_pir_from_irr to do the RVI write and kvm_apic_update_irr to
>> compute the new RVI on the fly.
>>
>> Signed-off-by: Paolo Bonzini <[email protected]>
>> ---
>
> Reviewed-by: Radim Krčmář <[email protected]>

Nope, this breaks nested VMX exit on external interrupt. For now I'm
testing only patch 1 and will push that one only to kvm/next.

Paolo

2016-11-03 13:53:47

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: [PATCH 2/5] KVM: x86: do not scan IRR twice on APICv vmentry

On Thu, Nov 03, 2016 at 02:30:31PM +0100, Paolo Bonzini wrote:
>
>
> On 26/10/2016 21:59, Radim Krčmář wrote:
> > 2016-10-14 20:21+0200, Paolo Bonzini:
> >> Calling apic_find_highest_irr results in IRR being scanned twice,
> >> once in vmx_sync_pir_from_irr and once in apic_search_irr. Change
> >> sync_pir_from_irr to do the RVI write and kvm_apic_update_irr to
> >> compute the new RVI on the fly.
> >>
> >> Signed-off-by: Paolo Bonzini <[email protected]>
> >> ---
> >
> > Reviewed-by: Radim Krčmář <[email protected]>
>
> Nope, this breaks nested VMX exit on external interrupt. For now I'm
> testing only patch 1 and will push that one only to kvm/next.
>
> Paolo

Do you know why? I don't see it yet.

2016-11-03 15:04:02

by Radim Krčmář

[permalink] [raw]
Subject: Re: [PATCH 2/5] KVM: x86: do not scan IRR twice on APICv vmentry

2016-11-03 14:30+0100, Paolo Bonzini:
> On 26/10/2016 21:59, Radim Krčmář wrote:
>> 2016-10-14 20:21+0200, Paolo Bonzini:
>>> Calling apic_find_highest_irr results in IRR being scanned twice,
>>> once in vmx_sync_pir_from_irr and once in apic_search_irr. Change
>>> sync_pir_from_irr to do the RVI write and kvm_apic_update_irr to
>>> compute the new RVI on the fly.
>>>
>>> Signed-off-by: Paolo Bonzini <[email protected]>
>>> ---
>>
>> Reviewed-by: Radim Krčmář <[email protected]>
>
> Nope, this breaks nested VMX exit on external interrupt. For now I'm
> testing only patch 1 and will push that one only to kvm/next.

Hm, does it also happen with this change?

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 4b20f7304b9c..6be110e53e9e 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -941,7 +941,7 @@ struct kvm_x86_ops {
void (*set_virtual_x2apic_mode)(struct kvm_vcpu *vcpu, bool set);
void (*set_apic_access_page_addr)(struct kvm_vcpu *vcpu, hpa_t hpa);
void (*deliver_posted_interrupt)(struct kvm_vcpu *vcpu, int vector);
- void (*sync_pir_to_irr)(struct kvm_vcpu *vcpu);
+ int (*sync_pir_to_irr)(struct kvm_vcpu *vcpu);
int (*set_tss_addr)(struct kvm *kvm, unsigned int addr);
int (*get_tdp_level)(void);
u64 (*get_mt_mask)(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio);
diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
index f8157a36ab09..ffa541d38343 100644
--- a/arch/x86/kvm/svm.c
+++ b/arch/x86/kvm/svm.c
@@ -4383,9 +4383,9 @@ static void svm_load_eoi_exitmap(struct kvm_vcpu *vcpu, u64 *eoi_exit_bitmap)
return;
}

-static void svm_sync_pir_to_irr(struct kvm_vcpu *vcpu)
+static int svm_sync_pir_to_irr(struct kvm_vcpu *vcpu)
{
- return;
+ return -1; // XXX
}

static void svm_deliver_avic_intr(struct kvm_vcpu *vcpu, int vec)
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index f272ccfddc48..2211e1774733 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -8493,17 +8493,15 @@ static void vmx_hwapic_irr_update(struct kvm_vcpu *vcpu, int max_irr)
}
}

-static void vmx_sync_pir_to_irr(struct kvm_vcpu *vcpu)
+static int vmx_sync_pir_to_irr(struct kvm_vcpu *vcpu)
{
struct vcpu_vmx *vmx = to_vmx(vcpu);
- int max_irr;

if (!pi_test_on(&vmx->pi_desc))
return;

pi_clear_on(&vmx->pi_desc);
- max_irr = kvm_apic_update_irr(vcpu, vmx->pi_desc.pir);
- vmx_hwapic_irr_update(vcpu, max_irr);
+ return kvm_apic_update_irr(vcpu, vmx->pi_desc.pir);
}

static void vmx_apicv_post_state_restore(struct kvm_vcpu *vcpu)
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 20d572c18293..e8789f3a9bfd 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -6675,7 +6675,7 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
* notified with kvm_vcpu_kick.
*/
if (vcpu->arch.apicv_active)
- kvm_x86_ops->sync_pir_to_irr(vcpu);
+ kvm_x86_ops->hwapic_irr_update(vcpu, kvm_x86_ops->sync_pir_to_irr(vcpu));
}

if (vcpu->mode == EXITING_GUEST_MODE || vcpu->requests

2016-11-03 16:00:52

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [PATCH 2/5] KVM: x86: do not scan IRR twice on APICv vmentry



On 03/11/2016 16:03, Radim Krčmář wrote:
> 2016-11-03 14:30+0100, Paolo Bonzini:
>> On 26/10/2016 21:59, Radim Krčmář wrote:
>>> 2016-10-14 20:21+0200, Paolo Bonzini:
>>>> Calling apic_find_highest_irr results in IRR being scanned twice,
>>>> once in vmx_sync_pir_from_irr and once in apic_search_irr. Change
>>>> sync_pir_from_irr to do the RVI write and kvm_apic_update_irr to
>>>> compute the new RVI on the fly.
>>>>
>>>> Signed-off-by: Paolo Bonzini <[email protected]>
>>>> ---
>>>
>>> Reviewed-by: Radim Krčmář <[email protected]>
>>
>> Nope, this breaks nested VMX exit on external interrupt. For now I'm
>> testing only patch 1 and will push that one only to kvm/next.
>
> Hm, does it also happen with this change?

Probably not but I wanted to understand why. :)

Paolo

> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 4b20f7304b9c..6be110e53e9e 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -941,7 +941,7 @@ struct kvm_x86_ops {
> void (*set_virtual_x2apic_mode)(struct kvm_vcpu *vcpu, bool set);
> void (*set_apic_access_page_addr)(struct kvm_vcpu *vcpu, hpa_t hpa);
> void (*deliver_posted_interrupt)(struct kvm_vcpu *vcpu, int vector);
> - void (*sync_pir_to_irr)(struct kvm_vcpu *vcpu);
> + int (*sync_pir_to_irr)(struct kvm_vcpu *vcpu);
> int (*set_tss_addr)(struct kvm *kvm, unsigned int addr);
> int (*get_tdp_level)(void);
> u64 (*get_mt_mask)(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio);
> diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
> index f8157a36ab09..ffa541d38343 100644
> --- a/arch/x86/kvm/svm.c
> +++ b/arch/x86/kvm/svm.c
> @@ -4383,9 +4383,9 @@ static void svm_load_eoi_exitmap(struct kvm_vcpu *vcpu, u64 *eoi_exit_bitmap)
> return;
> }
>
> -static void svm_sync_pir_to_irr(struct kvm_vcpu *vcpu)
> +static int svm_sync_pir_to_irr(struct kvm_vcpu *vcpu)
> {
> - return;
> + return -1; // XXX
> }
>
> static void svm_deliver_avic_intr(struct kvm_vcpu *vcpu, int vec)
> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> index f272ccfddc48..2211e1774733 100644
> --- a/arch/x86/kvm/vmx.c
> +++ b/arch/x86/kvm/vmx.c
> @@ -8493,17 +8493,15 @@ static void vmx_hwapic_irr_update(struct kvm_vcpu *vcpu, int max_irr)
> }
> }
>
> -static void vmx_sync_pir_to_irr(struct kvm_vcpu *vcpu)
> +static int vmx_sync_pir_to_irr(struct kvm_vcpu *vcpu)
> {
> struct vcpu_vmx *vmx = to_vmx(vcpu);
> - int max_irr;
>
> if (!pi_test_on(&vmx->pi_desc))
> return;
>
> pi_clear_on(&vmx->pi_desc);
> - max_irr = kvm_apic_update_irr(vcpu, vmx->pi_desc.pir);
> - vmx_hwapic_irr_update(vcpu, max_irr);
> + return kvm_apic_update_irr(vcpu, vmx->pi_desc.pir);
> }
>
> static void vmx_apicv_post_state_restore(struct kvm_vcpu *vcpu)
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 20d572c18293..e8789f3a9bfd 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -6675,7 +6675,7 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
> * notified with kvm_vcpu_kick.
> */
> if (vcpu->arch.apicv_active)
> - kvm_x86_ops->sync_pir_to_irr(vcpu);
> + kvm_x86_ops->hwapic_irr_update(vcpu, kvm_x86_ops->sync_pir_to_irr(vcpu));
> }
>
> if (vcpu->mode == EXITING_GUEST_MODE || vcpu->requests
>

2016-11-03 16:01:23

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [PATCH 2/5] KVM: x86: do not scan IRR twice on APICv vmentry



On 03/11/2016 14:53, Michael S. Tsirkin wrote:
> > Nope, this breaks nested VMX exit on external interrupt. For now I'm
> > testing only patch 1 and will push that one only to kvm/next.
>
> Do you know why? I don't see it yet.

Nope, have not debugged it yet.

Paolo

2016-11-03 18:08:00

by Radim Krčmář

[permalink] [raw]
Subject: Re: [PATCH 2/5] KVM: x86: do not scan IRR twice on APICv vmentry

2016-11-03 17:00+0100, Paolo Bonzini:
> On 03/11/2016 16:03, Radim Krčmář wrote:
>> 2016-11-03 14:30+0100, Paolo Bonzini:
>>> On 26/10/2016 21:59, Radim Krčmář wrote:
>>>> 2016-10-14 20:21+0200, Paolo Bonzini:
>>>>> Calling apic_find_highest_irr results in IRR being scanned twice,
>>>>> once in vmx_sync_pir_from_irr and once in apic_search_irr. Change
>>>>> sync_pir_from_irr to do the RVI write and kvm_apic_update_irr to
>>>>> compute the new RVI on the fly.
>>>>>
>>>>> Signed-off-by: Paolo Bonzini <[email protected]>
>>>>> ---
>>>>
>>>> Reviewed-by: Radim Krčmář <[email protected]>
>>>
>>> Nope, this breaks nested VMX exit on external interrupt. For now I'm
>>> testing only patch 1 and will push that one only to kvm/next.

Which hypervisor is being nested?

>> Hm, does it also happen with this change?
>
> Probably not but I wanted to understand why. :)

Yeah, I'm also looking forward to knowing the funny coincidence.

I think a bug is likely for hypervisors that don't enable
PIN_BASED_EXT_INTR_MASK. The bug would trigger when
kvm_cpu_has_interrupt() in vmx_check_nested_events() in
kvm_arch_vcpu_runnable() queues the interrupt ...
but I didn't see how this would have caused a problem. :)

2016-11-03 18:18:07

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [PATCH 2/5] KVM: x86: do not scan IRR twice on APICv vmentry



On 03/11/2016 19:07, Radim Krčmář wrote:
> 2016-11-03 17:00+0100, Paolo Bonzini:
>> On 03/11/2016 16:03, Radim Krčmář wrote:
>>> 2016-11-03 14:30+0100, Paolo Bonzini:
>>>> On 26/10/2016 21:59, Radim Krčmář wrote:
>>>>> 2016-10-14 20:21+0200, Paolo Bonzini:
>>>>>> Calling apic_find_highest_irr results in IRR being scanned twice,
>>>>>> once in vmx_sync_pir_from_irr and once in apic_search_irr. Change
>>>>>> sync_pir_from_irr to do the RVI write and kvm_apic_update_irr to
>>>>>> compute the new RVI on the fly.
>>>>>>
>>>>>> Signed-off-by: Paolo Bonzini <[email protected]>
>>>>>> ---
>>>>>
>>>>> Reviewed-by: Radim Krčmář <[email protected]>
>>>>
>>>> Nope, this breaks nested VMX exit on external interrupt. For now I'm
>>>> testing only patch 1 and will push that one only to kvm/next.
>
> Which hypervisor is being nested?

vmx.flat. :)

> I think a bug is likely for hypervisors that don't enable
> PIN_BASED_EXT_INTR_MASK. The bug would trigger when
> kvm_cpu_has_interrupt() in vmx_check_nested_events() in
> kvm_arch_vcpu_runnable() queues the interrupt ...
> but I didn't see how this would have caused a problem. :)

Ironically, _not_ enabling PIN_BASED_EXT_INTR_MASK and not using HALT
activity state is the only case that passes of the four that vmx.flat tests.

Paolo

2016-11-03 18:29:44

by Radim Krčmář

[permalink] [raw]
Subject: Re: [PATCH 2/5] KVM: x86: do not scan IRR twice on APICv vmentry

2016-11-03 19:18+0100, Paolo Bonzini:
> On 03/11/2016 19:07, Radim Krčmář wrote:
>> 2016-11-03 17:00+0100, Paolo Bonzini:
>>> On 03/11/2016 16:03, Radim Krčmář wrote:
>>>> 2016-11-03 14:30+0100, Paolo Bonzini:
>>>>> On 26/10/2016 21:59, Radim Krčmář wrote:
>>>>>> 2016-10-14 20:21+0200, Paolo Bonzini:
>>>>>>> Calling apic_find_highest_irr results in IRR being scanned twice,
>>>>>>> once in vmx_sync_pir_from_irr and once in apic_search_irr. Change
>>>>>>> sync_pir_from_irr to do the RVI write and kvm_apic_update_irr to
>>>>>>> compute the new RVI on the fly.
>>>>>>>
>>>>>>> Signed-off-by: Paolo Bonzini <[email protected]>
>>>>>>> ---
>>>>>>
>>>>>> Reviewed-by: Radim Krčmář <[email protected]>
>>>>>
>>>>> Nope, this breaks nested VMX exit on external interrupt. For now I'm
>>>>> testing only patch 1 and will push that one only to kvm/next.
>>
>> Which hypervisor is being nested?
>
> vmx.flat. :)
>
>> I think a bug is likely for hypervisors that don't enable
>> PIN_BASED_EXT_INTR_MASK. The bug would trigger when
>> kvm_cpu_has_interrupt() in vmx_check_nested_events() in
>> kvm_arch_vcpu_runnable() queues the interrupt ...
>> but I didn't see how this would have caused a problem. :)
>
> Ironically, _not_ enabling PIN_BASED_EXT_INTR_MASK and not using HALT
> activity state is the only case that passes of the four that vmx.flat tests.

Heh, the behavior is nice

PASS: direct interrupt + hlt
FAIL: intercepted interrupt + hlt
FAIL: direct interrupt + activity state hlt
FAIL: intercepted interrupt + activity state hlt

2016-11-03 20:16:44

by Radim Krčmář

[permalink] [raw]
Subject: Re: [PATCH 2/5] KVM: x86: do not scan IRR twice on APICv vmentry

[Oh, I got distracted and sent without finishing ...]

2016-11-03 19:29+0100, Radim Krčmář:
> 2016-11-03 19:18+0100, Paolo Bonzini:
>> On 03/11/2016 19:07, Radim Krčmář wrote:
>>> I think a bug is likely for hypervisors that don't enable
>>> PIN_BASED_EXT_INTR_MASK. The bug would trigger when
>>> kvm_cpu_has_interrupt() in vmx_check_nested_events() in
>>> kvm_arch_vcpu_runnable() queues the interrupt ...
>>> but I didn't see how this would have caused a problem. :)
>>
>> Ironically, _not_ enabling PIN_BASED_EXT_INTR_MASK and not using HALT
>> activity state is the only case that passes of the four that vmx.flat tests.
>
> Heh, the behavior is nice
>
> PASS: direct interrupt + hlt
> FAIL: intercepted interrupt + hlt
> FAIL: direct interrupt + activity state hlt
> FAIL: intercepted interrupt + activity state hlt

but the 3rd one is racy, so I sometimes also get

PASS: direct interrupt + hlt
FAIL: intercepted interrupt + hlt
PASS: direct interrupt + activity state hlt
FAIL: intercepted interrupt + activity state hlt

1st and 3rd have disabled extint and 2nd and 4th enabled ...
but that would mean that we a bug in a path that gets called in both
cases, so calling vmx_hwapic_irr_update() isn't a problem ...
and suddenly the bug becomes obvious:

> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> +static void vmx_sync_pir_to_irr(struct kvm_vcpu *vcpu)
> +{
> + struct vcpu_vmx *vmx = to_vmx(vcpu);
> + int max_irr;
> +
> + if (!pi_test_on(&vmx->pi_desc))

We don't call vmx_hwapic_irr_update() when returning early.

> + return;
> +
> + pi_clear_on(&vmx->pi_desc);
> + max_irr = kvm_apic_update_irr(vcpu, vmx->pi_desc.pir);
> + vmx_hwapic_irr_update(vcpu, max_irr);
> +}
> +
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> @@ -6611,8 +6611,7 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
> * virtual interrupt delivery.
> */
> if (vcpu->arch.apicv_active)
> - kvm_x86_ops->hwapic_irr_update(vcpu,
> - kvm_lapic_find_highest_irr(vcpu));
> + kvm_x86_ops->sync_pir_to_irr(vcpu);
> }

2016-11-04 09:38:52

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [PATCH 2/5] KVM: x86: do not scan IRR twice on APICv vmentry



On 03/11/2016 21:16, Radim Krčmář wrote:
> > + if (!pi_test_on(&vmx->pi_desc))
>
> We don't call vmx_hwapic_irr_update() when returning early.

This might be a good start, but it's on purpose: IRR is not changing and
the invariant _should_ be that RVI=highest-bit(IRR):

- IRR cleared by processor: see SDM 29.2.2 Virtual-interrupt delivery

- IRR set by processor: see SDM 29.6 Posted-interrupt processing

- IRR set by KVM: ON=1 so it doesn't exit here

- IRR cleared by KVM: might indeed be buggy here, but the next patch
does add a

+ kvm_x86_ops->hwapic_irr_update(vcpu,
+ apic_find_highest_irr(apic));

to apic_clear_irr, which doesn't fix the bug (and doesn't fix it also if
backported here).


So we're missing a place where IRR has changed but RVI is not being
updated. It should be related to vmx_check_nested_events and
kvm_cpu_has_interrupt as you said, but I cannot really see it.

Paolo

>> > + return;
>> > +
>> > + pi_clear_on(&vmx->pi_desc);
>> > + max_irr = kvm_apic_update_irr(vcpu, vmx->pi_desc.pir);
>> > + vmx_hwapic_irr_update(vcpu, max_irr);
>> > +}