2023-07-18 09:33:31

by Maxim Levitsky

[permalink] [raw]
Subject: [PATCH 0/3] Fix 'Spurious APIC interrupt (vector 0xFF) on CPU#n' issue

Recently we found an issue which causes these error messages

to be sometimes logged if the guest has VFIO device attached:



'Spurious APIC interrupt (vector 0xFF) on CPU#0, should never happen'



It was traced to the incorrect APICv inhibition bug which started with

'KVM: x86: inhibit APICv/AVIC on changes to APIC ID or APIC base'

(All these issues are now fixed)



However, there are valid cases for the APICv to be inhibited and it should not

cause spurious interrupts to be injected to the guest.



After some debug, the root cause was found and it is that __kvm_apic_update_irr

doesn't set irr_pending which later triggers a int->unsigned char conversion

bug which leads to the wrong 0xFF injection.



This also leads to an unbounded delay in injecting the interrupt and hurts

performance.



In addition to that, I also noticed that __kvm_apic_update_irr is not atomic

in regard to IRR, which can lead to an even harder to debug bug.



Best regards,

Maxim Levitsky



Maxim Levitsky (3):

KVM: x86: VMX: __kvm_apic_update_irr must update the IRR atomically

KVM: x86: VMX: set irr_pending in kvm_apic_update_irr

KVM: x86: check the kvm_cpu_get_interrupt result before using it



arch/x86/kvm/lapic.c | 23 +++++++++++++++--------

arch/x86/kvm/x86.c | 10 +++++++---

2 files changed, 22 insertions(+), 11 deletions(-)



--

2.26.3






2023-07-18 09:41:37

by Maxim Levitsky

[permalink] [raw]
Subject: [PATCH 1/3] KVM: x86: VMX: __kvm_apic_update_irr must update the IRR atomically

If APICv is inhibited, then IPIs from peer vCPUs are done by
atomically setting bits in IRR.

This means, that when __kvm_apic_update_irr copies PIR to IRR,
it has to modify IRR atomically as well.

Signed-off-by: Maxim Levitsky <[email protected]>
---
arch/x86/kvm/lapic.c | 18 +++++++++++-------
1 file changed, 11 insertions(+), 7 deletions(-)

diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index e542cf285b5184..b3f57e0f0d64ae 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -627,15 +627,19 @@ bool __kvm_apic_update_irr(u32 *pir, void *regs, int *max_irr)

for (i = vec = 0; i <= 7; i++, vec += 32) {
pir_val = READ_ONCE(pir[i]);
- irr_val = *((u32 *)(regs + APIC_IRR + i * 0x10));
+ irr_val = READ_ONCE(*((u32 *)(regs + APIC_IRR + i * 0x10)));
+
if (pir_val) {
+ pir_val = xchg(&pir[i], 0);
+
+ while (!try_cmpxchg(((u32 *)(regs + APIC_IRR + i * 0x10)),
+ &irr_val, irr_val | pir_val));
+
prev_irr_val = irr_val;
- irr_val |= xchg(&pir[i], 0);
- *((u32 *)(regs + APIC_IRR + i * 0x10)) = irr_val;
- if (prev_irr_val != irr_val) {
- max_updated_irr =
- __fls(irr_val ^ prev_irr_val) + vec;
- }
+ irr_val |= pir_val;
+
+ if (prev_irr_val != irr_val)
+ max_updated_irr = __fls(irr_val ^ prev_irr_val) + vec;
}
if (irr_val)
*max_irr = __fls(irr_val) + vec;
--
2.26.3


2023-07-18 09:53:49

by Maxim Levitsky

[permalink] [raw]
Subject: [PATCH 3/3] KVM: x86: check the kvm_cpu_get_interrupt result before using it

The code was blindly assuming that kvm_cpu_get_interrupt never returns -1
when there is a pending interrupt.

While this should be true, a bug in KVM can still cause this.

If -1 is returned, the code before this patch was converting it to 0xFF,
and 0xFF interrupt was injected to the guest, which results in an issue
which was hard to debug.

Add WARN_ON_ONCE to catch this case and skip the injection
if this happens again.

Signed-off-by: Maxim Levitsky <[email protected]>
---
arch/x86/kvm/x86.c | 10 +++++++---
1 file changed, 7 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index c0778ca39650f4..37162c589e8d0e 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -10190,9 +10190,13 @@ static int kvm_check_and_inject_events(struct kvm_vcpu *vcpu,
if (r < 0)
goto out;
if (r) {
- kvm_queue_interrupt(vcpu, kvm_cpu_get_interrupt(vcpu), false);
- static_call(kvm_x86_inject_irq)(vcpu, false);
- WARN_ON(static_call(kvm_x86_interrupt_allowed)(vcpu, true) < 0);
+ int irq = kvm_cpu_get_interrupt(vcpu);
+
+ if (!WARN_ON_ONCE(irq == -1)) {
+ kvm_queue_interrupt(vcpu, irq, false);
+ static_call(kvm_x86_inject_irq)(vcpu, false);
+ WARN_ON(static_call(kvm_x86_interrupt_allowed)(vcpu, true) < 0);
+ }
}
if (kvm_cpu_has_injectable_intr(vcpu))
static_call(kvm_x86_enable_irq_window)(vcpu);
--
2.26.3


2023-07-18 09:54:06

by Maxim Levitsky

[permalink] [raw]
Subject: [PATCH 2/3] KVM: x86: VMX: set irr_pending in kvm_apic_update_irr

When the APICv is inhibited, the irr_pending optimization is used.

Therefore, when kvm_apic_update_irr sets bits in the IRR,
it must set irr_pending to true as well.

Signed-off-by: Maxim Levitsky <[email protected]>
---
arch/x86/kvm/lapic.c | 5 ++++-
1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index b3f57e0f0d64ae..613830ba2d8cef 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -653,8 +653,11 @@ EXPORT_SYMBOL_GPL(__kvm_apic_update_irr);
bool kvm_apic_update_irr(struct kvm_vcpu *vcpu, u32 *pir, int *max_irr)
{
struct kvm_lapic *apic = vcpu->arch.apic;
+ bool irr_updated = __kvm_apic_update_irr(pir, apic->regs, max_irr);

- return __kvm_apic_update_irr(pir, apic->regs, max_irr);
+ if (unlikely(!apic->apicv_active && irr_updated))
+ apic->irr_pending = true;
+ return irr_updated;
}
EXPORT_SYMBOL_GPL(kvm_apic_update_irr);

--
2.26.3


2023-07-18 11:58:30

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [PATCH 1/3] KVM: x86: VMX: __kvm_apic_update_irr must update the IRR atomically

On 7/18/23 11:13, Maxim Levitsky wrote:
> + irr_val = READ_ONCE(*((u32 *)(regs + APIC_IRR + i * 0x10)));

Let's separate out the complicated arithmetic, as it recurs below too:

u32 *p_irr = (u32 *)(regs + APIC_IRR + i * 0x10);

> + while (!try_cmpxchg(((u32 *)(regs + APIC_IRR + i * 0x10)),
> + &irr_val, irr_val | pir_val));
> +
> prev_irr_val = irr_val;
> - irr_val |= xchg(&pir[i], 0);
> - *((u32 *)(regs + APIC_IRR + i * 0x10)) = irr_val;
> - if (prev_irr_val != irr_val) {
> - max_updated_irr =
> - __fls(irr_val ^ prev_irr_val) + vec;
> - }
> + irr_val |= pir_val;
> +
> + if (prev_irr_val != irr_val)
> + max_updated_irr = __fls(irr_val ^ prev_irr_val) + vec;

We can write this a bit more cleanly too, and avoid unnecessary
try_cmpxchg too:

prev_irr_val = irr_val;
do
irr_val = prev_irr_val | pir_val;
while (prev_irr_val != irr_val &&
!try_cmpxchg(p_irr, &prev_irr_val, irr_val));

if (prev_irr_val != irr_val)
max_updated_irr = __fls(irr_val ^ prev_irr_val) + vec;

If this looks okay to you, I'll queue the patches for -rc3 and also Cc
them for inclusion in stable kernels.

Paolo


2023-07-18 13:47:53

by Maxim Levitsky

[permalink] [raw]
Subject: Re: [PATCH 1/3] KVM: x86: VMX: __kvm_apic_update_irr must update the IRR atomically

У вт, 2023-07-18 у 13:41 +0200, Paolo Bonzini пише:
> On 7/18/23 11:13, Maxim Levitsky wrote:
> > + irr_val = READ_ONCE(*((u32 *)(regs + APIC_IRR + i * 0x10)));
>
> Let's separate out the complicated arithmetic, as it recurs below too:
>
> u32 *p_irr = (u32 *)(regs + APIC_IRR + i * 0x10);

No objections at all for this change, I wanted to have a minimal patch.

>
> > + while (!try_cmpxchg(((u32 *)(regs + APIC_IRR + i * 0x10)),
> > + &irr_val, irr_val | pir_val));
> > +
> > prev_irr_val = irr_val;
> > - irr_val |= xchg(&pir[i], 0);
> > - *((u32 *)(regs + APIC_IRR + i * 0x10)) = irr_val;
> > - if (prev_irr_val != irr_val) {
> > - max_updated_irr =
> > - __fls(irr_val ^ prev_irr_val) + vec;
> > - }
> > + irr_val |= pir_val;
> > +
> > + if (prev_irr_val != irr_val)
> > + max_updated_irr = __fls(irr_val ^ prev_irr_val) + vec;
>
> We can write this a bit more cleanly too, and avoid unnecessary

To be honest as far as I see, no matter what to do with this function, it is still
a bit complicated IMHO:

The root cause of the complexity in this function is that it does two things at the same time -
copies both the new bits to IRR and also counts the max_irr.

It would be so much cleaner to first copy new bits from PIR to irr (and that can be done
with 'lock or' or even by setting each bit with atomic bit set (in this way the setting of the bits
will be pretty much the same as what other users of IRR do (set bit atomically + set irr_pending).

And then let the common code count the max_irr.

I doubt this will affect performance in any way, but I don't have a good way to measure it,
so I won't be arguing about it.

On the other hand, I am thinking now that maybe I should make the cmpxchg conditional on
apicv beeing inhibited, as otherwise it works for nothing and actually might affect performance.

This though might in theory cause a race if a sender incorrectly thinks that this's vCPU APICv is
inhibited or not.

It probalby doesn't matter as the only reason for APICv to be inhibited is that AutoEOI thing which
should happen just once when the guest boots.


I also have another idea - I can make the IPI senders still set bits in the PIR even if APICv is inhibited,
then there is no race to worry about although then the bits will always have to be copied from PIR to IRR
(but then again APICv inhibition is rare).



> try_cmpxchg too:
>
> prev_irr_val = irr_val;
> do
> irr_val = prev_irr_val | pir_val;
> while (prev_irr_val != irr_val &&
> !try_cmpxchg(p_irr, &prev_irr_val, irr_val));
>
> if (prev_irr_val != irr_val)
> max_updated_irr = __fls(irr_val ^ prev_irr_val) + vec;
>
> If this looks okay to you, I'll queue the patches for -rc3 and also Cc
> them for inclusion in stable kernels.

No objections for this change as well.

Best regards,
Maxim Levitsky

>
> Paolo
>