2020-05-02 09:29:53

by Suthikulpanit, Suravee

[permalink] [raw]
Subject: [PATCH v2] kvm: ioapic: Introduce arch-specific check for lazy update EOI mechanism

commit f458d039db7e ("kvm: ioapic: Lazy update IOAPIC EOI") introduces
the following regression.

BUG: stack guard page was hit at 000000008f595917 \
(stack is 00000000bdefe5a4..00000000ae2b06f5)
kernel stack overflow (double-fault): 0000 [#1] SMP NOPTI
RIP: 0010:kvm_set_irq+0x51/0x160 [kvm]
Call Trace:
irqfd_resampler_ack+0x32/0x90 [kvm]
kvm_notify_acked_irq+0x62/0xd0 [kvm]
kvm_ioapic_update_eoi_one.isra.0+0x30/0x120 [kvm]
ioapic_set_irq+0x20e/0x240 [kvm]
kvm_ioapic_set_irq+0x5c/0x80 [kvm]
kvm_set_irq+0xbb/0x160 [kvm]
? kvm_hv_set_sint+0x20/0x20 [kvm]
irqfd_resampler_ack+0x32/0x90 [kvm]
kvm_notify_acked_irq+0x62/0xd0 [kvm]
kvm_ioapic_update_eoi_one.isra.0+0x30/0x120 [kvm]
ioapic_set_irq+0x20e/0x240 [kvm]
kvm_ioapic_set_irq+0x5c/0x80 [kvm]
kvm_set_irq+0xbb/0x160 [kvm]
? kvm_hv_set_sint+0x20/0x20 [kvm]
....

This is due to re-entrancy of the lazy update EOI logic
when enable APICv with VFIO pass-through device, which
sets up kvm_irqfd() w/ KVM_IRQFD_FLAG_RESAMPLE.

Fixes by adding re-entrancy check logic.

Reported-by: [email protected]
Suggested-by: Paolo Bonzini <[email protected]>
Link: https://www.spinics.net/lists/kvm/msg213512.html
Fixes: f458d039db7e ("kvm: ioapic: Lazy update IOAPIC EOI")
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=207489
Signed-off-by: Suravee Suthikulpanit <[email protected]>
---
arch/x86/kvm/ioapic.c | 11 +++++++++++
arch/x86/kvm/ioapic.h | 1 +
2 files changed, 12 insertions(+)

diff --git a/arch/x86/kvm/ioapic.c b/arch/x86/kvm/ioapic.c
index 750ff0b..a68f624 100644
--- a/arch/x86/kvm/ioapic.c
+++ b/arch/x86/kvm/ioapic.c
@@ -188,6 +188,14 @@ static void ioapic_lazy_update_eoi(struct kvm_ioapic *ioapic, int irq)
struct kvm_vcpu *vcpu;
union kvm_ioapic_redirect_entry *entry = &ioapic->redirtbl[irq];

+ /*
+ * Re-entrancy check due to KVM_IRQFD_FLAG_RESAMPLE can cause
+ * stack overflow where irqfd_resampler_ack() repeatedly calls
+ * ioapic_set_irq().
+ */
+ if (atomic_cmpxchg(&ioapic->in_lazy_update_eoi, 0, 1))
+ return;
+
kvm_for_each_vcpu(i, vcpu, ioapic->kvm) {
if (!kvm_apic_match_dest(vcpu, NULL, APIC_DEST_NOSHORT,
entry->fields.dest_id,
@@ -205,6 +213,8 @@ static void ioapic_lazy_update_eoi(struct kvm_ioapic *ioapic, int irq)
irq);
break;
}
+
+ atomic_set(&ioapic->in_lazy_update_eoi, 0);
}

static int ioapic_set_irq(struct kvm_ioapic *ioapic, unsigned int irq,
@@ -707,6 +717,7 @@ int kvm_ioapic_init(struct kvm *kvm)
kvm->arch.vioapic = NULL;
kfree(ioapic);
}
+ atomic_set(&ioapic->in_lazy_update_eoi, 0);

return ret;
}
diff --git a/arch/x86/kvm/ioapic.h b/arch/x86/kvm/ioapic.h
index 2fb2e3c..898a967 100644
--- a/arch/x86/kvm/ioapic.h
+++ b/arch/x86/kvm/ioapic.h
@@ -91,6 +91,7 @@ struct kvm_ioapic {
struct delayed_work eoi_inject;
u32 irq_eoi[IOAPIC_NUM_PINS];
u32 irr_delivered;
+ atomic_t in_lazy_update_eoi;
};

#ifdef DEBUG
--
1.8.3.1


2020-05-02 17:22:13

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [PATCH v2] kvm: ioapic: Introduce arch-specific check for lazy update EOI mechanism

On 02/05/20 11:24, Suravee Suthikulpanit wrote:
> This is due to re-entrancy of the lazy update EOI logic
> when enable APICv with VFIO pass-through device, which
> sets up kvm_irqfd() w/ KVM_IRQFD_FLAG_RESAMPLE.
>
> Fixes by adding re-entrancy check logic.

This does not explain why this is the right fix.

The questions to answer are: what is causing the re-entrancy? and why
is dropping the second EOI update safe?

The answer to the latter could well be "because we've already processed
it", but the answer to the former is more important.

The re-entrancy happens because the irq state is the OR of
the interrupt state and the resamplefd state. That is, we don't
want to show the state as 0 until we've had a chance to set the
resamplefd. But if the interrupt has _not_ gone low then we get an
infinite loop.

So the actual root cause is that this is a level-triggered interrupt,
otherwise irqfd_inject would immediately set the KVM_USERSPACE_IRQ_SOURCE_ID
high and then low and you wouldn't have the infinite loop. But in the
case of level-triggered interrupts the VMEXIT already happens because
TMR is set; only edge-triggered interrupts need the lazy invocation
of the ack notifier. So this should be the fix:

diff --git a/arch/x86/kvm/ioapic.c b/arch/x86/kvm/ioapic.c
index 7668fed1ce65..ca2d73cd00a3 100644
--- a/arch/x86/kvm/ioapic.c
+++ b/arch/x86/kvm/ioapic.c
@@ -225,12 +225,12 @@ static int ioapic_set_irq(struct kvm_ioapic *ioapic, unsigned int irq,
}

/*
- * AMD SVM AVIC accelerate EOI write and do not trap,
- * in-kernel IOAPIC will not be able to receive the EOI.
+ * AMD SVM AVIC accelerate EOI write iff the interrupt is level
+ * triggered, in-kernel IOAPIC will not be able to receive the EOI.
* In this case, we do lazy update of the pending EOI when
* trying to set IOAPIC irq.
*/
- if (kvm_apicv_activated(ioapic->kvm))
+ if (edge && kvm_apicv_activated(ioapic->kvm))
ioapic_lazy_update_eoi(ioapic, irq);

/*

Did I miss anything in the above analysis with respect to AVIC?

Paolo

2020-05-04 06:35:20

by Suthikulpanit, Suravee

[permalink] [raw]
Subject: Re: [PATCH v2] kvm: ioapic: Introduce arch-specific check for lazy update EOI mechanism

Paolo,

On 5/3/20 12:19 AM, Paolo Bonzini wrote:
> On 02/05/20 11:24, Suravee Suthikulpanit wrote:
> ....
> The questions to answer are: what is causing the re-entrancy? and why
> is dropping the second EOI update safe?
>
> The answer to the latter could well be "because we've already processed
> it", but the answer to the former is more important.
>
> The re-entrancy happens because the irq state is the OR of
> the interrupt state and the resamplefd state. That is, we don't
> want to show the state as 0 until we've had a chance to set the
> resamplefd. But if the interrupt has _not_ gone low then we get an
> infinite loop.

I'm not too familiar w/ the resamplefd. I must have missed this part.
Could you please point out to me where the OR logic is?

> So the actual root cause is that this is a level-triggered interrupt,
> otherwise irqfd_inject would immediately set the KVM_USERSPACE_IRQ_SOURCE_ID
> high and then low and you wouldn't have the infinite loop.

Okay.

> But in the case of level-triggered interrupts the VMEXIT already happens because
> TMR is set; only edge-triggered interrupts need the lazy invocation
> of the ack notifier.

For AVIC, EOI write for level-triggered would have also be trapped.
And yes, edge-triggered needs lazy ack notifier.

> So this should be the fix:
>
> diff --git a/arch/x86/kvm/ioapic.c b/arch/x86/kvm/ioapic.c
> index 7668fed1ce65..ca2d73cd00a3 100644
> --- a/arch/x86/kvm/ioapic.c
> +++ b/arch/x86/kvm/ioapic.c
> @@ -225,12 +225,12 @@ static int ioapic_set_irq(struct kvm_ioapic *ioapic, unsigned int irq,
> }
>
> /*
> - * AMD SVM AVIC accelerate EOI write and do not trap,
> - * in-kernel IOAPIC will not be able to receive the EOI.
> + * AMD SVM AVIC accelerate EOI write iff the interrupt is level
> + * triggered, in-kernel IOAPIC will not be able to receive the EOI.

Actually, it should be "AMD SVM AVIC accelerate EOI write iff the interrupt is _edge_ triggered".

> * In this case, we do lazy update of the pending EOI when
> * trying to set IOAPIC irq.
> */
> - if (kvm_apicv_activated(ioapic->kvm))
> + if (edge && kvm_apicv_activated(ioapic->kvm))
> ioapic_lazy_update_eoi(ioapic, irq);
>
> /*
>
> Did I miss anything in the above analysis with respect to AVIC?


For AMD:
Tested-by: Suravee Suthikulpanit <[email protected]>

Thanks,
Suravee

2020-05-04 10:23:55

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [PATCH v2] kvm: ioapic: Introduce arch-specific check for lazy update EOI mechanism

On 04/05/20 08:31, Suravee Suthikulpanit wrote:
>>
>>
>> The re-entrancy happens because the irq state is the OR of
>> the interrupt state and the resamplefd state.? That is, we don't
>> want to show the state as 0 until we've had a chance to set the
>> resamplefd.? But if the interrupt has _not_ gone low then we get an
>> infinite loop.
>
> I'm not too familiar w/ the resamplefd.? I must have missed this part.
> Could you please point out to me where the OR logic is?

It's because kvm_ioapic_set_irq gets the actual state of the interrupt
line from __kvm_irq_line_state before calling ioapic_set_irq.

>> But in the case of level-triggered interrupts the VMEXIT already
>> happens because TMR is set; only edge-triggered interrupts need
>> the lazy invocation of the ack notifier.?
>
> For AVIC, EOI write for level-triggered would have also be trapped.

Yes, I was talking about AVIC only there.

> And yes, edge-triggered needs lazy ack notifier.
>
>> So this should be the fix:
>>
>> diff --git a/arch/x86/kvm/ioapic.c b/arch/x86/kvm/ioapic.c
>> index 7668fed1ce65..ca2d73cd00a3 100644
>> --- a/arch/x86/kvm/ioapic.c
>> +++ b/arch/x86/kvm/ioapic.c
>> @@ -225,12 +225,12 @@ static int ioapic_set_irq(struct kvm_ioapic
>> *ioapic, unsigned int irq,
>> ????? }
>> ? ????? /*
>> -???? * AMD SVM AVIC accelerate EOI write and do not trap,
>> -???? * in-kernel IOAPIC will not be able to receive the EOI.
>> +???? * AMD SVM AVIC accelerate EOI write iff the interrupt is level
>> +???? * triggered, in-kernel IOAPIC will not be able to receive the EOI.
>
> Actually, it should be "AMD SVM AVIC accelerate EOI write iff the
> interrupt is _edge_ triggered".

Of course, thanks.

Paolo