2019-03-28 20:32:11

by Vitaly Kuznetsov

[permalink] [raw]
Subject: [PATCH RFC] KVM: x86: vmx: throttle immediate exit through preemtion timer to assist buggy guests

This is embarassing but we have another Windows/Hyper-V issue to workaround
in KVM (or QEMU). Hope "RFC" makes it less offensive.

It was noticed that Hyper-V guest on q35 KVM/QEMU VM hangs on boot if e.g.
'piix4-usb-uhci' device is attached. The problem with this device is that
it uses level-triggered interrupts.

The 'hang' scenario develops like this:
1) Hyper-V boots and QEMU is trying to inject two irq simultaneously. One
of them is level-triggered. KVM injects the edge-triggered one and
requests immediate exit to inject the level-triggered:

kvm_set_irq: gsi 23 level 1 source 0
kvm_msi_set_irq: dst 0 vec 80 (Fixed|physical|level)
kvm_apic_accept_irq: apicid 0 vec 80 (Fixed|edge)
kvm_msi_set_irq: dst 0 vec 96 (Fixed|physical|edge)
kvm_apic_accept_irq: apicid 0 vec 96 (Fixed|edge)
kvm_nested_vmexit_inject: reason EXTERNAL_INTERRUPT info1 0 info2 0 int_info 80000060 int_info_err 0

2) Hyper-V requires one of its VMs to run to handle the situation but
immediate exit happens:

kvm_entry: vcpu 0
kvm_exit: reason VMRESUME rip 0xfffff80006a40115 info 0 0
kvm_entry: vcpu 0
kvm_exit: reason PREEMPTION_TIMER rip 0xfffff8022f3d8350 info 0 0
kvm_nested_vmexit: rip fffff8022f3d8350 reason PREEMPTION_TIMER info1 0 info2 0 int_info 0 int_info_err 0
kvm_nested_vmexit_inject: reason EXTERNAL_INTERRUPT info1 0 info2 0 int_info 80000050 int_info_err 0

3) Hyper-V doesn't want to deal with the second irq (as its VM still didn't
process the first one) so it just does 'EOI' for level-triggered interrupt
and this causes re-injection:

kvm_exit: reason EOI_INDUCED rip 0xfffff80006a17e1a info 50 0
kvm_eoi: apicid 0 vector 80
kvm_userspace_exit: reason KVM_EXIT_IOAPIC_EOI (26)
kvm_set_irq: gsi 23 level 1 source 0
kvm_msi_set_irq: dst 0 vec 80 (Fixed|physical|level)
kvm_apic_accept_irq: apicid 0 vec 80 (Fixed|edge)
kvm_entry: vcpu 0

4) As we arm preemtion timer again we get stuck in the infinite loop:

kvm_exit: reason VMRESUME rip 0xfffff80006a40115 info 0 0
kvm_entry: vcpu 0
kvm_exit: reason PREEMPTION_TIMER rip 0xfffff8022f3d8350 info 0 0
kvm_nested_vmexit: rip fffff8022f3d8350 reason PREEMPTION_TIMER info1 0 info2 0 int_info 0 int_info_err 0
kvm_nested_vmexit_inject: reason EXTERNAL_INTERRUPT info1 0 info2 0 int_info 80000050 int_info_err 0
kvm_entry: vcpu 0
kvm_exit: reason EOI_INDUCED rip 0xfffff80006a17e1a info 50 0
kvm_eoi: apicid 0 vector 80
kvm_userspace_exit: reason KVM_EXIT_IOAPIC_EOI (26)
kvm_set_irq: gsi 23 level 1 source 0
kvm_msi_set_irq: dst 0 vec 80 (Fixed|physical|level)
kvm_apic_accept_irq: apicid 0 vec 80 (Fixed|edge)
kvm_entry: vcpu 0
kvm_exit: reason VMRESUME rip 0xfffff80006a40115 info 0 0
...

How does this work on hardware? I don't really know but it seems that we
were dealing with similar issues before, see commit 184564efae4d7 ("kvm:
ioapic: conditionally delay irq delivery duringeoi broadcast"). In case EOI
doesn't always cause an *immediate* interrupt re-injection some progress
can be made.

An alternative approach would be to port the above mentioned commit to
QEMU's ioapic implementation. I'm, however, not sure that level-triggered
interrupts is a must to trigger the issue.

HV_TIMER_THROTTLE/HV_TIMER_DELAY are more or less arbitrary. I haven't
looked at SVM yet but their immediate exit implementation does
smp_send_reschedule(), I'm not sure this can cause the above described
lockup.

Signed-off-by: Vitaly Kuznetsov <[email protected]>
---
arch/x86/kvm/vmx/vmcs.h | 2 ++
arch/x86/kvm/vmx/vmx.c | 24 +++++++++++++++++++++++-
2 files changed, 25 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/vmx/vmcs.h b/arch/x86/kvm/vmx/vmcs.h
index cb6079f8a227..983dfc60c30c 100644
--- a/arch/x86/kvm/vmx/vmcs.h
+++ b/arch/x86/kvm/vmx/vmcs.h
@@ -54,6 +54,8 @@ struct loaded_vmcs {
bool launched;
bool nmi_known_unmasked;
bool hv_timer_armed;
+ unsigned long hv_timer_lastrip;
+ int hv_timer_lastrip_cnt;
/* Support for vnmi-less CPUs */
int soft_vnmi_blocked;
ktime_t entry_time;
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 617443c1c6d7..8a49ec14dd3a 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -6321,14 +6321,36 @@ static void vmx_arm_hv_timer(struct vcpu_vmx *vmx, u32 val)
vmx->loaded_vmcs->hv_timer_armed = true;
}

+#define HV_TIMER_THROTTLE 100
+#define HV_TIMER_DELAY 1000
+
static void vmx_update_hv_timer(struct kvm_vcpu *vcpu)
{
struct vcpu_vmx *vmx = to_vmx(vcpu);
u64 tscl;
u32 delta_tsc;

+ /*
+ * Some guests are buggy and may behave in a way that causes KVM to
+ * always request immediate exit (e.g. of a nested guest). At the same
+ * time guest progress may be required to resolve the condition.
+ * Throttling below makes sure we never get stuck completely.
+ */
if (vmx->req_immediate_exit) {
- vmx_arm_hv_timer(vmx, 0);
+ unsigned long rip = kvm_rip_read(vcpu);
+
+ if (vmx->loaded_vmcs->hv_timer_lastrip == rip) {
+ ++vmx->loaded_vmcs->hv_timer_lastrip_cnt;
+ } else {
+ vmx->loaded_vmcs->hv_timer_lastrip_cnt = 0;
+ vmx->loaded_vmcs->hv_timer_lastrip = rip;
+ }
+
+ if (vmx->loaded_vmcs->hv_timer_lastrip_cnt > HV_TIMER_THROTTLE)
+ vmx_arm_hv_timer(vmx, HV_TIMER_DELAY);
+ else
+ vmx_arm_hv_timer(vmx, 0);
+
return;
}

--
2.20.1



2019-03-29 02:02:01

by Liran Alon

[permalink] [raw]
Subject: Re: [PATCH RFC] KVM: x86: vmx: throttle immediate exit through preemtion timer to assist buggy guests



> On 28 Mar 2019, at 22:31, Vitaly Kuznetsov <[email protected]> wrote:
>
> This is embarassing but we have another Windows/Hyper-V issue to workaround
> in KVM (or QEMU). Hope "RFC" makes it less offensive.
>
> It was noticed that Hyper-V guest on q35 KVM/QEMU VM hangs on boot if e.g.
> 'piix4-usb-uhci' device is attached. The problem with this device is that
> it uses level-triggered interrupts.
>
> The 'hang' scenario develops like this:
> 1) Hyper-V boots and QEMU is trying to inject two irq simultaneously. One
> of them is level-triggered. KVM injects the edge-triggered one and
> requests immediate exit to inject the level-triggered:
>
> kvm_set_irq: gsi 23 level 1 source 0
> kvm_msi_set_irq: dst 0 vec 80 (Fixed|physical|level)
> kvm_apic_accept_irq: apicid 0 vec 80 (Fixed|edge)
> kvm_msi_set_irq: dst 0 vec 96 (Fixed|physical|edge)
> kvm_apic_accept_irq: apicid 0 vec 96 (Fixed|edge)
> kvm_nested_vmexit_inject: reason EXTERNAL_INTERRUPT info1 0 info2 0 int_info 80000060 int_info_err 0

There is no immediate-exit here.
Here QEMU just set two pending irqs: vector 80 and vector 96.
Because vCPU 0 is running at non-root-mode, KVM emulates an exit from L2 to L1 on EXTERNAL_INTERRUPT.
Note that EXTERNAL_INTERRUPT is emulated on vector 0x60==96 which is the higher vector which is pending which is correct.

BTW, I don’t know why both are set in LAPIC as edge-triggered and not level-triggered.
But it can be seen from trace pattern that these interrupts are both level-triggered. (See QEMU’s ioapic_service()).
How did you deduce that one is edge-triggered and the other is level-triggered?

>
> 2) Hyper-V requires one of its VMs to run to handle the situation but
> immediate exit happens:
>
> kvm_entry: vcpu 0
> kvm_exit: reason VMRESUME rip 0xfffff80006a40115 info 0 0
> kvm_entry: vcpu 0
> kvm_exit: reason PREEMPTION_TIMER rip 0xfffff8022f3d8350 info 0 0
> kvm_nested_vmexit: rip fffff8022f3d8350 reason PREEMPTION_TIMER info1 0 info2 0 int_info 0 int_info_err 0
> kvm_nested_vmexit_inject: reason EXTERNAL_INTERRUPT info1 0 info2 0 int_info 80000050 int_info_err 0

I assume that as part of Hyper-V VMExit handler for EXTERNAL_INTERRUPT, it will forward the interrupt to the host.
As done in KVM vcpu_enter_guest() calling kvm_x86_ops->handle_external_intr().
Because vmcs->vm_exit_intr_info specifies vector 96, we are still left with vector 80 pending.

I also assume that Hyper-V utilise VM_EXIT_ACK_INTR_ON_EXIT and thus vector 96 is cleared from LAPIC IRR
and the bit in LAPIC ISR for vector 96 is set.
This is emulated by L0 KVM at nested_vmx_vmexit() -> kvm_cpu_get_interrupt().

I further assume that at the point that vector 96 runs in L1, interrupts are disabled.
Afterwards I would expect L1 to enable interrupts (Similar to vcpu_enter_guest() calling local_irq_enable() after kvm_x86_ops->handle_external_intr()).

I would expect Hyper-V handler for vector 96 at some point to do EOI such that when interrupts are later enabled, vector 80 will also get injected.
All of this before attempting to resume back into L2.

However, it can be seen that indeed at this resume, you receive, after an immediate-exit, an injection of EXTERNAL_INTERRUPT on vector 0x50==80.
As if Hyper-V never enabled interrupts after handling vector 96 before doing a resume again to L2.

This is still valid of course but just a bit bizarre and inefficient. Oh well. :)

>
> 3) Hyper-V doesn't want to deal with the second irq (as its VM still didn't
> process the first one)

Both interrupts are for L1 not L2.

> so it just does 'EOI' for level-triggered interrupt
> and this causes re-injection:
>
> kvm_exit: reason EOI_INDUCED rip 0xfffff80006a17e1a info 50 0
> kvm_eoi: apicid 0 vector 80
> kvm_userspace_exit: reason KVM_EXIT_IOAPIC_EOI (26)
> kvm_set_irq: gsi 23 level 1 source 0
> kvm_msi_set_irq: dst 0 vec 80 (Fixed|physical|level)
> kvm_apic_accept_irq: apicid 0 vec 80 (Fixed|edge)
> kvm_entry: vcpu 0

What happens here is that Hyper-V as a response to second EXTERNAL_INTERRUPT on vector 80,
it invokes vector 80 handler which performs EOI which is configured in ioapic_exit_bitmap to cause EOI_INDUCED exit to L0.
The EOI_INDUCED handler will reach handle_apic_eoi_induced() -> kvm_apic_set_eoi_accelerated() -> kvm_ioapic_send_eoi() -> kvm_make_request(KVM_REQ_IOAPIC_EOI_EXIT),
which will cause the exit on KVM_EXIT_IOAPIC_EOI to QEMU as required.

As part of QEMU handling for this exit (ioapic_eoi_broadcast()), it will note that pin’s irr is still set (irq-line was not lowered by vector 80 interrupt handler before EOI),
and thus vector 80 is re-injected by IOAPIC at ioapic_service().

If this is indeed a level-triggered interrupt, then it seems buggy to me that vector 80 handler haven’t lowered the irq-line before EOI.
I would suggest adding a trace to QEMU’s ioapic_set_irq() for when vector=80 and level=0 and ioapic_eoi_broadcast() to verify if this is what happens.
Another option is that even though vector 80 handler lowered irq-line, it got re-set by device between when it lowered the irq-line and until it did an EOI. Which is legit.

If the former is what happens and vector 80 handler haven’t lowered the irq-line before EOI, this is the real root-cause of your issue at hand here.

>
> 4) As we arm preemtion timer again we get stuck in the infinite loop:
>
> kvm_exit: reason VMRESUME rip 0xfffff80006a40115 info 0 0
> kvm_entry: vcpu 0
> kvm_exit: reason PREEMPTION_TIMER rip 0xfffff8022f3d8350 info 0 0
> kvm_nested_vmexit: rip fffff8022f3d8350 reason PREEMPTION_TIMER info1 0 info2 0 int_info 0 int_info_err 0
> kvm_nested_vmexit_inject: reason EXTERNAL_INTERRUPT info1 0 info2 0 int_info 80000050 int_info_err 0
> kvm_entry: vcpu 0
> kvm_exit: reason EOI_INDUCED rip 0xfffff80006a17e1a info 50 0
> kvm_eoi: apicid 0 vector 80
> kvm_userspace_exit: reason KVM_EXIT_IOAPIC_EOI (26)
> kvm_set_irq: gsi 23 level 1 source 0
> kvm_msi_set_irq: dst 0 vec 80 (Fixed|physical|level)
> kvm_apic_accept_irq: apicid 0 vec 80 (Fixed|edge)
> kvm_entry: vcpu 0
> kvm_exit: reason VMRESUME rip 0xfffff80006a40115 info 0 0
> ...
>
> How does this work on hardware? I don't really know but it seems that we
> were dealing with similar issues before, see commit 184564efae4d7 ("kvm:
> ioapic: conditionally delay irq delivery duringeoi broadcast"). In case EOI
> doesn't always cause an *immediate* interrupt re-injection some progress
> can be made.

This commit tries to handle with a bug exactly as I have described above.
In case a level-triggered interrupt handler in guest performs EOI before it lowers the irq-line, it will be re-injected in a loop forever with the level-triggered interrupt.
The reason this doesn’t happen on real hardware is because there is a very short delay between when an EOI is performed to IOAPIC and until IOAPIC attempts to serve a new interrupt.

To handle this, this commit delays the evaluation of IOAPIC new pending interrupts when it receives an EOI for a level-triggered interrupt.

>
> An alternative approach would be to port the above mentioned commit to
> QEMU's ioapic implementation. I'm, however, not sure that level-triggered
> interrupts is a must to trigger the issue.

This seems like the right solution to me.
It seems to me that I have actually pinpointed it in this analysis directly to level-triggered interrupts and it is a must to trigger the issue.

This also explains why the issue doesn’t reproduce when you are using in-kernel irqchip.
It also explains why "piix4-usb-uhci” must be present to re-produce this issue as probably the bug is in that device interrupt handler.
(We should probably report this bug to Microsoft. Do we have any contacts for this?)

Do you agree with above analysis?
-Liran

>
> HV_TIMER_THROTTLE/HV_TIMER_DELAY are more or less arbitrary. I haven't
> looked at SVM yet but their immediate exit implementation does
> smp_send_reschedule(), I'm not sure this can cause the above described
> lockup.
>
> Signed-off-by: Vitaly Kuznetsov <[email protected]>

As mentioned above, I think this patch should be dropped.

> ---
> arch/x86/kvm/vmx/vmcs.h | 2 ++
> arch/x86/kvm/vmx/vmx.c | 24 +++++++++++++++++++++++-
> 2 files changed, 25 insertions(+), 1 deletion(-)
>
> diff --git a/arch/x86/kvm/vmx/vmcs.h b/arch/x86/kvm/vmx/vmcs.h
> index cb6079f8a227..983dfc60c30c 100644
> --- a/arch/x86/kvm/vmx/vmcs.h
> +++ b/arch/x86/kvm/vmx/vmcs.h
> @@ -54,6 +54,8 @@ struct loaded_vmcs {
> bool launched;
> bool nmi_known_unmasked;
> bool hv_timer_armed;
> + unsigned long hv_timer_lastrip;
> + int hv_timer_lastrip_cnt;
> /* Support for vnmi-less CPUs */
> int soft_vnmi_blocked;
> ktime_t entry_time;
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index 617443c1c6d7..8a49ec14dd3a 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -6321,14 +6321,36 @@ static void vmx_arm_hv_timer(struct vcpu_vmx *vmx, u32 val)
> vmx->loaded_vmcs->hv_timer_armed = true;
> }
>
> +#define HV_TIMER_THROTTLE 100
> +#define HV_TIMER_DELAY 1000
> +
> static void vmx_update_hv_timer(struct kvm_vcpu *vcpu)
> {
> struct vcpu_vmx *vmx = to_vmx(vcpu);
> u64 tscl;
> u32 delta_tsc;
>
> + /*
> + * Some guests are buggy and may behave in a way that causes KVM to
> + * always request immediate exit (e.g. of a nested guest). At the same
> + * time guest progress may be required to resolve the condition.
> + * Throttling below makes sure we never get stuck completely.
> + */
> if (vmx->req_immediate_exit) {
> - vmx_arm_hv_timer(vmx, 0);
> + unsigned long rip = kvm_rip_read(vcpu);
> +
> + if (vmx->loaded_vmcs->hv_timer_lastrip == rip) {
> + ++vmx->loaded_vmcs->hv_timer_lastrip_cnt;
> + } else {
> + vmx->loaded_vmcs->hv_timer_lastrip_cnt = 0;
> + vmx->loaded_vmcs->hv_timer_lastrip = rip;
> + }
> +
> + if (vmx->loaded_vmcs->hv_timer_lastrip_cnt > HV_TIMER_THROTTLE)
> + vmx_arm_hv_timer(vmx, HV_TIMER_DELAY);
> + else
> + vmx_arm_hv_timer(vmx, 0);
> +
> return;
> }
>
> --
> 2.20.1
>


2019-03-29 09:17:15

by Vitaly Kuznetsov

[permalink] [raw]
Subject: Re: [PATCH RFC] KVM: x86: vmx: throttle immediate exit through preemtion timer to assist buggy guests

Liran Alon <[email protected]> writes:

>> On 28 Mar 2019, at 22:31, Vitaly Kuznetsov <[email protected]> wrote:
>>
>> This is embarassing but we have another Windows/Hyper-V issue to workaround
>> in KVM (or QEMU). Hope "RFC" makes it less offensive.
>>
>> It was noticed that Hyper-V guest on q35 KVM/QEMU VM hangs on boot if e.g.
>> 'piix4-usb-uhci' device is attached. The problem with this device is that
>> it uses level-triggered interrupts.
>>
>> The 'hang' scenario develops like this:
>> 1) Hyper-V boots and QEMU is trying to inject two irq simultaneously. One
>> of them is level-triggered. KVM injects the edge-triggered one and
>> requests immediate exit to inject the level-triggered:
>>
>> kvm_set_irq: gsi 23 level 1 source 0
>> kvm_msi_set_irq: dst 0 vec 80 (Fixed|physical|level)
>> kvm_apic_accept_irq: apicid 0 vec 80 (Fixed|edge)
>> kvm_msi_set_irq: dst 0 vec 96 (Fixed|physical|edge)
>> kvm_apic_accept_irq: apicid 0 vec 96 (Fixed|edge)
>> kvm_nested_vmexit_inject: reason EXTERNAL_INTERRUPT info1 0 info2 0 int_info 80000060 int_info_err 0
>
> There is no immediate-exit here.
> Here QEMU just set two pending irqs: vector 80 and vector 96.
> Because vCPU 0 is running at non-root-mode, KVM emulates an exit from L2 to L1 on EXTERNAL_INTERRUPT.
> Note that EXTERNAL_INTERRUPT is emulated on vector 0x60==96 which is the higher vector which is pending which is correct.
>
> BTW, I don’t know why both are set in LAPIC as edge-triggered and not level-triggered.
> But it can be seen from trace pattern that these interrupts are both level-triggered. (See QEMU’s ioapic_service()).
> How did you deduce that one is edge-triggered and the other is
> level-triggered?

"kvm_apic_accept_irq" event misreports level-triggered interrupts as
edge-triggered, see my "KVM: x86: avoid misreporting level-triggered
irqs as edge-triggered in tracing" patch.

Other than that I double-checked, both in Qemu and KVM (and there's a
lot of additional debug prints stripped) and I'm certain there's no
disagreement anywhere: gsi 23/vec 80 is a level-triggered interrupt.

>
>>
>> 2) Hyper-V requires one of its VMs to run to handle the situation but
>> immediate exit happens:
>>
>> kvm_entry: vcpu 0
>> kvm_exit: reason VMRESUME rip 0xfffff80006a40115 info 0 0
>> kvm_entry: vcpu 0
>> kvm_exit: reason PREEMPTION_TIMER rip 0xfffff8022f3d8350 info 0 0
>> kvm_nested_vmexit: rip fffff8022f3d8350 reason PREEMPTION_TIMER info1 0 info2 0 int_info 0 int_info_err 0
>> kvm_nested_vmexit_inject: reason EXTERNAL_INTERRUPT info1 0 info2 0 int_info 80000050 int_info_err 0
>
> I assume that as part of Hyper-V VMExit handler for EXTERNAL_INTERRUPT, it will forward the interrupt to the host.
> As done in KVM vcpu_enter_guest() calling kvm_x86_ops->handle_external_intr().
> Because vmcs->vm_exit_intr_info specifies vector 96, we are still left with vector 80 pending.
>
> I also assume that Hyper-V utilise VM_EXIT_ACK_INTR_ON_EXIT and thus vector 96 is cleared from LAPIC IRR
> and the bit in LAPIC ISR for vector 96 is set.
> This is emulated by L0 KVM at nested_vmx_vmexit() -> kvm_cpu_get_interrupt().
>
> I further assume that at the point that vector 96 runs in L1, interrupts are disabled.
> Afterwards I would expect L1 to enable interrupts (Similar to vcpu_enter_guest() calling local_irq_enable() after kvm_x86_ops->handle_external_intr()).
>
> I would expect Hyper-V handler for vector 96 at some point to do EOI such that when interrupts are later enabled, vector 80 will also get injected.
> All of this before attempting to resume back into L2.
>
> However, it can be seen that indeed at this resume, you receive, after an immediate-exit, an injection of EXTERNAL_INTERRUPT on vector 0x50==80.
> As if Hyper-V never enabled interrupts after handling vector 96 before doing a resume again to L2.
>
> This is still valid of course but just a bit bizarre and
> inefficient. Oh well. :)

Reverse-engineering is always fun :-)

>
>>
>> 3) Hyper-V doesn't want to deal with the second irq (as its VM still didn't
>> process the first one)
>
> Both interrupts are for L1 not L2.
>
>> so it just does 'EOI' for level-triggered interrupt
>> and this causes re-injection:
>>
>> kvm_exit: reason EOI_INDUCED rip 0xfffff80006a17e1a info 50 0
>> kvm_eoi: apicid 0 vector 80
>> kvm_userspace_exit: reason KVM_EXIT_IOAPIC_EOI (26)
>> kvm_set_irq: gsi 23 level 1 source 0
>> kvm_msi_set_irq: dst 0 vec 80 (Fixed|physical|level)
>> kvm_apic_accept_irq: apicid 0 vec 80 (Fixed|edge)
>> kvm_entry: vcpu 0
>
> What happens here is that Hyper-V as a response to second EXTERNAL_INTERRUPT on vector 80,
> it invokes vector 80 handler which performs EOI which is configured in ioapic_exit_bitmap to cause EOI_INDUCED exit to L0.
> The EOI_INDUCED handler will reach handle_apic_eoi_induced() -> kvm_apic_set_eoi_accelerated() -> kvm_ioapic_send_eoi() -> kvm_make_request(KVM_REQ_IOAPIC_EOI_EXIT),
> which will cause the exit on KVM_EXIT_IOAPIC_EOI to QEMU as required.
>
> As part of QEMU handling for this exit (ioapic_eoi_broadcast()), it will note that pin’s irr is still set (irq-line was not lowered by vector 80 interrupt handler before EOI),
> and thus vector 80 is re-injected by IOAPIC at ioapic_service().
>
> If this is indeed a level-triggered interrupt, then it seems buggy to me that vector 80 handler haven’t lowered the irq-line before EOI.
> I would suggest adding a trace to QEMU’s ioapic_set_irq() for when vector=80 and level=0 and ioapic_eoi_broadcast() to verify if this is what happens.
> Another option is that even though vector 80 handler lowered irq-line, it got re-set by device between when it lowered the irq-line and until it did an EOI. Which is legit.
>
> If the former is what happens and vector 80 handler haven’t lowered
> the irq-line before EOI, this is the real root-cause of your issue at
> hand here.

Yes, exactly - and sorry if I wasn't clear about that. Hyper-V does EOI
for a still pending level-triggered interrupt (as it can't actually fix
the reason - in needs Windows partition - L2 VM - to run and do the
job).

As a pure speculation I can guess: they don't want to keep a queue of
interrupts which need L2 handling and they count on the fact that doing
EOI with interrupts disabled in L1 -> VMRESUME will actually allow L2 VM
to run for some time and make some progress.

>
>>
>> 4) As we arm preemtion timer again we get stuck in the infinite loop:
>>
>> kvm_exit: reason VMRESUME rip 0xfffff80006a40115 info 0 0
>> kvm_entry: vcpu 0
>> kvm_exit: reason PREEMPTION_TIMER rip 0xfffff8022f3d8350 info 0 0
>> kvm_nested_vmexit: rip fffff8022f3d8350 reason PREEMPTION_TIMER info1 0 info2 0 int_info 0 int_info_err 0
>> kvm_nested_vmexit_inject: reason EXTERNAL_INTERRUPT info1 0 info2 0 int_info 80000050 int_info_err 0
>> kvm_entry: vcpu 0
>> kvm_exit: reason EOI_INDUCED rip 0xfffff80006a17e1a info 50 0
>> kvm_eoi: apicid 0 vector 80
>> kvm_userspace_exit: reason KVM_EXIT_IOAPIC_EOI (26)
>> kvm_set_irq: gsi 23 level 1 source 0
>> kvm_msi_set_irq: dst 0 vec 80 (Fixed|physical|level)
>> kvm_apic_accept_irq: apicid 0 vec 80 (Fixed|edge)
>> kvm_entry: vcpu 0
>> kvm_exit: reason VMRESUME rip 0xfffff80006a40115 info 0 0
>> ...
>>
>> How does this work on hardware? I don't really know but it seems that we
>> were dealing with similar issues before, see commit 184564efae4d7 ("kvm:
>> ioapic: conditionally delay irq delivery duringeoi broadcast"). In case EOI
>> doesn't always cause an *immediate* interrupt re-injection some progress
>> can be made.
>
> This commit tries to handle with a bug exactly as I have described above.
> In case a level-triggered interrupt handler in guest performs EOI before it lowers the irq-line, it will be re-injected in a loop forever with the level-triggered interrupt.
> The reason this doesn’t happen on real hardware is because there is a very short delay between when an EOI is performed to IOAPIC and until IOAPIC attempts to serve a new interrupt.
>
> To handle this, this commit delays the evaluation of IOAPIC new pending interrupts when it receives an EOI for a level-triggered interrupt.
>
>>
>> An alternative approach would be to port the above mentioned commit to
>> QEMU's ioapic implementation. I'm, however, not sure that level-triggered
>> interrupts is a must to trigger the issue.
>
> This seems like the right solution to me.
> It seems to me that I have actually pinpointed it in this analysis
> directly to level-triggered interrupts and it is a must to trigger the
> issue.

Yes, in this particular case it is directly linked to (mis-)handling of
a level-triggered interrupt.

>
> This also explains why the issue doesn’t reproduce when you are using in-kernel irqchip.
> It also explains why "piix4-usb-uhci” must be present to re-produce
> this issue as probably the bug is in that device interrupt handler.

Level-triggered interrupts are rather rare but still present, I think
that any device using them can cause similar hang.

> (We should probably report this bug to Microsoft. Do we have any
> contacts for this?)

Yes, I just wanted to get KVM people's opinion first.

>
> Do you agree with above analysis?

Thank you for it! I think we're in violent agreement here :-)

> -Liran
>
>>
>> HV_TIMER_THROTTLE/HV_TIMER_DELAY are more or less arbitrary. I haven't
>> looked at SVM yet but their immediate exit implementation does
>> smp_send_reschedule(), I'm not sure this can cause the above described
>> lockup.
>>
>> Signed-off-by: Vitaly Kuznetsov <[email protected]>
>
> As mentioned above, I think this patch should be dropped.
>

Yes, of course, I never expected it to be merged. I am, however, afraid
that we'll have to get back to this discussion if another scenario with
'immediate eoi storm' appears. I can't think of anything in particular
right away so I'm going to implement the throttling for level-triggred
interrupt re-assertion in Qemu first.

>> ---
>> arch/x86/kvm/vmx/vmcs.h | 2 ++
>> arch/x86/kvm/vmx/vmx.c | 24 +++++++++++++++++++++++-
>> 2 files changed, 25 insertions(+), 1 deletion(-)
>>
>> diff --git a/arch/x86/kvm/vmx/vmcs.h b/arch/x86/kvm/vmx/vmcs.h
>> index cb6079f8a227..983dfc60c30c 100644
>> --- a/arch/x86/kvm/vmx/vmcs.h
>> +++ b/arch/x86/kvm/vmx/vmcs.h
>> @@ -54,6 +54,8 @@ struct loaded_vmcs {
>> bool launched;
>> bool nmi_known_unmasked;
>> bool hv_timer_armed;
>> + unsigned long hv_timer_lastrip;
>> + int hv_timer_lastrip_cnt;
>> /* Support for vnmi-less CPUs */
>> int soft_vnmi_blocked;
>> ktime_t entry_time;
>> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
>> index 617443c1c6d7..8a49ec14dd3a 100644
>> --- a/arch/x86/kvm/vmx/vmx.c
>> +++ b/arch/x86/kvm/vmx/vmx.c
>> @@ -6321,14 +6321,36 @@ static void vmx_arm_hv_timer(struct vcpu_vmx *vmx, u32 val)
>> vmx->loaded_vmcs->hv_timer_armed = true;
>> }
>>
>> +#define HV_TIMER_THROTTLE 100
>> +#define HV_TIMER_DELAY 1000
>> +
>> static void vmx_update_hv_timer(struct kvm_vcpu *vcpu)
>> {
>> struct vcpu_vmx *vmx = to_vmx(vcpu);
>> u64 tscl;
>> u32 delta_tsc;
>>
>> + /*
>> + * Some guests are buggy and may behave in a way that causes KVM to
>> + * always request immediate exit (e.g. of a nested guest). At the same
>> + * time guest progress may be required to resolve the condition.
>> + * Throttling below makes sure we never get stuck completely.
>> + */
>> if (vmx->req_immediate_exit) {
>> - vmx_arm_hv_timer(vmx, 0);
>> + unsigned long rip = kvm_rip_read(vcpu);
>> +
>> + if (vmx->loaded_vmcs->hv_timer_lastrip == rip) {
>> + ++vmx->loaded_vmcs->hv_timer_lastrip_cnt;
>> + } else {
>> + vmx->loaded_vmcs->hv_timer_lastrip_cnt = 0;
>> + vmx->loaded_vmcs->hv_timer_lastrip = rip;
>> + }
>> +
>> + if (vmx->loaded_vmcs->hv_timer_lastrip_cnt > HV_TIMER_THROTTLE)
>> + vmx_arm_hv_timer(vmx, HV_TIMER_DELAY);
>> + else
>> + vmx_arm_hv_timer(vmx, 0);
>> +
>> return;
>> }
>>
>> --
>> 2.20.1
>>
>

--
Vitaly

2019-03-29 12:10:28

by Liran Alon

[permalink] [raw]
Subject: Re: [PATCH RFC] KVM: x86: vmx: throttle immediate exit through preemtion timer to assist buggy guests



> On 29 Mar 2019, at 12:14, Vitaly Kuznetsov <[email protected]> wrote:
>
> Liran Alon <[email protected]> writes:
>
>>> On 28 Mar 2019, at 22:31, Vitaly Kuznetsov <[email protected]> wrote:
>>>
>>> This is embarassing but we have another Windows/Hyper-V issue to workaround
>>> in KVM (or QEMU). Hope "RFC" makes it less offensive.
>>>
>>> It was noticed that Hyper-V guest on q35 KVM/QEMU VM hangs on boot if e.g.
>>> 'piix4-usb-uhci' device is attached. The problem with this device is that
>>> it uses level-triggered interrupts.
>>>
>>> The 'hang' scenario develops like this:
>>> 1) Hyper-V boots and QEMU is trying to inject two irq simultaneously. One
>>> of them is level-triggered. KVM injects the edge-triggered one and
>>> requests immediate exit to inject the level-triggered:
>>>
>>> kvm_set_irq: gsi 23 level 1 source 0
>>> kvm_msi_set_irq: dst 0 vec 80 (Fixed|physical|level)
>>> kvm_apic_accept_irq: apicid 0 vec 80 (Fixed|edge)
>>> kvm_msi_set_irq: dst 0 vec 96 (Fixed|physical|edge)
>>> kvm_apic_accept_irq: apicid 0 vec 96 (Fixed|edge)
>>> kvm_nested_vmexit_inject: reason EXTERNAL_INTERRUPT info1 0 info2 0 int_info 80000060 int_info_err 0
>>
>> There is no immediate-exit here.
>> Here QEMU just set two pending irqs: vector 80 and vector 96.
>> Because vCPU 0 is running at non-root-mode, KVM emulates an exit from L2 to L1 on EXTERNAL_INTERRUPT.
>> Note that EXTERNAL_INTERRUPT is emulated on vector 0x60==96 which is the higher vector which is pending which is correct.
>>
>> BTW, I don’t know why both are set in LAPIC as edge-triggered and not level-triggered.
>> But it can be seen from trace pattern that these interrupts are both level-triggered. (See QEMU’s ioapic_service()).
>> How did you deduce that one is edge-triggered and the other is
>> level-triggered?
>
> "kvm_apic_accept_irq" event misreports level-triggered interrupts as
> edge-triggered, see my "KVM: x86: avoid misreporting level-triggered
> irqs as edge-triggered in tracing" patch.
>
> Other than that I double-checked, both in Qemu and KVM (and there's a
> lot of additional debug prints stripped) and I'm certain there's no
> disagreement anywhere: gsi 23/vec 80 is a level-triggered interrupt.
>
>>
>>>
>>> 2) Hyper-V requires one of its VMs to run to handle the situation but
>>> immediate exit happens:
>>>
>>> kvm_entry: vcpu 0
>>> kvm_exit: reason VMRESUME rip 0xfffff80006a40115 info 0 0
>>> kvm_entry: vcpu 0
>>> kvm_exit: reason PREEMPTION_TIMER rip 0xfffff8022f3d8350 info 0 0
>>> kvm_nested_vmexit: rip fffff8022f3d8350 reason PREEMPTION_TIMER info1 0 info2 0 int_info 0 int_info_err 0
>>> kvm_nested_vmexit_inject: reason EXTERNAL_INTERRUPT info1 0 info2 0 int_info 80000050 int_info_err 0
>>
>> I assume that as part of Hyper-V VMExit handler for EXTERNAL_INTERRUPT, it will forward the interrupt to the host.
>> As done in KVM vcpu_enter_guest() calling kvm_x86_ops->handle_external_intr().
>> Because vmcs->vm_exit_intr_info specifies vector 96, we are still left with vector 80 pending.
>>
>> I also assume that Hyper-V utilise VM_EXIT_ACK_INTR_ON_EXIT and thus vector 96 is cleared from LAPIC IRR
>> and the bit in LAPIC ISR for vector 96 is set.
>> This is emulated by L0 KVM at nested_vmx_vmexit() -> kvm_cpu_get_interrupt().
>>
>> I further assume that at the point that vector 96 runs in L1, interrupts are disabled.
>> Afterwards I would expect L1 to enable interrupts (Similar to vcpu_enter_guest() calling local_irq_enable() after kvm_x86_ops->handle_external_intr()).
>>
>> I would expect Hyper-V handler for vector 96 at some point to do EOI such that when interrupts are later enabled, vector 80 will also get injected.
>> All of this before attempting to resume back into L2.
>>
>> However, it can be seen that indeed at this resume, you receive, after an immediate-exit, an injection of EXTERNAL_INTERRUPT on vector 0x50==80.
>> As if Hyper-V never enabled interrupts after handling vector 96 before doing a resume again to L2.
>>
>> This is still valid of course but just a bit bizarre and
>> inefficient. Oh well. :)
>
> Reverse-engineering is always fun :-)

We can report this to Microsoft as-well. :)

>
>>
>>>
>>> 3) Hyper-V doesn't want to deal with the second irq (as its VM still didn't
>>> process the first one)
>>
>> Both interrupts are for L1 not L2.
>>
>>> so it just does 'EOI' for level-triggered interrupt
>>> and this causes re-injection:
>>>
>>> kvm_exit: reason EOI_INDUCED rip 0xfffff80006a17e1a info 50 0
>>> kvm_eoi: apicid 0 vector 80
>>> kvm_userspace_exit: reason KVM_EXIT_IOAPIC_EOI (26)
>>> kvm_set_irq: gsi 23 level 1 source 0
>>> kvm_msi_set_irq: dst 0 vec 80 (Fixed|physical|level)
>>> kvm_apic_accept_irq: apicid 0 vec 80 (Fixed|edge)
>>> kvm_entry: vcpu 0
>>
>> What happens here is that Hyper-V as a response to second EXTERNAL_INTERRUPT on vector 80,
>> it invokes vector 80 handler which performs EOI which is configured in ioapic_exit_bitmap to cause EOI_INDUCED exit to L0.
>> The EOI_INDUCED handler will reach handle_apic_eoi_induced() -> kvm_apic_set_eoi_accelerated() -> kvm_ioapic_send_eoi() -> kvm_make_request(KVM_REQ_IOAPIC_EOI_EXIT),
>> which will cause the exit on KVM_EXIT_IOAPIC_EOI to QEMU as required.
>>
>> As part of QEMU handling for this exit (ioapic_eoi_broadcast()), it will note that pin’s irr is still set (irq-line was not lowered by vector 80 interrupt handler before EOI),
>> and thus vector 80 is re-injected by IOAPIC at ioapic_service().
>>
>> If this is indeed a level-triggered interrupt, then it seems buggy to me that vector 80 handler haven’t lowered the irq-line before EOI.
>> I would suggest adding a trace to QEMU’s ioapic_set_irq() for when vector=80 and level=0 and ioapic_eoi_broadcast() to verify if this is what happens.
>> Another option is that even though vector 80 handler lowered irq-line, it got re-set by device between when it lowered the irq-line and until it did an EOI. Which is legit.
>>
>> If the former is what happens and vector 80 handler haven’t lowered
>> the irq-line before EOI, this is the real root-cause of your issue at
>> hand here.
>
> Yes, exactly - and sorry if I wasn't clear about that. Hyper-V does EOI
> for a still pending level-triggered interrupt (as it can't actually fix
> the reason - in needs Windows partition - L2 VM - to run and do the
> job).
>
> As a pure speculation I can guess: they don't want to keep a queue of
> interrupts which need L2 handling and they count on the fact that doing
> EOI with interrupts disabled in L1 -> VMRESUME will actually allow L2 VM
> to run for some time and make some progress.

It seems very unlikely to me that IOAPIC will delay re-evaluation of irq-lines after an EOI so much that
CPU will actually be able to get back into VMX non-root-mode and execute a bunch of code that will
eventually lower the irq-line.

If L1 requires L2 to handle the level-triggered interrupt, then L1 should not EOI the interrupt but
instead just use vmcs12 to inject interrupt to L2 and allow L2 interrupt handler to first lower the
irq-line and only then perform EOI.

Let’s discuss about this with Microsoft and see what they have to say :)
(Let’s open a private email thread with them regarding this. Feel free to Cc me)

>
>>
>>>
>>> 4) As we arm preemtion timer again we get stuck in the infinite loop:
>>>
>>> kvm_exit: reason VMRESUME rip 0xfffff80006a40115 info 0 0
>>> kvm_entry: vcpu 0
>>> kvm_exit: reason PREEMPTION_TIMER rip 0xfffff8022f3d8350 info 0 0
>>> kvm_nested_vmexit: rip fffff8022f3d8350 reason PREEMPTION_TIMER info1 0 info2 0 int_info 0 int_info_err 0
>>> kvm_nested_vmexit_inject: reason EXTERNAL_INTERRUPT info1 0 info2 0 int_info 80000050 int_info_err 0
>>> kvm_entry: vcpu 0
>>> kvm_exit: reason EOI_INDUCED rip 0xfffff80006a17e1a info 50 0
>>> kvm_eoi: apicid 0 vector 80
>>> kvm_userspace_exit: reason KVM_EXIT_IOAPIC_EOI (26)
>>> kvm_set_irq: gsi 23 level 1 source 0
>>> kvm_msi_set_irq: dst 0 vec 80 (Fixed|physical|level)
>>> kvm_apic_accept_irq: apicid 0 vec 80 (Fixed|edge)
>>> kvm_entry: vcpu 0
>>> kvm_exit: reason VMRESUME rip 0xfffff80006a40115 info 0 0
>>> ...
>>>
>>> How does this work on hardware? I don't really know but it seems that we
>>> were dealing with similar issues before, see commit 184564efae4d7 ("kvm:
>>> ioapic: conditionally delay irq delivery duringeoi broadcast"). In case EOI
>>> doesn't always cause an *immediate* interrupt re-injection some progress
>>> can be made.
>>
>> This commit tries to handle with a bug exactly as I have described above.
>> In case a level-triggered interrupt handler in guest performs EOI before it lowers the irq-line, it will be re-injected in a loop forever with the level-triggered interrupt.
>> The reason this doesn’t happen on real hardware is because there is a very short delay between when an EOI is performed to IOAPIC and until IOAPIC attempts to serve a new interrupt.
>>
>> To handle this, this commit delays the evaluation of IOAPIC new pending interrupts when it receives an EOI for a level-triggered interrupt.
>>
>>>
>>> An alternative approach would be to port the above mentioned commit to
>>> QEMU's ioapic implementation. I'm, however, not sure that level-triggered
>>> interrupts is a must to trigger the issue.
>>
>> This seems like the right solution to me.
>> It seems to me that I have actually pinpointed it in this analysis
>> directly to level-triggered interrupts and it is a must to trigger the
>> issue.
>
> Yes, in this particular case it is directly linked to (mis-)handling of
> a level-triggered interrupt.
>
>>
>> This also explains why the issue doesn’t reproduce when you are using in-kernel irqchip.
>> It also explains why "piix4-usb-uhci” must be present to re-produce
>> this issue as probably the bug is in that device interrupt handler.
>
> Level-triggered interrupts are rather rare but still present, I think
> that any device using them can cause similar hang.

Why? This kind of hang should occur only if guest have buggy level-triggered interrupt handler.
If handler is written properly such that it lowers irq-line and only then does EOI, there is no reason
it will hang. (If between the irq-line lowering and the EOI the line gets asserted again, then it is legit
that there will be another interrupt as it should be handled).

-Liran

>
>> (We should probably report this bug to Microsoft. Do we have any
>> contacts for this?)
>
> Yes, I just wanted to get KVM people's opinion first.
>
>>
>> Do you agree with above analysis?
>
> Thank you for it! I think we're in violent agreement here :-)
>
>> -Liran
>>
>>>
>>> HV_TIMER_THROTTLE/HV_TIMER_DELAY are more or less arbitrary. I haven't
>>> looked at SVM yet but their immediate exit implementation does
>>> smp_send_reschedule(), I'm not sure this can cause the above described
>>> lockup.
>>>
>>> Signed-off-by: Vitaly Kuznetsov <[email protected]>
>>
>> As mentioned above, I think this patch should be dropped.
>>
>
> Yes, of course, I never expected it to be merged. I am, however, afraid
> that we'll have to get back to this discussion if another scenario with
> 'immediate eoi storm' appears. I can't think of anything in particular
> right away so I'm going to implement the throttling for level-triggred
> interrupt re-assertion in Qemu first.
>
>>> ---
>>> arch/x86/kvm/vmx/vmcs.h | 2 ++
>>> arch/x86/kvm/vmx/vmx.c | 24 +++++++++++++++++++++++-
>>> 2 files changed, 25 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/arch/x86/kvm/vmx/vmcs.h b/arch/x86/kvm/vmx/vmcs.h
>>> index cb6079f8a227..983dfc60c30c 100644
>>> --- a/arch/x86/kvm/vmx/vmcs.h
>>> +++ b/arch/x86/kvm/vmx/vmcs.h
>>> @@ -54,6 +54,8 @@ struct loaded_vmcs {
>>> bool launched;
>>> bool nmi_known_unmasked;
>>> bool hv_timer_armed;
>>> + unsigned long hv_timer_lastrip;
>>> + int hv_timer_lastrip_cnt;
>>> /* Support for vnmi-less CPUs */
>>> int soft_vnmi_blocked;
>>> ktime_t entry_time;
>>> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
>>> index 617443c1c6d7..8a49ec14dd3a 100644
>>> --- a/arch/x86/kvm/vmx/vmx.c
>>> +++ b/arch/x86/kvm/vmx/vmx.c
>>> @@ -6321,14 +6321,36 @@ static void vmx_arm_hv_timer(struct vcpu_vmx *vmx, u32 val)
>>> vmx->loaded_vmcs->hv_timer_armed = true;
>>> }
>>>
>>> +#define HV_TIMER_THROTTLE 100
>>> +#define HV_TIMER_DELAY 1000
>>> +
>>> static void vmx_update_hv_timer(struct kvm_vcpu *vcpu)
>>> {
>>> struct vcpu_vmx *vmx = to_vmx(vcpu);
>>> u64 tscl;
>>> u32 delta_tsc;
>>>
>>> + /*
>>> + * Some guests are buggy and may behave in a way that causes KVM to
>>> + * always request immediate exit (e.g. of a nested guest). At the same
>>> + * time guest progress may be required to resolve the condition.
>>> + * Throttling below makes sure we never get stuck completely.
>>> + */
>>> if (vmx->req_immediate_exit) {
>>> - vmx_arm_hv_timer(vmx, 0);
>>> + unsigned long rip = kvm_rip_read(vcpu);
>>> +
>>> + if (vmx->loaded_vmcs->hv_timer_lastrip == rip) {
>>> + ++vmx->loaded_vmcs->hv_timer_lastrip_cnt;
>>> + } else {
>>> + vmx->loaded_vmcs->hv_timer_lastrip_cnt = 0;
>>> + vmx->loaded_vmcs->hv_timer_lastrip = rip;
>>> + }
>>> +
>>> + if (vmx->loaded_vmcs->hv_timer_lastrip_cnt > HV_TIMER_THROTTLE)
>>> + vmx_arm_hv_timer(vmx, HV_TIMER_DELAY);
>>> + else
>>> + vmx_arm_hv_timer(vmx, 0);
>>> +
>>> return;
>>> }
>>>
>>> --
>>> 2.20.1
>>>
>>
>
> --
> Vitaly


2019-03-29 14:19:57

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [PATCH RFC] KVM: x86: vmx: throttle immediate exit through preemtion timer to assist buggy guests

On 28/03/19 21:31, Vitaly Kuznetsov wrote:
>
> The 'hang' scenario develops like this:
> 1) Hyper-V boots and QEMU is trying to inject two irq simultaneously. One
> of them is level-triggered. KVM injects the edge-triggered one and
> requests immediate exit to inject the level-triggered:
>
> kvm_set_irq: gsi 23 level 1 source 0
> kvm_msi_set_irq: dst 0 vec 80 (Fixed|physical|level)
> kvm_apic_accept_irq: apicid 0 vec 80 (Fixed|edge)
> kvm_msi_set_irq: dst 0 vec 96 (Fixed|physical|edge)
> kvm_apic_accept_irq: apicid 0 vec 96 (Fixed|edge)
> kvm_nested_vmexit_inject: reason EXTERNAL_INTERRUPT info1 0 info2 0 int_info 80000060 int_info_err 0
>
> 2) Hyper-V requires one of its VMs to run to handle the situation but
> immediate exit happens:
>
> kvm_entry: vcpu 0
> kvm_exit: reason VMRESUME rip 0xfffff80006a40115 info 0 0
> kvm_entry: vcpu 0
> kvm_exit: reason PREEMPTION_TIMER rip 0xfffff8022f3d8350 info 0 0
> kvm_nested_vmexit: rip fffff8022f3d8350 reason PREEMPTION_TIMER info1 0 info2 0 int_info 0 int_info_err 0
> kvm_nested_vmexit_inject: reason EXTERNAL_INTERRUPT info1 0 info2 0 int_info 80000050 int_info_err 0

I supposed before this there was an eoi for vector 96?

The main issue with your patch is that the preemption timer is buggy on
some processors (it runs too fast) and on those processors we shouldn't
use it with nonzero deadline. In particular because it runs too fast it
may not hide the bug.

I think level-triggered interrupts are required for the bug to show.
Edge-triggered interrupts usually have to be acknowledged with a device
register before the host device will trigger another interrupt; or at
least the interrupt event, for example an incoming network packet, must
happen again. This way, when the guest hangs it puts some back pressure
on the host.

I think we should do in QEMU the same fix that was done in the in-kernel
IOAPIC.

Paolo

2019-03-29 14:41:48

by Vitaly Kuznetsov

[permalink] [raw]
Subject: Re: [PATCH RFC] KVM: x86: vmx: throttle immediate exit through preemtion timer to assist buggy guests

Paolo Bonzini <[email protected]> writes:

> On 28/03/19 21:31, Vitaly Kuznetsov wrote:
>>
>> The 'hang' scenario develops like this:
>> 1) Hyper-V boots and QEMU is trying to inject two irq simultaneously. One
>> of them is level-triggered. KVM injects the edge-triggered one and
>> requests immediate exit to inject the level-triggered:
>>
>> kvm_set_irq: gsi 23 level 1 source 0
>> kvm_msi_set_irq: dst 0 vec 80 (Fixed|physical|level)
>> kvm_apic_accept_irq: apicid 0 vec 80 (Fixed|edge)
>> kvm_msi_set_irq: dst 0 vec 96 (Fixed|physical|edge)
>> kvm_apic_accept_irq: apicid 0 vec 96 (Fixed|edge)
>> kvm_nested_vmexit_inject: reason EXTERNAL_INTERRUPT info1 0 info2 0 int_info 80000060 int_info_err 0
>>
>> 2) Hyper-V requires one of its VMs to run to handle the situation but
>> immediate exit happens:
>>
>> kvm_entry: vcpu 0
>> kvm_exit: reason VMRESUME rip 0xfffff80006a40115 info 0 0
>> kvm_entry: vcpu 0
>> kvm_exit: reason PREEMPTION_TIMER rip 0xfffff8022f3d8350 info 0 0
>> kvm_nested_vmexit: rip fffff8022f3d8350 reason PREEMPTION_TIMER info1 0 info2 0 int_info 0 int_info_err 0
>> kvm_nested_vmexit_inject: reason EXTERNAL_INTERRUPT info1 0 info2 0 int_info 80000050 int_info_err 0
>
> I supposed before this there was an eoi for vector 96?

AFAIR: no, it seems that it is actually the VM it is trying to resume
(Windows partition?) which needs to do some work and with the preemtion
timer of 0 we don't allow it to.

>
> The main issue with your patch is that the preemption timer is buggy on
> some processors (it runs too fast) and on those processors we shouldn't
> use it with nonzero deadline. In particular because it runs too fast it
> may not hide the bug.
>
> I think level-triggered interrupts are required for the bug to show.
> Edge-triggered interrupts usually have to be acknowledged with a device
> register before the host device will trigger another interrupt; or at
> least the interrupt event, for example an incoming network packet, must
> happen again. This way, when the guest hangs it puts some back pressure
> on the host.
>
> I think we should do in QEMU the same fix that was done in the in-kernel
> IOAPIC.

Yes, I have this in my plan. Stay tuned!

--
Vitaly

2019-03-29 15:02:15

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [PATCH RFC] KVM: x86: vmx: throttle immediate exit through preemtion timer to assist buggy guests

On 29/03/19 15:40, Vitaly Kuznetsov wrote:
> Paolo Bonzini <[email protected]> writes:
>
>> On 28/03/19 21:31, Vitaly Kuznetsov wrote:
>>>
>>> The 'hang' scenario develops like this:
>>> 1) Hyper-V boots and QEMU is trying to inject two irq simultaneously. One
>>> of them is level-triggered. KVM injects the edge-triggered one and
>>> requests immediate exit to inject the level-triggered:
>>>
>>> kvm_set_irq: gsi 23 level 1 source 0
>>> kvm_msi_set_irq: dst 0 vec 80 (Fixed|physical|level)
>>> kvm_apic_accept_irq: apicid 0 vec 80 (Fixed|edge)
>>> kvm_msi_set_irq: dst 0 vec 96 (Fixed|physical|edge)
>>> kvm_apic_accept_irq: apicid 0 vec 96 (Fixed|edge)
>>> kvm_nested_vmexit_inject: reason EXTERNAL_INTERRUPT info1 0 info2 0 int_info 80000060 int_info_err 0
>>>
>>> 2) Hyper-V requires one of its VMs to run to handle the situation but
>>> immediate exit happens:
>>>
>>> kvm_entry: vcpu 0
>>> kvm_exit: reason VMRESUME rip 0xfffff80006a40115 info 0 0
>>> kvm_entry: vcpu 0
>>> kvm_exit: reason PREEMPTION_TIMER rip 0xfffff8022f3d8350 info 0 0
>>> kvm_nested_vmexit: rip fffff8022f3d8350 reason PREEMPTION_TIMER info1 0 info2 0 int_info 0 int_info_err 0
>>> kvm_nested_vmexit_inject: reason EXTERNAL_INTERRUPT info1 0 info2 0 int_info 80000050 int_info_err 0
>>
>> I supposed before this there was an eoi for vector 96?
>
> AFAIR: no, it seems that it is actually the VM it is trying to resume
> (Windows partition?) which needs to do some work and with the preemtion
> timer of 0 we don't allow it to.

kvm_apic_accept_irq placed IRQ 96 in IRR, and Hyper-V should be running
with "acknowledge interrupt on exit" since int_info is nonzero in
kvm_nested_vmexit_inject.

Therefore, at the kvm_nested_vmexit_inject tracepoint KVM should have
set bit 96 in ISR; and because PPR is now 96, interrupt 80 should have
never been delivered. Unless 96 is an auto-EOI interrupt, in which case
this comment would apply

/*
* For auto-EOI interrupts, there might be another pending
* interrupt above PPR, so check whether to raise another
* KVM_REQ_EVENT.
*/

IIRC there was an enlightenment to tell Windows "I support auto-EOI but
please don't use it". If this is what's happening, that would also fix it.

Thanks,

Paolo

2019-03-29 15:15:43

by Vitaly Kuznetsov

[permalink] [raw]
Subject: Re: [PATCH RFC] KVM: x86: vmx: throttle immediate exit through preemtion timer to assist buggy guests

Paolo Bonzini <[email protected]> writes:

> On 29/03/19 15:40, Vitaly Kuznetsov wrote:
>> Paolo Bonzini <[email protected]> writes:
>>
>>> On 28/03/19 21:31, Vitaly Kuznetsov wrote:
>>>>
>>>> The 'hang' scenario develops like this:
>>>> 1) Hyper-V boots and QEMU is trying to inject two irq simultaneously. One
>>>> of them is level-triggered. KVM injects the edge-triggered one and
>>>> requests immediate exit to inject the level-triggered:
>>>>
>>>> kvm_set_irq: gsi 23 level 1 source 0
>>>> kvm_msi_set_irq: dst 0 vec 80 (Fixed|physical|level)
>>>> kvm_apic_accept_irq: apicid 0 vec 80 (Fixed|edge)
>>>> kvm_msi_set_irq: dst 0 vec 96 (Fixed|physical|edge)
>>>> kvm_apic_accept_irq: apicid 0 vec 96 (Fixed|edge)
>>>> kvm_nested_vmexit_inject: reason EXTERNAL_INTERRUPT info1 0 info2 0 int_info 80000060 int_info_err 0
>>>>
>>>> 2) Hyper-V requires one of its VMs to run to handle the situation but
>>>> immediate exit happens:
>>>>
>>>> kvm_entry: vcpu 0
>>>> kvm_exit: reason VMRESUME rip 0xfffff80006a40115 info 0 0
>>>> kvm_entry: vcpu 0
>>>> kvm_exit: reason PREEMPTION_TIMER rip 0xfffff8022f3d8350 info 0 0
>>>> kvm_nested_vmexit: rip fffff8022f3d8350 reason PREEMPTION_TIMER info1 0 info2 0 int_info 0 int_info_err 0
>>>> kvm_nested_vmexit_inject: reason EXTERNAL_INTERRUPT info1 0 info2 0 int_info 80000050 int_info_err 0
>>>
>>> I supposed before this there was an eoi for vector 96?
>>
>> AFAIR: no, it seems that it is actually the VM it is trying to resume
>> (Windows partition?) which needs to do some work and with the preemtion
>> timer of 0 we don't allow it to.
>
> kvm_apic_accept_irq placed IRQ 96 in IRR, and Hyper-V should be running
> with "acknowledge interrupt on exit" since int_info is nonzero in
> kvm_nested_vmexit_inject.
>
> Therefore, at the kvm_nested_vmexit_inject tracepoint KVM should have
> set bit 96 in ISR; and because PPR is now 96, interrupt 80 should have
> never been delivered. Unless 96 is an auto-EOI interrupt, in which case
> this comment would apply
>
> /*
> * For auto-EOI interrupts, there might be another pending
> * interrupt above PPR, so check whether to raise another
> * KVM_REQ_EVENT.
> */
>
> IIRC there was an enlightenment to tell Windows "I support auto-EOI but
> please don't use it". If this is what's happening, that would also fix it.
>

Oh, that's actually an interesting thought, thanks!

Indeed, there is CPUID 0x40000004.EAX Bit 9: Recommend deprecating
AutoEOI which we don't currently set. I'll try and report back.

--
Vitaly

2019-03-29 15:33:55

by Liran Alon

[permalink] [raw]
Subject: Re: [PATCH RFC] KVM: x86: vmx: throttle immediate exit through preemtion timer to assist buggy guests



> On 29 Mar 2019, at 18:01, Paolo Bonzini <[email protected]> wrote:
>
> On 29/03/19 15:40, Vitaly Kuznetsov wrote:
>> Paolo Bonzini <[email protected]> writes:
>>
>>> On 28/03/19 21:31, Vitaly Kuznetsov wrote:
>>>>
>>>> The 'hang' scenario develops like this:
>>>> 1) Hyper-V boots and QEMU is trying to inject two irq simultaneously. One
>>>> of them is level-triggered. KVM injects the edge-triggered one and
>>>> requests immediate exit to inject the level-triggered:
>>>>
>>>> kvm_set_irq: gsi 23 level 1 source 0
>>>> kvm_msi_set_irq: dst 0 vec 80 (Fixed|physical|level)
>>>> kvm_apic_accept_irq: apicid 0 vec 80 (Fixed|edge)
>>>> kvm_msi_set_irq: dst 0 vec 96 (Fixed|physical|edge)
>>>> kvm_apic_accept_irq: apicid 0 vec 96 (Fixed|edge)
>>>> kvm_nested_vmexit_inject: reason EXTERNAL_INTERRUPT info1 0 info2 0 int_info 80000060 int_info_err 0
>>>>
>>>> 2) Hyper-V requires one of its VMs to run to handle the situation but
>>>> immediate exit happens:
>>>>
>>>> kvm_entry: vcpu 0
>>>> kvm_exit: reason VMRESUME rip 0xfffff80006a40115 info 0 0
>>>> kvm_entry: vcpu 0
>>>> kvm_exit: reason PREEMPTION_TIMER rip 0xfffff8022f3d8350 info 0 0
>>>> kvm_nested_vmexit: rip fffff8022f3d8350 reason PREEMPTION_TIMER info1 0 info2 0 int_info 0 int_info_err 0
>>>> kvm_nested_vmexit_inject: reason EXTERNAL_INTERRUPT info1 0 info2 0 int_info 80000050 int_info_err 0
>>>
>>> I supposed before this there was an eoi for vector 96?
>>
>> AFAIR: no, it seems that it is actually the VM it is trying to resume
>> (Windows partition?) which needs to do some work and with the preemtion
>> timer of 0 we don't allow it to.
>
> kvm_apic_accept_irq placed IRQ 96 in IRR, and Hyper-V should be running
> with "acknowledge interrupt on exit" since int_info is nonzero in
> kvm_nested_vmexit_inject.
>
> Therefore, at the kvm_nested_vmexit_inject tracepoint KVM should have
> set bit 96 in ISR; and because PPR is now 96, interrupt 80 should have
> never been delivered. Unless 96 is an auto-EOI interrupt, in which case
> this comment would apply
>
> /*
> * For auto-EOI interrupts, there might be another pending
> * interrupt above PPR, so check whether to raise another
> * KVM_REQ_EVENT.
> */
>
> IIRC there was an enlightenment to tell Windows "I support auto-EOI but
> please don't use it". If this is what's happening, that would also fix it.
>
> Thanks,
>
> Paolo

Paolo I am not sure this is the case here.
Please read my other replies in this email thread.

I think this is just a standard issue of a level-triggered interrupt handler in L1 (Hyper-V) that performs EOI before it lowers the irq-line.
I don’t think vector 96 is even related to the issue at hand here. This is why after it was already handled, the loop of EXTERNAL_INTERRUPT
happens on vector 80 and not vector 96.

In addition, there is a missing optimisation from Hyper-V that after it handles an EXTERNAL_INTERRUPT exit, it doesn’t enable interrupts
to receive other pending host interrupts (In our case, the pending vector 80) and will therefore only receive it once it enters back to L2
which will cause another EXTERNAL_INTERRUPT exit but this time on vector 80.

-Liran



2019-03-29 17:15:58

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [PATCH RFC] KVM: x86: vmx: throttle immediate exit through preemtion timer to assist buggy guests

On 29/03/19 16:32, Liran Alon wrote:
> Paolo I am not sure this is the case here. Please read my other
> replies in this email thread.
>
> I think this is just a standard issue of a level-triggered interrupt
> handler in L1 (Hyper-V) that performs EOI before it lowers the
> irq-line. I don’t think vector 96 is even related to the issue at
> hand here. This is why after it was already handled, the loop of
> EXTERNAL_INTERRUPT happens on vector 80 and not vector 96.

Hmm... Vitaly, what machine were you testing on---does it have APIC-v?
If not, then you should have seen either an EOI for irq 96 or a TPR
below threshold vmexit. However, if it has APIC-v then you wouldn't
have seen any of this (you only see the EOI for irq 80 because it's
level triggered) and Liran is probably right.

Paolo

> In addition, there is a missing optimisation from Hyper-V that after
> it handles an EXTERNAL_INTERRUPT exit, it doesn’t enable interrupts
> to receive other pending host interrupts (In our case, the pending
> vector 80) and will therefore only receive it once it enters back to
> L2 which will cause another EXTERNAL_INTERRUPT exit but this time on
> vector 80.


2019-04-01 08:41:59

by Vitaly Kuznetsov

[permalink] [raw]
Subject: Re: [PATCH RFC] KVM: x86: vmx: throttle immediate exit through preemtion timer to assist buggy guests

Paolo Bonzini <[email protected]> writes:

> On 29/03/19 16:32, Liran Alon wrote:
>> Paolo I am not sure this is the case here. Please read my other
>> replies in this email thread.
>>
>> I think this is just a standard issue of a level-triggered interrupt
>> handler in L1 (Hyper-V) that performs EOI before it lowers the
>> irq-line. I don’t think vector 96 is even related to the issue at
>> hand here. This is why after it was already handled, the loop of
>> EXTERNAL_INTERRUPT happens on vector 80 and not vector 96.
>
> Hmm... Vitaly, what machine were you testing on---does it have APIC-v?
> If not, then you should have seen either an EOI for irq 96 or a TPR
> below threshold vmexit. However, if it has APIC-v then you wouldn't
> have seen any of this (you only see the EOI for irq 80 because it's
> level triggered) and Liran is probably right.
>

It does, however, the issue is reproducible with and without
it. Moreover, I think the second simultaneous IRQ is just a red herring;
Here is another trace (enable_apicv). Posting it non-stripped and hope
your eyes will catch something I'm missing:

[001] 513675.736316: kvm_exit: reason VMRESUME rip 0xfffff80002cae115 info 0 0
[001] 513675.736321: kvm_entry: vcpu 0
[001] 513675.736565: kvm_exit: reason EXTERNAL_INTERRUPT rip 0xfffff80362dcd26d info 0 800000ec
[001] 513675.736566: kvm_nested_vmexit: rip fffff80362dcd26d reason EXTERNAL_INTERRUPT info1 0 info2 0 int_info 800000ec int_info_err 0
[001] 513675.736568: kvm_entry: vcpu 0
[001] 513675.736650: kvm_exit: reason EPT_VIOLATION rip 0xfffff80362dcd230 info 182 0
[001] 513675.736651: kvm_nested_vmexit: rip fffff80362dcd230 reason EPT_VIOLATION info1 182 info2 0 int_info 0 int_info_err 0
[001] 513675.736651: kvm_page_fault: address 261200000 error_code 182

-> injecting

[008] 513675.737059: kvm_set_irq: gsi 23 level 1 source 0
[008] 513675.737061: kvm_msi_set_irq: dst 0 vec 80 (Fixed|physical|level)
[008] 513675.737062: kvm_apic_accept_irq: apicid 0 vec 80 (Fixed|edge)
[001] 513675.737233: kvm_nested_vmexit_inject: reason EXTERNAL_INTERRUPT info1 0 info2 0 int_info 80000050 int_info_err 0
[001] 513675.737239: kvm_entry: vcpu 0
[001] 513675.737243: kvm_exit: reason EOI_INDUCED rip 0xfffff80002c85e1a info 50 0

-> immediate EOI causing re-injection (even preemption timer is not
involved here).

[001] 513675.737244: kvm_eoi: apicid 0 vector 80
[001] 513675.737245: kvm_fpu: unload
[001] 513675.737246: kvm_userspace_exit: reason KVM_EXIT_IOAPIC_EOI (26)
[001] 513675.737256: kvm_set_irq: gsi 23 level 1 source 0
[001] 513675.737259: kvm_msi_set_irq: dst 0 vec 80 (Fixed|physical|level)
[001] 513675.737260: kvm_apic_accept_irq: apicid 0 vec 80 (Fixed|edge)
[001] 513675.737264: kvm_fpu: load
[001] 513675.737265: kvm_entry: vcpu 0
[001] 513675.737271: kvm_exit: reason VMRESUME rip 0xfffff80002cae115 info 0 0
[001] 513675.737278: kvm_entry: vcpu 0
[001] 513675.737282: kvm_exit: reason PREEMPTION_TIMER rip 0xfffff80362dcc2d0 info 0 0
[001] 513675.737283: kvm_nested_vmexit: rip fffff80362dcc2d0 reason PREEMPTION_TIMER info1 0 info2 0 int_info 0 int_info_err 0
[001] 513675.737285: kvm_nested_vmexit_inject: reason EXTERNAL_INTERRUPT info1 0 info2 0 int_info 80000050 int_info_err 0
[001] 513675.737289: kvm_entry: vcpu 0
[001] 513675.737293: kvm_exit: reason EOI_INDUCED rip 0xfffff80002c85e1a info 50 0
[001] 513675.737293: kvm_eoi: apicid 0 vector 80
[001] 513675.737294: kvm_fpu: unload
[001] 513675.737295: kvm_userspace_exit: reason KVM_EXIT_IOAPIC_EOI (26)
[001] 513675.737299: kvm_set_irq: gsi 23 level 1 source 0
[001] 513675.737299: kvm_msi_set_irq: dst 0 vec 80 (Fixed|physical|level)
[001] 513675.737300: kvm_apic_accept_irq: apicid 0 vec 80 (Fixed|edge)
[001] 513675.737302: kvm_fpu: load
[001] 513675.737303: kvm_entry: vcpu 0
[001] 513675.737307: kvm_exit: reason VMRESUME rip 0xfffff80002cae115 info 0 0

...

--
Vitaly

2019-04-01 08:52:30

by Vitaly Kuznetsov

[permalink] [raw]
Subject: Re: [PATCH RFC] KVM: x86: vmx: throttle immediate exit through preemtion timer to assist buggy guests

Paolo Bonzini <[email protected]> writes:

>
> IIRC there was an enlightenment to tell Windows "I support auto-EOI but
> please don't use it". If this is what's happening, that would also fix it.
>

Unfortunately this doesn't save the day, Hyper-V behaves the
same. I'm more and move convinced that the secord IRQ was completely
unrelated:

[011] 515003.517644: kvm_exit: reason EPT_MISCONFIG rip 0xfffff80ca6bf1a5e info 0 0
[011] 515003.517645: kvm_nested_vmexit: rip fffff80ca6bf1a5e reason EPT_MISCONFIG info1 0 info2 0 int_info 0 int_info_err 0
[011] 515003.517649: kvm_emulate_insn: 0:fffff80ca6bf1a5e: 8b 08
[011] 515003.517652: vcpu_match_mmio: gva 0xffffd300eb0a8008 gpa 0x91044008 Read GPA
[011] 515003.517653: kvm_mmio: mmio unsatisfied-read len 4 gpa 0x91044008 val 0x0
[011] 515003.517654: kvm_fpu: unload
[011] 515003.517655: kvm_userspace_exit: reason KVM_EXIT_MMIO (6)

-> injecting

[007] 515003.517664: kvm_set_irq: gsi 23 level 1 source 0
[007] 515003.517666: kvm_msi_set_irq: dst 0 vec 80 (Fixed|physical|level)
[007] 515003.517667: kvm_apic_accept_irq: apicid 0 vec 80 (Fixed|edge)
[011] 515003.517688: kvm_fpu: load
[011] 515003.517689: kvm_mmio: mmio read len 4 gpa 0x91044008 val 0x1
[011] 515003.517692: kvm_nested_vmexit_inject: reason EXTERNAL_INTERRUPT info1 0 info2 0 int_info 80000050 int_info_err 0
[011] 515003.517696: kvm_entry: vcpu 0

-> immediate EOI

[011] 515003.517702: kvm_exit: reason EOI_INDUCED rip 0xfffff80002427e1a info 50 0
[011] 515003.517703: kvm_eoi: apicid 0 vector 80
[011] 515003.517704: kvm_fpu: unload
[011] 515003.517705: kvm_userspace_exit: reason KVM_EXIT_IOAPIC_EOI (26)
[011] 515003.517710: kvm_set_irq: gsi 23 level 1 source 0
[011] 515003.517711: kvm_msi_set_irq: dst 0 vec 80 (Fixed|physical|level)
[011] 515003.517712: kvm_apic_accept_irq: apicid 0 vec 80 (Fixed|edge)
[011] 515003.517714: kvm_fpu: load
[011] 515003.517716: kvm_entry: vcpu 0
[011] 515003.517720: kvm_exit: reason VMRESUME rip 0xfffff80002450115 info 0 0
[011] 515003.517726: kvm_entry: vcpu 0
[011] 515003.517729: kvm_exit: reason PREEMPTION_TIMER rip 0xfffff80ca6bf1a60 info 0 0
[011] 515003.517730: kvm_nested_vmexit: rip fffff80ca6bf1a60 reason PREEMPTION_TIMER info1 0 info2 0 int_info 0 int_info_err 0
[011] 515003.517732: kvm_nested_vmexit_inject: reason EXTERNAL_INTERRUPT info1 0 info2 0 int_info 80000050 int_info_err 0
[011] 515003.517735: kvm_entry: vcpu 0
[011] 515003.517739: kvm_exit: reason EOI_INDUCED rip 0xfffff80002427e1a info 50 0
[011] 515003.517739: kvm_eoi: apicid 0 vector 80
[011] 515003.517740: kvm_fpu: unload
[011] 515003.517741: kvm_userspace_exit: reason KVM_EXIT_IOAPIC_EOI (26)
[011] 515003.517743: kvm_set_irq: gsi 23 level 1 source 0
[011] 515003.517744: kvm_msi_set_irq: dst 0 vec 80 (Fixed|physical|level)
[011] 515003.517744: kvm_apic_accept_irq: apicid 0 vec 80 (Fixed|edge)
[011] 515003.517746: kvm_fpu: load
[011] 515003.517747: kvm_entry: vcpu 0
[011] 515003.517751: kvm_exit: reason VMRESUME rip 0xfffff80002450115 info 0 0

...

--
Vitaly

2019-04-01 10:09:38

by Liran Alon

[permalink] [raw]
Subject: Re: [PATCH RFC] KVM: x86: vmx: throttle immediate exit through preemtion timer to assist buggy guests



> On 1 Apr 2019, at 11:39, Vitaly Kuznetsov <[email protected]> wrote:
>
> Paolo Bonzini <[email protected]> writes:
>
>> On 29/03/19 16:32, Liran Alon wrote:
>>> Paolo I am not sure this is the case here. Please read my other
>>> replies in this email thread.
>>>
>>> I think this is just a standard issue of a level-triggered interrupt
>>> handler in L1 (Hyper-V) that performs EOI before it lowers the
>>> irq-line. I don’t think vector 96 is even related to the issue at
>>> hand here. This is why after it was already handled, the loop of
>>> EXTERNAL_INTERRUPT happens on vector 80 and not vector 96.
>>
>> Hmm... Vitaly, what machine were you testing on---does it have APIC-v?
>> If not, then you should have seen either an EOI for irq 96 or a TPR
>> below threshold vmexit. However, if it has APIC-v then you wouldn't
>> have seen any of this (you only see the EOI for irq 80 because it's
>> level triggered) and Liran is probably right.
>>
>
> It does, however, the issue is reproducible with and without
> it. Moreover, I think the second simultaneous IRQ is just a red herring;
> Here is another trace (enable_apicv). Posting it non-stripped and hope
> your eyes will catch something I'm missing:
>
> [001] 513675.736316: kvm_exit: reason VMRESUME rip 0xfffff80002cae115 info 0 0
> [001] 513675.736321: kvm_entry: vcpu 0
> [001] 513675.736565: kvm_exit: reason EXTERNAL_INTERRUPT rip 0xfffff80362dcd26d info 0 800000ec
> [001] 513675.736566: kvm_nested_vmexit: rip fffff80362dcd26d reason EXTERNAL_INTERRUPT info1 0 info2 0 int_info 800000ec int_info_err 0
> [001] 513675.736568: kvm_entry: vcpu 0
> [001] 513675.736650: kvm_exit: reason EPT_VIOLATION rip 0xfffff80362dcd230 info 182 0
> [001] 513675.736651: kvm_nested_vmexit: rip fffff80362dcd230 reason EPT_VIOLATION info1 182 info2 0 int_info 0 int_info_err 0
> [001] 513675.736651: kvm_page_fault: address 261200000 error_code 182
>
> -> injecting
>
> [008] 513675.737059: kvm_set_irq: gsi 23 level 1 source 0
> [008] 513675.737061: kvm_msi_set_irq: dst 0 vec 80 (Fixed|physical|level)
> [008] 513675.737062: kvm_apic_accept_irq: apicid 0 vec 80 (Fixed|edge)
> [001] 513675.737233: kvm_nested_vmexit_inject: reason EXTERNAL_INTERRUPT info1 0 info2 0 int_info 80000050 int_info_err 0
> [001] 513675.737239: kvm_entry: vcpu 0
> [001] 513675.737243: kvm_exit: reason EOI_INDUCED rip 0xfffff80002c85e1a info 50 0
>
> -> immediate EOI causing re-injection (even preemption timer is not
> involved here).
>
> [001] 513675.737244: kvm_eoi: apicid 0 vector 80
> [001] 513675.737245: kvm_fpu: unload
> [001] 513675.737246: kvm_userspace_exit: reason KVM_EXIT_IOAPIC_EOI (26)
> [001] 513675.737256: kvm_set_irq: gsi 23 level 1 source 0
> [001] 513675.737259: kvm_msi_set_irq: dst 0 vec 80 (Fixed|physical|level)
> [001] 513675.737260: kvm_apic_accept_irq: apicid 0 vec 80 (Fixed|edge)
> [001] 513675.737264: kvm_fpu: load
> [001] 513675.737265: kvm_entry: vcpu 0
> [001] 513675.737271: kvm_exit: reason VMRESUME rip 0xfffff80002cae115 info 0 0
> [001] 513675.737278: kvm_entry: vcpu 0
> [001] 513675.737282: kvm_exit: reason PREEMPTION_TIMER rip 0xfffff80362dcc2d0 info 0 0
> [001] 513675.737283: kvm_nested_vmexit: rip fffff80362dcc2d0 reason PREEMPTION_TIMER info1 0 info2 0 int_info 0 int_info_err 0
> [001] 513675.737285: kvm_nested_vmexit_inject: reason EXTERNAL_INTERRUPT info1 0 info2 0 int_info 80000050 int_info_err 0
> [001] 513675.737289: kvm_entry: vcpu 0
> [001] 513675.737293: kvm_exit: reason EOI_INDUCED rip 0xfffff80002c85e1a info 50 0
> [001] 513675.737293: kvm_eoi: apicid 0 vector 80
> [001] 513675.737294: kvm_fpu: unload
> [001] 513675.737295: kvm_userspace_exit: reason KVM_EXIT_IOAPIC_EOI (26)
> [001] 513675.737299: kvm_set_irq: gsi 23 level 1 source 0
> [001] 513675.737299: kvm_msi_set_irq: dst 0 vec 80 (Fixed|physical|level)
> [001] 513675.737300: kvm_apic_accept_irq: apicid 0 vec 80 (Fixed|edge)
> [001] 513675.737302: kvm_fpu: load
> [001] 513675.737303: kvm_entry: vcpu 0
> [001] 513675.737307: kvm_exit: reason VMRESUME rip 0xfffff80002cae115 info 0 0
>
> ...
>
> --
> Vitaly

So to sum-up: This matches what I mentioned in my previous emails right?
That vector 96 is not related, and the only issue here is that level-triggered interrupt handler for vector 80 is doing EOI before lowering the irq-line.
Which cause vector 80 to be injected in infinite loop.
And this is not even related to being a nested virtualization workload. It’s just an issue in Hyper-V (L1) interrupt handler for vector 80.

Therefore the only action-items are:
1) Microsoft to fix Hyper-V vector 80 interrupt handler to lower irq-line before EOI.
2) Patch QEMU IOAPIC implementation to have a mechanism similar to KVM to delay injection of level-triggered interrupt
in case we are injecting the same interrupt for X times in a row.

-Liran









2019-04-01 10:43:15

by Vitaly Kuznetsov

[permalink] [raw]
Subject: Re: [PATCH RFC] KVM: x86: vmx: throttle immediate exit through preemtion timer to assist buggy guests

Liran Alon <[email protected]> writes:

> So to sum-up: This matches what I mentioned in my previous emails right?
> That vector 96 is not related, and the only issue here is that level-triggered interrupt handler for vector 80 is doing EOI before lowering the irq-line.
> Which cause vector 80 to be injected in infinite loop.
> And this is not even related to being a nested virtualization
> workload. It’s just an issue in Hyper-V (L1) interrupt handler for
> vector 80.

Yes, it may be related to how communications between Hyper-V and Windows
partition are organized but it's a pure speculation.

>
> Therefore the only action-items are:
> 1) Microsoft to fix Hyper-V vector 80 interrupt handler to lower irq-line before EOI.
> 2) Patch QEMU IOAPIC implementation to have a mechanism similar to KVM to delay injection of level-triggered interrupt
> in case we are injecting the same interrupt for X times in a row.

I'm intended to do 2) and we already let MS know there's an issue.

Thanks for your help with the investigation guys!

--
Vitaly