From: Wanpeng Li <[email protected]>
------------[ cut here ]------------
WARNING: CPU: 5 PID: 2288 at arch/x86/kvm/vmx.c:11124 nested_vmx_vmexit+0xd64/0xd70 [kvm_intel]
CPU: 5 PID: 2288 Comm: qemu-system-x86 Not tainted 4.13.0-rc2+ #7
RIP: 0010:nested_vmx_vmexit+0xd64/0xd70 [kvm_intel]
Call Trace:
vmx_check_nested_events+0x131/0x1f0 [kvm_intel]
? vmx_check_nested_events+0x131/0x1f0 [kvm_intel]
kvm_arch_vcpu_ioctl_run+0x5dd/0x1be0 [kvm]
? vmx_vcpu_load+0x1be/0x220 [kvm_intel]
? kvm_arch_vcpu_load+0x62/0x230 [kvm]
kvm_vcpu_ioctl+0x340/0x700 [kvm]
? kvm_vcpu_ioctl+0x340/0x700 [kvm]
? __fget+0xfc/0x210
do_vfs_ioctl+0xa4/0x6a0
? __fget+0x11d/0x210
SyS_ioctl+0x79/0x90
do_syscall_64+0x8f/0x750
? trace_hardirqs_on_thunk+0x1a/0x1c
entry_SYSCALL64_slow_path+0x25/0x25
This can be reproduced by booting L1 guest w/ 'noapic' grub parameter, which
means that tells the kernel to not make use of any IOAPICs that may be present
in the system.
Actually external_intr variable in nested_vmx_vmexit() is the req_int_win
variable passed from vcpu_enter_guest() which means that the L0's userspace
requests an irq window. I observed the scenario (!kvm_cpu_has_interrupt(vcpu) &&
L0's userspace reqeusts an irq window) is true, so there is no interrupt which
L1 requires to inject to L2, we should not attempt to emualte "Acknowledge
interrupt on exit" for the irq window requirement in this scenario.
SDM says that with acknowledge interrupt on exit, bit 31 of the VM-exit
interrupt information (valid interrupt) is always set to 1 on
EXIT_REASON_EXTERNAL_INTERRUPT. We don't want to break hypervisors
expecting an interrupt in that case, so we should do a userspace VM exit
when the window is open and then inject the userspace interrupt with a
VM exit.
Cc: Paolo Bonzini <[email protected]>
Cc: Radim Krčmář <[email protected]>
Signed-off-by: Wanpeng Li <[email protected]>
---
v2 -> v3:
* request an irq window and don't nested vmexit
v1 -> v2:
* update patch description
* check nested_exit_intr_ack_set() first
arch/x86/kvm/vmx.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 80b20e8..9ef2ec3 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -10761,7 +10761,8 @@ static int vmx_check_nested_events(struct kvm_vcpu *vcpu, bool external_intr)
return 0;
}
- if ((kvm_cpu_has_interrupt(vcpu) || external_intr) &&
+ if ((kvm_cpu_has_interrupt(vcpu) ||
+ (external_intr && !nested_exit_intr_ack_set(vcpu))) &&
nested_exit_on_intr(vcpu)) {
if (vmx->nested.nested_run_pending)
return -EBUSY;
--
2.7.4
2017-08-02 03:48-0700, Wanpeng Li:
> From: Wanpeng Li <[email protected]>
>
> ------------[ cut here ]------------
> WARNING: CPU: 5 PID: 2288 at arch/x86/kvm/vmx.c:11124 nested_vmx_vmexit+0xd64/0xd70 [kvm_intel]
> CPU: 5 PID: 2288 Comm: qemu-system-x86 Not tainted 4.13.0-rc2+ #7
> RIP: 0010:nested_vmx_vmexit+0xd64/0xd70 [kvm_intel]
> Call Trace:
> vmx_check_nested_events+0x131/0x1f0 [kvm_intel]
> ? vmx_check_nested_events+0x131/0x1f0 [kvm_intel]
> kvm_arch_vcpu_ioctl_run+0x5dd/0x1be0 [kvm]
> ? vmx_vcpu_load+0x1be/0x220 [kvm_intel]
> ? kvm_arch_vcpu_load+0x62/0x230 [kvm]
> kvm_vcpu_ioctl+0x340/0x700 [kvm]
> ? kvm_vcpu_ioctl+0x340/0x700 [kvm]
> ? __fget+0xfc/0x210
> do_vfs_ioctl+0xa4/0x6a0
> ? __fget+0x11d/0x210
> SyS_ioctl+0x79/0x90
> do_syscall_64+0x8f/0x750
> ? trace_hardirqs_on_thunk+0x1a/0x1c
> entry_SYSCALL64_slow_path+0x25/0x25
>
> This can be reproduced by booting L1 guest w/ 'noapic' grub parameter, which
> means that tells the kernel to not make use of any IOAPICs that may be present
> in the system.
>
> Actually external_intr variable in nested_vmx_vmexit() is the req_int_win
> variable passed from vcpu_enter_guest() which means that the L0's userspace
> requests an irq window. I observed the scenario (!kvm_cpu_has_interrupt(vcpu) &&
> L0's userspace reqeusts an irq window) is true, so there is no interrupt which
> L1 requires to inject to L2, we should not attempt to emualte "Acknowledge
> interrupt on exit" for the irq window requirement in this scenario.
>
> SDM says that with acknowledge interrupt on exit, bit 31 of the VM-exit
> interrupt information (valid interrupt) is always set to 1 on
> EXIT_REASON_EXTERNAL_INTERRUPT. We don't want to break hypervisors
> expecting an interrupt in that case, so we should do a userspace VM exit
> when the window is open and then inject the userspace interrupt with a
> VM exit.
>
> Cc: Paolo Bonzini <[email protected]>
> Cc: Radim Krčmář <[email protected]>
> Signed-off-by: Wanpeng Li <[email protected]>
> ---
> v2 -> v3:
> * request an irq window and don't nested vmexit
> v1 -> v2:
> * update patch description
> * check nested_exit_intr_ack_set() first
>
> arch/x86/kvm/vmx.c | 3 ++-
> 1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> index 80b20e8..9ef2ec3 100644
> --- a/arch/x86/kvm/vmx.c
> +++ b/arch/x86/kvm/vmx.c
> @@ -10761,7 +10761,8 @@ static int vmx_check_nested_events(struct kvm_vcpu *vcpu, bool external_intr)
> return 0;
> }
>
> - if ((kvm_cpu_has_interrupt(vcpu) || external_intr) &&
> + if ((kvm_cpu_has_interrupt(vcpu) ||
> + (external_intr && !nested_exit_intr_ack_set(vcpu))) &&
I think it would be safer to also add something like the second hunk I
posted (that also takes nested_exit_on_intr() into account).
The issue is that we're allowing L2's GUEST_RFLAGS and
GUEST_INTERRUPTIBILITY_INFO to disable userspace interrupt injection
even though neither affect delivery of interrupts into L1.
This means that L2 can block/postpone the delivery to L1 by doing "cli;
busy_loop/normal_critical_section".
Thanks.
> nested_exit_on_intr(vcpu)) {
> if (vmx->nested.nested_run_pending)
> return -EBUSY;
> --
> 2.7.4
>
2017-08-03 4:26 GMT+08:00 Radim Krčmář <[email protected]>:
> 2017-08-02 03:48-0700, Wanpeng Li:
>> From: Wanpeng Li <[email protected]>
>>
>> ------------[ cut here ]------------
>> WARNING: CPU: 5 PID: 2288 at arch/x86/kvm/vmx.c:11124 nested_vmx_vmexit+0xd64/0xd70 [kvm_intel]
>> CPU: 5 PID: 2288 Comm: qemu-system-x86 Not tainted 4.13.0-rc2+ #7
>> RIP: 0010:nested_vmx_vmexit+0xd64/0xd70 [kvm_intel]
>> Call Trace:
>> vmx_check_nested_events+0x131/0x1f0 [kvm_intel]
>> ? vmx_check_nested_events+0x131/0x1f0 [kvm_intel]
>> kvm_arch_vcpu_ioctl_run+0x5dd/0x1be0 [kvm]
>> ? vmx_vcpu_load+0x1be/0x220 [kvm_intel]
>> ? kvm_arch_vcpu_load+0x62/0x230 [kvm]
>> kvm_vcpu_ioctl+0x340/0x700 [kvm]
>> ? kvm_vcpu_ioctl+0x340/0x700 [kvm]
>> ? __fget+0xfc/0x210
>> do_vfs_ioctl+0xa4/0x6a0
>> ? __fget+0x11d/0x210
>> SyS_ioctl+0x79/0x90
>> do_syscall_64+0x8f/0x750
>> ? trace_hardirqs_on_thunk+0x1a/0x1c
>> entry_SYSCALL64_slow_path+0x25/0x25
>>
>> This can be reproduced by booting L1 guest w/ 'noapic' grub parameter, which
>> means that tells the kernel to not make use of any IOAPICs that may be present
>> in the system.
>>
>> Actually external_intr variable in nested_vmx_vmexit() is the req_int_win
>> variable passed from vcpu_enter_guest() which means that the L0's userspace
>> requests an irq window. I observed the scenario (!kvm_cpu_has_interrupt(vcpu) &&
>> L0's userspace reqeusts an irq window) is true, so there is no interrupt which
>> L1 requires to inject to L2, we should not attempt to emualte "Acknowledge
>> interrupt on exit" for the irq window requirement in this scenario.
>>
>> SDM says that with acknowledge interrupt on exit, bit 31 of the VM-exit
>> interrupt information (valid interrupt) is always set to 1 on
>> EXIT_REASON_EXTERNAL_INTERRUPT. We don't want to break hypervisors
>> expecting an interrupt in that case, so we should do a userspace VM exit
>> when the window is open and then inject the userspace interrupt with a
>> VM exit.
>>
>> Cc: Paolo Bonzini <[email protected]>
>> Cc: Radim Krčmář <[email protected]>
>> Signed-off-by: Wanpeng Li <[email protected]>
>> ---
>> v2 -> v3:
>> * request an irq window and don't nested vmexit
>> v1 -> v2:
>> * update patch description
>> * check nested_exit_intr_ack_set() first
>>
>> arch/x86/kvm/vmx.c | 3 ++-
>> 1 file changed, 2 insertions(+), 1 deletion(-)
>>
>> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
>> index 80b20e8..9ef2ec3 100644
>> --- a/arch/x86/kvm/vmx.c
>> +++ b/arch/x86/kvm/vmx.c
>> @@ -10761,7 +10761,8 @@ static int vmx_check_nested_events(struct kvm_vcpu *vcpu, bool external_intr)
>> return 0;
>> }
>>
>> - if ((kvm_cpu_has_interrupt(vcpu) || external_intr) &&
>> + if ((kvm_cpu_has_interrupt(vcpu) ||
>> + (external_intr && !nested_exit_intr_ack_set(vcpu))) &&
>
> I think it would be safer to also add something like the second hunk I
> posted (that also takes nested_exit_on_intr() into account).
>
> The issue is that we're allowing L2's GUEST_RFLAGS and
> GUEST_INTERRUPTIBILITY_INFO to disable userspace interrupt injection
> even though neither affect delivery of interrupts into L1.
> This means that L2 can block/postpone the delivery to L1 by doing "cli;
> busy_loop/normal_critical_section".
Ouch! My fault, the v3 patch w/o the second hunk and w/ the second
hunk both can result in L1 guest softlockup. I just tested the patch
with L2 windows guest yesterday, however, the softlockup can happen
when the L2 is the linux guest. So should we still take the v2 for the
moment?
Regards,
Wanpeng Li
>
> Thanks.
>
>> nested_exit_on_intr(vcpu)) {
>> if (vmx->nested.nested_run_pending)
>> return -EBUSY;
>> --
>> 2.7.4
>>
2017-08-03 07:01+0800, Wanpeng Li:
> 2017-08-03 4:26 GMT+08:00 Radim Krčmář <[email protected]>:
> > 2017-08-02 03:48-0700, Wanpeng Li:
> >> From: Wanpeng Li <[email protected]>
> >> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> >> @@ -10761,7 +10761,8 @@ static int vmx_check_nested_events(struct kvm_vcpu *vcpu, bool external_intr)
> >> return 0;
> >> }
> >>
> >> - if ((kvm_cpu_has_interrupt(vcpu) || external_intr) &&
> >> + if ((kvm_cpu_has_interrupt(vcpu) ||
> >> + (external_intr && !nested_exit_intr_ack_set(vcpu))) &&
> >
> > I think it would be safer to also add something like the second hunk I
> > posted (that also takes nested_exit_on_intr() into account).
> >
> > The issue is that we're allowing L2's GUEST_RFLAGS and
> > GUEST_INTERRUPTIBILITY_INFO to disable userspace interrupt injection
> > even though neither affect delivery of interrupts into L1.
> > This means that L2 can block/postpone the delivery to L1 by doing "cli;
> > busy_loop/normal_critical_section".
>
> Ouch! My fault, the v3 patch w/o the second hunk and w/ the second
> hunk both can result in L1 guest softlockup. I just tested the patch
> with L2 windows guest yesterday, however, the softlockup can happen
> when the L2 is the linux guest. So should we still take the v2 for the
> moment?
Sure, that one is an improvement over the current situation (I guess it
doesn't break any hypervisor).
I'll just add a comment about its incorrectness.