Received: by 2002:a25:5b86:0:0:0:0:0 with SMTP id p128csp852096ybb; Thu, 28 Mar 2019 13:32:11 -0700 (PDT) X-Google-Smtp-Source: APXvYqzsXnRhETz4iauPLnAE5zxk1G4xxYllR2hD2ujgJ+8F0ETlD9crtGqzBnTLuzX9DWRa7GIa X-Received: by 2002:a63:bd42:: with SMTP id d2mr25617363pgp.319.1553805131388; Thu, 28 Mar 2019 13:32:11 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1553805131; cv=none; d=google.com; s=arc-20160816; b=NL8KWc6u5TkvsPvxOSSZ2cDGumPdhMlwwuxNaw5FGtTqo3c9MCBZjH2r6S+0/4DE8d /1mi/sYaCbgizEIzi0eQDJPJ2Jqgrie8omOw2VxfzbtnBn4SObzyU+qJ9S5Dp//Wb1jh tkSti1wil7WwKOs8DuitQo0+XH2ZxEEGsemhbR+oE4+fQPcGYryQGJqwHoM6srxlg/PZ QfaerovqQpE812dmeMgxQjAw1HjLpjqDaJ/ThCF/6MLiIBWotgLnHHJi9JXjNw4sxDN6 5xbB7I6UsaLgm/S6eL3Y0T/FNjHgZVJTrWPTzV73Jq+fWvsfMQYocgyvSUOQjNlPppAr BRLw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:content-transfer-encoding:mime-version :message-id:date:subject:cc:to:from; bh=hyJwxc23N85/kwI0rBiIflXzLj57ORYYwCCVqucotEs=; b=t0ZVYv+UwTTwd7gVwt2rafqwKz3SVJZfmuM0A8Tcuny1Z3up6P1uRe3Rbb8CQKsvg4 4GrO2FlK9U/3zh8pJ00CO0eiMcOyhKUVNpHQ4ifRaLqq5LVnb/vByUxn/D2Ta9y1+Bue eVq/R9N8Az2kQMgIWY1+D++hprzToP2hGEtU5+57oUUiuDNHlQWtwBrIPmEmuHrzr2HT qYG2FV8H297I6jfBGVsmgfGi2+ZoBvFskbfgL5EvidswGpPBUwD1DfjFNLOJn+cjeA0R k4FN7HXvhFF8wCG8qJ+XfwW9uJYFo1X0BLcsu7hiouYCWwQlDfvLAvDXjjVu/EJQFf30 U86g== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=redhat.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id 61si64491plq.154.2019.03.28.13.31.53; Thu, 28 Mar 2019 13:32:11 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726242AbfC1UbR (ORCPT + 99 others); Thu, 28 Mar 2019 16:31:17 -0400 Received: from mx1.redhat.com ([209.132.183.28]:56318 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726100AbfC1UbQ (ORCPT ); Thu, 28 Mar 2019 16:31:16 -0400 Received: from smtp.corp.redhat.com (int-mx01.intmail.prod.int.phx2.redhat.com [10.5.11.11]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mx1.redhat.com (Postfix) with ESMTPS id 26C7C3086272; Thu, 28 Mar 2019 20:31:16 +0000 (UTC) Received: from vitty.brq.redhat.com (ovpn-204-50.brq.redhat.com [10.40.204.50]) by smtp.corp.redhat.com (Postfix) with ESMTPS id 0521D600C0; Thu, 28 Mar 2019 20:31:13 +0000 (UTC) From: Vitaly Kuznetsov To: kvm@vger.kernel.org Cc: Paolo Bonzini , =?UTF-8?q?Radim=20Kr=C4=8Dm=C3=A1=C5=99?= , Liran Alon , Sean Christopherson , linux-kernel@vger.kernel.org Subject: [PATCH RFC] KVM: x86: vmx: throttle immediate exit through preemtion timer to assist buggy guests Date: Thu, 28 Mar 2019 21:31:10 +0100 Message-Id: <20190328203110.20655-1-vkuznets@redhat.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Scanned-By: MIMEDefang 2.79 on 10.5.11.11 X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.5.16 (mx1.redhat.com [10.5.110.49]); Thu, 28 Mar 2019 20:31:16 +0000 (UTC) Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org This is embarassing but we have another Windows/Hyper-V issue to workaround in KVM (or QEMU). Hope "RFC" makes it less offensive. It was noticed that Hyper-V guest on q35 KVM/QEMU VM hangs on boot if e.g. 'piix4-usb-uhci' device is attached. The problem with this device is that it uses level-triggered interrupts. The 'hang' scenario develops like this: 1) Hyper-V boots and QEMU is trying to inject two irq simultaneously. One of them is level-triggered. KVM injects the edge-triggered one and requests immediate exit to inject the level-triggered: kvm_set_irq: gsi 23 level 1 source 0 kvm_msi_set_irq: dst 0 vec 80 (Fixed|physical|level) kvm_apic_accept_irq: apicid 0 vec 80 (Fixed|edge) kvm_msi_set_irq: dst 0 vec 96 (Fixed|physical|edge) kvm_apic_accept_irq: apicid 0 vec 96 (Fixed|edge) kvm_nested_vmexit_inject: reason EXTERNAL_INTERRUPT info1 0 info2 0 int_info 80000060 int_info_err 0 2) Hyper-V requires one of its VMs to run to handle the situation but immediate exit happens: kvm_entry: vcpu 0 kvm_exit: reason VMRESUME rip 0xfffff80006a40115 info 0 0 kvm_entry: vcpu 0 kvm_exit: reason PREEMPTION_TIMER rip 0xfffff8022f3d8350 info 0 0 kvm_nested_vmexit: rip fffff8022f3d8350 reason PREEMPTION_TIMER info1 0 info2 0 int_info 0 int_info_err 0 kvm_nested_vmexit_inject: reason EXTERNAL_INTERRUPT info1 0 info2 0 int_info 80000050 int_info_err 0 3) Hyper-V doesn't want to deal with the second irq (as its VM still didn't process the first one) so it just does 'EOI' for level-triggered interrupt and this causes re-injection: kvm_exit: reason EOI_INDUCED rip 0xfffff80006a17e1a info 50 0 kvm_eoi: apicid 0 vector 80 kvm_userspace_exit: reason KVM_EXIT_IOAPIC_EOI (26) kvm_set_irq: gsi 23 level 1 source 0 kvm_msi_set_irq: dst 0 vec 80 (Fixed|physical|level) kvm_apic_accept_irq: apicid 0 vec 80 (Fixed|edge) kvm_entry: vcpu 0 4) As we arm preemtion timer again we get stuck in the infinite loop: kvm_exit: reason VMRESUME rip 0xfffff80006a40115 info 0 0 kvm_entry: vcpu 0 kvm_exit: reason PREEMPTION_TIMER rip 0xfffff8022f3d8350 info 0 0 kvm_nested_vmexit: rip fffff8022f3d8350 reason PREEMPTION_TIMER info1 0 info2 0 int_info 0 int_info_err 0 kvm_nested_vmexit_inject: reason EXTERNAL_INTERRUPT info1 0 info2 0 int_info 80000050 int_info_err 0 kvm_entry: vcpu 0 kvm_exit: reason EOI_INDUCED rip 0xfffff80006a17e1a info 50 0 kvm_eoi: apicid 0 vector 80 kvm_userspace_exit: reason KVM_EXIT_IOAPIC_EOI (26) kvm_set_irq: gsi 23 level 1 source 0 kvm_msi_set_irq: dst 0 vec 80 (Fixed|physical|level) kvm_apic_accept_irq: apicid 0 vec 80 (Fixed|edge) kvm_entry: vcpu 0 kvm_exit: reason VMRESUME rip 0xfffff80006a40115 info 0 0 ... How does this work on hardware? I don't really know but it seems that we were dealing with similar issues before, see commit 184564efae4d7 ("kvm: ioapic: conditionally delay irq delivery duringeoi broadcast"). In case EOI doesn't always cause an *immediate* interrupt re-injection some progress can be made. An alternative approach would be to port the above mentioned commit to QEMU's ioapic implementation. I'm, however, not sure that level-triggered interrupts is a must to trigger the issue. HV_TIMER_THROTTLE/HV_TIMER_DELAY are more or less arbitrary. I haven't looked at SVM yet but their immediate exit implementation does smp_send_reschedule(), I'm not sure this can cause the above described lockup. Signed-off-by: Vitaly Kuznetsov --- arch/x86/kvm/vmx/vmcs.h | 2 ++ arch/x86/kvm/vmx/vmx.c | 24 +++++++++++++++++++++++- 2 files changed, 25 insertions(+), 1 deletion(-) diff --git a/arch/x86/kvm/vmx/vmcs.h b/arch/x86/kvm/vmx/vmcs.h index cb6079f8a227..983dfc60c30c 100644 --- a/arch/x86/kvm/vmx/vmcs.h +++ b/arch/x86/kvm/vmx/vmcs.h @@ -54,6 +54,8 @@ struct loaded_vmcs { bool launched; bool nmi_known_unmasked; bool hv_timer_armed; + unsigned long hv_timer_lastrip; + int hv_timer_lastrip_cnt; /* Support for vnmi-less CPUs */ int soft_vnmi_blocked; ktime_t entry_time; diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c index 617443c1c6d7..8a49ec14dd3a 100644 --- a/arch/x86/kvm/vmx/vmx.c +++ b/arch/x86/kvm/vmx/vmx.c @@ -6321,14 +6321,36 @@ static void vmx_arm_hv_timer(struct vcpu_vmx *vmx, u32 val) vmx->loaded_vmcs->hv_timer_armed = true; } +#define HV_TIMER_THROTTLE 100 +#define HV_TIMER_DELAY 1000 + static void vmx_update_hv_timer(struct kvm_vcpu *vcpu) { struct vcpu_vmx *vmx = to_vmx(vcpu); u64 tscl; u32 delta_tsc; + /* + * Some guests are buggy and may behave in a way that causes KVM to + * always request immediate exit (e.g. of a nested guest). At the same + * time guest progress may be required to resolve the condition. + * Throttling below makes sure we never get stuck completely. + */ if (vmx->req_immediate_exit) { - vmx_arm_hv_timer(vmx, 0); + unsigned long rip = kvm_rip_read(vcpu); + + if (vmx->loaded_vmcs->hv_timer_lastrip == rip) { + ++vmx->loaded_vmcs->hv_timer_lastrip_cnt; + } else { + vmx->loaded_vmcs->hv_timer_lastrip_cnt = 0; + vmx->loaded_vmcs->hv_timer_lastrip = rip; + } + + if (vmx->loaded_vmcs->hv_timer_lastrip_cnt > HV_TIMER_THROTTLE) + vmx_arm_hv_timer(vmx, HV_TIMER_DELAY); + else + vmx_arm_hv_timer(vmx, 0); + return; } -- 2.20.1