Received: by 2002:a05:7412:5112:b0:fa:6e18:a558 with SMTP id fm18csp529052rdb; Tue, 23 Jan 2024 07:03:06 -0800 (PST) X-Google-Smtp-Source: AGHT+IF9JZ1cubnO81aVkokC8pV6WCIXmB+CbyXwEO5BP+ZYO+5FzqZV1Dn4q2BHqVkUD2emQ8mI X-Received: by 2002:a05:6a21:3944:b0:19c:4dcb:2939 with SMTP id ac4-20020a056a21394400b0019c4dcb2939mr3356067pzc.53.1706022186416; Tue, 23 Jan 2024 07:03:06 -0800 (PST) ARC-Seal: i=2; a=rsa-sha256; t=1706022186; cv=pass; d=google.com; s=arc-20160816; b=Z3co7Ex4jr0am/gWqXqf8P89JXkyjqoRI/J0gEhodEHBPQJtlCkMjrJYkyhoc9/CqK dPVUWIUaqYbod8AkyE7JzbLaeRE4blXTY7RD6EbRrkD+PfwL/BRwGzDWRicxnEIV5Q+8 kjxDoj/SfFGyqG8tGA13PsMxb9EKy0/jBpdP8WqQZeG7WbWkz8TJLIHZuJm2S/5cjv4T gQEE9NQrVm344fQUV8r7wShgudBvnq8iSfz96HFVmFtGCltG8ZG97IMY37J6Coot0ALS mddNmjh+sYfuG0nG2LjjA4nMquD2TdsJ8Oo2RHm5GV0dPmxCsbijvAA8z1nswsnyYTxu 9Ybg== ARC-Message-Signature: i=2; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=mime-version:list-unsubscribe:list-subscribe:list-id:precedence :user-agent:content-transfer-encoding:references:in-reply-to:date:cc :to:from:subject:message-id:dkim-signature; bh=ERsvf6mTANpUBUb3AMZpzI06hkNc3Po2zrO02K/JXGs=; fh=6UZkKU0ROKzDIysz6P63JaOABB0leEubRevhqieQaKc=; b=cUb0OZjSuEQqNFkG4dfw1u1s7lTpDMEPTKf3cti+NDVXCAvgvkMFegYgT5KUXAhNZd Wi9jf7PClQ2WH93tBWjjO1+EpETrO9yDR1J7nINg51pHV0KuB9KQ9d2dIRz5z0Y2pXHG tR6dAVwsxLkWardc5ReskVmCZTW9SFIPg5O2N/KHBP4MObRMLtsTEbzxwilS7imqoPmg anwkLLhzeIkmVl4P27AYS6ko2B64+cV02tMc94dqQKjp0RgITu8CxLkCCGxatDRbITZ7 5I34YF2+9maOlKzdBSJ+AK1tSxomjl+HX0kjdWIlEYVmrMS+Anyqd4tHIVAE2CIb2lZ0 a1/A== ARC-Authentication-Results: i=2; mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b=MUNq2gtb; arc=pass (i=1 spf=pass spfdomain=redhat.com dkim=pass dkdomain=redhat.com dmarc=pass fromdomain=redhat.com); spf=pass (google.com: domain of linux-kernel+bounces-35476-linux.lists.archive=gmail.com@vger.kernel.org designates 2604:1380:40f1:3f00::1 as permitted sender) smtp.mailfrom="linux-kernel+bounces-35476-linux.lists.archive=gmail.com@vger.kernel.org"; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Return-Path: Received: from sy.mirrors.kernel.org (sy.mirrors.kernel.org. [2604:1380:40f1:3f00::1]) by mx.google.com with ESMTPS id bo23-20020a056a02039700b005d41676082bsi481854pgb.350.2024.01.23.07.03.05 for (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 23 Jan 2024 07:03:06 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel+bounces-35476-linux.lists.archive=gmail.com@vger.kernel.org designates 2604:1380:40f1:3f00::1 as permitted sender) client-ip=2604:1380:40f1:3f00::1; Authentication-Results: mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b=MUNq2gtb; arc=pass (i=1 spf=pass spfdomain=redhat.com dkim=pass dkdomain=redhat.com dmarc=pass fromdomain=redhat.com); spf=pass (google.com: domain of linux-kernel+bounces-35476-linux.lists.archive=gmail.com@vger.kernel.org designates 2604:1380:40f1:3f00::1 as permitted sender) smtp.mailfrom="linux-kernel+bounces-35476-linux.lists.archive=gmail.com@vger.kernel.org"; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: from smtp.subspace.kernel.org (wormhole.subspace.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by sy.mirrors.kernel.org (Postfix) with ESMTPS id 5FFA0B28270 for ; Tue, 23 Jan 2024 14:49:11 +0000 (UTC) Received: from localhost.localdomain (localhost.localdomain [127.0.0.1]) by smtp.subspace.kernel.org (Postfix) with ESMTP id 7EA255EE82; Tue, 23 Jan 2024 14:48:27 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b="MUNq2gtb" Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 1209556B73 for ; Tue, 23 Jan 2024 14:48:23 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=170.10.129.124 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1706021306; cv=none; b=FrhQJgZ0+1fvOtG9QzVhlGVc+5YIi9J3rG25tI2RFjSRKJDIwQ+MKbmUIMFUIFXyVhbHNc0wl8/1CwmVlAscd2ey+iw82VDRCCdzLMqVmkrTcGDI2bX80DVPkA3aO9PCToZ1RYlBWQ+B+x1nDIigsbRNrfhN70uMWs1BIpTLDZs= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1706021306; c=relaxed/simple; bh=y1YqxHrgxOFc/Dj6UGRZZjBxY4p5SgEDvBp2HNCDjGY=; h=Message-ID:Subject:From:To:Cc:Date:In-Reply-To:References: Content-Type:MIME-Version; b=cfYt2moc/lYfo5SXT69veEJ05GMqzv6WTep5jf7kJUiv1Cz9BfcGOXJxO2En2fJsUcMHcSnU3PZ90J7iWpBvo70+ZP6FtBqVjofWtjm+dXzXMd5QEZhCg6DypzLLQLxFldBzVVmsDm3kEAmS9b1o/Rwrc/3ffBGd2ynO4JH2F8U= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=redhat.com; spf=pass smtp.mailfrom=redhat.com; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b=MUNq2gtb; arc=none smtp.client-ip=170.10.129.124 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=redhat.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=redhat.com DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1706021303; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=ERsvf6mTANpUBUb3AMZpzI06hkNc3Po2zrO02K/JXGs=; b=MUNq2gtbdRiCJtvdSH7UnO3zgeVWz79dT6V0fmSdXDfDgiKNFd+L7r0NzFQE4qVt9YcOEW XGlhE+MQUE5XF/J2GB4VoEycmrCSSlrQ67BhAtbMstBGAQW36BNP0L511uVL8YpGYGiTum 5VgKGmsyeiVZNMNTbRST4phW5HzzrEY= Received: from mail-yw1-f200.google.com (mail-yw1-f200.google.com [209.85.128.200]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-501-29WuO3j9MRyGfuO0Hr2H9A-1; Tue, 23 Jan 2024 09:48:21 -0500 X-MC-Unique: 29WuO3j9MRyGfuO0Hr2H9A-1 Received: by mail-yw1-f200.google.com with SMTP id 00721157ae682-5ff7cf2fd21so45237517b3.1 for ; Tue, 23 Jan 2024 06:48:20 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1706021300; x=1706626100; h=mime-version:user-agent:content-transfer-encoding:references :in-reply-to:date:cc:to:from:subject:message-id:x-gm-message-state :from:to:cc:subject:date:message-id:reply-to; bh=ERsvf6mTANpUBUb3AMZpzI06hkNc3Po2zrO02K/JXGs=; b=NsXHoioZEAkr8vGVdX/nyLqQ20mO4OwEEEfFeaKcJM/TPdw3HTVvqYBJ83tnIZD50O fGUOjyThZErdsyksAr8OaoTpqZulWOrsHfySSWJOzOJ4ilgFoIxY/CDktII5o/6RyLR6 peYhqHDMKE7ilRbHVL+ZwKkr1umW9sikOrx2s7OjArDNCCKxDzRb5c5d0zrRupvklmyb ciZMwpQtBb6fWRoqvV6Ky6tek6eXgoaN0WeJ/na1FAlAfJ7gzd0az5aEsC5V0WkoSjVj aSoYX1cRbcMrIl2AtpsVw0fHfZ6AtVSRAYabjoQDcVqfXcTS9eBZhGW4KvBeCuNPqoYw x1Jg== X-Gm-Message-State: AOJu0Yz8ML/J8h3OElUlr46NBNlxXgBBS07Fd/g/hqG6peSrn+MsTNo3 p0Bv4MAc0nuQhJjkiiC3xxaXIN+KwvJyILSb79JKVyYMwFDh0Zp5F3Z5I8CkN0f7gGGxu2ZLRU1 9VGiwktq6UNCmYOCNt3SHryjONhJ+eHtwTa3rDSz+/5zmk57/zRrbu6YZznDWig== X-Received: by 2002:a81:9102:0:b0:600:19aa:954 with SMTP id i2-20020a819102000000b0060019aa0954mr1681041ywg.48.1706021299950; Tue, 23 Jan 2024 06:48:19 -0800 (PST) X-Received: by 2002:a81:9102:0:b0:600:19aa:954 with SMTP id i2-20020a819102000000b0060019aa0954mr1681029ywg.48.1706021299579; Tue, 23 Jan 2024 06:48:19 -0800 (PST) Received: from [192.168.2.84] ([184.146.96.133]) by smtp.gmail.com with ESMTPSA id t125-20020a818383000000b005ffa352a84fsm2956124ywf.21.2024.01.23.06.48.18 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 23 Jan 2024 06:48:18 -0800 (PST) Message-ID: <08d7e86a4d46e24090812bb1dcf8555c3d0f9e5f.camel@redhat.com> Subject: Re: [PATCH v1] KVM: nVMX: Fix handling triple fault on RSM instruction From: mlevitsk@redhat.com To: Sean Christopherson , Michal Wilczynski Cc: pbonzini@redhat.com, tglx@linutronix.de, mingo@redhat.com, bp@alien8.de, dave.hansen@linux.intel.com, x86@kernel.org, hpa@zytor.com, kvm@vger.kernel.org, linux-kernel@vger.kernel.org, zhi.a.wang@intel.com, artem.bityutskiy@linux.intel.com, yuan.yao@intel.com, Zheyu Ma Date: Tue, 23 Jan 2024 09:48:08 -0500 In-Reply-To: References: <20231222164543.918037-1-michal.wilczynski@intel.com> Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable User-Agent: Evolution 3.44.4 (3.44.4-3.fc36) Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 On Tue, 2024-01-02 at 11:57 -0800, Sean Christopherson wrote: > > > > +Maxim > > > >=20 > > > > On Fri, Dec 22, 2023, Michal Wilczynski wrote: > > > > > > > > Syzkaller found a warning triggered in > > > > > > > > nested_vmx_vmexit(). > > > > > > > > vmx->nested.nested_run_pending is non-zero, even though > > > > > > > > we're in > > > > > > > > nested_vmx_vmexit(). Generally, trying to cancel a > > > > > > > > pending entry is > > > > > > > > considered a bug. However in this particular scenario, > > > > > > > > the kernel > > > > > > > > behavior seems correct. > > > > > > > >=20 > > > > > > > > Syzkaller scenario: > > > > > > > > 1) Set up VCPU's > > > > > > > > 2) Run some code with KVM_RUN in L2 as a nested guest > > > > > > > > 3) Return from KVM_RUN > > > > > > > > 4) Inject KVM_SMI into the VCPU > > > > > > > > 5) Change the EFER register with KVM_SET_SREGS to value > > > > > > > > 0x2501 > > > > > > > > 6) Run some code on the VCPU using KVM_RUN > > > > > > > > 7) Observe following behavior: > > > > > > > >=20 > > > > > > > > kvm_smm_transition: vcpu 0: entering SMM, smbase > > > > > > > > 0x30000 > > > > > > > > kvm_entry: vcpu 0, rip 0x8000 > > > > > > > > kvm_entry: vcpu 0, rip 0x8000 > > > > > > > > kvm_entry: vcpu 0, rip 0x8002 > > > > > > > > kvm_smm_transition: vcpu 0: leaving SMM, smbase 0x30000 > > > > > > > > kvm_nested_vmenter: rip: 0x0000000000008002 vmcs: > > > > > > > > 0x0000000000007000 > > > > > > > > =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0nested_rip: = 0x0000000000000000 > > > > > > > > int_ctl: 0x00000000 > > > > > > > > event_inj: 0x00000000 nested_ept=3Dn > > > > > > > > guest > > > > > > > > cr3: 0x0000000000002000 > > > > > > > > kvm_nested_vmexit_inject: reason: TRIPLE_FAULT > > > > > > > > ext_inf1: 0x0000000000000000 > > > > > > > > =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0ext_inf2: 0x0000000000000000 > > > > > > > > ext_int: 0x00000000 > > > > > > > > ext_int_err: 0x00000000 > > > > > > > >=20 > > > > > > > > What happened here is an SMI was injected immediately > > > > > > > > and the handler was > > > > > > > > called at address 0x8000; all is good. Later, an RSM > > > > > > > > instruction is > > > > > > > > executed in an emulator to return to the nested VM. > > > > > > > > em_rsm() is called, > > > > > > > > which leads to emulator_leave_smm(). A part of this > > > > > > > > function calls VMX/SVM > > > > > > > > callback, in this case vmx_leave_smm(). It attempts to > > > > > > > > set up a pending > > > > > > > > reentry to guest VM by calling > > > > > > > > nested_vmx_enter_non_root_mode() and sets > > > > > > > > vmx->nested.nested_run_pending to one. Unfortunately, > > > > > > > > later in > > > > > > > > emulator_leave_smm(), rsm_load_state_64() fails to > > > > > > > > write invalid EFER to > > > > > > > > the MSR. This results in em_rsm() calling triple_fault > > > > > > > > callback. At this > > > > > > > > point it's clear that the KVM should call the vmexit, > > > > > > > > but > > > > > > > > vmx->nested.nested_run_pending is left set to 1. To fix > > > > > > > > this reset the > > > > > > > > vmx->nested.nested_run_pending flag in triple_fault > > > > > > > > handler. > > > > > > > >=20 > > > > > > > > TL;DR (courtesy of Yuan Yao) > > > > > > > > Clear nested_run_pending in case of RSM failure on > > > > > > > > return from L2 SMM. > > > >=20 > > > > KVM doesn't emulate SMM for L2. On an injected SMI, KVM forces > > > > a syntethic nested > > > > VM-Exit to get from L2 to L1, and then emulates SMM in the > > > > context of L1. > > > >=20 > > > > > > > > The pending VMENTRY to L2 should be cancelled due to > > > > > > > > such failure leads > > > > > > > > to triple fault which should be injected to L1. > > > > > > > >=20 > > > > > > > > Possible alternative approach: > > > > > > > > While the proposed approach works, the concern is that > > > > > > > > it would be > > > > > > > > simpler, and more readable to cancel the > > > > > > > > nested_run_pending in em_rsm(). > > > > > > > > This would, however, require introducing new callback > > > > > > > > e.g, > > > > > > > > post_leave_smm(), that would cancel nested_run_pending > > > > > > > > in case of a > > > > > > > > failure to resume from SMM. > > > > > > > >=20 > > > > > > > > Additionally, while the proposed code fixes VMX > > > > > > > > specific issue, SVM also > > > > > > > > might suffer from similar problem as it also uses it's > > > > > > > > own > > > > > > > > nested_run_pending variable. > > > > > > > >=20 > > > > > > > > Reported-by: Zheyu Ma > > > > > > > > Closes: > > > > > > > > https://lore.kernel.org/all/CAMhUBjmXMYsEoVYw_M8hSZjBMHh24i= 88QYm-RY6HDta5YZ7Wgw@mail.gmail.com > > > >=20 > > > > Fixes: 759cbd59674a ("KVM: x86: nSVM/nVMX: set > > > > nested_run_pending on VM entry which is a result of RSM") > > > >=20 > > > > > > > > Signed-off-by: Michal Wilczynski > > > > > > > > > > > > > > > > --- > > > > > > > > =C2=A0arch/x86/kvm/vmx/nested.c | 9 +++++++++ > > > > > > > > =C2=A01 file changed, 9 insertions(+) > > > > > > > >=20 > > > > > > > > diff --git a/arch/x86/kvm/vmx/nested.c > > > > > > > > b/arch/x86/kvm/vmx/nested.c > > > > > > > > index c5ec0ef51ff7..44432e19eea6 100644 > > > > > > > > --- a/arch/x86/kvm/vmx/nested.c > > > > > > > > +++ b/arch/x86/kvm/vmx/nested.c > > > > > > > > @@ -4904,7 +4904,16 @@ void nested_vmx_vmexit(struct > > > > > > > > kvm_vcpu *vcpu, u32 vm_exit_reason, > > > > > > > > =C2=A0 > > > > > > > > =C2=A0static void nested_vmx_triple_fault(struct kvm_vcpu > > > > > > > > *vcpu) > > > > > > > > =C2=A0{ > > > > > > > > + struct vcpu_vmx *vmx =3D to_vmx(vcpu); > > > > > > > > + > > > > > > > > =C2=A0 kvm_clear_request(KVM_REQ_TRIPLE_FAULT, vcpu); > > > > > > > > + > > > > > > > > + /* In case of a triple fault, cancel the > > > > > > > > nested reentry. This may occur > > > >=20 > > > > /* > > > > * Multi-line comments should look like this. Blah > > > > blah blab blah blah > > > > * blah blah blah blah. > > > > */ > > > >=20 > > > > > > > > + * when the RSM instruction fails while > > > > > > > > attempting to restore the state > > > > > > > > + * from SMRAM. > > > > > > > > + */ > > > > > > > > + vmx->nested.nested_run_pending =3D 0; > > > >=20 > > > > Argh. KVM's handling of SMIs while L2 is active is complete > > > > garbage. As explained > > > > by the comment in vmx_enter_smm(), the L2<->SMM transitions > > > > should have a completely > > > > custom flow and not piggyback/usurp nested VM-Exit/VM-Entry. > > > >=20 > > > > /* > > > > * TODO: Implement custom flows for forcing the vCPU > > > > out/in of L2 on > > > > * SMI and RSM. Using the common VM-Exit + VM-Enter > > > > routines is wrong > > > > * SMI and RSM only modify state that is saved and > > > > restored via SMRAM. > > > > * E.g. most MSRs are left untouched, but many are > > > > modified by VM-Exit > > > > * and VM-Enter, and thus L2's values may be corrupted > > > > on SMI+RSM. > > > > */ > > > >=20 > > > > The Fixes: commit above added a hack on top of the hack. But > > > > it's not entirely > > > > clear from the changelog exactly what was being fixed. > > > >=20 > > > > =C2=A0=C2=A0=C2=A0=C2=A0While RSM induced VM entries are not full V= M entries, > > > > =C2=A0=C2=A0=C2=A0=C2=A0they still need to be followed by actual VM= entry to > > > > complete it, > > > > =C2=A0=C2=A0=C2=A0=C2=A0unlike setting the nested state. > > > > =C2=A0=C2=A0=C2=A0=C2=A0 > > > > =C2=A0=C2=A0=C2=A0=C2=A0This patch fixes boot of hyperv and SMM ena= bled > > > > =C2=A0=C2=A0=C2=A0=C2=A0windows VM running nested on KVM, which fai= l due > > > > =C2=A0=C2=A0=C2=A0=C2=A0to this issue combined with lack of dirty b= it setting. > > > >=20 > > > > My first guess would be event injection, but that shouldn't be > > > > relevant to RSM. > > > > Architecturally, events (SMIs, NMIs, IRQs, etc.) are recognized > > > > at instruction > > > > boundaries, but except for SMIs (see below), KVM naturally > > > > defers injection until > > > > an instruction boundary by virtue of delivering events via the > > > > VMCS/VMCB, i.e. by > > > > waiting to deliver events until successfully re-entering the > > > > guest. > > > >=20 > > > > Nested VM-Enter is a special snowflake because KVM needs to > > > > finish injecting events > > > > from vmcs12 before injecting any synthetic events, i.e. > > > > nested_run_pending ensures > > > > that KVM wouldn't clobber/override an L2 event coming from L1. > > > > In other words, > > > > nested_run_pending is much more specific than just needing to > > > > wait for an instruction > > > > boundary. > > > >=20 > > > > So while the "wait until the CPU is at an instruction boundary" > > > > applies to RSM, > > > > it's not immediately obvious to me why setting > > > > nested_run_pending is necessary. If nested_run_pending is not set, then nested entry can be aborted by an event (e.g another #SMI, or just an interrupt since after VM entry the interrupts are usually enabled, and interrupts are intercepted usually). It is even possible for a nested migration to happen right after the RSM is executed but before actual VMENTER is executed. > > > > And setting nested_run_pending *after* calling > > > > nested_vmx_enter_non_root_mode() > > > > is nasty. nested_vmx_enter_non_root_mode() and its helpers use > > > > nested_run_pending > > > > to determine whether or not to enforce various consistency > > > > checks and other > > > > behaviors. And a more minor issue is that stat.nested_run will > > > > be incorrectly > > > > incremented. > > > >=20 > > > > As a stop gap, something like this patch is not awful, though I > > > > would strongly > > > > prefer to be more precise and not clear it on all triple > > > > faults. We've had KVM > > > > bugs where KVM prematurely synthesizes triple fault on an > > > > actual nested VM-Enter, > > > > and those would be covered up by this fix. > > > >=20 > > > > But due to nested_run_pending being (unnecessarily) buried in > > > > vendor structs, it > > > > might actually be easier to do a cleaner fix. E.g. add yet > > > > another flag to track > > > > that a hardware VM-Enter needs to be completed in order to > > > > complete instruction > > > > emulation. > > > >=20 > > > > And as alluded to above, there's another bug lurking. Events > > > > that are *emulated* > > > > by KVM must not be emulated until KVM knows the vCPU is at an > > > > instruction boundary. > > > > Specifically, enter_smm() shouldn't be invoked while KVM is in > > > > the middle of > > > > instruction emulation (even if "emulation" is just setting > > > > registers and skipping > > > > the instruction). Theoretically, that could be fixed by > > > > honoring the existing > > > > at_instruction_boundary flag for SMIs, but that'd be a rather > > > > large change and > > > > at_instruction_boundary is nowhere near accurate enough to use > > > > right now. > > > >=20 > > > > Anyways, before we do anything, I'd like to get Maxim's input > > > > on what exactly was > > > > addressed by 759cbd59674a. > > > >=20 There is a good explanation in commit 759cbd59674a and commit=C2=A0e8efa4ff0037 that comes before it. In a nutshell, there were two problems that together created conditions for the guest boot failure: 1. When a non intercepted #SMI arrives while L2 is running, it's almost like a VM exit from L2 to L1, however instead of restoring the host state, KVM jumps to the SMI handler in real mode in L1. Thus to be able later to run a regular VM exit handler, KVM stores the host state (which is present in vmcb01) to HSAVE area and on RSM restores it from HSAVE to vmcb01. The bug was that KVM wasn't marking vmcb01 as dirty when restoring the host state. Usually this is not a problem because in this case we resume L2 and thus switch to vmcb02. 2. Due to lack of setting the nested_run_pending, the above switch to vmcb02 might not complete if there are pending events, which is allowed because RSM doesn't have interrupt shadow, thus once we finished emulating RSM, we are in the L2 guest and likely with interrupts enabled. If the switch to vmcb02 doesn't complete we end up still running with vmcb01 and with lots of fields that were changed but not marked as dirty. Now usually this works because the CPU isn't that agressive in caching vmcb fields, but if the whole thing is running nested under KVM, then kvm always caches few 'rare' vmcb fields, like the IDTR, so we end up with stale IDTR and on next interrupt the guest explodes. PS: this is one of the mails that I forgot to send because I moved to Canada recently and was offline for some time due to that. Best regards, Maxim Levitsky