Received: by 2002:a05:7412:b995:b0:f9:9502:5bb8 with SMTP id it21csp6753168rdb; Tue, 2 Jan 2024 12:06:32 -0800 (PST) X-Google-Smtp-Source: AGHT+IHBvWVoew1zkWEGw/UhkZAbHiljTF7gmsTdm/YUf1HOrigqB0EimFepLvbizsE4Zgt1SQoT X-Received: by 2002:a50:c348:0:b0:555:fb17:2808 with SMTP id q8-20020a50c348000000b00555fb172808mr2357193edb.76.1704225992011; Tue, 02 Jan 2024 12:06:32 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1704225991; cv=none; d=google.com; s=arc-20160816; b=t8ifbwrcTHgSbFOs4ZVF9/YgHeG1MMdY86tPOFN89Qh6y/3u/tX/4DypXmHnfpidu4 lYflI2MOF9J0VB/oqFbqAtx21VW9wB+Uf1ccb4qnHP9sT2lVHJY6CNnAS0gdyo/DRUd9 ydCcvLUmBkbxv8ijt1tA/PRGz1u8ilPFlCj2YVx10BzlNe4I5kitgklP4qaRRV6hZBPh f99yWrQqYGhI1W5aHdjUXrLPs50e+vdA+2nAQHiALy0TF3aq875+KYIpzwnzpPPfaxAB VbIBaqfS1WBVhKbOz7uQOrYVaL50AEW3L1cIkx6h7BIBnBDdHg+iXSAqkXwYzIfh8umy W2Yg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=cc:to:from:subject:message-id:references:mime-version :list-unsubscribe:list-subscribe:list-id:precedence:in-reply-to:date :dkim-signature; bh=njEQfqTDOHQ3mkpR7k6hSJp5zE0iqNyPNbquK5SNR0M=; fh=nBTdoMUEOWCuk6vYVedgRyBnmJft714RP8IxfVio4wU=; b=pXp0ILAysgY1M7Uf0QKax0DJElDq60+GVqq8oi7zob9q/bCThsAbx1suFRpDng9PSB eF05OwGWL9rkZ3ylYLq73BKucD8JKrPq4LjVxJwo8mUMQcEfmNdftI8P6KS9kDiUsu98 1nPorsWCNQpen2vNDExXwqJhwsEGsGikDF/QEyWwOsF0G+FWD3BAlNbbmiVICM7CBrtJ AOOE5bRI6wFhyBNpbN5K2sKCzPJ/+XusbBruqP109DF6bZQ+J6BBQ9v8zORCC/xKS29x L+iS38o8KcjoJm91IAjLwTFVVABvr9KwbzZTRah2TBwHT7xl5qf0M/wBW1xvR2Rs0kaO 7bhg== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20230601 header.b=jRF0FN1+; spf=pass (google.com: domain of linux-kernel+bounces-14787-linux.lists.archive=gmail.com@vger.kernel.org designates 2604:1380:4601:e00::3 as permitted sender) smtp.mailfrom="linux-kernel+bounces-14787-linux.lists.archive=gmail.com@vger.kernel.org"; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Return-Path: Received: from am.mirrors.kernel.org (am.mirrors.kernel.org. [2604:1380:4601:e00::3]) by mx.google.com with ESMTPS id m10-20020a50998a000000b005551842b3e7si6698966edb.19.2024.01.02.12.06.31 for (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 02 Jan 2024 12:06:31 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel+bounces-14787-linux.lists.archive=gmail.com@vger.kernel.org designates 2604:1380:4601:e00::3 as permitted sender) client-ip=2604:1380:4601:e00::3; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20230601 header.b=jRF0FN1+; spf=pass (google.com: domain of linux-kernel+bounces-14787-linux.lists.archive=gmail.com@vger.kernel.org designates 2604:1380:4601:e00::3 as permitted sender) smtp.mailfrom="linux-kernel+bounces-14787-linux.lists.archive=gmail.com@vger.kernel.org"; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: from smtp.subspace.kernel.org (wormhole.subspace.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by am.mirrors.kernel.org (Postfix) with ESMTPS id 22CDA1F24037 for ; Tue, 2 Jan 2024 19:57:48 +0000 (UTC) Received: from localhost.localdomain (localhost.localdomain [127.0.0.1]) by smtp.subspace.kernel.org (Postfix) with ESMTP id E596916414; Tue, 2 Jan 2024 19:57:31 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="jRF0FN1+" X-Original-To: linux-kernel@vger.kernel.org Received: from mail-pj1-f74.google.com (mail-pj1-f74.google.com [209.85.216.74]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id AB3F515EB0 for ; Tue, 2 Jan 2024 19:57:28 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--seanjc.bounces.google.com Received: by mail-pj1-f74.google.com with SMTP id 98e67ed59e1d1-28c0420e177so10638078a91.2 for ; Tue, 02 Jan 2024 11:57:28 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1704225448; x=1704830248; darn=vger.kernel.org; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=njEQfqTDOHQ3mkpR7k6hSJp5zE0iqNyPNbquK5SNR0M=; b=jRF0FN1+n7OYxeI4urt0nckVd0Sd5gPWACQthoTLXglvIZ0D5iNtaKdyUlFi2IvIOo XIQMwSS1xP2LUpcp0crit6GeqI8SBgdaapjPGtOMHWt11bfatEtK8qD4X36tX/o5qSF/ /Y85IWQcG6qJqQaf43cfRrTmUfnhX6RJHmt8bzaF0Bqui3uSY3jXk3aSbQyVkzCWmgmg Hsf5OnRRNLN768LOBt27Nar1/omUq3pPtUpSpUBEMYHhZ9duuwdbn4VYOtl1t4suxw8m 0vi4E72pRv/Y/dhaHUcWrt3bDd8jyGhFPsRevsEWaPK9nF7zbAqS+vBbPkk9DS0+pk9u lRfg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1704225448; x=1704830248; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=njEQfqTDOHQ3mkpR7k6hSJp5zE0iqNyPNbquK5SNR0M=; b=NCPWPgR3YBNW4VY6f5Xokwb9PGQk2iVzZuDOENrm8EdFg3EBQrEV36V3vCNIJMvLav Qc0hANXa2ZhncBbK6/GgyZzrrrXvkY5ALiW9saz43zo7K8jHTqMG4cJorXsbZxgXBjtF G+5rboYbpd61+PLGMyJWw2b34X0bKX+hnXDZ9fW6ar2OfBqXwasd/S2hUuIBkbgHhq51 bkv212A5p5Dpo8aMzJ9VFEjV7C0YhmBg3JBfTuibXQlmV0aGdI5CEkAKioRrPVhN3Oak L7tNZrfbVaarlTL76FvQ7RKBebyy2hQSfKGYA2Zg77tRZ+Yai53BayaMR4xruvKenxZe bOgg== X-Gm-Message-State: AOJu0Yz+7VOytNIvj6g5sl63E6g2nGW2caaXb2iqNmT4LxNRoHoLh4T9 BzTnSZtnDXInEkgYnQWxxPdDwdih5EbF5P2ivg== X-Received: from zagreus.c.googlers.com ([fda3:e722:ac3:cc00:7f:e700:c0a8:5c37]) (user=seanjc job=sendgmr) by 2002:a17:90a:86c7:b0:28c:59d:843a with SMTP id y7-20020a17090a86c700b0028c059d843amr1587931pjv.7.1704225448034; Tue, 02 Jan 2024 11:57:28 -0800 (PST) Date: Tue, 2 Jan 2024 11:57:26 -0800 In-Reply-To: <20231222164543.918037-1-michal.wilczynski@intel.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 References: <20231222164543.918037-1-michal.wilczynski@intel.com> Message-ID: Subject: Re: [PATCH v1] KVM: nVMX: Fix handling triple fault on RSM instruction From: Sean Christopherson To: Michal Wilczynski Cc: pbonzini@redhat.com, tglx@linutronix.de, mingo@redhat.com, bp@alien8.de, dave.hansen@linux.intel.com, x86@kernel.org, hpa@zytor.com, kvm@vger.kernel.org, linux-kernel@vger.kernel.org, zhi.a.wang@intel.com, artem.bityutskiy@linux.intel.com, yuan.yao@intel.com, Zheyu Ma , Maxim Levitsky Content-Type: text/plain; charset="us-ascii" +Maxim On Fri, Dec 22, 2023, Michal Wilczynski wrote: > Syzkaller found a warning triggered in nested_vmx_vmexit(). > vmx->nested.nested_run_pending is non-zero, even though we're in > nested_vmx_vmexit(). Generally, trying to cancel a pending entry is > considered a bug. However in this particular scenario, the kernel > behavior seems correct. > > Syzkaller scenario: > 1) Set up VCPU's > 2) Run some code with KVM_RUN in L2 as a nested guest > 3) Return from KVM_RUN > 4) Inject KVM_SMI into the VCPU > 5) Change the EFER register with KVM_SET_SREGS to value 0x2501 > 6) Run some code on the VCPU using KVM_RUN > 7) Observe following behavior: > > kvm_smm_transition: vcpu 0: entering SMM, smbase 0x30000 > kvm_entry: vcpu 0, rip 0x8000 > kvm_entry: vcpu 0, rip 0x8000 > kvm_entry: vcpu 0, rip 0x8002 > kvm_smm_transition: vcpu 0: leaving SMM, smbase 0x30000 > kvm_nested_vmenter: rip: 0x0000000000008002 vmcs: 0x0000000000007000 > nested_rip: 0x0000000000000000 int_ctl: 0x00000000 > event_inj: 0x00000000 nested_ept=n guest > cr3: 0x0000000000002000 > kvm_nested_vmexit_inject: reason: TRIPLE_FAULT ext_inf1: 0x0000000000000000 > ext_inf2: 0x0000000000000000 ext_int: 0x00000000 > ext_int_err: 0x00000000 > > What happened here is an SMI was injected immediately and the handler was > called at address 0x8000; all is good. Later, an RSM instruction is > executed in an emulator to return to the nested VM. em_rsm() is called, > which leads to emulator_leave_smm(). A part of this function calls VMX/SVM > callback, in this case vmx_leave_smm(). It attempts to set up a pending > reentry to guest VM by calling nested_vmx_enter_non_root_mode() and sets > vmx->nested.nested_run_pending to one. Unfortunately, later in > emulator_leave_smm(), rsm_load_state_64() fails to write invalid EFER to > the MSR. This results in em_rsm() calling triple_fault callback. At this > point it's clear that the KVM should call the vmexit, but > vmx->nested.nested_run_pending is left set to 1. To fix this reset the > vmx->nested.nested_run_pending flag in triple_fault handler. > > TL;DR (courtesy of Yuan Yao) > Clear nested_run_pending in case of RSM failure on return from L2 SMM. KVM doesn't emulate SMM for L2. On an injected SMI, KVM forces a syntethic nested VM-Exit to get from L2 to L1, and then emulates SMM in the context of L1. > The pending VMENTRY to L2 should be cancelled due to such failure leads > to triple fault which should be injected to L1. > > Possible alternative approach: > While the proposed approach works, the concern is that it would be > simpler, and more readable to cancel the nested_run_pending in em_rsm(). > This would, however, require introducing new callback e.g, > post_leave_smm(), that would cancel nested_run_pending in case of a > failure to resume from SMM. > > Additionally, while the proposed code fixes VMX specific issue, SVM also > might suffer from similar problem as it also uses it's own > nested_run_pending variable. > > Reported-by: Zheyu Ma > Closes: https://lore.kernel.org/all/CAMhUBjmXMYsEoVYw_M8hSZjBMHh24i88QYm-RY6HDta5YZ7Wgw@mail.gmail.com Fixes: 759cbd59674a ("KVM: x86: nSVM/nVMX: set nested_run_pending on VM entry which is a result of RSM") > Signed-off-by: Michal Wilczynski > --- > arch/x86/kvm/vmx/nested.c | 9 +++++++++ > 1 file changed, 9 insertions(+) > > diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c > index c5ec0ef51ff7..44432e19eea6 100644 > --- a/arch/x86/kvm/vmx/nested.c > +++ b/arch/x86/kvm/vmx/nested.c > @@ -4904,7 +4904,16 @@ void nested_vmx_vmexit(struct kvm_vcpu *vcpu, u32 vm_exit_reason, > > static void nested_vmx_triple_fault(struct kvm_vcpu *vcpu) > { > + struct vcpu_vmx *vmx = to_vmx(vcpu); > + > kvm_clear_request(KVM_REQ_TRIPLE_FAULT, vcpu); > + > + /* In case of a triple fault, cancel the nested reentry. This may occur /* * Multi-line comments should look like this. Blah blah blab blah blah * blah blah blah blah. */ > + * when the RSM instruction fails while attempting to restore the state > + * from SMRAM. > + */ > + vmx->nested.nested_run_pending = 0; Argh. KVM's handling of SMIs while L2 is active is complete garbage. As explained by the comment in vmx_enter_smm(), the L2<->SMM transitions should have a completely custom flow and not piggyback/usurp nested VM-Exit/VM-Entry. /* * TODO: Implement custom flows for forcing the vCPU out/in of L2 on * SMI and RSM. Using the common VM-Exit + VM-Enter routines is wrong * SMI and RSM only modify state that is saved and restored via SMRAM. * E.g. most MSRs are left untouched, but many are modified by VM-Exit * and VM-Enter, and thus L2's values may be corrupted on SMI+RSM. */ The Fixes: commit above added a hack on top of the hack. But it's not entirely clear from the changelog exactly what was being fixed. While RSM induced VM entries are not full VM entries, they still need to be followed by actual VM entry to complete it, unlike setting the nested state. This patch fixes boot of hyperv and SMM enabled windows VM running nested on KVM, which fail due to this issue combined with lack of dirty bit setting. My first guess would be event injection, but that shouldn't be relevant to RSM. Architecturally, events (SMIs, NMIs, IRQs, etc.) are recognized at instruction boundaries, but except for SMIs (see below), KVM naturally defers injection until an instruction boundary by virtue of delivering events via the VMCS/VMCB, i.e. by waiting to deliver events until successfully re-entering the guest. Nested VM-Enter is a special snowflake because KVM needs to finish injecting events from vmcs12 before injecting any synthetic events, i.e. nested_run_pending ensures that KVM wouldn't clobber/override an L2 event coming from L1. In other words, nested_run_pending is much more specific than just needing to wait for an instruction boundary. So while the "wait until the CPU is at an instruction boundary" applies to RSM, it's not immediately obvious to me why setting nested_run_pending is necessary. And setting nested_run_pending *after* calling nested_vmx_enter_non_root_mode() is nasty. nested_vmx_enter_non_root_mode() and its helpers use nested_run_pending to determine whether or not to enforce various consistency checks and other behaviors. And a more minor issue is that stat.nested_run will be incorrectly incremented. As a stop gap, something like this patch is not awful, though I would strongly prefer to be more precise and not clear it on all triple faults. We've had KVM bugs where KVM prematurely synthesizes triple fault on an actual nested VM-Enter, and those would be covered up by this fix. But due to nested_run_pending being (unnecessarily) buried in vendor structs, it might actually be easier to do a cleaner fix. E.g. add yet another flag to track that a hardware VM-Enter needs to be completed in order to complete instruction emulation. And as alluded to above, there's another bug lurking. Events that are *emulated* by KVM must not be emulated until KVM knows the vCPU is at an instruction boundary. Specifically, enter_smm() shouldn't be invoked while KVM is in the middle of instruction emulation (even if "emulation" is just setting registers and skipping the instruction). Theoretically, that could be fixed by honoring the existing at_instruction_boundary flag for SMIs, but that'd be a rather large change and at_instruction_boundary is nowhere near accurate enough to use right now. Anyways, before we do anything, I'd like to get Maxim's input on what exactly was addressed by 759cbd59674a.