Received: by 2002:a05:6358:3188:b0:123:57c1:9b43 with SMTP id q8csp1289294rwd; Wed, 31 May 2023 11:49:14 -0700 (PDT) X-Google-Smtp-Source: ACHHUZ7zPQpgT4lyauqGRUmFjAe59NrEn3sNHiMsHoRgTa6ttZZ0PGDn+jcQl/EafcgoRedopejv X-Received: by 2002:aa7:88d4:0:b0:64f:35c8:8584 with SMTP id k20-20020aa788d4000000b0064f35c88584mr8132124pff.18.1685558954473; Wed, 31 May 2023 11:49:14 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1685558954; cv=none; d=google.com; s=arc-20160816; b=WxbqW+GAAH7n7qcT/4YI/TUhfN50kqYIfp0Hc/qkri/3o1scDkCeMqOZFWftFF6QBa ynsB/k5hGbMWuAx4EtlzhPPCcY7S5X5yWv/FWORS9x+Qmf5LPhErIOQj0uigoyR1M28Z m6B5glUsdJoYPttApt201fM7XrqoCPwpV1PWToZl9xStsfLD7hS4UTPm9/5xTOCLwHGW 5x9EwkWE38AmsrO7VfCBqMEpXwOsWO3qCKp2pqxogZq3Bk+edKkf0Ry9xf87VUi96sdc KAPZw3XFOPJ3VDQ2jg1Po+cVoVa627ScWd6fgP5EWnLSzWb7+/mg/5gPvBuA9qL3nKoR sYTQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:cc:to:subject :message-id:date:from:in-reply-to:references:mime-version :dkim-signature; bh=vSkNdiMP7jY6tlUh6I5KPB+Z+BpvSyNnxaiAhEVXtE0=; b=uLeWewsOHylZz2zuYc6L9Bt0/xuuxtmxclMRHFdxeb0byk5hMxFe0QOKPKNjgpGAoY VaYnNvHYeQMUQciqrtJzZUn8Pd5o/2StUiwwjeb4b3iDHVtzFuwzmW7JtxAaUTszW0CP oDdb7ZeePtNvzicz72On3AvqOTtngtItICoJOhsq1xziSZ4xX8I+k0y/B7BuEJhanrPL y3k4ch9CK2GcJMhKys+W33Gy7HrXNuUjw5OYoHjS1/3DVFEmUPGKhV0zdvfz6g3Ng+tW SJM3fCXG9iWmPJmmZzqLvq/ku12eUKPBtiN2wVXQbxzBxVNmGDJP6zWXtZEBt1mod5Mk Cimg== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20221208 header.b="EpHT/8ZR"; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id l2-20020a622502000000b006466813eba9si4051365pfl.47.2023.05.31.11.49.00; Wed, 31 May 2023 11:49:14 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20221208 header.b="EpHT/8ZR"; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229899AbjEaSIe (ORCPT + 99 others); Wed, 31 May 2023 14:08:34 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:49984 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229491AbjEaSIc (ORCPT ); Wed, 31 May 2023 14:08:32 -0400 Received: from mail-pl1-x62f.google.com (mail-pl1-x62f.google.com [IPv6:2607:f8b0:4864:20::62f]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 26A7898 for ; Wed, 31 May 2023 11:08:30 -0700 (PDT) Received: by mail-pl1-x62f.google.com with SMTP id d9443c01a7336-1b025aaeddbso14225ad.1 for ; Wed, 31 May 2023 11:08:30 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20221208; t=1685556509; x=1688148509; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=vSkNdiMP7jY6tlUh6I5KPB+Z+BpvSyNnxaiAhEVXtE0=; b=EpHT/8ZRve1Wq26t+TAKx7SJQNT+fVySv2OMKjZYQ2b+in9M3Hk7qwSrnDtgSaSAdA BTw0q6kgnioUg/lB6FVO3YhgBjF+Z79W8RhGrcwzlDC8iQOw3SMWDxXNwvgqYDLD9P1B CBr+ezEAZy09kHGYPWtTSyMGlxosRk18+QdCs3QnlQ7X9CifFrugJEQIhYDiH1ZwvsTe SATBHTtIkDlFDDjxhjOnFIrn9OCTTpKlVMDP1Sa3qo1PISxR2g1ly4R1vZIFthzRQ95/ m4QEEl1mjkFjqsHHO+ieTqeF0XF2RzwhuieNLRQrsSIgoC+jGzcnTq272IfcrkINunLy v8Bg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1685556509; x=1688148509; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=vSkNdiMP7jY6tlUh6I5KPB+Z+BpvSyNnxaiAhEVXtE0=; b=NSRf2dnzLHXaSeQanJl+MAt+MT8gnCGd4YzaZdD9JXpbPRTIoQqA7j5cOryJEPEX3J X7WjueRnzrnh0BAlZTJ7KLuyP5sWD0722CUBm6vlbpKW8NmDZzywjFgfhKPJ+3mCCo0c 2fU2rV/Hc8BmyuKVDHQmNeTAuVybm2ckoEmyunfakRmFcHzYnF+ZtFBySiICqh1ycEq9 B/lE/OvjpCx5tyxQi2XhLGJyr3ieEXrpcOq9S7nD20YnC9eYlvxExnoMpJ1JnZUlhFY6 9vBH2VedcSzeChNICkbmOhwN5aN36V9rIz755NkN0LWKmQ4WfswpENuP4W3KKNQh9lhF aMuQ== X-Gm-Message-State: AC+VfDzgw3U/vULBqQsEUInfp1Zq+NrJ8Nd/WUbEEzANIAfrdEvYNu1w MdJScM4ge/U1sxBi037bn+lfpbj3yMbr/k7yZYLR8A== X-Received: by 2002:a17:902:d141:b0:1b0:26c0:757d with SMTP id t1-20020a170902d14100b001b026c0757dmr8147plt.22.1685556509329; Wed, 31 May 2023 11:08:29 -0700 (PDT) MIME-Version: 1.0 References: <20230531144128.73814-1-jon@nutanix.com> <9d37ab07-97c6-5245-6939-9c1090b4b3a9@redhat.com> <3cb96907-0d58-ba60-ed0e-6c17c69bd0f8@redhat.com> In-Reply-To: <3cb96907-0d58-ba60-ed0e-6c17c69bd0f8@redhat.com> From: Jim Mattson Date: Wed, 31 May 2023 11:08:17 -0700 Message-ID: Subject: Re: [PATCH v4] KVM: VMX: do not disable interception for MSR_IA32_SPEC_CTRL on eIBRS To: Waiman Long Cc: Jon Kohler , Sean Christopherson , Paolo Bonzini , Thomas Gleixner , Ingo Molnar , Borislav Petkov , Dave Hansen , "x86@kernel.org" , "H. Peter Anvin" , Josh Poimboeuf , Kees Cook , Andrea Arcangeli , "kvm@vger.kernel.org" , "linux-kernel@vger.kernel.org" , Josh Poimboeuf Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Spam-Status: No, score=-17.6 required=5.0 tests=BAYES_00,DKIMWL_WL_MED, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF, ENV_AND_HDR_SPF_MATCH,RCVD_IN_DNSWL_NONE,SPF_HELO_NONE,SPF_PASS, T_SCC_BODY_TEXT_LINE,URIBL_BLOCKED,USER_IN_DEF_DKIM_WL, USER_IN_DEF_SPF_WL autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, May 31, 2023 at 10:22=E2=80=AFAM Waiman Long w= rote: > > On 5/31/23 13:13, Jon Kohler wrote: > > > >> On May 31, 2023, at 1:02 PM, Waiman Long wrote: > >> > >> On 5/31/23 10:41, Jon Kohler wrote: > >>> Avoid expensive rdmsr on every VM Exit for MSR_IA32_SPEC_CTRL on > >>> eIBRS enabled systems iff the guest only sets IA32_SPEC_CTRL[0] (IBRS= ) > >>> and not [1] (STIBP) or [2] (SSBD) by not disabling interception in > >>> the MSR bitmap. Note: this logic is only for eIBRS, as Intel's guidan= ce > >>> has long been that eIBRS only needs to be set once, so most guests wi= th > >>> eIBRS awareness should behave nicely. We would not want to accidental= ly > >>> regress misbehaving guests on pre-eIBRS systems, who might be spammin= g > >>> IBRS MSR without the hypervisor being able to see it today. > >>> > >>> eIBRS enabled guests using just IBRS will only write SPEC_CTRL MSR > >>> once or twice per vCPU on boot, so it is far better to take those > >>> VM exits on boot than having to read and save this msr on every > >>> single VM exit forever. This outcome was suggested on Andrea's commit > >>> 2f46993d83ff ("x86: change default to spec_store_bypass_disable=3Dprc= tl spectre_v2_user=3Dprctl") > >>> however, since interception is still unilaterally disabled, the rdmsr > >>> tax is still there even after that commit. > >>> > >>> This is a significant win for eIBRS enabled systems as this rdmsr > >>> accounts for roughly ~50% of time for vmx_vcpu_run() as observed > >>> by perf top disassembly, and is in the critical path for all > >>> VM-Exits, including fastpath exits. > >>> > >>> Opportunistically update comments for both MSR_IA32_SPEC_CTRL and > >>> MSR_IA32_PRED_CMD to make it clear how L1 vs L2 handling works. > >>> > >>> Fixes: 2f46993d83ff ("x86: change default to spec_store_bypass_disabl= e=3Dprctl spectre_v2_user=3Dprctl") > >>> Signed-off-by: Jon Kohler > >>> Cc: Sean Christopherson > >>> Cc: Andrea Arcangeli > >>> Cc: Kees Cook > >>> Cc: Josh Poimboeuf > >>> Cc: Waiman Long > >>> --- > >>> v1 > >>> - https://urldefense.proofpoint.com/v2/url?u=3Dhttps-3A__lore.kerne= l.org_all_20220512174427.3608-2D1-2Djon-40nutanix.com_&d=3DDwICaQ&c=3Ds883G= pUCOChKOHiocYtGcg&r=3DNGPRGGo37mQiSXgHKm5rCQ&m=3DPT1QjB8Lk_a3baDOwHBfedQG67= HsVDmOdmcWHlr5PrT8WyuS9e6PfHF5JxLxD0zw&s=3DjNnloZQgh0KG-n36uwVC0dJTmokvqsQd= YQCWYI8hVvM&e=3D v1 -> v2: > >>> - https://urldefense.proofpoint.com/v2/url?u=3Dhttps-3A__lore.kerne= l.org_all_20220520195303.58692-2D1-2Djon-40nutanix.com_&d=3DDwICaQ&c=3Ds883= GpUCOChKOHiocYtGcg&r=3DNGPRGGo37mQiSXgHKm5rCQ&m=3DPT1QjB8Lk_a3baDOwHBfedQG6= 7HsVDmOdmcWHlr5PrT8WyuS9e6PfHF5JxLxD0zw&s=3DRwi5NoHwaezlmzzLiGGCuI6QHuGQZ1B= VK2hs6-SZvzU&e=3D - Addressed comments on approach from Sean. > >>> v2 -> v3: > >>> - https://urldefense.proofpoint.com/v2/url?u=3Dhttps-3A__lore.kerne= l.org_kvm_20220520204115.67580-2D1-2Djon-40nutanix.com_&d=3DDwICaQ&c=3Ds883= GpUCOChKOHiocYtGcg&r=3DNGPRGGo37mQiSXgHKm5rCQ&m=3DPT1QjB8Lk_a3baDOwHBfedQG6= 7HsVDmOdmcWHlr5PrT8WyuS9e6PfHF5JxLxD0zw&s=3DR2Ykxdv-DyeVGLWd8_pLpu43zEsnWzp= yvvBPEZ9lz-Y&e=3D - Addressed comments on approach from Sean. > >>> v3 -> v4: > >>> - Fixed inline code comments from Sean. > >>> > >>> arch/x86/kvm/vmx/vmx.c | 35 ++++++++++++++++++++++++----------- > >>> 1 file changed, 24 insertions(+), 11 deletions(-) > >>> > >>> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c > >>> index 44fb619803b8..5e643ac897bc 100644 > >>> --- a/arch/x86/kvm/vmx/vmx.c > >>> +++ b/arch/x86/kvm/vmx/vmx.c > >>> @@ -2260,20 +2260,33 @@ static int vmx_set_msr(struct kvm_vcpu *vcpu,= struct msr_data *msr_info) > >>> return 1; > >>> > >>> vmx->spec_ctrl =3D data; > >>> - if (!data) > >>> + > >>> + /* > >>> + * Disable interception on the first non-zero write, exce= pt if > >>> + * eIBRS is advertised to the guest and the guest is enab= ling > >>> + * _only_ IBRS. On eIBRS systems, kernels typically set = IBRS > >>> + * once at boot and never touch it post-boot. All other = bits, > >>> + * and IBRS on non-eIBRS systems, are often set on a per-= task > >>> + * basis, i.e. change frequently, so the benefit of avoid= ing > >>> + * VM-exits during guest context switches outweighs the c= ost of > >>> + * RDMSR on every VM-Exit to save the guest's value. > >>> + */ > >>> + if (!data || > >>> + (data =3D=3D SPEC_CTRL_IBRS && > >>> + (vcpu->arch.arch_capabilities & ARCH_CAP_IBRS_ALL))) > >>> break; > >>> > >>> /* > >>> - * For non-nested: > >>> - * When it's written (to non-zero) for the first time, pa= ss > >>> - * it through. > >>> - * > >>> - * For nested: > >>> - * The handling of the MSR bitmap for L2 guests is done i= n > >>> - * nested_vmx_prepare_msr_bitmap. We should not touch the > >>> - * vmcs02.msr_bitmap here since it gets completely overwr= itten > >>> - * in the merging. We update the vmcs01 here for L1 as we= ll > >>> - * since it will end up touching the MSR anyway now. > >>> + * Update vmcs01.msr_bitmap even if L2 is active, i.e. di= sable > >>> + * interception for the vCPU on the first write regardles= s of > >>> + * whether the WRMSR came from L1 or L2. vmcs02's bitmap= is a > >>> + * combination of vmcs01 and vmcs12 bitmaps, and will be > >>> + * recomputed by nested_vmx_prepare_msr_bitmap() on the n= ext > >>> + * nested VM-Enter. Note, this does mean that future WRM= SRs > >>> + * from L2 will be intercepted until the next nested VM-E= xit if > >>> + * L2 was the first to write, but L1 exposing the MSR to = L2 > >>> + * without first writing it is unlikely and not worth the > >>> + * extra bit of complexity. > >>> */ > >>> vmx_disable_intercept_for_msr(vcpu, > >>> MSR_IA32_SPEC_CTRL, > >> I have 2 comments. > >> > >> 1) Besides the IBRS, STIBP & SSBD bits, the SPEC_CTRL MSR may have the= RRSBA_DIS_S bit set in the future. I am not aware of any current Intel pro= cessors having this capability yet, but a future Intel processor may have t= his and the above patch will have to be modified accordingly. It looks like= that the RRSBA_DIS_S bit will be set once. > > Agreed. Once that becomes pubic with future processors, this code can b= e fixed up in a fairly trivial manner. I don=E2=80=99t have any access to s= aid future processors, so I=E2=80=99d like to keep it as-is today rather th= an project it out too far. Is that ok? > That is certainly OK. I am just raising a question here. How difficult would it be to do a back of the envelope cost/benefit analysis, rather than relying on heuristics based on today's typical guest behavior? Say that it's a minimum of 1000 cycles to intercept this WRMSR. The tradeoff is the cost of a RDMSR on every VM-exit. How long does a RDMSR take these days? On the order of 50 cycles? So, if the guest consistently writes IA32_SPEC_CTRL more often than once every 20 VM-exits, it's better not to intercept it. Most of the bookkeeping could be handled in the WRMSR(IA32_SPEC_CTRL) intercept. The only overhead we'd incur on every VM-exit would be the cost of incrementing a VM-exit counter. It's a bit more complicated, but it directly addresses the issue, and it's more future-proof.