Received: by 2002:a05:6358:11c7:b0:104:8066:f915 with SMTP id i7csp1818298rwl; Thu, 30 Mar 2023 01:48:24 -0700 (PDT) X-Google-Smtp-Source: AKy350YgZ6TW1+FBPLGmQyw8gQnWhXA2EavOeIsHSPgrQJeU6fSmmQaOKGJ5hgS+ZU79XQpjllNu X-Received: by 2002:a17:906:519d:b0:939:5398:768e with SMTP id y29-20020a170906519d00b009395398768emr20326235ejk.7.1680166104647; Thu, 30 Mar 2023 01:48:24 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1680166104; cv=none; d=google.com; s=arc-20160816; b=b2P+WwmJQriZKhgB7yrWUmyDOXldu8I/W0jABztAMrRQ14ZGhRL4mOn6Zjx7rNLKBQ BsuvB273eIR1h4Sf3lLzz+FFvdJZdq1bKfdLaD86chRceXfmZrz5GRq7MEWHwQ5tUZNb 8XUcc7NMDDJ5DYFyM9Da50p/2dSaXDs+j9ihKeHjPFjbk1ikH7dTIypMSqRA41riz944 gxLLWAp7cOi3AqMBY4JofQBNQmLuhPF6YM90N5WI6P+mV7FbD9t5aiIdAk4nyBLoHTNt MPl//VTk/mFDKQKsZ0s9zHAAR9okbnA1/uY69qzWFst3rdGHftos5CiASQnjFZ7yJ6R3 RyPQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:in-reply-to:from :references:cc:to:content-language:subject:user-agent:mime-version :date:message-id:dkim-signature; bh=vlpuagoHdNHyoDKKcfJ1S9LVllL6EOMCzIQgoSyOcDw=; b=dza8KVSxYvmegiRYySJTXTEC03MF1HnT5oo0n0Ukj8mkMv/8exSNud0BoMCu99IBBw mg/LZdkPdmhwRCh8lct/z9Go5SnQpuYkw7u3KK/LMl9Zf4auLzsskyewZ0zxzDdvw7CH TAKk/6Y3vBzOpQsQTLefSrLJ2GQ890zrqgRc5/kVCvqJ8tB4RO8JZSHCjALlsK5aNYe/ /bFJ47w65ZxmeaN7xQfRBLkdAS7Cfdc6r0mXADuv+5uj7aJR/WQnUfqPUgJUSjOlxP9N FZy5qGzuCeMMmYjGBmpLBfO5MlJLyyAG3cmpw9r03B/m4HofREgtUH1IuYuwSAVqsd8G TgJQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@grsecurity.net header.s=grsec header.b=Tx7bJzrQ; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=grsecurity.net Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id d11-20020a1709067f0b00b0093026fb89e0si17500132ejr.545.2023.03.30.01.47.57; Thu, 30 Mar 2023 01:48:24 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@grsecurity.net header.s=grsec header.b=Tx7bJzrQ; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=grsecurity.net Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230110AbjC3Ipu (ORCPT + 99 others); Thu, 30 Mar 2023 04:45:50 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:39542 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230165AbjC3Ipl (ORCPT ); Thu, 30 Mar 2023 04:45:41 -0400 Received: from mail-ed1-x52e.google.com (mail-ed1-x52e.google.com [IPv6:2a00:1450:4864:20::52e]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id E37D27D84 for ; Thu, 30 Mar 2023 01:45:16 -0700 (PDT) Received: by mail-ed1-x52e.google.com with SMTP id ek18so73557467edb.6 for ; Thu, 30 Mar 2023 01:45:16 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=grsecurity.net; s=grsec; t=1680165914; h=content-transfer-encoding:in-reply-to:from:references:cc:to :content-language:subject:user-agent:mime-version:date:message-id :from:to:cc:subject:date:message-id:reply-to; bh=vlpuagoHdNHyoDKKcfJ1S9LVllL6EOMCzIQgoSyOcDw=; b=Tx7bJzrQxgZuGXRYSDpUU6tW0DHmt2iTn6ML51phHbdXfKxP8AaIs4F6Wu0CMU2E8k TAyetfg9DZEHcSJF6/mUGYM4lpnqW9Z6jKPTkWhlX+u2FNpgbVSzV1c5UwASf+i/ucID qpQd1ptvMyraYyazc1t299COIfXj86PJWnqWTiApvUTNAivQL9jSLNAbcn4+YwCYaSeD cPls0ICIaZVH7QNEN3UHPJDrPSkJ++IMf6xeCyroBPSfDQOpCFs+DfwX2EHT2qJhadNH N1PrrLkl88YEEu7OCgbPZSXxK/lHmt2qyol14Smh0MFP9mxNWPtW48jJoEasrsNXYXTV /21w== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; t=1680165914; h=content-transfer-encoding:in-reply-to:from:references:cc:to :content-language:subject:user-agent:mime-version:date:message-id :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=vlpuagoHdNHyoDKKcfJ1S9LVllL6EOMCzIQgoSyOcDw=; b=pioF2DH4eeykWNATJKrKqkDRufvt+bmPJBrjkPbNujZZstxS6/auINiX/iFFSWKBs4 1DE4ZTADxguYcfFEXuGH/mHOsCojOOtT3TaNRwWn5DpSxgP53ly6/WR8pAJiUShur041 BHMKJCVbdaGT+PIWup0/JXlm5aBKopgF4E1psAXLz9uBU6brG0lIN0p016tVardCbJws 1zJRdDa6EBwJwWlAn1/eW/JRAwe5LL3SotFP7dp9deSi5Z8dlWchzFiLojCr/VEGk1Mz oB3XuxBZYEtQGssjODMeTbtG9PUW/aiVy3PJUm3H96s3ltgs3FO9plt4pYI3vb+RLe+X eyhA== X-Gm-Message-State: AAQBX9cXQvQZoRqRHpoCxaa72Szgf4o/RvkRtif/aU2e6FzZsH5P4LI5 Z9hxSMrJs/ikVmrMPG0XAJGGUw== X-Received: by 2002:a17:907:6d0b:b0:930:9cec:2fb9 with SMTP id sa11-20020a1709076d0b00b009309cec2fb9mr25165267ejc.77.1680165913889; Thu, 30 Mar 2023 01:45:13 -0700 (PDT) Received: from ?IPV6:2003:f6:af13:8b00:c684:f698:a009:dbf2? (p200300f6af138b00c684f698a009dbf2.dip0.t-ipconnect.de. [2003:f6:af13:8b00:c684:f698:a009:dbf2]) by smtp.gmail.com with ESMTPSA id qh7-20020a170906eca700b0092bea699124sm17411276ejb.106.2023.03.30.01.45.13 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Thu, 30 Mar 2023 01:45:13 -0700 (PDT) Message-ID: <677169b4-051f-fcae-756b-9a3e1bb9f8fe@grsecurity.net> Date: Thu, 30 Mar 2023 10:45:12 +0200 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101 Thunderbird/102.9.0 Subject: Re: [PATCH v4 6/6] KVM: VMX: Make CR0.WP a guest owned bit Content-Language: en-US, de-DE To: kvm@vger.kernel.org, Sean Christopherson , Paolo Bonzini Cc: linux-kernel@vger.kernel.org References: <20230322013731.102955-1-minipli@grsecurity.net> <20230322013731.102955-7-minipli@grsecurity.net> From: Mathias Krause In-Reply-To: <20230322013731.102955-7-minipli@grsecurity.net> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=0.1 required=5.0 tests=DKIM_SIGNED,DKIM_VALID, DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,NICE_REPLY_A, RCVD_IN_DNSWL_NONE,SPF_HELO_NONE,SPF_PASS autolearn=no autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 22.03.23 02:37, Mathias Krause wrote: > Guests like grsecurity that make heavy use of CR0.WP to implement kernel > level W^X will suffer from the implied VMEXITs. > > With EPT there is no need to intercept a guest change of CR0.WP, so > simply make it a guest owned bit if we can do so. > > This implies that a read of a guest's CR0.WP bit might need a VMREAD. > However, the only potentially affected user seems to be kvm_init_mmu() > which is a heavy operation to begin with. But also most callers already > cache the full value of CR0 anyway, so no additional VMREAD is needed. > The only exception is nested_vmx_load_cr3(). > > This change is VMX-specific, as SVM has no such fine grained control > register intercept control. Just a heads up! We did more tests, especially with the backports we did internally already, and ran into a bug when running a nested guest on an ESXi host. Setup is like: ESXi (L0) -> Linux (L1) -> Linux (L2) The Linux system, especially the kernel, is the same for L1 and L2. It's a grsecurity kernel, so makes use of toggling CR0.WP at runtime. The bug we see is that when L2 disables CR0.WP and tries to write to an r/o memory region (implicitly to the r/o GDT via LTR in our use case), this triggers a fault (EPT violation?) that gets ignored by L1, as, apparently, everything is fine from its point of view. I suspect the L0 VMM to be at fault here, as the VMCS structures look good, IMO. Here is a dump of vmx->loaded_vmcs in handle_triple_fault(): […] VMX: TRIPLE FAULT! […] VMCS ffff8883a9f18000, last attempted VM-entry on CPU 8 […] *** Guest State *** […] CR0: actual=0x0000000080040033, shadow=0x0000000080050033, gh_mask=fffffffffffefff7 CR0 in the L2 VM has CR0.WP disabled. However, it is set in the shadow CR0 but masked out via CR0_MASK, so should be read as the guest's value, according to the SDM. I also tried masking the shadow CR0 value in vmx_set_cr0(), but that makes no difference. […] CR4: actual=0x00000000003220f0, shadow=0x00000000003208b0, gh_mask=fffffffffffff871 […] CR3 = 0x0000000002684000 […] PDPTR0 = 0x0000000007d39001 PDPTR1 = 0x00000000033b5001 […] PDPTR2 = 0x000000000238c001 PDPTR3 = 0x0000000001c54001 […] RSP = 0xfffffe8040087f50 RIP = 0xffffffff8105d435 […] RFLAGS=0x00010006 DR7 = 0x0000000000000400 […] Sysenter RSP=0000000000000000 CS:RIP=0000:0000000000000000 […] CS: sel=0x0010, attr=0x0a09b, limit=0xffffffff, base=0x0000000000000000 […] DS: sel=0x0000, attr=0x1c001, limit=0xffffffff, base=0x0000000000000000 […] SS: sel=0x0000, attr=0x1c001, limit=0xffffffff, base=0x0000000000000000 […] ES: sel=0x0000, attr=0x1c001, limit=0xffffffff, base=0x0000000000000000 […] FS: sel=0x0000, attr=0x1c001, limit=0xffffffff, base=0x0000000000000000 […] GS: sel=0x0000, attr=0x1c001, limit=0xffffffff, base=0xffff888232e00000 […] GDTR: limit=0x00000097, base=0xfffffe0000201000 […] LDTR: sel=0x0000, attr=0x00082, limit=0x0000ffff, base=0x0000000000000000 […] IDTR: limit=0x000001ff, base=0xffffffff84004000 […] TR: sel=0x0000, attr=0x0008b, limit=0x0000ffff, base=0x0000000000000000 […] EFER= 0x0000000000000d01 (effective) […] PAT = 0x0007040600070406 […] DebugCtl = 0x0000000000000000 DebugExceptions = 0x0000000000000000 […] Interruptibility = 00000000 ActivityState = 00000000 […] *** Host State *** […] RIP = 0xffffffff86d28db6 RSP = 0xfffffe8040927d50 […] CS=0010 SS=0018 DS=0000 ES=0000 FS=0000 GS=0000 TR=0040 […] FSBase=0000639563fce700 GSBase=ffff88881a800000 TRBase=fffffe0001003000 […] GDTBase=fffffe0001001000 IDTBase=fffffe0000000000 […] CR0=0000000080050033 CR3=00000000026a0004 CR4=00000000007626f0 The "host" (which is our L1 VMM, I guess) has CR0.WP enabled and that is what I think confuses ESXi to enforce the read-only property to the L2 guest as well -- for unknown reasons so far. […] Sysenter RSP=fffffe0001003000 CS:RIP=0010:ffffffff810031a0 […] PAT = 0x0407050600070106 […] *** Control State *** […] PinBased=0000003f CPUBased=b5a06dfa SecondaryExec=001034ee […] EntryControls=000053ff ExitControls=000befff […] ExceptionBitmap=00060042 PFECmask=00000000 PFECmatch=00000000 […] VMEntry: intr_info=00000000 errcode=00000003 ilen=00000000 […] VMExit: intr_info=00000000 errcode=00000000 ilen=00000000 […] reason=00000002 qualification=0000000000000000 […] IDTVectoring: info=00000000 errcode=00000000 […] TSC Offset = 0xffffad7a1057f4cc […] TPR Threshold = 0x00 […] virt-APIC addr = 0x000000015716b000 […] EPT pointer = 0x0000000583c1e05e […] PLE Gap=00000080 Window=00001000 […] Virtual processor ID = 0x0002 I tried to reproduce the bug with different KVM based L0 VMMs (with and without this series; vanilla and grsecurity kernels) but no luck. That's why I'm suspecting a ESXi bug. I'm leaning to make CR0.WP guest owned only iff we're running on bare metal or the VMM is KVM to not play whack-a-mole for all the VMMs that might have similar bugs. (Will try to test a few others here as well.) However, that would prevent them from getting the performance gain, so I'd rather have this fixed / worked around in KVM instead. Any ideas how to investigate this further? Thanks, Mathias PS: ...should have left the chicken bit of v3 to be able to disable the feature by a module parameter ;) > > Suggested-and-co-developed-by: Sean Christopherson > Signed-off-by: Mathias Krause > --- > arch/x86/kvm/kvm_cache_regs.h | 2 +- > arch/x86/kvm/vmx/nested.c | 4 ++-- > arch/x86/kvm/vmx/vmx.c | 2 +- > arch/x86/kvm/vmx/vmx.h | 18 ++++++++++++++++++ > 4 files changed, 22 insertions(+), 4 deletions(-) > > diff --git a/arch/x86/kvm/kvm_cache_regs.h b/arch/x86/kvm/kvm_cache_regs.h > index 4c91f626c058..e50d353b5c1c 100644 > --- a/arch/x86/kvm/kvm_cache_regs.h > +++ b/arch/x86/kvm/kvm_cache_regs.h > @@ -4,7 +4,7 @@ > > #include > > -#define KVM_POSSIBLE_CR0_GUEST_BITS X86_CR0_TS > +#define KVM_POSSIBLE_CR0_GUEST_BITS (X86_CR0_TS | X86_CR0_WP) > #define KVM_POSSIBLE_CR4_GUEST_BITS \ > (X86_CR4_PVI | X86_CR4_DE | X86_CR4_PCE | X86_CR4_OSFXSR \ > | X86_CR4_OSXMMEXCPT | X86_CR4_PGE | X86_CR4_TSD | X86_CR4_FSGSBASE) > diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c > index f63b28f46a71..61d940fc91ba 100644 > --- a/arch/x86/kvm/vmx/nested.c > +++ b/arch/x86/kvm/vmx/nested.c > @@ -4481,7 +4481,7 @@ static void load_vmcs12_host_state(struct kvm_vcpu *vcpu, > * CR0_GUEST_HOST_MASK is already set in the original vmcs01 > * (KVM doesn't change it); > */ > - vcpu->arch.cr0_guest_owned_bits = KVM_POSSIBLE_CR0_GUEST_BITS; > + vcpu->arch.cr0_guest_owned_bits = vmx_l1_guest_owned_cr0_bits(); > vmx_set_cr0(vcpu, vmcs12->host_cr0); > > /* Same as above - no reason to call set_cr4_guest_host_mask(). */ > @@ -4632,7 +4632,7 @@ static void nested_vmx_restore_host_state(struct kvm_vcpu *vcpu) > */ > vmx_set_efer(vcpu, nested_vmx_get_vmcs01_guest_efer(vmx)); > > - vcpu->arch.cr0_guest_owned_bits = KVM_POSSIBLE_CR0_GUEST_BITS; > + vcpu->arch.cr0_guest_owned_bits = vmx_l1_guest_owned_cr0_bits(); > vmx_set_cr0(vcpu, vmcs_readl(CR0_READ_SHADOW)); > > vcpu->arch.cr4_guest_owned_bits = ~vmcs_readl(CR4_GUEST_HOST_MASK); > diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c > index 8fc1a0c7856f..e501f6864a72 100644 > --- a/arch/x86/kvm/vmx/vmx.c > +++ b/arch/x86/kvm/vmx/vmx.c > @@ -4790,7 +4790,7 @@ static void init_vmcs(struct vcpu_vmx *vmx) > /* 22.2.1, 20.8.1 */ > vm_entry_controls_set(vmx, vmx_vmentry_ctrl()); > > - vmx->vcpu.arch.cr0_guest_owned_bits = KVM_POSSIBLE_CR0_GUEST_BITS; > + vmx->vcpu.arch.cr0_guest_owned_bits = vmx_l1_guest_owned_cr0_bits(); > vmcs_writel(CR0_GUEST_HOST_MASK, ~vmx->vcpu.arch.cr0_guest_owned_bits); > > set_cr4_guest_host_mask(vmx); > diff --git a/arch/x86/kvm/vmx/vmx.h b/arch/x86/kvm/vmx/vmx.h > index 2acdc54bc34b..423e9d3c9c40 100644 > --- a/arch/x86/kvm/vmx/vmx.h > +++ b/arch/x86/kvm/vmx/vmx.h > @@ -640,6 +640,24 @@ BUILD_CONTROLS_SHADOW(tertiary_exec, TERTIARY_VM_EXEC_CONTROL, 64) > (1 << VCPU_EXREG_EXIT_INFO_1) | \ > (1 << VCPU_EXREG_EXIT_INFO_2)) > > +static inline unsigned long vmx_l1_guest_owned_cr0_bits(void) > +{ > + unsigned long bits = KVM_POSSIBLE_CR0_GUEST_BITS; > + > + /* > + * CR0.WP needs to be intercepted when KVM is shadowing legacy paging > + * in order to construct shadow PTEs with the correct protections. > + * Note! CR0.WP technically can be passed through to the guest if > + * paging is disabled, but checking CR0.PG would generate a cyclical > + * dependency of sorts due to forcing the caller to ensure CR0 holds > + * the correct value prior to determining which CR0 bits can be owned > + * by L1. Keep it simple and limit the optimization to EPT. > + */ > + if (!enable_ept) > + bits &= ~X86_CR0_WP; > + return bits; > +} > + > static __always_inline struct kvm_vmx *to_kvm_vmx(struct kvm *kvm) > { > return container_of(kvm, struct kvm_vmx, kvm);