Received: by 2002:a05:7412:1e0b:b0:fc:a2b0:25d7 with SMTP id kr11csp646731rdb; Thu, 15 Feb 2024 10:46:00 -0800 (PST) X-Forwarded-Encrypted: i=3; AJvYcCXtAHV3Fe7Rz3K2/ePiCAFuz/8FR8+rzNFGijnVsNOhB4wC8fI1NS8SDev/LvFqAnc31dldZzBTRbGRqSGXbPKDkgn6/eGWTDdOQCKa1g== X-Google-Smtp-Source: AGHT+IERdlkZ6h4t515WK6p7afk0FOiRW+qv/iLUQNO6rGGuQ4pG7pHBWySt6KKJ0TiHoqZugYH5 X-Received: by 2002:ac8:5795:0:b0:42c:6ef4:ce0d with SMTP id v21-20020ac85795000000b0042c6ef4ce0dmr2990449qta.9.1708022760268; Thu, 15 Feb 2024 10:46:00 -0800 (PST) ARC-Seal: i=2; a=rsa-sha256; t=1708022760; cv=pass; d=google.com; s=arc-20160816; b=W4gxI19F1hvImmXbsTA83wWu4bYxRdGoaStaOw78nEi8k5e35cqqFfJ6X66vE/iXiL oS+tLxVyv8RWu8+Bv7z4BPefIatZc4XGSTtYa0RSoi0rO77ZIJvTqw6KJGROoH387Um+ xZZ4UTjP9j6+5zW03Q0XmPqMJX/b5G4F2g3yjRAwiSvA1K/QgOP8y4aBH4qWoReeD8Mt ee1yrAxHQve8NY0/bP/gVR+wTf6s5VjV0SCuGgv6P4MKH54ZePqY7MYdgZcEfG/YIf5i pbGKEQ/txa3KdzHfdvWLoof6hUxsiLFOP5lmVziYPGs5a8n7TwY6w+uFNVbL3uAjlsht Qatg== ARC-Message-Signature: i=2; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=content-transfer-encoding:cc:to:from:subject:message-id:references :mime-version:list-unsubscribe:list-subscribe:list-id:precedence :in-reply-to:date:dkim-signature; bh=es2Z8u6824g77CxfRGV/9UPwzp6tSbCJdwnZIQFRqkQ=; fh=sLFazidbGo8jWEp3yH6tgRz3H/EMDFgMowBP0K6dAwo=; b=mHX3l47RhwWXI4aS8408DFDZHbB1OOEcP2WvpuXrrSfUnkMRxTRER1Ha0q89dwbObE Ea8wUOVDtD/AIICOI9FnLLMg1Q1w9348bPsrWXpOydfTY4DLS2wcb4aQDu1SQ4zRLfAt hm8PrDu7VmqbI7T7BvXMHyOQWKSxYdiCLpGTiTuXPl8uAyhFrmUb5ThWAZaFctgbf5Hl wr56ajip+u0Cz9VIY/7lULOiG9IRGSRjexpBBYVq/CiDuQkhVgcnMt4GwO6xF7kqS6w3 AHqC0ULMkSJWP/8nkK6mIlfPDErkWlB3+zFTUTic5OKqYoPivRXQ5u3DFATcGav9ViYY 74Uw==; dara=google.com ARC-Authentication-Results: i=2; mx.google.com; dkim=pass header.i=@google.com header.s=20230601 header.b=Qqfk8Pod; arc=pass (i=1 spf=pass spfdomain=flex--seanjc.bounces.google.com dkim=pass dkdomain=google.com dmarc=pass fromdomain=google.com); spf=pass (google.com: domain of linux-kernel+bounces-67533-linux.lists.archive=gmail.com@vger.kernel.org designates 147.75.199.223 as permitted sender) smtp.mailfrom="linux-kernel+bounces-67533-linux.lists.archive=gmail.com@vger.kernel.org"; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Return-Path: Received: from ny.mirrors.kernel.org (ny.mirrors.kernel.org. [147.75.199.223]) by mx.google.com with ESMTPS id t17-20020ac87611000000b0042bf29af871si1910184qtq.62.2024.02.15.10.46.00 for (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 15 Feb 2024 10:46:00 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel+bounces-67533-linux.lists.archive=gmail.com@vger.kernel.org designates 147.75.199.223 as permitted sender) client-ip=147.75.199.223; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20230601 header.b=Qqfk8Pod; arc=pass (i=1 spf=pass spfdomain=flex--seanjc.bounces.google.com dkim=pass dkdomain=google.com dmarc=pass fromdomain=google.com); spf=pass (google.com: domain of linux-kernel+bounces-67533-linux.lists.archive=gmail.com@vger.kernel.org designates 147.75.199.223 as permitted sender) smtp.mailfrom="linux-kernel+bounces-67533-linux.lists.archive=gmail.com@vger.kernel.org"; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: from smtp.subspace.kernel.org (wormhole.subspace.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ny.mirrors.kernel.org (Postfix) with ESMTPS id 568241C23B47 for ; Thu, 15 Feb 2024 18:45:43 +0000 (UTC) Received: from localhost.localdomain (localhost.localdomain [127.0.0.1]) by smtp.subspace.kernel.org (Postfix) with ESMTP id B783F1386BF; Thu, 15 Feb 2024 18:45:32 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="Qqfk8Pod" Received: from mail-yb1-f201.google.com (mail-yb1-f201.google.com [209.85.219.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id D81D91369AC for ; Thu, 15 Feb 2024 18:45:29 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.219.201 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1708022731; cv=none; b=uXLvh4dmIPIZbVQ/fsWVCIJarAg6K+MkAqNDrYekyMxhrYziHu5N+f/QWXwyrDkFZ8vIxyO/VytHhVomORVhhUIkXZr7YqSW8BDA6sgaxNwuKuSXZsajjRVQrwzinDe4XJY6VuXSPXl+M7tXprpCmIk8RyWOS9NGQwd4w+f4g+8= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1708022731; c=relaxed/simple; bh=UOeiFxBa6qkD5GL0C7Hexl1Pj7GUKu+u7H599BzMonQ=; h=Date:In-Reply-To:Mime-Version:References:Message-ID:Subject:From: To:Cc:Content-Type; b=r7oKpi17KgKMycjb7TdYFDb46leunte7UIuGj59A+enLYJTg9SF4X8Ak9vLFe30wGw9tcTQnPvS4LjYc5OdyUbY+dP/NaAo0a8FD8a5/qfd7ncd56JrbKkRFXOF33ECHL97ni0F0ODtqLABYeUk0H+wdUpa1n3w6HcJv9oQ07kY= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=flex--seanjc.bounces.google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=Qqfk8Pod; arc=none smtp.client-ip=209.85.219.201 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--seanjc.bounces.google.com Received: by mail-yb1-f201.google.com with SMTP id 3f1490d57ef6-dc6b5d1899eso3003538276.0 for ; Thu, 15 Feb 2024 10:45:29 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1708022729; x=1708627529; darn=vger.kernel.org; h=content-transfer-encoding:cc:to:from:subject:message-id:references :mime-version:in-reply-to:date:from:to:cc:subject:date:message-id :reply-to; bh=es2Z8u6824g77CxfRGV/9UPwzp6tSbCJdwnZIQFRqkQ=; b=Qqfk8Podgz1i60TAcI8G+9q5bijlyK+JU6d41/AzQBG9EfrA/N9X8nkel4XcknAeGp iit6O42Kx5X3NV7qr8Iht55mBHE2FWIqpJo03jlQk35pTWL2XaduTC1WQlvUf3+L/Qoj 3QQUmkeLw33gu37uBS54ihJvmRrflVD3xsCBLmOoR9AIm9VXw9zZ//FtsKzJ3uNkEFyB F41mQ5igbxDQiTv3ZnKQhmxEYAvvhHaOkbtFDC9TJOWahn6GxGM1Jh1MVjOVPF1WhkZS PJ7YNGEBcd7vOAUDug3F6y7TYLpcQziMQGSCpufO9lfHtGMGz/Cn7KvtAxpBX1UoWG2j 6zGQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1708022729; x=1708627529; h=content-transfer-encoding:cc:to:from:subject:message-id:references :mime-version:in-reply-to:date:x-gm-message-state:from:to:cc:subject :date:message-id:reply-to; bh=es2Z8u6824g77CxfRGV/9UPwzp6tSbCJdwnZIQFRqkQ=; b=CinysLT0FXVRK2LgiRH+R+Ug+ErBFe6xJjsqfsl+9phxe7s2QmItAnFpDM3Lj3f/4s JLbJisaDWjIGxOCG5thrYx4E3JnR/Hp6RRb6BEcruBxx2vaF4m5h25OLzMiUMkqXsoWr p4vkz657xGIFCCjvTLnOVS7/KhTq1pLLUBdCkt/dNHItE96MtlVI9ki/ZlJqFOq3q2NS 72s76VBdMtyJ+VEanx2quZ7owP6JWJ3qBJNjviVNOrjjvRs0sYfr0mXRYNJ8rpxMhPnB 7APyOnMIqrICW5gf3/3iiDecXIMSM/YakYO6jgLBYxuEfkmLf+WT5PlStUFphrGQaMpS FMxw== X-Forwarded-Encrypted: i=1; AJvYcCV9q+Uom4QKBcZbQTd9Hq2XCLsHavyGEAJBKLo3VnBswQw2m/3IM3YUqW/aPnU4Qlf5bQC1t0cVlY2w7OhtA6K+9KNlmVxdywzlcTep X-Gm-Message-State: AOJu0Yxvy56bLhVVAMFKuGSjkgjB7ZikHBMDH0UDgdRewSA2aTtA1qhl QW3E8EvQDh34sGbVmmKzeMhNtYoMgUcN+sdqPjk9JBxuMXwsVCPYM/zz+N0OsfJLz0HngAgcUKf 5vA== X-Received: from zagreus.c.googlers.com ([fda3:e722:ac3:cc00:7f:e700:c0a8:5c37]) (user=seanjc job=sendgmr) by 2002:a25:eb07:0:b0:dc6:b7c2:176e with SMTP id d7-20020a25eb07000000b00dc6b7c2176emr1071480ybs.4.1708022728842; Thu, 15 Feb 2024 10:45:28 -0800 (PST) Date: Thu, 15 Feb 2024 10:45:27 -0800 In-Reply-To: Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 References: <20240215010004.1456078-1-seanjc@google.com> <20240215010004.1456078-2-seanjc@google.com> Message-ID: Subject: Re: [PATCH 1/2] KVM: x86: Mark target gfn of emulated atomic instruction as dirty From: Sean Christopherson To: David Matlack Cc: Paolo Bonzini , kvm@vger.kernel.org, linux-kernel@vger.kernel.org, Pasha Tatashin , Michael Krebs Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable On Thu, Feb 15, 2024, David Matlack wrote: > On Wed, Feb 14, 2024 at 5:00=E2=80=AFPM Sean Christopherson wrote: > > > > When emulating an atomic access on behalf of the guest, mark the target > > gfn dirty if the CMPXCHG by KVM is attempted and doesn't fault. This > > fixes a bug where KVM effectively corrupts guest memory during live > > migration by writing to guest memory without informing userspace that t= he > > page is dirty. > > > > Marking the page dirty got unintentionally dropped when KVM's emulated > > CMPXCHG was converted to do a user access. Before that, KVM explicitly > > mapped the guest page into kernel memory, and marked the page dirty dur= ing > > the unmap phase. > > > > Mark the page dirty even if the CMPXCHG fails, as the old data is writt= en > > back on failure, i.e. the page is still written. The value written is > > guaranteed to be the same because the operation is atomic, but KVM's AB= I > > is that all writes are dirty logged regardless of the value written. A= nd > > more importantly, that's what KVM did before the buggy commit. > > > > Huge kudos to the folks on the Cc list (and many others), who did all t= he > > actual work of triaging and debugging. > > > > Fixes: 1c2361f667f3 ("KVM: x86: Use __try_cmpxchg_user() to emulate ato= mic accesses") >=20 > I'm only half serious but... Should we just revert this commit? No. > This commit claims that kvm_vcpu_map() is unsafe because it can race > with mremap(). But there are many other places where KVM uses > kvm_vcpu_map() (e.g. nested VMX). It seems like KVM is just not > compatible with mremap() until we address all the users of > kvm_vcpu_map(). Patching _just_ emulator_cmpxchg_emulated() seems > silly but maybe I'm missing some context on what led to commit > 1c2361f667f3 being written. The short version is that it's a rather trivial vector for userspace to tri= gger UAF. E.g. map non-refcounted memory into a guest and then unmap the memory= . We tried to fix the nVMX usage, but that proved to be impractical[1]. We h= aven't forced the issue because it's not obvious that there's meaningful exposure = in practice, e.g. unless userspace is hiding regular memory from the kernel *a= nd* oversubscribing VMs, a benign userspace won't be affected. But at the same= time, we don't have high confidence that the unsafe behavior can't be exploited i= n practice. What I am pushing for now is an off-by-default module param to let userspac= e opt-in to unsafe mappings such as these[2]. Because if KVM starts allowing non-refcounted struct page memory, the ability to exploit these flaws skyro= ckets. (Though this reminds me that I need to take another look at whether or not = allowing non-refcounted struct page memory is actually necessary). [1] https://lore.kernel.org/all/ZBEEQtmtNPaEqU1i@google.com [2] https://lore.kernel.org/all/20230911021637.1941096-4-stevensd@google.co= m > kvm_vcpu_map/unmap() might not be the best interface, but it serves as > a common choke-point for mapping guest memory to access in KVM. This > is helpful for avoiding missed dirty logging updates (obviously) and > will be even more helpful if we add support for freezing guest memory > and "KVM Userfault" (as discussed in the 1/3 PUCK). I think we all > agree we should do more of this (common choke points), not less. If > there's a usecase for mremap()ing guest memory, we should make > kvm_vcpu_map() play nice with mmu_notifiers. I agree, but KVM needs to __try_cmpxchg_user() use anyways, when updating g= uest A/D bits in FNAME(update_accessed_dirty_bits)(). And that one we *definite= ly* don't want to revert; see commit 2a8859f373b0 ("KVM: x86/mmu: do compare-an= d-exchange of gPTE via the user address") for details on how broken the previous code = was. In other words, reverting to kvm_vcpu_{un,}map() *probably* isn't wildly un= safe, but it also doesn't really buy us anything, and long term we have line of s= ight to closing the holes for good. And unlike the nVMX code, where it's reason= able for KVM to disallow using non-refcounted memory for VMCS pages, disallowing= such memory for emulated atomic accesses is less reasonable. Rather than revert, to make this more robust in the longer term, we can add= a wrapper in KVM to mark the gfn dirty. I didn't do it here because I was hu= stling to get this minimal fix posted. E.g. -- Subject: [PATCH] KVM: x86: Provide a wrapper for __try_cmpxchg_user() to ma= rk the gfn dirty Signed-off-by: Sean Christopherson --- arch/x86/kvm/mmu/paging_tmpl.h | 4 ++-- arch/x86/kvm/x86.c | 25 +++++++++---------------- arch/x86/kvm/x86.h | 19 +++++++++++++++++++ 3 files changed, 30 insertions(+), 18 deletions(-) diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.= h index 4d4e98fe4f35..a8123406fe99 100644 --- a/arch/x86/kvm/mmu/paging_tmpl.h +++ b/arch/x86/kvm/mmu/paging_tmpl.h @@ -246,11 +246,11 @@ static int FNAME(update_accessed_dirty_bits)(struct k= vm_vcpu *vcpu, if (unlikely(!walker->pte_writable[level - 1])) continue; =20 - ret =3D __try_cmpxchg_user(ptep_user, &orig_pte, pte, fault); + ret =3D kvm_try_cmpxchg_user(ptep_user, &orig_pte, pte, fault, + vcpu, table_gfn); if (ret) return ret; =20 - kvm_vcpu_mark_page_dirty(vcpu, table_gfn); walker->ptes[level - 1] =3D pte; } return 0; diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 3ec9781d6122..bedb51fbbad3 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -7946,8 +7946,9 @@ static int emulator_write_emulated(struct x86_emulate= _ctxt *ctxt, exception, &write_emultor); } =20 -#define emulator_try_cmpxchg_user(t, ptr, old, new) \ - (__try_cmpxchg_user((t __user *)(ptr), (t *)(old), *(t *)(new), efault ##= t)) +#define emulator_try_cmpxchg_user(t, ptr, old, new, vcpu, gfn) \ + (kvm_try_cmpxchg_user((t __user *)(ptr), (t *)(old), *(t *)(new), \ + efault ## t, vcpu, gfn)) =20 static int emulator_cmpxchg_emulated(struct x86_emulate_ctxt *ctxt, unsigned long addr, @@ -7960,6 +7961,7 @@ static int emulator_cmpxchg_emulated(struct x86_emula= te_ctxt *ctxt, u64 page_line_mask; unsigned long hva; gpa_t gpa; + gfn_t gfn; int r; =20 /* guests cmpxchg8b have to be emulated atomically */ @@ -7990,18 +7992,19 @@ static int emulator_cmpxchg_emulated(struct x86_emu= late_ctxt *ctxt, =20 hva +=3D offset_in_page(gpa); =20 + gfn =3D gpa_to_gfn(gpa); switch (bytes) { case 1: - r =3D emulator_try_cmpxchg_user(u8, hva, old, new); + r =3D emulator_try_cmpxchg_user(u8, hva, old, new, vcpu, gfn); break; case 2: - r =3D emulator_try_cmpxchg_user(u16, hva, old, new); + r =3D emulator_try_cmpxchg_user(u16, hva, old, new, vcpu, gfn); break; case 4: - r =3D emulator_try_cmpxchg_user(u32, hva, old, new); + r =3D emulator_try_cmpxchg_user(u32, hva, old, new, vcpu, gfn); break; case 8: - r =3D emulator_try_cmpxchg_user(u64, hva, old, new); + r =3D emulator_try_cmpxchg_user(u64, hva, old, new, vcpu, gfn); break; default: BUG(); @@ -8009,16 +8012,6 @@ static int emulator_cmpxchg_emulated(struct x86_emul= ate_ctxt *ctxt, =20 if (r < 0) return X86EMUL_UNHANDLEABLE; - - /* - * Mark the page dirty _before_ checking whether or not the CMPXCHG was - * successful, as the old value is written back on failure. Note, for - * live migration, this is unnecessarily conservative as CMPXCHG writes - * back the original value and the access is atomic, but KVM's ABI is - * that all writes are dirty logged, regardless of the value written. - */ - kvm_vcpu_mark_page_dirty(vcpu, gpa_to_gfn(gpa)); - if (r) return X86EMUL_CMPXCHG_FAILED; =20 diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h index 2f7e19166658..2fabc7cd7e39 100644 --- a/arch/x86/kvm/x86.h +++ b/arch/x86/kvm/x86.h @@ -290,6 +290,25 @@ static inline bool kvm_check_has_quirk(struct kvm *kvm= , u64 quirk) return !(kvm->arch.disabled_quirks & quirk); } =20 + + +/* + * Mark the page dirty even if the CMPXCHG fails (but didn't fault), as th= e old + * old value is written back on failure. Note, for live migration, this i= s + * unnecessarily conservative as CMPXCHG writes back the original value an= d the + * access is atomic, but KVM's ABI is that all writes are dirty logged, + * regardless of the value written. + */ +#define kvm_try_cmpxchg_user(ptr, oldp, nval, label, vcpu, gfn) \ +({ \ + int ret; \ + \ + ret =3D __try_cmpxchg_user(ptr, oldp, nval, label); \ + if (ret >=3D 0) \ + kvm_vcpu_mark_page_dirty(vcpu, gfn); \ + ret; \ +}) + void kvm_inject_realmode_interrupt(struct kvm_vcpu *vcpu, int irq, int inc= _eip); =20 u64 get_kvmclock_ns(struct kvm *kvm); base-commit: 6769ea8da8a93ed4630f1ce64df6aafcaabfce64 --=20