Received: by 2002:a89:288:0:b0:1f7:eeee:6653 with SMTP id j8csp240125lqh; Mon, 6 May 2024 18:35:11 -0700 (PDT) X-Forwarded-Encrypted: i=3; AJvYcCVnVxTthFJa6RD7vgd3fZyFCBg0aYvAvel2VNHGth4Xf2PrJG9H1EVC4yPQNBZHDP2dvk/PgoPgUvd9MAP8KYdOWu7yXFjwGBELwy9KCw== X-Google-Smtp-Source: AGHT+IHCjkZK67QNvHFHBhW0ZL3H/v5MGLSo2mXDd7bdTwJzfdjellKxafGxEh3jyZ9X/HsoAOek X-Received: by 2002:a05:620a:2491:b0:790:92bc:7a57 with SMTP id af79cd13be357-792a64c70damr250196085a.36.1715045710896; Mon, 06 May 2024 18:35:10 -0700 (PDT) ARC-Seal: i=2; a=rsa-sha256; t=1715045710; cv=pass; d=google.com; s=arc-20160816; b=iNboimYfqcNSHKYlOT9cyE4Kny0y74dnNOOyaW3AZbXh6AYVH2OwlT80VI6aMN5dvQ jdo5eR3J1K3qGQMhLIsRXBrIWzwJyDvRBx99p9jgtp9+yN96Bf1r6/fMeG9NIwHGdcfr fNnUueKe4S7a1O56k5ca2K6mrxl7Krtuedor03yYzv8brOcLYuWIB03cW9Np0nUny17e ylrG//ijdQXAwsqZCSAGyaufTac5p/LesIA6yrkepzdDnmGSLzenuS2joGhP7+Sq7Khm b05T2yDxYpPnR/we2cFyKYuazcZv6zEgdDe1UKV6hqnvhAzjFCc9yEj9CzJUVq6F7NDw jBEg== ARC-Message-Signature: i=2; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=content-transfer-encoding:cc:to:from:subject:message-id:references :mime-version:list-unsubscribe:list-subscribe:list-id:precedence :in-reply-to:date:dkim-signature; bh=VvGkjH9SPvsjHvBH/WrCy4dSbt+wcqHhkwU5LfzuS0Q=; fh=2GHRRvZBLNxJzZdSN02fzyVcHXtiMK+R/HOAHi0d8No=; b=zzkCERJlC5YB4fc0GtMJ28sUNbmwu3BOiG34IrLqLaVzxM18zGXjpPI+FsHNGNzu5i BVL6jbhnBH4NL6EBWWn6VpivSXTCu9fwoizXrw/gnMWregJ+zKfsjZQJBN3hBCF6LI8v GwYkLBzvaHEmbFvJ3Invt8W3VmMuiqZMsdsfpyVIfz0jRHNhJFi8YS4IuahoG7WBCFWx YdiK5oWJfOHOueYfXqAgCCeUTfj6EVfvp8ytPGkf69PCK+L7rA+iDnv4uppPtYmKCKWZ yL35gGXEQHQdSSyjVK/e6uugMdXtilbJ2kpakeUes8X0uE/0FIZ/jNdE+r5u3AfyHqY4 CNOw==; dara=google.com ARC-Authentication-Results: i=2; mx.google.com; dkim=pass header.i=@google.com header.s=20230601 header.b=KPY8ssQ8; arc=pass (i=1 spf=pass spfdomain=flex--seanjc.bounces.google.com dkim=pass dkdomain=google.com dmarc=pass fromdomain=google.com); spf=pass (google.com: domain of linux-kernel+bounces-170546-linux.lists.archive=gmail.com@vger.kernel.org designates 147.75.199.223 as permitted sender) smtp.mailfrom="linux-kernel+bounces-170546-linux.lists.archive=gmail.com@vger.kernel.org"; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Return-Path: Received: from ny.mirrors.kernel.org (ny.mirrors.kernel.org. [147.75.199.223]) by mx.google.com with ESMTPS id v14-20020a05622a014e00b0043a433a6d14si11140921qtw.578.2024.05.06.18.35.10 for (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 06 May 2024 18:35:10 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel+bounces-170546-linux.lists.archive=gmail.com@vger.kernel.org designates 147.75.199.223 as permitted sender) client-ip=147.75.199.223; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20230601 header.b=KPY8ssQ8; arc=pass (i=1 spf=pass spfdomain=flex--seanjc.bounces.google.com dkim=pass dkdomain=google.com dmarc=pass fromdomain=google.com); spf=pass (google.com: domain of linux-kernel+bounces-170546-linux.lists.archive=gmail.com@vger.kernel.org designates 147.75.199.223 as permitted sender) smtp.mailfrom="linux-kernel+bounces-170546-linux.lists.archive=gmail.com@vger.kernel.org"; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: from smtp.subspace.kernel.org (wormhole.subspace.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ny.mirrors.kernel.org (Postfix) with ESMTPS id 332911C21C90 for ; Tue, 7 May 2024 01:35:10 +0000 (UTC) Received: from localhost.localdomain (localhost.localdomain [127.0.0.1]) by smtp.subspace.kernel.org (Postfix) with ESMTP id E36664A3F; Tue, 7 May 2024 01:34:58 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="KPY8ssQ8" Received: from mail-yw1-f202.google.com (mail-yw1-f202.google.com [209.85.128.202]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 3C72B139F for ; Tue, 7 May 2024 01:34:56 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.128.202 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1715045697; cv=none; b=kQdSuAv2MSI2WJy0Ybl89wweIp+wkNC9RtnKvNo61i+tbbH0CtAR0Q5efOIfSohGxM4ygB1h9oLjqomGxUoDLm85fnZ6S2atmqgHWvvccIYHhOo6bnvXqA/cNTbEio5vo9aIMzxnR3+oE6INnxchgnyxZqr3sHVfmG+g3x7emfo= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1715045697; c=relaxed/simple; bh=nflotAw23OF8DzsvPikj7Zp6lbojRMt3XmZ/DkGkNJg=; h=Date:In-Reply-To:Mime-Version:References:Message-ID:Subject:From: To:Cc:Content-Type; b=LGF2JPbtzmioNhpRcjP6ZJc6Yh5xkzVq3SxcNSt+iSWanZjczAeue/kzNlQunxFhggmQNm7bCgm2+5lWUnaoOL4brypd0k+qS4/TBFghDb+5zb3CesN98CdGuApEq+nYwKMh/XesMs7LepATOdNzHJg0acjulbARRLKa+Fc5ofo= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=flex--seanjc.bounces.google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=KPY8ssQ8; arc=none smtp.client-ip=209.85.128.202 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--seanjc.bounces.google.com Received: by mail-yw1-f202.google.com with SMTP id 00721157ae682-61e0949fc17so51015997b3.0 for ; Mon, 06 May 2024 18:34:56 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1715045695; x=1715650495; darn=vger.kernel.org; h=content-transfer-encoding:cc:to:from:subject:message-id:references :mime-version:in-reply-to:date:from:to:cc:subject:date:message-id :reply-to; bh=VvGkjH9SPvsjHvBH/WrCy4dSbt+wcqHhkwU5LfzuS0Q=; b=KPY8ssQ8w5cp4ybcGORn6eJGseLb4POUHAkDqjpmZqIIXnqhAnNGrHx6PN9oGcz7+Q rVwY8Gl++OY8cq395XSlzSbqXOS6O4EhWqncMtQqhF5seyBLYjzDOo5N1Y2BqiVEvSu4 qVx+3jMDNpigUE69jZnSHy8TiJjnxJiVisZ5tmivWC/r7azKt0dgQeW0oPKLlzlLizgN 2/qw5AowBJo+LyS53mVN0TBf1G+I4syo5pBmifiQrcT/BlHORI22bzEXv+9VOR3UY9um 2M1NEcmNzByKVm3sTwRQfahuGpEncHwtls96j36waw0fMJ0ZrwKZ5Qcpk4zTJXN6a9JW 9SXw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1715045695; x=1715650495; h=content-transfer-encoding:cc:to:from:subject:message-id:references :mime-version:in-reply-to:date:x-gm-message-state:from:to:cc:subject :date:message-id:reply-to; bh=VvGkjH9SPvsjHvBH/WrCy4dSbt+wcqHhkwU5LfzuS0Q=; b=wVbLXK9XgPwPLnSO9U7JK1jGHcDIB2g1DW95LDm0Xq/LcFMDai8UlEbvZqt33dZ+qv vuA/z5CpbJc6Y45DqL3r+/DQ5TcTHRXpDjAvzTZoGBoVj02zKJpMO+tjws8cntDrtirg mFjuGpqQ38TVwsH8PvcQ/epdrg1lOkOIjcb2isgWCbk3hwRJXRo7mvViYU4U554A/O4g mA2FslqTpPdkfoH5zxhc3kbXHXxets/DcRvlwGC7EZGNJKTbBii3+7ZIdVCUvCJBMwup 6DtYEo1prynSwKQy+k00aH8D9RgsnxtcxFSC3IG9QAxbTwYgxLHhIlncdODdv2m18d6d niRQ== X-Forwarded-Encrypted: i=1; AJvYcCUcLKXcSN5skNUDguwzI5nz17nMDLn9VkS3TrS0aIfrvoohIx5NWfU/eg/rujF2DL4kS5iNem/0m5LiZ19A9mTSS5tJgCdRn3cRWlho X-Gm-Message-State: AOJu0YzXZrsn3qEb2BU3GhvPGyJL3Fq23ROVTdom1a4YMMIWjGzv39Ti IrDyDUERw+XeaLOwUnj7MK2i7PrPlC3tBBc52+k57+X/MDYV4YO31bWIMlqRvCcqfAidBkYh82I UDw== X-Received: from zagreus.c.googlers.com ([fda3:e722:ac3:cc00:7f:e700:c0a8:5c37]) (user=seanjc job=sendgmr) by 2002:a05:6902:1893:b0:de4:67d9:a2c6 with SMTP id cj19-20020a056902189300b00de467d9a2c6mr1291648ybb.2.1715045695256; Mon, 06 May 2024 18:34:55 -0700 (PDT) Date: Mon, 6 May 2024 18:34:53 -0700 In-Reply-To: <20240506.ohwe7eewu0oB@digikod.net> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 References: <20240503131910.307630-1-mic@digikod.net> <20240503131910.307630-4-mic@digikod.net> <20240506.ohwe7eewu0oB@digikod.net> Message-ID: Subject: Re: [RFC PATCH v3 3/5] KVM: x86: Add notifications for Heki policy configuration and violation From: Sean Christopherson To: "=?utf-8?Q?Micka=C3=ABl_Sala=C3=BCn?=" Cc: Borislav Petkov , Dave Hansen , "H . Peter Anvin" , Ingo Molnar , Kees Cook , Paolo Bonzini , Thomas Gleixner , Vitaly Kuznetsov , Wanpeng Li , Rick P Edgecombe , Alexander Graf , Angelina Vu , Anna Trikalinou , Chao Peng , Forrest Yuan Yu , James Gowans , James Morris , John Andersen , "Madhavan T . Venkataraman" , Marian Rotariu , "Mihai =?utf-8?B?RG9uyJt1?=" , "=?utf-8?B?TmljdciZb3IgQ8OuyJt1?=" , Thara Gopinath , Trilok Soni , Wei Liu , Will Deacon , Yu Zhang , "=?utf-8?Q?=C8=98tefan_=C8=98icleru?=" , dev@lists.cloudhypervisor.org, kvm@vger.kernel.org, linux-hardening@vger.kernel.org, linux-hyperv@vger.kernel.org, linux-kernel@vger.kernel.org, linux-security-module@vger.kernel.org, qemu-devel@nongnu.org, virtualization@lists.linux-foundation.org, x86@kernel.org, xen-devel@lists.xenproject.org Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable On Mon, May 06, 2024, Micka=C3=ABl Sala=C3=BCn wrote: > On Fri, May 03, 2024 at 07:03:21AM GMT, Sean Christopherson wrote: > > > --- > > >=20 > > > Changes since v1: > > > * New patch. Making user space aware of Heki properties was requested= by > > > Sean Christopherson. > >=20 > > No, I suggested having userspace _control_ the pinning[*], not merely b= e notified > > of pinning. > >=20 > > : IMO, manipulation of protections, both for memory (this patch) and C= PU state > > : (control registers in the next patch) should come from userspace. I= have no > > : objection to KVM providing plumbing if necessary, but I think usersp= ace needs to > > : to have full control over the actual state. > > :=20 > > : One of the things that caused Intel's control register pinning serie= s to stall > > : out was how to handle edge cases like kexec() and reboot. Deferring= to userspace > > : means the kernel doesn't need to define policy, e.g. when to unprote= ct memory, > > : and avoids questions like "should userspace be able to overwrite pin= ned control > > : registers". > > :=20 > > : And like the confidential VM use case, keeping userspace in the loop= is a big > > : beneifit, e.g. the guest can't circumvent protections by coercing us= erspace into > > : writing to protected memory. > >=20 > > I stand by that suggestion, because I don't see a sane way to handle th= ings like > > kexec() and reboot without having a _much_ more sophisticated policy th= an would > > ever be acceptable in KVM. > >=20 > > I think that can be done without KVM having any awareness of CR pinning= whatsoever. > > E.g. userspace just needs to ability to intercept CR writes and inject = #GPs. Off > > the cuff, I suspect the uAPI could look very similar to MSR filtering. = E.g. I bet > > userspace could enforce MSR pinning without any new KVM uAPI at all. > >=20 > > [*] https://lore.kernel.org/all/ZFUyhPuhtMbYdJ76@google.com >=20 > OK, I had concern about the control not directly coming from the guest, > especially in the case of pKVM and confidential computing, but I get you Hardware-based CoCo is completely out of scope, because KVM has zero visibi= lity into the guest (well, SNP technically allows trapping CR0/CR4, but KVM real= ly shouldn't intercept CR0/CR4 for SNP guests). And more importantly, _KVM_ doesn't define any policies for CoCo VMs. KVM = might help enforce policies that are defined by hardware/firmware, but KVM doesn'= t define any of its own. If pKVM on x86 comes along, then KVM will likely get in the business of def= ining policy, but until that happens, KVM needs to stay firmly out of the picture= . > point. It should indeed be quite similar to the MSR filtering on the > userspace side, except that we need another interface for the guest to > request such change (i.e. self-protection). >=20 > Would it be OK to keep this new KVM_HC_LOCK_CR_UPDATE hypercall but > forward the request to userspace with a VM exit instead? That would > also enable userspace to get the request and directly configure the CR > pinning with the same VM exit. No? Maybe? I strongly suspect that full support will need a richer set of= APIs than a single hypercall. E.g. to handle kexec(), suspend+resume, emulated = SMM, and so on and so forth. And that's just for CR pinning. And hypercalls are hampered by the fact that VMCALL/VMMCALL don't allow for delegation or restriction, i.e. there's no way for the guest to communicate= to the hypervisor that a less privileged component is allowed to perform some = action, nor is there a way for the guest to say some chunk of CPL0 code *isn't* all= owed to request transition. Delegation and restriction all has to be done out-o= f-band. It'd probably be more annoying to setup initially, but I think a synthetic = device with an MMIO-based interface would be more powerful and flexible in the lon= g run. Then userspace can evolve without needing to wait for KVM to catch up. Actually, potential bad/crazy idea. Why does the _host_ need to define pol= icy? Linux already knows what assets it wants to (un)protect and when. What's m= issing is a way for the guest kernel to effectively deprivilege and re-authenticat= e itself as needed. We've been tossing around the idea of paired VMs+vCPUs t= o support VTLs and SEV's VMPLs, what if we usurped/piggybacked those ideas, w= ith a bit of pKVM mixed in? Borrowing VTL terminology, where VTL0 is the least privileged, userspace la= unches the VM at VTL0. At some point, the guest triggers the deprivileging sequen= ce and userspace creates VTL1. Userpace also provides a way for VTL0 restrict acc= ess to its memory, e.g. to effectively make the page tables for the kernel's direc= t map writable only from VTL1, to make kernel text RO (or XO), etc. And VTL0 cou= ld then also completely remove its access to code that changes CR0/CR4. It would obviously require a _lot_ more upfront work, e.g. to isolate the k= ernel text that modifies CR0/CR4 so that it can be removed from VTL0, but that sh= ould be doable with annotations, e.g. tag relevant functions with __magic or wha= tever, throw them in a dedicated section, and then free/protect the section(s) at = the appropriate time. KVM would likely need to provide the ability to switch VTLs (or whatever th= ey get called), and host userspace would need to provide a decent amount of the ba= ckend mechanisms and "core" policies, e.g. to manage VTL0 memory, teardown (turn = off?) VTL1 on kexec(), etc. But everything else could live in the guest kernel i= tself. E.g. to have CR pinning play nice with kexec(), toss the relevant kexec() c= ode into VTL1. That way VTL1 can verify the kexec() target and tear itself down bef= ore jumping into the new kernel.=20 This is very off the cuff and have-wavy, e.g. I don't have much of an idea = what it would take to harden kernel text patching, but keeping the policy in the= guest seems like it'd make everything more tractable than trying to define an ABI between Linux and a VMM that is rich and flexible enough to support all the fancy things Linux does (and will do in the future). Am I crazy? Or maybe reinventing whatever that McAfee thing was that led t= o Intel implementing EPTP switching?