Received: by 2002:a05:6358:c692:b0:131:369:b2a3 with SMTP id fe18csp4464662rwb; Mon, 31 Jul 2023 07:18:22 -0700 (PDT) X-Google-Smtp-Source: APBJJlHx0baHQzhqA0u9qc5n+Fn99eKnnN+2LOkV8c2ft7iLvoVk8H5H/1rz6VR2V3idzNqA4LaO X-Received: by 2002:a17:902:eb49:b0:1b3:a928:18e7 with SMTP id i9-20020a170902eb4900b001b3a92818e7mr8208792pli.59.1690813101663; Mon, 31 Jul 2023 07:18:21 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1690813101; cv=none; d=google.com; s=arc-20160816; b=B/QOLwAjTyB4p39gCoeUv1RA+8uRtCCWQpTn1q/HOt9HGIdmdVXxj7ONGmfushzh/r Hl8MBOjAm95+MPxILeGMOQJcaduW5Y9ETEjqzaL5JFFBkLabcPB2UVls2TaaeGTpJkzK C/Qw9MjwhGzGT1nqh7+wVDIRzFchEs/WvzUw6UBWTe6DVDLFzjaotolepBeddOd0vu7Q h3J1lsUs8Lgn/PHmdEs7GfRSjgREFvi6xXKVghRpillFFsIWEzaQcJvtl7WVNiFMxajz ImT5JxpxAWt+ef3JLjKyOYGktDdDjL6xynRdJYFgdaOo9oN4SoB+k9R8eXE4Z2mUCQ5H w5Bg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:cc:to:subject :message-id:date:from:in-reply-to:references:mime-version :dkim-signature; bh=ysw1GJ5iZ7+m9LB1vMcTeerNVDEGAKXkHz9rft3ciOM=; fh=w8v2WFrQFuIJlsGzgH2mQqefCf7HJMSSS8pXmt7NM5s=; b=DJHzsLxk6V4DVkj4+5B6Mf0qGAArzpbP1qTSOVO3cMFppf1VQ6CpTYOAiTJWotzVlB H+xqVHWpm/rvjrUCMUELIgUjLvRjDC+DgPgqfIImXwjxgTrRd2Gy5xV6D16ZiWLcNRvi DNA7XwqnStvq5aa6kDDrHbe5YRubcuDSH1kmxysmcpVTQvWxCumiGOrG04wzwcLCBqPB 2ZVgONxv9RuucOVc4mmZaceZRx43NqKzD2mgO9ozCzSEfL2s3a5e0Mp0POfX33GOoNE2 qvUUNqYhpV/GheMst+6hN+Vtkxcsj3hL45ncJxUgO1yU1uG3kWty4XBz2fa4KS1eNmio vrXg== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20221208 header.b=u9OhuV2W; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id f13-20020a170902684d00b001bba6128bf5si7221036pln.368.2023.07.31.07.18.07; Mon, 31 Jul 2023 07:18:21 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20221208 header.b=u9OhuV2W; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229877AbjGaNre (ORCPT + 99 others); Mon, 31 Jul 2023 09:47:34 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:41096 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230296AbjGaNrb (ORCPT ); Mon, 31 Jul 2023 09:47:31 -0400 Received: from mail-qv1-xf2d.google.com (mail-qv1-xf2d.google.com [IPv6:2607:f8b0:4864:20::f2d]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id C4199171B for ; Mon, 31 Jul 2023 06:47:27 -0700 (PDT) Received: by mail-qv1-xf2d.google.com with SMTP id 6a1803df08f44-63d170a649eso30573236d6.3 for ; Mon, 31 Jul 2023 06:47:27 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20221208; t=1690811247; x=1691416047; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=ysw1GJ5iZ7+m9LB1vMcTeerNVDEGAKXkHz9rft3ciOM=; b=u9OhuV2WeoVUb538U+P/2QFmsHVbKcdA/KLYVytDxihcyMcOvBA0XVLiSr5cHMTN7D ikj7mVuN1iyhE73lkkm8FfLEjE5ponkqPZ1y3+LCm2IQAtu7oocrn6C54yzZ80W11Wvd rTxv00vYMpAR9z70F7m74ztY1F98dotGILfOawWW2zQkC0L0yO2AbJIKpEccIjfjgMtN rNOX3yDynA64SAgdgBRMMmYQcMR8X+2qNSp+2Yj5dtqRRJG2EoqS4Y/SRDVJ6nLIVtfR 4bdQAAHnMSoPJgm+FGVRTm8lI3EITt2Y96aOx0NZdkMIARTZMJb3A2kQXkT2avpiLvYZ 6BKA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1690811247; x=1691416047; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=ysw1GJ5iZ7+m9LB1vMcTeerNVDEGAKXkHz9rft3ciOM=; b=eJBPgf1TqqNLytRiAGk5rQgxRPOVDyR67dnGsMpB2BlLyjO5TBdACViD6fStdoN2Pd ib25+GZY12T2cCbVog7mcQg0enxOrV5ZvpSvm6Bib1r3tInIyD/53ETasUrh2Q0nn7N+ cfot+27zgm+aqfGxvaaP6z3UiaEdm+T0JfG3nMap4HstAH0b4Ek/qjMtgOBQxOUM9YYE Pxiasn0NsmQSWHvf6uNahAa5RGCtgLfnImPEZs/X+uEy+dm7ieKrh6FH6844uLMBPEnv xF4rD0eDgsOl8W5v1qE0mD2Wa08daidn2nYFoZz9StBExfgawArOoflC5GU7onrZeurX /i9Q== X-Gm-Message-State: ABy/qLbmj2fvHI6ajKX1h1t93ntrF0wk+5h1HNjBk7n/gG1XjzZ1R1o2 EOA6424LIQ/mF6svDuzNIk3Cn84r3exyetKPmsyedQ== X-Received: by 2002:ad4:5884:0:b0:632:2e63:d34b with SMTP id dz4-20020ad45884000000b006322e63d34bmr9450921qvb.14.1690811246768; Mon, 31 Jul 2023 06:47:26 -0700 (PDT) MIME-Version: 1.0 References: <20230718234512.1690985-1-seanjc@google.com> <20230718234512.1690985-13-seanjc@google.com> In-Reply-To: From: Fuad Tabba Date: Mon, 31 Jul 2023 14:46:50 +0100 Message-ID: Subject: Re: [RFC PATCH v11 12/29] KVM: Add KVM_CREATE_GUEST_MEMFD ioctl() for guest-specific backing memory To: Sean Christopherson Cc: Paolo Bonzini , Marc Zyngier , Oliver Upton , Huacai Chen , Michael Ellerman , Anup Patel , Paul Walmsley , Palmer Dabbelt , Albert Ou , "Matthew Wilcox (Oracle)" , Andrew Morton , Paul Moore , James Morris , "Serge E. Hallyn" , kvm@vger.kernel.org, linux-arm-kernel@lists.infradead.org, kvmarm@lists.linux.dev, linux-mips@vger.kernel.org, linuxppc-dev@lists.ozlabs.org, kvm-riscv@lists.infradead.org, linux-riscv@lists.infradead.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-security-module@vger.kernel.org, linux-kernel@vger.kernel.org, Chao Peng , Jarkko Sakkinen , Yu Zhang , Vishal Annapurve , Ackerley Tng , Maciej Szmigiero , Vlastimil Babka , David Hildenbrand , Quentin Perret , Michael Roth , Wang , Liam Merwick , Isaku Yamahata , "Kirill A . Shutemov" Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Spam-Status: No, score=-17.6 required=5.0 tests=BAYES_00,DKIMWL_WL_MED, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF, ENV_AND_HDR_SPF_MATCH,RCVD_IN_DNSWL_BLOCKED,SPF_HELO_NONE,SPF_PASS, T_SCC_BODY_TEXT_LINE,USER_IN_DEF_DKIM_WL,USER_IN_DEF_SPF_WL autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hi Sean, On Thu, Jul 27, 2023 at 6:13=E2=80=AFPM Sean Christopherson wrote: > > On Thu, Jul 27, 2023, Fuad Tabba wrote: > > Hi Sean, > > > > > > ... > > > > > @@ -5134,6 +5167,16 @@ static long kvm_vm_ioctl(struct file *filp, > > > case KVM_GET_STATS_FD: > > > r =3D kvm_vm_ioctl_get_stats_fd(kvm); > > > break; > > > + case KVM_CREATE_GUEST_MEMFD: { > > > + struct kvm_create_guest_memfd guest_memfd; > > > + > > > + r =3D -EFAULT; > > > + if (copy_from_user(&guest_memfd, argp, sizeof(guest_m= emfd))) > > > + goto out; > > > + > > > + r =3D kvm_gmem_create(kvm, &guest_memfd); > > > + break; > > > + } > > > > I'm thinking line of sight here, by having this as a vm ioctl (rather > > than a system iocl), would it complicate making it possible in the > > future to share/donate memory between VMs? > > Maybe, but I hope not? > > There would still be a primary owner of the memory, i.e. the memory would= still > need to be allocated in the context of a specific VM. And the primary ow= ner should > be able to restrict privileges, e.g. allow a different VM to read but not= write > memory. > > My current thinking is to (a) tie the lifetime of the backing pages to th= e inode, > i.e. allow allocations to outlive the original VM, and (b) create a new f= ile each > time memory is shared/donated with a different VM (or other entity in the= kernel). > > That should make it fairly straightforward to provide different permissio= ns, e.g. > track them per-file, and I think should also avoid the need to change the= memslot > binding logic since each VM would have it's own view/bindings. > > Copy+pasting a relevant snippet from a lengthier response in a different = thread[*]: > > Conceptually, I think KVM should to bind to the file. The inode is eff= ectively > the raw underlying physical storage, while the file is the VM's view of= that > storage. I'm not aware of any implementation of sharing memory between VMs in KVM before (afaik, since there was no need for one). The following is me thinking out loud, rather than any strong opinions on my part. If an allocation can outlive the original VM, then why associate it with that (or a) VM to begin with? Wouldn't it be more flexible if it were a system-level construct, which is effectively what it was in previous iterations of this? This doesn't rule out binding to the file, and keeping the inode as the underlying physical storage. The binding of a VM to a guestmem object could happen implicitly with KVM_SET_USER_MEMORY_REGION2, or we could have a new ioctl specifically for handling binding. Cheers, /fuad > Practically, I think that gives us a clean, intuitive way to handle int= ra-host > migration. Rather than transfer ownership of the file, instantiate a n= ew file > for the target VM, using the gmem inode from the source VM, i.e. create= a hard > link. That'd probably require new uAPI, but I don't think that will be= hugely > problematic. KVM would need to ensure the new VM's guest_memfd can't b= e mapped > until KVM_CAP_VM_MOVE_ENC_CONTEXT_FROM (which would also need to verify= the > memslots/bindings are identical), but that should be easy enough to enf= orce. > > That way, a VM, its memslots, and its SPTEs are tied to the file, while= allowing > the memory and the *contents* of memory to outlive the VM, i.e. be effe= ctively > transfered to the new target VM. And we'll maintain the invariant that= each > guest_memfd is bound 1:1 with a single VM. > > As above, that should also help us draw the line between mapping memory= into a > VM (file), and freeing/reclaiming the memory (inode). > > There will be extra complexity/overhead as we'll have to play nice with= the > possibility of multiple files per inode, e.g. to zap mappings across al= l files > when punching a hole, but the extra complexity is quite small, e.g. we = can use > address_space.private_list to keep track of the guest_memfd instances a= ssociated > with the inode. > > Setting aside TDX and SNP for the moment, as it's not clear how they'll= support > memory that is "private" but shared between multiple VMs, I think per-V= M files > would work well for sharing gmem between two VMs. E.g. would allow a g= ive page > to be bound to a different gfn for each VM, would allow having differen= t permissions > for each file (e.g. to allow fallocate() only from the original owner). > > [*] https://lore.kernel.org/all/ZLGiEfJZTyl7M8mS@google.com >