Received: by 2002:a05:7412:b795:b0:e2:908c:2ebd with SMTP id iv21csp542472rdb; Thu, 2 Nov 2023 10:37:47 -0700 (PDT) X-Google-Smtp-Source: AGHT+IHITCBKI5kN0Mt42e+gdrz59UtyEP3i3ru3j8x+R98M7+D4EP2V5X0dKroQZ0t4zGFVitX3 X-Received: by 2002:a17:90a:ac06:b0:280:200c:2e20 with SMTP id o6-20020a17090aac0600b00280200c2e20mr12001577pjq.27.1698946666942; Thu, 02 Nov 2023 10:37:46 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1698946666; cv=none; d=google.com; s=arc-20160816; b=tbRjpMbggaPZXy3VR8HbG0jK4/YJL0VZPd3F/BpCsTYwv+16Tejg7hJJeBMbIdGdDa a+acCmEgmgOllGhivdQGm2RXrrz1HgJPgMSNFHrKbIDO6Y3fUPvR+yKZkAO2gPXDFqah +LkwbL7SvY8MleJ8Rr/BTrbtPIowLIVTCxEaffXwJhmn0f/hUzjyD7WwDQvfnAmdRk+m Qx7A3WoBlfUXaPDUGVwyBGj1pLNjyKK6nGVgURkdPicVAdeKnxDnM6x/4ZtOXPrKvsYt YLKGL/ZhNlz2VcFxBSkm7/SpSqW8F8RNx73KQLq5h5X31aD3l9EFYYqPwxzdnE+BH1sn EGaQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:cc:to:from:subject :message-id:references:mime-version:in-reply-to:date:dkim-signature; bh=uN1ssLkZHjDTrGZRWby6miGmTEsoKqfVN3YbqYw1LLA=; fh=p1QQMJh6+ubXGRc27tvVlIW5aCE4hGMTYjZ/oP35KHc=; b=ynvREIsXHXtJCAA7j694M9ZqRyyqFk156PwGP8WU6TRkoP65BMfYH/Wbb6+QuVQAGl 0z9++MKawVJhZ7i0J68EQ6oXQ57ykEQDbugb4E3i1gns25+mtgSm2BVV/3lEZVgxgxt6 g7K8QDgcSKzxs47QUTtbC2+CRDW1v/gjCKF+tLB4MStrT1aVJhIvyq+6q3PSLyrHisFS dSERXq0/NVxG4xXS6G/awxMqUF1HsX0K7hg58itkH6NsJXra49nZstVvEzwR6mqhcqz0 wjwJnIteFCsbM3T8a7syJ4kzGW3a+C+3BDbJQ/wv01bOmoT7AcxEhillTn54PLRviBJ/ mU9g== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20230601 header.b=coSNqGSa; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.32 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Return-Path: Received: from agentk.vger.email (agentk.vger.email. [23.128.96.32]) by mx.google.com with ESMTPS id mq5-20020a17090b380500b00280c2b23021si196034pjb.108.2023.11.02.10.37.46 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 02 Nov 2023 10:37:46 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.32 as permitted sender) client-ip=23.128.96.32; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20230601 header.b=coSNqGSa; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.32 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0]) by agentk.vger.email (Postfix) with ESMTP id 227B9813544F; Thu, 2 Nov 2023 10:37:44 -0700 (PDT) X-Virus-Status: Clean X-Virus-Scanned: clamav-milter 0.103.10 at agentk.vger.email Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S232385AbjKBRhg (ORCPT + 99 others); Thu, 2 Nov 2023 13:37:36 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:39026 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S232341AbjKBRhf (ORCPT ); Thu, 2 Nov 2023 13:37:35 -0400 Received: from mail-pl1-x649.google.com (mail-pl1-x649.google.com [IPv6:2607:f8b0:4864:20::649]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 5F064197 for ; Thu, 2 Nov 2023 10:37:31 -0700 (PDT) Received: by mail-pl1-x649.google.com with SMTP id d9443c01a7336-1cc323b2aa3so8913465ad.3 for ; Thu, 02 Nov 2023 10:37:31 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1698946651; x=1699551451; darn=vger.kernel.org; h=content-transfer-encoding:cc:to:from:subject:message-id:references :mime-version:in-reply-to:date:from:to:cc:subject:date:message-id :reply-to; bh=uN1ssLkZHjDTrGZRWby6miGmTEsoKqfVN3YbqYw1LLA=; b=coSNqGSa47J/UZf0Pz9YGwIQ7TBXNp4w0vC+NhbSJdmfM63JiOkcuplyubh+1Bc1Xq wDZIUN+2gb4+E0zeGXXCd85Yf8kSu9+YVBg7mENZa8X2Td3Okm1qsYKwdlv1ZIziaahp phBKhJnDxgx037qgYQBLTXyDIWs04h12RX0vyIVuFVvywhiLTAp2dUtk4xYwj8sT53Kd wak23GwFfRo+wHgAjW8Ifr3xrZf2CDKVvRJ96S0KbOFd3kiUcpPysu6bM4aaMiQbe+Qx hptsOZZLoBiRBJmyrnh5Kp6KtkOEMsgVNRtO/71TejvAhuZF/s+jD0uHNwhko4LLUx/t wMTg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1698946651; x=1699551451; h=content-transfer-encoding:cc:to:from:subject:message-id:references :mime-version:in-reply-to:date:x-gm-message-state:from:to:cc:subject :date:message-id:reply-to; bh=uN1ssLkZHjDTrGZRWby6miGmTEsoKqfVN3YbqYw1LLA=; b=PUIKVaUOj4IkHF8qEai+pMf7jlzAMTuDQuuL4XbtU9E+pUGfNA8o4i7yd1BkRtXtyz JJ/BSIKfm49nczLtcE/UZWIOcCUXtyJ041ozeRWj0rNxMK5peTwJKWrbo3KkQOHqN+Pj YQY1/BcHkTIr0E24H1GfXCwzro9lp/wdyZOpQ5okbGIa4XWh7OxJcKgQrjqBTDb5V9KD dDjFIvnleWe4QytWE+P1qfPxNfMWZMFfSnGHcNM5AHE/TaW1EQr1dG1FOCthOetdo9AM aHKUh+k1Ekfcs0eHev3NzD14dCtiP5inPyNacZWx2jEYnjmnoLEqd6VcKu+QIOYZpg5C 8TQA== X-Gm-Message-State: AOJu0YzXGibFNTYLpAM93xTj7AvfX9D6PtlG6/Rv66NLFIIYNQPZU0Pk INJY0DtBcPqL85PnUzJZwySZczpmYJY= X-Received: from zagreus.c.googlers.com ([fda3:e722:ac3:cc00:7f:e700:c0a8:5c37]) (user=seanjc job=sendgmr) by 2002:a17:903:25d2:b0:1cc:2ffe:5a27 with SMTP id jc18-20020a17090325d200b001cc2ffe5a27mr287356plb.9.1698946650780; Thu, 02 Nov 2023 10:37:30 -0700 (PDT) Date: Thu, 2 Nov 2023 10:37:29 -0700 In-Reply-To: Mime-Version: 1.0 References: <20231027182217.3615211-1-seanjc@google.com> <20231027182217.3615211-17-seanjc@google.com> <6642c379-1023-4716-904f-4bbf076744c2@redhat.com> Message-ID: Subject: Re: [PATCH v13 16/35] KVM: Add KVM_CREATE_GUEST_MEMFD ioctl() for guest-specific backing memory From: Sean Christopherson To: David Matlack Cc: Paolo Bonzini , Marc Zyngier , Oliver Upton , Huacai Chen , Michael Ellerman , Anup Patel , Paul Walmsley , Palmer Dabbelt , Albert Ou , Alexander Viro , Christian Brauner , "Matthew Wilcox (Oracle)" , Andrew Morton , kvm@vger.kernel.org, linux-arm-kernel@lists.infradead.org, kvmarm@lists.linux.dev, linux-mips@vger.kernel.org, linuxppc-dev@lists.ozlabs.org, kvm-riscv@lists.infradead.org, linux-riscv@lists.infradead.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, Xiaoyao Li , Xu Yilun , Chao Peng , Fuad Tabba , Jarkko Sakkinen , Anish Moorthy , Yu Zhang , Isaku Yamahata , "=?utf-8?Q?Micka=C3=ABl_Sala=C3=BCn?=" , Vlastimil Babka , Vishal Annapurve , Ackerley Tng , Maciej Szmigiero , David Hildenbrand , Quentin Perret , Michael Roth , Wang , Liam Merwick , Isaku Yamahata , "Kirill A . Shutemov" Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable X-Spam-Status: No, score=-8.4 required=5.0 tests=DKIMWL_WL_MED,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,T_SCC_BODY_TEXT_LINE, USER_IN_DEF_DKIM_WL autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on agentk.vger.email Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.6.4 (agentk.vger.email [0.0.0.0]); Thu, 02 Nov 2023 10:37:44 -0700 (PDT) On Thu, Nov 02, 2023, David Matlack wrote: > On Thu, Nov 2, 2023 at 9:03=E2=80=AFAM Sean Christopherson wrote: > > > > On Thu, Nov 02, 2023, Paolo Bonzini wrote: > > > On 10/31/23 23:39, David Matlack wrote: > > > > > > Maybe can you sketch out how you see this proposal being extens= ible to > > > > > > using guest_memfd for shared mappings? > > > > > For in-place conversions, e.g. pKVM, no additional guest_memfd is= needed. What's > > > > > missing there is the ability to (safely) mmap() guest_memfd, e.g.= KVM needs to > > > > > ensure there are no outstanding references when converting back t= o private. > > > > > > > > > > For TDX/SNP, assuming we don't find a performant and robust way t= o do in-place > > > > > conversions, a second fd+offset pair would be needed. > > > > Is there a way to support non-in-place conversions within a single = guest_memfd? > > > > > > For TDX/SNP, you could have a hook from KVM_SET_MEMORY_ATTRIBUTES to = guest > > > memory. The hook would invalidate now-private parts if they have a V= MA, > > > causing a SIGSEGV/EFAULT if the host touches them. > > > > > > It would forbid mappings from multiple gfns to a single offset of the > > > guest_memfd, because then the shared vs. private attribute would be t= ied to > > > the offset. This should not be a problem; for example, in the case o= f SNP, > > > the RMP already requires a single mapping from host physical address = to > > > guest physical address. > > > > I don't see how this can work. It's not a M:1 scenario (where M is mul= tiple gfns), > > it's a 1:N scenario (wheren N is multiple offsets). The *gfn* doesn't = change on > > a conversion, what needs to change to do non-in-place conversion is the= pfn, which > > is effectively the guest_memfd+offset pair. > > > > So yes, we *could* support non-in-place conversions within a single gue= st_memfd, > > but it would require a second offset, >=20 > Why can't KVM free the existing page at guest_memfd+offset and > allocate a new one when doing non-in-place conversions? Oh, I see what you're suggesting. Eww. It's certainly possible, but it would largely defeat the purpose of why we = are adding guest_memfd in the first place. For TDX and SNP, the goal is to provide a simple, robust mechanism for isol= ating guest private memory so that it's all but impossible for the host to access= private memory. As things stand, memory for a given guest_memfd is either private = or shared (assuming we support a second guest_memfd per memslot). I.e. there's no ne= ed to track whether a given page/folio in the guest_memfd is private vs. shared. We could use memory attributes, but that further complicates things when in= trahost migration (and potentially other multi-user scenarios) comes along, i.e. wh= en KVM supports linking multiple guest_memfd files to a single inode. We'd have t= o ensure that all "struct kvm" instances have identical PRIVATE attributes for a giv= en *offset* in the inode. I'm not even sure how feasible that is for intrahos= t migration, and that's the *easy* case, because IIRC it's already a hard req= uirement that the source and destination have identical gnf=3D>guest_memfd bindings,= i.e. KVM can somewhat easily reason about gfn attributes. But even then, that only helps with the actual migration of the VM, e.g. we= 'd still have to figure out how to deal with .mmap() and other shared vs. private ac= tions when linking a new guest_memfd file against an existing inode. I haven't seen the pKVM patches for supporting .mmap(), so maybe this is al= ready a solved problem, but I'd honestly be quite surprised if it all works corre= ctly if/when KVM supports multiple files per inode. And I don't see what value non-in-place conversions would add. The value a= dded by in-place conversions, aside from the obvious preservation of data, which= isn't relevant to TDX/SNP, is that it doesn't require freeing and reallocating me= mory to avoid double-allocating for private vs. shared. That's especialy quite = nice when hugepages are being used because reconstituing a hugepage "only" requi= res zapping SPTEs. But if KVM is freeing the private page, it's the same as punching a hole, p= robably quite literally, when mapping the gfn as shared. In every way I can think = of, it's worse. E.g. it's more complex for KVM, and the PUNCH_HOLE =3D> allocation = operations must be serialized. Regarding double-allocating, I really, really think we should solve that in= the guest. I.e. teach Linux-as-a-guest to aggressively convert at 2MiB granula= rity and avoid 4KiB conversions. 4KiB conversions aren't just a memory utilizat= ion problem, they're also a performance problem, e.g. shatters hugepages (which= KVM doesn't yet support recovering) and increases TLB pressure for both stage-1= and stage-2 mappings.