Received: by 2002:a05:6512:2355:0:0:0:0 with SMTP id p21csp197231lfu; Wed, 30 Mar 2022 20:31:33 -0700 (PDT) X-Google-Smtp-Source: ABdhPJwYGcFGEFDIZXf9O8M64jyTmrtQRfVDopQ0Wm+4SD+XGw1eSiZcYUKkaiIwqHg5EWxr4BXc X-Received: by 2002:a65:654f:0:b0:378:b8f6:ebe4 with SMTP id a15-20020a65654f000000b00378b8f6ebe4mr8837511pgw.399.1648697492776; Wed, 30 Mar 2022 20:31:32 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1648697492; cv=none; d=google.com; s=arc-20160816; b=y6HjR5lKOtaQHVkJh7ftD5An/v5XfYgAWGip3UQRgqm6P0wwZaErfD32S+njMYBsCG WTvIWsP5iTVvEAB7PfzOOL0OOIn+W95ESnAXMkLXNaIneOvn0Cc8iNmHcDSxepDuTFQ5 E5Qs1JuNds8Sjj2a36sROX1boucaWFW7QtDtA5Fl6Fzb+9/d2iIcCIaguZaidRaJMs5y 4AKo+msvijpMfwqES7bfu5Rcwq1PtbQvxOH3mI+yV6fppPz0VX2sAcruR1EdSZ7MeF/S yYbsU9Ts6jeelEvezamd40EoREOovrf61EdkHfoKGDJoq+uYM/jHboElVjIIkj8pgERR 8pAg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:in-reply-to:content-disposition:mime-version :references:message-id:subject:cc:to:from:date:dkim-signature; bh=qRdDzcsR6tHaU83Zu+AB6iaBkt7tf4uVxesheVO+k04=; b=sYYZMHHY1ByW9yg4g8RE69avld95/dWMIr+sVUeB1CAy0ZIOD0bBVD8/4j/E0KmUOB XozgzdL4YBDKfw/A+QKkDAXdnXXPEJjrLGo9pCi5CAOKQGTqZNanqq7bFQMbikSyvv1v Vpr1WvihiCJL1k94mYUP7c3T3mFDJa+X470s4GoGfeQUBW8w5tkX91ydWfO2W75N6GEP lKrLW6DEDBdus+4H4BS1IK/1N1AgNw3dLAxWD8T6WheV7q+pL54N/eTn5FWFvEAM5Hyu jNq+cljEDHbVgPpoRc5/o2kJ3+vkBmjsZor/XeCrQl0sbg7XQrYw8Gb4jewyVtqDxP6u nAcw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b=fmINK1FW; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Return-Path: Received: from lindbergh.monkeyblade.net (lindbergh.monkeyblade.net. [2620:137:e000::1:18]) by mx.google.com with ESMTPS id o68-20020a17090a0a4a00b001c7cd11486bsi1892906pjo.175.2022.03.30.20.31.32 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 30 Mar 2022 20:31:32 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:18 as permitted sender) client-ip=2620:137:e000::1:18; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b=fmINK1FW; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by lindbergh.monkeyblade.net (Postfix) with ESMTP id 8686C12343C; Wed, 30 Mar 2022 19:56:33 -0700 (PDT) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1349598AbiC3SAj (ORCPT + 99 others); Wed, 30 Mar 2022 14:00:39 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:50900 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S245029AbiC3SAe (ORCPT ); Wed, 30 Mar 2022 14:00:34 -0400 Received: from mail-pf1-x435.google.com (mail-pf1-x435.google.com [IPv6:2607:f8b0:4864:20::435]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id AA6CDEAC99 for ; Wed, 30 Mar 2022 10:58:47 -0700 (PDT) Received: by mail-pf1-x435.google.com with SMTP id h19so18508262pfv.1 for ; Wed, 30 Mar 2022 10:58:47 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=date:from:to:cc:subject:message-id:references:mime-version :content-disposition:in-reply-to; bh=qRdDzcsR6tHaU83Zu+AB6iaBkt7tf4uVxesheVO+k04=; b=fmINK1FWWmoLOoYP7UthQGhftrRGdNFBc+IlK+NV5/Ok9ERGA+qLeK6Qc2wTVLtgAA 9LQ/i0EUSMCyjEaFKyIhIb7cMmbwQz9Y8BcfXLRuoN6ajRH59AuxXHA3EMGdoKl2VTSg p24YHpEbjdk+uT0CBOdjzzYlBeDTZ3F7gpj4hM7l1TBMoGnLiRy7iLYUHQWybuBMUn3o +79Dbrk0SFPljaZd36P/MkFhRRizQhSn53CNrK2P1EbC2v9I9Kv5K+rfi43xESLumhbH 68CiFZAIDCb11Ijcxh9K1W3GQjeukR3sJTsN4+zlgHn1NYrxdXEcaVCJINP6zp/PmCKt vRnQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:date:from:to:cc:subject:message-id:references :mime-version:content-disposition:in-reply-to; bh=qRdDzcsR6tHaU83Zu+AB6iaBkt7tf4uVxesheVO+k04=; b=zUcugLKmwZwD2rseqKO6VQiSt9E8D1UQSwBbqIcldTvwMZDAEx0atsPF9k94DcANjg DwUPOSrNNXd9R8nO5YQ6rdQScDjFrlZeVU9Bu8GY9d2xqKtcc5/xEkPjFhTApZDZh0es 76YLVnB8RVdho8i8hNOndCJ13hxO78d0ERI/ingGVtYoZvJ/d0pN6f9D3X1Qvy5nuSYo W+etZDd/yGkgTZNxEIJbwtWBe+CuG033pxozU7f/XY8uxaQIwrq09jWXCd2tDmWONB76 bofKjcqM89ap/r+FffvJx5f5UK0vRdYBEsM3A/SynLYsjW+tAklEd0ZhPvw4gwvMZLXe 6QUg== X-Gm-Message-State: AOAM532PLwhHaZZmjjKjfm0tiQXYhntuRxLi/bQcmKfqUGB0KaMaAL2y fS0ytkBom/EPV4wTgF7fBjjBGQvpJX8P3Q== X-Received: by 2002:a65:674b:0:b0:381:6565:26fc with SMTP id c11-20020a65674b000000b00381656526fcmr7064146pgu.618.1648663126803; Wed, 30 Mar 2022 10:58:46 -0700 (PDT) Received: from google.com (157.214.185.35.bc.googleusercontent.com. [35.185.214.157]) by smtp.gmail.com with ESMTPSA id y3-20020a17090a8b0300b001c735089cc2sm6710778pjn.54.2022.03.30.10.58.45 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 30 Mar 2022 10:58:45 -0700 (PDT) Date: Wed, 30 Mar 2022 17:58:41 +0000 From: Sean Christopherson To: Quentin Perret Cc: Steven Price , Chao Peng , kvm@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, linux-api@vger.kernel.org, qemu-devel@nongnu.org, Paolo Bonzini , Jonathan Corbet , Vitaly Kuznetsov , Wanpeng Li , Jim Mattson , Joerg Roedel , Thomas Gleixner , Ingo Molnar , Borislav Petkov , x86@kernel.org, "H . Peter Anvin" , Hugh Dickins , Jeff Layton , "J . Bruce Fields" , Andrew Morton , Mike Rapoport , "Maciej S . Szmigiero" , Vlastimil Babka , Vishal Annapurve , Yu Zhang , "Kirill A . Shutemov" , luto@kernel.org, jun.nakajima@intel.com, dave.hansen@intel.com, ak@linux.intel.com, david@redhat.com, maz@kernel.org, will@kernel.org Subject: Re: [PATCH v5 00/13] KVM: mm: fd-based approach for supporting KVM guest private memory Message-ID: References: <20220310140911.50924-1-chao.p.peng@linux.intel.com> <88620519-029e-342b-0a85-ce2a20eaf41b@arm.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: X-Spam-Status: No, score=-9.5 required=5.0 tests=BAYES_00,DKIMWL_WL_MED, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,RDNS_NONE,SPF_HELO_NONE,T_SCC_BODY_TEXT_LINE, USER_IN_DEF_DKIM_WL autolearn=no autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, Mar 30, 2022, Quentin Perret wrote: > On Wednesday 30 Mar 2022 at 09:58:27 (+0100), Steven Price wrote: > > On 29/03/2022 18:01, Quentin Perret wrote: > > > Is implicit sharing a thing? E.g., if a guest makes a memory access in > > > the shared gpa range at an address that doesn't have a backing memslot, > > > will KVM check whether there is a corresponding private memslot at the > > > right offset with a hole punched and report a KVM_EXIT_MEMORY_ERROR? Or > > > would that just generate an MMIO exit as usual? > > > > My understanding is that the guest needs some way of tagging whether a > > page is expected to be shared or private. On the architectures I'm aware > > of this is done by effectively stealing a bit from the IPA space and > > pretending it's a flag bit. > > Right, and that is in fact the main point of divergence we have I think. > While I understand this might be necessary for TDX and the likes, this > makes little sense for pKVM. This would effectively embed into the IPA a > purely software-defined non-architectural property/protocol although we > don't actually need to: we (pKVM) can reasonably expect the guest to > explicitly issue hypercalls to share pages in-place. So I'd be really > keen to avoid baking in assumptions about that model too deep in the > host mm bits if at all possible. There is no assumption about stealing PA bits baked into this API. Even within x86 KVM, I consider it a hard requirement that the common flows not assume the private vs. shared information is communicated through the PA. > > > I'm overall inclined to think that while this abstraction works nicely > > > for TDX and the likes, it might not suit pKVM all that well in the > > > current form, but it's close. > > > > > > What do you think of extending the model proposed here to also address > > > the needs of implementations that support in-place sharing? One option > > > would be to have KVM notify the private-fd backing store when a page is > > > shared back by a guest, which would then allow host userspace to mmap > > > that particular page in the private fd instead of punching a hole. > > > > > > This should retain the main property you're after: private pages that > > > are actually mapped in the guest SPTE aren't mmap-able, but all the > > > others are fair game. > > > > > > Thoughts? > > How do you propose this works if the page shared by the guest then needs > > to be made private again? If there's no hole punched then it's not > > possible to just repopulate the private-fd. I'm struggling to see how > > that could work. > > Yes, some discussion might be required, but I was thinking about > something along those lines: > > - a guest requests a shared->private page conversion; > > - the conversion request is routed all the way back to the VMM; > > - the VMM is expected to either decline the conversion (which may be > fatal for the guest if it can't handle this), or to tear-down its > mappings (via munmap()) of the shared page, and accept the > conversion; > > - upon return from the VMM, KVM will be expected to check how many > references to the shared page are still held (probably by asking the > fd backing store) to check that userspace has indeed torn down its > mappings. If all is fine, KVM will instruct the hypervisor to > repopulate the private range of the guest, otherwise it'll return an > error to the VMM; > > - if the conversion has been successful, the guest can resume its > execution normally. > > Note: this should still allow to use the hole-punching method just fine > on systems that require it. The invariant here is just that KVM (with > help from the backing store) is now responsible for refusing to > instruct the hypervisor (or TDX module, or RMM, or whatever) to map a > private page if there are existing mappings to it. > > > Having said that; if we can work out a way to safely > > mmap() pages from the private-fd there's definitely some benefits to be > > had - e.g. it could be used to populate the initial memory before the > > guest is started. > > Right, so assuming the approach proposed above isn't entirely bogus, > this might now become possible by having the VMM mmap the private-fd, > load the payload, and then unmap it all, and only then instruct the > hypervisor to use this as private memory. Hard "no" on mapping the private-fd. Having the invariant tha the private-fd can never be mapped greatly simplifies the responsibilities of the backing store, as well as the interface between the private-fd and the in-kernel consumers of the memory (KVM in this case). What is the use case for shared->private conversion? x86, both TDX and SNP, effectively do have a flavor of shared->private conversion; SNP can definitely be in-place, and I think TDX too. But the only use case in x86 is to populate the initial guest image, and due to other performance bottlenecks, it's strongly recommended to keep the initial image as small as possible. Based on your previous response about the guest firmware loading the full guest image, my understanding is that pKVM will also utilize a minimal initial image. As a result, true in-place conversion to reduce the number of memcpy()s is low priority, i.e. not planned at this time. Unless the use case expects to convert large swaths of memory, the simplest approach would be to have pKVM memcpy() between the private and shared backing pages during conversion. In-place conversion that preserves data needs to be a separate and/or additional hypercall, because "I want to map this page as private/shared" is very, very different than "I want to map this page as private/shared and consume/expose non-zero data". I.e. the host is guaranteed to get an explicit request to do the memcpy(), so there shouldn't be a need to implicitly allow this on any conversion.