Received: by 2002:a05:6359:c8b:b0:c7:702f:21d4 with SMTP id go11csp3888764rwb; Fri, 30 Sep 2022 09:38:40 -0700 (PDT) X-Google-Smtp-Source: AMsMyM7PcWmgpoAh5t/aCe5jUCInKew0G+hprOYjfoWiUrJ/c15KrWEhlJmphe4XuxwKk91Koqdv X-Received: by 2002:a17:906:9be9:b0:788:5a70:58a0 with SMTP id de41-20020a1709069be900b007885a7058a0mr1896134ejc.137.1664555920000; Fri, 30 Sep 2022 09:38:40 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1664555919; cv=none; d=google.com; s=arc-20160816; b=X43D5j6IyG3wpdfjOmw6ya2OfRDLrI1G0s/CUdk47rz+f1IZ91LEp+qR8uzkYGPwNC wMubxeUUCoDEpIYDkysLycEmyr65w+iB4BfksfKDVmSXbodjS30yRNy147bbVTyakfp+ en4I0nR+dF5AfugGnqAw0ach2uMnRGZerhZ/iL/yLnVhBd/WU+j3IBMX7DxyIhEyyt7e A2PdVRImJkLsO6dp95MCR5SpmqJrsqbxN1iY+vilYDkbFlii8eHltTFYvLSdrEEeqM7B OKeal+Zz4Epx3ziXYUPhOvPdPbkJpAZKSBkedIfiX3Q/y1zeimr3WCQu8U6/INX1dX06 LNkw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:cc:to:subject:message-id:date:from:in-reply-to :references:mime-version:dkim-signature; bh=l4aQ5hemtjsltDjjTFlU40Ar40/n0aYhBBQw2x9xe4M=; b=Go6LOA3ei64T7ZEozPWDkt+SK6+P7Rs+KoC1TKVQrDdXQHr8Kg/PR21EpDJLDP1TY7 fP5jmqgxmeQCyV73PG3vnZhKYrIVe0nM7YYzNYxpwRe0NXM0Y6TmMx1MRc2Xccu8UL/O 0HqUttXrFNJKBjAs1so+wpoaMwrF5BHs9ln4m/REKYVqhhbE3HcUquE7jLqPFFTmslC2 8kQfp/psRfNEzR+75BQqPcFOvr2AADwNM/kYrwYWK6D/drwexGZseLtke6os6CRTpEOq nKSBSC+sdLbk8sDFJ/tjCCAFNzTLB8dZoBu4fVA8S6mEAh1orJPQx/s/fBww9as1pbo8 8DXQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b=n5IXmIQP; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id y12-20020a056402270c00b0044e6ce6c84fsi2556006edd.548.2022.09.30.09.38.13; Fri, 30 Sep 2022 09:38:39 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b=n5IXmIQP; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S232051AbiI3QT2 (ORCPT + 99 others); Fri, 30 Sep 2022 12:19:28 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:57862 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S232020AbiI3QTZ (ORCPT ); Fri, 30 Sep 2022 12:19:25 -0400 Received: from mail-lj1-x234.google.com (mail-lj1-x234.google.com [IPv6:2a00:1450:4864:20::234]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id B97381C936A for ; Fri, 30 Sep 2022 09:19:23 -0700 (PDT) Received: by mail-lj1-x234.google.com with SMTP id l12so5310578ljg.9 for ; Fri, 30 Sep 2022 09:19:23 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:from:to:cc:subject:date; bh=l4aQ5hemtjsltDjjTFlU40Ar40/n0aYhBBQw2x9xe4M=; b=n5IXmIQPtFLCjFkpUE2OwV3H+1+xrjCHl08IJahd5SyBcKJ/UsdBT0ARbrXIomUclN fd8sqCRfVDzZ+eC1NLtqAp85IzNsTsvVkvysUm5VB52PQNwCA+yB4jBC1xuC3abZlkL3 co9ueIAqbCn/IDfi+RRnhmBQfNkNlKbH/RmP/Gy9yuqVbaafOyHxCJY2JFPjHrbWeyua MRGR4fju3q766ysi3eZBbex8yLYzg8fVBN0u24+SImCGIZCtg/U83iul66WqacM8s0qy dn97lGXvlczA2HSN41cwl43KlCmWxBUo/uGJrr47oJcMCi3tp0jWH6xze24KGsk1CzdL IBTA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:x-gm-message-state:from:to:cc:subject:date; bh=l4aQ5hemtjsltDjjTFlU40Ar40/n0aYhBBQw2x9xe4M=; b=UvuS8rZWw3OhjDffGbceBSaoHXquu5Mb6UpkVuIK4GLPBRempMvFRemSZJUapvYtND pla2b8jd4+QgL63j0zZM6ElWVt7FpyA0DzrJw2Woxs5GGIBI65GeHBcXGyMFRSSRDGf0 exVQ2+ROQvEXSqUOBXqgXNPEM0wCraiv9lVu+2IBxiEO6QAppHeUH7ZC396/s24XK1Rk yxHT9ikdBlbJftGdWOfv4mTktluDLYt4AqdW5tNEOSZ/nYfnYfD1vNm/8gnuTcj9MB7H 69Bj0jgB445YSU95zQxXOr6YtJ9b3qgW/eeGl85VoQ5V4wBHv3tfCHXChuaqUkec4nbH k1sA== X-Gm-Message-State: ACrzQf2ezmGqKSZvSKfpI2wLs7C4s1TfJf7njCd08m9vmzTfGQp8IsMa NBEwYQRNQnhTUnjmiiOJ9aMsI1BhLbxpqI1RX72D+A== X-Received: by 2002:a2e:9954:0:b0:26c:5555:b121 with SMTP id r20-20020a2e9954000000b0026c5555b121mr3154070ljj.280.1664554761893; Fri, 30 Sep 2022 09:19:21 -0700 (PDT) MIME-Version: 1.0 References: <20220915142913.2213336-1-chao.p.peng@linux.intel.com> <20220915142913.2213336-2-chao.p.peng@linux.intel.com> <20220926142330.GC2658254@chaop.bj.intel.com> In-Reply-To: From: Fuad Tabba Date: Fri, 30 Sep 2022 17:19:00 +0100 Message-ID: Subject: Re: [PATCH v8 1/8] mm/memfd: Introduce userspace inaccessible memfd To: Sean Christopherson Cc: Chao Peng , David Hildenbrand , kvm@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, linux-api@vger.kernel.org, linux-doc@vger.kernel.org, qemu-devel@nongnu.org, Paolo Bonzini , Jonathan Corbet , Vitaly Kuznetsov , Wanpeng Li , Jim Mattson , Joerg Roedel , Thomas Gleixner , Ingo Molnar , Borislav Petkov , x86@kernel.org, "H . Peter Anvin" , Hugh Dickins , Jeff Layton , "J . Bruce Fields" , Andrew Morton , Shuah Khan , Mike Rapoport , Steven Price , "Maciej S . Szmigiero" , Vlastimil Babka , Vishal Annapurve , Yu Zhang , "Kirill A . Shutemov" , luto@kernel.org, jun.nakajima@intel.com, dave.hansen@intel.com, ak@linux.intel.com, aarcange@redhat.com, ddutile@redhat.com, dhildenb@redhat.com, Quentin Perret , Michael Roth , mhocko@suse.com, Muchun Song , wei.w.wang@intel.com, Will Deacon , Marc Zyngier Content-Type: text/plain; charset="UTF-8" X-Spam-Status: No, score=-17.6 required=5.0 tests=BAYES_00,DKIMWL_WL_MED, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF, ENV_AND_HDR_SPF_MATCH,RCVD_IN_DNSWL_NONE,SPF_HELO_NONE,SPF_PASS, USER_IN_DEF_DKIM_WL,USER_IN_DEF_SPF_WL autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hi, On Tue, Sep 27, 2022 at 11:47 PM Sean Christopherson wrote: > > On Mon, Sep 26, 2022, Fuad Tabba wrote: > > Hi, > > > > On Mon, Sep 26, 2022 at 3:28 PM Chao Peng wrote: > > > > > > On Fri, Sep 23, 2022 at 04:19:46PM +0100, Fuad Tabba wrote: > > > > > Then on the KVM side, its mmap_start() + mmap_end() sequence would: > > > > > > > > > > 1. Not be supported for TDX or SEV-SNP because they don't allow adding non-zero > > > > > memory into the guest (after pre-boot phase). > > > > > > > > > > 2. Be mutually exclusive with shared<=>private conversions, and is allowed if > > > > > and only if the entire gfn range of the associated memslot is shared. > > > > > > > > In general I think that this would work with pKVM. However, limiting > > > > private<->shared conversions to the granularity of a whole memslot > > > > might be difficult to handle in pKVM, since the guest doesn't have the > > > > concept of memslots. For example, in pKVM right now, when a guest > > > > shares back its restricted DMA pool with the host it does so at the > > > > page-level. > > Y'all are killing me :-) :D > Isn't the guest enlightened? E.g. can't you tell the guest "thou shalt share at > granularity X"? With KVM's newfangled scalable memslots and per-vCPU MRU slot, > X doesn't even have to be that high to get reasonable performance, e.g. assuming > the DMA pool is at most 2GiB, that's "only" 1024 memslots, which is supposed to > work just fine in KVM. The guest is potentially enlightened, but the host doesn't necessarily know which memslot the guest might want to share back, since it doesn't know where the guest might want to place the DMA pool. If I understand this correctly, for this to work, all memslots would need to be the same size and sharing would always need to happen at that granularity. Moreover, for something like a small DMA pool this might scale, but I'm not sure about potential future workloads (e.g., multimedia in-place sharing). > > > > > pKVM would also need a way to make an fd accessible again > > > > when shared back, which I think isn't possible with this patch. > > > > > > But does pKVM really want to mmap/munmap a new region at the page-level, > > > that can cause VMA fragmentation if the conversion is frequent as I see. > > > Even with a KVM ioctl for mapping as mentioned below, I think there will > > > be the same issue. > > > > pKVM doesn't really need to unmap the memory. What is really important > > is that the memory is not GUP'able. > > Well, not entirely unguppable, just unguppable without a magic FOLL_* flag, > otherwise KVM wouldn't be able to get the PFN to map into guest memory. > > The problem is that gup() and "mapped" are tied together. So yes, pKVM doesn't > strictly need to unmap memory _in the untrusted host_, but since mapped==guppable, > the end result is the same. > > Emphasis above because pKVM still needs unmap the memory _somehwere_. IIUC, the > current approach is to do that only in the stage-2 page tables, i.e. only in the > context of the hypervisor. Which is also the source of the gup() problems; the > untrusted kernel is blissfully unaware that the memory is inaccessible. > > Any approach that moves some of that information into the untrusted kernel so that > the kernel can protect itself will incur fragmentation in the VMAs. Well, unless > all of guest memory becomes unguppable, but that's likely not a viable option. Actually, for pKVM, there is no need for the guest memory to be GUP'able at all if we use the new inaccessible_get_pfn(). This of course goes back to what I'd mentioned before in v7; it seems that representing the memslot memory as a file descriptor should be orthogonal to whether the memory is shared or private, rather than a private_fd for private memory and the userspace_addr for shared memory. The host can then map or unmap the shared/private memory using the fd, which allows it more freedom in even choosing to unmap shared memory when not needed, for example. Cheers, /fuad