Received: by 2002:a05:6a10:2726:0:0:0:0 with SMTP id ib38csp2126027pxb; Fri, 25 Mar 2022 11:32:30 -0700 (PDT) X-Google-Smtp-Source: ABdhPJw8GfIMd82aZmGvAGL0A/Ox7c1crZIY7ye3Sp2XvR3RTL9GlaYmBlruVjpY+IHNDPpHLkNN X-Received: by 2002:a17:90b:3c0d:b0:1c7:ecae:e609 with SMTP id pb13-20020a17090b3c0d00b001c7ecaee609mr4292743pjb.61.1648233150627; Fri, 25 Mar 2022 11:32:30 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1648233150; cv=none; d=google.com; s=arc-20160816; b=S6NmqF3PwYzWO2LO04xvMmb0bqpJfxdXS7yPCQj1YaY93Si1ZSOpqnuFYdxoSHXM1S Q2lPit7oqK3JLyh9U0HkB4LcuvLzc6PDBr7qP8Kg8RXlSaQwQ+A+Fe9/Cj0duXCnHAvE pkjfCiIdCFIC+/sXJNmcd+skQtGmeEyMlmnDAQaXPOi4Vl+cJKOY1yYhGT1PUhVmmFST JxTtsAQUDzX1cL1IuwgaYkW1FAroxVrKI15ZRSi2ukEj16FTG1BzG4Kcx7hQKqEhE4di t1kCWR4ZttejE4BLLr7yZJBe7ILGMtsYlNiAbZ8WhLxDOp/D1wFdSHlAcyURhPFRz7/J z4PA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:in-reply-to:content-disposition:mime-version :references:message-id:subject:cc:to:from:date:dkim-signature; bh=NEtaF19RLk3IuGGWvhhX/z9a33LLQAJ182dON6UK5Yo=; b=vkkpzZ4bah+Ov8lNBYXuzGzofamEvOFEMWUkqaPuuBBpuEJyqm46WuTE2WG0cn1uAL h/LRT4GzQSTB/AHUxSFPSP1LjbyhQR6CQPKpTy3/KO5Tuyols6p9BswtPl1K1FrE6vcI eSVsfTJQHCAogeP6L3cnCgNot2a3/mSb+GSzffM4Rm15CmKJIEx2j06olCnwXOHihCjv zpCblnanUQMuqMpJ15qMGXkkyAVU5jKZWQWikIamjZaAtlbqTM3HFlMu8ExkadMAbcnm 4wXdqwUqUWvtqcWNIlnnCYKzAgkAN/t11WuJi5a9dKUAz7Mi7J0yr9AeIH2ZYrmr+DjY c+Dw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b=fk9RkDsY; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Return-Path: Received: from lindbergh.monkeyblade.net (lindbergh.monkeyblade.net. [2620:137:e000::1:18]) by mx.google.com with ESMTPS id w32-20020a634920000000b0038633e6a886si2860812pga.513.2022.03.25.11.32.30 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 25 Mar 2022 11:32:30 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:18 as permitted sender) client-ip=2620:137:e000::1:18; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b=fk9RkDsY; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: from out1.vger.email (out1.vger.email [IPv6:2620:137:e000::1:20]) by lindbergh.monkeyblade.net (Postfix) with ESMTP id 8E1D93DDE2; Fri, 25 Mar 2022 10:53:17 -0700 (PDT) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1351409AbiCXPxa (ORCPT + 99 others); Thu, 24 Mar 2022 11:53:30 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:43768 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S243459AbiCXPx2 (ORCPT ); Thu, 24 Mar 2022 11:53:28 -0400 Received: from mail-wm1-x32e.google.com (mail-wm1-x32e.google.com [IPv6:2a00:1450:4864:20::32e]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 41763AC92A for ; Thu, 24 Mar 2022 08:51:54 -0700 (PDT) Received: by mail-wm1-x32e.google.com with SMTP id r7so2996671wmq.2 for ; Thu, 24 Mar 2022 08:51:54 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=date:from:to:cc:subject:message-id:references:mime-version :content-disposition:in-reply-to; bh=NEtaF19RLk3IuGGWvhhX/z9a33LLQAJ182dON6UK5Yo=; b=fk9RkDsYVyolj4TSgLqmrYuEXdokpvUU9jaV0JAO9n+wmTm4I2wgbn37yCP2KfXBWI /2WqZepo3GcWY397TcvebHgjb3u1LCH+XB9TxdoIB3HsM02i0vXyZ51l25KENYsebBZB QbTnr2t8ehr+TLPj7+dmYDh6R8fCKi7yxBgoArhQu4dZbntaUZ1LBgcwvG5trGBp1A56 W+b4dCBOtKMZ3YZueQKIC57OQp6xiYvcBVHV3A/C4PLGpqp2vi4+Ry8IOI2z1wZxIy3O C7pTxWtH0Fh0R0Nw9NrQNwKE2s6PQE6Nl1IN7bH79+T+kGNjIm0Ol7dDbMJr62zC41Tt NK4g== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:date:from:to:cc:subject:message-id:references :mime-version:content-disposition:in-reply-to; bh=NEtaF19RLk3IuGGWvhhX/z9a33LLQAJ182dON6UK5Yo=; b=y579nQLoGx1uWINJySrlAswAHeGa4ZLe6C3D4R+S0jMbQcmiZ+qD4gQ3YiyDv+s/MU z5calK4KCRHDYlvzN66pjJ7Yo63/htvGjMmPIG73WZZLO2v8hRYCXOYexnOz+egU4D9t gnxvEQYmTGQFlCfoYiN4OE5PZCY5bTzFZDK6EWQ3+pjpOKjoRf5gJyHkvisrM2wtaL09 oa1NRDVRIAsa/QZg03TCI+W7HZR6aBcbYeGmPL7n4I/eR3qAdrPZSzMOzD0B+ydCVgaW eIWTT3FGtZTMV3XgJfTBTGJqjuGzu0nUvs7OsQT0pWuZwA7AWT3WTzjm5v6uxR3Z7aKe JVZg== X-Gm-Message-State: AOAM5312eXOTZA05H0CGYn5qP8EHBnQuw+dkzavAgNX/9yamIcSY6uLk cL2vIkisG1FVl5sqCDvahEIANA== X-Received: by 2002:a05:600c:4e8b:b0:38c:90cf:1158 with SMTP id f11-20020a05600c4e8b00b0038c90cf1158mr15120363wmq.107.1648137112585; Thu, 24 Mar 2022 08:51:52 -0700 (PDT) Received: from google.com ([2a00:79e0:d:210:6aea:58cf:f2e0:7796]) by smtp.gmail.com with ESMTPSA id y13-20020adffa4d000000b00203e3ca2701sm4051307wrr.45.2022.03.24.08.51.51 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 24 Mar 2022 08:51:52 -0700 (PDT) Date: Thu, 24 Mar 2022 15:51:48 +0000 From: Quentin Perret To: Chao Peng Cc: kvm@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, linux-api@vger.kernel.org, qemu-devel@nongnu.org, Paolo Bonzini , Jonathan Corbet , Sean Christopherson , Vitaly Kuznetsov , Wanpeng Li , Jim Mattson , Joerg Roedel , Thomas Gleixner , Ingo Molnar , Borislav Petkov , x86@kernel.org, "H . Peter Anvin" , Hugh Dickins , Jeff Layton , "J . Bruce Fields" , Andrew Morton , Mike Rapoport , Steven Price , "Maciej S . Szmigiero" , Vlastimil Babka , Vishal Annapurve , Yu Zhang , "Kirill A . Shutemov" , luto@kernel.org, jun.nakajima@intel.com, dave.hansen@intel.com, ak@linux.intel.com, david@redhat.com, maz@kernel.org, will@kernel.org Subject: Re: [PATCH v5 00/13] KVM: mm: fd-based approach for supporting KVM guest private memory Message-ID: References: <20220310140911.50924-1-chao.p.peng@linux.intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20220310140911.50924-1-chao.p.peng@linux.intel.com> X-Spam-Status: No, score=-9.5 required=5.0 tests=BAYES_00,DKIMWL_WL_MED, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,RDNS_NONE,SPF_HELO_NONE,T_SCC_BODY_TEXT_LINE, USER_IN_DEF_DKIM_WL autolearn=no autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hi Chao, +CC Will and Marc for visibility. On Thursday 10 Mar 2022 at 22:08:58 (+0800), Chao Peng wrote: > This is the v5 of this series which tries to implement the fd-based KVM > guest private memory. The patches are based on latest kvm/queue branch > commit: > > d5089416b7fb KVM: x86: Introduce KVM_CAP_DISABLE_QUIRKS2 > > Introduction > ------------ > In general this patch series introduce fd-based memslot which provides > guest memory through memory file descriptor fd[offset,size] instead of > hva/size. The fd can be created from a supported memory filesystem > like tmpfs/hugetlbfs etc. which we refer as memory backing store. KVM > and the the memory backing store exchange callbacks when such memslot > gets created. At runtime KVM will call into callbacks provided by the > backing store to get the pfn with the fd+offset. Memory backing store > will also call into KVM callbacks when userspace fallocate/punch hole > on the fd to notify KVM to map/unmap secondary MMU page tables. > > Comparing to existing hva-based memslot, this new type of memslot allows > guest memory unmapped from host userspace like QEMU and even the kernel > itself, therefore reduce attack surface and prevent bugs. > > Based on this fd-based memslot, we can build guest private memory that > is going to be used in confidential computing environments such as Intel > TDX and AMD SEV. When supported, the memory backing store can provide > more enforcement on the fd and KVM can use a single memslot to hold both > the private and shared part of the guest memory. > > mm extension > --------------------- > Introduces new MFD_INACCESSIBLE flag for memfd_create(), the file created > with these flags cannot read(), write() or mmap() etc via normal > MMU operations. The file content can only be used with the newly > introduced memfile_notifier extension. > > The memfile_notifier extension provides two sets of callbacks for KVM to > interact with the memory backing store: > - memfile_notifier_ops: callbacks for memory backing store to notify > KVM when memory gets allocated/invalidated. > - memfile_pfn_ops: callbacks for KVM to call into memory backing store > to request memory pages for guest private memory. > > The memfile_notifier extension also provides APIs for memory backing > store to register/unregister itself and to trigger the notifier when the > bookmarked memory gets fallocated/invalidated. > > memslot extension > ----------------- > Add the private fd and the fd offset to existing 'shared' memslot so that > both private/shared guest memory can live in one single memslot. A page in > the memslot is either private or shared. A page is private only when it's > already allocated in the backing store fd, all the other cases it's treated > as shared, this includes those already mapped as shared as well as those > having not been mapped. This means the memory backing store is the place > which tells the truth of which page is private. > > Private memory map/unmap and conversion > --------------------------------------- > Userspace's map/unmap operations are done by fallocate() ioctl on the > backing store fd. > - map: default fallocate() with mode=0. > - unmap: fallocate() with FALLOC_FL_PUNCH_HOLE. > The map/unmap will trigger above memfile_notifier_ops to let KVM map/unmap > secondary MMU page tables. I recently came across this series which is interesting for the Protected KVM work that's currently ongoing in the Android world (see [1], [2] or [3] for more details). The idea is similar in a number of ways to the Intel TDX stuff (from what I understand, but I'm clearly not understanding it all so, ...) or the Arm CCA solution, but using stage-2 MMUs instead of encryption; and leverages the caveat of the nVHE KVM/arm64 implementation to isolate the control of stage-2 MMUs from the host. For Protected KVM (and I suspect most other confidential computing solutions), guests have the ability to share some of their pages back with the host kernel using a dedicated hypercall. This is necessary for e.g. virtio communications, so these shared pages need to be mapped back into the VMM's address space. I'm a bit confused about how that would work with the approach proposed here. What is going to be the approach for TDX? It feels like the most 'natural' thing would be to have a KVM exit reason describing which pages have been shared back by the guest, and to then allow the VMM to mmap those specific pages in response in the memfd. Is this something that has been discussed or considered? Thanks, Quentin [1] https://lwn.net/Articles/836693/ [2] https://www.youtube.com/watch?v=wY-u6n75iXc [3] https://www.youtube.com/watch?v=54q6RzS9BpQ&t=10862s