Received: by 2002:a05:6a11:4021:0:0:0:0 with SMTP id ky33csp1156780pxb; Thu, 16 Sep 2021 00:37:31 -0700 (PDT) X-Google-Smtp-Source: ABdhPJyzP8yrAOSf5RjtxiQIJMKtLhy9Sfgbc4frR0pTxXaug6himtWTmyvhzz6Q1t5ayYW2/ZiM X-Received: by 2002:a05:6e02:1bad:: with SMTP id n13mr914071ili.142.1631777851022; Thu, 16 Sep 2021 00:37:31 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1631777851; cv=none; d=google.com; s=arc-20160816; b=HHg6SLfDr/VgIopzk4VsxjmxCX4U2mfqLNDG/dBs9X5a23C+n3ipWeVL4yt9J3EfMc gDwZK/mIIhUwuLM2MZS62XDGm2BXP3c69Q7vQ7fl7W5CV7jwp78C17SiZ7SpwXCJsZZn N6Ip2ZdcXgjFIs5LYl7UUpntxSd6AIlEUPzY7wp+kbSsi0AssKMZjf6ELw2cDTVUAKsN C4i91I4nD2fVFQRfAPjSgwx2cY5cgWVXg/ywOrJcHH5hhU+J1i/AvUtIqrmloY2n2G+f v1MKjznOa6Z3+dszqBrFMo9SkN1mzyqEQFdK2JruRhfcLOIixcF8ZE7cvCkQ16I4AiRx 7SOA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:user-agent:in-reply-to:content-transfer-encoding :content-disposition:mime-version:references:reply-to:message-id :subject:cc:to:from:date; bh=9AsWwZO7DurLoDjM9R5WzaKT46Z6D1rJ9/5TuDuwrwU=; b=dYEKiTqvlY/rXzPSb33i9Z6s+W3/EQs3n6JxeSmQf8MiIXBmKgFYqx83zHWnnpptTz +o1aDsN7G4dMbOqfOei/AC7L8Y0NPNZmlZ/4NjunIg4cGbpbrID+B2ji1GB3SRa2J8Ii FUyRLqg5cpcde4DNsvdxcbhAkxeot8Xh/dhBeC7jNK2nvqIr+rttGlzR2ZaKS95Z6hWT znlpG2m+HnDoQ9m4wlypfh2MX/7aeK6JMiNX/PXXDGcWV+nkEGGiGozyaR10gA3AojYl nkHzLfvsi4Ok2YDcxoy4hG2Z4p0g42CPaUhYIFXKJhUMSJbYCT/JlkpRP5L8tya/79zd sqGQ== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=intel.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id z22si2638641jan.81.2021.09.16.00.37.19; Thu, 16 Sep 2021 00:37:31 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=intel.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S234767AbhIPHh5 (ORCPT + 99 others); Thu, 16 Sep 2021 03:37:57 -0400 Received: from mga09.intel.com ([134.134.136.24]:55143 "EHLO mga09.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S234296AbhIPHh4 (ORCPT ); Thu, 16 Sep 2021 03:37:56 -0400 X-IronPort-AV: E=McAfee;i="6200,9189,10108"; a="222546991" X-IronPort-AV: E=Sophos;i="5.85,297,1624345200"; d="scan'208";a="222546991" Received: from orsmga007.jf.intel.com ([10.7.209.58]) by orsmga102.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 16 Sep 2021 00:36:35 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.85,297,1624345200"; d="scan'208";a="472680027" Received: from chaop.bj.intel.com (HELO localhost) ([10.240.192.135]) by orsmga007.jf.intel.com with ESMTP; 16 Sep 2021 00:36:28 -0700 Date: Thu, 16 Sep 2021 15:36:27 +0800 From: Chao Peng To: "Kirill A. Shutemov" Cc: "Kirill A. Shutemov" , Andy Lutomirski , Sean Christopherson , Paolo Bonzini , Vitaly Kuznetsov , Wanpeng Li , Jim Mattson , Joerg Roedel , kvm@vger.kernel.org, linux-kernel@vger.kernel.org, Borislav Petkov , Andrew Morton , Joerg Roedel , Andi Kleen , David Rientjes , Vlastimil Babka , Tom Lendacky , Thomas Gleixner , Peter Zijlstra , Ingo Molnar , Varad Gautam , Dario Faggioli , x86@kernel.org, linux-mm@kvack.org, linux-coco@lists.linux.dev, Kuppuswamy Sathyanarayanan , David Hildenbrand , Dave Hansen , Yu Zhang Subject: Re: [RFC] KVM: mm: fd-based approach for supporting KVM guest private memory Message-ID: <20210916073627.GA18399@chaop.bj.intel.com> Reply-To: Chao Peng References: <20210824005248.200037-1-seanjc@google.com> <20210902184711.7v65p5lwhpr2pvk7@box.shutemov.name> <20210903191414.g7tfzsbzc7tpkx37@box.shutemov.name> <02806f62-8820-d5f9-779c-15c0e9cd0e85@kernel.org> <20210910171811.xl3lms6xoj3kx223@box.shutemov.name> <20210915195857.GA52522@chaop.bj.intel.com> <20210915141147.s4mgtcfv3ber5fnt@black.fi.intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=gb2312 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <20210915141147.s4mgtcfv3ber5fnt@black.fi.intel.com> User-Agent: Mutt/1.9.4 (2018-02-28) Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, Sep 15, 2021 at 05:11:47PM +0300, Kirill A. Shutemov wrote: > On Wed, Sep 15, 2021 at 07:58:57PM +0000, Chao Peng wrote: > > On Fri, Sep 10, 2021 at 08:18:11PM +0300, Kirill A. Shutemov wrote: > > > On Fri, Sep 03, 2021 at 12:15:51PM -0700, Andy Lutomirski wrote: > > > > On 9/3/21 12:14 PM, Kirill A. Shutemov wrote: > > > > > On Thu, Sep 02, 2021 at 08:33:31PM +0000, Sean Christopherson wrote: > > > > >> Would requiring the size to be '0' at F_SEAL_GUEST time solve that problem? > > > > > > > > > > I guess. Maybe we would need a WRITE_ONCE() on set. I donno. I will look > > > > > closer into locking next. > > > > > > > > We can decisively eliminate this sort of failure by making the switch > > > > happen at open time instead of after. For a memfd-like API, this would > > > > be straightforward. For a filesystem, it would take a bit more thought. > > > > > > I think it should work fine as long as we check seals after i_size in the > > > read path. See the comment in shmem_file_read_iter(). > > > > > > Below is updated version. I think it should be good enough to start > > > integrate with KVM. > > > > > > I also attach a test-case that consists of kernel patch and userspace > > > program. It demonstrates how it can be integrated into KVM code. > > > > > > One caveat I noticed is that guest_ops::invalidate_page_range() can be > > > called after the owner (struct kvm) has being freed. It happens because > > > memfd can outlive KVM. So the callback has to check if such owner exists, > > > than check that there's a memslot with such inode. > > > > Would introducing memfd_unregister_guest() fix this? > > I considered this, but it get complex quickly. > > At what point it gets called? On KVM memslot destroy? I meant when the VM gets destroyed. > > What if multiple KVM slot share the same memfd? Add refcount into memfd on > how many times the owner registered the memfd? > > It would leave us in strange state: memfd refcount owners (struct KVM) and > KVM memslot pins the struct file. Weird refcount exchnage program. > > I hate it. But yes agree things will get much complex in practice. > > > > I guess it should be okay: we have vm_list we can check owner against. > > > We may consider replace vm_list with something more scalable if number of > > > VMs will get too high. > > > > > > Any comments? > > > > > > diff --git a/include/linux/memfd.h b/include/linux/memfd.h > > > index 4f1600413f91..3005e233140a 100644 > > > --- a/include/linux/memfd.h > > > +++ b/include/linux/memfd.h > > > @@ -4,13 +4,34 @@ > > > > > > #include > > > > > > +struct guest_ops { > > > + void (*invalidate_page_range)(struct inode *inode, void *owner, > > > + pgoff_t start, pgoff_t end); > > > +}; > > > > I can see there are two scenarios to invalidate page(s), when punching a > > hole or ftruncating to 0, in either cases KVM should already been called > > with necessary information from usersapce with memory slot punch hole > > syscall or memory slot delete syscall, so wondering this callback is > > really needed. > > So what you propose? Forbid truncate/punch from userspace and make KVM > handle punch hole/truncate from within kernel? I think it's layering > violation. As far as I understand the flow for punch hole/truncate in this design, there will be two steps for userspace: 1. punch hole/delete kvm memory slot, and then 2. puncn hole/truncate on the memory backing store fd. In concept we can do whatever needed for invalidation in either steps. If we can do the invalidation in step 1 then we don??t need bothering this callback. This is what I mean but agree the current callback can also work. > > > > + > > > +struct guest_mem_ops { > > > + unsigned long (*get_lock_pfn)(struct inode *inode, pgoff_t offset); > > > + void (*put_unlock_pfn)(unsigned long pfn); > > > > Same as above, I??m not clear on which time put_unlock_pfn() would be > > called, I??m thinking the page can be put_and_unlock when userspace > > punching a hole or ftruncating to 0 on the fd. > > No. put_unlock_pfn() has to be called after the pfn is in SEPT. This way > we close race between SEPT population and truncate/punch. get_lock_pfn() > would stop truncate untile put_unlock_pfn() called. Okay, makes sense. Thanks, Chao