Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20;
Date:   Fri, 12 May 2023 11:01:10 -0700
In-Reply-To: <20230512002124.3sap3kzxpegwj3n2@amd.com>
Mime-Version: 1.0
References: <ZD1oevE8iHsi66T2@google.com> <658018f9-581c-7786-795a-85227c712be0@redhat.com>
 <ZD12htq6dWg0tg2e@google.com> <1ed06a62-05a1-ebe6-7ac4-5b35ba272d13@redhat.com>
 <ZD2bBB00eKP6F8kz@google.com> <9efef45f-e9f4-18d1-0120-f0fc0961761c@redhat.com>
 <ZD86E23gyzF6Q7AF@google.com> <5869f50f-0858-ab0c-9049-4345abcf5641@redhat.com>
 <ZEM5Zq8oo+xnApW9@google.com> <20230512002124.3sap3kzxpegwj3n2@amd.com>
Message-ID: <ZF5+5g5hI7xyyIAS@google.com>
Subject: Re: Rename restrictedmem => guardedmem? (was: Re: [PATCH v10 0/9]
 KVM: mm: fd-based approach for supporting KVM)
From:   Sean Christopherson <seanjc@google.com>
To:     Michael Roth <michael.roth@amd.com>
Cc:     David Hildenbrand <david@redhat.com>,
        Chao Peng <chao.p.peng@linux.intel.com>,
        Paolo Bonzini <pbonzini@redhat.com>,
        Vitaly Kuznetsov <vkuznets@redhat.com>,
        Jim Mattson <jmattson@google.com>,
        Joerg Roedel <joro@8bytes.org>,
        "Maciej S . Szmigiero" <mail@maciej.szmigiero.name>,
        Vlastimil Babka <vbabka@suse.cz>,
        Vishal Annapurve <vannapurve@google.com>,
        Yu Zhang <yu.c.zhang@linux.intel.com>,
        "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>,
        dhildenb@redhat.com, Quentin Perret <qperret@google.com>,
        tabba@google.com, wei.w.wang@intel.com,
        Mike Rapoport <rppt@kernel.org>,
        Liam Merwick <liam.merwick@oracle.com>,
        Isaku Yamahata <isaku.yamahata@gmail.com>,
        Jarkko Sakkinen <jarkko@kernel.org>,
        Ackerley Tng <ackerleytng@google.com>, kvm@vger.kernel.org,
        linux-kernel@vger.kernel.org, Hugh Dickins <hughd@google.com>,
        Christian Brauner <brauner@kernel.org>
Content-Type: text/plain; charset="us-ascii"
Precedence: bulk

On Thu, May 11, 2023, Michael Roth wrote:
> On Fri, Apr 21, 2023 at 06:33:26PM -0700, Sean Christopherson wrote:
> > 
> > Code is available here if folks want to take a look before any kind of formal
> > posting:
> > 
> > 	https://github.com/sean-jc/linux.git x86/kvm_gmem_solo
> 
> Hi Sean,
> 
> I've been working on getting the SNP patches ported to this but I'm having
> some trouble working out a reasonable scheme for how to work the
> RMPUPDATE hooks into the proposed design.
> 
> One of the main things is kvm_gmem_punch_hole(): this is can free pages
> back to the host whenever userspace feels like it. Pages that are still
> marked private in the RMP table will blow up the host if they aren't returned
> to the normal state before handing them back to the kernel. So I'm trying to
> add a hook, orchestrated by kvm_arch_gmem_invalidate(), to handle that,
> e.g.:
> 
>   static long kvm_gmem_punch_hole(struct file *file, int mode, loff_t offset,
>                                   loff_t len)
>   {
>           struct kvm_gmem *gmem = file->private_data;
>           pgoff_t start = offset >> PAGE_SHIFT;
>           pgoff_t end = (offset + len) >> PAGE_SHIFT;
>           struct kvm *kvm = gmem->kvm;
>   
>           /*
>            * Bindings must stable across invalidation to ensure the start+end
>            * are balanced.
>            */
>           filemap_invalidate_lock(file->f_mapping);
>           kvm_gmem_invalidate_begin(kvm, gmem, start, end);
>   
>           /* Handle arch-specific cleanups before releasing pages */
>           kvm_arch_gmem_invalidate(kvm, gmem, start, end);
>           truncate_inode_pages_range(file->f_mapping, offset, offset + len);
>   
>           kvm_gmem_invalidate_end(kvm, gmem, start, end);
>           filemap_invalidate_unlock(file->f_mapping);
>   
>           return 0;
>   }
> 
> But there's another hook, kvm_arch_gmem_set_mem_attributes(), needed to put
> the page in its intended state in the RMP table prior to mapping it into the
> guest's NPT.

IMO, this approach is wrong.  kvm->mem_attr_array is the source of truth for whether
userspace wants _guest_ physical pages mapped private vs. shared, but the attributes
array has zero insight into the _host_ physical pages.  I.e. SNP shouldn't hook
kvm_mem_attrs_changed(), because operating on the RMP from that code is fundamentally
wrong.

A good analogy is moving a memslot (ignoring that AFAIK no VMM actually moves
memslots, but it's a good analogy for KVM internals).  KVM needs to zap all mappings
for the old memslot gfn, but KVM does not create mappings for the new memslot gfn.
Same for changing attributes; unmap, but never map.

As for the unmapping side of things, kvm_unmap_gfn_range() will unmap all relevant
NPT entries, and the elevated mmu_invalidate_in_progress will prevent KVM from
establishing a new NPT mapping.  And mmu_invalidate_in_progress will reach '0' only
after both truncation _and_ kvm_vm_ioctl_set_mem_attributes() complete, i.e. KVM
can create new mappings only when both kvm->mem_attr_array and any relevant
guest_mem bindings have reached steady state.

That leaves the question of when/where to do RMP updates.  Off the cuff, I think
RMP updates (and I _think_ also TDX page conversions) should _always_ be done in
the context of either (a) file truncation (make host owned due, a.k.a. TDX reclaim)
or (b) allocating a new page/folio in guest_mem, a.k.a. kvm_gmem_get_folio().
Under the hood, even though the gfn is the same, the backing pfn is different, i.e.
installing a shared mapping should _never_ need to touch the RMP because pages
common from the normal (non-guest_mem) pool must already be host owned.

> Currently I'm calling that hook via kvm_vm_ioctl_set_mem_attributes(), just
> after kvm->mem_attr_array is updated based on the ioctl. The reasoning there
> is that KVM MMU can then rely on the existing mmu_invalidate_seq logic to
> ensure both the state in the mem_attr_array and the RMP table are in sync and
> up-to-date once MMU lock is acquired and MMU is ready to map it, or retry
> #NPF otherwise.
> 
> But for kvm_gmem_punch_hole(), kvm_vm_ioctl_set_mem_attributes() can potentially
> result in something like the following sequence if I implement things as above:
> 
>   CPU0: kvm_gmem_punch_hole():
>           kvm_gmem_invalidate_begin()
>           kvm_arch_gmem_invalidate()         // set pages to default/shared state in RMP table before free'ing
>   CPU1: kvm_vm_ioctl_set_mem_attributes():
>           kvm_arch_gmem_set_mem_attributes() // maliciously set pages to private in RMP table
>   CPU0:   truncate_inode_pages_range()       // HOST BLOWS UP TOUCHING PRIVATE PAGES
>           kvm_arch_gmem_invalidate_end()
> 
> One quick and lazy solution is to rely on the fact that
> kvm_vm_ioctl_set_mem_attributes() holds the kvm->slots_lock throughout the
> entire begin()/end() portion of the invalidation sequence, and to similarly
> hold the kvm->slots_lock throughout the begin()/end() sequence in
> kvm_gmem_punch_hole() to prevent any interleaving.
> 
> But I'd imagine overloading kvm->slots_lock is not the proper approach. But
> would introducing a similar mutex to keep these operations grouped/atomic be
> a reasonable approach to you, or should we be doing something else entirely
> here?
> 
> Keep in mind that RMP updates can't be done while holding KVM->mmu_lock
> spinlock, because we also need to unmap pages from the directmap, which can
> lead to scheduling-while-atomic BUG()s[1], so that's another constraint we
> need to work around.
> 
> Thanks!
> 
> -Mike
> 
> [1] https://lore.kernel.org/linux-coco/20221214194056.161492-7-michael.roth@amd.com/T/#m45a1af063aa5ac0b9314d6a7d46eecb1253bba7a
> 
> > 
> > [1] https://lore.kernel.org/all/ff5c5b97-acdf-9745-ebe5-c6609dd6322e@google.com
> > [2] https://lore.kernel.org/all/20230418-anfallen-irdisch-6993a61be10b@brauner
> > [3] https://lore.kernel.org/linux-mm/20200522125214.31348-1-kirill.shutemov@linux.intel.com