Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20;
Date:   Fri, 23 Jun 2023 12:37:42 -0700
In-Reply-To: <20230620190443.GU2244082@ls.amr.corp.intel.com>
Mime-Version: 1.0
References: <cover.1686858861.git.isaku.yamahata@intel.com>
 <e8d3ab4a56d69a09ba74ff1c439f904075d38c16.1686858861.git.isaku.yamahata@intel.com>
 <20230620162835.xsmaao63brira7as@amd.com> <20230620190443.GU2244082@ls.amr.corp.intel.com>
Message-ID: <ZJX0hk+KpQP0KUyB@google.com>
Subject: Re: [RFC PATCH 5/6] KVM: Add flags to struct kvm_gfn_range
From:   Sean Christopherson <seanjc@google.com>
To:     Isaku Yamahata <isaku.yamahata@gmail.com>
Cc:     Michael Roth <michael.roth@amd.com>, isaku.yamahata@intel.com,
        kvm@vger.kernel.org, linux-kernel@vger.kernel.org,
        Paolo Bonzini <pbonzini@redhat.com>, erdemaktas@google.com,
        Sagi Shahar <sagis@google.com>,
        David Matlack <dmatlack@google.com>,
        Kai Huang <kai.huang@intel.com>,
        Zhi Wang <zhi.wang.linux@gmail.com>, chen.bo@intel.com,
        linux-coco@lists.linux.dev,
        Chao Peng <chao.p.peng@linux.intel.com>,
        Ackerley Tng <ackerleytng@google.com>,
        Vishal Annapurve <vannapurve@google.com>
Content-Type: text/plain; charset="us-ascii"
Precedence: bulk

On Tue, Jun 20, 2023, Isaku Yamahata wrote:
> On Tue, Jun 20, 2023 at 11:28:35AM -0500,
> Michael Roth <michael.roth@amd.com> wrote:
> 
> > On Thu, Jun 15, 2023 at 01:12:18PM -0700, isaku.yamahata@intel.com wrote:
> > > From: Isaku Yamahata <isaku.yamahata@intel.com>
> > > 
> > > TDX and SEV-SNP need to know the reason for a callback by
> > > kvm_unmap_gfn_range().  mmu notifier, set memory attributes ioctl or KVM
> > > gmem callback.  The callback handler changes the behavior or does the
> > > additional housekeeping operation.  For mmu notifier, it's zapping shared
> > > PTE.  For set memory attributes, it's the conversion of memory attributes
> > > (private <=> shared).  For KVM gmem, it's punching a hole in the range, and
> > > releasing the file.
> > 
> > I think it's still an open topic that we need to hear more from Sean about:
> > 
> >   https://lore.kernel.org/lkml/20230522235838.ov3722lcusotzlvo@amd.com/
> > 
> > but I *think* we were leaning toward decoupling the act of invalidating
> > GFNs, vs. the act of invalidating/free'ing gmem pages.

Yes.  Ignore any comments I made about not introducing new hooks, I messed up and
forgot the full context.

> > One concrete example of where this seperation makes sense if with
> > hole-punching. SNP has unique platform-specific stuff it has to do before
> > free'ing that gmem page back to the host. If we try to plumb this through
> > kvm_unmap_gfn_range() via a special flag then it's a little awkward
> > because:
> > 
> > a) Presumably that hole-punch would have occurred after a preceeding
> >    KVM_SET_MEMORY_ATTRIBUTES was issued to switch the page to shared
> >    state in the xarray. So all it should really need to do is handle
> >    that platform-specific behavior, like updating RMP table in case of
> >    SNP. But none of the other details like GFN ranges are relevant in
> >    that case, RMP updates here only need the PFN, so we end up walking
> >    memslots to do GFN->PFN translations, when it would actually be much
> >    more efficient to do these translations by translating the
> >    hole-punched FD offset range to the corresponding folio()'s backing
> >    those ranges
> > 
> > b) It places an unecessary dependency on having an active memslot to do
> >    those translations. This ends up not being an issue with current
> >    version of gmem patchset because the release() happens *before*
> >    gmem_unbind(), so there is a memslot associated with the ranges at
> >    gmem_release() time, but in the initial version of gmem it was the
> >    reverse, so if things ever changed again in this regard we'd once
> >    again have to completely rework how to issue these platform-specific
> >    invalidation callbacks.
> > 
> > I really *really* like having a separate, simple invalidation mechanism
> > in place that just converts FD offsets to PFNs and then passes those on
> > to platform-defined handlers to clean up pages before free'ing them back
> > to the system. It's versatile in that it can be called pretty much
> > anywhere regardless of where we are in KVM lifecycle, it's robust in
> > that it doesn't rely on unecessary outside dependencies, and it avoids
> > added uneeded complexity to paths like kvm_unmap_gfn_range().
> > 
> > That's the approach taken with SNP hypervisor v9 series, with the
> > gmem hook being introduced here:
> > 
> >   https://lore.kernel.org/kvm/20230612042559.375660-1-michael.roth@amd.com/T/#m3ad8245235a27ed0f41c359c191dcda6c77af043
> > 
> > and the SEV-SNP implementation of that hook being here:
> > 
> >   https://lore.kernel.org/kvm/20230612042559.375660-1-michael.roth@amd.com/T/#m6ac04b44722dbc07839011816e94fadf5ad6794e
> > 
> > Would a similar approach work for TDX? At least WRT cleaning up pages
> > before returning them back to the host? If we could isolate that
> > requirement/handling from all the other aspects of invalidations it
> > really seems like it would cause us less headaches down the road.
> 
> In short, TDX KVM doesn't need an extra callback for invalidating/freeing gmem
> pages. kvm_unmap_gfn_range() callback works.  Instead TDX needs attributes
> (private-or-shared) for it.

Just because TDX doesn't strictly need a separate callback doesn't mean it
wouldn't be useful and beneficial.  SNP doesn't "need" a separate callback either,
i.e. we could make kvm_unmap_gfn_range() work, but it would be ugly and inflexible.

E.g. as Mike pointed out in the other thread, reclaiming physical memory when
SPTEs are zapped is suboptimal if the memory is not actually discarded.

  : There's also cases like userspaces that opt to not discard memory after
  : conversions because they highly favor performance over memory usage. In
  : those cases it would make sense to defer marking the pages as shared in
  : the RMP until the FALLOC_FL_PUNCH_HOLE, rather than triggering it via
  : KVM MMU invalidation path after a conversion.

And to some extent, I would even argue that TDX does "need" the separate hook,
because doing PAGE.WBINVD while holding mmu_lock for write is going to be slooow.
Even if the PAGE.WBINVD isn't redundant, i.e. the memory is never converted back
to private, deferring the cache flush until the backing store is freed is a win
for guest performance during conversions.

> > > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> > > index 1a47cedae8a1..c049c0aa44d6 100644
> > > --- a/include/linux/kvm_host.h
> > > +++ b/include/linux/kvm_host.h
> > > @@ -256,12 +256,21 @@ int kvm_async_pf_wakeup_all(struct kvm_vcpu *vcpu);
> > >  #endif
> > >  
> > >  #ifdef CONFIG_KVM_GENERIC_MMU_NOTIFIER
> > > +
> > > +#define KVM_GFN_RANGE_FLAGS_SET_MEM_ATTR		BIT(0)
> > 
> > Can you go into more detail on why special handling is needed for
> > SET_MEM_ATTR?
> 
> When in TDX, the VMM zaps a private page from the encrypted page table and the
> VMM adds the page back to the same GPA, it results in zeroing page and guest
> needs to accept the page again.  When converting a page from shared to private,
> KVM needs to zap only shared pages.  So the callback needs to know this zap
> is for converting shared-to-private or private-to-shared.

That doesn't answer Mike's question.  You answered why KVM needs to know whether
to zap shared vs. private, but not why SET_MEM_ATTR is a special case.

> > > +#define KVM_GFN_RANGE_FLAGS_GMEM_PUNCH_HOLE		BIT(1)
> > > +#define KVM_GFN_RANGE_FLAGS_GMEM_RELEASE		BIT(2)
> > 
> > Would the need to distinguish between PUNCH_HOLE/RELEASE go away in the
> > TDX case if you take the above approach? For SNP, the answer is yes. If
> > that's also the case for TDX I think that would be another argument in
> > favor of decoupling these from existing KVM MMU invalidation path.
> 
> TDX doesn't need gmem_invalidate callback.  TDx doesn't need the difference
> betwee punch hole and release. So in short TDX needs
> KVM_GFN_RANGE_FLAGS_SET_MEM_ATTR and KVM_GFN_RANGE_FLAGS_GMEM.

TDX needs to now what flavor of mappings, i.e. EPT vs. S-EPT, are in scope, TDX
doesn't need to know who/what initiated a zap.  And for that, a simple private vs.
shared flag would suffice.

However, looking at kvm_mem_attrs_changed() again, I think invoking kvm_unmap_gfn_range()
from generic KVM code is a mistake and shortsighted.  Zapping in response to
*any* attribute change is very private/shared centric.  E.g. if/when we extend
attributes to provide per-page RWX protections, zapping existing SPTEs in response
to granting *more* permissions may not be necessary or even desirable.

And for SNP, isn't zapping unnecessary?  KVM needs to update the RMP, but there's
no need to zap NPT entries.  Or do RMP lookups need to be blocked while the RMP
is being updated?

Regardless of what SNP needs to do on attribute changes, rather than using
kvm_unmap_gfn_range(), I think kvm_mem_attrs_changed() should look something like:


	struct kvm_gfn_range gfn_range;
	struct kvm_memory_slot *slot;
	struct kvm_memslots *slots;
	struct kvm_memslot_iter iter;
	bool flush = false;
	int i;

	gfn_range.may_block = true;
	gfn_range.attrs = attrs;

	for (i = 0; i < kvm_arch_nr_memslot_as_ids(kvm); i++) {
		slots = __kvm_memslots(kvm, i);

		kvm_for_each_memslot_in_gfn_range(&iter, slots, start, end) {
			slot = iter.slot;
			gfn_range.start = max(start, slot->base_gfn);
			gfn_range.end = min(end, slot->base_gfn + slot->npages);
			if (gfn_range.start >= gfn_range.end)
				continue;
			gfn_range.slot = slot;

			flush |= kvm_arch_set_memory_attributes(kvm, &gfn_range);
		}
	}

	if (flush)
		kvm_flush_remote_tlbs(kvm);

At that point, a single "is_private" or "is_shared" flag is awkward and arguably
wrong, because the changing attributes affect both.  The fact that TDX only needs
to zap one or the other is an implementation detail, not a fundamental truth of
the update itself.

We could just have kvm_arch_set_memory_attributes() take individual params instead
of taking a kvm_gfn_range pointer, but that's a missed opportunitiy IMO as this
really is just another variation of gfn-based notification to arch MMU code.

Going with my union suggestion from the MGLRU thread[*], I think we should aim
for making kvm_gfn_range look like this:

struct kvm_gfn_range {
	struct kvm_memory_slot *slot;
	gfn_t start;
	gfn_t end;
	union {
		struct test_clear_young_metadata *metadata;
		unsigned long attributes;
		pte_t pte;
		unsigned long callback_arg; /* needs a better name */
	};
	bool only_private;
	bool only_shared;
	bool may_block;
};

That way kvm_arch_set_memory_attributes() can communicate that it affects both
shared and private mappings, i.e. can leave it to TDX to precisely zap only the
necessary tree.  It's a bit unfortunate that the existing mmu_notifier usage
would need to explicitly say "only_shared", but that's literally one line of code
in __kvm_handle_hva_range(), and on the plus side would clearly document that
hva-based notifications are shared-only.

[*] https://lore.kernel.org/all/ZItNoeWriZgLUaon@google.com