by Sean Christopherson

[permalink] [raw]

Subject: Re: [RFC PATCH 0/8] KVM: Prepopulate guest memory API

On Wed, Apr 03, 2024, Isaku Yamahata wrote:
> On Wed, Apr 03, 2024 at 11:30:21AM -0700,
> Sean Christopherson <[email protected]> wrote:
>
> > On Tue, Mar 19, 2024, Isaku Yamahata wrote:
> > > On Wed, Mar 06, 2024 at 06:09:54PM -0800,
> > > Isaku Yamahata <[email protected]> wrote:
> > >
> > > > On Wed, Mar 06, 2024 at 04:53:41PM -0800,
> > > > David Matlack <[email protected]> wrote:
> > > >
> > > > > On 2024-03-01 09:28 AM, [email protected] wrote:
> > > > > > From: Isaku Yamahata <[email protected]>
> > > > > >
> > > > > > Implementation:
> > > > > > - x86 KVM MMU
> > > > > > In x86 KVM MMU, I chose to use kvm_mmu_do_page_fault(). It's not confined to
> > > > > > KVM TDP MMU. We can restrict it to KVM TDP MMU and introduce an optimized
> > > > > > version.
> > > > >
> > > > > Restricting to TDP MMU seems like a good idea. But I'm not quite sure
> > > > > how to reliably do that from a vCPU context. Checking for TDP being
> > > > > enabled is easy, but what if the vCPU is in guest-mode?
> > > >
> > > > As you pointed out in other mail, legacy KVM MMU support or guest-mode will be
> > > > troublesome.
> >
> > Why is shadow paging troublesome? I don't see any obvious issues with effectively
> > prefetching into a shadow MMU with read fault semantics. It might be pointless
> > and wasteful, as the guest PTEs need to be in place, but that's userspace's problem.
>
> The populating address for shadow paging is GVA, not GPA. I'm not sure if
> that's what the user space wants. If it's user-space problem, I'm fine.

/facepalm

> > Pre-populating is the primary use case, but that could happen if L2 is active,
> > e.g. after live migration.
> >
> > I'm not necessarily opposed to initially adding support only for the TDP MMU, but
> > if the delta to also support the shadow MMU is relatively small, my preference
> > would be to add the support right away. E.g. to give us confidence that the uAPI
> > can work for multiple MMUs, and so that we don't have to write documentation for
> > x86 to explain exactly when it's legal to use the ioctl().
>
> If we call kvm_mmu.page_fault() without caring of what address will be
> populated, I don't see the big difference.

Ignore me, I completely spaced that shadow MMUs don't operate on an L1 GPA. I
100% agree that restricting this to TDP, at least for the initial merge, is the
way to go. A uAPI where the type of address varies based on the vCPU mode and
MMU type would be super ugly, and probably hard to use.

At that point, I don't have a strong preference as to whether or not direct
legacy/shadow MMUs are supported. That said, I think it can (probably should?)
be done in a way where it more or less Just Works, e.g. by having a function hook
in "struct kvm_mmu".

2024-04-03 23:15:46

by Sean Christopherson

[permalink] [raw]

Subject: Re: [RFC PATCH 6/8] KVM: x86: Implement kvm_arch_{, pre_}vcpu_map_memory()

On Tue, Mar 19, 2024, Isaku Yamahata wrote:
> On Wed, Mar 06, 2024 at 05:51:51PM -0800,
> > Yes. We'd like to map exact gpa range for SNP or TDX case. We don't want to map
> > zero at around range. For SNP or TDX, we map page to GPA, it's one time
> > operation. It updates measurement.
> >
> > Say, we'd like to populate GPA1 and GPA2 with initial guest memory image. And
> > they are within same 2M range. Map GPA1 first. If GPA2 is also mapped with zero
> > with 2M page, the following mapping of GPA2 fails. Even if mapping of GPA2
> > succeeds, measurement may be updated when mapping GPA1.
> >
> > It's user space VMM responsibility to map GPA range only once at most for SNP or
> > TDX. Is this too strict requirement for default VM use case to mitigate KVM
> > page fault at guest boot up? If so, what about a flag like EXACT_MAPPING or
> > something?
>
> I'm thinking as follows. What do you think?
>
> - Allow mapping larger than requested with gmem_max_level hook:

I don't see any reason to allow userspace to request a mapping level. If the
prefetch is defined to have read fault semantics, KVM has all the wiggle room it
needs to do the optimal/sane thing, without having to worry reconcile userspace's
desired mapping level.

> Depend on the following patch. [1]
> The gmem_max_level hook allows vendor-backend to determine max level.
> By default (for default VM or sw-protected), it allows KVM_MAX_HUGEPAGE_LEVEL
> mapping. TDX allows only 4KB mapping.
>
> [1] https://lore.kernel.org/kvm/[email protected]/
> [PATCH v11 30/35] KVM: x86: Add gmem hook for determining max NPT mapping level
>
> - Pure mapping without coco operation:
> As Sean suggested at [2], make KVM_MAP_MEMORY pure mapping without coco
> operation. In the case of TDX, the API doesn't issue TDX specific operation
> like TDH.PAGE.ADD() and TDH.EXTEND.MR(). We need TDX specific API.
>
> [2] https://lore.kernel.org/kvm/[email protected]/
>
> - KVM_MAP_MEMORY on already mapped area potentially with large page:
> It succeeds. Not error. It doesn't care whether the GPA is backed by large
> page or not. Because the use case is pre-population before guest running, it
> doesn't matter if the given GPA was mapped or not, and what large page level
> it backs.
>
> Do you want error like -EEXIST?

No error. As above, I think the ioctl() should behave like a read fault, i.e.
be an expensive nop if there's nothing to be done.

For VMA-based memory, userspace can operate on the userspace address. E.g. if
userspace wants to break CoW, it can do that by writing from userspace. And if
userspace wants to "request" a certain mapping level, it can do that by MADV_*.

For guest_memfd, there are no protections (everything is RWX, for now), and when
hugepage support comes along, userspace can simply manipulate the guest_memfd
instance as needed.