2024-04-25 22:17:17

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH v14 09/22] KVM: SEV: Add support to handle MSR based Page State Change VMGEXIT

On Thu, Apr 25, 2024, Michael Roth wrote:
> On Wed, Apr 24, 2024 at 01:59:48PM -0700, Sean Christopherson wrote:
> > On Sun, Apr 21, 2024, Michael Roth wrote:
> > > +static int snp_begin_psc_msr(struct kvm_vcpu *vcpu, u64 ghcb_msr)
> > > +{
> > > + u64 gpa = gfn_to_gpa(GHCB_MSR_PSC_REQ_TO_GFN(ghcb_msr));
> > > + u8 op = GHCB_MSR_PSC_REQ_TO_OP(ghcb_msr);
> > > + struct vcpu_svm *svm = to_svm(vcpu);
> > > +
> > > + if (op != SNP_PAGE_STATE_PRIVATE && op != SNP_PAGE_STATE_SHARED) {
> > > + set_ghcb_msr(svm, GHCB_MSR_PSC_RESP_ERROR);
> > > + return 1; /* resume guest */
> > > + }
> > > +
> > > + vcpu->run->exit_reason = KVM_EXIT_VMGEXIT;
> > > + vcpu->run->vmgexit.type = KVM_USER_VMGEXIT_PSC_MSR;
> > > + vcpu->run->vmgexit.psc_msr.gpa = gpa;
> > > + vcpu->run->vmgexit.psc_msr.op = op;
> >
> > Argh, no.
> >
> > This is the same crud that TDX tried to push[*]. Use KVM's existing user exits,
> > and extend as *needed*. There is no good reason page state change requests need
> > *two* exit reasons. The *only* thing KVM supports right now is private<=>shared
> > conversions, and that can be handled with either KVM_HC_MAP_GPA_RANGE or
> > KVM_EXIT_MEMORY_FAULT.
> >
> > The non-MSR flavor can batch requests, but I'm willing to bet that the overwhelming
> > majority of requests are contiguous, i.e. can be combined into a range by KVM,
> > and that handling any outliers by performing multiple exits to userspace will
> > provide sufficient performance.
>
> That does tend to be the case. We won't have as much granularity with
> the per-entry error codes, but KVM_SET_MEMORY_ATTRIBUTES would be
> expected to be for the entire range anyway, and if that fails for
> whatever reason then we KVM_BUG_ON() anyway. We do have to have handling
> for cases where the entries aren't contiguous however, which would
> involve multiple KVM_EXIT_HYPERCALLs until everything is satisfied. But
> not a huge deal since it doesn't seem to be a common case.

If it was less complex overall, I wouldn't be opposed to KVM marshalling everything
into a buffer, but I suspect it will be simpler to just have KVM loop until the
PSC request is complete.

> KVM_HC_MAP_GPA_RANGE seems like a nice option because we'd also have the
> flexibility to just issue that directly within a guest rather than
> relying on SNP/TDX specific hcalls. I don't know if that approach is
> practical for a real guest, but it could be useful for having re-usable
> guest code in KVM selftests that "just works" for all variants of
> SNP/TDX/sw-protected. (though we'd still want stuff that exercises
> SNP/TDX->KVM_HC_MAP_GPA_RANGE translation).
>
> I think we'd there is some potential baggage there with the previous SEV
> live migration use cases. There's some potential that existing guest kernels
> will use it once it gets advertised and issue them alongside GHCB-based
> page-state changes. It might make sense to use one of the reserved bits
> to denote this flavor of KVM_HC_MAP_GPA_RANGE as being for
> hardware/software-protected VMs and not interchangeable with calls that
> were used for SEV live migration stuff.

I don't think I follow, what exactly wouldn't be interchangeable, and why?

> If this seems reasonable I'll give it a go and see what it looks like.
>
> >
> > And the non-MSR version that comes in later patch is a complete mess. It kicks
> > the PSC out to userspace without *any* validation. As I complained in the TDX
> > thread, that will create an unmaintable ABI for KVM.
> >
> > KVM needs to have its own, well-defined ABI. Splitting functionality between
> > KVM and userspace at seemingly random points is not maintainable.
> >
> > E.g. if/when KVM supports UNSMASH, upgrading to the KVM would arguably break
> > userspace as PSC requests that previously exited would suddenly be handled by
> > KVM. Maybe. It's impossible to review this because there's no KVM ABI, KVM is
> > little more than a dumb pipe parroting information to userspace.
>
> It leans on the GHCB spec to avoid re-inventing structs/documentation
> for things like Page State Change buffers, but do have some control
> as we want over how much we farm out versus lock into the KVM ABI. For
> instance the accompanying Documentation/ update mentions we only send a
> subset of GHCB requests that need to be handled by userspace, so we
> could handle SMASH/UNSMASH in KVM without breaking expectations (or if
> SMASH/UNSMASH were intermixed with PSCs, documentation that only PSC
> opcodes could be updated by userspace).
>
> But I'm certainly not arguing it wouldn't be better to have a
> guest-agnostic alternative if we can reach an agreement on that, and
> KVM_HC_MAP_GPA_RANGE seems like it could work.

Yeah, I want to at least _try_ to achieve common ground, because the basic
functionality of all this stuff is the exact same.


2024-04-26 17:17:22

by Michael Roth

[permalink] [raw]
Subject: Re: [PATCH v14 09/22] KVM: SEV: Add support to handle MSR based Page State Change VMGEXIT

On Thu, Apr 25, 2024 at 03:13:40PM -0700, Sean Christopherson wrote:
> On Thu, Apr 25, 2024, Michael Roth wrote:
> > On Wed, Apr 24, 2024 at 01:59:48PM -0700, Sean Christopherson wrote:
> > > On Sun, Apr 21, 2024, Michael Roth wrote:
> > > > +static int snp_begin_psc_msr(struct kvm_vcpu *vcpu, u64 ghcb_msr)
> > > > +{
> > > > + u64 gpa = gfn_to_gpa(GHCB_MSR_PSC_REQ_TO_GFN(ghcb_msr));
> > > > + u8 op = GHCB_MSR_PSC_REQ_TO_OP(ghcb_msr);
> > > > + struct vcpu_svm *svm = to_svm(vcpu);
> > > > +
> > > > + if (op != SNP_PAGE_STATE_PRIVATE && op != SNP_PAGE_STATE_SHARED) {
> > > > + set_ghcb_msr(svm, GHCB_MSR_PSC_RESP_ERROR);
> > > > + return 1; /* resume guest */
> > > > + }
> > > > +
> > > > + vcpu->run->exit_reason = KVM_EXIT_VMGEXIT;
> > > > + vcpu->run->vmgexit.type = KVM_USER_VMGEXIT_PSC_MSR;
> > > > + vcpu->run->vmgexit.psc_msr.gpa = gpa;
> > > > + vcpu->run->vmgexit.psc_msr.op = op;
> > >
> > > Argh, no.
> > >
> > > This is the same crud that TDX tried to push[*]. Use KVM's existing user exits,
> > > and extend as *needed*. There is no good reason page state change requests need
> > > *two* exit reasons. The *only* thing KVM supports right now is private<=>shared
> > > conversions, and that can be handled with either KVM_HC_MAP_GPA_RANGE or
> > > KVM_EXIT_MEMORY_FAULT.
> > >
> > > The non-MSR flavor can batch requests, but I'm willing to bet that the overwhelming
> > > majority of requests are contiguous, i.e. can be combined into a range by KVM,
> > > and that handling any outliers by performing multiple exits to userspace will
> > > provide sufficient performance.
> >
> > That does tend to be the case. We won't have as much granularity with
> > the per-entry error codes, but KVM_SET_MEMORY_ATTRIBUTES would be
> > expected to be for the entire range anyway, and if that fails for
> > whatever reason then we KVM_BUG_ON() anyway. We do have to have handling
> > for cases where the entries aren't contiguous however, which would
> > involve multiple KVM_EXIT_HYPERCALLs until everything is satisfied. But
> > not a huge deal since it doesn't seem to be a common case.
>
> If it was less complex overall, I wouldn't be opposed to KVM marshalling everything
> into a buffer, but I suspect it will be simpler to just have KVM loop until the
> PSC request is complete.

Agreed. But *if* we decided to introduce a buffer, where would you
suggest adding it? The kvm_run union fields are set to 256 bytes, and
we'd need close to 4K to handle a full GHCB PSC buffer in 1 go. Would
additional storage at the end of struct kvm_run be acceptable?

>
> > KVM_HC_MAP_GPA_RANGE seems like a nice option because we'd also have the
> > flexibility to just issue that directly within a guest rather than
> > relying on SNP/TDX specific hcalls. I don't know if that approach is
> > practical for a real guest, but it could be useful for having re-usable
> > guest code in KVM selftests that "just works" for all variants of
> > SNP/TDX/sw-protected. (though we'd still want stuff that exercises
> > SNP/TDX->KVM_HC_MAP_GPA_RANGE translation).
> >
> > I think we'd there is some potential baggage there with the previous SEV
> > live migration use cases. There's some potential that existing guest kernels
> > will use it once it gets advertised and issue them alongside GHCB-based
> > page-state changes. It might make sense to use one of the reserved bits
> > to denote this flavor of KVM_HC_MAP_GPA_RANGE as being for
> > hardware/software-protected VMs and not interchangeable with calls that
> > were used for SEV live migration stuff.
>
> I don't think I follow, what exactly wouldn't be interchangeable, and why?

For instance, if KVM_FEATURE_MIGRATION_CONTROL is advertised, then when
amd_enc_status_change_finish() is triggered as a result of
set_memory_encrypted(), we'd see

1) a GHCB PSC for SNP, which will get forwarded to userspace via
KVM_HC_MAP_GPA_RANGE
2) KVM_HC_MAP_GPA_RANGE issued directly by the guest.

In that case, we'd be duplicating PSCs but it wouldn't necessarily hurt
anything. But ideally we'd be able to distinguish the 2 cases so we
could rightly treat 1) as only being expected for SNP, and 2) as only
being expected for SEV/SEV-ES.

I'm also concerned there's potential for more serious issues, for instance
kvm_init_platform() has a loop that resets all E820_TYPE_RAM ranges to
encrypted, which may step on PSCs that SNP had to do earlier on for
setting up shared pages for things like GHCB buffers. For SEV live
migration use case, it's not a huge deal if a shared page gets processed
as private, it's just slower because it would need to go through the SEV
firmware APIs. There's also more leeway in that updates can happen at
opportunistic times since there's as KVM_MIGRATION_READY state that can
be set via MSR_KVM_MIGRATION_CONTROL once everything is ready is set
up.

For SNP, all those associates KVM_HC_MAP_GPA_RANGE calls would have
immediate effect if they are not in lock-step with the actual state of
each page before access, and there is no leeway for stuff like leaving
a shared page as private.

I also a lot of that was before lazy acceptance support which happens
out-of-band versus these pv_ops.mmu.notify_page_* hooks, and kexec
support for SNP guests has some notion of tracking state across kexec
boundaries that might get clobbered too.

I haven't looked too closely though, but it seems like a good idea to at
least set a bit for kvm->arch.has_private_mem use cases. If it's not
needed then worst case we have a bit that userspace doesn't necessarily
need to care about, but for SNP just filtering out the duplicate
requests seems like enough to justify having it.

-Mike