Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:6 as permitted sender) client-ip=2620:137:e000::3:6;
Date:   Thu, 30 Nov 2023 17:51:28 -0800
In-Reply-To: <3ad69657ba8e1b19d150db574193619cf0cb34df.camel@redhat.com>
Mime-Version: 1.0
References: <20231110235528.1561679-1-seanjc@google.com> <20231110235528.1561679-4-seanjc@google.com>
 <3ad69657ba8e1b19d150db574193619cf0cb34df.camel@redhat.com>
Message-ID: <ZWk8IMZamuemfwXG@google.com>
Subject: Re: [PATCH 3/9] KVM: x86: Initialize guest cpu_caps based on guest CPUID
From:   Sean Christopherson <seanjc@google.com>
To:     Maxim Levitsky <mlevitsk@redhat.com>
Cc:     Paolo Bonzini <pbonzini@redhat.com>, kvm@vger.kernel.org,
        linux-kernel@vger.kernel.org
Content-Type: text/plain; charset="us-ascii"
Precedence: bulk

On Sun, Nov 19, 2023, Maxim Levitsky wrote:
> On Fri, 2023-11-10 at 15:55 -0800, Sean Christopherson wrote:
> > +/*
> > + * This isn't truly "unsafe", but all callers except kvm_cpu_after_set_cpuid()
> > + * should use __cpuid_entry_get_reg(), which provides compile-time validation
> > + * of the input.
> > + */
> > +static u32 cpuid_get_reg_unsafe(struct kvm_cpuid_entry2 *entry, u32 reg)
> > +{
> > +	switch (reg) {
> > +	case CPUID_EAX:
> > +		return entry->eax;
> > +	case CPUID_EBX:
> > +		return entry->ebx;
> > +	case CPUID_ECX:
> > +		return entry->ecx;
> > +	case CPUID_EDX:
> > +		return entry->edx;
> > +	default:
> > +		WARN_ON_ONCE(1);
> > +		return 0;
> > +	}
> > +}

...

> >  static void kvm_vcpu_after_set_cpuid(struct kvm_vcpu *vcpu)
> >  {
> >  	struct kvm_lapic *apic = vcpu->arch.apic;
> >  	struct kvm_cpuid_entry2 *best;
> >  	bool allow_gbpages;
> > +	int i;
> >  
> > -	memset(vcpu->arch.cpu_caps, 0, sizeof(vcpu->arch.cpu_caps));
> > +	BUILD_BUG_ON(ARRAY_SIZE(reverse_cpuid) != NR_KVM_CPU_CAPS);
> > +
> > +	/*
> > +	 * Reset guest capabilities to userspace's guest CPUID definition, i.e.
> > +	 * honor userspace's definition for features that don't require KVM or
> > +	 * hardware management/support (or that KVM simply doesn't care about).
> > +	 */
> > +	for (i = 0; i < NR_KVM_CPU_CAPS; i++) {
> > +		const struct cpuid_reg cpuid = reverse_cpuid[i];
> > +
> > +		best = kvm_find_cpuid_entry_index(vcpu, cpuid.function, cpuid.index);
> > +		if (best)
> > +			vcpu->arch.cpu_caps[i] = cpuid_get_reg_unsafe(best, cpuid.reg);
> 
> Why not just use __cpuid_entry_get_reg? 
> 
> cpuid.reg comes from read/only 'reverse_cpuid' anyway, and in fact
> it seems that all callers of __cpuid_entry_get_reg, take the reg value from
> x86_feature_cpuid() which also takes it from 'reverse_cpuid'.
> 
> So if the compiler is smart enough to not complain in these cases, I don't
> see why this case is different.

It's because the input isn't a compile-time constant, and so the BUILD_BUG() in
the default path will fire.  All of the compile-time assertions in reverse_cpuid.h
rely on the feature being a constant value, which allows the compiler to optimize
away the dead paths, i.e. turn __cpuid_entry_get_reg()'s switch statement into
simple pointer arithmetic and thus omit the BUILD_BUG() code.

> Also why not to initialize guest_caps = host_caps & userspace_cpuid?
>
> If this was the default we won't need any guest_cpu_cap_restrict and such,
> instead it will just work.

Hrm, I definitely like the idea.  Unfortunately, unless we do an audit of all
~120 uses of guest_cpuid_has(), restricting those based on kvm_cpu_caps might
break userspace.

Aside from purging the governed feature nomenclature, the main goal of this series
provide a way to do fast lookups of all known guest CPUID bits without needing to
opt-in on a feature-by-feature basis, including for features that are fully
controlled by userspace.

It's definitely doable, but I'm not all that confident that the end result would
be a net positive, e.g. I believe we would need to special case things like the
feature bits that gate MSR_IA32_SPEC_CTRL and MSR_IA32_PRED_CMD.  MOVBE and RDPID
are other features that come to mind, where KVM emulates the feature in software
but it won't be set in kvm_cpu_caps.

Oof, and MONITOR and MWAIT too, as KVM deliberately doesn't advertise those to
userspace.

So yeah, I'm not opposed to trying that route at some point, but I really don't
want to do that in this series as the risk of subtly breaking something is super
high.

> Special code will only be needed in few more complex cases, like forced exposed
> of a feature to a guest due to a virtualization hole.
> 
> 
> > +		else
> > +			vcpu->arch.cpu_caps[i] = 0;
> > +	}
> >  
> >  	/*
> >  	 * If TDP is enabled, let the guest use GBPAGES if they're supported in
> > @@ -342,8 +380,7 @@ static void kvm_vcpu_after_set_cpuid(struct kvm_vcpu *vcpu)
> >  	 */
> >  	allow_gbpages = tdp_enabled ? boot_cpu_has(X86_FEATURE_GBPAGES) :
> >  				      guest_cpuid_has(vcpu, X86_FEATURE_GBPAGES);
> > -	if (allow_gbpages)
> > -		guest_cpu_cap_set(vcpu, X86_FEATURE_GBPAGES);
> > +	guest_cpu_cap_change(vcpu, X86_FEATURE_GBPAGES, allow_gbpages);
> 
> IMHO the original code was more readable, now I need to look up the
> 'guest_cpu_cap_change()' to understand what is going on.

The change is "necessary".  The issue is that with the caps 0-initialied, the
!allow_gbpages could simply do nothing.  Now, KVM needs to explicitly clear the
flag, i.e. would need to do:

	if (allow_gbpages)
		guest_cpu_cap_set(vcpu, X86_FEATURE_GBPAGES);
	else
		guest_cpu_cap_clear(vcpu, X86_FEATURE_GBPAGES);

I don't much love the name either, but it pairs with cpuid_entry_change() and I
want to keep the kvm_cpu_cap, cpuid_entry, and guest_cpu_cap APIs in sync as far
as the APIs go.  The only reason kvm_cpu_cap_change() doesn't exist is because
there aren't any flows that need to toggle a bit.

> >  static __always_inline bool guest_cpu_cap_has(struct kvm_vcpu *vcpu,
> > diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
> > index 8a99a73b6ee5..5827328e30f1 100644
> > --- a/arch/x86/kvm/svm/svm.c
> > +++ b/arch/x86/kvm/svm/svm.c
> > @@ -4315,14 +4315,14 @@ static void svm_vcpu_after_set_cpuid(struct kvm_vcpu *vcpu)
> >  	 * XSS on VM-Enter/VM-Exit.  Failure to do so would effectively give
> >  	 * the guest read/write access to the host's XSS.
> >  	 */
> > -	if (boot_cpu_has(X86_FEATURE_XSAVE) &&
> > -	    boot_cpu_has(X86_FEATURE_XSAVES) &&
> > -	    guest_cpuid_has(vcpu, X86_FEATURE_XSAVE))
> > -		guest_cpu_cap_set(vcpu, X86_FEATURE_XSAVES);
> > +	guest_cpu_cap_change(vcpu, X86_FEATURE_XSAVES,
> > +			     boot_cpu_has(X86_FEATURE_XSAVE) &&
> > +			     boot_cpu_has(X86_FEATURE_XSAVES) &&
> > +			     guest_cpuid_has(vcpu, X86_FEATURE_XSAVE));
> 
> In theory this change does change behavior, now the X86_FEATURE_XSAVE will
> be set iff the condition is true, but before it was set *if* the condition was true.

No, before it was set if and only if the condition was true, because in that case
caps were 0-initialized, i.e. this was/is the only way for XSAVE to be set.

> > -	guest_cpu_cap_check_and_set(vcpu, X86_FEATURE_NRIPS);
> > -	guest_cpu_cap_check_and_set(vcpu, X86_FEATURE_TSCRATEMSR);
> > -	guest_cpu_cap_check_and_set(vcpu, X86_FEATURE_LBRV);
> > +	guest_cpu_cap_restrict(vcpu, X86_FEATURE_NRIPS);
> > +	guest_cpu_cap_restrict(vcpu, X86_FEATURE_TSCRATEMSR);
> > +	guest_cpu_cap_restrict(vcpu, X86_FEATURE_LBRV);
> 
> One of the main reasons I don't like governed features is this manual list.

To be fair, the manual lists predate the governed features.

> I want to reach the point that one won't need to add anything manually,
> unless there is a good reason to do so, and there are only a few exceptions
> when the guest cap is set, while the host's isn't.

Yeah, agreed.