2024-01-24 00:39:53

by Mingwei Zhang

[permalink] [raw]
Subject: [PATCH 0/2] minor fix on perf_capabilities in KVM/x86

This simple series fixes the value setup for vcpu->arch.perf_capabilities
to fix an issue in live migration and remove an inline helper in Intel pmu
specific PMU code to make the code simple.

Mingwei Zhang (2):
KVM: x86/pmu: Reset perf_capabilities in vcpu to 0 if PDCM is disabled
KVM: x86/pmu: Remove vcpu_get_perf_capabilities()

arch/x86/kvm/cpuid.c | 3 +++
arch/x86/kvm/vmx/pmu_intel.c | 16 ++++------------
2 files changed, 7 insertions(+), 12 deletions(-)


base-commit: 6613476e225e090cc9aad49be7fa504e290dd33d
--
2.43.0.429.g432eaa2c6b-goog



2024-01-24 00:40:33

by Mingwei Zhang

[permalink] [raw]
Subject: [PATCH 1/2] KVM: x86/pmu: Reset perf_capabilities in vcpu to 0 if PDCM is disabled

Reset vcpu->arch.perf_capabilities to 0 if PDCM is disabled in guest cpuid.
Without this, there is an issue in live migration. In particular, to
migrate a VM with no PDCM enabled, VMM on the source is able to retrieve a
non-zero value by reading the MSR_IA32_PERF_CAPABILITIES. However, VMM on
the target is unable to set the value. This creates confusions on the user
side.

Fundamentally, it is because vcpu->arch.perf_capabilities as the cached
value of MSR_IA32_PERF_CAPABILITIES is incorrect, and there is nothing
wrong on the kvm_get_msr_common() which just reads
vcpu->arch.perf_capabilities.

Fix the issue by adding the reset code in kvm_vcpu_after_set_cpuid(), i.e.
early in VM setup time.

Cc: Aaron Lewis <[email protected]>
Signed-off-by: Mingwei Zhang <[email protected]>
---
arch/x86/kvm/cpuid.c | 3 +++
1 file changed, 3 insertions(+)

diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
index adba49afb5fe..416bee03c42a 100644
--- a/arch/x86/kvm/cpuid.c
+++ b/arch/x86/kvm/cpuid.c
@@ -369,6 +369,9 @@ static void kvm_vcpu_after_set_cpuid(struct kvm_vcpu *vcpu)
vcpu->arch.maxphyaddr = cpuid_query_maxphyaddr(vcpu);
vcpu->arch.reserved_gpa_bits = kvm_vcpu_reserved_gpa_bits_raw(vcpu);

+ /* Reset MSR_IA32_PERF_CAPABILITIES guest value to 0 if PDCM is off. */
+ if (!guest_cpuid_has(vcpu, X86_FEATURE_PDCM))
+ vcpu->arch.perf_capabilities = 0;
kvm_pmu_refresh(vcpu);
vcpu->arch.cr4_guest_rsvd_bits =
__cr4_reserved_bits(guest_cpuid_has, vcpu);
--
2.43.0.429.g432eaa2c6b-goog


2024-01-24 00:40:43

by Mingwei Zhang

[permalink] [raw]
Subject: [PATCH 2/2] KVM: x86/pmu: Remove vcpu_get_perf_capabilities()

Remove vcpu_get_perf_capabilities() helper and directly use the
vcpu->arch.perf_capabilities which now contains the true value of
IA32_PERF_CAPABILITIES if exposed to guest (and 0 otherwise). This should
slightly improve performance by avoiding the runtime check of
X86_FEATURE_PDCM.

Signed-off-by: Mingwei Zhang <[email protected]>
---
arch/x86/kvm/vmx/pmu_intel.c | 16 ++++------------
1 file changed, 4 insertions(+), 12 deletions(-)

diff --git a/arch/x86/kvm/vmx/pmu_intel.c b/arch/x86/kvm/vmx/pmu_intel.c
index a6216c874729..7cbee2d16ed9 100644
--- a/arch/x86/kvm/vmx/pmu_intel.c
+++ b/arch/x86/kvm/vmx/pmu_intel.c
@@ -158,17 +158,9 @@ static struct kvm_pmc *intel_rdpmc_ecx_to_pmc(struct kvm_vcpu *vcpu,
return &counters[array_index_nospec(idx, num_counters)];
}

-static inline u64 vcpu_get_perf_capabilities(struct kvm_vcpu *vcpu)
-{
- if (!guest_cpuid_has(vcpu, X86_FEATURE_PDCM))
- return 0;
-
- return vcpu->arch.perf_capabilities;
-}
-
static inline bool fw_writes_is_enabled(struct kvm_vcpu *vcpu)
{
- return (vcpu_get_perf_capabilities(vcpu) & PMU_CAP_FW_WRITES) != 0;
+ return (vcpu->arch.perf_capabilities & PMU_CAP_FW_WRITES) != 0;
}

static inline struct kvm_pmc *get_fw_gp_pmc(struct kvm_pmu *pmu, u32 msr)
@@ -207,13 +199,13 @@ static bool intel_is_valid_msr(struct kvm_vcpu *vcpu, u32 msr)
case MSR_CORE_PERF_FIXED_CTR_CTRL:
return kvm_pmu_has_perf_global_ctrl(pmu);
case MSR_IA32_PEBS_ENABLE:
- ret = vcpu_get_perf_capabilities(vcpu) & PERF_CAP_PEBS_FORMAT;
+ ret = vcpu->arch.perf_capabilities & PERF_CAP_PEBS_FORMAT;
break;
case MSR_IA32_DS_AREA:
ret = guest_cpuid_has(vcpu, X86_FEATURE_DS);
break;
case MSR_PEBS_DATA_CFG:
- perf_capabilities = vcpu_get_perf_capabilities(vcpu);
+ perf_capabilities = vcpu->arch.perf_capabilities;
ret = (perf_capabilities & PERF_CAP_PEBS_BASELINE) &&
((perf_capabilities & PERF_CAP_PEBS_FORMAT) > 3);
break;
@@ -577,7 +569,7 @@ static void intel_pmu_refresh(struct kvm_vcpu *vcpu)
bitmap_set(pmu->all_valid_pmc_idx,
INTEL_PMC_MAX_GENERIC, pmu->nr_arch_fixed_counters);

- perf_capabilities = vcpu_get_perf_capabilities(vcpu);
+ perf_capabilities = vcpu->arch.perf_capabilities;
if (cpuid_model_is_consistent(vcpu) &&
(perf_capabilities & PMU_CAP_LBR_FMT))
x86_perf_get_lbr(&lbr_desc->records);
--
2.43.0.429.g432eaa2c6b-goog


2024-01-24 15:51:54

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH 1/2] KVM: x86/pmu: Reset perf_capabilities in vcpu to 0 if PDCM is disabled

On Wed, Jan 24, 2024, Mingwei Zhang wrote:
> Reset vcpu->arch.perf_capabilities to 0 if PDCM is disabled in guest cpuid.
> Without this, there is an issue in live migration. In particular, to
> migrate a VM with no PDCM enabled, VMM on the source is able to retrieve a
> non-zero value by reading the MSR_IA32_PERF_CAPABILITIES. However, VMM on
> the target is unable to set the value. This creates confusions on the user
> side.
>
> Fundamentally, it is because vcpu->arch.perf_capabilities as the cached
> value of MSR_IA32_PERF_CAPABILITIES is incorrect, and there is nothing
> wrong on the kvm_get_msr_common() which just reads
> vcpu->arch.perf_capabilities.
>
> Fix the issue by adding the reset code in kvm_vcpu_after_set_cpuid(), i.e.
> early in VM setup time.
>
> Cc: Aaron Lewis <[email protected]>
> Signed-off-by: Mingwei Zhang <[email protected]>
> ---
> arch/x86/kvm/cpuid.c | 3 +++
> 1 file changed, 3 insertions(+)
>
> diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
> index adba49afb5fe..416bee03c42a 100644
> --- a/arch/x86/kvm/cpuid.c
> +++ b/arch/x86/kvm/cpuid.c
> @@ -369,6 +369,9 @@ static void kvm_vcpu_after_set_cpuid(struct kvm_vcpu *vcpu)
> vcpu->arch.maxphyaddr = cpuid_query_maxphyaddr(vcpu);
> vcpu->arch.reserved_gpa_bits = kvm_vcpu_reserved_gpa_bits_raw(vcpu);
>
> + /* Reset MSR_IA32_PERF_CAPABILITIES guest value to 0 if PDCM is off. */
> + if (!guest_cpuid_has(vcpu, X86_FEATURE_PDCM))
> + vcpu->arch.perf_capabilities = 0;

No, this is just papering over the underlying bug. KVM shouldn't be stuffing
vcpu->arch.perf_capabilities without explicit writes from host userspace. E.g
KVM_SET_CPUID{,2} is allowed multiple times, at which point KVM could clobber a
host userspace write to MSR_IA32_PERF_CAPABILITIES. It's unlikely any userspace
actually does something like that, but KVM overwriting guest state is almost
never a good thing.

I've been meaning to send a patch for a long time (IIRC, Aaron also ran into this?).
KVM needs to simply not stuff vcpu->arch.perf_capabilities. I believe we are
already fudging around this in our internal kernels, so I don't think there's a
need to carry a hack-a-fix for the destination kernel.

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 27e23714e960..fdef9d706d61 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -12116,7 +12116,6 @@ int kvm_arch_vcpu_create(struct kvm_vcpu *vcpu)

kvm_async_pf_hash_reset(vcpu);

- vcpu->arch.perf_capabilities = kvm_caps.supported_perf_cap;
kvm_pmu_init(vcpu);

vcpu->arch.pending_external_vector = -1;

> kvm_pmu_refresh(vcpu);
> vcpu->arch.cr4_guest_rsvd_bits =
> __cr4_reserved_bits(guest_cpuid_has, vcpu);
> --
> 2.43.0.429.g432eaa2c6b-goog
>

2024-01-24 15:52:17

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH 2/2] KVM: x86/pmu: Remove vcpu_get_perf_capabilities()

On Wed, Jan 24, 2024, Mingwei Zhang wrote:
> Remove vcpu_get_perf_capabilities() helper and directly use the
> vcpu->arch.perf_capabilities which now contains the true value of
> IA32_PERF_CAPABILITIES if exposed to guest (and 0 otherwise). This should
> slightly improve performance by avoiding the runtime check of
> X86_FEATURE_PDCM.

I have a generic in-progress series[*] to more or less solve the performance woes
with guest_cpuid_has(). I would rather keep the current code, even though it's
somewhat odd, as it's possible there are setups that rely on KVM checking PDCM.
E.g. if userspace sets MSRs *after* CPUID and plugs in a non-zero PERF_CAPABILITES.

[*] https://lore.kernel.org/all/[email protected]

> Signed-off-by: Mingwei Zhang <[email protected]>
> ---
> arch/x86/kvm/vmx/pmu_intel.c | 16 ++++------------
> 1 file changed, 4 insertions(+), 12 deletions(-)
>
> diff --git a/arch/x86/kvm/vmx/pmu_intel.c b/arch/x86/kvm/vmx/pmu_intel.c
> index a6216c874729..7cbee2d16ed9 100644
> --- a/arch/x86/kvm/vmx/pmu_intel.c
> +++ b/arch/x86/kvm/vmx/pmu_intel.c
> @@ -158,17 +158,9 @@ static struct kvm_pmc *intel_rdpmc_ecx_to_pmc(struct kvm_vcpu *vcpu,
> return &counters[array_index_nospec(idx, num_counters)];
> }
>
> -static inline u64 vcpu_get_perf_capabilities(struct kvm_vcpu *vcpu)
> -{
> - if (!guest_cpuid_has(vcpu, X86_FEATURE_PDCM))
> - return 0;
> -
> - return vcpu->arch.perf_capabilities;
> -}
> -
> static inline bool fw_writes_is_enabled(struct kvm_vcpu *vcpu)
> {
> - return (vcpu_get_perf_capabilities(vcpu) & PMU_CAP_FW_WRITES) != 0;
> + return (vcpu->arch.perf_capabilities & PMU_CAP_FW_WRITES) != 0;
> }
>
> static inline struct kvm_pmc *get_fw_gp_pmc(struct kvm_pmu *pmu, u32 msr)
> @@ -207,13 +199,13 @@ static bool intel_is_valid_msr(struct kvm_vcpu *vcpu, u32 msr)
> case MSR_CORE_PERF_FIXED_CTR_CTRL:
> return kvm_pmu_has_perf_global_ctrl(pmu);
> case MSR_IA32_PEBS_ENABLE:
> - ret = vcpu_get_perf_capabilities(vcpu) & PERF_CAP_PEBS_FORMAT;
> + ret = vcpu->arch.perf_capabilities & PERF_CAP_PEBS_FORMAT;
> break;
> case MSR_IA32_DS_AREA:
> ret = guest_cpuid_has(vcpu, X86_FEATURE_DS);
> break;
> case MSR_PEBS_DATA_CFG:
> - perf_capabilities = vcpu_get_perf_capabilities(vcpu);
> + perf_capabilities = vcpu->arch.perf_capabilities;
> ret = (perf_capabilities & PERF_CAP_PEBS_BASELINE) &&
> ((perf_capabilities & PERF_CAP_PEBS_FORMAT) > 3);
> break;
> @@ -577,7 +569,7 @@ static void intel_pmu_refresh(struct kvm_vcpu *vcpu)
> bitmap_set(pmu->all_valid_pmc_idx,
> INTEL_PMC_MAX_GENERIC, pmu->nr_arch_fixed_counters);
>
> - perf_capabilities = vcpu_get_perf_capabilities(vcpu);
> + perf_capabilities = vcpu->arch.perf_capabilities;
> if (cpuid_model_is_consistent(vcpu) &&
> (perf_capabilities & PMU_CAP_LBR_FMT))
> x86_perf_get_lbr(&lbr_desc->records);
> --
> 2.43.0.429.g432eaa2c6b-goog
>

2024-01-24 21:26:21

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH 1/2] KVM: x86/pmu: Reset perf_capabilities in vcpu to 0 if PDCM is disabled

On Wed, Jan 24, 2024, Aaron Lewis wrote:
> On Wed, Jan 24, 2024 at 7:49 AM Sean Christopherson <[email protected]> wrote:
> >
> > On Wed, Jan 24, 2024, Mingwei Zhang wrote:
> > > Reset vcpu->arch.perf_capabilities to 0 if PDCM is disabled in guest cpuid.
> > > Without this, there is an issue in live migration. In particular, to
> > > migrate a VM with no PDCM enabled, VMM on the source is able to retrieve a
> > > non-zero value by reading the MSR_IA32_PERF_CAPABILITIES. However, VMM on
> > > the target is unable to set the value. This creates confusions on the user
> > > side.
> > >
> > > Fundamentally, it is because vcpu->arch.perf_capabilities as the cached
> > > value of MSR_IA32_PERF_CAPABILITIES is incorrect, and there is nothing
> > > wrong on the kvm_get_msr_common() which just reads
> > > vcpu->arch.perf_capabilities.
> > >
> > > Fix the issue by adding the reset code in kvm_vcpu_after_set_cpuid(), i.e.
> > > early in VM setup time.
> > >
> > > Cc: Aaron Lewis <[email protected]>
> > > Signed-off-by: Mingwei Zhang <[email protected]>
> > > ---
> > > arch/x86/kvm/cpuid.c | 3 +++
> > > 1 file changed, 3 insertions(+)
> > >
> > > diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
> > > index adba49afb5fe..416bee03c42a 100644
> > > --- a/arch/x86/kvm/cpuid.c
> > > +++ b/arch/x86/kvm/cpuid.c
> > > @@ -369,6 +369,9 @@ static void kvm_vcpu_after_set_cpuid(struct kvm_vcpu *vcpu)
> > > vcpu->arch.maxphyaddr = cpuid_query_maxphyaddr(vcpu);
> > > vcpu->arch.reserved_gpa_bits = kvm_vcpu_reserved_gpa_bits_raw(vcpu);
> > >
> > > + /* Reset MSR_IA32_PERF_CAPABILITIES guest value to 0 if PDCM is off. */
> > > + if (!guest_cpuid_has(vcpu, X86_FEATURE_PDCM))
> > > + vcpu->arch.perf_capabilities = 0;
> >
> > No, this is just papering over the underlying bug. KVM shouldn't be stuffing
> > vcpu->arch.perf_capabilities without explicit writes from host userspace. E.g
> > KVM_SET_CPUID{,2} is allowed multiple times, at which point KVM could clobber a
> > host userspace write to MSR_IA32_PERF_CAPABILITIES. It's unlikely any userspace
> > actually does something like that, but KVM overwriting guest state is almost
> > never a good thing.
> >
> > I've been meaning to send a patch for a long time (IIRC, Aaron also ran into this?).
> > KVM needs to simply not stuff vcpu->arch.perf_capabilities. I believe we are
> > already fudging around this in our internal kernels, so I don't think there's a
> > need to carry a hack-a-fix for the destination kernel.
> >
> > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > index 27e23714e960..fdef9d706d61 100644
> > --- a/arch/x86/kvm/x86.c
> > +++ b/arch/x86/kvm/x86.c
> > @@ -12116,7 +12116,6 @@ int kvm_arch_vcpu_create(struct kvm_vcpu *vcpu)
> >
> > kvm_async_pf_hash_reset(vcpu);
> >
> > - vcpu->arch.perf_capabilities = kvm_caps.supported_perf_cap;
>
> Yeah, that will fix the issue we are seeing. The only thing that's
> not clear to me is if userspace should expect KVM to set this or if
> KVM should expect userspace to set this. How is that generally
> decided?

By "this", you mean the effective RESET value for vcpu->arch.perf_capabilities?
To be consistent with KVM's CPUID module at vCPU creation, which is completely
empty (vCPU has no PMU and no PDCM support) KVM *must* zero
vcpu->arch.perf_capabilities.

If userspace wants a non-zero value, then userspace needs to set CPUID to enable
PDCM and set MSR_IA32_PERF_CAPABILITIES.

MSR_IA32_ARCH_CAPABILITIES is in the same boat, e.g. a vCPU without
X86_FEATURE_ARCH_CAPABILITIES can end up seeing a non-zero MSR value. That too
should be excised.

In a perfect world, KVM would also zero-initialize vcpu->arch.msr_platform_info,
but that one is less obviously broken and also less obviously safe to remove.

commit e53d88af63ab4104e1226b8f9959f1e9903da10b
Author: Jim Mattson <[email protected]>
AuthorDate: Tue Oct 30 12:20:21 2018 -0700
Commit: Paolo Bonzini <[email protected]>
CommitDate: Fri Dec 14 18:00:01 2018 +0100

kvm: x86: Don't modify MSR_PLATFORM_INFO on vCPU reset

If userspace has provided a different value for this MSR (e.g with the
turbo bits set), the userspace-provided value should survive a vCPU
reset. For backwards compatibility, MSR_PLATFORM_INFO is initialized
in kvm_arch_vcpu_setup.

Signed-off-by: Jim Mattson <[email protected]>
Reviewed-by: Drew Schmitt <[email protected]>
Cc: Abhiroop Dabral <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

In other words, KVM shouldn't define the vCPU model beyond the absolute bare
minimum that is required by the x86 architecture (as of P6 CPUs, which is more
or less the oldest CPU KVM can reasonably virtualize without carrying useless code).

2024-01-24 22:25:10

by Mingwei Zhang

[permalink] [raw]
Subject: Re: [PATCH 1/2] KVM: x86/pmu: Reset perf_capabilities in vcpu to 0 if PDCM is disabled

On Wed, Jan 24, 2024, Sean Christopherson wrote:
> On Wed, Jan 24, 2024, Aaron Lewis wrote:
> > On Wed, Jan 24, 2024 at 7:49 AM Sean Christopherson <[email protected]> wrote:
> > >
> > > On Wed, Jan 24, 2024, Mingwei Zhang wrote:
> > > > Reset vcpu->arch.perf_capabilities to 0 if PDCM is disabled in guest cpuid.
> > > > Without this, there is an issue in live migration. In particular, to
> > > > migrate a VM with no PDCM enabled, VMM on the source is able to retrieve a
> > > > non-zero value by reading the MSR_IA32_PERF_CAPABILITIES. However, VMM on
> > > > the target is unable to set the value. This creates confusions on the user
> > > > side.
> > > >
> > > > Fundamentally, it is because vcpu->arch.perf_capabilities as the cached
> > > > value of MSR_IA32_PERF_CAPABILITIES is incorrect, and there is nothing
> > > > wrong on the kvm_get_msr_common() which just reads
> > > > vcpu->arch.perf_capabilities.
> > > >
> > > > Fix the issue by adding the reset code in kvm_vcpu_after_set_cpuid(), i.e.
> > > > early in VM setup time.
> > > >
> > > > Cc: Aaron Lewis <[email protected]>
> > > > Signed-off-by: Mingwei Zhang <[email protected]>
> > > > ---
> > > > arch/x86/kvm/cpuid.c | 3 +++
> > > > 1 file changed, 3 insertions(+)
> > > >
> > > > diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
> > > > index adba49afb5fe..416bee03c42a 100644
> > > > --- a/arch/x86/kvm/cpuid.c
> > > > +++ b/arch/x86/kvm/cpuid.c
> > > > @@ -369,6 +369,9 @@ static void kvm_vcpu_after_set_cpuid(struct kvm_vcpu *vcpu)
> > > > vcpu->arch.maxphyaddr = cpuid_query_maxphyaddr(vcpu);
> > > > vcpu->arch.reserved_gpa_bits = kvm_vcpu_reserved_gpa_bits_raw(vcpu);
> > > >
> > > > + /* Reset MSR_IA32_PERF_CAPABILITIES guest value to 0 if PDCM is off. */
> > > > + if (!guest_cpuid_has(vcpu, X86_FEATURE_PDCM))
> > > > + vcpu->arch.perf_capabilities = 0;
> > >
> > > No, this is just papering over the underlying bug. KVM shouldn't be stuffing
> > > vcpu->arch.perf_capabilities without explicit writes from host userspace. E.g
> > > KVM_SET_CPUID{,2} is allowed multiple times, at which point KVM could clobber a
> > > host userspace write to MSR_IA32_PERF_CAPABILITIES. It's unlikely any userspace
> > > actually does something like that, but KVM overwriting guest state is almost
> > > never a good thing.
> > >
> > > I've been meaning to send a patch for a long time (IIRC, Aaron also ran into this?).
> > > KVM needs to simply not stuff vcpu->arch.perf_capabilities. I believe we are
> > > already fudging around this in our internal kernels, so I don't think there's a
> > > need to carry a hack-a-fix for the destination kernel.
> > >
> > > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > > index 27e23714e960..fdef9d706d61 100644
> > > --- a/arch/x86/kvm/x86.c
> > > +++ b/arch/x86/kvm/x86.c
> > > @@ -12116,7 +12116,6 @@ int kvm_arch_vcpu_create(struct kvm_vcpu *vcpu)
> > >
> > > kvm_async_pf_hash_reset(vcpu);
> > >
> > > - vcpu->arch.perf_capabilities = kvm_caps.supported_perf_cap;
> >
> > Yeah, that will fix the issue we are seeing. The only thing that's
> > not clear to me is if userspace should expect KVM to set this or if
> > KVM should expect userspace to set this. How is that generally
> > decided?
>
> By "this", you mean the effective RESET value for vcpu->arch.perf_capabilities?
> To be consistent with KVM's CPUID module at vCPU creation, which is completely
> empty (vCPU has no PMU and no PDCM support) KVM *must* zero
> vcpu->arch.perf_capabilities.
>
> If userspace wants a non-zero value, then userspace needs to set CPUID to enable
> PDCM and set MSR_IA32_PERF_CAPABILITIES.
>
> MSR_IA32_ARCH_CAPABILITIES is in the same boat, e.g. a vCPU without
> X86_FEATURE_ARCH_CAPABILITIES can end up seeing a non-zero MSR value. That too
> should be excised.
>
hmm, does that mean KVM just allows an invalid vcpu state exist from
host point of view? I think this makes a lot of confusions on migration
where VMM on the source believes that a non-zero value from KVM_GET_MSRS
is valid and the VMM on the target will find it not true.

If we follow the suggestion by removing the initial value at vCPU
creation time, then I think it breaks the existing VMM code, since that
requires VMM to explicitly set the MSR, which I am not sure we do today.

The following code below is different. The key difference is that the
following code preserves a valid value, but this case is to not preserve
an invalid value.

Thanks.
-Mingwei

> In a perfect world, KVM would also zero-initialize vcpu->arch.msr_platform_info,
> but that one is less obviously broken and also less obviously safe to remove.
>
> commit e53d88af63ab4104e1226b8f9959f1e9903da10b
> Author: Jim Mattson <[email protected]>
> AuthorDate: Tue Oct 30 12:20:21 2018 -0700
> Commit: Paolo Bonzini <[email protected]>
> CommitDate: Fri Dec 14 18:00:01 2018 +0100
>
> kvm: x86: Don't modify MSR_PLATFORM_INFO on vCPU reset
>
> If userspace has provided a different value for this MSR (e.g with the
> turbo bits set), the userspace-provided value should survive a vCPU
> reset. For backwards compatibility, MSR_PLATFORM_INFO is initialized
> in kvm_arch_vcpu_setup.
>
> Signed-off-by: Jim Mattson <[email protected]>
> Reviewed-by: Drew Schmitt <[email protected]>
> Cc: Abhiroop Dabral <[email protected]>
> Signed-off-by: Paolo Bonzini <[email protected]>
>
> In other words, KVM shouldn't define the vCPU model beyond the absolute bare
> minimum that is required by the x86 architecture (as of P6 CPUs, which is more
> or less the oldest CPU KVM can reasonably virtualize without carrying useless code).

2024-01-24 23:03:16

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH 1/2] KVM: x86/pmu: Reset perf_capabilities in vcpu to 0 if PDCM is disabled

On Wed, Jan 24, 2024, Mingwei Zhang wrote:
> On Wed, Jan 24, 2024, Sean Christopherson wrote:
> > On Wed, Jan 24, 2024, Aaron Lewis wrote:
> > > On Wed, Jan 24, 2024 at 7:49 AM Sean Christopherson <[email protected]> wrote:
> > > >
> > > > On Wed, Jan 24, 2024, Mingwei Zhang wrote:
> > > > No, this is just papering over the underlying bug. KVM shouldn't be stuffing
> > > > vcpu->arch.perf_capabilities without explicit writes from host userspace. E.g
> > > > KVM_SET_CPUID{,2} is allowed multiple times, at which point KVM could clobber a
> > > > host userspace write to MSR_IA32_PERF_CAPABILITIES. It's unlikely any userspace
> > > > actually does something like that, but KVM overwriting guest state is almost
> > > > never a good thing.
> > > >
> > > > I've been meaning to send a patch for a long time (IIRC, Aaron also ran into this?).
> > > > KVM needs to simply not stuff vcpu->arch.perf_capabilities. I believe we are
> > > > already fudging around this in our internal kernels, so I don't think there's a
> > > > need to carry a hack-a-fix for the destination kernel.
> > > >
> > > > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > > > index 27e23714e960..fdef9d706d61 100644
> > > > --- a/arch/x86/kvm/x86.c
> > > > +++ b/arch/x86/kvm/x86.c
> > > > @@ -12116,7 +12116,6 @@ int kvm_arch_vcpu_create(struct kvm_vcpu *vcpu)
> > > >
> > > > kvm_async_pf_hash_reset(vcpu);
> > > >
> > > > - vcpu->arch.perf_capabilities = kvm_caps.supported_perf_cap;
> > >
> > > Yeah, that will fix the issue we are seeing. The only thing that's
> > > not clear to me is if userspace should expect KVM to set this or if
> > > KVM should expect userspace to set this. How is that generally
> > > decided?
> >
> > By "this", you mean the effective RESET value for vcpu->arch.perf_capabilities?
> > To be consistent with KVM's CPUID module at vCPU creation, which is completely
> > empty (vCPU has no PMU and no PDCM support) KVM *must* zero
> > vcpu->arch.perf_capabilities.
> >
> > If userspace wants a non-zero value, then userspace needs to set CPUID to enable
> > PDCM and set MSR_IA32_PERF_CAPABILITIES.
> >
> > MSR_IA32_ARCH_CAPABILITIES is in the same boat, e.g. a vCPU without
> > X86_FEATURE_ARCH_CAPABILITIES can end up seeing a non-zero MSR value. That too
> > should be excised.
> >
> hmm, does that mean KVM just allows an invalid vcpu state exist from
> host point of view?

Yes.

https://lore.kernel.org/all/[email protected]

> I think this makes a lot of confusions on migration where VMM on the source
> believes that a non-zero value from KVM_GET_MSRS is valid and the VMM on the
> target will find it not true.

Yes, but seeing a non-zero value is a KVM bug that should be fixed.

> If we follow the suggestion by removing the initial value at vCPU
> creation time, then I think it breaks the existing VMM code, since that
> requires VMM to explicitly set the MSR, which I am not sure we do today.

Yeah, I'm hoping we can squeak by without breaking existing setups.

I'm 99% certain QEMU is ok, as QEMU has explicitly set MSR_IA32_PERF_CAPABILITIES
since support for PDCM/PERF_CAPABILITIES was added by commit ea39f9b643
("target/i386: define a new MSR based feature word - FEAT_PERF_CAPABILITIES").

Frankly, if our VMM doesn't do the same, then it's wildly busted. Relying on
KVM to define the vCPU is irresponsible, to put it nicely.

> The following code below is different. The key difference is that the
> following code preserves a valid value, but this case is to not preserve
> an invalid value.

But it's a completely different fix. I referenced that commit to call out that
the need for the commit and changelog suggests that someone (*cough* us) is relying
on KVM to initialize MSR_PLATFORM_INFO, and has been doing so for a very long time.
That doesn't mean it's the correct KVM behavior, just that it's much riskier to
change.

2024-01-25 00:15:01

by Mingwei Zhang

[permalink] [raw]
Subject: Re: [PATCH 1/2] KVM: x86/pmu: Reset perf_capabilities in vcpu to 0 if PDCM is disabled

On Wed, Jan 24, 2024, Sean Christopherson wrote:
> On Wed, Jan 24, 2024, Mingwei Zhang wrote:
> > On Wed, Jan 24, 2024, Sean Christopherson wrote:
> > > On Wed, Jan 24, 2024, Aaron Lewis wrote:
> > > > On Wed, Jan 24, 2024 at 7:49 AM Sean Christopherson <[email protected]> wrote:
> > > > >
> > > > > On Wed, Jan 24, 2024, Mingwei Zhang wrote:
> > > > > No, this is just papering over the underlying bug. KVM shouldn't be stuffing
> > > > > vcpu->arch.perf_capabilities without explicit writes from host userspace. E.g
> > > > > KVM_SET_CPUID{,2} is allowed multiple times, at which point KVM could clobber a
> > > > > host userspace write to MSR_IA32_PERF_CAPABILITIES. It's unlikely any userspace
> > > > > actually does something like that, but KVM overwriting guest state is almost
> > > > > never a good thing.
> > > > >
> > > > > I've been meaning to send a patch for a long time (IIRC, Aaron also ran into this?).
> > > > > KVM needs to simply not stuff vcpu->arch.perf_capabilities. I believe we are
> > > > > already fudging around this in our internal kernels, so I don't think there's a
> > > > > need to carry a hack-a-fix for the destination kernel.
> > > > >
> > > > > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > > > > index 27e23714e960..fdef9d706d61 100644
> > > > > --- a/arch/x86/kvm/x86.c
> > > > > +++ b/arch/x86/kvm/x86.c
> > > > > @@ -12116,7 +12116,6 @@ int kvm_arch_vcpu_create(struct kvm_vcpu *vcpu)
> > > > >
> > > > > kvm_async_pf_hash_reset(vcpu);
> > > > >
> > > > > - vcpu->arch.perf_capabilities = kvm_caps.supported_perf_cap;
> > > >
> > > > Yeah, that will fix the issue we are seeing. The only thing that's
> > > > not clear to me is if userspace should expect KVM to set this or if
> > > > KVM should expect userspace to set this. How is that generally
> > > > decided?
> > >
> > > By "this", you mean the effective RESET value for vcpu->arch.perf_capabilities?
> > > To be consistent with KVM's CPUID module at vCPU creation, which is completely
> > > empty (vCPU has no PMU and no PDCM support) KVM *must* zero
> > > vcpu->arch.perf_capabilities.
> > >
> > > If userspace wants a non-zero value, then userspace needs to set CPUID to enable
> > > PDCM and set MSR_IA32_PERF_CAPABILITIES.
> > >
> > > MSR_IA32_ARCH_CAPABILITIES is in the same boat, e.g. a vCPU without
> > > X86_FEATURE_ARCH_CAPABILITIES can end up seeing a non-zero MSR value. That too
> > > should be excised.
> > >
> > hmm, does that mean KVM just allows an invalid vcpu state exist from
> > host point of view?
>
> Yes.
>
> https://lore.kernel.org/all/[email protected]
>
> > I think this makes a lot of confusions on migration where VMM on the source
> > believes that a non-zero value from KVM_GET_MSRS is valid and the VMM on the
> > target will find it not true.
>
> Yes, but seeing a non-zero value is a KVM bug that should be fixed.
>
How about adding an entry in vmx_get_msr() for
MSR_IA32_PERF_CAPABILITIES and check pmu_version? This basically pairs
with the implementation in vmx_set_msr() for MSR_IA32_PERF_CAPABILITIES.

Doing so allows KVM_GET_MSRS return 0 for the MSR instead of returning
the initial permitted value.

The benefit is that it is not enforcing the VMM to explicitly set the
value. In fact, there are several platform MSRs which has initial value
that VMM may rely on instead of explicitly setting.
MSR_IA32_PERF_CAPABILITIES is only one of them.


> > If we follow the suggestion by removing the initial value at vCPU
> > creation time, then I think it breaks the existing VMM code, since that
> > requires VMM to explicitly set the MSR, which I am not sure we do today.
>
> Yeah, I'm hoping we can squeak by without breaking existing setups.
>
> I'm 99% certain QEMU is ok, as QEMU has explicitly set MSR_IA32_PERF_CAPABILITIES
> since support for PDCM/PERF_CAPABILITIES was added by commit ea39f9b643
> ("target/i386: define a new MSR based feature word - FEAT_PERF_CAPABILITIES").
>
> Frankly, if our VMM doesn't do the same, then it's wildly busted. Relying on
> KVM to define the vCPU is irresponsible, to put it nicely.
>
> > The following code below is different. The key difference is that the
> > following code preserves a valid value, but this case is to not preserve
> > an invalid value.
>
> But it's a completely different fix. I referenced that commit to call out that
> the need for the commit and changelog suggests that someone (*cough* us) is relying
> on KVM to initialize MSR_PLATFORM_INFO, and has been doing so for a very long time.
> That doesn't mean it's the correct KVM behavior, just that it's much riskier to
> change.

2024-01-26 18:33:56

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH 1/2] KVM: x86/pmu: Reset perf_capabilities in vcpu to 0 if PDCM is disabled

On Thu, Jan 25, 2024, Mingwei Zhang wrote:
> On Wed, Jan 24, 2024, Sean Christopherson wrote:
> > On Wed, Jan 24, 2024, Mingwei Zhang wrote:
> > > I think this makes a lot of confusions on migration where VMM on the source
> > > believes that a non-zero value from KVM_GET_MSRS is valid and the VMM on the
> > > target will find it not true.
> >
> > Yes, but seeing a non-zero value is a KVM bug that should be fixed.
> >
> How about adding an entry in vmx_get_msr() for
> MSR_IA32_PERF_CAPABILITIES and check pmu_version? This basically pairs
> with the implementation in vmx_set_msr() for MSR_IA32_PERF_CAPABILITIES.
> Doing so allows KVM_GET_MSRS return 0 for the MSR instead of returning
> the initial permitted value.

Hrm, I don't hate it as a stopgap. But if we are the only people that are affected,
because again I'm pretty sure QEMU is fine, I would rather we just fix things in
our VMM and/or internal kernel.

Long term, I want some form of fix for the initialization code, even if that means
adding a quirk to let userspace opt out of KVM setting default values for platform
MSRs.

Side topic, vmx_set_msr() should check X86_FEATURE_PDCM, not just the PMU version.

> The benefit is that it is not enforcing the VMM to explicitly set the
> value. In fact, there are several platform MSRs which has initial value
> that VMM may rely on instead of explicitly setting.
> MSR_IA32_PERF_CAPABILITIES is only one of them.

Yeah, and all of those are broken. AFAICT, the bad behavior got introduced for
MSR_PLATFORM_INFO, and then people kept copy+pasting that broken pattern :-(

2024-01-26 19:34:47

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH 1/2] KVM: x86/pmu: Reset perf_capabilities in vcpu to 0 if PDCM is disabled

On Fri, Jan 26, 2024, Mingwei Zhang wrote:
> +Frederick Mayle +Steven Moreland
>
> On Fri, Jan 26, 2024 at 10:33 AM Sean Christopherson <[email protected]> wrote:
> >
> > On Thu, Jan 25, 2024, Mingwei Zhang wrote:
> > > On Wed, Jan 24, 2024, Sean Christopherson wrote:
> > > > On Wed, Jan 24, 2024, Mingwei Zhang wrote:
> > > > > I think this makes a lot of confusions on migration where VMM on the source
> > > > > believes that a non-zero value from KVM_GET_MSRS is valid and the VMM on the
> > > > > target will find it not true.
> > > >
> > > > Yes, but seeing a non-zero value is a KVM bug that should be fixed.
> > > >
> > > How about adding an entry in vmx_get_msr() for
> > > MSR_IA32_PERF_CAPABILITIES and check pmu_version? This basically pairs
> > > with the implementation in vmx_set_msr() for MSR_IA32_PERF_CAPABILITIES.
> > > Doing so allows KVM_GET_MSRS return 0 for the MSR instead of returning
> > > the initial permitted value.
> >
> > Hrm, I don't hate it as a stopgap. But if we are the only people that are affected,
> > because again I'm pretty sure QEMU is fine, I would rather we just fix things in
> > our VMM and/or internal kernel.
>
> It is not just QEMU. crossvm is another open source VMM that suffers
> from this one.

Does CrosVM support migration or some other form of save/restore (RR?)? And if
so, does CrosVM do that in conjunction with hiding the vPMU from the guest?

Because if not, then I think we can squeak by.

2024-01-29 14:40:57

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [PATCH 1/2] KVM: x86/pmu: Reset perf_capabilities in vcpu to 0 if PDCM is disabled

On 1/26/24 20:30, Mingwei Zhang wrote:
>> Hrm, I don't hate it as a stopgap. But if we are the only people that are affected,
>> because again I'm pretty sure QEMU is fine, I would rather we just fix things in
>> our VMM and/or internal kernel.
> It is not just QEMU. crossvm is another open source VMM that suffers
> from this one.

Can you explain the symptoms in both Google's internal VMM and crosvm?

Thanks,

Paolo


2024-01-29 14:44:40

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [PATCH 1/2] KVM: x86/pmu: Reset perf_capabilities in vcpu to 0 if PDCM is disabled

On 1/24/24 23:51, Sean Christopherson wrote:
>> If we follow the suggestion by removing the initial value at vCPU
>> creation time, then I think it breaks the existing VMM code, since that
>> requires VMM to explicitly set the MSR, which I am not sure we do today.
> Yeah, I'm hoping we can squeak by without breaking existing setups.
>
> I'm 99% certain QEMU is ok, as QEMU has explicitly set MSR_IA32_PERF_CAPABILITIES
> since support for PDCM/PERF_CAPABILITIES was added by commit ea39f9b643
> ("target/i386: define a new MSR based feature word - FEAT_PERF_CAPABILITIES").
>
> Frankly, if our VMM doesn't do the same, then it's wildly busted. Relying on
> KVM to define the vCPU is irresponsible, to put it nicely.

Yes, I tend to agree.

What QEMU does goes from the squeaky clean to the very debatable
depending on the parameters you give it.

With "-cpu Haswell" and similar, it will provide values for all CPUID
and MSR bits that match as much as possible values from an actual CPU
model. It will complain if there are some values that do not match[1].

With "-cpu host", it will copy values from KVM_GET_SUPPORTED_CPUID and
from the feature MSRs, but only for features that it knows about.

With "-cpu host,migratable=no", it will copy values from
KVM_GET_SUPPORTED_CPUID and from the feature MSRs, but only for *feature
words* (CPUID registers, or MSRs) that it knows about. This is where it
becomes debatable, because a CPUID bit could be added without QEMU
knowing the corresponding MSR. In this case, the user probably expects
the MSR to have a nonzero. On one hand I agree that it would be
irresponsible, on the other hand that's the point of "-cpu
host,migratable=no".

If you want to proceed with the change, I don't have any problem with
considering it a QEMU bug that it doesn't copy over to the guest any
unknown leaves or MSRs.

Paolo

[1] Unfortunately it's not fatal because there are way way too many
models, and also because until recently TCG lacked AVX---and therefore
could only emulate completely some very old CPU models. But with "-cpu
Haswell,enforce" then everything's clean.


2024-01-31 19:43:20

by Mingwei Zhang

[permalink] [raw]
Subject: Re: [PATCH 1/2] KVM: x86/pmu: Reset perf_capabilities in vcpu to 0 if PDCM is disabled

On Mon, Jan 29, 2024, Paolo Bonzini wrote:
> On 1/24/24 23:51, Sean Christopherson wrote:
> > > If we follow the suggestion by removing the initial value at vCPU
> > > creation time, then I think it breaks the existing VMM code, since that
> > > requires VMM to explicitly set the MSR, which I am not sure we do today.
> > Yeah, I'm hoping we can squeak by without breaking existing setups.
> >
> > I'm 99% certain QEMU is ok, as QEMU has explicitly set MSR_IA32_PERF_CAPABILITIES
> > since support for PDCM/PERF_CAPABILITIES was added by commit ea39f9b643
> > ("target/i386: define a new MSR based feature word - FEAT_PERF_CAPABILITIES").
> >
> > Frankly, if our VMM doesn't do the same, then it's wildly busted. Relying on
> > KVM to define the vCPU is irresponsible, to put it nicely.
>
> Yes, I tend to agree.

Discussed with Sean offline. Yes, I also agree that this should be
handled at VMM level. MSR_IA32_PERF_CAPABILITIES should be regarded as
part of the CPUID, or sort of. The diff is that its own
"KVM_GET_SUPPORTED_CPUID" (ie., the default value) should come from
KVM_GET_MSRS of the device ioctl.

Providing the default value for MSR_IA32_PERF_CAPABILITIES is really
making things messed. KVM has to always guard access to the cached guest
value with the checking of X86_FEATURE_PDCM. I believe
guest_cpuid_has(vcpu, X86_FEATURE_PDCM) will take runtime cost.

>
> What QEMU does goes from the squeaky clean to the very debatable depending
> on the parameters you give it.
>
> With "-cpu Haswell" and similar, it will provide values for all CPUID and
> MSR bits that match as much as possible values from an actual CPU model. It
> will complain if there are some values that do not match[1].
>
> With "-cpu host", it will copy values from KVM_GET_SUPPORTED_CPUID and from
> the feature MSRs, but only for features that it knows about.
>
> With "-cpu host,migratable=no", it will copy values from
> KVM_GET_SUPPORTED_CPUID and from the feature MSRs, but only for *feature
> words* (CPUID registers, or MSRs) that it knows about. This is where it
> becomes debatable, because a CPUID bit could be added without QEMU knowing
> the corresponding MSR. In this case, the user probably expects the MSR to
> have a nonzero. On one hand I agree that it would be irresponsible, on the
> other hand that's the point of "-cpu host,migratable=no".
>
> If you want to proceed with the change, I don't have any problem with
> considering it a QEMU bug that it doesn't copy over to the guest any unknown
> leaves or MSRs.
>
reply from another thread: CrosVM issue is not related to this one. It
might have something to do with KVM_GET_MSR_INDEX_LIST. I will come up
details later.
> Paolo
>
> [1] Unfortunately it's not fatal because there are way way too many models,
> and also because until recently TCG lacked AVX---and therefore could only
> emulate completely some very old CPU models. But with "-cpu
> Haswell,enforce" then everything's clean.
>