2024-01-26 12:06:24

by Zhang, Xiong Y

[permalink] [raw]
Subject: [RFC PATCH 23/41] KVM: x86/pmu: Implement the save/restore of PMU state for Intel CPU

From: Dapeng Mi <[email protected]>

Implement the save/restore of PMU state for pasthrough PMU in Intel. In
passthrough mode, KVM owns exclusively the PMU HW when control flow goes to
the scope of passthrough PMU. Thus, KVM needs to save the host PMU state
and gains the full HW PMU ownership. On the contrary, host regains the
ownership of PMU HW from KVM when control flow leaves the scope of
passthrough PMU.

Implement PMU context switches for Intel CPUs and opptunistically use
rdpmcl() instead of rdmsrl() when reading counters since the former has
lower latency in Intel CPUs.

Co-developed-by: Mingwei Zhang <[email protected]>
Signed-off-by: Mingwei Zhang <[email protected]>
Signed-off-by: Dapeng Mi <[email protected]>
---
arch/x86/kvm/vmx/pmu_intel.c | 73 ++++++++++++++++++++++++++++++++++++
1 file changed, 73 insertions(+)

diff --git a/arch/x86/kvm/vmx/pmu_intel.c b/arch/x86/kvm/vmx/pmu_intel.c
index 0d58fe7d243e..f79bebe7093d 100644
--- a/arch/x86/kvm/vmx/pmu_intel.c
+++ b/arch/x86/kvm/vmx/pmu_intel.c
@@ -823,10 +823,83 @@ void intel_passthrough_pmu_msrs(struct kvm_vcpu *vcpu)

static void intel_save_pmu_context(struct kvm_vcpu *vcpu)
{
+ struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
+ struct kvm_pmc *pmc;
+ u32 i;
+
+ if (pmu->version != 2) {
+ pr_warn("only PerfMon v2 is supported for passthrough PMU");
+ return;
+ }
+
+ /* Global ctrl register is already saved at VM-exit. */
+ rdmsrl(MSR_CORE_PERF_GLOBAL_STATUS, pmu->global_status);
+ /* Clear hardware MSR_CORE_PERF_GLOBAL_STATUS MSR, if non-zero. */
+ if (pmu->global_status)
+ wrmsrl(MSR_CORE_PERF_GLOBAL_OVF_CTRL, pmu->global_status);
+
+ for (i = 0; i < pmu->nr_arch_gp_counters; i++) {
+ pmc = &pmu->gp_counters[i];
+ rdpmcl(i, pmc->counter);
+ rdmsrl(i + MSR_ARCH_PERFMON_EVENTSEL0, pmc->eventsel);
+ /*
+ * Clear hardware PERFMON_EVENTSELx and its counter to avoid
+ * leakage and also avoid this guest GP counter get accidentally
+ * enabled during host running when host enable global ctrl.
+ */
+ if (pmc->eventsel)
+ wrmsrl(MSR_ARCH_PERFMON_EVENTSEL0 + i, 0);
+ if (pmc->counter)
+ wrmsrl(MSR_IA32_PMC0 + i, 0);
+ }
+
+ rdmsrl(MSR_CORE_PERF_FIXED_CTR_CTRL, pmu->fixed_ctr_ctrl);
+ /*
+ * Clear hardware FIXED_CTR_CTRL MSR to avoid information leakage and
+ * also avoid these guest fixed counters get accidentially enabled
+ * during host running when host enable global ctrl.
+ */
+ if (pmu->fixed_ctr_ctrl)
+ wrmsrl(MSR_CORE_PERF_FIXED_CTR_CTRL, 0);
+ for (i = 0; i < pmu->nr_arch_fixed_counters; i++) {
+ pmc = &pmu->fixed_counters[i];
+ rdpmcl(INTEL_PMC_FIXED_RDPMC_BASE | i, pmc->counter);
+ if (pmc->counter)
+ wrmsrl(MSR_CORE_PERF_FIXED_CTR0 + i, 0);
+ }
}

static void intel_restore_pmu_context(struct kvm_vcpu *vcpu)
{
+ struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
+ struct kvm_pmc *pmc;
+ u64 global_status;
+ int i;
+
+ if (pmu->version != 2) {
+ pr_warn("only PerfMon v2 is supported for passthrough PMU");
+ return;
+ }
+
+ /* Clear host global_ctrl and global_status MSR if non-zero. */
+ wrmsrl(MSR_CORE_PERF_GLOBAL_CTRL, 0);
+ rdmsrl(MSR_CORE_PERF_GLOBAL_STATUS, global_status);
+ if (global_status)
+ wrmsrl(MSR_CORE_PERF_GLOBAL_OVF_CTRL, global_status);
+
+ wrmsrl(MSR_CORE_PERF_GLOBAL_STATUS_SET, pmu->global_status);
+
+ for (i = 0; i < pmu->nr_arch_gp_counters; i++) {
+ pmc = &pmu->gp_counters[i];
+ wrmsrl(MSR_IA32_PMC0 + i, pmc->counter);
+ wrmsrl(MSR_ARCH_PERFMON_EVENTSEL0 + i, pmc->eventsel);
+ }
+
+ wrmsrl(MSR_CORE_PERF_FIXED_CTR_CTRL, pmu->fixed_ctr_ctrl);
+ for (i = 0; i < pmu->nr_arch_fixed_counters; i++) {
+ pmc = &pmu->fixed_counters[i];
+ wrmsrl(MSR_CORE_PERF_FIXED_CTR0 + i, pmc->counter);
+ }
}

struct kvm_pmu_ops intel_pmu_ops __initdata = {
--
2.34.1



2024-04-11 21:27:00

by Sean Christopherson

[permalink] [raw]
Subject: Re: [RFC PATCH 23/41] KVM: x86/pmu: Implement the save/restore of PMU state for Intel CPU

On Fri, Jan 26, 2024, Xiong Zhang wrote:
> static void intel_save_pmu_context(struct kvm_vcpu *vcpu)
> {
> + struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
> + struct kvm_pmc *pmc;
> + u32 i;
> +
> + if (pmu->version != 2) {
> + pr_warn("only PerfMon v2 is supported for passthrough PMU");
> + return;
> + }
> +
> + /* Global ctrl register is already saved at VM-exit. */
> + rdmsrl(MSR_CORE_PERF_GLOBAL_STATUS, pmu->global_status);
> + /* Clear hardware MSR_CORE_PERF_GLOBAL_STATUS MSR, if non-zero. */
> + if (pmu->global_status)
> + wrmsrl(MSR_CORE_PERF_GLOBAL_OVF_CTRL, pmu->global_status);
> +
> + for (i = 0; i < pmu->nr_arch_gp_counters; i++) {
> + pmc = &pmu->gp_counters[i];
> + rdpmcl(i, pmc->counter);
> + rdmsrl(i + MSR_ARCH_PERFMON_EVENTSEL0, pmc->eventsel);
> + /*
> + * Clear hardware PERFMON_EVENTSELx and its counter to avoid
> + * leakage and also avoid this guest GP counter get accidentally
> + * enabled during host running when host enable global ctrl.
> + */
> + if (pmc->eventsel)
> + wrmsrl(MSR_ARCH_PERFMON_EVENTSEL0 + i, 0);
> + if (pmc->counter)
> + wrmsrl(MSR_IA32_PMC0 + i, 0);
> + }
> +
> + rdmsrl(MSR_CORE_PERF_FIXED_CTR_CTRL, pmu->fixed_ctr_ctrl);
> + /*
> + * Clear hardware FIXED_CTR_CTRL MSR to avoid information leakage and
> + * also avoid these guest fixed counters get accidentially enabled
> + * during host running when host enable global ctrl.
> + */
> + if (pmu->fixed_ctr_ctrl)
> + wrmsrl(MSR_CORE_PERF_FIXED_CTR_CTRL, 0);
> + for (i = 0; i < pmu->nr_arch_fixed_counters; i++) {
> + pmc = &pmu->fixed_counters[i];
> + rdpmcl(INTEL_PMC_FIXED_RDPMC_BASE | i, pmc->counter);
> + if (pmc->counter)
> + wrmsrl(MSR_CORE_PERF_FIXED_CTR0 + i, 0);
> + }

For the next RFC, please make that it includes AMD support. Mostly because I'm
pretty all of this code can be in common x86. The fixed counters are ugly,
but pmu->nr_arch_fixed_counters is guaranteed to '0' on AMD, so it's _just_ ugly,
i.e. not functionally problematic.

2024-04-11 21:44:44

by Sean Christopherson

[permalink] [raw]
Subject: Re: [RFC PATCH 23/41] KVM: x86/pmu: Implement the save/restore of PMU state for Intel CPU

On Fri, Jan 26, 2024, Xiong Zhang wrote:
> From: Dapeng Mi <[email protected]>
>
> Implement the save/restore of PMU state for pasthrough PMU in Intel. In
> passthrough mode, KVM owns exclusively the PMU HW when control flow goes to
> the scope of passthrough PMU. Thus, KVM needs to save the host PMU state
> and gains the full HW PMU ownership. On the contrary, host regains the
> ownership of PMU HW from KVM when control flow leaves the scope of
> passthrough PMU.
>
> Implement PMU context switches for Intel CPUs and opptunistically use
> rdpmcl() instead of rdmsrl() when reading counters since the former has
> lower latency in Intel CPUs.
>
> Co-developed-by: Mingwei Zhang <[email protected]>
> Signed-off-by: Mingwei Zhang <[email protected]>
> Signed-off-by: Dapeng Mi <[email protected]>
> ---
> arch/x86/kvm/vmx/pmu_intel.c | 73 ++++++++++++++++++++++++++++++++++++
> 1 file changed, 73 insertions(+)
>
> diff --git a/arch/x86/kvm/vmx/pmu_intel.c b/arch/x86/kvm/vmx/pmu_intel.c
> index 0d58fe7d243e..f79bebe7093d 100644
> --- a/arch/x86/kvm/vmx/pmu_intel.c
> +++ b/arch/x86/kvm/vmx/pmu_intel.c
> @@ -823,10 +823,83 @@ void intel_passthrough_pmu_msrs(struct kvm_vcpu *vcpu)
>
> static void intel_save_pmu_context(struct kvm_vcpu *vcpu)

I would prefer there be a "guest" in there somewhere, e.g. intel_save_guest_pmu_context().

> {
> + struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
> + struct kvm_pmc *pmc;
> + u32 i;
> +
> + if (pmu->version != 2) {
> + pr_warn("only PerfMon v2 is supported for passthrough PMU");
> + return;
> + }
> +
> + /* Global ctrl register is already saved at VM-exit. */
> + rdmsrl(MSR_CORE_PERF_GLOBAL_STATUS, pmu->global_status);
> + /* Clear hardware MSR_CORE_PERF_GLOBAL_STATUS MSR, if non-zero. */
> + if (pmu->global_status)
> + wrmsrl(MSR_CORE_PERF_GLOBAL_OVF_CTRL, pmu->global_status);
> +
> + for (i = 0; i < pmu->nr_arch_gp_counters; i++) {
> + pmc = &pmu->gp_counters[i];
> + rdpmcl(i, pmc->counter);
> + rdmsrl(i + MSR_ARCH_PERFMON_EVENTSEL0, pmc->eventsel);
> + /*
> + * Clear hardware PERFMON_EVENTSELx and its counter to avoid
> + * leakage and also avoid this guest GP counter get accidentally
> + * enabled during host running when host enable global ctrl.
> + */
> + if (pmc->eventsel)
> + wrmsrl(MSR_ARCH_PERFMON_EVENTSEL0 + i, 0);
> + if (pmc->counter)
> + wrmsrl(MSR_IA32_PMC0 + i, 0);

This doesn't make much sense. The kernel already has full access to the guest,
I don't see what is gained by zeroing out the MSRs just to hide them from perf.

Similarly, if perf enables a counter if PERF_GLOBAL_CTRL without first restoring
the event selector, we gots problems.

Same thing for the fixed counters below. Can't this just be?

for (i = 0; i < pmu->nr_arch_gp_counters; i++)
rdpmcl(i, pmu->gp_counters[i].counter);

for (i = 0; i < pmu->nr_arch_fixed_counters; i++)
rdpmcl(INTEL_PMC_FIXED_RDPMC_BASE | i,
pmu->fixed_counters[i].counter);

> + }
> +
> + rdmsrl(MSR_CORE_PERF_FIXED_CTR_CTRL, pmu->fixed_ctr_ctrl);
> + /*
> + * Clear hardware FIXED_CTR_CTRL MSR to avoid information leakage and
> + * also avoid these guest fixed counters get accidentially enabled
> + * during host running when host enable global ctrl.
> + */
> + if (pmu->fixed_ctr_ctrl)
> + wrmsrl(MSR_CORE_PERF_FIXED_CTR_CTRL, 0);
> + for (i = 0; i < pmu->nr_arch_fixed_counters; i++) {
> + pmc = &pmu->fixed_counters[i];
> + rdpmcl(INTEL_PMC_FIXED_RDPMC_BASE | i, pmc->counter);
> + if (pmc->counter)
> + wrmsrl(MSR_CORE_PERF_FIXED_CTR0 + i, 0);
> + }
> }
>
> static void intel_restore_pmu_context(struct kvm_vcpu *vcpu)
> {
> + struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
> + struct kvm_pmc *pmc;
> + u64 global_status;
> + int i;
> +
> + if (pmu->version != 2) {
> + pr_warn("only PerfMon v2 is supported for passthrough PMU");
> + return;
> + }
> +
> + /* Clear host global_ctrl and global_status MSR if non-zero. */
> + wrmsrl(MSR_CORE_PERF_GLOBAL_CTRL, 0);

Why? PERF_GLOBAL_CTRL will be auto-loaded at VM-Enter, why do it now?

> + rdmsrl(MSR_CORE_PERF_GLOBAL_STATUS, global_status);
> + if (global_status)
> + wrmsrl(MSR_CORE_PERF_GLOBAL_OVF_CTRL, global_status);

This seems especially silly, isn't the full MSR being written below? Or am I
misunderstanding how these things work?

> + wrmsrl(MSR_CORE_PERF_GLOBAL_STATUS_SET, pmu->global_status);
> +
> + for (i = 0; i < pmu->nr_arch_gp_counters; i++) {
> + pmc = &pmu->gp_counters[i];
> + wrmsrl(MSR_IA32_PMC0 + i, pmc->counter);
> + wrmsrl(MSR_ARCH_PERFMON_EVENTSEL0 + i, pmc->eventsel);
> + }
> +
> + wrmsrl(MSR_CORE_PERF_FIXED_CTR_CTRL, pmu->fixed_ctr_ctrl);
> + for (i = 0; i < pmu->nr_arch_fixed_counters; i++) {
> + pmc = &pmu->fixed_counters[i];
> + wrmsrl(MSR_CORE_PERF_FIXED_CTR0 + i, pmc->counter);
> + }
> }
>
> struct kvm_pmu_ops intel_pmu_ops __initdata = {
> --
> 2.34.1
>

2024-04-11 22:20:10

by Jim Mattson

[permalink] [raw]
Subject: Re: [RFC PATCH 23/41] KVM: x86/pmu: Implement the save/restore of PMU state for Intel CPU

On Thu, Apr 11, 2024 at 2:44 PM Sean Christopherson <[email protected]> wrote:
>
> On Fri, Jan 26, 2024, Xiong Zhang wrote:
> > From: Dapeng Mi <[email protected]>
> >
> > Implement the save/restore of PMU state for pasthrough PMU in Intel. In
> > passthrough mode, KVM owns exclusively the PMU HW when control flow goes to
> > the scope of passthrough PMU. Thus, KVM needs to save the host PMU state
> > and gains the full HW PMU ownership. On the contrary, host regains the
> > ownership of PMU HW from KVM when control flow leaves the scope of
> > passthrough PMU.
> >
> > Implement PMU context switches for Intel CPUs and opptunistically use
> > rdpmcl() instead of rdmsrl() when reading counters since the former has
> > lower latency in Intel CPUs.
> >
> > Co-developed-by: Mingwei Zhang <[email protected]>
> > Signed-off-by: Mingwei Zhang <[email protected]>
> > Signed-off-by: Dapeng Mi <[email protected]>
> > ---
> > arch/x86/kvm/vmx/pmu_intel.c | 73 ++++++++++++++++++++++++++++++++++++
> > 1 file changed, 73 insertions(+)
> >
> > diff --git a/arch/x86/kvm/vmx/pmu_intel.c b/arch/x86/kvm/vmx/pmu_intel.c
> > index 0d58fe7d243e..f79bebe7093d 100644
> > --- a/arch/x86/kvm/vmx/pmu_intel.c
> > +++ b/arch/x86/kvm/vmx/pmu_intel.c
> > @@ -823,10 +823,83 @@ void intel_passthrough_pmu_msrs(struct kvm_vcpu *vcpu)
> >
> > static void intel_save_pmu_context(struct kvm_vcpu *vcpu)
>
> I would prefer there be a "guest" in there somewhere, e.g. intel_save_guest_pmu_context().
>
> > {
> > + struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
> > + struct kvm_pmc *pmc;
> > + u32 i;
> > +
> > + if (pmu->version != 2) {
> > + pr_warn("only PerfMon v2 is supported for passthrough PMU");
> > + return;
> > + }
> > +
> > + /* Global ctrl register is already saved at VM-exit. */
> > + rdmsrl(MSR_CORE_PERF_GLOBAL_STATUS, pmu->global_status);
> > + /* Clear hardware MSR_CORE_PERF_GLOBAL_STATUS MSR, if non-zero. */
> > + if (pmu->global_status)
> > + wrmsrl(MSR_CORE_PERF_GLOBAL_OVF_CTRL, pmu->global_status);
> > +
> > + for (i = 0; i < pmu->nr_arch_gp_counters; i++) {
> > + pmc = &pmu->gp_counters[i];
> > + rdpmcl(i, pmc->counter);
> > + rdmsrl(i + MSR_ARCH_PERFMON_EVENTSEL0, pmc->eventsel);
> > + /*
> > + * Clear hardware PERFMON_EVENTSELx and its counter to avoid
> > + * leakage and also avoid this guest GP counter get accidentally
> > + * enabled during host running when host enable global ctrl.
> > + */
> > + if (pmc->eventsel)
> > + wrmsrl(MSR_ARCH_PERFMON_EVENTSEL0 + i, 0);
> > + if (pmc->counter)
> > + wrmsrl(MSR_IA32_PMC0 + i, 0);
>
> This doesn't make much sense. The kernel already has full access to the guest,
> I don't see what is gained by zeroing out the MSRs just to hide them from perf.
>
> Similarly, if perf enables a counter if PERF_GLOBAL_CTRL without first restoring
> the event selector, we gots problems.
>
> Same thing for the fixed counters below. Can't this just be?
>
> for (i = 0; i < pmu->nr_arch_gp_counters; i++)
> rdpmcl(i, pmu->gp_counters[i].counter);
>
> for (i = 0; i < pmu->nr_arch_fixed_counters; i++)
> rdpmcl(INTEL_PMC_FIXED_RDPMC_BASE | i,
> pmu->fixed_counters[i].counter);
>
> > + }
> > +
> > + rdmsrl(MSR_CORE_PERF_FIXED_CTR_CTRL, pmu->fixed_ctr_ctrl);
> > + /*
> > + * Clear hardware FIXED_CTR_CTRL MSR to avoid information leakage and
> > + * also avoid these guest fixed counters get accidentially enabled
> > + * during host running when host enable global ctrl.
> > + */
> > + if (pmu->fixed_ctr_ctrl)
> > + wrmsrl(MSR_CORE_PERF_FIXED_CTR_CTRL, 0);
> > + for (i = 0; i < pmu->nr_arch_fixed_counters; i++) {
> > + pmc = &pmu->fixed_counters[i];
> > + rdpmcl(INTEL_PMC_FIXED_RDPMC_BASE | i, pmc->counter);
> > + if (pmc->counter)
> > + wrmsrl(MSR_CORE_PERF_FIXED_CTR0 + i, 0);
> > + }
> > }
> >
> > static void intel_restore_pmu_context(struct kvm_vcpu *vcpu)
> > {
> > + struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
> > + struct kvm_pmc *pmc;
> > + u64 global_status;
> > + int i;
> > +
> > + if (pmu->version != 2) {
> > + pr_warn("only PerfMon v2 is supported for passthrough PMU");
> > + return;
> > + }
> > +
> > + /* Clear host global_ctrl and global_status MSR if non-zero. */
> > + wrmsrl(MSR_CORE_PERF_GLOBAL_CTRL, 0);
>
> Why? PERF_GLOBAL_CTRL will be auto-loaded at VM-Enter, why do it now?
>
> > + rdmsrl(MSR_CORE_PERF_GLOBAL_STATUS, global_status);
> > + if (global_status)
> > + wrmsrl(MSR_CORE_PERF_GLOBAL_OVF_CTRL, global_status);
>
> This seems especially silly, isn't the full MSR being written below? Or am I
> misunderstanding how these things work?

LOL! You expect CPU design to follow basic logic?!?

Writing a 1 to a bit in IA32_PERF_GLOBAL_STATUS_SET sets the
corresponding bit in IA32_PERF_GLOBAL_STATUS to 1.

Writing a 0 to a bit in to IA32_PERF_GLOBAL_STATUS_SET is a nop.

To clear a bit in IA32_PERF_GLOBAL_STATUS, you need to write a 1 to
the corresponding bit in IA32_PERF_GLOBAL_STATUS_RESET (aka
IA32_PERF_GLOBAL_OVF_CTRL).

> > + wrmsrl(MSR_CORE_PERF_GLOBAL_STATUS_SET, pmu->global_status);
> > +
> > + for (i = 0; i < pmu->nr_arch_gp_counters; i++) {
> > + pmc = &pmu->gp_counters[i];
> > + wrmsrl(MSR_IA32_PMC0 + i, pmc->counter);
> > + wrmsrl(MSR_ARCH_PERFMON_EVENTSEL0 + i, pmc->eventsel);
> > + }
> > +
> > + wrmsrl(MSR_CORE_PERF_FIXED_CTR_CTRL, pmu->fixed_ctr_ctrl);
> > + for (i = 0; i < pmu->nr_arch_fixed_counters; i++) {
> > + pmc = &pmu->fixed_counters[i];
> > + wrmsrl(MSR_CORE_PERF_FIXED_CTR0 + i, pmc->counter);
> > + }
> > }
> >
> > struct kvm_pmu_ops intel_pmu_ops __initdata = {
> > --
> > 2.34.1
> >

2024-04-11 23:31:50

by Sean Christopherson

[permalink] [raw]
Subject: Re: [RFC PATCH 23/41] KVM: x86/pmu: Implement the save/restore of PMU state for Intel CPU

On Thu, Apr 11, 2024, Jim Mattson wrote:
> On Thu, Apr 11, 2024 at 2:44 PM Sean Christopherson <[email protected]> wrote:
> > > + /* Clear host global_ctrl and global_status MSR if non-zero. */
> > > + wrmsrl(MSR_CORE_PERF_GLOBAL_CTRL, 0);
> >
> > Why? PERF_GLOBAL_CTRL will be auto-loaded at VM-Enter, why do it now?
> >
> > > + rdmsrl(MSR_CORE_PERF_GLOBAL_STATUS, global_status);
> > > + if (global_status)
> > > + wrmsrl(MSR_CORE_PERF_GLOBAL_OVF_CTRL, global_status);
> >
> > This seems especially silly, isn't the full MSR being written below? Or am I
> > misunderstanding how these things work?
>
> LOL! You expect CPU design to follow basic logic?!?
>
> Writing a 1 to a bit in IA32_PERF_GLOBAL_STATUS_SET sets the
> corresponding bit in IA32_PERF_GLOBAL_STATUS to 1.
>
> Writing a 0 to a bit in to IA32_PERF_GLOBAL_STATUS_SET is a nop.
>
> To clear a bit in IA32_PERF_GLOBAL_STATUS, you need to write a 1 to
> the corresponding bit in IA32_PERF_GLOBAL_STATUS_RESET (aka
> IA32_PERF_GLOBAL_OVF_CTRL).

If only C had a way to annotate what the code is doing. :-)

> > > + wrmsrl(MSR_CORE_PERF_GLOBAL_STATUS_SET, pmu->global_status);

IIUC, that means this should be:

if (pmu->global_status)
wrmsrl(MSR_CORE_PERF_GLOBAL_STATUS_SET, pmu->global_status);

or even better:

toggle = pmu->global_status ^ global_status;
if (global_status & toggle)
wrmsrl(MSR_CORE_PERF_GLOBAL_OVF_CTRL, global_status & toggle);
if (pmu->global_status & toggle)
wrmsrl(MSR_CORE_PERF_GLOBAL_STATUS_SET, pmu->global_status & toggle);

2024-04-13 03:34:59

by Mingwei Zhang

[permalink] [raw]
Subject: Re: [RFC PATCH 23/41] KVM: x86/pmu: Implement the save/restore of PMU state for Intel CPU

On Sat, Apr 13, 2024, Mi, Dapeng wrote:
>
> On 4/12/2024 5:44 AM, Sean Christopherson wrote:
> > On Fri, Jan 26, 2024, Xiong Zhang wrote:
> > > From: Dapeng Mi <[email protected]>
> > >
> > > Implement the save/restore of PMU state for pasthrough PMU in Intel. In
> > > passthrough mode, KVM owns exclusively the PMU HW when control flow goes to
> > > the scope of passthrough PMU. Thus, KVM needs to save the host PMU state
> > > and gains the full HW PMU ownership. On the contrary, host regains the
> > > ownership of PMU HW from KVM when control flow leaves the scope of
> > > passthrough PMU.
> > >
> > > Implement PMU context switches for Intel CPUs and opptunistically use
> > > rdpmcl() instead of rdmsrl() when reading counters since the former has
> > > lower latency in Intel CPUs.
> > >
> > > Co-developed-by: Mingwei Zhang <[email protected]>
> > > Signed-off-by: Mingwei Zhang <[email protected]>
> > > Signed-off-by: Dapeng Mi <[email protected]>
> > > ---
> > > arch/x86/kvm/vmx/pmu_intel.c | 73 ++++++++++++++++++++++++++++++++++++
> > > 1 file changed, 73 insertions(+)
> > >
> > > diff --git a/arch/x86/kvm/vmx/pmu_intel.c b/arch/x86/kvm/vmx/pmu_intel.c
> > > index 0d58fe7d243e..f79bebe7093d 100644
> > > --- a/arch/x86/kvm/vmx/pmu_intel.c
> > > +++ b/arch/x86/kvm/vmx/pmu_intel.c
> > > @@ -823,10 +823,83 @@ void intel_passthrough_pmu_msrs(struct kvm_vcpu *vcpu)
> > > static void intel_save_pmu_context(struct kvm_vcpu *vcpu)
> > I would prefer there be a "guest" in there somewhere, e.g. intel_save_guest_pmu_context().
> Yeah. It looks clearer.
> >
> > > {
> > > + struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
> > > + struct kvm_pmc *pmc;
> > > + u32 i;
> > > +
> > > + if (pmu->version != 2) {
> > > + pr_warn("only PerfMon v2 is supported for passthrough PMU");
> > > + return;
> > > + }
> > > +
> > > + /* Global ctrl register is already saved at VM-exit. */
> > > + rdmsrl(MSR_CORE_PERF_GLOBAL_STATUS, pmu->global_status);
> > > + /* Clear hardware MSR_CORE_PERF_GLOBAL_STATUS MSR, if non-zero. */
> > > + if (pmu->global_status)
> > > + wrmsrl(MSR_CORE_PERF_GLOBAL_OVF_CTRL, pmu->global_status);
> > > +
> > > + for (i = 0; i < pmu->nr_arch_gp_counters; i++) {
> > > + pmc = &pmu->gp_counters[i];
> > > + rdpmcl(i, pmc->counter);
> > > + rdmsrl(i + MSR_ARCH_PERFMON_EVENTSEL0, pmc->eventsel);
> > > + /*
> > > + * Clear hardware PERFMON_EVENTSELx and its counter to avoid
> > > + * leakage and also avoid this guest GP counter get accidentally
> > > + * enabled during host running when host enable global ctrl.
> > > + */
> > > + if (pmc->eventsel)
> > > + wrmsrl(MSR_ARCH_PERFMON_EVENTSEL0 + i, 0);
> > > + if (pmc->counter)
> > > + wrmsrl(MSR_IA32_PMC0 + i, 0);
> > This doesn't make much sense. The kernel already has full access to the guest,
> > I don't see what is gained by zeroing out the MSRs just to hide them from perf.
>
> It's necessary to clear the EVENTSELx MSRs for both GP and fixed counters.
> Considering this case, Guest uses GP counter 2, but Host doesn't use it. So
> if the EVENTSEL2 MSR is not cleared here, the GP counter 2 would be enabled
> unexpectedly on host later since Host perf always enable all validate bits
> in PERF_GLOBAL_CTRL MSR. That would cause issues.
>
> Yeah,? the clearing for PMCx MSR should be unnecessary .
>

Why is clearing for PMCx MSR unnecessary? Do we want to leaking counter
values to the host? NO. Not in cloud usage.

Please make changes to this patch with **extreme** caution.

According to our past experience, if there is a bug somewhere,
there is a catch here (normally).

Thanks.
-Mingwei
>
> >
> > Similarly, if perf enables a counter if PERF_GLOBAL_CTRL without first restoring
> > the event selector, we gots problems.
> >
> > Same thing for the fixed counters below. Can't this just be?
> >
> > for (i = 0; i < pmu->nr_arch_gp_counters; i++)
> > rdpmcl(i, pmu->gp_counters[i].counter);
> >
> > for (i = 0; i < pmu->nr_arch_fixed_counters; i++)
> > rdpmcl(INTEL_PMC_FIXED_RDPMC_BASE | i,
> > pmu->fixed_counters[i].counter);
> >
> > > + }
> > > +
> > > + rdmsrl(MSR_CORE_PERF_FIXED_CTR_CTRL, pmu->fixed_ctr_ctrl);
> > > + /*
> > > + * Clear hardware FIXED_CTR_CTRL MSR to avoid information leakage and
> > > + * also avoid these guest fixed counters get accidentially enabled
> > > + * during host running when host enable global ctrl.
> > > + */
> > > + if (pmu->fixed_ctr_ctrl)
> > > + wrmsrl(MSR_CORE_PERF_FIXED_CTR_CTRL, 0);
> > > + for (i = 0; i < pmu->nr_arch_fixed_counters; i++) {
> > > + pmc = &pmu->fixed_counters[i];
> > > + rdpmcl(INTEL_PMC_FIXED_RDPMC_BASE | i, pmc->counter);
> > > + if (pmc->counter)
> > > + wrmsrl(MSR_CORE_PERF_FIXED_CTR0 + i, 0);
> > > + }
> > > }
> > > static void intel_restore_pmu_context(struct kvm_vcpu *vcpu)
> > > {
> > > + struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
> > > + struct kvm_pmc *pmc;
> > > + u64 global_status;
> > > + int i;
> > > +
> > > + if (pmu->version != 2) {
> > > + pr_warn("only PerfMon v2 is supported for passthrough PMU");
> > > + return;
> > > + }
> > > +
> > > + /* Clear host global_ctrl and global_status MSR if non-zero. */
> > > + wrmsrl(MSR_CORE_PERF_GLOBAL_CTRL, 0);
> > Why? PERF_GLOBAL_CTRL will be auto-loaded at VM-Enter, why do it now?
>
> As previous comments say, host perf always enable all counters in
> PERF_GLOBAL_CTRL by default. The reason to clear PERF_GLOBAL_CTRL here is to
> ensure all counters in disabled state and the later counter manipulation
> (writing MSR) won't cause any race condition or unexpected behavior on HW.
>
>
> >
> > > + rdmsrl(MSR_CORE_PERF_GLOBAL_STATUS, global_status);
> > > + if (global_status)
> > > + wrmsrl(MSR_CORE_PERF_GLOBAL_OVF_CTRL, global_status);
> > This seems especially silly, isn't the full MSR being written below? Or am I
> > misunderstanding how these things work?
>
> I think Jim's comment has already explain why we need to do this.
>
> >
> > > + wrmsrl(MSR_CORE_PERF_GLOBAL_STATUS_SET, pmu->global_status);
> > > +
> > > + for (i = 0; i < pmu->nr_arch_gp_counters; i++) {
> > > + pmc = &pmu->gp_counters[i];
> > > + wrmsrl(MSR_IA32_PMC0 + i, pmc->counter);
> > > + wrmsrl(MSR_ARCH_PERFMON_EVENTSEL0 + i, pmc->eventsel);
> > > + }
> > > +
> > > + wrmsrl(MSR_CORE_PERF_FIXED_CTR_CTRL, pmu->fixed_ctr_ctrl);
> > > + for (i = 0; i < pmu->nr_arch_fixed_counters; i++) {
> > > + pmc = &pmu->fixed_counters[i];
> > > + wrmsrl(MSR_CORE_PERF_FIXED_CTR0 + i, pmc->counter);
> > > + }
> > > }
> > > struct kvm_pmu_ops intel_pmu_ops __initdata = {
> > > --
> > > 2.34.1
> > >

2024-04-13 04:36:22

by Mi, Dapeng

[permalink] [raw]
Subject: Re: [RFC PATCH 23/41] KVM: x86/pmu: Implement the save/restore of PMU state for Intel CPU


On 4/12/2024 7:31 AM, Sean Christopherson wrote:
> On Thu, Apr 11, 2024, Jim Mattson wrote:
>> On Thu, Apr 11, 2024 at 2:44 PM Sean Christopherson <[email protected]> wrote:
>>>> + /* Clear host global_ctrl and global_status MSR if non-zero. */
>>>> + wrmsrl(MSR_CORE_PERF_GLOBAL_CTRL, 0);
>>> Why? PERF_GLOBAL_CTRL will be auto-loaded at VM-Enter, why do it now?
>>>
>>>> + rdmsrl(MSR_CORE_PERF_GLOBAL_STATUS, global_status);
>>>> + if (global_status)
>>>> + wrmsrl(MSR_CORE_PERF_GLOBAL_OVF_CTRL, global_status);
>>> This seems especially silly, isn't the full MSR being written below? Or am I
>>> misunderstanding how these things work?
>> LOL! You expect CPU design to follow basic logic?!?
>>
>> Writing a 1 to a bit in IA32_PERF_GLOBAL_STATUS_SET sets the
>> corresponding bit in IA32_PERF_GLOBAL_STATUS to 1.
>>
>> Writing a 0 to a bit in to IA32_PERF_GLOBAL_STATUS_SET is a nop.
>>
>> To clear a bit in IA32_PERF_GLOBAL_STATUS, you need to write a 1 to
>> the corresponding bit in IA32_PERF_GLOBAL_STATUS_RESET (aka
>> IA32_PERF_GLOBAL_OVF_CTRL).
> If only C had a way to annotate what the code is doing. :-)
>
>>>> + wrmsrl(MSR_CORE_PERF_GLOBAL_STATUS_SET, pmu->global_status);
> IIUC, that means this should be:
>
> if (pmu->global_status)
> wrmsrl(MSR_CORE_PERF_GLOBAL_STATUS_SET, pmu->global_status);
>
> or even better:
>
> toggle = pmu->global_status ^ global_status;
> if (global_status & toggle)
> wrmsrl(MSR_CORE_PERF_GLOBAL_OVF_CTRL, global_status & toggle);
> if (pmu->global_status & toggle)
> wrmsrl(MSR_CORE_PERF_GLOBAL_STATUS_SET, pmu->global_status & toggle);
Thanks, it looks better. Since PMU v4+, the MSR
CORE_PERF_GLOBAL_OVF_CTRL is renamed to CORE_PERF_GLOBAL_STATUS_RESET
with supporting more bits. CORE_PERF_GLOBAL_STATUS_RESET looks more
easily understand just from name than CORE_PERF_GLOBAL_OVF_CTRL, I would
prefer use this name in next version.

2024-04-13 06:51:44

by Mi, Dapeng

[permalink] [raw]
Subject: Re: [RFC PATCH 23/41] KVM: x86/pmu: Implement the save/restore of PMU state for Intel CPU


On 4/13/2024 11:34 AM, Mingwei Zhang wrote:
> On Sat, Apr 13, 2024, Mi, Dapeng wrote:
>> On 4/12/2024 5:44 AM, Sean Christopherson wrote:
>>> On Fri, Jan 26, 2024, Xiong Zhang wrote:
>>>> From: Dapeng Mi <[email protected]>
>>>>
>>>> Implement the save/restore of PMU state for pasthrough PMU in Intel. In
>>>> passthrough mode, KVM owns exclusively the PMU HW when control flow goes to
>>>> the scope of passthrough PMU. Thus, KVM needs to save the host PMU state
>>>> and gains the full HW PMU ownership. On the contrary, host regains the
>>>> ownership of PMU HW from KVM when control flow leaves the scope of
>>>> passthrough PMU.
>>>>
>>>> Implement PMU context switches for Intel CPUs and opptunistically use
>>>> rdpmcl() instead of rdmsrl() when reading counters since the former has
>>>> lower latency in Intel CPUs.
>>>>
>>>> Co-developed-by: Mingwei Zhang <[email protected]>
>>>> Signed-off-by: Mingwei Zhang <[email protected]>
>>>> Signed-off-by: Dapeng Mi <[email protected]>
>>>> ---
>>>> arch/x86/kvm/vmx/pmu_intel.c | 73 ++++++++++++++++++++++++++++++++++++
>>>> 1 file changed, 73 insertions(+)
>>>>
>>>> diff --git a/arch/x86/kvm/vmx/pmu_intel.c b/arch/x86/kvm/vmx/pmu_intel.c
>>>> index 0d58fe7d243e..f79bebe7093d 100644
>>>> --- a/arch/x86/kvm/vmx/pmu_intel.c
>>>> +++ b/arch/x86/kvm/vmx/pmu_intel.c
>>>> @@ -823,10 +823,83 @@ void intel_passthrough_pmu_msrs(struct kvm_vcpu *vcpu)
>>>> static void intel_save_pmu_context(struct kvm_vcpu *vcpu)
>>> I would prefer there be a "guest" in there somewhere, e.g. intel_save_guest_pmu_context().
>> Yeah. It looks clearer.
>>>> {
>>>> + struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
>>>> + struct kvm_pmc *pmc;
>>>> + u32 i;
>>>> +
>>>> + if (pmu->version != 2) {
>>>> + pr_warn("only PerfMon v2 is supported for passthrough PMU");
>>>> + return;
>>>> + }
>>>> +
>>>> + /* Global ctrl register is already saved at VM-exit. */
>>>> + rdmsrl(MSR_CORE_PERF_GLOBAL_STATUS, pmu->global_status);
>>>> + /* Clear hardware MSR_CORE_PERF_GLOBAL_STATUS MSR, if non-zero. */
>>>> + if (pmu->global_status)
>>>> + wrmsrl(MSR_CORE_PERF_GLOBAL_OVF_CTRL, pmu->global_status);
>>>> +
>>>> + for (i = 0; i < pmu->nr_arch_gp_counters; i++) {
>>>> + pmc = &pmu->gp_counters[i];
>>>> + rdpmcl(i, pmc->counter);
>>>> + rdmsrl(i + MSR_ARCH_PERFMON_EVENTSEL0, pmc->eventsel);
>>>> + /*
>>>> + * Clear hardware PERFMON_EVENTSELx and its counter to avoid
>>>> + * leakage and also avoid this guest GP counter get accidentally
>>>> + * enabled during host running when host enable global ctrl.
>>>> + */
>>>> + if (pmc->eventsel)
>>>> + wrmsrl(MSR_ARCH_PERFMON_EVENTSEL0 + i, 0);
>>>> + if (pmc->counter)
>>>> + wrmsrl(MSR_IA32_PMC0 + i, 0);
>>> This doesn't make much sense. The kernel already has full access to the guest,
>>> I don't see what is gained by zeroing out the MSRs just to hide them from perf.
>> It's necessary to clear the EVENTSELx MSRs for both GP and fixed counters.
>> Considering this case, Guest uses GP counter 2, but Host doesn't use it. So
>> if the EVENTSEL2 MSR is not cleared here, the GP counter 2 would be enabled
>> unexpectedly on host later since Host perf always enable all validate bits
>> in PERF_GLOBAL_CTRL MSR. That would cause issues.
>>
>> Yeah,  the clearing for PMCx MSR should be unnecessary .
>>
> Why is clearing for PMCx MSR unnecessary? Do we want to leaking counter
> values to the host? NO. Not in cloud usage.

No, this place is clearing the guest counter value instead of host
counter value. Host always has method to see guest value in a normal VM
if he want. I don't see its necessity, it's just a overkill and
introduce extra overhead to write MSRs.


>
> Please make changes to this patch with **extreme** caution.
>
> According to our past experience, if there is a bug somewhere,
> there is a catch here (normally).
>
> Thanks.
> -Mingwei
>>> Similarly, if perf enables a counter if PERF_GLOBAL_CTRL without first restoring
>>> the event selector, we gots problems.
>>>
>>> Same thing for the fixed counters below. Can't this just be?
>>>
>>> for (i = 0; i < pmu->nr_arch_gp_counters; i++)
>>> rdpmcl(i, pmu->gp_counters[i].counter);
>>>
>>> for (i = 0; i < pmu->nr_arch_fixed_counters; i++)
>>> rdpmcl(INTEL_PMC_FIXED_RDPMC_BASE | i,
>>> pmu->fixed_counters[i].counter);
>>>
>>>> + }
>>>> +
>>>> + rdmsrl(MSR_CORE_PERF_FIXED_CTR_CTRL, pmu->fixed_ctr_ctrl);
>>>> + /*
>>>> + * Clear hardware FIXED_CTR_CTRL MSR to avoid information leakage and
>>>> + * also avoid these guest fixed counters get accidentially enabled
>>>> + * during host running when host enable global ctrl.
>>>> + */
>>>> + if (pmu->fixed_ctr_ctrl)
>>>> + wrmsrl(MSR_CORE_PERF_FIXED_CTR_CTRL, 0);
>>>> + for (i = 0; i < pmu->nr_arch_fixed_counters; i++) {
>>>> + pmc = &pmu->fixed_counters[i];
>>>> + rdpmcl(INTEL_PMC_FIXED_RDPMC_BASE | i, pmc->counter);
>>>> + if (pmc->counter)
>>>> + wrmsrl(MSR_CORE_PERF_FIXED_CTR0 + i, 0);
>>>> + }
>>>> }
>>>> static void intel_restore_pmu_context(struct kvm_vcpu *vcpu)
>>>> {
>>>> + struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
>>>> + struct kvm_pmc *pmc;
>>>> + u64 global_status;
>>>> + int i;
>>>> +
>>>> + if (pmu->version != 2) {
>>>> + pr_warn("only PerfMon v2 is supported for passthrough PMU");
>>>> + return;
>>>> + }
>>>> +
>>>> + /* Clear host global_ctrl and global_status MSR if non-zero. */
>>>> + wrmsrl(MSR_CORE_PERF_GLOBAL_CTRL, 0);
>>> Why? PERF_GLOBAL_CTRL will be auto-loaded at VM-Enter, why do it now?
>> As previous comments say, host perf always enable all counters in
>> PERF_GLOBAL_CTRL by default. The reason to clear PERF_GLOBAL_CTRL here is to
>> ensure all counters in disabled state and the later counter manipulation
>> (writing MSR) won't cause any race condition or unexpected behavior on HW.
>>
>>
>>>> + rdmsrl(MSR_CORE_PERF_GLOBAL_STATUS, global_status);
>>>> + if (global_status)
>>>> + wrmsrl(MSR_CORE_PERF_GLOBAL_OVF_CTRL, global_status);
>>> This seems especially silly, isn't the full MSR being written below? Or am I
>>> misunderstanding how these things work?
>> I think Jim's comment has already explain why we need to do this.
>>
>>>> + wrmsrl(MSR_CORE_PERF_GLOBAL_STATUS_SET, pmu->global_status);
>>>> +
>>>> + for (i = 0; i < pmu->nr_arch_gp_counters; i++) {
>>>> + pmc = &pmu->gp_counters[i];
>>>> + wrmsrl(MSR_IA32_PMC0 + i, pmc->counter);
>>>> + wrmsrl(MSR_ARCH_PERFMON_EVENTSEL0 + i, pmc->eventsel);
>>>> + }
>>>> +
>>>> + wrmsrl(MSR_CORE_PERF_FIXED_CTR_CTRL, pmu->fixed_ctr_ctrl);
>>>> + for (i = 0; i < pmu->nr_arch_fixed_counters; i++) {
>>>> + pmc = &pmu->fixed_counters[i];
>>>> + wrmsrl(MSR_CORE_PERF_FIXED_CTR0 + i, pmc->counter);
>>>> + }
>>>> }
>>>> struct kvm_pmu_ops intel_pmu_ops __initdata = {
>>>> --
>>>> 2.34.1
>>>>

2024-04-13 07:16:54

by Mi, Dapeng

[permalink] [raw]
Subject: Re: [RFC PATCH 23/41] KVM: x86/pmu: Implement the save/restore of PMU state for Intel CPU


On 4/12/2024 5:26 AM, Sean Christopherson wrote:
> On Fri, Jan 26, 2024, Xiong Zhang wrote:
>> static void intel_save_pmu_context(struct kvm_vcpu *vcpu)
>> {
>> + struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
>> + struct kvm_pmc *pmc;
>> + u32 i;
>> +
>> + if (pmu->version != 2) {
>> + pr_warn("only PerfMon v2 is supported for passthrough PMU");
>> + return;
>> + }
>> +
>> + /* Global ctrl register is already saved at VM-exit. */
>> + rdmsrl(MSR_CORE_PERF_GLOBAL_STATUS, pmu->global_status);
>> + /* Clear hardware MSR_CORE_PERF_GLOBAL_STATUS MSR, if non-zero. */
>> + if (pmu->global_status)
>> + wrmsrl(MSR_CORE_PERF_GLOBAL_OVF_CTRL, pmu->global_status);
>> +
>> + for (i = 0; i < pmu->nr_arch_gp_counters; i++) {
>> + pmc = &pmu->gp_counters[i];
>> + rdpmcl(i, pmc->counter);
>> + rdmsrl(i + MSR_ARCH_PERFMON_EVENTSEL0, pmc->eventsel);
>> + /*
>> + * Clear hardware PERFMON_EVENTSELx and its counter to avoid
>> + * leakage and also avoid this guest GP counter get accidentally
>> + * enabled during host running when host enable global ctrl.
>> + */
>> + if (pmc->eventsel)
>> + wrmsrl(MSR_ARCH_PERFMON_EVENTSEL0 + i, 0);
>> + if (pmc->counter)
>> + wrmsrl(MSR_IA32_PMC0 + i, 0);
>> + }
>> +
>> + rdmsrl(MSR_CORE_PERF_FIXED_CTR_CTRL, pmu->fixed_ctr_ctrl);
>> + /*
>> + * Clear hardware FIXED_CTR_CTRL MSR to avoid information leakage and
>> + * also avoid these guest fixed counters get accidentially enabled
>> + * during host running when host enable global ctrl.
>> + */
>> + if (pmu->fixed_ctr_ctrl)
>> + wrmsrl(MSR_CORE_PERF_FIXED_CTR_CTRL, 0);
>> + for (i = 0; i < pmu->nr_arch_fixed_counters; i++) {
>> + pmc = &pmu->fixed_counters[i];
>> + rdpmcl(INTEL_PMC_FIXED_RDPMC_BASE | i, pmc->counter);
>> + if (pmc->counter)
>> + wrmsrl(MSR_CORE_PERF_FIXED_CTR0 + i, 0);
>> + }
> For the next RFC, please make that it includes AMD support. Mostly because I'm
> pretty all of this code can be in common x86. The fixed counters are ugly,
> but pmu->nr_arch_fixed_counters is guaranteed to '0' on AMD, so it's _just_ ugly,
> i.e. not functionally problematic.

Sure. I believe Mingwei would integrate AMD supporting patches in next
version. Yeah, I agree there could be a part of code which can be put
into common x86/pmu, but there are still some vendor specific code, we
still keep an vendor specific callback.



2024-04-13 09:01:52

by Mi, Dapeng

[permalink] [raw]
Subject: Re: [RFC PATCH 23/41] KVM: x86/pmu: Implement the save/restore of PMU state for Intel CPU


On 4/12/2024 5:44 AM, Sean Christopherson wrote:
> On Fri, Jan 26, 2024, Xiong Zhang wrote:
>> From: Dapeng Mi <[email protected]>
>>
>> Implement the save/restore of PMU state for pasthrough PMU in Intel. In
>> passthrough mode, KVM owns exclusively the PMU HW when control flow goes to
>> the scope of passthrough PMU. Thus, KVM needs to save the host PMU state
>> and gains the full HW PMU ownership. On the contrary, host regains the
>> ownership of PMU HW from KVM when control flow leaves the scope of
>> passthrough PMU.
>>
>> Implement PMU context switches for Intel CPUs and opptunistically use
>> rdpmcl() instead of rdmsrl() when reading counters since the former has
>> lower latency in Intel CPUs.
>>
>> Co-developed-by: Mingwei Zhang <[email protected]>
>> Signed-off-by: Mingwei Zhang <[email protected]>
>> Signed-off-by: Dapeng Mi <[email protected]>
>> ---
>> arch/x86/kvm/vmx/pmu_intel.c | 73 ++++++++++++++++++++++++++++++++++++
>> 1 file changed, 73 insertions(+)
>>
>> diff --git a/arch/x86/kvm/vmx/pmu_intel.c b/arch/x86/kvm/vmx/pmu_intel.c
>> index 0d58fe7d243e..f79bebe7093d 100644
>> --- a/arch/x86/kvm/vmx/pmu_intel.c
>> +++ b/arch/x86/kvm/vmx/pmu_intel.c
>> @@ -823,10 +823,83 @@ void intel_passthrough_pmu_msrs(struct kvm_vcpu *vcpu)
>>
>> static void intel_save_pmu_context(struct kvm_vcpu *vcpu)
> I would prefer there be a "guest" in there somewhere, e.g. intel_save_guest_pmu_context().
Yeah. It looks clearer.
>
>> {
>> + struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
>> + struct kvm_pmc *pmc;
>> + u32 i;
>> +
>> + if (pmu->version != 2) {
>> + pr_warn("only PerfMon v2 is supported for passthrough PMU");
>> + return;
>> + }
>> +
>> + /* Global ctrl register is already saved at VM-exit. */
>> + rdmsrl(MSR_CORE_PERF_GLOBAL_STATUS, pmu->global_status);
>> + /* Clear hardware MSR_CORE_PERF_GLOBAL_STATUS MSR, if non-zero. */
>> + if (pmu->global_status)
>> + wrmsrl(MSR_CORE_PERF_GLOBAL_OVF_CTRL, pmu->global_status);
>> +
>> + for (i = 0; i < pmu->nr_arch_gp_counters; i++) {
>> + pmc = &pmu->gp_counters[i];
>> + rdpmcl(i, pmc->counter);
>> + rdmsrl(i + MSR_ARCH_PERFMON_EVENTSEL0, pmc->eventsel);
>> + /*
>> + * Clear hardware PERFMON_EVENTSELx and its counter to avoid
>> + * leakage and also avoid this guest GP counter get accidentally
>> + * enabled during host running when host enable global ctrl.
>> + */
>> + if (pmc->eventsel)
>> + wrmsrl(MSR_ARCH_PERFMON_EVENTSEL0 + i, 0);
>> + if (pmc->counter)
>> + wrmsrl(MSR_IA32_PMC0 + i, 0);
> This doesn't make much sense. The kernel already has full access to the guest,
> I don't see what is gained by zeroing out the MSRs just to hide them from perf.

It's necessary to clear the EVENTSELx MSRs for both GP and fixed
counters. Considering this case, Guest uses GP counter 2, but Host
doesn't use it. So if the EVENTSEL2 MSR is not cleared here, the GP
counter 2 would be enabled unexpectedly on host later since Host perf
always enable all validate bits in PERF_GLOBAL_CTRL MSR. That would
cause issues.

Yeah,  the clearing for PMCx MSR should be unnecessary .


>
> Similarly, if perf enables a counter if PERF_GLOBAL_CTRL without first restoring
> the event selector, we gots problems.
>
> Same thing for the fixed counters below. Can't this just be?
>
> for (i = 0; i < pmu->nr_arch_gp_counters; i++)
> rdpmcl(i, pmu->gp_counters[i].counter);
>
> for (i = 0; i < pmu->nr_arch_fixed_counters; i++)
> rdpmcl(INTEL_PMC_FIXED_RDPMC_BASE | i,
> pmu->fixed_counters[i].counter);
>
>> + }
>> +
>> + rdmsrl(MSR_CORE_PERF_FIXED_CTR_CTRL, pmu->fixed_ctr_ctrl);
>> + /*
>> + * Clear hardware FIXED_CTR_CTRL MSR to avoid information leakage and
>> + * also avoid these guest fixed counters get accidentially enabled
>> + * during host running when host enable global ctrl.
>> + */
>> + if (pmu->fixed_ctr_ctrl)
>> + wrmsrl(MSR_CORE_PERF_FIXED_CTR_CTRL, 0);
>> + for (i = 0; i < pmu->nr_arch_fixed_counters; i++) {
>> + pmc = &pmu->fixed_counters[i];
>> + rdpmcl(INTEL_PMC_FIXED_RDPMC_BASE | i, pmc->counter);
>> + if (pmc->counter)
>> + wrmsrl(MSR_CORE_PERF_FIXED_CTR0 + i, 0);
>> + }
>> }
>>
>> static void intel_restore_pmu_context(struct kvm_vcpu *vcpu)
>> {
>> + struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
>> + struct kvm_pmc *pmc;
>> + u64 global_status;
>> + int i;
>> +
>> + if (pmu->version != 2) {
>> + pr_warn("only PerfMon v2 is supported for passthrough PMU");
>> + return;
>> + }
>> +
>> + /* Clear host global_ctrl and global_status MSR if non-zero. */
>> + wrmsrl(MSR_CORE_PERF_GLOBAL_CTRL, 0);
> Why? PERF_GLOBAL_CTRL will be auto-loaded at VM-Enter, why do it now?

As previous comments say, host perf always enable all counters in
PERF_GLOBAL_CTRL by default. The reason to clear PERF_GLOBAL_CTRL here
is to ensure all counters in disabled state and the later counter
manipulation (writing MSR) won't cause any race condition or unexpected
behavior on HW.


>
>> + rdmsrl(MSR_CORE_PERF_GLOBAL_STATUS, global_status);
>> + if (global_status)
>> + wrmsrl(MSR_CORE_PERF_GLOBAL_OVF_CTRL, global_status);
> This seems especially silly, isn't the full MSR being written below? Or am I
> misunderstanding how these things work?

I think Jim's comment has already explain why we need to do this.

>
>> + wrmsrl(MSR_CORE_PERF_GLOBAL_STATUS_SET, pmu->global_status);
>> +
>> + for (i = 0; i < pmu->nr_arch_gp_counters; i++) {
>> + pmc = &pmu->gp_counters[i];
>> + wrmsrl(MSR_IA32_PMC0 + i, pmc->counter);
>> + wrmsrl(MSR_ARCH_PERFMON_EVENTSEL0 + i, pmc->eventsel);
>> + }
>> +
>> + wrmsrl(MSR_CORE_PERF_FIXED_CTR_CTRL, pmu->fixed_ctr_ctrl);
>> + for (i = 0; i < pmu->nr_arch_fixed_counters; i++) {
>> + pmc = &pmu->fixed_counters[i];
>> + wrmsrl(MSR_CORE_PERF_FIXED_CTR0 + i, pmc->counter);
>> + }
>> }
>>
>> struct kvm_pmu_ops intel_pmu_ops __initdata = {
>> --
>> 2.34.1
>>

2024-04-15 06:07:10

by Mingwei Zhang

[permalink] [raw]
Subject: Re: [RFC PATCH 23/41] KVM: x86/pmu: Implement the save/restore of PMU state for Intel CPU

On Fri, Apr 12, 2024 at 9:25 PM Mi, Dapeng <[email protected]> wrote:
>
>
> On 4/13/2024 11:34 AM, Mingwei Zhang wrote:
> > On Sat, Apr 13, 2024, Mi, Dapeng wrote:
> >> On 4/12/2024 5:44 AM, Sean Christopherson wrote:
> >>> On Fri, Jan 26, 2024, Xiong Zhang wrote:
> >>>> From: Dapeng Mi <[email protected]>
> >>>>
> >>>> Implement the save/restore of PMU state for pasthrough PMU in Intel. In
> >>>> passthrough mode, KVM owns exclusively the PMU HW when control flow goes to
> >>>> the scope of passthrough PMU. Thus, KVM needs to save the host PMU state
> >>>> and gains the full HW PMU ownership. On the contrary, host regains the
> >>>> ownership of PMU HW from KVM when control flow leaves the scope of
> >>>> passthrough PMU.
> >>>>
> >>>> Implement PMU context switches for Intel CPUs and opptunistically use
> >>>> rdpmcl() instead of rdmsrl() when reading counters since the former has
> >>>> lower latency in Intel CPUs.
> >>>>
> >>>> Co-developed-by: Mingwei Zhang <[email protected]>
> >>>> Signed-off-by: Mingwei Zhang <[email protected]>
> >>>> Signed-off-by: Dapeng Mi <[email protected]>
> >>>> ---
> >>>> arch/x86/kvm/vmx/pmu_intel.c | 73 ++++++++++++++++++++++++++++++++++++
> >>>> 1 file changed, 73 insertions(+)
> >>>>
> >>>> diff --git a/arch/x86/kvm/vmx/pmu_intel.c b/arch/x86/kvm/vmx/pmu_intel.c
> >>>> index 0d58fe7d243e..f79bebe7093d 100644
> >>>> --- a/arch/x86/kvm/vmx/pmu_intel.c
> >>>> +++ b/arch/x86/kvm/vmx/pmu_intel.c
> >>>> @@ -823,10 +823,83 @@ void intel_passthrough_pmu_msrs(struct kvm_vcpu *vcpu)
> >>>> static void intel_save_pmu_context(struct kvm_vcpu *vcpu)
> >>> I would prefer there be a "guest" in there somewhere, e.g. intel_save_guest_pmu_context().
> >> Yeah. It looks clearer.
> >>>> {
> >>>> + struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
> >>>> + struct kvm_pmc *pmc;
> >>>> + u32 i;
> >>>> +
> >>>> + if (pmu->version != 2) {
> >>>> + pr_warn("only PerfMon v2 is supported for passthrough PMU");
> >>>> + return;
> >>>> + }
> >>>> +
> >>>> + /* Global ctrl register is already saved at VM-exit. */
> >>>> + rdmsrl(MSR_CORE_PERF_GLOBAL_STATUS, pmu->global_status);
> >>>> + /* Clear hardware MSR_CORE_PERF_GLOBAL_STATUS MSR, if non-zero. */
> >>>> + if (pmu->global_status)
> >>>> + wrmsrl(MSR_CORE_PERF_GLOBAL_OVF_CTRL, pmu->global_status);
> >>>> +
> >>>> + for (i = 0; i < pmu->nr_arch_gp_counters; i++) {
> >>>> + pmc = &pmu->gp_counters[i];
> >>>> + rdpmcl(i, pmc->counter);
> >>>> + rdmsrl(i + MSR_ARCH_PERFMON_EVENTSEL0, pmc->eventsel);
> >>>> + /*
> >>>> + * Clear hardware PERFMON_EVENTSELx and its counter to avoid
> >>>> + * leakage and also avoid this guest GP counter get accidentally
> >>>> + * enabled during host running when host enable global ctrl.
> >>>> + */
> >>>> + if (pmc->eventsel)
> >>>> + wrmsrl(MSR_ARCH_PERFMON_EVENTSEL0 + i, 0);
> >>>> + if (pmc->counter)
> >>>> + wrmsrl(MSR_IA32_PMC0 + i, 0);
> >>> This doesn't make much sense. The kernel already has full access to the guest,
> >>> I don't see what is gained by zeroing out the MSRs just to hide them from perf.
> >> It's necessary to clear the EVENTSELx MSRs for both GP and fixed counters.
> >> Considering this case, Guest uses GP counter 2, but Host doesn't use it. So
> >> if the EVENTSEL2 MSR is not cleared here, the GP counter 2 would be enabled
> >> unexpectedly on host later since Host perf always enable all validate bits
> >> in PERF_GLOBAL_CTRL MSR. That would cause issues.
> >>
> >> Yeah, the clearing for PMCx MSR should be unnecessary .
> >>
> > Why is clearing for PMCx MSR unnecessary? Do we want to leaking counter
> > values to the host? NO. Not in cloud usage.
>
> No, this place is clearing the guest counter value instead of host
> counter value. Host always has method to see guest value in a normal VM
> if he want. I don't see its necessity, it's just a overkill and
> introduce extra overhead to write MSRs.
>

I am curious how the perf subsystem solves the problem? Does perf
subsystem in the host only scrubbing the selector but not the counter
value when doing the context switch?

>
> >
> > Please make changes to this patch with **extreme** caution.
> >
> > According to our past experience, if there is a bug somewhere,
> > there is a catch here (normally).
> >
> > Thanks.
> > -Mingwei
> >>> Similarly, if perf enables a counter if PERF_GLOBAL_CTRL without first restoring
> >>> the event selector, we gots problems.
> >>>
> >>> Same thing for the fixed counters below. Can't this just be?
> >>>
> >>> for (i = 0; i < pmu->nr_arch_gp_counters; i++)
> >>> rdpmcl(i, pmu->gp_counters[i].counter);
> >>>
> >>> for (i = 0; i < pmu->nr_arch_fixed_counters; i++)
> >>> rdpmcl(INTEL_PMC_FIXED_RDPMC_BASE | i,
> >>> pmu->fixed_counters[i].counter);
> >>>
> >>>> + }
> >>>> +
> >>>> + rdmsrl(MSR_CORE_PERF_FIXED_CTR_CTRL, pmu->fixed_ctr_ctrl);
> >>>> + /*
> >>>> + * Clear hardware FIXED_CTR_CTRL MSR to avoid information leakage and
> >>>> + * also avoid these guest fixed counters get accidentially enabled
> >>>> + * during host running when host enable global ctrl.
> >>>> + */
> >>>> + if (pmu->fixed_ctr_ctrl)
> >>>> + wrmsrl(MSR_CORE_PERF_FIXED_CTR_CTRL, 0);
> >>>> + for (i = 0; i < pmu->nr_arch_fixed_counters; i++) {
> >>>> + pmc = &pmu->fixed_counters[i];
> >>>> + rdpmcl(INTEL_PMC_FIXED_RDPMC_BASE | i, pmc->counter);
> >>>> + if (pmc->counter)
> >>>> + wrmsrl(MSR_CORE_PERF_FIXED_CTR0 + i, 0);
> >>>> + }
> >>>> }
> >>>> static void intel_restore_pmu_context(struct kvm_vcpu *vcpu)
> >>>> {
> >>>> + struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
> >>>> + struct kvm_pmc *pmc;
> >>>> + u64 global_status;
> >>>> + int i;
> >>>> +
> >>>> + if (pmu->version != 2) {
> >>>> + pr_warn("only PerfMon v2 is supported for passthrough PMU");
> >>>> + return;
> >>>> + }
> >>>> +
> >>>> + /* Clear host global_ctrl and global_status MSR if non-zero. */
> >>>> + wrmsrl(MSR_CORE_PERF_GLOBAL_CTRL, 0);
> >>> Why? PERF_GLOBAL_CTRL will be auto-loaded at VM-Enter, why do it now?
> >> As previous comments say, host perf always enable all counters in
> >> PERF_GLOBAL_CTRL by default. The reason to clear PERF_GLOBAL_CTRL here is to
> >> ensure all counters in disabled state and the later counter manipulation
> >> (writing MSR) won't cause any race condition or unexpected behavior on HW.
> >>
> >>
> >>>> + rdmsrl(MSR_CORE_PERF_GLOBAL_STATUS, global_status);
> >>>> + if (global_status)
> >>>> + wrmsrl(MSR_CORE_PERF_GLOBAL_OVF_CTRL, global_status);
> >>> This seems especially silly, isn't the full MSR being written below? Or am I
> >>> misunderstanding how these things work?
> >> I think Jim's comment has already explain why we need to do this.
> >>
> >>>> + wrmsrl(MSR_CORE_PERF_GLOBAL_STATUS_SET, pmu->global_status);
> >>>> +
> >>>> + for (i = 0; i < pmu->nr_arch_gp_counters; i++) {
> >>>> + pmc = &pmu->gp_counters[i];
> >>>> + wrmsrl(MSR_IA32_PMC0 + i, pmc->counter);
> >>>> + wrmsrl(MSR_ARCH_PERFMON_EVENTSEL0 + i, pmc->eventsel);
> >>>> + }
> >>>> +
> >>>> + wrmsrl(MSR_CORE_PERF_FIXED_CTR_CTRL, pmu->fixed_ctr_ctrl);
> >>>> + for (i = 0; i < pmu->nr_arch_fixed_counters; i++) {
> >>>> + pmc = &pmu->fixed_counters[i];
> >>>> + wrmsrl(MSR_CORE_PERF_FIXED_CTR0 + i, pmc->counter);
> >>>> + }
> >>>> }
> >>>> struct kvm_pmu_ops intel_pmu_ops __initdata = {
> >>>> --
> >>>> 2.34.1
> >>>>

2024-04-15 14:16:46

by Mi, Dapeng

[permalink] [raw]
Subject: Re: [RFC PATCH 23/41] KVM: x86/pmu: Implement the save/restore of PMU state for Intel CPU


On 4/15/2024 2:06 PM, Mingwei Zhang wrote:
> On Fri, Apr 12, 2024 at 9:25 PM Mi, Dapeng <[email protected]> wrote:
>>
>> On 4/13/2024 11:34 AM, Mingwei Zhang wrote:
>>> On Sat, Apr 13, 2024, Mi, Dapeng wrote:
>>>> On 4/12/2024 5:44 AM, Sean Christopherson wrote:
>>>>> On Fri, Jan 26, 2024, Xiong Zhang wrote:
>>>>>> From: Dapeng Mi <[email protected]>
>>>>>>
>>>>>> Implement the save/restore of PMU state for pasthrough PMU in Intel. In
>>>>>> passthrough mode, KVM owns exclusively the PMU HW when control flow goes to
>>>>>> the scope of passthrough PMU. Thus, KVM needs to save the host PMU state
>>>>>> and gains the full HW PMU ownership. On the contrary, host regains the
>>>>>> ownership of PMU HW from KVM when control flow leaves the scope of
>>>>>> passthrough PMU.
>>>>>>
>>>>>> Implement PMU context switches for Intel CPUs and opptunistically use
>>>>>> rdpmcl() instead of rdmsrl() when reading counters since the former has
>>>>>> lower latency in Intel CPUs.
>>>>>>
>>>>>> Co-developed-by: Mingwei Zhang <[email protected]>
>>>>>> Signed-off-by: Mingwei Zhang <[email protected]>
>>>>>> Signed-off-by: Dapeng Mi <[email protected]>
>>>>>> ---
>>>>>> arch/x86/kvm/vmx/pmu_intel.c | 73 ++++++++++++++++++++++++++++++++++++
>>>>>> 1 file changed, 73 insertions(+)
>>>>>>
>>>>>> diff --git a/arch/x86/kvm/vmx/pmu_intel.c b/arch/x86/kvm/vmx/pmu_intel.c
>>>>>> index 0d58fe7d243e..f79bebe7093d 100644
>>>>>> --- a/arch/x86/kvm/vmx/pmu_intel.c
>>>>>> +++ b/arch/x86/kvm/vmx/pmu_intel.c
>>>>>> @@ -823,10 +823,83 @@ void intel_passthrough_pmu_msrs(struct kvm_vcpu *vcpu)
>>>>>> static void intel_save_pmu_context(struct kvm_vcpu *vcpu)
>>>>> I would prefer there be a "guest" in there somewhere, e.g. intel_save_guest_pmu_context().
>>>> Yeah. It looks clearer.
>>>>>> {
>>>>>> + struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
>>>>>> + struct kvm_pmc *pmc;
>>>>>> + u32 i;
>>>>>> +
>>>>>> + if (pmu->version != 2) {
>>>>>> + pr_warn("only PerfMon v2 is supported for passthrough PMU");
>>>>>> + return;
>>>>>> + }
>>>>>> +
>>>>>> + /* Global ctrl register is already saved at VM-exit. */
>>>>>> + rdmsrl(MSR_CORE_PERF_GLOBAL_STATUS, pmu->global_status);
>>>>>> + /* Clear hardware MSR_CORE_PERF_GLOBAL_STATUS MSR, if non-zero. */
>>>>>> + if (pmu->global_status)
>>>>>> + wrmsrl(MSR_CORE_PERF_GLOBAL_OVF_CTRL, pmu->global_status);
>>>>>> +
>>>>>> + for (i = 0; i < pmu->nr_arch_gp_counters; i++) {
>>>>>> + pmc = &pmu->gp_counters[i];
>>>>>> + rdpmcl(i, pmc->counter);
>>>>>> + rdmsrl(i + MSR_ARCH_PERFMON_EVENTSEL0, pmc->eventsel);
>>>>>> + /*
>>>>>> + * Clear hardware PERFMON_EVENTSELx and its counter to avoid
>>>>>> + * leakage and also avoid this guest GP counter get accidentally
>>>>>> + * enabled during host running when host enable global ctrl.
>>>>>> + */
>>>>>> + if (pmc->eventsel)
>>>>>> + wrmsrl(MSR_ARCH_PERFMON_EVENTSEL0 + i, 0);
>>>>>> + if (pmc->counter)
>>>>>> + wrmsrl(MSR_IA32_PMC0 + i, 0);
>>>>> This doesn't make much sense. The kernel already has full access to the guest,
>>>>> I don't see what is gained by zeroing out the MSRs just to hide them from perf.
>>>> It's necessary to clear the EVENTSELx MSRs for both GP and fixed counters.
>>>> Considering this case, Guest uses GP counter 2, but Host doesn't use it. So
>>>> if the EVENTSEL2 MSR is not cleared here, the GP counter 2 would be enabled
>>>> unexpectedly on host later since Host perf always enable all validate bits
>>>> in PERF_GLOBAL_CTRL MSR. That would cause issues.
>>>>
>>>> Yeah, the clearing for PMCx MSR should be unnecessary .
>>>>
>>> Why is clearing for PMCx MSR unnecessary? Do we want to leaking counter
>>> values to the host? NO. Not in cloud usage.
>> No, this place is clearing the guest counter value instead of host
>> counter value. Host always has method to see guest value in a normal VM
>> if he want. I don't see its necessity, it's just a overkill and
>> introduce extra overhead to write MSRs.
>>
> I am curious how the perf subsystem solves the problem? Does perf
> subsystem in the host only scrubbing the selector but not the counter
> value when doing the context switch?

When context switch happens, perf code would schedule out the old events
and schedule in the new events. When scheduling out, the ENABLE bit of
EVENTSELx MSR would be cleared, and when scheduling in, the EVENTSELx
and PMCx MSRs would be overwritten with new event's attr.config and
sample_period separately.  Of course, these is only for the case when
there are new events to be programmed on the PMC. If no new events, the
PMCx MSR would keep stall value and won't be cleared.

Anyway, I don't see any reason that PMCx MSR must be cleared.



>
>>> Please make changes to this patch with **extreme** caution.
>>>
>>> According to our past experience, if there is a bug somewhere,
>>> there is a catch here (normally).
>>>
>>> Thanks.
>>> -Mingwei
>>>>> Similarly, if perf enables a counter if PERF_GLOBAL_CTRL without first restoring
>>>>> the event selector, we gots problems.
>>>>>
>>>>> Same thing for the fixed counters below. Can't this just be?
>>>>>
>>>>> for (i = 0; i < pmu->nr_arch_gp_counters; i++)
>>>>> rdpmcl(i, pmu->gp_counters[i].counter);
>>>>>
>>>>> for (i = 0; i < pmu->nr_arch_fixed_counters; i++)
>>>>> rdpmcl(INTEL_PMC_FIXED_RDPMC_BASE | i,
>>>>> pmu->fixed_counters[i].counter);
>>>>>
>>>>>> + }
>>>>>> +
>>>>>> + rdmsrl(MSR_CORE_PERF_FIXED_CTR_CTRL, pmu->fixed_ctr_ctrl);
>>>>>> + /*
>>>>>> + * Clear hardware FIXED_CTR_CTRL MSR to avoid information leakage and
>>>>>> + * also avoid these guest fixed counters get accidentially enabled
>>>>>> + * during host running when host enable global ctrl.
>>>>>> + */
>>>>>> + if (pmu->fixed_ctr_ctrl)
>>>>>> + wrmsrl(MSR_CORE_PERF_FIXED_CTR_CTRL, 0);
>>>>>> + for (i = 0; i < pmu->nr_arch_fixed_counters; i++) {
>>>>>> + pmc = &pmu->fixed_counters[i];
>>>>>> + rdpmcl(INTEL_PMC_FIXED_RDPMC_BASE | i, pmc->counter);
>>>>>> + if (pmc->counter)
>>>>>> + wrmsrl(MSR_CORE_PERF_FIXED_CTR0 + i, 0);
>>>>>> + }
>>>>>> }
>>>>>> static void intel_restore_pmu_context(struct kvm_vcpu *vcpu)
>>>>>> {
>>>>>> + struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
>>>>>> + struct kvm_pmc *pmc;
>>>>>> + u64 global_status;
>>>>>> + int i;
>>>>>> +
>>>>>> + if (pmu->version != 2) {
>>>>>> + pr_warn("only PerfMon v2 is supported for passthrough PMU");
>>>>>> + return;
>>>>>> + }
>>>>>> +
>>>>>> + /* Clear host global_ctrl and global_status MSR if non-zero. */
>>>>>> + wrmsrl(MSR_CORE_PERF_GLOBAL_CTRL, 0);
>>>>> Why? PERF_GLOBAL_CTRL will be auto-loaded at VM-Enter, why do it now?
>>>> As previous comments say, host perf always enable all counters in
>>>> PERF_GLOBAL_CTRL by default. The reason to clear PERF_GLOBAL_CTRL here is to
>>>> ensure all counters in disabled state and the later counter manipulation
>>>> (writing MSR) won't cause any race condition or unexpected behavior on HW.
>>>>
>>>>
>>>>>> + rdmsrl(MSR_CORE_PERF_GLOBAL_STATUS, global_status);
>>>>>> + if (global_status)
>>>>>> + wrmsrl(MSR_CORE_PERF_GLOBAL_OVF_CTRL, global_status);
>>>>> This seems especially silly, isn't the full MSR being written below? Or am I
>>>>> misunderstanding how these things work?
>>>> I think Jim's comment has already explain why we need to do this.
>>>>
>>>>>> + wrmsrl(MSR_CORE_PERF_GLOBAL_STATUS_SET, pmu->global_status);
>>>>>> +
>>>>>> + for (i = 0; i < pmu->nr_arch_gp_counters; i++) {
>>>>>> + pmc = &pmu->gp_counters[i];
>>>>>> + wrmsrl(MSR_IA32_PMC0 + i, pmc->counter);
>>>>>> + wrmsrl(MSR_ARCH_PERFMON_EVENTSEL0 + i, pmc->eventsel);
>>>>>> + }
>>>>>> +
>>>>>> + wrmsrl(MSR_CORE_PERF_FIXED_CTR_CTRL, pmu->fixed_ctr_ctrl);
>>>>>> + for (i = 0; i < pmu->nr_arch_fixed_counters; i++) {
>>>>>> + pmc = &pmu->fixed_counters[i];
>>>>>> + wrmsrl(MSR_CORE_PERF_FIXED_CTR0 + i, pmc->counter);
>>>>>> + }
>>>>>> }
>>>>>> struct kvm_pmu_ops intel_pmu_ops __initdata = {
>>>>>> --
>>>>>> 2.34.1
>>>>>>

2024-04-15 16:45:40

by Mingwei Zhang

[permalink] [raw]
Subject: Re: [RFC PATCH 23/41] KVM: x86/pmu: Implement the save/restore of PMU state for Intel CPU

On Mon, Apr 15, 2024 at 3:04 AM Mi, Dapeng <[email protected]> wrote:
>
>
> On 4/15/2024 2:06 PM, Mingwei Zhang wrote:
> > On Fri, Apr 12, 2024 at 9:25 PM Mi, Dapeng <[email protected]> wrote:
> >>
> >> On 4/13/2024 11:34 AM, Mingwei Zhang wrote:
> >>> On Sat, Apr 13, 2024, Mi, Dapeng wrote:
> >>>> On 4/12/2024 5:44 AM, Sean Christopherson wrote:
> >>>>> On Fri, Jan 26, 2024, Xiong Zhang wrote:
> >>>>>> From: Dapeng Mi <[email protected]>
> >>>>>>
> >>>>>> Implement the save/restore of PMU state for pasthrough PMU in Intel. In
> >>>>>> passthrough mode, KVM owns exclusively the PMU HW when control flow goes to
> >>>>>> the scope of passthrough PMU. Thus, KVM needs to save the host PMU state
> >>>>>> and gains the full HW PMU ownership. On the contrary, host regains the
> >>>>>> ownership of PMU HW from KVM when control flow leaves the scope of
> >>>>>> passthrough PMU.
> >>>>>>
> >>>>>> Implement PMU context switches for Intel CPUs and opptunistically use
> >>>>>> rdpmcl() instead of rdmsrl() when reading counters since the former has
> >>>>>> lower latency in Intel CPUs.
> >>>>>>
> >>>>>> Co-developed-by: Mingwei Zhang <[email protected]>
> >>>>>> Signed-off-by: Mingwei Zhang <[email protected]>
> >>>>>> Signed-off-by: Dapeng Mi <[email protected]>
> >>>>>> ---
> >>>>>> arch/x86/kvm/vmx/pmu_intel.c | 73 ++++++++++++++++++++++++++++++++++++
> >>>>>> 1 file changed, 73 insertions(+)
> >>>>>>
> >>>>>> diff --git a/arch/x86/kvm/vmx/pmu_intel.c b/arch/x86/kvm/vmx/pmu_intel.c
> >>>>>> index 0d58fe7d243e..f79bebe7093d 100644
> >>>>>> --- a/arch/x86/kvm/vmx/pmu_intel.c
> >>>>>> +++ b/arch/x86/kvm/vmx/pmu_intel.c
> >>>>>> @@ -823,10 +823,83 @@ void intel_passthrough_pmu_msrs(struct kvm_vcpu *vcpu)
> >>>>>> static void intel_save_pmu_context(struct kvm_vcpu *vcpu)
> >>>>> I would prefer there be a "guest" in there somewhere, e.g. intel_save_guest_pmu_context().
> >>>> Yeah. It looks clearer.
> >>>>>> {
> >>>>>> + struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
> >>>>>> + struct kvm_pmc *pmc;
> >>>>>> + u32 i;
> >>>>>> +
> >>>>>> + if (pmu->version != 2) {
> >>>>>> + pr_warn("only PerfMon v2 is supported for passthrough PMU");
> >>>>>> + return;
> >>>>>> + }
> >>>>>> +
> >>>>>> + /* Global ctrl register is already saved at VM-exit. */
> >>>>>> + rdmsrl(MSR_CORE_PERF_GLOBAL_STATUS, pmu->global_status);
> >>>>>> + /* Clear hardware MSR_CORE_PERF_GLOBAL_STATUS MSR, if non-zero. */
> >>>>>> + if (pmu->global_status)
> >>>>>> + wrmsrl(MSR_CORE_PERF_GLOBAL_OVF_CTRL, pmu->global_status);
> >>>>>> +
> >>>>>> + for (i = 0; i < pmu->nr_arch_gp_counters; i++) {
> >>>>>> + pmc = &pmu->gp_counters[i];
> >>>>>> + rdpmcl(i, pmc->counter);
> >>>>>> + rdmsrl(i + MSR_ARCH_PERFMON_EVENTSEL0, pmc->eventsel);
> >>>>>> + /*
> >>>>>> + * Clear hardware PERFMON_EVENTSELx and its counter to avoid
> >>>>>> + * leakage and also avoid this guest GP counter get accidentally
> >>>>>> + * enabled during host running when host enable global ctrl.
> >>>>>> + */
> >>>>>> + if (pmc->eventsel)
> >>>>>> + wrmsrl(MSR_ARCH_PERFMON_EVENTSEL0 + i, 0);
> >>>>>> + if (pmc->counter)
> >>>>>> + wrmsrl(MSR_IA32_PMC0 + i, 0);
> >>>>> This doesn't make much sense. The kernel already has full access to the guest,
> >>>>> I don't see what is gained by zeroing out the MSRs just to hide them from perf.
> >>>> It's necessary to clear the EVENTSELx MSRs for both GP and fixed counters.
> >>>> Considering this case, Guest uses GP counter 2, but Host doesn't use it. So
> >>>> if the EVENTSEL2 MSR is not cleared here, the GP counter 2 would be enabled
> >>>> unexpectedly on host later since Host perf always enable all validate bits
> >>>> in PERF_GLOBAL_CTRL MSR. That would cause issues.
> >>>>
> >>>> Yeah, the clearing for PMCx MSR should be unnecessary .
> >>>>
> >>> Why is clearing for PMCx MSR unnecessary? Do we want to leaking counter
> >>> values to the host? NO. Not in cloud usage.
> >> No, this place is clearing the guest counter value instead of host
> >> counter value. Host always has method to see guest value in a normal VM
> >> if he want. I don't see its necessity, it's just a overkill and
> >> introduce extra overhead to write MSRs.
> >>
> > I am curious how the perf subsystem solves the problem? Does perf
> > subsystem in the host only scrubbing the selector but not the counter
> > value when doing the context switch?
>
> When context switch happens, perf code would schedule out the old events
> and schedule in the new events. When scheduling out, the ENABLE bit of
> EVENTSELx MSR would be cleared, and when scheduling in, the EVENTSELx
> and PMCx MSRs would be overwritten with new event's attr.config and
> sample_period separately. Of course, these is only for the case when
> there are new events to be programmed on the PMC. If no new events, the
> PMCx MSR would keep stall value and won't be cleared.
>
> Anyway, I don't see any reason that PMCx MSR must be cleared.
>

I don't have a strong opinion on the upstream version. But since both
the mediated vPMU and perf are clients of PMU HW, leaving PMC values
uncleared when transition out of the vPMU boundary is leaking info
technically.

Alternatively, doing the clearing at vcpu loop boundary should be
sufficient if considering performance overhead.

Thanks
-Mingwei
>
>
> >
> >>> Please make changes to this patch with **extreme** caution.
> >>>
> >>> According to our past experience, if there is a bug somewhere,
> >>> there is a catch here (normally).
> >>>
> >>> Thanks.
> >>> -Mingwei
> >>>>> Similarly, if perf enables a counter if PERF_GLOBAL_CTRL without first restoring
> >>>>> the event selector, we gots problems.
> >>>>>
> >>>>> Same thing for the fixed counters below. Can't this just be?
> >>>>>
> >>>>> for (i = 0; i < pmu->nr_arch_gp_counters; i++)
> >>>>> rdpmcl(i, pmu->gp_counters[i].counter);
> >>>>>
> >>>>> for (i = 0; i < pmu->nr_arch_fixed_counters; i++)
> >>>>> rdpmcl(INTEL_PMC_FIXED_RDPMC_BASE | i,
> >>>>> pmu->fixed_counters[i].counter);
> >>>>>
> >>>>>> + }
> >>>>>> +
> >>>>>> + rdmsrl(MSR_CORE_PERF_FIXED_CTR_CTRL, pmu->fixed_ctr_ctrl);
> >>>>>> + /*
> >>>>>> + * Clear hardware FIXED_CTR_CTRL MSR to avoid information leakage and
> >>>>>> + * also avoid these guest fixed counters get accidentially enabled
> >>>>>> + * during host running when host enable global ctrl.
> >>>>>> + */
> >>>>>> + if (pmu->fixed_ctr_ctrl)
> >>>>>> + wrmsrl(MSR_CORE_PERF_FIXED_CTR_CTRL, 0);
> >>>>>> + for (i = 0; i < pmu->nr_arch_fixed_counters; i++) {
> >>>>>> + pmc = &pmu->fixed_counters[i];
> >>>>>> + rdpmcl(INTEL_PMC_FIXED_RDPMC_BASE | i, pmc->counter);
> >>>>>> + if (pmc->counter)
> >>>>>> + wrmsrl(MSR_CORE_PERF_FIXED_CTR0 + i, 0);
> >>>>>> + }
> >>>>>> }
> >>>>>> static void intel_restore_pmu_context(struct kvm_vcpu *vcpu)
> >>>>>> {
> >>>>>> + struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
> >>>>>> + struct kvm_pmc *pmc;
> >>>>>> + u64 global_status;
> >>>>>> + int i;
> >>>>>> +
> >>>>>> + if (pmu->version != 2) {
> >>>>>> + pr_warn("only PerfMon v2 is supported for passthrough PMU");
> >>>>>> + return;
> >>>>>> + }
> >>>>>> +
> >>>>>> + /* Clear host global_ctrl and global_status MSR if non-zero. */
> >>>>>> + wrmsrl(MSR_CORE_PERF_GLOBAL_CTRL, 0);
> >>>>> Why? PERF_GLOBAL_CTRL will be auto-loaded at VM-Enter, why do it now?
> >>>> As previous comments say, host perf always enable all counters in
> >>>> PERF_GLOBAL_CTRL by default. The reason to clear PERF_GLOBAL_CTRL here is to
> >>>> ensure all counters in disabled state and the later counter manipulation
> >>>> (writing MSR) won't cause any race condition or unexpected behavior on HW.
> >>>>
> >>>>
> >>>>>> + rdmsrl(MSR_CORE_PERF_GLOBAL_STATUS, global_status);
> >>>>>> + if (global_status)
> >>>>>> + wrmsrl(MSR_CORE_PERF_GLOBAL_OVF_CTRL, global_status);
> >>>>> This seems especially silly, isn't the full MSR being written below? Or am I
> >>>>> misunderstanding how these things work?
> >>>> I think Jim's comment has already explain why we need to do this.
> >>>>
> >>>>>> + wrmsrl(MSR_CORE_PERF_GLOBAL_STATUS_SET, pmu->global_status);
> >>>>>> +
> >>>>>> + for (i = 0; i < pmu->nr_arch_gp_counters; i++) {
> >>>>>> + pmc = &pmu->gp_counters[i];
> >>>>>> + wrmsrl(MSR_IA32_PMC0 + i, pmc->counter);
> >>>>>> + wrmsrl(MSR_ARCH_PERFMON_EVENTSEL0 + i, pmc->eventsel);
> >>>>>> + }
> >>>>>> +
> >>>>>> + wrmsrl(MSR_CORE_PERF_FIXED_CTR_CTRL, pmu->fixed_ctr_ctrl);
> >>>>>> + for (i = 0; i < pmu->nr_arch_fixed_counters; i++) {
> >>>>>> + pmc = &pmu->fixed_counters[i];
> >>>>>> + wrmsrl(MSR_CORE_PERF_FIXED_CTR0 + i, pmc->counter);
> >>>>>> + }
> >>>>>> }
> >>>>>> struct kvm_pmu_ops intel_pmu_ops __initdata = {
> >>>>>> --
> >>>>>> 2.34.1
> >>>>>>

2024-04-15 17:39:08

by Sean Christopherson

[permalink] [raw]
Subject: Re: [RFC PATCH 23/41] KVM: x86/pmu: Implement the save/restore of PMU state for Intel CPU

On Mon, Apr 15, 2024, Mingwei Zhang wrote:
> On Mon, Apr 15, 2024 at 3:04 AM Mi, Dapeng <[email protected]> wrote:
> > On 4/15/2024 2:06 PM, Mingwei Zhang wrote:
> > > On Fri, Apr 12, 2024 at 9:25 PM Mi, Dapeng <[email protected]> wrote:
> > >>>> It's necessary to clear the EVENTSELx MSRs for both GP and fixed counters.
> > >>>> Considering this case, Guest uses GP counter 2, but Host doesn't use it. So
> > >>>> if the EVENTSEL2 MSR is not cleared here, the GP counter 2 would be enabled
> > >>>> unexpectedly on host later since Host perf always enable all validate bits
> > >>>> in PERF_GLOBAL_CTRL MSR. That would cause issues.
> > >>>>
> > >>>> Yeah, the clearing for PMCx MSR should be unnecessary .
> > >>>>
> > >>> Why is clearing for PMCx MSR unnecessary? Do we want to leaking counter
> > >>> values to the host? NO. Not in cloud usage.
> > >> No, this place is clearing the guest counter value instead of host
> > >> counter value. Host always has method to see guest value in a normal VM
> > >> if he want. I don't see its necessity, it's just a overkill and
> > >> introduce extra overhead to write MSRs.
> > >>
> > > I am curious how the perf subsystem solves the problem? Does perf
> > > subsystem in the host only scrubbing the selector but not the counter
> > > value when doing the context switch?
> >
> > When context switch happens, perf code would schedule out the old events
> > and schedule in the new events. When scheduling out, the ENABLE bit of
> > EVENTSELx MSR would be cleared, and when scheduling in, the EVENTSELx
> > and PMCx MSRs would be overwritten with new event's attr.config and
> > sample_period separately. Of course, these is only for the case when
> > there are new events to be programmed on the PMC. If no new events, the
> > PMCx MSR would keep stall value and won't be cleared.
> >
> > Anyway, I don't see any reason that PMCx MSR must be cleared.
> >
>
> I don't have a strong opinion on the upstream version. But since both
> the mediated vPMU and perf are clients of PMU HW, leaving PMC values
> uncleared when transition out of the vPMU boundary is leaking info
> technically.

I'm not objecting to ensuring guest PMCs can't be read by any entity that's not
in the guest's TCB, which is what I would consider a true leak. I'm objecting
to blindly clearing all PMCs, and more specifically objecting to *KVM* clearing
PMCs when saving guest state without coordinating with perf in any way.

I am ok if we start with (or default to) a "safe" implementation that zeroes all
PMCs when switching to host context, but I want KVM and perf to work together to
do the context switches, e.g. so that we don't end up with code where KVM writes
to all PMC MSRs and that perf also immediately writes to all PMC MSRs.

One my biggest complaints with the current vPMU code is that the roles and
responsibilities between KVM and perf are poorly defined, which leads to suboptimal
and hard to maintain code.

Case in point, I'm pretty sure leaving guest values in PMCs _would_ leak guest
state to userspace processes that have RDPMC permissions, as the PMCs might not
be dirty from perf's perspective (see perf_clear_dirty_counters()).

Blindly clearing PMCs in KVM "solves" that problem, but in doing so makes the
overall code brittle because it's not clear whether KVM _needs_ to clear PMCs,
or if KVM is just being paranoid.

2024-04-18 21:21:37

by Mingwei Zhang

[permalink] [raw]
Subject: Re: [RFC PATCH 23/41] KVM: x86/pmu: Implement the save/restore of PMU state for Intel CPU

On Mon, Apr 15, 2024, Sean Christopherson wrote:
> On Mon, Apr 15, 2024, Mingwei Zhang wrote:
> > On Mon, Apr 15, 2024 at 3:04 AM Mi, Dapeng <[email protected]> wrote:
> > > On 4/15/2024 2:06 PM, Mingwei Zhang wrote:
> > > > On Fri, Apr 12, 2024 at 9:25 PM Mi, Dapeng <[email protected]> wrote:
> > > >>>> It's necessary to clear the EVENTSELx MSRs for both GP and fixed counters.
> > > >>>> Considering this case, Guest uses GP counter 2, but Host doesn't use it. So
> > > >>>> if the EVENTSEL2 MSR is not cleared here, the GP counter 2 would be enabled
> > > >>>> unexpectedly on host later since Host perf always enable all validate bits
> > > >>>> in PERF_GLOBAL_CTRL MSR. That would cause issues.
> > > >>>>
> > > >>>> Yeah, the clearing for PMCx MSR should be unnecessary .
> > > >>>>
> > > >>> Why is clearing for PMCx MSR unnecessary? Do we want to leaking counter
> > > >>> values to the host? NO. Not in cloud usage.
> > > >> No, this place is clearing the guest counter value instead of host
> > > >> counter value. Host always has method to see guest value in a normal VM
> > > >> if he want. I don't see its necessity, it's just a overkill and
> > > >> introduce extra overhead to write MSRs.
> > > >>
> > > > I am curious how the perf subsystem solves the problem? Does perf
> > > > subsystem in the host only scrubbing the selector but not the counter
> > > > value when doing the context switch?
> > >
> > > When context switch happens, perf code would schedule out the old events
> > > and schedule in the new events. When scheduling out, the ENABLE bit of
> > > EVENTSELx MSR would be cleared, and when scheduling in, the EVENTSELx
> > > and PMCx MSRs would be overwritten with new event's attr.config and
> > > sample_period separately. Of course, these is only for the case when
> > > there are new events to be programmed on the PMC. If no new events, the
> > > PMCx MSR would keep stall value and won't be cleared.
> > >
> > > Anyway, I don't see any reason that PMCx MSR must be cleared.
> > >
> >
> > I don't have a strong opinion on the upstream version. But since both
> > the mediated vPMU and perf are clients of PMU HW, leaving PMC values
> > uncleared when transition out of the vPMU boundary is leaking info
> > technically.
>
> I'm not objecting to ensuring guest PMCs can't be read by any entity that's not
> in the guest's TCB, which is what I would consider a true leak. I'm objecting
> to blindly clearing all PMCs, and more specifically objecting to *KVM* clearing
> PMCs when saving guest state without coordinating with perf in any way.

Agree. blindly clearing PMCs is the basic implementation. I am thinking
about what coordination between perf and KVM as well.

>
> I am ok if we start with (or default to) a "safe" implementation that zeroes all
> PMCs when switching to host context, but I want KVM and perf to work together to
> do the context switches, e.g. so that we don't end up with code where KVM writes
> to all PMC MSRs and that perf also immediately writes to all PMC MSRs.

Sure. Point taken.
>
> One my biggest complaints with the current vPMU code is that the roles and
> responsibilities between KVM and perf are poorly defined, which leads to suboptimal
> and hard to maintain code.

Right.
>
> Case in point, I'm pretty sure leaving guest values in PMCs _would_ leak guest
> state to userspace processes that have RDPMC permissions, as the PMCs might not
> be dirty from perf's perspective (see perf_clear_dirty_counters()).
>

ah. This is a good point.

switch_mm_irqs_off() =>
cr4_update_pce_mm() =>
/*
* Clear the existing dirty counters to
* prevent the leak for an RDPMC task.
*/
perf_clear_dirty_counters()

So perf does clear dirty counter values on process context switch. This
is nice to know.

perf_clear_dirty_counters() clear the counter values according to
cpuc->dirty except for those assigned counters.

> Blindly clearing PMCs in KVM "solves" that problem, but in doing so makes the
> overall code brittle because it's not clear whether KVM _needs_ to clear PMCs,
> or if KVM is just being paranoid.

There is a difference between KVM and perf subsystem on PMU context
switch. The latter has the notion of "perf_events", while the former
currently does not. It is quite hard for KVM to know which counters are
really "in use".

Another point I want to raise up to you is that, KVM PMU context switch
and Perf PMU context switch happens at different timing:

- The former is a context switch between guest/host state of the same
process, happening at VM-enter/exit boundary.
- The latter is a context switch beteen two host-level processes.
- The former happens before the latter.
- Current design has no PMC partitioning between host/guest due to
arch limitation.

From the above, I feel that it might be impossible to combine them or to
add coordination? Unless we do the KVM PMU context switch at vcpu loop
boundary...

Thanks.
-Mingwei

2024-04-18 21:41:18

by Mingwei Zhang

[permalink] [raw]
Subject: Re: [RFC PATCH 23/41] KVM: x86/pmu: Implement the save/restore of PMU state for Intel CPU

On Thu, Apr 18, 2024, Mingwei Zhang wrote:
> On Mon, Apr 15, 2024, Sean Christopherson wrote:
> > On Mon, Apr 15, 2024, Mingwei Zhang wrote:
> > > On Mon, Apr 15, 2024 at 3:04 AM Mi, Dapeng <[email protected]> wrote:
> > > > On 4/15/2024 2:06 PM, Mingwei Zhang wrote:
> > > > > On Fri, Apr 12, 2024 at 9:25 PM Mi, Dapeng <[email protected]> wrote:
> > > > >>>> It's necessary to clear the EVENTSELx MSRs for both GP and fixed counters.
> > > > >>>> Considering this case, Guest uses GP counter 2, but Host doesn't use it. So
> > > > >>>> if the EVENTSEL2 MSR is not cleared here, the GP counter 2 would be enabled
> > > > >>>> unexpectedly on host later since Host perf always enable all validate bits
> > > > >>>> in PERF_GLOBAL_CTRL MSR. That would cause issues.
> > > > >>>>
> > > > >>>> Yeah, the clearing for PMCx MSR should be unnecessary .
> > > > >>>>
> > > > >>> Why is clearing for PMCx MSR unnecessary? Do we want to leaking counter
> > > > >>> values to the host? NO. Not in cloud usage.
> > > > >> No, this place is clearing the guest counter value instead of host
> > > > >> counter value. Host always has method to see guest value in a normal VM
> > > > >> if he want. I don't see its necessity, it's just a overkill and
> > > > >> introduce extra overhead to write MSRs.
> > > > >>
> > > > > I am curious how the perf subsystem solves the problem? Does perf
> > > > > subsystem in the host only scrubbing the selector but not the counter
> > > > > value when doing the context switch?
> > > >
> > > > When context switch happens, perf code would schedule out the old events
> > > > and schedule in the new events. When scheduling out, the ENABLE bit of
> > > > EVENTSELx MSR would be cleared, and when scheduling in, the EVENTSELx
> > > > and PMCx MSRs would be overwritten with new event's attr.config and
> > > > sample_period separately. Of course, these is only for the case when
> > > > there are new events to be programmed on the PMC. If no new events, the
> > > > PMCx MSR would keep stall value and won't be cleared.
> > > >
> > > > Anyway, I don't see any reason that PMCx MSR must be cleared.
> > > >
> > >
> > > I don't have a strong opinion on the upstream version. But since both
> > > the mediated vPMU and perf are clients of PMU HW, leaving PMC values
> > > uncleared when transition out of the vPMU boundary is leaking info
> > > technically.
> >
> > I'm not objecting to ensuring guest PMCs can't be read by any entity that's not
> > in the guest's TCB, which is what I would consider a true leak. I'm objecting
> > to blindly clearing all PMCs, and more specifically objecting to *KVM* clearing
> > PMCs when saving guest state without coordinating with perf in any way.
>
> Agree. blindly clearing PMCs is the basic implementation. I am thinking
> about what coordination between perf and KVM as well.
>
> >
> > I am ok if we start with (or default to) a "safe" implementation that zeroes all
> > PMCs when switching to host context, but I want KVM and perf to work together to
> > do the context switches, e.g. so that we don't end up with code where KVM writes
> > to all PMC MSRs and that perf also immediately writes to all PMC MSRs.
>
> Sure. Point taken.
> >
> > One my biggest complaints with the current vPMU code is that the roles and
> > responsibilities between KVM and perf are poorly defined, which leads to suboptimal
> > and hard to maintain code.
>
> Right.
> >
> > Case in point, I'm pretty sure leaving guest values in PMCs _would_ leak guest
> > state to userspace processes that have RDPMC permissions, as the PMCs might not
> > be dirty from perf's perspective (see perf_clear_dirty_counters()).
> >
>
> ah. This is a good point.
>
> switch_mm_irqs_off() =>
> cr4_update_pce_mm() =>
> /*
> * Clear the existing dirty counters to
> * prevent the leak for an RDPMC task.
> */

FYI, for the elaboration of "an RDPMC task".

when echo 2> /sys/devices/cpu/rdpmc, kernel set CR4.PCE to 1.

Once that is done, rdpmc instruction is no longer a priviledged
instruction. It is allowed for all tasks to execute in userspace.

Thanks.
-Mingwei
> perf_clear_dirty_counters()
>
> So perf does clear dirty counter values on process context switch. This
> is nice to know.
>
> perf_clear_dirty_counters() clear the counter values according to
> cpuc->dirty except for those assigned counters.
>
> > Blindly clearing PMCs in KVM "solves" that problem, but in doing so makes the
> > overall code brittle because it's not clear whether KVM _needs_ to clear PMCs,
> > or if KVM is just being paranoid.
>
> There is a difference between KVM and perf subsystem on PMU context
> switch. The latter has the notion of "perf_events", while the former
> currently does not. It is quite hard for KVM to know which counters are
> really "in use".
>
> Another point I want to raise up to you is that, KVM PMU context switch
> and Perf PMU context switch happens at different timing:
>
> - The former is a context switch between guest/host state of the same
> process, happening at VM-enter/exit boundary.
> - The latter is a context switch beteen two host-level processes.
> - The former happens before the latter.
> - Current design has no PMC partitioning between host/guest due to
> arch limitation.
>
> From the above, I feel that it might be impossible to combine them or to
> add coordination? Unless we do the KVM PMU context switch at vcpu loop
> boundary...
>
> Thanks.
> -Mingwei

2024-04-19 02:21:40

by Mi, Dapeng

[permalink] [raw]
Subject: Re: [RFC PATCH 23/41] KVM: x86/pmu: Implement the save/restore of PMU state for Intel CPU


On 4/19/2024 5:21 AM, Mingwei Zhang wrote:
> On Mon, Apr 15, 2024, Sean Christopherson wrote:
>> On Mon, Apr 15, 2024, Mingwei Zhang wrote:
>>> On Mon, Apr 15, 2024 at 3:04 AM Mi, Dapeng <[email protected]> wrote:
>>>> On 4/15/2024 2:06 PM, Mingwei Zhang wrote:
>>>>> On Fri, Apr 12, 2024 at 9:25 PM Mi, Dapeng <[email protected]> wrote:
>>>>>>>> It's necessary to clear the EVENTSELx MSRs for both GP and fixed counters.
>>>>>>>> Considering this case, Guest uses GP counter 2, but Host doesn't use it. So
>>>>>>>> if the EVENTSEL2 MSR is not cleared here, the GP counter 2 would be enabled
>>>>>>>> unexpectedly on host later since Host perf always enable all validate bits
>>>>>>>> in PERF_GLOBAL_CTRL MSR. That would cause issues.
>>>>>>>>
>>>>>>>> Yeah, the clearing for PMCx MSR should be unnecessary .
>>>>>>>>
>>>>>>> Why is clearing for PMCx MSR unnecessary? Do we want to leaking counter
>>>>>>> values to the host? NO. Not in cloud usage.
>>>>>> No, this place is clearing the guest counter value instead of host
>>>>>> counter value. Host always has method to see guest value in a normal VM
>>>>>> if he want. I don't see its necessity, it's just a overkill and
>>>>>> introduce extra overhead to write MSRs.
>>>>>>
>>>>> I am curious how the perf subsystem solves the problem? Does perf
>>>>> subsystem in the host only scrubbing the selector but not the counter
>>>>> value when doing the context switch?
>>>> When context switch happens, perf code would schedule out the old events
>>>> and schedule in the new events. When scheduling out, the ENABLE bit of
>>>> EVENTSELx MSR would be cleared, and when scheduling in, the EVENTSELx
>>>> and PMCx MSRs would be overwritten with new event's attr.config and
>>>> sample_period separately. Of course, these is only for the case when
>>>> there are new events to be programmed on the PMC. If no new events, the
>>>> PMCx MSR would keep stall value and won't be cleared.
>>>>
>>>> Anyway, I don't see any reason that PMCx MSR must be cleared.
>>>>
>>> I don't have a strong opinion on the upstream version. But since both
>>> the mediated vPMU and perf are clients of PMU HW, leaving PMC values
>>> uncleared when transition out of the vPMU boundary is leaking info
>>> technically.
>> I'm not objecting to ensuring guest PMCs can't be read by any entity that's not
>> in the guest's TCB, which is what I would consider a true leak. I'm objecting
>> to blindly clearing all PMCs, and more specifically objecting to *KVM* clearing
>> PMCs when saving guest state without coordinating with perf in any way.
> Agree. blindly clearing PMCs is the basic implementation. I am thinking
> about what coordination between perf and KVM as well.
>
>> I am ok if we start with (or default to) a "safe" implementation that zeroes all
>> PMCs when switching to host context, but I want KVM and perf to work together to
>> do the context switches, e.g. so that we don't end up with code where KVM writes
>> to all PMC MSRs and that perf also immediately writes to all PMC MSRs.
> Sure. Point taken.
>> One my biggest complaints with the current vPMU code is that the roles and
>> responsibilities between KVM and perf are poorly defined, which leads to suboptimal
>> and hard to maintain code.
> Right.
>> Case in point, I'm pretty sure leaving guest values in PMCs _would_ leak guest
>> state to userspace processes that have RDPMC permissions, as the PMCs might not
>> be dirty from perf's perspective (see perf_clear_dirty_counters()).
>>
> ah. This is a good point.
>
> switch_mm_irqs_off() =>
> cr4_update_pce_mm() =>
> /*
> * Clear the existing dirty counters to
> * prevent the leak for an RDPMC task.
> */
> perf_clear_dirty_counters()
>
> So perf does clear dirty counter values on process context switch. This
> is nice to know.
>
> perf_clear_dirty_counters() clear the counter values according to
> cpuc->dirty except for those assigned counters.
>
>> Blindly clearing PMCs in KVM "solves" that problem, but in doing so makes the
>> overall code brittle because it's not clear whether KVM _needs_ to clear PMCs,
>> or if KVM is just being paranoid.
> There is a difference between KVM and perf subsystem on PMU context
> switch. The latter has the notion of "perf_events", while the former
> currently does not. It is quite hard for KVM to know which counters are
> really "in use".
>
> Another point I want to raise up to you is that, KVM PMU context switch
> and Perf PMU context switch happens at different timing:
>
> - The former is a context switch between guest/host state of the same
> process, happening at VM-enter/exit boundary.
> - The latter is a context switch beteen two host-level processes.
> - The former happens before the latter.
> - Current design has no PMC partitioning between host/guest due to
> arch limitation.
>
> From the above, I feel that it might be impossible to combine them or to
> add coordination? Unless we do the KVM PMU context switch at vcpu loop
> boundary...

It seems there are two ways to clear the PMCx MSRs.

a) KVM clears these guest only used counters' PMCx MSRs when VM exits
and saving the guest MSRs. This can be seen as a portion of the next
step optimization that only guest used MSRs are saved and restored. It
would avoid to save/restore unnecessary MSR and decrease the performance
impact.

b) Perf subsystem clears these guest used MSRs, but perf system doesn't
know which MSRs are touched by guest, KVM has to provide a API to tell
perf subsystem the information. Another issue is that perf does the
clearing in task context switch. It may be too late, user can get the
guest counter value via rdpmc instruction as long as the vCPU process is
allowed to use rdpmc from user space.

We had an internal rough talk on this, It seems the option 1 is the
better one which looks simpler and has clearer boundary. We would
implement it together with the optimization mentioned in option 1.


>
> Thanks.
> -Mingwei
>

2024-04-22 02:14:44

by Bibo Mao

[permalink] [raw]
Subject: Re: [RFC PATCH 23/41] KVM: x86/pmu: Implement the save/restore of PMU state for Intel CPU



On 2024/4/16 上午6:45, Sean Christopherson wrote:
> On Mon, Apr 15, 2024, Mingwei Zhang wrote:
>> On Mon, Apr 15, 2024 at 10:38 AM Sean Christopherson <[email protected]> wrote:
>>> One my biggest complaints with the current vPMU code is that the roles and
>>> responsibilities between KVM and perf are poorly defined, which leads to suboptimal
>>> and hard to maintain code.
>>>
>>> Case in point, I'm pretty sure leaving guest values in PMCs _would_ leak guest
>>> state to userspace processes that have RDPMC permissions, as the PMCs might not
>>> be dirty from perf's perspective (see perf_clear_dirty_counters()).
>>>
>>> Blindly clearing PMCs in KVM "solves" that problem, but in doing so makes the
>>> overall code brittle because it's not clear whether KVM _needs_ to clear PMCs,
>>> or if KVM is just being paranoid.
>>
>> So once this rolls out, perf and vPMU are clients directly to PMU HW.
>
> I don't think this is a statement we want to make, as it opens a discussion
> that we won't win. Nor do I think it's one we *need* to make. KVM doesn't need
> to be on equal footing with perf in terms of owning/managing PMU hardware, KVM
> just needs a few APIs to allow faithfully and accurately virtualizing a guest PMU.
>
>> Faithful cleaning (blind cleaning) has to be the baseline
>> implementation, until both clients agree to a "deal" between them.
>> Currently, there is no such deal, but I believe we could have one via
>> future discussion.
>
> What I am saying is that there needs to be a "deal" in place before this code
> is merged. It doesn't need to be anything fancy, e.g. perf can still pave over
> PMCs it doesn't immediately load, as opposed to using cpu_hw_events.dirty to lazily
> do the clearing. But perf and KVM need to work together from the get go, ie. I
> don't want KVM doing something without regard to what perf does, and vice versa.
>
There is similar issue on LoongArch vPMU where vm can directly pmu
hardware and pmu hw is shard with guest and host. Besides context switch
there are other places where perf core will access pmu hw, such as tick
timer/hrtimer/ipi function call, and KVM can only intercept context switch.

Can we add callback handler in structure kvm_guest_cbs? just like this:
@@ -6403,6 +6403,7 @@ static struct perf_guest_info_callbacks
kvm_guest_cbs = {
.state = kvm_guest_state,
.get_ip = kvm_guest_get_ip,
.handle_intel_pt_intr = NULL,
+ .lose_pmu = kvm_guest_lose_pmu,
};

By the way, I do not know should the callback handler be triggered in
perf core or detailed pmu hw driver. From ARM pmu hw driver, it is
triggered in pmu hw driver such as function kvm_vcpu_pmu_resync_el0,
but I think it will be better if it is done in perf core.

Regards
Bibo Mao


2024-04-23 01:02:11

by Bibo Mao

[permalink] [raw]
Subject: Re: [RFC PATCH 23/41] KVM: x86/pmu: Implement the save/restore of PMU state for Intel CPU



On 2024/4/23 上午1:01, Sean Christopherson wrote:
> On Mon, Apr 22, 2024, maobibo wrote:
>> On 2024/4/16 上午6:45, Sean Christopherson wrote:
>>> On Mon, Apr 15, 2024, Mingwei Zhang wrote:
>>>> On Mon, Apr 15, 2024 at 10:38 AM Sean Christopherson <[email protected]> wrote:
>>>>> One my biggest complaints with the current vPMU code is that the roles and
>>>>> responsibilities between KVM and perf are poorly defined, which leads to suboptimal
>>>>> and hard to maintain code.
>>>>>
>>>>> Case in point, I'm pretty sure leaving guest values in PMCs _would_ leak guest
>>>>> state to userspace processes that have RDPMC permissions, as the PMCs might not
>>>>> be dirty from perf's perspective (see perf_clear_dirty_counters()).
>>>>>
>>>>> Blindly clearing PMCs in KVM "solves" that problem, but in doing so makes the
>>>>> overall code brittle because it's not clear whether KVM _needs_ to clear PMCs,
>>>>> or if KVM is just being paranoid.
>>>>
>>>> So once this rolls out, perf and vPMU are clients directly to PMU HW.
>>>
>>> I don't think this is a statement we want to make, as it opens a discussion
>>> that we won't win. Nor do I think it's one we *need* to make. KVM doesn't need
>>> to be on equal footing with perf in terms of owning/managing PMU hardware, KVM
>>> just needs a few APIs to allow faithfully and accurately virtualizing a guest PMU.
>>>
>>>> Faithful cleaning (blind cleaning) has to be the baseline
>>>> implementation, until both clients agree to a "deal" between them.
>>>> Currently, there is no such deal, but I believe we could have one via
>>>> future discussion.
>>>
>>> What I am saying is that there needs to be a "deal" in place before this code
>>> is merged. It doesn't need to be anything fancy, e.g. perf can still pave over
>>> PMCs it doesn't immediately load, as opposed to using cpu_hw_events.dirty to lazily
>>> do the clearing. But perf and KVM need to work together from the get go, ie. I
>>> don't want KVM doing something without regard to what perf does, and vice versa.
>>>
>> There is similar issue on LoongArch vPMU where vm can directly pmu hardware
>> and pmu hw is shard with guest and host. Besides context switch there are
>> other places where perf core will access pmu hw, such as tick
>> timer/hrtimer/ipi function call, and KVM can only intercept context switch.
>
> Two questions:
>
> 1) Can KVM prevent the guest from accessing the PMU?
>
> 2) If so, KVM can grant partial access to the PMU, or is it all or nothing?
>
> If the answer to both questions is "yes", then it sounds like LoongArch *requires*
> mediated/passthrough support in order to virtualize its PMU.

Hi Sean,

Thank for your quick response.

yes, kvm can prevent guest from accessing the PMU and grant partial or
all to access to the PMU. Only that if one pmu event is granted to VM,
host can not access this pmu event again. There must be pmu event switch
if host want to.

>
>> Can we add callback handler in structure kvm_guest_cbs? just like this:
>> @@ -6403,6 +6403,7 @@ static struct perf_guest_info_callbacks kvm_guest_cbs
>> = {
>> .state = kvm_guest_state,
>> .get_ip = kvm_guest_get_ip,
>> .handle_intel_pt_intr = NULL,
>> + .lose_pmu = kvm_guest_lose_pmu,
>> };
>>
>> By the way, I do not know should the callback handler be triggered in perf
>> core or detailed pmu hw driver. From ARM pmu hw driver, it is triggered in
>> pmu hw driver such as function kvm_vcpu_pmu_resync_el0,
>> but I think it will be better if it is done in perf core.
>
> I don't think we want to take the approach of perf and KVM guests "fighting" over
> the PMU. That's effectively what we have today, and it's a mess for KVM because
> it's impossible to provide consistent, deterministic behavior for the guest. And
> it's just as messy for perf, which ends up having wierd, cumbersome flows that
> exists purely to try to play nice with KVM.
With existing pmu core code, in tick timer interrupt or IPI function
call interrupt pmu hw may be accessed by host when VM is running and pmu
is already granted to guest. KVM can not intercept host IPI/timer
interrupt, there is no pmu context switch, there will be problem.

Regards
Bibo Mao


2024-04-23 02:55:32

by Bibo Mao

[permalink] [raw]
Subject: Re: [RFC PATCH 23/41] KVM: x86/pmu: Implement the save/restore of PMU state for Intel CPU



On 2024/4/23 上午10:44, Mi, Dapeng wrote:
>
> On 4/23/2024 9:01 AM, maobibo wrote:
>>
>>
>> On 2024/4/23 上午1:01, Sean Christopherson wrote:
>>> On Mon, Apr 22, 2024, maobibo wrote:
>>>> On 2024/4/16 上午6:45, Sean Christopherson wrote:
>>>>> On Mon, Apr 15, 2024, Mingwei Zhang wrote:
>>>>>> On Mon, Apr 15, 2024 at 10:38 AM Sean Christopherson
>>>>>> <[email protected]> wrote:
>>>>>>> One my biggest complaints with the current vPMU code is that the
>>>>>>> roles and
>>>>>>> responsibilities between KVM and perf are poorly defined, which
>>>>>>> leads to suboptimal
>>>>>>> and hard to maintain code.
>>>>>>>
>>>>>>> Case in point, I'm pretty sure leaving guest values in PMCs
>>>>>>> _would_ leak guest
>>>>>>> state to userspace processes that have RDPMC permissions, as the
>>>>>>> PMCs might not
>>>>>>> be dirty from perf's perspective (see perf_clear_dirty_counters()).
>>>>>>>
>>>>>>> Blindly clearing PMCs in KVM "solves" that problem, but in doing
>>>>>>> so makes the
>>>>>>> overall code brittle because it's not clear whether KVM _needs_
>>>>>>> to clear PMCs,
>>>>>>> or if KVM is just being paranoid.
>>>>>>
>>>>>> So once this rolls out, perf and vPMU are clients directly to PMU HW.
>>>>>
>>>>> I don't think this is a statement we want to make, as it opens a
>>>>> discussion
>>>>> that we won't win.  Nor do I think it's one we *need* to make.  KVM
>>>>> doesn't need
>>>>> to be on equal footing with perf in terms of owning/managing PMU
>>>>> hardware, KVM
>>>>> just needs a few APIs to allow faithfully and accurately
>>>>> virtualizing a guest PMU.
>>>>>
>>>>>> Faithful cleaning (blind cleaning) has to be the baseline
>>>>>> implementation, until both clients agree to a "deal" between them.
>>>>>> Currently, there is no such deal, but I believe we could have one via
>>>>>> future discussion.
>>>>>
>>>>> What I am saying is that there needs to be a "deal" in place before
>>>>> this code
>>>>> is merged.  It doesn't need to be anything fancy, e.g. perf can
>>>>> still pave over
>>>>> PMCs it doesn't immediately load, as opposed to using
>>>>> cpu_hw_events.dirty to lazily
>>>>> do the clearing.  But perf and KVM need to work together from the
>>>>> get go, ie. I
>>>>> don't want KVM doing something without regard to what perf does,
>>>>> and vice versa.
>>>>>
>>>> There is similar issue on LoongArch vPMU where vm can directly pmu
>>>> hardware
>>>> and pmu hw is shard with guest and host. Besides context switch
>>>> there are
>>>> other places where perf core will access pmu hw, such as tick
>>>> timer/hrtimer/ipi function call, and KVM can only intercept context
>>>> switch.
>>>
>>> Two questions:
>>>
>>>   1) Can KVM prevent the guest from accessing the PMU?
>>>
>>>   2) If so, KVM can grant partial access to the PMU, or is it all or
>>> nothing?
>>>
>>> If the answer to both questions is "yes", then it sounds like
>>> LoongArch *requires*
>>> mediated/passthrough support in order to virtualize its PMU.
>>
>> Hi Sean,
>>
>> Thank for your quick response.
>>
>> yes, kvm can prevent guest from accessing the PMU and grant partial or
>> all to access to the PMU. Only that if one pmu event is granted to VM,
>> host can not access this pmu event again. There must be pmu event
>> switch if host want to.
>
> PMU event is a software entity which won't be shared. did you mean if a
> PMU HW counter is granted to VM, then Host can't access the PMU HW
> counter, right?
yes, if PMU HW counter/control is granted to VM. The value comes from
guest, and is not meaningful for host. Host pmu core does not know that
it is granted to VM, host still think that it owns pmu.

Just like FPU register, it is shared by VM and host during different
time and it is lately switched. But if IPI or timer interrupt uses FPU
register on host, there will be the same issue.

Regards
Bibo Mao
>
>
>>
>>>
>>>> Can we add callback handler in structure kvm_guest_cbs?  just like
>>>> this:
>>>> @@ -6403,6 +6403,7 @@ static struct perf_guest_info_callbacks
>>>> kvm_guest_cbs
>>>> = {
>>>>          .state                  = kvm_guest_state,
>>>>          .get_ip                 = kvm_guest_get_ip,
>>>>          .handle_intel_pt_intr   = NULL,
>>>> +       .lose_pmu               = kvm_guest_lose_pmu,
>>>>   };
>>>>
>>>> By the way, I do not know should the callback handler be triggered
>>>> in perf
>>>> core or detailed pmu hw driver. From ARM pmu hw driver, it is
>>>> triggered in
>>>> pmu hw driver such as function kvm_vcpu_pmu_resync_el0,
>>>> but I think it will be better if it is done in perf core.
>>>
>>> I don't think we want to take the approach of perf and KVM guests
>>> "fighting" over
>>> the PMU.  That's effectively what we have today, and it's a mess for
>>> KVM because
>>> it's impossible to provide consistent, deterministic behavior for the
>>> guest.  And
>>> it's just as messy for perf, which ends up having wierd, cumbersome
>>> flows that
>>> exists purely to try to play nice with KVM.
>> With existing pmu core code, in tick timer interrupt or IPI function
>> call interrupt pmu hw may be accessed by host when VM is running and
>> pmu is already granted to guest. KVM can not intercept host IPI/timer
>> interrupt, there is no pmu context switch, there will be problem.
>>
>> Regards
>> Bibo Mao
>>


2024-04-23 03:26:40

by Bibo Mao

[permalink] [raw]
Subject: Re: [RFC PATCH 23/41] KVM: x86/pmu: Implement the save/restore of PMU state for Intel CPU



On 2024/4/23 上午11:13, Mi, Dapeng wrote:
>
> On 4/23/2024 10:53 AM, maobibo wrote:
>>
>>
>> On 2024/4/23 上午10:44, Mi, Dapeng wrote:
>>>
>>> On 4/23/2024 9:01 AM, maobibo wrote:
>>>>
>>>>
>>>> On 2024/4/23 上午1:01, Sean Christopherson wrote:
>>>>> On Mon, Apr 22, 2024, maobibo wrote:
>>>>>> On 2024/4/16 上午6:45, Sean Christopherson wrote:
>>>>>>> On Mon, Apr 15, 2024, Mingwei Zhang wrote:
>>>>>>>> On Mon, Apr 15, 2024 at 10:38 AM Sean Christopherson
>>>>>>>> <[email protected]> wrote:
>>>>>>>>> One my biggest complaints with the current vPMU code is that
>>>>>>>>> the roles and
>>>>>>>>> responsibilities between KVM and perf are poorly defined, which
>>>>>>>>> leads to suboptimal
>>>>>>>>> and hard to maintain code.
>>>>>>>>>
>>>>>>>>> Case in point, I'm pretty sure leaving guest values in PMCs
>>>>>>>>> _would_ leak guest
>>>>>>>>> state to userspace processes that have RDPMC permissions, as
>>>>>>>>> the PMCs might not
>>>>>>>>> be dirty from perf's perspective (see
>>>>>>>>> perf_clear_dirty_counters()).
>>>>>>>>>
>>>>>>>>> Blindly clearing PMCs in KVM "solves" that problem, but in
>>>>>>>>> doing so makes the
>>>>>>>>> overall code brittle because it's not clear whether KVM _needs_
>>>>>>>>> to clear PMCs,
>>>>>>>>> or if KVM is just being paranoid.
>>>>>>>>
>>>>>>>> So once this rolls out, perf and vPMU are clients directly to
>>>>>>>> PMU HW.
>>>>>>>
>>>>>>> I don't think this is a statement we want to make, as it opens a
>>>>>>> discussion
>>>>>>> that we won't win.  Nor do I think it's one we *need* to make.
>>>>>>> KVM doesn't need
>>>>>>> to be on equal footing with perf in terms of owning/managing PMU
>>>>>>> hardware, KVM
>>>>>>> just needs a few APIs to allow faithfully and accurately
>>>>>>> virtualizing a guest PMU.
>>>>>>>
>>>>>>>> Faithful cleaning (blind cleaning) has to be the baseline
>>>>>>>> implementation, until both clients agree to a "deal" between them.
>>>>>>>> Currently, there is no such deal, but I believe we could have
>>>>>>>> one via
>>>>>>>> future discussion.
>>>>>>>
>>>>>>> What I am saying is that there needs to be a "deal" in place
>>>>>>> before this code
>>>>>>> is merged.  It doesn't need to be anything fancy, e.g. perf can
>>>>>>> still pave over
>>>>>>> PMCs it doesn't immediately load, as opposed to using
>>>>>>> cpu_hw_events.dirty to lazily
>>>>>>> do the clearing.  But perf and KVM need to work together from the
>>>>>>> get go, ie. I
>>>>>>> don't want KVM doing something without regard to what perf does,
>>>>>>> and vice versa.
>>>>>>>
>>>>>> There is similar issue on LoongArch vPMU where vm can directly pmu
>>>>>> hardware
>>>>>> and pmu hw is shard with guest and host. Besides context switch
>>>>>> there are
>>>>>> other places where perf core will access pmu hw, such as tick
>>>>>> timer/hrtimer/ipi function call, and KVM can only intercept
>>>>>> context switch.
>>>>>
>>>>> Two questions:
>>>>>
>>>>>   1) Can KVM prevent the guest from accessing the PMU?
>>>>>
>>>>>   2) If so, KVM can grant partial access to the PMU, or is it all
>>>>> or nothing?
>>>>>
>>>>> If the answer to both questions is "yes", then it sounds like
>>>>> LoongArch *requires*
>>>>> mediated/passthrough support in order to virtualize its PMU.
>>>>
>>>> Hi Sean,
>>>>
>>>> Thank for your quick response.
>>>>
>>>> yes, kvm can prevent guest from accessing the PMU and grant partial
>>>> or all to access to the PMU. Only that if one pmu event is granted
>>>> to VM, host can not access this pmu event again. There must be pmu
>>>> event switch if host want to.
>>>
>>> PMU event is a software entity which won't be shared. did you mean if
>>> a PMU HW counter is granted to VM, then Host can't access the PMU HW
>>> counter, right?
>> yes, if PMU HW counter/control is granted to VM. The value comes from
>> guest, and is not meaningful for host.  Host pmu core does not know
>> that it is granted to VM, host still think that it owns pmu.
>
> That's one issue this patchset tries to solve. Current new mediated x86
> vPMU framework doesn't allow Host or Guest own the PMU HW resource
> simultaneously. Only when there is no !exclude_guest event on host,
> guest is allowed to exclusively own the PMU HW resource.
>
>
>>
>> Just like FPU register, it is shared by VM and host during different
>> time and it is lately switched. But if IPI or timer interrupt uses FPU
>> register on host, there will be the same issue.
>
> I didn't fully get your point. When IPI or timer interrupt reach, a
> VM-exit is triggered to make CPU traps into host first and then the host
> interrupt handler is called. Or are you complaining the executing
> sequence of switching guest PMU MSRs and these interrupt handler?
It is not necessary to save/restore PMU HW at every vm exit, it had
better be lately saved/restored, such as only when vcpu thread is
sched-out/sched-in, else the cost will be a little expensive.

I know little about perf core. However there is PMU HW access in
interrupt mode. That means PMU HW access should be irq disabled in
general mode, else there may be nested PMU HW access. Is that true?

>
>
>>
>> Regards
>> Bibo Mao
>>>
>>>
>>>>
>>>>>
>>>>>> Can we add callback handler in structure kvm_guest_cbs?  just like
>>>>>> this:
>>>>>> @@ -6403,6 +6403,7 @@ static struct perf_guest_info_callbacks
>>>>>> kvm_guest_cbs
>>>>>> = {
>>>>>>          .state                  = kvm_guest_state,
>>>>>>          .get_ip                 = kvm_guest_get_ip,
>>>>>>          .handle_intel_pt_intr   = NULL,
>>>>>> +       .lose_pmu               = kvm_guest_lose_pmu,
>>>>>>   };
>>>>>>
>>>>>> By the way, I do not know should the callback handler be triggered
>>>>>> in perf
>>>>>> core or detailed pmu hw driver. From ARM pmu hw driver, it is
>>>>>> triggered in
>>>>>> pmu hw driver such as function kvm_vcpu_pmu_resync_el0,
>>>>>> but I think it will be better if it is done in perf core.
>>>>>
>>>>> I don't think we want to take the approach of perf and KVM guests
>>>>> "fighting" over
>>>>> the PMU.  That's effectively what we have today, and it's a mess
>>>>> for KVM because
>>>>> it's impossible to provide consistent, deterministic behavior for
>>>>> the guest.  And
>>>>> it's just as messy for perf, which ends up having wierd, cumbersome
>>>>> flows that
>>>>> exists purely to try to play nice with KVM.
>>>> With existing pmu core code, in tick timer interrupt or IPI function
>>>> call interrupt pmu hw may be accessed by host when VM is running and
>>>> pmu is already granted to guest. KVM can not intercept host
>>>> IPI/timer interrupt, there is no pmu context switch, there will be
>>>> problem.
>>>>
>>>> Regards
>>>> Bibo Mao
>>>>
>>


2024-04-23 03:55:54

by Bibo Mao

[permalink] [raw]
Subject: Re: [RFC PATCH 23/41] KVM: x86/pmu: Implement the save/restore of PMU state for Intel CPU



On 2024/4/23 上午11:13, Mi, Dapeng wrote:
>
> On 4/23/2024 10:53 AM, maobibo wrote:
>>
>>
>> On 2024/4/23 上午10:44, Mi, Dapeng wrote:
>>>
>>> On 4/23/2024 9:01 AM, maobibo wrote:
>>>>
>>>>
>>>> On 2024/4/23 上午1:01, Sean Christopherson wrote:
>>>>> On Mon, Apr 22, 2024, maobibo wrote:
>>>>>> On 2024/4/16 上午6:45, Sean Christopherson wrote:
>>>>>>> On Mon, Apr 15, 2024, Mingwei Zhang wrote:
>>>>>>>> On Mon, Apr 15, 2024 at 10:38 AM Sean Christopherson
>>>>>>>> <[email protected]> wrote:
>>>>>>>>> One my biggest complaints with the current vPMU code is that
>>>>>>>>> the roles and
>>>>>>>>> responsibilities between KVM and perf are poorly defined, which
>>>>>>>>> leads to suboptimal
>>>>>>>>> and hard to maintain code.
>>>>>>>>>
>>>>>>>>> Case in point, I'm pretty sure leaving guest values in PMCs
>>>>>>>>> _would_ leak guest
>>>>>>>>> state to userspace processes that have RDPMC permissions, as
>>>>>>>>> the PMCs might not
>>>>>>>>> be dirty from perf's perspective (see
>>>>>>>>> perf_clear_dirty_counters()).
>>>>>>>>>
>>>>>>>>> Blindly clearing PMCs in KVM "solves" that problem, but in
>>>>>>>>> doing so makes the
>>>>>>>>> overall code brittle because it's not clear whether KVM _needs_
>>>>>>>>> to clear PMCs,
>>>>>>>>> or if KVM is just being paranoid.
>>>>>>>>
>>>>>>>> So once this rolls out, perf and vPMU are clients directly to
>>>>>>>> PMU HW.
>>>>>>>
>>>>>>> I don't think this is a statement we want to make, as it opens a
>>>>>>> discussion
>>>>>>> that we won't win.  Nor do I think it's one we *need* to make.
>>>>>>> KVM doesn't need
>>>>>>> to be on equal footing with perf in terms of owning/managing PMU
>>>>>>> hardware, KVM
>>>>>>> just needs a few APIs to allow faithfully and accurately
>>>>>>> virtualizing a guest PMU.
>>>>>>>
>>>>>>>> Faithful cleaning (blind cleaning) has to be the baseline
>>>>>>>> implementation, until both clients agree to a "deal" between them.
>>>>>>>> Currently, there is no such deal, but I believe we could have
>>>>>>>> one via
>>>>>>>> future discussion.
>>>>>>>
>>>>>>> What I am saying is that there needs to be a "deal" in place
>>>>>>> before this code
>>>>>>> is merged.  It doesn't need to be anything fancy, e.g. perf can
>>>>>>> still pave over
>>>>>>> PMCs it doesn't immediately load, as opposed to using
>>>>>>> cpu_hw_events.dirty to lazily
>>>>>>> do the clearing.  But perf and KVM need to work together from the
>>>>>>> get go, ie. I
>>>>>>> don't want KVM doing something without regard to what perf does,
>>>>>>> and vice versa.
>>>>>>>
>>>>>> There is similar issue on LoongArch vPMU where vm can directly pmu
>>>>>> hardware
>>>>>> and pmu hw is shard with guest and host. Besides context switch
>>>>>> there are
>>>>>> other places where perf core will access pmu hw, such as tick
>>>>>> timer/hrtimer/ipi function call, and KVM can only intercept
>>>>>> context switch.
>>>>>
>>>>> Two questions:
>>>>>
>>>>>   1) Can KVM prevent the guest from accessing the PMU?
>>>>>
>>>>>   2) If so, KVM can grant partial access to the PMU, or is it all
>>>>> or nothing?
>>>>>
>>>>> If the answer to both questions is "yes", then it sounds like
>>>>> LoongArch *requires*
>>>>> mediated/passthrough support in order to virtualize its PMU.
>>>>
>>>> Hi Sean,
>>>>
>>>> Thank for your quick response.
>>>>
>>>> yes, kvm can prevent guest from accessing the PMU and grant partial
>>>> or all to access to the PMU. Only that if one pmu event is granted
>>>> to VM, host can not access this pmu event again. There must be pmu
>>>> event switch if host want to.
>>>
>>> PMU event is a software entity which won't be shared. did you mean if
>>> a PMU HW counter is granted to VM, then Host can't access the PMU HW
>>> counter, right?
>> yes, if PMU HW counter/control is granted to VM. The value comes from
>> guest, and is not meaningful for host.  Host pmu core does not know
>> that it is granted to VM, host still think that it owns pmu.
>
> That's one issue this patchset tries to solve. Current new mediated x86
> vPMU framework doesn't allow Host or Guest own the PMU HW resource
> simultaneously. Only when there is no !exclude_guest event on host,
> guest is allowed to exclusively own the PMU HW resource.
>
>
>>
>> Just like FPU register, it is shared by VM and host during different
>> time and it is lately switched. But if IPI or timer interrupt uses FPU
>> register on host, there will be the same issue.
>
> I didn't fully get your point. When IPI or timer interrupt reach, a
> VM-exit is triggered to make CPU traps into host first and then the host
yes, it is.

> interrupt handler is called. Or are you complaining the executing
> sequence of switching guest PMU MSRs and these interrupt handler?
In our vPMU implementation, it is ok if vPMU is switched in vm exit
path, however there is problem if vPMU is switched during vcpu thread
sched-out/sched-in path since IPI/timer irq interrupt access pmu
register in host mode.

In general it will be better if the switch is done in vcpu thread
sched-out/sched-in, else there is requirement to profile kvm
hypervisor.Even there is such requirement, it is only one option. In
most conditions, it will better if time of VM context exit is small.

>
>
>>
>> Regards
>> Bibo Mao
>>>
>>>
>>>>
>>>>>
>>>>>> Can we add callback handler in structure kvm_guest_cbs?  just like
>>>>>> this:
>>>>>> @@ -6403,6 +6403,7 @@ static struct perf_guest_info_callbacks
>>>>>> kvm_guest_cbs
>>>>>> = {
>>>>>>          .state                  = kvm_guest_state,
>>>>>>          .get_ip                 = kvm_guest_get_ip,
>>>>>>          .handle_intel_pt_intr   = NULL,
>>>>>> +       .lose_pmu               = kvm_guest_lose_pmu,
>>>>>>   };
>>>>>>
>>>>>> By the way, I do not know should the callback handler be triggered
>>>>>> in perf
>>>>>> core or detailed pmu hw driver. From ARM pmu hw driver, it is
>>>>>> triggered in
>>>>>> pmu hw driver such as function kvm_vcpu_pmu_resync_el0,
>>>>>> but I think it will be better if it is done in perf core.
>>>>>
>>>>> I don't think we want to take the approach of perf and KVM guests
>>>>> "fighting" over
>>>>> the PMU.  That's effectively what we have today, and it's a mess
>>>>> for KVM because
>>>>> it's impossible to provide consistent, deterministic behavior for
>>>>> the guest.  And
>>>>> it's just as messy for perf, which ends up having wierd, cumbersome
>>>>> flows that
>>>>> exists purely to try to play nice with KVM.
>>>> With existing pmu core code, in tick timer interrupt or IPI function
>>>> call interrupt pmu hw may be accessed by host when VM is running and
>>>> pmu is already granted to guest. KVM can not intercept host
>>>> IPI/timer interrupt, there is no pmu context switch, there will be
>>>> problem.
>>>>
>>>> Regards
>>>> Bibo Mao
>>>>
>>


2024-04-23 04:24:30

by Mingwei Zhang

[permalink] [raw]
Subject: Re: [RFC PATCH 23/41] KVM: x86/pmu: Implement the save/restore of PMU state for Intel CPU

On Mon, Apr 22, 2024 at 8:55 PM maobibo <[email protected]> wrote:
>
>
>
> On 2024/4/23 上午11:13, Mi, Dapeng wrote:
> >
> > On 4/23/2024 10:53 AM, maobibo wrote:
> >>
> >>
> >> On 2024/4/23 上午10:44, Mi, Dapeng wrote:
> >>>
> >>> On 4/23/2024 9:01 AM, maobibo wrote:
> >>>>
> >>>>
> >>>> On 2024/4/23 上午1:01, Sean Christopherson wrote:
> >>>>> On Mon, Apr 22, 2024, maobibo wrote:
> >>>>>> On 2024/4/16 上午6:45, Sean Christopherson wrote:
> >>>>>>> On Mon, Apr 15, 2024, Mingwei Zhang wrote:
> >>>>>>>> On Mon, Apr 15, 2024 at 10:38 AM Sean Christopherson
> >>>>>>>> <[email protected]> wrote:
> >>>>>>>>> One my biggest complaints with the current vPMU code is that
> >>>>>>>>> the roles and
> >>>>>>>>> responsibilities between KVM and perf are poorly defined, which
> >>>>>>>>> leads to suboptimal
> >>>>>>>>> and hard to maintain code.
> >>>>>>>>>
> >>>>>>>>> Case in point, I'm pretty sure leaving guest values in PMCs
> >>>>>>>>> _would_ leak guest
> >>>>>>>>> state to userspace processes that have RDPMC permissions, as
> >>>>>>>>> the PMCs might not
> >>>>>>>>> be dirty from perf's perspective (see
> >>>>>>>>> perf_clear_dirty_counters()).
> >>>>>>>>>
> >>>>>>>>> Blindly clearing PMCs in KVM "solves" that problem, but in
> >>>>>>>>> doing so makes the
> >>>>>>>>> overall code brittle because it's not clear whether KVM _needs_
> >>>>>>>>> to clear PMCs,
> >>>>>>>>> or if KVM is just being paranoid.
> >>>>>>>>
> >>>>>>>> So once this rolls out, perf and vPMU are clients directly to
> >>>>>>>> PMU HW.
> >>>>>>>
> >>>>>>> I don't think this is a statement we want to make, as it opens a
> >>>>>>> discussion
> >>>>>>> that we won't win. Nor do I think it's one we *need* to make.
> >>>>>>> KVM doesn't need
> >>>>>>> to be on equal footing with perf in terms of owning/managing PMU
> >>>>>>> hardware, KVM
> >>>>>>> just needs a few APIs to allow faithfully and accurately
> >>>>>>> virtualizing a guest PMU.
> >>>>>>>
> >>>>>>>> Faithful cleaning (blind cleaning) has to be the baseline
> >>>>>>>> implementation, until both clients agree to a "deal" between them.
> >>>>>>>> Currently, there is no such deal, but I believe we could have
> >>>>>>>> one via
> >>>>>>>> future discussion.
> >>>>>>>
> >>>>>>> What I am saying is that there needs to be a "deal" in place
> >>>>>>> before this code
> >>>>>>> is merged. It doesn't need to be anything fancy, e.g. perf can
> >>>>>>> still pave over
> >>>>>>> PMCs it doesn't immediately load, as opposed to using
> >>>>>>> cpu_hw_events.dirty to lazily
> >>>>>>> do the clearing. But perf and KVM need to work together from the
> >>>>>>> get go, ie. I
> >>>>>>> don't want KVM doing something without regard to what perf does,
> >>>>>>> and vice versa.
> >>>>>>>
> >>>>>> There is similar issue on LoongArch vPMU where vm can directly pmu
> >>>>>> hardware
> >>>>>> and pmu hw is shard with guest and host. Besides context switch
> >>>>>> there are
> >>>>>> other places where perf core will access pmu hw, such as tick
> >>>>>> timer/hrtimer/ipi function call, and KVM can only intercept
> >>>>>> context switch.
> >>>>>
> >>>>> Two questions:
> >>>>>
> >>>>> 1) Can KVM prevent the guest from accessing the PMU?
> >>>>>
> >>>>> 2) If so, KVM can grant partial access to the PMU, or is it all
> >>>>> or nothing?
> >>>>>
> >>>>> If the answer to both questions is "yes", then it sounds like
> >>>>> LoongArch *requires*
> >>>>> mediated/passthrough support in order to virtualize its PMU.
> >>>>
> >>>> Hi Sean,
> >>>>
> >>>> Thank for your quick response.
> >>>>
> >>>> yes, kvm can prevent guest from accessing the PMU and grant partial
> >>>> or all to access to the PMU. Only that if one pmu event is granted
> >>>> to VM, host can not access this pmu event again. There must be pmu
> >>>> event switch if host want to.
> >>>
> >>> PMU event is a software entity which won't be shared. did you mean if
> >>> a PMU HW counter is granted to VM, then Host can't access the PMU HW
> >>> counter, right?
> >> yes, if PMU HW counter/control is granted to VM. The value comes from
> >> guest, and is not meaningful for host. Host pmu core does not know
> >> that it is granted to VM, host still think that it owns pmu.
> >
> > That's one issue this patchset tries to solve. Current new mediated x86
> > vPMU framework doesn't allow Host or Guest own the PMU HW resource
> > simultaneously. Only when there is no !exclude_guest event on host,
> > guest is allowed to exclusively own the PMU HW resource.
> >
> >
> >>
> >> Just like FPU register, it is shared by VM and host during different
> >> time and it is lately switched. But if IPI or timer interrupt uses FPU
> >> register on host, there will be the same issue.
> >
> > I didn't fully get your point. When IPI or timer interrupt reach, a
> > VM-exit is triggered to make CPU traps into host first and then the host
> yes, it is.

This is correct. And this is one of the points that we had debated
internally whether we should do PMU context switch at vcpu loop
boundary or VM Enter/exit boundary. (host-level) timer interrupt can
force VM Exit, which I think happens every 4ms or 1ms, depending on
configuration.

One of the key reasons we currently propose this is because it is the
same boundary as the legacy PMU, i.e., it would be simple to propose
from the perf subsystem perspective.

Performance wise, doing PMU context switch at vcpu boundary would be
way better in general. But the downside is that perf sub-system lose
the capability to profile majority of the KVM code (functions) when
guest PMU is enabled.

>
> > interrupt handler is called. Or are you complaining the executing
> > sequence of switching guest PMU MSRs and these interrupt handler?
> In our vPMU implementation, it is ok if vPMU is switched in vm exit
> path, however there is problem if vPMU is switched during vcpu thread
> sched-out/sched-in path since IPI/timer irq interrupt access pmu
> register in host mode.

Oh, the IPI/timer irq handler will access PMU registers? I thought
only the host-level NMI handler will access the PMU MSRs since PMI is
registered under NMI.

In that case, you should disable IRQ during vcpu context switch. For
NMI, we prevent its handler from accessing the PMU registers. In
particular, we use a per-cpu variable to guard that. So, the
host-level PMI handler for perf sub-system will check the variable
before proceeding.

>
> In general it will be better if the switch is done in vcpu thread
> sched-out/sched-in, else there is requirement to profile kvm
> hypervisor.Even there is such requirement, it is only one option. In
> most conditions, it will better if time of VM context exit is small.
>
Performance wise, agree, but there will be debate on perf
functionality loss at the host level.

Maybe, (just maybe), it is possible to do PMU context switch at vcpu
boundary normally, but doing it at VM Enter/Exit boundary when host is
profiling KVM kernel module. So, dynamically adjusting PMU context
switch location could be an option.

> >
> >
> >>
> >> Regards
> >> Bibo Mao
> >>>
> >>>
> >>>>
> >>>>>
> >>>>>> Can we add callback handler in structure kvm_guest_cbs? just like
> >>>>>> this:
> >>>>>> @@ -6403,6 +6403,7 @@ static struct perf_guest_info_callbacks
> >>>>>> kvm_guest_cbs
> >>>>>> = {
> >>>>>> .state = kvm_guest_state,
> >>>>>> .get_ip = kvm_guest_get_ip,
> >>>>>> .handle_intel_pt_intr = NULL,
> >>>>>> + .lose_pmu = kvm_guest_lose_pmu,
> >>>>>> };
> >>>>>>
> >>>>>> By the way, I do not know should the callback handler be triggered
> >>>>>> in perf
> >>>>>> core or detailed pmu hw driver. From ARM pmu hw driver, it is
> >>>>>> triggered in
> >>>>>> pmu hw driver such as function kvm_vcpu_pmu_resync_el0,
> >>>>>> but I think it will be better if it is done in perf core.
> >>>>>
> >>>>> I don't think we want to take the approach of perf and KVM guests
> >>>>> "fighting" over
> >>>>> the PMU. That's effectively what we have today, and it's a mess
> >>>>> for KVM because
> >>>>> it's impossible to provide consistent, deterministic behavior for
> >>>>> the guest. And
> >>>>> it's just as messy for perf, which ends up having wierd, cumbersome
> >>>>> flows that
> >>>>> exists purely to try to play nice with KVM.
> >>>> With existing pmu core code, in tick timer interrupt or IPI function
> >>>> call interrupt pmu hw may be accessed by host when VM is running and
> >>>> pmu is already granted to guest. KVM can not intercept host
> >>>> IPI/timer interrupt, there is no pmu context switch, there will be
> >>>> problem.
> >>>>
> >>>> Regards
> >>>> Bibo Mao
> >>>>
> >>
>

2024-04-23 05:11:47

by Mi, Dapeng

[permalink] [raw]
Subject: Re: [RFC PATCH 23/41] KVM: x86/pmu: Implement the save/restore of PMU state for Intel CPU


On 4/23/2024 9:01 AM, maobibo wrote:
>
>
> On 2024/4/23 上午1:01, Sean Christopherson wrote:
>> On Mon, Apr 22, 2024, maobibo wrote:
>>> On 2024/4/16 上午6:45, Sean Christopherson wrote:
>>>> On Mon, Apr 15, 2024, Mingwei Zhang wrote:
>>>>> On Mon, Apr 15, 2024 at 10:38 AM Sean Christopherson
>>>>> <[email protected]> wrote:
>>>>>> One my biggest complaints with the current vPMU code is that the
>>>>>> roles and
>>>>>> responsibilities between KVM and perf are poorly defined, which
>>>>>> leads to suboptimal
>>>>>> and hard to maintain code.
>>>>>>
>>>>>> Case in point, I'm pretty sure leaving guest values in PMCs
>>>>>> _would_ leak guest
>>>>>> state to userspace processes that have RDPMC permissions, as the
>>>>>> PMCs might not
>>>>>> be dirty from perf's perspective (see perf_clear_dirty_counters()).
>>>>>>
>>>>>> Blindly clearing PMCs in KVM "solves" that problem, but in doing
>>>>>> so makes the
>>>>>> overall code brittle because it's not clear whether KVM _needs_
>>>>>> to clear PMCs,
>>>>>> or if KVM is just being paranoid.
>>>>>
>>>>> So once this rolls out, perf and vPMU are clients directly to PMU HW.
>>>>
>>>> I don't think this is a statement we want to make, as it opens a
>>>> discussion
>>>> that we won't win.  Nor do I think it's one we *need* to make.  KVM
>>>> doesn't need
>>>> to be on equal footing with perf in terms of owning/managing PMU
>>>> hardware, KVM
>>>> just needs a few APIs to allow faithfully and accurately
>>>> virtualizing a guest PMU.
>>>>
>>>>> Faithful cleaning (blind cleaning) has to be the baseline
>>>>> implementation, until both clients agree to a "deal" between them.
>>>>> Currently, there is no such deal, but I believe we could have one via
>>>>> future discussion.
>>>>
>>>> What I am saying is that there needs to be a "deal" in place before
>>>> this code
>>>> is merged.  It doesn't need to be anything fancy, e.g. perf can
>>>> still pave over
>>>> PMCs it doesn't immediately load, as opposed to using
>>>> cpu_hw_events.dirty to lazily
>>>> do the clearing.  But perf and KVM need to work together from the
>>>> get go, ie. I
>>>> don't want KVM doing something without regard to what perf does,
>>>> and vice versa.
>>>>
>>> There is similar issue on LoongArch vPMU where vm can directly pmu
>>> hardware
>>> and pmu hw is shard with guest and host. Besides context switch
>>> there are
>>> other places where perf core will access pmu hw, such as tick
>>> timer/hrtimer/ipi function call, and KVM can only intercept context
>>> switch.
>>
>> Two questions:
>>
>>   1) Can KVM prevent the guest from accessing the PMU?
>>
>>   2) If so, KVM can grant partial access to the PMU, or is it all or
>> nothing?
>>
>> If the answer to both questions is "yes", then it sounds like
>> LoongArch *requires*
>> mediated/passthrough support in order to virtualize its PMU.
>
> Hi Sean,
>
> Thank for your quick response.
>
> yes, kvm can prevent guest from accessing the PMU and grant partial or
> all to access to the PMU. Only that if one pmu event is granted to VM,
> host can not access this pmu event again. There must be pmu event
> switch if host want to.

PMU event is a software entity which won't be shared. did you mean if a
PMU HW counter is granted to VM, then Host can't access the PMU HW
counter, right?


>
>>
>>> Can we add callback handler in structure kvm_guest_cbs?  just like
>>> this:
>>> @@ -6403,6 +6403,7 @@ static struct perf_guest_info_callbacks
>>> kvm_guest_cbs
>>> = {
>>>          .state                  = kvm_guest_state,
>>>          .get_ip                 = kvm_guest_get_ip,
>>>          .handle_intel_pt_intr   = NULL,
>>> +       .lose_pmu               = kvm_guest_lose_pmu,
>>>   };
>>>
>>> By the way, I do not know should the callback handler be triggered
>>> in perf
>>> core or detailed pmu hw driver. From ARM pmu hw driver, it is
>>> triggered in
>>> pmu hw driver such as function kvm_vcpu_pmu_resync_el0,
>>> but I think it will be better if it is done in perf core.
>>
>> I don't think we want to take the approach of perf and KVM guests
>> "fighting" over
>> the PMU.  That's effectively what we have today, and it's a mess for
>> KVM because
>> it's impossible to provide consistent, deterministic behavior for the
>> guest.  And
>> it's just as messy for perf, which ends up having wierd, cumbersome
>> flows that
>> exists purely to try to play nice with KVM.
> With existing pmu core code, in tick timer interrupt or IPI function
> call interrupt pmu hw may be accessed by host when VM is running and
> pmu is already granted to guest. KVM can not intercept host IPI/timer
> interrupt, there is no pmu context switch, there will be problem.
>
> Regards
> Bibo Mao
>

2024-04-23 06:09:22

by Bibo Mao

[permalink] [raw]
Subject: Re: [RFC PATCH 23/41] KVM: x86/pmu: Implement the save/restore of PMU state for Intel CPU



On 2024/4/23 下午12:23, Mingwei Zhang wrote:
> On Mon, Apr 22, 2024 at 8:55 PM maobibo <[email protected]> wrote:
>>
>>
>>
>> On 2024/4/23 上午11:13, Mi, Dapeng wrote:
>>>
>>> On 4/23/2024 10:53 AM, maobibo wrote:
>>>>
>>>>
>>>> On 2024/4/23 上午10:44, Mi, Dapeng wrote:
>>>>>
>>>>> On 4/23/2024 9:01 AM, maobibo wrote:
>>>>>>
>>>>>>
>>>>>> On 2024/4/23 上午1:01, Sean Christopherson wrote:
>>>>>>> On Mon, Apr 22, 2024, maobibo wrote:
>>>>>>>> On 2024/4/16 上午6:45, Sean Christopherson wrote:
>>>>>>>>> On Mon, Apr 15, 2024, Mingwei Zhang wrote:
>>>>>>>>>> On Mon, Apr 15, 2024 at 10:38 AM Sean Christopherson
>>>>>>>>>> <[email protected]> wrote:
>>>>>>>>>>> One my biggest complaints with the current vPMU code is that
>>>>>>>>>>> the roles and
>>>>>>>>>>> responsibilities between KVM and perf are poorly defined, which
>>>>>>>>>>> leads to suboptimal
>>>>>>>>>>> and hard to maintain code.
>>>>>>>>>>>
>>>>>>>>>>> Case in point, I'm pretty sure leaving guest values in PMCs
>>>>>>>>>>> _would_ leak guest
>>>>>>>>>>> state to userspace processes that have RDPMC permissions, as
>>>>>>>>>>> the PMCs might not
>>>>>>>>>>> be dirty from perf's perspective (see
>>>>>>>>>>> perf_clear_dirty_counters()).
>>>>>>>>>>>
>>>>>>>>>>> Blindly clearing PMCs in KVM "solves" that problem, but in
>>>>>>>>>>> doing so makes the
>>>>>>>>>>> overall code brittle because it's not clear whether KVM _needs_
>>>>>>>>>>> to clear PMCs,
>>>>>>>>>>> or if KVM is just being paranoid.
>>>>>>>>>>
>>>>>>>>>> So once this rolls out, perf and vPMU are clients directly to
>>>>>>>>>> PMU HW.
>>>>>>>>>
>>>>>>>>> I don't think this is a statement we want to make, as it opens a
>>>>>>>>> discussion
>>>>>>>>> that we won't win. Nor do I think it's one we *need* to make.
>>>>>>>>> KVM doesn't need
>>>>>>>>> to be on equal footing with perf in terms of owning/managing PMU
>>>>>>>>> hardware, KVM
>>>>>>>>> just needs a few APIs to allow faithfully and accurately
>>>>>>>>> virtualizing a guest PMU.
>>>>>>>>>
>>>>>>>>>> Faithful cleaning (blind cleaning) has to be the baseline
>>>>>>>>>> implementation, until both clients agree to a "deal" between them.
>>>>>>>>>> Currently, there is no such deal, but I believe we could have
>>>>>>>>>> one via
>>>>>>>>>> future discussion.
>>>>>>>>>
>>>>>>>>> What I am saying is that there needs to be a "deal" in place
>>>>>>>>> before this code
>>>>>>>>> is merged. It doesn't need to be anything fancy, e.g. perf can
>>>>>>>>> still pave over
>>>>>>>>> PMCs it doesn't immediately load, as opposed to using
>>>>>>>>> cpu_hw_events.dirty to lazily
>>>>>>>>> do the clearing. But perf and KVM need to work together from the
>>>>>>>>> get go, ie. I
>>>>>>>>> don't want KVM doing something without regard to what perf does,
>>>>>>>>> and vice versa.
>>>>>>>>>
>>>>>>>> There is similar issue on LoongArch vPMU where vm can directly pmu
>>>>>>>> hardware
>>>>>>>> and pmu hw is shard with guest and host. Besides context switch
>>>>>>>> there are
>>>>>>>> other places where perf core will access pmu hw, such as tick
>>>>>>>> timer/hrtimer/ipi function call, and KVM can only intercept
>>>>>>>> context switch.
>>>>>>>
>>>>>>> Two questions:
>>>>>>>
>>>>>>> 1) Can KVM prevent the guest from accessing the PMU?
>>>>>>>
>>>>>>> 2) If so, KVM can grant partial access to the PMU, or is it all
>>>>>>> or nothing?
>>>>>>>
>>>>>>> If the answer to both questions is "yes", then it sounds like
>>>>>>> LoongArch *requires*
>>>>>>> mediated/passthrough support in order to virtualize its PMU.
>>>>>>
>>>>>> Hi Sean,
>>>>>>
>>>>>> Thank for your quick response.
>>>>>>
>>>>>> yes, kvm can prevent guest from accessing the PMU and grant partial
>>>>>> or all to access to the PMU. Only that if one pmu event is granted
>>>>>> to VM, host can not access this pmu event again. There must be pmu
>>>>>> event switch if host want to.
>>>>>
>>>>> PMU event is a software entity which won't be shared. did you mean if
>>>>> a PMU HW counter is granted to VM, then Host can't access the PMU HW
>>>>> counter, right?
>>>> yes, if PMU HW counter/control is granted to VM. The value comes from
>>>> guest, and is not meaningful for host. Host pmu core does not know
>>>> that it is granted to VM, host still think that it owns pmu.
>>>
>>> That's one issue this patchset tries to solve. Current new mediated x86
>>> vPMU framework doesn't allow Host or Guest own the PMU HW resource
>>> simultaneously. Only when there is no !exclude_guest event on host,
>>> guest is allowed to exclusively own the PMU HW resource.
>>>
>>>
>>>>
>>>> Just like FPU register, it is shared by VM and host during different
>>>> time and it is lately switched. But if IPI or timer interrupt uses FPU
>>>> register on host, there will be the same issue.
>>>
>>> I didn't fully get your point. When IPI or timer interrupt reach, a
>>> VM-exit is triggered to make CPU traps into host first and then the host
>> yes, it is.
>
> This is correct. And this is one of the points that we had debated
> internally whether we should do PMU context switch at vcpu loop
> boundary or VM Enter/exit boundary. (host-level) timer interrupt can
> force VM Exit, which I think happens every 4ms or 1ms, depending on
> configuration.
>
> One of the key reasons we currently propose this is because it is the
> same boundary as the legacy PMU, i.e., it would be simple to propose
> from the perf subsystem perspective.
>
> Performance wise, doing PMU context switch at vcpu boundary would be
> way better in general. But the downside is that perf sub-system lose
> the capability to profile majority of the KVM code (functions) when
> guest PMU is enabled.
>
>>
>>> interrupt handler is called. Or are you complaining the executing
>>> sequence of switching guest PMU MSRs and these interrupt handler?
>> In our vPMU implementation, it is ok if vPMU is switched in vm exit
>> path, however there is problem if vPMU is switched during vcpu thread
>> sched-out/sched-in path since IPI/timer irq interrupt access pmu
>> register in host mode.
>
> Oh, the IPI/timer irq handler will access PMU registers? I thought
> only the host-level NMI handler will access the PMU MSRs since PMI is
> registered under NMI.
>
> In that case, you should disable IRQ during vcpu context switch. For
> NMI, we prevent its handler from accessing the PMU registers. In
> particular, we use a per-cpu variable to guard that. So, the
> host-level PMI handler for perf sub-system will check the variable
> before proceeding.

perf core will access pmu hw in tick timer/hrtimer/ipi function call,
such as function perf_event_task_tick() is called in tick timer, there
are event_function_call(event, __perf_event_xxx, &value) in file
kernel/events/core.c.

https://lore.kernel.org/lkml/[email protected]/T/#m15aeb79fdc9ce72dd5b374edd6acdcf7a9dafcf4


>
>>
>> In general it will be better if the switch is done in vcpu thread
>> sched-out/sched-in, else there is requirement to profile kvm
>> hypervisor.Even there is such requirement, it is only one option. In
>> most conditions, it will better if time of VM context exit is small.
>>
> Performance wise, agree, but there will be debate on perf
> functionality loss at the host level.
>
> Maybe, (just maybe), it is possible to do PMU context switch at vcpu
> boundary normally, but doing it at VM Enter/Exit boundary when host is
> profiling KVM kernel module. So, dynamically adjusting PMU context
> switch location could be an option.
>
>>>
>>>
>>>>
>>>> Regards
>>>> Bibo Mao
>>>>>
>>>>>
>>>>>>
>>>>>>>
>>>>>>>> Can we add callback handler in structure kvm_guest_cbs? just like
>>>>>>>> this:
>>>>>>>> @@ -6403,6 +6403,7 @@ static struct perf_guest_info_callbacks
>>>>>>>> kvm_guest_cbs
>>>>>>>> = {
>>>>>>>> .state = kvm_guest_state,
>>>>>>>> .get_ip = kvm_guest_get_ip,
>>>>>>>> .handle_intel_pt_intr = NULL,
>>>>>>>> + .lose_pmu = kvm_guest_lose_pmu,
>>>>>>>> };
>>>>>>>>
>>>>>>>> By the way, I do not know should the callback handler be triggered
>>>>>>>> in perf
>>>>>>>> core or detailed pmu hw driver. From ARM pmu hw driver, it is
>>>>>>>> triggered in
>>>>>>>> pmu hw driver such as function kvm_vcpu_pmu_resync_el0,
>>>>>>>> but I think it will be better if it is done in perf core.
>>>>>>>
>>>>>>> I don't think we want to take the approach of perf and KVM guests
>>>>>>> "fighting" over
>>>>>>> the PMU. That's effectively what we have today, and it's a mess
>>>>>>> for KVM because
>>>>>>> it's impossible to provide consistent, deterministic behavior for
>>>>>>> the guest. And
>>>>>>> it's just as messy for perf, which ends up having wierd, cumbersome
>>>>>>> flows that
>>>>>>> exists purely to try to play nice with KVM.
>>>>>> With existing pmu core code, in tick timer interrupt or IPI function
>>>>>> call interrupt pmu hw may be accessed by host when VM is running and
>>>>>> pmu is already granted to guest. KVM can not intercept host
>>>>>> IPI/timer interrupt, there is no pmu context switch, there will be
>>>>>> problem.
>>>>>>
>>>>>> Regards
>>>>>> Bibo Mao
>>>>>>
>>>>
>>


2024-04-23 07:17:40

by Mi, Dapeng

[permalink] [raw]
Subject: Re: [RFC PATCH 23/41] KVM: x86/pmu: Implement the save/restore of PMU state for Intel CPU


On 4/23/2024 10:53 AM, maobibo wrote:
>
>
> On 2024/4/23 上午10:44, Mi, Dapeng wrote:
>>
>> On 4/23/2024 9:01 AM, maobibo wrote:
>>>
>>>
>>> On 2024/4/23 上午1:01, Sean Christopherson wrote:
>>>> On Mon, Apr 22, 2024, maobibo wrote:
>>>>> On 2024/4/16 上午6:45, Sean Christopherson wrote:
>>>>>> On Mon, Apr 15, 2024, Mingwei Zhang wrote:
>>>>>>> On Mon, Apr 15, 2024 at 10:38 AM Sean Christopherson
>>>>>>> <[email protected]> wrote:
>>>>>>>> One my biggest complaints with the current vPMU code is that
>>>>>>>> the roles and
>>>>>>>> responsibilities between KVM and perf are poorly defined, which
>>>>>>>> leads to suboptimal
>>>>>>>> and hard to maintain code.
>>>>>>>>
>>>>>>>> Case in point, I'm pretty sure leaving guest values in PMCs
>>>>>>>> _would_ leak guest
>>>>>>>> state to userspace processes that have RDPMC permissions, as
>>>>>>>> the PMCs might not
>>>>>>>> be dirty from perf's perspective (see
>>>>>>>> perf_clear_dirty_counters()).
>>>>>>>>
>>>>>>>> Blindly clearing PMCs in KVM "solves" that problem, but in
>>>>>>>> doing so makes the
>>>>>>>> overall code brittle because it's not clear whether KVM _needs_
>>>>>>>> to clear PMCs,
>>>>>>>> or if KVM is just being paranoid.
>>>>>>>
>>>>>>> So once this rolls out, perf and vPMU are clients directly to
>>>>>>> PMU HW.
>>>>>>
>>>>>> I don't think this is a statement we want to make, as it opens a
>>>>>> discussion
>>>>>> that we won't win.  Nor do I think it's one we *need* to make. 
>>>>>> KVM doesn't need
>>>>>> to be on equal footing with perf in terms of owning/managing PMU
>>>>>> hardware, KVM
>>>>>> just needs a few APIs to allow faithfully and accurately
>>>>>> virtualizing a guest PMU.
>>>>>>
>>>>>>> Faithful cleaning (blind cleaning) has to be the baseline
>>>>>>> implementation, until both clients agree to a "deal" between them.
>>>>>>> Currently, there is no such deal, but I believe we could have
>>>>>>> one via
>>>>>>> future discussion.
>>>>>>
>>>>>> What I am saying is that there needs to be a "deal" in place
>>>>>> before this code
>>>>>> is merged.  It doesn't need to be anything fancy, e.g. perf can
>>>>>> still pave over
>>>>>> PMCs it doesn't immediately load, as opposed to using
>>>>>> cpu_hw_events.dirty to lazily
>>>>>> do the clearing.  But perf and KVM need to work together from the
>>>>>> get go, ie. I
>>>>>> don't want KVM doing something without regard to what perf does,
>>>>>> and vice versa.
>>>>>>
>>>>> There is similar issue on LoongArch vPMU where vm can directly pmu
>>>>> hardware
>>>>> and pmu hw is shard with guest and host. Besides context switch
>>>>> there are
>>>>> other places where perf core will access pmu hw, such as tick
>>>>> timer/hrtimer/ipi function call, and KVM can only intercept
>>>>> context switch.
>>>>
>>>> Two questions:
>>>>
>>>>   1) Can KVM prevent the guest from accessing the PMU?
>>>>
>>>>   2) If so, KVM can grant partial access to the PMU, or is it all
>>>> or nothing?
>>>>
>>>> If the answer to both questions is "yes", then it sounds like
>>>> LoongArch *requires*
>>>> mediated/passthrough support in order to virtualize its PMU.
>>>
>>> Hi Sean,
>>>
>>> Thank for your quick response.
>>>
>>> yes, kvm can prevent guest from accessing the PMU and grant partial
>>> or all to access to the PMU. Only that if one pmu event is granted
>>> to VM, host can not access this pmu event again. There must be pmu
>>> event switch if host want to.
>>
>> PMU event is a software entity which won't be shared. did you mean if
>> a PMU HW counter is granted to VM, then Host can't access the PMU HW
>> counter, right?
> yes, if PMU HW counter/control is granted to VM. The value comes from
> guest, and is not meaningful for host.  Host pmu core does not know
> that it is granted to VM, host still think that it owns pmu.

That's one issue this patchset tries to solve. Current new mediated x86
vPMU framework doesn't allow Host or Guest own the PMU HW resource
simultaneously. Only when there is no !exclude_guest event on host,
guest is allowed to exclusively own the PMU HW resource.


>
> Just like FPU register, it is shared by VM and host during different
> time and it is lately switched. But if IPI or timer interrupt uses FPU
> register on host, there will be the same issue.

I didn't fully get your point. When IPI or timer interrupt reach, a
VM-exit is triggered to make CPU traps into host first and then the host
interrupt handler is called. Or are you complaining the executing
sequence of switching guest PMU MSRs and these interrupt handler?


>
> Regards
> Bibo Mao
>>
>>
>>>
>>>>
>>>>> Can we add callback handler in structure kvm_guest_cbs?  just like
>>>>> this:
>>>>> @@ -6403,6 +6403,7 @@ static struct perf_guest_info_callbacks
>>>>> kvm_guest_cbs
>>>>> = {
>>>>>          .state                  = kvm_guest_state,
>>>>>          .get_ip                 = kvm_guest_get_ip,
>>>>>          .handle_intel_pt_intr   = NULL,
>>>>> +       .lose_pmu               = kvm_guest_lose_pmu,
>>>>>   };
>>>>>
>>>>> By the way, I do not know should the callback handler be triggered
>>>>> in perf
>>>>> core or detailed pmu hw driver. From ARM pmu hw driver, it is
>>>>> triggered in
>>>>> pmu hw driver such as function kvm_vcpu_pmu_resync_el0,
>>>>> but I think it will be better if it is done in perf core.
>>>>
>>>> I don't think we want to take the approach of perf and KVM guests
>>>> "fighting" over
>>>> the PMU.  That's effectively what we have today, and it's a mess
>>>> for KVM because
>>>> it's impossible to provide consistent, deterministic behavior for
>>>> the guest.  And
>>>> it's just as messy for perf, which ends up having wierd, cumbersome
>>>> flows that
>>>> exists purely to try to play nice with KVM.
>>> With existing pmu core code, in tick timer interrupt or IPI function
>>> call interrupt pmu hw may be accessed by host when VM is running and
>>> pmu is already granted to guest. KVM can not intercept host
>>> IPI/timer interrupt, there is no pmu context switch, there will be
>>> problem.
>>>
>>> Regards
>>> Bibo Mao
>>>
>

2024-04-23 08:47:01

by Mi, Dapeng

[permalink] [raw]
Subject: Re: [RFC PATCH 23/41] KVM: x86/pmu: Implement the save/restore of PMU state for Intel CPU


On 4/23/2024 11:26 AM, maobibo wrote:
>
>
> On 2024/4/23 上午11:13, Mi, Dapeng wrote:
>>
>> On 4/23/2024 10:53 AM, maobibo wrote:
>>>
>>>
>>> On 2024/4/23 上午10:44, Mi, Dapeng wrote:
>>>>
>>>> On 4/23/2024 9:01 AM, maobibo wrote:
>>>>>
>>>>>
>>>>> On 2024/4/23 上午1:01, Sean Christopherson wrote:
>>>>>> On Mon, Apr 22, 2024, maobibo wrote:
>>>>>>> On 2024/4/16 上午6:45, Sean Christopherson wrote:
>>>>>>>> On Mon, Apr 15, 2024, Mingwei Zhang wrote:
>>>>>>>>> On Mon, Apr 15, 2024 at 10:38 AM Sean Christopherson
>>>>>>>>> <[email protected]> wrote:
>>>>>>>>>> One my biggest complaints with the current vPMU code is that
>>>>>>>>>> the roles and
>>>>>>>>>> responsibilities between KVM and perf are poorly defined,
>>>>>>>>>> which leads to suboptimal
>>>>>>>>>> and hard to maintain code.
>>>>>>>>>>
>>>>>>>>>> Case in point, I'm pretty sure leaving guest values in PMCs
>>>>>>>>>> _would_ leak guest
>>>>>>>>>> state to userspace processes that have RDPMC permissions, as
>>>>>>>>>> the PMCs might not
>>>>>>>>>> be dirty from perf's perspective (see
>>>>>>>>>> perf_clear_dirty_counters()).
>>>>>>>>>>
>>>>>>>>>> Blindly clearing PMCs in KVM "solves" that problem, but in
>>>>>>>>>> doing so makes the
>>>>>>>>>> overall code brittle because it's not clear whether KVM
>>>>>>>>>> _needs_ to clear PMCs,
>>>>>>>>>> or if KVM is just being paranoid.
>>>>>>>>>
>>>>>>>>> So once this rolls out, perf and vPMU are clients directly to
>>>>>>>>> PMU HW.
>>>>>>>>
>>>>>>>> I don't think this is a statement we want to make, as it opens
>>>>>>>> a discussion
>>>>>>>> that we won't win.  Nor do I think it's one we *need* to make.
>>>>>>>> KVM doesn't need
>>>>>>>> to be on equal footing with perf in terms of owning/managing
>>>>>>>> PMU hardware, KVM
>>>>>>>> just needs a few APIs to allow faithfully and accurately
>>>>>>>> virtualizing a guest PMU.
>>>>>>>>
>>>>>>>>> Faithful cleaning (blind cleaning) has to be the baseline
>>>>>>>>> implementation, until both clients agree to a "deal" between
>>>>>>>>> them.
>>>>>>>>> Currently, there is no such deal, but I believe we could have
>>>>>>>>> one via
>>>>>>>>> future discussion.
>>>>>>>>
>>>>>>>> What I am saying is that there needs to be a "deal" in place
>>>>>>>> before this code
>>>>>>>> is merged.  It doesn't need to be anything fancy, e.g. perf can
>>>>>>>> still pave over
>>>>>>>> PMCs it doesn't immediately load, as opposed to using
>>>>>>>> cpu_hw_events.dirty to lazily
>>>>>>>> do the clearing.  But perf and KVM need to work together from
>>>>>>>> the get go, ie. I
>>>>>>>> don't want KVM doing something without regard to what perf
>>>>>>>> does, and vice versa.
>>>>>>>>
>>>>>>> There is similar issue on LoongArch vPMU where vm can directly
>>>>>>> pmu hardware
>>>>>>> and pmu hw is shard with guest and host. Besides context switch
>>>>>>> there are
>>>>>>> other places where perf core will access pmu hw, such as tick
>>>>>>> timer/hrtimer/ipi function call, and KVM can only intercept
>>>>>>> context switch.
>>>>>>
>>>>>> Two questions:
>>>>>>
>>>>>>   1) Can KVM prevent the guest from accessing the PMU?
>>>>>>
>>>>>>   2) If so, KVM can grant partial access to the PMU, or is it all
>>>>>> or nothing?
>>>>>>
>>>>>> If the answer to both questions is "yes", then it sounds like
>>>>>> LoongArch *requires*
>>>>>> mediated/passthrough support in order to virtualize its PMU.
>>>>>
>>>>> Hi Sean,
>>>>>
>>>>> Thank for your quick response.
>>>>>
>>>>> yes, kvm can prevent guest from accessing the PMU and grant
>>>>> partial or all to access to the PMU. Only that if one pmu event is
>>>>> granted to VM, host can not access this pmu event again. There
>>>>> must be pmu event switch if host want to.
>>>>
>>>> PMU event is a software entity which won't be shared. did you mean
>>>> if a PMU HW counter is granted to VM, then Host can't access the
>>>> PMU HW counter, right?
>>> yes, if PMU HW counter/control is granted to VM. The value comes
>>> from guest, and is not meaningful for host.  Host pmu core does not
>>> know that it is granted to VM, host still think that it owns pmu.
>>
>> That's one issue this patchset tries to solve. Current new mediated
>> x86 vPMU framework doesn't allow Host or Guest own the PMU HW
>> resource simultaneously. Only when there is no !exclude_guest event
>> on host, guest is allowed to exclusively own the PMU HW resource.
>>
>>
>>>
>>> Just like FPU register, it is shared by VM and host during different
>>> time and it is lately switched. But if IPI or timer interrupt uses
>>> FPU register on host, there will be the same issue.
>>
>> I didn't fully get your point. When IPI or timer interrupt reach, a
>> VM-exit is triggered to make CPU traps into host first and then the
>> host interrupt handler is called. Or are you complaining the
>> executing sequence of switching guest PMU MSRs and these interrupt
>> handler?
> It is not necessary to save/restore PMU HW at every vm exit, it had
> better be lately saved/restored, such as only when vcpu thread is
> sched-out/sched-in, else the cost will be a little expensive.

I suspect this optimization deferring guest PMU state save/restore to
vCPU task switching boundary would be really landed into KVM since it
would make host lose the capability to profile KVM and It seems Sean
object this.


>
> I know little about perf core. However there is PMU HW access in
> interrupt mode. That means PMU HW access should be irq disabled in
> general mode, else there may be nested PMU HW access. Is that true?

I had no idea that timer irq handler would access PMU MSRs before. Could
you please show me the code and I would look at it first. Thanks.


>
>>
>>
>>>
>>> Regards
>>> Bibo Mao
>>>>
>>>>
>>>>>
>>>>>>
>>>>>>> Can we add callback handler in structure kvm_guest_cbs?  just
>>>>>>> like this:
>>>>>>> @@ -6403,6 +6403,7 @@ static struct perf_guest_info_callbacks
>>>>>>> kvm_guest_cbs
>>>>>>> = {
>>>>>>>          .state                  = kvm_guest_state,
>>>>>>>          .get_ip                 = kvm_guest_get_ip,
>>>>>>>          .handle_intel_pt_intr   = NULL,
>>>>>>> +       .lose_pmu               = kvm_guest_lose_pmu,
>>>>>>>   };
>>>>>>>
>>>>>>> By the way, I do not know should the callback handler be
>>>>>>> triggered in perf
>>>>>>> core or detailed pmu hw driver. From ARM pmu hw driver, it is
>>>>>>> triggered in
>>>>>>> pmu hw driver such as function kvm_vcpu_pmu_resync_el0,
>>>>>>> but I think it will be better if it is done in perf core.
>>>>>>
>>>>>> I don't think we want to take the approach of perf and KVM guests
>>>>>> "fighting" over
>>>>>> the PMU.  That's effectively what we have today, and it's a mess
>>>>>> for KVM because
>>>>>> it's impossible to provide consistent, deterministic behavior for
>>>>>> the guest.  And
>>>>>> it's just as messy for perf, which ends up having wierd,
>>>>>> cumbersome flows that
>>>>>> exists purely to try to play nice with KVM.
>>>>> With existing pmu core code, in tick timer interrupt or IPI
>>>>> function call interrupt pmu hw may be accessed by host when VM is
>>>>> running and pmu is already granted to guest. KVM can not intercept
>>>>> host IPI/timer interrupt, there is no pmu context switch, there
>>>>> will be problem.
>>>>>
>>>>> Regards
>>>>> Bibo Mao
>>>>>
>>>
>

2024-04-23 08:52:11

by Bibo Mao

[permalink] [raw]
Subject: Re: [RFC PATCH 23/41] KVM: x86/pmu: Implement the save/restore of PMU state for Intel CPU



On 2024/4/23 下午4:24, Mi, Dapeng wrote:
>
> On 4/23/2024 3:10 PM, Mingwei Zhang wrote:
>> On Mon, Apr 22, 2024 at 11:45 PM Mi, Dapeng
>> <[email protected]> wrote:
>>>
>>> On 4/23/2024 2:08 PM, maobibo wrote:
>>>>
>>>> On 2024/4/23 下午12:23, Mingwei Zhang wrote:
>>>>> On Mon, Apr 22, 2024 at 8:55 PM maobibo <[email protected]> wrote:
>>>>>>
>>>>>>
>>>>>> On 2024/4/23 上午11:13, Mi, Dapeng wrote:
>>>>>>> On 4/23/2024 10:53 AM, maobibo wrote:
>>>>>>>>
>>>>>>>> On 2024/4/23 上午10:44, Mi, Dapeng wrote:
>>>>>>>>> On 4/23/2024 9:01 AM, maobibo wrote:
>>>>>>>>>>
>>>>>>>>>> On 2024/4/23 上午1:01, Sean Christopherson wrote:
>>>>>>>>>>> On Mon, Apr 22, 2024, maobibo wrote:
>>>>>>>>>>>> On 2024/4/16 上午6:45, Sean Christopherson wrote:
>>>>>>>>>>>>> On Mon, Apr 15, 2024, Mingwei Zhang wrote:
>>>>>>>>>>>>>> On Mon, Apr 15, 2024 at 10:38 AM Sean Christopherson
>>>>>>>>>>>>>> <[email protected]> wrote:
>>>>>>>>>>>>>>> One my biggest complaints with the current vPMU code is that
>>>>>>>>>>>>>>> the roles and
>>>>>>>>>>>>>>> responsibilities between KVM and perf are poorly defined,
>>>>>>>>>>>>>>> which
>>>>>>>>>>>>>>> leads to suboptimal
>>>>>>>>>>>>>>> and hard to maintain code.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Case in point, I'm pretty sure leaving guest values in PMCs
>>>>>>>>>>>>>>> _would_ leak guest
>>>>>>>>>>>>>>> state to userspace processes that have RDPMC permissions, as
>>>>>>>>>>>>>>> the PMCs might not
>>>>>>>>>>>>>>> be dirty from perf's perspective (see
>>>>>>>>>>>>>>> perf_clear_dirty_counters()).
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Blindly clearing PMCs in KVM "solves" that problem, but in
>>>>>>>>>>>>>>> doing so makes the
>>>>>>>>>>>>>>> overall code brittle because it's not clear whether KVM
>>>>>>>>>>>>>>> _needs_
>>>>>>>>>>>>>>> to clear PMCs,
>>>>>>>>>>>>>>> or if KVM is just being paranoid.
>>>>>>>>>>>>>> So once this rolls out, perf and vPMU are clients directly to
>>>>>>>>>>>>>> PMU HW.
>>>>>>>>>>>>> I don't think this is a statement we want to make, as it
>>>>>>>>>>>>> opens a
>>>>>>>>>>>>> discussion
>>>>>>>>>>>>> that we won't win.  Nor do I think it's one we *need* to make.
>>>>>>>>>>>>> KVM doesn't need
>>>>>>>>>>>>> to be on equal footing with perf in terms of
>>>>>>>>>>>>> owning/managing PMU
>>>>>>>>>>>>> hardware, KVM
>>>>>>>>>>>>> just needs a few APIs to allow faithfully and accurately
>>>>>>>>>>>>> virtualizing a guest PMU.
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Faithful cleaning (blind cleaning) has to be the baseline
>>>>>>>>>>>>>> implementation, until both clients agree to a "deal" between
>>>>>>>>>>>>>> them.
>>>>>>>>>>>>>> Currently, there is no such deal, but I believe we could have
>>>>>>>>>>>>>> one via
>>>>>>>>>>>>>> future discussion.
>>>>>>>>>>>>> What I am saying is that there needs to be a "deal" in place
>>>>>>>>>>>>> before this code
>>>>>>>>>>>>> is merged.  It doesn't need to be anything fancy, e.g. perf
>>>>>>>>>>>>> can
>>>>>>>>>>>>> still pave over
>>>>>>>>>>>>> PMCs it doesn't immediately load, as opposed to using
>>>>>>>>>>>>> cpu_hw_events.dirty to lazily
>>>>>>>>>>>>> do the clearing.  But perf and KVM need to work together from
>>>>>>>>>>>>> the
>>>>>>>>>>>>> get go, ie. I
>>>>>>>>>>>>> don't want KVM doing something without regard to what perf
>>>>>>>>>>>>> does,
>>>>>>>>>>>>> and vice versa.
>>>>>>>>>>>>>
>>>>>>>>>>>> There is similar issue on LoongArch vPMU where vm can directly
>>>>>>>>>>>> pmu
>>>>>>>>>>>> hardware
>>>>>>>>>>>> and pmu hw is shard with guest and host. Besides context switch
>>>>>>>>>>>> there are
>>>>>>>>>>>> other places where perf core will access pmu hw, such as tick
>>>>>>>>>>>> timer/hrtimer/ipi function call, and KVM can only intercept
>>>>>>>>>>>> context switch.
>>>>>>>>>>> Two questions:
>>>>>>>>>>>
>>>>>>>>>>>     1) Can KVM prevent the guest from accessing the PMU?
>>>>>>>>>>>
>>>>>>>>>>>     2) If so, KVM can grant partial access to the PMU, or is
>>>>>>>>>>> it all
>>>>>>>>>>> or nothing?
>>>>>>>>>>>
>>>>>>>>>>> If the answer to both questions is "yes", then it sounds like
>>>>>>>>>>> LoongArch *requires*
>>>>>>>>>>> mediated/passthrough support in order to virtualize its PMU.
>>>>>>>>>> Hi Sean,
>>>>>>>>>>
>>>>>>>>>> Thank for your quick response.
>>>>>>>>>>
>>>>>>>>>> yes, kvm can prevent guest from accessing the PMU and grant
>>>>>>>>>> partial
>>>>>>>>>> or all to access to the PMU. Only that if one pmu event is
>>>>>>>>>> granted
>>>>>>>>>> to VM, host can not access this pmu event again. There must be
>>>>>>>>>> pmu
>>>>>>>>>> event switch if host want to.
>>>>>>>>> PMU event is a software entity which won't be shared. did you
>>>>>>>>> mean if
>>>>>>>>> a PMU HW counter is granted to VM, then Host can't access the
>>>>>>>>> PMU HW
>>>>>>>>> counter, right?
>>>>>>>> yes, if PMU HW counter/control is granted to VM. The value comes
>>>>>>>> from
>>>>>>>> guest, and is not meaningful for host.  Host pmu core does not know
>>>>>>>> that it is granted to VM, host still think that it owns pmu.
>>>>>>> That's one issue this patchset tries to solve. Current new mediated
>>>>>>> x86
>>>>>>> vPMU framework doesn't allow Host or Guest own the PMU HW resource
>>>>>>> simultaneously. Only when there is no !exclude_guest event on host,
>>>>>>> guest is allowed to exclusively own the PMU HW resource.
>>>>>>>
>>>>>>>
>>>>>>>> Just like FPU register, it is shared by VM and host during
>>>>>>>> different
>>>>>>>> time and it is lately switched. But if IPI or timer interrupt uses
>>>>>>>> FPU
>>>>>>>> register on host, there will be the same issue.
>>>>>>> I didn't fully get your point. When IPI or timer interrupt reach, a
>>>>>>> VM-exit is triggered to make CPU traps into host first and then the
>>>>>>> host
>>>>>> yes, it is.
>>>>> This is correct. And this is one of the points that we had debated
>>>>> internally whether we should do PMU context switch at vcpu loop
>>>>> boundary or VM Enter/exit boundary. (host-level) timer interrupt can
>>>>> force VM Exit, which I think happens every 4ms or 1ms, depending on
>>>>> configuration.
>>>>>
>>>>> One of the key reasons we currently propose this is because it is the
>>>>> same boundary as the legacy PMU, i.e., it would be simple to propose
>>>>> from the perf subsystem perspective.
>>>>>
>>>>> Performance wise, doing PMU context switch at vcpu boundary would be
>>>>> way better in general. But the downside is that perf sub-system lose
>>>>> the capability to profile majority of the KVM code (functions) when
>>>>> guest PMU is enabled.
>>>>>
>>>>>>> interrupt handler is called. Or are you complaining the executing
>>>>>>> sequence of switching guest PMU MSRs and these interrupt handler?
>>>>>> In our vPMU implementation, it is ok if vPMU is switched in vm exit
>>>>>> path, however there is problem if vPMU is switched during vcpu thread
>>>>>> sched-out/sched-in path since IPI/timer irq interrupt access pmu
>>>>>> register in host mode.
>>>>> Oh, the IPI/timer irq handler will access PMU registers? I thought
>>>>> only the host-level NMI handler will access the PMU MSRs since PMI is
>>>>> registered under NMI.
>>>>>
>>>>> In that case, you should disable  IRQ during vcpu context switch. For
>>>>> NMI, we prevent its handler from accessing the PMU registers. In
>>>>> particular, we use a per-cpu variable to guard that. So, the
>>>>> host-level PMI handler for perf sub-system will check the variable
>>>>> before proceeding.
>>>> perf core will access pmu hw in tick timer/hrtimer/ipi function call,
>>>> such as function perf_event_task_tick() is called in tick timer, there
>>>> are  event_function_call(event, __perf_event_xxx, &value) in file
>>>> kernel/events/core.c.
>>>>
>>>> https://lore.kernel.org/lkml/[email protected]/T/#m15aeb79fdc9ce72dd5b374edd6acdcf7a9dafcf4
>>>>
>>>>
>>> Just go through functions (not sure if all),  whether
>>> perf_event_task_tick() or the callbacks of event_function_call() would
>>> check the event->state first, if the event is in
>>> PERF_EVENT_STATE_INACTIVE, the PMU HW MSRs would not be touched really.
>>> In this new proposal, all host events with exclude_guest attribute would
>>> be put on PERF_EVENT_STATE_INACTIVE sate if guest own the PMU HW
>>> resource. So I think it's fine.
>>>
>> Is there any event in the host still having PERF_EVENT_STATE_ACTIVE?
>> If so, hmm, it will reach perf_pmu_disable(event->pmu), which will
>> access the global ctrl MSR.
>
> I don't think there is any event with PERF_EVENT_STATE_ACTIVE state on
> host when guest owns the PMU HW resource.
>
> In current solution, VM would fail to create if there is any system-wide
> event without exclude_guest attribute. If VM is created successfully and
> when vm-entry happens, the helper perf_guest_enter() would put all host
> events with exclude_guest attribute into PERF_EVENT_STATE_INACTIVE state
> and block host to create system-wide events without exclude_guest
> attribute.
I do not know perf subsytem, Can the perf event state kept unchanged?
After VM enters, hw perf counter is allocated to VM. HW counter
function for host should be stopped already. It seems that host perf
core needs not perceive VM enter/exit.


2024-04-23 12:12:42

by Bibo Mao

[permalink] [raw]
Subject: Re: [RFC PATCH 23/41] KVM: x86/pmu: Implement the save/restore of PMU state for Intel CPU



On 2024/4/23 下午12:23, Mingwei Zhang wrote:
> On Mon, Apr 22, 2024 at 8:55 PM maobibo <[email protected]> wrote:
>>
>>
>>
>> On 2024/4/23 上午11:13, Mi, Dapeng wrote:
>>>
>>> On 4/23/2024 10:53 AM, maobibo wrote:
>>>>
>>>>
>>>> On 2024/4/23 上午10:44, Mi, Dapeng wrote:
>>>>>
>>>>> On 4/23/2024 9:01 AM, maobibo wrote:
>>>>>>
>>>>>>
>>>>>> On 2024/4/23 上午1:01, Sean Christopherson wrote:
>>>>>>> On Mon, Apr 22, 2024, maobibo wrote:
>>>>>>>> On 2024/4/16 上午6:45, Sean Christopherson wrote:
>>>>>>>>> On Mon, Apr 15, 2024, Mingwei Zhang wrote:
>>>>>>>>>> On Mon, Apr 15, 2024 at 10:38 AM Sean Christopherson
>>>>>>>>>> <[email protected]> wrote:
>>>>>>>>>>> One my biggest complaints with the current vPMU code is that
>>>>>>>>>>> the roles and
>>>>>>>>>>> responsibilities between KVM and perf are poorly defined, which
>>>>>>>>>>> leads to suboptimal
>>>>>>>>>>> and hard to maintain code.
>>>>>>>>>>>
>>>>>>>>>>> Case in point, I'm pretty sure leaving guest values in PMCs
>>>>>>>>>>> _would_ leak guest
>>>>>>>>>>> state to userspace processes that have RDPMC permissions, as
>>>>>>>>>>> the PMCs might not
>>>>>>>>>>> be dirty from perf's perspective (see
>>>>>>>>>>> perf_clear_dirty_counters()).
>>>>>>>>>>>
>>>>>>>>>>> Blindly clearing PMCs in KVM "solves" that problem, but in
>>>>>>>>>>> doing so makes the
>>>>>>>>>>> overall code brittle because it's not clear whether KVM _needs_
>>>>>>>>>>> to clear PMCs,
>>>>>>>>>>> or if KVM is just being paranoid.
>>>>>>>>>>
>>>>>>>>>> So once this rolls out, perf and vPMU are clients directly to
>>>>>>>>>> PMU HW.
>>>>>>>>>
>>>>>>>>> I don't think this is a statement we want to make, as it opens a
>>>>>>>>> discussion
>>>>>>>>> that we won't win. Nor do I think it's one we *need* to make.
>>>>>>>>> KVM doesn't need
>>>>>>>>> to be on equal footing with perf in terms of owning/managing PMU
>>>>>>>>> hardware, KVM
>>>>>>>>> just needs a few APIs to allow faithfully and accurately
>>>>>>>>> virtualizing a guest PMU.
>>>>>>>>>
>>>>>>>>>> Faithful cleaning (blind cleaning) has to be the baseline
>>>>>>>>>> implementation, until both clients agree to a "deal" between them.
>>>>>>>>>> Currently, there is no such deal, but I believe we could have
>>>>>>>>>> one via
>>>>>>>>>> future discussion.
>>>>>>>>>
>>>>>>>>> What I am saying is that there needs to be a "deal" in place
>>>>>>>>> before this code
>>>>>>>>> is merged. It doesn't need to be anything fancy, e.g. perf can
>>>>>>>>> still pave over
>>>>>>>>> PMCs it doesn't immediately load, as opposed to using
>>>>>>>>> cpu_hw_events.dirty to lazily
>>>>>>>>> do the clearing. But perf and KVM need to work together from the
>>>>>>>>> get go, ie. I
>>>>>>>>> don't want KVM doing something without regard to what perf does,
>>>>>>>>> and vice versa.
>>>>>>>>>
>>>>>>>> There is similar issue on LoongArch vPMU where vm can directly pmu
>>>>>>>> hardware
>>>>>>>> and pmu hw is shard with guest and host. Besides context switch
>>>>>>>> there are
>>>>>>>> other places where perf core will access pmu hw, such as tick
>>>>>>>> timer/hrtimer/ipi function call, and KVM can only intercept
>>>>>>>> context switch.
>>>>>>>
>>>>>>> Two questions:
>>>>>>>
>>>>>>> 1) Can KVM prevent the guest from accessing the PMU?
>>>>>>>
>>>>>>> 2) If so, KVM can grant partial access to the PMU, or is it all
>>>>>>> or nothing?
>>>>>>>
>>>>>>> If the answer to both questions is "yes", then it sounds like
>>>>>>> LoongArch *requires*
>>>>>>> mediated/passthrough support in order to virtualize its PMU.
>>>>>>
>>>>>> Hi Sean,
>>>>>>
>>>>>> Thank for your quick response.
>>>>>>
>>>>>> yes, kvm can prevent guest from accessing the PMU and grant partial
>>>>>> or all to access to the PMU. Only that if one pmu event is granted
>>>>>> to VM, host can not access this pmu event again. There must be pmu
>>>>>> event switch if host want to.
>>>>>
>>>>> PMU event is a software entity which won't be shared. did you mean if
>>>>> a PMU HW counter is granted to VM, then Host can't access the PMU HW
>>>>> counter, right?
>>>> yes, if PMU HW counter/control is granted to VM. The value comes from
>>>> guest, and is not meaningful for host. Host pmu core does not know
>>>> that it is granted to VM, host still think that it owns pmu.
>>>
>>> That's one issue this patchset tries to solve. Current new mediated x86
>>> vPMU framework doesn't allow Host or Guest own the PMU HW resource
>>> simultaneously. Only when there is no !exclude_guest event on host,
>>> guest is allowed to exclusively own the PMU HW resource.
>>>
>>>
>>>>
>>>> Just like FPU register, it is shared by VM and host during different
>>>> time and it is lately switched. But if IPI or timer interrupt uses FPU
>>>> register on host, there will be the same issue.
>>>
>>> I didn't fully get your point. When IPI or timer interrupt reach, a
>>> VM-exit is triggered to make CPU traps into host first and then the host
>> yes, it is.
>
> This is correct. And this is one of the points that we had debated
> internally whether we should do PMU context switch at vcpu loop
> boundary or VM Enter/exit boundary. (host-level) timer interrupt can
> force VM Exit, which I think happens every 4ms or 1ms, depending on
> configuration.
>
> One of the key reasons we currently propose this is because it is the
> same boundary as the legacy PMU, i.e., it would be simple to propose
> from the perf subsystem perspective.
>
> Performance wise, doing PMU context switch at vcpu boundary would be
> way better in general. But the downside is that perf sub-system lose
> the capability to profile majority of the KVM code (functions) when
> guest PMU is enabled.
>
>>
>>> interrupt handler is called. Or are you complaining the executing
>>> sequence of switching guest PMU MSRs and these interrupt handler?
>> In our vPMU implementation, it is ok if vPMU is switched in vm exit
>> path, however there is problem if vPMU is switched during vcpu thread
>> sched-out/sched-in path since IPI/timer irq interrupt access pmu
>> register in host mode.
>
> Oh, the IPI/timer irq handler will access PMU registers? I thought
> only the host-level NMI handler will access the PMU MSRs since PMI is
> registered under NMI.
>
> In that case, you should disable IRQ during vcpu context switch. For
> NMI, we prevent its handler from accessing the PMU registers. In
> particular, we use a per-cpu variable to guard that. So, the
> host-level PMI handler for perf sub-system will check the variable
> before proceeding.
>
>>
>> In general it will be better if the switch is done in vcpu thread
>> sched-out/sched-in, else there is requirement to profile kvm
>> hypervisor.Even there is such requirement, it is only one option. In
>> most conditions, it will better if time of VM context exit is small.
>>
> Performance wise, agree, but there will be debate on perf
> functionality loss at the host level.
>
> Maybe, (just maybe), it is possible to do PMU context switch at vcpu
> boundary normally, but doing it at VM Enter/Exit boundary when host is
> profiling KVM kernel module. So, dynamically adjusting PMU context
> switch location could be an option.
If there are two VMs with pmu enabled both, however host PMU is not
enabled. PMU context switch should be done in vcpu thread sched-out path.

If host pmu is used also, we can choose whether PMU switch should be
done in vm exit path or vcpu thread sched-out path.

>
>>>
>>>
>>>>
>>>> Regards
>>>> Bibo Mao
>>>>>
>>>>>
>>>>>>
>>>>>>>
>>>>>>>> Can we add callback handler in structure kvm_guest_cbs? just like
>>>>>>>> this:
>>>>>>>> @@ -6403,6 +6403,7 @@ static struct perf_guest_info_callbacks
>>>>>>>> kvm_guest_cbs
>>>>>>>> = {
>>>>>>>> .state = kvm_guest_state,
>>>>>>>> .get_ip = kvm_guest_get_ip,
>>>>>>>> .handle_intel_pt_intr = NULL,
>>>>>>>> + .lose_pmu = kvm_guest_lose_pmu,
>>>>>>>> };
>>>>>>>>
>>>>>>>> By the way, I do not know should the callback handler be triggered
>>>>>>>> in perf
>>>>>>>> core or detailed pmu hw driver. From ARM pmu hw driver, it is
>>>>>>>> triggered in
>>>>>>>> pmu hw driver such as function kvm_vcpu_pmu_resync_el0,
>>>>>>>> but I think it will be better if it is done in perf core.
>>>>>>>
>>>>>>> I don't think we want to take the approach of perf and KVM guests
>>>>>>> "fighting" over
>>>>>>> the PMU. That's effectively what we have today, and it's a mess
>>>>>>> for KVM because
>>>>>>> it's impossible to provide consistent, deterministic behavior for
>>>>>>> the guest. And
>>>>>>> it's just as messy for perf, which ends up having wierd, cumbersome
>>>>>>> flows that
>>>>>>> exists purely to try to play nice with KVM.
>>>>>> With existing pmu core code, in tick timer interrupt or IPI function
>>>>>> call interrupt pmu hw may be accessed by host when VM is running and
>>>>>> pmu is already granted to guest. KVM can not intercept host
>>>>>> IPI/timer interrupt, there is no pmu context switch, there will be
>>>>>> problem.
>>>>>>
>>>>>> Regards
>>>>>> Bibo Mao
>>>>>>
>>>>
>>


2024-04-23 12:42:17

by Mi, Dapeng

[permalink] [raw]
Subject: Re: [RFC PATCH 23/41] KVM: x86/pmu: Implement the save/restore of PMU state for Intel CPU


On 4/23/2024 2:08 PM, maobibo wrote:
>
>
> On 2024/4/23 下午12:23, Mingwei Zhang wrote:
>> On Mon, Apr 22, 2024 at 8:55 PM maobibo <[email protected]> wrote:
>>>
>>>
>>>
>>> On 2024/4/23 上午11:13, Mi, Dapeng wrote:
>>>>
>>>> On 4/23/2024 10:53 AM, maobibo wrote:
>>>>>
>>>>>
>>>>> On 2024/4/23 上午10:44, Mi, Dapeng wrote:
>>>>>>
>>>>>> On 4/23/2024 9:01 AM, maobibo wrote:
>>>>>>>
>>>>>>>
>>>>>>> On 2024/4/23 上午1:01, Sean Christopherson wrote:
>>>>>>>> On Mon, Apr 22, 2024, maobibo wrote:
>>>>>>>>> On 2024/4/16 上午6:45, Sean Christopherson wrote:
>>>>>>>>>> On Mon, Apr 15, 2024, Mingwei Zhang wrote:
>>>>>>>>>>> On Mon, Apr 15, 2024 at 10:38 AM Sean Christopherson
>>>>>>>>>>> <[email protected]> wrote:
>>>>>>>>>>>> One my biggest complaints with the current vPMU code is that
>>>>>>>>>>>> the roles and
>>>>>>>>>>>> responsibilities between KVM and perf are poorly defined,
>>>>>>>>>>>> which
>>>>>>>>>>>> leads to suboptimal
>>>>>>>>>>>> and hard to maintain code.
>>>>>>>>>>>>
>>>>>>>>>>>> Case in point, I'm pretty sure leaving guest values in PMCs
>>>>>>>>>>>> _would_ leak guest
>>>>>>>>>>>> state to userspace processes that have RDPMC permissions, as
>>>>>>>>>>>> the PMCs might not
>>>>>>>>>>>> be dirty from perf's perspective (see
>>>>>>>>>>>> perf_clear_dirty_counters()).
>>>>>>>>>>>>
>>>>>>>>>>>> Blindly clearing PMCs in KVM "solves" that problem, but in
>>>>>>>>>>>> doing so makes the
>>>>>>>>>>>> overall code brittle because it's not clear whether KVM
>>>>>>>>>>>> _needs_
>>>>>>>>>>>> to clear PMCs,
>>>>>>>>>>>> or if KVM is just being paranoid.
>>>>>>>>>>>
>>>>>>>>>>> So once this rolls out, perf and vPMU are clients directly to
>>>>>>>>>>> PMU HW.
>>>>>>>>>>
>>>>>>>>>> I don't think this is a statement we want to make, as it opens a
>>>>>>>>>> discussion
>>>>>>>>>> that we won't win.  Nor do I think it's one we *need* to make.
>>>>>>>>>> KVM doesn't need
>>>>>>>>>> to be on equal footing with perf in terms of owning/managing PMU
>>>>>>>>>> hardware, KVM
>>>>>>>>>> just needs a few APIs to allow faithfully and accurately
>>>>>>>>>> virtualizing a guest PMU.
>>>>>>>>>>
>>>>>>>>>>> Faithful cleaning (blind cleaning) has to be the baseline
>>>>>>>>>>> implementation, until both clients agree to a "deal" between
>>>>>>>>>>> them.
>>>>>>>>>>> Currently, there is no such deal, but I believe we could have
>>>>>>>>>>> one via
>>>>>>>>>>> future discussion.
>>>>>>>>>>
>>>>>>>>>> What I am saying is that there needs to be a "deal" in place
>>>>>>>>>> before this code
>>>>>>>>>> is merged.  It doesn't need to be anything fancy, e.g. perf can
>>>>>>>>>> still pave over
>>>>>>>>>> PMCs it doesn't immediately load, as opposed to using
>>>>>>>>>> cpu_hw_events.dirty to lazily
>>>>>>>>>> do the clearing.  But perf and KVM need to work together from
>>>>>>>>>> the
>>>>>>>>>> get go, ie. I
>>>>>>>>>> don't want KVM doing something without regard to what perf does,
>>>>>>>>>> and vice versa.
>>>>>>>>>>
>>>>>>>>> There is similar issue on LoongArch vPMU where vm can directly
>>>>>>>>> pmu
>>>>>>>>> hardware
>>>>>>>>> and pmu hw is shard with guest and host. Besides context switch
>>>>>>>>> there are
>>>>>>>>> other places where perf core will access pmu hw, such as tick
>>>>>>>>> timer/hrtimer/ipi function call, and KVM can only intercept
>>>>>>>>> context switch.
>>>>>>>>
>>>>>>>> Two questions:
>>>>>>>>
>>>>>>>>    1) Can KVM prevent the guest from accessing the PMU?
>>>>>>>>
>>>>>>>>    2) If so, KVM can grant partial access to the PMU, or is it all
>>>>>>>> or nothing?
>>>>>>>>
>>>>>>>> If the answer to both questions is "yes", then it sounds like
>>>>>>>> LoongArch *requires*
>>>>>>>> mediated/passthrough support in order to virtualize its PMU.
>>>>>>>
>>>>>>> Hi Sean,
>>>>>>>
>>>>>>> Thank for your quick response.
>>>>>>>
>>>>>>> yes, kvm can prevent guest from accessing the PMU and grant partial
>>>>>>> or all to access to the PMU. Only that if one pmu event is granted
>>>>>>> to VM, host can not access this pmu event again. There must be pmu
>>>>>>> event switch if host want to.
>>>>>>
>>>>>> PMU event is a software entity which won't be shared. did you
>>>>>> mean if
>>>>>> a PMU HW counter is granted to VM, then Host can't access the PMU HW
>>>>>> counter, right?
>>>>> yes, if PMU HW counter/control is granted to VM. The value comes from
>>>>> guest, and is not meaningful for host.  Host pmu core does not know
>>>>> that it is granted to VM, host still think that it owns pmu.
>>>>
>>>> That's one issue this patchset tries to solve. Current new mediated
>>>> x86
>>>> vPMU framework doesn't allow Host or Guest own the PMU HW resource
>>>> simultaneously. Only when there is no !exclude_guest event on host,
>>>> guest is allowed to exclusively own the PMU HW resource.
>>>>
>>>>
>>>>>
>>>>> Just like FPU register, it is shared by VM and host during different
>>>>> time and it is lately switched. But if IPI or timer interrupt uses
>>>>> FPU
>>>>> register on host, there will be the same issue.
>>>>
>>>> I didn't fully get your point. When IPI or timer interrupt reach, a
>>>> VM-exit is triggered to make CPU traps into host first and then the
>>>> host
>>> yes, it is.
>>
>> This is correct. And this is one of the points that we had debated
>> internally whether we should do PMU context switch at vcpu loop
>> boundary or VM Enter/exit boundary. (host-level) timer interrupt can
>> force VM Exit, which I think happens every 4ms or 1ms, depending on
>> configuration.
>>
>> One of the key reasons we currently propose this is because it is the
>> same boundary as the legacy PMU, i.e., it would be simple to propose
>> from the perf subsystem perspective.
>>
>> Performance wise, doing PMU context switch at vcpu boundary would be
>> way better in general. But the downside is that perf sub-system lose
>> the capability to profile majority of the KVM code (functions) when
>> guest PMU is enabled.
>>
>>>
>>>> interrupt handler is called. Or are you complaining the executing
>>>> sequence of switching guest PMU MSRs and these interrupt handler?
>>> In our vPMU implementation, it is ok if vPMU is switched in vm exit
>>> path, however there is problem if vPMU is switched during vcpu thread
>>> sched-out/sched-in path since IPI/timer irq interrupt access pmu
>>> register in host mode.
>>
>> Oh, the IPI/timer irq handler will access PMU registers? I thought
>> only the host-level NMI handler will access the PMU MSRs since PMI is
>> registered under NMI.
>>
>> In that case, you should disable  IRQ during vcpu context switch. For
>> NMI, we prevent its handler from accessing the PMU registers. In
>> particular, we use a per-cpu variable to guard that. So, the
>> host-level PMI handler for perf sub-system will check the variable
>> before proceeding.
>
> perf core will access pmu hw in tick timer/hrtimer/ipi function call,
> such as function perf_event_task_tick() is called in tick timer, there
> are  event_function_call(event, __perf_event_xxx, &value) in file
> kernel/events/core.c.
>
> https://lore.kernel.org/lkml/[email protected]/T/#m15aeb79fdc9ce72dd5b374edd6acdcf7a9dafcf4
>

Just go through functions (not sure if all),  whether
perf_event_task_tick() or the callbacks of event_function_call() would
check the event->state first, if the event is in
PERF_EVENT_STATE_INACTIVE, the PMU HW MSRs would not be touched really.
In this new proposal, all host events with exclude_guest attribute would
be put on PERF_EVENT_STATE_INACTIVE sate if guest own the PMU HW
resource. So I think it's fine.


>
>
>>
>>>
>>> In general it will be better if the switch is done in vcpu thread
>>> sched-out/sched-in, else there is requirement to profile kvm
>>> hypervisor.Even there is such requirement, it is only one option. In
>>> most conditions, it will better if time of VM context exit is small.
>>>
>> Performance wise, agree, but there will be debate on perf
>> functionality loss at the host level.
>>
>> Maybe, (just maybe), it is possible to do PMU context switch at vcpu
>> boundary normally, but doing it at VM Enter/Exit boundary when host is
>> profiling KVM kernel module. So, dynamically adjusting PMU context
>> switch location could be an option.
>>
>>>>
>>>>
>>>>>
>>>>> Regards
>>>>> Bibo Mao
>>>>>>
>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>>> Can we add callback handler in structure kvm_guest_cbs?  just
>>>>>>>>> like
>>>>>>>>> this:
>>>>>>>>> @@ -6403,6 +6403,7 @@ static struct perf_guest_info_callbacks
>>>>>>>>> kvm_guest_cbs
>>>>>>>>> = {
>>>>>>>>>           .state                  = kvm_guest_state,
>>>>>>>>>           .get_ip                 = kvm_guest_get_ip,
>>>>>>>>>           .handle_intel_pt_intr   = NULL,
>>>>>>>>> +       .lose_pmu               = kvm_guest_lose_pmu,
>>>>>>>>>    };
>>>>>>>>>
>>>>>>>>> By the way, I do not know should the callback handler be
>>>>>>>>> triggered
>>>>>>>>> in perf
>>>>>>>>> core or detailed pmu hw driver. From ARM pmu hw driver, it is
>>>>>>>>> triggered in
>>>>>>>>> pmu hw driver such as function kvm_vcpu_pmu_resync_el0,
>>>>>>>>> but I think it will be better if it is done in perf core.
>>>>>>>>
>>>>>>>> I don't think we want to take the approach of perf and KVM guests
>>>>>>>> "fighting" over
>>>>>>>> the PMU.  That's effectively what we have today, and it's a mess
>>>>>>>> for KVM because
>>>>>>>> it's impossible to provide consistent, deterministic behavior for
>>>>>>>> the guest.  And
>>>>>>>> it's just as messy for perf, which ends up having wierd,
>>>>>>>> cumbersome
>>>>>>>> flows that
>>>>>>>> exists purely to try to play nice with KVM.
>>>>>>> With existing pmu core code, in tick timer interrupt or IPI
>>>>>>> function
>>>>>>> call interrupt pmu hw may be accessed by host when VM is running
>>>>>>> and
>>>>>>> pmu is already granted to guest. KVM can not intercept host
>>>>>>> IPI/timer interrupt, there is no pmu context switch, there will be
>>>>>>> problem.
>>>>>>>
>>>>>>> Regards
>>>>>>> Bibo Mao
>>>>>>>
>>>>>
>>>
>
>

2024-04-23 15:38:26

by Mi, Dapeng

[permalink] [raw]
Subject: Re: [RFC PATCH 23/41] KVM: x86/pmu: Implement the save/restore of PMU state for Intel CPU


On 4/23/2024 3:10 PM, Mingwei Zhang wrote:
> On Mon, Apr 22, 2024 at 11:45 PM Mi, Dapeng <[email protected]> wrote:
>>
>> On 4/23/2024 2:08 PM, maobibo wrote:
>>>
>>> On 2024/4/23 下午12:23, Mingwei Zhang wrote:
>>>> On Mon, Apr 22, 2024 at 8:55 PM maobibo <[email protected]> wrote:
>>>>>
>>>>>
>>>>> On 2024/4/23 上午11:13, Mi, Dapeng wrote:
>>>>>> On 4/23/2024 10:53 AM, maobibo wrote:
>>>>>>>
>>>>>>> On 2024/4/23 上午10:44, Mi, Dapeng wrote:
>>>>>>>> On 4/23/2024 9:01 AM, maobibo wrote:
>>>>>>>>>
>>>>>>>>> On 2024/4/23 上午1:01, Sean Christopherson wrote:
>>>>>>>>>> On Mon, Apr 22, 2024, maobibo wrote:
>>>>>>>>>>> On 2024/4/16 上午6:45, Sean Christopherson wrote:
>>>>>>>>>>>> On Mon, Apr 15, 2024, Mingwei Zhang wrote:
>>>>>>>>>>>>> On Mon, Apr 15, 2024 at 10:38 AM Sean Christopherson
>>>>>>>>>>>>> <[email protected]> wrote:
>>>>>>>>>>>>>> One my biggest complaints with the current vPMU code is that
>>>>>>>>>>>>>> the roles and
>>>>>>>>>>>>>> responsibilities between KVM and perf are poorly defined,
>>>>>>>>>>>>>> which
>>>>>>>>>>>>>> leads to suboptimal
>>>>>>>>>>>>>> and hard to maintain code.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Case in point, I'm pretty sure leaving guest values in PMCs
>>>>>>>>>>>>>> _would_ leak guest
>>>>>>>>>>>>>> state to userspace processes that have RDPMC permissions, as
>>>>>>>>>>>>>> the PMCs might not
>>>>>>>>>>>>>> be dirty from perf's perspective (see
>>>>>>>>>>>>>> perf_clear_dirty_counters()).
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Blindly clearing PMCs in KVM "solves" that problem, but in
>>>>>>>>>>>>>> doing so makes the
>>>>>>>>>>>>>> overall code brittle because it's not clear whether KVM
>>>>>>>>>>>>>> _needs_
>>>>>>>>>>>>>> to clear PMCs,
>>>>>>>>>>>>>> or if KVM is just being paranoid.
>>>>>>>>>>>>> So once this rolls out, perf and vPMU are clients directly to
>>>>>>>>>>>>> PMU HW.
>>>>>>>>>>>> I don't think this is a statement we want to make, as it opens a
>>>>>>>>>>>> discussion
>>>>>>>>>>>> that we won't win. Nor do I think it's one we *need* to make.
>>>>>>>>>>>> KVM doesn't need
>>>>>>>>>>>> to be on equal footing with perf in terms of owning/managing PMU
>>>>>>>>>>>> hardware, KVM
>>>>>>>>>>>> just needs a few APIs to allow faithfully and accurately
>>>>>>>>>>>> virtualizing a guest PMU.
>>>>>>>>>>>>
>>>>>>>>>>>>> Faithful cleaning (blind cleaning) has to be the baseline
>>>>>>>>>>>>> implementation, until both clients agree to a "deal" between
>>>>>>>>>>>>> them.
>>>>>>>>>>>>> Currently, there is no such deal, but I believe we could have
>>>>>>>>>>>>> one via
>>>>>>>>>>>>> future discussion.
>>>>>>>>>>>> What I am saying is that there needs to be a "deal" in place
>>>>>>>>>>>> before this code
>>>>>>>>>>>> is merged. It doesn't need to be anything fancy, e.g. perf can
>>>>>>>>>>>> still pave over
>>>>>>>>>>>> PMCs it doesn't immediately load, as opposed to using
>>>>>>>>>>>> cpu_hw_events.dirty to lazily
>>>>>>>>>>>> do the clearing. But perf and KVM need to work together from
>>>>>>>>>>>> the
>>>>>>>>>>>> get go, ie. I
>>>>>>>>>>>> don't want KVM doing something without regard to what perf does,
>>>>>>>>>>>> and vice versa.
>>>>>>>>>>>>
>>>>>>>>>>> There is similar issue on LoongArch vPMU where vm can directly
>>>>>>>>>>> pmu
>>>>>>>>>>> hardware
>>>>>>>>>>> and pmu hw is shard with guest and host. Besides context switch
>>>>>>>>>>> there are
>>>>>>>>>>> other places where perf core will access pmu hw, such as tick
>>>>>>>>>>> timer/hrtimer/ipi function call, and KVM can only intercept
>>>>>>>>>>> context switch.
>>>>>>>>>> Two questions:
>>>>>>>>>>
>>>>>>>>>> 1) Can KVM prevent the guest from accessing the PMU?
>>>>>>>>>>
>>>>>>>>>> 2) If so, KVM can grant partial access to the PMU, or is it all
>>>>>>>>>> or nothing?
>>>>>>>>>>
>>>>>>>>>> If the answer to both questions is "yes", then it sounds like
>>>>>>>>>> LoongArch *requires*
>>>>>>>>>> mediated/passthrough support in order to virtualize its PMU.
>>>>>>>>> Hi Sean,
>>>>>>>>>
>>>>>>>>> Thank for your quick response.
>>>>>>>>>
>>>>>>>>> yes, kvm can prevent guest from accessing the PMU and grant partial
>>>>>>>>> or all to access to the PMU. Only that if one pmu event is granted
>>>>>>>>> to VM, host can not access this pmu event again. There must be pmu
>>>>>>>>> event switch if host want to.
>>>>>>>> PMU event is a software entity which won't be shared. did you
>>>>>>>> mean if
>>>>>>>> a PMU HW counter is granted to VM, then Host can't access the PMU HW
>>>>>>>> counter, right?
>>>>>>> yes, if PMU HW counter/control is granted to VM. The value comes from
>>>>>>> guest, and is not meaningful for host. Host pmu core does not know
>>>>>>> that it is granted to VM, host still think that it owns pmu.
>>>>>> That's one issue this patchset tries to solve. Current new mediated
>>>>>> x86
>>>>>> vPMU framework doesn't allow Host or Guest own the PMU HW resource
>>>>>> simultaneously. Only when there is no !exclude_guest event on host,
>>>>>> guest is allowed to exclusively own the PMU HW resource.
>>>>>>
>>>>>>
>>>>>>> Just like FPU register, it is shared by VM and host during different
>>>>>>> time and it is lately switched. But if IPI or timer interrupt uses
>>>>>>> FPU
>>>>>>> register on host, there will be the same issue.
>>>>>> I didn't fully get your point. When IPI or timer interrupt reach, a
>>>>>> VM-exit is triggered to make CPU traps into host first and then the
>>>>>> host
>>>>> yes, it is.
>>>> This is correct. And this is one of the points that we had debated
>>>> internally whether we should do PMU context switch at vcpu loop
>>>> boundary or VM Enter/exit boundary. (host-level) timer interrupt can
>>>> force VM Exit, which I think happens every 4ms or 1ms, depending on
>>>> configuration.
>>>>
>>>> One of the key reasons we currently propose this is because it is the
>>>> same boundary as the legacy PMU, i.e., it would be simple to propose
>>>> from the perf subsystem perspective.
>>>>
>>>> Performance wise, doing PMU context switch at vcpu boundary would be
>>>> way better in general. But the downside is that perf sub-system lose
>>>> the capability to profile majority of the KVM code (functions) when
>>>> guest PMU is enabled.
>>>>
>>>>>> interrupt handler is called. Or are you complaining the executing
>>>>>> sequence of switching guest PMU MSRs and these interrupt handler?
>>>>> In our vPMU implementation, it is ok if vPMU is switched in vm exit
>>>>> path, however there is problem if vPMU is switched during vcpu thread
>>>>> sched-out/sched-in path since IPI/timer irq interrupt access pmu
>>>>> register in host mode.
>>>> Oh, the IPI/timer irq handler will access PMU registers? I thought
>>>> only the host-level NMI handler will access the PMU MSRs since PMI is
>>>> registered under NMI.
>>>>
>>>> In that case, you should disable IRQ during vcpu context switch. For
>>>> NMI, we prevent its handler from accessing the PMU registers. In
>>>> particular, we use a per-cpu variable to guard that. So, the
>>>> host-level PMI handler for perf sub-system will check the variable
>>>> before proceeding.
>>> perf core will access pmu hw in tick timer/hrtimer/ipi function call,
>>> such as function perf_event_task_tick() is called in tick timer, there
>>> are event_function_call(event, __perf_event_xxx, &value) in file
>>> kernel/events/core.c.
>>>
>>> https://lore.kernel.org/lkml/[email protected]/T/#m15aeb79fdc9ce72dd5b374edd6acdcf7a9dafcf4
>>>
>> Just go through functions (not sure if all), whether
>> perf_event_task_tick() or the callbacks of event_function_call() would
>> check the event->state first, if the event is in
>> PERF_EVENT_STATE_INACTIVE, the PMU HW MSRs would not be touched really.
>> In this new proposal, all host events with exclude_guest attribute would
>> be put on PERF_EVENT_STATE_INACTIVE sate if guest own the PMU HW
>> resource. So I think it's fine.
>>
> Is there any event in the host still having PERF_EVENT_STATE_ACTIVE?
> If so, hmm, it will reach perf_pmu_disable(event->pmu), which will
> access the global ctrl MSR.

I don't think there is any event with PERF_EVENT_STATE_ACTIVE state on
host when guest owns the PMU HW resource.

In current solution, VM would fail to create if there is any system-wide
event without exclude_guest attribute. If VM is created successfully and
when vm-entry happens, the helper perf_guest_enter() would put all host
events with exclude_guest attribute into PERF_EVENT_STATE_INACTIVE state
and block host to create system-wide events without exclude_guest attribute.



2024-04-23 16:51:37

by Mingwei Zhang

[permalink] [raw]
Subject: Re: [RFC PATCH 23/41] KVM: x86/pmu: Implement the save/restore of PMU state for Intel CPU

> > Is there any event in the host still having PERF_EVENT_STATE_ACTIVE?
> > If so, hmm, it will reach perf_pmu_disable(event->pmu), which will
> > access the global ctrl MSR.
>
> I don't think there is any event with PERF_EVENT_STATE_ACTIVE state on
> host when guest owns the PMU HW resource.
>
> In current solution, VM would fail to create if there is any system-wide
> event without exclude_guest attribute. If VM is created successfully and
> when vm-entry happens, the helper perf_guest_enter() would put all host
> events with exclude_guest attribute into PERF_EVENT_STATE_INACTIVE state
> and block host to create system-wide events without exclude_guest attribute.
>

Yeah, that's perfect.

Thanks.
-Mingwei

2024-04-23 17:20:16

by Mingwei Zhang

[permalink] [raw]
Subject: Re: [RFC PATCH 23/41] KVM: x86/pmu: Implement the save/restore of PMU state for Intel CPU

> >
> > Maybe, (just maybe), it is possible to do PMU context switch at vcpu
> > boundary normally, but doing it at VM Enter/Exit boundary when host is
> > profiling KVM kernel module. So, dynamically adjusting PMU context
> > switch location could be an option.
> If there are two VMs with pmu enabled both, however host PMU is not
> enabled. PMU context switch should be done in vcpu thread sched-out path.
>
> If host pmu is used also, we can choose whether PMU switch should be
> done in vm exit path or vcpu thread sched-out path.
>

host PMU is always enabled, ie., Linux currently does not support KVM
PMU running standalone. I guess what you mean is there are no active
perf_events on the host side. Allowing a PMU context switch drifting
from vm-enter/exit boundary to vcpu loop boundary by checking host
side events might be a good option. We can keep the discussion, but I
won't propose that in v2.

I guess we are off topic. Sean's suggestion is that we should put
"perf" and "kvm" together while doing the context switch. I think this
is quite reasonable regardless of the PMU context switch location.

To execute this, I am thinking about adding a parameter or return
value to perf_guest_enter() so that once it returns back to KVM, KVM
gets to know which counters are active/inactive/cleared from the host
side. Knowing that, KVM can do the context switch more efficiently.

Thanks.
-Mingwei

2024-04-24 01:07:20

by Bibo Mao

[permalink] [raw]
Subject: Re: [RFC PATCH 23/41] KVM: x86/pmu: Implement the save/restore of PMU state for Intel CPU



On 2024/4/24 上午1:02, Mingwei Zhang wrote:
>>>
>>> Maybe, (just maybe), it is possible to do PMU context switch at vcpu
>>> boundary normally, but doing it at VM Enter/Exit boundary when host is
>>> profiling KVM kernel module. So, dynamically adjusting PMU context
>>> switch location could be an option.
>> If there are two VMs with pmu enabled both, however host PMU is not
>> enabled. PMU context switch should be done in vcpu thread sched-out path.
>>
>> If host pmu is used also, we can choose whether PMU switch should be
>> done in vm exit path or vcpu thread sched-out path.
>>
>
> host PMU is always enabled, ie., Linux currently does not support KVM
> PMU running standalone. I guess what you mean is there are no active
> perf_events on the host side. Allowing a PMU context switch drifting
> from vm-enter/exit boundary to vcpu loop boundary by checking host
> side events might be a good option. We can keep the discussion, but I
> won't propose that in v2.
>
> I guess we are off topic. Sean's suggestion is that we should put
> "perf" and "kvm" together while doing the context switch. I think this
> is quite reasonable regardless of the PMU context switch location.
>
> To execute this, I am thinking about adding a parameter or return
> value to perf_guest_enter() so that once it returns back to KVM, KVM
> gets to know which counters are active/inactive/cleared from the host
> side. Knowing that, KVM can do the context switch more efficiently.
yeap, that sounds great.

Regards
Bibo Mao

>
> Thanks.
> -Mingwei
>


2024-04-24 13:07:06

by Mi, Dapeng

[permalink] [raw]
Subject: Re: [RFC PATCH 23/41] KVM: x86/pmu: Implement the save/restore of PMU state for Intel CPU


On 4/24/2024 1:02 AM, Mingwei Zhang wrote:
>>> Maybe, (just maybe), it is possible to do PMU context switch at vcpu
>>> boundary normally, but doing it at VM Enter/Exit boundary when host is
>>> profiling KVM kernel module. So, dynamically adjusting PMU context
>>> switch location could be an option.
>> If there are two VMs with pmu enabled both, however host PMU is not
>> enabled. PMU context switch should be done in vcpu thread sched-out path.
>>
>> If host pmu is used also, we can choose whether PMU switch should be
>> done in vm exit path or vcpu thread sched-out path.
>>
> host PMU is always enabled, ie., Linux currently does not support KVM
> PMU running standalone. I guess what you mean is there are no active
> perf_events on the host side. Allowing a PMU context switch drifting
> from vm-enter/exit boundary to vcpu loop boundary by checking host
> side events might be a good option. We can keep the discussion, but I
> won't propose that in v2.

I suspect if it's really doable to do this deferring. This still makes
host lose the most of capability to profile KVM. Per my understanding,
most of KVM overhead happens in the vcpu loop, exactly speaking in
VM-exit handling. We have no idea when host want to create perf event to
profile KVM, it could be at any time.


>
> I guess we are off topic. Sean's suggestion is that we should put
> "perf" and "kvm" together while doing the context switch. I think this
> is quite reasonable regardless of the PMU context switch location.
>
> To execute this, I am thinking about adding a parameter or return
> value to perf_guest_enter() so that once it returns back to KVM, KVM
> gets to know which counters are active/inactive/cleared from the host
> side. Knowing that, KVM can do the context switch more efficiently.
>
> Thanks.
> -Mingwei
>

2024-04-24 15:33:33

by Sean Christopherson

[permalink] [raw]
Subject: Re: [RFC PATCH 23/41] KVM: x86/pmu: Implement the save/restore of PMU state for Intel CPU

On Wed, Apr 24, 2024, Dapeng Mi wrote:
>
> On 4/24/2024 1:02 AM, Mingwei Zhang wrote:
> > > > Maybe, (just maybe), it is possible to do PMU context switch at vcpu
> > > > boundary normally, but doing it at VM Enter/Exit boundary when host is
> > > > profiling KVM kernel module. So, dynamically adjusting PMU context
> > > > switch location could be an option.
> > > If there are two VMs with pmu enabled both, however host PMU is not
> > > enabled. PMU context switch should be done in vcpu thread sched-out path.
> > >
> > > If host pmu is used also, we can choose whether PMU switch should be
> > > done in vm exit path or vcpu thread sched-out path.
> > >
> > host PMU is always enabled, ie., Linux currently does not support KVM
> > PMU running standalone. I guess what you mean is there are no active
> > perf_events on the host side. Allowing a PMU context switch drifting
> > from vm-enter/exit boundary to vcpu loop boundary by checking host
> > side events might be a good option. We can keep the discussion, but I
> > won't propose that in v2.
>
> I suspect if it's really doable to do this deferring. This still makes host
> lose the most of capability to profile KVM. Per my understanding, most of
> KVM overhead happens in the vcpu loop, exactly speaking in VM-exit handling.
> We have no idea when host want to create perf event to profile KVM, it could
> be at any time.

No, the idea is that KVM will load host PMU state asap, but only when host PMU
state actually needs to be loaded, i.e. only when there are relevant host events.

If there are no host perf events, KVM keeps guest PMU state loaded for the entire
KVM_RUN loop, i.e. provides optimal behavior for the guest. But if a host perf
events exists (or comes along), the KVM context switches PMU at VM-Enter/VM-Exit,
i.e. lets the host profile almost all of KVM, at the cost of a degraded experience
for the guest while host perf events are active.

My original sketch: https://lore.kernel.org/all/[email protected]

2024-04-25 11:02:58

by Mi, Dapeng

[permalink] [raw]
Subject: Re: [RFC PATCH 23/41] KVM: x86/pmu: Implement the save/restore of PMU state for Intel CPU


On 4/24/2024 11:00 PM, Sean Christopherson wrote:
> On Wed, Apr 24, 2024, Dapeng Mi wrote:
>> On 4/24/2024 1:02 AM, Mingwei Zhang wrote:
>>>>> Maybe, (just maybe), it is possible to do PMU context switch at vcpu
>>>>> boundary normally, but doing it at VM Enter/Exit boundary when host is
>>>>> profiling KVM kernel module. So, dynamically adjusting PMU context
>>>>> switch location could be an option.
>>>> If there are two VMs with pmu enabled both, however host PMU is not
>>>> enabled. PMU context switch should be done in vcpu thread sched-out path.
>>>>
>>>> If host pmu is used also, we can choose whether PMU switch should be
>>>> done in vm exit path or vcpu thread sched-out path.
>>>>
>>> host PMU is always enabled, ie., Linux currently does not support KVM
>>> PMU running standalone. I guess what you mean is there are no active
>>> perf_events on the host side. Allowing a PMU context switch drifting
>>> from vm-enter/exit boundary to vcpu loop boundary by checking host
>>> side events might be a good option. We can keep the discussion, but I
>>> won't propose that in v2.
>> I suspect if it's really doable to do this deferring. This still makes host
>> lose the most of capability to profile KVM. Per my understanding, most of
>> KVM overhead happens in the vcpu loop, exactly speaking in VM-exit handling.
>> We have no idea when host want to create perf event to profile KVM, it could
>> be at any time.
> No, the idea is that KVM will load host PMU state asap, but only when host PMU
> state actually needs to be loaded, i.e. only when there are relevant host events.
>
> If there are no host perf events, KVM keeps guest PMU state loaded for the entire
> KVM_RUN loop, i.e. provides optimal behavior for the guest. But if a host perf
> events exists (or comes along), the KVM context switches PMU at VM-Enter/VM-Exit,
> i.e. lets the host profile almost all of KVM, at the cost of a degraded experience
> for the guest while host perf events are active.

I see. So KVM needs to provide a callback which needs to be called in
the IPI handler. The KVM callback needs to be called to switch PMU state
before perf really enabling host event and touching PMU MSRs. And only
the perf event with exclude_guest attribute is allowed to create on
host. Thanks.


>
> My original sketch: https://lore.kernel.org/all/[email protected]

2024-04-25 20:18:16

by Mingwei Zhang

[permalink] [raw]
Subject: Re: [RFC PATCH 23/41] KVM: x86/pmu: Implement the save/restore of PMU state for Intel CPU

On Thu, Apr 25, 2024 at 9:13 AM Liang, Kan <[email protected]> wrote:
>
>
>
> On 2024-04-25 12:24 a.m., Mingwei Zhang wrote:
> > On Wed, Apr 24, 2024 at 8:56 PM Mi, Dapeng <[email protected]> wrote:
> >>
> >>
> >> On 4/24/2024 11:00 PM, Sean Christopherson wrote:
> >>> On Wed, Apr 24, 2024, Dapeng Mi wrote:
> >>>> On 4/24/2024 1:02 AM, Mingwei Zhang wrote:
> >>>>>>> Maybe, (just maybe), it is possible to do PMU context switch at vcpu
> >>>>>>> boundary normally, but doing it at VM Enter/Exit boundary when host is
> >>>>>>> profiling KVM kernel module. So, dynamically adjusting PMU context
> >>>>>>> switch location could be an option.
> >>>>>> If there are two VMs with pmu enabled both, however host PMU is not
> >>>>>> enabled. PMU context switch should be done in vcpu thread sched-out path.
> >>>>>>
> >>>>>> If host pmu is used also, we can choose whether PMU switch should be
> >>>>>> done in vm exit path or vcpu thread sched-out path.
> >>>>>>
> >>>>> host PMU is always enabled, ie., Linux currently does not support KVM
> >>>>> PMU running standalone. I guess what you mean is there are no active
> >>>>> perf_events on the host side. Allowing a PMU context switch drifting
> >>>>> from vm-enter/exit boundary to vcpu loop boundary by checking host
> >>>>> side events might be a good option. We can keep the discussion, but I
> >>>>> won't propose that in v2.
> >>>> I suspect if it's really doable to do this deferring. This still makes host
> >>>> lose the most of capability to profile KVM. Per my understanding, most of
> >>>> KVM overhead happens in the vcpu loop, exactly speaking in VM-exit handling.
> >>>> We have no idea when host want to create perf event to profile KVM, it could
> >>>> be at any time.
> >>> No, the idea is that KVM will load host PMU state asap, but only when host PMU
> >>> state actually needs to be loaded, i.e. only when there are relevant host events.
> >>>
> >>> If there are no host perf events, KVM keeps guest PMU state loaded for the entire
> >>> KVM_RUN loop, i.e. provides optimal behavior for the guest. But if a host perf
> >>> events exists (or comes along), the KVM context switches PMU at VM-Enter/VM-Exit,
> >>> i.e. lets the host profile almost all of KVM, at the cost of a degraded experience
> >>> for the guest while host perf events are active.
> >>
> >> I see. So KVM needs to provide a callback which needs to be called in
> >> the IPI handler. The KVM callback needs to be called to switch PMU state
> >> before perf really enabling host event and touching PMU MSRs. And only
> >> the perf event with exclude_guest attribute is allowed to create on
> >> host. Thanks.
> >
> > Do we really need a KVM callback? I think that is one option.
> >
> > Immediately after VMEXIT, KVM will check whether there are "host perf
> > events". If so, do the PMU context switch immediately. Otherwise, keep
> > deferring the context switch to the end of vPMU loop.
> >
> > Detecting if there are "host perf events" would be interesting. The
> > "host perf events" refer to the perf_events on the host that are
> > active and assigned with HW counters and that are saved when context
> > switching to the guest PMU. I think getting those events could be done
> > by fetching the bitmaps in cpuc.
>
> The cpuc is ARCH specific structure. I don't think it can be get in the
> generic code. You probably have to implement ARCH specific functions to
> fetch the bitmaps. It probably won't worth it.
>
> You may check the pinned_groups and flexible_groups to understand if
> there are host perf events which may be scheduled when VM-exit. But it
> will not tell the idx of the counters which can only be got when the
> host event is really scheduled.
>
> > I have to look into the details. But
> > at the time of VMEXIT, kvm should already have that information, so it
> > can immediately decide whether to do the PMU context switch or not.
> >
> > oh, but when the control is executing within the run loop, a
> > host-level profiling starts, say 'perf record -a ...', it will
> > generate an IPI to all CPUs. Maybe that's when we need a callback so
> > the KVM guest PMU context gets preempted for the host-level profiling.
> > Gah..
> >
> > hmm, not a fan of that. That means the host can poke the guest PMU
> > context at any time and cause higher overhead. But I admit it is much
> > better than the current approach.
> >
> > The only thing is that: any command like 'perf record/stat -a' shot in
> > dark corners of the host can preempt guest PMUs of _all_ running VMs.
> > So, to alleviate that, maybe a module parameter that disables this
> > "preemption" is possible? This should fit scenarios where we don't
> > want guest PMU to be preempted outside of the vCPU loop?
> >
>
> It should not happen. For the current implementation, perf rejects all
> the !exclude_guest system-wide event creation if a guest with the vPMU
> is running.
> However, it's possible to create an exclude_guest system-wide event at
> any time. KVM cannot use the information from the VM-entry to decide if
> there will be active perf events in the VM-exit.

Hmm, why not? If there is any exclude_guest system-wide event,
perf_guest_enter() can return something to tell KVM "hey, some active
host events are swapped out. they are originally in counter #2 and
#3". If so, at the time when perf_guest_enter() returns, KVM will ack
that and keep it in its pmu data structure.

Now, when doing context switching back to host at just VMEXIT, KVM
will check this data and see if host perf context has something active
(of course, they are all exclude_guest events). If not, deferring the
context switch to vcpu boundary. Otherwise, do the proper PMU context
switching by respecting the occupied counter positions on the host
side, i.e., avoid doubling the work on the KVM side.

Kan, any suggestion on the above approach? Totally understand that
there might be some difficulty, since perf subsystem works in several
layers and obviously fetching low-level mapping is arch specific work.
If that is difficult, we can split the work in two phases: 1) phase
#1, just ask perf to tell kvm if there are active exclude_guest events
swapped out; 2) phase #2, ask perf to tell their (low-level) counter
indices.

Thanks.
-Mingwei

>
> The perf_guest_exit() will reload the host state. It's impossible to
> save the guest state after that. We may need a KVM callback. So perf can
> tell KVM whether to save the guest state before perf reloads the host state.
>
> Thanks,
> Kan
> >>
> >>
> >>>
> >>> My original sketch: https://lore.kernel.org/all/ZR3eNtP5IVAHeFNC@googlecom
> >

2024-04-25 21:47:37

by Sean Christopherson

[permalink] [raw]
Subject: Re: [RFC PATCH 23/41] KVM: x86/pmu: Implement the save/restore of PMU state for Intel CPU

On Thu, Apr 25, 2024, Kan Liang wrote:
> On 2024-04-25 4:16 p.m., Mingwei Zhang wrote:
> > On Thu, Apr 25, 2024 at 9:13 AM Liang, Kan <[email protected]> wrote:
> >> It should not happen. For the current implementation, perf rejects all
> >> the !exclude_guest system-wide event creation if a guest with the vPMU
> >> is running.
> >> However, it's possible to create an exclude_guest system-wide event at
> >> any time. KVM cannot use the information from the VM-entry to decide if
> >> there will be active perf events in the VM-exit.
> >
> > Hmm, why not? If there is any exclude_guest system-wide event,
> > perf_guest_enter() can return something to tell KVM "hey, some active
> > host events are swapped out. they are originally in counter #2 and
> > #3". If so, at the time when perf_guest_enter() returns, KVM will ack
> > that and keep it in its pmu data structure.
>
> I think it's possible that someone creates !exclude_guest event after

I assume you mean an exclude_guest=1 event? Because perf should be in a state
where it rejects exclude_guest=0 events.

> the perf_guest_enter(). The stale information is saved in the KVM. Perf
> will schedule the event in the next perf_guest_exit(). KVM will not know it.

Ya, the creation of an event on a CPU that currently has guest PMU state loaded
is what I had in mind when I suggested a callback in my sketch:

: D. Add a perf callback that is invoked from IRQ context when perf wants to
: configure a new PMU-based events, *before* actually programming the MSRs,
: and have KVM's callback put the guest PMU state

It's a similar idea to TIF_NEED_FPU_LOAD, just that instead of a common chunk of
kernel code swapping out the guest state (kernel_fpu_begin()), it's a callback
into KVM.

2024-04-25 22:12:36

by Liang, Kan

[permalink] [raw]
Subject: Re: [RFC PATCH 23/41] KVM: x86/pmu: Implement the save/restore of PMU state for Intel CPU



On 2024-04-25 12:24 a.m., Mingwei Zhang wrote:
> On Wed, Apr 24, 2024 at 8:56 PM Mi, Dapeng <[email protected]> wrote:
>>
>>
>> On 4/24/2024 11:00 PM, Sean Christopherson wrote:
>>> On Wed, Apr 24, 2024, Dapeng Mi wrote:
>>>> On 4/24/2024 1:02 AM, Mingwei Zhang wrote:
>>>>>>> Maybe, (just maybe), it is possible to do PMU context switch at vcpu
>>>>>>> boundary normally, but doing it at VM Enter/Exit boundary when host is
>>>>>>> profiling KVM kernel module. So, dynamically adjusting PMU context
>>>>>>> switch location could be an option.
>>>>>> If there are two VMs with pmu enabled both, however host PMU is not
>>>>>> enabled. PMU context switch should be done in vcpu thread sched-out path.
>>>>>>
>>>>>> If host pmu is used also, we can choose whether PMU switch should be
>>>>>> done in vm exit path or vcpu thread sched-out path.
>>>>>>
>>>>> host PMU is always enabled, ie., Linux currently does not support KVM
>>>>> PMU running standalone. I guess what you mean is there are no active
>>>>> perf_events on the host side. Allowing a PMU context switch drifting
>>>>> from vm-enter/exit boundary to vcpu loop boundary by checking host
>>>>> side events might be a good option. We can keep the discussion, but I
>>>>> won't propose that in v2.
>>>> I suspect if it's really doable to do this deferring. This still makes host
>>>> lose the most of capability to profile KVM. Per my understanding, most of
>>>> KVM overhead happens in the vcpu loop, exactly speaking in VM-exit handling.
>>>> We have no idea when host want to create perf event to profile KVM, it could
>>>> be at any time.
>>> No, the idea is that KVM will load host PMU state asap, but only when host PMU
>>> state actually needs to be loaded, i.e. only when there are relevant host events.
>>>
>>> If there are no host perf events, KVM keeps guest PMU state loaded for the entire
>>> KVM_RUN loop, i.e. provides optimal behavior for the guest. But if a host perf
>>> events exists (or comes along), the KVM context switches PMU at VM-Enter/VM-Exit,
>>> i.e. lets the host profile almost all of KVM, at the cost of a degraded experience
>>> for the guest while host perf events are active.
>>
>> I see. So KVM needs to provide a callback which needs to be called in
>> the IPI handler. The KVM callback needs to be called to switch PMU state
>> before perf really enabling host event and touching PMU MSRs. And only
>> the perf event with exclude_guest attribute is allowed to create on
>> host. Thanks.
>
> Do we really need a KVM callback? I think that is one option.
>
> Immediately after VMEXIT, KVM will check whether there are "host perf
> events". If so, do the PMU context switch immediately. Otherwise, keep
> deferring the context switch to the end of vPMU loop.
>
> Detecting if there are "host perf events" would be interesting. The
> "host perf events" refer to the perf_events on the host that are
> active and assigned with HW counters and that are saved when context
> switching to the guest PMU. I think getting those events could be done
> by fetching the bitmaps in cpuc.

The cpuc is ARCH specific structure. I don't think it can be get in the
generic code. You probably have to implement ARCH specific functions to
fetch the bitmaps. It probably won't worth it.

You may check the pinned_groups and flexible_groups to understand if
there are host perf events which may be scheduled when VM-exit. But it
will not tell the idx of the counters which can only be got when the
host event is really scheduled.

> I have to look into the details. But
> at the time of VMEXIT, kvm should already have that information, so it
> can immediately decide whether to do the PMU context switch or not.
>
> oh, but when the control is executing within the run loop, a
> host-level profiling starts, say 'perf record -a ...', it will
> generate an IPI to all CPUs. Maybe that's when we need a callback so
> the KVM guest PMU context gets preempted for the host-level profiling.
> Gah..
>
> hmm, not a fan of that. That means the host can poke the guest PMU
> context at any time and cause higher overhead. But I admit it is much
> better than the current approach.
>
> The only thing is that: any command like 'perf record/stat -a' shot in
> dark corners of the host can preempt guest PMUs of _all_ running VMs.
> So, to alleviate that, maybe a module parameter that disables this
> "preemption" is possible? This should fit scenarios where we don't
> want guest PMU to be preempted outside of the vCPU loop?
>

It should not happen. For the current implementation, perf rejects all
the !exclude_guest system-wide event creation if a guest with the vPMU
is running.
However, it's possible to create an exclude_guest system-wide event at
any time. KVM cannot use the information from the VM-entry to decide if
there will be active perf events in the VM-exit.

The perf_guest_exit() will reload the host state. It's impossible to
save the guest state after that. We may need a KVM callback. So perf can
tell KVM whether to save the guest state before perf reloads the host state.

Thanks,
Kan
>>
>>
>>>
>>> My original sketch: https://lore.kernel.org/all/ZR3eNtP5IVAHeFNC@googlecom
>

2024-04-26 00:22:31

by Liang, Kan

[permalink] [raw]
Subject: Re: [RFC PATCH 23/41] KVM: x86/pmu: Implement the save/restore of PMU state for Intel CPU



On 2024-04-25 4:16 p.m., Mingwei Zhang wrote:
> On Thu, Apr 25, 2024 at 9:13 AM Liang, Kan <[email protected]> wrote:
>>
>>
>>
>> On 2024-04-25 12:24 a.m., Mingwei Zhang wrote:
>>> On Wed, Apr 24, 2024 at 8:56 PM Mi, Dapeng <[email protected]> wrote:
>>>>
>>>>
>>>> On 4/24/2024 11:00 PM, Sean Christopherson wrote:
>>>>> On Wed, Apr 24, 2024, Dapeng Mi wrote:
>>>>>> On 4/24/2024 1:02 AM, Mingwei Zhang wrote:
>>>>>>>>> Maybe, (just maybe), it is possible to do PMU context switch at vcpu
>>>>>>>>> boundary normally, but doing it at VM Enter/Exit boundary when host is
>>>>>>>>> profiling KVM kernel module. So, dynamically adjusting PMU context
>>>>>>>>> switch location could be an option.
>>>>>>>> If there are two VMs with pmu enabled both, however host PMU is not
>>>>>>>> enabled. PMU context switch should be done in vcpu thread sched-out path.
>>>>>>>>
>>>>>>>> If host pmu is used also, we can choose whether PMU switch should be
>>>>>>>> done in vm exit path or vcpu thread sched-out path.
>>>>>>>>
>>>>>>> host PMU is always enabled, ie., Linux currently does not support KVM
>>>>>>> PMU running standalone. I guess what you mean is there are no active
>>>>>>> perf_events on the host side. Allowing a PMU context switch drifting
>>>>>>> from vm-enter/exit boundary to vcpu loop boundary by checking host
>>>>>>> side events might be a good option. We can keep the discussion, but I
>>>>>>> won't propose that in v2.
>>>>>> I suspect if it's really doable to do this deferring. This still makes host
>>>>>> lose the most of capability to profile KVM. Per my understanding, most of
>>>>>> KVM overhead happens in the vcpu loop, exactly speaking in VM-exit handling.
>>>>>> We have no idea when host want to create perf event to profile KVM, it could
>>>>>> be at any time.
>>>>> No, the idea is that KVM will load host PMU state asap, but only when host PMU
>>>>> state actually needs to be loaded, i.e. only when there are relevant host events.
>>>>>
>>>>> If there are no host perf events, KVM keeps guest PMU state loaded for the entire
>>>>> KVM_RUN loop, i.e. provides optimal behavior for the guest. But if a host perf
>>>>> events exists (or comes along), the KVM context switches PMU at VM-Enter/VM-Exit,
>>>>> i.e. lets the host profile almost all of KVM, at the cost of a degraded experience
>>>>> for the guest while host perf events are active.
>>>>
>>>> I see. So KVM needs to provide a callback which needs to be called in
>>>> the IPI handler. The KVM callback needs to be called to switch PMU state
>>>> before perf really enabling host event and touching PMU MSRs. And only
>>>> the perf event with exclude_guest attribute is allowed to create on
>>>> host. Thanks.
>>>
>>> Do we really need a KVM callback? I think that is one option.
>>>
>>> Immediately after VMEXIT, KVM will check whether there are "host perf
>>> events". If so, do the PMU context switch immediately. Otherwise, keep
>>> deferring the context switch to the end of vPMU loop.
>>>
>>> Detecting if there are "host perf events" would be interesting. The
>>> "host perf events" refer to the perf_events on the host that are
>>> active and assigned with HW counters and that are saved when context
>>> switching to the guest PMU. I think getting those events could be done
>>> by fetching the bitmaps in cpuc.
>>
>> The cpuc is ARCH specific structure. I don't think it can be get in the
>> generic code. You probably have to implement ARCH specific functions to
>> fetch the bitmaps. It probably won't worth it.
>>
>> You may check the pinned_groups and flexible_groups to understand if
>> there are host perf events which may be scheduled when VM-exit. But it
>> will not tell the idx of the counters which can only be got when the
>> host event is really scheduled.
>>
>>> I have to look into the details. But
>>> at the time of VMEXIT, kvm should already have that information, so it
>>> can immediately decide whether to do the PMU context switch or not.
>>>
>>> oh, but when the control is executing within the run loop, a
>>> host-level profiling starts, say 'perf record -a ...', it will
>>> generate an IPI to all CPUs. Maybe that's when we need a callback so
>>> the KVM guest PMU context gets preempted for the host-level profiling.
>>> Gah..
>>>
>>> hmm, not a fan of that. That means the host can poke the guest PMU
>>> context at any time and cause higher overhead. But I admit it is much
>>> better than the current approach.
>>>
>>> The only thing is that: any command like 'perf record/stat -a' shot in
>>> dark corners of the host can preempt guest PMUs of _all_ running VMs.
>>> So, to alleviate that, maybe a module parameter that disables this
>>> "preemption" is possible? This should fit scenarios where we don't
>>> want guest PMU to be preempted outside of the vCPU loop?
>>>
>>
>> It should not happen. For the current implementation, perf rejects all
>> the !exclude_guest system-wide event creation if a guest with the vPMU
>> is running.
>> However, it's possible to create an exclude_guest system-wide event at
>> any time. KVM cannot use the information from the VM-entry to decide if
>> there will be active perf events in the VM-exit.
>
> Hmm, why not? If there is any exclude_guest system-wide event,
> perf_guest_enter() can return something to tell KVM "hey, some active
> host events are swapped out. they are originally in counter #2 and
> #3". If so, at the time when perf_guest_enter() returns, KVM will ack
> that and keep it in its pmu data structure.

I think it's possible that someone creates !exclude_guest event after
the perf_guest_enter(). The stale information is saved in the KVM. Perf
will schedule the event in the next perf_guest_exit(). KVM will not know it.

>
> Now, when doing context switching back to host at just VMEXIT, KVM
> will check this data and see if host perf context has something active
> (of course, they are all exclude_guest events). If not, deferring the
> context switch to vcpu boundary. Otherwise, do the proper PMU context
> switching by respecting the occupied counter positions on the host
> side, i.e., avoid doubling the work on the KVM side.
>
> Kan, any suggestion on the above approach?

I think we can only know the accurate event list at perf_guest_exit().
You may check the pinned_groups and flexible_groups, which tell if there
are candidate events.

> Totally understand that
> there might be some difficulty, since perf subsystem works in several
> layers and obviously fetching low-level mapping is arch specific work.
> If that is difficult, we can split the work in two phases: 1) phase
> #1, just ask perf to tell kvm if there are active exclude_guest events
> swapped out; 2) phase #2, ask perf to tell their (low-level) counter
> indices.
>

If you want an accurate counter mask, the changes in the arch specific
code is required. Two phases sound good to me.

Besides perf changes, I think the KVM should also track which counters
need to be saved/restored. The information can be get from the EventSel
interception.

Thanks,
Kan
>>
>> The perf_guest_exit() will reload the host state. It's impossible to
>> save the guest state after that. We may need a KVM callback. So perf can
>> tell KVM whether to save the guest state before perf reloads the host state.
>>
>> Thanks,
>> Kan
>>>>
>>>>
>>>>>
>>>>> My original sketch: https://lore.kernel.org/all/ZR3eNtP5IVAHeFNC@googlecom
>>>
>

2024-04-26 03:12:53

by Mingwei Zhang

[permalink] [raw]
Subject: Re: [RFC PATCH 23/41] KVM: x86/pmu: Implement the save/restore of PMU state for Intel CPU

On Thu, Apr 25, 2024 at 6:46 PM Mi, Dapeng <[email protected]> wrote:
>
>
> On 4/26/2024 5:46 AM, Sean Christopherson wrote:
> > On Thu, Apr 25, 2024, Kan Liang wrote:
> >> On 2024-04-25 4:16 p.m., Mingwei Zhang wrote:
> >>> On Thu, Apr 25, 2024 at 9:13 AM Liang, Kan <[email protected]> wrote:
> >>>> It should not happen. For the current implementation, perf rejects all
> >>>> the !exclude_guest system-wide event creation if a guest with the vPMU
> >>>> is running.
> >>>> However, it's possible to create an exclude_guest system-wide event at
> >>>> any time. KVM cannot use the information from the VM-entry to decide if
> >>>> there will be active perf events in the VM-exit.
> >>> Hmm, why not? If there is any exclude_guest system-wide event,
> >>> perf_guest_enter() can return something to tell KVM "hey, some active
> >>> host events are swapped out. they are originally in counter #2 and
> >>> #3". If so, at the time when perf_guest_enter() returns, KVM will ack
> >>> that and keep it in its pmu data structure.
> >> I think it's possible that someone creates !exclude_guest event after
> > I assume you mean an exclude_guest=1 event? Because perf should be in a state
> > where it rejects exclude_guest=0 events.
>
> Suppose should be exclude_guest=1 event, the perf event without
> exclude_guest attribute would be blocked to create in the v2 patches
> which we are working on.
>
>
> >
> >> the perf_guest_enter(). The stale information is saved in the KVM. Perf
> >> will schedule the event in the next perf_guest_exit(). KVM will not know it.
> > Ya, the creation of an event on a CPU that currently has guest PMU state loaded
> > is what I had in mind when I suggested a callback in my sketch:
> >
> > : D. Add a perf callback that is invoked from IRQ context when perf wants to
> > : configure a new PMU-based events, *before* actually programming the MSRs,
> > : and have KVM's callback put the guest PMU state
>
>
> when host creates a perf event with exclude_guest attribute which is
> used to profile KVM/VMM user space, the vCPU process could work at three
> places.
>
> 1. in guest state (non-root mode)
>
> 2. inside vcpu-loop
>
> 3. outside vcpu-loop
>
> Since the PMU state has already been switched to host state, we don't
> need to consider the case 3 and only care about cases 1 and 2.
>
> when host creates a perf event with exclude_guest attribute to profile
> KVM/VMM user space, an IPI is triggered to enable the perf event
> eventually like the following code shows.
>
> event_function_call(event, __perf_event_enable, NULL);
>
> For case 1, a vm-exit is triggered and KVM starts to process the
> vm-exit and then run IPI irq handler, exactly speaking
> __perf_event_enable() to enable the perf event.
>
> For case 2, the IPI irq handler would preempt the vcpu-loop and call
> __perf_event_enable() to enable the perf event.
>
> So IMO KVM just needs to provide a callback to switch guest/host PMU
> state, and __perf_event_enable() calls this callback before really
> touching PMU MSRs.

ok, in this case, do we still need KVM to query perf if there are
active exclude_guest events? yes? Because there is an ordering issue.
The above suggests that the host-level perf profiling comes when a VM
is already running, there is an IPI that can invoke the callback and
trigger preemption. In this case, KVM should switch the context from
guest to host. What if it is the other way around, ie., host-level
profiling runs first and then VM runs?

In this case, just before entering the vcpu loop, kvm should check
whether there is an active host event and save that into a pmu data
structure. If none, do the context switch early (so that KVM saves a
huge amount of unnecessary PMU context switches in the future).
Otherwise, keep the host PMU context until vm-enter. At the time of
vm-exit, do the check again using the data stored in pmu structure. If
there is an active event do the context switch to the host PMU,
otherwise defer that until exiting the vcpu loop. Of course, in the
meantime, if there is any perf profiling started causing the IPI, the
irq handler calls the callback, preempting the guest PMU context. If
that happens, at the time of exiting the vcpu boundary, PMU context
switch is skipped since it is already done. Of course, note that the
irq could come at any time, so the PMU context switch in all 4
locations need to check the state flag (and skip the context switch if
needed).

So this requires vcpu->pmu has two pieces of state information: 1) the
flag similar to TIF_NEED_FPU_LOAD; 2) host perf context info (phase #1
just a boolean; phase #2, bitmap of occupied counters).

This is a non-trivial optimization on the PMU context switch. I am
thinking about splitting them into the following phases:

1) lazy PMU context switch, i.e., wait until the guest touches PMU MSR
for the 1st time.
2) fast PMU context switch on KVM side, i.e., KVM checking event
selector value (enable/disable) and selectively switch PMU state
(reducing rd/wr msrs)
3) dynamic PMU context boundary, ie., KVM can dynamically choose PMU
context switch boundary depending on existing active host-level
events.
3.1) more accurate dynamic PMU context switch, ie., KVM checking
host-level counter position and further reduces the number of msr
accesses.
4) guest PMU context preemption, i.e., any new host-level perf
profiling can immediately preempt the guest PMU in the vcpu loop
(instead of waiting for the next PMU context switch in KVM).

Thanks.
-Mingwei
>
> >
> > It's a similar idea to TIF_NEED_FPU_LOAD, just that instead of a common chunk of
> > kernel code swapping out the guest state (kernel_fpu_begin()), it's a callback
> > into KVM.

2024-04-26 05:26:56

by Mi, Dapeng

[permalink] [raw]
Subject: Re: [RFC PATCH 23/41] KVM: x86/pmu: Implement the save/restore of PMU state for Intel CPU


On 4/26/2024 4:43 AM, Liang, Kan wrote:
>
> On 2024-04-25 4:16 p.m., Mingwei Zhang wrote:
>> On Thu, Apr 25, 2024 at 9:13 AM Liang, Kan <[email protected]> wrote:
>>>
>>>
>>> On 2024-04-25 12:24 a.m., Mingwei Zhang wrote:
>>>> On Wed, Apr 24, 2024 at 8:56 PM Mi, Dapeng <[email protected]> wrote:
>>>>>
>>>>> On 4/24/2024 11:00 PM, Sean Christopherson wrote:
>>>>>> On Wed, Apr 24, 2024, Dapeng Mi wrote:
>>>>>>> On 4/24/2024 1:02 AM, Mingwei Zhang wrote:
>>>>>>>>>> Maybe, (just maybe), it is possible to do PMU context switch at vcpu
>>>>>>>>>> boundary normally, but doing it at VM Enter/Exit boundary when host is
>>>>>>>>>> profiling KVM kernel module. So, dynamically adjusting PMU context
>>>>>>>>>> switch location could be an option.
>>>>>>>>> If there are two VMs with pmu enabled both, however host PMU is not
>>>>>>>>> enabled. PMU context switch should be done in vcpu thread sched-out path.
>>>>>>>>>
>>>>>>>>> If host pmu is used also, we can choose whether PMU switch should be
>>>>>>>>> done in vm exit path or vcpu thread sched-out path.
>>>>>>>>>
>>>>>>>> host PMU is always enabled, ie., Linux currently does not support KVM
>>>>>>>> PMU running standalone. I guess what you mean is there are no active
>>>>>>>> perf_events on the host side. Allowing a PMU context switch drifting
>>>>>>>> from vm-enter/exit boundary to vcpu loop boundary by checking host
>>>>>>>> side events might be a good option. We can keep the discussion, but I
>>>>>>>> won't propose that in v2.
>>>>>>> I suspect if it's really doable to do this deferring. This still makes host
>>>>>>> lose the most of capability to profile KVM. Per my understanding, most of
>>>>>>> KVM overhead happens in the vcpu loop, exactly speaking in VM-exit handling.
>>>>>>> We have no idea when host want to create perf event to profile KVM, it could
>>>>>>> be at any time.
>>>>>> No, the idea is that KVM will load host PMU state asap, but only when host PMU
>>>>>> state actually needs to be loaded, i.e. only when there are relevant host events.
>>>>>>
>>>>>> If there are no host perf events, KVM keeps guest PMU state loaded for the entire
>>>>>> KVM_RUN loop, i.e. provides optimal behavior for the guest. But if a host perf
>>>>>> events exists (or comes along), the KVM context switches PMU at VM-Enter/VM-Exit,
>>>>>> i.e. lets the host profile almost all of KVM, at the cost of a degraded experience
>>>>>> for the guest while host perf events are active.
>>>>> I see. So KVM needs to provide a callback which needs to be called in
>>>>> the IPI handler. The KVM callback needs to be called to switch PMU state
>>>>> before perf really enabling host event and touching PMU MSRs. And only
>>>>> the perf event with exclude_guest attribute is allowed to create on
>>>>> host. Thanks.
>>>> Do we really need a KVM callback? I think that is one option.
>>>>
>>>> Immediately after VMEXIT, KVM will check whether there are "host perf
>>>> events". If so, do the PMU context switch immediately. Otherwise, keep
>>>> deferring the context switch to the end of vPMU loop.
>>>>
>>>> Detecting if there are "host perf events" would be interesting. The
>>>> "host perf events" refer to the perf_events on the host that are
>>>> active and assigned with HW counters and that are saved when context
>>>> switching to the guest PMU. I think getting those events could be done
>>>> by fetching the bitmaps in cpuc.
>>> The cpuc is ARCH specific structure. I don't think it can be get in the
>>> generic code. You probably have to implement ARCH specific functions to
>>> fetch the bitmaps. It probably won't worth it.
>>>
>>> You may check the pinned_groups and flexible_groups to understand if
>>> there are host perf events which may be scheduled when VM-exit. But it
>>> will not tell the idx of the counters which can only be got when the
>>> host event is really scheduled.
>>>
>>>> I have to look into the details. But
>>>> at the time of VMEXIT, kvm should already have that information, so it
>>>> can immediately decide whether to do the PMU context switch or not.
>>>>
>>>> oh, but when the control is executing within the run loop, a
>>>> host-level profiling starts, say 'perf record -a ...', it will
>>>> generate an IPI to all CPUs. Maybe that's when we need a callback so
>>>> the KVM guest PMU context gets preempted for the host-level profiling.
>>>> Gah..
>>>>
>>>> hmm, not a fan of that. That means the host can poke the guest PMU
>>>> context at any time and cause higher overhead. But I admit it is much
>>>> better than the current approach.
>>>>
>>>> The only thing is that: any command like 'perf record/stat -a' shot in
>>>> dark corners of the host can preempt guest PMUs of _all_ running VMs.
>>>> So, to alleviate that, maybe a module parameter that disables this
>>>> "preemption" is possible? This should fit scenarios where we don't
>>>> want guest PMU to be preempted outside of the vCPU loop?
>>>>
>>> It should not happen. For the current implementation, perf rejects all
>>> the !exclude_guest system-wide event creation if a guest with the vPMU
>>> is running.
>>> However, it's possible to create an exclude_guest system-wide event at
>>> any time. KVM cannot use the information from the VM-entry to decide if
>>> there will be active perf events in the VM-exit.
>> Hmm, why not? If there is any exclude_guest system-wide event,
>> perf_guest_enter() can return something to tell KVM "hey, some active
>> host events are swapped out. they are originally in counter #2 and
>> #3". If so, at the time when perf_guest_enter() returns, KVM will ack
>> that and keep it in its pmu data structure.
> I think it's possible that someone creates !exclude_guest event after
> the perf_guest_enter(). The stale information is saved in the KVM. Perf
> will schedule the event in the next perf_guest_exit(). KVM will not know it.
>
>> Now, when doing context switching back to host at just VMEXIT, KVM
>> will check this data and see if host perf context has something active
>> (of course, they are all exclude_guest events). If not, deferring the
>> context switch to vcpu boundary. Otherwise, do the proper PMU context
>> switching by respecting the occupied counter positions on the host
>> side, i.e., avoid doubling the work on the KVM side.
>>
>> Kan, any suggestion on the above approach?
> I think we can only know the accurate event list at perf_guest_exit().
> You may check the pinned_groups and flexible_groups, which tell if there
> are candidate events.
>
>> Totally understand that
>> there might be some difficulty, since perf subsystem works in several
>> layers and obviously fetching low-level mapping is arch specific work.
>> If that is difficult, we can split the work in two phases: 1) phase
>> #1, just ask perf to tell kvm if there are active exclude_guest events
>> swapped out; 2) phase #2, ask perf to tell their (low-level) counter
>> indices.
>>
> If you want an accurate counter mask, the changes in the arch specific
> code is required. Two phases sound good to me.
>
> Besides perf changes, I think the KVM should also track which counters
> need to be saved/restored. The information can be get from the EventSel
> interception.

Yes, that's another optimization from guest point view. It's in our
to-do list.


>
> Thanks,
> Kan
>>> The perf_guest_exit() will reload the host state. It's impossible to
>>> save the guest state after that. We may need a KVM callback. So perf can
>>> tell KVM whether to save the guest state before perf reloads the host state.
>>>
>>> Thanks,
>>> Kan
>>>>>
>>>>>> My original sketch: https://lore.kernel.org/all/ZR3eNtP5IVAHeFNC@googlecom

2024-04-26 06:56:56

by Mi, Dapeng

[permalink] [raw]
Subject: Re: [RFC PATCH 23/41] KVM: x86/pmu: Implement the save/restore of PMU state for Intel CPU


On 4/26/2024 5:46 AM, Sean Christopherson wrote:
> On Thu, Apr 25, 2024, Kan Liang wrote:
>> On 2024-04-25 4:16 p.m., Mingwei Zhang wrote:
>>> On Thu, Apr 25, 2024 at 9:13 AM Liang, Kan <[email protected]> wrote:
>>>> It should not happen. For the current implementation, perf rejects all
>>>> the !exclude_guest system-wide event creation if a guest with the vPMU
>>>> is running.
>>>> However, it's possible to create an exclude_guest system-wide event at
>>>> any time. KVM cannot use the information from the VM-entry to decide if
>>>> there will be active perf events in the VM-exit.
>>> Hmm, why not? If there is any exclude_guest system-wide event,
>>> perf_guest_enter() can return something to tell KVM "hey, some active
>>> host events are swapped out. they are originally in counter #2 and
>>> #3". If so, at the time when perf_guest_enter() returns, KVM will ack
>>> that and keep it in its pmu data structure.
>> I think it's possible that someone creates !exclude_guest event after
> I assume you mean an exclude_guest=1 event? Because perf should be in a state
> where it rejects exclude_guest=0 events.

Suppose should be exclude_guest=1 event, the perf event without
exclude_guest attribute would be blocked to create in the v2 patches
which we are working on.


>
>> the perf_guest_enter(). The stale information is saved in the KVM. Perf
>> will schedule the event in the next perf_guest_exit(). KVM will not know it.
> Ya, the creation of an event on a CPU that currently has guest PMU state loaded
> is what I had in mind when I suggested a callback in my sketch:
>
> : D. Add a perf callback that is invoked from IRQ context when perf wants to
> : configure a new PMU-based events, *before* actually programming the MSRs,
> : and have KVM's callback put the guest PMU state


when host creates a perf event with exclude_guest attribute which is
used to profile KVM/VMM user space, the vCPU process could work at three
places.

1. in guest state (non-root mode)

2. inside vcpu-loop

3. outside vcpu-loop

Since the PMU state has already been switched to host state, we don't
need to consider the case 3 and only care about cases 1 and 2.

when host creates a perf event with exclude_guest attribute to profile
KVM/VMM user space,  an IPI is triggered to enable the perf event
eventually like the following code shows.

event_function_call(event, __perf_event_enable, NULL);

For case 1,  a vm-exit is triggered and KVM starts to process the
vm-exit and then run IPI irq handler, exactly speaking
__perf_event_enable() to enable the perf event.

For case 2, the IPI irq handler would preempt the vcpu-loop and call
__perf_event_enable() to enable the perf event.

So IMO KVM just needs to provide a callback to switch guest/host PMU
state, and __perf_event_enable() calls this callback before really
touching PMU MSRs.

>
> It's a similar idea to TIF_NEED_FPU_LOAD, just that instead of a common chunk of
> kernel code swapping out the guest state (kernel_fpu_begin()), it's a callback
> into KVM.

2024-04-26 11:12:38

by Mi, Dapeng

[permalink] [raw]
Subject: Re: [RFC PATCH 23/41] KVM: x86/pmu: Implement the save/restore of PMU state for Intel CPU


On 4/26/2024 11:12 AM, Mingwei Zhang wrote:
> On Thu, Apr 25, 2024 at 6:46 PM Mi, Dapeng <[email protected]> wrote:
>>
>> On 4/26/2024 5:46 AM, Sean Christopherson wrote:
>>> On Thu, Apr 25, 2024, Kan Liang wrote:
>>>> On 2024-04-25 4:16 p.m., Mingwei Zhang wrote:
>>>>> On Thu, Apr 25, 2024 at 9:13 AM Liang, Kan <[email protected]> wrote:
>>>>>> It should not happen. For the current implementation, perf rejects all
>>>>>> the !exclude_guest system-wide event creation if a guest with the vPMU
>>>>>> is running.
>>>>>> However, it's possible to create an exclude_guest system-wide event at
>>>>>> any time. KVM cannot use the information from the VM-entry to decide if
>>>>>> there will be active perf events in the VM-exit.
>>>>> Hmm, why not? If there is any exclude_guest system-wide event,
>>>>> perf_guest_enter() can return something to tell KVM "hey, some active
>>>>> host events are swapped out. they are originally in counter #2 and
>>>>> #3". If so, at the time when perf_guest_enter() returns, KVM will ack
>>>>> that and keep it in its pmu data structure.
>>>> I think it's possible that someone creates !exclude_guest event after
>>> I assume you mean an exclude_guest=1 event? Because perf should be in a state
>>> where it rejects exclude_guest=0 events.
>> Suppose should be exclude_guest=1 event, the perf event without
>> exclude_guest attribute would be blocked to create in the v2 patches
>> which we are working on.
>>
>>
>>>> the perf_guest_enter(). The stale information is saved in the KVM. Perf
>>>> will schedule the event in the next perf_guest_exit(). KVM will not know it.
>>> Ya, the creation of an event on a CPU that currently has guest PMU state loaded
>>> is what I had in mind when I suggested a callback in my sketch:
>>>
>>> : D. Add a perf callback that is invoked from IRQ context when perf wants to
>>> : configure a new PMU-based events, *before* actually programming the MSRs,
>>> : and have KVM's callback put the guest PMU state
>>
>> when host creates a perf event with exclude_guest attribute which is
>> used to profile KVM/VMM user space, the vCPU process could work at three
>> places.
>>
>> 1. in guest state (non-root mode)
>>
>> 2. inside vcpu-loop
>>
>> 3. outside vcpu-loop
>>
>> Since the PMU state has already been switched to host state, we don't
>> need to consider the case 3 and only care about cases 1 and 2.
>>
>> when host creates a perf event with exclude_guest attribute to profile
>> KVM/VMM user space, an IPI is triggered to enable the perf event
>> eventually like the following code shows.
>>
>> event_function_call(event, __perf_event_enable, NULL);
>>
>> For case 1, a vm-exit is triggered and KVM starts to process the
>> vm-exit and then run IPI irq handler, exactly speaking
>> __perf_event_enable() to enable the perf event.
>>
>> For case 2, the IPI irq handler would preempt the vcpu-loop and call
>> __perf_event_enable() to enable the perf event.
>>
>> So IMO KVM just needs to provide a callback to switch guest/host PMU
>> state, and __perf_event_enable() calls this callback before really
>> touching PMU MSRs.
> ok, in this case, do we still need KVM to query perf if there are
> active exclude_guest events? yes? Because there is an ordering issue.
> The above suggests that the host-level perf profiling comes when a VM
> is already running, there is an IPI that can invoke the callback and
> trigger preemption. In this case, KVM should switch the context from
> guest to host. What if it is the other way around, ie., host-level
> profiling runs first and then VM runs?
>
> In this case, just before entering the vcpu loop, kvm should check
> whether there is an active host event and save that into a pmu data
> structure. If none, do the context switch early (so that KVM saves a
> huge amount of unnecessary PMU context switches in the future).
> Otherwise, keep the host PMU context until vm-enter. At the time of
> vm-exit, do the check again using the data stored in pmu structure. If
> there is an active event do the context switch to the host PMU,
> otherwise defer that until exiting the vcpu loop. Of course, in the
> meantime, if there is any perf profiling started causing the IPI, the
> irq handler calls the callback, preempting the guest PMU context. If
> that happens, at the time of exiting the vcpu boundary, PMU context
> switch is skipped since it is already done. Of course, note that the
> irq could come at any time, so the PMU context switch in all 4
> locations need to check the state flag (and skip the context switch if
> needed).
>
> So this requires vcpu->pmu has two pieces of state information: 1) the
> flag similar to TIF_NEED_FPU_LOAD; 2) host perf context info (phase #1
> just a boolean; phase #2, bitmap of occupied counters).

I still had no chance to look at the details about vFPU implementation,
currently I have no idea what we need exactly on vPMU side, a flag or a
callback. Anyway, that's just implementation details, we can look at it
when starting to implement it.

>
> This is a non-trivial optimization on the PMU context switch. I am
> thinking about splitting them into the following phases:
>
> 1) lazy PMU context switch, i.e., wait until the guest touches PMU MSR
> for the 1st time.
> 2) fast PMU context switch on KVM side, i.e., KVM checking event
> selector value (enable/disable) and selectively switch PMU state
> (reducing rd/wr msrs)
> 3) dynamic PMU context boundary, ie., KVM can dynamically choose PMU
> context switch boundary depending on existing active host-level
> events.
> 3.1) more accurate dynamic PMU context switch, ie., KVM checking
> host-level counter position and further reduces the number of msr
> accesses.
> 4) guest PMU context preemption, i.e., any new host-level perf
> profiling can immediately preempt the guest PMU in the vcpu loop
> (instead of waiting for the next PMU context switch in KVM).

Great! we have a whole clear picture about the optimization right now.
BTW, the optimization 1 and 2 are already on our original to-do list. We
plan to do it after RFC v2 is ready.


>
> Thanks.
> -Mingwei
>>> It's a similar idea to TIF_NEED_FPU_LOAD, just that instead of a common chunk of
>>> kernel code swapping out the guest state (kernel_fpu_begin()), it's a callback
>>> into KVM.

2024-04-26 13:56:27

by Liang, Kan

[permalink] [raw]
Subject: Re: [RFC PATCH 23/41] KVM: x86/pmu: Implement the save/restore of PMU state for Intel CPU



On 2024-04-25 5:46 p.m., Sean Christopherson wrote:
> On Thu, Apr 25, 2024, Kan Liang wrote:
>> On 2024-04-25 4:16 p.m., Mingwei Zhang wrote:
>>> On Thu, Apr 25, 2024 at 9:13 AM Liang, Kan <[email protected]> wrote:
>>>> It should not happen. For the current implementation, perf rejects all
>>>> the !exclude_guest system-wide event creation if a guest with the vPMU
>>>> is running.
>>>> However, it's possible to create an exclude_guest system-wide event at
>>>> any time. KVM cannot use the information from the VM-entry to decide if
>>>> there will be active perf events in the VM-exit.
>>>
>>> Hmm, why not? If there is any exclude_guest system-wide event,
>>> perf_guest_enter() can return something to tell KVM "hey, some active
>>> host events are swapped out. they are originally in counter #2 and
>>> #3". If so, at the time when perf_guest_enter() returns, KVM will ack
>>> that and keep it in its pmu data structure.
>>
>> I think it's possible that someone creates !exclude_guest event after
>
> I assume you mean an exclude_guest=1 event? Because perf should be in a state
> where it rejects exclude_guest=0 events.
>

Right.

>> the perf_guest_enter(). The stale information is saved in the KVM. Perf
>> will schedule the event in the next perf_guest_exit(). KVM will not know it.
>
> Ya, the creation of an event on a CPU that currently has guest PMU state loaded
> is what I had in mind when I suggested a callback in my sketch:
>
> : D. Add a perf callback that is invoked from IRQ context when perf wants to
> : configure a new PMU-based events, *before* actually programming the MSRs,
> : and have KVM's callback put the guest PMU state
>
> It's a similar idea to TIF_NEED_FPU_LOAD, just that instead of a common chunk of
> kernel code swapping out the guest state (kernel_fpu_begin()), it's a callback
> into KVM.

Yes, a callback should be required. I think it should be done right
before switching back to the host perf events, so there are an accurate
active event list.

Thanks,
Kan

2024-04-26 18:42:36

by Mingwei Zhang

[permalink] [raw]
Subject: Re: [RFC PATCH 23/41] KVM: x86/pmu: Implement the save/restore of PMU state for Intel CPU

On Fri, Apr 26, 2024 at 7:10 AM Liang, Kan <[email protected]> wrote:
>
>
>
> On 2024-04-25 11:12 p.m., Mingwei Zhang wrote:
> >>>> the perf_guest_enter(). The stale information is saved in the KVM. Perf
> >>>> will schedule the event in the next perf_guest_exit(). KVM will not know it.
> >>> Ya, the creation of an event on a CPU that currently has guest PMU state loaded
> >>> is what I had in mind when I suggested a callback in my sketch:
> >>>
> >>> : D. Add a perf callback that is invoked from IRQ context when perf wants to
> >>> : configure a new PMU-based events, *before* actually programming the MSRs,
> >>> : and have KVM's callback put the guest PMU state
> >>
> >> when host creates a perf event with exclude_guest attribute which is
> >> used to profile KVM/VMM user space, the vCPU process could work at three
> >> places.
> >>
> >> 1. in guest state (non-root mode)
> >>
> >> 2. inside vcpu-loop
> >>
> >> 3. outside vcpu-loop
> >>
> >> Since the PMU state has already been switched to host state, we don't
> >> need to consider the case 3 and only care about cases 1 and 2.
> >>
> >> when host creates a perf event with exclude_guest attribute to profile
> >> KVM/VMM user space, an IPI is triggered to enable the perf event
> >> eventually like the following code shows.
> >>
> >> event_function_call(event, __perf_event_enable, NULL);
> >>
> >> For case 1, a vm-exit is triggered and KVM starts to process the
> >> vm-exit and then run IPI irq handler, exactly speaking
> >> __perf_event_enable() to enable the perf event.
> >>
> >> For case 2, the IPI irq handler would preempt the vcpu-loop and call
> >> __perf_event_enable() to enable the perf event.
> >>
> >> So IMO KVM just needs to provide a callback to switch guest/host PMU
> >> state, and __perf_event_enable() calls this callback before really
> >> touching PMU MSRs.
> > ok, in this case, do we still need KVM to query perf if there are
> > active exclude_guest events? yes? Because there is an ordering issue.
> > The above suggests that the host-level perf profiling comes when a VM
> > is already running, there is an IPI that can invoke the callback and
> > trigger preemption. In this case, KVM should switch the context from
> > guest to host. What if it is the other way around, ie., host-level
> > profiling runs first and then VM runs?
> >
> > In this case, just before entering the vcpu loop, kvm should check
> > whether there is an active host event and save that into a pmu data
> > structure.
>
> KVM doesn't need to save/restore the host state. Host perf has the
> information and will reload the values whenever the host events are
> rescheduled. But I think KVM should clear the registers used by the host
> to prevent the value leaks to the guest.

Right, KVM needs to know about the host state to optimize its own PMU
context switch. If the host is using a counter of index, say 1, then
KVM may not need to zap the value of counter #1, since perf side will
override it.

>
> > If none, do the context switch early (so that KVM saves a
> > huge amount of unnecessary PMU context switches in the future).
> > Otherwise, keep the host PMU context until vm-enter. At the time of
> > vm-exit, do the check again using the data stored in pmu structure. If
> > there is an active event do the context switch to the host PMU,
> > otherwise defer that until exiting the vcpu loop. Of course, in the
> > meantime, if there is any perf profiling started causing the IPI, the
> > irq handler calls the callback, preempting the guest PMU context. If
> > that happens, at the time of exiting the vcpu boundary, PMU context
> > switch is skipped since it is already done. Of course, note that the
> > irq could come at any time, so the PMU context switch in all 4
> > locations need to check the state flag (and skip the context switch if
> > needed).
> >
> > So this requires vcpu->pmu has two pieces of state information: 1) the
> > flag similar to TIF_NEED_FPU_LOAD; 2) host perf context info (phase #1
> > just a boolean; phase #2, bitmap of occupied counters).
> >
> > This is a non-trivial optimization on the PMU context switch. I am
> > thinking about splitting them into the following phases:
> >
> > 1) lazy PMU context switch, i.e., wait until the guest touches PMU MSR
> > for the 1st time.
> > 2) fast PMU context switch on KVM side, i.e., KVM checking event
> > selector value (enable/disable) and selectively switch PMU state
> > (reducing rd/wr msrs)
> > 3) dynamic PMU context boundary, ie., KVM can dynamically choose PMU
> > context switch boundary depending on existing active host-level
> > events.
> > 3.1) more accurate dynamic PMU context switch, ie., KVM checking
> > host-level counter position and further reduces the number of msr
> > accesses.
> > 4) guest PMU context preemption, i.e., any new host-level perf
> > profiling can immediately preempt the guest PMU in the vcpu loop
> > (instead of waiting for the next PMU context switch in KVM).
>
> I'm not quit sure about the 4.
> The new host-level perf must be an exclude_guest event. It should not be
> scheduled when a guest is using the PMU. Why do we want to preempt the
> guest PMU? The current implementation in perf doesn't schedule any
> exclude_guest events when a guest is running.

right. The grey area is the code within the KVM_RUN loop, but
_outside_ of the guest. This part of the code is on the "host" side.
However, for efficiency reasons, KVM defers the PMU context switch by
retaining the guest PMU MSR values within the loop. Optimization 4
allows the host side to immediately profiling this part instead of
waiting for vcpu to reach to PMU context switch locations. Doing so
will generate more accurate results.

Do we want to preempt that? I think it depends. For regular cloud
usage, we don't. But for any other usages where we want to prioritize
KVM/VMM profiling over guest vPMU, it is useful.

My current opinion is that optimization 4 is something nice to have.
But we should allow people to turn it off just like we could choose to
disable preempt kernel.

Thanks.
-Mingwei
>
> Thanks,
> Kan

2024-04-26 19:48:05

by Sean Christopherson

[permalink] [raw]
Subject: Re: [RFC PATCH 23/41] KVM: x86/pmu: Implement the save/restore of PMU state for Intel CPU

On Fri, Apr 26, 2024, Kan Liang wrote:
> > Optimization 4
> > allows the host side to immediately profiling this part instead of
> > waiting for vcpu to reach to PMU context switch locations. Doing so
> > will generate more accurate results.
>
> If so, I think the 4 is a must to have. Otherwise, it wouldn't honer the
> definition of the exclude_guest. Without 4, it brings some random blind
> spots, right?

+1, I view it as a hard requirement. It's not an optimization, it's about
accuracy and functional correctness.

What _is_ an optimization is keeping guest state loaded while KVM is in its
run loop, i.e. initial mediated/passthrough PMU support could land upstream with
unconditional switches at entry/exit. The performance of KVM would likely be
unacceptable for any production use cases, but that would give us motivation to
finish the job, and it doesn't result in random, hard to diagnose issues for
userspace.

> > Do we want to preempt that? I think it depends. For regular cloud
> > usage, we don't. But for any other usages where we want to prioritize
> > KVM/VMM profiling over guest vPMU, it is useful.
> >
> > My current opinion is that optimization 4 is something nice to have.
> > But we should allow people to turn it off just like we could choose to
> > disable preempt kernel.
>
> The exclude_guest means everything but the guest. I don't see a reason
> why people want to turn it off and get some random blind spots.

2024-04-26 20:17:38

by Liang, Kan

[permalink] [raw]
Subject: Re: [RFC PATCH 23/41] KVM: x86/pmu: Implement the save/restore of PMU state for Intel CPU



On 2024-04-25 11:12 p.m., Mingwei Zhang wrote:
>>>> the perf_guest_enter(). The stale information is saved in the KVM. Perf
>>>> will schedule the event in the next perf_guest_exit(). KVM will not know it.
>>> Ya, the creation of an event on a CPU that currently has guest PMU state loaded
>>> is what I had in mind when I suggested a callback in my sketch:
>>>
>>> : D. Add a perf callback that is invoked from IRQ context when perf wants to
>>> : configure a new PMU-based events, *before* actually programming the MSRs,
>>> : and have KVM's callback put the guest PMU state
>>
>> when host creates a perf event with exclude_guest attribute which is
>> used to profile KVM/VMM user space, the vCPU process could work at three
>> places.
>>
>> 1. in guest state (non-root mode)
>>
>> 2. inside vcpu-loop
>>
>> 3. outside vcpu-loop
>>
>> Since the PMU state has already been switched to host state, we don't
>> need to consider the case 3 and only care about cases 1 and 2.
>>
>> when host creates a perf event with exclude_guest attribute to profile
>> KVM/VMM user space, an IPI is triggered to enable the perf event
>> eventually like the following code shows.
>>
>> event_function_call(event, __perf_event_enable, NULL);
>>
>> For case 1, a vm-exit is triggered and KVM starts to process the
>> vm-exit and then run IPI irq handler, exactly speaking
>> __perf_event_enable() to enable the perf event.
>>
>> For case 2, the IPI irq handler would preempt the vcpu-loop and call
>> __perf_event_enable() to enable the perf event.
>>
>> So IMO KVM just needs to provide a callback to switch guest/host PMU
>> state, and __perf_event_enable() calls this callback before really
>> touching PMU MSRs.
> ok, in this case, do we still need KVM to query perf if there are
> active exclude_guest events? yes? Because there is an ordering issue.
> The above suggests that the host-level perf profiling comes when a VM
> is already running, there is an IPI that can invoke the callback and
> trigger preemption. In this case, KVM should switch the context from
> guest to host. What if it is the other way around, ie., host-level
> profiling runs first and then VM runs?
>
> In this case, just before entering the vcpu loop, kvm should check
> whether there is an active host event and save that into a pmu data
> structure.

KVM doesn't need to save/restore the host state. Host perf has the
information and will reload the values whenever the host events are
rescheduled. But I think KVM should clear the registers used by the host
to prevent the value leaks to the guest.

> If none, do the context switch early (so that KVM saves a
> huge amount of unnecessary PMU context switches in the future).
> Otherwise, keep the host PMU context until vm-enter. At the time of
> vm-exit, do the check again using the data stored in pmu structure. If
> there is an active event do the context switch to the host PMU,
> otherwise defer that until exiting the vcpu loop. Of course, in the
> meantime, if there is any perf profiling started causing the IPI, the
> irq handler calls the callback, preempting the guest PMU context. If
> that happens, at the time of exiting the vcpu boundary, PMU context
> switch is skipped since it is already done. Of course, note that the
> irq could come at any time, so the PMU context switch in all 4
> locations need to check the state flag (and skip the context switch if
> needed).
>
> So this requires vcpu->pmu has two pieces of state information: 1) the
> flag similar to TIF_NEED_FPU_LOAD; 2) host perf context info (phase #1
> just a boolean; phase #2, bitmap of occupied counters).
>
> This is a non-trivial optimization on the PMU context switch. I am
> thinking about splitting them into the following phases:
>
> 1) lazy PMU context switch, i.e., wait until the guest touches PMU MSR
> for the 1st time.
> 2) fast PMU context switch on KVM side, i.e., KVM checking event
> selector value (enable/disable) and selectively switch PMU state
> (reducing rd/wr msrs)
> 3) dynamic PMU context boundary, ie., KVM can dynamically choose PMU
> context switch boundary depending on existing active host-level
> events.
> 3.1) more accurate dynamic PMU context switch, ie., KVM checking
> host-level counter position and further reduces the number of msr
> accesses.
> 4) guest PMU context preemption, i.e., any new host-level perf
> profiling can immediately preempt the guest PMU in the vcpu loop
> (instead of waiting for the next PMU context switch in KVM).

I'm not quit sure about the 4.
The new host-level perf must be an exclude_guest event. It should not be
scheduled when a guest is using the PMU. Why do we want to preempt the
guest PMU? The current implementation in perf doesn't schedule any
exclude_guest events when a guest is running.

Thanks,
Kan

2024-04-27 00:31:56

by Liang, Kan

[permalink] [raw]
Subject: Re: [RFC PATCH 23/41] KVM: x86/pmu: Implement the save/restore of PMU state for Intel CPU



On 2024-04-26 2:41 p.m., Mingwei Zhang wrote:
>>> So this requires vcpu->pmu has two pieces of state information: 1) the
>>> flag similar to TIF_NEED_FPU_LOAD; 2) host perf context info (phase #1
>>> just a boolean; phase #2, bitmap of occupied counters).
>>>
>>> This is a non-trivial optimization on the PMU context switch. I am
>>> thinking about splitting them into the following phases:
>>>
>>> 1) lazy PMU context switch, i.e., wait until the guest touches PMU MSR
>>> for the 1st time.
>>> 2) fast PMU context switch on KVM side, i.e., KVM checking event
>>> selector value (enable/disable) and selectively switch PMU state
>>> (reducing rd/wr msrs)
>>> 3) dynamic PMU context boundary, ie., KVM can dynamically choose PMU
>>> context switch boundary depending on existing active host-level
>>> events.
>>> 3.1) more accurate dynamic PMU context switch, ie., KVM checking
>>> host-level counter position and further reduces the number of msr
>>> accesses.
>>> 4) guest PMU context preemption, i.e., any new host-level perf
>>> profiling can immediately preempt the guest PMU in the vcpu loop
>>> (instead of waiting for the next PMU context switch in KVM).
>> I'm not quit sure about the 4.
>> The new host-level perf must be an exclude_guest event. It should not be
>> scheduled when a guest is using the PMU. Why do we want to preempt the
>> guest PMU? The current implementation in perf doesn't schedule any
>> exclude_guest events when a guest is running.
> right. The grey area is the code within the KVM_RUN loop, but
> _outside_ of the guest. This part of the code is on the "host" side.
> However, for efficiency reasons, KVM defers the PMU context switch by
> retaining the guest PMU MSR values within the loop.

I assume you mean the optimization of moving the context switch from
VM-exit/entry boundary to the vCPU boundary.

> Optimization 4
> allows the host side to immediately profiling this part instead of
> waiting for vcpu to reach to PMU context switch locations. Doing so
> will generate more accurate results.

If so, I think the 4 is a must to have. Otherwise, it wouldn't honer the
definition of the exclude_guest. Without 4, it brings some random blind
spots, right?

>
> Do we want to preempt that? I think it depends. For regular cloud
> usage, we don't. But for any other usages where we want to prioritize
> KVM/VMM profiling over guest vPMU, it is useful.
>
> My current opinion is that optimization 4 is something nice to have.
> But we should allow people to turn it off just like we could choose to
> disable preempt kernel.

The exclude_guest means everything but the guest. I don't see a reason
why people want to turn it off and get some random blind spots.

Thanks,
Kan

2024-04-28 06:02:34

by Mingwei Zhang

[permalink] [raw]
Subject: Re: [RFC PATCH 23/41] KVM: x86/pmu: Implement the save/restore of PMU state for Intel CPU

On Sat, Apr 27, 2024 at 5:59 PM Mi, Dapeng <[email protected]> wrote:
>
>
> On 4/27/2024 11:04 AM, Mingwei Zhang wrote:
> > On Fri, Apr 26, 2024 at 12:46 PM Sean Christopherson <[email protected]> wrote:
> >> On Fri, Apr 26, 2024, Kan Liang wrote:
> >>>> Optimization 4
> >>>> allows the host side to immediately profiling this part instead of
> >>>> waiting for vcpu to reach to PMU context switch locations. Doing so
> >>>> will generate more accurate results.
> >>> If so, I think the 4 is a must to have. Otherwise, it wouldn't honer the
> >>> definition of the exclude_guest. Without 4, it brings some random blind
> >>> spots, right?
> >> +1, I view it as a hard requirement. It's not an optimization, it's about
> >> accuracy and functional correctness.
> > Well. Does it have to be a _hard_ requirement? no? The irq handler
> > triggered by "perf record -a" could just inject a "state". Instead of
> > immediately preempting the guest PMU context, perf subsystem could
> > allow KVM defer the context switch when it reaches the next PMU
> > context switch location.
> >
> > This is the same as the preemption kernel logic. Do you want me to
> > stop the work immediately? Yes (if you enable preemption), or No, let
> > me finish my job and get to the scheduling point.
> >
> > Implementing this might be more difficult to debug. That's my real
> > concern. If we do not enable preemption, the PMU context switch will
> > only happen at the 2 pairs of locations. If we enable preemption, it
> > could happen at any time.
>
> IMO I don't prefer to add a switch to enable/disable the preemption. I
> think current implementation is already complicated enough and
> unnecessary to introduce an new parameter to confuse users. Furthermore,
> the switch could introduce an uncertainty and may mislead the perf user
> to read the perf stats incorrectly. As for debug, it won't bring any
> difference as long as no host event is created.
>
That's ok. It is about opinions and brainstorming. Adding a parameter
to disable preemption is from the cloud usage perspective. The
conflict of opinions is which one you prioritize: guest PMU or the
host PMU? If you stand on the guest vPMU usage perspective, do you
want anyone on the host to shoot a profiling command and generate
turbulence? no. If you stand on the host PMU perspective and you want
to profile VMM/KVM, you definitely want accuracy and no delay at all.

Thanks.
-Mingwei
>
> >
> >> What _is_ an optimization is keeping guest state loaded while KVM is in its
> >> run loop, i.e. initial mediated/passthrough PMU support could land upstream with
> >> unconditional switches at entry/exit. The performance of KVM would likely be
> >> unacceptable for any production use cases, but that would give us motivation to
> >> finish the job, and it doesn't result in random, hard to diagnose issues for
> >> userspace.
> > That's true. I agree with that.
> >
> >>>> Do we want to preempt that? I think it depends. For regular cloud
> >>>> usage, we don't. But for any other usages where we want to prioritize
> >>>> KVM/VMM profiling over guest vPMU, it is useful.
> >>>>
> >>>> My current opinion is that optimization 4 is something nice to have.
> >>>> But we should allow people to turn it off just like we could choose to
> >>>> disable preempt kernel.
> >>> The exclude_guest means everything but the guest. I don't see a reason
> >>> why people want to turn it off and get some random blind spots.

2024-04-28 06:57:50

by Mi, Dapeng

[permalink] [raw]
Subject: Re: [RFC PATCH 23/41] KVM: x86/pmu: Implement the save/restore of PMU state for Intel CPU


On 4/27/2024 11:04 AM, Mingwei Zhang wrote:
> On Fri, Apr 26, 2024 at 12:46 PM Sean Christopherson <[email protected]> wrote:
>> On Fri, Apr 26, 2024, Kan Liang wrote:
>>>> Optimization 4
>>>> allows the host side to immediately profiling this part instead of
>>>> waiting for vcpu to reach to PMU context switch locations. Doing so
>>>> will generate more accurate results.
>>> If so, I think the 4 is a must to have. Otherwise, it wouldn't honer the
>>> definition of the exclude_guest. Without 4, it brings some random blind
>>> spots, right?
>> +1, I view it as a hard requirement. It's not an optimization, it's about
>> accuracy and functional correctness.
> Well. Does it have to be a _hard_ requirement? no? The irq handler
> triggered by "perf record -a" could just inject a "state". Instead of
> immediately preempting the guest PMU context, perf subsystem could
> allow KVM defer the context switch when it reaches the next PMU
> context switch location.
>
> This is the same as the preemption kernel logic. Do you want me to
> stop the work immediately? Yes (if you enable preemption), or No, let
> me finish my job and get to the scheduling point.
>
> Implementing this might be more difficult to debug. That's my real
> concern. If we do not enable preemption, the PMU context switch will
> only happen at the 2 pairs of locations. If we enable preemption, it
> could happen at any time.

IMO I don't prefer to add a switch to enable/disable the preemption. I
think current implementation is already complicated enough and
unnecessary to introduce an new parameter to confuse users. Furthermore,
the switch could introduce an uncertainty and may mislead the perf user
to read the perf stats incorrectly.  As for debug, it won't bring any
difference as long as no host event is created.


>
>> What _is_ an optimization is keeping guest state loaded while KVM is in its
>> run loop, i.e. initial mediated/passthrough PMU support could land upstream with
>> unconditional switches at entry/exit. The performance of KVM would likely be
>> unacceptable for any production use cases, but that would give us motivation to
>> finish the job, and it doesn't result in random, hard to diagnose issues for
>> userspace.
> That's true. I agree with that.
>
>>>> Do we want to preempt that? I think it depends. For regular cloud
>>>> usage, we don't. But for any other usages where we want to prioritize
>>>> KVM/VMM profiling over guest vPMU, it is useful.
>>>>
>>>> My current opinion is that optimization 4 is something nice to have.
>>>> But we should allow people to turn it off just like we could choose to
>>>> disable preempt kernel.
>>> The exclude_guest means everything but the guest. I don't see a reason
>>> why people want to turn it off and get some random blind spots.

2024-04-29 13:09:01

by Liang, Kan

[permalink] [raw]
Subject: Re: [RFC PATCH 23/41] KVM: x86/pmu: Implement the save/restore of PMU state for Intel CPU



On 2024-04-26 11:04 p.m., Mingwei Zhang wrote:
> On Fri, Apr 26, 2024 at 12:46 PM Sean Christopherson <[email protected]> wrote:
>>
>> On Fri, Apr 26, 2024, Kan Liang wrote:
>>>> Optimization 4
>>>> allows the host side to immediately profiling this part instead of
>>>> waiting for vcpu to reach to PMU context switch locations. Doing so
>>>> will generate more accurate results.
>>>
>>> If so, I think the 4 is a must to have. Otherwise, it wouldn't honer the
>>> definition of the exclude_guest. Without 4, it brings some random blind
>>> spots, right?
>>
>> +1, I view it as a hard requirement. It's not an optimization, it's about
>> accuracy and functional correctness.
>
> Well. Does it have to be a _hard_ requirement? no? The irq handler
> triggered by "perf record -a" could just inject a "state". Instead of
> immediately preempting the guest PMU context, perf subsystem could
> allow KVM defer the context switch when it reaches the next PMU
> context switch location.

It depends on what is the upcoming PMU context switch location.
If it's the upcoming VM-exit/entry, the defer should be fine. Because
it's a exclude_guest event, nothing should be counted when a VM is running.
If it's the upcoming vCPU boundary, no. I think there may be several
VM-exit/entry before the upcoming vCPU switch. We may lose some results.
>
> This is the same as the preemption kernel logic. Do you want me to
> stop the work immediately? Yes (if you enable preemption), or No, let
> me finish my job and get to the scheduling point.

I don't think it's necessary. Just make sure that the counters are
scheduled in the upcoming VM-exit/entry boundary should be fine.

Thanks,
Kan
>
> Implementing this might be more difficult to debug. That's my real
> concern. If we do not enable preemption, the PMU context switch will
> only happen at the 2 pairs of locations. If we enable preemption, it
> could happen at any time.
>
>>
>> What _is_ an optimization is keeping guest state loaded while KVM is in its
>> run loop, i.e. initial mediated/passthrough PMU support could land upstream with
>> unconditional switches at entry/exit. The performance of KVM would likely be
>> unacceptable for any production use cases, but that would give us motivation to
>> finish the job, and it doesn't result in random, hard to diagnose issues for
>> userspace.
>
> That's true. I agree with that.
>
>>
>>>> Do we want to preempt that? I think it depends. For regular cloud
>>>> usage, we don't. But for any other usages where we want to prioritize
>>>> KVM/VMM profiling over guest vPMU, it is useful.
>>>>
>>>> My current opinion is that optimization 4 is something nice to have.
>>>> But we should allow people to turn it off just like we could choose to
>>>> disable preempt kernel.
>>>
>>> The exclude_guest means everything but the guest. I don't see a reason
>>> why people want to turn it off and get some random blind spots.
>

2024-05-01 21:06:18

by Liang, Kan

[permalink] [raw]
Subject: Re: [RFC PATCH 23/41] KVM: x86/pmu: Implement the save/restore of PMU state for Intel CPU



On 2024-05-01 1:43 p.m., Mingwei Zhang wrote:
> One of the things on top of the mind is that: there seems to be no way
> for the perf subsystem to express this: "no, your host-level profiling
> is not interested in profiling the KVM_RUN loop when our guest vPMU is
> actively running".

exclude_hv? Although it seems the option is not well supported on X86
for now.

Thanks,
Kan