2023-11-03 23:06:09

by Sean Christopherson

[permalink] [raw]
Subject: [PATCH v2 0/6] KVM: x86/pmu: Clean up emulated PMC event handling

The ultimate goal of this series is to track emulated counter events using
a dedicated variable instead of trying to track the previous counter value.
Tracking the previous counter value is flawed as it takes a snapshot at
every emulated event, but only checks for overflow prior to VM-Enter, i.e.
KVM could miss an overflow if KVM ever supports emulating event types that
can occur multiple times in a single VM-Exit.

And as Mingwei root caused, emulating overflow while the perf event is
running can result in duplicate overflow events, e.g. if the perf event
overflows between taking and checking the snapshot. This bug is largely
masked now that KVM correctly sets LVT_MASK when delivering PMIs, but it's
still a bug, e.g. could cause problems if there are other side effects.

Patches 1-5 are (some loosely, some tightly) related fixes and cleanups to
simplify the emulated counter approach implementation. The fixes are
tagged for stable as usersepace could cause some weirdness around perf
events, but I doubt any real world VMM is actually affected.

Dapeng, I intentionally omitted your Reviewed-by from the last patch as
the change from v1 isn't trivial.

v2:
- Collect reviews. [Dapeng]
- Emulate overflow *after* pausing perf event. [Mingwei]

v1: https://lore.kernel.org/all/[email protected]

Sean Christopherson (6):
KVM: x86/pmu: Move PMU reset logic to common x86 code
KVM: x86/pmu: Reset the PMU, i.e. stop counters, before refreshing
KVM: x86/pmu: Stop calling kvm_pmu_reset() at RESET (it's redundant)
KVM: x86/pmu: Remove manual clearing of fields in kvm_pmu_init()
KVM: x86/pmu: Update sample period in pmc_write_counter()
KVM: x86/pmu: Track emulated counter events instead of previous
counter

arch/x86/include/asm/kvm-x86-pmu-ops.h | 2 +-
arch/x86/include/asm/kvm_host.h | 17 ++-
arch/x86/kvm/pmu.c | 140 +++++++++++++++++++++----
arch/x86/kvm/pmu.h | 47 +--------
arch/x86/kvm/svm/pmu.c | 17 ---
arch/x86/kvm/vmx/pmu_intel.c | 22 ----
arch/x86/kvm/x86.c | 1 -
7 files changed, 137 insertions(+), 109 deletions(-)


base-commit: 45b890f7689eb0aba454fc5831d2d79763781677
--
2.42.0.869.gea05f2083d-goog


2023-11-03 23:06:14

by Sean Christopherson

[permalink] [raw]
Subject: [PATCH v2 2/6] KVM: x86/pmu: Reset the PMU, i.e. stop counters, before refreshing

Stop all counters and release all perf events before refreshing the vPMU,
i.e. before reconfiguring the vPMU to respond to changes in the vCPU
model.

Clear need_cleanup in kvm_pmu_reset() as well so that KVM doesn't
prematurely stop counters, e.g. if KVM enters the guest and enables
counters before the vCPU is scheduled out.

Cc: [email protected]
Reviewed-by: Dapeng Mi <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
---
arch/x86/kvm/pmu.c | 35 ++++++++++++++++++++++-------------
1 file changed, 22 insertions(+), 13 deletions(-)

diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c
index 027e9c3c2b93..dc8e8e907cfb 100644
--- a/arch/x86/kvm/pmu.c
+++ b/arch/x86/kvm/pmu.c
@@ -657,25 +657,14 @@ int kvm_pmu_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
return 0;
}

-/* refresh PMU settings. This function generally is called when underlying
- * settings are changed (such as changes of PMU CPUID by guest VMs), which
- * should rarely happen.
- */
-void kvm_pmu_refresh(struct kvm_vcpu *vcpu)
-{
- if (KVM_BUG_ON(kvm_vcpu_has_run(vcpu), vcpu->kvm))
- return;
-
- bitmap_zero(vcpu_to_pmu(vcpu)->all_valid_pmc_idx, X86_PMC_IDX_MAX);
- static_call(kvm_x86_pmu_refresh)(vcpu);
-}
-
void kvm_pmu_reset(struct kvm_vcpu *vcpu)
{
struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
struct kvm_pmc *pmc;
int i;

+ pmu->need_cleanup = false;
+
bitmap_zero(pmu->reprogram_pmi, X86_PMC_IDX_MAX);

for_each_set_bit(i, pmu->all_valid_pmc_idx, X86_PMC_IDX_MAX) {
@@ -695,6 +684,26 @@ void kvm_pmu_reset(struct kvm_vcpu *vcpu)
static_call_cond(kvm_x86_pmu_reset)(vcpu);
}

+
+/*
+ * Refresh the PMU configuration for the vCPU, e.g. if userspace changes CPUID
+ * and/or PERF_CAPABILITIES.
+ */
+void kvm_pmu_refresh(struct kvm_vcpu *vcpu)
+{
+ if (KVM_BUG_ON(kvm_vcpu_has_run(vcpu), vcpu->kvm))
+ return;
+
+ /*
+ * Stop/release all existing counters/events before realizing the new
+ * vPMU model.
+ */
+ kvm_pmu_reset(vcpu);
+
+ bitmap_zero(vcpu_to_pmu(vcpu)->all_valid_pmc_idx, X86_PMC_IDX_MAX);
+ static_call(kvm_x86_pmu_refresh)(vcpu);
+}
+
void kvm_pmu_init(struct kvm_vcpu *vcpu)
{
struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
--
2.42.0.869.gea05f2083d-goog

2023-11-03 23:06:14

by Sean Christopherson

[permalink] [raw]
Subject: [PATCH v2 3/6] KVM: x86/pmu: Stop calling kvm_pmu_reset() at RESET (it's redundant)

Drop kvm_vcpu_reset()'s call to kvm_pmu_reset(), the call is performed
only for RESET, which is really just the same thing as vCPU creation,
and kvm_arch_vcpu_create() *just* called kvm_pmu_init(), i.e. there can't
possibly be any work to do.

Unlike Intel, AMD's amd_pmu_refresh() does fill all_valid_pmc_idx even if
guest CPUID is empty, but everything that is at all dynamic is guaranteed
to be '0'/NULL, e.g. it should be impossible for KVM to have already
created a perf event.

Reviewed-by: Dapeng Mi <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
---
arch/x86/kvm/pmu.c | 2 +-
arch/x86/kvm/pmu.h | 1 -
arch/x86/kvm/x86.c | 1 -
3 files changed, 1 insertion(+), 3 deletions(-)

diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c
index dc8e8e907cfb..458e836c6efe 100644
--- a/arch/x86/kvm/pmu.c
+++ b/arch/x86/kvm/pmu.c
@@ -657,7 +657,7 @@ int kvm_pmu_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
return 0;
}

-void kvm_pmu_reset(struct kvm_vcpu *vcpu)
+static void kvm_pmu_reset(struct kvm_vcpu *vcpu)
{
struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
struct kvm_pmc *pmc;
diff --git a/arch/x86/kvm/pmu.h b/arch/x86/kvm/pmu.h
index a46aa9b25150..db9a12c0a2ef 100644
--- a/arch/x86/kvm/pmu.h
+++ b/arch/x86/kvm/pmu.h
@@ -243,7 +243,6 @@ bool kvm_pmu_is_valid_msr(struct kvm_vcpu *vcpu, u32 msr);
int kvm_pmu_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info);
int kvm_pmu_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info);
void kvm_pmu_refresh(struct kvm_vcpu *vcpu);
-void kvm_pmu_reset(struct kvm_vcpu *vcpu);
void kvm_pmu_init(struct kvm_vcpu *vcpu);
void kvm_pmu_cleanup(struct kvm_vcpu *vcpu);
void kvm_pmu_destroy(struct kvm_vcpu *vcpu);
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 2c924075f6f1..efbf52a9dc83 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -12207,7 +12207,6 @@ void kvm_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
}

if (!init_event) {
- kvm_pmu_reset(vcpu);
vcpu->arch.smbase = 0x30000;

vcpu->arch.msr_misc_features_enables = 0;
--
2.42.0.869.gea05f2083d-goog

2023-11-03 23:06:20

by Sean Christopherson

[permalink] [raw]
Subject: [PATCH v2 4/6] KVM: x86/pmu: Remove manual clearing of fields in kvm_pmu_init()

Remove code that unnecessarily clears event_count and need_cleanup in
kvm_pmu_init(), the entire kvm_pmu is zeroed just a few lines earlier.
Vendor code doesn't set event_count or need_cleanup during .init(), and
if either VMX or SVM did set those fields it would be a flagrant bug.

Reviewed-by: Dapeng Mi <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
---
arch/x86/kvm/pmu.c | 2 --
1 file changed, 2 deletions(-)

diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c
index 458e836c6efe..c06090196b00 100644
--- a/arch/x86/kvm/pmu.c
+++ b/arch/x86/kvm/pmu.c
@@ -710,8 +710,6 @@ void kvm_pmu_init(struct kvm_vcpu *vcpu)

memset(pmu, 0, sizeof(*pmu));
static_call(kvm_x86_pmu_init)(vcpu);
- pmu->event_count = 0;
- pmu->need_cleanup = false;
kvm_pmu_refresh(vcpu);
}

--
2.42.0.869.gea05f2083d-goog

2023-11-03 23:06:47

by Sean Christopherson

[permalink] [raw]
Subject: [PATCH v2 5/6] KVM: x86/pmu: Update sample period in pmc_write_counter()

Update a PMC's sample period in pmc_write_counter() to deduplicate code
across all callers of pmc_write_counter(). Opportunistically move
pmc_write_counter() into pmc.c now that it's doing more work. WRMSR isn't
such a hot path that an extra CALL+RET pair will be problematic, and the
order of function definitions needs to be changed anyways, i.e. now is a
convenient time to eat the churn.

Reviewed-by: Dapeng Mi <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
---
arch/x86/kvm/pmu.c | 27 +++++++++++++++++++++++++++
arch/x86/kvm/pmu.h | 25 +------------------------
arch/x86/kvm/svm/pmu.c | 1 -
arch/x86/kvm/vmx/pmu_intel.c | 2 --
4 files changed, 28 insertions(+), 27 deletions(-)

diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c
index c06090196b00..3725d001239d 100644
--- a/arch/x86/kvm/pmu.c
+++ b/arch/x86/kvm/pmu.c
@@ -161,6 +161,15 @@ static u64 pmc_get_pebs_precise_level(struct kvm_pmc *pmc)
return 1;
}

+static u64 get_sample_period(struct kvm_pmc *pmc, u64 counter_value)
+{
+ u64 sample_period = (-counter_value) & pmc_bitmask(pmc);
+
+ if (!sample_period)
+ sample_period = pmc_bitmask(pmc) + 1;
+ return sample_period;
+}
+
static int pmc_reprogram_counter(struct kvm_pmc *pmc, u32 type, u64 config,
bool exclude_user, bool exclude_kernel,
bool intr)
@@ -268,6 +277,24 @@ static void pmc_stop_counter(struct kvm_pmc *pmc)
}
}

+static void pmc_update_sample_period(struct kvm_pmc *pmc)
+{
+ if (!pmc->perf_event || pmc->is_paused ||
+ !is_sampling_event(pmc->perf_event))
+ return;
+
+ perf_event_period(pmc->perf_event,
+ get_sample_period(pmc, pmc->counter));
+}
+
+void pmc_write_counter(struct kvm_pmc *pmc, u64 val)
+{
+ pmc->counter += val - pmc_read_counter(pmc);
+ pmc->counter &= pmc_bitmask(pmc);
+ pmc_update_sample_period(pmc);
+}
+EXPORT_SYMBOL_GPL(pmc_write_counter);
+
static int filter_cmp(const void *pa, const void *pb, u64 mask)
{
u64 a = *(u64 *)pa & mask;
diff --git a/arch/x86/kvm/pmu.h b/arch/x86/kvm/pmu.h
index db9a12c0a2ef..cae85e550f60 100644
--- a/arch/x86/kvm/pmu.h
+++ b/arch/x86/kvm/pmu.h
@@ -74,11 +74,7 @@ static inline u64 pmc_read_counter(struct kvm_pmc *pmc)
return counter & pmc_bitmask(pmc);
}

-static inline void pmc_write_counter(struct kvm_pmc *pmc, u64 val)
-{
- pmc->counter += val - pmc_read_counter(pmc);
- pmc->counter &= pmc_bitmask(pmc);
-}
+void pmc_write_counter(struct kvm_pmc *pmc, u64 val);

static inline bool pmc_is_gp(struct kvm_pmc *pmc)
{
@@ -128,25 +124,6 @@ static inline struct kvm_pmc *get_fixed_pmc(struct kvm_pmu *pmu, u32 msr)
return NULL;
}

-static inline u64 get_sample_period(struct kvm_pmc *pmc, u64 counter_value)
-{
- u64 sample_period = (-counter_value) & pmc_bitmask(pmc);
-
- if (!sample_period)
- sample_period = pmc_bitmask(pmc) + 1;
- return sample_period;
-}
-
-static inline void pmc_update_sample_period(struct kvm_pmc *pmc)
-{
- if (!pmc->perf_event || pmc->is_paused ||
- !is_sampling_event(pmc->perf_event))
- return;
-
- perf_event_period(pmc->perf_event,
- get_sample_period(pmc, pmc->counter));
-}
-
static inline bool pmc_speculative_in_use(struct kvm_pmc *pmc)
{
struct kvm_pmu *pmu = pmc_to_pmu(pmc);
diff --git a/arch/x86/kvm/svm/pmu.c b/arch/x86/kvm/svm/pmu.c
index 3fd47de14b38..b6a7ad4d6914 100644
--- a/arch/x86/kvm/svm/pmu.c
+++ b/arch/x86/kvm/svm/pmu.c
@@ -161,7 +161,6 @@ static int amd_pmu_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
pmc = get_gp_pmc_amd(pmu, msr, PMU_TYPE_COUNTER);
if (pmc) {
pmc_write_counter(pmc, data);
- pmc_update_sample_period(pmc);
return 0;
}
/* MSR_EVNTSELn */
diff --git a/arch/x86/kvm/vmx/pmu_intel.c b/arch/x86/kvm/vmx/pmu_intel.c
index 90c1f7f07e53..a6216c874729 100644
--- a/arch/x86/kvm/vmx/pmu_intel.c
+++ b/arch/x86/kvm/vmx/pmu_intel.c
@@ -437,11 +437,9 @@ static int intel_pmu_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
!(msr & MSR_PMC_FULL_WIDTH_BIT))
data = (s64)(s32)data;
pmc_write_counter(pmc, data);
- pmc_update_sample_period(pmc);
break;
} else if ((pmc = get_fixed_pmc(pmu, msr))) {
pmc_write_counter(pmc, data);
- pmc_update_sample_period(pmc);
break;
} else if ((pmc = get_gp_pmc(pmu, msr, MSR_P6_EVNTSEL0))) {
reserved_bits = pmu->reserved_bits;
--
2.42.0.869.gea05f2083d-goog

2023-11-03 23:06:47

by Sean Christopherson

[permalink] [raw]
Subject: [PATCH v2 1/6] KVM: x86/pmu: Move PMU reset logic to common x86 code

Move the common (or at least "ignored") aspects of resetting the vPMU to
common x86 code, along with the stop/release helpers that are no used only
by the common pmu.c.

There is no need to manually handle fixed counters as all_valid_pmc_idx
tracks both fixed and general purpose counters, and resetting the vPMU is
far from a hot path, i.e. the extra bit of overhead to the PMC from the
index is a non-issue.

Zero fixed_ctr_ctrl in common code even though it's Intel specific.
Ensuring it's zero doesn't harm AMD/SVM in any way, and stopping the fixed
counters via all_valid_pmc_idx, but not clearing the associated control
bits, would be odd/confusing.

Make the .reset() hook optional as SVM no longer needs vendor specific
handling.

Cc: [email protected]
Reviewed-by: Dapeng Mi <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
---
arch/x86/include/asm/kvm-x86-pmu-ops.h | 2 +-
arch/x86/kvm/pmu.c | 40 +++++++++++++++++++++++++-
arch/x86/kvm/pmu.h | 18 ------------
arch/x86/kvm/svm/pmu.c | 16 -----------
arch/x86/kvm/vmx/pmu_intel.c | 20 -------------
5 files changed, 40 insertions(+), 56 deletions(-)

diff --git a/arch/x86/include/asm/kvm-x86-pmu-ops.h b/arch/x86/include/asm/kvm-x86-pmu-ops.h
index 6c98f4bb4228..058bc636356a 100644
--- a/arch/x86/include/asm/kvm-x86-pmu-ops.h
+++ b/arch/x86/include/asm/kvm-x86-pmu-ops.h
@@ -22,7 +22,7 @@ KVM_X86_PMU_OP(get_msr)
KVM_X86_PMU_OP(set_msr)
KVM_X86_PMU_OP(refresh)
KVM_X86_PMU_OP(init)
-KVM_X86_PMU_OP(reset)
+KVM_X86_PMU_OP_OPTIONAL(reset)
KVM_X86_PMU_OP_OPTIONAL(deliver_pmi)
KVM_X86_PMU_OP_OPTIONAL(cleanup)

diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c
index 9ae07db6f0f6..027e9c3c2b93 100644
--- a/arch/x86/kvm/pmu.c
+++ b/arch/x86/kvm/pmu.c
@@ -250,6 +250,24 @@ static bool pmc_resume_counter(struct kvm_pmc *pmc)
return true;
}

+static void pmc_release_perf_event(struct kvm_pmc *pmc)
+{
+ if (pmc->perf_event) {
+ perf_event_release_kernel(pmc->perf_event);
+ pmc->perf_event = NULL;
+ pmc->current_config = 0;
+ pmc_to_pmu(pmc)->event_count--;
+ }
+}
+
+static void pmc_stop_counter(struct kvm_pmc *pmc)
+{
+ if (pmc->perf_event) {
+ pmc->counter = pmc_read_counter(pmc);
+ pmc_release_perf_event(pmc);
+ }
+}
+
static int filter_cmp(const void *pa, const void *pb, u64 mask)
{
u64 a = *(u64 *)pa & mask;
@@ -654,7 +672,27 @@ void kvm_pmu_refresh(struct kvm_vcpu *vcpu)

void kvm_pmu_reset(struct kvm_vcpu *vcpu)
{
- static_call(kvm_x86_pmu_reset)(vcpu);
+ struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
+ struct kvm_pmc *pmc;
+ int i;
+
+ bitmap_zero(pmu->reprogram_pmi, X86_PMC_IDX_MAX);
+
+ for_each_set_bit(i, pmu->all_valid_pmc_idx, X86_PMC_IDX_MAX) {
+ pmc = static_call(kvm_x86_pmu_pmc_idx_to_pmc)(pmu, i);
+ if (!pmc)
+ continue;
+
+ pmc_stop_counter(pmc);
+ pmc->counter = 0;
+
+ if (pmc_is_gp(pmc))
+ pmc->eventsel = 0;
+ }
+
+ pmu->fixed_ctr_ctrl = pmu->global_ctrl = pmu->global_status = 0;
+
+ static_call_cond(kvm_x86_pmu_reset)(vcpu);
}

void kvm_pmu_init(struct kvm_vcpu *vcpu)
diff --git a/arch/x86/kvm/pmu.h b/arch/x86/kvm/pmu.h
index 1d64113de488..a46aa9b25150 100644
--- a/arch/x86/kvm/pmu.h
+++ b/arch/x86/kvm/pmu.h
@@ -80,24 +80,6 @@ static inline void pmc_write_counter(struct kvm_pmc *pmc, u64 val)
pmc->counter &= pmc_bitmask(pmc);
}

-static inline void pmc_release_perf_event(struct kvm_pmc *pmc)
-{
- if (pmc->perf_event) {
- perf_event_release_kernel(pmc->perf_event);
- pmc->perf_event = NULL;
- pmc->current_config = 0;
- pmc_to_pmu(pmc)->event_count--;
- }
-}
-
-static inline void pmc_stop_counter(struct kvm_pmc *pmc)
-{
- if (pmc->perf_event) {
- pmc->counter = pmc_read_counter(pmc);
- pmc_release_perf_event(pmc);
- }
-}
-
static inline bool pmc_is_gp(struct kvm_pmc *pmc)
{
return pmc->type == KVM_PMC_GP;
diff --git a/arch/x86/kvm/svm/pmu.c b/arch/x86/kvm/svm/pmu.c
index 373ff6a6687b..3fd47de14b38 100644
--- a/arch/x86/kvm/svm/pmu.c
+++ b/arch/x86/kvm/svm/pmu.c
@@ -233,21 +233,6 @@ static void amd_pmu_init(struct kvm_vcpu *vcpu)
}
}

-static void amd_pmu_reset(struct kvm_vcpu *vcpu)
-{
- struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
- int i;
-
- for (i = 0; i < KVM_AMD_PMC_MAX_GENERIC; i++) {
- struct kvm_pmc *pmc = &pmu->gp_counters[i];
-
- pmc_stop_counter(pmc);
- pmc->counter = pmc->prev_counter = pmc->eventsel = 0;
- }
-
- pmu->global_ctrl = pmu->global_status = 0;
-}
-
struct kvm_pmu_ops amd_pmu_ops __initdata = {
.hw_event_available = amd_hw_event_available,
.pmc_idx_to_pmc = amd_pmc_idx_to_pmc,
@@ -259,7 +244,6 @@ struct kvm_pmu_ops amd_pmu_ops __initdata = {
.set_msr = amd_pmu_set_msr,
.refresh = amd_pmu_refresh,
.init = amd_pmu_init,
- .reset = amd_pmu_reset,
.EVENTSEL_EVENT = AMD64_EVENTSEL_EVENT,
.MAX_NR_GP_COUNTERS = KVM_AMD_PMC_MAX_GENERIC,
.MIN_NR_GP_COUNTERS = AMD64_NUM_COUNTERS,
diff --git a/arch/x86/kvm/vmx/pmu_intel.c b/arch/x86/kvm/vmx/pmu_intel.c
index 820d3e1f6b4f..90c1f7f07e53 100644
--- a/arch/x86/kvm/vmx/pmu_intel.c
+++ b/arch/x86/kvm/vmx/pmu_intel.c
@@ -632,26 +632,6 @@ static void intel_pmu_init(struct kvm_vcpu *vcpu)

static void intel_pmu_reset(struct kvm_vcpu *vcpu)
{
- struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
- struct kvm_pmc *pmc = NULL;
- int i;
-
- for (i = 0; i < KVM_INTEL_PMC_MAX_GENERIC; i++) {
- pmc = &pmu->gp_counters[i];
-
- pmc_stop_counter(pmc);
- pmc->counter = pmc->prev_counter = pmc->eventsel = 0;
- }
-
- for (i = 0; i < KVM_PMC_MAX_FIXED; i++) {
- pmc = &pmu->fixed_counters[i];
-
- pmc_stop_counter(pmc);
- pmc->counter = pmc->prev_counter = 0;
- }
-
- pmu->fixed_ctr_ctrl = pmu->global_ctrl = pmu->global_status = 0;
-
intel_pmu_release_guest_lbr_event(vcpu);
}

--
2.42.0.869.gea05f2083d-goog

2023-11-03 23:07:02

by Sean Christopherson

[permalink] [raw]
Subject: [PATCH v2 6/6] KVM: x86/pmu: Track emulated counter events instead of previous counter

Explicitly track emulated counter events instead of using the common
counter value that's shared with the hardware counter owned by perf.
Bumping the common counter requires snapshotting the pre-increment value
in order to detect overflow from emulation, and the snapshot approach is
inherently flawed.

Snapshotting the previous counter at every increment assumes that there is
at most one emulated counter event per emulated instruction (or rather,
between checks for KVM_REQ_PMU). That's mostly holds true today because
KVM only emulates (branch) instructions retired, but the approach will
fall apart if KVM ever supports event types that don't have a 1:1
relationship with instructions.

And KVM already has a relevant bug, as handle_invalid_guest_state()
emulates multiple instructions without checking KVM_REQ_PMU, i.e. could
miss an overflow event due to clobbering pmc->prev_counter. Not checking
KVM_REQ_PMU is problematic in both cases, but at least with the emulated
counter approach, the resulting behavior is delayed overflow detection,
as opposed to completely lost detection.

Tracking the emulated count fixes another bug where the snapshot approach
can signal spurious overflow due to incorporating both the emulated count
and perf's count in the check, i.e. if overflow is detected by perf, then
KVM's emulation will also incorrectly signal overflow. Add a comment in
the related code to call out the need to process emulated events *after*
pausing the perf event (big kudos to Mingwei for figuring out that
particular wrinkle).

Cc: Mingwei Zhang <[email protected]>
Cc: Roman Kagan <[email protected]>
Cc: Jim Mattson <[email protected]>
Cc: Dapeng Mi <[email protected]>
Cc: Like Xu <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
---
arch/x86/include/asm/kvm_host.h | 17 +++++++++++-
arch/x86/kvm/pmu.c | 48 ++++++++++++++++++++++++---------
arch/x86/kvm/pmu.h | 3 ++-
3 files changed, 53 insertions(+), 15 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index d7036982332e..d8bc9ba88cfc 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -500,8 +500,23 @@ struct kvm_pmc {
u8 idx;
bool is_paused;
bool intr;
+ /*
+ * Base value of the PMC counter, relative to the *consumed* count in
+ * the associated perf_event. This value includes counter updates from
+ * the perf_event and emulated_count since the last time the counter
+ * was reprogrammed, but it is *not* the current value as seen by the
+ * guest or userspace.
+ *
+ * The count is relative to the associated perf_event so that KVM
+ * doesn't need to reprogram the perf_event every time the guest writes
+ * to the counter.
+ */
u64 counter;
- u64 prev_counter;
+ /*
+ * PMC events triggered by KVM emulation that haven't been fully
+ * processed, i.e. haven't undergone overflow detection.
+ */
+ u64 emulated_counter;
u64 eventsel;
struct perf_event *perf_event;
struct kvm_vcpu *vcpu;
diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c
index 3725d001239d..87cc6c8809ad 100644
--- a/arch/x86/kvm/pmu.c
+++ b/arch/x86/kvm/pmu.c
@@ -127,9 +127,9 @@ static void kvm_perf_overflow(struct perf_event *perf_event,
struct kvm_pmc *pmc = perf_event->overflow_handler_context;

/*
- * Ignore overflow events for counters that are scheduled to be
- * reprogrammed, e.g. if a PMI for the previous event races with KVM's
- * handling of a related guest WRMSR.
+ * Ignore asynchronous overflow events for counters that are scheduled
+ * to be reprogrammed, e.g. if a PMI for the previous event races with
+ * KVM's handling of a related guest WRMSR.
*/
if (test_and_set_bit(pmc->idx, pmc_to_pmu(pmc)->reprogram_pmi))
return;
@@ -224,17 +224,30 @@ static int pmc_reprogram_counter(struct kvm_pmc *pmc, u32 type, u64 config,
return 0;
}

-static void pmc_pause_counter(struct kvm_pmc *pmc)
+static bool pmc_pause_counter(struct kvm_pmc *pmc)
{
u64 counter = pmc->counter;
-
- if (!pmc->perf_event || pmc->is_paused)
- return;
+ u64 prev_counter;

/* update counter, reset event value to avoid redundant accumulation */
- counter += perf_event_pause(pmc->perf_event, true);
+ if (pmc->perf_event && !pmc->is_paused)
+ counter += perf_event_pause(pmc->perf_event, true);
+
+ /*
+ * Snapshot the previous counter *after* accumulating state from perf.
+ * If overflow already happened, hardware (via perf) is responsible for
+ * generating a PMI. KVM just needs to detect overflow on emulated
+ * counter events that haven't yet been processed.
+ */
+ prev_counter = counter & pmc_bitmask(pmc);
+
+ counter += pmc->emulated_counter;
pmc->counter = counter & pmc_bitmask(pmc);
+
+ pmc->emulated_counter = 0;
pmc->is_paused = true;
+
+ return pmc->counter < prev_counter;
}

static bool pmc_resume_counter(struct kvm_pmc *pmc)
@@ -289,6 +302,15 @@ static void pmc_update_sample_period(struct kvm_pmc *pmc)

void pmc_write_counter(struct kvm_pmc *pmc, u64 val)
{
+ /*
+ * Drop any unconsumed accumulated counts, the WRMSR is a write, not a
+ * read-modify-write. Adjust the counter value so that its value is
+ * relative to the current count, as reading the current count from
+ * perf is faster than pausing and repgrogramming the event in order to
+ * reset it to '0'. Note, this very sneakily offsets the accumulated
+ * emulated count too, by using pmc_read_counter()!
+ */
+ pmc->emulated_counter = 0;
pmc->counter += val - pmc_read_counter(pmc);
pmc->counter &= pmc_bitmask(pmc);
pmc_update_sample_period(pmc);
@@ -428,14 +450,15 @@ static void reprogram_counter(struct kvm_pmc *pmc)
struct kvm_pmu *pmu = pmc_to_pmu(pmc);
u64 eventsel = pmc->eventsel;
u64 new_config = eventsel;
+ bool emulate_overflow;
u8 fixed_ctr_ctrl;

- pmc_pause_counter(pmc);
+ emulate_overflow = pmc_pause_counter(pmc);

if (!pmc_event_is_allowed(pmc))
goto reprogram_complete;

- if (pmc->counter < pmc->prev_counter)
+ if (emulate_overflow)
__kvm_perf_overflow(pmc, false);

if (eventsel & ARCH_PERFMON_EVENTSEL_PIN_CONTROL)
@@ -475,7 +498,6 @@ static void reprogram_counter(struct kvm_pmc *pmc)

reprogram_complete:
clear_bit(pmc->idx, (unsigned long *)&pmc_to_pmu(pmc)->reprogram_pmi);
- pmc->prev_counter = 0;
}

void kvm_pmu_handle_event(struct kvm_vcpu *vcpu)
@@ -701,6 +723,7 @@ static void kvm_pmu_reset(struct kvm_vcpu *vcpu)

pmc_stop_counter(pmc);
pmc->counter = 0;
+ pmc->emulated_counter = 0;

if (pmc_is_gp(pmc))
pmc->eventsel = 0;
@@ -772,8 +795,7 @@ void kvm_pmu_destroy(struct kvm_vcpu *vcpu)

static void kvm_pmu_incr_counter(struct kvm_pmc *pmc)
{
- pmc->prev_counter = pmc->counter;
- pmc->counter = (pmc->counter + 1) & pmc_bitmask(pmc);
+ pmc->emulated_counter++;
kvm_pmu_request_counter_reprogram(pmc);
}

diff --git a/arch/x86/kvm/pmu.h b/arch/x86/kvm/pmu.h
index cae85e550f60..7caeb3d8d4fd 100644
--- a/arch/x86/kvm/pmu.h
+++ b/arch/x86/kvm/pmu.h
@@ -66,7 +66,8 @@ static inline u64 pmc_read_counter(struct kvm_pmc *pmc)
{
u64 counter, enabled, running;

- counter = pmc->counter;
+ counter = pmc->counter + pmc->emulated_counter;
+
if (pmc->perf_event && !pmc->is_paused)
counter += perf_event_read_value(pmc->perf_event,
&enabled, &running);
--
2.42.0.869.gea05f2083d-goog

2023-11-03 23:22:02

by Mingwei Zhang

[permalink] [raw]
Subject: Re: [PATCH v2 6/6] KVM: x86/pmu: Track emulated counter events instead of previous counter

On Fri, Nov 03, 2023, Sean Christopherson wrote:
> Explicitly track emulated counter events instead of using the common
> counter value that's shared with the hardware counter owned by perf.
> Bumping the common counter requires snapshotting the pre-increment value
> in order to detect overflow from emulation, and the snapshot approach is
> inherently flawed.
>
> Snapshotting the previous counter at every increment assumes that there is
> at most one emulated counter event per emulated instruction (or rather,
> between checks for KVM_REQ_PMU). That's mostly holds true today because
> KVM only emulates (branch) instructions retired, but the approach will
> fall apart if KVM ever supports event types that don't have a 1:1
> relationship with instructions.
>
> And KVM already has a relevant bug, as handle_invalid_guest_state()
> emulates multiple instructions without checking KVM_REQ_PMU, i.e. could
> miss an overflow event due to clobbering pmc->prev_counter. Not checking
> KVM_REQ_PMU is problematic in both cases, but at least with the emulated
> counter approach, the resulting behavior is delayed overflow detection,
> as opposed to completely lost detection.
>
> Tracking the emulated count fixes another bug where the snapshot approach
> can signal spurious overflow due to incorporating both the emulated count
> and perf's count in the check, i.e. if overflow is detected by perf, then
> KVM's emulation will also incorrectly signal overflow. Add a comment in
> the related code to call out the need to process emulated events *after*
> pausing the perf event (big kudos to Mingwei for figuring out that
> particular wrinkle).
>
> Cc: Mingwei Zhang <[email protected]>
> Cc: Roman Kagan <[email protected]>
> Cc: Jim Mattson <[email protected]>
> Cc: Dapeng Mi <[email protected]>
> Cc: Like Xu <[email protected]>
> Signed-off-by: Sean Christopherson <[email protected]>
> ---
Reviewed-by: Mingwei Zhang <[email protected]>

> arch/x86/include/asm/kvm_host.h | 17 +++++++++++-
> arch/x86/kvm/pmu.c | 48 ++++++++++++++++++++++++---------
> arch/x86/kvm/pmu.h | 3 ++-
> 3 files changed, 53 insertions(+), 15 deletions(-)
>
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index d7036982332e..d8bc9ba88cfc 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -500,8 +500,23 @@ struct kvm_pmc {
> u8 idx;
> bool is_paused;
> bool intr;
> + /*
> + * Base value of the PMC counter, relative to the *consumed* count in
> + * the associated perf_event. This value includes counter updates from
> + * the perf_event and emulated_count since the last time the counter
> + * was reprogrammed, but it is *not* the current value as seen by the
> + * guest or userspace.
> + *
> + * The count is relative to the associated perf_event so that KVM
> + * doesn't need to reprogram the perf_event every time the guest writes
> + * to the counter.
> + */
> u64 counter;
> - u64 prev_counter;
> + /*
> + * PMC events triggered by KVM emulation that haven't been fully
> + * processed, i.e. haven't undergone overflow detection.
> + */
> + u64 emulated_counter;
> u64 eventsel;
> struct perf_event *perf_event;
> struct kvm_vcpu *vcpu;
> diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c
> index 3725d001239d..87cc6c8809ad 100644
> --- a/arch/x86/kvm/pmu.c
> +++ b/arch/x86/kvm/pmu.c
> @@ -127,9 +127,9 @@ static void kvm_perf_overflow(struct perf_event *perf_event,
> struct kvm_pmc *pmc = perf_event->overflow_handler_context;
>
> /*
> - * Ignore overflow events for counters that are scheduled to be
> - * reprogrammed, e.g. if a PMI for the previous event races with KVM's
> - * handling of a related guest WRMSR.
> + * Ignore asynchronous overflow events for counters that are scheduled
> + * to be reprogrammed, e.g. if a PMI for the previous event races with
> + * KVM's handling of a related guest WRMSR.
> */
> if (test_and_set_bit(pmc->idx, pmc_to_pmu(pmc)->reprogram_pmi))
> return;
> @@ -224,17 +224,30 @@ static int pmc_reprogram_counter(struct kvm_pmc *pmc, u32 type, u64 config,
> return 0;
> }
>
> -static void pmc_pause_counter(struct kvm_pmc *pmc)
> +static bool pmc_pause_counter(struct kvm_pmc *pmc)
> {
> u64 counter = pmc->counter;
> -
> - if (!pmc->perf_event || pmc->is_paused)
> - return;
> + u64 prev_counter;
>
> /* update counter, reset event value to avoid redundant accumulation */
> - counter += perf_event_pause(pmc->perf_event, true);
> + if (pmc->perf_event && !pmc->is_paused)
> + counter += perf_event_pause(pmc->perf_event, true);
> +
> + /*
> + * Snapshot the previous counter *after* accumulating state from perf.
> + * If overflow already happened, hardware (via perf) is responsible for
> + * generating a PMI. KVM just needs to detect overflow on emulated
> + * counter events that haven't yet been processed.
> + */
> + prev_counter = counter & pmc_bitmask(pmc);
> +
> + counter += pmc->emulated_counter;
> pmc->counter = counter & pmc_bitmask(pmc);
> +
> + pmc->emulated_counter = 0;
> pmc->is_paused = true;
> +
> + return pmc->counter < prev_counter;
> }
>
> static bool pmc_resume_counter(struct kvm_pmc *pmc)
> @@ -289,6 +302,15 @@ static void pmc_update_sample_period(struct kvm_pmc *pmc)
>
> void pmc_write_counter(struct kvm_pmc *pmc, u64 val)
> {
> + /*
> + * Drop any unconsumed accumulated counts, the WRMSR is a write, not a
> + * read-modify-write. Adjust the counter value so that its value is
> + * relative to the current count, as reading the current count from
> + * perf is faster than pausing and repgrogramming the event in order to
> + * reset it to '0'. Note, this very sneakily offsets the accumulated
> + * emulated count too, by using pmc_read_counter()!
> + */
> + pmc->emulated_counter = 0;
> pmc->counter += val - pmc_read_counter(pmc);
> pmc->counter &= pmc_bitmask(pmc);
> pmc_update_sample_period(pmc);
> @@ -428,14 +450,15 @@ static void reprogram_counter(struct kvm_pmc *pmc)
> struct kvm_pmu *pmu = pmc_to_pmu(pmc);
> u64 eventsel = pmc->eventsel;
> u64 new_config = eventsel;
> + bool emulate_overflow;
> u8 fixed_ctr_ctrl;
>
> - pmc_pause_counter(pmc);
> + emulate_overflow = pmc_pause_counter(pmc);
>
> if (!pmc_event_is_allowed(pmc))
> goto reprogram_complete;
>
> - if (pmc->counter < pmc->prev_counter)
> + if (emulate_overflow)
> __kvm_perf_overflow(pmc, false);
>
> if (eventsel & ARCH_PERFMON_EVENTSEL_PIN_CONTROL)
> @@ -475,7 +498,6 @@ static void reprogram_counter(struct kvm_pmc *pmc)
>
> reprogram_complete:
> clear_bit(pmc->idx, (unsigned long *)&pmc_to_pmu(pmc)->reprogram_pmi);
> - pmc->prev_counter = 0;
> }
>
> void kvm_pmu_handle_event(struct kvm_vcpu *vcpu)
> @@ -701,6 +723,7 @@ static void kvm_pmu_reset(struct kvm_vcpu *vcpu)
>
> pmc_stop_counter(pmc);
> pmc->counter = 0;
> + pmc->emulated_counter = 0;
>
> if (pmc_is_gp(pmc))
> pmc->eventsel = 0;
> @@ -772,8 +795,7 @@ void kvm_pmu_destroy(struct kvm_vcpu *vcpu)
>
> static void kvm_pmu_incr_counter(struct kvm_pmc *pmc)
> {
> - pmc->prev_counter = pmc->counter;
> - pmc->counter = (pmc->counter + 1) & pmc_bitmask(pmc);
> + pmc->emulated_counter++;
> kvm_pmu_request_counter_reprogram(pmc);
> }
>
> diff --git a/arch/x86/kvm/pmu.h b/arch/x86/kvm/pmu.h
> index cae85e550f60..7caeb3d8d4fd 100644
> --- a/arch/x86/kvm/pmu.h
> +++ b/arch/x86/kvm/pmu.h
> @@ -66,7 +66,8 @@ static inline u64 pmc_read_counter(struct kvm_pmc *pmc)
> {
> u64 counter, enabled, running;
>
> - counter = pmc->counter;
> + counter = pmc->counter + pmc->emulated_counter;
> +
> if (pmc->perf_event && !pmc->is_paused)
> counter += perf_event_read_value(pmc->perf_event,
> &enabled, &running);
> --
> 2.42.0.869.gea05f2083d-goog
>

2023-12-01 01:56:27

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH v2 0/6] KVM: x86/pmu: Clean up emulated PMC event handling

On Fri, 03 Nov 2023 16:05:35 -0700, Sean Christopherson wrote:
> The ultimate goal of this series is to track emulated counter events using
> a dedicated variable instead of trying to track the previous counter value.
> Tracking the previous counter value is flawed as it takes a snapshot at
> every emulated event, but only checks for overflow prior to VM-Enter, i.e.
> KVM could miss an overflow if KVM ever supports emulating event types that
> can occur multiple times in a single VM-Exit.
>
> [...]

Applied to kvm-x86 pmu, thanks!

[1/6] KVM: x86/pmu: Move PMU reset logic to common x86 code
https://github.com/kvm-x86/linux/commit/cbb359d81a26
[2/6] KVM: x86/pmu: Reset the PMU, i.e. stop counters, before refreshing
https://github.com/kvm-x86/linux/commit/1647b52757d5
[3/6] KVM: x86/pmu: Stop calling kvm_pmu_reset() at RESET (it's redundant)
https://github.com/kvm-x86/linux/commit/f2f63f7ec6fd
[4/6] KVM: x86/pmu: Remove manual clearing of fields in kvm_pmu_init()
https://github.com/kvm-x86/linux/commit/ec61b2306dfd
[5/6] KVM: x86/pmu: Update sample period in pmc_write_counter()
https://github.com/kvm-x86/linux/commit/89acf1237b81
[6/6] KVM: x86/pmu: Track emulated counter events instead of previous counter
https://github.com/kvm-x86/linux/commit/fd89499a5151

--
https://github.com/kvm-x86/linux/tree/next