LinuxLists.cc - [PATCH 0/5] KVM: x86/pmu: Hide guest counter updates from the VMRUN instruction

[permalink] [raw]

Subject: [PATCH 1/5] KVM: x86/pmu: Emulate CTR overflow directly in kvm_pmu_handle_event()

From: Like Xu <[email protected]>

More and more vPMU emulations are deferred to kvm_pmu_handle_event()
as it reduces the overhead of repeated execution. The reprogram_counter()
is only responsible for creating the required perf_event, and naturally
doesn't include the counter overflow part from emulated instruction events.

Prior to this change, the assignment of pmc->prev_counter always occurred
after pmc was enabled, and to keep the same semantics, pmc->prev_counter
always needed to be reset after it has taken effect.

No functional change intended.

Signed-off-by: Like Xu <[email protected]>
---
arch/x86/kvm/pmu.c | 11 ++++++++---
1 file changed, 8 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c
index a1a79b5f49d7..d1c89a6625a0 100644
--- a/arch/x86/kvm/pmu.c
+++ b/arch/x86/kvm/pmu.c
@@ -418,9 +418,6 @@ static void reprogram_counter(struct kvm_pmc *pmc)
if (!event_is_allowed(pmc))
goto reprogram_complete;

- if (pmc->counter < pmc->prev_counter)
- __kvm_perf_overflow(pmc, false);
-
if (eventsel & ARCH_PERFMON_EVENTSEL_PIN_CONTROL)
printk_once("kvm pmu: pin control bit is ignored\n");

@@ -458,6 +455,13 @@ static void reprogram_counter(struct kvm_pmc *pmc)

reprogram_complete:
clear_bit(pmc->idx, (unsigned long *)&pmc_to_pmu(pmc)->reprogram_pmi);
+}
+
+static inline void kvm_pmu_handle_pmc_overflow(struct kvm_pmc *pmc)
+{
+ if (pmc->counter < pmc->prev_counter)
+ __kvm_perf_overflow(pmc, false);
+
pmc->prev_counter = 0;
}

@@ -475,6 +479,7 @@ void kvm_pmu_handle_event(struct kvm_vcpu *vcpu)
}

reprogram_counter(pmc);
+ kvm_pmu_handle_pmc_overflow(pmc);
}

/*
--
2.39.2

2023-03-10 10:54:39

[permalink] [raw]

Subject: [PATCH 2/5] KVM: x86/pmu: Add a helper to check if pmc has PEBS mode enabled

From: Like Xu <[email protected]>

Add a helper to check if pmc has PEBS mode enabled so that more new
code may reuse this part and opportunistically drop a pmu reference.

No functional change intended.

Signed-off-by: Like Xu <[email protected]>
---
arch/x86/kvm/pmu.c | 3 +--
arch/x86/kvm/pmu.h | 7 +++++++
2 files changed, 8 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c
index d1c89a6625a0..01a6b7ffa9b1 100644
--- a/arch/x86/kvm/pmu.c
+++ b/arch/x86/kvm/pmu.c
@@ -191,7 +191,6 @@ static int pmc_reprogram_counter(struct kvm_pmc *pmc, u32 type, u64 config,
bool exclude_user, bool exclude_kernel,
bool intr)
{
- struct kvm_pmu *pmu = pmc_to_pmu(pmc);
struct perf_event *event;
struct perf_event_attr attr = {
.type = type,
@@ -203,7 +202,7 @@ static int pmc_reprogram_counter(struct kvm_pmc *pmc, u32 type, u64 config,
.exclude_kernel = exclude_kernel,
.config = config,
};
- bool pebs = test_bit(pmc->idx, (unsigned long *)&pmu->pebs_enable);
+ bool pebs = pebs_is_enabled(pmc);

attr.sample_period = get_sample_period(pmc, pmc->counter);

diff --git a/arch/x86/kvm/pmu.h b/arch/x86/kvm/pmu.h
index cff0651b030b..db4262fe8814 100644
--- a/arch/x86/kvm/pmu.h
+++ b/arch/x86/kvm/pmu.h
@@ -189,6 +189,13 @@ static inline void kvm_pmu_request_counter_reprogram(struct kvm_pmc *pmc)
kvm_make_request(KVM_REQ_PMU, pmc->vcpu);
}

+static inline bool pebs_is_enabled(struct kvm_pmc *pmc)
+{
+ struct kvm_pmu *pmu = pmc_to_pmu(pmc);
+
+ return test_bit(pmc->idx, (unsigned long *)&pmu->pebs_enable);
+}
+
void kvm_pmu_deliver_pmi(struct kvm_vcpu *vcpu);
void kvm_pmu_handle_event(struct kvm_vcpu *vcpu);
int kvm_pmu_rdpmc(struct kvm_vcpu *vcpu, unsigned pmc, u64 *data);
--
2.39.2

2023-03-10 10:54:59

[permalink] [raw]

Subject: [PATCH 3/5] KVM: x86/pmu: Move the overflow of a normal counter out of PMI context

From: Like Xu <[email protected]>

From the guest's point of view, vPMU's global_status bit update following
a counter overflow is completely independent of whether it is emulated
in the host PMI context. The guest counter overflow emulation only depends
on whether pmc->counter has an overflow or not. Plus the counter overflow
generated by the emulation instruction has been delayed and not been
handled in the PMI context. This part of the logic can be unified by
reusing pmc->prev_counter for a normal counter. However for a PEBS
counter, its buffer overflow irq still requires hardware to trigger PMI.

Signed-off-by: Like Xu <[email protected]>
---
arch/x86/kvm/pmu.c | 5 ++++-
1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c
index 01a6b7ffa9b1..81c7cc4ceadf 100644
--- a/arch/x86/kvm/pmu.c
+++ b/arch/x86/kvm/pmu.c
@@ -160,7 +160,10 @@ static void kvm_perf_overflow(struct perf_event *perf_event,
if (test_and_set_bit(pmc->idx, pmc_to_pmu(pmc)->reprogram_pmi))
return;

- __kvm_perf_overflow(pmc, true);
+ if (pebs_is_enabled(pmc))
+ __kvm_perf_overflow(pmc, true);
+ else
+ pmc->prev_counter = pmc->counter;

kvm_make_request(KVM_REQ_PMU, pmc->vcpu);
}
--
2.39.2

2023-03-10 10:54:59

[permalink] [raw]

Subject: [PATCH 4/5] KVM: x86/pmu: Reorder functions to reduce unnecessary declarations

From: Like Xu <[email protected]>

Considering that more emulations are deferred to kvm_pmu_handle_event(),
moving it to the end of pmu.c makes it easier to call previous functions,
instead of just piling up the function declarations to make compiler green.
The same motivation is applied to kvm_pmu_request_counter_reprogram(),
as it is the trigger for any emulations delay.

No functional change intended.

Signed-off-by: Like Xu <[email protected]>
---
arch/x86/kvm/pmu.c | 52 +++++++++++++++++++++++-----------------------
arch/x86/kvm/pmu.h | 12 +++++------
2 files changed, 32 insertions(+), 32 deletions(-)

diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c
index 81c7cc4ceadf..2a0504732966 100644
--- a/arch/x86/kvm/pmu.c
+++ b/arch/x86/kvm/pmu.c
@@ -467,32 +467,6 @@ static inline void kvm_pmu_handle_pmc_overflow(struct kvm_pmc *pmc)
pmc->prev_counter = 0;
}

-void kvm_pmu_handle_event(struct kvm_vcpu *vcpu)
-{
- struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
- int bit;
-
- for_each_set_bit(bit, pmu->reprogram_pmi, X86_PMC_IDX_MAX) {
- struct kvm_pmc *pmc = static_call(kvm_x86_pmu_pmc_idx_to_pmc)(pmu, bit);
-
- if (unlikely(!pmc)) {
- clear_bit(bit, pmu->reprogram_pmi);
- continue;
- }
-
- reprogram_counter(pmc);
- kvm_pmu_handle_pmc_overflow(pmc);
- }
-
- /*
- * Unused perf_events are only released if the corresponding MSRs
- * weren't accessed during the last vCPU time slice. kvm_arch_sched_in
- * triggers KVM_REQ_PMU if cleanup is needed.
- */
- if (unlikely(pmu->need_cleanup))
- kvm_pmu_cleanup(vcpu);
-}
-
/* check if idx is a valid index to access PMU */
bool kvm_pmu_is_valid_rdpmc_ecx(struct kvm_vcpu *vcpu, unsigned int idx)
{
@@ -847,3 +821,29 @@ int kvm_vm_ioctl_set_pmu_event_filter(struct kvm *kvm, void __user *argp)
kfree(filter);
return r;
}
+
+void kvm_pmu_handle_event(struct kvm_vcpu *vcpu)
+{
+ struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
+ int bit;
+
+ for_each_set_bit(bit, pmu->reprogram_pmi, X86_PMC_IDX_MAX) {
+ struct kvm_pmc *pmc = static_call(kvm_x86_pmu_pmc_idx_to_pmc)(pmu, bit);
+
+ if (unlikely(!pmc)) {
+ clear_bit(bit, pmu->reprogram_pmi);
+ continue;
+ }
+
+ reprogram_counter(pmc);
+ kvm_pmu_handle_pmc_overflow(pmc);
+ }
+
+ /*
+ * Unused perf_events are only released if the corresponding MSRs
+ * weren't accessed during the last vCPU time slice. kvm_arch_sched_in
+ * triggers KVM_REQ_PMU if cleanup is needed.
+ */
+ if (unlikely(pmu->need_cleanup))
+ kvm_pmu_cleanup(vcpu);
+}
diff --git a/arch/x86/kvm/pmu.h b/arch/x86/kvm/pmu.h
index db4262fe8814..a47b579667c6 100644
--- a/arch/x86/kvm/pmu.h
+++ b/arch/x86/kvm/pmu.h
@@ -48,6 +48,12 @@ static inline u64 pmc_bitmask(struct kvm_pmc *pmc)
return pmu->counter_bitmask[pmc->type];
}

+static inline void kvm_pmu_request_counter_reprogram(struct kvm_pmc *pmc)
+{
+ set_bit(pmc->idx, pmc_to_pmu(pmc)->reprogram_pmi);
+ kvm_make_request(KVM_REQ_PMU, pmc->vcpu);
+}
+
static inline u64 pmc_read_counter(struct kvm_pmc *pmc)
{
u64 counter, enabled, running;
@@ -183,12 +189,6 @@ static inline void kvm_init_pmu_capability(const struct kvm_pmu_ops *pmu_ops)
KVM_PMC_MAX_FIXED);
}

-static inline void kvm_pmu_request_counter_reprogram(struct kvm_pmc *pmc)
-{
- set_bit(pmc->idx, pmc_to_pmu(pmc)->reprogram_pmi);
- kvm_make_request(KVM_REQ_PMU, pmc->vcpu);
-}
-
static inline bool pebs_is_enabled(struct kvm_pmc *pmc)
{
struct kvm_pmu *pmu = pmc_to_pmu(pmc);
--
2.39.2

2023-03-10 10:55:06

[permalink] [raw]

Subject: [PATCH 5/5] KVM: x86/pmu: Hide guest counter updates from the VMRUN instruction

From: Like Xu <[email protected]>

When AMD guest is counting (branch) instructions event, its vPMU should
first subtract one for any relevant (branch)-instructions enabled counter
(when it precedes VMRUN and cannot be preempted) to offset the inevitable
plus-one effect of the VMRUN instruction immediately follows.

Based on a number of micro observations (also the reason why x86_64/
pmu_event_filter_test fails on AMD Zen platforms), each VMRUN will
increment all hw-(branch)-instructions counters by 1, even if they are
only enabled for guest code. This issue seriously affects the performance
understanding of guest developers based on (branch) instruction events.

If the current physical register value on the hardware is ~0x0, it triggers
an overflow in the guest world right after running VMRUN. Although this
cannot be avoided on mainstream released hardware, the resulting PMI
(if configured) will not be incorrectly injected into the guest by vPMU,
since the delayed injection mechanism for a normal counter overflow
depends only on the change of pmc->counter values.

The pmu_hide_vmrun() is called before each VMRUN and its overhead
depends on the number of counters enabled by the guest and is negligible
when none of the counters are used.

Cc: Ravi Bangoria <[email protected]>
Signed-off-by: Like Xu <[email protected]>
---
arch/x86/include/asm/kvm_host.h | 4 ++++
arch/x86/kvm/pmu.c | 18 ++++++++++++++++++
arch/x86/kvm/pmu.h | 24 +++++++++++++++++++++++-
arch/x86/kvm/svm/pmu.c | 1 +
arch/x86/kvm/svm/svm.c | 28 ++++++++++++++++++++++++++++
5 files changed, 74 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index adb92fc4d7c9..d6fcbf233cb3 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -561,6 +561,10 @@ struct kvm_pmu {
*/
u64 host_cross_mapped_mask;

+ /* Flags to track any HW quirks that need to be fixed by vPMU. */
+ u64 quirk_flags;
+ DECLARE_BITMAP(hide_vmrun_pmc_idx, X86_PMC_IDX_MAX);
+
/*
* The gate to release perf_events not marked in
* pmc_in_use only once in a vcpu time slice.
diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c
index 2a0504732966..315dca021d57 100644
--- a/arch/x86/kvm/pmu.c
+++ b/arch/x86/kvm/pmu.c
@@ -254,6 +254,7 @@ static void pmc_pause_counter(struct kvm_pmc *pmc)
counter += perf_event_pause(pmc->perf_event, true);
pmc->counter = counter & pmc_bitmask(pmc);
pmc->is_paused = true;
+ kvm_mark_pmc_is_quirky(pmc);
}

static bool pmc_resume_counter(struct kvm_pmc *pmc)
@@ -822,6 +823,19 @@ int kvm_vm_ioctl_set_pmu_event_filter(struct kvm *kvm, void __user *argp)
return r;
}

+static inline bool event_is_branch_instruction(struct kvm_pmc *pmc)
+{
+ return eventsel_match_perf_hw_id(pmc, PERF_COUNT_HW_INSTRUCTIONS) ||
+ eventsel_match_perf_hw_id(pmc,
+ PERF_COUNT_HW_BRANCH_INSTRUCTIONS);
+}
+
+static inline bool quirky_pmc_will_count_vmrun(struct kvm_pmc *pmc)
+{
+ return event_is_branch_instruction(pmc) && event_is_allowed(pmc) &&
+ !static_call(kvm_x86_get_cpl)(pmc->vcpu);
+}
+
void kvm_pmu_handle_event(struct kvm_vcpu *vcpu)
{
struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
@@ -837,6 +851,10 @@ void kvm_pmu_handle_event(struct kvm_vcpu *vcpu)

reprogram_counter(pmc);
kvm_pmu_handle_pmc_overflow(pmc);
+
+ if (vcpu_has_pmu_quirks(vcpu) &&
+ quirky_pmc_will_count_vmrun(pmc))
+ set_bit(pmc->idx, pmu->hide_vmrun_pmc_idx);
}

/*
diff --git a/arch/x86/kvm/pmu.h b/arch/x86/kvm/pmu.h
index a47b579667c6..30f6f58f4c38 100644
--- a/arch/x86/kvm/pmu.h
+++ b/arch/x86/kvm/pmu.h
@@ -18,6 +18,9 @@
#define VMWARE_BACKDOOR_PMC_REAL_TIME 0x10001
#define VMWARE_BACKDOOR_PMC_APPARENT_TIME 0x10002

+#define X86_PMU_COUNT_VMRUN BIT_ULL(0)
+#define X86_PMU_QUIRKS_MASK X86_PMU_COUNT_VMRUN
+
struct kvm_pmu_ops {
bool (*hw_event_available)(struct kvm_pmc *pmc);
bool (*pmc_is_enabled)(struct kvm_pmc *pmc);
@@ -54,14 +57,33 @@ static inline void kvm_pmu_request_counter_reprogram(struct kvm_pmc *pmc)
kvm_make_request(KVM_REQ_PMU, pmc->vcpu);
}

+static inline bool vcpu_has_pmu_quirks(struct kvm_vcpu *vcpu)
+{
+ return vcpu_to_pmu(vcpu)->quirk_flags & X86_PMU_QUIRKS_MASK;
+}
+
+/*
+ * The time to mark pmc is when the accumulation value returned
+ * by perf API based on a HW counter has just taken effect.
+ */
+static inline void kvm_mark_pmc_is_quirky(struct kvm_pmc *pmc)
+{
+ if (!vcpu_has_pmu_quirks(pmc->vcpu))
+ return;
+
+ kvm_pmu_request_counter_reprogram(pmc);
+}
+
static inline u64 pmc_read_counter(struct kvm_pmc *pmc)
{
u64 counter, enabled, running;

counter = pmc->counter;
- if (pmc->perf_event && !pmc->is_paused)
+ if (pmc->perf_event && !pmc->is_paused) {
counter += perf_event_read_value(pmc->perf_event,
&enabled, &running);
+ kvm_mark_pmc_is_quirky(pmc);
+ }
/* FIXME: Scaling needed? */
return counter & pmc_bitmask(pmc);
}
diff --git a/arch/x86/kvm/svm/pmu.c b/arch/x86/kvm/svm/pmu.c
index 5fa939e411d8..130991a97f22 100644
--- a/arch/x86/kvm/svm/pmu.c
+++ b/arch/x86/kvm/svm/pmu.c
@@ -187,6 +187,7 @@ static void amd_pmu_refresh(struct kvm_vcpu *vcpu)
pmu->nr_arch_fixed_counters = 0;
pmu->global_status = 0;
bitmap_set(pmu->all_valid_pmc_idx, 0, pmu->nr_arch_gp_counters);
+ pmu->quirk_flags |= X86_PMU_COUNT_VMRUN;
}

static void amd_pmu_init(struct kvm_vcpu *vcpu)
diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index f41d96e638ef..f6b33d172481 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -3919,6 +3919,31 @@ static fastpath_t svm_exit_handlers_fastpath(struct kvm_vcpu *vcpu)
return EXIT_FASTPATH_NONE;
}

+static void pmu_hide_vmrun(struct kvm_vcpu *vcpu)
+{
+ struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
+ struct kvm_pmc *pmc;
+ unsigned int i;
+
+ for_each_set_bit(i, pmu->hide_vmrun_pmc_idx, X86_PMC_IDX_MAX) {
+ clear_bit(i, pmu->hide_vmrun_pmc_idx);
+
+ /* AMD doesn't have fixed counters at now. */
+ if (i >= pmu->nr_arch_gp_counters)
+ continue;
+
+ /*
+ * The prerequisite for fixing HW quirks is that there is indeed
+ * HW working and perf has no chance to retrieve the counter.
+ */
+ pmc = &pmu->gp_counters[i];
+ if (!pmc->perf_event || pmc->perf_event->hw.idx < 0)
+ continue;
+
+ pmc->counter--;
+ }
+}
+
static noinstr void svm_vcpu_enter_exit(struct kvm_vcpu *vcpu, bool spec_ctrl_intercepted)
{
struct vcpu_svm *svm = to_svm(vcpu);
@@ -3986,6 +4011,9 @@ static __no_kcsan fastpath_t svm_vcpu_run(struct kvm_vcpu *vcpu)

kvm_wait_lapic_expire(vcpu);

+ if (vcpu->kvm->arch.enable_pmu && vcpu_has_pmu_quirks(vcpu))
+ pmu_hide_vmrun(vcpu);
+
/*
* If this vCPU has touched SPEC_CTRL, restore the guest's value if
* it's non-zero. Since vmentry is serialising on affected CPUs, there
--
2.39.2

2023-03-13 10:57:28

by Sandipan Das

[permalink] [raw]

Subject: Re: [PATCH 0/5] KVM: x86/pmu: Hide guest counter updates from the VMRUN instruction

+CC: Santosh, Tom, Ananth

Hi Like,

On 3/10/2023 4:23 PM, Like Xu wrote:
> Considering that developers are more likely to have access to AMD VMs
> and use vPMU inside guest, there's a dark cloud that needs to rain.
> The x86_64/pmu_event_filter_test always fails on Zen3 boxes:
>
> test_amd_deny_list: Branch instructions retired = 43 (expected 42)
> test_without_filter: Branch instructions retired = 43 (expected 42)
> test_member_allow_list: Branch instructions retired = 43 (expected 42)
> test_not_member_deny_list: Branch instructions retired = 43 (expected 42)
>
> ,which is not caused by the event_filter feature (otherwise it's zero).
>
> After some dubious guessing and microtesting on Zen3+ pmu hardware,
> we found that VMRUN or one of the instructions in __svm_vcpu_run()
> causes a guest-only enabled counter for counting guest instruction (in the
> pmu_event_filter case, the branch instruction) to always increase by one
> right after each vm_entry.
>
> This creates an inconsistency with the AMD64_EVENTSEL_GUESTONLY,
> where the vPMU user in the VM does not expect to see any counter
> changes due to the SVM transaction at all. This patch set provides a low
> overhead software fix until HW change arrives or simply no fix planned.
>

Yes, VMRUNs do get counted as retired branches in the guest context. My
understanding is that this behaviour applies to all generations of Zen
and even some older ones too, not just Zen 3 and later. I also do not
expect this to change in the near future.

- Sandipan

2023-03-23 08:23:50

[permalink] [raw]

Subject: Re: [PATCH 0/5] KVM: x86/pmu: Hide guest counter updates from the VMRUN instruction

On 13/3/2023 6:57 pm, Sandipan Das wrote:
> +CC: Santosh, Tom, Ananth
>
> Hi Like,
>
> On 3/10/2023 4:23 PM, Like Xu wrote:
>> Considering that developers are more likely to have access to AMD VMs
>> and use vPMU inside guest, there's a dark cloud that needs to rain.
>> The x86_64/pmu_event_filter_test always fails on Zen3 boxes:
>>
>> test_amd_deny_list: Branch instructions retired = 43 (expected 42)
>> test_without_filter: Branch instructions retired = 43 (expected 42)
>> test_member_allow_list: Branch instructions retired = 43 (expected 42)
>> test_not_member_deny_list: Branch instructions retired = 43 (expected 42)
>>
>> ,which is not caused by the event_filter feature (otherwise it's zero).
>>
>> After some dubious guessing and microtesting on Zen3+ pmu hardware,
>> we found that VMRUN or one of the instructions in __svm_vcpu_run()
>> causes a guest-only enabled counter for counting guest instruction (in the
>> pmu_event_filter case, the branch instruction) to always increase by one
>> right after each vm_entry.
>>
>> This creates an inconsistency with the AMD64_EVENTSEL_GUESTONLY,
>> where the vPMU user in the VM does not expect to see any counter
>> changes due to the SVM transaction at all. This patch set provides a low
>> overhead software fix until HW change arrives or simply no fix planned.
>>
>
> Yes, VMRUNs do get counted as retired branches in the guest context. My
> understanding is that this behaviour applies to all generations of Zen
> and even some older ones too, not just Zen 3 and later. I also do not
> expect this to change in the near future.
>
> - Sandipan
>

Interesting, thanks for confirming this issue (I presume this is an official reply).

If this hardware behavior is not expected to change, this software-based correction
would be essential and urgent for those who only use vPMU inside AMD guests.

Let's see what will happen ...

2023-04-07 02:20:06

[permalink] [raw]

Subject: Re: [PATCH 5/5] KVM: x86/pmu: Hide guest counter updates from the VMRUN instruction

On Fri, Mar 10, 2023, Like Xu wrote:
> From: Like Xu <[email protected]>
>
> When AMD guest is counting (branch) instructions event, its vPMU should
> first subtract one for any relevant (branch)-instructions enabled counter
> (when it precedes VMRUN and cannot be preempted) to offset the inevitable
> plus-one effect of the VMRUN instruction immediately follows.
>
> Based on a number of micro observations (also the reason why x86_64/
> pmu_event_filter_test fails on AMD Zen platforms), each VMRUN will
> increment all hw-(branch)-instructions counters by 1, even if they are
> only enabled for guest code. This issue seriously affects the performance
> understanding of guest developers based on (branch) instruction events.
>
> If the current physical register value on the hardware is ~0x0, it triggers
> an overflow in the guest world right after running VMRUN. Although this
> cannot be avoided on mainstream released hardware, the resulting PMI
> (if configured) will not be incorrectly injected into the guest by vPMU,
> since the delayed injection mechanism for a normal counter overflow
> depends only on the change of pmc->counter values.

IIUC, this is saying that KVM may get a spurious PMI, but otherwise nothing bad
will happen?

> +static inline bool event_is_branch_instruction(struct kvm_pmc *pmc)
> +{
> + return eventsel_match_perf_hw_id(pmc, PERF_COUNT_HW_INSTRUCTIONS) ||
> + eventsel_match_perf_hw_id(pmc,
> + PERF_COUNT_HW_BRANCH_INSTRUCTIONS);
> +}
> +
> +static inline bool quirky_pmc_will_count_vmrun(struct kvm_pmc *pmc)
> +{
> + return event_is_branch_instruction(pmc) && event_is_allowed(pmc) &&
> + !static_call(kvm_x86_get_cpl)(pmc->vcpu);

Wait, really? VMRUN is counted if and only if it enters to a CPL0 guest? Can
someone from AMD confirm this? I was going to say we should just treat this as
"normal" behavior, but counting CPL0 but not CPL>0 is definitely quirky.

2023-04-07 08:30:27

[permalink] [raw]

Subject: Re: [PATCH 5/5] KVM: x86/pmu: Hide guest counter updates from the VMRUN instruction

On 7/4/2023 10:18 am, Sean Christopherson wrote:
> On Fri, Mar 10, 2023, Like Xu wrote:
>> From: Like Xu <[email protected]>
>>
>> When AMD guest is counting (branch) instructions event, its vPMU should
>> first subtract one for any relevant (branch)-instructions enabled counter
>> (when it precedes VMRUN and cannot be preempted) to offset the inevitable
>> plus-one effect of the VMRUN instruction immediately follows.
>>
>> Based on a number of micro observations (also the reason why x86_64/
>> pmu_event_filter_test fails on AMD Zen platforms), each VMRUN will
>> increment all hw-(branch)-instructions counters by 1, even if they are
>> only enabled for guest code. This issue seriously affects the performance
>> understanding of guest developers based on (branch) instruction events.
>>
>> If the current physical register value on the hardware is ~0x0, it triggers
>> an overflow in the guest world right after running VMRUN. Although this
>> cannot be avoided on mainstream released hardware, the resulting PMI
>> (if configured) will not be incorrectly injected into the guest by vPMU,
>> since the delayed injection mechanism for a normal counter overflow
>> depends only on the change of pmc->counter values.
>
> IIUC, this is saying that KVM may get a spurious PMI, but otherwise nothing bad
> will happen?

Guests will have nothing to lose, except gaining vPMI accuracy under this proposal.

When a host gets an overflow interrupt caused by a VMRUN, it forwards it to KVM.
KVM does not inject it into the VM, but discards it. For those using PMU to
profiling
the hypervisor itself, they lose an interrupt or a sample on VMRUN context.

>
>> +static inline bool event_is_branch_instruction(struct kvm_pmc *pmc)
>> +{
>> + return eventsel_match_perf_hw_id(pmc, PERF_COUNT_HW_INSTRUCTIONS) ||
>> + eventsel_match_perf_hw_id(pmc,
>> + PERF_COUNT_HW_BRANCH_INSTRUCTIONS);
>> +}
>> +
>> +static inline bool quirky_pmc_will_count_vmrun(struct kvm_pmc *pmc)
>> +{
>> + return event_is_branch_instruction(pmc) && event_is_allowed(pmc) &&
>> + !static_call(kvm_x86_get_cpl)(pmc->vcpu);
>
> Wait, really? VMRUN is counted if and only if it enters to a CPL0 guest? Can
> someone from AMD confirm this? I was going to say we should just treat this as
> "normal" behavior, but counting CPL0 but not CPL>0 is definitely quirky.

VMRUN is only counted on a CPL0-target (branch) instruction counter. The VMRUN
is not expected to be counted by the guest counters, regardless of the guest CPL.

This issue makes a guest CPL0-target instruction counter inexplicably increase,
as if it
would have been under-counted before the virtualization instructions were counted.

Treating the host hypervisor instructions like VMRUN as guest workload instructions
is already an error in itself not "normal" behavior that affects guest accuracy.

2023-04-07 15:00:02

[permalink] [raw]

Subject: Re: [PATCH 5/5] KVM: x86/pmu: Hide guest counter updates from the VMRUN instruction

On Fri, Apr 07, 2023, Like Xu wrote:
> On 7/4/2023 10:18 am, Sean Christopherson wrote:
> > Wait, really? VMRUN is counted if and only if it enters to a CPL0 guest? Can
> > someone from AMD confirm this? I was going to say we should just treat this as
> > "normal" behavior, but counting CPL0 but not CPL>0 is definitely quirky.
>
> VMRUN is only counted on a CPL0-target (branch) instruction counter.

Yes or no question: if KVM does VMRUN and a PMC is programmed to count _all_ taken
branches, will the PMC count VMRUN as a branch if guest CPL>0 according to the VMCB?

> This issue makes a guest CPL0-target instruction counter inexplicably
> increase, as if it would have been under-counted before the virtualization
> instructions were counted.

Heh, it's very much explicable, it's just not desirable, and you and I would argue
that it's also incorrect.

AMD folks, are there plans to document this as an erratum? I agree with Like that
counting VMRUN as a taken branch in guest context is a CPU bug, even if the behavior
is known/expected.

2023-04-19 13:45:55

[permalink] [raw]

Subject: Re: [PATCH 5/5] KVM: x86/pmu: Hide guest counter updates from the VMRUN instruction

On 7/4/2023 10:56 pm, Sean Christopherson wrote:
> On Fri, Apr 07, 2023, Like Xu wrote:
>> On 7/4/2023 10:18 am, Sean Christopherson wrote:
>>> Wait, really? VMRUN is counted if and only if it enters to a CPL0 guest? Can
>>> someone from AMD confirm this? I was going to say we should just treat this as
>>> "normal" behavior, but counting CPL0 but not CPL>0 is definitely quirky.
>>
>> VMRUN is only counted on a CPL0-target (branch) instruction counter.
>
> Yes or no question: if KVM does VMRUN and a PMC is programmed to count _all_ taken
> branches, will the PMC count VMRUN as a branch if guest CPL>0 according to the VMCB?

YES, my quick tests (based on run_in_user() from KUT on Zen4) show:

EVENTSEL_GUESTONLY + EVENTSEL_ALL + VMRUN_to_USR -> AMD_ZEN_BR_RETIRED + 1
EVENTSEL_GUESTONLY + EVENTSEL_ALL + VMRUN_to_OS -> AMD_ZEN_BR_RETIRED + 1

EVENTSEL_GUESTONLY + EVENTSEL_USR + VMRUN_to_USR -> AMD_ZEN_BR_RETIRED + 1
EVENTSEL_GUESTONLY + EVENTSEL_OS + VMRUN_to_OS -> AMD_ZEN_BR_RETIRED + 1

VENTSEL_GUESTONLY + EVENTSEL_OS + VMRUN_to_USR -> No change
VENTSEL_GUESTONLY + EVENTSEL_USR + VMRUN_to_OS -> No change

I'm actually not surprised and related test would be posted later.

>
>> This issue makes a guest CPL0-target instruction counter inexplicably
>> increase, as if it would have been under-counted before the virtualization
>> instructions were counted.
>
> Heh, it's very much explicable, it's just not desirable, and you and I would argue
> that it's also incorrect.

This is completely inaccurate from the end guest pmu user's perspective.

I have a toy that looks like virtio-pmu, through which guest users can get
hypervisor performance data.
But the side effect of letting the guest see the VMRUN instruction by default is
unacceptable, isn't it ?

>
> AMD folks, are there plans to document this as an erratum? I agree with Like that
> counting VMRUN as a taken branch in guest context is a CPU bug, even if the behavior
> is known/expected.

+CC: Santosh, Tom, Ananth

2023-04-26 05:33:52

by Sandipan Das

[permalink] [raw]

Subject: Re: [PATCH 5/5] KVM: x86/pmu: Hide guest counter updates from the VMRUN instruction

Hi Sean, Like,

On 4/19/2023 7:11 PM, Like Xu wrote:
> On 7/4/2023 10:56 pm, Sean Christopherson wrote:
>> On Fri, Apr 07, 2023, Like Xu wrote:
>>> On 7/4/2023 10:18 am, Sean Christopherson wrote:
>>>> Wait, really? VMRUN is counted if and only if it enters to a CPL0 guest? Can
>>>> someone from AMD confirm this? I was going to say we should just treat this as
>>>> "normal" behavior, but counting CPL0 but not CPL>0 is definitely quirky.
>>>
>>> VMRUN is only counted on a CPL0-target (branch) instruction counter.
>>
>> Yes or no question: if KVM does VMRUN and a PMC is programmed to count _all_ taken
>> branches, will the PMC count VMRUN as a branch if guest CPL>0 according to the VMCB?
>
> YES, my quick tests (based on run_in_user() from KUT on Zen4) show:
>
> EVENTSEL_GUESTONLY + EVENTSEL_ALL + VMRUN_to_USR -> AMD_ZEN_BR_RETIRED + 1
> EVENTSEL_GUESTONLY + EVENTSEL_ALL + VMRUN_to_OS -> AMD_ZEN_BR_RETIRED + 1
>
> EVENTSEL_GUESTONLY + EVENTSEL_USR + VMRUN_to_USR -> AMD_ZEN_BR_RETIRED + 1
> EVENTSEL_GUESTONLY + EVENTSEL_OS + VMRUN_to_OS -> AMD_ZEN_BR_RETIRED + 1
>
> VENTSEL_GUESTONLY + EVENTSEL_OS + VMRUN_to_USR -> No change
> VENTSEL_GUESTONLY + EVENTSEL_USR + VMRUN_to_OS -> No change
>
> I'm actually not surprised and related test would be posted later.
>
>>
>>> This issue makes a guest CPL0-target instruction counter inexplicably
>>> increase, as if it would have been under-counted before the virtualization
>>> instructions were counted.
>>
>> Heh, it's very much explicable, it's just not desirable, and you and I would argue
>> that it's also incorrect.
>
> This is completely inaccurate from the end guest pmu user's perspective.
>
> I have a toy that looks like virtio-pmu, through which guest users can get hypervisor performance data.
> But the side effect of letting the guest see the VMRUN instruction by default is unacceptable, isn't it ?
>
>>
>> AMD folks, are there plans to document this as an erratum? I agree with Like that
>> counting VMRUN as a taken branch in guest context is a CPU bug, even if the behavior
>> is known/expected.
>

This behaviour is architectural and an erratum will not be issued. However, for clarity, a future
release of the APM will include additional details like the following:

1) From the perspective of performance monitoring counters, VMRUNs are considered as far control
transfers and VMEXITs as exceptions.

2) When the performance monitoring counters are set up to count events only in certain modes
through the "OsUserMode" and "HostGuestOnly" bits, instructions and events that change the
mode are counted in the target mode. For example, a SYSCALL from CPL 3 to CPL 0 with a
counter set to count retired instructions with USR=1 and OS=0 will not cause an increment of
the counter. However, the SYSRET back from CPL 0 to CPL 3 will cause an increment of the
counter and the total count will end up correct. Similarly, when counting PMCx0C6 (retired
far control transfers, including exceptions and interrupts) with Guest=1 and Host=0, a VMRUN
instruction will cause an increment of the counter. However, the subsequent VMEXIT that occurs,
since the target is in the host, will not cause an increment of the counter and so the total
count will end up correct.

2023-04-26 06:27:29

[permalink] [raw]

Subject: Re: [PATCH 5/5] KVM: x86/pmu: Hide guest counter updates from the VMRUN instruction

On 26/4/2023 1:25 pm, Sandipan Das wrote:
> Hi Sean, Like,
>
> On 4/19/2023 7:11 PM, Like Xu wrote:
>> On 7/4/2023 10:56 pm, Sean Christopherson wrote:
>>> On Fri, Apr 07, 2023, Like Xu wrote:
>>>> On 7/4/2023 10:18 am, Sean Christopherson wrote:
>>>>> Wait, really? VMRUN is counted if and only if it enters to a CPL0 guest? Can
>>>>> someone from AMD confirm this? I was going to say we should just treat this as
>>>>> "normal" behavior, but counting CPL0 but not CPL>0 is definitely quirky.
>>>>
>>>> VMRUN is only counted on a CPL0-target (branch) instruction counter.
>>>
>>> Yes or no question: if KVM does VMRUN and a PMC is programmed to count _all_ taken
>>> branches, will the PMC count VMRUN as a branch if guest CPL>0 according to the VMCB?
>>
>> YES, my quick tests (based on run_in_user() from KUT on Zen4) show:
>>
>> EVENTSEL_GUESTONLY + EVENTSEL_ALL + VMRUN_to_USR -> AMD_ZEN_BR_RETIRED + 1
>> EVENTSEL_GUESTONLY + EVENTSEL_ALL + VMRUN_to_OS -> AMD_ZEN_BR_RETIRED + 1
>>
>> EVENTSEL_GUESTONLY + EVENTSEL_USR + VMRUN_to_USR -> AMD_ZEN_BR_RETIRED + 1
>> EVENTSEL_GUESTONLY + EVENTSEL_OS + VMRUN_to_OS -> AMD_ZEN_BR_RETIRED + 1
>>
>> VENTSEL_GUESTONLY + EVENTSEL_OS + VMRUN_to_USR -> No change
>> VENTSEL_GUESTONLY + EVENTSEL_USR + VMRUN_to_OS -> No change
>>
>> I'm actually not surprised and related test would be posted later.
>>
>>>
>>>> This issue makes a guest CPL0-target instruction counter inexplicably
>>>> increase, as if it would have been under-counted before the virtualization
>>>> instructions were counted.
>>>
>>> Heh, it's very much explicable, it's just not desirable, and you and I would argue
>>> that it's also incorrect.
>>
>> This is completely inaccurate from the end guest pmu user's perspective.
>>
>> I have a toy that looks like virtio-pmu, through which guest users can get hypervisor performance data.
>> But the side effect of letting the guest see the VMRUN instruction by default is unacceptable, isn't it ?
>>
>>>
>>> AMD folks, are there plans to document this as an erratum? I agree with Like that
>>> counting VMRUN as a taken branch in guest context is a CPU bug, even if the behavior
>>> is known/expected.
>>
>
> This behaviour is architectural and an erratum will not be issued. However, for clarity, a future
> release of the APM will include additional details like the following:
>
> 1) From the perspective of performance monitoring counters, VMRUNs are considered as far control
> transfers and VMEXITs as exceptions.
>
> 2) When the performance monitoring counters are set up to count events only in certain modes
> through the "OsUserMode" and "HostGuestOnly" bits, instructions and events that change the
> mode are counted in the target mode. For example, a SYSCALL from CPL 3 to CPL 0 with a
> counter set to count retired instructions with USR=1 and OS=0 will not cause an increment of
> the counter. However, the SYSRET back from CPL 0 to CPL 3 will cause an increment of the
> counter and the total count will end up correct. Similarly, when counting PMCx0C6 (retired
> far control transfers, including exceptions and interrupts) with Guest=1 and Host=0, a VMRUN
> instruction will cause an increment of the counter. However, the subsequent VMEXIT that occurs,
> since the target is in the host, will not cause an increment of the counter and so the total
> count will end up correct.
>

Thanks for the clarification, that fits my understanding.

"Calculated in target mode" and "correct total count" are architectural choices,
which is not a problem if the consumers of PMU data are on the same side.

But for a VM user, seeing SYSRET in the user mode is completely and functionally
different from seeing VMRUN in the guest context. Since the host user and
the guest user are two separate pmu data consumers, and they do not aggregate
or share the so-called "total" PMU data.

This situation is even worse for nested SVM guests and SEV-SNP guests.

I'm not urging that AMD hardware should change, but it is entirely necessary
for our software layer to take this step, as it is part of the hypervisor's
responsibility
to hide itself by default.

2023-05-24 20:52:21

[permalink] [raw]

Subject: Re: [PATCH 5/5] KVM: x86/pmu: Hide guest counter updates from the VMRUN instruction

On Wed, Apr 26, 2023, Sandipan Das wrote:
> Hi Sean, Like,
>
> On 4/19/2023 7:11 PM, Like Xu wrote:
> >> Heh, it's very much explicable, it's just not desirable, and you and I would argue
> >> that it's also incorrect.
> >
> > This is completely inaccurate from the end guest pmu user's perspective.
> >
> > I have a toy that looks like virtio-pmu, through which guest users can get hypervisor performance data.
> > But the side effect of letting the guest see the VMRUN instruction by default is unacceptable, isn't it ?
> >
> >>
> >> AMD folks, are there plans to document this as an erratum?� I agree with Like that
> >> counting VMRUN as a taken branch in guest context is a CPU bug, even if the behavior
> >> is known/expected.
> >
>
> This behaviour is architectural and an erratum will not be issued. However, for clarity, a future
> release of the APM will include additional details like the following:
>
> 1) From the perspective of performance monitoring counters, VMRUNs are considered as far control
> transfers and VMEXITs as exceptions.
>
> 2) When the performance monitoring counters are set up to count events only in certain modes
> through the "OsUserMode" and "HostGuestOnly" bits, instructions and events that change the
> mode are counted in the target mode. For example, a SYSCALL from CPL 3 to CPL 0 with a
> counter set to count retired instructions with USR=1 and OS=0 will not cause an increment of
> the counter. However, the SYSRET back from CPL 0 to CPL 3 will cause an increment of the
> counter and the total count will end up correct. Similarly, when counting PMCx0C6 (retired
> far control transfers, including exceptions and interrupts) with Guest=1 and Host=0, a VMRUN
> instruction will cause an increment of the counter. However, the subsequent VMEXIT that occurs,
> since the target is in the host, will not cause an increment of the counter and so the total
> count will end up correct.

The count from the guest's perspective does not "end up correct". Unlike SYSCALL,
where _userspace_ deliberately and synchronously executes a branch instruction,
VMEXIT and VMRUN are supposed to be transparent to the guest and can be completely
asynchronous with respect to guest code execution, e.g. if the host is spamming
IRQs, the guest will see a potentially large number of bogus (from it's perspective)
branches retired.

2023-05-24 21:02:45

[permalink] [raw]

Subject: Re: [PATCH 5/5] KVM: x86/pmu: Hide guest counter updates from the VMRUN instruction

On Wed, May 24, 2023 at 1:41 PM Sean Christopherson <[email protected]> wrote:
>
> On Wed, Apr 26, 2023, Sandipan Das wrote:
> > Hi Sean, Like,
> >
> > On 4/19/2023 7:11 PM, Like Xu wrote:
> > >> Heh, it's very much explicable, it's just not desirable, and you and I would argue
> > >> that it's also incorrect.
> > >
> > > This is completely inaccurate from the end guest pmu user's perspective.
> > >
> > > I have a toy that looks like virtio-pmu, through which guest users can get hypervisor performance data.
> > > But the side effect of letting the guest see the VMRUN instruction by default is unacceptable, isn't it ?
> > >
> > >>
> > >> AMD folks, are there plans to document this as an erratum?ï¿½ I agree with Like that
> > >> counting VMRUN as a taken branch in guest context is a CPU bug, even if the behavior
> > >> is known/expected.
> > >
> >
> > This behaviour is architectural and an erratum will not be issued. However, for clarity, a future
> > release of the APM will include additional details like the following:
> >
> > 1) From the perspective of performance monitoring counters, VMRUNs are considered as far control
> > transfers and VMEXITs as exceptions.
> >
> > 2) When the performance monitoring counters are set up to count events only in certain modes
> > through the "OsUserMode" and "HostGuestOnly" bits, instructions and events that change the
> > mode are counted in the target mode. For example, a SYSCALL from CPL 3 to CPL 0 with a
> > counter set to count retired instructions with USR=1 and OS=0 will not cause an increment of
> > the counter. However, the SYSRET back from CPL 0 to CPL 3 will cause an increment of the
> > counter and the total count will end up correct. Similarly, when counting PMCx0C6 (retired
> > far control transfers, including exceptions and interrupts) with Guest=1 and Host=0, a VMRUN
> > instruction will cause an increment of the counter. However, the subsequent VMEXIT that occurs,
> > since the target is in the host, will not cause an increment of the counter and so the total
> > count will end up correct.
>
> The count from the guest's perspective does not "end up correct". Unlike SYSCALL,
> where _userspace_ deliberately and synchronously executes a branch instruction,
> VMEXIT and VMRUN are supposed to be transparent to the guest and can be completely
> asynchronous with respect to guest code execution, e.g. if the host is spamming
> IRQs, the guest will see a potentially large number of bogus (from it's perspective)
> branches retired.

The reverse problem occurs when a PMC is configured to count "CPUID
instructions retired." Since KVM intercepts CPUID and emulates it, the
PMC will always read 0, even if the guest executes a tight loop of
CPUID instructions.

The PMU is not virtualizable on AMD CPUs without significant
hypervisor corrections. I have to wonder if it's really worth the
effort.

2023-05-24 21:04:56

[permalink] [raw]

Subject: Re: [PATCH 2/5] KVM: x86/pmu: Add a helper to check if pmc has PEBS mode enabled

On Fri, Mar 10, 2023, Like Xu wrote:
> From: Like Xu <[email protected]>
>
> Add a helper to check if pmc has PEBS mode enabled so that more new
> code may reuse this part and opportunistically drop a pmu reference.
>
> No functional change intended.
>
> Signed-off-by: Like Xu <[email protected]>
> ---
> arch/x86/kvm/pmu.c | 3 +--
> arch/x86/kvm/pmu.h | 7 +++++++
> 2 files changed, 8 insertions(+), 2 deletions(-)
>
> diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c
> index d1c89a6625a0..01a6b7ffa9b1 100644
> --- a/arch/x86/kvm/pmu.c
> +++ b/arch/x86/kvm/pmu.c
> @@ -191,7 +191,6 @@ static int pmc_reprogram_counter(struct kvm_pmc *pmc, u32 type, u64 config,
> bool exclude_user, bool exclude_kernel,
> bool intr)
> {
> - struct kvm_pmu *pmu = pmc_to_pmu(pmc);
> struct perf_event *event;
> struct perf_event_attr attr = {
> .type = type,
> @@ -203,7 +202,7 @@ static int pmc_reprogram_counter(struct kvm_pmc *pmc, u32 type, u64 config,
> .exclude_kernel = exclude_kernel,
> .config = config,
> };
> - bool pebs = test_bit(pmc->idx, (unsigned long *)&pmu->pebs_enable);
> + bool pebs = pebs_is_enabled(pmc);
>
> attr.sample_period = get_sample_period(pmc, pmc->counter);
>
> diff --git a/arch/x86/kvm/pmu.h b/arch/x86/kvm/pmu.h
> index cff0651b030b..db4262fe8814 100644
> --- a/arch/x86/kvm/pmu.h
> +++ b/arch/x86/kvm/pmu.h
> @@ -189,6 +189,13 @@ static inline void kvm_pmu_request_counter_reprogram(struct kvm_pmc *pmc)
> kvm_make_request(KVM_REQ_PMU, pmc->vcpu);
> }
>
> +static inline bool pebs_is_enabled(struct kvm_pmc *pmc)

pebs_is_enabled() is a bit too generic, e.g. at a glance I would expect it to return
true if PEBS as a whole is enabled. What about something like pmc_is_pebs() or
pmc_is_pebs_enabled()?

2023-05-24 21:07:37

[permalink] [raw]

Subject: Re: [PATCH 3/5] KVM: x86/pmu: Move the overflow of a normal counter out of PMI context

On Fri, Mar 10, 2023, Like Xu wrote:
> From: Like Xu <[email protected]>
>
> >From the guest's point of view, vPMU's global_status bit update following
> a counter overflow is completely independent of whether it is emulated
> in the host PMI context. The guest counter overflow emulation only depends
> on whether pmc->counter has an overflow or not. Plus the counter overflow
> generated by the emulation instruction has been delayed and not been
> handled in the PMI context. This part of the logic can be unified by
> reusing pmc->prev_counter for a normal counter.

I've asked many times. Please write changelogs that state what the patch actually
does, not what "can" be done. The other patches in this series have similar problems,
e.g. desribe the state _after_ the patch is applied, not what the patch does.

IIUC, this is effectively deferring injection to kvm_pmu_handle_event() and
kvm_pmu_handle_pmc_overflow(). The changelog says nothing of the sort though.

> However for a PEBS counter, its buffer overflow irq still requires hardware to
> trigger PMI.

I didn't follow this. Hardware isn't triggering the PMI the guest sees, that's
still coming from KVM. Are you saying that KVM doesn't have a way to detect
overflow for PEBS counters, i.e. would drop the PMI as the hardware PMI is the
only notification KVM gets?

2023-05-24 21:31:36

[permalink] [raw]

Subject: Re: [PATCH 4/5] KVM: x86/pmu: Reorder functions to reduce unnecessary declarations

On Fri, Mar 10, 2023, Like Xu wrote:
> From: Like Xu <[email protected]>
>
> Considering that more emulations are deferred to kvm_pmu_handle_event(),
> moving it to the end of pmu.c makes it easier to call previous functions,
> instead of just piling up the function declarations to make compiler green.

kvm_pmu_handle_event() is globally visible, moving it around changes nothing.

As for using it in kvm_mark_pmc_is_quirky(), explicitly state the direct
motivation for moving kvm_pmu_request_counter_reprogram(), i.e. that it's being
hoisted above pmc_read_counter() for use in a future patch. Using abstract
language might sound pretty and dramatic, but it's really not helpful for reviewers.

> diff --git a/arch/x86/kvm/pmu.h b/arch/x86/kvm/pmu.h
> index db4262fe8814..a47b579667c6 100644
> --- a/arch/x86/kvm/pmu.h
> +++ b/arch/x86/kvm/pmu.h
> @@ -48,6 +48,12 @@ static inline u64 pmc_bitmask(struct kvm_pmc *pmc)
> return pmu->counter_bitmask[pmc->type];
> }
>
> +static inline void kvm_pmu_request_counter_reprogram(struct kvm_pmc *pmc)
> +{
> + set_bit(pmc->idx, pmc_to_pmu(pmc)->reprogram_pmi);
> + kvm_make_request(KVM_REQ_PMU, pmc->vcpu);
> +}
> +
> static inline u64 pmc_read_counter(struct kvm_pmc *pmc)
> {
> u64 counter, enabled, running;
> @@ -183,12 +189,6 @@ static inline void kvm_init_pmu_capability(const struct kvm_pmu_ops *pmu_ops)
> KVM_PMC_MAX_FIXED);
> }
>
> -static inline void kvm_pmu_request_counter_reprogram(struct kvm_pmc *pmc)
> -{
> - set_bit(pmc->idx, pmc_to_pmu(pmc)->reprogram_pmi);
> - kvm_make_request(KVM_REQ_PMU, pmc->vcpu);
> -}
> -
> static inline bool pebs_is_enabled(struct kvm_pmc *pmc)
> {
> struct kvm_pmu *pmu = pmc_to_pmu(pmc);
> --
> 2.39.2
>

2023-05-24 21:31:49

[permalink] [raw]

Subject: Re: [PATCH 5/5] KVM: x86/pmu: Hide guest counter updates from the VMRUN instruction

On Fri, Mar 10, 2023, Like Xu wrote:
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index adb92fc4d7c9..d6fcbf233cb3 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -561,6 +561,10 @@ struct kvm_pmu {
> */
> u64 host_cross_mapped_mask;
>
> + /* Flags to track any HW quirks that need to be fixed by vPMU. */
> + u64 quirk_flags;
> + DECLARE_BITMAP(hide_vmrun_pmc_idx, X86_PMC_IDX_MAX);

Since it sounds like AMD isn't changing the behavior, let's forego the quirk and
just hardcode the fixup.

> diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c
> index 2a0504732966..315dca021d57 100644
> --- a/arch/x86/kvm/pmu.c
> +++ b/arch/x86/kvm/pmu.c
> @@ -254,6 +254,7 @@ static void pmc_pause_counter(struct kvm_pmc *pmc)
> counter += perf_event_pause(pmc->perf_event, true);
> pmc->counter = counter & pmc_bitmask(pmc);
> pmc->is_paused = true;
> + kvm_mark_pmc_is_quirky(pmc);
> }
>
> static bool pmc_resume_counter(struct kvm_pmc *pmc)
> @@ -822,6 +823,19 @@ int kvm_vm_ioctl_set_pmu_event_filter(struct kvm *kvm, void __user *argp)
> return r;
> }
>
> +static inline bool event_is_branch_instruction(struct kvm_pmc *pmc)

How about pmc_is_counting_branches()? The "event" itself isn't a branch
instruction.

> +{
> + return eventsel_match_perf_hw_id(pmc, PERF_COUNT_HW_INSTRUCTIONS) ||
> + eventsel_match_perf_hw_id(pmc,
> + PERF_COUNT_HW_BRANCH_INSTRUCTIONS);

Let this poke out.

> +}
> +
> +static inline bool quirky_pmc_will_count_vmrun(struct kvm_pmc *pmc)
> +{
> + return event_is_branch_instruction(pmc) && event_is_allowed(pmc) &&
> + !static_call(kvm_x86_get_cpl)(pmc->vcpu);
> +}
> +
> void kvm_pmu_handle_event(struct kvm_vcpu *vcpu)
> {
> struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
> @@ -837,6 +851,10 @@ void kvm_pmu_handle_event(struct kvm_vcpu *vcpu)
>
> reprogram_counter(pmc);
> kvm_pmu_handle_pmc_overflow(pmc);
> +
> + if (vcpu_has_pmu_quirks(vcpu) &&
> + quirky_pmc_will_count_vmrun(pmc))
> + set_bit(pmc->idx, pmu->hide_vmrun_pmc_idx);

Doesn't this need to adjust the count _before_ handling overflow? I.e. isn't it
possible for the bogus counts to cause bogus overflow?

> diff --git a/arch/x86/kvm/pmu.h b/arch/x86/kvm/pmu.h
> index a47b579667c6..30f6f58f4c38 100644
> --- a/arch/x86/kvm/pmu.h
> +++ b/arch/x86/kvm/pmu.h
> @@ -18,6 +18,9 @@
> #define VMWARE_BACKDOOR_PMC_REAL_TIME 0x10001
> #define VMWARE_BACKDOOR_PMC_APPARENT_TIME 0x10002
>
> +#define X86_PMU_COUNT_VMRUN BIT_ULL(0)
> +#define X86_PMU_QUIRKS_MASK X86_PMU_COUNT_VMRUN
> +
> struct kvm_pmu_ops {
> bool (*hw_event_available)(struct kvm_pmc *pmc);
> bool (*pmc_is_enabled)(struct kvm_pmc *pmc);
> @@ -54,14 +57,33 @@ static inline void kvm_pmu_request_counter_reprogram(struct kvm_pmc *pmc)
> kvm_make_request(KVM_REQ_PMU, pmc->vcpu);
> }
>
> +static inline bool vcpu_has_pmu_quirks(struct kvm_vcpu *vcpu)
> +{
> + return vcpu_to_pmu(vcpu)->quirk_flags & X86_PMU_QUIRKS_MASK;
> +}
> +
> +/*
> + * The time to mark pmc is when the accumulation value returned
> + * by perf API based on a HW counter has just taken effect.
> + */
> +static inline void kvm_mark_pmc_is_quirky(struct kvm_pmc *pmc)
> +{
> + if (!vcpu_has_pmu_quirks(pmc->vcpu))
> + return;
> +
> + kvm_pmu_request_counter_reprogram(pmc);
> +}
> +
> static inline u64 pmc_read_counter(struct kvm_pmc *pmc)
> {
> u64 counter, enabled, running;
>
> counter = pmc->counter;
> - if (pmc->perf_event && !pmc->is_paused)
> + if (pmc->perf_event && !pmc->is_paused) {
> counter += perf_event_read_value(pmc->perf_event,
> &enabled, &running);
> + kvm_mark_pmc_is_quirky(pmc);
> + }
> /* FIXME: Scaling needed? */
> return counter & pmc_bitmask(pmc);
> }
> diff --git a/arch/x86/kvm/svm/pmu.c b/arch/x86/kvm/svm/pmu.c
> index 5fa939e411d8..130991a97f22 100644
> --- a/arch/x86/kvm/svm/pmu.c
> +++ b/arch/x86/kvm/svm/pmu.c
> @@ -187,6 +187,7 @@ static void amd_pmu_refresh(struct kvm_vcpu *vcpu)
> pmu->nr_arch_fixed_counters = 0;
> pmu->global_status = 0;
> bitmap_set(pmu->all_valid_pmc_idx, 0, pmu->nr_arch_gp_counters);
> + pmu->quirk_flags |= X86_PMU_COUNT_VMRUN;
> }
>
> static void amd_pmu_init(struct kvm_vcpu *vcpu)
> diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
> index f41d96e638ef..f6b33d172481 100644
> --- a/arch/x86/kvm/svm/svm.c
> +++ b/arch/x86/kvm/svm/svm.c
> @@ -3919,6 +3919,31 @@ static fastpath_t svm_exit_handlers_fastpath(struct kvm_vcpu *vcpu)
> return EXIT_FASTPATH_NONE;
> }
>
> +static void pmu_hide_vmrun(struct kvm_vcpu *vcpu)

This needs to be noinstr.

> +{
> + struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
> + struct kvm_pmc *pmc;
> + unsigned int i;
> +
> + for_each_set_bit(i, pmu->hide_vmrun_pmc_idx, X86_PMC_IDX_MAX) {
> + clear_bit(i, pmu->hide_vmrun_pmc_idx);

Clearing the bit will hide only the first VMRUN after the guest attempts to read
the counter, no? The fixup needs to apply to every VMRUN that is executed after
the PMC is programmed. Or am I misreading the patch?

> +
> + /* AMD doesn't have fixed counters at now. */
> + if (i >= pmu->nr_arch_gp_counters)
> + continue;
> +
> + /*
> + * The prerequisite for fixing HW quirks is that there is indeed
> + * HW working and perf has no chance to retrieve the counter.

I don't follow the "perf has no chance to retrieve the counter" part.

> + */
> + pmc = &pmu->gp_counters[i];
> + if (!pmc->perf_event || pmc->perf_event->hw.idx < 0)
> + continue;
> +
> + pmc->counter--;
> + }
> +}
> +
> static noinstr void svm_vcpu_enter_exit(struct kvm_vcpu *vcpu, bool spec_ctrl_intercepted)
> {
> struct vcpu_svm *svm = to_svm(vcpu);
> @@ -3986,6 +4011,9 @@ static __no_kcsan fastpath_t svm_vcpu_run(struct kvm_vcpu *vcpu)
>
> kvm_wait_lapic_expire(vcpu);
>
> + if (vcpu->kvm->arch.enable_pmu && vcpu_has_pmu_quirks(vcpu))
> + pmu_hide_vmrun(vcpu);
> +
> /*
> * If this vCPU has touched SPEC_CTRL, restore the guest's value if
> * it's non-zero. Since vmentry is serialising on affected CPUs, there
> --
> 2.39.2
>

2023-05-24 21:32:37

[permalink] [raw]

Subject: Re: [PATCH 5/5] KVM: x86/pmu: Hide guest counter updates from the VMRUN instruction

On Wed, May 24, 2023, Jim Mattson wrote:
> On Wed, May 24, 2023 at 1:41 PM Sean Christopherson <[email protected]> wrote:
> >
> > On Wed, Apr 26, 2023, Sandipan Das wrote:
> > > Hi Sean, Like,
> > >
> > > On 4/19/2023 7:11 PM, Like Xu wrote:
> > > >> Heh, it's very much explicable, it's just not desirable, and you and I would argue
> > > >> that it's also incorrect.
> > > >
> > > > This is completely inaccurate from the end guest pmu user's perspective.
> > > >
> > > > I have a toy that looks like virtio-pmu, through which guest users can get hypervisor performance data.
> > > > But the side effect of letting the guest see the VMRUN instruction by default is unacceptable, isn't it ?
> > > >
> > > >>
> > > >> AMD folks, are there plans to document this as an erratum?ï¿½ I agree with Like that
> > > >> counting VMRUN as a taken branch in guest context is a CPU bug, even if the behavior
> > > >> is known/expected.
> > > >
> > >
> > > This behaviour is architectural and an erratum will not be issued. However, for clarity, a future
> > > release of the APM will include additional details like the following:
> > >
> > > 1) From the perspective of performance monitoring counters, VMRUNs are considered as far control
> > > transfers and VMEXITs as exceptions.
> > >
> > > 2) When the performance monitoring counters are set up to count events only in certain modes
> > > through the "OsUserMode" and "HostGuestOnly" bits, instructions and events that change the
> > > mode are counted in the target mode. For example, a SYSCALL from CPL 3 to CPL 0 with a
> > > counter set to count retired instructions with USR=1 and OS=0 will not cause an increment of
> > > the counter. However, the SYSRET back from CPL 0 to CPL 3 will cause an increment of the
> > > counter and the total count will end up correct. Similarly, when counting PMCx0C6 (retired
> > > far control transfers, including exceptions and interrupts) with Guest=1 and Host=0, a VMRUN
> > > instruction will cause an increment of the counter. However, the subsequent VMEXIT that occurs,
> > > since the target is in the host, will not cause an increment of the counter and so the total
> > > count will end up correct.
> >
> > The count from the guest's perspective does not "end up correct". Unlike SYSCALL,
> > where _userspace_ deliberately and synchronously executes a branch instruction,
> > VMEXIT and VMRUN are supposed to be transparent to the guest and can be completely
> > asynchronous with respect to guest code execution, e.g. if the host is spamming
> > IRQs, the guest will see a potentially large number of bogus (from it's perspective)
> > branches retired.
>
> The reverse problem occurs when a PMC is configured to count "CPUID
> instructions retired." Since KVM intercepts CPUID and emulates it, the
> PMC will always read 0, even if the guest executes a tight loop of
> CPUID instructions.
>
> The PMU is not virtualizable on AMD CPUs without significant
> hypervisor corrections. I have to wonder if it's really worth the
> effort.

Per our offlist chat, my understanding is that there are caveats with vPMUs that
it's simply not feasible for a hypervisor to handle. I.e. virtualizing any x86
PMU with 100% accuracy isn't happening anytime soon.

The way forward is likely to evaluate each caveat on a case-by-case basis to
determine whether or not the cost of the fixup in KVM is worth the benefit to
the guest. E.g. emulating "CPUID instructions retired" seems like it would be
fairly straightforward. AFAICT, fixing up the VMRUN stuff is quite difficult though.

2023-05-24 21:36:59

[permalink] [raw]

Subject: Re: [PATCH 5/5] KVM: x86/pmu: Hide guest counter updates from the VMRUN instruction

On Wed, May 24, 2023 at 2:23 PM Sean Christopherson <[email protected]> wrote:
>
> On Fri, Mar 10, 2023, Like Xu wrote:
> > diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> > index adb92fc4d7c9..d6fcbf233cb3 100644
> > --- a/arch/x86/include/asm/kvm_host.h
> > +++ b/arch/x86/include/asm/kvm_host.h
> > @@ -561,6 +561,10 @@ struct kvm_pmu {
> > */
> > u64 host_cross_mapped_mask;
> >
> > + /* Flags to track any HW quirks that need to be fixed by vPMU. */
> > + u64 quirk_flags;
> > + DECLARE_BITMAP(hide_vmrun_pmc_idx, X86_PMC_IDX_MAX);
>
> Since it sounds like AMD isn't changing the behavior, let's forego the quirk and
> just hardcode the fixup.
>
> > diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c
> > index 2a0504732966..315dca021d57 100644
> > --- a/arch/x86/kvm/pmu.c
> > +++ b/arch/x86/kvm/pmu.c
> > @@ -254,6 +254,7 @@ static void pmc_pause_counter(struct kvm_pmc *pmc)
> > counter += perf_event_pause(pmc->perf_event, true);
> > pmc->counter = counter & pmc_bitmask(pmc);
> > pmc->is_paused = true;
> > + kvm_mark_pmc_is_quirky(pmc);
> > }
> >
> > static bool pmc_resume_counter(struct kvm_pmc *pmc)
> > @@ -822,6 +823,19 @@ int kvm_vm_ioctl_set_pmu_event_filter(struct kvm *kvm, void __user *argp)
> > return r;
> > }
> >
> > +static inline bool event_is_branch_instruction(struct kvm_pmc *pmc)
>
> How about pmc_is_counting_branches()? The "event" itself isn't a branch
> instruction.

Note that there's a bug in the original code for this that has
probably never been fixed: it ignores CMASK and INV in the PerfEvtSel.

> > +{
> > + return eventsel_match_perf_hw_id(pmc, PERF_COUNT_HW_INSTRUCTIONS) ||
> > + eventsel_match_perf_hw_id(pmc,
> > + PERF_COUNT_HW_BRANCH_INSTRUCTIONS);
>
> Let this poke out.
>
> > +}
> > +
> > +static inline bool quirky_pmc_will_count_vmrun(struct kvm_pmc *pmc)
> > +{
> > + return event_is_branch_instruction(pmc) && event_is_allowed(pmc) &&
> > + !static_call(kvm_x86_get_cpl)(pmc->vcpu);
> > +}
> > +
> > void kvm_pmu_handle_event(struct kvm_vcpu *vcpu)
> > {
> > struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
> > @@ -837,6 +851,10 @@ void kvm_pmu_handle_event(struct kvm_vcpu *vcpu)
> >
> > reprogram_counter(pmc);
> > kvm_pmu_handle_pmc_overflow(pmc);
> > +
> > + if (vcpu_has_pmu_quirks(vcpu) &&
> > + quirky_pmc_will_count_vmrun(pmc))
> > + set_bit(pmc->idx, pmu->hide_vmrun_pmc_idx);
>
> Doesn't this need to adjust the count _before_ handling overflow? I.e. isn't it
> possible for the bogus counts to cause bogus overflow?
>
> > diff --git a/arch/x86/kvm/pmu.h b/arch/x86/kvm/pmu.h
> > index a47b579667c6..30f6f58f4c38 100644
> > --- a/arch/x86/kvm/pmu.h
> > +++ b/arch/x86/kvm/pmu.h
> > @@ -18,6 +18,9 @@
> > #define VMWARE_BACKDOOR_PMC_REAL_TIME 0x10001
> > #define VMWARE_BACKDOOR_PMC_APPARENT_TIME 0x10002
> >
> > +#define X86_PMU_COUNT_VMRUN BIT_ULL(0)
> > +#define X86_PMU_QUIRKS_MASK X86_PMU_COUNT_VMRUN
> > +
> > struct kvm_pmu_ops {
> > bool (*hw_event_available)(struct kvm_pmc *pmc);
> > bool (*pmc_is_enabled)(struct kvm_pmc *pmc);
> > @@ -54,14 +57,33 @@ static inline void kvm_pmu_request_counter_reprogram(struct kvm_pmc *pmc)
> > kvm_make_request(KVM_REQ_PMU, pmc->vcpu);
> > }
> >
> > +static inline bool vcpu_has_pmu_quirks(struct kvm_vcpu *vcpu)
> > +{
> > + return vcpu_to_pmu(vcpu)->quirk_flags & X86_PMU_QUIRKS_MASK;
> > +}
> > +
> > +/*
> > + * The time to mark pmc is when the accumulation value returned
> > + * by perf API based on a HW counter has just taken effect.
> > + */
> > +static inline void kvm_mark_pmc_is_quirky(struct kvm_pmc *pmc)
> > +{
> > + if (!vcpu_has_pmu_quirks(pmc->vcpu))
> > + return;
> > +
> > + kvm_pmu_request_counter_reprogram(pmc);
> > +}
> > +
> > static inline u64 pmc_read_counter(struct kvm_pmc *pmc)
> > {
> > u64 counter, enabled, running;
> >
> > counter = pmc->counter;
> > - if (pmc->perf_event && !pmc->is_paused)
> > + if (pmc->perf_event && !pmc->is_paused) {
> > counter += perf_event_read_value(pmc->perf_event,
> > &enabled, &running);
> > + kvm_mark_pmc_is_quirky(pmc);
> > + }
> > /* FIXME: Scaling needed? */
> > return counter & pmc_bitmask(pmc);
> > }
> > diff --git a/arch/x86/kvm/svm/pmu.c b/arch/x86/kvm/svm/pmu.c
> > index 5fa939e411d8..130991a97f22 100644
> > --- a/arch/x86/kvm/svm/pmu.c
> > +++ b/arch/x86/kvm/svm/pmu.c
> > @@ -187,6 +187,7 @@ static void amd_pmu_refresh(struct kvm_vcpu *vcpu)
> > pmu->nr_arch_fixed_counters = 0;
> > pmu->global_status = 0;
> > bitmap_set(pmu->all_valid_pmc_idx, 0, pmu->nr_arch_gp_counters);
> > + pmu->quirk_flags |= X86_PMU_COUNT_VMRUN;
> > }
> >
> > static void amd_pmu_init(struct kvm_vcpu *vcpu)
> > diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
> > index f41d96e638ef..f6b33d172481 100644
> > --- a/arch/x86/kvm/svm/svm.c
> > +++ b/arch/x86/kvm/svm/svm.c
> > @@ -3919,6 +3919,31 @@ static fastpath_t svm_exit_handlers_fastpath(struct kvm_vcpu *vcpu)
> > return EXIT_FASTPATH_NONE;
> > }
> >
> > +static void pmu_hide_vmrun(struct kvm_vcpu *vcpu)
>
> This needs to be noinstr.
>
> > +{
> > + struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
> > + struct kvm_pmc *pmc;
> > + unsigned int i;
> > +
> > + for_each_set_bit(i, pmu->hide_vmrun_pmc_idx, X86_PMC_IDX_MAX) {
> > + clear_bit(i, pmu->hide_vmrun_pmc_idx);
>
> Clearing the bit will hide only the first VMRUN after the guest attempts to read
> the counter, no? The fixup needs to apply to every VMRUN that is executed after
> the PMC is programmed. Or am I misreading the patch?
>
> > +
> > + /* AMD doesn't have fixed counters at now. */
> > + if (i >= pmu->nr_arch_gp_counters)
> > + continue;
> > +
> > + /*
> > + * The prerequisite for fixing HW quirks is that there is indeed
> > + * HW working and perf has no chance to retrieve the counter.
>
> I don't follow the "perf has no chance to retrieve the counter" part.
>
> > + */
> > + pmc = &pmu->gp_counters[i];
> > + if (!pmc->perf_event || pmc->perf_event->hw.idx < 0)
> > + continue;
> > +
> > + pmc->counter--;
> > + }
> > +}
> > +
> > static noinstr void svm_vcpu_enter_exit(struct kvm_vcpu *vcpu, bool spec_ctrl_intercepted)
> > {
> > struct vcpu_svm *svm = to_svm(vcpu);
> > @@ -3986,6 +4011,9 @@ static __no_kcsan fastpath_t svm_vcpu_run(struct kvm_vcpu *vcpu)
> >
> > kvm_wait_lapic_expire(vcpu);
> >
> > + if (vcpu->kvm->arch.enable_pmu && vcpu_has_pmu_quirks(vcpu))
> > + pmu_hide_vmrun(vcpu);
> > +
> > /*
> > * If this vCPU has touched SPEC_CTRL, restore the guest's value if
> > * it's non-zero. Since vmentry is serialising on affected CPUs, there
> > --
> > 2.39.2
> >

2023-05-24 21:59:44

[permalink] [raw]

Subject: Re: [PATCH 5/5] KVM: x86/pmu: Hide guest counter updates from the VMRUN instruction

On Wed, May 24, 2023 at 2:29 PM Sean Christopherson <[email protected]> wrote:
>
> On Wed, May 24, 2023, Jim Mattson wrote:
> > On Wed, May 24, 2023 at 1:41 PM Sean Christopherson <[email protected]> wrote:
> > >
> > > On Wed, Apr 26, 2023, Sandipan Das wrote:
> > > > Hi Sean, Like,
> > > >
> > > > On 4/19/2023 7:11 PM, Like Xu wrote:
> > > > >> Heh, it's very much explicable, it's just not desirable, and you and I would argue
> > > > >> that it's also incorrect.
> > > > >
> > > > > This is completely inaccurate from the end guest pmu user's perspective.
> > > > >
> > > > > I have a toy that looks like virtio-pmu, through which guest users can get hypervisor performance data.
> > > > > But the side effect of letting the guest see the VMRUN instruction by default is unacceptable, isn't it ?
> > > > >
> > > > >>
> > > > >> AMD folks, are there plans to document this as an erratum?ï¿½ I agree with Like that
> > > > >> counting VMRUN as a taken branch in guest context is a CPU bug, even if the behavior
> > > > >> is known/expected.
> > > > >
> > > >
> > > > This behaviour is architectural and an erratum will not be issued. However, for clarity, a future
> > > > release of the APM will include additional details like the following:
> > > >
> > > > 1) From the perspective of performance monitoring counters, VMRUNs are considered as far control
> > > > transfers and VMEXITs as exceptions.
> > > >
> > > > 2) When the performance monitoring counters are set up to count events only in certain modes
> > > > through the "OsUserMode" and "HostGuestOnly" bits, instructions and events that change the
> > > > mode are counted in the target mode. For example, a SYSCALL from CPL 3 to CPL 0 with a
> > > > counter set to count retired instructions with USR=1 and OS=0 will not cause an increment of
> > > > the counter. However, the SYSRET back from CPL 0 to CPL 3 will cause an increment of the
> > > > counter and the total count will end up correct. Similarly, when counting PMCx0C6 (retired
> > > > far control transfers, including exceptions and interrupts) with Guest=1 and Host=0, a VMRUN
> > > > instruction will cause an increment of the counter. However, the subsequent VMEXIT that occurs,
> > > > since the target is in the host, will not cause an increment of the counter and so the total
> > > > count will end up correct.
> > >
> > > The count from the guest's perspective does not "end up correct". Unlike SYSCALL,
> > > where _userspace_ deliberately and synchronously executes a branch instruction,
> > > VMEXIT and VMRUN are supposed to be transparent to the guest and can be completely
> > > asynchronous with respect to guest code execution, e.g. if the host is spamming
> > > IRQs, the guest will see a potentially large number of bogus (from it's perspective)
> > > branches retired.
> >
> > The reverse problem occurs when a PMC is configured to count "CPUID
> > instructions retired." Since KVM intercepts CPUID and emulates it, the
> > PMC will always read 0, even if the guest executes a tight loop of
> > CPUID instructions.
> >
> > The PMU is not virtualizable on AMD CPUs without significant
> > hypervisor corrections. I have to wonder if it's really worth the
> > effort.
>
> Per our offlist chat, my understanding is that there are caveats with vPMUs that
> it's simply not feasible for a hypervisor to handle. I.e. virtualizing any x86
> PMU with 100% accuracy isn't happening anytime soon.
>
> The way forward is likely to evaluate each caveat on a case-by-case basis to
> determine whether or not the cost of the fixup in KVM is worth the benefit to
> the guest. E.g. emulating "CPUID instructions retired" seems like it would be
> fairly straightforward. AFAICT, fixing up the VMRUN stuff is quite difficult though.

Yeah. The problem with fixing up "CPUID instructions retired" is
tracking what the event encoding is for every F/M/S out there. It's
not worth it.

2023-05-29 14:48:30

[permalink] [raw]

Subject: Re: [PATCH 5/5] KVM: x86/pmu: Hide guest counter updates from the VMRUN instruction

On 25/5/2023 5:30 am, Jim Mattson wrote:
> Note that there's a bug in the original code for this that has
> probably never been fixed: it ignores CMASK and INV in the PerfEvtSel.

Regarding the emulation of CMASK, INV and PIN_CONTROL bits on vPMU,
please forgive me for never having the right expectations (so there will be no
correct emulation), an share more *architecture* descriptions of those hw
behavior, and it would be great to have corresponding selftests as a bug report.

2023-05-29 14:56:30

[permalink] [raw]

Subject: Re: [PATCH 5/5] KVM: x86/pmu: Hide guest counter updates from the VMRUN instruction

On 25/5/2023 5:32 am, Jim Mattson wrote:
> On Wed, May 24, 2023 at 2:29 PM Sean Christopherson <[email protected]> wrote:
>>
>> On Wed, May 24, 2023, Jim Mattson wrote:
>>> On Wed, May 24, 2023 at 1:41 PM Sean Christopherson <[email protected]> wrote:
>>>>
>>>> On Wed, Apr 26, 2023, Sandipan Das wrote:
>>>>> Hi Sean, Like,
>>>>>
>>>>> On 4/19/2023 7:11 PM, Like Xu wrote:
>>>>>>> Heh, it's very much explicable, it's just not desirable, and you and I would argue
>>>>>>> that it's also incorrect.
>>>>>>
>>>>>> This is completely inaccurate from the end guest pmu user's perspective.
>>>>>>
>>>>>> I have a toy that looks like virtio-pmu, through which guest users can get hypervisor performance data.
>>>>>> But the side effect of letting the guest see the VMRUN instruction by default is unacceptable, isn't it ?
>>>>>>
>>>>>>>
>>>>>>> AMD folks, are there plans to document this as an erratum?ï¿½ I agree with Like that
>>>>>>> counting VMRUN as a taken branch in guest context is a CPU bug, even if the behavior
>>>>>>> is known/expected.
>>>>>>
>>>>>
>>>>> This behaviour is architectural and an erratum will not be issued. However, for clarity, a future
>>>>> release of the APM will include additional details like the following:
>>>>>
>>>>> 1) From the perspective of performance monitoring counters, VMRUNs are considered as far control
>>>>> transfers and VMEXITs as exceptions.
>>>>>
>>>>> 2) When the performance monitoring counters are set up to count events only in certain modes
>>>>> through the "OsUserMode" and "HostGuestOnly" bits, instructions and events that change the
>>>>> mode are counted in the target mode. For example, a SYSCALL from CPL 3 to CPL 0 with a
>>>>> counter set to count retired instructions with USR=1 and OS=0 will not cause an increment of
>>>>> the counter. However, the SYSRET back from CPL 0 to CPL 3 will cause an increment of the
>>>>> counter and the total count will end up correct. Similarly, when counting PMCx0C6 (retired
>>>>> far control transfers, including exceptions and interrupts) with Guest=1 and Host=0, a VMRUN
>>>>> instruction will cause an increment of the counter. However, the subsequent VMEXIT that occurs,
>>>>> since the target is in the host, will not cause an increment of the counter and so the total
>>>>> count will end up correct.
>>>>
>>>> The count from the guest's perspective does not "end up correct". Unlike SYSCALL,
>>>> where _userspace_ deliberately and synchronously executes a branch instruction,
>>>> VMEXIT and VMRUN are supposed to be transparent to the guest and can be completely
>>>> asynchronous with respect to guest code execution, e.g. if the host is spamming
>>>> IRQs, the guest will see a potentially large number of bogus (from it's perspective)
>>>> branches retired.
>>>
>>> The reverse problem occurs when a PMC is configured to count "CPUID
>>> instructions retired." Since KVM intercepts CPUID and emulates it, the
>>> PMC will always read 0, even if the guest executes a tight loop of
>>> CPUID instructions.

Unlikely. KVM will count any emulated instructions based on kvm_pmu_incr_counter().
Did I miss some conditions ?

>>>
>>> The PMU is not virtualizable on AMD CPUs without significant
>>> hypervisor corrections. I have to wonder if it's really worth the
>>> effort.

I used to think so, until I saw the AMD64_EVENTSEL_GUESTONLY bit.
Hardware architects are expected to put more effort into this area.

>>
>> Per our offlist chat, my understanding is that there are caveats with vPMUs that
>> it's simply not feasible for a hypervisor to handle. I.e. virtualizing any x86
>> PMU with 100% accuracy isn't happening anytime soon.

Indeed, and any more detailed complaints ?

>>
>> The way forward is likely to evaluate each caveat on a case-by-case basis to
>> determine whether or not the cost of the fixup in KVM is worth the benefit to
>> the guest. E.g. emulating "CPUID instructions retired" seems like it would be
>> fairly straightforward. AFAICT, fixing up the VMRUN stuff is quite difficult though.
>
> Yeah. The problem with fixing up "CPUID instructions retired" is
> tracking what the event encoding is for every F/M/S out there. It's
> not worth it.

I don't think it's feasible to emulate 100% accuracy on Intel. For guest pmu
users, it is motivated by wanting to know how effective they are running on
the current pCPU, and any vPMU eimulation behavior that helps this
understanding would be valuable.

2023-05-30 20:14:14