2023-09-27 10:58:51

by Dapeng Mi

[permalink] [raw]
Subject: [Patch v4 00/13] Enable fixed counter 3 and topdown perf metrics for vPMU

This is a RFC patchset to enable Intel PMU fixed counter 3 and topdown
perf metrics feature for KVM vPMU.

The TopDown Microarchitecture Analysis (TMA) Method is a structured
analysis methodology to identify critical performance bottlenecks in
out-of-order processors. The details about topdown metrics support on
Intel processors can be found in section "Performance Metrics" of Intel's
SDM Volume 3[1]. Kernel enabling code has also been merged, see
patchset[2] to learn more about the feature.

The TMA method is quite powerful and efficient to help developers to
identify the performance bottleneck in the program. The TMA method has
been integrated into multiple performance analysis tools, such as perf,
Vtune. Developers can leverage TMA method to analyze their program's
performance bottleneck easily with these tools and improve their program's
performance. TMA method is becoming the most widely used performance
analysis method on x86 platform. Currently the TMA method has been
supported fairly well on Native, but it's still not supported in Guest
environment. Since the environment difference between Host and Guest,
even same program may show different performance bottleneck between Guest
and Host. Obviously, the most straightforward and best method to
profiling Guest performance bottleneck is to run the TMA method in Guest
directly. So supporting topdown perf metrics in Guest becomes a real and
important requirement and we hope this patchset can mitigate this gap.

Like Xu posted a patch series to support guest Topdown[3], the patchset
creates a group of topdown metric events in KVM by binding to fixed counter
3 to obtain hardware values and the guest value of PERF_METRICS MSR is
assembled based on the count of grouped metric events.

This patchset improves Like's proposal, it leverages mature vPMU PMC
emulation framework and current perf topdown metric handling logic to
support guest topdown metrics feature.

In current perf logic, an events group is required to handle the topdown
metrics profiling, and the events group couples a slots event which
acts events group leader and multiple metric events. To coordinate with
the perf topdown metrics handing logic and reduce the code changes in
KVM, we choose to follow current mature vPMU PMC emulation framework. The
only difference is that we need to create a events group for fixed
counter 3 and manipulate FIXED_CTR3 and PERF_METRICS MSRS together
instead of a single event and only manipulating FIXED_CTR3 MSR.

When guest writes PERF_METRICS MSR at first, KVM would create an event
group which couples a slots event and a virtual metrics event. In this
event group, slots event claims the fixed counter 3 HW resource and acts
as group leader which is required by perf system. The virtual metrics
event claims the PERF_METRICS MSR. This event group is just like the perf
metrics events group on host and is scheduled by host perf system.

In this proposal, the count of slots event is calculated and emulated
on host and returned to guest just like other normal counters, but there
is a difference for the metrics event processing. KVM doesn't calculate
the real count of topdown metrics, it just stores the raw data of
PERF_METRICS MSR and directly returnthe stored raw data to guest. Thus,
guest can get the real HW PERF_METRICS data and guarantee the calculation
accuracy of topdown metrics.

Comparing with Like's patchset, this proposal brings two benefits.

1. Reduce the created perf events number
Like's patchset needs to create 4 (Ice Lake) or 8 (Sapphire Rapids)
metric events, whereas this patchset only needs to create 1 metric
event.

2. Increase the accuracy of metric calculation
Like's patchset needs to do twice metric count conversion. The first
conversion happens on perf system, perf topdown metrics handling
logic reads the metric percentage from PERF_METRICS MSR and times with
elapsed slots count and obtains the metric count. The second conversion
happens on KVM, KVM needs to convert the metric count back to metric
percentage by using metric count divide elapsed slots again and then
assembles the 4 or 8 metric percentage values to the virtual
PERF_METRICS MSR and return to Guest at last. Considering each metric
percentage in PERF_METRICS MSR is represented with only 8 bits, the
twice conversions (once multiplication and once division) definitely
cause accuracy loss in theory. Since this patchset directly returns
the raw data of PERF_METRICS MSR to guest, it won't have any accuracy
loss.

The patchset is rebased on latest kvm-x86/next branch and it is tested
on both Host and Guest (SPR Platform) with below perf commands. The 'foo'
is a backend-bound benchmark. We can see the output of perf commands are
quite close between host and guest.

1. perf stat ./foo

Host outputs:

Performance counter stats for '/home/sdp/work/foo/foo':

33,485.69 msec task-clock # 1.000 CPUs utilized
44 context-switches # 1.314 /sec
0 cpu-migrations # 0.000 /sec
50 page-faults # 1.493 /sec
125,321,811,275 cycles # 3.743 GHz
238,142,619,081 instructions # 1.90 insn per cycle
44,898,337,778 branches # 1.341 G/sec
69,302,880 branch-misses # 0.15% of all branches
TopdownL1 # 59.2 % tma_backend_bound
# 0.8 % tma_bad_speculation
# 1.9 % tma_frontend_bound
# 38.0 % tma_retiring
TopdownL2 # 0.8 % tma_branch_mispredicts
# 51.8 % tma_core_bound
# 0.8 % tma_fetch_bandwidth
# 1.2 % tma_fetch_latency
# 8.6 % tma_heavy_operations
# 29.4 % tma_light_operations
# 0.0 % tma_machine_clears
# 7.5 % tma_memory_bound

33.490329445 seconds time elapsed

33.483624000 seconds user
0.003999000 seconds sys

Guest outputs:

Performance counter stats for '/home/pnp/foo/foo':

33,753.35 msec task-clock # 1.000 CPUs utilized
12 context-switches # 0.356 /sec
0 cpu-migrations # 0.000 /sec
51 page-faults # 1.511 /sec
125,598,628,777 cycles # 3.721 GHz
238,420,589,003 instructions # 1.90 insn per cycle
44,952,453,723 branches # 1.332 G/sec
69,450,137 branch-misses # 0.15% of all branches
TopdownL1 # 58.0 % tma_backend_bound
# 1.2 % tma_bad_speculation
# 3.1 % tma_frontend_bound
# 37.6 % tma_retiring
TopdownL2 # 1.2 % tma_branch_mispredicts
# 49.4 % tma_core_bound
# 1.2 % tma_fetch_bandwidth
# 1.9 % tma_fetch_latency
# 8.2 % tma_heavy_operations
# 29.4 % tma_light_operations
# 0.0 % tma_machine_clears
# 8.6 % tma_memory_bound

33.763389325 seconds time elapsed

33.748767000 seconds user
0.008005000 seconds sys

2. perf stat -e slots ./foo

Host outputs:

Performance counter stats for '/home/sdp/work/foo/foo':

713,234,232,102 slots

31.786272154 seconds time elapsed

31.782986000 seconds user
0.003999000 seconds sys

Guest outputs:

Performance counter stats for '/home/pnp/foo/foo':

714,054,317,454 slots

32.279685600 seconds time elapsed

32.275457000 seconds user
0.004002000 seconds sys

3.
echo 0 > /proc/sys/kernel/nmi_watchdog
echo 25 > /proc/sys/kernel/perf_cpu_time_max_percent
echo 100000 > /proc/sys/kernel/perf_event_max_sample_rate
echo 0 > /proc/sys/kernel/perf_cpu_time_max_percent
perf record -e slots ./foo && perf report

Host outputs:

# Total Lost Samples: 0
#
# Samples: 129K of event 'slots'
# Event count (approx.): 716048770762
#
# Overhead Command Shared Object Symbol
# ........ ....... ................ ...................................
#
74.50% foo libc.so.6 [.] random
7.25% foo libc.so.6 [.] random_r
7.22% foo foo [.] qux
5.45% foo foo [.] main
1.86% foo foo [.] random@plt
1.80% foo foo [.] foo
1.78% foo foo [.] bar
0.02% foo [kernel.vmlinux] [k] perf_adjust_freq_unthr_context
0.02% foo [kernel.vmlinux] [k] timekeeping_advance
0.02% foo [kernel.vmlinux] [k] _raw_spin_lock_irqsave
0.01% foo [kernel.vmlinux] [k] __wake_up_common
0.01% foo [kernel.vmlinux] [k] call_function_single_prep_ipi

Guest outputs:

# Total Lost Samples: 0
#
# Samples: 6K of event 'slots'
# Event count (approx.): 19515232625
#
# Overhead Command Shared Object Symbol
# ........ ....... ................ ..................................
#
75.02% foo libc.so.6 [.] __random
7.07% foo libc.so.6 [.] __random_r
7.03% foo foo [.] qux
5.21% foo foo [.] main
2.01% foo foo [.] foo
1.85% foo foo [.] bar
1.79% foo foo [.] random@plt
0.02% foo [kernel.vmlinux] [k] task_mm_cid_work
0.00% foo [kernel.vmlinux] [k] lock_acquire
0.00% perf-ex [kernel.vmlinux] [k] perf_adjust_freq_unthr_context
0.00% foo [kernel.vmlinux] [k] native_write_msr
0.00% perf-ex [kernel.vmlinux] [k] native_write_msr

To support the guest topdown metrics feature, we have to do several
fundamental changes for perf system and vPMU code, we tried to avoid
these changes AMAP, but it seems it's inevitable. If you have any idea,
please suggest.

The fundamental changes:
1. Add *group_leader for perf_event_create_group_kernel_counters()
Add an argument *group_leader for perf_event_create_group_kernel_counters()
so group events can be created from kernel space.
2. perf/core: Add new function perf_event_topdown_metrics()
Add a new API to update topdown metrics values
3. perf/x86/intel: Handle KVM virtual metrics event in perf system
Add virtual metrics event processing logic in topdown metrics
processing code
4. KVM: x86/pmu: Extend pmc_reprogram_counter() to create group events
Extend pmc_reprogram_counter() to be capable to create group events
instead of just single event

References:
[1]: Intel 64 and IA-32 Architectures Software Developer Manual
Combined Volumes: 1, 2A, 2B, 2C, 2D, 3A, 3B, 3C, 3D, and 4
https://cdrdv2.intel.com/v1/dl/getContent/671200
[2]: perf/x86/intel: Support TopDown metrics on Ice Lake
https://lwn.net/ml/linux-kernel/[email protected]/
[3]: KVM: x86/pmu: Enable Fixed Counter3 and Topdown Perf Metrics
https://lwn.net/ml/linux-kernel/[email protected]/

Changelog:
v3 -> v4:
* Fix building error on riscv.
* Rebase this patchset to latest kvm-x86 code (v6.6-rc2+)
v2 -> v3:
* Add an argument in perf_event_create_group_kernel_counters() to
create group events instead of introducing a new function.
* Warning fix.

Dapeng Mi (13):
KVM: x86/pmu: Add Intel CPUID-hinted TopDown slots event
KVM: x86/pmu: Support PMU fixed counter 3
perf/core: Add function perf_event_group_leader_check()
perf/core: Add function perf_event_move_group()
perf/core: Add *group_leader for
perf_event_create_group_kernel_counters()
perf/x86: Fix typos and inconsistent indents in perf_event header
perf/x86: Add constraint for guest perf metrics event
perf/core: Add new function perf_event_topdown_metrics()
perf/x86/intel: Handle KVM virtual metrics event in perf system
KVM: x86/pmu: Extend pmc_reprogram_counter() to create group events
KVM: x86/pmu: Support topdown perf metrics feature
KVM: x86/pmu: Handle PERF_METRICS overflow
KVM: x86/pmu: Expose Topdown in MSR_IA32_PERF_CAPABILITIES

arch/arm64/kvm/pmu-emul.c | 2 +-
arch/riscv/kvm/vcpu_pmu.c | 2 +-
arch/x86/events/intel/core.c | 74 ++++--
arch/x86/events/perf_event.h | 10 +-
arch/x86/include/asm/kvm_host.h | 19 +-
arch/x86/include/asm/perf_event.h | 21 +-
arch/x86/kernel/cpu/resctrl/pseudo_lock.c | 4 +-
arch/x86/kvm/pmu.c | 141 ++++++++--
arch/x86/kvm/pmu.h | 50 +++-
arch/x86/kvm/svm/pmu.c | 2 +
arch/x86/kvm/vmx/capabilities.h | 1 +
arch/x86/kvm/vmx/pmu_intel.c | 71 ++++-
arch/x86/kvm/vmx/vmx.c | 2 +
arch/x86/kvm/vmx/vmx.h | 5 +
arch/x86/kvm/x86.c | 5 +-
include/linux/perf_event.h | 14 +
kernel/events/core.c | 304 ++++++++++++++--------
kernel/events/hw_breakpoint.c | 4 +-
kernel/events/hw_breakpoint_test.c | 2 +-
kernel/watchdog_perf.c | 2 +-
20 files changed, 574 insertions(+), 161 deletions(-)


base-commit: 5804c19b80bf625c6a9925317f845e497434d6d3
--
2.34.1


2023-09-27 10:59:31

by Dapeng Mi

[permalink] [raw]
Subject: [Patch v4 10/13] KVM: x86/pmu: Extend pmc_reprogram_counter() to create group events

Current perf code creates a events group which contains a slots event
that acts as group leader and multiple metric events to support the
topdown perf metrics feature. To support the topdown metrics feature
in KVM and reduce the changes for perf system at the same time, we
follow this mature mechanism and create a events group in KVM. The
events group contains a slots event which claims the fixed counter 3
and act as group leader as perf system requires, and a virtual metrics
event which claims PERF_METRICS MSR. This events group would be
scheduled as a whole by the perf system.

Unfortunately the function pmc_reprogram_counter() can only create a
single event for every counter, so this change extends the function and
makes it have the capability to create a events group.

Co-developed-by: Like Xu <[email protected]>
Signed-off-by: Like Xu <[email protected]>
Signed-off-by: Dapeng Mi <[email protected]>
---
arch/x86/include/asm/kvm_host.h | 11 +++++-
arch/x86/kvm/pmu.c | 64 ++++++++++++++++++++++++++-------
arch/x86/kvm/pmu.h | 22 ++++++++----
arch/x86/kvm/svm/pmu.c | 2 ++
arch/x86/kvm/vmx/pmu_intel.c | 4 +++
5 files changed, 83 insertions(+), 20 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 90ecd3f7a9c3..bf1626b2b553 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -490,12 +490,12 @@ enum pmc_type {
struct kvm_pmc {
enum pmc_type type;
u8 idx;
+ u8 max_nr_events;
bool is_paused;
bool intr;
u64 counter;
u64 prev_counter;
u64 eventsel;
- struct perf_event *perf_event;
struct kvm_vcpu *vcpu;
/*
* only for creating or reusing perf_event,
@@ -503,6 +503,15 @@ struct kvm_pmc {
* ctrl value for fixed counters.
*/
u64 current_config;
+ /*
+ * Non-leader events may need some extra information,
+ * this field can be used to store this information.
+ */
+ u64 extra_config;
+ union {
+ struct perf_event *perf_event;
+ DECLARE_FLEX_ARRAY(struct perf_event *, perf_events);
+ };
};

/* More counters may conflict with other existing Architectural MSRs */
diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c
index 760d293f4a4a..b02a56c77647 100644
--- a/arch/x86/kvm/pmu.c
+++ b/arch/x86/kvm/pmu.c
@@ -187,7 +187,7 @@ static int pmc_reprogram_counter(struct kvm_pmc *pmc, u32 type, u64 config,
bool intr)
{
struct kvm_pmu *pmu = pmc_to_pmu(pmc);
- struct perf_event *event;
+ struct perf_event *event, *group_leader;
struct perf_event_attr attr = {
.type = type,
.size = sizeof(attr),
@@ -199,6 +199,7 @@ static int pmc_reprogram_counter(struct kvm_pmc *pmc, u32 type, u64 config,
.config = config,
};
bool pebs = test_bit(pmc->idx, (unsigned long *)&pmu->pebs_enable);
+ unsigned int i, j;

attr.sample_period = get_sample_period(pmc, pmc->counter);

@@ -221,36 +222,73 @@ static int pmc_reprogram_counter(struct kvm_pmc *pmc, u32 type, u64 config,
attr.precise_ip = pmc_get_pebs_precise_level(pmc);
}

- event = perf_event_create_kernel_counter(&attr, -1, current, NULL,
- kvm_perf_overflow, pmc);
- if (IS_ERR(event)) {
- pr_debug_ratelimited("kvm_pmu: event creation failed %ld for pmc->idx = %d\n",
- PTR_ERR(event), pmc->idx);
- return PTR_ERR(event);
+ /*
+ * To create grouped events, the first created perf_event doesn't
+ * know it will be the group_leader and may move to an unexpected
+ * enabling path, thus delay all enablement until after creation,
+ * not affecting non-grouped events to save one perf interface call.
+ */
+ if (pmc->max_nr_events > 1)
+ attr.disabled = 1;
+
+ for (i = 0; i < pmc->max_nr_events; i++) {
+ group_leader = i ? pmc->perf_event : NULL;
+ event = perf_event_create_kernel_counter(&attr, -1,
+ current, group_leader,
+ kvm_perf_overflow, pmc);
+ if (IS_ERR(event)) {
+ pr_err_ratelimited("kvm_pmu: event %u of pmc %u creation failed %ld\n",
+ i, pmc->idx, PTR_ERR(event));
+
+ for (j = 0; j < i; j++) {
+ perf_event_release_kernel(pmc->perf_events[j]);
+ pmc->perf_events[j] = NULL;
+ pmc_to_pmu(pmc)->event_count--;
+ }
+
+ return PTR_ERR(event);
+ }
+
+ pmc->perf_events[i] = event;
+ pmc_to_pmu(pmc)->event_count++;
}

- pmc->perf_event = event;
- pmc_to_pmu(pmc)->event_count++;
pmc->is_paused = false;
pmc->intr = intr || pebs;
+
+ if (!attr.disabled)
+ return 0;
+
+ for (i = 0; pmc->perf_events[i] && i < pmc->max_nr_events; i++)
+ perf_event_enable(pmc->perf_events[i]);
+
return 0;
}

static void pmc_pause_counter(struct kvm_pmc *pmc)
{
u64 counter = pmc->counter;
+ unsigned int i;

if (!pmc->perf_event || pmc->is_paused)
return;

- /* update counter, reset event value to avoid redundant accumulation */
+ /*
+ * Update counter, reset event value to avoid redundant
+ * accumulation. Disable group leader event firstly and
+ * then disable non-group leader events.
+ */
counter += perf_event_pause(pmc->perf_event, true);
+ for (i = 1; pmc->perf_events[i] && i < pmc->max_nr_events; i++)
+ perf_event_pause(pmc->perf_events[i], true);
pmc->counter = counter & pmc_bitmask(pmc);
pmc->is_paused = true;
}

static bool pmc_resume_counter(struct kvm_pmc *pmc)
{
+ unsigned int i;
+
if (!pmc->perf_event)
return false;

@@ -264,8 +302,8 @@ static bool pmc_resume_counter(struct kvm_pmc *pmc)
(!!pmc->perf_event->attr.precise_ip))
return false;

- /* reuse perf_event to serve as pmc_reprogram_counter() does*/
- perf_event_enable(pmc->perf_event);
+ for (i = 0; pmc->perf_events[i] && i < pmc->max_nr_events; i++)
+ perf_event_enable(pmc->perf_events[i]);
pmc->is_paused = false;

return true;
@@ -432,7 +470,7 @@ static void reprogram_counter(struct kvm_pmc *pmc)
if (pmc->current_config == new_config && pmc_resume_counter(pmc))
goto reprogram_complete;

- pmc_release_perf_event(pmc);
+ pmc_release_perf_event(pmc, false);

pmc->current_config = new_config;

diff --git a/arch/x86/kvm/pmu.h b/arch/x86/kvm/pmu.h
index 7d9ba301c090..3dc0deb83096 100644
--- a/arch/x86/kvm/pmu.h
+++ b/arch/x86/kvm/pmu.h
@@ -74,21 +74,31 @@ static inline u64 pmc_read_counter(struct kvm_pmc *pmc)
return counter & pmc_bitmask(pmc);
}

-static inline void pmc_release_perf_event(struct kvm_pmc *pmc)
+static inline void pmc_release_perf_event(struct kvm_pmc *pmc, bool reset)
{
- if (pmc->perf_event) {
- perf_event_release_kernel(pmc->perf_event);
- pmc->perf_event = NULL;
- pmc->current_config = 0;
+ unsigned int i;
+
+ if (!pmc->perf_event)
+ return;
+
+ for (i = 0; pmc->perf_events[i] && i < pmc->max_nr_events; i++) {
+ perf_event_release_kernel(pmc->perf_events[i]);
+ pmc->perf_events[i] = NULL;
pmc_to_pmu(pmc)->event_count--;
}
+
+ if (reset) {
+ pmc->current_config = 0;
+ pmc->extra_config = 0;
+ pmc->max_nr_events = 1;
+ }
}

static inline void pmc_stop_counter(struct kvm_pmc *pmc)
{
if (pmc->perf_event) {
pmc->counter = pmc_read_counter(pmc);
- pmc_release_perf_event(pmc);
+ pmc_release_perf_event(pmc, true);
}
}

diff --git a/arch/x86/kvm/svm/pmu.c b/arch/x86/kvm/svm/pmu.c
index cef5a3d0abd0..861ff79ac614 100644
--- a/arch/x86/kvm/svm/pmu.c
+++ b/arch/x86/kvm/svm/pmu.c
@@ -230,6 +230,8 @@ static void amd_pmu_init(struct kvm_vcpu *vcpu)
pmu->gp_counters[i].vcpu = vcpu;
pmu->gp_counters[i].idx = i;
pmu->gp_counters[i].current_config = 0;
+ pmu->gp_counters[i].extra_config = 0;
+ pmu->gp_counters[i].max_nr_events = 1;
}
}

diff --git a/arch/x86/kvm/vmx/pmu_intel.c b/arch/x86/kvm/vmx/pmu_intel.c
index 9bf80fee34fb..b45396e0a46c 100644
--- a/arch/x86/kvm/vmx/pmu_intel.c
+++ b/arch/x86/kvm/vmx/pmu_intel.c
@@ -628,6 +628,8 @@ static void intel_pmu_init(struct kvm_vcpu *vcpu)
pmu->gp_counters[i].vcpu = vcpu;
pmu->gp_counters[i].idx = i;
pmu->gp_counters[i].current_config = 0;
+ pmu->gp_counters[i].extra_config = 0;
+ pmu->gp_counters[i].max_nr_events = 1;
}

for (i = 0; i < KVM_PMC_MAX_FIXED; i++) {
@@ -635,6 +637,8 @@ static void intel_pmu_init(struct kvm_vcpu *vcpu)
pmu->fixed_counters[i].vcpu = vcpu;
pmu->fixed_counters[i].idx = i + INTEL_PMC_IDX_FIXED;
pmu->fixed_counters[i].current_config = 0;
+ pmu->fixed_counters[i].extra_config = 0;
+ pmu->fixed_counters[i].max_nr_events = 1;
}

lbr_desc->records.nr = 0;
--
2.34.1

2023-09-27 11:53:20

by Dapeng Mi

[permalink] [raw]
Subject: [Patch v4 04/13] perf/core: Add function perf_event_move_group()

Extract the group moving code in function sys_perf_event_open() to create
a new function perf_event_move_group().

The subsequent change would add a new function
perf_event_create_group_kernel_counters() which is used to create group
events in kernel space. The function also needs to do same group moving
for group leader event just like function sys_perf_event_open() does. So
extract the moving code into a separate function to avoid the code
duplication.

Signed-off-by: Dapeng Mi <[email protected]>
---
kernel/events/core.c | 82 ++++++++++++++++++++++++--------------------
1 file changed, 45 insertions(+), 37 deletions(-)

diff --git a/kernel/events/core.c b/kernel/events/core.c
index d485dac2b55f..953e3d3a1664 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -12388,6 +12388,48 @@ static int perf_event_group_leader_check(struct perf_event *group_leader,
return 0;
}

+static void perf_event_move_group(struct perf_event *group_leader,
+ struct perf_event_pmu_context *pmu_ctx,
+ struct perf_event_context *ctx)
+{
+ struct perf_event *sibling;
+
+ perf_remove_from_context(group_leader, 0);
+ put_pmu_ctx(group_leader->pmu_ctx);
+
+ for_each_sibling_event(sibling, group_leader) {
+ perf_remove_from_context(sibling, 0);
+ put_pmu_ctx(sibling->pmu_ctx);
+ }
+
+ /*
+ * Install the group siblings before the group leader.
+ *
+ * Because a group leader will try and install the entire group
+ * (through the sibling list, which is still in-tact), we can
+ * end up with siblings installed in the wrong context.
+ *
+ * By installing siblings first we NO-OP because they're not
+ * reachable through the group lists.
+ */
+ for_each_sibling_event(sibling, group_leader) {
+ sibling->pmu_ctx = pmu_ctx;
+ get_pmu_ctx(pmu_ctx);
+ perf_event__state_init(sibling);
+ perf_install_in_context(ctx, sibling, sibling->cpu);
+ }
+
+ /*
+ * Removing from the context ends up with disabled
+ * event. What we want here is event in the initial
+ * startup state, ready to be add into new context.
+ */
+ group_leader->pmu_ctx = pmu_ctx;
+ get_pmu_ctx(pmu_ctx);
+ perf_event__state_init(group_leader);
+ perf_install_in_context(ctx, group_leader, group_leader->cpu);
+}
+
/**
* sys_perf_event_open - open a performance event, associate it to a task/cpu
*
@@ -12403,7 +12445,7 @@ SYSCALL_DEFINE5(perf_event_open,
{
struct perf_event *group_leader = NULL, *output_event = NULL;
struct perf_event_pmu_context *pmu_ctx;
- struct perf_event *event, *sibling;
+ struct perf_event *event;
struct perf_event_attr attr;
struct perf_event_context *ctx;
struct file *event_file = NULL;
@@ -12635,42 +12677,8 @@ SYSCALL_DEFINE5(perf_event_open,
* where we start modifying current state.
*/

- if (move_group) {
- perf_remove_from_context(group_leader, 0);
- put_pmu_ctx(group_leader->pmu_ctx);
-
- for_each_sibling_event(sibling, group_leader) {
- perf_remove_from_context(sibling, 0);
- put_pmu_ctx(sibling->pmu_ctx);
- }
-
- /*
- * Install the group siblings before the group leader.
- *
- * Because a group leader will try and install the entire group
- * (through the sibling list, which is still in-tact), we can
- * end up with siblings installed in the wrong context.
- *
- * By installing siblings first we NO-OP because they're not
- * reachable through the group lists.
- */
- for_each_sibling_event(sibling, group_leader) {
- sibling->pmu_ctx = pmu_ctx;
- get_pmu_ctx(pmu_ctx);
- perf_event__state_init(sibling);
- perf_install_in_context(ctx, sibling, sibling->cpu);
- }
-
- /*
- * Removing from the context ends up with disabled
- * event. What we want here is event in the initial
- * startup state, ready to be add into new context.
- */
- group_leader->pmu_ctx = pmu_ctx;
- get_pmu_ctx(pmu_ctx);
- perf_event__state_init(group_leader);
- perf_install_in_context(ctx, group_leader, group_leader->cpu);
- }
+ if (move_group)
+ perf_event_move_group(group_leader, pmu_ctx, ctx);

/*
* Precalculate sample_data sizes; do while holding ctx::mutex such
--
2.34.1

2023-09-27 15:39:41

by Dapeng Mi

[permalink] [raw]
Subject: [Patch v4 01/13] KVM: x86/pmu: Add Intel CPUID-hinted TopDown slots event

This patch adds support for the architectural topdown slots event which
is hinted by CPUID.0AH.EBX.

The topdown slots event counts the total number of available slots for
an unhalted logical processor. Software can use this event as the
denominator for the top-level metrics of the topDown Microarchitecture
Analysis method.

Although the MSR_PERF_METRICS MSR required for topdown events is not
currently available in the guest, relying only on the data provided by
the slots event is sufficient for pmu users to perceive differences in
cpu pipeline machine-width across micro-architectures.

The standalone slots event, like the instruction event, can be counted
with gp counter or fixed counter 3 (if any). Its availability is also
controlled by CPUID.AH.EBX. On Linux, perf user may encode
"-e cpu/event=0xa4,umask=0x01/" or "-e cpu/slots/" to count slots events.

This patch only enables slots event on GP counters. The enabling on fixed
counter 3 will be supported in subsequent patches.

Co-developed-by: Like Xu <[email protected]>
Signed-off-by: Like Xu <[email protected]>
Signed-off-by: Dapeng Mi <[email protected]>
---
arch/x86/kvm/vmx/pmu_intel.c | 2 ++
1 file changed, 2 insertions(+)

diff --git a/arch/x86/kvm/vmx/pmu_intel.c b/arch/x86/kvm/vmx/pmu_intel.c
index f2efa0bf7ae8..7322f0c18565 100644
--- a/arch/x86/kvm/vmx/pmu_intel.c
+++ b/arch/x86/kvm/vmx/pmu_intel.c
@@ -34,6 +34,7 @@ enum intel_pmu_architectural_events {
INTEL_ARCH_LLC_MISSES,
INTEL_ARCH_BRANCHES_RETIRED,
INTEL_ARCH_BRANCHES_MISPREDICTED,
+ INTEL_ARCH_TOPDOWN_SLOTS,

NR_REAL_INTEL_ARCH_EVENTS,

@@ -58,6 +59,7 @@ static struct {
[INTEL_ARCH_LLC_MISSES] = { 0x2e, 0x41 },
[INTEL_ARCH_BRANCHES_RETIRED] = { 0xc4, 0x00 },
[INTEL_ARCH_BRANCHES_MISPREDICTED] = { 0xc5, 0x00 },
+ [INTEL_ARCH_TOPDOWN_SLOTS] = { 0xa4, 0x01 },
[PSEUDO_ARCH_REFERENCE_CYCLES] = { 0x00, 0x03 },
};

--
2.34.1

2023-09-27 19:25:36

by Dapeng Mi

[permalink] [raw]
Subject: [Patch v4 03/13] perf/core: Add function perf_event_group_leader_check()

Extract the group leader checking code in function sys_perf_event_open()
to create a new function perf_event_group_leader_check().

The subsequent change would add a new function
perf_event_create_group_kernel_counters() which is used to create group
events in kernel space. The function also needs to do same check for group
leader event just like function sys_perf_event_open() does. So extract
the checking code into a separate function and avoid the code
duplication.

Signed-off-by: Dapeng Mi <[email protected]>
---
kernel/events/core.c | 143 +++++++++++++++++++++++--------------------
1 file changed, 78 insertions(+), 65 deletions(-)

diff --git a/kernel/events/core.c b/kernel/events/core.c
index 4c72a41f11af..d485dac2b55f 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -12313,6 +12313,81 @@ perf_check_permission(struct perf_event_attr *attr, struct task_struct *task)
return is_capable || ptrace_may_access(task, ptrace_mode);
}

+static int perf_event_group_leader_check(struct perf_event *group_leader,
+ struct perf_event *event,
+ struct perf_event_attr *attr,
+ struct perf_event_context *ctx,
+ struct pmu **pmu,
+ int *move_group)
+{
+ if (!group_leader)
+ return 0;
+
+ /*
+ * Do not allow a recursive hierarchy (this new sibling
+ * becoming part of another group-sibling):
+ */
+ if (group_leader->group_leader != group_leader)
+ return -EINVAL;
+
+ /* All events in a group should have the same clock */
+ if (group_leader->clock != event->clock)
+ return -EINVAL;
+
+ /*
+ * Make sure we're both events for the same CPU;
+ * grouping events for different CPUs is broken; since
+ * you can never concurrently schedule them anyhow.
+ */
+ if (group_leader->cpu != event->cpu)
+ return -EINVAL;
+
+ /*
+ * Make sure we're both on the same context; either task or cpu.
+ */
+ if (group_leader->ctx != ctx)
+ return -EINVAL;
+
+ /*
+ * Only a group leader can be exclusive or pinned
+ */
+ if (attr->exclusive || attr->pinned)
+ return -EINVAL;
+
+ if (is_software_event(event) &&
+ !in_software_context(group_leader)) {
+ /*
+ * If the event is a sw event, but the group_leader
+ * is on hw context.
+ *
+ * Allow the addition of software events to hw
+ * groups, this is safe because software events
+ * never fail to schedule.
+ *
+ * Note the comment that goes with struct
+ * perf_event_pmu_context.
+ */
+ *pmu = group_leader->pmu_ctx->pmu;
+ } else if (!is_software_event(event)) {
+ if (is_software_event(group_leader) &&
+ (group_leader->group_caps & PERF_EV_CAP_SOFTWARE)) {
+ /*
+ * In case the group is a pure software group, and we
+ * try to add a hardware event, move the whole group to
+ * the hardware context.
+ */
+ *move_group = 1;
+ }
+
+ /* Don't allow group of multiple hw events from different pmus */
+ if (!in_software_context(group_leader) &&
+ group_leader->pmu_ctx->pmu != *pmu)
+ return -EINVAL;
+ }
+
+ return 0;
+}
+
/**
* sys_perf_event_open - open a performance event, associate it to a task/cpu
*
@@ -12507,71 +12582,9 @@ SYSCALL_DEFINE5(perf_event_open,
}
}

- if (group_leader) {
- err = -EINVAL;
-
- /*
- * Do not allow a recursive hierarchy (this new sibling
- * becoming part of another group-sibling):
- */
- if (group_leader->group_leader != group_leader)
- goto err_locked;
-
- /* All events in a group should have the same clock */
- if (group_leader->clock != event->clock)
- goto err_locked;
-
- /*
- * Make sure we're both events for the same CPU;
- * grouping events for different CPUs is broken; since
- * you can never concurrently schedule them anyhow.
- */
- if (group_leader->cpu != event->cpu)
- goto err_locked;
-
- /*
- * Make sure we're both on the same context; either task or cpu.
- */
- if (group_leader->ctx != ctx)
- goto err_locked;
-
- /*
- * Only a group leader can be exclusive or pinned
- */
- if (attr.exclusive || attr.pinned)
- goto err_locked;
-
- if (is_software_event(event) &&
- !in_software_context(group_leader)) {
- /*
- * If the event is a sw event, but the group_leader
- * is on hw context.
- *
- * Allow the addition of software events to hw
- * groups, this is safe because software events
- * never fail to schedule.
- *
- * Note the comment that goes with struct
- * perf_event_pmu_context.
- */
- pmu = group_leader->pmu_ctx->pmu;
- } else if (!is_software_event(event)) {
- if (is_software_event(group_leader) &&
- (group_leader->group_caps & PERF_EV_CAP_SOFTWARE)) {
- /*
- * In case the group is a pure software group, and we
- * try to add a hardware event, move the whole group to
- * the hardware context.
- */
- move_group = 1;
- }
-
- /* Don't allow group of multiple hw events from different pmus */
- if (!in_software_context(group_leader) &&
- group_leader->pmu_ctx->pmu != pmu)
- goto err_locked;
- }
- }
+ err = perf_event_group_leader_check(group_leader, event, &attr, ctx, &pmu, &move_group);
+ if (err)
+ goto err_locked;

/*
* Now that we're certain of the pmu; find the pmu_ctx.
--
2.34.1

2023-09-27 21:04:18

by Dapeng Mi

[permalink] [raw]
Subject: [Patch v4 11/13] KVM: x86/pmu: Support topdown perf metrics feature

This patch adds topdown perf metrics support for KVM. The topdown perf
metrics is a feature on Intel CPUs which supports the TopDown
Microarchitecture Analysis (TMA) Method. TMA is a structured analysis
methodology to identify critical performance bottlenecks in out-of-order
processors. The details about topdown metrics support on Intel
processors can be found in section "Performance Metrics" of Intel's SDM.

Intel chips use fixed counter 3 and PERF_METRICS MSR together to support
topdown metrics feature. Fix counter 3 counts the elapsed cpu slots,
PERF_METRICS reports the topdown metrics percentage.

Generally speaking, KVM has no method to know guest is running a solo
slots event counting/sampling or guest is profiling the topdown perf
metrics if KVM only observes the fixed counter 3. Fortunately we know
topdown metrics profiling always manipulate fixed counter 3 and
PERF_METRICS MSR together with a fixed sequence, FIXED_CTR3 MSR is
written firstly and then PERF_METRICS follows. So we can assume topdown
metrics profiling is running in guest if KVM observes PERF_METRICS
writing.

In current perf logic, an events group is required to handle the topdown
metrics profiling, and the events group couples a slots event which
acts events group leader and multiple metric events. To coordinate with
the perf topdown metrics handing logic and reduce the code changes in
KVM, we choose to follow current mature vPMU PMC emulation framework. The
only difference is that we need to create a events group for fixed
counter 3 and manipulate FIXED_CTR3 and PERF_METRICS MSRS together
instead of a single event and only manipulating FIXED_CTR3 MSR.

When guest write PERF_METRICS MSR at first, KVM would create an event
group which couples a slots event and a virtual metrics event. In this
event group, slots event claims the fixed counter 3 HW resource and acts
as group leader which is required by perf system. The virtual metrics
event claims the PERF_METRICS MSR. This event group is just like the perf
metrics events group on host and is scheduled by host perf system.

In this proposal, the count of slots event is calculated and emulated
on host and returned to guest just like other normal counters, but there
is a difference for the metrics event processing. KVM doesn't calculate
the real count of topdown metrics, it just stores the raw data of
PERF_METRICS MSR and directly returns the stored raw data to guest. Thus,
guest can get the real HW PERF_METRICS data and guarantee the calculation
accuracy of topdown metrics.

The whole procedure can be summarized as below.
1. KVM intercepts PERF_METRICS MSR writing and marks fixed counter 3
enter topdown profiling mode (set the max_nr_events of fixed counter 3
to 2) if it's not.
2. If the topdown metrics events group doesn't exist, create the events
group firstly, and then update the saved slots count and metrics data
of the group events with guest values. At last, enable the events and
make the guest values are loaded into HW FIXED_CTR3 and PERF_METRICS
MSRs.
3. Modify kvm_pmu_rdpmc() function to return PERF_METRICS MSR raw data to
guest directly.

Signed-off-by: Dapeng Mi <[email protected]>
---
arch/x86/include/asm/kvm_host.h | 6 ++++
arch/x86/kvm/pmu.c | 62 +++++++++++++++++++++++++++++++--
arch/x86/kvm/pmu.h | 28 +++++++++++++++
arch/x86/kvm/vmx/capabilities.h | 1 +
arch/x86/kvm/vmx/pmu_intel.c | 48 +++++++++++++++++++++++++
arch/x86/kvm/vmx/vmx.h | 5 +++
arch/x86/kvm/x86.c | 1 +
7 files changed, 149 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index bf1626b2b553..19f8fb65e21e 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -487,6 +487,12 @@ enum pmc_type {
KVM_PMC_FIXED,
};

+enum topdown_events {
+ KVM_TD_SLOTS = 0,
+ KVM_TD_METRICS = 1,
+ KVM_TD_EVENTS_MAX = 2,
+};
+
struct kvm_pmc {
enum pmc_type type;
u8 idx;
diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c
index b02a56c77647..fad7b2c10bb8 100644
--- a/arch/x86/kvm/pmu.c
+++ b/arch/x86/kvm/pmu.c
@@ -182,6 +182,30 @@ static u64 pmc_get_pebs_precise_level(struct kvm_pmc *pmc)
return 1;
}

+static void pmc_setup_td_metrics_events_attr(struct kvm_pmc *pmc,
+ struct perf_event_attr *attr,
+ unsigned int event_idx)
+{
+ if (!pmc_is_topdown_metrics_used(pmc))
+ return;
+
+ /*
+ * setup slots event attribute, when slots event is
+ * created for guest topdown metrics profiling, the
+ * sample period must be 0.
+ */
+ if (event_idx == KVM_TD_SLOTS)
+ attr->sample_period = 0;
+
+ /* setup vmetrics event attribute */
+ if (event_idx == KVM_TD_METRICS) {
+ attr->config = INTEL_FIXED_VMETRICS_EVENT;
+ attr->sample_period = 0;
+ /* Only group leader event can be pinned. */
+ attr->pinned = false;
+ }
+}
+
static int pmc_reprogram_counter(struct kvm_pmc *pmc, u32 type, u64 config,
bool exclude_user, bool exclude_kernel,
bool intr)
@@ -233,6 +257,8 @@ static int pmc_reprogram_counter(struct kvm_pmc *pmc, u32 type, u64 config,

for (i = 0; i < pmc->max_nr_events; i++) {
group_leader = i ? pmc->perf_event : NULL;
+ pmc_setup_td_metrics_events_attr(pmc, &attr, i);
+
event = perf_event_create_kernel_counter(&attr, -1,
current, group_leader,
kvm_perf_overflow, pmc);
@@ -256,6 +282,12 @@ static int pmc_reprogram_counter(struct kvm_pmc *pmc, u32 type, u64 config,
pmc->is_paused = false;
pmc->intr = intr || pebs;

+ if (pmc_is_topdown_metrics_active(pmc)) {
+ pmc_update_topdown_metrics(pmc);
+ /* KVM need to inject PMI for PERF_METRICS overflow. */
+ pmc->intr = true;
+ }
+
if (!attr.disabled)
return 0;

@@ -269,6 +301,7 @@ static void pmc_pause_counter(struct kvm_pmc *pmc)
{
u64 counter = pmc->counter;
unsigned int i;
+ u64 data;

if (!pmc->perf_event || pmc->is_paused)
return;
@@ -279,8 +312,15 @@ static void pmc_pause_counter(struct kvm_pmc *pmc)
* then disable non-group leader events.
*/
counter += perf_event_pause(pmc->perf_event, true);
- for (i = 1; pmc->perf_events[i] && i < pmc->max_nr_events; i++)
- perf_event_pause(pmc->perf_events[i], true);
+ for (i = 1; pmc->perf_events[i] && i < pmc->max_nr_events; i++) {
+ data = perf_event_pause(pmc->perf_events[i], true);
+ /*
+ * The count of vmetrics event actually stores raw data of
+ * PERF_METRICS, save it to extra_config.
+ */
+ if (pmc->idx == INTEL_PMC_IDX_FIXED_SLOTS && i == KVM_TD_METRICS)
+ pmc->extra_config = data;
+ }
pmc->counter = counter & pmc_bitmask(pmc);
pmc->is_paused = true;
}
@@ -557,6 +597,21 @@ static int kvm_pmu_rdpmc_vmware(struct kvm_vcpu *vcpu, unsigned idx, u64 *data)
return 0;
}

+static inline int kvm_pmu_read_perf_metrics(struct kvm_vcpu *vcpu,
+ unsigned int idx, u64 *data)
+{
+ struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
+ struct kvm_pmc *pmc = get_fixed_pmc(pmu, MSR_CORE_PERF_FIXED_CTR3);
+
+ if (!pmc) {
+ *data = 0;
+ return 1;
+ }
+
+ *data = pmc->extra_config;
+ return 0;
+}
+
int kvm_pmu_rdpmc(struct kvm_vcpu *vcpu, unsigned idx, u64 *data)
{
bool fast_mode = idx & (1u << 31);
@@ -570,6 +625,9 @@ int kvm_pmu_rdpmc(struct kvm_vcpu *vcpu, unsigned idx, u64 *data)
if (is_vmware_backdoor_pmc(idx))
return kvm_pmu_rdpmc_vmware(vcpu, idx, data);

+ if (idx & INTEL_PMC_FIXED_RDPMC_METRICS)
+ return kvm_pmu_read_perf_metrics(vcpu, idx, data);
+
pmc = static_call(kvm_x86_pmu_rdpmc_ecx_to_pmc)(vcpu, idx, &mask);
if (!pmc)
return 1;
diff --git a/arch/x86/kvm/pmu.h b/arch/x86/kvm/pmu.h
index 3dc0deb83096..43abe793c11c 100644
--- a/arch/x86/kvm/pmu.h
+++ b/arch/x86/kvm/pmu.h
@@ -257,6 +257,34 @@ static inline bool pmc_is_globally_enabled(struct kvm_pmc *pmc)
return test_bit(pmc->idx, (unsigned long *)&pmu->global_ctrl);
}

+static inline int pmc_is_topdown_metrics_used(struct kvm_pmc *pmc)
+{
+ return (pmc->idx == INTEL_PMC_IDX_FIXED_SLOTS) &&
+ (pmc->max_nr_events == KVM_TD_EVENTS_MAX);
+}
+
+static inline int pmc_is_topdown_metrics_active(struct kvm_pmc *pmc)
+{
+ return pmc_is_topdown_metrics_used(pmc) &&
+ pmc->perf_events[KVM_TD_METRICS];
+}
+
+static inline void pmc_update_topdown_metrics(struct kvm_pmc *pmc)
+{
+ struct perf_event *event;
+ int i;
+
+ struct td_metrics td_metrics = {
+ .slots = pmc->counter,
+ .metric = pmc->extra_config,
+ };
+
+ for (i = 0; i < pmc->max_nr_events; i++) {
+ event = pmc->perf_events[i];
+ perf_event_topdown_metrics(event, &td_metrics);
+ }
+}
+
void kvm_pmu_deliver_pmi(struct kvm_vcpu *vcpu);
void kvm_pmu_handle_event(struct kvm_vcpu *vcpu);
int kvm_pmu_rdpmc(struct kvm_vcpu *vcpu, unsigned pmc, u64 *data);
diff --git a/arch/x86/kvm/vmx/capabilities.h b/arch/x86/kvm/vmx/capabilities.h
index 41a4533f9989..d8317552b634 100644
--- a/arch/x86/kvm/vmx/capabilities.h
+++ b/arch/x86/kvm/vmx/capabilities.h
@@ -22,6 +22,7 @@ extern int __read_mostly pt_mode;
#define PT_MODE_HOST_GUEST 1

#define PMU_CAP_FW_WRITES (1ULL << 13)
+#define PMU_CAP_PERF_METRICS BIT_ULL(15)
#define PMU_CAP_LBR_FMT 0x3f

struct nested_vmx_msrs {
diff --git a/arch/x86/kvm/vmx/pmu_intel.c b/arch/x86/kvm/vmx/pmu_intel.c
index b45396e0a46c..04ccb8c6f7e4 100644
--- a/arch/x86/kvm/vmx/pmu_intel.c
+++ b/arch/x86/kvm/vmx/pmu_intel.c
@@ -229,6 +229,9 @@ static bool intel_is_valid_msr(struct kvm_vcpu *vcpu, u32 msr)
ret = (perf_capabilities & PERF_CAP_PEBS_BASELINE) &&
((perf_capabilities & PERF_CAP_PEBS_FORMAT) > 3);
break;
+ case MSR_PERF_METRICS:
+ ret = intel_pmu_metrics_is_enabled(vcpu) && (pmu->version > 1);
+ break;
default:
ret = get_gp_pmc(pmu, msr, MSR_IA32_PERFCTR0) ||
get_gp_pmc(pmu, msr, MSR_P6_EVNTSEL0) ||
@@ -357,6 +360,43 @@ static bool intel_pmu_handle_lbr_msrs_access(struct kvm_vcpu *vcpu,
return true;
}

+static int intel_pmu_handle_perf_metrics_access(struct kvm_vcpu *vcpu,
+ struct msr_data *msr_info, bool read)
+{
+ u32 index = msr_info->index;
+ struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
+ struct kvm_pmc *pmc = get_fixed_pmc(pmu, MSR_CORE_PERF_FIXED_CTR3);
+
+ if (!pmc || index != MSR_PERF_METRICS)
+ return 1;
+
+ if (read) {
+ msr_info->data = pmc->extra_config;
+ } else {
+ /*
+ * Save guest PERF_METRICS data in to extra_config,
+ * the extra_config would be read to write to PERF_METRICS
+ * MSR in later events group creating process.
+ */
+ pmc->extra_config = msr_info->data;
+ if (pmc_is_topdown_metrics_active(pmc)) {
+ pmc_update_topdown_metrics(pmc);
+ } else {
+ /*
+ * If the slots/vmetrics events group is not
+ * created yet, set max_nr_events to 2
+ * (slots event + vmetrics event), so KVM knows
+ * topdown metrics profiling is running in guest
+ * and slots/vmetrics events group would be created
+ * later.
+ */
+ pmc->max_nr_events = KVM_TD_EVENTS_MAX;
+ }
+ }
+
+ return 0;
+}
+
static int intel_pmu_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
{
struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
@@ -376,6 +416,10 @@ static int intel_pmu_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
case MSR_PEBS_DATA_CFG:
msr_info->data = pmu->pebs_data_cfg;
break;
+ case MSR_PERF_METRICS:
+ if (intel_pmu_handle_perf_metrics_access(vcpu, msr_info, true))
+ return 1;
+ break;
default:
if ((pmc = get_gp_pmc(pmu, msr, MSR_IA32_PERFCTR0)) ||
(pmc = get_gp_pmc(pmu, msr, MSR_IA32_PMC0))) {
@@ -438,6 +482,10 @@ static int intel_pmu_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)

pmu->pebs_data_cfg = data;
break;
+ case MSR_PERF_METRICS:
+ if (intel_pmu_handle_perf_metrics_access(vcpu, msr_info, false))
+ return 1;
+ break;
default:
if ((pmc = get_gp_pmc(pmu, msr, MSR_IA32_PERFCTR0)) ||
(pmc = get_gp_pmc(pmu, msr, MSR_IA32_PMC0))) {
diff --git a/arch/x86/kvm/vmx/vmx.h b/arch/x86/kvm/vmx/vmx.h
index c2130d2c8e24..63b6dcc360c2 100644
--- a/arch/x86/kvm/vmx/vmx.h
+++ b/arch/x86/kvm/vmx/vmx.h
@@ -670,6 +670,11 @@ static inline bool intel_pmu_lbr_is_enabled(struct kvm_vcpu *vcpu)
return !!vcpu_to_lbr_records(vcpu)->nr;
}

+static inline bool intel_pmu_metrics_is_enabled(struct kvm_vcpu *vcpu)
+{
+ return vcpu->arch.perf_capabilities & PMU_CAP_PERF_METRICS;
+}
+
void intel_pmu_cross_mapped_check(struct kvm_pmu *pmu);
int intel_pmu_create_guest_lbr_event(struct kvm_vcpu *vcpu);
void vmx_passthrough_lbr_msrs(struct kvm_vcpu *vcpu);
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 906af36850fb..1f4c4ebe2efc 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -1472,6 +1472,7 @@ static const u32 msrs_to_save_pmu[] = {
MSR_CORE_PERF_FIXED_CTR_CTRL, MSR_CORE_PERF_GLOBAL_STATUS,
MSR_CORE_PERF_GLOBAL_CTRL, MSR_CORE_PERF_GLOBAL_OVF_CTRL,
MSR_IA32_PEBS_ENABLE, MSR_IA32_DS_AREA, MSR_PEBS_DATA_CFG,
+ MSR_PERF_METRICS,

/* This part of MSRs should match KVM_INTEL_PMC_MAX_GENERIC. */
MSR_ARCH_PERFMON_PERFCTR0, MSR_ARCH_PERFMON_PERFCTR1,
--
2.34.1

2023-09-27 21:26:19

by Dapeng Mi

[permalink] [raw]
Subject: [Patch v4 07/13] perf/x86: Add constraint for guest perf metrics event

When guest wants to use PERF_METRICS MSR, a virtual metrics event needs
to be created in the perf subsystem so that the guest can have exclusive
ownership of the PERF_METRICS MSR.

We introduce the new vmetrics constraint, so that we can couple this
virtual metrics event with slots event as a events group to involves in
the host perf system scheduling. Since Guest metric events are always
recognized as vCPU process's events on host, they are time-sharing
multiplexed with other host metric events, so that we choose bit 48
(INTEL_PMC_IDX_METRIC_BASE) as the index of this virtual metrics event.

Co-developed-by: Yang Weijiang <[email protected]>
Signed-off-by: Yang Weijiang <[email protected]>
Signed-off-by: Dapeng Mi <[email protected]>
---
arch/x86/events/intel/core.c | 28 +++++++++++++++++++++-------
arch/x86/events/perf_event.h | 1 +
arch/x86/include/asm/perf_event.h | 15 +++++++++++++++
3 files changed, 37 insertions(+), 7 deletions(-)

diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
index fa355d3658a6..1c349290677c 100644
--- a/arch/x86/events/intel/core.c
+++ b/arch/x86/events/intel/core.c
@@ -3158,17 +3158,26 @@ intel_bts_constraints(struct perf_event *event)
return NULL;
}

+static struct event_constraint *intel_virt_event_constraints[] __read_mostly = {
+ &vlbr_constraint,
+ &vmetrics_constraint,
+};
+
/*
- * Note: matches a fake event, like Fixed2.
+ * Note: matches a virtual event, like vmetrics.
*/
static struct event_constraint *
-intel_vlbr_constraints(struct perf_event *event)
+intel_virt_constraints(struct perf_event *event)
{
- struct event_constraint *c = &vlbr_constraint;
+ int i;
+ struct event_constraint *c;

- if (unlikely(constraint_match(c, event->hw.config))) {
- event->hw.flags |= c->flags;
- return c;
+ for (i = 0; i < ARRAY_SIZE(intel_virt_event_constraints); i++) {
+ c = intel_virt_event_constraints[i];
+ if (unlikely(constraint_match(c, event->hw.config))) {
+ event->hw.flags |= c->flags;
+ return c;
+ }
}

return NULL;
@@ -3368,7 +3377,7 @@ __intel_get_event_constraints(struct cpu_hw_events *cpuc, int idx,
{
struct event_constraint *c;

- c = intel_vlbr_constraints(event);
+ c = intel_virt_constraints(event);
if (c)
return c;

@@ -5369,6 +5378,11 @@ static struct attribute *spr_tsx_events_attrs[] = {
NULL,
};

+struct event_constraint vmetrics_constraint =
+ __EVENT_CONSTRAINT(INTEL_FIXED_VMETRICS_EVENT,
+ (1ULL << INTEL_PMC_IDX_FIXED_VMETRICS),
+ FIXED_EVENT_FLAGS, 1, 0, 0);
+
static ssize_t freeze_on_smi_show(struct device *cdev,
struct device_attribute *attr,
char *buf)
diff --git a/arch/x86/events/perf_event.h b/arch/x86/events/perf_event.h
index c8ba2be7585d..a0d12989a483 100644
--- a/arch/x86/events/perf_event.h
+++ b/arch/x86/events/perf_event.h
@@ -1482,6 +1482,7 @@ void reserve_lbr_buffers(void);

extern struct event_constraint bts_constraint;
extern struct event_constraint vlbr_constraint;
+extern struct event_constraint vmetrics_constraint;

void intel_pmu_enable_bts(u64 config);

diff --git a/arch/x86/include/asm/perf_event.h b/arch/x86/include/asm/perf_event.h
index 63e1ce1f4b27..d767807aae91 100644
--- a/arch/x86/include/asm/perf_event.h
+++ b/arch/x86/include/asm/perf_event.h
@@ -390,6 +390,21 @@ static inline bool is_topdown_idx(int idx)
*/
#define INTEL_PMC_IDX_FIXED_VLBR (GLOBAL_STATUS_LBRS_FROZEN_BIT)

+/*
+ * We model guest TopDown metrics event tracing similarly.
+ *
+ * Guest metric events are recognized as vCPU process's events on host, they
+ * would be time-sharing multiplexed with other host metric events, so that
+ * we choose bit 48 (INTEL_PMC_IDX_METRIC_BASE) as the index of virtual
+ * metrics event.
+ */
+#define INTEL_PMC_IDX_FIXED_VMETRICS (INTEL_PMC_IDX_METRIC_BASE)
+
+/*
+ * Pseudo-encoding the guest metrics event as event=0x00,umask=0x11,
+ * since it would claim bit 48 which is effectively Fixed16.
+ */
+#define INTEL_FIXED_VMETRICS_EVENT 0x1100
/*
* Pseudo-encoding the guest LBR event as event=0x00,umask=0x1b,
* since it would claim bit 58 which is effectively Fixed26.
--
2.34.1

2023-09-27 22:21:45

by Dapeng Mi

[permalink] [raw]
Subject: [Patch v4 12/13] KVM: x86/pmu: Handle PERF_METRICS overflow

When the fixed counter 3 overflows, the PMU would also triggers an
PERF_METRICS overflow subsequently. This patch handles the PERF_METRICS
overflow case, it would inject an PMI into guest and set the
PERF_METRICS overflow bit in PERF_GLOBAL_STATUS MSR after detecting
PERF_METRICS overflow on host.

Signed-off-by: Dapeng Mi <[email protected]>
---
arch/x86/events/intel/core.c | 7 ++++++-
arch/x86/kvm/pmu.c | 19 +++++++++++++++----
2 files changed, 21 insertions(+), 5 deletions(-)

diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
index df56e091eb25..5c3271772d75 100644
--- a/arch/x86/events/intel/core.c
+++ b/arch/x86/events/intel/core.c
@@ -3053,8 +3053,13 @@ static int handle_pmi_common(struct pt_regs *regs, u64 status)
* Intel Perf metrics
*/
if (__test_and_clear_bit(GLOBAL_STATUS_PERF_METRICS_OVF_BIT, (unsigned long *)&status)) {
+ struct perf_event *event = cpuc->events[GLOBAL_STATUS_PERF_METRICS_OVF_BIT];
+
handled++;
- static_call(intel_pmu_update_topdown_event)(NULL);
+ if (event && is_vmetrics_event(event))
+ READ_ONCE(event->overflow_handler)(event, &data, regs);
+ else
+ static_call(intel_pmu_update_topdown_event)(NULL);
}

/*
diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c
index fad7b2c10bb8..06c815859f77 100644
--- a/arch/x86/kvm/pmu.c
+++ b/arch/x86/kvm/pmu.c
@@ -101,7 +101,7 @@ static void kvm_pmi_trigger_fn(struct irq_work *irq_work)
kvm_pmu_deliver_pmi(vcpu);
}

-static inline void __kvm_perf_overflow(struct kvm_pmc *pmc, bool in_pmi)
+static inline void __kvm_perf_overflow(struct kvm_pmc *pmc, bool in_pmi, bool metrics_of)
{
struct kvm_pmu *pmu = pmc_to_pmu(pmc);
bool skip_pmi = false;
@@ -121,7 +121,11 @@ static inline void __kvm_perf_overflow(struct kvm_pmc *pmc, bool in_pmi)
(unsigned long *)&pmu->global_status);
}
} else {
- __set_bit(pmc->idx, (unsigned long *)&pmu->global_status);
+ if (metrics_of)
+ __set_bit(GLOBAL_STATUS_PERF_METRICS_OVF_BIT,
+ (unsigned long *)&pmu->global_status);
+ else
+ __set_bit(pmc->idx, (unsigned long *)&pmu->global_status);
}

if (!pmc->intr || skip_pmi)
@@ -141,11 +145,18 @@ static inline void __kvm_perf_overflow(struct kvm_pmc *pmc, bool in_pmi)
kvm_make_request(KVM_REQ_PMI, pmc->vcpu);
}

+static inline bool is_vmetrics_event(struct perf_event *event)
+{
+ return (event->attr.config & INTEL_ARCH_EVENT_MASK) ==
+ INTEL_FIXED_VMETRICS_EVENT;
+}
+
static void kvm_perf_overflow(struct perf_event *perf_event,
struct perf_sample_data *data,
struct pt_regs *regs)
{
struct kvm_pmc *pmc = perf_event->overflow_handler_context;
+ bool metrics_of = is_vmetrics_event(perf_event);

/*
* Ignore overflow events for counters that are scheduled to be
@@ -155,7 +166,7 @@ static void kvm_perf_overflow(struct perf_event *perf_event,
if (test_and_set_bit(pmc->idx, pmc_to_pmu(pmc)->reprogram_pmi))
return;

- __kvm_perf_overflow(pmc, true);
+ __kvm_perf_overflow(pmc, true, metrics_of);

kvm_make_request(KVM_REQ_PMU, pmc->vcpu);
}
@@ -490,7 +501,7 @@ static void reprogram_counter(struct kvm_pmc *pmc)
goto reprogram_complete;

if (pmc->counter < pmc->prev_counter)
- __kvm_perf_overflow(pmc, false);
+ __kvm_perf_overflow(pmc, false, false);

if (eventsel & ARCH_PERFMON_EVENTSEL_PIN_CONTROL)
printk_once("kvm pmu: pin control bit is ignored\n");
--
2.34.1

2023-09-30 09:26:02

by Jim Mattson

[permalink] [raw]
Subject: Re: [Patch v4 07/13] perf/x86: Add constraint for guest perf metrics event

On Fri, Sep 29, 2023 at 8:46 AM Sean Christopherson <[email protected]> wrote:
>
> On Fri, Sep 29, 2023, Peter Zijlstra wrote:
> > On Wed, Sep 27, 2023 at 10:27:07AM -0700, Sean Christopherson wrote:
> > > Jumping the gun a bit (we're in the *super* early stages of scraping together a
> > > rough PoC), but I think we should effectively put KVM's current vPMU support into
> > > maintenance-only mode, i.e. stop adding new features unless they are *very* simple
> > > to enable, and instead pursue an implementation that (a) lets userspace (and/or
> > > the kernel builder) completely disable host perf (or possibly just host perf usage
> > > of the hardware PMU) and (b) let KVM passthrough the entire hardware PMU when it
> > > has been turned off in the host.
> >
> > I don't think you need to go that far, host can use PMU just fine as
> > long as it doesn't overlap with a vCPU. Basically, if you force
> > perf_attr::exclude_guest on everything your vCPU can haz the full thing.
>
> Complexity aside, my understanding is that the overhead of trapping and emulating
> all of the guest counter and MSR accesses results in unacceptably degraded functionality
> for the guest. And we haven't even gotten to things like arch LBRs where context
> switching MSRs between the guest and host is going to be quite costly.

Trapping and emulating all of the PMU MSR accesses is ludicrously
slow, especially when the guest is multiplexing events.

Also, the current scheme of implicitly tying together usage mode and
priority means that KVM's "task pinned" perf_events always lose to
someone else's "CPU pinned" perf_events. Even if those "CPU pinned"
perf events are tagged "exclude_guest," the counters they occupy are
not available for KVM's "exclude_host" events, because host perf won't
multiplex a counter between an "exclude_host" event and an
"exclude_guest" event, even though the two events don't overlap.
Frankly, we wouldn't want it to, because that would introduce
egregious overheads at VM-entry and VM-exit. What we would need would
be a mechanism for allocating KVM's "task pinned" perf_events at the
highest priority, so they always win.

For things to work well in the "vPMU as a client of host perf" world,
we need to have the following at a minimum:
1) Guaranteed identity mapping of guest PMCs to host PMCs, so that we
don't have to intercept accesses to IA32_PERF_GLOBAL_CTRL.
2) Exclusive ownership of the PMU MSRs while in the KVM_RUN loop, so
that we don't have to switch any PMU MSRs on VM-entry/VM-exit (with
the exception of IA32_PERF_GLOBAL_CTRL, which has guest and host
fields in the VMCS).

There are other issues with the current implementation, like the
ridiculous overhead of bumping a counter in software to account for an
emulated instruction. That should just be a RDMSR, an increment, a
WRMSR, and the conditional synthesis of a guest PMI on overflow.
Instead, we have to pause a perf_event and reprogram it before
continuing. Putting a high-level abstraction between the guest PMU and
the host PMU does not yield the most efficient implementation.

> > > Note, a similar idea was floated and rejected in the past[*], but that failed
> > > proposal tried to retain host perf+PMU functionality by making the behavior dynamic,
> > > which I agree would create an awful ABI for the host. If we make the "knob" a
> > > Kconfig
> >
> > Must not be Kconfig, distros would have no sane choice.
>
> Or not only a Kconfig? E.g. similar to how the kernel has
> CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS and nopku.
>
> > > or kernel param, i.e. require the platform owner to opt-out of using perf
> > > no later than at boot time, then I think we can provide a sane ABI, keep the
> > > implementation simple, all without breaking existing users that utilize perf in
> > > the host to profile guests.
> >
> > It's a shit choice to have to make. At the same time I'm not sure I have
> > a better proposal.
> >
> > It does mean a host cannot profile one guest and have pass-through on the
> > other. Eg. have a development and production guest on the same box. This
> > is pretty crap.
> >
> > Making it a guest-boot-option would allow that, but then the host gets
> > complicated again. I think I can make it trivially work for per-task
> > events, simply error the creation of events without exclude_guest for
> > affected vCPU tasks. But the CPU events are tricky.
> >
> >
> > I will firmly reject anything that takes the PMU away from the host
> > entirely through.
>
> Why? What is so wrong with supporting use cases where the platform owner *wants*
> to give up host PMU and NMI watchdog functionality? If disabling host PMU usage
> were complex, highly invasive, and/or difficult to maintain, then I can understand
> the pushback.
>
> But if we simply allow hiding hardware PMU support, then isn't the cost to perf
> just a few lines in init_hw_perf_events()? And if we put a stake in the ground
> and say that exposing "advanced" PMU features to KVM guests requires a passthrough
> PMU, i.e. the PMU to be hidden from the host, that will significantly reduce our
> maintenance and complexity.
>
> The kernel allows disabling almost literally every other feature that is even
> remotely optional, I don't understand why the hardware PMU is special.

2023-10-01 00:32:23

by Namhyung Kim

[permalink] [raw]
Subject: Re: [Patch v4 07/13] perf/x86: Add constraint for guest perf metrics event

Hello,

On Fri, Sep 29, 2023 at 8:29 PM Jim Mattson <[email protected]> wrote:
> For things to work well in the "vPMU as a client of host perf" world,
> we need to have the following at a minimum:
> 1) Guaranteed identity mapping of guest PMCs to host PMCs, so that we
> don't have to intercept accesses to IA32_PERF_GLOBAL_CTRL.
> 2) Exclusive ownership of the PMU MSRs while in the KVM_RUN loop, so
> that we don't have to switch any PMU MSRs on VM-entry/VM-exit (with
> the exception of IA32_PERF_GLOBAL_CTRL, which has guest and host
> fields in the VMCS).

IIUC that means multiplexing should be handled in the guest.
Not sure it can support mixing host and guest events at the
same time.

Thanks,
Namhyung

2023-10-02 18:30:21

by Sean Christopherson

[permalink] [raw]
Subject: Re: [Patch v4 07/13] perf/x86: Add constraint for guest perf metrics event

On Mon, Oct 02, 2023, Ingo Molnar wrote:
>
> * Peter Zijlstra <[email protected]> wrote:
>
> > On Fri, Sep 29, 2023 at 03:46:55PM +0000, Sean Christopherson wrote:
> >
> > > > I will firmly reject anything that takes the PMU away from the host
> > > > entirely through.
> > >
> > > Why? What is so wrong with supporting use cases where the platform owner *wants*
> > > to give up host PMU and NMI watchdog functionality? If disabling host PMU usage
> > > were complex, highly invasive, and/or difficult to maintain, then I can understand
> > > the pushback.
> >
> > Because it sucks.
>
> > You're forcing people to choose between no host PMU or a slow guest PMU.

Nowhere did I say that we wouldn't take patches to improve the existing vPMU
support. But that's largely a moot point because I don't think it's possible to
improve the current approach to the point where it would provide a performant,
functional guest PMU.

> > And that's simply not a sane choice for most people --

It's better than the status quo, which is that no one gets to choose, everyone
gets a slow guest PMU.

> > worse it's not a choice based in technical reality.

The technical reality is that context switching the PMU between host and guest
requires reading and writing far too many MSRs for KVM to be able to context
switch at every VM-Enter and every VM-Exit. And PMIs skidding past VM-Exit adds
another layer of complexity to deal with.

> > It's a choice out of lazyness, disabling host PMU is not a requirement
> > for pass-through.

The requirement isn't passthrough access, the requirements are that the guest's
PMU has accuracy that is on par with bare metal, and that exposing a PMU to the
guest doesn't have a meaningful impact on guest performance.

> Not just a choice of laziness, but it will clearly be forced upon users
> by external entities:
>
> "Pass ownership of the PMU to the guest and have no host PMU, or you
> won't have sane guest PMU support at all. If you disagree, please open
> a support ticket, which we'll ignore."

We don't have sane guest PMU support today. In the 12+ years since commit
f5132b01386b ("KVM: Expose a version 2 architectural PMU to a guests"), KVM has
never provided anything remotely close to a sane vPMU. It *mostly* works if host
perf is quiesced, but that "good enough" approach doesn't suffice for any form of
PMU usage that requires a high level of accuracy and precision.

> The host OS shouldn't offer facilities that severely limit its own capabilities,
> when there's a better solution. We don't give the FPU to apps exclusively either,
> it would be insanely stupid for a platform to do that.

The FPU can be effeciently context switched, guest state remains resident in
hardware so long as the vCPU task is scheduled in (ignoring infrequrent FPU usage
from IRQ context), and guest usage of the FPU doesn't require trap-and-emulate
behavior in KVM.

As David said, ceding the hardware PMU for all of kvm_arch_vcpu_ioctl_run()
(module the vCPU task being scheduled out) is likely a viable alternative.

: But it does mean that when entering the KVM run loop, the host perf system
: needs to context switch away the host PMU state and allow KVM to load the guest
: PMU state. And much like the FPU situation, the portion of the host kernel
: that runs between the context switch to the KVM thread and VMENTER to the guest
: cannot use the PMU.

If y'all are willing to let KVM redefined exclude_guest to be KVM's outer run
loop, then I'm all for exploring that option. But that idea got shot down over
a year ago[*]. Or at least, that was my reading of things. Maybe it was just a
misunderstanding because we didn't do a good job of defining the behavior.

I am completely ok with either approach, but I am not ok with being nak'd on both.
Because unless there's a magical third option lurking, those two options are the
only ways for KVM to provide a vPMU that meets the requirements for slice-of-hardware
use cases.

[*] https://lore.kernel.org/all/[email protected]

2023-10-02 18:39:05

by Ingo Molnar

[permalink] [raw]
Subject: Re: [Patch v4 07/13] perf/x86: Add constraint for guest perf metrics event


* Peter Zijlstra <[email protected]> wrote:

> On Fri, Sep 29, 2023 at 03:46:55PM +0000, Sean Christopherson wrote:
>
> > > I will firmly reject anything that takes the PMU away from the host
> > > entirely through.
> >
> > Why? What is so wrong with supporting use cases where the platform owner *wants*
> > to give up host PMU and NMI watchdog functionality? If disabling host PMU usage
> > were complex, highly invasive, and/or difficult to maintain, then I can understand
> > the pushback.
>
> Because it sucks.
>
> You're forcing people to choose between no host PMU or a slow guest PMU.
> And that's simply not a sane choice for most people -- worse it's not a
> choice based in technical reality.
>
> It's a choice out of lazyness, disabling host PMU is not a requirement
> for pass-through.

Not just a choice of laziness, but it will clearly be forced upon users
by external entities:

"Pass ownership of the PMU to the guest and have no host PMU, or you
won't have sane guest PMU support at all. If you disagree, please open
a support ticket, which we'll ignore."

The host OS shouldn't offer facilities that severely limit its own capabilities,
when there's a better solution. We don't give the FPU to apps exclusively either,
it would be insanely stupid for a platform to do that.

Thanks,

Ingo

2023-10-02 20:04:06

by Mingwei Zhang

[permalink] [raw]
Subject: Re: [Patch v4 07/13] perf/x86: Add constraint for guest perf metrics event

On Mon, Oct 2, 2023 at 8:23 AM David Dunn <[email protected]> wrote:
>
> On Mon, Oct 2, 2023 at 6:30 AM Ingo Molnar <[email protected]> wrote:
> >
> >
> > The host OS shouldn't offer facilities that severely limit its own capabilities,
> > when there's a better solution. We don't give the FPU to apps exclusively either,
> > it would be insanely stupid for a platform to do that.
> >
>
> If you think of the guest VM as a usermode application (which it
> effectively is), the analogous situation is that there is no way to
> tell the usermode application which portions of the FPU state might be
> used by the kernel without context switching. Although the kernel can
> and does use FPU state, it doesn't zero out a portion of that state
> whenever the kernel needs to use the FPU.
>
> Today there is no way for a guest to dynamically adjust which PMU
> state is valid or invalid. And this changes based on usage by other
> commands run on the host. As observed by perf subsystem running in
> the guest kernel, this looks like counters that simply zero out and
> stop counting at random.
>
> I think the request here is that there be a way for KVM to be able to
> tell the guest kernel (running the perf subsystem) that it has a
> functional HW PMU. And for that to be true. This doesn't mean taking
> away the use of the PMU any more than exposing the FPU to usermode
> applications means taking away the FPU from the kernel. But it does
> mean that when entering the KVM run loop, the host perf system needs
> to context switch away the host PMU state and allow KVM to load the
> guest PMU state. And much like the FPU situation, the portion of the
> host kernel that runs between the context switch to the KVM thread and
> VMENTER to the guest cannot use the PMU.
>
> This obviously should be a policy set by the host owner. They are
> deliberately giving up the ability to profile that small portion of
> the host (KVM VCPU thread cannot be profiled) in return to providing a
> full set of perf functionality to the guest kernel.
>

+1

I was pretty confused until I read this one. Pass-through vPMU for the
guest VM does not conflict with the host PMU software. All we need is
to accept the feasibility that host PMU software (perf subsystem in
Linux) can co-exist with pass-through vPMU in KVM. They could both
work directly on the hardware PMU, operating the registers etc...

To achieve this, I think what we really ask for the perf subsystem in
Linux are two things:
- full context switch for hardware PMU. Currently, perf subsystem is
the exclusive owner of this piece of hardware. So this code is missing
- NMI sharing or NMI control transfer. Either KVM could raise its own
NMI handler and get control transferred or Linux promotes the existing
NMI handler to serve two entities in the kernel.

Once the above is achieved, KVM and perf subsystem in Linux could
harmoniously share the hardware PMU as I believe, instead of forcing
the former as a client of the latter.

To step back a little bit, we are not asking about the feasibility,
since KVM and perf subsystem sharing hardware PMU is a reality because
of TDX/SEV-SNP. So, I think all that is just a draft proposal to make
the sharing clean and efficient.

Thanks.
-Mingwei

> Dave Dunn

2023-10-02 21:32:17

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [Patch v4 07/13] perf/x86: Add constraint for guest perf metrics event

On Mon, Oct 02, 2023 at 08:56:50AM -0700, Sean Christopherson wrote:
> On Mon, Oct 02, 2023, Ingo Molnar wrote:
> >
> > * Peter Zijlstra <[email protected]> wrote:
> >
> > > On Fri, Sep 29, 2023 at 03:46:55PM +0000, Sean Christopherson wrote:
> > >
> > > > > I will firmly reject anything that takes the PMU away from the host
> > > > > entirely through.
> > > >
> > > > Why? What is so wrong with supporting use cases where the platform owner *wants*
> > > > to give up host PMU and NMI watchdog functionality? If disabling host PMU usage
> > > > were complex, highly invasive, and/or difficult to maintain, then I can understand
> > > > the pushback.
> > >
> > > Because it sucks.
> >
> > > You're forcing people to choose between no host PMU or a slow guest PMU.
>
> Nowhere did I say that we wouldn't take patches to improve the existing vPMU
> support.

Nowhere did I talk about vPMU -- I explicitly mentioned pass-through.

> > > worse it's not a choice based in technical reality.
>
> The technical reality is that context switching the PMU between host and guest
> requires reading and writing far too many MSRs for KVM to be able to context
> switch at every VM-Enter and every VM-Exit. And PMIs skidding past VM-Exit adds
> another layer of complexity to deal with.

I'm not sure what you're suggesting here. It will have to save/restore
all those MSRs anyway. Suppose it switches between vCPUs.

> > > It's a choice out of lazyness, disabling host PMU is not a requirement
> > > for pass-through.
>
> The requirement isn't passthrough access, the requirements are that the guest's
> PMU has accuracy that is on par with bare metal, and that exposing a PMU to the
> guest doesn't have a meaningful impact on guest performance.

Given you don't think that trapping MSR accesses is viable, what else
besides pass-through did you have in mind?

> > Not just a choice of laziness, but it will clearly be forced upon users
> > by external entities:
> >
> > "Pass ownership of the PMU to the guest and have no host PMU, or you
> > won't have sane guest PMU support at all. If you disagree, please open
> > a support ticket, which we'll ignore."
>
> We don't have sane guest PMU support today.

Because KVM is too damn hard to use, rebooting a machine is *sooo* much
easier -- and I'm really not kidding here.

Anyway, you want pass-through, but that doesn't mean host cannot use
PMU when vCPU thread is not running.

> If y'all are willing to let KVM redefined exclude_guest to be KVM's outer run
> loop, then I'm all for exploring that option. But that idea got shot down over
> a year ago[*].

I never saw that idea in that thread. You virt people keep talking like
I know how KVM works -- I'm not joking when I say I have no clue about
virt.

Sometimes I get a little clue after y'all keep bashing me over the head,
but it quickly erases itself.

> Or at least, that was my reading of things. Maybe it was just a
> misunderstanding because we didn't do a good job of defining the behavior.

This might be the case. I don't particularly care where the guest
boundary lies -- somewhere in the vCPU thread. Once the thread is gone,
PMU is usable again etc..

Re-reading parts of that linked thread, I see mention of
PT_MODE_HOST_GUEST -- see I knew we had something there, but I can never
remember all that nonsense. Worst part is that I can't find the relevant
perf code when I grep for that string :/


Anyway, what I don't like is KVM silently changing all events to
::exclude_guest=1. I would like all (pre-existing) ::exclude_guest=0
events to hard error when they run into a vCPU with pass-through on
(PERF_EVENT_STATE_ERROR). I would like event-creation to error out on
::exclude_guest=0 events when a vCPU with pass-through exists -- with
minimal scope (this probably means all CPU events, but only relevant
vCPU events).

It also means ::exclude_guest should actually work -- it often does not
today -- the IBS thing for example totally ignores it.

Anyway, none of this means host cannot use PMU because virt muck wants
it.

2023-10-02 22:50:03

by David Dunn

[permalink] [raw]
Subject: Re: [Patch v4 07/13] perf/x86: Add constraint for guest perf metrics event

On Mon, Oct 2, 2023 at 6:30 AM Ingo Molnar <[email protected]> wrote:
>
>
> The host OS shouldn't offer facilities that severely limit its own capabilities,
> when there's a better solution. We don't give the FPU to apps exclusively either,
> it would be insanely stupid for a platform to do that.
>

If you think of the guest VM as a usermode application (which it
effectively is), the analogous situation is that there is no way to
tell the usermode application which portions of the FPU state might be
used by the kernel without context switching. Although the kernel can
and does use FPU state, it doesn't zero out a portion of that state
whenever the kernel needs to use the FPU.

Today there is no way for a guest to dynamically adjust which PMU
state is valid or invalid. And this changes based on usage by other
commands run on the host. As observed by perf subsystem running in
the guest kernel, this looks like counters that simply zero out and
stop counting at random.

I think the request here is that there be a way for KVM to be able to
tell the guest kernel (running the perf subsystem) that it has a
functional HW PMU. And for that to be true. This doesn't mean taking
away the use of the PMU any more than exposing the FPU to usermode
applications means taking away the FPU from the kernel. But it does
mean that when entering the KVM run loop, the host perf system needs
to context switch away the host PMU state and allow KVM to load the
guest PMU state. And much like the FPU situation, the portion of the
host kernel that runs between the context switch to the KVM thread and
VMENTER to the guest cannot use the PMU.

This obviously should be a policy set by the host owner. They are
deliberately giving up the ability to profile that small portion of
the host (KVM VCPU thread cannot be profiled) in return to providing a
full set of perf functionality to the guest kernel.

Dave Dunn

2023-10-02 22:50:40

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [Patch v4 07/13] perf/x86: Add constraint for guest perf metrics event

On Mon, Oct 02, 2023 at 03:50:24PM -0400, Liang, Kan wrote:

> Now, the NMI watchdog is using a "CPU-pinned" event. But I think it can
> be replaced by the buddy system, commit 1f423c905a6b
> ("watchdog/hardlockup: detect hard lockups using secondary (buddy) CPUs")

For some simple cases. I've had plenty experience with situations where
that thing would be completely useless.

That is, at some point the 'all CPUs hard locked up' scenario was
something I ran into a lot (although I can't for the life of me remember
wtf I was doing at the time). All that needs is a single
spin_lock_irqsave() on a common lock (or group of locks, like the
rq->lock). Before you know it, the whole machine is a brick.

That said; if you augment this thing with a bunch of CPUs that have
HPET-NMI and IPI-NMI for backtraces, it might actually be useful.

2023-10-02 22:55:46

by Liang, Kan

[permalink] [raw]
Subject: Re: [Patch v4 07/13] perf/x86: Add constraint for guest perf metrics event



On 2023-10-02 11:56 a.m., Sean Christopherson wrote:
> I am completely ok with either approach, but I am not ok with being nak'd on both.
> Because unless there's a magical third option lurking, those two options are the
> only ways for KVM to provide a vPMU that meets the requirements for slice-of-hardware
> use cases.

It may be doable to introduce a knob to enable/disable the "CPU-pinned"
capability of perf_event. If the capability is disabled, the KVM doesn't
need to worry about the counters being gone without notification, since
the "task pinned" has the highest priority at the moment.

It should also keeps the flexibility that sometimes the host wants to
profile the guest.

Now, the NMI watchdog is using a "CPU-pinned" event. But I think it can
be replaced by the buddy system, commit 1f423c905a6b
("watchdog/hardlockup: detect hard lockups using secondary (buddy) CPUs")

Thanks,
Kan

2023-10-03 00:56:35

by Sean Christopherson

[permalink] [raw]
Subject: Re: [Patch v4 07/13] perf/x86: Add constraint for guest perf metrics event

On Mon, Oct 02, 2023, Peter Zijlstra wrote:
> On Mon, Oct 02, 2023 at 08:56:50AM -0700, Sean Christopherson wrote:
> > > > worse it's not a choice based in technical reality.
> >
> > The technical reality is that context switching the PMU between host and guest
> > requires reading and writing far too many MSRs for KVM to be able to context
> > switch at every VM-Enter and every VM-Exit. And PMIs skidding past VM-Exit adds
> > another layer of complexity to deal with.
>
> I'm not sure what you're suggesting here. It will have to save/restore
> all those MSRs anyway. Suppose it switches between vCPUs.

The "when" is what's important. If KVM took a literal interpretation of
"exclude guest" for pass-through MSRs, then KVM would context switch all those
MSRs twice for every VM-Exit=>VM-Enter roundtrip, even when the VM-Exit isn't a
reschedule IRQ to schedule in a different task (or vCPU). The overhead to save
all the host/guest MSRs and load all of the guest/host MSRs *twice* for every
VM-Exit would be a non-starter. E.g. simple VM-Exits are completely handled in
<1500 cycles, and "fastpath" exits are something like half that. Switching all
the MSRs is likely 1000+ cycles, if not double that.

FWIW, the primary use case we care about is for slice-of-hardware VMs, where each
vCPU is pinned 1:1 with a host pCPU. I suspect it's a similar story for the other
CSPs that are trying to provide accurate PMUs to guests. If a vCPU is scheduled
out, then yes, a bunch of context switching will need to happen. But for the
types of VMs that are the target audience, their vCPUs will rarely be scheduled
out.

> > > > It's a choice out of lazyness, disabling host PMU is not a requirement
> > > > for pass-through.
> >
> > The requirement isn't passthrough access, the requirements are that the guest's
> > PMU has accuracy that is on par with bare metal, and that exposing a PMU to the
> > guest doesn't have a meaningful impact on guest performance.
>
> Given you don't think that trapping MSR accesses is viable, what else
> besides pass-through did you have in mind?

Sorry, I didn't mean to imply that we don't want pass-through of MSRs. What I was
trying to say is that *just* passthrough MSRs doesn't solve the problem, because
again I thought the whole "context switch PMU state less often" approach had been
firmly nak'd.

> > > Not just a choice of laziness, but it will clearly be forced upon users
> > > by external entities:
> > >
> > > "Pass ownership of the PMU to the guest and have no host PMU, or you
> > > won't have sane guest PMU support at all. If you disagree, please open
> > > a support ticket, which we'll ignore."
> >
> > We don't have sane guest PMU support today.
>
> Because KVM is too damn hard to use, rebooting a machine is *sooo* much
> easier -- and I'm really not kidding here.
>
> Anyway, you want pass-through, but that doesn't mean host cannot use
> PMU when vCPU thread is not running.
>
> > If y'all are willing to let KVM redefined exclude_guest to be KVM's outer run
> > loop, then I'm all for exploring that option. But that idea got shot down over
> > a year ago[*].
>
> I never saw that idea in that thread. You virt people keep talking like
> I know how KVM works -- I'm not joking when I say I have no clue about
> virt.
>
> Sometimes I get a little clue after y'all keep bashing me over the head,
> but it quickly erases itself.
>
> > Or at least, that was my reading of things. Maybe it was just a
> > misunderstanding because we didn't do a good job of defining the behavior.
>
> This might be the case. I don't particularly care where the guest
> boundary lies -- somewhere in the vCPU thread. Once the thread is gone,
> PMU is usable again etc..

Well drat, that there would have saved a wee bit of frustration. Better late
than never though, that's for sure.

Just to double confirm: keeping guest PMU state loaded until the vCPU is scheduled
out or KVM exits to userspace, would mean that host perf events won't be active
for potentially large swaths of non-KVM code. Any function calls or event/exception
handlers that occur within the context of ioctl(KVM_RUN) would run with host
perf events disabled.

Are you ok with that approach? Assuming we don't completely botch things, the
interfaces are sane, we can come up with a clean solution for handling NMIs, etc.

> Re-reading parts of that linked thread, I see mention of
> PT_MODE_HOST_GUEST -- see I knew we had something there, but I can never
> remember all that nonsense. Worst part is that I can't find the relevant
> perf code when I grep for that string :/

The PT stuff is actually an example of what we don't want, at least not exactly.
The concept of a hard switch between guest and host is ok, but as-is, KVM's PT
code does a big pile of MSR reads and writes on every VM-Enter and VM-Exit.

> Anyway, what I don't like is KVM silently changing all events to
> ::exclude_guest=1. I would like all (pre-existing) ::exclude_guest=0
> events to hard error when they run into a vCPU with pass-through on
> (PERF_EVENT_STATE_ERROR). I would like event-creation to error out on
> ::exclude_guest=0 events when a vCPU with pass-through exists -- with
> minimal scope (this probably means all CPU events, but only relevant
> vCPU events).

Agreed, I am definitely against KVM silently doing anything. And the more that
is surfaced to the user, the better.

> It also means ::exclude_guest should actually work -- it often does not
> today -- the IBS thing for example totally ignores it.

Is that already an in-tree, or are you talking about Manali's proposed series to
support virtualizing IBS?

> Anyway, none of this means host cannot use PMU because virt muck wants
> it.

2023-10-03 08:31:40

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [Patch v4 07/13] perf/x86: Add constraint for guest perf metrics event

On Mon, Oct 02, 2023 at 05:56:28PM -0700, Sean Christopherson wrote:
> On Mon, Oct 02, 2023, Peter Zijlstra wrote:

> > I'm not sure what you're suggesting here. It will have to save/restore
> > all those MSRs anyway. Suppose it switches between vCPUs.
>
> The "when" is what's important. If KVM took a literal interpretation of
> "exclude guest" for pass-through MSRs, then KVM would context switch all those
> MSRs twice for every VM-Exit=>VM-Enter roundtrip, even when the VM-Exit isn't a
> reschedule IRQ to schedule in a different task (or vCPU). The overhead to save
> all the host/guest MSRs and load all of the guest/host MSRs *twice* for every
> VM-Exit would be a non-starter. E.g. simple VM-Exits are completely handled in
> <1500 cycles, and "fastpath" exits are something like half that. Switching all
> the MSRs is likely 1000+ cycles, if not double that.

See, you're the virt-nerd and I'm sure you know what you're talking
about, but I have no clue :-) I didn't know there were different levels
of vm-exit.

> FWIW, the primary use case we care about is for slice-of-hardware VMs, where each
> vCPU is pinned 1:1 with a host pCPU.

I've been given to understand that vm-exit is a bad word in this
scenario, any exit is a fail. They get MWAIT and all the other crap and
more or less pretend to be real hardware.

So why do you care about those MSRs so much? That should 'never' happen
in this scenario.

> > > Or at least, that was my reading of things. Maybe it was just a
> > > misunderstanding because we didn't do a good job of defining the behavior.
> >
> > This might be the case. I don't particularly care where the guest
> > boundary lies -- somewhere in the vCPU thread. Once the thread is gone,
> > PMU is usable again etc..
>
> Well drat, that there would have saved a wee bit of frustration. Better late
> than never though, that's for sure.
>
> Just to double confirm: keeping guest PMU state loaded until the vCPU is scheduled
> out or KVM exits to userspace, would mean that host perf events won't be active
> for potentially large swaths of non-KVM code. Any function calls or event/exception
> handlers that occur within the context of ioctl(KVM_RUN) would run with host
> perf events disabled.

Hurmph, that sounds sub-optimal, earlier you said <1500 cycles, this all
sounds like a ton more.

/me frobs around the kvm code some...

Are we talking about exit_fastpath loop in vcpu_enter_guest() ? That
seems to run with IRQs disabled, so at most you can trigger a #PF or
something, which will then trip an exception fixup because you can't run
#PF with IRQs disabled etc..

That seems fine. That is, a theoretical kvm_x86_handle_enter_irqoff()
coupled with the existing kvm_x86_handle_exit_irqoff() seems like
reasonable solution from where I'm sitting. That also more or less
matches the FPU state save/restore AFAICT.

Or are you talking about the whole of vcpu_run() ? That seems like a
massive amount of code, and doesn't look like anything I'd call a
fast-path. Also, much of that loop has preemption enabled...

> Are you ok with that approach? Assuming we don't completely botch things, the
> interfaces are sane, we can come up with a clean solution for handling NMIs, etc.

Since you steal the whole PMU, can't you re-route the PMI to something
that's virt friendly too?

> > It also means ::exclude_guest should actually work -- it often does not
> > today -- the IBS thing for example totally ignores it.
>
> Is that already an in-tree, or are you talking about Manali's proposed series to
> support virtualizing IBS?

The IBS code as is, it totally ignores ::exclude_guest. Manali was going
to add some of it. But I'm not at all sure about the state of the other
PMU drivers we have.

Just for giggles, P4 has VMX support... /me runs like crazy

2023-10-03 15:23:38

by Sean Christopherson

[permalink] [raw]
Subject: Re: [Patch v4 07/13] perf/x86: Add constraint for guest perf metrics event

On Tue, Oct 03, 2023, Peter Zijlstra wrote:
> On Mon, Oct 02, 2023 at 05:56:28PM -0700, Sean Christopherson wrote:
> > On Mon, Oct 02, 2023, Peter Zijlstra wrote:
>
> > > I'm not sure what you're suggesting here. It will have to save/restore
> > > all those MSRs anyway. Suppose it switches between vCPUs.
> >
> > The "when" is what's important. If KVM took a literal interpretation of
> > "exclude guest" for pass-through MSRs, then KVM would context switch all those
> > MSRs twice for every VM-Exit=>VM-Enter roundtrip, even when the VM-Exit isn't a
> > reschedule IRQ to schedule in a different task (or vCPU). The overhead to save
> > all the host/guest MSRs and load all of the guest/host MSRs *twice* for every
> > VM-Exit would be a non-starter. E.g. simple VM-Exits are completely handled in
> > <1500 cycles, and "fastpath" exits are something like half that. Switching all
> > the MSRs is likely 1000+ cycles, if not double that.
>
> See, you're the virt-nerd and I'm sure you know what you're talking
> about, but I have no clue :-) I didn't know there were different levels
> of vm-exit.

An exit is essentially a fancy exception/event. The hardware transition from
guest=>host is the exception itself (VM-Exit), and the transition back to guest
is analagous to the IRET (VM-Enter).

In between, software will do some amount of work, and the amount of work that is
done can vary quite significantly depending on what caused the exit.

> > FWIW, the primary use case we care about is for slice-of-hardware VMs, where each
> > vCPU is pinned 1:1 with a host pCPU.
>
> I've been given to understand that vm-exit is a bad word in this
> scenario, any exit is a fail. They get MWAIT and all the other crap and
> more or less pretend to be real hardware.
>
> So why do you care about those MSRs so much? That should 'never' happen
> in this scenario.

It's not feasible to completely avoid exits, as current/upcoming hardware doesn't
(yet) virtualize a few important things. Off the top of my head, the two most
relevant flows are:

- APIC_LVTPC entry and PMU counters. If a PMU counter overflows, the NMI that
is generated will trigger a hardware level NMI and cause an exit. And sadly,
the guest's NMI handler (assuming the guest is also using NMIs for PMIs) will
trigger another exit when it clears the mask bit in its LVTPC entry.

- Timer related IRQs, both in the guest and host. These are the biggest source
of exits on modern hardware. Neither AMD nor Intel provide a virtual APIC
timer, and so KVM must trap and emulate writes to TSC_DEADLINE (or to APIC_TMICT),
and the subsequent IRQ will also cause an exit.

The cumulative cost of all exits is important, but the latency of each individual
exit is even more critical, especially for PMU related stuff. E.g. if the guest
is trying to use perf/PMU to profile a workload, adding a few thousand cycles to
each exit will introduce too much noise into the results.

> > > > Or at least, that was my reading of things. Maybe it was just a
> > > > misunderstanding because we didn't do a good job of defining the behavior.
> > >
> > > This might be the case. I don't particularly care where the guest
> > > boundary lies -- somewhere in the vCPU thread. Once the thread is gone,
> > > PMU is usable again etc..
> >
> > Well drat, that there would have saved a wee bit of frustration. Better late
> > than never though, that's for sure.
> >
> > Just to double confirm: keeping guest PMU state loaded until the vCPU is scheduled
> > out or KVM exits to userspace, would mean that host perf events won't be active
> > for potentially large swaths of non-KVM code. Any function calls or event/exception
> > handlers that occur within the context of ioctl(KVM_RUN) would run with host
> > perf events disabled.
>
> Hurmph, that sounds sub-optimal, earlier you said <1500 cycles, this all
> sounds like a ton more.
>
> /me frobs around the kvm code some...
>
> Are we talking about exit_fastpath loop in vcpu_enter_guest() ? That
> seems to run with IRQs disabled, so at most you can trigger a #PF or
> something, which will then trip an exception fixup because you can't run
> #PF with IRQs disabled etc..
>
> That seems fine. That is, a theoretical kvm_x86_handle_enter_irqoff()
> coupled with the existing kvm_x86_handle_exit_irqoff() seems like
> reasonable solution from where I'm sitting. That also more or less
> matches the FPU state save/restore AFAICT.
>
> Or are you talking about the whole of vcpu_run() ? That seems like a
> massive amount of code, and doesn't look like anything I'd call a
> fast-path. Also, much of that loop has preemption enabled...

The whole of vcpu_run(). And yes, much of it runs with preemption enabled. KVM
uses preempt notifiers to context switch state if the vCPU task is scheduled
out/in, we'd use those hooks to swap PMU state.

Jumping back to the exception analogy, not all exits are equal. For "simple" exits
that KVM can handle internally, the roundtrip is <1500. The exit_fastpath loop is
roughly half that.

But for exits that are more complex, e.g. if the guest hits the equivalent of a
page fault, the cost of handling the page fault can vary significantly. It might
be <1500, but it might also be 10x that if handling the page fault requires faulting
in a new page in the host.

We don't want to get too aggressive with moving stuff into the exit_fastpath loop,
because doing too much work with IRQs disabled can cause latency problems for the
host. This isn't much of a concern for slice-of-hardware setups, but would be
quite problematic for other use cases.

And except for obviously slow paths (from the guest's perspective), extra latency
on any exit can be problematic. E.g. even if we got to the point where KVM handles
99% of exits the fastpath (may or may not be feasible), a not-fastpath exit at an
inopportune time could throw off the guest's profiling results, introduce unacceptable
jitter, etc.

> > Are you ok with that approach? Assuming we don't completely botch things, the
> > interfaces are sane, we can come up with a clean solution for handling NMIs, etc.
>
> Since you steal the whole PMU, can't you re-route the PMI to something
> that's virt friendly too?

Hmm, actually, we probably could. It would require modifying the host's APIC_LVTPC
entry when context switching the PMU, e.g. to replace the NMI with a dedicated IRQ
vector. As gross as that sounds, it might actually be cleaner overall than
deciphering whether an NMI belongs to the host or guest, and it would almost
certainly yield lower latency for guest PMIs.

2023-10-03 17:32:48

by Manali Shukla

[permalink] [raw]
Subject: Re: [Patch v4 07/13] perf/x86: Add constraint for guest perf metrics event

On 10/3/2023 1:46 PM, Peter Zijlstra wrote:
> On Mon, Oct 02, 2023 at 05:56:28PM -0700, Sean Christopherson wrote:
>> On Mon, Oct 02, 2023, Peter Zijlstra wrote:
>
>>> I'm not sure what you're suggesting here. It will have to save/restore
>>> all those MSRs anyway. Suppose it switches between vCPUs.
>>
>> The "when" is what's important. If KVM took a literal interpretation of
>> "exclude guest" for pass-through MSRs, then KVM would context switch all those
>> MSRs twice for every VM-Exit=>VM-Enter roundtrip, even when the VM-Exit isn't a
>> reschedule IRQ to schedule in a different task (or vCPU). The overhead to save
>> all the host/guest MSRs and load all of the guest/host MSRs *twice* for every
>> VM-Exit would be a non-starter. E.g. simple VM-Exits are completely handled in
>> <1500 cycles, and "fastpath" exits are something like half that. Switching all
>> the MSRs is likely 1000+ cycles, if not double that.
>
> See, you're the virt-nerd and I'm sure you know what you're talking
> about, but I have no clue :-) I didn't know there were different levels
> of vm-exit.
>
>> FWIW, the primary use case we care about is for slice-of-hardware VMs, where each
>> vCPU is pinned 1:1 with a host pCPU.
>
> I've been given to understand that vm-exit is a bad word in this
> scenario, any exit is a fail. They get MWAIT and all the other crap and
> more or less pretend to be real hardware.
>
> So why do you care about those MSRs so much? That should 'never' happen
> in this scenario.
>
>>>> Or at least, that was my reading of things. Maybe it was just a
>>>> misunderstanding because we didn't do a good job of defining the behavior.
>>>
>>> This might be the case. I don't particularly care where the guest
>>> boundary lies -- somewhere in the vCPU thread. Once the thread is gone,
>>> PMU is usable again etc..
>>
>> Well drat, that there would have saved a wee bit of frustration. Better late
>> than never though, that's for sure.
>>
>> Just to double confirm: keeping guest PMU state loaded until the vCPU is scheduled
>> out or KVM exits to userspace, would mean that host perf events won't be active
>> for potentially large swaths of non-KVM code. Any function calls or event/exception
>> handlers that occur within the context of ioctl(KVM_RUN) would run with host
>> perf events disabled.
>
> Hurmph, that sounds sub-optimal, earlier you said <1500 cycles, this all
> sounds like a ton more.
>
> /me frobs around the kvm code some...
>
> Are we talking about exit_fastpath loop in vcpu_enter_guest() ? That
> seems to run with IRQs disabled, so at most you can trigger a #PF or
> something, which will then trip an exception fixup because you can't run
> #PF with IRQs disabled etc..
>
> That seems fine. That is, a theoretical kvm_x86_handle_enter_irqoff()
> coupled with the existing kvm_x86_handle_exit_irqoff() seems like
> reasonable solution from where I'm sitting. That also more or less
> matches the FPU state save/restore AFAICT.
>
> Or are you talking about the whole of vcpu_run() ? That seems like a
> massive amount of code, and doesn't look like anything I'd call a
> fast-path. Also, much of that loop has preemption enabled...
>
>> Are you ok with that approach? Assuming we don't completely botch things, the
>> interfaces are sane, we can come up with a clean solution for handling NMIs, etc.
>
> Since you steal the whole PMU, can't you re-route the PMI to something
> that's virt friendly too?
>
>>> It also means ::exclude_guest should actually work -- it often does not
>>> today -- the IBS thing for example totally ignores it.
>>
>> Is that already an in-tree, or are you talking about Manali's proposed series to
>> support virtualizing IBS?
>
> The IBS code as is, it totally ignores ::exclude_guest. Manali was going
> to add some of it. But I'm not at all sure about the state of the other
> PMU drivers we have.
>
> Just for giggles, P4 has VMX support... /me runs like crazy

I am working on Solution 1.1 from the approach proposed in [*].
I will send V2 (for IBS virtualization series) based on it shortly.

* https://lore.kernel.org/all/[email protected]/T/#m7389910e577966c93a0b50fbaf9442be80dc730b

- Manali

2023-10-03 18:22:28

by Jim Mattson

[permalink] [raw]
Subject: Re: [Patch v4 07/13] perf/x86: Add constraint for guest perf metrics event

On Tue, Oct 3, 2023 at 8:23 AM Sean Christopherson <[email protected]> wrote:
>
> On Tue, Oct 03, 2023, Peter Zijlstra wrote:
> > On Mon, Oct 02, 2023 at 05:56:28PM -0700, Sean Christopherson wrote:
> > > On Mon, Oct 02, 2023, Peter Zijlstra wrote:
> >
> > > > I'm not sure what you're suggesting here. It will have to save/restore
> > > > all those MSRs anyway. Suppose it switches between vCPUs.
> > >
> > > The "when" is what's important. If KVM took a literal interpretation of
> > > "exclude guest" for pass-through MSRs, then KVM would context switch all those
> > > MSRs twice for every VM-Exit=>VM-Enter roundtrip, even when the VM-Exit isn't a
> > > reschedule IRQ to schedule in a different task (or vCPU). The overhead to save
> > > all the host/guest MSRs and load all of the guest/host MSRs *twice* for every
> > > VM-Exit would be a non-starter. E.g. simple VM-Exits are completely handled in
> > > <1500 cycles, and "fastpath" exits are something like half that. Switching all
> > > the MSRs is likely 1000+ cycles, if not double that.
> >
> > See, you're the virt-nerd and I'm sure you know what you're talking
> > about, but I have no clue :-) I didn't know there were different levels
> > of vm-exit.
>
> An exit is essentially a fancy exception/event. The hardware transition from
> guest=>host is the exception itself (VM-Exit), and the transition back to guest
> is analagous to the IRET (VM-Enter).
>
> In between, software will do some amount of work, and the amount of work that is
> done can vary quite significantly depending on what caused the exit.
>
> > > FWIW, the primary use case we care about is for slice-of-hardware VMs, where each
> > > vCPU is pinned 1:1 with a host pCPU.
> >
> > I've been given to understand that vm-exit is a bad word in this
> > scenario, any exit is a fail. They get MWAIT and all the other crap and
> > more or less pretend to be real hardware.
> >
> > So why do you care about those MSRs so much? That should 'never' happen
> > in this scenario.
>
> It's not feasible to completely avoid exits, as current/upcoming hardware doesn't
> (yet) virtualize a few important things. Off the top of my head, the two most
> relevant flows are:
>
> - APIC_LVTPC entry and PMU counters. If a PMU counter overflows, the NMI that
> is generated will trigger a hardware level NMI and cause an exit. And sadly,
> the guest's NMI handler (assuming the guest is also using NMIs for PMIs) will
> trigger another exit when it clears the mask bit in its LVTPC entry.

In addition, when the guest PMI handler writes to
IA32_PERF_GLOBAL_CTRL to disable all counters (and again later to
re-enable the counters), KVM has to intercept that as well, with
today's implementation. Similarly, on each guest timer tick, when
guest perf is multiplexing PMCs, KVM has to intercept writes to
IA32_PERF_GLOBAL _CTRL.

Furthermore, in some cases, Linux perf seems to double-disable
counters, using both the individual enable bits in each PerfEvtSel, as
well as the bits in PERF_GLOBAL_CTRL. KVM has to intercept writes to
the PerfEvtSels as well. Off-topic, but I'd like to request that Linux
perf *only* use the enable bits in IA32_PERF_GLOBAL_CTRL on
architectures where that is supported. Just leave the enable bits set
in the PrfEvtSels, to avoid unnecessary VM-exits. :)

> - Timer related IRQs, both in the guest and host. These are the biggest source
> of exits on modern hardware. Neither AMD nor Intel provide a virtual APIC
> timer, and so KVM must trap and emulate writes to TSC_DEADLINE (or to APIC_TMICT),
> and the subsequent IRQ will also cause an exit.
>
> The cumulative cost of all exits is important, but the latency of each individual
> exit is even more critical, especially for PMU related stuff. E.g. if the guest
> is trying to use perf/PMU to profile a workload, adding a few thousand cycles to
> each exit will introduce too much noise into the results.
>
> > > > > Or at least, that was my reading of things. Maybe it was just a
> > > > > misunderstanding because we didn't do a good job of defining the behavior.
> > > >
> > > > This might be the case. I don't particularly care where the guest
> > > > boundary lies -- somewhere in the vCPU thread. Once the thread is gone,
> > > > PMU is usable again etc..
> > >
> > > Well drat, that there would have saved a wee bit of frustration. Better late
> > > than never though, that's for sure.
> > >
> > > Just to double confirm: keeping guest PMU state loaded until the vCPU is scheduled
> > > out or KVM exits to userspace, would mean that host perf events won't be active
> > > for potentially large swaths of non-KVM code. Any function calls or event/exception
> > > handlers that occur within the context of ioctl(KVM_RUN) would run with host
> > > perf events disabled.
> >
> > Hurmph, that sounds sub-optimal, earlier you said <1500 cycles, this all
> > sounds like a ton more.
> >
> > /me frobs around the kvm code some...
> >
> > Are we talking about exit_fastpath loop in vcpu_enter_guest() ? That
> > seems to run with IRQs disabled, so at most you can trigger a #PF or
> > something, which will then trip an exception fixup because you can't run
> > #PF with IRQs disabled etc..
> >
> > That seems fine. That is, a theoretical kvm_x86_handle_enter_irqoff()
> > coupled with the existing kvm_x86_handle_exit_irqoff() seems like
> > reasonable solution from where I'm sitting. That also more or less
> > matches the FPU state save/restore AFAICT.
> >
> > Or are you talking about the whole of vcpu_run() ? That seems like a
> > massive amount of code, and doesn't look like anything I'd call a
> > fast-path. Also, much of that loop has preemption enabled...
>
> The whole of vcpu_run(). And yes, much of it runs with preemption enabled. KVM
> uses preempt notifiers to context switch state if the vCPU task is scheduled
> out/in, we'd use those hooks to swap PMU state.
>
> Jumping back to the exception analogy, not all exits are equal. For "simple" exits
> that KVM can handle internally, the roundtrip is <1500. The exit_fastpath loop is
> roughly half that.
>
> But for exits that are more complex, e.g. if the guest hits the equivalent of a
> page fault, the cost of handling the page fault can vary significantly. It might
> be <1500, but it might also be 10x that if handling the page fault requires faulting
> in a new page in the host.
>
> We don't want to get too aggressive with moving stuff into the exit_fastpath loop,
> because doing too much work with IRQs disabled can cause latency problems for the
> host. This isn't much of a concern for slice-of-hardware setups, but would be
> quite problematic for other use cases.
>
> And except for obviously slow paths (from the guest's perspective), extra latency
> on any exit can be problematic. E.g. even if we got to the point where KVM handles
> 99% of exits the fastpath (may or may not be feasible), a not-fastpath exit at an
> inopportune time could throw off the guest's profiling results, introduce unacceptable
> jitter, etc.
>
> > > Are you ok with that approach? Assuming we don't completely botch things, the
> > > interfaces are sane, we can come up with a clean solution for handling NMIs, etc.
> >
> > Since you steal the whole PMU, can't you re-route the PMI to something
> > that's virt friendly too?
>
> Hmm, actually, we probably could. It would require modifying the host's APIC_LVTPC
> entry when context switching the PMU, e.g. to replace the NMI with a dedicated IRQ
> vector. As gross as that sounds, it might actually be cleaner overall than
> deciphering whether an NMI belongs to the host or guest, and it would almost
> certainly yield lower latency for guest PMIs.

Ugh. Can't KVM just install its own NMI handler? Either way, it's
possible for late PMIs to arrive in the wrong context.

2023-10-03 22:03:40

by Mingwei Zhang

[permalink] [raw]
Subject: Re: [Patch v4 07/13] perf/x86: Add constraint for guest perf metrics event

On Mon, Oct 2, 2023 at 5:56 PM Sean Christopherson <[email protected]> wrote:
>
> On Mon, Oct 02, 2023, Peter Zijlstra wrote:
> > On Mon, Oct 02, 2023 at 08:56:50AM -0700, Sean Christopherson wrote:
> > > > > worse it's not a choice based in technical reality.
> > >
> > > The technical reality is that context switching the PMU between host and guest
> > > requires reading and writing far too many MSRs for KVM to be able to context
> > > switch at every VM-Enter and every VM-Exit. And PMIs skidding past VM-Exit adds
> > > another layer of complexity to deal with.
> >
> > I'm not sure what you're suggesting here. It will have to save/restore
> > all those MSRs anyway. Suppose it switches between vCPUs.
>
> The "when" is what's important. If KVM took a literal interpretation of
> "exclude guest" for pass-through MSRs, then KVM would context switch all those
> MSRs twice for every VM-Exit=>VM-Enter roundtrip, even when the VM-Exit isn't a
> reschedule IRQ to schedule in a different task (or vCPU). The overhead to save
> all the host/guest MSRs and load all of the guest/host MSRs *twice* for every
> VM-Exit would be a non-starter. E.g. simple VM-Exits are completely handled in
> <1500 cycles, and "fastpath" exits are something like half that. Switching all
> the MSRs is likely 1000+ cycles, if not double that.

Hi Sean,

Sorry, I have no intention to interrupt the conversation, but this is
slightly confusing to me.

I remember when doing AMX, we added gigantic 8KB memory in the FPU
context switch. That works well in Linux today. Why can't we do the
same for PMU? Assuming we context switch all counters, selectors and
global stuff there?

On the VM boundary, all we need is for global ctrl, right? We stop all
counters when we exit from the guest and restore the guest value of
global control when entering it. But the actual PMU context switch
should be deferred roughly to the same time we switch FPU (xsave
state). This means we do that when switching task_struct and/or
returning to userspace.

Please kindly correct me if this is flawed.

ah, I think I understand what you are saying... So, "If KVM took a
literal interpretation of "exclude guest" for pass-through MSRs..."

perf_event.attr.exclude_guest might need a different meaning, if we
have a pass-through PMU for KVM. exclude_guest=1 does not mean the
counters are restored at the VMEXIT boundary, which is a disaster if
we do that.

Thanks.
-Mingwei


-Mingwei


>
> FWIW, the primary use case we care about is for slice-of-hardware VMs, where each
> vCPU is pinned 1:1 with a host pCPU. I suspect it's a similar story for the other
> CSPs that are trying to provide accurate PMUs to guests. If a vCPU is scheduled
> out, then yes, a bunch of context switching will need to happen. But for the
> types of VMs that are the target audience, their vCPUs will rarely be scheduled
> out.
>
> > > > > It's a choice out of lazyness, disabling host PMU is not a requirement
> > > > > for pass-through.
> > >
> > > The requirement isn't passthrough access, the requirements are that the guest's
> > > PMU has accuracy that is on par with bare metal, and that exposing a PMU to the
> > > guest doesn't have a meaningful impact on guest performance.
> >
> > Given you don't think that trapping MSR accesses is viable, what else
> > besides pass-through did you have in mind?
>
> Sorry, I didn't mean to imply that we don't want pass-through of MSRs. What I was
> trying to say is that *just* passthrough MSRs doesn't solve the problem, because
> again I thought the whole "context switch PMU state less often" approach had been
> firmly nak'd.
>
> > > > Not just a choice of laziness, but it will clearly be forced upon users
> > > > by external entities:
> > > >
> > > > "Pass ownership of the PMU to the guest and have no host PMU, or you
> > > > won't have sane guest PMU support at all. If you disagree, please open
> > > > a support ticket, which we'll ignore."
> > >
> > > We don't have sane guest PMU support today.
> >
> > Because KVM is too damn hard to use, rebooting a machine is *sooo* much
> > easier -- and I'm really not kidding here.
> >
> > Anyway, you want pass-through, but that doesn't mean host cannot use
> > PMU when vCPU thread is not running.
> >
> > > If y'all are willing to let KVM redefined exclude_guest to be KVM's outer run
> > > loop, then I'm all for exploring that option. But that idea got shot down over
> > > a year ago[*].
> >
> > I never saw that idea in that thread. You virt people keep talking like
> > I know how KVM works -- I'm not joking when I say I have no clue about
> > virt.
> >
> > Sometimes I get a little clue after y'all keep bashing me over the head,
> > but it quickly erases itself.
> >
> > > Or at least, that was my reading of things. Maybe it was just a
> > > misunderstanding because we didn't do a good job of defining the behavior.
> >
> > This might be the case. I don't particularly care where the guest
> > boundary lies -- somewhere in the vCPU thread. Once the thread is gone,
> > PMU is usable again etc..
>
> Well drat, that there would have saved a wee bit of frustration. Better late
> than never though, that's for sure.
>
> Just to double confirm: keeping guest PMU state loaded until the vCPU is scheduled
> out or KVM exits to userspace, would mean that host perf events won't be active
> for potentially large swaths of non-KVM code. Any function calls or event/exception
> handlers that occur within the context of ioctl(KVM_RUN) would run with host
> perf events disabled.
>
> Are you ok with that approach? Assuming we don't completely botch things, the
> interfaces are sane, we can come up with a clean solution for handling NMIs, etc.
>
> > Re-reading parts of that linked thread, I see mention of
> > PT_MODE_HOST_GUEST -- see I knew we had something there, but I can never
> > remember all that nonsense. Worst part is that I can't find the relevant
> > perf code when I grep for that string :/
>
> The PT stuff is actually an example of what we don't want, at least not exactly.
> The concept of a hard switch between guest and host is ok, but as-is, KVM's PT
> code does a big pile of MSR reads and writes on every VM-Enter and VM-Exit.
>
> > Anyway, what I don't like is KVM silently changing all events to
> > ::exclude_guest=1. I would like all (pre-existing) ::exclude_guest=0
> > events to hard error when they run into a vCPU with pass-through on
> > (PERF_EVENT_STATE_ERROR). I would like event-creation to error out on
> > ::exclude_guest=0 events when a vCPU with pass-through exists -- with
> > minimal scope (this probably means all CPU events, but only relevant
> > vCPU events).
>
> Agreed, I am definitely against KVM silently doing anything. And the more that
> is surfaced to the user, the better.
>
> > It also means ::exclude_guest should actually work -- it often does not
> > today -- the IBS thing for example totally ignores it.
>
> Is that already an in-tree, or are you talking about Manali's proposed series to
> support virtualizing IBS?
>
> > Anyway, none of this means host cannot use PMU because virt muck wants
> > it.

2023-10-04 11:22:26

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [Patch v4 07/13] perf/x86: Add constraint for guest perf metrics event

On Tue, Oct 03, 2023 at 08:23:26AM -0700, Sean Christopherson wrote:
> On Tue, Oct 03, 2023, Peter Zijlstra wrote:
> > On Mon, Oct 02, 2023 at 05:56:28PM -0700, Sean Christopherson wrote:

> > > Well drat, that there would have saved a wee bit of frustration. Better late
> > > than never though, that's for sure.
> > >
> > > Just to double confirm: keeping guest PMU state loaded until the vCPU is scheduled
> > > out or KVM exits to userspace, would mean that host perf events won't be active
> > > for potentially large swaths of non-KVM code. Any function calls or event/exception
> > > handlers that occur within the context of ioctl(KVM_RUN) would run with host
> > > perf events disabled.
> >
> > Hurmph, that sounds sub-optimal, earlier you said <1500 cycles, this all
> > sounds like a ton more.
> >
> > /me frobs around the kvm code some...
> >
> > Are we talking about exit_fastpath loop in vcpu_enter_guest() ? That
> > seems to run with IRQs disabled, so at most you can trigger a #PF or
> > something, which will then trip an exception fixup because you can't run
> > #PF with IRQs disabled etc..
> >
> > That seems fine. That is, a theoretical kvm_x86_handle_enter_irqoff()
> > coupled with the existing kvm_x86_handle_exit_irqoff() seems like
> > reasonable solution from where I'm sitting. That also more or less
> > matches the FPU state save/restore AFAICT.
> >
> > Or are you talking about the whole of vcpu_run() ? That seems like a
> > massive amount of code, and doesn't look like anything I'd call a
> > fast-path. Also, much of that loop has preemption enabled...
>
> The whole of vcpu_run(). And yes, much of it runs with preemption enabled. KVM
> uses preempt notifiers to context switch state if the vCPU task is scheduled
> out/in, we'd use those hooks to swap PMU state.
>
> Jumping back to the exception analogy, not all exits are equal. For "simple" exits
> that KVM can handle internally, the roundtrip is <1500. The exit_fastpath loop is
> roughly half that.
>
> But for exits that are more complex, e.g. if the guest hits the equivalent of a
> page fault, the cost of handling the page fault can vary significantly. It might
> be <1500, but it might also be 10x that if handling the page fault requires faulting
> in a new page in the host.
>
> We don't want to get too aggressive with moving stuff into the exit_fastpath loop,
> because doing too much work with IRQs disabled can cause latency problems for the
> host. This isn't much of a concern for slice-of-hardware setups, but would be
> quite problematic for other use cases.
>
> And except for obviously slow paths (from the guest's perspective), extra latency
> on any exit can be problematic. E.g. even if we got to the point where KVM handles
> 99% of exits the fastpath (may or may not be feasible), a not-fastpath exit at an
> inopportune time could throw off the guest's profiling results, introduce unacceptable
> jitter, etc.

I'm confused... the PMU must not be running after vm-exit. It must not
be able to profile the host. So what jitter are you talking about?

Even if we persist the MSR contents, the PMU itself must be disabled on
vm-exit and enabled on vm-enter. If not by hardware then by software
poking at the global ctrl msr.

I also don't buy the latency argument, we already do full and complete
PMU rewrites with IRQs disabled in the context switch path. And as
mentioned elsewhere, the whole AMX thing has an 8k copy stuck in the FPU
save/restore.

I would much prefer we keep the PMU swizzle inside the IRQ disabled
region of vcpu_enter_guest(). That's already a ton better than you have
today.

2023-10-04 11:33:26

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [Patch v4 07/13] perf/x86: Add constraint for guest perf metrics event

On Tue, Oct 03, 2023 at 11:21:46AM -0700, Jim Mattson wrote:
> On Tue, Oct 3, 2023 at 8:23 AM Sean Christopherson <[email protected]> wrote:

> > > Since you steal the whole PMU, can't you re-route the PMI to something
> > > that's virt friendly too?
> >
> > Hmm, actually, we probably could. It would require modifying the host's APIC_LVTPC
> > entry when context switching the PMU, e.g. to replace the NMI with a dedicated IRQ
> > vector. As gross as that sounds, it might actually be cleaner overall than
> > deciphering whether an NMI belongs to the host or guest, and it would almost
> > certainly yield lower latency for guest PMIs.
>
> Ugh. Can't KVM just install its own NMI handler? Either way, it's
> possible for late PMIs to arrive in the wrong context.

I don't think you realize what a horrible trainwreck the NMI handler is.
Every handler has to be able to determine if the NMI is theirs to
handle.

If we go do this whole swizzle thing we must find a sequence of PMU
'instructions' that syncs against the PMI, because otherwise we're going
to loose PMIs and that's going to be a *TON* of pain.

I'll put it on the agenda for the next time I talk with the hardware
folks. But IIRC the AMD thing is *much* worse in this regards than the
Intel one.

2023-10-04 19:52:35

by Mingwei Zhang

[permalink] [raw]
Subject: Re: [Patch v4 07/13] perf/x86: Add constraint for guest perf metrics event

On Wed, Oct 4, 2023 at 4:22 AM Peter Zijlstra <[email protected]> wrote:
>
> On Tue, Oct 03, 2023 at 08:23:26AM -0700, Sean Christopherson wrote:
> > On Tue, Oct 03, 2023, Peter Zijlstra wrote:
> > > On Mon, Oct 02, 2023 at 05:56:28PM -0700, Sean Christopherson wrote:
>
> > > > Well drat, that there would have saved a wee bit of frustration. Better late
> > > > than never though, that's for sure.
> > > >
> > > > Just to double confirm: keeping guest PMU state loaded until the vCPU is scheduled
> > > > out or KVM exits to userspace, would mean that host perf events won't be active
> > > > for potentially large swaths of non-KVM code. Any function calls or event/exception
> > > > handlers that occur within the context of ioctl(KVM_RUN) would run with host
> > > > perf events disabled.
> > >
> > > Hurmph, that sounds sub-optimal, earlier you said <1500 cycles, this all
> > > sounds like a ton more.
> > >
> > > /me frobs around the kvm code some...
> > >
> > > Are we talking about exit_fastpath loop in vcpu_enter_guest() ? That
> > > seems to run with IRQs disabled, so at most you can trigger a #PF or
> > > something, which will then trip an exception fixup because you can't run
> > > #PF with IRQs disabled etc..
> > >
> > > That seems fine. That is, a theoretical kvm_x86_handle_enter_irqoff()
> > > coupled with the existing kvm_x86_handle_exit_irqoff() seems like
> > > reasonable solution from where I'm sitting. That also more or less
> > > matches the FPU state save/restore AFAICT.
> > >
> > > Or are you talking about the whole of vcpu_run() ? That seems like a
> > > massive amount of code, and doesn't look like anything I'd call a
> > > fast-path. Also, much of that loop has preemption enabled...
> >
> > The whole of vcpu_run(). And yes, much of it runs with preemption enabled. KVM
> > uses preempt notifiers to context switch state if the vCPU task is scheduled
> > out/in, we'd use those hooks to swap PMU state.
> >
> > Jumping back to the exception analogy, not all exits are equal. For "simple" exits
> > that KVM can handle internally, the roundtrip is <1500. The exit_fastpath loop is
> > roughly half that.
> >
> > But for exits that are more complex, e.g. if the guest hits the equivalent of a
> > page fault, the cost of handling the page fault can vary significantly. It might
> > be <1500, but it might also be 10x that if handling the page fault requires faulting
> > in a new page in the host.
> >
> > We don't want to get too aggressive with moving stuff into the exit_fastpath loop,
> > because doing too much work with IRQs disabled can cause latency problems for the
> > host. This isn't much of a concern for slice-of-hardware setups, but would be
> > quite problematic for other use cases.
> >
> > And except for obviously slow paths (from the guest's perspective), extra latency
> > on any exit can be problematic. E.g. even if we got to the point where KVM handles
> > 99% of exits the fastpath (may or may not be feasible), a not-fastpath exit at an
> > inopportune time could throw off the guest's profiling results, introduce unacceptable
> > jitter, etc.
>
> I'm confused... the PMU must not be running after vm-exit. It must not
> be able to profile the host. So what jitter are you talking about?
>
> Even if we persist the MSR contents, the PMU itself must be disabled on
> vm-exit and enabled on vm-enter. If not by hardware then by software
> poking at the global ctrl msr.
>
> I also don't buy the latency argument, we already do full and complete
> PMU rewrites with IRQs disabled in the context switch path. And as
> mentioned elsewhere, the whole AMX thing has an 8k copy stuck in the FPU
> save/restore.
>
> I would much prefer we keep the PMU swizzle inside the IRQ disabled
> region of vcpu_enter_guest(). That's already a ton better than you have
> today.

Peter, I think the jitter Sean was talking about is the potential
issue in pass-through implementation. If KVM follows the perf
subsystem requirement, then after VMEXIT, any perf_event with
exclude_guest=1 (and higher priority ?) should start counting. Because
the guest VM exclusively owns the PMU with all counters at that point,
the gigantic msr save/restore is needed which requires a whole bunch
of wrmsrs. That will be a performance disaster since VMEXIT could
happen at a very high frequency.

In comparison, if we are talking about the existing non-pass-through
implementation, then the PMU context switch immediately becomes
simple: only global ctrl tweak is needed at VM boundary (to stop
exclude_host events and start exclude_guest events in one shot), since
the guest VM and host perf subsystem share the hardware PMU counters.

Peter, that latency argument in pass-through implementation is
something that we hope you could buy. This should be relatively easy
to prove. I can provide some data if you need.

To cope with that, KVM might need to defer that msr save/restore for
PMU to a later point in pass-through implementation. But that will be
conflicting with the support of the perf_event with exclude_guest=1.
So, I guess that's why Sean mentioned this: "If y'all are willing to
let KVM redefined exclude_guest to be KVM's outer run loop, then I'm
all for exploring that option."

Note that the situation is similar to AMX, i.e., when guest VMEXIT to
host, the FPU should be switched to the host FPU as well, but because
AMX is too big and thus too slow, KVM defers that to a very late
point.

Hope this explains a little bit and sorry if this might be an
injection of noise.

Thanks.
-Mingwei


>

2023-10-04 20:43:58

by Sean Christopherson

[permalink] [raw]
Subject: Re: [Patch v4 07/13] perf/x86: Add constraint for guest perf metrics event

On Tue, Oct 03, 2023, Mingwei Zhang wrote:
> On Mon, Oct 2, 2023 at 5:56 PM Sean Christopherson <[email protected]> wrote:
> > The "when" is what's important. If KVM took a literal interpretation of
> > "exclude guest" for pass-through MSRs, then KVM would context switch all those
> > MSRs twice for every VM-Exit=>VM-Enter roundtrip, even when the VM-Exit isn't a
> > reschedule IRQ to schedule in a different task (or vCPU). The overhead to save
> > all the host/guest MSRs and load all of the guest/host MSRs *twice* for every
> > VM-Exit would be a non-starter. E.g. simple VM-Exits are completely handled in
> > <1500 cycles, and "fastpath" exits are something like half that. Switching all
> > the MSRs is likely 1000+ cycles, if not double that.
>
> Hi Sean,
>
> Sorry, I have no intention to interrupt the conversation, but this is
> slightly confusing to me.
>
> I remember when doing AMX, we added gigantic 8KB memory in the FPU
> context switch. That works well in Linux today. Why can't we do the
> same for PMU? Assuming we context switch all counters, selectors and
> global stuff there?

That's what we (Google folks) are proposing. However, there are significant
side effects if KVM context switches PMU outside of vcpu_run(), whereas the FPU
doesn't suffer the same problems.

Keeping the guest FPU resident for the duration of vcpu_run() is, in terms of
functionality, completely transparent to the rest of the kernel. From the kernel's
perspective, the guest FPU is just a variation of a userspace FPU, and the kernel
is already designed to save/restore userspace/guest FPU state when the kernel wants
to use the FPU for whatever reason. And crucially, kernel FPU usage is explicit
and contained, e.g. see kernel_fpu_{begin,end}(), and comes with mechanisms for
KVM to detect when the guest FPU needs to be reloaded (see TIF_NEED_FPU_LOAD).

The PMU is a completely different story. PMU usage, a.k.a. perf, by design is
"always running". KVM can't transparently stop host usage of the PMU, as disabling
host PMU usage stops perf events from counting/profiling whatever it is they're
supposed to profile.

Today, KVM minimizes the "downtime" of host PMU usage by context switching PMU
state at VM-Enter and VM-Exit, or at least as close as possible, e.g. for LBRs
and Intel PT.

What we are proposing would *significantly* increase the downtime, to the point
where it would almost be unbounded in some paths, e.g. if KVM faults in a page,
gup() could go swap in memory from disk, install PTEs, and so on and so forth.
If the host is trying to profile something related to swap or memory management,
they're out of luck.

2023-10-04 21:55:34

by Sean Christopherson

[permalink] [raw]
Subject: Re: [Patch v4 07/13] perf/x86: Add constraint for guest perf metrics event

On Wed, Oct 04, 2023, Mingwei Zhang wrote:
> On Wed, Oct 4, 2023 at 4:22 AM Peter Zijlstra <[email protected]> wrote:
> > > > Or are you talking about the whole of vcpu_run() ? That seems like a
> > > > massive amount of code, and doesn't look like anything I'd call a
> > > > fast-path. Also, much of that loop has preemption enabled...
> > >
> > > The whole of vcpu_run(). And yes, much of it runs with preemption enabled. KVM
> > > uses preempt notifiers to context switch state if the vCPU task is scheduled
> > > out/in, we'd use those hooks to swap PMU state.
> > >
> > > Jumping back to the exception analogy, not all exits are equal. For "simple" exits
> > > that KVM can handle internally, the roundtrip is <1500. The exit_fastpath loop is
> > > roughly half that.
> > >
> > > But for exits that are more complex, e.g. if the guest hits the equivalent of a
> > > page fault, the cost of handling the page fault can vary significantly. It might
> > > be <1500, but it might also be 10x that if handling the page fault requires faulting
> > > in a new page in the host.
> > >
> > > We don't want to get too aggressive with moving stuff into the exit_fastpath loop,
> > > because doing too much work with IRQs disabled can cause latency problems for the
> > > host. This isn't much of a concern for slice-of-hardware setups, but would be
> > > quite problematic for other use cases.
> > >
> > > And except for obviously slow paths (from the guest's perspective), extra latency
> > > on any exit can be problematic. E.g. even if we got to the point where KVM handles
> > > 99% of exits the fastpath (may or may not be feasible), a not-fastpath exit at an
> > > inopportune time could throw off the guest's profiling results, introduce unacceptable
> > > jitter, etc.
> >
> > I'm confused... the PMU must not be running after vm-exit. It must not
> > be able to profile the host. So what jitter are you talking about?
> >
> > Even if we persist the MSR contents, the PMU itself must be disabled on
> > vm-exit and enabled on vm-enter. If not by hardware then by software
> > poking at the global ctrl msr.
> >
> > I also don't buy the latency argument, we already do full and complete
> > PMU rewrites with IRQs disabled in the context switch path. And as
> > mentioned elsewhere, the whole AMX thing has an 8k copy stuck in the FPU
> > save/restore.
> >
> > I would much prefer we keep the PMU swizzle inside the IRQ disabled
> > region of vcpu_enter_guest(). That's already a ton better than you have
> > today.

...

> Peter, that latency argument in pass-through implementation is
> something that we hope you could buy. This should be relatively easy
> to prove. I can provide some data if you need.

You and Peter are talking about is two different latencies. Or rather, how the
latency impacts two different things.

Peter is talking about is the latency impact on the host, specifically how much
work is done with IRQs disabled.

You are talking about is the latency impact on the guest, i.e. how much guest
performance is affected if KVM swaps MSRs on every exit.

Peter is contending that swapping PMU MSRs with IRQs disabled isn't a big deal,
because the kernel already does as much during a context switch. I agree, *if*
we're talking about only adding the PMU MSRs.

You (and I) are contending that the latency impact on the guest will be too high
if KVM swaps in the inner VM-Exit loop. This is not analogous to host context
switches, as VM-Exits can occur at a much higher frequency than context switches,
and can be triggered by events that have nothing to do with the guest.

There's some confusion here though because of what I said earlier:

We don't want to get too aggressive with moving stuff into the exit_fastpath
loop, because doing too much work with IRQs disabled can cause latency problems
for the host.

By "stuff" I wasn't talking about PMU MSRs, I was referring to all exit handling
that KVM *could* move into the IRQs disabled section in order to mitigate the
concerns that we have about the latency impacts on the guest. E.g. if most exits
are handled in the IRQs disabled section, then KVM could handle most exits without
swapping PMU state and thus limit the impact on guest performance, and not cause
to much host perf "downtime" that I mentioned in the other thread[*].

However, my concern is that handling most exits with IRQs disabled would result
in KVM doing too much work with IRQs disabled, i.e. would impact the host latency
that Peter is talking about. And I'm more than a bit terrified of calling into
code that doesn't expect to be called with IRQs disabled.

Thinking about this more, what if we do a blend of KVM's FPU swapping and debug
register swapping?

A. Load guest PMU state in vcpu_enter_guest() after IRQs are disabled
B. Put guest PMU state (and load host state) in vcpu_enter_guest() before IRQs
are enabled, *if and only if* the current CPU has one or perf events that
wants to use the hardware PMU
C. Put guest PMU state at vcpu_put()
D. Add a perf callback that is invoked from IRQ context when perf wants to
configure a new PMU-based events, *before* actually programming the MSRs,
and have KVM's callback put the guest PMU state

If there are host perf events that want to use the PMU, then KVM will swap fairly
aggressively and the "downtime" of the host perf events will be limited to the
small window around VM-Enter/VM-Exit.

If there are no such host events, KVM will swap on the first entry to the guest,
and keep the guest PMU loaded until the vCPU is put.

The perf callback in (D) would allow perf to program system-wide events on all
CPUs without clobbering guest PMU state.

I think that would make everyone happy. As long as our hosts don't create perf
events, then we get the "swap as little as possible" behavior without significantly
impacting the host's ability to utilize perf. If our host screws up and creates
perf events on CPUs that are running vCPUs, then the degraded vCPU performance is
on us.

Rough sketch below, minus the perf callback or any of actual swapping logic.

[*] https://lore.kernel.org/all/[email protected]

diff --git a/arch/x86/kvm/pmu.h b/arch/x86/kvm/pmu.h
index 7d9ba301c090..86699d310224 100644
--- a/arch/x86/kvm/pmu.h
+++ b/arch/x86/kvm/pmu.h
@@ -41,6 +41,30 @@ struct kvm_pmu_ops {

void kvm_pmu_ops_update(const struct kvm_pmu_ops *pmu_ops);

+static inline void kvm_load_guest_pmu(struct kvm_vcpu *vcpu)
+{
+ struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
+
+ lockdep_assert_irqs_disabled();
+
+ if (vcpu->pmu->guest_state_loaded)
+ return;
+
+ <swap state>
+ vcpu->pmu->guest_state_loaded = true;
+}
+
+static inline void kvm_put_guest_pmu(struct kvm_vcpu *vcpu)
+{
+ lockdep_assert_irqs_disabled();
+
+ if (!vcpu->pmu->guest_state_loaded)
+ return;
+
+ <swap state>
+ vcpu->pmu->guest_state_loaded = false;
+}
+
static inline bool kvm_pmu_has_perf_global_ctrl(struct kvm_pmu *pmu)
{
/*
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 1e645f5b1e2c..93a8f268c37b 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -4903,8 +4903,20 @@ static void kvm_steal_time_set_preempted(struct kvm_vcpu *vcpu)

void kvm_arch_vcpu_put(struct kvm_vcpu *vcpu)
{
+ unsigned long flags;
int idx;

+ /*
+ * This can get false positives, but not false negatives, i.e. KVM will
+ * never fail to put the PMU, but may unnecessarily disable IRQs to
+ * safely check if the PMU is still loaded.
+ */
+ if (kvm_is_guest_pmu_loaded(vcpu)) {
+ local_irq_save(flags);
+ kvm_put_guest_pmu(vcpu);
+ local_irq_restore(flags);
+ }
+
if (vcpu->preempted) {
if (!vcpu->arch.guest_state_protected)
vcpu->arch.preempted_in_kernel = !static_call(kvm_x86_get_cpl)(vcpu);
@@ -10759,6 +10771,8 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
set_debugreg(0, 7);
}

+ kvm_load_guest_pmu(vcpu);
+
guest_timing_enter_irqoff();

for (;;) {
@@ -10810,6 +10824,9 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
if (hw_breakpoint_active())
hw_breakpoint_restore();

+ if (perf_has_hw_events())
+ kvm_put_guest_pmu(vcpu);
+
vcpu->arch.last_vmentry_cpu = vcpu->cpu;
vcpu->arch.last_guest_tsc = kvm_read_l1_tsc(vcpu, rdtsc());


2023-10-04 22:06:13

by Sean Christopherson

[permalink] [raw]
Subject: Re: [Patch v4 07/13] perf/x86: Add constraint for guest perf metrics event

On Wed, Oct 04, 2023, Sean Christopherson wrote:
> Thinking about this more, what if we do a blend of KVM's FPU swapping and debug
> register swapping?
>
> A. Load guest PMU state in vcpu_enter_guest() after IRQs are disabled
> B. Put guest PMU state (and load host state) in vcpu_enter_guest() before IRQs
> are enabled, *if and only if* the current CPU has one or perf events that
> wants to use the hardware PMU
> C. Put guest PMU state at vcpu_put()
> D. Add a perf callback that is invoked from IRQ context when perf wants to
> configure a new PMU-based events, *before* actually programming the MSRs,
> and have KVM's callback put the guest PMU state
>
> If there are host perf events that want to use the PMU, then KVM will swap fairly
> aggressively and the "downtime" of the host perf events will be limited to the
> small window around VM-Enter/VM-Exit.
>
> If there are no such host events, KVM will swap on the first entry to the guest,
> and keep the guest PMU loaded until the vCPU is put.
>
> The perf callback in (D) would allow perf to program system-wide events on all
> CPUs without clobbering guest PMU state.
>
> I think that would make everyone happy. As long as our hosts don't create perf
> events, then we get the "swap as little as possible" behavior without significantly
> impacting the host's ability to utilize perf. If our host screws up and creates
> perf events on CPUs that are running vCPUs, then the degraded vCPU performance is
> on us.
>
> Rough sketch below, minus the perf callback or any of actual swapping logic.

Another reason to go for an approach that doesn't completely kill off host PMU
usage: just because we don't plan on enable perf events in *production*, there
will undoubtedly be times where we want to enable perf events to debug issues
(outside of prod) in the host kernel/KVM that affect VMs with a passthrough PMU.

So I'll add a self-NAK to the idea of completely disabling the host PMU, I think
that would burn us quite badly at some point.

2023-10-08 10:05:10

by Like Xu

[permalink] [raw]
Subject: Re: [Patch v4 07/13] perf/x86: Add constraint for guest perf metrics event

Hi all,

On 5/10/2023 6:05 am, Sean Christopherson wrote:
> So I'll add a self-NAK to the idea of completely disabling the host PMU, I think
> that would burn us quite badly at some point.

I seem to have missed a party, so allow me to add a few more comments
to better facilitate future discussions in this direction:

(1) PMU counters on TEE

The SGX/SEV is already part of the upstream, but what kind of performance
data will be obtained by sampling enclaves or sev-guest with hardware pmu
counters on host (will the perf-report show these data missing holes or
pure encrypted data), we don't have a clear idea nor have we established
the right expectations. But on AMD profiling a SEV-SNP guest is supported:

"Fingerprinting attack protection is also not supported in the current
generation of these technologies. Fingerprinting attacks attempt to
determine what code the VM is running by monitoring its access patterns,
performance counter information, etc." (AMD SEV-SNP White Paper, 2020)

(2) PMU Guest/Host Co-existence Development

The introduction of pt_mode in the KVM was misleading, leading subsequent
developers to believe that static slicing of pmu facility usage was allowed.

On user scenarios, the host/perf should treat pmu resource requests from
vCPUs with regularity (which can be unequal under the host's authority IMO)
while allowing the host to be able to profile any software entity (including
hypervisor and guest-code, including TEE code in debug mode). Functionality
takes precedence over performance.

The semantics of exclude_guest/host should be tied to the hw-event isolation
settings on the hardware interfaces, not to the human-defined sw-context.
The perf subsystem is the arbiter of pmu resource allocation on the host,
and any attempt to change the status quo (or maintenance scope) will not
succeed. Therefore, vPMU developers are required to be familiar with the
implementation details of both perf and kvm, and try not to add perf APIs
dedicated to serving KVM blindly.

Getting host and guests to share limited PMU resources harmoniously is not
particularly difficult compared to real rocket science in the kernel, so
please don't be intimidated.

(3) Performance Concern in Co-existence

I wonder if it would be possible to add a knob to turn off the perf counter
multiplexing mechanism on the host, so that in coexistence scenarios, the
number of VM exits on the vCPU would not be increased by counter rotations
due to timer expiration.

For normal counters shared between guest and host, the number of counter
msr switches requiring a vm-entry level will be relatively small.
(The number of counters is growing; for LBR, it is possible to share LBR
select values to avoid frequent switching, but of course this requires the
implementation of a software filtering mechanism when the host/guest read
the LBR records, and some additional PMI; for DS-based PEBS, host and guest
PEBS buffers are automatically segregated based on linear address).

There is a lot of room for optimisation here, and in real scenarios where
triggering a large number of register switches in the host/guest PMU is
to be expected and observed easily (accompanied by a large number of pmi
appearances).

If we are really worried about the virtualisation overhead of vPMU, then
virtio-pmu might be an option. In this technology direction, the back-end
pmu can add more performance events of interest to the VM (including host
un-core and off-core events, host-side software events, etc.) In terms of
implementation, the semantics of the MSRLIST instruction can be re-used,
along with compatibility with the different PMU hardware interfaces on ARM
and Risc-v, which is also very friendly to production environments based on
its virtio nature.

(4) New vPMU Feature Development

We should not put KVM's current vPMU support into maintenance-only mode.
Users want more PMU features in the guest, like AMD vIBS, Intel pmu higher
versions, Intel topdown and Arch lbr, more on the way. The maturity of
different features' patch sets aren't the same, but we can't ignore these
real needs because of available time for key maintainers, apathy towards
contributors, mindset avoidance and laziness, and preference for certain
technology stacks. These technical challenges will attract an influx of
open source heroes to push the technology forward, which is good in the
long run.

(5) More to think about

Similar to the guest PMU feature, the debugging feature may face the same
state. For example, what happens when you debug code inside the host and
guest at the same time (host debugs hypevisor/guest code and guest debugs
guest code only) ?

Forgive my ignorance and offence, but we don't want to see a KVM subsystem
controlled and driven by Google's demands.

Please feel free to share comments to move forward.

Thanks,
Like Xu

2023-10-09 17:04:16

by Manali Shukla

[permalink] [raw]
Subject: Re: [Patch v4 07/13] perf/x86: Add constraint for guest perf metrics event

On 10/8/2023 3:34 PM, Like Xu wrote:
> Hi all,
>
> On 5/10/2023 6:05 am, Sean Christopherson wrote:
>> So I'll add a self-NAK to the idea of completely disabling the host PMU, I think
>> that would burn us quite badly at some point.
>
> I seem to have missed a party, so allow me to add a few more comments
> to better facilitate future discussions in this direction:
>
> (1) PMU counters on TEE
>
> The SGX/SEV is already part of the upstream, but what kind of performance
> data will be obtained by sampling enclaves or sev-guest with hardware pmu
> counters on host (will the perf-report show these data missing holes or
> pure encrypted data), we don't have a clear idea nor have we established
> the right expectations. But on AMD profiling a SEV-SNP guest is supported:
>
> "Fingerprinting attack protection is also not supported in the current
> generation of these technologies. Fingerprinting attacks attempt to
> determine what code the VM is running by monitoring its access patterns,
> performance counter information, etc." (AMD SEV-SNP White Paper, 2020)
>
> (2) PMU Guest/Host Co-existence Development
>
> The introduction of pt_mode in the KVM was misleading, leading subsequent
> developers to believe that static slicing of pmu facility usage was allowed.
>
> On user scenarios, the host/perf should treat pmu resource requests from
> vCPUs with regularity (which can be unequal under the host's authority IMO)
> while allowing the host to be able to profile any software entity (including
> hypervisor and guest-code, including TEE code in debug mode). Functionality
> takes precedence over performance.
>
> The semantics of exclude_guest/host should be tied to the hw-event isolation
> settings on the hardware interfaces, not to the human-defined sw-context.
> The perf subsystem is the arbiter of pmu resource allocation on the host,
> and any attempt to change the status quo (or maintenance scope) will not
> succeed. Therefore, vPMU developers are required to be familiar with the
> implementation details of both perf and kvm, and try not to add perf APIs
> dedicated to serving KVM blindly.
>
> Getting host and guests to share limited PMU resources harmoniously is not
> particularly difficult compared to real rocket science in the kernel, so
> please don't be intimidated.
>
> (3) Performance Concern in Co-existence
>
> I wonder if it would be possible to add a knob to turn off the perf counter
> multiplexing mechanism on the host, so that in coexistence scenarios, the
> number of VM exits on the vCPU would not be increased by counter rotations
> due to timer expiration.
>
> For normal counters shared between guest and host, the number of counter
> msr switches requiring a vm-entry level will be relatively small.
> (The number of counters is growing; for LBR, it is possible to share LBR
> select values to avoid frequent switching, but of course this requires the
> implementation of a software filtering mechanism when the host/guest read
> the LBR records, and some additional PMI; for DS-based PEBS, host and guest
> PEBS buffers are automatically segregated based on linear address).
>
> There is a lot of room for optimisation here, and in real scenarios where
> triggering a large number of register switches in the host/guest PMU is
> to be expected and observed easily (accompanied by a large number of pmi
> appearances).
>
> If we are really worried about the virtualisation overhead of vPMU, then
> virtio-pmu might be an option. In this technology direction, the back-end
> pmu can add more performance events of interest to the VM (including host
> un-core and off-core events, host-side software events, etc.) In terms of
> implementation, the semantics of the MSRLIST instruction can be re-used,
> along with compatibility with the different PMU hardware interfaces on ARM
> and Risc-v, which is also very friendly to production environments based on
> its virtio nature.
>
> (4) New vPMU Feature Development
>
> We should not put KVM's current vPMU support into maintenance-only mode.
> Users want more PMU features in the guest, like AMD vIBS, Intel pmu higher
> versions, Intel topdown and Arch lbr, more on the way. The maturity of
> different features' patch sets aren't the same, but we can't ignore these
> real needs because of available time for key maintainers, apathy towards
> contributors, mindset avoidance and laziness, and preference for certain
> technology stacks. These technical challenges will attract an influx of
> open source heroes to push the technology forward, which is good in the
> long run.
>
> (5) More to think about
>
> Similar to the guest PMU feature, the debugging feature may face the same
> state. For example, what happens when you debug code inside the host and
> guest at the same time (host debugs hypevisor/guest code and guest debugs
> guest code only) ?
>
> Forgive my ignorance and offence, but we don't want to see a KVM subsystem
> controlled and driven by Google's demands.
>
> Please feel free to share comments to move forward.
>
> Thanks,
> Like Xu

Hi all,

I would like to add following things to the discussion just for the awareness of
everyone.

Fully virtualized PMC support is coming to an upcoming AMD SoC and we are
working on prototyping it.

As part of virtualized PMC design, the PERF_CTL registers are defined as Swap
type C: guest PMC states are loaded at VMRUN automatically but host PMC states
are not saved by hardware. If hypervisor is using the performance counters, it
is hypervisor's responsibility to save PERF_CTL registers to host save area
prior to VMRUN and restore them after VMEXIT. In order to tackle PMC overflow
interrupts in guest itself, NMI virtualization or AVIC can be used, so that
interrupt on PMC overflow in guest will not leak to host.

- Manali

2023-10-11 14:16:29

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [Patch v4 07/13] perf/x86: Add constraint for guest perf metrics event

On Mon, Oct 09, 2023 at 10:33:41PM +0530, Manali Shukla wrote:
> Hi all,
>
> I would like to add following things to the discussion just for the awareness of
> everyone.
>
> Fully virtualized PMC support is coming to an upcoming AMD SoC and we are
> working on prototyping it.
>
> As part of virtualized PMC design, the PERF_CTL registers are defined as Swap
> type C: guest PMC states are loaded at VMRUN automatically but host PMC states
> are not saved by hardware.

Per the previous discussion, doing this while host has active counters
that do not have ::exclude_guest=1 is invalid and must result in an
error.

Also, I'm assuming it is all optional, a host can still profile a guest
if all is configured just so?

> If hypervisor is using the performance counters, it
> is hypervisor's responsibility to save PERF_CTL registers to host save area
> prior to VMRUN and restore them after VMEXIT.

Does VMEXIT clear global_ctrl at least?

> In order to tackle PMC overflow
> interrupts in guest itself, NMI virtualization or AVIC can be used, so that
> interrupt on PMC overflow in guest will not leak to host.

Can you please clarify -- AMD has this history with very dodgy PMI
boundaries. See the whole amd_pmu_adjust_nmi_window() crud. Even the
PMUv2 update didn't fix that nonsense.

How is any virt stuff supposed to fix this? If the hardware is late
delivering PMI, what guarantees a guest PMI does not land in host
context and vice-versa?

How does NMI virtualization (what even is that) or AVIC (I'm assuming
that's a virtual interrupt controller) help?

Please make very sure, with your hardware team, that PMI must not be
delivered after clearing global_ctrl (preferably) or at the very least,
there exists a sequence of operations that provides a hard barrier
to order PMI.

2023-10-11 14:21:02

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [Patch v4 07/13] perf/x86: Add constraint for guest perf metrics event

On Wed, Oct 04, 2023 at 02:50:46PM -0700, Sean Christopherson wrote:

> Thinking about this more, what if we do a blend of KVM's FPU swapping and debug
> register swapping?
>
> A. Load guest PMU state in vcpu_enter_guest() after IRQs are disabled
> B. Put guest PMU state (and load host state) in vcpu_enter_guest() before IRQs
> are enabled, *if and only if* the current CPU has one or perf events that
> wants to use the hardware PMU
> C. Put guest PMU state at vcpu_put()
> D. Add a perf callback that is invoked from IRQ context when perf wants to
> configure a new PMU-based events, *before* actually programming the MSRs,
> and have KVM's callback put the guest PMU state
>

No real objection, but I would suggest arriving at that solution by
first building the simple-stupid thing and then making it more
complicated in additinoal patches. Hmm?

2023-10-13 17:03:00

by Sean Christopherson

[permalink] [raw]
Subject: Re: [Patch v4 07/13] perf/x86: Add constraint for guest perf metrics event

On Wed, Oct 11, 2023, Peter Zijlstra wrote:
> On Wed, Oct 04, 2023 at 02:50:46PM -0700, Sean Christopherson wrote:
>
> > Thinking about this more, what if we do a blend of KVM's FPU swapping and debug
> > register swapping?
> >
> > A. Load guest PMU state in vcpu_enter_guest() after IRQs are disabled
> > B. Put guest PMU state (and load host state) in vcpu_enter_guest() before IRQs
> > are enabled, *if and only if* the current CPU has one or perf events that
> > wants to use the hardware PMU
> > C. Put guest PMU state at vcpu_put()
> > D. Add a perf callback that is invoked from IRQ context when perf wants to
> > configure a new PMU-based events, *before* actually programming the MSRs,
> > and have KVM's callback put the guest PMU state
> >
>
> No real objection, but I would suggest arriving at that solution by
> first building the simple-stupid thing and then making it more
> complicated in additinoal patches. Hmm?

Yeah, for sure.

2023-10-17 10:24:47

by Manali Shukla

[permalink] [raw]
Subject: Re: [Patch v4 07/13] perf/x86: Add constraint for guest perf metrics event

On 10/11/2023 7:45 PM, Peter Zijlstra wrote:
> On Mon, Oct 09, 2023 at 10:33:41PM +0530, Manali Shukla wrote:
>> Hi all,
>>
>> I would like to add following things to the discussion just for the awareness of
>> everyone.
>>
>> Fully virtualized PMC support is coming to an upcoming AMD SoC and we are
>> working on prototyping it.
>>
>> As part of virtualized PMC design, the PERF_CTL registers are defined as Swap
>> type C: guest PMC states are loaded at VMRUN automatically but host PMC states
>> are not saved by hardware.
>
> Per the previous discussion, doing this while host has active counters
> that do not have ::exclude_guest=1 is invalid and must result in an
> error.
>

Yeah, exclude_guest should be enforced on host, while host has active PMC
counters and VPMC is enabled.

> Also, I'm assuming it is all optional, a host can still profile a guest
> if all is configured just so?
>

Correct, host should be able to profile guest, if VPMC is not enabled.

>> If hypervisor is using the performance counters, it
>> is hypervisor's responsibility to save PERF_CTL registers to host save area
>> prior to VMRUN and restore them after VMEXIT.
>
> Does VMEXIT clear global_ctrl at least?
>

global_ctrl will be initialized to reset value(0x3F) during VMEXIT. Similarly,
all the perf_ctl and perf_ctr are initialized to reset values(0) at VMEXIT.

>> In order to tackle PMC overflow
>> interrupts in guest itself, NMI virtualization or AVIC can be used, so that
>> interrupt on PMC overflow in guest will not leak to host.
>
> Can you please clarify -- AMD has this history with very dodgy PMI
> boundaries. See the whole amd_pmu_adjust_nmi_window() crud. Even the
> PMUv2 update didn't fix that nonsense.
>
> How is any virt stuff supposed to fix this? If the hardware is late
> delivering PMI, what guarantees a guest PMI does not land in host
> context and vice-versa?
>
> How does NMI virtualization (what even is that) or AVIC (I'm assuming
> that's a virtual interrupt controller) help?
>

When NMI virtualization is enabled and source of VNMI is in guest, micro code will
make sure that VNMI will directly be delivered to the guest (even if NMI was late
delivered), so there is no issue of leaking guest NMI to the host.

> Please make very sure, with your hardware team, that PMI must not be
> delivered after clearing global_ctrl (preferably) or at the very least,
> there exists a sequence of operations that provides a hard barrier
> to order PMI.
>
We are verifying all the corner cases, while prototyping PMC virtualization.
As of now, we don't see guest NMIs leaking to host issue. But latency issues
still stays.

2023-10-17 16:59:15

by Mingwei Zhang

[permalink] [raw]
Subject: Re: [Patch v4 07/13] perf/x86: Add constraint for guest perf metrics event

On Tue, Oct 17, 2023 at 3:24 AM Manali Shukla <[email protected]> wrote:
>
> On 10/11/2023 7:45 PM, Peter Zijlstra wrote:
> > On Mon, Oct 09, 2023 at 10:33:41PM +0530, Manali Shukla wrote:
> >> Hi all,
> >>
> >> I would like to add following things to the discussion just for the awareness of
> >> everyone.
> >>
> >> Fully virtualized PMC support is coming to an upcoming AMD SoC and we are
> >> working on prototyping it.
> >>
> >> As part of virtualized PMC design, the PERF_CTL registers are defined as Swap
> >> type C: guest PMC states are loaded at VMRUN automatically but host PMC states
> >> are not saved by hardware.
> >
> > Per the previous discussion, doing this while host has active counters
> > that do not have ::exclude_guest=1 is invalid and must result in an
> > error.
> >
>
> Yeah, exclude_guest should be enforced on host, while host has active PMC
> counters and VPMC is enabled.
>
> > Also, I'm assuming it is all optional, a host can still profile a guest
> > if all is configured just so?
> >
>
> Correct, host should be able to profile guest, if VPMC is not enabled.
>
> >> If hypervisor is using the performance counters, it
> >> is hypervisor's responsibility to save PERF_CTL registers to host save area
> >> prior to VMRUN and restore them after VMEXIT.
> >
> > Does VMEXIT clear global_ctrl at least?
> >
>
> global_ctrl will be initialized to reset value(0x3F) during VMEXIT. Similarly,
> all the perf_ctl and perf_ctr are initialized to reset values(0) at VMEXIT.

Are these guest values (automatically) saved in the guest area of VMCB
on VMEXIT?

>
> >> In order to tackle PMC overflow
> >> interrupts in guest itself, NMI virtualization or AVIC can be used, so that
> >> interrupt on PMC overflow in guest will not leak to host.
> >
> > Can you please clarify -- AMD has this history with very dodgy PMI
> > boundaries. See the whole amd_pmu_adjust_nmi_window() crud. Even the
> > PMUv2 update didn't fix that nonsense.
> >
> > How is any virt stuff supposed to fix this? If the hardware is late
> > delivering PMI, what guarantees a guest PMI does not land in host
> > context and vice-versa?
> >
> > How does NMI virtualization (what even is that) or AVIC (I'm assuming
> > that's a virtual interrupt controller) help?
> >
>
> When NMI virtualization is enabled and source of VNMI is in guest, micro code will
> make sure that VNMI will directly be delivered to the guest (even if NMI was late
> delivered), so there is no issue of leaking guest NMI to the host.
>
> > Please make very sure, with your hardware team, that PMI must not be
> > delivered after clearing global_ctrl (preferably) or at the very least,
> > there exists a sequence of operations that provides a hard barrier
> > to order PMI.
> >
> We are verifying all the corner cases, while prototyping PMC virtualization.
> As of now, we don't see guest NMIs leaking to host issue. But latency issues
> still stays.

How long is the latency? From the existing code (amd_pmu_disable_all()
=> amd_pmu_check_overflow()), it seems to take up to 50 microseconds
in amd_pmu_check_overflow(). But I wonder how much in reality.

Thanks.
-Mingwei
>

2023-10-18 00:02:26

by Mingwei Zhang

[permalink] [raw]
Subject: Re: [Patch v4 07/13] perf/x86: Add constraint for guest perf metrics event

On Tue, Oct 17, 2023 at 9:58 AM Mingwei Zhang <[email protected]> wrote:
>
> On Tue, Oct 17, 2023 at 3:24 AM Manali Shukla <[email protected]> wrote:
> >
> > On 10/11/2023 7:45 PM, Peter Zijlstra wrote:
> > > On Mon, Oct 09, 2023 at 10:33:41PM +0530, Manali Shukla wrote:
> > >> Hi all,
> > >>
> > >> I would like to add following things to the discussion just for the awareness of
> > >> everyone.
> > >>
> > >> Fully virtualized PMC support is coming to an upcoming AMD SoC and we are
> > >> working on prototyping it.
> > >>
> > >> As part of virtualized PMC design, the PERF_CTL registers are defined as Swap
> > >> type C: guest PMC states are loaded at VMRUN automatically but host PMC states
> > >> are not saved by hardware.
> > >
> > > Per the previous discussion, doing this while host has active counters
> > > that do not have ::exclude_guest=1 is invalid and must result in an
> > > error.
> > >
> >
> > Yeah, exclude_guest should be enforced on host, while host has active PMC
> > counters and VPMC is enabled.
> >
> > > Also, I'm assuming it is all optional, a host can still profile a guest
> > > if all is configured just so?
> > >
> >
> > Correct, host should be able to profile guest, if VPMC is not enabled.
> >
> > >> If hypervisor is using the performance counters, it
> > >> is hypervisor's responsibility to save PERF_CTL registers to host save area
> > >> prior to VMRUN and restore them after VMEXIT.
> > >
> > > Does VMEXIT clear global_ctrl at least?
> > >
> >
> > global_ctrl will be initialized to reset value(0x3F) during VMEXIT. Similarly,
> > all the perf_ctl and perf_ctr are initialized to reset values(0) at VMEXIT.
>
> Are these guest values (automatically) saved in the guest area of VMCB
> on VMEXIT?
>

Never mind on this one. So, if both values are in Type C, then they
should be saved to the guest area of VMCB on VMEXIT (according to APM
vol 2 Table B-3). So, this means KVM does not need to intercept these
MSRs on pass-through implementation.

Thanks.
-Mingwei

-Mingwei