2020-03-13 02:19:25

by Like Xu

[permalink] [raw]
Subject: [PATCH v9 00/10] Guest Last Branch Recording Enabling

Hi all,

Please help review your interesting parts in this stable version,
e.g. the first four patches involve the perf event subsystem
and the fifth patch concerns the KVM userspace interface.

---

The last branch recording (LBR) is a performance monitor unit (PMU)
feature on Intel processors that records a running trace of the most
recent branches taken by the processor in the LBR stack. This patch
series is going to enable this feature for plenty of KVM guests.

The userspace could configure whether it's enabled or not for each
guest via vm_ioctl KVM_CAP_X86_GUEST_LBR. As a first step, a guest
could only enable LBR feature if its cpu model is the same as the
host since the LBR feature is still one of model specific features.
The record format defined in bits [0,5] of IA32_PERF_CAPABILITIES
and cpuid PDCM bit is also exposed to the guest when it's enabled.

If it's enabled on the guest, the guest LBR driver would accesses the
LBR MSR (including IA32_DEBUGCTLMSR and stack MSRs) as host does.
The first guest access on the LBR related MSRs is always interceptible.
The KVM trap would create a special LBR event (called guest LBR event)
which enables the callstack mode and none of hardware counter is bound.
The host perf would enable and schedule this event as usual.

Guest's first access to a LBR-related msr gets trapped to KVM, which
creates a guest LBR perf event. It's a regular LBR perf event which gets
the LBR facility assigned from the perf subsystem. Once that succeeds,
the LBR stack msrs are passed through to the guest for efficient accesses.
However, if another host LBR event comes in and takes over the LBR
facility, the LBR msrs will be made interceptible, and guest following
accesses to the LBR msrs will be trapped and meaningless.

Because saving/restoring tens of LBR MSRs (e.g. 32 LBR stack entries) in
VMX transition brings too excessive overhead to frequent vmx transition
itself, the guest LBR event would help save/restore the LBR stack msrs
during the context switching with the help of native LBR event callstack
mechanism, including LBR_SELECT msr.

If the guest no longer accesses the LBR-related MSRs within a scheduling
time slice and the LBR enable bit is unset, vPMU would release its guest
LBR event as a normal event of a unused vPMC and the pass-through
state of the LBR stack msrs would be canceled.

You may check more details in each commit message.

---

LBR testcase:
echo 1 > /proc/sys/kernel/watchdog
echo 25 > /proc/sys/kernel/perf_cpu_time_max_percent
echo 5000 > /proc/sys/kernel/perf_event_max_sample_rate
echo 0 > /proc/sys/kernel/perf_cpu_time_max_percent
./perf record -b ./br_instr a

- Perf report on the host:
Samples: 72K of event 'cycles', Event count (approx.): 72512
Overhead Command Source Shared Object Source Symbol Target Symbol Basic Block Cycles
12.12% br_instr br_instr [.] cmp_end [.] lfsr_cond 1
11.05% br_instr br_instr [.] lfsr_cond [.] cmp_end 5
8.81% br_instr br_instr [.] lfsr_cond [.] cmp_end 4
5.04% br_instr br_instr [.] cmp_end [.] lfsr_cond 20
4.92% br_instr br_instr [.] lfsr_cond [.] cmp_end 6
4.88% br_instr br_instr [.] cmp_end [.] lfsr_cond 6
4.58% br_instr br_instr [.] cmp_end [.] lfsr_cond 5

- Perf report on the guest:
Samples: 92K of event 'cycles', Event count (approx.): 92544
Overhead Command Source Shared Object Source Symbol Target Symbol Basic Block Cycles
12.03% br_instr br_instr [.] cmp_end [.] lfsr_cond 1
11.09% br_instr br_instr [.] lfsr_cond [.] cmp_end 5
8.57% br_instr br_instr [.] lfsr_cond [.] cmp_end 4
5.08% br_instr br_instr [.] lfsr_cond [.] cmp_end 6
5.06% br_instr br_instr [.] cmp_end [.] lfsr_cond 20
4.87% br_instr br_instr [.] cmp_end [.] lfsr_cond 6
4.70% br_instr br_instr [.] cmp_end [.] lfsr_cond 5

Conclusion: the profiling results on the guest are similar to that on the host.

---

v8->v9 Changelog:
- using guest_lbr_constraint to create guest LBR event without hw counter;
(please check perf changes in patch 0003)
- rename 'cpuc->vcpu_lbr' to 'cpuc->guest_lbr_enabled';
(please check host LBR changes in patch 0004)
- replace 'pmu->lbr_used' mechanism with lazy release kvm_pmu_lbr_cleanup();
- refactor IA32_PERF_CAPABILITIES trap via get_perf_capabilities();
- refactor kvm_pmu_lbr_enable() with kvm_pmu_lbr_setup();
- simplify model-specific LBR functionality check;
- rename x86_perf_get_lbr_stack to x86_perf_get_lbr;
- rename intel_pmu_lbr_confirm() to kvm_pmu_availability_check();

Previous:
https://lore.kernel.org/lkml/[email protected]/

Like Xu (7):
perf/x86/lbr: Add interface to get basic information about LBR stack
perf/x86: Add constraint to create guest LBR event without hw counter
perf/x86: Keep LBR stack unchanged on the host for guest LBR event
KVM: x86: Add KVM_CAP_X86_GUEST_LBR interface to dis/enable LBR
feature
KVM: x86/pmu: Add LBR feature emulation via guest LBR event
KVM: x86/pmu: Release guest LBR event via vPMU lazy release mechanism
KVM: x86: Expose MSR_IA32_PERF_CAPABILITIES to guest for LBR record
format

Wei Wang (3):
perf/x86: Fix msr variable type for the LBR msrs
KVM: x86/pmu: Tweak kvm_pmu_get_msr to pass 'struct msr_data' in
KVM: x86: Remove the common trap handler of the MSR_IA32_DEBUGCTLMSR

Documentation/virt/kvm/api.rst | 28 +++
arch/x86/events/core.c | 9 +-
arch/x86/events/intel/core.c | 29 +++
arch/x86/events/intel/lbr.c | 55 +++++-
arch/x86/events/perf_event.h | 21 ++-
arch/x86/include/asm/kvm_host.h | 7 +
arch/x86/include/asm/perf_event.h | 24 ++-
arch/x86/kvm/cpuid.c | 3 +-
arch/x86/kvm/pmu.c | 28 ++-
arch/x86/kvm/pmu.h | 26 ++-
arch/x86/kvm/pmu_amd.c | 7 +-
arch/x86/kvm/vmx/pmu_intel.c | 291 ++++++++++++++++++++++++++++--
arch/x86/kvm/vmx/vmx.c | 4 +-
arch/x86/kvm/vmx/vmx.h | 2 +
arch/x86/kvm/x86.c | 42 +++--
include/linux/perf_event.h | 7 +
include/uapi/linux/kvm.h | 1 +
kernel/events/core.c | 7 -
tools/include/uapi/linux/kvm.h | 1 +
19 files changed, 540 insertions(+), 52 deletions(-)

--
2.21.1


2020-03-13 02:19:39

by Like Xu

[permalink] [raw]
Subject: [PATCH v9 02/10] perf/x86/lbr: Add interface to get basic information about LBR stack

The LBR stack msrs are model specific. The perf subsystem has already
obtained the LBR stack base addresses based on the cpu model.

Therefore, an interface is added to enable callers outside the perf
subsystem to obtain the LBR stack base addresses. This is useful for
hypervisors to emulate the LBR feature for guests with less code.

Co-developed-by: Wei Wang <[email protected]>
Signed-off-by: Wei Wang <[email protected]>
Signed-off-by: Like Xu <[email protected]>
---
arch/x86/events/intel/lbr.c | 19 +++++++++++++++++++
arch/x86/include/asm/perf_event.h | 12 ++++++++++++
2 files changed, 31 insertions(+)

diff --git a/arch/x86/events/intel/lbr.c b/arch/x86/events/intel/lbr.c
index 534c76606049..5ed88e578eaa 100644
--- a/arch/x86/events/intel/lbr.c
+++ b/arch/x86/events/intel/lbr.c
@@ -1334,3 +1334,22 @@ void intel_pmu_lbr_init_knl(void)
if (x86_pmu.intel_cap.lbr_format == LBR_FORMAT_LIP)
x86_pmu.intel_cap.lbr_format = LBR_FORMAT_EIP_FLAGS;
}
+
+/**
+ * x86_perf_get_lbr - get the LBR stack information
+ *
+ * @stack: the caller's memory to store the LBR stack information
+ *
+ * Returns: 0 indicates the LBR stack info has been successfully obtained
+ */
+int x86_perf_get_lbr(struct x86_pmu_lbr *stack)
+{
+ stack->nr = x86_pmu.lbr_nr;
+ stack->from = x86_pmu.lbr_from;
+ stack->to = x86_pmu.lbr_to;
+ stack->info = (x86_pmu.intel_cap.lbr_format == LBR_FORMAT_INFO) ?
+ MSR_LBR_INFO_0 : 0;
+
+ return 0;
+}
+EXPORT_SYMBOL_GPL(x86_perf_get_lbr);
diff --git a/arch/x86/include/asm/perf_event.h b/arch/x86/include/asm/perf_event.h
index 29964b0e1075..e018a1cf604c 100644
--- a/arch/x86/include/asm/perf_event.h
+++ b/arch/x86/include/asm/perf_event.h
@@ -322,6 +322,13 @@ struct perf_guest_switch_msr {
u64 host, guest;
};

+struct x86_pmu_lbr {
+ unsigned int nr;
+ unsigned int from;
+ unsigned int to;
+ unsigned int info;
+};
+
extern void perf_get_x86_pmu_capability(struct x86_pmu_capability *cap);
extern void perf_check_microcode(void);
extern int x86_perf_rdpmc_index(struct perf_event *event);
@@ -337,12 +344,17 @@ static inline void perf_check_microcode(void) { }

#if defined(CONFIG_PERF_EVENTS) && defined(CONFIG_CPU_SUP_INTEL)
extern struct perf_guest_switch_msr *perf_guest_get_msrs(int *nr);
+extern int x86_perf_get_lbr(struct x86_pmu_lbr *stack);
#else
static inline struct perf_guest_switch_msr *perf_guest_get_msrs(int *nr)
{
*nr = 0;
return NULL;
}
+static inline int x86_perf_get_lbr(struct x86_pmu_lbr *stack)
+{
+ return -1;
+}
#endif

#ifdef CONFIG_CPU_SUP_INTEL
--
2.21.1

2020-03-13 02:19:47

by Like Xu

[permalink] [raw]
Subject: [PATCH v9 03/10] perf/x86: Add constraint to create guest LBR event without hw counter

The hypervisor may request the perf subsystem to schedule a time window
to directly access the LBR stack msrs for its own use. Normally, it would
create a guest LBR event with callstack mode enabled, which is scheduled
along with other LBR events on the host but in an exclusive way.

To avoid wasting a counter for the guest LBR event, the perf tracks it via
is_guest_lbr_event() and assigns it with a fake INTEL_PMC_IDX_FIXED_VLBR
counter with the help of new guest_lbr_constraint. Inspired by BTS event,
there is actually no hardware counter allocated for guest LBR events.

Cc: Peter Zijlstra <[email protected]>
Cc: Ingo Molnar <[email protected]>
Signed-off-by: Like Xu <[email protected]>
---
arch/x86/events/core.c | 9 ++++++---
arch/x86/events/intel/core.c | 19 +++++++++++++++++++
arch/x86/events/intel/lbr.c | 3 +++
arch/x86/events/perf_event.h | 15 +++++++++++++++
arch/x86/include/asm/perf_event.h | 12 +++++++++++-
include/linux/perf_event.h | 7 +++++++
kernel/events/core.c | 7 -------
7 files changed, 61 insertions(+), 11 deletions(-)

diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
index 3bb738f5a472..e919187a0751 100644
--- a/arch/x86/events/core.c
+++ b/arch/x86/events/core.c
@@ -74,7 +74,8 @@ u64 x86_perf_event_update(struct perf_event *event)
int idx = hwc->idx;
u64 delta;

- if (idx == INTEL_PMC_IDX_FIXED_BTS)
+ if ((idx == INTEL_PMC_IDX_FIXED_BTS) ||
+ (idx == INTEL_PMC_IDX_FIXED_VLBR))
return 0;

/*
@@ -1102,7 +1103,8 @@ static inline void x86_assign_hw_event(struct perf_event *event,
hwc->last_cpu = smp_processor_id();
hwc->last_tag = ++cpuc->tags[i];

- if (hwc->idx == INTEL_PMC_IDX_FIXED_BTS) {
+ if ((hwc->idx == INTEL_PMC_IDX_FIXED_BTS) ||
+ (hwc->idx == INTEL_PMC_IDX_FIXED_VLBR)) {
hwc->config_base = 0;
hwc->event_base = 0;
} else if (hwc->idx >= INTEL_PMC_IDX_FIXED) {
@@ -1233,7 +1235,8 @@ int x86_perf_event_set_period(struct perf_event *event)
s64 period = hwc->sample_period;
int ret = 0, idx = hwc->idx;

- if (idx == INTEL_PMC_IDX_FIXED_BTS)
+ if ((idx == INTEL_PMC_IDX_FIXED_BTS) ||
+ (idx == INTEL_PMC_IDX_FIXED_VLBR))
return 0;

/*
diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
index 3be51aa06e67..901c82032f4a 100644
--- a/arch/x86/events/intel/core.c
+++ b/arch/x86/events/intel/core.c
@@ -2157,6 +2157,9 @@ static void intel_pmu_disable_event(struct perf_event *event)
return;
}

+ if (unlikely(hwc->idx == INTEL_PMC_IDX_FIXED_VLBR))
+ return;
+
cpuc->intel_ctrl_guest_mask &= ~(1ull << hwc->idx);
cpuc->intel_ctrl_host_mask &= ~(1ull << hwc->idx);
cpuc->intel_cp_status &= ~(1ull << hwc->idx);
@@ -2241,6 +2244,9 @@ static void intel_pmu_enable_event(struct perf_event *event)
return;
}

+ if (unlikely(hwc->idx == INTEL_PMC_IDX_FIXED_VLBR))
+ return;
+
if (event->attr.exclude_host)
cpuc->intel_ctrl_guest_mask |= (1ull << hwc->idx);
if (event->attr.exclude_guest)
@@ -2595,6 +2601,15 @@ intel_bts_constraints(struct perf_event *event)
return NULL;
}

+static struct event_constraint *
+intel_guest_event_constraints(struct perf_event *event)
+{
+ if (unlikely(is_guest_lbr_event(event)))
+ return &guest_lbr_constraint;
+
+ return NULL;
+}
+
static int intel_alt_er(int idx, u64 config)
{
int alt_idx = idx;
@@ -2785,6 +2800,10 @@ __intel_get_event_constraints(struct cpu_hw_events *cpuc, int idx,
{
struct event_constraint *c;

+ c = intel_guest_event_constraints(event);
+ if (c)
+ return c;
+
c = intel_bts_constraints(event);
if (c)
return c;
diff --git a/arch/x86/events/intel/lbr.c b/arch/x86/events/intel/lbr.c
index 5ed88e578eaa..ff1f35b4f420 100644
--- a/arch/x86/events/intel/lbr.c
+++ b/arch/x86/events/intel/lbr.c
@@ -1353,3 +1353,6 @@ int x86_perf_get_lbr(struct x86_pmu_lbr *stack)
return 0;
}
EXPORT_SYMBOL_GPL(x86_perf_get_lbr);
+
+struct event_constraint guest_lbr_constraint =
+ EVENT_CONSTRAINT(0, 1ULL << GLOBAL_STATUS_LBRS_FROZEN_BIT, 0);
diff --git a/arch/x86/events/perf_event.h b/arch/x86/events/perf_event.h
index 1025bc6eb04f..9a62264a3068 100644
--- a/arch/x86/events/perf_event.h
+++ b/arch/x86/events/perf_event.h
@@ -969,6 +969,20 @@ static inline bool intel_pmu_has_bts(struct perf_event *event)
return intel_pmu_has_bts_period(event, hwc->sample_period);
}

+static inline bool is_guest_event(struct perf_event *event)
+{
+ if (event->attr.exclude_host && is_kernel_event(event))
+ return true;
+ return false;
+}
+
+static inline bool is_guest_lbr_event(struct perf_event *event)
+{
+ if (is_guest_event(event) && needs_branch_stack(event))
+ return true;
+ return false;
+}
+
int intel_pmu_save_and_restart(struct perf_event *event);

struct event_constraint *
@@ -989,6 +1003,7 @@ void release_ds_buffers(void);
void reserve_ds_buffers(void);

extern struct event_constraint bts_constraint;
+extern struct event_constraint guest_lbr_constraint;

void intel_pmu_enable_bts(u64 config);

diff --git a/arch/x86/include/asm/perf_event.h b/arch/x86/include/asm/perf_event.h
index e018a1cf604c..674130aca75a 100644
--- a/arch/x86/include/asm/perf_event.h
+++ b/arch/x86/include/asm/perf_event.h
@@ -181,9 +181,19 @@ struct x86_pmu_capability {
#define GLOBAL_STATUS_UNC_OVF BIT_ULL(61)
#define GLOBAL_STATUS_ASIF BIT_ULL(60)
#define GLOBAL_STATUS_COUNTERS_FROZEN BIT_ULL(59)
-#define GLOBAL_STATUS_LBRS_FROZEN BIT_ULL(58)
+#define GLOBAL_STATUS_LBRS_FROZEN_BIT 58
+#define GLOBAL_STATUS_LBRS_FROZEN BIT_ULL(GLOBAL_STATUS_LBRS_FROZEN_BIT)
#define GLOBAL_STATUS_TRACE_TOPAPMI BIT_ULL(55)

+/*
+ * We model guest LBR event tracing as another fixed-mode PMC like BTS.
+ *
+ * We choose bit 58 (LBRS_FROZEN_BIT) which is used to indicate that the LBR
+ * stack is frozen on a hardware PMI request in the PERF_GLOBAL_STATUS msr,
+ * and the 59th PMC counter (if any) is not supposed to use it as well.
+ */
+#define INTEL_PMC_IDX_FIXED_VLBR GLOBAL_STATUS_LBRS_FROZEN_BIT
+
/*
* Adaptive PEBS v4
*/
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 547773f5894e..b94b695f2d7e 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -1082,6 +1082,13 @@ static inline bool is_sampling_event(struct perf_event *event)
return event->attr.sample_period != 0;
}

+#define TASK_TOMBSTONE ((void *)-1L)
+
+static inline bool is_kernel_event(struct perf_event *event)
+{
+ return READ_ONCE(event->owner) == TASK_TOMBSTONE;
+}
+
/*
* Return 1 for a software event, 0 for a hardware event
*/
diff --git a/kernel/events/core.c b/kernel/events/core.c
index e453589da97c..75d47fcb9a0d 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -164,13 +164,6 @@ static void perf_ctx_unlock(struct perf_cpu_context *cpuctx,
raw_spin_unlock(&cpuctx->ctx.lock);
}

-#define TASK_TOMBSTONE ((void *)-1L)
-
-static bool is_kernel_event(struct perf_event *event)
-{
- return READ_ONCE(event->owner) == TASK_TOMBSTONE;
-}
-
/*
* On task ctx scheduling...
*
--
2.21.1

2020-03-13 02:19:56

by Like Xu

[permalink] [raw]
Subject: [PATCH v9 04/10] perf/x86: Keep LBR stack unchanged on the host for guest LBR event

When a guest wants to use the LBR stack, its hypervisor creates a guest
LBR event and let host perf schedules it. A new 'int guest_lbr_enabled'
field in the "struct cpu_hw_events", is marked as true when perf adds
a guest LBR event and false on deletion.

The LBR stack msrs are accessible to the guest when its guest LBR event
is scheduled in by the perf subsystem. Before scheduling out the event,
we should avoid host changes on IA32_DEBUGCTLMSR or LBR_SELECT. Otherwise,
some unexpected branch operations may interfere with guest behavior,
pollute LBR records, and even cause host branch data leakage. In addition,
the intel_pmu_lbr_read() on the host is also avoidable for guest usage.

On v4 PMU or later, the LBR stack are frozen on the overflowed condition
if Freeze_LBR_On_PMI is true and resume recording via acking LBRS_FROZEN
to global status msr instead of re-enabling IA32_DEBUGCTL.LBR. So when a
guest LBR event is running, the host PMI handler has to keep LBRS_FROZEN
bit set (thus LBR being frozen) until the guest enables it. Otherwise,
when the guest enters non-root mode, the LBR will start recording and
the guest PMI handler code will also pollute the LBR stack.

To ensure that guest LBR records are not lost during the context switch,
the BRANCH_CALL_STACK flag should be configured in the 'branch_sample_type'
for a guest LBR event because a callstack event could save/restore guest
unread records with the help of intel_pmu_lbr_sched_task() naturally.

However, the regular host LBR perf event doesn't save/restore LBR_SELECT,
because it's configured in the LBR_enable() based on branch_sample_type.
So when a guest LBR is running, the guest LBR_SELECT may changes for its
own use and we have to add the LBR_SELECT save/restore to ensure what the
guest LBR_SELECT value doesn't get lost during the context switching.

Cc: Peter Zijlstra <[email protected]>
Cc: Ingo Molnar <[email protected]>
Co-developed-by: Wei Wang <[email protected]>
Signed-off-by: Wei Wang <[email protected]>
Signed-off-by: Like Xu <[email protected]>
---
arch/x86/events/intel/core.c | 10 ++++++++++
arch/x86/events/intel/lbr.c | 33 ++++++++++++++++++++++++++++++---
arch/x86/events/perf_event.h | 2 ++
3 files changed, 42 insertions(+), 3 deletions(-)

diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
index 901c82032f4a..66a0fedb9948 100644
--- a/arch/x86/events/intel/core.c
+++ b/arch/x86/events/intel/core.c
@@ -2126,6 +2126,16 @@ static inline u64 intel_pmu_get_status(void)

static inline void intel_pmu_ack_status(u64 ack)
{
+ struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
+
+ /*
+ * When the LBR hardware is scheduled for a guest LBR event,
+ * it should remain disabled until the guest enables it. Otherwise,
+ * the guest PMI handler may contaminate the LBR records.
+ */
+ if (cpuc->guest_lbr_enabled)
+ ack &= ~GLOBAL_STATUS_LBRS_FROZEN;
+
wrmsrl(MSR_CORE_PERF_GLOBAL_OVF_CTRL, ack);
}

diff --git a/arch/x86/events/intel/lbr.c b/arch/x86/events/intel/lbr.c
index ff1f35b4f420..48538702f3f2 100644
--- a/arch/x86/events/intel/lbr.c
+++ b/arch/x86/events/intel/lbr.c
@@ -383,6 +383,15 @@ static void __intel_pmu_lbr_restore(struct x86_perf_task_context *task_ctx)

wrmsrl(x86_pmu.lbr_tos, tos);
task_ctx->lbr_stack_state = LBR_NONE;
+
+ /*
+ * When the LBR hardware is scheduled for a guest LBR event,
+ * the guest LBR_sel is likely different from event->hw.branch_reg.
+ * Therefore, it’s necessary to save/restore MSR_LBR_SELECT written
+ * by the guest so that it's not lost during the context switch.
+ */
+ if (cpuc->guest_lbr_enabled)
+ wrmsrl(MSR_LBR_SELECT, task_ctx->lbr_sel);
}

static void __intel_pmu_lbr_save(struct x86_perf_task_context *task_ctx)
@@ -415,6 +424,9 @@ static void __intel_pmu_lbr_save(struct x86_perf_task_context *task_ctx)

cpuc->last_task_ctx = task_ctx;
cpuc->last_log_id = ++task_ctx->log_id;
+
+ if (cpuc->guest_lbr_enabled)
+ rdmsrl(MSR_LBR_SELECT, task_ctx->lbr_sel);
}

void intel_pmu_lbr_swap_task_ctx(struct perf_event_context *prev,
@@ -485,6 +497,9 @@ void intel_pmu_lbr_add(struct perf_event *event)
if (!x86_pmu.lbr_nr)
return;

+ if (is_guest_lbr_event(event))
+ cpuc->guest_lbr_enabled = 1;
+
cpuc->br_sel = event->hw.branch_reg.reg;

if (branch_user_callstack(cpuc->br_sel) && event->ctx->task_ctx_data) {
@@ -532,6 +547,9 @@ void intel_pmu_lbr_del(struct perf_event *event)
task_ctx->lbr_callstack_users--;
}

+ if (is_guest_lbr_event(event))
+ cpuc->guest_lbr_enabled = 0;
+
if (x86_pmu.intel_cap.pebs_baseline && event->attr.precise_ip > 0)
cpuc->lbr_pebs_users--;
cpuc->lbr_users--;
@@ -544,7 +562,12 @@ void intel_pmu_lbr_enable_all(bool pmi)
{
struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);

- if (cpuc->lbr_users)
+ /*
+ * When the LBR hardware is scheduled for a guest LBR event,
+ * the guest will dis/enables LBR itself at the appropriate time,
+ * including configuring MSR_LBR_SELECT.
+ */
+ if (cpuc->lbr_users && !cpuc->guest_lbr_enabled)
__intel_pmu_lbr_enable(pmi);
}

@@ -552,7 +575,7 @@ void intel_pmu_lbr_disable_all(void)
{
struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);

- if (cpuc->lbr_users)
+ if (cpuc->lbr_users && !cpuc->guest_lbr_enabled)
__intel_pmu_lbr_disable();
}

@@ -691,8 +714,12 @@ void intel_pmu_lbr_read(void)
*
* This could be smarter and actually check the event,
* but this simple approach seems to work for now.
+ *
+ * And there is no need to read lbr here if a guest LBR event
+ * is using it, because the guest will read them on its own.
*/
- if (!cpuc->lbr_users || cpuc->lbr_users == cpuc->lbr_pebs_users)
+ if (!cpuc->lbr_users || cpuc->guest_lbr_enabled ||
+ cpuc->lbr_users == cpuc->lbr_pebs_users)
return;

if (x86_pmu.intel_cap.lbr_format == LBR_FORMAT_32)
diff --git a/arch/x86/events/perf_event.h b/arch/x86/events/perf_event.h
index 9a62264a3068..cb7288361f34 100644
--- a/arch/x86/events/perf_event.h
+++ b/arch/x86/events/perf_event.h
@@ -237,6 +237,7 @@ struct cpu_hw_events {
u64 br_sel;
struct x86_perf_task_context *last_task_ctx;
int last_log_id;
+ int guest_lbr_enabled;

/*
* Intel host/guest exclude bits
@@ -721,6 +722,7 @@ struct x86_perf_task_context {
u64 lbr_from[MAX_LBR_ENTRIES];
u64 lbr_to[MAX_LBR_ENTRIES];
u64 lbr_info[MAX_LBR_ENTRIES];
+ u64 lbr_sel;
int tos;
int valid_lbrs;
int lbr_callstack_users;
--
2.21.1

2020-03-13 02:20:07

by Like Xu

[permalink] [raw]
Subject: [PATCH v9 01/10] perf/x86: Fix msr variable type for the LBR msrs

From: Wei Wang <[email protected]>

The MSR variable type can be 'unsigned int', which uses less memory than
the longer 'unsigned long'. Fix struct x86_pmu for that. the lbr_nr won't
be a negative number, so make it 'unsigned int' as well.

Cc: Thomas Gleixner <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Arnaldo Carvalho de Melo <[email protected]>
Suggested-by: Peter Zijlstra <[email protected]>
Signed-off-by: Wei Wang <[email protected]>
---
arch/x86/events/perf_event.h | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/x86/events/perf_event.h b/arch/x86/events/perf_event.h
index f1cd1ca1a77b..1025bc6eb04f 100644
--- a/arch/x86/events/perf_event.h
+++ b/arch/x86/events/perf_event.h
@@ -672,8 +672,8 @@ struct x86_pmu {
/*
* Intel LBR
*/
- unsigned long lbr_tos, lbr_from, lbr_to; /* MSR base regs */
- int lbr_nr; /* hardware stack size */
+ unsigned int lbr_tos, lbr_from, lbr_to,
+ lbr_nr; /* LBR base regs and size */
u64 lbr_sel_mask; /* LBR_SELECT valid bits */
const int *lbr_sel_map; /* lbr_select mappings */
bool lbr_double_abort; /* duplicated lbr aborts */
--
2.21.1

2020-03-13 02:20:09

by Like Xu

[permalink] [raw]
Subject: [PATCH v9 06/10] KVM: x86/pmu: Tweak kvm_pmu_get_msr to pass 'struct msr_data' in

From: Wei Wang <[email protected]>

Change kvm_pmu_get_msr() to get the msr_data struct, as the host_initiated
field from the struct could be used by get_msr. This also makes this API
consistent with kvm_pmu_set_msr. No functional changes.

Signed-off-by: Wei Wang <[email protected]>
---
arch/x86/kvm/pmu.c | 4 ++--
arch/x86/kvm/pmu.h | 4 ++--
arch/x86/kvm/pmu_amd.c | 7 ++++---
arch/x86/kvm/vmx/pmu_intel.c | 19 +++++++++++--------
arch/x86/kvm/x86.c | 4 ++--
5 files changed, 21 insertions(+), 17 deletions(-)

diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c
index 970fb5574402..306a79af0d0e 100644
--- a/arch/x86/kvm/pmu.c
+++ b/arch/x86/kvm/pmu.c
@@ -397,9 +397,9 @@ static void kvm_pmu_mark_pmc_in_use(struct kvm_vcpu *vcpu, u32 msr)
__set_bit(pmc->idx, pmu->pmc_in_use);
}

-int kvm_pmu_get_msr(struct kvm_vcpu *vcpu, u32 msr, u64 *data)
+int kvm_pmu_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
{
- return kvm_x86_ops->pmu_ops->get_msr(vcpu, msr, data);
+ return kvm_x86_ops->pmu_ops->get_msr(vcpu, msr_info);
}

int kvm_pmu_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
diff --git a/arch/x86/kvm/pmu.h b/arch/x86/kvm/pmu.h
index 4842a417afe9..c0fb092f985e 100644
--- a/arch/x86/kvm/pmu.h
+++ b/arch/x86/kvm/pmu.h
@@ -32,7 +32,7 @@ struct kvm_pmu_ops {
struct kvm_pmc *(*msr_idx_to_pmc)(struct kvm_vcpu *vcpu, u32 msr);
int (*is_valid_rdpmc_ecx)(struct kvm_vcpu *vcpu, unsigned int idx);
bool (*is_valid_msr)(struct kvm_vcpu *vcpu, u32 msr);
- int (*get_msr)(struct kvm_vcpu *vcpu, u32 msr, u64 *data);
+ int (*get_msr)(struct kvm_vcpu *vcpu, struct msr_data *msr_info);
int (*set_msr)(struct kvm_vcpu *vcpu, struct msr_data *msr_info);
void (*refresh)(struct kvm_vcpu *vcpu);
void (*init)(struct kvm_vcpu *vcpu);
@@ -148,7 +148,7 @@ void kvm_pmu_handle_event(struct kvm_vcpu *vcpu);
int kvm_pmu_rdpmc(struct kvm_vcpu *vcpu, unsigned pmc, u64 *data);
int kvm_pmu_is_valid_rdpmc_ecx(struct kvm_vcpu *vcpu, unsigned int idx);
bool kvm_pmu_is_valid_msr(struct kvm_vcpu *vcpu, u32 msr);
-int kvm_pmu_get_msr(struct kvm_vcpu *vcpu, u32 msr, u64 *data);
+int kvm_pmu_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info);
int kvm_pmu_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info);
void kvm_pmu_refresh(struct kvm_vcpu *vcpu);
void kvm_pmu_reset(struct kvm_vcpu *vcpu);
diff --git a/arch/x86/kvm/pmu_amd.c b/arch/x86/kvm/pmu_amd.c
index ce0b10fe5e2b..62976242bb12 100644
--- a/arch/x86/kvm/pmu_amd.c
+++ b/arch/x86/kvm/pmu_amd.c
@@ -215,21 +215,22 @@ static struct kvm_pmc *amd_msr_idx_to_pmc(struct kvm_vcpu *vcpu, u32 msr)
return pmc;
}

-static int amd_pmu_get_msr(struct kvm_vcpu *vcpu, u32 msr, u64 *data)
+static int amd_pmu_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
{
struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
struct kvm_pmc *pmc;
+ u32 msr = msr_info->index;

/* MSR_PERFCTRn */
pmc = get_gp_pmc_amd(pmu, msr, PMU_TYPE_COUNTER);
if (pmc) {
- *data = pmc_read_counter(pmc);
+ msr_info->data = pmc_read_counter(pmc);
return 0;
}
/* MSR_EVNTSELn */
pmc = get_gp_pmc_amd(pmu, msr, PMU_TYPE_EVNTSEL);
if (pmc) {
- *data = pmc->eventsel;
+ msr_info->data = pmc_read_counter(pmc);
return 0;
}

diff --git a/arch/x86/kvm/vmx/pmu_intel.c b/arch/x86/kvm/vmx/pmu_intel.c
index 4f529e1de5d2..eef11e716570 100644
--- a/arch/x86/kvm/vmx/pmu_intel.c
+++ b/arch/x86/kvm/vmx/pmu_intel.c
@@ -184,35 +184,38 @@ static struct kvm_pmc *intel_msr_idx_to_pmc(struct kvm_vcpu *vcpu, u32 msr)
return pmc;
}

-static int intel_pmu_get_msr(struct kvm_vcpu *vcpu, u32 msr, u64 *data)
+static int intel_pmu_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
{
struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
struct kvm_pmc *pmc;
+ u32 msr = msr_info->index;

switch (msr) {
case MSR_CORE_PERF_FIXED_CTR_CTRL:
- *data = pmu->fixed_ctr_ctrl;
+ msr_info->data = pmu->fixed_ctr_ctrl;
return 0;
case MSR_CORE_PERF_GLOBAL_STATUS:
- *data = pmu->global_status;
+ msr_info->data = pmu->global_status;
return 0;
case MSR_CORE_PERF_GLOBAL_CTRL:
- *data = pmu->global_ctrl;
+ msr_info->data = pmu->global_ctrl;
return 0;
case MSR_CORE_PERF_GLOBAL_OVF_CTRL:
- *data = pmu->global_ovf_ctrl;
+ msr_info->data = pmu->global_ovf_ctrl;
return 0;
default:
if ((pmc = get_gp_pmc(pmu, msr, MSR_IA32_PERFCTR0))) {
u64 val = pmc_read_counter(pmc);
- *data = val & pmu->counter_bitmask[KVM_PMC_GP];
+ msr_info->data =
+ val & pmu->counter_bitmask[KVM_PMC_GP];
return 0;
} else if ((pmc = get_fixed_pmc(pmu, msr))) {
u64 val = pmc_read_counter(pmc);
- *data = val & pmu->counter_bitmask[KVM_PMC_FIXED];
+ msr_info->data =
+ val & pmu->counter_bitmask[KVM_PMC_FIXED];
return 0;
} else if ((pmc = get_gp_pmc(pmu, msr, MSR_P6_EVNTSEL0))) {
- *data = pmc->eventsel;
+ msr_info->data = pmc->eventsel;
return 0;
}
}
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 51c62d14809a..ce6b0326a1ad 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -3049,7 +3049,7 @@ int kvm_get_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
case MSR_P6_PERFCTR0 ... MSR_P6_PERFCTR1:
case MSR_P6_EVNTSEL0 ... MSR_P6_EVNTSEL1:
if (kvm_pmu_is_valid_msr(vcpu, msr_info->index))
- return kvm_pmu_get_msr(vcpu, msr_info->index, &msr_info->data);
+ return kvm_pmu_get_msr(vcpu, msr_info);
msr_info->data = 0;
break;
case MSR_IA32_UCODE_REV:
@@ -3211,7 +3211,7 @@ int kvm_get_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
break;
default:
if (kvm_pmu_is_valid_msr(vcpu, msr_info->index))
- return kvm_pmu_get_msr(vcpu, msr_info->index, &msr_info->data);
+ return kvm_pmu_get_msr(vcpu, msr_info);
if (!ignore_msrs) {
vcpu_debug_ratelimited(vcpu, "unhandled rdmsr: 0x%x\n",
msr_info->index);
--
2.21.1

2020-03-13 02:20:15

by Like Xu

[permalink] [raw]
Subject: [PATCH v9 07/10] KVM: x86/pmu: Add LBR feature emulation via guest LBR event

VMX transition is much more frequent than vcpu switching, and saving/
restoring tens of LBR MSRs (e.g. 32 LBR stack entries) brings too much
overhead to the frequent vmx transition itself, which is not necessary.
So the guest LBR stack msrs only gets saved/restored on the vcpu context
switching via the help of native LBR event callstack mechanism. Generally,
the LBR-related MSRs and its functionality are emulated in this way:

The guest first access on the LBR related MSRs (including DEBUGCTLMSR
and stack msrs) is always interceptible. The KVM hanlder would create
a guest LBR event which enables the callstack mode and none of hardware
counter is assigned. The host perf would enable and schedule this event
as usual. When the guest LBR event exists and the LBR stack is available
(defined as 'event->oncpu != -1'), the LBR stack msrs access would not
be intercepted but pass-through to the vcpu before vm-entry. This kind
of availability check is always performed before vm-entry, but as late
as possible to avoid reclaiming resources from any higher priority LBR
event. A negative check result would bring the interception back, and
prevent real registers accesses and potential data leakage.

At this point, vPMU only supports architecture v2 and the guest PMI
handler enables LBR via DEBUGCTLMSR rather than GLOBAL_OVF_CTRL. So
when the guest sets the enable bit, the DEBUGCTLMSR trap will ensure
the LBRS_FROZEN bit is cleared on any host with v4 or higher PMU and
the LBR facility could record as guest expected. The guest LBR event
will be released when the vPMU is reset but in next step, the lazy
release mechanism would be applied to this event like a regular vPMC.

Suggested-by: Andi Kleen <[email protected]>
Co-developed-by: Wei Wang <[email protected]>
Signed-off-by: Wei Wang <[email protected]>
Signed-off-by: Like Xu <[email protected]>
---
arch/x86/include/asm/kvm_host.h | 4 +
arch/x86/kvm/pmu.c | 7 ++
arch/x86/kvm/pmu.h | 7 ++
arch/x86/kvm/vmx/pmu_intel.c | 199 +++++++++++++++++++++++++++++++-
arch/x86/kvm/vmx/vmx.c | 4 +-
arch/x86/kvm/vmx/vmx.h | 2 +
arch/x86/kvm/x86.c | 12 ++
7 files changed, 230 insertions(+), 5 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index b87d2ab28b0e..b4c1761ca783 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -502,6 +502,10 @@ struct kvm_pmu {
* redundant check before cleanup if guest don't use vPMU at all.
*/
u8 event_count;
+
+ /* Last Branch Recording Emulation */
+ struct perf_event *lbr_event;
+ bool lbr_is_availabile;
};

struct kvm_pmu_ops;
diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c
index 306a79af0d0e..84b5ec50ca6d 100644
--- a/arch/x86/kvm/pmu.c
+++ b/arch/x86/kvm/pmu.c
@@ -434,6 +434,7 @@ void kvm_pmu_init(struct kvm_vcpu *vcpu)
init_irq_work(&pmu->irq_work, kvm_pmi_trigger_fn);
pmu->event_count = 0;
pmu->need_cleanup = false;
+ pmu->lbr_is_availabile = false;
kvm_pmu_refresh(vcpu);
}

@@ -526,3 +527,9 @@ bool kvm_pmu_lbr_setup(struct kvm_vcpu *vcpu)

return false;
}
+
+void kvm_pmu_availability_check(struct kvm_vcpu *vcpu)
+{
+ if (kvm_x86_ops->pmu_ops->availability_check)
+ kvm_x86_ops->pmu_ops->availability_check(vcpu);
+}
diff --git a/arch/x86/kvm/pmu.h b/arch/x86/kvm/pmu.h
index c0fb092f985e..3ddff3972b8d 100644
--- a/arch/x86/kvm/pmu.h
+++ b/arch/x86/kvm/pmu.h
@@ -38,8 +38,14 @@ struct kvm_pmu_ops {
void (*init)(struct kvm_vcpu *vcpu);
void (*reset)(struct kvm_vcpu *vcpu);
bool (*lbr_setup)(struct kvm_vcpu *vcpu);
+ void (*availability_check)(struct kvm_vcpu *vcpu);
};

+static inline bool event_is_oncpu(struct perf_event *event)
+{
+ return event && event->oncpu != -1;
+}
+
static inline u64 pmc_bitmask(struct kvm_pmc *pmc)
{
struct kvm_pmu *pmu = pmc_to_pmu(pmc);
@@ -157,6 +163,7 @@ void kvm_pmu_cleanup(struct kvm_vcpu *vcpu);
void kvm_pmu_destroy(struct kvm_vcpu *vcpu);
int kvm_vm_ioctl_set_pmu_event_filter(struct kvm *kvm, void __user *argp);
bool kvm_pmu_lbr_setup(struct kvm_vcpu *vcpu);
+void kvm_pmu_availability_check(struct kvm_vcpu *vcpu);

bool is_vmware_backdoor_pmc(u32 pmc_idx);

diff --git a/arch/x86/kvm/vmx/pmu_intel.c b/arch/x86/kvm/vmx/pmu_intel.c
index eef11e716570..92627f31cda3 100644
--- a/arch/x86/kvm/vmx/pmu_intel.c
+++ b/arch/x86/kvm/vmx/pmu_intel.c
@@ -17,6 +17,7 @@
#include "lapic.h"
#include "nested.h"
#include "pmu.h"
+#include "vmx.h"

static struct kvm_event_hw_type_mapping intel_arch_events[] = {
/* Index must match CPUID 0x0A.EBX bit vector */
@@ -150,6 +151,24 @@ static struct kvm_pmc *intel_rdpmc_ecx_to_pmc(struct kvm_vcpu *vcpu,
return &counters[array_index_nospec(idx, num_counters)];
}

+static inline bool intel_is_valid_lbr_msr(struct kvm_vcpu *vcpu, u32 index)
+{
+ struct x86_pmu_lbr *stack = &vcpu->kvm->arch.lbr_stack;
+ bool ret = false;
+
+ if (!vcpu->kvm->arch.lbr_in_guest)
+ return ret;
+
+ ret = (index == MSR_LBR_SELECT || index == MSR_LBR_TOS ||
+ (index >= stack->from && index < stack->from + stack->nr) ||
+ (index >= stack->to && index < stack->to + stack->nr));
+
+ if (!ret && stack->info)
+ ret = (index >= stack->info && index < stack->info + stack->nr);
+
+ return ret;
+}
+
static bool intel_is_valid_msr(struct kvm_vcpu *vcpu, u32 msr)
{
struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
@@ -160,12 +179,14 @@ static bool intel_is_valid_msr(struct kvm_vcpu *vcpu, u32 msr)
case MSR_CORE_PERF_GLOBAL_STATUS:
case MSR_CORE_PERF_GLOBAL_CTRL:
case MSR_CORE_PERF_GLOBAL_OVF_CTRL:
+ case MSR_IA32_DEBUGCTLMSR:
ret = pmu->version > 1;
break;
default:
ret = get_gp_pmc(pmu, msr, MSR_IA32_PERFCTR0) ||
get_gp_pmc(pmu, msr, MSR_P6_EVNTSEL0) ||
- get_fixed_pmc(pmu, msr);
+ get_fixed_pmc(pmu, msr) ||
+ intel_is_valid_lbr_msr(vcpu, msr);
break;
}

@@ -184,6 +205,120 @@ static struct kvm_pmc *intel_msr_idx_to_pmc(struct kvm_vcpu *vcpu, u32 msr)
return pmc;
}

+/*
+ * "set = true" to make the LBR stack msrs interceptible,
+ * otherwise pass through the LBR stack msrs to the guest.
+ */
+static void intel_pmu_set_intercept_for_lbr_msrs(struct kvm_vcpu *vcpu,
+ bool set)
+{
+ unsigned long *msr_bitmap = to_vmx(vcpu)->vmcs01.msr_bitmap;
+ struct x86_pmu_lbr *stack = &vcpu->kvm->arch.lbr_stack;
+ int i;
+
+ vmx_set_intercept_for_msr(msr_bitmap, MSR_LBR_SELECT, MSR_TYPE_RW, set);
+ vmx_set_intercept_for_msr(msr_bitmap, MSR_LBR_TOS, MSR_TYPE_RW, set);
+ for (i = 0; i < stack->nr; i++) {
+ vmx_set_intercept_for_msr(msr_bitmap, stack->from + i,
+ MSR_TYPE_RW, set);
+ vmx_set_intercept_for_msr(msr_bitmap, stack->to + i,
+ MSR_TYPE_RW, set);
+ if (stack->info)
+ vmx_set_intercept_for_msr(msr_bitmap,
+ stack->info + i, MSR_TYPE_RW, set);
+ }
+}
+
+int intel_pmu_create_lbr_event(struct kvm_vcpu *vcpu)
+{
+ struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
+ struct perf_event *event;
+
+ /*
+ * The perf_event_attr is constructed in the minimum efficient way:
+ * - set 'pinned = true' to make it task pinned so that if another
+ * cpu pinned event reclaims LBR, the event->oncpu will be set to -1;
+ * - set 'sample_type = PERF_SAMPLE_BRANCH_STACK' and 'exclude_host =
+ * true' to mark it as a guest LBR event which indicates host perf
+ * to schedule it without any hw counter but a fake one,
+ * check is_guest_lbr_event() and intel_guest_event_constraints();
+ * - set 'branch_sample_type = PERF_SAMPLE_BRANCH_CALL_STACK |
+ * PERF_SAMPLE_BRANCH_USER' to configure it as a callstack LBR event.
+ * which allocs ctx->task_ctx_data and request host perf subsystem
+ * to save/restore guest LBR stack during host context switches,
+ * check branch_user_callstack() and intel_pmu_lbr_sched_task();
+ */
+ struct perf_event_attr attr = {
+ .type = PERF_TYPE_RAW,
+ .size = sizeof(attr),
+ .pinned = true,
+ .exclude_host = true,
+ .sample_type = PERF_SAMPLE_BRANCH_STACK,
+ .branch_sample_type = PERF_SAMPLE_BRANCH_CALL_STACK |
+ PERF_SAMPLE_BRANCH_USER,
+ };
+
+ if (pmu->lbr_event)
+ return 0;
+
+ event = perf_event_create_kernel_counter(&attr, -1,
+ current, NULL, NULL);
+ if (IS_ERR(event)) {
+ pr_debug_ratelimited("%s: failed %ld\n",
+ __func__, PTR_ERR(event));
+ return -ENOENT;
+ }
+ pmu->lbr_event = event;
+ pmu->event_count++;
+ return 0;
+}
+
+void intel_pmu_free_lbr_event(struct kvm_vcpu *vcpu)
+{
+ struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
+ struct perf_event *event = pmu->lbr_event;
+
+ if (!event)
+ return;
+
+ perf_event_release_kernel(event);
+ intel_pmu_set_intercept_for_lbr_msrs(vcpu, true);
+ pmu->lbr_is_availabile = false;
+ pmu->event_count--;
+ pmu->lbr_event = NULL;
+}
+
+static bool intel_pmu_access_lbr_msr(struct kvm_vcpu *vcpu,
+ struct msr_data *msr_info, bool read)
+{
+ struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
+ u32 index = msr_info->index;
+
+ if (!intel_is_valid_lbr_msr(vcpu, index))
+ return false;
+
+ if (!pmu->lbr_event)
+ intel_pmu_create_lbr_event(vcpu);
+
+ /*
+ * Disable irq to ensure the LBR feature doesn't get reclaimed by the
+ * host at the time the value is read from the msr, this avoids the
+ * host lbr value to be leaked to the guest. If lbr has been reclaimed,
+ * return 0 on guest reads.
+ */
+ local_irq_disable();
+ if (event_is_oncpu(pmu->lbr_event)) {
+ if (read)
+ rdmsrl(index, msr_info->data);
+ else
+ wrmsrl(index, msr_info->data);
+ } else if (read)
+ msr_info->data = 0;
+ local_irq_enable();
+
+ return true;
+}
+
static int intel_pmu_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
{
struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
@@ -203,6 +338,9 @@ static int intel_pmu_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
case MSR_CORE_PERF_GLOBAL_OVF_CTRL:
msr_info->data = pmu->global_ovf_ctrl;
return 0;
+ case MSR_IA32_DEBUGCTLMSR:
+ msr_info->data = vmcs_read64(GUEST_IA32_DEBUGCTL);
+ return 0;
default:
if ((pmc = get_gp_pmc(pmu, msr, MSR_IA32_PERFCTR0))) {
u64 val = pmc_read_counter(pmc);
@@ -217,7 +355,8 @@ static int intel_pmu_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
} else if ((pmc = get_gp_pmc(pmu, msr, MSR_P6_EVNTSEL0))) {
msr_info->data = pmc->eventsel;
return 0;
- }
+ } else if (intel_pmu_access_lbr_msr(vcpu, msr_info, true))
+ return 0;
}

return 1;
@@ -261,6 +400,22 @@ static int intel_pmu_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
return 0;
}
break;
+ case MSR_IA32_DEBUGCTLMSR:
+ /* Values other than LBR are reserved and should throw a #GP */
+ if (data & ~(DEBUGCTLMSR_LBR | DEBUGCTLMSR_FREEZE_LBRS_ON_PMI))
+ return 1;
+ vmcs_write64(GUEST_IA32_DEBUGCTL, data);
+ if (data && !intel_pmu_create_lbr_event(vcpu) &&
+ event_is_oncpu(pmu->lbr_event)) {
+ /*
+ * On the host with v4 PMU, the LBR starts to
+ * record when the enable bit is set in debugctl
+ * and LBRS_FROZEN is cleared in the global status.
+ */
+ wrmsrl_safe(MSR_CORE_PERF_GLOBAL_OVF_CTRL,
+ GLOBAL_STATUS_LBRS_FROZEN);
+ }
+ return 0;
default:
if ((pmc = get_gp_pmc(pmu, msr, MSR_IA32_PERFCTR0))) {
if (!msr_info->host_initiated)
@@ -283,7 +438,8 @@ static int intel_pmu_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
reprogram_gp_counter(pmc, data);
return 0;
}
- }
+ } else if (intel_pmu_access_lbr_msr(vcpu, msr_info, false))
+ return 0;
}

return 1;
@@ -396,6 +552,8 @@ static void intel_pmu_reset(struct kvm_vcpu *vcpu)
pmc->counter = 0;
}

+ intel_pmu_free_lbr_event(vcpu);
+
pmu->fixed_ctr_ctrl = pmu->global_ctrl = pmu->global_status =
pmu->global_ovf_ctrl = 0;
}
@@ -428,6 +586,40 @@ static bool intel_pmu_setup_lbr(struct kvm_vcpu *vcpu)
return true;
}

+void intel_pmu_lbr_availability_check(struct kvm_vcpu *vcpu)
+{
+ struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
+
+ /*
+ * the LBR stack gets reclaimed via IPI calls, so checking of
+ * lbr_event->oncpu needs to be in an atomic context.
+ * Use assertion to confirm that irq has already been disabled.
+ */
+ lockdep_assert_irqs_disabled();
+
+ if (pmu->lbr_is_availabile && event_is_oncpu(pmu->lbr_event))
+ return;
+
+ if (!pmu->lbr_is_availabile && !event_is_oncpu(pmu->lbr_event))
+ return;
+
+ if (event_is_oncpu(pmu->lbr_event)) {
+ intel_pmu_set_intercept_for_lbr_msrs(vcpu, false);
+ pmu->lbr_is_availabile = true;
+ } else {
+ intel_pmu_set_intercept_for_lbr_msrs(vcpu, true);
+ pmu->lbr_is_availabile = false;
+ }
+}
+
+void intel_pmu_availability_check(struct kvm_vcpu *vcpu)
+{
+ struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
+
+ if (vcpu->kvm->arch.lbr_in_guest && unlikely(pmu->lbr_event))
+ intel_pmu_lbr_availability_check(vcpu);
+}
+
struct kvm_pmu_ops intel_pmu_ops = {
.find_arch_event = intel_find_arch_event,
.find_fixed_event = intel_find_fixed_event,
@@ -443,4 +635,5 @@ struct kvm_pmu_ops intel_pmu_ops = {
.init = intel_pmu_init,
.reset = intel_pmu_reset,
.lbr_setup = intel_pmu_setup_lbr,
+ .availability_check = intel_pmu_availability_check,
};
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 57742ddfd854..c13c2b00bb16 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -3682,8 +3682,8 @@ static __always_inline void vmx_enable_intercept_for_msr(unsigned long *msr_bitm
}
}

-static __always_inline void vmx_set_intercept_for_msr(unsigned long *msr_bitmap,
- u32 msr, int type, bool value)
+void vmx_set_intercept_for_msr(unsigned long *msr_bitmap,
+ u32 msr, int type, bool value)
{
if (value)
vmx_enable_intercept_for_msr(msr_bitmap, msr, type);
diff --git a/arch/x86/kvm/vmx/vmx.h b/arch/x86/kvm/vmx/vmx.h
index e64da06c7009..a0644eef5631 100644
--- a/arch/x86/kvm/vmx/vmx.h
+++ b/arch/x86/kvm/vmx/vmx.h
@@ -347,6 +347,8 @@ void vmx_update_msr_bitmap(struct kvm_vcpu *vcpu);
bool vmx_get_nmi_mask(struct kvm_vcpu *vcpu);
void vmx_set_nmi_mask(struct kvm_vcpu *vcpu, bool masked);
void vmx_set_virtual_apic_mode(struct kvm_vcpu *vcpu);
+void vmx_set_intercept_for_msr(unsigned long *msr_bitmap,
+ u32 msr, int type, bool value);
struct shared_msr_entry *find_msr_entry(struct vcpu_vmx *vmx, u32 msr);
void pt_update_intercept_for_msr(struct vcpu_vmx *vmx);
void vmx_update_host_rsp(struct vcpu_vmx *vmx, unsigned long host_rsp);
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index ce6b0326a1ad..3771d5fb2524 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -8338,6 +8338,18 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
*/
smp_mb__after_srcu_read_unlock();

+ /*
+ * Higher priority host perf events (e.g. cpu pinned) could reclaim the
+ * pmu resources (e.g. lbr) that were assigned to the guest. This is
+ * usually done via ipi calls (more details in perf_install_in_context).
+ *
+ * Before entering the non-root mode (with irq disabled here), double
+ * confirm that the pmu features enabled to the guest are not reclaimed
+ * by higher priority host events. Otherwise, disallow vcpu's access to
+ * the reclaimed features.
+ */
+ kvm_pmu_availability_check(vcpu);
+
/*
* This handles the case where a posted interrupt was
* notified with kvm_vcpu_kick.
--
2.21.1

2020-03-13 02:20:25

by Like Xu

[permalink] [raw]
Subject: [PATCH v9 09/10] KVM: x86: Expose MSR_IA32_PERF_CAPABILITIES to guest for LBR record format

The MSR_IA32_PERF_CAPABILITIES is a read only MSR that enumerates the
existence of performance monitoring features. Bits [0, 5] of it tells
about the LBR format of the branch record addresses stored in the LBR
stack. Expose those bits to the guest when the guest LBR is enabled.

Cc: Luwei Kang <[email protected]>
Co-developed-by: Wei Wang <[email protected]>
Signed-off-by: Wei Wang <[email protected]>
Signed-off-by: Like Xu <[email protected]>
---
arch/x86/include/asm/kvm_host.h | 1 +
arch/x86/kvm/cpuid.c | 3 ++-
arch/x86/kvm/pmu.h | 9 +++++++++
arch/x86/kvm/vmx/pmu_intel.c | 28 ++++++++++++++++++++++++++++
4 files changed, 40 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index b4c1761ca783..d915eedcbe43 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -483,6 +483,7 @@ struct kvm_pmu {
u64 global_ctrl_mask;
u64 global_ovf_ctrl_mask;
u64 reserved_bits;
+ u64 perf_capabilities;
u8 version;
struct kvm_pmc gp_counters[INTEL_PMC_MAX_GENERIC];
struct kvm_pmc fixed_counters[INTEL_PMC_MAX_FIXED];
diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
index b1c469446b07..55f57d6c3de0 100644
--- a/arch/x86/kvm/cpuid.c
+++ b/arch/x86/kvm/cpuid.c
@@ -446,6 +446,7 @@ static inline int __do_cpuid_func(struct kvm_cpuid_entry2 *entry, u32 function,
unsigned f_rdtscp = kvm_x86_ops->rdtscp_supported() ? F(RDTSCP) : 0;
unsigned f_xsaves = kvm_x86_ops->xsaves_supported() ? F(XSAVES) : 0;
unsigned f_intel_pt = kvm_x86_ops->pt_supported() ? F(INTEL_PT) : 0;
+ unsigned int f_pdcm = kvm_pmu_get_perf_capabilities() ? F(PDCM) : 0;

/* cpuid 1.edx */
const u32 kvm_cpuid_1_edx_x86_features =
@@ -474,7 +475,7 @@ static inline int __do_cpuid_func(struct kvm_cpuid_entry2 *entry, u32 function,
F(XMM3) | F(PCLMULQDQ) | 0 /* DTES64, MONITOR */ |
0 /* DS-CPL, VMX, SMX, EST */ |
0 /* TM2 */ | F(SSSE3) | 0 /* CNXT-ID */ | 0 /* Reserved */ |
- F(FMA) | F(CX16) | 0 /* xTPR Update, PDCM */ |
+ F(FMA) | F(CX16) | 0 /* xTPR Update*/ | f_pdcm |
F(PCID) | 0 /* Reserved, DCA */ | F(XMM4_1) |
F(XMM4_2) | F(X2APIC) | F(MOVBE) | F(POPCNT) |
0 /* Reserved*/ | F(AES) | F(XSAVE) | 0 /* OSXSAVE */ | F(AVX) |
diff --git a/arch/x86/kvm/pmu.h b/arch/x86/kvm/pmu.h
index d4ef7ec3331d..1cf73f39335d 100644
--- a/arch/x86/kvm/pmu.h
+++ b/arch/x86/kvm/pmu.h
@@ -43,6 +43,7 @@ struct kvm_pmu_ops {
bool (*lbr_setup)(struct kvm_vcpu *vcpu);
void (*lbr_cleanup)(struct kvm_vcpu *vcpu);
void (*availability_check)(struct kvm_vcpu *vcpu);
+ u64 (*get_perf_capabilities)(void);
};

static inline bool event_is_oncpu(struct perf_event *event)
@@ -149,6 +150,14 @@ static inline u64 get_sample_period(struct kvm_pmc *pmc, u64 counter_value)
return sample_period;
}

+static inline u64 kvm_pmu_get_perf_capabilities(void)
+{
+ if (kvm_x86_ops->pmu_ops->get_perf_capabilities)
+ return kvm_x86_ops->pmu_ops->get_perf_capabilities();
+
+ return 0;
+}
+
void reprogram_gp_counter(struct kvm_pmc *pmc, u64 eventsel);
void reprogram_fixed_counter(struct kvm_pmc *pmc, u8 ctrl, int fixed_idx);
void reprogram_counter(struct kvm_pmu *pmu, int pmc_idx);
diff --git a/arch/x86/kvm/vmx/pmu_intel.c b/arch/x86/kvm/vmx/pmu_intel.c
index bbb5f4c63f52..167061ce0434 100644
--- a/arch/x86/kvm/vmx/pmu_intel.c
+++ b/arch/x86/kvm/vmx/pmu_intel.c
@@ -19,6 +19,8 @@
#include "pmu.h"
#include "vmx.h"

+#define X86_PERF_CAP_MASK_LBR_FMT 0x3f
+
static struct kvm_event_hw_type_mapping intel_arch_events[] = {
/* Index must match CPUID 0x0A.EBX bit vector */
[0] = { 0x3c, 0x00, PERF_COUNT_HW_CPU_CYCLES },
@@ -182,6 +184,9 @@ static bool intel_is_valid_msr(struct kvm_vcpu *vcpu, u32 msr)
case MSR_IA32_DEBUGCTLMSR:
ret = pmu->version > 1;
break;
+ case MSR_IA32_PERF_CAPABILITIES:
+ ret = guest_cpuid_has(vcpu, X86_FEATURE_PDCM);
+ break;
default:
ret = get_gp_pmc(pmu, msr, MSR_IA32_PERFCTR0) ||
get_gp_pmc(pmu, msr, MSR_P6_EVNTSEL0) ||
@@ -342,6 +347,9 @@ static int intel_pmu_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
case MSR_IA32_DEBUGCTLMSR:
msr_info->data = vmcs_read64(GUEST_IA32_DEBUGCTL);
return 0;
+ case MSR_IA32_PERF_CAPABILITIES:
+ msr_info->data = pmu->perf_capabilities;
+ return 0;
default:
if ((pmc = get_gp_pmc(pmu, msr, MSR_IA32_PERFCTR0))) {
u64 val = pmc_read_counter(pmc);
@@ -419,6 +427,8 @@ static int intel_pmu_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
if (pmu->lbr_event)
__set_bit(KVM_PMU_LBR_IN_USE_IDX, pmu->pmc_in_use);
return 0;
+ case MSR_IA32_PERF_CAPABILITIES:
+ return 1; /* RO MSR */
default:
if ((pmc = get_gp_pmc(pmu, msr, MSR_IA32_PERFCTR0))) {
if (!msr_info->host_initiated)
@@ -448,6 +458,19 @@ static int intel_pmu_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
return 1;
}

+static u64 intel_pmu_get_perf_capabilities(void)
+{
+ u64 perf_cap = 0;
+
+ if (boot_cpu_has(X86_FEATURE_PDCM))
+ rdmsrl(MSR_IA32_PERF_CAPABILITIES, perf_cap);
+
+ /* Currently, KVM only support LBR. */
+ perf_cap &= X86_PERF_CAP_MASK_LBR_FMT;
+
+ return perf_cap;
+}
+
static void intel_pmu_refresh(struct kvm_vcpu *vcpu)
{
struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
@@ -474,6 +497,7 @@ static void intel_pmu_refresh(struct kvm_vcpu *vcpu)
return;

perf_get_x86_pmu_capability(&x86_pmu);
+ pmu->perf_capabilities = intel_pmu_get_perf_capabilities();

pmu->nr_arch_gp_counters = min_t(int, eax.split.num_counters,
x86_pmu.num_counters_gp);
@@ -501,6 +525,9 @@ static void intel_pmu_refresh(struct kvm_vcpu *vcpu)
pmu->global_ovf_ctrl_mask &=
~MSR_CORE_PERF_GLOBAL_OVF_CTRL_TRACE_TOPA_PMI;

+ if (!vcpu->kvm->arch.lbr_in_guest)
+ pmu->perf_capabilities &= ~X86_PERF_CAP_MASK_LBR_FMT;
+
entry = kvm_find_cpuid_entry(vcpu, 7, 0);
if (entry &&
(boot_cpu_has(X86_FEATURE_HLE) || boot_cpu_has(X86_FEATURE_RTM)) &&
@@ -652,4 +679,5 @@ struct kvm_pmu_ops intel_pmu_ops = {
.lbr_setup = intel_pmu_setup_lbr,
.availability_check = intel_pmu_availability_check,
.lbr_cleanup = intel_pmu_cleanup_lbr,
+ .get_perf_capabilities = intel_pmu_get_perf_capabilities,
};
--
2.21.1

2020-03-13 02:20:32

by Like Xu

[permalink] [raw]
Subject: [PATCH v9 10/10] KVM: x86: Remove the common trap handler of the MSR_IA32_DEBUGCTLMSR

From: Wei Wang <[email protected]>

The debugctl msr is not completely identical on AMD and Intel CPUs, for
example, FREEZE_LBRS_ON_PMI is supported by Intel CPUs only. Now, this
msr is handled separately in svm.c and intel_pmu.c. So remove the
common debugctl msr handling code in kvm_get/set_msr_common.

Signed-off-by: Wei Wang <[email protected]>
---
arch/x86/kvm/x86.c | 13 -------------
1 file changed, 13 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 3771d5fb2524..5ad36e71b3eb 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -2756,18 +2756,6 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
return 1;
}
break;
- case MSR_IA32_DEBUGCTLMSR:
- if (!data) {
- /* We support the non-activated case already */
- break;
- } else if (data & ~(DEBUGCTLMSR_LBR | DEBUGCTLMSR_BTF)) {
- /* Values other than LBR and BTF are vendor-specific,
- thus reserved and should throw a #GP */
- return 1;
- }
- vcpu_unimpl(vcpu, "%s: MSR_IA32_DEBUGCTLMSR 0x%llx, nop\n",
- __func__, data);
- break;
case 0x200 ... 0x2ff:
return kvm_mtrr_set_msr(vcpu, msr, data);
case MSR_IA32_APICBASE:
@@ -3025,7 +3013,6 @@ int kvm_get_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
switch (msr_info->index) {
case MSR_IA32_PLATFORM_ID:
case MSR_IA32_EBL_CR_POWERON:
- case MSR_IA32_DEBUGCTLMSR:
case MSR_IA32_LASTBRANCHFROMIP:
case MSR_IA32_LASTBRANCHTOIP:
case MSR_IA32_LASTINTFROMIP:
--
2.21.1

2020-03-13 02:20:41

by Like Xu

[permalink] [raw]
Subject: [PATCH v9 05/10] KVM: x86: Add KVM_CAP_X86_GUEST_LBR interface to dis/enable LBR feature

The LBR feature is model specific. Introduce KVM_CAP_X86_GUEST_LBR to
allow per-VM enabling of the guest LBR feature.

For enable_cap ioctl, the first input parameter is whether LBR feature
should be enabled or not, and the second parameter is the pointer to the
userspace memory to read the LBR stack information. If the second parameter
is invalid or the guest/host cpu model doesn't match, it returns -EINVAL
which means the LBR feature cannot be enabled. For check_extension ioctl,
the return value could help userspace calculate the total size of the
complete guest LBR entries and do functional compatibility check.

Cc: Gonglei <[email protected]>
Cc: Jann Horn <[email protected]>
Cc: Eric Hankland <[email protected]>
Cc: Eduardo Habkost <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Co-developed-by: Wei Wang <[email protected]>
Signed-off-by: Wei Wang <[email protected]>
Signed-off-by: Like Xu <[email protected]>
---
Documentation/virt/kvm/api.rst | 28 ++++++++++++++++++++++++++++
arch/x86/include/asm/kvm_host.h | 2 ++
arch/x86/kvm/pmu.c | 8 ++++++++
arch/x86/kvm/pmu.h | 2 ++
arch/x86/kvm/vmx/pmu_intel.c | 29 +++++++++++++++++++++++++++++
arch/x86/kvm/x86.c | 13 +++++++++++++
include/uapi/linux/kvm.h | 1 +
tools/include/uapi/linux/kvm.h | 1 +
8 files changed, 84 insertions(+)

diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index 0adef66585b1..6d3f4d620f9e 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -5743,6 +5743,34 @@ it hard or impossible to use it correctly. The availability of
KVM_CAP_MANUAL_DIRTY_LOG_PROTECT2 signals that those bugs are fixed.
Userspace should not try to use KVM_CAP_MANUAL_DIRTY_LOG_PROTECT.

+7.19 KVM_CAP_X86_GUEST_LBR
+Architectures: x86
+Parameters: args[0] whether LBR feature should be enabled or not
+ args[1] pointer to the userspace memory to load the LBR stack info
+
+the LBR stack information is described by
+struct x86_pmu_lbr {
+ unsigned int nr;
+ unsigned int from;
+ unsigned int to;
+ unsigned int info;
+};
+
+@nr: number of LBR stack entries
+@from: index of the msr that stores a branch source address
+@to: index of the msr that stores a branch destination address
+@info: index of the msr that stores LBR related flags, such as misprediction
+
+Enabling this capability allows guest accesses to the LBR feature. Otherwise,
+#GP will be injected to the guest when it accesses to the LBR related msrs.
+
+After the feature is enabled, before exiting to userspace, kvm handlers
+would fill the LBR stack info into the userspace memory pointed by args[1].
+
+The return value of kvm_vm_ioctl_check_extension for KVM_CAP_X86_GUEST_LBR
+is the size of 'struct x86_pmu_lbr' and userspace could calculate the total
+size of the complete guest LBR entries for functional compatibility check.
+
8. Other capabilities.
======================

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 6ac67b6d6692..b87d2ab28b0e 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -933,6 +933,7 @@ struct kvm_arch {
bool hlt_in_guest;
bool pause_in_guest;
bool cstate_in_guest;
+ bool lbr_in_guest;

unsigned long irq_sources_bitmap;
s64 kvmclock_offset;
@@ -982,6 +983,7 @@ struct kvm_arch {
bool exception_payload_enabled;

struct kvm_pmu_event_filter *pmu_event_filter;
+ struct x86_pmu_lbr lbr_stack;
struct task_struct *nx_lpage_recovery_thread;
};

diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c
index d1f8ca57d354..970fb5574402 100644
--- a/arch/x86/kvm/pmu.c
+++ b/arch/x86/kvm/pmu.c
@@ -518,3 +518,11 @@ int kvm_vm_ioctl_set_pmu_event_filter(struct kvm *kvm, void __user *argp)
kfree(filter);
return r;
}
+
+bool kvm_pmu_lbr_setup(struct kvm_vcpu *vcpu)
+{
+ if (kvm_x86_ops->pmu_ops->lbr_setup)
+ return kvm_x86_ops->pmu_ops->lbr_setup(vcpu);
+
+ return false;
+}
diff --git a/arch/x86/kvm/pmu.h b/arch/x86/kvm/pmu.h
index d7da2b9e0755..4842a417afe9 100644
--- a/arch/x86/kvm/pmu.h
+++ b/arch/x86/kvm/pmu.h
@@ -37,6 +37,7 @@ struct kvm_pmu_ops {
void (*refresh)(struct kvm_vcpu *vcpu);
void (*init)(struct kvm_vcpu *vcpu);
void (*reset)(struct kvm_vcpu *vcpu);
+ bool (*lbr_setup)(struct kvm_vcpu *vcpu);
};

static inline u64 pmc_bitmask(struct kvm_pmc *pmc)
@@ -155,6 +156,7 @@ void kvm_pmu_init(struct kvm_vcpu *vcpu);
void kvm_pmu_cleanup(struct kvm_vcpu *vcpu);
void kvm_pmu_destroy(struct kvm_vcpu *vcpu);
int kvm_vm_ioctl_set_pmu_event_filter(struct kvm *kvm, void __user *argp);
+bool kvm_pmu_lbr_setup(struct kvm_vcpu *vcpu);

bool is_vmware_backdoor_pmc(u32 pmc_idx);

diff --git a/arch/x86/kvm/vmx/pmu_intel.c b/arch/x86/kvm/vmx/pmu_intel.c
index e933541751fb..4f529e1de5d2 100644
--- a/arch/x86/kvm/vmx/pmu_intel.c
+++ b/arch/x86/kvm/vmx/pmu_intel.c
@@ -397,6 +397,34 @@ static void intel_pmu_reset(struct kvm_vcpu *vcpu)
pmu->global_ovf_ctrl = 0;
}

+static bool intel_pmu_get_lbr_stack(struct kvm_vcpu *vcpu)
+{
+ struct kvm *kvm = vcpu->kvm;
+
+ if (likely(kvm->arch.lbr_stack.nr))
+ return true;
+
+ return !x86_perf_get_lbr(&kvm->arch.lbr_stack);
+}
+
+static bool intel_pmu_setup_lbr(struct kvm_vcpu *vcpu)
+{
+ if (!intel_pmu_get_lbr_stack(vcpu))
+ return false;
+
+ if (vcpu_to_pmu(vcpu)->version < 2)
+ return false;
+
+ /*
+ * As a first step, a guest could only enable LBR feature if its cpu
+ * model is the same as the host since LBR is model-specific for now.
+ */
+ if (boot_cpu_data.x86_model != guest_cpuid_model(vcpu))
+ return false;
+
+ return true;
+}
+
struct kvm_pmu_ops intel_pmu_ops = {
.find_arch_event = intel_find_arch_event,
.find_fixed_event = intel_find_fixed_event,
@@ -411,4 +439,5 @@ struct kvm_pmu_ops intel_pmu_ops = {
.refresh = intel_pmu_refresh,
.init = intel_pmu_init,
.reset = intel_pmu_reset,
+ .lbr_setup = intel_pmu_setup_lbr,
};
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index a69f7bf020d9..51c62d14809a 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -3357,6 +3357,9 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
case KVM_CAP_EXCEPTION_PAYLOAD:
r = 1;
break;
+ case KVM_CAP_X86_GUEST_LBR:
+ r = sizeof(struct x86_pmu_lbr);
+ break;
case KVM_CAP_SYNC_REGS:
r = KVM_SYNC_X86_VALID_FIELDS;
break;
@@ -4866,6 +4869,16 @@ int kvm_vm_ioctl_enable_cap(struct kvm *kvm,
kvm->arch.exception_payload_enabled = cap->args[0];
r = 0;
break;
+ case KVM_CAP_X86_GUEST_LBR:
+ r = -EINVAL;
+ if (cap->args[0] && !kvm_pmu_lbr_setup(kvm->vcpus[0]))
+ break;
+ if (copy_to_user((void __user *)cap->args[1],
+ &kvm->arch.lbr_stack, sizeof(struct x86_pmu_lbr)))
+ break;
+ kvm->arch.lbr_in_guest = cap->args[0];
+ r = 0;
+ break;
default:
r = -EINVAL;
break;
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index d8499d935954..721510b2789a 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -1010,6 +1010,7 @@ struct kvm_ppc_resize_hpt {
#define KVM_CAP_ARM_NISV_TO_USER 177
#define KVM_CAP_ARM_INJECT_EXT_DABT 178
#define KVM_CAP_S390_VCPU_RESETS 179
+#define KVM_CAP_X86_GUEST_LBR 180

#ifdef KVM_CAP_IRQ_ROUTING

diff --git a/tools/include/uapi/linux/kvm.h b/tools/include/uapi/linux/kvm.h
index f0a16b4adbbd..3c094ca5bef4 100644
--- a/tools/include/uapi/linux/kvm.h
+++ b/tools/include/uapi/linux/kvm.h
@@ -1009,6 +1009,7 @@ struct kvm_ppc_resize_hpt {
#define KVM_CAP_PPC_GUEST_DEBUG_SSTEP 176
#define KVM_CAP_ARM_NISV_TO_USER 177
#define KVM_CAP_ARM_INJECT_EXT_DABT 178
+#define KVM_CAP_X86_GUEST_LBR 180

#ifdef KVM_CAP_IRQ_ROUTING

--
2.21.1

2020-03-13 02:21:05

by Like Xu

[permalink] [raw]
Subject: [PATCH v9 08/10] KVM: x86/pmu: Release guest LBR event via vPMU lazy release mechanism

The guest LBR event uses bit 58 in pmu->pmc_in_use (which is also the
LBRS_FROZEN bit in GLOBAL_STATUS) to indicate whether the event is still
needed by the vcpu. If the guest no longer accesses the LBR related MSRs
within a scheduling time slice and the enable bit of LBR is unset, vPMU
would treat the guest LBR event as a bland event of a vPMC counter and
release it as usual and the pass-through state of stack MSRs is canceled.

Signed-off-by: Like Xu <[email protected]>
---
arch/x86/kvm/pmu.c | 9 +++++++++
arch/x86/kvm/pmu.h | 4 ++++
arch/x86/kvm/vmx/pmu_intel.c | 16 ++++++++++++++++
3 files changed, 29 insertions(+)

diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c
index 84b5ec50ca6d..57859ff8b118 100644
--- a/arch/x86/kvm/pmu.c
+++ b/arch/x86/kvm/pmu.c
@@ -449,6 +449,12 @@ static inline bool pmc_speculative_in_use(struct kvm_pmc *pmc)
return pmc->eventsel & ARCH_PERFMON_EVENTSEL_ENABLE;
}

+void kvm_pmu_lbr_cleanup(struct kvm_vcpu *vcpu)
+{
+ if (kvm_x86_ops->pmu_ops->lbr_cleanup)
+ kvm_x86_ops->pmu_ops->lbr_cleanup(vcpu);
+}
+
/* Release perf_events for vPMCs that have been unused for a full time slice. */
void kvm_pmu_cleanup(struct kvm_vcpu *vcpu)
{
@@ -467,6 +473,9 @@ void kvm_pmu_cleanup(struct kvm_vcpu *vcpu)

if (pmc && pmc->perf_event && !pmc_speculative_in_use(pmc))
pmc_stop_counter(pmc);
+
+ if (i == KVM_PMU_LBR_IN_USE_IDX)
+ kvm_pmu_lbr_cleanup(vcpu);
}

bitmap_zero(pmu->pmc_in_use, X86_PMC_IDX_MAX);
diff --git a/arch/x86/kvm/pmu.h b/arch/x86/kvm/pmu.h
index 3ddff3972b8d..d4ef7ec3331d 100644
--- a/arch/x86/kvm/pmu.h
+++ b/arch/x86/kvm/pmu.h
@@ -15,6 +15,9 @@
#define VMWARE_BACKDOOR_PMC_REAL_TIME 0x10001
#define VMWARE_BACKDOOR_PMC_APPARENT_TIME 0x10002

+/* Indicate if the LBR MSRs were accessed during a time slice */
+#define KVM_PMU_LBR_IN_USE_IDX GLOBAL_STATUS_LBRS_FROZEN_BIT
+
struct kvm_event_hw_type_mapping {
u8 eventsel;
u8 unit_mask;
@@ -38,6 +41,7 @@ struct kvm_pmu_ops {
void (*init)(struct kvm_vcpu *vcpu);
void (*reset)(struct kvm_vcpu *vcpu);
bool (*lbr_setup)(struct kvm_vcpu *vcpu);
+ void (*lbr_cleanup)(struct kvm_vcpu *vcpu);
void (*availability_check)(struct kvm_vcpu *vcpu);
};

diff --git a/arch/x86/kvm/vmx/pmu_intel.c b/arch/x86/kvm/vmx/pmu_intel.c
index 92627f31cda3..bbb5f4c63f52 100644
--- a/arch/x86/kvm/vmx/pmu_intel.c
+++ b/arch/x86/kvm/vmx/pmu_intel.c
@@ -316,6 +316,7 @@ static bool intel_pmu_access_lbr_msr(struct kvm_vcpu *vcpu,
msr_info->data = 0;
local_irq_enable();

+ __set_bit(KVM_PMU_LBR_IN_USE_IDX, pmu->pmc_in_use);
return true;
}

@@ -415,6 +416,8 @@ static int intel_pmu_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
wrmsrl_safe(MSR_CORE_PERF_GLOBAL_OVF_CTRL,
GLOBAL_STATUS_LBRS_FROZEN);
}
+ if (pmu->lbr_event)
+ __set_bit(KVM_PMU_LBR_IN_USE_IDX, pmu->pmc_in_use);
return 0;
default:
if ((pmc = get_gp_pmc(pmu, msr, MSR_IA32_PERFCTR0))) {
@@ -508,6 +511,7 @@ static void intel_pmu_refresh(struct kvm_vcpu *vcpu)
0, pmu->nr_arch_gp_counters);
bitmap_set(pmu->all_valid_pmc_idx,
INTEL_PMC_MAX_GENERIC, pmu->nr_arch_fixed_counters);
+ bitmap_set(pmu->all_valid_pmc_idx, KVM_PMU_LBR_IN_USE_IDX, 1);

nested_vmx_pmu_entry_exit_ctls_update(vcpu);
}
@@ -620,6 +624,17 @@ void intel_pmu_availability_check(struct kvm_vcpu *vcpu)
intel_pmu_lbr_availability_check(vcpu);
}

+static void intel_pmu_cleanup_lbr(struct kvm_vcpu *vcpu)
+{
+ struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
+
+ if (!pmu->lbr_event)
+ return;
+
+ if (!(vmcs_read64(GUEST_IA32_DEBUGCTL) & DEBUGCTLMSR_LBR))
+ intel_pmu_free_lbr_event(vcpu);
+}
+
struct kvm_pmu_ops intel_pmu_ops = {
.find_arch_event = intel_find_arch_event,
.find_fixed_event = intel_find_fixed_event,
@@ -636,4 +651,5 @@ struct kvm_pmu_ops intel_pmu_ops = {
.reset = intel_pmu_reset,
.lbr_setup = intel_pmu_setup_lbr,
.availability_check = intel_pmu_availability_check,
+ .lbr_cleanup = intel_pmu_cleanup_lbr,
};
--
2.21.1

2020-03-20 08:46:27

by Xu, Like

[permalink] [raw]
Subject: Re: [PATCH v9 00/10] Guest Last Branch Recording Enabling

Hi Peter,
any comments on the host perf changes?

Hi Paolo,
any comments on the kvm changes? Isn't this feature interesting to you?

Just kindly ping.

Thanks,
Like Xu

On 2020/3/13 10:16, Like Xu wrote:
> Hi all,
>
> Please help review your interesting parts in this stable version,
> e.g. the first four patches involve the perf event subsystem
> and the fifth patch concerns the KVM userspace interface.

> v8->v9 Changelog:
> - using guest_lbr_constraint to create guest LBR event without hw counter;
> (please check perf changes in patch 0003)
> - rename 'cpuc->vcpu_lbr' to 'cpuc->guest_lbr_enabled';
> (please check host LBR changes in patch 0004)
> - replace 'pmu->lbr_used' mechanism with lazy release kvm_pmu_lbr_cleanup();
> - refactor IA32_PERF_CAPABILITIES trap via get_perf_capabilities();
> - refactor kvm_pmu_lbr_enable() with kvm_pmu_lbr_setup();
> - simplify model-specific LBR functionality check;
> - rename x86_perf_get_lbr_stack to x86_perf_get_lbr;
> - rename intel_pmu_lbr_confirm() to kvm_pmu_availability_check();
>
> Previous:
> https://lore.kernel.org/lkml/[email protected]/
>
> Like Xu (7):
> perf/x86/lbr: Add interface to get basic information about LBR stack
> perf/x86: Add constraint to create guest LBR event without hw counter
> perf/x86: Keep LBR stack unchanged on the host for guest LBR event
> KVM: x86: Add KVM_CAP_X86_GUEST_LBR interface to dis/enable LBR
> feature
> KVM: x86/pmu: Add LBR feature emulation via guest LBR event
> KVM: x86/pmu: Release guest LBR event via vPMU lazy release mechanism
> KVM: x86: Expose MSR_IA32_PERF_CAPABILITIES to guest for LBR record
> format
>
> Wei Wang (3):
> perf/x86: Fix msr variable type for the LBR msrs
> KVM: x86/pmu: Tweak kvm_pmu_get_msr to pass 'struct msr_data' in
> KVM: x86: Remove the common trap handler of the MSR_IA32_DEBUGCTLMSR
>
> Documentation/virt/kvm/api.rst | 28 +++
> arch/x86/events/core.c | 9 +-
> arch/x86/events/intel/core.c | 29 +++
> arch/x86/events/intel/lbr.c | 55 +++++-
> arch/x86/events/perf_event.h | 21 ++-
> arch/x86/include/asm/kvm_host.h | 7 +
> arch/x86/include/asm/perf_event.h | 24 ++-
> arch/x86/kvm/cpuid.c | 3 +-
> arch/x86/kvm/pmu.c | 28 ++-
> arch/x86/kvm/pmu.h | 26 ++-
> arch/x86/kvm/pmu_amd.c | 7 +-
> arch/x86/kvm/vmx/pmu_intel.c | 291 ++++++++++++++++++++++++++++--
> arch/x86/kvm/vmx/vmx.c | 4 +-
> arch/x86/kvm/vmx/vmx.h | 2 +
> arch/x86/kvm/x86.c | 42 +++--
> include/linux/perf_event.h | 7 +
> include/uapi/linux/kvm.h | 1 +
> kernel/events/core.c | 7 -
> tools/include/uapi/linux/kvm.h | 1 +
> 19 files changed, 540 insertions(+), 52 deletions(-)
>

2020-04-02 13:07:45

by Xu, Like

[permalink] [raw]
Subject: Re: [PATCH v9 00/10] Guest Last Branch Recording Enabling

Hi Peter,

I'm not sure if you recently had a time to review the proposed change
on the host perf subsystem, which was introduced to support guest LBR
enablement.

The number of potential LBR users on the guest is growing and
I have been looking forward to your comments on the patch 0001-0004 in this
version,
even if it is completely negative.

Thanks,
Like Xu

On 2020/3/20 16:45, Xu, Like wrote:
> Hi Peter,
> any comments on the host perf changes?
>
> Hi Paolo,
> any comments on the kvm changes? Isn't this feature interesting to you?
>
> Just kindly ping.
>
> Thanks,
> Like Xu
>
> On 2020/3/13 10:16, Like Xu wrote:
>> Hi all,
>>
>> Please help review your interesting parts in this stable version,
>> e.g. the first four patches involve the perf event subsystem
>> and the fifth patch concerns the KVM userspace interface.
>
>> v8->v9 Changelog:
>> - using guest_lbr_constraint to create guest LBR event without hw counter;
>>    (please check perf changes in patch 0003)
>> - rename 'cpuc->vcpu_lbr' to 'cpuc->guest_lbr_enabled';
>>    (please check host LBR changes in patch 0004)
>> - replace 'pmu->lbr_used' mechanism with lazy release
>> kvm_pmu_lbr_cleanup();
>> - refactor IA32_PERF_CAPABILITIES trap via get_perf_capabilities();
>> - refactor kvm_pmu_lbr_enable() with kvm_pmu_lbr_setup();
>> - simplify model-specific LBR functionality check;
>> - rename x86_perf_get_lbr_stack to x86_perf_get_lbr;
>> - rename intel_pmu_lbr_confirm() to kvm_pmu_availability_check();
>>
>> Previous:
>> https://lore.kernel.org/lkml/[email protected]/
>>
>>
>> Like Xu (7):
>>    perf/x86/lbr: Add interface to get basic information about LBR stack
>>    perf/x86: Add constraint to create guest LBR event without hw counter
>>    perf/x86: Keep LBR stack unchanged on the host for guest LBR event
>>    KVM: x86: Add KVM_CAP_X86_GUEST_LBR interface to dis/enable LBR
>>      feature
>>    KVM: x86/pmu: Add LBR feature emulation via guest LBR event
>>    KVM: x86/pmu: Release guest LBR event via vPMU lazy release mechanism
>>    KVM: x86: Expose MSR_IA32_PERF_CAPABILITIES to guest for LBR record
>>      format
>>
>> Wei Wang (3):
>>    perf/x86: Fix msr variable type for the LBR msrs
>>    KVM: x86/pmu: Tweak kvm_pmu_get_msr to pass 'struct msr_data' in
>>    KVM: x86: Remove the common trap handler of the MSR_IA32_DEBUGCTLMSR
>>
>>   Documentation/virt/kvm/api.rst    |  28 +++
>>   arch/x86/events/core.c            |   9 +-
>>   arch/x86/events/intel/core.c      |  29 +++
>>   arch/x86/events/intel/lbr.c       |  55 +++++-
>>   arch/x86/events/perf_event.h      |  21 ++-
>>   arch/x86/include/asm/kvm_host.h   |   7 +
>>   arch/x86/include/asm/perf_event.h |  24 ++-
>>   arch/x86/kvm/cpuid.c              |   3 +-
>>   arch/x86/kvm/pmu.c                |  28 ++-
>>   arch/x86/kvm/pmu.h                |  26 ++-
>>   arch/x86/kvm/pmu_amd.c            |   7 +-
>>   arch/x86/kvm/vmx/pmu_intel.c      | 291 ++++++++++++++++++++++++++++--
>>   arch/x86/kvm/vmx/vmx.c            |   4 +-
>>   arch/x86/kvm/vmx/vmx.h            |   2 +
>>   arch/x86/kvm/x86.c                |  42 +++--
>>   include/linux/perf_event.h        |   7 +
>>   include/uapi/linux/kvm.h          |   1 +
>>   kernel/events/core.c              |   7 -
>>   tools/include/uapi/linux/kvm.h    |   1 +
>>   19 files changed, 540 insertions(+), 52 deletions(-)
>>
>

2020-04-09 17:13:25

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v9 03/10] perf/x86: Add constraint to create guest LBR event without hw counter

> diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
> index 3bb738f5a472..e919187a0751 100644
> --- a/arch/x86/events/core.c
> +++ b/arch/x86/events/core.c
> @@ -74,7 +74,8 @@ u64 x86_perf_event_update(struct perf_event *event)
> int idx = hwc->idx;
> u64 delta;
>
> - if (idx == INTEL_PMC_IDX_FIXED_BTS)
> + if ((idx == INTEL_PMC_IDX_FIXED_BTS) ||
> + (idx == INTEL_PMC_IDX_FIXED_VLBR))
> return 0;
>
> /*
> @@ -1102,7 +1103,8 @@ static inline void x86_assign_hw_event(struct perf_event *event,
> hwc->last_cpu = smp_processor_id();
> hwc->last_tag = ++cpuc->tags[i];
>
> - if (hwc->idx == INTEL_PMC_IDX_FIXED_BTS) {
> + if ((hwc->idx == INTEL_PMC_IDX_FIXED_BTS) ||
> + (hwc->idx == INTEL_PMC_IDX_FIXED_VLBR)) {
> hwc->config_base = 0;
> hwc->event_base = 0;
> } else if (hwc->idx >= INTEL_PMC_IDX_FIXED) {
> @@ -1233,7 +1235,8 @@ int x86_perf_event_set_period(struct perf_event *event)
> s64 period = hwc->sample_period;
> int ret = 0, idx = hwc->idx;
>
> - if (idx == INTEL_PMC_IDX_FIXED_BTS)
> + if ((idx == INTEL_PMC_IDX_FIXED_BTS) ||
> + (idx == INTEL_PMC_IDX_FIXED_VLBR))
> return 0;
>
> /*

That seems unfortunate; can that be >= INTEL_PMC_IDX_FIXED_BTS ? If so,
that probably wants a comment with the definitions.

Or otherwise check for !hwc->event_base. That should be 0 for both these
things.

> diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
> index 3be51aa06e67..901c82032f4a 100644
> --- a/arch/x86/events/intel/core.c
> +++ b/arch/x86/events/intel/core.c
> @@ -2157,6 +2157,9 @@ static void intel_pmu_disable_event(struct perf_event *event)
> return;
> }
>
> + if (unlikely(hwc->idx == INTEL_PMC_IDX_FIXED_VLBR))
> + return;
> +

Please check code-gen to see if you can cut down on brancher here;
there's 4 cases:

- vlbr
- bts
- fixed
- gp

perhaps you can write it like so:

(also see https://lkml.kernel.org/r/[email protected] )

static void intel_pmu_enable_event(struct perf_event *event)
{
...
int idx = hwx->idx;

if (idx < INTEL_PMC_IDX_FIXED) {
intel_set_masks(event, idx);
__x86_pmu_enable_event(hwc, ARCH_PERFMON_EVENTSEL_ENABLE);
} else if (idx < INTEL_PMC_IDX_FIXED_BTS) {
intel_set_masks(event, idx);
intel_pmu_enable_fixed(event);
} else if (idx == INTEL_PMC_IDX_FIXED_BTS) {
intel_pmu_enable_bts(hwc->config);
}

/* nothing for INTEL_PMC_IDX_FIXED_VLBR */
}

That should sort the branches in order of: gp,fixed,bts,vlbr

> cpuc->intel_ctrl_guest_mask &= ~(1ull << hwc->idx);
> cpuc->intel_ctrl_host_mask &= ~(1ull << hwc->idx);
> cpuc->intel_cp_status &= ~(1ull << hwc->idx);
> @@ -2241,6 +2244,9 @@ static void intel_pmu_enable_event(struct perf_event *event)
> return;
> }
>
> + if (unlikely(hwc->idx == INTEL_PMC_IDX_FIXED_VLBR))
> + return;
> +
> if (event->attr.exclude_host)
> cpuc->intel_ctrl_guest_mask |= (1ull << hwc->idx);
> if (event->attr.exclude_guest)

idem.

> @@ -2595,6 +2601,15 @@ intel_bts_constraints(struct perf_event *event)
> return NULL;
> }
>
> +static struct event_constraint *
> +intel_guest_event_constraints(struct perf_event *event)
> +{
> + if (unlikely(is_guest_lbr_event(event)))
> + return &guest_lbr_constraint;
> +
> + return NULL;
> +}

This is a mis-nomer, it isn't just any guest_event

> +
> static int intel_alt_er(int idx, u64 config)
> {
> int alt_idx = idx;
> @@ -2785,6 +2800,10 @@ __intel_get_event_constraints(struct cpu_hw_events *cpuc, int idx,
> {
> struct event_constraint *c;
>
> + c = intel_guest_event_constraints(event);
> + if (c)
> + return c;
> +
> c = intel_bts_constraints(event);
> if (c)
> return c;

> diff --git a/arch/x86/events/perf_event.h b/arch/x86/events/perf_event.h
> index 1025bc6eb04f..9a62264a3068 100644
> --- a/arch/x86/events/perf_event.h
> +++ b/arch/x86/events/perf_event.h
> @@ -969,6 +969,20 @@ static inline bool intel_pmu_has_bts(struct perf_event *event)
> return intel_pmu_has_bts_period(event, hwc->sample_period);
> }
>
> +static inline bool is_guest_event(struct perf_event *event)
> +{
> + if (event->attr.exclude_host && is_kernel_event(event))
> + return true;
> + return false;
> +}

I don't like this one, what if another in-kernel users generates an
event with exclude_host set ?

> @@ -989,6 +1003,7 @@ void release_ds_buffers(void);
> void reserve_ds_buffers(void);
>
> extern struct event_constraint bts_constraint;
> +extern struct event_constraint guest_lbr_constraint;
>
> void intel_pmu_enable_bts(u64 config);
>
> diff --git a/arch/x86/include/asm/perf_event.h b/arch/x86/include/asm/perf_event.h
> index e018a1cf604c..674130aca75a 100644
> --- a/arch/x86/include/asm/perf_event.h
> +++ b/arch/x86/include/asm/perf_event.h
> @@ -181,9 +181,19 @@ struct x86_pmu_capability {
> #define GLOBAL_STATUS_UNC_OVF BIT_ULL(61)
> #define GLOBAL_STATUS_ASIF BIT_ULL(60)
> #define GLOBAL_STATUS_COUNTERS_FROZEN BIT_ULL(59)
> -#define GLOBAL_STATUS_LBRS_FROZEN BIT_ULL(58)
> +#define GLOBAL_STATUS_LBRS_FROZEN_BIT 58
> +#define GLOBAL_STATUS_LBRS_FROZEN BIT_ULL(GLOBAL_STATUS_LBRS_FROZEN_BIT)
> #define GLOBAL_STATUS_TRACE_TOPAPMI BIT_ULL(55)
>
> +/*
> + * We model guest LBR event tracing as another fixed-mode PMC like BTS.
> + *
> + * We choose bit 58 (LBRS_FROZEN_BIT) which is used to indicate that the LBR
> + * stack is frozen on a hardware PMI request in the PERF_GLOBAL_STATUS msr,
> + * and the 59th PMC counter (if any) is not supposed to use it as well.

Is this saying that STATUS.58 should never be set? I don't really
understand the language.

> + */
> +#define INTEL_PMC_IDX_FIXED_VLBR GLOBAL_STATUS_LBRS_FROZEN_BIT
> +
> /*
> * Adaptive PEBS v4
> */

2020-04-09 17:15:36

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v9 04/10] perf/x86: Keep LBR stack unchanged on the host for guest LBR event

On Fri, Mar 13, 2020 at 10:16:10AM +0800, Like Xu wrote:
> When a guest wants to use the LBR stack, its hypervisor creates a guest
> LBR event and let host perf schedules it. A new 'int guest_lbr_enabled'
> field in the "struct cpu_hw_events", is marked as true when perf adds
> a guest LBR event and false on deletion.
>
> The LBR stack msrs are accessible to the guest when its guest LBR event
> is scheduled in by the perf subsystem. Before scheduling out the event,
> we should avoid host changes on IA32_DEBUGCTLMSR or LBR_SELECT. Otherwise,
> some unexpected branch operations may interfere with guest behavior,
> pollute LBR records, and even cause host branch data leakage. In addition,
> the intel_pmu_lbr_read() on the host is also avoidable for guest usage.
>
> On v4 PMU or later, the LBR stack are frozen on the overflowed condition
> if Freeze_LBR_On_PMI is true and resume recording via acking LBRS_FROZEN
> to global status msr instead of re-enabling IA32_DEBUGCTL.LBR. So when a
> guest LBR event is running, the host PMI handler has to keep LBRS_FROZEN
> bit set (thus LBR being frozen) until the guest enables it. Otherwise,
> when the guest enters non-root mode, the LBR will start recording and
> the guest PMI handler code will also pollute the LBR stack.
>
> To ensure that guest LBR records are not lost during the context switch,
> the BRANCH_CALL_STACK flag should be configured in the 'branch_sample_type'
> for a guest LBR event because a callstack event could save/restore guest
> unread records with the help of intel_pmu_lbr_sched_task() naturally.
>
> However, the regular host LBR perf event doesn't save/restore LBR_SELECT,
> because it's configured in the LBR_enable() based on branch_sample_type.
> So when a guest LBR is running, the guest LBR_SELECT may changes for its
> own use and we have to add the LBR_SELECT save/restore to ensure what the
> guest LBR_SELECT value doesn't get lost during the context switching.

I had to read the patch before that made sense; I think it's mostly
there, but it can use a little help.


> @@ -691,8 +714,12 @@ void intel_pmu_lbr_read(void)
> *
> * This could be smarter and actually check the event,
> * but this simple approach seems to work for now.
> + *
> + * And there is no need to read lbr here if a guest LBR event

There's 'lbr' and 'LBR' in the same sentence

> + * is using it, because the guest will read them on its own.
> */
> - if (!cpuc->lbr_users || cpuc->lbr_users == cpuc->lbr_pebs_users)
> + if (!cpuc->lbr_users || cpuc->guest_lbr_enabled ||
> + cpuc->lbr_users == cpuc->lbr_pebs_users)

indent fail

> return;
>
> if (x86_pmu.intel_cap.lbr_format == LBR_FORMAT_32)

2020-04-10 03:05:14

by Xu, Like

[permalink] [raw]
Subject: Re: [PATCH v9 03/10] perf/x86: Add constraint to create guest LBR event without hw counter

Hi Peter,

First of all, thanks for your comments!

On 2020/4/10 0:37, Peter Zijlstra wrote:
>> diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
>> index 3bb738f5a472..e919187a0751 100644
>> --- a/arch/x86/events/core.c
>> +++ b/arch/x86/events/core.c
>> @@ -74,7 +74,8 @@ u64 x86_perf_event_update(struct perf_event *event)
>> int idx = hwc->idx;
>> u64 delta;
>>
>> - if (idx == INTEL_PMC_IDX_FIXED_BTS)
>> + if ((idx == INTEL_PMC_IDX_FIXED_BTS) ||
>> + (idx == INTEL_PMC_IDX_FIXED_VLBR))
>> return 0;
>>
>> /*
>> @@ -1102,7 +1103,8 @@ static inline void x86_assign_hw_event(struct perf_event *event,
>> hwc->last_cpu = smp_processor_id();
>> hwc->last_tag = ++cpuc->tags[i];
>>
>> - if (hwc->idx == INTEL_PMC_IDX_FIXED_BTS) {
>> + if ((hwc->idx == INTEL_PMC_IDX_FIXED_BTS) ||
>> + (hwc->idx == INTEL_PMC_IDX_FIXED_VLBR)) {
>> hwc->config_base = 0;
>> hwc->event_base = 0;
>> } else if (hwc->idx >= INTEL_PMC_IDX_FIXED) {
>> @@ -1233,7 +1235,8 @@ int x86_perf_event_set_period(struct perf_event *event)
>> s64 period = hwc->sample_period;
>> int ret = 0, idx = hwc->idx;
>>
>> - if (idx == INTEL_PMC_IDX_FIXED_BTS)
>> + if ((idx == INTEL_PMC_IDX_FIXED_BTS) ||
>> + (idx == INTEL_PMC_IDX_FIXED_VLBR))
>> return 0;
>>
>> /*
> That seems unfortunate; can that be >= INTEL_PMC_IDX_FIXED_BTS ? If so,
> that probably wants a comment with the definitions.
>
> Or otherwise check for !hwc->event_base. That should be 0 for both these
> things.
Yes, the !hwc->event_base looks good to me.
>
>> diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
>> index 3be51aa06e67..901c82032f4a 100644
>> --- a/arch/x86/events/intel/core.c
>> +++ b/arch/x86/events/intel/core.c
>> @@ -2157,6 +2157,9 @@ static void intel_pmu_disable_event(struct perf_event *event)
>> return;
>> }
>>
>> + if (unlikely(hwc->idx == INTEL_PMC_IDX_FIXED_VLBR))
>> + return;
>> +
> Please check code-gen to see if you can cut down on brancher here;
> there's 4 cases:
>
> - vlbr
> - bts
> - fixed
> - gp
>
> perhaps you can write it like so:
>
> (also see https://lkml.kernel.org/r/[email protected] )
>
> static void intel_pmu_enable_event(struct perf_event *event)
> {
> ...
> int idx = hwx->idx;
>
> if (idx < INTEL_PMC_IDX_FIXED) {
> intel_set_masks(event, idx);
> __x86_pmu_enable_event(hwc, ARCH_PERFMON_EVENTSEL_ENABLE);
> } else if (idx < INTEL_PMC_IDX_FIXED_BTS) {
> intel_set_masks(event, idx);
> intel_pmu_enable_fixed(event);
> } else if (idx == INTEL_PMC_IDX_FIXED_BTS) {
> intel_pmu_enable_bts(hwc->config);
> }
>
> /* nothing for INTEL_PMC_IDX_FIXED_VLBR */
> }
>
> That should sort the branches in order of: gp,fixed,bts,vlbr

Note the current order is: bts, pebs, fixed, gp.

Sure, let me try to refactor it in this way.
>
>> cpuc->intel_ctrl_guest_mask &= ~(1ull << hwc->idx);
>> cpuc->intel_ctrl_host_mask &= ~(1ull << hwc->idx);
>> cpuc->intel_cp_status &= ~(1ull << hwc->idx);
>> @@ -2241,6 +2244,9 @@ static void intel_pmu_enable_event(struct perf_event *event)
>> return;
>> }
>>
>> + if (unlikely(hwc->idx == INTEL_PMC_IDX_FIXED_VLBR))
>> + return;
>> +
>> if (event->attr.exclude_host)
>> cpuc->intel_ctrl_guest_mask |= (1ull << hwc->idx);
>> if (event->attr.exclude_guest)
> idem.
idem.
>
>> @@ -2595,6 +2601,15 @@ intel_bts_constraints(struct perf_event *event)
>> return NULL;
>> }
>>
>> +static struct event_constraint *
>> +intel_guest_event_constraints(struct perf_event *event)
>> +{
>> + if (unlikely(is_guest_lbr_event(event)))
>> + return &guest_lbr_constraint;
>> +
>> + return NULL;
>> +}
> This is a mis-nomer, it isn't just any guest_event

Sure,  I'll rename it to intel_guest_lbr_event_constraints()
instead of using it as a unified interface to get all of guest event
constraints.

>
>> +
>> static int intel_alt_er(int idx, u64 config)
>> {
>> int alt_idx = idx;
>> @@ -2785,6 +2800,10 @@ __intel_get_event_constraints(struct cpu_hw_events *cpuc, int idx,
>> {
>> struct event_constraint *c;
>>
>> + c = intel_guest_event_constraints(event);
>> + if (c)
>> + return c;
>> +
>> c = intel_bts_constraints(event);
>> if (c)
>> return c;
>> diff --git a/arch/x86/events/perf_event.h b/arch/x86/events/perf_event.h
>> index 1025bc6eb04f..9a62264a3068 100644
>> --- a/arch/x86/events/perf_event.h
>> +++ b/arch/x86/events/perf_event.h
>> @@ -969,6 +969,20 @@ static inline bool intel_pmu_has_bts(struct perf_event *event)
>> return intel_pmu_has_bts_period(event, hwc->sample_period);
>> }
>>
>> +static inline bool is_guest_event(struct perf_event *event)
>> +{
>> + if (event->attr.exclude_host && is_kernel_event(event))
>> + return true;
>> + return false;
>> +}
> I don't like this one, what if another in-kernel users generates an
> event with exclude_host set ?
Thanks for the clear attitude.

How about:
- remove the is_guest_event() to avoid potential misuse;
- move all checks into is_guest_lbr_event() and make it dedicated:

static inline bool is_guest_lbr_event(struct perf_event *event)
{
    if (is_kernel_event(event) &&
        event->attr.exclude_host && needs_branch_stack(event))
        return true;
    return false;
}

In this case, it's safe to generate an event with exclude_host set
and also use LBR to count guest or nothing for other in-kernel users
because the intel_guest_lbr_event_constraints() makes LBR exclusive.

For this generic usage, I may rename:
- is_guest_lbr_event() to is_lbr_no_counter_event();
- intel_guest_lbr_event_constraints() to
intel_lbr_no_counter_event_constraints();

Is this acceptable to you?
If there is anything needs to be improved, please let me know.

>> @@ -989,6 +1003,7 @@ void release_ds_buffers(void);
>> void reserve_ds_buffers(void);
>>
>> extern struct event_constraint bts_constraint;
>> +extern struct event_constraint guest_lbr_constraint;
>>
>> void intel_pmu_enable_bts(u64 config);
>>
>> diff --git a/arch/x86/include/asm/perf_event.h b/arch/x86/include/asm/perf_event.h
>> index e018a1cf604c..674130aca75a 100644
>> --- a/arch/x86/include/asm/perf_event.h
>> +++ b/arch/x86/include/asm/perf_event.h
>> @@ -181,9 +181,19 @@ struct x86_pmu_capability {
>> #define GLOBAL_STATUS_UNC_OVF BIT_ULL(61)
>> #define GLOBAL_STATUS_ASIF BIT_ULL(60)
>> #define GLOBAL_STATUS_COUNTERS_FROZEN BIT_ULL(59)
>> -#define GLOBAL_STATUS_LBRS_FROZEN BIT_ULL(58)
>> +#define GLOBAL_STATUS_LBRS_FROZEN_BIT 58
>> +#define GLOBAL_STATUS_LBRS_FROZEN BIT_ULL(GLOBAL_STATUS_LBRS_FROZEN_BIT)
>> #define GLOBAL_STATUS_TRACE_TOPAPMI BIT_ULL(55)
>>
>> +/*
>> + * We model guest LBR event tracing as another fixed-mode PMC like BTS.
>> + *
>> + * We choose bit 58 (LBRS_FROZEN_BIT) which is used to indicate that the LBR
>> + * stack is frozen on a hardware PMI request in the PERF_GLOBAL_STATUS msr,
>> + * and the 59th PMC counter (if any) is not supposed to use it as well.
> Is this saying that STATUS.58 should never be set? I don't really
> understand the language.
My fault, and let me make it more clearly:

We choose bit 58 because it's used to indicate LBR stack frozen state
not like other overflow conditions in the PERF_GLOBAL_STATUS msr,
and it will not be used for any actual fixed events.

>
>> + */
>> +#define INTEL_PMC_IDX_FIXED_VLBR GLOBAL_STATUS_LBRS_FROZEN_BIT
>> +
>> /*
>> * Adaptive PEBS v4
>> */

2020-04-10 03:11:40

by Xu, Like

[permalink] [raw]
Subject: Re: [PATCH v9 04/10] perf/x86: Keep LBR stack unchanged on the host for guest LBR event

On 2020/4/10 0:45, Peter Zijlstra wrote:
> On Fri, Mar 13, 2020 at 10:16:10AM +0800, Like Xu wrote:
>> When a guest wants to use the LBR stack, its hypervisor creates a guest
>> LBR event and let host perf schedules it. A new 'int guest_lbr_enabled'
>> field in the "struct cpu_hw_events", is marked as true when perf adds
>> a guest LBR event and false on deletion.
>>
>> The LBR stack msrs are accessible to the guest when its guest LBR event
>> is scheduled in by the perf subsystem. Before scheduling out the event,
>> we should avoid host changes on IA32_DEBUGCTLMSR or LBR_SELECT. Otherwise,
>> some unexpected branch operations may interfere with guest behavior,
>> pollute LBR records, and even cause host branch data leakage. In addition,
>> the intel_pmu_lbr_read() on the host is also avoidable for guest usage.
>>
>> On v4 PMU or later, the LBR stack are frozen on the overflowed condition
>> if Freeze_LBR_On_PMI is true and resume recording via acking LBRS_FROZEN
>> to global status msr instead of re-enabling IA32_DEBUGCTL.LBR. So when a
>> guest LBR event is running, the host PMI handler has to keep LBRS_FROZEN
>> bit set (thus LBR being frozen) until the guest enables it. Otherwise,
>> when the guest enters non-root mode, the LBR will start recording and
>> the guest PMI handler code will also pollute the LBR stack.
>>
>> To ensure that guest LBR records are not lost during the context switch,
>> the BRANCH_CALL_STACK flag should be configured in the 'branch_sample_type'
>> for a guest LBR event because a callstack event could save/restore guest
>> unread records with the help of intel_pmu_lbr_sched_task() naturally.
>>
>> However, the regular host LBR perf event doesn't save/restore LBR_SELECT,
>> because it's configured in the LBR_enable() based on branch_sample_type.
>> So when a guest LBR is running, the guest LBR_SELECT may changes for its
>> own use and we have to add the LBR_SELECT save/restore to ensure what the
>> guest LBR_SELECT value doesn't get lost during the context switching.
> I had to read the patch before that made sense; I think it's mostly
> there, but it can use a little help.
Ah, thanks for your patient. This is good news for me that
you did read the main part of the proposal changes in this version.

>
>
>> @@ -691,8 +714,12 @@ void intel_pmu_lbr_read(void)
>> *
>> * This could be smarter and actually check the event,
>> * but this simple approach seems to work for now.
>> + *
>> + * And there is no need to read lbr here if a guest LBR event
> There's 'lbr' and 'LBR' in the same sentence
Yes, l'll fix it.
>
>> + * is using it, because the guest will read them on its own.
>> */
>> - if (!cpuc->lbr_users || cpuc->lbr_users == cpuc->lbr_pebs_users)
>> + if (!cpuc->lbr_users || cpuc->guest_lbr_enabled ||
>> + cpuc->lbr_users == cpuc->lbr_pebs_users)
> indent fail
Yes, l'll fix it.
>
>> return;
>>
>> if (x86_pmu.intel_cap.lbr_format == LBR_FORMAT_32)

2020-04-17 08:42:27

by Xu, Like

[permalink] [raw]
Subject: Re: [PATCH v9 03/10] perf/x86: Add constraint to create guest LBR event without hw counter

Hi Peter,

On 2020/4/10 11:03, Xu, Like wrote:
> Hi Peter,
>
> First of all, thanks for your comments!
>
> On 2020/4/10 0:37, Peter Zijlstra wrote:
>>> diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
>>> index 3bb738f5a472..e919187a0751 100644
>>> --- a/arch/x86/events/core.c
>>> +++ b/arch/x86/events/core.c
>>> @@ -74,7 +74,8 @@ u64 x86_perf_event_update(struct perf_event *event)
>>>       int idx = hwc->idx;
>>>       u64 delta;
>>>   -    if (idx == INTEL_PMC_IDX_FIXED_BTS)
>>> +    if ((idx == INTEL_PMC_IDX_FIXED_BTS) ||
>>> +        (idx == INTEL_PMC_IDX_FIXED_VLBR))
>>>           return 0;
>>>         /*
>>> @@ -1102,7 +1103,8 @@ static inline void x86_assign_hw_event(struct
>>> perf_event *event,
>>>       hwc->last_cpu = smp_processor_id();
>>>       hwc->last_tag = ++cpuc->tags[i];
>>>   -    if (hwc->idx == INTEL_PMC_IDX_FIXED_BTS) {
>>> +    if ((hwc->idx == INTEL_PMC_IDX_FIXED_BTS) ||
>>> +        (hwc->idx == INTEL_PMC_IDX_FIXED_VLBR)) {
>>>           hwc->config_base = 0;
>>>           hwc->event_base    = 0;
>>>       } else if (hwc->idx >= INTEL_PMC_IDX_FIXED) {
>>> @@ -1233,7 +1235,8 @@ int x86_perf_event_set_period(struct perf_event
>>> *event)
>>>       s64 period = hwc->sample_period;
>>>       int ret = 0, idx = hwc->idx;
>>>   -    if (idx == INTEL_PMC_IDX_FIXED_BTS)
>>> +    if ((idx == INTEL_PMC_IDX_FIXED_BTS) ||
>>> +        (idx == INTEL_PMC_IDX_FIXED_VLBR))
>>>           return 0;
>>>         /*
>> That seems unfortunate; can that be >= INTEL_PMC_IDX_FIXED_BTS ? If so,
>> that probably wants a comment with the definitions.
>>
>> Or otherwise check for !hwc->event_base. That should be 0 for both these
>> things.
> Yes, the !hwc->event_base looks good to me.
>>
>>> diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
>>> index 3be51aa06e67..901c82032f4a 100644
>>> --- a/arch/x86/events/intel/core.c
>>> +++ b/arch/x86/events/intel/core.c
>>> @@ -2157,6 +2157,9 @@ static void intel_pmu_disable_event(struct
>>> perf_event *event)
>>>           return;
>>>       }
>>>   +    if (unlikely(hwc->idx == INTEL_PMC_IDX_FIXED_VLBR))
>>> +        return;
>>> +
>> Please check code-gen to see if you can cut down on brancher here;
>> there's 4 cases:
>>
>>   - vlbr
>>   - bts
>>   - fixed
>>   - gp
>>
>> perhaps you can write it like so:
>>
>> (also see
>> https://lkml.kernel.org/r/[email protected]
>> )
>>
>> static void intel_pmu_enable_event(struct perf_event *event)
>> {
>>     ...
>>     int idx = hwx->idx;
>>
>>     if (idx < INTEL_PMC_IDX_FIXED) {
>>         intel_set_masks(event, idx);
>>         __x86_pmu_enable_event(hwc, ARCH_PERFMON_EVENTSEL_ENABLE);
>>     } else if (idx < INTEL_PMC_IDX_FIXED_BTS) {
>>         intel_set_masks(event, idx);
>>         intel_pmu_enable_fixed(event);
>>     } else if (idx == INTEL_PMC_IDX_FIXED_BTS) {
>>         intel_pmu_enable_bts(hwc->config);
>>     }
>>
>>     /* nothing for INTEL_PMC_IDX_FIXED_VLBR */
>> }
>>
>> That should sort the branches in order of: gp,fixed,bts,vlbr
>
> Note the current order is: bts, pebs, fixed, gp.
>
> Sure, let me try to refactor it in this way.
>>
>>>       cpuc->intel_ctrl_guest_mask &= ~(1ull << hwc->idx);
>>>       cpuc->intel_ctrl_host_mask &= ~(1ull << hwc->idx);
>>>       cpuc->intel_cp_status &= ~(1ull << hwc->idx);
>>> @@ -2241,6 +2244,9 @@ static void intel_pmu_enable_event(struct
>>> perf_event *event)
>>>           return;
>>>       }
>>>   +    if (unlikely(hwc->idx == INTEL_PMC_IDX_FIXED_VLBR))
>>> +        return;
>>> +
>>>       if (event->attr.exclude_host)
>>>           cpuc->intel_ctrl_guest_mask |= (1ull << hwc->idx);
>>>       if (event->attr.exclude_guest)
>> idem.
> idem.
>>
>>> @@ -2595,6 +2601,15 @@ intel_bts_constraints(struct perf_event *event)
>>>       return NULL;
>>>   }
>>>   +static struct event_constraint *
>>> +intel_guest_event_constraints(struct perf_event *event)
>>> +{
>>> +    if (unlikely(is_guest_lbr_event(event)))
>>> +        return &guest_lbr_constraint;
>>> +
>>> +    return NULL;
>>> +}
>> This is a mis-nomer, it isn't just any guest_event
>
> Sure,  I'll rename it to intel_guest_lbr_event_constraints()
> instead of using it as a unified interface to get all of guest event
> constraints.
>
>>
>>> +
>>>   static int intel_alt_er(int idx, u64 config)
>>>   {
>>>       int alt_idx = idx;
>>> @@ -2785,6 +2800,10 @@ __intel_get_event_constraints(struct
>>> cpu_hw_events *cpuc, int idx,
>>>   {
>>>       struct event_constraint *c;
>>>   +    c = intel_guest_event_constraints(event);
>>> +    if (c)
>>> +        return c;
>>> +
>>>       c = intel_bts_constraints(event);
>>>       if (c)
>>>           return c;
>>> diff --git a/arch/x86/events/perf_event.h b/arch/x86/events/perf_event.h
>>> index 1025bc6eb04f..9a62264a3068 100644
>>> --- a/arch/x86/events/perf_event.h
>>> +++ b/arch/x86/events/perf_event.h
>>> @@ -969,6 +969,20 @@ static inline bool intel_pmu_has_bts(struct
>>> perf_event *event)
>>>       return intel_pmu_has_bts_period(event, hwc->sample_period);
>>>   }
>>>   +static inline bool is_guest_event(struct perf_event *event)
>>> +{
>>> +    if (event->attr.exclude_host && is_kernel_event(event))
>>> +        return true;
>>> +    return false;
>>> +}
>> I don't like this one, what if another in-kernel users generates an
>> event with exclude_host set ?
> Thanks for the clear attitude.
>
> How about:
> - remove the is_guest_event() to avoid potential misuse;
> - move all checks into is_guest_lbr_event() and make it dedicated:
>
> static inline bool is_guest_lbr_event(struct perf_event *event)
> {
>     if (is_kernel_event(event) &&
>         event->attr.exclude_host && needs_branch_stack(event))
>         return true;
>     return false;
> }
>
> In this case, it's safe to generate an event with exclude_host set
> and also use LBR to count guest or nothing for other in-kernel users
> because the intel_guest_lbr_event_constraints() makes LBR exclusive.
>
> For this generic usage, I may rename:
> - is_guest_lbr_event() to is_lbr_no_counter_event();
> - intel_guest_lbr_event_constraints() to
> intel_lbr_no_counter_event_constraints();
>
> Is this acceptable to you?
> If there is anything needs to be improved, please let me know.
Do you have any preference for this ?

If you have more comments for the general idea or code details, please let
me know.
For example, you may take a look at the interface named
intel_pmu_create_lbr_event()
in the "[PATCH v9 07/10] KVM: x86/pmu: Add LBR feature emulation via guest
LBR event".

If not, I'll spin the next version based on your current feedback.

Thanks,
Like Xu
>
>>> @@ -989,6 +1003,7 @@ void release_ds_buffers(void);
>>>   void reserve_ds_buffers(void);
>>>     extern struct event_constraint bts_constraint;
>>> +extern struct event_constraint guest_lbr_constraint;
>>>     void intel_pmu_enable_bts(u64 config);
>>>   diff --git a/arch/x86/include/asm/perf_event.h
>>> b/arch/x86/include/asm/perf_event.h
>>> index e018a1cf604c..674130aca75a 100644
>>> --- a/arch/x86/include/asm/perf_event.h
>>> +++ b/arch/x86/include/asm/perf_event.h
>>> @@ -181,9 +181,19 @@ struct x86_pmu_capability {
>>>   #define GLOBAL_STATUS_UNC_OVF                BIT_ULL(61)
>>>   #define GLOBAL_STATUS_ASIF                BIT_ULL(60)
>>>   #define GLOBAL_STATUS_COUNTERS_FROZEN            BIT_ULL(59)
>>> -#define GLOBAL_STATUS_LBRS_FROZEN            BIT_ULL(58)
>>> +#define GLOBAL_STATUS_LBRS_FROZEN_BIT            58
>>> +#define GLOBAL_STATUS_LBRS_FROZEN BIT_ULL(GLOBAL_STATUS_LBRS_FROZEN_BIT)
>>>   #define GLOBAL_STATUS_TRACE_TOPAPMI            BIT_ULL(55)
>>>   +/*
>>> + * We model guest LBR event tracing as another fixed-mode PMC like BTS.
>>> + *
>>> + * We choose bit 58 (LBRS_FROZEN_BIT) which is used to indicate that
>>> the LBR
>>> + * stack is frozen on a hardware PMI request in the PERF_GLOBAL_STATUS
>>> msr,
>>> + * and the 59th PMC counter (if any) is not supposed to use it as well.
>> Is this saying that STATUS.58 should never be set? I don't really
>> understand the language.
> My fault, and let me make it more clearly:
>
> We choose bit 58 because it's used to indicate LBR stack frozen state
> not like other overflow conditions in the PERF_GLOBAL_STATUS msr,
> and it will not be used for any actual fixed events.
>
>>
>>> + */
>>> +#define INTEL_PMC_IDX_FIXED_VLBR GLOBAL_STATUS_LBRS_FROZEN_BIT
>>> +
>>>   /*
>>>    * Adaptive PEBS v4
>>>    */
>

2020-04-17 10:34:06

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v9 03/10] perf/x86: Add constraint to create guest LBR event without hw counter

On Fri, Apr 10, 2020 at 11:03:33AM +0800, Xu, Like wrote:
> On 2020/4/10 0:37, Peter Zijlstra wrote:

> > That should sort the branches in order of: gp,fixed,bts,vlbr
>
> Note the current order is: bts, pebs, fixed, gp.

Yeah, and that means that gp (which is I think the most common case) is
the most expensive.

> Sure, let me try to refactor it in this way.

Thanks!

> > > +static inline bool is_guest_event(struct perf_event *event)
> > > +{
> > > + if (event->attr.exclude_host && is_kernel_event(event))
> > > + return true;
> > > + return false;
> > > +}
> > I don't like this one, what if another in-kernel users generates an
> > event with exclude_host set ?
> Thanks for the clear attitude.
>
> How about:
> - remove the is_guest_event() to avoid potential misuse;
> - move all checks into is_guest_lbr_event() and make it dedicated:
>
> static inline bool is_guest_lbr_event(struct perf_event *event)
> {
>     if (is_kernel_event(event) &&
>         event->attr.exclude_host && needs_branch_stack(event))
>         return true;
>     return false;
> }
>
> In this case, it's safe to generate an event with exclude_host set
> and also use LBR to count guest or nothing for other in-kernel users
> because the intel_guest_lbr_event_constraints() makes LBR exclusive.
>
> For this generic usage, I may rename:
> - is_guest_lbr_event() to is_lbr_no_counter_event();
> - intel_guest_lbr_event_constraints() to
> intel_lbr_no_counter_event_constraints();
>
> Is this acceptable to you?

I suppose so, please put a comment on top of that function though, so we
don't forget.

> > > @@ -181,9 +181,19 @@ struct x86_pmu_capability {
> > > #define GLOBAL_STATUS_UNC_OVF BIT_ULL(61)
> > > #define GLOBAL_STATUS_ASIF BIT_ULL(60)
> > > #define GLOBAL_STATUS_COUNTERS_FROZEN BIT_ULL(59)
> > > -#define GLOBAL_STATUS_LBRS_FROZEN BIT_ULL(58)
> > > +#define GLOBAL_STATUS_LBRS_FROZEN_BIT 58
> > > +#define GLOBAL_STATUS_LBRS_FROZEN BIT_ULL(GLOBAL_STATUS_LBRS_FROZEN_BIT)
> > > #define GLOBAL_STATUS_TRACE_TOPAPMI BIT_ULL(55)
> > > +/*
> > > + * We model guest LBR event tracing as another fixed-mode PMC like BTS.
> > > + *
> > > + * We choose bit 58 (LBRS_FROZEN_BIT) which is used to indicate that the LBR
> > > + * stack is frozen on a hardware PMI request in the PERF_GLOBAL_STATUS msr,
> > > + * and the 59th PMC counter (if any) is not supposed to use it as well.
> > Is this saying that STATUS.58 should never be set? I don't really
> > understand the language.
> My fault, and let me make it more clearly:
>
> We choose bit 58 because it's used to indicate LBR stack frozen state
> not like other overflow conditions in the PERF_GLOBAL_STATUS msr,
> and it will not be used for any actual fixed events.

That's only with v4, also we unconditionally mask that bit in
handle_pmi_common(), so it'll never be set in the overflow handling.

That's all fine, I suppose, all you want is means of programming the LBR
registers, we don't actually do anything with then in the host context.
Please write a more elaborate comment here.