Last Branch Recording (LBR) is a performance monitor unit (PMU) feature
on Intel CPUs that captures branch related info. This patch series enables
this feature to KVM guests.
Here is a conclusion of the fundamental methods that we use:
1) the LBR feature is enabled per guest via QEMU setting of
KVM_CAP_X86_GUEST_LBR;
2) the LBR stack is passed through to the guest for direct accesses after
the guest's first access to any of the lbr related MSRs;
3) When the guest uses the LBR feature with the user callstack mode, the
host will help save/resotre the LBR stack when the vCPU is scheduled
out/in.
ChangeLog:
v2->v3:
- replaces the pv approach with a lazy save approach to saving the lbr
stack on vCPU switching;
- destroy the host side perf event if the guest is torn down;
- remove the unnecessary event->pmu->stop() before calling
perf_event_release_kernel().
v1->v2:
- add the per guest LBR capability, KVM_CAP_X86_GUEST_LBR;
- save/restore the LBR stack conditionally on the vCPU thread context
switching, instead of on VMX transitions;
- expose MSR_IA32_PERF_CAPABILITIES to the guest.
Like Xu (1):
KVM/x86/vPMU: Add APIs to support host save/restore the guest lbr
stack
Wei Wang (4):
perf/x86: add a function to get the lbr stack
KVM/x86: KVM_CAP_X86_GUEST_LBR
KVM/x86: expose MSR_IA32_PERF_CAPABILITIES to the guest
KVM/x86/lbr: lazy save the guest lbr stack
arch/x86/events/intel/lbr.c | 49 ++++++++++++-
arch/x86/events/perf_event.h | 1 +
arch/x86/include/asm/kvm_host.h | 7 ++
arch/x86/include/asm/perf_event.h | 22 ++++++
arch/x86/kvm/cpuid.c | 2 +-
arch/x86/kvm/pmu.h | 8 +++
arch/x86/kvm/pmu_intel.c | 41 +++++++++++
arch/x86/kvm/vmx.c | 144 ++++++++++++++++++++++++++++++++++++++
arch/x86/kvm/x86.c | 10 +++
include/uapi/linux/kvm.h | 1 +
10 files changed, 281 insertions(+), 4 deletions(-)
--
2.7.4
When the vCPU is scheduled in:
- if the lbr feature was used in the last vCPU time slice, set the lbr
stack to be interceptible, so that the host can capture whether the
lbr feature will be used in this time slice;
- if the lbr feature wasn't used in the last vCPU time slice, disable
the vCPU support of the guest lbr switching.
Upon the first access to one of the lbr related MSRs (since the vCPU was
scheduled in):
- record that the guest has used the lbr;
- create a host perf event to help save/restore the guest lbr stack if
the guest uses the user callstack mode lbr stack;
- pass the stack through to the guest.
Suggested-by: Andi Kleen <[email protected]>
Signed-off-by: Like Xu <[email protected]>
Signed-off-by: Wei Wang <[email protected]>
Cc: Paolo Bonzini <[email protected]>
Cc: Andi Kleen <[email protected]>
---
arch/x86/events/intel/lbr.c | 16 +++++
arch/x86/include/asm/kvm_host.h | 4 ++
arch/x86/include/asm/perf_event.h | 6 ++
arch/x86/kvm/pmu.h | 5 ++
arch/x86/kvm/vmx.c | 137 ++++++++++++++++++++++++++++++++++++++
5 files changed, 168 insertions(+)
diff --git a/arch/x86/events/intel/lbr.c b/arch/x86/events/intel/lbr.c
index 915fcc3..a260015 100644
--- a/arch/x86/events/intel/lbr.c
+++ b/arch/x86/events/intel/lbr.c
@@ -64,6 +64,7 @@ static const enum {
#define LBR_NO_INFO (1ULL << LBR_NO_INFO_BIT)
#define LBR_PLM (LBR_KERNEL | LBR_USER)
+#define LBR_USER_CALLSTACK (LBR_CALL_STACK | LBR_USER)
#define LBR_SEL_MASK 0x3ff /* valid bits in LBR_SELECT */
#define LBR_NOT_SUPP -1 /* LBR filter not supported */
@@ -1283,6 +1284,21 @@ void intel_pmu_lbr_init_knl(void)
}
/**
+ * lbr_select_user_callstack - check if the user callstack mode is set
+ *
+ * @lbr_select: the lbr select msr
+ *
+ * Returns: true if the msr is configured to the user callstack mode.
+ * Otherwise, false.
+ *
+ */
+bool lbr_select_user_callstack(u64 lbr_select)
+{
+ return !!(lbr_select & LBR_USER_CALLSTACK);
+}
+EXPORT_SYMBOL_GPL(lbr_select_user_callstack);
+
+/**
* perf_get_lbr_stack - get the lbr stack related MSRs
*
* @stack: the caller's memory to get the lbr stack
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index fdcac01..41b4d29 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -730,6 +730,10 @@ struct kvm_vcpu_arch {
/* Flush the L1 Data cache for L1TF mitigation on VMENTER */
bool l1tf_flush_l1d;
+ /* Indicate if the guest is using lbr with the user callstack mode */
+ bool lbr_user_callstack;
+ /* Indicate if the lbr msrs were accessed in this vCPU time slice */
+ bool lbr_used;
};
struct kvm_lpage_info {
diff --git a/arch/x86/include/asm/perf_event.h b/arch/x86/include/asm/perf_event.h
index e893a69..2d7ae55 100644
--- a/arch/x86/include/asm/perf_event.h
+++ b/arch/x86/include/asm/perf_event.h
@@ -277,6 +277,7 @@ struct perf_lbr_stack {
unsigned long info;
};
+extern bool lbr_select_user_callstack(u64 msr_lbr_select);
extern struct perf_guest_switch_msr *perf_guest_get_msrs(int *nr);
extern int perf_get_lbr_stack(struct perf_lbr_stack *stack);
extern void perf_get_x86_pmu_capability(struct x86_pmu_capability *cap);
@@ -288,6 +289,11 @@ static inline struct perf_guest_switch_msr *perf_guest_get_msrs(int *nr)
return NULL;
}
+static bool lbr_select_user_callstack(u64 msr_lbr_select)
+{
+ return false;
+}
+
static inline int perf_get_lbr_stack(struct perf_lbr_stack *stack)
{
return -1;
diff --git a/arch/x86/kvm/pmu.h b/arch/x86/kvm/pmu.h
index e872aed..94f0624 100644
--- a/arch/x86/kvm/pmu.h
+++ b/arch/x86/kvm/pmu.h
@@ -102,6 +102,11 @@ static inline struct kvm_pmc *get_fixed_pmc(struct kvm_pmu *pmu, u32 msr)
return NULL;
}
+static inline bool intel_pmu_save_guest_lbr_enabled(struct kvm_vcpu *vcpu)
+{
+ return !!vcpu_to_pmu(vcpu)->guest_lbr_event;
+}
+
void reprogram_gp_counter(struct kvm_pmc *pmc, u64 eventsel);
void reprogram_fixed_counter(struct kvm_pmc *pmc, u8 ctrl, int fixed_idx);
void reprogram_counter(struct kvm_pmu *pmu, int pmc_idx);
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 92705b5..ae20563 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -1282,6 +1282,9 @@ static bool nested_vmx_is_page_fault_vmexit(struct vmcs12 *vmcs12,
static void vmx_update_msr_bitmap(struct kvm_vcpu *vcpu);
static void __always_inline vmx_disable_intercept_for_msr(unsigned long *msr_bitmap,
u32 msr, int type);
+static void
+__always_inline vmx_set_intercept_for_msr(unsigned long *msr_bitmap, u32 msr,
+ int type, bool value);
static DEFINE_PER_CPU(struct vmcs *, vmxarea);
static DEFINE_PER_CPU(struct vmcs *, current_vmcs);
@@ -4056,6 +4059,120 @@ static int vmx_get_msr_feature(struct kvm_msr_entry *msr)
return 0;
}
+static void vmx_set_intercept_for_lbr_msrs(struct kvm_vcpu *vcpu, bool set)
+{
+ unsigned long *msr_bitmap = to_vmx(vcpu)->vmcs01.msr_bitmap;
+ struct perf_lbr_stack *stack = &vcpu->kvm->arch.lbr_stack;
+ int nr = stack->nr;
+ int i;
+
+ vmx_set_intercept_for_msr(msr_bitmap, stack->tos, MSR_TYPE_RW, set);
+ for (i = 0; i < nr; i++) {
+ vmx_set_intercept_for_msr(msr_bitmap, stack->from + i,
+ MSR_TYPE_RW, set);
+ vmx_set_intercept_for_msr(msr_bitmap, stack->to + i,
+ MSR_TYPE_RW, set);
+ if (stack->info)
+ vmx_set_intercept_for_msr(msr_bitmap, stack->info + i,
+ MSR_TYPE_RW, set);
+ }
+}
+
+static inline bool msr_is_lbr_stack(struct kvm_vcpu *vcpu, u32 index)
+{
+ struct perf_lbr_stack *stack = &vcpu->kvm->arch.lbr_stack;
+ int nr = stack->nr;
+
+ return !!(index == stack->tos ||
+ (index >= stack->from && index < stack->from + nr) ||
+ (index >= stack->to && index < stack->to + nr) ||
+ (index >= stack->info && index < stack->info));
+}
+
+static bool guest_get_lbr_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
+{
+ u32 index = msr_info->index;
+ bool ret = false;
+
+ switch (index) {
+ case MSR_IA32_DEBUGCTLMSR:
+ msr_info->data = vmcs_read64(GUEST_IA32_DEBUGCTL);
+ ret = true;
+ break;
+ case MSR_LBR_SELECT:
+ ret = true;
+ rdmsrl(index, msr_info->data);
+ break;
+ default:
+ if (msr_is_lbr_stack(vcpu, index)) {
+ ret = true;
+ rdmsrl(index, msr_info->data);
+ }
+ }
+
+ return ret;
+}
+
+static bool guest_set_lbr_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
+{
+ u32 index = msr_info->index;
+ u64 data = msr_info->data;
+ bool ret = false;
+
+ switch (index) {
+ case MSR_IA32_DEBUGCTLMSR:
+ ret = true;
+ /*
+ * Currently, only FREEZE_LBRS_ON_PMI and DEBUGCTLMSR_LBR are
+ * supported.
+ */
+ data &= (DEBUGCTLMSR_FREEZE_LBRS_ON_PMI | DEBUGCTLMSR_LBR);
+ vmcs_write64(GUEST_IA32_DEBUGCTL, msr_info->data);
+ break;
+ case MSR_LBR_SELECT:
+ ret = true;
+ if (lbr_select_user_callstack(data))
+ vcpu->arch.lbr_user_callstack = true;
+ else
+ vcpu->arch.lbr_user_callstack = false;
+ wrmsrl(index, msr_info->data);
+ break;
+ default:
+ if (msr_is_lbr_stack(vcpu, index)) {
+ ret = true;
+ wrmsrl(index, msr_info->data);
+ }
+ }
+
+ return ret;
+}
+
+static bool guest_access_lbr_msr(struct kvm_vcpu *vcpu,
+ struct msr_data *msr_info,
+ bool set)
+{
+ bool ret = false;
+
+ if (!vcpu->kvm->arch.guest_lbr_enabled)
+ return false;
+
+ if (set)
+ ret = guest_set_lbr_msr(vcpu, msr_info);
+ else
+ ret = guest_get_lbr_msr(vcpu, msr_info);
+
+ if (ret) {
+ vcpu->arch.lbr_used = true;
+ vmx_set_intercept_for_lbr_msrs(vcpu, false);
+ if (vcpu->arch.lbr_user_callstack)
+ intel_pmu_enable_save_guest_lbr(vcpu);
+ else
+ intel_pmu_disable_save_guest_lbr(vcpu);
+ }
+
+ return ret;
+}
+
/*
* Reads an msr value (of 'msr_index') into 'pdata'.
* Returns 0 on success, non-0 otherwise.
@@ -4147,6 +4264,8 @@ static int vmx_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
msr_info->data = msr->data;
break;
}
+ if (guest_access_lbr_msr(vcpu, msr_info, false))
+ break;
return kvm_get_msr_common(vcpu, msr_info);
}
@@ -4341,6 +4460,8 @@ static int vmx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
}
break;
}
+ if (guest_access_lbr_msr(vcpu, msr_info, true))
+ break;
ret = kvm_set_msr_common(vcpu, msr_info);
}
@@ -10946,6 +11067,7 @@ static void vmx_free_vcpu(struct kvm_vcpu *vcpu)
if (enable_pml)
vmx_destroy_pml_buffer(vmx);
+ intel_pmu_disable_save_guest_lbr(vcpu);
free_vpid(vmx->vpid);
leave_guest_mode(vcpu);
vmx_free_vcpu_nested(vcpu);
@@ -13488,6 +13610,21 @@ static void vmx_sched_in(struct kvm_vcpu *vcpu, int cpu)
{
if (!kvm_pause_in_guest(vcpu->kvm))
shrink_ple_window(vcpu);
+
+ if (vcpu->arch.lbr_used) {
+ vcpu->arch.lbr_used = false;
+ vmx_set_intercept_for_lbr_msrs(vcpu, true);
+ } else if (intel_pmu_save_guest_lbr_enabled(vcpu)) {
+ u64 guest_debugctl;
+
+ /*
+ * The lbr feature wasn't used during that last vCPU time
+ * slice, so it's time to disable the vCPU side save/restore.
+ */
+ guest_debugctl = vmcs_read64(GUEST_IA32_DEBUGCTL);
+ if (!(guest_debugctl & DEBUGCTLMSR_LBR))
+ intel_pmu_disable_save_guest_lbr(vcpu);
+ }
}
static void vmx_slot_enable_log_dirty(struct kvm *kvm,
--
2.7.4
From: Like Xu <[email protected]>
This patch adds support to enable and disable the host side saving and
restoring the guest lbr stack on vCPU switching. To enable that, the
host creates a perf event for the vCPU, and the event attributes are set
to user callstack mode lbr so that all the conditions are meet in the host
perf subsystem to save the lbr stack on thread switching.
The host side lbr perf event are created only for the purpose of saving
and restoring the lbr stack. There is no need to enable the lbr
functionality for this perf event, because the feature is essentially used
in the vCPU.
So, a guest_lbr boolean control is added to cpuc, to indicate if the lbr
perf event is created for the guest. When the perf subsystem handles this
event (e.g. lbr enable or read lbr stack on PMI) and finds it is for the
guest, it simply returns, because all we need for the perf event is just
a context switch support for the lbr stack.
Signed-off-by: Like Xu <[email protected]>
Signed-off-by: Wei Wang <[email protected]>
Cc: Paolo Bonzini <[email protected]>
Cc: Andi Kleen <[email protected]>
---
arch/x86/events/intel/lbr.c | 10 +++++++---
arch/x86/events/perf_event.h | 1 +
arch/x86/include/asm/kvm_host.h | 1 +
arch/x86/kvm/pmu.h | 3 +++
arch/x86/kvm/pmu_intel.c | 41 +++++++++++++++++++++++++++++++++++++++++
5 files changed, 53 insertions(+), 3 deletions(-)
diff --git a/arch/x86/events/intel/lbr.c b/arch/x86/events/intel/lbr.c
index c81f160..915fcc3 100644
--- a/arch/x86/events/intel/lbr.c
+++ b/arch/x86/events/intel/lbr.c
@@ -462,6 +462,9 @@ void intel_pmu_lbr_add(struct perf_event *event)
if (!x86_pmu.lbr_nr)
return;
+ if (event->attr.exclude_host)
+ cpuc->guest_lbr = true;
+
cpuc->br_sel = event->hw.branch_reg.reg;
if (branch_user_callstack(cpuc->br_sel) && event->ctx->task_ctx_data) {
@@ -507,6 +510,7 @@ void intel_pmu_lbr_del(struct perf_event *event)
task_ctx->lbr_callstack_users--;
}
+ cpuc->guest_lbr = false;
cpuc->lbr_users--;
WARN_ON_ONCE(cpuc->lbr_users < 0);
perf_sched_cb_dec(event->ctx->pmu);
@@ -516,7 +520,7 @@ void intel_pmu_lbr_enable_all(bool pmi)
{
struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
- if (cpuc->lbr_users)
+ if (cpuc->lbr_users && !cpuc->guest_lbr)
__intel_pmu_lbr_enable(pmi);
}
@@ -524,7 +528,7 @@ void intel_pmu_lbr_disable_all(void)
{
struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
- if (cpuc->lbr_users)
+ if (cpuc->lbr_users && !cpuc->guest_lbr)
__intel_pmu_lbr_disable();
}
@@ -658,7 +662,7 @@ void intel_pmu_lbr_read(void)
{
struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
- if (!cpuc->lbr_users)
+ if (!cpuc->lbr_users || cpuc->guest_lbr)
return;
if (x86_pmu.intel_cap.lbr_format == LBR_FORMAT_32)
diff --git a/arch/x86/events/perf_event.h b/arch/x86/events/perf_event.h
index 1562863..a91fdef 100644
--- a/arch/x86/events/perf_event.h
+++ b/arch/x86/events/perf_event.h
@@ -223,6 +223,7 @@ struct cpu_hw_events {
*/
u64 intel_ctrl_guest_mask;
u64 intel_ctrl_host_mask;
+ bool guest_lbr;
struct perf_guest_switch_msr guest_switch_msrs[X86_PMC_IDX_MAX];
/*
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 4a46e31..fdcac01 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -432,6 +432,7 @@ struct kvm_pmu {
struct kvm_pmc fixed_counters[INTEL_PMC_MAX_FIXED];
struct irq_work irq_work;
u64 reprogram_pmi;
+ struct perf_event *guest_lbr_event;
};
struct kvm_pmu_ops;
diff --git a/arch/x86/kvm/pmu.h b/arch/x86/kvm/pmu.h
index ba8898e..e872aed 100644
--- a/arch/x86/kvm/pmu.h
+++ b/arch/x86/kvm/pmu.h
@@ -120,6 +120,9 @@ void kvm_pmu_destroy(struct kvm_vcpu *vcpu);
bool is_vmware_backdoor_pmc(u32 pmc_idx);
+extern int intel_pmu_enable_save_guest_lbr(struct kvm_vcpu *vcpu);
+extern void intel_pmu_disable_save_guest_lbr(struct kvm_vcpu *vcpu);
+
extern struct kvm_pmu_ops intel_pmu_ops;
extern struct kvm_pmu_ops amd_pmu_ops;
#endif /* __KVM_X86_PMU_H */
diff --git a/arch/x86/kvm/pmu_intel.c b/arch/x86/kvm/pmu_intel.c
index 5ab4a36..97a29d7 100644
--- a/arch/x86/kvm/pmu_intel.c
+++ b/arch/x86/kvm/pmu_intel.c
@@ -342,6 +342,47 @@ static void intel_pmu_reset(struct kvm_vcpu *vcpu)
pmu->global_ovf_ctrl = 0;
}
+int intel_pmu_enable_save_guest_lbr(struct kvm_vcpu *vcpu)
+{
+ struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
+ struct perf_event *event;
+ struct perf_event_attr attr = {
+ .type = PERF_TYPE_RAW,
+ .size = sizeof(attr),
+ .pinned = true,
+ .exclude_host = true,
+ .sample_type = PERF_SAMPLE_BRANCH_STACK,
+ .branch_sample_type = PERF_SAMPLE_BRANCH_CALL_STACK |
+ PERF_SAMPLE_BRANCH_USER |
+ PERF_SAMPLE_BRANCH_KERNEL,
+ };
+
+ if (pmu->guest_lbr_event)
+ return 0;
+
+ event = perf_event_create_kernel_counter(&attr, -1, current, NULL,
+ NULL);
+ if (IS_ERR(event)) {
+ pr_err("%s: failed %ld\n", __func__, PTR_ERR(event));
+ return -ENOENT;
+ }
+ pmu->guest_lbr_event = event;
+
+ return 0;
+}
+
+void intel_pmu_disable_save_guest_lbr(struct kvm_vcpu *vcpu)
+{
+ struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
+ struct perf_event *event = pmu->guest_lbr_event;
+
+ if (!event)
+ return;
+
+ perf_event_release_kernel(event);
+ pmu->guest_lbr_event = NULL;
+}
+
struct kvm_pmu_ops intel_pmu_ops = {
.find_arch_event = intel_find_arch_event,
.find_fixed_event = intel_find_fixed_event,
--
2.7.4
The LBR stack MSRs are architecturally specific. The perf subsystem has
already assigned the abstracted MSR values based on the CPU architecture.
This patch enables a caller outside the perf subsystem to get the LBR
stack info. This is useful for hyperviosrs to prepare the lbr feature
for the guest.
Signed-off-by: Like Xu <[email protected]>
Signed-off-by: Wei Wang <[email protected]>
Cc: Paolo Bonzini <[email protected]>
Cc: Andi Kleen <[email protected]>
---
arch/x86/events/intel/lbr.c | 23 +++++++++++++++++++++++
arch/x86/include/asm/perf_event.h | 14 ++++++++++++++
2 files changed, 37 insertions(+)
diff --git a/arch/x86/events/intel/lbr.c b/arch/x86/events/intel/lbr.c
index c88ed39..c81f160 100644
--- a/arch/x86/events/intel/lbr.c
+++ b/arch/x86/events/intel/lbr.c
@@ -1277,3 +1277,26 @@ void intel_pmu_lbr_init_knl(void)
if (x86_pmu.intel_cap.lbr_format == LBR_FORMAT_LIP)
x86_pmu.intel_cap.lbr_format = LBR_FORMAT_EIP_FLAGS;
}
+
+/**
+ * perf_get_lbr_stack - get the lbr stack related MSRs
+ *
+ * @stack: the caller's memory to get the lbr stack
+ *
+ * Returns: 0 indicates that the lbr stack has been successfully obtained.
+ */
+int perf_get_lbr_stack(struct perf_lbr_stack *stack)
+{
+ stack->nr = x86_pmu.lbr_nr;
+ stack->tos = x86_pmu.lbr_tos;
+ stack->from = x86_pmu.lbr_from;
+ stack->to = x86_pmu.lbr_to;
+
+ if (x86_pmu.intel_cap.lbr_format == LBR_FORMAT_INFO)
+ stack->info = MSR_LBR_INFO_0;
+ else
+ stack->info = 0;
+
+ return 0;
+}
+EXPORT_SYMBOL_GPL(perf_get_lbr_stack);
diff --git a/arch/x86/include/asm/perf_event.h b/arch/x86/include/asm/perf_event.h
index 12f5408..84cc8cb 100644
--- a/arch/x86/include/asm/perf_event.h
+++ b/arch/x86/include/asm/perf_event.h
@@ -267,7 +267,16 @@ struct perf_guest_switch_msr {
u64 host, guest;
};
+struct perf_lbr_stack {
+ int nr;
+ unsigned long tos;
+ unsigned long from;
+ unsigned long to;
+ unsigned long info;
+};
+
extern struct perf_guest_switch_msr *perf_guest_get_msrs(int *nr);
+extern int perf_get_lbr_stack(struct perf_lbr_stack *stack);
extern void perf_get_x86_pmu_capability(struct x86_pmu_capability *cap);
extern void perf_check_microcode(void);
#else
@@ -277,6 +286,11 @@ static inline struct perf_guest_switch_msr *perf_guest_get_msrs(int *nr)
return NULL;
}
+static inline int perf_get_lbr_stack(struct perf_lbr_stack *stack)
+{
+ return -1;
+}
+
static inline void perf_get_x86_pmu_capability(struct x86_pmu_capability *cap)
{
memset(cap, 0, sizeof(*cap));
--
2.7.4
Introduce KVM_CAP_X86_GUEST_LBR to allow per-VM enabling of the guest
lbr feature.
Signed-off-by: Like Xu <[email protected]>
Signed-off-by: Wei Wang <[email protected]>
Cc: Paolo Bonzini <[email protected]>
Cc: Andi Kleen <[email protected]>
---
arch/x86/include/asm/kvm_host.h | 2 ++
arch/x86/kvm/x86.c | 10 ++++++++++
include/uapi/linux/kvm.h | 1 +
3 files changed, 13 insertions(+)
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 8e90488..4a46e31 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -825,6 +825,8 @@ struct kvm_arch {
bool mwait_in_guest;
bool hlt_in_guest;
bool pause_in_guest;
+ bool guest_lbr_enabled;
+ struct perf_lbr_stack lbr_stack;
unsigned long irq_sources_bitmap;
s64 kvmclock_offset;
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 542f631..8fdcfa9 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -2927,6 +2927,7 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
case KVM_CAP_SPLIT_IRQCHIP:
case KVM_CAP_IMMEDIATE_EXIT:
case KVM_CAP_GET_MSR_FEATURES:
+ case KVM_CAP_X86_GUEST_LBR:
r = 1;
break;
case KVM_CAP_SYNC_REGS:
@@ -4350,6 +4351,15 @@ static int kvm_vm_ioctl_enable_cap(struct kvm *kvm,
kvm->arch.pause_in_guest = true;
r = 0;
break;
+ case KVM_CAP_X86_GUEST_LBR:
+ r = -EINVAL;
+ if (perf_get_lbr_stack(&kvm->arch.lbr_stack) < 0) {
+ pr_err("Failed to enable the guest lbr feature\n");
+ break;
+ }
+ kvm->arch.guest_lbr_enabled = true;
+ r = 0;
+ break;
default:
r = -EINVAL;
break;
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 07548de..7cf9bc0 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -952,6 +952,7 @@ struct kvm_ppc_resize_hpt {
#define KVM_CAP_S390_HPAGE_1M 156
#define KVM_CAP_NESTED_STATE 157
#define KVM_CAP_ARM_INJECT_SERROR_ESR 158
+#define KVM_CAP_X86_GUEST_LBR 159
#ifdef KVM_CAP_IRQ_ROUTING
--
2.7.4
Bits [0, 5] of MSR_IA32_PERF_CAPABILITIES tell about the format of
the addresses stored in the LBR stack. Expose those bits to the guest
when the guest lbr feature is enabled.
Signed-off-by: Like Xu <[email protected]>
Signed-off-by: Wei Wang <[email protected]>
Cc: Paolo Bonzini <[email protected]>
Cc: Andi Kleen <[email protected]>
---
arch/x86/include/asm/perf_event.h | 2 ++
arch/x86/kvm/cpuid.c | 2 +-
arch/x86/kvm/vmx.c | 7 +++++++
3 files changed, 10 insertions(+), 1 deletion(-)
diff --git a/arch/x86/include/asm/perf_event.h b/arch/x86/include/asm/perf_event.h
index 84cc8cb..e893a69 100644
--- a/arch/x86/include/asm/perf_event.h
+++ b/arch/x86/include/asm/perf_event.h
@@ -79,6 +79,8 @@
#define ARCH_PERFMON_BRANCH_MISSES_RETIRED 6
#define ARCH_PERFMON_EVENTS_COUNT 7
+#define PERF_CAP_MASK_LBR_FMT 0x3f
+
/*
* Intel "Architectural Performance Monitoring" CPUID
* detection/enumeration details:
diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
index 7bcfa61..3b8a57b 100644
--- a/arch/x86/kvm/cpuid.c
+++ b/arch/x86/kvm/cpuid.c
@@ -365,7 +365,7 @@ static inline int __do_cpuid_ent(struct kvm_cpuid_entry2 *entry, u32 function,
F(XMM3) | F(PCLMULQDQ) | 0 /* DTES64, MONITOR */ |
0 /* DS-CPL, VMX, SMX, EST */ |
0 /* TM2 */ | F(SSSE3) | 0 /* CNXT-ID */ | 0 /* Reserved */ |
- F(FMA) | F(CX16) | 0 /* xTPR Update, PDCM */ |
+ F(FMA) | F(CX16) | 0 /* xTPR Update*/ | F(PDCM) |
F(PCID) | 0 /* Reserved, DCA */ | F(XMM4_1) |
F(XMM4_2) | F(X2APIC) | F(MOVBE) | F(POPCNT) |
0 /* Reserved*/ | F(AES) | F(XSAVE) | 0 /* OSXSAVE */ | F(AVX) |
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 533a327..92705b5 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -4134,6 +4134,13 @@ static int vmx_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
!guest_cpuid_has(vcpu, X86_FEATURE_RDTSCP))
return 1;
/* Otherwise falls through */
+ case MSR_IA32_PERF_CAPABILITIES:
+ if (!boot_cpu_has(X86_FEATURE_PDCM))
+ return 1;
+ msr_info->data = native_read_msr(MSR_IA32_PERF_CAPABILITIES);
+ if (vcpu->kvm->arch.guest_lbr_enabled)
+ msr_info->data &= PERF_CAP_MASK_LBR_FMT;
+ break;
default:
msr = find_msr_entry(vmx, msr_info->index);
if (msr) {
--
2.7.4
> -----Original Message-----
> From: [email protected] [mailto:[email protected]] On
> Behalf Of Wei Wang
> Sent: Thursday, September 20, 2018 6:06 PM
> To: [email protected]; [email protected]; [email protected];
> [email protected]
> Cc: [email protected]; [email protected]; [email protected];
> [email protected]; [email protected]; [email protected];
> [email protected]; Gonglei (Arei) <[email protected]>
> Subject: [PATCH v3 0/5] Guest LBR Enabling
>
> Last Branch Recording (LBR) is a performance monitor unit (PMU) feature
> on Intel CPUs that captures branch related info. This patch series enables
> this feature to KVM guests.
>
> Here is a conclusion of the fundamental methods that we use:
> 1) the LBR feature is enabled per guest via QEMU setting of
> KVM_CAP_X86_GUEST_LBR;
Did you send the QEMU parts for guest LBR support?
Thanks,
-Gonglei
> 2) the LBR stack is passed through to the guest for direct accesses after
> the guest's first access to any of the lbr related MSRs;
> 3) When the guest uses the LBR feature with the user callstack mode, the
> host will help save/resotre the LBR stack when the vCPU is scheduled
> out/in.
>
> ChangeLog:
> v2->v3:
> - replaces the pv approach with a lazy save approach to saving the lbr
> stack on vCPU switching;
> - destroy the host side perf event if the guest is torn down;
> - remove the unnecessary event->pmu->stop() before calling
> perf_event_release_kernel().
> v1->v2:
> - add the per guest LBR capability, KVM_CAP_X86_GUEST_LBR;
> - save/restore the LBR stack conditionally on the vCPU thread context
> switching, instead of on VMX transitions;
> - expose MSR_IA32_PERF_CAPABILITIES to the guest.
>
> Like Xu (1):
> KVM/x86/vPMU: Add APIs to support host save/restore the guest lbr
> stack
>
> Wei Wang (4):
> perf/x86: add a function to get the lbr stack
> KVM/x86: KVM_CAP_X86_GUEST_LBR
> KVM/x86: expose MSR_IA32_PERF_CAPABILITIES to the guest
> KVM/x86/lbr: lazy save the guest lbr stack
>
> arch/x86/events/intel/lbr.c | 49 ++++++++++++-
> arch/x86/events/perf_event.h | 1 +
> arch/x86/include/asm/kvm_host.h | 7 ++
> arch/x86/include/asm/perf_event.h | 22 ++++++
> arch/x86/kvm/cpuid.c | 2 +-
> arch/x86/kvm/pmu.h | 8 +++
> arch/x86/kvm/pmu_intel.c | 41 +++++++++++
> arch/x86/kvm/vmx.c | 144
> ++++++++++++++++++++++++++++++++++++++
> arch/x86/kvm/x86.c | 10 +++
> include/uapi/linux/kvm.h | 1 +
> 10 files changed, 281 insertions(+), 4 deletions(-)
>
> --
> 2.7.4
On Thu, Sep 20, 2018 at 06:05:55PM +0800, Wei Wang wrote:
> diff --git a/arch/x86/events/intel/lbr.c b/arch/x86/events/intel/lbr.c
> index c88ed39..c81f160 100644
> --- a/arch/x86/events/intel/lbr.c
> +++ b/arch/x86/events/intel/lbr.c
> @@ -1277,3 +1277,26 @@ void intel_pmu_lbr_init_knl(void)
> if (x86_pmu.intel_cap.lbr_format == LBR_FORMAT_LIP)
> x86_pmu.intel_cap.lbr_format = LBR_FORMAT_EIP_FLAGS;
> }
> +
> +/**
> + * perf_get_lbr_stack - get the lbr stack related MSRs
> + *
> + * @stack: the caller's memory to get the lbr stack
> + *
> + * Returns: 0 indicates that the lbr stack has been successfully obtained.
> + */
> +int perf_get_lbr_stack(struct perf_lbr_stack *stack)
> +{
> + stack->nr = x86_pmu.lbr_nr;
> + stack->tos = x86_pmu.lbr_tos;
> + stack->from = x86_pmu.lbr_from;
> + stack->to = x86_pmu.lbr_to;
> +
> + if (x86_pmu.intel_cap.lbr_format == LBR_FORMAT_INFO)
> + stack->info = MSR_LBR_INFO_0;
> + else
> + stack->info = 0;
> +
> + return 0;
> +}
> +EXPORT_SYMBOL_GPL(perf_get_lbr_stack);
Blergh, I know KVM is a module but it really sucks having to export
everything :/
> diff --git a/arch/x86/include/asm/perf_event.h b/arch/x86/include/asm/perf_event.h
> index 12f5408..84cc8cb 100644
> --- a/arch/x86/include/asm/perf_event.h
> +++ b/arch/x86/include/asm/perf_event.h
> @@ -267,7 +267,16 @@ struct perf_guest_switch_msr {
> u64 host, guest;
> };
>
> +struct perf_lbr_stack {
> + int nr;
Do we need a negative number of LBR entries?
> + unsigned long tos;
> + unsigned long from;
> + unsigned long to;
> + unsigned long info;
These are all MSR values, that can be 'unsigned int', right? If so,
please also fix struct x86_pmu for that.
> +};
> +
> extern struct perf_guest_switch_msr *perf_guest_get_msrs(int *nr);
> +extern int perf_get_lbr_stack(struct perf_lbr_stack *stack);
> extern void perf_get_x86_pmu_capability(struct x86_pmu_capability *cap);
> extern void perf_check_microcode(void);
> #else
I would like to use the x86_perf namespace or something along those
lines, this is very much not a generic perf interface -- it is very much
x86 (or rather even Intel) specific.
On Thu, Sep 20, 2018 at 06:05:55PM +0800, Wei Wang wrote:
> The LBR stack MSRs are architecturally specific. The perf subsystem has
> already assigned the abstracted MSR values based on the CPU architecture.
>
> This patch enables a caller outside the perf subsystem to get the LBR
> stack info. This is useful for hyperviosrs to prepare the lbr feature
> for the guest.
>
> Signed-off-by: Like Xu <[email protected]>
> Signed-off-by: Wei Wang <[email protected]>
Also, that SoB chain is broken. Since you're the author (per From), your
SoB should be first, also since you're the one sending it, your SoB
should also be last.
Nowhere did Like Xu handle the patch so he shouldn't be having a SoB at
all.
> -----Original Message-----
> From: Wei Wang [mailto:[email protected]]
> Sent: Thursday, September 20, 2018 6:06 PM
> To: [email protected]; [email protected]; [email protected];
> [email protected]
> Cc: [email protected]; [email protected]; [email protected];
> [email protected]; [email protected]; [email protected];
> [email protected]; Gonglei (Arei) <[email protected]>
> Subject: [PATCH v3 5/5] KVM/x86/lbr: lazy save the guest lbr stack
>
> When the vCPU is scheduled in:
> - if the lbr feature was used in the last vCPU time slice, set the lbr
> stack to be interceptible, so that the host can capture whether the
> lbr feature will be used in this time slice;
> - if the lbr feature wasn't used in the last vCPU time slice, disable
> the vCPU support of the guest lbr switching.
>
> Upon the first access to one of the lbr related MSRs (since the vCPU was
> scheduled in):
> - record that the guest has used the lbr;
> - create a host perf event to help save/restore the guest lbr stack if
> the guest uses the user callstack mode lbr stack;
> - pass the stack through to the guest.
>
> Suggested-by: Andi Kleen <[email protected]>
> Signed-off-by: Like Xu <[email protected]>
> Signed-off-by: Wei Wang <[email protected]>
> Cc: Paolo Bonzini <[email protected]>
> Cc: Andi Kleen <[email protected]>
> ---
> arch/x86/events/intel/lbr.c | 16 +++++
> arch/x86/include/asm/kvm_host.h | 4 ++
> arch/x86/include/asm/perf_event.h | 6 ++
> arch/x86/kvm/pmu.h | 5 ++
> arch/x86/kvm/vmx.c | 137
> ++++++++++++++++++++++++++++++++++++++
> 5 files changed, 168 insertions(+)
>
> diff --git a/arch/x86/events/intel/lbr.c b/arch/x86/events/intel/lbr.c
> index 915fcc3..a260015 100644
> --- a/arch/x86/events/intel/lbr.c
> +++ b/arch/x86/events/intel/lbr.c
> @@ -64,6 +64,7 @@ static const enum {
> #define LBR_NO_INFO (1ULL << LBR_NO_INFO_BIT)
>
> #define LBR_PLM (LBR_KERNEL | LBR_USER)
> +#define LBR_USER_CALLSTACK (LBR_CALL_STACK | LBR_USER)
>
> #define LBR_SEL_MASK 0x3ff /* valid bits in LBR_SELECT */
> #define LBR_NOT_SUPP -1 /* LBR filter not supported */
> @@ -1283,6 +1284,21 @@ void intel_pmu_lbr_init_knl(void)
> }
>
> /**
> + * lbr_select_user_callstack - check if the user callstack mode is set
> + *
> + * @lbr_select: the lbr select msr
> + *
> + * Returns: true if the msr is configured to the user callstack mode.
> + * Otherwise, false.
> + *
> + */
> +bool lbr_select_user_callstack(u64 lbr_select)
> +{
> + return !!(lbr_select & LBR_USER_CALLSTACK);
> +}
> +EXPORT_SYMBOL_GPL(lbr_select_user_callstack);
> +
> +/**
> * perf_get_lbr_stack - get the lbr stack related MSRs
> *
> * @stack: the caller's memory to get the lbr stack
> diff --git a/arch/x86/include/asm/kvm_host.h
> b/arch/x86/include/asm/kvm_host.h
> index fdcac01..41b4d29 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -730,6 +730,10 @@ struct kvm_vcpu_arch {
>
> /* Flush the L1 Data cache for L1TF mitigation on VMENTER */
> bool l1tf_flush_l1d;
> + /* Indicate if the guest is using lbr with the user callstack mode */
> + bool lbr_user_callstack;
> + /* Indicate if the lbr msrs were accessed in this vCPU time slice */
> + bool lbr_used;
> };
>
> struct kvm_lpage_info {
> diff --git a/arch/x86/include/asm/perf_event.h
> b/arch/x86/include/asm/perf_event.h
> index e893a69..2d7ae55 100644
> --- a/arch/x86/include/asm/perf_event.h
> +++ b/arch/x86/include/asm/perf_event.h
> @@ -277,6 +277,7 @@ struct perf_lbr_stack {
> unsigned long info;
> };
>
> +extern bool lbr_select_user_callstack(u64 msr_lbr_select);
> extern struct perf_guest_switch_msr *perf_guest_get_msrs(int *nr);
> extern int perf_get_lbr_stack(struct perf_lbr_stack *stack);
> extern void perf_get_x86_pmu_capability(struct x86_pmu_capability *cap);
> @@ -288,6 +289,11 @@ static inline struct perf_guest_switch_msr
> *perf_guest_get_msrs(int *nr)
> return NULL;
> }
>
> +static bool lbr_select_user_callstack(u64 msr_lbr_select)
> +{
> + return false;
> +}
> +
> static inline int perf_get_lbr_stack(struct perf_lbr_stack *stack)
> {
> return -1;
> diff --git a/arch/x86/kvm/pmu.h b/arch/x86/kvm/pmu.h
> index e872aed..94f0624 100644
> --- a/arch/x86/kvm/pmu.h
> +++ b/arch/x86/kvm/pmu.h
> @@ -102,6 +102,11 @@ static inline struct kvm_pmc *get_fixed_pmc(struct
> kvm_pmu *pmu, u32 msr)
> return NULL;
> }
>
> +static inline bool intel_pmu_save_guest_lbr_enabled(struct kvm_vcpu *vcpu)
> +{
> + return !!vcpu_to_pmu(vcpu)->guest_lbr_event;
> +}
> +
> void reprogram_gp_counter(struct kvm_pmc *pmc, u64 eventsel);
> void reprogram_fixed_counter(struct kvm_pmc *pmc, u8 ctrl, int fixed_idx);
> void reprogram_counter(struct kvm_pmu *pmu, int pmc_idx);
> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> index 92705b5..ae20563 100644
> --- a/arch/x86/kvm/vmx.c
> +++ b/arch/x86/kvm/vmx.c
> @@ -1282,6 +1282,9 @@ static bool
> nested_vmx_is_page_fault_vmexit(struct vmcs12 *vmcs12,
> static void vmx_update_msr_bitmap(struct kvm_vcpu *vcpu);
> static void __always_inline vmx_disable_intercept_for_msr(unsigned long
> *msr_bitmap,
> u32 msr, int type);
> +static void
> +__always_inline vmx_set_intercept_for_msr(unsigned long *msr_bitmap, u32
> msr,
> + int type, bool value);
>
> static DEFINE_PER_CPU(struct vmcs *, vmxarea);
> static DEFINE_PER_CPU(struct vmcs *, current_vmcs);
> @@ -4056,6 +4059,120 @@ static int vmx_get_msr_feature(struct
> kvm_msr_entry *msr)
> return 0;
> }
>
> +static void vmx_set_intercept_for_lbr_msrs(struct kvm_vcpu *vcpu, bool set)
> +{
> + unsigned long *msr_bitmap = to_vmx(vcpu)->vmcs01.msr_bitmap;
> + struct perf_lbr_stack *stack = &vcpu->kvm->arch.lbr_stack;
> + int nr = stack->nr;
> + int i;
> +
> + vmx_set_intercept_for_msr(msr_bitmap, stack->tos, MSR_TYPE_RW,
> set);
> + for (i = 0; i < nr; i++) {
> + vmx_set_intercept_for_msr(msr_bitmap, stack->from + i,
> + MSR_TYPE_RW, set);
> + vmx_set_intercept_for_msr(msr_bitmap, stack->to + i,
> + MSR_TYPE_RW, set);
> + if (stack->info)
> + vmx_set_intercept_for_msr(msr_bitmap, stack->info + i,
> + MSR_TYPE_RW, set);
> + }
> +}
> +
> +static inline bool msr_is_lbr_stack(struct kvm_vcpu *vcpu, u32 index)
> +{
> + struct perf_lbr_stack *stack = &vcpu->kvm->arch.lbr_stack;
> + int nr = stack->nr;
> +
> + return !!(index == stack->tos ||
> + (index >= stack->from && index < stack->from + nr) ||
> + (index >= stack->to && index < stack->to + nr) ||
> + (index >= stack->info && index < stack->info));
> +}
> +
> +static bool guest_get_lbr_msr(struct kvm_vcpu *vcpu, struct msr_data
> *msr_info)
> +{
> + u32 index = msr_info->index;
> + bool ret = false;
> +
> + switch (index) {
> + case MSR_IA32_DEBUGCTLMSR:
> + msr_info->data = vmcs_read64(GUEST_IA32_DEBUGCTL);
> + ret = true;
> + break;
> + case MSR_LBR_SELECT:
> + ret = true;
> + rdmsrl(index, msr_info->data);
> + break;
> + default:
> + if (msr_is_lbr_stack(vcpu, index)) {
> + ret = true;
> + rdmsrl(index, msr_info->data);
> + }
> + }
> +
> + return ret;
> +}
> +
> +static bool guest_set_lbr_msr(struct kvm_vcpu *vcpu, struct msr_data
> *msr_info)
> +{
> + u32 index = msr_info->index;
> + u64 data = msr_info->data;
> + bool ret = false;
> +
> + switch (index) {
> + case MSR_IA32_DEBUGCTLMSR:
> + ret = true;
> + /*
> + * Currently, only FREEZE_LBRS_ON_PMI and DEBUGCTLMSR_LBR
> are
> + * supported.
> + */
> + data &= (DEBUGCTLMSR_FREEZE_LBRS_ON_PMI |
> DEBUGCTLMSR_LBR);
> + vmcs_write64(GUEST_IA32_DEBUGCTL, msr_info->data);
> + break;
> + case MSR_LBR_SELECT:
> + ret = true;
> + if (lbr_select_user_callstack(data))
> + vcpu->arch.lbr_user_callstack = true;
> + else
> + vcpu->arch.lbr_user_callstack = false;
> + wrmsrl(index, msr_info->data);
> + break;
> + default:
> + if (msr_is_lbr_stack(vcpu, index)) {
> + ret = true;
> + wrmsrl(index, msr_info->data);
> + }
> + }
> +
> + return ret;
> +}
> +
> +static bool guest_access_lbr_msr(struct kvm_vcpu *vcpu,
> + struct msr_data *msr_info,
> + bool set)
> +{
> + bool ret = false;
> +
> + if (!vcpu->kvm->arch.guest_lbr_enabled)
> + return false;
> +
> + if (set)
> + ret = guest_set_lbr_msr(vcpu, msr_info);
> + else
> + ret = guest_get_lbr_msr(vcpu, msr_info);
> +
> + if (ret) {
> + vcpu->arch.lbr_used = true;
> + vmx_set_intercept_for_lbr_msrs(vcpu, false);
You can use if (!vcpu->arch.lbr_used) as the condition of assign values.
They are need only once.
Thanks,
-Gonglei
On Thu, Sep 20, 2018 at 06:05:58PM +0800, Wei Wang wrote:
> diff --git a/arch/x86/events/perf_event.h b/arch/x86/events/perf_event.h
> index 1562863..a91fdef 100644
> --- a/arch/x86/events/perf_event.h
> +++ b/arch/x86/events/perf_event.h
> @@ -223,6 +223,7 @@ struct cpu_hw_events {
> */
> u64 intel_ctrl_guest_mask;
> u64 intel_ctrl_host_mask;
> + bool guest_lbr;
> struct perf_guest_switch_msr guest_switch_msrs[X86_PMC_IDX_MAX];
>
> /*
Please, no bools in aggregate types.
Either use a bitfield or u8, there's more LBR related bools, convert and
group the lot to avoid holes.
On 09/20/2018 07:34 PM, Gonglei (Arei) wrote:
>> -----Original Message-----
>> From: [email protected] [mailto:[email protected]] On
>> Behalf Of Wei Wang
>> Sent: Thursday, September 20, 2018 6:06 PM
>> To: [email protected]; [email protected]; [email protected];
>> [email protected]
>> Cc: [email protected]; [email protected]; [email protected];
>> [email protected]; [email protected]; [email protected];
>> [email protected]; Gonglei (Arei) <[email protected]>
>> Subject: [PATCH v3 0/5] Guest LBR Enabling
>>
>> Last Branch Recording (LBR) is a performance monitor unit (PMU) feature
>> on Intel CPUs that captures branch related info. This patch series enables
>> this feature to KVM guests.
>>
>> Here is a conclusion of the fundamental methods that we use:
>> 1) the LBR feature is enabled per guest via QEMU setting of
>> KVM_CAP_X86_GUEST_LBR;
> Did you send the QEMU parts for guest LBR support?
>
Not yet. Probably next week for the QEMU part.
Best,
Wei
On Thu, Sep 20, 2018 at 06:05:58PM +0800, Wei Wang wrote:
> diff --git a/arch/x86/kvm/pmu_intel.c b/arch/x86/kvm/pmu_intel.c
> index 5ab4a36..97a29d7 100644
> --- a/arch/x86/kvm/pmu_intel.c
> +++ b/arch/x86/kvm/pmu_intel.c
> @@ -342,6 +342,47 @@ static void intel_pmu_reset(struct kvm_vcpu *vcpu)
> pmu->global_ovf_ctrl = 0;
> }
>
> +int intel_pmu_enable_save_guest_lbr(struct kvm_vcpu *vcpu)
> +{
> + struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
> + struct perf_event *event;
> + struct perf_event_attr attr = {
> + .type = PERF_TYPE_RAW,
> + .size = sizeof(attr),
Bit yuck to imply .config = 0, like that.
> + .pinned = true,
Note that this can still result in failing to schedule the event, you
create a pinned task event, but a pinned cpu event takes precedence and
can claim the LBR.
How do you deal with that situation, where the LBR is in use by a host
event?
> + .exclude_host = true,
Didn't you mean to use .exclude_guest ? You don't want this thing to run
when the guest is running, right?
But I still don't like this hack much, what happens if userspace sets
that bit? You seems to have forgotten to combine it with PF_VCPU or
whatever is the appropriate bit to check.
Similarly, how does the existing exclude_{gues,host} crud work?
> + .sample_type = PERF_SAMPLE_BRANCH_STACK,
What's the point of setting .sample_type if !.sample_period ?
> + .branch_sample_type = PERF_SAMPLE_BRANCH_CALL_STACK |
> + PERF_SAMPLE_BRANCH_USER |
> + PERF_SAMPLE_BRANCH_KERNEL,
> + };
I think this function wants a comment to explain wth its doing and why,
because if someone stumbles over this code in a few months nobody will
remember those things.
> +
> + if (pmu->guest_lbr_event)
> + return 0;
> +
> + event = perf_event_create_kernel_counter(&attr, -1, current, NULL,
> + NULL);
> + if (IS_ERR(event)) {
> + pr_err("%s: failed %ld\n", __func__, PTR_ERR(event));
> + return -ENOENT;
> + }
> + pmu->guest_lbr_event = event;
> +
> + return 0;
> +}
On 09/20/2018 06:05 PM, Wei Wang wrote:
> Bits [0, 5] of MSR_IA32_PERF_CAPABILITIES tell about the format of
> the addresses stored in the LBR stack. Expose those bits to the guest
> when the guest lbr feature is enabled.
>
> Signed-off-by: Like Xu <[email protected]>
> Signed-off-by: Wei Wang <[email protected]>
> Cc: Paolo Bonzini <[email protected]>
> Cc: Andi Kleen <[email protected]>
> ---
> arch/x86/include/asm/perf_event.h | 2 ++
> arch/x86/kvm/cpuid.c | 2 +-
> arch/x86/kvm/vmx.c | 7 +++++++
> 3 files changed, 10 insertions(+), 1 deletion(-)
>
> diff --git a/arch/x86/include/asm/perf_event.h b/arch/x86/include/asm/perf_event.h
> index 84cc8cb..e893a69 100644
> --- a/arch/x86/include/asm/perf_event.h
> +++ b/arch/x86/include/asm/perf_event.h
> @@ -79,6 +79,8 @@
> #define ARCH_PERFMON_BRANCH_MISSES_RETIRED 6
> #define ARCH_PERFMON_EVENTS_COUNT 7
>
> +#define PERF_CAP_MASK_LBR_FMT 0x3f
> +
> /*
> * Intel "Architectural Performance Monitoring" CPUID
> * detection/enumeration details:
> diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
> index 7bcfa61..3b8a57b 100644
> --- a/arch/x86/kvm/cpuid.c
> +++ b/arch/x86/kvm/cpuid.c
> @@ -365,7 +365,7 @@ static inline int __do_cpuid_ent(struct kvm_cpuid_entry2 *entry, u32 function,
> F(XMM3) | F(PCLMULQDQ) | 0 /* DTES64, MONITOR */ |
> 0 /* DS-CPL, VMX, SMX, EST */ |
> 0 /* TM2 */ | F(SSSE3) | 0 /* CNXT-ID */ | 0 /* Reserved */ |
> - F(FMA) | F(CX16) | 0 /* xTPR Update, PDCM */ |
> + F(FMA) | F(CX16) | 0 /* xTPR Update*/ | F(PDCM) |
> F(PCID) | 0 /* Reserved, DCA */ | F(XMM4_1) |
> F(XMM4_2) | F(X2APIC) | F(MOVBE) | F(POPCNT) |
> 0 /* Reserved*/ | F(AES) | F(XSAVE) | 0 /* OSXSAVE */ | F(AVX) |
> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> index 533a327..92705b5 100644
> --- a/arch/x86/kvm/vmx.c
> +++ b/arch/x86/kvm/vmx.c
> @@ -4134,6 +4134,13 @@ static int vmx_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
> !guest_cpuid_has(vcpu, X86_FEATURE_RDTSCP))
> return 1;
> /* Otherwise falls through */
> + case MSR_IA32_PERF_CAPABILITIES:
> + if (!boot_cpu_has(X86_FEATURE_PDCM))
> + return 1;
> + msr_info->data = native_read_msr(MSR_IA32_PERF_CAPABILITIES);
> + if (vcpu->kvm->arch.guest_lbr_enabled)
> + msr_info->data &= PERF_CAP_MASK_LBR_FMT;
> + break;
Sorry about a mistake here. Will move "case MSR_IA32_PERF_CAPABILITIES"
one step above - above "case TSC_AUX".
Best,
Wei
On Thu, Sep 20, 2018 at 06:05:59PM +0800, Wei Wang wrote:
> /**
> + * lbr_select_user_callstack - check if the user callstack mode is set
> + *
> + * @lbr_select: the lbr select msr
> + *
> + * Returns: true if the msr is configured to the user callstack mode.
> + * Otherwise, false.
> + *
> + */
> +bool lbr_select_user_callstack(u64 lbr_select)
> +{
> + return !!(lbr_select & LBR_USER_CALLSTACK);
> +}
> +EXPORT_SYMBOL_GPL(lbr_select_user_callstack);
That function is pure and tiny, wth is that an exported symbol and not
an inline?
On 09/20/2018 08:37 PM, Peter Zijlstra wrote:
> On Thu, Sep 20, 2018 at 06:05:59PM +0800, Wei Wang wrote:
>> /**
>> + * lbr_select_user_callstack - check if the user callstack mode is set
>> + *
>> + * @lbr_select: the lbr select msr
>> + *
>> + * Returns: true if the msr is configured to the user callstack mode.
>> + * Otherwise, false.
>> + *
>> + */
>> +bool lbr_select_user_callstack(u64 lbr_select)
>> +{
>> + return !!(lbr_select & LBR_USER_CALLSTACK);
>> +}
>> +EXPORT_SYMBOL_GPL(lbr_select_user_callstack);
> That function is pure and tiny, wth is that an exported symbol and not
> an inline?
Thanks Peter for the comments.
Because this function uses the LBR_ macros which are defined in this
lbr.c file.
Do you think it would be OK to move all the above LBR_ macros to
asm/perf_event.h?
Best,
Wei
On Thu, Sep 20, 2018 at 08:58:16PM +0800, Wei Wang wrote:
> On 09/20/2018 08:37 PM, Peter Zijlstra wrote:
> > On Thu, Sep 20, 2018 at 06:05:59PM +0800, Wei Wang wrote:
> > > /**
> > > + * lbr_select_user_callstack - check if the user callstack mode is set
> > > + *
> > > + * @lbr_select: the lbr select msr
> > > + *
> > > + * Returns: true if the msr is configured to the user callstack mode.
> > > + * Otherwise, false.
> > > + *
> > > + */
> > > +bool lbr_select_user_callstack(u64 lbr_select)
> > > +{
> > > + return !!(lbr_select & LBR_USER_CALLSTACK);
> > > +}
> > > +EXPORT_SYMBOL_GPL(lbr_select_user_callstack);
> > That function is pure and tiny, wth is that an exported symbol and not
> > an inline?
> Thanks Peter for the comments.
>
> Because this function uses the LBR_ macros which are defined in this lbr.c
> file.
> Do you think it would be OK to move all the above LBR_ macros to
> asm/perf_event.h?
Sure.
> +int intel_pmu_enable_save_guest_lbr(struct kvm_vcpu *vcpu)
> +{
> + struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
> + struct perf_event *event;
> + struct perf_event_attr attr = {
> + .type = PERF_TYPE_RAW,
> + .size = sizeof(attr),
> + .pinned = true,
> + .exclude_host = true,
> + .sample_type = PERF_SAMPLE_BRANCH_STACK,
> + .branch_sample_type = PERF_SAMPLE_BRANCH_CALL_STACK |
> + PERF_SAMPLE_BRANCH_USER |
> + PERF_SAMPLE_BRANCH_KERNEL,
I think that will allocate an extra perfmon counter, right?
So if the guest wants to use all perfmon counters we would start to
multiplex, which would be bad
Would need to fix perf core to not allocate a perfmon counter in
this case, if no period and no event count is requested.
-Andi
On Thu, Sep 20, 2018 at 08:30:35AM -0700, Andi Kleen wrote:
> > +int intel_pmu_enable_save_guest_lbr(struct kvm_vcpu *vcpu)
> > +{
> > + struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
> > + struct perf_event *event;
> > + struct perf_event_attr attr = {
> > + .type = PERF_TYPE_RAW,
> > + .size = sizeof(attr),
> > + .pinned = true,
> > + .exclude_host = true,
> > + .sample_type = PERF_SAMPLE_BRANCH_STACK,
> > + .branch_sample_type = PERF_SAMPLE_BRANCH_CALL_STACK |
> > + PERF_SAMPLE_BRANCH_USER |
> > + PERF_SAMPLE_BRANCH_KERNEL,
>
> I think that will allocate an extra perfmon counter, right?
I throught the same too, but I think the exclude_host/guest, whichever
is the right one makes that work for regular counters too.
That code is a wee bit magical and I didn't take the time to reverse
engineer that. It most certainly needs a comment.
On Thursday, September 20, 2018 11:31 PM, Andi Kleen wrote:
> > +int intel_pmu_enable_save_guest_lbr(struct kvm_vcpu *vcpu) {
> > + struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
> > + struct perf_event *event;
> > + struct perf_event_attr attr = {
> > + .type = PERF_TYPE_RAW,
> > + .size = sizeof(attr),
> > + .pinned = true,
> > + .exclude_host = true,
> > + .sample_type = PERF_SAMPLE_BRANCH_STACK,
> > + .branch_sample_type = PERF_SAMPLE_BRANCH_CALL_STACK
> |
> > + PERF_SAMPLE_BRANCH_USER |
> > + PERF_SAMPLE_BRANCH_KERNEL,
Sorry for my late response (I was digging into some of the details).
> I think that will allocate an extra perfmon counter, right?
>
> So if the guest wants to use all perfmon counters we would start to multiplex,
> which would be bad
Right. The host side counter allocation is not necessary.
> Would need to fix perf core to not allocate a perfmon counter in this case, if
> no period and no event count is requested.
>
If no period (i.e. .sample_period = 0), the current code still uses one counter with the period set to the max period.
If we change that to do no counter allocation, would that break any of the existing usages (I didn't find the reason why the existing code not avoid that allocation for a non-sampling event)?
Best,
Wei
On Friday, September 21, 2018 12:24 AM, Peter Zijlstra wrote:
> On Thu, Sep 20, 2018 at 08:30:35AM -0700, Andi Kleen wrote:
> > > +int intel_pmu_enable_save_guest_lbr(struct kvm_vcpu *vcpu) {
> > > + struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
> > > + struct perf_event *event;
> > > + struct perf_event_attr attr = {
> > > + .type = PERF_TYPE_RAW,
> > > + .size = sizeof(attr),
> > > + .pinned = true,
> > > + .exclude_host = true,
> > > + .sample_type = PERF_SAMPLE_BRANCH_STACK,
> > > + .branch_sample_type = PERF_SAMPLE_BRANCH_CALL_STACK
> |
> > > + PERF_SAMPLE_BRANCH_USER |
> > > + PERF_SAMPLE_BRANCH_KERNEL,
> >
> > I think that will allocate an extra perfmon counter, right?
>
> I throught the same too, but I think the exclude_host/guest, whichever is the
> right one makes that work for regular counters too.
Sorry for being late.
I'm not sure if exclude_host/guest would be suitable, for example, if the guest wants to use a perf counter, host will create a perf event with "exclude_host=true" to have the counter not count in host. And "exclude_guest=true" is a flag to the perf core that the counter should not count when the guest runs.
What would you think if we add a new flag (e.g. .force_no_counters) to the perf core to indicate not allocating a perf counter?
> That code is a wee bit magical and I didn't take the time to reverse engineer
> that. It most certainly needs a comment.
No problem. I will add more comments in the next version.
Best,
Wei
On Thursday, September 20, 2018 8:32 PM, Peter Zijlstra wrote:
> On Thu, Sep 20, 2018 at 06:05:58PM +0800, Wei Wang wrote:
> How do you deal with that situation, where the LBR is in use by a host event?
I guess you were referring to this use case: "perf record -b qemu-system-x86_64..", where the qemu process also needs lbr.
I also discussed with Andi about this in v1, where we switch the lbr state on VMX transitions. That could achieve the above, but adds too much overhead to VMX transitions, which are frequent. Considering the above usage is not common (it's more common to just have the task in the guest use the lbr feature), we decided to not take that usage into account, and switch lbr state only on the vCPU switching.
Best,
Wei
On Thursday, September 20, 2018 8:08 PM, Gonglei (Arei) wrote:
> > +static bool guest_access_lbr_msr(struct kvm_vcpu *vcpu,
> > + struct msr_data *msr_info,
> > + bool set)
> > +{
> > + bool ret = false;
> > +
> > + if (!vcpu->kvm->arch.guest_lbr_enabled)
> > + return false;
> > +
> > + if (set)
> > + ret = guest_set_lbr_msr(vcpu, msr_info);
> > + else
> > + ret = guest_get_lbr_msr(vcpu, msr_info);
> > +
> > + if (ret) {
> > + vcpu->arch.lbr_used = true;
> > + vmx_set_intercept_for_lbr_msrs(vcpu, false);
>
> You can use if (!vcpu->arch.lbr_used) as the condition of assign values.
> They are need only once.
Thanks, I think it would be better to use
If (ret && !vcpu->arch.lbr_used)
(we need to make sure that the guest is accessing one of the LBR related MSRs via ret=true)
Best,
Wei