Hi Peter,
Would you mind acking the host perf patches if it looks good to you ?
Hi Paolo,
Please help review the KVM proposal changes in this stable version.
Now, we can use upstream QEMU w/ '-cpu host' to test this feature, and
disable it by clearing the LBR format bits in the IA32_PERF_CAPABILITIES.
v10->v11 Changelog:
- add '.config = INTEL_FIXED_VLBR_EVENT' to the guest LBR event config;
- rewrite is_guest_lbr_event() with 'config == INTEL_FIXED_VLBR_EVENT';
- emit pr_warn() on the host when guest LBR is temporarily unavailable;
- drop the KVM_CAP_X86_GUEST_LBR patch;
- rewrite MSR_IA32_PERF_CAPABILITIES patch LBR record format;
- split 'kvm_pmu->lbr_already_available' into a separate patch;
- split 'pmu_ops->availability_check' into a separate patch;
- comments and naming refinement, misc;
You may check more details in each commit.
Previous:
https://lore.kernel.org/kvm/[email protected]/
---
The last branch recording (LBR) is a performance monitor unit (PMU)
feature on Intel processors that records a running trace of the most
recent branches taken by the processor in the LBR stack. This patch
series is going to enable this feature for plenty of KVM guests.
The userspace could configure whether it's enabled or not for each
guest via MSR_IA32_PERF_CAPABILITIES msr. As a first step, a guest
could only enable LBR feature if its cpu model is the same as the
host since the LBR feature is still one of model specific features.
If it's enabled on the guest, the guest LBR driver would accesses the
LBR MSR (including IA32_DEBUGCTLMSR and records MSRs) as host does.
The first guest access on the LBR related MSRs is always interceptible.
The KVM trap would create a special LBR event (called guest LBR event)
which enables the callstack mode and none of hardware counter is assigned.
The host perf would enable and schedule this event as usual.
Guest's first access to a LBR registers gets trapped to KVM, which
creates a guest LBR perf event. It's a regular LBR perf event which gets
the LBR facility assigned from the perf subsystem. Once that succeeds,
the LBR stack msrs are passed through to the guest for efficient accesses.
However, if another host LBR event comes in and takes over the LBR
facility, the LBR msrs will be made interceptible, and guest following
accesses to the LBR msrs will be trapped and meaningless.
Because saving/restoring tens of LBR MSRs (e.g. 32 LBR stack entries) in
VMX transition brings too excessive overhead to frequent vmx transition
itself, the guest LBR event would help save/restore the LBR stack msrs
during the context switching with the help of native LBR event callstack
mechanism, including LBR_SELECT msr.
If the guest no longer accesses the LBR-related MSRs within a scheduling
time slice and the LBR enable bit is unset, vPMU would release its guest
LBR event as a normal event of a unused vPMC and the pass-through
state of the LBR stack msrs would be canceled.
---
LBR testcase:
echo 1 > /proc/sys/kernel/watchdog
echo 25 > /proc/sys/kernel/perf_cpu_time_max_percent
echo 5000 > /proc/sys/kernel/perf_event_max_sample_rate
echo 0 > /proc/sys/kernel/perf_cpu_time_max_percent
./perf record -b ./br_instr a
- Perf report on the host:
Samples: 72K of event 'cycles', Event count (approx.): 72512
Overhead Command Source Shared Object Source Symbol Target Symbol Basic Block Cycles
12.12% br_instr br_instr [.] cmp_end [.] lfsr_cond 1
11.05% br_instr br_instr [.] lfsr_cond [.] cmp_end 5
8.81% br_instr br_instr [.] lfsr_cond [.] cmp_end 4
5.04% br_instr br_instr [.] cmp_end [.] lfsr_cond 20
4.92% br_instr br_instr [.] lfsr_cond [.] cmp_end 6
4.88% br_instr br_instr [.] cmp_end [.] lfsr_cond 6
4.58% br_instr br_instr [.] cmp_end [.] lfsr_cond 5
- Perf report on the guest:
Samples: 92K of event 'cycles', Event count (approx.): 92544
Overhead Command Source Shared Object Source Symbol Target Symbol Basic Block Cycles
12.03% br_instr br_instr [.] cmp_end [.] lfsr_cond 1
11.09% br_instr br_instr [.] lfsr_cond [.] cmp_end 5
8.57% br_instr br_instr [.] lfsr_cond [.] cmp_end 4
5.08% br_instr br_instr [.] lfsr_cond [.] cmp_end 6
5.06% br_instr br_instr [.] cmp_end [.] lfsr_cond 20
4.87% br_instr br_instr [.] cmp_end [.] lfsr_cond 6
4.70% br_instr br_instr [.] cmp_end [.] lfsr_cond 5
Conclusion: the profiling results on the guest are similar to that on the host.
Like Xu (9):
perf/x86/core: Refactor hw->idx checks and cleanup
perf/x86/lbr: Add interface to get basic information about LBR stack
perf/x86: Add constraint to create guest LBR event without hw counter
perf/x86: Keep LBR stack unchanged in host context for guest LBR event
KVM: x86: Expose MSR_IA32_PERF_CAPABILITIES for LBR record format
KVM: x86/pmu: Emulate LBR feature via guest LBR event
KVM: x86/pmu: Release guest LBR event via vPMU lazy release mechanism
KVM: x86/pmu: Check guest LBR availability in case host reclaims them
KVM: x86/pmu: Reduce the overhead of LBR passthrough or cancellation
Wei Wang (2):
perf/x86: Fix variable types for LBR registers
KVM: x86/pmu: Tweak kvm_pmu_get_msr to pass 'struct msr_data' in
arch/x86/events/core.c | 26 ++-
arch/x86/events/intel/core.c | 105 ++++++----
arch/x86/events/intel/lbr.c | 56 +++++-
arch/x86/events/perf_event.h | 12 +-
arch/x86/include/asm/kvm_host.h | 13 ++
arch/x86/include/asm/perf_event.h | 34 +++-
arch/x86/kvm/cpuid.c | 2 +-
arch/x86/kvm/pmu.c | 19 +-
arch/x86/kvm/pmu.h | 15 +-
arch/x86/kvm/svm/pmu.c | 7 +-
arch/x86/kvm/vmx/capabilities.h | 15 ++
arch/x86/kvm/vmx/pmu_intel.c | 320 +++++++++++++++++++++++++++++-
arch/x86/kvm/vmx/vmx.c | 12 +-
arch/x86/kvm/vmx/vmx.h | 2 +
arch/x86/kvm/x86.c | 18 +-
15 files changed, 564 insertions(+), 92 deletions(-)
--
2.21.3
When a guest wants to use the LBR stack, its hypervisor creates a guest
LBR event and let host perf schedules it. A new 'int guest_lbr_enabled'
field in the 'struct cpu_hw_events', is marked as true when perf adds
a guest LBR event and marked as false on deletion.
The LBR stack msrs are accessible to the guest when its guest LBR event
is scheduled in by the perf subsystem. Before scheduling this event out,
we should avoid host changes on IA32_DEBUGCTLMSR or LBR_SELECT.
Otherwise, some unexpected branch operations may interfere with guest
behavior, pollute LBR records, and even cause host branch data leakage.
In addition, the intel_pmu_lbr_read() on the host is also avoidable.
To ensure that guest LBR records are not lost during the context switch,
the BRANCH_CALL_STACK flag should be configured in the 'branch_sample_type'
for any guest LBR event because the callstack mode could save/restore guest
unread LBR records with the help of intel_pmu_lbr_sched_task() naturally.
However, the regular host LBR perf event doesn't save/restore LBR_SELECT,
because it's configured in the LBR_enable() based on branch_sample_type.
So when a guest LBR is running, the guest LBR_SELECT may changes for its
own use and we have to support LBR_SELECT save/restore to ensure what the
guest LBR_SELECT value doesn't get lost during the context switching.
Cc: Peter Zijlstra (Intel) <[email protected]>
Co-developed-by: Wei Wang <[email protected]>
Signed-off-by: Wei Wang <[email protected]>
Signed-off-by: Like Xu <[email protected]>
---
arch/x86/events/intel/lbr.c | 33 ++++++++++++++++++++++++++++++---
arch/x86/events/perf_event.h | 2 ++
2 files changed, 32 insertions(+), 3 deletions(-)
diff --git a/arch/x86/events/intel/lbr.c b/arch/x86/events/intel/lbr.c
index 0a91cbe3a7c7..ff8d0433837d 100644
--- a/arch/x86/events/intel/lbr.c
+++ b/arch/x86/events/intel/lbr.c
@@ -383,6 +383,15 @@ static void __intel_pmu_lbr_restore(struct x86_perf_task_context *task_ctx)
wrmsrl(x86_pmu.lbr_tos, tos);
task_ctx->lbr_stack_state = LBR_NONE;
+
+ /*
+ * When the LBR hardware is scheduled for a guest LBR event,
+ * the guest lbr_sel is likely different from event->hw.branch_reg.
+ * Therefore, it’s necessary to save/restore MSR_LBR_SELECT written
+ * by the guest so that it's not lost during the context switch.
+ */
+ if (cpuc->guest_lbr_enabled)
+ wrmsrl(MSR_LBR_SELECT, task_ctx->lbr_sel);
}
static void __intel_pmu_lbr_save(struct x86_perf_task_context *task_ctx)
@@ -415,6 +424,9 @@ static void __intel_pmu_lbr_save(struct x86_perf_task_context *task_ctx)
cpuc->last_task_ctx = task_ctx;
cpuc->last_log_id = ++task_ctx->log_id;
+
+ if (cpuc->guest_lbr_enabled)
+ rdmsrl(MSR_LBR_SELECT, task_ctx->lbr_sel);
}
void intel_pmu_lbr_swap_task_ctx(struct perf_event_context *prev,
@@ -485,6 +497,9 @@ void intel_pmu_lbr_add(struct perf_event *event)
if (!x86_pmu.lbr_nr)
return;
+ if (is_guest_lbr_event(event))
+ cpuc->guest_lbr_enabled = 1;
+
cpuc->br_sel = event->hw.branch_reg.reg;
if (branch_user_callstack(cpuc->br_sel) && event->ctx->task_ctx_data) {
@@ -532,6 +547,9 @@ void intel_pmu_lbr_del(struct perf_event *event)
task_ctx->lbr_callstack_users--;
}
+ if (is_guest_lbr_event(event))
+ cpuc->guest_lbr_enabled = 0;
+
if (x86_pmu.intel_cap.pebs_baseline && event->attr.precise_ip > 0)
cpuc->lbr_pebs_users--;
cpuc->lbr_users--;
@@ -544,7 +562,12 @@ void intel_pmu_lbr_enable_all(bool pmi)
{
struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
- if (cpuc->lbr_users)
+ /*
+ * When the LBR hardware is scheduled for a guest LBR event,
+ * the guest will dis/enables LBR itself at the appropriate time,
+ * including configuring MSR_LBR_SELECT.
+ */
+ if (cpuc->lbr_users && !cpuc->guest_lbr_enabled)
__intel_pmu_lbr_enable(pmi);
}
@@ -552,7 +575,7 @@ void intel_pmu_lbr_disable_all(void)
{
struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
- if (cpuc->lbr_users)
+ if (cpuc->lbr_users && !cpuc->guest_lbr_enabled)
__intel_pmu_lbr_disable();
}
@@ -693,8 +716,12 @@ void intel_pmu_lbr_read(void)
*
* This could be smarter and actually check the event,
* but this simple approach seems to work for now.
+ *
+ * And there is no need to read LBR record here if a guest LBR
+ * event is using it, because the guest will read them on its own.
*/
- if (!cpuc->lbr_users || cpuc->lbr_users == cpuc->lbr_pebs_users)
+ if (!cpuc->lbr_users || cpuc->guest_lbr_enabled ||
+ cpuc->lbr_users == cpuc->lbr_pebs_users)
return;
if (x86_pmu.intel_cap.lbr_format == LBR_FORMAT_32)
diff --git a/arch/x86/events/perf_event.h b/arch/x86/events/perf_event.h
index 34d76a2f5ebd..8d322b0204c8 100644
--- a/arch/x86/events/perf_event.h
+++ b/arch/x86/events/perf_event.h
@@ -237,6 +237,7 @@ struct cpu_hw_events {
u64 br_sel;
struct x86_perf_task_context *last_task_ctx;
int last_log_id;
+ int guest_lbr_enabled;
/*
* Intel host/guest exclude bits
@@ -721,6 +722,7 @@ struct x86_perf_task_context {
u64 lbr_from[MAX_LBR_ENTRIES];
u64 lbr_to[MAX_LBR_ENTRIES];
u64 lbr_info[MAX_LBR_ENTRIES];
+ u64 lbr_sel;
int tos;
int valid_lbrs;
int lbr_callstack_users;
--
2.21.3
From: Wei Wang <[email protected]>
Change kvm_pmu_get_msr() to get the msr_data struct, as the host_initiated
field from the struct could be used by get_msr. This also makes this API
consistent with kvm_pmu_set_msr. No functional changes.
Signed-off-by: Wei Wang <[email protected]>
---
arch/x86/kvm/pmu.c | 4 ++--
arch/x86/kvm/pmu.h | 4 ++--
arch/x86/kvm/svm/pmu.c | 7 ++++---
arch/x86/kvm/vmx/pmu_intel.c | 19 +++++++++++--------
arch/x86/kvm/x86.c | 4 ++--
5 files changed, 21 insertions(+), 17 deletions(-)
diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c
index a5078841bdac..b86346903f2e 100644
--- a/arch/x86/kvm/pmu.c
+++ b/arch/x86/kvm/pmu.c
@@ -397,9 +397,9 @@ static void kvm_pmu_mark_pmc_in_use(struct kvm_vcpu *vcpu, u32 msr)
__set_bit(pmc->idx, pmu->pmc_in_use);
}
-int kvm_pmu_get_msr(struct kvm_vcpu *vcpu, u32 msr, u64 *data)
+int kvm_pmu_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
{
- return kvm_x86_ops.pmu_ops->get_msr(vcpu, msr, data);
+ return kvm_x86_ops.pmu_ops->get_msr(vcpu, msr_info);
}
int kvm_pmu_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
diff --git a/arch/x86/kvm/pmu.h b/arch/x86/kvm/pmu.h
index a6c78a797cb1..ab85eed8a6cc 100644
--- a/arch/x86/kvm/pmu.h
+++ b/arch/x86/kvm/pmu.h
@@ -32,7 +32,7 @@ struct kvm_pmu_ops {
struct kvm_pmc *(*msr_idx_to_pmc)(struct kvm_vcpu *vcpu, u32 msr);
int (*is_valid_rdpmc_ecx)(struct kvm_vcpu *vcpu, unsigned int idx);
bool (*is_valid_msr)(struct kvm_vcpu *vcpu, u32 msr);
- int (*get_msr)(struct kvm_vcpu *vcpu, u32 msr, u64 *data);
+ int (*get_msr)(struct kvm_vcpu *vcpu, struct msr_data *msr_info);
int (*set_msr)(struct kvm_vcpu *vcpu, struct msr_data *msr_info);
void (*refresh)(struct kvm_vcpu *vcpu);
void (*init)(struct kvm_vcpu *vcpu);
@@ -147,7 +147,7 @@ void kvm_pmu_handle_event(struct kvm_vcpu *vcpu);
int kvm_pmu_rdpmc(struct kvm_vcpu *vcpu, unsigned pmc, u64 *data);
int kvm_pmu_is_valid_rdpmc_ecx(struct kvm_vcpu *vcpu, unsigned int idx);
bool kvm_pmu_is_valid_msr(struct kvm_vcpu *vcpu, u32 msr);
-int kvm_pmu_get_msr(struct kvm_vcpu *vcpu, u32 msr, u64 *data);
+int kvm_pmu_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info);
int kvm_pmu_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info);
void kvm_pmu_refresh(struct kvm_vcpu *vcpu);
void kvm_pmu_reset(struct kvm_vcpu *vcpu);
diff --git a/arch/x86/kvm/svm/pmu.c b/arch/x86/kvm/svm/pmu.c
index ce0b10fe5e2b..035da07500e8 100644
--- a/arch/x86/kvm/svm/pmu.c
+++ b/arch/x86/kvm/svm/pmu.c
@@ -215,21 +215,22 @@ static struct kvm_pmc *amd_msr_idx_to_pmc(struct kvm_vcpu *vcpu, u32 msr)
return pmc;
}
-static int amd_pmu_get_msr(struct kvm_vcpu *vcpu, u32 msr, u64 *data)
+static int amd_pmu_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
{
struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
struct kvm_pmc *pmc;
+ u32 msr = msr_info->index;
/* MSR_PERFCTRn */
pmc = get_gp_pmc_amd(pmu, msr, PMU_TYPE_COUNTER);
if (pmc) {
- *data = pmc_read_counter(pmc);
+ msr_info->data = pmc_read_counter(pmc);
return 0;
}
/* MSR_EVNTSELn */
pmc = get_gp_pmc_amd(pmu, msr, PMU_TYPE_EVNTSEL);
if (pmc) {
- *data = pmc->eventsel;
+ msr_info->data = pmc->eventsel;
return 0;
}
diff --git a/arch/x86/kvm/vmx/pmu_intel.c b/arch/x86/kvm/vmx/pmu_intel.c
index 7c857737b438..e1a303fefc16 100644
--- a/arch/x86/kvm/vmx/pmu_intel.c
+++ b/arch/x86/kvm/vmx/pmu_intel.c
@@ -184,35 +184,38 @@ static struct kvm_pmc *intel_msr_idx_to_pmc(struct kvm_vcpu *vcpu, u32 msr)
return pmc;
}
-static int intel_pmu_get_msr(struct kvm_vcpu *vcpu, u32 msr, u64 *data)
+static int intel_pmu_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
{
struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
struct kvm_pmc *pmc;
+ u32 msr = msr_info->index;
switch (msr) {
case MSR_CORE_PERF_FIXED_CTR_CTRL:
- *data = pmu->fixed_ctr_ctrl;
+ msr_info->data = pmu->fixed_ctr_ctrl;
return 0;
case MSR_CORE_PERF_GLOBAL_STATUS:
- *data = pmu->global_status;
+ msr_info->data = pmu->global_status;
return 0;
case MSR_CORE_PERF_GLOBAL_CTRL:
- *data = pmu->global_ctrl;
+ msr_info->data = pmu->global_ctrl;
return 0;
case MSR_CORE_PERF_GLOBAL_OVF_CTRL:
- *data = pmu->global_ovf_ctrl;
+ msr_info->data = pmu->global_ovf_ctrl;
return 0;
default:
if ((pmc = get_gp_pmc(pmu, msr, MSR_IA32_PERFCTR0))) {
u64 val = pmc_read_counter(pmc);
- *data = val & pmu->counter_bitmask[KVM_PMC_GP];
+ msr_info->data =
+ val & pmu->counter_bitmask[KVM_PMC_GP];
return 0;
} else if ((pmc = get_fixed_pmc(pmu, msr))) {
u64 val = pmc_read_counter(pmc);
- *data = val & pmu->counter_bitmask[KVM_PMC_FIXED];
+ msr_info->data =
+ val & pmu->counter_bitmask[KVM_PMC_FIXED];
return 0;
} else if ((pmc = get_gp_pmc(pmu, msr, MSR_P6_EVNTSEL0))) {
- *data = pmc->eventsel;
+ msr_info->data = pmc->eventsel;
return 0;
}
}
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 7e46027f405a..23fe511c6ba0 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -3106,7 +3106,7 @@ int kvm_get_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
case MSR_P6_PERFCTR0 ... MSR_P6_PERFCTR1:
case MSR_P6_EVNTSEL0 ... MSR_P6_EVNTSEL1:
if (kvm_pmu_is_valid_msr(vcpu, msr_info->index))
- return kvm_pmu_get_msr(vcpu, msr_info->index, &msr_info->data);
+ return kvm_pmu_get_msr(vcpu, msr_info);
msr_info->data = 0;
break;
case MSR_IA32_UCODE_REV:
@@ -3268,7 +3268,7 @@ int kvm_get_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
break;
default:
if (kvm_pmu_is_valid_msr(vcpu, msr_info->index))
- return kvm_pmu_get_msr(vcpu, msr_info->index, &msr_info->data);
+ return kvm_pmu_get_msr(vcpu, msr_info);
if (!ignore_msrs) {
vcpu_debug_ratelimited(vcpu, "unhandled rdmsr: 0x%x\n",
msr_info->index);
--
2.21.3
The MSR_IA32_PERF_CAPABILITIES is a read only MSR that enumerates the
existence of performance monitoring features and KVM would always set
F(PDCM) since worst case it will just be zero.
The bits [0, 5] of MSR_IA32_PERF_CAPABILITIES tells about the LBR format
of the branch record addresses stored in the LBR stack. User space could
enable guest LBR by setting the exactly supported LBR format bits for this
msr_based_feature.
Signed-off-by: Like Xu <[email protected]>
---
arch/x86/include/asm/kvm_host.h | 1 +
arch/x86/kvm/cpuid.c | 2 +-
arch/x86/kvm/vmx/capabilities.h | 15 +++++++++++++++
arch/x86/kvm/vmx/pmu_intel.c | 23 +++++++++++++++++++++++
arch/x86/kvm/vmx/vmx.c | 3 +++
arch/x86/kvm/x86.c | 1 +
6 files changed, 44 insertions(+), 1 deletion(-)
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 35a915787559..8c3ae83f63d9 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -599,6 +599,7 @@ struct kvm_vcpu_arch {
u64 ia32_xss;
u64 microcode_version;
u64 arch_capabilities;
+ u64 perf_capabilities;
/*
* Paging state of the vcpu
diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
index 35845704cf57..411ce1b58341 100644
--- a/arch/x86/kvm/cpuid.c
+++ b/arch/x86/kvm/cpuid.c
@@ -294,7 +294,7 @@ void kvm_set_cpu_caps(void)
F(XMM3) | F(PCLMULQDQ) | 0 /* DTES64, MONITOR */ |
0 /* DS-CPL, VMX, SMX, EST */ |
0 /* TM2 */ | F(SSSE3) | 0 /* CNXT-ID */ | 0 /* Reserved */ |
- F(FMA) | F(CX16) | 0 /* xTPR Update, PDCM */ |
+ F(FMA) | F(CX16) | 0 /* xTPR Update */ | F(PDCM) |
F(PCID) | 0 /* Reserved, DCA */ | F(XMM4_1) |
F(XMM4_2) | F(X2APIC) | F(MOVBE) | F(POPCNT) |
0 /* Reserved*/ | F(AES) | F(XSAVE) | 0 /* OSXSAVE */ | F(AVX) |
diff --git a/arch/x86/kvm/vmx/capabilities.h b/arch/x86/kvm/vmx/capabilities.h
index 8903475f751e..27a66795665b 100644
--- a/arch/x86/kvm/vmx/capabilities.h
+++ b/arch/x86/kvm/vmx/capabilities.h
@@ -18,6 +18,8 @@ extern int __read_mostly pt_mode;
#define PT_MODE_SYSTEM 0
#define PT_MODE_HOST_GUEST 1
+#define PERF_CAP_LBR_FMT 0x3f
+
struct nested_vmx_msrs {
/*
* We only store the "true" versions of the VMX capability MSRs. We
@@ -367,4 +369,17 @@ static inline bool vmx_pt_mode_is_host_guest(void)
return pt_mode == PT_MODE_HOST_GUEST;
}
+static inline u64 vmx_get_perf_capabilities(void)
+{
+ u64 perf_cap = 0;
+
+ if (boot_cpu_has(X86_FEATURE_PDCM))
+ rdmsrl(MSR_IA32_PERF_CAPABILITIES, perf_cap);
+
+ /* Currently, KVM only supports LBR. */
+ perf_cap &= PERF_CAP_LBR_FMT;
+
+ return perf_cap;
+}
+
#endif /* __KVM_X86_VMX_CAPS_H */
diff --git a/arch/x86/kvm/vmx/pmu_intel.c b/arch/x86/kvm/vmx/pmu_intel.c
index e1a303fefc16..79e5bf36ffc8 100644
--- a/arch/x86/kvm/vmx/pmu_intel.c
+++ b/arch/x86/kvm/vmx/pmu_intel.c
@@ -162,6 +162,9 @@ static bool intel_is_valid_msr(struct kvm_vcpu *vcpu, u32 msr)
case MSR_CORE_PERF_GLOBAL_OVF_CTRL:
ret = pmu->version > 1;
break;
+ case MSR_IA32_PERF_CAPABILITIES:
+ ret = guest_cpuid_has(vcpu, X86_FEATURE_PDCM);
+ break;
default:
ret = get_gp_pmc(pmu, msr, MSR_IA32_PERFCTR0) ||
get_gp_pmc(pmu, msr, MSR_P6_EVNTSEL0) ||
@@ -203,6 +206,12 @@ static int intel_pmu_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
case MSR_CORE_PERF_GLOBAL_OVF_CTRL:
msr_info->data = pmu->global_ovf_ctrl;
return 0;
+ case MSR_IA32_PERF_CAPABILITIES:
+ if (!msr_info->host_initiated &&
+ !guest_cpuid_has(vcpu, X86_FEATURE_PDCM))
+ return 1;
+ msr_info->data = vcpu->arch.perf_capabilities;
+ return 0;
default:
if ((pmc = get_gp_pmc(pmu, msr, MSR_IA32_PERFCTR0))) {
u64 val = pmc_read_counter(pmc);
@@ -261,6 +270,16 @@ static int intel_pmu_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
return 0;
}
break;
+ case MSR_IA32_PERF_CAPABILITIES:
+ if (!msr_info->host_initiated ||
+ !guest_cpuid_has(vcpu, X86_FEATURE_PDCM))
+ return 1;
+ if (!(data & ~vmx_get_perf_capabilities()))
+ return 1;
+ if ((data ^ vmx_get_perf_capabilities()) & PERF_CAP_LBR_FMT)
+ return 1;
+ vcpu->arch.perf_capabilities = data;
+ return 0;
default:
if ((pmc = get_gp_pmc(pmu, msr, MSR_IA32_PERFCTR0))) {
if (!msr_info->host_initiated)
@@ -315,6 +334,8 @@ static void intel_pmu_refresh(struct kvm_vcpu *vcpu)
return;
perf_get_x86_pmu_capability(&x86_pmu);
+ if (!guest_cpuid_has(vcpu, X86_FEATURE_PDCM))
+ vcpu->arch.perf_capabilities = 0;
pmu->nr_arch_gp_counters = min_t(int, eax.split.num_counters,
x86_pmu.num_counters_gp);
@@ -374,6 +395,8 @@ static void intel_pmu_init(struct kvm_vcpu *vcpu)
pmu->fixed_counters[i].idx = i + INTEL_PMC_IDX_FIXED;
pmu->fixed_counters[i].current_config = 0;
}
+
+ vcpu->arch.perf_capabilities = vmx_get_perf_capabilities();
}
static void intel_pmu_reset(struct kvm_vcpu *vcpu)
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index bc5e5cf1d4cc..ee94d94e855a 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -1789,6 +1789,9 @@ static int vmx_get_msr_feature(struct kvm_msr_entry *msr)
if (!nested)
return 1;
return vmx_get_vmx_msr(&vmcs_config.nested, msr->index, &msr->data);
+ case MSR_IA32_PERF_CAPABILITIES:
+ msr->data = vmx_get_perf_capabilities();
+ return 0;
default:
return 1;
}
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 23fe511c6ba0..b577fadffb1d 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -1323,6 +1323,7 @@ static const u32 msr_based_features_all[] = {
MSR_F10H_DECFG,
MSR_IA32_UCODE_REV,
MSR_IA32_ARCH_CAPABILITIES,
+ MSR_IA32_PERF_CAPABILITIES,
};
static u32 msr_based_features[ARRAY_SIZE(msr_based_features_all)];
--
2.21.3
VMX transition is much more frequent than vcpu switching, and saving/
restoring tens of LBR msrs (e.g. 32 LBR records entries) brings too much
overhead to the frequent vmx transition itself, which is not necessary.
So the guest LBR records only gets saved/restored on the vcpu context
switching via the help of native LBR event callstack mechanism. Generally,
the LBR-related msrs and its functionality are emulated in this way:
The guest first access on the LBR related msrs (including DEBUGCTLMSR
and records msrs) is always interceptible. The KVM handler would create
a guest LBR event which enables the callstack mode and none of hardware
counter is assigned. The host perf would enable and schedule this event
as usual but in an exclusive way.
At this point, vPMU only supports Architecture v2 and the guest needs
to re-enable LBR via DEBUGCTLMSR in the guest PMI handler. The guest
LBR event will be released when the vPMU is reset but soon, the lazy
release mechanism would be applied to this event like a regular vPMC.
The debugctl msr is handled separately in vmx/svm and they're not
completely identical, hence remove the common msr handling code.
Cc: Peter Zijlstra (Intel) <[email protected]>
Suggested-by: Andi Kleen <[email protected]>
Co-developed-by: Wei Wang <[email protected]>
Signed-off-by: Wei Wang <[email protected]>
Signed-off-by: Like Xu <[email protected]>
---
arch/x86/include/asm/kvm_host.h | 9 ++
arch/x86/kvm/pmu.c | 5 +-
arch/x86/kvm/pmu.h | 6 +
arch/x86/kvm/vmx/pmu_intel.c | 220 +++++++++++++++++++++++++++++++-
arch/x86/kvm/vmx/vmx.c | 5 +-
arch/x86/kvm/vmx/vmx.h | 2 +
arch/x86/kvm/x86.c | 13 --
7 files changed, 241 insertions(+), 19 deletions(-)
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 8c3ae83f63d9..57b281c4b196 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -503,6 +503,14 @@ struct kvm_pmu {
* redundant check before cleanup if guest don't use vPMU at all.
*/
u8 event_count;
+
+ /*
+ * Emulate LBR feature via pass-through LBR registers when the
+ * per-vcpu guest LBR event is scheduled on the current pcpu.
+ *
+ * The records may be inaccurate if the host reclaims the LBR.
+ */
+ struct perf_event *lbr_event;
};
struct kvm_pmu_ops;
@@ -984,6 +992,7 @@ struct kvm_arch {
bool guest_can_read_msr_platform_info;
bool exception_payload_enabled;
+ struct x86_pmu_lbr lbr;
struct kvm_pmu_event_filter *pmu_event_filter;
struct task_struct *nx_lpage_recovery_thread;
};
diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c
index b86346903f2e..5053f4238218 100644
--- a/arch/x86/kvm/pmu.c
+++ b/arch/x86/kvm/pmu.c
@@ -378,8 +378,11 @@ int kvm_pmu_rdpmc(struct kvm_vcpu *vcpu, unsigned idx, u64 *data)
void kvm_pmu_deliver_pmi(struct kvm_vcpu *vcpu)
{
- if (lapic_in_kernel(vcpu))
+ if (lapic_in_kernel(vcpu)) {
+ if (kvm_x86_ops.pmu_ops->deliver_pmi)
+ kvm_x86_ops.pmu_ops->deliver_pmi(vcpu);
kvm_apic_local_deliver(vcpu->arch.apic, APIC_LVTPC);
+ }
}
bool kvm_pmu_is_valid_msr(struct kvm_vcpu *vcpu, u32 msr)
diff --git a/arch/x86/kvm/pmu.h b/arch/x86/kvm/pmu.h
index ab85eed8a6cc..ed96cbb40757 100644
--- a/arch/x86/kvm/pmu.h
+++ b/arch/x86/kvm/pmu.h
@@ -37,8 +37,14 @@ struct kvm_pmu_ops {
void (*refresh)(struct kvm_vcpu *vcpu);
void (*init)(struct kvm_vcpu *vcpu);
void (*reset)(struct kvm_vcpu *vcpu);
+ void (*deliver_pmi)(struct kvm_vcpu *vcpu);
};
+static inline bool event_is_oncpu(struct perf_event *event)
+{
+ return event && event->oncpu != -1;
+}
+
static inline u64 pmc_bitmask(struct kvm_pmc *pmc)
{
struct kvm_pmu *pmu = pmc_to_pmu(pmc);
diff --git a/arch/x86/kvm/vmx/pmu_intel.c b/arch/x86/kvm/vmx/pmu_intel.c
index 79e5bf36ffc8..8f9c837c7dca 100644
--- a/arch/x86/kvm/vmx/pmu_intel.c
+++ b/arch/x86/kvm/vmx/pmu_intel.c
@@ -17,6 +17,7 @@
#include "lapic.h"
#include "nested.h"
#include "pmu.h"
+#include "vmx.h"
static struct kvm_event_hw_type_mapping intel_arch_events[] = {
/* Index must match CPUID 0x0A.EBX bit vector */
@@ -150,6 +151,48 @@ static struct kvm_pmc *intel_rdpmc_ecx_to_pmc(struct kvm_vcpu *vcpu,
return &counters[array_index_nospec(idx, num_counters)];
}
+static bool lbr_is_enabled(struct kvm_vcpu *vcpu)
+{
+ struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
+
+ if (likely(vcpu->kvm->arch.lbr.nr))
+ return true;
+
+ if (pmu->version < 2)
+ return false;
+
+ if (!(vcpu->arch.perf_capabilities & PERF_CAP_LBR_FMT))
+ return false;
+
+ /*
+ * As a first step, a guest could only enable LBR feature if its cpu
+ * model is the same as the host because the LBR registers would
+ * be passthrough to the guest and they're model specific.
+ */
+ if (boot_cpu_data.x86_model != guest_cpuid_model(vcpu))
+ return false;
+
+ return !x86_perf_get_lbr(&vcpu->kvm->arch.lbr);
+}
+
+static bool intel_is_valid_lbr_msr(struct kvm_vcpu *vcpu, u32 index)
+{
+ struct x86_pmu_lbr *stack = &vcpu->kvm->arch.lbr;
+ bool ret = false;
+
+ if (!lbr_is_enabled(vcpu))
+ return ret;
+
+ ret = (index == MSR_LBR_SELECT || index == MSR_LBR_TOS ||
+ (index >= stack->from && index < stack->from + stack->nr) ||
+ (index >= stack->to && index < stack->to + stack->nr));
+
+ if (!ret && stack->info)
+ ret = (index >= stack->info && index < stack->info + stack->nr);
+
+ return ret;
+}
+
static bool intel_is_valid_msr(struct kvm_vcpu *vcpu, u32 msr)
{
struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
@@ -160,6 +203,7 @@ static bool intel_is_valid_msr(struct kvm_vcpu *vcpu, u32 msr)
case MSR_CORE_PERF_GLOBAL_STATUS:
case MSR_CORE_PERF_GLOBAL_CTRL:
case MSR_CORE_PERF_GLOBAL_OVF_CTRL:
+ case MSR_IA32_DEBUGCTLMSR:
ret = pmu->version > 1;
break;
case MSR_IA32_PERF_CAPABILITIES:
@@ -168,7 +212,8 @@ static bool intel_is_valid_msr(struct kvm_vcpu *vcpu, u32 msr)
default:
ret = get_gp_pmc(pmu, msr, MSR_IA32_PERFCTR0) ||
get_gp_pmc(pmu, msr, MSR_P6_EVNTSEL0) ||
- get_fixed_pmc(pmu, msr);
+ get_fixed_pmc(pmu, msr) ||
+ intel_is_valid_lbr_msr(vcpu, msr);
break;
}
@@ -187,6 +232,130 @@ static struct kvm_pmc *intel_msr_idx_to_pmc(struct kvm_vcpu *vcpu, u32 msr)
return pmc;
}
+static int intel_pmu_create_lbr_event(struct kvm_vcpu *vcpu)
+{
+ struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
+ struct perf_event *event;
+
+ /*
+ * The perf_event_attr is constructed in the minimum efficient way:
+ * - set 'pinned = true' to make it task pinned so that if another
+ * cpu pinned event reclaims LBR, the event->oncpu will be set to -1;
+ * - set '.exclude_host = true' to record guest branches behavior;
+ *
+ * - set '.config = INTEL_FIXED_VLBR_EVENT' to indicates host perf
+ * schedule the event without a real HW counter but a fake one;
+ * check is_guest_lbr_event() and __intel_get_event_constraints();
+ *
+ * - set 'sample_type = PERF_SAMPLE_BRANCH_STACK' and
+ * 'branch_sample_type = PERF_SAMPLE_BRANCH_CALL_STACK |
+ * PERF_SAMPLE_BRANCH_USER' to configure it as a LBR callstack
+ * event, which helps KVM to save/restore guest LBR records
+ * during host context switches and reduces quite a lot overhead,
+ * check branch_user_callstack() and intel_pmu_lbr_sched_task();
+ */
+ struct perf_event_attr attr = {
+ .type = PERF_TYPE_RAW,
+ .size = sizeof(attr),
+ .pinned = true,
+ .exclude_host = true,
+ .config = INTEL_FIXED_VLBR_EVENT,
+ .sample_type = PERF_SAMPLE_BRANCH_STACK,
+ .branch_sample_type = PERF_SAMPLE_BRANCH_CALL_STACK |
+ PERF_SAMPLE_BRANCH_USER,
+ };
+
+ if (unlikely(pmu->lbr_event))
+ return 0;
+
+ event = perf_event_create_kernel_counter(&attr, -1,
+ current, NULL, NULL);
+ if (IS_ERR(event)) {
+ pr_debug_ratelimited("%s: failed %ld\n",
+ __func__, PTR_ERR(event));
+ return -ENOENT;
+ }
+ pmu->lbr_event = event;
+ pmu->event_count++;
+ return 0;
+}
+
+/*
+ * "set = true" to make the LBR records registers interceptible,
+ * otherwise passthrough the LBR records registers to the vcpu.
+ */
+static void intel_pmu_intercept_lbr_msrs(struct kvm_vcpu *vcpu, bool set)
+{
+ unsigned long *msr_bitmap = to_vmx(vcpu)->vmcs01.msr_bitmap;
+ struct x86_pmu_lbr *stack = &vcpu->kvm->arch.lbr;
+ int i;
+
+ if (!stack->nr)
+ return;
+
+ for (i = 0; i < stack->nr; i++) {
+ vmx_set_intercept_for_msr(msr_bitmap,
+ stack->from + i, MSR_TYPE_RW, set);
+ vmx_set_intercept_for_msr(msr_bitmap,
+ stack->to + i, MSR_TYPE_RW, set);
+ if (stack->info)
+ vmx_set_intercept_for_msr(msr_bitmap,
+ stack->info + i, MSR_TYPE_RW, set);
+ }
+ vmx_set_intercept_for_msr(msr_bitmap, MSR_LBR_SELECT, MSR_TYPE_RW, set);
+ vmx_set_intercept_for_msr(msr_bitmap, MSR_LBR_TOS, MSR_TYPE_RW, set);
+}
+
+static void intel_pmu_free_lbr_event(struct kvm_vcpu *vcpu)
+{
+ struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
+ struct perf_event *event = pmu->lbr_event;
+
+ if (!event)
+ return;
+
+ perf_event_release_kernel(event);
+ intel_pmu_intercept_lbr_msrs(vcpu, true);
+ pmu->event_count--;
+ pmu->lbr_event = NULL;
+}
+
+/*
+ * It's safe to access LBR msrs from guest when they have not
+ * been passthrough since the host would help restore or reset
+ * the LBR msrs records when the guest LBR event is scheduled in.
+ */
+static bool intel_pmu_access_lbr_msr(struct kvm_vcpu *vcpu,
+ struct msr_data *msr_info, bool read)
+{
+ struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
+ u32 index = msr_info->index;
+
+ if (!intel_is_valid_lbr_msr(vcpu, index))
+ return false;
+
+ if (!msr_info->host_initiated && !pmu->lbr_event)
+ intel_pmu_create_lbr_event(vcpu);
+
+ /*
+ * Disable irq to ensure the LBR feature doesn't get reclaimed by the
+ * host at the time the value is read from the msr, and this avoids the
+ * host LBR value to be leaked to the guest. If LBR has been reclaimed,
+ * return 0 on guest reads.
+ */
+ local_irq_disable();
+ if (event_is_oncpu(pmu->lbr_event)) {
+ if (read)
+ rdmsrl(index, msr_info->data);
+ else
+ wrmsrl(index, msr_info->data);
+ } else if (read)
+ msr_info->data = 0;
+ local_irq_enable();
+
+ return true;
+}
+
static int intel_pmu_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
{
struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
@@ -212,6 +381,9 @@ static int intel_pmu_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
return 1;
msr_info->data = vcpu->arch.perf_capabilities;
return 0;
+ case MSR_IA32_DEBUGCTLMSR:
+ msr_info->data = vmcs_read64(GUEST_IA32_DEBUGCTL);
+ return 0;
default:
if ((pmc = get_gp_pmc(pmu, msr, MSR_IA32_PERFCTR0))) {
u64 val = pmc_read_counter(pmc);
@@ -226,7 +398,8 @@ static int intel_pmu_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
} else if ((pmc = get_gp_pmc(pmu, msr, MSR_P6_EVNTSEL0))) {
msr_info->data = pmc->eventsel;
return 0;
- }
+ } else if (intel_pmu_access_lbr_msr(vcpu, msr_info, true))
+ return 0;
}
return 1;
@@ -280,6 +453,14 @@ static int intel_pmu_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
return 1;
vcpu->arch.perf_capabilities = data;
return 0;
+ case MSR_IA32_DEBUGCTLMSR:
+ /* Values other than LBR are reserved and should throw a #GP */
+ if (data & ~(DEBUGCTLMSR_LBR | DEBUGCTLMSR_FREEZE_LBRS_ON_PMI))
+ return 1;
+ vmcs_write64(GUEST_IA32_DEBUGCTL, data);
+ if (!msr_info->host_initiated && !pmu->lbr_event)
+ intel_pmu_create_lbr_event(vcpu);
+ return 0;
default:
if ((pmc = get_gp_pmc(pmu, msr, MSR_IA32_PERFCTR0))) {
if (!msr_info->host_initiated)
@@ -302,7 +483,8 @@ static int intel_pmu_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
reprogram_gp_counter(pmc, data);
return 0;
}
- }
+ } else if (intel_pmu_access_lbr_msr(vcpu, msr_info, false))
+ return 0;
}
return 1;
@@ -421,6 +603,37 @@ static void intel_pmu_reset(struct kvm_vcpu *vcpu)
pmu->fixed_ctr_ctrl = pmu->global_ctrl = pmu->global_status =
pmu->global_ovf_ctrl = 0;
+ intel_pmu_free_lbr_event(vcpu);
+}
+
+/*
+ * Emulate LBR_On_PMI behavior for 1 < pmu.version < 4.
+ *
+ * If Freeze_LBR_On_PMI = 1, the LBR is frozen on PMI and
+ * the KVM emulates to clear the LBR bit (bit 0) in IA32_DEBUGCTL.
+ *
+ * Guest needs to re-enable LBR to resume branches recording.
+ */
+static void intel_pmu_legacy_freezing_lbrs_on_pmi(struct kvm_vcpu *vcpu)
+{
+ u64 data;
+
+ if (!lbr_is_enabled(vcpu))
+ return;
+
+ data = vmcs_read64(GUEST_IA32_DEBUGCTL);
+ if (data & DEBUGCTLMSR_FREEZE_LBRS_ON_PMI) {
+ data &= ~DEBUGCTLMSR_LBR;
+ vmcs_write64(GUEST_IA32_DEBUGCTL, data);
+ }
+}
+
+static void intel_pmu_deliver_pmi(struct kvm_vcpu *vcpu)
+{
+ u8 version = vcpu_to_pmu(vcpu)->version;
+
+ if (version > 1 && version < 4)
+ intel_pmu_legacy_freezing_lbrs_on_pmi(vcpu);
}
struct kvm_pmu_ops intel_pmu_ops = {
@@ -437,4 +650,5 @@ struct kvm_pmu_ops intel_pmu_ops = {
.refresh = intel_pmu_refresh,
.init = intel_pmu_init,
.reset = intel_pmu_reset,
+ .deliver_pmi = intel_pmu_deliver_pmi,
};
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index ee94d94e855a..9969d663826a 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -3748,8 +3748,8 @@ static __always_inline void vmx_enable_intercept_for_msr(unsigned long *msr_bitm
}
}
-static __always_inline void vmx_set_intercept_for_msr(unsigned long *msr_bitmap,
- u32 msr, int type, bool value)
+void vmx_set_intercept_for_msr(unsigned long *msr_bitmap,
+ u32 msr, int type, bool value)
{
if (value)
vmx_enable_intercept_for_msr(msr_bitmap, msr, type);
@@ -6698,6 +6698,7 @@ static fastpath_t vmx_vcpu_run(struct kvm_vcpu *vcpu)
if (vcpu_to_pmu(vcpu)->version)
atomic_switch_perf_msrs(vmx);
+
atomic_switch_umwait_control_msr(vmx);
if (enable_preemption_timer)
diff --git a/arch/x86/kvm/vmx/vmx.h b/arch/x86/kvm/vmx/vmx.h
index 298ddef79d00..1b6bb964badf 100644
--- a/arch/x86/kvm/vmx/vmx.h
+++ b/arch/x86/kvm/vmx/vmx.h
@@ -349,6 +349,8 @@ bool vmx_interrupt_blocked(struct kvm_vcpu *vcpu);
bool vmx_get_nmi_mask(struct kvm_vcpu *vcpu);
void vmx_set_nmi_mask(struct kvm_vcpu *vcpu, bool masked);
void vmx_set_virtual_apic_mode(struct kvm_vcpu *vcpu);
+void vmx_set_intercept_for_msr(unsigned long *msr_bitmap,
+ u32 msr, int type, bool value);
struct shared_msr_entry *find_msr_entry(struct vcpu_vmx *vmx, u32 msr);
void pt_update_intercept_for_msr(struct vcpu_vmx *vmx);
void vmx_update_host_rsp(struct vcpu_vmx *vmx, unsigned long host_rsp);
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index b577fadffb1d..8b4a5c6b206d 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -2815,18 +2815,6 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
return 1;
}
break;
- case MSR_IA32_DEBUGCTLMSR:
- if (!data) {
- /* We support the non-activated case already */
- break;
- } else if (data & ~(DEBUGCTLMSR_LBR | DEBUGCTLMSR_BTF)) {
- /* Values other than LBR and BTF are vendor-specific,
- thus reserved and should throw a #GP */
- return 1;
- }
- vcpu_unimpl(vcpu, "%s: MSR_IA32_DEBUGCTLMSR 0x%llx, nop\n",
- __func__, data);
- break;
case 0x200 ... 0x2ff:
return kvm_mtrr_set_msr(vcpu, msr, data);
case MSR_IA32_APICBASE:
@@ -3083,7 +3071,6 @@ int kvm_get_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
switch (msr_info->index) {
case MSR_IA32_PLATFORM_ID:
case MSR_IA32_EBL_CR_POWERON:
- case MSR_IA32_DEBUGCTLMSR:
case MSR_IA32_LASTBRANCHFROMIP:
case MSR_IA32_LASTBRANCHTOIP:
case MSR_IA32_LASTINTFROMIP:
--
2.21.3
After passthrough the LBR msrs to the guest, there is no need to call
intel_pmu_intercept_lbr_msrs() again and again, and vice versa.
If host reclaims the LBR between two availability checks, the interception
state and LBR records can be safely preserved due to native save/restore
support from guest LBR event.
Signed-off-by: Like Xu <[email protected]>
---
arch/x86/include/asm/kvm_host.h | 3 +++
arch/x86/kvm/pmu.c | 1 +
arch/x86/kvm/vmx/pmu_intel.c | 9 +++++++++
3 files changed, 13 insertions(+)
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 57b281c4b196..dd51250c5688 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -511,6 +511,9 @@ struct kvm_pmu {
* The records may be inaccurate if the host reclaims the LBR.
*/
struct perf_event *lbr_event;
+
+ /* A flag to reduce the overhead of LBR pass-through or cancellation. */
+ bool lbr_already_available;
};
struct kvm_pmu_ops;
diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c
index d0dece055605..583ecb5f5f7c 100644
--- a/arch/x86/kvm/pmu.c
+++ b/arch/x86/kvm/pmu.c
@@ -437,6 +437,7 @@ void kvm_pmu_init(struct kvm_vcpu *vcpu)
init_irq_work(&pmu->irq_work, kvm_pmi_trigger_fn);
pmu->event_count = 0;
pmu->need_cleanup = false;
+ pmu->lbr_already_available = false;
kvm_pmu_refresh(vcpu);
}
diff --git a/arch/x86/kvm/vmx/pmu_intel.c b/arch/x86/kvm/vmx/pmu_intel.c
index db185dca903d..408d80c9418b 100644
--- a/arch/x86/kvm/vmx/pmu_intel.c
+++ b/arch/x86/kvm/vmx/pmu_intel.c
@@ -316,6 +316,7 @@ static void intel_pmu_free_lbr_event(struct kvm_vcpu *vcpu)
perf_event_release_kernel(event);
intel_pmu_intercept_lbr_msrs(vcpu, true);
+ pmu->lbr_already_available = false;
pmu->event_count--;
pmu->lbr_event = NULL;
}
@@ -653,10 +654,18 @@ static bool intel_pmu_lbr_is_availabile(struct kvm_vcpu *vcpu)
if (!pmu->lbr_event)
return false;
+ if (pmu->lbr_already_available && event_is_oncpu(pmu->lbr_event))
+ return true;
+
+ if (!pmu->lbr_already_available && !event_is_oncpu(pmu->lbr_event))
+ return false;
+
if (event_is_oncpu(pmu->lbr_event)) {
intel_pmu_intercept_lbr_msrs(vcpu, false);
+ pmu->lbr_already_available = true;
} else {
intel_pmu_intercept_lbr_msrs(vcpu, true);
+ pmu->lbr_already_available = false;
return false;
}
--
2.21.3
The hypervisor may request the perf subsystem to schedule a time window
to directly access the LBR stack msrs for its own use. Normally, it would
create a guest LBR event with callstack mode enabled, which is scheduled
along with other ordinary LBR events on the host but in an exclusive way.
To avoid wasting a counter for the guest LBR event, the perf tracks it via
needs_guest_lbr_without_counter() and assigns it with a fake VLBR counter
with the help of new lbr_without_counter_constraint. As with the BTS event,
there is actually no hardware counter assigned for the guest LBR event.
Cc: Peter Zijlstra (Intel) <[email protected]>
Signed-off-by: Like Xu <[email protected]>
---
arch/x86/events/core.c | 1 +
arch/x86/events/intel/core.c | 20 ++++++++++++++++++++
arch/x86/events/intel/lbr.c | 3 +++
arch/x86/events/perf_event.h | 6 ++++++
arch/x86/include/asm/perf_event.h | 22 +++++++++++++++++++++-
5 files changed, 51 insertions(+), 1 deletion(-)
diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
index f7a259dcbb06..2405926e2dba 100644
--- a/arch/x86/events/core.c
+++ b/arch/x86/events/core.c
@@ -1104,6 +1104,7 @@ static inline void x86_assign_hw_event(struct perf_event *event,
switch (hwc->idx) {
case INTEL_PMC_IDX_FIXED_BTS:
+ case INTEL_PMC_IDX_FIXED_VLBR:
hwc->config_base = 0;
hwc->event_base = 0;
break;
diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
index f1439acbf7e6..112bc8367d5b 100644
--- a/arch/x86/events/intel/core.c
+++ b/arch/x86/events/intel/core.c
@@ -2621,6 +2621,22 @@ intel_bts_constraints(struct perf_event *event)
return NULL;
}
+/*
+ * Note, the event that satisfies this constraint will not be assigned
+ * with a normal hardware counter but a fake one like BTS event.
+ *
+ * The guest LBR event uses it in the __intel_get_event_constraints()
+ * to make sure LBR registers to be used exclusively for guest.
+ */
+static struct event_constraint *
+intel_guest_lbr_constraints(struct perf_event *event)
+{
+ if (unlikely(is_guest_lbr_event(event)))
+ return &guest_lbr_constraint;
+
+ return NULL;
+}
+
static int intel_alt_er(int idx, u64 config)
{
int alt_idx = idx;
@@ -2811,6 +2827,10 @@ __intel_get_event_constraints(struct cpu_hw_events *cpuc, int idx,
{
struct event_constraint *c;
+ c = intel_guest_lbr_constraints(event);
+ if (c)
+ return c;
+
c = intel_bts_constraints(event);
if (c)
return c;
diff --git a/arch/x86/events/intel/lbr.c b/arch/x86/events/intel/lbr.c
index 6c60dcaaaf69..0a91cbe3a7c7 100644
--- a/arch/x86/events/intel/lbr.c
+++ b/arch/x86/events/intel/lbr.c
@@ -1363,3 +1363,6 @@ int x86_perf_get_lbr(struct x86_pmu_lbr *stack)
return 0;
}
EXPORT_SYMBOL_GPL(x86_perf_get_lbr);
+
+struct event_constraint guest_lbr_constraint =
+ EVENT_CONSTRAINT(0, 1ULL << INTEL_PMC_IDX_FIXED_VLBR, 0);
diff --git a/arch/x86/events/perf_event.h b/arch/x86/events/perf_event.h
index 1025bc6eb04f..34d76a2f5ebd 100644
--- a/arch/x86/events/perf_event.h
+++ b/arch/x86/events/perf_event.h
@@ -969,6 +969,11 @@ static inline bool intel_pmu_has_bts(struct perf_event *event)
return intel_pmu_has_bts_period(event, hwc->sample_period);
}
+static inline bool is_guest_lbr_event(struct perf_event *event)
+{
+ return event->attr.config == INTEL_FIXED_VLBR_EVENT;
+}
+
int intel_pmu_save_and_restart(struct perf_event *event);
struct event_constraint *
@@ -989,6 +994,7 @@ void release_ds_buffers(void);
void reserve_ds_buffers(void);
extern struct event_constraint bts_constraint;
+extern struct event_constraint guest_lbr_constraint;
void intel_pmu_enable_bts(u64 config);
diff --git a/arch/x86/include/asm/perf_event.h b/arch/x86/include/asm/perf_event.h
index 5071515f6b0f..d1938484854d 100644
--- a/arch/x86/include/asm/perf_event.h
+++ b/arch/x86/include/asm/perf_event.h
@@ -192,9 +192,29 @@ struct x86_pmu_capability {
#define GLOBAL_STATUS_UNC_OVF BIT_ULL(61)
#define GLOBAL_STATUS_ASIF BIT_ULL(60)
#define GLOBAL_STATUS_COUNTERS_FROZEN BIT_ULL(59)
-#define GLOBAL_STATUS_LBRS_FROZEN BIT_ULL(58)
+#define GLOBAL_STATUS_LBRS_FROZEN_BIT 58
+#define GLOBAL_STATUS_LBRS_FROZEN BIT_ULL(GLOBAL_STATUS_LBRS_FROZEN_BIT)
#define GLOBAL_STATUS_TRACE_TOPAPMI BIT_ULL(55)
+/*
+ * We model guest LBR event tracing as another fixed-mode PMC like BTS.
+ *
+ * We choose bit 58 because it's used to indicate LBR stack frozen state
+ * for architectural perfmon v4, also we unconditionally mask that bit in
+ * the handle_pmi_common(), so it'll never be set in the overflow handling.
+ *
+ * With this fake counter assigned, the guest LBR event user (such as KVM),
+ * can program the LBR registers on its own, and we don't actually do anything
+ * with then in the host context.
+ */
+#define INTEL_PMC_IDX_FIXED_VLBR GLOBAL_STATUS_LBRS_FROZEN_BIT
+
+/*
+ * Pseudo-encoding the guest LBR event as event=0x00,umask=0x1a,
+ * since it would claim bit 58 which is effectively Fixed25.
+ */
+#define INTEL_FIXED_VLBR_EVENT 0x1a00
+
/*
* Adaptive PEBS v4
*/
--
2.21.3
The vPMU uses GUEST_LBR_IN_USE_IDX (bit 58) in 'pmu->pmc_in_use' to
indicate whether a guest LBR event is still needed by the vcpu. If the vcpu
no longer accesses LBR related registers within a scheduling time slice,
and the enable bit of LBR has been unset, vPMU will treat the guest LBR
event as a bland event of a vPMC counter and release it as usual. Also the
passthrough state of LBR records msrs is cancelled.
Signed-off-by: Like Xu <[email protected]>
---
arch/x86/kvm/pmu.c | 9 +++++++++
arch/x86/kvm/pmu.h | 4 ++++
arch/x86/kvm/vmx/pmu_intel.c | 11 +++++++++++
3 files changed, 24 insertions(+)
diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c
index 5053f4238218..d0dece055605 100644
--- a/arch/x86/kvm/pmu.c
+++ b/arch/x86/kvm/pmu.c
@@ -451,6 +451,12 @@ static inline bool pmc_speculative_in_use(struct kvm_pmc *pmc)
return pmc->eventsel & ARCH_PERFMON_EVENTSEL_ENABLE;
}
+static inline void kvm_pmu_lbr_cleanup(struct kvm_vcpu *vcpu)
+{
+ if (kvm_x86_ops.pmu_ops->lbr_cleanup)
+ kvm_x86_ops.pmu_ops->lbr_cleanup(vcpu);
+}
+
/* Release perf_events for vPMCs that have been unused for a full time slice. */
void kvm_pmu_cleanup(struct kvm_vcpu *vcpu)
{
@@ -469,6 +475,9 @@ void kvm_pmu_cleanup(struct kvm_vcpu *vcpu)
if (pmc && pmc->perf_event && !pmc_speculative_in_use(pmc))
pmc_stop_counter(pmc);
+
+ if (i == GUEST_LBR_IN_USE_IDX && pmu->lbr_event)
+ kvm_pmu_lbr_cleanup(vcpu);
}
bitmap_zero(pmu->pmc_in_use, X86_PMC_IDX_MAX);
diff --git a/arch/x86/kvm/pmu.h b/arch/x86/kvm/pmu.h
index ed96cbb40757..78f0cfe1622f 100644
--- a/arch/x86/kvm/pmu.h
+++ b/arch/x86/kvm/pmu.h
@@ -15,6 +15,9 @@
#define VMWARE_BACKDOOR_PMC_REAL_TIME 0x10001
#define VMWARE_BACKDOOR_PMC_APPARENT_TIME 0x10002
+/* Indicates whether LBR msrs were accessed during the last time slice. */
+#define GUEST_LBR_IN_USE_IDX INTEL_PMC_IDX_FIXED_VLBR
+
struct kvm_event_hw_type_mapping {
u8 eventsel;
u8 unit_mask;
@@ -38,6 +41,7 @@ struct kvm_pmu_ops {
void (*init)(struct kvm_vcpu *vcpu);
void (*reset)(struct kvm_vcpu *vcpu);
void (*deliver_pmi)(struct kvm_vcpu *vcpu);
+ void (*lbr_cleanup)(struct kvm_vcpu *vcpu);
};
static inline bool event_is_oncpu(struct perf_event *event)
diff --git a/arch/x86/kvm/vmx/pmu_intel.c b/arch/x86/kvm/vmx/pmu_intel.c
index 8f9c837c7dca..ea4faae56473 100644
--- a/arch/x86/kvm/vmx/pmu_intel.c
+++ b/arch/x86/kvm/vmx/pmu_intel.c
@@ -353,6 +353,7 @@ static bool intel_pmu_access_lbr_msr(struct kvm_vcpu *vcpu,
msr_info->data = 0;
local_irq_enable();
+ __set_bit(GUEST_LBR_IN_USE_IDX, pmu->pmc_in_use);
return true;
}
@@ -460,6 +461,7 @@ static int intel_pmu_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
vmcs_write64(GUEST_IA32_DEBUGCTL, data);
if (!msr_info->host_initiated && !pmu->lbr_event)
intel_pmu_create_lbr_event(vcpu);
+ __set_bit(GUEST_LBR_IN_USE_IDX, pmu->pmc_in_use);
return 0;
default:
if ((pmc = get_gp_pmc(pmu, msr, MSR_IA32_PERFCTR0))) {
@@ -555,6 +557,8 @@ static void intel_pmu_refresh(struct kvm_vcpu *vcpu)
0, pmu->nr_arch_gp_counters);
bitmap_set(pmu->all_valid_pmc_idx,
INTEL_PMC_MAX_GENERIC, pmu->nr_arch_fixed_counters);
+ if (lbr_is_enabled(vcpu))
+ bitmap_set(pmu->all_valid_pmc_idx, GUEST_LBR_IN_USE_IDX, 1);
nested_vmx_pmu_entry_exit_ctls_update(vcpu);
}
@@ -636,6 +640,12 @@ static void intel_pmu_deliver_pmi(struct kvm_vcpu *vcpu)
intel_pmu_legacy_freezing_lbrs_on_pmi(vcpu);
}
+static void intel_pmu_lbr_cleanup(struct kvm_vcpu *vcpu)
+{
+ if (!(vmcs_read64(GUEST_IA32_DEBUGCTL) & DEBUGCTLMSR_LBR))
+ intel_pmu_free_lbr_event(vcpu);
+}
+
struct kvm_pmu_ops intel_pmu_ops = {
.find_arch_event = intel_find_arch_event,
.find_fixed_event = intel_find_fixed_event,
@@ -651,4 +661,5 @@ struct kvm_pmu_ops intel_pmu_ops = {
.init = intel_pmu_init,
.reset = intel_pmu_reset,
.deliver_pmi = intel_pmu_deliver_pmi,
+ .lbr_cleanup = intel_pmu_lbr_cleanup,
};
--
2.21.3
The LBR stack msrs are model specific. The perf subsystem has already
obtained the LBR stack base addresses based on the cpu model.
Therefore, an interface is added to allow callers outside the perf
subsystem to obtain the LBR stack base addresses. It's useful for
hypervisors to emulate the LBR feature for guests with less code.
Cc: Peter Zijlstra (Intel) <[email protected]>
Co-developed-by: Wei Wang <[email protected]>
Signed-off-by: Wei Wang <[email protected]>
Signed-off-by: Like Xu <[email protected]>
---
arch/x86/events/intel/lbr.c | 20 ++++++++++++++++++++
arch/x86/include/asm/perf_event.h | 12 ++++++++++++
2 files changed, 32 insertions(+)
diff --git a/arch/x86/events/intel/lbr.c b/arch/x86/events/intel/lbr.c
index 65113b16804a..6c60dcaaaf69 100644
--- a/arch/x86/events/intel/lbr.c
+++ b/arch/x86/events/intel/lbr.c
@@ -1343,3 +1343,23 @@ void intel_pmu_lbr_init_knl(void)
if (x86_pmu.intel_cap.lbr_format == LBR_FORMAT_LIP)
x86_pmu.intel_cap.lbr_format = LBR_FORMAT_EIP_FLAGS;
}
+
+/**
+ * x86_perf_get_lbr - get the LBR stack information
+ *
+ * @stack: the caller's memory to store the LBR stack information
+ *
+ * Returns: 0 indicates the LBR stack info has been successfully obtained
+ */
+int x86_perf_get_lbr(struct x86_pmu_lbr *stack)
+{
+ int lbr_fmt = x86_pmu.intel_cap.lbr_format;
+
+ stack->nr = x86_pmu.lbr_nr;
+ stack->from = x86_pmu.lbr_from;
+ stack->to = x86_pmu.lbr_to;
+ stack->info = (lbr_fmt == LBR_FORMAT_INFO) ? MSR_LBR_INFO_0 : 0;
+
+ return 0;
+}
+EXPORT_SYMBOL_GPL(x86_perf_get_lbr);
diff --git a/arch/x86/include/asm/perf_event.h b/arch/x86/include/asm/perf_event.h
index e855e9cf2c37..5071515f6b0f 100644
--- a/arch/x86/include/asm/perf_event.h
+++ b/arch/x86/include/asm/perf_event.h
@@ -333,6 +333,13 @@ struct perf_guest_switch_msr {
u64 host, guest;
};
+struct x86_pmu_lbr {
+ unsigned int nr;
+ unsigned int from;
+ unsigned int to;
+ unsigned int info;
+};
+
extern void perf_get_x86_pmu_capability(struct x86_pmu_capability *cap);
extern void perf_check_microcode(void);
extern int x86_perf_rdpmc_index(struct perf_event *event);
@@ -348,12 +355,17 @@ static inline void perf_check_microcode(void) { }
#if defined(CONFIG_PERF_EVENTS) && defined(CONFIG_CPU_SUP_INTEL)
extern struct perf_guest_switch_msr *perf_guest_get_msrs(int *nr);
+extern int x86_perf_get_lbr(struct x86_pmu_lbr *stack);
#else
static inline struct perf_guest_switch_msr *perf_guest_get_msrs(int *nr)
{
*nr = 0;
return NULL;
}
+static inline int x86_perf_get_lbr(struct x86_pmu_lbr *stack)
+{
+ return -1;
+}
#endif
#ifdef CONFIG_CPU_SUP_INTEL
--
2.21.3
When the guest LBR event exists and the LBR stack is available (defined
as 'event->oncpu != -1'), the LBR records msrs access would not be
intercepted but passthrough to the vcpu before vm-entry. This kind
of availability check is always performed before vm-entry, but as late
as possible to avoid reclaiming resources from any higher priority event.
A negative check result would bring the registers interception back, and
it also prevents real registers accesses and potential data leakage.
The KVM emits a pr_warn() on the host when the guest LBR is unavailable.
The administer is supposed to reminder users that the guest result may be
inaccurate if someone is using LBR to record hypervisor on the host side.
Suggested-by: Wei Wang <[email protected]>
Signed-off-by: Like Xu <[email protected]>
---
arch/x86/kvm/pmu.h | 1 +
arch/x86/kvm/vmx/pmu_intel.c | 38 ++++++++++++++++++++++++++++++++++++
arch/x86/kvm/vmx/vmx.c | 4 +++-
3 files changed, 42 insertions(+), 1 deletion(-)
diff --git a/arch/x86/kvm/pmu.h b/arch/x86/kvm/pmu.h
index 78f0cfe1622f..bb3f3ef3386e 100644
--- a/arch/x86/kvm/pmu.h
+++ b/arch/x86/kvm/pmu.h
@@ -42,6 +42,7 @@ struct kvm_pmu_ops {
void (*reset)(struct kvm_vcpu *vcpu);
void (*deliver_pmi)(struct kvm_vcpu *vcpu);
void (*lbr_cleanup)(struct kvm_vcpu *vcpu);
+ void (*availability_check)(struct kvm_vcpu *vcpu);
};
static inline bool event_is_oncpu(struct perf_event *event)
diff --git a/arch/x86/kvm/vmx/pmu_intel.c b/arch/x86/kvm/vmx/pmu_intel.c
index ea4faae56473..db185dca903d 100644
--- a/arch/x86/kvm/vmx/pmu_intel.c
+++ b/arch/x86/kvm/vmx/pmu_intel.c
@@ -646,6 +646,43 @@ static void intel_pmu_lbr_cleanup(struct kvm_vcpu *vcpu)
intel_pmu_free_lbr_event(vcpu);
}
+static bool intel_pmu_lbr_is_availabile(struct kvm_vcpu *vcpu)
+{
+ struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
+
+ if (!pmu->lbr_event)
+ return false;
+
+ if (event_is_oncpu(pmu->lbr_event)) {
+ intel_pmu_intercept_lbr_msrs(vcpu, false);
+ } else {
+ intel_pmu_intercept_lbr_msrs(vcpu, true);
+ return false;
+ }
+
+ return true;
+}
+
+/*
+ * Higher priority host perf events (e.g. cpu pinned) could reclaim the
+ * pmu resources (e.g. LBR) that were assigned to the guest. This is
+ * usually done via ipi calls (more details in perf_install_in_context).
+ *
+ * Before entering the non-root mode (with irq disabled here), double
+ * confirm that the pmu features enabled to the guest are not reclaimed
+ * by higher priority host events. Otherwise, disallow vcpu's access to
+ * the reclaimed features.
+ */
+static void intel_pmu_availability_check(struct kvm_vcpu *vcpu)
+{
+ lockdep_assert_irqs_disabled();
+
+ if (lbr_is_enabled(vcpu) && !intel_pmu_lbr_is_availabile(vcpu) &&
+ (vmcs_read64(GUEST_IA32_DEBUGCTL) & DEBUGCTLMSR_LBR))
+ pr_warn_ratelimited("kvm: vcpu-%d: LBR is temporarily unavailable.\n",
+ vcpu->vcpu_id);
+}
+
struct kvm_pmu_ops intel_pmu_ops = {
.find_arch_event = intel_find_arch_event,
.find_fixed_event = intel_find_fixed_event,
@@ -662,4 +699,5 @@ struct kvm_pmu_ops intel_pmu_ops = {
.reset = intel_pmu_reset,
.deliver_pmi = intel_pmu_deliver_pmi,
.lbr_cleanup = intel_pmu_lbr_cleanup,
+ .availability_check = intel_pmu_availability_check,
};
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 9969d663826a..80d036c5f64a 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -6696,8 +6696,10 @@ static fastpath_t vmx_vcpu_run(struct kvm_vcpu *vcpu)
pt_guest_enter(vmx);
- if (vcpu_to_pmu(vcpu)->version)
+ if (vcpu_to_pmu(vcpu)->version) {
atomic_switch_perf_msrs(vmx);
+ kvm_x86_ops.pmu_ops->availability_check(vcpu);
+ }
atomic_switch_umwait_control_msr(vmx);
--
2.21.3
Subject: perf/x86: Add constraint to create guest LBR event without hw counter
From: Like Xu <[email protected]>
Date: Thu, 14 May 2020 16:30:47 +0800
From: Like Xu <[email protected]>
The hypervisor may request the perf subsystem to schedule a time window
to directly access the LBR stack msrs for its own use. Normally, it would
create a guest LBR event with callstack mode enabled, which is scheduled
along with other ordinary LBR events on the host but in an exclusive way.
To avoid wasting a counter for the guest LBR event, the perf tracks it via
needs_guest_lbr_without_counter() and assigns it with a fake VLBR counter
with the help of new lbr_without_counter_constraint. As with the BTS event,
there is actually no hardware counter assigned for the guest LBR event.
Signed-off-by: Like Xu <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
---
arch/x86/events/core.c | 1 +
arch/x86/events/intel/core.c | 18 ++++++++++++++++++
arch/x86/events/intel/lbr.c | 4 ++++
arch/x86/events/perf_event.h | 1 +
arch/x86/include/asm/perf_event.h | 22 +++++++++++++++++++++-
5 files changed, 45 insertions(+), 1 deletion(-)
--- a/arch/x86/events/core.c
+++ b/arch/x86/events/core.c
@@ -1104,6 +1104,7 @@ static inline void x86_assign_hw_event(s
switch (hwc->idx) {
case INTEL_PMC_IDX_FIXED_BTS:
+ case INTEL_PMC_IDX_FIXED_VLBR:
hwc->config_base = 0;
hwc->event_base = 0;
break;
--- a/arch/x86/events/intel/core.c
+++ b/arch/x86/events/intel/core.c
@@ -2621,6 +2621,20 @@ intel_bts_constraints(struct perf_event
return NULL;
}
+/*
+ * Note: matches a fake event, like Fixed2.
+ */
+static struct event_constraint *
+intel_vlbr_constraints(struct perf_event *event)
+{
+ struct event_constraint *c = &vlbr_constraint;
+
+ if (unlikely(constraint_match(c, event->hw.config)))
+ return c;
+
+ return NULL;
+}
+
static int intel_alt_er(int idx, u64 config)
{
int alt_idx = idx;
@@ -2811,6 +2825,10 @@ __intel_get_event_constraints(struct cpu
{
struct event_constraint *c;
+ c = intel_vlbr_constraints(event);
+ if (c)
+ return c;
+
c = intel_bts_constraints(event);
if (c)
return c;
--- a/arch/x86/events/intel/lbr.c
+++ b/arch/x86/events/intel/lbr.c
@@ -1363,3 +1363,7 @@ int x86_perf_get_lbr(struct x86_pmu_lbr
return 0;
}
EXPORT_SYMBOL_GPL(x86_perf_get_lbr);
+
+struct event_constraint vlbr_constraint =
+ FIXED_EVENT_CONSTRAINT(INTEL_FIXED_VLBR_EVENT,
+ (INTEL_PMC_IDX_FIXED_VLBR - INTEL_PMC_IDX_FIXED));
--- a/arch/x86/events/perf_event.h
+++ b/arch/x86/events/perf_event.h
@@ -990,6 +990,7 @@ void release_ds_buffers(void);
void reserve_ds_buffers(void);
extern struct event_constraint bts_constraint;
+extern struct event_constraint vlbr_constraint;
void intel_pmu_enable_bts(u64 config);
--- a/arch/x86/include/asm/perf_event.h
+++ b/arch/x86/include/asm/perf_event.h
@@ -192,10 +192,30 @@ struct x86_pmu_capability {
#define GLOBAL_STATUS_UNC_OVF BIT_ULL(61)
#define GLOBAL_STATUS_ASIF BIT_ULL(60)
#define GLOBAL_STATUS_COUNTERS_FROZEN BIT_ULL(59)
-#define GLOBAL_STATUS_LBRS_FROZEN BIT_ULL(58)
+#define GLOBAL_STATUS_LBRS_FROZEN_BIT 58
+#define GLOBAL_STATUS_LBRS_FROZEN BIT_ULL(GLOBAL_STATUS_LBRS_FROZEN_BIT)
#define GLOBAL_STATUS_TRACE_TOPAPMI BIT_ULL(55)
/*
+ * We model guest LBR event tracing as another fixed-mode PMC like BTS.
+ *
+ * We choose bit 58 because it's used to indicate LBR stack frozen state
+ * for architectural perfmon v4, also we unconditionally mask that bit in
+ * the handle_pmi_common(), so it'll never be set in the overflow handling.
+ *
+ * With this fake counter assigned, the guest LBR event user (such as KVM),
+ * can program the LBR registers on its own, and we don't actually do anything
+ * with then in the host context.
+ */
+#define INTEL_PMC_IDX_FIXED_VLBR (GLOBAL_STATUS_LBRS_FROZEN_BIT)
+
+/*
+ * Pseudo-encoding the guest LBR event as event=0x00,umask=0x1b,
+ * since it would claim bit 58 which is effectively Fixed26.
+ */
+#define INTEL_FIXED_VLBR_EVENT 0x1b00
+
+/*
* Adaptive PEBS v4
*/
Subject: perf/x86: Keep LBR stack unchanged in host context for guest LBR event
From: Like Xu <[email protected]>
Date: Thu, 14 May 2020 16:30:48 +0800
From: Like Xu <[email protected]>
When a guest wants to use the LBR stack, its hypervisor creates a guest
LBR event and let host perf schedules it. A new 'int guest_lbr_enabled'
field in the 'struct cpu_hw_events', is marked as true when perf adds
a guest LBR event and marked as false on deletion.
The LBR stack msrs are accessible to the guest when its guest LBR event
is scheduled in by the perf subsystem. Before scheduling this event out,
we should avoid host changes on IA32_DEBUGCTLMSR or LBR_SELECT.
Otherwise, some unexpected branch operations may interfere with guest
behavior, pollute LBR records, and even cause host branch data leakage.
In addition, the intel_pmu_lbr_read() on the host is also avoidable.
To ensure that guest LBR records are not lost during the context switch,
the BRANCH_CALL_STACK flag should be configured in the 'branch_sample_type'
for any guest LBR event because the callstack mode could save/restore guest
unread LBR records with the help of intel_pmu_lbr_sched_task() naturally.
However, the regular host LBR perf event doesn't save/restore LBR_SELECT,
because it's configured in the LBR_enable() based on branch_sample_type.
So when a guest LBR is running, the guest LBR_SELECT may changes for its
own use and we have to support LBR_SELECT save/restore to ensure what the
guest LBR_SELECT value doesn't get lost during the context switching.
Signed-off-by: Like Xu <[email protected]>
Signed-off-by: Wei Wang <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
---
arch/x86/events/intel/lbr.c | 16 ++++++++++++++--
arch/x86/events/perf_event.h | 3 +++
2 files changed, 17 insertions(+), 2 deletions(-)
--- a/arch/x86/events/intel/lbr.c
+++ b/arch/x86/events/intel/lbr.c
@@ -383,6 +383,9 @@ static void __intel_pmu_lbr_restore(stru
wrmsrl(x86_pmu.lbr_tos, tos);
task_ctx->lbr_stack_state = LBR_NONE;
+
+ if (cpuc->lbr_select)
+ wrmsrl(MSR_LBR_SELECT, task_ctx->lbr_sel);
}
static void __intel_pmu_lbr_save(struct x86_perf_task_context *task_ctx)
@@ -415,6 +418,9 @@ static void __intel_pmu_lbr_save(struct
cpuc->last_task_ctx = task_ctx;
cpuc->last_log_id = ++task_ctx->log_id;
+
+ if (cpuc->lbr_select)
+ rdmsrl(MSR_LBR_SELECT, task_ctx->lbr_sel);
}
void intel_pmu_lbr_swap_task_ctx(struct perf_event_context *prev,
@@ -485,6 +491,9 @@ void intel_pmu_lbr_add(struct perf_event
if (!x86_pmu.lbr_nr)
return;
+ if (event->hw.flags & PERF_X86_EVENT_LBR_SELECT)
+ cpuc->lbr_select = 1;
+
cpuc->br_sel = event->hw.branch_reg.reg;
if (branch_user_callstack(cpuc->br_sel) && event->ctx->task_ctx_data) {
@@ -532,6 +541,9 @@ void intel_pmu_lbr_del(struct perf_event
task_ctx->lbr_callstack_users--;
}
+ if (event->hw.flags & PERF_X86_EVENT_LBR_SELECT)
+ cpuc->lbr_select = 0;
+
if (x86_pmu.intel_cap.pebs_baseline && event->attr.precise_ip > 0)
cpuc->lbr_pebs_users--;
cpuc->lbr_users--;
@@ -1365,5 +1377,5 @@ int x86_perf_get_lbr(struct x86_pmu_lbr
EXPORT_SYMBOL_GPL(x86_perf_get_lbr);
struct event_constraint vlbr_constraint =
- FIXED_EVENT_CONSTRAINT(INTEL_FIXED_VLBR_EVENT,
- (INTEL_PMC_IDX_FIXED_VLBR - INTEL_PMC_IDX_FIXED));
+ __EVENT_CONSTRAINT(INTEL_FIXED_VLBR_EVENT, (1ULL << INTEL_PMC_IDX_FIXED_VLBR),
+ FIXED_EVENT_FLAGS, 1, 0, PERF_X86_EVENT_LBR_SELECT);
--- a/arch/x86/events/perf_event.h
+++ b/arch/x86/events/perf_event.h
@@ -78,6 +78,7 @@ static inline bool constraint_match(stru
#define PERF_X86_EVENT_LARGE_PEBS 0x0400 /* use large PEBS */
#define PERF_X86_EVENT_PEBS_VIA_PT 0x0800 /* use PT buffer for PEBS */
#define PERF_X86_EVENT_PAIR 0x1000 /* Large Increment per Cycle */
+#define PERF_X86_EVENT_LBR_SELECT 0x2000 /* Save/Restore MSR_LBR_SELECT */
struct amd_nb {
int nb_id; /* NorthBridge id */
@@ -237,6 +238,7 @@ struct cpu_hw_events {
u64 br_sel;
struct x86_perf_task_context *last_task_ctx;
int last_log_id;
+ int lbr_select;
/*
* Intel host/guest exclude bits
@@ -722,6 +724,7 @@ struct x86_perf_task_context {
u64 lbr_from[MAX_LBR_ENTRIES];
u64 lbr_to[MAX_LBR_ENTRIES];
u64 lbr_info[MAX_LBR_ENTRIES];
+ u64 lbr_sel;
int tos;
int valid_lbrs;
int lbr_callstack_users;
On Thu, May 14, 2020 at 04:30:48PM +0800, Like Xu wrote:
> @@ -544,7 +562,12 @@ void intel_pmu_lbr_enable_all(bool pmi)
> {
> struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
>
> - if (cpuc->lbr_users)
> + /*
> + * When the LBR hardware is scheduled for a guest LBR event,
> + * the guest will dis/enables LBR itself at the appropriate time,
> + * including configuring MSR_LBR_SELECT.
> + */
> + if (cpuc->lbr_users && !cpuc->guest_lbr_enabled)
> __intel_pmu_lbr_enable(pmi);
> }
No!, that should be done through perf_event_attr::exclude_host, as I
believe all the other KVM event do it.
Hi Peter,
Thanks for the clear attitude and code refinement.
On 2020/5/18 20:02, Peter Zijlstra wrote:
> On Thu, May 14, 2020 at 04:30:48PM +0800, Like Xu wrote:
>> @@ -544,7 +562,12 @@ void intel_pmu_lbr_enable_all(bool pmi)
>> {
>> struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
>>
>> - if (cpuc->lbr_users)
>> + /*
>> + * When the LBR hardware is scheduled for a guest LBR event,
>> + * the guest will dis/enables LBR itself at the appropriate time,
>> + * including configuring MSR_LBR_SELECT.
>> + */
>> + if (cpuc->lbr_users && !cpuc->guest_lbr_enabled)
>> __intel_pmu_lbr_enable(pmi);
>> }
>
> No!, that should be done through perf_event_attr::exclude_host, as I
> believe all the other KVM event do it.
>
Sure, I could reuse cpuc->intel_ctrl_guest_mask to rewrite this part:
diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
index d788edb7c1f9..f1243e8211ca 100644
--- a/arch/x86/events/intel/core.c
+++ b/arch/x86/events/intel/core.c
@@ -2189,7 +2189,8 @@ static void intel_pmu_disable_event(struct perf_event
*event)
} else if (idx == INTEL_PMC_IDX_FIXED_BTS) {
intel_pmu_disable_bts();
intel_pmu_drain_bts_buffer();
- }
+ } else if (idx == INTEL_PMC_IDX_FIXED_VLBR)
+ intel_clear_masks(event, idx);
/*
* Needs to be called after x86_pmu_disable_event,
@@ -2271,7 +2272,8 @@ static void intel_pmu_enable_event(struct perf_event
*event)
if (!__this_cpu_read(cpu_hw_events.enabled))
return;
intel_pmu_enable_bts(hwc->config);
- }
+ } else if (idx == INTEL_PMC_IDX_FIXED_VLBR)
+ intel_set_masks(event, idx);
}
static void intel_pmu_add_event(struct perf_event *event)
diff --git a/arch/x86/events/intel/lbr.c b/arch/x86/events/intel/lbr.c
index b8dabf1698d6..1b30c76815dd 100644
--- a/arch/x86/events/intel/lbr.c
+++ b/arch/x86/events/intel/lbr.c
@@ -552,11 +552,19 @@ void intel_pmu_lbr_del(struct perf_event *event)
perf_sched_cb_dec(event->ctx->pmu);
}
+static inline bool vlbr_is_enabled(void)
+{
+ struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
+
+ return test_bit(INTEL_PMC_IDX_FIXED_VLBR,
+ (unsigned long *)&cpuc->intel_ctrl_guest_mask);
+}
+
void intel_pmu_lbr_enable_all(bool pmi)
{
struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
- if (cpuc->lbr_users)
+ if (cpuc->lbr_users && !vlbr_is_enabled())
__intel_pmu_lbr_enable(pmi);
}
@@ -564,7 +572,7 @@ void intel_pmu_lbr_disable_all(void)
{
struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
- if (cpuc->lbr_users)
+ if (cpuc->lbr_users && !vlbr_is_enabled())
__intel_pmu_lbr_disable();
}
@@ -706,7 +714,8 @@ void intel_pmu_lbr_read(void)
* This could be smarter and actually check the event,
* but this simple approach seems to work for now.
*/
- if (!cpuc->lbr_users || cpuc->lbr_users == cpuc->lbr_pebs_users)
+ if (!cpuc->lbr_users || vlbr_is_enabled() ||
+ cpuc->lbr_users == cpuc->lbr_pebs_users)
return;
if (x86_pmu.intel_cap.lbr_format == LBR_FORMAT_32)
Is this acceptable to you ?
If you have more comments on the patchset, please let me know.
Thanks,
Like Xu
On Tue, May 19, 2020 at 11:08:41AM +0800, Like Xu wrote:
> Sure, I could reuse cpuc->intel_ctrl_guest_mask to rewrite this part:
>
> diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
> index d788edb7c1f9..f1243e8211ca 100644
> --- a/arch/x86/events/intel/core.c
> +++ b/arch/x86/events/intel/core.c
> @@ -2189,7 +2189,8 @@ static void intel_pmu_disable_event(struct perf_event
> *event)
> } else if (idx == INTEL_PMC_IDX_FIXED_BTS) {
> intel_pmu_disable_bts();
> intel_pmu_drain_bts_buffer();
> - }
> + } else if (idx == INTEL_PMC_IDX_FIXED_VLBR)
> + intel_clear_masks(event, idx);
>
> /*
> * Needs to be called after x86_pmu_disable_event,
> @@ -2271,7 +2272,8 @@ static void intel_pmu_enable_event(struct perf_event
> *event)
> if (!__this_cpu_read(cpu_hw_events.enabled))
> return;
> intel_pmu_enable_bts(hwc->config);
> - }
> + } else if (idx == INTEL_PMC_IDX_FIXED_VLBR)
> + intel_set_masks(event, idx);
> }
This makes me wonder if we can pull intel_{set,clear}_masks() out of
that if()-forest, but that's something for later...
> static void intel_pmu_add_event(struct perf_event *event)
> diff --git a/arch/x86/events/intel/lbr.c b/arch/x86/events/intel/lbr.c
> index b8dabf1698d6..1b30c76815dd 100644
> --- a/arch/x86/events/intel/lbr.c
> +++ b/arch/x86/events/intel/lbr.c
> @@ -552,11 +552,19 @@ void intel_pmu_lbr_del(struct perf_event *event)
> perf_sched_cb_dec(event->ctx->pmu);
> }
>
> +static inline bool vlbr_is_enabled(void)
> +{
> + struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
> +
> + return test_bit(INTEL_PMC_IDX_FIXED_VLBR,
> + (unsigned long *)&cpuc->intel_ctrl_guest_mask);
> +}
Maybe call this: vlbr_exclude_host() ?
> +
> void intel_pmu_lbr_enable_all(bool pmi)
> {
> struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
>
> - if (cpuc->lbr_users)
> + if (cpuc->lbr_users && !vlbr_is_enabled())
> __intel_pmu_lbr_enable(pmi);
> }
>
> @@ -564,7 +572,7 @@ void intel_pmu_lbr_disable_all(void)
> {
> struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
>
> - if (cpuc->lbr_users)
> + if (cpuc->lbr_users && !vlbr_is_enabled())
> __intel_pmu_lbr_disable();
> }
>
> @@ -706,7 +714,8 @@ void intel_pmu_lbr_read(void)
> * This could be smarter and actually check the event,
> * but this simple approach seems to work for now.
> */
> - if (!cpuc->lbr_users || cpuc->lbr_users == cpuc->lbr_pebs_users)
> + if (!cpuc->lbr_users || vlbr_is_enabled() ||
> + cpuc->lbr_users == cpuc->lbr_pebs_users)
> return;
>
> if (x86_pmu.intel_cap.lbr_format == LBR_FORMAT_32)
>
> Is this acceptable to you ?
Yeah, looks about right. Let me stare at the rest.
On Thu, May 14, 2020 at 04:30:50PM +0800, Like Xu wrote:
> @@ -203,6 +206,12 @@ static int intel_pmu_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
> case MSR_CORE_PERF_GLOBAL_OVF_CTRL:
> msr_info->data = pmu->global_ovf_ctrl;
> return 0;
> + case MSR_IA32_PERF_CAPABILITIES:
> + if (!msr_info->host_initiated &&
> + !guest_cpuid_has(vcpu, X86_FEATURE_PDCM))
> + return 1;
I know this is KVM code, so maybe they feel differently, but I find the
above indentation massively confusing. Consider using: "set cino=:0(0"
if you're a vim user.
> + msr_info->data = vcpu->arch.perf_capabilities;
> + return 0;
> default:
> if ((pmc = get_gp_pmc(pmu, msr, MSR_IA32_PERFCTR0))) {
> u64 val = pmc_read_counter(pmc);
> @@ -261,6 +270,16 @@ static int intel_pmu_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
> return 0;
> }
> break;
> + case MSR_IA32_PERF_CAPABILITIES:
> + if (!msr_info->host_initiated ||
> + !guest_cpuid_has(vcpu, X86_FEATURE_PDCM))
> + return 1;
Idem.
> + if (!(data & ~vmx_get_perf_capabilities()))
> + return 1;
> + if ((data ^ vmx_get_perf_capabilities()) & PERF_CAP_LBR_FMT)
> + return 1;
> + vcpu->arch.perf_capabilities = data;
> + return 0;
> default:
> if ((pmc = get_gp_pmc(pmu, msr, MSR_IA32_PERFCTR0))) {
> if (!msr_info->host_initiated)
On Thu, May 14, 2020 at 04:30:51PM +0800, Like Xu wrote:
> +static inline bool event_is_oncpu(struct perf_event *event)
> +{
> + return event && event->oncpu != -1;
> +}
> +/*
> + * It's safe to access LBR msrs from guest when they have not
> + * been passthrough since the host would help restore or reset
> + * the LBR msrs records when the guest LBR event is scheduled in.
> + */
> +static bool intel_pmu_access_lbr_msr(struct kvm_vcpu *vcpu,
> + struct msr_data *msr_info, bool read)
> +{
> + struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
> + u32 index = msr_info->index;
> +
> + if (!intel_is_valid_lbr_msr(vcpu, index))
> + return false;
> +
> + if (!msr_info->host_initiated && !pmu->lbr_event)
> + intel_pmu_create_lbr_event(vcpu);
> +
> + /*
> + * Disable irq to ensure the LBR feature doesn't get reclaimed by the
> + * host at the time the value is read from the msr, and this avoids the
> + * host LBR value to be leaked to the guest. If LBR has been reclaimed,
> + * return 0 on guest reads.
> + */
> + local_irq_disable();
> + if (event_is_oncpu(pmu->lbr_event)) {
> + if (read)
> + rdmsrl(index, msr_info->data);
> + else
> + wrmsrl(index, msr_info->data);
> + } else if (read)
> + msr_info->data = 0;
> + local_irq_enable();
> +
> + return true;
> +}
So this runs in the vCPU thread in host context to emulate the MSR
access, right?
On Thu, May 14, 2020 at 04:30:51PM +0800, Like Xu wrote:
> + struct perf_event_attr attr = {
> + .type = PERF_TYPE_RAW,
> + .size = sizeof(attr),
> + .pinned = true,
> + .exclude_host = true,
> + .config = INTEL_FIXED_VLBR_EVENT,
> + .sample_type = PERF_SAMPLE_BRANCH_STACK,
> + .branch_sample_type = PERF_SAMPLE_BRANCH_CALL_STACK |
> + PERF_SAMPLE_BRANCH_USER,
Maybe order the fields according to how they're declared in the
structure?
> + };
On Thu, May 14, 2020 at 04:30:51PM +0800, Like Xu wrote:
> @@ -6698,6 +6698,7 @@ static fastpath_t vmx_vcpu_run(struct kvm_vcpu *vcpu)
>
> if (vcpu_to_pmu(vcpu)->version)
> atomic_switch_perf_msrs(vmx);
> +
> atomic_switch_umwait_control_msr(vmx);
>
> if (enable_preemption_timer)
Is this where the test to see if any of the KVM events went into ERROR
state should go?
if (event->state == PERF_EVENT_STATE_ERROR) {
pr_warn("unhappy, someone stole our counter\n");
}
like..
On Thu, May 14, 2020 at 04:30:53PM +0800, Like Xu wrote:
> diff --git a/arch/x86/kvm/vmx/pmu_intel.c b/arch/x86/kvm/vmx/pmu_intel.c
> index ea4faae56473..db185dca903d 100644
> --- a/arch/x86/kvm/vmx/pmu_intel.c
> +++ b/arch/x86/kvm/vmx/pmu_intel.c
> @@ -646,6 +646,43 @@ static void intel_pmu_lbr_cleanup(struct kvm_vcpu *vcpu)
> intel_pmu_free_lbr_event(vcpu);
> }
>
> +static bool intel_pmu_lbr_is_availabile(struct kvm_vcpu *vcpu)
> +{
> + struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
> +
> + if (!pmu->lbr_event)
> + return false;
> +
> + if (event_is_oncpu(pmu->lbr_event)) {
> + intel_pmu_intercept_lbr_msrs(vcpu, false);
> + } else {
> + intel_pmu_intercept_lbr_msrs(vcpu, true);
> + return false;
> + }
> +
> + return true;
> +}
This is unreadable gunk, what?
> +/*
> + * Higher priority host perf events (e.g. cpu pinned) could reclaim the
> + * pmu resources (e.g. LBR) that were assigned to the guest. This is
> + * usually done via ipi calls (more details in perf_install_in_context).
> + *
> + * Before entering the non-root mode (with irq disabled here), double
> + * confirm that the pmu features enabled to the guest are not reclaimed
> + * by higher priority host events. Otherwise, disallow vcpu's access to
> + * the reclaimed features.
> + */
> +static void intel_pmu_availability_check(struct kvm_vcpu *vcpu)
> +{
> + lockdep_assert_irqs_disabled();
> +
> + if (lbr_is_enabled(vcpu) && !intel_pmu_lbr_is_availabile(vcpu) &&
> + (vmcs_read64(GUEST_IA32_DEBUGCTL) & DEBUGCTLMSR_LBR))
> + pr_warn_ratelimited("kvm: vcpu-%d: LBR is temporarily unavailable.\n",
> + vcpu->vcpu_id);
More unreadable nonsense; when the events go into ERROR state, it's a
permanent fail, they'll not come back.
> +}
> +
> struct kvm_pmu_ops intel_pmu_ops = {
> .find_arch_event = intel_find_arch_event,
> .find_fixed_event = intel_find_fixed_event,
> @@ -662,4 +699,5 @@ struct kvm_pmu_ops intel_pmu_ops = {
> .reset = intel_pmu_reset,
> .deliver_pmi = intel_pmu_deliver_pmi,
> .lbr_cleanup = intel_pmu_lbr_cleanup,
> + .availability_check = intel_pmu_availability_check,
> };
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index 9969d663826a..80d036c5f64a 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -6696,8 +6696,10 @@ static fastpath_t vmx_vcpu_run(struct kvm_vcpu *vcpu)
>
> pt_guest_enter(vmx);
>
> - if (vcpu_to_pmu(vcpu)->version)
> + if (vcpu_to_pmu(vcpu)->version) {
> atomic_switch_perf_msrs(vmx);
> + kvm_x86_ops.pmu_ops->availability_check(vcpu);
> + }
AFAICT you just did a call out to the kvm_pmu crud in
atomic_switch_perf_msrs(), why do another call?
On 2020/5/19 18:53, Peter Zijlstra wrote:
> On Thu, May 14, 2020 at 04:30:50PM +0800, Like Xu wrote:
>> @@ -203,6 +206,12 @@ static int intel_pmu_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
>> case MSR_CORE_PERF_GLOBAL_OVF_CTRL:
>> msr_info->data = pmu->global_ovf_ctrl;
>> return 0;
>> + case MSR_IA32_PERF_CAPABILITIES:
>> + if (!msr_info->host_initiated &&
>> + !guest_cpuid_has(vcpu, X86_FEATURE_PDCM))
>> + return 1;
> I know this is KVM code, so maybe they feel differently, but I find the
> above indentation massively confusing. Consider using: "set cino=:0(0"
> if you're a vim user.
Nice tip and I'll apply it. Thanks.
>
>> + msr_info->data = vcpu->arch.perf_capabilities;
>> + return 0;
>> default:
>> if ((pmc = get_gp_pmc(pmu, msr, MSR_IA32_PERFCTR0))) {
>> u64 val = pmc_read_counter(pmc);
>> @@ -261,6 +270,16 @@ static int intel_pmu_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
>> return 0;
>> }
>> break;
>> + case MSR_IA32_PERF_CAPABILITIES:
>> + if (!msr_info->host_initiated ||
>> + !guest_cpuid_has(vcpu, X86_FEATURE_PDCM))
>> + return 1;
> Idem.
>
>> + if (!(data & ~vmx_get_perf_capabilities()))
>> + return 1;
>> + if ((data ^ vmx_get_perf_capabilities()) & PERF_CAP_LBR_FMT)
>> + return 1;
>> + vcpu->arch.perf_capabilities = data;
>> + return 0;
>> default:
>> if ((pmc = get_gp_pmc(pmu, msr, MSR_IA32_PERFCTR0))) {
>> if (!msr_info->host_initiated)
On 2020/5/19 19:00, Peter Zijlstra wrote:
> On Thu, May 14, 2020 at 04:30:51PM +0800, Like Xu wrote:
>> +static inline bool event_is_oncpu(struct perf_event *event)
>> +{
>> + return event && event->oncpu != -1;
>> +}
>
>> +/*
>> + * It's safe to access LBR msrs from guest when they have not
>> + * been passthrough since the host would help restore or reset
>> + * the LBR msrs records when the guest LBR event is scheduled in.
>> + */
>> +static bool intel_pmu_access_lbr_msr(struct kvm_vcpu *vcpu,
>> + struct msr_data *msr_info, bool read)
>> +{
>> + struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
>> + u32 index = msr_info->index;
>> +
>> + if (!intel_is_valid_lbr_msr(vcpu, index))
>> + return false;
>> +
>> + if (!msr_info->host_initiated && !pmu->lbr_event)
>> + intel_pmu_create_lbr_event(vcpu);
>> +
>> + /*
>> + * Disable irq to ensure the LBR feature doesn't get reclaimed by the
>> + * host at the time the value is read from the msr, and this avoids the
>> + * host LBR value to be leaked to the guest. If LBR has been reclaimed,
>> + * return 0 on guest reads.
>> + */
>> + local_irq_disable();
>> + if (event_is_oncpu(pmu->lbr_event)) {
>> + if (read)
>> + rdmsrl(index, msr_info->data);
>> + else
>> + wrmsrl(index, msr_info->data);
>> + } else if (read)
>> + msr_info->data = 0;
>> + local_irq_enable();
>> +
>> + return true;
>> +}
> So this runs in the vCPU thread in host context to emulate the MSR
> access, right?
Yes, it's called to emulate MSR accesses when the guest LBR event is
scheduled on while the LBR stack MSRs have not been passthrough to the vCPU.
Thanks,
Like Xu
>
On 2020/5/19 19:01, Peter Zijlstra wrote:
> On Thu, May 14, 2020 at 04:30:51PM +0800, Like Xu wrote:
>
>> + struct perf_event_attr attr = {
>> + .type = PERF_TYPE_RAW,
>> + .size = sizeof(attr),
>> + .pinned = true,
>> + .exclude_host = true,
>> + .config = INTEL_FIXED_VLBR_EVENT,
>> + .sample_type = PERF_SAMPLE_BRANCH_STACK,
>> + .branch_sample_type = PERF_SAMPLE_BRANCH_CALL_STACK |
>> + PERF_SAMPLE_BRANCH_USER,
> Maybe order the fields according to how they're declared in the
> structure?
Sure, I'll sort the fields in the order of declaration. Thanks.
>> + };
On 2020/5/19 19:03, Peter Zijlstra wrote:
> On Thu, May 14, 2020 at 04:30:51PM +0800, Like Xu wrote:
>> @@ -6698,6 +6698,7 @@ static fastpath_t vmx_vcpu_run(struct kvm_vcpu *vcpu)
>>
>> if (vcpu_to_pmu(vcpu)->version)
>> atomic_switch_perf_msrs(vmx);
>> +
>> atomic_switch_umwait_control_msr(vmx);
>>
>> if (enable_preemption_timer)
> Is this where the test to see if any of the KVM events went into ERROR
> state should go?
Yes, I chose the same location to do LBR availability check in the next
patch 0010.
Actually for normal vPMU counters and their events,
I'm not sure whether pr_warn() should also be used widely.
The current approach is to keep vPMC silent when it may be inaccurate.
I may need @Paolo's attitude on this issue.
Thanks,
Like Xu
>
> if (event->state == PERF_EVENT_STATE_ERROR) {
> pr_warn("unhappy, someone stole our counter\n");
> }
>
> like..
On 2020/5/19 19:15, Peter Zijlstra wrote:
> On Thu, May 14, 2020 at 04:30:53PM +0800, Like Xu wrote:
>
>> diff --git a/arch/x86/kvm/vmx/pmu_intel.c b/arch/x86/kvm/vmx/pmu_intel.c
>> index ea4faae56473..db185dca903d 100644
>> --- a/arch/x86/kvm/vmx/pmu_intel.c
>> +++ b/arch/x86/kvm/vmx/pmu_intel.c
>> @@ -646,6 +646,43 @@ static void intel_pmu_lbr_cleanup(struct kvm_vcpu *vcpu)
>> intel_pmu_free_lbr_event(vcpu);
>> }
>>
>> +static bool intel_pmu_lbr_is_availabile(struct kvm_vcpu *vcpu)
>> +{
>> + struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
>> +
>> + if (!pmu->lbr_event)
>> + return false;
>> +
>> + if (event_is_oncpu(pmu->lbr_event)) {
>> + intel_pmu_intercept_lbr_msrs(vcpu, false);
>> + } else {
>> + intel_pmu_intercept_lbr_msrs(vcpu, true);
>> + return false;
>> + }
>> +
>> + return true;
>> +}
> This is unreadable gunk, what?
Abstractly, it is saying "KVM would passthrough the LBR satck MSRs if
event_is_oncpu() is true, otherwise cancel the passthrough state if any."
I'm using 'event->oncpu != -1' to represent the guest LBR event
is scheduled on rather than 'event->state == PERF_EVENT_STATE_ERROR'.
For intel_pmu_intercept_lbr_msrs(), false means to passthrough the LBR stack
MSRs to the vCPU, and true means to cancel the passthrough state and make
LBR MSR accesses trapped by the KVM.
>
>> +/*
>> + * Higher priority host perf events (e.g. cpu pinned) could reclaim the
>> + * pmu resources (e.g. LBR) that were assigned to the guest. This is
>> + * usually done via ipi calls (more details in perf_install_in_context).
>> + *
>> + * Before entering the non-root mode (with irq disabled here), double
>> + * confirm that the pmu features enabled to the guest are not reclaimed
>> + * by higher priority host events. Otherwise, disallow vcpu's access to
>> + * the reclaimed features.
>> + */
>> +static void intel_pmu_availability_check(struct kvm_vcpu *vcpu)
>> +{
>> + lockdep_assert_irqs_disabled();
>> +
>> + if (lbr_is_enabled(vcpu) && !intel_pmu_lbr_is_availabile(vcpu) &&
>> + (vmcs_read64(GUEST_IA32_DEBUGCTL) & DEBUGCTLMSR_LBR))
>> + pr_warn_ratelimited("kvm: vcpu-%d: LBR is temporarily unavailable.\n",
>> + vcpu->vcpu_id);
> More unreadable nonsense; when the events go into ERROR state, it's a
> permanent fail, they'll not come back.
It's not true. The guest LBR event with 'ERROR state' or 'oncpu != -1'
would be
lazy released and re-created in the next time the
intel_pmu_create_lbr_event() is
called and it's supposed to be re-scheduled and re-do availability_check()
as well.
From the perspective of the guest user, the guest LBR is only temporarily
unavailable
until the host no longer reclaims the LBR.
>
>> +}
>> +
>> struct kvm_pmu_ops intel_pmu_ops = {
>> .find_arch_event = intel_find_arch_event,
>> .find_fixed_event = intel_find_fixed_event,
>> @@ -662,4 +699,5 @@ struct kvm_pmu_ops intel_pmu_ops = {
>> .reset = intel_pmu_reset,
>> .deliver_pmi = intel_pmu_deliver_pmi,
>> .lbr_cleanup = intel_pmu_lbr_cleanup,
>> + .availability_check = intel_pmu_availability_check,
>> };
>> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
>> index 9969d663826a..80d036c5f64a 100644
>> --- a/arch/x86/kvm/vmx/vmx.c
>> +++ b/arch/x86/kvm/vmx/vmx.c
>> @@ -6696,8 +6696,10 @@ static fastpath_t vmx_vcpu_run(struct kvm_vcpu *vcpu)
>>
>> pt_guest_enter(vmx);
>>
>> - if (vcpu_to_pmu(vcpu)->version)
>> + if (vcpu_to_pmu(vcpu)->version) {
>> atomic_switch_perf_msrs(vmx);
>> + kvm_x86_ops.pmu_ops->availability_check(vcpu);
>> + }
> AFAICT you just did a call out to the kvm_pmu crud in
> atomic_switch_perf_msrs(), why do another call?
In fact, availability_check() is only called here for just one time.
The callchain looks like:
- vmx_vcpu_run()
- kvm_x86_ops.pmu_ops->availability_check();
- intel_pmu_availability_check()
- intel_pmu_lbr_is_availabile()
- event_is_oncpu() ...
Thanks,
Like Xu
>
>
Hi Peter,
On 2020/5/19 18:45, Peter Zijlstra wrote:
> On Tue, May 19, 2020 at 11:08:41AM +0800, Like Xu wrote:
>
>> Sure, I could reuse cpuc->intel_ctrl_guest_mask to rewrite this part:
>>
>> diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
>> index d788edb7c1f9..f1243e8211ca 100644
>> --- a/arch/x86/events/intel/core.c
>> +++ b/arch/x86/events/intel/core.c
>> @@ -2189,7 +2189,8 @@ static void intel_pmu_disable_event(struct perf_event
>> *event)
>> } else if (idx == INTEL_PMC_IDX_FIXED_BTS) {
>> intel_pmu_disable_bts();
>> intel_pmu_drain_bts_buffer();
>> - }
>> + } else if (idx == INTEL_PMC_IDX_FIXED_VLBR)
>> + intel_clear_masks(event, idx);
>>
>> /*
>> * Needs to be called after x86_pmu_disable_event,
>> @@ -2271,7 +2272,8 @@ static void intel_pmu_enable_event(struct perf_event
>> *event)
>> if (!__this_cpu_read(cpu_hw_events.enabled))
>> return;
>> intel_pmu_enable_bts(hwc->config);
>> - }
>> + } else if (idx == INTEL_PMC_IDX_FIXED_VLBR)
>> + intel_set_masks(event, idx);
>> }
> This makes me wonder if we can pull intel_{set,clear}_masks() out of
> that if()-forest, but that's something for later...
>
>> static void intel_pmu_add_event(struct perf_event *event)
>> diff --git a/arch/x86/events/intel/lbr.c b/arch/x86/events/intel/lbr.c
>> index b8dabf1698d6..1b30c76815dd 100644
>> --- a/arch/x86/events/intel/lbr.c
>> +++ b/arch/x86/events/intel/lbr.c
>> @@ -552,11 +552,19 @@ void intel_pmu_lbr_del(struct perf_event *event)
>> perf_sched_cb_dec(event->ctx->pmu);
>> }
>>
>> +static inline bool vlbr_is_enabled(void)
>> +{
>> + struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
>> +
>> + return test_bit(INTEL_PMC_IDX_FIXED_VLBR,
>> + (unsigned long *)&cpuc->intel_ctrl_guest_mask);
>> +}
> Maybe call this: vlbr_exclude_host() ?
Sure, I'll apply it.
>
>> +
>> void intel_pmu_lbr_enable_all(bool pmi)
>> {
>> struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
>>
>> - if (cpuc->lbr_users)
>> + if (cpuc->lbr_users && !vlbr_is_enabled())
>> __intel_pmu_lbr_enable(pmi);
>> }
>>
>> @@ -564,7 +572,7 @@ void intel_pmu_lbr_disable_all(void)
>> {
>> struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
>>
>> - if (cpuc->lbr_users)
>> + if (cpuc->lbr_users && !vlbr_is_enabled())
>> __intel_pmu_lbr_disable();
>> }
>>
>> @@ -706,7 +714,8 @@ void intel_pmu_lbr_read(void)
>> * This could be smarter and actually check the event,
>> * but this simple approach seems to work for now.
>> */
>> - if (!cpuc->lbr_users || cpuc->lbr_users == cpuc->lbr_pebs_users)
>> + if (!cpuc->lbr_users || vlbr_is_enabled() ||
>> + cpuc->lbr_users == cpuc->lbr_pebs_users)
>> return;
>>
>> if (x86_pmu.intel_cap.lbr_format == LBR_FORMAT_32)
>>
>> Is this acceptable to you ?
> Yeah, looks about right. Let me stare at the rest.
Uh, thanks for your warmly comments on the rest KVM part.
Let me assume we do not have any blocking issues
on the host perf changes to enable LBR feature for guests.
Thanks,
Like Xu
On Tue, May 19, 2020 at 09:10:58PM +0800, Xu, Like wrote:
> On 2020/5/19 19:15, Peter Zijlstra wrote:
> > On Thu, May 14, 2020 at 04:30:53PM +0800, Like Xu wrote:
> >
> > > diff --git a/arch/x86/kvm/vmx/pmu_intel.c b/arch/x86/kvm/vmx/pmu_intel.c
> > > index ea4faae56473..db185dca903d 100644
> > > --- a/arch/x86/kvm/vmx/pmu_intel.c
> > > +++ b/arch/x86/kvm/vmx/pmu_intel.c
> > > @@ -646,6 +646,43 @@ static void intel_pmu_lbr_cleanup(struct kvm_vcpu *vcpu)
> > > intel_pmu_free_lbr_event(vcpu);
> > > }
> > > +static bool intel_pmu_lbr_is_availabile(struct kvm_vcpu *vcpu)
> > > +{
> > > + struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
> > > +
> > > + if (!pmu->lbr_event)
> > > + return false;
> > > +
> > > + if (event_is_oncpu(pmu->lbr_event)) {
> > > + intel_pmu_intercept_lbr_msrs(vcpu, false);
> > > + } else {
> > > + intel_pmu_intercept_lbr_msrs(vcpu, true);
> > > + return false;
> > > + }
> > > +
> > > + return true;
> > > +}
> > This is unreadable gunk, what?
>
> Abstractly, it is saying "KVM would passthrough the LBR satck MSRs if
> event_is_oncpu() is true, otherwise cancel the passthrough state if any."
>
> I'm using 'event->oncpu != -1' to represent the guest LBR event
> is scheduled on rather than 'event->state == PERF_EVENT_STATE_ERROR'.
>
> For intel_pmu_intercept_lbr_msrs(), false means to passthrough the LBR stack
> MSRs to the vCPU, and true means to cancel the passthrough state and make
> LBR MSR accesses trapped by the KVM.
To me it seems very weird to change state in a function that is supposed
to just query state.
'is_available' seems to suggest a simple: return 'lbr_event->state ==
PERF_EVENT_STATE_ACTIVE' or something.
> > > +static void intel_pmu_availability_check(struct kvm_vcpu *vcpu)
> > > +{
> > > + lockdep_assert_irqs_disabled();
> > > +
> > > + if (lbr_is_enabled(vcpu) && !intel_pmu_lbr_is_availabile(vcpu) &&
> > > + (vmcs_read64(GUEST_IA32_DEBUGCTL) & DEBUGCTLMSR_LBR))
> > > + pr_warn_ratelimited("kvm: vcpu-%d: LBR is temporarily unavailable.\n",
> > > + vcpu->vcpu_id);
> > More unreadable nonsense; when the events go into ERROR state, it's a
> > permanent fail, they'll not come back.
> It's not true.? The guest LBR event with 'ERROR state' or 'oncpu != -1'
> would be
> lazy released and re-created in the next time the
> intel_pmu_create_lbr_event() is
> called and it's supposed to be re-scheduled and re-do availability_check()
> as well.
Where? Also, wth would you need to destroy and re-create an event for
that?
> > > @@ -6696,8 +6696,10 @@ static fastpath_t vmx_vcpu_run(struct kvm_vcpu *vcpu)
> > > pt_guest_enter(vmx);
> > > - if (vcpu_to_pmu(vcpu)->version)
> > > + if (vcpu_to_pmu(vcpu)->version) {
> > > atomic_switch_perf_msrs(vmx);
> > > + kvm_x86_ops.pmu_ops->availability_check(vcpu);
> > > + }
> > AFAICT you just did a call out to the kvm_pmu crud in
> > atomic_switch_perf_msrs(), why do another call?
> In fact, availability_check() is only called here for just one time.
>
> The callchain looks like:
> - vmx_vcpu_run()
> ??? - kvm_x86_ops.pmu_ops->availability_check();
> ??? ??? - intel_pmu_availability_check()
> ??? ??? ??? - intel_pmu_lbr_is_availabile()
> ??? ??? ??? ??? - event_is_oncpu() ...
>
What I'm saying is that you just did a pmu_ops indirect call in
atomic_switch_perf_msrs(), why add another?
On Tue, May 19, 2020 at 12:53:35PM +0200, Peter Zijlstra wrote:
> On Thu, May 14, 2020 at 04:30:50PM +0800, Like Xu wrote:
> > @@ -203,6 +206,12 @@ static int intel_pmu_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
> > case MSR_CORE_PERF_GLOBAL_OVF_CTRL:
> > msr_info->data = pmu->global_ovf_ctrl;
> > return 0;
> > + case MSR_IA32_PERF_CAPABILITIES:
> > + if (!msr_info->host_initiated &&
> > + !guest_cpuid_has(vcpu, X86_FEATURE_PDCM))
> > + return 1;
>
> I know this is KVM code, so maybe they feel differently, but I find the
> above indentation massively confusing. Consider using: "set cino=:0(0"
> if you're a vim user.
I most definitely don't feel differently. I would be strongly in favor of
making that pattern a checkpatch error.
On 2020/5/19 22:57, Peter Zijlstra wrote:
> On Tue, May 19, 2020 at 09:10:58PM +0800, Xu, Like wrote:
>> On 2020/5/19 19:15, Peter Zijlstra wrote:
>>> On Thu, May 14, 2020 at 04:30:53PM +0800, Like Xu wrote:
>>>
>>>> diff --git a/arch/x86/kvm/vmx/pmu_intel.c b/arch/x86/kvm/vmx/pmu_intel.c
>>>> index ea4faae56473..db185dca903d 100644
>>>> --- a/arch/x86/kvm/vmx/pmu_intel.c
>>>> +++ b/arch/x86/kvm/vmx/pmu_intel.c
>>>> @@ -646,6 +646,43 @@ static void intel_pmu_lbr_cleanup(struct kvm_vcpu *vcpu)
>>>> intel_pmu_free_lbr_event(vcpu);
>>>> }
>>>> +static bool intel_pmu_lbr_is_availabile(struct kvm_vcpu *vcpu)
>>>> +{
>>>> + struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
>>>> +
>>>> + if (!pmu->lbr_event)
>>>> + return false;
>>>> +
>>>> + if (event_is_oncpu(pmu->lbr_event)) {
>>>> + intel_pmu_intercept_lbr_msrs(vcpu, false);
>>>> + } else {
>>>> + intel_pmu_intercept_lbr_msrs(vcpu, true);
>>>> + return false;
>>>> + }
>>>> +
>>>> + return true;
>>>> +}
>>> This is unreadable gunk, what?
>> Abstractly, it is saying "KVM would passthrough the LBR satck MSRs if
>> event_is_oncpu() is true, otherwise cancel the passthrough state if any."
>>
>> I'm using 'event->oncpu != -1' to represent the guest LBR event
>> is scheduled on rather than 'event->state == PERF_EVENT_STATE_ERROR'.
>>
>> For intel_pmu_intercept_lbr_msrs(), false means to passthrough the LBR stack
>> MSRs to the vCPU, and true means to cancel the passthrough state and make
>> LBR MSR accesses trapped by the KVM.
> To me it seems very weird to change state in a function that is supposed
> to just query state.
>
> 'is_available' seems to suggest a simple: return 'lbr_event->state ==
> PERF_EVENT_STATE_ACTIVE' or something.
This clarification led me to reconsider the use of a more readable name here.
Do you accept the check usage of "event->oncpu != -1" instead of
'event->state == PERF_EVENT_STATE_ERROR' before KVM do passthrough ?
>
>>>> +static void intel_pmu_availability_check(struct kvm_vcpu *vcpu)
>>>> +{
>>>> + lockdep_assert_irqs_disabled();
>>>> +
>>>> + if (lbr_is_enabled(vcpu) && !intel_pmu_lbr_is_availabile(vcpu) &&
>>>> + (vmcs_read64(GUEST_IA32_DEBUGCTL) & DEBUGCTLMSR_LBR))
>>>> + pr_warn_ratelimited("kvm: vcpu-%d: LBR is temporarily unavailable.\n",
>>>> + vcpu->vcpu_id);
>>> More unreadable nonsense; when the events go into ERROR state, it's a
>>> permanent fail, they'll not come back.
>> It's not true. The guest LBR event with 'ERROR state' or 'oncpu != -1'
>> would be
>> lazy released and re-created in the next time the
>> intel_pmu_create_lbr_event() is
>> called and it's supposed to be re-scheduled and re-do availability_check()
>> as well.
> Where? Also, wth would you need to destroy and re-create an event for
> that?
If the guest does not set the EN_LBR bit and did not touch any LBR-related
registers
in the last time slice, KVM will destroy the guest LBR event in
kvm_pmu_cleanup()
which is called once every time the vCPU thread is scheduled in.
The re-creation is not directly called after the destruction
but is triggered by the next guest access to the LBR-related registers if any.
From the time when the guest LBR event enters the "oncpu! = -1" state
to the next re-creation, the guest LBR is not available. After the re-creation,
the guest LBR is hopefully available and if it's true, the LBR will be
passthrough
and used by the guest normally.
That's the reason for "LBR is temporarily unavailable"
and please let me know if it doesn't make sense to you.
>>>> @@ -6696,8 +6696,10 @@ static fastpath_t vmx_vcpu_run(struct kvm_vcpu *vcpu)
>>>> pt_guest_enter(vmx);
>>>> - if (vcpu_to_pmu(vcpu)->version)
>>>> + if (vcpu_to_pmu(vcpu)->version) {
>>>> atomic_switch_perf_msrs(vmx);
>>>> + kvm_x86_ops.pmu_ops->availability_check(vcpu);
>>>> + }
>>> AFAICT you just did a call out to the kvm_pmu crud in
>>> atomic_switch_perf_msrs(), why do another call?
>> In fact, availability_check() is only called here for just one time.
>>
>> The callchain looks like:
>> - vmx_vcpu_run()
>> - kvm_x86_ops.pmu_ops->availability_check();
>> - intel_pmu_availability_check()
>> - intel_pmu_lbr_is_availabile()
>> - event_is_oncpu() ...
>>
> What I'm saying is that you just did a pmu_ops indirect call in
> atomic_switch_perf_msrs(), why add another?
Do you mean the indirect call:
- atomic_switch_perf_msrs()
- perf_guest_get_msrs()
- x86_pmu.guest_get_msrs()
?
The two pmu_ops are quite different:
- the first one in atomic_switch_perf_msrs() is defined in the host side;
- the second one for availability_check() is defined in the KVM side;
The availability_check() for guest LBR event and MSRs pass-through
operations are definitely KVM context specific.
Thanks,
Like Xu
Hi Peter,
On 2020/5/20 10:01, Xu, Like wrote:
> On 2020/5/19 22:57, Peter Zijlstra wrote:
>> On Tue, May 19, 2020 at 09:10:58PM +0800, Xu, Like wrote:
>>> On 2020/5/19 19:15, Peter Zijlstra wrote:
>>>> On Thu, May 14, 2020 at 04:30:53PM +0800, Like Xu wrote:
>>>>
>>>>> diff --git a/arch/x86/kvm/vmx/pmu_intel.c b/arch/x86/kvm/vmx/pmu_intel.c
>>>>> index ea4faae56473..db185dca903d 100644
>>>>> --- a/arch/x86/kvm/vmx/pmu_intel.c
>>>>> +++ b/arch/x86/kvm/vmx/pmu_intel.c
>>>>> @@ -646,6 +646,43 @@ static void intel_pmu_lbr_cleanup(struct kvm_vcpu
>>>>> *vcpu)
>>>>> intel_pmu_free_lbr_event(vcpu);
>>>>> }
>>>>> +static bool intel_pmu_lbr_is_availabile(struct kvm_vcpu *vcpu)
>>>>> +{
>>>>> + struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
>>>>> +
>>>>> + if (!pmu->lbr_event)
>>>>> + return false;
>>>>> +
>>>>> + if (event_is_oncpu(pmu->lbr_event)) {
>>>>> + intel_pmu_intercept_lbr_msrs(vcpu, false);
>>>>> + } else {
>>>>> + intel_pmu_intercept_lbr_msrs(vcpu, true);
>>>>> + return false;
>>>>> + }
>>>>> +
>>>>> + return true;
>>>>> +}
>>>> This is unreadable gunk, what?
>>> Abstractly, it is saying "KVM would passthrough the LBR satck MSRs if
>>> event_is_oncpu() is true, otherwise cancel the passthrough state if any."
>>>
>>> I'm using 'event->oncpu != -1' to represent the guest LBR event
>>> is scheduled on rather than 'event->state == PERF_EVENT_STATE_ERROR'.
>>>
>>> For intel_pmu_intercept_lbr_msrs(), false means to passthrough the LBR
>>> stack
>>> MSRs to the vCPU, and true means to cancel the passthrough state and make
>>> LBR MSR accesses trapped by the KVM.
>> To me it seems very weird to change state in a function that is supposed
>> to just query state.
>>
>> 'is_available' seems to suggest a simple: return 'lbr_event->state ==
>> PERF_EVENT_STATE_ACTIVE' or something.
> This clarification led me to reconsider the use of a more readable name here.
>
> Do you accept the check usage of "event->oncpu != -1" instead of
> 'event->state == PERF_EVENT_STATE_ERROR' before KVM do passthrough ?
>>
>>>>> +static void intel_pmu_availability_check(struct kvm_vcpu *vcpu)
>>>>> +{
>>>>> + lockdep_assert_irqs_disabled();
>>>>> +
>>>>> + if (lbr_is_enabled(vcpu) && !intel_pmu_lbr_is_availabile(vcpu) &&
>>>>> + (vmcs_read64(GUEST_IA32_DEBUGCTL) & DEBUGCTLMSR_LBR))
>>>>> + pr_warn_ratelimited("kvm: vcpu-%d: LBR is temporarily
>>>>> unavailable.\n",
>>>>> + vcpu->vcpu_id);
>>>> More unreadable nonsense; when the events go into ERROR state, it's a
>>>> permanent fail, they'll not come back.
>>> It's not true. The guest LBR event with 'ERROR state' or 'oncpu != -1'
>>> would be
>>> lazy released and re-created in the next time the
>>> intel_pmu_create_lbr_event() is
>>> called and it's supposed to be re-scheduled and re-do availability_check()
>>> as well.
>> Where? Also, wth would you need to destroy and re-create an event for
>> that?
> If the guest does not set the EN_LBR bit and did not touch any LBR-related
> registers
> in the last time slice, KVM will destroy the guest LBR event in
> kvm_pmu_cleanup()
> which is called once every time the vCPU thread is scheduled in.
>
> The re-creation is not directly called after the destruction
> but is triggered by the next guest access to the LBR-related registers if any.
>
> From the time when the guest LBR event enters the "oncpu! = -1" state
> to the next re-creation, the guest LBR is not available. After the
> re-creation,
> the guest LBR is hopefully available and if it's true, the LBR will be
> passthrough
> and used by the guest normally.
>
> That's the reason for "LBR is temporarily unavailable"
Do you still have any concerns on this issue?
> and please let me know if it doesn't make sense to you.
>
>>>>> @@ -6696,8 +6696,10 @@ static fastpath_t vmx_vcpu_run(struct kvm_vcpu
>>>>> *vcpu)
>>>>> pt_guest_enter(vmx);
>>>>> - if (vcpu_to_pmu(vcpu)->version)
>>>>> + if (vcpu_to_pmu(vcpu)->version) {
>>>>> atomic_switch_perf_msrs(vmx);
>>>>> + kvm_x86_ops.pmu_ops->availability_check(vcpu);
>>>>> + }
>>>> AFAICT you just did a call out to the kvm_pmu crud in
>>>> atomic_switch_perf_msrs(), why do another call?
>>> In fact, availability_check() is only called here for just one time.
>>>
>>> The callchain looks like:
>>> - vmx_vcpu_run()
>>> - kvm_x86_ops.pmu_ops->availability_check();
>>> - intel_pmu_availability_check()
>>> - intel_pmu_lbr_is_availabile()
>>> - event_is_oncpu() ...
>>>
>> What I'm saying is that you just did a pmu_ops indirect call in
>> atomic_switch_perf_msrs(), why add another?
> Do you mean the indirect call:
> - atomic_switch_perf_msrs()
> - perf_guest_get_msrs()
> - x86_pmu.guest_get_msrs()
> ?
>
> The two pmu_ops are quite different:
> - the first one in atomic_switch_perf_msrs() is defined in the host side;
> - the second one for availability_check() is defined in the KVM side;
>
> The availability_check() for guest LBR event and MSRs pass-through
> operations are definitely KVM context specific.
Do you still have any concerns on this issue?
If you have more comments on the patchset, please let me know.
>
> Thanks,
> Like Xu
>
Hi Paolo,
On 2020/5/14 16:30, Like Xu wrote:
> Hi Peter,
> Would you mind acking the host perf patches if it looks good to you ?
>
> Hi Paolo,
> Please help review the KVM proposal changes in this stable version.
>
> Now, we can use upstream QEMU w/ '-cpu host' to test this feature, and
> disable it by clearing the LBR format bits in the IA32_PERF_CAPABILITIES.
>
> v10->v11 Changelog:
> - add '.config = INTEL_FIXED_VLBR_EVENT' to the guest LBR event config;
> - rewrite is_guest_lbr_event() with 'config == INTEL_FIXED_VLBR_EVENT';
> - emit pr_warn() on the host when guest LBR is temporarily unavailable;
> - drop the KVM_CAP_X86_GUEST_LBR patch;
> - rewrite MSR_IA32_PERF_CAPABILITIES patch LBR record format;
> - split 'kvm_pmu->lbr_already_available' into a separate patch;
> - split 'pmu_ops->availability_check' into a separate patch;
> - comments and naming refinement, misc;
>
> You may check more details in each commit.
>
> Previous:
> https://lore.kernel.org/kvm/[email protected]/
...
>
> Like Xu (9):
> perf/x86/core: Refactor hw->idx checks and cleanup
> perf/x86/lbr: Add interface to get basic information about LBR stack
> perf/x86: Add constraint to create guest LBR event without hw counter
> perf/x86: Keep LBR stack unchanged in host context for guest LBR event
Do you have a moment to review the KVM changes in this version
to enable guest LBR on most Intel platforms ?
It would be great if I can address most of your comments in the next version.
Thanks,
Like Xu
> KVM: x86: Expose MSR_IA32_PERF_CAPABILITIES for LBR record format
> KVM: x86/pmu: Emulate LBR feature via guest LBR event
> KVM: x86/pmu: Release guest LBR event via vPMU lazy release mechanism
> KVM: x86/pmu: Check guest LBR availability in case host reclaims them
> KVM: x86/pmu: Reduce the overhead of LBR passthrough or cancellation
>
> Wei Wang (2):
> perf/x86: Fix variable types for LBR registers
> KVM: x86/pmu: Tweak kvm_pmu_get_msr to pass 'struct msr_data' in
>
> arch/x86/events/core.c | 26 ++-
> arch/x86/events/intel/core.c | 105 ++++++----
> arch/x86/events/intel/lbr.c | 56 +++++-
> arch/x86/events/perf_event.h | 12 +-
> arch/x86/include/asm/kvm_host.h | 13 ++
> arch/x86/include/asm/perf_event.h | 34 +++-
> arch/x86/kvm/cpuid.c | 2 +-
> arch/x86/kvm/pmu.c | 19 +-
> arch/x86/kvm/pmu.h | 15 +-
> arch/x86/kvm/svm/pmu.c | 7 +-
> arch/x86/kvm/vmx/capabilities.h | 15 ++
> arch/x86/kvm/vmx/pmu_intel.c | 320 +++++++++++++++++++++++++++++-
> arch/x86/kvm/vmx/vmx.c | 12 +-
> arch/x86/kvm/vmx/vmx.h | 2 +
> arch/x86/kvm/x86.c | 18 +-
> 15 files changed, 564 insertions(+), 92 deletions(-)
>
The following commit has been merged into the perf/core branch of tip:
Commit-ID: e1ad1ac2deb8f90af9f12ff316989dd5675dec11
Gitweb: https://git.kernel.org/tip/e1ad1ac2deb8f90af9f12ff316989dd5675dec11
Author: Like Xu <[email protected]>
AuthorDate: Sat, 13 Jun 2020 16:09:50 +08:00
Committer: Peter Zijlstra <[email protected]>
CommitterDate: Thu, 02 Jul 2020 15:51:46 +02:00
perf/x86: Keep LBR records unchanged in host context for guest usage
When a guest wants to use the LBR registers, its hypervisor creates a guest
LBR event and let host perf schedules it. The LBR records msrs are
accessible to the guest when its guest LBR event is scheduled on
by the perf subsystem.
Before scheduling this event out, we should avoid host changes on
IA32_DEBUGCTLMSR or LBR_SELECT. Otherwise, some unexpected branch
operations may interfere with guest behavior, pollute LBR records, and even
cause host branches leakage. In addition, the read operation
on host is also avoidable.
To ensure that guest LBR records are not lost during the context switch,
the guest LBR event would enable the callstack mode which could
save/restore guest unread LBR records with the help of
intel_pmu_lbr_sched_task() naturally.
However, the guest LBR_SELECT may changes for its own use and the host
LBR event doesn't save/restore it. To ensure that we doesn't lost the guest
LBR_SELECT value when the guest LBR event is running, the vlbr_constraint
is bound up with a new constraint flag PERF_X86_EVENT_LBR_SELECT.
Signed-off-by: Like Xu <[email protected]>
Signed-off-by: Wei Wang <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
---
arch/x86/events/intel/core.c | 6 ++++--
arch/x86/events/intel/lbr.c | 31 ++++++++++++++++++++++++++-----
arch/x86/events/perf_event.h | 3 +++
3 files changed, 33 insertions(+), 7 deletions(-)
diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
index 51e1fba..582ddff 100644
--- a/arch/x86/events/intel/core.c
+++ b/arch/x86/events/intel/core.c
@@ -2189,7 +2189,8 @@ static void intel_pmu_disable_event(struct perf_event *event)
} else if (idx == INTEL_PMC_IDX_FIXED_BTS) {
intel_pmu_disable_bts();
intel_pmu_drain_bts_buffer();
- }
+ } else if (idx == INTEL_PMC_IDX_FIXED_VLBR)
+ intel_clear_masks(event, idx);
/*
* Needs to be called after x86_pmu_disable_event,
@@ -2271,7 +2272,8 @@ static void intel_pmu_enable_event(struct perf_event *event)
if (!__this_cpu_read(cpu_hw_events.enabled))
return;
intel_pmu_enable_bts(hwc->config);
- }
+ } else if (idx == INTEL_PMC_IDX_FIXED_VLBR)
+ intel_set_masks(event, idx);
}
static void intel_pmu_add_event(struct perf_event *event)
diff --git a/arch/x86/events/intel/lbr.c b/arch/x86/events/intel/lbr.c
index d285d26..d03de75 100644
--- a/arch/x86/events/intel/lbr.c
+++ b/arch/x86/events/intel/lbr.c
@@ -383,6 +383,9 @@ static void __intel_pmu_lbr_restore(struct x86_perf_task_context *task_ctx)
wrmsrl(x86_pmu.lbr_tos, tos);
task_ctx->lbr_stack_state = LBR_NONE;
+
+ if (cpuc->lbr_select)
+ wrmsrl(MSR_LBR_SELECT, task_ctx->lbr_sel);
}
static void __intel_pmu_lbr_save(struct x86_perf_task_context *task_ctx)
@@ -415,6 +418,9 @@ static void __intel_pmu_lbr_save(struct x86_perf_task_context *task_ctx)
cpuc->last_task_ctx = task_ctx;
cpuc->last_log_id = ++task_ctx->log_id;
+
+ if (cpuc->lbr_select)
+ rdmsrl(MSR_LBR_SELECT, task_ctx->lbr_sel);
}
void intel_pmu_lbr_swap_task_ctx(struct perf_event_context *prev,
@@ -485,6 +491,9 @@ void intel_pmu_lbr_add(struct perf_event *event)
if (!x86_pmu.lbr_nr)
return;
+ if (event->hw.flags & PERF_X86_EVENT_LBR_SELECT)
+ cpuc->lbr_select = 1;
+
cpuc->br_sel = event->hw.branch_reg.reg;
if (branch_user_callstack(cpuc->br_sel) && event->ctx->task_ctx_data) {
@@ -532,6 +541,9 @@ void intel_pmu_lbr_del(struct perf_event *event)
task_ctx->lbr_callstack_users--;
}
+ if (event->hw.flags & PERF_X86_EVENT_LBR_SELECT)
+ cpuc->lbr_select = 0;
+
if (x86_pmu.intel_cap.pebs_baseline && event->attr.precise_ip > 0)
cpuc->lbr_pebs_users--;
cpuc->lbr_users--;
@@ -540,11 +552,19 @@ void intel_pmu_lbr_del(struct perf_event *event)
perf_sched_cb_dec(event->ctx->pmu);
}
+static inline bool vlbr_exclude_host(void)
+{
+ struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
+
+ return test_bit(INTEL_PMC_IDX_FIXED_VLBR,
+ (unsigned long *)&cpuc->intel_ctrl_guest_mask);
+}
+
void intel_pmu_lbr_enable_all(bool pmi)
{
struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
- if (cpuc->lbr_users)
+ if (cpuc->lbr_users && !vlbr_exclude_host())
__intel_pmu_lbr_enable(pmi);
}
@@ -552,7 +572,7 @@ void intel_pmu_lbr_disable_all(void)
{
struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
- if (cpuc->lbr_users)
+ if (cpuc->lbr_users && !vlbr_exclude_host())
__intel_pmu_lbr_disable();
}
@@ -694,7 +714,8 @@ void intel_pmu_lbr_read(void)
* This could be smarter and actually check the event,
* but this simple approach seems to work for now.
*/
- if (!cpuc->lbr_users || cpuc->lbr_users == cpuc->lbr_pebs_users)
+ if (!cpuc->lbr_users || vlbr_exclude_host() ||
+ cpuc->lbr_users == cpuc->lbr_pebs_users)
return;
if (x86_pmu.intel_cap.lbr_format == LBR_FORMAT_32)
@@ -1365,5 +1386,5 @@ int x86_perf_get_lbr(struct x86_pmu_lbr *lbr)
EXPORT_SYMBOL_GPL(x86_perf_get_lbr);
struct event_constraint vlbr_constraint =
- FIXED_EVENT_CONSTRAINT(INTEL_FIXED_VLBR_EVENT,
- (INTEL_PMC_IDX_FIXED_VLBR - INTEL_PMC_IDX_FIXED));
+ __EVENT_CONSTRAINT(INTEL_FIXED_VLBR_EVENT, (1ULL << INTEL_PMC_IDX_FIXED_VLBR),
+ FIXED_EVENT_FLAGS, 1, 0, PERF_X86_EVENT_LBR_SELECT);
diff --git a/arch/x86/events/perf_event.h b/arch/x86/events/perf_event.h
index 77a6dd6..8147596 100644
--- a/arch/x86/events/perf_event.h
+++ b/arch/x86/events/perf_event.h
@@ -78,6 +78,7 @@ static inline bool constraint_match(struct event_constraint *c, u64 ecode)
#define PERF_X86_EVENT_LARGE_PEBS 0x0400 /* use large PEBS */
#define PERF_X86_EVENT_PEBS_VIA_PT 0x0800 /* use PT buffer for PEBS */
#define PERF_X86_EVENT_PAIR 0x1000 /* Large Increment per Cycle */
+#define PERF_X86_EVENT_LBR_SELECT 0x2000 /* Save/Restore MSR_LBR_SELECT */
struct amd_nb {
int nb_id; /* NorthBridge id */
@@ -237,6 +238,7 @@ struct cpu_hw_events {
u64 br_sel;
struct x86_perf_task_context *last_task_ctx;
int last_log_id;
+ int lbr_select;
/*
* Intel host/guest exclude bits
@@ -722,6 +724,7 @@ struct x86_perf_task_context {
u64 lbr_from[MAX_LBR_ENTRIES];
u64 lbr_to[MAX_LBR_ENTRIES];
u64 lbr_info[MAX_LBR_ENTRIES];
+ u64 lbr_sel;
int tos;
int valid_lbrs;
int lbr_callstack_users;
The following commit has been merged into the perf/core branch of tip:
Commit-ID: 097e4311cda952dfb047f2a49d35aa5de500d474
Gitweb: https://git.kernel.org/tip/097e4311cda952dfb047f2a49d35aa5de500d474
Author: Like Xu <[email protected]>
AuthorDate: Sat, 13 Jun 2020 16:09:49 +08:00
Committer: Peter Zijlstra <[email protected]>
CommitterDate: Thu, 02 Jul 2020 15:51:46 +02:00
perf/x86: Add constraint to create guest LBR event without hw counter
The hypervisor may request the perf subsystem to schedule a time window
to directly access the LBR records msrs for its own use. Normally, it would
create a guest LBR event with callstack mode enabled, which is scheduled
along with other ordinary LBR events on the host but in an exclusive way.
To avoid wasting a counter for the guest LBR event, the perf tracks its
hw->idx via INTEL_PMC_IDX_FIXED_VLBR and assigns it with a fake VLBR
counter with the help of new vlbr_constraint. As with the BTS event,
there is actually no hardware counter assigned for the guest LBR event.
Signed-off-by: Like Xu <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
---
arch/x86/events/core.c | 1 +
arch/x86/events/intel/core.c | 18 ++++++++++++++++++
arch/x86/events/intel/lbr.c | 4 ++++
arch/x86/events/perf_event.h | 1 +
arch/x86/include/asm/perf_event.h | 22 +++++++++++++++++++++-
5 files changed, 45 insertions(+), 1 deletion(-)
diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
index 15cb7af..d740c86 100644
--- a/arch/x86/events/core.c
+++ b/arch/x86/events/core.c
@@ -1104,6 +1104,7 @@ static inline void x86_assign_hw_event(struct perf_event *event,
switch (hwc->idx) {
case INTEL_PMC_IDX_FIXED_BTS:
+ case INTEL_PMC_IDX_FIXED_VLBR:
hwc->config_base = 0;
hwc->event_base = 0;
break;
diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
index 8dac4c6..51e1fba 100644
--- a/arch/x86/events/intel/core.c
+++ b/arch/x86/events/intel/core.c
@@ -2621,6 +2621,20 @@ intel_bts_constraints(struct perf_event *event)
return NULL;
}
+/*
+ * Note: matches a fake event, like Fixed2.
+ */
+static struct event_constraint *
+intel_vlbr_constraints(struct perf_event *event)
+{
+ struct event_constraint *c = &vlbr_constraint;
+
+ if (unlikely(constraint_match(c, event->hw.config)))
+ return c;
+
+ return NULL;
+}
+
static int intel_alt_er(int idx, u64 config)
{
int alt_idx = idx;
@@ -2811,6 +2825,10 @@ __intel_get_event_constraints(struct cpu_hw_events *cpuc, int idx,
{
struct event_constraint *c;
+ c = intel_vlbr_constraints(event);
+ if (c)
+ return c;
+
c = intel_bts_constraints(event);
if (c)
return c;
diff --git a/arch/x86/events/intel/lbr.c b/arch/x86/events/intel/lbr.c
index 2ed3f2a..d285d26 100644
--- a/arch/x86/events/intel/lbr.c
+++ b/arch/x86/events/intel/lbr.c
@@ -1363,3 +1363,7 @@ int x86_perf_get_lbr(struct x86_pmu_lbr *lbr)
return 0;
}
EXPORT_SYMBOL_GPL(x86_perf_get_lbr);
+
+struct event_constraint vlbr_constraint =
+ FIXED_EVENT_CONSTRAINT(INTEL_FIXED_VLBR_EVENT,
+ (INTEL_PMC_IDX_FIXED_VLBR - INTEL_PMC_IDX_FIXED));
diff --git a/arch/x86/events/perf_event.h b/arch/x86/events/perf_event.h
index eb37f6c..77a6dd6 100644
--- a/arch/x86/events/perf_event.h
+++ b/arch/x86/events/perf_event.h
@@ -990,6 +990,7 @@ void release_ds_buffers(void);
void reserve_ds_buffers(void);
extern struct event_constraint bts_constraint;
+extern struct event_constraint vlbr_constraint;
void intel_pmu_enable_bts(u64 config);
diff --git a/arch/x86/include/asm/perf_event.h b/arch/x86/include/asm/perf_event.h
index 5d2c30f..2df7073 100644
--- a/arch/x86/include/asm/perf_event.h
+++ b/arch/x86/include/asm/perf_event.h
@@ -192,10 +192,30 @@ struct x86_pmu_capability {
#define GLOBAL_STATUS_UNC_OVF BIT_ULL(61)
#define GLOBAL_STATUS_ASIF BIT_ULL(60)
#define GLOBAL_STATUS_COUNTERS_FROZEN BIT_ULL(59)
-#define GLOBAL_STATUS_LBRS_FROZEN BIT_ULL(58)
+#define GLOBAL_STATUS_LBRS_FROZEN_BIT 58
+#define GLOBAL_STATUS_LBRS_FROZEN BIT_ULL(GLOBAL_STATUS_LBRS_FROZEN_BIT)
#define GLOBAL_STATUS_TRACE_TOPAPMI BIT_ULL(55)
/*
+ * We model guest LBR event tracing as another fixed-mode PMC like BTS.
+ *
+ * We choose bit 58 because it's used to indicate LBR stack frozen state
+ * for architectural perfmon v4, also we unconditionally mask that bit in
+ * the handle_pmi_common(), so it'll never be set in the overflow handling.
+ *
+ * With this fake counter assigned, the guest LBR event user (such as KVM),
+ * can program the LBR registers on its own, and we don't actually do anything
+ * with then in the host context.
+ */
+#define INTEL_PMC_IDX_FIXED_VLBR (GLOBAL_STATUS_LBRS_FROZEN_BIT)
+
+/*
+ * Pseudo-encoding the guest LBR event as event=0x00,umask=0x1b,
+ * since it would claim bit 58 which is effectively Fixed26.
+ */
+#define INTEL_FIXED_VLBR_EVENT 0x1b00
+
+/*
* Adaptive PEBS v4
*/