2020-04-23 08:19:25

by Like Xu

[permalink] [raw]
Subject: [PATCH v10 00/11] Guest Last Branch Recording Enabling

Hi all,

Please help review your interesting parts in this stable version,
e.g. the first five patches involve the perf event subsystem
and the sixth patch concerns the KVM userspace interface.

v9->v10 Changelog:
- new patch (0002) to refactor hw->idx checks and cleanup for host perf;
- refine comments in guest LBR constraint patch and rename functions;
- still ack LBRS_FROZEN for guest LBR event in the intel_pmu_ack_status();
- add more checks before enabling LBR in the kvm_vm_ioctl_enable_cap();
- add pmu_ops->deliver_pmi to clear LBR enable bit for guest debugctl_msr;
- use vmx_supported_perf_capabilities() to expose PDCM via kvm_cpu_cap*();

You may check more details in each commit.

Previous:
https://lore.kernel.org/kvm/[email protected]/

---

The last branch recording (LBR) is a performance monitor unit (PMU)
feature on Intel processors that records a running trace of the most
recent branches taken by the processor in the LBR stack. This patch
series is going to enable this feature for plenty of KVM guests.

The userspace could configure whether it's enabled or not for each
guest via vm_ioctl KVM_CAP_X86_GUEST_LBR. As a first step, a guest
could only enable LBR feature if its cpu model is the same as the
host since the LBR feature is still one of model specific features.

If it's enabled on the guest, the guest LBR driver would accesses the
LBR MSR (including IA32_DEBUGCTLMSR and records MSRs) as host does.
The first guest access on the LBR related MSRs is always interceptible.
The KVM trap would create a special LBR event (called guest LBR event)
which enables the callstack mode and none of hardware counter is assigned.
The host perf would enable and schedule this event as usual.

Guest's first access to a LBR registers gets trapped to KVM, which
creates a guest LBR perf event. It's a regular LBR perf event which gets
the LBR facility assigned from the perf subsystem. Once that succeeds,
the LBR stack msrs are passed through to the guest for efficient accesses.
However, if another host LBR event comes in and takes over the LBR
facility, the LBR msrs will be made interceptible, and guest following
accesses to the LBR msrs will be trapped and meaningless.

Because saving/restoring tens of LBR MSRs (e.g. 32 LBR stack entries) in
VMX transition brings too excessive overhead to frequent vmx transition
itself, the guest LBR event would help save/restore the LBR stack msrs
during the context switching with the help of native LBR event callstack
mechanism, including LBR_SELECT msr.

If the guest no longer accesses the LBR-related MSRs within a scheduling
time slice and the LBR enable bit is unset, vPMU would release its guest
LBR event as a normal event of a unused vPMC and the pass-through
state of the LBR stack msrs would be canceled.

---

LBR testcase:
echo 1 > /proc/sys/kernel/watchdog
echo 25 > /proc/sys/kernel/perf_cpu_time_max_percent
echo 5000 > /proc/sys/kernel/perf_event_max_sample_rate
echo 0 > /proc/sys/kernel/perf_cpu_time_max_percent
./perf record -b ./br_instr a

- Perf report on the host:
Samples: 72K of event 'cycles', Event count (approx.): 72512
Overhead Command Source Shared Object Source Symbol Target Symbol Basic Block Cycles
12.12% br_instr br_instr [.] cmp_end [.] lfsr_cond 1
11.05% br_instr br_instr [.] lfsr_cond [.] cmp_end 5
8.81% br_instr br_instr [.] lfsr_cond [.] cmp_end 4
5.04% br_instr br_instr [.] cmp_end [.] lfsr_cond 20
4.92% br_instr br_instr [.] lfsr_cond [.] cmp_end 6
4.88% br_instr br_instr [.] cmp_end [.] lfsr_cond 6
4.58% br_instr br_instr [.] cmp_end [.] lfsr_cond 5

- Perf report on the guest:
Samples: 92K of event 'cycles', Event count (approx.): 92544
Overhead Command Source Shared Object Source Symbol Target Symbol Basic Block Cycles
12.03% br_instr br_instr [.] cmp_end [.] lfsr_cond 1
11.09% br_instr br_instr [.] lfsr_cond [.] cmp_end 5
8.57% br_instr br_instr [.] lfsr_cond [.] cmp_end 4
5.08% br_instr br_instr [.] lfsr_cond [.] cmp_end 6
5.06% br_instr br_instr [.] cmp_end [.] lfsr_cond 20
4.87% br_instr br_instr [.] cmp_end [.] lfsr_cond 6
4.70% br_instr br_instr [.] cmp_end [.] lfsr_cond 5

Conclusion: the profiling results on the guest are similar to that on the host.

Like Xu (8):
perf/x86/core: Refactor hw->idx checks and cleanup
perf/x86/lbr: Add interface to get basic information about LBR stack
perf/x86: Add constraint to create guest LBR event without hw counter
perf/x86: Keep LBR stack unchanged in host context for guest LBR event
KVM: x86: Add KVM_CAP_X86_GUEST_LBR to dis/enable LBR from user-space
KVM: x86/pmu: Add LBR feature emulation via guest LBR event
KVM: x86/pmu: Release guest LBR event via vPMU lazy release mechanism
KVM: x86: Expose MSR_IA32_PERF_CAPABILITIES for LBR record format

Wei Wang (3):
perf/x86: Fix variable type for LBR registers
KVM: x86/pmu: Tweak kvm_pmu_get_msr to pass 'struct msr_data' in
KVM: x86: Remove the common trap handler of the MSR_IA32_DEBUGCTLMSR

Documentation/virt/kvm/api.rst | 28 +++
arch/x86/events/core.c | 26 ++-
arch/x86/events/intel/core.c | 102 ++++++----
arch/x86/events/intel/lbr.c | 56 +++++-
arch/x86/events/perf_event.h | 18 +-
arch/x86/include/asm/kvm_host.h | 14 ++
arch/x86/include/asm/perf_event.h | 28 ++-
arch/x86/kvm/pmu.c | 27 ++-
arch/x86/kvm/pmu.h | 17 +-
arch/x86/kvm/svm/pmu.c | 7 +-
arch/x86/kvm/vmx/capabilities.h | 15 ++
arch/x86/kvm/vmx/pmu_intel.c | 301 ++++++++++++++++++++++++++++--
arch/x86/kvm/vmx/vmx.c | 11 +-
arch/x86/kvm/vmx/vmx.h | 2 +
arch/x86/kvm/x86.c | 34 ++--
include/linux/perf_event.h | 7 +
include/uapi/linux/kvm.h | 1 +
kernel/events/core.c | 7 -
18 files changed, 603 insertions(+), 98 deletions(-)

--
2.21.1


2020-04-23 08:19:32

by Like Xu

[permalink] [raw]
Subject: [PATCH v10 01/11] perf/x86: Fix variable type for LBR registers

From: Wei Wang <[email protected]>

The msr variable type can be 'unsigned int', which uses less memory than
the longer 'unsigned long'. Fix 'struct x86_pmu' for that. The lbr_nr won't
be a negative number, so make it 'unsigned int' as well.

Cc: Peter Zijlstra (Intel) <[email protected]>
Suggested-by: Peter Zijlstra (Intel) <[email protected]>
Signed-off-by: Wei Wang <[email protected]>
---
arch/x86/events/perf_event.h | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/x86/events/perf_event.h b/arch/x86/events/perf_event.h
index f1cd1ca1a77b..1025bc6eb04f 100644
--- a/arch/x86/events/perf_event.h
+++ b/arch/x86/events/perf_event.h
@@ -672,8 +672,8 @@ struct x86_pmu {
/*
* Intel LBR
*/
- unsigned long lbr_tos, lbr_from, lbr_to; /* MSR base regs */
- int lbr_nr; /* hardware stack size */
+ unsigned int lbr_tos, lbr_from, lbr_to,
+ lbr_nr; /* LBR base regs and size */
u64 lbr_sel_mask; /* LBR_SELECT valid bits */
const int *lbr_sel_map; /* lbr_select mappings */
bool lbr_double_abort; /* duplicated lbr aborts */
--
2.21.1

2020-04-23 08:19:53

by Like Xu

[permalink] [raw]
Subject: [PATCH v10 04/11] perf/x86: Add constraint to create guest LBR event without hw counter

The hypervisor may request the perf subsystem to schedule a time window
to directly access the LBR stack msrs for its own use. Normally, it would
create a guest LBR event with callstack mode enabled, which is scheduled
along with other ordinary LBR events on the host but in an exclusive way.

To avoid wasting a counter for the guest LBR event, the perf tracks it via
needs_guest_lbr_without_counter() and assigns it with a fake VLBR counter
with the help of new lbr_without_counter_constraint. As with the BTS event,
there is actually no hardware counter assigned for the guest LBR event.

Cc: Peter Zijlstra (Intel) <[email protected]>
Signed-off-by: Like Xu <[email protected]>
---
arch/x86/events/core.c | 1 +
arch/x86/events/intel/core.c | 17 +++++++++++++++++
arch/x86/events/intel/lbr.c | 3 +++
arch/x86/events/perf_event.h | 12 ++++++++++++
arch/x86/include/asm/perf_event.h | 16 +++++++++++++++-
include/linux/perf_event.h | 7 +++++++
kernel/events/core.c | 7 -------
7 files changed, 55 insertions(+), 8 deletions(-)

diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
index f7a259dcbb06..2405926e2dba 100644
--- a/arch/x86/events/core.c
+++ b/arch/x86/events/core.c
@@ -1104,6 +1104,7 @@ static inline void x86_assign_hw_event(struct perf_event *event,

switch (hwc->idx) {
case INTEL_PMC_IDX_FIXED_BTS:
+ case INTEL_PMC_IDX_FIXED_VLBR:
hwc->config_base = 0;
hwc->event_base = 0;
break;
diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
index f1439acbf7e6..fe5595275368 100644
--- a/arch/x86/events/intel/core.c
+++ b/arch/x86/events/intel/core.c
@@ -2621,6 +2621,19 @@ intel_bts_constraints(struct perf_event *event)
return NULL;
}

+/*
+ * Note, the event that satisfies this constraint will not be assigned
+ * with a normal hardware counter but a fake one like BTS event.
+ */
+static struct event_constraint *
+intel_guest_lbr_constraints(struct perf_event *event)
+{
+ if (unlikely(needs_guest_lbr_without_counter(event)))
+ return &lbr_without_counter_constraint;
+
+ return NULL;
+}
+
static int intel_alt_er(int idx, u64 config)
{
int alt_idx = idx;
@@ -2811,6 +2824,10 @@ __intel_get_event_constraints(struct cpu_hw_events *cpuc, int idx,
{
struct event_constraint *c;

+ c = intel_guest_lbr_constraints(event);
+ if (c)
+ return c;
+
c = intel_bts_constraints(event);
if (c)
return c;
diff --git a/arch/x86/events/intel/lbr.c b/arch/x86/events/intel/lbr.c
index 6c60dcaaaf69..2fca4aff7621 100644
--- a/arch/x86/events/intel/lbr.c
+++ b/arch/x86/events/intel/lbr.c
@@ -1363,3 +1363,6 @@ int x86_perf_get_lbr(struct x86_pmu_lbr *stack)
return 0;
}
EXPORT_SYMBOL_GPL(x86_perf_get_lbr);
+
+struct event_constraint lbr_without_counter_constraint =
+ EVENT_CONSTRAINT(0, 1ULL << INTEL_PMC_IDX_FIXED_VLBR, 0);
diff --git a/arch/x86/events/perf_event.h b/arch/x86/events/perf_event.h
index 1025bc6eb04f..e6e8c626ed00 100644
--- a/arch/x86/events/perf_event.h
+++ b/arch/x86/events/perf_event.h
@@ -969,6 +969,17 @@ static inline bool intel_pmu_has_bts(struct perf_event *event)
return intel_pmu_has_bts_period(event, hwc->sample_period);
}

+/*
+ * It's safe to generate an event with attr.exclude_host set and also
+ * using LBR to profile guest for other in-kernel users because the
+ * intel_guest_lbr_constraints() makes LBR registers to be used exclusively.
+ */
+static inline bool needs_guest_lbr_without_counter(struct perf_event *event)
+{
+ return needs_branch_stack(event) && is_kernel_event(event) &&
+ event->attr.exclude_host;
+}
+
int intel_pmu_save_and_restart(struct perf_event *event);

struct event_constraint *
@@ -989,6 +1000,7 @@ void release_ds_buffers(void);
void reserve_ds_buffers(void);

extern struct event_constraint bts_constraint;
+extern struct event_constraint lbr_without_counter_constraint;

void intel_pmu_enable_bts(u64 config);

diff --git a/arch/x86/include/asm/perf_event.h b/arch/x86/include/asm/perf_event.h
index 5071515f6b0f..7be581027ebb 100644
--- a/arch/x86/include/asm/perf_event.h
+++ b/arch/x86/include/asm/perf_event.h
@@ -192,9 +192,23 @@ struct x86_pmu_capability {
#define GLOBAL_STATUS_UNC_OVF BIT_ULL(61)
#define GLOBAL_STATUS_ASIF BIT_ULL(60)
#define GLOBAL_STATUS_COUNTERS_FROZEN BIT_ULL(59)
-#define GLOBAL_STATUS_LBRS_FROZEN BIT_ULL(58)
+#define GLOBAL_STATUS_LBRS_FROZEN_BIT 58
+#define GLOBAL_STATUS_LBRS_FROZEN BIT_ULL(GLOBAL_STATUS_LBRS_FROZEN_BIT)
#define GLOBAL_STATUS_TRACE_TOPAPMI BIT_ULL(55)

+/*
+ * We model guest LBR event tracing as another fixed-mode PMC like BTS.
+ *
+ * We choose bit 58 because it's used to indicate LBR stack frozen state
+ * for architectural perfmon v4, also we unconditionally mask that bit in
+ * the handle_pmi_common(), so it'll never be set in the overflow handling.
+ *
+ * With this fake counter assigned, the guest LBR event user (such as KVM),
+ * can program the LBR registers on its own, and we don't actually do anything
+ * with then in the host context.
+ */
+#define INTEL_PMC_IDX_FIXED_VLBR GLOBAL_STATUS_LBRS_FROZEN_BIT
+
/*
* Adaptive PEBS v4
*/
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 8768a39b5258..e25930fa526c 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -1101,6 +1101,13 @@ static inline bool is_sampling_event(struct perf_event *event)
return event->attr.sample_period != 0;
}

+#define TASK_TOMBSTONE ((void *)-1L)
+
+static inline bool is_kernel_event(struct perf_event *event)
+{
+ return READ_ONCE(event->owner) == TASK_TOMBSTONE;
+}
+
/*
* Return 1 for a software event, 0 for a hardware event
*/
diff --git a/kernel/events/core.c b/kernel/events/core.c
index e1459df73043..5ea22596ede2 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -165,13 +165,6 @@ static void perf_ctx_unlock(struct perf_cpu_context *cpuctx,
raw_spin_unlock(&cpuctx->ctx.lock);
}

-#define TASK_TOMBSTONE ((void *)-1L)
-
-static bool is_kernel_event(struct perf_event *event)
-{
- return READ_ONCE(event->owner) == TASK_TOMBSTONE;
-}
-
/*
* On task ctx scheduling...
*
--
2.21.1

2020-04-23 08:20:10

by Like Xu

[permalink] [raw]
Subject: [PATCH v10 09/11] KVM: x86/pmu: Release guest LBR event via vPMU lazy release mechanism

The vPMU uses INTEL_PMC_IDX_FIXED_VLBR (bit 58) in 'pmu->pmc_in_use' to
indicate whether a guest LBR event is still needed by the vcpu. If the vcpu
no longer accesses LBR related registers within a scheduling time slice,
and the enable bit of LBR has been unset, vPMU will treat the guest LBR
event as a bland event of a vPMC counter and release it as usual. Also the
pass-through state of LBR records msrs is cancelled.

Signed-off-by: Like Xu <[email protected]>
---
arch/x86/kvm/pmu.c | 9 +++++++++
arch/x86/kvm/pmu.h | 4 ++++
arch/x86/kvm/vmx/pmu_intel.c | 12 +++++++++++-
3 files changed, 24 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c
index 5776d305e254..7dad899850bb 100644
--- a/arch/x86/kvm/pmu.c
+++ b/arch/x86/kvm/pmu.c
@@ -452,6 +452,12 @@ static inline bool pmc_speculative_in_use(struct kvm_pmc *pmc)
return pmc->eventsel & ARCH_PERFMON_EVENTSEL_ENABLE;
}

+void kvm_pmu_lbr_cleanup(struct kvm_vcpu *vcpu)
+{
+ if (kvm_x86_ops.pmu_ops->lbr_cleanup)
+ kvm_x86_ops.pmu_ops->lbr_cleanup(vcpu);
+}
+
/* Release perf_events for vPMCs that have been unused for a full time slice. */
void kvm_pmu_cleanup(struct kvm_vcpu *vcpu)
{
@@ -470,6 +476,9 @@ void kvm_pmu_cleanup(struct kvm_vcpu *vcpu)

if (pmc && pmc->perf_event && !pmc_speculative_in_use(pmc))
pmc_stop_counter(pmc);
+
+ if (i == KVM_PMU_LBR_IN_USE_IDX && pmu->lbr_event)
+ kvm_pmu_lbr_cleanup(vcpu);
}

bitmap_zero(pmu->pmc_in_use, X86_PMC_IDX_MAX);
diff --git a/arch/x86/kvm/pmu.h b/arch/x86/kvm/pmu.h
index 594642ab2575..008105051114 100644
--- a/arch/x86/kvm/pmu.h
+++ b/arch/x86/kvm/pmu.h
@@ -15,6 +15,9 @@
#define VMWARE_BACKDOOR_PMC_REAL_TIME 0x10001
#define VMWARE_BACKDOOR_PMC_APPARENT_TIME 0x10002

+/* Indicates whether LBR msrs were accessed during the last time slice. */
+#define KVM_PMU_LBR_IN_USE_IDX INTEL_PMC_IDX_FIXED_VLBR
+
struct kvm_event_hw_type_mapping {
u8 eventsel;
u8 unit_mask;
@@ -40,6 +43,7 @@ struct kvm_pmu_ops {
bool (*lbr_setup)(struct kvm_vcpu *vcpu);
void (*deliver_pmi)(struct kvm_vcpu *vcpu);
void (*availability_check)(struct kvm_vcpu *vcpu);
+ void (*lbr_cleanup)(struct kvm_vcpu *vcpu);
};

static inline bool event_is_oncpu(struct perf_event *event)
diff --git a/arch/x86/kvm/vmx/pmu_intel.c b/arch/x86/kvm/vmx/pmu_intel.c
index bb8e4dccbb18..37088bbcde7f 100644
--- a/arch/x86/kvm/vmx/pmu_intel.c
+++ b/arch/x86/kvm/vmx/pmu_intel.c
@@ -282,7 +282,7 @@ static void intel_pmu_free_lbr_event(struct kvm_vcpu *vcpu)
struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
struct perf_event *event = pmu->lbr_event;

- if (!event)
+ if (unlikely(!event))
return;

perf_event_release_kernel(event);
@@ -320,6 +320,7 @@ static bool intel_pmu_access_lbr_msr(struct kvm_vcpu *vcpu,
msr_info->data = 0;
local_irq_enable();

+ __set_bit(KVM_PMU_LBR_IN_USE_IDX, pmu->pmc_in_use);
return true;
}

@@ -411,6 +412,7 @@ static int intel_pmu_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
vmcs_write64(GUEST_IA32_DEBUGCTL, data);
if (!pmu->lbr_event)
intel_pmu_create_lbr_event(vcpu);
+ __set_bit(KVM_PMU_LBR_IN_USE_IDX, pmu->pmc_in_use);
return 0;
default:
if ((pmc = get_gp_pmc(pmu, msr, MSR_IA32_PERFCTR0))) {
@@ -505,6 +507,7 @@ static void intel_pmu_refresh(struct kvm_vcpu *vcpu)
0, pmu->nr_arch_gp_counters);
bitmap_set(pmu->all_valid_pmc_idx,
INTEL_PMC_MAX_GENERIC, pmu->nr_arch_fixed_counters);
+ bitmap_set(pmu->all_valid_pmc_idx, KVM_PMU_LBR_IN_USE_IDX, 1);

nested_vmx_pmu_entry_exit_ctls_update(vcpu);
}
@@ -650,6 +653,12 @@ static void intel_pmu_availability_check(struct kvm_vcpu *vcpu)
intel_pmu_lbr_availability_check(vcpu);
}

+static void intel_pmu_cleanup_lbr(struct kvm_vcpu *vcpu)
+{
+ if (!(vmcs_read64(GUEST_IA32_DEBUGCTL) & DEBUGCTLMSR_LBR))
+ intel_pmu_free_lbr_event(vcpu);
+}
+
struct kvm_pmu_ops intel_pmu_ops = {
.find_arch_event = intel_find_arch_event,
.find_fixed_event = intel_find_fixed_event,
@@ -667,4 +676,5 @@ struct kvm_pmu_ops intel_pmu_ops = {
.lbr_setup = intel_pmu_lbr_setup,
.deliver_pmi = intel_pmu_deliver_pmi,
.availability_check = intel_pmu_availability_check,
+ .lbr_cleanup = intel_pmu_cleanup_lbr,
};
--
2.21.1

2020-04-23 08:20:13

by Like Xu

[permalink] [raw]
Subject: [PATCH v10 02/11] perf/x86/core: Refactor hw->idx checks and cleanup

For intel_pmu_en/disable_event(), reorder the branches checks for
hw->idx and make them sorted by probability: gp,fixed,bts,others.

Clean up the x86_assign_hw_event() by converting multiple if-else
statements to a switch statement.

To skip x86_perf_event_update() and x86_perf_event_set_period(),
it's generic to replace "idx == INTEL_PMC_IDX_FIXED_BTS" check with
'!hwc->event_base' because that should be 0 for all non-gp/fixed cases.

Wrap related bit operations into intel_set/clear_masks() and
make the main path more cleaner and readable.

No functional changes.

Cc: Kan Liang <[email protected]>
Cc: Peter Zijlstra (Intel) <[email protected]>
Original-by: Peter Zijlstra (Intel) <[email protected]>
Signed-off-by: Like Xu <[email protected]>
---
arch/x86/events/core.c | 25 +++++++----
arch/x86/events/intel/core.c | 85 +++++++++++++++++++-----------------
2 files changed, 62 insertions(+), 48 deletions(-)

diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
index a619763e96e1..f7a259dcbb06 100644
--- a/arch/x86/events/core.c
+++ b/arch/x86/events/core.c
@@ -71,10 +71,9 @@ u64 x86_perf_event_update(struct perf_event *event)
struct hw_perf_event *hwc = &event->hw;
int shift = 64 - x86_pmu.cntval_bits;
u64 prev_raw_count, new_raw_count;
- int idx = hwc->idx;
u64 delta;

- if (idx == INTEL_PMC_IDX_FIXED_BTS)
+ if (unlikely(!hwc->event_base))
return 0;

/*
@@ -1097,22 +1096,30 @@ static inline void x86_assign_hw_event(struct perf_event *event,
struct cpu_hw_events *cpuc, int i)
{
struct hw_perf_event *hwc = &event->hw;
+ int idx;

- hwc->idx = cpuc->assign[i];
+ idx = hwc->idx = cpuc->assign[i];
hwc->last_cpu = smp_processor_id();
hwc->last_tag = ++cpuc->tags[i];

- if (hwc->idx == INTEL_PMC_IDX_FIXED_BTS) {
+ switch (hwc->idx) {
+ case INTEL_PMC_IDX_FIXED_BTS:
hwc->config_base = 0;
hwc->event_base = 0;
- } else if (hwc->idx >= INTEL_PMC_IDX_FIXED) {
+ break;
+
+ case INTEL_PMC_IDX_FIXED ... INTEL_PMC_IDX_FIXED_BTS-1:
hwc->config_base = MSR_ARCH_PERFMON_FIXED_CTR_CTRL;
- hwc->event_base = MSR_ARCH_PERFMON_FIXED_CTR0 + (hwc->idx - INTEL_PMC_IDX_FIXED);
- hwc->event_base_rdpmc = (hwc->idx - INTEL_PMC_IDX_FIXED) | 1<<30;
- } else {
+ hwc->event_base = MSR_ARCH_PERFMON_FIXED_CTR0 +
+ (idx - INTEL_PMC_IDX_FIXED);
+ hwc->event_base_rdpmc = (idx - INTEL_PMC_IDX_FIXED) | 1<<30;
+ break;
+
+ default:
hwc->config_base = x86_pmu_config_addr(hwc->idx);
hwc->event_base = x86_pmu_event_addr(hwc->idx);
hwc->event_base_rdpmc = x86_pmu_rdpmc_index(hwc->idx);
+ break;
}
}

@@ -1233,7 +1240,7 @@ int x86_perf_event_set_period(struct perf_event *event)
s64 period = hwc->sample_period;
int ret = 0, idx = hwc->idx;

- if (idx == INTEL_PMC_IDX_FIXED_BTS)
+ if (unlikely(!hwc->event_base))
return 0;

/*
diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
index 332954cccece..f1439acbf7e6 100644
--- a/arch/x86/events/intel/core.c
+++ b/arch/x86/events/intel/core.c
@@ -2136,8 +2136,35 @@ static inline void intel_pmu_ack_status(u64 ack)
wrmsrl(MSR_CORE_PERF_GLOBAL_OVF_CTRL, ack);
}

-static void intel_pmu_disable_fixed(struct hw_perf_event *hwc)
+static inline bool event_is_checkpointed(struct perf_event *event)
+{
+ return unlikely(event->hw.config & HSW_IN_TX_CHECKPOINTED) != 0;
+}
+
+static inline void intel_set_masks(struct perf_event *event, int idx)
+{
+ struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
+
+ if (event->attr.exclude_host)
+ __set_bit(idx, (unsigned long *)&cpuc->intel_ctrl_guest_mask);
+ if (event->attr.exclude_guest)
+ __set_bit(idx, (unsigned long *)&cpuc->intel_ctrl_host_mask);
+ if (event_is_checkpointed(event))
+ __set_bit(idx, (unsigned long *)&cpuc->intel_cp_status);
+}
+
+static inline void intel_clear_masks(struct perf_event *event, int idx)
{
+ struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
+
+ __clear_bit(idx, (unsigned long *)&cpuc->intel_ctrl_guest_mask);
+ __clear_bit(idx, (unsigned long *)&cpuc->intel_ctrl_host_mask);
+ __clear_bit(idx, (unsigned long *)&cpuc->intel_cp_status);
+}
+
+static void intel_pmu_disable_fixed(struct perf_event *event)
+{
+ struct hw_perf_event *hwc = &event->hw;
int idx = hwc->idx - INTEL_PMC_IDX_FIXED;
u64 ctrl_val, mask;

@@ -2148,31 +2175,22 @@ static void intel_pmu_disable_fixed(struct hw_perf_event *hwc)
wrmsrl(hwc->config_base, ctrl_val);
}

-static inline bool event_is_checkpointed(struct perf_event *event)
-{
- return (event->hw.config & HSW_IN_TX_CHECKPOINTED) != 0;
-}
-
static void intel_pmu_disable_event(struct perf_event *event)
{
struct hw_perf_event *hwc = &event->hw;
- struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
+ int idx = hwc->idx;

- if (unlikely(hwc->idx == INTEL_PMC_IDX_FIXED_BTS)) {
+ if (idx < INTEL_PMC_IDX_FIXED) {
+ intel_clear_masks(event, idx);
+ x86_pmu_disable_event(event);
+ } else if (idx < INTEL_PMC_IDX_FIXED_BTS) {
+ intel_clear_masks(event, idx);
+ intel_pmu_disable_fixed(event);
+ } else if (idx == INTEL_PMC_IDX_FIXED_BTS) {
intel_pmu_disable_bts();
intel_pmu_drain_bts_buffer();
- return;
}

- cpuc->intel_ctrl_guest_mask &= ~(1ull << hwc->idx);
- cpuc->intel_ctrl_host_mask &= ~(1ull << hwc->idx);
- cpuc->intel_cp_status &= ~(1ull << hwc->idx);
-
- if (unlikely(hwc->config_base == MSR_ARCH_PERFMON_FIXED_CTR_CTRL))
- intel_pmu_disable_fixed(hwc);
- else
- x86_pmu_disable_event(event);
-
/*
* Needs to be called after x86_pmu_disable_event,
* so we don't trigger the event without PEBS bit set.
@@ -2238,33 +2256,22 @@ static void intel_pmu_enable_fixed(struct perf_event *event)
static void intel_pmu_enable_event(struct perf_event *event)
{
struct hw_perf_event *hwc = &event->hw;
- struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
-
- if (unlikely(hwc->idx == INTEL_PMC_IDX_FIXED_BTS)) {
- if (!__this_cpu_read(cpu_hw_events.enabled))
- return;
-
- intel_pmu_enable_bts(hwc->config);
- return;
- }
-
- if (event->attr.exclude_host)
- cpuc->intel_ctrl_guest_mask |= (1ull << hwc->idx);
- if (event->attr.exclude_guest)
- cpuc->intel_ctrl_host_mask |= (1ull << hwc->idx);
-
- if (unlikely(event_is_checkpointed(event)))
- cpuc->intel_cp_status |= (1ull << hwc->idx);
+ int idx = hwc->idx;

if (unlikely(event->attr.precise_ip))
intel_pmu_pebs_enable(event);

- if (unlikely(hwc->config_base == MSR_ARCH_PERFMON_FIXED_CTR_CTRL)) {
+ if (idx < INTEL_PMC_IDX_FIXED) {
+ intel_set_masks(event, idx);
+ __x86_pmu_enable_event(hwc, ARCH_PERFMON_EVENTSEL_ENABLE);
+ } else if (idx < INTEL_PMC_IDX_FIXED_BTS) {
+ intel_set_masks(event, idx);
intel_pmu_enable_fixed(event);
- return;
+ } else if (idx == INTEL_PMC_IDX_FIXED_BTS) {
+ if (!__this_cpu_read(cpu_hw_events.enabled))
+ return;
+ intel_pmu_enable_bts(hwc->config);
}
-
- __x86_pmu_enable_event(hwc, ARCH_PERFMON_EVENTSEL_ENABLE);
}

static void intel_pmu_add_event(struct perf_event *event)
--
2.21.1

2020-04-23 08:20:22

by Like Xu

[permalink] [raw]
Subject: [PATCH v10 10/11] KVM: x86: Expose MSR_IA32_PERF_CAPABILITIES for LBR record format

The MSR_IA32_PERF_CAPABILITIES is a read only MSR that enumerates the
existence of performance monitoring features. Bits [0, 5] of it tells
about the LBR format of the branch record addresses stored in the LBR
stack. Expose those bits to the guest when the guest LBR is enabled.

Co-developed-by: Wei Wang <[email protected]>
Signed-off-by: Wei Wang <[email protected]>
Signed-off-by: Like Xu <[email protected]>
---
arch/x86/include/asm/kvm_host.h | 1 +
arch/x86/kvm/vmx/capabilities.h | 15 +++++++++++++++
arch/x86/kvm/vmx/pmu_intel.c | 13 +++++++++++++
arch/x86/kvm/vmx/vmx.c | 2 ++
4 files changed, 31 insertions(+)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index f73c9b789bff..137097981180 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -487,6 +487,7 @@ struct kvm_pmu {
u64 global_ctrl_mask;
u64 global_ovf_ctrl_mask;
u64 reserved_bits;
+ u64 perf_capabilities;
u8 version;
struct kvm_pmc gp_counters[INTEL_PMC_MAX_GENERIC];
struct kvm_pmc fixed_counters[INTEL_PMC_MAX_FIXED];
diff --git a/arch/x86/kvm/vmx/capabilities.h b/arch/x86/kvm/vmx/capabilities.h
index 8903475f751e..be61cd5bce0c 100644
--- a/arch/x86/kvm/vmx/capabilities.h
+++ b/arch/x86/kvm/vmx/capabilities.h
@@ -367,4 +367,19 @@ static inline bool vmx_pt_mode_is_host_guest(void)
return pt_mode == PT_MODE_HOST_GUEST;
}

+#define PERF_CAP_LBR_FMT 0x3f
+
+static inline u64 vmx_supported_perf_capabilities(void)
+{
+ u64 perf_cap = 0;
+
+ if (boot_cpu_has(X86_FEATURE_PDCM))
+ rdmsrl(MSR_IA32_PERF_CAPABILITIES, perf_cap);
+
+ /* Currently, KVM only supports LBR. */
+ perf_cap &= PERF_CAP_LBR_FMT;
+
+ return perf_cap;
+}
+
#endif /* __KVM_X86_VMX_CAPS_H */
diff --git a/arch/x86/kvm/vmx/pmu_intel.c b/arch/x86/kvm/vmx/pmu_intel.c
index 37088bbcde7f..c64c53bdc77d 100644
--- a/arch/x86/kvm/vmx/pmu_intel.c
+++ b/arch/x86/kvm/vmx/pmu_intel.c
@@ -182,6 +182,9 @@ static bool intel_is_valid_msr(struct kvm_vcpu *vcpu, u32 msr)
case MSR_IA32_DEBUGCTLMSR:
ret = pmu->version > 1;
break;
+ case MSR_IA32_PERF_CAPABILITIES:
+ ret = guest_cpuid_has(vcpu, X86_FEATURE_PDCM);
+ break;
default:
ret = get_gp_pmc(pmu, msr, MSR_IA32_PERFCTR0) ||
get_gp_pmc(pmu, msr, MSR_P6_EVNTSEL0) ||
@@ -346,6 +349,9 @@ static int intel_pmu_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
case MSR_IA32_DEBUGCTLMSR:
msr_info->data = vmcs_read64(GUEST_IA32_DEBUGCTL);
return 0;
+ case MSR_IA32_PERF_CAPABILITIES:
+ msr_info->data = pmu->perf_capabilities;
+ return 0;
default:
if ((pmc = get_gp_pmc(pmu, msr, MSR_IA32_PERFCTR0))) {
u64 val = pmc_read_counter(pmc);
@@ -414,6 +420,8 @@ static int intel_pmu_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
intel_pmu_create_lbr_event(vcpu);
__set_bit(KVM_PMU_LBR_IN_USE_IDX, pmu->pmc_in_use);
return 0;
+ case MSR_IA32_PERF_CAPABILITIES:
+ return 1; /* RO MSR */
default:
if ((pmc = get_gp_pmc(pmu, msr, MSR_IA32_PERFCTR0))) {
if (!msr_info->host_initiated)
@@ -458,6 +466,7 @@ static void intel_pmu_refresh(struct kvm_vcpu *vcpu)
pmu->version = 0;
pmu->reserved_bits = 0xffffffff00200000ull;
vcpu->kvm->arch.lbr_in_guest = false;
+ pmu->perf_capabilities = 0;

entry = kvm_find_cpuid_entry(vcpu, 0xa, 0);
if (!entry)
@@ -470,6 +479,7 @@ static void intel_pmu_refresh(struct kvm_vcpu *vcpu)
return;

perf_get_x86_pmu_capability(&x86_pmu);
+ pmu->perf_capabilities = vmx_supported_perf_capabilities();

pmu->nr_arch_gp_counters = min_t(int, eax.split.num_counters,
x86_pmu.num_counters_gp);
@@ -497,6 +507,9 @@ static void intel_pmu_refresh(struct kvm_vcpu *vcpu)
pmu->global_ovf_ctrl_mask &=
~MSR_CORE_PERF_GLOBAL_OVF_CTRL_TRACE_TOPA_PMI;

+ if (!vcpu->kvm->arch.lbr_in_guest)
+ pmu->perf_capabilities &= ~PERF_CAP_LBR_FMT;
+
entry = kvm_find_cpuid_entry(vcpu, 7, 0);
if (entry &&
(boot_cpu_has(X86_FEATURE_HLE) || boot_cpu_has(X86_FEATURE_RTM)) &&
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 31c294b2d941..ae2cb7967018 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -7201,6 +7201,8 @@ static __init void vmx_set_cpu_caps(void)
kvm_cpu_cap_check_and_set(X86_FEATURE_INVPCID);
if (vmx_pt_mode_is_host_guest())
kvm_cpu_cap_check_and_set(X86_FEATURE_INTEL_PT);
+ if (vmx_supported_perf_capabilities())
+ kvm_cpu_cap_check_and_set(X86_FEATURE_PDCM);

/* PKU is not yet implemented for shadow paging. */
if (enable_ept && boot_cpu_has(X86_FEATURE_OSPKE))
--
2.21.1

2020-04-23 08:21:19

by Like Xu

[permalink] [raw]
Subject: [PATCH v10 07/11] KVM: x86/pmu: Tweak kvm_pmu_get_msr to pass 'struct msr_data' in

From: Wei Wang <[email protected]>

Change kvm_pmu_get_msr() to get the msr_data struct, as the host_initiated
field from the struct could be used by get_msr. This also makes this API
consistent with kvm_pmu_set_msr. No functional changes.

Signed-off-by: Wei Wang <[email protected]>
---
arch/x86/kvm/pmu.c | 4 ++--
arch/x86/kvm/pmu.h | 4 ++--
arch/x86/kvm/svm/pmu.c | 7 ++++---
arch/x86/kvm/vmx/pmu_intel.c | 19 +++++++++++--------
arch/x86/kvm/x86.c | 4 ++--
5 files changed, 21 insertions(+), 17 deletions(-)

diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c
index c1f95b2f9559..b24b19ede76a 100644
--- a/arch/x86/kvm/pmu.c
+++ b/arch/x86/kvm/pmu.c
@@ -397,9 +397,9 @@ static void kvm_pmu_mark_pmc_in_use(struct kvm_vcpu *vcpu, u32 msr)
__set_bit(pmc->idx, pmu->pmc_in_use);
}

-int kvm_pmu_get_msr(struct kvm_vcpu *vcpu, u32 msr, u64 *data)
+int kvm_pmu_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
{
- return kvm_x86_ops.pmu_ops->get_msr(vcpu, msr, data);
+ return kvm_x86_ops.pmu_ops->get_msr(vcpu, msr_info);
}

int kvm_pmu_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
diff --git a/arch/x86/kvm/pmu.h b/arch/x86/kvm/pmu.h
index 971da6431d74..19d8e057c0b5 100644
--- a/arch/x86/kvm/pmu.h
+++ b/arch/x86/kvm/pmu.h
@@ -32,7 +32,7 @@ struct kvm_pmu_ops {
struct kvm_pmc *(*msr_idx_to_pmc)(struct kvm_vcpu *vcpu, u32 msr);
int (*is_valid_rdpmc_ecx)(struct kvm_vcpu *vcpu, unsigned int idx);
bool (*is_valid_msr)(struct kvm_vcpu *vcpu, u32 msr);
- int (*get_msr)(struct kvm_vcpu *vcpu, u32 msr, u64 *data);
+ int (*get_msr)(struct kvm_vcpu *vcpu, struct msr_data *msr_info);
int (*set_msr)(struct kvm_vcpu *vcpu, struct msr_data *msr_info);
void (*refresh)(struct kvm_vcpu *vcpu);
void (*init)(struct kvm_vcpu *vcpu);
@@ -148,7 +148,7 @@ void kvm_pmu_handle_event(struct kvm_vcpu *vcpu);
int kvm_pmu_rdpmc(struct kvm_vcpu *vcpu, unsigned pmc, u64 *data);
int kvm_pmu_is_valid_rdpmc_ecx(struct kvm_vcpu *vcpu, unsigned int idx);
bool kvm_pmu_is_valid_msr(struct kvm_vcpu *vcpu, u32 msr);
-int kvm_pmu_get_msr(struct kvm_vcpu *vcpu, u32 msr, u64 *data);
+int kvm_pmu_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info);
int kvm_pmu_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info);
void kvm_pmu_refresh(struct kvm_vcpu *vcpu);
void kvm_pmu_reset(struct kvm_vcpu *vcpu);
diff --git a/arch/x86/kvm/svm/pmu.c b/arch/x86/kvm/svm/pmu.c
index ce0b10fe5e2b..035da07500e8 100644
--- a/arch/x86/kvm/svm/pmu.c
+++ b/arch/x86/kvm/svm/pmu.c
@@ -215,21 +215,22 @@ static struct kvm_pmc *amd_msr_idx_to_pmc(struct kvm_vcpu *vcpu, u32 msr)
return pmc;
}

-static int amd_pmu_get_msr(struct kvm_vcpu *vcpu, u32 msr, u64 *data)
+static int amd_pmu_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
{
struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
struct kvm_pmc *pmc;
+ u32 msr = msr_info->index;

/* MSR_PERFCTRn */
pmc = get_gp_pmc_amd(pmu, msr, PMU_TYPE_COUNTER);
if (pmc) {
- *data = pmc_read_counter(pmc);
+ msr_info->data = pmc_read_counter(pmc);
return 0;
}
/* MSR_EVNTSELn */
pmc = get_gp_pmc_amd(pmu, msr, PMU_TYPE_EVNTSEL);
if (pmc) {
- *data = pmc->eventsel;
+ msr_info->data = pmc->eventsel;
return 0;
}

diff --git a/arch/x86/kvm/vmx/pmu_intel.c b/arch/x86/kvm/vmx/pmu_intel.c
index 4056bd114844..5d7d002e5a3e 100644
--- a/arch/x86/kvm/vmx/pmu_intel.c
+++ b/arch/x86/kvm/vmx/pmu_intel.c
@@ -184,35 +184,38 @@ static struct kvm_pmc *intel_msr_idx_to_pmc(struct kvm_vcpu *vcpu, u32 msr)
return pmc;
}

-static int intel_pmu_get_msr(struct kvm_vcpu *vcpu, u32 msr, u64 *data)
+static int intel_pmu_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
{
struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
struct kvm_pmc *pmc;
+ u32 msr = msr_info->index;

switch (msr) {
case MSR_CORE_PERF_FIXED_CTR_CTRL:
- *data = pmu->fixed_ctr_ctrl;
+ msr_info->data = pmu->fixed_ctr_ctrl;
return 0;
case MSR_CORE_PERF_GLOBAL_STATUS:
- *data = pmu->global_status;
+ msr_info->data = pmu->global_status;
return 0;
case MSR_CORE_PERF_GLOBAL_CTRL:
- *data = pmu->global_ctrl;
+ msr_info->data = pmu->global_ctrl;
return 0;
case MSR_CORE_PERF_GLOBAL_OVF_CTRL:
- *data = pmu->global_ovf_ctrl;
+ msr_info->data = pmu->global_ovf_ctrl;
return 0;
default:
if ((pmc = get_gp_pmc(pmu, msr, MSR_IA32_PERFCTR0))) {
u64 val = pmc_read_counter(pmc);
- *data = val & pmu->counter_bitmask[KVM_PMC_GP];
+ msr_info->data =
+ val & pmu->counter_bitmask[KVM_PMC_GP];
return 0;
} else if ((pmc = get_fixed_pmc(pmu, msr))) {
u64 val = pmc_read_counter(pmc);
- *data = val & pmu->counter_bitmask[KVM_PMC_FIXED];
+ msr_info->data =
+ val & pmu->counter_bitmask[KVM_PMC_FIXED];
return 0;
} else if ((pmc = get_gp_pmc(pmu, msr, MSR_P6_EVNTSEL0))) {
- *data = pmc->eventsel;
+ msr_info->data = pmc->eventsel;
return 0;
}
}
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index b5ce89016eeb..99f819dfcc90 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -3083,7 +3083,7 @@ int kvm_get_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
case MSR_P6_PERFCTR0 ... MSR_P6_PERFCTR1:
case MSR_P6_EVNTSEL0 ... MSR_P6_EVNTSEL1:
if (kvm_pmu_is_valid_msr(vcpu, msr_info->index))
- return kvm_pmu_get_msr(vcpu, msr_info->index, &msr_info->data);
+ return kvm_pmu_get_msr(vcpu, msr_info);
msr_info->data = 0;
break;
case MSR_IA32_UCODE_REV:
@@ -3245,7 +3245,7 @@ int kvm_get_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
break;
default:
if (kvm_pmu_is_valid_msr(vcpu, msr_info->index))
- return kvm_pmu_get_msr(vcpu, msr_info->index, &msr_info->data);
+ return kvm_pmu_get_msr(vcpu, msr_info);
if (!ignore_msrs) {
vcpu_debug_ratelimited(vcpu, "unhandled rdmsr: 0x%x\n",
msr_info->index);
--
2.21.1

2020-04-23 08:21:22

by Like Xu

[permalink] [raw]
Subject: [PATCH v10 05/11] perf/x86: Keep LBR stack unchanged in host context for guest LBR event

When a guest wants to use the LBR stack, its hypervisor creates a guest
LBR event and let host perf schedules it. A new 'int guest_lbr_enabled'
field in the 'struct cpu_hw_events', is marked as true when perf adds
a guest LBR event and marked as false on deletion.

The LBR stack msrs are accessible to the guest when its guest LBR event
is scheduled in by the perf subsystem. Before scheduling this event out,
we should avoid host changes on IA32_DEBUGCTLMSR or LBR_SELECT.
Otherwise, some unexpected branch operations may interfere with guest
behavior, pollute LBR records, and even cause host branch data leakage.
In addition, the intel_pmu_lbr_read() on the host is also avoidable.

To ensure that guest LBR records are not lost during the context switch,
the BRANCH_CALL_STACK flag should be configured in the 'branch_sample_type'
for any guest LBR event because the callstack mode could save/restore guest
unread LBR records with the help of intel_pmu_lbr_sched_task() naturally.

However, the regular host LBR perf event doesn't save/restore LBR_SELECT,
because it's configured in the LBR_enable() based on branch_sample_type.
So when a guest LBR is running, the guest LBR_SELECT may changes for its
own use and we have to support LBR_SELECT save/restore to ensure what the
guest LBR_SELECT value doesn't get lost during the context switching.

Cc: Peter Zijlstra (Intel) <[email protected]>
Co-developed-by: Wei Wang <[email protected]>
Signed-off-by: Wei Wang <[email protected]>
Signed-off-by: Like Xu <[email protected]>
---
arch/x86/events/intel/lbr.c | 33 ++++++++++++++++++++++++++++++---
arch/x86/events/perf_event.h | 2 ++
2 files changed, 32 insertions(+), 3 deletions(-)

diff --git a/arch/x86/events/intel/lbr.c b/arch/x86/events/intel/lbr.c
index 2fca4aff7621..44bb7db6ce02 100644
--- a/arch/x86/events/intel/lbr.c
+++ b/arch/x86/events/intel/lbr.c
@@ -383,6 +383,15 @@ static void __intel_pmu_lbr_restore(struct x86_perf_task_context *task_ctx)

wrmsrl(x86_pmu.lbr_tos, tos);
task_ctx->lbr_stack_state = LBR_NONE;
+
+ /*
+ * When the LBR hardware is scheduled for a guest LBR event,
+ * the guest lbr_sel is likely different from event->hw.branch_reg.
+ * Therefore, it’s necessary to save/restore MSR_LBR_SELECT written
+ * by the guest so that it's not lost during the context switch.
+ */
+ if (cpuc->guest_lbr_enabled)
+ wrmsrl(MSR_LBR_SELECT, task_ctx->lbr_sel);
}

static void __intel_pmu_lbr_save(struct x86_perf_task_context *task_ctx)
@@ -415,6 +424,9 @@ static void __intel_pmu_lbr_save(struct x86_perf_task_context *task_ctx)

cpuc->last_task_ctx = task_ctx;
cpuc->last_log_id = ++task_ctx->log_id;
+
+ if (cpuc->guest_lbr_enabled)
+ rdmsrl(MSR_LBR_SELECT, task_ctx->lbr_sel);
}

void intel_pmu_lbr_swap_task_ctx(struct perf_event_context *prev,
@@ -485,6 +497,9 @@ void intel_pmu_lbr_add(struct perf_event *event)
if (!x86_pmu.lbr_nr)
return;

+ if (needs_guest_lbr_without_counter(event))
+ cpuc->guest_lbr_enabled = 1;
+
cpuc->br_sel = event->hw.branch_reg.reg;

if (branch_user_callstack(cpuc->br_sel) && event->ctx->task_ctx_data) {
@@ -532,6 +547,9 @@ void intel_pmu_lbr_del(struct perf_event *event)
task_ctx->lbr_callstack_users--;
}

+ if (needs_guest_lbr_without_counter(event))
+ cpuc->guest_lbr_enabled = 0;
+
if (x86_pmu.intel_cap.pebs_baseline && event->attr.precise_ip > 0)
cpuc->lbr_pebs_users--;
cpuc->lbr_users--;
@@ -544,7 +562,12 @@ void intel_pmu_lbr_enable_all(bool pmi)
{
struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);

- if (cpuc->lbr_users)
+ /*
+ * When the LBR hardware is scheduled for a guest LBR event,
+ * the guest will dis/enables LBR itself at the appropriate time,
+ * including configuring MSR_LBR_SELECT.
+ */
+ if (cpuc->lbr_users && !cpuc->guest_lbr_enabled)
__intel_pmu_lbr_enable(pmi);
}

@@ -552,7 +575,7 @@ void intel_pmu_lbr_disable_all(void)
{
struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);

- if (cpuc->lbr_users)
+ if (cpuc->lbr_users && !cpuc->guest_lbr_enabled)
__intel_pmu_lbr_disable();
}

@@ -693,8 +716,12 @@ void intel_pmu_lbr_read(void)
*
* This could be smarter and actually check the event,
* but this simple approach seems to work for now.
+ *
+ * And there is no need to read LBR record here if a guest LBR
+ * event is using it, because the guest will read them on its own.
*/
- if (!cpuc->lbr_users || cpuc->lbr_users == cpuc->lbr_pebs_users)
+ if (!cpuc->lbr_users || cpuc->guest_lbr_enabled ||
+ cpuc->lbr_users == cpuc->lbr_pebs_users)
return;

if (x86_pmu.intel_cap.lbr_format == LBR_FORMAT_32)
diff --git a/arch/x86/events/perf_event.h b/arch/x86/events/perf_event.h
index e6e8c626ed00..cfd0ba25cac6 100644
--- a/arch/x86/events/perf_event.h
+++ b/arch/x86/events/perf_event.h
@@ -237,6 +237,7 @@ struct cpu_hw_events {
u64 br_sel;
struct x86_perf_task_context *last_task_ctx;
int last_log_id;
+ int guest_lbr_enabled;

/*
* Intel host/guest exclude bits
@@ -721,6 +722,7 @@ struct x86_perf_task_context {
u64 lbr_from[MAX_LBR_ENTRIES];
u64 lbr_to[MAX_LBR_ENTRIES];
u64 lbr_info[MAX_LBR_ENTRIES];
+ u64 lbr_sel;
int tos;
int valid_lbrs;
int lbr_callstack_users;
--
2.21.1

2020-04-23 08:21:25

by Like Xu

[permalink] [raw]
Subject: [PATCH v10 03/11] perf/x86/lbr: Add interface to get basic information about LBR stack

The LBR stack msrs are model specific. The perf subsystem has already
obtained the LBR stack base addresses based on the cpu model.

Therefore, an interface is added to allow callers outside the perf
subsystem to obtain the LBR stack base addresses. It's useful for
hypervisors to emulate the LBR feature for guests with less code.

Cc: Peter Zijlstra (Intel) <[email protected]>
Co-developed-by: Wei Wang <[email protected]>
Signed-off-by: Wei Wang <[email protected]>
Signed-off-by: Like Xu <[email protected]>
---
arch/x86/events/intel/lbr.c | 20 ++++++++++++++++++++
arch/x86/include/asm/perf_event.h | 12 ++++++++++++
2 files changed, 32 insertions(+)

diff --git a/arch/x86/events/intel/lbr.c b/arch/x86/events/intel/lbr.c
index 65113b16804a..6c60dcaaaf69 100644
--- a/arch/x86/events/intel/lbr.c
+++ b/arch/x86/events/intel/lbr.c
@@ -1343,3 +1343,23 @@ void intel_pmu_lbr_init_knl(void)
if (x86_pmu.intel_cap.lbr_format == LBR_FORMAT_LIP)
x86_pmu.intel_cap.lbr_format = LBR_FORMAT_EIP_FLAGS;
}
+
+/**
+ * x86_perf_get_lbr - get the LBR stack information
+ *
+ * @stack: the caller's memory to store the LBR stack information
+ *
+ * Returns: 0 indicates the LBR stack info has been successfully obtained
+ */
+int x86_perf_get_lbr(struct x86_pmu_lbr *stack)
+{
+ int lbr_fmt = x86_pmu.intel_cap.lbr_format;
+
+ stack->nr = x86_pmu.lbr_nr;
+ stack->from = x86_pmu.lbr_from;
+ stack->to = x86_pmu.lbr_to;
+ stack->info = (lbr_fmt == LBR_FORMAT_INFO) ? MSR_LBR_INFO_0 : 0;
+
+ return 0;
+}
+EXPORT_SYMBOL_GPL(x86_perf_get_lbr);
diff --git a/arch/x86/include/asm/perf_event.h b/arch/x86/include/asm/perf_event.h
index e855e9cf2c37..5071515f6b0f 100644
--- a/arch/x86/include/asm/perf_event.h
+++ b/arch/x86/include/asm/perf_event.h
@@ -333,6 +333,13 @@ struct perf_guest_switch_msr {
u64 host, guest;
};

+struct x86_pmu_lbr {
+ unsigned int nr;
+ unsigned int from;
+ unsigned int to;
+ unsigned int info;
+};
+
extern void perf_get_x86_pmu_capability(struct x86_pmu_capability *cap);
extern void perf_check_microcode(void);
extern int x86_perf_rdpmc_index(struct perf_event *event);
@@ -348,12 +355,17 @@ static inline void perf_check_microcode(void) { }

#if defined(CONFIG_PERF_EVENTS) && defined(CONFIG_CPU_SUP_INTEL)
extern struct perf_guest_switch_msr *perf_guest_get_msrs(int *nr);
+extern int x86_perf_get_lbr(struct x86_pmu_lbr *stack);
#else
static inline struct perf_guest_switch_msr *perf_guest_get_msrs(int *nr)
{
*nr = 0;
return NULL;
}
+static inline int x86_perf_get_lbr(struct x86_pmu_lbr *stack)
+{
+ return -1;
+}
#endif

#ifdef CONFIG_CPU_SUP_INTEL
--
2.21.1

2020-04-23 08:22:02

by Like Xu

[permalink] [raw]
Subject: [PATCH v10 06/11] KVM: x86: Add KVM_CAP_X86_GUEST_LBR to dis/enable LBR from user-space

The LBR feature is model specific. Introduce KVM_CAP_X86_GUEST_LBR to
control per-VM enablement of the guest LBR feature (disabled by default).

For enable_cap ioctl, the first input parameter is whether LBR feature
should be enabled or not, and the second parameter is the pointer to
the userspace memory to save the LBR records information. If the
second parameter is invalid or the guest/host cpu model doesn't match,
it returns -EINVAL which means the LBR feature cannot be enabled.

For check_extension ioctl, the return value could help userspace calculate
the total size of the complete guest LBR entries for compatibility check.

Co-developed-by: Wei Wang <[email protected]>
Signed-off-by: Wei Wang <[email protected]>
Signed-off-by: Like Xu <[email protected]>
---
Documentation/virt/kvm/api.rst | 28 ++++++++++++++++++++++++++++
arch/x86/include/asm/kvm_host.h | 2 ++
arch/x86/kvm/pmu.c | 8 ++++++++
arch/x86/kvm/pmu.h | 2 ++
arch/x86/kvm/vmx/pmu_intel.c | 31 +++++++++++++++++++++++++++++++
arch/x86/kvm/x86.c | 17 +++++++++++++++++
include/uapi/linux/kvm.h | 1 +
7 files changed, 89 insertions(+)

diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index efbbe570aa9b..14f8d98c2651 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -5802,6 +5802,34 @@ If present, this capability can be enabled for a VM, meaning that KVM
will allow the transition to secure guest mode. Otherwise KVM will
veto the transition.

+7.20 KVM_CAP_X86_GUEST_LBR
+Architectures: x86
+Parameters: args[0] whether LBR feature should be enabled or not,
+ args[1] pointer to the userspace memory to save the LBR records information.
+
+the LBR records information is described by
+struct x86_pmu_lbr {
+ unsigned int nr;
+ unsigned int from;
+ unsigned int to;
+ unsigned int info;
+};
+
+@nr: number of LBR records entries;
+@from: index of the msr that stores a branch source address;
+@to: index of the msr that stores a branch destination address;
+@info: index of the msr that stores LBR related flags, such as misprediction.
+
+Enabling this capability allows guest accesses to the LBR feature. Otherwise,
+#GP will be injected to the guest when it accesses to the LBR registers.
+
+After the feature is enabled, before exiting to userspace, kvm handlers
+would fill the LBR records info into the userspace memory pointed by args[1].
+
+The return value of kvm_vm_ioctl_check_extension for KVM_CAP_X86_GUEST_LBR
+is the size of 'struct x86_pmu_lbr' and userspace could calculate the total
+size of the complete guest LBR entries for functional compatibility check.
+
8. Other capabilities.
======================

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index f26df2cb0591..3a4433607773 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -985,6 +985,8 @@ struct kvm_arch {
bool guest_can_read_msr_platform_info;
bool exception_payload_enabled;

+ bool lbr_in_guest;
+ struct x86_pmu_lbr lbr;
struct kvm_pmu_event_filter *pmu_event_filter;
struct task_struct *nx_lpage_recovery_thread;
};
diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c
index a5078841bdac..c1f95b2f9559 100644
--- a/arch/x86/kvm/pmu.c
+++ b/arch/x86/kvm/pmu.c
@@ -518,3 +518,11 @@ int kvm_vm_ioctl_set_pmu_event_filter(struct kvm *kvm, void __user *argp)
kfree(filter);
return r;
}
+
+bool kvm_pmu_lbr_setup(struct kvm_vcpu *vcpu)
+{
+ if (kvm_x86_ops.pmu_ops->lbr_setup)
+ return kvm_x86_ops.pmu_ops->lbr_setup(vcpu);
+
+ return false;
+}
diff --git a/arch/x86/kvm/pmu.h b/arch/x86/kvm/pmu.h
index a6c78a797cb1..971da6431d74 100644
--- a/arch/x86/kvm/pmu.h
+++ b/arch/x86/kvm/pmu.h
@@ -37,6 +37,7 @@ struct kvm_pmu_ops {
void (*refresh)(struct kvm_vcpu *vcpu);
void (*init)(struct kvm_vcpu *vcpu);
void (*reset)(struct kvm_vcpu *vcpu);
+ bool (*lbr_setup)(struct kvm_vcpu *vcpu);
};

static inline u64 pmc_bitmask(struct kvm_pmc *pmc)
@@ -155,6 +156,7 @@ void kvm_pmu_init(struct kvm_vcpu *vcpu);
void kvm_pmu_cleanup(struct kvm_vcpu *vcpu);
void kvm_pmu_destroy(struct kvm_vcpu *vcpu);
int kvm_vm_ioctl_set_pmu_event_filter(struct kvm *kvm, void __user *argp);
+bool kvm_pmu_lbr_setup(struct kvm_vcpu *vcpu);

bool is_vmware_backdoor_pmc(u32 pmc_idx);

diff --git a/arch/x86/kvm/vmx/pmu_intel.c b/arch/x86/kvm/vmx/pmu_intel.c
index 7c857737b438..4056bd114844 100644
--- a/arch/x86/kvm/vmx/pmu_intel.c
+++ b/arch/x86/kvm/vmx/pmu_intel.c
@@ -300,6 +300,7 @@ static void intel_pmu_refresh(struct kvm_vcpu *vcpu)
pmu->counter_bitmask[KVM_PMC_FIXED] = 0;
pmu->version = 0;
pmu->reserved_bits = 0xffffffff00200000ull;
+ vcpu->kvm->arch.lbr_in_guest = false;

entry = kvm_find_cpuid_entry(vcpu, 0xa, 0);
if (!entry)
@@ -397,6 +398,35 @@ static void intel_pmu_reset(struct kvm_vcpu *vcpu)
pmu->global_ovf_ctrl = 0;
}

+static bool intel_pmu_get_lbr(struct kvm_vcpu *vcpu)
+{
+ struct kvm *kvm = vcpu->kvm;
+
+ if (likely(kvm->arch.lbr.nr))
+ return true;
+
+ return !x86_perf_get_lbr(&kvm->arch.lbr);
+}
+
+static bool intel_pmu_lbr_setup(struct kvm_vcpu *vcpu)
+{
+ if (vcpu_to_pmu(vcpu)->version < 2)
+ return false;
+
+ if (!intel_pmu_get_lbr(vcpu))
+ return false;
+
+ /*
+ * As a first step, a guest could only enable LBR feature if its cpu
+ * model is the same as the host because the LBR registers would
+ * be passthrough to the guest and they're model specific.
+ */
+ if (boot_cpu_data.x86_model != guest_cpuid_model(vcpu))
+ return false;
+
+ return true;
+}
+
struct kvm_pmu_ops intel_pmu_ops = {
.find_arch_event = intel_find_arch_event,
.find_fixed_event = intel_find_fixed_event,
@@ -411,4 +441,5 @@ struct kvm_pmu_ops intel_pmu_ops = {
.refresh = intel_pmu_refresh,
.init = intel_pmu_init,
.reset = intel_pmu_reset,
+ .lbr_setup = intel_pmu_lbr_setup,
};
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 59958ce2b681..b5ce89016eeb 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -3391,6 +3391,9 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
case KVM_CAP_EXCEPTION_PAYLOAD:
r = 1;
break;
+ case KVM_CAP_X86_GUEST_LBR:
+ r = sizeof(struct x86_pmu_lbr);
+ break;
case KVM_CAP_SYNC_REGS:
r = KVM_SYNC_X86_VALID_FIELDS;
break;
@@ -4899,6 +4902,20 @@ int kvm_vm_ioctl_enable_cap(struct kvm *kvm,
kvm->arch.exception_payload_enabled = cap->args[0];
r = 0;
break;
+ case KVM_CAP_X86_GUEST_LBR:
+ r = -EINVAL;
+ if (!cap->args[0] || !kvm->vcpus[0])
+ break;
+ if (!kvm_pmu_lbr_setup(kvm->vcpus[0]))
+ break;
+ if (vcpu_to_pmu(kvm->vcpus[0])->version < 2)
+ break;
+ if (copy_to_user((void __user *)cap->args[1],
+ &kvm->arch.lbr, sizeof(struct x86_pmu_lbr)))
+ break;
+ kvm->arch.lbr_in_guest = !!cap->args[0];
+ r = 0;
+ break;
default:
r = -EINVAL;
break;
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 428c7dde6b4b..083a3d206f16 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -1017,6 +1017,7 @@ struct kvm_ppc_resize_hpt {
#define KVM_CAP_S390_VCPU_RESETS 179
#define KVM_CAP_S390_PROTECTED 180
#define KVM_CAP_PPC_SECURE_GUEST 181
+#define KVM_CAP_X86_GUEST_LBR 182

#ifdef KVM_CAP_IRQ_ROUTING

--
2.21.1

2020-04-23 08:22:35

by Like Xu

[permalink] [raw]
Subject: [PATCH v10 11/11] KVM: x86: Remove the common trap handler of the MSR_IA32_DEBUGCTLMSR

From: Wei Wang <[email protected]>

The debugctl msr is not completely identical on AMD and Intel CPUs, for
example, FREEZE_LBRS_ON_PMI is supported by Intel CPUs only. Now, this
msr is handled separately in svm.c and intel_pmu.c. So remove the
common debugctl msr handling code in kvm_get/set_msr_common.

Signed-off-by: Wei Wang <[email protected]>
---
arch/x86/kvm/x86.c | 13 -------------
1 file changed, 13 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 99f819dfcc90..e6dd8525cc60 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -2791,18 +2791,6 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
return 1;
}
break;
- case MSR_IA32_DEBUGCTLMSR:
- if (!data) {
- /* We support the non-activated case already */
- break;
- } else if (data & ~(DEBUGCTLMSR_LBR | DEBUGCTLMSR_BTF)) {
- /* Values other than LBR and BTF are vendor-specific,
- thus reserved and should throw a #GP */
- return 1;
- }
- vcpu_unimpl(vcpu, "%s: MSR_IA32_DEBUGCTLMSR 0x%llx, nop\n",
- __func__, data);
- break;
case 0x200 ... 0x2ff:
return kvm_mtrr_set_msr(vcpu, msr, data);
case MSR_IA32_APICBASE:
@@ -3059,7 +3047,6 @@ int kvm_get_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
switch (msr_info->index) {
case MSR_IA32_PLATFORM_ID:
case MSR_IA32_EBL_CR_POWERON:
- case MSR_IA32_DEBUGCTLMSR:
case MSR_IA32_LASTBRANCHFROMIP:
case MSR_IA32_LASTBRANCHTOIP:
case MSR_IA32_LASTINTFROMIP:
--
2.21.1

2020-04-23 08:22:52

by Like Xu

[permalink] [raw]
Subject: [PATCH v10 08/11] KVM: x86/pmu: Add LBR feature emulation via guest LBR event

VMX transition is much more frequent than vcpu switching, and saving/
restoring tens of LBR msrs (e.g. 32 LBR records entries) brings too much
overhead to the frequent vmx transition itself, which is not necessary.
So the guest LBR records only gets saved/restored on the vcpu context
switching via the help of native LBR event callstack mechanism. Generally,
the LBR-related msrs and its functionality are emulated in this way:

The guest first access on the LBR related msrs (including DEBUGCTLMSR
and records msrs) is always interceptible. The KVM handler would create
a guest LBR event which enables the callstack mode and none of hardware
counter is assigned. The host perf would enable and schedule this event
as usual but in an exclusive way.

When the guest LBR event exists and the LBR stack is available (defined
as 'event->oncpu != -1'), the LBR records msrs access would not be
intercepted but pass-through to the vcpu before vm-entry. This kind
of availability check is always performed before vm-entry, but as late
as possible to avoid reclaiming resources from any higher priority event.
A negative check result would bring the registers interception back, and
it also prevents real registers accesses and potential data leakage.

At this point, vPMU only supports Architecture v2 and the guest needs
to re-enable LBR via DEBUGCTLMSR in the guest PMI handler. The guest
LBR event will be released when the vPMU is reset but soon, the lazy
release mechanism would be applied to this event like a regular vPMC.

Cc: Peter Zijlstra (Intel) <[email protected]>
Suggested-by: Andi Kleen <[email protected]>
Co-developed-by: Wei Wang <[email protected]>
Signed-off-by: Wei Wang <[email protected]>
Signed-off-by: Like Xu <[email protected]>
---
arch/x86/include/asm/kvm_host.h | 11 ++
arch/x86/kvm/pmu.c | 6 +-
arch/x86/kvm/pmu.h | 7 +
arch/x86/kvm/vmx/pmu_intel.c | 228 +++++++++++++++++++++++++++++++-
arch/x86/kvm/vmx/vmx.c | 9 +-
arch/x86/kvm/vmx/vmx.h | 2 +
6 files changed, 256 insertions(+), 7 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 3a4433607773..f73c9b789bff 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -506,6 +506,17 @@ struct kvm_pmu {
* redundant check before cleanup if guest don't use vPMU at all.
*/
u8 event_count;
+
+ /*
+ * Emulate LBR feature via pass-through LBR registers when the
+ * per-vcpu guest LBR event is scheduled on the current pcpu.
+ *
+ * Note the records could't be trusted if the host reclaims the LBR.
+ */
+ struct perf_event *lbr_event;
+
+ /* A flag to reduce the overhead of LBR pass-through or cancellation. */
+ bool lbr_already_available;
};

struct kvm_pmu_ops;
diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c
index b24b19ede76a..5776d305e254 100644
--- a/arch/x86/kvm/pmu.c
+++ b/arch/x86/kvm/pmu.c
@@ -378,8 +378,11 @@ int kvm_pmu_rdpmc(struct kvm_vcpu *vcpu, unsigned idx, u64 *data)

void kvm_pmu_deliver_pmi(struct kvm_vcpu *vcpu)
{
- if (lapic_in_kernel(vcpu))
+ if (lapic_in_kernel(vcpu)) {
kvm_apic_local_deliver(vcpu->arch.apic, APIC_LVTPC);
+ if (kvm_x86_ops.pmu_ops->deliver_pmi)
+ kvm_x86_ops.pmu_ops->deliver_pmi(vcpu);
+ }
}

bool kvm_pmu_is_valid_msr(struct kvm_vcpu *vcpu, u32 msr)
@@ -434,6 +437,7 @@ void kvm_pmu_init(struct kvm_vcpu *vcpu)
init_irq_work(&pmu->irq_work, kvm_pmi_trigger_fn);
pmu->event_count = 0;
pmu->need_cleanup = false;
+ pmu->lbr_already_available = false;
kvm_pmu_refresh(vcpu);
}

diff --git a/arch/x86/kvm/pmu.h b/arch/x86/kvm/pmu.h
index 19d8e057c0b5..594642ab2575 100644
--- a/arch/x86/kvm/pmu.h
+++ b/arch/x86/kvm/pmu.h
@@ -38,8 +38,15 @@ struct kvm_pmu_ops {
void (*init)(struct kvm_vcpu *vcpu);
void (*reset)(struct kvm_vcpu *vcpu);
bool (*lbr_setup)(struct kvm_vcpu *vcpu);
+ void (*deliver_pmi)(struct kvm_vcpu *vcpu);
+ void (*availability_check)(struct kvm_vcpu *vcpu);
};

+static inline bool event_is_oncpu(struct perf_event *event)
+{
+ return event && event->oncpu != -1;
+}
+
static inline u64 pmc_bitmask(struct kvm_pmc *pmc)
{
struct kvm_pmu *pmu = pmc_to_pmu(pmc);
diff --git a/arch/x86/kvm/vmx/pmu_intel.c b/arch/x86/kvm/vmx/pmu_intel.c
index 5d7d002e5a3e..bb8e4dccbb18 100644
--- a/arch/x86/kvm/vmx/pmu_intel.c
+++ b/arch/x86/kvm/vmx/pmu_intel.c
@@ -17,6 +17,7 @@
#include "lapic.h"
#include "nested.h"
#include "pmu.h"
+#include "vmx.h"

static struct kvm_event_hw_type_mapping intel_arch_events[] = {
/* Index must match CPUID 0x0A.EBX bit vector */
@@ -150,6 +151,24 @@ static struct kvm_pmc *intel_rdpmc_ecx_to_pmc(struct kvm_vcpu *vcpu,
return &counters[array_index_nospec(idx, num_counters)];
}

+static bool intel_is_valid_lbr_msr(struct kvm_vcpu *vcpu, u32 index)
+{
+ struct x86_pmu_lbr *stack = &vcpu->kvm->arch.lbr;
+ bool ret = false;
+
+ if (!vcpu->kvm->arch.lbr_in_guest)
+ return ret;
+
+ ret = (index == MSR_LBR_SELECT || index == MSR_LBR_TOS ||
+ (index >= stack->from && index < stack->from + stack->nr) ||
+ (index >= stack->to && index < stack->to + stack->nr));
+
+ if (!ret && stack->info)
+ ret = (index >= stack->info && index < stack->info + stack->nr);
+
+ return ret;
+}
+
static bool intel_is_valid_msr(struct kvm_vcpu *vcpu, u32 msr)
{
struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
@@ -160,12 +179,14 @@ static bool intel_is_valid_msr(struct kvm_vcpu *vcpu, u32 msr)
case MSR_CORE_PERF_GLOBAL_STATUS:
case MSR_CORE_PERF_GLOBAL_CTRL:
case MSR_CORE_PERF_GLOBAL_OVF_CTRL:
+ case MSR_IA32_DEBUGCTLMSR:
ret = pmu->version > 1;
break;
default:
ret = get_gp_pmc(pmu, msr, MSR_IA32_PERFCTR0) ||
get_gp_pmc(pmu, msr, MSR_P6_EVNTSEL0) ||
- get_fixed_pmc(pmu, msr);
+ get_fixed_pmc(pmu, msr) ||
+ intel_is_valid_lbr_msr(vcpu, msr);
break;
}

@@ -184,6 +205,124 @@ static struct kvm_pmc *intel_msr_idx_to_pmc(struct kvm_vcpu *vcpu, u32 msr)
return pmc;
}

+static int intel_pmu_create_lbr_event(struct kvm_vcpu *vcpu)
+{
+ struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
+ struct perf_event *event;
+
+ /*
+ * The perf_event_attr is constructed in the minimum efficient way:
+ * - set 'pinned = true' to make it task pinned so that if another
+ * cpu pinned event reclaims LBR, the event->oncpu will be set to -1;
+ *
+ * - set 'sample_type = PERF_SAMPLE_BRANCH_STACK' and
+ * 'exclude_host = true' to mark it as a guest LBR event which
+ * indicates host perf to schedule it without but a fake counter,
+ * check is_guest_lbr_event() and intel_guest_event_constraints();
+ *
+ * - set 'branch_sample_type = PERF_SAMPLE_BRANCH_CALL_STACK |
+ * PERF_SAMPLE_BRANCH_USER' to configure it to use callstack mode,
+ * which allocs 'ctx->task_ctx_data' and request host perf subsystem
+ * to save/restore guest LBR records during host context switches,
+ * check branch_user_callstack() and intel_pmu_lbr_sched_task();
+ */
+ struct perf_event_attr attr = {
+ .type = PERF_TYPE_RAW,
+ .size = sizeof(attr),
+ .pinned = true,
+ .exclude_host = true,
+ .sample_type = PERF_SAMPLE_BRANCH_STACK,
+ .branch_sample_type = PERF_SAMPLE_BRANCH_CALL_STACK |
+ PERF_SAMPLE_BRANCH_USER,
+ };
+
+ if (unlikely(pmu->lbr_event))
+ return 0;
+
+ event = perf_event_create_kernel_counter(&attr, -1,
+ current, NULL, NULL);
+ if (IS_ERR(event)) {
+ pr_debug_ratelimited("%s: failed %ld\n",
+ __func__, PTR_ERR(event));
+ return -ENOENT;
+ }
+ pmu->lbr_event = event;
+ pmu->event_count++;
+ return 0;
+}
+
+/*
+ * "set = true" to make the LBR records registers interceptible,
+ * otherwise passthrough the LBR records registers to the vcpu.
+ */
+static void intel_pmu_intercept_lbr_msrs(struct kvm_vcpu *vcpu, bool set)
+{
+ unsigned long *msr_bitmap = to_vmx(vcpu)->vmcs01.msr_bitmap;
+ struct x86_pmu_lbr *stack = &vcpu->kvm->arch.lbr;
+ int i;
+
+ if (!stack->nr)
+ return;
+
+ for (i = 0; i < stack->nr; i++) {
+ vmx_set_intercept_for_msr(msr_bitmap,
+ stack->from + i, MSR_TYPE_RW, set);
+ vmx_set_intercept_for_msr(msr_bitmap,
+ stack->to + i, MSR_TYPE_RW, set);
+ if (stack->info)
+ vmx_set_intercept_for_msr(msr_bitmap,
+ stack->info + i, MSR_TYPE_RW, set);
+ }
+ vmx_set_intercept_for_msr(msr_bitmap, MSR_LBR_SELECT, MSR_TYPE_RW, set);
+ vmx_set_intercept_for_msr(msr_bitmap, MSR_LBR_TOS, MSR_TYPE_RW, set);
+}
+
+static void intel_pmu_free_lbr_event(struct kvm_vcpu *vcpu)
+{
+ struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
+ struct perf_event *event = pmu->lbr_event;
+
+ if (!event)
+ return;
+
+ perf_event_release_kernel(event);
+ intel_pmu_intercept_lbr_msrs(vcpu, true);
+ pmu->lbr_already_available = false;
+ pmu->event_count--;
+ pmu->lbr_event = NULL;
+}
+
+static bool intel_pmu_access_lbr_msr(struct kvm_vcpu *vcpu,
+ struct msr_data *msr_info, bool read)
+{
+ struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
+ u32 index = msr_info->index;
+
+ if (!intel_is_valid_lbr_msr(vcpu, index))
+ return false;
+
+ if (!pmu->lbr_event)
+ intel_pmu_create_lbr_event(vcpu);
+
+ /*
+ * Disable irq to ensure the LBR feature doesn't get reclaimed by the
+ * host at the time the value is read from the msr, and this avoids the
+ * host LBR value to be leaked to the guest. If LBR has been reclaimed,
+ * return 0 on guest reads.
+ */
+ local_irq_disable();
+ if (event_is_oncpu(pmu->lbr_event)) {
+ if (read)
+ rdmsrl(index, msr_info->data);
+ else
+ wrmsrl(index, msr_info->data);
+ } else if (read)
+ msr_info->data = 0;
+ local_irq_enable();
+
+ return true;
+}
+
static int intel_pmu_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
{
struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
@@ -203,6 +342,9 @@ static int intel_pmu_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
case MSR_CORE_PERF_GLOBAL_OVF_CTRL:
msr_info->data = pmu->global_ovf_ctrl;
return 0;
+ case MSR_IA32_DEBUGCTLMSR:
+ msr_info->data = vmcs_read64(GUEST_IA32_DEBUGCTL);
+ return 0;
default:
if ((pmc = get_gp_pmc(pmu, msr, MSR_IA32_PERFCTR0))) {
u64 val = pmc_read_counter(pmc);
@@ -217,7 +359,8 @@ static int intel_pmu_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
} else if ((pmc = get_gp_pmc(pmu, msr, MSR_P6_EVNTSEL0))) {
msr_info->data = pmc->eventsel;
return 0;
- }
+ } else if (intel_pmu_access_lbr_msr(vcpu, msr_info, true))
+ return 0;
}

return 1;
@@ -261,6 +404,14 @@ static int intel_pmu_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
return 0;
}
break;
+ case MSR_IA32_DEBUGCTLMSR:
+ /* Values other than LBR are reserved and should throw a #GP */
+ if (data & ~(DEBUGCTLMSR_LBR | DEBUGCTLMSR_FREEZE_LBRS_ON_PMI))
+ return 1;
+ vmcs_write64(GUEST_IA32_DEBUGCTL, data);
+ if (!pmu->lbr_event)
+ intel_pmu_create_lbr_event(vcpu);
+ return 0;
default:
if ((pmc = get_gp_pmc(pmu, msr, MSR_IA32_PERFCTR0))) {
if (!msr_info->host_initiated)
@@ -283,7 +434,8 @@ static int intel_pmu_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
reprogram_gp_counter(pmc, data);
return 0;
}
- }
+ } else if (intel_pmu_access_lbr_msr(vcpu, msr_info, false))
+ return 0;
}

return 1;
@@ -397,6 +549,8 @@ static void intel_pmu_reset(struct kvm_vcpu *vcpu)
pmc->counter = 0;
}

+ intel_pmu_free_lbr_event(vcpu);
+
pmu->fixed_ctr_ctrl = pmu->global_ctrl = pmu->global_status =
pmu->global_ovf_ctrl = 0;
}
@@ -430,6 +584,72 @@ static bool intel_pmu_lbr_setup(struct kvm_vcpu *vcpu)
return true;
}

+static void intel_pmu_legacy_freezing_lbrs_on_pmi(struct kvm_vcpu *vcpu)
+{
+ u64 data;
+
+ if (!vcpu->kvm->arch.lbr_in_guest)
+ return;
+
+ /*
+ * If Freeze_LBR_On_PMI = 1, the LBR is frozen on PMI and
+ * the KVM emulates to clear the LBR bit (bit 0) in IA32_DEBUGCTL.
+ *
+ * Guest would then re-enable LBR to resume recording branches.
+ */
+ data = vmcs_read64(GUEST_IA32_DEBUGCTL);
+ if (data & DEBUGCTLMSR_FREEZE_LBRS_ON_PMI) {
+ data &= ~DEBUGCTLMSR_LBR;
+ vmcs_write64(GUEST_IA32_DEBUGCTL, data);
+ }
+}
+
+static void intel_pmu_deliver_pmi(struct kvm_vcpu *vcpu)
+{
+ u8 version = vcpu_to_pmu(vcpu)->version;
+
+ /* Freeze_LBR_on_PMI is supported for PMU version <= 3 and >1 */
+ if (version > 1 && version <= 3)
+ intel_pmu_legacy_freezing_lbrs_on_pmi(vcpu);
+}
+
+static void intel_pmu_lbr_availability_check(struct kvm_vcpu *vcpu)
+{
+ struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
+
+ if (pmu->lbr_already_available && event_is_oncpu(pmu->lbr_event))
+ return;
+
+ if (!pmu->lbr_already_available && !event_is_oncpu(pmu->lbr_event))
+ return;
+
+ if (event_is_oncpu(pmu->lbr_event)) {
+ intel_pmu_intercept_lbr_msrs(vcpu, false);
+ pmu->lbr_already_available = true;
+ } else {
+ intel_pmu_intercept_lbr_msrs(vcpu, true);
+ pmu->lbr_already_available = false;
+ }
+}
+
+/*
+ * Higher priority host perf events (e.g. cpu pinned) could reclaim the
+ * pmu resources (e.g. LBR) that were assigned to the guest. This is
+ * usually done via ipi calls (more details in perf_install_in_context).
+ *
+ * Before entering the non-root mode (with irq disabled here), double
+ * confirm that the pmu features enabled to the guest are not reclaimed
+ * by higher priority host events. Otherwise, disallow vcpu's access to
+ * the reclaimed features.
+ */
+static void intel_pmu_availability_check(struct kvm_vcpu *vcpu)
+{
+ lockdep_assert_irqs_disabled();
+
+ if (vcpu->kvm->arch.lbr_in_guest && vcpu_to_pmu(vcpu)->lbr_event)
+ intel_pmu_lbr_availability_check(vcpu);
+}
+
struct kvm_pmu_ops intel_pmu_ops = {
.find_arch_event = intel_find_arch_event,
.find_fixed_event = intel_find_fixed_event,
@@ -445,4 +665,6 @@ struct kvm_pmu_ops intel_pmu_ops = {
.init = intel_pmu_init,
.reset = intel_pmu_reset,
.lbr_setup = intel_pmu_lbr_setup,
+ .deliver_pmi = intel_pmu_deliver_pmi,
+ .availability_check = intel_pmu_availability_check,
};
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 766303b31949..31c294b2d941 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -3717,8 +3717,8 @@ static __always_inline void vmx_enable_intercept_for_msr(unsigned long *msr_bitm
}
}

-static __always_inline void vmx_set_intercept_for_msr(unsigned long *msr_bitmap,
- u32 msr, int type, bool value)
+void vmx_set_intercept_for_msr(unsigned long *msr_bitmap,
+ u32 msr, int type, bool value)
{
if (value)
vmx_enable_intercept_for_msr(msr_bitmap, msr, type);
@@ -6622,8 +6622,11 @@ static enum exit_fastpath_completion vmx_vcpu_run(struct kvm_vcpu *vcpu)

pt_guest_enter(vmx);

- if (vcpu_to_pmu(vcpu)->version)
+ if (vcpu_to_pmu(vcpu)->version) {
atomic_switch_perf_msrs(vmx);
+ kvm_x86_ops.pmu_ops->availability_check(vcpu);
+ }
+
atomic_switch_umwait_control_msr(vmx);

if (enable_preemption_timer)
diff --git a/arch/x86/kvm/vmx/vmx.h b/arch/x86/kvm/vmx/vmx.h
index edfb739e5907..e0845e1dab06 100644
--- a/arch/x86/kvm/vmx/vmx.h
+++ b/arch/x86/kvm/vmx/vmx.h
@@ -347,6 +347,8 @@ void vmx_update_msr_bitmap(struct kvm_vcpu *vcpu);
bool vmx_get_nmi_mask(struct kvm_vcpu *vcpu);
void vmx_set_nmi_mask(struct kvm_vcpu *vcpu, bool masked);
void vmx_set_virtual_apic_mode(struct kvm_vcpu *vcpu);
+void vmx_set_intercept_for_msr(unsigned long *msr_bitmap,
+ u32 msr, int type, bool value);
struct shared_msr_entry *find_msr_entry(struct vcpu_vmx *vmx, u32 msr);
void pt_update_intercept_for_msr(struct vcpu_vmx *vmx);
void vmx_update_host_rsp(struct vcpu_vmx *vmx, unsigned long host_rsp);
--
2.21.1

2020-04-24 12:19:29

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v10 08/11] KVM: x86/pmu: Add LBR feature emulation via guest LBR event

On Thu, Apr 23, 2020 at 04:14:09PM +0800, Like Xu wrote:
> +static int intel_pmu_create_lbr_event(struct kvm_vcpu *vcpu)
> +{
> + struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
> + struct perf_event *event;
> +
> + /*
> + * The perf_event_attr is constructed in the minimum efficient way:
> + * - set 'pinned = true' to make it task pinned so that if another
> + * cpu pinned event reclaims LBR, the event->oncpu will be set to -1;
> + *
> + * - set 'sample_type = PERF_SAMPLE_BRANCH_STACK' and
> + * 'exclude_host = true' to mark it as a guest LBR event which
> + * indicates host perf to schedule it without but a fake counter,
> + * check is_guest_lbr_event() and intel_guest_event_constraints();
> + *
> + * - set 'branch_sample_type = PERF_SAMPLE_BRANCH_CALL_STACK |
> + * PERF_SAMPLE_BRANCH_USER' to configure it to use callstack mode,
> + * which allocs 'ctx->task_ctx_data' and request host perf subsystem
> + * to save/restore guest LBR records during host context switches,
> + * check branch_user_callstack() and intel_pmu_lbr_sched_task();
> + */
> + struct perf_event_attr attr = {
> + .type = PERF_TYPE_RAW,

This is not right; this needs a .config

And I suppose that is why you need that horrible:
needs_guest_lbr_without_counter() thing to begin with.

Please allocate yourself an event from the pseudo event range:
event==0x00. Currently we only have umask==3 for Fixed2 and umask==4
for Fixed3, given you claim 58, which is effectively Fixed25,
umask==0x1a might be appropriate.

Also, I suppose we need to claim 0x0000 as an error, so that other
people won't try this again.

> + .size = sizeof(attr),
> + .pinned = true,
> + .exclude_host = true,
> + .sample_type = PERF_SAMPLE_BRANCH_STACK,
> + .branch_sample_type = PERF_SAMPLE_BRANCH_CALL_STACK |
> + PERF_SAMPLE_BRANCH_USER,
> + };
> +
> + if (unlikely(pmu->lbr_event))
> + return 0;
> +
> + event = perf_event_create_kernel_counter(&attr, -1,
> + current, NULL, NULL);
> + if (IS_ERR(event)) {
> + pr_debug_ratelimited("%s: failed %ld\n",
> + __func__, PTR_ERR(event));
> + return -ENOENT;
> + }
> + pmu->lbr_event = event;
> + pmu->event_count++;
> + return 0;
> +}

Also, what happens if you fail programming due to a conflicting cpu
event? That pinned doesn't guarantee you'll get the event, it just means
you'll error instead of getting RR.

I didn't find any code checking the event state.

2020-04-27 03:21:05

by Like Xu

[permalink] [raw]
Subject: Re: [PATCH v10 08/11] KVM: x86/pmu: Add LBR feature emulation via guest LBR event

Hi Peter,

On 2020/4/24 20:16, Peter Zijlstra wrote:
> On Thu, Apr 23, 2020 at 04:14:09PM +0800, Like Xu wrote:
>> +static int intel_pmu_create_lbr_event(struct kvm_vcpu *vcpu)
>> +{
>> + struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
>> + struct perf_event *event;
>> +
>> + /*
>> + * The perf_event_attr is constructed in the minimum efficient way:
>> + * - set 'pinned = true' to make it task pinned so that if another
>> + * cpu pinned event reclaims LBR, the event->oncpu will be set to -1;
>> + *
>> + * - set 'sample_type = PERF_SAMPLE_BRANCH_STACK' and
>> + * 'exclude_host = true' to mark it as a guest LBR event which
>> + * indicates host perf to schedule it without but a fake counter,
>> + * check is_guest_lbr_event() and intel_guest_event_constraints();
>> + *
>> + * - set 'branch_sample_type = PERF_SAMPLE_BRANCH_CALL_STACK |
>> + * PERF_SAMPLE_BRANCH_USER' to configure it to use callstack mode,
>> + * which allocs 'ctx->task_ctx_data' and request host perf subsystem
>> + * to save/restore guest LBR records during host context switches,
>> + * check branch_user_callstack() and intel_pmu_lbr_sched_task();
>> + */
>> + struct perf_event_attr attr = {
>> + .type = PERF_TYPE_RAW,
>
> This is not right; this needs a .config

Now we know the default value .config = 0 for attr is not acceptable.

>
> And I suppose that is why you need that horrible:
> needs_guest_lbr_without_counter() thing to begin with.

Do you suggest to use event->attr.config check to replace
"needs_branch_stack(event) && is_kernel_event(event) &&
event->attr.exclude_host" check for guest LBR event ?

>
> Please allocate yourself an event from the pseudo event range:
> event==0x00. Currently we only have umask==3 for Fixed2 and umask==4
> for Fixed3, given you claim 58, which is effectively Fixed25,
> umask==0x1a might be appropriate.

OK, I assume that adding one more field ".config = 0x1a00" is
efficient enough for perf_event_attr to allocate guest LBR events.

>
> Also, I suppose we need to claim 0x0000 as an error, so that other
> people won't try this again.

Does the following fix address your concern on this ?

diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
index 2405926e2dba..32d2a3f8c51f 100644
--- a/arch/x86/events/core.c
+++ b/arch/x86/events/core.c
@@ -498,6 +498,9 @@ int x86_pmu_max_precise(void)

int x86_pmu_hw_config(struct perf_event *event)
{
+ if (!unlikely(event->attr.config & X86_ARCH_EVENT_MASK))
+ return -EINVAL;
+
if (event->attr.precise_ip) {
int precise = x86_pmu_max_precise();

diff --git a/arch/x86/include/asm/perf_event.h
b/arch/x86/include/asm/perf_event.h
index 2e6c59308344..bdba87a6f0af 100644
--- a/arch/x86/include/asm/perf_event.h
+++ b/arch/x86/include/asm/perf_event.h
@@ -47,6 +47,8 @@
(ARCH_PERFMON_EVENTSEL_EVENT | (0x0FULL << 32))
#define INTEL_ARCH_EVENT_MASK \
(ARCH_PERFMON_EVENTSEL_UMASK | ARCH_PERFMON_EVENTSEL_EVENT)
+#define X86_ARCH_EVENT_MASK \
+ (ARCH_PERFMON_EVENTSEL_UMASK | ARCH_PERFMON_EVENTSEL_EVENT)

#define AMD64_L3_SLICE_SHIFT 48
#define AMD64_L3_SLICE_MASK

>
>> + .size = sizeof(attr),
>> + .pinned = true,
>> + .exclude_host = true,
>> + .sample_type = PERF_SAMPLE_BRANCH_STACK,
>> + .branch_sample_type = PERF_SAMPLE_BRANCH_CALL_STACK |
>> + PERF_SAMPLE_BRANCH_USER,
>> + };
>> +
>> + if (unlikely(pmu->lbr_event))
>> + return 0;
>> +
>> + event = perf_event_create_kernel_counter(&attr, -1,
>> + current, NULL, NULL);
>> + if (IS_ERR(event)) {
>> + pr_debug_ratelimited("%s: failed %ld\n",
>> + __func__, PTR_ERR(event));
>> + return -ENOENT;
>> + }
>> + pmu->lbr_event = event;
>> + pmu->event_count++;
>> + return 0;
>> +}
>
> Also, what happens if you fail programming due to a conflicting cpu
> event? That pinned doesn't guarantee you'll get the event, it just means
> you'll error instead of getting RR.
>
> I didn't find any code checking the event state.
>

Error instead of RR is expected.

If the KVM fails programming due to a conflicting cpu event
the LBR registers will not be passthrough to the guest,
and KVM would return zero for any guest LBR records accesses
until the next attempt to program the guest LBR event.

Every time before cpu enters the non-root mode where irq is
disabled, the "event-> oncpu! =-1" check will be applied.
(more details in the comment around intel_pmu_availability_check())

The guests administer is supposed to know the result of guest
LBR records is inaccurate if someone is using LBR to record
guest or hypervisor on the host side.

Is this acceptable to you?

If there is anything needs to be improved, please let me know.

Thanks,
Like Xu

2020-04-27 16:38:15

by kernel test robot

[permalink] [raw]
Subject: Re: [PATCH v10 06/11] KVM: x86: Add KVM_CAP_X86_GUEST_LBR to dis/enable LBR from user-space

Hi Like,

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on tip/perf/core]
[also build test WARNING on vhost/linux-next v5.7-rc3 next-20200424]
[cannot apply to kvm/linux-next]
[if your patch is applied to the wrong git tree, please drop us a note to help
improve the system. BTW, we also suggest to use '--base' option to specify the
base tree in git format-patch, please see https://stackoverflow.com/a/37406982]

url: https://github.com/0day-ci/linux/commits/Like-Xu/Guest-Last-Branch-Recording-Enabling/20200426-123735
base: https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git 87cfeb1920f84f465a738d4c6589033eefa20b45
config: i386-allyesconfig (attached as .config)
compiler: gcc-7 (Ubuntu 7.5.0-6ubuntu2) 7.5.0
reproduce:
# save the attached .config to linux build tree
make ARCH=i386

If you fix the issue, kindly add following tag as appropriate
Reported-by: kbuild test robot <[email protected]>

All warnings (new ones prefixed by >>):

arch/x86/kvm/x86.c: In function 'kvm_vm_ioctl_enable_cap':
>> arch/x86/kvm/x86.c:4909:20: warning: cast to pointer from integer of different size [-Wint-to-pointer-cast]
if (copy_to_user((void __user *)cap->args[1],
^

vim +4909 arch/x86/kvm/x86.c

4829
4830 int kvm_vm_ioctl_enable_cap(struct kvm *kvm,
4831 struct kvm_enable_cap *cap)
4832 {
4833 int r;
4834
4835 if (cap->flags)
4836 return -EINVAL;
4837
4838 switch (cap->cap) {
4839 case KVM_CAP_DISABLE_QUIRKS:
4840 kvm->arch.disabled_quirks = cap->args[0];
4841 r = 0;
4842 break;
4843 case KVM_CAP_SPLIT_IRQCHIP: {
4844 mutex_lock(&kvm->lock);
4845 r = -EINVAL;
4846 if (cap->args[0] > MAX_NR_RESERVED_IOAPIC_PINS)
4847 goto split_irqchip_unlock;
4848 r = -EEXIST;
4849 if (irqchip_in_kernel(kvm))
4850 goto split_irqchip_unlock;
4851 if (kvm->created_vcpus)
4852 goto split_irqchip_unlock;
4853 r = kvm_setup_empty_irq_routing(kvm);
4854 if (r)
4855 goto split_irqchip_unlock;
4856 /* Pairs with irqchip_in_kernel. */
4857 smp_wmb();
4858 kvm->arch.irqchip_mode = KVM_IRQCHIP_SPLIT;
4859 kvm->arch.nr_reserved_ioapic_pins = cap->args[0];
4860 r = 0;
4861 split_irqchip_unlock:
4862 mutex_unlock(&kvm->lock);
4863 break;
4864 }
4865 case KVM_CAP_X2APIC_API:
4866 r = -EINVAL;
4867 if (cap->args[0] & ~KVM_X2APIC_API_VALID_FLAGS)
4868 break;
4869
4870 if (cap->args[0] & KVM_X2APIC_API_USE_32BIT_IDS)
4871 kvm->arch.x2apic_format = true;
4872 if (cap->args[0] & KVM_X2APIC_API_DISABLE_BROADCAST_QUIRK)
4873 kvm->arch.x2apic_broadcast_quirk_disabled = true;
4874
4875 r = 0;
4876 break;
4877 case KVM_CAP_X86_DISABLE_EXITS:
4878 r = -EINVAL;
4879 if (cap->args[0] & ~KVM_X86_DISABLE_VALID_EXITS)
4880 break;
4881
4882 if ((cap->args[0] & KVM_X86_DISABLE_EXITS_MWAIT) &&
4883 kvm_can_mwait_in_guest())
4884 kvm->arch.mwait_in_guest = true;
4885 if (cap->args[0] & KVM_X86_DISABLE_EXITS_HLT)
4886 kvm->arch.hlt_in_guest = true;
4887 if (cap->args[0] & KVM_X86_DISABLE_EXITS_PAUSE)
4888 kvm->arch.pause_in_guest = true;
4889 if (cap->args[0] & KVM_X86_DISABLE_EXITS_CSTATE)
4890 kvm->arch.cstate_in_guest = true;
4891 r = 0;
4892 break;
4893 case KVM_CAP_MSR_PLATFORM_INFO:
4894 kvm->arch.guest_can_read_msr_platform_info = cap->args[0];
4895 r = 0;
4896 break;
4897 case KVM_CAP_EXCEPTION_PAYLOAD:
4898 kvm->arch.exception_payload_enabled = cap->args[0];
4899 r = 0;
4900 break;
4901 case KVM_CAP_X86_GUEST_LBR:
4902 r = -EINVAL;
4903 if (!cap->args[0] || !kvm->vcpus[0])
4904 break;
4905 if (!kvm_pmu_lbr_setup(kvm->vcpus[0]))
4906 break;
4907 if (vcpu_to_pmu(kvm->vcpus[0])->version < 2)
4908 break;
> 4909 if (copy_to_user((void __user *)cap->args[1],
4910 &kvm->arch.lbr, sizeof(struct x86_pmu_lbr)))
4911 break;
4912 kvm->arch.lbr_in_guest = !!cap->args[0];
4913 r = 0;
4914 break;
4915 default:
4916 r = -EINVAL;
4917 break;
4918 }
4919 return r;
4920 }
4921

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/[email protected]


Attachments:
(No filename) (4.36 kB)
.config.gz (70.54 kB)
Download all attachments

2020-04-28 05:12:03

by kernel test robot

[permalink] [raw]
Subject: Re: [PATCH v10 09/11] KVM: x86/pmu: Release guest LBR event via vPMU lazy release mechanism

Hi Like,

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on tip/perf/core]
[also build test WARNING on vhost/linux-next v5.7-rc3 next-20200424]
[cannot apply to kvm/linux-next]
[if your patch is applied to the wrong git tree, please drop us a note to help
improve the system. BTW, we also suggest to use '--base' option to specify the
base tree in git format-patch, please see https://stackoverflow.com/a/37406982]

url: https://github.com/0day-ci/linux/commits/Like-Xu/Guest-Last-Branch-Recording-Enabling/20200426-123735
base: https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git 87cfeb1920f84f465a738d4c6589033eefa20b45
reproduce:
# apt-get install sparse
# sparse version: v0.6.1-191-gc51a0382-dirty
make ARCH=x86_64 allmodconfig
make C=1 CF='-fdiagnostic-prefix -D__CHECK_ENDIAN__'

If you fix the issue, kindly add following tag as appropriate
Reported-by: kbuild test robot <[email protected]>


sparse warnings: (new ones prefixed by >>)

arch/x86/kvm/pmu.c:190:18: sparse: sparse: incompatible types in comparison expression (different address spaces):
arch/x86/kvm/pmu.c:190:18: sparse: struct kvm_pmu_event_filter [noderef] <asn:4> *
arch/x86/kvm/pmu.c:190:18: sparse: struct kvm_pmu_event_filter *
arch/x86/kvm/pmu.c:251:18: sparse: sparse: incompatible types in comparison expression (different address spaces):
arch/x86/kvm/pmu.c:251:18: sparse: struct kvm_pmu_event_filter [noderef] <asn:4> *
arch/x86/kvm/pmu.c:251:18: sparse: struct kvm_pmu_event_filter *
>> arch/x86/kvm/pmu.c:455:6: sparse: sparse: symbol 'kvm_pmu_lbr_cleanup' was not declared. Should it be static?
arch/x86/kvm/pmu.c:524:18: sparse: sparse: incompatible types in comparison expression (different address spaces):
arch/x86/kvm/pmu.c:524:18: sparse: struct kvm_pmu_event_filter [noderef] <asn:4> *
arch/x86/kvm/pmu.c:524:18: sparse: struct kvm_pmu_event_filter *
arch/x86/kvm/pmu.c:524:18: sparse: sparse: incompatible types in comparison expression (different address spaces):
arch/x86/kvm/pmu.c:524:18: sparse: struct kvm_pmu_event_filter [noderef] <asn:4> *
arch/x86/kvm/pmu.c:524:18: sparse: struct kvm_pmu_event_filter *

Please review and possibly fold the followup patch.

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/[email protected]

2020-04-28 05:12:07

by kernel test robot

[permalink] [raw]
Subject: [RFC PATCH] KVM: x86/pmu: kvm_pmu_lbr_cleanup() can be static


Signed-off-by: kbuild test robot <[email protected]>
---
pmu.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c
index 7dad899850bb3..b9b6d16651c8b 100644
--- a/arch/x86/kvm/pmu.c
+++ b/arch/x86/kvm/pmu.c
@@ -452,7 +452,7 @@ static inline bool pmc_speculative_in_use(struct kvm_pmc *pmc)
return pmc->eventsel & ARCH_PERFMON_EVENTSEL_ENABLE;
}

-void kvm_pmu_lbr_cleanup(struct kvm_vcpu *vcpu)
+static void kvm_pmu_lbr_cleanup(struct kvm_vcpu *vcpu)
{
if (kvm_x86_ops.pmu_ops->lbr_cleanup)
kvm_x86_ops.pmu_ops->lbr_cleanup(vcpu);

2020-05-08 08:52:29

by Like Xu

[permalink] [raw]
Subject: Re: [PATCH v10 08/11] KVM: x86/pmu: Add LBR feature emulation via guest LBR event

Hi Peter,

On 2020/4/27 11:16, Like Xu wrote:
> Hi Peter,
>
> On 2020/4/24 20:16, Peter Zijlstra wrote:
>> On Thu, Apr 23, 2020 at 04:14:09PM +0800, Like Xu wrote:
>>> +static int intel_pmu_create_lbr_event(struct kvm_vcpu *vcpu)
>>> +{
>>> +    struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
>>> +    struct perf_event *event;
>>> +
>>> +    /*
>>> +     * The perf_event_attr is constructed in the minimum efficient way:
>>> +     * - set 'pinned = true' to make it task pinned so that if another
>>> +     *   cpu pinned event reclaims LBR, the event->oncpu will be set to
>>> -1;
>>> +     *
>>> +     * - set 'sample_type = PERF_SAMPLE_BRANCH_STACK' and
>>> +     *   'exclude_host = true' to mark it as a guest LBR event which
>>> +     *   indicates host perf to schedule it without but a fake counter,
>>> +     *   check is_guest_lbr_event() and intel_guest_event_constraints();
>>> +     *
>>> +     * - set 'branch_sample_type = PERF_SAMPLE_BRANCH_CALL_STACK |
>>> +     *   PERF_SAMPLE_BRANCH_USER' to configure it to use callstack mode,
>>> +     *   which allocs 'ctx->task_ctx_data' and request host perf subsystem
>>> +     *   to save/restore guest LBR records during host context switches,
>>> +     *   check branch_user_callstack() and intel_pmu_lbr_sched_task();
>>> +     */
>>> +    struct perf_event_attr attr = {
>>> +        .type = PERF_TYPE_RAW,
>>
>> This is not right; this needs a .config
>
> Now we know the default value .config = 0 for attr is not acceptable.
>
>>
>> And I suppose that is why you need that horrible:
>> needs_guest_lbr_without_counter() thing to begin with.
>
> Do you suggest to use event->attr.config check to replace
> "needs_branch_stack(event) && is_kernel_event(event) &&
> event->attr.exclude_host" check for guest LBR event ?
>
>>
>> Please allocate yourself an event from the pseudo event range:
>> event==0x00. Currently we only have umask==3 for Fixed2 and umask==4
>> for Fixed3, given you claim 58, which is effectively Fixed25,
>> umask==0x1a might be appropriate.
>
> OK, I assume that adding one more field ".config = 0x1a00" is
> efficient enough for perf_event_attr to allocate guest LBR events.

Do you have any comment for this ?

>
>>
>> Also, I suppose we need to claim 0x0000 as an error, so that other
>> people won't try this again.
>
> Does the following fix address your concern on this ?

Does the following fix address your concern on this ?

>
> diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
> index 2405926e2dba..32d2a3f8c51f 100644
> --- a/arch/x86/events/core.c
> +++ b/arch/x86/events/core.c
> @@ -498,6 +498,9 @@ int x86_pmu_max_precise(void)
>
>  int x86_pmu_hw_config(struct perf_event *event)
>  {
> +       if (!unlikely(event->attr.config & X86_ARCH_EVENT_MASK))
> +               return -EINVAL;
> +
>         if (event->attr.precise_ip) {
>                 int precise = x86_pmu_max_precise();
>
> diff --git a/arch/x86/include/asm/perf_event.h
> b/arch/x86/include/asm/perf_event.h
> index 2e6c59308344..bdba87a6f0af 100644
> --- a/arch/x86/include/asm/perf_event.h
> +++ b/arch/x86/include/asm/perf_event.h
> @@ -47,6 +47,8 @@
>         (ARCH_PERFMON_EVENTSEL_EVENT | (0x0FULL << 32))
>  #define INTEL_ARCH_EVENT_MASK  \
>         (ARCH_PERFMON_EVENTSEL_UMASK | ARCH_PERFMON_EVENTSEL_EVENT)
> +#define X86_ARCH_EVENT_MASK    \
> +       (ARCH_PERFMON_EVENTSEL_UMASK | ARCH_PERFMON_EVENTSEL_EVENT)
>
>  #define AMD64_L3_SLICE_SHIFT                           48
>  #define AMD64_L3_SLICE_MASK
>

>>
>>> +        .size = sizeof(attr),
>>> +        .pinned = true,
>>> +        .exclude_host = true,
>>> +        .sample_type = PERF_SAMPLE_BRANCH_STACK,
>>> +        .branch_sample_type = PERF_SAMPLE_BRANCH_CALL_STACK |
>>> +                    PERF_SAMPLE_BRANCH_USER,
>>> +    };
>>> +
>>> +    if (unlikely(pmu->lbr_event))
>>> +        return 0;
>>> +
>>> +    event = perf_event_create_kernel_counter(&attr, -1,
>>> +                        current, NULL, NULL);
>>> +    if (IS_ERR(event)) {
>>> +        pr_debug_ratelimited("%s: failed %ld\n",
>>> +                    __func__, PTR_ERR(event));
>>> +        return -ENOENT;
>>> +    }
>>> +    pmu->lbr_event = event;
>>> +    pmu->event_count++;
>>> +    return 0;
>>> +}
>>
>> Also, what happens if you fail programming due to a conflicting cpu
>> event? That pinned doesn't guarantee you'll get the event, it just means
>> you'll error instead of getting RR.
>>
>> I didn't find any code checking the event state.
>>
>
> Error instead of RR is expected.
>
> If the KVM fails programming due to a conflicting cpu event
> the LBR registers will not be passthrough to the guest,
> and KVM would return zero for any guest LBR records accesses
> until the next attempt to program the guest LBR event.
>
> Every time before cpu enters the non-root mode where irq is
> disabled, the "event-> oncpu! =-1" check will be applied.
> (more details in the comment around intel_pmu_availability_check())
>
> The guests administer is supposed to know the result of guest
> LBR records is inaccurate if someone is using LBR to record
> guest or hypervisor on the host side.
>
> Is this acceptable to you?
>
> If there is anything needs to be improved, please let me know.

Is this acceptable to you?

If there is anything needs to be improved, please let me know.

>
> Thanks,
> Like Xu
>

2020-05-08 13:12:52

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v10 08/11] KVM: x86/pmu: Add LBR feature emulation via guest LBR event

On Mon, Apr 27, 2020 at 11:16:40AM +0800, Like Xu wrote:
> On 2020/4/24 20:16, Peter Zijlstra wrote:

> > And I suppose that is why you need that horrible:
> > needs_guest_lbr_without_counter() thing to begin with.
>
> Do you suggest to use event->attr.config check to replace
> "needs_branch_stack(event) && is_kernel_event(event) &&
> event->attr.exclude_host" check for guest LBR event ?

That's what the BTS thing does.

> > Please allocate yourself an event from the pseudo event range:
> > event==0x00. Currently we only have umask==3 for Fixed2 and umask==4
> > for Fixed3, given you claim 58, which is effectively Fixed25,
> > umask==0x1a might be appropriate.
>
> OK, I assume that adding one more field ".config = 0x1a00" is
> efficient enough for perf_event_attr to allocate guest LBR events.

Uh what? The code is already setting .config. You just have to change it
do another value.

> > Also, I suppose we need to claim 0x0000 as an error, so that other
> > people won't try this again.
>
> Does the following fix address your concern on this ?
>
> diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
> index 2405926e2dba..32d2a3f8c51f 100644
> --- a/arch/x86/events/core.c
> +++ b/arch/x86/events/core.c
> @@ -498,6 +498,9 @@ int x86_pmu_max_precise(void)
>
> int x86_pmu_hw_config(struct perf_event *event)
> {
> + if (!unlikely(event->attr.config & X86_ARCH_EVENT_MASK))
> + return -EINVAL;
> +
> if (event->attr.precise_ip) {
> int precise = x86_pmu_max_precise();

That wouldn't work right for AMD. But yes, something like that.

> > Also, what happens if you fail programming due to a conflicting cpu
> > event? That pinned doesn't guarantee you'll get the event, it just means
> > you'll error instead of getting RR.
> >
> > I didn't find any code checking the event state.
> >
>
> Error instead of RR is expected.
>
> If the KVM fails programming due to a conflicting cpu event
> the LBR registers will not be passthrough to the guest,
> and KVM would return zero for any guest LBR records accesses
> until the next attempt to program the guest LBR event.
>
> Every time before cpu enters the non-root mode where irq is
> disabled, the "event-> oncpu! =-1" check will be applied.
> (more details in the comment around intel_pmu_availability_check())
>
> The guests administer is supposed to know the result of guest
> LBR records is inaccurate if someone is using LBR to record
> guest or hypervisor on the host side.
>
> Is this acceptable to you?
>
> If there is anything needs to be improved, please let me know.

It might be nice to emit a pr_warn() or something on the host when this
happens. Then at least the host admin can know he wrecked things for
which guest.

2020-05-12 05:02:46

by Xu, Like

[permalink] [raw]
Subject: Re: [PATCH v10 08/11] KVM: x86/pmu: Add LBR feature emulation via guest LBR event

On 2020/5/8 21:09, Peter Zijlstra wrote:
> On Mon, Apr 27, 2020 at 11:16:40AM +0800, Like Xu wrote:
>> On 2020/4/24 20:16, Peter Zijlstra wrote:
>>> And I suppose that is why you need that horrible:
>>> needs_guest_lbr_without_counter() thing to begin with.
>> Do you suggest to use event->attr.config check to replace
>> "needs_branch_stack(event) && is_kernel_event(event) &&
>> event->attr.exclude_host" check for guest LBR event ?
> That's what the BTS thing does.
Thanks, I'll apply it.
>
>>> Please allocate yourself an event from the pseudo event range:
>>> event==0x00. Currently we only have umask==3 for Fixed2 and umask==4
>>> for Fixed3, given you claim 58, which is effectively Fixed25,
>>> umask==0x1a might be appropriate.
>> OK, I assume that adding one more field ".config = 0x1a00" is
>> efficient enough for perf_event_attr to allocate guest LBR events.
> Uh what? The code is already setting .config. You just have to change it
> do another value.
Thanks, I'll apply it.
>
>>> Also, I suppose we need to claim 0x0000 as an error, so that other
>>> people won't try this again.
>> Does the following fix address your concern on this ?
>>
>> diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
>> index 2405926e2dba..32d2a3f8c51f 100644
>> --- a/arch/x86/events/core.c
>> +++ b/arch/x86/events/core.c
>> @@ -498,6 +498,9 @@ int x86_pmu_max_precise(void)
>>
>> int x86_pmu_hw_config(struct perf_event *event)
>> {
>> + if (!unlikely(event->attr.config & X86_ARCH_EVENT_MASK))
>> + return -EINVAL;
>> +
>> if (event->attr.precise_ip) {
>> int precise = x86_pmu_max_precise();
> That wouldn't work right for AMD. But yes, something like that.
Sure, I may figure it out and leave it in another thread.
>
>>> Also, what happens if you fail programming due to a conflicting cpu
>>> event? That pinned doesn't guarantee you'll get the event, it just means
>>> you'll error instead of getting RR.
>>>
>>> I didn't find any code checking the event state.
>>>
>> Error instead of RR is expected.
>>
>> If the KVM fails programming due to a conflicting cpu event
>> the LBR registers will not be passthrough to the guest,
>> and KVM would return zero for any guest LBR records accesses
>> until the next attempt to program the guest LBR event.
>>
>> Every time before cpu enters the non-root mode where irq is
>> disabled, the "event-> oncpu! =-1" check will be applied.
>> (more details in the comment around intel_pmu_availability_check())
>>
>> The guests administer is supposed to know the result of guest
>> LBR records is inaccurate if someone is using LBR to record
>> guest or hypervisor on the host side.
>>
>> Is this acceptable to you?
>>
>> If there is anything needs to be improved, please let me know.
> It might be nice to emit a pr_warn() or something on the host when this
> happens. Then at least the host admin can know he wrecked things for
> which guest.
Sure, I will use pr_warn () and indicate which guest it is.

If you have more comments on the patchset, please let me know.
If not, I'll spin the next version based on your current feedback.

Thanks,
Like Xu