2021-08-06 18:34:24

by Zhu, Lingshan

[permalink] [raw]
Subject: [PATCH V10 00/18] KVM: x86/pmu: Add *basic* support to enable guest PEBS via DS

The guest Precise Event Based Sampling (PEBS) feature can provide an
architectural state of the instruction executed after the guest instruction
that exactly caused the event. It needs new hardware facility only available
on Intel Ice Lake Server platforms. This patch set enables the basic PEBS
feature for KVM guests on ICX.

We can use PEBS feature on the Linux guest like native:

# echo 0 > /proc/sys/kernel/watchdog (on the host)
# perf record -e instructions:ppp ./br_instr a
# perf record -c 100000 -e instructions:pp ./br_instr a

To emulate guest PEBS facility for the above perf usages,
we need to implement 2 code paths:

1) Fast path

This is when the host assigned physical PMC has an identical index as the
virtual PMC (e.g. using physical PMC0 to emulate virtual PMC0).
This path is used in most common use cases.

2) Slow path

This is when the host assigned physical PMC has a different index from the
virtual PMC (e.g. using physical PMC1 to emulate virtual PMC0) In this case,
KVM needs to rewrite the PEBS records to change the applicable counter indexes
to the virtual PMC indexes, which would otherwise contain the physical counter
index written by PEBS facility, and switch the counter reset values to the
offset corresponding to the physical counter indexes in the DS data structure.

The previous version [0] enables both fast path and slow path, which seems
a bit more complex as the first step. In this patchset, we want to start with
the fast path to get the basic guest PEBS enabled while keeping the slow path
disabled. More focused discussion on the slow path [1] is planned to be put to
another patchset in the next step.

Compared to later versions in subsequent steps, the functionality to support
host-guest PEBS both enabled and the functionality to emulate guest PEBS when
the counter is cross-mapped are missing in this patch set
(neither of these are typical scenarios).

With the basic support, the guest can retrieve the correct PEBS information from
its own PEBS records on the Ice Lake servers. And we expect it should work when
migrating to another Ice Lake and no regression about host perf is expected.

Here are the results of pebs test from guest/host for same workload:

perf report on guest:
# Samples: 2K of event 'instructions:ppp', # Event count (approx.): 1473377250 # Overhead Command Shared Object Symbol
57.74% br_instr br_instr [.] lfsr_cond
41.40% br_instr br_instr [.] cmp_end
0.21% br_instr [kernel.kallsyms] [k] __lock_acquire

perf report on host:
# Samples: 2K of event 'instructions:ppp', # Event count (approx.): 1462721386 # Overhead Command Shared Object Symbol
57.90% br_instr br_instr [.] lfsr_cond
41.95% br_instr br_instr [.] cmp_end
0.05% br_instr [kernel.vmlinux] [k] lock_acquire
Conclusion: the profiling results on the guest are similar tothat on the host.

A minimum guest kernel version may be v5.4 or a backport version support
Icelake server PEBS.

Please check more details in each commit and feel free to comment.

Previous:
https://lore.kernel.org/kvm/[email protected]/

[0]
https://lore.kernel.org/kvm/[email protected]/
[1]
https://lore.kernel.org/kvm/[email protected]/

V9->V10:
- improve readability in core.c(Peter Z)
- reuse guest_pebs_idxs(Liu XiangDong)
V8 -> V9 Changelog:
-fix a brackets error in xen_guest_state()

V7 -> V8 Changelog:
- fix coding style, add {} for single statement of multiple lines(Peter Z)
- fix coding style in xen_guest_state() (Boris Ostrovsky)
- s/pmu/kvm_pmu/ in intel_guest_get_msrs() (Peter Z)
- put lower cost branch in the first place for x86_pmu_handle_guest_pebs() (Peter Z)

V6 -> V7 Changelog:
- Fix conditions order and call x86_pmu_handle_guest_pebs() unconditionally; (PeterZ)
- Add a new patch to make all that perf_guest_cbs stuff suck less; (PeterZ)
- Document IA32_MISC_ENABLE[7] that that behavior matches bare metal; (Sean & Venkatesh)
- Update commit message for fixed counter mask refactoring;(PeterZ)
- Clarifying comments about {.host and .guest} for intel_guest_get_msrs(); (PeterZ)
- Add pebs_capable to store valid PEBS_COUNTER_MASK value; (PeterZ)
- Add more comments for perf's precise_ip field; (Andi & PeterZ)
- Refactor perf_overflow_handler_t and make it more legible; (PeterZ)
- Use "(unsigned long)cpuc->ds" instead of __this_cpu_read(cpu_hw_events.ds); (PeterZ)
- Keep using "(struct kvm_pmu *)data" to follow K&R; (Andi)

Like Xu (17):
perf/core: Use static_call to optimize perf_guest_info_callbacks
perf/x86/intel: Add EPT-Friendly PEBS for Ice Lake Server
perf/x86/intel: Handle guest PEBS overflow PMI for KVM guest
perf/x86/core: Pass "struct kvm_pmu *" to determine the guest values
KVM: x86/pmu: Set MSR_IA32_MISC_ENABLE_EMON bit when vPMU is enabled
KVM: x86/pmu: Introduce the ctrl_mask value for fixed counter
KVM: x86/pmu: Add IA32_PEBS_ENABLE MSR emulation for extended PEBS
KVM: x86/pmu: Reprogram PEBS event to emulate guest PEBS counter
KVM: x86/pmu: Adjust precise_ip to emulate Ice Lake guest PDIR counter
KVM: x86/pmu: Add IA32_DS_AREA MSR emulation to support guest DS
KVM: x86/pmu: Add PEBS_DATA_CFG MSR emulation to support adaptive PEBS
KVM: x86: Set PEBS_UNAVAIL in IA32_MISC_ENABLE when PEBS is enabled
KVM: x86/pmu: Move pmc_speculative_in_use() to arch/x86/kvm/pmu.h
KVM: x86/pmu: Disable guest PEBS temporarily in two rare situations
KVM: x86/pmu: Add kvm_pmu_cap to optimize perf_get_x86_pmu_capability
KVM: x86/cpuid: Refactor host/guest CPU model consistency check
KVM: x86/pmu: Expose CPUIDs feature bits PDCM, DS, DTES64

Peter Zijlstra (Intel) (1):
x86/perf/core: Add pebs_capable to store valid PEBS_COUNTER_MASK value

arch/arm/kernel/perf_callchain.c | 16 +--
arch/arm64/kernel/perf_callchain.c | 29 +++--
arch/arm64/kvm/perf.c | 22 ++--
arch/csky/kernel/perf_callchain.c | 4 +-
arch/nds32/kernel/perf_event_cpu.c | 16 +--
arch/riscv/kernel/perf_callchain.c | 4 +-
arch/x86/events/core.c | 44 ++++++--
arch/x86/events/intel/core.c | 167 +++++++++++++++++++++++------
arch/x86/events/perf_event.h | 6 +-
arch/x86/include/asm/kvm_host.h | 18 +++-
arch/x86/include/asm/msr-index.h | 6 ++
arch/x86/include/asm/perf_event.h | 5 +-
arch/x86/kvm/cpuid.c | 26 ++---
arch/x86/kvm/cpuid.h | 5 +
arch/x86/kvm/pmu.c | 60 ++++++++---
arch/x86/kvm/pmu.h | 38 +++++++
arch/x86/kvm/vmx/capabilities.h | 26 +++--
arch/x86/kvm/vmx/pmu_intel.c | 116 ++++++++++++++++----
arch/x86/kvm/vmx/vmx.c | 24 ++++-
arch/x86/kvm/vmx/vmx.h | 2 +-
arch/x86/kvm/x86.c | 51 +++++----
arch/x86/xen/pmu.c | 33 +++---
include/linux/perf_event.h | 12 ++-
kernel/events/core.c | 9 ++
24 files changed, 548 insertions(+), 191 deletions(-)

--
2.27.0


2021-08-06 18:35:04

by Zhu, Lingshan

[permalink] [raw]
Subject: [PATCH V10 02/18] perf/x86/intel: Add EPT-Friendly PEBS for Ice Lake Server

From: Like Xu <[email protected]>

The new hardware facility supporting guest PEBS is only available
on Intel Ice Lake Server platforms for now. KVM will check this field
through perf_get_x86_pmu_capability() instead of hard coding the cpu
models in the KVM code. If it is supported, the guest PEBS capability
will be exposed to the guest.

Signed-off-by: Like Xu <[email protected]>
Signed-off-by: Zhu Lingshan <[email protected]>
Acked-by: Peter Zijlstra (Intel) <[email protected]>
---
arch/x86/events/core.c | 1 +
arch/x86/events/intel/core.c | 1 +
arch/x86/events/perf_event.h | 3 ++-
arch/x86/include/asm/perf_event.h | 1 +
4 files changed, 5 insertions(+), 1 deletion(-)

diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
index 9a908631f6cc..2240480cef6f 100644
--- a/arch/x86/events/core.c
+++ b/arch/x86/events/core.c
@@ -3011,5 +3011,6 @@ void perf_get_x86_pmu_capability(struct x86_pmu_capability *cap)
cap->bit_width_fixed = x86_pmu.cntval_bits;
cap->events_mask = (unsigned int)x86_pmu.events_maskl;
cap->events_mask_len = x86_pmu.events_mask_len;
+ cap->pebs_vmx = x86_pmu.pebs_vmx;
}
EXPORT_SYMBOL_GPL(perf_get_x86_pmu_capability);
diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
index fb1bd7a0e1a6..da835f5a37e2 100644
--- a/arch/x86/events/intel/core.c
+++ b/arch/x86/events/intel/core.c
@@ -6037,6 +6037,7 @@ __init int intel_pmu_init(void)

case INTEL_FAM6_ICELAKE_X:
case INTEL_FAM6_ICELAKE_D:
+ x86_pmu.pebs_vmx = 1;
pmem = true;
fallthrough;
case INTEL_FAM6_ICELAKE_L:
diff --git a/arch/x86/events/perf_event.h b/arch/x86/events/perf_event.h
index 2bf1c7ea2758..68601de166a3 100644
--- a/arch/x86/events/perf_event.h
+++ b/arch/x86/events/perf_event.h
@@ -797,7 +797,8 @@ struct x86_pmu {
pebs_prec_dist :1,
pebs_no_tlb :1,
pebs_no_isolation :1,
- pebs_block :1;
+ pebs_block :1,
+ pebs_vmx :1;
int pebs_record_size;
int pebs_buffer_size;
int max_pebs_events;
diff --git a/arch/x86/include/asm/perf_event.h b/arch/x86/include/asm/perf_event.h
index 8fc1b5003713..42d7bcf1a896 100644
--- a/arch/x86/include/asm/perf_event.h
+++ b/arch/x86/include/asm/perf_event.h
@@ -192,6 +192,7 @@ struct x86_pmu_capability {
int bit_width_fixed;
unsigned int events_mask;
int events_mask_len;
+ unsigned int pebs_vmx :1;
};

/*
--
2.27.0

2021-08-06 18:35:27

by Zhu, Lingshan

[permalink] [raw]
Subject: [PATCH V10 04/18] perf/x86/core: Pass "struct kvm_pmu *" to determine the guest values

From: Like Xu <[email protected]>

Splitting the logic for determining the guest values is unnecessarily
confusing, and potentially fragile. Perf should have full knowledge and
control of what values are loaded for the guest.

If we change .guest_get_msrs() to take a struct kvm_pmu pointer, then it
can generate the full set of guest values by grabbing guest ds_area and
pebs_data_cfg. Alternatively, .guest_get_msrs() could take the desired
guest MSR values directly (ds_area and pebs_data_cfg), but kvm_pmu is
vendor agnostic, so we don't see any reason to not just pass the pointer.

Suggested-by: Sean Christopherson <[email protected]>
Signed-off-by: Like Xu <[email protected]>
Signed-off-by: Zhu Lingshan <[email protected]>
Acked-by: Peter Zijlstra (Intel) <[email protected]>
---
arch/x86/events/core.c | 4 ++--
arch/x86/events/intel/core.c | 4 ++--
arch/x86/events/perf_event.h | 2 +-
arch/x86/include/asm/perf_event.h | 4 ++--
arch/x86/kvm/vmx/vmx.c | 3 ++-
5 files changed, 9 insertions(+), 8 deletions(-)

diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
index 2240480cef6f..ae2ffe37bfbb 100644
--- a/arch/x86/events/core.c
+++ b/arch/x86/events/core.c
@@ -713,9 +713,9 @@ void x86_pmu_disable_all(void)
}
}

-struct perf_guest_switch_msr *perf_guest_get_msrs(int *nr)
+struct perf_guest_switch_msr *perf_guest_get_msrs(int *nr, void *data)
{
- return static_call(x86_pmu_guest_get_msrs)(nr);
+ return static_call(x86_pmu_guest_get_msrs)(nr, data);
}
EXPORT_SYMBOL_GPL(perf_guest_get_msrs);

diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
index 2770eba232c7..e09cc8901524 100644
--- a/arch/x86/events/intel/core.c
+++ b/arch/x86/events/intel/core.c
@@ -3899,7 +3899,7 @@ static int intel_pmu_hw_config(struct perf_event *event)
return 0;
}

-static struct perf_guest_switch_msr *intel_guest_get_msrs(int *nr)
+static struct perf_guest_switch_msr *intel_guest_get_msrs(int *nr, void *data)
{
struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
struct perf_guest_switch_msr *arr = cpuc->guest_switch_msrs;
@@ -3932,7 +3932,7 @@ static struct perf_guest_switch_msr *intel_guest_get_msrs(int *nr)
return arr;
}

-static struct perf_guest_switch_msr *core_guest_get_msrs(int *nr)
+static struct perf_guest_switch_msr *core_guest_get_msrs(int *nr, void *data)
{
struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
struct perf_guest_switch_msr *arr = cpuc->guest_switch_msrs;
diff --git a/arch/x86/events/perf_event.h b/arch/x86/events/perf_event.h
index 68601de166a3..1518f2754842 100644
--- a/arch/x86/events/perf_event.h
+++ b/arch/x86/events/perf_event.h
@@ -877,7 +877,7 @@ struct x86_pmu {
/*
* Intel host/guest support (KVM)
*/
- struct perf_guest_switch_msr *(*guest_get_msrs)(int *nr);
+ struct perf_guest_switch_msr *(*guest_get_msrs)(int *nr, void *data);

/*
* Check period value for PERF_EVENT_IOC_PERIOD ioctl.
diff --git a/arch/x86/include/asm/perf_event.h b/arch/x86/include/asm/perf_event.h
index 42d7bcf1a896..bf61beaa7906 100644
--- a/arch/x86/include/asm/perf_event.h
+++ b/arch/x86/include/asm/perf_event.h
@@ -492,10 +492,10 @@ static inline void perf_check_microcode(void) { }
#endif

#if defined(CONFIG_PERF_EVENTS) && defined(CONFIG_CPU_SUP_INTEL)
-extern struct perf_guest_switch_msr *perf_guest_get_msrs(int *nr);
+extern struct perf_guest_switch_msr *perf_guest_get_msrs(int *nr, void *data);
extern int x86_perf_get_lbr(struct x86_pmu_lbr *lbr);
#else
-struct perf_guest_switch_msr *perf_guest_get_msrs(int *nr);
+struct perf_guest_switch_msr *perf_guest_get_msrs(int *nr, void *data);
static inline int x86_perf_get_lbr(struct x86_pmu_lbr *lbr)
{
return -1;
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 927a552393b9..063e869b4e19 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -6512,9 +6512,10 @@ static void atomic_switch_perf_msrs(struct vcpu_vmx *vmx)
{
int i, nr_msrs;
struct perf_guest_switch_msr *msrs;
+ struct kvm_pmu *pmu = vcpu_to_pmu(&vmx->vcpu);

/* Note, nr_msrs may be garbage if perf_guest_get_msrs() returns NULL. */
- msrs = perf_guest_get_msrs(&nr_msrs);
+ msrs = perf_guest_get_msrs(&nr_msrs, (void *)pmu);
if (!msrs)
return;

--
2.27.0

2021-08-06 18:36:06

by Zhu, Lingshan

[permalink] [raw]
Subject: [PATCH V10 06/18] KVM: x86/pmu: Introduce the ctrl_mask value for fixed counter

From: Like Xu <[email protected]>

The mask value of fixed counter control register should be dynamic
adjusted with the number of fixed counters. This patch introduces a
variable that includes the reserved bits of fixed counter control
registers. This is a generic code refactoring.

Co-developed-by: Luwei Kang <[email protected]>
Signed-off-by: Luwei Kang <[email protected]>
Signed-off-by: Like Xu <[email protected]>
Signed-off-by: Zhu Lingshan <[email protected]>
Acked-by: Peter Zijlstra (Intel) <[email protected]>
---
arch/x86/include/asm/kvm_host.h | 1 +
arch/x86/kvm/vmx/pmu_intel.c | 6 +++++-
2 files changed, 6 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 128e2dd9c944..172fabbcc11a 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -489,6 +489,7 @@ struct kvm_pmu {
unsigned nr_arch_fixed_counters;
unsigned available_event_types;
u64 fixed_ctr_ctrl;
+ u64 fixed_ctr_ctrl_mask;
u64 global_ctrl;
u64 global_status;
u64 global_ovf_ctrl;
diff --git a/arch/x86/kvm/vmx/pmu_intel.c b/arch/x86/kvm/vmx/pmu_intel.c
index d9dbebe03cae..ac7fe714e6c1 100644
--- a/arch/x86/kvm/vmx/pmu_intel.c
+++ b/arch/x86/kvm/vmx/pmu_intel.c
@@ -400,7 +400,7 @@ static int intel_pmu_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
case MSR_CORE_PERF_FIXED_CTR_CTRL:
if (pmu->fixed_ctr_ctrl == data)
return 0;
- if (!(data & 0xfffffffffffff444ull)) {
+ if (!(data & pmu->fixed_ctr_ctrl_mask)) {
reprogram_fixed_counters(pmu, data);
return 0;
}
@@ -470,6 +470,7 @@ static void intel_pmu_refresh(struct kvm_vcpu *vcpu)
struct kvm_cpuid_entry2 *entry;
union cpuid10_eax eax;
union cpuid10_edx edx;
+ int i;

pmu->nr_arch_gp_counters = 0;
pmu->nr_arch_fixed_counters = 0;
@@ -477,6 +478,7 @@ static void intel_pmu_refresh(struct kvm_vcpu *vcpu)
pmu->counter_bitmask[KVM_PMC_FIXED] = 0;
pmu->version = 0;
pmu->reserved_bits = 0xffffffff00200000ull;
+ pmu->fixed_ctr_ctrl_mask = ~0ull;

entry = kvm_find_cpuid_entry(vcpu, 0xa, 0);
if (!entry)
@@ -511,6 +513,8 @@ static void intel_pmu_refresh(struct kvm_vcpu *vcpu)
((u64)1 << edx.split.bit_width_fixed) - 1;
}

+ for (i = 0; i < pmu->nr_arch_fixed_counters; i++)
+ pmu->fixed_ctr_ctrl_mask &= ~(0xbull << (i * 4));
pmu->global_ctrl = ((1ull << pmu->nr_arch_gp_counters) - 1) |
(((1ull << pmu->nr_arch_fixed_counters) - 1) << INTEL_PMC_IDX_FIXED);
pmu->global_ctrl_mask = ~pmu->global_ctrl;
--
2.27.0

2021-08-06 18:38:06

by Zhu, Lingshan

[permalink] [raw]
Subject: [PATCH V10 07/18] x86/perf/core: Add pebs_capable to store valid PEBS_COUNTER_MASK value

From: "Peter Zijlstra (Intel)" <[email protected]>

The value of pebs_counter_mask will be accessed frequently
for repeated use in the intel_guest_get_msrs(). So it can be
optimized instead of endlessly mucking about with branches.

Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Signed-off-by: Zhu Lingshan <[email protected]>
---
arch/x86/events/intel/core.c | 14 ++++++--------
arch/x86/events/perf_event.h | 1 +
2 files changed, 7 insertions(+), 8 deletions(-)

diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
index e09cc8901524..552a623dc886 100644
--- a/arch/x86/events/intel/core.c
+++ b/arch/x86/events/intel/core.c
@@ -2867,10 +2867,7 @@ static int handle_pmi_common(struct pt_regs *regs, u64 status)
* counters from the GLOBAL_STATUS mask and we always process PEBS
* events via drain_pebs().
*/
- if (x86_pmu.flags & PMU_FL_PEBS_ALL)
- status &= ~cpuc->pebs_enabled;
- else
- status &= ~(cpuc->pebs_enabled & PEBS_COUNTER_MASK);
+ status &= ~(cpuc->pebs_enabled & x86_pmu.pebs_capable);

/*
* PEBS overflow sets bit 62 in the global status register
@@ -3908,10 +3905,7 @@ static struct perf_guest_switch_msr *intel_guest_get_msrs(int *nr, void *data)
arr[0].msr = MSR_CORE_PERF_GLOBAL_CTRL;
arr[0].host = intel_ctrl & ~cpuc->intel_ctrl_guest_mask;
arr[0].guest = intel_ctrl & ~cpuc->intel_ctrl_host_mask;
- if (x86_pmu.flags & PMU_FL_PEBS_ALL)
- arr[0].guest &= ~cpuc->pebs_enabled;
- else
- arr[0].guest &= ~(cpuc->pebs_enabled & PEBS_COUNTER_MASK);
+ arr[0].guest &= ~(cpuc->pebs_enabled & x86_pmu.pebs_capable);
*nr = 1;

if (x86_pmu.pebs && x86_pmu.pebs_no_isolation) {
@@ -5594,6 +5588,7 @@ __init int intel_pmu_init(void)
x86_pmu.events_mask_len = eax.split.mask_length;

x86_pmu.max_pebs_events = min_t(unsigned, MAX_PEBS_EVENTS, x86_pmu.num_counters);
+ x86_pmu.pebs_capable = PEBS_COUNTER_MASK;

/*
* Quirk: v2 perfmon does not report fixed-purpose events, so
@@ -5778,6 +5773,7 @@ __init int intel_pmu_init(void)
x86_pmu.pebs_aliases = NULL;
x86_pmu.pebs_prec_dist = true;
x86_pmu.lbr_pt_coexist = true;
+ x86_pmu.pebs_capable = ~0ULL;
x86_pmu.flags |= PMU_FL_HAS_RSP_1;
x86_pmu.flags |= PMU_FL_PEBS_ALL;
x86_pmu.get_event_constraints = glp_get_event_constraints;
@@ -6135,6 +6131,7 @@ __init int intel_pmu_init(void)
x86_pmu.pebs_aliases = NULL;
x86_pmu.pebs_prec_dist = true;
x86_pmu.pebs_block = true;
+ x86_pmu.pebs_capable = ~0ULL;
x86_pmu.flags |= PMU_FL_HAS_RSP_1;
x86_pmu.flags |= PMU_FL_NO_HT_SHARING;
x86_pmu.flags |= PMU_FL_PEBS_ALL;
@@ -6178,6 +6175,7 @@ __init int intel_pmu_init(void)
x86_pmu.pebs_aliases = NULL;
x86_pmu.pebs_prec_dist = true;
x86_pmu.pebs_block = true;
+ x86_pmu.pebs_capable = ~0ULL;
x86_pmu.flags |= PMU_FL_HAS_RSP_1;
x86_pmu.flags |= PMU_FL_NO_HT_SHARING;
x86_pmu.flags |= PMU_FL_PEBS_ALL;
diff --git a/arch/x86/events/perf_event.h b/arch/x86/events/perf_event.h
index 1518f2754842..35d0a7ec5f20 100644
--- a/arch/x86/events/perf_event.h
+++ b/arch/x86/events/perf_event.h
@@ -807,6 +807,7 @@ struct x86_pmu {
void (*pebs_aliases)(struct perf_event *event);
unsigned long large_pebs_flags;
u64 rtm_abort_event;
+ u64 pebs_capable;

/*
* Intel LBR
--
2.27.0

2021-08-06 18:39:02

by Zhu, Lingshan

[permalink] [raw]
Subject: [PATCH V10 09/18] KVM: x86/pmu: Reprogram PEBS event to emulate guest PEBS counter

From: Like Xu <[email protected]>

When a guest counter is configured as a PEBS counter through
IA32_PEBS_ENABLE, a guest PEBS event will be reprogrammed by
configuring a non-zero precision level in the perf_event_attr.

The guest PEBS overflow PMI bit would be set in the guest
GLOBAL_STATUS MSR when PEBS facility generates a PEBS
overflow PMI based on guest IA32_DS_AREA MSR.

Even with the same counter index and the same event code and
mask, guest PEBS events will not be reused for non-PEBS events.

Originally-by: Andi Kleen <[email protected]>
Co-developed-by: Kan Liang <[email protected]>
Signed-off-by: Kan Liang <[email protected]>
Signed-off-by: Like Xu <[email protected]>
Signed-off-by: Zhu Lingshan <[email protected]>
Acked-by: Peter Zijlstra (Intel) <[email protected]>
---
arch/x86/kvm/pmu.c | 42 ++++++++++++++++++++++++++++++++++++++----
1 file changed, 38 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c
index 2dcbd1b30004..d76b0a5d80d7 100644
--- a/arch/x86/kvm/pmu.c
+++ b/arch/x86/kvm/pmu.c
@@ -74,11 +74,21 @@ static void kvm_perf_overflow_intr(struct perf_event *perf_event,
{
struct kvm_pmc *pmc = perf_event->overflow_handler_context;
struct kvm_pmu *pmu = pmc_to_pmu(pmc);
+ bool skip_pmi = false;

if (!test_and_set_bit(pmc->idx, pmu->reprogram_pmi)) {
- __set_bit(pmc->idx, (unsigned long *)&pmu->global_status);
+ if (perf_event->attr.precise_ip) {
+ /* Indicate PEBS overflow PMI to guest. */
+ skip_pmi = __test_and_set_bit(GLOBAL_STATUS_BUFFER_OVF_BIT,
+ (unsigned long *)&pmu->global_status);
+ } else {
+ __set_bit(pmc->idx, (unsigned long *)&pmu->global_status);
+ }
kvm_make_request(KVM_REQ_PMU, pmc->vcpu);

+ if (skip_pmi)
+ return;
+
/*
* Inject PMI. If vcpu was in a guest mode during NMI PMI
* can be ejected on a guest mode re-entry. Otherwise we can't
@@ -99,6 +109,7 @@ static void pmc_reprogram_counter(struct kvm_pmc *pmc, u32 type,
bool exclude_kernel, bool intr,
bool in_tx, bool in_tx_cp)
{
+ struct kvm_pmu *pmu = vcpu_to_pmu(pmc->vcpu);
struct perf_event *event;
struct perf_event_attr attr = {
.type = type,
@@ -110,6 +121,8 @@ static void pmc_reprogram_counter(struct kvm_pmc *pmc, u32 type,
.exclude_kernel = exclude_kernel,
.config = config,
};
+ bool pebs = test_bit(pmc->idx, (unsigned long *)&pmu->pebs_enable);
+ perf_overflow_handler_t ovf = kvm_perf_overflow;

attr.sample_period = get_sample_period(pmc, pmc->counter);

@@ -124,10 +137,27 @@ static void pmc_reprogram_counter(struct kvm_pmc *pmc, u32 type,
attr.sample_period = 0;
attr.config |= HSW_IN_TX_CHECKPOINTED;
}
+ if (pebs) {
+ /*
+ * The non-zero precision level of guest event makes the ordinary
+ * guest event becomes a guest PEBS event and triggers the host
+ * PEBS PMI handler to determine whether the PEBS overflow PMI
+ * comes from the host counters or the guest.
+ *
+ * For most PEBS hardware events, the difference in the software
+ * precision levels of guest and host PEBS events will not affect
+ * the accuracy of the PEBS profiling result, because the "event IP"
+ * in the PEBS record is calibrated on the guest side.
+ *
+ * On Icelake everything is fine. Other hardware (GLC+, TNT+) that
+ * could possibly care here is unsupported and needs changes.
+ */
+ attr.precise_ip = 1;
+ }
+ if (pebs || intr)
+ ovf = kvm_perf_overflow_intr;

- event = perf_event_create_kernel_counter(&attr, -1, current,
- intr ? kvm_perf_overflow_intr :
- kvm_perf_overflow, pmc);
+ event = perf_event_create_kernel_counter(&attr, -1, current, ovf, pmc);
if (IS_ERR(event)) {
pr_debug_ratelimited("kvm_pmu: event creation failed %ld for pmc->idx = %d\n",
PTR_ERR(event), pmc->idx);
@@ -161,6 +191,10 @@ static bool pmc_resume_counter(struct kvm_pmc *pmc)
get_sample_period(pmc, pmc->counter)))
return false;

+ if (!test_bit(pmc->idx, (unsigned long *)&pmc_to_pmu(pmc)->pebs_enable) &&
+ pmc->perf_event->attr.precise_ip)
+ return false;
+
/* reuse perf_event to serve as pmc_reprogram_counter() does*/
perf_event_enable(pmc->perf_event);

--
2.27.0

2021-08-06 18:40:05

by Zhu, Lingshan

[permalink] [raw]
Subject: [PATCH V10 11/18] KVM: x86/pmu: Add IA32_DS_AREA MSR emulation to support guest DS

From: Like Xu <[email protected]>

When CPUID.01H:EDX.DS[21] is set, the IA32_DS_AREA MSR exists and points
to the linear address of the first byte of the DS buffer management area,
which is used to manage the PEBS records.

When guest PEBS is enabled, the MSR_IA32_DS_AREA MSR will be added to the
perf_guest_switch_msr() and switched during the VMX transitions just like
CORE_PERF_GLOBAL_CTRL MSR. The WRMSR to IA32_DS_AREA MSR brings a #GP(0)
if the source register contains a non-canonical address.

Originally-by: Andi Kleen <[email protected]>
Co-developed-by: Kan Liang <[email protected]>
Signed-off-by: Kan Liang <[email protected]>
Signed-off-by: Like Xu <[email protected]>
Signed-off-by: Zhu Lingshan <[email protected]>
Acked-by: Peter Zijlstra (Intel) <[email protected]>
---
arch/x86/events/intel/core.c | 10 +++++++++-
arch/x86/include/asm/kvm_host.h | 1 +
arch/x86/kvm/vmx/pmu_intel.c | 11 +++++++++++
3 files changed, 21 insertions(+), 1 deletion(-)

diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
index b25c9d6be314..6501f8adb2b6 100644
--- a/arch/x86/events/intel/core.c
+++ b/arch/x86/events/intel/core.c
@@ -21,6 +21,7 @@
#include <asm/intel_pt.h>
#include <asm/apic.h>
#include <asm/cpu_device_id.h>
+#include <asm/kvm_host.h>

#include "../perf_event.h"

@@ -3917,6 +3918,7 @@ static struct perf_guest_switch_msr *intel_guest_get_msrs(int *nr, void *data)
{
struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
struct perf_guest_switch_msr *arr = cpuc->guest_switch_msrs;
+ struct kvm_pmu *kvm_pmu = (struct kvm_pmu *)data;
u64 intel_ctrl = hybrid(cpuc->pmu, intel_ctrl);
u64 pebs_mask = cpuc->pebs_enabled & x86_pmu.pebs_capable;
int global_ctrl, pebs_enable;
@@ -3949,9 +3951,15 @@ static struct perf_guest_switch_msr *intel_guest_get_msrs(int *nr, void *data)
return arr;
}

- if (!x86_pmu.pebs_vmx)
+ if (!kvm_pmu || !x86_pmu.pebs_vmx)
return arr;

+ arr[(*nr)++] = (struct perf_guest_switch_msr){
+ .msr = MSR_IA32_DS_AREA,
+ .host = (unsigned long)cpuc->ds,
+ .guest = kvm_pmu->ds_area,
+ };
+
pebs_enable = (*nr)++;
arr[pebs_enable] = (struct perf_guest_switch_msr){
.msr = MSR_IA32_PEBS_ENABLE,
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 425e872ddf4f..35f106f9f124 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -505,6 +505,7 @@ struct kvm_pmu {
DECLARE_BITMAP(all_valid_pmc_idx, X86_PMC_IDX_MAX);
DECLARE_BITMAP(pmc_in_use, X86_PMC_IDX_MAX);

+ u64 ds_area;
u64 pebs_enable;
u64 pebs_enable_mask;

diff --git a/arch/x86/kvm/vmx/pmu_intel.c b/arch/x86/kvm/vmx/pmu_intel.c
index 9938b485c31c..5584b8dfadb3 100644
--- a/arch/x86/kvm/vmx/pmu_intel.c
+++ b/arch/x86/kvm/vmx/pmu_intel.c
@@ -223,6 +223,9 @@ static bool intel_is_valid_msr(struct kvm_vcpu *vcpu, u32 msr)
case MSR_IA32_PEBS_ENABLE:
ret = vcpu->arch.perf_capabilities & PERF_CAP_PEBS_FORMAT;
break;
+ case MSR_IA32_DS_AREA:
+ ret = guest_cpuid_has(vcpu, X86_FEATURE_DS);
+ break;
default:
ret = get_gp_pmc(pmu, msr, MSR_IA32_PERFCTR0) ||
get_gp_pmc(pmu, msr, MSR_P6_EVNTSEL0) ||
@@ -373,6 +376,9 @@ static int intel_pmu_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
case MSR_IA32_PEBS_ENABLE:
msr_info->data = pmu->pebs_enable;
return 0;
+ case MSR_IA32_DS_AREA:
+ msr_info->data = pmu->ds_area;
+ return 0;
default:
if ((pmc = get_gp_pmc(pmu, msr, MSR_IA32_PERFCTR0)) ||
(pmc = get_gp_pmc(pmu, msr, MSR_IA32_PMC0))) {
@@ -441,6 +447,11 @@ static int intel_pmu_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
return 0;
}
break;
+ case MSR_IA32_DS_AREA:
+ if (is_noncanonical_address(data, vcpu))
+ return 1;
+ pmu->ds_area = data;
+ return 0;
default:
if ((pmc = get_gp_pmc(pmu, msr, MSR_IA32_PERFCTR0)) ||
(pmc = get_gp_pmc(pmu, msr, MSR_IA32_PMC0))) {
--
2.27.0

2021-08-06 18:43:59

by Zhu, Lingshan

[permalink] [raw]
Subject: [PATCH V10 13/18] KVM: x86: Set PEBS_UNAVAIL in IA32_MISC_ENABLE when PEBS is enabled

From: Like Xu <[email protected]>

The bit 12 represents "Processor Event Based Sampling Unavailable (RO)" :
1 = PEBS is not supported.
0 = PEBS is supported.

A write to this PEBS_UNAVL available bit will bring #GP(0) when guest PEBS
is enabled. Some PEBS drivers in guest may care about this bit.

Signed-off-by: Like Xu <[email protected]>
Signed-off-by: Zhu Lingshan <[email protected]>
Acked-by: Peter Zijlstra (Intel) <[email protected]>
---
arch/x86/kvm/vmx/pmu_intel.c | 2 ++
arch/x86/kvm/x86.c | 4 ++++
2 files changed, 6 insertions(+)

diff --git a/arch/x86/kvm/vmx/pmu_intel.c b/arch/x86/kvm/vmx/pmu_intel.c
index 58f32a55cc2e..296246bf253d 100644
--- a/arch/x86/kvm/vmx/pmu_intel.c
+++ b/arch/x86/kvm/vmx/pmu_intel.c
@@ -588,6 +588,7 @@ static void intel_pmu_refresh(struct kvm_vcpu *vcpu)
bitmap_set(pmu->all_valid_pmc_idx, INTEL_PMC_IDX_FIXED_VLBR, 1);

if (vcpu->arch.perf_capabilities & PERF_CAP_PEBS_FORMAT) {
+ vcpu->arch.ia32_misc_enable_msr &= ~MSR_IA32_MISC_ENABLE_PEBS_UNAVAIL;
if (vcpu->arch.perf_capabilities & PERF_CAP_PEBS_BASELINE) {
pmu->pebs_enable_mask = ~pmu->global_ctrl;
pmu->reserved_bits &= ~ICL_EVENTSEL_ADAPTIVE;
@@ -601,6 +602,7 @@ static void intel_pmu_refresh(struct kvm_vcpu *vcpu)
~((1ull << pmu->nr_arch_gp_counters) - 1);
}
} else {
+ vcpu->arch.ia32_misc_enable_msr |= MSR_IA32_MISC_ENABLE_PEBS_UNAVAIL;
vcpu->arch.perf_capabilities &= ~PERF_CAP_PEBS_MASK;
}
}
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index f6b6984e26ef..78014b9d5340 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -3322,6 +3322,10 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
break;
case MSR_IA32_MISC_ENABLE:
data &= ~MSR_IA32_MISC_ENABLE_EMON;
+ if (!msr_info->host_initiated &&
+ (vcpu->arch.perf_capabilities & PERF_CAP_PEBS_FORMAT) &&
+ (data & MSR_IA32_MISC_ENABLE_PEBS_UNAVAIL))
+ return 1;
if (!kvm_check_has_quirk(vcpu->kvm, KVM_X86_QUIRK_MISC_ENABLE_NO_MWAIT) &&
((vcpu->arch.ia32_misc_enable_msr ^ data) & MSR_IA32_MISC_ENABLE_MWAIT)) {
if (!guest_cpuid_has(vcpu, X86_FEATURE_XMM3))
--
2.27.0

2021-08-06 18:45:01

by Zhu, Lingshan

[permalink] [raw]
Subject: [PATCH V10 15/18] KVM: x86/pmu: Disable guest PEBS temporarily in two rare situations

From: Like Xu <[email protected]>

The guest PEBS will be disabled when some users try to perf KVM and
its user-space through the same PEBS facility OR when the host perf
doesn't schedule the guest PEBS counter in a one-to-one mapping manner
(neither of these are typical scenarios).

The PEBS records in the guest DS buffer are still accurate and the
above two restrictions will be checked before each vm-entry only if
guest PEBS is deemed to be enabled.

Suggested-by: Wei Wang <[email protected]>
Signed-off-by: Like Xu <[email protected]>
Signed-off-by: Zhu Lingshan <[email protected]>
Acked-by: Peter Zijlstra (Intel) <[email protected]>
---
arch/x86/events/intel/core.c | 11 +++++++++--
arch/x86/include/asm/kvm_host.h | 9 +++++++++
arch/x86/kvm/vmx/pmu_intel.c | 20 ++++++++++++++++++++
arch/x86/kvm/vmx/vmx.c | 4 ++++
arch/x86/kvm/vmx/vmx.h | 1 +
5 files changed, 43 insertions(+), 2 deletions(-)

diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
index a698aae54fcb..8fa0b57989f6 100644
--- a/arch/x86/events/intel/core.c
+++ b/arch/x86/events/intel/core.c
@@ -3975,8 +3975,15 @@ static struct perf_guest_switch_msr *intel_guest_get_msrs(int *nr, void *data)
.guest = pebs_mask & ~cpuc->intel_ctrl_host_mask,
};

- /* Set hw GLOBAL_CTRL bits for PEBS counter when it runs for guest */
- arr[0].guest |= arr[*nr].guest;
+ if (arr[pebs_enable].host) {
+ /* Disable guest PEBS if host PEBS is enabled. */
+ arr[pebs_enable].guest = 0;
+ } else {
+ /* Disable guest PEBS for cross-mapped PEBS counters. */
+ arr[pebs_enable].guest &= ~kvm_pmu->host_cross_mapped_mask;
+ /* Set hw GLOBAL_CTRL bits for PEBS counter when it runs for guest */
+ arr[global_ctrl].guest |= arr[pebs_enable].guest;
+ }

return arr;
}
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 0fc1fef1af70..637685485ddd 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -511,6 +511,15 @@ struct kvm_pmu {
u64 pebs_data_cfg;
u64 pebs_data_cfg_mask;

+ /*
+ * If a guest counter is cross-mapped to host counter with different
+ * index, its PEBS capability will be temporarily disabled.
+ *
+ * The user should make sure that this mask is updated
+ * after disabling interrupts and before perf_guest_get_msrs();
+ */
+ u64 host_cross_mapped_mask;
+
/*
* The gate to release perf_events not marked in
* pmc_in_use only once in a vcpu time slice.
diff --git a/arch/x86/kvm/vmx/pmu_intel.c b/arch/x86/kvm/vmx/pmu_intel.c
index 296246bf253d..afdc9796fe4e 100644
--- a/arch/x86/kvm/vmx/pmu_intel.c
+++ b/arch/x86/kvm/vmx/pmu_intel.c
@@ -770,6 +770,26 @@ static void intel_pmu_cleanup(struct kvm_vcpu *vcpu)
intel_pmu_release_guest_lbr_event(vcpu);
}

+void intel_pmu_cross_mapped_check(struct kvm_pmu *pmu)
+{
+ struct kvm_pmc *pmc = NULL;
+ int bit;
+
+ for_each_set_bit(bit, (unsigned long *)&pmu->global_ctrl,
+ X86_PMC_IDX_MAX) {
+ pmc = kvm_x86_ops.pmu_ops->pmc_idx_to_pmc(pmu, bit);
+
+ if (!pmc || !pmc_speculative_in_use(pmc) ||
+ !pmc_is_enabled(pmc))
+ continue;
+
+ if (pmc->perf_event && (pmc->idx != pmc->perf_event->hw.idx)) {
+ pmu->host_cross_mapped_mask |=
+ BIT_ULL(pmc->perf_event->hw.idx);
+ }
+ }
+}
+
struct kvm_pmu_ops intel_pmu_ops = {
.find_arch_event = intel_find_arch_event,
.find_fixed_event = intel_find_fixed_event,
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 063e869b4e19..d8552dbece6f 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -6514,6 +6514,10 @@ static void atomic_switch_perf_msrs(struct vcpu_vmx *vmx)
struct perf_guest_switch_msr *msrs;
struct kvm_pmu *pmu = vcpu_to_pmu(&vmx->vcpu);

+ pmu->host_cross_mapped_mask = 0;
+ if (pmu->pebs_enable & pmu->global_ctrl)
+ intel_pmu_cross_mapped_check(pmu);
+
/* Note, nr_msrs may be garbage if perf_guest_get_msrs() returns NULL. */
msrs = perf_guest_get_msrs(&nr_msrs, (void *)pmu);
if (!msrs)
diff --git a/arch/x86/kvm/vmx/vmx.h b/arch/x86/kvm/vmx/vmx.h
index db88ed4f2121..86fa1a08feca 100644
--- a/arch/x86/kvm/vmx/vmx.h
+++ b/arch/x86/kvm/vmx/vmx.h
@@ -94,6 +94,7 @@ union vmx_exit_reason {
#define vcpu_to_lbr_desc(vcpu) (&to_vmx(vcpu)->lbr_desc)
#define vcpu_to_lbr_records(vcpu) (&to_vmx(vcpu)->lbr_desc.records)

+void intel_pmu_cross_mapped_check(struct kvm_pmu *pmu);
bool intel_pmu_lbr_is_compatible(struct kvm_vcpu *vcpu);
bool intel_pmu_lbr_is_enabled(struct kvm_vcpu *vcpu);

--
2.27.0

2021-08-06 18:45:36

by Zhu, Lingshan

[permalink] [raw]
Subject: [PATCH V10 16/18] KVM: x86/pmu: Add kvm_pmu_cap to optimize perf_get_x86_pmu_capability

From: Like Xu <[email protected]>

The information obtained from the interface perf_get_x86_pmu_capability()
doesn't change, so an exported "struct x86_pmu_capability" is introduced
for all guests in the KVM, and it's initialized before hardware_setup().

Signed-off-by: Like Xu <[email protected]>
Signed-off-by: Zhu Lingshan <[email protected]>
Acked-by: Peter Zijlstra (Intel) <[email protected]>
---
arch/x86/kvm/cpuid.c | 26 ++++++++------------------
arch/x86/kvm/pmu.c | 3 +++
arch/x86/kvm/pmu.h | 20 ++++++++++++++++++++
arch/x86/kvm/vmx/pmu_intel.c | 17 ++++++++---------
arch/x86/kvm/x86.c | 9 ++++-----
5 files changed, 43 insertions(+), 32 deletions(-)

diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
index 739be5da3bca..647bff490c0d 100644
--- a/arch/x86/kvm/cpuid.c
+++ b/arch/x86/kvm/cpuid.c
@@ -745,33 +745,23 @@ static inline int __do_cpuid_func(struct kvm_cpuid_array *array, u32 function)
case 9:
break;
case 0xa: { /* Architectural Performance Monitoring */
- struct x86_pmu_capability cap;
union cpuid10_eax eax;
union cpuid10_edx edx;

- perf_get_x86_pmu_capability(&cap);
+ eax.split.version_id = kvm_pmu_cap.version;
+ eax.split.num_counters = kvm_pmu_cap.num_counters_gp;
+ eax.split.bit_width = kvm_pmu_cap.bit_width_gp;
+ eax.split.mask_length = kvm_pmu_cap.events_mask_len;
+ edx.split.num_counters_fixed = kvm_pmu_cap.num_counters_fixed;
+ edx.split.bit_width_fixed = kvm_pmu_cap.bit_width_fixed;

- /*
- * Only support guest architectural pmu on a host
- * with architectural pmu.
- */
- if (!cap.version)
- memset(&cap, 0, sizeof(cap));
-
- eax.split.version_id = min(cap.version, 2);
- eax.split.num_counters = cap.num_counters_gp;
- eax.split.bit_width = cap.bit_width_gp;
- eax.split.mask_length = cap.events_mask_len;
-
- edx.split.num_counters_fixed = min(cap.num_counters_fixed, MAX_FIXED_COUNTERS);
- edx.split.bit_width_fixed = cap.bit_width_fixed;
- if (cap.version)
+ if (kvm_pmu_cap.version)
edx.split.anythread_deprecated = 1;
edx.split.reserved1 = 0;
edx.split.reserved2 = 0;

entry->eax = eax.full;
- entry->ebx = cap.events_mask;
+ entry->ebx = kvm_pmu_cap.events_mask;
entry->ecx = 0;
entry->edx = edx.full;
break;
diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c
index d957c1e83ec9..ec10a635b057 100644
--- a/arch/x86/kvm/pmu.c
+++ b/arch/x86/kvm/pmu.c
@@ -19,6 +19,9 @@
#include "lapic.h"
#include "pmu.h"

+struct x86_pmu_capability __read_mostly kvm_pmu_cap;
+EXPORT_SYMBOL_GPL(kvm_pmu_cap);
+
/* This is enough to filter the vast majority of currently defined events. */
#define KVM_PMU_EVENT_FILTER_MAX_EVENTS 300

diff --git a/arch/x86/kvm/pmu.h b/arch/x86/kvm/pmu.h
index 5795bb113e76..1903c0fe01ca 100644
--- a/arch/x86/kvm/pmu.h
+++ b/arch/x86/kvm/pmu.h
@@ -160,6 +160,24 @@ static inline bool pmc_speculative_in_use(struct kvm_pmc *pmc)
return pmc->eventsel & ARCH_PERFMON_EVENTSEL_ENABLE;
}

+extern struct x86_pmu_capability kvm_pmu_cap;
+
+static inline void kvm_init_pmu_capability(void)
+{
+ perf_get_x86_pmu_capability(&kvm_pmu_cap);
+
+ /*
+ * Only support guest architectural pmu on
+ * a host with architectural pmu.
+ */
+ if (!kvm_pmu_cap.version)
+ memset(&kvm_pmu_cap, 0, sizeof(kvm_pmu_cap));
+
+ kvm_pmu_cap.version = min(kvm_pmu_cap.version, 2);
+ kvm_pmu_cap.num_counters_fixed = min(kvm_pmu_cap.num_counters_fixed,
+ MAX_FIXED_COUNTERS);
+}
+
void reprogram_gp_counter(struct kvm_pmc *pmc, u64 eventsel);
void reprogram_fixed_counter(struct kvm_pmc *pmc, u8 ctrl, int fixed_idx);
void reprogram_counter(struct kvm_pmu *pmu, int pmc_idx);
@@ -177,9 +195,11 @@ void kvm_pmu_init(struct kvm_vcpu *vcpu);
void kvm_pmu_cleanup(struct kvm_vcpu *vcpu);
void kvm_pmu_destroy(struct kvm_vcpu *vcpu);
int kvm_vm_ioctl_set_pmu_event_filter(struct kvm *kvm, void __user *argp);
+void kvm_init_pmu_capability(void);

bool is_vmware_backdoor_pmc(u32 pmc_idx);

extern struct kvm_pmu_ops intel_pmu_ops;
extern struct kvm_pmu_ops amd_pmu_ops;
+
#endif /* __KVM_X86_PMU_H */
diff --git a/arch/x86/kvm/vmx/pmu_intel.c b/arch/x86/kvm/vmx/pmu_intel.c
index afdc9796fe4e..05bc218c08df 100644
--- a/arch/x86/kvm/vmx/pmu_intel.c
+++ b/arch/x86/kvm/vmx/pmu_intel.c
@@ -504,8 +504,6 @@ static void intel_pmu_refresh(struct kvm_vcpu *vcpu)
{
struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
struct lbr_desc *lbr_desc = vcpu_to_lbr_desc(vcpu);
-
- struct x86_pmu_capability x86_pmu;
struct kvm_cpuid_entry2 *entry;
union cpuid10_eax eax;
union cpuid10_edx edx;
@@ -532,13 +530,14 @@ static void intel_pmu_refresh(struct kvm_vcpu *vcpu)
return;

vcpu->arch.ia32_misc_enable_msr |= MSR_IA32_MISC_ENABLE_EMON;
- perf_get_x86_pmu_capability(&x86_pmu);

pmu->nr_arch_gp_counters = min_t(int, eax.split.num_counters,
- x86_pmu.num_counters_gp);
- eax.split.bit_width = min_t(int, eax.split.bit_width, x86_pmu.bit_width_gp);
+ kvm_pmu_cap.num_counters_gp);
+ eax.split.bit_width = min_t(int, eax.split.bit_width,
+ kvm_pmu_cap.bit_width_gp);
pmu->counter_bitmask[KVM_PMC_GP] = ((u64)1 << eax.split.bit_width) - 1;
- eax.split.mask_length = min_t(int, eax.split.mask_length, x86_pmu.events_mask_len);
+ eax.split.mask_length = min_t(int, eax.split.mask_length,
+ kvm_pmu_cap.events_mask_len);
pmu->available_event_types = ~entry->ebx &
((1ull << eax.split.mask_length) - 1);

@@ -547,9 +546,9 @@ static void intel_pmu_refresh(struct kvm_vcpu *vcpu)
} else {
pmu->nr_arch_fixed_counters =
min_t(int, edx.split.num_counters_fixed,
- x86_pmu.num_counters_fixed);
- edx.split.bit_width_fixed = min_t(int,
- edx.split.bit_width_fixed, x86_pmu.bit_width_fixed);
+ kvm_pmu_cap.num_counters_fixed);
+ edx.split.bit_width_fixed = min_t(int, edx.split.bit_width_fixed,
+ kvm_pmu_cap.bit_width_fixed);
pmu->counter_bitmask[KVM_PMC_FIXED] =
((u64)1 << edx.split.bit_width_fixed) - 1;
}
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 78014b9d5340..6a63e7526a1f 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -6175,15 +6175,12 @@ long kvm_arch_vm_ioctl(struct file *filp,

static void kvm_init_msr_list(void)
{
- struct x86_pmu_capability x86_pmu;
u32 dummy[2];
unsigned i;

BUILD_BUG_ON_MSG(INTEL_PMC_MAX_FIXED != 4,
"Please update the fixed PMCs in msrs_to_saved_all[]");

- perf_get_x86_pmu_capability(&x86_pmu);
-
num_msrs_to_save = 0;
num_emulated_msrs = 0;
num_msr_based_features = 0;
@@ -6235,12 +6232,12 @@ static void kvm_init_msr_list(void)
break;
case MSR_ARCH_PERFMON_PERFCTR0 ... MSR_ARCH_PERFMON_PERFCTR0 + 17:
if (msrs_to_save_all[i] - MSR_ARCH_PERFMON_PERFCTR0 >=
- min(INTEL_PMC_MAX_GENERIC, x86_pmu.num_counters_gp))
+ min(INTEL_PMC_MAX_GENERIC, kvm_pmu_cap.num_counters_gp))
continue;
break;
case MSR_ARCH_PERFMON_EVENTSEL0 ... MSR_ARCH_PERFMON_EVENTSEL0 + 17:
if (msrs_to_save_all[i] - MSR_ARCH_PERFMON_EVENTSEL0 >=
- min(INTEL_PMC_MAX_GENERIC, x86_pmu.num_counters_gp))
+ min(INTEL_PMC_MAX_GENERIC, kvm_pmu_cap.num_counters_gp))
continue;
break;
default:
@@ -11008,6 +11005,8 @@ int kvm_arch_hardware_setup(void *opaque)
if (boot_cpu_has(X86_FEATURE_XSAVES))
rdmsrl(MSR_IA32_XSS, host_xss);

+ kvm_init_pmu_capability();
+
r = ops->hardware_setup();
if (r != 0)
return r;
--
2.27.0

2021-08-06 18:48:08

by Zhu, Lingshan

[permalink] [raw]
Subject: [PATCH V10 18/18] KVM: x86/pmu: Expose CPUIDs feature bits PDCM, DS, DTES64

From: Like Xu <[email protected]>

The CPUID features PDCM, DS and DTES64 are required for PEBS feature.
KVM would expose CPUID feature PDCM, DS and DTES64 to guest when PEBS
is supported in the KVM on the Ice Lake server platforms.

Originally-by: Andi Kleen <[email protected]>
Co-developed-by: Kan Liang <[email protected]>
Signed-off-by: Kan Liang <[email protected]>
Co-developed-by: Luwei Kang <[email protected]>
Signed-off-by: Luwei Kang <[email protected]>
Signed-off-by: Like Xu <[email protected]>
Signed-off-by: Zhu Lingshan <[email protected]>
Acked-by: Peter Zijlstra (Intel) <[email protected]>
---
arch/x86/kvm/vmx/capabilities.h | 26 ++++++++++++++++++--------
arch/x86/kvm/vmx/vmx.c | 15 +++++++++++++++
2 files changed, 33 insertions(+), 8 deletions(-)

diff --git a/arch/x86/kvm/vmx/capabilities.h b/arch/x86/kvm/vmx/capabilities.h
index 4705ad55abb5..41b0933abdb1 100644
--- a/arch/x86/kvm/vmx/capabilities.h
+++ b/arch/x86/kvm/vmx/capabilities.h
@@ -5,6 +5,7 @@
#include <asm/vmx.h>

#include "lapic.h"
+#include "pmu.h"

extern bool __read_mostly enable_vpid;
extern bool __read_mostly flexpriority_enabled;
@@ -376,20 +377,29 @@ static inline bool vmx_pt_mode_is_host_guest(void)
return pt_mode == PT_MODE_HOST_GUEST;
}

-static inline u64 vmx_get_perf_capabilities(void)
+static inline bool vmx_pebs_supported(void)
{
- u64 perf_cap = 0;
-
- if (boot_cpu_has(X86_FEATURE_PDCM))
- rdmsrl(MSR_IA32_PERF_CAPABILITIES, perf_cap);
-
- perf_cap &= PMU_CAP_LBR_FMT;
+ return boot_cpu_has(X86_FEATURE_PEBS) && kvm_pmu_cap.pebs_vmx;
+}

+static inline u64 vmx_get_perf_capabilities(void)
+{
/*
* Since counters are virtualized, KVM would support full
* width counting unconditionally, even if the host lacks it.
*/
- return PMU_CAP_FW_WRITES | perf_cap;
+ u64 perf_cap = PMU_CAP_FW_WRITES;
+ u64 host_perf_cap = 0;
+
+ if (boot_cpu_has(X86_FEATURE_PDCM))
+ rdmsrl(MSR_IA32_PERF_CAPABILITIES, host_perf_cap);
+
+ perf_cap |= host_perf_cap & PMU_CAP_LBR_FMT;
+
+ if (vmx_pebs_supported())
+ perf_cap |= host_perf_cap & PERF_CAP_PEBS_MASK;
+
+ return perf_cap;
}

static inline u64 vmx_supported_debugctl(void)
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index d0af51c1389d..32dd90707b0d 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -2224,6 +2224,17 @@ static int vmx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
if (!cpuid_model_is_consistent(vcpu))
return 1;
}
+ if (data & PERF_CAP_PEBS_FORMAT) {
+ if ((data & PERF_CAP_PEBS_MASK) !=
+ (vmx_get_perf_capabilities() & PERF_CAP_PEBS_MASK))
+ return 1;
+ if (!guest_cpuid_has(vcpu, X86_FEATURE_DS))
+ return 1;
+ if (!guest_cpuid_has(vcpu, X86_FEATURE_DTES64))
+ return 1;
+ if (!cpuid_model_is_consistent(vcpu))
+ return 1;
+ }
ret = kvm_set_msr_common(vcpu, msr_info);
break;

@@ -7225,6 +7236,10 @@ static __init void vmx_set_cpu_caps(void)
kvm_cpu_cap_clear(X86_FEATURE_INVPCID);
if (vmx_pt_mode_is_host_guest())
kvm_cpu_cap_check_and_set(X86_FEATURE_INTEL_PT);
+ if (vmx_pebs_supported()) {
+ kvm_cpu_cap_check_and_set(X86_FEATURE_DS);
+ kvm_cpu_cap_check_and_set(X86_FEATURE_DTES64);
+ }

if (!enable_sgx) {
kvm_cpu_cap_clear(X86_FEATURE_SGX);
--
2.27.0

2021-08-06 19:59:17

by Zhu, Lingshan

[permalink] [raw]
Subject: [PATCH V10 01/18] perf/core: Use static_call to optimize perf_guest_info_callbacks

From: Like Xu <[email protected]>

For "struct perf_guest_info_callbacks", the two fields "is_in_guest"
and "is_user_mode" are replaced with a new multiplexed member named
"state", and the "get_guest_ip" field will be renamed to "get_ip".

For arm64, xen and kvm/x86, the application of DEFINE_STATIC_CALL_RET0
could make all that perf_guest_cbs stuff suck less. For arm, csky, nds32,
and riscv, just applied some renamed refactoring.

Cc: Will Deacon <[email protected]>
Cc: Marc Zyngier <[email protected]>
Cc: Guo Ren <[email protected]>
Cc: Nick Hu <[email protected]>
Cc: Paul Walmsley <[email protected]>
Cc: Boris Ostrovsky <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Suggested-by: Peter Zijlstra (Intel) <[email protected]>
Original-by: Peter Zijlstra (Intel) <[email protected]>
Signed-off-by: Like Xu <[email protected]>
Signed-off-by: Zhu Lingshan <[email protected]>
Reviewed-by: Boris Ostrovsky <[email protected]>
Acked-by: Peter Zijlstra (Intel) <[email protected]>
---
arch/arm/kernel/perf_callchain.c | 16 +++++++-----
arch/arm64/kernel/perf_callchain.c | 29 +++++++++++++++++-----
arch/arm64/kvm/perf.c | 22 ++++++++---------
arch/csky/kernel/perf_callchain.c | 4 +--
arch/nds32/kernel/perf_event_cpu.c | 16 +++++++-----
arch/riscv/kernel/perf_callchain.c | 4 +--
arch/x86/events/core.c | 39 ++++++++++++++++++++++++------
arch/x86/events/intel/core.c | 7 +++---
arch/x86/include/asm/kvm_host.h | 2 +-
arch/x86/kvm/pmu.c | 2 +-
arch/x86/kvm/x86.c | 37 +++++++++++++++-------------
arch/x86/xen/pmu.c | 33 ++++++++++---------------
include/linux/perf_event.h | 12 ++++++---
kernel/events/core.c | 9 +++++++
14 files changed, 144 insertions(+), 88 deletions(-)

diff --git a/arch/arm/kernel/perf_callchain.c b/arch/arm/kernel/perf_callchain.c
index 3b69a76d341e..1ce30f86d6c7 100644
--- a/arch/arm/kernel/perf_callchain.c
+++ b/arch/arm/kernel/perf_callchain.c
@@ -64,7 +64,7 @@ perf_callchain_user(struct perf_callchain_entry_ctx *entry, struct pt_regs *regs
{
struct frame_tail __user *tail;

- if (perf_guest_cbs && perf_guest_cbs->is_in_guest()) {
+ if (perf_guest_cbs && perf_guest_cbs->state()) {
/* We don't support guest os callchain now */
return;
}
@@ -100,7 +100,7 @@ perf_callchain_kernel(struct perf_callchain_entry_ctx *entry, struct pt_regs *re
{
struct stackframe fr;

- if (perf_guest_cbs && perf_guest_cbs->is_in_guest()) {
+ if (perf_guest_cbs && perf_guest_cbs->state()) {
/* We don't support guest os callchain now */
return;
}
@@ -111,8 +111,8 @@ perf_callchain_kernel(struct perf_callchain_entry_ctx *entry, struct pt_regs *re

unsigned long perf_instruction_pointer(struct pt_regs *regs)
{
- if (perf_guest_cbs && perf_guest_cbs->is_in_guest())
- return perf_guest_cbs->get_guest_ip();
+ if (perf_guest_cbs && perf_guest_cbs->state())
+ return perf_guest_cbs->get_ip();

return instruction_pointer(regs);
}
@@ -120,9 +120,13 @@ unsigned long perf_instruction_pointer(struct pt_regs *regs)
unsigned long perf_misc_flags(struct pt_regs *regs)
{
int misc = 0;
+ unsigned int state = 0;

- if (perf_guest_cbs && perf_guest_cbs->is_in_guest()) {
- if (perf_guest_cbs->is_user_mode())
+ if (perf_guest_cbs)
+ state = perf_guest_cbs->state();
+
+ if (perf_guest_cbs && state) {
+ if (state & PERF_GUEST_USER)
misc |= PERF_RECORD_MISC_GUEST_USER;
else
misc |= PERF_RECORD_MISC_GUEST_KERNEL;
diff --git a/arch/arm64/kernel/perf_callchain.c b/arch/arm64/kernel/perf_callchain.c
index 4a72c2727309..1b344e23fd2f 100644
--- a/arch/arm64/kernel/perf_callchain.c
+++ b/arch/arm64/kernel/perf_callchain.c
@@ -5,6 +5,7 @@
* Copyright (C) 2015 ARM Limited
*/
#include <linux/perf_event.h>
+#include <linux/static_call.h>
#include <linux/uaccess.h>

#include <asm/pointer_auth.h>
@@ -99,10 +100,25 @@ compat_user_backtrace(struct compat_frame_tail __user *tail,
}
#endif /* CONFIG_COMPAT */

+DEFINE_STATIC_CALL_RET0(arm64_guest_state, *(perf_guest_cbs->state));
+DEFINE_STATIC_CALL_RET0(arm64_guest_get_ip, *(perf_guest_cbs->get_ip));
+
+void arch_perf_update_guest_cbs(void)
+{
+ static_call_update(arm64_guest_state, (void *)&__static_call_return0);
+ static_call_update(arm64_guest_get_ip, (void *)&__static_call_return0);
+
+ if (perf_guest_cbs && perf_guest_cbs->state)
+ static_call_update(arm64_guest_state, perf_guest_cbs->state);
+
+ if (perf_guest_cbs && perf_guest_cbs->get_ip)
+ static_call_update(arm64_guest_get_ip, perf_guest_cbs->get_ip);
+}
+
void perf_callchain_user(struct perf_callchain_entry_ctx *entry,
struct pt_regs *regs)
{
- if (perf_guest_cbs && perf_guest_cbs->is_in_guest()) {
+ if (static_call(arm64_guest_state)()) {
/* We don't support guest os callchain now */
return;
}
@@ -149,7 +165,7 @@ void perf_callchain_kernel(struct perf_callchain_entry_ctx *entry,
{
struct stackframe frame;

- if (perf_guest_cbs && perf_guest_cbs->is_in_guest()) {
+ if (static_call(arm64_guest_state)()) {
/* We don't support guest os callchain now */
return;
}
@@ -160,8 +176,8 @@ void perf_callchain_kernel(struct perf_callchain_entry_ctx *entry,

unsigned long perf_instruction_pointer(struct pt_regs *regs)
{
- if (perf_guest_cbs && perf_guest_cbs->is_in_guest())
- return perf_guest_cbs->get_guest_ip();
+ if (static_call(arm64_guest_state)())
+ return static_call(arm64_guest_get_ip)();

return instruction_pointer(regs);
}
@@ -169,9 +185,10 @@ unsigned long perf_instruction_pointer(struct pt_regs *regs)
unsigned long perf_misc_flags(struct pt_regs *regs)
{
int misc = 0;
+ unsigned int guest = static_call(arm64_guest_state)();

- if (perf_guest_cbs && perf_guest_cbs->is_in_guest()) {
- if (perf_guest_cbs->is_user_mode())
+ if (guest) {
+ if (guest & PERF_GUEST_USER)
misc |= PERF_RECORD_MISC_GUEST_USER;
else
misc |= PERF_RECORD_MISC_GUEST_KERNEL;
diff --git a/arch/arm64/kvm/perf.c b/arch/arm64/kvm/perf.c
index 151c31fb9860..8a3387e58f42 100644
--- a/arch/arm64/kvm/perf.c
+++ b/arch/arm64/kvm/perf.c
@@ -13,21 +13,20 @@

DEFINE_STATIC_KEY_FALSE(kvm_arm_pmu_available);

-static int kvm_is_in_guest(void)
-{
- return kvm_get_running_vcpu() != NULL;
-}
-
-static int kvm_is_user_mode(void)
+static unsigned int kvm_guest_state(void)
{
struct kvm_vcpu *vcpu;
+ unsigned int state = 0;
+
+ if (kvm_get_running_vcpu())
+ state |= PERF_GUEST_ACTIVE;

vcpu = kvm_get_running_vcpu();

- if (vcpu)
- return !vcpu_mode_priv(vcpu);
+ if (vcpu && !vcpu_mode_priv(vcpu))
+ state |= PERF_GUEST_USER;

- return 0;
+ return state;
}

static unsigned long kvm_get_guest_ip(void)
@@ -43,9 +42,8 @@ static unsigned long kvm_get_guest_ip(void)
}

static struct perf_guest_info_callbacks kvm_guest_cbs = {
- .is_in_guest = kvm_is_in_guest,
- .is_user_mode = kvm_is_user_mode,
- .get_guest_ip = kvm_get_guest_ip,
+ .state = kvm_guest_state,
+ .get_ip = kvm_get_guest_ip,
};

int kvm_perf_init(void)
diff --git a/arch/csky/kernel/perf_callchain.c b/arch/csky/kernel/perf_callchain.c
index ab55e98ee8f6..3e42239dd1b2 100644
--- a/arch/csky/kernel/perf_callchain.c
+++ b/arch/csky/kernel/perf_callchain.c
@@ -89,7 +89,7 @@ void perf_callchain_user(struct perf_callchain_entry_ctx *entry,
unsigned long fp = 0;

/* C-SKY does not support virtualization. */
- if (perf_guest_cbs && perf_guest_cbs->is_in_guest())
+ if (perf_guest_cbs && perf_guest_cbs->state())
return;

fp = regs->regs[4];
@@ -113,7 +113,7 @@ void perf_callchain_kernel(struct perf_callchain_entry_ctx *entry,
struct stackframe fr;

/* C-SKY does not support virtualization. */
- if (perf_guest_cbs && perf_guest_cbs->is_in_guest()) {
+ if (perf_guest_cbs && perf_guest_cbs->state()) {
pr_warn("C-SKY does not support perf in guest mode!");
return;
}
diff --git a/arch/nds32/kernel/perf_event_cpu.c b/arch/nds32/kernel/perf_event_cpu.c
index 0ce6f9f307e6..1dc32ba842ce 100644
--- a/arch/nds32/kernel/perf_event_cpu.c
+++ b/arch/nds32/kernel/perf_event_cpu.c
@@ -1371,7 +1371,7 @@ perf_callchain_user(struct perf_callchain_entry_ctx *entry,

leaf_fp = 0;

- if (perf_guest_cbs && perf_guest_cbs->is_in_guest()) {
+ if (perf_guest_cbs && perf_guest_cbs->state()) {
/* We don't support guest os callchain now */
return;
}
@@ -1481,7 +1481,7 @@ perf_callchain_kernel(struct perf_callchain_entry_ctx *entry,
{
struct stackframe fr;

- if (perf_guest_cbs && perf_guest_cbs->is_in_guest()) {
+ if (perf_guest_cbs && perf_guest_cbs->state()) {
/* We don't support guest os callchain now */
return;
}
@@ -1494,8 +1494,8 @@ perf_callchain_kernel(struct perf_callchain_entry_ctx *entry,
unsigned long perf_instruction_pointer(struct pt_regs *regs)
{
/* However, NDS32 does not support virtualization */
- if (perf_guest_cbs && perf_guest_cbs->is_in_guest())
- return perf_guest_cbs->get_guest_ip();
+ if (perf_guest_cbs && perf_guest_cbs->state())
+ return perf_guest_cbs->get_ip();

return instruction_pointer(regs);
}
@@ -1503,10 +1503,14 @@ unsigned long perf_instruction_pointer(struct pt_regs *regs)
unsigned long perf_misc_flags(struct pt_regs *regs)
{
int misc = 0;
+ unsigned int state = 0;
+
+ if (perf_guest_cbs)
+ state = perf_guest_cbs->state();

/* However, NDS32 does not support virtualization */
- if (perf_guest_cbs && perf_guest_cbs->is_in_guest()) {
- if (perf_guest_cbs->is_user_mode())
+ if (perf_guest_cbs && state) {
+ if (state & PERF_GUEST_USER)
misc |= PERF_RECORD_MISC_GUEST_USER;
else
misc |= PERF_RECORD_MISC_GUEST_KERNEL;
diff --git a/arch/riscv/kernel/perf_callchain.c b/arch/riscv/kernel/perf_callchain.c
index 0bb1854dce83..ea63f70cae5d 100644
--- a/arch/riscv/kernel/perf_callchain.c
+++ b/arch/riscv/kernel/perf_callchain.c
@@ -59,7 +59,7 @@ void perf_callchain_user(struct perf_callchain_entry_ctx *entry,
unsigned long fp = 0;

/* RISC-V does not support perf in guest mode. */
- if (perf_guest_cbs && perf_guest_cbs->is_in_guest())
+ if (perf_guest_cbs && perf_guest_cbs->state())
return;

fp = regs->s0;
@@ -79,7 +79,7 @@ void perf_callchain_kernel(struct perf_callchain_entry_ctx *entry,
struct pt_regs *regs)
{
/* RISC-V does not support perf in guest mode. */
- if (perf_guest_cbs && perf_guest_cbs->is_in_guest()) {
+ if (perf_guest_cbs && perf_guest_cbs->state()) {
pr_warn("RISC-V does not support perf in guest mode!");
return;
}
diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
index 1eb45139fcc6..9a908631f6cc 100644
--- a/arch/x86/events/core.c
+++ b/arch/x86/events/core.c
@@ -90,6 +90,28 @@ DEFINE_STATIC_CALL_NULL(x86_pmu_pebs_aliases, *x86_pmu.pebs_aliases);
*/
DEFINE_STATIC_CALL_RET0(x86_pmu_guest_get_msrs, *x86_pmu.guest_get_msrs);

+DEFINE_STATIC_CALL_RET0(x86_guest_state, *(perf_guest_cbs->state));
+DEFINE_STATIC_CALL_RET0(x86_guest_get_ip, *(perf_guest_cbs->get_ip));
+DEFINE_STATIC_CALL_RET0(x86_guest_handle_intel_pt_intr, *(perf_guest_cbs->handle_intel_pt_intr));
+
+void arch_perf_update_guest_cbs(void)
+{
+ static_call_update(x86_guest_state, (void *)&__static_call_return0);
+ static_call_update(x86_guest_get_ip, (void *)&__static_call_return0);
+ static_call_update(x86_guest_handle_intel_pt_intr, (void *)&__static_call_return0);
+
+ if (perf_guest_cbs && perf_guest_cbs->state)
+ static_call_update(x86_guest_state, perf_guest_cbs->state);
+
+ if (perf_guest_cbs && perf_guest_cbs->get_ip)
+ static_call_update(x86_guest_get_ip, perf_guest_cbs->get_ip);
+
+ if (perf_guest_cbs && perf_guest_cbs->handle_intel_pt_intr) {
+ static_call_update(x86_guest_handle_intel_pt_intr,
+ perf_guest_cbs->handle_intel_pt_intr);
+ }
+}
+
u64 __read_mostly hw_cache_event_ids
[PERF_COUNT_HW_CACHE_MAX]
[PERF_COUNT_HW_CACHE_OP_MAX]
@@ -2764,7 +2786,7 @@ perf_callchain_kernel(struct perf_callchain_entry_ctx *entry, struct pt_regs *re
struct unwind_state state;
unsigned long addr;

- if (perf_guest_cbs && perf_guest_cbs->is_in_guest()) {
+ if (static_call(x86_guest_state)()) {
/* TODO: We don't support guest os callchain now */
return;
}
@@ -2867,7 +2889,7 @@ perf_callchain_user(struct perf_callchain_entry_ctx *entry, struct pt_regs *regs
struct stack_frame frame;
const struct stack_frame __user *fp;

- if (perf_guest_cbs && perf_guest_cbs->is_in_guest()) {
+ if (static_call(x86_guest_state)()) {
/* TODO: We don't support guest os callchain now */
return;
}
@@ -2944,18 +2966,21 @@ static unsigned long code_segment_base(struct pt_regs *regs)

unsigned long perf_instruction_pointer(struct pt_regs *regs)
{
- if (perf_guest_cbs && perf_guest_cbs->is_in_guest())
- return perf_guest_cbs->get_guest_ip();
+ unsigned long ip = static_call(x86_guest_get_ip)();
+
+ if (likely(!ip))
+ ip = regs->ip + code_segment_base(regs);

- return regs->ip + code_segment_base(regs);
+ return ip;
}

unsigned long perf_misc_flags(struct pt_regs *regs)
{
+ unsigned int guest = static_call(x86_guest_state)();
int misc = 0;

- if (perf_guest_cbs && perf_guest_cbs->is_in_guest()) {
- if (perf_guest_cbs->is_user_mode())
+ if (guest) {
+ if (guest & PERF_GUEST_USER)
misc |= PERF_RECORD_MISC_GUEST_USER;
else
misc |= PERF_RECORD_MISC_GUEST_KERNEL;
diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
index fca7a6e2242f..fb1bd7a0e1a6 100644
--- a/arch/x86/events/intel/core.c
+++ b/arch/x86/events/intel/core.c
@@ -2782,6 +2782,8 @@ static void intel_pmu_reset(void)
local_irq_restore(flags);
}

+DECLARE_STATIC_CALL(x86_guest_handle_intel_pt_intr, *(perf_guest_cbs->handle_intel_pt_intr));
+
static int handle_pmi_common(struct pt_regs *regs, u64 status)
{
struct perf_sample_data data;
@@ -2852,10 +2854,7 @@ static int handle_pmi_common(struct pt_regs *regs, u64 status)
*/
if (__test_and_clear_bit(GLOBAL_STATUS_TRACE_TOPAPMI_BIT, (unsigned long *)&status)) {
handled++;
- if (unlikely(perf_guest_cbs && perf_guest_cbs->is_in_guest() &&
- perf_guest_cbs->handle_intel_pt_intr))
- perf_guest_cbs->handle_intel_pt_intr();
- else
+ if (!static_call(x86_guest_handle_intel_pt_intr)())
intel_pt_interrupt();
}

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 974cbfb1eefe..128e2dd9c944 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1884,7 +1884,7 @@ int kvm_skip_emulated_instruction(struct kvm_vcpu *vcpu);
int kvm_complete_insn_gp(struct kvm_vcpu *vcpu, int err);
void __kvm_request_immediate_exit(struct kvm_vcpu *vcpu);

-int kvm_is_in_guest(void);
+unsigned int kvm_guest_state(void);

void __user *__x86_set_memory_region(struct kvm *kvm, int id, gpa_t gpa,
u32 size);
diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c
index 827886c12c16..2dcbd1b30004 100644
--- a/arch/x86/kvm/pmu.c
+++ b/arch/x86/kvm/pmu.c
@@ -87,7 +87,7 @@ static void kvm_perf_overflow_intr(struct perf_event *perf_event,
* woken up. So we should wake it, but this is impossible from
* NMI context. Do it from irq work instead.
*/
- if (!kvm_is_in_guest())
+ if (!kvm_guest_state())
irq_work_queue(&pmc_to_pmu(pmc)->irq_work);
else
kvm_make_request(KVM_REQ_PMI, pmc->vcpu);
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index e5d5c5ed7dd4..efd11702465c 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -8268,44 +8268,47 @@ static void kvm_timer_init(void)
DEFINE_PER_CPU(struct kvm_vcpu *, current_vcpu);
EXPORT_PER_CPU_SYMBOL_GPL(current_vcpu);

-int kvm_is_in_guest(void)
+unsigned int kvm_guest_state(void)
{
- return __this_cpu_read(current_vcpu) != NULL;
-}
-
-static int kvm_is_user_mode(void)
-{
- int user_mode = 3;
+ struct kvm_vcpu *vcpu = __this_cpu_read(current_vcpu);
+ unsigned int state = 0;

- if (__this_cpu_read(current_vcpu))
- user_mode = static_call(kvm_x86_get_cpl)(__this_cpu_read(current_vcpu));
+ if (vcpu) {
+ state |= PERF_GUEST_ACTIVE;
+ if (static_call(kvm_x86_get_cpl)(vcpu))
+ state |= PERF_GUEST_USER;
+ }

- return user_mode != 0;
+ return state;
}

-static unsigned long kvm_get_guest_ip(void)
+static unsigned long kvm_guest_get_ip(void)
{
+ struct kvm_vcpu *vcpu = __this_cpu_read(current_vcpu);
unsigned long ip = 0;

- if (__this_cpu_read(current_vcpu))
- ip = kvm_rip_read(__this_cpu_read(current_vcpu));
+ if (vcpu)
+ ip = kvm_rip_read(vcpu);

return ip;
}

-static void kvm_handle_intel_pt_intr(void)
+static unsigned int kvm_handle_intel_pt_intr(void)
{
struct kvm_vcpu *vcpu = __this_cpu_read(current_vcpu);

+ if (!vcpu)
+ return 0;
+
kvm_make_request(KVM_REQ_PMI, vcpu);
__set_bit(MSR_CORE_PERF_GLOBAL_OVF_CTRL_TRACE_TOPA_PMI_BIT,
(unsigned long *)&vcpu->arch.pmu.global_status);
+ return 1;
}

static struct perf_guest_info_callbacks kvm_guest_cbs = {
- .is_in_guest = kvm_is_in_guest,
- .is_user_mode = kvm_is_user_mode,
- .get_guest_ip = kvm_get_guest_ip,
+ .state = kvm_guest_state,
+ .get_ip = kvm_guest_get_ip,
.handle_intel_pt_intr = kvm_handle_intel_pt_intr,
};

diff --git a/arch/x86/xen/pmu.c b/arch/x86/xen/pmu.c
index e13b0b49fcdf..85c6e6f6f422 100644
--- a/arch/x86/xen/pmu.c
+++ b/arch/x86/xen/pmu.c
@@ -413,34 +413,28 @@ int pmu_apic_update(uint32_t val)
}

/* perf callbacks */
-static int xen_is_in_guest(void)
+static unsigned int xen_guest_state(void)
{
const struct xen_pmu_data *xenpmu_data = get_xenpmu_data();
+ unsigned int state = 0;

if (!xenpmu_data) {
pr_warn_once("%s: pmudata not initialized\n", __func__);
- return 0;
+ return state;
}

if (!xen_initial_domain() || (xenpmu_data->domain_id >= DOMID_SELF))
- return 0;
+ return state;

- return 1;
-}
+ state |= PERF_GUEST_ACTIVE;

-static int xen_is_user_mode(void)
-{
- const struct xen_pmu_data *xenpmu_data = get_xenpmu_data();
+ if (xenpmu_data->pmu.pmu_flags & PMU_SAMPLE_PV) {
+ if (xenpmu_data->pmu.pmu_flags & PMU_SAMPLE_USER)
+ state |= PERF_GUEST_USER;
+ } else if (xenpmu_data->pmu.r.regs.cpl & 3)
+ state |= PERF_GUEST_USER;

- if (!xenpmu_data) {
- pr_warn_once("%s: pmudata not initialized\n", __func__);
- return 0;
- }
-
- if (xenpmu_data->pmu.pmu_flags & PMU_SAMPLE_PV)
- return (xenpmu_data->pmu.pmu_flags & PMU_SAMPLE_USER);
- else
- return !!(xenpmu_data->pmu.r.regs.cpl & 3);
+ return state;
}

static unsigned long xen_get_guest_ip(void)
@@ -456,9 +450,8 @@ static unsigned long xen_get_guest_ip(void)
}

static struct perf_guest_info_callbacks xen_guest_cbs = {
- .is_in_guest = xen_is_in_guest,
- .is_user_mode = xen_is_user_mode,
- .get_guest_ip = xen_get_guest_ip,
+ .state = xen_guest_state,
+ .get_ip = xen_get_guest_ip,
};

/* Convert registers from Xen's format to Linux' */
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 2d510ad750ed..e823677a214c 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -26,11 +26,13 @@
# include <asm/local64.h>
#endif

+#define PERF_GUEST_ACTIVE 0x01
+#define PERF_GUEST_USER 0x02
+
struct perf_guest_info_callbacks {
- int (*is_in_guest)(void);
- int (*is_user_mode)(void);
- unsigned long (*get_guest_ip)(void);
- void (*handle_intel_pt_intr)(void);
+ unsigned int (*state)(void);
+ unsigned long (*get_ip)(void);
+ unsigned int (*handle_intel_pt_intr)(void);
};

#ifdef CONFIG_HAVE_HW_BREAKPOINT
@@ -1237,6 +1239,8 @@ extern void perf_event_bpf_event(struct bpf_prog *prog,
u16 flags);

extern struct perf_guest_info_callbacks *perf_guest_cbs;
+extern void __weak arch_perf_update_guest_cbs(void);
+
extern int perf_register_guest_info_callbacks(struct perf_guest_info_callbacks *callbacks);
extern int perf_unregister_guest_info_callbacks(struct perf_guest_info_callbacks *callbacks);

diff --git a/kernel/events/core.c b/kernel/events/core.c
index 464917096e73..e466fc8176e1 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -6489,9 +6489,18 @@ static void perf_pending_event(struct irq_work *entry)
*/
struct perf_guest_info_callbacks *perf_guest_cbs;

+/* explicitly use __weak to fix duplicate symbol error */
+void __weak arch_perf_update_guest_cbs(void)
+{
+}
+
int perf_register_guest_info_callbacks(struct perf_guest_info_callbacks *cbs)
{
+ if (WARN_ON_ONCE(perf_guest_cbs))
+ return -EBUSY;
+
perf_guest_cbs = cbs;
+ arch_perf_update_guest_cbs();
return 0;
}
EXPORT_SYMBOL_GPL(perf_register_guest_info_callbacks);
--
2.27.0

2021-08-06 19:59:20

by Zhu, Lingshan

[permalink] [raw]
Subject: [PATCH V10 05/18] KVM: x86/pmu: Set MSR_IA32_MISC_ENABLE_EMON bit when vPMU is enabled

From: Like Xu <[email protected]>

On Intel platforms, the software can use the IA32_MISC_ENABLE[7] bit to
detect whether the processor supports performance monitoring facility.

It depends on the PMU is enabled for the guest, and a software write
operation to this available bit will be ignored. The proposal to ignore
the toggle in KVM is the way to go and that behavior matches bare metal.

Cc: Yao Yuan <[email protected]>
Signed-off-by: Like Xu <[email protected]>
Reviewed-by: Venkatesh Srinivas <[email protected]>
Signed-off-by: Zhu Lingshan <[email protected]>
Acked-by: Peter Zijlstra (Intel) <[email protected]>
---
arch/x86/kvm/vmx/pmu_intel.c | 1 +
arch/x86/kvm/x86.c | 1 +
2 files changed, 2 insertions(+)

diff --git a/arch/x86/kvm/vmx/pmu_intel.c b/arch/x86/kvm/vmx/pmu_intel.c
index 9efc1a6b8693..d9dbebe03cae 100644
--- a/arch/x86/kvm/vmx/pmu_intel.c
+++ b/arch/x86/kvm/vmx/pmu_intel.c
@@ -488,6 +488,7 @@ static void intel_pmu_refresh(struct kvm_vcpu *vcpu)
if (!pmu->version)
return;

+ vcpu->arch.ia32_misc_enable_msr |= MSR_IA32_MISC_ENABLE_EMON;
perf_get_x86_pmu_capability(&x86_pmu);

pmu->nr_arch_gp_counters = min_t(int, eax.split.num_counters,
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index efd11702465c..f6b6984e26ef 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -3321,6 +3321,7 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
}
break;
case MSR_IA32_MISC_ENABLE:
+ data &= ~MSR_IA32_MISC_ENABLE_EMON;
if (!kvm_check_has_quirk(vcpu->kvm, KVM_X86_QUIRK_MISC_ENABLE_NO_MWAIT) &&
((vcpu->arch.ia32_misc_enable_msr ^ data) & MSR_IA32_MISC_ENABLE_MWAIT)) {
if (!guest_cpuid_has(vcpu, X86_FEATURE_XMM3))
--
2.27.0

2021-08-06 20:00:56

by Zhu, Lingshan

[permalink] [raw]
Subject: [PATCH V10 10/18] KVM: x86/pmu: Adjust precise_ip to emulate Ice Lake guest PDIR counter

From: Like Xu <[email protected]>

The PEBS-PDIR facility on Ice Lake server is supported on IA31_FIXED0 only.
If the guest configures counter 32 and PEBS is enabled, the PEBS-PDIR
facility is supposed to be used, in which case KVM adjusts attr.precise_ip
to 3 and request host perf to assign the exactly requested counter or fail.

The CPU model check is also required since some platforms may place the
PEBS-PDIR facility in another counter index.

Signed-off-by: Like Xu <[email protected]>
Signed-off-by: Zhu Lingshan <[email protected]>
Acked-by: Peter Zijlstra (Intel) <[email protected]>
---
arch/x86/events/intel/core.c | 2 +-
arch/x86/kvm/pmu.c | 2 ++
arch/x86/kvm/pmu.h | 7 +++++++
3 files changed, 10 insertions(+), 1 deletion(-)

diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
index d196b32617c5..b25c9d6be314 100644
--- a/arch/x86/events/intel/core.c
+++ b/arch/x86/events/intel/core.c
@@ -3951,8 +3951,8 @@ static struct perf_guest_switch_msr *intel_guest_get_msrs(int *nr, void *data)

if (!x86_pmu.pebs_vmx)
return arr;
- pebs_enable = (*nr)++;

+ pebs_enable = (*nr)++;
arr[pebs_enable] = (struct perf_guest_switch_msr){
.msr = MSR_IA32_PEBS_ENABLE,
.host = cpuc->pebs_enabled & ~cpuc->intel_ctrl_guest_mask,
diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c
index d76b0a5d80d7..b907aba35ff3 100644
--- a/arch/x86/kvm/pmu.c
+++ b/arch/x86/kvm/pmu.c
@@ -153,6 +153,8 @@ static void pmc_reprogram_counter(struct kvm_pmc *pmc, u32 type,
* could possibly care here is unsupported and needs changes.
*/
attr.precise_ip = 1;
+ if (x86_match_cpu(vmx_icl_pebs_cpu) && pmc->idx == 32)
+ attr.precise_ip = 3;
}
if (pebs || intr)
ovf = kvm_perf_overflow_intr;
diff --git a/arch/x86/kvm/pmu.h b/arch/x86/kvm/pmu.h
index 67e753edfa22..1af86ae1d3f2 100644
--- a/arch/x86/kvm/pmu.h
+++ b/arch/x86/kvm/pmu.h
@@ -4,6 +4,8 @@

#include <linux/nospec.h>

+#include <asm/cpu_device_id.h>
+
#define vcpu_to_pmu(vcpu) (&(vcpu)->arch.pmu)
#define pmu_to_vcpu(pmu) (container_of((pmu), struct kvm_vcpu, arch.pmu))
#define pmc_to_pmu(pmc) (&(pmc)->vcpu->arch.pmu)
@@ -16,6 +18,11 @@
#define VMWARE_BACKDOOR_PMC_APPARENT_TIME 0x10002

#define MAX_FIXED_COUNTERS 3
+static const struct x86_cpu_id vmx_icl_pebs_cpu[] = {
+ X86_MATCH_INTEL_FAM6_MODEL(ICELAKE_D, NULL),
+ X86_MATCH_INTEL_FAM6_MODEL(ICELAKE_X, NULL),
+ {}
+};

struct kvm_event_hw_type_mapping {
u8 eventsel;
--
2.27.0

2021-08-06 20:03:43

by Zhu, Lingshan

[permalink] [raw]
Subject: [PATCH V10 03/18] perf/x86/intel: Handle guest PEBS overflow PMI for KVM guest

From: Like Xu <[email protected]>

With PEBS virtualization, the guest PEBS records get delivered to the
guest DS, and the host pmi handler uses perf_guest_cbs->is_in_guest()
to distinguish whether the PMI comes from the guest code like Intel PT.

No matter how many guest PEBS counters are overflowed, only triggering
one fake event is enough. The fake event causes the KVM PMI callback to
be called, thereby injecting the PEBS overflow PMI into the guest.

KVM may inject the PMI with BUFFER_OVF set, even if the guest DS is
empty. That should really be harmless. Thus guest PEBS handler would
retrieve the correct information from its own PEBS records buffer.

Originally-by: Andi Kleen <[email protected]>
Co-developed-by: Kan Liang <[email protected]>
Signed-off-by: Kan Liang <[email protected]>
Signed-off-by: Like Xu <[email protected]>
Signed-off-by: Zhu Lingshan <[email protected]>
Acked-by: Peter Zijlstra (Intel) <[email protected]>
---
arch/x86/events/intel/core.c | 45 ++++++++++++++++++++++++++++++++++++
1 file changed, 45 insertions(+)

diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
index da835f5a37e2..2770eba232c7 100644
--- a/arch/x86/events/intel/core.c
+++ b/arch/x86/events/intel/core.c
@@ -2783,6 +2783,50 @@ static void intel_pmu_reset(void)
}

DECLARE_STATIC_CALL(x86_guest_handle_intel_pt_intr, *(perf_guest_cbs->handle_intel_pt_intr));
+DECLARE_STATIC_CALL(x86_guest_state, *(perf_guest_cbs->state));
+
+/*
+ * We may be running with guest PEBS events created by KVM, and the
+ * PEBS records are logged into the guest's DS and invisible to host.
+ *
+ * In the case of guest PEBS overflow, we only trigger a fake event
+ * to emulate the PEBS overflow PMI for guest PBES counters in KVM.
+ * The guest will then vm-entry and check the guest DS area to read
+ * the guest PEBS records.
+ *
+ * The contents and other behavior of the guest event do not matter.
+ */
+static void x86_pmu_handle_guest_pebs(struct pt_regs *regs,
+ struct perf_sample_data *data)
+{
+ struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
+ u64 guest_pebs_idxs = cpuc->pebs_enabled & ~cpuc->intel_ctrl_host_mask;
+ struct perf_event *event = NULL;
+ unsigned int guest = 0;
+ int bit;
+
+ guest = static_call(x86_guest_state)();
+ if (!(guest & PERF_GUEST_ACTIVE))
+ return;
+
+ if (!x86_pmu.pebs_vmx || !x86_pmu.pebs_active ||
+ !guest_pebs_idxs)
+ return;
+
+ for_each_set_bit(bit, (unsigned long *)&guest_pebs_idxs,
+ INTEL_PMC_IDX_FIXED + x86_pmu.num_counters_fixed) {
+ event = cpuc->events[bit];
+ if (!event->attr.precise_ip)
+ continue;
+
+ perf_sample_data_init(data, 0, event->hw.last_period);
+ if (perf_event_overflow(event, data, regs))
+ x86_pmu_stop(event, 0);
+
+ /* Inject one fake event is enough. */
+ break;
+ }
+}

static int handle_pmi_common(struct pt_regs *regs, u64 status)
{
@@ -2835,6 +2879,7 @@ static int handle_pmi_common(struct pt_regs *regs, u64 status)
u64 pebs_enabled = cpuc->pebs_enabled;

handled++;
+ x86_pmu_handle_guest_pebs(regs, &data);
x86_pmu.drain_pebs(regs, &data);
status &= intel_ctrl | GLOBAL_STATUS_TRACE_TOPAPMI;

--
2.27.0

2021-08-06 20:06:58

by Zhu, Lingshan

[permalink] [raw]
Subject: [PATCH V10 08/18] KVM: x86/pmu: Add IA32_PEBS_ENABLE MSR emulation for extended PEBS

From: Like Xu <[email protected]>

If IA32_PERF_CAPABILITIES.PEBS_BASELINE [bit 14] is set, the
IA32_PEBS_ENABLE MSR exists and all architecturally enumerated fixed
and general-purpose counters have corresponding bits in IA32_PEBS_ENABLE
that enable generation of PEBS records. The general-purpose counter bits
start at bit IA32_PEBS_ENABLE[0], and the fixed counter bits start at
bit IA32_PEBS_ENABLE[32].

When guest PEBS is enabled, the IA32_PEBS_ENABLE MSR will be
added to the perf_guest_switch_msr() and atomically switched during
the VMX transitions just like CORE_PERF_GLOBAL_CTRL MSR.

Based on whether the platform supports x86_pmu.pebs_vmx, it has also
refactored the way to add more msrs to arr[] in intel_guest_get_msrs()
for extensibility.

Originally-by: Andi Kleen <[email protected]>
Co-developed-by: Kan Liang <[email protected]>
Signed-off-by: Kan Liang <[email protected]>
Co-developed-by: Luwei Kang <[email protected]>
Signed-off-by: Luwei Kang <[email protected]>
Signed-off-by: Like Xu <[email protected]>
Signed-off-by: Zhu Lingshan <[email protected]>
Acked-by: Peter Zijlstra (Intel) <[email protected]>
---
arch/x86/events/intel/core.c | 75 ++++++++++++++++++++++++--------
arch/x86/include/asm/kvm_host.h | 3 ++
arch/x86/include/asm/msr-index.h | 6 +++
arch/x86/kvm/vmx/pmu_intel.c | 31 +++++++++++++
4 files changed, 97 insertions(+), 18 deletions(-)

diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
index 552a623dc886..d196b32617c5 100644
--- a/arch/x86/events/intel/core.c
+++ b/arch/x86/events/intel/core.c
@@ -3896,33 +3896,72 @@ static int intel_pmu_hw_config(struct perf_event *event)
return 0;
}

+/*
+ * Currently, the only caller of this function is the atomic_switch_perf_msrs().
+ * The host perf conext helps to prepare the values of the real hardware for
+ * a set of msrs that need to be switched atomically in a vmx transaction.
+ *
+ * For example, the pseudocode needed to add a new msr should look like:
+ *
+ * arr[(*nr)++] = (struct perf_guest_switch_msr){
+ * .msr = the hardware msr address,
+ * .host = the value the hardware has when it doesn't run a guest,
+ * .guest = the value the hardware has when it runs a guest,
+ * };
+ *
+ * These values have nothing to do with the emulated values the guest sees
+ * when it uses {RD,WR}MSR, which should be handled by the KVM context,
+ * specifically in the intel_pmu_{get,set}_msr().
+ */
static struct perf_guest_switch_msr *intel_guest_get_msrs(int *nr, void *data)
{
struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
struct perf_guest_switch_msr *arr = cpuc->guest_switch_msrs;
u64 intel_ctrl = hybrid(cpuc->pmu, intel_ctrl);
+ u64 pebs_mask = cpuc->pebs_enabled & x86_pmu.pebs_capable;
+ int global_ctrl, pebs_enable;
+
+ *nr = 0;
+ global_ctrl = (*nr)++;
+ arr[global_ctrl] = (struct perf_guest_switch_msr){
+ .msr = MSR_CORE_PERF_GLOBAL_CTRL,
+ .host = intel_ctrl & ~cpuc->intel_ctrl_guest_mask,
+ .guest = intel_ctrl & (~cpuc->intel_ctrl_host_mask | ~pebs_mask),
+ };

- arr[0].msr = MSR_CORE_PERF_GLOBAL_CTRL;
- arr[0].host = intel_ctrl & ~cpuc->intel_ctrl_guest_mask;
- arr[0].guest = intel_ctrl & ~cpuc->intel_ctrl_host_mask;
- arr[0].guest &= ~(cpuc->pebs_enabled & x86_pmu.pebs_capable);
- *nr = 1;
+ if (!x86_pmu.pebs)
+ return arr;

- if (x86_pmu.pebs && x86_pmu.pebs_no_isolation) {
- /*
- * If PMU counter has PEBS enabled it is not enough to
- * disable counter on a guest entry since PEBS memory
- * write can overshoot guest entry and corrupt guest
- * memory. Disabling PEBS solves the problem.
- *
- * Don't do this if the CPU already enforces it.
- */
- arr[1].msr = MSR_IA32_PEBS_ENABLE;
- arr[1].host = cpuc->pebs_enabled;
- arr[1].guest = 0;
- *nr = 2;
+ /*
+ * If PMU counter has PEBS enabled it is not enough to
+ * disable counter on a guest entry since PEBS memory
+ * write can overshoot guest entry and corrupt guest
+ * memory. Disabling PEBS solves the problem.
+ *
+ * Don't do this if the CPU already enforces it.
+ */
+ if (x86_pmu.pebs_no_isolation) {
+ arr[(*nr)++] = (struct perf_guest_switch_msr){
+ .msr = MSR_IA32_PEBS_ENABLE,
+ .host = cpuc->pebs_enabled,
+ .guest = 0,
+ };
+ return arr;
}

+ if (!x86_pmu.pebs_vmx)
+ return arr;
+ pebs_enable = (*nr)++;
+
+ arr[pebs_enable] = (struct perf_guest_switch_msr){
+ .msr = MSR_IA32_PEBS_ENABLE,
+ .host = cpuc->pebs_enabled & ~cpuc->intel_ctrl_guest_mask,
+ .guest = pebs_mask & ~cpuc->intel_ctrl_host_mask,
+ };
+
+ /* Set hw GLOBAL_CTRL bits for PEBS counter when it runs for guest */
+ arr[0].guest |= arr[*nr].guest;
+
return arr;
}

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 172fabbcc11a..425e872ddf4f 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -505,6 +505,9 @@ struct kvm_pmu {
DECLARE_BITMAP(all_valid_pmc_idx, X86_PMC_IDX_MAX);
DECLARE_BITMAP(pmc_in_use, X86_PMC_IDX_MAX);

+ u64 pebs_enable;
+ u64 pebs_enable_mask;
+
/*
* The gate to release perf_events not marked in
* pmc_in_use only once in a vcpu time slice.
diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
index a7c413432b33..986b285b97f7 100644
--- a/arch/x86/include/asm/msr-index.h
+++ b/arch/x86/include/asm/msr-index.h
@@ -189,6 +189,12 @@
#define PERF_CAP_PT_IDX 16

#define MSR_PEBS_LD_LAT_THRESHOLD 0x000003f6
+#define PERF_CAP_PEBS_TRAP BIT_ULL(6)
+#define PERF_CAP_ARCH_REG BIT_ULL(7)
+#define PERF_CAP_PEBS_FORMAT 0xf00
+#define PERF_CAP_PEBS_BASELINE BIT_ULL(14)
+#define PERF_CAP_PEBS_MASK (PERF_CAP_PEBS_TRAP | PERF_CAP_ARCH_REG | \
+ PERF_CAP_PEBS_FORMAT | PERF_CAP_PEBS_BASELINE)

#define MSR_IA32_RTIT_CTL 0x00000570
#define RTIT_CTL_TRACEEN BIT(0)
diff --git a/arch/x86/kvm/vmx/pmu_intel.c b/arch/x86/kvm/vmx/pmu_intel.c
index ac7fe714e6c1..9938b485c31c 100644
--- a/arch/x86/kvm/vmx/pmu_intel.c
+++ b/arch/x86/kvm/vmx/pmu_intel.c
@@ -220,6 +220,9 @@ static bool intel_is_valid_msr(struct kvm_vcpu *vcpu, u32 msr)
case MSR_CORE_PERF_GLOBAL_OVF_CTRL:
ret = pmu->version > 1;
break;
+ case MSR_IA32_PEBS_ENABLE:
+ ret = vcpu->arch.perf_capabilities & PERF_CAP_PEBS_FORMAT;
+ break;
default:
ret = get_gp_pmc(pmu, msr, MSR_IA32_PERFCTR0) ||
get_gp_pmc(pmu, msr, MSR_P6_EVNTSEL0) ||
@@ -367,6 +370,9 @@ static int intel_pmu_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
case MSR_CORE_PERF_GLOBAL_OVF_CTRL:
msr_info->data = pmu->global_ovf_ctrl;
return 0;
+ case MSR_IA32_PEBS_ENABLE:
+ msr_info->data = pmu->pebs_enable;
+ return 0;
default:
if ((pmc = get_gp_pmc(pmu, msr, MSR_IA32_PERFCTR0)) ||
(pmc = get_gp_pmc(pmu, msr, MSR_IA32_PMC0))) {
@@ -427,6 +433,14 @@ static int intel_pmu_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
return 0;
}
break;
+ case MSR_IA32_PEBS_ENABLE:
+ if (pmu->pebs_enable == data)
+ return 0;
+ if (!(data & pmu->pebs_enable_mask)) {
+ pmu->pebs_enable = data;
+ return 0;
+ }
+ break;
default:
if ((pmc = get_gp_pmc(pmu, msr, MSR_IA32_PERFCTR0)) ||
(pmc = get_gp_pmc(pmu, msr, MSR_IA32_PMC0))) {
@@ -479,6 +493,7 @@ static void intel_pmu_refresh(struct kvm_vcpu *vcpu)
pmu->version = 0;
pmu->reserved_bits = 0xffffffff00200000ull;
pmu->fixed_ctr_ctrl_mask = ~0ull;
+ pmu->pebs_enable_mask = ~0ull;

entry = kvm_find_cpuid_entry(vcpu, 0xa, 0);
if (!entry)
@@ -545,6 +560,22 @@ static void intel_pmu_refresh(struct kvm_vcpu *vcpu)

if (lbr_desc->records.nr)
bitmap_set(pmu->all_valid_pmc_idx, INTEL_PMC_IDX_FIXED_VLBR, 1);
+
+ if (vcpu->arch.perf_capabilities & PERF_CAP_PEBS_FORMAT) {
+ if (vcpu->arch.perf_capabilities & PERF_CAP_PEBS_BASELINE) {
+ pmu->pebs_enable_mask = ~pmu->global_ctrl;
+ pmu->reserved_bits &= ~ICL_EVENTSEL_ADAPTIVE;
+ for (i = 0; i < pmu->nr_arch_fixed_counters; i++) {
+ pmu->fixed_ctr_ctrl_mask &=
+ ~(1ULL << (INTEL_PMC_IDX_FIXED + i * 4));
+ }
+ } else {
+ pmu->pebs_enable_mask =
+ ~((1ull << pmu->nr_arch_gp_counters) - 1);
+ }
+ } else {
+ vcpu->arch.perf_capabilities &= ~PERF_CAP_PEBS_MASK;
+ }
}

static void intel_pmu_init(struct kvm_vcpu *vcpu)
--
2.27.0

2021-08-06 20:10:28

by Zhu, Lingshan

[permalink] [raw]
Subject: [PATCH V10 12/18] KVM: x86/pmu: Add PEBS_DATA_CFG MSR emulation to support adaptive PEBS

From: Like Xu <[email protected]>

If IA32_PERF_CAPABILITIES.PEBS_BASELINE [bit 14] is set, the adaptive
PEBS is supported. The PEBS_DATA_CFG MSR and adaptive record enable
bits (IA32_PERFEVTSELx.Adaptive_Record and IA32_FIXED_CTR_CTRL.
FCx_Adaptive_Record) are also supported.

Adaptive PEBS provides software the capability to configure the PEBS
records to capture only the data of interest, keeping the record size
compact. An overflow of PMCx results in generation of an adaptive PEBS
record with state information based on the selections specified in
MSR_PEBS_DATA_CFG.By default, the record only contain the Basic group.

When guest adaptive PEBS is enabled, the IA32_PEBS_ENABLE MSR will
be added to the perf_guest_switch_msr() and switched during the VMX
transitions just like CORE_PERF_GLOBAL_CTRL MSR.

Co-developed-by: Luwei Kang <[email protected]>
Signed-off-by: Luwei Kang <[email protected]>
Signed-off-by: Like Xu <[email protected]>
Signed-off-by: Zhu Lingshan <[email protected]>
Acked-by: Peter Zijlstra (Intel) <[email protected]>
---
arch/x86/events/intel/core.c | 8 ++++++++
arch/x86/include/asm/kvm_host.h | 2 ++
arch/x86/kvm/vmx/pmu_intel.c | 16 ++++++++++++++++
3 files changed, 26 insertions(+)

diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
index 6501f8adb2b6..a698aae54fcb 100644
--- a/arch/x86/events/intel/core.c
+++ b/arch/x86/events/intel/core.c
@@ -3960,6 +3960,14 @@ static struct perf_guest_switch_msr *intel_guest_get_msrs(int *nr, void *data)
.guest = kvm_pmu->ds_area,
};

+ if (x86_pmu.intel_cap.pebs_baseline) {
+ arr[(*nr)++] = (struct perf_guest_switch_msr){
+ .msr = MSR_PEBS_DATA_CFG,
+ .host = cpuc->pebs_data_cfg,
+ .guest = kvm_pmu->pebs_data_cfg,
+ };
+ }
+
pebs_enable = (*nr)++;
arr[pebs_enable] = (struct perf_guest_switch_msr){
.msr = MSR_IA32_PEBS_ENABLE,
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 35f106f9f124..0fc1fef1af70 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -508,6 +508,8 @@ struct kvm_pmu {
u64 ds_area;
u64 pebs_enable;
u64 pebs_enable_mask;
+ u64 pebs_data_cfg;
+ u64 pebs_data_cfg_mask;

/*
* The gate to release perf_events not marked in
diff --git a/arch/x86/kvm/vmx/pmu_intel.c b/arch/x86/kvm/vmx/pmu_intel.c
index 5584b8dfadb3..58f32a55cc2e 100644
--- a/arch/x86/kvm/vmx/pmu_intel.c
+++ b/arch/x86/kvm/vmx/pmu_intel.c
@@ -226,6 +226,9 @@ static bool intel_is_valid_msr(struct kvm_vcpu *vcpu, u32 msr)
case MSR_IA32_DS_AREA:
ret = guest_cpuid_has(vcpu, X86_FEATURE_DS);
break;
+ case MSR_PEBS_DATA_CFG:
+ ret = vcpu->arch.perf_capabilities & PERF_CAP_PEBS_BASELINE;
+ break;
default:
ret = get_gp_pmc(pmu, msr, MSR_IA32_PERFCTR0) ||
get_gp_pmc(pmu, msr, MSR_P6_EVNTSEL0) ||
@@ -379,6 +382,9 @@ static int intel_pmu_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
case MSR_IA32_DS_AREA:
msr_info->data = pmu->ds_area;
return 0;
+ case MSR_PEBS_DATA_CFG:
+ msr_info->data = pmu->pebs_data_cfg;
+ return 0;
default:
if ((pmc = get_gp_pmc(pmu, msr, MSR_IA32_PERFCTR0)) ||
(pmc = get_gp_pmc(pmu, msr, MSR_IA32_PMC0))) {
@@ -452,6 +458,14 @@ static int intel_pmu_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
return 1;
pmu->ds_area = data;
return 0;
+ case MSR_PEBS_DATA_CFG:
+ if (pmu->pebs_data_cfg == data)
+ return 0;
+ if (!(data & pmu->pebs_data_cfg_mask)) {
+ pmu->pebs_data_cfg = data;
+ return 0;
+ }
+ break;
default:
if ((pmc = get_gp_pmc(pmu, msr, MSR_IA32_PERFCTR0)) ||
(pmc = get_gp_pmc(pmu, msr, MSR_IA32_PMC0))) {
@@ -505,6 +519,7 @@ static void intel_pmu_refresh(struct kvm_vcpu *vcpu)
pmu->reserved_bits = 0xffffffff00200000ull;
pmu->fixed_ctr_ctrl_mask = ~0ull;
pmu->pebs_enable_mask = ~0ull;
+ pmu->pebs_data_cfg_mask = ~0ull;

entry = kvm_find_cpuid_entry(vcpu, 0xa, 0);
if (!entry)
@@ -580,6 +595,7 @@ static void intel_pmu_refresh(struct kvm_vcpu *vcpu)
pmu->fixed_ctr_ctrl_mask &=
~(1ULL << (INTEL_PMC_IDX_FIXED + i * 4));
}
+ pmu->pebs_data_cfg_mask = ~0xff00000full;
} else {
pmu->pebs_enable_mask =
~((1ull << pmu->nr_arch_gp_counters) - 1);
--
2.27.0

2021-08-06 20:25:00

by Zhu, Lingshan

[permalink] [raw]
Subject: [PATCH V10 14/18] KVM: x86/pmu: Move pmc_speculative_in_use() to arch/x86/kvm/pmu.h

From: Like Xu <[email protected]>

It allows this inline function to be reused by more callers in
more files, such as pmu_intel.c.

Signed-off-by: Like Xu <[email protected]>
Signed-off-by: Zhu Lingshan <[email protected]>
Acked-by: Peter Zijlstra (Intel) <[email protected]>
---
arch/x86/kvm/pmu.c | 11 -----------
arch/x86/kvm/pmu.h | 11 +++++++++++
2 files changed, 11 insertions(+), 11 deletions(-)

diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c
index b907aba35ff3..d957c1e83ec9 100644
--- a/arch/x86/kvm/pmu.c
+++ b/arch/x86/kvm/pmu.c
@@ -481,17 +481,6 @@ void kvm_pmu_init(struct kvm_vcpu *vcpu)
kvm_pmu_refresh(vcpu);
}

-static inline bool pmc_speculative_in_use(struct kvm_pmc *pmc)
-{
- struct kvm_pmu *pmu = pmc_to_pmu(pmc);
-
- if (pmc_is_fixed(pmc))
- return fixed_ctrl_field(pmu->fixed_ctr_ctrl,
- pmc->idx - INTEL_PMC_IDX_FIXED) & 0x3;
-
- return pmc->eventsel & ARCH_PERFMON_EVENTSEL_ENABLE;
-}
-
/* Release perf_events for vPMCs that have been unused for a full time slice. */
void kvm_pmu_cleanup(struct kvm_vcpu *vcpu)
{
diff --git a/arch/x86/kvm/pmu.h b/arch/x86/kvm/pmu.h
index 1af86ae1d3f2..5795bb113e76 100644
--- a/arch/x86/kvm/pmu.h
+++ b/arch/x86/kvm/pmu.h
@@ -149,6 +149,17 @@ static inline u64 get_sample_period(struct kvm_pmc *pmc, u64 counter_value)
return sample_period;
}

+static inline bool pmc_speculative_in_use(struct kvm_pmc *pmc)
+{
+ struct kvm_pmu *pmu = pmc_to_pmu(pmc);
+
+ if (pmc_is_fixed(pmc))
+ return fixed_ctrl_field(pmu->fixed_ctr_ctrl,
+ pmc->idx - INTEL_PMC_IDX_FIXED) & 0x3;
+
+ return pmc->eventsel & ARCH_PERFMON_EVENTSEL_ENABLE;
+}
+
void reprogram_gp_counter(struct kvm_pmc *pmc, u64 eventsel);
void reprogram_fixed_counter(struct kvm_pmc *pmc, u8 ctrl, int fixed_idx);
void reprogram_counter(struct kvm_pmu *pmu, int pmc_idx);
--
2.27.0

2021-08-06 20:30:56

by Zhu, Lingshan

[permalink] [raw]
Subject: [PATCH V10 17/18] KVM: x86/cpuid: Refactor host/guest CPU model consistency check

From: Like Xu <[email protected]>

For the same purpose, the leagcy intel_pmu_lbr_is_compatible() can be
renamed for reuse by more callers, and remove the comment about LBR
use case can be deleted by the way.

Signed-off-by: Like Xu <[email protected]>
Signed-off-by: Zhu Lingshan <[email protected]>
Acked-by: Peter Zijlstra (Intel) <[email protected]>
---
arch/x86/kvm/cpuid.h | 5 +++++
arch/x86/kvm/vmx/pmu_intel.c | 12 +-----------
arch/x86/kvm/vmx/vmx.c | 2 +-
arch/x86/kvm/vmx/vmx.h | 1 -
4 files changed, 7 insertions(+), 13 deletions(-)

diff --git a/arch/x86/kvm/cpuid.h b/arch/x86/kvm/cpuid.h
index c99edfff7f82..439ce776b9a0 100644
--- a/arch/x86/kvm/cpuid.h
+++ b/arch/x86/kvm/cpuid.h
@@ -143,6 +143,11 @@ static inline int guest_cpuid_model(struct kvm_vcpu *vcpu)
return x86_model(best->eax);
}

+static inline bool cpuid_model_is_consistent(struct kvm_vcpu *vcpu)
+{
+ return boot_cpu_data.x86_model == guest_cpuid_model(vcpu);
+}
+
static inline int guest_cpuid_stepping(struct kvm_vcpu *vcpu)
{
struct kvm_cpuid_entry2 *best;
diff --git a/arch/x86/kvm/vmx/pmu_intel.c b/arch/x86/kvm/vmx/pmu_intel.c
index 05bc218c08df..a77d5a5f2ba5 100644
--- a/arch/x86/kvm/vmx/pmu_intel.c
+++ b/arch/x86/kvm/vmx/pmu_intel.c
@@ -173,16 +173,6 @@ static inline struct kvm_pmc *get_fw_gp_pmc(struct kvm_pmu *pmu, u32 msr)
return get_gp_pmc(pmu, msr, MSR_IA32_PMC0);
}

-bool intel_pmu_lbr_is_compatible(struct kvm_vcpu *vcpu)
-{
- /*
- * As a first step, a guest could only enable LBR feature if its
- * cpu model is the same as the host because the LBR registers
- * would be pass-through to the guest and they're model specific.
- */
- return boot_cpu_data.x86_model == guest_cpuid_model(vcpu);
-}
-
bool intel_pmu_lbr_is_enabled(struct kvm_vcpu *vcpu)
{
struct x86_pmu_lbr *lbr = vcpu_to_lbr_records(vcpu);
@@ -578,7 +568,7 @@ static void intel_pmu_refresh(struct kvm_vcpu *vcpu)

nested_vmx_pmu_entry_exit_ctls_update(vcpu);

- if (intel_pmu_lbr_is_compatible(vcpu))
+ if (cpuid_model_is_consistent(vcpu))
x86_perf_get_lbr(&lbr_desc->records);
else
lbr_desc->records.nr = 0;
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index d8552dbece6f..d0af51c1389d 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -2221,7 +2221,7 @@ static int vmx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
if ((data & PMU_CAP_LBR_FMT) !=
(vmx_get_perf_capabilities() & PMU_CAP_LBR_FMT))
return 1;
- if (!intel_pmu_lbr_is_compatible(vcpu))
+ if (!cpuid_model_is_consistent(vcpu))
return 1;
}
ret = kvm_set_msr_common(vcpu, msr_info);
diff --git a/arch/x86/kvm/vmx/vmx.h b/arch/x86/kvm/vmx/vmx.h
index 86fa1a08feca..bd2922c8046f 100644
--- a/arch/x86/kvm/vmx/vmx.h
+++ b/arch/x86/kvm/vmx/vmx.h
@@ -95,7 +95,6 @@ union vmx_exit_reason {
#define vcpu_to_lbr_records(vcpu) (&to_vmx(vcpu)->lbr_desc.records)

void intel_pmu_cross_mapped_check(struct kvm_pmu *pmu);
-bool intel_pmu_lbr_is_compatible(struct kvm_vcpu *vcpu);
bool intel_pmu_lbr_is_enabled(struct kvm_vcpu *vcpu);

int intel_pmu_create_guest_lbr_event(struct kvm_vcpu *vcpu);
--
2.27.0

2021-08-18 05:29:54

by Zhu, Lingshan

[permalink] [raw]
Subject: Re: [PATCH V10 00/18] KVM: x86/pmu: Add *basic* support to enable guest PEBS via DS

Hi Paolo,

Do you have any comments on this series(already reviewed by Peter)?

Or any chance to queue it in next merge window?

Thanks,
Zhu Lingshan

On 8/6/2021 9:37 PM, Zhu Lingshan wrote:
> The guest Precise Event Based Sampling (PEBS) feature can provide an
> architectural state of the instruction executed after the guest instruction
> that exactly caused the event. It needs new hardware facility only available
> on Intel Ice Lake Server platforms. This patch set enables the basic PEBS
> feature for KVM guests on ICX.
>
> We can use PEBS feature on the Linux guest like native:
>
> # echo 0 > /proc/sys/kernel/watchdog (on the host)
> # perf record -e instructions:ppp ./br_instr a
> # perf record -c 100000 -e instructions:pp ./br_instr a
>
> To emulate guest PEBS facility for the above perf usages,
> we need to implement 2 code paths:
>
> 1) Fast path
>
> This is when the host assigned physical PMC has an identical index as the
> virtual PMC (e.g. using physical PMC0 to emulate virtual PMC0).
> This path is used in most common use cases.
>
> 2) Slow path
>
> This is when the host assigned physical PMC has a different index from the
> virtual PMC (e.g. using physical PMC1 to emulate virtual PMC0) In this case,
> KVM needs to rewrite the PEBS records to change the applicable counter indexes
> to the virtual PMC indexes, which would otherwise contain the physical counter
> index written by PEBS facility, and switch the counter reset values to the
> offset corresponding to the physical counter indexes in the DS data structure.
>
> The previous version [0] enables both fast path and slow path, which seems
> a bit more complex as the first step. In this patchset, we want to start with
> the fast path to get the basic guest PEBS enabled while keeping the slow path
> disabled. More focused discussion on the slow path [1] is planned to be put to
> another patchset in the next step.
>
> Compared to later versions in subsequent steps, the functionality to support
> host-guest PEBS both enabled and the functionality to emulate guest PEBS when
> the counter is cross-mapped are missing in this patch set
> (neither of these are typical scenarios).
>
> With the basic support, the guest can retrieve the correct PEBS information from
> its own PEBS records on the Ice Lake servers. And we expect it should work when
> migrating to another Ice Lake and no regression about host perf is expected.
>
> Here are the results of pebs test from guest/host for same workload:
>
> perf report on guest:
> # Samples: 2K of event 'instructions:ppp', # Event count (approx.): 1473377250 # Overhead Command Shared Object Symbol
> 57.74% br_instr br_instr [.] lfsr_cond
> 41.40% br_instr br_instr [.] cmp_end
> 0.21% br_instr [kernel.kallsyms] [k] __lock_acquire
>
> perf report on host:
> # Samples: 2K of event 'instructions:ppp', # Event count (approx.): 1462721386 # Overhead Command Shared Object Symbol
> 57.90% br_instr br_instr [.] lfsr_cond
> 41.95% br_instr br_instr [.] cmp_end
> 0.05% br_instr [kernel.vmlinux] [k] lock_acquire
> Conclusion: the profiling results on the guest are similar tothat on the host.
>
> A minimum guest kernel version may be v5.4 or a backport version support
> Icelake server PEBS.
>
> Please check more details in each commit and feel free to comment.
>
> Previous:
> https://lore.kernel.org/kvm/[email protected]/
>
> [0]
> https://lore.kernel.org/kvm/[email protected]/
> [1]
> https://lore.kernel.org/kvm/[email protected]/
>
> V9->V10:
> - improve readability in core.c(Peter Z)
> - reuse guest_pebs_idxs(Liu XiangDong)
> V8 -> V9 Changelog:
> -fix a brackets error in xen_guest_state()
>
> V7 -> V8 Changelog:
> - fix coding style, add {} for single statement of multiple lines(Peter Z)
> - fix coding style in xen_guest_state() (Boris Ostrovsky)
> - s/pmu/kvm_pmu/ in intel_guest_get_msrs() (Peter Z)
> - put lower cost branch in the first place for x86_pmu_handle_guest_pebs() (Peter Z)
>
> V6 -> V7 Changelog:
> - Fix conditions order and call x86_pmu_handle_guest_pebs() unconditionally; (PeterZ)
> - Add a new patch to make all that perf_guest_cbs stuff suck less; (PeterZ)
> - Document IA32_MISC_ENABLE[7] that that behavior matches bare metal; (Sean & Venkatesh)
> - Update commit message for fixed counter mask refactoring;(PeterZ)
> - Clarifying comments about {.host and .guest} for intel_guest_get_msrs(); (PeterZ)
> - Add pebs_capable to store valid PEBS_COUNTER_MASK value; (PeterZ)
> - Add more comments for perf's precise_ip field; (Andi & PeterZ)
> - Refactor perf_overflow_handler_t and make it more legible; (PeterZ)
> - Use "(unsigned long)cpuc->ds" instead of __this_cpu_read(cpu_hw_events.ds); (PeterZ)
> - Keep using "(struct kvm_pmu *)data" to follow K&R; (Andi)
>
> Like Xu (17):
> perf/core: Use static_call to optimize perf_guest_info_callbacks
> perf/x86/intel: Add EPT-Friendly PEBS for Ice Lake Server
> perf/x86/intel: Handle guest PEBS overflow PMI for KVM guest
> perf/x86/core: Pass "struct kvm_pmu *" to determine the guest values
> KVM: x86/pmu: Set MSR_IA32_MISC_ENABLE_EMON bit when vPMU is enabled
> KVM: x86/pmu: Introduce the ctrl_mask value for fixed counter
> KVM: x86/pmu: Add IA32_PEBS_ENABLE MSR emulation for extended PEBS
> KVM: x86/pmu: Reprogram PEBS event to emulate guest PEBS counter
> KVM: x86/pmu: Adjust precise_ip to emulate Ice Lake guest PDIR counter
> KVM: x86/pmu: Add IA32_DS_AREA MSR emulation to support guest DS
> KVM: x86/pmu: Add PEBS_DATA_CFG MSR emulation to support adaptive PEBS
> KVM: x86: Set PEBS_UNAVAIL in IA32_MISC_ENABLE when PEBS is enabled
> KVM: x86/pmu: Move pmc_speculative_in_use() to arch/x86/kvm/pmu.h
> KVM: x86/pmu: Disable guest PEBS temporarily in two rare situations
> KVM: x86/pmu: Add kvm_pmu_cap to optimize perf_get_x86_pmu_capability
> KVM: x86/cpuid: Refactor host/guest CPU model consistency check
> KVM: x86/pmu: Expose CPUIDs feature bits PDCM, DS, DTES64
>
> Peter Zijlstra (Intel) (1):
> x86/perf/core: Add pebs_capable to store valid PEBS_COUNTER_MASK value
>
> arch/arm/kernel/perf_callchain.c | 16 +--
> arch/arm64/kernel/perf_callchain.c | 29 +++--
> arch/arm64/kvm/perf.c | 22 ++--
> arch/csky/kernel/perf_callchain.c | 4 +-
> arch/nds32/kernel/perf_event_cpu.c | 16 +--
> arch/riscv/kernel/perf_callchain.c | 4 +-
> arch/x86/events/core.c | 44 ++++++--
> arch/x86/events/intel/core.c | 167 +++++++++++++++++++++++------
> arch/x86/events/perf_event.h | 6 +-
> arch/x86/include/asm/kvm_host.h | 18 +++-
> arch/x86/include/asm/msr-index.h | 6 ++
> arch/x86/include/asm/perf_event.h | 5 +-
> arch/x86/kvm/cpuid.c | 26 ++---
> arch/x86/kvm/cpuid.h | 5 +
> arch/x86/kvm/pmu.c | 60 ++++++++---
> arch/x86/kvm/pmu.h | 38 +++++++
> arch/x86/kvm/vmx/capabilities.h | 26 +++--
> arch/x86/kvm/vmx/pmu_intel.c | 116 ++++++++++++++++----
> arch/x86/kvm/vmx/vmx.c | 24 ++++-
> arch/x86/kvm/vmx/vmx.h | 2 +-
> arch/x86/kvm/x86.c | 51 +++++----
> arch/x86/xen/pmu.c | 33 +++---
> include/linux/perf_event.h | 12 ++-
> kernel/events/core.c | 9 ++
> 24 files changed, 548 insertions(+), 191 deletions(-)
>

2021-08-26 20:00:19

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH V10 01/18] perf/core: Use static_call to optimize perf_guest_info_callbacks

TL;DR: Please don't merge this patch, it's broken and is also built on a shoddy
foundation that I would like to fix.

On Fri, Aug 06, 2021, Zhu Lingshan wrote:
> diff --git a/kernel/events/core.c b/kernel/events/core.c
> index 464917096e73..e466fc8176e1 100644
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -6489,9 +6489,18 @@ static void perf_pending_event(struct irq_work *entry)
> */
> struct perf_guest_info_callbacks *perf_guest_cbs;
>
> +/* explicitly use __weak to fix duplicate symbol error */
> +void __weak arch_perf_update_guest_cbs(void)
> +{
> +}
> +
> int perf_register_guest_info_callbacks(struct perf_guest_info_callbacks *cbs)
> {
> + if (WARN_ON_ONCE(perf_guest_cbs))
> + return -EBUSY;
> +
> perf_guest_cbs = cbs;
> + arch_perf_update_guest_cbs();

This is horribly broken, it fails to cleanup the static calls when KVM unregisters
the callbacks, which happens when the vendor module, e.g. kvm_intel, is unloaded.
The explosion doesn't happen until 'kvm' is unloaded because the functions are
implemented in 'kvm', i.e. the use-after-free is deferred a bit.

BUG: unable to handle page fault for address: ffffffffa011bb90
#PF: supervisor instruction fetch in kernel mode
#PF: error_code(0x0010) - not-present page
PGD 6211067 P4D 6211067 PUD 6212063 PMD 102b99067 PTE 0
Oops: 0010 [#1] PREEMPT SMP
CPU: 0 PID: 1047 Comm: rmmod Not tainted 5.14.0-rc2+ #460
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
RIP: 0010:0xffffffffa011bb90
Code: Unable to access opcode bytes at RIP 0xffffffffa011bb66.
Call Trace:
<NMI>
? perf_misc_flags+0xe/0x50
? perf_prepare_sample+0x53/0x6b0
? perf_event_output_forward+0x67/0x160
? kvm_clock_read+0x14/0x30
? kvm_sched_clock_read+0x5/0x10
? sched_clock_cpu+0xd/0xd0
? __perf_event_overflow+0x52/0xf0
? handle_pmi_common+0x1f2/0x2d0
? __flush_tlb_all+0x30/0x30
? intel_pmu_handle_irq+0xcf/0x410
? nmi_handle+0x5/0x260
? perf_event_nmi_handler+0x28/0x50
? nmi_handle+0xc7/0x260
? lock_release+0x2b0/0x2b0
? default_do_nmi+0x6b/0x170
? exc_nmi+0x103/0x130
? end_repeat_nmi+0x16/0x1f
? lock_release+0x2b0/0x2b0
? lock_release+0x2b0/0x2b0
? lock_release+0x2b0/0x2b0
</NMI>
Modules linked in: irqbypass [last unloaded: kvm]

Even more fun, the existing perf_guest_cbs framework is also broken, though it's
much harder to get it to fail, and probably impossible to get it to fail without
some help. The issue is that perf_guest_cbs is global, which means that it can
be nullified by KVM (during module unload) while the callbacks are being accessed
by a PMI handler on a different CPU.

The bug has escaped notice because all dererfences of perf_guest_cbs follow the
same "perf_guest_cbs && perf_guest_cbs->is_in_guest()" pattern, and AFAICT the
compiler never reload perf_guest_cbs in this sequence. The compiler does reload
perf_guest_cbs for any future dereferences, but the ->is_in_guest() guard all but
guarantees the PMI handler will win the race, e.g. to nullify perf_guest_cbs,
KVM has to completely exit the guest and teardown down all VMs before it can be
unloaded.

But with a help, e.g. RAED_ONCE(perf_guest_cbs), unloading kvm_intel can trigger
a NULL pointer derference, e.g. this tweak

diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
index 1eb45139fcc6..202e5ad97f82 100644
--- a/arch/x86/events/core.c
+++ b/arch/x86/events/core.c
@@ -2954,7 +2954,7 @@ unsigned long perf_misc_flags(struct pt_regs *regs)
{
int misc = 0;

- if (perf_guest_cbs && perf_guest_cbs->is_in_guest()) {
+ if (READ_ONCE(perf_guest_cbs) && READ_ONCE(perf_guest_cbs)->is_in_guest()) {
if (perf_guest_cbs->is_user_mode())
misc |= PERF_RECORD_MISC_GUEST_USER;
else


while spamming module load/unload leads to:

BUG: kernel NULL pointer dereference, address: 0000000000000000
#PF: supervisor read access in kernel mode
#PF: error_code(0x0000) - not-present page
PGD 0 P4D 0
Oops: 0000 [#1] PREEMPT SMP
CPU: 6 PID: 1825 Comm: stress Not tainted 5.14.0-rc2+ #459
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
RIP: 0010:perf_misc_flags+0x1c/0x70
Call Trace:
perf_prepare_sample+0x53/0x6b0
perf_event_output_forward+0x67/0x160
__perf_event_overflow+0x52/0xf0
handle_pmi_common+0x207/0x300
intel_pmu_handle_irq+0xcf/0x410
perf_event_nmi_handler+0x28/0x50
nmi_handle+0xc7/0x260
default_do_nmi+0x6b/0x170
exc_nmi+0x103/0x130
asm_exc_nmi+0x76/0xbf


The good news is that I have a series that should fix both the existing NULL pointer
bug and mostly obviate the need for static calls. The bad news is that my approach,
making perf_guest_cbs per-CPU, likely complicates turning these into static calls,
though I'm guessing it's still a solvable problem.

Tangentially related, IMO we should make architectures opt-in to getting
perf_guest_cbs and nuke all of the code in the below files. Except for arm,
which recently lost KVM support, it's all a bunch of useless copy-paste code that
serves no purpose and just complicates cleanups like this.

> arch/arm/kernel/perf_callchain.c | 16 +++++++-----
> arch/csky/kernel/perf_callchain.c | 4 +--
> arch/nds32/kernel/perf_event_cpu.c | 16 +++++++-----
> arch/riscv/kernel/perf_callchain.c | 4 +--

2021-08-27 06:33:20

by Like Xu

[permalink] [raw]
Subject: Re: [PATCH V10 01/18] perf/core: Use static_call to optimize perf_guest_info_callbacks

On 27/8/2021 3:59 am, Sean Christopherson wrote:
> TL;DR: Please don't merge this patch, it's broken and is also built on a shoddy
> foundation that I would like to fix.

Obviously, this patch is not closely related to the guest PEBS feature enabling,
and we can certainly put this issue in another discussion thread [1].

[1] https://lore.kernel.org/kvm/[email protected]/

>
> On Fri, Aug 06, 2021, Zhu Lingshan wrote:
>> diff --git a/kernel/events/core.c b/kernel/events/core.c
>> index 464917096e73..e466fc8176e1 100644
>> --- a/kernel/events/core.c
>> +++ b/kernel/events/core.c
>> @@ -6489,9 +6489,18 @@ static void perf_pending_event(struct irq_work *entry)
>> */
>> struct perf_guest_info_callbacks *perf_guest_cbs;
>>
>> +/* explicitly use __weak to fix duplicate symbol error */
>> +void __weak arch_perf_update_guest_cbs(void)
>> +{
>> +}
>> +
>> int perf_register_guest_info_callbacks(struct perf_guest_info_callbacks *cbs)
>> {
>> + if (WARN_ON_ONCE(perf_guest_cbs))
>> + return -EBUSY;
>> +
>> perf_guest_cbs = cbs;
>> + arch_perf_update_guest_cbs();
>
> This is horribly broken, it fails to cleanup the static calls when KVM unregisters
> the callbacks, which happens when the vendor module, e.g. kvm_intel, is unloaded.
> The explosion doesn't happen until 'kvm' is unloaded because the functions are
> implemented in 'kvm', i.e. the use-after-free is deferred a bit.
>
> BUG: unable to handle page fault for address: ffffffffa011bb90
> #PF: supervisor instruction fetch in kernel mode
> #PF: error_code(0x0010) - not-present page
> PGD 6211067 P4D 6211067 PUD 6212063 PMD 102b99067 PTE 0
> Oops: 0010 [#1] PREEMPT SMP
> CPU: 0 PID: 1047 Comm: rmmod Not tainted 5.14.0-rc2+ #460
> Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
> RIP: 0010:0xffffffffa011bb90
> Code: Unable to access opcode bytes at RIP 0xffffffffa011bb66.
> Call Trace:
> <NMI>
> ? perf_misc_flags+0xe/0x50
> ? perf_prepare_sample+0x53/0x6b0
> ? perf_event_output_forward+0x67/0x160
> ? kvm_clock_read+0x14/0x30
> ? kvm_sched_clock_read+0x5/0x10
> ? sched_clock_cpu+0xd/0xd0
> ? __perf_event_overflow+0x52/0xf0
> ? handle_pmi_common+0x1f2/0x2d0
> ? __flush_tlb_all+0x30/0x30
> ? intel_pmu_handle_irq+0xcf/0x410
> ? nmi_handle+0x5/0x260
> ? perf_event_nmi_handler+0x28/0x50
> ? nmi_handle+0xc7/0x260
> ? lock_release+0x2b0/0x2b0
> ? default_do_nmi+0x6b/0x170
> ? exc_nmi+0x103/0x130
> ? end_repeat_nmi+0x16/0x1f
> ? lock_release+0x2b0/0x2b0
> ? lock_release+0x2b0/0x2b0
> ? lock_release+0x2b0/0x2b0
> </NMI>
> Modules linked in: irqbypass [last unloaded: kvm]
>
> Even more fun, the existing perf_guest_cbs framework is also broken, though it's
> much harder to get it to fail, and probably impossible to get it to fail without
> some help. The issue is that perf_guest_cbs is global, which means that it can
> be nullified by KVM (during module unload) while the callbacks are being accessed
> by a PMI handler on a different CPU.
>
> The bug has escaped notice because all dererfences of perf_guest_cbs follow the
> same "perf_guest_cbs && perf_guest_cbs->is_in_guest()" pattern, and AFAICT the
> compiler never reload perf_guest_cbs in this sequence. The compiler does reload
> perf_guest_cbs for any future dereferences, but the ->is_in_guest() guard all but
> guarantees the PMI handler will win the race, e.g. to nullify perf_guest_cbs,
> KVM has to completely exit the guest and teardown down all VMs before it can be
> unloaded.
>
> But with a help, e.g. RAED_ONCE(perf_guest_cbs), unloading kvm_intel can trigger
> a NULL pointer derference, e.g. this tweak
>
> diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
> index 1eb45139fcc6..202e5ad97f82 100644
> --- a/arch/x86/events/core.c
> +++ b/arch/x86/events/core.c
> @@ -2954,7 +2954,7 @@ unsigned long perf_misc_flags(struct pt_regs *regs)
> {
> int misc = 0;
>
> - if (perf_guest_cbs && perf_guest_cbs->is_in_guest()) {
> + if (READ_ONCE(perf_guest_cbs) && READ_ONCE(perf_guest_cbs)->is_in_guest()) {
> if (perf_guest_cbs->is_user_mode())
> misc |= PERF_RECORD_MISC_GUEST_USER;
> else
>
>
> while spamming module load/unload leads to:
>
> BUG: kernel NULL pointer dereference, address: 0000000000000000
> #PF: supervisor read access in kernel mode
> #PF: error_code(0x0000) - not-present page
> PGD 0 P4D 0
> Oops: 0000 [#1] PREEMPT SMP
> CPU: 6 PID: 1825 Comm: stress Not tainted 5.14.0-rc2+ #459
> Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
> RIP: 0010:perf_misc_flags+0x1c/0x70
> Call Trace:
> perf_prepare_sample+0x53/0x6b0
> perf_event_output_forward+0x67/0x160
> __perf_event_overflow+0x52/0xf0
> handle_pmi_common+0x207/0x300
> intel_pmu_handle_irq+0xcf/0x410
> perf_event_nmi_handler+0x28/0x50
> nmi_handle+0xc7/0x260
> default_do_nmi+0x6b/0x170
> exc_nmi+0x103/0x130
> asm_exc_nmi+0x76/0xbf
>
>
> The good news is that I have a series that should fix both the existing NULL pointer
> bug and mostly obviate the need for static calls. The bad news is that my approach,
> making perf_guest_cbs per-CPU, likely complicates turning these into static calls,
> though I'm guessing it's still a solvable problem.
>
> Tangentially related, IMO we should make architectures opt-in to getting
> perf_guest_cbs and nuke all of the code in the below files. Except for arm,
> which recently lost KVM support, it's all a bunch of useless copy-paste code that
> serves no purpose and just complicates cleanups like this.
>
>> arch/arm/kernel/perf_callchain.c | 16 +++++++-----
>> arch/csky/kernel/perf_callchain.c | 4 +--
>> arch/nds32/kernel/perf_event_cpu.c | 16 +++++++-----
>> arch/riscv/kernel/perf_callchain.c | 4 +--

2021-08-27 17:24:44

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH V10 01/18] perf/core: Use static_call to optimize perf_guest_info_callbacks

On Fri, Aug 06, 2021, Zhu Lingshan wrote:
> @@ -2944,18 +2966,21 @@ static unsigned long code_segment_base(struct pt_regs *regs)
>
> unsigned long perf_instruction_pointer(struct pt_regs *regs)
> {
> - if (perf_guest_cbs && perf_guest_cbs->is_in_guest())
> - return perf_guest_cbs->get_guest_ip();
> + unsigned long ip = static_call(x86_guest_get_ip)();
> +
> + if (likely(!ip))

Pivoting on ip==0 isn't correct, it's perfectly legal for a guest to execute
from %rip=0. Unless there's some static_call() magic that supports this with a
default function:

if (unlikely(!static_call(x86_guest_get_ip)(&ip)))
regs->ip + code_segment_base(regs)

return ip;

The easiest thing is keep the existing:

if (unlikely(static_call(x86_guest_state)()))
return static_call(x86_guest_get_ip)();

return regs->ip + code_segment_base(regs);

It's an extra call for PMIs in guest, but I don't think any of the KVM folks care
_that_ much about the performance in this case.

> + ip = regs->ip + code_segment_base(regs);
>
> - return regs->ip + code_segment_base(regs);
> + return ip;
> }

2021-09-15 01:23:35

by Zhu, Lingshan

[permalink] [raw]
Subject: Re: [PATCH V10 01/18] perf/core: Use static_call to optimize perf_guest_info_callbacks



On 8/27/2021 3:59 AM, Sean Christopherson wrote:
> TL;DR: Please don't merge this patch, it's broken and is also built on a shoddy
> foundation that I would like to fix.
Hi Sean,Peter, Paolo

I will send out an V11 which drops this patch since it's buggy, and Sean
is working on fix this.
Does this sound good?

Thanks,
Zhu Lingshan
>
> On Fri, Aug 06, 2021, Zhu Lingshan wrote:
>> diff --git a/kernel/events/core.c b/kernel/events/core.c
>> index 464917096e73..e466fc8176e1 100644
>> --- a/kernel/events/core.c
>> +++ b/kernel/events/core.c
>> @@ -6489,9 +6489,18 @@ static void perf_pending_event(struct irq_work *entry)
>> */
>> struct perf_guest_info_callbacks *perf_guest_cbs;
>>
>> +/* explicitly use __weak to fix duplicate symbol error */
>> +void __weak arch_perf_update_guest_cbs(void)
>> +{
>> +}
>> +
>> int perf_register_guest_info_callbacks(struct perf_guest_info_callbacks *cbs)
>> {
>> + if (WARN_ON_ONCE(perf_guest_cbs))
>> + return -EBUSY;
>> +
>> perf_guest_cbs = cbs;
>> + arch_perf_update_guest_cbs();
> This is horribly broken, it fails to cleanup the static calls when KVM unregisters
> the callbacks, which happens when the vendor module, e.g. kvm_intel, is unloaded.
> The explosion doesn't happen until 'kvm' is unloaded because the functions are
> implemented in 'kvm', i.e. the use-after-free is deferred a bit.
>
> BUG: unable to handle page fault for address: ffffffffa011bb90
> #PF: supervisor instruction fetch in kernel mode
> #PF: error_code(0x0010) - not-present page
> PGD 6211067 P4D 6211067 PUD 6212063 PMD 102b99067 PTE 0
> Oops: 0010 [#1] PREEMPT SMP
> CPU: 0 PID: 1047 Comm: rmmod Not tainted 5.14.0-rc2+ #460
> Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
> RIP: 0010:0xffffffffa011bb90
> Code: Unable to access opcode bytes at RIP 0xffffffffa011bb66.
> Call Trace:
> <NMI>
> ? perf_misc_flags+0xe/0x50
> ? perf_prepare_sample+0x53/0x6b0
> ? perf_event_output_forward+0x67/0x160
> ? kvm_clock_read+0x14/0x30
> ? kvm_sched_clock_read+0x5/0x10
> ? sched_clock_cpu+0xd/0xd0
> ? __perf_event_overflow+0x52/0xf0
> ? handle_pmi_common+0x1f2/0x2d0
> ? __flush_tlb_all+0x30/0x30
> ? intel_pmu_handle_irq+0xcf/0x410
> ? nmi_handle+0x5/0x260
> ? perf_event_nmi_handler+0x28/0x50
> ? nmi_handle+0xc7/0x260
> ? lock_release+0x2b0/0x2b0
> ? default_do_nmi+0x6b/0x170
> ? exc_nmi+0x103/0x130
> ? end_repeat_nmi+0x16/0x1f
> ? lock_release+0x2b0/0x2b0
> ? lock_release+0x2b0/0x2b0
> ? lock_release+0x2b0/0x2b0
> </NMI>
> Modules linked in: irqbypass [last unloaded: kvm]
>
> Even more fun, the existing perf_guest_cbs framework is also broken, though it's
> much harder to get it to fail, and probably impossible to get it to fail without
> some help. The issue is that perf_guest_cbs is global, which means that it can
> be nullified by KVM (during module unload) while the callbacks are being accessed
> by a PMI handler on a different CPU.
>
> The bug has escaped notice because all dererfences of perf_guest_cbs follow the
> same "perf_guest_cbs && perf_guest_cbs->is_in_guest()" pattern, and AFAICT the
> compiler never reload perf_guest_cbs in this sequence. The compiler does reload
> perf_guest_cbs for any future dereferences, but the ->is_in_guest() guard all but
> guarantees the PMI handler will win the race, e.g. to nullify perf_guest_cbs,
> KVM has to completely exit the guest and teardown down all VMs before it can be
> unloaded.
>
> But with a help, e.g. RAED_ONCE(perf_guest_cbs), unloading kvm_intel can trigger
> a NULL pointer derference, e.g. this tweak
>
> diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
> index 1eb45139fcc6..202e5ad97f82 100644
> --- a/arch/x86/events/core.c
> +++ b/arch/x86/events/core.c
> @@ -2954,7 +2954,7 @@ unsigned long perf_misc_flags(struct pt_regs *regs)
> {
> int misc = 0;
>
> - if (perf_guest_cbs && perf_guest_cbs->is_in_guest()) {
> + if (READ_ONCE(perf_guest_cbs) && READ_ONCE(perf_guest_cbs)->is_in_guest()) {
> if (perf_guest_cbs->is_user_mode())
> misc |= PERF_RECORD_MISC_GUEST_USER;
> else
>
>
> while spamming module load/unload leads to:
>
> BUG: kernel NULL pointer dereference, address: 0000000000000000
> #PF: supervisor read access in kernel mode
> #PF: error_code(0x0000) - not-present page
> PGD 0 P4D 0
> Oops: 0000 [#1] PREEMPT SMP
> CPU: 6 PID: 1825 Comm: stress Not tainted 5.14.0-rc2+ #459
> Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
> RIP: 0010:perf_misc_flags+0x1c/0x70
> Call Trace:
> perf_prepare_sample+0x53/0x6b0
> perf_event_output_forward+0x67/0x160
> __perf_event_overflow+0x52/0xf0
> handle_pmi_common+0x207/0x300
> intel_pmu_handle_irq+0xcf/0x410
> perf_event_nmi_handler+0x28/0x50
> nmi_handle+0xc7/0x260
> default_do_nmi+0x6b/0x170
> exc_nmi+0x103/0x130
> asm_exc_nmi+0x76/0xbf
>
>
> The good news is that I have a series that should fix both the existing NULL pointer
> bug and mostly obviate the need for static calls. The bad news is that my approach,
> making perf_guest_cbs per-CPU, likely complicates turning these into static calls,
> though I'm guessing it's still a solvable problem.
>
> Tangentially related, IMO we should make architectures opt-in to getting
> perf_guest_cbs and nuke all of the code in the below files. Except for arm,
> which recently lost KVM support, it's all a bunch of useless copy-paste code that
> serves no purpose and just complicates cleanups like this.
>
>> arch/arm/kernel/perf_callchain.c | 16 +++++++-----
>> arch/csky/kernel/perf_callchain.c | 4 +--
>> arch/nds32/kernel/perf_event_cpu.c | 16 +++++++-----
>> arch/riscv/kernel/perf_callchain.c | 4 +--

2021-09-22 00:47:00

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH V10 01/18] perf/core: Use static_call to optimize perf_guest_info_callbacks

On Wed, Sep 15, 2021, Zhu, Lingshan wrote:
>
>
> On 8/27/2021 3:59 AM, Sean Christopherson wrote:
> > TL;DR: Please don't merge this patch, it's broken and is also built on a shoddy
> > foundation that I would like to fix.
> Hi Sean,Peter, Paolo
>
> I will send out an V11 which drops this patch since it's buggy, and Sean is
> working on fix this.
> Does this sound good?

Works for me, thanks!

2021-11-07 18:24:42

by Liuxiangdong

[permalink] [raw]
Subject: Re: [PATCH V10 05/18] KVM: x86/pmu: Set MSR_IA32_MISC_ENABLE_EMON bit when vPMU is enabled

Hi, like and lingshan.

As said, IA32_MISC_ENABLE[7] bit depends on the PMU is enabled for the
guest, so a software
write openration to this bit will be ignored.

But, in this patch, all the openration that writes msr_ia32_misc_enable
in guest could make this bit become 0.

Suppose:
When we start vm with "enable_pmu", vcpu->arch.ia32_misc_enable_msr may
be 0x80 first.
And next, guest writes msr_ia32_misc_enable value 0x1.
What we want could be 0x81, but unfortunately, it will be 0x1 because of
"data &= ~MSR_IA32_MISC_ENABLE_EMON;"
And even if guest writes msr_ia32_misc_enable value 0x81, it will be 0x1
also.


What we want is write operation will not change this bit. So, how about
this?

--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -3321,6 +3321,7 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu,
struct msr_data *msr_info)
}
break;
case MSR_IA32_MISC_ENABLE:
+ data &= ~MSR_IA32_MISC_ENABLE_EMON;
+ data |= (vcpu->arch.ia32_misc_enable_msr &
MSR_IA32_MISC_ENABLE_EMON);
if (!kvm_check_has_quirk(vcpu->kvm,
KVM_X86_QUIRK_MISC_ENABLE_NO_MWAIT) &&
((vcpu->arch.ia32_misc_enable_msr ^ data) &
MSR_IA32_MISC_ENABLE_MWAIT)) {
if (!guest_cpuid_has(vcpu, X86_FEATURE_XMM3))


Or is there anything in your design intention I don't understand?

Thanks!

Xiangdong Liu


On 2021/8/6 21:37, Zhu Lingshan wrote:
> From: Like Xu <[email protected]>
>
> On Intel platforms, the software can use the IA32_MISC_ENABLE[7] bit to
> detect whether the processor supports performance monitoring facility.
>
> It depends on the PMU is enabled for the guest, and a software write
> operation to this available bit will be ignored. The proposal to ignore
> the toggle in KVM is the way to go and that behavior matches bare metal.
>
> Cc: Yao Yuan <[email protected]>
> Signed-off-by: Like Xu <[email protected]>
> Reviewed-by: Venkatesh Srinivas <[email protected]>
> Signed-off-by: Zhu Lingshan <[email protected]>
> Acked-by: Peter Zijlstra (Intel) <[email protected]>
> ---
> arch/x86/kvm/vmx/pmu_intel.c | 1 +
> arch/x86/kvm/x86.c | 1 +
> 2 files changed, 2 insertions(+)
>
> diff --git a/arch/x86/kvm/vmx/pmu_intel.c b/arch/x86/kvm/vmx/pmu_intel.c
> index 9efc1a6b8693..d9dbebe03cae 100644
> --- a/arch/x86/kvm/vmx/pmu_intel.c
> +++ b/arch/x86/kvm/vmx/pmu_intel.c
> @@ -488,6 +488,7 @@ static void intel_pmu_refresh(struct kvm_vcpu *vcpu)
> if (!pmu->version)
> return;
>
> + vcpu->arch.ia32_misc_enable_msr |= MSR_IA32_MISC_ENABLE_EMON;
> perf_get_x86_pmu_capability(&x86_pmu);
>
> pmu->nr_arch_gp_counters = min_t(int, eax.split.num_counters,
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index efd11702465c..f6b6984e26ef 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -3321,6 +3321,7 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
> }
> break;
> case MSR_IA32_MISC_ENABLE:
> + data &= ~MSR_IA32_MISC_ENABLE_EMON;
> if (!kvm_check_has_quirk(vcpu->kvm, KVM_X86_QUIRK_MISC_ENABLE_NO_MWAIT) &&
> ((vcpu->arch.ia32_misc_enable_msr ^ data) & MSR_IA32_MISC_ENABLE_MWAIT)) {
> if (!guest_cpuid_has(vcpu, X86_FEATURE_XMM3))

2021-11-08 07:32:25

by Liuxiangdong

[permalink] [raw]
Subject: Re: [PATCH V10 05/18] KVM: x86/pmu: Set MSR_IA32_MISC_ENABLE_EMON bit when vPMU is enabled



On 2021/11/8 11:06, Like Xu wrote:
> On 7/11/2021 6:14 pm, Liuxiangdong wrote:
>> Hi, like and lingshan.
>>
>> As said, IA32_MISC_ENABLE[7] bit depends on the PMU is enabled for
>> the guest, so a software
>> write openration to this bit will be ignored.
>>
>> But, in this patch, all the openration that writes
>> msr_ia32_misc_enable in guest could make this bit become 0.
>>
>> Suppose:
>> When we start vm with "enable_pmu", vcpu->arch.ia32_misc_enable_msr
>> may be 0x80 first.
>> And next, guest writes msr_ia32_misc_enable value 0x1.
>> What we want could be 0x81, but unfortunately, it will be 0x1 because of
>> "data &= ~MSR_IA32_MISC_ENABLE_EMON;"
>> And even if guest writes msr_ia32_misc_enable value 0x81, it will be
>> 0x1 also.
>>
>
> Yes and thank you. The fix has been committed on my private tree for a
> long time.
>
>>
>> What we want is write operation will not change this bit. So, how
>> about this?
>>
>> --- a/arch/x86/kvm/x86.c
>> +++ b/arch/x86/kvm/x86.c
>> @@ -3321,6 +3321,7 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu,
>> struct msr_data *msr_info)
>> }
>> break;
>> case MSR_IA32_MISC_ENABLE:
>> + data &= ~MSR_IA32_MISC_ENABLE_EMON;
>> + data |= (vcpu->arch.ia32_misc_enable_msr &
>> MSR_IA32_MISC_ENABLE_EMON);
>> if (!kvm_check_has_quirk(vcpu->kvm,
>> KVM_X86_QUIRK_MISC_ENABLE_NO_MWAIT) &&
>> ((vcpu->arch.ia32_misc_enable_msr ^ data) &
>> MSR_IA32_MISC_ENABLE_MWAIT)) {
>> if (!guest_cpuid_has(vcpu, X86_FEATURE_XMM3))
>>
>>
>
> How about this for the final state considering PEBS enabling:
>
> case MSR_IA32_MISC_ENABLE: {
> u64 old_val = vcpu->arch.ia32_misc_enable_msr;
> u64 pmu_mask = MSR_IA32_MISC_ENABLE_EMON |
> MSR_IA32_MISC_ENABLE_EMON;
>
u64 pmu_mask = MSR_IA32_MISC_ENABLE_EMON |
MSR_IA32_MISC_ENABLE_EMON;

Repetitive "MSR_IA32_MISC_ENABLE_EMON" ?

> /* RO bits */
> if (!msr_info->host_initiated &&
> ((old_val ^ data) & MSR_IA32_MISC_ENABLE_PEBS_UNAVAIL))
> return 1;
>
> /*
> * For a dummy user space, the order of setting vPMU
> capabilities and
> * initialising MSR_IA32_MISC_ENABLE is not strictly
> guaranteed, so to
> * avoid inconsistent functionality we keep the vPMU bits
> unchanged here.
> */
Yes. It's a little clearer with comments.
> data &= ~pmu_mask;
> data |= old_val & pmu_mask;
> if (!kvm_check_has_quirk(vcpu->kvm,
> KVM_X86_QUIRK_MISC_ENABLE_NO_MWAIT) &&
> ((old_val ^ data) & MSR_IA32_MISC_ENABLE_MWAIT)) {
> if (!guest_cpuid_has(vcpu, X86_FEATURE_XMM3))
> return 1;
> vcpu->arch.ia32_misc_enable_msr = data;
> kvm_update_cpuid_runtime(vcpu);
> } else {
> vcpu->arch.ia32_misc_enable_msr = data;
> }
> break;
> }
>
>> Or is there anything in your design intention I don't understand?
>>
>> Thanks!
>>
>> Xiangdong Liu
>>
>>
>> On 2021/8/6 21:37, Zhu Lingshan wrote:
>>> From: Like Xu <[email protected]>
>>>
>>> On Intel platforms, the software can use the IA32_MISC_ENABLE[7] bit to
>>> detect whether the processor supports performance monitoring facility.
>>>
>>> It depends on the PMU is enabled for the guest, and a software write
>>> operation to this available bit will be ignored. The proposal to ignore
>>> the toggle in KVM is the way to go and that behavior matches bare
>>> metal.
>>>
>>> Cc: Yao Yuan <[email protected]>
>>> Signed-off-by: Like Xu <[email protected]>
>>> Reviewed-by: Venkatesh Srinivas <[email protected]>
>>> Signed-off-by: Zhu Lingshan <[email protected]>
>>> Acked-by: Peter Zijlstra (Intel) <[email protected]>
>>> ---
>>> arch/x86/kvm/vmx/pmu_intel.c | 1 +
>>> arch/x86/kvm/x86.c | 1 +
>>> 2 files changed, 2 insertions(+)
>>>
>>> diff --git a/arch/x86/kvm/vmx/pmu_intel.c
>>> b/arch/x86/kvm/vmx/pmu_intel.c
>>> index 9efc1a6b8693..d9dbebe03cae 100644
>>> --- a/arch/x86/kvm/vmx/pmu_intel.c
>>> +++ b/arch/x86/kvm/vmx/pmu_intel.c
>>> @@ -488,6 +488,7 @@ static void intel_pmu_refresh(struct kvm_vcpu
>>> *vcpu)
>>> if (!pmu->version)
>>> return;
>>> + vcpu->arch.ia32_misc_enable_msr |= MSR_IA32_MISC_ENABLE_EMON;
>>> perf_get_x86_pmu_capability(&x86_pmu);
>>> pmu->nr_arch_gp_counters = min_t(int, eax.split.num_counters,
>>> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
>>> index efd11702465c..f6b6984e26ef 100644
>>> --- a/arch/x86/kvm/x86.c
>>> +++ b/arch/x86/kvm/x86.c
>>> @@ -3321,6 +3321,7 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu,
>>> struct msr_data *msr_info)
>>> }
>>> break;
>>> case MSR_IA32_MISC_ENABLE:
>>> + data &= ~MSR_IA32_MISC_ENABLE_EMON;
>>> if (!kvm_check_has_quirk(vcpu->kvm,
>>> KVM_X86_QUIRK_MISC_ENABLE_NO_MWAIT) &&
>>> ((vcpu->arch.ia32_misc_enable_msr ^ data) &
>>> MSR_IA32_MISC_ENABLE_MWAIT)) {
>>> if (!guest_cpuid_has(vcpu, X86_FEATURE_XMM3))
>>

2021-11-08 08:34:50

by Like Xu

[permalink] [raw]
Subject: Re: [PATCH V10 05/18] KVM: x86/pmu: Set MSR_IA32_MISC_ENABLE_EMON bit when vPMU is enabled

On 7/11/2021 6:14 pm, Liuxiangdong wrote:
> Hi, like and lingshan.
>
> As said,  IA32_MISC_ENABLE[7] bit depends on the PMU is enabled for the guest,
> so a software
> write openration to this bit will be ignored.
>
> But, in this patch, all the openration that writes msr_ia32_misc_enable in guest
> could make this bit become 0.
>
> Suppose:
> When we start vm with "enable_pmu", vcpu->arch.ia32_misc_enable_msr may be 0x80
> first.
> And next, guest writes msr_ia32_misc_enable value 0x1.
> What we want could be 0x81, but unfortunately, it will be 0x1 because of
> "data &= ~MSR_IA32_MISC_ENABLE_EMON;"
> And even if guest writes msr_ia32_misc_enable value 0x81, it will be 0x1 also.
>

Yes and thank you. The fix has been committed on my private tree for a long time.

>
> What we want is write operation will not change this bit. So, how about this?
>
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -3321,6 +3321,7 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct
> msr_data *msr_info)
>          }
>          break;
>      case MSR_IA32_MISC_ENABLE:
> +        data &= ~MSR_IA32_MISC_ENABLE_EMON;
> +        data |= (vcpu->arch.ia32_misc_enable_msr & MSR_IA32_MISC_ENABLE_EMON);
>          if (!kvm_check_has_quirk(vcpu->kvm,
> KVM_X86_QUIRK_MISC_ENABLE_NO_MWAIT) &&
>              ((vcpu->arch.ia32_misc_enable_msr ^ data) &
> MSR_IA32_MISC_ENABLE_MWAIT)) {
>              if (!guest_cpuid_has(vcpu, X86_FEATURE_XMM3))
>
>

How about this for the final state considering PEBS enabling:

case MSR_IA32_MISC_ENABLE: {
u64 old_val = vcpu->arch.ia32_misc_enable_msr;
u64 pmu_mask = MSR_IA32_MISC_ENABLE_EMON |
MSR_IA32_MISC_ENABLE_EMON;

/* RO bits */
if (!msr_info->host_initiated &&
((old_val ^ data) & MSR_IA32_MISC_ENABLE_PEBS_UNAVAIL))
return 1;

/*
* For a dummy user space, the order of setting vPMU capabilities and
* initialising MSR_IA32_MISC_ENABLE is not strictly guaranteed, so to
* avoid inconsistent functionality we keep the vPMU bits unchanged here.
*/
data &= ~pmu_mask;
data |= old_val & pmu_mask;
if (!kvm_check_has_quirk(vcpu->kvm, KVM_X86_QUIRK_MISC_ENABLE_NO_MWAIT) &&
((old_val ^ data) & MSR_IA32_MISC_ENABLE_MWAIT)) {
if (!guest_cpuid_has(vcpu, X86_FEATURE_XMM3))
return 1;
vcpu->arch.ia32_misc_enable_msr = data;
kvm_update_cpuid_runtime(vcpu);
} else {
vcpu->arch.ia32_misc_enable_msr = data;
}
break;
}

> Or is there anything in your design intention I don't understand?
>
> Thanks!
>
> Xiangdong Liu
>
>
> On 2021/8/6 21:37, Zhu Lingshan wrote:
>> From: Like Xu <[email protected]>
>>
>> On Intel platforms, the software can use the IA32_MISC_ENABLE[7] bit to
>> detect whether the processor supports performance monitoring facility.
>>
>> It depends on the PMU is enabled for the guest, and a software write
>> operation to this available bit will be ignored. The proposal to ignore
>> the toggle in KVM is the way to go and that behavior matches bare metal.
>>
>> Cc: Yao Yuan <[email protected]>
>> Signed-off-by: Like Xu <[email protected]>
>> Reviewed-by: Venkatesh Srinivas <[email protected]>
>> Signed-off-by: Zhu Lingshan <[email protected]>
>> Acked-by: Peter Zijlstra (Intel) <[email protected]>
>> ---
>>   arch/x86/kvm/vmx/pmu_intel.c | 1 +
>>   arch/x86/kvm/x86.c           | 1 +
>>   2 files changed, 2 insertions(+)
>>
>> diff --git a/arch/x86/kvm/vmx/pmu_intel.c b/arch/x86/kvm/vmx/pmu_intel.c
>> index 9efc1a6b8693..d9dbebe03cae 100644
>> --- a/arch/x86/kvm/vmx/pmu_intel.c
>> +++ b/arch/x86/kvm/vmx/pmu_intel.c
>> @@ -488,6 +488,7 @@ static void intel_pmu_refresh(struct kvm_vcpu *vcpu)
>>       if (!pmu->version)
>>           return;
>> +    vcpu->arch.ia32_misc_enable_msr |= MSR_IA32_MISC_ENABLE_EMON;
>>       perf_get_x86_pmu_capability(&x86_pmu);
>>       pmu->nr_arch_gp_counters = min_t(int, eax.split.num_counters,
>> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
>> index efd11702465c..f6b6984e26ef 100644
>> --- a/arch/x86/kvm/x86.c
>> +++ b/arch/x86/kvm/x86.c
>> @@ -3321,6 +3321,7 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct
>> msr_data *msr_info)
>>           }
>>           break;
>>       case MSR_IA32_MISC_ENABLE:
>> +        data &= ~MSR_IA32_MISC_ENABLE_EMON;
>>           if (!kvm_check_has_quirk(vcpu->kvm,
>> KVM_X86_QUIRK_MISC_ENABLE_NO_MWAIT) &&
>>               ((vcpu->arch.ia32_misc_enable_msr ^ data) &
>> MSR_IA32_MISC_ENABLE_MWAIT)) {
>>               if (!guest_cpuid_has(vcpu, X86_FEATURE_XMM3))
>

2021-11-08 09:04:05

by Like Xu

[permalink] [raw]
Subject: Re: [PATCH V10 05/18] KVM: x86/pmu: Set MSR_IA32_MISC_ENABLE_EMON bit when vPMU is enabled

On 8/11/2021 12:07 pm, Liuxiangdong wrote:
>
>
> On 2021/11/8 11:06, Like Xu wrote:
>> On 7/11/2021 6:14 pm, Liuxiangdong wrote:
>>> Hi, like and lingshan.
>>>
>>> As said,  IA32_MISC_ENABLE[7] bit depends on the PMU is enabled for the
>>> guest, so a software
>>> write openration to this bit will be ignored.
>>>
>>> But, in this patch, all the openration that writes msr_ia32_misc_enable in
>>> guest could make this bit become 0.
>>>
>>> Suppose:
>>> When we start vm with "enable_pmu", vcpu->arch.ia32_misc_enable_msr may be
>>> 0x80 first.
>>> And next, guest writes msr_ia32_misc_enable value 0x1.
>>> What we want could be 0x81, but unfortunately, it will be 0x1 because of
>>> "data &= ~MSR_IA32_MISC_ENABLE_EMON;"
>>> And even if guest writes msr_ia32_misc_enable value 0x81, it will be 0x1 also.
>>>
>>
>> Yes and thank you. The fix has been committed on my private tree for a long time.
>>
>>>
>>> What we want is write operation will not change this bit. So, how about this?
>>>
>>> --- a/arch/x86/kvm/x86.c
>>> +++ b/arch/x86/kvm/x86.c
>>> @@ -3321,6 +3321,7 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct
>>> msr_data *msr_info)
>>>           }
>>>           break;
>>>       case MSR_IA32_MISC_ENABLE:
>>> +        data &= ~MSR_IA32_MISC_ENABLE_EMON;
>>> +        data |= (vcpu->arch.ia32_misc_enable_msr & MSR_IA32_MISC_ENABLE_EMON);
>>>           if (!kvm_check_has_quirk(vcpu->kvm,
>>> KVM_X86_QUIRK_MISC_ENABLE_NO_MWAIT) &&
>>>               ((vcpu->arch.ia32_misc_enable_msr ^ data) &
>>> MSR_IA32_MISC_ENABLE_MWAIT)) {
>>>               if (!guest_cpuid_has(vcpu, X86_FEATURE_XMM3))
>>>
>>>
>>
>> How about this for the final state considering PEBS enabling:
>>
>>     case MSR_IA32_MISC_ENABLE: {
>>         u64 old_val = vcpu->arch.ia32_misc_enable_msr;
>>         u64 pmu_mask = MSR_IA32_MISC_ENABLE_EMON |
>>             MSR_IA32_MISC_ENABLE_EMON;
>>
>         u64 pmu_mask = MSR_IA32_MISC_ENABLE_EMON |
>             MSR_IA32_MISC_ENABLE_EMON;
>
> Repetitive "MSR_IA32_MISC_ENABLE_EMON" ?

Oops,

u64 pmu_mask = MSR_IA32_MISC_ENABLE_EMON |
MSR_IA32_MISC_ENABLE_PEBS_UNAVAIL;

I'll send the fix after sync with Lingshan.

>
>>         /* RO bits */
>>         if (!msr_info->host_initiated &&
>>             ((old_val ^ data) & MSR_IA32_MISC_ENABLE_PEBS_UNAVAIL))
>>             return 1;
>>
>>         /*
>>          * For a dummy user space, the order of setting vPMU capabilities and
>>          * initialising MSR_IA32_MISC_ENABLE is not strictly guaranteed, so to
>>          * avoid inconsistent functionality we keep the vPMU bits unchanged here.
>>          */
> Yes. It's a little clearer with comments.

Thanks for your feedback! Enjoy the feature.

>>         data &= ~pmu_mask;
>>         data |= old_val & pmu_mask;
>>         if (!kvm_check_has_quirk(vcpu->kvm,
>> KVM_X86_QUIRK_MISC_ENABLE_NO_MWAIT) &&
>>             ((old_val ^ data) & MSR_IA32_MISC_ENABLE_MWAIT)) {
>>             if (!guest_cpuid_has(vcpu, X86_FEATURE_XMM3))
>>                 return 1;
>>             vcpu->arch.ia32_misc_enable_msr = data;
>>             kvm_update_cpuid_runtime(vcpu);
>>         } else {
>>             vcpu->arch.ia32_misc_enable_msr = data;
>>         }
>>         break;
>>     }
>>
>>> Or is there anything in your design intention I don't understand?
>>>
>>> Thanks!
>>>
>>> Xiangdong Liu
>>>
>>>
>>> On 2021/8/6 21:37, Zhu Lingshan wrote:
>>>> From: Like Xu <[email protected]>
>>>>
>>>> On Intel platforms, the software can use the IA32_MISC_ENABLE[7] bit to
>>>> detect whether the processor supports performance monitoring facility.
>>>>
>>>> It depends on the PMU is enabled for the guest, and a software write
>>>> operation to this available bit will be ignored. The proposal to ignore
>>>> the toggle in KVM is the way to go and that behavior matches bare metal.
>>>>
>>>> Cc: Yao Yuan <[email protected]>
>>>> Signed-off-by: Like Xu <[email protected]>
>>>> Reviewed-by: Venkatesh Srinivas <[email protected]>
>>>> Signed-off-by: Zhu Lingshan <[email protected]>
>>>> Acked-by: Peter Zijlstra (Intel) <[email protected]>
>>>> ---
>>>>   arch/x86/kvm/vmx/pmu_intel.c | 1 +
>>>>   arch/x86/kvm/x86.c           | 1 +
>>>>   2 files changed, 2 insertions(+)
>>>>
>>>> diff --git a/arch/x86/kvm/vmx/pmu_intel.c b/arch/x86/kvm/vmx/pmu_intel.c
>>>> index 9efc1a6b8693..d9dbebe03cae 100644
>>>> --- a/arch/x86/kvm/vmx/pmu_intel.c
>>>> +++ b/arch/x86/kvm/vmx/pmu_intel.c
>>>> @@ -488,6 +488,7 @@ static void intel_pmu_refresh(struct kvm_vcpu *vcpu)
>>>>       if (!pmu->version)
>>>>           return;
>>>> +    vcpu->arch.ia32_misc_enable_msr |= MSR_IA32_MISC_ENABLE_EMON;
>>>>       perf_get_x86_pmu_capability(&x86_pmu);
>>>>       pmu->nr_arch_gp_counters = min_t(int, eax.split.num_counters,
>>>> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
>>>> index efd11702465c..f6b6984e26ef 100644
>>>> --- a/arch/x86/kvm/x86.c
>>>> +++ b/arch/x86/kvm/x86.c
>>>> @@ -3321,6 +3321,7 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct
>>>> msr_data *msr_info)
>>>>           }
>>>>           break;
>>>>       case MSR_IA32_MISC_ENABLE:
>>>> +        data &= ~MSR_IA32_MISC_ENABLE_EMON;
>>>>           if (!kvm_check_has_quirk(vcpu->kvm,
>>>> KVM_X86_QUIRK_MISC_ENABLE_NO_MWAIT) &&
>>>>               ((vcpu->arch.ia32_misc_enable_msr ^ data) &
>>>> MSR_IA32_MISC_ENABLE_MWAIT)) {
>>>>               if (!guest_cpuid_has(vcpu, X86_FEATURE_XMM3))
>>>
>

2021-11-08 11:43:41

by Liuxiangdong

[permalink] [raw]
Subject: Re: [PATCH V10 05/18] KVM: x86/pmu: Set MSR_IA32_MISC_ENABLE_EMON bit when vPMU is enabled



On 2021/11/8 12:11, Like Xu wrote:
> On 8/11/2021 12:07 pm, Liuxiangdong wrote:
>>
>>
>> On 2021/11/8 11:06, Like Xu wrote:
>>> On 7/11/2021 6:14 pm, Liuxiangdong wrote:
>>>> Hi, like and lingshan.
>>>>
>>>> As said, IA32_MISC_ENABLE[7] bit depends on the PMU is enabled for
>>>> the guest, so a software
>>>> write openration to this bit will be ignored.
>>>>
>>>> But, in this patch, all the openration that writes
>>>> msr_ia32_misc_enable in guest could make this bit become 0.
>>>>
>>>> Suppose:
>>>> When we start vm with "enable_pmu", vcpu->arch.ia32_misc_enable_msr
>>>> may be 0x80 first.
>>>> And next, guest writes msr_ia32_misc_enable value 0x1.
>>>> What we want could be 0x81, but unfortunately, it will be 0x1
>>>> because of
>>>> "data &= ~MSR_IA32_MISC_ENABLE_EMON;"
>>>> And even if guest writes msr_ia32_misc_enable value 0x81, it will
>>>> be 0x1 also.
>>>>
>>>
>>> Yes and thank you. The fix has been committed on my private tree for
>>> a long time.
>>>
>>>>
>>>> What we want is write operation will not change this bit. So, how
>>>> about this?
>>>>
>>>> --- a/arch/x86/kvm/x86.c
>>>> +++ b/arch/x86/kvm/x86.c
>>>> @@ -3321,6 +3321,7 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu,
>>>> struct msr_data *msr_info)
>>>> }
>>>> break;
>>>> case MSR_IA32_MISC_ENABLE:
>>>> + data &= ~MSR_IA32_MISC_ENABLE_EMON;
>>>> + data |= (vcpu->arch.ia32_misc_enable_msr &
>>>> MSR_IA32_MISC_ENABLE_EMON);
>>>> if (!kvm_check_has_quirk(vcpu->kvm,
>>>> KVM_X86_QUIRK_MISC_ENABLE_NO_MWAIT) &&
>>>> ((vcpu->arch.ia32_misc_enable_msr ^ data) &
>>>> MSR_IA32_MISC_ENABLE_MWAIT)) {
>>>> if (!guest_cpuid_has(vcpu, X86_FEATURE_XMM3))
>>>>
>>>>
>>>
>>> How about this for the final state considering PEBS enabling:
>>>
>>> case MSR_IA32_MISC_ENABLE: {
>>> u64 old_val = vcpu->arch.ia32_misc_enable_msr;
>>> u64 pmu_mask = MSR_IA32_MISC_ENABLE_EMON |
>>> MSR_IA32_MISC_ENABLE_EMON;
>>>
>> u64 pmu_mask = MSR_IA32_MISC_ENABLE_EMON |
>> MSR_IA32_MISC_ENABLE_EMON;
>>
>> Repetitive "MSR_IA32_MISC_ENABLE_EMON" ?
>
> Oops,
>
> u64 pmu_mask = MSR_IA32_MISC_ENABLE_EMON |
> MSR_IA32_MISC_ENABLE_PEBS_UNAVAIL;
>

Yes. bit[12] is also read-only, so we can keep this bit unchanged also.

And, because write operation will not change this bit by "pmu_mask", do
we still need this if statement?

/* RO bits */
if (!msr_info->host_initiated &&
((old_val ^ data) & MSR_IA32_MISC_ENABLE_PEBS_UNAVAIL))
return 1;

"(old_val ^ data) & MSR_IA32_MISC_ENABLE_PEBS_UNAVAIL" means some
operation tries to change this bit,
so we cannot allow it.
But, if there is no this judgement, "pmu_mask" will still make this
bit[12] no change.

The only difference is that we can not change other bit (except bit 12
and bit 7) once "old_val[12] != data[12]" if there exists this statement
and we can change other bit if there is no judgement.

For both MSR_IA32_MISC_ENABLE_EMON and MSR_IA32_MISC_ENABLE_EMON are
read-only, maybe we can keep
their behavioral consistency. Either both judge, or neither.

Do you think so?


> I'll send the fix after sync with Lingshan.
>
>>
>>> /* RO bits */
>>> if (!msr_info->host_initiated &&
>>> ((old_val ^ data) & MSR_IA32_MISC_ENABLE_PEBS_UNAVAIL))
>>> return 1;
>>>
>>> /*
>>> * For a dummy user space, the order of setting vPMU
>>> capabilities and
>>> * initialising MSR_IA32_MISC_ENABLE is not strictly
>>> guaranteed, so to
>>> * avoid inconsistent functionality we keep the vPMU bits
>>> unchanged here.
>>> */
>> Yes. It's a little clearer with comments.
>
> Thanks for your feedback! Enjoy the feature.
>
>>> data &= ~pmu_mask;
>>> data |= old_val & pmu_mask;
>>> if (!kvm_check_has_quirk(vcpu->kvm,
>>> KVM_X86_QUIRK_MISC_ENABLE_NO_MWAIT) &&
>>> ((old_val ^ data) & MSR_IA32_MISC_ENABLE_MWAIT)) {
>>> if (!guest_cpuid_has(vcpu, X86_FEATURE_XMM3))
>>> return 1;
>>> vcpu->arch.ia32_misc_enable_msr = data;
>>> kvm_update_cpuid_runtime(vcpu);
>>> } else {
>>> vcpu->arch.ia32_misc_enable_msr = data;
>>> }
>>> break;
>>> }
>>>
>>>> Or is there anything in your design intention I don't understand?
>>>>
>>>> Thanks!
>>>>
>>>> Xiangdong Liu
>>>>
>>>>
>>>> On 2021/8/6 21:37, Zhu Lingshan wrote:
>>>>> From: Like Xu <[email protected]>
>>>>>
>>>>> On Intel platforms, the software can use the IA32_MISC_ENABLE[7]
>>>>> bit to
>>>>> detect whether the processor supports performance monitoring
>>>>> facility.
>>>>>
>>>>> It depends on the PMU is enabled for the guest, and a software write
>>>>> operation to this available bit will be ignored. The proposal to
>>>>> ignore
>>>>> the toggle in KVM is the way to go and that behavior matches bare
>>>>> metal.
>>>>>
>>>>> Cc: Yao Yuan <[email protected]>
>>>>> Signed-off-by: Like Xu <[email protected]>
>>>>> Reviewed-by: Venkatesh Srinivas <[email protected]>
>>>>> Signed-off-by: Zhu Lingshan <[email protected]>
>>>>> Acked-by: Peter Zijlstra (Intel) <[email protected]>
>>>>> ---
>>>>> arch/x86/kvm/vmx/pmu_intel.c | 1 +
>>>>> arch/x86/kvm/x86.c | 1 +
>>>>> 2 files changed, 2 insertions(+)
>>>>>
>>>>> diff --git a/arch/x86/kvm/vmx/pmu_intel.c
>>>>> b/arch/x86/kvm/vmx/pmu_intel.c
>>>>> index 9efc1a6b8693..d9dbebe03cae 100644
>>>>> --- a/arch/x86/kvm/vmx/pmu_intel.c
>>>>> +++ b/arch/x86/kvm/vmx/pmu_intel.c
>>>>> @@ -488,6 +488,7 @@ static void intel_pmu_refresh(struct kvm_vcpu
>>>>> *vcpu)
>>>>> if (!pmu->version)
>>>>> return;
>>>>> + vcpu->arch.ia32_misc_enable_msr |= MSR_IA32_MISC_ENABLE_EMON;
>>>>> perf_get_x86_pmu_capability(&x86_pmu);
>>>>> pmu->nr_arch_gp_counters = min_t(int, eax.split.num_counters,
>>>>> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
>>>>> index efd11702465c..f6b6984e26ef 100644
>>>>> --- a/arch/x86/kvm/x86.c
>>>>> +++ b/arch/x86/kvm/x86.c
>>>>> @@ -3321,6 +3321,7 @@ int kvm_set_msr_common(struct kvm_vcpu
>>>>> *vcpu, struct msr_data *msr_info)
>>>>> }
>>>>> break;
>>>>> case MSR_IA32_MISC_ENABLE:
>>>>> + data &= ~MSR_IA32_MISC_ENABLE_EMON;
>>>>> if (!kvm_check_has_quirk(vcpu->kvm,
>>>>> KVM_X86_QUIRK_MISC_ENABLE_NO_MWAIT) &&
>>>>> ((vcpu->arch.ia32_misc_enable_msr ^ data) &
>>>>> MSR_IA32_MISC_ENABLE_MWAIT)) {
>>>>> if (!guest_cpuid_has(vcpu, X86_FEATURE_XMM3))
>>>>
>>

2021-11-08 12:18:12

by Like Xu

[permalink] [raw]
Subject: Re: [PATCH V10 05/18] KVM: x86/pmu: Set MSR_IA32_MISC_ENABLE_EMON bit when vPMU is enabled

On 8/11/2021 4:27 pm, Liuxiangdong wrote:
>
>
> On 2021/11/8 12:11, Like Xu wrote:
>> On 8/11/2021 12:07 pm, Liuxiangdong wrote:
>>>
>>>
>>> On 2021/11/8 11:06, Like Xu wrote:
>>>> On 7/11/2021 6:14 pm, Liuxiangdong wrote:
>>>>> Hi, like and lingshan.
>>>>>
>>>>> As said,  IA32_MISC_ENABLE[7] bit depends on the PMU is enabled for the
>>>>> guest, so a software
>>>>> write openration to this bit will be ignored.
>>>>>
>>>>> But, in this patch, all the openration that writes msr_ia32_misc_enable in
>>>>> guest could make this bit become 0.
>>>>>
>>>>> Suppose:
>>>>> When we start vm with "enable_pmu", vcpu->arch.ia32_misc_enable_msr may be
>>>>> 0x80 first.
>>>>> And next, guest writes msr_ia32_misc_enable value 0x1.
>>>>> What we want could be 0x81, but unfortunately, it will be 0x1 because of
>>>>> "data &= ~MSR_IA32_MISC_ENABLE_EMON;"
>>>>> And even if guest writes msr_ia32_misc_enable value 0x81, it will be 0x1 also.
>>>>>
>>>>
>>>> Yes and thank you. The fix has been committed on my private tree for a long
>>>> time.
>>>>
>>>>>
>>>>> What we want is write operation will not change this bit. So, how about this?
>>>>>
>>>>> --- a/arch/x86/kvm/x86.c
>>>>> +++ b/arch/x86/kvm/x86.c
>>>>> @@ -3321,6 +3321,7 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct
>>>>> msr_data *msr_info)
>>>>>           }
>>>>>           break;
>>>>>       case MSR_IA32_MISC_ENABLE:
>>>>> +        data &= ~MSR_IA32_MISC_ENABLE_EMON;
>>>>> +        data |= (vcpu->arch.ia32_misc_enable_msr &
>>>>> MSR_IA32_MISC_ENABLE_EMON);
>>>>>           if (!kvm_check_has_quirk(vcpu->kvm,
>>>>> KVM_X86_QUIRK_MISC_ENABLE_NO_MWAIT) &&
>>>>>               ((vcpu->arch.ia32_misc_enable_msr ^ data) &
>>>>> MSR_IA32_MISC_ENABLE_MWAIT)) {
>>>>>               if (!guest_cpuid_has(vcpu, X86_FEATURE_XMM3))
>>>>>
>>>>>
>>>>
>>>> How about this for the final state considering PEBS enabling:
>>>>
>>>>     case MSR_IA32_MISC_ENABLE: {
>>>>         u64 old_val = vcpu->arch.ia32_misc_enable_msr;
>>>>         u64 pmu_mask = MSR_IA32_MISC_ENABLE_EMON |
>>>>             MSR_IA32_MISC_ENABLE_EMON;
>>>>
>>>          u64 pmu_mask = MSR_IA32_MISC_ENABLE_EMON |
>>>              MSR_IA32_MISC_ENABLE_EMON;
>>>
>>> Repetitive "MSR_IA32_MISC_ENABLE_EMON" ?
>>
>> Oops,
>>
>>     u64 pmu_mask = MSR_IA32_MISC_ENABLE_EMON |
>>             MSR_IA32_MISC_ENABLE_PEBS_UNAVAIL;
>>
>
> Yes. bit[12] is also read-only, so we can keep this bit unchanged also.
>
> And, because write operation will not change this bit by "pmu_mask", do we still
> need this if statement?
>
>         /* RO bits */
>         if (!msr_info->host_initiated &&
>             ((old_val ^ data) & MSR_IA32_MISC_ENABLE_PEBS_UNAVAIL))
>             return 1;
>
> "(old_val ^ data) & MSR_IA32_MISC_ENABLE_PEBS_UNAVAIL" means some operation
> tries to change this bit,
> so we cannot allow it.
> But, if there is no this judgement, "pmu_mask" will still make this bit[12] no
> change.
>
> The only difference is that we can not change other bit (except bit 12 and bit
> 7) once "old_val[12] != data[12]" if there exists this statement
> and we can change other bit if there is no judgement.
>
> For both MSR_IA32_MISC_ENABLE_EMON and MSR_IA32_MISC_ENABLE_EMON are read-only,
> maybe we can keep
> their behavioral consistency. Either both judge, or neither.

One more difference per Intel SDM, I assume:

For Bit 7, Performance Monitoring Available (R)
(R) means that attempts to change this bit will be silent;
For Bit 12, Processor Event Based Sampling (PEBS) Unavailable (RO),
(RO) means that attempts to change this bit will be #GP;

>
> Do you think so?
>
>
>> I'll send the fix after sync with Lingshan.
>>
>>>
>>>>         /* RO bits */
>>>>         if (!msr_info->host_initiated &&
>>>>             ((old_val ^ data) & MSR_IA32_MISC_ENABLE_PEBS_UNAVAIL))
>>>>             return 1;
>>>>
>>>>         /*
>>>>          * For a dummy user space, the order of setting vPMU capabilities and
>>>>          * initialising MSR_IA32_MISC_ENABLE is not strictly guaranteed, so to
>>>>          * avoid inconsistent functionality we keep the vPMU bits unchanged
>>>> here.
>>>>          */
>>> Yes. It's a little clearer with comments.
>>
>> Thanks for your feedback! Enjoy the feature.
>>
>>>>         data &= ~pmu_mask;
>>>>         data |= old_val & pmu_mask;
>>>>         if (!kvm_check_has_quirk(vcpu->kvm,
>>>> KVM_X86_QUIRK_MISC_ENABLE_NO_MWAIT) &&
>>>>             ((old_val ^ data) & MSR_IA32_MISC_ENABLE_MWAIT)) {
>>>>             if (!guest_cpuid_has(vcpu, X86_FEATURE_XMM3))
>>>>                 return 1;
>>>>             vcpu->arch.ia32_misc_enable_msr = data;
>>>>             kvm_update_cpuid_runtime(vcpu);
>>>>         } else {
>>>>             vcpu->arch.ia32_misc_enable_msr = data;
>>>>         }
>>>>         break;
>>>>     }
>>>>
>>>>> Or is there anything in your design intention I don't understand?
>>>>>
>>>>> Thanks!
>>>>>
>>>>> Xiangdong Liu
>>>>>
>>>>>
>>>>> On 2021/8/6 21:37, Zhu Lingshan wrote:
>>>>>> From: Like Xu <[email protected]>
>>>>>>
>>>>>> On Intel platforms, the software can use the IA32_MISC_ENABLE[7] bit to
>>>>>> detect whether the processor supports performance monitoring facility.
>>>>>>
>>>>>> It depends on the PMU is enabled for the guest, and a software write
>>>>>> operation to this available bit will be ignored. The proposal to ignore
>>>>>> the toggle in KVM is the way to go and that behavior matches bare metal.
>>>>>>
>>>>>> Cc: Yao Yuan <[email protected]>
>>>>>> Signed-off-by: Like Xu <[email protected]>
>>>>>> Reviewed-by: Venkatesh Srinivas <[email protected]>
>>>>>> Signed-off-by: Zhu Lingshan <[email protected]>
>>>>>> Acked-by: Peter Zijlstra (Intel) <[email protected]>
>>>>>> ---
>>>>>>   arch/x86/kvm/vmx/pmu_intel.c | 1 +
>>>>>>   arch/x86/kvm/x86.c           | 1 +
>>>>>>   2 files changed, 2 insertions(+)
>>>>>>
>>>>>> diff --git a/arch/x86/kvm/vmx/pmu_intel.c b/arch/x86/kvm/vmx/pmu_intel.c
>>>>>> index 9efc1a6b8693..d9dbebe03cae 100644
>>>>>> --- a/arch/x86/kvm/vmx/pmu_intel.c
>>>>>> +++ b/arch/x86/kvm/vmx/pmu_intel.c
>>>>>> @@ -488,6 +488,7 @@ static void intel_pmu_refresh(struct kvm_vcpu *vcpu)
>>>>>>       if (!pmu->version)
>>>>>>           return;
>>>>>> +    vcpu->arch.ia32_misc_enable_msr |= MSR_IA32_MISC_ENABLE_EMON;
>>>>>>       perf_get_x86_pmu_capability(&x86_pmu);
>>>>>>       pmu->nr_arch_gp_counters = min_t(int, eax.split.num_counters,
>>>>>> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
>>>>>> index efd11702465c..f6b6984e26ef 100644
>>>>>> --- a/arch/x86/kvm/x86.c
>>>>>> +++ b/arch/x86/kvm/x86.c
>>>>>> @@ -3321,6 +3321,7 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct
>>>>>> msr_data *msr_info)
>>>>>>           }
>>>>>>           break;
>>>>>>       case MSR_IA32_MISC_ENABLE:
>>>>>> +        data &= ~MSR_IA32_MISC_ENABLE_EMON;
>>>>>>           if (!kvm_check_has_quirk(vcpu->kvm,
>>>>>> KVM_X86_QUIRK_MISC_ENABLE_NO_MWAIT) &&
>>>>>>               ((vcpu->arch.ia32_misc_enable_msr ^ data) &
>>>>>> MSR_IA32_MISC_ENABLE_MWAIT)) {
>>>>>>               if (!guest_cpuid_has(vcpu, X86_FEATURE_XMM3))
>>>>>
>>>
>

2021-11-08 14:48:34

by Liuxiangdong

[permalink] [raw]
Subject: Re: [PATCH V10 05/18] KVM: x86/pmu: Set MSR_IA32_MISC_ENABLE_EMON bit when vPMU is enabled



On 2021/11/8 16:44, Like Xu wrote:
> On 8/11/2021 4:27 pm, Liuxiangdong wrote:
>>
>>
>> On 2021/11/8 12:11, Like Xu wrote:
>>> On 8/11/2021 12:07 pm, Liuxiangdong wrote:
>>>>
>>>>
>>>> On 2021/11/8 11:06, Like Xu wrote:
>>>>> On 7/11/2021 6:14 pm, Liuxiangdong wrote:
>>>>>> Hi, like and lingshan.
>>>>>>
>>>>>> As said, IA32_MISC_ENABLE[7] bit depends on the PMU is enabled
>>>>>> for the guest, so a software
>>>>>> write openration to this bit will be ignored.
>>>>>>
>>>>>> But, in this patch, all the openration that writes
>>>>>> msr_ia32_misc_enable in guest could make this bit become 0.
>>>>>>
>>>>>> Suppose:
>>>>>> When we start vm with "enable_pmu",
>>>>>> vcpu->arch.ia32_misc_enable_msr may be 0x80 first.
>>>>>> And next, guest writes msr_ia32_misc_enable value 0x1.
>>>>>> What we want could be 0x81, but unfortunately, it will be 0x1
>>>>>> because of
>>>>>> "data &= ~MSR_IA32_MISC_ENABLE_EMON;"
>>>>>> And even if guest writes msr_ia32_misc_enable value 0x81, it will
>>>>>> be 0x1 also.
>>>>>>
>>>>>
>>>>> Yes and thank you. The fix has been committed on my private tree
>>>>> for a long time.
>>>>>
>>>>>>
>>>>>> What we want is write operation will not change this bit. So, how
>>>>>> about this?
>>>>>>
>>>>>> --- a/arch/x86/kvm/x86.c
>>>>>> +++ b/arch/x86/kvm/x86.c
>>>>>> @@ -3321,6 +3321,7 @@ int kvm_set_msr_common(struct kvm_vcpu
>>>>>> *vcpu, struct msr_data *msr_info)
>>>>>> }
>>>>>> break;
>>>>>> case MSR_IA32_MISC_ENABLE:
>>>>>> + data &= ~MSR_IA32_MISC_ENABLE_EMON;
>>>>>> + data |= (vcpu->arch.ia32_misc_enable_msr &
>>>>>> MSR_IA32_MISC_ENABLE_EMON);
>>>>>> if (!kvm_check_has_quirk(vcpu->kvm,
>>>>>> KVM_X86_QUIRK_MISC_ENABLE_NO_MWAIT) &&
>>>>>> ((vcpu->arch.ia32_misc_enable_msr ^ data) &
>>>>>> MSR_IA32_MISC_ENABLE_MWAIT)) {
>>>>>> if (!guest_cpuid_has(vcpu, X86_FEATURE_XMM3))
>>>>>>
>>>>>>
>>>>>
>>>>> How about this for the final state considering PEBS enabling:
>>>>>
>>>>> case MSR_IA32_MISC_ENABLE: {
>>>>> u64 old_val = vcpu->arch.ia32_misc_enable_msr;
>>>>> u64 pmu_mask = MSR_IA32_MISC_ENABLE_EMON |
>>>>> MSR_IA32_MISC_ENABLE_EMON;
>>>>>
>>>> u64 pmu_mask = MSR_IA32_MISC_ENABLE_EMON |
>>>> MSR_IA32_MISC_ENABLE_EMON;
>>>>
>>>> Repetitive "MSR_IA32_MISC_ENABLE_EMON" ?
>>>
>>> Oops,
>>>
>>> u64 pmu_mask = MSR_IA32_MISC_ENABLE_EMON |
>>> MSR_IA32_MISC_ENABLE_PEBS_UNAVAIL;
>>>
>>
>> Yes. bit[12] is also read-only, so we can keep this bit unchanged also.
>>
>> And, because write operation will not change this bit by "pmu_mask",
>> do we still need this if statement?
>>
>> /* RO bits */
>> if (!msr_info->host_initiated &&
>> ((old_val ^ data) & MSR_IA32_MISC_ENABLE_PEBS_UNAVAIL))
>> return 1;
>>
>> "(old_val ^ data) & MSR_IA32_MISC_ENABLE_PEBS_UNAVAIL" means some
>> operation tries to change this bit,
>> so we cannot allow it.
>> But, if there is no this judgement, "pmu_mask" will still make this
>> bit[12] no change.
>>
>> The only difference is that we can not change other bit (except bit
>> 12 and bit 7) once "old_val[12] != data[12]" if there exists this
>> statement
>> and we can change other bit if there is no judgement.
>>
>> For both MSR_IA32_MISC_ENABLE_EMON and MSR_IA32_MISC_ENABLE_EMON are
>> read-only, maybe we can keep
>> their behavioral consistency. Either both judge, or neither.
>
> One more difference per Intel SDM, I assume:
>
> For Bit 7, Performance Monitoring Available (R)
> (R) means that attempts to change this bit will be silent;
> For Bit 12, Processor Event Based Sampling (PEBS) Unavailable (RO),
> (RO) means that attempts to change this bit will be #GP;
>

Yes, I found it in SDM. You're right. Thanks for your explanation!

>>
>> Do you think so?
>>
>>
>>> I'll send the fix after sync with Lingshan.
>>>
>>>>
>>>>> /* RO bits */
>>>>> if (!msr_info->host_initiated &&
>>>>> ((old_val ^ data) & MSR_IA32_MISC_ENABLE_PEBS_UNAVAIL))
>>>>> return 1;
>>>>>
>>>>> /*
>>>>> * For a dummy user space, the order of setting vPMU
>>>>> capabilities and
>>>>> * initialising MSR_IA32_MISC_ENABLE is not strictly
>>>>> guaranteed, so to
>>>>> * avoid inconsistent functionality we keep the vPMU bits
>>>>> unchanged here.
>>>>> */
>>>> Yes. It's a little clearer with comments.
>>>
>>> Thanks for your feedback! Enjoy the feature.
>>>
>>>>> data &= ~pmu_mask;
>>>>> data |= old_val & pmu_mask;
>>>>> if (!kvm_check_has_quirk(vcpu->kvm,
>>>>> KVM_X86_QUIRK_MISC_ENABLE_NO_MWAIT) &&
>>>>> ((old_val ^ data) & MSR_IA32_MISC_ENABLE_MWAIT)) {
>>>>> if (!guest_cpuid_has(vcpu, X86_FEATURE_XMM3))
>>>>> return 1;
>>>>> vcpu->arch.ia32_misc_enable_msr = data;
>>>>> kvm_update_cpuid_runtime(vcpu);
>>>>> } else {
>>>>> vcpu->arch.ia32_misc_enable_msr = data;
>>>>> }
>>>>> break;
>>>>> }
>>>>>
>>>>>> Or is there anything in your design intention I don't understand?
>>>>>>
>>>>>> Thanks!
>>>>>>
>>>>>> Xiangdong Liu
>>>>>>
>>>>>>
>>>>>> On 2021/8/6 21:37, Zhu Lingshan wrote:
>>>>>>> From: Like Xu <[email protected]>
>>>>>>>
>>>>>>> On Intel platforms, the software can use the IA32_MISC_ENABLE[7]
>>>>>>> bit to
>>>>>>> detect whether the processor supports performance monitoring
>>>>>>> facility.
>>>>>>>
>>>>>>> It depends on the PMU is enabled for the guest, and a software
>>>>>>> write
>>>>>>> operation to this available bit will be ignored. The proposal to
>>>>>>> ignore
>>>>>>> the toggle in KVM is the way to go and that behavior matches
>>>>>>> bare metal.
>>>>>>>
>>>>>>> Cc: Yao Yuan <[email protected]>
>>>>>>> Signed-off-by: Like Xu <[email protected]>
>>>>>>> Reviewed-by: Venkatesh Srinivas <[email protected]>
>>>>>>> Signed-off-by: Zhu Lingshan <[email protected]>
>>>>>>> Acked-by: Peter Zijlstra (Intel) <[email protected]>
>>>>>>> ---
>>>>>>> arch/x86/kvm/vmx/pmu_intel.c | 1 +
>>>>>>> arch/x86/kvm/x86.c | 1 +
>>>>>>> 2 files changed, 2 insertions(+)
>>>>>>>
>>>>>>> diff --git a/arch/x86/kvm/vmx/pmu_intel.c
>>>>>>> b/arch/x86/kvm/vmx/pmu_intel.c
>>>>>>> index 9efc1a6b8693..d9dbebe03cae 100644
>>>>>>> --- a/arch/x86/kvm/vmx/pmu_intel.c
>>>>>>> +++ b/arch/x86/kvm/vmx/pmu_intel.c
>>>>>>> @@ -488,6 +488,7 @@ static void intel_pmu_refresh(struct
>>>>>>> kvm_vcpu *vcpu)
>>>>>>> if (!pmu->version)
>>>>>>> return;
>>>>>>> + vcpu->arch.ia32_misc_enable_msr |= MSR_IA32_MISC_ENABLE_EMON;
>>>>>>> perf_get_x86_pmu_capability(&x86_pmu);
>>>>>>> pmu->nr_arch_gp_counters = min_t(int, eax.split.num_counters,
>>>>>>> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
>>>>>>> index efd11702465c..f6b6984e26ef 100644
>>>>>>> --- a/arch/x86/kvm/x86.c
>>>>>>> +++ b/arch/x86/kvm/x86.c
>>>>>>> @@ -3321,6 +3321,7 @@ int kvm_set_msr_common(struct kvm_vcpu
>>>>>>> *vcpu, struct msr_data *msr_info)
>>>>>>> }
>>>>>>> break;
>>>>>>> case MSR_IA32_MISC_ENABLE:
>>>>>>> + data &= ~MSR_IA32_MISC_ENABLE_EMON;
>>>>>>> if (!kvm_check_has_quirk(vcpu->kvm,
>>>>>>> KVM_X86_QUIRK_MISC_ENABLE_NO_MWAIT) &&
>>>>>>> ((vcpu->arch.ia32_misc_enable_msr ^ data) &
>>>>>>> MSR_IA32_MISC_ENABLE_MWAIT)) {
>>>>>>> if (!guest_cpuid_has(vcpu, X86_FEATURE_XMM3))
>>>>>>
>>>>
>>