The guest Precise Event Based Sampling (PEBS) feature can provide
an architectural state of the instruction executed after the guest
instruction that exactly caused the event. It needs new hardware
facility only available on Intel Ice Lake Server platforms. This
patch set enables the basic PEBS via DS feature for KVM guests on ICX.
We can use PEBS feature on the Linux guest like native:
# perf record -e instructions:ppp ./br_instr a
# perf record -c 100000 -e instructions:pp ./br_instr a
To emulate guest PEBS facility for the above perf usages,
we need to implement 2 code paths:
1) Fast path
This is when the host assigned physical PMC has an identical index as
the virtual PMC (e.g. using physical PMC0 to emulate virtual PMC0).
This path is used in most common use cases.
2) Slow path
This is when the host assigned physical PMC has a different index
from the virtual PMC (e.g. using physical PMC1 to emulate virtual PMC0)
In this case, KVM needs to rewrite the PEBS records to change the
applicable counter indexes to the virtual PMC indexes, which would
otherwise contain the physical counter index written by PEBS facility,
and switch the counter reset values to the offset corresponding to
the physical counter indexes in the DS data structure.
The previous version [0] enables both fast path and slow path, which
seems a bit more complex as the first step. In this patchset, we want
to start with the fast path to get the basic guest PEBS enabled while
keeping the slow path disabled. More focused discussion on the slow
path [1] is planned to be put to another patchset in the next step.
Compared to later versions in subsequent steps, the functionality
to support host-guest PEBS both enabled and the functionality to
emulate guest PEBS when the counter is cross-mapped are missing
in this patch set (neither of these are typical scenarios).
With the basic support, the guest can retrieve the correct PEBS
information from its own PEBS records on the Ice Lake servers.
And we expect it should work when migrating to another Ice Lake
and no regression about host perf is expected.
Here are the results of pebs test from guest/host for same workload:
perf report on guest:
# Samples: 2K of event 'instructions:ppp', # Event count (approx.): 1473377250
# Overhead Command Shared Object Symbol
57.74% br_instr br_instr [.] lfsr_cond
41.40% br_instr br_instr [.] cmp_end
0.21% br_instr [kernel.kallsyms] [k] __lock_acquire
perf report on host:
# Samples: 2K of event 'instructions:ppp', # Event count (approx.): 1462721386
# Overhead Command Shared Object Symbol
57.90% br_instr br_instr [.] lfsr_cond
41.95% br_instr br_instr [.] cmp_end
0.05% br_instr [kernel.vmlinux] [k] lock_acquire
Conclusion: the profiling results on the guest are similar tothat on the host.
Please check more details in each commit and feel free to comment.
Previous:
[0] https://lore.kernel.org/kvm/[email protected]/
[1] https://lore.kernel.org/kvm/[email protected]/
v3->v4 Changelog:
- Update this cover letter and propose a new upstream plan;
[PERF]
- Drop check host DS and move handler to handle_pmi_common();
- Pass "struct kvm_pmu *" to intel_guest_get_msrs();
- Propose new assignment logic for perf_guest_switch_msr();
- Introduce x86_pmu.pebs_vmx for future capability maintenance;
[KVM]
- Add kvm_pmu_cap to optimize perf_get_x86_pmu_capability;
- Raising PEBS PMI only when OVF_BIT 62 is not set;
- Make vmx_icl_pebs_cpu specific for PEBS-PDIR emulation;
- Fix a bug for fixed_ctr_ctrl_mask;
- Add two minor refactoring patches for reuse;
Like Xu (16):
perf/x86/intel: Add x86_pmu.pebs_vmx for Ice Lake Servers
perf/x86/intel: Handle guest PEBS overflow PMI for KVM guest
perf/x86/core: Pass "struct kvm_pmu *" to determine the guest values
KVM: x86/pmu: Set MSR_IA32_MISC_ENABLE_EMON bit when vPMU is enabled
KVM: x86/pmu: Introduce the ctrl_mask value for fixed counter
KVM: x86/pmu: Reprogram guest PEBS event to emulate guest PEBS counter
KVM: x86/pmu: Add IA32_PEBS_ENABLE MSR emulation for extended PEBS
KVM: x86/pmu: Add IA32_DS_AREA MSR emulation to manage guest DS buffer
KVM: x86/pmu: Add PEBS_DATA_CFG MSR emulation to support adaptive PEBS
KVM: x86: Set PEBS_UNAVAIL in IA32_MISC_ENABLE when PEBS is enabled
KVM: x86/pmu: Adjust precise_ip to emulate Ice Lake guest PDIR counter
KVM: x86/pmu: Move pmc_speculative_in_use() to arch/x86/kvm/pmu.h
KVM: x86/pmu: Disable guest PEBS before vm-entry in two cases
KVM: x86/pmu: Add kvm_pmu_cap to optimize perf_get_x86_pmu_capability
KVM: x86/cpuid: Refactor host/guest CPU model consistency check
KVM: x86/pmu: Expose CPUIDs feature bits PDCM, DS, DTES64
arch/x86/events/core.c | 5 +-
arch/x86/events/intel/core.c | 93 +++++++++++++++++++++++---
arch/x86/events/perf_event.h | 5 +-
arch/x86/include/asm/kvm_host.h | 16 +++++
arch/x86/include/asm/msr-index.h | 6 ++
arch/x86/include/asm/perf_event.h | 5 +-
arch/x86/kvm/cpuid.c | 24 ++-----
arch/x86/kvm/cpuid.h | 5 ++
arch/x86/kvm/pmu.c | 49 ++++++++++----
arch/x86/kvm/pmu.h | 37 +++++++++++
arch/x86/kvm/vmx/capabilities.h | 26 ++++++--
arch/x86/kvm/vmx/pmu_intel.c | 105 ++++++++++++++++++++++++------
arch/x86/kvm/vmx/vmx.c | 25 ++++++-
arch/x86/kvm/vmx/vmx.h | 2 +-
arch/x86/kvm/x86.c | 14 ++--
15 files changed, 339 insertions(+), 78 deletions(-)
--
2.29.2
Splitting the logic for determining the guest values is unnecessarily
confusing, and potentially fragile. Perf should have full knowledge and
control of what values are loaded for the guest.
If we change .guest_get_msrs() to take a struct kvm_pmu pointer, then it
can generate the full set of guest values by grabbing guest ds_area and
pebs_data_cfg. Alternatively, .guest_get_msrs() could take the desired
guest MSR values directly (ds_area and pebs_data_cfg), but kvm_pmu is
vendor agnostic, so we don't see any reason to not just pass the pointer.
Suggested-by: Sean Christopherson <[email protected]>
Signed-off-by: Like Xu <[email protected]>
---
arch/x86/events/core.c | 4 ++--
arch/x86/events/intel/core.c | 4 ++--
arch/x86/events/perf_event.h | 2 +-
arch/x86/include/asm/perf_event.h | 4 ++--
arch/x86/kvm/vmx/vmx.c | 3 ++-
5 files changed, 9 insertions(+), 8 deletions(-)
diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
index 06bef6ba8a9b..7e2264a8c3f7 100644
--- a/arch/x86/events/core.c
+++ b/arch/x86/events/core.c
@@ -673,9 +673,9 @@ void x86_pmu_disable_all(void)
}
}
-struct perf_guest_switch_msr *perf_guest_get_msrs(int *nr)
+struct perf_guest_switch_msr *perf_guest_get_msrs(int *nr, void *data)
{
- return static_call(x86_pmu_guest_get_msrs)(nr);
+ return static_call(x86_pmu_guest_get_msrs)(nr, data);
}
EXPORT_SYMBOL_GPL(perf_guest_get_msrs);
diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
index af9ac48fe840..e8fee7cf767f 100644
--- a/arch/x86/events/intel/core.c
+++ b/arch/x86/events/intel/core.c
@@ -3837,7 +3837,7 @@ static int intel_pmu_hw_config(struct perf_event *event)
return 0;
}
-static struct perf_guest_switch_msr *intel_guest_get_msrs(int *nr)
+static struct perf_guest_switch_msr *intel_guest_get_msrs(int *nr, void *data)
{
struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
struct perf_guest_switch_msr *arr = cpuc->guest_switch_msrs;
@@ -3869,7 +3869,7 @@ static struct perf_guest_switch_msr *intel_guest_get_msrs(int *nr)
return arr;
}
-static struct perf_guest_switch_msr *core_guest_get_msrs(int *nr)
+static struct perf_guest_switch_msr *core_guest_get_msrs(int *nr, void *data)
{
struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
struct perf_guest_switch_msr *arr = cpuc->guest_switch_msrs;
diff --git a/arch/x86/events/perf_event.h b/arch/x86/events/perf_event.h
index 85dc4e1d4514..e52b35333e1f 100644
--- a/arch/x86/events/perf_event.h
+++ b/arch/x86/events/perf_event.h
@@ -809,7 +809,7 @@ struct x86_pmu {
/*
* Intel host/guest support (KVM)
*/
- struct perf_guest_switch_msr *(*guest_get_msrs)(int *nr);
+ struct perf_guest_switch_msr *(*guest_get_msrs)(int *nr, void *data);
/*
* Check period value for PERF_EVENT_IOC_PERIOD ioctl.
diff --git a/arch/x86/include/asm/perf_event.h b/arch/x86/include/asm/perf_event.h
index 6a6e707905be..d5957b68906b 100644
--- a/arch/x86/include/asm/perf_event.h
+++ b/arch/x86/include/asm/perf_event.h
@@ -491,10 +491,10 @@ static inline void perf_check_microcode(void) { }
#endif
#if defined(CONFIG_PERF_EVENTS) && defined(CONFIG_CPU_SUP_INTEL)
-extern struct perf_guest_switch_msr *perf_guest_get_msrs(int *nr);
+extern struct perf_guest_switch_msr *perf_guest_get_msrs(int *nr, void *data);
extern int x86_perf_get_lbr(struct x86_pmu_lbr *lbr);
#else
-struct perf_guest_switch_msr *perf_guest_get_msrs(int *nr);
+struct perf_guest_switch_msr *perf_guest_get_msrs(int *nr, void *data);
static inline int x86_perf_get_lbr(struct x86_pmu_lbr *lbr)
{
return -1;
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index c8a4a548e96b..8063cb7e8387 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -6513,9 +6513,10 @@ static void atomic_switch_perf_msrs(struct vcpu_vmx *vmx)
{
int i, nr_msrs;
struct perf_guest_switch_msr *msrs;
+ struct kvm_pmu *pmu = vcpu_to_pmu(&vmx->vcpu);
/* Note, nr_msrs may be garbage if perf_guest_get_msrs() returns NULL. */
- msrs = perf_guest_get_msrs(&nr_msrs);
+ msrs = perf_guest_get_msrs(&nr_msrs, (void *)pmu);
if (!msrs)
return;
--
2.29.2
On Intel platforms, software may uses IA32_MISC_ENABLE[7]
bit to detect whether the performance monitoring facility
is supported in the processor.
It's dependent on the PMU being enabled for the guest and
a write to this PMU available bit will be ignored.
Cc: Yao Yuan <[email protected]>
Signed-off-by: Like Xu <[email protected]>
---
arch/x86/kvm/vmx/pmu_intel.c | 1 +
arch/x86/kvm/x86.c | 1 +
2 files changed, 2 insertions(+)
diff --git a/arch/x86/kvm/vmx/pmu_intel.c b/arch/x86/kvm/vmx/pmu_intel.c
index 9efc1a6b8693..d9dbebe03cae 100644
--- a/arch/x86/kvm/vmx/pmu_intel.c
+++ b/arch/x86/kvm/vmx/pmu_intel.c
@@ -488,6 +488,7 @@ static void intel_pmu_refresh(struct kvm_vcpu *vcpu)
if (!pmu->version)
return;
+ vcpu->arch.ia32_misc_enable_msr |= MSR_IA32_MISC_ENABLE_EMON;
perf_get_x86_pmu_capability(&x86_pmu);
pmu->nr_arch_gp_counters = min_t(int, eax.split.num_counters,
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index a9d95f90a048..536b64360b75 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -3125,6 +3125,7 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
}
break;
case MSR_IA32_MISC_ENABLE:
+ data &= ~MSR_IA32_MISC_ENABLE_EMON;
if (!kvm_check_has_quirk(vcpu->kvm, KVM_X86_QUIRK_MISC_ENABLE_NO_MWAIT) &&
((vcpu->arch.ia32_misc_enable_msr ^ data) & MSR_IA32_MISC_ENABLE_MWAIT)) {
if (!guest_cpuid_has(vcpu, X86_FEATURE_XMM3))
--
2.29.2
If IA32_PERF_CAPABILITIES.PEBS_BASELINE [bit 14] is set, the
IA32_PEBS_ENABLE MSR exists and all architecturally enumerated fixed
and general purpose counters have corresponding bits in IA32_PEBS_ENABLE
that enable generation of PEBS records. The general-purpose counter bits
start at bit IA32_PEBS_ENABLE[0], and the fixed counter bits start at
bit IA32_PEBS_ENABLE[32].
When guest PEBS is enabled, the IA32_PEBS_ENABLE MSR will be
added to the perf_guest_switch_msr() and switched during the
VMX transitions just like CORE_PERF_GLOBAL_CTRL MSR.
Originally-by: Andi Kleen <[email protected]>
Co-developed-by: Kan Liang <[email protected]>
Signed-off-by: Kan Liang <[email protected]>
Co-developed-by: Luwei Kang <[email protected]>
Signed-off-by: Luwei Kang <[email protected]>
Signed-off-by: Like Xu <[email protected]>
---
arch/x86/events/intel/core.c | 17 +++++++++++++----
arch/x86/include/asm/kvm_host.h | 1 +
arch/x86/include/asm/msr-index.h | 6 ++++++
arch/x86/kvm/vmx/pmu_intel.c | 28 ++++++++++++++++++++++++++++
4 files changed, 48 insertions(+), 4 deletions(-)
diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
index e8fee7cf767f..2ca8ed61f444 100644
--- a/arch/x86/events/intel/core.c
+++ b/arch/x86/events/intel/core.c
@@ -3851,7 +3851,11 @@ static struct perf_guest_switch_msr *intel_guest_get_msrs(int *nr, void *data)
arr[0].guest &= ~(cpuc->pebs_enabled & PEBS_COUNTER_MASK);
*nr = 1;
- if (x86_pmu.pebs && x86_pmu.pebs_no_isolation) {
+ if (x86_pmu.pebs) {
+ arr[1].msr = MSR_IA32_PEBS_ENABLE;
+ arr[1].host = cpuc->pebs_enabled & ~cpuc->intel_ctrl_guest_mask;
+ arr[1].guest = cpuc->pebs_enabled & ~cpuc->intel_ctrl_host_mask;
+
/*
* If PMU counter has PEBS enabled it is not enough to
* disable counter on a guest entry since PEBS memory
@@ -3860,9 +3864,14 @@ static struct perf_guest_switch_msr *intel_guest_get_msrs(int *nr, void *data)
*
* Don't do this if the CPU already enforces it.
*/
- arr[1].msr = MSR_IA32_PEBS_ENABLE;
- arr[1].host = cpuc->pebs_enabled;
- arr[1].guest = 0;
+ if (x86_pmu.pebs_no_isolation)
+ arr[1].guest = 0;
+
+ if (arr[1].guest)
+ arr[0].guest |= arr[1].guest;
+ else
+ arr[1].guest = arr[1].host;
+
*nr = 2;
}
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 9b814bdc9137..f620485d7836 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -461,6 +461,7 @@ struct kvm_pmu {
DECLARE_BITMAP(pmc_in_use, X86_PMC_IDX_MAX);
u64 pebs_enable;
+ u64 pebs_enable_mask;
/*
* The gate to release perf_events not marked in
diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
index 546d6ecf0a35..9afcad882f4f 100644
--- a/arch/x86/include/asm/msr-index.h
+++ b/arch/x86/include/asm/msr-index.h
@@ -186,6 +186,12 @@
#define MSR_IA32_DS_AREA 0x00000600
#define MSR_IA32_PERF_CAPABILITIES 0x00000345
#define MSR_PEBS_LD_LAT_THRESHOLD 0x000003f6
+#define PERF_CAP_PEBS_TRAP BIT_ULL(6)
+#define PERF_CAP_ARCH_REG BIT_ULL(7)
+#define PERF_CAP_PEBS_FORMAT 0xf00
+#define PERF_CAP_PEBS_BASELINE BIT_ULL(14)
+#define PERF_CAP_PEBS_MASK (PERF_CAP_PEBS_TRAP | PERF_CAP_ARCH_REG | \
+ PERF_CAP_PEBS_FORMAT | PERF_CAP_PEBS_BASELINE)
#define MSR_IA32_RTIT_CTL 0x00000570
#define RTIT_CTL_TRACEEN BIT(0)
diff --git a/arch/x86/kvm/vmx/pmu_intel.c b/arch/x86/kvm/vmx/pmu_intel.c
index ac7fe714e6c1..0700d6d739f7 100644
--- a/arch/x86/kvm/vmx/pmu_intel.c
+++ b/arch/x86/kvm/vmx/pmu_intel.c
@@ -220,6 +220,9 @@ static bool intel_is_valid_msr(struct kvm_vcpu *vcpu, u32 msr)
case MSR_CORE_PERF_GLOBAL_OVF_CTRL:
ret = pmu->version > 1;
break;
+ case MSR_IA32_PEBS_ENABLE:
+ ret = vcpu->arch.perf_capabilities & PERF_CAP_PEBS_FORMAT;
+ break;
default:
ret = get_gp_pmc(pmu, msr, MSR_IA32_PERFCTR0) ||
get_gp_pmc(pmu, msr, MSR_P6_EVNTSEL0) ||
@@ -367,6 +370,9 @@ static int intel_pmu_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
case MSR_CORE_PERF_GLOBAL_OVF_CTRL:
msr_info->data = pmu->global_ovf_ctrl;
return 0;
+ case MSR_IA32_PEBS_ENABLE:
+ msr_info->data = pmu->pebs_enable;
+ return 0;
default:
if ((pmc = get_gp_pmc(pmu, msr, MSR_IA32_PERFCTR0)) ||
(pmc = get_gp_pmc(pmu, msr, MSR_IA32_PMC0))) {
@@ -427,6 +433,14 @@ static int intel_pmu_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
return 0;
}
break;
+ case MSR_IA32_PEBS_ENABLE:
+ if (pmu->pebs_enable == data)
+ return 0;
+ if (!(data & pmu->pebs_enable_mask)) {
+ pmu->pebs_enable = data;
+ return 0;
+ }
+ break;
default:
if ((pmc = get_gp_pmc(pmu, msr, MSR_IA32_PERFCTR0)) ||
(pmc = get_gp_pmc(pmu, msr, MSR_IA32_PMC0))) {
@@ -479,6 +493,7 @@ static void intel_pmu_refresh(struct kvm_vcpu *vcpu)
pmu->version = 0;
pmu->reserved_bits = 0xffffffff00200000ull;
pmu->fixed_ctr_ctrl_mask = ~0ull;
+ pmu->pebs_enable_mask = ~0ull;
entry = kvm_find_cpuid_entry(vcpu, 0xa, 0);
if (!entry)
@@ -545,6 +560,19 @@ static void intel_pmu_refresh(struct kvm_vcpu *vcpu)
if (lbr_desc->records.nr)
bitmap_set(pmu->all_valid_pmc_idx, INTEL_PMC_IDX_FIXED_VLBR, 1);
+
+ if (vcpu->arch.perf_capabilities & PERF_CAP_PEBS_FORMAT) {
+ if (vcpu->arch.perf_capabilities & PERF_CAP_PEBS_BASELINE) {
+ pmu->pebs_enable_mask = ~pmu->global_ctrl;
+ pmu->reserved_bits &= ~ICL_EVENTSEL_ADAPTIVE;
+ for (i = 0; i < pmu->nr_arch_fixed_counters; i++)
+ pmu->fixed_ctr_ctrl_mask &=
+ ~(1ULL << (INTEL_PMC_IDX_FIXED + i * 4));
+ } else
+ pmu->pebs_enable_mask = ~((1ull << pmu->nr_arch_gp_counters) - 1);
+ } else {
+ vcpu->arch.perf_capabilities &= ~PERF_CAP_PEBS_MASK;
+ }
}
static void intel_pmu_init(struct kvm_vcpu *vcpu)
--
2.29.2
When a guest counter is configured as a PEBS counter through
IA32_PEBS_ENABLE, a guest PEBS event will be reprogrammed by
configuring a non-zero precision level in the perf_event_attr.
The guest PEBS overflow PMI bit would be set in the guest
GLOBAL_STATUS MSR when PEBS facility generates a PEBS
overflow PMI based on guest IA32_DS_AREA MSR.
The attr.precise_ip would be adjusted to a special precision
level when the new PEBS-PDIR feature is supported later which
would affect the host counters scheduling.
The guest PEBS event would not be reused for non-PEBS
guest event even with the same guest counter index.
Originally-by: Andi Kleen <[email protected]>
Co-developed-by: Kan Liang <[email protected]>
Signed-off-by: Kan Liang <[email protected]>
Signed-off-by: Like Xu <[email protected]>
---
arch/x86/include/asm/kvm_host.h | 2 ++
arch/x86/kvm/pmu.c | 33 +++++++++++++++++++++++++++++++--
2 files changed, 33 insertions(+), 2 deletions(-)
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index c560960544a3..9b814bdc9137 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -460,6 +460,8 @@ struct kvm_pmu {
DECLARE_BITMAP(all_valid_pmc_idx, X86_PMC_IDX_MAX);
DECLARE_BITMAP(pmc_in_use, X86_PMC_IDX_MAX);
+ u64 pebs_enable;
+
/*
* The gate to release perf_events not marked in
* pmc_in_use only once in a vcpu time slice.
diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c
index 827886c12c16..3509b18478b9 100644
--- a/arch/x86/kvm/pmu.c
+++ b/arch/x86/kvm/pmu.c
@@ -74,11 +74,20 @@ static void kvm_perf_overflow_intr(struct perf_event *perf_event,
{
struct kvm_pmc *pmc = perf_event->overflow_handler_context;
struct kvm_pmu *pmu = pmc_to_pmu(pmc);
+ bool skip_pmi = false;
if (!test_and_set_bit(pmc->idx, pmu->reprogram_pmi)) {
- __set_bit(pmc->idx, (unsigned long *)&pmu->global_status);
+ if (perf_event->attr.precise_ip) {
+ /* Indicate PEBS overflow PMI to guest. */
+ skip_pmi = test_and_set_bit(GLOBAL_STATUS_BUFFER_OVF_BIT,
+ (unsigned long *)&pmu->global_status);
+ } else
+ __set_bit(pmc->idx, (unsigned long *)&pmu->global_status);
kvm_make_request(KVM_REQ_PMU, pmc->vcpu);
+ if (skip_pmi)
+ return;
+
/*
* Inject PMI. If vcpu was in a guest mode during NMI PMI
* can be ejected on a guest mode re-entry. Otherwise we can't
@@ -99,6 +108,7 @@ static void pmc_reprogram_counter(struct kvm_pmc *pmc, u32 type,
bool exclude_kernel, bool intr,
bool in_tx, bool in_tx_cp)
{
+ struct kvm_pmu *pmu = vcpu_to_pmu(pmc->vcpu);
struct perf_event *event;
struct perf_event_attr attr = {
.type = type,
@@ -110,6 +120,7 @@ static void pmc_reprogram_counter(struct kvm_pmc *pmc, u32 type,
.exclude_kernel = exclude_kernel,
.config = config,
};
+ bool pebs = test_bit(pmc->idx, (unsigned long *)&pmu->pebs_enable);
attr.sample_period = get_sample_period(pmc, pmc->counter);
@@ -124,9 +135,23 @@ static void pmc_reprogram_counter(struct kvm_pmc *pmc, u32 type,
attr.sample_period = 0;
attr.config |= HSW_IN_TX_CHECKPOINTED;
}
+ if (pebs) {
+ /*
+ * The non-zero precision level of guest event makes the ordinary
+ * guest event becomes a guest PEBS event and triggers the host
+ * PEBS PMI handler to determine whether the PEBS overflow PMI
+ * comes from the host counters or the guest.
+ *
+ * For most PEBS hardware events, the difference in the software
+ * precision levels of guest and host PEBS events will not affect
+ * the accuracy of the PEBS profiling result, because the "event IP"
+ * in the PEBS record is calibrated on the guest side.
+ */
+ attr.precise_ip = 1;
+ }
event = perf_event_create_kernel_counter(&attr, -1, current,
- intr ? kvm_perf_overflow_intr :
+ (intr || pebs) ? kvm_perf_overflow_intr :
kvm_perf_overflow, pmc);
if (IS_ERR(event)) {
pr_debug_ratelimited("kvm_pmu: event creation failed %ld for pmc->idx = %d\n",
@@ -161,6 +186,10 @@ static bool pmc_resume_counter(struct kvm_pmc *pmc)
get_sample_period(pmc, pmc->counter)))
return false;
+ if (!test_bit(pmc->idx, (unsigned long *)&pmc_to_pmu(pmc)->pebs_enable) &&
+ pmc->perf_event->attr.precise_ip)
+ return false;
+
/* reuse perf_event to serve as pmc_reprogram_counter() does*/
perf_event_enable(pmc->perf_event);
--
2.29.2
If IA32_PERF_CAPABILITIES.PEBS_BASELINE [bit 14] is set, the adaptive
PEBS is supported. The PEBS_DATA_CFG MSR and adaptive record enable
bits (IA32_PERFEVTSELx.Adaptive_Record and IA32_FIXED_CTR_CTRL.
FCx_Adaptive_Record) are also supported.
Adaptive PEBS provides software the capability to configure the PEBS
records to capture only the data of interest, keeping the record size
compact. An overflow of PMCx results in generation of an adaptive PEBS
record with state information based on the selections specified in
MSR_PEBS_DATA_CFG (Memory Info [bit 0], GPRs [bit 1], XMMs [bit 2],
and LBRs [bit 3], LBR Entries [bit 31:24]). By default, the PEBS record
will only contain the Basic group.
When guest adaptive PEBS is enabled, the IA32_PEBS_ENABLE MSR will
be added to the perf_guest_switch_msr() and switched during the VMX
transitions just like CORE_PERF_GLOBAL_CTRL MSR.
Co-developed-by: Luwei Kang <[email protected]>
Signed-off-by: Luwei Kang <[email protected]>
Signed-off-by: Like Xu <[email protected]>
---
arch/x86/events/intel/core.c | 11 ++++++++++-
arch/x86/include/asm/kvm_host.h | 2 ++
arch/x86/kvm/vmx/pmu_intel.c | 16 ++++++++++++++++
3 files changed, 28 insertions(+), 1 deletion(-)
diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
index 7f3821a59b84..3bbdfc4f6931 100644
--- a/arch/x86/events/intel/core.c
+++ b/arch/x86/events/intel/core.c
@@ -3844,6 +3844,7 @@ static struct perf_guest_switch_msr *intel_guest_get_msrs(int *nr, void *data)
struct perf_guest_switch_msr *arr = cpuc->guest_switch_msrs;
struct debug_store *ds = __this_cpu_read(cpu_hw_events.ds);
struct kvm_pmu *pmu = (struct kvm_pmu *)data;
+ bool baseline = x86_pmu.intel_cap.pebs_baseline;
arr[0].msr = MSR_CORE_PERF_GLOBAL_CTRL;
arr[0].host = x86_pmu.intel_ctrl & ~cpuc->intel_ctrl_guest_mask;
@@ -3863,6 +3864,12 @@ static struct perf_guest_switch_msr *intel_guest_get_msrs(int *nr, void *data)
arr[2].host = (unsigned long)ds;
arr[2].guest = pmu->ds_area;
+ if (baseline) {
+ arr[3].msr = MSR_PEBS_DATA_CFG;
+ arr[3].host = cpuc->pebs_data_cfg;
+ arr[3].guest = pmu->pebs_data_cfg;
+ }
+
/*
* If PMU counter has PEBS enabled it is not enough to
* disable counter on a guest entry since PEBS memory
@@ -3879,9 +3886,11 @@ static struct perf_guest_switch_msr *intel_guest_get_msrs(int *nr, void *data)
else {
arr[1].guest = arr[1].host;
arr[2].guest = arr[2].host;
+ if (baseline)
+ arr[3].guest = arr[3].host;
}
- *nr = 3;
+ *nr = baseline ? 4 : 3;
}
return arr;
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 2275cc144f58..94366da2dfee 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -463,6 +463,8 @@ struct kvm_pmu {
u64 ds_area;
u64 pebs_enable;
u64 pebs_enable_mask;
+ u64 pebs_data_cfg;
+ u64 pebs_data_cfg_mask;
/*
* The gate to release perf_events not marked in
diff --git a/arch/x86/kvm/vmx/pmu_intel.c b/arch/x86/kvm/vmx/pmu_intel.c
index 77d30106abca..7f18c760dbae 100644
--- a/arch/x86/kvm/vmx/pmu_intel.c
+++ b/arch/x86/kvm/vmx/pmu_intel.c
@@ -226,6 +226,9 @@ static bool intel_is_valid_msr(struct kvm_vcpu *vcpu, u32 msr)
case MSR_IA32_DS_AREA:
ret = guest_cpuid_has(vcpu, X86_FEATURE_DS);
break;
+ case MSR_PEBS_DATA_CFG:
+ ret = vcpu->arch.perf_capabilities & PERF_CAP_PEBS_BASELINE;
+ break;
default:
ret = get_gp_pmc(pmu, msr, MSR_IA32_PERFCTR0) ||
get_gp_pmc(pmu, msr, MSR_P6_EVNTSEL0) ||
@@ -379,6 +382,9 @@ static int intel_pmu_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
case MSR_IA32_DS_AREA:
msr_info->data = pmu->ds_area;
return 0;
+ case MSR_PEBS_DATA_CFG:
+ msr_info->data = pmu->pebs_data_cfg;
+ return 0;
default:
if ((pmc = get_gp_pmc(pmu, msr, MSR_IA32_PERFCTR0)) ||
(pmc = get_gp_pmc(pmu, msr, MSR_IA32_PMC0))) {
@@ -452,6 +458,14 @@ static int intel_pmu_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
return 1;
pmu->ds_area = data;
return 0;
+ case MSR_PEBS_DATA_CFG:
+ if (pmu->pebs_data_cfg == data)
+ return 0;
+ if (!(data & pmu->pebs_data_cfg_mask)) {
+ pmu->pebs_data_cfg = data;
+ return 0;
+ }
+ break;
default:
if ((pmc = get_gp_pmc(pmu, msr, MSR_IA32_PERFCTR0)) ||
(pmc = get_gp_pmc(pmu, msr, MSR_IA32_PMC0))) {
@@ -505,6 +519,7 @@ static void intel_pmu_refresh(struct kvm_vcpu *vcpu)
pmu->reserved_bits = 0xffffffff00200000ull;
pmu->fixed_ctr_ctrl_mask = ~0ull;
pmu->pebs_enable_mask = ~0ull;
+ pmu->pebs_data_cfg_mask = ~0ull;
entry = kvm_find_cpuid_entry(vcpu, 0xa, 0);
if (!entry)
@@ -579,6 +594,7 @@ static void intel_pmu_refresh(struct kvm_vcpu *vcpu)
for (i = 0; i < pmu->nr_arch_fixed_counters; i++)
pmu->fixed_ctr_ctrl_mask &=
~(1ULL << (INTEL_PMC_IDX_FIXED + i * 4));
+ pmu->pebs_data_cfg_mask = ~0xff00000full;
} else
pmu->pebs_enable_mask = ~((1ull << pmu->nr_arch_gp_counters) - 1);
} else {
--
2.29.2
The new hardware facility supporting guest PEBS is only available
on Intel Ice Lake Server platforms for now. KVM will check this field
through perf_get_x86_pmu_capability() instead of hard coding the cpu
models in the KVM code. If it is supported, the guest PBES capability
will be exposed to the guest.
Signed-off-by: Like Xu <[email protected]>
---
arch/x86/events/core.c | 1 +
arch/x86/events/intel/core.c | 1 +
arch/x86/events/perf_event.h | 3 ++-
arch/x86/include/asm/perf_event.h | 1 +
4 files changed, 5 insertions(+), 1 deletion(-)
diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
index 18df17129695..06bef6ba8a9b 100644
--- a/arch/x86/events/core.c
+++ b/arch/x86/events/core.c
@@ -2776,5 +2776,6 @@ void perf_get_x86_pmu_capability(struct x86_pmu_capability *cap)
cap->bit_width_fixed = x86_pmu.cntval_bits;
cap->events_mask = (unsigned int)x86_pmu.events_maskl;
cap->events_mask_len = x86_pmu.events_mask_len;
+ cap->pebs_vmx = x86_pmu.pebs_vmx;
}
EXPORT_SYMBOL_GPL(perf_get_x86_pmu_capability);
diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
index 7bbb5bb98d8c..591d60cc8436 100644
--- a/arch/x86/events/intel/core.c
+++ b/arch/x86/events/intel/core.c
@@ -5574,6 +5574,7 @@ __init int intel_pmu_init(void)
case INTEL_FAM6_ICELAKE_X:
case INTEL_FAM6_ICELAKE_D:
+ x86_pmu.pebs_vmx = 1;
pmem = true;
fallthrough;
case INTEL_FAM6_ICELAKE_L:
diff --git a/arch/x86/events/perf_event.h b/arch/x86/events/perf_event.h
index 53b2b5fc23bc..85dc4e1d4514 100644
--- a/arch/x86/events/perf_event.h
+++ b/arch/x86/events/perf_event.h
@@ -729,7 +729,8 @@ struct x86_pmu {
pebs_prec_dist :1,
pebs_no_tlb :1,
pebs_no_isolation :1,
- pebs_block :1;
+ pebs_block :1,
+ pebs_vmx :1;
int pebs_record_size;
int pebs_buffer_size;
int max_pebs_events;
diff --git a/arch/x86/include/asm/perf_event.h b/arch/x86/include/asm/perf_event.h
index 544f41a179fb..6a6e707905be 100644
--- a/arch/x86/include/asm/perf_event.h
+++ b/arch/x86/include/asm/perf_event.h
@@ -192,6 +192,7 @@ struct x86_pmu_capability {
int bit_width_fixed;
unsigned int events_mask;
int events_mask_len;
+ unsigned int pebs_vmx :1;
};
/*
--
2.29.2
With PEBS virtualization, the guest PEBS records get delivered to the
guest DS, and the host pmi handler uses perf_guest_cbs->is_in_guest()
to distinguish whether the PMI comes from the guest code like Intel PT.
No matter how many guest PEBS counters are overflowed, only triggering
one fake event is enough. The fake event causes the KVM PMI callback to
be called, thereby injecting the PEBS overflow PMI into the guest.
KVM will inject the PMI with BUFFER_OVF set, even if the guest DS is
empty. That should really be harmless. Thus the guest PEBS handler would
retrieve the correct information from its own PEBS records buffer.
Originally-by: Andi Kleen <[email protected]>
Co-developed-by: Kan Liang <[email protected]>
Signed-off-by: Kan Liang <[email protected]>
Signed-off-by: Like Xu <[email protected]>
---
arch/x86/events/intel/core.c | 45 +++++++++++++++++++++++++++++++++++-
1 file changed, 44 insertions(+), 1 deletion(-)
diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
index 591d60cc8436..af9ac48fe840 100644
--- a/arch/x86/events/intel/core.c
+++ b/arch/x86/events/intel/core.c
@@ -2747,6 +2747,46 @@ static void intel_pmu_reset(void)
local_irq_restore(flags);
}
+/*
+ * We may be running with guest PEBS events created by KVM, and the
+ * PEBS records are logged into the guest's DS and invisible to host.
+ *
+ * In the case of guest PEBS overflow, we only trigger a fake event
+ * to emulate the PEBS overflow PMI for guest PBES counters in KVM.
+ * The guest will then vm-entry and check the guest DS area to read
+ * the guest PEBS records.
+ *
+ * The contents and other behavior of the guest event do not matter.
+ */
+static int x86_pmu_handle_guest_pebs(struct pt_regs *regs,
+ struct perf_sample_data *data)
+{
+ struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
+ u64 guest_pebs_idxs = cpuc->pebs_enabled & ~cpuc->intel_ctrl_host_mask;
+ struct perf_event *event = NULL;
+ int bit;
+
+ if (!x86_pmu.pebs_active || !guest_pebs_idxs)
+ return 0;
+
+ for_each_set_bit(bit, (unsigned long *)&guest_pebs_idxs,
+ INTEL_PMC_IDX_FIXED + x86_pmu.num_counters_fixed) {
+
+ event = cpuc->events[bit];
+ if (!event->attr.precise_ip)
+ continue;
+
+ perf_sample_data_init(data, 0, event->hw.last_period);
+ if (perf_event_overflow(event, data, regs))
+ x86_pmu_stop(event, 0);
+
+ /* Inject one fake event is enough. */
+ return 1;
+ }
+
+ return 0;
+}
+
static int handle_pmi_common(struct pt_regs *regs, u64 status)
{
struct perf_sample_data data;
@@ -2797,7 +2837,10 @@ static int handle_pmi_common(struct pt_regs *regs, u64 status)
u64 pebs_enabled = cpuc->pebs_enabled;
handled++;
- x86_pmu.drain_pebs(regs, &data);
+ if (x86_pmu.pebs_vmx && perf_guest_cbs && perf_guest_cbs->is_in_guest())
+ x86_pmu_handle_guest_pebs(regs, &data);
+ else
+ x86_pmu.drain_pebs(regs, &data);
status &= x86_pmu.intel_ctrl | GLOBAL_STATUS_TRACE_TOPAPMI;
/*
--
2.29.2
It allows this inline function to be reused by more callers in
more files, such as pmu_intel.c.
Signed-off-by: Like Xu <[email protected]>
---
arch/x86/kvm/pmu.c | 11 -----------
arch/x86/kvm/pmu.h | 11 +++++++++++
2 files changed, 11 insertions(+), 11 deletions(-)
diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c
index 8d2873cfec69..0081cb742743 100644
--- a/arch/x86/kvm/pmu.c
+++ b/arch/x86/kvm/pmu.c
@@ -476,17 +476,6 @@ void kvm_pmu_init(struct kvm_vcpu *vcpu)
kvm_pmu_refresh(vcpu);
}
-static inline bool pmc_speculative_in_use(struct kvm_pmc *pmc)
-{
- struct kvm_pmu *pmu = pmc_to_pmu(pmc);
-
- if (pmc_is_fixed(pmc))
- return fixed_ctrl_field(pmu->fixed_ctr_ctrl,
- pmc->idx - INTEL_PMC_IDX_FIXED) & 0x3;
-
- return pmc->eventsel & ARCH_PERFMON_EVENTSEL_ENABLE;
-}
-
/* Release perf_events for vPMCs that have been unused for a full time slice. */
void kvm_pmu_cleanup(struct kvm_vcpu *vcpu)
{
diff --git a/arch/x86/kvm/pmu.h b/arch/x86/kvm/pmu.h
index d9157128e6eb..6c902b2d2d5a 100644
--- a/arch/x86/kvm/pmu.h
+++ b/arch/x86/kvm/pmu.h
@@ -149,6 +149,17 @@ static inline u64 get_sample_period(struct kvm_pmc *pmc, u64 counter_value)
return sample_period;
}
+static inline bool pmc_speculative_in_use(struct kvm_pmc *pmc)
+{
+ struct kvm_pmu *pmu = pmc_to_pmu(pmc);
+
+ if (pmc_is_fixed(pmc))
+ return fixed_ctrl_field(pmu->fixed_ctr_ctrl,
+ pmc->idx - INTEL_PMC_IDX_FIXED) & 0x3;
+
+ return pmc->eventsel & ARCH_PERFMON_EVENTSEL_ENABLE;
+}
+
void reprogram_gp_counter(struct kvm_pmc *pmc, u64 eventsel);
void reprogram_fixed_counter(struct kvm_pmc *pmc, u8 ctrl, int fixed_idx);
void reprogram_counter(struct kvm_pmu *pmu, int pmc_idx);
--
2.29.2
The guest PEBS will be disabled when some users try to perf KVM and
its user-space through the same PEBS facility OR when the host perf
doesn't schedule the guest PEBS counter in a one-to-one mapping manner
(neither of these are typical scenarios).
The PEBS records in the guest DS buffer is still accurate and the
above two restrictions will be checked before each vm-entry only if
guest PEBS is deemed to be enabled.
Signed-off-by: Like Xu <[email protected]>
---
arch/x86/events/intel/core.c | 8 +++++++-
arch/x86/include/asm/kvm_host.h | 9 +++++++++
arch/x86/kvm/vmx/pmu_intel.c | 16 ++++++++++++++++
arch/x86/kvm/vmx/vmx.c | 4 ++++
arch/x86/kvm/vmx/vmx.h | 1 +
5 files changed, 37 insertions(+), 1 deletion(-)
diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
index 3bbdfc4f6931..20ee1b3fd06b 100644
--- a/arch/x86/events/intel/core.c
+++ b/arch/x86/events/intel/core.c
@@ -3858,7 +3858,13 @@ static struct perf_guest_switch_msr *intel_guest_get_msrs(int *nr, void *data)
if (pmu && x86_pmu.pebs) {
arr[1].msr = MSR_IA32_PEBS_ENABLE;
arr[1].host = cpuc->pebs_enabled & ~cpuc->intel_ctrl_guest_mask;
- arr[1].guest = cpuc->pebs_enabled & ~cpuc->intel_ctrl_host_mask;
+ if (!arr[1].host) {
+ arr[1].guest = cpuc->pebs_enabled & ~cpuc->intel_ctrl_host_mask;
+ /* Disable guest PEBS for cross-mapped PEBS counters. */
+ arr[1].guest &= ~pmu->host_cross_mapped_mask;
+ } else
+ /* Disable guest PEBS if host PEBS is enabled. */
+ arr[1].guest = 0;
arr[2].msr = MSR_IA32_DS_AREA;
arr[2].host = (unsigned long)ds;
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 94366da2dfee..cfb5467be7e6 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -466,6 +466,15 @@ struct kvm_pmu {
u64 pebs_data_cfg;
u64 pebs_data_cfg_mask;
+ /*
+ * If a guest counter is cross-mapped to host counter with different
+ * index, its PEBS capability will be temporarily disabled.
+ *
+ * The user should make sure that this mask is updated
+ * after disabling interrupts and before perf_guest_get_msrs();
+ */
+ u64 host_cross_mapped_mask;
+
/*
* The gate to release perf_events not marked in
* pmc_in_use only once in a vcpu time slice.
diff --git a/arch/x86/kvm/vmx/pmu_intel.c b/arch/x86/kvm/vmx/pmu_intel.c
index 4dcf66e6c398..55caa941e336 100644
--- a/arch/x86/kvm/vmx/pmu_intel.c
+++ b/arch/x86/kvm/vmx/pmu_intel.c
@@ -767,6 +767,22 @@ static void intel_pmu_cleanup(struct kvm_vcpu *vcpu)
intel_pmu_release_guest_lbr_event(vcpu);
}
+void intel_pmu_cross_mapped_check(struct kvm_pmu *pmu)
+{
+ struct kvm_pmc *pmc = NULL;
+ int bit;
+
+ for_each_set_bit(bit, (unsigned long *)&pmu->global_ctrl, X86_PMC_IDX_MAX) {
+ pmc = kvm_x86_ops.pmu_ops->pmc_idx_to_pmc(pmu, bit);
+
+ if (!pmc || !pmc_speculative_in_use(pmc) || !pmc_is_enabled(pmc))
+ continue;
+
+ if (pmc->perf_event && (pmc->idx != pmc->perf_event->hw.idx))
+ pmu->host_cross_mapped_mask |= BIT_ULL(pmc->perf_event->hw.idx);
+ }
+}
+
struct kvm_pmu_ops intel_pmu_ops = {
.find_arch_event = intel_find_arch_event,
.find_fixed_event = intel_find_fixed_event,
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 594c058f6f0f..966fa7962808 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -6516,6 +6516,10 @@ static void atomic_switch_perf_msrs(struct vcpu_vmx *vmx)
struct perf_guest_switch_msr *msrs;
struct kvm_pmu *pmu = vcpu_to_pmu(&vmx->vcpu);
+ pmu->host_cross_mapped_mask = 0;
+ if (pmu->pebs_enable & pmu->global_ctrl)
+ intel_pmu_cross_mapped_check(pmu);
+
/* Note, nr_msrs may be garbage if perf_guest_get_msrs() returns NULL. */
msrs = perf_guest_get_msrs(&nr_msrs, (void *)pmu);
if (!msrs)
diff --git a/arch/x86/kvm/vmx/vmx.h b/arch/x86/kvm/vmx/vmx.h
index 0fb3236b0283..0029aaad8eda 100644
--- a/arch/x86/kvm/vmx/vmx.h
+++ b/arch/x86/kvm/vmx/vmx.h
@@ -96,6 +96,7 @@ union vmx_exit_reason {
#define vcpu_to_lbr_desc(vcpu) (&to_vmx(vcpu)->lbr_desc)
#define vcpu_to_lbr_records(vcpu) (&to_vmx(vcpu)->lbr_desc.records)
+void intel_pmu_cross_mapped_check(struct kvm_pmu *pmu);
bool intel_pmu_lbr_is_compatible(struct kvm_vcpu *vcpu);
bool intel_pmu_lbr_is_enabled(struct kvm_vcpu *vcpu);
--
2.29.2
The information obtained from the interface perf_get_x86_pmu_capability()
doesn't change, so an exported "struct x86_pmu_capability" is introduced
for all guests in the KVM, and it's initialized before hardware_setup().
Signed-off-by: Like Xu <[email protected]>
---
arch/x86/kvm/cpuid.c | 24 +++++++-----------------
arch/x86/kvm/pmu.c | 3 +++
arch/x86/kvm/pmu.h | 19 +++++++++++++++++++
arch/x86/kvm/vmx/pmu_intel.c | 13 +++++--------
arch/x86/kvm/x86.c | 9 ++++-----
5 files changed, 38 insertions(+), 30 deletions(-)
diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
index 6bd2f8b830e4..b3c751d425b7 100644
--- a/arch/x86/kvm/cpuid.c
+++ b/arch/x86/kvm/cpuid.c
@@ -680,32 +680,22 @@ static inline int __do_cpuid_func(struct kvm_cpuid_array *array, u32 function)
case 9:
break;
case 0xa: { /* Architectural Performance Monitoring */
- struct x86_pmu_capability cap;
union cpuid10_eax eax;
union cpuid10_edx edx;
- perf_get_x86_pmu_capability(&cap);
+ eax.split.version_id = kvm_pmu_cap.version;
+ eax.split.num_counters = kvm_pmu_cap.num_counters_gp;
+ eax.split.bit_width = kvm_pmu_cap.bit_width_gp;
+ eax.split.mask_length = kvm_pmu_cap.events_mask_len;
- /*
- * Only support guest architectural pmu on a host
- * with architectural pmu.
- */
- if (!cap.version)
- memset(&cap, 0, sizeof(cap));
-
- eax.split.version_id = min(cap.version, 2);
- eax.split.num_counters = cap.num_counters_gp;
- eax.split.bit_width = cap.bit_width_gp;
- eax.split.mask_length = cap.events_mask_len;
-
- edx.split.num_counters_fixed = min(cap.num_counters_fixed, MAX_FIXED_COUNTERS);
- edx.split.bit_width_fixed = cap.bit_width_fixed;
+ edx.split.num_counters_fixed = kvm_pmu_cap.num_counters_fixed;
+ edx.split.bit_width_fixed = kvm_pmu_cap.bit_width_fixed;
edx.split.anythread_deprecated = 1;
edx.split.reserved1 = 0;
edx.split.reserved2 = 0;
entry->eax = eax.full;
- entry->ebx = cap.events_mask;
+ entry->ebx = kvm_pmu_cap.events_mask;
entry->ecx = 0;
entry->edx = edx.full;
break;
diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c
index 0081cb742743..28deb51242e1 100644
--- a/arch/x86/kvm/pmu.c
+++ b/arch/x86/kvm/pmu.c
@@ -19,6 +19,9 @@
#include "lapic.h"
#include "pmu.h"
+struct x86_pmu_capability __read_mostly kvm_pmu_cap;
+EXPORT_SYMBOL_GPL(kvm_pmu_cap);
+
/* This is enough to filter the vast majority of currently defined events. */
#define KVM_PMU_EVENT_FILTER_MAX_EVENTS 300
diff --git a/arch/x86/kvm/pmu.h b/arch/x86/kvm/pmu.h
index 6c902b2d2d5a..3f84640d8f8c 100644
--- a/arch/x86/kvm/pmu.h
+++ b/arch/x86/kvm/pmu.h
@@ -160,6 +160,23 @@ static inline bool pmc_speculative_in_use(struct kvm_pmc *pmc)
return pmc->eventsel & ARCH_PERFMON_EVENTSEL_ENABLE;
}
+extern struct x86_pmu_capability kvm_pmu_cap;
+
+static inline void kvm_init_pmu_capability(void)
+{
+ perf_get_x86_pmu_capability(&kvm_pmu_cap);
+
+ /*
+ * Only support guest architectural pmu on
+ * a host with architectural pmu.
+ */
+ if (!kvm_pmu_cap.version)
+ memset(&kvm_pmu_cap, 0, sizeof(kvm_pmu_cap));
+
+ kvm_pmu_cap.version = min(kvm_pmu_cap.version, 2);
+ kvm_pmu_cap.num_counters_fixed = min(kvm_pmu_cap.num_counters_fixed, MAX_FIXED_COUNTERS);
+}
+
void reprogram_gp_counter(struct kvm_pmc *pmc, u64 eventsel);
void reprogram_fixed_counter(struct kvm_pmc *pmc, u8 ctrl, int fixed_idx);
void reprogram_counter(struct kvm_pmu *pmu, int pmc_idx);
@@ -177,9 +194,11 @@ void kvm_pmu_init(struct kvm_vcpu *vcpu);
void kvm_pmu_cleanup(struct kvm_vcpu *vcpu);
void kvm_pmu_destroy(struct kvm_vcpu *vcpu);
int kvm_vm_ioctl_set_pmu_event_filter(struct kvm *kvm, void __user *argp);
+void kvm_init_pmu_capability(void);
bool is_vmware_backdoor_pmc(u32 pmc_idx);
extern struct kvm_pmu_ops intel_pmu_ops;
extern struct kvm_pmu_ops amd_pmu_ops;
+
#endif /* __KVM_X86_PMU_H */
diff --git a/arch/x86/kvm/vmx/pmu_intel.c b/arch/x86/kvm/vmx/pmu_intel.c
index 55caa941e336..3c1ee59571d9 100644
--- a/arch/x86/kvm/vmx/pmu_intel.c
+++ b/arch/x86/kvm/vmx/pmu_intel.c
@@ -504,8 +504,6 @@ static void intel_pmu_refresh(struct kvm_vcpu *vcpu)
{
struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
struct lbr_desc *lbr_desc = vcpu_to_lbr_desc(vcpu);
-
- struct x86_pmu_capability x86_pmu;
struct kvm_cpuid_entry2 *entry;
union cpuid10_eax eax;
union cpuid10_edx edx;
@@ -532,13 +530,12 @@ static void intel_pmu_refresh(struct kvm_vcpu *vcpu)
return;
vcpu->arch.ia32_misc_enable_msr |= MSR_IA32_MISC_ENABLE_EMON;
- perf_get_x86_pmu_capability(&x86_pmu);
pmu->nr_arch_gp_counters = min_t(int, eax.split.num_counters,
- x86_pmu.num_counters_gp);
- eax.split.bit_width = min_t(int, eax.split.bit_width, x86_pmu.bit_width_gp);
+ kvm_pmu_cap.num_counters_gp);
+ eax.split.bit_width = min_t(int, eax.split.bit_width, kvm_pmu_cap.bit_width_gp);
pmu->counter_bitmask[KVM_PMC_GP] = ((u64)1 << eax.split.bit_width) - 1;
- eax.split.mask_length = min_t(int, eax.split.mask_length, x86_pmu.events_mask_len);
+ eax.split.mask_length = min_t(int, eax.split.mask_length, kvm_pmu_cap.events_mask_len);
pmu->available_event_types = ~entry->ebx &
((1ull << eax.split.mask_length) - 1);
@@ -547,9 +544,9 @@ static void intel_pmu_refresh(struct kvm_vcpu *vcpu)
} else {
pmu->nr_arch_fixed_counters =
min_t(int, edx.split.num_counters_fixed,
- x86_pmu.num_counters_fixed);
+ kvm_pmu_cap.num_counters_fixed);
edx.split.bit_width_fixed = min_t(int,
- edx.split.bit_width_fixed, x86_pmu.bit_width_fixed);
+ edx.split.bit_width_fixed, kvm_pmu_cap.bit_width_fixed);
pmu->counter_bitmask[KVM_PMC_FIXED] =
((u64)1 << edx.split.bit_width_fixed) - 1;
}
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 888f2c3cc288..ae8c679923dd 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -5834,15 +5834,12 @@ long kvm_arch_vm_ioctl(struct file *filp,
static void kvm_init_msr_list(void)
{
- struct x86_pmu_capability x86_pmu;
u32 dummy[2];
unsigned i;
BUILD_BUG_ON_MSG(INTEL_PMC_MAX_FIXED != 4,
"Please update the fixed PMCs in msrs_to_saved_all[]");
- perf_get_x86_pmu_capability(&x86_pmu);
-
num_msrs_to_save = 0;
num_emulated_msrs = 0;
num_msr_based_features = 0;
@@ -5893,12 +5890,12 @@ static void kvm_init_msr_list(void)
break;
case MSR_ARCH_PERFMON_PERFCTR0 ... MSR_ARCH_PERFMON_PERFCTR0 + 17:
if (msrs_to_save_all[i] - MSR_ARCH_PERFMON_PERFCTR0 >=
- min(INTEL_PMC_MAX_GENERIC, x86_pmu.num_counters_gp))
+ min(INTEL_PMC_MAX_GENERIC, kvm_pmu_cap.num_counters_gp))
continue;
break;
case MSR_ARCH_PERFMON_EVENTSEL0 ... MSR_ARCH_PERFMON_EVENTSEL0 + 17:
if (msrs_to_save_all[i] - MSR_ARCH_PERFMON_EVENTSEL0 >=
- min(INTEL_PMC_MAX_GENERIC, x86_pmu.num_counters_gp))
+ min(INTEL_PMC_MAX_GENERIC, kvm_pmu_cap.num_counters_gp))
continue;
break;
default:
@@ -10451,6 +10448,8 @@ int kvm_arch_hardware_setup(void *opaque)
if (boot_cpu_has(X86_FEATURE_XSAVES))
rdmsrl(MSR_IA32_XSS, host_xss);
+ kvm_init_pmu_capability();
+
r = ops->hardware_setup();
if (r != 0)
return r;
--
2.29.2
For the same purpose, the leagcy intel_pmu_lbr_is_compatible() could be
renamed for reuse by more callers for the same purpose and remove the
comment about LBR use case incidentally.
Signed-off-by: Like Xu <[email protected]>
---
arch/x86/kvm/cpuid.h | 5 +++++
arch/x86/kvm/vmx/pmu_intel.c | 12 +-----------
arch/x86/kvm/vmx/vmx.c | 2 +-
arch/x86/kvm/vmx/vmx.h | 1 -
4 files changed, 7 insertions(+), 13 deletions(-)
diff --git a/arch/x86/kvm/cpuid.h b/arch/x86/kvm/cpuid.h
index 2a0c5064497f..fb478fb45b9e 100644
--- a/arch/x86/kvm/cpuid.h
+++ b/arch/x86/kvm/cpuid.h
@@ -270,6 +270,11 @@ static inline int guest_cpuid_model(struct kvm_vcpu *vcpu)
return x86_model(best->eax);
}
+static inline bool cpuid_model_is_consistent(struct kvm_vcpu *vcpu)
+{
+ return boot_cpu_data.x86_model == guest_cpuid_model(vcpu);
+}
+
static inline int guest_cpuid_stepping(struct kvm_vcpu *vcpu)
{
struct kvm_cpuid_entry2 *best;
diff --git a/arch/x86/kvm/vmx/pmu_intel.c b/arch/x86/kvm/vmx/pmu_intel.c
index 3c1ee59571d9..4fe13cf80bb5 100644
--- a/arch/x86/kvm/vmx/pmu_intel.c
+++ b/arch/x86/kvm/vmx/pmu_intel.c
@@ -173,16 +173,6 @@ static inline struct kvm_pmc *get_fw_gp_pmc(struct kvm_pmu *pmu, u32 msr)
return get_gp_pmc(pmu, msr, MSR_IA32_PMC0);
}
-bool intel_pmu_lbr_is_compatible(struct kvm_vcpu *vcpu)
-{
- /*
- * As a first step, a guest could only enable LBR feature if its
- * cpu model is the same as the host because the LBR registers
- * would be pass-through to the guest and they're model specific.
- */
- return boot_cpu_data.x86_model == guest_cpuid_model(vcpu);
-}
-
bool intel_pmu_lbr_is_enabled(struct kvm_vcpu *vcpu)
{
struct x86_pmu_lbr *lbr = vcpu_to_lbr_records(vcpu);
@@ -576,7 +566,7 @@ static void intel_pmu_refresh(struct kvm_vcpu *vcpu)
nested_vmx_pmu_entry_exit_ctls_update(vcpu);
- if (intel_pmu_lbr_is_compatible(vcpu))
+ if (cpuid_model_is_consistent(vcpu))
x86_perf_get_lbr(&lbr_desc->records);
else
lbr_desc->records.nr = 0;
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 966fa7962808..b0f2cb790359 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -2259,7 +2259,7 @@ static int vmx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
if ((data & PMU_CAP_LBR_FMT) !=
(vmx_get_perf_capabilities() & PMU_CAP_LBR_FMT))
return 1;
- if (!intel_pmu_lbr_is_compatible(vcpu))
+ if (!cpuid_model_is_consistent(vcpu))
return 1;
}
ret = kvm_set_msr_common(vcpu, msr_info);
diff --git a/arch/x86/kvm/vmx/vmx.h b/arch/x86/kvm/vmx/vmx.h
index 0029aaad8eda..d214b6c43886 100644
--- a/arch/x86/kvm/vmx/vmx.h
+++ b/arch/x86/kvm/vmx/vmx.h
@@ -97,7 +97,6 @@ union vmx_exit_reason {
#define vcpu_to_lbr_records(vcpu) (&to_vmx(vcpu)->lbr_desc.records)
void intel_pmu_cross_mapped_check(struct kvm_pmu *pmu);
-bool intel_pmu_lbr_is_compatible(struct kvm_vcpu *vcpu);
bool intel_pmu_lbr_is_enabled(struct kvm_vcpu *vcpu);
int intel_pmu_create_guest_lbr_event(struct kvm_vcpu *vcpu);
--
2.29.2
The mask value of fixed counter control register should be dynamic
adjusted with the number of fixed counters. This patch introduces a
variable that includes the reserved bits of fixed counter control
registers. This is needed for later Ice Lake fixed counter changes.
Co-developed-by: Luwei Kang <[email protected]>
Signed-off-by: Luwei Kang <[email protected]>
Signed-off-by: Like Xu <[email protected]>
---
arch/x86/include/asm/kvm_host.h | 1 +
arch/x86/kvm/vmx/pmu_intel.c | 6 +++++-
2 files changed, 6 insertions(+), 1 deletion(-)
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index a52f973bdff6..c560960544a3 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -444,6 +444,7 @@ struct kvm_pmu {
unsigned nr_arch_fixed_counters;
unsigned available_event_types;
u64 fixed_ctr_ctrl;
+ u64 fixed_ctr_ctrl_mask;
u64 global_ctrl;
u64 global_status;
u64 global_ovf_ctrl;
diff --git a/arch/x86/kvm/vmx/pmu_intel.c b/arch/x86/kvm/vmx/pmu_intel.c
index d9dbebe03cae..ac7fe714e6c1 100644
--- a/arch/x86/kvm/vmx/pmu_intel.c
+++ b/arch/x86/kvm/vmx/pmu_intel.c
@@ -400,7 +400,7 @@ static int intel_pmu_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
case MSR_CORE_PERF_FIXED_CTR_CTRL:
if (pmu->fixed_ctr_ctrl == data)
return 0;
- if (!(data & 0xfffffffffffff444ull)) {
+ if (!(data & pmu->fixed_ctr_ctrl_mask)) {
reprogram_fixed_counters(pmu, data);
return 0;
}
@@ -470,6 +470,7 @@ static void intel_pmu_refresh(struct kvm_vcpu *vcpu)
struct kvm_cpuid_entry2 *entry;
union cpuid10_eax eax;
union cpuid10_edx edx;
+ int i;
pmu->nr_arch_gp_counters = 0;
pmu->nr_arch_fixed_counters = 0;
@@ -477,6 +478,7 @@ static void intel_pmu_refresh(struct kvm_vcpu *vcpu)
pmu->counter_bitmask[KVM_PMC_FIXED] = 0;
pmu->version = 0;
pmu->reserved_bits = 0xffffffff00200000ull;
+ pmu->fixed_ctr_ctrl_mask = ~0ull;
entry = kvm_find_cpuid_entry(vcpu, 0xa, 0);
if (!entry)
@@ -511,6 +513,8 @@ static void intel_pmu_refresh(struct kvm_vcpu *vcpu)
((u64)1 << edx.split.bit_width_fixed) - 1;
}
+ for (i = 0; i < pmu->nr_arch_fixed_counters; i++)
+ pmu->fixed_ctr_ctrl_mask &= ~(0xbull << (i * 4));
pmu->global_ctrl = ((1ull << pmu->nr_arch_gp_counters) - 1) |
(((1ull << pmu->nr_arch_fixed_counters) - 1) << INTEL_PMC_IDX_FIXED);
pmu->global_ctrl_mask = ~pmu->global_ctrl;
--
2.29.2
The CPUID features PDCM, DS and DTES64 are required for PEBS feature.
KVM would expose CPUID feature PDCM, DS and DTES64 to guest when PEBS
is supported in the KVM on the Ice Lake server platforms.
Originally-by: Andi Kleen <[email protected]>
Co-developed-by: Kan Liang <[email protected]>
Signed-off-by: Kan Liang <[email protected]>
Co-developed-by: Luwei Kang <[email protected]>
Signed-off-by: Luwei Kang <[email protected]>
Signed-off-by: Like Xu <[email protected]>
---
arch/x86/kvm/vmx/capabilities.h | 26 ++++++++++++++++++++------
arch/x86/kvm/vmx/vmx.c | 15 +++++++++++++++
2 files changed, 35 insertions(+), 6 deletions(-)
diff --git a/arch/x86/kvm/vmx/capabilities.h b/arch/x86/kvm/vmx/capabilities.h
index d1d77985e889..df06da09f84c 100644
--- a/arch/x86/kvm/vmx/capabilities.h
+++ b/arch/x86/kvm/vmx/capabilities.h
@@ -5,6 +5,7 @@
#include <asm/vmx.h>
#include "lapic.h"
+#include "pmu.h"
extern bool __read_mostly enable_vpid;
extern bool __read_mostly flexpriority_enabled;
@@ -378,20 +379,33 @@ static inline bool vmx_pt_mode_is_host_guest(void)
return pt_mode == PT_MODE_HOST_GUEST;
}
-static inline u64 vmx_get_perf_capabilities(void)
+static inline bool vmx_pebs_supported(void)
{
- u64 perf_cap = 0;
+ struct x86_pmu_capability x86_pmu;
- if (boot_cpu_has(X86_FEATURE_PDCM))
- rdmsrl(MSR_IA32_PERF_CAPABILITIES, perf_cap);
+ perf_get_x86_pmu_capability(&x86_pmu);
- perf_cap &= PMU_CAP_LBR_FMT;
+ return boot_cpu_has(X86_FEATURE_PEBS) && x86_pmu.pebs_vmx;
+}
+static inline u64 vmx_get_perf_capabilities(void)
+{
/*
* Since counters are virtualized, KVM would support full
* width counting unconditionally, even if the host lacks it.
*/
- return PMU_CAP_FW_WRITES | perf_cap;
+ u64 perf_cap = PMU_CAP_FW_WRITES;
+ u64 host_perf_cap = 0;
+
+ if (boot_cpu_has(X86_FEATURE_PDCM))
+ rdmsrl(MSR_IA32_PERF_CAPABILITIES, host_perf_cap);
+
+ perf_cap |= host_perf_cap & PMU_CAP_LBR_FMT;
+
+ if (vmx_pebs_supported())
+ perf_cap |= host_perf_cap & PERF_CAP_PEBS_MASK;
+
+ return perf_cap;
}
static inline u64 vmx_supported_debugctl(void)
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index b0f2cb790359..7cd9370357f9 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -2262,6 +2262,17 @@ static int vmx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
if (!cpuid_model_is_consistent(vcpu))
return 1;
}
+ if (data & PERF_CAP_PEBS_FORMAT) {
+ if ((data & PERF_CAP_PEBS_MASK) !=
+ (vmx_get_perf_capabilities() & PERF_CAP_PEBS_MASK))
+ return 1;
+ if (!guest_cpuid_has(vcpu, X86_FEATURE_DS))
+ return 1;
+ if (!guest_cpuid_has(vcpu, X86_FEATURE_DTES64))
+ return 1;
+ if (!cpuid_model_is_consistent(vcpu))
+ return 1;
+ }
ret = kvm_set_msr_common(vcpu, msr_info);
break;
@@ -7264,6 +7275,10 @@ static __init void vmx_set_cpu_caps(void)
kvm_cpu_cap_clear(X86_FEATURE_INVPCID);
if (vmx_pt_mode_is_host_guest())
kvm_cpu_cap_check_and_set(X86_FEATURE_INTEL_PT);
+ if (vmx_pebs_supported()) {
+ kvm_cpu_cap_check_and_set(X86_FEATURE_DS);
+ kvm_cpu_cap_check_and_set(X86_FEATURE_DTES64);
+ }
if (vmx_umip_emulated())
kvm_cpu_cap_set(X86_FEATURE_UMIP);
--
2.29.2
When CPUID.01H:EDX.DS[21] is set, the IA32_DS_AREA MSR exists and
points to the linear address of the first byte of the DS buffer
management area, which is used to manage the PEBS records.
When guest PEBS is enabled and the value is different from the
host, KVM will add the IA32_DS_AREA MSR to the msr-switch list.
The guest's DS value can be loaded to the real HW before VM-entry,
and will be removed when guest PEBS is disabled.
The WRMSR to IA32_DS_AREA MSR brings a #GP(0) if the source register
contains a non-canonical address. The switch of IA32_DS_AREA MSR would
also, setup a quiescent period to write the host PEBS records (if any)
to host DS area rather than guest DS area.
When guest PEBS is enabled, the MSR_IA32_DS_AREA MSR will be
added to the perf_guest_switch_msr() and switched during the
VMX transitions just like CORE_PERF_GLOBAL_CTRL MSR.
Originally-by: Andi Kleen <[email protected]>
Co-developed-by: Kan Liang <[email protected]>
Signed-off-by: Kan Liang <[email protected]>
Signed-off-by: Like Xu <[email protected]>
---
arch/x86/events/intel/core.c | 15 ++++++++++++---
arch/x86/include/asm/kvm_host.h | 1 +
arch/x86/kvm/vmx/pmu_intel.c | 11 +++++++++++
arch/x86/kvm/vmx/vmx.c | 1 +
4 files changed, 25 insertions(+), 3 deletions(-)
diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
index 2ca8ed61f444..7f3821a59b84 100644
--- a/arch/x86/events/intel/core.c
+++ b/arch/x86/events/intel/core.c
@@ -21,6 +21,7 @@
#include <asm/intel_pt.h>
#include <asm/apic.h>
#include <asm/cpu_device_id.h>
+#include <asm/kvm_host.h>
#include "../perf_event.h"
@@ -3841,6 +3842,8 @@ static struct perf_guest_switch_msr *intel_guest_get_msrs(int *nr, void *data)
{
struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
struct perf_guest_switch_msr *arr = cpuc->guest_switch_msrs;
+ struct debug_store *ds = __this_cpu_read(cpu_hw_events.ds);
+ struct kvm_pmu *pmu = (struct kvm_pmu *)data;
arr[0].msr = MSR_CORE_PERF_GLOBAL_CTRL;
arr[0].host = x86_pmu.intel_ctrl & ~cpuc->intel_ctrl_guest_mask;
@@ -3851,11 +3854,15 @@ static struct perf_guest_switch_msr *intel_guest_get_msrs(int *nr, void *data)
arr[0].guest &= ~(cpuc->pebs_enabled & PEBS_COUNTER_MASK);
*nr = 1;
- if (x86_pmu.pebs) {
+ if (pmu && x86_pmu.pebs) {
arr[1].msr = MSR_IA32_PEBS_ENABLE;
arr[1].host = cpuc->pebs_enabled & ~cpuc->intel_ctrl_guest_mask;
arr[1].guest = cpuc->pebs_enabled & ~cpuc->intel_ctrl_host_mask;
+ arr[2].msr = MSR_IA32_DS_AREA;
+ arr[2].host = (unsigned long)ds;
+ arr[2].guest = pmu->ds_area;
+
/*
* If PMU counter has PEBS enabled it is not enough to
* disable counter on a guest entry since PEBS memory
@@ -3869,10 +3876,12 @@ static struct perf_guest_switch_msr *intel_guest_get_msrs(int *nr, void *data)
if (arr[1].guest)
arr[0].guest |= arr[1].guest;
- else
+ else {
arr[1].guest = arr[1].host;
+ arr[2].guest = arr[2].host;
+ }
- *nr = 2;
+ *nr = 3;
}
return arr;
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index f620485d7836..2275cc144f58 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -460,6 +460,7 @@ struct kvm_pmu {
DECLARE_BITMAP(all_valid_pmc_idx, X86_PMC_IDX_MAX);
DECLARE_BITMAP(pmc_in_use, X86_PMC_IDX_MAX);
+ u64 ds_area;
u64 pebs_enable;
u64 pebs_enable_mask;
diff --git a/arch/x86/kvm/vmx/pmu_intel.c b/arch/x86/kvm/vmx/pmu_intel.c
index 0700d6d739f7..77d30106abca 100644
--- a/arch/x86/kvm/vmx/pmu_intel.c
+++ b/arch/x86/kvm/vmx/pmu_intel.c
@@ -223,6 +223,9 @@ static bool intel_is_valid_msr(struct kvm_vcpu *vcpu, u32 msr)
case MSR_IA32_PEBS_ENABLE:
ret = vcpu->arch.perf_capabilities & PERF_CAP_PEBS_FORMAT;
break;
+ case MSR_IA32_DS_AREA:
+ ret = guest_cpuid_has(vcpu, X86_FEATURE_DS);
+ break;
default:
ret = get_gp_pmc(pmu, msr, MSR_IA32_PERFCTR0) ||
get_gp_pmc(pmu, msr, MSR_P6_EVNTSEL0) ||
@@ -373,6 +376,9 @@ static int intel_pmu_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
case MSR_IA32_PEBS_ENABLE:
msr_info->data = pmu->pebs_enable;
return 0;
+ case MSR_IA32_DS_AREA:
+ msr_info->data = pmu->ds_area;
+ return 0;
default:
if ((pmc = get_gp_pmc(pmu, msr, MSR_IA32_PERFCTR0)) ||
(pmc = get_gp_pmc(pmu, msr, MSR_IA32_PMC0))) {
@@ -441,6 +447,11 @@ static int intel_pmu_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
return 0;
}
break;
+ case MSR_IA32_DS_AREA:
+ if (is_noncanonical_address(data, vcpu))
+ return 1;
+ pmu->ds_area = data;
+ return 0;
default:
if ((pmc = get_gp_pmc(pmu, msr, MSR_IA32_PERFCTR0)) ||
(pmc = get_gp_pmc(pmu, msr, MSR_IA32_PMC0))) {
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 8063cb7e8387..594c058f6f0f 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -1001,6 +1001,7 @@ static void add_atomic_switch_msr(struct vcpu_vmx *vmx, unsigned msr,
return;
}
break;
+ case MSR_IA32_DS_AREA:
case MSR_IA32_PEBS_ENABLE:
/* PEBS needs a quiescent period after being disabled (to write
* a record). Disabling PEBS through VMX MSR swapping doesn't
--
2.29.2
The bit 12 represents "Processor Event Based Sampling Unavailable (RO)" :
1 = PEBS is not supported.
0 = PEBS is supported.
A write to this PEBS_UNAVL available bit will bring #GP(0) when guest PEBS
is enabled. Some PEBS drivers in guest may care about this bit.
Signed-off-by: Like Xu <[email protected]>
---
arch/x86/kvm/vmx/pmu_intel.c | 2 ++
arch/x86/kvm/x86.c | 4 ++++
2 files changed, 6 insertions(+)
diff --git a/arch/x86/kvm/vmx/pmu_intel.c b/arch/x86/kvm/vmx/pmu_intel.c
index 7f18c760dbae..4dcf66e6c398 100644
--- a/arch/x86/kvm/vmx/pmu_intel.c
+++ b/arch/x86/kvm/vmx/pmu_intel.c
@@ -588,6 +588,7 @@ static void intel_pmu_refresh(struct kvm_vcpu *vcpu)
bitmap_set(pmu->all_valid_pmc_idx, INTEL_PMC_IDX_FIXED_VLBR, 1);
if (vcpu->arch.perf_capabilities & PERF_CAP_PEBS_FORMAT) {
+ vcpu->arch.ia32_misc_enable_msr &= ~MSR_IA32_MISC_ENABLE_PEBS_UNAVAIL;
if (vcpu->arch.perf_capabilities & PERF_CAP_PEBS_BASELINE) {
pmu->pebs_enable_mask = ~pmu->global_ctrl;
pmu->reserved_bits &= ~ICL_EVENTSEL_ADAPTIVE;
@@ -598,6 +599,7 @@ static void intel_pmu_refresh(struct kvm_vcpu *vcpu)
} else
pmu->pebs_enable_mask = ~((1ull << pmu->nr_arch_gp_counters) - 1);
} else {
+ vcpu->arch.ia32_misc_enable_msr |= MSR_IA32_MISC_ENABLE_PEBS_UNAVAIL;
vcpu->arch.perf_capabilities &= ~PERF_CAP_PEBS_MASK;
}
}
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 536b64360b75..888f2c3cc288 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -3126,6 +3126,10 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
break;
case MSR_IA32_MISC_ENABLE:
data &= ~MSR_IA32_MISC_ENABLE_EMON;
+ if (!msr_info->host_initiated &&
+ (vcpu->arch.perf_capabilities & PERF_CAP_PEBS_FORMAT) &&
+ (data & MSR_IA32_MISC_ENABLE_PEBS_UNAVAIL))
+ return 1;
if (!kvm_check_has_quirk(vcpu->kvm, KVM_X86_QUIRK_MISC_ENABLE_NO_MWAIT) &&
((vcpu->arch.ia32_misc_enable_msr ^ data) & MSR_IA32_MISC_ENABLE_MWAIT)) {
if (!guest_cpuid_has(vcpu, X86_FEATURE_XMM3))
--
2.29.2
The PEBS-PDIR facility on Ice Lake server is supported on IA31_FIXED0 only.
If the guest configures counter 32 and PEBS is enabled, the PEBS-PDIR
facility is supposed to be used, in which case KVM adjusts attr.precise_ip
to 3 and request host perf to assign the exactly requested counter or fail.
The cpu model check is also required since some platforms may place the
PEBS-PDIR facility in another counter index.
Signed-off-by: Like Xu <[email protected]>
---
arch/x86/kvm/pmu.c | 2 ++
arch/x86/kvm/pmu.h | 7 +++++++
2 files changed, 9 insertions(+)
diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c
index 3509b18478b9..8d2873cfec69 100644
--- a/arch/x86/kvm/pmu.c
+++ b/arch/x86/kvm/pmu.c
@@ -148,6 +148,8 @@ static void pmc_reprogram_counter(struct kvm_pmc *pmc, u32 type,
* in the PEBS record is calibrated on the guest side.
*/
attr.precise_ip = 1;
+ if (x86_match_cpu(vmx_icl_pebs_cpu) && pmc->idx == 32)
+ attr.precise_ip = 3;
}
event = perf_event_create_kernel_counter(&attr, -1, current,
diff --git a/arch/x86/kvm/pmu.h b/arch/x86/kvm/pmu.h
index 7b30bc967af3..d9157128e6eb 100644
--- a/arch/x86/kvm/pmu.h
+++ b/arch/x86/kvm/pmu.h
@@ -4,6 +4,8 @@
#include <linux/nospec.h>
+#include <asm/cpu_device_id.h>
+
#define vcpu_to_pmu(vcpu) (&(vcpu)->arch.pmu)
#define pmu_to_vcpu(pmu) (container_of((pmu), struct kvm_vcpu, arch.pmu))
#define pmc_to_pmu(pmc) (&(pmc)->vcpu->arch.pmu)
@@ -16,6 +18,11 @@
#define VMWARE_BACKDOOR_PMC_APPARENT_TIME 0x10002
#define MAX_FIXED_COUNTERS 3
+static const struct x86_cpu_id vmx_icl_pebs_cpu[] = {
+ X86_MATCH_INTEL_FAM6_MODEL(ICELAKE_D, NULL),
+ X86_MATCH_INTEL_FAM6_MODEL(ICELAKE_X, NULL),
+ {}
+};
struct kvm_event_hw_type_mapping {
u8 eventsel;
--
2.29.2
Hi all, do we have any comments on this patch set?
On 2021/3/29 13:41, Like Xu wrote:
> The guest Precise Event Based Sampling (PEBS) feature can provide
> an architectural state of the instruction executed after the guest
> instruction that exactly caused the event. It needs new hardware
> facility only available on Intel Ice Lake Server platforms. This
> patch set enables the basic PEBS via DS feature for KVM guests on ICX.
>
> We can use PEBS feature on the Linux guest like native:
>
> # perf record -e instructions:ppp ./br_instr a
> # perf record -c 100000 -e instructions:pp ./br_instr a
>
> To emulate guest PEBS facility for the above perf usages,
> we need to implement 2 code paths:
>
> 1) Fast path
>
> This is when the host assigned physical PMC has an identical index as
> the virtual PMC (e.g. using physical PMC0 to emulate virtual PMC0).
> This path is used in most common use cases.
>
> 2) Slow path
>
> This is when the host assigned physical PMC has a different index
> from the virtual PMC (e.g. using physical PMC1 to emulate virtual PMC0)
> In this case, KVM needs to rewrite the PEBS records to change the
> applicable counter indexes to the virtual PMC indexes, which would
> otherwise contain the physical counter index written by PEBS facility,
> and switch the counter reset values to the offset corresponding to
> the physical counter indexes in the DS data structure.
>
> The previous version [0] enables both fast path and slow path, which
> seems a bit more complex as the first step. In this patchset, we want
> to start with the fast path to get the basic guest PEBS enabled while
> keeping the slow path disabled. More focused discussion on the slow
> path [1] is planned to be put to another patchset in the next step.
>
> Compared to later versions in subsequent steps, the functionality
> to support host-guest PEBS both enabled and the functionality to
> emulate guest PEBS when the counter is cross-mapped are missing
> in this patch set (neither of these are typical scenarios).
>
> With the basic support, the guest can retrieve the correct PEBS
> information from its own PEBS records on the Ice Lake servers.
> And we expect it should work when migrating to another Ice Lake
> and no regression about host perf is expected.
>
> Here are the results of pebs test from guest/host for same workload:
>
> perf report on guest:
> # Samples: 2K of event 'instructions:ppp', # Event count (approx.): 1473377250
> # Overhead Command Shared Object Symbol
> 57.74% br_instr br_instr [.] lfsr_cond
> 41.40% br_instr br_instr [.] cmp_end
> 0.21% br_instr [kernel.kallsyms] [k] __lock_acquire
>
> perf report on host:
> # Samples: 2K of event 'instructions:ppp', # Event count (approx.): 1462721386
> # Overhead Command Shared Object Symbol
> 57.90% br_instr br_instr [.] lfsr_cond
> 41.95% br_instr br_instr [.] cmp_end
> 0.05% br_instr [kernel.vmlinux] [k] lock_acquire
> Conclusion: the profiling results on the guest are similar tothat on the host.
>
> Please check more details in each commit and feel free to comment.
>
> Previous:
> [0] https://lore.kernel.org/kvm/[email protected]/
> [1] https://lore.kernel.org/kvm/[email protected]/
>
> v3->v4 Changelog:
> - Update this cover letter and propose a new upstream plan;
> [PERF]
> - Drop check host DS and move handler to handle_pmi_common();
> - Pass "struct kvm_pmu *" to intel_guest_get_msrs();
> - Propose new assignment logic for perf_guest_switch_msr();
> - Introduce x86_pmu.pebs_vmx for future capability maintenance;
> [KVM]
> - Add kvm_pmu_cap to optimize perf_get_x86_pmu_capability;
> - Raising PEBS PMI only when OVF_BIT 62 is not set;
> - Make vmx_icl_pebs_cpu specific for PEBS-PDIR emulation;
> - Fix a bug for fixed_ctr_ctrl_mask;
> - Add two minor refactoring patches for reuse;
>
> Like Xu (16):
> perf/x86/intel: Add x86_pmu.pebs_vmx for Ice Lake Servers
> perf/x86/intel: Handle guest PEBS overflow PMI for KVM guest
> perf/x86/core: Pass "struct kvm_pmu *" to determine the guest values
> KVM: x86/pmu: Set MSR_IA32_MISC_ENABLE_EMON bit when vPMU is enabled
> KVM: x86/pmu: Introduce the ctrl_mask value for fixed counter
> KVM: x86/pmu: Reprogram guest PEBS event to emulate guest PEBS counter
> KVM: x86/pmu: Add IA32_PEBS_ENABLE MSR emulation for extended PEBS
> KVM: x86/pmu: Add IA32_DS_AREA MSR emulation to manage guest DS buffer
> KVM: x86/pmu: Add PEBS_DATA_CFG MSR emulation to support adaptive PEBS
> KVM: x86: Set PEBS_UNAVAIL in IA32_MISC_ENABLE when PEBS is enabled
> KVM: x86/pmu: Adjust precise_ip to emulate Ice Lake guest PDIR counter
> KVM: x86/pmu: Move pmc_speculative_in_use() to arch/x86/kvm/pmu.h
> KVM: x86/pmu: Disable guest PEBS before vm-entry in two cases
> KVM: x86/pmu: Add kvm_pmu_cap to optimize perf_get_x86_pmu_capability
> KVM: x86/cpuid: Refactor host/guest CPU model consistency check
> KVM: x86/pmu: Expose CPUIDs feature bits PDCM, DS, DTES64
>
> arch/x86/events/core.c | 5 +-
> arch/x86/events/intel/core.c | 93 +++++++++++++++++++++++---
> arch/x86/events/perf_event.h | 5 +-
> arch/x86/include/asm/kvm_host.h | 16 +++++
> arch/x86/include/asm/msr-index.h | 6 ++
> arch/x86/include/asm/perf_event.h | 5 +-
> arch/x86/kvm/cpuid.c | 24 ++-----
> arch/x86/kvm/cpuid.h | 5 ++
> arch/x86/kvm/pmu.c | 49 ++++++++++----
> arch/x86/kvm/pmu.h | 37 +++++++++++
> arch/x86/kvm/vmx/capabilities.h | 26 ++++++--
> arch/x86/kvm/vmx/pmu_intel.c | 105 ++++++++++++++++++++++++------
> arch/x86/kvm/vmx/vmx.c | 25 ++++++-
> arch/x86/kvm/vmx/vmx.h | 2 +-
> arch/x86/kvm/x86.c | 14 ++--
> 15 files changed, 339 insertions(+), 78 deletions(-)
>
Hi,like.
Some questions about this new pebs patches set:
https://lore.kernel.org/kvm/[email protected]/
The new hardware facility supporting guest PEBS is only available
on Intel Ice Lake Server platforms for now.
AFAIK, Icelake supports adaptive PEBS and extended PEBS which Skylake
doesn't.
But we can still use IA32_PEBS_ENABLE MSR to indicate general-purpose
counter in Skylake.
Is there anything else that only Icelake supports in this patches set?
Besides, we have tried this patches set in Icelake. We can use pebs(eg:
"perf record -e cycles:pp")
when guest is kernel-5.11, but can't when kernel-4.18. Is there a
minimum guest kernel version requirement?
Thanks,
Xiangdong Liu
Hi Xiangdong,
On 2021/4/6 11:24, Liuxiangdong (Aven, Cloud Infrastructure Service Product
Dept.) wrote:
> Hi,like.
> Some questions about this new pebs patches set:
> https://lore.kernel.org/kvm/[email protected]/
>
> The new hardware facility supporting guest PEBS is only available
> on Intel Ice Lake Server platforms for now.
Yes, we have documented this "EPT-friendly PEBS" capability in the SDM
18.3.10.1 Processor Event Based Sampling (PEBS) Facility
And again, this patch set doesn't officially support guest PEBS on the Skylake.
>
>
> AFAIK, Icelake supports adaptive PEBS and extended PEBS which Skylake
> doesn't.
> But we can still use IA32_PEBS_ENABLE MSR to indicate general-purpose
> counter in Skylake.
For Skylake, only the PMC0-PMC3 are valid for PEBS and you may
mask the other unsupported bits in the pmu->pebs_enable_mask.
> Is there anything else that only Icelake supports in this patches set?
The PDIR counter on the Ice Lake is the fixed counter 0
while the PDIR counter on the Sky Lake is the gp counter 1.
You may also expose x86_pmu.pebs_vmx for Skylake in the 1st patch.
>
>
> Besides, we have tried this patches set in Icelake. We can use pebs(eg:
> "perf record -e cycles:pp")
> when guest is kernel-5.11, but can't when kernel-4.18. Is there a
> minimum guest kernel version requirement?
The Ice Lake CPU model has been added since v5.4.
You may double check whether the stable tree(s) code has
INTEL_FAM6_ICELAKE in the arch/x86/include/asm/intel-family.h.
>
>
> Thanks,
> Xiangdong Liu
> AFAIK, Icelake supports adaptive PEBS and extended PEBS which Skylake
> doesn't.
> But we can still use IA32_PEBS_ENABLE MSR to indicate general-purpose
> counter in Skylake.
> Is there anything else that only Icelake supports in this patches set?
Only Icelake server has the support for recovering from a EPT violation
on the PEBS data structures. To use it on Skylake server you would
need to pin the whole guest, but that is currently not done.
> Besides, we have tried this patches set in Icelake. We can use pebs(eg:
> "perf record -e cycles:pp")
> when guest is kernel-5.11, but can't when kernel-4.18. Is there a minimum
> guest kernel version requirement?
You would need a guest kernel that supports Icelake server PEBS. 4.18
would need backports for tht.
-Andi
On Mon, Mar 29, 2021 at 01:41:23PM +0800, Like Xu wrote:
> With PEBS virtualization, the guest PEBS records get delivered to the
> guest DS, and the host pmi handler uses perf_guest_cbs->is_in_guest()
> to distinguish whether the PMI comes from the guest code like Intel PT.
>
> No matter how many guest PEBS counters are overflowed, only triggering
> one fake event is enough. The fake event causes the KVM PMI callback to
> be called, thereby injecting the PEBS overflow PMI into the guest.
>
> KVM will inject the PMI with BUFFER_OVF set, even if the guest DS is
> empty. That should really be harmless. Thus the guest PEBS handler would
> retrieve the correct information from its own PEBS records buffer.
>
> Originally-by: Andi Kleen <[email protected]>
> Co-developed-by: Kan Liang <[email protected]>
> Signed-off-by: Kan Liang <[email protected]>
> Signed-off-by: Like Xu <[email protected]>
> ---
> arch/x86/events/intel/core.c | 45 +++++++++++++++++++++++++++++++++++-
> 1 file changed, 44 insertions(+), 1 deletion(-)
>
> diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
> index 591d60cc8436..af9ac48fe840 100644
> --- a/arch/x86/events/intel/core.c
> +++ b/arch/x86/events/intel/core.c
> @@ -2747,6 +2747,46 @@ static void intel_pmu_reset(void)
> local_irq_restore(flags);
> }
>
> +/*
> + * We may be running with guest PEBS events created by KVM, and the
> + * PEBS records are logged into the guest's DS and invisible to host.
> + *
> + * In the case of guest PEBS overflow, we only trigger a fake event
> + * to emulate the PEBS overflow PMI for guest PBES counters in KVM.
> + * The guest will then vm-entry and check the guest DS area to read
> + * the guest PEBS records.
> + *
> + * The contents and other behavior of the guest event do not matter.
> + */
> +static int x86_pmu_handle_guest_pebs(struct pt_regs *regs,
> + struct perf_sample_data *data)
> +{
> + struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
> + u64 guest_pebs_idxs = cpuc->pebs_enabled & ~cpuc->intel_ctrl_host_mask;
> + struct perf_event *event = NULL;
> + int bit;
> +
> + if (!x86_pmu.pebs_active || !guest_pebs_idxs)
> + return 0;
> +
> + for_each_set_bit(bit, (unsigned long *)&guest_pebs_idxs,
> + INTEL_PMC_IDX_FIXED + x86_pmu.num_counters_fixed) {
> +
> + event = cpuc->events[bit];
> + if (!event->attr.precise_ip)
> + continue;
> +
> + perf_sample_data_init(data, 0, event->hw.last_period);
> + if (perf_event_overflow(event, data, regs))
> + x86_pmu_stop(event, 0);
> +
> + /* Inject one fake event is enough. */
> + return 1;
> + }
> +
> + return 0;
> +}
Why the return value, it is ignored.
> +
> static int handle_pmi_common(struct pt_regs *regs, u64 status)
> {
> struct perf_sample_data data;
> @@ -2797,7 +2837,10 @@ static int handle_pmi_common(struct pt_regs *regs, u64 status)
> u64 pebs_enabled = cpuc->pebs_enabled;
>
> handled++;
> - x86_pmu.drain_pebs(regs, &data);
> + if (x86_pmu.pebs_vmx && perf_guest_cbs && perf_guest_cbs->is_in_guest())
> + x86_pmu_handle_guest_pebs(regs, &data);
> + else
> + x86_pmu.drain_pebs(regs, &data);
Why is that else? Since we can't tell if the PMI was for the guest or
for our own DS, we should check both, no?
On 2021/4/7 0:22, Peter Zijlstra wrote:
> On Mon, Mar 29, 2021 at 01:41:23PM +0800, Like Xu wrote:
>> With PEBS virtualization, the guest PEBS records get delivered to the
>> guest DS, and the host pmi handler uses perf_guest_cbs->is_in_guest()
>> to distinguish whether the PMI comes from the guest code like Intel PT.
>>
>> No matter how many guest PEBS counters are overflowed, only triggering
>> one fake event is enough. The fake event causes the KVM PMI callback to
>> be called, thereby injecting the PEBS overflow PMI into the guest.
>>
>> KVM will inject the PMI with BUFFER_OVF set, even if the guest DS is
>> empty. That should really be harmless. Thus the guest PEBS handler would
>> retrieve the correct information from its own PEBS records buffer.
>>
>> Originally-by: Andi Kleen <[email protected]>
>> Co-developed-by: Kan Liang <[email protected]>
>> Signed-off-by: Kan Liang <[email protected]>
>> Signed-off-by: Like Xu <[email protected]>
>> ---
>> arch/x86/events/intel/core.c | 45 +++++++++++++++++++++++++++++++++++-
>> 1 file changed, 44 insertions(+), 1 deletion(-)
>>
>> diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
>> index 591d60cc8436..af9ac48fe840 100644
>> --- a/arch/x86/events/intel/core.c
>> +++ b/arch/x86/events/intel/core.c
>> @@ -2747,6 +2747,46 @@ static void intel_pmu_reset(void)
>> local_irq_restore(flags);
>> }
>>
>> +/*
>> + * We may be running with guest PEBS events created by KVM, and the
>> + * PEBS records are logged into the guest's DS and invisible to host.
>> + *
>> + * In the case of guest PEBS overflow, we only trigger a fake event
>> + * to emulate the PEBS overflow PMI for guest PBES counters in KVM.
>> + * The guest will then vm-entry and check the guest DS area to read
>> + * the guest PEBS records.
>> + *
>> + * The contents and other behavior of the guest event do not matter.
>> + */
>> +static int x86_pmu_handle_guest_pebs(struct pt_regs *regs,
>> + struct perf_sample_data *data)
>> +{
>> + struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
>> + u64 guest_pebs_idxs = cpuc->pebs_enabled & ~cpuc->intel_ctrl_host_mask;
>> + struct perf_event *event = NULL;
>> + int bit;
>> +
>> + if (!x86_pmu.pebs_active || !guest_pebs_idxs)
>> + return 0;
>> +
>> + for_each_set_bit(bit, (unsigned long *)&guest_pebs_idxs,
>> + INTEL_PMC_IDX_FIXED + x86_pmu.num_counters_fixed) {
>> +
>> + event = cpuc->events[bit];
>> + if (!event->attr.precise_ip)
>> + continue;
>> +
>> + perf_sample_data_init(data, 0, event->hw.last_period);
>> + if (perf_event_overflow(event, data, regs))
>> + x86_pmu_stop(event, 0);
>> +
>> + /* Inject one fake event is enough. */
>> + return 1;
>> + }
>> +
>> + return 0;
>> +}
> Why the return value, it is ignored.
Thanks, I'll apply it.
>
>> +
>> static int handle_pmi_common(struct pt_regs *regs, u64 status)
>> {
>> struct perf_sample_data data;
>> @@ -2797,7 +2837,10 @@ static int handle_pmi_common(struct pt_regs *regs, u64 status)
>> u64 pebs_enabled = cpuc->pebs_enabled;
>>
>> handled++;
>> - x86_pmu.drain_pebs(regs, &data);
>> + if (x86_pmu.pebs_vmx && perf_guest_cbs && perf_guest_cbs->is_in_guest())
>> + x86_pmu_handle_guest_pebs(regs, &data);
>> + else
>> + x86_pmu.drain_pebs(regs, &data);
> Why is that else? Since we can't tell if the PMI was for the guest or
> for our own DS, we should check both, no?
Yes, it's helpful for the later usage and I'll apply it.
By the way, do you have any comments on patches 01, 03
and the changes related to intel_guest_get_msrs() (such as patch 09) ?
On 2021/4/6 20:47, Andi Kleen wrote:
>> AFAIK, Icelake supports adaptive PEBS and extended PEBS which Skylake
>> doesn't.
>> But we can still use IA32_PEBS_ENABLE MSR to indicate general-purpose
>> counter in Skylake.
>> Is there anything else that only Icelake supports in this patches set?
> Only Icelake server has the support for recovering from a EPT violation
> on the PEBS data structures. To use it on Skylake server you would
> need to pin the whole guest, but that is currently not done.
Sorry. Some questions about "Pin the whole guest". Do you mean VmPin
equals VmSize
in "/proc/$(pidof qemu-kvm)/status"? Or just VmLck equals VmSize? Or
something else?
>> Besides, we have tried this patches set in Icelake. We can use pebs(eg:
>> "perf record -e cycles:pp")
>> when guest is kernel-5.11, but can't when kernel-4.18. Is there a minimum
>> guest kernel version requirement?
> You would need a guest kernel that supports Icelake server PEBS. 4.18
> would need backports for tht.
>
>
> -Andi
On Mon, Mar 29, 2021 at 01:41:27PM +0800, Like Xu wrote:
> diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c
> index 827886c12c16..3509b18478b9 100644
> --- a/arch/x86/kvm/pmu.c
> +++ b/arch/x86/kvm/pmu.c
> @@ -74,11 +74,20 @@ static void kvm_perf_overflow_intr(struct perf_event *perf_event,
> {
> struct kvm_pmc *pmc = perf_event->overflow_handler_context;
> struct kvm_pmu *pmu = pmc_to_pmu(pmc);
> + bool skip_pmi = false;
>
> if (!test_and_set_bit(pmc->idx, pmu->reprogram_pmi)) {
> - __set_bit(pmc->idx, (unsigned long *)&pmu->global_status);
> + if (perf_event->attr.precise_ip) {
> + /* Indicate PEBS overflow PMI to guest. */
> + skip_pmi = test_and_set_bit(GLOBAL_STATUS_BUFFER_OVF_BIT,
> + (unsigned long *)&pmu->global_status);
Is there actual concurrency here, or did you forget to type __?
And in case you're using vim, use something like: set cino=(0:0
> + } else
> + __set_bit(pmc->idx, (unsigned long *)&pmu->global_status);
> kvm_make_request(KVM_REQ_PMU, pmc->vcpu);
>
> + if (skip_pmi)
> + return;
> +
> /*
> * Inject PMI. If vcpu was in a guest mode during NMI PMI
> * can be ejected on a guest mode re-entry. Otherwise we can't
On Wed, Apr 07, 2021 at 11:05:20AM +0800, Liuxiangdong (Aven, Cloud Infrastructure Service Product Dept.) wrote:
>
>
> On 2021/4/6 20:47, Andi Kleen wrote:
> > > AFAIK, Icelake supports adaptive PEBS and extended PEBS which Skylake
> > > doesn't.
> > > But we can still use IA32_PEBS_ENABLE MSR to indicate general-purpose
> > > counter in Skylake.
> > > Is there anything else that only Icelake supports in this patches set?
> > Only Icelake server has the support for recovering from a EPT violation
> > on the PEBS data structures. To use it on Skylake server you would
> > need to pin the whole guest, but that is currently not done.
> Sorry. Some questions about "Pin the whole guest". Do you mean VmPin equals
> VmSize
> in "/proc/$(pidof qemu-kvm)/status"? Or just VmLck equals VmSize? Or
> something else?
Either would be sufficient. All that matters is that the EPT pages don't get
unmapped ever while PEBS is active.
-Andi
On Mon, Mar 29, 2021 at 01:41:28PM +0800, Like Xu wrote:
> + if (vcpu->arch.perf_capabilities & PERF_CAP_PEBS_FORMAT) {
> + if (vcpu->arch.perf_capabilities & PERF_CAP_PEBS_BASELINE) {
> + pmu->pebs_enable_mask = ~pmu->global_ctrl;
> + pmu->reserved_bits &= ~ICL_EVENTSEL_ADAPTIVE;
> + for (i = 0; i < pmu->nr_arch_fixed_counters; i++)
> + pmu->fixed_ctr_ctrl_mask &=
> + ~(1ULL << (INTEL_PMC_IDX_FIXED + i * 4));
{ }
> + } else
> + pmu->pebs_enable_mask = ~((1ull << pmu->nr_arch_gp_counters) - 1);
{ }
> + } else {
> + vcpu->arch.perf_capabilities &= ~PERF_CAP_PEBS_MASK;
as you already do here..
> + }
> }
On Mon, Mar 29, 2021 at 01:41:30PM +0800, Like Xu wrote:
> @@ -3863,6 +3864,12 @@ static struct perf_guest_switch_msr *intel_guest_get_msrs(int *nr, void *data)
> arr[2].host = (unsigned long)ds;
> arr[2].guest = pmu->ds_area;
*nr = 3;
>
> + if (baseline) {
> + arr[3].msr = MSR_PEBS_DATA_CFG;
> + arr[3].host = cpuc->pebs_data_cfg;
> + arr[3].guest = pmu->pebs_data_cfg;
*nr = 4;
> + }
> +
> /*
> * If PMU counter has PEBS enabled it is not enough to
> * disable counter on a guest entry since PEBS memory
> @@ -3879,9 +3886,11 @@ static struct perf_guest_switch_msr *intel_guest_get_msrs(int *nr, void *data)
> else {
> arr[1].guest = arr[1].host;
> arr[2].guest = arr[2].host;
> + if (baseline)
> + arr[3].guest = arr[3].host;
> }
>
> - *nr = 3;
> + *nr = baseline ? 4 : 3;
And you don't need yet another branch to determine a value you already
know.
> }
On Mon, Mar 29, 2021 at 01:41:28PM +0800, Like Xu wrote:
> diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
> index 546d6ecf0a35..9afcad882f4f 100644
> --- a/arch/x86/include/asm/msr-index.h
> +++ b/arch/x86/include/asm/msr-index.h
> @@ -186,6 +186,12 @@
> #define MSR_IA32_DS_AREA 0x00000600
> #define MSR_IA32_PERF_CAPABILITIES 0x00000345
> #define MSR_PEBS_LD_LAT_THRESHOLD 0x000003f6
> +#define PERF_CAP_PEBS_TRAP BIT_ULL(6)
> +#define PERF_CAP_ARCH_REG BIT_ULL(7)
> +#define PERF_CAP_PEBS_FORMAT 0xf00
> +#define PERF_CAP_PEBS_BASELINE BIT_ULL(14)
> +#define PERF_CAP_PEBS_MASK (PERF_CAP_PEBS_TRAP | PERF_CAP_ARCH_REG | \
> + PERF_CAP_PEBS_FORMAT | PERF_CAP_PEBS_BASELINE)
broken indentation
On Mon, Mar 29, 2021 at 01:41:29PM +0800, Like Xu wrote:
> @@ -3869,10 +3876,12 @@ static struct perf_guest_switch_msr *intel_guest_get_msrs(int *nr, void *data)
>
> if (arr[1].guest)
> arr[0].guest |= arr[1].guest;
> - else
> + else {
> arr[1].guest = arr[1].host;
> + arr[2].guest = arr[2].host;
> + }
What's all this gibberish?
The way I read that it says:
if guest has PEBS_ENABLED
guest GLOBAL_CTRL |= PEBS_ENABLED
otherwise
guest PEBS_ENABLED = host PEBS_ENABLED
guest DS_AREA = host DS_AREA
which is just completely random garbage afaict. Why would you leak host
msrs into the guest? Why would you change guest GLOBAL_CTRL implicitly;
guest had better wrmsr that himself to control when stuff is enabled.
This just cannot be right.
On 2021/4/6 13:14, Xu, Like wrote:
> Hi Xiangdong,
>
> On 2021/4/6 11:24, Liuxiangdong (Aven, Cloud Infrastructure Service
> Product Dept.) wrote:
>> Hi,like.
>> Some questions about this new pebs patches set:
>> https://lore.kernel.org/kvm/[email protected]/
>>
>>
>> The new hardware facility supporting guest PEBS is only available
>> on Intel Ice Lake Server platforms for now.
>
> Yes, we have documented this "EPT-friendly PEBS" capability in the SDM
> 18.3.10.1 Processor Event Based Sampling (PEBS) Facility
>
> And again, this patch set doesn't officially support guest PEBS on the
> Skylake.
>
>>
>>
>> AFAIK, Icelake supports adaptive PEBS and extended PEBS which
>> Skylake doesn't.
>> But we can still use IA32_PEBS_ENABLE MSR to indicate general-purpose
>> counter in Skylake.
>
> For Skylake, only the PMC0-PMC3 are valid for PEBS and you may
> mask the other unsupported bits in the pmu->pebs_enable_mask.
>
>> Is there anything else that only Icelake supports in this patches set?
>
> The PDIR counter on the Ice Lake is the fixed counter 0
> while the PDIR counter on the Sky Lake is the gp counter 1.
>
> You may also expose x86_pmu.pebs_vmx for Skylake in the 1st patch.
>
Yes. In fact, I have tried using this patch set in Skylake after these
modifications:
1. Expose x86_pmu.pebs_vmx for Skylake.
2. Use PMC0-PMC3 for pebs
2.1 Replace "INTEL_PMC_IDX_FIXED + x86_pmu.num_counters_fixed" with
"x86_pmu.max_pebs_events" in "x86_pmu_handle_guest_pebs"
2.2 Unmask other unsupported bits in the pmu->pebs_enable_mask.
IA32_PERF_CAPABILITIES.PEBS_BASELINE [bit 14]
is always 0 in Skylake, so pmu->pebs_enable_mask equals
`((1ull << pmu->nr_arch_gp_counters)-1).
2.3 Replace "pmc->idx == 32 " with "pmc->idx == 1" because the
PDIR counter on the Skylake is the gp counter 1.
3. Shield patch-09 because Skylake does not support adaptive pebs.
4. Shield all cpu check code in this patch set just for test.
But, unfortunately, guest will record only a few seconds and then host
will certainly soft lockup .
Is there anything wrong?
>>
>>
>> Besides, we have tried this patches set in Icelake. We can use
>> pebs(eg: "perf record -e cycles:pp")
>> when guest is kernel-5.11, but can't when kernel-4.18. Is there a
>> minimum guest kernel version requirement?
>
> The Ice Lake CPU model has been added since v5.4.
>
> You may double check whether the stable tree(s) code has
> INTEL_FAM6_ICELAKE in the arch/x86/include/asm/intel-family.h.
>
>>
>>
>> Thanks,
>> Xiangdong Liu
>
Hi Peter,
Thanks for your detailed comments.
If you have more comments for other patches, please let me know.
On 2021/4/7 23:39, Peter Zijlstra wrote:
> On Mon, Mar 29, 2021 at 01:41:29PM +0800, Like Xu wrote:
>> @@ -3869,10 +3876,12 @@ static struct perf_guest_switch_msr *intel_guest_get_msrs(int *nr, void *data)
>>
>> if (arr[1].guest)
>> arr[0].guest |= arr[1].guest;
>> - else
>> + else {
>> arr[1].guest = arr[1].host;
>> + arr[2].guest = arr[2].host;
>> + }
> What's all this gibberish?
>
> The way I read that it says:
>
> if guest has PEBS_ENABLED
> guest GLOBAL_CTRL |= PEBS_ENABLED
> otherwise
> guest PEBS_ENABLED = host PEBS_ENABLED
> guest DS_AREA = host DS_AREA
>
> which is just completely random garbage afaict. Why would you leak host
> msrs into the guest?
In fact, this is not a leak at all.
When we do "arr[i].guest = arr[i].host;" assignment in the
intel_guest_get_msrs(),
the KVM will check "if (msrs[i].host == msrs[i].guest)" and if so, it
disables the atomic
switch for this msr during vmx transaction in the caller
atomic_switch_perf_msrs().
In that case, the msr value doesn't change and any guest write will be trapped.
If the next check is "msrs[i].host != msrs[i].guest", the atomic switch
will be triggered again.
Compared to before, this part of the logic has not changed, which helps to
reduce overhead.
> Why would you change guest GLOBAL_CTRL implicitly;
This is because in the early part of this function, we have operations:
if (x86_pmu.flags & PMU_FL_PEBS_ALL)
arr[0].guest &= ~cpuc->pebs_enabled;
else
arr[0].guest &= ~(cpuc->pebs_enabled & PEBS_COUNTER_MASK);
and if guest has PEBS_ENABLED, we need these bits back for PEBS counters:
arr[0].guest |= arr[1].guest;
> guest had better wrmsr that himself to control when stuff is enabled.
When vm_entry, the msr value of GLOBAL_CTRL on the hardware may be
different from trapped value "pmu->global_ctrl" written by the guest.
If the perf scheduler cross maps guest counter X to the host counter Y,
we have to enable the bit Y in GLOBAL_CTRL before vm_entry rather than X.
>
> This just cannot be right.
On Thu, Apr 08, 2021 at 01:39:49PM +0800, Xu, Like wrote:
> Hi Peter,
>
> Thanks for your detailed comments.
>
> If you have more comments for other patches, please let me know.
>
> On 2021/4/7 23:39, Peter Zijlstra wrote:
> > On Mon, Mar 29, 2021 at 01:41:29PM +0800, Like Xu wrote:
> > > @@ -3869,10 +3876,12 @@ static struct perf_guest_switch_msr *intel_guest_get_msrs(int *nr, void *data)
> > > if (arr[1].guest)
> > > arr[0].guest |= arr[1].guest;
> > > - else
> > > + else {
> > > arr[1].guest = arr[1].host;
> > > + arr[2].guest = arr[2].host;
> > > + }
> > What's all this gibberish?
> >
> > The way I read that it says:
> >
> > if guest has PEBS_ENABLED
> > guest GLOBAL_CTRL |= PEBS_ENABLED
> > otherwise
> > guest PEBS_ENABLED = host PEBS_ENABLED
> > guest DS_AREA = host DS_AREA
> >
> > which is just completely random garbage afaict. Why would you leak host
> > msrs into the guest?
>
> In fact, this is not a leak at all.
>
> When we do "arr[i].guest = arr[i].host;" assignment in the
> intel_guest_get_msrs(), the KVM will check "if (msrs[i].host ==
> msrs[i].guest)" and if so, it disables the atomic switch for this msr
> during vmx transaction in the caller atomic_switch_perf_msrs().
Another marvel of bad coding style that function is :-( Lots of missing
{} and indentation fail.
This is terrible though, why would we clear the guest MSRs when it
changes PEBS_ENABLED. The guest had better clear them itself. Removing
guest DS_AREA just because we don't have any bits set in PEBS_ENABLED is
wrong and could very break all sorts of drivers.
> In that case, the msr value doesn't change and any guest write will be
> trapped. If the next check is "msrs[i].host != msrs[i].guest", the
> atomic switch will be triggered again.
>
> Compared to before, this part of the logic has not changed, which helps to
> reduce overhead.
It's unreadable garbage at best. If you don't want it changed, then
don't add it to the arr[] thing in the first place.
> > Why would you change guest GLOBAL_CTRL implicitly;
>
> This is because in the early part of this function, we have operations:
>
> ??? if (x86_pmu.flags & PMU_FL_PEBS_ALL)
> ??? ??? arr[0].guest &= ~cpuc->pebs_enabled;
> ??? else
> ??? ??? arr[0].guest &= ~(cpuc->pebs_enabled & PEBS_COUNTER_MASK);
>
> and if guest has PEBS_ENABLED, we need these bits back for PEBS counters:
>
> ??? arr[0].guest |= arr[1].guest;
I don't think that's right, who's to say they were set in the first
place? The guest's GLOBAL_CTRL could have had the bits cleared at VMEXIT
time. You can't unconditionally add PEBS_ENABLED into GLOBAL_CTRL,
that's wrong.
> > guest had better wrmsr that himself to control when stuff is enabled.
>
> When vm_entry, the msr value of GLOBAL_CTRL on the hardware may be
> different from trapped value "pmu->global_ctrl" written by the guest.
>
> If the perf scheduler cross maps guest counter X to the host counter Y,
> we have to enable the bit Y in GLOBAL_CTRL before vm_entry rather than X.
Sure, but I don't see that happening here.
On 2021/4/8 15:52, Peter Zijlstra wrote:
> On Thu, Apr 08, 2021 at 01:39:49PM +0800, Xu, Like wrote:
>> Hi Peter,
>>
>> Thanks for your detailed comments.
>>
>> If you have more comments for other patches, please let me know.
>>
>> On 2021/4/7 23:39, Peter Zijlstra wrote:
>>> On Mon, Mar 29, 2021 at 01:41:29PM +0800, Like Xu wrote:
>>>> @@ -3869,10 +3876,12 @@ static struct perf_guest_switch_msr *intel_guest_get_msrs(int *nr, void *data)
>>>> if (arr[1].guest)
>>>> arr[0].guest |= arr[1].guest;
>>>> - else
>>>> + else {
>>>> arr[1].guest = arr[1].host;
>>>> + arr[2].guest = arr[2].host;
>>>> + }
>>> What's all this gibberish?
>>>
>>> The way I read that it says:
>>>
>>> if guest has PEBS_ENABLED
>>> guest GLOBAL_CTRL |= PEBS_ENABLED
>>> otherwise
>>> guest PEBS_ENABLED = host PEBS_ENABLED
>>> guest DS_AREA = host DS_AREA
>>>
>>> which is just completely random garbage afaict. Why would you leak host
>>> msrs into the guest?
>> In fact, this is not a leak at all.
>>
>> When we do "arr[i].guest = arr[i].host;" assignment in the
>> intel_guest_get_msrs(), the KVM will check "if (msrs[i].host ==
>> msrs[i].guest)" and if so, it disables the atomic switch for this msr
>> during vmx transaction in the caller atomic_switch_perf_msrs().
> Another marvel of bad coding style that function is :-( Lots of missing
> {} and indentation fail.
Sorry for that and I'll fix them.
>
> This is terrible though, why would we clear the guest MSRs when it
> changes PEBS_ENABLED.
The values of arr[1].host and arr[1].guest depend on the arrangement of
host perf:
arr[1].host = cpuc->pebs_enabled & ~cpuc->intel_ctrl_guest_mask;
arr[1].guest = cpuc->pebs_enabled & ~cpuc->intel_ctrl_host_mask;
rather than the guest value of PEBS_ENABLE.
When the value of this msr is different across vmx-transaction,
we will load arr[1].host after vm-exit and load arr[1].guest before vm-entry.
If the value of this msr is the same before and after vmx-transaction,
we do nothing and keep the original value on the register.
> The guest had better clear them itself.
I don't understand what you are referring to here.
Can you explain what you think is the correct behavior here ?
> Removing
> guest DS_AREA just because we don't have any bits set in PEBS_ENABLED is
> wrong and could very break all sorts of drivers.
Except for PEBS, other features that rely on DS_AREA are not available in
the guest .
Can you explain more of your concerns for DS_AREA switch ?
>
>> In that case, the msr value doesn't change and any guest write will be
>> trapped. If the next check is "msrs[i].host != msrs[i].guest", the
>> atomic switch will be triggered again.
>>
>> Compared to before, this part of the logic has not changed, which helps to
>> reduce overhead.
> It's unreadable garbage at best. If you don't want it changed, then
> don't add it to the arr[] thing in the first place.
Thanks, adding GLOBAL_CTRL to arr[] in the last step is a better choice.
>
>>> Why would you change guest GLOBAL_CTRL implicitly;
>> This is because in the early part of this function, we have operations:
>>
>> if (x86_pmu.flags & PMU_FL_PEBS_ALL)
>> arr[0].guest &= ~cpuc->pebs_enabled;
>> else
>> arr[0].guest &= ~(cpuc->pebs_enabled & PEBS_COUNTER_MASK);
>>
>> and if guest has PEBS_ENABLED, we need these bits back for PEBS counters:
>>
>> arr[0].guest |= arr[1].guest;
> I don't think that's right, who's to say they were set in the first
> place? The guest's GLOBAL_CTRL could have had the bits cleared at VMEXIT
> time.
Please note the guest GLOBAL_CTRL value is stored in the pmu->global_ctrl,
while the actual loaded value for GLOBAL_CTRL msr after vm-entry is
"x86_pmu.intel_ctrl & ~cpuc->intel_ctrl_host_mask".
> You can't unconditionally add PEBS_ENABLED into GLOBAL_CTRL,
> that's wrong.
The determination of the msr values before and after vmx-transaction
are always in the context of host perf which means the PEBS perf_events
created by the KVM are all scheduled on and used legally , and it does not
depend on the guest values at all.
>
>>> guest had better wrmsr that himself to control when stuff is enabled.
>> When vm_entry, the msr value of GLOBAL_CTRL on the hardware may be
>> different from trapped value "pmu->global_ctrl" written by the guest.
>>
>> If the perf scheduler cross maps guest counter X to the host counter Y,
>> we have to enable the bit Y in GLOBAL_CTRL before vm_entry rather than X.
> Sure, but I don't see that happening here.
Just fire questions if we're not on the same page or you're out of KVM context.
Hi Peter,
On 2021/4/8 15:52, Peter Zijlstra wrote:
>> This is because in the early part of this function, we have operations:
>>
>> if (x86_pmu.flags & PMU_FL_PEBS_ALL)
>> arr[0].guest &= ~cpuc->pebs_enabled;
>> else
>> arr[0].guest &= ~(cpuc->pebs_enabled & PEBS_COUNTER_MASK);
>>
>> and if guest has PEBS_ENABLED, we need these bits back for PEBS counters:
>>
>> arr[0].guest |= arr[1].guest;
I can't keep up with you on this comment and would you explain more ?
> I don't think that's right, who's to say they were set in the first
> place? The guest's GLOBAL_CTRL could have had the bits cleared at VMEXIT
> time. You can't unconditionally add PEBS_ENABLED into GLOBAL_CTRL,
> that's wrong.
>
To address your previous comments, does the code below look good to you?
static struct perf_guest_switch_msr *intel_guest_get_msrs(int *nr, void *data)
{
struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
struct perf_guest_switch_msr *arr = cpuc->guest_switch_msrs;
struct debug_store *ds = __this_cpu_read(cpu_hw_events.ds);
struct kvm_pmu *pmu = (struct kvm_pmu *)data;
u64 pebs_mask = (x86_pmu.flags & PMU_FL_PEBS_ALL) ?
cpuc->pebs_enabled : (cpuc->pebs_enabled & PEBS_COUNTER_MASK);
int i = 0;
arr[i].msr = MSR_CORE_PERF_GLOBAL_CTRL;
arr[i].host = x86_pmu.intel_ctrl & ~cpuc->intel_ctrl_guest_mask;
arr[i].guest = x86_pmu.intel_ctrl & ~cpuc->intel_ctrl_host_mask;
arr[i].guest &= ~pebs_mask;
if (!x86_pmu.pebs)
goto out;
/*
* If PMU counter has PEBS enabled it is not enough to
* disable counter on a guest entry since PEBS memory
* write can overshoot guest entry and corrupt guest
* memory. Disabling PEBS solves the problem.
*
* Don't do this if the CPU already enforces it.
*/
if (x86_pmu.pebs_no_isolation) {
i++;
arr[i].msr = MSR_IA32_PEBS_ENABLE;
arr[i].host = cpuc->pebs_enabled;
arr[i].guest = 0;
goto out;
}
if (!pmu || !x86_pmu.pebs_vmx)
goto out;
i++;
arr[i].msr = MSR_IA32_DS_AREA;
arr[i].host = (unsigned long)ds;
arr[i].guest = pmu->ds_area;
if (x86_pmu.intel_cap.pebs_baseline) {
i++;
arr[i].msr = MSR_PEBS_DATA_CFG;
arr[i].host = cpuc->pebs_data_cfg;
arr[i].guest = pmu->pebs_data_cfg;
}
i++;
arr[i].msr = MSR_IA32_PEBS_ENABLE;
arr[i].host = cpuc->pebs_enabled & ~cpuc->intel_ctrl_guest_mask;
arr[i].guest = pebs_mask & ~cpuc->intel_ctrl_host_mask;
if (arr[i].host) {
/* Disable guest PEBS if host PEBS is enabled. */
arr[i].guest = 0;
} else {
/* Disable guest PEBS for cross-mapped PEBS counters. */
arr[i].guest &= ~pmu->host_cross_mapped_mask;
arr[0].guest |= arr[i].guest;
}
out:
*nr = ++i;
return arr;
}
On Fri, Apr 09, 2021 at 03:07:38PM +0800, Xu, Like wrote:
> Hi Peter,
>
> On 2021/4/8 15:52, Peter Zijlstra wrote:
> > > This is because in the early part of this function, we have operations:
> > >
> > > ??? if (x86_pmu.flags & PMU_FL_PEBS_ALL)
> > > ??? ??? arr[0].guest &= ~cpuc->pebs_enabled;
> > > ??? else
> > > ??? ??? arr[0].guest &= ~(cpuc->pebs_enabled & PEBS_COUNTER_MASK);
> > >
> > > and if guest has PEBS_ENABLED, we need these bits back for PEBS counters:
> > >
> > > ??? arr[0].guest |= arr[1].guest;
>
> > I don't think that's right, who's to say they were set in the first
> > place? The guest's GLOBAL_CTRL could have had the bits cleared at VMEXIT
> > time. You can't unconditionally add PEBS_ENABLED into GLOBAL_CTRL,
> > that's wrong.
> I can't keep up with you on this comment and would you explain more ?
Well, it could be I'm terminally confused on how virt works (I usually
am, it just doesn't make any sense ever).
On top of that this code doesn't have any comments to help.
So perf_guest_switch_msr has two msr values: guest and host.
In my naive understanding guest is the msr value the guest sees and host
is the value the host has. If it is not that, then the naming is just
misleading at best.
But thinking more about it, if these are fully emulated MSRs (which I
think they are), then there might actually be 3 different values, not 2.
We have the value the guest sees when it uses {RD,WR}MSR.
We have the value the hardware has when it runs a guest.
We have the value the hardware has when it doesn't run a guest.
And somehow this code does something, but I can't for the life of me
figure out what and how.
> To address your previous comments, does the code below look good to you?
>
> static struct perf_guest_switch_msr *intel_guest_get_msrs(int *nr, void *data)
> {
> ??? struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
> ??? struct perf_guest_switch_msr *arr = cpuc->guest_switch_msrs;
> ??? struct debug_store *ds = __this_cpu_read(cpu_hw_events.ds);
> ??? struct kvm_pmu *pmu = (struct kvm_pmu *)data;
> ??? u64 pebs_mask = (x86_pmu.flags & PMU_FL_PEBS_ALL) ?
> ??????????? cpuc->pebs_enabled : (cpuc->pebs_enabled & PEBS_COUNTER_MASK);
> ??? int i = 0;
>
> ??? arr[i].msr = MSR_CORE_PERF_GLOBAL_CTRL;
> ??? arr[i].host = x86_pmu.intel_ctrl & ~cpuc->intel_ctrl_guest_mask;
> ??? arr[i].guest = x86_pmu.intel_ctrl & ~cpuc->intel_ctrl_host_mask;
> ??? arr[i].guest &= ~pebs_mask;
>
> ??? if (!x86_pmu.pebs)
> ??? ??? goto out;
>
> ??? /*
> ??? ?* If PMU counter has PEBS enabled it is not enough to
> ??? ?* disable counter on a guest entry since PEBS memory
> ??? ?* write can overshoot guest entry and corrupt guest
> ??? ?* memory. Disabling PEBS solves the problem.
> ??? ?*
> ??? ?* Don't do this if the CPU already enforces it.
> ??? ?*/
> ??? if (x86_pmu.pebs_no_isolation) {
> ??? ??? i++;
> ??? ??? arr[i].msr = MSR_IA32_PEBS_ENABLE;
> ??? ??? arr[i].host = cpuc->pebs_enabled;
> ??? ??? arr[i].guest = 0;
> ??? ??? goto out;
> ??? }
>
> ??? if (!pmu || !x86_pmu.pebs_vmx)
> ??? ??? goto out;
>
> ??? i++;
> ??? arr[i].msr = MSR_IA32_DS_AREA;
> ??? arr[i].host = (unsigned long)ds;
> ??? arr[i].guest = pmu->ds_area;
>
> ??? if (x86_pmu.intel_cap.pebs_baseline) {
> ??? ??? i++;
> ??? ??? arr[i].msr = MSR_PEBS_DATA_CFG;
> ??? ??? arr[i].host = cpuc->pebs_data_cfg;
> ??? ??? arr[i].guest = pmu->pebs_data_cfg;
> ??? }
>
> ??? i++;
> ??? arr[i].msr = MSR_IA32_PEBS_ENABLE;
> ??? arr[i].host = cpuc->pebs_enabled & ~cpuc->intel_ctrl_guest_mask;
> ??? arr[i].guest = pebs_mask & ~cpuc->intel_ctrl_host_mask;
>
> ??? if (arr[i].host) {
> ??? ??? /* Disable guest PEBS if host PEBS is enabled. */
> ??? ??? arr[i].guest = 0;
> ??? } else {
> ??? ??? /* Disable guest PEBS for cross-mapped PEBS counters. */
> ??? ??? arr[i].guest &= ~pmu->host_cross_mapped_mask;
> ??? ??? arr[0].guest |= arr[i].guest;
> ??? }
>
> out:
> ??? *nr = ++i;
> ??? return arr;
> }
The ++ is in a weird location, if you place it after filling out an
entry it makes more sense I think. Something like:
arr[i].msr = MSR_CORE_PERF_GLOBAL_CTRL;
arr[i].host = x86_pmu.intel_ctrl & ~cpuc->intel_ctrl_guest_mask;
arr[i].guest = x86_pmu.intel_ctrl & ~cpuc->intel_ctrl_host_mask;
arr[i].guest &= ~pebs_mask;
i++;
or, perhaps even like:
arr[i++] = (struct perf_guest_switch_msr){
.msr = MSR_CORE_PERF_GLOBAL_CTRL,
.host = x86_pmu.intel_ctrl & ~cpuc->intel_ctrl_guest_mask,
.guest = x86_pmu.intel_ctrl & (~cpuc->intel_ctrl_host_mask | ~pebs_mask),
};
But it doesn't address the fundamental confusion I seem to be having,
what actual msr value is what.
On 2021/4/9 15:59, Peter Zijlstra wrote:
> On Fri, Apr 09, 2021 at 03:07:38PM +0800, Xu, Like wrote:
>> Hi Peter,
>>
>> On 2021/4/8 15:52, Peter Zijlstra wrote:
>>>> This is because in the early part of this function, we have operations:
>>>>
>>>> if (x86_pmu.flags & PMU_FL_PEBS_ALL)
>>>> arr[0].guest &= ~cpuc->pebs_enabled;
>>>> else
>>>> arr[0].guest &= ~(cpuc->pebs_enabled & PEBS_COUNTER_MASK);
>>>>
>>>> and if guest has PEBS_ENABLED, we need these bits back for PEBS counters:
>>>>
>>>> arr[0].guest |= arr[1].guest;
>>> I don't think that's right, who's to say they were set in the first
>>> place? The guest's GLOBAL_CTRL could have had the bits cleared at VMEXIT
>>> time. You can't unconditionally add PEBS_ENABLED into GLOBAL_CTRL,
>>> that's wrong.
>> I can't keep up with you on this comment and would you explain more ?
> Well, it could be I'm terminally confused on how virt works (I usually
> am, it just doesn't make any sense ever).
I may help you a little on this.
>
> On top of that this code doesn't have any comments to help.
More comments will be added.
>
> So perf_guest_switch_msr has two msr values: guest and host.
>
> In my naive understanding guest is the msr value the guest sees and host
> is the value the host has. If it is not that, then the naming is just
> misleading at best.
>
> But thinking more about it, if these are fully emulated MSRs (which I
> think they are), then there might actually be 3 different values, not 2.
You are right about 3 different values.
>
> We have the value the guest sees when it uses {RD,WR}MSR.
> We have the value the hardware has when it runs a guest.
> We have the value the hardware has when it doesn't run a guest.
>
> And somehow this code does something, but I can't for the life of me
> figure out what and how.
Just focus on the last two values and the enabling bits (on the GLOBAL_CTRL
and PEBS_ENABLE) of "the value the hardware has when it runs a guest"
are exclusive with "the value the hardware has when it doesn't run a guest."
>> To address your previous comments, does the code below look good to you?
>>
>> static struct perf_guest_switch_msr *intel_guest_get_msrs(int *nr, void *data)
>> {
>> struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
>> struct perf_guest_switch_msr *arr = cpuc->guest_switch_msrs;
>> struct debug_store *ds = __this_cpu_read(cpu_hw_events.ds);
>> struct kvm_pmu *pmu = (struct kvm_pmu *)data;
>> u64 pebs_mask = (x86_pmu.flags & PMU_FL_PEBS_ALL) ?
>> cpuc->pebs_enabled : (cpuc->pebs_enabled & PEBS_COUNTER_MASK);
>> int i = 0;
>>
>> arr[i].msr = MSR_CORE_PERF_GLOBAL_CTRL;
>> arr[i].host = x86_pmu.intel_ctrl & ~cpuc->intel_ctrl_guest_mask;
>> arr[i].guest = x86_pmu.intel_ctrl & ~cpuc->intel_ctrl_host_mask;
>> arr[i].guest &= ~pebs_mask;
>>
>> if (!x86_pmu.pebs)
>> goto out;
>>
>> /*
>> * If PMU counter has PEBS enabled it is not enough to
>> * disable counter on a guest entry since PEBS memory
>> * write can overshoot guest entry and corrupt guest
>> * memory. Disabling PEBS solves the problem.
>> *
>> * Don't do this if the CPU already enforces it.
>> */
>> if (x86_pmu.pebs_no_isolation) {
>> i++;
>> arr[i].msr = MSR_IA32_PEBS_ENABLE;
>> arr[i].host = cpuc->pebs_enabled;
>> arr[i].guest = 0;
>> goto out;
>> }
>>
>> if (!pmu || !x86_pmu.pebs_vmx)
>> goto out;
>>
>> i++;
>> arr[i].msr = MSR_IA32_DS_AREA;
>> arr[i].host = (unsigned long)ds;
>> arr[i].guest = pmu->ds_area;
>>
>> if (x86_pmu.intel_cap.pebs_baseline) {
>> i++;
>> arr[i].msr = MSR_PEBS_DATA_CFG;
>> arr[i].host = cpuc->pebs_data_cfg;
>> arr[i].guest = pmu->pebs_data_cfg;
>> }
>>
>> i++;
>> arr[i].msr = MSR_IA32_PEBS_ENABLE;
>> arr[i].host = cpuc->pebs_enabled & ~cpuc->intel_ctrl_guest_mask;
>> arr[i].guest = pebs_mask & ~cpuc->intel_ctrl_host_mask;
>>
>> if (arr[i].host) {
>> /* Disable guest PEBS if host PEBS is enabled. */
>> arr[i].guest = 0;
>> } else {
>> /* Disable guest PEBS for cross-mapped PEBS counters. */
>> arr[i].guest &= ~pmu->host_cross_mapped_mask;
>> arr[0].guest |= arr[i].guest;
>> }
>>
>> out:
>> *nr = ++i;
>> return arr;
>> }
> The ++ is in a weird location, if you place it after filling out an
> entry it makes more sense I think. Something like:
>
> arr[i].msr = MSR_CORE_PERF_GLOBAL_CTRL;
> arr[i].host = x86_pmu.intel_ctrl & ~cpuc->intel_ctrl_guest_mask;
> arr[i].guest = x86_pmu.intel_ctrl & ~cpuc->intel_ctrl_host_mask;
> arr[i].guest &= ~pebs_mask;
> i++;
>
> or, perhaps even like:
>
> arr[i++] = (struct perf_guest_switch_msr){
> .msr = MSR_CORE_PERF_GLOBAL_CTRL,
> .host = x86_pmu.intel_ctrl & ~cpuc->intel_ctrl_guest_mask,
> .guest = x86_pmu.intel_ctrl & (~cpuc->intel_ctrl_host_mask | ~pebs_mask),
> };
The later one looks good to me and I'll apply it.
> But it doesn't address the fundamental confusion I seem to be having,
> what actual msr value is what.
VMX hardware has the capability to switch MSR values atomically:
- for vm-entry instruction, it loads the value of arr[i].guest to arr[i].msr;
- for vm-exit instruction, it loads the value of arr[i].host to arr[i].msr;
The intel_guest_get_msrs() will populate arr[i].guest and arr[i].host values
before each vm-entry and its caller does the optimization to skip the switch
if arr[i].guest == arr[i].host.
Just let me know if you have more questions,
otherwise I assume we have reached an agreement on this part of code.
Do you have any comments or ideas about it ?
https://lore.kernel.org/kvm/[email protected]/
On 2021/4/6 13:14, Xu, Like wrote:
> Hi Xiangdong,
>
> On 2021/4/6 11:24, Liuxiangdong (Aven, Cloud Infrastructure Service
> Product Dept.) wrote:
>> Hi,like.
>> Some questions about this new pebs patches set:
>> https://lore.kernel.org/kvm/[email protected]/
>>
>>
>> The new hardware facility supporting guest PEBS is only available
>> on Intel Ice Lake Server platforms for now.
>
> Yes, we have documented this "EPT-friendly PEBS" capability in the SDM
> 18.3.10.1 Processor Event Based Sampling (PEBS) Facility
>
> And again, this patch set doesn't officially support guest PEBS on the
> Skylake.
>
>>
>>
>> AFAIK, Icelake supports adaptive PEBS and extended PEBS which
>> Skylake doesn't.
>> But we can still use IA32_PEBS_ENABLE MSR to indicate general-purpose
>> counter in Skylake.
>
> For Skylake, only the PMC0-PMC3 are valid for PEBS and you may
> mask the other unsupported bits in the pmu->pebs_enable_mask.
>
>> Is there anything else that only Icelake supports in this patches set?
>
> The PDIR counter on the Ice Lake is the fixed counter 0
> while the PDIR counter on the Sky Lake is the gp counter 1.
>
> You may also expose x86_pmu.pebs_vmx for Skylake in the 1st patch.
>
>>
>>
>> Besides, we have tried this patches set in Icelake. We can use
>> pebs(eg: "perf record -e cycles:pp")
>> when guest is kernel-5.11, but can't when kernel-4.18. Is there a
>> minimum guest kernel version requirement?
>
> The Ice Lake CPU model has been added since v5.4.
>
> You may double check whether the stable tree(s) code has
> INTEL_FAM6_ICELAKE in the arch/x86/include/asm/intel-family.h.
>
>>
>>
>> Thanks,
>> Xiangdong Liu
>
Hi Liuxiangdong,
On 2021/4/9 16:33, Liuxiangdong (Aven, Cloud Infrastructure Service Product
Dept.) wrote:
> Do you have any comments or ideas about it ?
>
> https://lore.kernel.org/kvm/[email protected]/
My expectation is that there may be many fewer PEBS samples
on Skylake without any soft lockup.
You may need to confirm the statement
"All that matters is that the EPT pages don't get
unmapped ever while PEBS is active"
is true in the kernel level.
Try "-overcommit mem-lock=on" for your qemu.
>
>
> On 2021/4/6 13:14, Xu, Like wrote:
>> Hi Xiangdong,
>>
>> On 2021/4/6 11:24, Liuxiangdong (Aven, Cloud Infrastructure Service
>> Product Dept.) wrote:
>>> Hi,like.
>>> Some questions about this new pebs patches set:
>>> https://lore.kernel.org/kvm/[email protected]/
>>>
>>>
>>> The new hardware facility supporting guest PEBS is only available
>>> on Intel Ice Lake Server platforms for now.
>>
>> Yes, we have documented this "EPT-friendly PEBS" capability in the SDM
>> 18.3.10.1 Processor Event Based Sampling (PEBS) Facility
>>
>> And again, this patch set doesn't officially support guest PEBS on the
>> Skylake.
>>
>>>
>>>
>>> AFAIK, Icelake supports adaptive PEBS and extended PEBS which Skylake
>>> doesn't.
>>> But we can still use IA32_PEBS_ENABLE MSR to indicate general-purpose
>>> counter in Skylake.
>>
>> For Skylake, only the PMC0-PMC3 are valid for PEBS and you may
>> mask the other unsupported bits in the pmu->pebs_enable_mask.
>>
>>> Is there anything else that only Icelake supports in this patches set?
>>
>> The PDIR counter on the Ice Lake is the fixed counter 0
>> while the PDIR counter on the Sky Lake is the gp counter 1.
>>
>> You may also expose x86_pmu.pebs_vmx for Skylake in the 1st patch.
>>
>>>
>>>
>>> Besides, we have tried this patches set in Icelake. We can use pebs(eg:
>>> "perf record -e cycles:pp")
>>> when guest is kernel-5.11, but can't when kernel-4.18. Is there a
>>> minimum guest kernel version requirement?
>>
>> The Ice Lake CPU model has been added since v5.4.
>>
>> You may double check whether the stable tree(s) code has
>> INTEL_FAM6_ICELAKE in the arch/x86/include/asm/intel-family.h.
>>
>>>
>>>
>>> Thanks,
>>> Xiangdong Liu
>>
>
On 2021/4/9 16:46, Like Xu wrote:
> Hi Liuxiangdong,
>
> On 2021/4/9 16:33, Liuxiangdong (Aven, Cloud Infrastructure Service
> Product Dept.) wrote:
>> Do you have any comments or ideas about it ?
>>
>> https://lore.kernel.org/kvm/[email protected]/
>
> My expectation is that there may be many fewer PEBS samples
> on Skylake without any soft lockup.
>
> You may need to confirm the statement
>
> "All that matters is that the EPT pages don't get
> unmapped ever while PEBS is active"
>
> is true in the kernel level.
>
> Try "-overcommit mem-lock=on" for your qemu.
>
We have used "-overcommit mem-lock=on" for qemu when soft lockup.
It seems that ept violation happens when we use pebs.
[ 5199.056246] Call Trace:
[ 5199.056248] _raw_spin_lock+0x1b/0x20[ 5199.056251]
follow_page_pte+0xf5/0x580
[ 5199.056258] __get_user_pages+0x1d6/0x750[ 5199.056262]
get_user_pages_unlocked+0xdc/0x310
[ 5199.056265] __gfn_to_pfn_memslot+0x12d/0x4d0 [kvm]
[ 5199.056304] try_async_pf+0xcc/0x250 [kvm]
[ 5199.056337] direct_page_fault+0x413/0xa90 [kvm]
[ 5199.056367] kvm_mmu_page_fault+0x77/0x5e0 [kvm]
[ 5199.056395] ? vprintk_emit+0xa2/0x240
[ 5199.056399] ? vmx_vmexit+0x1d/0x40 [kvm_intel]
[ 5199.056407] ? vmx_vmexit+0x11/0x40 [kvm_intel]
[ 5199.056412] vmx_handle_exit+0xfe/0x640 [kvm_intel]
[ 5199.056418] vcpu_enter_guest+0x904/0x1450 [kvm]
[ 5199.056445] ? kvm_apic_has_interrupt+0x44/0x80 [kvm]
[ 5199.056472] ? apic_has_interrupt_for_ppr+0x62/0x90 [kvm]
[ 5199.056498] ? kvm_arch_vcpu_ioctl_run+0xeb/0x550 [kvm]
[ 5199.056523] kvm_arch_vcpu_ioctl_run+0xeb/0x550 [kvm]
[ 5199.056547] kvm_vcpu_ioctl+0x23e/0x5b0 [kvm]
[ 5199.056568] __x64_sys_ioctl+0x8e/0xd0
[ 5199.056571] do_syscall_64+0x33/0x40
[ 5199.056574] entry_SYSCALL_64_after_hwframe+0x44/0xae
SDM 17.4.9.2 "Setting Up the DS Save Area" says:
The recording of branch records in the BTS buffer (or PEBS records in
the PEBS buffer) may not operate
properly if accesses to the linear addresses in any of the three DS save
area sections cause page faults, VM
exits, or the setting of accessed or dirty flags in the paging
structures (ordinary or EPT). For that reason,
system software should establish paging structures (both ordinary and
EPT) to prevent such occurrences.
Implications of this may be that an operating system should allocate
this memory from a non-paged pool and
that system software cannot do “lazy” page-table entry propagation for
these pages. Some newer processor
generations support “lazy” EPT page-table entry propagation for PEBS;
see Section 18.3.10.1 and Section
18.9.5 for more information. A virtual-machine monitor may choose to
allow use of PEBS by guest software
only if EPT maps all guest-physical memory as present and read/write.
The reason why soft lockup happens may be the unmapped EPT pages. So, do
we have a way to map all gpa
before we use pebs on Skylake?
>>
>>
>> On 2021/4/6 13:14, Xu, Like wrote:
>>> Hi Xiangdong,
>>>
>>> On 2021/4/6 11:24, Liuxiangdong (Aven, Cloud Infrastructure Service
>>> Product Dept.) wrote:
>>>> Hi,like.
>>>> Some questions about this new pebs patches set:
>>>> https://lore.kernel.org/kvm/[email protected]/
>>>>
>>>>
>>>> The new hardware facility supporting guest PEBS is only available
>>>> on Intel Ice Lake Server platforms for now.
>>>
>>> Yes, we have documented this "EPT-friendly PEBS" capability in the SDM
>>> 18.3.10.1 Processor Event Based Sampling (PEBS) Facility
>>>
>>> And again, this patch set doesn't officially support guest PEBS on
>>> the Skylake.
>>>
>>>>
>>>>
>>>> AFAIK, Icelake supports adaptive PEBS and extended PEBS which
>>>> Skylake doesn't.
>>>> But we can still use IA32_PEBS_ENABLE MSR to indicate
>>>> general-purpose counter in Skylake.
>>>
>>> For Skylake, only the PMC0-PMC3 are valid for PEBS and you may
>>> mask the other unsupported bits in the pmu->pebs_enable_mask.
>>>
>>>> Is there anything else that only Icelake supports in this patches set?
>>>
>>> The PDIR counter on the Ice Lake is the fixed counter 0
>>> while the PDIR counter on the Sky Lake is the gp counter 1.
>>>
>>> You may also expose x86_pmu.pebs_vmx for Skylake in the 1st patch.
>>>
>>>>
>>>>
>>>> Besides, we have tried this patches set in Icelake. We can use
>>>> pebs(eg: "perf record -e cycles:pp")
>>>> when guest is kernel-5.11, but can't when kernel-4.18. Is there a
>>>> minimum guest kernel version requirement?
>>>
>>> The Ice Lake CPU model has been added since v5.4.
>>>
>>> You may double check whether the stable tree(s) code has
>>> INTEL_FAM6_ICELAKE in the arch/x86/include/asm/intel-family.h.
>>>
>>>>
>>>>
>>>> Thanks,
>>>> Xiangdong Liu
>>>
>>
>
> The reason why soft lockup happens may be the unmapped EPT pages. So, do we
> have a way to map all gpa
> before we use pebs on Skylake?
Can you configure a VT-d device, that will implicitly pin all pages for the
IOMMU. I *think* that should be enough for testing.
-Andi
Hi Like,
On 2021/4/9 16:46, Like Xu wrote:
> Hi Liuxiangdong,
>
> On 2021/4/9 16:33, Liuxiangdong (Aven, Cloud Infrastructure Service
> Product Dept.) wrote:
>> Do you have any comments or ideas about it ?
>>
>> https://lore.kernel.org/kvm/[email protected]/
>
> My expectation is that there may be many fewer PEBS samples
> on Skylake without any soft lockup.
>
> You may need to confirm the statement
>
> "All that matters is that the EPT pages don't get
> unmapped ever while PEBS is active"
>
> is true in the kernel level.
>
> Try "-overcommit mem-lock=on" for your qemu.
>
Sorry, in fact, I don't quite understand
"My expectation is that there may be many fewer PEBS samples on Skylake
without any soft lockup. "
And, I have used "-overcommit mem-lock=on" when soft lockup happens.
Now, I have tried to configure 1G-hugepages for 2G-mem vm. Each of guest
numa nodes has 1G mem.
When I use pebs(perf record -e cycles:pp) in guest, there are successful
pebs samples just for a while and
then I cannot get pebs samples. Host doesn't soft lockup in this process.
Are there something wrong on skylake for we can only get a few samples?
IRQ? Or using hugepage is not effecitve?
Thanks!
>>
>>
>> On 2021/4/6 13:14, Xu, Like wrote:
>>> Hi Xiangdong,
>>>
>>> On 2021/4/6 11:24, Liuxiangdong (Aven, Cloud Infrastructure Service
>>> Product Dept.) wrote:
>>>> Hi,like.
>>>> Some questions about this new pebs patches set:
>>>> https://lore.kernel.org/kvm/[email protected]/
>>>>
>>>>
>>>> The new hardware facility supporting guest PEBS is only available
>>>> on Intel Ice Lake Server platforms for now.
>>>
>>> Yes, we have documented this "EPT-friendly PEBS" capability in the SDM
>>> 18.3.10.1 Processor Event Based Sampling (PEBS) Facility
>>>
>>> And again, this patch set doesn't officially support guest PEBS on
>>> the Skylake.
>>>
>>>>
>>>>
>>>> AFAIK, Icelake supports adaptive PEBS and extended PEBS which
>>>> Skylake doesn't.
>>>> But we can still use IA32_PEBS_ENABLE MSR to indicate
>>>> general-purpose counter in Skylake.
>>>
>>> For Skylake, only the PMC0-PMC3 are valid for PEBS and you may
>>> mask the other unsupported bits in the pmu->pebs_enable_mask.
>>>
>>>> Is there anything else that only Icelake supports in this patches set?
>>>
>>> The PDIR counter on the Ice Lake is the fixed counter 0
>>> while the PDIR counter on the Sky Lake is the gp counter 1.
>>>
>>> You may also expose x86_pmu.pebs_vmx for Skylake in the 1st patch.
>>>
>>>>
>>>>
>>>> Besides, we have tried this patches set in Icelake. We can use
>>>> pebs(eg: "perf record -e cycles:pp")
>>>> when guest is kernel-5.11, but can't when kernel-4.18. Is there a
>>>> minimum guest kernel version requirement?
>>>
>>> The Ice Lake CPU model has been added since v5.4.
>>>
>>> You may double check whether the stable tree(s) code has
>>> INTEL_FAM6_ICELAKE in the arch/x86/include/asm/intel-family.h.
>>>
>>>>
>>>>
>>>> Thanks,
>>>> Xiangdong Liu
>>>
>>
>
On 2021/4/12 23:25, Andi Kleen wrote:
>> The reason why soft lockup happens may be the unmapped EPT pages. So, do we
>> have a way to map all gpa
>> before we use pebs on Skylake?
> Can you configure a VT-d device, that will implicitly pin all pages for the
> IOMMU. I *think* that should be enough for testing.
>
> -Andi
Thanks!
But, it doesn't seem to work because host still soft lockup when I
configure a SR-IOV direct network card for vm.
Besides, I have tried to configure 1G-hugepages for 2G-mem vm. Each of
guest numa nodes has 1G mem.
When I use pebs (perf record -e cycles:pp) in guest, there are
successful pebs samples on skylake just for a while and
then I cannot get pebs sample. Host doesn't soft lockup in this process.
Is this method effective? Are there something wrong on skylake for we
can only get a few samples ? Maybe IRQ?
On 2021/4/14 22:49, Liuxiangdong wrote:
> Hi Like,
>
> On 2021/4/9 16:46, Like Xu wrote:
>> Hi Liuxiangdong,
>>
>> On 2021/4/9 16:33, Liuxiangdong (Aven, Cloud Infrastructure Service
>> Product Dept.) wrote:
>>> Do you have any comments or ideas about it ?
>>>
>>> https://lore.kernel.org/kvm/[email protected]/
>>
>> My expectation is that there may be many fewer PEBS samples
>> on Skylake without any soft lockup.
>>
>> You may need to confirm the statement
>>
>> "All that matters is that the EPT pages don't get
>> unmapped ever while PEBS is active"
>>
>> is true in the kernel level.
>>
>> Try "-overcommit mem-lock=on" for your qemu.
>>
>
> Sorry, in fact, I don't quite understand
> "My expectation is that there may be many fewer PEBS samples on Skylake
> without any soft lockup. "
For testcase: perf record -e instructions:pp ./workload
We can get 2242 samples on the ICX guest, but
only 17 samples or less on the Skylake guest.
In my testcase on Skylake, neither the host nor the guest triggered the
soft lock.
>
> And, I have used "-overcommit mem-lock=on" when soft lockup happens.
I misunderstood the use of "mem-lock=on". It is not the same as the
guest mem pin and I believe more kernel patches are needed.
>
>
> Now, I have tried to configure 1G-hugepages for 2G-mem vm. Each of guest
> numa nodes has 1G mem.
> When I use pebs(perf record -e cycles:pp) in guest, there are successful
> pebs samples just for a while and
> then I cannot get pebs samples. Host doesn't soft lockup in this process.
In the worst case, no samples are expected.
>
> Are there something wrong on skylake for we can only get a few samples?
> IRQ? Or using hugepage is not effecitve?
The few samples comes from hardware limitation.
The Skylake doesn't have this "EPT-Friendly PEBS" capabilityand
some PEBS records will be lost when used by guests.
>
> Thanks!
>
>>>
>>>
>>> On 2021/4/6 13:14, Xu, Like wrote:
>>>> Hi Xiangdong,
>>>>
>>>> On 2021/4/6 11:24, Liuxiangdong (Aven, Cloud Infrastructure Service
>>>> Product Dept.) wrote:
>>>>> Hi,like.
>>>>> Some questions about this new pebs patches set:
>>>>> https://lore.kernel.org/kvm/[email protected]/
>>>>>
>>>>>
>>>>> The new hardware facility supporting guest PEBS is only available
>>>>> on Intel Ice Lake Server platforms for now.
>>>>
>>>> Yes, we have documented this "EPT-friendly PEBS" capability in the SDM
>>>> 18.3.10.1 Processor Event Based Sampling (PEBS) Facility
>>>>
>>>> And again, this patch set doesn't officially support guest PEBS on the
>>>> Skylake.
>>>>
>>>>>
>>>>>
>>>>> AFAIK, Icelake supports adaptive PEBS and extended PEBS which Skylake
>>>>> doesn't.
>>>>> But we can still use IA32_PEBS_ENABLE MSR to indicate general-purpose
>>>>> counter in Skylake.
>>>>
>>>> For Skylake, only the PMC0-PMC3 are valid for PEBS and you may
>>>> mask the other unsupported bits in the pmu->pebs_enable_mask.
>>>>
>>>>> Is there anything else that only Icelake supports in this patches set?
>>>>
>>>> The PDIR counter on the Ice Lake is the fixed counter 0
>>>> while the PDIR counter on the Sky Lake is the gp counter 1.
>>>>
>>>> You may also expose x86_pmu.pebs_vmx for Skylake in the 1st patch.
>>>>
>>>>>
>>>>>
>>>>> Besides, we have tried this patches set in Icelake. We can use
>>>>> pebs(eg: "perf record -e cycles:pp")
>>>>> when guest is kernel-5.11, but can't when kernel-4.18. Is there a
>>>>> minimum guest kernel version requirement?
>>>>
>>>> The Ice Lake CPU model has been added since v5.4.
>>>>
>>>> You may double check whether the stable tree(s) code has
>>>> INTEL_FAM6_ICELAKE in the arch/x86/include/asm/intel-family.h.
>>>>
>>>>>
>>>>>
>>>>> Thanks,
>>>>> Xiangdong Liu
>>>>
>>>
>>
>
On 2021/4/15 9:38, Xu, Like wrote:
> On 2021/4/14 22:49, Liuxiangdong wrote:
>> Hi Like,
>>
>> On 2021/4/9 16:46, Like Xu wrote:
>>> Hi Liuxiangdong,
>>>
>>> On 2021/4/9 16:33, Liuxiangdong (Aven, Cloud Infrastructure Service
>>> Product Dept.) wrote:
>>>> Do you have any comments or ideas about it ?
>>>>
>>>> https://lore.kernel.org/kvm/[email protected]/
>>>
>>> My expectation is that there may be many fewer PEBS samples
>>> on Skylake without any soft lockup.
>>>
>>> You may need to confirm the statement
>>>
>>> "All that matters is that the EPT pages don't get
>>> unmapped ever while PEBS is active"
>>>
>>> is true in the kernel level.
>>>
>>> Try "-overcommit mem-lock=on" for your qemu.
>>>
>>
>> Sorry, in fact, I don't quite understand
>> "My expectation is that there may be many fewer PEBS samples on
>> Skylake without any soft lockup. "
>
> For testcase: perf record -e instructions:pp ./workload
>
> We can get 2242 samples on the ICX guest, but
> only 17 samples or less on the Skylake guest.
>
> In my testcase on Skylake, neither the host nor the guest triggered
> the soft lock.
>
Thanks for your explanation!
Could you please show your complete qemu command and qemu version used
on Skylake?
I hope I can test it again according to your qemu cmd and version.
>>
>> And, I have used "-overcommit mem-lock=on" when soft lockup happens.
>
> I misunderstood the use of "mem-lock=on". It is not the same as the
> guest mem pin and I believe more kernel patches are needed.
>
>>
>>
>> Now, I have tried to configure 1G-hugepages for 2G-mem vm. Each of
>> guest numa nodes has 1G mem.
>> When I use pebs(perf record -e cycles:pp) in guest, there are
>> successful pebs samples just for a while and
>> then I cannot get pebs samples. Host doesn't soft lockup in this
>> process.
>
> In the worst case, no samples are expected.
>
>>
>> Are there something wrong on skylake for we can only get a few
>> samples? IRQ? Or using hugepage is not effecitve?
>
> The few samples comes from hardware limitation.
> The Skylake doesn't have this "EPT-Friendly PEBS" capabilityand
> some PEBS records will be lost when used by guests.
>
>>
>> Thanks!
>>
>>>>
>>>>
>>>> On 2021/4/6 13:14, Xu, Like wrote:
>>>>> Hi Xiangdong,
>>>>>
>>>>> On 2021/4/6 11:24, Liuxiangdong (Aven, Cloud Infrastructure
>>>>> Service Product Dept.) wrote:
>>>>>> Hi,like.
>>>>>> Some questions about this new pebs patches set:
>>>>>> https://lore.kernel.org/kvm/[email protected]/
>>>>>>
>>>>>>
>>>>>> The new hardware facility supporting guest PEBS is only available
>>>>>> on Intel Ice Lake Server platforms for now.
>>>>>
>>>>> Yes, we have documented this "EPT-friendly PEBS" capability in the
>>>>> SDM
>>>>> 18.3.10.1 Processor Event Based Sampling (PEBS) Facility
>>>>>
>>>>> And again, this patch set doesn't officially support guest PEBS on
>>>>> the Skylake.
>>>>>
>>>>>>
>>>>>>
>>>>>> AFAIK, Icelake supports adaptive PEBS and extended PEBS which
>>>>>> Skylake doesn't.
>>>>>> But we can still use IA32_PEBS_ENABLE MSR to indicate
>>>>>> general-purpose counter in Skylake.
>>>>>
>>>>> For Skylake, only the PMC0-PMC3 are valid for PEBS and you may
>>>>> mask the other unsupported bits in the pmu->pebs_enable_mask.
>>>>>
>>>>>> Is there anything else that only Icelake supports in this patches
>>>>>> set?
>>>>>
>>>>> The PDIR counter on the Ice Lake is the fixed counter 0
>>>>> while the PDIR counter on the Sky Lake is the gp counter 1.
>>>>>
>>>>> You may also expose x86_pmu.pebs_vmx for Skylake in the 1st patch.
>>>>>
>>>>>>
>>>>>>
>>>>>> Besides, we have tried this patches set in Icelake. We can use
>>>>>> pebs(eg: "perf record -e cycles:pp")
>>>>>> when guest is kernel-5.11, but can't when kernel-4.18. Is there a
>>>>>> minimum guest kernel version requirement?
>>>>>
>>>>> The Ice Lake CPU model has been added since v5.4.
>>>>>
>>>>> You may double check whether the stable tree(s) code has
>>>>> INTEL_FAM6_ICELAKE in the arch/x86/include/asm/intel-family.h.
>>>>>
>>>>>>
>>>>>>
>>>>>> Thanks,
>>>>>> Xiangdong Liu
>>>>>
>>>>
>>>
>>
>
On 2021/4/15 10:49, Liuxiangdong wrote:
>
>
> On 2021/4/15 9:38, Xu, Like wrote:
>> On 2021/4/14 22:49, Liuxiangdong wrote:
>>> Hi Like,
>>>
>>> On 2021/4/9 16:46, Like Xu wrote:
>>>> Hi Liuxiangdong,
>>>>
>>>> On 2021/4/9 16:33, Liuxiangdong (Aven, Cloud Infrastructure Service
>>>> Product Dept.) wrote:
>>>>> Do you have any comments or ideas about it ?
>>>>>
>>>>> https://lore.kernel.org/kvm/[email protected]/
>>>>
>>>> My expectation is that there may be many fewer PEBS samples
>>>> on Skylake without any soft lockup.
>>>>
>>>> You may need to confirm the statement
>>>>
>>>> "All that matters is that the EPT pages don't get
>>>> unmapped ever while PEBS is active"
>>>>
>>>> is true in the kernel level.
>>>>
>>>> Try "-overcommit mem-lock=on" for your qemu.
>>>>
>>>
>>> Sorry, in fact, I don't quite understand
>>> "My expectation is that there may be many fewer PEBS samples on Skylake
>>> without any soft lockup. "
>>
>> For testcase: perf record -e instructions:pp ./workload
>>
>> We can get 2242 samples on the ICX guest, but
>> only 17 samples or less on the Skylake guest.
>>
>> In my testcase on Skylake, neither the host nor the guest triggered the
>> soft lock.
>>
>
> Thanks for your explanation!
> Could you please show your complete qemu command and qemu version used on
> Skylake?
> I hope I can test it again according to your qemu cmd and version.
A new version is released and you may have a try.
qemu command: "-enable-kvm -cpu host,migratable=no"
qemu base commit: db55d2c9239d445cb7f1fa8ede8e42bd339058f4
kvm base commit: f96be2deac9bca3ef5a2b0b66b71fcef8bad586d
diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
index 63c55f45ca92..727f55400eaf 100644
--- a/arch/x86/events/intel/core.c
+++ b/arch/x86/events/intel/core.c
@@ -5618,6 +5618,7 @@ __init int intel_pmu_init(void)
case INTEL_FAM6_KABYLAKE:
case INTEL_FAM6_COMETLAKE_L:
case INTEL_FAM6_COMETLAKE:
+ x86_pmu.pebs_vmx = 1;
x86_add_quirk(intel_pebs_isolation_quirk);
x86_pmu.late_ack = true;
memcpy(hw_cache_event_ids, skl_hw_cache_event_ids,
sizeof(hw_cache_event_ids));
diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c
index 100a749251b8..9e37e3dbe3ae 100644
--- a/arch/x86/kvm/pmu.c
+++ b/arch/x86/kvm/pmu.c
@@ -150,9 +150,8 @@ static void pmc_reprogram_counter(struct kvm_pmc *pmc,
u32 type,
* the accuracy of the PEBS profiling result, because the "event IP"
* in the PEBS record is calibrated on the guest side.
*/
- attr.precise_ip = 1;
- if (x86_match_cpu(vmx_icl_pebs_cpu) && pmc->idx == 32)
- attr.precise_ip = 3;
+ attr.precise_ip = x86_match_cpu(vmx_icl_pebs_cpu) ?
+ ((pmc->idx == 32) ? 3 : 1) : ((pmc->idx == 1) ? 3 : 1);
}
event = perf_event_create_kernel_counter(&attr, -1, current,
>
>
>>>
>>> And, I have used "-overcommit mem-lock=on" when soft lockup happens.
>>
>> I misunderstood the use of "mem-lock=on". It is not the same as the
>> guest mem pin and I believe more kernel patches are needed.
>>
>>>
>>>
>>> Now, I have tried to configure 1G-hugepages for 2G-mem vm. Each of guest
>>> numa nodes has 1G mem.
>>> When I use pebs(perf record -e cycles:pp) in guest, there are successful
>>> pebs samples just for a while and
>>> then I cannot get pebs samples. Host doesn't soft lockup in this process.
>>
>> In the worst case, no samples are expected.
>>
>>>
>>> Are there something wrong on skylake for we can only get a few samples?
>>> IRQ? Or using hugepage is not effecitve?
>>
>> The few samples comes from hardware limitation.
>> The Skylake doesn't have this "EPT-Friendly PEBS" capabilityand
>> some PEBS records will be lost when used by guests.
>>
>>>
>>> Thanks!
>>>
>>>>>
>>>>>
>>>>> On 2021/4/6 13:14, Xu, Like wrote:
>>>>>> Hi Xiangdong,
>>>>>>
>>>>>> On 2021/4/6 11:24, Liuxiangdong (Aven, Cloud Infrastructure Service
>>>>>> Product Dept.) wrote:
>>>>>>> Hi,like.
>>>>>>> Some questions about this new pebs patches set:
>>>>>>> https://lore.kernel.org/kvm/[email protected]/
>>>>>>>
>>>>>>>
>>>>>>> The new hardware facility supporting guest PEBS is only available
>>>>>>> on Intel Ice Lake Server platforms for now.
>>>>>>
>>>>>> Yes, we have documented this "EPT-friendly PEBS" capability in the SDM
>>>>>> 18.3.10.1 Processor Event Based Sampling (PEBS) Facility
>>>>>>
>>>>>> And again, this patch set doesn't officially support guest PEBS on
>>>>>> the Skylake.
>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> AFAIK, Icelake supports adaptive PEBS and extended PEBS which
>>>>>>> Skylake doesn't.
>>>>>>> But we can still use IA32_PEBS_ENABLE MSR to indicate
>>>>>>> general-purpose counter in Skylake.
>>>>>>
>>>>>> For Skylake, only the PMC0-PMC3 are valid for PEBS and you may
>>>>>> mask the other unsupported bits in the pmu->pebs_enable_mask.
>>>>>>
>>>>>>> Is there anything else that only Icelake supports in this patches set?
>>>>>>
>>>>>> The PDIR counter on the Ice Lake is the fixed counter 0
>>>>>> while the PDIR counter on the Sky Lake is the gp counter 1.
>>>>>>
>>>>>> You may also expose x86_pmu.pebs_vmx for Skylake in the 1st patch.
>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Besides, we have tried this patches set in Icelake. We can use
>>>>>>> pebs(eg: "perf record -e cycles:pp")
>>>>>>> when guest is kernel-5.11, but can't when kernel-4.18. Is there a
>>>>>>> minimum guest kernel version requirement?
>>>>>>
>>>>>> The Ice Lake CPU model has been added since v5.4.
>>>>>>
>>>>>> You may double check whether the stable tree(s) code has
>>>>>> INTEL_FAM6_ICELAKE in the arch/x86/include/asm/intel-family.h.
>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Xiangdong Liu
>>>>>>
>>>>>
>>>>
>>>
>>
>