The Precise Event Based Sampling(PEBS) supported on Intel Ice Lake server
platforms can provide an architectural state of the instruction executed
after the instruction that caused the event. This patch set enables the
the PEBS via DS feature for KVM (also non) Linux guest on the Ice Lake.
The Linux guest can use PEBS feature like native:
# perf record -e instructions:ppp ./br_instr a
# perf record -c 100000 -e instructions:pp ./br_instr a
If the counter_freezing is not enabled on the host, the guest PEBS will
be disabled on purpose when host is using PEBS facility. By default,
KVM disables the co-existence of guest PEBS and host PEBS.
The whole patch set could be divided into three parts and the first two
parts enables the basic PEBS via DS feature which could be considered
to be merged and no regression about host perf is expected.
- part 1: patch 0001-0003: preparation and minor fixes for vPMU overall;
- part 2: patch 0004-0012: enable guest PEBS when the guest PEBS counters
are not cross-mapped, which is the most usual case in practice;
- part 3: patch 0012-0017: enable guest PEBS when cross-mapping happens;
Compared to the first version, an important change here is the removal
of the forced 1-1 mapping of the virtual to physical PMC and we handle
the cross-mapping issue carefully in the part 3 which may address
artificial competition concern from PeterZ.
In general, there are 2 code paths to emulate guest PEBS facility.
1) Fast path (part 2)
This is when the host assigned physical PMC has an identical index as
the virtual PMC (e.g. using physical PMC0 to emulate virtual PMC0).
It works as the 1-1 mapping that we did in the first version.
2) Slow path (part 3)
This is when the host assigned physical PMC has a different index
from the virtual PMC (e.g. using physical PMC1 to emulate virtual PMC0)
In this case, KVM needs to rewrite the PEBS records to change the
applicable counter indexes to the virtual PMC indexes, which would
otherwise contain the physical counter index written by PEBS facility,
and switch the counter reset values to the offset corresponding to
the physical counter indexes in the DS data structure. Large PEBS
needs to be disabled by KVM rewriting the pebs_interrupt_threshold
filed in DS to only one record in the slow path.
This is because a guest may implicitly drain PEBS buffer, e.g.,
context switch. KVM doesn't get a chance to update the PEBS buffer.
The physical PMC index will confuse the guest. The difficulty comes
when multiple events get rescheduled inside the guest.
Hence disabling large PEBS in this case might be an easy and safe way
to keep it corrects as an initial step here.
We don't expect this to break any guest code, which can generally tolerate
earlier PMIs. In the fast path with 1:1 mapping this is not needed.
The rewriting work is performed before delivering a vPMI to the guest to
notify the guest to read the record (before entering the guest, where
interrupt has been disabled so no counter reschedule would happen
at that point on the host).
In summary, this patch set enables the guest PEBS to retrieve the correct
information from its own PEBS records on the Ice Lake server platforms
when host is not using PEBS facility at the same time. And we expect it
should work when migrating to another Ice Lake.
Here are the results of pebs test from guest/host for same workload:
perf report on guest:
# Samples: 2K of event 'instructions:ppp', # Event count (approx.): 1473377250
# Overhead Command Shared Object Symbol
57.74% br_instr br_instr [.] lfsr_cond
41.40% br_instr br_instr [.] cmp_end
0.21% br_instr [kernel.kallsyms] [k] __lock_acquire
perf report on host:
# Samples: 2K of event 'instructions:ppp', # Event count (approx.): 1462721386
# Overhead Command Shared Object Symbol
57.90% br_instr br_instr [.] lfsr_cond
41.95% br_instr br_instr [.] cmp_end
0.05% br_instr [kernel.vmlinux] [k] lock_acquire
Conclusion: the profiling results on the guest are similar to that on the host.
Please check more details in each commit and feel free to comment.
v1->v2 Changelog:
- drop the 1:1 counter mapping proposal for PeterZ;
- *add the guest PEBS records rewrite proposal* (new patch 0013 - 0017);
- drop the co-existence of guest PEBS and host PEBS w/o counter_freezing;
- drop the auto-reload configuration for guest PEBS event;
- use attr.precise_ip = 3 for guest PEBS PDIR counter;
- move msr-switch code to perf_guest_get_msrs();
- new user interface to dis/enable guest PEBS via IA32_PERF_CAPABILITIES;
- general vPMU related fixup patches (patch 0001 - 0003);
- rebased to the latest kvm-queue;
Previous:
https://lore.kernel.org/kvm/[email protected]
Like Xu (17):
KVM: x86/pmu: Set MSR_IA32_MISC_ENABLE_EMON bit when vPMU is enabled
KVM: vmx/pmu: Use IA32_PERF_CAPABILITIES to adjust features visibility
KVM: x86/pmu: Introduce the ctrl_mask value for fixed counter
perf: x86/ds: Handle guest PEBS overflow PMI and inject it to guest
KVM: x86/pmu: Reprogram guest PEBS event to emulate guest PEBS counter
KVM: x86/pmu: Add IA32_PEBS_ENABLE MSR emulation for extended PEBS
KVM: x86/pmu: Add IA32_DS_AREA MSR emulation to manage guest DS buffer
KVM: x86/pmu: Add PEBS_DATA_CFG MSR emulation to support adaptive PEBS
KVM: x86: Set PEBS_UNAVAIL in IA32_MISC_ENABLE when PEBS is enabled
KVM: x86/pmu: Expose CPUIDs feature bits PDCM, DS, DTES64
KVM: x86/pmu: Adjust precise_ip to emulate Ice Lake guest PDIR counter
KVM: x86/pmu: Disable guest PEBS when counters are cross-mapped
KVM: x86/pmu: Add hook to emulate pebs for cross-mapped counters
KVM: vmx/pmu: Limit pebs_interrupt_threshold in the guest DS area
KVM: vmx/pmu: Rewrite applicable_counters field in the guest PEBS record
KVM: x86/pmu: Save guest pebs reset value when a pebs counter is configured
KVM: x86/pmu: Adjust guest DS pebs reset counter values for mapped counter
arch/x86/events/intel/core.c | 46 +++++
arch/x86/events/intel/ds.c | 64 +++++++
arch/x86/include/asm/kvm_host.h | 15 ++
arch/x86/include/asm/msr-index.h | 6 +
arch/x86/kvm/pmu.c | 91 +++++++--
arch/x86/kvm/pmu.h | 20 ++
arch/x86/kvm/vmx/capabilities.h | 17 +-
arch/x86/kvm/vmx/pmu_intel.c | 313 ++++++++++++++++++++++++++++++-
arch/x86/kvm/vmx/vmx.c | 29 +++
arch/x86/kvm/x86.c | 12 +-
10 files changed, 594 insertions(+), 19 deletions(-)
--
2.21.3
When CPUID.01H:EDX.DS[21] is set, the IA32_DS_AREA MSR exists and
points to the linear address of the first byte of the DS buffer
management area, which is used to manage the PEBS records.
When guest PEBS is enabled and the value is different from the
host, KVM will add the IA32_DS_AREA MSR to the msr-switch list.
The guest's DS value can be loaded to the real HW before VM-entry,
and will be removed when guest PEBS is disabled.
The WRMSR to IA32_DS_AREA MSR brings a #GP(0) if the source register
contains a non-canonical address. The switch of IA32_DS_AREA MSR would
also, setup a quiescent period to write the host PEBS records (if any)
to host DS area rather than guest DS area.
When guest PEBS is enabled, the MSR_IA32_DS_AREA MSR will be
added to the perf_guest_switch_msr() and switched during the
VMX transitions just like CORE_PERF_GLOBAL_CTRL MSR.
Originally-by: Andi Kleen <[email protected]>
Co-developed-by: Kan Liang <[email protected]>
Signed-off-by: Kan Liang <[email protected]>
Signed-off-by: Like Xu <[email protected]>
---
arch/x86/events/intel/core.c | 13 +++++++++++++
arch/x86/include/asm/kvm_host.h | 1 +
arch/x86/kvm/vmx/pmu_intel.c | 11 +++++++++++
arch/x86/kvm/vmx/vmx.c | 6 ++++++
4 files changed, 31 insertions(+)
diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
index d824c7156d34..71b45376ee1f 100644
--- a/arch/x86/events/intel/core.c
+++ b/arch/x86/events/intel/core.c
@@ -3394,6 +3394,7 @@ static struct perf_guest_switch_msr *intel_guest_get_msrs(int *nr)
{
struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
struct perf_guest_switch_msr *arr = cpuc->guest_switch_msrs;
+ struct debug_store *ds = __this_cpu_read(cpu_hw_events.ds);
arr[0].msr = MSR_CORE_PERF_GLOBAL_CTRL;
arr[0].host = x86_pmu.intel_ctrl & ~cpuc->intel_ctrl_guest_mask;
@@ -3440,6 +3441,18 @@ static struct perf_guest_switch_msr *intel_guest_get_msrs(int *nr)
*nr = 2;
}
+ if (arr[1].guest) {
+ arr[2].msr = MSR_IA32_DS_AREA;
+ arr[2].host = (unsigned long)ds;
+ /* KVM will update MSR_IA32_DS_AREA with the trapped guest value. */
+ arr[2].guest = 0ull;
+ *nr = 3;
+ } else if (*nr == 2) {
+ arr[2].msr = MSR_IA32_DS_AREA;
+ arr[2].host = arr[2].guest = 0;
+ *nr = 3;
+ }
+
return arr;
}
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 6d7e895ae535..eb24c4796f8b 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -449,6 +449,7 @@ struct kvm_pmu {
DECLARE_BITMAP(all_valid_pmc_idx, X86_PMC_IDX_MAX);
DECLARE_BITMAP(pmc_in_use, X86_PMC_IDX_MAX);
+ u64 ds_area;
u64 pebs_enable;
u64 pebs_enable_mask;
diff --git a/arch/x86/kvm/vmx/pmu_intel.c b/arch/x86/kvm/vmx/pmu_intel.c
index 2f10587bda19..ff5fc405703f 100644
--- a/arch/x86/kvm/vmx/pmu_intel.c
+++ b/arch/x86/kvm/vmx/pmu_intel.c
@@ -183,6 +183,9 @@ static bool intel_is_valid_msr(struct kvm_vcpu *vcpu, u32 msr)
case MSR_IA32_PEBS_ENABLE:
ret = vcpu->arch.perf_capabilities & PERF_CAP_PEBS_FORMAT;
break;
+ case MSR_IA32_DS_AREA:
+ ret = guest_cpuid_has(vcpu, X86_FEATURE_DS);
+ break;
default:
ret = get_gp_pmc(pmu, msr, MSR_IA32_PERFCTR0) ||
get_gp_pmc(pmu, msr, MSR_P6_EVNTSEL0) ||
@@ -227,6 +230,9 @@ static int intel_pmu_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
case MSR_IA32_PEBS_ENABLE:
msr_info->data = pmu->pebs_enable;
return 0;
+ case MSR_IA32_DS_AREA:
+ msr_info->data = pmu->ds_area;
+ return 0;
default:
if ((pmc = get_gp_pmc(pmu, msr, MSR_IA32_PERFCTR0)) ||
(pmc = get_gp_pmc(pmu, msr, MSR_IA32_PMC0))) {
@@ -294,6 +300,11 @@ static int intel_pmu_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
return 0;
}
break;
+ case MSR_IA32_DS_AREA:
+ if (is_noncanonical_address(data, vcpu))
+ return 1;
+ pmu->ds_area = data;
+ return 0;
default:
if ((pmc = get_gp_pmc(pmu, msr, MSR_IA32_PERFCTR0)) ||
(pmc = get_gp_pmc(pmu, msr, MSR_IA32_PMC0))) {
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 83a16ae04b4e..94dbd79a8582 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -975,6 +975,7 @@ static void add_atomic_switch_msr(struct vcpu_vmx *vmx, unsigned msr,
return;
}
break;
+ case MSR_IA32_DS_AREA:
case MSR_IA32_PEBS_ENABLE:
/* PEBS needs a quiescent period after being disabled (to write
* a record). Disabling PEBS through VMX MSR swapping doesn't
@@ -6536,12 +6537,17 @@ static void atomic_switch_perf_msrs(struct vcpu_vmx *vmx)
{
int i, nr_msrs;
struct perf_guest_switch_msr *msrs;
+ struct kvm_vcpu *vcpu = &vmx->vcpu;
+ struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
msrs = perf_guest_get_msrs(&nr_msrs);
if (!msrs)
return;
+ if (nr_msrs > 2 && msrs[1].guest)
+ msrs[2].guest = pmu->ds_area;
+
for (i = 0; i < nr_msrs; i++)
if (msrs[i].host == msrs[i].guest)
clear_atomic_switch_msr(vmx, msrs[i].msr);
--
2.21.3
The PEBS-PDIR facility on Ice Lake server is supported on IA31_FIXED0 only.
If the guest configures counter 32 and PEBS is enabled, the PEBS-PDIR
facility is supposed to be used, in which case KVM adjusts attr.precise_ip
to 3 and request host perf to assign the exactly requested counter or fail.
The cpu model check is also required since some platforms may place the
PEBS-PDIR facility in another counter index.
Signed-off-by: Like Xu <[email protected]>
---
arch/x86/kvm/pmu.c | 2 ++
1 file changed, 2 insertions(+)
diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c
index c6208234e007..f8aa4724d67b 100644
--- a/arch/x86/kvm/pmu.c
+++ b/arch/x86/kvm/pmu.c
@@ -143,6 +143,8 @@ static void pmc_reprogram_counter(struct kvm_pmc *pmc, u32 type,
* in the PEBS record is calibrated on the guest side.
*/
attr.precise_ip = 1;
+ if (x86_match_cpu(vmx_icl_pebs_cpu) && pmc->idx == 32)
+ attr.precise_ip = 3;
}
event = perf_event_create_kernel_counter(&attr, -1, current,
--
2.21.3
KVM will check if a guest PEBS counter X is cross-mapped to the host
counter Y. In this case, the applicable_counter field in the guest PEBS
records is filled with the real host counter index(s) which is incorrect.
Currently, KVM will disable guest PEBS before vm-entry and the later
patches would do more emulations in the KVM to keep PEBS functionality
work as host, such as rewriting applicable_counter field in the guest
PEBS records buffer.
The cross-mapped check should be done right before vm-entry but after
local_irq_disable() since perf scheduler would rotate the pmc->perf_event
to another host counter or make the event into error state via hrtimer irq.
Signed-off-by: Like Xu <[email protected]>
---
arch/x86/include/asm/kvm_host.h | 2 ++
arch/x86/kvm/pmu.c | 25 +++++++++++++++++++++++++
arch/x86/kvm/pmu.h | 1 +
arch/x86/kvm/vmx/vmx.c | 3 +++
arch/x86/kvm/x86.c | 4 ++++
5 files changed, 35 insertions(+)
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 37df29061a4d..bffb384485da 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -455,6 +455,8 @@ struct kvm_pmu {
u64 pebs_data_cfg;
u64 pebs_data_cfg_mask;
+ bool counter_cross_mapped;
+
/*
* The gate to release perf_events not marked in
* pmc_in_use only once in a vcpu time slice.
diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c
index f8aa4724d67b..a6c5951a5728 100644
--- a/arch/x86/kvm/pmu.c
+++ b/arch/x86/kvm/pmu.c
@@ -549,3 +549,28 @@ int kvm_vm_ioctl_set_pmu_event_filter(struct kvm *kvm, void __user *argp)
kfree(filter);
return r;
}
+
+/*
+ * The caller needs to ensure that there is no time window for
+ * perf hrtimer irq or any chance to reschedule pmc->perf_event.
+ */
+void kvm_pmu_counter_cross_mapped_check(struct kvm_vcpu *vcpu)
+{
+ struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
+ struct kvm_pmc *pmc = NULL;
+ int bit;
+
+ pmu->counter_cross_mapped = false;
+
+ for_each_set_bit(bit, (unsigned long *)&pmu->pebs_enable, X86_PMC_IDX_MAX) {
+ pmc = kvm_x86_ops.pmu_ops->pmc_idx_to_pmc(pmu, bit);
+
+ if (!pmc || !pmc_speculative_in_use(pmc) || !pmc_is_enabled(pmc))
+ continue;
+
+ if (pmc->perf_event && (pmc->idx != pmc->perf_event->hw.idx)) {
+ pmu->counter_cross_mapped = true;
+ break;
+ }
+ }
+}
diff --git a/arch/x86/kvm/pmu.h b/arch/x86/kvm/pmu.h
index ee8f15cc4b5e..f5ec94e9a1dc 100644
--- a/arch/x86/kvm/pmu.h
+++ b/arch/x86/kvm/pmu.h
@@ -163,6 +163,7 @@ void kvm_pmu_init(struct kvm_vcpu *vcpu);
void kvm_pmu_cleanup(struct kvm_vcpu *vcpu);
void kvm_pmu_destroy(struct kvm_vcpu *vcpu);
int kvm_vm_ioctl_set_pmu_event_filter(struct kvm *kvm, void __user *argp);
+void kvm_pmu_counter_cross_mapped_check(struct kvm_vcpu *vcpu);
bool is_vmware_backdoor_pmc(u32 pmc_idx);
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 3b62907c8959..302808ec9699 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -6556,6 +6556,9 @@ static void atomic_switch_perf_msrs(struct vcpu_vmx *vmx)
if (!msrs)
return;
+ if (pmu->counter_cross_mapped)
+ msrs[1].guest = 0;
+
if (nr_msrs > 2 && msrs[1].guest) {
msrs[2].guest = pmu->ds_area;
if (nr_msrs > 3)
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index b5963a36bf6b..88a544e6379f 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -8859,6 +8859,10 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
* result in virtual interrupt delivery.
*/
local_irq_disable();
+
+ if (vcpu_to_pmu(vcpu)->global_ctrl & vcpu_to_pmu(vcpu)->pebs_enable)
+ kvm_pmu_counter_cross_mapped_check(vcpu);
+
vcpu->mode = IN_GUEST_MODE;
srcu_read_unlock(&vcpu->kvm->srcu, vcpu->srcu_idx);
--
2.21.3
If the host counter X is scheduled to the guest PEBS counter Y,
the guest ds pebs_interrupt_threshold field in guest DS area would
be changed to only ONE record before vm-entry which helps KVM
more easily and accurately handle the cross-mapping emulation
when the PEBS overflow PMI is generated.
In most cases, the guest counters would not be scheduled in a cross-mapped
way which means there is no need to change guest DS
pebs_interrupt_threshold and the applicable_counters fields in the guest
PEBS records are naturally correct. PEBS facility writes multiple PEBS
records into guest DS w/o interception and the performance is good.
AFAIK, we don't expect that changing the pebs_interrupt_threshold value
from the KVM side will break any guest PEBS drivers.
Signed-off-by: Like Xu <[email protected]>
---
arch/x86/include/asm/kvm_host.h | 1 +
arch/x86/kvm/pmu.c | 17 +++-----
arch/x86/kvm/pmu.h | 11 +++++
arch/x86/kvm/vmx/pmu_intel.c | 71 +++++++++++++++++++++++++++++++++
arch/x86/kvm/x86.c | 1 +
5 files changed, 90 insertions(+), 11 deletions(-)
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index bffb384485da..77b529b8c16a 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -456,6 +456,7 @@ struct kvm_pmu {
u64 pebs_data_cfg_mask;
bool counter_cross_mapped;
+ bool need_rewrite_ds_pebs_interrupt_threshold;
/*
* The gate to release perf_events not marked in
diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c
index f87be3c2140e..7c8e3ca5b7ad 100644
--- a/arch/x86/kvm/pmu.c
+++ b/arch/x86/kvm/pmu.c
@@ -471,17 +471,6 @@ void kvm_pmu_init(struct kvm_vcpu *vcpu)
kvm_pmu_refresh(vcpu);
}
-static inline bool pmc_speculative_in_use(struct kvm_pmc *pmc)
-{
- struct kvm_pmu *pmu = pmc_to_pmu(pmc);
-
- if (pmc_is_fixed(pmc))
- return fixed_ctrl_field(pmu->fixed_ctr_ctrl,
- pmc->idx - INTEL_PMC_IDX_FIXED) & 0x3;
-
- return pmc->eventsel & ARCH_PERFMON_EVENTSEL_ENABLE;
-}
-
/* Release perf_events for vPMCs that have been unused for a full time slice. */
void kvm_pmu_cleanup(struct kvm_vcpu *vcpu)
{
@@ -576,4 +565,10 @@ void kvm_pmu_counter_cross_mapped_check(struct kvm_vcpu *vcpu)
break;
}
}
+
+ if (!pmu->counter_cross_mapped)
+ return;
+
+ if (pmu->need_rewrite_ds_pebs_interrupt_threshold)
+ kvm_make_request(KVM_REQ_PMU, pmc->vcpu);
}
diff --git a/arch/x86/kvm/pmu.h b/arch/x86/kvm/pmu.h
index b1e52e33f08c..6cdc9fd03195 100644
--- a/arch/x86/kvm/pmu.h
+++ b/arch/x86/kvm/pmu.h
@@ -147,6 +147,17 @@ static inline u64 get_sample_period(struct kvm_pmc *pmc, u64 counter_value)
return sample_period;
}
+static inline bool pmc_speculative_in_use(struct kvm_pmc *pmc)
+{
+ struct kvm_pmu *pmu = pmc_to_pmu(pmc);
+
+ if (pmc_is_fixed(pmc))
+ return fixed_ctrl_field(pmu->fixed_ctr_ctrl,
+ pmc->idx - INTEL_PMC_IDX_FIXED) & 0x3;
+
+ return pmc->eventsel & ARCH_PERFMON_EVENTSEL_ENABLE;
+}
+
void reprogram_gp_counter(struct kvm_pmc *pmc, u64 eventsel);
void reprogram_fixed_counter(struct kvm_pmc *pmc, u8 ctrl, int fixed_idx);
void reprogram_counter(struct kvm_pmu *pmu, int pmc_idx);
diff --git a/arch/x86/kvm/vmx/pmu_intel.c b/arch/x86/kvm/vmx/pmu_intel.c
index 2917105e584e..346b1104e674 100644
--- a/arch/x86/kvm/vmx/pmu_intel.c
+++ b/arch/x86/kvm/vmx/pmu_intel.c
@@ -211,6 +211,23 @@ static struct kvm_pmc *intel_msr_idx_to_pmc(struct kvm_vcpu *vcpu, u32 msr)
return pmc;
}
+static void intel_pmu_pebs_setup(struct kvm_pmu *pmu)
+{
+ struct kvm_pmc *pmc = NULL;
+ int bit;
+
+ pmu->need_rewrite_ds_pebs_interrupt_threshold = false;
+
+ for_each_set_bit(bit, (unsigned long *)&pmu->pebs_enable, X86_PMC_IDX_MAX) {
+ pmc = kvm_x86_ops.pmu_ops->pmc_idx_to_pmc(pmu, bit);
+
+ if (pmc && pmc_speculative_in_use(pmc)) {
+ pmu->need_rewrite_ds_pebs_interrupt_threshold = true;
+ break;
+ }
+ }
+}
+
static int intel_pmu_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
{
struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
@@ -287,6 +304,8 @@ static int intel_pmu_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
return 0;
if (kvm_valid_perf_global_ctrl(pmu, data)) {
global_ctrl_changed(pmu, data);
+ if (pmu->global_ctrl & pmu->pebs_enable)
+ intel_pmu_pebs_setup(pmu);
return 0;
}
break;
@@ -491,12 +510,64 @@ static void intel_pmu_reset(struct kvm_vcpu *vcpu)
pmu->global_ovf_ctrl = 0;
}
+static int rewrite_ds_pebs_interrupt_threshold(struct kvm_vcpu *vcpu)
+{
+ struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
+ struct debug_store *ds = NULL;
+ u64 new_threshold, offset;
+ gpa_t gpa;
+ int srcu_idx, ret = -ENOMEM;
+
+ ds = kmalloc(sizeof(struct debug_store), GFP_KERNEL);
+ if (!ds)
+ goto out;
+
+ ret = -EFAULT;
+ srcu_idx = srcu_read_lock(&vcpu->kvm->srcu);
+ gpa = kvm_mmu_gva_to_gpa_system(vcpu, pmu->ds_area, NULL);
+ if (gpa == UNMAPPED_GVA)
+ goto unlock_out;
+
+ if (kvm_read_guest(vcpu->kvm, gpa, ds, sizeof(struct debug_store)))
+ goto unlock_out;
+
+ /* Adding sizeof(struct pebs_basic) offset is enough to generate PMI. */
+ new_threshold = ds->pebs_buffer_base + sizeof(struct pebs_basic);
+ offset = offsetof(struct debug_store, pebs_interrupt_threshold);
+ gpa = kvm_mmu_gva_to_gpa_system(vcpu, pmu->ds_area + offset, NULL);
+ if (gpa == UNMAPPED_GVA)
+ goto unlock_out;
+
+ if (kvm_write_guest(vcpu->kvm, gpa, &new_threshold, sizeof(u64)))
+ goto unlock_out;
+
+ ret = 0;
+
+unlock_out:
+ srcu_read_unlock(&vcpu->kvm->srcu, srcu_idx);
+
+out:
+ kfree(ds);
+ return ret;
+}
+
void intel_pmu_handle_event(struct kvm_vcpu *vcpu)
{
struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
+ int ret;
if (!(pmu->global_ctrl & pmu->pebs_enable))
return;
+
+ if (pmu->counter_cross_mapped && pmu->need_rewrite_ds_pebs_interrupt_threshold) {
+ ret = rewrite_ds_pebs_interrupt_threshold(vcpu);
+ pmu->need_rewrite_ds_pebs_interrupt_threshold = false;
+ }
+
+ if (ret == -ENOMEM)
+ pr_debug_ratelimited("%s: Fail to emulate guest PEBS due to OOM.", __func__);
+ else if (ret == -EFAULT)
+ pr_debug_ratelimited("%s: Fail to emulate guest PEBS due to GPA fault.", __func__);
}
struct kvm_pmu_ops intel_pmu_ops = {
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 88a544e6379f..8db0811c1dd3 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -5856,6 +5856,7 @@ gpa_t kvm_mmu_gva_to_gpa_system(struct kvm_vcpu *vcpu, gva_t gva,
{
return vcpu->arch.walk_mmu->gva_to_gpa(vcpu, gva, 0, exception);
}
+EXPORT_SYMBOL_GPL(kvm_mmu_gva_to_gpa_system);
static int kvm_read_guest_virt_helper(gva_t addr, void *val, unsigned int bytes,
struct kvm_vcpu *vcpu, u32 access,
--
2.21.3
The guest pebs counter X may be cross mapped to the host counter Y.
While the PEBS facility would reload the reset value once a PEBS record
is written to guest DS and potentially continue to generate PEBS records
before guest read the previous records.
KVM will adjust the guest DS pebs reset counter values for exactly mapped
host counters but before that, it needs to save the original expected guest
reset counter values right after the counter is fully enabled via a trap.
We assume that every time the guest PEBS driver enables the counter for
large PEBS, it will configure the DS reset counter values as Linux does.
Signed-off-by: Like Xu <[email protected]>
---
arch/x86/include/asm/kvm_host.h | 2 ++
arch/x86/kvm/vmx/pmu_intel.c | 51 +++++++++++++++++++++++++++++++--
2 files changed, 50 insertions(+), 3 deletions(-)
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index cdc3c6efdd8e..32a677ff1e55 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -418,6 +418,7 @@ struct kvm_pmc {
enum pmc_type type;
u8 idx;
u64 counter;
+ u64 reset_counter;
u64 eventsel;
struct perf_event *perf_event;
struct kvm_vcpu *vcpu;
@@ -458,6 +459,7 @@ struct kvm_pmu {
bool counter_cross_mapped;
bool need_rewrite_ds_pebs_interrupt_threshold;
bool need_rewrite_pebs_records;
+ bool need_save_reset_counter;
/*
* The gate to release perf_events not marked in
diff --git a/arch/x86/kvm/vmx/pmu_intel.c b/arch/x86/kvm/vmx/pmu_intel.c
index d58d04ee13a5..f5a69addd7a8 100644
--- a/arch/x86/kvm/vmx/pmu_intel.c
+++ b/arch/x86/kvm/vmx/pmu_intel.c
@@ -217,12 +217,14 @@ static void intel_pmu_pebs_setup(struct kvm_pmu *pmu)
int bit;
pmu->need_rewrite_ds_pebs_interrupt_threshold = false;
+ pmu->need_save_reset_counter = false;
for_each_set_bit(bit, (unsigned long *)&pmu->pebs_enable, X86_PMC_IDX_MAX) {
pmc = kvm_x86_ops.pmu_ops->pmc_idx_to_pmc(pmu, bit);
if (pmc && pmc_speculative_in_use(pmc)) {
pmu->need_rewrite_ds_pebs_interrupt_threshold = true;
+ pmu->need_save_reset_counter = true;
break;
}
}
@@ -619,10 +621,48 @@ static int rewrite_ds_pebs_records(struct kvm_vcpu *vcpu)
return ret;
}
+static int save_ds_pebs_reset_values(struct kvm_vcpu *vcpu)
+{
+ struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
+ struct kvm_pmc *pmc = NULL;
+ struct debug_store *ds = NULL;
+ gpa_t gpa;
+ int srcu_idx, bit, idx, ret;
+
+ ds = kmalloc(sizeof(struct debug_store), GFP_KERNEL);
+ if (!ds)
+ return -ENOMEM;
+
+ ret = -EFAULT;
+ srcu_idx = srcu_read_lock(&vcpu->kvm->srcu);
+ gpa = kvm_mmu_gva_to_gpa_system(vcpu, pmu->ds_area, NULL);
+ if (gpa == UNMAPPED_GVA)
+ goto out;
+
+ if (kvm_read_guest(vcpu->kvm, gpa, ds, sizeof(struct debug_store)))
+ goto out;
+
+ for_each_set_bit(bit, (unsigned long *)&pmu->pebs_enable, X86_PMC_IDX_MAX) {
+ pmc = kvm_x86_ops.pmu_ops->pmc_idx_to_pmc(pmu, bit);
+
+ if (pmc) {
+ idx = (pmc->idx < INTEL_PMC_IDX_FIXED) ?
+ pmc->idx : (MAX_PEBS_EVENTS + pmc->idx - INTEL_PMC_IDX_FIXED);
+ pmc->reset_counter = ds->pebs_event_reset[idx];
+ }
+ }
+ ret = 0;
+
+out:
+ srcu_read_unlock(&vcpu->kvm->srcu, srcu_idx);
+ kfree(ds);
+ return ret;
+}
+
void intel_pmu_handle_event(struct kvm_vcpu *vcpu)
{
struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
- int ret1, ret2;
+ int ret1, ret2, ret3;
if (pmu->need_rewrite_pebs_records) {
pmu->need_rewrite_pebs_records = false;
@@ -637,11 +677,16 @@ void intel_pmu_handle_event(struct kvm_vcpu *vcpu)
ret2 = rewrite_ds_pebs_interrupt_threshold(vcpu);
}
+ if (pmu->need_save_reset_counter) {
+ pmu->need_save_reset_counter = false;
+ ret3 = save_ds_pebs_reset_values(vcpu);
+ }
+
out:
- if (ret1 == -ENOMEM || ret2 == -ENOMEM)
+ if (ret1 == -ENOMEM || ret2 == -ENOMEM || ret3 == -ENOMEM)
pr_debug_ratelimited("%s: Fail to emulate guest PEBS due to OOM.", __func__);
- else if (ret1 == -EFAULT || ret2 == -EFAULT)
+ else if (ret1 == -EFAULT || ret2 == -EFAULT || ret3 == -EFAULT)
pr_debug_ratelimited("%s: Fail to emulate guest PEBS due to GPA fault.", __func__);
}
--
2.21.3
The original PEBS reset counter value has been saved to pmc->reset_counter.
When the guest PEBS counter X is enabled, the reset value RST-x would be
written to guest DS reset field RST-y and it will be auto reloaded to the
real host counter Y which is mapped to the guest PEBS counter X during
this vm-entry period.
KVM would record each last host reset counter index field for each guest
PEBS counter and trigger the reset values rewrite once any entry in the
host-guest counter mapping table is changed before vm-entry.
The frequent changes in the mapping relationship should only happen when
perf multiplexes the counters with the default 1ms timer. The time cost
of adjusting the guest reset values will not exceed 1ms (13347ns on ICX),
and there will be no race with the multiplex timer to create a livelock.
Signed-off-by: Like Xu <[email protected]>
---
arch/x86/include/asm/kvm_host.h | 2 ++
arch/x86/kvm/pmu.c | 15 +++++++++++
arch/x86/kvm/pmu.h | 1 +
arch/x86/kvm/vmx/pmu_intel.c | 47 ++++++++++++++++++++++++++++++---
4 files changed, 62 insertions(+), 3 deletions(-)
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 32a677ff1e55..93026e9361d9 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -418,6 +418,7 @@ struct kvm_pmc {
enum pmc_type type;
u8 idx;
u64 counter;
+ u8 host_idx;
u64 reset_counter;
u64 eventsel;
struct perf_event *perf_event;
@@ -460,6 +461,7 @@ struct kvm_pmu {
bool need_rewrite_ds_pebs_interrupt_threshold;
bool need_rewrite_pebs_records;
bool need_save_reset_counter;
+ bool need_rewrite_reset_counter;
/*
* The gate to release perf_events not marked in
diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c
index 64dce19644e3..d12dbe07117e 100644
--- a/arch/x86/kvm/pmu.c
+++ b/arch/x86/kvm/pmu.c
@@ -154,6 +154,7 @@ static void pmc_reprogram_counter(struct kvm_pmc *pmc, u32 type,
if (IS_ERR(event)) {
pr_debug_ratelimited("kvm_pmu: event creation failed %ld for pmc->idx = %d\n",
PTR_ERR(event), pmc->idx);
+ pmc->host_idx = -1;
return;
}
@@ -554,6 +555,7 @@ void kvm_pmu_counter_cross_mapped_check(struct kvm_vcpu *vcpu)
int bit;
pmu->counter_cross_mapped = false;
+ pmu->need_rewrite_reset_counter = false;
for_each_set_bit(bit, (unsigned long *)&pmu->pebs_enable, X86_PMC_IDX_MAX) {
pmc = kvm_x86_ops.pmu_ops->pmc_idx_to_pmc(pmu, bit);
@@ -567,6 +569,19 @@ void kvm_pmu_counter_cross_mapped_check(struct kvm_vcpu *vcpu)
}
}
+ for_each_set_bit(bit, (unsigned long *)&pmu->pebs_enable, X86_PMC_IDX_MAX) {
+ pmc = kvm_x86_ops.pmu_ops->pmc_idx_to_pmc(pmu, bit);
+
+ if (!pmc || !pmc_speculative_in_use(pmc) || !pmc_is_enabled(pmc))
+ continue;
+
+ if ((pmc->perf_event && (pmc->host_idx != pmc->perf_event->hw.idx))) {
+ pmu->need_rewrite_reset_counter = true;
+ kvm_make_request(KVM_REQ_PMU, pmc->vcpu);
+ break;
+ }
+ }
+
if (!pmu->counter_cross_mapped)
return;
diff --git a/arch/x86/kvm/pmu.h b/arch/x86/kvm/pmu.h
index 6cdc9fd03195..2776a048fd27 100644
--- a/arch/x86/kvm/pmu.h
+++ b/arch/x86/kvm/pmu.h
@@ -74,6 +74,7 @@ static inline void pmc_release_perf_event(struct kvm_pmc *pmc)
pmc->perf_event = NULL;
pmc->current_config = 0;
pmc_to_pmu(pmc)->event_count--;
+ pmc->host_idx = -1;
}
}
diff --git a/arch/x86/kvm/vmx/pmu_intel.c b/arch/x86/kvm/vmx/pmu_intel.c
index f5a69addd7a8..0aab3a4f9e41 100644
--- a/arch/x86/kvm/vmx/pmu_intel.c
+++ b/arch/x86/kvm/vmx/pmu_intel.c
@@ -659,10 +659,46 @@ static int save_ds_pebs_reset_values(struct kvm_vcpu *vcpu)
return ret;
}
+static int rewrite_ds_pebs_reset_counters(struct kvm_vcpu *vcpu)
+{
+ struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
+ struct kvm_pmc *pmc = NULL;
+ gpa_t gpa;
+ int srcu_idx, bit, ret;
+ u64 offset, host_idx, idx;
+
+ ret = -EFAULT;
+ srcu_idx = srcu_read_lock(&vcpu->kvm->srcu);
+ for_each_set_bit(bit, (unsigned long *)&pmu->pebs_enable, X86_PMC_IDX_MAX) {
+ pmc = kvm_x86_ops.pmu_ops->pmc_idx_to_pmc(pmu, bit);
+
+ if (!pmc || !pmc->perf_event)
+ continue;
+
+ host_idx = pmc->perf_event->hw.idx;
+ idx = (host_idx < INTEL_PMC_IDX_FIXED) ?
+ host_idx : (MAX_PEBS_EVENTS + host_idx - INTEL_PMC_IDX_FIXED);
+ offset = offsetof(struct debug_store, pebs_event_reset) + sizeof(u64) * idx;
+ gpa = kvm_mmu_gva_to_gpa_system(vcpu, pmu->ds_area + offset, NULL);
+ if (gpa == UNMAPPED_GVA)
+ goto out;
+
+ if (kvm_write_guest(vcpu->kvm, gpa, &pmc->reset_counter, sizeof(u64)))
+ goto out;
+
+ pmc->host_idx = pmc->perf_event->hw.idx;
+ }
+ ret = 0;
+
+out:
+ srcu_read_unlock(&vcpu->kvm->srcu, srcu_idx);
+ return ret;
+}
+
void intel_pmu_handle_event(struct kvm_vcpu *vcpu)
{
struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
- int ret1, ret2, ret3;
+ int ret1, ret2, ret3, ret4;
if (pmu->need_rewrite_pebs_records) {
pmu->need_rewrite_pebs_records = false;
@@ -682,11 +718,16 @@ void intel_pmu_handle_event(struct kvm_vcpu *vcpu)
ret3 = save_ds_pebs_reset_values(vcpu);
}
+ if (pmu->need_rewrite_reset_counter) {
+ ret4 = pmu->need_rewrite_reset_counter = false;
+ rewrite_ds_pebs_reset_counters(vcpu);
+ }
+
out:
- if (ret1 == -ENOMEM || ret2 == -ENOMEM || ret3 == -ENOMEM)
+ if (ret1 == -ENOMEM || ret2 == -ENOMEM || ret3 == -ENOMEM || ret4 == -ENOMEM)
pr_debug_ratelimited("%s: Fail to emulate guest PEBS due to OOM.", __func__);
- else if (ret1 == -EFAULT || ret2 == -EFAULT || ret3 == -EFAULT)
+ else if (ret1 == -EFAULT || ret2 == -EFAULT || ret3 == -EFAULT || ret4 == -EFAULT)
pr_debug_ratelimited("%s: Fail to emulate guest PEBS due to GPA fault.", __func__);
}
--
2.21.3
The CPUID features PDCM, DS and DTES64 are required for PEBS feature.
KVM would expose CPUID feature PDCM, DS and DTES64 to guest when PEBS
is supported in the KVM on the Ice Lake server platforms.
Originally-by: Andi Kleen <[email protected]>
Co-developed-by: Kan Liang <[email protected]>
Signed-off-by: Kan Liang <[email protected]>
Co-developed-by: Luwei Kang <[email protected]>
Signed-off-by: Luwei Kang <[email protected]>
Signed-off-by: Like Xu <[email protected]>
---
arch/x86/kvm/pmu.h | 6 ++++++
arch/x86/kvm/vmx/capabilities.h | 17 ++++++++++++++++-
arch/x86/kvm/vmx/vmx.c | 15 +++++++++++++++
3 files changed, 37 insertions(+), 1 deletion(-)
diff --git a/arch/x86/kvm/pmu.h b/arch/x86/kvm/pmu.h
index 067fef51760c..ee8f15cc4b5e 100644
--- a/arch/x86/kvm/pmu.h
+++ b/arch/x86/kvm/pmu.h
@@ -3,6 +3,7 @@
#define __KVM_X86_PMU_H
#include <linux/nospec.h>
+#include <asm/cpu_device_id.h>
#define vcpu_to_pmu(vcpu) (&(vcpu)->arch.pmu)
#define pmu_to_vcpu(pmu) (container_of((pmu), struct kvm_vcpu, arch.pmu))
@@ -16,6 +17,11 @@
#define VMWARE_BACKDOOR_PMC_APPARENT_TIME 0x10002
#define MAX_FIXED_COUNTERS 3
+static const struct x86_cpu_id vmx_icl_pebs_cpu[] = {
+ X86_MATCH_INTEL_FAM6_MODEL(ICELAKE_D, NULL),
+ X86_MATCH_INTEL_FAM6_MODEL(ICELAKE_X, NULL),
+ {}
+};
struct kvm_event_hw_type_mapping {
u8 eventsel;
diff --git a/arch/x86/kvm/vmx/capabilities.h b/arch/x86/kvm/vmx/capabilities.h
index 3a1861403d73..2f22ce34b165 100644
--- a/arch/x86/kvm/vmx/capabilities.h
+++ b/arch/x86/kvm/vmx/capabilities.h
@@ -5,6 +5,7 @@
#include <asm/vmx.h>
#include "lapic.h"
+#include "pmu.h"
extern bool __read_mostly enable_vpid;
extern bool __read_mostly flexpriority_enabled;
@@ -369,13 +370,27 @@ static inline bool vmx_pt_mode_is_host_guest(void)
return pt_mode == PT_MODE_HOST_GUEST;
}
+static inline bool vmx_pebs_supported(void)
+{
+ return boot_cpu_has(X86_FEATURE_PEBS) && x86_match_cpu(vmx_icl_pebs_cpu);
+}
+
static inline u64 vmx_get_perf_capabilities(void)
{
/*
* Since counters are virtualized, KVM would support full
* width counting unconditionally, even if the host lacks it.
*/
- return PMU_CAP_FW_WRITES;
+ u64 value = PMU_CAP_FW_WRITES;
+ u64 perf_cap = 0;
+
+ if (boot_cpu_has(X86_FEATURE_PDCM))
+ rdmsrl(MSR_IA32_PERF_CAPABILITIES, perf_cap);
+
+ if (vmx_pebs_supported())
+ value |= perf_cap & PERF_CAP_PEBS_MASK;
+
+ return value;
}
#endif /* __KVM_X86_VMX_CAPS_H */
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 954c15fb9b00..3b62907c8959 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -2201,6 +2201,17 @@ static int vmx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
case MSR_IA32_PERF_CAPABILITIES:
if (data && !vcpu_to_pmu(vcpu)->version)
return 1;
+ if (data & PERF_CAP_PEBS_FORMAT) {
+ if ((data & PERF_CAP_PEBS_MASK) !=
+ (vmx_get_perf_capabilities() & PERF_CAP_PEBS_MASK))
+ return 1;
+ if (!guest_cpuid_has(vcpu, X86_FEATURE_DS))
+ return 1;
+ if (!guest_cpuid_has(vcpu, X86_FEATURE_DTES64))
+ return 1;
+ if (boot_cpu_data.x86_model != guest_cpuid_model(vcpu))
+ return 1;
+ }
ret = kvm_set_msr_common(vcpu, msr_info);
break;
@@ -7293,6 +7304,10 @@ static __init void vmx_set_cpu_caps(void)
kvm_cpu_cap_check_and_set(X86_FEATURE_INVPCID);
if (vmx_pt_mode_is_host_guest())
kvm_cpu_cap_check_and_set(X86_FEATURE_INTEL_PT);
+ if (vmx_pebs_supported()) {
+ kvm_cpu_cap_check_and_set(X86_FEATURE_DS);
+ kvm_cpu_cap_check_and_set(X86_FEATURE_DTES64);
+ }
if (vmx_umip_emulated())
kvm_cpu_cap_set(X86_FEATURE_UMIP);
--
2.21.3
When a guest counter is configured as a PEBS counter through
IA32_PEBS_ENABLE, a guest PEBS event will be reprogrammed by
configuring a non-zero precision level in the perf_event_attr.
The guest PEBS overflow PMI bit would be set in the guest
GLOBAL_STATUS MSR when PEBS facility generates a PEBS
overflow PMI based on guest IA32_DS_AREA MSR.
The attr.precise_ip would be adjusted to a special precision
level when the new PEBS-PDIR feature is supported later which
would affect the host counters scheduling.
The guest PEBS event would not be reused for non-PEBS
guest event even with the same guest counter index.
Originally-by: Andi Kleen <[email protected]>
Co-developed-by: Kan Liang <[email protected]>
Signed-off-by: Kan Liang <[email protected]>
Signed-off-by: Like Xu <[email protected]>
---
arch/x86/include/asm/kvm_host.h | 2 ++
arch/x86/kvm/pmu.c | 28 ++++++++++++++++++++++++++--
2 files changed, 28 insertions(+), 2 deletions(-)
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 505f9b39c423..eb0d73a095a3 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -449,6 +449,8 @@ struct kvm_pmu {
DECLARE_BITMAP(all_valid_pmc_idx, X86_PMC_IDX_MAX);
DECLARE_BITMAP(pmc_in_use, X86_PMC_IDX_MAX);
+ u64 pebs_enable;
+
/*
* The gate to release perf_events not marked in
* pmc_in_use only once in a vcpu time slice.
diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c
index 67741d2a0308..c6208234e007 100644
--- a/arch/x86/kvm/pmu.c
+++ b/arch/x86/kvm/pmu.c
@@ -76,7 +76,11 @@ static void kvm_perf_overflow_intr(struct perf_event *perf_event,
struct kvm_pmu *pmu = pmc_to_pmu(pmc);
if (!test_and_set_bit(pmc->idx, pmu->reprogram_pmi)) {
- __set_bit(pmc->idx, (unsigned long *)&pmu->global_status);
+ if (perf_event->attr.precise_ip) {
+ /* Indicate PEBS overflow PMI to guest. */
+ __set_bit(62, (unsigned long *)&pmu->global_status);
+ } else
+ __set_bit(pmc->idx, (unsigned long *)&pmu->global_status);
kvm_make_request(KVM_REQ_PMU, pmc->vcpu);
/*
@@ -99,6 +103,7 @@ static void pmc_reprogram_counter(struct kvm_pmc *pmc, u32 type,
bool exclude_kernel, bool intr,
bool in_tx, bool in_tx_cp)
{
+ struct kvm_pmu *pmu = vcpu_to_pmu(pmc->vcpu);
struct perf_event *event;
struct perf_event_attr attr = {
.type = type,
@@ -110,6 +115,7 @@ static void pmc_reprogram_counter(struct kvm_pmc *pmc, u32 type,
.exclude_kernel = exclude_kernel,
.config = config,
};
+ bool pebs = test_bit(pmc->idx, (unsigned long *)&pmu->pebs_enable);
attr.sample_period = get_sample_period(pmc, pmc->counter);
@@ -124,9 +130,23 @@ static void pmc_reprogram_counter(struct kvm_pmc *pmc, u32 type,
attr.sample_period = 0;
attr.config |= HSW_IN_TX_CHECKPOINTED;
}
+ if (pebs) {
+ /*
+ * The non-zero precision level of guest event makes the ordinary
+ * guest event becomes a guest PEBS event and triggers the host
+ * PEBS PMI handler to determine whether the PEBS overflow PMI
+ * comes from the host counters or the guest.
+ *
+ * For most PEBS hardware events, the difference in the software
+ * precision levels of guest and host PEBS events will not affect
+ * the accuracy of the PEBS profiling result, because the "event IP"
+ * in the PEBS record is calibrated on the guest side.
+ */
+ attr.precise_ip = 1;
+ }
event = perf_event_create_kernel_counter(&attr, -1, current,
- intr ? kvm_perf_overflow_intr :
+ (intr || pebs) ? kvm_perf_overflow_intr :
kvm_perf_overflow, pmc);
if (IS_ERR(event)) {
pr_debug_ratelimited("kvm_pmu: event creation failed %ld for pmc->idx = %d\n",
@@ -161,6 +181,10 @@ static bool pmc_resume_counter(struct kvm_pmc *pmc)
get_sample_period(pmc, pmc->counter)))
return false;
+ if (!test_bit(pmc->idx, (unsigned long *)&pmc_to_pmu(pmc)->pebs_enable) &&
+ pmc->perf_event->attr.precise_ip)
+ return false;
+
/* reuse perf_event to serve as pmc_reprogram_counter() does*/
perf_event_enable(pmc->perf_event);
--
2.21.3
If IA32_PERF_CAPABILITIES.PEBS_BASELINE [bit 14] is set, the adaptive
PEBS is supported. The PEBS_DATA_CFG MSR and adaptive record enable
bits (IA32_PERFEVTSELx.Adaptive_Record and IA32_FIXED_CTR_CTRL.
FCx_Adaptive_Record) are also supported.
Adaptive PEBS provides software the capability to configure the PEBS
records to capture only the data of interest, keeping the record size
compact. An overflow of PMCx results in generation of an adaptive PEBS
record with state information based on the selections specified in
MSR_PEBS_DATA_CFG (Memory Info [bit 0], GPRs [bit 1], XMMs [bit 2],
and LBRs [bit 3], LBR Entries [bit 31:24]). By default, the PEBS record
will only contain the Basic group.
When guest adaptive PEBS is enabled, the IA32_PEBS_ENABLE MSR will
be added to the perf_guest_switch_msr() and switched during the VMX
transitions just like CORE_PERF_GLOBAL_CTRL MSR.
Co-developed-by: Luwei Kang <[email protected]>
Signed-off-by: Luwei Kang <[email protected]>
Signed-off-by: Like Xu <[email protected]>
---
arch/x86/events/intel/core.c | 12 ++++++++++++
arch/x86/include/asm/kvm_host.h | 2 ++
arch/x86/kvm/vmx/pmu_intel.c | 16 ++++++++++++++++
arch/x86/kvm/vmx/vmx.c | 5 ++++-
4 files changed, 34 insertions(+), 1 deletion(-)
diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
index 71b45376ee1f..1a0c026955bf 100644
--- a/arch/x86/events/intel/core.c
+++ b/arch/x86/events/intel/core.c
@@ -3453,6 +3453,18 @@ static struct perf_guest_switch_msr *intel_guest_get_msrs(int *nr)
*nr = 3;
}
+ if (arr[1].guest && x86_pmu.intel_cap.pebs_baseline) {
+ arr[3].msr = MSR_PEBS_DATA_CFG;
+ arr[3].host = cpuc->pebs_data_cfg;
+ /* KVM will update MSR_PEBS_DATA_CFG with the trapped guest value. */
+ arr[3].guest = 0ull;
+ *nr = 4;
+ } else if (*nr == 3) {
+ arr[3].msr = MSR_PEBS_DATA_CFG;
+ arr[3].host = arr[3].guest = 0;
+ *nr = 4;
+ }
+
return arr;
}
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index eb24c4796f8b..37df29061a4d 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -452,6 +452,8 @@ struct kvm_pmu {
u64 ds_area;
u64 pebs_enable;
u64 pebs_enable_mask;
+ u64 pebs_data_cfg;
+ u64 pebs_data_cfg_mask;
/*
* The gate to release perf_events not marked in
diff --git a/arch/x86/kvm/vmx/pmu_intel.c b/arch/x86/kvm/vmx/pmu_intel.c
index ff5fc405703f..c04e12812797 100644
--- a/arch/x86/kvm/vmx/pmu_intel.c
+++ b/arch/x86/kvm/vmx/pmu_intel.c
@@ -186,6 +186,9 @@ static bool intel_is_valid_msr(struct kvm_vcpu *vcpu, u32 msr)
case MSR_IA32_DS_AREA:
ret = guest_cpuid_has(vcpu, X86_FEATURE_DS);
break;
+ case MSR_PEBS_DATA_CFG:
+ ret = vcpu->arch.perf_capabilities & PERF_CAP_PEBS_BASELINE;
+ break;
default:
ret = get_gp_pmc(pmu, msr, MSR_IA32_PERFCTR0) ||
get_gp_pmc(pmu, msr, MSR_P6_EVNTSEL0) ||
@@ -233,6 +236,9 @@ static int intel_pmu_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
case MSR_IA32_DS_AREA:
msr_info->data = pmu->ds_area;
return 0;
+ case MSR_PEBS_DATA_CFG:
+ msr_info->data = pmu->pebs_data_cfg;
+ return 0;
default:
if ((pmc = get_gp_pmc(pmu, msr, MSR_IA32_PERFCTR0)) ||
(pmc = get_gp_pmc(pmu, msr, MSR_IA32_PMC0))) {
@@ -305,6 +311,14 @@ static int intel_pmu_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
return 1;
pmu->ds_area = data;
return 0;
+ case MSR_PEBS_DATA_CFG:
+ if (pmu->pebs_data_cfg == data)
+ return 0;
+ if (!(data & pmu->pebs_data_cfg_mask)) {
+ pmu->pebs_data_cfg = data;
+ return 0;
+ }
+ break;
default:
if ((pmc = get_gp_pmc(pmu, msr, MSR_IA32_PERFCTR0)) ||
(pmc = get_gp_pmc(pmu, msr, MSR_IA32_PMC0))) {
@@ -355,6 +369,7 @@ static void intel_pmu_refresh(struct kvm_vcpu *vcpu)
pmu->reserved_bits = 0xffffffff00200000ull;
pmu->fixed_ctr_ctrl_mask = ~0ull;
pmu->pebs_enable_mask = ~0ull;
+ pmu->pebs_data_cfg_mask = ~0ull;
entry = kvm_find_cpuid_entry(vcpu, 0xa, 0);
if (!entry)
@@ -417,6 +432,7 @@ static void intel_pmu_refresh(struct kvm_vcpu *vcpu)
for (i = 0; i < pmu->nr_arch_fixed_counters; i++)
pmu->fixed_ctr_ctrl_mask &=
~(1ULL << (INTEL_PMC_IDX_FIXED + i * 4));
+ pmu->pebs_data_cfg_mask = ~0xff00000full;
} else
pmu->pebs_enable_mask = ~((1ull << pmu->nr_arch_gp_counters) - 1);
} else {
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 94dbd79a8582..954c15fb9b00 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -6545,8 +6545,11 @@ static void atomic_switch_perf_msrs(struct vcpu_vmx *vmx)
if (!msrs)
return;
- if (nr_msrs > 2 && msrs[1].guest)
+ if (nr_msrs > 2 && msrs[1].guest) {
msrs[2].guest = pmu->ds_area;
+ if (nr_msrs > 3)
+ msrs[3].guest = pmu->pebs_data_cfg;
+ }
for (i = 0; i < nr_msrs; i++)
if (msrs[i].host == msrs[i].guest)
--
2.21.3
If IA32_PERF_CAPABILITIES.PEBS_BASELINE [bit 14] is set, the
IA32_PEBS_ENABLE MSR exists and all architecturally enumerated fixed
and general purpose counters have corresponding bits in IA32_PEBS_ENABLE
that enable generation of PEBS records. The general-purpose counter bits
start at bit IA32_PEBS_ENABLE[0], and the fixed counter bits start at
bit IA32_PEBS_ENABLE[32].
When guest PEBS is enabled, the IA32_PEBS_ENABLE MSR will be
added to the perf_guest_switch_msr() and switched during the
VMX transitions just like CORE_PERF_GLOBAL_CTRL MSR.
Originally-by: Andi Kleen <[email protected]>
Co-developed-by: Kan Liang <[email protected]>
Signed-off-by: Kan Liang <[email protected]>
Co-developed-by: Luwei Kang <[email protected]>
Signed-off-by: Luwei Kang <[email protected]>
Signed-off-by: Like Xu <[email protected]>
---
arch/x86/events/intel/core.c | 21 +++++++++++++++++++++
arch/x86/include/asm/kvm_host.h | 1 +
arch/x86/include/asm/msr-index.h | 6 ++++++
arch/x86/kvm/vmx/pmu_intel.c | 28 ++++++++++++++++++++++++++++
4 files changed, 56 insertions(+)
diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
index 31e6887d24f1..d824c7156d34 100644
--- a/arch/x86/events/intel/core.c
+++ b/arch/x86/events/intel/core.c
@@ -3419,6 +3419,27 @@ static struct perf_guest_switch_msr *intel_guest_get_msrs(int *nr)
*nr = 2;
}
+ if (cpuc->pebs_enabled & ~cpuc->intel_ctrl_host_mask) {
+ arr[1].msr = MSR_IA32_PEBS_ENABLE;
+ arr[1].host = cpuc->pebs_enabled & ~cpuc->intel_ctrl_guest_mask;
+ arr[1].guest = cpuc->pebs_enabled & ~cpuc->intel_ctrl_host_mask;
+ /*
+ * Without x86_pmu.counter_freezing support on the host,
+ * the guest PEBS is disabled once the host PEBS is enabled
+ * since the both enabled case may brings a unknown PMI to
+ * confuse host and the guest PEBS overflow PMI would be missed.
+ */
+ if (!x86_pmu.counter_freezing && arr[1].host)
+ arr[1].guest = 0;
+ arr[0].guest |= arr[1].guest;
+ *nr = 2;
+ } else if (*nr == 1) {
+ /* Remove MSR_IA32_PEBS_ENABLE from MSR switch list in KVM */
+ arr[1].msr = MSR_IA32_PEBS_ENABLE;
+ arr[1].host = arr[1].guest = 0;
+ *nr = 2;
+ }
+
return arr;
}
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index eb0d73a095a3..6d7e895ae535 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -450,6 +450,7 @@ struct kvm_pmu {
DECLARE_BITMAP(pmc_in_use, X86_PMC_IDX_MAX);
u64 pebs_enable;
+ u64 pebs_enable_mask;
/*
* The gate to release perf_events not marked in
diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
index 2859ee4f39a8..8bc6269f577d 100644
--- a/arch/x86/include/asm/msr-index.h
+++ b/arch/x86/include/asm/msr-index.h
@@ -184,6 +184,12 @@
#define MSR_PEBS_DATA_CFG 0x000003f2
#define MSR_IA32_DS_AREA 0x00000600
#define MSR_IA32_PERF_CAPABILITIES 0x00000345
+#define PERF_CAP_PEBS_TRAP BIT_ULL(6)
+#define PERF_CAP_ARCH_REG BIT_ULL(7)
+#define PERF_CAP_PEBS_FORMAT 0xf00
+#define PERF_CAP_PEBS_BASELINE BIT_ULL(14)
+#define PERF_CAP_PEBS_MASK (PERF_CAP_PEBS_TRAP | PERF_CAP_ARCH_REG | \
+ PERF_CAP_PEBS_FORMAT | PERF_CAP_PEBS_BASELINE)
#define MSR_PEBS_LD_LAT_THRESHOLD 0x000003f6
#define MSR_IA32_RTIT_CTL 0x00000570
diff --git a/arch/x86/kvm/vmx/pmu_intel.c b/arch/x86/kvm/vmx/pmu_intel.c
index 50047114c298..2f10587bda19 100644
--- a/arch/x86/kvm/vmx/pmu_intel.c
+++ b/arch/x86/kvm/vmx/pmu_intel.c
@@ -180,6 +180,9 @@ static bool intel_is_valid_msr(struct kvm_vcpu *vcpu, u32 msr)
case MSR_CORE_PERF_GLOBAL_OVF_CTRL:
ret = pmu->version > 1;
break;
+ case MSR_IA32_PEBS_ENABLE:
+ ret = vcpu->arch.perf_capabilities & PERF_CAP_PEBS_FORMAT;
+ break;
default:
ret = get_gp_pmc(pmu, msr, MSR_IA32_PERFCTR0) ||
get_gp_pmc(pmu, msr, MSR_P6_EVNTSEL0) ||
@@ -221,6 +224,9 @@ static int intel_pmu_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
case MSR_CORE_PERF_GLOBAL_OVF_CTRL:
msr_info->data = pmu->global_ovf_ctrl;
return 0;
+ case MSR_IA32_PEBS_ENABLE:
+ msr_info->data = pmu->pebs_enable;
+ return 0;
default:
if ((pmc = get_gp_pmc(pmu, msr, MSR_IA32_PERFCTR0)) ||
(pmc = get_gp_pmc(pmu, msr, MSR_IA32_PMC0))) {
@@ -280,6 +286,14 @@ static int intel_pmu_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
return 0;
}
break;
+ case MSR_IA32_PEBS_ENABLE:
+ if (pmu->pebs_enable == data)
+ return 0;
+ if (!(data & pmu->pebs_enable_mask)) {
+ pmu->pebs_enable = data;
+ return 0;
+ }
+ break;
default:
if ((pmc = get_gp_pmc(pmu, msr, MSR_IA32_PERFCTR0)) ||
(pmc = get_gp_pmc(pmu, msr, MSR_IA32_PMC0))) {
@@ -329,6 +343,7 @@ static void intel_pmu_refresh(struct kvm_vcpu *vcpu)
pmu->version = 0;
pmu->reserved_bits = 0xffffffff00200000ull;
pmu->fixed_ctr_ctrl_mask = ~0ull;
+ pmu->pebs_enable_mask = ~0ull;
entry = kvm_find_cpuid_entry(vcpu, 0xa, 0);
if (!entry)
@@ -384,6 +399,19 @@ static void intel_pmu_refresh(struct kvm_vcpu *vcpu)
bitmap_set(pmu->all_valid_pmc_idx,
INTEL_PMC_MAX_GENERIC, pmu->nr_arch_fixed_counters);
+ if (vcpu->arch.perf_capabilities & PERF_CAP_PEBS_FORMAT) {
+ if (vcpu->arch.perf_capabilities & PERF_CAP_PEBS_BASELINE) {
+ pmu->pebs_enable_mask = ~pmu->global_ctrl;
+ pmu->reserved_bits &= ~ICL_EVENTSEL_ADAPTIVE;
+ for (i = 0; i < pmu->nr_arch_fixed_counters; i++)
+ pmu->fixed_ctr_ctrl_mask &=
+ ~(1ULL << (INTEL_PMC_IDX_FIXED + i * 4));
+ } else
+ pmu->pebs_enable_mask = ~((1ull << pmu->nr_arch_gp_counters) - 1);
+ } else {
+ vcpu->arch.perf_capabilities &= ~PERF_CAP_PEBS_MASK;
+ }
+
nested_vmx_pmu_entry_exit_ctls_update(vcpu);
}
--
2.21.3
To emulate PEBS facility, KVM may needs setup context such as
guest DS PEBS fields correctly before vm-entry and this part
will be implemented in the vmx handle_event() hook.
When the cross-map happens to any enabled PEBS counter, it will make
PMU request and exit to kvm_pmu_handle_event() for some rewrite stuff
and then back to cross-map check again and finally to vm-entry.
In this hook, KVM would rewrite the state for the guest and it won't
move events, hence races with the NMI PMI are not a problem.
Signed-off-by: Like Xu <[email protected]>
---
arch/x86/kvm/pmu.c | 3 +++
arch/x86/kvm/pmu.h | 1 +
arch/x86/kvm/vmx/pmu_intel.c | 9 +++++++++
arch/x86/kvm/vmx/vmx.c | 3 ---
4 files changed, 13 insertions(+), 3 deletions(-)
diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c
index a6c5951a5728..f87be3c2140e 100644
--- a/arch/x86/kvm/pmu.c
+++ b/arch/x86/kvm/pmu.c
@@ -339,6 +339,9 @@ void kvm_pmu_handle_event(struct kvm_vcpu *vcpu)
*/
if (unlikely(pmu->need_cleanup))
kvm_pmu_cleanup(vcpu);
+
+ if (kvm_x86_ops.pmu_ops->handle_event)
+ kvm_x86_ops.pmu_ops->handle_event(vcpu);
}
/* check if idx is a valid index to access PMU */
diff --git a/arch/x86/kvm/pmu.h b/arch/x86/kvm/pmu.h
index f5ec94e9a1dc..b1e52e33f08c 100644
--- a/arch/x86/kvm/pmu.h
+++ b/arch/x86/kvm/pmu.h
@@ -45,6 +45,7 @@ struct kvm_pmu_ops {
void (*refresh)(struct kvm_vcpu *vcpu);
void (*init)(struct kvm_vcpu *vcpu);
void (*reset)(struct kvm_vcpu *vcpu);
+ void (*handle_event)(struct kvm_vcpu *vcpu);
};
static inline u64 pmc_bitmask(struct kvm_pmc *pmc)
diff --git a/arch/x86/kvm/vmx/pmu_intel.c b/arch/x86/kvm/vmx/pmu_intel.c
index 99d9453e0176..2917105e584e 100644
--- a/arch/x86/kvm/vmx/pmu_intel.c
+++ b/arch/x86/kvm/vmx/pmu_intel.c
@@ -491,6 +491,14 @@ static void intel_pmu_reset(struct kvm_vcpu *vcpu)
pmu->global_ovf_ctrl = 0;
}
+void intel_pmu_handle_event(struct kvm_vcpu *vcpu)
+{
+ struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
+
+ if (!(pmu->global_ctrl & pmu->pebs_enable))
+ return;
+}
+
struct kvm_pmu_ops intel_pmu_ops = {
.find_arch_event = intel_find_arch_event,
.find_fixed_event = intel_find_fixed_event,
@@ -505,4 +513,5 @@ struct kvm_pmu_ops intel_pmu_ops = {
.refresh = intel_pmu_refresh,
.init = intel_pmu_init,
.reset = intel_pmu_reset,
+ .handle_event = intel_pmu_handle_event,
};
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 302808ec9699..3b62907c8959 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -6556,9 +6556,6 @@ static void atomic_switch_perf_msrs(struct vcpu_vmx *vmx)
if (!msrs)
return;
- if (pmu->counter_cross_mapped)
- msrs[1].guest = 0;
-
if (nr_msrs > 2 && msrs[1].guest) {
msrs[2].guest = pmu->ds_area;
if (nr_msrs > 3)
--
2.21.3
The PEBS event counters scheduled by host may different to the counters
required by guest. The host counter index will be leaked into the guest
PEBS record and the guest driver will be confused by the counter indexes
in the "Applicable Counters" field of the PEBS records and ignore them.
Before the guest PEBS overflow PMI is injected into the guest through
global status, KVM needs to rewrite the "Applicable Counters" field with
the right enabled guest pebs counter idx(s) in the guest PEBS records.
Co-developed-by: Luwei Kang <[email protected]>
Signed-off-by: Luwei Kang <[email protected]>
Signed-off-by: Like Xu <[email protected]>
---
arch/x86/include/asm/kvm_host.h | 1 +
arch/x86/kvm/pmu.c | 1 +
arch/x86/kvm/vmx/pmu_intel.c | 85 +++++++++++++++++++++++++++++++--
3 files changed, 82 insertions(+), 5 deletions(-)
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 77b529b8c16a..cdc3c6efdd8e 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -457,6 +457,7 @@ struct kvm_pmu {
bool counter_cross_mapped;
bool need_rewrite_ds_pebs_interrupt_threshold;
+ bool need_rewrite_pebs_records;
/*
* The gate to release perf_events not marked in
diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c
index 7c8e3ca5b7ad..64dce19644e3 100644
--- a/arch/x86/kvm/pmu.c
+++ b/arch/x86/kvm/pmu.c
@@ -77,6 +77,7 @@ static void kvm_perf_overflow_intr(struct perf_event *perf_event,
if (!test_and_set_bit(pmc->idx, pmu->reprogram_pmi)) {
if (perf_event->attr.precise_ip) {
+ pmu->need_rewrite_pebs_records = pmu->counter_cross_mapped;
/* Indicate PEBS overflow PMI to guest. */
__set_bit(62, (unsigned long *)&pmu->global_status);
} else
diff --git a/arch/x86/kvm/vmx/pmu_intel.c b/arch/x86/kvm/vmx/pmu_intel.c
index 346b1104e674..d58d04ee13a5 100644
--- a/arch/x86/kvm/vmx/pmu_intel.c
+++ b/arch/x86/kvm/vmx/pmu_intel.c
@@ -551,22 +551,97 @@ static int rewrite_ds_pebs_interrupt_threshold(struct kvm_vcpu *vcpu)
return ret;
}
+static int rewrite_ds_pebs_records(struct kvm_vcpu *vcpu)
+{
+ struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
+ struct kvm_pmc *pmc = NULL;
+ struct debug_store *ds = NULL;
+ gpa_t gpa;
+ u64 pebs_buffer_base, offset, status, new_status, format_size;
+ int srcu_idx, bit, ret = 0;
+
+ if (!pmu->counter_cross_mapped)
+ return ret;
+
+ ds = kmalloc(sizeof(struct debug_store), GFP_KERNEL);
+ if (!ds)
+ return -ENOMEM;
+
+ ret = -EFAULT;
+ srcu_idx = srcu_read_lock(&vcpu->kvm->srcu);
+ gpa = kvm_mmu_gva_to_gpa_system(vcpu, pmu->ds_area, NULL);
+ if (gpa == UNMAPPED_GVA)
+ goto out;
+
+ if (kvm_read_guest(vcpu->kvm, gpa, ds, sizeof(struct debug_store)))
+ goto out;
+
+ if (ds->pebs_index <= ds->pebs_buffer_base)
+ goto out;
+
+ pebs_buffer_base = ds->pebs_buffer_base;
+ offset = offsetof(struct pebs_basic, applicable_counters);
+
+ do {
+ ret = -EFAULT;
+ gpa = kvm_mmu_gva_to_gpa_system(vcpu, pebs_buffer_base + offset, NULL);
+ if (gpa == UNMAPPED_GVA)
+ goto out;
+ if (kvm_read_guest(vcpu->kvm, gpa, &status, sizeof(u64)))
+ goto out;
+
+ new_status = 0ull;
+ for_each_set_bit(bit, (unsigned long *)&pmu->pebs_enable, X86_PMC_IDX_MAX) {
+ pmc = kvm_x86_ops.pmu_ops->pmc_idx_to_pmc(pmu, bit);
+
+ if (!pmc || !pmc->perf_event)
+ continue;
+
+ if (test_bit(pmc->perf_event->hw.idx, (unsigned long *)&status))
+ new_status |= BIT_ULL(pmc->idx);
+ }
+ if (kvm_write_guest(vcpu->kvm, gpa, &new_status, sizeof(u64)))
+ goto out;
+
+ gpa = kvm_mmu_gva_to_gpa_system(vcpu, pebs_buffer_base, NULL);
+ if (gpa == UNMAPPED_GVA)
+ goto out;
+ if (kvm_read_guest(vcpu->kvm, gpa, &format_size, sizeof(u64)))
+ goto out;
+ ret = 0;
+
+ pebs_buffer_base = pebs_buffer_base + (format_size >> 48);
+ } while (pebs_buffer_base < ds->pebs_index);
+
+out:
+ srcu_read_unlock(&vcpu->kvm->srcu, srcu_idx);
+ kfree(ds);
+ return ret;
+}
+
void intel_pmu_handle_event(struct kvm_vcpu *vcpu)
{
struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
- int ret;
+ int ret1, ret2;
+
+ if (pmu->need_rewrite_pebs_records) {
+ pmu->need_rewrite_pebs_records = false;
+ ret1 = rewrite_ds_pebs_records(vcpu);
+ }
if (!(pmu->global_ctrl & pmu->pebs_enable))
- return;
+ goto out;
if (pmu->counter_cross_mapped && pmu->need_rewrite_ds_pebs_interrupt_threshold) {
- ret = rewrite_ds_pebs_interrupt_threshold(vcpu);
pmu->need_rewrite_ds_pebs_interrupt_threshold = false;
+ ret2 = rewrite_ds_pebs_interrupt_threshold(vcpu);
}
- if (ret == -ENOMEM)
+out:
+
+ if (ret1 == -ENOMEM || ret2 == -ENOMEM)
pr_debug_ratelimited("%s: Fail to emulate guest PEBS due to OOM.", __func__);
- else if (ret == -EFAULT)
+ else if (ret1 == -EFAULT || ret2 == -EFAULT)
pr_debug_ratelimited("%s: Fail to emulate guest PEBS due to GPA fault.", __func__);
}
--
2.21.3
The bit 12 represents "Processor Event Based Sampling Unavailable (RO)" :
1 = PEBS is not supported.
0 = PEBS is supported.
A write to this PEBS_UNAVL available bit will bring #GP(0) when guest PEBS
is enabled. Some PEBS drivers in guest may care about this bit.
Signed-off-by: Like Xu <[email protected]>
---
arch/x86/kvm/vmx/pmu_intel.c | 2 ++
arch/x86/kvm/x86.c | 4 ++++
2 files changed, 6 insertions(+)
diff --git a/arch/x86/kvm/vmx/pmu_intel.c b/arch/x86/kvm/vmx/pmu_intel.c
index c04e12812797..99d9453e0176 100644
--- a/arch/x86/kvm/vmx/pmu_intel.c
+++ b/arch/x86/kvm/vmx/pmu_intel.c
@@ -426,6 +426,7 @@ static void intel_pmu_refresh(struct kvm_vcpu *vcpu)
INTEL_PMC_MAX_GENERIC, pmu->nr_arch_fixed_counters);
if (vcpu->arch.perf_capabilities & PERF_CAP_PEBS_FORMAT) {
+ vcpu->arch.ia32_misc_enable_msr &= ~MSR_IA32_MISC_ENABLE_PEBS_UNAVAIL;
if (vcpu->arch.perf_capabilities & PERF_CAP_PEBS_BASELINE) {
pmu->pebs_enable_mask = ~pmu->global_ctrl;
pmu->reserved_bits &= ~ICL_EVENTSEL_ADAPTIVE;
@@ -436,6 +437,7 @@ static void intel_pmu_refresh(struct kvm_vcpu *vcpu)
} else
pmu->pebs_enable_mask = ~((1ull << pmu->nr_arch_gp_counters) - 1);
} else {
+ vcpu->arch.ia32_misc_enable_msr |= MSR_IA32_MISC_ENABLE_PEBS_UNAVAIL;
vcpu->arch.perf_capabilities &= ~PERF_CAP_PEBS_MASK;
}
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index e1280bb18152..b5963a36bf6b 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -3087,6 +3087,10 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
break;
case MSR_IA32_MISC_ENABLE:
data &= ~MSR_IA32_MISC_ENABLE_EMON;
+ if (!msr_info->host_initiated &&
+ (vcpu->arch.perf_capabilities & PERF_CAP_PEBS_FORMAT) &&
+ (data & MSR_IA32_MISC_ENABLE_PEBS_UNAVAIL))
+ return 1;
if (!kvm_check_has_quirk(vcpu->kvm, KVM_X86_QUIRK_MISC_ENABLE_NO_MWAIT) &&
((vcpu->arch.ia32_misc_enable_msr ^ data) & MSR_IA32_MISC_ENABLE_MWAIT)) {
if (!guest_cpuid_has(vcpu, X86_FEATURE_XMM3))
--
2.21.3
With PEBS virtualization, the PEBS records get delivered to the guest,
and host still sees the PEBS overflow PMI from guest PEBS counters.
This would normally result in a spurious host PMI and we needs to inject
that PEBS overflow PMI into the guest, so that the guest PMI handler
can handle the PEBS records.
Check for this case in the host perf PEBS handler. If a PEBS overflow
PMI occurs and it's not generated from host side (via check host DS),
a fake event will be triggered. The fake event causes the KVM PMI callback
to be called, thereby injecting the PEBS overflow PMI into the guest.
No matter how many guest PEBS counters are overflowed, only triggering
one fake event is enough. The guest PEBS handler would retrieve the
correct information from its own PEBS records buffer.
If the counter_freezing is disabled on the host, a guest PEBS overflow
PMI would be missed when a PEBS counter is enabled on the host side
and coincidentally a host PEBS overflow PMI based on host DS_AREA is
also triggered right after vm-exit due to the guest PEBS overflow PMI
based on guest DS_AREA. In that case, KVM will disable guest PEBS before
vm-entry once there's a host PEBS counter enabled on the same CPU.
Originally-by: Andi Kleen <[email protected]>
Co-developed-by: Kan Liang <[email protected]>
Signed-off-by: Kan Liang <[email protected]>
Signed-off-by: Like Xu <[email protected]>
---
arch/x86/events/intel/ds.c | 64 ++++++++++++++++++++++++++++++++++++++
1 file changed, 64 insertions(+)
diff --git a/arch/x86/events/intel/ds.c b/arch/x86/events/intel/ds.c
index 86848c57b55e..1e759c74bffd 100644
--- a/arch/x86/events/intel/ds.c
+++ b/arch/x86/events/intel/ds.c
@@ -1721,6 +1721,67 @@ intel_pmu_save_and_restart_reload(struct perf_event *event, int count)
return 0;
}
+/*
+ * We may be running with guest PEBS events created by KVM, and the
+ * PEBS records are logged into the guest's DS and invisible to host.
+ *
+ * In the case of guest PEBS overflow, we only trigger a fake event
+ * to emulate the PEBS overflow PMI for guest PBES counters in KVM.
+ * The guest will then vm-entry and check the guest DS area to read
+ * the guest PEBS records.
+ *
+ * Without counter_freezing support on the host, the guest PEBS overflow
+ * PMI may be dropped when both the guest and the host use PEBS.
+ * Therefore, KVM will not enable guest PEBS once the host PEBS is enabled
+ * without counter_freezing since it may bring a confused unknown NMI.
+ *
+ * The contents and other behavior of the guest event do not matter.
+ */
+static int intel_pmu_handle_guest_pebs(struct cpu_hw_events *cpuc,
+ struct pt_regs *iregs,
+ struct debug_store *ds)
+{
+ struct perf_sample_data data;
+ struct perf_event *event = NULL;
+ u64 guest_pebs_idxs = cpuc->pebs_enabled & ~cpuc->intel_ctrl_host_mask;
+ int bit;
+
+ /*
+ * Ideally, we should check guest DS to understand if it's
+ * a guest PEBS overflow PMI from guest PEBS counters.
+ * However, it brings high overhead to retrieve guest DS in host.
+ * So we check host DS instead for performance.
+ *
+ * If PEBS interrupt threshold on host is not exceeded in a NMI, there
+ * must be a PEBS overflow PMI generated from the guest PEBS counters.
+ * There is no ambiguity since the reported event in the PMI is guest
+ * only. It gets handled correctly on a case by case base for each event.
+ *
+ * Note: This is based on the assumption that counter_freezing is enabled,
+ * or KVM disables the co-existence of guest PEBS and host PEBS.
+ */
+ if (!guest_pebs_idxs || !in_nmi() ||
+ ds->pebs_index >= ds->pebs_interrupt_threshold)
+ return 0;
+
+ for_each_set_bit(bit, (unsigned long *)&guest_pebs_idxs,
+ INTEL_PMC_IDX_FIXED + x86_pmu.num_counters_fixed) {
+
+ event = cpuc->events[bit];
+ if (!event->attr.precise_ip)
+ continue;
+
+ perf_sample_data_init(&data, 0, event->hw.last_period);
+ if (perf_event_overflow(event, &data, iregs))
+ x86_pmu_stop(event, 0);
+
+ /* Inject one fake event is enough. */
+ return 1;
+ }
+
+ return 0;
+}
+
static void __intel_pmu_pebs_event(struct perf_event *event,
struct pt_regs *iregs,
void *base, void *top,
@@ -1954,6 +2015,9 @@ static void intel_pmu_drain_pebs_icl(struct pt_regs *iregs)
if (!x86_pmu.pebs_active)
return;
+ if (intel_pmu_handle_guest_pebs(cpuc, iregs, ds))
+ return;
+
base = (struct pebs_basic *)(unsigned long)ds->pebs_buffer_base;
top = (struct pebs_basic *)(unsigned long)ds->pebs_index;
--
2.21.3
On Mon, Nov 09, 2020 at 10:12:37AM +0800, Like Xu wrote:
> The Precise Event Based Sampling(PEBS) supported on Intel Ice Lake server
> platforms can provide an architectural state of the instruction executed
> after the instruction that caused the event. This patch set enables the
> the PEBS via DS feature for KVM (also non) Linux guest on the Ice Lake.
> The Linux guest can use PEBS feature like native:
>
> # perf record -e instructions:ppp ./br_instr a
> # perf record -c 100000 -e instructions:pp ./br_instr a
>
> If the counter_freezing is not enabled on the host, the guest PEBS will
> be disabled on purpose when host is using PEBS facility. By default,
> KVM disables the co-existence of guest PEBS and host PEBS.
Uuhh, what?!? counter_freezing should never be enabled, its broken. Let
me go delete all that code.
On Tue, Nov 10, 2020 at 04:12:57PM +0100, Peter Zijlstra wrote:
> On Mon, Nov 09, 2020 at 10:12:37AM +0800, Like Xu wrote:
> > The Precise Event Based Sampling(PEBS) supported on Intel Ice Lake server
> > platforms can provide an architectural state of the instruction executed
> > after the instruction that caused the event. This patch set enables the
> > the PEBS via DS feature for KVM (also non) Linux guest on the Ice Lake.
> > The Linux guest can use PEBS feature like native:
> >
> > # perf record -e instructions:ppp ./br_instr a
> > # perf record -c 100000 -e instructions:pp ./br_instr a
> >
> > If the counter_freezing is not enabled on the host, the guest PEBS will
> > be disabled on purpose when host is using PEBS facility. By default,
> > KVM disables the co-existence of guest PEBS and host PEBS.
>
> Uuhh, what?!? counter_freezing should never be enabled, its broken. Let
> me go delete all that code.
---
Subject: perf/intel: Remove Perfmon-v4 counter_freezing support
Perfmon-v4 counter freezing is fundamentally broken; remove this default
disabled code to make sure nobody uses it.
The feature is called Freeze-on-PMI in the SDM, and if it would do that,
there wouldn't actually be a problem, *however* it does something subtly
different. It globally disables the whole PMU when it raises the PMI,
not when the PMI hits.
This means there's a window between the PMI getting raised and the PMI
actually getting served where we loose events and this violates the
perf counter independence. That is, a counting event should not result
in a different event count when there is a sampling event co-scheduled.
This is known to break existing software.
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
---
arch/x86/events/intel/core.c | 152 -------------------------------------------
arch/x86/events/perf_event.h | 3 +-
2 files changed, 1 insertion(+), 154 deletions(-)
diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
index c79748f6921d..9909dfa6fb12 100644
--- a/arch/x86/events/intel/core.c
+++ b/arch/x86/events/intel/core.c
@@ -2121,18 +2121,6 @@ static void intel_tfa_pmu_enable_all(int added)
intel_pmu_enable_all(added);
}
-static void enable_counter_freeze(void)
-{
- update_debugctlmsr(get_debugctlmsr() |
- DEBUGCTLMSR_FREEZE_PERFMON_ON_PMI);
-}
-
-static void disable_counter_freeze(void)
-{
- update_debugctlmsr(get_debugctlmsr() &
- ~DEBUGCTLMSR_FREEZE_PERFMON_ON_PMI);
-}
-
static inline u64 intel_pmu_get_status(void)
{
u64 status;
@@ -2696,95 +2684,6 @@ static int handle_pmi_common(struct pt_regs *regs, u64 status)
return handled;
}
-static bool disable_counter_freezing = true;
-static int __init intel_perf_counter_freezing_setup(char *s)
-{
- bool res;
-
- if (kstrtobool(s, &res))
- return -EINVAL;
-
- disable_counter_freezing = !res;
- return 1;
-}
-__setup("perf_v4_pmi=", intel_perf_counter_freezing_setup);
-
-/*
- * Simplified handler for Arch Perfmon v4:
- * - We rely on counter freezing/unfreezing to enable/disable the PMU.
- * This is done automatically on PMU ack.
- * - Ack the PMU only after the APIC.
- */
-
-static int intel_pmu_handle_irq_v4(struct pt_regs *regs)
-{
- struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
- int handled = 0;
- bool bts = false;
- u64 status;
- int pmu_enabled = cpuc->enabled;
- int loops = 0;
-
- /* PMU has been disabled because of counter freezing */
- cpuc->enabled = 0;
- if (test_bit(INTEL_PMC_IDX_FIXED_BTS, cpuc->active_mask)) {
- bts = true;
- intel_bts_disable_local();
- handled = intel_pmu_drain_bts_buffer();
- handled += intel_bts_interrupt();
- }
- status = intel_pmu_get_status();
- if (!status)
- goto done;
-again:
- intel_pmu_lbr_read();
- if (++loops > 100) {
- static bool warned;
-
- if (!warned) {
- WARN(1, "perfevents: irq loop stuck!\n");
- perf_event_print_debug();
- warned = true;
- }
- intel_pmu_reset();
- goto done;
- }
-
-
- handled += handle_pmi_common(regs, status);
-done:
- /* Ack the PMI in the APIC */
- apic_write(APIC_LVTPC, APIC_DM_NMI);
-
- /*
- * The counters start counting immediately while ack the status.
- * Make it as close as possible to IRET. This avoids bogus
- * freezing on Skylake CPUs.
- */
- if (status) {
- intel_pmu_ack_status(status);
- } else {
- /*
- * CPU may issues two PMIs very close to each other.
- * When the PMI handler services the first one, the
- * GLOBAL_STATUS is already updated to reflect both.
- * When it IRETs, the second PMI is immediately
- * handled and it sees clear status. At the meantime,
- * there may be a third PMI, because the freezing bit
- * isn't set since the ack in first PMI handlers.
- * Double check if there is more work to be done.
- */
- status = intel_pmu_get_status();
- if (status)
- goto again;
- }
-
- if (bts)
- intel_bts_enable_local();
- cpuc->enabled = pmu_enabled;
- return handled;
-}
-
/*
* This handler is triggered by the local APIC, so the APIC IRQ handling
* rules apply:
@@ -4081,9 +3980,6 @@ static void intel_pmu_cpu_starting(int cpu)
if (x86_pmu.version > 1)
flip_smm_bit(&x86_pmu.attr_freeze_on_smi);
- if (x86_pmu.counter_freezing)
- enable_counter_freeze();
-
/* Disable perf metrics if any added CPU doesn't support it. */
if (x86_pmu.intel_cap.perf_metrics) {
union perf_capabilities perf_cap;
@@ -4154,9 +4050,6 @@ static void free_excl_cntrs(struct cpu_hw_events *cpuc)
static void intel_pmu_cpu_dying(int cpu)
{
fini_debug_store_on_cpu(cpu);
-
- if (x86_pmu.counter_freezing)
- disable_counter_freeze();
}
void intel_cpuc_finish(struct cpu_hw_events *cpuc)
@@ -4548,39 +4441,6 @@ static __init void intel_nehalem_quirk(void)
}
}
-static const struct x86_cpu_desc counter_freezing_ucodes[] = {
- INTEL_CPU_DESC(INTEL_FAM6_ATOM_GOLDMONT, 2, 0x0000000e),
- INTEL_CPU_DESC(INTEL_FAM6_ATOM_GOLDMONT, 9, 0x0000002e),
- INTEL_CPU_DESC(INTEL_FAM6_ATOM_GOLDMONT, 10, 0x00000008),
- INTEL_CPU_DESC(INTEL_FAM6_ATOM_GOLDMONT_D, 1, 0x00000028),
- INTEL_CPU_DESC(INTEL_FAM6_ATOM_GOLDMONT_PLUS, 1, 0x00000028),
- INTEL_CPU_DESC(INTEL_FAM6_ATOM_GOLDMONT_PLUS, 8, 0x00000006),
- {}
-};
-
-static bool intel_counter_freezing_broken(void)
-{
- return !x86_cpu_has_min_microcode_rev(counter_freezing_ucodes);
-}
-
-static __init void intel_counter_freezing_quirk(void)
-{
- /* Check if it's already disabled */
- if (disable_counter_freezing)
- return;
-
- /*
- * If the system starts with the wrong ucode, leave the
- * counter-freezing feature permanently disabled.
- */
- if (intel_counter_freezing_broken()) {
- pr_info("PMU counter freezing disabled due to CPU errata,"
- "please upgrade microcode\n");
- x86_pmu.counter_freezing = false;
- x86_pmu.handle_irq = intel_pmu_handle_irq;
- }
-}
-
/*
* enable software workaround for errata:
* SNB: BJ122
@@ -4966,9 +4826,6 @@ __init int intel_pmu_init(void)
max((int)edx.split.num_counters_fixed, assume);
}
- if (version >= 4)
- x86_pmu.counter_freezing = !disable_counter_freezing;
-
if (boot_cpu_has(X86_FEATURE_PDCM)) {
u64 capabilities;
@@ -5090,7 +4947,6 @@ __init int intel_pmu_init(void)
case INTEL_FAM6_ATOM_GOLDMONT:
case INTEL_FAM6_ATOM_GOLDMONT_D:
- x86_add_quirk(intel_counter_freezing_quirk);
memcpy(hw_cache_event_ids, glm_hw_cache_event_ids,
sizeof(hw_cache_event_ids));
memcpy(hw_cache_extra_regs, glm_hw_cache_extra_regs,
@@ -5117,7 +4973,6 @@ __init int intel_pmu_init(void)
break;
case INTEL_FAM6_ATOM_GOLDMONT_PLUS:
- x86_add_quirk(intel_counter_freezing_quirk);
memcpy(hw_cache_event_ids, glp_hw_cache_event_ids,
sizeof(hw_cache_event_ids));
memcpy(hw_cache_extra_regs, glp_hw_cache_extra_regs,
@@ -5577,13 +5432,6 @@ __init int intel_pmu_init(void)
pr_cont("full-width counters, ");
}
- /*
- * For arch perfmon 4 use counter freezing to avoid
- * several MSR accesses in the PMI.
- */
- if (x86_pmu.counter_freezing)
- x86_pmu.handle_irq = intel_pmu_handle_irq_v4;
-
if (x86_pmu.intel_cap.perf_metrics)
x86_pmu.intel_ctrl |= 1ULL << GLOBAL_CTRL_EN_PERF_METRICS;
diff --git a/arch/x86/events/perf_event.h b/arch/x86/events/perf_event.h
index 10032f023fcc..4084fa31cc21 100644
--- a/arch/x86/events/perf_event.h
+++ b/arch/x86/events/perf_event.h
@@ -681,8 +681,7 @@ struct x86_pmu {
/* PMI handler bits */
unsigned int late_ack :1,
- enabled_ack :1,
- counter_freezing :1;
+ enabled_ack :1;
/*
* sysfs attrs
*/
On Tue, Nov 10, 2020 at 7:37 AM Peter Zijlstra <[email protected]> wrote:
>
> On Tue, Nov 10, 2020 at 04:12:57PM +0100, Peter Zijlstra wrote:
> > On Mon, Nov 09, 2020 at 10:12:37AM +0800, Like Xu wrote:
> > > The Precise Event Based Sampling(PEBS) supported on Intel Ice Lake server
> > > platforms can provide an architectural state of the instruction executed
> > > after the instruction that caused the event. This patch set enables the
> > > the PEBS via DS feature for KVM (also non) Linux guest on the Ice Lake.
> > > The Linux guest can use PEBS feature like native:
> > >
> > > # perf record -e instructions:ppp ./br_instr a
> > > # perf record -c 100000 -e instructions:pp ./br_instr a
> > >
> > > If the counter_freezing is not enabled on the host, the guest PEBS will
> > > be disabled on purpose when host is using PEBS facility. By default,
> > > KVM disables the co-existence of guest PEBS and host PEBS.
> >
> > Uuhh, what?!? counter_freezing should never be enabled, its broken. Let
> > me go delete all that code.
>
> ---
> Subject: perf/intel: Remove Perfmon-v4 counter_freezing support
>
> Perfmon-v4 counter freezing is fundamentally broken; remove this default
> disabled code to make sure nobody uses it.
>
> The feature is called Freeze-on-PMI in the SDM, and if it would do that,
> there wouldn't actually be a problem, *however* it does something subtly
> different. It globally disables the whole PMU when it raises the PMI,
> not when the PMI hits.
>
> This means there's a window between the PMI getting raised and the PMI
> actually getting served where we loose events and this violates the
> perf counter independence. That is, a counting event should not result
> in a different event count when there is a sampling event co-scheduled.
>
What is implemented is Freeze-on-Overflow, yet it is described as Freeze-on-PMI.
That, in itself, is a problem. I agree with you on that point.
However, there are use cases for both modes.
I can sample on event A and count on B, C and when A overflows, I want
to snapshot B, C.
For that I want B, C at the moment of the overflow, not at the moment
the PMI is delivered. Thus, youd
would want the Freeze-on-overflow behavior. You can collect in this
mode with the perf tool,
IIRC: perf record -e '{cycles,instructions,branches:S}' ....
The other usage model is that of the replay-debugger (rr) which you are alluding
to, which needs precise count of an event including during the skid
window. For that, you need
Freeze-on-PMI (delivered). Note that this tool likely only cares about
user level occurrences of events.
As for counter independence, I am not sure it holds in all cases. If
the events are setup for user+kernel
then, as soon as you co-schedule a sampling event, you will likely get
more counts on the counting
event due to the additional kernel entries/exits caused by
interrupt-based profiling. Even if you were to
restrict to user level only, I would expect to see a few more counts.
> This is known to break existing software.
>
> Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
> ---
> arch/x86/events/intel/core.c | 152 -------------------------------------------
> arch/x86/events/perf_event.h | 3 +-
> 2 files changed, 1 insertion(+), 154 deletions(-)
>
> diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
> index c79748f6921d..9909dfa6fb12 100644
> --- a/arch/x86/events/intel/core.c
> +++ b/arch/x86/events/intel/core.c
> @@ -2121,18 +2121,6 @@ static void intel_tfa_pmu_enable_all(int added)
> intel_pmu_enable_all(added);
> }
>
> -static void enable_counter_freeze(void)
> -{
> - update_debugctlmsr(get_debugctlmsr() |
> - DEBUGCTLMSR_FREEZE_PERFMON_ON_PMI);
> -}
> -
> -static void disable_counter_freeze(void)
> -{
> - update_debugctlmsr(get_debugctlmsr() &
> - ~DEBUGCTLMSR_FREEZE_PERFMON_ON_PMI);
> -}
> -
> static inline u64 intel_pmu_get_status(void)
> {
> u64 status;
> @@ -2696,95 +2684,6 @@ static int handle_pmi_common(struct pt_regs *regs, u64 status)
> return handled;
> }
>
> -static bool disable_counter_freezing = true;
> -static int __init intel_perf_counter_freezing_setup(char *s)
> -{
> - bool res;
> -
> - if (kstrtobool(s, &res))
> - return -EINVAL;
> -
> - disable_counter_freezing = !res;
> - return 1;
> -}
> -__setup("perf_v4_pmi=", intel_perf_counter_freezing_setup);
> -
> -/*
> - * Simplified handler for Arch Perfmon v4:
> - * - We rely on counter freezing/unfreezing to enable/disable the PMU.
> - * This is done automatically on PMU ack.
> - * - Ack the PMU only after the APIC.
> - */
> -
> -static int intel_pmu_handle_irq_v4(struct pt_regs *regs)
> -{
> - struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
> - int handled = 0;
> - bool bts = false;
> - u64 status;
> - int pmu_enabled = cpuc->enabled;
> - int loops = 0;
> -
> - /* PMU has been disabled because of counter freezing */
> - cpuc->enabled = 0;
> - if (test_bit(INTEL_PMC_IDX_FIXED_BTS, cpuc->active_mask)) {
> - bts = true;
> - intel_bts_disable_local();
> - handled = intel_pmu_drain_bts_buffer();
> - handled += intel_bts_interrupt();
> - }
> - status = intel_pmu_get_status();
> - if (!status)
> - goto done;
> -again:
> - intel_pmu_lbr_read();
> - if (++loops > 100) {
> - static bool warned;
> -
> - if (!warned) {
> - WARN(1, "perfevents: irq loop stuck!\n");
> - perf_event_print_debug();
> - warned = true;
> - }
> - intel_pmu_reset();
> - goto done;
> - }
> -
> -
> - handled += handle_pmi_common(regs, status);
> -done:
> - /* Ack the PMI in the APIC */
> - apic_write(APIC_LVTPC, APIC_DM_NMI);
> -
> - /*
> - * The counters start counting immediately while ack the status.
> - * Make it as close as possible to IRET. This avoids bogus
> - * freezing on Skylake CPUs.
> - */
> - if (status) {
> - intel_pmu_ack_status(status);
> - } else {
> - /*
> - * CPU may issues two PMIs very close to each other.
> - * When the PMI handler services the first one, the
> - * GLOBAL_STATUS is already updated to reflect both.
> - * When it IRETs, the second PMI is immediately
> - * handled and it sees clear status. At the meantime,
> - * there may be a third PMI, because the freezing bit
> - * isn't set since the ack in first PMI handlers.
> - * Double check if there is more work to be done.
> - */
> - status = intel_pmu_get_status();
> - if (status)
> - goto again;
> - }
> -
> - if (bts)
> - intel_bts_enable_local();
> - cpuc->enabled = pmu_enabled;
> - return handled;
> -}
> -
> /*
> * This handler is triggered by the local APIC, so the APIC IRQ handling
> * rules apply:
> @@ -4081,9 +3980,6 @@ static void intel_pmu_cpu_starting(int cpu)
> if (x86_pmu.version > 1)
> flip_smm_bit(&x86_pmu.attr_freeze_on_smi);
>
> - if (x86_pmu.counter_freezing)
> - enable_counter_freeze();
> -
> /* Disable perf metrics if any added CPU doesn't support it. */
> if (x86_pmu.intel_cap.perf_metrics) {
> union perf_capabilities perf_cap;
> @@ -4154,9 +4050,6 @@ static void free_excl_cntrs(struct cpu_hw_events *cpuc)
> static void intel_pmu_cpu_dying(int cpu)
> {
> fini_debug_store_on_cpu(cpu);
> -
> - if (x86_pmu.counter_freezing)
> - disable_counter_freeze();
> }
>
> void intel_cpuc_finish(struct cpu_hw_events *cpuc)
> @@ -4548,39 +4441,6 @@ static __init void intel_nehalem_quirk(void)
> }
> }
>
> -static const struct x86_cpu_desc counter_freezing_ucodes[] = {
> - INTEL_CPU_DESC(INTEL_FAM6_ATOM_GOLDMONT, 2, 0x0000000e),
> - INTEL_CPU_DESC(INTEL_FAM6_ATOM_GOLDMONT, 9, 0x0000002e),
> - INTEL_CPU_DESC(INTEL_FAM6_ATOM_GOLDMONT, 10, 0x00000008),
> - INTEL_CPU_DESC(INTEL_FAM6_ATOM_GOLDMONT_D, 1, 0x00000028),
> - INTEL_CPU_DESC(INTEL_FAM6_ATOM_GOLDMONT_PLUS, 1, 0x00000028),
> - INTEL_CPU_DESC(INTEL_FAM6_ATOM_GOLDMONT_PLUS, 8, 0x00000006),
> - {}
> -};
> -
> -static bool intel_counter_freezing_broken(void)
> -{
> - return !x86_cpu_has_min_microcode_rev(counter_freezing_ucodes);
> -}
> -
> -static __init void intel_counter_freezing_quirk(void)
> -{
> - /* Check if it's already disabled */
> - if (disable_counter_freezing)
> - return;
> -
> - /*
> - * If the system starts with the wrong ucode, leave the
> - * counter-freezing feature permanently disabled.
> - */
> - if (intel_counter_freezing_broken()) {
> - pr_info("PMU counter freezing disabled due to CPU errata,"
> - "please upgrade microcode\n");
> - x86_pmu.counter_freezing = false;
> - x86_pmu.handle_irq = intel_pmu_handle_irq;
> - }
> -}
> -
> /*
> * enable software workaround for errata:
> * SNB: BJ122
> @@ -4966,9 +4826,6 @@ __init int intel_pmu_init(void)
> max((int)edx.split.num_counters_fixed, assume);
> }
>
> - if (version >= 4)
> - x86_pmu.counter_freezing = !disable_counter_freezing;
> -
> if (boot_cpu_has(X86_FEATURE_PDCM)) {
> u64 capabilities;
>
> @@ -5090,7 +4947,6 @@ __init int intel_pmu_init(void)
>
> case INTEL_FAM6_ATOM_GOLDMONT:
> case INTEL_FAM6_ATOM_GOLDMONT_D:
> - x86_add_quirk(intel_counter_freezing_quirk);
> memcpy(hw_cache_event_ids, glm_hw_cache_event_ids,
> sizeof(hw_cache_event_ids));
> memcpy(hw_cache_extra_regs, glm_hw_cache_extra_regs,
> @@ -5117,7 +4973,6 @@ __init int intel_pmu_init(void)
> break;
>
> case INTEL_FAM6_ATOM_GOLDMONT_PLUS:
> - x86_add_quirk(intel_counter_freezing_quirk);
> memcpy(hw_cache_event_ids, glp_hw_cache_event_ids,
> sizeof(hw_cache_event_ids));
> memcpy(hw_cache_extra_regs, glp_hw_cache_extra_regs,
> @@ -5577,13 +5432,6 @@ __init int intel_pmu_init(void)
> pr_cont("full-width counters, ");
> }
>
> - /*
> - * For arch perfmon 4 use counter freezing to avoid
> - * several MSR accesses in the PMI.
> - */
> - if (x86_pmu.counter_freezing)
> - x86_pmu.handle_irq = intel_pmu_handle_irq_v4;
> -
> if (x86_pmu.intel_cap.perf_metrics)
> x86_pmu.intel_ctrl |= 1ULL << GLOBAL_CTRL_EN_PERF_METRICS;
>
> diff --git a/arch/x86/events/perf_event.h b/arch/x86/events/perf_event.h
> index 10032f023fcc..4084fa31cc21 100644
> --- a/arch/x86/events/perf_event.h
> +++ b/arch/x86/events/perf_event.h
> @@ -681,8 +681,7 @@ struct x86_pmu {
>
> /* PMI handler bits */
> unsigned int late_ack :1,
> - enabled_ack :1,
> - counter_freezing :1;
> + enabled_ack :1;
> /*
> * sysfs attrs
> */
Hi Peter,
On 2020/11/11 4:52, Stephane Eranian wrote:
> On Tue, Nov 10, 2020 at 7:37 AM Peter Zijlstra<[email protected]> wrote:
>> On Tue, Nov 10, 2020 at 04:12:57PM +0100, Peter Zijlstra wrote:
>>> On Mon, Nov 09, 2020 at 10:12:37AM +0800, Like Xu wrote:
>>>> The Precise Event Based Sampling(PEBS) supported on Intel Ice Lake server
>>>> platforms can provide an architectural state of the instruction executed
>>>> after the instruction that caused the event. This patch set enables the
>>>> the PEBS via DS feature for KVM (also non) Linux guest on the Ice Lake.
>>>> The Linux guest can use PEBS feature like native:
>>>>
>>>> # perf record -e instructions:ppp ./br_instr a
>>>> # perf record -c 100000 -e instructions:pp ./br_instr a
>>>>
>>>> If the counter_freezing is not enabled on the host, the guest PEBS will
>>>> be disabled on purpose when host is using PEBS facility. By default,
>>>> KVM disables the co-existence of guest PEBS and host PEBS.
Thanks Stephane for clarifying the use cases for Freeze-on-[PMI|Overflow].
Please let me express it more clearly.
The goal of the whole patch set is to enable guest PEBS, regardless of
whether the counter_freezing is frozen or not. By default, it will not
support both the guest and the host to use PEBS at the same time.
Please continue reviewing the patch set, especially for the slow path
we proposed this time and related host perf changes:
- add intel_pmu_handle_guest_pebs() to __intel_pmu_pebs_event();
- add switch MSRs (PEBS_ENABLE, DS_AREA, DATA_CFG) to intel_guest_get_msrs();
- the construction of incoming parameters for
perf_event_create_kernel_counter();
I believe if you understand the general idea, the comments will be very
valuable.
Thanks,
Like Xu
>>> Uuhh, what?!? counter_freezing should never be enabled, its broken. Let
>>> me go delete all that code.
>> ---
>> Subject: perf/intel: Remove Perfmon-v4 counter_freezing support
>>
>> Perfmon-v4 counter freezing is fundamentally broken; remove this default
>> disabled code to make sure nobody uses it.
>>
>> The feature is called Freeze-on-PMI in the SDM, and if it would do that,
>> there wouldn't actually be a problem,*however* it does something subtly
>> different. It globally disables the whole PMU when it raises the PMI,
>> not when the PMI hits.
>>
>> This means there's a window between the PMI getting raised and the PMI
>> actually getting served where we loose events and this violates the
>> perf counter independence. That is, a counting event should not result
>> in a different event count when there is a sampling event co-scheduled.
>>
> What is implemented is Freeze-on-Overflow, yet it is described as Freeze-on-PMI.
> That, in itself, is a problem. I agree with you on that point.
>
> However, there are use cases for both modes.
>
> I can sample on event A and count on B, C and when A overflows, I want
> to snapshot B, C.
> For that I want B, C at the moment of the overflow, not at the moment
> the PMI is delivered. Thus, youd
> would want the Freeze-on-overflow behavior. You can collect in this
> mode with the perf tool,
> IIRC: perf record -e '{cycles,instructions,branches:S}' ....
>
> The other usage model is that of the replay-debugger (rr) which you are alluding
> to, which needs precise count of an event including during the skid
> window. For that, you need
> Freeze-on-PMI (delivered). Note that this tool likely only cares about
> user level occurrences of events.
>
> As for counter independence, I am not sure it holds in all cases. If
> the events are setup for user+kernel
> then, as soon as you co-schedule a sampling event, you will likely get
> more counts on the counting
> event due to the additional kernel entries/exits caused by
> interrupt-based profiling. Even if you were to
> restrict to user level only, I would expect to see a few more counts.
>
>
>> This is known to break existing software.
>>
>> Signed-off-by: Peter Zijlstra (Intel)<[email protected]>
On Tue, Nov 10, 2020 at 12:52:04PM -0800, Stephane Eranian wrote:
> What is implemented is Freeze-on-Overflow, yet it is described as Freeze-on-PMI.
> That, in itself, is a problem. I agree with you on that point.
Exactly.
> However, there are use cases for both modes.
>
> I can sample on event A and count on B, C and when A overflows, I want
> to snapshot B, C.
> For that I want B, C at the moment of the overflow, not at the moment
> the PMI is delivered. Thus, youd
> would want the Freeze-on-overflow behavior. You can collect in this
> mode with the perf tool,
> IIRC: perf record -e '{cycles,instructions,branches:S}' ....
Right, but we never supported that. Also, in that case the group must
then be fully exlusive so as not to mess with other groups. A better
solution might be an extention to Adaptive PEBS.
> The other usage model is that of the replay-debugger (rr) which you are alluding
> to, which needs precise count of an event including during the skid
> window. For that, you need
> Freeze-on-PMI (delivered). Note that this tool likely only cares about
> user level occurrences of events.
Correct, RR only cares about user-only counting.
> As for counter independence, I am not sure it holds in all cases. If
> the events are setup for user+kernel
This is true; however if it were an actual Freeze-on-PMI we could
actually do u+k independence correctly too.
Anyway, as it stands I think the whole counter_freezing thing is a
trainwreck and it needs to go.
Hi Peter,
On 2020/11/10 23:37, Peter Zijlstra wrote:
> -static int __init intel_perf_counter_freezing_setup(char *s)
> -{
> - bool res;
> -
> - if (kstrtobool(s, &res))
> - return -EINVAL;
> -
> - disable_counter_freezing = !res;
> - return 1;
> -}
> -__setup("perf_v4_pmi=", intel_perf_counter_freezing_setup);
...
> Anyway, as it stands I think the whole counter_freezing thing is a
> trainwreck and it needs to go.
If you really want to drop the counter_freezing stuff, we also need
to clean it up in Documentation/admin-guide/kernel-parameters.txt:
perf_v4_pmi= [X86,INTEL]
Format: <bool>
Disable Intel PMU counter freezing feature.
The feature only exists starting from
Arch Perfmon v4 (Skylake and newer).
However someone may still need it based on the correct understanding
of "Freeze-on-Overflow" as Stephane said. How about renaming and
documenting it instead of discarding it completely?
Our guest PEBS enabling patches does not completely depend on it
and we do not require the administrator to enable perf_v4_pmi for
guest PEBS.
Would you generously take a look at the perf part in this series?
Thanks,
Like Xu
On Mon, Nov 09, 2020 at 10:12:41AM +0800, Like Xu wrote:
> With PEBS virtualization, the PEBS records get delivered to the guest,
> and host still sees the PEBS overflow PMI from guest PEBS counters.
> This would normally result in a spurious host PMI and we needs to inject
> that PEBS overflow PMI into the guest, so that the guest PMI handler
> can handle the PEBS records.
>
> Check for this case in the host perf PEBS handler. If a PEBS overflow
> PMI occurs and it's not generated from host side (via check host DS),
> a fake event will be triggered. The fake event causes the KVM PMI callback
> to be called, thereby injecting the PEBS overflow PMI into the guest.
>
> No matter how many guest PEBS counters are overflowed, only triggering
> one fake event is enough. The guest PEBS handler would retrieve the
> correct information from its own PEBS records buffer.
>
> If the counter_freezing is disabled on the host, a guest PEBS overflow
> PMI would be missed when a PEBS counter is enabled on the host side
> and coincidentally a host PEBS overflow PMI based on host DS_AREA is
> also triggered right after vm-exit due to the guest PEBS overflow PMI
> based on guest DS_AREA. In that case, KVM will disable guest PEBS before
> vm-entry once there's a host PEBS counter enabled on the same CPU.
How does this guest DS crud work? DS_AREA is a host virtual address;
ISTR there was lots of fail trying to virtualize it earlier. What's
changed? There's 0 clues here.
Why are the host and guest DS area separate, why can't we map them to
the exact same physical pages?
On Mon, Nov 09, 2020 at 10:12:42AM +0800, Like Xu wrote:
> + /* Indicate PEBS overflow PMI to guest. */
> + __set_bit(62, (unsigned long *)&pmu->global_status);
GLOBAL_STATUS_BUFFER_OVF_BIT
Hi Peter,
On 2020/11/17 22:35, Peter Zijlstra wrote:
> On Mon, Nov 09, 2020 at 10:12:41AM +0800, Like Xu wrote:
>> With PEBS virtualization, the PEBS records get delivered to the guest,
>> and host still sees the PEBS overflow PMI from guest PEBS counters.
>> This would normally result in a spurious host PMI and we needs to inject
>> that PEBS overflow PMI into the guest, so that the guest PMI handler
>> can handle the PEBS records.
>>
>> Check for this case in the host perf PEBS handler. If a PEBS overflow
>> PMI occurs and it's not generated from host side (via check host DS),
>> a fake event will be triggered. The fake event causes the KVM PMI callback
>> to be called, thereby injecting the PEBS overflow PMI into the guest.
>>
>> No matter how many guest PEBS counters are overflowed, only triggering
>> one fake event is enough. The guest PEBS handler would retrieve the
>> correct information from its own PEBS records buffer.
>>
>> If the counter_freezing is disabled on the host, a guest PEBS overflow
>> PMI would be missed when a PEBS counter is enabled on the host side
>> and coincidentally a host PEBS overflow PMI based on host DS_AREA is
>> also triggered right after vm-exit due to the guest PEBS overflow PMI
>> based on guest DS_AREA. In that case, KVM will disable guest PEBS before
>> vm-entry once there's a host PEBS counter enabled on the same CPU.
>
> How does this guest DS crud work? DS_AREA is a host virtual address;
A host counter will be scheduled (maybe cross-mapped) for a guest PEBS
counter (via guest PEBS event), and its enable bits (PEBS_ENABLE + EN
+ GLOBAL_CTRL) will be set according to guest's values right before the
vcpu entry (via atomic_switch_perf_msrs).
The guest PEBS record(s) will be written to the guest DS buffer referenced
by the guest DS_AREA msr, which is switched during the vmx transaction,
and it is the guest virtual address.
> ISTR there was lots of fail trying to virtualize it earlier. What's
> changed? There's 0 clues here.
Ah, now we have EPT-friendly PEBS facilities supported since Ice Lake
which makes guest PEBS feature possible w/o guest memory pinned.
>
> Why are the host and guest DS area separate, why can't we map them to
> the exact same physical pages?
If we map both guest and host DS_AREA to the exact same physical pages,
- the guest can access the host PEBS records, which means that the host
IP maybe leaked, because we cannot predict the time guest drains records
and it would be over-designed to clean it up before each vm-entry;
- different tasks/vcpus on the same pcpu cannot share the same PEBS DS
settings from the same physical page. For example, some require large
PEBS and reset values, while others do not.
Like many guest msrs, we use the separate guest DS_AREA for the guest's
own use and it avoids mutual interference as little as possible.
Thanks,
Like Xu
On 2020/11/17 22:41, Peter Zijlstra wrote:
> On Mon, Nov 09, 2020 at 10:12:42AM +0800, Like Xu wrote:
>> + /* Indicate PEBS overflow PMI to guest. */
>> + __set_bit(62, (unsigned long *)&pmu->global_status);
>
> GLOBAL_STATUS_BUFFER_OVF_BIT
>
Thanks, I'll apply it.
If you have more comments on the entire patch set, please let me know.
Thanks,
Like Xu
On Thu, Nov 19, 2020 at 12:15:09AM +0800, Like Xu wrote:
> > ISTR there was lots of fail trying to virtualize it earlier. What's
> > changed? There's 0 clues here.
>
> Ah, now we have EPT-friendly PEBS facilities supported since Ice Lake
> which makes guest PEBS feature possible w/o guest memory pinned.
OK.
> > Why are the host and guest DS area separate, why can't we map them to
> > the exact same physical pages?
>
> If we map both guest and host DS_AREA to the exact same physical pages,
> - the guest can access the host PEBS records, which means that the host
> IP maybe leaked, because we cannot predict the time guest drains records and
> it would be over-designed to clean it up before each vm-entry;
> - different tasks/vcpus on the same pcpu cannot share the same PEBS DS
> settings from the same physical page. For example, some require large
> PEBS and reset values, while others do not.
>
> Like many guest msrs, we use the separate guest DS_AREA for the guest's
> own use and it avoids mutual interference as little as possible.
OK, but the code here wanted to inspect the guest DS from the host. It
states this is somehow complicated/expensive. But surely we can at the
very least map the first guest DS page somewhere so we can at least
access the control bits without too much magic.
On 2020/11/19 2:07, Peter Zijlstra wrote:
> On Thu, Nov 19, 2020 at 12:15:09AM +0800, Like Xu wrote:
>
>>> ISTR there was lots of fail trying to virtualize it earlier. What's
>>> changed? There's 0 clues here.
>> Ah, now we have EPT-friendly PEBS facilities supported since Ice Lake
>> which makes guest PEBS feature possible w/o guest memory pinned.
> OK.
>
>>> Why are the host and guest DS area separate, why can't we map them to
>>> the exact same physical pages?
>> If we map both guest and host DS_AREA to the exact same physical pages,
>> - the guest can access the host PEBS records, which means that the host
>> IP maybe leaked, because we cannot predict the time guest drains records and
>> it would be over-designed to clean it up before each vm-entry;
>> - different tasks/vcpus on the same pcpu cannot share the same PEBS DS
>> settings from the same physical page. For example, some require large
>> PEBS and reset values, while others do not.
>>
>> Like many guest msrs, we use the separate guest DS_AREA for the guest's
>> own use and it avoids mutual interference as little as possible.
> OK, but the code here wanted to inspect the guest DS from the host. It
> states this is somehow complicated/expensive.
Yes, it's expensive to inspect guest DS in the NMI handler and also
unsafe to call sleepable EPT page fault handler in the NMI context.
So in this version, we propose to move this complicated part to the KVM
and I think the overhead is acceptable for the cross-mapped cases.
The bad thing is we will not be able to distinguish whether PEBS PMI is from
guest or host once there's a host PEBS counter enabled on the same CPU
and we may fix it later w/ your ideas.
> But surely we can at the
> very least map the first guest DS page somewhere so we can at least
> access the control bits without too much magic.
I am not sure whether the first mapped guest DS page can help
the guest PEBS to keep working from beginning to end since
later host may not map guest DS page on the physical one
and currently KVM couldn't control it fine-grained.
This version could also avoid attacks from malicious guests
by keeping its control bits in its guest space.
Thanks,
Like Xu
Hi Peter,
On 2020/11/19 2:07, Peter Zijlstra wrote:
> On Thu, Nov 19, 2020 at 12:15:09AM +0800, Like Xu wrote:
>
>>> ISTR there was lots of fail trying to virtualize it earlier. What's
>>> changed? There's 0 clues here.
>> Ah, now we have EPT-friendly PEBS facilities supported since Ice Lake
>> which makes guest PEBS feature possible w/o guest memory pinned.
> OK.
>
>>> Why are the host and guest DS area separate, why can't we map them to
>>> the exact same physical pages?
>> If we map both guest and host DS_AREA to the exact same physical pages,
>> - the guest can access the host PEBS records, which means that the host
>> IP maybe leaked, because we cannot predict the time guest drains records and
>> it would be over-designed to clean it up before each vm-entry;
>> - different tasks/vcpus on the same pcpu cannot share the same PEBS DS
>> settings from the same physical page. For example, some require large
>> PEBS and reset values, while others do not.
>>
>> Like many guest msrs, we use the separate guest DS_AREA for the guest's
>> own use and it avoids mutual interference as little as possible.
> OK, but the code here wanted to inspect the guest DS from the host. It
> states this is somehow complicated/expensive. But surely we can at the
> very least map the first guest DS page somewhere so we can at least
> access the control bits without too much magic.
We note that the SDM has a contiguous present memory mapping
assumption about the DS save area and the PEBS buffer area.
Therefore, we revisit your suggestion here and move it a bit forward:
When the PEBS is enabled, KVM will cache the following values:
- gva ds_area (kvm msr trap)
- hva1 for "gva ds_area" (walk guest page table)
- hva2 for "gva pebs_buffer_base" via hva1 (walk guest page table)
if the "gva ds_area" cache hits,
- access PEBS "interrupt threshold" and "Counter Reset[]" via hva1
- get "gva2 pebs_buffer_base" via __copy_from_user(hva1)
if the "gva2 pebs_buffer_base" cache hits,
- we get "gva2 pebs_index" via __copy_from_user(hva2),
- rewrite the guest PEBS records via hva2 and pebs_index
If any cache misses, setup the cache values via walking tables again.
I wonder if you would agree with this optimization idea,
we look forward to your confirmation for the next step.
Thanks,
Like Xu
On Fri, Nov 27, 2020 at 10:14:49AM +0800, Xu, Like wrote:
> > OK, but the code here wanted to inspect the guest DS from the host. It
> > states this is somehow complicated/expensive. But surely we can at the
> > very least map the first guest DS page somewhere so we can at least
> > access the control bits without too much magic.
> We note that the SDM has a contiguous present memory mapping
> assumption about the DS save area and the PEBS buffer area.
>
> Therefore, we revisit your suggestion here and move it a bit forward:
>
> When the PEBS is enabled, KVM will cache the following values:
> - gva ds_area (kvm msr trap)
> - hva1 for "gva ds_area" (walk guest page table)
> - hva2 for "gva pebs_buffer_base" via hva1 (walk guest page table)
What this [gh]va? Guest/Host Virtual Address? I think you're assuming I
know about all this virt crap,.. I don't.
> if the "gva ds_area" cache hits,
what?
> - access PEBS "interrupt threshold" and "Counter Reset[]" via hva1
> - get "gva2 pebs_buffer_base" via __copy_from_user(hva1)
But you already had hva2, so what's the point?
> if the "gva2 pebs_buffer_base" cache hits,
What?
> - we get "gva2 pebs_index" via __copy_from_user(hva2),
pebs_index is in ds_are, which would be hva1
> - rewrite the guest PEBS records via hva2 and pebs_index
>
> If any cache misses, setup the cache values via walking tables again.
>
> I wonder if you would agree with this optimization idea,
> we look forward to your confirmation for the next step.
I'm utterly confused. I really can't follow.
Hi Peter,
On 2020/11/30 18:49, Peter Zijlstra wrote:
> On Fri, Nov 27, 2020 at 10:14:49AM +0800, Xu, Like wrote:
>
>>> OK, but the code here wanted to inspect the guest DS from the host. It
>>> states this is somehow complicated/expensive. But surely we can at the
>>> very least map the first guest DS page somewhere so we can at least
>>> access the control bits without too much magic.
>> We note that the SDM has a contiguous present memory mapping
>> assumption about the DS save area and the PEBS buffer area.
>>
>> Therefore, we revisit your suggestion here and move it a bit forward:
>>
>> When the PEBS is enabled, KVM will cache the following values:
>> - gva ds_area (kvm msr trap)
>> - hva1 for "gva ds_area" (walk guest page table)
>> - hva2 for "gva pebs_buffer_base" via hva1 (walk guest page table)
> What this [gh]va? Guest/Host Virtual Address? I think you're assuming I
> know about all this virt crap,.. I don't.
Oh, my bad and let me add it:
gva: guest virtual address
gpa: guest physical address
gfn: guest frame number
hva: host virtual adderss
hpa: host physical address
In the KVM, we get hva from gva in the following way:
gpa = kvm_mmu_gva_to_gpa_system(vcpu, gva, NULL);
gfn = gpa >> PAGE_SHIFT;
slot = gfn_to_memslot(kvm, gfn);
hva = gfn_to_hva_memslot_prot(slot, gfn, NULL);
>
>> if the "gva ds_area" cache hits,
> what?
Sorry, it looks a misuse of terminology.
I mean KVM will save the last used "gva ds_area" value and its hva in the
extra fields,
if the "gva ds_area" does not change this time, we will not walk the guest
page table
to get its hva again.
I think it's the main point in your suggestion, and I try to elaborate it.
>> - access PEBS "interrupt threshold" and "Counter Reset[]" via hva1
>> - get "gva2 pebs_buffer_base" via __copy_from_user(hva1)
> But you already had hva2, so what's the point?
hva1 is for for "gva ds_area"
hva2 is for "gva pebs_buffer_base"
The point is before using the last save hva2, we need to
make sure that "gva pebs_buffer_base" is not changed to avoid
that some malicious drivers may change it without changing ds_area.
>
>> if the "gva2 pebs_buffer_base" cache hits,
> What?
>
>> - we get "gva2 pebs_index" via __copy_from_user(hva2),
> pebs_index is in ds_are, which would be hva1
Yes, we get "gva2 pebs_index" via __copy_from_user(hva1).
>
>> - rewrite the guest PEBS records via hva2 and pebs_index
>>
>> If any cache misses, setup the cache values via walking tables again.
>>
>> I wonder if you would agree with this optimization idea,
>> we look forward to your confirmation for the next step.
> I'm utterly confused. I really can't follow.
Generally, KVM will save hva1 (gva1 ds_area) and hva2 (for gva2
pebs_buffer_base)
in the first round of the guest page table walking and reuse them
if they're not changed in subsequent use.
I think this approach is feasible, and please complain if you are still
confused or disagree.
Thanks,
Like Xu
On 11/11/20 03:42, Xu, Like wrote:
> Hi Peter,
>
> On 2020/11/11 4:52, Stephane Eranian wrote:
>> On Tue, Nov 10, 2020 at 7:37 AM Peter Zijlstra<[email protected]>
>> wrote:
>>> On Tue, Nov 10, 2020 at 04:12:57PM +0100, Peter Zijlstra wrote:
>>>> On Mon, Nov 09, 2020 at 10:12:37AM +0800, Like Xu wrote:
>>>>> The Precise Event Based Sampling(PEBS) supported on Intel Ice Lake
>>>>> server
>>>>> platforms can provide an architectural state of the instruction
>>>>> executed
>>>>> after the instruction that caused the event. This patch set enables
>>>>> the
>>>>> the PEBS via DS feature for KVM (also non) Linux guest on the Ice
>>>>> Lake.
>>>>> The Linux guest can use PEBS feature like native:
>>>>>
>>>>> # perf record -e instructions:ppp ./br_instr a
>>>>> # perf record -c 100000 -e instructions:pp ./br_instr a
>>>>>
>>>>> If the counter_freezing is not enabled on the host, the guest PEBS
>>>>> will
>>>>> be disabled on purpose when host is using PEBS facility. By default,
>>>>> KVM disables the co-existence of guest PEBS and host PEBS.
> Thanks Stephane for clarifying the use cases for Freeze-on-[PMI|Overflow].
>
> Please let me express it more clearly.
>
> The goal of the whole patch set is to enable guest PEBS, regardless of
> whether the counter_freezing is frozen or not. By default, it will not
> support both the guest and the host to use PEBS at the same time.
>
> Please continue reviewing the patch set, especially for the slow path
> we proposed this time and related host perf changes:
>
> - add intel_pmu_handle_guest_pebs() to __intel_pmu_pebs_event();
> - add switch MSRs (PEBS_ENABLE, DS_AREA, DATA_CFG) to
> intel_guest_get_msrs();
> - the construction of incoming parameters for
> perf_event_create_kernel_counter();
>
> I believe if you understand the general idea, the comments will be very
> valuable.
What is the state of this work? I was expecting a new version that
doesn't use counter_freezing. However, I see that counter_freezing is
still in there, so this patch from Peter has never been applied.
Paolo
On Tue, Jan 26, 2021 at 10:51:39AM +0100, Paolo Bonzini wrote:
> What is the state of this work? I was expecting a new version that doesn't
> use counter_freezing. However, I see that counter_freezing is still in
> there, so this patch from Peter has never been applied.
Bah, I forgot about it. Lemme go find it.
On 2021/1/26 17:51, Paolo Bonzini wrote:
> On 11/11/20 03:42, Xu, Like wrote:
>> Hi Peter,
>>
>> On 2020/11/11 4:52, Stephane Eranian wrote:
>>> On Tue, Nov 10, 2020 at 7:37 AM Peter Zijlstra<[email protected]>
>>> wrote:
>>>> On Tue, Nov 10, 2020 at 04:12:57PM +0100, Peter Zijlstra wrote:
>>>>> On Mon, Nov 09, 2020 at 10:12:37AM +0800, Like Xu wrote:
>>>>>> The Precise Event Based Sampling(PEBS) supported on Intel Ice Lake
>>>>>> server
>>>>>> platforms can provide an architectural state of the instruction
>>>>>> executed
>>>>>> after the instruction that caused the event. This patch set enables the
>>>>>> the PEBS via DS feature for KVM (also non) Linux guest on the Ice Lake.
>>>>>> The Linux guest can use PEBS feature like native:
>>>>>>
>>>>>> # perf record -e instructions:ppp ./br_instr a
>>>>>> # perf record -c 100000 -e instructions:pp ./br_instr a
>>>>>>
>>>>>> If the counter_freezing is not enabled on the host, the guest PEBS will
>>>>>> be disabled on purpose when host is using PEBS facility. By default,
>>>>>> KVM disables the co-existence of guest PEBS and host PEBS.
>> Thanks Stephane for clarifying the use cases for Freeze-on-[PMI|Overflow].
>>
>> Please let me express it more clearly.
>>
>> The goal of the whole patch set is to enable guest PEBS, regardless of
>> whether the counter_freezing is frozen or not. By default, it will not
>> support both the guest and the host to use PEBS at the same time.
>>
>> Please continue reviewing the patch set, especially for the slow path
>> we proposed this time and related host perf changes:
>>
>> - add intel_pmu_handle_guest_pebs() to __intel_pmu_pebs_event();
>> - add switch MSRs (PEBS_ENABLE, DS_AREA, DATA_CFG) to
>> intel_guest_get_msrs();
>> - the construction of incoming parameters for
>> perf_event_create_kernel_counter();
>>
>> I believe if you understand the general idea, the comments will be very
>> valuable.
>
> What is the state of this work? I was expecting a new version that
> doesn't use counter_freezing. However, I see that counter_freezing is
> still in there, so this patch from Peter has never been applied.
>
> Paolo
Ah, now we have the v3 version on guest PEBS feature.
It does not rely on counter_freezing, but disables the co-existence of
guest PEBS and host PEBS.
I am not clear about your attitude towards this co-existence.
There are also more interesting topics for you to review and comment.
Please check
https://lore.kernel.org/kvm/[email protected]/
On 26/01/21 12:35, Xu, Like wrote:
>>
>
> Ah, now we have the v3 version on guest PEBS feature.
> It does not rely on counter_freezing, but disables the co-existence of
> guest PEBS and host PEBS.
> I am not clear about your attitude towards this co-existence.
>
> There are also more interesting topics for you to review and comment.
> Please check
> https://lore.kernel.org/kvm/[email protected]/
>
Thanks Like, I'll review that one.
Paolo