2021-01-04 13:26:41

by Like Xu

[permalink] [raw]
Subject: [PATCH v3 00/17] KVM: x86/pmu: Add support to enable Guest PEBS via DS

The Precise Event Based Sampling (PEBS) facility on Intel Ice Lake Server
platforms can provide an architectural state of the instruction executed
after the guest instruction that caused the event. This patch set enables
the PEBS via DS feature for KVM guests on the Ice Lake Server.

We can use PEBS feature on the linux guest like native:

# perf record -e instructions:ppp ./br_instr a
# perf record -c 100000 -e instructions:pp ./br_instr a

The guest PEBS will be disabled on purpose when host is using PEBS.
By default, KVM disables the co-existence of guest PEBS and host PEBS.

The whole patch set could be divided into three parts and the first two
parts enables the basic PEBS via DS feature which could be considered
to be merged and no regression about host perf is expected.

Compared to the first version, an important change here is the removal
of the forced 1-1 mapping of the virtual to physical PMC and we handle
the cross-mapping issue carefully in the part 3 which may address
artificial competition concern from PeterZ.

In general, there are 2 code paths to emulate guest PEBS facility.

1) Fast path (part 2, patch 0004-0012)

This is when the host assigned physical PMC has an identical index as
the virtual PMC (e.g. using physical PMC0 to emulate virtual PMC0).
It works as the 1-1 mapping that we did in the first version.

2) Slow path (part 3, patch 0012-0017)

This is when the host assigned physical PMC has a different index
from the virtual PMC (e.g. using physical PMC1 to emulate virtual PMC0)
In this case, KVM needs to rewrite the PEBS records to change the
applicable counter indexes to the virtual PMC indexes, which would
otherwise contain the physical counter index written by PEBS facility,
and switch the counter reset values to the offset corresponding to
the physical counter indexes in the DS data structure.

Large PEBS needs to be disabled by KVM rewriting the
pebs_interrupt_threshold filed in DS to only one record in
the slow path. This is because a guest may implicitly drain PEBS buffer,
e.g., context switch. KVM doesn't get a chance to update the PEBS buffer.
The physical PMC index will confuse the guest. The difficulty comes
when multiple events get rescheduled inside the guest. Hence disabling
large PEBS in this case might be an easy and safe way to keep it corrects
as an initial step here.

We don't expect this change would break any guest
code, which can generally tolerate
earlier PMIs. In the fast path with 1:1 mapping this is not needed.

The rewriting work is performed before delivering a vPMI to the guest to
notify the guest to read the record (before entering the guest, where
interrupt has been disabled so no counter reschedule would happen
at that point on the host).

For the DS area virtualization, the PEBS hardware is registered with the
guest virtual address (gva) of the guest DS memory.
In the past, the difficulty is that the host needs to pin the guest
DS memory, as the page fault caused by the PEBS hardware can't be fixed.
This isn't needed from ICX thanks to the hardware support.

KVM rewriting the guest DS area needs to walk the guest page tables to
translate gva to host virtual address (hva).

To reduce the translation overhead, we cache the translation on the first
time of DS memory rewriting. The cached translation is valid to use by
KVM until the guest disables PEBS (VMExits to KVM), which means the guest
may do re-allocation of the PEBS buffer next time and KVM needs
to re-walk the guest pages tables to update the cached translation.

In summary, this patch set enables the guest PEBS to retrieve the correct
information from its own PEBS records on the Ice Lake server platforms
when host is not using PEBS facility at the same time. And we expect it
should work when migrating to another Ice Lake.

Here are the results of pebs test from guest/host for same workload:

perf report on guest:
# Samples: 2K of event 'instructions:ppp', # Event count (approx.): 1473377250
# Overhead Command Shared Object Symbol
57.74% br_instr br_instr [.] lfsr_cond
41.40% br_instr br_instr [.] cmp_end
0.21% br_instr [kernel.kallsyms] [k] __lock_acquire

perf report on host:
# Samples: 2K of event 'instructions:ppp', # Event count (approx.): 1462721386
# Overhead Command Shared Object Symbol
57.90% br_instr br_instr [.] lfsr_cond
41.95% br_instr br_instr [.] cmp_end
0.05% br_instr [kernel.vmlinux] [k] lock_acquire

Conclusion: the profiling results on the guest are similar to that on the host.

Please check more details in each commit and feel free to comment.

v2->v3 Changelog:
- drop the counter_freezing check and disable guest PEBS when host uses PEBS;
- use kvm_read/write_guest_[offset]_cached() to reduce memory rewrite overhead;
- use GLOBAL_STATUS_BUFFER_OVF_BIT instead of 62;
- make intel_pmu_handle_event() static;
- rebased to kvm-queue d45f89f7437d;

Previous:
https://lore.kernel.org/kvm/[email protected]/

Like Xu (17):
KVM: x86/pmu: Set MSR_IA32_MISC_ENABLE_EMON bit when vPMU is enabled
KVM: vmx/pmu: Use IA32_PERF_CAPABILITIES to adjust features visibility
KVM: x86/pmu: Introduce the ctrl_mask value for fixed counter
perf: x86/ds: Handle guest PEBS overflow PMI and inject it to guest
KVM: x86/pmu: Reprogram guest PEBS event to emulate guest PEBS counter
KVM: x86/pmu: Add IA32_PEBS_ENABLE MSR emulation for extended PEBS
KVM: x86/pmu: Add IA32_DS_AREA MSR emulation to manage guest DS buffer
KVM: x86/pmu: Add PEBS_DATA_CFG MSR emulation to support adaptive PEBS
KVM: x86: Set PEBS_UNAVAIL in IA32_MISC_ENABLE when PEBS is enabled
KVM: x86/pmu: Expose CPUIDs feature bits PDCM, DS, DTES64
KVM: x86/pmu: Adjust precise_ip to emulate Ice Lake guest PDIR counter
KVM: x86/pmu: Disable guest PEBS when counters are cross-mapped
KVM: x86/pmu: Add hook to emulate pebs for cross-mapped counters
KVM: vmx/pmu: Limit pebs_interrupt_threshold in the guest DS area
KVM: vmx/pmu: Rewrite applicable_counters field in guest PEBS records
KVM: x86/pmu: Save guest pebs reset values when pebs is configured
KVM: x86/pmu: Adjust guest pebs reset values for crpss-mapped counters

arch/x86/events/intel/core.c | 45 +++++
arch/x86/events/intel/ds.c | 62 +++++++
arch/x86/include/asm/kvm_host.h | 18 ++
arch/x86/include/asm/msr-index.h | 6 +
arch/x86/kvm/pmu.c | 92 +++++++--
arch/x86/kvm/pmu.h | 20 ++
arch/x86/kvm/vmx/capabilities.h | 17 +-
arch/x86/kvm/vmx/pmu_intel.c | 310 ++++++++++++++++++++++++++++++-
arch/x86/kvm/vmx/vmx.c | 29 +++
arch/x86/kvm/x86.c | 12 +-
10 files changed, 592 insertions(+), 19 deletions(-)

--
2.29.2


2021-01-04 13:27:31

by Like Xu

[permalink] [raw]
Subject: [PATCH v3 05/17] KVM: x86/pmu: Reprogram guest PEBS event to emulate guest PEBS counter

When a guest counter is configured as a PEBS counter through
IA32_PEBS_ENABLE, a guest PEBS event will be reprogrammed by
configuring a non-zero precision level in the perf_event_attr.

The guest PEBS overflow PMI bit would be set in the guest
GLOBAL_STATUS MSR when PEBS facility generates a PEBS
overflow PMI based on guest IA32_DS_AREA MSR.

The attr.precise_ip would be adjusted to a special precision
level when the new PEBS-PDIR feature is supported later which
would affect the host counters scheduling.

The guest PEBS event would not be reused for non-PEBS
guest event even with the same guest counter index.

Originally-by: Andi Kleen <[email protected]>
Co-developed-by: Kan Liang <[email protected]>
Signed-off-by: Kan Liang <[email protected]>
Signed-off-by: Like Xu <[email protected]>
---
arch/x86/include/asm/kvm_host.h | 2 ++
arch/x86/kvm/pmu.c | 29 +++++++++++++++++++++++++++--
2 files changed, 29 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 94c8bfee4a82..09dacda33fb8 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -449,6 +449,8 @@ struct kvm_pmu {
DECLARE_BITMAP(all_valid_pmc_idx, X86_PMC_IDX_MAX);
DECLARE_BITMAP(pmc_in_use, X86_PMC_IDX_MAX);

+ u64 pebs_enable;
+
/*
* The gate to release perf_events not marked in
* pmc_in_use only once in a vcpu time slice.
diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c
index 67741d2a0308..2e81c50323e2 100644
--- a/arch/x86/kvm/pmu.c
+++ b/arch/x86/kvm/pmu.c
@@ -76,7 +76,12 @@ static void kvm_perf_overflow_intr(struct perf_event *perf_event,
struct kvm_pmu *pmu = pmc_to_pmu(pmc);

if (!test_and_set_bit(pmc->idx, pmu->reprogram_pmi)) {
- __set_bit(pmc->idx, (unsigned long *)&pmu->global_status);
+ if (perf_event->attr.precise_ip) {
+ /* Indicate PEBS overflow PMI to guest. */
+ __set_bit(GLOBAL_STATUS_BUFFER_OVF_BIT,
+ (unsigned long *)&pmu->global_status);
+ } else
+ __set_bit(pmc->idx, (unsigned long *)&pmu->global_status);
kvm_make_request(KVM_REQ_PMU, pmc->vcpu);

/*
@@ -99,6 +104,7 @@ static void pmc_reprogram_counter(struct kvm_pmc *pmc, u32 type,
bool exclude_kernel, bool intr,
bool in_tx, bool in_tx_cp)
{
+ struct kvm_pmu *pmu = vcpu_to_pmu(pmc->vcpu);
struct perf_event *event;
struct perf_event_attr attr = {
.type = type,
@@ -110,6 +116,7 @@ static void pmc_reprogram_counter(struct kvm_pmc *pmc, u32 type,
.exclude_kernel = exclude_kernel,
.config = config,
};
+ bool pebs = test_bit(pmc->idx, (unsigned long *)&pmu->pebs_enable);

attr.sample_period = get_sample_period(pmc, pmc->counter);

@@ -124,9 +131,23 @@ static void pmc_reprogram_counter(struct kvm_pmc *pmc, u32 type,
attr.sample_period = 0;
attr.config |= HSW_IN_TX_CHECKPOINTED;
}
+ if (pebs) {
+ /*
+ * The non-zero precision level of guest event makes the ordinary
+ * guest event becomes a guest PEBS event and triggers the host
+ * PEBS PMI handler to determine whether the PEBS overflow PMI
+ * comes from the host counters or the guest.
+ *
+ * For most PEBS hardware events, the difference in the software
+ * precision levels of guest and host PEBS events will not affect
+ * the accuracy of the PEBS profiling result, because the "event IP"
+ * in the PEBS record is calibrated on the guest side.
+ */
+ attr.precise_ip = 1;
+ }

event = perf_event_create_kernel_counter(&attr, -1, current,
- intr ? kvm_perf_overflow_intr :
+ (intr || pebs) ? kvm_perf_overflow_intr :
kvm_perf_overflow, pmc);
if (IS_ERR(event)) {
pr_debug_ratelimited("kvm_pmu: event creation failed %ld for pmc->idx = %d\n",
@@ -161,6 +182,10 @@ static bool pmc_resume_counter(struct kvm_pmc *pmc)
get_sample_period(pmc, pmc->counter)))
return false;

+ if (!test_bit(pmc->idx, (unsigned long *)&pmc_to_pmu(pmc)->pebs_enable) &&
+ pmc->perf_event->attr.precise_ip)
+ return false;
+
/* reuse perf_event to serve as pmc_reprogram_counter() does*/
perf_event_enable(pmc->perf_event);

--
2.29.2

2021-01-04 13:27:54

by Like Xu

[permalink] [raw]
Subject: [PATCH v3 04/17] perf: x86/ds: Handle guest PEBS overflow PMI and inject it to guest

With PEBS virtualization, the PEBS records get delivered to the guest,
and host still sees the PEBS overflow PMI from guest PEBS counters.
This would normally result in a spurious host PMI and we needs to inject
that PEBS overflow PMI into the guest, so that the guest PMI handler
can handle the PEBS records.

Check for this case in the host perf PEBS handler. If a PEBS overflow
PMI occurs and it's not generated from host side (via check host DS),
a fake event will be triggered. The fake event causes the KVM PMI callback
to be called, thereby injecting the PEBS overflow PMI into the guest.

No matter how many guest PEBS counters are overflowed, only triggering
one fake event is enough. The guest PEBS handler would retrieve the
correct information from its own PEBS records buffer.

A guest PEBS overflow PMI would be missed when a PEBS counter is enabled
on the host side and coincidentally a host PEBS overflow PMI based on
host DS_AREA is also triggered right after vm-exit due to the guest
PEBS overflow PMI based on guest DS_AREA. In that case, KVM will disable
guest PEBS before vm-entry once there's a host PEBS counter enabled
on the same CPU.

Originally-by: Andi Kleen <[email protected]>
Co-developed-by: Kan Liang <[email protected]>
Signed-off-by: Kan Liang <[email protected]>
Signed-off-by: Like Xu <[email protected]>
---
arch/x86/events/intel/ds.c | 62 ++++++++++++++++++++++++++++++++++++++
1 file changed, 62 insertions(+)

diff --git a/arch/x86/events/intel/ds.c b/arch/x86/events/intel/ds.c
index b47cc4226934..c499bdb58373 100644
--- a/arch/x86/events/intel/ds.c
+++ b/arch/x86/events/intel/ds.c
@@ -1721,6 +1721,65 @@ intel_pmu_save_and_restart_reload(struct perf_event *event, int count)
return 0;
}

+/*
+ * We may be running with guest PEBS events created by KVM, and the
+ * PEBS records are logged into the guest's DS and invisible to host.
+ *
+ * In the case of guest PEBS overflow, we only trigger a fake event
+ * to emulate the PEBS overflow PMI for guest PBES counters in KVM.
+ * The guest will then vm-entry and check the guest DS area to read
+ * the guest PEBS records.
+ *
+ * The guest PEBS overflow PMI may be dropped when both the guest and
+ * the host use PEBS. Therefore, KVM will not enable guest PEBS once
+ * the host PEBS is enabled since it may bring a confused unknown NMI.
+ *
+ * The contents and other behavior of the guest event do not matter.
+ */
+static int intel_pmu_handle_guest_pebs(struct cpu_hw_events *cpuc,
+ struct pt_regs *iregs,
+ struct debug_store *ds)
+{
+ struct perf_sample_data data;
+ struct perf_event *event = NULL;
+ u64 guest_pebs_idxs = cpuc->pebs_enabled & ~cpuc->intel_ctrl_host_mask;
+ int bit;
+
+ /*
+ * Ideally, we should check guest DS to understand if it's
+ * a guest PEBS overflow PMI from guest PEBS counters.
+ * However, it brings high overhead to retrieve guest DS in host.
+ * So we check host DS instead for performance.
+ *
+ * If PEBS interrupt threshold on host is not exceeded in a NMI, there
+ * must be a PEBS overflow PMI generated from the guest PEBS counters.
+ * There is no ambiguity since the reported event in the PMI is guest
+ * only. It gets handled correctly on a case by case base for each event.
+ *
+ * Note: KVM disables the co-existence of guest PEBS and host PEBS.
+ */
+ if (!guest_pebs_idxs || !in_nmi() ||
+ ds->pebs_index >= ds->pebs_interrupt_threshold)
+ return 0;
+
+ for_each_set_bit(bit, (unsigned long *)&guest_pebs_idxs,
+ INTEL_PMC_IDX_FIXED + x86_pmu.num_counters_fixed) {
+
+ event = cpuc->events[bit];
+ if (!event->attr.precise_ip)
+ continue;
+
+ perf_sample_data_init(&data, 0, event->hw.last_period);
+ if (perf_event_overflow(event, &data, iregs))
+ x86_pmu_stop(event, 0);
+
+ /* Inject one fake event is enough. */
+ return 1;
+ }
+
+ return 0;
+}
+
static __always_inline void
__intel_pmu_pebs_event(struct perf_event *event,
struct pt_regs *iregs,
@@ -1965,6 +2024,9 @@ static void intel_pmu_drain_pebs_icl(struct pt_regs *iregs, struct perf_sample_d
if (!x86_pmu.pebs_active)
return;

+ if (intel_pmu_handle_guest_pebs(cpuc, iregs, ds))
+ return;
+
base = (struct pebs_basic *)(unsigned long)ds->pebs_buffer_base;
top = (struct pebs_basic *)(unsigned long)ds->pebs_index;

--
2.29.2

2021-01-04 13:28:26

by Like Xu

[permalink] [raw]
Subject: [PATCH v3 06/17] KVM: x86/pmu: Add IA32_PEBS_ENABLE MSR emulation for extended PEBS

If IA32_PERF_CAPABILITIES.PEBS_BASELINE [bit 14] is set, the
IA32_PEBS_ENABLE MSR exists and all architecturally enumerated fixed
and general purpose counters have corresponding bits in IA32_PEBS_ENABLE
that enable generation of PEBS records. The general-purpose counter bits
start at bit IA32_PEBS_ENABLE[0], and the fixed counter bits start at
bit IA32_PEBS_ENABLE[32].

When guest PEBS is enabled, the IA32_PEBS_ENABLE MSR will be
added to the perf_guest_switch_msr() and switched during the
VMX transitions just like CORE_PERF_GLOBAL_CTRL MSR.

Originally-by: Andi Kleen <[email protected]>
Co-developed-by: Kan Liang <[email protected]>
Signed-off-by: Kan Liang <[email protected]>
Co-developed-by: Luwei Kang <[email protected]>
Signed-off-by: Luwei Kang <[email protected]>
Signed-off-by: Like Xu <[email protected]>
---
arch/x86/events/intel/core.c | 20 ++++++++++++++++++++
arch/x86/include/asm/kvm_host.h | 1 +
arch/x86/include/asm/msr-index.h | 6 ++++++
arch/x86/kvm/vmx/pmu_intel.c | 28 ++++++++++++++++++++++++++++
4 files changed, 55 insertions(+)

diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
index af457f8cb29d..6453b8a6834a 100644
--- a/arch/x86/events/intel/core.c
+++ b/arch/x86/events/intel/core.c
@@ -3715,6 +3715,26 @@ static struct perf_guest_switch_msr *intel_guest_get_msrs(int *nr)
*nr = 2;
}

+ if (cpuc->pebs_enabled & ~cpuc->intel_ctrl_host_mask) {
+ arr[1].msr = MSR_IA32_PEBS_ENABLE;
+ arr[1].host = cpuc->pebs_enabled & ~cpuc->intel_ctrl_guest_mask;
+ arr[1].guest = cpuc->pebs_enabled & ~cpuc->intel_ctrl_host_mask;
+ /*
+ * The guest PEBS will be disabled once the host PEBS is enabled
+ * since the both enabled case may brings a unknown PMI to
+ * confuse host and the guest PEBS overflow PMI would be missed.
+ */
+ if (arr[1].host)
+ arr[1].guest = 0;
+ arr[0].guest |= arr[1].guest;
+ *nr = 2;
+ } else if (*nr == 1) {
+ /* Remove MSR_IA32_PEBS_ENABLE from MSR switch list in KVM */
+ arr[1].msr = MSR_IA32_PEBS_ENABLE;
+ arr[1].host = arr[1].guest = 0;
+ *nr = 2;
+ }
+
return arr;
}

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 09dacda33fb8..88a403fa46d4 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -450,6 +450,7 @@ struct kvm_pmu {
DECLARE_BITMAP(pmc_in_use, X86_PMC_IDX_MAX);

u64 pebs_enable;
+ u64 pebs_enable_mask;

/*
* The gate to release perf_events not marked in
diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
index abfc9b0fbd8d..11cc0b80fe7a 100644
--- a/arch/x86/include/asm/msr-index.h
+++ b/arch/x86/include/asm/msr-index.h
@@ -184,6 +184,12 @@
#define MSR_PEBS_DATA_CFG 0x000003f2
#define MSR_IA32_DS_AREA 0x00000600
#define MSR_IA32_PERF_CAPABILITIES 0x00000345
+#define PERF_CAP_PEBS_TRAP BIT_ULL(6)
+#define PERF_CAP_ARCH_REG BIT_ULL(7)
+#define PERF_CAP_PEBS_FORMAT 0xf00
+#define PERF_CAP_PEBS_BASELINE BIT_ULL(14)
+#define PERF_CAP_PEBS_MASK (PERF_CAP_PEBS_TRAP | PERF_CAP_ARCH_REG | \
+ PERF_CAP_PEBS_FORMAT | PERF_CAP_PEBS_BASELINE)
#define MSR_PEBS_LD_LAT_THRESHOLD 0x000003f6

#define MSR_IA32_RTIT_CTL 0x00000570
diff --git a/arch/x86/kvm/vmx/pmu_intel.c b/arch/x86/kvm/vmx/pmu_intel.c
index 50047114c298..2f10587bda19 100644
--- a/arch/x86/kvm/vmx/pmu_intel.c
+++ b/arch/x86/kvm/vmx/pmu_intel.c
@@ -180,6 +180,9 @@ static bool intel_is_valid_msr(struct kvm_vcpu *vcpu, u32 msr)
case MSR_CORE_PERF_GLOBAL_OVF_CTRL:
ret = pmu->version > 1;
break;
+ case MSR_IA32_PEBS_ENABLE:
+ ret = vcpu->arch.perf_capabilities & PERF_CAP_PEBS_FORMAT;
+ break;
default:
ret = get_gp_pmc(pmu, msr, MSR_IA32_PERFCTR0) ||
get_gp_pmc(pmu, msr, MSR_P6_EVNTSEL0) ||
@@ -221,6 +224,9 @@ static int intel_pmu_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
case MSR_CORE_PERF_GLOBAL_OVF_CTRL:
msr_info->data = pmu->global_ovf_ctrl;
return 0;
+ case MSR_IA32_PEBS_ENABLE:
+ msr_info->data = pmu->pebs_enable;
+ return 0;
default:
if ((pmc = get_gp_pmc(pmu, msr, MSR_IA32_PERFCTR0)) ||
(pmc = get_gp_pmc(pmu, msr, MSR_IA32_PMC0))) {
@@ -280,6 +286,14 @@ static int intel_pmu_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
return 0;
}
break;
+ case MSR_IA32_PEBS_ENABLE:
+ if (pmu->pebs_enable == data)
+ return 0;
+ if (!(data & pmu->pebs_enable_mask)) {
+ pmu->pebs_enable = data;
+ return 0;
+ }
+ break;
default:
if ((pmc = get_gp_pmc(pmu, msr, MSR_IA32_PERFCTR0)) ||
(pmc = get_gp_pmc(pmu, msr, MSR_IA32_PMC0))) {
@@ -329,6 +343,7 @@ static void intel_pmu_refresh(struct kvm_vcpu *vcpu)
pmu->version = 0;
pmu->reserved_bits = 0xffffffff00200000ull;
pmu->fixed_ctr_ctrl_mask = ~0ull;
+ pmu->pebs_enable_mask = ~0ull;

entry = kvm_find_cpuid_entry(vcpu, 0xa, 0);
if (!entry)
@@ -384,6 +399,19 @@ static void intel_pmu_refresh(struct kvm_vcpu *vcpu)
bitmap_set(pmu->all_valid_pmc_idx,
INTEL_PMC_MAX_GENERIC, pmu->nr_arch_fixed_counters);

+ if (vcpu->arch.perf_capabilities & PERF_CAP_PEBS_FORMAT) {
+ if (vcpu->arch.perf_capabilities & PERF_CAP_PEBS_BASELINE) {
+ pmu->pebs_enable_mask = ~pmu->global_ctrl;
+ pmu->reserved_bits &= ~ICL_EVENTSEL_ADAPTIVE;
+ for (i = 0; i < pmu->nr_arch_fixed_counters; i++)
+ pmu->fixed_ctr_ctrl_mask &=
+ ~(1ULL << (INTEL_PMC_IDX_FIXED + i * 4));
+ } else
+ pmu->pebs_enable_mask = ~((1ull << pmu->nr_arch_gp_counters) - 1);
+ } else {
+ vcpu->arch.perf_capabilities &= ~PERF_CAP_PEBS_MASK;
+ }
+
nested_vmx_pmu_entry_exit_ctls_update(vcpu);
}

--
2.29.2

2021-01-04 13:28:28

by Like Xu

[permalink] [raw]
Subject: [PATCH v3 08/17] KVM: x86/pmu: Add PEBS_DATA_CFG MSR emulation to support adaptive PEBS

If IA32_PERF_CAPABILITIES.PEBS_BASELINE [bit 14] is set, the adaptive
PEBS is supported. The PEBS_DATA_CFG MSR and adaptive record enable
bits (IA32_PERFEVTSELx.Adaptive_Record and IA32_FIXED_CTR_CTRL.
FCx_Adaptive_Record) are also supported.

Adaptive PEBS provides software the capability to configure the PEBS
records to capture only the data of interest, keeping the record size
compact. An overflow of PMCx results in generation of an adaptive PEBS
record with state information based on the selections specified in
MSR_PEBS_DATA_CFG (Memory Info [bit 0], GPRs [bit 1], XMMs [bit 2],
and LBRs [bit 3], LBR Entries [bit 31:24]). By default, the PEBS record
will only contain the Basic group.

When guest adaptive PEBS is enabled, the IA32_PEBS_ENABLE MSR will
be added to the perf_guest_switch_msr() and switched during the VMX
transitions just like CORE_PERF_GLOBAL_CTRL MSR.

Co-developed-by: Luwei Kang <[email protected]>
Signed-off-by: Luwei Kang <[email protected]>
Signed-off-by: Like Xu <[email protected]>
---
arch/x86/events/intel/core.c | 12 ++++++++++++
arch/x86/include/asm/kvm_host.h | 2 ++
arch/x86/kvm/vmx/pmu_intel.c | 16 ++++++++++++++++
arch/x86/kvm/vmx/vmx.c | 5 ++++-
4 files changed, 34 insertions(+), 1 deletion(-)

diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
index ccddda455bec..736487e6c5e3 100644
--- a/arch/x86/events/intel/core.c
+++ b/arch/x86/events/intel/core.c
@@ -3748,6 +3748,18 @@ static struct perf_guest_switch_msr *intel_guest_get_msrs(int *nr)
*nr = 3;
}

+ if (arr[1].guest && x86_pmu.intel_cap.pebs_baseline) {
+ arr[3].msr = MSR_PEBS_DATA_CFG;
+ arr[3].host = cpuc->pebs_data_cfg;
+ /* KVM will update MSR_PEBS_DATA_CFG with the trapped guest value. */
+ arr[3].guest = 0ull;
+ *nr = 4;
+ } else if (*nr == 3) {
+ arr[3].msr = MSR_PEBS_DATA_CFG;
+ arr[3].host = arr[3].guest = 0;
+ *nr = 4;
+ }
+
return arr;
}

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 520a21af711b..4ff6aa00a325 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -452,6 +452,8 @@ struct kvm_pmu {
u64 ds_area;
u64 pebs_enable;
u64 pebs_enable_mask;
+ u64 pebs_data_cfg;
+ u64 pebs_data_cfg_mask;

/*
* The gate to release perf_events not marked in
diff --git a/arch/x86/kvm/vmx/pmu_intel.c b/arch/x86/kvm/vmx/pmu_intel.c
index ff5fc405703f..c04e12812797 100644
--- a/arch/x86/kvm/vmx/pmu_intel.c
+++ b/arch/x86/kvm/vmx/pmu_intel.c
@@ -186,6 +186,9 @@ static bool intel_is_valid_msr(struct kvm_vcpu *vcpu, u32 msr)
case MSR_IA32_DS_AREA:
ret = guest_cpuid_has(vcpu, X86_FEATURE_DS);
break;
+ case MSR_PEBS_DATA_CFG:
+ ret = vcpu->arch.perf_capabilities & PERF_CAP_PEBS_BASELINE;
+ break;
default:
ret = get_gp_pmc(pmu, msr, MSR_IA32_PERFCTR0) ||
get_gp_pmc(pmu, msr, MSR_P6_EVNTSEL0) ||
@@ -233,6 +236,9 @@ static int intel_pmu_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
case MSR_IA32_DS_AREA:
msr_info->data = pmu->ds_area;
return 0;
+ case MSR_PEBS_DATA_CFG:
+ msr_info->data = pmu->pebs_data_cfg;
+ return 0;
default:
if ((pmc = get_gp_pmc(pmu, msr, MSR_IA32_PERFCTR0)) ||
(pmc = get_gp_pmc(pmu, msr, MSR_IA32_PMC0))) {
@@ -305,6 +311,14 @@ static int intel_pmu_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
return 1;
pmu->ds_area = data;
return 0;
+ case MSR_PEBS_DATA_CFG:
+ if (pmu->pebs_data_cfg == data)
+ return 0;
+ if (!(data & pmu->pebs_data_cfg_mask)) {
+ pmu->pebs_data_cfg = data;
+ return 0;
+ }
+ break;
default:
if ((pmc = get_gp_pmc(pmu, msr, MSR_IA32_PERFCTR0)) ||
(pmc = get_gp_pmc(pmu, msr, MSR_IA32_PMC0))) {
@@ -355,6 +369,7 @@ static void intel_pmu_refresh(struct kvm_vcpu *vcpu)
pmu->reserved_bits = 0xffffffff00200000ull;
pmu->fixed_ctr_ctrl_mask = ~0ull;
pmu->pebs_enable_mask = ~0ull;
+ pmu->pebs_data_cfg_mask = ~0ull;

entry = kvm_find_cpuid_entry(vcpu, 0xa, 0);
if (!entry)
@@ -417,6 +432,7 @@ static void intel_pmu_refresh(struct kvm_vcpu *vcpu)
for (i = 0; i < pmu->nr_arch_fixed_counters; i++)
pmu->fixed_ctr_ctrl_mask &=
~(1ULL << (INTEL_PMC_IDX_FIXED + i * 4));
+ pmu->pebs_data_cfg_mask = ~0xff00000full;
} else
pmu->pebs_enable_mask = ~((1ull << pmu->nr_arch_gp_counters) - 1);
} else {
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 42c65acc6c01..dbb0e49aae64 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -6531,8 +6531,11 @@ static void atomic_switch_perf_msrs(struct vcpu_vmx *vmx)
if (!msrs)
return;

- if (nr_msrs > 2 && msrs[1].guest)
+ if (nr_msrs > 2 && msrs[1].guest) {
msrs[2].guest = pmu->ds_area;
+ if (nr_msrs > 3)
+ msrs[3].guest = pmu->pebs_data_cfg;
+ }

for (i = 0; i < nr_msrs; i++)
if (msrs[i].host == msrs[i].guest)
--
2.29.2

2021-01-04 13:28:39

by Like Xu

[permalink] [raw]
Subject: [PATCH v3 03/17] KVM: x86/pmu: Introduce the ctrl_mask value for fixed counter

The mask value of fixed counter control register should be dynamic
adjusted with the number of fixed counters. This patch introduces a
variable that includes the reserved bits of fixed counter control
registers. This is needed for later Ice Lake fixed counter changes.

Co-developed-by: Luwei Kang <[email protected]>
Signed-off-by: Luwei Kang <[email protected]>
Signed-off-by: Like Xu <[email protected]>
---
arch/x86/include/asm/kvm_host.h | 1 +
arch/x86/kvm/vmx/pmu_intel.c | 7 ++++++-
2 files changed, 7 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 39707e72b062..94c8bfee4a82 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -433,6 +433,7 @@ struct kvm_pmu {
unsigned nr_arch_fixed_counters;
unsigned available_event_types;
u64 fixed_ctr_ctrl;
+ u64 fixed_ctr_ctrl_mask;
u64 global_ctrl;
u64 global_status;
u64 global_ovf_ctrl;
diff --git a/arch/x86/kvm/vmx/pmu_intel.c b/arch/x86/kvm/vmx/pmu_intel.c
index 7c18c85328da..50047114c298 100644
--- a/arch/x86/kvm/vmx/pmu_intel.c
+++ b/arch/x86/kvm/vmx/pmu_intel.c
@@ -253,7 +253,7 @@ static int intel_pmu_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
case MSR_CORE_PERF_FIXED_CTR_CTRL:
if (pmu->fixed_ctr_ctrl == data)
return 0;
- if (!(data & 0xfffffffffffff444ull)) {
+ if (!(data & pmu->fixed_ctr_ctrl_mask)) {
reprogram_fixed_counters(pmu, data);
return 0;
}
@@ -320,6 +320,7 @@ static void intel_pmu_refresh(struct kvm_vcpu *vcpu)
struct kvm_cpuid_entry2 *entry;
union cpuid10_eax eax;
union cpuid10_edx edx;
+ int i;

pmu->nr_arch_gp_counters = 0;
pmu->nr_arch_fixed_counters = 0;
@@ -327,6 +328,7 @@ static void intel_pmu_refresh(struct kvm_vcpu *vcpu)
pmu->counter_bitmask[KVM_PMC_FIXED] = 0;
pmu->version = 0;
pmu->reserved_bits = 0xffffffff00200000ull;
+ pmu->fixed_ctr_ctrl_mask = ~0ull;

entry = kvm_find_cpuid_entry(vcpu, 0xa, 0);
if (!entry)
@@ -358,6 +360,9 @@ static void intel_pmu_refresh(struct kvm_vcpu *vcpu)
((u64)1 << edx.split.bit_width_fixed) - 1;
}

+ for (i = 0; i < pmu->nr_arch_fixed_counters; i++)
+ pmu->fixed_ctr_ctrl_mask |= (0xbull << (i * 4));
+ pmu->fixed_ctr_ctrl_mask = ~pmu->fixed_ctr_ctrl_mask;
pmu->global_ctrl = ((1ull << pmu->nr_arch_gp_counters) - 1) |
(((1ull << pmu->nr_arch_fixed_counters) - 1) << INTEL_PMC_IDX_FIXED);
pmu->global_ctrl_mask = ~pmu->global_ctrl;
--
2.29.2

2021-01-04 13:28:40

by Like Xu

[permalink] [raw]
Subject: [PATCH v3 07/17] KVM: x86/pmu: Add IA32_DS_AREA MSR emulation to manage guest DS buffer

When CPUID.01H:EDX.DS[21] is set, the IA32_DS_AREA MSR exists and
points to the linear address of the first byte of the DS buffer
management area, which is used to manage the PEBS records.

When guest PEBS is enabled and the value is different from the
host, KVM will add the IA32_DS_AREA MSR to the msr-switch list.
The guest's DS value can be loaded to the real HW before VM-entry,
and will be removed when guest PEBS is disabled.

The WRMSR to IA32_DS_AREA MSR brings a #GP(0) if the source register
contains a non-canonical address. The switch of IA32_DS_AREA MSR would
also, setup a quiescent period to write the host PEBS records (if any)
to host DS area rather than guest DS area.

When guest PEBS is enabled, the MSR_IA32_DS_AREA MSR will be
added to the perf_guest_switch_msr() and switched during the
VMX transitions just like CORE_PERF_GLOBAL_CTRL MSR.

Originally-by: Andi Kleen <[email protected]>
Co-developed-by: Kan Liang <[email protected]>
Signed-off-by: Kan Liang <[email protected]>
Signed-off-by: Like Xu <[email protected]>
---
arch/x86/events/intel/core.c | 13 +++++++++++++
arch/x86/include/asm/kvm_host.h | 1 +
arch/x86/kvm/vmx/pmu_intel.c | 11 +++++++++++
arch/x86/kvm/vmx/vmx.c | 6 ++++++
4 files changed, 31 insertions(+)

diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
index 6453b8a6834a..ccddda455bec 100644
--- a/arch/x86/events/intel/core.c
+++ b/arch/x86/events/intel/core.c
@@ -3690,6 +3690,7 @@ static struct perf_guest_switch_msr *intel_guest_get_msrs(int *nr)
{
struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
struct perf_guest_switch_msr *arr = cpuc->guest_switch_msrs;
+ struct debug_store *ds = __this_cpu_read(cpu_hw_events.ds);

arr[0].msr = MSR_CORE_PERF_GLOBAL_CTRL;
arr[0].host = x86_pmu.intel_ctrl & ~cpuc->intel_ctrl_guest_mask;
@@ -3735,6 +3736,18 @@ static struct perf_guest_switch_msr *intel_guest_get_msrs(int *nr)
*nr = 2;
}

+ if (arr[1].guest) {
+ arr[2].msr = MSR_IA32_DS_AREA;
+ arr[2].host = (unsigned long)ds;
+ /* KVM will update MSR_IA32_DS_AREA with the trapped guest value. */
+ arr[2].guest = 0ull;
+ *nr = 3;
+ } else if (*nr == 2) {
+ arr[2].msr = MSR_IA32_DS_AREA;
+ arr[2].host = arr[2].guest = 0;
+ *nr = 3;
+ }
+
return arr;
}

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 88a403fa46d4..520a21af711b 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -449,6 +449,7 @@ struct kvm_pmu {
DECLARE_BITMAP(all_valid_pmc_idx, X86_PMC_IDX_MAX);
DECLARE_BITMAP(pmc_in_use, X86_PMC_IDX_MAX);

+ u64 ds_area;
u64 pebs_enable;
u64 pebs_enable_mask;

diff --git a/arch/x86/kvm/vmx/pmu_intel.c b/arch/x86/kvm/vmx/pmu_intel.c
index 2f10587bda19..ff5fc405703f 100644
--- a/arch/x86/kvm/vmx/pmu_intel.c
+++ b/arch/x86/kvm/vmx/pmu_intel.c
@@ -183,6 +183,9 @@ static bool intel_is_valid_msr(struct kvm_vcpu *vcpu, u32 msr)
case MSR_IA32_PEBS_ENABLE:
ret = vcpu->arch.perf_capabilities & PERF_CAP_PEBS_FORMAT;
break;
+ case MSR_IA32_DS_AREA:
+ ret = guest_cpuid_has(vcpu, X86_FEATURE_DS);
+ break;
default:
ret = get_gp_pmc(pmu, msr, MSR_IA32_PERFCTR0) ||
get_gp_pmc(pmu, msr, MSR_P6_EVNTSEL0) ||
@@ -227,6 +230,9 @@ static int intel_pmu_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
case MSR_IA32_PEBS_ENABLE:
msr_info->data = pmu->pebs_enable;
return 0;
+ case MSR_IA32_DS_AREA:
+ msr_info->data = pmu->ds_area;
+ return 0;
default:
if ((pmc = get_gp_pmc(pmu, msr, MSR_IA32_PERFCTR0)) ||
(pmc = get_gp_pmc(pmu, msr, MSR_IA32_PMC0))) {
@@ -294,6 +300,11 @@ static int intel_pmu_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
return 0;
}
break;
+ case MSR_IA32_DS_AREA:
+ if (is_noncanonical_address(data, vcpu))
+ return 1;
+ pmu->ds_area = data;
+ return 0;
default:
if ((pmc = get_gp_pmc(pmu, msr, MSR_IA32_PERFCTR0)) ||
(pmc = get_gp_pmc(pmu, msr, MSR_IA32_PMC0))) {
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 09bc41c53cd8..42c65acc6c01 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -974,6 +974,7 @@ static void add_atomic_switch_msr(struct vcpu_vmx *vmx, unsigned msr,
return;
}
break;
+ case MSR_IA32_DS_AREA:
case MSR_IA32_PEBS_ENABLE:
/* PEBS needs a quiescent period after being disabled (to write
* a record). Disabling PEBS through VMX MSR swapping doesn't
@@ -6522,12 +6523,17 @@ static void atomic_switch_perf_msrs(struct vcpu_vmx *vmx)
{
int i, nr_msrs;
struct perf_guest_switch_msr *msrs;
+ struct kvm_vcpu *vcpu = &vmx->vcpu;
+ struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);

msrs = perf_guest_get_msrs(&nr_msrs);

if (!msrs)
return;

+ if (nr_msrs > 2 && msrs[1].guest)
+ msrs[2].guest = pmu->ds_area;
+
for (i = 0; i < nr_msrs; i++)
if (msrs[i].host == msrs[i].guest)
clear_atomic_switch_msr(vmx, msrs[i].msr);
--
2.29.2

2021-01-04 13:28:55

by Like Xu

[permalink] [raw]
Subject: [PATCH v3 09/17] KVM: x86: Set PEBS_UNAVAIL in IA32_MISC_ENABLE when PEBS is enabled

The bit 12 represents "Processor Event Based Sampling Unavailable (RO)" :
1 = PEBS is not supported.
0 = PEBS is supported.

A write to this PEBS_UNAVL available bit will bring #GP(0) when guest PEBS
is enabled. Some PEBS drivers in guest may care about this bit.

Signed-off-by: Like Xu <[email protected]>
---
arch/x86/kvm/vmx/pmu_intel.c | 2 ++
arch/x86/kvm/x86.c | 4 ++++
2 files changed, 6 insertions(+)

diff --git a/arch/x86/kvm/vmx/pmu_intel.c b/arch/x86/kvm/vmx/pmu_intel.c
index c04e12812797..99d9453e0176 100644
--- a/arch/x86/kvm/vmx/pmu_intel.c
+++ b/arch/x86/kvm/vmx/pmu_intel.c
@@ -426,6 +426,7 @@ static void intel_pmu_refresh(struct kvm_vcpu *vcpu)
INTEL_PMC_MAX_GENERIC, pmu->nr_arch_fixed_counters);

if (vcpu->arch.perf_capabilities & PERF_CAP_PEBS_FORMAT) {
+ vcpu->arch.ia32_misc_enable_msr &= ~MSR_IA32_MISC_ENABLE_PEBS_UNAVAIL;
if (vcpu->arch.perf_capabilities & PERF_CAP_PEBS_BASELINE) {
pmu->pebs_enable_mask = ~pmu->global_ctrl;
pmu->reserved_bits &= ~ICL_EVENTSEL_ADAPTIVE;
@@ -436,6 +437,7 @@ static void intel_pmu_refresh(struct kvm_vcpu *vcpu)
} else
pmu->pebs_enable_mask = ~((1ull << pmu->nr_arch_gp_counters) - 1);
} else {
+ vcpu->arch.ia32_misc_enable_msr |= MSR_IA32_MISC_ENABLE_PEBS_UNAVAIL;
vcpu->arch.perf_capabilities &= ~PERF_CAP_PEBS_MASK;
}

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index a38ca932eec5..213368e47500 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -3095,6 +3095,10 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
break;
case MSR_IA32_MISC_ENABLE:
data &= ~MSR_IA32_MISC_ENABLE_EMON;
+ if (!msr_info->host_initiated &&
+ (vcpu->arch.perf_capabilities & PERF_CAP_PEBS_FORMAT) &&
+ (data & MSR_IA32_MISC_ENABLE_PEBS_UNAVAIL))
+ return 1;
if (!kvm_check_has_quirk(vcpu->kvm, KVM_X86_QUIRK_MISC_ENABLE_NO_MWAIT) &&
((vcpu->arch.ia32_misc_enable_msr ^ data) & MSR_IA32_MISC_ENABLE_MWAIT)) {
if (!guest_cpuid_has(vcpu, X86_FEATURE_XMM3))
--
2.29.2

2021-01-04 13:29:21

by Like Xu

[permalink] [raw]
Subject: [PATCH v3 10/17] KVM: x86/pmu: Expose CPUIDs feature bits PDCM, DS, DTES64

The CPUID features PDCM, DS and DTES64 are required for PEBS feature.
KVM would expose CPUID feature PDCM, DS and DTES64 to guest when PEBS
is supported in the KVM on the Ice Lake server platforms.

Originally-by: Andi Kleen <[email protected]>
Co-developed-by: Kan Liang <[email protected]>
Signed-off-by: Kan Liang <[email protected]>
Co-developed-by: Luwei Kang <[email protected]>
Signed-off-by: Luwei Kang <[email protected]>
Signed-off-by: Like Xu <[email protected]>
---
arch/x86/kvm/pmu.h | 6 ++++++
arch/x86/kvm/vmx/capabilities.h | 17 ++++++++++++++++-
arch/x86/kvm/vmx/vmx.c | 15 +++++++++++++++
3 files changed, 37 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/pmu.h b/arch/x86/kvm/pmu.h
index 067fef51760c..ee8f15cc4b5e 100644
--- a/arch/x86/kvm/pmu.h
+++ b/arch/x86/kvm/pmu.h
@@ -3,6 +3,7 @@
#define __KVM_X86_PMU_H

#include <linux/nospec.h>
+#include <asm/cpu_device_id.h>

#define vcpu_to_pmu(vcpu) (&(vcpu)->arch.pmu)
#define pmu_to_vcpu(pmu) (container_of((pmu), struct kvm_vcpu, arch.pmu))
@@ -16,6 +17,11 @@
#define VMWARE_BACKDOOR_PMC_APPARENT_TIME 0x10002

#define MAX_FIXED_COUNTERS 3
+static const struct x86_cpu_id vmx_icl_pebs_cpu[] = {
+ X86_MATCH_INTEL_FAM6_MODEL(ICELAKE_D, NULL),
+ X86_MATCH_INTEL_FAM6_MODEL(ICELAKE_X, NULL),
+ {}
+};

struct kvm_event_hw_type_mapping {
u8 eventsel;
diff --git a/arch/x86/kvm/vmx/capabilities.h b/arch/x86/kvm/vmx/capabilities.h
index 3a1861403d73..2f22ce34b165 100644
--- a/arch/x86/kvm/vmx/capabilities.h
+++ b/arch/x86/kvm/vmx/capabilities.h
@@ -5,6 +5,7 @@
#include <asm/vmx.h>

#include "lapic.h"
+#include "pmu.h"

extern bool __read_mostly enable_vpid;
extern bool __read_mostly flexpriority_enabled;
@@ -369,13 +370,27 @@ static inline bool vmx_pt_mode_is_host_guest(void)
return pt_mode == PT_MODE_HOST_GUEST;
}

+static inline bool vmx_pebs_supported(void)
+{
+ return boot_cpu_has(X86_FEATURE_PEBS) && x86_match_cpu(vmx_icl_pebs_cpu);
+}
+
static inline u64 vmx_get_perf_capabilities(void)
{
/*
* Since counters are virtualized, KVM would support full
* width counting unconditionally, even if the host lacks it.
*/
- return PMU_CAP_FW_WRITES;
+ u64 value = PMU_CAP_FW_WRITES;
+ u64 perf_cap = 0;
+
+ if (boot_cpu_has(X86_FEATURE_PDCM))
+ rdmsrl(MSR_IA32_PERF_CAPABILITIES, perf_cap);
+
+ if (vmx_pebs_supported())
+ value |= perf_cap & PERF_CAP_PEBS_MASK;
+
+ return value;
}

#endif /* __KVM_X86_VMX_CAPS_H */
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index dbb0e49aae64..341794b67f9a 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -2200,6 +2200,17 @@ static int vmx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
case MSR_IA32_PERF_CAPABILITIES:
if (data && !vcpu_to_pmu(vcpu)->version)
return 1;
+ if (data & PERF_CAP_PEBS_FORMAT) {
+ if ((data & PERF_CAP_PEBS_MASK) !=
+ (vmx_get_perf_capabilities() & PERF_CAP_PEBS_MASK))
+ return 1;
+ if (!guest_cpuid_has(vcpu, X86_FEATURE_DS))
+ return 1;
+ if (!guest_cpuid_has(vcpu, X86_FEATURE_DTES64))
+ return 1;
+ if (boot_cpu_data.x86_model != guest_cpuid_model(vcpu))
+ return 1;
+ }
ret = kvm_set_msr_common(vcpu, msr_info);
break;

@@ -7277,6 +7288,10 @@ static __init void vmx_set_cpu_caps(void)
kvm_cpu_cap_check_and_set(X86_FEATURE_INVPCID);
if (vmx_pt_mode_is_host_guest())
kvm_cpu_cap_check_and_set(X86_FEATURE_INTEL_PT);
+ if (vmx_pebs_supported()) {
+ kvm_cpu_cap_check_and_set(X86_FEATURE_DS);
+ kvm_cpu_cap_check_and_set(X86_FEATURE_DTES64);
+ }

if (vmx_umip_emulated())
kvm_cpu_cap_set(X86_FEATURE_UMIP);
--
2.29.2

2021-01-04 13:29:41

by Like Xu

[permalink] [raw]
Subject: [PATCH v3 11/17] KVM: x86/pmu: Adjust precise_ip to emulate Ice Lake guest PDIR counter

The PEBS-PDIR facility on Ice Lake server is supported on IA31_FIXED0 only.
If the guest configures counter 32 and PEBS is enabled, the PEBS-PDIR
facility is supposed to be used, in which case KVM adjusts attr.precise_ip
to 3 and request host perf to assign the exactly requested counter or fail.

The cpu model check is also required since some platforms may place the
PEBS-PDIR facility in another counter index.

Signed-off-by: Like Xu <[email protected]>
---
arch/x86/kvm/pmu.c | 2 ++
1 file changed, 2 insertions(+)

diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c
index 2e81c50323e2..67c20ab81991 100644
--- a/arch/x86/kvm/pmu.c
+++ b/arch/x86/kvm/pmu.c
@@ -144,6 +144,8 @@ static void pmc_reprogram_counter(struct kvm_pmc *pmc, u32 type,
* in the PEBS record is calibrated on the guest side.
*/
attr.precise_ip = 1;
+ if (x86_match_cpu(vmx_icl_pebs_cpu) && pmc->idx == 32)
+ attr.precise_ip = 3;
}

event = perf_event_create_kernel_counter(&attr, -1, current,
--
2.29.2

2021-01-04 13:31:11

by Like Xu

[permalink] [raw]
Subject: [PATCH v3 14/17] KVM: vmx/pmu: Limit pebs_interrupt_threshold in the guest DS area

If the host counter X is scheduled to the guest PEBS counter Y,
the guest ds pebs_interrupt_threshold field in guest DS area would
be changed to only ONE record before vm-entry which helps KVM
more easily and accurately handle the cross-mapping emulation
when the PEBS overflow PMI is generated.

In most cases, the guest counters would not be scheduled in a cross-mapped
way which means there is no need to change guest DS
pebs_interrupt_threshold and the applicable_counters fields in the guest
PEBS records are naturally correct. PEBS facility writes multiple PEBS
records into guest DS w/o interception and the performance is good.

AFAIK, we don't expect that changing the pebs_interrupt_threshold value
from the KVM side will break any guest PEBS drivers.

Signed-off-by: Like Xu <[email protected]>
---
arch/x86/include/asm/kvm_host.h | 3 ++
arch/x86/kvm/pmu.c | 17 +++-----
arch/x86/kvm/pmu.h | 11 +++++
arch/x86/kvm/vmx/pmu_intel.c | 77 +++++++++++++++++++++++++++++++++
arch/x86/kvm/x86.c | 1 +
5 files changed, 98 insertions(+), 11 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 5de4c14cf526..ea204c628f45 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -450,12 +450,15 @@ struct kvm_pmu {
DECLARE_BITMAP(pmc_in_use, X86_PMC_IDX_MAX);

u64 ds_area;
+ u64 cached_ds_area;
+ struct gfn_to_hva_cache ds_area_cache;
u64 pebs_enable;
u64 pebs_enable_mask;
u64 pebs_data_cfg;
u64 pebs_data_cfg_mask;

bool counter_cross_mapped;
+ bool need_rewrite_ds_pebs_interrupt_threshold;

/*
* The gate to release perf_events not marked in
diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c
index e898da4699c9..c0f18b304933 100644
--- a/arch/x86/kvm/pmu.c
+++ b/arch/x86/kvm/pmu.c
@@ -472,17 +472,6 @@ void kvm_pmu_init(struct kvm_vcpu *vcpu)
kvm_pmu_refresh(vcpu);
}

-static inline bool pmc_speculative_in_use(struct kvm_pmc *pmc)
-{
- struct kvm_pmu *pmu = pmc_to_pmu(pmc);
-
- if (pmc_is_fixed(pmc))
- return fixed_ctrl_field(pmu->fixed_ctr_ctrl,
- pmc->idx - INTEL_PMC_IDX_FIXED) & 0x3;
-
- return pmc->eventsel & ARCH_PERFMON_EVENTSEL_ENABLE;
-}
-
/* Release perf_events for vPMCs that have been unused for a full time slice. */
void kvm_pmu_cleanup(struct kvm_vcpu *vcpu)
{
@@ -577,4 +566,10 @@ void kvm_pmu_counter_cross_mapped_check(struct kvm_vcpu *vcpu)
break;
}
}
+
+ if (!pmu->counter_cross_mapped)
+ return;
+
+ if (pmu->need_rewrite_ds_pebs_interrupt_threshold)
+ kvm_make_request(KVM_REQ_PMU, pmc->vcpu);
}
diff --git a/arch/x86/kvm/pmu.h b/arch/x86/kvm/pmu.h
index b1e52e33f08c..6cdc9fd03195 100644
--- a/arch/x86/kvm/pmu.h
+++ b/arch/x86/kvm/pmu.h
@@ -147,6 +147,17 @@ static inline u64 get_sample_period(struct kvm_pmc *pmc, u64 counter_value)
return sample_period;
}

+static inline bool pmc_speculative_in_use(struct kvm_pmc *pmc)
+{
+ struct kvm_pmu *pmu = pmc_to_pmu(pmc);
+
+ if (pmc_is_fixed(pmc))
+ return fixed_ctrl_field(pmu->fixed_ctr_ctrl,
+ pmc->idx - INTEL_PMC_IDX_FIXED) & 0x3;
+
+ return pmc->eventsel & ARCH_PERFMON_EVENTSEL_ENABLE;
+}
+
void reprogram_gp_counter(struct kvm_pmc *pmc, u64 eventsel);
void reprogram_fixed_counter(struct kvm_pmc *pmc, u8 ctrl, int fixed_idx);
void reprogram_counter(struct kvm_pmu *pmu, int pmc_idx);
diff --git a/arch/x86/kvm/vmx/pmu_intel.c b/arch/x86/kvm/vmx/pmu_intel.c
index 2a06f923fbc7..b69e7c47fb05 100644
--- a/arch/x86/kvm/vmx/pmu_intel.c
+++ b/arch/x86/kvm/vmx/pmu_intel.c
@@ -211,6 +211,36 @@ static struct kvm_pmc *intel_msr_idx_to_pmc(struct kvm_vcpu *vcpu, u32 msr)
return pmc;
}

+static void intel_pmu_pebs_setup(struct kvm_pmu *pmu)
+{
+ struct kvm_vcpu *vcpu = pmu_to_vcpu(pmu);
+ struct kvm_pmc *pmc = NULL;
+ int bit, idx;
+ gpa_t gpa;
+
+ pmu->need_rewrite_ds_pebs_interrupt_threshold = false;
+
+ for_each_set_bit(bit, (unsigned long *)&pmu->pebs_enable, X86_PMC_IDX_MAX) {
+ pmc = kvm_x86_ops.pmu_ops->pmc_idx_to_pmc(pmu, bit);
+
+ if (pmc && pmc_speculative_in_use(pmc)) {
+ pmu->need_rewrite_ds_pebs_interrupt_threshold = true;
+ break;
+ }
+ }
+
+ if (pmu->pebs_enable && pmu->cached_ds_area != pmu->ds_area) {
+ idx = srcu_read_lock(&vcpu->kvm->srcu);
+ gpa = kvm_mmu_gva_to_gpa_system(vcpu, pmu->ds_area, NULL);
+ if (kvm_gfn_to_hva_cache_init(vcpu->kvm, &pmu->ds_area_cache,
+ gpa, sizeof(struct debug_store)))
+ goto out;
+ pmu->cached_ds_area = pmu->ds_area;
+out:
+ srcu_read_unlock(&vcpu->kvm->srcu, idx);
+ }
+}
+
static int intel_pmu_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
{
struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
@@ -287,6 +317,8 @@ static int intel_pmu_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
return 0;
if (kvm_valid_perf_global_ctrl(pmu, data)) {
global_ctrl_changed(pmu, data);
+ if (pmu->global_ctrl & pmu->pebs_enable)
+ intel_pmu_pebs_setup(pmu);
return 0;
}
break;
@@ -491,12 +523,57 @@ static void intel_pmu_reset(struct kvm_vcpu *vcpu)
pmu->global_ovf_ctrl = 0;
}

+static int rewrite_ds_pebs_interrupt_threshold(struct kvm_vcpu *vcpu)
+{
+ struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
+ struct debug_store *ds = NULL;
+ u64 new_threshold, offset;
+ int srcu_idx, ret = -ENOMEM;
+
+ ds = kmalloc(sizeof(struct debug_store), GFP_KERNEL);
+ if (!ds)
+ goto out;
+
+ ret = -EFAULT;
+ srcu_idx = srcu_read_lock(&vcpu->kvm->srcu);
+ if (kvm_read_guest_cached(vcpu->kvm, &pmu->ds_area_cache,
+ ds, sizeof(struct debug_store)))
+ goto unlock_out;
+
+ /* Adding sizeof(struct pebs_basic) offset is enough to generate PMI. */
+ new_threshold = ds->pebs_buffer_base + sizeof(struct pebs_basic);
+ offset = offsetof(struct debug_store, pebs_interrupt_threshold);
+ if (kvm_write_guest_offset_cached(vcpu->kvm, &pmu->ds_area_cache,
+ &new_threshold, offset, sizeof(u64)))
+ goto unlock_out;
+
+ ret = 0;
+
+unlock_out:
+ srcu_read_unlock(&vcpu->kvm->srcu, srcu_idx);
+
+out:
+ kfree(ds);
+ return ret;
+}
+
static void intel_pmu_handle_event(struct kvm_vcpu *vcpu)
{
struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
+ int ret;

if (!(pmu->global_ctrl & pmu->pebs_enable))
return;
+
+ if (pmu->counter_cross_mapped && pmu->need_rewrite_ds_pebs_interrupt_threshold) {
+ ret = rewrite_ds_pebs_interrupt_threshold(vcpu);
+ pmu->need_rewrite_ds_pebs_interrupt_threshold = false;
+ }
+
+ if (ret == -ENOMEM)
+ pr_debug_ratelimited("%s: Fail to emulate guest PEBS due to OOM.", __func__);
+ else if (ret == -EFAULT)
+ pr_debug_ratelimited("%s: Fail to emulate guest PEBS due to GPA fault.", __func__);
}

struct kvm_pmu_ops intel_pmu_ops = {
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 4ab1ce26244d..118e6752b563 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -5917,6 +5917,7 @@ gpa_t kvm_mmu_gva_to_gpa_system(struct kvm_vcpu *vcpu, gva_t gva,
{
return vcpu->arch.walk_mmu->gva_to_gpa(vcpu, gva, 0, exception);
}
+EXPORT_SYMBOL_GPL(kvm_mmu_gva_to_gpa_system);

static int kvm_read_guest_virt_helper(gva_t addr, void *val, unsigned int bytes,
struct kvm_vcpu *vcpu, u32 access,
--
2.29.2

2021-01-04 13:31:57

by Like Xu

[permalink] [raw]
Subject: [PATCH v3 13/17] KVM: x86/pmu: Add hook to emulate pebs for cross-mapped counters

To emulate PEBS facility, KVM may needs setup context such as
guest DS PEBS fields correctly before vm-entry and this part
will be implemented in the vmx handle_event() hook.

When the cross-map happens to any enabled PEBS counter, it will make
PMU request and exit to kvm_pmu_handle_event() for some rewrite stuff
and then back to cross-map check again and finally to vm-entry.

In this hook, KVM would rewrite the state for the guest and it won't
move events, hence races with the NMI PMI are not a problem.

Signed-off-by: Like Xu <[email protected]>
---
arch/x86/kvm/pmu.c | 3 +++
arch/x86/kvm/pmu.h | 1 +
arch/x86/kvm/vmx/pmu_intel.c | 9 +++++++++
arch/x86/kvm/vmx/vmx.c | 3 ---
4 files changed, 13 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c
index 3bfed803ed17..e898da4699c9 100644
--- a/arch/x86/kvm/pmu.c
+++ b/arch/x86/kvm/pmu.c
@@ -340,6 +340,9 @@ void kvm_pmu_handle_event(struct kvm_vcpu *vcpu)
*/
if (unlikely(pmu->need_cleanup))
kvm_pmu_cleanup(vcpu);
+
+ if (kvm_x86_ops.pmu_ops->handle_event)
+ kvm_x86_ops.pmu_ops->handle_event(vcpu);
}

/* check if idx is a valid index to access PMU */
diff --git a/arch/x86/kvm/pmu.h b/arch/x86/kvm/pmu.h
index f5ec94e9a1dc..b1e52e33f08c 100644
--- a/arch/x86/kvm/pmu.h
+++ b/arch/x86/kvm/pmu.h
@@ -45,6 +45,7 @@ struct kvm_pmu_ops {
void (*refresh)(struct kvm_vcpu *vcpu);
void (*init)(struct kvm_vcpu *vcpu);
void (*reset)(struct kvm_vcpu *vcpu);
+ void (*handle_event)(struct kvm_vcpu *vcpu);
};

static inline u64 pmc_bitmask(struct kvm_pmc *pmc)
diff --git a/arch/x86/kvm/vmx/pmu_intel.c b/arch/x86/kvm/vmx/pmu_intel.c
index 99d9453e0176..2a06f923fbc7 100644
--- a/arch/x86/kvm/vmx/pmu_intel.c
+++ b/arch/x86/kvm/vmx/pmu_intel.c
@@ -491,6 +491,14 @@ static void intel_pmu_reset(struct kvm_vcpu *vcpu)
pmu->global_ovf_ctrl = 0;
}

+static void intel_pmu_handle_event(struct kvm_vcpu *vcpu)
+{
+ struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
+
+ if (!(pmu->global_ctrl & pmu->pebs_enable))
+ return;
+}
+
struct kvm_pmu_ops intel_pmu_ops = {
.find_arch_event = intel_find_arch_event,
.find_fixed_event = intel_find_fixed_event,
@@ -505,4 +513,5 @@ struct kvm_pmu_ops intel_pmu_ops = {
.refresh = intel_pmu_refresh,
.init = intel_pmu_init,
.reset = intel_pmu_reset,
+ .handle_event = intel_pmu_handle_event,
};
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index bc30c83e0a62..341794b67f9a 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -6542,9 +6542,6 @@ static void atomic_switch_perf_msrs(struct vcpu_vmx *vmx)
if (!msrs)
return;

- if (pmu->counter_cross_mapped)
- msrs[1].guest = 0;
-
if (nr_msrs > 2 && msrs[1].guest) {
msrs[2].guest = pmu->ds_area;
if (nr_msrs > 3)
--
2.29.2

2021-01-04 13:32:01

by Like Xu

[permalink] [raw]
Subject: [PATCH v3 12/17] KVM: x86/pmu: Disable guest PEBS when counters are cross-mapped

KVM will check if a guest PEBS counter X is cross-mapped to the host
counter Y. In this case, the applicable_counter field in the guest PEBS
records is filled with the real host counter index(s) which is incorrect.

Currently, KVM will disable guest PEBS before vm-entry and the later
patches would do more emulations in the KVM to keep PEBS functionality
work as host, such as rewriting applicable_counter field in the guest
PEBS records buffer.

The cross-mapped check should be done right before vm-entry but after
local_irq_disable() since perf scheduler would rotate the pmc->perf_event
to another host counter or make the event into error state via hrtimer irq.

Signed-off-by: Like Xu <[email protected]>
---
arch/x86/include/asm/kvm_host.h | 2 ++
arch/x86/kvm/pmu.c | 25 +++++++++++++++++++++++++
arch/x86/kvm/pmu.h | 1 +
arch/x86/kvm/vmx/vmx.c | 3 +++
arch/x86/kvm/x86.c | 4 ++++
5 files changed, 35 insertions(+)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 4ff6aa00a325..5de4c14cf526 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -455,6 +455,8 @@ struct kvm_pmu {
u64 pebs_data_cfg;
u64 pebs_data_cfg_mask;

+ bool counter_cross_mapped;
+
/*
* The gate to release perf_events not marked in
* pmc_in_use only once in a vcpu time slice.
diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c
index 67c20ab81991..3bfed803ed17 100644
--- a/arch/x86/kvm/pmu.c
+++ b/arch/x86/kvm/pmu.c
@@ -550,3 +550,28 @@ int kvm_vm_ioctl_set_pmu_event_filter(struct kvm *kvm, void __user *argp)
kfree(filter);
return r;
}
+
+/*
+ * The caller needs to ensure that there is no time window for
+ * perf hrtimer irq or any chance to reschedule pmc->perf_event.
+ */
+void kvm_pmu_counter_cross_mapped_check(struct kvm_vcpu *vcpu)
+{
+ struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
+ struct kvm_pmc *pmc = NULL;
+ int bit;
+
+ pmu->counter_cross_mapped = false;
+
+ for_each_set_bit(bit, (unsigned long *)&pmu->pebs_enable, X86_PMC_IDX_MAX) {
+ pmc = kvm_x86_ops.pmu_ops->pmc_idx_to_pmc(pmu, bit);
+
+ if (!pmc || !pmc_speculative_in_use(pmc) || !pmc_is_enabled(pmc))
+ continue;
+
+ if (pmc->perf_event && (pmc->idx != pmc->perf_event->hw.idx)) {
+ pmu->counter_cross_mapped = true;
+ break;
+ }
+ }
+}
diff --git a/arch/x86/kvm/pmu.h b/arch/x86/kvm/pmu.h
index ee8f15cc4b5e..f5ec94e9a1dc 100644
--- a/arch/x86/kvm/pmu.h
+++ b/arch/x86/kvm/pmu.h
@@ -163,6 +163,7 @@ void kvm_pmu_init(struct kvm_vcpu *vcpu);
void kvm_pmu_cleanup(struct kvm_vcpu *vcpu);
void kvm_pmu_destroy(struct kvm_vcpu *vcpu);
int kvm_vm_ioctl_set_pmu_event_filter(struct kvm *kvm, void __user *argp);
+void kvm_pmu_counter_cross_mapped_check(struct kvm_vcpu *vcpu);

bool is_vmware_backdoor_pmc(u32 pmc_idx);

diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 341794b67f9a..bc30c83e0a62 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -6542,6 +6542,9 @@ static void atomic_switch_perf_msrs(struct vcpu_vmx *vmx)
if (!msrs)
return;

+ if (pmu->counter_cross_mapped)
+ msrs[1].guest = 0;
+
if (nr_msrs > 2 && msrs[1].guest) {
msrs[2].guest = pmu->ds_area;
if (nr_msrs > 3)
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 213368e47500..4ab1ce26244d 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -8936,6 +8936,10 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
* result in virtual interrupt delivery.
*/
local_irq_disable();
+
+ if (vcpu_to_pmu(vcpu)->global_ctrl & vcpu_to_pmu(vcpu)->pebs_enable)
+ kvm_pmu_counter_cross_mapped_check(vcpu);
+
vcpu->mode = IN_GUEST_MODE;

srcu_read_unlock(&vcpu->kvm->srcu, vcpu->srcu_idx);
--
2.29.2

2021-01-04 13:32:53

by Like Xu

[permalink] [raw]
Subject: [PATCH v3 16/17] KVM: x86/pmu: Save guest pebs reset values when pebs is configured

The guest pebs counter X may be cross mapped to the host counter Y.
While the PEBS facility would reload the reset value once a PEBS record
is written to guest DS and potentially continue to generate PEBS records
before guest read the previous records.

KVM will adjust the guest DS pebs reset counter values for exactly mapped
host counters but before that, it needs to save the original expected guest
reset counter values right after the counter is fully enabled via a trap.

We assume that every time the guest PEBS driver enables the counter for
large PEBS, it will configure the DS reset counter values as Linux does.

Signed-off-by: Like Xu <[email protected]>
---
arch/x86/include/asm/kvm_host.h | 2 ++
arch/x86/kvm/vmx/pmu_intel.c | 47 ++++++++++++++++++++++++++++++---
2 files changed, 46 insertions(+), 3 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index e6394ac54f81..1d17e51c5c8a 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -418,6 +418,7 @@ struct kvm_pmc {
enum pmc_type type;
u8 idx;
u64 counter;
+ u64 reset_counter;
u64 eventsel;
struct perf_event *perf_event;
struct kvm_vcpu *vcpu;
@@ -461,6 +462,7 @@ struct kvm_pmu {
bool counter_cross_mapped;
bool need_rewrite_ds_pebs_interrupt_threshold;
bool need_rewrite_pebs_records;
+ bool need_save_reset_counter;

/*
* The gate to release perf_events not marked in
diff --git a/arch/x86/kvm/vmx/pmu_intel.c b/arch/x86/kvm/vmx/pmu_intel.c
index 4c095c31db38..4e6ed0e8bddf 100644
--- a/arch/x86/kvm/vmx/pmu_intel.c
+++ b/arch/x86/kvm/vmx/pmu_intel.c
@@ -219,12 +219,14 @@ static void intel_pmu_pebs_setup(struct kvm_pmu *pmu)
gpa_t gpa;

pmu->need_rewrite_ds_pebs_interrupt_threshold = false;
+ pmu->need_save_reset_counter = false;

for_each_set_bit(bit, (unsigned long *)&pmu->pebs_enable, X86_PMC_IDX_MAX) {
pmc = kvm_x86_ops.pmu_ops->pmc_idx_to_pmc(pmu, bit);

if (pmc && pmc_speculative_in_use(pmc)) {
pmu->need_rewrite_ds_pebs_interrupt_threshold = true;
+ pmu->need_save_reset_counter = true;
break;
}
}
@@ -624,10 +626,44 @@ static int rewrite_ds_pebs_records(struct kvm_vcpu *vcpu)
return ret;
}

+static int save_ds_pebs_reset_values(struct kvm_vcpu *vcpu)
+{
+ struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
+ struct kvm_pmc *pmc = NULL;
+ struct debug_store *ds = NULL;
+ int srcu_idx, bit, idx, ret;
+
+ ds = kmalloc(sizeof(struct debug_store), GFP_KERNEL);
+ if (!ds)
+ return -ENOMEM;
+
+ ret = -EFAULT;
+ srcu_idx = srcu_read_lock(&vcpu->kvm->srcu);
+ if (kvm_read_guest_cached(vcpu->kvm, &pmu->ds_area_cache,
+ ds, sizeof(struct debug_store)))
+ goto out;
+
+ for_each_set_bit(bit, (unsigned long *)&pmu->pebs_enable, X86_PMC_IDX_MAX) {
+ pmc = kvm_x86_ops.pmu_ops->pmc_idx_to_pmc(pmu, bit);
+
+ if (pmc) {
+ idx = (pmc->idx < INTEL_PMC_IDX_FIXED) ?
+ pmc->idx : (MAX_PEBS_EVENTS + pmc->idx - INTEL_PMC_IDX_FIXED);
+ pmc->reset_counter = ds->pebs_event_reset[idx];
+ }
+ }
+ ret = 0;
+
+out:
+ srcu_read_unlock(&vcpu->kvm->srcu, srcu_idx);
+ kfree(ds);
+ return ret;
+}
+
static void intel_pmu_handle_event(struct kvm_vcpu *vcpu)
{
struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
- int ret1, ret2;
+ int ret1, ret2, ret3;

if (pmu->need_rewrite_pebs_records) {
pmu->need_rewrite_pebs_records = false;
@@ -642,11 +678,16 @@ static void intel_pmu_handle_event(struct kvm_vcpu *vcpu)
ret2 = rewrite_ds_pebs_interrupt_threshold(vcpu);
}

+ if (pmu->need_save_reset_counter) {
+ pmu->need_save_reset_counter = false;
+ ret3 = save_ds_pebs_reset_values(vcpu);
+ }
+
out:

- if (ret1 == -ENOMEM || ret2 == -ENOMEM)
+ if (ret1 == -ENOMEM || ret2 == -ENOMEM || ret3 == -ENOMEM)
pr_debug_ratelimited("%s: Fail to emulate guest PEBS due to OOM.", __func__);
- else if (ret1 == -EFAULT || ret2 == -EFAULT)
+ else if (ret1 == -EFAULT || ret2 == -EFAULT || ret3 == -EFAULT)
pr_debug_ratelimited("%s: Fail to emulate guest PEBS due to GPA fault.", __func__);
}

--
2.29.2

2021-01-04 13:33:04

by Like Xu

[permalink] [raw]
Subject: [PATCH v3 17/17] KVM: x86/pmu: Adjust guest pebs reset values for crpss-mapped counters

The original PEBS reset counter value has been saved to pmc->reset_counter.

When the guest PEBS counter X is enabled, the reset value RST-x would be
written to guest DS reset field RST-y and it will be auto reloaded to the
real host counter Y which is mapped to the guest PEBS counter X during
this vm-entry period.

KVM would record each last host reset counter index field for each guest
PEBS counter and trigger the reset values rewrite once any entry in the
host-guest counter mapping table is changed before vm-entry.

The frequent changes in the mapping relationship should only happen when
perf multiplexes the counters with the default 1ms timer. The time cost
of adjusting the guest reset values will not exceed 1ms (13347ns on ICX),
and there will be no race with the multiplex timer to create a livelock.

Signed-off-by: Like Xu <[email protected]>
---
arch/x86/include/asm/kvm_host.h | 2 ++
arch/x86/kvm/pmu.c | 15 ++++++++++++
arch/x86/kvm/pmu.h | 1 +
arch/x86/kvm/vmx/pmu_intel.c | 43 ++++++++++++++++++++++++++++++---
4 files changed, 58 insertions(+), 3 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 1d17e51c5c8a..b97e73d16e65 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -418,6 +418,7 @@ struct kvm_pmc {
enum pmc_type type;
u8 idx;
u64 counter;
+ u8 host_idx;
u64 reset_counter;
u64 eventsel;
struct perf_event *perf_event;
@@ -463,6 +464,7 @@ struct kvm_pmu {
bool need_rewrite_ds_pebs_interrupt_threshold;
bool need_rewrite_pebs_records;
bool need_save_reset_counter;
+ bool need_rewrite_reset_counter;

/*
* The gate to release perf_events not marked in
diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c
index 581653589108..2dbca3f02e33 100644
--- a/arch/x86/kvm/pmu.c
+++ b/arch/x86/kvm/pmu.c
@@ -155,6 +155,7 @@ static void pmc_reprogram_counter(struct kvm_pmc *pmc, u32 type,
if (IS_ERR(event)) {
pr_debug_ratelimited("kvm_pmu: event creation failed %ld for pmc->idx = %d\n",
PTR_ERR(event), pmc->idx);
+ pmc->host_idx = -1;
return;
}

@@ -555,6 +556,7 @@ void kvm_pmu_counter_cross_mapped_check(struct kvm_vcpu *vcpu)
int bit;

pmu->counter_cross_mapped = false;
+ pmu->need_rewrite_reset_counter = false;

for_each_set_bit(bit, (unsigned long *)&pmu->pebs_enable, X86_PMC_IDX_MAX) {
pmc = kvm_x86_ops.pmu_ops->pmc_idx_to_pmc(pmu, bit);
@@ -568,6 +570,19 @@ void kvm_pmu_counter_cross_mapped_check(struct kvm_vcpu *vcpu)
}
}

+ for_each_set_bit(bit, (unsigned long *)&pmu->pebs_enable, X86_PMC_IDX_MAX) {
+ pmc = kvm_x86_ops.pmu_ops->pmc_idx_to_pmc(pmu, bit);
+
+ if (!pmc || !pmc_speculative_in_use(pmc) || !pmc_is_enabled(pmc))
+ continue;
+
+ if ((pmc->perf_event && (pmc->host_idx != pmc->perf_event->hw.idx))) {
+ pmu->need_rewrite_reset_counter = true;
+ kvm_make_request(KVM_REQ_PMU, pmc->vcpu);
+ break;
+ }
+ }
+
if (!pmu->counter_cross_mapped)
return;

diff --git a/arch/x86/kvm/pmu.h b/arch/x86/kvm/pmu.h
index 6cdc9fd03195..2776a048fd27 100644
--- a/arch/x86/kvm/pmu.h
+++ b/arch/x86/kvm/pmu.h
@@ -74,6 +74,7 @@ static inline void pmc_release_perf_event(struct kvm_pmc *pmc)
pmc->perf_event = NULL;
pmc->current_config = 0;
pmc_to_pmu(pmc)->event_count--;
+ pmc->host_idx = -1;
}
}

diff --git a/arch/x86/kvm/vmx/pmu_intel.c b/arch/x86/kvm/vmx/pmu_intel.c
index 4e6ed0e8bddf..d0bfde29d2b0 100644
--- a/arch/x86/kvm/vmx/pmu_intel.c
+++ b/arch/x86/kvm/vmx/pmu_intel.c
@@ -660,10 +660,42 @@ static int save_ds_pebs_reset_values(struct kvm_vcpu *vcpu)
return ret;
}

+static int rewrite_ds_pebs_reset_counters(struct kvm_vcpu *vcpu)
+{
+ struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
+ struct kvm_pmc *pmc = NULL;
+ int srcu_idx, bit, ret;
+ u64 offset, host_idx, idx;
+
+ ret = -EFAULT;
+ srcu_idx = srcu_read_lock(&vcpu->kvm->srcu);
+ for_each_set_bit(bit, (unsigned long *)&pmu->pebs_enable, X86_PMC_IDX_MAX) {
+ pmc = kvm_x86_ops.pmu_ops->pmc_idx_to_pmc(pmu, bit);
+
+ if (!pmc || !pmc->perf_event)
+ continue;
+
+ host_idx = pmc->perf_event->hw.idx;
+ idx = (host_idx < INTEL_PMC_IDX_FIXED) ?
+ host_idx : (MAX_PEBS_EVENTS + host_idx - INTEL_PMC_IDX_FIXED);
+ offset = offsetof(struct debug_store, pebs_event_reset) + sizeof(u64) * idx;
+ if (kvm_write_guest_offset_cached(vcpu->kvm, &pmu->ds_area_cache,
+ &pmc->reset_counter, offset, sizeof(u64)))
+ goto out;
+
+ pmc->host_idx = pmc->perf_event->hw.idx;
+ }
+ ret = 0;
+
+out:
+ srcu_read_unlock(&vcpu->kvm->srcu, srcu_idx);
+ return ret;
+}
+
static void intel_pmu_handle_event(struct kvm_vcpu *vcpu)
{
struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
- int ret1, ret2, ret3;
+ int ret1, ret2, ret3, ret4;

if (pmu->need_rewrite_pebs_records) {
pmu->need_rewrite_pebs_records = false;
@@ -683,11 +715,16 @@ static void intel_pmu_handle_event(struct kvm_vcpu *vcpu)
ret3 = save_ds_pebs_reset_values(vcpu);
}

+ if (pmu->need_rewrite_reset_counter) {
+ ret4 = pmu->need_rewrite_reset_counter = false;
+ rewrite_ds_pebs_reset_counters(vcpu);
+ }
+
out:

- if (ret1 == -ENOMEM || ret2 == -ENOMEM || ret3 == -ENOMEM)
+ if (ret1 == -ENOMEM || ret2 == -ENOMEM || ret3 == -ENOMEM || ret4 == -ENOMEM)
pr_debug_ratelimited("%s: Fail to emulate guest PEBS due to OOM.", __func__);
- else if (ret1 == -EFAULT || ret2 == -EFAULT || ret3 == -EFAULT)
+ else if (ret1 == -EFAULT || ret2 == -EFAULT || ret3 == -EFAULT || ret4 == -EFAULT)
pr_debug_ratelimited("%s: Fail to emulate guest PEBS due to GPA fault.", __func__);
}

--
2.29.2

2021-01-04 13:33:21

by Like Xu

[permalink] [raw]
Subject: [PATCH v3 15/17] KVM: vmx/pmu: Rewrite applicable_counters field in guest PEBS records

The PEBS event counters scheduled by host may different to the counters
required by guest. The host counter index will be leaked into the guest
PEBS record and the guest driver will be confused by the counter indexes
in the "Applicable Counters" field of the PEBS records and ignore them.

Before the guest PEBS overflow PMI is injected into the guest through
global status, KVM needs to rewrite the "Applicable Counters" field with
the right enabled guest pebs counter idx(s) in the guest PEBS records.

Co-developed-by: Luwei Kang <[email protected]>
Signed-off-by: Luwei Kang <[email protected]>
Signed-off-by: Like Xu <[email protected]>
---
arch/x86/include/asm/kvm_host.h | 2 +
arch/x86/kvm/pmu.c | 1 +
arch/x86/kvm/vmx/pmu_intel.c | 84 +++++++++++++++++++++++++++++++--
3 files changed, 82 insertions(+), 5 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index ea204c628f45..e6394ac54f81 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -452,6 +452,7 @@ struct kvm_pmu {
u64 ds_area;
u64 cached_ds_area;
struct gfn_to_hva_cache ds_area_cache;
+ struct gfn_to_hva_cache pebs_buffer_base_cache;
u64 pebs_enable;
u64 pebs_enable_mask;
u64 pebs_data_cfg;
@@ -459,6 +460,7 @@ struct kvm_pmu {

bool counter_cross_mapped;
bool need_rewrite_ds_pebs_interrupt_threshold;
+ bool need_rewrite_pebs_records;

/*
* The gate to release perf_events not marked in
diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c
index c0f18b304933..581653589108 100644
--- a/arch/x86/kvm/pmu.c
+++ b/arch/x86/kvm/pmu.c
@@ -77,6 +77,7 @@ static void kvm_perf_overflow_intr(struct perf_event *perf_event,

if (!test_and_set_bit(pmc->idx, pmu->reprogram_pmi)) {
if (perf_event->attr.precise_ip) {
+ pmu->need_rewrite_pebs_records = pmu->counter_cross_mapped;
/* Indicate PEBS overflow PMI to guest. */
__set_bit(GLOBAL_STATUS_BUFFER_OVF_BIT,
(unsigned long *)&pmu->global_status);
diff --git a/arch/x86/kvm/vmx/pmu_intel.c b/arch/x86/kvm/vmx/pmu_intel.c
index b69e7c47fb05..4c095c31db38 100644
--- a/arch/x86/kvm/vmx/pmu_intel.c
+++ b/arch/x86/kvm/vmx/pmu_intel.c
@@ -557,22 +557,96 @@ static int rewrite_ds_pebs_interrupt_threshold(struct kvm_vcpu *vcpu)
return ret;
}

+static int rewrite_ds_pebs_records(struct kvm_vcpu *vcpu)
+{
+ struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
+ struct kvm_pmc *pmc = NULL;
+ struct debug_store *ds = NULL;
+ gpa_t gpa;
+ u64 pebs_buffer_base, offset, buffer_base, status, new_status, format_size;
+ int srcu_idx, bit, ret = 0;
+
+ if (!pmu->counter_cross_mapped)
+ return ret;
+
+ ds = kmalloc(sizeof(struct debug_store), GFP_KERNEL);
+ if (!ds)
+ return -ENOMEM;
+
+ ret = -EFAULT;
+ srcu_idx = srcu_read_lock(&vcpu->kvm->srcu);
+ if (kvm_read_guest_cached(vcpu->kvm, &pmu->ds_area_cache,
+ ds, sizeof(struct debug_store)))
+ goto out;
+
+ if (ds->pebs_index <= ds->pebs_buffer_base)
+ goto out;
+
+ pebs_buffer_base = ds->pebs_buffer_base;
+ offset = offsetof(struct pebs_basic, applicable_counters);
+ buffer_base = 0;
+
+ gpa = kvm_mmu_gva_to_gpa_system(vcpu, pebs_buffer_base, NULL);
+ if (kvm_gfn_to_hva_cache_init(vcpu->kvm, &pmu->pebs_buffer_base_cache,
+ gpa, sizeof(struct pebs_basic)))
+ goto out;
+
+ do {
+ ret = -EFAULT;
+ if (kvm_read_guest_offset_cached(vcpu->kvm, &pmu->pebs_buffer_base_cache,
+ &status, buffer_base + offset, sizeof(u64)))
+ goto out;
+ if (kvm_read_guest_offset_cached(vcpu->kvm, &pmu->pebs_buffer_base_cache,
+ &format_size, buffer_base, sizeof(u64)))
+ goto out;
+
+ new_status = 0ull;
+ for_each_set_bit(bit, (unsigned long *)&pmu->pebs_enable, X86_PMC_IDX_MAX) {
+ pmc = kvm_x86_ops.pmu_ops->pmc_idx_to_pmc(pmu, bit);
+
+ if (!pmc || !pmc->perf_event)
+ continue;
+
+ if (test_bit(pmc->perf_event->hw.idx, (unsigned long *)&status))
+ new_status |= BIT_ULL(pmc->idx);
+ }
+ if (kvm_write_guest_offset_cached(vcpu->kvm, &pmu->pebs_buffer_base_cache,
+ &new_status, buffer_base + offset, sizeof(u64)))
+ goto out;
+
+ ret = 0;
+ buffer_base += format_size >> 48;
+ } while (pebs_buffer_base + buffer_base < ds->pebs_index);
+
+out:
+ srcu_read_unlock(&vcpu->kvm->srcu, srcu_idx);
+ kfree(ds);
+ return ret;
+}
+
static void intel_pmu_handle_event(struct kvm_vcpu *vcpu)
{
struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
- int ret;
+ int ret1, ret2;
+
+ if (pmu->need_rewrite_pebs_records) {
+ pmu->need_rewrite_pebs_records = false;
+ ret1 = rewrite_ds_pebs_records(vcpu);
+ }

if (!(pmu->global_ctrl & pmu->pebs_enable))
- return;
+ goto out;

if (pmu->counter_cross_mapped && pmu->need_rewrite_ds_pebs_interrupt_threshold) {
- ret = rewrite_ds_pebs_interrupt_threshold(vcpu);
pmu->need_rewrite_ds_pebs_interrupt_threshold = false;
+ ret2 = rewrite_ds_pebs_interrupt_threshold(vcpu);
}

- if (ret == -ENOMEM)
+out:
+
+ if (ret1 == -ENOMEM || ret2 == -ENOMEM)
pr_debug_ratelimited("%s: Fail to emulate guest PEBS due to OOM.", __func__);
- else if (ret == -EFAULT)
+ else if (ret1 == -EFAULT || ret2 == -EFAULT)
pr_debug_ratelimited("%s: Fail to emulate guest PEBS due to GPA fault.", __func__);
}

--
2.29.2

2021-01-05 23:04:01

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH v3 06/17] KVM: x86/pmu: Add IA32_PEBS_ENABLE MSR emulation for extended PEBS

On Mon, Jan 04, 2021, Like Xu wrote:
> If IA32_PERF_CAPABILITIES.PEBS_BASELINE [bit 14] is set, the
> IA32_PEBS_ENABLE MSR exists and all architecturally enumerated fixed
> and general purpose counters have corresponding bits in IA32_PEBS_ENABLE
> that enable generation of PEBS records. The general-purpose counter bits
> start at bit IA32_PEBS_ENABLE[0], and the fixed counter bits start at
> bit IA32_PEBS_ENABLE[32].
>
> When guest PEBS is enabled, the IA32_PEBS_ENABLE MSR will be
> added to the perf_guest_switch_msr() and switched during the
> VMX transitions just like CORE_PERF_GLOBAL_CTRL MSR.
>
> Originally-by: Andi Kleen <[email protected]>
> Co-developed-by: Kan Liang <[email protected]>
> Signed-off-by: Kan Liang <[email protected]>
> Co-developed-by: Luwei Kang <[email protected]>
> Signed-off-by: Luwei Kang <[email protected]>
> Signed-off-by: Like Xu <[email protected]>
> ---
> arch/x86/events/intel/core.c | 20 ++++++++++++++++++++
> arch/x86/include/asm/kvm_host.h | 1 +
> arch/x86/include/asm/msr-index.h | 6 ++++++
> arch/x86/kvm/vmx/pmu_intel.c | 28 ++++++++++++++++++++++++++++
> 4 files changed, 55 insertions(+)
>
> diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
> index af457f8cb29d..6453b8a6834a 100644
> --- a/arch/x86/events/intel/core.c
> +++ b/arch/x86/events/intel/core.c
> @@ -3715,6 +3715,26 @@ static struct perf_guest_switch_msr *intel_guest_get_msrs(int *nr)
> *nr = 2;
> }
>
> + if (cpuc->pebs_enabled & ~cpuc->intel_ctrl_host_mask) {
> + arr[1].msr = MSR_IA32_PEBS_ENABLE;
> + arr[1].host = cpuc->pebs_enabled & ~cpuc->intel_ctrl_guest_mask;
> + arr[1].guest = cpuc->pebs_enabled & ~cpuc->intel_ctrl_host_mask;
> + /*
> + * The guest PEBS will be disabled once the host PEBS is enabled
> + * since the both enabled case may brings a unknown PMI to
> + * confuse host and the guest PEBS overflow PMI would be missed.
> + */
> + if (arr[1].host)
> + arr[1].guest = 0;
> + arr[0].guest |= arr[1].guest;

Can't you modify the code that strips the PEBS counters from the guest's
value instead of poking into the array entry after the fact?

Also, why is this scenario even allowed? Can't we force exclude_guest for
events that use PEBS?

> + *nr = 2;
> + } else if (*nr == 1) {
> + /* Remove MSR_IA32_PEBS_ENABLE from MSR switch list in KVM */
> + arr[1].msr = MSR_IA32_PEBS_ENABLE;
> + arr[1].host = arr[1].guest = 0;
> + *nr = 2;

Similar to above, rather then check "*nr == 1", this should properly integrate
with the "x86_pmu.pebs && x86_pmu.pebs_no_isolation" logic instead of poking
into the array after the fact.

By incorporating both suggestions, the logic can be streamlined significantly,
and IMO makes the overall flow much more understandable. Untested...

diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
index d4569bfa83e3..c5cc7e558c8e 100644
--- a/arch/x86/events/intel/core.c
+++ b/arch/x86/events/intel/core.c
@@ -3708,24 +3708,39 @@ static struct perf_guest_switch_msr *intel_guest_get_msrs(int *nr)
arr[0].msr = MSR_CORE_PERF_GLOBAL_CTRL;
arr[0].host = x86_pmu.intel_ctrl & ~cpuc->intel_ctrl_guest_mask;
arr[0].guest = x86_pmu.intel_ctrl & ~cpuc->intel_ctrl_host_mask;
- if (x86_pmu.flags & PMU_FL_PEBS_ALL)
- arr[0].guest &= ~cpuc->pebs_enabled;
- else
- arr[0].guest &= ~(cpuc->pebs_enabled & PEBS_COUNTER_MASK);
+
+ /*
+ * Disable PEBS in the guest if PEBS is used by the host; enabling PEBS
+ * in both will lead to unexpected PMIs in the host and/or missed PMIs
+ * in the guest.
+ */
+ if (cpuc->pebs_enabled & ~cpuc->intel_ctrl_guest_mask) {
+ if (x86_pmu.flags & PMU_FL_PEBS_ALL)
+ arr[0].guest &= ~cpuc->pebs_enabled;
+ else
+ arr[0].guest &= ~(cpuc->pebs_enabled & PEBS_COUNTER_MASK);
+ }
*nr = 1;

- if (x86_pmu.pebs && x86_pmu.pebs_no_isolation) {
- /*
- * If PMU counter has PEBS enabled it is not enough to
- * disable counter on a guest entry since PEBS memory
- * write can overshoot guest entry and corrupt guest
- * memory. Disabling PEBS solves the problem.
- *
- * Don't do this if the CPU already enforces it.
- */
+ if (x86_pmu.pebs) {
arr[1].msr = MSR_IA32_PEBS_ENABLE;
- arr[1].host = cpuc->pebs_enabled;
- arr[1].guest = 0;
+ arr[1].host = cpuc->pebs_enabled & ~cpuc->intel_ctrl_guest_mask;
+
+ /*
+ * Host and guest PEBS are mutually exclusive. Load the guest
+ * value iff PEBS is disabled in the host. If PEBS is enabled
+ * in the host and the CPU supports PEBS isolation, disabling
+ * the counters is sufficient (see above); skip the MSR loads
+ * by stuffing guest=host (KVM will remove the entry). Without
+ * isolation, PEBS must be explicitly disabled prior to
+ * VM-Enter to prevent PEBS writes from overshooting VM-Enter.
+ */
+ if (!arr[1].host)
+ arr[1].guest = cpuc->pebs_enabled & ~cpuc->intel_ctrl_host_mask;
+ else if (x86_pmu.pebs_no_isolation)
+ arr[1].guest = 0;
+ else
+ arr[1].guest = arr[1].host;
*nr = 2;
}

2021-01-05 23:04:37

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH v3 07/17] KVM: x86/pmu: Add IA32_DS_AREA MSR emulation to manage guest DS buffer

On Mon, Jan 04, 2021, Like Xu wrote:
> When CPUID.01H:EDX.DS[21] is set, the IA32_DS_AREA MSR exists and
> points to the linear address of the first byte of the DS buffer
> management area, which is used to manage the PEBS records.
>
> When guest PEBS is enabled and the value is different from the
> host, KVM will add the IA32_DS_AREA MSR to the msr-switch list.
> The guest's DS value can be loaded to the real HW before VM-entry,
> and will be removed when guest PEBS is disabled.
>
> The WRMSR to IA32_DS_AREA MSR brings a #GP(0) if the source register
> contains a non-canonical address. The switch of IA32_DS_AREA MSR would
> also, setup a quiescent period to write the host PEBS records (if any)
> to host DS area rather than guest DS area.
>
> When guest PEBS is enabled, the MSR_IA32_DS_AREA MSR will be
> added to the perf_guest_switch_msr() and switched during the
> VMX transitions just like CORE_PERF_GLOBAL_CTRL MSR.
>
> Originally-by: Andi Kleen <[email protected]>
> Co-developed-by: Kan Liang <[email protected]>
> Signed-off-by: Kan Liang <[email protected]>
> Signed-off-by: Like Xu <[email protected]>
> ---
> arch/x86/events/intel/core.c | 13 +++++++++++++
> arch/x86/include/asm/kvm_host.h | 1 +
> arch/x86/kvm/vmx/pmu_intel.c | 11 +++++++++++
> arch/x86/kvm/vmx/vmx.c | 6 ++++++
> 4 files changed, 31 insertions(+)
>
> diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
> index 6453b8a6834a..ccddda455bec 100644
> --- a/arch/x86/events/intel/core.c
> +++ b/arch/x86/events/intel/core.c
> @@ -3690,6 +3690,7 @@ static struct perf_guest_switch_msr *intel_guest_get_msrs(int *nr)
> {
> struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
> struct perf_guest_switch_msr *arr = cpuc->guest_switch_msrs;
> + struct debug_store *ds = __this_cpu_read(cpu_hw_events.ds);
>
> arr[0].msr = MSR_CORE_PERF_GLOBAL_CTRL;
> arr[0].host = x86_pmu.intel_ctrl & ~cpuc->intel_ctrl_guest_mask;
> @@ -3735,6 +3736,18 @@ static struct perf_guest_switch_msr *intel_guest_get_msrs(int *nr)
> *nr = 2;
> }
>
> + if (arr[1].guest) {
> + arr[2].msr = MSR_IA32_DS_AREA;
> + arr[2].host = (unsigned long)ds;
> + /* KVM will update MSR_IA32_DS_AREA with the trapped guest value. */
> + arr[2].guest = 0ull;
> + *nr = 3;
> + } else if (*nr == 2) {
> + arr[2].msr = MSR_IA32_DS_AREA;
> + arr[2].host = arr[2].guest = 0;
> + *nr = 3;
> + }

Similar comments as the previous patch, please figure out a way to properly
integrate this into the PEBS logic instead of querying arr/nr.

> +
> return arr;
> }

2021-01-07 12:40:40

by Xu, Like

[permalink] [raw]
Subject: Re: [PATCH v3 06/17] KVM: x86/pmu: Add IA32_PEBS_ENABLE MSR emulation for extended PEBS

Hi Sean,

On 2021/1/6 5:11, Sean Christopherson wrote:
> On Mon, Jan 04, 2021, Like Xu wrote:
>> If IA32_PERF_CAPABILITIES.PEBS_BASELINE [bit 14] is set, the
>> IA32_PEBS_ENABLE MSR exists and all architecturally enumerated fixed
>> and general purpose counters have corresponding bits in IA32_PEBS_ENABLE
>> that enable generation of PEBS records. The general-purpose counter bits
>> start at bit IA32_PEBS_ENABLE[0], and the fixed counter bits start at
>> bit IA32_PEBS_ENABLE[32].
>>
>> When guest PEBS is enabled, the IA32_PEBS_ENABLE MSR will be
>> added to the perf_guest_switch_msr() and switched during the
>> VMX transitions just like CORE_PERF_GLOBAL_CTRL MSR.
>>
>> Originally-by: Andi Kleen <[email protected]>
>> Co-developed-by: Kan Liang <[email protected]>
>> Signed-off-by: Kan Liang <[email protected]>
>> Co-developed-by: Luwei Kang <[email protected]>
>> Signed-off-by: Luwei Kang <[email protected]>
>> Signed-off-by: Like Xu <[email protected]>
>> ---
>> arch/x86/events/intel/core.c | 20 ++++++++++++++++++++
>> arch/x86/include/asm/kvm_host.h | 1 +
>> arch/x86/include/asm/msr-index.h | 6 ++++++
>> arch/x86/kvm/vmx/pmu_intel.c | 28 ++++++++++++++++++++++++++++
>> 4 files changed, 55 insertions(+)
>>
>> diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
>> index af457f8cb29d..6453b8a6834a 100644
>> --- a/arch/x86/events/intel/core.c
>> +++ b/arch/x86/events/intel/core.c
>> @@ -3715,6 +3715,26 @@ static struct perf_guest_switch_msr *intel_guest_get_msrs(int *nr)
>> *nr = 2;
>> }
>>
>> + if (cpuc->pebs_enabled & ~cpuc->intel_ctrl_host_mask) {
>> + arr[1].msr = MSR_IA32_PEBS_ENABLE;
>> + arr[1].host = cpuc->pebs_enabled & ~cpuc->intel_ctrl_guest_mask;
>> + arr[1].guest = cpuc->pebs_enabled & ~cpuc->intel_ctrl_host_mask;
>> + /*
>> + * The guest PEBS will be disabled once the host PEBS is enabled
>> + * since the both enabled case may brings a unknown PMI to
>> + * confuse host and the guest PEBS overflow PMI would be missed.
>> + */
>> + if (arr[1].host)
>> + arr[1].guest = 0;
>> + arr[0].guest |= arr[1].guest;
> Can't you modify the code that strips the PEBS counters from the guest's
> value instead of poking into the array entry after the fact?
Ah, nice move.
>
> Also, why is this scenario even allowed? Can't we force exclude_guest for
> events that use PEBS?

The attr.exclude_guest can be configured for each event, and
it's not shared with other perf_events on the same CPU,
and changing it will not take effect when the event is running.

Host perf would still allow to create or run the PEBS perf_event
for vcpu when host is using PEBS counter. One reason is that the
perf scheduler needs to know which event owns which PEBS counter.

>
>> + *nr = 2;
>> + } else if (*nr == 1) {
>> + /* Remove MSR_IA32_PEBS_ENABLE from MSR switch list in KVM */
>> + arr[1].msr = MSR_IA32_PEBS_ENABLE;
>> + arr[1].host = arr[1].guest = 0;
>> + *nr = 2;
> Similar to above, rather then check "*nr == 1", this should properly integrate
> with the "x86_pmu.pebs && x86_pmu.pebs_no_isolation" logic instead of poking
> into the array after the fact.
Thanks, it makes sense to me and I'll figure it out w/ your clearly code.
>
> By incorporating both suggestions, the logic can be streamlined significantly,
> and IMO makes the overall flow much more understandable. Untested...
>
> diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
> index d4569bfa83e3..c5cc7e558c8e 100644
> --- a/arch/x86/events/intel/core.c
> +++ b/arch/x86/events/intel/core.c
> @@ -3708,24 +3708,39 @@ static struct perf_guest_switch_msr *intel_guest_get_msrs(int *nr)
> arr[0].msr = MSR_CORE_PERF_GLOBAL_CTRL;
> arr[0].host = x86_pmu.intel_ctrl & ~cpuc->intel_ctrl_guest_mask;
> arr[0].guest = x86_pmu.intel_ctrl & ~cpuc->intel_ctrl_host_mask;
> - if (x86_pmu.flags & PMU_FL_PEBS_ALL)
> - arr[0].guest &= ~cpuc->pebs_enabled;
> - else
> - arr[0].guest &= ~(cpuc->pebs_enabled & PEBS_COUNTER_MASK);
> +
> + /*
> + * Disable PEBS in the guest if PEBS is used by the host; enabling PEBS
> + * in both will lead to unexpected PMIs in the host and/or missed PMIs
> + * in the guest.
> + */
> + if (cpuc->pebs_enabled & ~cpuc->intel_ctrl_guest_mask) {
> + if (x86_pmu.flags & PMU_FL_PEBS_ALL)
> + arr[0].guest &= ~cpuc->pebs_enabled;
> + else
> + arr[0].guest &= ~(cpuc->pebs_enabled & PEBS_COUNTER_MASK);
> + }
> *nr = 1;
>
> - if (x86_pmu.pebs && x86_pmu.pebs_no_isolation) {
> - /*
> - * If PMU counter has PEBS enabled it is not enough to
> - * disable counter on a guest entry since PEBS memory
> - * write can overshoot guest entry and corrupt guest
> - * memory. Disabling PEBS solves the problem.
> - *
> - * Don't do this if the CPU already enforces it.
> - */
> + if (x86_pmu.pebs) {
> arr[1].msr = MSR_IA32_PEBS_ENABLE;
> - arr[1].host = cpuc->pebs_enabled;
> - arr[1].guest = 0;
> + arr[1].host = cpuc->pebs_enabled & ~cpuc->intel_ctrl_guest_mask;
> +
> + /*
> + * Host and guest PEBS are mutually exclusive. Load the guest
> + * value iff PEBS is disabled in the host. If PEBS is enabled
> + * in the host and the CPU supports PEBS isolation, disabling
> + * the counters is sufficient (see above); skip the MSR loads
s/above/9b545c04abd4/
> + * by stuffing guest=host (KVM will remove the entry). Without
> + * isolation, PEBS must be explicitly disabled prior to
> + * VM-Enter to prevent PEBS writes from overshooting VM-Enter.
> + */
> + if (!arr[1].host)
> + arr[1].guest = cpuc->pebs_enabled & ~cpuc->intel_ctrl_host_mask;
miss "arr[0].guest |= arr[1].guest;" here to make it enabled in the
global_ctrl msr.

Sean, if you have more comments on other patches, just let me know.

---
thx,likexu
> + else if (x86_pmu.pebs_no_isolation)
> + arr[1].guest = 0;
> + else
> + arr[1].guest = arr[1].host;
> *nr = 2;
> }

2021-01-08 03:08:24

by Xu, Like

[permalink] [raw]
Subject: Re: [PATCH v3 07/17] KVM: x86/pmu: Add IA32_DS_AREA MSR emulation to manage guest DS buffer

Hi Sean,

On 2021/1/6 5:16, Sean Christopherson wrote:
>> diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
>> index 6453b8a6834a..ccddda455bec 100644
>> --- a/arch/x86/events/intel/core.c
>> +++ b/arch/x86/events/intel/core.c
>> @@ -3690,6 +3690,7 @@ static struct perf_guest_switch_msr *intel_guest_get_msrs(int *nr)
>> {
>> struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
>> struct perf_guest_switch_msr *arr = cpuc->guest_switch_msrs;
>> + struct debug_store *ds = __this_cpu_read(cpu_hw_events.ds);
>>
>> arr[0].msr = MSR_CORE_PERF_GLOBAL_CTRL;
>> arr[0].host = x86_pmu.intel_ctrl & ~cpuc->intel_ctrl_guest_mask;
>> @@ -3735,6 +3736,18 @@ static struct perf_guest_switch_msr *intel_guest_get_msrs(int *nr)
>> *nr = 2;
>> }
>>
>> + if (arr[1].guest) {
>> + arr[2].msr = MSR_IA32_DS_AREA;
>> + arr[2].host = (unsigned long)ds;
>> + /* KVM will update MSR_IA32_DS_AREA with the trapped guest value. */
>> + arr[2].guest = 0ull;
>> + *nr = 3;
>> + } else if (*nr == 2) {
>> + arr[2].msr = MSR_IA32_DS_AREA;
>> + arr[2].host = arr[2].guest = 0;
>> + *nr = 3;
>> + }
> Similar comments as the previous patch, please figure out a way to properly
> integrate this into the PEBS logic instead of querying arr/nr.

To address your comment, please help confirm whether you are
fine or happy with the streamlined logic of intel_guest_get_msrs():

static struct perf_guest_switch_msr *intel_guest_get_msrs(int *nr)
{
    struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
    struct perf_guest_switch_msr *arr = cpuc->guest_switch_msrs;
    struct debug_store *ds = __this_cpu_read(cpu_hw_events.ds);

    arr[0].msr = MSR_CORE_PERF_GLOBAL_CTRL;
    arr[0].host = x86_pmu.intel_ctrl & ~cpuc->intel_ctrl_guest_mask;
    arr[0].guest = x86_pmu.intel_ctrl & ~cpuc->intel_ctrl_host_mask;

    /*
     * Disable PEBS in the guest if PEBS is used by the host; enabling PEBS
     * in both will lead to unexpected PMIs in the host and/or missed PMIs
     * in the guest.
     */
    if (cpuc->pebs_enabled & ~cpuc->intel_ctrl_host_mask) {
        if (x86_pmu.flags & PMU_FL_PEBS_ALL)
            arr[0].guest &= ~cpuc->pebs_enabled;
        else
            arr[0].guest &= ~(cpuc->pebs_enabled & PEBS_COUNTER_MASK);
    }
    *nr = 1;

    if (x86_pmu.pebs) {
        arr[1].msr = MSR_IA32_PEBS_ENABLE;
        arr[2].msr = MSR_IA32_DS_AREA;
        if (x86_pmu.intel_cap.pebs_baseline)
            arr[3].msr = MSR_PEBS_DATA_CFG;

        /* Skip the MSR loads by stuffing guest=host (KVM will remove the
entry). */
        arr[1].guest = arr[1].host = cpuc->pebs_enabled &
~cpuc->intel_ctrl_guest_mask;
        arr[2].guest = arr[2].host = (unsigned long)ds;
        if (x86_pmu.intel_cap.pebs_baseline)
            arr[3].guest = arr[3].host = cpuc->pebs_data_cfg;

        /*
         * Host and guest PEBS are mutually exclusive. Load the guest
         * value iff PEBS is disabled in the host.
         *
         * If PEBS is enabled in the host and the CPU supports PEBS isolation,
         * disabling the counters is sufficient (see commit 9b545c04abd4);
         * Without isolation, PEBS must be explicitly disabled prior to
         * VM-Enter to prevent PEBS writes from overshooting VM-Enter.
         *
         * KVM will update arr[2|3].guest with the trapped guest values
         * iff guest PEBS is allowed to be enabled.
         */
        if (!arr[1].host) {
            arr[1].guest = cpuc->pebs_enabled & ~cpuc->intel_ctrl_host_mask;
            arr[0].guest |= arr[1].guest;
        } else if (x86_pmu.pebs_no_isolation)
            arr[1].guest = 0;

        *nr = x86_pmu.intel_cap.pebs_baseline ? 4 : 3;
    }

    return arr;
}

---
thx,likexu

>
>> +
>> return arr;
>> }

2021-01-13 18:12:25

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v3 03/17] KVM: x86/pmu: Introduce the ctrl_mask value for fixed counter

On Mon, Jan 04, 2021 at 09:15:28PM +0800, Like Xu wrote:
> @@ -327,6 +328,7 @@ static void intel_pmu_refresh(struct kvm_vcpu *vcpu)
> pmu->counter_bitmask[KVM_PMC_FIXED] = 0;
> pmu->version = 0;
> pmu->reserved_bits = 0xffffffff00200000ull;
> + pmu->fixed_ctr_ctrl_mask = ~0ull;

All 1s

>
> entry = kvm_find_cpuid_entry(vcpu, 0xa, 0);
> if (!entry)
> @@ -358,6 +360,9 @@ static void intel_pmu_refresh(struct kvm_vcpu *vcpu)
> ((u64)1 << edx.split.bit_width_fixed) - 1;
> }
>
> + for (i = 0; i < pmu->nr_arch_fixed_counters; i++)
> + pmu->fixed_ctr_ctrl_mask |= (0xbull << (i * 4));

With some extra 1s on top

> + pmu->fixed_ctr_ctrl_mask = ~pmu->fixed_ctr_ctrl_mask;

Inverted is all 0s, always.

> pmu->global_ctrl = ((1ull << pmu->nr_arch_gp_counters) - 1) |
> (((1ull << pmu->nr_arch_fixed_counters) - 1) << INTEL_PMC_IDX_FIXED);
> pmu->global_ctrl_mask = ~pmu->global_ctrl;
> --
> 2.29.2
>

2021-01-13 18:24:34

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v3 04/17] perf: x86/ds: Handle guest PEBS overflow PMI and inject it to guest

On Mon, Jan 04, 2021 at 09:15:29PM +0800, Like Xu wrote:

> diff --git a/arch/x86/events/intel/ds.c b/arch/x86/events/intel/ds.c
> index b47cc4226934..c499bdb58373 100644
> --- a/arch/x86/events/intel/ds.c
> +++ b/arch/x86/events/intel/ds.c
> @@ -1721,6 +1721,65 @@ intel_pmu_save_and_restart_reload(struct perf_event *event, int count)
> return 0;
> }
>
> +/*
> + * We may be running with guest PEBS events created by KVM, and the
> + * PEBS records are logged into the guest's DS and invisible to host.
> + *
> + * In the case of guest PEBS overflow, we only trigger a fake event
> + * to emulate the PEBS overflow PMI for guest PBES counters in KVM.
> + * The guest will then vm-entry and check the guest DS area to read
> + * the guest PEBS records.
> + *
> + * The guest PEBS overflow PMI may be dropped when both the guest and
> + * the host use PEBS. Therefore, KVM will not enable guest PEBS once
> + * the host PEBS is enabled since it may bring a confused unknown NMI.
> + *
> + * The contents and other behavior of the guest event do not matter.
> + */
> +static int intel_pmu_handle_guest_pebs(struct cpu_hw_events *cpuc,
> + struct pt_regs *iregs,
> + struct debug_store *ds)
> +{
> + struct perf_sample_data data;
> + struct perf_event *event = NULL;
> + u64 guest_pebs_idxs = cpuc->pebs_enabled & ~cpuc->intel_ctrl_host_mask;
> + int bit;
> +
> + /*
> + * Ideally, we should check guest DS to understand if it's
> + * a guest PEBS overflow PMI from guest PEBS counters.
> + * However, it brings high overhead to retrieve guest DS in host.
> + * So we check host DS instead for performance.

Again; for the virt illiterate people here (me); why is it expensive to
check guest DS?

Why do we need to? Can't we simply always forward the PMI if the guest
has bits set in MSR_IA32_PEBS_ENABLE ? Surely we can access the guest
MSRs at a reasonable rate..

Sure, it'll send too many PMIs, but is that really a problem?

> + *
> + * If PEBS interrupt threshold on host is not exceeded in a NMI, there
> + * must be a PEBS overflow PMI generated from the guest PEBS counters.
> + * There is no ambiguity since the reported event in the PMI is guest
> + * only. It gets handled correctly on a case by case base for each event.
> + *
> + * Note: KVM disables the co-existence of guest PEBS and host PEBS.

Where; I need a code reference here.

> + */
> + if (!guest_pebs_idxs || !in_nmi() ||

All the other code uses !iregs instead of !in_nmi(), also your
indentation is broken.

> + ds->pebs_index >= ds->pebs_interrupt_threshold)
> + return 0;
> +
> + for_each_set_bit(bit, (unsigned long *)&guest_pebs_idxs,
> + INTEL_PMC_IDX_FIXED + x86_pmu.num_counters_fixed) {
> +
> + event = cpuc->events[bit];
> + if (!event->attr.precise_ip)
> + continue;
> +
> + perf_sample_data_init(&data, 0, event->hw.last_period);
> + if (perf_event_overflow(event, &data, iregs))
> + x86_pmu_stop(event, 0);
> +
> + /* Inject one fake event is enough. */
> + return 1;
> + }
> +
> + return 0;
> +}
> +
> static __always_inline void
> __intel_pmu_pebs_event(struct perf_event *event,
> struct pt_regs *iregs,
> @@ -1965,6 +2024,9 @@ static void intel_pmu_drain_pebs_icl(struct pt_regs *iregs, struct perf_sample_d
> if (!x86_pmu.pebs_active)
> return;
>
> + if (intel_pmu_handle_guest_pebs(cpuc, iregs, ds))
> + return;
> +
> base = (struct pebs_basic *)(unsigned long)ds->pebs_buffer_base;
> top = (struct pebs_basic *)(unsigned long)ds->pebs_index;
>
> --
> 2.29.2
>

2021-01-13 18:31:25

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v3 04/17] perf: x86/ds: Handle guest PEBS overflow PMI and inject it to guest

On Wed, Jan 13, 2021 at 07:22:09PM +0100, Peter Zijlstra wrote:
> Again; for the virt illiterate people here (me); why is it expensive to
> check guest DS?

Remember, you're trying to get someone that thinks virt is the devil's
work (me) to review virt patches. You get to spell things out in detail.

2021-01-14 02:01:45

by Xu, Like

[permalink] [raw]
Subject: Re: [PATCH v3 03/17] KVM: x86/pmu: Introduce the ctrl_mask value for fixed counter

On 2021/1/14 2:06, Peter Zijlstra wrote:
> On Mon, Jan 04, 2021 at 09:15:28PM +0800, Like Xu wrote:
>> @@ -327,6 +328,7 @@ static void intel_pmu_refresh(struct kvm_vcpu *vcpu)
>> pmu->counter_bitmask[KVM_PMC_FIXED] = 0;
>> pmu->version = 0;
>> pmu->reserved_bits = 0xffffffff00200000ull;
>> + pmu->fixed_ctr_ctrl_mask = ~0ull;
> All 1s
>
>>
>> entry = kvm_find_cpuid_entry(vcpu, 0xa, 0);
>> if (!entry)
>> @@ -358,6 +360,9 @@ static void intel_pmu_refresh(struct kvm_vcpu *vcpu)
>> ((u64)1 << edx.split.bit_width_fixed) - 1;
>> }
>>
>> + for (i = 0; i < pmu->nr_arch_fixed_counters; i++)
>> + pmu->fixed_ctr_ctrl_mask |= (0xbull << (i * 4));
> With some extra 1s on top

You're right, I think it should be:

    pmu->fixed_ctr_ctrl_mask &= ~(0xbull << (i * 4));

w/o invertion and I will fix it in the next version.

>
>> + pmu->fixed_ctr_ctrl_mask = ~pmu->fixed_ctr_ctrl_mask;
> Inverted is all 0s, always.
>
>> pmu->global_ctrl = ((1ull << pmu->nr_arch_gp_counters) - 1) |
>> (((1ull << pmu->nr_arch_fixed_counters) - 1) << INTEL_PMC_IDX_FIXED);
>> pmu->global_ctrl_mask = ~pmu->global_ctrl;
>> --
>> 2.29.2
>>

2021-01-14 03:42:01

by Xu, Like

[permalink] [raw]
Subject: Re: [PATCH v3 04/17] perf: x86/ds: Handle guest PEBS overflow PMI and inject it to guest

On 2021/1/14 2:22, Peter Zijlstra wrote:
> On Mon, Jan 04, 2021 at 09:15:29PM +0800, Like Xu wrote:
>
>> diff --git a/arch/x86/events/intel/ds.c b/arch/x86/events/intel/ds.c
>> index b47cc4226934..c499bdb58373 100644
>> --- a/arch/x86/events/intel/ds.c
>> +++ b/arch/x86/events/intel/ds.c
>> @@ -1721,6 +1721,65 @@ intel_pmu_save_and_restart_reload(struct perf_event *event, int count)
>> return 0;
>> }
>>
>> +/*
>> + * We may be running with guest PEBS events created by KVM, and the
>> + * PEBS records are logged into the guest's DS and invisible to host.
>> + *
>> + * In the case of guest PEBS overflow, we only trigger a fake event
>> + * to emulate the PEBS overflow PMI for guest PBES counters in KVM.
>> + * The guest will then vm-entry and check the guest DS area to read
>> + * the guest PEBS records.
>> + *
>> + * The guest PEBS overflow PMI may be dropped when both the guest and
>> + * the host use PEBS. Therefore, KVM will not enable guest PEBS once
>> + * the host PEBS is enabled since it may bring a confused unknown NMI.
>> + *
>> + * The contents and other behavior of the guest event do not matter.
>> + */
>> +static int intel_pmu_handle_guest_pebs(struct cpu_hw_events *cpuc,
>> + struct pt_regs *iregs,
>> + struct debug_store *ds)
>> +{
>> + struct perf_sample_data data;
>> + struct perf_event *event = NULL;
>> + u64 guest_pebs_idxs = cpuc->pebs_enabled & ~cpuc->intel_ctrl_host_mask;
>> + int bit;
>> +
>> + /*
>> + * Ideally, we should check guest DS to understand if it's
>> + * a guest PEBS overflow PMI from guest PEBS counters.
>> + * However, it brings high overhead to retrieve guest DS in host.
>> + * So we check host DS instead for performance.
> Again; for the virt illiterate people here (me); why is it expensive to
> check guest DS?

We are not checking the guest DS here for two reasons:
- it brings additional kvm mem locking operations and guest page table
traversal,
   which is very expensive for guests with large memory (if we have cached the
   mapped values, we still need to check whether the cached ones are still
valid);
- the current interface kvm_read_guest_*() might sleep and is not irq safe;

If you still need me to try this guest DS check approach, please let me know,
I will provide more performance data.

>
> Why do we need to? Can't we simply always forward the PMI if the guest
> has bits set in MSR_IA32_PEBS_ENABLE ? Surely we can access the guest
> MSRs at a reasonable rate..
>
> Sure, it'll send too many PMIs, but is that really a problem?

More vPMI means more guest irq handler calls and
more PMI virtualization overhead. In addition,

the correctness of some workloads (RR?) depends on
the correct number of PMIs and the PMI trigger times
and virt may not want to break this assumption.

>
>> + *
>> + * If PEBS interrupt threshold on host is not exceeded in a NMI, there
>> + * must be a PEBS overflow PMI generated from the guest PEBS counters.
>> + * There is no ambiguity since the reported event in the PMI is guest
>> + * only. It gets handled correctly on a case by case base for each event.
>> + *
>> + * Note: KVM disables the co-existence of guest PEBS and host PEBS.
> Where; I need a code reference here.

How about:

Note: KVM will disable the co-existence of guest PEBS and host PEBS.
In the intel_guest_get_msrs(), when we have host PEBS ctrl bit(s) enabled,
KVM will clear the guest PEBS ctrl enable bit(s) before vm-entry.
The guest PEBS users should be notified of this runtime restriction.

>
>> + */
>> + if (!guest_pebs_idxs || !in_nmi() ||
> All the other code uses !iregs instead of !in_nmi(), also your
> indentation is broken.

Sure, I'll use !iregs and fix the indentation in the next version.

---
thx,likexu
>> + ds->pebs_index >= ds->pebs_interrupt_threshold)
>> + return 0;
>> +
>> + for_each_set_bit(bit, (unsigned long *)&guest_pebs_idxs,
>> + INTEL_PMC_IDX_FIXED + x86_pmu.num_counters_fixed) {
>> +
>> + event = cpuc->events[bit];
>> + if (!event->attr.precise_ip)
>> + continue;
>> +
>> + perf_sample_data_init(&data, 0, event->hw.last_period);
>> + if (perf_event_overflow(event, &data, iregs))
>> + x86_pmu_stop(event, 0);
>> +
>> + /* Inject one fake event is enough. */
>> + return 1;
>> + }
>> +
>> + return 0;
>> +}
>> +
>> static __always_inline void
>> __intel_pmu_pebs_event(struct perf_event *event,
>> struct pt_regs *iregs,
>> @@ -1965,6 +2024,9 @@ static void intel_pmu_drain_pebs_icl(struct pt_regs *iregs, struct perf_sample_d
>> if (!x86_pmu.pebs_active)
>> return;
>>
>> + if (intel_pmu_handle_guest_pebs(cpuc, iregs, ds))
>> + return;
>> +
>> base = (struct pebs_basic *)(unsigned long)ds->pebs_buffer_base;
>> top = (struct pebs_basic *)(unsigned long)ds->pebs_index;
>>
>> --
>> 2.29.2
>>

2021-01-14 18:58:46

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH v3 04/17] perf: x86/ds: Handle guest PEBS overflow PMI and inject it to guest

On Mon, Jan 04, 2021, Like Xu wrote:
> ---
> arch/x86/events/intel/ds.c | 62 ++++++++++++++++++++++++++++++++++++++
> 1 file changed, 62 insertions(+)
>
> diff --git a/arch/x86/events/intel/ds.c b/arch/x86/events/intel/ds.c
> index b47cc4226934..c499bdb58373 100644
> --- a/arch/x86/events/intel/ds.c
> +++ b/arch/x86/events/intel/ds.c
> @@ -1721,6 +1721,65 @@ intel_pmu_save_and_restart_reload(struct perf_event *event, int count)
> return 0;
> }
>
> +/*
> + * We may be running with guest PEBS events created by KVM, and the
> + * PEBS records are logged into the guest's DS and invisible to host.
> + *
> + * In the case of guest PEBS overflow, we only trigger a fake event
> + * to emulate the PEBS overflow PMI for guest PBES counters in KVM.
> + * The guest will then vm-entry and check the guest DS area to read
> + * the guest PEBS records.
> + *
> + * The guest PEBS overflow PMI may be dropped when both the guest and
> + * the host use PEBS. Therefore, KVM will not enable guest PEBS once
> + * the host PEBS is enabled since it may bring a confused unknown NMI.
> + *
> + * The contents and other behavior of the guest event do not matter.
> + */
> +static int intel_pmu_handle_guest_pebs(struct cpu_hw_events *cpuc,
> + struct pt_regs *iregs,
> + struct debug_store *ds)
> +{
> + struct perf_sample_data data;
> + struct perf_event *event = NULL;
> + u64 guest_pebs_idxs = cpuc->pebs_enabled & ~cpuc->intel_ctrl_host_mask;
> + int bit;
> +
> + /*
> + * Ideally, we should check guest DS to understand if it's
> + * a guest PEBS overflow PMI from guest PEBS counters.
> + * However, it brings high overhead to retrieve guest DS in host.
> + * So we check host DS instead for performance.
> + *
> + * If PEBS interrupt threshold on host is not exceeded in a NMI, there
> + * must be a PEBS overflow PMI generated from the guest PEBS counters.
> + * There is no ambiguity since the reported event in the PMI is guest
> + * only. It gets handled correctly on a case by case base for each event.
> + *
> + * Note: KVM disables the co-existence of guest PEBS and host PEBS.

By "KVM", do you mean KVM's loading of the MSRs provided by intel_guest_get_msrs()?
Because the PMU should really be the entity that controls guest vs. host. KVM
should just be a dumb pipe that handles the mechanics of how values are context
switch.

For example, commit 7099e2e1f4d9 ("KVM: VMX: disable PEBS before a guest entry"),
where KVM does an explicit WRMSR(PEBS_ENABLE) to (attempt to) force PEBS
quiescence, is flawed in that the PMU can re-enable PEBS after the WRMSR if a
PMI arrives between the WRMSR and VM-Enter (because VMX can't block NMIs). The
PMU really needs to be involved in the WRMSR workaround.

> + */
> + if (!guest_pebs_idxs || !in_nmi() ||

Are PEBS updates guaranteed to be isolated in both directions on relevant
hardware? By that I mean, will host updates be fully processed before VM-Enter
compeletes, and guest updates before VM-Exit completes? If that's the case,
then this path could be optimized to change the KVM invocation of the NMI
handler so that the "is this a guest PEBS PMI" check is done if and only if the
PMI originated from with the guest.

> + ds->pebs_index >= ds->pebs_interrupt_threshold)
> + return 0;
> +
> + for_each_set_bit(bit, (unsigned long *)&guest_pebs_idxs,
> + INTEL_PMC_IDX_FIXED + x86_pmu.num_counters_fixed) {
> +
> + event = cpuc->events[bit];
> + if (!event->attr.precise_ip)
> + continue;
> +
> + perf_sample_data_init(&data, 0, event->hw.last_period);
> + if (perf_event_overflow(event, &data, iregs))
> + x86_pmu_stop(event, 0);
> +
> + /* Inject one fake event is enough. */
> + return 1;
> + }
> +
> + return 0;
> +}

2021-01-14 19:14:49

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH v3 00/17] KVM: x86/pmu: Add support to enable Guest PEBS via DS

On Mon, Jan 04, 2021, Like Xu wrote:
> 2) Slow path (part 3, patch 0012-0017)
>
> This is when the host assigned physical PMC has a different index
> from the virtual PMC (e.g. using physical PMC1 to emulate virtual PMC0)
> In this case, KVM needs to rewrite the PEBS records to change the
> applicable counter indexes to the virtual PMC indexes, which would
> otherwise contain the physical counter index written by PEBS facility,
> and switch the counter reset values to the offset corresponding to
> the physical counter indexes in the DS data structure.
>
> Large PEBS needs to be disabled by KVM rewriting the
> pebs_interrupt_threshold filed in DS to only one record in
> the slow path. This is because a guest may implicitly drain PEBS buffer,
> e.g., context switch. KVM doesn't get a chance to update the PEBS buffer.

Are the PEBS record write, PEBS index update, and subsequent PMI atomic with
respect to instruction execution? If not, doesn't this approach still leave a
window where the guest could see the wrong counter?

The virtualization hole is also visible if the guest is reading the PEBS records
from a different vCPU, though I assume no sane kernel does that?

> The physical PMC index will confuse the guest. The difficulty comes
> when multiple events get rescheduled inside the guest. Hence disabling
> large PEBS in this case might be an easy and safe way to keep it corrects
> as an initial step here.

2021-01-15 05:23:35

by Xu, Like

[permalink] [raw]
Subject: Re: [PATCH v3 00/17] KVM: x86/pmu: Add support to enable Guest PEBS via DS

Hi Sean,

Thanks for your comments !

On 2021/1/15 3:10, Sean Christopherson wrote:
> On Mon, Jan 04, 2021, Like Xu wrote:
>> 2) Slow path (part 3, patch 0012-0017)
>>
>> This is when the host assigned physical PMC has a different index
>> from the virtual PMC (e.g. using physical PMC1 to emulate virtual PMC0)
>> In this case, KVM needs to rewrite the PEBS records to change the
>> applicable counter indexes to the virtual PMC indexes, which would
>> otherwise contain the physical counter index written by PEBS facility,
>> and switch the counter reset values to the offset corresponding to
>> the physical counter indexes in the DS data structure.
>>
>> Large PEBS needs to be disabled by KVM rewriting the
>> pebs_interrupt_threshold filed in DS to only one record in
>> the slow path. This is because a guest may implicitly drain PEBS buffer,
>> e.g., context switch. KVM doesn't get a chance to update the PEBS buffer.
> Are the PEBS record write, PEBS index update, and subsequent PMI atomic with
> respect to instruction execution? If not, doesn't this approach still leave a
> window where the guest could see the wrong counter?

First, KVM would limit/rewrite guest DS pebs_interrupt_threshold to one
record before vm-entry,
(see patch [PATCH v3 14/17] KVM: vmx/pmu: Limit pebs_interrupt_threshold in
the guest DS area)
which means once a PEBS record is written into the guest pebs buffer,
a PEBS PMI will be generated immediately and thus vm-exit.

Second, KVM would complete the PEBS record rewriting, PEBS index update,
and inject vPMI
before the next vm-entry (we deal with these separately in patches 15-17
for easy review).

After the updated PEBS record(s) are (atomically?) prepared, guests will be
notified via PMI
and there is no window for vcpu to check whether there is a PEBS record due
to vm-exit.

> The virtualization hole is also visible if the guest is reading the PEBS records
> from a different vCPU, though I assume no sane kernel does that?

I have checked the guest PEBS driver behavior for Linux and Windows, and
they're sane.

Theoretically, it's true for busy-poll PBES buffer readers from other vCPUs
and to fix it, making all vCPUs vm-exit is onerous for a large-size guest
and I don't think you would accept this or do we have a better idea ?

In fact, we don't think it's a hole or vulnerability because the motivation for
correcting the counter index(s) is to help guest PEBS reader understand their
PEBS records correctly and provide the same sampling accuracy as the non-cross
mapped case, instead of providing a new attack interface from guest to host.

PeterZ commented on the V1 version and insisted that the host perf allows
the guest
counter to be assigned a cross-mapped back-end counter. In this case, the
slow path
patches (13-17) are introduced to ensure that from the guest counter
perspective,
the PEBS records are also correct. We do not want these records to be
invalid and
ignored, which would undermine the accuracy of PEBS.

In the practical use, the slow patch rarely happens and
we're glad to see if the fast patch could be upstream
and the cross-mapped case is teamprily disabled
until we're on the same page for the cross mapped case.

In actual use, slow path rarely occur. As a first step,
we propose to upstream the quick patches (patch 01-12) with your help.
The guest PEBS would been disabled temporarily when guest PEBS counters
are cross-mapped until we figure out a satisfactory cross-mapping solution.

---
thx,likexu

>
>> The physical PMC index will confuse the guest. The difficulty comes
>> when multiple events get rescheduled inside the guest. Hence disabling
>> large PEBS in this case might be an easy and safe way to keep it corrects
>> as an initial step here.

2021-01-15 05:44:28

by Xu, Like

[permalink] [raw]
Subject: Re: [PATCH v3 04/17] perf: x86/ds: Handle guest PEBS overflow PMI and inject it to guest

On 2021/1/15 2:55, Sean Christopherson wrote:
> On Mon, Jan 04, 2021, Like Xu wrote:
>> ---
>> arch/x86/events/intel/ds.c | 62 ++++++++++++++++++++++++++++++++++++++
>> 1 file changed, 62 insertions(+)
>>
>> diff --git a/arch/x86/events/intel/ds.c b/arch/x86/events/intel/ds.c
>> index b47cc4226934..c499bdb58373 100644
>> --- a/arch/x86/events/intel/ds.c
>> +++ b/arch/x86/events/intel/ds.c
>> @@ -1721,6 +1721,65 @@ intel_pmu_save_and_restart_reload(struct perf_event *event, int count)
>> return 0;
>> }
>>
>> +/*
>> + * We may be running with guest PEBS events created by KVM, and the
>> + * PEBS records are logged into the guest's DS and invisible to host.
>> + *
>> + * In the case of guest PEBS overflow, we only trigger a fake event
>> + * to emulate the PEBS overflow PMI for guest PBES counters in KVM.
>> + * The guest will then vm-entry and check the guest DS area to read
>> + * the guest PEBS records.
>> + *
>> + * The guest PEBS overflow PMI may be dropped when both the guest and
>> + * the host use PEBS. Therefore, KVM will not enable guest PEBS once
>> + * the host PEBS is enabled since it may bring a confused unknown NMI.
>> + *
>> + * The contents and other behavior of the guest event do not matter.
>> + */
>> +static int intel_pmu_handle_guest_pebs(struct cpu_hw_events *cpuc,
>> + struct pt_regs *iregs,
>> + struct debug_store *ds)
>> +{
>> + struct perf_sample_data data;
>> + struct perf_event *event = NULL;
>> + u64 guest_pebs_idxs = cpuc->pebs_enabled & ~cpuc->intel_ctrl_host_mask;
>> + int bit;
>> +
>> + /*
>> + * Ideally, we should check guest DS to understand if it's
>> + * a guest PEBS overflow PMI from guest PEBS counters.
>> + * However, it brings high overhead to retrieve guest DS in host.
>> + * So we check host DS instead for performance.
>> + *
>> + * If PEBS interrupt threshold on host is not exceeded in a NMI, there
>> + * must be a PEBS overflow PMI generated from the guest PEBS counters.
>> + * There is no ambiguity since the reported event in the PMI is guest
>> + * only. It gets handled correctly on a case by case base for each event.
>> + *
>> + * Note: KVM disables the co-existence of guest PEBS and host PEBS.
> By "KVM", do you mean KVM's loading of the MSRs provided by intel_guest_get_msrs()?
> Because the PMU should really be the entity that controls guest vs. host. KVM
> should just be a dumb pipe that handles the mechanics of how values are context
> switch.

The intel_guest_get_msrs() and atomic_switch_perf_msrs()
will work together to disable the co-existence of guest PEBS and host PEBS:

https://lore.kernel.org/kvm/[email protected]/

+

static void atomic_switch_perf_msrs(struct vcpu_vmx *vmx)
...
    if (nr_msrs > 2 && (msrs[1].guest & msrs[0].guest)) {
        msrs[2].guest = pmu->ds_area;
        if (nr_msrs > 3)
            msrs[3].guest = pmu->pebs_data_cfg;
    }

   for (i = 0; i < nr_msrs; i++)
...

>
> For example, commit 7099e2e1f4d9 ("KVM: VMX: disable PEBS before a guest entry"),
> where KVM does an explicit WRMSR(PEBS_ENABLE) to (attempt to) force PEBS
> quiescence, is flawed in that the PMU can re-enable PEBS after the WRMSR if a
> PMI arrives between the WRMSR and VM-Enter (because VMX can't block NMIs). The
> PMU really needs to be involved in the WRMSR workaround.

Thanks, I will carefully confirm the PEBS quiescent behavior on the ICX.
But we're fine to keep "wrmsrl(MSR_IA32_PEBS_ENABLE, 0);" here
since we will load a new guest value (if any) for this msr later.

>
>> + */
>> + if (!guest_pebs_idxs || !in_nmi() ||
> Are PEBS updates guaranteed to be isolated in both directions on relevant
> hardware?

I think it's true on the ICX.

> By that I mean, will host updates be fully processed before VM-Enter
> compeletes, and guest updates before VM-Exit completes?

The situation is more complicated.

> If that's the case,
> then this path could be optimized to change the KVM invocation of the NMI
> handler so that the "is this a guest PEBS PMI" check is done if and only if the
> PMI originated from with the guest.

When we have a PEBS PMI due to guest workload and vm-exits,
the code path from vm-exit to the host PEBS PMI handler may also
bring PEBS PMI and mark the status bit. The current PMI handler
can't distinguish them and would treat the later one as a suspicious
PMI and output a warning.

This is the main reason why we choose to disable the co-existence
of guest PEBS and host PEBS, and future hardware enhancements
may break this limitation.

---
thx, likexu
>
>> + ds->pebs_index >= ds->pebs_interrupt_threshold)
>> + return 0;
>> +
>> + for_each_set_bit(bit, (unsigned long *)&guest_pebs_idxs,
>> + INTEL_PMC_IDX_FIXED + x86_pmu.num_counters_fixed) {
>> +
>> + event = cpuc->events[bit];
>> + if (!event->attr.precise_ip)
>> + continue;
>> +
>> + perf_sample_data_init(&data, 0, event->hw.last_period);
>> + if (perf_event_overflow(event, &data, iregs))
>> + x86_pmu_stop(event, 0);
>> +
>> + /* Inject one fake event is enough. */
>> + return 1;
>> + }
>> +
>> + return 0;
>> +}

2021-01-15 11:39:20

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v3 05/17] KVM: x86/pmu: Reprogram guest PEBS event to emulate guest PEBS counter

On Mon, Jan 04, 2021 at 09:15:30PM +0800, Like Xu wrote:
> When a guest counter is configured as a PEBS counter through
> IA32_PEBS_ENABLE, a guest PEBS event will be reprogrammed by
> configuring a non-zero precision level in the perf_event_attr.
>
> The guest PEBS overflow PMI bit would be set in the guest
> GLOBAL_STATUS MSR when PEBS facility generates a PEBS
> overflow PMI based on guest IA32_DS_AREA MSR.
>
> The attr.precise_ip would be adjusted to a special precision
> level when the new PEBS-PDIR feature is supported later which
> would affect the host counters scheduling.

This seems like a random collection of changes, all required, but
loosely related.

> The guest PEBS event would not be reused for non-PEBS
> guest event even with the same guest counter index.

/me rolls eyes at the whole destroy+create nonsense...

2021-01-15 12:06:54

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v3 04/17] perf: x86/ds: Handle guest PEBS overflow PMI and inject it to guest

On Thu, Jan 14, 2021 at 11:39:00AM +0800, Xu, Like wrote:

> > Why do we need to? Can't we simply always forward the PMI if the guest
> > has bits set in MSR_IA32_PEBS_ENABLE ? Surely we can access the guest
> > MSRs at a reasonable rate..
> >
> > Sure, it'll send too many PMIs, but is that really a problem?
>
> More vPMI means more guest irq handler calls and
> more PMI virtualization overhead.

Only if you have both guest and host PEBS. And in that case I really
can't be arsed about some overhead to the guest.

> In addition,
> the correctness of some workloads (RR?) depends on
> the correct number of PMIs and the PMI trigger times
> and virt may not want to break this assumption.

Are you sure? Spurious NMI/PMIs are known to happen anyway. We have far
too much code to deal with them.

> > > + * If PEBS interrupt threshold on host is not exceeded in a NMI, there
> > > + * must be a PEBS overflow PMI generated from the guest PEBS counters.
> > > + * There is no ambiguity since the reported event in the PMI is guest
> > > + * only. It gets handled correctly on a case by case base for each event.
> > > + *
> > > + * Note: KVM disables the co-existence of guest PEBS and host PEBS.
> > Where; I need a code reference here.
>
> How about:
>
> Note: KVM will disable the co-existence of guest PEBS and host PEBS.
> In the intel_guest_get_msrs(), when we have host PEBS ctrl bit(s) enabled,
> KVM will clear the guest PEBS ctrl enable bit(s) before vm-entry.
> The guest PEBS users should be notified of this runtime restriction.

Since you had me look at that function, can clean up that
CONFIG_RETPOLINE crud and replace it with static_call() ?

2021-01-15 13:55:46

by Xu, Like

[permalink] [raw]
Subject: Re: [PATCH v3 05/17] KVM: x86/pmu: Reprogram guest PEBS event to emulate guest PEBS counter

On 2021/1/15 19:33, Peter Zijlstra wrote:
> On Mon, Jan 04, 2021 at 09:15:30PM +0800, Like Xu wrote:
>> When a guest counter is configured as a PEBS counter through
>> IA32_PEBS_ENABLE, a guest PEBS event will be reprogrammed by
>> configuring a non-zero precision level in the perf_event_attr.
>>
>> The guest PEBS overflow PMI bit would be set in the guest
>> GLOBAL_STATUS MSR when PEBS facility generates a PEBS
>> overflow PMI based on guest IA32_DS_AREA MSR.
>>
>> The attr.precise_ip would be adjusted to a special precision
>> level when the new PEBS-PDIR feature is supported later which
>> would affect the host counters scheduling.
> This seems like a random collection of changes, all required, but
> loosely related.

Yes, these changes are made in the KVM context, and
they are all necessary to emulate basic PEBS hw behavior.

>
>> The guest PEBS event would not be reused for non-PEBS
>> guest event even with the same guest counter index.

Let me add more KVM context here,

we would create a perf_event for a normal non-PEBS counter
and we reuse the same perf_event from time to time as much as possible
instead of "create + destroy" new perf_event.

So when a normal counter is configured for PEBS,
the original perf_event would not be reused and
a new PEBS perf_event is created, vice verse.

> /me rolls eyes at the whole destroy+create nonsense...

I absolutely agree that cross-domain development
may make maintainers' eyes uncomfortable.

My on-demand explanation is always online
if you fire more questions on this patch set.


2021-01-15 14:33:20

by Xu, Like

[permalink] [raw]
Subject: Re: [PATCH v3 04/17] perf: x86/ds: Handle guest PEBS overflow PMI and inject it to guest

On 2021/1/15 20:01, Peter Zijlstra wrote:
> On Thu, Jan 14, 2021 at 11:39:00AM +0800, Xu, Like wrote:
>
>>> Why do we need to? Can't we simply always forward the PMI if the guest
>>> has bits set in MSR_IA32_PEBS_ENABLE ? Surely we can access the guest
>>> MSRs at a reasonable rate..
>>>
>>> Sure, it'll send too many PMIs, but is that really a problem?
>> More vPMI means more guest irq handler calls and
>> more PMI virtualization overhead.
> Only if you have both guest and host PEBS. And in that case I really
> can't be arsed about some overhead to the guest.

Less overhead makes everyone happier.

Ah, can I assume that you're fine with disabling the
co-existence of guest PEBS and host PEBS as the first upstream step ?

>
>> In addition,
>> the correctness of some workloads (RR?) depends on
>> the correct number of PMIs and the PMI trigger times
>> and virt may not want to break this assumption.
> Are you sure? Spurious NMI/PMIs are known to happen anyway. We have far
> too much code to deal with them.

https://lore.kernel.org/lkml/20170628130748.GI5981@leverpostej/T/

In the rr workload, the commit change "the PMI interrupts in skid region
should be dropped"
is reverted since some users complain that:

> It seems to me that it might be reasonable to ignore the interrupt if
> the purpose of the interrupt is to trigger sampling of the CPUs
> register state. But if the interrupt will trigger some other
> operation, such as a signal on an fd, then there's no reason to drop
> it.

I assume that if the PMI drop is unacceptable, either will spurious PMI
injection.

I'm pretty open if you insist that we really need to do this for guest PEBS
enabling.

>
>>>> + * If PEBS interrupt threshold on host is not exceeded in a NMI, there
>>>> + * must be a PEBS overflow PMI generated from the guest PEBS counters.
>>>> + * There is no ambiguity since the reported event in the PMI is guest
>>>> + * only. It gets handled correctly on a case by case base for each event.
>>>> + *
>>>> + * Note: KVM disables the co-existence of guest PEBS and host PEBS.
>>> Where; I need a code reference here.
>> How about:
>>
>> Note: KVM will disable the co-existence of guest PEBS and host PEBS.
>> In the intel_guest_get_msrs(), when we have host PEBS ctrl bit(s) enabled,
>> KVM will clear the guest PEBS ctrl enable bit(s) before vm-entry.
>> The guest PEBS users should be notified of this runtime restriction.
> Since you had me look at that function, can clean up that
> CONFIG_RETPOLINE crud and replace it with static_call() ?

Sure. Let me try it.

---
thx, likexu

2021-01-15 14:48:27

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v3 04/17] perf: x86/ds: Handle guest PEBS overflow PMI and inject it to guest

On Fri, Jan 15, 2021 at 10:30:13PM +0800, Xu, Like wrote:

> > Are you sure? Spurious NMI/PMIs are known to happen anyway. We have far
> > too much code to deal with them.
>
> https://lore.kernel.org/lkml/20170628130748.GI5981@leverpostej/T/
>
> In the rr workload, the commit change "the PMI interrupts in skid region
> should be dropped"
> is reverted since some users complain that:
>
> > It seems to me that it might be reasonable to ignore the interrupt if
> > the purpose of the interrupt is to trigger sampling of the CPUs
> > register state. But if the interrupt will trigger some other
> > operation, such as a signal on an fd, then there's no reason to drop
> > it.
>
> I assume that if the PMI drop is unacceptable, either will spurious PMI
> injection.
>
> I'm pretty open if you insist that we really need to do this for guest PEBS
> enabling.

That was an entirely different issue. We were dropping events on the
floor because they'd passed priv boundaries. So there was an actual
event, and we made it go away.

What we're talking about here is raising an PMI with BUFFER_OVF set,
even if the DS is empty. That should really be harmless. We'll take the
PMI, find there's nothing there, and do nothing.

2021-01-15 14:50:31

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v3 06/17] KVM: x86/pmu: Add IA32_PEBS_ENABLE MSR emulation for extended PEBS

On Mon, Jan 04, 2021 at 09:15:31PM +0800, Like Xu wrote:

> + if (cpuc->pebs_enabled & ~cpuc->intel_ctrl_host_mask) {
> + arr[1].msr = MSR_IA32_PEBS_ENABLE;
> + arr[1].host = cpuc->pebs_enabled & ~cpuc->intel_ctrl_guest_mask;
> + arr[1].guest = cpuc->pebs_enabled & ~cpuc->intel_ctrl_host_mask;
> + /*
> + * The guest PEBS will be disabled once the host PEBS is enabled
> + * since the both enabled case may brings a unknown PMI to
> + * confuse host and the guest PEBS overflow PMI would be missed.
> + */
> + if (arr[1].host)
> + arr[1].guest = 0;
> + arr[0].guest |= arr[1].guest;
> + *nr = 2;

Elsewhere you write:

> When we have a PEBS PMI due to guest workload and vm-exits,
> the code path from vm-exit to the host PEBS PMI handler may also
> bring PEBS PMI and mark the status bit. The current PMI handler
> can't distinguish them and would treat the later one as a suspicious
> PMI and output a warning.

So the reason isn't that spurious PMIs are tedious, but that the
hardware is actually doing something weird.

Or am I not reading things straight?

2021-01-15 15:14:55

by Xu, Like

[permalink] [raw]
Subject: Re: [PATCH v3 04/17] perf: x86/ds: Handle guest PEBS overflow PMI and inject it to guest

On 2021/1/15 22:44, Peter Zijlstra wrote:
> On Fri, Jan 15, 2021 at 10:30:13PM +0800, Xu, Like wrote:
>
>>> Are you sure? Spurious NMI/PMIs are known to happen anyway. We have far
>>> too much code to deal with them.
>> https://lore.kernel.org/lkml/20170628130748.GI5981@leverpostej/T/
>>
>> In the rr workload, the commit change "the PMI interrupts in skid region
>> should be dropped"
>> is reverted since some users complain that:
>>
>>> It seems to me that it might be reasonable to ignore the interrupt if
>>> the purpose of the interrupt is to trigger sampling of the CPUs
>>> register state. But if the interrupt will trigger some other
>>> operation, such as a signal on an fd, then there's no reason to drop
>>> it.
>> I assume that if the PMI drop is unacceptable, either will spurious PMI
>> injection.
>>
>> I'm pretty open if you insist that we really need to do this for guest PEBS
>> enabling.
> That was an entirely different issue. We were dropping events on the
> floor because they'd passed priv boundaries. So there was an actual
> event, and we made it go away.

Thanks for your clarification and support.

> What we're talking about here is raising an PMI with BUFFER_OVF set,
> even if the DS is empty. That should really be harmless. We'll take the
> PMI, find there's nothing there, and do nothing.

The only harm point is confusing the guest PEBS user with
the behavior of pebs_interrupt_threshold.

Now that KVM has to break it due to cross-mapping issue,
Let me implement this idea in the next version w/ relevant performance data.


2021-01-15 15:32:27

by Xu, Like

[permalink] [raw]
Subject: Re: [PATCH v3 06/17] KVM: x86/pmu: Add IA32_PEBS_ENABLE MSR emulation for extended PEBS

On 2021/1/15 22:46, Peter Zijlstra wrote:
> On Mon, Jan 04, 2021 at 09:15:31PM +0800, Like Xu wrote:
>
>> + if (cpuc->pebs_enabled & ~cpuc->intel_ctrl_host_mask) {
>> + arr[1].msr = MSR_IA32_PEBS_ENABLE;
>> + arr[1].host = cpuc->pebs_enabled & ~cpuc->intel_ctrl_guest_mask;
>> + arr[1].guest = cpuc->pebs_enabled & ~cpuc->intel_ctrl_host_mask;
>> + /*
>> + * The guest PEBS will be disabled once the host PEBS is enabled
>> + * since the both enabled case may brings a unknown PMI to
>> + * confuse host and the guest PEBS overflow PMI would be missed.
>> + */
>> + if (arr[1].host)
>> + arr[1].guest = 0;
>> + arr[0].guest |= arr[1].guest;
>> + *nr = 2;
> Elsewhere you write:
>
>> When we have a PEBS PMI due to guest workload and vm-exits,
>> the code path from vm-exit to the host PEBS PMI handler may also
>> bring PEBS PMI and mark the status bit. The current PMI handler
>> can't distinguish them and would treat the later one as a suspicious
>> PMI and output a warning.
> So the reason isn't that spurious PMIs are tedious, but that the
> hardware is actually doing something weird.
>
> Or am I not reading things straight?

I think the PEBS facility works as expected because in the both enabled case,
the first PEBS PMI is generated from host counter 1 based on guest
interrupt_threshold
and the later PEBS PMI could be generated from host counter 2 based on host
interrupt_threshold.

Therefore, if we adjust the overflow value to a small value, so that the
number of
instructions from vm-exit to global ctrl disabling could be enough big to
trigger PEBS PMI.

Do you think this is weird, or do you see other possibilities ?

2021-01-15 17:45:10

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH v3 04/17] perf: x86/ds: Handle guest PEBS overflow PMI and inject it to guest

On Fri, Jan 15, 2021, Xu, Like wrote:
> On 2021/1/15 2:55, Sean Christopherson wrote:
> > On Mon, Jan 04, 2021, Like Xu wrote:
> > > + * Note: KVM disables the co-existence of guest PEBS and host PEBS.
> > By "KVM", do you mean KVM's loading of the MSRs provided by intel_guest_get_msrs()?
> > Because the PMU should really be the entity that controls guest vs. host. KVM
> > should just be a dumb pipe that handles the mechanics of how values are context
> > switch.
>
> The intel_guest_get_msrs() and atomic_switch_perf_msrs()
> will work together to disable the co-existence of guest PEBS and host PEBS:
>
> https://lore.kernel.org/kvm/[email protected]/
>
> +
>
> static void atomic_switch_perf_msrs(struct vcpu_vmx *vmx)
> ...
> ??? if (nr_msrs > 2 && (msrs[1].guest & msrs[0].guest)) {
> ??? ??? msrs[2].guest = pmu->ds_area;
> ??? ??? if (nr_msrs > 3)
> ??? ??? ??? msrs[3].guest = pmu->pebs_data_cfg;
> ??? }
>
> ?? for (i = 0; i < nr_msrs; i++)
> ...

Yeah, that's exactly what I'm complaining about. Splitting the logic for
determining the guest values is unnecessarily confusing, and as evidenced by the
PEBS_ENABLE bug, potentially fragile. Perf should have full knowledge and
control of what values are loaded for the guest. And, the above indexing magic
is nigh impossible to follow and _super_ fragile.

If we change .guest_get_msrs() to take a struct kvm_pmu pointer, then it can
generate the full set of guest values by grabbing ds_area and pebs_data_cfg.
Alternatively, .guest_get_msrs() could take the desired guest MSR values
directly (ds_area and pebs_data_cfg), but kvm_pmu is vendor agnostic, so I don't
see any reason to not just pass the pointer.

2021-01-15 18:01:29

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH v3 00/17] KVM: x86/pmu: Add support to enable Guest PEBS via DS

On Fri, Jan 15, 2021, Xu, Like wrote:
> Hi Sean,
>
> Thanks for your comments !
>
> On 2021/1/15 3:10, Sean Christopherson wrote:
> > On Mon, Jan 04, 2021, Like Xu wrote:
> > > 2) Slow path (part 3, patch 0012-0017)
> > >
> > > This is when the host assigned physical PMC has a different index
> > > from the virtual PMC (e.g. using physical PMC1 to emulate virtual PMC0)
> > > In this case, KVM needs to rewrite the PEBS records to change the
> > > applicable counter indexes to the virtual PMC indexes, which would
> > > otherwise contain the physical counter index written by PEBS facility,
> > > and switch the counter reset values to the offset corresponding to
> > > the physical counter indexes in the DS data structure.
> > >
> > > Large PEBS needs to be disabled by KVM rewriting the
> > > pebs_interrupt_threshold filed in DS to only one record in
> > > the slow path. This is because a guest may implicitly drain PEBS buffer,
> > > e.g., context switch. KVM doesn't get a chance to update the PEBS buffer.
> > Are the PEBS record write, PEBS index update, and subsequent PMI atomic with
> > respect to instruction execution? If not, doesn't this approach still leave a
> > window where the guest could see the wrong counter?
>
> First, KVM would limit/rewrite guest DS pebs_interrupt_threshold to one
> record before vm-entry,
> (see patch [PATCH v3 14/17] KVM: vmx/pmu: Limit pebs_interrupt_threshold in
> the guest DS area)
> which means once a PEBS record is written into the guest pebs buffer,
> a PEBS PMI will be generated immediately and thus vm-exit.

I'm asking about ucode/hardare. Is the "guest pebs buffer write -> PEBS PMI"
guaranteed to be atomic?

In practice, under what scenarios will guest counters get cross-mapped? And,
how does this support affect guest accuracy? I.e. how bad do things get for the
guest if we simply disable guest counters if they can't have a 1:1 association
with their physical counter?

2021-01-15 18:52:51

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH v3 00/17] KVM: x86/pmu: Add support to enable Guest PEBS via DS

> I'm asking about ucode/hardare. Is the "guest pebs buffer write -> PEBS PMI"
> guaranteed to be atomic?

Of course not.
>
> In practice, under what scenarios will guest counters get cross-mapped? And,
> how does this support affect guest accuracy? I.e. how bad do things get for the
> guest if we simply disable guest counters if they can't have a 1:1 association
> with their physical counter?

This would completely break perfmon for the guest, likely with no way to
recover.

-Andi

2021-01-15 18:54:41

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH v3 00/17] KVM: x86/pmu: Add support to enable Guest PEBS via DS

On Fri, Jan 15, 2021, Andi Kleen wrote:
> > I'm asking about ucode/hardare. Is the "guest pebs buffer write -> PEBS PMI"
> > guaranteed to be atomic?
>
> Of course not.

So there's still a window where the guest could observe the bad counter index,
correct?

2021-01-15 19:13:01

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH v3 00/17] KVM: x86/pmu: Add support to enable Guest PEBS via DS

On Fri, Jan 15, 2021 at 10:51:38AM -0800, Sean Christopherson wrote:
> On Fri, Jan 15, 2021, Andi Kleen wrote:
> > > I'm asking about ucode/hardare. Is the "guest pebs buffer write -> PEBS PMI"
> > > guaranteed to be atomic?
> >
> > Of course not.
>
> So there's still a window where the guest could observe the bad counter index,
> correct?

Yes.

But with single record PEBS it doesn't really matter with normal perfmon
drivers.

-Andi

2021-01-22 05:51:37

by Like Xu

[permalink] [raw]
Subject: Re: [PATCH v3 04/17] perf: x86/ds: Handle guest PEBS overflow PMI and inject it to guest

On 2021/1/16 1:42, Sean Christopherson wrote:
> On Fri, Jan 15, 2021, Xu, Like wrote:
>> On 2021/1/15 2:55, Sean Christopherson wrote:
>>> On Mon, Jan 04, 2021, Like Xu wrote:
>>>> + * Note: KVM disables the co-existence of guest PEBS and host PEBS.
>>> By "KVM", do you mean KVM's loading of the MSRs provided by intel_guest_get_msrs()?
>>> Because the PMU should really be the entity that controls guest vs. host. KVM
>>> should just be a dumb pipe that handles the mechanics of how values are context
>>> switch.
>>
>> The intel_guest_get_msrs() and atomic_switch_perf_msrs()
>> will work together to disable the co-existence of guest PEBS and host PEBS:
>>
>> https://lore.kernel.org/kvm/[email protected]/
>>
>> +
>>
>> static void atomic_switch_perf_msrs(struct vcpu_vmx *vmx)
>> ...
>>     if (nr_msrs > 2 && (msrs[1].guest & msrs[0].guest)) {
>>         msrs[2].guest = pmu->ds_area;
>>         if (nr_msrs > 3)
>>             msrs[3].guest = pmu->pebs_data_cfg;
>>     }
>>
>>    for (i = 0; i < nr_msrs; i++)
>> ...
>
> Yeah, that's exactly what I'm complaining about. Splitting the logic for
> determining the guest values is unnecessarily confusing, and as evidenced by the
> PEBS_ENABLE bug, potentially fragile. Perf should have full knowledge and
> control of what values are loaded for the guest. And, the above indexing magic
> is nigh impossible to follow and _super_ fragile.

Thanks for pointing this out.

>
> If we change .guest_get_msrs() to take a struct kvm_pmu pointer, then it can
> generate the full set of guest values by grabbing ds_area and pebs_data_cfg.
> Alternatively, .guest_get_msrs() could take the desired guest MSR values
> directly (ds_area and pebs_data_cfg), but kvm_pmu is vendor agnostic, so I don't
> see any reason to not just pass the pointer.

Hi Peter,

What do you think of us passing a "struct kvm_pmu" pointer (defined in
arch/x86/include/asm/kvm_host.h) to guest_get_msrs(int *nr) ?

---
thx,likexu

2021-01-22 10:01:25

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v3 00/17] KVM: x86/pmu: Add support to enable Guest PEBS via DS

On Fri, Jan 15, 2021 at 10:51:38AM -0800, Sean Christopherson wrote:
> On Fri, Jan 15, 2021, Andi Kleen wrote:
> > > I'm asking about ucode/hardare. Is the "guest pebs buffer write -> PEBS PMI"
> > > guaranteed to be atomic?
> >
> > Of course not.
>
> So there's still a window where the guest could observe the bad counter index,
> correct?

Guest could do a hypercall to fix up the DS area before it tries to read
it I suppose. Or the HV could expose the index mapping and have the
guest fix up it.

Adding a little virt crud on top shouldn't be too hard.

2021-01-25 02:46:35

by Like Xu

[permalink] [raw]
Subject: Re: [PATCH v3 00/17] KVM: x86/pmu: Add support to enable Guest PEBS via DS

+ [email protected]

Hi Liuxiangdong,

On 2021/1/22 18:02, Liuxiangdong (Aven, Cloud Infrastructure Service
Product Dept.) wrote:
> Hi Like,
>
> Some questions about
> https://lore.kernel.org/kvm/[email protected]/ <https://lore.kernel.org/kvm/[email protected]/>

Thanks for trying the PEBS feature in the guest,
and I assume you have correctly applied the QEMU patches for guest PEBS.

>
> 1)Test in IceLake

In the [PATCH v3 10/17] KVM: x86/pmu: Expose CPUIDs feature bits PDCM, DS,
DTES64, we only support Ice Lake with the following x86_model(s):

#define INTEL_FAM6_ICELAKE_X 0x6A
#define INTEL_FAM6_ICELAKE_D 0x6C

you can check the eax output of "cpuid -l 1 -1 -r",
for example "0x000606a4" meets this requirement.

>
> HOST:
>
> CPU family:                      6
>
> Model:                           106
>
> Model name:                      Intel(R) Xeon(R) Platinum 8378A CPU $@ $@
>
> microcode: sig=0x606a6, pf=0x1, revision=0xd000122

As long as you get the latest BIOS from the provider,
you may check 'cat /proc/cpuinfo | grep code | uniq' with the latest one.

>
> Guest:  linux kernel 5.11.0-rc2

I assume it's the "upstream tag v5.11-rc2" which is fine.

>
> We can find pebs/intel_pt flag in guest cpuinfo, but there still exists
> error when we use perf

Just a note, intel_pt and pebs are two features and we can write
pebs records to intel_pt buffer with extra hardware support.
(by default, pebs records are written to the pebs buffer)

You may check the output of "dmesg | grep PEBS" in the guest
to see if the guest PEBS cpuinfo is exposed and use "perf record
–e cycles:pp" to see if PEBS feature actually works in the guest.

>
> # perf record –e cycles:pp
>
> Error:
>
> cycles:pp: PMU Hardware doesn’t support sampling/overflow-interrupts. Try
> ‘perf stat’
>
> Could you give some advice?

If you have more specific comments or any concerns, just let me know.

>
> 2)Test in Skylake
>
> HOST:
>
> CPU family:                      6
>
> Model:                           85
>
> Model name:                      Intel(R) Xeon(R) Gold 6146 CPU @
>
>                                   3.20GHz
>
> microcode        : 0x2000064
>
> Guest: linux 4.18
>
> we cannot find intel_pt flag in guest cpuinfo because
> cpu_has_vmx_intel_pt() return false.

You may check vmx_pebs_supported().

>
> SECONDARY_EXEC_PT_USE_GPA/VM_EXIT_CLEAR_IA32_RTIT_CTL/VM_ENTRY_LOAD_IA32_RTIT_CTL
> are both disable.
>
> Is it because microcode is not supported?
>
> And, isthere a new macrocode which can support these bits? How can we get this?

Currently, this patch set doesn't support guest PEBS on the Skylake
platforms, and if we choose to support it, we will let you know.

---
thx,likexu

>
> Thanks,
>
> Liuxiangdong
>

2021-01-25 10:47:09

by Like Xu

[permalink] [raw]
Subject: Re: [PATCH v3 00/17] KVM: x86/pmu: Add support to enable Guest PEBS via DS

Hi Peter,

On 2021/1/22 17:56, Peter Zijlstra wrote:
> On Fri, Jan 15, 2021 at 10:51:38AM -0800, Sean Christopherson wrote:
>> On Fri, Jan 15, 2021, Andi Kleen wrote:
>>>> I'm asking about ucode/hardare. Is the "guest pebs buffer write -> PEBS PMI"
>>>> guaranteed to be atomic?
>>>
>>> Of course not.
>>
>> So there's still a window where the guest could observe the bad counter index,
>> correct?
>
> Guest could do a hypercall to fix up the DS area before it tries to read
> it I suppose. Or the HV could expose the index mapping and have the
> guest fix up it.

A weird (malicious) guest would read unmodified PEBS records in the
guest PEBS buffer from other vCPUs without the need for hypercall or
index mapping from HV.

Do you see any security issues on this host index leak window?

>
> Adding a little virt crud on top shouldn't be too hard.
>

The patches 13-17 in this version has modified the guest PEBS buffer
to correct the index mapping information in the guest PEBS records.

---
thx,likexu

2021-01-25 13:07:08

by Xu, Like

[permalink] [raw]
Subject: Re: [PATCH v3 00/17] KVM: x86/pmu: Add support to enable Guest PEBS via DS

On 2021/1/25 20:18, Peter Zijlstra wrote:
> On Mon, Jan 25, 2021 at 08:07:06PM +0800, Xu, Like wrote:
>
>> So under the premise that counter cross-mapping is allowed,
>> how can hypercall help fix it ?
> Hypercall or otherwise exposing the mapping, will let the guest fix it
> up when it already touches the data. Which avoids the host from having
> to access the guest memory and is faster, no?
- as you may know, the mapping table is changing rapidly from
the time records to be rewritten to the time records to be read;

- the patches will modify the records before it is notified via PMI
which means it's transparent to normal guests (including Windows);

- a malicious guest would ignore the exposed mapping and the
hypercall and I don't think it can solve the leakage issue at all;

- make the guest aware of that hypercall or mapping requires more code changes
in the guest side; but now we can make it on the KVM side and we also know that
cross-mapping case rarely happens, and the overhead is acceptable based on
our tests;

Please let me know if you or Sean are not going to
buy in the PEBS records rewrite proposal in the patch 13 - 17.

---
thx,likexu

2021-01-25 14:58:05

by Liuxiangdong

[permalink] [raw]
Subject: Re: [PATCH v3 00/17] KVM: x86/pmu: Add support to enable Guest PEBS via DS

Thanks for replying,

On 2021/1/25 10:41, Like Xu wrote:
> + [email protected]
>
> Hi Liuxiangdong,
>
> On 2021/1/22 18:02, Liuxiangdong (Aven, Cloud Infrastructure Service
> Product Dept.) wrote:
>> Hi Like,
>>
>> Some questions about
>> https://lore.kernel.org/kvm/[email protected]/
>> <https://lore.kernel.org/kvm/[email protected]/>
>>
>
> Thanks for trying the PEBS feature in the guest,
> and I assume you have correctly applied the QEMU patches for guest PEBS.
>
Is there any other patch that needs to be apply? I use qemu 5.2.0.
(download from github on January 14th)

>> 1)Test in IceLake
>
> In the [PATCH v3 10/17] KVM: x86/pmu: Expose CPUIDs feature bits PDCM,
> DS, DTES64, we only support Ice Lake with the following x86_model(s):
>
> #define INTEL_FAM6_ICELAKE_X 0x6A
> #define INTEL_FAM6_ICELAKE_D 0x6C
>
> you can check the eax output of "cpuid -l 1 -1 -r",
> for example "0x000606a4" meets this requirement.
It's INTEL_FAM6_ICELAKE_X
cpuid -l 1 -1 -r

CPU:
0x00000001 0x00: eax=0x000606a6 ebx=0xb4800800 ecx=0x7ffefbf7
edx=0xbfebfbff

>>
>> HOST:
>>
>> CPU family: 6
>>
>> Model: 106
>>
>> Model name: Intel(R) Xeon(R) Platinum 8378A CPU
>> $@ $@
>>
>> microcode: sig=0x606a6, pf=0x1, revision=0xd000122
>
> As long as you get the latest BIOS from the provider,
> you may check 'cat /proc/cpuinfo | grep code | uniq' with the latest one.
OK. I'll do it later.
>
>>
>> Guest: linux kernel 5.11.0-rc2
>
> I assume it's the "upstream tag v5.11-rc2" which is fine.
Yes.
>
>>
>> We can find pebs/intel_pt flag in guest cpuinfo, but there still
>> exists error when we use perf
>
> Just a note, intel_pt and pebs are two features and we can write
> pebs records to intel_pt buffer with extra hardware support.
> (by default, pebs records are written to the pebs buffer)
>
> You may check the output of "dmesg | grep PEBS" in the guest
> to see if the guest PEBS cpuinfo is exposed and use "perf record
> –e cycles:pp" to see if PEBS feature actually works in the guest.

I apply only pebs patch set to linux kernel 5.11.0-rc2, test perf in
guest and dump stack when return -EOPNOTSUPP

(1)
# perf record -e instructions:pp
Error:
instructions:pp: PMU Hardware doesn't support
sampling/overflow-interrupts. Try 'perf stat'

[ 117.793266] Call Trace:
[ 117.793270] dump_stack+0x57/0x6a
[ 117.793275] intel_pmu_setup_lbr_filter+0x137/0x190
[ 117.793280] intel_pmu_hw_config+0x18b/0x320
[ 117.793288] hsw_hw_config+0xe/0xa0
[ 117.793290] x86_pmu_event_init+0x8e/0x210
[ 117.793293] perf_try_init_event+0x40/0x130
[ 117.793297] perf_event_alloc.part.22+0x611/0xde0
[ 117.793299] ? alloc_fd+0xba/0x180
[ 117.793302] __do_sys_perf_event_open+0x1bd/0xd90
[ 117.793305] do_syscall_64+0x33/0x40
[ 117.793308] entry_SYSCALL_64_after_hwframe+0x44/0xa9

Do we need lbr when we use pebs?

I tried to apply lbr patch
set(https://lore.kernel.org/kvm/[email protected]/)
to kernel and qemu, but there is still other problem.
Error:
The sys_perf_event_open() syscall returned with 22 (Invalid argument)
for event
...

(2)
# perf record -e instructions:ppp
Error:
instructions:ppp: PMU Hardware doesn't support
sampling/overflow-interrupts. Try 'perf stat'

[ 115.188498] Call Trace:
[ 115.188503] dump_stack+0x57/0x6a
[ 115.188509] x86_pmu_hw_config+0x1eb/0x220
[ 115.188515] intel_pmu_hw_config+0x13/0x320
[ 115.188519] hsw_hw_config+0xe/0xa0
[ 115.188521] x86_pmu_event_init+0x8e/0x210
[ 115.188524] perf_try_init_event+0x40/0x130
[ 115.188528] perf_event_alloc.part.22+0x611/0xde0
[ 115.188530] ? alloc_fd+0xba/0x180
[ 115.188534] __do_sys_perf_event_open+0x1bd/0xd90
[ 115.188538] do_syscall_64+0x33/0x40
[ 115.188541] entry_SYSCALL_64_after_hwframe+0x44/0xa9

This is beacuse x86_pmu.intel_cap.pebs_format is always 0 in
x86_pmu_max_precise().

We rdmsr MSR_IA32_PERF_CAPABILITIES(0x00000345) from HOST, it's f4c5.
From guest, it's 2000

>>
>> # perf record –e cycles:pp
>>
>> Error:
>>
>> cycles:pp: PMU Hardware doesn’t support sampling/overflow-interrupts.
>> Try ‘perf stat’
>>
>> Could you give some advice?
>
> If you have more specific comments or any concerns, just let me know.
>
>>
>> 2)Test in Skylake
>>
>> HOST:
>>
>> CPU family: 6
>>
>> Model: 85
>>
>> Model name: Intel(R) Xeon(R) Gold 6146 CPU @
>>
>> 3.20GHz
>>
>> microcode : 0x2000064
>>
>> Guest: linux 4.18
>>
>> we cannot find intel_pt flag in guest cpuinfo because
>> cpu_has_vmx_intel_pt() return false.
>
> You may check vmx_pebs_supported().
It's true.
>
>>
>> SECONDARY_EXEC_PT_USE_GPA/VM_EXIT_CLEAR_IA32_RTIT_CTL/VM_ENTRY_LOAD_IA32_RTIT_CTL
>> are both disable.
>>
>> Is it because microcode is not supported?
>>
>> And, isthere a new macrocode which can support these bits? How can we
>> get this?
>
> Currently, this patch set doesn't support guest PEBS on the Skylake
> platforms, and if we choose to support it, we will let you know.
>
And now, we want to use pebs in skylake. If we develop based on pebs
patch set, do you have any suggestions?
I think microcode requirements need to be satisfied. Can we use
https://github.com/intel/Intel-Linux-Processor-Microcode-Data-Files ?

> ---
> thx,likexu
>
>>
>> Thanks,
>>
>> Liuxiangdong
>>
>
Thanks. Liuxiangdong

2021-01-26 01:30:51

by Xu, Like

[permalink] [raw]
Subject: Re: [PATCH v3 00/17] KVM: x86/pmu: Add support to enable Guest PEBS via DS

On 2021/1/25 19:13, Peter Zijlstra wrote:
> On Mon, Jan 25, 2021 at 04:08:22PM +0800, Like Xu wrote:
>> Hi Peter,
>>
>> On 2021/1/22 17:56, Peter Zijlstra wrote:
>>> On Fri, Jan 15, 2021 at 10:51:38AM -0800, Sean Christopherson wrote:
>>>> On Fri, Jan 15, 2021, Andi Kleen wrote:
>>>>>> I'm asking about ucode/hardare. Is the "guest pebs buffer write -> PEBS PMI"
>>>>>> guaranteed to be atomic?
>>>>> Of course not.
>>>> So there's still a window where the guest could observe the bad counter index,
>>>> correct?
>>> Guest could do a hypercall to fix up the DS area before it tries to read
>>> it I suppose. Or the HV could expose the index mapping and have the
>>> guest fix up it.
>> A weird (malicious) guest would read unmodified PEBS records in the
>> guest PEBS buffer from other vCPUs without the need for hypercall or
>> index mapping from HV.
>>
>> Do you see any security issues on this host index leak window?
>>
>>> Adding a little virt crud on top shouldn't be too hard.
>>>
>> The patches 13-17 in this version has modified the guest PEBS buffer
>> to correct the index mapping information in the guest PEBS records.
> Right, but given there is no atomicity between writing the DS area and
> triggering the PMI (as already established earlier in this thread), a
> malicious guest can already access this information, no?
>

So under the premise that counter cross-mapping is allowed,
how can hypercall help fix it ?

Personally,  I think it is acceptable at the moment, and
no security issues based on this have been defined and found.

2021-01-26 05:09:46

by Like Xu

[permalink] [raw]
Subject: Re: [PATCH v3 04/17] perf: x86/ds: Handle guest PEBS overflow PMI and inject it to guest

Hi Peter,

On 2021/1/15 22:44, Peter Zijlstra wrote:
> On Fri, Jan 15, 2021 at 10:30:13PM +0800, Xu, Like wrote:
>
>>> Are you sure? Spurious NMI/PMIs are known to happen anyway. We have far
>>> too much code to deal with them.
>>
>> https://lore.kernel.org/lkml/20170628130748.GI5981@leverpostej/T/
>>
>> In the rr workload, the commit change "the PMI interrupts in skid region
>> should be dropped"
>> is reverted since some users complain that:
>>
>>> It seems to me that it might be reasonable to ignore the interrupt if
>>> the purpose of the interrupt is to trigger sampling of the CPUs
>>> register state. But if the interrupt will trigger some other
>>> operation, such as a signal on an fd, then there's no reason to drop
>>> it.
>>
>> I assume that if the PMI drop is unacceptable, either will spurious PMI
>> injection.
>>
>> I'm pretty open if you insist that we really need to do this for guest PEBS
>> enabling.
>
> That was an entirely different issue. We were dropping events on the
> floor because they'd passed priv boundaries. So there was an actual
> event, and we made it go away.
>
> What we're talking about here is raising an PMI with BUFFER_OVF set,
> even if the DS is empty. That should really be harmless. We'll take the
> PMI, find there's nothing there, and do nothing.
>

In the host and guest PEBS both enabled case,
we'll get a crazy dmesg *bombing* about spurious PMI warning
if we pass the host PEBS PMI "harmlessly" to the guest:

[11261.502536] Uhhuh. NMI received for unknown reason 2c on CPU 36.
[11261.502539] Do you have a strange power saving mode enabled?
[11261.502541] Dazed and confused, but trying to continue

Legacy guest users may be very confused and dissatisfied with that.

I'm double checking with you if it's acceptable to take the proposal
"disables the co-existence of guest PEBS and host PEBS" as the first
step to upstream, and enable both host and guest PEBS in the near future.

---
thx,likexu

2021-01-26 06:08:02

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v3 00/17] KVM: x86/pmu: Add support to enable Guest PEBS via DS

On Mon, Jan 25, 2021 at 04:08:22PM +0800, Like Xu wrote:
> Hi Peter,
>
> On 2021/1/22 17:56, Peter Zijlstra wrote:
> > On Fri, Jan 15, 2021 at 10:51:38AM -0800, Sean Christopherson wrote:
> > > On Fri, Jan 15, 2021, Andi Kleen wrote:
> > > > > I'm asking about ucode/hardare. Is the "guest pebs buffer write -> PEBS PMI"
> > > > > guaranteed to be atomic?
> > > >
> > > > Of course not.
> > >
> > > So there's still a window where the guest could observe the bad counter index,
> > > correct?
> >
> > Guest could do a hypercall to fix up the DS area before it tries to read
> > it I suppose. Or the HV could expose the index mapping and have the
> > guest fix up it.
>
> A weird (malicious) guest would read unmodified PEBS records in the
> guest PEBS buffer from other vCPUs without the need for hypercall or
> index mapping from HV.
>
> Do you see any security issues on this host index leak window?
>
> >
> > Adding a little virt crud on top shouldn't be too hard.
> >
>
> The patches 13-17 in this version has modified the guest PEBS buffer
> to correct the index mapping information in the guest PEBS records.

Right, but given there is no atomicity between writing the DS area and
triggering the PMI (as already established earlier in this thread), a
malicious guest can already access this information, no?

2021-01-26 19:50:05

by Xu, Like

[permalink] [raw]
Subject: Re: [PATCH v3 00/17] KVM: x86/pmu: Add support to enable Guest PEBS via DS

On 2021/1/25 22:47, Liuxiangdong (Aven, Cloud Infrastructure Service
Product Dept.) wrote:
> Thanks for replying,
>
> On 2021/1/25 10:41, Like Xu wrote:
>> + [email protected]
>>
>> Hi Liuxiangdong,
>>
>> On 2021/1/22 18:02, Liuxiangdong (Aven, Cloud Infrastructure Service
>> Product Dept.) wrote:
>>> Hi Like,
>>>
>>> Some questions about
>>> https://lore.kernel.org/kvm/[email protected]/
>>> <https://lore.kernel.org/kvm/[email protected]/>
>>>
>>
>> Thanks for trying the PEBS feature in the guest,
>> and I assume you have correctly applied the QEMU patches for guest PEBS.
>>
> Is there any other patch that needs to be apply? I use qemu 5.2.0.
> (download from github on January 14th)

Two qemu patches are attached against qemu tree
(commit 31ee895047bdcf7387e3570cbd2a473c6f744b08)
and then run the guest with "-cpu,pebs=true".

Note, this two patch are just for test and not finalized for qemu upstream.

>
>>> 1)Test in IceLake
>>
>> In the [PATCH v3 10/17] KVM: x86/pmu: Expose CPUIDs feature bits PDCM,
>> DS, DTES64, we only support Ice Lake with the following x86_model(s):
>>
>> #define INTEL_FAM6_ICELAKE_X        0x6A
>> #define INTEL_FAM6_ICELAKE_D        0x6C
>>
>> you can check the eax output of "cpuid -l 1 -1 -r",
>> for example "0x000606a4" meets this requirement.
> It's INTEL_FAM6_ICELAKE_X

Yes, it's the target hardware.

> cpuid -l 1 -1 -r
>
> CPU:
>    0x00000001 0x00: eax=0x000606a6 ebx=0xb4800800 ecx=0x7ffefbf7
> edx=0xbfebfbff
>
>>>
>>> HOST:
>>>
>>> CPU family:                      6
>>>
>>> Model:                           106
>>>
>>> Model name:                      Intel(R) Xeon(R) Platinum 8378A CPU $@ $@
>>>
>>> microcode: sig=0x606a6, pf=0x1, revision=0xd000122
>>
>> As long as you get the latest BIOS from the provider,
>> you may check 'cat /proc/cpuinfo | grep code | uniq' with the latest one.
> OK. I'll do it later.
>>
>>>
>>> Guest:  linux kernel 5.11.0-rc2
>>
>> I assume it's the "upstream tag v5.11-rc2" which is fine.
> Yes.
>>
>>>
>>> We can find pebs/intel_pt flag in guest cpuinfo, but there still exists
>>> error when we use perf
>>
>> Just a note, intel_pt and pebs are two features and we can write
>> pebs records to intel_pt buffer with extra hardware support.
>> (by default, pebs records are written to the pebs buffer)
>>
>> You may check the output of "dmesg | grep PEBS" in the guest
>> to see if the guest PEBS cpuinfo is exposed and use "perf record
>> –e cycles:pp" to see if PEBS feature actually  works in the guest.
>
> I apply only pebs patch set to linux kernel 5.11.0-rc2, test perf in
> guest and dump stack when return -EOPNOTSUPP

Yes, you may apply the qemu patches and try it again.

>
> (1)
> # perf record -e instructions:pp
> Error:
> instructions:pp: PMU Hardware doesn't support
> sampling/overflow-interrupts. Try 'perf stat'
>
> [  117.793266] Call Trace:
> [  117.793270]  dump_stack+0x57/0x6a
> [  117.793275]  intel_pmu_setup_lbr_filter+0x137/0x190
> [  117.793280]  intel_pmu_hw_config+0x18b/0x320
> [  117.793288]  hsw_hw_config+0xe/0xa0
> [  117.793290]  x86_pmu_event_init+0x8e/0x210
> [  117.793293]  perf_try_init_event+0x40/0x130
> [  117.793297]  perf_event_alloc.part.22+0x611/0xde0
> [  117.793299]  ? alloc_fd+0xba/0x180
> [  117.793302]  __do_sys_perf_event_open+0x1bd/0xd90
> [  117.793305]  do_syscall_64+0x33/0x40
> [  117.793308]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
>
> Do we need lbr when we use pebs?

No, lbr ane pebs are two features and we enable it separately.

>
> I tried to apply lbr patch
> set(https://lore.kernel.org/kvm/[email protected]/)
> to kernel and qemu, but there is still other problem.
> Error:
> The sys_perf_event_open() syscall returned with 22 (Invalid argument) for
> event
> ...

We don't need that patch for PEBS feature.

>
> (2)
> # perf record -e instructions:ppp
> Error:
> instructions:ppp: PMU Hardware doesn't support
> sampling/overflow-interrupts. Try 'perf stat'
>
> [  115.188498] Call Trace:
> [  115.188503]  dump_stack+0x57/0x6a
> [  115.188509]  x86_pmu_hw_config+0x1eb/0x220
> [  115.188515]  intel_pmu_hw_config+0x13/0x320
> [  115.188519]  hsw_hw_config+0xe/0xa0
> [  115.188521]  x86_pmu_event_init+0x8e/0x210
> [  115.188524]  perf_try_init_event+0x40/0x130
> [  115.188528]  perf_event_alloc.part.22+0x611/0xde0
> [  115.188530]  ? alloc_fd+0xba/0x180
> [  115.188534]  __do_sys_perf_event_open+0x1bd/0xd90
> [  115.188538]  do_syscall_64+0x33/0x40
> [  115.188541]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
>
> This is beacuse x86_pmu.intel_cap.pebs_format is always 0 in
> x86_pmu_max_precise().
>
> We rdmsr MSR_IA32_PERF_CAPABILITIES(0x00000345)  from HOST, it's f4c5.
> From guest, it's 2000
>
>>>
>>> # perf record –e cycles:pp
>>>
>>> Error:
>>>
>>> cycles:pp: PMU Hardware doesn’t support sampling/overflow-interrupts.
>>> Try ‘perf stat’
>>>
>>> Could you give some advice?
>>
>> If you have more specific comments or any concerns, just let me know.
>>
>>>
>>> 2)Test in Skylake
>>>
>>> HOST:
>>>
>>> CPU family:                      6
>>>
>>> Model:                           85
>>>
>>> Model name:                      Intel(R) Xeon(R) Gold 6146 CPU @
>>>
>>>                                    3.20GHz
>>>
>>> microcode        : 0x2000064
>>>
>>> Guest: linux 4.18
>>>
>>> we cannot find intel_pt flag in guest cpuinfo because
>>> cpu_has_vmx_intel_pt() return false.
>>
>> You may check vmx_pebs_supported().
> It's true.
>>
>>>
>>> SECONDARY_EXEC_PT_USE_GPA/VM_EXIT_CLEAR_IA32_RTIT_CTL/VM_ENTRY_LOAD_IA32_RTIT_CTL
>>> are both disable.
>>>
>>> Is it because microcode is not supported?
>>>
>>> And, isthere a new macrocode which can support these bits? How can we
>>> get this?
>>
>> Currently, this patch set doesn't support guest PEBS on the Skylake
>> platforms, and if we choose to support it, we will let you know.
>>
> And now, we want to use pebs in skylake. If we develop based on pebs
> patch set, do you have any suggestions?

- At least you need to pin guest memory such as "-overcommit mem-lock=true"
for qemu
- You may rewrite the patches 13 - 17 for Skylake specific because the
records format is different with Ice Lake.

> I think microcode requirements need to be satisfied.  Can we use
> https://github.com/intel/Intel-Linux-Processor-Microcode-Data-Files ?

You may try it at your risk and again,
this patch set doesn't support guest PEBS on the Skylake platforms currently.

>
>> ---
>> thx,likexu
>>
>>>
>>> Thanks,
>>>
>>> Liuxiangdong
>>>
>>
> Thanks. Liuxiangdong
>


Attachments:
0001-target-i386-Expose-PEBS-capabilities-in-the-FEAT_PER.patch (1.70 kB)
0002-target-i386-add-cpu-pebs-true-support-to-enable-gues.patch (5.36 kB)
Download all attachments

2021-01-26 21:33:54

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v3 04/17] perf: x86/ds: Handle guest PEBS overflow PMI and inject it to guest

On Mon, Jan 25, 2021 at 04:26:22PM +0800, Like Xu wrote:

> In the host and guest PEBS both enabled case,
> we'll get a crazy dmesg *bombing* about spurious PMI warning
> if we pass the host PEBS PMI "harmlessly" to the guest:
>
> [11261.502536] Uhhuh. NMI received for unknown reason 2c on CPU 36.
> [11261.502539] Do you have a strange power saving mode enabled?
> [11261.502541] Dazed and confused, but trying to continue

How? AFAICT handle_pmi_common() will increment handled when
GLOBAL_STATUS_BUFFER_OVF_BIT is set, irrespective of DS containing
data.


2021-01-27 03:35:35

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v3 00/17] KVM: x86/pmu: Add support to enable Guest PEBS via DS

On Mon, Jan 25, 2021 at 08:07:06PM +0800, Xu, Like wrote:

> So under the premise that counter cross-mapping is allowed,
> how can hypercall help fix it ?

Hypercall or otherwise exposing the mapping, will let the guest fix it
up when it already touches the data. Which avoids the host from having
to access the guest memory and is faster, no?

2021-01-29 02:55:27

by Liuxiangdong

[permalink] [raw]
Subject: Re: [PATCH v3 00/17] KVM: x86/pmu: Add support to enable Guest PEBS via DS



On 2021/1/26 15:08, Xu, Like wrote:
> On 2021/1/25 22:47, Liuxiangdong (Aven, Cloud Infrastructure Service
> Product Dept.) wrote:
>> Thanks for replying,
>>
>> On 2021/1/25 10:41, Like Xu wrote:
>>> + [email protected]
>>>
>>> Hi Liuxiangdong,
>>>
>>> On 2021/1/22 18:02, Liuxiangdong (Aven, Cloud Infrastructure Service
>>> Product Dept.) wrote:
>>>> Hi Like,
>>>>
>>>> Some questions about
>>>> https://lore.kernel.org/kvm/[email protected]/
>>>> <https://lore.kernel.org/kvm/[email protected]/>
>>>>
>>> Thanks for trying the PEBS feature in the guest,
>>> and I assume you have correctly applied the QEMU patches for guest PEBS.
>>>
>> Is there any other patch that needs to be apply? I use qemu 5.2.0.
>> (download from github on January 14th)
> Two qemu patches are attached against qemu tree
> (commit 31ee895047bdcf7387e3570cbd2a473c6f744b08)
> and then run the guest with "-cpu,pebs=true".
>
> Note, this two patch are just for test and not finalized for qemu upstream.
Yes, we can use pebs in IceLake when qemu patches applied.
Thanks very much!
>>>> 1)Test in IceLake
>>> In the [PATCH v3 10/17] KVM: x86/pmu: Expose CPUIDs feature bits PDCM,
>>> DS, DTES64, we only support Ice Lake with the following x86_model(s):
>>>
>>> #define INTEL_FAM6_ICELAKE_X 0x6A
>>> #define INTEL_FAM6_ICELAKE_D 0x6C
>>>
>>> you can check the eax output of "cpuid -l 1 -1 -r",
>>> for example "0x000606a4" meets this requirement.
>> It's INTEL_FAM6_ICELAKE_X
> Yes, it's the target hardware.
>
>> cpuid -l 1 -1 -r
>>
>> CPU:
>> 0x00000001 0x00: eax=0x000606a6 ebx=0xb4800800 ecx=0x7ffefbf7
>> edx=0xbfebfbff
>>
>>>> HOST:
>>>>
>>>> CPU family: 6
>>>>
>>>> Model: 106
>>>>
>>>> Model name: Intel(R) Xeon(R) Platinum 8378A CPU $@ $@
>>>>
>>>> microcode: sig=0x606a6, pf=0x1, revision=0xd000122
>>> As long as you get the latest BIOS from the provider,
>>> you may check 'cat /proc/cpuinfo | grep code | uniq' with the latest one.
>> OK. I'll do it later.
>>>> Guest: linux kernel 5.11.0-rc2
>>> I assume it's the "upstream tag v5.11-rc2" which is fine.
>> Yes.
>>>> We can find pebs/intel_pt flag in guest cpuinfo, but there still exists
>>>> error when we use perf
>>> Just a note, intel_pt and pebs are two features and we can write
>>> pebs records to intel_pt buffer with extra hardware support.
>>> (by default, pebs records are written to the pebs buffer)
>>>
>>> You may check the output of "dmesg | grep PEBS" in the guest
>>> to see if the guest PEBS cpuinfo is exposed and use "perf record
>>> –e cycles:pp" to see if PEBS feature actually works in the guest.
>> I apply only pebs patch set to linux kernel 5.11.0-rc2, test perf in
>> guest and dump stack when return -EOPNOTSUPP
> Yes, you may apply the qemu patches and try it again.
>
>> (1)
>> # perf record -e instructions:pp
>> Error:
>> instructions:pp: PMU Hardware doesn't support
>> sampling/overflow-interrupts. Try 'perf stat'
>>
>> [ 117.793266] Call Trace:
>> [ 117.793270] dump_stack+0x57/0x6a
>> [ 117.793275] intel_pmu_setup_lbr_filter+0x137/0x190
>> [ 117.793280] intel_pmu_hw_config+0x18b/0x320
>> [ 117.793288] hsw_hw_config+0xe/0xa0
>> [ 117.793290] x86_pmu_event_init+0x8e/0x210
>> [ 117.793293] perf_try_init_event+0x40/0x130
>> [ 117.793297] perf_event_alloc.part.22+0x611/0xde0
>> [ 117.793299] ? alloc_fd+0xba/0x180
>> [ 117.793302] __do_sys_perf_event_open+0x1bd/0xd90
>> [ 117.793305] do_syscall_64+0x33/0x40
>> [ 117.793308] entry_SYSCALL_64_after_hwframe+0x44/0xa9
>>
>> Do we need lbr when we use pebs?
> No, lbr ane pebs are two features and we enable it separately.
>
>> I tried to apply lbr patch
>> set(https://lore.kernel.org/kvm/[email protected]/)
>> to kernel and qemu, but there is still other problem.
>> Error:
>> The sys_perf_event_open() syscall returned with 22 (Invalid argument) for
>> event
>> ...
> We don't need that patch for PEBS feature.
>
>> (2)
>> # perf record -e instructions:ppp
>> Error:
>> instructions:ppp: PMU Hardware doesn't support
>> sampling/overflow-interrupts. Try 'perf stat'
>>
>> [ 115.188498] Call Trace:
>> [ 115.188503] dump_stack+0x57/0x6a
>> [ 115.188509] x86_pmu_hw_config+0x1eb/0x220
>> [ 115.188515] intel_pmu_hw_config+0x13/0x320
>> [ 115.188519] hsw_hw_config+0xe/0xa0
>> [ 115.188521] x86_pmu_event_init+0x8e/0x210
>> [ 115.188524] perf_try_init_event+0x40/0x130
>> [ 115.188528] perf_event_alloc.part.22+0x611/0xde0
>> [ 115.188530] ? alloc_fd+0xba/0x180
>> [ 115.188534] __do_sys_perf_event_open+0x1bd/0xd90
>> [ 115.188538] do_syscall_64+0x33/0x40
>> [ 115.188541] entry_SYSCALL_64_after_hwframe+0x44/0xa9
>>
>> This is beacuse x86_pmu.intel_cap.pebs_format is always 0 in
>> x86_pmu_max_precise().
>>
>> We rdmsr MSR_IA32_PERF_CAPABILITIES(0x00000345) from HOST, it's f4c5.
>> From guest, it's 2000
>>
>>>> # perf record –e cycles:pp
>>>>
>>>> Error:
>>>>
>>>> cycles:pp: PMU Hardware doesn’t support sampling/overflow-interrupts.
>>>> Try ‘perf stat’
>>>>
>>>> Could you give some advice?
>>> If you have more specific comments or any concerns, just let me know.
>>>
>>>> 2)Test in Skylake
>>>>
>>>> HOST:
>>>>
>>>> CPU family: 6
>>>>
>>>> Model: 85
>>>>
>>>> Model name: Intel(R) Xeon(R) Gold 6146 CPU @
>>>>
>>>> 3.20GHz
>>>>
>>>> microcode : 0x2000064
>>>>
>>>> Guest: linux 4.18
>>>>
>>>> we cannot find intel_pt flag in guest cpuinfo because
>>>> cpu_has_vmx_intel_pt() return false.
>>> You may check vmx_pebs_supported().
>> It's true.
>>>> SECONDARY_EXEC_PT_USE_GPA/VM_EXIT_CLEAR_IA32_RTIT_CTL/VM_ENTRY_LOAD_IA32_RTIT_CTL
>>>> are both disable.
>>>>
>>>> Is it because microcode is not supported?
>>>>
>>>> And, isthere a new macrocode which can support these bits? How can we
>>>> get this?
>>> Currently, this patch set doesn't support guest PEBS on the Skylake
>>> platforms, and if we choose to support it, we will let you know.
>>>
>> And now, we want to use pebs in skylake. If we develop based on pebs
>> patch set, do you have any suggestions?
> - At least you need to pin guest memory such as "-overcommit mem-lock=true"
> for qemu
> - You may rewrite the patches 13 - 17 for Skylake specific because the
> records format is different with Ice Lake.
OK. So, is there anything else we need to pay attention to except record
format when used for Skylake?
>> I think microcode requirements need to be satisfied. Can we use
>> https://github.com/intel/Intel-Linux-Processor-Microcode-Data-Files ?
> You may try it at your risk and again,
> this patch set doesn't support guest PEBS on the Skylake platforms currently.
>
>>> ---
>>> thx,likexu
>>>
>>>> Thanks,
>>>>
>>>> Liuxiangdong
>>>>
>> Thanks. Liuxiangdong
>>

2021-02-01 08:45:53

by Xu, Like

[permalink] [raw]
Subject: Re: [PATCH v3 00/17] KVM: x86/pmu: Add support to enable Guest PEBS via DS

On 2021/1/29 10:52, Liuxiangdong (Aven, Cloud Infrastructure Service
Product Dept.) wrote:
>
>
> On 2021/1/26 15:08, Xu, Like wrote:
>> On 2021/1/25 22:47, Liuxiangdong (Aven, Cloud Infrastructure Service
>> Product Dept.) wrote:
>>> Thanks for replying,
>>>
>>> On 2021/1/25 10:41, Like Xu wrote:
>>>> + [email protected]
>>>>
>>>> Hi Liuxiangdong,
>>>>
>>>> On 2021/1/22 18:02, Liuxiangdong (Aven, Cloud Infrastructure Service
>>>> Product Dept.) wrote:
>>>>> Hi Like,
>>>>>
>>>>> Some questions about
>>>>> https://lore.kernel.org/kvm/[email protected]/
>>>>>
>>>>> <https://lore.kernel.org/kvm/[email protected]/>
>>>>>
>>>>>
>>>> Thanks for trying the PEBS feature in the guest,
>>>> and I assume you have correctly applied the QEMU patches for guest PEBS.
>>>>
>>> Is there any other patch that needs to be apply? I use qemu 5.2.0.
>>> (download from github on January 14th)
>> Two qemu patches are attached against qemu tree
>> (commit 31ee895047bdcf7387e3570cbd2a473c6f744b08)
>> and then run the guest with "-cpu,pebs=true".
>>
>> Note, this two patch are just for test and not finalized for qemu upstream.
> Yes, we can use pebs in IceLake when qemu patches applied.
> Thanks very much!

Thanks for your verification on this earlier version.

>>>>> 1)Test in IceLake
>>>> In the [PATCH v3 10/17] KVM: x86/pmu: Expose CPUIDs feature bits PDCM,
>>>> DS, DTES64, we only support Ice Lake with the following x86_model(s):
>>>>
>>>> #define INTEL_FAM6_ICELAKE_X        0x6A
>>>> #define INTEL_FAM6_ICELAKE_D        0x6C
>>>>
>>>> you can check the eax output of "cpuid -l 1 -1 -r",
>>>> for example "0x000606a4" meets this requirement.
>>> It's INTEL_FAM6_ICELAKE_X
>> Yes, it's the target hardware.
>>
>>> cpuid -l 1 -1 -r
>>>
>>> CPU:
>>>     0x00000001 0x00: eax=0x000606a6 ebx=0xb4800800 ecx=0x7ffefbf7
>>> edx=0xbfebfbff
>>>
>>>>> HOST:
>>>>>
>>>>> CPU family:                      6
>>>>>
>>>>> Model:                           106
>>>>>
>>>>> Model name:                      Intel(R) Xeon(R) Platinum 8378A CPU
>>>>> $@ $@
>>>>>
>>>>> microcode: sig=0x606a6, pf=0x1, revision=0xd000122
>>>> As long as you get the latest BIOS from the provider,
>>>> you may check 'cat /proc/cpuinfo | grep code | uniq' with the latest one.
>>> OK. I'll do it later.
>>>>> Guest:  linux kernel 5.11.0-rc2
>>>> I assume it's the "upstream tag v5.11-rc2" which is fine.
>>> Yes.
>>>>> We can find pebs/intel_pt flag in guest cpuinfo, but there still exists
>>>>> error when we use perf
>>>> Just a note, intel_pt and pebs are two features and we can write
>>>> pebs records to intel_pt buffer with extra hardware support.
>>>> (by default, pebs records are written to the pebs buffer)
>>>>
>>>> You may check the output of "dmesg | grep PEBS" in the guest
>>>> to see if the guest PEBS cpuinfo is exposed and use "perf record
>>>> –e cycles:pp" to see if PEBS feature actually  works in the guest.
>>> I apply only pebs patch set to linux kernel 5.11.0-rc2, test perf in
>>> guest and dump stack when return -EOPNOTSUPP
>> Yes, you may apply the qemu patches and try it again.
>>
>>> (1)
>>> # perf record -e instructions:pp
>>> Error:
>>> instructions:pp: PMU Hardware doesn't support
>>> sampling/overflow-interrupts. Try 'perf stat'
>>>
>>> [  117.793266] Call Trace:
>>> [  117.793270]  dump_stack+0x57/0x6a
>>> [  117.793275]  intel_pmu_setup_lbr_filter+0x137/0x190
>>> [  117.793280]  intel_pmu_hw_config+0x18b/0x320
>>> [  117.793288]  hsw_hw_config+0xe/0xa0
>>> [  117.793290]  x86_pmu_event_init+0x8e/0x210
>>> [  117.793293]  perf_try_init_event+0x40/0x130
>>> [  117.793297]  perf_event_alloc.part.22+0x611/0xde0
>>> [  117.793299]  ? alloc_fd+0xba/0x180
>>> [  117.793302]  __do_sys_perf_event_open+0x1bd/0xd90
>>> [  117.793305]  do_syscall_64+0x33/0x40
>>> [  117.793308]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
>>>
>>> Do we need lbr when we use pebs?
>> No, lbr ane pebs are two features and we enable it separately.
>>
>>> I tried to apply lbr patch
>>> set(https://lore.kernel.org/kvm/[email protected]/)
>>>
>>> to kernel and qemu, but there is still other problem.
>>> Error:
>>> The sys_perf_event_open() syscall returned with 22 (Invalid argument) for
>>> event
>>> ...
>> We don't need that patch for PEBS feature.
>>
>>> (2)
>>> # perf record -e instructions:ppp
>>> Error:
>>> instructions:ppp: PMU Hardware doesn't support
>>> sampling/overflow-interrupts. Try 'perf stat'
>>>
>>> [  115.188498] Call Trace:
>>> [  115.188503]  dump_stack+0x57/0x6a
>>> [  115.188509]  x86_pmu_hw_config+0x1eb/0x220
>>> [  115.188515]  intel_pmu_hw_config+0x13/0x320
>>> [  115.188519]  hsw_hw_config+0xe/0xa0
>>> [  115.188521]  x86_pmu_event_init+0x8e/0x210
>>> [  115.188524]  perf_try_init_event+0x40/0x130
>>> [  115.188528]  perf_event_alloc.part.22+0x611/0xde0
>>> [  115.188530]  ? alloc_fd+0xba/0x180
>>> [  115.188534]  __do_sys_perf_event_open+0x1bd/0xd90
>>> [  115.188538]  do_syscall_64+0x33/0x40
>>> [  115.188541]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
>>>
>>> This is beacuse x86_pmu.intel_cap.pebs_format is always 0 in
>>> x86_pmu_max_precise().
>>>
>>> We rdmsr MSR_IA32_PERF_CAPABILITIES(0x00000345)  from HOST, it's f4c5.
>>>  From guest, it's 2000
>>>
>>>>> # perf record –e cycles:pp
>>>>>
>>>>> Error:
>>>>>
>>>>> cycles:pp: PMU Hardware doesn’t support sampling/overflow-interrupts.
>>>>> Try ‘perf stat’
>>>>>
>>>>> Could you give some advice?
>>>> If you have more specific comments or any concerns, just let me know.
>>>>
>>>>> 2)Test in Skylake
>>>>>
>>>>> HOST:
>>>>>
>>>>> CPU family:                      6
>>>>>
>>>>> Model:                           85
>>>>>
>>>>> Model name:                      Intel(R) Xeon(R) Gold 6146 CPU @
>>>>>
>>>>>                                     3.20GHz
>>>>>
>>>>> microcode        : 0x2000064
>>>>>
>>>>> Guest: linux 4.18
>>>>>
>>>>> we cannot find intel_pt flag in guest cpuinfo because
>>>>> cpu_has_vmx_intel_pt() return false.
>>>> You may check vmx_pebs_supported().
>>> It's true.
>>>>> SECONDARY_EXEC_PT_USE_GPA/VM_EXIT_CLEAR_IA32_RTIT_CTL/VM_ENTRY_LOAD_IA32_RTIT_CTL
>>>>>
>>>>> are both disable.
>>>>>
>>>>> Is it because microcode is not supported?
>>>>>
>>>>> And, isthere a new macrocode which can support these bits? How can we
>>>>> get this?
>>>> Currently, this patch set doesn't support guest PEBS on the Skylake
>>>> platforms, and if we choose to support it, we will let you know.
>>>>
>>> And now, we want to use pebs in skylake. If we develop based on pebs
>>> patch set, do you have any suggestions?
>> - At least you need to pin guest memory such as "-overcommit mem-lock=true"
>> for qemu
>> - You may rewrite the patches 13 - 17 for Skylake specific because the
>> records format is different with Ice Lake.
> OK. So, is there anything else we need to pay attention to except record
> format when used for Skylake?

You may need:
- remove x86_match_cpu check in the vmx_pebs_supported()
- add intel_pmu_handle_guest_pebs() to the intel_pmu_drain_pebs_nhm()

I suggest that you may pick up the one-one mapping patch from the v1
so that you can get avoid of patches 13 - 17.

>>> I think microcode requirements need to be satisfied.  Can we use
>>> https://github.com/intel/Intel-Linux-Processor-Microcode-Data-Files ?
>> You may try it at your risk and again,
>> this patch set doesn't support guest PEBS on the Skylake platforms
>> currently.
>>
>>>> ---
>>>> thx,likexu
>>>>
>>>>> Thanks,
>>>>>
>>>>> Liuxiangdong
>>>>>
>>> Thanks. Liuxiangdong
>>>
>

2021-02-02 06:34:17

by Xu, Like

[permalink] [raw]
Subject: Re: [PATCH v3 04/17] perf: x86/ds: Handle guest PEBS overflow PMI and inject it to guest

On 2021/1/25 19:47, Peter Zijlstra wrote:
> On Mon, Jan 25, 2021 at 04:26:22PM +0800, Like Xu wrote:
>
>> In the host and guest PEBS both enabled case,
>> we'll get a crazy dmesg *bombing* about spurious PMI warning
>> if we pass the host PEBS PMI "harmlessly" to the guest:
>>
>> [11261.502536] Uhhuh. NMI received for unknown reason 2c on CPU 36.
>> [11261.502539] Do you have a strange power saving mode enabled?
>> [11261.502541] Dazed and confused, but trying to continue
> How? AFAICT handle_pmi_common() will increment handled when
> GLOBAL_STATUS_BUFFER_OVF_BIT is set, irrespective of DS containing
> data.

Thanks for this comment, and it's enlightening.

For the case that both host and guest PEBS are enabled,
the host PEBS PMI will be injected into the guest only when
GLOBAL_STATUS_BUFFER_OVF_BIT is not set in the guest global_status.

>