Performance Monitoring Unit is designed to monitor micro architectural
events which helps in analyzing how applications or operating systems
are performing on the processors. In KVM/X86, version 2 Architectural
PMU on Intel and AMD hosts have been enabled.
This patch series is going to improve vPMU Efficiency for guest perf users
which is mainly measured by guest NMI handler latency for basic perf usages
[1][2][3][4] with hardware PMU. It's not a passthrough solution but based
on the legacy vPMU implementation (since 2011) with backport-friendliness.
The general idea (defined in patch 2/3) is to reuse last created perf_event
for the same vPMC when the new requested config is the exactly same as the
last programed config (used by pmc_reprogram_counter()) AND the new event
period is appropriate and accepted (via perf_event_period() in patch 1/3).
Before reusing the perf_event, it will be disabled until it's suitable for
reuse and a hardware counter will be reassigned again to serve vPMC.
If the disabled perf_event is no longer reused, we do a lazy release
mechanism (defined in patch 3/3) which in a short is to release the
disabled perf_events on the first call of vcpu_enter_guest since the
vcpu gets next scheduled in if its MSRs is not accessed in the last
sched time slice. The bitmap pmu->lazy_release_ctrl is added to track.
The kvm_pmu_cleanup() is called at the first time to run vcpu_enter_guest
after the vcpu shced_in and the overhead for check is very limited.
With this optimization, the average latency of the guest NMI handler is
reduced from 99450 ns to 56195 ns (1.76x speed up on CLX-AP with v5.3).
If host disables the watchdog (echo 0 > /proc/sys/kernel/watchdog), the
minimum latency of guest NMI handler could be speed up at 2994x and in
the average at 685x. The run time of workload with perf attached inside
the guest could be reduced significantly with this optimization.
Please check each commit for more details and share your comments with us.
Thanks,
Like Xu
---
[1] multiplexing sampling usage: time perf record -e \
`perf list | grep Hardware | grep event |\
awk '{print $1}' | head -n 10 |tr '\n' ',' | sed 's/,$//' ` ./ftest
[2] one gp counter sampling usage: perf record -e branch-misses ./ftest
[3] one fixed counter sampling usage: perf record -e instructions ./ftest
[4] event count usage: perf stat -e branch-misses ./ftest
---
Changes in v2:
- use perf_event_pause() to disable, read, reset by only one lock;
- use __perf_event_read_value() after _perf_event_disable();
- replace bitfields with 'u8 event_count; bool need_cleanup;';
- refine comments and commit messages;
- fix two issues reported by kbuild test robot for ARCH=[nds32|sh]
v1:
https://lore.kernel.org/kvm/[email protected]/
Like Xu (4):
perf/core: Provide a kernel-internal interface to recalibrate event
period
perf/core: Provide a kernel-internal interface to pause perf_event
KVM: x86/vPMU: Reuse perf_event to avoid unnecessary
pmc_reprogram_counter
KVM: x86/vPMU: Add lazy mechanism to release perf_event per vPMC
arch/x86/include/asm/kvm_host.h | 17 +++++++
arch/x86/kvm/pmu.c | 88 ++++++++++++++++++++++++++++++++-
arch/x86/kvm/pmu.h | 15 +++++-
arch/x86/kvm/pmu_amd.c | 14 ++++++
arch/x86/kvm/vmx/pmu_intel.c | 27 ++++++++++
arch/x86/kvm/x86.c | 12 +++++
include/linux/perf_event.h | 10 ++++
kernel/events/core.c | 44 ++++++++++++++---
8 files changed, 216 insertions(+), 11 deletions(-)
--
2.21.0
Exporting perf_event_pause() as an external accessor for kernel users (such
as KVM) who may do both disable perf_event and read count with just one
time to hold perf_event_ctx_lock. Also the value could be reset optionally.
Suggested-by: Peter Zijlstra <[email protected]>
Signed-off-by: Like Xu <[email protected]>
---
include/linux/perf_event.h | 5 +++++
kernel/events/core.c | 16 ++++++++++++++++
2 files changed, 21 insertions(+)
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index d601df36e671..e9768bfc76f6 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -1337,6 +1337,7 @@ extern void perf_event_disable_inatomic(struct perf_event *event);
extern void perf_event_task_tick(void);
extern int perf_event_account_interrupt(struct perf_event *event);
extern int perf_event_period(struct perf_event *event, u64 value);
+extern u64 perf_event_pause(struct perf_event *event, bool reset);
#else /* !CONFIG_PERF_EVENTS: */
static inline void *
perf_aux_output_begin(struct perf_output_handle *handle,
@@ -1420,6 +1421,10 @@ static inline int perf_event_period(struct perf_event *event, u64 value)
{
return -EINVAL;
}
+static inline u64 perf_event_pause(struct perf_event *event, bool reset)
+{
+ return 0;
+}
#endif
#if defined(CONFIG_PERF_EVENTS) && defined(CONFIG_CPU_SUP_INTEL)
diff --git a/kernel/events/core.c b/kernel/events/core.c
index e1b83d2731da..e29038984cf4 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -5029,6 +5029,22 @@ static void _perf_event_reset(struct perf_event *event)
perf_event_update_userpage(event);
}
+u64 perf_event_pause(struct perf_event *event, bool reset)
+{
+ struct perf_event_context *ctx;
+ u64 count, enabled, running;
+
+ ctx = perf_event_ctx_lock(event);
+ _perf_event_disable(event);
+ count = __perf_event_read_value(event, &enabled, &running);
+ if (reset)
+ local64_set(&event->count, 0);
+ perf_event_ctx_unlock(event, ctx);
+
+ return count;
+}
+EXPORT_SYMBOL_GPL(perf_event_pause);
+
/*
* Holding the top-level event's child_mutex means that any
* descendant process that has inherited this event will block
--
2.21.0
The perf_event_create_kernel_counter() in the pmc_reprogram_counter() is
a heavyweight and high-frequency operation, especially when host disables
the watchdog (maximum 21000000 ns) which leads to an unacceptable latency
of the guest NMI handler. It limits the use of vPMUs in the guest.
When a vPMC is fully enabled, the legacy reprogram_*_counter() would stop
and release its existing perf_event (if any) every time EVEN in most cases
almost the same requested perf_event will be created and configured again.
For each vPMC, if the reuqested config ('u64 eventsel' for gp and 'u8 ctrl'
for fixed) is the same as its last programed config AND a new sample period
based on pmc->counter is accepted by host perf interface, the current event
could be reused safely as a new created one does. Otherwise, do release the
undesirable perf_event and reprogram a new one as usual.
It's light-weight to call pmc_pause_counter (disable, read and reset event)
and pmc_resume_counter (recalibrate period and re-enable event) as guest
expects instead of release-and-create again on any condition. Compared to
use the filterable event->attr or hw.config, a new 'u64 programed_config'
field is added to save the last original programed config for each vPMC.
Based on this implementation, the number of calls to pmc_reprogram_counter
is reduced by ~94% for a gp sampling event and ~99.9% for a fixed event.
In the usage of multiplexing perf sampling mode, the average latency of the
guest NMI handler is reduced from 99450 ns to 56195 ns (1.76x speed up).
If host disables watchdog, the minimum latecy of guest NMI handler could be
speed up at 2994x (from 18134692 to 6057 ns) and at 685x in the average.
Suggested-by: Kan Liang <[email protected]>
Signed-off-by: Like Xu <[email protected]>
---
arch/x86/include/asm/kvm_host.h | 2 ++
arch/x86/kvm/pmu.c | 45 +++++++++++++++++++++++++++++++--
arch/x86/kvm/pmu.h | 12 +++++++--
arch/x86/kvm/pmu_amd.c | 1 +
arch/x86/kvm/vmx/pmu_intel.c | 2 ++
5 files changed, 58 insertions(+), 4 deletions(-)
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 50eb430b0ad8..1abbbbae4953 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -451,6 +451,8 @@ struct kvm_pmc {
u64 eventsel;
struct perf_event *perf_event;
struct kvm_vcpu *vcpu;
+ /* the exact requested config to create perf_event */
+ u64 programed_config;
};
struct kvm_pmu {
diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c
index 46875bbd0419..09d1a03c057c 100644
--- a/arch/x86/kvm/pmu.c
+++ b/arch/x86/kvm/pmu.c
@@ -140,6 +140,35 @@ static void pmc_reprogram_counter(struct kvm_pmc *pmc, u32 type,
clear_bit(pmc->idx, (unsigned long*)&pmc_to_pmu(pmc)->reprogram_pmi);
}
+static void pmc_pause_counter(struct kvm_pmc *pmc)
+{
+ u64 counter = pmc->counter;
+
+ if (!pmc->perf_event)
+ return;
+
+ /* update counter, reset event value to avoid redundant accumulation */
+ counter += perf_event_pause(pmc->perf_event, true);
+ pmc->counter = counter & pmc_bitmask(pmc);
+}
+
+static bool pmc_resume_counter(struct kvm_pmc *pmc)
+{
+ if (!pmc->perf_event)
+ return false;
+
+ /* recalibrate sample period and check if it's accepted by perf core */
+ if (perf_event_period(pmc->perf_event,
+ (-pmc->counter) & pmc_bitmask(pmc)))
+ return false;
+
+ /* reuse perf_event to serve as pmc_reprogram_counter() does*/
+ perf_event_enable(pmc->perf_event);
+
+ clear_bit(pmc->idx, (unsigned long *)&pmc_to_pmu(pmc)->reprogram_pmi);
+ return true;
+}
+
void reprogram_gp_counter(struct kvm_pmc *pmc, u64 eventsel)
{
unsigned config, type = PERF_TYPE_RAW;
@@ -154,7 +183,7 @@ void reprogram_gp_counter(struct kvm_pmc *pmc, u64 eventsel)
pmc->eventsel = eventsel;
- pmc_stop_counter(pmc);
+ pmc_pause_counter(pmc);
if (!(eventsel & ARCH_PERFMON_EVENTSEL_ENABLE) || !pmc_is_enabled(pmc))
return;
@@ -193,6 +222,12 @@ void reprogram_gp_counter(struct kvm_pmc *pmc, u64 eventsel)
if (type == PERF_TYPE_RAW)
config = eventsel & X86_RAW_EVENT_MASK;
+ if (pmc->programed_config == eventsel && pmc_resume_counter(pmc))
+ return;
+
+ pmc_release_perf_event(pmc);
+
+ pmc->programed_config = eventsel;
pmc_reprogram_counter(pmc, type, config,
!(eventsel & ARCH_PERFMON_EVENTSEL_USR),
!(eventsel & ARCH_PERFMON_EVENTSEL_OS),
@@ -209,7 +244,7 @@ void reprogram_fixed_counter(struct kvm_pmc *pmc, u8 ctrl, int idx)
struct kvm_pmu_event_filter *filter;
struct kvm *kvm = pmc->vcpu->kvm;
- pmc_stop_counter(pmc);
+ pmc_pause_counter(pmc);
if (!en_field || !pmc_is_enabled(pmc))
return;
@@ -224,6 +259,12 @@ void reprogram_fixed_counter(struct kvm_pmc *pmc, u8 ctrl, int idx)
return;
}
+ if (pmc->programed_config == (u64)ctrl && pmc_resume_counter(pmc))
+ return;
+
+ pmc_release_perf_event(pmc);
+
+ pmc->programed_config = (u64)ctrl;
pmc_reprogram_counter(pmc, PERF_TYPE_HARDWARE,
kvm_x86_ops->pmu_ops->find_fixed_event(idx),
!(en_field & 0x2), /* exclude user */
diff --git a/arch/x86/kvm/pmu.h b/arch/x86/kvm/pmu.h
index 58265f761c3b..3a95952702d2 100644
--- a/arch/x86/kvm/pmu.h
+++ b/arch/x86/kvm/pmu.h
@@ -55,12 +55,20 @@ static inline u64 pmc_read_counter(struct kvm_pmc *pmc)
return counter & pmc_bitmask(pmc);
}
-static inline void pmc_stop_counter(struct kvm_pmc *pmc)
+static inline void pmc_release_perf_event(struct kvm_pmc *pmc)
{
if (pmc->perf_event) {
- pmc->counter = pmc_read_counter(pmc);
perf_event_release_kernel(pmc->perf_event);
pmc->perf_event = NULL;
+ pmc->programed_config = 0;
+ }
+}
+
+static inline void pmc_stop_counter(struct kvm_pmc *pmc)
+{
+ if (pmc->perf_event) {
+ pmc->counter = pmc_read_counter(pmc);
+ pmc_release_perf_event(pmc);
}
}
diff --git a/arch/x86/kvm/pmu_amd.c b/arch/x86/kvm/pmu_amd.c
index c8388389a3b0..3d656b2d439f 100644
--- a/arch/x86/kvm/pmu_amd.c
+++ b/arch/x86/kvm/pmu_amd.c
@@ -285,6 +285,7 @@ static void amd_pmu_init(struct kvm_vcpu *vcpu)
pmu->gp_counters[i].type = KVM_PMC_GP;
pmu->gp_counters[i].vcpu = vcpu;
pmu->gp_counters[i].idx = i;
+ pmu->gp_counters[i].programed_config = 0;
}
}
diff --git a/arch/x86/kvm/vmx/pmu_intel.c b/arch/x86/kvm/vmx/pmu_intel.c
index 3e9c059099e9..fa14882dc3ad 100644
--- a/arch/x86/kvm/vmx/pmu_intel.c
+++ b/arch/x86/kvm/vmx/pmu_intel.c
@@ -328,12 +328,14 @@ static void intel_pmu_init(struct kvm_vcpu *vcpu)
pmu->gp_counters[i].type = KVM_PMC_GP;
pmu->gp_counters[i].vcpu = vcpu;
pmu->gp_counters[i].idx = i;
+ pmu->gp_counters[i].programed_config = 0;
}
for (i = 0; i < INTEL_PMC_MAX_FIXED; i++) {
pmu->fixed_counters[i].type = KVM_PMC_FIXED;
pmu->fixed_counters[i].vcpu = vcpu;
pmu->fixed_counters[i].idx = i + INTEL_PMC_IDX_FIXED;
+ pmu->fixed_counters[i].programed_config = 0;
}
}
--
2.21.0
Currently, a host perf_event is created for a vPMC functionality emulation.
It’s unpredictable to determine if a disabled perf_event will be reused.
If they are disabled and are not reused for a considerable period of time,
those obsolete perf_events would increase host context switch overhead that
could have been avoided.
If the guest doesn't access (set_msr/get_msr/rdpmc) any of the vPMC's MSRs
during an entire vcpu sched time slice, and its independent enable bit of
the vPMC isn't set, we can predict that the guest has finished the use of
this vPMC, and then it's time to release non-reused perf_events in the
first call of vcpu_enter_guest() after the vcpu gets next scheduled in.
This lazy mechanism delays the event release time to the beginning of the
next scheduled time slice if vPMC's MSRs aren't accessed during this time
slice. If guest comes back to use this vPMC in next time slice, a new perf
event would be re-created via perf_event_create_kernel_counter() as usual.
Suggested-by: Wei W Wang <[email protected]>
Signed-off-by: Like Xu <[email protected]>
---
arch/x86/include/asm/kvm_host.h | 15 ++++++++++++
arch/x86/kvm/pmu.c | 43 +++++++++++++++++++++++++++++++++
arch/x86/kvm/pmu.h | 3 +++
arch/x86/kvm/pmu_amd.c | 13 ++++++++++
arch/x86/kvm/vmx/pmu_intel.c | 25 +++++++++++++++++++
arch/x86/kvm/x86.c | 12 +++++++++
6 files changed, 111 insertions(+)
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 1abbbbae4953..45f9cdae150b 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -472,6 +472,21 @@ struct kvm_pmu {
struct kvm_pmc fixed_counters[INTEL_PMC_MAX_FIXED];
struct irq_work irq_work;
u64 reprogram_pmi;
+
+ /* for vPMC being set, do not released its perf_event (if any) */
+ u64 lazy_release_ctrl;
+
+ /*
+ * The gate to release perf_events not marked in
+ * lazy_release_ctrl only once in a vcpu time slice.
+ */
+ bool need_cleanup;
+
+ /*
+ * The total number of programmed perf_events and it helps to avoid
+ * redundant check before cleanup if guest don't use vPMU at all.
+ */
+ u8 event_count;
};
struct kvm_pmu_ops;
diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c
index 09d1a03c057c..7ab262f009f6 100644
--- a/arch/x86/kvm/pmu.c
+++ b/arch/x86/kvm/pmu.c
@@ -137,6 +137,7 @@ static void pmc_reprogram_counter(struct kvm_pmc *pmc, u32 type,
}
pmc->perf_event = event;
+ pmc_to_pmu(pmc)->event_count++;
clear_bit(pmc->idx, (unsigned long*)&pmc_to_pmu(pmc)->reprogram_pmi);
}
@@ -368,6 +369,7 @@ int kvm_pmu_rdpmc(struct kvm_vcpu *vcpu, unsigned idx, u64 *data)
if (!pmc)
return 1;
+ __set_bit(pmc->idx, (unsigned long *)&pmu->lazy_release_ctrl);
*data = pmc_read_counter(pmc) & mask;
return 0;
}
@@ -385,11 +387,13 @@ bool kvm_pmu_is_valid_msr(struct kvm_vcpu *vcpu, u32 msr)
int kvm_pmu_get_msr(struct kvm_vcpu *vcpu, u32 msr, u64 *data)
{
+ kvm_x86_ops->pmu_ops->update_lazy_release_ctrl(vcpu, msr);
return kvm_x86_ops->pmu_ops->get_msr(vcpu, msr, data);
}
int kvm_pmu_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
{
+ kvm_x86_ops->pmu_ops->update_lazy_release_ctrl(vcpu, msr_info->index);
return kvm_x86_ops->pmu_ops->set_msr(vcpu, msr_info);
}
@@ -417,9 +421,48 @@ void kvm_pmu_init(struct kvm_vcpu *vcpu)
memset(pmu, 0, sizeof(*pmu));
kvm_x86_ops->pmu_ops->init(vcpu);
init_irq_work(&pmu->irq_work, kvm_pmi_trigger_fn);
+ pmu->lazy_release_ctrl = 0;
+ pmu->event_count = 0;
+ pmu->need_cleanup = false;
kvm_pmu_refresh(vcpu);
}
+static inline bool pmc_speculative_in_use(struct kvm_pmc *pmc)
+{
+ struct kvm_pmu *pmu = pmc_to_pmu(pmc);
+
+ if (pmc_is_fixed(pmc))
+ return fixed_ctrl_field(pmu->fixed_ctr_ctrl,
+ pmc->idx - INTEL_PMC_IDX_FIXED) & 0x3;
+
+ return pmc->eventsel & ARCH_PERFMON_EVENTSEL_ENABLE;
+}
+
+void kvm_pmu_cleanup(struct kvm_vcpu *vcpu)
+{
+ struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
+ struct kvm_pmc *pmc = NULL;
+ u64 bitmask = ~pmu->lazy_release_ctrl;
+ int i;
+
+ if (!unlikely(pmu->need_cleanup))
+ return;
+
+ /* do cleanup before the first time of running vcpu after sched_in */
+ pmu->need_cleanup = false;
+
+ /* release events for unmarked vPMCs in the last sched time slice */
+ for_each_set_bit(i, (unsigned long *)&bitmask, X86_PMC_IDX_MAX) {
+ pmc = kvm_x86_ops->pmu_ops->pmc_idx_to_pmc(pmu, i);
+
+ if (pmc && pmc->perf_event && !pmc_speculative_in_use(pmc))
+ pmc_stop_counter(pmc);
+ }
+
+ /* reset vPMC lazy-release bitmap for this sched time slice */
+ pmu->lazy_release_ctrl = 0;
+}
+
void kvm_pmu_destroy(struct kvm_vcpu *vcpu)
{
kvm_pmu_reset(vcpu);
diff --git a/arch/x86/kvm/pmu.h b/arch/x86/kvm/pmu.h
index 3a95952702d2..1bf8adb1d211 100644
--- a/arch/x86/kvm/pmu.h
+++ b/arch/x86/kvm/pmu.h
@@ -34,6 +34,7 @@ struct kvm_pmu_ops {
void (*refresh)(struct kvm_vcpu *vcpu);
void (*init)(struct kvm_vcpu *vcpu);
void (*reset)(struct kvm_vcpu *vcpu);
+ void (*update_lazy_release_ctrl)(struct kvm_vcpu *vcpu, u32 msr);
};
static inline u64 pmc_bitmask(struct kvm_pmc *pmc)
@@ -61,6 +62,7 @@ static inline void pmc_release_perf_event(struct kvm_pmc *pmc)
perf_event_release_kernel(pmc->perf_event);
pmc->perf_event = NULL;
pmc->programed_config = 0;
+ pmc_to_pmu(pmc)->event_count--;
}
}
@@ -125,6 +127,7 @@ int kvm_pmu_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info);
void kvm_pmu_refresh(struct kvm_vcpu *vcpu);
void kvm_pmu_reset(struct kvm_vcpu *vcpu);
void kvm_pmu_init(struct kvm_vcpu *vcpu);
+void kvm_pmu_cleanup(struct kvm_vcpu *vcpu);
void kvm_pmu_destroy(struct kvm_vcpu *vcpu);
int kvm_vm_ioctl_set_pmu_event_filter(struct kvm *kvm, void __user *argp);
diff --git a/arch/x86/kvm/pmu_amd.c b/arch/x86/kvm/pmu_amd.c
index 3d656b2d439f..c74087dad5e8 100644
--- a/arch/x86/kvm/pmu_amd.c
+++ b/arch/x86/kvm/pmu_amd.c
@@ -208,6 +208,18 @@ static bool amd_is_valid_msr(struct kvm_vcpu *vcpu, u32 msr)
return ret;
}
+static void amd_update_lazy_release_ctrl(struct kvm_vcpu *vcpu, u32 msr)
+{
+ struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
+ struct kvm_pmc *pmc = NULL;
+
+ pmc = get_gp_pmc_amd(pmu, msr, PMU_TYPE_COUNTER);
+ pmc = pmc ? pmc : get_gp_pmc_amd(pmu, msr, PMU_TYPE_EVNTSEL);
+
+ if (pmc)
+ __set_bit(pmc->idx, (unsigned long *)&pmu->lazy_release_ctrl);
+}
+
static int amd_pmu_get_msr(struct kvm_vcpu *vcpu, u32 msr, u64 *data)
{
struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
@@ -315,4 +327,5 @@ struct kvm_pmu_ops amd_pmu_ops = {
.refresh = amd_pmu_refresh,
.init = amd_pmu_init,
.reset = amd_pmu_reset,
+ .update_lazy_release_ctrl = amd_update_lazy_release_ctrl,
};
diff --git a/arch/x86/kvm/vmx/pmu_intel.c b/arch/x86/kvm/vmx/pmu_intel.c
index fa14882dc3ad..8be7551ffcb3 100644
--- a/arch/x86/kvm/vmx/pmu_intel.c
+++ b/arch/x86/kvm/vmx/pmu_intel.c
@@ -140,6 +140,30 @@ static struct kvm_pmc *intel_msr_idx_to_pmc(struct kvm_vcpu *vcpu,
return &counters[idx];
}
+static void intel_update_lazy_release_ctrl(struct kvm_vcpu *vcpu, u32 msr)
+{
+ struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
+ struct kvm_pmc *pmc = NULL;
+ int i;
+
+ if (msr == MSR_CORE_PERF_FIXED_CTR_CTRL) {
+ for (i = 0; i < pmu->nr_arch_fixed_counters; i++) {
+ if (!fixed_ctrl_field(pmu->fixed_ctr_ctrl, i))
+ continue;
+ __set_bit(INTEL_PMC_IDX_FIXED + i,
+ (unsigned long *)&pmu->lazy_release_ctrl);
+ }
+ return;
+ }
+
+ pmc = get_fixed_pmc(pmu, msr);
+ pmc = pmc ? pmc : get_gp_pmc(pmu, msr, MSR_P6_EVNTSEL0);
+ pmc = pmc ? pmc : get_gp_pmc(pmu, msr, MSR_IA32_PERFCTR0);
+
+ if (pmc)
+ __set_bit(pmc->idx, (unsigned long *)&pmu->lazy_release_ctrl);
+}
+
static bool intel_is_valid_msr(struct kvm_vcpu *vcpu, u32 msr)
{
struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
@@ -376,4 +400,5 @@ struct kvm_pmu_ops intel_pmu_ops = {
.refresh = intel_pmu_refresh,
.init = intel_pmu_init,
.reset = intel_pmu_reset,
+ .update_lazy_release_ctrl = intel_update_lazy_release_ctrl,
};
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 661e2bf38526..023ea5efb3bb 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -8080,6 +8080,14 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
goto cancel_injection;
}
+ /*
+ * vPMU uses a lazy method to release the perf_events created for
+ * features emulation when the related MSRs weren't accessed during
+ * last vcpu time slice. Technically, this cleanup check happens on
+ * the first call of vcpu_enter_guest after the vcpu gets scheduled in.
+ */
+ kvm_pmu_cleanup(vcpu);
+
preempt_disable();
kvm_x86_ops->prepare_guest_switch(vcpu);
@@ -9415,7 +9423,11 @@ void kvm_arch_vcpu_uninit(struct kvm_vcpu *vcpu)
void kvm_arch_sched_in(struct kvm_vcpu *vcpu, int cpu)
{
+ struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
+
vcpu->arch.l1tf_flush_l1d = true;
+ if (pmu->version && unlikely(pmu->event_count))
+ pmu->need_cleanup = true;
kvm_x86_ops->sched_in(vcpu, cpu);
}
--
2.21.0
Currently, perf_event_period() is used by user tools via ioctl. Based on
naming convention, exporting perf_event_period() for kernel users (such
as KVM) who may recalibrate the event period for their assigned counter
according to their requirements.
The perf_event_period() is an external accessor, just like the
perf_event_{en,dis}able() and should thus use perf_event_ctx_lock().
Suggested-by: Kan Liang <[email protected]>
Signed-off-by: Like Xu <[email protected]>
---
include/linux/perf_event.h | 5 +++++
kernel/events/core.c | 28 +++++++++++++++++++++-------
2 files changed, 26 insertions(+), 7 deletions(-)
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 61448c19a132..d601df36e671 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -1336,6 +1336,7 @@ extern void perf_event_disable_local(struct perf_event *event);
extern void perf_event_disable_inatomic(struct perf_event *event);
extern void perf_event_task_tick(void);
extern int perf_event_account_interrupt(struct perf_event *event);
+extern int perf_event_period(struct perf_event *event, u64 value);
#else /* !CONFIG_PERF_EVENTS: */
static inline void *
perf_aux_output_begin(struct perf_output_handle *handle,
@@ -1415,6 +1416,10 @@ static inline void perf_event_disable(struct perf_event *event) { }
static inline int __perf_event_disable(void *info) { return -1; }
static inline void perf_event_task_tick(void) { }
static inline int perf_event_release_kernel(struct perf_event *event) { return 0; }
+static inline int perf_event_period(struct perf_event *event, u64 value)
+{
+ return -EINVAL;
+}
#endif
#if defined(CONFIG_PERF_EVENTS) && defined(CONFIG_CPU_SUP_INTEL)
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 9ec0b0bfddbd..e1b83d2731da 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -5106,16 +5106,11 @@ static int perf_event_check_period(struct perf_event *event, u64 value)
return event->pmu->check_period(event, value);
}
-static int perf_event_period(struct perf_event *event, u64 __user *arg)
+static int _perf_event_period(struct perf_event *event, u64 value)
{
- u64 value;
-
if (!is_sampling_event(event))
return -EINVAL;
- if (copy_from_user(&value, arg, sizeof(value)))
- return -EFAULT;
-
if (!value)
return -EINVAL;
@@ -5133,6 +5128,19 @@ static int perf_event_period(struct perf_event *event, u64 __user *arg)
return 0;
}
+int perf_event_period(struct perf_event *event, u64 value)
+{
+ struct perf_event_context *ctx;
+ int ret;
+
+ ctx = perf_event_ctx_lock(event);
+ ret = _perf_event_period(event, value);
+ perf_event_ctx_unlock(event, ctx);
+
+ return ret;
+}
+EXPORT_SYMBOL_GPL(perf_event_period);
+
static const struct file_operations perf_fops;
static inline int perf_fget_light(int fd, struct fd *p)
@@ -5176,8 +5184,14 @@ static long _perf_ioctl(struct perf_event *event, unsigned int cmd, unsigned lon
return _perf_event_refresh(event, arg);
case PERF_EVENT_IOC_PERIOD:
- return perf_event_period(event, (u64 __user *)arg);
+ {
+ u64 value;
+
+ if (copy_from_user(&value, (u64 __user *)arg, sizeof(value)))
+ return -EFAULT;
+ return _perf_event_period(event, value);
+ }
case PERF_EVENT_IOC_ID:
{
u64 id = primary_event_id(event);
--
2.21.0
On Sun, Oct 13, 2019 at 05:15:31PM +0800, Like Xu wrote:
> Exporting perf_event_pause() as an external accessor for kernel users (such
> as KVM) who may do both disable perf_event and read count with just one
> time to hold perf_event_ctx_lock. Also the value could be reset optionally.
> +u64 perf_event_pause(struct perf_event *event, bool reset)
> +{
> + struct perf_event_context *ctx;
> + u64 count, enabled, running;
> +
> + ctx = perf_event_ctx_lock(event);
> + _perf_event_disable(event);
> + count = __perf_event_read_value(event, &enabled, &running);
> + if (reset)
> + local64_set(&event->count, 0);
This local64_set() already assumes there are no child events, so maybe
write the thing like:
WARN_ON_ONCE(event->attr.inherit);
_perf_event_disable(event);
count = local64_read(&event->count);
local64_set(&event->count, 0);
> + perf_event_ctx_unlock(event, ctx);
> +
> + return count;
> +}
> +EXPORT_SYMBOL_GPL(perf_event_pause);
On 2019/10/14 19:51, Peter Zijlstra wrote:
> On Sun, Oct 13, 2019 at 05:15:31PM +0800, Like Xu wrote:
>> Exporting perf_event_pause() as an external accessor for kernel users (such
>> as KVM) who may do both disable perf_event and read count with just one
>> time to hold perf_event_ctx_lock. Also the value could be reset optionally.
>
>> +u64 perf_event_pause(struct perf_event *event, bool reset)
>> +{
>> + struct perf_event_context *ctx;
>> + u64 count, enabled, running;
>> +
>> + ctx = perf_event_ctx_lock(event);
>
>> + _perf_event_disable(event);
>> + count = __perf_event_read_value(event, &enabled, &running);
>> + if (reset)
>> + local64_set(&event->count, 0);
>
> This local64_set() already assumes there are no child events, so maybe
> write the thing like:
>
> WARN_ON_ONCE(event->attr.inherit);
> _perf_event_disable(event);
> count = local64_read(&event->count);
> local64_set(&event->count, 0);
>
Thanks. It looks good to me and I will apply this.
>
>> + perf_event_ctx_unlock(event, ctx);
>> +
>> + return count;
>> +}
>> +EXPORT_SYMBOL_GPL(perf_event_pause);
>
On 2019/10/13 17:15, Like Xu wrote:
...
Hi Paolo,
Do you have any concerns or comments for the KVM codes (the last two
patches) about this vPMU efficiency optimization?
Thanks,
Like Xu
>
> ---
> Changes in v2:
> - use perf_event_pause() to disable, read, reset by only one lock;
> - use __perf_event_read_value() after _perf_event_disable();
> - replace bitfields with 'u8 event_count; bool need_cleanup;';
> - refine comments and commit messages;
> - fix two issues reported by kbuild test robot for ARCH=[nds32|sh]
>
> v1:
> https://lore.kernel.org/kvm/[email protected]/
>
> Like Xu (4):
> perf/core: Provide a kernel-internal interface to recalibrate event
> period
> perf/core: Provide a kernel-internal interface to pause perf_event
> KVM: x86/vPMU: Reuse perf_event to avoid unnecessary
> pmc_reprogram_counter
> KVM: x86/vPMU: Add lazy mechanism to release perf_event per vPMC
>
> arch/x86/include/asm/kvm_host.h | 17 +++++++
> arch/x86/kvm/pmu.c | 88 ++++++++++++++++++++++++++++++++-
> arch/x86/kvm/pmu.h | 15 +++++-
> arch/x86/kvm/pmu_amd.c | 14 ++++++
> arch/x86/kvm/vmx/pmu_intel.c | 27 ++++++++++
> arch/x86/kvm/x86.c | 12 +++++
> include/linux/perf_event.h | 10 ++++
> kernel/events/core.c | 44 ++++++++++++++---
> 8 files changed, 216 insertions(+), 11 deletions(-)
>
Just a naming tweak:
On 13/10/19 11:15, Like Xu wrote:
> + /* the exact requested config to create perf_event */
> + u64 programed_config;
/*
* eventsel value for general purpose counters, ctrl value for
* fixed counters.
*/
u64 current_config;
Hi Paolo,
On 2019/10/21 16:12, Paolo Bonzini wrote:
> Just a naming tweak:
>
> On 13/10/19 11:15, Like Xu wrote:
>> + /* the exact requested config to create perf_event */
>> + u64 programed_config;
>
> /*
> * eventsel value for general purpose counters, ctrl value for
> * fixed counters.
> */
> u64 current_config;
>
>
It looks good to me and I'll apply this.
Is there more need for improvement?
On 13/10/19 11:15, Like Xu wrote:
> Currently, a host perf_event is created for a vPMC functionality emulation.
> It’s unpredictable to determine if a disabled perf_event will be reused.
> If they are disabled and are not reused for a considerable period of time,
> those obsolete perf_events would increase host context switch overhead that
> could have been avoided.
>
> If the guest doesn't access (set_msr/get_msr/rdpmc) any of the vPMC's MSRs
> during an entire vcpu sched time slice, and its independent enable bit of
> the vPMC isn't set, we can predict that the guest has finished the use of
> this vPMC, and then it's time to release non-reused perf_events in the
> first call of vcpu_enter_guest() after the vcpu gets next scheduled in.
>
> This lazy mechanism delays the event release time to the beginning of the
> next scheduled time slice if vPMC's MSRs aren't accessed during this time
> slice. If guest comes back to use this vPMC in next time slice, a new perf
> event would be re-created via perf_event_create_kernel_counter() as usual.
>
> Suggested-by: Wei W Wang <[email protected]>
> Signed-off-by: Like Xu <[email protected]>
> ---
> arch/x86/include/asm/kvm_host.h | 15 ++++++++++++
> arch/x86/kvm/pmu.c | 43 +++++++++++++++++++++++++++++++++
> arch/x86/kvm/pmu.h | 3 +++
> arch/x86/kvm/pmu_amd.c | 13 ++++++++++
> arch/x86/kvm/vmx/pmu_intel.c | 25 +++++++++++++++++++
> arch/x86/kvm/x86.c | 12 +++++++++
> 6 files changed, 111 insertions(+)
>
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 1abbbbae4953..45f9cdae150b 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -472,6 +472,21 @@ struct kvm_pmu {
> struct kvm_pmc fixed_counters[INTEL_PMC_MAX_FIXED];
> struct irq_work irq_work;
> u64 reprogram_pmi;
> +
> + /* for vPMC being set, do not released its perf_event (if any) */
> + u64 lazy_release_ctrl;
Please use DECLARE_BITMAP(lazy_release_ctrl, X86_PMC_IDX_MAX). I would
also rename the bitmap to pmc_in_use.
I know you're just copying what reprogram_pmi does, but that has to be
fixed too. :) I'll send a patch now.
> + /*
> + * The gate to release perf_events not marked in
> + * lazy_release_ctrl only once in a vcpu time slice.
> + */
> + bool need_cleanup;
> +
> + /*
> + * The total number of programmed perf_events and it helps to avoid
> + * redundant check before cleanup if guest don't use vPMU at all.
> + */
> + u8 event_count;
> };
>
> struct kvm_pmu_ops;
> diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c
> index 09d1a03c057c..7ab262f009f6 100644
> --- a/arch/x86/kvm/pmu.c
> +++ b/arch/x86/kvm/pmu.c
> @@ -137,6 +137,7 @@ static void pmc_reprogram_counter(struct kvm_pmc *pmc, u32 type,
> }
>
> pmc->perf_event = event;
> + pmc_to_pmu(pmc)->event_count++;
> clear_bit(pmc->idx, (unsigned long*)&pmc_to_pmu(pmc)->reprogram_pmi);
> }
>
> @@ -368,6 +369,7 @@ int kvm_pmu_rdpmc(struct kvm_vcpu *vcpu, unsigned idx, u64 *data)
> if (!pmc)
> return 1;
>
> + __set_bit(pmc->idx, (unsigned long *)&pmu->lazy_release_ctrl);
> *data = pmc_read_counter(pmc) & mask;
> return 0;
> }
> @@ -385,11 +387,13 @@ bool kvm_pmu_is_valid_msr(struct kvm_vcpu *vcpu, u32 msr)
>
> int kvm_pmu_get_msr(struct kvm_vcpu *vcpu, u32 msr, u64 *data)
> {
> + kvm_x86_ops->pmu_ops->update_lazy_release_ctrl(vcpu, msr);
Instead of this new pmu_ops entry, please introduce two separate patches
to do the following:
1) rename the existing msr_idx_to_pmc to rdpmc_idx_to_pmc, and
is_valid_msr_idx to is_valid_rdpmc_idx (likewise for
kvm_pmu_is_valid_msr_idx).
2) introduce a new callback msr_idx_to_pmc that returns a struct
kvm_pmc*, and change kvm_pmu_is_valid_msr to do
bool kvm_pmu_is_valid_msr(struct kvm_vcpu *vcpu, u32 msr)
{
return (kvm_x86_ops->pmu_ops->msr_idx_to_pmc(vcpu, msr) ||
kvm_x86_ops->pmu_ops->is_valid_msr(vcpu, msr));
}
and AMD can just return false from .is_valid_msr.
Once this change is done, this patch can use simply:
static int kvm_pmu_mark_pmc_in_use(struct kvm_vcpu *vcpu, u32 msr)
{
struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
struct kvm_pmc *pmc = kvm_x86_ops->pmu_ops->msr_idx_to_pmc(vcpu, msr);
__set_bit(pmc->idx, pmu->pmc_in_use);
}
> return kvm_x86_ops->pmu_ops->get_msr(vcpu, msr, data);
> }
>
> int kvm_pmu_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
> {
> + kvm_x86_ops->pmu_ops->update_lazy_release_ctrl(vcpu, msr_info->index);
> return kvm_x86_ops->pmu_ops->set_msr(vcpu, msr_info);
> }
>
> @@ -417,9 +421,48 @@ void kvm_pmu_init(struct kvm_vcpu *vcpu)
> memset(pmu, 0, sizeof(*pmu));
> kvm_x86_ops->pmu_ops->init(vcpu);
> init_irq_work(&pmu->irq_work, kvm_pmi_trigger_fn);
> + pmu->lazy_release_ctrl = 0;
> + pmu->event_count = 0;
> + pmu->need_cleanup = false;
> kvm_pmu_refresh(vcpu);
> }
>
> +static inline bool pmc_speculative_in_use(struct kvm_pmc *pmc)
> +{
> + struct kvm_pmu *pmu = pmc_to_pmu(pmc);
> +
> + if (pmc_is_fixed(pmc))
> + return fixed_ctrl_field(pmu->fixed_ctr_ctrl,
> + pmc->idx - INTEL_PMC_IDX_FIXED) & 0x3;
> +
> + return pmc->eventsel & ARCH_PERFMON_EVENTSEL_ENABLE;
> +}
> +
> +void kvm_pmu_cleanup(struct kvm_vcpu *vcpu)
> +{
> + struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
> + struct kvm_pmc *pmc = NULL;
> + u64 bitmask = ~pmu->lazy_release_ctrl;
DECLARE_BITMAP(bitmask, X86_PMC_IDX_MAX);
> + int i;
> +
> + if (!unlikely(pmu->need_cleanup))
> + return;
> +
> + /* do cleanup before the first time of running vcpu after sched_in */
> + pmu->need_cleanup = false;
> +
> + /* release events for unmarked vPMCs in the last sched time slice */
> + for_each_set_bit(i, (unsigned long *)&bitmask, X86_PMC_IDX_MAX) {
First, you could use for_each_clear_bit instead of inverting.
However, this is iterating unnecessarily through almost all of the 64
bits, most of which will not pass pmc_idx_to_pmc. You can set up a
bitmap like
/* in kvm_pmu_refresh */
bitmap_set(pmu->all_valid_pmc_idx, 0,
pmu->nr_arch_fixed_counters);
bitmap_set(pmu->all_valid_pmc_idx, INTEL_PMC_IDX_FIXED,
pmu->nr_arch_gp_counters);
/* on cleanup */
bitmap_andnot(bitmask, pmu->all_valid_pmc_idx,
pmu->pmc_in_use, X86_PMC_IDX_MAX);
for_each_set_bit(i, bitmask, X86_PMC_IDX_MAX) {
...
}
Note that on 64-bit machines the bitmap_andnot will be compiled to a
single "x = y & ~z" expression.
> + pmc = kvm_x86_ops->pmu_ops->pmc_idx_to_pmc(pmu, i);
> +
> + if (pmc && pmc->perf_event && !pmc_speculative_in_use(pmc))
> + pmc_stop_counter(pmc);
> + }
> +
> + /* reset vPMC lazy-release bitmap for this sched time slice */
> + pmu->lazy_release_ctrl = 0;
> +}
> +
> void kvm_pmu_destroy(struct kvm_vcpu *vcpu)
> {
> kvm_pmu_reset(vcpu);
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 661e2bf38526..023ea5efb3bb 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -8080,6 +8080,14 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
> goto cancel_injection;
> }
>
> + /*
> + * vPMU uses a lazy method to release the perf_events created for
> + * features emulation when the related MSRs weren't accessed during
> + * last vcpu time slice. Technically, this cleanup check happens on
> + * the first call of vcpu_enter_guest after the vcpu gets scheduled in.
> + */
> + kvm_pmu_cleanup(vcpu);
Instead of calling this unconditionally from vcpu_enter_guest, please
request KVM_REQ_PMU in kvm_arch_sched_in, and call kvm_pmu_cleanup from
kvm_pmu_handle_event.
Paolo
> preempt_disable();
>
> kvm_x86_ops->prepare_guest_switch(vcpu);
> @@ -9415,7 +9423,11 @@ void kvm_arch_vcpu_uninit(struct kvm_vcpu *vcpu)
>
> void kvm_arch_sched_in(struct kvm_vcpu *vcpu, int cpu)
> {
> + struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
> +
> vcpu->arch.l1tf_flush_l1d = true;
> + if (pmu->version && unlikely(pmu->event_count))
> + pmu->need_cleanup = true;
> kvm_x86_ops->sched_in(vcpu, cpu);
> }
>
>
Hi Paolo,
On 2019/10/21 16:45, Paolo Bonzini wrote:
> On 13/10/19 11:15, Like Xu wrote:
>> Currently, a host perf_event is created for a vPMC functionality emulation.
>> It’s unpredictable to determine if a disabled perf_event will be reused.
>> If they are disabled and are not reused for a considerable period of time,
>> those obsolete perf_events would increase host context switch overhead that
>> could have been avoided.
>>
>> If the guest doesn't access (set_msr/get_msr/rdpmc) any of the vPMC's MSRs
>> during an entire vcpu sched time slice, and its independent enable bit of
>> the vPMC isn't set, we can predict that the guest has finished the use of
>> this vPMC, and then it's time to release non-reused perf_events in the
>> first call of vcpu_enter_guest() after the vcpu gets next scheduled in.
>>
>> This lazy mechanism delays the event release time to the beginning of the
>> next scheduled time slice if vPMC's MSRs aren't accessed during this time
>> slice. If guest comes back to use this vPMC in next time slice, a new perf
>> event would be re-created via perf_event_create_kernel_counter() as usual.
>>
>> Suggested-by: Wei W Wang <[email protected]>
>> Signed-off-by: Like Xu <[email protected]>
>> ---
>> arch/x86/include/asm/kvm_host.h | 15 ++++++++++++
>> arch/x86/kvm/pmu.c | 43 +++++++++++++++++++++++++++++++++
>> arch/x86/kvm/pmu.h | 3 +++
>> arch/x86/kvm/pmu_amd.c | 13 ++++++++++
>> arch/x86/kvm/vmx/pmu_intel.c | 25 +++++++++++++++++++
>> arch/x86/kvm/x86.c | 12 +++++++++
>> 6 files changed, 111 insertions(+)
>>
>> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
>> index 1abbbbae4953..45f9cdae150b 100644
>> --- a/arch/x86/include/asm/kvm_host.h
>> +++ b/arch/x86/include/asm/kvm_host.h
>> @@ -472,6 +472,21 @@ struct kvm_pmu {
>> struct kvm_pmc fixed_counters[INTEL_PMC_MAX_FIXED];
>> struct irq_work irq_work;
>> u64 reprogram_pmi;
>> +
>> + /* for vPMC being set, do not released its perf_event (if any) */
>> + u64 lazy_release_ctrl;
>
> Please use DECLARE_BITMAP(lazy_release_ctrl, X86_PMC_IDX_MAX). I would
> also rename the bitmap to pmc_in_use.
Thanks for introducing BITMAP and I would definitely use it if I knew.
>
> I know you're just copying what reprogram_pmi does, but that has to be
> fixed too. :) I'll send a patch now.
>
>> + /*
>> + * The gate to release perf_events not marked in
>> + * lazy_release_ctrl only once in a vcpu time slice.
>> + */
>> + bool need_cleanup;
>> +
>> + /*
>> + * The total number of programmed perf_events and it helps to avoid
>> + * redundant check before cleanup if guest don't use vPMU at all.
>> + */
>> + u8 event_count;
>> };
>>
>> struct kvm_pmu_ops;
>> diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c
>> index 09d1a03c057c..7ab262f009f6 100644
>> --- a/arch/x86/kvm/pmu.c
>> +++ b/arch/x86/kvm/pmu.c
>> @@ -137,6 +137,7 @@ static void pmc_reprogram_counter(struct kvm_pmc *pmc, u32 type,
>> }
>>
>> pmc->perf_event = event;
>> + pmc_to_pmu(pmc)->event_count++;
>> clear_bit(pmc->idx, (unsigned long*)&pmc_to_pmu(pmc)->reprogram_pmi);
>> }
>>
>> @@ -368,6 +369,7 @@ int kvm_pmu_rdpmc(struct kvm_vcpu *vcpu, unsigned idx, u64 *data)
>> if (!pmc)
>> return 1;
>>
>> + __set_bit(pmc->idx, (unsigned long *)&pmu->lazy_release_ctrl);
>> *data = pmc_read_counter(pmc) & mask;
>> return 0;
>> }
>> @@ -385,11 +387,13 @@ bool kvm_pmu_is_valid_msr(struct kvm_vcpu *vcpu, u32 msr)
>>
>> int kvm_pmu_get_msr(struct kvm_vcpu *vcpu, u32 msr, u64 *data)
>> {
>> + kvm_x86_ops->pmu_ops->update_lazy_release_ctrl(vcpu, msr);
>
> Instead of this new pmu_ops entry, please introduce two separate patches
> to do the following:
>
> 1) rename the existing msr_idx_to_pmc to rdpmc_idx_to_pmc, and
> is_valid_msr_idx to is_valid_rdpmc_idx (likewise for
> kvm_pmu_is_valid_msr_idx).
It looks good to me and I'll apply it.
>
> 2) introduce a new callback msr_idx_to_pmc that returns a struct
> kvm_pmc*, and change kvm_pmu_is_valid_msr to do
For callback msr_idx_to_pmc,
how do we deal with the input 'MSR_CORE_PERF_FIXED_CTR_CTRL'
which may return several fixed kvm_pmcs not just one?
>
> bool kvm_pmu_is_valid_msr(struct kvm_vcpu *vcpu, u32 msr)
> {
> return (kvm_x86_ops->pmu_ops->msr_idx_to_pmc(vcpu, msr) ||
> kvm_x86_ops->pmu_ops->is_valid_msr(vcpu, msr));
> }
>
> and AMD can just return false from .is_valid_msr.
>
> Once this change is done, this patch can use simply:
>
> static int kvm_pmu_mark_pmc_in_use(struct kvm_vcpu *vcpu, u32 msr)
s/static int/static void/.
> {
> struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
> struct kvm_pmc *pmc = kvm_x86_ops->pmu_ops->msr_idx_to_pmc(vcpu, msr);
We need 'if(pmc)' here.
> __set_bit(pmc->idx, pmu->pmc_in_use);
> }
>
>> return kvm_x86_ops->pmu_ops->get_msr(vcpu, msr, data);
>> }
>>
>> int kvm_pmu_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
>> {
>> + kvm_x86_ops->pmu_ops->update_lazy_release_ctrl(vcpu, msr_info->index);
>> return kvm_x86_ops->pmu_ops->set_msr(vcpu, msr_info);
>> }
>>
>> @@ -417,9 +421,48 @@ void kvm_pmu_init(struct kvm_vcpu *vcpu)
>> memset(pmu, 0, sizeof(*pmu));
>> kvm_x86_ops->pmu_ops->init(vcpu);
>> init_irq_work(&pmu->irq_work, kvm_pmi_trigger_fn);
>> + pmu->lazy_release_ctrl = 0;
>> + pmu->event_count = 0;
>> + pmu->need_cleanup = false;
>> kvm_pmu_refresh(vcpu);
>> }
>>
>> +static inline bool pmc_speculative_in_use(struct kvm_pmc *pmc)
>> +{
>> + struct kvm_pmu *pmu = pmc_to_pmu(pmc);
>> +
>> + if (pmc_is_fixed(pmc))
>> + return fixed_ctrl_field(pmu->fixed_ctr_ctrl,
>> + pmc->idx - INTEL_PMC_IDX_FIXED) & 0x3;
>> +
>> + return pmc->eventsel & ARCH_PERFMON_EVENTSEL_ENABLE;
>> +}
>> +
>> +void kvm_pmu_cleanup(struct kvm_vcpu *vcpu)
>> +{
>> + struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
>> + struct kvm_pmc *pmc = NULL;
>> + u64 bitmask = ~pmu->lazy_release_ctrl;
>
> DECLARE_BITMAP(bitmask, X86_PMC_IDX_MAX);
>
>> + int i;
>> +
>> + if (!unlikely(pmu->need_cleanup))
>> + return;
>> +
>> + /* do cleanup before the first time of running vcpu after sched_in */
>> + pmu->need_cleanup = false;
>> +
>> + /* release events for unmarked vPMCs in the last sched time slice */
>> + for_each_set_bit(i, (unsigned long *)&bitmask, X86_PMC_IDX_MAX) {
>
> First, you could use for_each_clear_bit instead of inverting.
>
> However, this is iterating unnecessarily through almost all of the 64
> bits, most of which will not pass pmc_idx_to_pmc. You can set up a
> bitmap like
Nice catch and let's trade space for time.
>
> /* in kvm_pmu_refresh */
> bitmap_set(pmu->all_valid_pmc_idx, 0,
> pmu->nr_arch_fixed_counters);
> bitmap_set(pmu->all_valid_pmc_idx, INTEL_PMC_IDX_FIXED,
> pmu->nr_arch_gp_counters);
>
> /* on cleanup */
> bitmap_andnot(bitmask, pmu->all_valid_pmc_idx,
> pmu->pmc_in_use, X86_PMC_IDX_MAX);
>
> for_each_set_bit(i, bitmask, X86_PMC_IDX_MAX) {
> ...
> }
>
> Note that on 64-bit machines the bitmap_andnot will be compiled to a
> single "x = y & ~z" expression.
Thanks for your tutorial on bitmap APIs, again.
>
>> + pmc = kvm_x86_ops->pmu_ops->pmc_idx_to_pmc(pmu, i);
>> +
>> + if (pmc && pmc->perf_event && !pmc_speculative_in_use(pmc))
>> + pmc_stop_counter(pmc);
>> + }
>> +
>> + /* reset vPMC lazy-release bitmap for this sched time slice */
>> + pmu->lazy_release_ctrl = 0;
>> +}
>> +
>> void kvm_pmu_destroy(struct kvm_vcpu *vcpu)
>> {
>> kvm_pmu_reset(vcpu);
>> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
>> index 661e2bf38526..023ea5efb3bb 100644
>> --- a/arch/x86/kvm/x86.c
>> +++ b/arch/x86/kvm/x86.c
>> @@ -8080,6 +8080,14 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
>> goto cancel_injection;
>> }
>>
>> + /*
>> + * vPMU uses a lazy method to release the perf_events created for
>> + * features emulation when the related MSRs weren't accessed during
>> + * last vcpu time slice. Technically, this cleanup check happens on
>> + * the first call of vcpu_enter_guest after the vcpu gets scheduled in.
>> + */
>> + kvm_pmu_cleanup(vcpu);
>
> Instead of calling this unconditionally from vcpu_enter_guest, please
> request KVM_REQ_PMU in kvm_arch_sched_in, and call kvm_pmu_cleanup from
> kvm_pmu_handle_event.
This proposal does make sense to the name 'kvm_pmu_handle_event' and
I'll apply it.
>
> Paolo
>
>> preempt_disable();
>>
>> kvm_x86_ops->prepare_guest_switch(vcpu);
>> @@ -9415,7 +9423,11 @@ void kvm_arch_vcpu_uninit(struct kvm_vcpu *vcpu)
>>
>> void kvm_arch_sched_in(struct kvm_vcpu *vcpu, int cpu)
>> {
>> + struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
>> +
>> vcpu->arch.l1tf_flush_l1d = true;
>> + if (pmu->version && unlikely(pmu->event_count))
>> + pmu->need_cleanup = true;
>> kvm_x86_ops->sched_in(vcpu, cpu);
>> }
>>
>>
>
>
On 21/10/19 16:04, Like Xu wrote:
>
>>
>> 2) introduce a new callback msr_idx_to_pmc that returns a struct
>> kvm_pmc*, and change kvm_pmu_is_valid_msr to do
>
> For callback msr_idx_to_pmc,
> how do we deal with the input 'MSR_CORE_PERF_FIXED_CTR_CTRL'
> which may return several fixed kvm_pmcs not just one?
For RDMSR, do not do anything. For WRMSR, you can handle it in the
set_msr callback.
>>
>> static int kvm_pmu_mark_pmc_in_use(struct kvm_vcpu *vcpu, u32 msr)
>
> s/static int/static void/.
>
>> {
>> struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
>> struct kvm_pmc *pmc = kvm_x86_ops->pmu_ops->msr_idx_to_pmc(vcpu, msr);
>
> We need 'if(pmc)' here.
Agreed, sorry for the sloppiness. Never interpret my suggestions as
more than pseudocode. :)
Paolo
Replace the explicit declaration of "u64 reprogram_pmi" with the generic
macro DECLARE_BITMAP for all possible appropriate number of bits.
Suggested-by: Paolo Bonzini <[email protected]>
Signed-off-by: Like Xu <[email protected]>
---
arch/x86/include/asm/kvm_host.h | 2 +-
arch/x86/kvm/pmu.c | 15 +++++----------
2 files changed, 6 insertions(+), 11 deletions(-)
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 50eb430b0ad8..236a876a5a2e 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -469,7 +469,7 @@ struct kvm_pmu {
struct kvm_pmc gp_counters[INTEL_PMC_MAX_GENERIC];
struct kvm_pmc fixed_counters[INTEL_PMC_MAX_FIXED];
struct irq_work irq_work;
- u64 reprogram_pmi;
+ DECLARE_BITMAP(reprogram_pmi, X86_PMC_IDX_MAX);
};
struct kvm_pmu_ops;
diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c
index 46875bbd0419..75e8f9fae031 100644
--- a/arch/x86/kvm/pmu.c
+++ b/arch/x86/kvm/pmu.c
@@ -62,8 +62,7 @@ static void kvm_perf_overflow(struct perf_event *perf_event,
struct kvm_pmc *pmc = perf_event->overflow_handler_context;
struct kvm_pmu *pmu = pmc_to_pmu(pmc);
- if (!test_and_set_bit(pmc->idx,
- (unsigned long *)&pmu->reprogram_pmi)) {
+ if (!test_and_set_bit(pmc->idx, pmu->reprogram_pmi)) {
__set_bit(pmc->idx, (unsigned long *)&pmu->global_status);
kvm_make_request(KVM_REQ_PMU, pmc->vcpu);
}
@@ -76,8 +75,7 @@ static void kvm_perf_overflow_intr(struct perf_event *perf_event,
struct kvm_pmc *pmc = perf_event->overflow_handler_context;
struct kvm_pmu *pmu = pmc_to_pmu(pmc);
- if (!test_and_set_bit(pmc->idx,
- (unsigned long *)&pmu->reprogram_pmi)) {
+ if (!test_and_set_bit(pmc->idx, pmu->reprogram_pmi)) {
__set_bit(pmc->idx, (unsigned long *)&pmu->global_status);
kvm_make_request(KVM_REQ_PMU, pmc->vcpu);
@@ -137,7 +135,7 @@ static void pmc_reprogram_counter(struct kvm_pmc *pmc, u32 type,
}
pmc->perf_event = event;
- clear_bit(pmc->idx, (unsigned long*)&pmc_to_pmu(pmc)->reprogram_pmi);
+ clear_bit(pmc->idx, pmc_to_pmu(pmc)->reprogram_pmi);
}
void reprogram_gp_counter(struct kvm_pmc *pmc, u64 eventsel)
@@ -253,16 +251,13 @@ EXPORT_SYMBOL_GPL(reprogram_counter);
void kvm_pmu_handle_event(struct kvm_vcpu *vcpu)
{
struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
- u64 bitmask;
int bit;
- bitmask = pmu->reprogram_pmi;
-
- for_each_set_bit(bit, (unsigned long *)&bitmask, X86_PMC_IDX_MAX) {
+ for_each_set_bit(bit, pmu->reprogram_pmi, X86_PMC_IDX_MAX) {
struct kvm_pmc *pmc = kvm_x86_ops->pmu_ops->pmc_idx_to_pmc(pmu, bit);
if (unlikely(!pmc || !pmc->perf_event)) {
- clear_bit(bit, (unsigned long *)&pmu->reprogram_pmi);
+ clear_bit(bit, pmu->reprogram_pmi);
continue;
}
--
2.21.0
On 21/10/19 12:55, Like Xu wrote:
> Replace the explicit declaration of "u64 reprogram_pmi" with the generic
> macro DECLARE_BITMAP for all possible appropriate number of bits.
>
> Suggested-by: Paolo Bonzini <[email protected]>
> Signed-off-by: Like Xu <[email protected]>
> ---
> arch/x86/include/asm/kvm_host.h | 2 +-
> arch/x86/kvm/pmu.c | 15 +++++----------
> 2 files changed, 6 insertions(+), 11 deletions(-)
>
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 50eb430b0ad8..236a876a5a2e 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -469,7 +469,7 @@ struct kvm_pmu {
> struct kvm_pmc gp_counters[INTEL_PMC_MAX_GENERIC];
> struct kvm_pmc fixed_counters[INTEL_PMC_MAX_FIXED];
> struct irq_work irq_work;
> - u64 reprogram_pmi;
> + DECLARE_BITMAP(reprogram_pmi, X86_PMC_IDX_MAX);
> };
>
> struct kvm_pmu_ops;
> diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c
> index 46875bbd0419..75e8f9fae031 100644
> --- a/arch/x86/kvm/pmu.c
> +++ b/arch/x86/kvm/pmu.c
> @@ -62,8 +62,7 @@ static void kvm_perf_overflow(struct perf_event *perf_event,
> struct kvm_pmc *pmc = perf_event->overflow_handler_context;
> struct kvm_pmu *pmu = pmc_to_pmu(pmc);
>
> - if (!test_and_set_bit(pmc->idx,
> - (unsigned long *)&pmu->reprogram_pmi)) {
> + if (!test_and_set_bit(pmc->idx, pmu->reprogram_pmi)) {
> __set_bit(pmc->idx, (unsigned long *)&pmu->global_status);
> kvm_make_request(KVM_REQ_PMU, pmc->vcpu);
> }
> @@ -76,8 +75,7 @@ static void kvm_perf_overflow_intr(struct perf_event *perf_event,
> struct kvm_pmc *pmc = perf_event->overflow_handler_context;
> struct kvm_pmu *pmu = pmc_to_pmu(pmc);
>
> - if (!test_and_set_bit(pmc->idx,
> - (unsigned long *)&pmu->reprogram_pmi)) {
> + if (!test_and_set_bit(pmc->idx, pmu->reprogram_pmi)) {
> __set_bit(pmc->idx, (unsigned long *)&pmu->global_status);
> kvm_make_request(KVM_REQ_PMU, pmc->vcpu);
>
> @@ -137,7 +135,7 @@ static void pmc_reprogram_counter(struct kvm_pmc *pmc, u32 type,
> }
>
> pmc->perf_event = event;
> - clear_bit(pmc->idx, (unsigned long*)&pmc_to_pmu(pmc)->reprogram_pmi);
> + clear_bit(pmc->idx, pmc_to_pmu(pmc)->reprogram_pmi);
> }
>
> void reprogram_gp_counter(struct kvm_pmc *pmc, u64 eventsel)
> @@ -253,16 +251,13 @@ EXPORT_SYMBOL_GPL(reprogram_counter);
> void kvm_pmu_handle_event(struct kvm_vcpu *vcpu)
> {
> struct kvm_pmu *pmu = vcpu_to_pmu(vcpu);
> - u64 bitmask;
> int bit;
>
> - bitmask = pmu->reprogram_pmi;
> -
> - for_each_set_bit(bit, (unsigned long *)&bitmask, X86_PMC_IDX_MAX) {
> + for_each_set_bit(bit, pmu->reprogram_pmi, X86_PMC_IDX_MAX) {
> struct kvm_pmc *pmc = kvm_x86_ops->pmu_ops->pmc_idx_to_pmc(pmu, bit);
>
> if (unlikely(!pmc || !pmc->perf_event)) {
> - clear_bit(bit, (unsigned long *)&pmu->reprogram_pmi);
> + clear_bit(bit, pmu->reprogram_pmi);
> continue;
> }
>
>
Queued, thanks.
Paolo