2011-05-11 16:04:59

by Avi Kivity

[permalink] [raw]
Subject: [PATCH v1 0/5] KVM in-guest performance monitoring

This not-for-merging patchset exposes an emulated version 1 architectural
performance monitoring unit to KVM guests. The PMU is emulated using
perf_events, so the host kernel can multiplex host-wide, host-user, and the
guest on available resources.

Caveats:
- counters that have PMI (interrupt) enabled stop counting after the
interrupt is signalled. This is because we need one-shot samples
that keep counting, which perf doesn't support yet
- some combinations of INV and CMASK are not supported
- counters keep on counting in the host as well as the guest
- the RDPMC instruction and CR4.PCE bit are not yet emulated
- there is likely a bug in the implementation; running 'perf top' in
a guest that spends 80% of its time in userspace shows perf itself
as consuming almost all cpu

perf maintainers: please consider the first three patches for merging (the
first two make sense even without the rest). If you're familiar with the Intel
PMU, please review patch 5 as well - it effectively undoes all your work
of abstracting the PMU into perf_events by unabstracting perf_events into what
is hoped is a very similar PMU.

Avi Kivity (5):
perf: add context parameter to perf_event overflow handler
x86, perf: add constraints for architectural PMU v1
perf: export perf_event_refresh() to modules
KVM: Expose kvm_lapic_local_deliver()
KVM: Expose a version 1 architectural PMU to guests

arch/arm/kernel/ptrace.c | 10 +-
arch/powerpc/kernel/ptrace.c | 4 +-
arch/sh/kernel/ptrace_32.c | 5 +-
arch/x86/include/asm/kvm_host.h | 29 ++++
arch/x86/kernel/cpu/perf_event_intel.c | 23 +++-
arch/x86/kernel/kgdb.c | 2 +-
arch/x86/kernel/ptrace.c | 5 +-
arch/x86/kvm/Makefile | 2 +-
arch/x86/kvm/lapic.c | 2 +-
arch/x86/kvm/lapic.h | 1 +
arch/x86/kvm/pmu.c | 248 +++++++++++++++++++++++++++++++
arch/x86/kvm/x86.c | 16 +-
drivers/oprofile/oprofile_perf.c | 5 +-
include/linux/hw_breakpoint.h | 10 +-
include/linux/perf_event.h | 13 ++-
kernel/events/core.c | 27 +++-
kernel/events/hw_breakpoint.c | 10 +-
kernel/watchdog.c | 7 +-
samples/hw_breakpoint/data_breakpoint.c | 5 +-
19 files changed, 377 insertions(+), 47 deletions(-)
create mode 100644 arch/x86/kvm/pmu.c

--
1.7.4.3


2011-05-11 15:55:59

by Avi Kivity

[permalink] [raw]
Subject: [PATCH v1 1/5] perf: add context parameter to perf_event overflow handler

The perf_event overflow handler does not receive any caller-derived
argument, so many callers need to resort to looking up the perf_event
in their local data structure. This is ugly and doesn't scale if a
single callback services many perf_events.

Fix by adding a context parameter to perf_event_create_kernel_counter()
(and derived hardware breakpoints APIs) and pass it to the callback.
All callers are updated.

Signed-off-by: Avi Kivity <[email protected]>
---
arch/arm/kernel/ptrace.c | 10 ++++++----
arch/powerpc/kernel/ptrace.c | 4 ++--
arch/sh/kernel/ptrace_32.c | 5 +++--
arch/x86/kernel/kgdb.c | 2 +-
arch/x86/kernel/ptrace.c | 5 +++--
drivers/oprofile/oprofile_perf.c | 5 +++--
include/linux/hw_breakpoint.h | 10 ++++++++--
include/linux/perf_event.h | 8 ++++++--
kernel/events/core.c | 24 +++++++++++++++++-------
kernel/events/hw_breakpoint.c | 10 +++++++---
kernel/watchdog.c | 7 +++++--
samples/hw_breakpoint/data_breakpoint.c | 5 +++--
12 files changed, 64 insertions(+), 31 deletions(-)

diff --git a/arch/arm/kernel/ptrace.c b/arch/arm/kernel/ptrace.c
index 8182f45..1d6c4ae 100644
--- a/arch/arm/kernel/ptrace.c
+++ b/arch/arm/kernel/ptrace.c
@@ -497,9 +497,10 @@ static long ptrace_hbp_idx_to_num(int idx)
/*
* Handle hitting a HW-breakpoint.
*/
-static void ptrace_hbptriggered(struct perf_event *bp, int unused,
- struct perf_sample_data *data,
- struct pt_regs *regs)
+static void ptrace_hbptriggered(void *context,
+ struct perf_event *bp, int unused,
+ struct perf_sample_data *data,
+ struct pt_regs *regs)
{
struct arch_hw_breakpoint *bkpt = counter_arch_bp(bp);
long num;
@@ -580,7 +581,8 @@ static struct perf_event *ptrace_hbp_create(struct task_struct *tsk, int type)
attr.bp_type = type;
attr.disabled = 1;

- return register_user_hw_breakpoint(&attr, ptrace_hbptriggered, tsk);
+ return register_user_hw_breakpoint(&attr, ptrace_hbptriggered, NULL,
+ tsk);
}

static int ptrace_gethbpregs(struct task_struct *tsk, long num,
diff --git a/arch/powerpc/kernel/ptrace.c b/arch/powerpc/kernel/ptrace.c
index a6ae1cf..4d42f06 100644
--- a/arch/powerpc/kernel/ptrace.c
+++ b/arch/powerpc/kernel/ptrace.c
@@ -878,7 +878,7 @@ void user_disable_single_step(struct task_struct *task)
}

#ifdef CONFIG_HAVE_HW_BREAKPOINT
-void ptrace_triggered(struct perf_event *bp, int nmi,
+void ptrace_triggered(void *context, struct perf_event *bp, int nmi,
struct perf_sample_data *data, struct pt_regs *regs)
{
struct perf_event_attr attr;
@@ -969,7 +969,7 @@ int ptrace_set_debugreg(struct task_struct *task, unsigned long addr,
&attr.bp_type);

thread->ptrace_bps[0] = bp = register_user_hw_breakpoint(&attr,
- ptrace_triggered, task);
+ ptrace_triggered, NULL, task);
if (IS_ERR(bp)) {
thread->ptrace_bps[0] = NULL;
ptrace_put_breakpoints(task);
diff --git a/arch/sh/kernel/ptrace_32.c b/arch/sh/kernel/ptrace_32.c
index 3d7b209..b08c8b0 100644
--- a/arch/sh/kernel/ptrace_32.c
+++ b/arch/sh/kernel/ptrace_32.c
@@ -63,7 +63,7 @@ static inline int put_stack_long(struct task_struct *task, int offset,
return 0;
}

-void ptrace_triggered(struct perf_event *bp, int nmi,
+void ptrace_triggered(void *context, struct perf_event *bp, int nmi,
struct perf_sample_data *data, struct pt_regs *regs)
{
struct perf_event_attr attr;
@@ -91,7 +91,8 @@ static int set_single_step(struct task_struct *tsk, unsigned long addr)
attr.bp_len = HW_BREAKPOINT_LEN_2;
attr.bp_type = HW_BREAKPOINT_R;

- bp = register_user_hw_breakpoint(&attr, ptrace_triggered, tsk);
+ bp = register_user_hw_breakpoint(&attr, ptrace_triggered,
+ NULL, tsk);
if (IS_ERR(bp))
return PTR_ERR(bp);

diff --git a/arch/x86/kernel/kgdb.c b/arch/x86/kernel/kgdb.c
index 5f9ecff..473ab53 100644
--- a/arch/x86/kernel/kgdb.c
+++ b/arch/x86/kernel/kgdb.c
@@ -638,7 +638,7 @@ void kgdb_arch_late(void)
for (i = 0; i < HBP_NUM; i++) {
if (breakinfo[i].pev)
continue;
- breakinfo[i].pev = register_wide_hw_breakpoint(&attr, NULL);
+ breakinfo[i].pev = register_wide_hw_breakpoint(&attr, NULL, NULL);
if (IS_ERR((void * __force)breakinfo[i].pev)) {
printk(KERN_ERR "kgdb: Could not allocate hw"
"breakpoints\nDisabling the kernel debugger\n");
diff --git a/arch/x86/kernel/ptrace.c b/arch/x86/kernel/ptrace.c
index f65e5b5..418f5c6 100644
--- a/arch/x86/kernel/ptrace.c
+++ b/arch/x86/kernel/ptrace.c
@@ -528,7 +528,7 @@ static int genregs_set(struct task_struct *target,
return ret;
}

-static void ptrace_triggered(struct perf_event *bp, int nmi,
+static void ptrace_triggered(void *context, struct perf_event *bp, int nmi,
struct perf_sample_data *data,
struct pt_regs *regs)
{
@@ -715,7 +715,8 @@ static int ptrace_set_breakpoint_addr(struct task_struct *tsk, int nr,
attr.bp_type = HW_BREAKPOINT_W;
attr.disabled = 1;

- bp = register_user_hw_breakpoint(&attr, ptrace_triggered, tsk);
+ bp = register_user_hw_breakpoint(&attr, ptrace_triggered,
+ NULL, tsk);

/*
* CHECKME: the previous code returned -EIO if the addr wasn't
diff --git a/drivers/oprofile/oprofile_perf.c b/drivers/oprofile/oprofile_perf.c
index 9046f7b..68e7f37 100644
--- a/drivers/oprofile/oprofile_perf.c
+++ b/drivers/oprofile/oprofile_perf.c
@@ -31,7 +31,8 @@ static int num_counters;
/*
* Overflow callback for oprofile.
*/
-static void op_overflow_handler(struct perf_event *event, int unused,
+static void op_overflow_handler(void *context,
+ struct perf_event *event, int unused,
struct perf_sample_data *data, struct pt_regs *regs)
{
int id;
@@ -79,7 +80,7 @@ static int op_create_counter(int cpu, int event)

pevent = perf_event_create_kernel_counter(&counter_config[event].attr,
cpu, NULL,
- op_overflow_handler);
+ op_overflow_handler, NULL);

if (IS_ERR(pevent))
return PTR_ERR(pevent);
diff --git a/include/linux/hw_breakpoint.h b/include/linux/hw_breakpoint.h
index d1e55fe..6ae9c63 100644
--- a/include/linux/hw_breakpoint.h
+++ b/include/linux/hw_breakpoint.h
@@ -73,6 +73,7 @@ static inline unsigned long hw_breakpoint_len(struct perf_event *bp)
extern struct perf_event *
register_user_hw_breakpoint(struct perf_event_attr *attr,
perf_overflow_handler_t triggered,
+ void *context,
struct task_struct *tsk);

/* FIXME: only change from the attr, and don't unregister */
@@ -85,11 +86,13 @@ modify_user_hw_breakpoint(struct perf_event *bp, struct perf_event_attr *attr);
extern struct perf_event *
register_wide_hw_breakpoint_cpu(struct perf_event_attr *attr,
perf_overflow_handler_t triggered,
+ void *context,
int cpu);

extern struct perf_event * __percpu *
register_wide_hw_breakpoint(struct perf_event_attr *attr,
- perf_overflow_handler_t triggered);
+ perf_overflow_handler_t triggered,
+ void *context);

extern int register_perf_hw_breakpoint(struct perf_event *bp);
extern int __register_perf_hw_breakpoint(struct perf_event *bp);
@@ -115,6 +118,7 @@ static inline int __init init_hw_breakpoint(void) { return 0; }
static inline struct perf_event *
register_user_hw_breakpoint(struct perf_event_attr *attr,
perf_overflow_handler_t triggered,
+ void *context,
struct task_struct *tsk) { return NULL; }
static inline int
modify_user_hw_breakpoint(struct perf_event *bp,
@@ -122,10 +126,12 @@ modify_user_hw_breakpoint(struct perf_event *bp,
static inline struct perf_event *
register_wide_hw_breakpoint_cpu(struct perf_event_attr *attr,
perf_overflow_handler_t triggered,
+ void *context,
int cpu) { return NULL; }
static inline struct perf_event * __percpu *
register_wide_hw_breakpoint(struct perf_event_attr *attr,
- perf_overflow_handler_t triggered) { return NULL; }
+ perf_overflow_handler_t triggered,
+ void *context) { return NULL; }
static inline int
register_perf_hw_breakpoint(struct perf_event *bp) { return -ENOSYS; }
static inline int
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 3412684..de346a0 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -709,7 +709,9 @@ struct perf_buffer {

struct perf_sample_data;

-typedef void (*perf_overflow_handler_t)(struct perf_event *, int,
+typedef void (*perf_overflow_handler_t)(void *context,
+ struct perf_event *event,
+ int,
struct perf_sample_data *,
struct pt_regs *regs);

@@ -855,6 +857,7 @@ struct perf_event {
u64 id;

perf_overflow_handler_t overflow_handler;
+ void *overflow_handler_context;

#ifdef CONFIG_EVENT_TRACING
struct ftrace_event_call *tp_event;
@@ -978,7 +981,8 @@ extern struct perf_event *
perf_event_create_kernel_counter(struct perf_event_attr *attr,
int cpu,
struct task_struct *task,
- perf_overflow_handler_t callback);
+ perf_overflow_handler_t callback,
+ void *context);
extern u64 perf_event_read_value(struct perf_event *event,
u64 *enabled, u64 *running);

diff --git a/kernel/events/core.c b/kernel/events/core.c
index 0fc34a3..f60b81c 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -5032,7 +5032,8 @@ static int __perf_event_overflow(struct perf_event *event, int nmi,
}

if (event->overflow_handler)
- event->overflow_handler(event, nmi, data, regs);
+ event->overflow_handler(event->overflow_handler_context, event,
+ nmi, data, regs);
else
perf_event_output(event, nmi, data, regs);

@@ -6158,7 +6159,8 @@ perf_event_alloc(struct perf_event_attr *attr, int cpu,
struct task_struct *task,
struct perf_event *group_leader,
struct perf_event *parent_event,
- perf_overflow_handler_t overflow_handler)
+ perf_overflow_handler_t overflow_handler,
+ void *context)
{
struct pmu *pmu;
struct perf_event *event;
@@ -6216,10 +6218,13 @@ perf_event_alloc(struct perf_event_attr *attr, int cpu,
#endif
}

- if (!overflow_handler && parent_event)
+ if (!overflow_handler && parent_event) {
overflow_handler = parent_event->overflow_handler;
+ context = parent_event->overflow_handler_context;
+ }

event->overflow_handler = overflow_handler;
+ event->overflow_handler_context = context;

if (attr->disabled)
event->state = PERF_EVENT_STATE_OFF;
@@ -6486,7 +6491,8 @@ SYSCALL_DEFINE5(perf_event_open,
}
}

- event = perf_event_alloc(&attr, cpu, task, group_leader, NULL, NULL);
+ event = perf_event_alloc(&attr, cpu, task, group_leader, NULL,
+ NULL, NULL);
if (IS_ERR(event)) {
err = PTR_ERR(event);
goto err_task;
@@ -6671,7 +6677,8 @@ err_fd:
struct perf_event *
perf_event_create_kernel_counter(struct perf_event_attr *attr, int cpu,
struct task_struct *task,
- perf_overflow_handler_t overflow_handler)
+ perf_overflow_handler_t overflow_handler,
+ void *context)
{
struct perf_event_context *ctx;
struct perf_event *event;
@@ -6681,7 +6688,8 @@ perf_event_create_kernel_counter(struct perf_event_attr *attr, int cpu,
* Get the target context (task or percpu):
*/

- event = perf_event_alloc(attr, cpu, task, NULL, NULL, overflow_handler);
+ event = perf_event_alloc(attr, cpu, task, NULL, NULL,
+ overflow_handler, context);
if (IS_ERR(event)) {
err = PTR_ERR(event);
goto err;
@@ -6965,7 +6973,7 @@ inherit_event(struct perf_event *parent_event,
parent_event->cpu,
child,
group_leader, parent_event,
- NULL);
+ NULL, NULL);
if (IS_ERR(child_event))
return child_event;
get_ctx(child_ctx);
@@ -6992,6 +7000,8 @@ inherit_event(struct perf_event *parent_event,

child_event->ctx = child_ctx;
child_event->overflow_handler = parent_event->overflow_handler;
+ child_event->overflow_handler_context
+ = parent_event->overflow_handler_context;

/*
* Precalculate sample_data sizes
diff --git a/kernel/events/hw_breakpoint.c b/kernel/events/hw_breakpoint.c
index 086adf2..b7971d6 100644
--- a/kernel/events/hw_breakpoint.c
+++ b/kernel/events/hw_breakpoint.c
@@ -431,9 +431,11 @@ int register_perf_hw_breakpoint(struct perf_event *bp)
struct perf_event *
register_user_hw_breakpoint(struct perf_event_attr *attr,
perf_overflow_handler_t triggered,
+ void *context,
struct task_struct *tsk)
{
- return perf_event_create_kernel_counter(attr, -1, tsk, triggered);
+ return perf_event_create_kernel_counter(attr, -1, tsk, triggered,
+ context);
}
EXPORT_SYMBOL_GPL(register_user_hw_breakpoint);

@@ -502,7 +504,8 @@ EXPORT_SYMBOL_GPL(unregister_hw_breakpoint);
*/
struct perf_event * __percpu *
register_wide_hw_breakpoint(struct perf_event_attr *attr,
- perf_overflow_handler_t triggered)
+ perf_overflow_handler_t triggered,
+ void *context)
{
struct perf_event * __percpu *cpu_events, **pevent, *bp;
long err;
@@ -515,7 +518,8 @@ register_wide_hw_breakpoint(struct perf_event_attr *attr,
get_online_cpus();
for_each_online_cpu(cpu) {
pevent = per_cpu_ptr(cpu_events, cpu);
- bp = perf_event_create_kernel_counter(attr, cpu, NULL, triggered);
+ bp = perf_event_create_kernel_counter(attr, cpu, NULL,
+ triggered, context);

*pevent = bp;

diff --git a/kernel/watchdog.c b/kernel/watchdog.c
index 14733d4..e4502a9 100644
--- a/kernel/watchdog.c
+++ b/kernel/watchdog.c
@@ -198,7 +198,8 @@ static struct perf_event_attr wd_hw_attr = {
};

/* Callback function for perf event subsystem */
-static void watchdog_overflow_callback(struct perf_event *event, int nmi,
+static void watchdog_overflow_callback(void *context,
+ struct perf_event *event, int nmi,
struct perf_sample_data *data,
struct pt_regs *regs)
{
@@ -360,7 +361,9 @@ static int watchdog_nmi_enable(int cpu)
/* Try to register using hardware perf events */
wd_attr = &wd_hw_attr;
wd_attr->sample_period = hw_nmi_get_sample_period();
- event = perf_event_create_kernel_counter(wd_attr, cpu, NULL, watchdog_overflow_callback);
+ event = perf_event_create_kernel_counter(wd_attr, cpu, NULL,
+ watchdog_overflow_callback,
+ NULL);
if (!IS_ERR(event)) {
printk(KERN_INFO "NMI watchdog enabled, takes one hw-pmu counter.\n");
goto out_save;
diff --git a/samples/hw_breakpoint/data_breakpoint.c b/samples/hw_breakpoint/data_breakpoint.c
index 0636539..0635bf3 100644
--- a/samples/hw_breakpoint/data_breakpoint.c
+++ b/samples/hw_breakpoint/data_breakpoint.c
@@ -41,7 +41,8 @@ module_param_string(ksym, ksym_name, KSYM_NAME_LEN, S_IRUGO);
MODULE_PARM_DESC(ksym, "Kernel symbol to monitor; this module will report any"
" write operations on the kernel symbol");

-static void sample_hbp_handler(struct perf_event *bp, int nmi,
+static void sample_hbp_handler(void *context,
+ struct perf_event *bp, int nmi,
struct perf_sample_data *data,
struct pt_regs *regs)
{
@@ -60,7 +61,7 @@ static int __init hw_break_module_init(void)
attr.bp_len = HW_BREAKPOINT_LEN_4;
attr.bp_type = HW_BREAKPOINT_W | HW_BREAKPOINT_R;

- sample_hbp = register_wide_hw_breakpoint(&attr, sample_hbp_handler);
+ sample_hbp = register_wide_hw_breakpoint(&attr, sample_hbp_handler, NULL);
if (IS_ERR((void __force *)sample_hbp)) {
ret = PTR_ERR((void __force *)sample_hbp);
goto fail;
--
1.7.4.3

2011-05-11 15:55:55

by Avi Kivity

[permalink] [raw]
Subject: [PATCH v1 2/5] x86, perf: add constraints for architectural PMU v1

The v1 PMU does not have any fixed counters. Using the v2 constraints,
which do have fixed counters, causes an additional choice to be present
in the weight calculation, but not when actually scheduling the event,
leading to an event being not scheduled at all.

Signed-off-by: Avi Kivity <[email protected]>
---
arch/x86/kernel/cpu/perf_event_intel.c | 23 ++++++++++++++++++-----
1 files changed, 18 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event_intel.c b/arch/x86/kernel/cpu/perf_event_intel.c
index 41178c8..b46b70e 100644
--- a/arch/x86/kernel/cpu/perf_event_intel.c
+++ b/arch/x86/kernel/cpu/perf_event_intel.c
@@ -137,6 +137,11 @@ static struct event_constraint intel_westmere_percore_constraints[] __read_mostl
EVENT_CONSTRAINT_END
};

+static struct event_constraint intel_v1_event_constraints[] __read_mostly =
+{
+ EVENT_CONSTRAINT_END
+};
+
static struct event_constraint intel_gen_event_constraints[] __read_mostly =
{
FIXED_EVENT_CONSTRAINT(0x00c0, 0), /* INST_RETIRED.ANY */
@@ -1512,11 +1517,19 @@ static __init int intel_pmu_init(void)
break;

default:
- /*
- * default constraints for v2 and up
- */
- x86_pmu.event_constraints = intel_gen_event_constraints;
- pr_cont("generic architected perfmon, ");
+ switch (x86_pmu.version) {
+ case 1:
+ x86_pmu.event_constraints = intel_v1_event_constraints;
+ pr_cont("generic architected perfmon v1, ");
+ break;
+ default:
+ /*
+ * default constraints for v2 and up
+ */
+ x86_pmu.event_constraints = intel_gen_event_constraints;
+ pr_cont("generic architected perfmon, ");
+ break;
+ }
}
return 0;
}
--
1.7.4.3

2011-05-11 16:08:00

by Avi Kivity

[permalink] [raw]
Subject: [PATCH v1 3/5] perf: export perf_event_refresh() to modules

KVM needs one-shot samples, since a PMC programmed to -X will fire after X
events and then again after 2^40 events (i.e. variable period).

Signed-off-by: Avi Kivity <[email protected]>
---
include/linux/perf_event.h | 5 +++++
kernel/events/core.c | 3 ++-
2 files changed, 7 insertions(+), 1 deletions(-)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index de346a0..b317a8b 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -975,6 +975,7 @@ extern void perf_pmu_disable(struct pmu *pmu);
extern void perf_pmu_enable(struct pmu *pmu);
extern int perf_event_task_disable(void);
extern int perf_event_task_enable(void);
+extern int perf_event_refresh(struct perf_event *event, int refresh);
extern void perf_event_update_userpage(struct perf_event *event);
extern int perf_event_release_kernel(struct perf_event *event);
extern struct perf_event *
@@ -1170,6 +1171,10 @@ static inline void perf_event_delayed_put(struct task_struct *task) { }
static inline void perf_event_print_debug(void) { }
static inline int perf_event_task_disable(void) { return -EINVAL; }
static inline int perf_event_task_enable(void) { return -EINVAL; }
+static inline int perf_event_refresh(struct perf_event *event, int refresh)
+{
+ return -EINVAL;
+}

static inline void
perf_sw_event(u32 event_id, u64 nr, int nmi,
diff --git a/kernel/events/core.c b/kernel/events/core.c
index f60b81c..6f58291 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -1747,7 +1747,7 @@ out:
raw_spin_unlock_irq(&ctx->lock);
}

-static int perf_event_refresh(struct perf_event *event, int refresh)
+int perf_event_refresh(struct perf_event *event, int refresh)
{
/*
* not supported on inherited events
@@ -1760,6 +1760,7 @@ static int perf_event_refresh(struct perf_event *event, int refresh)

return 0;
}
+EXPORT_SYMBOL_GPL(perf_event_refresh);

static void ctx_sched_out(struct perf_event_context *ctx,
struct perf_cpu_context *cpuctx,
--
1.7.4.3

2011-05-11 15:55:54

by Avi Kivity

[permalink] [raw]
Subject: [PATCH v1 4/5] KVM: Expose kvm_lapic_local_deliver()

Needed to deliver performance monitoring interrupts.

Signed-off-by: Avi Kivity <[email protected]>
---
arch/x86/kvm/lapic.c | 2 +-
arch/x86/kvm/lapic.h | 1 +
2 files changed, 2 insertions(+), 1 deletions(-)

diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index 2b2255b..79d4850 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -1017,7 +1017,7 @@ int apic_has_pending_timer(struct kvm_vcpu *vcpu)
return 0;
}

-static int kvm_apic_local_deliver(struct kvm_lapic *apic, int lvt_type)
+int kvm_apic_local_deliver(struct kvm_lapic *apic, int lvt_type)
{
u32 reg = apic_get_reg(apic, lvt_type);
int vector, mode, trig_mode;
diff --git a/arch/x86/kvm/lapic.h b/arch/x86/kvm/lapic.h
index 52c9e6b..09dbe2f 100644
--- a/arch/x86/kvm/lapic.h
+++ b/arch/x86/kvm/lapic.h
@@ -33,6 +33,7 @@ void kvm_apic_set_version(struct kvm_vcpu *vcpu);
int kvm_apic_match_physical_addr(struct kvm_lapic *apic, u16 dest);
int kvm_apic_match_logical_addr(struct kvm_lapic *apic, u8 mda);
int kvm_apic_set_irq(struct kvm_vcpu *vcpu, struct kvm_lapic_irq *irq);
+int kvm_apic_local_deliver(struct kvm_lapic *apic, int lvt_type);

u64 kvm_get_apic_base(struct kvm_vcpu *vcpu);
void kvm_set_apic_base(struct kvm_vcpu *vcpu, u64 data);
--
1.7.4.3

2011-05-11 16:08:34

by Avi Kivity

[permalink] [raw]
Subject: [PATCH v1 5/5] KVM: Expose a version 1 architectural PMU to guests

Use perf_events to emulate an architectural PMU, version 1.

Caveats:
- counters that have PMI (interrupt) enabled stop counting after the
interrupt is signalled. This is because we need one-shot samples
that keep counting, which perf doesn't support yet
- some combinations of INV and CMASK are not supported
- counters keep on counting in the host as well as the guest

Signed-off-by: Avi Kivity <[email protected]>
---
arch/x86/include/asm/kvm_host.h | 29 +++++
arch/x86/kvm/Makefile | 2 +-
arch/x86/kvm/pmu.c | 248 +++++++++++++++++++++++++++++++++++++++
arch/x86/kvm/x86.c | 16 ++--
4 files changed, 286 insertions(+), 9 deletions(-)
create mode 100644 arch/x86/kvm/pmu.c

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index d2ac8e2..5563fb4 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -16,6 +16,7 @@
#include <linux/mmu_notifier.h>
#include <linux/tracepoint.h>
#include <linux/cpumask.h>
+#include <linux/irq_work.h>

#include <linux/kvm.h>
#include <linux/kvm_para.h>
@@ -291,6 +292,24 @@ struct kvm_mmu {
u64 pdptrs[4]; /* pae */
};

+#define KVM_PMU_MAX_GENERAL_PURPOSE_COUNTERS 4
+
+struct kvm_pmc {
+ u64 counter;
+ u64 eventsel;
+ struct perf_event *perf_event;
+ struct kvm_vcpu *vcpu;
+};
+
+struct kvm_pmu {
+ unsigned nr_arch_gp_counters;
+ unsigned available_event_types;
+ u64 counter_bitmask;
+ u8 version;
+ struct kvm_pmc gp_counters[KVM_PMU_MAX_GENERAL_PURPOSE_COUNTERS];
+ struct irq_work irq_work;
+};
+
struct kvm_vcpu_arch {
/*
* rip and regs accesses must go through
@@ -419,6 +438,8 @@ struct kvm_vcpu_arch {
u64 mcg_ctl;
u64 *mce_banks;

+ struct kvm_pmu pmu;
+
/* used for guest single stepping over the given code position */
unsigned long singlestep_rip;

@@ -870,4 +891,12 @@ extern bool kvm_find_async_pf_gfn(struct kvm_vcpu *vcpu, gfn_t gfn);

void kvm_complete_insn_gp(struct kvm_vcpu *vcpu, int err);

+void kvm_pmu_init(struct kvm_vcpu *vcpu);
+void kvm_pmu_destroy(struct kvm_vcpu *vcpu);
+void kvm_pmu_cpuid_update(struct kvm_vcpu *vcpu);
+bool kvm_pmu_msr(struct kvm_vcpu *vcpu, u32 msr);
+int kvm_pmu_get_msr(struct kvm_vcpu *vcpu, u32 msr, u64 *data);
+int kvm_pmu_set_msr(struct kvm_vcpu *vcpu, u32 msr, u64 data);
+int kvm_pmu_read_pmc(struct kvm_vcpu *vcpu, unsigned pmc, u64 *data);
+
#endif /* _ASM_X86_KVM_HOST_H */
diff --git a/arch/x86/kvm/Makefile b/arch/x86/kvm/Makefile
index f15501f..cfca03f 100644
--- a/arch/x86/kvm/Makefile
+++ b/arch/x86/kvm/Makefile
@@ -12,7 +12,7 @@ kvm-$(CONFIG_IOMMU_API) += $(addprefix ../../../virt/kvm/, iommu.o)
kvm-$(CONFIG_KVM_ASYNC_PF) += $(addprefix ../../../virt/kvm/, async_pf.o)

kvm-y += x86.o mmu.o emulate.o i8259.o irq.o lapic.o \
- i8254.o timer.o
+ i8254.o timer.o pmu.o
kvm-intel-y += vmx.o
kvm-amd-y += svm.o

diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c
new file mode 100644
index 0000000..fb36d35
--- /dev/null
+++ b/arch/x86/kvm/pmu.c
@@ -0,0 +1,248 @@
+/*
+ * Kernel-based Virtual Machine -- Performane Monitoring Unit support
+ *
+ * Copyright 2011 Red Hat, Inc. and/or its affiliates.
+ *
+ * Authors:
+ * Avi Kivity <[email protected]>
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2. See
+ * the COPYING file in the top-level directory.
+ *
+ */
+
+#include <linux/types.h>
+#include <linux/kvm_host.h>
+#include <linux/perf_event.h>
+#include "x86.h"
+#include "pmu.h"
+#include "lapic.h"
+
+static struct kvm_arch_event_perf_mapping {
+ u8 eventsel;
+ u8 unit_mask;
+ unsigned event_type;
+ bool inexact;
+} arch_events[] = {
+ /* Index must match CPUID 0x0A.EBX bit vector */
+ [0] = { 0x3c, 0x00, PERF_COUNT_HW_CPU_CYCLES },
+ [1] = { 0xc0, 0x00, PERF_COUNT_HW_INSTRUCTIONS },
+ [2] = { 0x3c, 0x01, PERF_COUNT_HW_BUS_CYCLES },
+ [3] = { 0x2e, 0x4f, PERF_COUNT_HW_CACHE_REFERENCES },
+ [4] = { 0x2e, 0x41, PERF_COUNT_HW_CACHE_MISSES },
+ [5] = { 0xc4, 0x00, PERF_COUNT_HW_BRANCH_INSTRUCTIONS },
+ [6] = { 0xc5, 0x00, PERF_COUNT_HW_BRANCH_MISSES },
+};
+
+static inline struct kvm_pmc *get_gp_pmc(struct kvm_pmu *pmu, u32 msr,
+ u32 base)
+{
+ if (msr >= base && msr < base + pmu->nr_arch_gp_counters)
+ return &pmu->gp_counters[msr - base];
+ return NULL;
+}
+
+static void __kvm_perf_overflow(struct irq_work *irq_work)
+{
+ struct kvm_pmu *pmu = container_of(irq_work, struct kvm_pmu, irq_work);
+ struct kvm_vcpu *vcpu = container_of(pmu, struct kvm_vcpu, arch.pmu);
+
+ if (vcpu->arch.apic)
+ kvm_apic_local_deliver(vcpu->arch.apic, APIC_LVTPC);
+}
+
+static void kvm_perf_overflow(void *_pmc, struct perf_event *perf_event,
+ int nmi,
+ struct perf_sample_data *data,
+ struct pt_regs *regs)
+{
+ struct kvm_pmc *pmc = _pmc;
+
+ irq_work_queue(&pmc->vcpu->arch.pmu.irq_work);
+}
+
+static u64 read_gp_pmc(struct kvm_pmu *pmu, struct kvm_pmc *pmc)
+{
+ u64 counter, enabled, running;
+
+ counter = pmc->counter;
+
+ if (pmc->perf_event)
+ counter += perf_event_read_value(pmc->perf_event,
+ &enabled, &running);
+
+ /* FIXME: Scaling needed? */
+
+ return counter & pmu->counter_bitmask;
+}
+
+static int reprogram_gp_counter(struct kvm_pmu *pmu, struct kvm_pmc *pmc,
+ u64 eventsel)
+{
+ struct perf_event_attr attr = { };
+ struct perf_event *event;
+ int i;
+ u8 event_select, unit_mask, cmask;
+ perf_overflow_handler_t callback = NULL;
+ bool inv;
+
+ if (pmc->perf_event) {
+ pmc->counter = read_gp_pmc(pmu, pmc);
+ perf_event_release_kernel(pmc->perf_event);
+ pmc->perf_event = NULL;
+ irq_work_sync(&pmu->irq_work);
+ pmc->eventsel = eventsel;
+ }
+
+ if (!(eventsel & ARCH_PERFMON_EVENTSEL_ENABLE))
+ return 0;
+
+ attr.type = PERF_TYPE_HARDWARE;
+ attr.size = sizeof(attr);
+ attr.exclude_idle = true;
+
+ event_select = eventsel & ARCH_PERFMON_EVENTSEL_EVENT;
+ unit_mask = (eventsel & ARCH_PERFMON_EVENTSEL_UMASK) >> 8;
+
+ for (i = 0; i < ARRAY_SIZE(arch_events); ++i) {
+ if (arch_events[i].eventsel == event_select
+ && arch_events[i].unit_mask == unit_mask
+ && (pmu->available_event_types & (1 << i))) {
+ attr.config = arch_events[i].event_type;
+ break;
+ }
+ }
+ if (i == ARRAY_SIZE(arch_events))
+ return 1;
+
+ attr.exclude_user = !(eventsel & ARCH_PERFMON_EVENTSEL_USR);
+ attr.exclude_kernel = !(eventsel & ARCH_PERFMON_EVENTSEL_OS);
+
+ if (eventsel & ARCH_PERFMON_EVENTSEL_EDGE)
+ printk_once("kvm: pmu ignoring edge bit\n");
+
+ if (eventsel & ARCH_PERFMON_EVENTSEL_INT) {
+ callback = kvm_perf_overflow;
+ attr.disabled = true;
+ }
+
+ inv = eventsel & ARCH_PERFMON_EVENTSEL_INV;
+ cmask = (eventsel & ARCH_PERFMON_EVENTSEL_CMASK) >> 24;
+
+ pmc->eventsel = eventsel;
+
+ if (inv || cmask > 1) {
+ printk_once("kvm: pmu ignoring difficult inv/cmask combo\n");
+ return 0;
+ }
+
+ attr.sample_period = (-pmc->counter) & pmu->counter_bitmask;
+
+ event = perf_event_create_kernel_counter(&attr, -1, current,
+ callback, pmc);
+ if (IS_ERR(event))
+ return PTR_ERR(event);
+
+ if (callback)
+ perf_event_refresh(event, 1);
+
+ pmc->perf_event = event;
+ return 0;
+}
+
+bool kvm_pmu_msr(struct kvm_vcpu *vcpu, u32 msr)
+{
+ struct kvm_pmu *pmu = &vcpu->arch.pmu;
+
+ return get_gp_pmc(pmu, msr, MSR_IA32_PERFCTR0)
+ || get_gp_pmc(pmu, msr, MSR_P6_EVNTSEL0);
+}
+
+int kvm_pmu_get_msr(struct kvm_vcpu *vcpu, u32 index, u64 *data)
+{
+ struct kvm_pmu *pmu = &vcpu->arch.pmu;
+ struct kvm_pmc *pmc;
+
+ if ((pmc = get_gp_pmc(pmu, index, MSR_IA32_PERFCTR0))) {
+ *data = read_gp_pmc(pmu, pmc);
+ return 0;
+ } else if ((pmc = get_gp_pmc(pmu, index, MSR_P6_EVNTSEL0))) {
+ *data = pmc->eventsel;
+ return 0;
+ }
+ return 1;
+}
+
+int kvm_pmu_set_msr(struct kvm_vcpu *vcpu, u32 index, u64 data)
+{
+ struct kvm_pmu *pmu = &vcpu->arch.pmu;
+ struct kvm_pmc *pmc;
+
+ if ((pmc = get_gp_pmc(pmu, index, MSR_IA32_PERFCTR0))) {
+ data = (s64)(s32)data;
+ pmc->counter += data - read_gp_pmc(pmu, pmc);
+ return 0;
+ } else if ((pmc = get_gp_pmc(pmu, index, MSR_P6_EVNTSEL0))) {
+ if (data == pmc->eventsel)
+ return 0;
+ if (data & 0xffffffff00200000ULL)
+ return 1;
+ return reprogram_gp_counter(pmu, pmc, data);
+ }
+ return 1;
+}
+
+int kvm_pmu_read_pmc(struct kvm_vcpu *vcpu, unsigned pmc, u64 *data)
+{
+ struct kvm_pmu *pmu = &vcpu->arch.pmu;
+
+ if (pmc >= pmu->nr_arch_gp_counters)
+ return 1;
+ *data = read_gp_pmc(pmu, &pmu->gp_counters[pmc]);
+ return 0;
+}
+
+void kvm_pmu_cpuid_update(struct kvm_vcpu *vcpu)
+{
+ struct kvm_pmu *pmu = &vcpu->arch.pmu;
+ struct kvm_cpuid_entry2 *entry;
+ unsigned bitmap_len;
+
+ pmu->nr_arch_gp_counters = 0;
+ pmu->version = 0;
+ entry = kvm_find_cpuid_entry(vcpu, 0xa, 0);
+ if (!entry)
+ return;
+ pmu->version = entry->eax & 0xff;
+ pmu->nr_arch_gp_counters = min((int)(entry->eax >> 8) & 0xff,
+ KVM_PMU_MAX_GENERAL_PURPOSE_COUNTERS);
+ pmu->counter_bitmask = ((u64)1 << ((entry->eax >> 16) & 0xff)) - 1;
+ bitmap_len = (entry->eax >> 24) & 0xff;
+ pmu->available_event_types = ~entry->ebx & ((1ULL << bitmap_len) - 1);
+}
+
+void kvm_pmu_init(struct kvm_vcpu *vcpu)
+{
+ int i;
+ struct kvm_pmu *pmu = &vcpu->arch.pmu;
+
+ memset(pmu, 0, sizeof(*pmu));
+ for (i = 0; i < KVM_PMU_MAX_GENERAL_PURPOSE_COUNTERS; ++i)
+ pmu->gp_counters[i].vcpu = vcpu;
+ init_irq_work(&pmu->irq_work, __kvm_perf_overflow);
+ kvm_pmu_cpuid_update(vcpu);
+}
+
+void kvm_pmu_destroy(struct kvm_vcpu *vcpu)
+{
+ struct kvm_pmu *pmu = &vcpu->arch.pmu;
+ struct kvm_pmc *pmc;
+ int i;
+
+ irq_work_sync(&pmu->irq_work);
+ for (i = 0; i < KVM_PMU_MAX_GENERAL_PURPOSE_COUNTERS; ++i) {
+ pmc = &pmu->gp_counters[i];
+ if (pmc->perf_event)
+ perf_event_release_kernel(pmc->perf_event);
+ }
+}
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 77c9d867..b3c609e 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -593,6 +593,8 @@ static void update_cpuid(struct kvm_vcpu *vcpu)
if (kvm_read_cr4_bits(vcpu, X86_CR4_OSXSAVE))
best->ecx |= bit(X86_FEATURE_OSXSAVE);
}
+
+ kvm_pmu_cpuid_update(vcpu);
}

int kvm_set_cr4(struct kvm_vcpu *vcpu, unsigned long cr4)
@@ -1561,8 +1563,6 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, u32 msr, u64 data)
* which we perfectly emulate ;-). Any other value should be at least
* reported, some guests depend on them.
*/
- case MSR_P6_EVNTSEL0:
- case MSR_P6_EVNTSEL1:
case MSR_K7_EVNTSEL0:
case MSR_K7_EVNTSEL1:
case MSR_K7_EVNTSEL2:
@@ -1574,8 +1574,6 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, u32 msr, u64 data)
/* at least RHEL 4 unconditionally writes to the perfctr registers,
* so we ignore writes to make it happy.
*/
- case MSR_P6_PERFCTR0:
- case MSR_P6_PERFCTR1:
case MSR_K7_PERFCTR0:
case MSR_K7_PERFCTR1:
case MSR_K7_PERFCTR2:
@@ -1612,6 +1610,8 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, u32 msr, u64 data)
default:
if (msr && (msr == vcpu->kvm->arch.xen_hvm_config.msr))
return xen_hvm_config(vcpu, data);
+ if (kvm_pmu_msr(vcpu, msr))
+ return kvm_pmu_set_msr(vcpu, msr, data);
if (!ignore_msrs) {
pr_unimpl(vcpu, "unhandled wrmsr: 0x%x data %llx\n",
msr, data);
@@ -1772,10 +1772,6 @@ int kvm_get_msr_common(struct kvm_vcpu *vcpu, u32 msr, u64 *pdata)
case MSR_K8_SYSCFG:
case MSR_K7_HWCR:
case MSR_VM_HSAVE_PA:
- case MSR_P6_PERFCTR0:
- case MSR_P6_PERFCTR1:
- case MSR_P6_EVNTSEL0:
- case MSR_P6_EVNTSEL1:
case MSR_K7_EVNTSEL0:
case MSR_K7_PERFCTR0:
case MSR_K8_INT_PENDING_MSG:
@@ -1877,6 +1873,8 @@ int kvm_get_msr_common(struct kvm_vcpu *vcpu, u32 msr, u64 *pdata)
data = 0xbe702111;
break;
default:
+ if (kvm_pmu_msr(vcpu, msr))
+ return kvm_pmu_get_msr(vcpu, msr, pdata);
if (!ignore_msrs) {
pr_unimpl(vcpu, "unhandled rdmsr: 0x%x\n", msr);
return 1;
@@ -6221,6 +6219,7 @@ int kvm_arch_vcpu_init(struct kvm_vcpu *vcpu)
goto fail_free_mce_banks;

kvm_async_pf_hash_reset(vcpu);
+ kvm_pmu_init(vcpu);

return 0;
fail_free_mce_banks:
@@ -6239,6 +6238,7 @@ void kvm_arch_vcpu_uninit(struct kvm_vcpu *vcpu)
{
int idx;

+ kvm_pmu_destroy(vcpu);
kfree(vcpu->arch.mce_banks);
kvm_free_lapic(vcpu);
idx = srcu_read_lock(&vcpu->kvm->srcu);
--
1.7.4.3

2011-05-12 09:33:12

by Joerg Roedel

[permalink] [raw]
Subject: Re: [PATCH v1 0/5] KVM in-guest performance monitoring

On Wed, May 11, 2011 at 11:55:28AM -0400, Avi Kivity wrote:
> This not-for-merging patchset exposes an emulated version 1 architectural
> performance monitoring unit to KVM guests. The PMU is emulated using
> perf_events, so the host kernel can multiplex host-wide, host-user, and the
> guest on available resources.
>
> Caveats:
> - counters that have PMI (interrupt) enabled stop counting after the
> interrupt is signalled. This is because we need one-shot samples
> that keep counting, which perf doesn't support yet
> - some combinations of INV and CMASK are not supported
> - counters keep on counting in the host as well as the guest
> - the RDPMC instruction and CR4.PCE bit are not yet emulated
> - there is likely a bug in the implementation; running 'perf top' in
> a guest that spends 80% of its time in userspace shows perf itself
> as consuming almost all cpu
>
> perf maintainers: please consider the first three patches for merging (the
> first two make sense even without the rest). If you're familiar with the Intel
> PMU, please review patch 5 as well - it effectively undoes all your work
> of abstracting the PMU into perf_events by unabstracting perf_events into what
> is hoped is a very similar PMU.

Gaah, I was just about to submit a talk about PMU virtualization for KVM
Forum :)

Anyway, I thought about a paravirt-approach instead of implementing a
real PMU... But there are certainly good reasons for both.

Joerg

2011-05-12 09:48:16

by Jan Kiszka

[permalink] [raw]
Subject: Re: [PATCH v1 0/5] KVM in-guest performance monitoring

On 2011-05-12 11:33, Joerg Roedel wrote:
> On Wed, May 11, 2011 at 11:55:28AM -0400, Avi Kivity wrote:
>> This not-for-merging patchset exposes an emulated version 1 architectural
>> performance monitoring unit to KVM guests. The PMU is emulated using
>> perf_events, so the host kernel can multiplex host-wide, host-user, and the
>> guest on available resources.
>>
>> Caveats:
>> - counters that have PMI (interrupt) enabled stop counting after the
>> interrupt is signalled. This is because we need one-shot samples
>> that keep counting, which perf doesn't support yet
>> - some combinations of INV and CMASK are not supported
>> - counters keep on counting in the host as well as the guest
>> - the RDPMC instruction and CR4.PCE bit are not yet emulated
>> - there is likely a bug in the implementation; running 'perf top' in
>> a guest that spends 80% of its time in userspace shows perf itself
>> as consuming almost all cpu
>>
>> perf maintainers: please consider the first three patches for merging (the
>> first two make sense even without the rest). If you're familiar with the Intel
>> PMU, please review patch 5 as well - it effectively undoes all your work
>> of abstracting the PMU into perf_events by unabstracting perf_events into what
>> is hoped is a very similar PMU.
>
> Gaah, I was just about to submit a talk about PMU virtualization for KVM
> Forum :)
>
> Anyway, I thought about a paravirt-approach instead of implementing a
> real PMU... But there are certainly good reasons for both.

Paravirt is taking away the pressure from CPU vendors to do their virt
extensions properly - and doesn't help with unmodifiable OSes.

Jan

--
Siemens AG, Corporate Technology, CT T DE IT 1
Corporate Competence Center Embedded Linux

2011-05-12 09:51:34

by Avi Kivity

[permalink] [raw]
Subject: Re: [PATCH v1 0/5] KVM in-guest performance monitoring

On 05/12/2011 12:33 PM, Joerg Roedel wrote:
> On Wed, May 11, 2011 at 11:55:28AM -0400, Avi Kivity wrote:
> > This not-for-merging patchset exposes an emulated version 1 architectural
> > performance monitoring unit to KVM guests. The PMU is emulated using
> > perf_events, so the host kernel can multiplex host-wide, host-user, and the
> > guest on available resources.
> >
> > Caveats:
> > - counters that have PMI (interrupt) enabled stop counting after the
> > interrupt is signalled. This is because we need one-shot samples
> > that keep counting, which perf doesn't support yet
> > - some combinations of INV and CMASK are not supported
> > - counters keep on counting in the host as well as the guest
> > - the RDPMC instruction and CR4.PCE bit are not yet emulated
> > - there is likely a bug in the implementation; running 'perf top' in
> > a guest that spends 80% of its time in userspace shows perf itself
> > as consuming almost all cpu
> >
> > perf maintainers: please consider the first three patches for merging (the
> > first two make sense even without the rest). If you're familiar with the Intel
> > PMU, please review patch 5 as well - it effectively undoes all your work
> > of abstracting the PMU into perf_events by unabstracting perf_events into what
> > is hoped is a very similar PMU.
>
> Gaah, I was just about to submit a talk about PMU virtualization for KVM
> Forum :)

Speed matters.

> Anyway, I thought about a paravirt-approach instead of implementing a
> real PMU... But there are certainly good reasons for both.

Note, at this time the architectural PMU is only recognized on an Intel
host.

Is the statement "if an AMD processor returns non-zero information in
cpuid leaf 0xa, then that processor will be compatible with other
vendors' processors reporting the same information" correct?

If so, we can move the detection of the architectural pmu outside the
cpu vendor checks, and this code will work on both AMD and Intel
processors (even if the host cpu doesn't have an architectural PMU).

--
error compiling committee.c: too many arguments to function

2011-05-12 09:54:15

by Avi Kivity

[permalink] [raw]
Subject: Re: [PATCH v1 0/5] KVM in-guest performance monitoring

On 05/12/2011 12:47 PM, Jan Kiszka wrote:
> >
> > Anyway, I thought about a paravirt-approach instead of implementing a
> > real PMU... But there are certainly good reasons for both.
>
> Paravirt is taking away the pressure from CPU vendors to do their virt
> extensions properly - and doesn't help with unmodifiable OSes.

Yes. In the case of the PMU things are less clear, since they are very
model specific. In the case of Linux, you have to lie to the guest and
present a model number that it doesn't recognize, otherwise it prefers
the model-specific PMU to the architectural PMU.

I though of adding a kvm cpuid bit that says "prefer the architectural
pmu" to work around this issue.

--
error compiling committee.c: too many arguments to function

2011-05-12 13:06:46

by Joerg Roedel

[permalink] [raw]
Subject: Re: [PATCH v1 0/5] KVM in-guest performance monitoring

On Thu, May 12, 2011 at 12:51:11PM +0300, Avi Kivity wrote:
> On 05/12/2011 12:33 PM, Joerg Roedel wrote:

>> Gaah, I was just about to submit a talk about PMU virtualization for KVM
>> Forum :)
>
> Speed matters.

I'll take that as an argument for paravirt pmu, because that one is
certainly faster than anything we can emulate on-top of perf_events ;-)


> Note, at this time the architectural PMU is only recognized on an Intel
> host.
>
> Is the statement "if an AMD processor returns non-zero information in
> cpuid leaf 0xa, then that processor will be compatible with other
> vendors' processors reporting the same information" correct?

AMD processors don't implement that cpuid leaf.

> If so, we can move the detection of the architectural pmu outside the
> cpu vendor checks, and this code will work on both AMD and Intel
> processors (even if the host cpu doesn't have an architectural PMU).

Thats already some kind of paravirtualization. Don't get me wrong, I see
the point of emulating a real pmu in the guest. But on the other side I
think a interface that works across cpu models fits better into the KVM
design, because KVM (oposed to other hypervisors) trys to hide details
of the host cpu as much as necessary to get migration working between
different cpus.
And since pmu are, as you said, very model-specific, some abstraction is
needed. In the end probably both ways can be implemented in parallel:

* re-implementing the host-pmu using perf_events for -cpu host
guests
* a paravirt pmu for everybody that wants migration and more
accurate results

Regards,

Joerg

2011-05-12 13:11:32

by Joerg Roedel

[permalink] [raw]
Subject: Re: [PATCH v1 0/5] KVM in-guest performance monitoring

On Thu, May 12, 2011 at 11:47:51AM +0200, Jan Kiszka wrote:
> On 2011-05-12 11:33, Joerg Roedel wrote:

> > Anyway, I thought about a paravirt-approach instead of implementing a
> > real PMU... But there are certainly good reasons for both.
>
> Paravirt is taking away the pressure from CPU vendors to do their virt
> extensions properly - and doesn't help with unmodifiable OSes.

Seriously, I think such decisions should be technical only and not
political like that. The losers of such political decisions are always
the users because they don't get useful features that are technical
possible.

Regards,

Joerg

2011-05-12 13:24:07

by Jan Kiszka

[permalink] [raw]
Subject: Re: [PATCH v1 0/5] KVM in-guest performance monitoring

On 2011-05-12 15:11, Joerg Roedel wrote:
> On Thu, May 12, 2011 at 11:47:51AM +0200, Jan Kiszka wrote:
>> On 2011-05-12 11:33, Joerg Roedel wrote:
>
>>> Anyway, I thought about a paravirt-approach instead of implementing a
>>> real PMU... But there are certainly good reasons for both.
>>
>> Paravirt is taking away the pressure from CPU vendors to do their virt
>> extensions properly - and doesn't help with unmodifiable OSes.
>
> Seriously, I think such decisions should be technical only and not
> political like that. The losers of such political decisions are always
> the users because they don't get useful features that are technical
> possible.

Paravirt remains a workaround, useful until hardware provides a solution
for all guests, and that often in an even more efficient way (like for
MMU virtualization).

We do not need to block a PV-PMU for Linux guests (or other OSes that
want to adopt to it), but that will not be a solution for the problem,
that's my point. A PV-PMU may even be useful to demonstrate usefulness
of a virtual PMU the CPU vendors (if they aren't aware of this yet).

Jan

--
Siemens AG, Corporate Technology, CT T DE IT 1
Corporate Competence Center Embedded Linux

2011-05-12 13:31:59

by Avi Kivity

[permalink] [raw]
Subject: Re: [PATCH v1 0/5] KVM in-guest performance monitoring

On 05/12/2011 04:11 PM, Joerg Roedel wrote:
> On Thu, May 12, 2011 at 11:47:51AM +0200, Jan Kiszka wrote:
> > On 2011-05-12 11:33, Joerg Roedel wrote:
>
> > > Anyway, I thought about a paravirt-approach instead of implementing a
> > > real PMU... But there are certainly good reasons for both.
> >
> > Paravirt is taking away the pressure from CPU vendors to do their virt
> > extensions properly - and doesn't help with unmodifiable OSes.
>
> Seriously, I think such decisions should be technical only and not
> political like that. The losers of such political decisions are always
> the users because they don't get useful features that are technical
> possible.

I agree. But there are technical advantages to using architectural
features instead of paravirt:

- when the cpu gains support for virtualizing the architectural feature,
we transparently speed the guest up, including support for live
migrating from a deployment that emulates the feature to a deployment
that properly virtualizes the feature, and back. Usually the
virtualized support will beat the pants off any paravirtualization we can do
- following an existing spec is a lot easier to get right than doing
something from scratch
- no need to meticulously document the feature
- easier testing
- existing guest support - only need to write the host side (sometimes
the only one available to us)

We saw all that with the move from shadow paging to nested paging. We
had paravirt support which turned out to be poorly documented, had a
security issue, and would have been slower on npt. In the end we ripped
it out.

Paravirtualizing does have its advantages. For the PMU, for example, we
can have a single hypercall read and reprogram all counters, saving
*many* exits. But I think we need to start from the architectural PMU
and see exactly what the problems are, before we optimize it to death.

--
error compiling committee.c: too many arguments to function

2011-05-12 13:38:45

by Avi Kivity

[permalink] [raw]
Subject: Re: [PATCH v1 0/5] KVM in-guest performance monitoring

On 05/12/2011 04:06 PM, Joerg Roedel wrote:
> On Thu, May 12, 2011 at 12:51:11PM +0300, Avi Kivity wrote:
> > On 05/12/2011 12:33 PM, Joerg Roedel wrote:
>
> >> Gaah, I was just about to submit a talk about PMU virtualization for KVM
> >> Forum :)
> >
> > Speed matters.
>
> I'll take that as an argument for paravirt pmu, because that one is
> certainly faster than anything we can emulate on-top of perf_events ;-)

Correct, though some massaging of perf_events can make it faster. Still
we'll pay with more exits with the architectural PMU.

Note a v2 PMU can reduce the exit count since it has MSRs for
programming many PMCs at once.

> > Note, at this time the architectural PMU is only recognized on an Intel
> > host.
> >
> > Is the statement "if an AMD processor returns non-zero information in
> > cpuid leaf 0xa, then that processor will be compatible with other
> > vendors' processors reporting the same information" correct?
>
> AMD processors don't implement that cpuid leaf.

Right. But if an AMD processor were to implement that leaf, it would be
in a compatible manner, yes?

That allows us to


- if (vendor == intel && leaf_0xa_indicates_arch_pmu)
+ if (leaf_0xa_indicates_arch_pmu)

> > If so, we can move the detection of the architectural pmu outside the
> > cpu vendor checks, and this code will work on both AMD and Intel
> > processors (even if the host cpu doesn't have an architectural PMU).
>
> Thats already some kind of paravirtualization. Don't get me wrong, I see
> the point of emulating a real pmu in the guest. But on the other side I
> think a interface that works across cpu models fits better into the KVM
> design, because KVM (oposed to other hypervisors) trys to hide details
> of the host cpu as much as necessary to get migration working between
> different cpus.
> And since pmu are, as you said, very model-specific, some abstraction is
> needed.

The architectural PMU is not model specific.

> In the end probably both ways can be implemented in parallel:
>
> * re-implementing the host-pmu using perf_events for -cpu host
> guests
> * a paravirt pmu for everybody that wants migration and more
> accurate results


A paravirt PMU also has to be implemented on top of perf_events.
Otherwise we can't share this resource. So the only question is what
the interface looks like. The arch pmu is non-optimized, but well
specified and somewhat supported in guests. A paravirt pmu is not so
well specified at this point but can be faster (less exits).

--
error compiling committee.c: too many arguments to function

2011-05-12 13:43:42

by Joerg Roedel

[permalink] [raw]
Subject: Re: [PATCH v1 0/5] KVM in-guest performance monitoring

On Thu, May 12, 2011 at 03:23:39PM +0200, Jan Kiszka wrote:
> On 2011-05-12 15:11, Joerg Roedel wrote:

> > Seriously, I think such decisions should be technical only and not
> > political like that. The losers of such political decisions are always
> > the users because they don't get useful features that are technical
> > possible.
>
> Paravirt remains a workaround, useful until hardware provides a solution
> for all guests, and that often in an even more efficient way (like for
> MMU virtualization).

Fully agreed. And todays x86 CPUs lack proper support for virtualizing
the PMU. That will hopefully change but users want the feature today.

> We do not need to block a PV-PMU for Linux guests (or other OSes that
> want to adopt to it), but that will not be a solution for the problem,
> that's my point. A PV-PMU may even be useful to demonstrate usefulness
> of a virtual PMU the CPU vendors (if they aren't aware of this yet).

Right, if users actually use the virtual PMU this probably increases the
priority for proper hardware support.

Joerg

2011-05-12 14:24:52

by Joerg Roedel

[permalink] [raw]
Subject: Re: [PATCH v1 0/5] KVM in-guest performance monitoring

On Thu, May 12, 2011 at 04:31:38PM +0300, Avi Kivity wrote:
> - when the cpu gains support for virtualizing the architectural feature,
> we transparently speed the guest up, including support for live
> migrating from a deployment that emulates the feature to a deployment
> that properly virtualizes the feature, and back. Usually the
> virtualized support will beat the pants off any paravirtualization we can
> do
> - following an existing spec is a lot easier to get right than doing
> something from scratch
> - no need to meticulously document the feature

Need to be done, but not problematic I think.

> - easier testing

Testing shouldn't be different on both variants I think.

> - existing guest support - only need to write the host side (sometimes
> the only one available to us)

Otherwise I agree.

> Paravirtualizing does have its advantages. For the PMU, for example, we
> can have a single hypercall read and reprogram all counters, saving
> *many* exits. But I think we need to start from the architectural PMU
> and see exactly what the problems are, before we optimize it to death.

The problem certainly is that with arch-pmu we add a lot of msr-exits to
the guest-context-switch path if it uses per-task profiling. Depending
on the workload this can very much distort the results.

Joerg

2011-05-12 14:29:42

by Joerg Roedel

[permalink] [raw]
Subject: Re: [PATCH v1 0/5] KVM in-guest performance monitoring

On Thu, May 12, 2011 at 04:38:17PM +0300, Avi Kivity wrote:
> On 05/12/2011 04:06 PM, Joerg Roedel wrote:

>> AMD processors don't implement that cpuid leaf.
>
> Right. But if an AMD processor were to implement that leaf, it would be
> in a compatible manner, yes?

No official statement, but I guess this is the case. I have to check
back, though.

> A paravirt PMU also has to be implemented on top of perf_events.
> Otherwise we can't share this resource. So the only question is what
> the interface looks like. The arch pmu is non-optimized, but well
> specified and somewhat supported in guests. A paravirt pmu is not so
> well specified at this point but can be faster (less exits).

I agree that getting the interface right is certainly the most difficult
and important task here.

Joerg

2011-05-12 14:38:00

by Avi Kivity

[permalink] [raw]
Subject: Re: [PATCH v1 0/5] KVM in-guest performance monitoring

On 05/12/2011 05:24 PM, Joerg Roedel wrote:
> > Paravirtualizing does have its advantages. For the PMU, for example, we
> > can have a single hypercall read and reprogram all counters, saving
> > *many* exits. But I think we need to start from the architectural PMU
> > and see exactly what the problems are, before we optimize it to death.
>
> The problem certainly is that with arch-pmu we add a lot of msr-exits to
> the guest-context-switch path if it uses per-task profiling. Depending
> on the workload this can very much distort the results.

Right. The combination of per-task profiling with high context switch
rates is problematic.

One thing we could do is paravirtualize at a lower level, introduce a
hypercall for batch MSR reads and writes. So we can use the existing
PMU semantics and code, just optimize the switch. This is similar to
what Xen did with lazy cpu updates, and what kvm did for paravirt
pagetable writes.

I've considered something similar for mmio - use hypercalls for ordinary
mmio to avoid calling into the emulator - but virtio uses pio which
isn't emulated and we don't have massive consumers of mmio (except
perhaps hpet).

(and we can have a cpuid bit that advertises whether we recommend to use
this feature for PMU MSRs; if/when we get hardware support, we turn it off)

--
error compiling committee.c: too many arguments to function

2011-05-12 14:46:19

by Avi Kivity

[permalink] [raw]
Subject: Re: [PATCH v1 0/5] KVM in-guest performance monitoring

On 05/12/2011 05:37 PM, Avi Kivity wrote:
>
> I've considered something similar for mmio - use hypercalls for
> ordinary mmio to avoid calling into the emulator - but virtio uses pio
> which isn't emulated and we don't have massive consumers of mmio
> (except perhaps hpet).

Say,

enum {
KVM_OP_MMIO_READ = 1,
KVM_OP_MMIO_WRITE = 2,
KVM_OP_PIO_READ = 3,
KVM_OP_PIO_WRITE = 4,
KVM_OP_MSR_READ = 5,
KVM_OP_MSR_WRITE = 6,
KVM_OP_HLT = 7, /* so we can program the apic timer and sleep */
}

struct kvm_batch_op {
u8 op;
u8 result;
u8 size;
u8 reserved[5];
u64 address;
u64 data;
};

int kvm_batch(int nr, u64 kvm_batch_op_phys);

This is really repeating Xen PV though (different ops), not sure it's
worth the intrusiveness.

--
error compiling committee.c: too many arguments to function

2011-05-13 12:34:39

by Joerg Roedel

[permalink] [raw]
Subject: Re: [PATCH v1 0/5] KVM in-guest performance monitoring

On Thu, May 12, 2011 at 05:37:43PM +0300, Avi Kivity wrote:
> On 05/12/2011 05:24 PM, Joerg Roedel wrote:

> One thing we could do is paravirtualize at a lower level, introduce a
> hypercall for batch MSR reads and writes. So we can use the existing
> PMU semantics and code, just optimize the switch. This is similar to
> what Xen did with lazy cpu updates, and what kvm did for paravirt
> pagetable writes.

In general a good idea. For MSR writes this isn't trivial as well. MSR
reads and writes can cause a #GP, this needs to be represented in this
interface. Can we allow that a batched MSR write changes the operation
mode of the vcpu and so on.
Or we limit this interface to the PMU MSRs, then we are pretty much at
a paravirt-pmu.


> I've considered something similar for mmio - use hypercalls for ordinary
> mmio to avoid calling into the emulator - but virtio uses pio which
> isn't emulated and we don't have massive consumers of mmio (except
> perhaps hpet).
>
> (and we can have a cpuid bit that advertises whether we recommend to use
> this feature for PMU MSRs; if/when we get hardware support, we turn it
> off)

A cpuid-bit that indicates which pmu is prefered is certainly a good
idea.

Joerg

2011-05-17 09:52:19

by Avi Kivity

[permalink] [raw]
Subject: Re: [PATCH v1 0/5] KVM in-guest performance monitoring

On 05/11/2011 06:55 PM, Avi Kivity wrote:
> perf maintainers: please consider the first three patches for merging (the
> first two make sense even without the rest). If you're familiar with the Intel
> PMU, please review patch 5 as well - it effectively undoes all your work
> of abstracting the PMU into perf_events by unabstracting perf_events into what
> is hoped is a very similar PMU.

Perf maintainers, ping?

--
error compiling committee.c: too many arguments to function

2011-05-17 19:41:30

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH v1 5/5] KVM: Expose a version 1 architectural PMU to guests


* Avi Kivity <[email protected]> wrote:

> Caveats:
> - counters that have PMI (interrupt) enabled stop counting after the
> interrupt is signalled. This is because we need one-shot samples
> that keep counting, which perf doesn't support yet

Hm, do you need more than perf_event::event_limit, or something special?

> - some combinations of INV and CMASK are not supported

Could you please describe this better, where does this limit come from?
If perf then this needs fixing.

> - counters keep on counting in the host as well as the guest

I suspect fixing this either needs a hw filter feature, or the ability to
disable/enable these events across VM exits/entries.

I would imagine the disable/enable to be rather expensive so hw help would be
preferred ...

I didnt see anything objectionable in your patches, but i'd like to have
Peter's Acked-by as well before we go forward. I think that in the long run
having a virtio-perf gateway would allow us a lot more tricks than just
arch-perfmon emulation:

- we could do things like propagate guest side traces over to the host

- we could control from the host which events we measure on the guest side

- etc.

How would you like to handle the flow of patches - should we merge #1,#2,#3 in
perf/core and you'd then merge #4,#5 via the KVM tree once the first bits hit
upstream?

We could also set up a separate branch for these three commits, which you could
pull - this would allow all this to still hit .40.

Thanks,

Ingo

2011-05-18 09:03:37

by Avi Kivity

[permalink] [raw]
Subject: Re: [PATCH v1 5/5] KVM: Expose a version 1 architectural PMU to guests

On 05/17/2011 10:41 PM, Ingo Molnar wrote:
> * Avi Kivity<[email protected]> wrote:
>
> > Caveats:
> > - counters that have PMI (interrupt) enabled stop counting after the
> > interrupt is signalled. This is because we need one-shot samples
> > that keep counting, which perf doesn't support yet
>
> Hm, do you need more than perf_event::event_limit, or something special?

I do use event_limit, but it's inexact since it auto-disables the event
after firing the callback. I want to limit the callback to just one,
but let the event keep counting.

> > - some combinations of INV and CMASK are not supported
>
> Could you please describe this better, where does this limit come from?
> If perf then this needs fixing.

perf_event_attr does not support generic INV and CMASK at all. I
imagine you can get them through the model-specific hardware
configuration, but that means we have to encode model specific
information into kvm host code, which is (a) hard (b) counter to the
spirit of perf.

(INV and CMASK allow you to increment the counter only when > N or < N
events occur simultaneously, for example count when 2 or more
instructions are retired in a single clock).

> > - counters keep on counting in the host as well as the guest
>
> I suspect fixing this either needs a hw filter feature, or the ability to
> disable/enable these events across VM exits/entries.
>
> I would imagine the disable/enable to be rather expensive so hw help would be
> preferred ...

Yes, Joerg already posted support for this feature for AMD. Intel
supports this in a more complicated way (you can tell vmx to load
IA32_PERF_GLOBAL_CTL atomically during entry or exit). This can be done
independently of this patchset.

> I didnt see anything objectionable in your patches, but i'd like to have
> Peter's Acked-by as well before we go forward. I think that in the long run
> having a virtio-perf gateway would allow us a lot more tricks than just
> arch-perfmon emulation:
>
> - we could do things like propagate guest side traces over to the host

We support that already via 'perf kvm', no? This is more about the
guest profiling itself without access to the host (which is the more
common scenario, IMO).

We're still missing tunnelling guest ftrace to the host, but a patch was
recently posted to do that.

> - we could control from the host which events we measure on the guest side
>
> - etc.
>
> How would you like to handle the flow of patches - should we merge #1,#2,#3 in
> perf/core and you'd then merge #4,#5 via the KVM tree once the first bits hit
> upstream?
>
> We could also set up a separate branch for these three commits, which you could
> pull - this would allow all this to still hit .40.

You can put them in either perf/core or a different branch, and I can
pull them as long as its fast-forward only.

But I don't think I'll target 2.6.40, this needs to be tested a lot
more, particularly wrt guest-initiated DoS against the host.

--
error compiling committee.c: too many arguments to function

2011-05-18 11:08:14

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH v1 5/5] KVM: Expose a version 1 architectural PMU to guests


* Avi Kivity <[email protected]> wrote:

> >> - some combinations of INV and CMASK are not supported
> >
> > Could you please describe this better, where does this limit come from? If
> > perf then this needs fixing.
>
> perf_event_attr does not support generic INV and CMASK at all. [...]

It does through raw events - which are indeed model specific.

> [...] I imagine you can get them through the model-specific hardware
> configuration, but that means we have to encode model specific information
> into kvm host code, which is (a) hard (b) counter to the spirit of perf.
>
> (INV and CMASK allow you to increment the counter only when > N or <
> N events occur simultaneously, for example count when 2 or more
> instructions are retired in a single clock).

Peter, what do you think about adding a inv and cmask attribute that is applied
to generic events automatically?

I *think* it should work just out of box: we recognize the fixed-purpose
counter events based on their raw value, so if one comes in with inv and cmask
set it will just be scheduled on a generic hw counter.

> > - we could do things like propagate guest side traces over to the host
>
> We support that already via 'perf kvm', no? [...]

Not without usability issues, if you remember that flamewa^W thread! :-)

> [...] This is more about the guest profiling itself without access to the
> host (which is the more common scenario, IMO).

Correct - so there would be interactions with virtio-perf. I think the two will
just have to be made to mix well, that's all.

> We're still missing tunnelling guest ftrace to the host, but a patch was
> recently posted to do that.

If you tunnel guest perf events then ftrace events are included automatically.
That's the primary/unified event transport model we are working towards.

Thanks,

Ingo

2011-05-18 11:30:11

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v1 5/5] KVM: Expose a version 1 architectural PMU to guests

On Wed, 2011-05-18 at 13:07 +0200, Ingo Molnar wrote:
>
> Peter, what do you think about adding a inv and cmask attribute that is applied
> to generic events automatically?

Have you read the PPC, ARM and SPARC64 PMU definition to check if that
makes sense?

2011-05-18 11:32:54

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v1 5/5] KVM: Expose a version 1 architectural PMU to guests

On Wed, 2011-05-18 at 13:07 +0200, Ingo Molnar wrote:
>
> It does through raw events - which are indeed model specific.

Which is exactly what is needed anyway, he gets a raw msr value.

The only thing that is not exposed is the ANY bit, but since KVM doesn't
expose HT anyway that doesn't matter.

2011-05-18 11:38:11

by Avi Kivity

[permalink] [raw]
Subject: Re: [PATCH v1 5/5] KVM: Expose a version 1 architectural PMU to guests

On 05/18/2011 02:32 PM, Peter Zijlstra wrote:
> On Wed, 2011-05-18 at 13:07 +0200, Ingo Molnar wrote:
> >
> > It does through raw events - which are indeed model specific.
>
> Which is exactly what is needed anyway, he gets a raw msr value.
>
> The only thing that is not exposed is the ANY bit, but since KVM doesn't
> expose HT anyway that doesn't matter.

If I were to use raw events, I'd need to program AMD and Intel hosts
separately. As it is, I just use the generic counters and the perf
backend does its thing.

(note, INV and CMASK are hardly important, I think we can get away
without implementing them)

--
error compiling committee.c: too many arguments to function

2011-05-18 11:57:49

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH v1 5/5] KVM: Expose a version 1 architectural PMU to guests


* Peter Zijlstra <[email protected]> wrote:

> On Wed, 2011-05-18 at 13:07 +0200, Ingo Molnar wrote:
> >
> > Peter, what do you think about adding a inv and cmask attribute that is applied
> > to generic events automatically?
>
> Have you read the PPC, ARM and SPARC64 PMU definition to check if that makes
> sense?

No - some feedback from them would be nice.

Thanks,

Ingo

2011-05-18 12:35:36

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v1 5/5] KVM: Expose a version 1 architectural PMU to guests

On Wed, 2011-05-18 at 14:37 +0300, Avi Kivity wrote:
> On 05/18/2011 02:32 PM, Peter Zijlstra wrote:
> > On Wed, 2011-05-18 at 13:07 +0200, Ingo Molnar wrote:
> > >
> > > It does through raw events - which are indeed model specific.
> >
> > Which is exactly what is needed anyway, he gets a raw msr value.
> >
> > The only thing that is not exposed is the ANY bit, but since KVM doesn't
> > expose HT anyway that doesn't matter.
>
> If I were to use raw events, I'd need to program AMD and Intel hosts
> separately. As it is, I just use the generic counters and the perf
> backend does its thing.

But why exactly, from what I understood you emulate an actual hardware
PMU, that means the guest will be writing proper content to the relevant
MSRs, you can feed that directly into the raw config field (with
exception of the USR/OS/INT bits).

I'm fairly sure that emulating the intel arch bits on an cpu that
otherwise identifies itself as AMD is going to confuse things.

2011-05-18 12:49:19

by Avi Kivity

[permalink] [raw]
Subject: Re: [PATCH v1 5/5] KVM: Expose a version 1 architectural PMU to guests

On 05/18/2011 03:35 PM, Peter Zijlstra wrote:
> On Wed, 2011-05-18 at 14:37 +0300, Avi Kivity wrote:
> > On 05/18/2011 02:32 PM, Peter Zijlstra wrote:
> > > On Wed, 2011-05-18 at 13:07 +0200, Ingo Molnar wrote:
> > > >
> > > > It does through raw events - which are indeed model specific.
> > >
> > > Which is exactly what is needed anyway, he gets a raw msr value.
> > >
> > > The only thing that is not exposed is the ANY bit, but since KVM doesn't
> > > expose HT anyway that doesn't matter.
> >
> > If I were to use raw events, I'd need to program AMD and Intel hosts
> > separately. As it is, I just use the generic counters and the perf
> > backend does its thing.
>
> But why exactly, from what I understood you emulate an actual hardware
> PMU, that means the guest will be writing proper content to the relevant
> MSRs, you can feed that directly into the raw config field (with
> exception of the USR/OS/INT bits).

I am emulating an architectural PMU on a machine that may not have one
(Intel P4 or AMD), so I need translation between the different config
formats and event codes.

> I'm fairly sure that emulating the intel arch bits on an cpu that
> otherwise identifies itself as AMD is going to confuse things.

First, we can cheat and identify as Intel. Second, I'd like to change
architectural PMU probing to ignore the vendor (pending confirmation
from AMD that they won't implement cpuid leaf 0xa in a way that is
incompatible with Intel). That's what architectural means, no?

--
error compiling committee.c: too many arguments to function

2011-06-01 09:45:52

by Avi Kivity

[permalink] [raw]
Subject: Re: [PATCH v1 0/5] KVM in-guest performance monitoring

On 05/17/2011 12:52 PM, Avi Kivity wrote:
> On 05/11/2011 06:55 PM, Avi Kivity wrote:
>> perf maintainers: please consider the first three patches for merging
>> (the
>> first two make sense even without the rest). If you're familiar with
>> the Intel
>> PMU, please review patch 5 as well - it effectively undoes all your work
>> of abstracting the PMU into perf_events by unabstracting perf_events
>> into what
>> is hoped is a very similar PMU.
>
> Perf maintainers, ping?
>

Re-ping?

--
error compiling committee.c: too many arguments to function

2011-06-01 12:15:48

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v1 0/5] KVM in-guest performance monitoring

On Wed, 2011-06-01 at 12:45 +0300, Avi Kivity wrote:
> On 05/17/2011 12:52 PM, Avi Kivity wrote:
> > On 05/11/2011 06:55 PM, Avi Kivity wrote:
> >> perf maintainers: please consider the first three patches for merging
> >> (the
> >> first two make sense even without the rest). If you're familiar with
> >> the Intel
> >> PMU, please review patch 5 as well - it effectively undoes all your work
> >> of abstracting the PMU into perf_events by unabstracting perf_events
> >> into what
> >> is hoped is a very similar PMU.
> >
> > Perf maintainers, ping?
> >
>
> Re-ping?

Still struggling with merge window fallout, this is soon after that.

2011-06-01 11:27:17

by Avi Kivity

[permalink] [raw]
Subject: Re: [PATCH v1 0/5] KVM in-guest performance monitoring

On 06/01/2011 01:26 PM, Peter Zijlstra wrote:
> >
> > Re-ping?
>
> Still struggling with merge window fallout, this is soon after that.

Thanks.

--
error compiling committee.c: too many arguments to function

2011-06-03 14:42:05

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v1 5/5] KVM: Expose a version 1 architectural PMU to guests

On Wed, 2011-05-11 at 11:55 -0400, Avi Kivity wrote:
> - counters that have PMI (interrupt) enabled stop counting after the
> interrupt is signalled. This is because we need one-shot samples
> that keep counting, which perf doesn't support yet

You'll have to reprogram the thing anyway, since not all hardware has
the same counter width:

[ 0.046996] Performance Events: AMD PMU driver.
[ 0.048998] ... bit width: 48

vs

[ 0.026998] Performance Events: PEBS fmt0+, Core2 events, Intel PMU driver.
[ 0.026998] ... bit width: 40

simply letting the thing run will not behave in a consistent fashion.

Or are you going to assume all software will properly read the cpuid
leaf and not assume bit width?

Also, I can't seem to locate where you fill that cpuid-leaf,
kvm_pmu_cpuid_update() seems to read the entry, not write it.

2011-06-03 14:42:15

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v1 5/5] KVM: Expose a version 1 architectural PMU to guests

On Wed, 2011-05-11 at 11:55 -0400, Avi Kivity wrote:
> +static u64 read_gp_pmc(struct kvm_pmu *pmu, struct kvm_pmc *pmc)
> +{
> + u64 counter, enabled, running;
> +
> + counter = pmc->counter;
> +
> + if (pmc->perf_event)
> + counter += perf_event_read_value(pmc->perf_event,
> + &enabled, &running);
> +
> + /* FIXME: Scaling needed? */

Yeah, either that or use attr.pinned = 1, doing the latter will result
in an error when the PMU is overloaded though.

> + return counter & pmu->counter_bitmask;
> +}

2011-06-03 14:42:33

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v1 1/5] perf: add context parameter to perf_event overflow handler

On Wed, 2011-05-11 at 11:55 -0400, Avi Kivity wrote:
> +++ b/include/linux/perf_event.h
> @@ -709,7 +709,9 @@ struct perf_buffer {
>
> struct perf_sample_data;
>
> -typedef void (*perf_overflow_handler_t)(struct perf_event *, int,
> +typedef void (*perf_overflow_handler_t)(void *context,
> + struct perf_event *event,
> + int,
> struct perf_sample_data *,
> struct pt_regs *regs);
>
> @@ -855,6 +857,7 @@ struct perf_event {
> u64 id;
>
> perf_overflow_handler_t overflow_handler;
> + void *overflow_handler_context;
>
> #ifdef CONFIG_EVENT_TRACING
> struct ftrace_event_call *tp_event;

If you're adding a pointer here, then why pass it along in the callback
thing if it already has the perf_event *?

It seems to me the callback can easily do
event->overflow_handler_context if it needs the thing.

2011-06-05 08:08:11

by Avi Kivity

[permalink] [raw]
Subject: Re: [PATCH v1 1/5] perf: add context parameter to perf_event overflow handler

On 06/03/2011 05:41 PM, Peter Zijlstra wrote:
> On Wed, 2011-05-11 at 11:55 -0400, Avi Kivity wrote:
> > +++ b/include/linux/perf_event.h
> > @@ -709,7 +709,9 @@ struct perf_buffer {
> >
> > struct perf_sample_data;
> >
> > -typedef void (*perf_overflow_handler_t)(struct perf_event *, int,
> > +typedef void (*perf_overflow_handler_t)(void *context,
> > + struct perf_event *event,
> > + int,
> > struct perf_sample_data *,
> > struct pt_regs *regs);
> >
> > @@ -855,6 +857,7 @@ struct perf_event {
> > u64 id;
> >
> > perf_overflow_handler_t overflow_handler;
> > + void *overflow_handler_context;
> >
> > #ifdef CONFIG_EVENT_TRACING
> > struct ftrace_event_call *tp_event;
>
> If you're adding a pointer here, then why pass it along in the callback
> thing if it already has the perf_event *?
>
> It seems to me the callback can easily do
> event->overflow_handler_context if it needs the thing.

Makes sense. Will update.

(I would have preferred a caller-allocated perf_event with
container_of(), but that's a huge change).

--
error compiling committee.c: too many arguments to function

2011-06-05 08:12:42

by Avi Kivity

[permalink] [raw]
Subject: Re: [PATCH v1 5/5] KVM: Expose a version 1 architectural PMU to guests

On 06/03/2011 05:41 PM, Peter Zijlstra wrote:
> On Wed, 2011-05-11 at 11:55 -0400, Avi Kivity wrote:
> > - counters that have PMI (interrupt) enabled stop counting after the
> > interrupt is signalled. This is because we need one-shot samples
> > that keep counting, which perf doesn't support yet
>
> You'll have to reprogram the thing anyway, since not all hardware has
> the same counter width:
>
> [ 0.046996] Performance Events: AMD PMU driver.
> [ 0.048998] ... bit width: 48
>
> vs
>
> [ 0.026998] Performance Events: PEBS fmt0+, Core2 events, Intel PMU driver.
> [ 0.026998] ... bit width: 40
>
> simply letting the thing run will not behave in a consistent fashion.

I don't really follow. How is the bit width related?

We adjust for bit width during PMC read.

Though I agree for accurate emulation we do need to reprogram, so we can
set the next overflow event at 2^bit_width. I doubt that anyone relies
on this second overflow (even with 40 bits counting cycles, it takes far
too long to overflow), still we have to emulate it.

> Or are you going to assume all software will properly read the cpuid
> leaf and not assume bit width?
>
> Also, I can't seem to locate where you fill that cpuid-leaf,
> kvm_pmu_cpuid_update() seems to read the entry, not write it.

We rely on host userspace to set up cpuid (and KVM_GET_SUPPORTED_CPUID
to report to userspace what we support).

--
error compiling committee.c: too many arguments to function

2011-06-05 08:13:14

by Avi Kivity

[permalink] [raw]
Subject: Re: [PATCH v1 5/5] KVM: Expose a version 1 architectural PMU to guests

On 06/03/2011 05:41 PM, Peter Zijlstra wrote:
> On Wed, 2011-05-11 at 11:55 -0400, Avi Kivity wrote:
> > +static u64 read_gp_pmc(struct kvm_pmu *pmu, struct kvm_pmc *pmc)
> > +{
> > + u64 counter, enabled, running;
> > +
> > + counter = pmc->counter;
> > +
> > + if (pmc->perf_event)
> > + counter += perf_event_read_value(pmc->perf_event,
> > +&enabled,&running);
> > +
> > + /* FIXME: Scaling needed? */
>
> Yeah, either that or use attr.pinned = 1, doing the latter will result
> in an error when the PMU is overloaded though.

So what's the scale factor? enabled / running?

--
error compiling committee.c: too many arguments to function