2012-02-20 22:38:59

by Vince Weaver

[permalink] [raw]
Subject: [PATCH] perf_event use rdpmc rather than rdmsr when possible in kernel

Hello

The perfctr interface uses rdpmc rather than rdmsr when possible in the
kernel, as rdpmc tends to have lower latencies. (One can look in
etc/costs in the perfctr-2.6 package to see a historical list of the
overhead).

I have done some tests on a 3.2 kernel:

rdmsr rdpmc
Core2 T9900: 203.9 cycles 30.9 cycles
AMD fam0fh: 56.2 cycles 9.8 cycles
Atom 6/28/2: 129.7 cycles 50.6 cycles

As you can see the speedup of using rdpmc is large, although granted
it really is a drop in the bucket compared to the other overheads
involved.

I've attached a kernel module that can be used to test this on your own
personal x86 machine.

Below is a patch that changes perf_event to use rdpmc rather than rdmsr
when possible. It's probably possible (and desirable) to do this without
requiring a new field in the hw_perf_event structure, but the fixed events
make this tricky.

Signed-off-by: Vince Weaver <[email protected]>

diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c
index 5adce10..5550047 100644
--- a/arch/x86/kernel/cpu/perf_event.c
+++ b/arch/x86/kernel/cpu/perf_event.c
@@ -85,7 +85,7 @@ u64 x86_perf_event_update(struct perf_event *event)
*/
again:
prev_raw_count = local64_read(&hwc->prev_count);
- rdmsrl(hwc->event_base, new_raw_count);
+ new_raw_count=native_read_pmc(hwc->event_base_rdpmc);

if (local64_cmpxchg(&hwc->prev_count, prev_raw_count,
new_raw_count) != prev_raw_count)
@@ -768,9 +768,11 @@ static inline void x86_assign_hw_event(struct perf_event *event,
} else if (hwc->idx >= X86_PMC_IDX_FIXED) {
hwc->config_base = MSR_ARCH_PERFMON_FIXED_CTR_CTRL;
hwc->event_base = MSR_ARCH_PERFMON_FIXED_CTR0 + (hwc->idx - X86_PMC_IDX_FIXED);
+ hwc->event_base_rdpmc = (hwc->idx - X86_PMC_IDX_FIXED) | 1<<30;
} else {
hwc->config_base = x86_pmu_config_addr(hwc->idx);
hwc->event_base = x86_pmu_event_addr(hwc->idx);
+ hwc->event_base_rdpmc = x86_pmu_addr_offset(hwc->idx);
}
}

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index abb2776..432ac69 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -562,6 +562,7 @@ struct hw_perf_event {
u64 last_tag;
unsigned long config_base;
unsigned long event_base;
+ unsigned long event_base_rdpmc;
int idx;
int last_cpu;
struct hw_perf_event_extra extra_reg;


Attachments:
Makefile (197.00 B)
Makefile
rdpmc-module.c (7.18 kB)
rdpmc-module.c
Download all attachments

2012-02-27 11:39:38

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH] perf_event use rdpmc rather than rdmsr when possible in kernel

On Mon, 2012-02-20 at 17:38 -0500, Vince Weaver wrote:

> diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c
> index 5adce10..5550047 100644
> --- a/arch/x86/kernel/cpu/perf_event.c
> +++ b/arch/x86/kernel/cpu/perf_event.c
> @@ -85,7 +85,7 @@ u64 x86_perf_event_update(struct perf_event *event)
> */
> again:
> prev_raw_count = local64_read(&hwc->prev_count);
> - rdmsrl(hwc->event_base, new_raw_count);
> + new_raw_count=native_read_pmc(hwc->event_base_rdpmc);

That really wants to be rdpmc(), bypassing paravirt like that isn't
nice.

>
> if (local64_cmpxchg(&hwc->prev_count, prev_raw_count,
> new_raw_count) != prev_raw_count)
> @@ -768,9 +768,11 @@ static inline void x86_assign_hw_event(struct perf_event *event,
> } else if (hwc->idx >= X86_PMC_IDX_FIXED) {
> hwc->config_base = MSR_ARCH_PERFMON_FIXED_CTR_CTRL;
> hwc->event_base = MSR_ARCH_PERFMON_FIXED_CTR0 + (hwc->idx - X86_PMC_IDX_FIXED);
> + hwc->event_base_rdpmc = (hwc->idx - X86_PMC_IDX_FIXED) | 1<<30;
> } else {
> hwc->config_base = x86_pmu_config_addr(hwc->idx);
> hwc->event_base = x86_pmu_event_addr(hwc->idx);
> + hwc->event_base_rdpmc = x86_pmu_addr_offset(hwc->idx);
> }
> }
>
> diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
> index abb2776..432ac69 100644
> --- a/include/linux/perf_event.h
> +++ b/include/linux/perf_event.h
> @@ -562,6 +562,7 @@ struct hw_perf_event {
> u64 last_tag;
> unsigned long config_base;
> unsigned long event_base;
> + unsigned long event_base_rdpmc;

We could make that unsigned int, the SDM explicitly states
rdmsr/wrmsr/rdpmc take ECX (and ignore the upper 32bits RCX).

> int idx;
> int last_cpu;
> struct hw_perf_event_extra extra_reg;

But yeah, avoiding the extra variable will add a conditional and extra
instructions to the rdpmc path.

2012-02-27 16:06:47

by Vince Weaver

[permalink] [raw]
Subject: Re: [PATCH] perf_event use rdpmc rather than rdmsr when possible in kernel

On Mon, 27 Feb 2012, Peter Zijlstra wrote:

> On Mon, 2012-02-20 at 17:38 -0500, Vince Weaver wrote:
>
> > diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c
> > index 5adce10..5550047 100644
> > --- a/arch/x86/kernel/cpu/perf_event.c
> > +++ b/arch/x86/kernel/cpu/perf_event.c
> > @@ -85,7 +85,7 @@ u64 x86_perf_event_update(struct perf_event *event)
> > */
> > again:
> > prev_raw_count = local64_read(&hwc->prev_count);
> > - rdmsrl(hwc->event_base, new_raw_count);
> > + new_raw_count=native_read_pmc(hwc->event_base_rdpmc);
>
> That really wants to be rdpmc(), bypassing paravirt like that isn't
> nice.

I couldn't find another usable rdpmc() call in the kernel. Should I add
one? I admit I hadn't thought that this might break VMs not expecting
rdpmc calls from the kernel.

> > index abb2776..432ac69 100644
> > --- a/include/linux/perf_event.h
> > +++ b/include/linux/perf_event.h
> > @@ -562,6 +562,7 @@ struct hw_perf_event {
> > u64 last_tag;
> > unsigned long config_base;
> > unsigned long event_base;
> > + unsigned long event_base_rdpmc;
>
> We could make that unsigned int, the SDM explicitly states
> rdmsr/wrmsr/rdpmc take ECX (and ignore the upper 32bits RCX).

OK.

> > int idx;
> > int last_cpu;
> > struct hw_perf_event_extra extra_reg;
>
> But yeah, avoiding the extra variable will add a conditional and extra
> instructions to the rdpmc path.

yes, I couldn't think of a good way to get rid of the extra field
without adding extra conditional branches in key code paths.

Mostly that's the fault of the Intel fixed counters, otherwise you could
possibly fix things by cleverly adding or subtracting off the msr_base
to get the counter number.

Vince

2012-02-27 16:18:57

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH] perf_event use rdpmc rather than rdmsr when possible in kernel

On Mon, 2012-02-27 at 11:06 -0500, Vince Weaver wrote:
> > > + new_raw_count=native_read_pmc(hwc->event_base_rdpmc);
> >
> > That really wants to be rdpmc(), bypassing paravirt like that isn't
> > nice.
>
> I couldn't find another usable rdpmc() call in the kernel. Should I add
> one? I admit I hadn't thought that this might break VMs not expecting
> rdpmc calls from the kernel.

arch/x86/include/asm/msr.h

#define rdpmc(counter, low, high) \
do { \
u64 _l = native_read_pmc((counter)); \
(low) = (u32)_l; \
(high) = (u32)(_l >> 32); \
} while (0)

2012-02-27 17:04:34

by Vince Weaver

[permalink] [raw]
Subject: Re: [PATCH] perf_event use rdpmc rather than rdmsr when possible in kernel

On Mon, 27 Feb 2012, Peter Zijlstra wrote:

> > I couldn't find another usable rdpmc() call in the kernel. Should I add
> > one? I admit I hadn't thought that this might break VMs not expecting
> > rdpmc calls from the kernel.
>
> arch/x86/include/asm/msr.h
>
> #define rdpmc(counter, low, high) \
> do { \
> u64 _l = native_read_pmc((counter)); \
> (low) = (u32)_l; \
> (high) = (u32)(_l >> 32); \
> } while (0)

ugh I did multiple recursive greps, and definitely searched msr.h by
hand various times looking for that and I somehow missed it each time.

I'll update the patch.

Vince