2015-04-28 21:25:42

by Andy Lutomirski

[permalink] [raw]
Subject: [RFC] x86, perf: Add an aperfmperf driver

Signed-off-by: Andy Lutomirski <[email protected]>
---

This driver seems a little bit silly, but I can imagine it being useful. For
example, I think that turbostat could do some of its work without being
root if we had a driver like this.

Thoughts? Would it make sense at all? Did I wire it up right? This is
the only PMU driver I've ever written, and it could have any number of
issues.

arch/x86/kernel/cpu/Makefile | 2 +
arch/x86/kernel/cpu/perf_event_aperfmperf.c | 119 ++++++++++++++++++++++++++++
2 files changed, 121 insertions(+)
create mode 100644 arch/x86/kernel/cpu/perf_event_aperfmperf.c

diff --git a/arch/x86/kernel/cpu/Makefile b/arch/x86/kernel/cpu/Makefile
index 80091ae54c2b..fadc822efc90 100644
--- a/arch/x86/kernel/cpu/Makefile
+++ b/arch/x86/kernel/cpu/Makefile
@@ -45,6 +45,8 @@ obj-$(CONFIG_PERF_EVENTS_INTEL_UNCORE) += perf_event_intel_uncore.o \
perf_event_intel_uncore_snb.o \
perf_event_intel_uncore_snbep.o \
perf_event_intel_uncore_nhmex.o
+obj-$(CONFIG_CPU_SUP_INTEL) += perf_event_aperf_mperf.o
+obj-$(CONFIG_CPU_SUP_AMD) += perf_event_aperf_mperf.o
endif


diff --git a/arch/x86/kernel/cpu/perf_event_aperfmperf.c b/arch/x86/kernel/cpu/perf_event_aperfmperf.c
new file mode 100644
index 000000000000..6e6d113bd9ce
--- /dev/null
+++ b/arch/x86/kernel/cpu/perf_event_aperfmperf.c
@@ -0,0 +1,119 @@
+#include <linux/perf_event.h>
+
+#define APERFMPERF_EVENT_APERF 0
+#define APERFMPERF_EVENT_MPERF 1
+
+PMU_EVENT_ATTR_STRING(aperf, evattr_aperf, "event=0x00");
+PMU_EVENT_ATTR_STRING(mperf, evattr_mperf, "event=0x01");
+static struct attribute *events_attrs[] = {
+ &evattr_aperf.attr.attr,
+ &evattr_mperf.attr.attr,
+ NULL,
+};
+static struct attribute_group events_attr_group = {
+ .name = "events",
+ .attrs = events_attrs,
+};
+
+PMU_FORMAT_ATTR(event, "config:0-63");
+static struct attribute *format_attrs[] = {
+ &format_attr_event.attr,
+ NULL,
+};
+static struct attribute_group format_attr_group = {
+ .name = "format",
+ .attrs = format_attrs,
+};
+
+static const struct attribute_group *attr_groups[] = {
+ &events_attr_group,
+ &format_attr_group,
+ NULL,
+};
+
+static int aperfmperf_event_init(struct perf_event *event)
+{
+ if (event->attr.type != event->pmu->type)
+ return -ENOENT;
+
+ if (event->attr.config != APERFMPERF_EVENT_APERF &&
+ event->attr.config != APERFMPERF_EVENT_MPERF)
+ return -ENOENT;
+
+ if (event->attr.config1 != 0)
+ return -ENOENT;
+
+ /* no sampling */
+ if (event->hw.sample_period)
+ return -EINVAL;
+
+ /* unsupported modes and filters */
+ if (event->attr.exclude_user ||
+ event->attr.exclude_kernel ||
+ event->attr.exclude_hv ||
+ event->attr.exclude_idle ||
+ event->attr.exclude_host ||
+ event->attr.exclude_guest ||
+ event->attr.freq ||
+ event->attr.sample_period) /* no sampling */
+ return -EINVAL;
+
+ event->hw.idx = -1;
+ event->hw.event_base = (event->attr.config == APERFMPERF_EVENT_APERF ?
+ MSR_IA32_APERF : MSR_IA32_MPERF);
+
+ return 0;
+}
+
+static void aperfmperf_event_update(struct perf_event *event)
+{
+ u64 prev;
+ u64 now;
+
+ rdmsrl(event->hw.event_base, now);
+ prev = local64_xchg(&event->hw.prev_count, now);
+ local64_add(now - prev, &event->count);
+}
+
+static void aperfmperf_event_start(struct perf_event *event, int flags)
+{
+ u64 now;
+
+ rdmsrl(event->hw.event_base, now);
+ local64_set(&event->hw.prev_count, now);
+}
+
+static void aperfmperf_event_stop_or_del(struct perf_event *event, int flags)
+{
+ aperfmperf_event_update(event);
+}
+
+static int aperfmperf_event_add(struct perf_event *event, int flags)
+{
+ if (flags & PERF_EF_START)
+ aperfmperf_event_start(event, flags);
+
+ return 0;
+}
+
+static struct pmu pmu_aperfmperf = {
+ .task_ctx_nr = perf_invalid_context,
+ .attr_groups = attr_groups,
+ .event_init = aperfmperf_event_init,
+ .add = aperfmperf_event_add,
+ .del = aperfmperf_event_stop_or_del,
+ .start = aperfmperf_event_start,
+ .stop = aperfmperf_event_stop_or_del,
+ .read = aperfmperf_event_update,
+};
+
+static int __init aperfmperf_init(void)
+{
+ if (!boot_cpu_has(X86_FEATURE_APERFMPERF))
+ return -ENODEV;
+
+ perf_pmu_register(&pmu_aperfmperf, "aperfmperf", -1);
+
+ return 0;
+}
+device_initcall(aperfmperf_init);
--
2.3.0


2015-04-28 22:29:59

by Brown, Len

[permalink] [raw]
Subject: RE: [RFC] x86, perf: Add an aperfmperf driver

> I think that turbostat could do some of its work without being
> root if we had a driver like this.

Note that turbostat can be run as non-root this way:

# setcap cap_sys_rawio=ep ./turbostat
# chmod +r /dev/cpu/*/msr

For the debug case, there are a number of MSRs that turbostat must access,
so would still need permission for that case (which is the only case I use:-)

> Thoughts? Would it make sense at all? Did I wire it up right? This is
> the only PMU driver I've ever written, and it could have any number of
> issues.

APERF/MPERF, as with all per-thread MSRs, must be accessed
from the local processor. I didn't see where this driver
distinguishes the CPU. Also, I assume the intent is to return
a snapshot, rather than sampling, yes?

Note that turbostat binds itself to a remote CPU so that MSR reads
are all local, then it binds to the next CPU etc. In the old days,
we read everything without this binding, and the kernel overhead
of the remote reads was too high, making it difficult to measure
profoundly idle systems.

cheers,
-Len

2015-04-28 22:44:03

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [RFC] x86, perf: Add an aperfmperf driver

On Tue, Apr 28, 2015 at 3:29 PM, Brown, Len <[email protected]> wrote:
>> I think that turbostat could do some of its work without being
>> root if we had a driver like this.
>
> Note that turbostat can be run as non-root this way:
>
> # setcap cap_sys_rawio=ep ./turbostat
> # chmod +r /dev/cpu/*/msr
>
> For the debug case, there are a number of MSRs that turbostat must access,
> so would still need permission for that case (which is the only case I use:-)
>

True. This would only get the average turbo ratio. Of course, I
think that can be done using cpu-cycles as well.

>> Thoughts? Would it make sense at all? Did I wire it up right? This is
>> the only PMU driver I've ever written, and it could have any number of
>> issues.
>
> APERF/MPERF, as with all per-thread MSRs, must be accessed
> from the local processor. I didn't see where this driver
> distinguishes the CPU. Also, I assume the intent is to return
> a snapshot, rather than sampling, yes?

I think that the perf core takes care of that for us, but I'm not entirely sure.

--Andy

2015-04-29 09:09:57

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC] x86, perf: Add an aperfmperf driver

On Tue, Apr 28, 2015 at 02:25:37PM -0700, Andy Lutomirski wrote:
> diff --git a/arch/x86/kernel/cpu/Makefile b/arch/x86/kernel/cpu/Makefile
> index 80091ae54c2b..fadc822efc90 100644
> --- a/arch/x86/kernel/cpu/Makefile
> +++ b/arch/x86/kernel/cpu/Makefile
> @@ -45,6 +45,8 @@ obj-$(CONFIG_PERF_EVENTS_INTEL_UNCORE) += perf_event_intel_uncore.o \
> perf_event_intel_uncore_snb.o \
> perf_event_intel_uncore_snbep.o \
> perf_event_intel_uncore_nhmex.o
> +obj-$(CONFIG_CPU_SUP_INTEL) += perf_event_aperf_mperf.o
> +obj-$(CONFIG_CPU_SUP_AMD) += perf_event_aperf_mperf.o

Does this actually work? I would expect it to go complain about having
to build it twice if you have both set.

> diff --git a/arch/x86/kernel/cpu/perf_event_aperfmperf.c b/arch/x86/kernel/cpu/perf_event_aperfmperf.c
> new file mode 100644
> index 000000000000..6e6d113bd9ce
> --- /dev/null
> +++ b/arch/x86/kernel/cpu/perf_event_aperfmperf.c
> @@ -0,0 +1,119 @@
> +#include <linux/perf_event.h>
> +
> +#define APERFMPERF_EVENT_APERF 0
> +#define APERFMPERF_EVENT_MPERF 1
> +

> +static int aperfmperf_event_init(struct perf_event *event)
> +{
> + if (event->attr.type != event->pmu->type)
> + return -ENOENT;
> +
> + if (event->attr.config != APERFMPERF_EVENT_APERF &&
> + event->attr.config != APERFMPERF_EVENT_MPERF)
> + return -ENOENT;

Once we pass the type test we know its 'our' event, and we can go return
fatal errors. No other PMU will pick this up.

This could therefore turn into an -EINVAL.

> +
> + if (event->attr.config1 != 0)
> + return -ENOENT;

Idem.

> + /* no sampling */
> + if (event->hw.sample_period)
> + return -EINVAL;

You could have set pmu::capabilities =
PERF_PMU_CAP_NO_INTERRUPT which would also have killed that dead.

> + /* unsupported modes and filters */
> + if (event->attr.exclude_user ||
> + event->attr.exclude_kernel ||
> + event->attr.exclude_hv ||
> + event->attr.exclude_idle ||
> + event->attr.exclude_host ||
> + event->attr.exclude_guest ||
> + event->attr.freq ||
> + event->attr.sample_period) /* no sampling */
> + return -EINVAL;
> +
> + event->hw.idx = -1;
> + event->hw.event_base = (event->attr.config == APERFMPERF_EVENT_APERF ?
> + MSR_IA32_APERF : MSR_IA32_MPERF);
> +
> + return 0;
> +}

The rest looks about right. Very simple thing indeed ;-)

2015-04-29 09:13:24

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC] x86, perf: Add an aperfmperf driver

On Tue, Apr 28, 2015 at 03:43:38PM -0700, Andy Lutomirski wrote:
> On Tue, Apr 28, 2015 at 3:29 PM, Brown, Len <[email protected]> wrote:

> >> Thoughts? Would it make sense at all? Did I wire it up right? This is
> >> the only PMU driver I've ever written, and it could have any number of
> >> issues.
> >
> > APERF/MPERF, as with all per-thread MSRs, must be accessed
> > from the local processor. I didn't see where this driver
> > distinguishes the CPU. Also, I assume the intent is to return
> > a snapshot, rather than sampling, yes?
>
> I think that the perf core takes care of that for us, but I'm not entirely sure.

It does indeed. The events are always created/used in either task or cpu
context, and in the case of task context they're context switched along,
which again results in strict per cpu usage.

Since this driver has no state to track nothing else is required.

2015-04-29 09:15:20

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC] x86, perf: Add an aperfmperf driver

On Tue, Apr 28, 2015 at 02:25:37PM -0700, Andy Lutomirski wrote:

> +static struct pmu pmu_aperfmperf = {
> + .task_ctx_nr = perf_invalid_context,

You could actually have made that perf_sw_context, because its
impossible to fail to add() this event. That will make it possible to
attach it to tasks and you can measure per task a/m-perf.

> + .attr_groups = attr_groups,
> + .event_init = aperfmperf_event_init,
> + .add = aperfmperf_event_add,
> + .del = aperfmperf_event_stop_or_del,
> + .start = aperfmperf_event_start,
> + .stop = aperfmperf_event_stop_or_del,
> + .read = aperfmperf_event_update,
> +};

2015-04-29 18:50:52

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [RFC] x86, perf: Add an aperfmperf driver

On Apr 29, 2015 2:09 AM, "Peter Zijlstra" <[email protected]> wrote:
>
> On Tue, Apr 28, 2015 at 02:25:37PM -0700, Andy Lutomirski wrote:
> > diff --git a/arch/x86/kernel/cpu/Makefile b/arch/x86/kernel/cpu/Makefile
> > index 80091ae54c2b..fadc822efc90 100644
> > --- a/arch/x86/kernel/cpu/Makefile
> > +++ b/arch/x86/kernel/cpu/Makefile
> > @@ -45,6 +45,8 @@ obj-$(CONFIG_PERF_EVENTS_INTEL_UNCORE) += perf_event_intel_uncore.o \
> > perf_event_intel_uncore_snb.o \
> > perf_event_intel_uncore_snbep.o \
> > perf_event_intel_uncore_nhmex.o
> > +obj-$(CONFIG_CPU_SUP_INTEL) += perf_event_aperf_mperf.o
> > +obj-$(CONFIG_CPU_SUP_AMD) += perf_event_aperf_mperf.o
>
> Does this actually work? I would expect it to go complain about having
> to build it twice if you have both set.

No, but only because I spelled the filename wrong while regenerating
the patch. Oops!

>
> > diff --git a/arch/x86/kernel/cpu/perf_event_aperfmperf.c b/arch/x86/kernel/cpu/perf_event_aperfmperf.c
> > new file mode 100644
> > index 000000000000..6e6d113bd9ce
> > --- /dev/null
> > +++ b/arch/x86/kernel/cpu/perf_event_aperfmperf.c
> > @@ -0,0 +1,119 @@
> > +#include <linux/perf_event.h>
> > +
> > +#define APERFMPERF_EVENT_APERF 0
> > +#define APERFMPERF_EVENT_MPERF 1
> > +
>
> > +static int aperfmperf_event_init(struct perf_event *event)
> > +{
> > + if (event->attr.type != event->pmu->type)
> > + return -ENOENT;
> > +
> > + if (event->attr.config != APERFMPERF_EVENT_APERF &&
> > + event->attr.config != APERFMPERF_EVENT_MPERF)
> > + return -ENOENT;
>
> Once we pass the type test we know its 'our' event, and we can go return
> fatal errors. No other PMU will pick this up.
>
> This could therefore turn into an -EINVAL.
>
> > +
> > + if (event->attr.config1 != 0)
> > + return -ENOENT;
>
> Idem.
>
> > + /* no sampling */
> > + if (event->hw.sample_period)
> > + return -EINVAL;
>
> You could have set pmu::capabilities =
> PERF_PMU_CAP_NO_INTERRUPT which would also have killed that dead.


That checks attr.sample_period. I'm a bit confused about the
relationship between event->hw and event->attr. Do I not need to
check hw.sample_period?

>
> > + /* unsupported modes and filters */
> > + if (event->attr.exclude_user ||
> > + event->attr.exclude_kernel ||
> > + event->attr.exclude_hv ||
> > + event->attr.exclude_idle ||
> > + event->attr.exclude_host ||
> > + event->attr.exclude_guest ||
> > + event->attr.freq ||
> > + event->attr.sample_period) /* no sampling */
> > + return -EINVAL;
> > +
> > + event->hw.idx = -1;
> > + event->hw.event_base = (event->attr.config == APERFMPERF_EVENT_APERF ?
> > + MSR_IA32_APERF : MSR_IA32_MPERF);
> > +
> > + return 0;
> > +}
>
> The rest looks about right. Very simple thing indeed ;-)

Before I submit v2, do you think this is actually worth doing? I can
see it being useful for answering questions like "did this workload
end up running at full speed".

--Andy

2015-04-30 01:17:28

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [RFC] x86, perf: Add an aperfmperf driver

On Wed, Apr 29, 2015 at 11:50 AM, Andy Lutomirski <[email protected]> wrote:
> On Apr 29, 2015 2:09 AM, "Peter Zijlstra" <[email protected]> wrote:
>>
>> On Tue, Apr 28, 2015 at 02:25:37PM -0700, Andy Lutomirski wrote:
>> > diff --git a/arch/x86/kernel/cpu/Makefile b/arch/x86/kernel/cpu/Makefile
>> > index 80091ae54c2b..fadc822efc90 100644
>> > --- a/arch/x86/kernel/cpu/Makefile
>> > +++ b/arch/x86/kernel/cpu/Makefile
>> > @@ -45,6 +45,8 @@ obj-$(CONFIG_PERF_EVENTS_INTEL_UNCORE) += perf_event_intel_uncore.o \
>> > perf_event_intel_uncore_snb.o \
>> > perf_event_intel_uncore_snbep.o \
>> > perf_event_intel_uncore_nhmex.o
>> > +obj-$(CONFIG_CPU_SUP_INTEL) += perf_event_aperf_mperf.o
>> > +obj-$(CONFIG_CPU_SUP_AMD) += perf_event_aperf_mperf.o
>>
>> Does this actually work? I would expect it to go complain about having
>> to build it twice if you have both set.
>
> No, but only because I spelled the filename wrong while regenerating
> the patch. Oops!
>
>>
>> > diff --git a/arch/x86/kernel/cpu/perf_event_aperfmperf.c b/arch/x86/kernel/cpu/perf_event_aperfmperf.c
>> > new file mode 100644
>> > index 000000000000..6e6d113bd9ce
>> > --- /dev/null
>> > +++ b/arch/x86/kernel/cpu/perf_event_aperfmperf.c
>> > @@ -0,0 +1,119 @@
>> > +#include <linux/perf_event.h>
>> > +
>> > +#define APERFMPERF_EVENT_APERF 0
>> > +#define APERFMPERF_EVENT_MPERF 1
>> > +
>>
>> > +static int aperfmperf_event_init(struct perf_event *event)
>> > +{
>> > + if (event->attr.type != event->pmu->type)
>> > + return -ENOENT;
>> > +
>> > + if (event->attr.config != APERFMPERF_EVENT_APERF &&
>> > + event->attr.config != APERFMPERF_EVENT_MPERF)
>> > + return -ENOENT;
>>
>> Once we pass the type test we know its 'our' event, and we can go return
>> fatal errors. No other PMU will pick this up.
>>
>> This could therefore turn into an -EINVAL.
>>
>> > +
>> > + if (event->attr.config1 != 0)
>> > + return -ENOENT;
>>
>> Idem.
>>
>> > + /* no sampling */
>> > + if (event->hw.sample_period)
>> > + return -EINVAL;
>>
>> You could have set pmu::capabilities =
>> PERF_PMU_CAP_NO_INTERRUPT which would also have killed that dead.
>
>
> That checks attr.sample_period. I'm a bit confused about the
> relationship between event->hw and event->attr. Do I not need to
> check hw.sample_period?
>
>>
>> > + /* unsupported modes and filters */
>> > + if (event->attr.exclude_user ||
>> > + event->attr.exclude_kernel ||
>> > + event->attr.exclude_hv ||
>> > + event->attr.exclude_idle ||
>> > + event->attr.exclude_host ||
>> > + event->attr.exclude_guest ||
>> > + event->attr.freq ||
>> > + event->attr.sample_period) /* no sampling */
>> > + return -EINVAL;
>> > +
>> > + event->hw.idx = -1;
>> > + event->hw.event_base = (event->attr.config == APERFMPERF_EVENT_APERF ?
>> > + MSR_IA32_APERF : MSR_IA32_MPERF);
>> > +
>> > + return 0;
>> > +}
>>
>> The rest looks about right. Very simple thing indeed ;-)
>
> Before I submit v2, do you think this is actually worth doing? I can
> see it being useful for answering questions like "did this workload
> end up running at full speed".
>

To clarify, this is partially redundant with "cpu-cycles" and
"ref-cycles". That being said, these are simpler, actually documented
as being appropriate for measuring cpu performance states, and don't
have any scheduling constraints.

Also, is perf stat able to count while idle? perf stat -a -e
cpu-cycles sleep 1 reports very small numbers.

> --Andy



--
Andy Lutomirski
AMA Capital Management, LLC

2015-04-30 08:51:42

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC] x86, perf: Add an aperfmperf driver

On Wed, Apr 29, 2015 at 06:17:05PM -0700, Andy Lutomirski wrote:
> >> > + /* no sampling */
> >> > + if (event->hw.sample_period)
> >> > + return -EINVAL;
> >>
> >> You could have set pmu::capabilities =
> >> PERF_PMU_CAP_NO_INTERRUPT which would also have killed that dead.
> >
> >
> > That checks attr.sample_period. I'm a bit confused about the
> > relationship between event->hw and event->attr. Do I not need to
> > check hw.sample_period?

event->attr is the perf_event_attr used to instantiate the event.
event->hw is the hardware/working state of the event.

You'll notice that attr::sample_period is part of a union and when
!attr::freq will be used as the actual hw::sample_period. However when
attr::freq we'll compute hw::sample_period based on actual event rates
such that we'll approx attr::sample_freq.

Setting pmu::capabilities = PERF_PMU_CAP_NO_INTERRUPT would be the best
solution here.

> > Before I submit v2, do you think this is actually worth doing? I can
> > see it being useful for answering questions like "did this workload
> > end up running at full speed".
> >
>
> To clarify, this is partially redundant with "cpu-cycles" and
> "ref-cycles". That being said, these are simpler, actually documented
> as being appropriate for measuring cpu performance states, and don't
> have any scheduling constraints.

On the whole useful question; I dunno. It seems like something worth
providing for the reasons you state. But I don't really get around to
doing much userspace these days so I might not be the best to answer
this.

Also, you could extend this with IA32_PPERF (Skylake and later, see
SDM-201501 book 3 section 14.4.5.1).

> Also, is perf stat able to count while idle? perf stat -a -e
> cpu-cycles sleep 1 reports very small numbers.

Yes, perf stat -a (iow cpu events) should count while idle, note however
that not all events count during halt, so its very much event dependent.

2015-04-30 22:09:59

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [RFC] x86, perf: Add an aperfmperf driver

On Thu, Apr 30, 2015 at 1:51 AM, Peter Zijlstra <[email protected]> wrote:
> On Wed, Apr 29, 2015 at 06:17:05PM -0700, Andy Lutomirski wrote:
>> >> > + /* no sampling */
>> >> > + if (event->hw.sample_period)
>> >> > + return -EINVAL;
>> >>
>> >> You could have set pmu::capabilities =
>> >> PERF_PMU_CAP_NO_INTERRUPT which would also have killed that dead.
>> >
>> >
>> > That checks attr.sample_period. I'm a bit confused about the
>> > relationship between event->hw and event->attr. Do I not need to
>> > check hw.sample_period?
>
> event->attr is the perf_event_attr used to instantiate the event.
> event->hw is the hardware/working state of the event.
>
> You'll notice that attr::sample_period is part of a union and when
> !attr::freq will be used as the actual hw::sample_period. However when
> attr::freq we'll compute hw::sample_period based on actual event rates
> such that we'll approx attr::sample_freq.
>
> Setting pmu::capabilities = PERF_PMU_CAP_NO_INTERRUPT would be the best
> solution here.
>
>> > Before I submit v2, do you think this is actually worth doing? I can
>> > see it being useful for answering questions like "did this workload
>> > end up running at full speed".
>> >
>>
>> To clarify, this is partially redundant with "cpu-cycles" and
>> "ref-cycles". That being said, these are simpler, actually documented
>> as being appropriate for measuring cpu performance states, and don't
>> have any scheduling constraints.
>
> On the whole useful question; I dunno. It seems like something worth
> providing for the reasons you state. But I don't really get around to
> doing much userspace these days so I might not be the best to answer
> this.
>
> Also, you could extend this with IA32_PPERF (Skylake and later, see
> SDM-201501 book 3 section 14.4.5.1).

Interesting. I can't test it for obvious reasons, and the enumeration
is not really straightforward, since it's non-architectural. If I
send the patch, can you test? Should the PMU still be called
aperfmperf?

>
>> Also, is perf stat able to count while idle? perf stat -a -e
>> cpu-cycles sleep 1 reports very small numbers.
>
> Yes, perf stat -a (iow cpu events) should count while idle, note however
> that not all events count during halt, so its very much event dependent.

I see. MPERF, etc only count in C0.

--Andy

2015-05-11 09:48:30

by Ingo Molnar

[permalink] [raw]
Subject: Re: [RFC] x86, perf: Add an aperfmperf driver


* Andy Lutomirski <[email protected]> wrote:

> + event->hw.idx = -1;
> + event->hw.event_base = (event->attr.config == APERFMPERF_EVENT_APERF ?
> + MSR_IA32_APERF : MSR_IA32_MPERF);

So instead of having a separate driver per MSR, I think it might be
more useful to have a generic 'MSR as counters' PMU driver, for such
really simple cases where MSR contents represent an interesting
hardware metric, and have a table that enumerates the MSRs we allow to
be measured, and a sysfs list of them, to allow easy discovery.

APERF/MPERF would be one such MSR, MSR_SMI_COUNT another one - but
there are also other interesting ones.

Some of these are per CPU, some are system wide. Such an approach
would be far more robust than tooling poking around in /dev/msr (!).

Thanks,

Ingo