LinuxLists.cc - [RFC PATCH 0/3] perf: show package power consumption in perf

2010-08-18 07:56:12

Subject: [RFC PATCH 0/3] perf: show package power consumption in perf

Hi, all,

RAPL(running average power limit) is a new feature which provides
mechanisms to enforce power consumption limit, on some new processors.

Generally speaking, by using RAPL, OS can set a power budget in a
certain time window, and let Hardware to throttle the processor
P/T-state to meet this energy limitation.

RAPL also provides a new MSR, i.e. MSR_PKG_ENERGY_STATUS, which reports
the total amount of energy consumed by the package.

I'm not sure if to support RAPL or not, but anyway, it sounds like a
good idea to export the energy status in perf.

So a new perf pmu and event to show the package energy consumed is
introduced in this patch.

Here is what I get after applying the three patches,

#./perf stat -e energy test
Performance counter stats for 'test':

202 Joules cost by package
7.926001238 seconds time elapsed

Note that this patch set is made based on Peter's perf-pmu branch,
git://git.kernel.org/pub/scm/linux/kernel/git/peterz/linux-2.6-perf.git
which provides better interfaces to register/unregister a new pmu.

any comment are welcome. :)

thanks,
rui

2010-08-18 12:26:04

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [RFC PATCH 0/3] perf: show package power consumption in perf

On Wed, 2010-08-18 at 15:59 +0800, Zhang Rui wrote:
> Hi, all,
>
> RAPL(running average power limit) is a new feature which provides
> mechanisms to enforce power consumption limit, on some new processors.
>
> Generally speaking, by using RAPL, OS can set a power budget in a
> certain time window, and let Hardware to throttle the processor
> P/T-state to meet this energy limitation.
>
> RAPL also provides a new MSR, i.e. MSR_PKG_ENERGY_STATUS, which reports
> the total amount of energy consumed by the package.
>
> I'm not sure if to support RAPL or not, but anyway, it sounds like a
> good idea to export the energy status in perf.
>
> So a new perf pmu and event to show the package energy consumed is
> introduced in this patch.
>
> Here is what I get after applying the three patches,
>
> #./perf stat -e energy test
> Performance counter stats for 'test':
>
> 202 Joules cost by package
> 7.926001238 seconds time elapsed
>
>
> Note that this patch set is made based on Peter's perf-pmu branch,
> git://git.kernel.org/pub/scm/linux/kernel/git/peterz/linux-2.6-perf.git
> which provides better interfaces to register/unregister a new pmu.
>
> any comment are welcome. :)

Nice,.. however:

- if it is a pure read-only counter without sampling support,
expose it as such, don't fudge in the hrtimer stuff. Simply
fail to create a sampling event.

SH has the same problem for its 'normal' PMU, the solution is
to use event groups, Matt was looking at adding support to
perf-record for that, if creating a sampling event fails, fall
back to {hrtimer, $event} groups.

- since its a free-running, non-configurable counter, you can indeed
act like its a 'software' event in that you can schedule consumers
without constraints, however I don't think the PERF_COUNT_SW_* space
is the right way to expose this counter.

Better would be to use the sysfs stuff Lin has been working on (for
which I still need to catch up on the latest discussions), it would
then be tied to the pmu instance and appear/disappear when you load/
unload the module.

However for testing purposes I see why you'd want to have _a_
interface :-)

- it would be nice if you'd write the cpu detection a bit more readable,
also, it looks like you forgot to check x86_vendor == X86_VENDOR_INTEL.

> +static int __init intel_rapl_init(void)
> +{
> + /*
> + * RAPL features are only supported on processors have a CPUID
> + * signature with DisplayFamily_DisplayModel of 06_2AH, 06_2DH
> + */
> + if (boot_cpu_data.x86 != 0x06 ||
> + (boot_cpu_data.x86_model != 0x2A &&
> + boot_cpu_data.x86_model != 0x2D))
> + return -ENODEV;
> +
> + if (rapl_check_unit())
> + return -ENODEV;
> +
> + perf_pmu_register(&rapl_pmu);
> + return 0;
> +}

Maybe something like (see intel_pmu_init() for example):

if (boot_cpu_data.x86_vendor != X86_VENDOR_INTEL)
return -ENODEV;

if (boot_cpu_data.x86 != 0x06)
return -ENODEV;

switch (boot_cpu_data.x86_model) {
case 0x2A: /* sandybridge ?! 32nm */
case 0x2D: /* othermodel 32nm */
break;

default:
return -ENODEV;
}

Which again reminds me to ask of Intel, a comprehensive x86_model list,
please?

Alternatively, you can create a X86_FEATURE_RAPL and simply use
boot_cpu_has(X86_FEATURE_RAPL) (much like intel_ds_init() has).

2010-08-18 12:41:19

by Matt Fleming

[permalink] [raw]

Subject: Re: [RFC PATCH 0/3] perf: show package power consumption in perf

On Wed, Aug 18, 2010 at 02:25:29PM +0200, Peter Zijlstra wrote:
> On Wed, 2010-08-18 at 15:59 +0800, Zhang Rui wrote:
> > Hi, all,
> >
> > RAPL(running average power limit) is a new feature which provides
> > mechanisms to enforce power consumption limit, on some new processors.
> >
> > Generally speaking, by using RAPL, OS can set a power budget in a
> > certain time window, and let Hardware to throttle the processor
> > P/T-state to meet this energy limitation.
> >
> > RAPL also provides a new MSR, i.e. MSR_PKG_ENERGY_STATUS, which reports
> > the total amount of energy consumed by the package.
> >
> > I'm not sure if to support RAPL or not, but anyway, it sounds like a
> > good idea to export the energy status in perf.
> >
> > So a new perf pmu and event to show the package energy consumed is
> > introduced in this patch.
> >
> > Here is what I get after applying the three patches,
> >
> > #./perf stat -e energy test
> > Performance counter stats for 'test':
> >
> > 202 Joules cost by package
> > 7.926001238 seconds time elapsed
> >
> >
> > Note that this patch set is made based on Peter's perf-pmu branch,
> > git://git.kernel.org/pub/scm/linux/kernel/git/peterz/linux-2.6-perf.git
> > which provides better interfaces to register/unregister a new pmu.
> >
> > any comment are welcome. :)
>
>
> Nice,.. however:
>
> - if it is a pure read-only counter without sampling support,
> expose it as such, don't fudge in the hrtimer stuff. Simply
> fail to create a sampling event.
>
> SH has the same problem for its 'normal' PMU, the solution is
> to use event groups, Matt was looking at adding support to
> perf-record for that, if creating a sampling event fails, fall
> back to {hrtimer, $event} groups.

I had a quick look over the patches and Peter is right - the group
events stuff would probably fit quite well here. Unfortunately, due to
holidays and things, I haven't been able to get them finished
yet. I'll get on that ASAP.

2010-08-19 02:43:12

by Lin Ming

[permalink] [raw]

Subject: Re: [RFC PATCH 0/3] perf: show package power consumption in perf

On Wed, 2010-08-18 at 20:25 +0800, Peter Zijlstra wrote:
> On Wed, 2010-08-18 at 15:59 +0800, Zhang Rui wrote:
> > Hi, all,
> >
> > RAPL(running average power limit) is a new feature which provides
> > mechanisms to enforce power consumption limit, on some new processors.
> >
> > Generally speaking, by using RAPL, OS can set a power budget in a
> > certain time window, and let Hardware to throttle the processor
> > P/T-state to meet this energy limitation.
> >
> > RAPL also provides a new MSR, i.e. MSR_PKG_ENERGY_STATUS, which reports
> > the total amount of energy consumed by the package.
> >
> > I'm not sure if to support RAPL or not, but anyway, it sounds like a
> > good idea to export the energy status in perf.
> >
> > So a new perf pmu and event to show the package energy consumed is
> > introduced in this patch.
> >
> > Here is what I get after applying the three patches,
> >
> > #./perf stat -e energy test
> > Performance counter stats for 'test':
> >
> > 202 Joules cost by package
> > 7.926001238 seconds time elapsed
> >
> >
> > Note that this patch set is made based on Peter's perf-pmu branch,
> > git://git.kernel.org/pub/scm/linux/kernel/git/peterz/linux-2.6-perf.git
> > which provides better interfaces to register/unregister a new pmu.
> >
> > any comment are welcome. :)
>
>
> Nice,.. however:
>
> - if it is a pure read-only counter without sampling support,
> expose it as such, don't fudge in the hrtimer stuff. Simply
> fail to create a sampling event.
>
> SH has the same problem for its 'normal' PMU, the solution is
> to use event groups, Matt was looking at adding support to
> perf-record for that, if creating a sampling event fails, fall
> back to {hrtimer, $event} groups.
>
> - since its a free-running, non-configurable counter, you can indeed
> act like its a 'software' event in that you can schedule consumers
> without constraints, however I don't think the PERF_COUNT_SW_* space
> is the right way to expose this counter.
>
> Better would be to use the sysfs stuff Lin has been working on (for

Sorry that I have no good idea how to export the various tracepoints
events automatically, so this work will take time.

Lin Ming

> which I still need to catch up on the latest discussions), it would
> then be tied to the pmu instance and appear/disappear when you load/
> unload the module.
>
> However for testing purposes I see why you'd want to have _a_
> interface :-)
>
> - it would be nice if you'd write the cpu detection a bit more readable,
> also, it looks like you forgot to check x86_vendor == X86_VENDOR_INTEL.
>
> > +static int __init intel_rapl_init(void)
> > +{
> > + /*
> > + * RAPL features are only supported on processors have a CPUID
> > + * signature with DisplayFamily_DisplayModel of 06_2AH, 06_2DH
> > + */
> > + if (boot_cpu_data.x86 != 0x06 ||
> > + (boot_cpu_data.x86_model != 0x2A &&
> > + boot_cpu_data.x86_model != 0x2D))
> > + return -ENODEV;
> > +
> > + if (rapl_check_unit())
> > + return -ENODEV;
> > +
> > + perf_pmu_register(&rapl_pmu);
> > + return 0;
> > +}
>
> Maybe something like (see intel_pmu_init() for example):
>
> if (boot_cpu_data.x86_vendor != X86_VENDOR_INTEL)
> return -ENODEV;
>
> if (boot_cpu_data.x86 != 0x06)
> return -ENODEV;
>
> switch (boot_cpu_data.x86_model) {
> case 0x2A: /* sandybridge ?! 32nm */
> case 0x2D: /* othermodel 32nm */
> break;
>
> default:
> return -ENODEV;
> }
>
> Which again reminds me to ask of Intel, a comprehensive x86_model list,
> please?
>
> Alternatively, you can create a X86_FEATURE_RAPL and simply use
> boot_cpu_has(X86_FEATURE_RAPL) (much like intel_ds_init() has).

2010-08-19 03:27:58

by Lin Ming

[permalink] [raw]

Subject: Re: [RFC PATCH 0/3] perf: show package power consumption in perf

On Wed, 2010-08-18 at 20:41 +0800, Matt Fleming wrote:
> On Wed, Aug 18, 2010 at 02:25:29PM +0200, Peter Zijlstra wrote:
> > On Wed, 2010-08-18 at 15:59 +0800, Zhang Rui wrote:
> > > Hi, all,
> > >
> > > RAPL(running average power limit) is a new feature which provides
> > > mechanisms to enforce power consumption limit, on some new processors.
> > >
> > > Generally speaking, by using RAPL, OS can set a power budget in a
> > > certain time window, and let Hardware to throttle the processor
> > > P/T-state to meet this energy limitation.
> > >
> > > RAPL also provides a new MSR, i.e. MSR_PKG_ENERGY_STATUS, which reports
> > > the total amount of energy consumed by the package.
> > >
> > > I'm not sure if to support RAPL or not, but anyway, it sounds like a
> > > good idea to export the energy status in perf.
> > >
> > > So a new perf pmu and event to show the package energy consumed is
> > > introduced in this patch.
> > >
> > > Here is what I get after applying the three patches,
> > >
> > > #./perf stat -e energy test
> > > Performance counter stats for 'test':
> > >
> > > 202 Joules cost by package
> > > 7.926001238 seconds time elapsed
> > >
> > >
> > > Note that this patch set is made based on Peter's perf-pmu branch,
> > > git://git.kernel.org/pub/scm/linux/kernel/git/peterz/linux-2.6-perf.git
> > > which provides better interfaces to register/unregister a new pmu.
> > >
> > > any comment are welcome. :)
> >
> >
> > Nice,.. however:
> >
> > - if it is a pure read-only counter without sampling support,
> > expose it as such, don't fudge in the hrtimer stuff. Simply
> > fail to create a sampling event.
> >
> > SH has the same problem for its 'normal' PMU, the solution is
> > to use event groups, Matt was looking at adding support to
> > perf-record for that, if creating a sampling event fails, fall
> > back to {hrtimer, $event} groups.
>
> I had a quick look over the patches and Peter is right - the group
> events stuff would probably fit quite well here. Unfortunately, due to
> holidays and things, I haven't been able to get them finished
> yet. I'll get on that ASAP.

Hi, Matt

What's the "group events stuff"?
Is there some discussion on LKML or elsewhere I can have a look at?

Thanks,
Lin Ming

2010-08-19 07:54:14

by Matt Fleming

[permalink] [raw]

Subject: Re: [RFC PATCH 0/3] perf: show package power consumption in perf

On Thu, Aug 19, 2010 at 11:28:17AM +0800, Lin Ming wrote:
> On Wed, 2010-08-18 at 20:41 +0800, Matt Fleming wrote:
> >
> > I had a quick look over the patches and Peter is right - the group
> > events stuff would probably fit quite well here. Unfortunately, due to
> > holidays and things, I haven't been able to get them finished
> > yet. I'll get on that ASAP.
>
> Hi, Matt
>
> What's the "group events stuff"?
> Is there some discussion on LKML or elsewhere I can have a look at?
>
> Thanks,
> Lin Ming

The relevant information can be found here in this thread,
http://lkml.org/lkml/2010/8/4/174. I'm working on some patches for
this but they're not finished yet. I can probably get something to
show by next week.

The discussion started because the performance counters on SH do not
generate an interrupt on overflow, so we need to periodically sample
them. Am I correct in thinking that the energy counters also do not
generate an interrupt on overflow and that's why you wrote the event
as a software event?

2010-08-19 08:15:19

by Lin Ming

[permalink] [raw]

Subject: Re: [RFC PATCH 0/3] perf: show package power consumption in perf

On Thu, 2010-08-19 at 15:54 +0800, Matt Fleming wrote:
> On Thu, Aug 19, 2010 at 11:28:17AM +0800, Lin Ming wrote:
> > On Wed, 2010-08-18 at 20:41 +0800, Matt Fleming wrote:
> > >
> > > I had a quick look over the patches and Peter is right - the group
> > > events stuff would probably fit quite well here. Unfortunately, due to
> > > holidays and things, I haven't been able to get them finished
> > > yet. I'll get on that ASAP.
> >
> > Hi, Matt
> >
> > What's the "group events stuff"?
> > Is there some discussion on LKML or elsewhere I can have a look at?
> >
> > Thanks,
> > Lin Ming
>
> The relevant information can be found here in this thread,
> http://lkml.org/lkml/2010/8/4/174. I'm working on some patches for
> this but they're not finished yet. I can probably get something to
> show by next week.

Thanks.

>
> The discussion started because the performance counters on SH do not
> generate an interrupt on overflow, so we need to periodically sample
> them. Am I correct in thinking that the energy counters also do not
> generate an interrupt on overflow and that's why you wrote the event
> as a software event?

I think so.

Rui, could you confirm this?

2010-08-19 08:28:54

by Zhang, Rui

[permalink] [raw]

Subject: Re: [RFC PATCH 0/3] perf: show package power consumption in perf

On Thu, 2010-08-19 at 15:54 +0800, Matt Fleming wrote:
> On Thu, Aug 19, 2010 at 11:28:17AM +0800, Lin Ming wrote:
> > On Wed, 2010-08-18 at 20:41 +0800, Matt Fleming wrote:
> > >
> > > I had a quick look over the patches and Peter is right - the group
> > > events stuff would probably fit quite well here. Unfortunately, due to
> > > holidays and things, I haven't been able to get them finished
> > > yet. I'll get on that ASAP.
> >
> > Hi, Matt
> >
> > What's the "group events stuff"?
> > Is there some discussion on LKML or elsewhere I can have a look at?
> >
> > Thanks,
> > Lin Ming
>
> The relevant information can be found here in this thread,
> http://lkml.org/lkml/2010/8/4/174. I'm working on some patches for
> this but they're not finished yet. I can probably get something to
> show by next week.
>
> The discussion started because the performance counters on SH do not
> generate an interrupt on overflow, so we need to periodically sample
> them. Am I correct in thinking that the energy counters also do not
> generate an interrupt on overflow and that's why you wrote the event
> as a software event?

right.

BTW, I'm not quite familiar with perf tool, and now I'm wondering if the
periodically sample is needed.
because IMO, .start is invoked every time the process is scheduled in,
and .stop is invoked when it's scheduled out. It seems that we just need
to read the energy consumed in .start and .stop, and update the counter
in .stop, right?

thanks,
rui

2010-08-19 08:32:20

by Matt Fleming

[permalink] [raw]

Subject: Re: [RFC PATCH 0/3] perf: show package power consumption in perf

On Thu, Aug 19, 2010 at 04:31:54PM +0800, Zhang Rui wrote:
> On Thu, 2010-08-19 at 15:54 +0800, Matt Fleming wrote:
> > On Thu, Aug 19, 2010 at 11:28:17AM +0800, Lin Ming wrote:
> > > On Wed, 2010-08-18 at 20:41 +0800, Matt Fleming wrote:
> > > >
> > > > I had a quick look over the patches and Peter is right - the group
> > > > events stuff would probably fit quite well here. Unfortunately, due to
> > > > holidays and things, I haven't been able to get them finished
> > > > yet. I'll get on that ASAP.
> > >
> > > Hi, Matt
> > >
> > > What's the "group events stuff"?
> > > Is there some discussion on LKML or elsewhere I can have a look at?
> > >
> > > Thanks,
> > > Lin Ming
> >
> > The relevant information can be found here in this thread,
> > http://lkml.org/lkml/2010/8/4/174. I'm working on some patches for
> > this but they're not finished yet. I can probably get something to
> > show by next week.
> >
> > The discussion started because the performance counters on SH do not
> > generate an interrupt on overflow, so we need to periodically sample
> > them. Am I correct in thinking that the energy counters also do not
> > generate an interrupt on overflow and that's why you wrote the event
> > as a software event?
>
> right.
>
> BTW, I'm not quite familiar with perf tool, and now I'm wondering if the
> periodically sample is needed.
> because IMO, .start is invoked every time the process is scheduled in,
> and .stop is invoked when it's scheduled out. It seems that we just need
> to read the energy consumed in .start and .stop, and update the counter
> in .stop, right?

How big is the hardware counter? The problem comes when the process is
scheduled in and runs for a long time, e.g. so long that the energy
hardware counter wraps. This is why it's necessary to periodically
sample the counter.

2010-08-19 08:54:41

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [RFC PATCH 0/3] perf: show package power consumption in perf

On Thu, 2010-08-19 at 10:43 +0800, Lin Ming wrote:
> Sorry that I have no good idea how to export the various tracepoints
> events automatically, so this work will take time.
>
Well, we could start with just he hardware bits and leave the tracepoint
bits for later, right?

2010-08-19 09:02:15

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [RFC PATCH 0/3] perf: show package power consumption in perf

On Thu, 2010-08-19 at 11:28 +0800, Lin Ming wrote:
> On Wed, 2010-08-18 at 20:41 +0800, Matt Fleming wrote:
> > On Wed, Aug 18, 2010 at 02:25:29PM +0200, Peter Zijlstra wrote:
> > > On Wed, 2010-08-18 at 15:59 +0800, Zhang Rui wrote:
> > > > Hi, all,
> > > >
> > > > RAPL(running average power limit) is a new feature which provides
> > > > mechanisms to enforce power consumption limit, on some new processors.
> > > >
> > > > Generally speaking, by using RAPL, OS can set a power budget in a
> > > > certain time window, and let Hardware to throttle the processor
> > > > P/T-state to meet this energy limitation.
> > > >
> > > > RAPL also provides a new MSR, i.e. MSR_PKG_ENERGY_STATUS, which reports
> > > > the total amount of energy consumed by the package.
> > > >
> > > > I'm not sure if to support RAPL or not, but anyway, it sounds like a
> > > > good idea to export the energy status in perf.
> > > >
> > > > So a new perf pmu and event to show the package energy consumed is
> > > > introduced in this patch.
> > > >
> > > > Here is what I get after applying the three patches,
> > > >
> > > > #./perf stat -e energy test
> > > > Performance counter stats for 'test':
> > > >
> > > > 202 Joules cost by package
> > > > 7.926001238 seconds time elapsed
> > > >
> > > >
> > > > Note that this patch set is made based on Peter's perf-pmu branch,
> > > > git://git.kernel.org/pub/scm/linux/kernel/git/peterz/linux-2.6-perf.git
> > > > which provides better interfaces to register/unregister a new pmu.
> > > >
> > > > any comment are welcome. :)
> > >
> > >
> > > Nice,.. however:
> > >
> > > - if it is a pure read-only counter without sampling support,
> > > expose it as such, don't fudge in the hrtimer stuff. Simply
> > > fail to create a sampling event.
> > >
> > > SH has the same problem for its 'normal' PMU, the solution is
> > > to use event groups, Matt was looking at adding support to
> > > perf-record for that, if creating a sampling event fails, fall
> > > back to {hrtimer, $event} groups.
> >
> > I had a quick look over the patches and Peter is right - the group
> > events stuff would probably fit quite well here. Unfortunately, due to
> > holidays and things, I haven't been able to get them finished
> > yet. I'll get on that ASAP.
>
> Hi, Matt
>
> What's the "group events stuff"?
> Is there some discussion on LKML or elsewhere I can have a look at?

its some obscure perf feature:

leader = sys_perf_event_open(&hrtimer_attr, pid, cpu, 0, 0);
sibling = sys_perf_event_open(&rapl_attr, pid, cpu, leader, 0);

will create an even group (which means that both events require to be
co-scheduled). If you then provided:

hrtimer_attr.read_format |= PERF_FORMAT_GROUP;
hrtimer_attr.sample_type |= PERF_SAMPLE_READ;

the samples from the hrtimer will contain a field like:

* { u64 nr;
* { u64 time_enabled; } && PERF_FORMAT_ENABLED
* { u64 time_running; } && PERF_FORMAT_RUNNING
* { u64 value;
* { u64 id; } && PERF_FORMAT_ID
* } cntr[nr];
* } && PERF_FORMAT_GROUP

Which contains both the hrtimer count (ns) and the RAPL count (watts).

Using that you can compute the RAPL delta between consecutive samples
and use that to weight the sample.

For perf-stat non of this is needed, since it doesn't use sampling
counters anyway ;-).

2010-08-19 09:45:20

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [RFC PATCH 0/3] perf: show package power consumption in perf

On Thu, 2010-08-19 at 09:32 +0100, Matt Fleming wrote:
>
>
> How big is the hardware counter? The problem comes when the process is
> scheduled in and runs for a long time, e.g. so long that the energy
> hardware counter wraps. This is why it's necessary to periodically
> sample the counter.
>
Long running processes aren't the only case, you could associate an
event with a CPU.

Right, short counters (like SH when not chained) need something to
accumulate deltas into the larger u64. You can indeed use timers for
that, hr or otherwise, but you don't need the swcounter hrtimer
infrastructure for that.

2010-08-20 00:20:49

by Lin Ming

[permalink] [raw]

Subject: Re: [RFC PATCH 0/3] perf: show package power consumption in perf

On Thu, 2010-08-19 at 16:54 +0800, Peter Zijlstra wrote:
> On Thu, 2010-08-19 at 10:43 +0800, Lin Ming wrote:
> > Sorry that I have no good idea how to export the various tracepoints
> > events automatically, so this work will take time.
> >
> Well, we could start with just he hardware bits and leave the tracepoint
> bits for later, right?

Right. I'll update the patches.

2010-08-20 01:41:30

by Zhang, Rui

[permalink] [raw]

Subject: Re: [RFC PATCH 0/3] perf: show package power consumption in perf

On Thu, 2010-08-19 at 17:02 +0800, Peter Zijlstra wrote:
> > > >
> > > > - if it is a pure read-only counter without sampling support,
> > > > expose it as such, don't fudge in the hrtimer stuff. Simply
> > > > fail to create a sampling event.
> > > >
> > > > SH has the same problem for its 'normal' PMU, the solution is
> > > > to use event groups, Matt was looking at adding support to
> > > > perf-record for that, if creating a sampling event fails, fall
> > > > back to {hrtimer, $event} groups.
> > >
> > > I had a quick look over the patches and Peter is right - the group
> > > events stuff would probably fit quite well here. Unfortunately, due to
> > > holidays and things, I haven't been able to get them finished
> > > yet. I'll get on that ASAP.
> >
> > Hi, Matt
> >
> > What's the "group events stuff"?
> > Is there some discussion on LKML or elsewhere I can have a look at?
>
> its some obscure perf feature:
>
> leader = sys_perf_event_open(&hrtimer_attr, pid, cpu, 0, 0);
> sibling = sys_perf_event_open(&rapl_attr, pid, cpu, leader, 0);
>
> will create an even group (which means that both events require to be
> co-scheduled). If you then provided:
>
> hrtimer_attr.read_format |= PERF_FORMAT_GROUP;
> hrtimer_attr.sample_type |= PERF_SAMPLE_READ;
>
hrtimer_attr is only shared in an event group, and rapl needs its owen
event group, right?

> the samples from the hrtimer will contain a field like:
>
> * { u64 nr;
> * { u64 time_enabled; } && PERF_FORMAT_ENABLED
> * { u64 time_running; } && PERF_FORMAT_RUNNING
> * { u64 value;
> * { u64 id; } && PERF_FORMAT_ID
> * } cntr[nr];
> * } && PERF_FORMAT_GROUP
>
> Which contains both the hrtimer count (ns) and the RAPL count (watts).
>
> Using that you can compute the RAPL delta between consecutive samples
> and use that to weight the sample.
>
>
> For perf-stat non of this is needed, since it doesn't use sampling
> counters anyway ;-).

so what do you think the rapl counter should look like in userspace?
showing it in perf-stat looks nice, right? :)

thanks,
rui

2010-08-20 09:34:32

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [RFC PATCH 0/3] perf: show package power consumption in perf

On Fri, 2010-08-20 at 09:44 +0800, Zhang Rui wrote:
> On Thu, 2010-08-19 at 17:02 +0800, Peter Zijlstra wrote:

> > its some obscure perf feature:
> >
> > leader = sys_perf_event_open(&hrtimer_attr, pid, cpu, 0, 0);
> > sibling = sys_perf_event_open(&rapl_attr, pid, cpu, leader, 0);
> >
> > will create an even group (which means that both events require to be
> > co-scheduled). If you then provided:
> >
> > hrtimer_attr.read_format |= PERF_FORMAT_GROUP;
> > hrtimer_attr.sample_type |= PERF_SAMPLE_READ;
> >
> hrtimer_attr is only shared in an event group, and rapl needs its owen
> event group, right?

Uhm, no. The idea is to group the hrtimer and rapl event in order to
obtain rapl 'samples'.

That is, you get hrtimer samples which include the rapl count. For this
we use the grouping construct where group siblings are always
co-scheduled and can report on each others count.

> so what do you think the rapl counter should look like in userspace?
> showing it in perf-stat looks nice, right? :)

Right, so the userspace interface would be using Lin's sysfs bits, which
I still need to read up on. But the general idea is that each PMU gets a
sysfs representation somewhere in the system topology reflecting its
actual site (RAPL would be CPU local), this sysfs representation would
then also allow you to discover all events it provides.

perf list will then use sysfs to discover all available events, and you
can still use perf stat -e $foo to select it, where foo is some to be
determined string that identifies the thing, maybe something like:
rapl:watts or somesuch (with rapl identifying the pmu and watts the
actual event for that pmu).

2010-08-20 12:32:43

by Ingo Molnar

[permalink] [raw]

Subject: Re: [RFC PATCH 0/3] perf: show package power consumption in perf

* Peter Zijlstra <[email protected]> wrote:

> On Fri, 2010-08-20 at 09:44 +0800, Zhang Rui wrote:
> > On Thu, 2010-08-19 at 17:02 +0800, Peter Zijlstra wrote:
>
> > > its some obscure perf feature:
> > >
> > > leader = sys_perf_event_open(&hrtimer_attr, pid, cpu, 0, 0);
> > > sibling = sys_perf_event_open(&rapl_attr, pid, cpu, leader, 0);
> > >
> > > will create an even group (which means that both events require to be
> > > co-scheduled). If you then provided:
> > >
> > > hrtimer_attr.read_format |= PERF_FORMAT_GROUP;
> > > hrtimer_attr.sample_type |= PERF_SAMPLE_READ;
> > >
> > hrtimer_attr is only shared in an event group, and rapl needs its owen
> > event group, right?
>
> Uhm, no. The idea is to group the hrtimer and rapl event in order to
> obtain rapl 'samples'.
>
> That is, you get hrtimer samples which include the rapl count. For this
> we use the grouping construct where group siblings are always
> co-scheduled and can report on each others count.
>
> > so what do you think the rapl counter should look like in userspace?
> > showing it in perf-stat looks nice, right? :)
>
> Right, so the userspace interface would be using Lin's sysfs bits, which I
> still need to read up on. But the general idea is that each PMU gets a sysfs
> representation somewhere in the system topology reflecting its actual site
> (RAPL would be CPU local), this sysfs representation would then also allow
> you to discover all events it provides.
>
> perf list will then use sysfs to discover all available events, and you can
> still use perf stat -e $foo to select it, where foo is some to be determined
> string that identifies the thing, maybe something like: rapl:watts or
> somesuch (with rapl identifying the pmu and watts the actual event for that
> pmu).

Btw., some 'perf list' thoughts. We could do a:

perf list --help rapl:watts

Which gives the user some idea what an event does. Also, short descriptive
line in perf list output would be nice:

$ perf list

List of pre-defined events (to be used in -e):

cpu-cycles OR cycles [Hardware event] # CPU cycles
instructions [Hardware event] # instructions executed

...

rapl:watts [Tracepoint] # watts usage

or something like that. Perhaps even a TUI for perf list, to browse between
event types? (in that case it would probably be useful to make them collapse
along natural grouping)

We want users/developers to discover new events, see and understand their
purpose and combine them in not-seen-before ways.

Thanks,

Ingo

2010-08-20 21:35:41

by Arnaldo Carvalho de Melo

[permalink] [raw]

Subject: Re: [RFC PATCH 0/3] perf: show package power consumption in perf

Em Fri, Aug 20, 2010 at 02:31:59PM +0200, Ingo Molnar escreveu:
> Btw., some 'perf list' thoughts. We could do a:
>
> perf list --help rapl:watts
>
> Which gives the user some idea what an event does. Also, short descriptive
> line in perf list output would be nice:
>
> $ perf list
>
> List of pre-defined events (to be used in -e):
>
> cpu-cycles OR cycles [Hardware event] # CPU cycles
> instructions [Hardware event] # instructions executed
>
> ...
>
> rapl:watts [Tracepoint] # watts usage
>
> or something like that. Perhaps even a TUI for perf list, to browse between
> event types? (in that case it would probably be useful to make them collapse
> along natural grouping)
>
> We want users/developers to discover new events, see and understand their
> purpose and combine them in not-seen-before ways.

Right, record, list, probe, top are on the UI (not just T-UI, see latest
efforts on decoupling from newt/slang) hit-list :)

Moving from one to the other seamlessly like today is possible for
report and annotate is the goal.

Now that the UI browser code is more robust and generic that should
happen faster, I think.

- Arnaldo

2010-08-21 01:18:12

by Frederic Weisbecker

[permalink] [raw]

Subject: Re: [RFC PATCH 0/3] perf: show package power consumption in perf

On Thu, Aug 19, 2010 at 11:44:45AM +0200, Peter Zijlstra wrote:
> On Thu, 2010-08-19 at 09:32 +0100, Matt Fleming wrote:
> >
> >
> > How big is the hardware counter? The problem comes when the process is
> > scheduled in and runs for a long time, e.g. so long that the energy
> > hardware counter wraps. This is why it's necessary to periodically
> > sample the counter.
> >
> Long running processes aren't the only case, you could associate an
> event with a CPU.

I don't understand what you mean.

> Right, short counters (like SH when not chained) need something to
> accumulate deltas into the larger u64. You can indeed use timers for
> that, hr or otherwise, but you don't need the swcounter hrtimer
> infrastructure for that.

So what is the point in simulating a PMI using an hrtimer? It won't be
based on periods on the interesting counter but on time periods. This
is not how we want the samples. If we want timer based samples, we can
just launch a seperate software timer based event.

In the case of SH where we need to flush to avoid wraps, I understand, but
oterwise?

2010-08-21 09:30:57

by Ingo Molnar

[permalink] [raw]

Subject: Re: [RFC PATCH 0/3] perf: show package power consumption in perf

* Frederic Weisbecker <[email protected]> wrote:

> > Right, short counters (like SH when not chained) need something to
> > accumulate deltas into the larger u64. You can indeed use timers for
> > that, hr or otherwise, but you don't need the swcounter hrtimer
> > infrastructure for that.
>
> So what is the point in simulating a PMI using an hrtimer? It won't be
> based on periods on the interesting counter but on time periods. This
> is not how we want the samples. If we want timer based samples, we can
> just launch a seperate software timer based event.

If we then measure the delta of the count during that constant-time
period, we'll get a 'weight' to consider.

So for example if we sample with a period of every 1000 cache-misses,
regular same-counter-PMU-IRQ sampling goes like this:

1000
1000
1000
1000
1000
....

While if we use a hrtimer, we get variations:

1050
711
1539
2210
400

But using that variable period as a weight will, statistically,
compensate for the variation.

It's similar to how the auto-freq code works - that too has variable
periods (due to the self-adjustment) - which we compensate with weight.

Thanks,

Ingo

2010-08-23 09:31:57

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [RFC PATCH 0/3] perf: show package power consumption in perf

On Sat, 2010-08-21 at 03:18 +0200, Frederic Weisbecker wrote:
> On Thu, Aug 19, 2010 at 11:44:45AM +0200, Peter Zijlstra wrote:
> > On Thu, 2010-08-19 at 09:32 +0100, Matt Fleming wrote:
> > >
> > >
> > > How big is the hardware counter? The problem comes when the process is
> > > scheduled in and runs for a long time, e.g. so long that the energy
> > > hardware counter wraps. This is why it's necessary to periodically
> > > sample the counter.
> > >
> > Long running processes aren't the only case, you could associate an
> > event with a CPU.

> I don't understand what you mean.

perf_event_open(.pid = -1, .cpu = n);

> > Right, short counters (like SH when not chained) need something to
> > accumulate deltas into the larger u64. You can indeed use timers for
> > that, hr or otherwise, but you don't need the swcounter hrtimer
> > infrastructure for that.
>
>
> So what is the point in simulating a PMI using an hrtimer? It won't be
> based on periods on the interesting counter but on time periods. This
> is not how we want the samples. If we want timer based samples, we can
> just launch a seperate software timer based event.

*sigh* that's exactly what we're doing, we're creating a separate
software hrtimer to create samples, the only thing that's different is
that we put this hrtimer and the hw-counter in a group and let the
hrtimer sample include the hw-counter's value.

If you then weight the samples by the hw-counter delta, you get
something that's more or less related to the thing the hw-counter is
counting.

For counter's that do no provide overflow interrupts this is the only
possible way to get anything.

> In the case of SH where we need to flush to avoid wraps, I understand, but
> oterwise?

The wrap issue it totally unrelated.