LinuxLists.cc - [announce] Performance Counters for Linux, v6

[permalink] [raw]

Subject: Re: [announce] Performance Counters for Linux, v6

* Randy Dunlap <[email protected]> wrote:

> Ingo Molnar wrote:
> > We are pleased to announce version 6 of our performance counters subsystem
> > implementation. The shortlog, diffstat and the combo patch can be found
> > below. The combo patch against latest -git (2.6.29-rc2) can be also found
> > at:
> >
> > http://people.redhat.com/mingo/perfcounters/perfcounters-v6-v2.6.29-rc2.patch
> >
> > It's also available in tip/master at:
> >
> > http://people.redhat.com/mingo/tip.git/README
> >
> > There are many changes in the v6 release:
> >
> > - PowerPC performance counters support from Paul Mackerras, for POWER6
> > and for the PPC970 family.
> >
> > - ioctl API to disable/enable individual counters and groups without
> > closing their fd. This can be useful for libraries, ad-hoc
> > instrumentation and PAPI support.
> >
> > - 'pinned' and 'exclusive' counter attributes - for those
> > applications that want to influence counter scheduling explicitly.
> >
> > - The 'perfstat' utility (ex 'timec') has been updated:
> >
> > http://people.redhat.com/mingo/perfcounters/perfstat.c
> >
> > - 'kerneltop' (easy-to-use text mode NMI profiler) has been updated:
> >
> > http://people.redhat.com/mingo/perfcounters/kerneltop.c
>
> BTW, this kerneltop has nothing to do with that other one??
>
> http://www.xenotime.net/linux/kerneltop/

heh, didnt know about that one - there's no connection other than the name
:-) The project seems somewhat stale but indeed similar in purpose. Can
rename to kerneltop2 i guess.

Ingo

2009-01-21 21:16:45

by Randy Dunlap

[permalink] [raw]

Subject: Re: [announce] Performance Counters for Linux, v6

Ingo Molnar wrote:
> * Randy Dunlap <[email protected]> wrote:
>
>> Ingo Molnar wrote:
>>> We are pleased to announce version 6 of our performance counters subsystem
>>> implementation. The shortlog, diffstat and the combo patch can be found
>>> below. The combo patch against latest -git (2.6.29-rc2) can be also found
>>> at:
>>>
>>> http://people.redhat.com/mingo/perfcounters/perfcounters-v6-v2.6.29-rc2.patch
>>>
>>> It's also available in tip/master at:
>>>
>>> http://people.redhat.com/mingo/tip.git/README
>>>
>>> There are many changes in the v6 release:
>>>
>>> - PowerPC performance counters support from Paul Mackerras, for POWER6
>>> and for the PPC970 family.
>>>
>>> - ioctl API to disable/enable individual counters and groups without
>>> closing their fd. This can be useful for libraries, ad-hoc
>>> instrumentation and PAPI support.
>>>
>>> - 'pinned' and 'exclusive' counter attributes - for those
>>> applications that want to influence counter scheduling explicitly.
>>>
>>> - The 'perfstat' utility (ex 'timec') has been updated:
>>>
>>> http://people.redhat.com/mingo/perfcounters/perfstat.c
>>>
>>> - 'kerneltop' (easy-to-use text mode NMI profiler) has been updated:
>>>
>>> http://people.redhat.com/mingo/perfcounters/kerneltop.c
>> BTW, this kerneltop has nothing to do with that other one??
>>
>> http://www.xenotime.net/linux/kerneltop/
>
> heh, didnt know about that one - there's no connection other than the name
> :-) The project seems somewhat stale but indeed similar in purpose. Can
> rename to kerneltop2 i guess.

Yes, it's stale. I plan to update it sometime this year. :)

I don't care if you rename it or not.

--
~Randy

2009-01-22 11:24:15

by Karel Zak

[permalink] [raw]

Subject: Re: [announce] Performance Counters for Linux, v6

Hi,

On Wed, Jan 21, 2009 at 07:50:21PM +0100, Ingo Molnar wrote:
> - The 'perfstat' utility (ex 'timec') has been updated:
>
> http://people.redhat.com/mingo/perfcounters/perfstat.c

what are your planning to do with this utility? We can merge it into
util-linux-ng.

Karel

--
Karel Zak <[email protected]>

2009-01-22 12:06:19

by Karel Zak

[permalink] [raw]

Subject: Re: [announce] Performance Counters for Linux, v6

On Thu, Jan 22, 2009 at 12:22:38PM +0100, Karel Zak wrote:
> On Wed, Jan 21, 2009 at 07:50:21PM +0100, Ingo Molnar wrote:
> > - The 'perfstat' utility (ex 'timec') has been updated:
> >
> > http://people.redhat.com/mingo/perfcounters/perfstat.c
>
> what are your planning to do with this utility? We can merge it into

grr.. s/your/you/

> util-linux-ng.
>
> Karel

--
Karel Zak <[email protected]>

2009-01-22 12:07:28

[permalink] [raw]

Subject: Re: [announce] Performance Counters for Linux, v6

* Karel Zak <[email protected]> wrote:

> On Wed, Jan 21, 2009 at 07:50:21PM +0100, Ingo Molnar wrote:
> > - The 'perfstat' utility (ex 'timec') has been updated:
> >
> > http://people.redhat.com/mingo/perfcounters/perfstat.c
>
> what are your planning to do with this utility? We can merge it into
> util-linux-ng.

That would be nice to do, if/once/when the subsystem and the syscall
itself is merged upstream. The syscall ABI might change until that
happens.

Ingo

2009-01-26 01:06:31

[permalink] [raw]

Subject: Re: [announce] Performance Counters for Linux, v6

2009-01-26 09:13:58

[permalink] [raw]

Subject: Re: [announce] Performance Counters for Linux, v6

Hi,

Corey brings up an interesting problem which I wanted to comment on.

The current proposal hinges on the idea that by interpreting a single
value the kernel
can understand what the user wants to measure. For instance, if I pass
type=0, then
the kernel understands I want to measure CPU_CYCLES. Given that the number of
events and their unit mask combinations can be large, the proposal also provides
a "raw" mode, where the content of the type field is interpreted as
the raw value to
put into a register.

This is where there is an issue because with several PMU models,
including on X86, using
the raw bit + 64 value is not enough to figure out what the user wants
to measure. This happens
when the PMU has more than counters. Thus, interpreting each raw value
has the event code
may be wrong. To remain on familiar territory, the Nehalem uncore PMU
has an opcode matcher register,
that uses a 64-bit value. On AMD64 Family 10h, you have IBS. But I
could give examples on
Itanium with opcode matchers, range restrictions. Corey provided other
examples for Power.
The API has to provide a way to express what the raw value is meant
for: counter, matcher, filter...

There are PMU where programming an event requires writing two config
registers. This is the case
for all Netburst-based processors where you have to program CCCR and
ESCR. I wonder how,
raw mode is supported for those processors. What if a PMU requires
three registers to be programmed?

On Mon, Jan 26, 2009 at 2:06 AM, Corey Ashford
<[email protected]> wrote:
> Ingo Molnar wrote:
>>
>> We are pleased to announce version 6 of our performance counters subsystem
>> implementation. The shortlog, diffstat and the combo patch can be found
>> below. The combo patch against latest -git (2.6.29-rc2) can be also found
>> at:
>>
>>
>> http://people.redhat.com/mingo/perfcounters/perfcounters-v6-v2.6.29-rc2.patch
>>
>> It's also available in tip/master at:
>>
>> http://people.redhat.com/mingo/tip.git/README
>>
>> There are many changes in the v6 release:
>>
>> - PowerPC performance counters support from Paul Mackerras, for POWER6
>> and for the PPC970 family.
>>
>> - ioctl API to disable/enable individual counters and groups without
>> closing their fd. This can be useful for libraries, ad-hoc
>> instrumentation and PAPI support.
>>
>> - 'pinned' and 'exclusive' counter attributes - for those
>> applications that want to influence counter scheduling explicitly.
>>
>> - The 'perfstat' utility (ex 'timec') has been updated:
>>
>> http://people.redhat.com/mingo/perfcounters/perfstat.c
>>
>> - 'kerneltop' (easy-to-use text mode NMI profiler) has been updated:
>> http://people.redhat.com/mingo/perfcounters/kerneltop.c
>>
>> - Merged to latest mainline
>>
>> - Various fixes and other updates
>>
>> Ingo
>
> Hi Ingo,
>
> Looking over the latest capabilities of this proposal, I am wondering how it
> can accommodate performance monitor units which have extra registers which
> require user-defined data to be loaded into them.
>
> For example, on the Power architecture, there is an Instruction Matching
> Register which allows the counting of particular instructions. Currently,
> this is unsupported in perfmon2/3, but we have plans to add it, and it's
> pretty straight-forward to imagine how this would be done in perfmon.
>
> But I don't see an obvious way to do it with your proposal. Do you have any
> ideas how Performance Counters for Linux could accommodate this sort of PMU
> functionality?
>
> One thought would be to change the event code to an event descriptor
> structure, which has room for lots of bits, including arch-defined bits (in
> the case of Power, an IMR value, and others). This might also be a way to
> accommodate unit masks (and enums) as well, which Andi Kleen pointed out as
> an issue in an earlier LKML posting.
>
> Regards,
>
> - Corey
>
> Corey Ashford
> Software Engineer
> IBM Linux Technology Center, Linux Toolchain
> Beaverton, OR
> 503-578-3507
> [email protected]
>
>

2009-01-26 15:18:38

[permalink] [raw]

Subject: Re: [announce] Performance Counters for Linux, v6

* stephane eranian <[email protected]> wrote:

> Hi,
>
> Corey brings up an interesting problem which I wanted to comment on.
>
> The current proposal hinges on the idea that by interpreting a single
> value the kernel can understand what the user wants to measure. For
> instance, if I pass type=0, then the kernel understands I want to
> measure CPU_CYCLES. Given that the number of events and their unit mask
> combinations can be large, the proposal also provides a "raw" mode,
> where the content of the type field is interpreted as the raw value to
> put into a register.
>
> This is where there is an issue because with several PMU models,
> including on X86, using the raw bit + 64 value is not enough to figure
> out what the user wants to measure. This happens when the PMU has more
> than counters. Thus, interpreting each raw value has the event code may
> be wrong. To remain on familiar territory, the Nehalem uncore PMU has an
> opcode matcher register, that uses a 64-bit value. On AMD64 Family 10h,
> you have IBS. But I could give examples on Itanium with opcode matchers,
> range restrictions. Corey provided other examples for Power. The API has
> to provide a way to express what the raw value is meant for: counter,
> matcher, filter...

this can be done in a number of ways (in order of increasing levels of
abstraction):

- the raw type is kept wide enough. Paul already requested the raw type
to be widened to 128 bits to express certain PowerPC features.

- or the PMU capability is expressed as a special counter type (if it's
useful enough) - and then either the write() method or ioctl is extended
to express attributes we want to set/change while a counter is running.

- or the highest level counter / hw event data type is extended with new
attribute field(s).

My feeling is that we generally want such hw features to start small -
i.e. at the raw type level initially. Then we can allow them to climb the
ladder, if they prove their utility in practice. We've got space reserved
in the ABI to allow for growth like this.

Ingo

2009-01-26 16:56:17

[permalink] [raw]

Subject: Re: [announce] Performance Counters for Linux, v6

On Mon, Jan 26, 2009 at 4:17 PM, Ingo Molnar <[email protected]> wrote:
>
> * stephane eranian <[email protected]> wrote:
>
>> Hi,
>>
>> Corey brings up an interesting problem which I wanted to comment on.
>>
>> The current proposal hinges on the idea that by interpreting a single
>> value the kernel can understand what the user wants to measure. For
>> instance, if I pass type=0, then the kernel understands I want to
>> measure CPU_CYCLES. Given that the number of events and their unit mask
>> combinations can be large, the proposal also provides a "raw" mode,
>> where the content of the type field is interpreted as the raw value to
>> put into a register.
>>
>> This is where there is an issue because with several PMU models,
>> including on X86, using the raw bit + 64 value is not enough to figure
>> out what the user wants to measure. This happens when the PMU has more
>> than counters. Thus, interpreting each raw value has the event code may
>> be wrong. To remain on familiar territory, the Nehalem uncore PMU has an
>> opcode matcher register, that uses a 64-bit value. On AMD64 Family 10h,
>> you have IBS. But I could give examples on Itanium with opcode matchers,
>> range restrictions. Corey provided other examples for Power. The API has
>> to provide a way to express what the raw value is meant for: counter,
>> matcher, filter...
>
> this can be done in a number of ways (in order of increasing levels of
> abstraction):
>
> - the raw type is kept wide enough. Paul already requested the raw type
> to be widened to 128 bits to express certain PowerPC features.

Yes, 1 bit is not enough. With 128 that would be enough to encode all
the resources I can think of for existing Itanium, X86.

But then, I think fields would have to be renamed to make it clearer.
Raw would denote a type of resource and the current type field
could be renamed 'code' or 'val' or 'id' which reflects more what the
content actually would be.

As for Netburst, both CCCR and ESCR only use the bottom 32-bit so they
could be stuffed into the 64-bit 'type' field. But that would not work if a PMU
were to require wider values. Those values are part of the event encoding and
should not be considered optional nor attributes. That is why using a separate
call to program the second value does not seem appropriate to me.

>
> - or the PMU capability is expressed as a special counter type (if it's
> useful enough) - and then either the write() method or ioctl is extended
> to express attributes we want to set/change while a counter is running.
>
> - or the highest level counter / hw event data type is extended with new
> attribute field(s).
>
> My feeling is that we generally want such hw features to start small -
> i.e. at the raw type level initially. Then we can allow them to climb the
> ladder, if they prove their utility in practice. We've got space reserved
> in the ABI to allow for growth like this.
>
> Ingo
>

2009-01-26 19:14:15

[permalink] [raw]

Subject: Re: [announce] Performance Counters for Linux, v6

Ingo Molnar wrote:
> * stephane eranian <[email protected]> wrote:
>
>> Hi,
>>
>> Corey brings up an interesting problem which I wanted to comment on.
>>
>> The current proposal hinges on the idea that by interpreting a single
>> value the kernel can understand what the user wants to measure. For
>> instance, if I pass type=0, then the kernel understands I want to
>> measure CPU_CYCLES. Given that the number of events and their unit mask
>> combinations can be large, the proposal also provides a "raw" mode,
>> where the content of the type field is interpreted as the raw value to
>> put into a register.
>>
>> This is where there is an issue because with several PMU models,
>> including on X86, using the raw bit + 64 value is not enough to figure
>> out what the user wants to measure. This happens when the PMU has more
>> than counters. Thus, interpreting each raw value has the event code may
>> be wrong. To remain on familiar territory, the Nehalem uncore PMU has an
>> opcode matcher register, that uses a 64-bit value. On AMD64 Family 10h,
>> you have IBS. But I could give examples on Itanium with opcode matchers,
>> range restrictions. Corey provided other examples for Power. The API has
>> to provide a way to express what the raw value is meant for: counter,
>> matcher, filter...
>
> this can be done in a number of ways (in order of increasing levels of
> abstraction):
>
> - the raw type is kept wide enough. Paul already requested the raw type
> to be widened to 128 bits to express certain PowerPC features.
>
> - or the PMU capability is expressed as a special counter type (if it's
> useful enough) - and then either the write() method or ioctl is extended
> to express attributes we want to set/change while a counter is running.
>
> - or the highest level counter / hw event data type is extended with new
> attribute field(s).
>
> My feeling is that we generally want such hw features to start small -
> i.e. at the raw type level initially. Then we can allow them to climb the
> ladder, if they prove their utility in practice. We've got space reserved
> in the ABI to allow for growth like this.
>
> Ingo

Hi Ingo and Stephane,

Thanks for the replies.

I think any one of those solutions would work for Power's Instruction
Matching Register. If more than one register needs to be programmed, or
the values don't fit into the 128-bit raw event types, we could use the
"special counter" approach, I think.

I will have another look at the Power PMU description and see if there
are other constraints that might cause us to want to go one way or the
other, or perhaps a different way.

Regards,

- Corey

Corey Ashford
Software Engineer
IBM Linux Technology Center, Linux Toolchain
Beaverton, OR
503-578-3507
[email protected]

2009-01-26 19:39:19

by Tony Luck

[permalink] [raw]

Subject: RE: [perfmon2] [announce] Performance Counters for Linux, v6

> - or the PMU capability is expressed as a special counter type (if it's
> useful enough) - and then either the write() method or ioctl is extended
> to express attributes we want to set/change while a counter is running.

The product of:
{exotic PMU modes} * {creative performance measurement ideas}
will produce a large number of candidates for these special counters
(at least on ia64 ... which has a large number of exotic PMU options).

I don't think that I'm qualified to judge which of them are "useful enough"
to warrant a special counter type.

-Tony

2009-01-26 22:11:22

[permalink] [raw]

Subject: Re: [perfmon2] [announce] Performance Counters for Linux, v6

* Luck, Tony <[email protected]> wrote:

> > - or the PMU capability is expressed as a special counter type (if it's
> > useful enough) - and then either the write() method or ioctl is extended
> > to express attributes we want to set/change while a counter is running.
>
> The product of:
> {exotic PMU modes} * {creative performance measurement ideas}
>
> will produce a large number of candidates for these special counters (at
> least on ia64 ... which has a large number of exotic PMU options).
>
> I don't think that I'm qualified to judge which of them are "useful
> enough" to warrant a special counter type.

it should certainly be done on a case by case basis. They need to be
consciously exposed not just summarily exported to user-space, because PMU
hw features have security implications so it has to be done all
explicitly.

Ingo

2009-01-26 22:16:35

[permalink] [raw]

Subject: Re: [announce] Performance Counters for Linux, v6

* Corey Ashford <[email protected]> wrote:

> Ingo Molnar wrote:
>> * stephane eranian <[email protected]> wrote:
>>
>>> Hi,
>>>
>>> Corey brings up an interesting problem which I wanted to comment on.
>>>
>>> The current proposal hinges on the idea that by interpreting a single
>>> value the kernel can understand what the user wants to measure. For
>>> instance, if I pass type=0, then the kernel understands I want to
>>> measure CPU_CYCLES. Given that the number of events and their unit
>>> mask combinations can be large, the proposal also provides a "raw"
>>> mode, where the content of the type field is interpreted as the raw
>>> value to put into a register.
>>>
>>> This is where there is an issue because with several PMU models,
>>> including on X86, using the raw bit + 64 value is not enough to
>>> figure out what the user wants to measure. This happens when the PMU
>>> has more than counters. Thus, interpreting each raw value has the
>>> event code may be wrong. To remain on familiar territory, the Nehalem
>>> uncore PMU has an opcode matcher register, that uses a 64-bit value.
>>> On AMD64 Family 10h, you have IBS. But I could give examples on
>>> Itanium with opcode matchers, range restrictions. Corey provided
>>> other examples for Power. The API has to provide a way to express
>>> what the raw value is meant for: counter, matcher, filter...
>>
>> this can be done in a number of ways (in order of increasing levels of
>> abstraction):
>>
>> - the raw type is kept wide enough. Paul already requested the raw type
>> to be widened to 128 bits to express certain PowerPC features.
>>
>> - or the PMU capability is expressed as a special counter type (if it's
>> useful enough) - and then either the write() method or ioctl is extended
>> to express attributes we want to set/change while a counter is running.
>>
>> - or the highest level counter / hw event data type is extended with new
>> attribute field(s).
>>
>> My feeling is that we generally want such hw features to start small -
>> i.e. at the raw type level initially. Then we can allow them to climb
>> the ladder, if they prove their utility in practice. We've got space
>> reserved in the ABI to allow for growth like this.
>>
>> Ingo
>
>
> Hi Ingo and Stephane,
>
> Thanks for the replies.
>
> I think any one of those solutions would work for Power's Instruction
> Matching Register. If more than one register needs to be programmed, or
> the values don't fit into the 128-bit raw event types, we could use the
> "special counter" approach, I think.
>
> I will have another look at the Power PMU description and see if there
> are other constraints that might cause us to want to go one way or the
> other, or perhaps a different way.

thanks, that's really appreciated!

One useful approach would be to come up with a bitcount that you think
would fit considering even (currently) fringe/odd features - and we'd make
sure there's enough space for that in the ABI - should there be a
need/desire to expose that in the future.

Ingo

2009-01-26 23:41:19

[permalink] [raw]

Subject: Re: [announce] Performance Counters for Linux, v6

Ingo Molnar wrote:
> * Corey Ashford <[email protected]> wrote:
>
>> Ingo Molnar wrote:
>>> * stephane eranian <[email protected]> wrote:
>>>
>>>> Hi,
>>>>
>>>> Corey brings up an interesting problem which I wanted to comment on.
>>>>
>>>> The current proposal hinges on the idea that by interpreting a single
>>>> value the kernel can understand what the user wants to measure. For
>>>> instance, if I pass type=0, then the kernel understands I want to
>>>> measure CPU_CYCLES. Given that the number of events and their unit
>>>> mask combinations can be large, the proposal also provides a "raw"
>>>> mode, where the content of the type field is interpreted as the raw
>>>> value to put into a register.
>>>>
>>>> This is where there is an issue because with several PMU models,
>>>> including on X86, using the raw bit + 64 value is not enough to
>>>> figure out what the user wants to measure. This happens when the PMU
>>>> has more than counters. Thus, interpreting each raw value has the
>>>> event code may be wrong. To remain on familiar territory, the Nehalem
>>>> uncore PMU has an opcode matcher register, that uses a 64-bit value.
>>>> On AMD64 Family 10h, you have IBS. But I could give examples on
>>>> Itanium with opcode matchers, range restrictions. Corey provided
>>>> other examples for Power. The API has to provide a way to express
>>>> what the raw value is meant for: counter, matcher, filter...
>>> this can be done in a number of ways (in order of increasing levels of
>>> abstraction):
>>>
>>> - the raw type is kept wide enough. Paul already requested the raw type
>>> to be widened to 128 bits to express certain PowerPC features.
>>>
>>> - or the PMU capability is expressed as a special counter type (if it's
>>> useful enough) - and then either the write() method or ioctl is extended
>>> to express attributes we want to set/change while a counter is running.
>>>
>>> - or the highest level counter / hw event data type is extended with new
>>> attribute field(s).
>>>
>>> My feeling is that we generally want such hw features to start small -
>>> i.e. at the raw type level initially. Then we can allow them to climb
>>> the ladder, if they prove their utility in practice. We've got space
>>> reserved in the ABI to allow for growth like this.
>>>
>>> Ingo
>>
>> Hi Ingo and Stephane,
>>
>> Thanks for the replies.
>>
>> I think any one of those solutions would work for Power's Instruction
>> Matching Register. If more than one register needs to be programmed, or
>> the values don't fit into the 128-bit raw event types, we could use the
>> "special counter" approach, I think.
>>
>> I will have another look at the Power PMU description and see if there
>> are other constraints that might cause us to want to go one way or the
>> other, or perhaps a different way.
>
> thanks, that's really appreciated!
>
> One useful approach would be to come up with a bitcount that you think
> would fit considering even (currently) fringe/odd features - and we'd make
> sure there's enough space for that in the ABI - should there be a
> need/desire to expose that in the future.
>
> Ingo

Looking at the Instruction Matching CAM on Power6, it's comprised of two
64-bit values, but there are quite a few reserved bits, and bits that
must be programmed in a fixed way. If we were to squeeze out the
reserved and fixed bits from the ABI, that leaves 74 real bits of data
that a user would like to be able to set.

In addition to that, there is an instruction marking mechanism that
requires 2 bits to set the sampling mode.

Lastly, there is a thresholding mechanism that has 6 bits of count, two
3-bit start/end event fields, and a 2-bit granularity field.

In total, that's 90 bits in addition to the event code (9 bits?). There
may be a few stragglers that I have missed, and some room should be left
for future processors. 128 could be a bit tight for future processor
generations.

While reading the Power6 PMU manual, I also had a look at Power5+ PMU
manual, and it has five more accessible instruction matching registers
(32-bits each). These five are somewhat more special-purpose (they
match fewer bits in the instruction), and they probably could be left
out, but it would be nice if the ABI had the room for them.

Regards,

- Corey

Corey Ashford
Software Engineer
IBM Linux Technology Center, Linux Toolchain
Beaverton, OR
503-578-3507
[email protected]

2009-01-29 02:11:38

[permalink] [raw]

Subject: Re: [announce] Performance Counters for Linux, v6

Ingo Molnar wrote:
> We are pleased to announce version 6 of our performance counters subsystem
> implementation. The shortlog, diffstat and the combo patch can be found
> below. The combo patch against latest -git (2.6.29-rc2) can be also found
> at:
>
> http://people.redhat.com/mingo/perfcounters/perfcounters-v6-v2.6.29-rc2.patch
>
> It's also available in tip/master at:
>
> http://people.redhat.com/mingo/tip.git/README
>
> There are many changes in the v6 release:
>
> - PowerPC performance counters support from Paul Mackerras, for POWER6
> and for the PPC970 family.
>
> - ioctl API to disable/enable individual counters and groups without
> closing their fd. This can be useful for libraries, ad-hoc
> instrumentation and PAPI support.
>
> - 'pinned' and 'exclusive' counter attributes - for those
> applications that want to influence counter scheduling explicitly.
>
> - The 'perfstat' utility (ex 'timec') has been updated:
>
> http://people.redhat.com/mingo/perfcounters/perfstat.c
>
> - 'kerneltop' (easy-to-use text mode NMI profiler) has been updated:
>
> http://people.redhat.com/mingo/perfcounters/kerneltop.c
>
> - Merged to latest mainline
>
> - Various fixes and other updates
>
> Ingo

I'm not sure if this is the right place to propose such a thing, but I
think it would be very valuable to have a standardized user-side library
to accompany this addition to the kernel.

In particular, as a starting place for the discussion, I'd like to see
functions in it that are very similar to a subset of what is currently
in libpfm. Specifically, I'd like to see the following functions (with
the names changed to pcl_* perhaps):

extern pfm_err_t pfm_find_event(const char *str, unsigned int *idx);
extern pfm_err_t pfm_find_event_bycode(int code, unsigned int *idx);
extern pfm_err_t pfm_find_event_bycode_next(int code, unsigned int start,
unsigned int *next);
extern pfm_err_t pfm_find_event_mask(unsigned int event_idx, const char
*str,
unsigned int *mask_idx);
extern pfm_err_t pfm_find_full_event(const char *str, pfmlib_event_t *e);

extern pfm_err_t pfm_get_max_event_name_len(size_t *len);

extern pfm_err_t pfm_get_num_events(unsigned int *count);
extern pfm_err_t pfm_get_num_event_masks(unsigned int event_idx,
unsigned int *count);
extern pfm_err_t pfm_get_event_name(unsigned int idx, char *name,
size_t maxlen);
extern pfm_err_t pfm_get_full_event_name(pfmlib_event_t *e, char *name,
size_t maxlen);
extern pfm_err_t pfm_get_event_code(unsigned int idx, int *code);
extern pfm_err_t pfm_get_event_mask_code(unsigned int idx,
unsigned int mask_idx,
unsigned int *code);
extern pfm_err_t pfm_get_event_description(unsigned int idx, char **str);
extern pfm_err_t pfm_get_event_code_counter(unsigned int idx, unsigned
int cnt,
int *code);
extern pfm_err_t pfm_get_event_mask_name(unsigned int event_idx,
unsigned int mask_idx,
char *name, size_t maxlen);
extern pfm_err_t pfm_get_event_mask_description(unsigned int event_idx,
unsigned int mask_idx,
char **desc);

Now, since it's not clear right now how unit masks are going to be
handled in your proposal, I'm not sure the that *_event_mask_* functions
are applicable, but I think something that fills that function will be
needed.

Architectures that have need for additional functionality should be free
to add arch-specific functions.

Full descriptions of these functions can be found in the man pages of
the libpfm documentation.

Any thoughts on this? Do you already have a user library structure in mind?

--
Regards,

- Corey

Corey Ashford
Software Engineer
IBM Linux Technology Center, Linux Toolchain
Beaverton, OR
503-578-3507
[email protected]

2009-01-29 12:32:39

[permalink] [raw]

Subject: Re: [announce] Performance Counters for Linux, v6

Corey,

On Thu, Jan 29, 2009 at 3:10 AM, Corey Ashford
<[email protected]> wrote:
> Ingo Molnar wrote:
>>
>> We are pleased to announce version 6 of our performance counters subsystem
>> implementation. The shortlog, diffstat and the combo patch can be found
>> below. The combo patch against latest -git (2.6.29-rc2) can be also found
>> at:
>>
>>
>> http://people.redhat.com/mingo/perfcounters/perfcounters-v6-v2.6.29-rc2.patch
>>
>> It's also available in tip/master at:
>>
>> http://people.redhat.com/mingo/tip.git/README
>>
>> There are many changes in the v6 release:
>>
>> - PowerPC performance counters support from Paul Mackerras, for POWER6
>> and for the PPC970 family.
>>
>> - ioctl API to disable/enable individual counters and groups without
>> closing their fd. This can be useful for libraries, ad-hoc
>> instrumentation and PAPI support.
>>
>> - 'pinned' and 'exclusive' counter attributes - for those
>> applications that want to influence counter scheduling explicitly.
>>
>> - The 'perfstat' utility (ex 'timec') has been updated:
>>
>> http://people.redhat.com/mingo/perfcounters/perfstat.c
>>
>> - 'kerneltop' (easy-to-use text mode NMI profiler) has been updated:
>> http://people.redhat.com/mingo/perfcounters/kerneltop.c
>>
>> - Merged to latest mainline
>>
>> - Various fixes and other updates
>>
>> Ingo
>
> I'm not sure if this is the right place to propose such a thing, but I think
> it would be very valuable to have a standardized user-side library to
> accompany this addition to the kernel.
>
> In particular, as a starting place for the discussion, I'd like to see
> functions in it that are very similar to a subset of what is currently in
> libpfm. Specifically, I'd like to see the following functions (with the
> names changed to pcl_* perhaps):
>
> extern pfm_err_t pfm_find_event(const char *str, unsigned int *idx);
> extern pfm_err_t pfm_find_event_bycode(int code, unsigned int *idx);
> extern pfm_err_t pfm_find_event_bycode_next(int code, unsigned int start,
> unsigned int *next);
> extern pfm_err_t pfm_find_event_mask(unsigned int event_idx, const char
> *str,
> unsigned int *mask_idx);
> extern pfm_err_t pfm_find_full_event(const char *str, pfmlib_event_t *e);
>
> extern pfm_err_t pfm_get_max_event_name_len(size_t *len);
>
> extern pfm_err_t pfm_get_num_events(unsigned int *count);
> extern pfm_err_t pfm_get_num_event_masks(unsigned int event_idx,
> unsigned int *count);
> extern pfm_err_t pfm_get_event_name(unsigned int idx, char *name,
> size_t maxlen);
> extern pfm_err_t pfm_get_full_event_name(pfmlib_event_t *e, char *name,
> size_t maxlen);
> extern pfm_err_t pfm_get_event_code(unsigned int idx, int *code);
> extern pfm_err_t pfm_get_event_mask_code(unsigned int idx,
> unsigned int mask_idx,
> unsigned int *code);
> extern pfm_err_t pfm_get_event_description(unsigned int idx, char **str);
> extern pfm_err_t pfm_get_event_code_counter(unsigned int idx, unsigned int
> cnt,
> int *code);
> extern pfm_err_t pfm_get_event_mask_name(unsigned int event_idx,
> unsigned int mask_idx,
> char *name, size_t maxlen);
> extern pfm_err_t pfm_get_event_mask_description(unsigned int event_idx,
> unsigned int mask_idx,
> char **desc);
>
>
> Now, since it's not clear right now how unit masks are going to be handled
> in your proposal, I'm not sure the that *_event_mask_* functions are
> applicable, but I think something that fills that function will be needed.
>
> Architectures that have need for additional functionality should be free to
> add arch-specific functions.
>
> Full descriptions of these functions can be found in the man pages of the
> libpfm documentation.
>
> Any thoughts on this? Do you already have a us
er library structure in mind?
>
Yes, I did give some thoughts to all of this. In fact, I have been
playing a bit with
libpfm and the LPC proposal.

I think, given that LPC is dealing with event -> counter assignment in
the kernel, libpfm
does not have to do it. All it needs to do is event:attributes ->
value, and that value is
then passed to the kernel in raw mode.

Event attributes includes on x86, for instance, the edge, invert,
counter-mask, plm, field.
I think we could do something more generic than what is currently
there. That would not
require PMU specific data structures for attributes. Just pass
everything into a string.

To that extent, I have been experimenting with something along those lines:

int pfm_get_event_encoding(char *event_str, uint64_t **values, int *count);

events are encoded as follows:

event_name:[unit_mask1:unit_mask2:...:unit_maskn][::A1=V1:A2=V2:..:An=Vn]

Attribute names and values depend on each PMU model. Attributes names
are strings.
Values can have any type.

For X86, most attributes would be identical, same thing on Itanium
because they are
architected.

Some PMU models may need more than one 64-bit value to configure one
event, That is
is why there is vector and a count. Libpfm should not be concerned
with how those values
are encoded and passed to the kernel. It should be concerned with the
event -> value
as described in the PMU documentation.

Given that LPC manages events independently of each other, libpfm does
not reallly need
to process multiple events at a time to get a global view of what is
being measured.

Here is an example:

$ self inst_retired:any_p::i=1:c=1:u=1:k=1
[0x1d300c0 event_sel=0xc0 umask=0x0 os=1 usr=1 en=1 int=1 inv=1 edge=0
cnt_mask=1] INST_RETIRED

2009-01-29 20:01:32

[permalink] [raw]

Subject: Re: [announce] Performance Counters for Linux, v6

stephane eranian wrote:
> Corey,
>
> On Thu, Jan 29, 2009 at 3:10 AM, Corey Ashford
> <[email protected]> wrote:
>> Ingo Molnar wrote:
>>> We are pleased to announce version 6 of our performance counters subsystem
>>> implementation. The shortlog, diffstat and the combo patch can be found
>>> below. The combo patch against latest -git (2.6.29-rc2) can be also found
>>> at:
>>>
>>>
>>> http://people.redhat.com/mingo/perfcounters/perfcounters-v6-v2.6.29-rc2.patch
>>>
>>> It's also available in tip/master at:
>>>
>>> http://people.redhat.com/mingo/tip.git/README
>>>
>>> There are many changes in the v6 release:
>>>
>>> - PowerPC performance counters support from Paul Mackerras, for POWER6
>>> and for the PPC970 family.
>>>
>>> - ioctl API to disable/enable individual counters and groups without
>>> closing their fd. This can be useful for libraries, ad-hoc
>>> instrumentation and PAPI support.
>>>
>>> - 'pinned' and 'exclusive' counter attributes - for those
>>> applications that want to influence counter scheduling explicitly.
>>>
>>> - The 'perfstat' utility (ex 'timec') has been updated:
>>>
>>> http://people.redhat.com/mingo/perfcounters/perfstat.c
>>>
>>> - 'kerneltop' (easy-to-use text mode NMI profiler) has been updated:
>>> http://people.redhat.com/mingo/perfcounters/kerneltop.c
>>>
>>> - Merged to latest mainline
>>>
>>> - Various fixes and other updates
>>>
>>> Ingo
>> I'm not sure if this is the right place to propose such a thing, but I think
>> it would be very valuable to have a standardized user-side library to
>> accompany this addition to the kernel.
>>
>> In particular, as a starting place for the discussion, I'd like to see
>> functions in it that are very similar to a subset of what is currently in
>> libpfm. Specifically, I'd like to see the following functions (with the
>> names changed to pcl_* perhaps):
>>
>> extern pfm_err_t pfm_find_event(const char *str, unsigned int *idx);
>> extern pfm_err_t pfm_find_event_bycode(int code, unsigned int *idx);
>> extern pfm_err_t pfm_find_event_bycode_next(int code, unsigned int start,
>> unsigned int *next);
>> extern pfm_err_t pfm_find_event_mask(unsigned int event_idx, const char
>> *str,
>> unsigned int *mask_idx);
>> extern pfm_err_t pfm_find_full_event(const char *str, pfmlib_event_t *e);
>>
>> extern pfm_err_t pfm_get_max_event_name_len(size_t *len);
>>
>> extern pfm_err_t pfm_get_num_events(unsigned int *count);
>> extern pfm_err_t pfm_get_num_event_masks(unsigned int event_idx,
>> unsigned int *count);
>> extern pfm_err_t pfm_get_event_name(unsigned int idx, char *name,
>> size_t maxlen);
>> extern pfm_err_t pfm_get_full_event_name(pfmlib_event_t *e, char *name,
>> size_t maxlen);
>> extern pfm_err_t pfm_get_event_code(unsigned int idx, int *code);
>> extern pfm_err_t pfm_get_event_mask_code(unsigned int idx,
>> unsigned int mask_idx,
>> unsigned int *code);
>> extern pfm_err_t pfm_get_event_description(unsigned int idx, char **str);
>> extern pfm_err_t pfm_get_event_code_counter(unsigned int idx, unsigned int
>> cnt,
>> int *code);
>> extern pfm_err_t pfm_get_event_mask_name(unsigned int event_idx,
>> unsigned int mask_idx,
>> char *name, size_t maxlen);
>> extern pfm_err_t pfm_get_event_mask_description(unsigned int event_idx,
>> unsigned int mask_idx,
>> char **desc);
>>
>>
>> Now, since it's not clear right now how unit masks are going to be handled
>> in your proposal, I'm not sure the that *_event_mask_* functions are
>> applicable, but I think something that fills that function will be needed.
>>
>> Architectures that have need for additional functionality should be free to
>> add arch-specific functions.
>>
>> Full descriptions of these functions can be found in the man pages of the
>> libpfm documentation.
>>
>> Any thoughts on this? Do you already have a us
> er library structure in mind?
> Yes, I did give some thoughts to all of this. In fact, I have been
> playing a bit with
> libpfm and the LPC proposal.
>
> I think, given that LPC is dealing with event -> counter assignment in
> the kernel, libpfm
> does not have to do it. All it needs to do is event:attributes ->
> value, and that value is
> then passed to the kernel in raw mode.
>
> Event attributes includes on x86, for instance, the edge, invert,
> counter-mask, plm, field.
> I think we could do something more generic than what is currently
> there. That would not
> require PMU specific data structures for attributes. Just pass
> everything into a string.
>
> To that extent, I have been experimenting with something along those lines:
>
> int pfm_get_event_encoding(char *event_str, uint64_t **values, int *count);
>
> events are encoded as follows:
>
> event_name:[unit_mask1:unit_mask2:...:unit_maskn][::A1=V1:A2=V2:..:An=Vn]
>
> Attribute names and values depend on each PMU model. Attributes names
> are strings.
> Values can have any type.
>
> For X86, most attributes would be identical, same thing on Itanium
> because they are
> architected.
>
> Some PMU models may need more than one 64-bit value to configure one
> event, That is
> is why there is vector and a count. Libpfm should not be concerned
> with how those values
> are encoded and passed to the kernel. It should be concerned with the
> event -> value
> as described in the PMU documentation.
>
> Given that LPC manages events independently of each other, libpfm does
> not reallly need
> to process multiple events at a time to get a global view of what is
> being measured.
>
> Here is an example:
>
> $ self inst_retired:any_p::i=1:c=1:u=1:k=1
> [0x1d300c0 event_sel=0xc0 umask=0x0 os=1 usr=1 en=1 int=1 inv=1 edge=0
> cnt_mask=1] INST_RETIRED

This looks encouraging!

I assume the library would still retain the functions that allow us to
iterate through the available events, and obtain text description of
events. Would it make sense to have similar functions to obtain the
available unit masks and attributes for a particular event?

For debugging purposes at least, it might make sense to have a function
that does the inverse of pfm_get_event_encoding as well.

--
Regards,

- Corey

Corey Ashford
Software Engineer
IBM Linux Technology Center, Linux Toolchain
Beaverton, OR
503-578-3507
[email protected]

2009-01-29 21:44:30

[permalink] [raw]

Subject: Re: [announce] Performance Counters for Linux, v6

On Thu, Jan 29, 2009 at 9:01 PM, Corey Ashford
<[email protected]> wrote:
>>>
>>> I'm not sure if this is the right place to propose such a thing, but I
>>> think
>>> it would be very valuable to have a standardized user-side library to
>>> accompany this addition to the kernel.
>>>
>>> In particular, as a starting place for the discussion, I'd like to see
>>> functions in it that are very similar to a subset of what is currently in
>>> libpfm. Specifically, I'd like to see the following functions (with the
>>> names changed to pcl_* perhaps):
>>>
>>> extern pfm_err_t pfm_find_event(const char *str, unsigned int *idx);
>>> extern pfm_err_t pfm_find_event_bycode(int code, unsigned int *idx);
>>> extern pfm_err_t pfm_find_event_bycode_next(int code, unsigned int start,
>>> unsigned int *next);
>>> extern pfm_err_t pfm_find_event_mask(unsigned int event_idx, const char
>>> *str,
>>> unsigned int *mask_idx);
>>> extern pfm_err_t pfm_find_full_event(const char *str, pfmlib_event_t *e);
>>>
>>> extern pfm_err_t pfm_get_max_event_name_len(size_t *len);
>>>
>>> extern pfm_err_t pfm_get_num_events(unsigned int *count);
>>> extern pfm_err_t pfm_get_num_event_masks(unsigned int event_idx,
>>> unsigned int *count);
>>> extern pfm_err_t pfm_get_event_name(unsigned int idx, char *name,
>>> size_t maxlen);
>>> extern pfm_err_t pfm_get_full_event_name(pfmlib_event_t *e, char *name,
>>> size_t maxlen);
>>> extern pfm_err_t pfm_get_event_code(unsigned int idx, int *code);
>>> extern pfm_err_t pfm_get_event_mask_code(unsigned int idx,
>>> unsigned int mask_idx,
>>> unsigned int *code);
>>> extern pfm_err_t pfm_get_event_description(unsigned int idx, char **str);
>>> extern pfm_err_t pfm_get_event_code_counter(unsigned int idx, unsigned
>>> int
>>> cnt,
>>> int *code);
>>> extern pfm_err_t pfm_get_event_mask_name(unsigned int event_idx,
>>> unsigned int mask_idx,
>>> char *name, size_t maxlen);
>>> extern pfm_err_t pfm_get_event_mask_description(unsigned int event_idx,
>>> unsigned int mask_idx,
>>> char **desc);
>>>
>>>
>>> Now, since it's not clear right now how unit masks are going to be
>>> handled
>>> in your proposal, I'm not sure the that *_event_mask_* functions are
>>> applicable, but I think something that fills that function will be
>>> needed.
>>>
>>> Architectures that have need for additional functionality should be free
>>> to
>>> add arch-specific functions.
>>>
>>> Full descriptions of these functions can be found in the man pages of the
>>> libpfm documentation.
>>>
>>> Any thoughts on this? Do you already have a us
>>
>> er library structure in mind?
>> Yes, I did give some thoughts to all of this. In fact, I have been
>> playing a bit with
>> libpfm and the LPC proposal.
>>
>> I think, given that LPC is dealing with event -> counter assignment in
>> the kernel, libpfm
>> does not have to do it. All it needs to do is event:attributes ->
>> value, and that value is
>> then passed to the kernel in raw mode.
>>
>> Event attributes includes on x86, for instance, the edge, invert,
>> counter-mask, plm, field.
>> I think we could do something more generic than what is currently
>> there. That would not
>> require PMU specific data structures for attributes. Just pass
>> everything into a string.
>>
>> To that extent, I have been experimenting with something along those
>> lines:
>>
>> int pfm_get_event_encoding(char *event_str, uint64_t **values, int
>> *count);
>>
>> events are encoded as follows:
>>
>>
>> event_name:[unit_mask1:unit_mask2:...:unit_maskn][::A1=V1:A2=V2:..:An=Vn]
>>
>> Attribute names and values depend on each PMU model. Attributes names
>> are strings.
>> Values can have any type.
>>
>> For X86, most attributes would be identical, same thing on Itanium
>> because they are
>> architected.
>>
>> Some PMU models may need more than one 64-bit value to configure one
>> event, That is
>> is why there is vector and a count. Libpfm should not be concerned
>> with how those values
>> are encoded and passed to the kernel. It should be concerned with the
>> event -> value
>> as described in the PMU documentation.
>>
>> Given that LPC manages events independently of each other, libpfm does
>> not reallly need
>> to process multiple events at a time to get a global view of what is
>> being measured.
>>
>> Here is an example:
>>
>> $ self inst_retired:any_p::i=1:c=1:u=1:k=1
>> [0x1d300c0 event_sel=0xc0 umask=0x0 os=1 usr=1 en=1 int=1 inv=1 edge=0
>> cnt_mask=1] INST_RETIRED
>
> This looks encouraging!
>
> I assume the library would still retain the functions that allow us to
> iterate through the available events, and obtain text description of events.
> Would it make sense to have similar functions to obtain the available unit
> masks and attributes for a particular event?
>

Yes, that would most likely stay there, although, I think we could
simplify a bit.

> For debugging purposes at least, it might make sense to have a function that
> does the inverse of pfm_get_event_encoding as well.
>
Yes, we could provide the opposite function.

I also believe this same scheme could be used to describe non-event features,
such as IBS, LBR, Opcode matcher.

2009-03-09 01:39:38

by tip-bot for Robert Richter

[permalink] [raw]

Subject: Re: [announce] Performance Counters for Linux, v6

Some points to mention here. This patch set actually introduces two
interfaces, a new user/kernel interface and an in-kernel api to access
performance counters. These are separate things and sometimes mixed
too much. There is a strong need for an in-kernel api. This is the
third implementation I am involved (oprofile, perfmon are the others)
and the things are always the same way. All these subsystems should be
merged to one in-kernel implemenation and share the same code. The
different user/kernel i/fs could then coexist and meet the users
different needs.

The implementation of the hardware counters is written from scratch
again. It is sometimes useful to drop old code, but there is the
danger of making errors twice. Implementing performance counters is
not trivial, especially buffer handling, SMP and cpu hotplug. For
oprofile and perfmon it took years to get stable code. We should
benefit from this. (The current x86 code in this patch series seems
not to work proper with SMP.) So we should look for a way to better
reuse and share code.

See also my comments below.

On 21.01.09 19:50:21, Ingo Molnar wrote:

[...]

> +static bool perf_counters_initialized __read_mostly;
> +
> +/*
> + * Number of (generic) HW counters:
> + */
> +static int nr_counters_generic __read_mostly;
> +static u64 perf_counter_mask __read_mostly;
> +static u64 counter_value_mask __read_mostly;
> +
> +static int nr_counters_fixed __read_mostly;
> +
> +struct cpu_hw_counters {
> + struct perf_counter *counters[X86_PMC_IDX_MAX];
> + unsigned long used[BITS_TO_LONGS(X86_PMC_IDX_MAX)];
> +};
> +
> +/*
> + * Intel PerfMon v3. Used on Core2 and later.
> + */
> +static DEFINE_PER_CPU(struct cpu_hw_counters, cpu_hw_counters);
> +
> +static const int intel_perfmon_event_map[] =
> +{
> + [PERF_COUNT_CPU_CYCLES] = 0x003c,
> + [PERF_COUNT_INSTRUCTIONS] = 0x00c0,
> + [PERF_COUNT_CACHE_REFERENCES] = 0x4f2e,
> + [PERF_COUNT_CACHE_MISSES] = 0x412e,
> + [PERF_COUNT_BRANCH_INSTRUCTIONS] = 0x00c4,
> + [PERF_COUNT_BRANCH_MISSES] = 0x00c5,
> + [PERF_COUNT_BUS_CYCLES] = 0x013c,
> +};

I would like to define _all_ the behaviour of the architecture and the
models in functions instead of parameters and lists. It is hard to
explain why, because it is more esthetics, but I believe, only nice
things work best. Let me try.

1) The list above seems to be random, there are lots of events and it
is hard to define, which event is really important. Surely these
events are important, but it is hard to draw a line here.

2) The list assumes/implies the events are available on all
architectures and cpus. This is probably not the case, and also, the
existence of an event must not be _important_ for a certain
architecture. But it has to be there even if it is of no interest.

3) Hard to extend. If an event is added here this could have impact to
all other architectures. Data structures are changing.

4) In the kernel the behaviour of a subsystem is offen implemented by
functions (e.g. struct device_driver). There are lots of ops structs
in the kernel and there are reasons for it.

5) ops structs are more dynamic. The data could be generated
dynamically and does not have to be static in some tables and
variables.

So, instead of making the list a public data structure, better pass
the type to an arch specific function, e.g.:

int arch_xxx_setup_event(int event_type);

If the type is not supported, an error could be returned. There is no
more impact. Even the binaries of the builds would be identically if
hw_event_types would be extended for a single different architecture.

The same applies also for counters and so on, better implement
functions.

-Robert

--
Advanced Micro Devices, Inc.
Operating System Research Center
email: [email protected]

2009-03-09 23:01:45

by Paul Mackerras

[permalink] [raw]

Subject: Re: [announce] Performance Counters for Linux, v6

Robert Richter writes:

> Some points to mention here. This patch set actually introduces two
> interfaces, a new user/kernel interface and an in-kernel api to access
> performance counters. These are separate things and sometimes mixed
> too much. There is a strong need for an in-kernel api. This is the

We have been concentrating more on the user/kernel API since that is
the one that cannot be changed in an incompatible way once this stuff
goes upstream. The in-kernel API can be changed at any time and is
still evolving.

> third implementation I am involved (oprofile, perfmon are the others)
> and the things are always the same way. All these subsystems should be
> merged to one in-kernel implemenation and share the same code. The
> different user/kernel i/fs could then coexist and meet the users
> different needs.

It would certainly be good to get oprofile to use the same low-level
machinery as perf_counters. I'm not sure what the fate of perfmon
will be, but it seems unlikely it will go upstream in anything like
its present form.

> > +static const int intel_perfmon_event_map[] =
> > +{
> > + [PERF_COUNT_CPU_CYCLES] = 0x003c,
> > + [PERF_COUNT_INSTRUCTIONS] = 0x00c0,
> > + [PERF_COUNT_CACHE_REFERENCES] = 0x4f2e,
> > + [PERF_COUNT_CACHE_MISSES] = 0x412e,
> > + [PERF_COUNT_BRANCH_INSTRUCTIONS] = 0x00c4,
> > + [PERF_COUNT_BRANCH_MISSES] = 0x00c5,
> > + [PERF_COUNT_BUS_CYCLES] = 0x013c,
> > +};
>
> I would like to define _all_ the behaviour of the architecture and the
> models in functions instead of parameters and lists. It is hard to
> explain why, because it is more esthetics, but I believe, only nice
> things work best. Let me try.
>
> 1) The list above seems to be random, there are lots of events and it
> is hard to define, which event is really important. Surely these
> events are important, but it is hard to draw a line here.

I see that list as a convenience for doing a few simple performance
measurements. For any serious in-depth analysis userspace will know
what processor it's running on and use raw event codes.

> 2) The list assumes/implies the events are available on all
> architectures and cpus. This is probably not the case, and also, the
> existence of an event must not be _important_ for a certain
> architecture. But it has to be there even if it is of no interest.
>
> 3) Hard to extend. If an event is added here this could have impact to
> all other architectures. Data structures are changing.
>
> 4) In the kernel the behaviour of a subsystem is offen implemented by
> functions (e.g. struct device_driver). There are lots of ops structs
> in the kernel and there are reasons for it.
>
> 5) ops structs are more dynamic. The data could be generated
> dynamically and does not have to be static in some tables and
> variables.
>
> So, instead of making the list a public data structure, better pass
> the type to an arch specific function, e.g.:
>
> int arch_xxx_setup_event(int event_type);

That's exactly what we have, except that it's called
hw_perf_counter_init and the event_type you have there is in the
struct perf_counter that gets passed in.

> If the type is not supported, an error could be returned. There is no
> more impact. Even the binaries of the builds would be identically if
> hw_event_types would be extended for a single different architecture.
>
> The same applies also for counters and so on, better implement
> functions.

All of that is already done; hw_perf_counter_init gets to interpret
the counter->hw_event.type and counter->hw_event.raw fields and decide
whether the event is supported, and return an error if not. On x86 it
looks like there is a further ops structure (struct pmc_x86_ops) which
allows each x86-compatible cpu type to supply its own functions for
doing the interpretation of counter->hw_event and other things.

Paul.

2009-03-10 09:46:19

by tip-bot for Robert Richter

[permalink] [raw]

Subject: Re: [announce] Performance Counters for Linux, v6

On 10.03.09 10:01:28, Paul Mackerras wrote:
> > Some points to mention here. This patch set actually introduces two
> > interfaces, a new user/kernel interface and an in-kernel api to access
> > performance counters. These are separate things and sometimes mixed
> > too much. There is a strong need for an in-kernel api. This is the
>
> We have been concentrating more on the user/kernel API since that is
> the one that cannot be changed in an incompatible way once this stuff
> goes upstream. The in-kernel API can be changed at any time and is
> still evolving.

I agree, it is much more easier to change the in-kernel i/f. I just
wanted to emphasize the importance of this i/f. Oprofile, Perfmon and
also LPC will exist in the future too and should share the same code
base. That's what I missed in the discussion until now.

>
> > third implementation I am involved (oprofile, perfmon are the others)
> > and the things are always the same way. All these subsystems should be
> > merged to one in-kernel implemenation and share the same code. The
> > different user/kernel i/fs could then coexist and meet the users
> > different needs.
>
> It would certainly be good to get oprofile to use the same low-level
> machinery as perf_counters. I'm not sure what the fate of perfmon
> will be, but it seems unlikely it will go upstream in anything like
> its present form.

Right, as I sad above, all should share the same low-level code. And,
this code already exists. The question is more how to merge it and
bring the things together.

[...]

> > So, instead of making the list a public data structure, better pass
> > the type to an arch specific function, e.g.:
> >
> > int arch_xxx_setup_event(int event_type);
>
> That's exactly what we have, except that it's called
> hw_perf_counter_init and the event_type you have there is in the
> struct perf_counter that gets passed in.

Thanks for pointing this out, I was misinterpreting this as a general
hw initialization function, but instead a counter is allocated.

>
> > If the type is not supported, an error could be returned. There is no
> > more impact. Even the binaries of the builds would be identically if
> > hw_event_types would be extended for a single different architecture.
> >
> > The same applies also for counters and so on, better implement
> > functions.
>
> All of that is already done; hw_perf_counter_init gets to interpret
> the counter->hw_event.type and counter->hw_event.raw fields and decide
> whether the event is supported, and return an error if not. On x86 it
> looks like there is a further ops structure (struct pmc_x86_ops) which
> allows each x86-compatible cpu type to supply its own functions for
> doing the interpretation of counter->hw_event and other things.

Ok, maybe I mixed too much the architectural with the x86 model
specific implementation. My impression is that there is data in
generic structures what should be better private for the model or
architecture. However, I have to figure out the details here.

-Robert

--
Advanced Micro Devices, Inc.
Operating System Research Center
email: [email protected]

2009-03-10 10:29:47

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [announce] Performance Counters for Linux, v6

On Tue, 2009-03-10 at 10:44 +0100, Robert Richter wrote:
> I agree, it is much more easier to change the in-kernel i/f. I just
> wanted to emphasize the importance of this i/f. Oprofile, Perfmon and
> also LPC will exist in the future too and should share the same code
> base. That's what I missed in the discussion until now.

We could implement oprofile on top of lpc for those archs that have LPC
support. And afaik only ia64 needs to bother with perfmon as that's the
only arch that has support for it anyway.

Now, even on x86 LPC would need a little more arch support before we can
fully replace oprofile, but a half-way model would be a LPC oprofile
driver that uses LPC on those machines its supported on, while working
to provide LPC support for the older machines.

2009-03-10 11:49:27

by Paul Mackerras

[permalink] [raw]

Subject: Re: [announce] Performance Counters for Linux, v6

Robert Richter writes:

> Ok, maybe I mixed too much the architectural with the x86 model
> specific implementation. My impression is that there is data in
> generic structures what should be better private for the model or
> architecture. However, I have to figure out the details here.

The details of the x86 support have changed quite a lot since the v6
patch was posted, I believe. Are you looking at the v6 patch, or at
Ingo's tip:perfcounters/core branch?

Ingo - maybe it's time to post a v7 patch?

Paul.

2009-03-10 11:54:47

[permalink] [raw]

Subject: Re: [announce] Performance Counters for Linux, v6

* Paul Mackerras <[email protected]> wrote:

> Robert Richter writes:
>
> > Ok, maybe I mixed too much the architectural with the x86 model
> > specific implementation. My impression is that there is data in
> > generic structures what should be better private for the model or
> > architecture. However, I have to figure out the details here.
>
> The details of the x86 support have changed quite a lot since
> the v6 patch was posted, I believe. Are you looking at the v6
> patch, or at Ingo's tip:perfcounters/core branch?
>
> Ingo - maybe it's time to post a v7 patch?

Yeah - will try to do that later today.

Ingo

2009-03-10 16:26:46

by tip-bot for Robert Richter

[permalink] [raw]

Subject: Re: [announce] Performance Counters for Linux, v6

On 10.03.09 22:49:08, Paul Mackerras wrote:
> Robert Richter writes:
>
> > Ok, maybe I mixed too much the architectural with the x86 model
> > specific implementation. My impression is that there is data in
> > generic structures what should be better private for the model or
> > architecture. However, I have to figure out the details here.
>
> The details of the x86 support have changed quite a lot since the v6
> patch was posted, I believe. Are you looking at the v6 patch, or at
> Ingo's tip:perfcounters/core branch?

I am using the tip branch, but took the v6 thread to discuss this.

>
> Ingo - maybe it's time to post a v7 patch?

For me it's fine to work with the branch if patches are posted to the
mailing list for review.

-Robert

--
Advanced Micro Devices, Inc.
Operating System Research Center
email: [email protected]

2009-03-10 17:28:57