2009-06-08 17:26:09

by Peter Zijlstra

[permalink] [raw]
Subject: [PATCH] perf_counter: extensible perf_counter_attr

Allow extending the perf_counter_attr structure by linking extended
structures to it.

Also, should we grow the directly reserved space in the structure a
little more?

Signed-off-by: Peter Zijlstra <[email protected]>
---
include/linux/perf_counter.h | 7 +++++++
kernel/perf_counter.c | 1 +
2 files changed, 8 insertions(+), 0 deletions(-)

diff --git a/include/linux/perf_counter.h b/include/linux/perf_counter.h
index 3586df8..781d8ce 100644
--- a/include/linux/perf_counter.h
+++ b/include/linux/perf_counter.h
@@ -175,6 +175,13 @@ struct perf_counter_attr {
__u32 __reserved_3;

__u64 __reserved_4;
+
+ struct perf_counter_attr_ext *ext_attrs;
+};
+
+struct perf_counter_attr_ext {
+ struct perf_counter_attr_ext *next;
+ __u64 perf_attr_ext_type;
};

/*
diff --git a/kernel/perf_counter.c b/kernel/perf_counter.c
index 5eacaaf..606f662 100644
--- a/kernel/perf_counter.c
+++ b/kernel/perf_counter.c
@@ -3467,6 +3467,7 @@ perf_counter_alloc(struct perf_counter_attr *attr,

counter->cpu = cpu;
counter->attr = *attr;
+ counter->attr.ext_attrs = NULL;
counter->group_leader = group_leader;
counter->pmu = NULL;
counter->ctx = ctx;


2009-06-08 19:02:49

by Corey Ashford

[permalink] [raw]
Subject: Re: [PATCH] perf_counter: extensible perf_counter_attr

Hi Peter,

Peter Zijlstra wrote:
> Allow extending the perf_counter_attr structure by linking extended
> structures to it.
>
> Also, should we grow the directly reserved space in the structure a
> little more?
>
> Signed-off-by: Peter Zijlstra <[email protected]>
> ---
> include/linux/perf_counter.h | 7 +++++++
> kernel/perf_counter.c | 1 +
> 2 files changed, 8 insertions(+), 0 deletions(-)
>
> diff --git a/include/linux/perf_counter.h b/include/linux/perf_counter.h
> index 3586df8..781d8ce 100644
> --- a/include/linux/perf_counter.h
> +++ b/include/linux/perf_counter.h
> @@ -175,6 +175,13 @@ struct perf_counter_attr {
> __u32 __reserved_3;
>
> __u64 __reserved_4;
> +
> + struct perf_counter_attr_ext *ext_attrs;
> +};
> +
> +struct perf_counter_attr_ext {
> + struct perf_counter_attr_ext *next;
> + __u64 perf_attr_ext_type;
> };

Let's say I want to extend the attributes by four 64-bit quantities... from the
above definition, I'd need four additional records chained together, right? How
about something like this instead:

struct perf_counter_attr_ext {
__u32 num_perf_attrs;
__u64 *perf_attr_ext;
};

Another alternative would be to place these two fields into perf_counter_attr
and so eliminate the extra level of indirection.


Regards,

- Corey

Corey Ashford
Software Engineer
IBM Linux Technology Center, Linux Toolchain
[email protected]

2009-06-08 19:51:43

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH] perf_counter: extensible perf_counter_attr

On Mon, 2009-06-08 at 12:02 -0700, Corey Ashford wrote:
> Hi Peter,
>
> Peter Zijlstra wrote:
> > Allow extending the perf_counter_attr structure by linking extended
> > structures to it.
> >
> > Also, should we grow the directly reserved space in the structure a
> > little more?
> >
> > Signed-off-by: Peter Zijlstra <[email protected]>
> > ---
> > include/linux/perf_counter.h | 7 +++++++
> > kernel/perf_counter.c | 1 +
> > 2 files changed, 8 insertions(+), 0 deletions(-)
> >
> > diff --git a/include/linux/perf_counter.h b/include/linux/perf_counter.h
> > index 3586df8..781d8ce 100644
> > --- a/include/linux/perf_counter.h
> > +++ b/include/linux/perf_counter.h
> > @@ -175,6 +175,13 @@ struct perf_counter_attr {
> > __u32 __reserved_3;
> >
> > __u64 __reserved_4;
> > +
> > + struct perf_counter_attr_ext *ext_attrs;
> > +};
> > +
> > +struct perf_counter_attr_ext {
> > + struct perf_counter_attr_ext *next;
> > + __u64 perf_attr_ext_type;
> > };
>
> Let's say I want to extend the attributes by four 64-bit quantities... from the
> above definition, I'd need four additional records chained together, right? How
> about something like this instead:

Ah, the idea was to do something like:

struct perf_counter_attr_feature {
struct perf_counter_attr_ext head;
... more stuff ...
};

and set head.perf_attr_ext_type = PERF_ATTR_EXT_FEATURE


2009-06-08 21:18:27

by Corey Ashford

[permalink] [raw]
Subject: Re: [PATCH] perf_counter: extensible perf_counter_attr

Peter Zijlstra wrote:
> On Mon, 2009-06-08 at 12:02 -0700, Corey Ashford wrote:
>> Hi Peter,
>>
>> Peter Zijlstra wrote:
>>> Allow extending the perf_counter_attr structure by linking extended
>>> structures to it.
>>>
>>> Also, should we grow the directly reserved space in the structure a
>>> little more?
>>>
>>> Signed-off-by: Peter Zijlstra <[email protected]>
>>> ---
>>> include/linux/perf_counter.h | 7 +++++++
>>> kernel/perf_counter.c | 1 +
>>> 2 files changed, 8 insertions(+), 0 deletions(-)
>>>
>>> diff --git a/include/linux/perf_counter.h b/include/linux/perf_counter.h
>>> index 3586df8..781d8ce 100644
>>> --- a/include/linux/perf_counter.h
>>> +++ b/include/linux/perf_counter.h
>>> @@ -175,6 +175,13 @@ struct perf_counter_attr {
>>> __u32 __reserved_3;
>>>
>>> __u64 __reserved_4;
>>> +
>>> + struct perf_counter_attr_ext *ext_attrs;
>>> +};
>>> +
>>> +struct perf_counter_attr_ext {
>>> + struct perf_counter_attr_ext *next;
>>> + __u64 perf_attr_ext_type;
>>> };
>> Let's say I want to extend the attributes by four 64-bit quantities... from the
>> above definition, I'd need four additional records chained together, right? How
>> about something like this instead:
>
> Ah, the idea was to do something like:
>
> struct perf_counter_attr_feature {
> struct perf_counter_attr_ext head;
> ... more stuff ...
> };
>
> and set head.perf_attr_ext_type = PERF_ATTR_EXT_FEATURE


So you'd do something like this then when assigning to the perf_counter_attr struct:

struct perf_counter_attr pca;
...
struct perf_counter_attr_feature feature;

...
pca.ext_attrs = (struct perf_counter_attr_ext *)&feature;
-or-
pca.ext_attrs = &feature.head; /* secretly know that there's data that lies past
the attr struct header */

Am I understanding this correctly?

- Corey

2009-06-08 21:24:12

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH] perf_counter: extensible perf_counter_attr

On Mon, 2009-06-08 at 14:18 -0700, Corey Ashford wrote:
> pca.ext_attrs = &feature.head; /* secretly know that there's data that lies past
> the attr struct header */

Right, except its not so very secret since we have the type field
telling us.

2009-06-08 21:29:56

by Corey Ashford

[permalink] [raw]
Subject: Re: [PATCH] perf_counter: extensible perf_counter_attr



Peter Zijlstra wrote:
> On Mon, 2009-06-08 at 14:18 -0700, Corey Ashford wrote:
>> pca.ext_attrs = &feature.head; /* secretly know that there's data that lies past
>> the attr struct header */
>
> Right, except its not so very secret since we have the type field
> telling us.

I see. Thanks for the explanation. Looks like a good plan to me!

- Corey

2009-06-08 21:50:39

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH] perf_counter: extensible perf_counter_attr


* Corey Ashford <[email protected]> wrote:

> Peter Zijlstra wrote:
>> On Mon, 2009-06-08 at 14:18 -0700, Corey Ashford wrote:
>>> pca.ext_attrs = &feature.head; /* secretly know that there's data
>>> that lies past the attr struct header */
>>
>> Right, except its not so very secret since we have the type field
>> telling us.
>
> I see. Thanks for the explanation. Looks like a good plan to me!

i think we'd have a much simpler implementation by changing
__reserved_1 to attr_size. When the kernel adds new attributes, the
size will increase - old user-space will use the old size which the
kernel detects and adopts to (by zeroing out that attribute space).

That way we'll always have a nice flat attributes structure, with no
quirky chaining that has field-dependent data types ...

Ingo

2009-06-09 00:50:23

by Corey Ashford

[permalink] [raw]
Subject: Re: [PATCH] perf_counter: extensible perf_counter_attr

Ingo Molnar wrote:
> * Corey Ashford <[email protected]> wrote:
>
>> Peter Zijlstra wrote:
>>> On Mon, 2009-06-08 at 14:18 -0700, Corey Ashford wrote:
>>>> pca.ext_attrs = &feature.head; /* secretly know that there's data
>>>> that lies past the attr struct header */
>>> Right, except its not so very secret since we have the type field
>>> telling us.
>> I see. Thanks for the explanation. Looks like a good plan to me!
>
> i think we'd have a much simpler implementation by changing
> __reserved_1 to attr_size. When the kernel adds new attributes, the
> size will increase - old user-space will use the old size which the
> kernel detects and adopts to (by zeroing out that attribute space).
>
> That way we'll always have a nice flat attributes structure, with no
> quirky chaining that has field-dependent data types ...
>
> Ingo

If I understand you correctly, you would simply make perf_counter_attr larger
every time you want to add a new attribute. Users using the new attributes
would call sys_perf_counter_open with a larger attr_size value.

What about arch-dependent attributes? Would you want to place them all in the
perf_counter_attr struct? I suppose this could be done by #include'ing an
arch-specific .h file.

- Corey

2009-06-09 04:34:35

by Paul Mackerras

[permalink] [raw]
Subject: Re: [PATCH] perf_counter: extensible perf_counter_attr

Peter Zijlstra writes:

> Allow extending the perf_counter_attr structure by linking extended
> structures to it.

...

> @@ -175,6 +175,13 @@ struct perf_counter_attr {
> __u32 __reserved_3;
>
> __u64 __reserved_4;
> +
> + struct perf_counter_attr_ext *ext_attrs;
> +};

Since this is a user-visible structure, you've just introduced an ABI
difference between 32-bit and 64-bit processes, which we've managed to
avoid so far. Assuming that is that you intend eventually to use the
value that userspace puts there, which you don't at the moment. Was
that your intention?

Paul.

2009-06-09 06:51:49

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH] perf_counter: extensible perf_counter_attr


* Corey Ashford <[email protected]> wrote:

> If I understand you correctly, you would simply make
> perf_counter_attr larger every time you want to add a new
> attribute. Users using the new attributes would call
> sys_perf_counter_open with a larger attr_size value.

Yes, exactly. Basically ABIs in Linux only get extended (never
shrunk and never changed) so it's not like we ever want to (or can)
shrink the size of the structure or change its semantics.

Each future extension gives the structure a new, unique size - which
also acts as an 'ABI version' identifier, in a pretty robust way. We
check this 'ABI version' (the structure size) in the kernel code so
it's not just a passive 'version field' thing.

Here are the various compatibility variations:

- same-version kernel and user-space: they both use the same
attr_size value and support the full set of features.

- old user-space running on new kernel: works fine, as the kernel
will do a short copy and zero out the remaining attributes.

- new user-space running on old kernel: the kernel returns -ENOTSUP
and user-space has a choice to refuse to run cleanly - or, if an
old ABI version is widespread, might chose to utilize the old,
smaller attribute structure size field (at the cost of not using
new attribute features, obviously).

( Additional detail: in the size mismatch failure case the kernel
should write back the supported size into attr_size, so that
user-space knows which precise ABI variant it deals with on the
kernel side. )

This kind of ABI maintenance method has a number of substantial
advantages:

- It is very compatible (see above)

- It is extensible easily and in an unlimited way - we just extend
the structure size.

- It is very clean on the kernel side and the user side as well,
because we just have a single attribute structure.

- It makes the ABI 'version' field an _active_ component of
functionality - so there is no way for subtle breakages to slip
in.

- New attributes are prime-time members of the attribute structure,
not second-class citizens that first have to be read in via an
elaborate chaining mechanism at extra cost.

[ If only all our syscall ABIs used this technique :-) It would be
so much easier to extend syscalls cleanly - without having to go
through the expensive and time-consuming process to add new
syscalls. ]

> What about arch-dependent attributes? Would you want to place
> them all in the perf_counter_attr struct? I suppose this could be
> done by #include'ing an arch-specific .h file.

What arch-dependent attributes are you thinking about? In the
perfcounters subsystem we want to support PMU and other performance
analysis features in a way that makes it possible for all
architectures to make use of them.

So 'arch dependent attributes' per se are bad and against the
perfcounters design. "Generic perfcounter feature only supported by
a single architecture initially" is better.

Ingo

2009-06-09 06:54:19

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH] perf_counter: extensible perf_counter_attr


* Paul Mackerras <[email protected]> wrote:

> Peter Zijlstra writes:
>
> > Allow extending the perf_counter_attr structure by linking extended
> > structures to it.
>
> ...
>
> > @@ -175,6 +175,13 @@ struct perf_counter_attr {
> > __u32 __reserved_3;
> >
> > __u64 __reserved_4;
> > +
> > + struct perf_counter_attr_ext *ext_attrs;
> > +};
>
> Since this is a user-visible structure, you've just introduced an
> ABI difference between 32-bit and 64-bit processes, which we've
> managed to avoid so far. [...]

This is a good point too - handling pointers in ABIs is possible but
should be avoided as much as possible: it creates the need to
introduce a compat_sys_perf_counter_open() and doubles the syscall
table complexity.

Lets do the s/__reserved_1/attr_size ABI i suggested (i outlined
various properties of it in the previous mail). Agreed?

Ingo

2009-06-09 08:02:45

by Corey Ashford

[permalink] [raw]
Subject: Re: [PATCH] perf_counter: extensible perf_counter_attr

Ingo Molnar wrote:
> * Corey Ashford <[email protected]> wrote:
>
>
>> If I understand you correctly, you would simply make
>> perf_counter_attr larger every time you want to add a new
>> attribute. Users using the new attributes would call
>> sys_perf_counter_open with a larger attr_size value.
>>
>
> Yes, exactly. Basically ABIs in Linux only get extended (never
> shrunk and never changed) so it's not like we ever want to (or can)
> shrink the size of the structure or change its semantics.
>
> Each future extension gives the structure a new, unique size - which
> also acts as an 'ABI version' identifier, in a pretty robust way. We
> check this 'ABI version' (the structure size) in the kernel code so
> it's not just a passive 'version field' thing.
>
> Here are the various compatibility variations:
>
> - same-version kernel and user-space: they both use the same
> attr_size value and support the full set of features.
>
> - old user-space running on new kernel: works fine, as the kernel
> will do a short copy and zero out the remaining attributes.
>
> - new user-space running on old kernel: the kernel returns -ENOTSUP
> and user-space has a choice to refuse to run cleanly - or, if an
> old ABI version is widespread, might chose to utilize the old,
> smaller attribute structure size field (at the cost of not using
> new attribute features, obviously).
>
> ( Additional detail: in the size mismatch failure case the kernel
> should write back the supported size into attr_size, so that
> user-space knows which precise ABI variant it deals with on the
> kernel side. )
>
> This kind of ABI maintenance method has a number of substantial
> advantages:
>
> - It is very compatible (see above)
>
> - It is extensible easily and in an unlimited way - we just extend
> the structure size.
>
> - It is very clean on the kernel side and the user side as well,
> because we just have a single attribute structure.
>
> - It makes the ABI 'version' field an _active_ component of
> functionality - so there is no way for subtle breakages to slip
> in.
>
> - New attributes are prime-time members of the attribute structure,
> not second-class citizens that first have to be read in via an
> elaborate chaining mechanism at extra cost.
>
> [ If only all our syscall ABIs used this technique :-) It would be
> so much easier to extend syscalls cleanly - without having to go
> through the expensive and time-consuming process to add new
> syscalls. ]
>
Thanks for the detailed explanation :)

>> What about arch-dependent attributes? Would you want to place
>> them all in the perf_counter_attr struct? I suppose this could be
>> done by #include'ing an arch-specific .h file.
>>
>
> What arch-dependent attributes are you thinking about? In the
> perfcounters subsystem we want to support PMU and other performance
> analysis features in a way that makes it possible for all
> architectures to make use of them.
>
> So 'arch dependent attributes' per se are bad and against the
> perfcounters design. "Generic perfcounter feature only supported by
> a single architecture initially" is better.
>
Well, I think Intel has PEBS and AMD has some similar mechanism. I
would guess that at some point you would want to provide access to those
PMU features via these attributes. Since these mechanisms are very
chip-specific, I don't think you would want to try to create an
arch-independent interface to them. There may be future mechanisms that
only make sense on one particular chip design, and would therefore not
be a candidate for wider use, but would still make sense to provide some
support for that mechanism via the attributes.

Did you have some different plan for PEBS (etc.) ?

- Corey

2009-06-09 09:58:21

by Paul Mackerras

[permalink] [raw]
Subject: Re: [PATCH] perf_counter: extensible perf_counter_attr

Ingo Molnar writes:

> * Paul Mackerras <[email protected]> wrote:

> > Since this is a user-visible structure, you've just introduced an
> > ABI difference between 32-bit and 64-bit processes, which we've
> > managed to avoid so far. [...]
>
> This is a good point too - handling pointers in ABIs is possible but
> should be avoided as much as possible: it creates the need to
> introduce a compat_sys_perf_counter_open() and doubles the syscall
> table complexity.
>
> Lets do the s/__reserved_1/attr_size ABI i suggested (i outlined
> various properties of it in the previous mail). Agreed?

Yep, sounds good.

We also need a way to allow access to random machine-specific features
that might not be supported in a generic way, such as the instruction
matching CAM on POWER4/PPC970, or (apparently) PEBS. That's what I
had intended exclusive groups + the extra_config_len field to be used
for, but Peter removed extra_config_len...

Paul.

2009-06-09 11:54:18

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH] perf_counter: extensible perf_counter_attr


* Corey Ashford <[email protected]> wrote:

>> So 'arch dependent attributes' per se are bad and against the
>> perfcounters design. "Generic perfcounter feature only supported
>> by a single architecture initially" is better.
>
> Well, I think Intel has PEBS and AMD has some similar mechanism.
> I would guess that at some point you would want to provide access
> to those PMU features via these attributes. Since these
> mechanisms are very chip-specific, I don't think you would want to
> try to create an arch-independent interface to them. There may be
> future mechanisms that only make sense on one particular chip
> design, and would therefore not be a candidate for wider use, but
> would still make sense to provide some support for that mechanism
> via the attributes.
>
> Did you have some different plan for PEBS (etc.) ?

I think PEBS is best supported by a generic abstraction. Something
like this: it's basically a special sampling format, that generates
a record of:

struct pt_regs regs;
__u64 insn_latency; /* optional */
__u64 data_address; /* optional */

this is pretty generic.

The raw CPU records have a CPU specific format, and they have to be
demultiplexed anyway (on Nehalem, which can have up to four separate
PEBS counters - but each output into the same DS area), so the
lowlevel arch code converts the CPU record into the above generic
sample record when it copies it into the mmap pages. It's a quick
copy so no big deal performance-wise.

( Details:

- there might be some additional complications from sampling
32-bit contexts, but that too is a mostly low level detail that
gets hidden.

- we might use a tiny bit more compact registers structure than
struct pt_regs. OTOH it's a well-known structure so it makes
sense to standardize on it, even if the CPU doesnt sample all
registers.
)

Can you see desirable PEBS-alike PMU features that cannot be
expressed via such means?

Ingo

2009-06-09 16:44:18

by Corey Ashford

[permalink] [raw]
Subject: Re: [PATCH] perf_counter: extensible perf_counter_attr

Thanks for your reply, Ingo.

Ingo Molnar wrote:
> I think PEBS is best supported by a generic abstraction. Something
> like this: it's basically a special sampling format, that generates
> a record of:
>
> struct pt_regs regs;
> __u64 insn_latency; /* optional */
> __u64 data_address; /* optional */
>
> this is pretty generic.
>
> The raw CPU records have a CPU specific format, and they have to be
> demultiplexed anyway (on Nehalem, which can have up to four separate
> PEBS counters - but each output into the same DS area), so the
> lowlevel arch code converts the CPU record into the above generic
> sample record when it copies it into the mmap pages. It's a quick
> copy so no big deal performance-wise.
>
> ( Details:
>
> - there might be some additional complications from sampling
> 32-bit contexts, but that too is a mostly low level detail that
> gets hidden.
>
> - we might use a tiny bit more compact registers structure than
> struct pt_regs. OTOH it's a well-known structure so it makes
> sense to standardize on it, even if the CPU doesnt sample all
> registers.
> )
>
>
I see, so that's how you'd return the data. How would a user specify
that they want to use PEBS?

Another observation is that you'd need some sort of bit vector, or at
the least a document, that describes which registers are valid in the
pt_regs struct.

> Can you see desirable PEBS-alike PMU features that cannot be
> expressed via such means?
>
> Ingo
Power PMU's provide some fairly complex features, such as a thresholding
mechanism which is used for marking instructions, and also there's an
Instruction Matching CAM which can be used to mark only on certain
instruction types. Since these features are present only on Power, I'm
not sure it makes sense to go to the trouble of abstracting them for use
on other arch/chip designs.

- Corey

2009-06-09 22:00:50

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH] perf_counter: extensible perf_counter_attr


* Corey Ashford <[email protected]> wrote:

> Thanks for your reply, Ingo.
>
> Ingo Molnar wrote:
>> I think PEBS is best supported by a generic abstraction. Something
>> like this: it's basically a special sampling format, that generates a
>> record of:
>>
>> struct pt_regs regs;
>> __u64 insn_latency; /* optional */
>> __u64 data_address; /* optional */
>>
>> this is pretty generic.
>>
>> The raw CPU records have a CPU specific format, and they have to be
>> demultiplexed anyway (on Nehalem, which can have up to four separate
>> PEBS counters - but each output into the same DS area), so the
>> lowlevel arch code converts the CPU record into the above generic
>> sample record when it copies it into the mmap pages. It's a quick copy
>> so no big deal performance-wise.
>>
>> ( Details:
>>
>> - there might be some additional complications from sampling
>> 32-bit contexts, but that too is a mostly low level detail that
>> gets hidden.
>>
>> - we might use a tiny bit more compact registers structure than
>> struct pt_regs. OTOH it's a well-known structure so it makes
>> sense to standardize on it, even if the CPU doesnt sample all
>> registers.
>> )
>>
>>
> I see, so that's how you'd return the data. How would a user
> specify that they want to use PEBS?

They wouldnt need to. PEBS would result in more precise samples when
certain counters are used - and transparently so.

The non-PEBS NMI samples are pretty accurate to begin with, so it's
not a _major_ difference in quality.

PEBS would also auto-activate when a user wants to sample say the
instruction latency as well - and on non-PEBS hardware the platform
code would refuse to open the counter.

PEBS could also be used to reduce the number of NMIs needed - so
it's a transparent speedup as well.

> Another observation is that you'd need some sort of bit vector, or
> at the least a document, that describes which registers are valid
> in the pt_regs struct.

Initially i think we should use PEBS to transparently enhance
regular IP samples.

What would we use the other registers for? Many registers will have
stack context dependencies so the PEBS data cannot really be
analyzed in any meaningful way after the fact.

(except perhaps for the narrow purpose of debugging)

>> Can you see desirable PEBS-alike PMU features that cannot be expressed
>> via such means?
>
> Power PMU's provide some fairly complex features, such as a
> thresholding mechanism which is used for marking instructions, and
> also there's an Instruction Matching CAM which can be used to mark
> only on certain instruction types. Since these features are
> present only on Power, I'm not sure it makes sense to go to the
> trouble of abstracting them for use on other arch/chip designs.

Could you please describe the low level semantics more accurately -
at least of the ones you find to be the most useful in practice?

Ingo

2009-06-09 23:16:56

by Corey Ashford

[permalink] [raw]
Subject: Re: [PATCH] perf_counter: extensible perf_counter_attr



Ingo Molnar wrote:
> * Corey Ashford <[email protected]> wrote:
>
>> Thanks for your reply, Ingo.
>>
>> Ingo Molnar wrote:
>>> I think PEBS is best supported by a generic abstraction. Something
>>> like this: it's basically a special sampling format, that generates a
>>> record of:
>>>
>>> struct pt_regs regs;
>>> __u64 insn_latency; /* optional */
>>> __u64 data_address; /* optional */
>>>
>>> this is pretty generic.
>>>
>>> The raw CPU records have a CPU specific format, and they have to be
>>> demultiplexed anyway (on Nehalem, which can have up to four separate
>>> PEBS counters - but each output into the same DS area), so the
>>> lowlevel arch code converts the CPU record into the above generic
>>> sample record when it copies it into the mmap pages. It's a quick copy
>>> so no big deal performance-wise.
>>>
>>> ( Details:
>>>
>>> - there might be some additional complications from sampling
>>> 32-bit contexts, but that too is a mostly low level detail that
>>> gets hidden.
>>>
>>> - we might use a tiny bit more compact registers structure than
>>> struct pt_regs. OTOH it's a well-known structure so it makes
>>> sense to standardize on it, even if the CPU doesnt sample all
>>> registers.
>>> )
>>>
>>>
>> I see, so that's how you'd return the data. How would a user
>> specify that they want to use PEBS?
>
> They wouldnt need to. PEBS would result in more precise samples when
> certain counters are used - and transparently so.
>
> The non-PEBS NMI samples are pretty accurate to begin with, so it's
> not a _major_ difference in quality.
>
> PEBS would also auto-activate when a user wants to sample say the
> instruction latency as well - and on non-PEBS hardware the platform
> code would refuse to open the counter.
>
> PEBS could also be used to reduce the number of NMIs needed - so
> it's a transparent speedup as well.
>
>> Another observation is that you'd need some sort of bit vector, or
>> at the least a document, that describes which registers are valid
>> in the pt_regs struct.
>
> Initially i think we should use PEBS to transparently enhance
> regular IP samples.
>
> What would we use the other registers for? Many registers will have
> stack context dependencies so the PEBS data cannot really be
> analyzed in any meaningful way after the fact.

Well, you were the one that offered up the idea of using pt_regs as an
abstraction. For the case of PEBS, I think 98% of it would be emptry since the
sampling mechanism can get a PC and maybe a data address (?).

>
> (except perhaps for the narrow purpose of debugging)
>
>>> Can you see desirable PEBS-alike PMU features that cannot be expressed
>>> via such means?
>> Power PMU's provide some fairly complex features, such as a
>> thresholding mechanism which is used for marking instructions, and
>> also there's an Instruction Matching CAM which can be used to mark
>> only on certain instruction types. Since these features are
>> present only on Power, I'm not sure it makes sense to go to the
>> trouble of abstracting them for use on other arch/chip designs.
>
> Could you please describe the low level semantics more accurately -
> at least of the ones you find to be the most useful in practice?

Ok, some disclosure here: we have not yet supported either of these features in
the Power PMU's in any open source code (and perhaps the proprietary code too,
but I don't know about that). Since these features are described in an IBM
proprietary document, I can't describe how they work here, except to say that
they are present in the chip.

I realize this is not very useful to come here and complain there's no support
for something I can't describe! (maybe later or maybe never... not sure)

But in general I do think there should be some sort of provision in the
interface for some chip-specific features for which fully abstracting their
function doesn't make a lot of sense. At some point, if we find that there's a
good reason to support these features in open source code (and thereby divulging
how they work), it would be nice if there was an avenue for adding such support
to PCL.

- Corey

Corey Ashford
Software Engineer
IBM Linux Technology Center, Linux Toolchain
[email protected]

2009-06-10 00:14:39

by Paul Mackerras

[permalink] [raw]
Subject: Re: [PATCH] perf_counter: extensible perf_counter_attr

Corey Ashford writes:

> Ok, some disclosure here: we have not yet supported either of these features in
> the Power PMU's in any open source code (and perhaps the proprietary code too,
> but I don't know about that). Since these features are described in an IBM
> proprietary document, I can't describe how they work here, except to say that
> they are present in the chip.

Actually, the public PPC970FX user manual has a chapter on the
performance monitor unit which describes both thresholding and the
instruction matching CAM, among other things.

I think we can support thresholding using higher-order bits of the
event code when the low bits == PM_THRESH_TIMEO. We can only have one
PM_THRESH_TIMEO event on the PMU at any given time, so there can't be
any conflict over the thresholder settings.

The IMC is more problematic, but we can't do much with it on POWER5
and later processors anyway, due to various things being accessible
only in hypervisor mode, so I have been ignoring it. :)

Paul.

2009-06-10 22:07:15

by Corey Ashford

[permalink] [raw]
Subject: Re: [PATCH] perf_counter: extensible perf_counter_attr

Paul Mackerras wrote:
> Corey Ashford writes:
>
>> Ok, some disclosure here: we have not yet supported either of these features in
>> the Power PMU's in any open source code (and perhaps the proprietary code too,
>> but I don't know about that). Since these features are described in an IBM
>> proprietary document, I can't describe how they work here, except to say that
>> they are present in the chip.
>
> Actually, the public PPC970FX user manual has a chapter on the
> performance monitor unit which describes both thresholding and the
> instruction matching CAM, among other things.

Ah, good point!

Anyone interested can find the 970FX (G5) manual here: http://www.cs.lth.se/EDA116/G5.pdf
It's supposed to be on the IBM web site too, but for some reason, the link is
broken (reported!).

>
> I think we can support thresholding using higher-order bits of the
> event code when the low bits == PM_THRESH_TIMEO. We can only have one
> PM_THRESH_TIMEO event on the PMU at any given time, so there can't be
> any conflict over the thresholder settings.
>
> The IMC is more problematic, but we can't do much with it on POWER5
> and later processors anyway, due to various things being accessible
> only in hypervisor mode, so I have been ignoring it. :)
>
> Paul.

Ug. Well, there's always bare-metal Linux :)

- Corey