LinuxLists.cc - [patch] Performance Counters for Linux, v3

[permalink] [raw]

Subject: Re: [patch] Performance Counters for Linux, v3

* Tony Luck <[email protected]> wrote:

> > /*
> > * Special "software" counters provided by the kernel, even if
> > * the hardware does not support performance counters. These
> > * counters measure various physical and sw events of the
> > * kernel (and allow the profiling of them as well):
> > */
> > PERF_COUNT_CPU_CLOCK = -1,
> > PERF_COUNT_TASK_CLOCK = -2,
> > /*
> > * Future software events:
> > */
> > /* PERF_COUNT_PAGE_FAULTS = -3,
> > PERF_COUNT_CONTEXT_SWITCHES = -4, */
>
> ...
> > +[ Note: more hw_event_types are supported as well, but they are CPU
> > + specific and are enumerated via /sys on a per CPU basis. Raw hw event
> > + types can be passed in as negative numbers. For example, to count
> > + "External bus cycles while bus lock signal asserted" events on Intel
> > + Core CPUs, pass in a -0x4064 event type value. ]
>
> It looks like you have an overlap here. You are using some negative
> numbers to denote your special software events, but also as "raw"
> hardware events. What if these conflict?

that's an old comment, not a bug in the code - thx for pointing it out, i
just fixed the comments - see the commit below.

Raw events are now done without using up negative numbers, they are done
via:

struct perf_counter_hw_event {
s64 type;

u64 irq_period;
u32 record_type;

u32 disabled : 1, /* off by default */
nmi : 1, /* NMI sampling */
raw : 1, /* raw event type */
__reserved_1 : 29;

u64 __reserved_2;
};

if the hw_event.raw bit is set to 1, then the hw_event.type is fully
'raw'. The default is for raw to be 0. So negative numbers can be used
for sw events, positive numbers for hw events. Both can be extended
gradually, without arbitrarily limits introduced.

Ingo

------------------------->
>From 447557ac7ce120306b4a31d6003faef39cb1bf14 Mon Sep 17 00:00:00 2001
From: Ingo Molnar <[email protected]>
Date: Thu, 11 Dec 2008 20:40:18 +0100
Subject: [PATCH] perf counters: update docs

Impact: update docs

Signed-off-by: Ingo Molnar <[email protected]>
---
Documentation/perf-counters.txt | 107 +++++++++++++++++++++++++++------------
1 files changed, 75 insertions(+), 32 deletions(-)

diff --git a/Documentation/perf-counters.txt b/Documentation/perf-counters.txt
index 19033a0..fddd321 100644
--- a/Documentation/perf-counters.txt
+++ b/Documentation/perf-counters.txt
@@ -10,8 +10,8 @@ trigger interrupts when a threshold number of events have passed - and can
thus be used to profile the code that runs on that CPU.

The Linux Performance Counter subsystem provides an abstraction of these
-hardware capabilities. It provides per task and per CPU counters, and
-it provides event capabilities on top of those.
+hardware capabilities. It provides per task and per CPU counters, counter
+groups, and it provides event capabilities on top of those.

Performance counters are accessed via special file descriptors.
There's one file descriptor per virtual counter used.
@@ -19,12 +19,8 @@ There's one file descriptor per virtual counter used.
The special file descriptor is opened via the perf_counter_open()
system call:

- int
- perf_counter_open(u32 hw_event_type,
- u32 hw_event_period,
- u32 record_type,
- pid_t pid,
- int cpu);
+ int sys_perf_counter_open(struct perf_counter_hw_event *hw_event_uptr,
+ pid_t pid, int cpu, int group_fd);

The syscall returns the new fd. The fd can be used via the normal
VFS system calls: read() can be used to read the counter, fcntl()
@@ -33,39 +29,78 @@ can be used to set the blocking mode, etc.
Multiple counters can be kept open at a time, and the counters
can be poll()ed.

-When creating a new counter fd, 'hw_event_type' is one of:
-
- enum hw_event_types {
- PERF_COUNT_CYCLES,
- PERF_COUNT_INSTRUCTIONS,
- PERF_COUNT_CACHE_REFERENCES,
- PERF_COUNT_CACHE_MISSES,
- PERF_COUNT_BRANCH_INSTRUCTIONS,
- PERF_COUNT_BRANCH_MISSES,
- };
+When creating a new counter fd, 'perf_counter_hw_event' is:
+
+/*
+ * Hardware event to monitor via a performance monitoring counter:
+ */
+struct perf_counter_hw_event {
+ s64 type;
+
+ u64 irq_period;
+ u32 record_type;
+
+ u32 disabled : 1, /* off by default */
+ nmi : 1, /* NMI sampling */
+ raw : 1, /* raw event type */
+ __reserved_1 : 29;
+
+ u64 __reserved_2;
+};
+
+/*
+ * Generalized performance counter event types, used by the hw_event.type
+ * parameter of the sys_perf_counter_open() syscall:
+ */
+enum hw_event_types {
+ /*
+ * Common hardware events, generalized by the kernel:
+ */
+ PERF_COUNT_CYCLES = 0,
+ PERF_COUNT_INSTRUCTIONS = 1,
+ PERF_COUNT_CACHE_REFERENCES = 2,
+ PERF_COUNT_CACHE_MISSES = 3,
+ PERF_COUNT_BRANCH_INSTRUCTIONS = 4,
+ PERF_COUNT_BRANCH_MISSES = 5,
+
+ /*
+ * Special "software" counters provided by the kernel, even if
+ * the hardware does not support performance counters. These
+ * counters measure various physical and sw events of the
+ * kernel (and allow the profiling of them as well):
+ */
+ PERF_COUNT_CPU_CLOCK = -1,
+ PERF_COUNT_TASK_CLOCK = -2,
+ /*
+ * Future software events:
+ */
+ /* PERF_COUNT_PAGE_FAULTS = -3,
+ PERF_COUNT_CONTEXT_SWITCHES = -4, */
+};

These are standardized types of events that work uniformly on all CPUs
that implements Performance Counters support under Linux. If a CPU is
not able to count branch-misses, then the system call will return
-EINVAL.

-[ Note: more hw_event_types are supported as well, but they are CPU
- specific and are enumerated via /sys on a per CPU basis. Raw hw event
- types can be passed in as negative numbers. For example, to count
- "External bus cycles while bus lock signal asserted" events on Intel
- Core CPUs, pass in a -0x4064 event type value. ]
-
-The parameter 'hw_event_period' is the number of events before waking up
-a read() that is blocked on a counter fd. Zero value means a non-blocking
-counter.
+More hw_event_types are supported as well, but they are CPU
+specific and are enumerated via /sys on a per CPU basis. Raw hw event
+types can be passed in under hw_event.type if hw_event.raw is 1.
+For example, to count "External bus cycles while bus lock signal asserted"
+events on Intel Core CPUs, pass in a 0x4064 event type value and set
+hw_event.raw to 1.

'record_type' is the type of data that a read() will provide for the
counter, and it can be one of:

- enum perf_record_type {
- PERF_RECORD_SIMPLE,
- PERF_RECORD_IRQ,
- };
+/*
+ * IRQ-notification data record type:
+ */
+enum perf_counter_record_type {
+ PERF_RECORD_SIMPLE = 0,
+ PERF_RECORD_IRQ = 1,
+ PERF_RECORD_GROUP = 2,
+};

a "simple" counter is one that counts hardware events and allows
them to be read out into a u64 count value. (read() returns 8 on
@@ -76,6 +111,10 @@ the IP of the interrupted context. In this case read() will return
the 8-byte counter value, plus the Instruction Pointer address of the
interrupted context.

+The parameter 'hw_event_period' is the number of events before waking up
+a read() that is blocked on a counter fd. Zero value means a non-blocking
+counter.
+
The 'pid' parameter allows the counter to be specific to a task:

pid == 0: if the pid parameter is zero, the counter is attached to the
@@ -92,7 +131,7 @@ CPU:
cpu >= 0: the counter is restricted to a specific CPU
cpu == -1: the counter counts on all CPUs

-Note: the combination of 'pid == -1' and 'cpu == -1' is not valid.
+(Note: the combination of 'pid == -1' and 'cpu == -1' is not valid.)

A 'pid > 0' and 'cpu == -1' counter is a per task counter that counts
events of that task and 'follows' that task to whatever CPU the task
@@ -102,3 +141,7 @@ their own tasks.
A 'pid == -1' and 'cpu == x' counter is a per CPU counter that counts
all events on CPU-x. Per CPU counters need CAP_SYS_ADMIN privilege.

+Group counters are created by passing in a group_fd of another counter.
+Groups are scheduled at once and can be used with PERF_RECORD_GROUP
+to record multi-dimensional timestamps.
+

2008-12-11 22:05:22

by William Cohen

[permalink] [raw]

Subject: Re: [patch] Performance Counters for Linux, v3

I was taking a look at the proposed performance monitoring and kerneltop.c. I
noticed that http://redhat.com/~mingo/perfcounters/kerneltop.c doesn't work
with the v3 version. I didn't see a more recent version available, so I made
some modifications to make allow it to work with the v3 kernel (with the
attached). However, I assume some where there is an updated version of kerneltop.c

The Documentation/perf-counters.txt doesn't describe how the group_fd is used.
Found that -1 used to indicate not connected to any other fd.

-Will

Attachments:

v3.diff (2.36 kB)

2008-12-12 06:22:39

[permalink] [raw]

Subject: Re: [patch] Performance Counters for Linux, v3

* Andrew Morton <[email protected]> wrote:

> On Thu, 11 Dec 2008 16:52:30 +0100
> Ingo Molnar <[email protected]> wrote:
>
> > To: [email protected]
> > Cc: Thomas Gleixner <[email protected]>, Andrew Morton <[email protected]>, Stephane Eranian <[email protected]>, Eric Dumazet <[email protected]>, Robert Richter <[email protected]>, Arjan van de Veen <[email protected]>, Peter Anvin <[email protected]>, Peter Zijlstra <[email protected]>, Paul Mackerras <[email protected]>, "David S. Miller" <[email protected]>
>
> Please copy [email protected] on all this. That is
> where the real-world people who use these facilities on a regular basis
> hang out.

Sure, we'll do that for v4.

The reason we kept posting this to lkml initially was because there is a
visible detachment of this community from kernel developers. And that is
at least in part because this stuff has never been made interesting
enough to kernel developers. I dont remember a _single_ perfmon-generated
profile (be that user-space or kernel-space) in my mailbox before - and
optimizing the kernel is supposed to be one of the most important aspects
of performance tuning.

That's why we concentrate on making this useful and interesting to kernel
developers too via KernelTop, that's why we made the BTS/[PEBS] hardware
tracer available via an ftrace plugin, etc.

Furthermore, kernel developers tend to be quite good at co-designing,
influencing [and flaming ;-) ] such APIs at the early prototype stages,
so the main early technical feedback we were looking for on the kernel
side structure was lkml. But the wider community is not ignored either,
of course - with v4 it might be useful already for wider circulation.

Ingo

2008-12-12 08:26:10

[permalink] [raw]

Subject: Re: [patch] Performance Counters for Linux, v3

On Thu, 2008-12-11 at 13:02 -0500, Vince Weaver wrote:

> I have at least 60 machines that I do regular performance counter work on.
> They involve Pentium Pro, Pentium II, 32-bit Athlon, 64-bit Athlon,
> Pentium 4, Pentium D, Core, Core2, Atom, MIPS R12k, Niagara T1,
> and PPC/Playstation 3.

Good.

> Perfmon3 works for all of those 60 machines. This new proposal works on a
> 2 out of the 60.

s/works/is implemented/

> Who is going to add support for all of those machines? I've spent a lot
> of developer time getting prefmon going for all of those configurations.
> But why should I help out with this new inferior proposal? It could all
> be another waste of time.

So much for constructive critisism.. have you tried taking the design to
its limits, if so, where do you see problems?

I read the above as: I invested a lot of time in something of dubious
statue (out of tree patch), and now expect it to be merged because I
have invested in it.

> Also, my primary method of using counters is total aggregate count for a
> single user-space process.

Process, as in single thread, or multi-threaded? I'll assume
single-thread.

> Can this new infrastructure to this?

Yes, afaict it can.

You can group counters in v3, a read out of such a group will be an
atomic read out and provide vectored output that contains all the data
in one stream.

> I find the documentation/tools support to be very incomplete.

Gosh, what does one expect from something that is hardly a week old..

> One comment on the patch.
>
> > + /*
> > + * Common hardware events, generalized by the kernel:
> > + */
> > + PERF_COUNT_CYCLES = 0,
> > + PERF_COUNT_INSTRUCTIONS = 1,
> > + PERF_COUNT_CACHE_REFERENCES = 2,
> > + PERF_COUNT_CACHE_MISSES = 3,
> > + PERF_COUNT_BRANCH_INSTRUCTIONS = 4,
> > + PERF_COUNT_BRANCH_MISSES = 5,
>
> Many machines do not support these counts. For example, Niagara T1 does
> not have a CYCLES count. And good luck if you think you can easily come
> up with something meaningful for the various kind of CACHE_MISSES on the
> Pentium 4. Also, the Pentium D has various flavors of retired instruction
> count with slightly different semantics. This kind of abstraction should
> be done in userspace.

I'll argue to disagree, sure such events might not be supported by any
particular hardware implementation - but the fact that PAPI gives a list
of 'common' events means that they are, well, common. So unifying them
between those archs that do implement them seems like a sane choice, no?

For those archs that do not support it, it will just fail to open. No
harm done.

The proposal allows for you to specify raw hardware events, so you can
just totally ignore this part of the abstraction.

2008-12-12 08:29:57

[permalink] [raw]

Subject: Re: [patch] Performance Counters for Linux, v3

On Thu, 2008-12-11 at 20:34 +0100, Ingo Molnar wrote:

> struct perf_counter_hw_event {
> s64 type;
>
> u64 irq_period;
> u32 record_type;
>
> u32 disabled : 1, /* off by default */
> nmi : 1, /* NMI sampling */
> raw : 1, /* raw event type */
> __reserved_1 : 29;
>
> u64 __reserved_2;
> };
>
> if the hw_event.raw bit is set to 1, then the hw_event.type is fully
> 'raw'. The default is for raw to be 0. So negative numbers can be used
> for sw events, positive numbers for hw events. Both can be extended
> gradually, without arbitrarily limits introduced.

On that, I still don't think its a good idea to use bitfields in an ABI.
The C std is just not strict enough on them, and I guess that is the
reason this would be the first such usage.

2008-12-12 08:35:55

[permalink] [raw]

Subject: Re: [patch] Performance Counters for Linux, v3

Peter,

On Fri, Dec 12, 2008 at 9:25 AM, Peter Zijlstra <[email protected]> wrote:
>> > + /*
>> > + * Common hardware events, generalized by the kernel:
>> > + */
>> > + PERF_COUNT_CYCLES = 0,
>> > + PERF_COUNT_INSTRUCTIONS = 1,
>> > + PERF_COUNT_CACHE_REFERENCES = 2,
>> > + PERF_COUNT_CACHE_MISSES = 3,
>> > + PERF_COUNT_BRANCH_INSTRUCTIONS = 4,
>> > + PERF_COUNT_BRANCH_MISSES = 5,
>>
>> Many machines do not support these counts. For example, Niagara T1 does
>> not have a CYCLES count. And good luck if you think you can easily come
>> up with something meaningful for the various kind of CACHE_MISSES on the
>> Pentium 4. Also, the Pentium D has various flavors of retired instruction
>> count with slightly different semantics. This kind of abstraction should
>> be done in userspace.
>
> I'll argue to disagree, sure such events might not be supported by any
> particular hardware implementation - but the fact that PAPI gives a list
> of 'common' events means that they are, well, common. So unifying them
> between those archs that do implement them seems like a sane choice, no?
>
> For those archs that do not support it, it will just fail to open. No
> harm done.
>
> The proposal allows for you to specify raw hardware events, so you can
> just totally ignore this part of the abstraction.
>
I believe the cache related events do not belong in here. There is no definition
for them. You don't know what cache miss level, what kind of access. You cannot
do this even on Intel Core processors.

2008-12-12 08:51:41

[permalink] [raw]

Subject: Re: [patch] Performance Counters for Linux, v3

On Fri, 2008-12-12 at 09:35 +0100, stephane eranian wrote:
> Peter,
>
> On Fri, Dec 12, 2008 at 9:25 AM, Peter Zijlstra <[email protected]> wrote:
> >> > + /*
> >> > + * Common hardware events, generalized by the kernel:
> >> > + */
> >> > + PERF_COUNT_CYCLES = 0,
> >> > + PERF_COUNT_INSTRUCTIONS = 1,
> >> > + PERF_COUNT_CACHE_REFERENCES = 2,
> >> > + PERF_COUNT_CACHE_MISSES = 3,
> >> > + PERF_COUNT_BRANCH_INSTRUCTIONS = 4,
> >> > + PERF_COUNT_BRANCH_MISSES = 5,
> >>
> >> Many machines do not support these counts. For example, Niagara T1 does
> >> not have a CYCLES count. And good luck if you think you can easily come
> >> up with something meaningful for the various kind of CACHE_MISSES on the
> >> Pentium 4. Also, the Pentium D has various flavors of retired instruction
> >> count with slightly different semantics. This kind of abstraction should
> >> be done in userspace.
> >
> > I'll argue to disagree, sure such events might not be supported by any
> > particular hardware implementation - but the fact that PAPI gives a list
> > of 'common' events means that they are, well, common. So unifying them
> > between those archs that do implement them seems like a sane choice, no?
> >
> > For those archs that do not support it, it will just fail to open. No
> > harm done.
> >
> > The proposal allows for you to specify raw hardware events, so you can
> > just totally ignore this part of the abstraction.
> >
> I believe the cache related events do not belong in here. There is no definition
> for them. You don't know what cache miss level, what kind of access. You cannot
> do this even on Intel Core processors.

I might agree with that, perhaps we should model this to the common list
PAPI specifies?

2008-12-12 08:55:33

[permalink] [raw]

Subject: Re: [patch] Performance Counters for Linux, v3

* Peter Zijlstra <[email protected]> wrote:

> On Thu, 2008-12-11 at 20:34 +0100, Ingo Molnar wrote:
>
> > struct perf_counter_hw_event {
> > s64 type;
> >
> > u64 irq_period;
> > u32 record_type;
> >
> > u32 disabled : 1, /* off by default */
> > nmi : 1, /* NMI sampling */
> > raw : 1, /* raw event type */
> > __reserved_1 : 29;
> >
> > u64 __reserved_2;
> > };
> >
> > if the hw_event.raw bit is set to 1, then the hw_event.type is fully
> > 'raw'. The default is for raw to be 0. So negative numbers can be used
> > for sw events, positive numbers for hw events. Both can be extended
> > gradually, without arbitrarily limits introduced.
>
> On that, I still don't think its a good idea to use bitfields in an
> ABI. The C std is just not strict enough on them, and I guess that is
> the reason this would be the first such usage.

I dont feel strongly about this, we could certainly change it.

But these are system calls which have per platform bit order anyway - is
it really an issue? I'd agree that it would be bad for any sort of
persistent or otherwise cross-platform data such as filesystems, network
protocol bits, etc.

We use bitfields in a couple of system calls ABIs already, for example in
PPP:

if_ppp.h-/* For PPPIOCGL2TPSTATS */
if_ppp.h-struct pppol2tp_ioc_stats {
if_ppp.h- __u16 tunnel_id; /* redundant */
if_ppp.h- __u16 session_id; /* if zero, get tunnel stats */
if_ppp.h: __u32 using_ipsec:1; /* valid only for session_id ==

Ingo

2008-12-12 09:00:13

[permalink] [raw]

Subject: Re: [patch] Performance Counters for Linux, v3

Peter,

On Fri, Dec 12, 2008 at 9:25 AM, Peter Zijlstra <[email protected]> wrote:
> On Thu, 2008-12-11 at 13:02 -0500, Vince Weaver wrote:
>

>> Perfmon3 works for all of those 60 machines. This new proposal works on a
>> 2 out of the 60.
>
> s/works/is implemented/
>
>> Who is going to add support for all of those machines? I've spent a lot
>> of developer time getting prefmon going for all of those configurations.
>> But why should I help out with this new inferior proposal? It could all
>> be another waste of time.
>
> So much for constructive critisism.. have you tried taking the design to
> its limits, if so, where do you see problems?
>
People have pointed out problems, but you keep forgetting to answer them.

For instance, people have pointed out that your design necessarily implies
pulling into the kernel the event table for all PMU models out there. This
is not just data, this is also complex algorithms to assign events to counters.
The constraints between events can be very tricky to solve. If you get this
wrong, this leads to silent errors, and that is really bad.

Looking at Intel Core, Nehalem, or AMD64 does not reflect the reality of
the complexity of this. Paul pointed out earlier the complexity on Power.
I can relate to the complexity on Itanium (I implemented all the code in
the user level libpfm for them). Read the Itanium PMU description and I
hope you'll understand.

Events constraints are not going away anytime soon, quite the contrary.

Furthermore, event tables are not always correct. In fact, they are
always bogus.
Event semantics varies between steppings. New events shows up, others
get removed.
Constraints are discovered later on.

If you have all of that in the kernel, it means you'll have to
generate a kernel patch each
time. Even if that can be encapsulated into a kernel module, you will
still have problems.

Furthermore, Linux commercial distribution release cycles do not
align well with new processor
releases. I can boot my RHEL5 kernel on a Nehalem system and it would
be nice not to have to
wait for a new kernel update to get the full Nehalem PMU event table,
so I can program more than
the basic 6 architected events of Intel X86.

I know the argument about the fact that you'll have a patch with 24h
on kernel.org. The problem
is that no end-user runs a kernel.org kernel, nobody. Changing the
kernel is not an option for
many end-users, it may even require re-certifications for many customers.

I believe many people would like to see how you plan on addressing those issues.

2008-12-12 09:01:20

[permalink] [raw]

Subject: Re: [patch] Performance Counters for Linux, v3

On Fri, 2008-12-12 at 09:51 +0100, Peter Zijlstra wrote:
> On Fri, 2008-12-12 at 09:35 +0100, stephane eranian wrote:
> > Peter,
> >
> > On Fri, Dec 12, 2008 at 9:25 AM, Peter Zijlstra <[email protected]> wrote:
> > >> > + /*
> > >> > + * Common hardware events, generalized by the kernel:
> > >> > + */
> > >> > + PERF_COUNT_CYCLES = 0,
> > >> > + PERF_COUNT_INSTRUCTIONS = 1,
> > >> > + PERF_COUNT_CACHE_REFERENCES = 2,
> > >> > + PERF_COUNT_CACHE_MISSES = 3,
> > >> > + PERF_COUNT_BRANCH_INSTRUCTIONS = 4,
> > >> > + PERF_COUNT_BRANCH_MISSES = 5,
> > >>
> > >> Many machines do not support these counts. For example, Niagara T1 does
> > >> not have a CYCLES count. And good luck if you think you can easily come
> > >> up with something meaningful for the various kind of CACHE_MISSES on the
> > >> Pentium 4. Also, the Pentium D has various flavors of retired instruction
> > >> count with slightly different semantics. This kind of abstraction should
> > >> be done in userspace.
> > >
> > > I'll argue to disagree, sure such events might not be supported by any
> > > particular hardware implementation - but the fact that PAPI gives a list
> > > of 'common' events means that they are, well, common. So unifying them
> > > between those archs that do implement them seems like a sane choice, no?
> > >
> > > For those archs that do not support it, it will just fail to open. No
> > > harm done.
> > >
> > > The proposal allows for you to specify raw hardware events, so you can
> > > just totally ignore this part of the abstraction.
> > >
> > I believe the cache related events do not belong in here. There is no definition
> > for them. You don't know what cache miss level, what kind of access. You cannot
> > do this even on Intel Core processors.
>
> I might agree with that, perhaps we should model this to the common list
> PAPI specifies?

http://icl.cs.utk.edu/projects/papi/files/html_man3/papi_presets.html

Has a lot of cache events.

And I can see the use of a set without the L[123] in there, which would
signify either all or the lack of more specific knowledge. Like with
PAPI its perfectly fine to not support these common events on a
particular hardware platform.

2008-12-12 09:08:24

[permalink] [raw]

Subject: Re: [patch] Performance Counters for Linux, v3

* Peter Zijlstra <[email protected]> wrote:

> On Fri, 2008-12-12 at 09:51 +0100, Peter Zijlstra wrote:
> > On Fri, 2008-12-12 at 09:35 +0100, stephane eranian wrote:
> > > Peter,
> > >
> > > On Fri, Dec 12, 2008 at 9:25 AM, Peter Zijlstra <[email protected]> wrote:
> > > >> > + /*
> > > >> > + * Common hardware events, generalized by the kernel:
> > > >> > + */
> > > >> > + PERF_COUNT_CYCLES = 0,
> > > >> > + PERF_COUNT_INSTRUCTIONS = 1,
> > > >> > + PERF_COUNT_CACHE_REFERENCES = 2,
> > > >> > + PERF_COUNT_CACHE_MISSES = 3,
> > > >> > + PERF_COUNT_BRANCH_INSTRUCTIONS = 4,
> > > >> > + PERF_COUNT_BRANCH_MISSES = 5,
> > > >>
> > > >> Many machines do not support these counts. For example, Niagara T1 does
> > > >> not have a CYCLES count. And good luck if you think you can easily come
> > > >> up with something meaningful for the various kind of CACHE_MISSES on the
> > > >> Pentium 4. Also, the Pentium D has various flavors of retired instruction
> > > >> count with slightly different semantics. This kind of abstraction should
> > > >> be done in userspace.
> > > >
> > > > I'll argue to disagree, sure such events might not be supported by any
> > > > particular hardware implementation - but the fact that PAPI gives a list
> > > > of 'common' events means that they are, well, common. So unifying them
> > > > between those archs that do implement them seems like a sane choice, no?
> > > >
> > > > For those archs that do not support it, it will just fail to open. No
> > > > harm done.
> > > >
> > > > The proposal allows for you to specify raw hardware events, so you can
> > > > just totally ignore this part of the abstraction.
> > > >
> > > I believe the cache related events do not belong in here. There is no definition
> > > for them. You don't know what cache miss level, what kind of access. You cannot
> > > do this even on Intel Core processors.
> >
> > I might agree with that, perhaps we should model this to the common list
> > PAPI specifies?
>
> http://icl.cs.utk.edu/projects/papi/files/html_man3/papi_presets.html
>
> Has a lot of cache events.
>
> And I can see the use of a set without the L[123] in there, which would
> signify either all or the lack of more specific knowledge. Like with
> PAPI its perfectly fine to not support these common events on a
> particular hardware platform.

yes, exactly.

A PAPI wrapper on top of this code might even opt to never use any of the
generic types, because it can be well aware of all the CPU types and
their exact event mappings to raw types, and can use those directly.

Different apps like KernelTop might opt to utilize the generic types.

A kernel is all about providing intelligent, generalized access to hw
resources.

Ingo

2008-12-12 09:25:04

[permalink] [raw]

Subject: Re: [patch] Performance Counters for Linux, v3

On Fri, 2008-12-12 at 09:59 +0100, stephane eranian wrote:
> Peter,
>
> On Fri, Dec 12, 2008 at 9:25 AM, Peter Zijlstra <[email protected]> wrote:
> > On Thu, 2008-12-11 at 13:02 -0500, Vince Weaver wrote:
> >
>
> >> Perfmon3 works for all of those 60 machines. This new proposal works on a
> >> 2 out of the 60.
> >
> > s/works/is implemented/
> >
> >> Who is going to add support for all of those machines? I've spent a lot
> >> of developer time getting prefmon going for all of those configurations.
> >> But why should I help out with this new inferior proposal? It could all
> >> be another waste of time.
> >
> > So much for constructive critisism.. have you tried taking the design to
> > its limits, if so, where do you see problems?
> >
> People have pointed out problems, but you keep forgetting to answer them.

I thought some of that (and surely more to follow) has been
incorporated.

> For instance, people have pointed out that your design necessarily implies
> pulling into the kernel the event table for all PMU models out there. This
> is not just data, this is also complex algorithms to assign events to counters.
> The constraints between events can be very tricky to solve. If you get this
> wrong, this leads to silent errors, and that is really bad.

(well, its not my design - I'm just trying to see how far we can push it
out of sheer curiosity)

This has to be done anyway, and getting it wrong in userspace is just as
bad no?

The _ONLY_ technical argument I've seen to do this in userspace is that
these tables and text segments are unswappable in-kernel - which doesn't
count too heavily in my book.

> Looking at Intel Core, Nehalem, or AMD64 does not reflect the reality of
> the complexity of this. Paul pointed out earlier the complexity on Power.
> I can relate to the complexity on Itanium (I implemented all the code in
> the user level libpfm for them). Read the Itanium PMU description and I
> hope you'll understand.

Again, I appreciate the fact that multi-dimensional constraint solving
isn't easy. But any which way we turn this thing, it still needs to be
done.

> Events constraints are not going away anytime soon, quite the contrary.
>
> Furthermore, event tables are not always correct. In fact, they are
> always bogus.
> Event semantics varies between steppings. New events shows up, others
> get removed.
> Constraints are discovered later on.
>
> If you have all of that in the kernel, it means you'll have to
> generate a kernel patch each
> time. Even if that can be encapsulated into a kernel module, you will
> still have problems.

How is updating a kernel module (esp one that only contains constraint
tables) more difficult than upgrading a user-space library? That just
doesn't make sense.

> Furthermore, Linux commercial distribution release cycles do not
> align well with new processor
> releases. I can boot my RHEL5 kernel on a Nehalem system and it would
> be nice not to have to
> wait for a new kernel update to get the full Nehalem PMU event table,
> so I can program more than
> the basic 6 architected events of Intel X86.

Talking with my community hat on, that is an artificial problem created
by distributions, tell them to fix it.

All it requires is a new kernel module that describes the new chip,
surely that can be shipped as easily as a new library.

> I know the argument about the fact that you'll have a patch with 24h
> on kernel.org. The problem
> is that no end-user runs a kernel.org kernel, nobody. Changing the
> kernel is not an option for
> many end-users, it may even require re-certifications for many customers.
>
> I believe many people would like to see how you plan on addressing those issues.

You're talking to LKML here - we don't care about stuff older than -git
(well, only a little, but not much more beyond n-1).

What we do care about is technical arguments, and last time I checked,
hardware resource scheduling was an OS level job.

But if the PMU control is critical to the enterprise deployment of
$customer, then he would have to re-certify on the library update too.

If its only development phase stuff, then the deployment machines won't
even load the module so there'd be no problem anyway.

2008-12-12 10:22:22

by tip-bot for Robert Richter

[permalink] [raw]

Subject: Re: [patch] Performance Counters for Linux, v3

On 12.12.08 10:23:54, Peter Zijlstra wrote:
> On Fri, 2008-12-12 at 09:59 +0100, stephane eranian wrote:
> > For instance, people have pointed out that your design necessarily implies
> > pulling into the kernel the event table for all PMU models out there. This
> > is not just data, this is also complex algorithms to assign events to counters.
> > The constraints between events can be very tricky to solve. If you get this
> > wrong, this leads to silent errors, and that is really bad.
>
> (well, its not my design - I'm just trying to see how far we can push it
> out of sheer curiosity)
>
> This has to be done anyway, and getting it wrong in userspace is just as
> bad no?
>
> The _ONLY_ technical argument I've seen to do this in userspace is that
> these tables and text segments are unswappable in-kernel - which doesn't
> count too heavily in my book.

But, there are also no arguments to implement it not in userspace.

> > Looking at Intel Core, Nehalem, or AMD64 does not reflect the reality of
> > the complexity of this. Paul pointed out earlier the complexity on Power.
> > I can relate to the complexity on Itanium (I implemented all the code in
> > the user level libpfm for them). Read the Itanium PMU description and I
> > hope you'll understand.
>
> Again, I appreciate the fact that multi-dimensional constraint solving
> isn't easy. But any which way we turn this thing, it still needs to be
> done.

I agree with Stephane. There are already many different PMU
descriptions depending on family, model and steppping and with *every*
new cpu revision you will get one more update. Implementing this in
the kernel would require kernel updates where otherwise no changes
would be necessary.

If you look at current pmu implementations, there are tons of
descriptions files and code you don't want to have in the kernel.

Also, a profiling tool that needs a certain pmu feature would depend
then on its kernel implementation. (Actually, it is impossible to have
a 100% implementation coverage.) If the pmu could be programmed from
userspace, the tool could provide the feature itself.

> > Events constraints are not going away anytime soon, quite the contrary.
> >
> > Furthermore, event tables are not always correct. In fact, they are
> > always bogus.
> > Event semantics varies between steppings. New events shows up, others
> > get removed.
> > Constraints are discovered later on.
> >
> > If you have all of that in the kernel, it means you'll have to
> > generate a kernel patch each
> > time. Even if that can be encapsulated into a kernel module, you will
> > still have problems.
>
> How is updating a kernel module (esp one that only contains constraint
> tables) more difficult than upgrading a user-space library? That just
> doesn't make sense.

At least this would require a kernel with modules enabled.

> > Furthermore, Linux commercial distribution release cycles do not
> > align well with new processor
> > releases. I can boot my RHEL5 kernel on a Nehalem system and it would
> > be nice not to have to
> > wait for a new kernel update to get the full Nehalem PMU event table,
> > so I can program more than
> > the basic 6 architected events of Intel X86.
>
> Talking with my community hat on, that is an artificial problem created
> by distributions, tell them to fix it.

It does not make sense to close the eyes to reality. There are systems
where it is not possible to update the kernel frequently. Probably you
have one running yourself.

-Robert

--
Advanced Micro Devices, Inc.
Operating System Research Center
email: [email protected]

2008-12-12 10:59:58

by Christoph Hellwig

[permalink] [raw]

Subject: Re: [patch] Performance Counters for Linux, v3

On Fri, Dec 12, 2008 at 11:21:11AM +0100, Robert Richter wrote:
> I agree with Stephane. There are already many different PMU
> descriptions depending on family, model and steppping and with *every*
> new cpu revision you will get one more update. Implementing this in
> the kernel would require kernel updates where otherwise no changes
> would be necessary.

Please stop the Bullshit. You have to update _something_. It makes a
lot of sense to update the thing you need to udpate anyway for new
hardware support, and not some piece of junk library like libperfmon.

> > Talking with my community hat on, that is an artificial problem created
> > by distributions, tell them to fix it.
>
> It does not make sense to close the eyes to reality. There are systems
> where it is not possible to update the kernel frequently. Probably you
> have one running yourself.

Of course it is. And on many of my systems it's much easier to update a
kernel than a library. A kernel I can build myself, for libraries I'm
more or less reliant on the distro or hacking fugly rpm or debian
packagging bits.

Having HW support in the kernel is a lot easier than in weird libraries.

2008-12-12 11:36:22

by tip-bot for Robert Richter

[permalink] [raw]

Subject: Re: [patch] Performance Counters for Linux, v3

On 12.12.08 05:59:38, Christoph Hellwig wrote:
> On Fri, Dec 12, 2008 at 11:21:11AM +0100, Robert Richter wrote:
> > I agree with Stephane. There are already many different PMU
> > descriptions depending on family, model and steppping and with *every*
> > new cpu revision you will get one more update. Implementing this in
> > the kernel would require kernel updates where otherwise no changes
> > would be necessary.
>
> Please stop the Bullshit. You have to update _something_. It makes a
> lot of sense to update the thing you need to udpate anyway for new
> hardware support, and not some piece of junk library like libperfmon.

New hardware does not always mean to implement new hardware
support. Sometimes it is sufficient to simply program the same
registers in another way. Why changing the kernel for this?

-Robert

--
Advanced Micro Devices, Inc.
Operating System Research Center
email: [email protected]

2008-12-12 13:41:58

by Andi Kleen

[permalink] [raw]

Subject: Re: [patch] Performance Counters for Linux, v3

Peter Zijlstra <[email protected]> writes:
> On that, I still don't think its a good idea to use bitfields in an ABI.
> The C std is just not strict enough on them,

If you constrain yourself to a single architecture in practice C
bitfield standards are quite good e.g. on Linux/x86 it is "everyone
implements what gcc does" (and on linux/ppc "what ppc gcc does").
And the syscall ABI is certainly restricted to one architecture.

-Andi

--
[email protected]

2008-12-12 16:46:43

by Chris Friesen

[permalink] [raw]

Subject: Re: [patch] Performance Counters for Linux, v3

Peter Zijlstra wrote:
> On Fri, 2008-12-12 at 09:59 +0100, stephane eranian wrote:

>>Furthermore, Linux commercial distribution release cycles do not
>>align well with new processor
>>releases. I can boot my RHEL5 kernel on a Nehalem system and it would
>>be nice not to have to
>>wait for a new kernel update to get the full Nehalem PMU event table,
>>so I can program more than
>>the basic 6 architected events of Intel X86.
>
>
> Talking with my community hat on, that is an artificial problem created
> by distributions, tell them to fix it.
>
> All it requires is a new kernel module that describes the new chip,
> surely that can be shipped as easily as a new library.

I have to confess that I haven't had a chance to look at the code. Is
the current proposal set up in such a way as to support loading a module
and having the new description picked up automatically?

>>Changing the
>>kernel is not an option for
>>many end-users, it may even require re-certifications for many customers.

> What we do care about is technical arguments, and last time I checked,
> hardware resource scheduling was an OS level job.

Here I agree.

> But if the PMU control is critical to the enterprise deployment of
> $customer, then he would have to re-certify on the library update too.

It may not have any basis in fact, but in practice it seems like kernel
changes are considered more risky than userspace changes.

As you say though, it's not likely that most production systems would be
running performance monitoring code, so this may only be an issue for
development machines.

Chris

2008-12-12 17:12:14

[permalink] [raw]

Subject: Re: [patch] Performance Counters for Linux, v3

On Fri, 2008-12-12 at 18:03 +0100, Samuel Thibault wrote:
> Peter Zijlstra, le Fri 12 Dec 2008 09:25:45 +0100, a écrit :
> > On Thu, 2008-12-11 at 13:02 -0500, Vince Weaver wrote:
> > > Also, my primary method of using counters is total aggregate count for a
> > > single user-space process.
> >
> > Process, as in single thread, or multi-threaded? I'll assume
> > single-thread.
>
> BTW, just to make sure it is taken into account (I haven't followed the
> thread up to here, just saw a "pid_t" somwhere that alarmed me): for our
> uses, we _do_ need per-kernelthread counters.

Yes, counters are per task - not sure on the exact interface thingy
though - I guess it should be tid_t but glibc does a bit weird there or
something.

2008-12-12 17:13:38

by Samuel Thibault

[permalink] [raw]

Subject: Re: [patch] Performance Counters for Linux, v3

Peter Zijlstra, le Fri 12 Dec 2008 09:25:45 +0100, a ?crit :
> On Thu, 2008-12-11 at 13:02 -0500, Vince Weaver wrote:
> > Also, my primary method of using counters is total aggregate count for a
> > single user-space process.
>
> Process, as in single thread, or multi-threaded? I'll assume
> single-thread.

BTW, just to make sure it is taken into account (I haven't followed the
thread up to here, just saw a "pid_t" somwhere that alarmed me): for our
uses, we _do_ need per-kernelthread counters.

Samuel

2008-12-12 17:46:51

[permalink] [raw]

Subject: Re: [patch] Performance Counters for Linux, v3

Peter,

On Fri, Dec 12, 2008 at 10:23 AM, Peter Zijlstra <[email protected]> wrote:
>> For instance, people have pointed out that your design necessarily implies
>> pulling into the kernel the event table for all PMU models out there. This
>> is not just data, this is also complex algorithms to assign events to counters.
>> The constraints between events can be very tricky to solve. If you get this
>> wrong, this leads to silent errors, and that is really bad.
>
> (well, its not my design - I'm just trying to see how far we can push it
> out of sheer curiosity)
>
> This has to be done anyway, and getting it wrong in userspace is just as
> bad no?
>
No as bad. If a library is bad, then just don't the library. In fact,
I know tools
which do not even need a library. What is important is that there is a way
to avoid the problem. If the kernel controls this, then there is no way out.

To remain in your world, look at the Pentium 4 (Netburst) PMU
description. And you'll see that things are very complicated already there.

> The _ONLY_ technical argument I've seen to do this in userspace is that
> these tables and text segments are unswappable in-kernel - which doesn't
> count too heavily in my book.
>
>> Looking at Intel Core, Nehalem, or AMD64 does not reflect the reality of
>> the complexity of this. Paul pointed out earlier the complexity on Power.
>> I can relate to the complexity on Itanium (I implemented all the code in
>> the user level libpfm for them). Read the Itanium PMU description and I
>> hope you'll understand.
>
> Again, I appreciate the fact that multi-dimensional constraint solving
> isn't easy. But any which way we turn this thing, it still needs to be
> done.
>

Yes, but you have lots of ways of doing this at the user level. For all I know,
you could even hardcode the values (register, value) pairs in your tool if you
know what you are doing. And don't discount the fact that advanced tools
know what they are doing very precisely.

>> Events constraints are not going away anytime soon, quite the contrary.
>>
>> Furthermore, event tables are not always correct. In fact, they are
>> always bogus.
>> Event semantics varies between steppings. New events shows up, others
>> get removed.
>> Constraints are discovered later on.
>>
>> If you have all of that in the kernel, it means you'll have to
>> generate a kernel patch each
>> time. Even if that can be encapsulated into a kernel module, you will
>> still have problems.
>
> How is updating a kernel module (esp one that only contains constraint
> tables) more difficult than upgrading a user-space library? That just
> doesn't make sense.
>
Go ask end-users what they think of that?

You don't even need a library. All of this could be integrated into the tool.
New processor, just go download the updated version of the tool.
No kernel changes.

>> Furthermore, Linux commercial distribution release cycles do not
>> align well with new processor
>> releases. I can boot my RHEL5 kernel on a Nehalem system and it would
>> be nice not to have to
>> wait for a new kernel update to get the full Nehalem PMU event table,
>> so I can program more than
>> the basic 6 architected events of Intel X86.
>
> Talking with my community hat on, that is an artificial problem created
> by distributions, tell them to fix it.
>
> All it requires is a new kernel module that describes the new chip,
> surely that can be shipped as easily as a new library.
>

No, because you need tons of versions of that module based on kernel
versions. People do not recompile kernel modules.

>> I know the argument about the fact that you'll have a patch with 24h
>> on kernel.org. The problem
>> is that no end-user runs a kernel.org kernel, nobody. Changing the
>> kernel is not an option for
>> many end-users, it may even require re-certifications for many customers.
>>
>> I believe many people would like to see how you plan on addressing those issues.
>
> You're talking to LKML here - we don't care about stuff older than -git
> (well, only a little, but not much more beyond n-1).
>
That is why you don't always understand the issues of users, unfortunately.

> What we do care about is technical arguments, and last time I checked,
> hardware resource scheduling was an OS level job.
>
Yes, if you get it wrong, applications are screwed.

> But if the PMU control is critical to the enterprise deployment of
> $customer, then he would have to re-certify on the library update too.
>
No, they just download a new version of the tool.

> If its only development phase stuff, then the deployment machines won't
> even load the module so there'd be no problem anyway.
>
This not just development stuff anymore.

2008-12-12 18:05:26

[permalink] [raw]

Subject: Re: [patch] Performance Counters for Linux, v3

Hi,

Given the level of abstractions you are using for the API, and given
your argument
that the kernel can do the HW resource scheduling better than anybody else.

What happens in the following test case:

- 2-way system (cpu0, cpu1)

- on cpu0, two processes P1, P2, each self-monitoring and counting event E1.
Event E1 can only be measured on counter C1.

- on cpu1, there is a cpu-wide session, monitoring event E1, thus using C1

- the scheduler decides to migrate P1 onto CPU1. You now have a
conflict on C1.

How is this managed?

2008-12-12 18:14:19

by Vince Weaver

[permalink] [raw]

Subject: Re: [patch] Performance Counters for Linux, v3

On Fri, 12 Dec 2008, Peter Zijlstra wrote:
> On Thu, 2008-12-11 at 13:02 -0500, Vince Weaver wrote:
>
>> Perfmon3 works for all of those 60 machines. This new proposal works on a
>> 2 out of the 60.
>
> s/works/is implemented/

Once you "implement" the new solution for all the machines I listed, it's
going to be just as bad, if not worse, than current perfmon3.

> So much for constructive critisism.. have you tried taking the design to
> its limits, if so, where do you see problems?

I have a currently working solution in perfmon3.
I need a pretty strong reason to abandon that.

> I read the above as: I invested a lot of time in something of dubious
> statue (out of tree patch), and now expect it to be merged because I
> have invested in it.

perfmon has been around for years. It's even been in the kernel (in
Itanium form) for years. The perfmon patchset has been posted numerous
time for reviews to the linux-kernel list. It's not like perfmon was some
sort of secret project sprung on the world last-minute.

I know the way the Linux kernel development works. If some other
performance monitoring implementation does get merged, I will cope and
move on. I'm just trying to help avoid a costly mistake.

>> Also, my primary method of using counters is total aggregate count for a
>> single user-space process.
>
> Process, as in single thread, or multi-threaded? I'll assume
> single-thread.

No. Multi-thread too.

> I'll argue to disagree, sure such events might not be supported by any
> particular hardware implementation - but the fact that PAPI gives a list
> of 'common' events means that they are, well, common. So unifying them
> between those archs that do implement them seems like a sane choice, no?

No.

I do not use PAPI. PAPI only supports a small subset of counters.

What is needed is a tool for accessing _all_ performance counters on
various machines.

What is _not_ needed is pushing PAPI into kernel space.

> The proposal allows for you to specify raw hardware events, so you can
> just totally ignore this part of the abstraction.

If you can do raw events, then that's enough. There's no need to put some
sort of abstraction level into the kernel. That way lies madness if
you've ever looked at any code that tries to do it.

As others have suggested, check out the P4 PMU documentation.

Vince

2008-12-12 19:46:26

by Chris Friesen

[permalink] [raw]

Subject: Re: [patch] Performance Counters for Linux, v3

stephane eranian wrote:

> What happens in the following test case:
>
> - 2-way system (cpu0, cpu1)
>
> - on cpu0, two processes P1, P2, each self-monitoring and counting event E1.
> Event E1 can only be measured on counter C1.
>
> - on cpu1, there is a cpu-wide session, monitoring event E1, thus using C1
>
> - the scheduler decides to migrate P1 onto CPU1. You now have a
> conflict on C1.
>
> How is this managed?

Prevent the load balancer from moving P1 onto cpu1?

Chris

2008-12-13 11:18:07

by Henrique de Moraes Holschuh

[permalink] [raw]

Subject: Re: [patch] Performance Counters for Linux, v3

On Fri, 2008-12-12 at 18:42 +0100, stephane eranian wrote:
> In fact, I know tools which do not even need a library.

By your own saying, the problem solved by libperfmon is a hard problem
(and I fully understand that).

Now you say there is software out there that doesn't use libperfmon,
that means they'll have to duplicate that functionality.

And only commercial software has a clear gain by wastefully duplicating
that effort. This means there is an active commercial interest to not
make perfmon the best technical solution there is, which is contrary to
the very thing Linux is about.

What is worse, you defend that:

> Go ask end-users what they think of that?
>
> You don't even need a library. All of this could be integrated into the tool.
> New processor, just go download the updated version of the tool.

No! what people want is their problem fixed - no matter how. That is one
of the powers of FOSS, you can fix your problems in any way suitable.

Would it not be much better if those folks duped into using a binary
only product only had to upgrade their FOSS kernel, instead of possibly
forking over more $$$ for an upgrade?

You have just irrevocably proven to me this needs to go into the kernel,
as the design of perfmon is little more than a GPL circumvention device
- independent of whether you are aware of that or not.

For that I hereby fully NAK perfmon

Nacked-by: Peter Zijlstra <[email protected]>

2008-12-13 13:48:44

[permalink] [raw]

Subject: Re: [patch] Performance Counters for Linux, v3

On Sat, 13 Dec 2008, Peter Zijlstra wrote:
> You have just irrevocably proven to me this needs to go into the kernel,
> as the design of perfmon is little more than a GPL circumvention device
> - independent of whether you are aware of that or not.

As long as it uses some sort of "module plugin" approach, perhaps coupled to
the firmware loader system to avoid wasting a ton of space with tables for
processors other than the one you need... you could just move all of the
hardware-related parts of perfmon lib into the kernel.

That would close the doors to non-gpl badness.

--
"One disk to rule them all, One disk to find them. One disk to bring
them all and in the darkness grind them. In the Land of Redmond
where the shadows lie." -- The Silicon Valley Tarot
Henrique Holschuh

2008-12-13 17:45:01

[permalink] [raw]

Subject: Re: [patch] Performance Counters for Linux, v3

Peter,

I don't think you understand what libpfm actually does and therefore
you rush to the wrong conclusion.

At its core, libpfm does NOT know anything about the perfmon kernel API.

I think you missed that, unfortunately.

It is a helper library which helps tool writer solves the event-> code
-> counter assignment problems.
That's it. It does not make any perfmon syscall at ALL to do that.
Proof is people have been using it on
Windows, I can also use it on MacOS.

Looking at your proposal, you think you won't need such a library and
that the kernel is
going to do all this for you. Let's go back to your kerneltop program:

KernelTop Options (up to 4 event types can be specified):

-e EID --event_id=EID # event type ID [default: 0]
0: CPU cycles
1: instructions
2: cache accesses
3: cache misses
4: branch instructions
5: branch prediction misses
< 0: raw CPU events

Looks like I can do:

$ kerneltop --event_id=-0x510088

You think users are going to come up with 0x510088 out of the blue?

I want to say:

$ kerneltop --event_id=BR_INST_EXEC --plm=user

Where do you think they are going to get that from?

The kernel or a helper user library?

Do not denigrate other people's software without understanding what it does.

On Sat, Dec 13, 2008 at 12:17 PM, Peter Zijlstra <[email protected]> wrote:
> On Fri, 2008-12-12 at 18:42 +0100, stephane eranian wrote:
>> In fact, I know tools which do not even need a library.
>
> By your own saying, the problem solved by libperfmon is a hard problem
> (and I fully understand that).
>
> Now you say there is software out there that doesn't use libperfmon,
> that means they'll have to duplicate that functionality.
>
> And only commercial software has a clear gain by wastefully duplicating
> that effort. This means there is an active commercial interest to not
> make perfmon the best technical solution there is, which is contrary to
> the very thing Linux is about.
>
> What is worse, you defend that:
>
>> Go ask end-users what they think of that?
>>
>> You don't even need a library. All of this could be integrated into the tool.
>> New processor, just go download the updated version of the tool.
>
> No! what people want is their problem fixed - no matter how. That is one
> of the powers of FOSS, you can fix your problems in any way suitable.
>
> Would it not be much better if those folks duped into using a binary
> only product only had to upgrade their FOSS kernel, instead of possibly
> forking over more $$$ for an upgrade?
>
> You have just irrevocably proven to me this needs to go into the kernel,
> as the design of perfmon is little more than a GPL circumvention device
> - independent of whether you are aware of that or not.
>
> For that I hereby fully NAK perfmon
>
> Nacked-by: Peter Zijlstra <[email protected]>
>
>
>
>

2008-12-14 01:02:59

[permalink] [raw]

Subject: Re: [patch] Performance Counters for Linux, v3

Peter Zijlstra writes:

> On Fri, 2008-12-12 at 18:42 +0100, stephane eranian wrote:
> > In fact, I know tools which do not even need a library.
>
> By your own saying, the problem solved by libperfmon is a hard problem
> (and I fully understand that).
>
> Now you say there is software out there that doesn't use libperfmon,
> that means they'll have to duplicate that functionality.
>
> And only commercial software has a clear gain by wastefully duplicating
> that effort. This means there is an active commercial interest to not
> make perfmon the best technical solution there is, which is contrary to
> the very thing Linux is about.
>
> What is worse, you defend that:
>
> > Go ask end-users what they think of that?
> >
> > You don't even need a library. All of this could be integrated into the tool.
> > New processor, just go download the updated version of the tool.
>
> No! what people want is their problem fixed - no matter how. That is one
> of the powers of FOSS, you can fix your problems in any way suitable.
>
> Would it not be much better if those folks duped into using a binary
> only product only had to upgrade their FOSS kernel, instead of possibly
> forking over more $$$ for an upgrade?
>
> You have just irrevocably proven to me this needs to go into the kernel,
> as the design of perfmon is little more than a GPL circumvention device
> - independent of whether you are aware of that or not.

I'm sorry, but that is a pretty silly argument.

By that logic, the kernel module loader should include an in-kernel
copy of gcc and binutils, and the fact that it doesn't proves that the
module loader is little more than a GPL circumvention device -
independent of whether you are aware of that or not. 8-)

Paul.

2008-12-14 14:51:03

by Andi Kleen

[permalink] [raw]

Subject: Re: Performance counter API review was [patch] Performance Counters for Linux, v3

Ingo Molnar <[email protected]> writes:

Here are some comments from my (mostly x86) perspective on the interface.
I'm focusing on the interface only, not the code.

- There was a lot of discussion about counter assignment. But an event
actually needs much more meta data than just the counter assignments.
For example here's an event-set out of the upcoming Core i7 oprofile
events file:

event:0xC3 counters:0,1,2,3 um:machine_clears minimum:6000 name:machine_clears : Counts the cycles machine clear is asserted.

and the associated sub unit masks:

name:machine_clears type:bitmask default:0x01
0x01 cycles Counts the cycles machine clear is asserted
0x02 mem_order Counts the number of machine clears due to memory order conflicts
0x04 smc Counts the number of times that a program writes to a code section
0x10 fusion_assist Counts the number of macro-fusion assists

As you can see there is a lot of meta data in there and to my knowledge
none of it is really optional. For example without the name and the description
it's pretty much impossible to use the event (in fact even with description
it is often hard enough to figure out what it means). I think every
non trivial perfctr user front end will need a way to query name and
description. Where should they be stored?

Then the minimum overflow period is needed (see below)

Counter assignment is needed as discussed earlier: there are some events
that can only go to specific counters, and then there are complication
like fixed event counters and uncore events in separate registers.

Then there is the concept of unit_masks, which define the sub-events.
Right now the single event number does not specify how unit masks
are specified. Unit masks also are complicated because they are
sometimes masks (you can or them up) or enumerations (you can't)
To make good use of them the software needs to know the difference.

So these all need to be somewhere. I assume the right place is
not the kernel. I don't think it would be a good idea to duplicate
all of this in every application. So some user space library is needed anyways.

- All the event meta data should be ideally stored in a single place,
otherwise there is risk of it getting out of sync. Events are relatively
often updated (even during a CPU life-cycle when a event is found
to be buggy), so a smooth upgrade procedure is crucial.

- There doesn't seem to be a way to enforce minimum overflow periods.
It's also pretty easy to hang a system by programming a too short
overflow period to a commonly encountered event. For example
if you program a counter to trigger an NMI every hundred cycles
then the system will not do much useful work anymore.

This might even be a security hazard because the interface is available
to non-root. Solving that one would actually argue to put at least
some knowledge into the kernel or always enforce a minimum safe period?

The minimum safe period has the problem that it might break some
useful tracing setups on low frequency event where it might
be quite useful to useful on each event. But on a common event
that's a really bad idea. So probably it needs per event information.

Hard problem. oprofile avoids it by only allowing root to configure events.

[btw i'm not sure perfmon3 has solved that one either]

- Split of event into event and unit mask
On x86 events consist of a event number and a unit mask (which
can be sometimes an enumeration, not a mask). It's unclear
right now how the unit mask is specified in the perfctr structure.
While it could be both encoded in type that would be clumsy,
requiring special macros. So likely it needs a separate field.

- PEBS/Debug Store

Intel/x86 has support for letting the CPU directly log events into a memory
ring buffer with some additional information like register contents. From
the first look this could be supported with additional record types. One
issue there is that the record layout is not architectural and varies
with different CPUs. Getting a nice general API out of that might be tricky.
Would each new CPU need a new record type?

Processing PEBS records is also moderately performance critical
(and they can be quite big) so it would be a good idea to have some way
to process them copy less.

Another issue is that you need to specify the buffer size/overflow threshold
somewhere. Right now there is no way in the API to do that (and the
existing syscall has already quite a lot of arguments). So PEBS would
likely need a new syscall?

- Additional bits. x86 has some more flag bits in the perfctr
registers like edge triggering or counter inversion. Right now there
doesn't seem to be any way to specify those in the syscall. There are
some events (especially when multiple events are counted together)
which can be only counted by setting those bits. Likely needs to be
controlled by the application.

I suppose adding new fields to perf_counter_hw_event would be possible.

- It's unclear to me why the API has a special NMI mode. For me it looks
like that if NMIs are implemented they should be the default way.
Or rather if you have NMI events, why ever not use them?
The only exception I can think of would be if the system is known
to have NMI problems in the BIOS like some ThinkPads. In that case
it shouldn't be per syscall/user controlled though, but some global
root only knob (ideally set automatically)

- Global tracing. Right now there seem to be two modi: per task and
per CPU. But a common variant is global tracing of all CPUs. While this
could be in theory done right now by attaching to each CPU
this has the problem that it doesn't interact very well with CPU
hot plug. The application would need to poll for additional/lost
CPUs somehow and then re-attach to them (or detach). This would
likely be quite clumsy and slow. It would be better if the kernel supported
that better.

Or alternative here is to do nothing and keep oprofile for that job
(which it doesn't do that badly)

- Ring 3 vs ring 0.
x86 supports counting only user space or only kernel space. Right
now there is no way to specify that in the syscall interface.
I suppose adding a new field to perf_counter_hw_event would be possible.

- SMT support
Sometimes you want to count events occurred by both SMT siblings.
For example this is useful when measuring a multi threaded
application that uses both threads and you want to see the
shared cache events of both.
In arch perfmon v3 there is a new perfctr "AnyThread" bit
that controls this. It needs to be exposed.

- In general the SMT and shared resource semantics seem to be a
bit unclear recently. Some clarification of that would be good.
What happens when the resource is not available? How are
the reservation semantics?

- Uncore monitoring
Nehalem has some additional performance counters in the Uncore
which count specific uncore events. They have slightly different
semantics and additional register (like an opcode filter).
It's unclear how they would be programmed in this API.

Also the shared resource problem applies. An uncore is shared
by multiple cores/threads on a socket. Neither a CPU number nor
a pid are particularly useful to address them.

- RDPMC self monitoring
x86 supports reading performance counters from user space
using the RDPMC application. I find that rather useful
as a replacement for RDTSC because it allows to count
real cycles using one of the fixed performance counter.

One problem is that it needs to be explicitely enabled and also
controlled because it always exposes information from
all performance counters (which could be an information
leak). So ideally it needs to cooperate with the kernel
and allow to set up suitable counters for own use and also
to make sure that counters do not leak information on context
switch. There should be some way in the API to specify that.

-Andi

--
[email protected]

2008-12-14 22:38:37

[permalink] [raw]

Subject: Re: [patch] Performance Counters for Linux, v3

* Paul Mackerras <[email protected]> wrote:

> Peter Zijlstra writes:
>
> > On Fri, 2008-12-12 at 18:42 +0100, stephane eranian wrote:
> > > In fact, I know tools which do not even need a library.
> >
> > By your own saying, the problem solved by libperfmon is a hard problem
> > (and I fully understand that).
> >
> > Now you say there is software out there that doesn't use libperfmon,
> > that means they'll have to duplicate that functionality.
> >
> > And only commercial software has a clear gain by wastefully duplicating
> > that effort. This means there is an active commercial interest to not
> > make perfmon the best technical solution there is, which is contrary to
> > the very thing Linux is about.
> >
> > What is worse, you defend that:
> >
> > > Go ask end-users what they think of that?
> > >
> > > You don't even need a library. All of this could be integrated into the tool.
> > > New processor, just go download the updated version of the tool.
> >
> > No! what people want is their problem fixed - no matter how. That is one
> > of the powers of FOSS, you can fix your problems in any way suitable.
> >
> > Would it not be much better if those folks duped into using a binary
> > only product only had to upgrade their FOSS kernel, instead of possibly
> > forking over more $$$ for an upgrade?
> >
> > You have just irrevocably proven to me this needs to go into the kernel,
> > as the design of perfmon is little more than a GPL circumvention device
> > - independent of whether you are aware of that or not.
>
> I'm sorry, but that is a pretty silly argument.
>
> By that logic, the kernel module loader should include an in-kernel copy
> of gcc and binutils, and the fact that it doesn't proves that the module
> loader is little more than a GPL circumvention device - independent of
> whether you are aware of that or not. 8-)

i'm not sure how your example applies: the kernel module loader is not an
application that needs to be updated to new versions of syscalls. Nor is
it a needless duplication of infrastructure - it runs in a completely
different protection domain - just to name one of the key differences.

Applications going to complex raw syscalls and avoiding a neutral hw
infrastructure library that implements a non-trivial job is quite typical
for FOSS-library-shy bin-only apps. The "you cannot infringe what you do
not link to at all" kind of defensive thinking.

Ingo

2008-12-14 23:14:17

[permalink] [raw]

Subject: Re: [patch] Performance Counters for Linux, v3

* stephane eranian <[email protected]> wrote:

> Hi,
>
> Given the level of abstractions you are using for the API, and given
> your argument that the kernel can do the HW resource scheduling better
> than anybody else.
>
> What happens in the following test case:
>
> - 2-way system (cpu0, cpu1)
>
> - on cpu0, two processes P1, P2, each self-monitoring and counting event E1.
> Event E1 can only be measured on counter C1.
>
> - on cpu1, there is a cpu-wide session, monitoring event E1, thus using C1
>
> - the scheduler decides to migrate P1 onto CPU1. You now have a
> conflict on C1.
>
> How is this managed?

If there's a single unit of sharable resource [such as an event counter,
or a physical CPU], then there's just three main possibilities: either
user 1 gets it all, or user 2 gets it all, or they share it.

We've implemented the essence of these variants, with sharing the resource
being the sane default, and with the sysadmin also having a configuration
vector to reserve the resource to himself permanently. (There could be
more variations of this.)

What is your point?

Ingo

2008-12-15 00:37:42

[permalink] [raw]

Subject: Re: [patch] Performance Counters for Linux, v3

Ingo Molnar writes:

> * stephane eranian <[email protected]> wrote:
>
> > Hi,
> >
> > Given the level of abstractions you are using for the API, and given
> > your argument that the kernel can do the HW resource scheduling better
> > than anybody else.
> >
> > What happens in the following test case:
> >
> > - 2-way system (cpu0, cpu1)
> >
> > - on cpu0, two processes P1, P2, each self-monitoring and counting event E1.
> > Event E1 can only be measured on counter C1.
> >
> > - on cpu1, there is a cpu-wide session, monitoring event E1, thus using C1
> >
> > - the scheduler decides to migrate P1 onto CPU1. You now have a
> > conflict on C1.
> >
> > How is this managed?
>
> If there's a single unit of sharable resource [such as an event counter,
> or a physical CPU], then there's just three main possibilities: either
> user 1 gets it all, or user 2 gets it all, or they share it.
>
> We've implemented the essence of these variants, with sharing the resource
> being the sane default, and with the sysadmin also having a configuration
> vector to reserve the resource to himself permanently. (There could be
> more variations of this.)
>
> What is your point?

Note that Stephane said *counting* event E1.

One of the important things about counting (as opposed to sampling) is
that it matters whether or not the event is being counted the whole
time or only part of the time. Thus it puts constraints on counter
scheduling and reporting that don't apply for sampling.

In other words, if I'm counting an event, I want it to be counted all
the time (i.e. whenever the task is executing, for a per-task counter,
or continuously for a per-cpu counter). If that causes conflicts and
the kernel decides not to count the event for part of the time, that
is very much second-best, and I absolutely need to know that that
happened, and also when the kernel started and stopped counting the
event (so I can scale the result to get some idea what the result
would have been if it had been counted the whole time).

Now, I haven't digested V4 yet, so you might have already implemented
something like that. Have you? :)

Paul.

2008-12-15 00:50:53

[permalink] [raw]

Subject: Re: [patch] Performance Counters for Linux, v3

Ingo Molnar writes:

> * Paul Mackerras <[email protected]> wrote:
>
> > Peter Zijlstra writes:
> >
> > > On Fri, 2008-12-12 at 18:42 +0100, stephane eranian wrote:
> > > > In fact, I know tools which do not even need a library.
> > >
> > > By your own saying, the problem solved by libperfmon is a hard problem
> > > (and I fully understand that).
> > >
> > > Now you say there is software out there that doesn't use libperfmon,
> > > that means they'll have to duplicate that functionality.
> > >
> > > And only commercial software has a clear gain by wastefully duplicating
> > > that effort. This means there is an active commercial interest to not
> > > make perfmon the best technical solution there is, which is contrary to
> > > the very thing Linux is about.
> > >
> > > What is worse, you defend that:
> > >
> > > > Go ask end-users what they think of that?
> > > >
> > > > You don't even need a library. All of this could be integrated into the tool.
> > > > New processor, just go download the updated version of the tool.
> > >
> > > No! what people want is their problem fixed - no matter how. That is one
> > > of the powers of FOSS, you can fix your problems in any way suitable.
> > >
> > > Would it not be much better if those folks duped into using a binary
> > > only product only had to upgrade their FOSS kernel, instead of possibly
> > > forking over more $$$ for an upgrade?
> > >
> > > You have just irrevocably proven to me this needs to go into the kernel,
> > > as the design of perfmon is little more than a GPL circumvention device
> > > - independent of whether you are aware of that or not.
> >
> > I'm sorry, but that is a pretty silly argument.
> >
> > By that logic, the kernel module loader should include an in-kernel copy
> > of gcc and binutils, and the fact that it doesn't proves that the module
> > loader is little more than a GPL circumvention device - independent of
> > whether you are aware of that or not. 8-)
>
> i'm not sure how your example applies: the kernel module loader is not an
> application that needs to be updated to new versions of syscalls. Nor is
> it a needless duplication of infrastructure - it runs in a completely
> different protection domain - just to name one of the key differences.

Peter's argument was in essence that since using perfmon3 involves some
userspace computation that can be done by proprietary software instead
of a GPL'd library (libpfm), that makes perfmon3 a GPL-circumvention
device.

I was trying to point out that that argument is silly by applying it
to the kernel module loader. There the userspace component is gcc and
binutils, and the computation they do can be done alternatively by
proprietary software such as icc or xlc. That of itself doesn't make
the module loader a GPL-circumvention device (though it may be for
other reasons).

And if the argument is silly in that case (which it is), it is even
more silly in the case of perfmon3, where what is being computed and
passed to the kernel is just a few register values, not instructions.

> Applications going to complex raw syscalls and avoiding a neutral hw
> infrastructure library that implements a non-trivial job is quite typical
> for FOSS-library-shy bin-only apps. The "you cannot infringe what you do
> not link to at all" kind of defensive thinking.

FOSS is about freedom - we don't force anyone to use our code. If
someone wants to use their own code instead of glibc or libpfm on the
user-space side of the syscall interface, that's fine.

Paul.

2008-12-15 12:58:29

[permalink] [raw]

Subject: Re: [patch] Performance Counters for Linux, v3

2008-12-15 13:03:16

[permalink] [raw]

Subject: Re: [patch] Performance Counters for Linux, v3

Hi,

On Mon, Dec 15, 2008 at 1:50 AM, Paul Mackerras <[email protected]> wrote:

> FOSS is about freedom - we don't force anyone to use our code. If
> someone wants to use their own code instead of glibc or libpfm on the
> user-space side of the syscall interface, that's fine.
>
Exactly right!

That was exactly my point when I said, you are free to not use libpfm
in your tool.

2008-12-15 14:42:42

[permalink] [raw]

Subject: Re: [patch] Performance Counters for Linux, v3

Hi,

On Mon, Dec 15, 2008 at 1:37 AM, Paul Mackerras <[email protected]> wrote:
> Ingo Molnar writes:
>
>> * stephane eranian <[email protected]> wrote:
>>
>> > Hi,
>> >
>> > Given the level of abstractions you are using for the API, and given
>> > your argument that the kernel can do the HW resource scheduling better
>> > than anybody else.
>> >
>> > What happens in the following test case:
>> >
>> > - 2-way system (cpu0, cpu1)
>> >
>> > - on cpu0, two processes P1, P2, each self-monitoring and counting event E1.
>> > Event E1 can only be measured on counter C1.
>> >
>> > - on cpu1, there is a cpu-wide session, monitoring event E1, thus using C1
>> >
>> > - the scheduler decides to migrate P1 onto CPU1. You now have a
>> > conflict on C1.
>> >
>> > How is this managed?
>>
>> If there's a single unit of sharable resource [such as an event counter,
>> or a physical CPU], then there's just three main possibilities: either
>> user 1 gets it all, or user 2 gets it all, or they share it.
>>
>> We've implemented the essence of these variants, with sharing the resource
>> being the sane default, and with the sysadmin also having a configuration
>> vector to reserve the resource to himself permanently. (There could be
>> more variations of this.)
>>
>> What is your point?
>
> Note that Stephane said *counting* event E1.
>
> One of the important things about counting (as opposed to sampling) is
> that it matters whether or not the event is being counted the whole
> time or only part of the time. Thus it puts constraints on counter
> scheduling and reporting that don't apply for sampling.
>
Paul is right.

> In other words, if I'm counting an event, I want it to be counted all
> the time (i.e. whenever the task is executing, for a per-task counter,
> or continuously for a per-cpu counter). If that causes conflicts and
> the kernel decides not to count the event for part of the time, that
> is very much second-best, and I absolutely need to know that that
> happened, and also when the kernel started and stopped counting the
> event (so I can scale the result to get some idea what the result
> would have been if it had been counted the whole time).
>
That is very true.

You cannot multiplex events onto counters without applications knowing.
They need to know how long each 'set' has been active. This is needed
to scale the results. This is especially true for cpu-wide measurements.

2008-12-15 14:50:40

[permalink] [raw]

Subject: Re: [patch] Performance Counters for Linux, v3

On Fri, Dec 12, 2008 at 8:45 PM, Chris Friesen <[email protected]> wrote:
> stephane eranian wrote:
>
>> What happens in the following test case:
>>
>> - 2-way system (cpu0, cpu1)
>>
>> - on cpu0, two processes P1, P2, each self-monitoring and counting event
>> E1.
>> Event E1 can only be measured on counter C1.
>>
>> - on cpu1, there is a cpu-wide session, monitoring event E1, thus using
>> C1
>>
>> - the scheduler decides to migrate P1 onto CPU1. You now have a
>> conflict on C1.
>>
>> How is this managed?
>
> Prevent the load balancer from moving P1 onto cpu1?
>
You don't want to do that.

There was a reason why the scheduler decided to move the task.
Now, because of monitoring you would change the behavior of the task
and scheduler.
Monitoring should be unintrusive. You want the task/scheduler to
behave as if no monitoring
was present otherwise what is it you are actually measuring?

Changing or forcing the affinity because of monitoring is also a bad
idea, for the same reason.

2008-12-15 20:58:23

[permalink] [raw]

Subject: Re: [patch] Performance Counters for Linux, v3

Hi,

On Mon, Dec 15, 2008 at 12:13 AM, Ingo Molnar <[email protected]> wrote:
> We've implemented the essence of these variants, with sharing the resource
> being the sane default, and with the sysadmin also having a configuration
> vector to reserve the resource to himself permanently. (There could be
> more variations of this.)
>
Reading the v4 code, it does not appear the sysadmin can specify which
resource to reserve. The current code reserves a number of counters.
This is problematic with hardware where not all counters can measure
everything, or when not all PMU registers are counters.

2008-12-15 22:33:08

by Chris Friesen

[permalink] [raw]

Subject: Re: [patch] Performance Counters for Linux, v3

stephane eranian wrote:
> On Fri, Dec 12, 2008 at 8:45 PM, Chris Friesen <[email protected]> wrote:
>
>>stephane eranian wrote:
>>
>>
>>>What happens in the following test case:
>>>
>>> - 2-way system (cpu0, cpu1)
>>>
>>> - on cpu0, two processes P1, P2, each self-monitoring and counting event
>>>E1.
>>> Event E1 can only be measured on counter C1.
>>>
>>> - on cpu1, there is a cpu-wide session, monitoring event E1, thus using
>>>C1
>>>
>>> - the scheduler decides to migrate P1 onto CPU1. You now have a
>>>conflict on C1.
>>>
>>>How is this managed?
>>
>>Prevent the load balancer from moving P1 onto cpu1?
>>
>
> You don't want to do that.
>
> There was a reason why the scheduler decided to move the task.
> Now, because of monitoring you would change the behavior of the task
> and scheduler.
> Monitoring should be unintrusive. You want the task/scheduler to
> behave as if no monitoring
> was present otherwise what is it you are actually measuring?

In a scenario where the system physically cannot gather the desired data
without influencing the behaviour of the program, I see two options:

1) limit the behaviour of the system to ensure that we can gather the
performance monitoring data as specified

2) limit the performance monitoring to minimize any influence on the
program, and report the fact that performance monitoring was limited.

You've indicated that you don't want option 1, so I assume that you
prefer option 2. In the above scenario, how would _you_ handle it?

Chris

2008-12-15 22:54:25

[permalink] [raw]

Subject: Re: [patch] Performance Counters for Linux, v3

Ingo Molnar writes:

> If there's a single unit of sharable resource [such as an event counter,
> or a physical CPU], then there's just three main possibilities: either
> user 1 gets it all, or user 2 gets it all, or they share it.
>
> We've implemented the essence of these variants, with sharing the resource
> being the sane default, and with the sysadmin also having a configuration
> vector to reserve the resource to himself permanently. (There could be
> more variations of this.)

Thinking about this a bit more, it seems to me that there is an
unstated assumption that dealing with performance counters is mostly a
scheduling problem - that the hardware resource of a fixed number of
performance counters can be virtualized to provide a larger number of
software counters in much the same way that a fixed number of physical
cpus are virtualized to support a larger number of tasks.

Put another way, your assumption seems to be that software counters
can be transparently time-multiplexed onto the physical counters,
without affecting the end results. In other words, you assume that
time-multiplexing is a reasonable way to implement sharing of hardware
performance counters, and that users shouldn't have to know or care
that their counters are being time-multiplexed. Is that an accurate
statement of your belief?

If it is (and the code you've posted seems to indicate that it is)
then you are going to have unhappy users, because counting part of the
time is not at all the same thing as counting all the time. As just
one example, imagine that the period over which you are counting is
shorter than the counter timeslice period (for example because the
executable you are measuring doesn't run for very long). If you have
N software counters but only M < N hardware counters, then only the
first M software counters will report anything useful, and the
remaining M - N will report zero!

Sampling, as opposed to counting, may be more tolerant of
time-multiplexing of counters, particularly for long-running programs,
but even there time-multiplexing will affect the results and users
need to know about it.

It seems to me that this assumption is pretty deeply rooted in the
design of your performance counter subsystem, and I'm not sure at this
point what is the best way to fix it.

Paul.

2008-12-17 07:45:23