2008-12-04 23:46:42

by Thomas Gleixner

[permalink] [raw]
Subject: [patch 0/3] [Announcement] Performance Counters for Linux

Performance counters are special hardware registers available on most modern
CPUs. These register count the number of certain types of hw events: such
as instructions executed, cachemisses suffered, or branches mis-predicted,
without slowing down the kernel or applications. These registers can also
trigger interrupts when a threshold number of events have passed - and can
thus be used to profile the code that runs on that CPU.

We'd like to announce a brand new implementation of performance counter
support for Linux. It is a very simple and extensible design that has the
potential to implement the full range of features we would expect from such
a subsystem.

The Linux Performance Counter subsystem (implemented via the patches
posted in this announcement) provides an abstraction of performance counter
hardware capabilities. It provides per task and per CPU counters, and it
provides event capabilities on top of those.

The code is far from complete - but the basic approach is already there
and stable.

The biggest missing detail is lowlevel support for non-Intel CPUs and
older Intel CPUs - right now the code is implemented for Intel Core2
(and later) Intel CPUs that have the PERFMON CPU feature. (see below
a wider list of missing/upcoming features)

We are aware of the perfmon3 patchset that has been submitted to lkml
recently. Our patchset tries to achieve a similar end result, with
a fundamentally different (and we believe, superior :-) design:

- The API is based on a single counter abstraction

- Only one single new system call is needed: sys_perf_counter_open().
All performance-counter operations are implemented via standard
VFS APIs such as read() / fcntl() and poll().

- User-space is not exposed to lowlevel details like contexts or
arrays of counters. Opening and reading a basic counter is as simple
as 2 lines of C code:

void main(void)
{
u64 count;

fd = perf_counter_open(3 /* PERF_COUNT_CACHE_MISSES */, 0, 0, 0, -1);
ret = read(fd, &count, sizeof(count));
if (ret == sizeof(count))
printf("Current count: %Ld cachemisses!", count);
}

- Events, blocking/sleep are natural built-in properties of counters.

- No interaction with ptrace: any task (with sufficient permissions) can
monitor other tasks, without having to stop that task.

- Mapping of counters to hw counters is not static - counters are
scheduled dynamically on each CPU where a task runs.

- There's a /sys based reservation facility that allows the allocation
of a certain number of hw counters for guaranteed sysadmin access.

- Generalized enumeration for common hw event types. Raw event codes
can be passed to the API too - but the most common (and most useful)
event codes are generalized into a hardware-independent registry
of events:

enum hw_event_types {
PERF_COUNT_CYCLES,
PERF_COUNT_INSTRUCTIONS,
PERF_COUNT_CACHE_REFERENCES,
PERF_COUNT_CACHE_MISSES,
PERF_COUNT_BRANCH_INSTRUCTIONS,
PERF_COUNT_BRANCH_MISSES,
};

- Simplified lowlevel/arch support. The x86 code for Intel CPUs (with
the PERFMON CPU feature) is 340 lines of code that implements
7 straightforward lowlevel API calls:

int hw_perf_counter_init(struct perf_counter *counter, u32 hw_event_type);
void hw_perf_counter_enable(struct perf_counter *counter);
void hw_perf_counter_disable(struct perf_counter *counter);
void hw_perf_counter_read(struct perf_counter *counter);
void hw_perf_counter_enable_config(struct perf_counter *counter);
void hw_perf_counter_disable_config(struct perf_counter *counter);
void hw_perf_counter_setup(void);

There's one kernel/perf_counter.c core file, and a single
arch/x86/kernel/cpu/perf_counter.c architecture support file.

The impact on the kernel tree is relatively moderate:

27 files changed, 1641 insertions(+), 7 deletions(-)

TODO:

- Non-Intel CPU support. Help is welcome :-)

- Round-robin scheduling of counters, when there's more task counters
than hw counters available.

- Support for extended record types such as PEBS.

- Support for NMI events in the x86 code (the core design is ready)

- Make sure it works well with OProfile and the x86 NMI watchdog

Short documentation is available in Documentation/perf-counters.txt

Find below the source of a simple monitoring demo.

We'd like to seek the feedback of perfmon developers and architecture
maintainers - what do you think about this approach?

Comments, reports, suggestions, flames and other types of feedback
is more than welcome,

Thomas, Ingo
---

/*
* Performance counters monitoring test case
*/
#include <sys/types.h>
#include <sys/stat.h>
#include <sys/time.h>
#include <unistd.h>
#include <stdint.h>
#include <stdlib.h>
#include <string.h>
#include <getopt.h>
#include <fcntl.h>
#include <stdio.h>
#include <errno.h>

#define __user

#include "sys.h"

static int count = 10000;
static int eventid;
static int tid;
static char *debuginfo;

static void display_help(void)
{
printf("monitor\n");
printf("Usage:\n"
"monitor options threadid\n\n"
"-e EID --eventid=EID eventid\n"
"-c CNT --count=CNT event count on which IP is sampled\n"
"-d FILE --debug=FILE path to binary file with debug info\n");
exit(0);
}

static void process_options (int argc, char *argv[])
{
int error = 0;

for (;;) {
int option_index = 0;
/** Options for getopt */
static struct option long_options[] = {
{"count", required_argument, NULL, 'c'},
{"debug", required_argument, NULL, 'd'},
{"eventid", required_argument, NULL, 'e'},
{"help", no_argument, NULL, 'h'},
{NULL, 0, NULL, 0}
};
int c = getopt_long(argc, argv, "c:d:e:",
long_options, &option_index);
if (c == -1)
break;
switch (c) {
case 'c': count = atoi(optarg); break;
case 'd': debuginfo = strdup(optarg); break;
case 'e': eventid = atoi(optarg); break;
default: error = 1; break;
}
}
if (error || optind == argc)
display_help ();

tid = atoi(argv[optind]);
}

int main(int argc, char *argv[])
{
char str[256];
uint64_t ip;
ssize_t res;
int fd;

process_options(argc, argv);

fd = perf_counter_open(eventid, count, 1, tid, -1);
if (fd < 0) {
perror("Create counter");
exit(-1);
}

while (1) {
res = read(fd, (char *) &ip, sizeof(ip));
if (res != sizeof(ip)) {
perror("Read counter");
break;
}

if (!debuginfo) {
printf("IP: 0x%016llx\n", (unsigned long long)ip);
} else {
sprintf(str, "addr2line -e %s 0x%llx\n", debuginfo,
(unsigned long long)ip);
system(str);
}
}

close(fd);
exit(0);
}



2008-12-05 00:28:18

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [patch 0/3] [Announcement] Performance Counters for Linux

Thomas Gleixner wrote:
>
> We'd like to announce a brand new implementation of performance counter
> support for Linux. It is a very simple and extensible design that has the
> potential to implement the full range of features we would expect from such
> a subsystem.
>

First of all, let me say I really like what I've seen so far. The file
descriptor paradigm seems really elegant to me.

> - Only one single new system call is needed: sys_perf_counter_open().
> All performance-counter operations are implemented via standard
> VFS APIs such as read() / fcntl() and poll().

As previously discussed, I think this should be a filesystem rather than
a system call. There are a couple of advantages to doing it that way, IMO:

- Strings, rather than numbers, which means fewer constraints across
architectures.
- The events available can be exported in the filesystem itself (via
readdir) rather than via sysfs.
- Compatibility with existing tools, esp. non-C tools.

I'm thinking of something like:

/dev/perfctr/3/cache_misses/all/simple/300

i.e. /dev/perfctr/<cpu>/<event>/<pid>/<type>/<period>. I am putting
<cpu> ahead of <event> in the hierarchy, so a readdir() on the <cpu>
directory can show the events available by name on that CPU. Raw
hardware events can be accessed by something like
/dev/perfctr/<cpu>/0x4064/...

-hpa

2008-12-05 00:33:59

by Paul Mackerras

[permalink] [raw]
Subject: Re: [patch 0/3] [Announcement] Performance Counters for Linux

Thomas Gleixner writes:

> We'd like to announce a brand new implementation of performance counter
> support for Linux. It is a very simple and extensible design that has the
> potential to implement the full range of features we would expect from such
> a subsystem.

Looks like the sort of thing I was thinking about a year or so ago
when I was trying to come up with a simpler API than perfmon2.
However, it turned out that my design, and I believe yours too, can't
do some things that users really want to do with performance
counters.

One thing that this sort of thing can't do is to get values from
multiple counters that correlate with each other. For instance, we
would often want to count, say, L2 cache misses and instructions
completed at the same time, and be able to read both counters at very
close to the same time, so that we can measure average L2 cache misses
per instruction completed, which is useful.

Another problem is that this abstraction provides no way to deal with
interrelationships between counters. For example, on PowerPC it is
common to have a facility where one counter overflowing can cause all
the other counters to freeze. I don't see this abstraction providing
any way to handle that.

It looks to me that your new API will be unworkable for real
performance measurement and tuning, just like mine ended up being. :)

Paul.

2008-12-05 00:43:29

by Paul Mackerras

[permalink] [raw]
Subject: Re: [patch 0/3] [Announcement] Performance Counters for Linux

H. Peter Anvin writes:

> First of all, let me say I really like what I've seen so far. The file
> descriptor paradigm seems really elegant to me.

I have to say, without intending any disrespect, that it looks to me
like it was designed by someone who hasn't actually ever done much
serious performance analysis or tuning using these hardware
facilities. If I'm wrong about that, I'm willing to be corrected.

Paul.

2008-12-05 01:12:42

by David Miller

[permalink] [raw]
Subject: Re: [patch 0/3] [Announcement] Performance Counters for Linux

From: Thomas Gleixner <[email protected]>
Date: Thu, 04 Dec 2008 23:44:39 -0000

> - No interaction with ptrace: any task (with sufficient permissions) can
> monitor other tasks, without having to stop that task.

This isn't going to work.

If you look at the things the perfmon libraries do, you do need to
stop the task.

Consider counter virtualization as the most direct example. Perfmon
allows you to count 6 events even if you can only monitor 2 at a time
with your hardware. It does this by periodically changing the counter
configuration during the run of the program(s) being inspected. These
control register changes and counter captures have to be atomic or
else you'll get garbage or less accurate results.

There are entire families of cases where you want to perform a
sequence of operations on the control registers and counters if one of
them overflows. And these operations must be atomic. The only way
to ensure this is to stop the task, then let the library in the
monitoring task make the changes, and finally let the library
release that task.

The crux of the matter is, when a counter overflows, what you want to
do in response to that event is non-trivial and it must be performed
without letting the monitored task continue executing. So you have to
stop the task, and unless you want tons of cpu specific knowledge and
counter virtualization support code in the kernel, we want userspace
telling the kernel how to program the registers. And since we have
to stop the task, there is no benefit doing this work in the kernel
anyways.

If you don't like the NMI and IPI business on x86 in the perfmon
patches, suggest alternatives.

2008-12-05 03:31:20

by Andrew Morton

[permalink] [raw]
Subject: Re: [patch 0/3] [Announcement] Performance Counters for Linux

On Thu, 04 Dec 2008 23:44:39 -0000 Thomas Gleixner <[email protected]> wrote:

> Performance counters are special hardware registers available on most modern
> CPUs. These register count the number of certain types of hw events: such
> as instructions executed, cachemisses suffered, or branches mis-predicted,
> without slowing down the kernel or applications. These registers can also
> trigger interrupts when a threshold number of events have passed - and can
> thus be used to profile the code that runs on that CPU.
>
> We'd like to announce a brand new implementation of performance counter
> support for Linux. It is a very simple and extensible design that has the
> potential to implement the full range of features we would expect from such
> a subsystem.
>
> The Linux Performance Counter subsystem (implemented via the patches
> posted in this announcement) provides an abstraction of performance counter
> hardware capabilities. It provides per task and per CPU counters, and it
> provides event capabilities on top of those.
>
> The code is far from complete - but the basic approach is already there
> and stable.
>
> The biggest missing detail is lowlevel support for non-Intel CPUs and
> older Intel CPUs - right now the code is implemented for Intel Core2
> (and later) Intel CPUs that have the PERFMON CPU feature. (see below
> a wider list of missing/upcoming features)
>
> We are aware of the perfmon3 patchset that has been submitted to lkml
> recently. Our patchset tries to achieve a similar end result, with
> a fundamentally different (and we believe, superior :-) design:

There's also the perfctr patchset, which has been available for a long
time.

I believe that established users of this sort of capability often
access it via the supposed-to-be-cross-platform PAPI interface/library.

Please cc [email protected] on emails related to this
work.

2008-12-05 06:10:52

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch 0/3] [Announcement] Performance Counters for Linux


* David Miller <[email protected]> wrote:

> From: Thomas Gleixner <[email protected]>
> Date: Thu, 04 Dec 2008 23:44:39 -0000
>
> > - No interaction with ptrace: any task (with sufficient permissions) can
> > monitor other tasks, without having to stop that task.
>
> This isn't going to work.
>
> If you look at the things the perfmon libraries do, you do need to stop
> the task.
>
> Consider counter virtualization as the most direct example. [...]

Note that counter virtualization is not offered in the perfmon3 patchset
that has been posted to lkml. (It is part of the much larger 'full'
perfmon patchset which has not been submitted for integration)

Nevertheless we will offer counter virtualization in -v2 of our patchset
and we mentioned it in the TODO list:

> > - Round-robin scheduling of counters, when there's more task
> > counters than hw counters available.

The 'target' task does not have to be stopped to offer counter
virtualization (counter overcommit or counter scheduling) - or to offer
any of the other performance counter features. Please let us know why it
needs the task to be stopped - we asked about that on lkml in the perfmon
thread and no technical answer was given, and couldnt find any such
technical reason while implementing it ourselves.

Relying on ptrace machinery can be considered one of the bigger design
mistakes of the permon3 patchset.

We pointed that out in review, and now we demonstrate it via this
patchset that it can be done much cleaner and much simpler. (Please stay
tuned for -v2 if you want to see the proof of the pudding.)

Ingo

2008-12-05 06:31:59

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch 0/3] [Announcement] Performance Counters for Linux


* Paul Mackerras <[email protected]> wrote:

> Thomas Gleixner writes:
>
> > We'd like to announce a brand new implementation of performance counter
> > support for Linux. It is a very simple and extensible design that has the
> > potential to implement the full range of features we would expect from such
> > a subsystem.
>
> Looks like the sort of thing I was thinking about a year or so ago when
> I was trying to come up with a simpler API than perfmon2. However, it
> turned out that my design, and I believe yours too, can't do some
> things that users really want to do with performance counters.
>
> One thing that this sort of thing can't do is to get values from
> multiple counters that correlate with each other. For instance, we
> would often want to count, say, L2 cache misses and instructions
> completed at the same time, and be able to read both counters at very
> close to the same time, so that we can measure average L2 cache misses
> per instruction completed, which is useful.

This can be done in a very natural way with our abstraction, and the
"hello.c" example happens to do exactly that:

aldebaran:~/perf-counter-test> ./hello
doing perf_counter_open() call:
counter[0]... fd: 3.
counter[1]... fd: 4.
counter[0] delta: 10866 cycles
counter[1] delta: 414 cycles
counter[0] delta: 23640 cycles
counter[1] delta: 3673 cycles
counter[0] delta: 28225 cycles
counter[1] delta: 3695 cycles

This counts cycles executed and instructions executed, and reads the two
counters out at the same time.

I just modified it to measure the exact example you mentioned above - L2
cache misses and instructions completed, sampled once every second:

titan:~/perf-counter-test> ./hello
doing perf_counter_open() call:

counter[0] delta: 1 cachemisses
counter[1] delta: 497 instructions

counter[0] delta: 14 cachemisses
counter[1] delta: 4303 instructions

counter[0] delta: 6 cachemisses
counter[1] delta: 3666 instructions

counter[0] delta: 2 cachemisses
counter[1] delta: 3641 instructions

counter[0] delta: 1 cachemisses
counter[1] delta: 3641 instructions

It's a matter of:

fd1 = perf_counter_open(PERF_COUNT_CACHE_MISSES, 0, 0, 0, -1);
fd2 = perf_counter_open(PERF_COUNT_INSTRUCTIONS, 0, 0, 0, -1);

So it's very much possible. (If i've missed something about your example
then please let me know.)

> Another problem is that this abstraction provides no way to deal with
> interrelationships between counters. For example, on PowerPC it is
> common to have a facility where one counter overflowing can cause all
> the other counters to freeze. I don't see this abstraction providing
> any way to handle that.

We could add that facility if it makes sense - there's no reason why
there couldnt be event interaction between counters - we just went for
the most common event variants in v1.

Btw., i'm curious, why would we want to do that? It skews the results if
the task continues executing and counters stop. To get the highest
quality profiling output the counters should follow the true state of the
task that is profiled - and events should be passed to the monitoring
task asynchronously. The _events_ can contain precise coupled information
- but the counters should continue.

What i'd do is what hello.c does: if you want to read out multiple
counters at once, you can read them out at once.

(Again, please explain in more detail if i have missed something about
your observation.)

Ingo

2008-12-05 07:01:16

by Arjan van de Ven

[permalink] [raw]
Subject: Re: [patch 0/3] [Announcement] Performance Counters for Linux

On Fri, 5 Dec 2008 07:31:31 +0100
Ingo Molnar <[email protected]> wrote:

> Btw., i'm curious, why would we want to do that? It skews the results
> if the task continues executing and counters stop. To get the highest
> quality profiling output the counters should follow the true state of
> the task that is profiled - and events should be passed to the
> monitoring task asynchronously. The _events_ can contain precise
> coupled information
> - but the counters should continue.

btw stopping the task on counter overflow is an issue for things that
want to self profile, like JITs


--
Arjan van de Ven Intel Open Source Technology Centre
For development, discussion and tips for power savings,
visit http://www.lesswatts.org

2008-12-05 07:04:16

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch 0/3] [Announcement] Performance Counters for Linux


* Ingo Molnar <[email protected]> wrote:

> This can be done in a very natural way with our abstraction, and the
> "hello.c" example happens to do exactly that:

multiple people pointed out that we have not posted hello.c :-/

Here's a simple standalone example (full working code attached below):

int main(void)
{
unsigned long long count1, count2;
int fd1, fd2, ret;

fd1 = perf_counter_open(PERF_COUNT_INSTRUCTIONS, 0, 0, 0, -1);
assert(fd1 >= 0);
fd2 = perf_counter_open(PERF_COUNT_CACHE_MISSES, 0, 0, 0, -1);
assert(fd1 >= 0);

for (;;) {
ret = read(fd1, &count1, sizeof(count1));
assert(ret == 8);
ret = read(fd2, &count2, sizeof(count2));
assert(ret == 8);

printf("counter1 value: %Ld instructions\n", count1);
printf("counter2 value: %Ld cachemisses\n", count2);
sleep(1);
}
return 0;
}


which gives this output (one readout per second):

titan:~/perf-counter-test> ./simple
counter1 value: 0 instructions
counter2 value: 0 cachemisses
counter1 value: 23 instructions
counter2 value: 0 cachemisses
counter1 value: 2853 instructions
counter2 value: 6 cachemisses
counter1 value: 5736 instructions
counter2 value: 7 cachemisses
counter1 value: 8619 instructions
counter2 value: 8 cachemisses
counter1 value: 11502 instructions
counter2 value: 8 cachemisses
^C

You need our patchset but then the code below will work just fine. No
libraries, no context setup, nothing - just what is more interesting: the
counter and profiling data.

Ingo

----------------->
/*
* Very simple performance counter testcase.
*/
#include <sys/syscall.h>
#include <sys/types.h>
#include <sys/ioctl.h>
#include <sys/stat.h>
#include <sys/time.h>
#include <sys/uio.h>

#include <linux/unistd.h>

#include <assert.h>
#include <unistd.h>
#include <stdlib.h>
#include <unistd.h>
#include <string.h>
#include <stdio.h>
#include <fcntl.h>

#ifdef __x86_64__
# define __NR_perf_counter_open 295
#endif

#ifdef __i386__
# define __NR_perf_counter_open 333
#endif

int
perf_counter_open(int hw_event_type,
unsigned int hw_event_period,
unsigned int record_type,
pid_t pid,
int cpu)
{
return syscall(__NR_perf_counter_open, hw_event_type, hw_event_period,
record_type, pid, cpu);
}

enum hw_event_types {
PERF_COUNT_CYCLES,
PERF_COUNT_INSTRUCTIONS,
PERF_COUNT_CACHE_REFERENCES,
PERF_COUNT_CACHE_MISSES,
PERF_COUNT_BRANCH_INSTRUCTIONS,
PERF_COUNT_BRANCH_MISSES,
};

int main(void)
{
unsigned long long count1, count2;
int fd1, fd2, ret;

fd1 = perf_counter_open(PERF_COUNT_INSTRUCTIONS, 0, 0, 0, -1);
assert(fd1 >= 0);
fd2 = perf_counter_open(PERF_COUNT_CACHE_MISSES, 0, 0, 0, -1);
assert(fd1 >= 0);

for (;;) {
ret = read(fd1, &count1, sizeof(count1));
assert(ret == 8);
ret = read(fd2, &count2, sizeof(count2));
assert(ret == 8);

printf("counter1 value: %Ld instructions\n", count1);
printf("counter2 value: %Ld cachemisses\n", count2);
sleep(1);
}
return 0;
}

2008-12-05 07:16:43

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [patch 0/3] [Announcement] Performance Counters for Linux

On Fri, 2008-12-05 at 08:03 +0100, Ingo Molnar wrote:

> int main(void)
> {
> unsigned long long count1, count2;
> int fd1, fd2, ret;
>
> fd1 = perf_counter_open(PERF_COUNT_INSTRUCTIONS, 0, 0, 0, -1);
> assert(fd1 >= 0);
> fd2 = perf_counter_open(PERF_COUNT_CACHE_MISSES, 0, 0, 0, -1);
> assert(fd1 >= 0);
>
> for (;;) {
> ret = read(fd1, &count1, sizeof(count1));
> assert(ret == 8);
> ret = read(fd2, &count2, sizeof(count2));
> assert(ret == 8);
>
> printf("counter1 value: %Ld instructions\n", count1);
> printf("counter2 value: %Ld cachemisses\n", count2);
> sleep(1);
> }
> return 0;
> }

So, while most people would not consider two consecutive read() ops to
be close or near the same time, due to preemption and such, that is
taken away by the fact that the counters are task local time based - so
preemption doesn't affect thing. Right?

2008-12-05 07:50:38

by David Miller

[permalink] [raw]
Subject: Re: [patch 0/3] [Announcement] Performance Counters for Linux

From: Ingo Molnar <[email protected]>
Date: Fri, 5 Dec 2008 07:10:12 +0100

> * David Miller <[email protected]> wrote:
>
> > From: Thomas Gleixner <[email protected]>
> > Date: Thu, 04 Dec 2008 23:44:39 -0000
> >
> > > - No interaction with ptrace: any task (with sufficient permissions) can
> > > monitor other tasks, without having to stop that task.
> >
> > This isn't going to work.
> >
> > If you look at the things the perfmon libraries do, you do need to stop
> > the task.
> >
> > Consider counter virtualization as the most direct example. [...]
>
> Note that counter virtualization is not offered in the perfmon3 patchset
> that has been posted to lkml. (It is part of the much larger 'full'
> perfmon patchset which has not been submitted for integration)

I know, it was yanked out to make a merge more likely.

> Relying on ptrace machinery can be considered one of the bigger design
> mistakes of the permon3 patchset.

I totally disagree.

> We pointed that out in review, and now we demonstrate it via this
> patchset that it can be done much cleaner and much simpler. (Please stay
> tuned for -v2 if you want to see the proof of the pudding.)

I hope it will provide enough for full PAPI library support, otherwise
it's useless for most of the world.

2008-12-05 07:52:48

by David Miller

[permalink] [raw]
Subject: Re: [patch 0/3] [Announcement] Performance Counters for Linux

From: Arjan van de Ven <[email protected]>
Date: Thu, 4 Dec 2008 23:02:06 -0800

> On Fri, 5 Dec 2008 07:31:31 +0100
> Ingo Molnar <[email protected]> wrote:
>
> > Btw., i'm curious, why would we want to do that? It skews the results
> > if the task continues executing and counters stop. To get the highest
> > quality profiling output the counters should follow the true state of
> > the task that is profiled - and events should be passed to the
> > monitoring task asynchronously. The _events_ can contain precise
> > coupled information
> > - but the counters should continue.
>
> btw stopping the task on counter overflow is an issue for things that
> want to self profile, like JITs

They can fork off a thread to do this.

No blocking on couter overflow leads to inaccurate results.
This is a pretty fundamental aspect of perf counter usage.

2008-12-05 07:54:23

by Paul Mackerras

[permalink] [raw]
Subject: Re: [patch 0/3] [Announcement] Performance Counters for Linux

Ingo Molnar writes:
>
> * Paul Mackerras <[email protected]> wrote:
[snip]
> > One thing that this sort of thing can't do is to get values from
> > multiple counters that correlate with each other. For instance, we
> > would often want to count, say, L2 cache misses and instructions
> > completed at the same time, and be able to read both counters at very
> > close to the same time, so that we can measure average L2 cache misses
> > per instruction completed, which is useful.
>
> This can be done in a very natural way with our abstraction, and the
> "hello.c" example happens to do exactly that:

Has hello.c been posted? I can't find it in any of the posts from you
or Thomas. Am I just being blind? :)

> aldebaran:~/perf-counter-test> ./hello
> doing perf_counter_open() call:
> counter[0]... fd: 3.
> counter[1]... fd: 4.
> counter[0] delta: 10866 cycles
> counter[1] delta: 414 cycles
> counter[0] delta: 23640 cycles
> counter[1] delta: 3673 cycles
> counter[0] delta: 28225 cycles
> counter[1] delta: 3695 cycles
>
> This counts cycles executed and instructions executed, and reads the two
> counters out at the same time.

Isn't it two separate read() calls to read the two counters? If so,
the only way the two values are actually going to correspond to the
same point in time is if the task being monitored is stopped. In
which case the monitoring task needs to use ptrace or something
similar in order to make sure that the monitored task is actually
stopped.

If the monitored task is not stopped, then the interval between the
two reads will be sufficient to render the results useless -
particularly since the monitoring task could get preempted for an
arbitrary length of time between the two reads. But even if it
doesn't, the hundreds of cycles between the two reads will introduce
considerable imprecision in the results.

There really is value in being able to read all the counters you're
using in one system call.

Paul.

2008-12-05 07:57:48

by Paul Mackerras

[permalink] [raw]
Subject: Re: [patch 0/3] [Announcement] Performance Counters for Linux

Peter Zijlstra writes:

> So, while most people would not consider two consecutive read() ops to
> be close or near the same time, due to preemption and such, that is
> taken away by the fact that the counters are task local time based - so
> preemption doesn't affect thing. Right?

I'm sorry, I don't follow the argument here. What do you mean by
"task local time based"?

Paul.

2008-12-05 07:58:04

by David Miller

[permalink] [raw]
Subject: Re: [patch 0/3] [Announcement] Performance Counters for Linux

From: Ingo Molnar <[email protected]>
Date: Fri, 5 Dec 2008 08:03:29 +0100

>
> * Ingo Molnar <[email protected]> wrote:
>
> > This can be done in a very natural way with our abstraction, and the
> > "hello.c" example happens to do exactly that:
>
> multiple people pointed out that we have not posted hello.c :-/

Because it's completely not providing the facility. This is not how
people want to use the performance counters at all.

And it doesn't even do what Paulus said is necessary, he said:

--------------------
> One thing that this sort of thing can't do is to get values from
> multiple counters that correlate with each other. For instance, we
> would often want to count, say, L2 cache misses and instructions
> completed at the same time, and be able to read both counters at very
> close to the same time, so that we can measure average L2 cache misses
> per instruction completed, which is useful.
--------------------

And if you read one counter then read the other as seperate operations,
you get extra events in there as a side effect of going back into
userspace between the two reads.

Nobody wants that, it's inaccurate and if you're looking for if one
event happens at all it's not only inaccurate it's useless if the
reads trigger that counter event. Also, correlation has other
meanings.

What people want is blocking on overflow events, and a monitoring task
or thread doing all of the tricky control register management and task
inspection.

I mean look at some of the test cases and sample programs in the PAPI
and perfmon2 librarys, that stuff is extremely cool and this proposal
cannot do half of that stuff correctly.

2008-12-05 08:03:56

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [patch 0/3] [Announcement] Performance Counters for Linux

On Fri, 2008-12-05 at 18:57 +1100, Paul Mackerras wrote:
> Peter Zijlstra writes:
>
> > So, while most people would not consider two consecutive read() ops to
> > be close or near the same time, due to preemption and such, that is
> > taken away by the fact that the counters are task local time based - so
> > preemption doesn't affect thing. Right?
>
> I'm sorry, I don't follow the argument here. What do you mean by
> "task local time based"?

time only flows when the task is running.

2008-12-05 08:07:27

by David Miller

[permalink] [raw]
Subject: Re: [patch 0/3] [Announcement] Performance Counters for Linux

From: Peter Zijlstra <[email protected]>
Date: Fri, 05 Dec 2008 09:03:36 +0100

> On Fri, 2008-12-05 at 18:57 +1100, Paul Mackerras wrote:
> > Peter Zijlstra writes:
> >
> > > So, while most people would not consider two consecutive read() ops to
> > > be close or near the same time, due to preemption and such, that is
> > > taken away by the fact that the counters are task local time based - so
> > > preemption doesn't affect thing. Right?
> >
> > I'm sorry, I don't follow the argument here. What do you mean by
> > "task local time based"?
>
> time only flows when the task is running.

These things aren't measuring time, or even just cycles, they
are measuring things like L2 cache misses, cpu cycles, and
other similar kinds of events.

So these counters are going to measure all of the damn crap
assosciated with doing the read() call as well as the real work
the task does.

That's not useful to people.

2008-12-05 08:08:52

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch 0/3] [Announcement] Performance Counters for Linux


* Paul Mackerras <[email protected]> wrote:

> Ingo Molnar writes:
> >
> > * Paul Mackerras <[email protected]> wrote:
> [snip]
> > > One thing that this sort of thing can't do is to get values from
> > > multiple counters that correlate with each other. For instance, we
> > > would often want to count, say, L2 cache misses and instructions
> > > completed at the same time, and be able to read both counters at very
> > > close to the same time, so that we can measure average L2 cache misses
> > > per instruction completed, which is useful.
> >
> > This can be done in a very natural way with our abstraction, and the
> > "hello.c" example happens to do exactly that:
>
> Has hello.c been posted? I can't find it in any of the posts from you
> or Thomas. Am I just being blind? :)

Sorry, was late at night when we did the release - monitor.c was posted -
and i just posted hello.c it half an hour ago :)

> > aldebaran:~/perf-counter-test> ./hello
> > doing perf_counter_open() call:
> > counter[0]... fd: 3.
> > counter[1]... fd: 4.
> > counter[0] delta: 10866 cycles
> > counter[1] delta: 414 cycles
> > counter[0] delta: 23640 cycles
> > counter[1] delta: 3673 cycles
> > counter[0] delta: 28225 cycles
> > counter[1] delta: 3695 cycles
> >
> > This counts cycles executed and instructions executed, and reads the two
> > counters out at the same time.
>
> Isn't it two separate read() calls to read the two counters? If so,
> the only way the two values are actually going to correspond to the
> same point in time is if the task being monitored is stopped. In which
> case the monitoring task needs to use ptrace or something similar in
> order to make sure that the monitored task is actually stopped.

It doesnt matter in practice.

Also, look at our code: we buffer notification events and do not have to
stop the thread for recording the context information.

Also, if you _do_ care about getting immediate readouts, the _monitoring_
task can be set to higher priority. (not that i'd advocate it in general:
any task stopping or scheduling can destroy a workload's true behavior)

> If the monitored task is not stopped, then the interval between the two
> reads will be sufficient to render the results useless - particularly
> since the monitoring task could get preempted for an arbitrary length
> of time between the two reads. But even if it doesn't, the hundreds of
> cycles between the two reads will introduce considerable imprecision in
> the results.

Even if the two read()s are done apart, stopping a task is _far_ more
intrusive to the event flow of a single application. Most workloads are
multithreaded - so stopping a task causes another task to be scheduled
in, which would not have occured were the profiling more transparent and
less intrusive.

Furthermore, even for the special case of single task monitoring, a
context-switch is more expensive than a system call.

Furthermore, in most of the practical cases there's very few events
happening between two read()s. The interval of profiling versus the
interval between two reads()s is a couple of orders of magnitude.

This 'task has to be stopped' aspect is a red herring that has no
technical basis.

> There really is value in being able to read all the counters you're
> using in one system call.

It's possible with our code too: what you are asking for is in essence a
sys_read_fds() system call extension - a bit like readv(), but from a
vector of separate fds.

Such kind of 'group system call facility' has been suggested several
times in the past - but ... never got anywhere because system calls are
cheap enough, it really does not count in practice.

It could be implemented, and note that because our code uses a proper
Linux file descriptor abstraction, such a sys_read_fds() facility would
help _other_ applications as well, not just performance counters.

But it brings complications: demultiplexing of error conditions on
individual counters is a real pain with any compound abstraction. We very
consciously went with the 'one fd, one object, one counter' design.

Ingo

2008-12-05 08:12:18

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch 0/3] [Announcement] Performance Counters for Linux


* David Miller <[email protected]> wrote:

> From: Peter Zijlstra <[email protected]>
> Date: Fri, 05 Dec 2008 09:03:36 +0100
>
> > On Fri, 2008-12-05 at 18:57 +1100, Paul Mackerras wrote:
> > > Peter Zijlstra writes:
> > >
> > > > So, while most people would not consider two consecutive read() ops to
> > > > be close or near the same time, due to preemption and such, that is
> > > > taken away by the fact that the counters are task local time based - so
> > > > preemption doesn't affect thing. Right?
> > >
> > > I'm sorry, I don't follow the argument here. What do you mean by
> > > "task local time based"?
> >
> > time only flows when the task is running.
>
> These things aren't measuring time, or even just cycles, they are
> measuring things like L2 cache misses, cpu cycles, and other similar
> kinds of events.
>
> So these counters are going to measure all of the damn crap assosciated
> with doing the read() call as well as the real work the task does.

that's wrong, look at the example we posted - see it pasted below.

When monitoring another task it does _not_ count the read() done in the
monitoring task, it does _not_ include it in the event count. It is a
fundamental property of our code to be as unintrusive as possible. It
only measures the work done by that task.

( You _can_ measure your own overhead of course too, if you want to. It's
a natural special-case of our performance counter abstraction. )

Ingo

---

/*
* Performance counters monitoring test case
*/
#include <sys/types.h>
#include <sys/stat.h>
#include <sys/time.h>
#include <unistd.h>
#include <stdint.h>
#include <stdlib.h>
#include <string.h>
#include <getopt.h>
#include <fcntl.h>
#include <stdio.h>
#include <errno.h>

#define __user

#include "sys.h"

static int count = 10000;
static int eventid;
static int tid;
static char *debuginfo;

static void display_help(void)
{
printf("monitor\n");
printf("Usage:\n"
"monitor options threadid\n\n"
"-e EID --eventid=EID eventid\n"
"-c CNT --count=CNT event count on which IP is sampled\n"
"-d FILE --debug=FILE path to binary file with debug info\n");
exit(0);
}

static void process_options (int argc, char *argv[])
{
int error = 0;

for (;;) {
int option_index = 0;
/** Options for getopt */
static struct option long_options[] = {
{"count", required_argument, NULL, 'c'},
{"debug", required_argument, NULL, 'd'},
{"eventid", required_argument, NULL, 'e'},
{"help", no_argument, NULL, 'h'},
{NULL, 0, NULL, 0}
};
int c = getopt_long(argc, argv, "c:d:e:",
long_options, &option_index);
if (c == -1)
break;
switch (c) {
case 'c': count = atoi(optarg); break;
case 'd': debuginfo = strdup(optarg); break;
case 'e': eventid = atoi(optarg); break;
default: error = 1; break;
}
}
if (error || optind == argc)
display_help ();

tid = atoi(argv[optind]);
}

int main(int argc, char *argv[])
{
char str[256];
uint64_t ip;
ssize_t res;
int fd;

process_options(argc, argv);

fd = perf_counter_open(eventid, count, 1, tid, -1);
if (fd < 0) {
perror("Create counter");
exit(-1);
}

while (1) {
res = read(fd, (char *) &ip, sizeof(ip));
if (res != sizeof(ip)) {
perror("Read counter");
break;
}

if (!debuginfo) {
printf("IP: 0x%016llx\n", (unsigned long long)ip);
} else {
sprintf(str, "addr2line -e %s 0x%llx\n", debuginfo,
(unsigned long long)ip);
system(str);
}
}

close(fd);
exit(0);
}

2008-12-05 08:15:31

by David Miller

[permalink] [raw]
Subject: Re: [patch 0/3] [Announcement] Performance Counters for Linux

From: Ingo Molnar <[email protected]>
Date: Fri, 5 Dec 2008 09:08:13 +0100

>
> * Paul Mackerras <[email protected]> wrote:
>
> > Ingo Molnar writes:
> > >
> > > * Paul Mackerras <[email protected]> wrote:
> > [snip]
> > > > One thing that this sort of thing can't do is to get values from
> > > > multiple counters that correlate with each other. For instance, we
> > > > would often want to count, say, L2 cache misses and instructions
> > > > completed at the same time, and be able to read both counters at very
> > > > close to the same time, so that we can measure average L2 cache misses
> > > > per instruction completed, which is useful.
> > >
> > > This can be done in a very natural way with our abstraction, and the
> > > "hello.c" example happens to do exactly that:
> >
> > Has hello.c been posted? I can't find it in any of the posts from you
> > or Thomas. Am I just being blind? :)
>
> Sorry, was late at night when we did the release - monitor.c was posted -
> and i just posted hello.c it half an hour ago :)
>
> > > aldebaran:~/perf-counter-test> ./hello
> > > doing perf_counter_open() call:
> > > counter[0]... fd: 3.
> > > counter[1]... fd: 4.
> > > counter[0] delta: 10866 cycles
> > > counter[1] delta: 414 cycles
> > > counter[0] delta: 23640 cycles
> > > counter[1] delta: 3673 cycles
> > > counter[0] delta: 28225 cycles
> > > counter[1] delta: 3695 cycles
> > >
> > > This counts cycles executed and instructions executed, and reads the two
> > > counters out at the same time.
> >
> > Isn't it two separate read() calls to read the two counters? If so,
> > the only way the two values are actually going to correspond to the
> > same point in time is if the task being monitored is stopped. In which
> > case the monitoring task needs to use ptrace or something similar in
> > order to make sure that the monitored task is actually stopped.
>
> It doesnt matter in practice.

Yes it DOES!

If I want to know if a code block triggers event X or Y, and your read
call triggers one of those events, I can't figure out the answer to my
profiling problem.

That is completely fundamental to all of this. And this is why this
proposal is a non-workable solution.


> Also, look at our code: we buffer notification events and do not have to
> stop the thread for recording the context information.

But that's what monitoring libraries want, they want to stop the task
and inspect it.

Look at the PAPI library. If you can't implement what that thing
provides, all the real users of profiling information can't use
this stuff.

> Even if the two read()s are done apart, stopping a task is _far_ more
> intrusive to the event flow of a single application.

I really don't think you get the use case for these kinds of
facilities.

Once again I encourage you to look at the test programs, test cases,
and wonderful documentation provided with the PAPI and perfmon2
library bits. That's how people want to use this stuff. Ignore
at your own peril :-)

2008-12-05 08:17:26

by David Miller

[permalink] [raw]
Subject: Re: [patch 0/3] [Announcement] Performance Counters for Linux

From: Ingo Molnar <[email protected]>
Date: Fri, 5 Dec 2008 09:11:37 +0100

>
> * David Miller <[email protected]> wrote:
>
> > From: Peter Zijlstra <[email protected]>
> > Date: Fri, 05 Dec 2008 09:03:36 +0100
> >
> > > On Fri, 2008-12-05 at 18:57 +1100, Paul Mackerras wrote:
> > > > Peter Zijlstra writes:
> > > >
> > > > > So, while most people would not consider two consecutive read() ops to
> > > > > be close or near the same time, due to preemption and such, that is
> > > > > taken away by the fact that the counters are task local time based - so
> > > > > preemption doesn't affect thing. Right?
> > > >
> > > > I'm sorry, I don't follow the argument here. What do you mean by
> > > > "task local time based"?
> > >
> > > time only flows when the task is running.
> >
> > These things aren't measuring time, or even just cycles, they are
> > measuring things like L2 cache misses, cpu cycles, and other similar
> > kinds of events.
> >
> > So these counters are going to measure all of the damn crap assosciated
> > with doing the read() call as well as the real work the task does.
>
> that's wrong, look at the example we posted - see it pasted below.

It's still too simple to be useful.

There are so many aspects other than the immediate PC that monitoring
tasks want to inspect when a counter overflows.

2008-12-05 08:19:14

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch 0/3] [Announcement] Performance Counters for Linux


* David Miller <[email protected]> wrote:

> From: Ingo Molnar <[email protected]>
> Date: Fri, 5 Dec 2008 08:03:29 +0100
>
> >
> > * Ingo Molnar <[email protected]> wrote:
> >
> > > This can be done in a very natural way with our abstraction, and the
> > > "hello.c" example happens to do exactly that:
> >
> > multiple people pointed out that we have not posted hello.c :-/
>
> Because it's completely not providing the facility. This is not how
> people want to use the performance counters at all.
>
> And it doesn't even do what Paulus said is necessary, he said:
>
> --------------------
> > One thing that this sort of thing can't do is to get values from
> > multiple counters that correlate with each other. For instance, we
> > would often want to count, say, L2 cache misses and instructions
> > completed at the same time, and be able to read both counters at very
> > close to the same time, so that we can measure average L2 cache misses
> > per instruction completed, which is useful.
> --------------------
>
> And if you read one counter then read the other as seperate operations,
> you get extra events in there as a side effect of going back into
> userspace between the two reads.

that's wrong. If you _want_ to measure in a different context, with as
little measurement impact as possible, you can do it with our code. The
announcement provides the example for that.

For example, i just started this bash infinite loop:

$ while :; do :; done &
[1] 1877

$ ./monitor -e 1 -c 1000000000 1877
IP: 0x00000031a2e70d4b
IP: 0x0000000000455f64
IP: 0x00000031a2f028a0
IP: 0x0000000000440692
IP: 0x0000000000441b8e
IP: 0x00000031a2e6f630
IP: 0x0000000000446129
IP: 0x00000031a2e6edbc
IP: 0x0000000000443736
IP: 0x0000000000441c80
IP: 0x000000000043913a
^C

We get IP readouts every 1 billion instructions executed in that shell.
That shell is never stopped or otherwise intruded - it's kept as an as
pristine of an execution environment as possible.

Furthermore, the event readouts strictly only include event counts of the
shell PID, _not_ of the monitor context's read() or other activities.

> Nobody wants that, [...]

Nobody wants that and we dont do it.

Really, you should take a more serious look at our code.

Ingo

2008-12-05 08:20:47

by David Miller

[permalink] [raw]
Subject: Re: [patch 0/3] [Announcement] Performance Counters for Linux

From: Ingo Molnar <[email protected]>
Date: Fri, 5 Dec 2008 09:18:38 +0100

> Really, you should take a more serious look at our code.

People don't want code, they want a usable port of the PAPI libraries
for profiling.

2008-12-05 08:24:58

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch 0/3] [Announcement] Performance Counters for Linux


* David Miller <[email protected]> wrote:

> > > These things aren't measuring time, or even just cycles, they are
> > > measuring things like L2 cache misses, cpu cycles, and other
> > > similar kinds of events.
> > >
> > > So these counters are going to measure all of the damn crap
> > > assosciated with doing the read() call as well as the real work the
> > > task does.
> >
> > that's wrong, look at the example we posted - see it pasted below.
>
> It's still too simple to be useful.
>
> There are so many aspects other than the immediate PC that monitoring
> tasks want to inspect when a counter overflows.

fully agreed.

While most of the flat profilers like oprofile will be happy with the PC
alone, i do think we want a couple of extended notification types.

Right now we begun with the most trivial ones:

enum perf_record_type {
PERF_RECORD_SIMPLE,
PERF_RECORD_IRQ,
};

... but it would be natural to do a PERF_RECORD_GP_REGISTERS as well.
Perhaps even a PERF_RECORD_STACKTRACE using the sysprof facilities, to do
a hierarchic multi-dimension profile that sysprof does so nicely.

Note that the record type is an independent attribute of a counter. It
can be set regardless of the even type - and it can be set independently
for each counter. So you can have say 3 'simple' counters with no irqs
plus one 'all registers' counter which generates an IRQ: and then you can
read out the simple counters at the same type.

We could also perhaps do a PERF_RECORD_ALL: it represents a snapshot of
all active counter values in the task. This is _far_ better than forcibly
scheduling the monitored task.

Ingo

2008-12-05 08:27:19

by David Miller

[permalink] [raw]
Subject: Re: [patch 0/3] [Announcement] Performance Counters for Linux

From: Ingo Molnar <[email protected]>
Date: Fri, 5 Dec 2008 09:24:31 +0100

> Right now we begun with the most trivial ones:
>
> enum perf_record_type {
> PERF_RECORD_SIMPLE,
> PERF_RECORD_IRQ,
> };
>
> ... but it would be natural to do a PERF_RECORD_GP_REGISTERS as well.
> Perhaps even a PERF_RECORD_STACKTRACE using the sysprof facilities, to do
> a hierarchic multi-dimension profile that sysprof does so nicely.

Maybe even add something like PERF_RECORD_THE_MOON...

see how rediculious this is?

It's not your business in the kernel to decide what things are
useful. The monitor can stop the task and inspect whatever
it wants with _existing_ facilities. We need none of this stuff.

2008-12-05 08:43:11

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch 0/3] [Announcement] Performance Counters for Linux


* David Miller <[email protected]> wrote:

> From: Ingo Molnar <[email protected]>
> Date: Fri, 5 Dec 2008 09:24:31 +0100
>
> > Right now we begun with the most trivial ones:
> >
> > enum perf_record_type {
> > PERF_RECORD_SIMPLE,
> > PERF_RECORD_IRQ,
> > };
> >
> > ... but it would be natural to do a PERF_RECORD_GP_REGISTERS as well.
> > Perhaps even a PERF_RECORD_STACKTRACE using the sysprof facilities, to do
> > a hierarchic multi-dimension profile that sysprof does so nicely.
>
> Maybe even add something like PERF_RECORD_THE_MOON...
>
> see how rediculious this is?

Note that more notification record types is actually where latest
hardware is going: for example in Nehalem there's a PEBS notification
record type that has cachemiss latency included in the record. I.e. we
can get profiles with _cachemiss latency_ included (as measured from
issuing the instruction to completion).

You cannot get that information out of any 'stop the task' interface ...

Stopping a task is way too intrusive, i dont know why you keep harping on
it. Listen to the scheduler guys: it's a non-starter.

> It's not your business in the kernel to decide what things are useful.
> The monitor can stop the task and inspect whatever it wants with
> _existing_ facilities. We need none of this stuff.

You try to ridicule our efforts, while you have not answered our
technical arguments in substance.

Please let me repeat: it's a _fundamental_ thesis of performance
instrumentation to not disturb the monitored context. Your insistence on
_stopping_ the monitored task breaks that fundamental axiom!

Stopping a task destroys the characteristics of many, many workloads. To
get a reasonable histogram out of a system a highlevel event count of
thousands a second is desired (but hundreds of them are a minimum, to get
any reasonable coverage).

But injecting even hundreds of artificialy task-stoppages will destroy
the true behavior of many reference workloads we care about in Linux!

Stopping the task is a fundamental and obvious design failure of perfmon.

Ingo

2008-12-05 08:49:32

by David Miller

[permalink] [raw]
Subject: Re: [patch 0/3] [Announcement] Performance Counters for Linux

From: Ingo Molnar <[email protected]>
Date: Fri, 5 Dec 2008 09:42:33 +0100

> Please let me repeat: it's a _fundamental_ thesis of performance
> instrumentation to not disturb the monitored context. Your insistence on
> _stopping_ the monitored task breaks that fundamental axiom!

This is only a problem if you make your measurement quantums too
small.

Furthermore, there are multiple registers and states to update
atomically when a perf counter overflows. You're read/write thing
just doesn't cut it, especially for certain kinds of hardware.

It's really a utopian view of the world. :)

2008-12-05 09:10:44

by Paul Mackerras

[permalink] [raw]
Subject: Re: [patch 0/3] [Announcement] Performance Counters for Linux

Ingo Molnar writes:

> * Paul Mackerras <[email protected]> wrote:
[...]
> > Isn't it two separate read() calls to read the two counters? If so,
> > the only way the two values are actually going to correspond to the
> > same point in time is if the task being monitored is stopped. In which
> > case the monitoring task needs to use ptrace or something similar in
> > order to make sure that the monitored task is actually stopped.
>
> It doesnt matter in practice.

Can I ask - and this is a real question, I'm not being sarcastic - is
that statement made with substantial serious experience in performance
analysis behind it, or is it just an intuition?

I will happily admit that I am not a great expert on performance
analysis with years of experience. But I have taken a bit of a look
at what people with that sort of experience do, and I don't think they
would agree with your "doesn't matter" statement.

> Such kind of 'group system call facility' has been suggested several
> times in the past - but ... never got anywhere because system calls are
> cheap enough, it really does not count in practice.
>
> It could be implemented, and note that because our code uses a proper
> Linux file descriptor abstraction, such a sys_read_fds() facility would
> help _other_ applications as well, not just performance counters.
>
> But it brings complications: demultiplexing of error conditions on
> individual counters is a real pain with any compound abstraction. We very
> consciously went with the 'one fd, one object, one counter' design.

And I think that is the fundamental flaw. On the machines I am
familiar with, the performance counters as not separate things that
can individually and independently be assigned to count one thing or
another.

Rather, what the hardware provides is ONE performance monitor unit,
which the OS can context-switch between tasks. The performance
monitor unit has several counters that can be assigned (within limits)
to count various aspects of the performance of the code being
executed. That is why, for instance, if you ask for the counters to
be frozen when one of them overflows, they all get frozen at that
point.

And that's how the hardware is designed because that's how the people
that do performance analysis want to do their measurements. This idea
of splitting things up into separate counters that look independent
but aren't is just going to cause unnecessary complications and
difficulties.

Paul.

2008-12-05 09:34:33

by Paul Mackerras

[permalink] [raw]
Subject: Re: [patch 0/3] [Announcement] Performance Counters for Linux

Ingo Molnar writes:

> The 'target' task does not have to be stopped to offer counter
> virtualization (counter overcommit or counter scheduling) - or to offer
> any of the other performance counter features. Please let us know why it
> needs the task to be stopped - we asked about that on lkml in the perfmon
> thread and no technical answer was given, and couldnt find any such
> technical reason while implementing it ourselves.

I like this feature of your patchset, in fact, and the code looks
pretty clean (as I would expect :). What I don't like (as I have
already said) is having to use an API that splits up the PMU into
pieces, plus the requirement that flows from that to have the kernel
know about the event selection logic on every CPU model we support.

One thing I haven't figured out yet is what happens if you have a
counter on a task and the task dies. Can I still use the counter fd
after the task has died, and read out the total count?

Paul.

2008-12-05 09:34:47

by Paul Mackerras

[permalink] [raw]
Subject: Re: [patch 0/3] [Announcement] Performance Counters for Linux

Peter Zijlstra writes:

> On Fri, 2008-12-05 at 18:57 +1100, Paul Mackerras wrote:
> > Peter Zijlstra writes:
> >
> > > So, while most people would not consider two consecutive read() ops to
> > > be close or near the same time, due to preemption and such, that is
> > > taken away by the fact that the counters are task local time based - so
> > > preemption doesn't affect thing. Right?
> >
> > I'm sorry, I don't follow the argument here. What do you mean by
> > "task local time based"?
>
> time only flows when the task is running.

Right - but the monitored task is running while the monitoring task is
running. So time is flowing for the monitored task between the two
reads done by the monitoring task, meaning that you can't actually
relate the two values you read with any precision.

Paul.

2008-12-05 10:05:56

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch 0/3] [Announcement] Performance Counters for Linux


* Ingo Molnar <[email protected]> wrote:

> > > - No interaction with ptrace: any task (with sufficient permissions) can
> > > monitor other tasks, without having to stop that task.
> >
> > This isn't going to work.
> >
> > If you look at the things the perfmon libraries do, you do need to stop
> > the task.
> >
> > Consider counter virtualization as the most direct example. [...]
>
> Note that counter virtualization is not offered in the perfmon3 patchset that has
> been posted to lkml. (It is part of the much larger 'full' perfmon patchset which
> has not been submitted for integration)
>
> Nevertheless we will offer counter virtualization in -v2 of our patchset [...]

i've just implemented it. Running an (infinite-loop) hello.c with 6 counters on a
CPU that has only two counters now gives the expected:

counter[0 cycles ]: 3368245084 , delta: 842019470 events
counter[1 instructions ]: 1384678210 , delta: 346108294 events
counter[2 cache-refs ]: 659 , delta: 150 events
counter[3 cache-misses ]: 0
counter[4 branch-instructions ]: 266919398 , delta: 66731508 events
counter[5 branch-misses ]: 1201 , delta: 315 events

This will be in -v2.

Ingo

2008-12-05 10:41:52

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch 0/3] [Announcement] Performance Counters for Linux


* Paul Mackerras <[email protected]> wrote:

> Ingo Molnar writes:
>
> > The 'target' task does not have to be stopped to offer counter
> > virtualization (counter overcommit or counter scheduling) - or to offer
> > any of the other performance counter features. Please let us know why it
> > needs the task to be stopped - we asked about that on lkml in the perfmon
> > thread and no technical answer was given, and couldnt find any such
> > technical reason while implementing it ourselves.
>
> I like this feature of your patchset, in fact, and the code looks
> pretty clean (as I would expect :). What I don't like (as I have
> already said) is having to use an API that splits up the PMU into
> pieces, plus the requirement that flows from that to have the kernel
> know about the event selection logic on every CPU model we support.
>
> One thing I haven't figured out yet is what happens if you have a
> counter on a task and the task dies. Can I still use the counter fd
> after the task has died, and read out the total count?

yes, it will work just the way you'd expect it to work: the counter is
attached to the fd of the monitoring task, so it does not go away. The
counter simply stops counting but otherwise can be read even after the
monitored task has exited.

We are also planning a natural 'the task has died' notification: a -EPIPE
returned by read(), after the final count has been allowed to be read
out. With blocking counters this will behave quite smoothly: instead of
blocking indefinitely, we'd get back -EPIPE. Hm?

Ingo

2008-12-05 12:08:17

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch 0/3] [Announcement] Performance Counters for Linux


* Paul Mackerras <[email protected]> wrote:

> Ingo Molnar writes:
>
> > * Paul Mackerras <[email protected]> wrote:
> [...]
> > > Isn't it two separate read() calls to read the two counters? If so,
> > > the only way the two values are actually going to correspond to the
> > > same point in time is if the task being monitored is stopped. In which
> > > case the monitoring task needs to use ptrace or something similar in
> > > order to make sure that the monitored task is actually stopped.
> >
> > It doesnt matter in practice.
>
> Can I ask - and this is a real question, I'm not being sarcastic - is
> that statement made with substantial serious experience in performance
> analysis behind it, or is it just an intuition?
>
> I will happily admit that I am not a great expert on performance
> analysis with years of experience. But I have taken a bit of a look at
> what people with that sort of experience do, and I don't think they
> would agree with your "doesn't matter" statement.

A stream of read()s possibly slightly being off is an order of magnitude
smaller of an effect to precision. Look at the numbers: on the testbox i
have a read() syscall takes 0.2 microseconds, while a context-switch
takes 2 microseconds on the local CPU and about 5-10 microseconds
cross-CPU (or more, if the cache pattern is unlucky/unaffine). That's
10-25-50 times more expensive. You can do 9-24-49 reads and still be
cheaper. Compound syscalls are almost never worth the complexity.

So as a scheduler person i cannot really take the perfmon "ptrace
approach" seriously, and i explained that in great detail already. It
clearly came from HPC workload quarters where tasks are persistent
entities running alone on a single CPU that just use up CPU time there
and dont interact with each other too much. That's a good and important
profiling target for sure - but by no means the only workload target to
design a core kernel facility for. It's an absolutely horrible approach
for a number of more common workloads for sure.

> > Such kind of 'group system call facility' has been suggested several
> > times in the past - but ... never got anywhere because system calls
> > are cheap enough, it really does not count in practice.
> >
> > It could be implemented, and note that because our code uses a proper
> > Linux file descriptor abstraction, such a sys_read_fds() facility
> > would help _other_ applications as well, not just performance
> > counters.
> >
> > But it brings complications: demultiplexing of error conditions on
> > individual counters is a real pain with any compound abstraction. We
> > very consciously went with the 'one fd, one object, one counter'
> > design.
>
> And I think that is the fundamental flaw. On the machines I am
> familiar with, the performance counters as not separate things that can
> individually and independently be assigned to count one thing or
> another.

Today we've implemented virtual counter scheduling in our to-be-v2 code:

3 files changed, 36 insertions(+), 1 deletion(-)

hello.c gives:

counter[0 cycles ]: 10121258163 , delta: 844256826 events
counter[1 instructions ]: 4160893621 , delta: 347054666 events
counter[2 cache-refs ]: 2297 , delta: 179 events
counter[3 cache-misses ]: 3 , delta: 0 events
counter[4 branch-instructions ]: 799422166 , delta: 66551572 events
counter[5 branch-misses ]: 7286 , delta: 775 events

All we need to get that array of information from 6 sw counters is a
_single_ hardware counter. I'm not sure where you read "you must map sw
counters to hw counters directly" or "hw counters must be independent of
each other" into our design - it's not part of it, emphatically.

And i dont see your (fully correct!) statement above about counter
constraints to be in any sort of conflict with what we are doing.

Intel hardware is just as constrained as powerpc hardware: there are
counter inter-dependencies and many CPUs have just two performance
counters. We very much took this into account while designing this code.

[ Obviously, you _can_ do higher quality profiling if you have more
hardware resources that help it. Nothing will change that fact. ]

> Rather, what the hardware provides is ONE performance monitor unit,
> which the OS can context-switch between tasks. The performance monitor
> unit has several counters that can be assigned (within limits) to count
> various aspects of the performance of the code being executed. That is
> why, for instance, if you ask for the counters to be frozen when one of
> them overflows, they all get frozen at that point.

i dont see this as an issue at all - it's a feature of powerpc over x86
that the core perfcounter code can support just fine. The overflow IRQ
handler is arch specific. The overflow IRQ handler, if it triggers,
updates the sw counters, creates any event records if needed, wakes up
the monitor task if needed, and continues the task and performance
measurement without having scheduled out. Demultiplexing of hw counters
is arch-specific.

Ingo

2008-12-05 12:13:34

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch 0/3] [Announcement] Performance Counters for Linux


* David Miller <[email protected]> wrote:

> From: Ingo Molnar <[email protected]>
> Date: Fri, 5 Dec 2008 09:42:33 +0100
>
> > Please let me repeat: it's a _fundamental_ thesis of performance
> > instrumentation to not disturb the monitored context. Your insistence
> > on _stopping_ the monitored task breaks that fundamental axiom!
>
> This is only a problem if you make your measurement quantums too small.

But if you make the measurement long enough - say we make it 100,000
usecs, then 0.2 usecs of delay between two read()s is insignificant
statistically, right? It's a 1:500,000 ratio.

Scheduling out a task and back is far more drastic of an effect than any
new events in 0.2 usecs.

Ingo

2008-12-05 12:39:39

by Andi Kleen

[permalink] [raw]
Subject: Re: [patch 0/3] [Announcement] Performance Counters for Linux

Ingo Molnar <[email protected]> writes:

> Note that more notification record types is actually where latest
> hardware is going: for example in Nehalem there's a PEBS notification
> record type that has cachemiss latency included in the record. I.e. we
> can get profiles with _cachemiss latency_ included (as measured from
> issuing the instruction to completion).

One problem is that you have to find out the correct RIP for that PEBS
cache miss you have to disassemble from the last basic block because
the IP in PEBS points to the next instruction.

If such a thing is ever implemented it should be in user space
I think.

Also in general some of the more useful PEBS information requires
disassembling unfortunately. For example if you want a address
histogram you get the register contents, but you have to interpret the
code to compute the EA. While the kernel has a x86 interpreter now for
this I suspect doing it in kernel space would be quite complicated
and at least I would consider doing it in user space cleaner too.

Given these are more obscure features, but not being able to fit
them easily into your model from the start isn't a very promising sign
for the long term extensibility of the design.

-Andi

--
[email protected]

2008-12-05 13:25:48

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch 0/3] [Announcement] Performance Counters for Linux


* David Miller <[email protected]> wrote:

> > > Isn't it two separate read() calls to read the two counters? If
> > > so, the only way the two values are actually going to correspond to
> > > the same point in time is if the task being monitored is stopped.
> > > In which case the monitoring task needs to use ptrace or something
> > > similar in order to make sure that the monitored task is actually
> > > stopped.
> >
> > It doesnt matter in practice.
>
> Yes it DOES!
>
> If I want to know if a code block triggers event X or Y, and your read
> call triggers one of those events, I can't figure out the answer to my
> profiling problem.

( this misunderstanding of yours has been cleared up in a later mail:
reading a counter causes events in the monitoring context, not in the
monitored context. )

> That is completely fundamental to all of this. And this is why this
> proposal is a non-workable solution.
>
>
> > Also, look at our code: we buffer notification events and do not have
> > to stop the thread for recording the context information.
>
> But that's what monitoring libraries want, they want to stop the task
> and inspect it.
>
> Look at the PAPI library. If you can't implement what that thing
> provides, all the real users of profiling information can't use this
> stuff.

We have looked, and the PAPI library can be implemented on top of our
system call as well - just like it was implemented on top of the perfctr
driver and like it was implemented ontop of "perfmon-full".

PAPI is a relatively simple wrapper around OS level performance counter
facilities. Both the high level counter abstraction
(PAPI_start_counters() & friends) and the low level PAPI abstraction
(PAPI event sets, PAPI_attach/detach) can be readily implemented via the
use of our performance counter subsystem facilities. (In addition to all
the facilities around PAPI event enumeration.)

PAPI has about 100 functions - if you think our design does not fit it
for some fundamental reason then please point out exactly which
functionality (which PAPI function call) cannot be done.

Perfmon needlessly complicated their design by exposing user-space to a
'performance counter context' and other lowlevel details that should not
and must not be handled at the ABI level. The PAPI interfaces do not
force that design choice in any way. It's a plain unnecessary
complication that permeates the whole perfmon code.

Ingo

2008-12-05 14:59:47

by Arjan van de Ven

[permalink] [raw]
Subject: Re: [patch 0/3] [Announcement] Performance Counters for Linux

On Fri, 05 Dec 2008 00:07:16 -0800 (PST)
David Miller <[email protected]> wrote:

> These things aren't measuring time, or even just cycles, they
> are measuring things like L2 cache misses, cpu cycles, and
> other similar kinds of events.
>
> So these counters are going to measure all of the damn crap
> assosciated with doing the read() call as well as the real work
> the task does.

as you said before, not if you do the read() from a thread that's
exempt from the profiling.

--
Arjan van de Ven Intel Open Source Technology Centre
For development, discussion and tips for power savings,
visit http://www.lesswatts.org

2008-12-05 20:08:25

by David Miller

[permalink] [raw]
Subject: Re: [patch 0/3] [Announcement] Performance Counters for Linux

From: Andi Kleen <[email protected]>
Date: Fri, 05 Dec 2008 13:39:43 +0100

> Given these are more obscure features, but not being able to fit
> them easily into your model from the start isn't a very promising sign
> for the long term extensibility of the design.

Another thing I'm interested in is if this new stuff will work with
performance counters that lack an overflow interrupt.

We have several chips that are like this, and perfmon supported that
on the kernel side, and also provided overflow emulation for such
hardware in userspace (where such complexity belongs).

2008-12-05 21:24:56

by Corey J Ashford

[permalink] [raw]
Subject: Re: [patch 0/3] [Announcement] Performance Counters for Linux

> * Ingo Molnar <[email protected]> wrote:
>
>> > > - No interaction with ptrace: any task (with sufficient permissions) can
>> > > monitor other tasks, without having to stop that task.
>> >
>> > This isn't going to work.
>> >
>> > If you look at the things the perfmon libraries do, you do need to stop
>> > the task.
>> >
>> > Consider counter virtualization as the most direct example. [...]
>>
>> Note that counter virtualization is not offered in the perfmon3 patchset that has
>> been posted to lkml. (It is part of the much larger 'full' perfmon patchset which
>> has not been submitted for integration)
>>
>> Nevertheless we will offer counter virtualization in -v2 of our patchset [...]
>
> i've just implemented it. Running an (infinite-loop) hello.c with 6 counters on a
> CPU that has only two counters now gives the expected:
>
> counter[0 cycles ]: 3368245084 , delta: 842019470 events
> counter[1 instructions ]: 1384678210 , delta: 346108294 events
> counter[2 cache-refs ]: 659 , delta: 150 events
> counter[3 cache-misses ]: 0
> counter[4 branch-instructions ]: 266919398 , delta: 66731508 events
> counter[5 branch-misses ]: 1201 , delta: 315 events
>
> This will be in -v2.
>
> Ingo
>

When you use the term "virtualization" here, I think you mean "event set
multiplexing" in perfmon terms. When perfmon talks about
virtualization, it's the virtualizing of a small counter (e.g. 32-bits)
to a 64-bit counter via its overflow interrupt. And 64-bit counter
support is included in the perfmon3 posted to LKML.

One thing that PAPI needs is some control over which events are in each
event "set", to use a perfmon term. In particular, it needs to have a
cycles counter in each set so that it can properly scale the event
counts at the time it reads them up.

With your proposal:

* Would there be a way to force a particular event to be in every event
set that is scheduled onto the processor?

* When monitoring program reads up the counts, how would it find the
individual cycles count for each set?

* How would it know which other events were in the same set?

* Would it force the round robin scheduling to only a single event
(paired with the cycles event) in each set?

* On what basis is the round robin scheduling performed? Time? Upon
the overflow of an event counter? If there is more than one option, how
is it specified and tweaked? If time is one of the options, how does the
caller specify the the round-robin switching rate?

These are all things that are supported in a very flexible way in
perfmon3 (full).

Regards,

- Corey

Corey Ashford
Software Engineer
IBM Linux Technology Center, Linux Toolchain
Beaverton, OR
503-578-3507
[email protected]


2008-12-06 00:05:46

by Paul Mackerras

[permalink] [raw]
Subject: Re: [patch 0/3] [Announcement] Performance Counters for Linux

Ingo Molnar writes:

> A stream of read()s possibly slightly being off is an order of magnitude
> smaller of an effect to precision. Look at the numbers: on the testbox i
> have a read() syscall takes 0.2 microseconds, while a context-switch
> takes 2 microseconds on the local CPU and about 5-10 microseconds
> cross-CPU (or more, if the cache pattern is unlucky/unaffine). That's
> 10-25-50 times more expensive. You can do 9-24-49 reads and still be
> cheaper. Compound syscalls are almost never worth the complexity.

If we're on an SMP system and the monitored task is currently running,
then it's not just the read syscall, there's also the IPI, which will
add considerably to the cost. So it's likely to be a thousand or more
cycles between the two counter reads, which is IMHO unacceptable.

Anyway, that's not my major problem with the API, just one of its
little annoyances...

> So as a scheduler person i cannot really take the perfmon "ptrace
> approach" seriously, and i explained that in great detail already. It
> clearly came from HPC workload quarters where tasks are persistent
> entities running alone on a single CPU that just use up CPU time there
> and dont interact with each other too much. That's a good and important
> profiling target for sure - but by no means the only workload target to
> design a core kernel facility for. It's an absolutely horrible approach
> for a number of more common workloads for sure.

I never defended the use of ptrace, and it isn't IMO an essential part
of the perfmon API, just an aspect of its current implementation.

> > And I think that is the fundamental flaw. On the machines I am
> > familiar with, the performance counters as not separate things that can
> > individually and independently be assigned to count one thing or
> > another.
>
> Today we've implemented virtual counter scheduling in our to-be-v2 code:
>
> 3 files changed, 36 insertions(+), 1 deletion(-)
>
> hello.c gives:
>
> counter[0 cycles ]: 10121258163 , delta: 844256826 events
> counter[1 instructions ]: 4160893621 , delta: 347054666 events
> counter[2 cache-refs ]: 2297 , delta: 179 events
> counter[3 cache-misses ]: 3 , delta: 0 events
> counter[4 branch-instructions ]: 799422166 , delta: 66551572 events
> counter[5 branch-misses ]: 7286 , delta: 775 events

And this tells me what? I can't relate any of these measurements to
any others, because I don't know how many cycles or instructions or
milliseconds each of these counts relates to, and I don't know which
counts were taken at the same time as which other counts.

Your abstraction hides all the details of what is being counted with
which counter over what period of time, and that is absolutely crucial
information for any serious analysis of the numbers.

> All we need to get that array of information from 6 sw counters is a
> _single_ hardware counter. I'm not sure where you read "you must map sw
> counters to hw counters directly" or "hw counters must be independent of
> each other" into our design - it's not part of it, emphatically.

I'm not sure those quoted statements are exactly what I said, but
whatever.

Your API has as its central abstraction the "counter". I am saying
that that is the wrong abstraction. The abstraction really needs to
be a set of counters that are all active over precisely the same
interval, so that their values can be meaningfully compared and
related to each other.

> And i dont see your (fully correct!) statement above about counter
> constraints to be in any sort of conflict with what we are doing.
>
> Intel hardware is just as constrained as powerpc hardware: there are
> counter inter-dependencies and many CPUs have just two performance
> counters. We very much took this into account while designing this code.

Well, here's my reasoning.

* Your perf_counter_open call takes the event type but doesn't have
any way to select a particular hardware counter (deliberately, since
your API is trying to present some common-denominator abstraction of
the individual counters).

* On powerpc, the event selector value to count a particular event is
different for each counter, and may even depend on what's being
counted on other counters.

* That means that we can't meaningfully pass raw (negative) event
selector values, since what any particular value means depends on
which hardware counter we get to use, and we don't know that (and in
fact it may change from time to time).

* In other words, the kernel will have to know the mapping from
abstract event types to event selector values for each counter for
each supported CPU type.

Now, the tables in perfmon's user-land libpfm that describe the
mapping from abstract events to event-selector values and the
constraints on what events can be counted together come to nearly
29,000 lines of code just for the IBM 64-bit powerpc processors.

Your API condemns us to adding all that bloat to the kernel, plus the
code to use those tables.

Furthermore, since your generic code doesn't know anything about the
constraints and thinks it can just add any counter to any task at any
time (subject only to a maximum number of counters in use), we'll
potentially have to work out event selector values at latency-critical
times such as context switches and interrupts.

> [ Obviously, you _can_ do higher quality profiling if you have more
> hardware resources that help it. Nothing will change that fact. ]
>
> > Rather, what the hardware provides is ONE performance monitor unit,
> > which the OS can context-switch between tasks. The performance monitor
> > unit has several counters that can be assigned (within limits) to count
> > various aspects of the performance of the code being executed. That is
> > why, for instance, if you ask for the counters to be frozen when one of
> > them overflows, they all get frozen at that point.
>
> i dont see this as an issue at all - it's a feature of powerpc over x86
> that the core perfcounter code can support just fine. The overflow IRQ
> handler is arch specific. The overflow IRQ handler, if it triggers,
> updates the sw counters, creates any event records if needed, wakes up
> the monitor task if needed, and continues the task and performance
> measurement without having scheduled out. Demultiplexing of hw counters
> is arch-specific.

The ability to create event records in a ring buffer is certainly
nice. I have no problem with that part of your proposal, particularly
if we can optionally record things like a timestamp, task registers,
stacktrace, etc. at the same time, as you have suggested.

My point is that the monitoring task wants to be able to control which
things get measured simultaneously. The kernel shouldn't be deciding
how the set of software counters gets multiplexed onto the hardware
counters - the monitoring task needs to be able to control that in
order to get meaningful results.

There are three other problems that I see with your API (these are
probably fixable):

1. I don't see any way to control whether I'm counting events in user
mode, kernel mode, hypervisor mode, or some combination. That is
needed for some types of performance analysis.

2. If I'm counting events for all tasks, I want to be able to exclude
the idle task, optionally. I don't see a way to do that.

3. If I have a counter in PERF_RECORD_IRQ mode, I have no way to read
its actual value, which I would want to do (for instance, when some
other counter overflows, or when the task exits).

Paul.

2008-12-06 01:25:27

by Mikael Pettersson

[permalink] [raw]
Subject: Re: [patch 0/3] [Announcement] Performance Counters for Linux

Paul Mackerras writes:
> Furthermore, since your generic code doesn't know anything about the
> constraints and thinks it can just add any counter to any task at any
> time

This observation alone makes this proposal a non-starter.
Counters are not independent. Even on x86. Never have been.

If you want to fix something, here's one:
- Make the decision whether to schedule task t on processor p a
function of what other set of tasks T are currently on processor p.

The issue is that some performance counter events aren't thread
local, e.g. Nehalem uncore stuff and similar HW crap in AMD
northbridge events and everything P4. So while one task t1
is running it's also reserving off-thread resources R, making those
resources unavailable for other tasks T.

(If you want a simpler metaphor, imagine a multi-threaded or multi-core
processor package having only a single floating-point unit. How would
you handle that in the scheduler? There are performance counter events
from both Intel and AMD that pose the same challenge.)

I "solved" that in perfctr for P4 by enforcing affinity constraints,
but surely the scheduler could be smarter?

2008-12-06 02:36:51

by Stephane Eranian

[permalink] [raw]
Subject: Re: [patch 0/3] [Announcement] Performance Counters for Linux

Hello,

I have been reading all the threads after this unexpected announcement
of a competing proposal for an interface to access the performance counters.
I would like to respond to some of the things I have seen.

* ptrace: as Paul just pointed out, ptrace() is a limitation of the
current perfmon implementation. This is not a limitation of the
interface as has been insinuated earlier. In my mind, this does
not justify starting from scratch. There is nothing that precludes
removing ptrace and using the IPI to chase down the PMU state,
like you are doing. And in fact I believe we can do it more efficiently
because we would potentially collect multiple values in one IPI,
something your API cannot allow because it is single event oriented.

* There is more to perfmon than what you have looked at on LKML. There
is advanced sampling support with a kernel level buffer which is remapped
to user space. So there is no such thing as a couple of ptrace() calls per
sample. In fact, there is zero copy export to user space. In the
case of PEBS,
there is even zero-copy from HW to user space.

* The proposed API exposes events as individual entities. To measure N
events, you need N file descriptors. There is no coordination of actions
between the various events. If you want to start/stop all events, it seems
you have to close the file descriptors and start over. That is not
how people
use this, especially people doing self monitoring. They want to start/stop
around critical loops or functions and they want this to be fast.

* To read N events you need N syscalls and potentially N IPIs. There
is no guarantee of atomicity between the reads. The argument of raising
the priority to prevent preemption is bogus and unrealistic. We want regular
users to be able to measure their own applications without having to have
special privileges. This is especially unpractical when you want to read from
another thread. It is important to get a view of the counters that
is as consistent
as possible and for that you want to read the registers are closely
as possible
from each other.

* As mentioned by Paul, Corey, the API inevitably forces the kernel to
know about
ALL the events and how they map onto counters. People who have been doing this
in userland, and I am one of them, can tell you that this is a very
hard problem.
Looking at it just on the Intel and AMD x86 is misleading. It is not
the number of
events that matters, even it contributes to the kernel bloat, it is
managing the constraints
between events (event A and B cannot be measured together, if event
A uses counter X
then B cannot be measured on counter Y). Sometimes, the value of a
config register depends
on which register you load it on. With the proposed API, all this
complexity would have to go in
the kernel. I don't think it belongs here and it will leads to
maintenance problems, and longer
delays to enable support of new hardware. The argument for doing
this was that it would
facilitate writing tools. But all that complexity does not belong in
the tools but in a user library.
This is what libpfm is designed for and it has worked nicely so far.
The role of the kernel
is to control access to the PMU resource and to make sure incorrect
programming of the registers
cannot crash the kernel. If you do this, then providing support for
new hardware is for the most part
simply exposing the registers. Something which can even be
discovered automatically on newer
processors, e.g., ones supporting Intel architectural perfmon.

* Tools usually manage monitoring as a session. There was criticism
about the perfmon context abstraction and vectors. A context is merely
a synonym for session. I believe having a file descriptor per session is
a natural thing to have. Vectors are used to access multiple registers in
one syscall. Vector have variable sizes, it depends on what you want to
access. The size is not mandated by the number of registers of the
underlying hardware.

* As mentioned by Paul, with certain PMUs, it is not possible to solve
the event -> counter problem without having a global view
of all the events. Your API being single-event oriented, it is not
clear to me how this can be solved.

* It is not because you run a per thread session, that you should be
limited to measuring at priv level 3.

* Modern PMU, including AMD Barcelona. Itanium2, expose more than
counters. Any API than assumes PMU export only
counters is going to be limited, e.g. Oprofile. Perfmon does not
make that mistake, the interface does not know anything
about counters nor sampling periods. It sees registers with values
you can read or write. That has allowed us to support
advanced features such as Itanium2 Opcode filter, Itanium2
Code/Data range restrictions (hosted in debug regs), AMD
Barcelona IBS which has no event associated with it, Itanium2
BranchTraceBuffer, Intel Core 2 LBR, Intel Core i7 uncore PMU.
Some of those features have no ties with counters, they do not even
overflow (e.g., LBR). They must be used in combination with
counters, e.g., LBRs. I don't think you will be able to do this
with your API.

* With regards to sampling, advanced users have long been collecting
more than just the IP. They want to collect the values of other
PMU registers or even values of other non-PMU resources. With your
API, it seems for every new need, you'd have to create a new
perf_record_type, which translates into a kernel patch. This is not
what people want. With perfmon, you have a choice of doing user
level sampling (users gets notification for each sample) but you can
also use a kernel sampling buffer. In that case, you can express
what you want recorded in the buffer using simple bitmasks of PMU
registers. There is no predefined set, no kernel patch.
To make this even more flexible the buffer format is not part of the
interface, you can define your own and record whatever you want
in whatever format you want. All is provided by kernel modules. You
want double-buffer, cyclic buffer, just add your kernel module. It
seems this feature has been overlooked by LKML reviewers but it is
really powerful.

* It is not clear to me how you would add a sampling buffer and
remapping using your API given the number of file descriptors you will
end up using and the fact that you do not have the notion of a session.

* When sampling, you want to freeze the counters on overflow to get an
as consistent as possible view. There is no such guarantee in
your API nor implementation. On some hardware platforms, e.g.,
Itanium, you have no choice this is the behavior.

* Multiple counters can overflow at the same time and generate a
single interrupt. With your approach, if two counters overflow
simultaneously, then you need to enqueue two messages, yet only
one SIGIO wil be generated, it seems. Wonder how that works when
self-monitoring.


In summary, although the idea of simplifying tools by moving the
complexity elsewhere is legitimate, pushing it down to the kernel
is the wrong approach in my opinion, perfmon has avoided that as much
as possible for good reasons. We have shown , with libpfm,
that a large part of complexity can easily be encapsulated into a user
library. I also don't think the approach of managing events
independently of each others works for all processors. As pointed out
by others, there are other factors at stake and they may not
even be on the same core.

S. Eranian

2008-12-06 12:35:23

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [patch 0/3] [Announcement] Performance Counters for Linux

On Sat, 2008-12-06 at 11:05 +1100, Paul Mackerras wrote:
> Now, the tables in perfmon's user-land libpfm that describe the
> mapping from abstract events to event-selector values and the
> constraints on what events can be counted together come to nearly
> 29,000 lines of code just for the IBM 64-bit powerpc processors.
>
> Your API condemns us to adding all that bloat to the kernel, plus the
> code to use those tables.

Since you need those tables and that code anyway, and in a solid
reliable way, what is the objection of carrying it in the kernel?

Furthermore, is there a good technical reason these cpus are so
complicated to use?

2008-12-07 05:15:26

by Paul Mackerras

[permalink] [raw]
Subject: Re: [patch 0/3] [Announcement] Performance Counters for Linux

Peter Zijlstra writes:

> On Sat, 2008-12-06 at 11:05 +1100, Paul Mackerras wrote:
> > Now, the tables in perfmon's user-land libpfm that describe the
> > mapping from abstract events to event-selector values and the
> > constraints on what events can be counted together come to nearly
> > 29,000 lines of code just for the IBM 64-bit powerpc processors.
> >
> > Your API condemns us to adding all that bloat to the kernel, plus the
> > code to use those tables.
>
> Since you need those tables and that code anyway, and in a solid
> reliable way, what is the objection of carrying it in the kernel?

Because it's about 320kB of unpageable kernel memory, and it doesn't
need to be in the kernel.

The fundamental problem with Ingo and Thomas's proposal is that the
abstraction is at the wrong level. It makes individual counters the
central idea, when the central idea should be a set of counters that
all start and stop counting at the same times. People doing
performance analysis want to be able to compare counts of different
events and get ratios, and you can't do that meaningfully if the
counts correspond to different stretches of code.

Once you make the abstraction a set of counters, then you also make it
possible to have a counter-set that is the whole PMU. Then you don't
have to have the kernel knowing all the possible settings for the PMU;
it only needs to know the simple ones, and if you want to do something
more sophisticated, you can have userspace specifying the bits to
select the more sophisticated setting.

> Furthermore, is there a good technical reason these cpus are so
> complicated to use?

That question is a bit ambiguous. If you mean, why did the hardware
designers make it so complex? then I don't really know, but it doesn't
matter because the CPU hardware is what it is. At best I might be
able to influence future designs to be a bit simpler.

If you mean, could the software description of the hardware be
simpler? then maybe - I'm just reading up on the details of the
hardware, and it is pretty complex, with multiple layers of
multiplexers and event buses.

Paul.

2008-12-08 02:12:46

by Dan Terpstra

[permalink] [raw]
Subject: RE: [perfmon2] [patch 0/3] [Announcement] Performance Counters forLinux

I'm reminded of the quote attributed to Einstein: "Make things as simple as
possible, but no simpler".
In that regard, it appears that Stephane's perfmon is closer to the mark
than this proposal.
If Stephane's observations below are even close to correct, it would make
PAPI's first-person event-set caliper model essentially useless. We must be
able to start and stop multiple counter values simultaneously and quickly to
infer any validity even for derived measurements as simple as
instructions-per-cycle.
dan terpstra
for the PAPI team


> -----Original Message-----
> From: stephane eranian [mailto:[email protected]]
> Sent: Friday, December 05, 2008 9:37 PM
> To: Thomas Gleixner
> Cc: [email protected]; Peter Zijlstra; David Miller; LKML; Steven
> Rostedt; Eric Dumazet; Paul Mackerras; Peter Anvin; Andrew Morton; Ingo
> Molnar; perfmon2-devel; Arjan van de Veen
> Subject: Re: [perfmon2] [patch 0/3] [Announcement] Performance Counters
> forLinux
>
> Hello,
>
> I have been reading all the threads after this unexpected announcement
> of a competing proposal for an interface to access the performance
> counters.
> I would like to respond to some of the things I have seen.
>
> * ptrace: as Paul just pointed out, ptrace() is a limitation of the
> current perfmon implementation. This is not a limitation of the
> interface as has been insinuated earlier. In my mind, this does
> not justify starting from scratch. There is nothing that precludes
> removing ptrace and using the IPI to chase down the PMU state,
> like you are doing. And in fact I believe we can do it more efficiently
> because we would potentially collect multiple values in one IPI,
> something your API cannot allow because it is single event oriented.
>
> * There is more to perfmon than what you have looked at on LKML. There
> is advanced sampling support with a kernel level buffer which is
> remapped
> to user space. So there is no such thing as a couple of ptrace() calls
> per
> sample. In fact, there is zero copy export to user space. In the
> case of PEBS,
> there is even zero-copy from HW to user space.
>
> * The proposed API exposes events as individual entities. To measure N
> events, you need N file descriptors. There is no coordination of
> actions
> between the various events. If you want to start/stop all events, it
> seems
> you have to close the file descriptors and start over. That is not
> how people
> use this, especially people doing self monitoring. They want to
> start/stop
> around critical loops or functions and they want this to be fast.
>
> * To read N events you need N syscalls and potentially N IPIs. There
> is no guarantee of atomicity between the reads. The argument of raising
> the priority to prevent preemption is bogus and unrealistic. We want
> regular
> users to be able to measure their own applications without having to
> have
> special privileges. This is especially unpractical when you want to
> read from
> another thread. It is important to get a view of the counters that
> is as consistent
> as possible and for that you want to read the registers are closely
> as possible
> from each other.
>
> * As mentioned by Paul, Corey, the API inevitably forces the kernel to
> know about
> ALL the events and how they map onto counters. People who have been
> doing this
> in userland, and I am one of them, can tell you that this is a very
> hard problem.
> Looking at it just on the Intel and AMD x86 is misleading. It is not
> the number of
> events that matters, even it contributes to the kernel bloat, it is
> managing the constraints
> between events (event A and B cannot be measured together, if event
> A uses counter X
> then B cannot be measured on counter Y). Sometimes, the value of a
> config register depends
> on which register you load it on. With the proposed API, all this
> complexity would have to go in
> the kernel. I don't think it belongs here and it will leads to
> maintenance problems, and longer
> delays to enable support of new hardware. The argument for doing
> this was that it would
> facilitate writing tools. But all that complexity does not belong in
> the tools but in a user library.
> This is what libpfm is designed for and it has worked nicely so far.
> The role of the kernel
> is to control access to the PMU resource and to make sure incorrect
> programming of the registers
> cannot crash the kernel. If you do this, then providing support for
> new hardware is for the most part
> simply exposing the registers. Something which can even be
> discovered automatically on newer
> processors, e.g., ones supporting Intel architectural perfmon.
>
> * Tools usually manage monitoring as a session. There was criticism
> about the perfmon context abstraction and vectors. A context is merely
> a synonym for session. I believe having a file descriptor per session
> is
> a natural thing to have. Vectors are used to access multiple registers
> in
> one syscall. Vector have variable sizes, it depends on what you want to
> access. The size is not mandated by the number of registers of the
> underlying hardware.
>
> * As mentioned by Paul, with certain PMUs, it is not possible to solve
> the event -> counter problem without having a global view
> of all the events. Your API being single-event oriented, it is not
> clear to me how this can be solved.
>
> * It is not because you run a per thread session, that you should be
> limited to measuring at priv level 3.
>
> * Modern PMU, including AMD Barcelona. Itanium2, expose more than
> counters. Any API than assumes PMU export only
> counters is going to be limited, e.g. Oprofile. Perfmon does not
> make that mistake, the interface does not know anything
> about counters nor sampling periods. It sees registers with values
> you can read or write. That has allowed us to support
> advanced features such as Itanium2 Opcode filter, Itanium2
> Code/Data range restrictions (hosted in debug regs), AMD
> Barcelona IBS which has no event associated with it, Itanium2
> BranchTraceBuffer, Intel Core 2 LBR, Intel Core i7 uncore PMU.
> Some of those features have no ties with counters, they do not even
> overflow (e.g., LBR). They must be used in combination with
> counters, e.g., LBRs. I don't think you will be able to do this
> with your API.
>
> * With regards to sampling, advanced users have long been collecting
> more than just the IP. They want to collect the values of other
> PMU registers or even values of other non-PMU resources. With your
> API, it seems for every new need, you'd have to create a new
> perf_record_type, which translates into a kernel patch. This is not
> what people want. With perfmon, you have a choice of doing user
> level sampling (users gets notification for each sample) but you can
> also use a kernel sampling buffer. In that case, you can express
> what you want recorded in the buffer using simple bitmasks of PMU
> registers. There is no predefined set, no kernel patch.
> To make this even more flexible the buffer format is not part of the
> interface, you can define your own and record whatever you want
> in whatever format you want. All is provided by kernel modules. You
> want double-buffer, cyclic buffer, just add your kernel module. It
> seems this feature has been overlooked by LKML reviewers but it is
> really powerful.
>
> * It is not clear to me how you would add a sampling buffer and
> remapping using your API given the number of file descriptors you will
> end up using and the fact that you do not have the notion of a session.
>
> * When sampling, you want to freeze the counters on overflow to get an
> as consistent as possible view. There is no such guarantee in
> your API nor implementation. On some hardware platforms, e.g.,
> Itanium, you have no choice this is the behavior.
>
> * Multiple counters can overflow at the same time and generate a
> single interrupt. With your approach, if two counters overflow
> simultaneously, then you need to enqueue two messages, yet only
> one SIGIO wil be generated, it seems. Wonder how that works when
> self-monitoring.
>
>
> In summary, although the idea of simplifying tools by moving the
> complexity elsewhere is legitimate, pushing it down to the kernel
> is the wrong approach in my opinion, perfmon has avoided that as much
> as possible for good reasons. We have shown , with libpfm,
> that a large part of complexity can easily be encapsulated into a user
> library. I also don't think the approach of managing events
> independently of each others works for all processors. As pointed out
> by others, there are other factors at stake and they may not
> even be on the same core.
>
> S. Eranian
>
> --------------------------------------------------------------------------
> ----
> SF.Net email is Sponsored by MIX09, March 18-20, 2009 in Las Vegas,
> Nevada.
> The future of the web can't happen without you. Join us at MIX09 to help
> pave the way to the Next Web now. Learn more and register at
> http://ad.doubleclick.net/clk;208669438;13503038;i?http://2009.visitmix.co
> m/
> _______________________________________________
> perfmon2-devel mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/perfmon2-devel

2008-12-08 07:19:08

by Stephane Eranian

[permalink] [raw]
Subject: Re: [patch 0/3] [Announcement] Performance Counters for Linux

Hi,

On Sun, Dec 7, 2008 at 6:15 AM, Paul Mackerras <[email protected]> wrote:
> Peter Zijlstra writes:
>
>> On Sat, 2008-12-06 at 11:05 +1100, Paul Mackerras wrote:
>> > Now, the tables in perfmon's user-land libpfm that describe the
>> > mapping from abstract events to event-selector values and the
>> > constraints on what events can be counted together come to nearly
>> > 29,000 lines of code just for the IBM 64-bit powerpc processors.
>> >
>> > Your API condemns us to adding all that bloat to the kernel, plus the
>> > code to use those tables.
>>
>> Since you need those tables and that code anyway, and in a solid
>> reliable way, what is the objection of carrying it in the kernel?
>
> Because it's about 320kB of unpageable kernel memory, and it doesn't
> need to be in the kernel.
>

That inevitably pulls in large amounts of data, the event table for each PMU
model and the description of the constraints between events. New processors
have hundreds of events. Moreover, there is the complexity of the assignment
algorithm to map the events to counters such that they actually measure what
you've asked for. I described some of those constraints in my previous message.
They are not trivial and are oftentimes multi-dimensional. Getting the
algorithms
right is difficult. Event tables are also oftentimes incomplete or
bogus when first
published by HW vendors.

It does not make sense to have this kind of data + code in the kernel. It would
make developing them much more difficult. Maintenance would also be more
difficult. And clearly you don't want to have to re-run the assignment routine
each time you context switch.

The kernel is not the only place for rock-solid code. You can have solid/stable
code in libraries as well.

> The fundamental problem with Ingo and Thomas's proposal is that the
> abstraction is at the wrong level. It makes individual counters the
> central idea, when the central idea should be a set of counters that
> all start and stop counting at the same times. People doing
> performance analysis want to be able to compare counts of different
> events and get ratios, and you can't do that meaningfully if the
> counts correspond to different stretches of code.
>
> Once you make the abstraction a set of counters, then you also make it
> possible to have a counter-set that is the whole PMU. Then you don't
> have to have the kernel knowing all the possible settings for the PMU;
> it only needs to know the simple ones, and if you want to do something
> more sophisticated, you can have userspace specifying the bits to
> select the more sophisticated setting.
>
>> Furthermore, is there a good technical reason these cpus are so
>> complicated to use?
>
> That question is a bit ambiguous. If you mean, why did the hardware
> designers make it so complex? then I don't really know, but it doesn't
> matter because the CPU hardware is what it is. At best I might be
> able to influence future designs to be a bit simpler.
>

Let me explain the HW complexity a bit. It's all a matter of tradeoffs.
I have regular discussions with the PMU design architects about this.
If you talk to them, then you understand the environment they have to
live in and you understand why those constraints are there. The key point
to understand is that the PMU is never critical to the chip. The chip can work
well without. The real-estate on the chip is always very tight. PMU is a 2nd
class citizen, thus low in the priority list. For certain PMU features
the tradeoff
is: do you want the feature with constraints on programming or no feature at
all. The common HW limitation is wires. For instance, I was once told: would you
rather have a PMU cache event that can be programmed on any counters but
with an increased cache latency for all accesses or a faster cache and
a constraint
on the event? The response is obvious.

I think you now understand why there are constraints and also why they
will never
go away, quite the contrary. I'd rather have a PMU with constraints than no PMU.
Hardware designers make a lot of efforts to give us what we have today already
and we should be thankful.

> If you mean, could the software description of the hardware be
> simpler? then maybe - I'm just reading up on the details of the
> hardware, and it is pretty complex, with multiple layers of
> multiplexers and event buses.
>

2008-12-08 11:12:35

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch 0/3] [Announcement] Performance Counters for Linux


* stephane eranian <[email protected]> wrote:

> Let me explain the HW complexity a bit. It's all a matter of tradeoffs.
> I have regular discussions with the PMU design architects about this.
> If you talk to them, then you understand the environment they have to
> live in and you understand why those constraints are there. The key
> point to understand is that the PMU is never critical to the chip. The
> chip can work well without. The real-estate on the chip is always very
> tight. PMU is a 2nd class citizen, thus low in the priority list. [...]

The chip designers i talk to with my scheduler maintainer hat on do point
out that performance monitoring is (of course) in the critical path of
any chip, and hence its overhead and impact on the gate count of various
critical components of the CPU core and its impact on the power envelope
must be kept very low.

Nevertheless, the same chip designers rely on performance counters on a
daily basis to plan their next-gen chip. They very much want them to work
fine, and they work hard on making them relevant and easy to use. Often
the performance counters are the _only_ real cheap hands-on insight into
the dynamic situation of a modern CPU core, even for hw designers.

And all the current hw trends show that it's not just talk but action as
well: the Core2 PMCs are already much saner (less constrained) than the
P4 ones, and now they even expanded on them: Nehalem / Core i7 doubled
the number of generic PMCs from two to four.

So, contrary to your suggestion, chip designers very much care about
performance counters and they are working very hard to make this stuff
useful to us. [ Yes, there are constraints even with generic counters
(for example you only want a single line towards a PMC register from
divider units), but the number of cross-counter constraints and their
relevance is decreasing, not increasing. ]

Anyway ... i think your reply highlights why the fundamental premise of
your patchset is so wrong: i believe you have designed your code and APIs
at the wrong level by (paradoxically) assuming in essence that
performance counters do not matter in the general scheme of things. (!)

So you introduced limited, special-purpose but still quite complex APIs
that tailored the ABIs to intricate low level details of PMUs. I see an
explosion in complexity due to that incorrect design choice: too many
syscalls, too broad interaction between core code and architecture code,
and too little practical utility in the end.

We did what we believe to be the right thing: we gave performance
counters the proper high-level abstraction they _deserve_, and we made
performance counters a prime-time Linux citizen as well.

Ingo

2008-12-08 11:59:46

by David Miller

[permalink] [raw]
Subject: Re: [patch 0/3] [Announcement] Performance Counters for Linux

From: Ingo Molnar <[email protected]>
Date: Mon, 8 Dec 2008 12:11:53 +0100

> We did what we believe to be the right thing: we gave performance
> counters the proper high-level abstraction they _deserve_, and we made
> performance counters a prime-time Linux citizen as well.

Seperate counters that are read independently is fundamentally wrong,
no matter how many times you try to say it isn't. In fact it has
been shown (repeatedly) that this abstraction is at the wrong level.

People want to correlate, and it's not possible to do that if the
counters are sampled seperately.

We also don't want half-megabyte PMU tables in the kernel, nor the
complex logic about how PMU counter X can configured when counter Y is
configured for event A. All of that belongs in userspace.

We also want to support PMUs that do not generate an overflow
interrupt.

Really, all of the backlash these new patches have received is not
about how clean the abstraction is, but rather whether it can even
do the job properly.

And also, another part of the backlash is that the poor perfmon3
person was completely blindsided by this new stuff. Which to be
honest was pretty unfair. He might have had great ideas about
the requirements (even if you don't give a crap about his approach
to achieving those requirements) and thus could have helped avoid
the past few days of churn.

2008-12-09 00:22:02

by Stephane Eranian

[permalink] [raw]
Subject: Re: [patch 0/3] [Announcement] Performance Counters for Linux

On Mon, Dec 8, 2008 at 12:11 PM, Ingo Molnar <[email protected]> wrote:
>
> * stephane eranian <[email protected]> wrote:
>
>> Let me explain the HW complexity a bit. It's all a matter of tradeoffs.
>> I have regular discussions with the PMU design architects about this.
>> If you talk to them, then you understand the environment they have to
>> live in and you understand why those constraints are there. The key
>> point to understand is that the PMU is never critical to the chip. The
>> chip can work well without. The real-estate on the chip is always very
>> tight. PMU is a 2nd class citizen, thus low in the priority list. [...]
>
> The chip designers i talk to with my scheduler maintainer hat on do point
> out that performance monitoring is (of course) in the critical path of
> any chip, and hence its overhead and impact on the gate count of various
> critical components of the CPU core and its impact on the power envelope
> must be kept very low.
>

You have a talent for turning people's argument into something else.

You dropped my example about the wire limitation. It was describing my point
about constraints and PMU as 2nd class citizen. I'd rather have a new
constrained
PMU feature that no new feature at all. You also seem to limit your
world to x86,
you have to look beyond like Itanium and Power, for instance.

I know quite well that the PMU is used for debugging internally and early on,
so don't lecture me on this! I have participated in the architectural design of
some.

> Nevertheless, the same chip designers rely on performance counters on a
> daily basis to plan their next-gen chip. They very much want them to work
> fine, and they work hard on making them relevant and easy to use. Often
> the performance counters are the _only_ real cheap hands-on insight into
> the dynamic situation of a modern CPU core, even for hw designers.
>
Like, I did not know that?

> And all the current hw trends show that it's not just talk but action as
> well: the Core2 PMCs are already much saner (less constrained) than the
> P4 ones, and now they even expanded on them: Nehalem / Core i7 doubled
> the number of generic PMCs from two to four.
>

You think I am not aware of that?I know that quite well because I talk to the
PMU architects on a regular basis trying to get them to add new features and
make the PMU easier to manage. And I make sure I broaden my horizon
beyond x86.

And yes, the PMU is becoming more and more critical and a true-value add.
That's good for end-users as long as the new features can be exposed.

> So, contrary to your suggestion, chip designers very much care about

You did not get my point, but I am not surprised...

> performance counters and they are working very hard to make this stuff
> useful to us. [ Yes, there are constraints even with generic counters
> (for example you only want a single line towards a PMC register from
> divider units), but the number of cross-counter constraints and their
> relevance is decreasing, not increasing. ]
>
> Anyway ... i think your reply highlights why the fundamental premise of
> your patchset is so wrong: i believe you have designed your code and APIs
> at the wrong level by (paradoxically) assuming in essence that
> performance counters do not matter in the general scheme of things. (!)
>
> So you introduced limited, special-purpose but still quite complex APIs

That's not a valid argument! Perfmon, unlike any other existing API, has
exposed all advanced features of all existing PMU models and across
multiple architectures.

> that tailored the ABIs to intricate low level details of PMUs. I see an
> explosion in complexity due to that incorrect design choice: too many

You current API does not offer access to any of the advanced features of
X86, like PEBS, IBS, LBR and others, let alone on the other architectures.
So again your arguments are unfounded.

> syscalls, too broad interaction between core code and architecture code,
> and too little practical utility in the end.
>

I think the number of syscalls is irrelevant, that's not how I measure
the usefulness of an API.
What matters is the functionalities. Any performance monitoring API should have:
- create a session
- program the registers
- start and stop on demand and has many times as you want
- attach to a thread or CPU
- read the register values
- advanced support for event-based sampling

> We did what we believe to be the right thing: we gave performance
> counters the proper high-level abstraction they _deserve_, and we made
> performance counters a prime-time Linux citizen as well.
>
You have no validation to prove you chose the right level.

As if the perfmon project did not put the PMU on the forefront.
Who is going to buy that?

2008-12-10 03:50:54

by Paul Mundt

[permalink] [raw]
Subject: Re: [patch 0/3] [Announcement] Performance Counters for Linux

On Fri, Dec 05, 2008 at 12:08:14PM -0800, David Miller wrote:
> From: Andi Kleen <[email protected]>
> Date: Fri, 05 Dec 2008 13:39:43 +0100
>
> > Given these are more obscure features, but not being able to fit
> > them easily into your model from the start isn't a very promising sign
> > for the long term extensibility of the design.
>
> Another thing I'm interested in is if this new stuff will work with
> performance counters that lack an overflow interrupt.
>
> We have several chips that are like this, and perfmon supported that
> on the kernel side, and also provided overflow emulation for such
> hardware in userspace (where such complexity belongs).

There doesn't seem to have been any reply to this point unfortunately,
and it is something I am also wondering about.

The sh perf counters were not designed with overflowing in mind, they are
split in to a pair of 48-bit or 64-bit counters that simply keep running.
Any write simply clears the value and the counter starts over. They are
simply counters only, and generate no events whatsoever.

Oprofile has been a pretty bad fit for them, and while I'm slightly more
optimistic about perfmon, I'm rather less enthusiastic about yet another
peformance counter implementation that I am unable to make any use of.

2008-12-10 04:45:02

by Paul Mackerras

[permalink] [raw]
Subject: Re: [patch 0/3] [Announcement] Performance Counters for Linux

Paul Mundt writes:

> On Fri, Dec 05, 2008 at 12:08:14PM -0800, David Miller wrote:
> > From: Andi Kleen <[email protected]>
> > Date: Fri, 05 Dec 2008 13:39:43 +0100
> >
> > > Given these are more obscure features, but not being able to fit
> > > them easily into your model from the start isn't a very promising sign
> > > for the long term extensibility of the design.
> >
> > Another thing I'm interested in is if this new stuff will work with
> > performance counters that lack an overflow interrupt.
> >
> > We have several chips that are like this, and perfmon supported that
> > on the kernel side, and also provided overflow emulation for such
> > hardware in userspace (where such complexity belongs).
>
> There doesn't seem to have been any reply to this point unfortunately,
> and it is something I am also wondering about.
>
> The sh perf counters were not designed with overflowing in mind, they are
> split in to a pair of 48-bit or 64-bit counters that simply keep running.
> Any write simply clears the value and the counter starts over. They are
> simply counters only, and generate no events whatsoever.
>
> Oprofile has been a pretty bad fit for them, and while I'm slightly more
> optimistic about perfmon, I'm rather less enthusiastic about yet another
> peformance counter implementation that I am unable to make any use of.

This is the sampling vs. counting distinction again, and it sounds
like these counters were designed for counting but not sampling. If
Ingo and Thomas extend their infrastructure to provide good support
for counting as well as sampling, then you should hopefully be able to
use them for counting, at least.

On POWER6 we have a somewhat similar situation with two out of the six
available counters. These two counters are fixed function (they
always count cycles and instructions completed) and don't generate
interrupts. Furthermore, they are only 32 bits wide. So I definitely
agree we need support for counters that don't interrupt.

Paul.

2008-12-10 08:44:57

by Mikael Pettersson

[permalink] [raw]
Subject: Re: [patch 0/3] [Announcement] Performance Counters for Linux

Paul Mundt writes:
> The sh perf counters were not designed with overflowing in mind, they are
> split in to a pair of 48-bit or 64-bit counters that simply keep running.
> Any write simply clears the value and the counter starts over. They are
> simply counters only, and generate no events whatsoever.
>
> Oprofile has been a pretty bad fit for them, and while I'm slightly more
> optimistic about perfmon, I'm rather less enthusiastic about yet another
> peformance counter implementation that I am unable to make any use of.

My 'perfctr' kernel extension has supported this type of hardware
since its beginning in 1999, simply because that's how much hardware
worked at the time. Typical CPUs in that category include Intel P5s,
Intel P6s where the local APIC isn't available (some don't have one
in HW, many have it disabled by BIOS), 1st gen AMD K7, VIA C3, early
UltraSPARCs (not supported by perfctr but could be), and many G3/G4
type 32-bit PowerPCs where HW errata make the PMU overflow interrupt
facility useless or dangerous.

Plain event counting over a group of counters is a convenient way of
computing metrics for isolated blocks of code, such as CPI, branch
misses / insn or clock, and such, so I often use that even on CPUs
that do support overflow interrupts.

2008-12-10 10:16:52

by Andi Kleen

[permalink] [raw]
Subject: Re: [patch 0/3] [Announcement] Performance Counters for Linux

> Oprofile has been a pretty bad fit for them, and while I'm slightly more

You could always use a extension of timer mode that reads them
periodically?

-Andi
--
[email protected]

2008-12-10 10:25:52

by Paul Mundt

[permalink] [raw]
Subject: Re: [patch 0/3] [Announcement] Performance Counters for Linux

On Wed, Dec 10, 2008 at 11:28:19AM +0100, Andi Kleen wrote:
> > Oprofile has been a pretty bad fit for them, and while I'm slightly more
>
> You could always use a extension of timer mode that reads them
> periodically?
>
This is what I do today, but it is not an ideal solution. It would be
nice if these sorts of use cases could be supported by newer frameworks
without every platform with similar requirements having to implement
workarounds hanging off of the timer IRQ.

2008-12-10 10:51:49

by Andi Kleen

[permalink] [raw]
Subject: Re: [patch 0/3] [Announcement] Performance Counters for Linux

On Wed, Dec 10, 2008 at 07:23:36PM +0900, Paul Mundt wrote:
> On Wed, Dec 10, 2008 at 11:28:19AM +0100, Andi Kleen wrote:
> > > Oprofile has been a pretty bad fit for them, and while I'm slightly more
> >
> > You could always use a extension of timer mode that reads them
> > periodically?
> >
> This is what I do today, but it is not an ideal solution. It would be
> nice if these sorts of use cases could be supported by newer frameworks
> without every platform with similar requirements having to implement
> workarounds hanging off of the timer IRQ.

But you shouldn't hang off the timer irq anyways, but better use a regular
timer or hr timer. This would give more regular sampling even with dyntick.
And doing such a timer is only a few lines of code, I'm not sure it would
buy you all that much to generalize it.

-Andi

--
[email protected]

2008-12-10 16:41:46

by Rob Fowler

[permalink] [raw]
Subject: Re: [perfmon2] [patch 0/3] [Announcement] Performance Counters for Linux

My reaction is more from a downstream tool developer and end user perspective.

What I don't see in the new proposal is support for real end users of hardware
performance counter information. There is a long-existing community that is using the
counters, including the hardware designers, driver writers, tool developers, and
performance tuning specialists working for both vendors and end customers. Not
everyone is in the same camp, as each the hardware capabilities change from revision to
revision of the chips as features are added, architectures evolve, and implementations are
cleaned up. System vendors have their own tools and developers (SpeedShop, Vtune, Tprof, Sun Studio
Code Analyst, etc). There are academic and open source efforts with long histories (PAPI,
oprofile, HPCToolkit (Rice, not IBM), etc). We've lived with proprietary drivers/APIs and with
a succession of open-source drivers (pci, perfctr, oprofile, perfmon). (My apologies to
readers/developers whose favorite tool(s) I haven't mentioned.) Out-and-out religious wars
have not erupted, but there are a lot of healthy disagreements. A significant part of this
community has been converging around Perfmon2/3, not because it is a thing of beauty, but
because it is a tool that exposes the full HPM capabilities (which are often ugly) in a useful
way for a community of tool developers and end users.

Before considering this new proposal seriously, I'd need to see it proven. This means
that it needs to be developed, by the proposers, enough to be used seriously. I've
got collaborators that measure compute resources in units of tens of TeraFLOP-years, so
my definition of "seriously" is that the HPM tool chain has to work with low overhead
on huge clusters of multi-core, multi-socket machines and it has to be able to provide
performance insights that will let us get even more performance out of applications
that already do pretty well. Google and other large users have similar notions of "serious".

Here's my set of strawman requirements:

-- Can it support a *completely* functional PAPI? There are a lot of tools (HPCToolkit,
TAU, etc.) built on this layer.

-- Means to support IBS/EBS profiling and efficiently record execution contexts? Can it
support event-based call stack profiling?

-- Can it supplant or support oprofile by supporting the tools (Code Analyst, etc) that
depend on it?

-- Kernel and daemon profiling capabilities?

-- Does it have sufficiently low overhead? Six years ago DCPI/ProfileMe was capable of
collecting around 5000 samples/second on a quad socket 1GHz Alpha EV67 system with
about a 1.5% overhead. That's the gold standard. Oprofile and pfmon are not far off
that mark.

-- Does it even scale within one box? My workhorse systems today are quad-socket Barcelonas.
I'm reliably using multiple, cooperating (Some measure on-core, others measure off-core events.)
instances of pfmon to collect profiles using all 64 (4 per core x 16 cores) counters
productively with low overhead. Real soon now I will have similar expectations
regarding multi-socket Nehalems where the resources will be 7 (heterogeneous) counters per
core plus 8 "uncore" counters (I prefer "nest", Alex Mericas' terminology.) per socket.


Regards,
Rob


stephane eranian wrote:
> Hello,
>
> I have been reading all the threads after this unexpected announcement
> of a competing proposal for an interface to access the performance counters.
> I would like to respond to some of the things I have seen.
>

<<<<<< Details of Stephane's comment's elided >>>>>>

>
> In summary, although the idea of simplifying tools by moving the
> complexity elsewhere is legitimate, pushing it down to the kernel
> is the wrong approach in my opinion, perfmon has avoided that as much
> as possible for good reasons. We have shown , with libpfm,
> that a large part of complexity can easily be encapsulated into a user
> library. I also don't think the approach of managing events
> independently of each others works for all processors. As pointed out
> by others, there are other factors at stake and they may not
> even be on the same core.
>
> S. Eranian
>
> ------------------------------------------------------------------------------
> SF.Net email is Sponsored by MIX09, March 18-20, 2009 in Las Vegas, Nevada.
> The future of the web can't happen without you. Join us at MIX09 to help
> pave the way to the Next Web now. Learn more and register at
> http://ad.doubleclick.net/clk;208669438;13503038;i?http://2009.visitmix.com/
> _______________________________________________
> perfmon2-devel mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/perfmon2-devel

--
Robert J. Fowler
Chief Domain Scientist, HPC
Renaissance Computing Institute
The University of North Carolina at Chapel Hill
100 Europa Dr, Suite 540
Chapel Hill, NC 27517
V: 919.445.9670
F: 919 445.9669
[email protected]

2008-12-10 17:10:57

by Andi Kleen

[permalink] [raw]
Subject: Re: [patch 0/3] [Announcement] Performance Counters for Linux

Rob Fowler <[email protected]> writes:
>
> -- Can it supplant or support oprofile by supporting the tools (Code Analyst, etc) that
> depend on it?

There's no need to supplant/support oprofile really because at least
short term oprofile will not go away.

-Andi

--
[email protected]