Hello
As an aside, is it time to set up a dedicated Performance Counters
for Linux mailing list? (Hereafter referred to as p10c7l to avoid
confusion with the other implementations that have already taken
all the good abbreviated forms of the concept). If/when the
infrastructure appears in a released kernel, there's going to be a lot of
chatter by people who use performance counters and suddenly find they are
stuck with a huge step backwards in functionality. And asking Fortran
programmers to provide kernel patches probably won't be a productive
response. But I digress.
I was trying to get an exact retired instruction count from p10c7l.
I am using the test million.s, available here
( http://www.csl.cornell.edu/~vince/projects/perf_counter/million.s )
It should count exactly one million instructions.
Tests with valgrind and qemu show that it does.
Using perfmon2 on Pentium Pro, PII, PIII, P4, Athlon32, and Phenom
all give the proper result:
tobler:~% pfmon -e retired_instructions ./million
1000002 RETIRED_INSTRUCTIONS
( it is 1,000,002 +/-2 because on most x86 architectures retired
instruction count includes any hardware interrupts that might
happen at the time. It woud be a great feature if p10c7l
could add some way of gathering the per-process hardware
instruction count statistic to help quantify that).
Yet with perf on the same Athlon32 machine (using
kernel 2.6.30-03984-g45e3e19) gives:
tobler:~%perf stat ./million
Performance counter stats for './million':
1.519366 task-clock-ticks # 0.835 CPU utilization factor
3 context-switches # 0.002 M/sec
0 CPU-migrations # 0.000 M/sec
53 page-faults # 0.035 M/sec
2483822 cycles # 1634.775 M/sec
1240849 instructions # 816.689 M/sec # 0.500 per cycle
612685 cache-references # 403.250 M/sec
3564 cache-misses # 2.346 M/sec
Wall-clock time elapsed: 1.819226 msecs
Running multiple times gives:
1240849
1257312
1242313
Which is a varying error of at least 20% which isn't even
consistent. Is this because of sampling? The documentation doesn't
really warn about this as far as I can tell.
Thanks for any help resolving this problem
Vince
* Vince Weaver <[email protected]> wrote:
> Hello
>
> As an aside, is it time to set up a dedicated Performance Counters
> for Linux mailing list? (Hereafter referred to as p10c7l to avoid
> confusion with the other implementations that have already taken
> all the good abbreviated forms of the concept).
('perfcounters' is the name of the subsystem/feature and it's
unique.)
> [...] If/when the infrastructure appears in a released kernel,
> there's going to be a lot of chatter by people who use performance
> counters and suddenly find they are stuck with a huge step
> backwards in functionality. And asking Fortran programmers to
> provide kernel patches probably won't be a productive response.
> But I digress.
>
> I was trying to get an exact retired instruction count from
> p10c7l. I am using the test million.s, available here
>
> ( http://www.csl.cornell.edu/~vince/projects/perf_counter/million.s )
>
> It should count exactly one million instructions.
>
> Tests with valgrind and qemu show that it does.
>
> Using perfmon2 on Pentium Pro, PII, PIII, P4, Athlon32, and Phenom
> all give the proper result:
>
> tobler:~% pfmon -e retired_instructions ./million
> 1000002 RETIRED_INSTRUCTIONS
>
> ( it is 1,000,002 +/-2 because on most x86 architectures retired
> instruction count includes any hardware interrupts that might
> happen at the time. It woud be a great feature if p10c7l
> could add some way of gathering the per-process hardware
> instruction count statistic to help quantify that).
>
> Yet with perf on the same Athlon32 machine (using
> kernel 2.6.30-03984-g45e3e19) gives:
>
> tobler:~%perf stat ./million
>
> Performance counter stats for './million':
>
> 1.519366 task-clock-ticks # 0.835 CPU utilization factor
> 3 context-switches # 0.002 M/sec
> 0 CPU-migrations # 0.000 M/sec
> 53 page-faults # 0.035 M/sec
> 2483822 cycles # 1634.775 M/sec
> 1240849 instructions # 816.689 M/sec # 0.500 per cycle
> 612685 cache-references # 403.250 M/sec
> 3564 cache-misses # 2.346 M/sec
>
> Wall-clock time elapsed: 1.819226 msecs
>
> Running multiple times gives:
> 1240849
> 1257312
> 1242313
>
> Which is a varying error of at least 20% which isn't even
> consistent. Is this because of sampling? The documentation
> doesn't really warn about this as far as I can tell.
>
> Thanks for any help resolving this problem
Thanks for the question! There's still gaps in the documentation so
let me explain the basics here:
'perf stat' counts the true cost of executing the command in
question, including the costs of:
fork()ing the task
exec()-ing it
the ELF loader resolving dynamic symbols
the app hitting various pagefaults that instantiate its pagetables
etc.
Those operations are pretty 'noisy' on a typical CPU, with lots of
cache effects, so the noise you see is real.
You can eliminate much of the noise by only counting user-space
instructions, as much of the command startup cost is in
kernel-space.
Running your test-app that way can be done the following way:
$ perf stat --repeat 10 -e 0:1:u ./million
Performance counter stats for './million' (10 runs):
1002106 instructions ( +- 0.015% )
0.000599029 seconds time elapsed.
( note the --repeat feature of perf stat - it does a loop of command
executions and observes the noise and displays it. )
Those ~2100 instructions are executed by your app: as the ELF
dynamic loader starts up your test-app.
If you have some tool that reports less than that then that tool is
not being truthful about the true overhead of your application.
Also note that applications that only execute 1 million instructions
are very, very rare - a modern CPU can execute billions of
instructions, per second, per core.
So i usually test a reference app that is more realistic, that
executes 1 billion instructions:
$ perf stat --repeat 10 -e 0:1:u ./loop_1b_instructions
Performance counter stats for './loop_1b_instructions' (10 runs):
1000079797 instructions ( +- 0.000% )
0.239947420 seconds time elapsed.
the noise there is very low. (despite 230 milliseconds still being a
very short runtime)
Hope this helps - thanks,
Ingo
On Wed, 24 Jun 2009, Ingo Molnar wrote:
> * Vince Weaver <[email protected]> wrote:
>
> Those ~2100 instructions are executed by your app: as the ELF
> dynamic loader starts up your test-app.
>
> If you have some tool that reports less than that then that tool is
> not being truthful about the true overhead of your application.
I wanted the instruction count of the application, not the loader.
If I wanted the overhead of the loader too, then I would have specified
it. I don't think it has anything to do with tools being "less than
truthful". I notice perf doesn't seem to include its own overheads into
the count.
> Also note that applications that only execute 1 million instructions
> are very, very rare - a modern CPU can execute billions of
> instructions, per second, per core.
Yes, I know that.
As I hope you know, the chip designers offer no guarantees with any of the
performance counters. So before you can use them, you have to validate
them a bit to make sure they are returning expected results. Hence the
need for microbenchmarks, one of which I used as an example.
You have to be careful with performance counters. For example, on Pentium
4, the retired instruction counter will have as much as 2% error on some
of the spec2k benchmarks because the "fldcw" instruction counts as two
instructions instead of one.
This kind of difference is important when doing validation work, and can't
just be swept under the rug with "if you use bigger programs it doesn't
matter".
It's also nice to be able to skip the loader overhead, as the loader can
change from system to system and makes it hard to compare counters across
various machines. Though it sounds like the perf utility isn't going to
be supporting this anytime soon.
Vince
On Wed, 2009-06-24 at 22:12 -0400, Vince Weaver wrote:
>
> It's also nice to be able to skip the loader overhead, as the loader can
> change from system to system and makes it hard to compare counters across
> various machines. Though it sounds like the perf utility isn't going to
> be supporting this anytime soon.
Feel free to contribute such if you think its important.
* Peter Zijlstra <[email protected]> wrote:
> On Wed, 2009-06-24 at 22:12 -0400, Vince Weaver wrote:
> >
> > It's also nice to be able to skip the loader overhead, as the
> > loader can change from system to system and makes it hard to
> > compare counters across various machines. Though it sounds like
> > the perf utility isn't going to be supporting this anytime soon.
>
> Feel free to contribute such if you think its important.
I'd be glad to review and test any resulting patches from Vince -
and/or help out with pointers where to start and help out there's
any roadblocks along the way.
The kernel side bits can be found in v2.6.31-rc1, in
kernel/perf_counter.c, include/linux/perf_counter.h and
arch/x86/kernel/cpu/perf_counter.c. We tried to keep the code as
hackable as possible.
The tooling bits can be found in tools/perf/ in the kernel repo.
builtin-stat.c contains the 'perf stat' bits.
Thanks,
Ingo
On Wed, 24 Jun 2009, Ingo Molnar wrote:
> * Vince Weaver <[email protected]> wrote:
>
> Those ~2100 instructions are executed by your app: as the ELF
> dynamic loader starts up your test-app.
>
> If you have some tool that reports less than that then that tool is
> not being truthful about the true overhead of your application.
Wait a second... my application is a statically linked binary. There is
no ELF dynamic loader involved at all.
On further investigation, all of the overhead comes _entirely_ from the
perf utility. This is overhead and instructions that would not occur when
not using the perf utility.
>From the best I can tell digging through the perf sources, the performance
counters are set up and started in userspace, but instead of doing an
immediate clone/exec, thousands of instructions worth of other stuff is
done by perf in between.
Ther "perfmon" util, plus linux-user simulators like qemu and valgrind do
things properly. perf can't it seems, and it seems to be a limitation of
the new performance counter infrastructure.
Vince
PS. Why is the perf code littered with many many __MINGW32__ defined?
Should this be in the kernel tree? It makes the code really hard
to follow. Are there plans to port perf to windows?
On Fri, 2009-06-26 at 14:22 -0400, Vince Weaver wrote:
> On Wed, 24 Jun 2009, Ingo Molnar wrote:
> > * Vince Weaver <[email protected]> wrote:
> >
> > Those ~2100 instructions are executed by your app: as the ELF
> > dynamic loader starts up your test-app.
> >
> > If you have some tool that reports less than that then that tool is
> > not being truthful about the true overhead of your application.
>
> Wait a second... my application is a statically linked binary. There is
> no ELF dynamic loader involved at all.
>
> On further investigation, all of the overhead comes _entirely_ from the
> perf utility. This is overhead and instructions that would not occur when
> not using the perf utility.
>
> From the best I can tell digging through the perf sources, the performance
> counters are set up and started in userspace, but instead of doing an
> immediate clone/exec, thousands of instructions worth of other stuff is
> done by perf in between.
>
> Ther "perfmon" util, plus linux-user simulators like qemu and valgrind do
> things properly. perf can't it seems, and it seems to be a limitation of
> the new performance counter infrastructure.
perf can do it just fine, all you need is a will to touch ptrace().
Nothing in the perf counter design is limiting this to work.
I just can't really be bothered by this tiny and mostly constant offset,
esp if the cost is risking braindamage from touching ptrace(), but if
you think otherwise (and make the ptrace bit optional) I'm more than
willing to merge the patch.
> PS. Why is the perf code littered with many many __MINGW32__ defined?
> Should this be in the kernel tree? It makes the code really hard
> to follow. Are there plans to port perf to windows?
Comes straight from the git sources.. and littered might be a bit much,
I count only 11.
# git grep MING tools/perf | wc -l
11
But yeah, that might want cleaning up.
On Fri, 26 Jun 2009, Vince Weaver wrote:
> From the best I can tell digging through the perf sources, the performance
> counters are set up and started in userspace, but instead of doing an
> immediate clone/exec, thousands of instructions worth of other stuff is done
> by perf in between.
and for the curious, wondering how a simple
prctl(COUNTERS_ENABLE);
fork()
execvp()
can cause 6000+ instructions of non-deterministic execution, it turns out
that perf is dynamically linked. So it has to spend 5000+ cycles in
ld-linux.so resolving the excecvp() symbol before it can actually execvp.
So when trying to get accurate profiles of simple statically linked
programs, you still have to put up with the dynamic loader overhead
because of the way perf is designed. nice.
Vince
* Peter Zijlstra <[email protected]> wrote:
> > PS. Why is the perf code littered with many many __MINGW32__ defined?
> > Should this be in the kernel tree? It makes the code really hard
> > to follow. Are there plans to port perf to windows?
>
> Comes straight from the git sources.. and littered might be a bit
> much, I count only 11.
>
> # git grep MING tools/perf | wc -l
> 11
>
> But yeah, that might want cleaning up.
Indeed. I removed those bits - thanks Vince for reporting it!
Ingo
* Vince Weaver <[email protected]> wrote:
> On Fri, 26 Jun 2009, Vince Weaver wrote:
>
>> From the best I can tell digging through the perf sources, the
>> performance counters are set up and started in userspace, but instead
>> of doing an immediate clone/exec, thousands of instructions worth of
>> other stuff is done by perf in between.
>
> and for the curious, wondering how a simple
>
> prctl(COUNTERS_ENABLE);
> fork()
> execvp()
>
> can cause 6000+ instructions of non-deterministic execution, it
> turns out that perf is dynamically linked. So it has to spend
> 5000+ cycles in ld-linux.so resolving the excecvp() symbol before
> it can actually execvp.
I measured 2000, but generally a few thousand cycles per invocation
sounds about right.
That is in the 0.0001% measurement overhead range (per 'perf stat'
invocation) for any realistic app that does something worth
measuring - and even with a worst-case 'cheapest app' case it is in
the 0.2-0.4% range.
Besides, you compare perfcounters to perfmon (which you seem to be a
contributor of), while in reality perfmon has much, much worse (and
unfixable, because designed-in) measurement overhead.
So why are you criticising perfcounters for a 5000 cycles
measurement overhead while perfmon has huge, _hundreds of millions_
of cycles measurement overhead (per second) for various realistic
workloads? [ In fact in one of the scheduler-tests perfmon has a
whopping measurement overhead of _nine billion_ cycles, it increased
total runtime of the workload from 3.3 seconds to 6.6 seconds. (!) ]
Why are you using a double standard here?
Here are some numbers to put the 5000 cycles startup cost into
perspective. For example the default startup costs of even the
simplest Linux binaries (/bin/true):
titan:~> perf stat /bin/true
Performance counter stats for '/bin/true':
0.811328 task-clock-msecs # 1.002 CPUs
1 context-switches # 0.001 M/sec
1 CPU-migrations # 0.001 M/sec
180 page-faults # 0.222 M/sec
1267713 cycles # 1562.516 M/sec
733772 instructions # 0.579 IPC
26261 cache-references # 32.368 M/sec
531 cache-misses # 0.654 M/sec
0.000809407 seconds time elapsed
5000/1267713 cycles is in the 0.4% range. Run any app that actually
does something beyond starting up, an app which has a chance to get
a decent cache footprint and gets into steady state so that it gets
stable properties that can be measured reliably - and you'll get
into the billions of cycles range or more - at which point a few
thousand cycles is in the 0.0001% measurement overhead range.
Compare to this the intrinsic noise of cycles metrics for some
benchmark like hackbench:
titan:~> perf stat -r 2 -e 0:0 -- ~/hackbench 10
Time: 0.448
Time: 0.447
Performance counter stats for '/home/mingo/hackbench 10' (2 runs):
2661715310 cycles ( +- 0.588% )
0.480153304 seconds time elapsed ( +- 0.549% )
The noise in this (very short) hackbench run above was 15 _million_
cycles. See how small a few thousand cycles are?
If the 5 thousand cycles measurement overhead _still_ matters to you
under such circumstances then by all means please submit the patches
to improve it. Despite your claims this is totally fixable with the
current perfcounters design, Peter outlined the steps of how to
solve it, you can utilize ptrace if you want to.
Ingo
* Ingo Molnar <[email protected]> wrote:
> Besides, you compare perfcounters to perfmon (which you seem to be
> a contributor of), while in reality perfmon has much, much worse
> (and unfixable, because designed-in) measurement overhead.
>
> So why are you criticising perfcounters for a 5000 cycles
> measurement overhead while perfmon has huge, _hundreds of
> millions_ of cycles measurement overhead (per second) for various
> realistic workloads? [ In fact in one of the scheduler-tests
> perfmon has a whopping measurement overhead of _nine billion_
> cycles, it increased total runtime of the workload from 3.3
> seconds to 6.6 seconds. (!) ]
Here are the more detailed perfmon/pfmon measurement overhead
numbers.
Test system is a "Intel Core2 E6800 @ 2.93GHz", 1 GB of RAM, default
Fedora install.
I've measured two workloads:
hackbench.c # messaging server benchmark
test-1m-pipes.c # does 1 million pipe ops, similar to lat_pipe
v2.6.28+perfmon patches (v3, full):
./hackbench 10
0.496400985 seconds time elapsed ( +- 1.699% )
pfmon --follow-fork--aggregate-results ./hackbench 10
0.580812999 seconds time elapsed ( +- 2.233% )
I.e. this workload runs 17% slower under pfmon, the measurement
overhead is about 1.45 billion cycles.
Furthermore, when running a 'pipe latency benchmark', an app that
does one million pipe reads and writes between two tasks (source
code attached below), i measured the following perfmon/pfmon
overhead:
./pipe-test-1m
3.344280347 seconds time elapsed ( +- 0.361% )
pfmon --follow-fork --aggregate-results ./pipe-test-1m
6.508737983 seconds time elapsed ( +- 0.243% )
That's an about 94% measurement overhead, or about 9.2 _billion_
cycles overhead on this test-system.
These perfmon/pfmon overhead figures are consistently reproducible,
and they happen on other test-systems as well, and with other
workloads as well. Basically for any app that involves task creation
or context-switching, perfmon adds considerable runtime overhead -
well beyond the overhead of perfcounters.
Ingo
-----------------{ pipe-test-1m.c }-------------------->
#include <unistd.h>
#include <stdio.h>
#include <stdlib.h>
#include <signal.h>
#include <sys/wait.h>
#include <linux/unistd.h>
#define LOOPS 1000000
int main (void)
{
unsigned long long t0, t1;
int pipe_1[2], pipe_2[2];
int m = 0, i;
pipe(pipe_1);
pipe(pipe_2);
if (!fork()) {
for (i = 0; i < LOOPS; i++) {
read(pipe_1[0], &m, sizeof(int));
write(pipe_2[1], &m, sizeof(int));
}
} else {
for (i = 0; i < LOOPS; i++) {
write(pipe_1[1], &m, sizeof(int));
read(pipe_2[0], &m, sizeof(int));
}
}
return 0;
}
Ingo Molnar writes:
> I measured 2000, but generally a few thousand cycles per invocation
> sounds about right.
We could actually do a bit better than we do, fairly easily. We could
attach the counters to the child after the fork instead of the parent
before the fork, using a couple of pipes for synchronization. And
there's probably a way to get the dynamic linker to resolve the execvp
call early in the child so we avoid that overhead. I think we should
be able to get the overhead down to tens of userspace instructions
without doing anything unnatural.
Paul.
* Paul Mackerras <[email protected]> wrote:
> Ingo Molnar writes:
>
> > I measured 2000, but generally a few thousand cycles per
> > invocation sounds about right.
>
> We could actually do a bit better than we do, fairly easily. We
> could attach the counters to the child after the fork instead of
> the parent before the fork, using a couple of pipes for
> synchronization. And there's probably a way to get the dynamic
> linker to resolve the execvp call early in the child so we avoid
> that overhead. I think we should be able to get the overhead down
> to tens of userspace instructions without doing anything
> unnatural.
Definitely so.
Ingo
I can think of three ways to eliminate the PLT resolver overhead on
execvp:
(1) Do execvp on a non-executable file first to get execvp resolved:
char tmpnam[16];
int fd;
char *args[1];
strcpy(tmpname, "/tmp/perfXXXXXX");
fd = mkstemp(tmpname);
if (fd >= 0) {
args[1] = NULL;
execvp(tmpname, args);
close(fd);
unlink(tmpname);
}
enable_counters();
execvp(prog, argv);
(2) Look up execvp in glibc and call it directly:
int (*execptr)(const char *, char *const []);
execptr = dlsym(RTLD_NEXT, "execvp");
enable_counters();
(*execptr)(prog, argv);
(3) Resolve the executable path ourselves and then invoke the execve
system call directly:
char *execpath;
execpath = search_path(getenv("PATH"), prog);
enable_counters();
syscall(NR_execve, execpath, argv, envp);
(4) Same as (1), but rely on "" being an invalid program name for
execvp:
execvp("", argv);
enable_counters();
execvp(prog, argv);
What do you guys think? Does any of these appeal more than the
others? I'm leaning towards (4) myself.
Paul.
Paul Mackerras writes:
> I can think of three ways to eliminate the PLT resolver overhead on
> execvp:
s/three/four/, obviously - I thought of the 4th while I was writing
the mail.
Paul.
* Paul Mackerras <[email protected]> wrote:
> I can think of three ways to eliminate the PLT resolver overhead on
> execvp:
>
> (1) Do execvp on a non-executable file first to get execvp resolved:
>
> char tmpnam[16];
> int fd;
> char *args[1];
>
> strcpy(tmpname, "/tmp/perfXXXXXX");
> fd = mkstemp(tmpname);
> if (fd >= 0) {
> args[1] = NULL;
> execvp(tmpname, args);
> close(fd);
> unlink(tmpname);
> }
> enable_counters();
> execvp(prog, argv);
>
> (2) Look up execvp in glibc and call it directly:
>
> int (*execptr)(const char *, char *const []);
>
> execptr = dlsym(RTLD_NEXT, "execvp");
> enable_counters();
> (*execptr)(prog, argv);
>
> (3) Resolve the executable path ourselves and then invoke the execve
> system call directly:
>
> char *execpath;
>
> execpath = search_path(getenv("PATH"), prog);
> enable_counters();
> syscall(NR_execve, execpath, argv, envp);
>
> (4) Same as (1), but rely on "" being an invalid program name for
> execvp:
>
> execvp("", argv);
> enable_counters();
> execvp(prog, argv);
>
> What do you guys think? Does any of these appeal more than the
> others? I'm leaning towards (4) myself.
(4) looks convincingly elegant.
We could also do (5): a one-shot counters-disabled ptrace run of the
target, then enable-counters-in-target + ptrace-detach after the
first stop.
Ingo
Hello
> Ingo Molnar <[email protected]> wrote:
>> Vince Weaver <[email protected]> wrote:
> That is in the 0.0001% measurement overhead range (per 'perf stat'
> invocation) for any realistic app that does something worth
> measuring
I'm just curious about this "app worth measuring" idea.
Do you intend for performance counters to simply be "oprofile done right"
or do you intend it to be a generic way of exposing performance counters
to userspace?
For the research my co-workers and I are currently working on the former
is uninteresting. If we wanted oprofile, we'd use it.
What matters for us is getting very exact counts of counters on programs
that are being run as deterministically as possible. This includes
very small programs, and counts like retired_instructions, load/store
ratios, uop_counts, etc.
This may be uninteresting to you, but it is important to us. Hence my
interest in the capabilities of the infrastructure finally getting merged
into the kernel.
> Besides, you compare perfcounters to perfmon
what else shoud I be comparing it to?
> (which you seem to be a contributor of)
is that not allowed?
> workloads? [ In fact in one of the scheduler-tests perfmon has a
> whopping measurement overhead of _nine billion_ cycles, it increased
> total runtime of the workload from 3.3 seconds to 6.6 seconds. (!) ]
I'm sure the perfmon2 people would welcome any patches you have to fix
this problem.
as I said, I am looking for aggregate counts for deterministic programs.
Compared to the ovreheads of 50x for DBI-based tools like Valgrind, or
1000x for "cycle-accurate" simulations, then even overhead of 2x really
isn't that bad.
Counting cycles or time is always a dangerous thing when performance
counters are involved. Things as trivial as compiler, object link-order,
length of the executable name, number of environment variables, number of
ELF auxilliary vectors, etc, can all vastly change what results you get.
I'd reccomend the following paper for more details:
"Producing wrong data without doing anything obviously wrong"
by Mytkowicz et al.
http://www-plan.cs.colorado.edu/klipto/mytkowicz-asplos09.pdf
> If the 5 thousand cycles measurement overhead _still_ matters to you
> under such circumstances then by all means please submit the patches
> to improve it. Despite your claims this is totally fixable with the
> current perfcounters design, Peter outlined the steps of how to
> solve it, you can utilize ptrace if you want to.
Is it really "totally" fixible? I don't just mean getting the overhead
from ~3000 down to ~100, I mean down to zero.
> Here are the more detailed perfmon/pfmon measurement overhead
> numbers.
>
> ...
>
> I.e. this workload runs 17% slower under pfmon, the measurement
> overhead is about 1.45 billion cycles.
>
> ..
>
> That's an about 94% measurement overhead, or about 9.2 _billion_
> cycles overhead on this test-system.
I'm more interested in very CPU-intensive benchmarks. I ran some
experiments with gcc and equake from the spec2k benchmark suite.
This is on a 32-bit AMD Athlon(tm) XP 2000+ machine
gcc.200 (spec2k)
+ 2.6.30-03984-g45e3e19, configured with perf counters disabled
108.44s +/- 0.7
+ 2.6.30-03984-g45e3e19, perf stat -e 0:1:u --
109.17s +/- 0.7
*** For a slowdown of about 0.6%
+ 2.6.29.5 (unpatched)
115.31s +/- 0.5
+ 2.6.29.5 with perfmon2 patches applied, pfmon -e retired_instructions,cpu_clk_unhalted
115.62 +/- 0.5
** For a slowdown of about 0.2%
So in this case, perfmon2 had less overhead, though it's so small overhead
as to be lost in the noise. Why the 2.6.30-git kernel
seems to be much faster on this hardware, I don't know.
equake (spec2k)
+ 2.6.30-03984-g45e3e19, configured with perf counters disabled
392.77s +/- 1.5
+ 2.6.30-03984-g45e3e19, perf stat -e 0:1:u --
393.45s +/- 0.7
*** For a slowdown of about 0.17%
+ 2.6.29.5 (unpatched)
429.25s +/- 1.7
+ 2.6.29.5 with perfmon2 patches applied, pfmon -e retired_instructions,cpu_clk_unhalted
428.91 +/- 0.8
** For a _speedup_ of about 0.08%
So again the difference in overheads is in the noise. Again I am not sure
why 2.6.30-git is so much faster on this hardware.
As for counter results, in this case retired instructions:
gcc.200
perf: 72,618,643,132 +/- 8million
pfmon: 72,618,519,792 +/- 5million
equake
perf: 144,952,319,472 +/- 8000
pfmon: 144,952,327,906 +/- 500
So in the equake case you can easily see that the few thousand instruction
overhead from perf can show up even on long-running programs.
In any case, the point I am trying to make is that perf counters are used
by a wide variety of people in a wide variety of ways, with lots of
different performance/accuracy tradeoffs. Don't limit the API just
because you can't envision a use for certain features.
Vince
* Vince Weaver <[email protected]> wrote:
>> If the 5 thousand cycles measurement overhead _still_ matters to
>> you under such circumstances then by all means please submit the
>> patches to improve it. Despite your claims this is totally
>> fixable with the current perfcounters design, Peter outlined the
>> steps of how to solve it, you can utilize ptrace if you want to.
>
> Is it really "totally" fixible? I don't just mean getting the
> overhead from ~3000 down to ~100, I mean down to zero.
The thing is, not even pfmon gets it down to zero:
pfmon -e INSTRUCTIONS_RETIRED --follow-fork --aggregate-results ~/million
1000001 INSTRUCTIONS_RETIRED
So ... do you take the hardliner purist view and consider it crap
due to that imprecision, or do you take the pragmatist view of also
considering the relative relevance of any imperfection? ;-)
Ingo
* Vince Weaver <[email protected]> wrote:
>> If the 5 thousand cycles measurement overhead _still_ matters to
>> you under such circumstances then by all means please submit the
>> patches to improve it. Despite your claims this is totally
>> fixable with the current perfcounters design, Peter outlined the
>> steps of how to solve it, you can utilize ptrace if you want to.
>
> Is it really "totally" fixible? I don't just mean getting the
> overhead from ~3000 down to ~100, I mean down to zero.
Yes, it's truly very easy to get exactly the same output as pfmon,
for the 'million.s' test app you posted:
titan:~> perf stat -e 0:1:u ./million
Performance counter stats for './million':
1000001 instructions
0.000489736 seconds time elapsed
See the small patch below.
( Note that this approach does not use ptrace, hence it can be used
to measure debuggers too. ptrace attach has the limitation of
being exclusive - no task can be attached to twice. perfmon used
ptrace attach, which limited its capabilities unreasonably. )
The question was really not whether we can do it - but whether we
want to do it. I have no strong feelings either way - because as i
told you in my first mail, all the other noise sources in the system
dominate the metrics far more than this very small constant startup
offset.
And the thing is, as a perfmon contributor i assume you have
experience in these matters. Had you taken a serious, unbiased look
at perfcounters, and had this problem truly bothered you personally,
you could have come up with a similar patch yourself as well, while
only spending a fraction of the energies you are putting into these
emails. Instead you ignored our technical arguments, you refused to
touch the code and you went on rambling against how perfcounters
supposedly cannot solve this problem. Not very productive IMO.
Ingo
---------------->
Subject: perf_counter: Add enable-on-exec attribute
From: Ingo Molnar <[email protected]>
Date: Mon Jun 29 22:05:11 CEST 2009
Add another attribute variant: attr.enable_on_exec.
The purpose is to allow the auto-enabling of such counters
on exec(), to measure exec()-ed workloads precisely, from
the first to the last instruction.
Cc: Peter Zijlstra <[email protected]>
Cc: Mike Galbraith <[email protected]>
Cc: Paul Mackerras <[email protected]>
Cc: Arnaldo Carvalho de Melo <[email protected]>
LKML-Reference: <new-submission>
Signed-off-by: Ingo Molnar <[email protected]>
---
fs/exec.c | 3 +--
include/linux/perf_counter.h | 5 ++++-
kernel/perf_counter.c | 39 ++++++++++++++++++++++++++++++++++++---
tools/perf/builtin-stat.c | 5 +++--
4 files changed, 44 insertions(+), 8 deletions(-)
Index: linux/fs/exec.c
===================================================================
--- linux.orig/fs/exec.c
+++ linux/fs/exec.c
@@ -996,8 +996,7 @@ int flush_old_exec(struct linux_binprm *
* Flush performance counters when crossing a
* security domain:
*/
- if (!get_dumpable(current->mm))
- perf_counter_exit_task(current);
+ perf_counter_exec(current);
/* An exec changes our domain. We are no longer part of the thread
group */
Index: linux/include/linux/perf_counter.h
===================================================================
--- linux.orig/include/linux/perf_counter.h
+++ linux/include/linux/perf_counter.h
@@ -179,8 +179,9 @@ struct perf_counter_attr {
comm : 1, /* include comm data */
freq : 1, /* use freq, not period */
inherit_stat : 1, /* per task counts */
+ enable_on_exec : 1, /* enable on exec */
- __reserved_1 : 52;
+ __reserved_1 : 51;
__u32 wakeup_events; /* wakeup every n events */
__u32 __reserved_2;
@@ -712,6 +713,7 @@ static inline void perf_counter_mmap(str
extern void perf_counter_comm(struct task_struct *tsk);
extern void perf_counter_fork(struct task_struct *tsk);
+extern void perf_counter_exec(struct task_struct *tsk);
extern struct perf_callchain_entry *perf_callchain(struct pt_regs *regs);
@@ -752,6 +754,7 @@ perf_swcounter_event(u32 event, u64 nr,
static inline void perf_counter_mmap(struct vm_area_struct *vma) { }
static inline void perf_counter_comm(struct task_struct *tsk) { }
static inline void perf_counter_fork(struct task_struct *tsk) { }
+static inline void perf_counter_exec(struct task_struct *tsk) { }
static inline void perf_counter_init(void) { }
#endif
Index: linux/kernel/perf_counter.c
===================================================================
--- linux.orig/kernel/perf_counter.c
+++ linux/kernel/perf_counter.c
@@ -903,6 +903,9 @@ static void perf_counter_enable(struct p
struct perf_counter_context *ctx = counter->ctx;
struct task_struct *task = ctx->task;
+ if (counter->attr.enable_on_exec)
+ return;
+
if (!task) {
/*
* Enable the counter on the cpu that it's on
@@ -2856,6 +2859,32 @@ void perf_counter_fork(struct task_struc
perf_counter_fork_event(&fork_event);
}
+void perf_counter_exec(struct task_struct *task)
+{
+ struct perf_counter_context *ctx;
+ struct perf_counter *counter;
+
+ if (!get_dumpable(task->mm)) {
+ perf_counter_exit_task(task);
+ return;
+ }
+
+ if (!task->perf_counter_ctxp)
+ return;
+
+ rcu_read_lock();
+ ctx = task->perf_counter_ctxp;
+ if (ctx) {
+ list_for_each_entry(counter, &ctx->counter_list, list_entry) {
+ if (counter->attr.enable_on_exec) {
+ counter->attr.enable_on_exec = 0;
+ __perf_counter_enable(counter);
+ }
+ }
+ }
+ rcu_read_unlock();
+}
+
/*
* comm tracking
*/
@@ -4064,10 +4093,14 @@ inherit_counter(struct perf_counter *par
* not its attr.disabled bit. We hold the parent's mutex,
* so we won't race with perf_counter_{en, dis}able_family.
*/
- if (parent_counter->state >= PERF_COUNTER_STATE_INACTIVE)
- child_counter->state = PERF_COUNTER_STATE_INACTIVE;
- else
+ if (parent_counter->state >= PERF_COUNTER_STATE_INACTIVE) {
+ if (child_counter->attr.enable_on_exec)
+ child_counter->state = PERF_COUNTER_STATE_OFF;
+ else
+ child_counter->state = PERF_COUNTER_STATE_INACTIVE;
+ } else {
child_counter->state = PERF_COUNTER_STATE_OFF;
+ }
if (parent_counter->attr.freq)
child_counter->hw.sample_period = parent_counter->hw.sample_period;
Index: linux/tools/perf/builtin-stat.c
===================================================================
--- linux.orig/tools/perf/builtin-stat.c
+++ linux/tools/perf/builtin-stat.c
@@ -116,8 +116,9 @@ static void create_perf_stat_counter(int
fd[cpu][counter], strerror(errno));
}
} else {
- attr->inherit = inherit;
- attr->disabled = 1;
+ attr->inherit = inherit;
+ attr->disabled = 1;
+ attr->enable_on_exec = 1;
fd[0][counter] = sys_perf_counter_open(attr, pid, -1, -1, 0);
if (fd[0][counter] < 0 && verbose)
* Vince Weaver <[email protected]> wrote:
>> Besides, you compare perfcounters to perfmon
>
> what else shoud I be comparing it to?
>
>> (which you seem to be a contributor of)
>
> is that not allowed?
Here's the full, uncropped sentence i wrote:
" Besides, you compare perfcounters to perfmon (which you seem to
be a contributor of), while in reality perfmon has much, much
worse (and unfixable, because designed-in) measurement overhead. "
Where i question the blatant hypocracy of bringing up perfmon as a
good example while in reality perfmon has far worse measurement
overhead than perfcounters, for a wide range of workloads.
As far as i can see you didnt answer my questions: why are you
dismissing perfcounters for a minor, once per startup measurement
offset (which is entirely fixable - see the patch i sent), while you
generously allow perfmon to have serious, 90% measurement overhead
amounting to billions of instructions overhead per second, for
certain workloads?
Ingo
* Vince Weaver <[email protected]> wrote:
>> workloads? [ In fact in one of the scheduler-tests perfmon has a
>> whopping measurement overhead of _nine billion_ cycles, it
>> increased total runtime of the workload from 3.3 seconds to 6.6
>> seconds. (!) ]
>
> I'm sure the perfmon2 people would welcome any patches you have to
> fix this problem.
I think this flaw of perfmon is unfixable, because perfmon (by
design) uses a _way_ too low level and way too opaque and
structure-less abstraction for the PMU, which disallows the kind of
high-level optimizations that perfcounters can do.
We werent silent about this - to the contrary. Last November Thomas
and me _did_ take a good look at perfmon patches (we are maintaining
the code areas affected by perfmon), we saw that it has unfixable
problems and came up with objections and later on came up with
patches that fix these problems: the perfcounters subsystem.
>> That's an about 94% measurement overhead, or about 9.2 _billion_
>> cycles overhead on this test-system.
>
> I'm more interested in very CPU-intensive benchmarks. I ran some
> experiments with gcc and equake from the spec2k benchmark suite.
The workloads i cited are _all_ 100% CPU-intensive benchmarks:
- hackbench
- loop-pipe-1-million
But i could add 'lat_tcp localhost', 'bw_tcp localhost' or sysbench
to the list - all show very significant overhead under perfmon.
These are all important workloads and important benchmarks. A kernel
based performance analysis facility that is any good must handle
them transparently.
Ingo
sorry for the delay in responding, was away
On Mon, 29 Jun 2009, Ingo Molnar wrote:
>
> * Vince Weaver <[email protected]> wrote:
>
>>> If the 5 thousand cycles measurement overhead _still_ matters to
>>> you under such circumstances then by all means please submit the
>>> patches to improve it. Despite your claims this is totally
>>> fixable with the current perfcounters design, Peter outlined the
>>> steps of how to solve it, you can utilize ptrace if you want to.
>>
>> Is it really "totally" fixible? I don't just mean getting the
>> overhead from ~3000 down to ~100, I mean down to zero.
>
> The thing is, not even pfmon gets it down to zero:
>
> pfmon -e INSTRUCTIONS_RETIRED --follow-fork --aggregate-results ~/million
> 1000001 INSTRUCTIONS_RETIRED
>
> So ... do you take the hardliner purist view and consider it crap
> due to that imprecision, or do you take the pragmatist view of also
> considering the relative relevance of any imperfection? ;-)
as I said in a previous post, on most x86 chips the instructions_retired
counter also includes any hardware interrupts that occur during the
process runtime. So any clock interrupts, etc, show up as an extra
instruction. So on the "million" benchmark, it's usually +/- 2 extra
instructions.
It looks like support might be added to perfcounters to track these
hardware interrupt stats per-process, which would be great, as it's been
really hard to quantify that currently.
In any case, it looks like the changes to make perf have lower overhead
have been merged, which makes me happy. Thank you.
Vince
* Vince Weaver <[email protected]> wrote:
> On Mon, 29 Jun 2009, Ingo Molnar wrote:
>>
>> * Vince Weaver <[email protected]> wrote:
>>
>>>> If the 5 thousand cycles measurement overhead _still_ matters to
>>>> you under such circumstances then by all means please submit the
>>>> patches to improve it. Despite your claims this is totally
>>>> fixable with the current perfcounters design, Peter outlined the
>>>> steps of how to solve it, you can utilize ptrace if you want to.
>>>
>>> Is it really "totally" fixible? I don't just mean getting the
>>> overhead from ~3000 down to ~100, I mean down to zero.
>>
>> The thing is, not even pfmon gets it down to zero:
>>
>> pfmon -e INSTRUCTIONS_RETIRED --follow-fork --aggregate-results ~/million
>> 1000001 INSTRUCTIONS_RETIRED
>>
>> So ... do you take the hardliner purist view and consider it crap
>> due to that imprecision, or do you take the pragmatist view of also
>> considering the relative relevance of any imperfection? ;-)
>
> as I said in a previous post, on most x86 chips the
> instructions_retired counter also includes any hardware interrupts
> that occur during the process runtime. So any clock interrupts,
> etc, show up as an extra instruction. So on the "million"
> benchmark, it's usually +/- 2 extra instructions.
yeah. But it has nothing to do with the function you are measuring,
right?
My general point is really that what matters is the statistical
validity of the end result. I dont think you ever disagreed with
that point - you just seem to have a lower noise acceptance
threshold ;-)
> It looks like support might be added to perfcounters to track
> these hardware interrupt stats per-process, which would be great,
> as it's been really hard to quantify that currently.
Yeah. There's a patch-set in the works that attempts to do something
in this area - see these mails on lkml:
perf_counter: Add Generalized Hardware interrupt support
Right now they are just convenience wrappers around CPU model
specific hw events - but we could extend the whole thing with
software counters as well and isolate per IRQ vector events and
counts, by adding a callback to do_IRQ().
That would give a mixture of hardware and software counter based IRQ
instrumentation features that looks quite compelling. Any comments
on what features/capabilities you'd like to see in this area?
> In any case, it looks like the changes to make perf have lower
> overhead have been merged, which makes me happy. Thank you.
You are welcome :)
Btw., perfcounters still has no support for older Intel CPUs such as
P3's and P2's - and they have pretty sane PMUs - so if you have such
a machine (which your perfmon contribution suggests you might
have/had) and are interested it would be nice to get support for
them. P4 support is interesting too but more challenging.
Ingo
Vince Weaver <[email protected]> writes:
>
> as I said in a previous post, on most x86 chips the instructions_retired
> counter also includes any hardware interrupts that occur during the
> process runtime.
On the other hand afaik near all chips have interrupt performance counter
events.
So if you're willing to waste one of the variable counter registers
you can always count those and then correct based on the other count.
But the question is of course if it's worth it, the error should
be really small. Also you could always lose a few cycles occasionally
in other "random" events, which can happen too.
> So any clock interrupts, etc, show up as an extra
> instruction. So on the "million" benchmark, it's usually +/- 2 extra
> instructions.
1-2 error in a million doesn't sound like a catastrophic problem.
-Andi
--
[email protected] -- Speaking for myself only.
> Vince Weaver <[email protected]> writes:
>>
>> as I said in a previous post, on most x86 chips the instructions_retired
>> counter also includes any hardware interrupts that occur during the
>> process runtime.
>
> On the other hand afaik near all chips have interrupt performance counter
> events.
I guess by "near all" you mean "only AMD"? The AMD event also has some
oddities, as it seems to report things like page faults and other things
that don't really match up with the excess instruction count. I must
admit it's been a while since I've looked at that particular counter.
> But the question is of course if it's worth it, the error should
> be really small. Also you could always lose a few cycles occasionally
> in other "random" events, which can happen too.
> 1-2 error in a million doesn't sound like a catastrophic problem.
well, it's basically at least HZ extra instructions per however many
seconds your benchmark runs, and unfortunately it's non-deterministic
because it depends on keyboard/network/usb/etc interrupts too that may by
chance happen while your program is running.
For me, it's the determinism that matters. Not overhead, not runtime not
"oh it doesn't matter, it's small". For a deterministic benchmark I
want to get as close to the same value every run as possible. I admit
it might not be possible to always get the same result, but the
closter the better. This might not match up with the way
kernel-hackers use perf counters, but it is important for the work I am
doing.
Vince
On Fri, 3 Jul 2009, Ingo Molnar wrote:
> That would give a mixture of hardware and software counter based IRQ
> instrumentation features that looks quite compelling. Any comments
> on what features/capabilities you'd like to see in this area?
I'm mainly interested in just an aggregate total of "this many interrupts
occurred". It wouldn't even need to be separated out by type or number.
I don't know if the metric would be useful to anyone else. I tried to
hack this up a long time ago, to have the result reported with rusage()
but never got anywhere with it.
> Btw., perfcounters still has no support for older Intel CPUs such as
> P3's and P2's - and they have pretty sane PMUs - so if you have such
> a machine (which your perfmon contribution suggests you might
> have/had) and are interested it would be nice to get support for
> them. P4 support is interesting too but more challenging.
I was indeed the one who got perfmon2 running on Pentium Pro, Pentium II,
and MIPS R12k. For all those though there was an existing PMU driver and
I just added the appropriate "case" statements to enable support, and then
provided an updated list of available counters to the userspace utility.
The only real kernel hacking involved was the week spent tracking down a
hard-to-debug interrupt issue on the MIPS machine.
Unfortunately I think writing PMU drivers is a bit beyond me, for the
amount of time I have. Especially as the relevant machines I have are
located in relatively inaccessible locations (and PMU mistakes can lock up
the machines) plus it can take the better part of a day to compile 2.6
kernels on some of those machines.
Vince
On Fri, Jul 03, 2009 at 05:25:32PM -0400, Vince Weaver wrote:
> >Vince Weaver <[email protected]> writes:
> >>
> >>as I said in a previous post, on most x86 chips the instructions_retired
> >>counter also includes any hardware interrupts that occur during the
> >>process runtime.
> >
> >On the other hand afaik near all chips have interrupt performance counter
> >events.
>
> I guess by "near all" you mean "only AMD"? The AMD event also has some
Intel CPUs typically have HW_INT.RX event. AMD has a similar event.
> well, it's basically at least HZ extra instructions per however many
> seconds your benchmark runs, and unfortunately it's non-deterministic
> because it depends on keyboard/network/usb/etc interrupts too that may by
> chance happen while your program is running.
>
> For me, it's the determinism that matters. Not overhead, not runtime not
To be honest I don't think you'll ever be full deterministic. Modern
computers and operating systems are just too complex with too
many (often unpredictable) things going on in the background. In my own
experience even simulators (which are much more stable than
real hardware) are not fully deterministic. You'll always run
into problems.
If you need 100% deterministic use a simple micro controller.
-Andi
--
[email protected] -- Speaking for myself only.