2009-06-29 11:14:10

by Paul Mackerras

[permalink] [raw]
Subject: [PATCH 1/2] perf_counter: tools: Make :u and :k exclude hypervisor

At present, appending ":u" to an event sets the exclude_kernel bit,
and ":k" sets the exclude_user bit. There is no way to set the
exclude_hv bit, which means that on systems with a hypervisor (e.g.
IBM pSeries systems), we get counts from hypervisor mode for an event
such as 0:1:u.

This fixes the problem by setting all three exclude bits when we see
the second ':' and the clearing the exclude bits corresponding to the
modes we want to count. This also adds a ":h" modifier to allow the
user to ask for counts in hypervisor mode.

Signed-off-by: Paul Mackerras <[email protected]>
---
tools/perf/util/parse-events.c | 9 +++++++--
1 files changed, 7 insertions(+), 2 deletions(-)

diff --git a/tools/perf/util/parse-events.c b/tools/perf/util/parse-events.c
index 4d042f1..f2ffe2c 100644
--- a/tools/perf/util/parse-events.c
+++ b/tools/perf/util/parse-events.c
@@ -277,10 +277,15 @@ static int parse_event_symbols(const char *str, struct perf_counter_attr *attr)
sep = strchr(pstr, ':');
if (sep) {
pstr = sep + 1;
+ attr->exclude_user = 1;
+ attr->exclude_kernel = 1;
+ attr->exclude_hv = 1;
if (strchr(pstr, 'k'))
- attr->exclude_user = 1;
+ attr->exclude_kernel = 0;
if (strchr(pstr, 'u'))
- attr->exclude_kernel = 1;
+ attr->exclude_user = 0;
+ if (strchr(pstr, 'h'))
+ attr->exclude_hv = 0;
}
attr->type = type;
attr->config = id;
--
1.6.0.4


2009-06-29 11:14:25

by Paul Mackerras

[permalink] [raw]
Subject: [PATCH 2/2] perf_counter: tools: Reduce perf stat overhead

At present, perf stat creates its counters on the perf process. Thus
the counters count the fork and various other activity in both the
parent and child, such as the resolver overhead for resolving PLT
entries for any libc functions that haven't been called before, such
as execvp.

This reduces the overhead by creating the counters on the child process
after the fork, using a couple of pipes to synchronize so that the
child process waits until the parent has created the counters before
doing the exec. To eliminate the PLT resolution overhead on calling
execvp, this does a dummy execvp first which will always fail.

With this, the overhead of executing a program goes down from over
4800 instructions to about 90 instructions on powerpc (32-bit).
This was measured with a statically-linked program written in
assembler which only does the 3 instructions needed to call _exit(0).

Before:

$ perf stat -e 0:1:u ./three

Performance counter stats for './three':

4858 instructions

0.001274523 seconds time elapsed

After:

$ perf stat -e 0:1:u ./three

Performance counter stats for './three':

92 instructions

0.000468153 seconds time elapsed

Signed-off-by: Paul Mackerras <[email protected]>
---
tools/perf/builtin-stat.c | 64 +++++++++++++++++++++++++++++++++++----------
1 files changed, 50 insertions(+), 14 deletions(-)

diff --git a/tools/perf/builtin-stat.c b/tools/perf/builtin-stat.c
index 3e5ea4e..f0260ac 100644
--- a/tools/perf/builtin-stat.c
+++ b/tools/perf/builtin-stat.c
@@ -99,7 +99,7 @@ static u64 runtime_cycles_noise;
#define ERR_PERF_OPEN \
"Error: counter %d, sys_perf_counter_open() syscall returned with %d (%s)\n"

-static void create_perf_stat_counter(int counter)
+static void create_perf_stat_counter(int counter, int pid)
{
struct perf_counter_attr *attr = attrs + counter;

@@ -119,7 +119,7 @@ static void create_perf_stat_counter(int counter)
attr->inherit = inherit;
attr->disabled = 1;

- fd[0][counter] = sys_perf_counter_open(attr, 0, -1, -1, 0);
+ fd[0][counter] = sys_perf_counter_open(attr, pid, -1, -1, 0);
if (fd[0][counter] < 0 && verbose)
fprintf(stderr, ERR_PERF_OPEN, counter,
fd[0][counter], strerror(errno));
@@ -205,12 +205,58 @@ static int run_perf_stat(int argc, const char **argv)
int status = 0;
int counter;
int pid;
+ int child_ready_pipe[2], go_pipe[2];
+ char buf;

if (!system_wide)
nr_cpus = 1;

+ if (pipe(child_ready_pipe) < 0 || pipe(go_pipe) < 0) {
+ perror("failed to create pipes");
+ exit(1);
+ }
+
+ if ((pid = fork()) < 0)
+ perror("failed to fork");
+
+ if (!pid) {
+ close(child_ready_pipe[0]);
+ close(go_pipe[1]);
+ fcntl(go_pipe[0], F_SETFD, FD_CLOEXEC);
+
+ /*
+ * Do a dummy execvp to get the PLT entry resolved,
+ * so we avoid the resolver overhead on the real
+ * execvp call.
+ */
+ execvp("", (char **)argv);
+
+ /*
+ * Tell the parent we're ready to go
+ */
+ close(child_ready_pipe[1]);
+
+ /*
+ * Wait until the parent tells us to go.
+ */
+ read(go_pipe[0], &buf, 1);
+
+ execvp(argv[0], (char **)argv);
+
+ perror(argv[0]);
+ exit(-1);
+ }
+
+ /*
+ * Wait for the child to be ready to exec.
+ */
+ close(child_ready_pipe[1]);
+ close(go_pipe[0]);
+ read(child_ready_pipe[0], &buf, 1);
+ close(child_ready_pipe[0]);
+
for (counter = 0; counter < nr_counters; counter++)
- create_perf_stat_counter(counter);
+ create_perf_stat_counter(counter, pid);

/*
* Enable counters and exec the command:
@@ -218,19 +264,9 @@ static int run_perf_stat(int argc, const char **argv)
t0 = rdclock();
prctl(PR_TASK_PERF_COUNTERS_ENABLE);

- if ((pid = fork()) < 0)
- perror("failed to fork");
-
- if (!pid) {
- if (execvp(argv[0], (char **)argv)) {
- perror(argv[0]);
- exit(-1);
- }
- }
-
+ close(go_pipe[1]);
wait(&status);

- prctl(PR_TASK_PERF_COUNTERS_DISABLE);
t1 = rdclock();

walltime_nsecs[run_idx] = t1 - t0;
--
1.6.0.4

2009-06-29 20:34:55

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH 1/2] perf_counter: tools: Make :u and :k exclude hypervisor


* Paul Mackerras <[email protected]> wrote:

> At present, appending ":u" to an event sets the exclude_kernel
> bit, and ":k" sets the exclude_user bit. There is no way to set
> the exclude_hv bit, which means that on systems with a hypervisor
> (e.g. IBM pSeries systems), we get counts from hypervisor mode for
> an event such as 0:1:u.
>
> This fixes the problem by setting all three exclude bits when we
> see the second ':' and the clearing the exclude bits corresponding
> to the modes we want to count. This also adds a ":h" modifier to
> allow the user to ask for counts in hypervisor mode.
>
> Signed-off-by: Paul Mackerras <[email protected]>
> ---
> tools/perf/util/parse-events.c | 9 +++++++--
> 1 files changed, 7 insertions(+), 2 deletions(-)
>
> diff --git a/tools/perf/util/parse-events.c b/tools/perf/util/parse-events.c
> index 4d042f1..f2ffe2c 100644
> --- a/tools/perf/util/parse-events.c
> +++ b/tools/perf/util/parse-events.c
> @@ -277,10 +277,15 @@ static int parse_event_symbols(const char *str, struct perf_counter_attr *attr)
> sep = strchr(pstr, ':');
> if (sep) {
> pstr = sep + 1;
> + attr->exclude_user = 1;
> + attr->exclude_kernel = 1;
> + attr->exclude_hv = 1;
> if (strchr(pstr, 'k'))
> - attr->exclude_user = 1;
> + attr->exclude_kernel = 0;
> if (strchr(pstr, 'u'))
> - attr->exclude_kernel = 1;
> + attr->exclude_user = 0;
> + if (strchr(pstr, 'h'))
> + attr->exclude_hv = 0;
> }

Hm, mind fixing the full range of problems with these flags please?

One problem is that things like:

--event cycles:u

dont work as expected - the u/k/h flags only work in numeric events
which is a pity. Also, it would be nice to have an 'general' option
to specify the context mask for all events, in some straightforward
format like this:

--event-mask +u+k-h

Things like that. This bit is really not well developed right now.

Ingo

2009-06-29 20:52:42

by Paul Mackerras

[permalink] [raw]
Subject: [tip:perfcounters/urgent] perf_counter tools: Reduce perf stat measurement overhead/skew

Commit-ID: 051ae7f7344f453616b6b10332d4d8e1d40ed823
Gitweb: http://git.kernel.org/tip/051ae7f7344f453616b6b10332d4d8e1d40ed823
Author: Paul Mackerras <[email protected]>
AuthorDate: Mon, 29 Jun 2009 21:13:21 +1000
Committer: Ingo Molnar <[email protected]>
CommitDate: Mon, 29 Jun 2009 22:38:09 +0200

perf_counter tools: Reduce perf stat measurement overhead/skew

Vince Weaver reported a 'perf stat' measurement overhead in the
count of retired instructions, which can amount to a +6000
instructions inflated count in the reported count.

At present, perf stat creates its counters on the perf process. Thus
the counters count the fork and various other activity in both the
parent and child, such as the resolver overhead for resolving PLT
entries for any libc functions that haven't been called before, such
as execvp.

This reduces the overhead by creating the counters on the child process
after the fork, using a couple of pipes to synchronize so that the
child process waits until the parent has created the counters before
doing the exec. To eliminate the PLT resolution overhead on calling
execvp, this does a dummy execvp first which will always fail.

With this, the overhead of executing a program goes down from over
4800 instructions to about 90 instructions on powerpc (32-bit).
This was measured with a statically-linked program written in
assembler which only does the 3 instructions needed to call _exit(0).

Before:

$ perf stat -e 0:1:u ./three

Performance counter stats for './three':

4858 instructions

0.001274523 seconds time elapsed

After:

$ perf stat -e 0:1:u ./three

Performance counter stats for './three':

92 instructions

0.000468153 seconds time elapsed

Reported-by: Vince Weaver <[email protected]>
Signed-off-by: Paul Mackerras <[email protected]>
Cc: Peter Zijlstra <[email protected]>
LKML-Reference: <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>


---
tools/perf/builtin-stat.c | 64 +++++++++++++++++++++++++++++++++++----------
1 files changed, 50 insertions(+), 14 deletions(-)

diff --git a/tools/perf/builtin-stat.c b/tools/perf/builtin-stat.c
index c5a2907..201ef23 100644
--- a/tools/perf/builtin-stat.c
+++ b/tools/perf/builtin-stat.c
@@ -99,7 +99,7 @@ static u64 runtime_cycles_noise;
#define ERR_PERF_OPEN \
"Error: counter %d, sys_perf_counter_open() syscall returned with %d (%s)\n"

-static void create_perf_stat_counter(int counter)
+static void create_perf_stat_counter(int counter, int pid)
{
struct perf_counter_attr *attr = attrs + counter;

@@ -119,7 +119,7 @@ static void create_perf_stat_counter(int counter)
attr->inherit = inherit;
attr->disabled = 1;

- fd[0][counter] = sys_perf_counter_open(attr, 0, -1, -1, 0);
+ fd[0][counter] = sys_perf_counter_open(attr, pid, -1, -1, 0);
if (fd[0][counter] < 0 && verbose)
fprintf(stderr, ERR_PERF_OPEN, counter,
fd[0][counter], strerror(errno));
@@ -205,12 +205,58 @@ static int run_perf_stat(int argc, const char **argv)
int status = 0;
int counter;
int pid;
+ int child_ready_pipe[2], go_pipe[2];
+ char buf;

if (!system_wide)
nr_cpus = 1;

+ if (pipe(child_ready_pipe) < 0 || pipe(go_pipe) < 0) {
+ perror("failed to create pipes");
+ exit(1);
+ }
+
+ if ((pid = fork()) < 0)
+ perror("failed to fork");
+
+ if (!pid) {
+ close(child_ready_pipe[0]);
+ close(go_pipe[1]);
+ fcntl(go_pipe[0], F_SETFD, FD_CLOEXEC);
+
+ /*
+ * Do a dummy execvp to get the PLT entry resolved,
+ * so we avoid the resolver overhead on the real
+ * execvp call.
+ */
+ execvp("", (char **)argv);
+
+ /*
+ * Tell the parent we're ready to go
+ */
+ close(child_ready_pipe[1]);
+
+ /*
+ * Wait until the parent tells us to go.
+ */
+ read(go_pipe[0], &buf, 1);
+
+ execvp(argv[0], (char **)argv);
+
+ perror(argv[0]);
+ exit(-1);
+ }
+
+ /*
+ * Wait for the child to be ready to exec.
+ */
+ close(child_ready_pipe[1]);
+ close(go_pipe[0]);
+ read(child_ready_pipe[0], &buf, 1);
+ close(child_ready_pipe[0]);
+
for (counter = 0; counter < nr_counters; counter++)
- create_perf_stat_counter(counter);
+ create_perf_stat_counter(counter, pid);

/*
* Enable counters and exec the command:
@@ -218,19 +264,9 @@ static int run_perf_stat(int argc, const char **argv)
t0 = rdclock();
prctl(PR_TASK_PERF_COUNTERS_ENABLE);

- if ((pid = fork()) < 0)
- perror("failed to fork");
-
- if (!pid) {
- if (execvp(argv[0], (char **)argv)) {
- perror(argv[0]);
- exit(-1);
- }
- }
-
+ close(go_pipe[1]);
wait(&status);

- prctl(PR_TASK_PERF_COUNTERS_DISABLE);
t1 = rdclock();

walltime_nsecs[run_idx] = t1 - t0;

2009-06-30 11:51:53

by Paul Mackerras

[permalink] [raw]
Subject: Re: [PATCH 1/2] perf_counter: tools: Make :u and :k exclude hypervisor

Ingo Molnar writes:

> Hm, mind fixing the full range of problems with these flags please?

Sure, looking at it now.

One thing I'd like to do is add complete lists of hardware events for
each processor so that perf can tell you the full set of things you
can measure, and can let you ask for them without having to know raw
event codes. I know how to work out which processor we're running on
for powerpc; for x86 I assume cpuid or something similar is usable
from userspace, is it?

Paul.

2009-06-30 11:57:43

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH 1/2] perf_counter: tools: Make :u and :k exclude hypervisor


* Paul Mackerras <[email protected]> wrote:

> Ingo Molnar writes:
>
> > Hm, mind fixing the full range of problems with these flags
> > please?
>
> Sure, looking at it now.
>
> One thing I'd like to do is add complete lists of hardware events
> for each processor so that perf can tell you the full set of
> things you can measure, and can let you ask for them without
> having to know raw event codes. [...]

Excellent.

> [...] I know how to work out which processor we're running on for
> powerpc; for x86 I assume cpuid or something similar is usable
> from userspace, is it?

Yeah. /proc/cpuinfo can be read as well.

Ingo