2010-12-12 13:47:53

by Arnaldo Carvalho de Melo

[permalink] [raw]
Subject: perf monitoring triggers Was: Re: [tip:perf/core] perf bench: Print both of prefaulted and no prefaulted results by default

Em Sun, Dec 12, 2010 at 02:15:25AM +0900, Hitoshi Mitake escreveu:
> BTW, I found that measuring performance of prefaulted memcpy()
> with perf stat is difficult. Because current perf stat monitors
> whole execution of program or range of perf stat lifetime.

> If perf stat and monitored program can interact and work
> synchronously, it will be better.

> For example, if perf stat waits on the unix domain socket
> before create_perf_stat_counter() and monitored program wakes perf stat
> up through the socket, more fine grain monitoring will be possible.

> I imagine the execution will be like this:
> perf stat --wait-on /tmp/perf_wait perf bench mem memcpy --wake-up
> /tmp/perf_wait

> --wait-on is imaginaly option of perf stat, and the way of waking up
> perf stat is left to monitored program (in this case, --wake-up is
> used for specifying the name of the socket).

> I'd like to implement such a option to perf stat, how do you think?

Looks interesting, and also interesting would be to be able to place
probes that would wake up it too, for unmodified binaries to have
something similar.

Other kinds of triggers may be to hook on syscalls and when some
expression matches, like connecting to host 1.2.3.4, start monitoring,
stop when the socket is closed, i.e. monitor a connection lifetime, etc.

I think it is worth pursuing and encourage you to work on it :-)

- Arnaldo


2010-12-13 11:15:53

by Peter Zijlstra

[permalink] [raw]
Subject: Re: perf monitoring triggers Was: Re: [tip:perf/core] perf bench: Print both of prefaulted and no prefaulted results by default

On Sun, 2010-12-12 at 11:46 -0200, Arnaldo Carvalho de Melo wrote:
> Em Sun, Dec 12, 2010 at 02:15:25AM +0900, Hitoshi Mitake escreveu:
> > BTW, I found that measuring performance of prefaulted memcpy()
> > with perf stat is difficult. Because current perf stat monitors
> > whole execution of program or range of perf stat lifetime.
>
> > If perf stat and monitored program can interact and work
> > synchronously, it will be better.
>
> > For example, if perf stat waits on the unix domain socket
> > before create_perf_stat_counter() and monitored program wakes perf stat
> > up through the socket, more fine grain monitoring will be possible.
>
> > I imagine the execution will be like this:
> > perf stat --wait-on /tmp/perf_wait perf bench mem memcpy --wake-up
> > /tmp/perf_wait
>
> > --wait-on is imaginaly option of perf stat, and the way of waking up
> > perf stat is left to monitored program (in this case, --wake-up is
> > used for specifying the name of the socket).
>
> > I'd like to implement such a option to perf stat, how do you think?
>
> Looks interesting, and also interesting would be to be able to place
> probes that would wake up it too, for unmodified binaries to have
> something similar.
>
> Other kinds of triggers may be to hook on syscalls and when some
> expression matches, like connecting to host 1.2.3.4, start monitoring,
> stop when the socket is closed, i.e. monitor a connection lifetime, etc.
>
> I think it is worth pursuing and encourage you to work on it :-)

Sounds to me like you want something like a library with self-monitoring
stuff.

2010-12-13 12:39:12

by Arnaldo Carvalho de Melo

[permalink] [raw]
Subject: Re: perf monitoring triggers Was: Re: [tip:perf/core] perf bench: Print both of prefaulted and no prefaulted results by default

Em Mon, Dec 13, 2010 at 12:14:33PM +0100, Peter Zijlstra escreveu:
> On Sun, 2010-12-12 at 11:46 -0200, Arnaldo Carvalho de Melo wrote:
> > Looks interesting, and also interesting would be to be able to place
> > probes that would wake up it too, for unmodified binaries to have
> > something similar.

> > Other kinds of triggers may be to hook on syscalls and when some
> > expression matches, like connecting to host 1.2.3.4, start monitoring,
> > stop when the socket is closed, i.e. monitor a connection lifetime, etc.

> Sounds to me like you want something like a library with self-monitoring
> stuff.

Yeah, that could be a way, an LD_PRELOAD thingy that would intercept
library calls, setup counters, start a monitoring thread, etc.

Along the lines of:

http://git.kernel.org/?p=linux/kernel/git/acme/libautocork.git;a=blob;f=libautocork.c

This one just intercepts calls, but the __init function could do the
rest.

To make it easier we could move the counter setup we have in record/top
to a library, etc.

- Arnaldo

2010-12-13 12:41:41

by Peter Zijlstra

[permalink] [raw]
Subject: Re: perf monitoring triggers Was: Re: [tip:perf/core] perf bench: Print both of prefaulted and no prefaulted results by default

On Mon, 2010-12-13 at 10:38 -0200, Arnaldo Carvalho de Melo wrote:
> Em Mon, Dec 13, 2010 at 12:14:33PM +0100, Peter Zijlstra escreveu:
> > On Sun, 2010-12-12 at 11:46 -0200, Arnaldo Carvalho de Melo wrote:
> > > Looks interesting, and also interesting would be to be able to place
> > > probes that would wake up it too, for unmodified binaries to have
> > > something similar.
>
> > > Other kinds of triggers may be to hook on syscalls and when some
> > > expression matches, like connecting to host 1.2.3.4, start monitoring,
> > > stop when the socket is closed, i.e. monitor a connection lifetime, etc.
>
> > Sounds to me like you want something like a library with self-monitoring
> > stuff.
>
> Yeah, that could be a way, an LD_PRELOAD thingy that would intercept
> library calls, setup counters, start a monitoring thread, etc.
>
> Along the lines of:
>
> http://git.kernel.org/?p=linux/kernel/git/acme/libautocork.git;a=blob;f=libautocork.c
>
> This one just intercepts calls, but the __init function could do the
> rest.
>
> To make it easier we could move the counter setup we have in record/top
> to a library, etc.

Nah, I was more thinking of something along the lines of libPAPI and
libpfmon. A library that contains the needed building blocks for apps to
profile themselves.

2010-12-13 13:12:45

by Arnaldo Carvalho de Melo

[permalink] [raw]
Subject: Re: perf monitoring triggers Was: Re: [tip:perf/core] perf bench: Print both of prefaulted and no prefaulted results by default

Em Mon, Dec 13, 2010 at 01:40:59PM +0100, Peter Zijlstra escreveu:
> On Mon, 2010-12-13 at 10:38 -0200, Arnaldo Carvalho de Melo wrote:
> > Em Mon, Dec 13, 2010 at 12:14:33PM +0100, Peter Zijlstra escreveu:
> > > Sounds to me like you want something like a library with self-monitoring
> > > stuff.

> > Yeah, that could be a way, an LD_PRELOAD thingy that would intercept
> > library calls, setup counters, start a monitoring thread, etc.

> > To make it easier we could move the counter setup we have in record/top
> > to a library, etc.
>
> Nah, I was more thinking of something along the lines of libPAPI and
> libpfmon. A library that contains the needed building blocks for apps to
> profile themselves.

Ok, you mean for the case where you can modify the app, I was thinking
about when you can't.

In both cases its good to move the counter creation, etc routines from
record/top to a lib, that then could be used in the way you mention, and
in the way I mention too. Two different usecases :-)

- Arnaldo

2010-12-13 17:37:27

by Hitoshi Mitake

[permalink] [raw]
Subject: Re: perf monitoring triggers Was: Re: [tip:perf/core] perf bench: Print both of prefaulted and no prefaulted results by default

On 2010年12月13日 22:12, Arnaldo Carvalho de Melo wrote:
> Em Mon, Dec 13, 2010 at 01:40:59PM +0100, Peter Zijlstra escreveu:
>> On Mon, 2010-12-13 at 10:38 -0200, Arnaldo Carvalho de Melo wrote:
>>> Em Mon, Dec 13, 2010 at 12:14:33PM +0100, Peter Zijlstra escreveu:
>>>> Sounds to me like you want something like a library with self-monitoring
>>>> stuff.
>
>>> Yeah, that could be a way, an LD_PRELOAD thingy that would intercept
>>> library calls, setup counters, start a monitoring thread, etc.
>
>>> To make it easier we could move the counter setup we have in record/top
>>> to a library, etc.
>>
>> Nah, I was more thinking of something along the lines of libPAPI and
>> libpfmon. A library that contains the needed building blocks for apps to
>> profile themselves.
>
> Ok, you mean for the case where you can modify the app, I was thinking
> about when you can't.
>
> In both cases its good to move the counter creation, etc routines from
> record/top to a lib, that then could be used in the way you mention, and
> in the way I mention too. Two different usecases :-)

Thanks for your comments, Arnaldo, Peter.

I implement basic feature of my proposal,
and found that communicating perf stat and benchmarking programs
via socket is really dirty. As you said, unified form,
interception for unmodified binary and library for modifiable binary,
will be ideal for fine grain monitoring.

But I believe that measuring performance of some sort of programs
like in kernel routines requires more fine grain perf stating,
so I'll seek the unified way.

Anyway, I'll send my proof of concept patch later.

Thanks,
Hitoshi

2010-12-14 05:47:08

by Hitoshi Mitake

[permalink] [raw]
Subject: [RFC PATCH 2/2] perf bench: more fine grain monitoring for prefault memcpy()

This patch makes perf bench mem memcpy to use the new feature of perf stat.

New option --wake-up requires path name of unix domain socket.
If --only-prefault or --no-prefault is specified, the pid of itself is written
to this socket before actual memcpy() to be monitored. And the pid of perf stat
is read from it. The pid of perf stat is used for signaling perf stat
to terminate monitoring.

With this feature, the detailed performance monitoring of prefaulted
(or non prefaulted only) memcpy() will be possible.

Example of use, non prefaulted version:
| mitake@x201i:~/linux/.../tools/perf% sudo ./perf stat -w /tmp/perf-stat-wait
|

After execution, perf stat waits the pid...

| Performance counter stats for process id '27109':
|
| 440.534943 task-clock-msecs # 0.997 CPUs
| 44 context-switches # 0.000 M/sec
| 5 CPU-migrations # 0.000 M/sec
| 256,002 page-faults # 0.581 M/sec
| 934,443,072 cycles # 2121.155 M/sec
| 780,408,435 instructions # 0.835 IPC
| 111,756,558 branches # 253.684 M/sec
| 392,170 branch-misses # 0.351 %
| 8,611,308 cache-references # 19.547 M/sec
| 8,533,588 cache-misses # 19.371 M/sec
|
| 0.441803031 seconds time elapsed

in another shell,

| mitake@x201i:~/linux/.../tools/perf% sudo ./perf bench mem memcpy -l 500MB --no-prefault -w /tmp/perf-stat-wait
| # Running mem/memcpy benchmark...
| # Copying 500MB Bytes ...
|
| 1.105722 GB/Sec

Example of use, prefaulted version:

| mitake@x201i:~/linux/.../tools/perf% sudo ./perf stat -w /tmp/perf-stat-wait
| Performance counter stats for process id '27112':
|
| 105.001542 task-clock-msecs # 0.997 CPUs
| 11 context-switches # 0.000 M/sec
| 0 CPU-migrations # 0.000 M/sec
| 2 page-faults # 0.000 M/sec
| 223,273,425 cycles # 2126.382 M/sec
| 197,992,585 instructions # 0.887 IPC
| 16,657,288 branches # 158.639 M/sec
| 1,942 branch-misses # 0.012 %
| 3,105,619 cache-references # 29.577 M/sec
| 3,082,390 cache-misses # 29.356 M/sec
|
| 0.105316101 seconds time elapsed

in another shell,

| mitake@x201i:~/linux/.../tools/perf% sudo ./perf bench mem memcpy -l 500MB --only-prefault -w /tmp/perf-stat-wait
| # Running mem/memcpy benchmark...
| # Copying 500MB Bytes ...
|
| 4.640927 GB/Sec (with prefault)

The result shows that the difference between non-prefaulted memcpy() and prefaulted one.
And this will be useful for detailed performance analysis of various memcpy()s
like Miao Xie's one and rep prefix version.

But this is too adhoc and dirty... :(

Cc: Miao Xie <[email protected]>
Cc: Ma Ling <[email protected]>
Cc: Zhao Yakui <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Arnaldo Carvalho de Melo <[email protected]>
Cc: Paul Mackerras <[email protected]>
Cc: Frederic Weisbecker <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Andi Kleen <[email protected]>
Signed-off-by: Hitoshi Mitake <[email protected]>
---
tools/perf/bench/mem-memcpy.c | 56 +++++++++++++++++++++++++++++++++++++++++
1 files changed, 56 insertions(+), 0 deletions(-)

diff --git a/tools/perf/bench/mem-memcpy.c b/tools/perf/bench/mem-memcpy.c
index ac88f52..7d0bcea 100644
--- a/tools/perf/bench/mem-memcpy.c
+++ b/tools/perf/bench/mem-memcpy.c
@@ -21,6 +21,10 @@
#include <errno.h>
#include <unistd.h>

+#include <sys/types.h>
+#include <sys/socket.h>
+#include <sys/un.h>
+
#define K 1024

static const char *length_str = "1MB";
@@ -31,6 +35,7 @@ static bool only_prefault;
static bool no_prefault;
static int src_align;
static int dst_align;
+static const char *wake_path;

static const struct option options[] = {
OPT_STRING('l', "length", &length_str, "1MB",
@@ -48,6 +53,9 @@ static const struct option options[] = {
"Alignment of source memory region (in byte)"),
OPT_INTEGER('d', "dst-alignment", &dst_align,
"Alignment of destination memory region (in byte)"),
+ OPT_STRING('w', "wake-up", &wake_path, "default",
+ "Path of unix domain socket for waking up perf stat"
+ " (use with only_prefault option)"),
OPT_END()
};

@@ -116,6 +124,33 @@ static double timeval2double(struct timeval *ts)
(double)ts->tv_usec / (double)1000000;
}

+static pid_t perf_stat_pid;
+
+static void wake_up_perf_stat(void)
+{
+ int wake_fd;
+ struct sockaddr_un wake_addr;
+ pid_t myself = getpid();
+
+ wake_fd = socket(PF_UNIX, SOCK_STREAM, 0);
+ if (wake_fd < 0)
+ die("unable to create socket for sync\n");
+
+ memset(&wake_addr, 0, sizeof(wake_addr));
+ wake_addr.sun_family = PF_UNIX;
+ strncpy(wake_addr.sun_path, wake_path, sizeof(wake_addr.sun_path));
+
+ if (connect(wake_fd, (struct sockaddr *)&wake_addr, sizeof(wake_addr)))
+ die("connect() failed\n");
+
+ if (write(wake_fd, &myself, sizeof(myself)) != sizeof(myself))
+ die("write() my pid to socket failed\n");
+
+ if (read(wake_fd, &perf_stat_pid, sizeof(perf_stat_pid))
+ != sizeof(perf_stat_pid))
+ die("read() pid of perf stat from socket\n");
+}
+
static void alloc_mem(void **dst, void **src, size_t length)
{
int ret;
@@ -139,10 +174,16 @@ static u64 do_memcpy_clock(memcpy_t fn, size_t len, bool prefault)
if (prefault)
fn(dst + dst_align, src + src_align, len);

+ if (wake_path)
+ wake_up_perf_stat();
+
clock_start = get_clock();
fn(dst + dst_align, src + src_align, len);
clock_end = get_clock();

+ if (wake_path) /* kill perf stat */
+ kill(perf_stat_pid, SIGINT);
+
free(src);
free(dst);
return clock_end - clock_start;
@@ -158,12 +199,18 @@ static double do_memcpy_gettimeofday(memcpy_t fn, size_t len, bool prefault)
if (prefault)
fn(dst + dst_align, src + src_align, len);

+ if (wake_path)
+ wake_up_perf_stat();
+
BUG_ON(gettimeofday(&tv_start, NULL));
fn(dst + dst_align, src + src_align, len);
BUG_ON(gettimeofday(&tv_end, NULL));

timersub(&tv_end, &tv_start, &tv_diff);

+ if (wake_path) /* kill perf stat */
+ kill(perf_stat_pid, SIGINT);
+
free(src);
free(dst);
return (double)((double)len / timeval2double(&tv_diff));
@@ -235,6 +282,15 @@ int bench_mem_memcpy(int argc, const char **argv,

if (!only_prefault && !no_prefault) {
/* show both of results */
+ if (wake_path) {
+ fprintf(stderr, "Meaningless combination of option, "
+ "you should not use wake_path alone.\n"
+ "Use it with --only-prefault"
+ " or --no-prefault\n");
+ return 1;
+ }
+
+
if (use_clock) {
result_clock[0] =
do_memcpy_clock(routines[i].fn, len, false);
--
1.7.3.3

2010-12-14 05:47:06

by Hitoshi Mitake

[permalink] [raw]
Subject: [RFC PATCH 1/2] perf stat: wait on unix domain socket before calling sys_perf_event_open()

This patch adds new option "--wait-on" option to perf stat.

Current perf stat can monitor
1) lifetime of program specified as command line argument, or
2) lifetime of perf stat. Target process is specified with pid,
and end of monitoring is triggered with signal.
1) is too coarse grain. And 2) is difficult to distinguish the range to monitor.

This patch makes it possible to wait before sys_perf_event_open().
Monitored process can wake up perf stat via unix domain socket,
and terminate monitoring via signal.

New option --wait-on requires the string as the path of unix domain socket.
perf stat read the pid from the socket for target_pid. Monitored program
should write the pid of itself to it.
perf stat replies the pid of itself to monitored program. The monitored program
should send signal SIGINT to perf stat with this pid. Then monitoring is terminated.

I feel current implementation is really dirty. As Arnaldo and Peter suggested,
more unified way like interception or self monitoring library is ideal.
This is the proof of concept version. I'd like to hear your comments.

Cc: Miao Xie <[email protected]>
Cc: Ma Ling <[email protected]>
Cc: Zhao Yakui <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Arnaldo Carvalho de Melo <[email protected]>
Cc: Paul Mackerras <[email protected]>
Cc: Frederic Weisbecker <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Andi Kleen <[email protected]>
Signed-off-by: Hitoshi Mitake <[email protected]>
---
tools/perf/builtin-stat.c | 63 ++++++++++++++++++++++++++++++++++++++++++--
1 files changed, 60 insertions(+), 3 deletions(-)

diff --git a/tools/perf/builtin-stat.c b/tools/perf/builtin-stat.c
index 7ff746d..4cc10a1 100644
--- a/tools/perf/builtin-stat.c
+++ b/tools/perf/builtin-stat.c
@@ -51,6 +51,8 @@
#include <sys/prctl.h>
#include <math.h>
#include <locale.h>
+#include <sys/socket.h>
+#include <sys/un.h>

#define DEFAULT_SEPARATOR " "

@@ -90,11 +92,15 @@ static const char *cpu_list;
static const char *csv_sep = NULL;
static bool csv_output = false;

+static const char *wait_path;

static int *fd[MAX_NR_CPUS][MAX_COUNTERS];

static int event_scaled[MAX_COUNTERS];

+static int wait_fd = -1;
+static struct sockaddr_un wait_addr;
+
static struct {
u64 val;
u64 ena;
@@ -342,7 +348,7 @@ static int run_perf_stat(int argc __used, const char **argv)
unsigned long long t0, t1;
int status = 0;
int counter, ncreated = 0;
- int child_ready_pipe[2], go_pipe[2];
+ int child_ready_pipe[2], go_pipe[2], accepted_fd;
bool perm_err = false;
const bool forks = (argc > 0);
char buf;
@@ -401,6 +407,43 @@ static int run_perf_stat(int argc __used, const char **argv)
close(child_ready_pipe[0]);
}

+ if (wait_path) {
+ int sock_err;
+ struct sockaddr accepted_addr;
+ socklen_t accepted_len = sizeof(accepted_addr);
+
+ wait_fd = socket(PF_UNIX, SOCK_STREAM, 0);
+ if (wait_fd < 0)
+ die("unable to create socket for sync\n");
+
+ memset(&wait_addr, 0, sizeof(wait_addr));
+ wait_addr.sun_family = PF_UNIX;
+ strncpy(wait_addr.sun_path, wait_path,
+ sizeof(wait_addr.sun_path));
+
+ sock_err = bind(wait_fd, (struct sockaddr *)&wait_addr,
+ sizeof(wait_addr));
+ if (sock_err < 0)
+ die("bind() failed\n");
+
+ sock_err = listen(wait_fd, 1);
+ if (sock_err < 0)
+ die("listen() failed\n");
+
+ accepted_fd = accept(wait_fd, &accepted_addr, &accepted_len);
+ if (accepted_fd < 0)
+ die("accept() failed\n");
+
+ if (read(accepted_fd, &target_pid, sizeof(target_pid))
+ != sizeof(target_pid))
+ die("read() pid from socket failed\n");
+
+ target_tid = target_pid;
+ thread_num = find_all_tid(target_pid, &all_tids);
+ if (thread_num <= 0)
+ die("couldn't find threads of %d\n", target_pid);
+ }
+
for (counter = 0; counter < nr_counters; counter++)
ncreated += create_perf_stat_counter(counter, &perm_err);

@@ -425,6 +468,14 @@ static int run_perf_stat(int argc __used, const char **argv)
close(go_pipe[1]);
wait(&status);
} else {
+ if (wait_path) {
+ pid_t myself = getpid();
+ if (write(accepted_fd, &myself, sizeof(myself))
+ != sizeof(myself))
+ die("write() my pid failed\n");
+ close(accepted_fd);
+ }
+
while(!done) sleep(1);
}

@@ -670,6 +721,9 @@ static void sig_atexit(void)
if (signr == -1)
return;

+ if (wait_path)
+ unlink(wait_path);
+
signal(signr, SIG_DFL);
kill(getpid(), signr);
}
@@ -715,6 +769,8 @@ static const struct option options[] = {
"disable CPU count aggregation"),
OPT_STRING('x', "field-separator", &csv_sep, "separator",
"print counts with custom separator"),
+ OPT_STRING('w', "wait-on", &wait_path, "path",
+ "path of unix domain socket to wait on"),
OPT_END()
};

@@ -746,7 +802,7 @@ int cmd_stat(int argc, const char **argv, const char *prefix __used)
} else if (big_num_opt == 0) /* User passed --no-big-num */
big_num = false;

- if (!argc && target_pid == -1 && target_tid == -1)
+ if (!argc && target_pid == -1 && target_tid == -1 && !wait_path)
usage_with_options(stat_usage, options);
if (run_count <= 0)
usage_with_options(stat_usage, options);
@@ -769,7 +825,8 @@ int cmd_stat(int argc, const char **argv, const char *prefix __used)
if (nr_cpus < 1)
usage_with_options(stat_usage, options);

- if (target_pid != -1) {
+ /* if wait_path is specified, we read pid to monitor from it later */
+ if (target_pid != -1 && !wait_path) {
target_tid = target_pid;
thread_num = find_all_tid(target_pid, &all_tids);
if (thread_num <= 0) {
--
1.7.3.3