This follows an old patchset I did which was called context
exclusion, which implemented a suggestion from Ingo to
exclude interrupt contexts from counting/sampling on
chosen events. Besides, he also suggested to use the
function graph tracer to count/sample events inside
a function boundaries.
All in one, this is all about events that activate/deactivate
others. So it sounds quite natural to use what we already use to
abstract events: perf events, in order to start/stop other perf events.
May be that was also suggested by Ingo or someone else, or not,
can't remember.
This is implemented using a pair of events: a starter and a stopper.
One can attribute one starter and one stopper to a target event.
When the starter generates an events, it starts the target,
making it counting and sampling. When the stopper generates
an event, it stops the target. That combined with an attribute
called "enable_on_starter" that put the event into a paused state,
even when it is scheduled, waiting for the starter to start it
once.
We have three new options:
* --starter and --stopper. These follow the target event
and must be followed by the index of the starter/stopper event
in the command line.
* --enable-on-starter creates the event in pause mode, it won't
be started until the starter triggers.
Some examples:
- Trace lock events inside irqs
$ perf record -e irq:irq_handler_entry -e irq:irq_handler_exit \
-e lock:lock_acquire --starter 0 --stopper 1 --enable-on-starter \
perf bench sched messaging
$ perf script
perf-4300 [000] 2000.209721: irq_handler_entry: irq=40 name=ahci
perf-4300 [000] 2000.209725: lock_acquire: 0xffff880037851c40 &(&host->lock)->rlock
perf-4300 [000] 2000.209731: irq_handler_exit: irq=40 ret=handled
perf-4300 [000] 2000.209821: irq_handler_entry: irq=40 name=ahci
perf-4300 [000] 2000.209823: lock_acquire: 0xffff880037851c40 &(&host->lock)->rlock
perf-4300 [000] 2000.209828: irq_handler_exit: irq=40 ret=handled
- Count instructions inside softirqs, outside softirqs and everywhere:
$ perf stat -e irq:softirq_entry -e irq:softirq_exit \
-e instructions --starter 0 --stopper 1 --enable-on-starter \
-e instructions --starter 1 --stopper 0 \
-e instructions ./perf bench sched messaging
# Running sched/messaging benchmark...
# 20 sender and receiver processes per group
# 10 groups == 400 processes run
Total time: 1.056 [sec]
Performance counter stats for './perf bench sched messaging':
114 irq:softirq_entry
114 irq:softirq_exit
821 503 instructions # 0,000 IPC <-- inside softirq
243 985 460 instructions # 0,000 IPC <-- outside softirq
244 594 383 instructions # 0,000 IPC <-- all
1,157462964 seconds time elapsed
All instructions must be inside + outside softirqs. However there is always a small
delta (here the delta is of 212 580) due to some noise.
There is another example with lock events in the last patch.
It's supposed to support infinite combinations with starter having starters
themselves, plus filters, etc...
Some limitations:
* An event can only have one starter and one stopper. This can be
changed if necessary. Letting more would allow us to union custom
contexts.
* Starters and stoppers must be on the same context (task + cpu) than
the target. That can probably be more or less easy to change too.
* There are many code duplications. But I wanted to know if it's
worth continuing that direction before cleaning up.
* What you'll find.
Adventurers can fetch from:
git://git.kernel.org/pub/scm/linux/kernel/git/frederic/random-tracing.git
perf/custom-ctx-v1
This is on top of three previous patches I posted lately.
Thanks.
Frederic Weisbecker (4):
perf: Starter and stopper events
perf: New enable_on_starter attribute
perf: Support for starter and stopper in tools
perf: New --enable-on-starter option
include/linux/perf_event.h | 14 ++-
kernel/perf_event.c | 298 +++++++++++++++++++++++++++++++++++++---
tools/perf/builtin-record.c | 20 +++-
tools/perf/builtin-stat.c | 22 +++-
tools/perf/util/evlist.c | 12 +-
tools/perf/util/evlist.h | 4 +-
tools/perf/util/evsel.c | 53 +++++++
tools/perf/util/evsel.h | 13 ++
tools/perf/util/parse-events.c | 78 +++++++++++
tools/perf/util/parse-events.h | 3 +
10 files changed, 485 insertions(+), 32 deletions(-)
--
1.7.3.2
Add --starter and --stopper options in perf record
and perf stat. These options must follow the event
that wants to be the target of the starter/stopper and
must be followed by the index of the desired event in the
command line, starting from 0.
For example in:
perf record -e irq:softirq_entry -e irq:softirq_exit \
-e lock:lock_acquire -a
If we want to count/sample lock acquire event only outside softirqs,
we will take softirq_exit as the starter and softirq_entry as the
stopper. Thus the stopper is the event 0 and the starter is the
event 1, which leads us to the given command line:
perf record -e irq:softirq_entry -e irq:softirq_exit \
-e lock:lock_acquire --starter 1 --stopper 0 -a
Signed-off-by: Frederic Weisbecker <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Arnaldo Carvalho de Melo <[email protected]>
Cc: Paul Mackerras <[email protected]>
Cc: Stephane Eranian <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Masami Hiramatsu <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Hitoshi Mitake <[email protected]>
---
tools/perf/builtin-record.c | 18 +++++++++++-
tools/perf/builtin-stat.c | 20 ++++++++++++-
tools/perf/util/evlist.c | 12 +++-----
tools/perf/util/evlist.h | 4 +-
tools/perf/util/evsel.c | 53 ++++++++++++++++++++++++++++++++++++
tools/perf/util/evsel.h | 13 +++++++++
tools/perf/util/parse-events.c | 58 ++++++++++++++++++++++++++++++++++++++++
tools/perf/util/parse-events.h | 2 +
8 files changed, 167 insertions(+), 13 deletions(-)
diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
index 6febcc1..7cc690e 100644
--- a/tools/perf/builtin-record.c
+++ b/tools/perf/builtin-record.c
@@ -323,12 +323,24 @@ try_again:
}
}
- if (perf_evlist__set_filters(evlist)) {
+ if (perf_evlist__for_each_evsel(evlist, perf_evsel__set_filter)) {
error("failed to set filter with %d (%s)\n", errno,
strerror(errno));
exit(-1);
}
+ if (perf_evlist__for_each_evsel(evlist, perf_evsel__set_starter)) {
+ error("failed to set starter with %d (%s)\n", errno,
+ strerror(errno));
+ exit(-1);
+ }
+
+ if (perf_evlist__for_each_evsel(evlist, perf_evsel__set_stopper)) {
+ error("failed to set stopper with %d (%s)\n", errno,
+ strerror(errno));
+ exit(-1);
+ }
+
if (perf_evlist__mmap(evlist, mmap_pages, false) < 0)
die("failed to mmap with %d (%s)\n", errno, strerror(errno));
@@ -729,6 +741,10 @@ const struct option record_options[] = {
parse_events),
OPT_CALLBACK(0, "filter", &evsel_list, "filter",
"event filter", parse_filter),
+ OPT_CALLBACK(0, "starter", &evsel_list, "starter",
+ "event starter", parse_starter),
+ OPT_CALLBACK(0, "stopper", &evsel_list, "stopper",
+ "event stopper", parse_stopper),
OPT_INTEGER('p', "pid", &target_pid,
"record events on existing process id"),
OPT_INTEGER('t', "tid", &target_tid,
diff --git a/tools/perf/builtin-stat.c b/tools/perf/builtin-stat.c
index e2109f9..ba89a4d 100644
--- a/tools/perf/builtin-stat.c
+++ b/tools/perf/builtin-stat.c
@@ -333,8 +333,20 @@ static int run_perf_stat(int argc __used, const char **argv)
}
}
- if (perf_evlist__set_filters(evsel_list)) {
- error("failed to set filter with %d (%s)\n", errno,
+ if (perf_evlist__for_each_evsel(evsel_list, perf_evsel__set_filter)) {
+ pr_err("failed to set filter with %d (%s)\n", errno,
+ strerror(errno));
+ return -1;
+ }
+
+ if (perf_evlist__for_each_evsel(evsel_list, perf_evsel__set_starter)) {
+ pr_err("failed to set starter with %d (%s)\n", errno,
+ strerror(errno));
+ return -1;
+ }
+
+ if (perf_evlist__for_each_evsel(evsel_list, perf_evsel__set_stopper)) {
+ pr_err("failed to set stopper with %d (%s)\n", errno,
strerror(errno));
return -1;
}
@@ -642,6 +654,10 @@ static const struct option options[] = {
parse_events),
OPT_CALLBACK(0, "filter", &evsel_list, "filter",
"event filter", parse_filter),
+ OPT_CALLBACK(0, "starter", &evsel_list, "starter",
+ "event starter", parse_starter),
+ OPT_CALLBACK(0, "stopper", &evsel_list, "stopper",
+ "event stopper", parse_stopper),
OPT_BOOLEAN('i', "no-inherit", &no_inherit,
"child tasks do not inherit counters"),
OPT_INTEGER('p', "pid", &target_pid,
diff --git a/tools/perf/util/evlist.c b/tools/perf/util/evlist.c
index d852cef..b70eb92 100644
--- a/tools/perf/util/evlist.c
+++ b/tools/perf/util/evlist.c
@@ -365,27 +365,23 @@ void perf_evlist__delete_maps(struct perf_evlist *evlist)
evlist->threads = NULL;
}
-int perf_evlist__set_filters(struct perf_evlist *evlist)
+int perf_evlist__for_each_evsel(struct perf_evlist *evlist,
+ int (*call)(struct perf_evsel *, int, int))
{
const struct thread_map *threads = evlist->threads;
const struct cpu_map *cpus = evlist->cpus;
struct perf_evsel *evsel;
- char *filter;
int thread;
int cpu;
int err;
- int fd;
list_for_each_entry(evsel, &evlist->entries, node) {
- filter = evsel->filter;
- if (!filter)
- continue;
for (cpu = 0; cpu < cpus->nr; cpu++) {
for (thread = 0; thread < threads->nr; thread++) {
- fd = FD(evsel, cpu, thread);
- err = ioctl(fd, PERF_EVENT_IOC_SET_FILTER, filter);
+ err = call(evsel, cpu, thread);
if (err)
return err;
+
}
}
}
diff --git a/tools/perf/util/evlist.h b/tools/perf/util/evlist.h
index 8b1cb7a..1262e15 100644
--- a/tools/perf/util/evlist.h
+++ b/tools/perf/util/evlist.h
@@ -63,6 +63,6 @@ static inline void perf_evlist__set_maps(struct perf_evlist *evlist,
int perf_evlist__create_maps(struct perf_evlist *evlist, pid_t target_pid,
pid_t target_tid, const char *cpu_list);
void perf_evlist__delete_maps(struct perf_evlist *evlist);
-int perf_evlist__set_filters(struct perf_evlist *evlist);
-
+int perf_evlist__for_each_evsel(struct perf_evlist *evlist,
+ int (*call)(struct perf_evsel *, int, int));
#endif /* __PERF_EVLIST_H */
diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
index 662596a..7b2ba9f 100644
--- a/tools/perf/util/evsel.c
+++ b/tools/perf/util/evsel.c
@@ -21,6 +21,8 @@ void perf_evsel__init(struct perf_evsel *evsel,
evsel->idx = idx;
evsel->attr = *attr;
INIT_LIST_HEAD(&evsel->node);
+ INIT_LIST_HEAD(&evsel->starter_list);
+ INIT_LIST_HEAD(&evsel->stopper_list);
}
struct perf_evsel *perf_evsel__new(struct perf_event_attr *attr, int idx)
@@ -395,3 +397,54 @@ int perf_event__parse_sample(const union perf_event *event, u64 type,
return 0;
}
+
+int perf_evsel__set_filter(struct perf_evsel *evsel, int cpu,
+ int thread)
+{
+ char *filter;
+ int fd;
+
+ filter = evsel->filter;
+ if (!filter)
+ return 0;
+
+ fd = FD(evsel, cpu, thread);
+
+ return ioctl(fd, PERF_EVENT_IOC_SET_FILTER, filter);
+}
+
+int perf_evsel__set_starter(struct perf_evsel *evsel, int cpu,
+ int thread)
+{
+ struct perf_evsel *target;
+ int fd, fd_target;
+ int ret = 0;
+
+ list_for_each_entry(target, &evsel->starter_list, starter_entry) {
+ fd = FD(evsel, cpu, thread);
+ fd_target = FD(target, cpu, thread);
+ ret = ioctl(fd, PERF_EVENT_IOC_SET_STARTER, fd_target);
+ if (ret)
+ break;
+ }
+
+ return ret;
+}
+
+int perf_evsel__set_stopper(struct perf_evsel *evsel, int cpu,
+ int thread)
+{
+ struct perf_evsel *target;
+ int fd, fd_target;
+ int ret = 0;
+
+ list_for_each_entry(target, &evsel->stopper_list, stopper_entry) {
+ fd = FD(evsel, cpu, thread);
+ fd_target = FD(target, cpu, thread);
+ ret = ioctl(fd, PERF_EVENT_IOC_SET_STOPPER, fd_target);
+ if (ret)
+ break;
+ }
+
+ return ret;
+}
diff --git a/tools/perf/util/evsel.h b/tools/perf/util/evsel.h
index 6710ab5..e35d01f 100644
--- a/tools/perf/util/evsel.h
+++ b/tools/perf/util/evsel.h
@@ -61,6 +61,10 @@ struct perf_evsel {
off_t id_offset;
};
struct cgroup_sel *cgrp;
+ struct list_head starter_list;
+ struct list_head starter_entry;
+ struct list_head stopper_list;
+ struct list_head stopper_entry;
};
struct cpu_map;
@@ -123,6 +127,15 @@ static inline int perf_evsel__read_on_cpu_scaled(struct perf_evsel *evsel,
int __perf_evsel__read(struct perf_evsel *evsel, int ncpus, int nthreads,
bool scale);
+int perf_evsel__set_filter(struct perf_evsel *evsel, int cpu,
+ int thread);
+
+int perf_evsel__set_starter(struct perf_evsel *evsel, int cpu,
+ int thread);
+
+int perf_evsel__set_stopper(struct perf_evsel *evsel, int cpu,
+ int thread);
+
/**
* perf_evsel__read - Read the aggregate results on all CPUs
*
diff --git a/tools/perf/util/parse-events.c b/tools/perf/util/parse-events.c
index 54a7e26..7dd494f 100644
--- a/tools/perf/util/parse-events.c
+++ b/tools/perf/util/parse-events.c
@@ -845,6 +845,64 @@ int parse_filter(const struct option *opt, const char *str,
return 0;
}
+static int parse_starter_stopper(const struct option *opt,
+ const char *str, int starter)
+{
+ struct perf_evlist *evlist = *(struct perf_evlist **)opt->value;
+ struct perf_evsel *last = NULL, *trigger = NULL;
+ char *end;
+ unsigned long i = 0;
+ int found = 0;
+ unsigned long idx;
+
+ if (evlist->nr_entries > 0)
+ last = list_entry(evlist->entries.prev, struct perf_evsel, node);
+
+ if (last == NULL) {
+ fprintf(stderr, "--starter/--stopper options should follow a -e tracepoint option\n");
+ return -1;
+ }
+
+ idx = strtoul(str, &end, 10);
+ if (str == end) {
+ //FIXME: Clarify that message, and fix above error handling
+ fprintf(stderr, "--starter/--stopper options should be followed by an event index\n");
+ return -1;
+ }
+
+ list_for_each_entry(trigger, &evlist->entries, node) {
+ if (i++ == idx) {
+ found = 1;
+ break;
+ }
+ }
+
+ if (!found) {
+ fprintf(stderr, "--starter/--stopper should be followed by a number "
+ "matching the nth event from the command line\n");
+ return -1;
+ }
+
+ if (starter)
+ list_add_tail(&last->starter_entry, &trigger->starter_list);
+ else
+ list_add_tail(&last->stopper_entry, &trigger->stopper_list);
+
+ return 0;
+}
+
+int parse_starter(const struct option *opt, const char *str,
+ int unset __used)
+{
+ return parse_starter_stopper(opt, str, 1);
+}
+
+int parse_stopper(const struct option *opt, const char *str,
+ int unset __used)
+{
+ return parse_starter_stopper(opt, str, 0);
+}
+
static const char * const event_type_descriptors[] = {
"Hardware event",
"Software event",
diff --git a/tools/perf/util/parse-events.h b/tools/perf/util/parse-events.h
index 212f88e..72d73b5 100644
--- a/tools/perf/util/parse-events.h
+++ b/tools/perf/util/parse-events.h
@@ -25,6 +25,8 @@ extern const char *__event_name(int type, u64 config);
extern int parse_events(const struct option *opt, const char *str, int unset);
extern int parse_filter(const struct option *opt, const char *str, int unset);
+extern int parse_starter(const struct option *opt, const char *str, int unset);
+extern int parse_stopper(const struct option *opt, const char *str, int unset);
#define EVENTS_HELP_MAX (128*1024)
--
1.7.3.2
The new --enable-on-starter option can be used to perf record
and perf stat in order to get the event that precedes to be
started only once the starter fires. It won't count or sample
before.
starter/stopper events without --enable-on-starter must be
typically used to count everywhere but outside a pair of
events boundaries whereas the use of --enable-on-starter
should be rather used to count nowhere but inside the
pair of events boundaries.
Counting only outside softirqs:
perf stat -e irq:softirq_exit -e irq:softirq_entry \
-e instructions --starter 0 --stopper 1
Counting only inside softirqs:
perf stat -e irq:softirq_entry -e irq:softirq_exit \
-e instructions --starter 0 --stopper 1 --enable-on-starter
The following example shows a more complicated example, have three
instruction counters:
* One that counts instructions only when we hold the runqueue lock
* One that counts instructions only when we don't hold the
runqueue lock.
* One that always count instructions
./perf stat -e lock:lock_acquire --filter "name==\"&rq->lock\"" \
-e lock:lock_release --filter "name==\"&rq->lock\"" \
-e instructions --starter 0 --stopper 1 --enable-on-starter \
-e instructions --starter 1 --stopper 0 -e instructions ./perf bench sched messaging
# Running sched/messaging benchmark...
# 20 sender and receiver processes per group
# 10 groups == 400 processes run
Total time: 1.636 [sec]
Performance counter stats for './perf bench sched messaging':
2 274 lock:lock_acquire
2 234 lock:lock_release
9 601 767 instructions # 0,000 IPC <-- inside lock
482 273 822 instructions # 0,000 IPC <-- outside lock
490 646 837 instructions # 0,000 IPC <-- all
1,787294881 seconds time elapsed
Signed-off-by: Frederic Weisbecker <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Arnaldo Carvalho de Melo <[email protected]>
Cc: Paul Mackerras <[email protected]>
Cc: Stephane Eranian <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Masami Hiramatsu <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Hitoshi Mitake <[email protected]>
---
tools/perf/builtin-record.c | 2 ++
tools/perf/builtin-stat.c | 2 ++
tools/perf/util/parse-events.c | 20 ++++++++++++++++++++
tools/perf/util/parse-events.h | 1 +
4 files changed, 25 insertions(+), 0 deletions(-)
diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
index 7cc690e..272f0e4 100644
--- a/tools/perf/builtin-record.c
+++ b/tools/perf/builtin-record.c
@@ -745,6 +745,8 @@ const struct option record_options[] = {
"event starter", parse_starter),
OPT_CALLBACK(0, "stopper", &evsel_list, "stopper",
"event stopper", parse_stopper),
+ OPT_CALLBACK_NOOPT(0, "enable-on-starter", &evsel_list, "enable-on-starter",
+ "enable-on-starter", parse_enable_on_starter),
OPT_INTEGER('p', "pid", &target_pid,
"record events on existing process id"),
OPT_INTEGER('t', "tid", &target_tid,
diff --git a/tools/perf/builtin-stat.c b/tools/perf/builtin-stat.c
index ba89a4d..84f9e7f 100644
--- a/tools/perf/builtin-stat.c
+++ b/tools/perf/builtin-stat.c
@@ -658,6 +658,8 @@ static const struct option options[] = {
"event starter", parse_starter),
OPT_CALLBACK(0, "stopper", &evsel_list, "stopper",
"event stopper", parse_stopper),
+ OPT_CALLBACK_NOOPT(0, "enable-on-starter", &evsel_list, "enable-on-starter",
+ "enable-on-starter", parse_enable_on_starter),
OPT_BOOLEAN('i', "no-inherit", &no_inherit,
"child tasks do not inherit counters"),
OPT_INTEGER('p', "pid", &target_pid,
diff --git a/tools/perf/util/parse-events.c b/tools/perf/util/parse-events.c
index 7dd494f..ce5cf10 100644
--- a/tools/perf/util/parse-events.c
+++ b/tools/perf/util/parse-events.c
@@ -845,6 +845,26 @@ int parse_filter(const struct option *opt, const char *str,
return 0;
}
+int parse_enable_on_starter(const struct option *opt, const char __used *str,
+ int unset __used)
+{
+ struct perf_evlist *evlist = *(struct perf_evlist **)opt->value;
+ struct perf_evsel *last = NULL;
+
+ if (evlist->nr_entries > 0)
+ last = list_entry(evlist->entries.prev, struct perf_evsel, node);
+
+ if (last == NULL) {
+ fprintf(stderr,
+ "--enable-on-starter option should follow a -e tracepoint option\n");
+ return -1;
+ }
+
+ last->attr.enable_on_starter = 1;
+
+ return 0;
+}
+
static int parse_starter_stopper(const struct option *opt,
const char *str, int starter)
{
diff --git a/tools/perf/util/parse-events.h b/tools/perf/util/parse-events.h
index 72d73b5..262ebc8 100644
--- a/tools/perf/util/parse-events.h
+++ b/tools/perf/util/parse-events.h
@@ -27,6 +27,7 @@ extern int parse_events(const struct option *opt, const char *str, int unset);
extern int parse_filter(const struct option *opt, const char *str, int unset);
extern int parse_starter(const struct option *opt, const char *str, int unset);
extern int parse_stopper(const struct option *opt, const char *str, int unset);
+extern int parse_enable_on_starter(const struct option *opt, const char *str, int unset);
#define EVENTS_HELP_MAX (128*1024)
--
1.7.3.2
Current event contexts are limited to the task and/or cpu scope.
However perf has a bunch of meaningful events on top of which
one could create his own custom contexts on top of the task
and/or cpu ones.
Starter and stopper events provide this custom context granularity.
One can create an event and attribute it a starter and a stopper. The
starter, when it triggers an event, starts the target perf event
(using the lightweight pmu->start() callback) and the stopper does the
reverse. The target will then only count and sample inside the
boundaries created by the starter and stopper when they overflow.
This creates two new ioctls:
- PERF_EVENT_IOC_SET_STARTER
- PERF_EVENT_IOC_SET_STOPPER
An event can have only one starter and one stopper which can't be
changed once attributed.
But an event can be a starter or a stopper of as many events it
wants as far as they belong to the same task and cpu context.
Signed-off-by: Frederic Weisbecker <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Arnaldo Carvalho de Melo <[email protected]>
Cc: Paul Mackerras <[email protected]>
Cc: Stephane Eranian <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Masami Hiramatsu <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Hitoshi Mitake <[email protected]>
---
include/linux/perf_event.h | 11 ++-
kernel/perf_event.c | 295 +++++++++++++++++++++++++++++++++++++++++---
2 files changed, 288 insertions(+), 18 deletions(-)
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 614615b..3d33bb8 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -245,6 +245,8 @@ struct perf_event_attr {
#define PERF_EVENT_IOC_PERIOD _IOW('$', 4, __u64)
#define PERF_EVENT_IOC_SET_OUTPUT _IO ('$', 5)
#define PERF_EVENT_IOC_SET_FILTER _IOW('$', 6, char *)
+#define PERF_EVENT_IOC_SET_STARTER _IO('$', 7)
+#define PERF_EVENT_IOC_SET_STOPPER _IO('$', 8)
enum perf_event_ioc_flags {
PERF_IOC_FLAG_GROUP = 1U << 0,
@@ -863,7 +865,14 @@ struct perf_event {
struct perf_cgroup *cgrp; /* cgroup event is attach to */
int cgrp_defer_enabled;
#endif
-
+ struct mutex starter_stopper_mutex;
+ struct list_head starter_entry;
+ struct list_head stopper_entry;
+ struct list_head starter_list;
+ struct list_head stopper_list;
+ struct perf_event *starter;
+ struct perf_event *stopper;
+ int paused;
#endif /* CONFIG_PERF_EVENTS */
};
diff --git a/kernel/perf_event.c b/kernel/perf_event.c
index 533f715..c58ee74 100644
--- a/kernel/perf_event.c
+++ b/kernel/perf_event.c
@@ -1299,6 +1299,7 @@ event_sched_in(struct perf_event *event,
struct perf_event_context *ctx)
{
u64 tstamp = perf_event_time(event);
+ int add_flags = PERF_EF_START;
if (event->state <= PERF_EVENT_STATE_OFF)
return 0;
@@ -1321,7 +1322,10 @@ event_sched_in(struct perf_event *event,
*/
smp_wmb();
- if (event->pmu->add(event, PERF_EF_START)) {
+ if (event->paused)
+ add_flags = 0;
+
+ if (event->pmu->add(event, add_flags)) {
event->state = PERF_EVENT_STATE_INACTIVE;
event->oncpu = -1;
return -EAGAIN;
@@ -2918,6 +2922,7 @@ static void free_event(struct perf_event *event)
int perf_event_release_kernel(struct perf_event *event)
{
struct perf_event_context *ctx = event->ctx;
+ struct perf_event *state_event;
/*
* Remove from the PMU, can't get re-enabled since we got
@@ -2945,6 +2950,20 @@ int perf_event_release_kernel(struct perf_event *event)
raw_spin_unlock_irq(&ctx->lock);
mutex_unlock(&ctx->mutex);
+ if (event->starter) {
+ state_event = event->starter;
+ mutex_lock(&state_event->starter_stopper_mutex);
+ list_del_rcu(&event->starter_entry);
+ mutex_unlock(&state_event->starter_stopper_mutex);
+ }
+
+ if (event->stopper) {
+ state_event = event->stopper;
+ mutex_lock(&state_event->starter_stopper_mutex);
+ list_del_rcu(&event->stopper_entry);
+ mutex_unlock(&state_event->starter_stopper_mutex);
+ }
+
free_event(event);
return 0;
@@ -3246,6 +3265,73 @@ static struct perf_event *perf_fget_light(int fd, int *fput_needed)
static int perf_event_set_output(struct perf_event *event,
struct perf_event *output_event);
static int perf_event_set_filter(struct perf_event *event, void __user *arg);
+static void perf_event_pause_resume(struct perf_event *event, int nmi,
+ struct perf_sample_data *data,
+ struct pt_regs *regs);
+
+static int perf_event_set_starter(struct perf_event *event,
+ struct perf_event *target)
+{
+ struct perf_event *iter;
+ int err = 0;
+
+ if (event->ctx->task != target->ctx->task ||
+ event->cpu != target->cpu)
+ return -EINVAL;
+
+ mutex_lock(&event->starter_stopper_mutex);
+
+ list_for_each_entry_rcu(iter, &event->starter_list, starter_entry) {
+ if (iter == target) {
+ err = -EEXIST;
+ goto end;
+ }
+ }
+
+ if (cmpxchg(&target->starter, NULL, event) == NULL) {
+ list_add_rcu(&target->starter_entry, &event->starter_list);
+ event->overflow_handler = perf_event_pause_resume;
+ } else {
+ err = -EBUSY;
+ }
+
+ end:
+ mutex_unlock(&event->starter_stopper_mutex);
+
+ return err;
+}
+
+static int perf_event_set_stopper(struct perf_event *event,
+ struct perf_event *target)
+{
+ struct perf_event *iter;
+ int err = 0;
+
+ if (event->ctx->task != target->ctx->task ||
+ event->cpu != target->cpu)
+ return -EINVAL;
+
+ mutex_lock(&event->starter_stopper_mutex);
+
+ list_for_each_entry_rcu(iter, &event->stopper_list, stopper_entry) {
+ if (iter == target) {
+ err = -EEXIST;
+ goto end;
+ }
+ }
+
+ if (cmpxchg(&target->stopper, NULL, event) == NULL) {
+ list_add_rcu(&target->stopper_entry, &event->stopper_list);
+ event->overflow_handler = perf_event_pause_resume;
+ } else {
+ err = -EBUSY;
+ }
+
+ end:
+ mutex_unlock(&event->starter_stopper_mutex);
+
+ return err;
+}
static long perf_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
{
@@ -3292,6 +3378,44 @@ static long perf_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
case PERF_EVENT_IOC_SET_FILTER:
return perf_event_set_filter(event, (void __user *)arg);
+ case PERF_EVENT_IOC_SET_STARTER:
+ {
+ struct perf_event *target = NULL;
+ int fput_needed = 0;
+ int ret;
+
+ if (arg != -1) {
+ target = perf_fget_light(arg, &fput_needed);
+ if (IS_ERR(target))
+ return PTR_ERR(target);
+ }
+
+ ret = perf_event_set_starter(event, target);
+ if (target)
+ fput_light(target->filp, fput_needed);
+
+ return ret;
+ }
+
+ case PERF_EVENT_IOC_SET_STOPPER:
+ {
+ struct perf_event *target = NULL;
+ int fput_needed = 0;
+ int ret;
+
+ if (arg != -1) {
+ target = perf_fget_light(arg, &fput_needed);
+ if (IS_ERR(target))
+ return PTR_ERR(target);
+ }
+
+ ret = perf_event_set_stopper(event, target);
+ if (target)
+ fput_light(target->filp, fput_needed);
+
+ return ret;
+ }
+
default:
return -ENOTTY;
}
@@ -5017,12 +5141,62 @@ static int __perf_event_overflow(struct perf_event *event, int nmi,
}
int perf_event_overflow(struct perf_event *event, int nmi,
- struct perf_sample_data *data,
- struct pt_regs *regs)
+ struct perf_sample_data *data,
+ struct pt_regs *regs)
{
return __perf_event_overflow(event, nmi, 1, data, regs);
}
+static void perf_event_pause_resume(struct perf_event *event, int nmi,
+ struct perf_sample_data *data,
+ struct pt_regs *regs)
+{
+ struct perf_event *iter;
+ unsigned long flags;
+
+ /*
+ * Ensure the targets can't be sched in/out concurrently.
+ * Disabling irqs is sufficient for that because starters/stoppers
+ * are on the same cpu/task.
+ */
+ local_irq_save(flags);
+
+
+ /* Prevent the targets from being removed under us. */
+ rcu_read_lock();
+
+ list_for_each_entry_rcu(iter, &event->starter_list, starter_entry) {
+ /*
+ * There is a small race window here, between the state
+ * gets set to active and the event is actually ->add().
+ * We need to find a way to ensure the starter/stopper
+ * can't trigger in between.
+ */
+ if (iter->state == PERF_EVENT_STATE_ACTIVE) {
+ if (iter->paused) {
+ iter->pmu->start(iter, PERF_EF_RELOAD);
+ iter->paused = 0;
+ }
+ }
+ }
+
+ list_for_each_entry_rcu(iter, &event->stopper_list, stopper_entry) {
+ /* Similar race with ->del() */
+ if (iter->state == PERF_EVENT_STATE_ACTIVE) {
+ if (!iter->paused) {
+ iter->pmu->stop(iter, PERF_EF_UPDATE);
+ iter->paused = 1;
+ }
+ }
+ }
+
+ rcu_read_unlock();
+
+ local_irq_restore(flags);
+
+ perf_event_output(event, nmi, data, regs);
+}
+
/*
* Generic software event infrastructure
*/
@@ -6164,6 +6338,11 @@ perf_event_alloc(struct perf_event_attr *attr, int cpu,
INIT_LIST_HEAD(&event->group_entry);
INIT_LIST_HEAD(&event->event_entry);
INIT_LIST_HEAD(&event->sibling_list);
+
+ mutex_init(&event->starter_stopper_mutex);
+ INIT_LIST_HEAD(&event->starter_list);
+ INIT_LIST_HEAD(&event->stopper_list);
+
init_waitqueue_head(&event->waitq);
init_irq_work(&event->pending, perf_pending_event);
@@ -6916,7 +7095,8 @@ inherit_event(struct perf_event *parent_event,
struct perf_event_context *parent_ctx,
struct task_struct *child,
struct perf_event *group_leader,
- struct perf_event_context *child_ctx)
+ struct perf_event_context *child_ctx,
+ int *triggers)
{
struct perf_event *child_event;
unsigned long flags;
@@ -6930,6 +7110,9 @@ inherit_event(struct perf_event *parent_event,
if (parent_event->parent)
parent_event = parent_event->parent;
+ if (parent_event->starter || parent_event->stopper)
+ *triggers = 1;
+
child_event = perf_event_alloc(&parent_event->attr,
parent_event->cpu,
child,
@@ -6995,22 +7178,23 @@ inherit_event(struct perf_event *parent_event,
}
static int inherit_group(struct perf_event *parent_event,
- struct task_struct *parent,
- struct perf_event_context *parent_ctx,
- struct task_struct *child,
- struct perf_event_context *child_ctx)
+ struct task_struct *parent,
+ struct perf_event_context *parent_ctx,
+ struct task_struct *child,
+ struct perf_event_context *child_ctx,
+ int *triggers)
{
struct perf_event *leader;
struct perf_event *sub;
struct perf_event *child_ctr;
leader = inherit_event(parent_event, parent, parent_ctx,
- child, NULL, child_ctx);
+ child, NULL, child_ctx, triggers);
if (IS_ERR(leader))
return PTR_ERR(leader);
list_for_each_entry(sub, &parent_event->sibling_list, group_entry) {
child_ctr = inherit_event(sub, parent, parent_ctx,
- child, leader, child_ctx);
+ child, leader, child_ctx, triggers);
if (IS_ERR(child_ctr))
return PTR_ERR(child_ctr);
}
@@ -7021,7 +7205,7 @@ static int
inherit_task_group(struct perf_event *event, struct task_struct *parent,
struct perf_event_context *parent_ctx,
struct task_struct *child, int ctxn,
- int *inherited_all)
+ int *inherited_all, int *triggers)
{
int ret;
struct perf_event_context *child_ctx;
@@ -7048,7 +7232,7 @@ inherit_task_group(struct perf_event *event, struct task_struct *parent,
}
ret = inherit_group(event, parent, parent_ctx,
- child, child_ctx);
+ child, child_ctx, triggers);
if (ret)
*inherited_all = 0;
@@ -7059,7 +7243,7 @@ inherit_task_group(struct perf_event *event, struct task_struct *parent,
/*
* Initialize the perf_event context in task_struct
*/
-int perf_event_init_context(struct task_struct *child, int ctxn)
+int perf_event_init_context(struct task_struct *child, int ctxn, int *triggers)
{
struct perf_event_context *child_ctx, *parent_ctx;
struct perf_event_context *cloned_ctx;
@@ -7097,7 +7281,8 @@ int perf_event_init_context(struct task_struct *child, int ctxn)
*/
list_for_each_entry(event, &parent_ctx->pinned_groups, group_entry) {
ret = inherit_task_group(event, parent, parent_ctx,
- child, ctxn, &inherited_all);
+ child, ctxn, &inherited_all,
+ triggers);
if (ret)
break;
}
@@ -7113,7 +7298,8 @@ int perf_event_init_context(struct task_struct *child, int ctxn)
list_for_each_entry(event, &parent_ctx->flexible_groups, group_entry) {
ret = inherit_task_group(event, parent, parent_ctx,
- child, ctxn, &inherited_all);
+ child, ctxn, &inherited_all,
+ triggers);
if (ret)
break;
}
@@ -7151,23 +7337,98 @@ int perf_event_init_context(struct task_struct *child, int ctxn)
return ret;
}
+static void
+perf_event_inherit_starter(struct task_struct *child,
+ struct perf_event *event,
+ struct perf_event *parent_starter)
+{
+ int ctxn;
+ struct perf_event *iter;
+ struct perf_event_context *ctx;
+
+ ctxn = parent_starter->pmu->task_ctx_nr;
+ ctx = child->perf_event_ctxp[ctxn];
+
+ if (WARN_ON_ONCE(!ctx))
+ return;
+
+ list_for_each_entry(iter, &ctx->event_list, event_entry) {
+ if (iter->parent == parent_starter) {
+ list_add_tail(&event->starter_entry, &iter->starter_list);
+ return;
+ }
+ }
+
+ WARN_ONCE(1, "inherited starter not found\n");
+}
+
+static void
+perf_event_inherit_stopper(struct task_struct *child,
+ struct perf_event *event,
+ struct perf_event *parent_stopper)
+{
+ int ctxn;
+ struct perf_event *iter;
+ struct perf_event_context *ctx;
+
+ ctxn = parent_stopper->pmu->task_ctx_nr;
+ ctx = child->perf_event_ctxp[ctxn];
+
+ if (WARN_ON_ONCE(!ctx))
+ return;
+
+ list_for_each_entry(iter, &ctx->event_list, event_entry) {
+ if (iter->parent == parent_stopper) {
+ list_add_tail(&event->stopper_entry, &iter->stopper_list);
+ return;
+ }
+ }
+
+ WARN_ONCE(1, "inherited stopper not found\n");
+}
+
+
+static void
+perf_event_inherit_triggers(struct task_struct *child, int ctxn)
+{
+ struct perf_event_context *ctx;
+ struct perf_event *event;
+
+ ctx = child->perf_event_ctxp[ctxn];
+ if (!ctx)
+ return;
+
+ list_for_each_entry(event, &ctx->event_list, event_entry) {
+ if (event->parent->starter)
+ perf_event_inherit_starter(child, event, event->parent->starter);
+ if (event->parent->stopper)
+ perf_event_inherit_stopper(child, event, event->parent->stopper);
+ }
+}
+
+
/*
* Initialize the perf_event context in task_struct
*/
int perf_event_init_task(struct task_struct *child)
{
- int ctxn, ret;
+ int ctxn, ret, triggers = 0;
memset(child->perf_event_ctxp, 0, sizeof(child->perf_event_ctxp));
mutex_init(&child->perf_event_mutex);
INIT_LIST_HEAD(&child->perf_event_list);
for_each_task_context_nr(ctxn) {
- ret = perf_event_init_context(child, ctxn);
+ ret = perf_event_init_context(child, ctxn, &triggers);
if (ret)
return ret;
}
+ if (triggers) {
+ for_each_task_context_nr(ctxn)
+ perf_event_inherit_triggers(child, ctxn);
+ }
+
return 0;
}
--
1.7.3.2
Em Mon, Mar 14, 2011 at 08:17:59PM +0100, Frederic Weisbecker escreveu:
> Some examples:
>
> - Trace lock events inside irqs
>
> $ perf record -e irq:irq_handler_entry -e irq:irq_handler_exit \
> -e lock:lock_acquire --starter 0 --stopper 1 --enable-on-starter \
> perf bench sched messaging
A more compact alternative that uses .. to delimit ranges (start and/or
stop), like git does for changeset ranges and @ to determine where to
enable the event:
perf record -e instructions@irq:irq_handler_entry..irq:irq_handler_exit \
perf bench sched messaging
event-A at-range event-B to event-C
open ended ranges being possible as well, i.e. start counting when
event-B triggers and stop only when the workload finishes, or count from
workload start till event-C triggers.
> $ perf script
>
> perf-4300 [000] 2000.209721: irq_handler_entry: irq=40 name=ahci
> perf-4300 [000] 2000.209725: lock_acquire: 0xffff880037851c40 &(&host->lock)->rlock
> perf-4300 [000] 2000.209731: irq_handler_exit: irq=40 ret=handled
> perf-4300 [000] 2000.209821: irq_handler_entry: irq=40 name=ahci
> perf-4300 [000] 2000.209823: lock_acquire: 0xffff880037851c40 &(&host->lock)->rlock
> perf-4300 [000] 2000.209828: irq_handler_exit: irq=40 ret=handled
> - Count instructions inside softirqs, outside softirqs and everywhere:
>
> $ perf stat -e irq:softirq_entry -e irq:softirq_exit \
> -e instructions --starter 0 --stopper 1 --enable-on-starter \
> -e instructions --starter 1 --stopper 0 \
> -e instructions ./perf bench sched messaging
$ perf record -e instructions@irq:softirq_entry..irq:softirq_exit \
-e instructions@irq:softirq_exit..irq:softirq_entry \
-e instructions \
perf bench sched messaging
- Arnaldo
On Mon, Mar 14, 2011 at 05:43:41PM -0300, Arnaldo Carvalho de Melo wrote:
> Em Mon, Mar 14, 2011 at 08:17:59PM +0100, Frederic Weisbecker escreveu:
> > Some examples:
> >
> > - Trace lock events inside irqs
> >
> > $ perf record -e irq:irq_handler_entry -e irq:irq_handler_exit \
> > -e lock:lock_acquire --starter 0 --stopper 1 --enable-on-starter \
> > perf bench sched messaging
>
> A more compact alternative that uses .. to delimit ranges (start and/or
> stop), like git does for changeset ranges and @ to determine where to
> enable the event:
>
> perf record -e instructions@irq:irq_handler_entry..irq:irq_handler_exit \
> perf bench sched messaging
>
> event-A at-range event-B to event-C
>
> open ended ranges being possible as well, i.e. start counting when
> event-B triggers and stop only when the workload finishes, or count from
> workload start till event-C triggers.
>
> > $ perf script
> >
> > perf-4300 [000] 2000.209721: irq_handler_entry: irq=40 name=ahci
> > perf-4300 [000] 2000.209725: lock_acquire: 0xffff880037851c40 &(&host->lock)->rlock
> > perf-4300 [000] 2000.209731: irq_handler_exit: irq=40 ret=handled
> > perf-4300 [000] 2000.209821: irq_handler_entry: irq=40 name=ahci
> > perf-4300 [000] 2000.209823: lock_acquire: 0xffff880037851c40 &(&host->lock)->rlock
> > perf-4300 [000] 2000.209828: irq_handler_exit: irq=40 ret=handled
> > - Count instructions inside softirqs, outside softirqs and everywhere:
> >
> > $ perf stat -e irq:softirq_entry -e irq:softirq_exit \
> > -e instructions --starter 0 --stopper 1 --enable-on-starter \
> > -e instructions --starter 1 --stopper 0 \
> > -e instructions ./perf bench sched messaging
>
> $ perf record -e instructions@irq:softirq_entry..irq:softirq_exit \
> -e instructions@irq:softirq_exit..irq:softirq_entry \
> -e instructions \
> perf bench sched messaging
>
> - Arnaldo
Agreed that's nice but may be more as a shortcut than a full replacement?
Otherwise it becomes hard or unreadable to define a filter on a starter. Or
a starter on a starter.
Em Mon, Mar 14, 2011 at 09:51:02PM +0100, Frederic Weisbecker escreveu:
> On Mon, Mar 14, 2011 at 05:43:41PM -0300, Arnaldo Carvalho de Melo wrote:
> > > - Count instructions inside softirqs, outside softirqs and everywhere:
> > > $ perf stat -e irq:softirq_entry -e irq:softirq_exit \
> > > -e instructions --starter 0 --stopper 1 --enable-on-starter \
> > > -e instructions --starter 1 --stopper 0 \
> > > -e instructions ./perf bench sched messaging
> > $ perf record -e instructions@irq:softirq_entry..irq:softirq_exit \
> > -e instructions@irq:softirq_exit..irq:softirq_entry \
> > -e instructions \
> > perf bench sched messaging
> Agreed that's nice but may be more as a shortcut than a full replacement?
> Otherwise it becomes hard or unreadable to define a filter on a starter. Or
> a starter on a starter.
Sure, multiple ways to express ranges probably is ok, with more complex
ones for more complex cases, like the way you did, that first defines an
evlist that is the set of starters/stoppers and then evsel->evidx to add
starters/stoppers to later events.
But starter on a starter? Couldn't grok, could you provide an example?
But I could think of this as a way to express filters:
$ perf record -e instructions@irq:irq:irq_handler_entry(irq=eth0)..irq:irq_handler_exit(\1) \
-e instructions \
netperf
looks quite natural for someone used to git and sed, i.e. developers :)
- Arnaldo
On Mon, Mar 14, 2011 at 06:03:15PM -0300, Arnaldo Carvalho de Melo wrote:
> Em Mon, Mar 14, 2011 at 09:51:02PM +0100, Frederic Weisbecker escreveu:
> > On Mon, Mar 14, 2011 at 05:43:41PM -0300, Arnaldo Carvalho de Melo wrote:
> > > > - Count instructions inside softirqs, outside softirqs and everywhere:
>
> > > > $ perf stat -e irq:softirq_entry -e irq:softirq_exit \
> > > > -e instructions --starter 0 --stopper 1 --enable-on-starter \
> > > > -e instructions --starter 1 --stopper 0 \
> > > > -e instructions ./perf bench sched messaging
>
> > > $ perf record -e instructions@irq:softirq_entry..irq:softirq_exit \
> > > -e instructions@irq:softirq_exit..irq:softirq_entry \
> > > -e instructions \
> > > perf bench sched messaging
>
> > Agreed that's nice but may be more as a shortcut than a full replacement?
> > Otherwise it becomes hard or unreadable to define a filter on a starter. Or
> > a starter on a starter.
>
> Sure, multiple ways to express ranges probably is ok, with more complex
> ones for more complex cases, like the way you did, that first defines an
> evlist that is the set of starters/stoppers and then evsel->evidx to add
> starters/stoppers to later events.
Yep.
> But starter on a starter? Couldn't grok, could you provide an example?
I have no strong example in mind.
But one may want to count instructions when we are in an interrupt and
lock A is held.
Or count instruction when A and B are held. Or count instruction in
page faults happening in read() syscall.
Event range define a state, and anytime you need to profile/trace a
desired stacked state, starters on starters can be a good solution,
thus even a common practice.
> But I could think of this as a way to express filters:
>
> $ perf record -e instructions@irq:irq:irq_handler_entry(irq=eth0)..irq:irq_handler_exit(\1) \
> -e instructions \
> netperf
>
> looks quite natural for someone used to git and sed, i.e. developers :)
Yeah indeed, I like filters defined in parenthesis after the event!
Em Mon, Mar 14, 2011 at 10:20:53PM +0100, Frederic Weisbecker escreveu:
> On Mon, Mar 14, 2011 at 06:03:15PM -0300, Arnaldo Carvalho de Melo wrote:
> > Em Mon, Mar 14, 2011 at 09:51:02PM +0100, Frederic Weisbecker escreveu:
> > > On Mon, Mar 14, 2011 at 05:43:41PM -0300, Arnaldo Carvalho de Melo wrote:
> > But starter on a starter? Couldn't grok, could you provide an example?
>
> I have no strong example in mind.
>
> But one may want to count instructions when we are in an interrupt and
> lock A is held.
Those would be and/or starter/stopper expressions, something like:
$ perf record -e instructions@(irq:irq_handler_entry(irq=eth0) && lock:lock_acquired(foo_lock))..irq:irq_handler_exit(\1) \
-e instructions \
netperf
when all starters before the stopper are valid, we entered a range.
> Or count instruction when A and B are held.
Using wildcards that matches just the things we want to make it a bit
more compact:
$ perf record -e inst*@(irq:*entry(irq=eth0) && lock:*acquired(A) && \
lock:*acquired(B))..(lock:*release(A) || lock:*release(B)) \
./my_workload
Parenthesis don't have to be used just for filters :) Just like in C,
they can be used to express the list of parameters for a function or for
expressions, etc.
> Or count instruction in page faults happening in read() syscall.
We would need to use 'perf probe' first to insert the entry and exit
probes on the page fault handling path:
[root@felicio ~]# perf list *fault* *:*fault*
List of pre-defined events (to be used in -e):
page-faults OR faults [Software event]
minor-faults [Software event]
major-faults [Software event]
alignment-faults [Software event]
emulation-faults [Software event]
kvm:kvm_page_fault [Tracepoint event]
[root@felicio ~]#
But then an expression could be used like I showed above for the
previous use case you mentioned.
> Event range define a state, and anytime you need to profile/trace a
> desired stacked state, starters on starters can be a good solution,
> thus even a common practice.
See above, is that what you're thinking about?
> > But I could think of this as a way to express filters:
> >
> > $ perf record -e instructions@irq:irq:irq_handler_entry(irq=eth0)..irq:irq_handler_exit(\1) \
> > -e instructions \
> > netperf
> >
> > looks quite natural for someone used to git and sed, i.e. developers :)
>
> Yeah indeed, I like filters defined in parenthesis after the event!
:-)
- Arnaldo
Em Mon, Mar 14, 2011 at 06:56:03PM -0300, Arnaldo Carvalho de Melo escreveu:
> Em Mon, Mar 14, 2011 at 10:20:53PM +0100, Frederic Weisbecker escreveu:
> > On Mon, Mar 14, 2011 at 06:03:15PM -0300, Arnaldo Carvalho de Melo wrote:
> > > But I could think of this as a way to express filters:
> > > $ perf record -e instructions@irq:irq:irq_handler_entry(irq=eth0)..irq:irq_handler_exit(\1) \
> > > -e instructions \
> > > netperf
> > > looks quite natural for someone used to git and sed, i.e. developers :)
> > Yeah indeed, I like filters defined in parenthesis after the event!
>
> :-)
Also it would be good to look at 68baa43 and think about ways to share
code/syntax in static and dynamic probes...
- Arnaldo
On Mon, Mar 14, 2011 at 06:56:03PM -0300, Arnaldo Carvalho de Melo wrote:
> Em Mon, Mar 14, 2011 at 10:20:53PM +0100, Frederic Weisbecker escreveu:
> > On Mon, Mar 14, 2011 at 06:03:15PM -0300, Arnaldo Carvalho de Melo wrote:
> > > Em Mon, Mar 14, 2011 at 09:51:02PM +0100, Frederic Weisbecker escreveu:
> > > > On Mon, Mar 14, 2011 at 05:43:41PM -0300, Arnaldo Carvalho de Melo wrote:
>
> > > But starter on a starter? Couldn't grok, could you provide an example?
> >
> > I have no strong example in mind.
> >
> > But one may want to count instructions when we are in an interrupt and
> > lock A is held.
>
> Those would be and/or starter/stopper expressions, something like:
>
> $ perf record -e instructions@(irq:irq_handler_entry(irq=eth0) && lock:lock_acquired(foo_lock))..irq:irq_handler_exit(\1) \
> -e instructions \
> netperf
>
> when all starters before the stopper are valid, we entered a range.
So, if we want to stop when lock is released, we do:
perf record -e instructions@(irq:irq_handler_entry(irq=eth0) && lock:lock_acquired(foo_lock))..lock:lock_release(foo_lock) && irq:irq_handler_exit(\1) \
-e instructions \
netperf
Or || for stoppers like you do below? Hmm, I'm confused...
>
> > Or count instruction when A and B are held.
>
> Using wildcards that matches just the things we want to make it a bit
> more compact:
>
> $ perf record -e inst*@(irq:*entry(irq=eth0) && lock:*acquired(A) && \
> lock:*acquired(B))..(lock:*release(A) || lock:*release(B)) \
> ./my_workload
>
> Parenthesis don't have to be used just for filters :) Just like in C,
> they can be used to express the list of parameters for a function or for
> expressions, etc.
The && make sense. But the || ?
What about:
-e inst*@(lock:*acquire(A)..lock:*release(A))@(lock:*acquire(B)..lock:*release(B))@(irq:*entry(irq=eth0)..irq:*exit(irq=eth0))
That looks to me less confusing.
>
> > Or count instruction in page faults happening in read() syscall.
>
> We would need to use 'perf probe' first to insert the entry and exit
> probes on the page fault handling path:
>
> [root@felicio ~]# perf list *fault* *:*fault*
>
> List of pre-defined events (to be used in -e):
> page-faults OR faults [Software event]
> minor-faults [Software event]
> major-faults [Software event]
> alignment-faults [Software event]
> emulation-faults [Software event]
>
> kvm:kvm_page_fault [Tracepoint event]
> [root@felicio ~]#
>
> But then an expression could be used like I showed above for the
> previous use case you mentioned.
Right.
>
> > Event range define a state, and anytime you need to profile/trace a
> > desired stacked state, starters on starters can be a good solution,
> > thus even a common practice.
>
> See above, is that what you're thinking about?
I'm not sure. I can find the meaning of && in your expressions. But not
the meaning of ||. I lack some sleep though :)
But still, I'm all for trying to make a better and smarter way to
express these events, following your suggestions, but I'm not sure I have
the motivation to write a full parser capable of evaluating near C expressions.
Em Mon, Mar 14, 2011 at 11:43:46PM +0100, Frederic Weisbecker escreveu:
> On Mon, Mar 14, 2011 at 06:56:03PM -0300, Arnaldo Carvalho de Melo wrote:
> > Em Mon, Mar 14, 2011 at 10:20:53PM +0100, Frederic Weisbecker escreveu:
> > > On Mon, Mar 14, 2011 at 06:03:15PM -0300, Arnaldo Carvalho de Melo wrote:
> > > > Em Mon, Mar 14, 2011 at 09:51:02PM +0100, Frederic Weisbecker escreveu:
> > > > > On Mon, Mar 14, 2011 at 05:43:41PM -0300, Arnaldo Carvalho de Melo wrote:
> >
> > > > But starter on a starter? Couldn't grok, could you provide an example?
> > >
> > > I have no strong example in mind.
> > >
> > > But one may want to count instructions when we are in an interrupt and
> > > lock A is held.
> >
> > Those would be and/or starter/stopper expressions, something like:
> >
> > $ perf record -e instructions@(irq:irq_handler_entry(irq=eth0) && lock:lock_acquired(foo_lock))..irq:irq_handler_exit(\1) \
> > -e instructions \
> > netperf
> >
> > when all starters before the stopper are valid, we entered a range.
>
> So, if we want to stop when lock is released, we do:
>
> perf record -e instructions@(irq:irq_handler_entry(irq=eth0) && lock:lock_acquired(foo_lock))..lock:lock_release(foo_lock) && irq:irq_handler_exit(\1) \
> -e instructions \
> netperf
>
> Or || for stoppers like you do below? Hmm, I'm confused...
>
> >
> > > Or count instruction when A and B are held.
> >
> > Using wildcards that matches just the things we want to make it a bit
> > more compact:
> >
> > $ perf record -e inst*@(irq:*entry(irq=eth0) && lock:*acquired(A) && \
> > lock:*acquired(B))..(lock:*release(A) || lock:*release(B)) \
> > ./my_workload
> >
> > Parenthesis don't have to be used just for filters :) Just like in C,
> > they can be used to express the list of parameters for a function or for
> > expressions, etc.
>
> The && make sense. But the || ?
>
> What about:
>
> -e inst*@(lock:*acquire(A)..lock:*release(A))@(lock:*acquire(B)..lock:*release(B))@(irq:*entry(irq=eth0)..irq:*exit(irq=eth0))
>
> That looks to me less confusing.
Now it seems its me that needs to have some sleep :-) I find the above
confusing, but I'm in a hurry right now, will try to comment more
tomorrow.
>
> >
> > > Or count instruction in page faults happening in read() syscall.
> >
> > We would need to use 'perf probe' first to insert the entry and exit
> > probes on the page fault handling path:
> >
> > [root@felicio ~]# perf list *fault* *:*fault*
> >
> > List of pre-defined events (to be used in -e):
> > page-faults OR faults [Software event]
> > minor-faults [Software event]
> > major-faults [Software event]
> > alignment-faults [Software event]
> > emulation-faults [Software event]
> >
> > kvm:kvm_page_fault [Tracepoint event]
> > [root@felicio ~]#
> >
> > But then an expression could be used like I showed above for the
> > previous use case you mentioned.
>
> Right.
>
> >
> > > Event range define a state, and anytime you need to profile/trace a
> > > desired stacked state, starters on starters can be a good solution,
> > > thus even a common practice.
> >
> > See above, is that what you're thinking about?
>
> I'm not sure. I can find the meaning of && in your expressions. But not
> the meaning of ||. I lack some sleep though :)
>
> But still, I'm all for trying to make a better and smarter way to
> express these events, following your suggestions, but I'm not sure I have
> the motivation to write a full parser capable of evaluating near C expressions.
See the other message, the start of it is there, thanks to Masami.
On Tue, Mar 15, 2011 at 3:18 AM, Frederic Weisbecker <[email protected]> wrote:
>
> +static void perf_event_pause_resume(struct perf_event *event, int nmi,
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? struct perf_sample_data *data,
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? struct pt_regs *regs)
> +{
> + ? ? ? struct perf_event *iter;
> + ? ? ? unsigned long flags;
> +
> + ? ? ? /*
> + ? ? ? ?* Ensure the targets can't be sched in/out concurrently.
> + ? ? ? ?* Disabling irqs is sufficient for that because starters/stoppers
> + ? ? ? ?* are on the same cpu/task.
> + ? ? ? ?*/
> + ? ? ? local_irq_save(flags);
Could you explain this more detail?
> +
> +
> + ? ? ? /* Prevent the targets from being removed under us. */
> + ? ? ? rcu_read_lock();
> +
> + ? ? ? list_for_each_entry_rcu(iter, &event->starter_list, starter_entry) {
> + ? ? ? ? ? ? ? /*
> + ? ? ? ? ? ? ? ?* There is a small race window here, between the state
> + ? ? ? ? ? ? ? ?* gets set to active and the event is actually ->add().
> + ? ? ? ? ? ? ? ?* We need to find a way to ensure the starter/stopper
> + ? ? ? ? ? ? ? ?* can't trigger in between.
Can we set the state to active after the event is actually ->add()?
> + ? ? ? ? ? ? ? ?*/
> + ? ? ? ? ? ? ? if (iter->state == PERF_EVENT_STATE_ACTIVE) {
> + ? ? ? ? ? ? ? ? ? ? ? if (iter->paused) {
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? iter->pmu->start(iter, PERF_EF_RELOAD);
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? iter->paused = 0;
> + ? ? ? ? ? ? ? ? ? ? ? }
> + ? ? ? ? ? ? ? }
> + ? ? ? }
> +
Thanks,
Lin Ming
On Tue, Mar 15, 2011 at 10:36:18PM +0800, Lin Ming wrote:
> On Tue, Mar 15, 2011 at 3:18 AM, Frederic Weisbecker <[email protected]> wrote:
>
> >
> > +static void perf_event_pause_resume(struct perf_event *event, int nmi,
> > + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? struct perf_sample_data *data,
> > + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? struct pt_regs *regs)
> > +{
> > + ? ? ? struct perf_event *iter;
> > + ? ? ? unsigned long flags;
> > +
> > + ? ? ? /*
> > + ? ? ? ?* Ensure the targets can't be sched in/out concurrently.
> > + ? ? ? ?* Disabling irqs is sufficient for that because starters/stoppers
> > + ? ? ? ?* are on the same cpu/task.
> > + ? ? ? ?*/
> > + ? ? ? local_irq_save(flags);
>
> Could you explain this more detail?
Yeah, I should have detailed that more.
So, I put a constraint in starters and stoppers: those must be attached
to the same task and cpu than the target. That allows us to do this
pause/resume lockless if we can ensure that:
- target sched in/out can't interrupt perf_event_pause_resume()
- perf_event_pause_resume() can interrupt the target in the middle of
event_sched_in()
So that both are strictly serialized.
We need to ensure that the target event can not be concurrently scheduled
in (->add()) or scheduled out (->del() ), so that when we check
PERF_EVENT_STATE_ACTIVE, we know that the event is currently running
and is not going to move while we do our checks and we call start() and
stop().
So the rationale is that the target can not be in the middle of
event_sched_in() or event_sched_out() when the starter/stopper
trigger. We have no guarantee of that currently, especially because
of events that trigger in NMIs, but also for other corner cases may
be, so I'll need to think about it later. Why not by using pmu_disable_all()
on the starter/stopper when the target is about to schedule in/out, until
we know the event->state really reflects the hardware and logical states.
Now event_sched_in() and event_sched_out() can still be called from an
IPI to enable/disable an event. Hence the interrupts disabled to prevent
from that.
> > +
> > +
> > + ? ? ? /* Prevent the targets from being removed under us. */
> > + ? ? ? rcu_read_lock();
> > +
> > + ? ? ? list_for_each_entry_rcu(iter, &event->starter_list, starter_entry) {
> > + ? ? ? ? ? ? ? /*
> > + ? ? ? ? ? ? ? ?* There is a small race window here, between the state
> > + ? ? ? ? ? ? ? ?* gets set to active and the event is actually ->add().
> > + ? ? ? ? ? ? ? ?* We need to find a way to ensure the starter/stopper
> > + ? ? ? ? ? ? ? ?* can't trigger in between.
>
> Can we set the state to active after the event is actually ->add()?
If we do so, the event may trigger before its state gets updated and that state
can be checked from event fast paths. As is the case from hrtimer for example.
Or perf_event_update_userpage().
Not sure how we deal with that with del() called after we update the state.
On Mon, Mar 14, 2011 at 08:02:11PM -0300, Arnaldo Carvalho de Melo wrote:
> Em Mon, Mar 14, 2011 at 11:43:46PM +0100, Frederic Weisbecker escreveu:
> > On Mon, Mar 14, 2011 at 06:56:03PM -0300, Arnaldo Carvalho de Melo wrote:
> > > Em Mon, Mar 14, 2011 at 10:20:53PM +0100, Frederic Weisbecker escreveu:
> > > > On Mon, Mar 14, 2011 at 06:03:15PM -0300, Arnaldo Carvalho de Melo wrote:
> > > > > Em Mon, Mar 14, 2011 at 09:51:02PM +0100, Frederic Weisbecker escreveu:
> > > > > > On Mon, Mar 14, 2011 at 05:43:41PM -0300, Arnaldo Carvalho de Melo wrote:
> > >
> > > > > But starter on a starter? Couldn't grok, could you provide an example?
> > > >
> > > > I have no strong example in mind.
> > > >
> > > > But one may want to count instructions when we are in an interrupt and
> > > > lock A is held.
> > >
> > > Those would be and/or starter/stopper expressions, something like:
> > >
> > > $ perf record -e instructions@(irq:irq_handler_entry(irq=eth0) && lock:lock_acquired(foo_lock))..irq:irq_handler_exit(\1) \
> > > -e instructions \
> > > netperf
> > >
> > > when all starters before the stopper are valid, we entered a range.
> >
> > So, if we want to stop when lock is released, we do:
> >
> > perf record -e instructions@(irq:irq_handler_entry(irq=eth0) && lock:lock_acquired(foo_lock))..lock:lock_release(foo_lock) && irq:irq_handler_exit(\1) \
> > -e instructions \
> > netperf
> >
> > Or || for stoppers like you do below? Hmm, I'm confused...
> >
> > >
> > > > Or count instruction when A and B are held.
> > >
> > > Using wildcards that matches just the things we want to make it a bit
> > > more compact:
> > >
> > > $ perf record -e inst*@(irq:*entry(irq=eth0) && lock:*acquired(A) && \
> > > lock:*acquired(B))..(lock:*release(A) || lock:*release(B)) \
> > > ./my_workload
> > >
> > > Parenthesis don't have to be used just for filters :) Just like in C,
> > > they can be used to express the list of parameters for a function or for
> > > expressions, etc.
> >
> > The && make sense. But the || ?
> >
> > What about:
> >
> > -e inst*@(lock:*acquire(A)..lock:*release(A))@(lock:*acquire(B)..lock:*release(B))@(irq:*entry(irq=eth0)..irq:*exit(irq=eth0))
> >
> > That looks to me less confusing.
>
> Now it seems its me that needs to have some sleep :-) I find the above
> confusing, but I'm in a hurry right now, will try to comment more
> tomorrow.
Hehe :)
-e inst*@(lock:*acquire(A)..lock:*release(A))@(lock:*acquire(B)..lock:*release(B))
means we want to count instructions when we hold A and B, with A held
inside a section where we hold B.
Right?
But that's limited. We should express it that way:
-e inst*@((lock:*acquire(B)..lock:*release(B) && (lock:*acquire(A)..lock:*release(A)))
Which means we first define a state where B is held. Inside that state we define
another one where A is held.
If we want to have an event always running, from the beginning, until we
acquire B:
-e inst*@(..lock:*acquire(B))
or:
-e inst*@..lock:*acquire(B)
If we want to only count once we hold B:
-e inst*@lock:*acquire(B)..
If we want to count everywhere but when we hold B:
-e inst*@(..lock:*acquire(B) && lock:*release(B)..)
This covers about everything. Now if in the future we want to support having
multiple starters or stoppers for a single target, in order to union custom
contexts, we can use the ||.
Like only count when we hold B or when we hold A:
-e inst*@(lock:*acquire(A)..lock:*release(A) || lock:*acquire(B)..lock:*release(B))
Right?
> > > > Event range define a state, and anytime you need to profile/trace a
> > > > desired stacked state, starters on starters can be a good solution,
> > > > thus even a common practice.
> > >
> > > See above, is that what you're thinking about?
> >
> > I'm not sure. I can find the meaning of && in your expressions. But not
> > the meaning of ||. I lack some sleep though :)
> >
> > But still, I'm all for trying to make a better and smarter way to
> > express these events, following your suggestions, but I'm not sure I have
> > the motivation to write a full parser capable of evaluating near C expressions.
>
> See the other message, the start of it is there, thanks to Masami.
Indeed. I just had a look and it provides a basic parsing. Now it's tied to string
glob comparison and it needs to be generalized to support some custom set of
operation. Notwithstanding the final intepretation that is not trivial.
So that's a lot of work. I'd rather suggest to do this as a separate work. And
may be start with the raw --starter/--stopper things or alike to start. We can
still remove that, with the --filter thing, once we have the event comprehension
well settled.
Em Tue, Mar 15, 2011 at 07:58:16PM +0100, Frederic Weisbecker escreveu:
> On Mon, Mar 14, 2011 at 08:02:11PM -0300, Arnaldo Carvalho de Melo wrote:
> > Em Mon, Mar 14, 2011 at 11:43:46PM +0100, Frederic Weisbecker escreveu:
> > > On Mon, Mar 14, 2011 at 06:56:03PM -0300, Arnaldo Carvalho de Melo wrote:
> > > > Em Mon, Mar 14, 2011 at 10:20:53PM +0100, Frederic Weisbecker escreveu:
> > > > > On Mon, Mar 14, 2011 at 06:03:15PM -0300, Arnaldo Carvalho de Melo wrote:
> > > > Using wildcards that matches just the things we want to make it a bit
> > > > more compact:
> > > >
> > > > $ perf record -e inst*@(irq:*entry(irq=eth0) && lock:*acquired(A) && \
> > > > lock:*acquired(B))..(lock:*release(A) || lock:*release(B)) \
> > > > ./my_workload
> > > >
> > > > Parenthesis don't have to be used just for filters :) Just like in C,
> > > > they can be used to express the list of parameters for a function or for
> > > > expressions, etc.
> > >
> > > The && make sense. But the || ?
> > >
> > > What about:
> > >
> > > -e inst*@(lock:*acquire(A)..lock:*release(A))@(lock:*acquire(B)..lock:*release(B))@(irq:*entry(irq=eth0)..irq:*exit(irq=eth0))
> > >
> > > That looks to me less confusing.
> >
> > Now it seems its me that needs to have some sleep :-) I find the above
> > confusing, but I'm in a hurry right now, will try to comment more
> > tomorrow.
>
> Hehe :)
>
> -e inst*@(lock:*acquire(A)..lock:*release(A))@(lock:*acquire(B)..lock:*release(B))
>
> means we want to count instructions when we hold A and B, with A held
> inside a section where we hold B.
>
> Right?
Understood, see below about the &&, I think we're improving the syntax
for expressing these narrowing ranges where events should be active.
> But that's limited. We should express it that way:
>
> -e inst*@((lock:*acquire(B)..lock:*release(B) && (lock:*acquire(A)..lock:*release(A)))
>
> Which means we first define a state where B is held. Inside that state we define
> another one where A is held.
>
> If we want to have an event always running, from the beginning, until we
> acquire B:
>
> -e inst*@(..lock:*acquire(B))
> or:
> -e inst*@..lock:*acquire(B)
Yeah, extra parenthesis, like in a C expression, are just ellided away.
And agreed on the above, record instructions until that event takes
place.
> If we want to only count once we hold B:
>
> -e inst*@lock:*acquire(B)..
agreed
> If we want to count everywhere but when we hold B:
>
> -e inst*@(..lock:*acquire(B) && lock:*release(B)..)
Makes sense but looks confusing at first sight, how to translate that to
events starter/stoppers?
its an event: instructions, it starts in what state? Enabled, i.e. the
first part of the expression has precedence: ..lock:*acquire(B), then,
after it is disabled, because we acquired B, the second part gets armed,
i.e. when the release(B) event takes place, the first part gets armed
again.
That is why I felt '&&' to be confusing, its not that both are in
effect, its that one is only armed when the previous was disarmed.
Perhaps we could find some other operator that was more natural, or just
agree that && in this context works this way, almost like what we do in
shell scripting with && and || (the cycle to rearm the first range after
the last one is disarmed is the part that doesn't matches the shell
use).
Also:
-e inst*@(lock:*release(B)..lock:*acquire(B))
Wouldn't be simpler? Not equivalent tho if one is interested in
everything that happens till the first time the lock is acquired in some
specific thread that is started from the tool, etc.
> This covers about everything. Now if in the future we want to support having
> multiple starters or stoppers for a single target, in order to union custom
> contexts, we can use the ||.
>
> Like only count when we hold B or when we hold A:
>
> -e inst*@(lock:*acquire(A)..lock:*release(A) || lock:*acquire(B)..lock:*release(B))
> Right?
Yeah, this one is natural at first sight. As is:
-e (inst*,cycles,dTLB*misses)@(lock:*acquire(A)..lock:*release(A) || lock:*acquire(B)..lock:*release(B))
I.e. we're expressing an evlist (list of event selectors) to be counted
when the 'at' expression allows.
> > > > > Event range define a state, and anytime you need to profile/trace a
> > > > > desired stacked state, starters on starters can be a good solution,
> > > > > thus even a common practice.
> > > >
> > > > See above, is that what you're thinking about?
> > >
> > > I'm not sure. I can find the meaning of && in your expressions. But not
> > > the meaning of ||. I lack some sleep though :)
> > >
> > > But still, I'm all for trying to make a better and smarter way to
> > > express these events, following your suggestions, but I'm not sure I have
> > > the motivation to write a full parser capable of evaluating near C expressions.
> >
> > See the other message, the start of it is there, thanks to Masami.
>
> Indeed. I just had a look and it provides a basic parsing. Now it's tied to string
> glob comparison and it needs to be generalized to support some custom set of
> operation. Notwithstanding the final intepretation that is not trivial.
>
> So that's a lot of work. I'd rather suggest to do this as a separate work. And
> may be start with the raw --starter/--stopper things or alike to start. We can
> still remove that, with the --filter thing, once we have the event comprehension
> well settled.
Right, so I don't much like the interface, two new ioctl()s and a flag
of dubious use.
How important is this recursive nature of the thing to you:
> It's supposed to support infinite combinations with starter having starters
> themselves, plus filters, etc...
We've so far avoided recursion like that, we only have single level
groups etc.
On Tue, Mar 15, 2011 at 04:24:22PM -0300, Arnaldo Carvalho de Melo wrote:
> Em Tue, Mar 15, 2011 at 07:58:16PM +0100, Frederic Weisbecker escreveu:
> > If we want to count everywhere but when we hold B:
> >
> > -e inst*@(..lock:*acquire(B) && lock:*release(B)..)
>
> Makes sense but looks confusing at first sight, how to translate that to
> events starter/stoppers?
>
> its an event: instructions, it starts in what state? Enabled, i.e. the
> first part of the expression has precedence: ..lock:*acquire(B), then,
> after it is disabled, because we acquired B, the second part gets armed,
> i.e. when the release(B) event takes place, the first part gets armed
> again.
>
> That is why I felt '&&' to be confusing, its not that both are in
> effect, its that one is only armed when the previous was disarmed.
>
> Perhaps we could find some other operator that was more natural, or just
> agree that && in this context works this way, almost like what we do in
> shell scripting with && and || (the cycle to rearm the first range after
> the last one is disarmed is the part that doesn't matches the shell
> use).
Doh you're right. && would have two meaning.
No we should probably keep a && b has a meaning of we are
in the range a AND in the range b. Both at the same time, with
a evaluated first and then b. We also need to ensure than
a && b doesn't mean the same than b && a. You're right, perhaps
we need another operator to expression inclusion, or we need to
assume that specific meaning of &&.
For what I wanted to express in the example above, || seem be the
right choice: -e inst*@(..lock:*acquire(B) || lock:*release(B)..)
So || would mean union and && would mean inclusion.
>
> Also:
>
> -e inst*@(lock:*release(B)..lock:*acquire(B))
>
> Wouldn't be simpler? Not equivalent tho if one is interested in
> everything that happens till the first time the lock is acquired in some
> specific thread that is started from the tool, etc.
Yep, they are not the same.
>
> > This covers about everything. Now if in the future we want to support having
> > multiple starters or stoppers for a single target, in order to union custom
> > contexts, we can use the ||.
> >
> > Like only count when we hold B or when we hold A:
> >
> > -e inst*@(lock:*acquire(A)..lock:*release(A) || lock:*acquire(B)..lock:*release(B))
>
> > Right?
>
> Yeah, this one is natural at first sight. As is:
>
> -e (inst*,cycles,dTLB*misses)@(lock:*acquire(A)..lock:*release(A) || lock:*acquire(B)..lock:*release(B))
>
> I.e. we're expressing an evlist (list of event selectors) to be counted
> when the 'at' expression allows.
Indeed, looks nice.
On Tue, Mar 15, 2011 at 11:32:54PM +0100, Peter Zijlstra wrote:
> Right, so I don't much like the interface, two new ioctl()s and a flag
> of dubious use.
Do you think we need a new syscall for this new feature.
> How important is this recursive nature of the thing to you:
>
> > It's supposed to support infinite combinations with starter having starters
> > themselves, plus filters, etc...
>
> We've so far avoided recursion like that, we only have single level
> groups etc.
There is actually no recursivity of any sort that the kernel has to handle.
The starter/stopper links are never handled recursively. ie: there is no
loop walking through the entire chain of starter to starter to starter, etc...
It's always only handled between direct related event: starter and target, but
never further.
Only the final effect has recursion properties in the resulting count
or trace.
On Wed, 2011-03-16 at 14:53 +0100, Frederic Weisbecker wrote:
> On Tue, Mar 15, 2011 at 11:32:54PM +0100, Peter Zijlstra wrote:
> > Right, so I don't much like the interface, two new ioctl()s and a flag
> > of dubious use.
>
> Do you think we need a new syscall for this new feature.
No.
> > How important is this recursive nature of the thing to you:
> >
> > > It's supposed to support infinite combinations with starter having starters
> > > themselves, plus filters, etc...
> >
> > We've so far avoided recursion like that, we only have single level
> > groups etc.
>
> There is actually no recursivity of any sort that the kernel has to handle.
> The starter/stopper links are never handled recursively. ie: there is no
> loop walking through the entire chain of starter to starter to starter, etc...
>
> It's always only handled between direct related event: starter and target, but
> never further.
>
> Only the final effect has recursion properties in the resulting count
> or trace.
That's not an answer to my question.
On Wed, Mar 16, 2011 at 02:56:15PM +0100, Peter Zijlstra wrote:
> On Wed, 2011-03-16 at 14:53 +0100, Frederic Weisbecker wrote:
> > On Tue, Mar 15, 2011 at 11:32:54PM +0100, Peter Zijlstra wrote:
> > > Right, so I don't much like the interface, two new ioctl()s and a flag
> > > of dubious use.
> >
> > Do you think we need a new syscall for this new feature.
>
> No.
>
> > > How important is this recursive nature of the thing to you:
> > >
> > > > It's supposed to support infinite combinations with starter having starters
> > > > themselves, plus filters, etc...
> > >
> > > We've so far avoided recursion like that, we only have single level
> > > groups etc.
> >
> > There is actually no recursivity of any sort that the kernel has to handle.
> > The starter/stopper links are never handled recursively. ie: there is no
> > loop walking through the entire chain of starter to starter to starter, etc...
> >
> > It's always only handled between direct related event: starter and target, but
> > never further.
> >
> > Only the final effect has recursion properties in the resulting count
> > or trace.
>
> That's not an answer to my question.
Well, since there is no recursivity involved, there should be ne more worries
about allowing or not starters on starters. And I think it's an important
feature, that's in fact one of its key properties that let's one able to
define whatever kind of precise stacked context.
The possible usecase is so wide that I have a hard time to find a good
example. Counting instructions in exceptions on some specific syscalls,
counting instructions when some lock is taken on some irq handler, or
whatever...
On Tue, Mar 15, 2011 at 06:54:19PM +0100, Frederic Weisbecker wrote:
> On Tue, Mar 15, 2011 at 10:36:18PM +0800, Lin Ming wrote:
> > On Tue, Mar 15, 2011 at 3:18 AM, Frederic Weisbecker <[email protected]> wrote:
> >
> > >
> > > +static void perf_event_pause_resume(struct perf_event *event, int nmi,
> > > + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? struct perf_sample_data *data,
> > > + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? struct pt_regs *regs)
> > > +{
> > > + ? ? ? struct perf_event *iter;
> > > + ? ? ? unsigned long flags;
> > > +
> > > + ? ? ? /*
> > > + ? ? ? ?* Ensure the targets can't be sched in/out concurrently.
> > > + ? ? ? ?* Disabling irqs is sufficient for that because starters/stoppers
> > > + ? ? ? ?* are on the same cpu/task.
> > > + ? ? ? ?*/
> > > + ? ? ? local_irq_save(flags);
> >
> > Could you explain this more detail?
>
> Yeah, I should have detailed that more.
>
> So, I put a constraint in starters and stoppers: those must be attached
> to the same task and cpu than the target. That allows us to do this
> pause/resume lockless if we can ensure that:
>
> - target sched in/out can't interrupt perf_event_pause_resume()
> - perf_event_pause_resume() can interrupt the target in the middle of
> event_sched_in()
>
> So that both are strictly serialized.
>
> We need to ensure that the target event can not be concurrently scheduled
> in (->add()) or scheduled out (->del() ), so that when we check
> PERF_EVENT_STATE_ACTIVE, we know that the event is currently running
> and is not going to move while we do our checks and we call start() and
> stop().
>
> So the rationale is that the target can not be in the middle of
> event_sched_in() or event_sched_out() when the starter/stopper
> trigger. We have no guarantee of that currently, especially because
> of events that trigger in NMIs, but also for other corner cases may
> be, so I'll need to think about it later. Why not by using pmu_disable_all()
> on the starter/stopper when the target is about to schedule in/out, until
> we know the event->state really reflects the hardware and logical states.
>
> Now event_sched_in() and event_sched_out() can still be called from an
> IPI to enable/disable an event. Hence the interrupts disabled to prevent
> from that.
>
> > > +
> > > +
> > > + ? ? ? /* Prevent the targets from being removed under us. */
> > > + ? ? ? rcu_read_lock();
And BTW this rcu_read_lock() is not necessary. The target can not be removed
under us.
And also there is another race to take care about: if the starter and the stopper
trigger at the same time, we are going to call ->start() and ->stop() concurrently.
Not sure yet how to solve that.
On Wed, 2011-03-16 at 15:02 +0100, Frederic Weisbecker wrote:
> The possible usecase is so wide that I have a hard time to find a good
> example. Counting instructions in exceptions on some specific syscalls,
> counting instructions when some lock is taken on some irq handler, or
> whatever...
All of which can be done without the recursion, but ok.
But what you're saying is that you want to be able to build a full
expression tree of events.
Now the things I dislike about the whole thing is that its essentially a
filter, and we already have a filter thing, this adds a second totally
dis-joint filter capability.
Futhermore, you've split the start/stop things, as if they're somehow
different, when in fact they're pretty much the same thing, and you can
even think of more classes like a toggle event or whatever.
You've also split the start/stop events from the regular event lists
making the whole event management and inheritance stuff even more
complicated.
Furthermore, there is a definite possibility for weird behaviour in
there, in that if you're trying to measure a similar event to the one
that is used to enable/disable it, it very much depends on the order of
the demux lists as to which is processed first.
The simply scheme I came up with is having these events be part of the
event_group and add only one field:
pause_ops : 2
with:
enum perf_event_pause_ops {
PERF_PAUSE_OP_NOP = 0,
PERF_PAUSE_OP_INC,
PERF_PAUSE_OP_DEC,
PERF_PAUSE_OP_TOGGLE,
};
and have INC increment the parent pause field and clip at INT_MAX, DEC
decrement the pause field and clip at 0, and TOGGLE do ^1.
That however doesn't allow for these full expression trees, so we need
to come up with something else. It does however do away with the
ioctl()s, that redundant flag and the weird event separation.
It is still susceptible to the demux order.
(2011/03/16 10:03), Frederic Weisbecker wrote:
> On Tue, Mar 15, 2011 at 04:24:22PM -0300, Arnaldo Carvalho de Melo wrote:
>> Em Tue, Mar 15, 2011 at 07:58:16PM +0100, Frederic Weisbecker escreveu:
>>> If we want to count everywhere but when we hold B:
>>>
>>> -e inst*@(..lock:*acquire(B) && lock:*release(B)..)
>>
>> Makes sense but looks confusing at first sight, how to translate that to
>> events starter/stoppers?
>>
>> its an event: instructions, it starts in what state? Enabled, i.e. the
>> first part of the expression has precedence: ..lock:*acquire(B), then,
>> after it is disabled, because we acquired B, the second part gets armed,
>> i.e. when the release(B) event takes place, the first part gets armed
>> again.
>>
>> That is why I felt '&&' to be confusing, its not that both are in
>> effect, its that one is only armed when the previous was disarmed.
>>
>> Perhaps we could find some other operator that was more natural, or just
>> agree that && in this context works this way, almost like what we do in
>> shell scripting with && and || (the cycle to rearm the first range after
>> the last one is disarmed is the part that doesn't matches the shell
>> use).
>
> Doh you're right. && would have two meaning.
> No we should probably keep a && b has a meaning of we are
> in the range a AND in the range b. Both at the same time, with
> a evaluated first and then b. We also need to ensure than
> a && b doesn't mean the same than b && a. You're right, perhaps
> we need another operator to expression inclusion, or we need to
> assume that specific meaning of &&.
>
> For what I wanted to express in the example above, || seem be the
> right choice: -e inst*@(..lock:*acquire(B) || lock:*release(B)..)
>
> So || would mean union and && would mean inclusion.
Hmm, would we really need that kind of complex rules?
It seems that we only need union case. If so, I'd suggest
you to use ',' to express that, instead of ||.
-e inst*@(..lock:*acquire(B),lock:*release(B)..)
Thank you,
--
Masami HIRAMATSU
2nd Dept. Linux Technology Center
Hitachi, Ltd., Systems Development Laboratory
E-mail: [email protected]
Em Thu, Mar 17, 2011 at 12:47:01AM +0900, Masami Hiramatsu escreveu:
> (2011/03/16 10:03), Frederic Weisbecker wrote:
> > Doh you're right. && would have two meaning.
> > No we should probably keep a && b has a meaning of we are
> > in the range a AND in the range b. Both at the same time, with
> > a evaluated first and then b. We also need to ensure than
> > a && b doesn't mean the same than b && a. You're right, perhaps
> > we need another operator to expression inclusion, or we need to
> > assume that specific meaning of &&.
> >
> > For what I wanted to express in the example above, || seem be the
> > right choice: -e inst*@(..lock:*acquire(B) || lock:*release(B)..)
> >
> > So || would mean union and && would mean inclusion.
>
> Hmm, would we really need that kind of complex rules?
> It seems that we only need union case. If so, I'd suggest
> you to use ',' to express that, instead of ||.
>
> -e inst*@(..lock:*acquire(B),lock:*release(B)..)
Yeah, I somehow was avoiding the comma operator because it could be used
to represent multiple events, but then its a different context, using it
to represent a circular list of ranges in the @ (at, location) expression
seems ok.
1. '..lock:*acquire(B)' is armed, 'lock:*release(B)..' isn't
2. '..lock:*acquire(B)' trigers, which causes 'lock:*release(B)..' to be
armed
3. 'lock:*release(B)..' triggers, which causes '..lock:*acquire(B)' to
be armed, rinse, repeat
- Arnaldo
On Wed, 2011-03-16 at 14:53 -0300, Arnaldo Carvalho de Melo wrote:
> Em Thu, Mar 17, 2011 at 12:47:01AM +0900, Masami Hiramatsu escreveu:
> > (2011/03/16 10:03), Frederic Weisbecker wrote:
> > > Doh you're right. && would have two meaning.
> > > No we should probably keep a && b has a meaning of we are
> > > in the range a AND in the range b. Both at the same time, with
> > > a evaluated first and then b. We also need to ensure than
> > > a && b doesn't mean the same than b && a. You're right, perhaps
> > > we need another operator to expression inclusion, or we need to
> > > assume that specific meaning of &&.
> > >
> > > For what I wanted to express in the example above, || seem be the
> > > right choice: -e inst*@(..lock:*acquire(B) || lock:*release(B)..)
> > >
> > > So || would mean union and && would mean inclusion.
> >
> > Hmm, would we really need that kind of complex rules?
> > It seems that we only need union case. If so, I'd suggest
> > you to use ',' to express that, instead of ||.
> >
> > -e inst*@(..lock:*acquire(B),lock:*release(B)..)
>
> Yeah, I somehow was avoiding the comma operator because it could be used
> to represent multiple events, but then its a different context, using it
> to represent a circular list of ranges in the @ (at, location) expression
> seems ok.
>
> 1. '..lock:*acquire(B)' is armed, 'lock:*release(B)..' isn't
> 2. '..lock:*acquire(B)' trigers, which causes 'lock:*release(B)..' to be
> armed
> 3. 'lock:*release(B)..' triggers, which causes '..lock:*acquire(B)' to
> be armed, rinse, repeat
How about we start writing proper EBNF syntax rules for this stuff, its
getting seriously out of hand.
2011/3/16 Peter Zijlstra <[email protected]>:
> On Wed, 2011-03-16 at 15:02 +0100, Frederic Weisbecker wrote:
>
>> The possible usecase is so wide that I have a hard time to find a good
>> example. Counting instructions in exceptions on some specific syscalls,
>> counting instructions when some lock is taken on some irq handler, or
>> whatever...
>
> All of which can be done without the recursion, but ok.
>
> But what you're saying is that you want to be able to build a full
> expression tree of events.
Yep.
>
> Now the things I dislike about the whole thing is that its essentially a
> filter, and we already have a filter thing, this adds a second totally
> dis-joint filter capability.
>
> Futhermore, you've split the start/stop things, as if they're somehow
> different, when in fact they're pretty much the same thing, and you can
> even think of more classes like a toggle event or whatever.
Indeed, there are a lot of code duplications I wanted to clean up into
some toggle thing. Just wanted to get an early ping about the design
first. And I was right :)
> You've also split the start/stop events from the regular event lists
> making the whole event management and inheritance stuff even more
> complicated.
Agreed. I wish we can solve that too.
> Furthermore, there is a definite possibility for weird behaviour in
> there, in that if you're trying to measure a similar event to the one
> that is used to enable/disable it, it very much depends on the order of
> the demux lists as to which is processed first.
Do you have an example of that? I have some trouble to figure out the
issue.
>
> The simply scheme I came up with is having these events be part of the
> event_group and add only one field:
>
> ?pause_ops : 2
>
> with:
>
> enum perf_event_pause_ops {
> ?PERF_PAUSE_OP_NOP = 0,
> ?PERF_PAUSE_OP_INC,
> ?PERF_PAUSE_OP_DEC,
> ?PERF_PAUSE_OP_TOGGLE,
> };
>
> and have INC increment the parent pause field and clip at INT_MAX, DEC
> decrement the pause field and clip at 0, and TOGGLE do ^1.
>
> That however doesn't allow for these full expression trees, so we need
> to come up with something else. It does however do away with the
> ioctl()s, that redundant flag and the weird event separation.
That's a great idea! It indeed utterly simplifies the implementation
and the interface
and reuse what we already habe.
This makes me double reconsider the recursive issue now.
But I'm still worried. I don't know how useful the event tree could be. But
I have the feeling we are losing a nice and very flexible feature if
we get rid of it.
This can be something we solve later if we accept recursive groups may
be, although
things may be even more complicated if we take that path.
Thanks.
On Fri, 2011-03-25 at 15:47 +0100, Frederic Weisbecker wrote:
> > Furthermore, there is a definite possibility for weird behaviour in
> > there, in that if you're trying to measure a similar event to the one
> > that is used to enable/disable it, it very much depends on the order of
> > the demux lists as to which is processed first.
>
> Do you have an example of that? I have some trouble to figure out the
> issue.
Suppose you for some reason or other, gate irq_enter counting with
irq_enter/irq_exit (pretty daft example but hey), they irq_enter will
have two events associated, one trigger and one counter gated by the
trigger.
Now when irq_enter happens, the software event code in
do_perf_sw_event() will iterate the events interested in the event (yay
for naming). If the trigger comes first it will have enabled the gate
and the event will count, if the event comes first then the gate will
still be disabled and we'll not count.
> > The simply scheme I came up with is having these events be part of the
> > event_group and add only one field:
> >
> > pause_ops : 2
> >
> > with:
> >
> > enum perf_event_pause_ops {
> > PERF_PAUSE_OP_NOP = 0,
> > PERF_PAUSE_OP_INC,
> > PERF_PAUSE_OP_DEC,
> > PERF_PAUSE_OP_TOGGLE,
> > };
> >
> > and have INC increment the parent pause field and clip at INT_MAX, DEC
> > decrement the pause field and clip at 0, and TOGGLE do ^1.
> >
> > That however doesn't allow for these full expression trees, so we need
> > to come up with something else. It does however do away with the
> > ioctl()s, that redundant flag and the weird event separation.
>
> That's a great idea! It indeed utterly simplifies the implementation
> and the interface
> and reuse what we already habe.
> This makes me double reconsider the recursive issue now.
>
> But I'm still worried. I don't know how useful the event tree could be. But
> I have the feeling we are losing a nice and very flexible feature if
> we get rid of it.
> This can be something we solve later if we accept recursive groups may
> be, although
> things may be even more complicated if we take that path.
Yeah, like I said I'm not sure my proposal is any good because of the
limited functionality it provides, but its a different way of looking at
the problem so I provided it.
We'll have to try and explode the issue a bit more to see if there's
anything else we can come up with.
On Fri, Mar 25, 2011 at 04:03:53PM +0100, Peter Zijlstra wrote:
> On Fri, 2011-03-25 at 15:47 +0100, Frederic Weisbecker wrote:
>
> > > Furthermore, there is a definite possibility for weird behaviour in
> > > there, in that if you're trying to measure a similar event to the one
> > > that is used to enable/disable it, it very much depends on the order of
> > > the demux lists as to which is processed first.
> >
> > Do you have an example of that? I have some trouble to figure out the
> > issue.
>
> Suppose you for some reason or other, gate irq_enter counting with
> irq_enter/irq_exit (pretty daft example but hey), they irq_enter will
> have two events associated, one trigger and one counter gated by the
> trigger.
>
> Now when irq_enter happens, the software event code in
> do_perf_sw_event() will iterate the events interested in the event (yay
> for naming). If the trigger comes first it will have enabled the gate
> and the event will count, if the event comes first then the gate will
> still be disabled and we'll not count.
Ok I see. So demux order problem only happen if the trigger is the same
kind of event than the target?
But that's an inherent problem of such event tree, I'm not sure we can
do much about it. Perhaps forcing some order somehow, no idea, we can
probably deal with that leater.
>
> > > The simply scheme I came up with is having these events be part of the
> > > event_group and add only one field:
> > >
> > > pause_ops : 2
> > >
> > > with:
> > >
> > > enum perf_event_pause_ops {
> > > PERF_PAUSE_OP_NOP = 0,
> > > PERF_PAUSE_OP_INC,
> > > PERF_PAUSE_OP_DEC,
> > > PERF_PAUSE_OP_TOGGLE,
> > > };
> > >
> > > and have INC increment the parent pause field and clip at INT_MAX, DEC
> > > decrement the pause field and clip at 0, and TOGGLE do ^1.
> > >
> > > That however doesn't allow for these full expression trees, so we need
> > > to come up with something else. It does however do away with the
> > > ioctl()s, that redundant flag and the weird event separation.
> >
> > That's a great idea! It indeed utterly simplifies the implementation
> > and the interface
> > and reuse what we already habe.
> > This makes me double reconsider the recursive issue now.
> >
> > But I'm still worried. I don't know how useful the event tree could be. But
> > I have the feeling we are losing a nice and very flexible feature if
> > we get rid of it.
> > This can be something we solve later if we accept recursive groups may
> > be, although
> > things may be even more complicated if we take that path.
>
> Yeah, like I said I'm not sure my proposal is any good because of the
> limited functionality it provides, but its a different way of looking at
> the problem so I provided it.
>
> We'll have to try and explode the issue a bit more to see if there's
> anything else we can come up with.
So, at least we can factorize the notion of starter/stopper into a
single trigger attr, which has some inc/dec properties, like you
proposed. That still requires a silly ioctl though.
About the inherit part, I thought about making the trigger only apply to
events inside a same group. That may simplify a bit the inherit implementation
by pushing a bit deeper the mess. OTOH that doesn't solve the fact
we have to deal with a kind of duplicate event list.
Thoughts?