2012-06-04 08:13:34

by Andrei Vagin

[permalink] [raw]
Subject: [RFC] [PATCH 0/5] Teach perf tool to profile sleep times (V4)

Do you want to know where your code waits locks for a long time?
Yes! It's for you. This feature helps you to find bottlenecks.

It's not artificial task. Once one of my colleague was investigating a
scalability problem. He pressed sysrq-t some times and tried to merge
call-chains by hand. But perf can do that better.

Here is fourth attempt to add this functionality in the kernel.

Previous attempts:
* [PATCH 0/4] trace: add ability to collect call chains of non-current task.

* [PATCH rh6 0/7] Teach perf tool to collect D-state statistics
This version doesn't work for non-root users.

* "sched/tracing: Add a new tracepoint for sleeptime" by Arun Sharma
It has been committed, but then reverted due to performance reason.

This series resolves a problem with events filtering and non-root users.
Before I tried to use sched_stat* events, but it occurs in a context of
another task. For this case logic of perf_event_context* doesn't work.

This series adds one more event sched:sched_switch_finish, it's sent
when a task starts executing on a cpu. When a task goes out from cpu,
sched:sched_switch is sent.

The time difference between sched_switch and sched_switch_finish is a time
when a task wasn't being executed on a cpu.

For profiling sleep time, events of both types should be collected and
filtered by a defined state (interruptible of uninterruptible sleep).

Then events should be post-processed with help "perf inject". It saves the time
of sched_switch event for each process and when it meets stat_switch_finish,
it sets time difference in sample->period and saves this new sample
in the new perf.data.

./perf record -e sched:sched_switch -g --filter 'prev_state == 1' \
-e sched:sched_switch_finish --filter 'state == 1' -P ~/foo

./perf inject -i ~/perf.data -o perf.data.new -s

A test program:
#include<unistd.h>
#include<time.h>
#include<sys/select.h>

int main()
{
struct timespec ts1;
struct timeval tv1;
int i;
long s;


for (i = 0; i< 2; i++) {
ts1.tv_sec = 0;
ts1.tv_nsec = 10000000;
nanosleep(&ts1, NULL);

tv1.tv_sec = 0;
tv1.tv_usec = 40000;
select(0, NULL, NULL, NULL,&tv1);
if (i % 100 == 0) {
s = time(NULL);
while (time(NULL) - s < 4);
}
}
return 1;
}

Profiling results:
# Overhead Command Shared Object Symbol
# ........ ....... ................. ......................
#
100.00% foo [kernel.kallsyms] [k] finish_task_switch
|
--- finish_task_switch
__schedule
schedule
|
|--79.85%-- schedule_hrtimeout_range_clock
| schedule_hrtimeout_range
| poll_schedule_timeout
| do_select
| core_sys_select
| sys_select
| system_call
| __select
| __libc_start_main
|
--20.15%-- do_nanosleep
hrtimer_nanosleep
sys_nanosleep
system_call
__GI___libc_nanosleep
__libc_start_main

Andrew Vagin (5):
event: add an event sched_switch_finish
sched: send the event sched_switch before perf_event_task_sched_out
sched: save a previous state on task_struct
perf: teach "perf inject" to work with files
perf: teach perf inject to merge sched_switch* events

include/linux/sched.h | 3 +
include/trace/events/sched.h | 29 ++++++++++
kernel/sched/core.c | 3 +-
tools/perf/builtin-inject.c | 119 +++++++++++++++++++++++++++++++++++++++---
4 files changed, 146 insertions(+), 8 deletions(-)


2012-06-04 08:10:32

by Andrei Vagin

[permalink] [raw]
Subject: [PATCH 2/5] sched: send the event sched_switch before perf_event_task_sched_out

Otherwice they can't be filtered for a defined task.

perf record -e sched:sched_switch ./foo

This command doesn't report any event without this patch.

Signed-off-by: Andrew Vagin <[email protected]>
---
kernel/sched/core.c | 3 ++-
1 files changed, 2 insertions(+), 1 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 39eb601..e17a3ec 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1911,12 +1911,12 @@ static inline void
prepare_task_switch(struct rq *rq, struct task_struct *prev,
struct task_struct *next)
{
+ trace_sched_switch(prev, next);
sched_info_switch(prev, next);
perf_event_task_sched_out(prev, next);
fire_sched_out_preempt_notifiers(prev, next);
prepare_lock_switch(rq, next);
prepare_arch_switch(next);
- trace_sched_switch(prev, next);
}

/**
@@ -1976,6 +1976,7 @@ static void finish_task_switch(struct rq *rq, struct task_struct *prev)
kprobe_flush_task(prev);
put_task_struct(prev);
}
+ trace_sched_switch_finish(current);
}

#ifdef CONFIG_SMP
--
1.7.1

2012-06-04 08:10:30

by Andrei Vagin

[permalink] [raw]
Subject: [PATCH 3/5] sched: save a previous state on task_struct

prev_state is a state of task, when it stops executing on cpu.

It's needed for filtering sched:sched_switch_finish.
For example If we want to track when a task is sleeping:

./perf record -e sched:sched_switch -g --filter 'prev_state == 1' \
-e sched:sched_switch_finish --filter 'state == 1' -P foo

The previous state is set in sched:sched_switch, so it works when
both events sched:sched_switch and sched:sched_switch_finish are
enabled.

Signed-off-by: Andrew Vagin <[email protected]>
---
include/linux/sched.h | 3 +++
include/trace/events/sched.h | 12 +++++++++++-
2 files changed, 14 insertions(+), 1 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index f45c0b2..172bbe3 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1576,6 +1576,9 @@ struct task_struct {
struct uprobe_task *utask;
int uprobe_srcu_id;
#endif
+#ifdef CONFIG_EVENT_TRACING
+ volatile long prev_state;
+#endif
};

/* Future-safe accessor for struct task_struct's cpus_allowed. */
diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h
index b2219d8..c1b70b1 100644
--- a/include/trace/events/sched.h
+++ b/include/trace/events/sched.h
@@ -133,6 +133,7 @@ TRACE_EVENT(sched_switch,
__entry->prev_pid = prev->pid;
__entry->prev_prio = prev->prio;
__entry->prev_state = __trace_sched_switch_state(prev);
+ prev->prev_state = __trace_sched_switch_state(prev);
memcpy(__entry->prev_comm, prev->comm, TASK_COMM_LEN);
__entry->next_pid = next->pid;
__entry->next_prio = next->prio;
@@ -158,14 +159,23 @@ TRACE_EVENT(sched_switch_finish,
TP_STRUCT__entry(
__array( char, comm, TASK_COMM_LEN )
__field( pid_t, pid )
+ __field( long, state )
),

TP_fast_assign(
__entry->pid = p->pid;
+ __entry->state = p->prev_state;
memcpy(__entry->comm, p->comm, TASK_COMM_LEN);
),

- TP_printk("comm=%s pid=%d", __entry->comm, __entry->pid)
+ TP_printk("comm=%s pid=%d state=%s%s",
+ __entry->comm, __entry->pid,
+ __entry->state & (TASK_STATE_MAX-1) ?
+ __print_flags(__entry->state & (TASK_STATE_MAX-1), "|",
+ { 1, "S"} , { 2, "D" }, { 4, "T" }, { 8, "t" },
+ { 16, "Z" }, { 32, "X" }, { 64, "x" },
+ { 128, "W" }) : "R",
+ __entry->state & TASK_STATE_MAX ? "+" : "")
);

/*
--
1.7.1

2012-06-04 08:10:27

by Andrei Vagin

[permalink] [raw]
Subject: [PATCH 1/5] sched: event: add trace events when a task starts executing on a cpu

sched_switch_finish is sent when a task starts executing on a cpu.
It sends in context of this task, that's why sched_switch can't be used.
sched_switch is send when a task goes out from a cpu.

This two events will be needed for profiling sleep time.

Signed-off-by: Andrew Vagin <[email protected]>
---
include/trace/events/sched.h | 19 +++++++++++++++++++
1 files changed, 19 insertions(+), 0 deletions(-)

diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h
index ea7a203..b2219d8 100644
--- a/include/trace/events/sched.h
+++ b/include/trace/events/sched.h
@@ -149,6 +149,25 @@ TRACE_EVENT(sched_switch,
__entry->next_comm, __entry->next_pid, __entry->next_prio)
);

+TRACE_EVENT(sched_switch_finish,
+
+ TP_PROTO(struct task_struct *p),
+
+ TP_ARGS(p),
+
+ TP_STRUCT__entry(
+ __array( char, comm, TASK_COMM_LEN )
+ __field( pid_t, pid )
+ ),
+
+ TP_fast_assign(
+ __entry->pid = p->pid;
+ memcpy(__entry->comm, p->comm, TASK_COMM_LEN);
+ ),
+
+ TP_printk("comm=%s pid=%d", __entry->comm, __entry->pid)
+);
+
/*
* Tracepoint for a task being migrated:
*/
--
1.7.1

2012-06-04 08:13:13

by Andrei Vagin

[permalink] [raw]
Subject: [PATCH 4/5] perf: teach "perf inject" to work with files

Before this patch "perf inject" can only handle data from pipe.

I want to use "perf inject" for reworking events. Look at my following patch.

Signed-off-by: Andrew Vagin <[email protected]>
---
tools/perf/builtin-inject.c | 33 +++++++++++++++++++++++++++++++--
1 files changed, 31 insertions(+), 2 deletions(-)

diff --git a/tools/perf/builtin-inject.c b/tools/perf/builtin-inject.c
index 3beab48..bc30a40 100644
--- a/tools/perf/builtin-inject.c
+++ b/tools/perf/builtin-inject.c
@@ -14,7 +14,12 @@

#include "util/parse-options.h"

-static char const *input_name = "-";
+static char const *input_name = "-";
+static const char *output_name = "-";
+static int pipe_output = 0;
+static int output;
+static u64 bytes_written = 0;
+
static bool inject_build_ids;

static int perf_event__repipe_synth(struct perf_tool *tool __used,
@@ -27,12 +32,14 @@ static int perf_event__repipe_synth(struct perf_tool *tool __used,
size = event->header.size;

while (size) {
- int ret = write(STDOUT_FILENO, buf, size);
+ int ret = write(output, buf, size);
if (ret < 0)
return -errno;

size -= ret;
buf += ret;
+
+ bytes_written += ret;
}

return 0;
@@ -244,8 +251,14 @@ static int __cmd_inject(void)
if (session == NULL)
return -ENOMEM;

+ if (!pipe_output)
+ lseek(output, session->header.data_offset, SEEK_SET);
ret = perf_session__process_events(session, &perf_inject);

+ if (!pipe_output) {
+ session->header.data_size = bytes_written;
+ perf_session__write_header(session, session->evlist, output, true);
+ }
perf_session__delete(session);

return ret;
@@ -259,6 +272,10 @@ static const char * const report_usage[] = {
static const struct option options[] = {
OPT_BOOLEAN('b', "build-ids", &inject_build_ids,
"Inject build-ids into the output stream"),
+ OPT_STRING('i', "input", &input_name, "file",
+ "input file name"),
+ OPT_STRING('o', "output", &output_name, "file",
+ "output file name"),
OPT_INCR('v', "verbose", &verbose,
"be more verbose (show build ids, etc)"),
OPT_END()
@@ -274,6 +291,18 @@ int cmd_inject(int argc, const char **argv, const char *prefix __used)
if (argc)
usage_with_options(report_usage, options);

+ if (!strcmp(output_name, "-")) {
+ pipe_output = 1;
+ output = STDOUT_FILENO;
+ } else {
+ output = open(output_name, O_CREAT| O_WRONLY | O_TRUNC,
+ S_IRUSR | S_IWUSR);
+ if (output < 0) {
+ perror("failed to create output file");
+ exit(-1);
+ }
+ }
+
if (symbol__init() < 0)
return -1;

--
1.7.1

2012-06-04 08:13:32

by Andrei Vagin

[permalink] [raw]
Subject: [PATCH 5/5] perf: teach perf inject to merge sched_switch* events for profiling sleep times

The time difference between sched_switch and sched_switch_finish is a time
when a task wasn't being executed on a cpu.

My code saves the time of sched_switch event for each process and
when it meets stat_switch_finish, it sets time difference in sample->period
and saves this new sample in the new perf.data.

As a result we can see where and how long task sleeps.

Signed-off-by: Andrew Vagin <[email protected]>
---
tools/perf/builtin-inject.c | 88 ++++++++++++++++++++++++++++++++++++++++--
1 files changed, 83 insertions(+), 5 deletions(-)

diff --git a/tools/perf/builtin-inject.c b/tools/perf/builtin-inject.c
index bc30a40..631051e 100644
--- a/tools/perf/builtin-inject.c
+++ b/tools/perf/builtin-inject.c
@@ -13,6 +13,8 @@
#include "util/debug.h"

#include "util/parse-options.h"
+#include "util/trace-event.h"
+

static char const *input_name = "-";
static const char *output_name = "-";
@@ -21,6 +23,9 @@ static int output;
static u64 bytes_written = 0;

static bool inject_build_ids;
+static bool inject_sched_stat;
+
+struct perf_session *session;

static int perf_event__repipe_synth(struct perf_tool *tool __used,
union perf_event *event,
@@ -47,7 +52,7 @@ static int perf_event__repipe_synth(struct perf_tool *tool __used,

static int perf_event__repipe_op2_synth(struct perf_tool *tool,
union perf_event *event,
- struct perf_session *session __used)
+ struct perf_session *s __used)
{
return perf_event__repipe_synth(tool, event, NULL);
}
@@ -59,7 +64,7 @@ static int perf_event__repipe_event_type_synth(struct perf_tool *tool,
}

static int perf_event__repipe_tracing_data_synth(union perf_event *event,
- struct perf_session *session __used)
+ struct perf_session *s __used)
{
return perf_event__repipe_synth(NULL, event, NULL);
}
@@ -119,12 +124,12 @@ static int perf_event__repipe_task(struct perf_tool *tool,
}

static int perf_event__repipe_tracing_data(union perf_event *event,
- struct perf_session *session)
+ struct perf_session *s)
{
int err;

perf_event__repipe_synth(NULL, event, NULL);
- err = perf_event__process_tracing_data(event, session);
+ err = perf_event__process_tracing_data(event, s);

return err;
}
@@ -210,6 +215,75 @@ repipe:
return 0;
}

+struct event_entry
+{
+ struct list_head list;
+ u32 pid;
+ u64 time;
+};
+
+static LIST_HEAD(samples);
+
+static int perf_event__sched_stat(struct perf_tool *tool,
+ union perf_event *event,
+ struct perf_sample *sample,
+ struct perf_evsel *evsel __used,
+ struct machine *machine)
+{
+ int type;
+ struct event_format *e;
+ const char *evname = NULL;
+ struct event_entry *ent;
+
+ type = trace_parse_common_type(sample->raw_data);
+ e = trace_find_event(type);
+ if (e)
+ evname = e->name;
+
+ if (!strcmp(evname, "sched_switch")) {
+ list_for_each_entry(ent, &samples, list)
+ if (sample->pid == ent->pid)
+ break;
+
+ if (&ent->list != &samples) {
+ list_del(&ent->list);
+ free(ent);
+ }
+
+ ent = malloc(sizeof(struct event_entry));
+ ent->pid = sample->pid;
+ ent->time = sample->time;
+ list_add(&ent->list, &samples);
+ return 0;
+
+ } else if (!strcmp(evname, "sched_switch_finish")) {
+ union perf_event *event_sw;
+
+ list_for_each_entry(ent, &samples, list)
+ if (sample->pid == ent->pid)
+ break;
+
+ if (&ent->list == &samples) {
+ pr_debug("Could not find sched_switch for pid %u\n", sample->pid);
+ return 0;
+ }
+
+ event_sw = malloc(event->header.size);
+ if (event_sw == NULL)
+ return -1;
+
+ memcpy(event_sw, event, event->header.size);
+ sample->period = sample->time - ent->time;
+ perf_session__synthesize_sample(session, event_sw, sample);
+ perf_event__repipe(tool, event_sw, sample, machine);
+ free(event_sw);
+ return 0;
+ }
+
+ perf_event__repipe(tool, event, sample, machine);
+
+ return 0;
+}
struct perf_tool perf_inject = {
.sample = perf_event__repipe_sample,
.mmap = perf_event__repipe,
@@ -235,7 +309,6 @@ static void sig_handler(int sig __attribute__((__unused__)))

static int __cmd_inject(void)
{
- struct perf_session *session;
int ret = -EINVAL;

signal(SIGINT, sig_handler);
@@ -245,6 +318,9 @@ static int __cmd_inject(void)
perf_inject.mmap = perf_event__repipe_mmap;
perf_inject.fork = perf_event__repipe_task;
perf_inject.tracing_data = perf_event__repipe_tracing_data;
+ } else if (inject_sched_stat) {
+ perf_inject.sample = perf_event__sched_stat;
+ perf_inject.ordered_samples = true;
}

session = perf_session__new(input_name, O_RDONLY, false, true, &perf_inject);
@@ -272,6 +348,8 @@ static const char * const report_usage[] = {
static const struct option options[] = {
OPT_BOOLEAN('b', "build-ids", &inject_build_ids,
"Inject build-ids into the output stream"),
+ OPT_BOOLEAN('s', "sched-switch", &inject_sched_stat,
+ "merge sched-switch* for profiling sleep time"),
OPT_STRING('i', "input", &input_name, "file",
"input file name"),
OPT_STRING('o', "output", &output_name, "file",
--
1.7.1

2012-06-04 12:41:15

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC] [PATCH 0/5] Teach perf tool to profile sleep times (V4)

On Mon, 2012-06-04 at 12:09 +0400, Andrew Vagin wrote:
> This series resolves a problem with events filtering and non-root users.
> Before I tried to use sched_stat* events, but it occurs in a context of
> another task. For this case logic of perf_event_context* doesn't work.
>
> This series adds one more event sched:sched_switch_finish, it's sent
> when a task starts executing on a cpu. When a task goes out from cpu,
> sched:sched_switch is sent.
>
> The time difference between sched_switch and sched_switch_finish is a time
> when a task wasn't being executed on a cpu.


So I really don't like this stuff either.. :-)

Mostly because it adds very similar stuff to what's already there and
then doesn't try and consolidate things.

And partly because you're measuring the 'wrong' thing, the delay between
sched-out and sched-in includes a number of things, only one part of
that is the delay due to blocking.

And as a result you need that hideous prev_state crap.


Would something like the below work?

The one thing I'm not entirely sure of is if this is a sekjoerity issue
or not.. anybody? I would think a task was entitled to know who woke it
and wherefrom etc..

---
include/linux/ftrace_event.h | 5 +++--
include/linux/perf_event.h | 3 ++-
include/trace/events/sched.h | 4 ++++
include/trace/ftrace.h | 6 +++++-
kernel/events/core.c | 19 ++++++++++++++++++-
kernel/sched/core.c | 2 +-
6 files changed, 33 insertions(+), 6 deletions(-)

diff --git a/include/linux/ftrace_event.h b/include/linux/ftrace_event.h
index 176a939..55b96e3 100644
--- a/include/linux/ftrace_event.h
+++ b/include/linux/ftrace_event.h
@@ -303,9 +303,10 @@ extern void *perf_trace_buf_prepare(int size, unsigned short type,

static inline void
perf_trace_buf_submit(void *raw_data, int size, int rctx, u64 addr,
- u64 count, struct pt_regs *regs, void *head)
+ u64 count, struct pt_regs *regs, void *head,
+ struct task_struct *task)
{
- perf_tp_event(addr, count, raw_data, size, regs, head, rctx);
+ perf_tp_event(addr, count, raw_data, size, regs, head, rctx, task);
}
#endif

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index f325786..a3d7b90 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -1269,7 +1269,8 @@ static inline bool perf_paranoid_kernel(void)
extern void perf_event_init(void);
extern void perf_tp_event(u64 addr, u64 count, void *record,
int entry_size, struct pt_regs *regs,
- struct hlist_head *head, int rctx);
+ struct hlist_head *head, int rctx,
+ struct task_struct *task);
extern void perf_bp_event(struct perf_event *event, void *data);

#ifndef perf_misc_flags
diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h
index ea7a203..31da561 100644
--- a/include/trace/events/sched.h
+++ b/include/trace/events/sched.h
@@ -75,6 +75,10 @@ DECLARE_EVENT_CLASS(sched_wakeup_template,
__entry->target_cpu = task_cpu(p);
),

+ TP_perf_assign(
+ __perf_task(p);
+ ),
+
TP_printk("comm=%s pid=%d prio=%d success=%d target_cpu=%03d",
__entry->comm, __entry->pid, __entry->prio,
__entry->success, __entry->target_cpu)
diff --git a/include/trace/ftrace.h b/include/trace/ftrace.h
index 7697249..db14daf 100644
--- a/include/trace/ftrace.h
+++ b/include/trace/ftrace.h
@@ -711,6 +711,9 @@ __attribute__((section("_ftrace_events"))) *__event_##call = &event_##call
#undef __perf_count
#define __perf_count(c) __count = (c)

+#undef __perf_task
+#define __perf_task(t) __task = (t)
+
#undef TP_perf_assign
#define TP_perf_assign(args...) args

@@ -724,6 +727,7 @@ perf_trace_##call(void *__data, proto) \
struct ftrace_raw_##call *entry; \
struct pt_regs __regs; \
u64 __addr = 0, __count = 1; \
+ struct task_struct *__task = NULL; \
struct hlist_head *head; \
int __entry_size; \
int __data_size; \
@@ -751,7 +755,7 @@ perf_trace_##call(void *__data, proto) \
\
head = this_cpu_ptr(event_call->perf_events); \
perf_trace_buf_submit(entry, __entry_size, rctx, __addr, \
- __count, &__regs, head); \
+ __count, &__regs, head, __task); \
}

/*
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 5b06cbb..cb84cbd6 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -5204,7 +5204,8 @@ static int perf_tp_event_match(struct perf_event *event,
}

void perf_tp_event(u64 addr, u64 count, void *record, int entry_size,
- struct pt_regs *regs, struct hlist_head *head, int rctx)
+ struct pt_regs *regs, struct hlist_head *head, int rctx,
+ struct task_struct *task)
{
struct perf_sample_data data;
struct perf_event *event;
@@ -5223,6 +5224,22 @@ void perf_tp_event(u64 addr, u64 count, void *record, int entry_size,
perf_swevent_event(event, count, &data, regs);
}

+ /*
+ * If we got specified a target task, also iterate its context and
+ * deliver this event there too.
+ */
+ if (task) {
+ struct perf_event_context *ctx;
+
+ rcu_read_lock();
+ ctx = rcu_dereference(task->perf_event_ctxp[perf_sw_context]);
+ list_for_each_entry_rcu(event, &ctx->event_list, event_entry) {
+ if (perf_tp_event_match(event, &data, regs))
+ perf_swevent_event(event, count, &data, regs);
+ }
+ rcu_read_unlock();
+ }
+
perf_swevent_put_recursion_context(rctx);
}
EXPORT_SYMBOL_GPL(perf_tp_event);
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 420e118..8ac193c 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1912,11 +1912,11 @@ prepare_task_switch(struct rq *rq, struct task_struct *prev,
struct task_struct *next)
{
sched_info_switch(prev, next);
+ trace_sched_switch(prev, next); /* before we flip the trace context */
perf_event_task_sched_out(prev, next);
fire_sched_out_preempt_notifiers(prev, next);
prepare_lock_switch(rq, next);
prepare_arch_switch(next);
- trace_sched_switch(prev, next);
}

/**

2012-06-04 12:50:33

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC] [PATCH 0/5] Teach perf tool to profile sleep times (V4)

On Mon, 2012-06-04 at 14:40 +0200, Peter Zijlstra wrote:
> + if (task) {
> + struct perf_event_context *ctx;
> +
> + rcu_read_lock();
> + ctx = rcu_dereference(task->perf_event_ctxp[perf_sw_context]);
> + list_for_each_entry_rcu(event, &ctx->event_list, event_entry) {

Ah, this needs a little something like the below:

if (!(event->attr.type == PERF_TYPE_TRACEPOINT &&
event->attr.config == $something))
continue;

Not exactly sure where we hid the tracepoint_id at this point..


> + if (perf_tp_event_match(event, &data, regs))
> + perf_swevent_event(event, count, &data, regs);
> + }
> + rcu_read_unlock();
> + }

2012-06-04 13:46:41

by Steven Rostedt

[permalink] [raw]
Subject: Re: [RFC] [PATCH 0/5] Teach perf tool to profile sleep times (V4)

On Mon, 2012-06-04 at 14:40 +0200, Peter Zijlstra wrote:

> The one thing I'm not entirely sure of is if this is a sekjoerity issue
> or not.. anybody? I would think a task was entitled to know who woke it
> and wherefrom etc..

"sekjoerity"? Sure, play games with us native English speakers, who
would pronounce that as "seek-joe-rity" and be totally confused :-p
Who's joe, and why are we seeking him?

Anyway, the answer is yes it is. Well, that's because *everything* in
the kernel is a security issue. Now the real question is, can someone
use it to do harm. Well, yes. But can they use it to do more harm than
they can with other methods that exist today? Probably not.

An attacker with an unprivileged account could probably analyze a system
with just 'ps', to figure out what they can and cannot do. Perhaps they
could use perf to analyze what other things are happening, and even set
up their tools to use perf to time attacks. A wakeup can tell a user if
they were blocked on a mutex, and who just let go of that mutex to wake
the user up.

Can this information be used to continue some other kind of attack?
Maybe. But is it a big enough risk that it outweighs the usefulness of
the tool? Probably not.

As I said earlier, all kernel issues deal with finding joe. But the
question is a simple matter of risk vs usability. If you want your
system to be really secure, then lock it in a vault and do not allow
anything to connect to it. That is minimizing risk, but at the cost of
usability.

If a sysadmin doesn't want this open, then just have them keep the
paranoid level up for activating perf, and do not let users run it.

-- Steve

2012-06-04 17:02:55

by Arun Sharma

[permalink] [raw]
Subject: Re: [RFC] [PATCH 0/5] Teach perf tool to profile sleep times (V4)

On 6/4/12 5:40 AM, Peter Zijlstra wrote:

> And as a result you need that hideous prev_state crap.
>
>
> Would something like the below work?
>
> The one thing I'm not entirely sure of is if this is a sekjoerity issue
> or not.. anybody? I would think a task was entitled to know who woke it
> and wherefrom etc..

Frederic had some comments on this topic in the last round:

http://thread.gmane.org/gmane.linux.kernel/1195368/focus=1195959

> We should probably avoid the remote callchains, sounds like asking
for complications everywhere.

Do they still apply? Frederic?

-Arun

2012-06-04 17:03:10

by Arun Sharma

[permalink] [raw]
Subject: Re: [RFC] [PATCH 0/5] Teach perf tool to profile sleep times (V4)

On 6/4/12 1:09 AM, Andrew Vagin wrote:
>
> * "sched/tracing: Add a new tracepoint for sleeptime" by Arun Sharma
> It has been committed, but then reverted due to performance reason.

Correctness actually. The first sample could potentially be bogus and
needs to be discarded in userspace. Hasn't been a problem in practice
for us.

-Arun

2012-06-04 20:52:33

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC] [PATCH 0/5] Teach perf tool to profile sleep times (V4)

On Mon, 2012-06-04 at 10:01 -0700, Arun Sharma wrote:
> On 6/4/12 5:40 AM, Peter Zijlstra wrote:
>
> > And as a result you need that hideous prev_state crap.
> >
> >
> > Would something like the below work?
> >
> > The one thing I'm not entirely sure of is if this is a sekjoerity issue
> > or not.. anybody? I would think a task was entitled to know who woke it
> > and wherefrom etc..
>
> Frederic had some comments on this topic in the last round:
>
> http://thread.gmane.org/gmane.linux.kernel/1195368/focus=1195959
>
> > We should probably avoid the remote callchains, sounds like asking
> for complications everywhere.
>
> Do they still apply? Frederic?

Probably, but note that the approaches are slightly different. The
previous thing would wreck the callchain for everybody, the proposed
thing will make more events.

Also, you don't need the callchain on the wakeup side of things. You
only want the callchain of the sched_switch site.

We could even dis-allow the callchain unwind when event->ctx->task &&
event->ctx->task != current -- pretend the unwind failed by writing 0
entries.

That also avoids the 'difficultly' of exposing and trying to interpret
another task's callchain.

2012-06-06 08:32:43

by Ingo Molnar

[permalink] [raw]
Subject: Re: [RFC] [PATCH 0/5] Teach perf tool to profile sleep times (V4)


* Peter Zijlstra <[email protected]> wrote:

> The one thing I'm not entirely sure of is if this is a
> sekjoerity issue or not.. anybody? I would think a task was
> entitled to know who woke it and wherefrom etc..

I wouldn't worry about that really in the first approximation.
/proc already gives out much of that information already.

Thanks,

Ingo

2012-06-12 13:15:34

by Andrei Vagin

[permalink] [raw]
Subject: Re: [RFC] [PATCH 0/5] Teach perf tool to profile sleep times (V4)

Hello,

> So I really don't like this stuff either.. :-)
>
> Mostly because it adds very similar stuff to what's already there and
> then doesn't try and consolidate things.
>
> And partly because you're measuring the 'wrong' thing, the delay between
> sched-out and sched-in includes a number of things, only one part of
> that is the delay due to blocking.
>
> And as a result you need that hideous prev_state crap.
>

Yes, you are right.

>
>
> Would something like the below work?

Yes, I like it. Are you going to commit this? Thank for this work.

2012-06-12 17:46:16

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC] [PATCH 0/5] Teach perf tool to profile sleep times (V4)

On Tue, 2012-06-12 at 17:15 +0400, Andrew Wagin wrote:
> Yes, I like it. Are you going to commit this?

No its unfinished and untested.. not even sure its seen a compiler.

Also, in another email I wrote:

> Also, you don't need the callchain on the wakeup side of things. You
> only want the callchain of the sched_switch site.
>
> We could even dis-allow the callchain unwind when event->ctx->task &&
> event->ctx->task != current -- pretend the unwind failed by writing 0
> entries.
>
> That also avoids the 'difficultly' of exposing and trying to interpret
> another task's callchain.

That probably wants to be done as well.

If you have the time, please look into this. Hoping of course Frederic
and others don't object to this.. :-)