2010-07-28 02:31:42

by Ian Munsie

[permalink] [raw]
Subject: [PATCH 1/2] ftrace: record command lines at more appropriate moment

From: Ian Munsie <[email protected]>

Previously, when tracing was activated through debugfs, regardless of
which tracing plugin (if any) were activated, the probe_sched_switch and
probe_sched_wakeup probes from the sched_switch plugin would be
activated. This appears to have been a hack to use them to record the
command lines of active processes as they were scheduled.

That approach would suffer if many processes were being scheduled that
were not generating events as they would consume entries in the
saved_cmdlines buffer that could otherwise have been used by other
processes that were actually generating events.

It also had the problem that events could be mis-attributed - in the
common situation of a process forking then execing a new process, the
change of the process command would not be noticed for some time after
the exec until the process was next scheduled.

If the trace was read after the fact this would generally go unnoticed
because at some point the process would be scheduled and the entry in
the saved_cmdlines buffer would be updated so that the new command would
be reported when the trace was eventually read. However, if the events
were being read live (e.g. through trace_pipe), the events just after
the exec and before the process was next scheduled would show the
incorrect command (though the PID would be correct).

This patch removes the sched_switch hack altogether and instead records
the commands at a more appropriate moment - at the same time the PID of
the process is recorded (i.e. when an entry on the ring buffer is
reserved). This means that the recorded command line is much more likely
to be correct when the trace is read, either live or after the fact, so
long as the command line still resides in the saved_cmdlines buffer.

It is still not guaranteed to be correct in all situations. For instance
if the trace is read after the fact rather than live (consider events
generated by a process before an exec - in the below example they would
be attributed to sleep rather than stealpid since the entry in
saved_cmdlines would have changed before the event was read), but this
is no different to the current situation and the alternative would be to
store the command line with each and every event.

terminal 1: grep '\-12345' /sys/kernel/debug/tracing/trace_pipe
terminal 2: ./stealpid 12345 `which sleep` 0.1

Before:
stealpid-12345 [003] 86.001826: sys_clone -> 0x0
stealpid-12345 [003] 86.002013: compat_sys_execve(ufilename: ffaaabef, argv: ffaaa7ec, envp: ffaaa7f8)
stealpid-12345 [002] 86.002292: sys_restart_syscall -> 0x0
stealpid-12345 [002] 86.002336: sys_brk(brk: 0)
stealpid-12345 [002] 86.002338: sys_brk -> 0x1007a000
...
stealpid-12345 [002] 86.002582: sys_mmap(addr: 0, len: 1000, prot: 3, flags: 22, fd: ffffffff, offset: 0)
stealpid-12345 [002] 86.002586: sys_mmap -> 0xf7c21000
sleep-12345 [002] 86.002771: sys_mprotect(start: ffe8000, len: 4000, prot: 1)
sleep-12345 [002] 86.002780: sys_mprotect -> 0x0
...

After:
stealpid-12345 [003] 1368.823626: sys_clone -> 0x0
stealpid-12345 [003] 1368.823820: compat_sys_execve(ufilename: fffa6bef, argv: fffa5afc, envp: fffa5b08)
sleep-12345 [002] 1368.824125: sys_restart_syscall -> 0x0
sleep-12345 [002] 1368.824173: sys_brk(brk: 0)
sleep-12345 [002] 1368.824175: sys_brk -> 0x104ae000
...

Signed-off-by: Ian Munsie <[email protected]>
---
kernel/trace/trace.c | 3 +--
kernel/trace/trace_events.c | 11 -----------
kernel/trace/trace_functions.c | 2 --
kernel/trace/trace_functions_graph.c | 2 --
kernel/trace/trace_sched_switch.c | 10 ----------
5 files changed, 1 insertions(+), 27 deletions(-)

diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index 4b1122d..f8458c3 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -1023,8 +1023,6 @@ void tracing_stop(void)
spin_unlock_irqrestore(&tracing_start_lock, flags);
}

-void trace_stop_cmdline_recording(void);
-
static void trace_save_cmdline(struct task_struct *tsk)
{
unsigned pid, idx;
@@ -1112,6 +1110,7 @@ tracing_generic_entry_update(struct trace_entry *entry, unsigned long flags,
{
struct task_struct *tsk = current;

+ tracing_record_cmdline(tsk);
entry->preempt_count = pc & 0xff;
entry->pid = (tsk) ? tsk->pid : 0;
entry->lock_depth = (tsk) ? tsk->lock_depth : 0;
diff --git a/kernel/trace/trace_events.c b/kernel/trace/trace_events.c
index 09b4fa6..2c0cf38 100644
--- a/kernel/trace/trace_events.c
+++ b/kernel/trace/trace_events.c
@@ -180,10 +180,8 @@ void trace_event_enable_cmd_record(bool enable)
continue;

if (enable) {
- tracing_start_cmdline_record();
call->flags |= TRACE_EVENT_FL_RECORDED_CMD;
} else {
- tracing_stop_cmdline_record();
call->flags &= ~TRACE_EVENT_FL_RECORDED_CMD;
}
}
@@ -199,22 +197,13 @@ static int ftrace_event_enable_disable(struct ftrace_event_call *call,
case 0:
if (call->flags & TRACE_EVENT_FL_ENABLED) {
call->flags &= ~TRACE_EVENT_FL_ENABLED;
- if (call->flags & TRACE_EVENT_FL_RECORDED_CMD) {
- tracing_stop_cmdline_record();
- call->flags &= ~TRACE_EVENT_FL_RECORDED_CMD;
- }
call->class->reg(call, TRACE_REG_UNREGISTER);
}
break;
case 1:
if (!(call->flags & TRACE_EVENT_FL_ENABLED)) {
- if (trace_flags & TRACE_ITER_RECORD_CMD) {
- tracing_start_cmdline_record();
- call->flags |= TRACE_EVENT_FL_RECORDED_CMD;
- }
ret = call->class->reg(call, TRACE_REG_REGISTER);
if (ret) {
- tracing_stop_cmdline_record();
pr_info("event trace: Could not enable event "
"%s\n", call->name);
break;
diff --git a/kernel/trace/trace_functions.c b/kernel/trace/trace_functions.c
index 16aee4d..0800e91 100644
--- a/kernel/trace/trace_functions.c
+++ b/kernel/trace/trace_functions.c
@@ -31,7 +31,6 @@ static int function_trace_init(struct trace_array *tr)
tr->cpu = get_cpu();
put_cpu();

- tracing_start_cmdline_record();
tracing_start_function_trace();
return 0;
}
@@ -39,7 +38,6 @@ static int function_trace_init(struct trace_array *tr)
static void function_trace_reset(struct trace_array *tr)
{
tracing_stop_function_trace();
- tracing_stop_cmdline_record();
}

static void function_trace_start(struct trace_array *tr)
diff --git a/kernel/trace/trace_functions_graph.c b/kernel/trace/trace_functions_graph.c
index 6bff236..4bb8811 100644
--- a/kernel/trace/trace_functions_graph.c
+++ b/kernel/trace/trace_functions_graph.c
@@ -321,14 +321,12 @@ static int graph_trace_init(struct trace_array *tr)
&trace_graph_entry);
if (ret)
return ret;
- tracing_start_cmdline_record();

return 0;
}

static void graph_trace_reset(struct trace_array *tr)
{
- tracing_stop_cmdline_record();
unregister_ftrace_graph();
}

diff --git a/kernel/trace/trace_sched_switch.c b/kernel/trace/trace_sched_switch.c
index 8f758d0..9a5cb2c 100644
--- a/kernel/trace/trace_sched_switch.c
+++ b/kernel/trace/trace_sched_switch.c
@@ -190,16 +190,6 @@ static void tracing_stop_sched_switch(void)
mutex_unlock(&sched_register_mutex);
}

-void tracing_start_cmdline_record(void)
-{
- tracing_start_sched_switch();
-}
-
-void tracing_stop_cmdline_record(void)
-{
- tracing_stop_sched_switch();
-}
-
/**
* tracing_start_sched_switch_record - start tracing context switches
*
--
1.7.1


2010-07-28 02:31:52

by Ian Munsie

[permalink] [raw]
Subject: [PATCH 2/2] Revert "tracing: Allow to disable cmdline recording"

From: Ian Munsie <[email protected]>

This reverts commit e870e9a1240bcef1157ffaaf71dac63362e71904.

That patch was submitted to provide an option to disable recording
command lines since we were using a hack to record them using the
sched_switch probes which added significant overhead to context
switching.

Since we no longer use that hack and command lines are now recorded when
a ring buffer entry is reserved (at the same time the PID is recorded),
the original purpose of this patch is no longer present and the option
to disable command line recording is not really required any more.

Signed-off-by: Ian Munsie <[email protected]>
---
include/linux/ftrace_event.h | 7 ++-----
kernel/trace/trace.c | 6 +-----
kernel/trace/trace.h | 3 ---
kernel/trace/trace_events.c | 18 ------------------
4 files changed, 3 insertions(+), 31 deletions(-)

diff --git a/include/linux/ftrace_event.h b/include/linux/ftrace_event.h
index 02b8b24..b7915d6 100644
--- a/include/linux/ftrace_event.h
+++ b/include/linux/ftrace_event.h
@@ -153,13 +153,11 @@ extern int ftrace_event_reg(struct ftrace_event_call *event,
enum {
TRACE_EVENT_FL_ENABLED_BIT,
TRACE_EVENT_FL_FILTERED_BIT,
- TRACE_EVENT_FL_RECORDED_CMD_BIT,
};

enum {
- TRACE_EVENT_FL_ENABLED = (1 << TRACE_EVENT_FL_ENABLED_BIT),
- TRACE_EVENT_FL_FILTERED = (1 << TRACE_EVENT_FL_FILTERED_BIT),
- TRACE_EVENT_FL_RECORDED_CMD = (1 << TRACE_EVENT_FL_RECORDED_CMD_BIT),
+ TRACE_EVENT_FL_ENABLED = (1 << TRACE_EVENT_FL_ENABLED_BIT),
+ TRACE_EVENT_FL_FILTERED = (1 << TRACE_EVENT_FL_FILTERED_BIT),
};

struct ftrace_event_call {
@@ -177,7 +175,6 @@ struct ftrace_event_call {
* 32 bit flags:
* bit 1: enabled
* bit 2: filter_active
- * bit 3: enabled cmd record
*
* Changes to flags must hold the event_mutex.
*
diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index f8458c3..7fe152a 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -344,7 +344,7 @@ static DECLARE_WAIT_QUEUE_HEAD(trace_wait);
/* trace_flags holds trace_options default values */
unsigned long trace_flags = TRACE_ITER_PRINT_PARENT | TRACE_ITER_PRINTK |
TRACE_ITER_ANNOTATE | TRACE_ITER_CONTEXT_INFO | TRACE_ITER_SLEEP_TIME |
- TRACE_ITER_GRAPH_TIME | TRACE_ITER_RECORD_CMD;
+ TRACE_ITER_GRAPH_TIME;

static int trace_stop_count;
static DEFINE_SPINLOCK(tracing_start_lock);
@@ -428,7 +428,6 @@ static const char *trace_options[] = {
"latency-format",
"sleep-time",
"graph-time",
- "record-cmd",
NULL
};

@@ -2517,9 +2516,6 @@ static void set_tracer_flags(unsigned int mask, int enabled)
trace_flags |= mask;
else
trace_flags &= ~mask;
-
- if (mask == TRACE_ITER_RECORD_CMD)
- trace_event_enable_cmd_record(enabled);
}

static ssize_t
diff --git a/kernel/trace/trace.h b/kernel/trace/trace.h
index d05c873..a6f61bd 100644
--- a/kernel/trace/trace.h
+++ b/kernel/trace/trace.h
@@ -582,7 +582,6 @@ enum trace_iterator_flags {
TRACE_ITER_LATENCY_FMT = 0x20000,
TRACE_ITER_SLEEP_TIME = 0x40000,
TRACE_ITER_GRAPH_TIME = 0x80000,
- TRACE_ITER_RECORD_CMD = 0x100000,
};

/*
@@ -715,8 +714,6 @@ filter_check_discard(struct ftrace_event_call *call, void *rec,
return 0;
}

-extern void trace_event_enable_cmd_record(bool enable);
-
extern struct mutex event_mutex;
extern struct list_head ftrace_events;

diff --git a/kernel/trace/trace_events.c b/kernel/trace/trace_events.c
index 2c0cf38..8f8dd73 100644
--- a/kernel/trace/trace_events.c
+++ b/kernel/trace/trace_events.c
@@ -170,24 +170,6 @@ int ftrace_event_reg(struct ftrace_event_call *call, enum trace_reg type)
}
EXPORT_SYMBOL_GPL(ftrace_event_reg);

-void trace_event_enable_cmd_record(bool enable)
-{
- struct ftrace_event_call *call;
-
- mutex_lock(&event_mutex);
- list_for_each_entry(call, &ftrace_events, list) {
- if (!(call->flags & TRACE_EVENT_FL_ENABLED))
- continue;
-
- if (enable) {
- call->flags |= TRACE_EVENT_FL_RECORDED_CMD;
- } else {
- call->flags &= ~TRACE_EVENT_FL_RECORDED_CMD;
- }
- }
- mutex_unlock(&event_mutex);
-}
-
static int ftrace_event_enable_disable(struct ftrace_event_call *call,
int enable)
{
--
1.7.1

2010-07-28 02:51:09

by Li Zefan

[permalink] [raw]
Subject: Re: [PATCH 1/2] ftrace: record command lines at more appropriate moment

Ian Munsie wrote:
> From: Ian Munsie <[email protected]>
>
> Previously, when tracing was activated through debugfs, regardless of
> which tracing plugin (if any) were activated, the probe_sched_switch and
> probe_sched_wakeup probes from the sched_switch plugin would be
> activated. This appears to have been a hack to use them to record the
> command lines of active processes as they were scheduled.
>
> That approach would suffer if many processes were being scheduled that
> were not generating events as they would consume entries in the
> saved_cmdlines buffer that could otherwise have been used by other
> processes that were actually generating events.
>
> It also had the problem that events could be mis-attributed - in the
> common situation of a process forking then execing a new process, the
> change of the process command would not be noticed for some time after
> the exec until the process was next scheduled.
>
> If the trace was read after the fact this would generally go unnoticed
> because at some point the process would be scheduled and the entry in
> the saved_cmdlines buffer would be updated so that the new command would
> be reported when the trace was eventually read. However, if the events
> were being read live (e.g. through trace_pipe), the events just after
> the exec and before the process was next scheduled would show the
> incorrect command (though the PID would be correct).
>
> This patch removes the sched_switch hack altogether and instead records
> the commands at a more appropriate moment - at the same time the PID of
> the process is recorded (i.e. when an entry on the ring buffer is
> reserved). This means that the recorded command line is much more likely
> to be correct when the trace is read, either live or after the fact, so
> long as the command line still resides in the saved_cmdlines buffer.
>
> It is still not guaranteed to be correct in all situations. For instance
> if the trace is read after the fact rather than live (consider events
> generated by a process before an exec - in the below example they would
> be attributed to sleep rather than stealpid since the entry in
> saved_cmdlines would have changed before the event was read), but this
> is no different to the current situation and the alternative would be to
> store the command line with each and every event.
>
...
>
> Signed-off-by: Ian Munsie <[email protected]>
> ---
> kernel/trace/trace.c | 3 +--
> kernel/trace/trace_events.c | 11 -----------
> kernel/trace/trace_functions.c | 2 --
> kernel/trace/trace_functions_graph.c | 2 --
> kernel/trace/trace_sched_switch.c | 10 ----------
> 5 files changed, 1 insertions(+), 27 deletions(-)
>
> diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
> index 4b1122d..f8458c3 100644
> --- a/kernel/trace/trace.c
> +++ b/kernel/trace/trace.c
> @@ -1023,8 +1023,6 @@ void tracing_stop(void)
> spin_unlock_irqrestore(&tracing_start_lock, flags);
> }
>
> -void trace_stop_cmdline_recording(void);
> -
> static void trace_save_cmdline(struct task_struct *tsk)
> {
> unsigned pid, idx;
> @@ -1112,6 +1110,7 @@ tracing_generic_entry_update(struct trace_entry *entry, unsigned long flags,
> {
> struct task_struct *tsk = current;
>
> + tracing_record_cmdline(tsk);

Now this function is called everytime a tracepoint is triggered, so
did you run some benchmarks to see if the performance is improved
or even worse?

Another problem in this patch is, tracing_generic_entry_update() is also
called by perf, but cmdline recoding is not needed in perf.

> entry->preempt_count = pc & 0xff;
> entry->pid = (tsk) ? tsk->pid : 0;
> entry->lock_depth = (tsk) ? tsk->lock_depth : 0;
...

2010-07-28 11:44:28

by Ian Munsie

[permalink] [raw]
Subject: Re: [PATCH 1/2] ftrace: record command lines at more appropriate moment

Excerpts from Li Zefan's message of Wed Jul 28 12:55:54 +1000 2010:
> > @@ -1112,6 +1110,7 @@ tracing_generic_entry_update(struct trace_entry *entry, unsigned long flags,
> > {
> > struct task_struct *tsk = current;
> >
> > + tracing_record_cmdline(tsk);
>
> Now this function is called everytime a tracepoint is triggered, so
> did you run some benchmarks to see if the performance is improved
> or even worse?

Admittedly when I posted the patch I had not done that. For the below
benchmark I isolated the trace_sched_switch tracepoint from the
context_switch function (since it is called often) into it's own function
(tp_benchmark) which I could then run the ftrace function profiler on
while the tracepoint was active through debugfs.

On my test system there is a performance hit for an active event of
~0.233 us per event (which I have now reduced to ~0.127 us by inlining
tracing_record_cmdline and trace_save_cmdline). At least that is only on
active events as opposed to every single context switch as before.

Before:
Function Hit Time Avg s^2
-------- --- ---- --- ---
.tp_benchmark 1494 2699.670 us 1.807 us 0.536 us
.tp_benchmark 212 357.546 us 1.686 us 0.363 us
.tp_benchmark 215 389.984 us 1.813 us 0.404 us
.tp_benchmark 649 1116.156 us 1.719 us 0.626 us
.tp_benchmark 273 483.530 us 1.771 us 0.350 us
.tp_benchmark 333 599.600 us 1.800 us 0.378 us
.tp_benchmark 203 355.038 us 1.748 us 0.351 us
.tp_benchmark 270 473.222 us 1.752 us 0.360 us

After existing patch:
Function Hit Time Avg s^2
-------- --- ---- --- ---
.tp_benchmark 1427 2815.906 us 1.973 us 0.623 us
.tp_benchmark 358 645.550 us 1.803 us 0.240 us
.tp_benchmark 437 867.762 us 1.985 us 0.684 us
.tp_benchmark 701 1445.618 us 2.062 us 0.906 us
.tp_benchmark 121 257.166 us 2.125 us 0.949 us
.tp_benchmark 162 329.536 us 2.034 us 0.671 us
.tp_benchmark 216 448.420 us 2.076 us 0.754 us
.tp_benchmark 238 452.244 us 1.900 us 0.384 us

With inlining:
Function Hit Time Avg s^2
-------- --- ---- --- ---
.tp_benchmark 1478 2834.292 us 1.917 us 0.451 us
.tp_benchmark 316 583.166 us 1.845 us 0.227 us
.tp_benchmark 160 312.752 us 1.954 us 0.302 us
.tp_benchmark 687 1251.652 us 1.821 us 0.445 us
.tp_benchmark 177 352.310 us 1.990 us 0.451 us
.tp_benchmark 324 603.848 us 1.863 us 0.239 us
.tp_benchmark 150 284.444 us 1.896 us 0.343 us
.tp_benchmark 339 617.716 us 1.822 us 0.215 us


> Another problem in this patch is, tracing_generic_entry_update() is also
> called by perf, but cmdline recoding is not needed in perf.

That's a good point - I could move the call into
trace_buffer_lock_reserve so that perf does not get the unneeded
overhead. Actually there's probably no reason I couldn't put it in
__trace_buffer_unlock_commit to avoid the overhead if the event has been
filtered out.

Anyway, what do you think? Is the extra overhead per event acceptable?
I'll go ahead and respin the patch to remove the overhead in the perf
case for the moment.

Cheers,
-Ian

2010-07-28 12:17:29

by Ian Munsie

[permalink] [raw]
Subject: [PATCH v2] ftrace: record command lines at more appropriate moment

From: Ian Munsie <[email protected]>

Previously, when tracing was activated through debugfs, regardless of
which tracing plugin (if any) were activated, the probe_sched_switch and
probe_sched_wakeup probes from the sched_switch plugin would be
activated. This appears to have been a hack to use them to record the
command lines of active processes as they were scheduled.

That approach would suffer if many processes were being scheduled that
were not generating events as they would consume entries in the
saved_cmdlines buffer that could otherwise have been used by other
processes that were actually generating events.

It also had the problem that events could be mis-attributed - in the
common situation of a process forking then execing a new process, the
change of the process command would not be noticed for some time after
the exec until the process was next scheduled.

If the trace was read after the fact this would generally go unnoticed
because at some point the process would be scheduled and the entry in
the saved_cmdlines buffer would be updated so that the new command would
be reported when the trace was eventually read. However, if the events
were being read live (e.g. through trace_pipe), the events just after
the exec and before the process was next scheduled would show the
incorrect command (though the PID would be correct).

This patch removes the sched_switch hack altogether and instead records
the commands at a more appropriate moment - when a new trace event is
committed onto the ftrace ring buffer. This means that the recorded
command line is much more likely to be correct when the trace is read,
either live or after the fact, so long as the command line still resides
in the saved_cmdlines buffer.

It is still not guaranteed to be correct in all situations. For instance
if the trace is read after the fact rather than live (consider events
generated by a process before an exec - in the below example they would
be attributed to sleep rather than stealpid since the entry in
saved_cmdlines would have changed before the event was read), but this
is no different to the current situation and the alternative would be to
store the command line with each and every event.

terminal 1: grep '\-12345' /sys/kernel/debug/tracing/trace_pipe
terminal 2: ./stealpid 12345 `which sleep` 0.1

Before:
stealpid-12345 [003] 86.001826: sys_clone -> 0x0
stealpid-12345 [003] 86.002013: compat_sys_execve(ufilename: ffaaabef, argv: ffaaa7ec, envp: ffaaa7f8)
stealpid-12345 [002] 86.002292: sys_restart_syscall -> 0x0
stealpid-12345 [002] 86.002336: sys_brk(brk: 0)
stealpid-12345 [002] 86.002338: sys_brk -> 0x1007a000
...
stealpid-12345 [002] 86.002582: sys_mmap(addr: 0, len: 1000, prot: 3, flags: 22, fd: ffffffff, offset: 0)
stealpid-12345 [002] 86.002586: sys_mmap -> 0xf7c21000
sleep-12345 [002] 86.002771: sys_mprotect(start: ffe8000, len: 4000, prot: 1)
sleep-12345 [002] 86.002780: sys_mprotect -> 0x0
...

After:
stealpid-12345 [003] 1368.823626: sys_clone -> 0x0
stealpid-12345 [003] 1368.823820: compat_sys_execve(ufilename: fffa6bef, argv: fffa5afc, envp: fffa5b08)
sleep-12345 [002] 1368.824125: sys_restart_syscall -> 0x0
sleep-12345 [002] 1368.824173: sys_brk(brk: 0)
sleep-12345 [002] 1368.824175: sys_brk -> 0x104ae000
...

Signed-off-by: Ian Munsie <[email protected]>
---

Changes since v1 addressing feedback from Li Zefan:
* Inline trace_save_cmdline and tracing_record_cmdline for a marginal speed
gain when recording command lines.
* Move call to tracing_record_cmdline from tracing_generic_entry_update to
__trace_buffer_unlock_commit to avoid the overhead when using perf or if the
event was filtered out.

kernel/trace/trace.c | 7 +++----
kernel/trace/trace_events.c | 11 -----------
kernel/trace/trace_functions.c | 2 --
kernel/trace/trace_functions_graph.c | 2 --
kernel/trace/trace_sched_switch.c | 10 ----------
5 files changed, 3 insertions(+), 29 deletions(-)

diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index 4b1122d..dfbc62c 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -1023,9 +1023,7 @@ void tracing_stop(void)
spin_unlock_irqrestore(&tracing_start_lock, flags);
}

-void trace_stop_cmdline_recording(void);
-
-static void trace_save_cmdline(struct task_struct *tsk)
+static inline void trace_save_cmdline(struct task_struct *tsk)
{
unsigned pid, idx;

@@ -1097,7 +1095,7 @@ void trace_find_cmdline(int pid, char comm[])
preempt_enable();
}

-void tracing_record_cmdline(struct task_struct *tsk)
+inline void tracing_record_cmdline(struct task_struct *tsk)
{
if (atomic_read(&trace_record_cmdline_disabled) || !tracer_enabled ||
!tracing_is_on())
@@ -1152,6 +1150,7 @@ __trace_buffer_unlock_commit(struct ring_buffer *buffer,
unsigned long flags, int pc,
int wake)
{
+ tracing_record_cmdline(current);
ring_buffer_unlock_commit(buffer, event);

ftrace_trace_stack(buffer, flags, 6, pc);
diff --git a/kernel/trace/trace_events.c b/kernel/trace/trace_events.c
index 09b4fa6..2c0cf38 100644
--- a/kernel/trace/trace_events.c
+++ b/kernel/trace/trace_events.c
@@ -180,10 +180,8 @@ void trace_event_enable_cmd_record(bool enable)
continue;

if (enable) {
- tracing_start_cmdline_record();
call->flags |= TRACE_EVENT_FL_RECORDED_CMD;
} else {
- tracing_stop_cmdline_record();
call->flags &= ~TRACE_EVENT_FL_RECORDED_CMD;
}
}
@@ -199,22 +197,13 @@ static int ftrace_event_enable_disable(struct ftrace_event_call *call,
case 0:
if (call->flags & TRACE_EVENT_FL_ENABLED) {
call->flags &= ~TRACE_EVENT_FL_ENABLED;
- if (call->flags & TRACE_EVENT_FL_RECORDED_CMD) {
- tracing_stop_cmdline_record();
- call->flags &= ~TRACE_EVENT_FL_RECORDED_CMD;
- }
call->class->reg(call, TRACE_REG_UNREGISTER);
}
break;
case 1:
if (!(call->flags & TRACE_EVENT_FL_ENABLED)) {
- if (trace_flags & TRACE_ITER_RECORD_CMD) {
- tracing_start_cmdline_record();
- call->flags |= TRACE_EVENT_FL_RECORDED_CMD;
- }
ret = call->class->reg(call, TRACE_REG_REGISTER);
if (ret) {
- tracing_stop_cmdline_record();
pr_info("event trace: Could not enable event "
"%s\n", call->name);
break;
diff --git a/kernel/trace/trace_functions.c b/kernel/trace/trace_functions.c
index 16aee4d..0800e91 100644
--- a/kernel/trace/trace_functions.c
+++ b/kernel/trace/trace_functions.c
@@ -31,7 +31,6 @@ static int function_trace_init(struct trace_array *tr)
tr->cpu = get_cpu();
put_cpu();

- tracing_start_cmdline_record();
tracing_start_function_trace();
return 0;
}
@@ -39,7 +38,6 @@ static int function_trace_init(struct trace_array *tr)
static void function_trace_reset(struct trace_array *tr)
{
tracing_stop_function_trace();
- tracing_stop_cmdline_record();
}

static void function_trace_start(struct trace_array *tr)
diff --git a/kernel/trace/trace_functions_graph.c b/kernel/trace/trace_functions_graph.c
index 6bff236..4bb8811 100644
--- a/kernel/trace/trace_functions_graph.c
+++ b/kernel/trace/trace_functions_graph.c
@@ -321,14 +321,12 @@ static int graph_trace_init(struct trace_array *tr)
&trace_graph_entry);
if (ret)
return ret;
- tracing_start_cmdline_record();

return 0;
}

static void graph_trace_reset(struct trace_array *tr)
{
- tracing_stop_cmdline_record();
unregister_ftrace_graph();
}

diff --git a/kernel/trace/trace_sched_switch.c b/kernel/trace/trace_sched_switch.c
index 8f758d0..9a5cb2c 100644
--- a/kernel/trace/trace_sched_switch.c
+++ b/kernel/trace/trace_sched_switch.c
@@ -190,16 +190,6 @@ static void tracing_stop_sched_switch(void)
mutex_unlock(&sched_register_mutex);
}

-void tracing_start_cmdline_record(void)
-{
- tracing_start_sched_switch();
-}
-
-void tracing_stop_cmdline_record(void)
-{
- tracing_stop_sched_switch();
-}
-
/**
* tracing_start_sched_switch_record - start tracing context switches
*
--
1.7.1

2010-07-29 01:58:30

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [PATCH 1/2] ftrace: record command lines at more appropriate moment

On Wed, Jul 28, 2010 at 09:44:01PM +1000, Ian Munsie wrote:
> Excerpts from Li Zefan's message of Wed Jul 28 12:55:54 +1000 2010:
> > > @@ -1112,6 +1110,7 @@ tracing_generic_entry_update(struct trace_entry *entry, unsigned long flags,
> > > {
> > > struct task_struct *tsk = current;
> > >
> > > + tracing_record_cmdline(tsk);
> >
> > Now this function is called everytime a tracepoint is triggered, so
> > did you run some benchmarks to see if the performance is improved
> > or even worse?
>
> Admittedly when I posted the patch I had not done that. For the below
> benchmark I isolated the trace_sched_switch tracepoint from the
> context_switch function (since it is called often) into it's own function
> (tp_benchmark) which I could then run the ftrace function profiler on
> while the tracepoint was active through debugfs.
>
> On my test system there is a performance hit for an active event of
> ~0.233 us per event (which I have now reduced to ~0.127 us by inlining
> tracing_record_cmdline and trace_save_cmdline). At least that is only on
> active events as opposed to every single context switch as before.
>
> Before:
> Function Hit Time Avg s^2
> -------- --- ---- --- ---
> .tp_benchmark 1494 2699.670 us 1.807 us 0.536 us
> .tp_benchmark 212 357.546 us 1.686 us 0.363 us
> .tp_benchmark 215 389.984 us 1.813 us 0.404 us
> .tp_benchmark 649 1116.156 us 1.719 us 0.626 us
> .tp_benchmark 273 483.530 us 1.771 us 0.350 us
> .tp_benchmark 333 599.600 us 1.800 us 0.378 us
> .tp_benchmark 203 355.038 us 1.748 us 0.351 us
> .tp_benchmark 270 473.222 us 1.752 us 0.360 us
>
> After existing patch:
> Function Hit Time Avg s^2
> -------- --- ---- --- ---
> .tp_benchmark 1427 2815.906 us 1.973 us 0.623 us
> .tp_benchmark 358 645.550 us 1.803 us 0.240 us
> .tp_benchmark 437 867.762 us 1.985 us 0.684 us
> .tp_benchmark 701 1445.618 us 2.062 us 0.906 us
> .tp_benchmark 121 257.166 us 2.125 us 0.949 us
> .tp_benchmark 162 329.536 us 2.034 us 0.671 us
> .tp_benchmark 216 448.420 us 2.076 us 0.754 us
> .tp_benchmark 238 452.244 us 1.900 us 0.384 us
>
> With inlining:
> Function Hit Time Avg s^2
> -------- --- ---- --- ---
> .tp_benchmark 1478 2834.292 us 1.917 us 0.451 us
> .tp_benchmark 316 583.166 us 1.845 us 0.227 us
> .tp_benchmark 160 312.752 us 1.954 us 0.302 us
> .tp_benchmark 687 1251.652 us 1.821 us 0.445 us
> .tp_benchmark 177 352.310 us 1.990 us 0.451 us
> .tp_benchmark 324 603.848 us 1.863 us 0.239 us
> .tp_benchmark 150 284.444 us 1.896 us 0.343 us
> .tp_benchmark 339 617.716 us 1.822 us 0.215 us
>
>
> > Another problem in this patch is, tracing_generic_entry_update() is also
> > called by perf, but cmdline recoding is not needed in perf.
>
> That's a good point - I could move the call into
> trace_buffer_lock_reserve so that perf does not get the unneeded
> overhead. Actually there's probably no reason I couldn't put it in
> __trace_buffer_unlock_commit to avoid the overhead if the event has been
> filtered out.
>
> Anyway, what do you think? Is the extra overhead per event acceptable?



In fact I don't really understand what is this tp_benchmark function, when and
where is it called?

But anyway, I'll rather comment the idea on the patch.

2010-07-29 02:50:37

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [PATCH v2] ftrace: record command lines at more appropriate moment

On Wed, Jul 28, 2010 at 10:16:49PM +1000, Ian Munsie wrote:
> From: Ian Munsie <[email protected]>
>
> Previously, when tracing was activated through debugfs, regardless of
> which tracing plugin (if any) were activated, the probe_sched_switch and
> probe_sched_wakeup probes from the sched_switch plugin would be
> activated. This appears to have been a hack to use them to record the
> command lines of active processes as they were scheduled.
>
> That approach would suffer if many processes were being scheduled that
> were not generating events as they would consume entries in the
> saved_cmdlines buffer that could otherwise have been used by other
> processes that were actually generating events.
>
> It also had the problem that events could be mis-attributed - in the
> common situation of a process forking then execing a new process, the
> change of the process command would not be noticed for some time after
> the exec until the process was next scheduled.
>
> If the trace was read after the fact this would generally go unnoticed
> because at some point the process would be scheduled and the entry in
> the saved_cmdlines buffer would be updated so that the new command would
> be reported when the trace was eventually read. However, if the events
> were being read live (e.g. through trace_pipe), the events just after
> the exec and before the process was next scheduled would show the
> incorrect command (though the PID would be correct).
>
> This patch removes the sched_switch hack altogether and instead records
> the commands at a more appropriate moment - when a new trace event is
> committed onto the ftrace ring buffer. This means that the recorded
> command line is much more likely to be correct when the trace is read,
> either live or after the fact, so long as the command line still resides
> in the saved_cmdlines buffer.
>
> It is still not guaranteed to be correct in all situations. For instance
> if the trace is read after the fact rather than live (consider events
> generated by a process before an exec - in the below example they would
> be attributed to sleep rather than stealpid since the entry in
> saved_cmdlines would have changed before the event was read), but this
> is no different to the current situation and the alternative would be to
> store the command line with each and every event.
>
> terminal 1: grep '\-12345' /sys/kernel/debug/tracing/trace_pipe
> terminal 2: ./stealpid 12345 `which sleep` 0.1
>
> Before:
> stealpid-12345 [003] 86.001826: sys_clone -> 0x0
> stealpid-12345 [003] 86.002013: compat_sys_execve(ufilename: ffaaabef, argv: ffaaa7ec, envp: ffaaa7f8)
> stealpid-12345 [002] 86.002292: sys_restart_syscall -> 0x0
> stealpid-12345 [002] 86.002336: sys_brk(brk: 0)
> stealpid-12345 [002] 86.002338: sys_brk -> 0x1007a000
> ...
> stealpid-12345 [002] 86.002582: sys_mmap(addr: 0, len: 1000, prot: 3, flags: 22, fd: ffffffff, offset: 0)
> stealpid-12345 [002] 86.002586: sys_mmap -> 0xf7c21000
> sleep-12345 [002] 86.002771: sys_mprotect(start: ffe8000, len: 4000, prot: 1)
> sleep-12345 [002] 86.002780: sys_mprotect -> 0x0
> ...
>
> After:
> stealpid-12345 [003] 1368.823626: sys_clone -> 0x0
> stealpid-12345 [003] 1368.823820: compat_sys_execve(ufilename: fffa6bef, argv: fffa5afc, envp: fffa5b08)
> sleep-12345 [002] 1368.824125: sys_restart_syscall -> 0x0
> sleep-12345 [002] 1368.824173: sys_brk(brk: 0)
> sleep-12345 [002] 1368.824175: sys_brk -> 0x104ae000
> ...
>
> Signed-off-by: Ian Munsie <[email protected]>




So, in fact we can't do this. There is a strong reason that makes us maintaining
the cmdline resolution on sched switch rather than on tracing time: for
performances and scalability.

Look at what does tracing_record_cmdline():

- it's one more call
- it checks a lot of conditions
- takes a spinlock (gives up if already in use, but that's still bad for the cache
in a tracing path)
- deref a shared hashlist
...


Currently that is made on sched switch time, which means quite often.
Now imagine you turn on the function tracer: this is going to happen
for _every_ function called in the kernel. There is going to be a lot
of cache ping pong between CPUs due to the spinlock for example, for
every function this is clearly unacceptable (it would be twice per
function with the function graph tracer).

And still there are also all the things with the hashlist deref, the checks,
etc...

It's not only the function tracers. The lock events will also show you very
bad results. Same if you enable all the others together.


Better have a new call to tracing_record_cmdline() made from the fork
and exec tracepoints to solve this problem.
But still, that only solves the lazy update and not all the problems
you've listed in this changelog.

In fact cmdline tracking would grow in complexity in the kernel if we had
to make it correctly. Ideally:

* dump every tasks when we start tracing, and map their cmdlines
* do the pid mapping per time interval. Keys in the hlist must be
pid/start_time:end_time, not only pid anymore.
* map new cmdlines from fork and exec events. If exec, we must open
a new entry for our pid, closing noting end_time of this previous
pid entry and open a new start_time for the new entry.


And this would be way much more efficient than the sched_switch based
thing we have. More efficient in terms of performance and per timeslice
granularity.

That's the kind of thing we'd better do from userspace, for tools like
perf tools or trace-cmd. And perf tools do, partially (no time
granularity yet).

But still, in-kernel made cmdline resolution is cool to have on
some circumstances, especially on ascii tracing and dump. But
for that I think we should just don't care further and keep this
basic and non-perfect cmdline tracking, sufficient for most uses.

In fact we could fix it by dumping the tasks comm from tasklist
and hook on the fork and exec events, rather than sched switch.
It would be better for performances, and then appreciated.

But for those who want an advanced comm resolution, I mean something
that takes the timeslices into account, I think switching to
userspace tools is a better direction. In fact they need enhancing
in this regard.

Thanks.

2010-07-29 02:57:08

by Li Zefan

[permalink] [raw]
Subject: Re: [PATCH v2] ftrace: record command lines at more appropriate moment

Ian Munsie wrote:
> From: Ian Munsie <[email protected]>
>
> Previously, when tracing was activated through debugfs, regardless of
> which tracing plugin (if any) were activated, the probe_sched_switch and
> probe_sched_wakeup probes from the sched_switch plugin would be
> activated. This appears to have been a hack to use them to record the
> command lines of active processes as they were scheduled.
>
> That approach would suffer if many processes were being scheduled that
> were not generating events as they would consume entries in the
> saved_cmdlines buffer that could otherwise have been used by other
> processes that were actually generating events.
>
> It also had the problem that events could be mis-attributed - in the
> common situation of a process forking then execing a new process, the
> change of the process command would not be noticed for some time after
> the exec until the process was next scheduled.
>
> If the trace was read after the fact this would generally go unnoticed
> because at some point the process would be scheduled and the entry in
> the saved_cmdlines buffer would be updated so that the new command would
> be reported when the trace was eventually read. However, if the events
> were being read live (e.g. through trace_pipe), the events just after
> the exec and before the process was next scheduled would show the
> incorrect command (though the PID would be correct).
>
> This patch removes the sched_switch hack altogether and instead records
> the commands at a more appropriate moment - when a new trace event is
> committed onto the ftrace ring buffer. This means that the recorded
> command line is much more likely to be correct when the trace is read,
> either live or after the fact, so long as the command line still resides
> in the saved_cmdlines buffer.
>
> It is still not guaranteed to be correct in all situations. For instance
> if the trace is read after the fact rather than live (consider events
> generated by a process before an exec - in the below example they would
> be attributed to sleep rather than stealpid since the entry in
> saved_cmdlines would have changed before the event was read), but this
> is no different to the current situation and the alternative would be to
> store the command line with each and every event.
>
> terminal 1: grep '\-12345' /sys/kernel/debug/tracing/trace_pipe
> terminal 2: ./stealpid 12345 `which sleep` 0.1
>
> Before:
> stealpid-12345 [003] 86.001826: sys_clone -> 0x0
> stealpid-12345 [003] 86.002013: compat_sys_execve(ufilename: ffaaabef, argv: ffaaa7ec, envp: ffaaa7f8)
> stealpid-12345 [002] 86.002292: sys_restart_syscall -> 0x0
> stealpid-12345 [002] 86.002336: sys_brk(brk: 0)
> stealpid-12345 [002] 86.002338: sys_brk -> 0x1007a000
> ...
> stealpid-12345 [002] 86.002582: sys_mmap(addr: 0, len: 1000, prot: 3, flags: 22, fd: ffffffff, offset: 0)
> stealpid-12345 [002] 86.002586: sys_mmap -> 0xf7c21000
> sleep-12345 [002] 86.002771: sys_mprotect(start: ffe8000, len: 4000, prot: 1)
> sleep-12345 [002] 86.002780: sys_mprotect -> 0x0
> ...
>
> After:
> stealpid-12345 [003] 1368.823626: sys_clone -> 0x0
> stealpid-12345 [003] 1368.823820: compat_sys_execve(ufilename: fffa6bef, argv: fffa5afc, envp: fffa5b08)
> sleep-12345 [002] 1368.824125: sys_restart_syscall -> 0x0
> sleep-12345 [002] 1368.824173: sys_brk(brk: 0)
> sleep-12345 [002] 1368.824175: sys_brk -> 0x104ae000
> ...
>
> Signed-off-by: Ian Munsie <[email protected]>

I've tested your patch using lmbench(ctx):

Context switching - times in microseconds - smaller is better
-------------------------------------------------------------------------
Host OS 2p/0K 2p/16K 2p/64K 8p/16K 8p/64K 16p/16K 16p/64K
ctxsw ctxsw ctxsw ctxsw ctxsw ctxsw ctxsw
--------- ------------- ------ ------ ------ ------ ------ ------- -------
(trace-off)
Linux 2.6.35- 2.1300 2.2100 2.0800 2.5900 2.1400 2.59000 2.19000
Linux 2.6.35- 2.1400 2.2000 2.0800 2.6000 2.0900 2.56000 2.15000

(all events on)
Linux 2.6.35- 2.8000 2.9600 2.7200 3.2500 2.8200 3.24000 2.98000
Linux 2.6.35- 2.7100 2.6900 2.7300 3.2200 2.8500 3.25000 2.79000

(all events on without cmdline-recording)
Linux 2.6.35- 2.6100 2.6900 2.5800 3.0300 2.5800 3.04000 2.67000
Linux 2.6.35- 2.5800 2.5900 2.5600 3.0300 2.6600 3.04000 2.61000

(your patch applied)
Linux 2.6.35- 2.7100 2.8000 2.7200 3.2100 2.8400 3.24000 2.82000
Linux 2.6.35- 2.6600 2.8400 2.6900 3.1900 2.7600 3.27000 2.78000

So with your patch applied, the performance is still worse than just disabling
cmdline-recording.

The performance may be worse if I choose some other benchmarks.

I'd suggest another approch, that we add a tracepiont in set_task_comm()
to record cmdname, and that's how perf does.

> ---
>
> Changes since v1 addressing feedback from Li Zefan:
> * Inline trace_save_cmdline and tracing_record_cmdline for a marginal speed
> gain when recording command lines.
> * Move call to tracing_record_cmdline from tracing_generic_entry_update to
> __trace_buffer_unlock_commit to avoid the overhead when using perf or if the
> event was filtered out.
>
> kernel/trace/trace.c | 7 +++----
> kernel/trace/trace_events.c | 11 -----------
> kernel/trace/trace_functions.c | 2 --
> kernel/trace/trace_functions_graph.c | 2 --
> kernel/trace/trace_sched_switch.c | 10 ----------
> 5 files changed, 3 insertions(+), 29 deletions(-)

2010-07-29 03:04:56

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [PATCH v2] ftrace: record command lines at more appropriate moment

On Thu, Jul 29, 2010 at 11:01:53AM +0800, Li Zefan wrote:
> Ian Munsie wrote:
> > From: Ian Munsie <[email protected]>
> >
> > Previously, when tracing was activated through debugfs, regardless of
> > which tracing plugin (if any) were activated, the probe_sched_switch and
> > probe_sched_wakeup probes from the sched_switch plugin would be
> > activated. This appears to have been a hack to use them to record the
> > command lines of active processes as they were scheduled.
> >
> > That approach would suffer if many processes were being scheduled that
> > were not generating events as they would consume entries in the
> > saved_cmdlines buffer that could otherwise have been used by other
> > processes that were actually generating events.
> >
> > It also had the problem that events could be mis-attributed - in the
> > common situation of a process forking then execing a new process, the
> > change of the process command would not be noticed for some time after
> > the exec until the process was next scheduled.
> >
> > If the trace was read after the fact this would generally go unnoticed
> > because at some point the process would be scheduled and the entry in
> > the saved_cmdlines buffer would be updated so that the new command would
> > be reported when the trace was eventually read. However, if the events
> > were being read live (e.g. through trace_pipe), the events just after
> > the exec and before the process was next scheduled would show the
> > incorrect command (though the PID would be correct).
> >
> > This patch removes the sched_switch hack altogether and instead records
> > the commands at a more appropriate moment - when a new trace event is
> > committed onto the ftrace ring buffer. This means that the recorded
> > command line is much more likely to be correct when the trace is read,
> > either live or after the fact, so long as the command line still resides
> > in the saved_cmdlines buffer.
> >
> > It is still not guaranteed to be correct in all situations. For instance
> > if the trace is read after the fact rather than live (consider events
> > generated by a process before an exec - in the below example they would
> > be attributed to sleep rather than stealpid since the entry in
> > saved_cmdlines would have changed before the event was read), but this
> > is no different to the current situation and the alternative would be to
> > store the command line with each and every event.
> >
> > terminal 1: grep '\-12345' /sys/kernel/debug/tracing/trace_pipe
> > terminal 2: ./stealpid 12345 `which sleep` 0.1
> >
> > Before:
> > stealpid-12345 [003] 86.001826: sys_clone -> 0x0
> > stealpid-12345 [003] 86.002013: compat_sys_execve(ufilename: ffaaabef, argv: ffaaa7ec, envp: ffaaa7f8)
> > stealpid-12345 [002] 86.002292: sys_restart_syscall -> 0x0
> > stealpid-12345 [002] 86.002336: sys_brk(brk: 0)
> > stealpid-12345 [002] 86.002338: sys_brk -> 0x1007a000
> > ...
> > stealpid-12345 [002] 86.002582: sys_mmap(addr: 0, len: 1000, prot: 3, flags: 22, fd: ffffffff, offset: 0)
> > stealpid-12345 [002] 86.002586: sys_mmap -> 0xf7c21000
> > sleep-12345 [002] 86.002771: sys_mprotect(start: ffe8000, len: 4000, prot: 1)
> > sleep-12345 [002] 86.002780: sys_mprotect -> 0x0
> > ...
> >
> > After:
> > stealpid-12345 [003] 1368.823626: sys_clone -> 0x0
> > stealpid-12345 [003] 1368.823820: compat_sys_execve(ufilename: fffa6bef, argv: fffa5afc, envp: fffa5b08)
> > sleep-12345 [002] 1368.824125: sys_restart_syscall -> 0x0
> > sleep-12345 [002] 1368.824173: sys_brk(brk: 0)
> > sleep-12345 [002] 1368.824175: sys_brk -> 0x104ae000
> > ...
> >
> > Signed-off-by: Ian Munsie <[email protected]>
>
> I've tested your patch using lmbench(ctx):
>
> Context switching - times in microseconds - smaller is better
> -------------------------------------------------------------------------
> Host OS 2p/0K 2p/16K 2p/64K 8p/16K 8p/64K 16p/16K 16p/64K
> ctxsw ctxsw ctxsw ctxsw ctxsw ctxsw ctxsw
> --------- ------------- ------ ------ ------ ------ ------ ------- -------
> (trace-off)
> Linux 2.6.35- 2.1300 2.2100 2.0800 2.5900 2.1400 2.59000 2.19000
> Linux 2.6.35- 2.1400 2.2000 2.0800 2.6000 2.0900 2.56000 2.15000
>
> (all events on)
> Linux 2.6.35- 2.8000 2.9600 2.7200 3.2500 2.8200 3.24000 2.98000
> Linux 2.6.35- 2.7100 2.6900 2.7300 3.2200 2.8500 3.25000 2.79000
>
> (all events on without cmdline-recording)
> Linux 2.6.35- 2.6100 2.6900 2.5800 3.0300 2.5800 3.04000 2.67000
> Linux 2.6.35- 2.5800 2.5900 2.5600 3.0300 2.6600 3.04000 2.61000
>
> (your patch applied)
> Linux 2.6.35- 2.7100 2.8000 2.7200 3.2100 2.8400 3.24000 2.82000
> Linux 2.6.35- 2.6600 2.8400 2.6900 3.1900 2.7600 3.27000 2.78000
>
> So with your patch applied, the performance is still worse than just disabling
> cmdline-recording.
>
> The performance may be worse if I choose some other benchmarks.
>
> I'd suggest another approch, that we add a tracepiont in set_task_comm()
> to record cmdname, and that's how perf does.


That works for tasks that exec() after tracing get enabled, but not for the others.
We need to first walk the tasklist to catch up with running tasks, and then
hook on the exec and fork tracepoints.

I think this would be welcome.

2010-07-29 10:26:39

by Ian Munsie

[permalink] [raw]
Subject: Re: [PATCH v2] ftrace: record command lines at more appropriate moment

Excerpts from Frederic Weisbecker's message of Thu Jul 29 12:50:41 +1000 2010:
> So, in fact we can't do this. There is a strong reason that makes us maintaining
> the cmdline resolution on sched switch rather than on tracing time: for
> performances and scalability.
>
> Look at what does tracing_record_cmdline():
>
> - it's one more call
> - it checks a lot of conditions
> - takes a spinlock (gives up if already in use, but that's still bad for the cache
> in a tracing path)
> - deref a shared hashlist
> ...
>
>
> Currently that is made on sched switch time, which means quite often.
> Now imagine you turn on the function tracer: this is going to happen
> for _every_ function called in the kernel. There is going to be a lot
> of cache ping pong between CPUs due to the spinlock for example, for
> every function this is clearly unacceptable (it would be twice per
> function with the function graph tracer).
>
> And still there are also all the things with the hashlist deref, the checks,
> etc...
>
> It's not only the function tracers. The lock events will also show you very
> bad results. Same if you enable all the others together.

My first thought when reading this was to make the saved_cmdlines and
related data per CPU to reduce a lot of the cache ping pong, but I'm
happy to take the alternate approach you suggest.

> Better have a new call to tracing_record_cmdline() made from the fork
> and exec tracepoints to solve this problem.
> But still, that only solves the lazy update and not all the problems
> you've listed in this changelog.

Still, it would scratch my itch so I'm happy to take that approach.

> In fact cmdline tracking would grow in complexity in the kernel if we had
> to make it correctly. Ideally:
>
> * dump every tasks when we start tracing, and map their cmdlines
> * do the pid mapping per time interval. Keys in the hlist must be
> pid/start_time:end_time, not only pid anymore.
> * map new cmdlines from fork and exec events. If exec, we must open
> a new entry for our pid, closing noting end_time of this previous
> pid entry and open a new start_time for the new entry.
>
>
> And this would be way much more efficient than the sched_switch based
> thing we have. More efficient in terms of performance and per timeslice
> granularity.
>
> That's the kind of thing we'd better do from userspace, for tools like
> perf tools or trace-cmd. And perf tools do, partially (no time
> granularity yet).

I'd tend to agree, I find the in-kernel stuff most useful for watching
events on a live system. My itch was that I couldn't simply grep
trace_pipe for a command that I was about to run and reliably see all
it's events.

> But still, in-kernel made cmdline resolution is cool to have on
> some circumstances, especially on ascii tracing and dump. But
> for that I think we should just don't care further and keep this
> basic and non-perfect cmdline tracking, sufficient for most uses.
>
> In fact we could fix it by dumping the tasks comm from tasklist
> and hook on the fork and exec events, rather than sched switch.
> It would be better for performances, and then appreciated.

I guess the compromise here would be that the saved_cmdlines buffer
would need to grow to hold all the command lines for every process that
has been running since the trace started - a limit of 128 commands as it
is now wouldn't cut it on most systems. Then again, there's no reason
not to bring back Li's patch to provide the option to disable recording
the command lines for people who don't want it.

Hmmm, I suppose we could hook into the process termination and check if
any events were associated with it and free up the entry if not...

Cheers,
-Ian

2010-07-29 10:30:56

by Ian Munsie

[permalink] [raw]
Subject: Re: [PATCH 1/2] ftrace: record command lines at more appropriate moment

Excerpts from Frederic Weisbecker's message of Thu Jul 29 11:58:34 +1000 2010:
> In fact I don't really understand what is this tp_benchmark function, when and
> where is it called?

The idea was just to replace a single tracepoint with a call to a
separate function (who's sole action was to call the tracepoint) so that
the ftrace function profiler could profile that function and provide
average timing data for the function call + tracepoint, i.e. like:

noinline void tp_benchmark(...)
{
trace_...
}

Mostly just laziness on my part really.

> But anyway, I'll rather comment the idea on the patch.

Cheers,
-Ian