LinuxLists.cc - Add support for userspace stacktraces in tracing/iter

[permalink] [raw]

Subject: [PATCH] Identify which executable object the userspace address belongs to. Store thread group leader id, and use it to lookup the address in the process's map. We could have looked up the address on thread's map, but the thread might not exist by the time we are called. The process might not exist either, but if you are reading trace_pipe, that is unlikely.

Example usage:
mount -t debugfs nodev /sys/kernel/debug
cd /sys/kernel/debug/tracing
echo userstacktrace >iter_ctrl
echo sym-userobj >iter_ctrl
echo sched_switch >current_tracer
echo 1 >tracing_enabled
cat trace_pipe >/tmp/trace&
.... run application ...
echo 0 >tracing_enabled
cat /tmp/trace

You'll see stack entries like:
/lib/libpthread-2.7.so[+0xd370]
You can convert them to function/line using:
addr2line -fie /lib/libpthread-2.7.so 0xd370
Or
addr2line -fie /usr/lib/debug/libpthread-2.7.so 0xd370

For non-PIC/PIE executables this won't work:
a.out[+0x73b]
You need to run the following: addr2line -fie a.out 0x40073b
(where 0x400000 is the default load address of a.out)

Signed-off-by: Török Edwin <[email protected]>
---
Documentation/ftrace.txt | 13 ++++++++++-
kernel/trace/trace.c | 51 +++++++++++++++++++++++++++++++++++++++++++++-
kernel/trace/trace.h | 3 +-
3 files changed, 64 insertions(+), 3 deletions(-)

diff --git a/Documentation/ftrace.txt b/Documentation/ftrace.txt
index b81618e..c5670a9 100644
--- a/Documentation/ftrace.txt
+++ b/Documentation/ftrace.txt
@@ -330,7 +330,7 @@ output. To see what is available, simply cat the file:

cat /debug/tracing/iter_ctrl
print-parent nosym-offset nosym-addr noverbose noraw nohex nobin \
- noblock nostacktrace nosched-tree nouserstacktrace
+ noblock nostacktrace nosched-tree nouserstacktrace nosym-userobj

To disable one of the options, echo in the option prepended with "no".

@@ -387,6 +387,17 @@ Here are the available options:
userstacktrace - This option changes the trace.
It records a stacktrace of the current userspace thread.

+ sym-userobj - when user stacktrace are enabled, look up which object the
+ address belongs to, and print a relative address
+ This is especially useful when ASLR is on, otherwise you don't
+ get a chance to resolve the address to object/file/line after the app is no
+ longer running
+
+ The lookup is performed when you read trace,trace_pipe,latency_trace. Example:
+
+ a.out-1623 [000] 40874.465068: /root/a.out[+0x480] <-/root/a.out[+0
+x494] <- /root/a.out[+0x4a8] <- /lib/libc-2.7.so[+0x1e1a6]
+
sched-tree - TBD (any users??)

diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index bb965af..b5f4068 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -236,6 +236,7 @@ static const char *trace_options[] = {
"sched-tree",
"ftrace_printk",
"userstacktrace",
+ "sym-userobj",
NULL
};

@@ -382,6 +383,20 @@ trace_seq_putmem_hex(struct trace_seq *s, void *mem, size_t len)
return trace_seq_putmem(s, hex, j);
}

+static int
+trace_seq_path(struct trace_seq *s, struct path *path)
+{
+ int ret;
+ struct seq_file m;
+ m.count = s->len;
+ m.size = PAGE_SIZE;
+ m.buf = s->buffer;
+ ret = seq_path(&m, path, "\n");
+ if (ret)
+ s->len = m.count;
+ return ret;
+}
+
static void
trace_seq_reset(struct trace_seq *s)
{
@@ -678,6 +693,7 @@ tracing_generic_entry_update(struct trace_entry *entry, unsigned long flags,

entry->preempt_count = pc & 0xff;
entry->pid = (tsk) ? tsk->pid : 0;
+ entry->tgid = (tsk) ? tsk->tgid : 0;
entry->flags =
#ifdef CONFIG_TRACE_IRQFLAGS_SUPPORT
(irqs_disabled_flags(flags) ? TRACE_FLAG_IRQS_OFF : 0) |
@@ -1221,11 +1237,26 @@ static int
seq_print_userip_objs(const struct userstack_entry *entry, struct trace_seq *s,
unsigned long sym_flags)
{
+ struct mm_struct *mm = NULL;
int ret = 1;
unsigned i;

+ if (trace_flags & TRACE_ITER_SYM_USEROBJ) {
+ struct task_struct *task;
+ /* we do the lookup on the thread group leader,
+ * since individual threads might have already quit! */
+ rcu_read_lock();
+ task = find_task_by_vpid(entry->ent.tgid);
+ rcu_read_unlock();
+
+ if (task)
+ mm = get_task_mm(task);
+ }
+
for (i = 0; i < FTRACE_STACK_ENTRIES; i++) {
unsigned long ip = entry->caller[i];
+ unsigned long vmstart = 0;
+ struct file *file = NULL;

if (ip == ULONG_MAX || !ret)
break;
@@ -1235,10 +1266,25 @@ seq_print_userip_objs(const struct userstack_entry *entry, struct trace_seq *s,
ret = trace_seq_puts(s, "??");
continue;
}
- if (ret /*&& (sym_flags & TRACE_ITER_SYM_ADDR)*/)
+ if (mm) {
+ const struct vm_area_struct *vma = find_vma(mm, ip);
+ if (vma) {
+ file = vma->vm_file;
+ vmstart = vma->vm_start;
+ }
+ }
+ if (file) {
+ ret = trace_seq_path(s, &file->f_path);
+ if (ret)
+ ret = trace_seq_printf(s, "[+0x%lx]",
+ ip - vmstart);
+ }
+ if (ret && ((sym_flags & TRACE_ITER_SYM_ADDR) || !file))
ret = trace_seq_printf(s, " <" IP_FMT ">", ip);
}

+ if (mm)
+ mmput(mm);
return ret;
}

@@ -3239,6 +3285,9 @@ void ftrace_dump(void)
atomic_inc(&global_trace.data[cpu]->disabled);
}

+ /* don't look at user memory in panic mode */
+ trace_flags &= ~TRACE_ITER_SYM_USEROBJ;
+
printk(KERN_TRACE "Dumping ftrace buffer:\n");

iter.tr = &global_trace;
diff --git a/kernel/trace/trace.h b/kernel/trace/trace.h
index 4c81642..5f94aed 100644
--- a/kernel/trace/trace.h
+++ b/kernel/trace/trace.h
@@ -423,7 +423,8 @@ enum trace_iterator_flags {
TRACE_ITER_STACKTRACE = 0x100,
TRACE_ITER_SCHED_TREE = 0x200,
TRACE_ITER_PRINTK = 0x400,
- TRACE_ITER_USERSTACKTRACE = 0x800
+ TRACE_ITER_USERSTACKTRACE = 0x800,
+ TRACE_ITER_SYM_USEROBJ = 0x1000
};

extern struct tracer nop_trace;
--
1.5.6.5

2008-11-02 21:18:59

[permalink] [raw]

Subject: [PATCH] Add support for userspace stacktraces in tracing/iter_ctrl

Usage example:
mount -t debugfs nodev /sys/kernel/debug
cd /sys/kernel/debug/tracing
echo userstacktrace >iter_ctrl
echo sched_switch >current_tracer
echo 1 >tracing_enabled
.... run application ...
echo 0 >tracing_enabled

Then read one of 'trace','latency_trace','trace_pipe'
It is recomended to compile your userspace programs with frame pointers
(at least glibc + the app you are tracing).

Signed-off-by: Török Edwin <[email protected]>
---
Documentation/ftrace.txt | 5 ++-
arch/x86/kernel/stacktrace.c | 57 +++++++++++++++++++++++++
include/linux/stacktrace.h | 8 ++++
kernel/trace/trace.c | 93 ++++++++++++++++++++++++++++++++++++++++++
kernel/trace/trace.h | 9 ++++
5 files changed, 171 insertions(+), 1 deletions(-)

diff --git a/Documentation/ftrace.txt b/Documentation/ftrace.txt
index ea5a827..b81618e 100644
--- a/Documentation/ftrace.txt
+++ b/Documentation/ftrace.txt
@@ -330,7 +330,7 @@ output. To see what is available, simply cat the file:

cat /debug/tracing/iter_ctrl
print-parent nosym-offset nosym-addr noverbose noraw nohex nobin \
- noblock nostacktrace nosched-tree
+ noblock nostacktrace nosched-tree nouserstacktrace

To disable one of the options, echo in the option prepended with "no".

@@ -384,6 +384,9 @@ Here are the available options:
When a trace is recorded, so is the stack of functions.
This allows for back traces of trace sites.

+ userstacktrace - This option changes the trace.
+ It records a stacktrace of the current userspace thread.
+
sched-tree - TBD (any users??)

diff --git a/arch/x86/kernel/stacktrace.c b/arch/x86/kernel/stacktrace.c
index d1d850a..000a965 100644
--- a/arch/x86/kernel/stacktrace.c
+++ b/arch/x86/kernel/stacktrace.c
@@ -6,6 +6,7 @@
#include <linux/sched.h>
#include <linux/stacktrace.h>
#include <linux/module.h>
+#include <linux/uaccess.h>
#include <asm/stacktrace.h>

static void save_stack_warning(void *data, char *msg)
@@ -90,3 +91,59 @@ void save_stack_trace_tsk(struct task_struct *tsk, struct stack_trace *trace)
trace->entries[trace->nr_entries++] = ULONG_MAX;
}
EXPORT_SYMBOL_GPL(save_stack_trace_tsk);
+
+/* Userspace stacktrace - based on kernel/trace/trace_sysprof.c */
+
+struct stack_frame {
+ const void __user *next_fp;
+ unsigned long return_address;
+};
+
+static int copy_stack_frame(const void __user *fp, struct stack_frame *frame)
+{
+ int ret;
+
+ if (!access_ok(VERIFY_READ, fp, sizeof(*frame)))
+ return 0;
+
+ ret = 1;
+ pagefault_disable();
+ if (__copy_from_user_inatomic(frame, fp, sizeof(*frame)))
+ ret = 0;
+ pagefault_enable();
+
+ return ret;
+}
+
+void save_stack_trace_user(struct stack_trace *trace)
+{
+ /*
+ * Trace user stack if we are not a kernel thread
+ */
+ if (current->mm) {
+ const struct pt_regs *regs = task_pt_regs(current);
+ const void __user *fp = (const void __user *)regs->bp;
+
+ if (trace->nr_entries < trace->max_entries)
+ trace->entries[trace->nr_entries++] = regs->ip;
+
+ while (trace->nr_entries < trace->max_entries) {
+ struct stack_frame frame;
+ frame.next_fp = NULL;
+ frame.return_address = 0;
+ if (!copy_stack_frame(fp, &frame))
+ break;
+ if ((unsigned long)fp < regs->sp)
+ break;
+ if (frame.return_address)
+ trace->entries[trace->nr_entries++] =
+ frame.return_address;
+ if (fp == frame.next_fp)
+ break;
+ fp = frame.next_fp;
+ }
+ }
+ if (trace->nr_entries < trace->max_entries)
+ trace->entries[trace->nr_entries++] = ULONG_MAX;
+}
+
diff --git a/include/linux/stacktrace.h b/include/linux/stacktrace.h
index 6b8e54a..46704a5 100644
--- a/include/linux/stacktrace.h
+++ b/include/linux/stacktrace.h
@@ -18,9 +18,17 @@ extern void save_stack_trace_tsk(struct task_struct *tsk,
struct stack_trace *trace);

extern void print_stack_trace(struct stack_trace *trace, int spaces);
+
+#ifdef CONFIG_X86
+extern void save_stack_trace_user(struct stack_trace *trace);
+#else
+# define save_stack_trace_user(trace) do { } while (0)
+#endif
+
#else
# define save_stack_trace(trace) do { } while (0)
# define save_stack_trace_tsk(tsk, trace) do { } while (0)
+# define save_stack_trace_user(trace) do { } while (0)
# define print_stack_trace(trace, spaces) do { } while (0)
#endif

diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index 5b8c90d..bb965af 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -235,6 +235,7 @@ static const char *trace_options[] = {
"stacktrace",
"sched-tree",
"ftrace_printk",
+ "userstacktrace",
NULL
};

@@ -762,6 +763,44 @@ void __trace_stack(struct trace_array *tr,
ftrace_trace_stack(tr, data, flags, skip, preempt_count());
}

+static void ftrace_trace_userstack(struct trace_array *tr,
+ struct trace_array_cpu *data,
+ unsigned long flags, int pc)
+{
+ struct userstack_entry *entry;
+ struct stack_trace trace;
+ struct ring_buffer_event *event;
+ unsigned long irq_flags;
+
+ if (!(trace_flags & TRACE_ITER_USERSTACKTRACE))
+ return;
+
+ event = ring_buffer_lock_reserve(tr->buffer, sizeof(*entry),
+ &irq_flags);
+ if (!event)
+ return;
+ entry = ring_buffer_event_data(event);
+ tracing_generic_entry_update(&entry->ent, flags, pc);
+ entry->ent.type = TRACE_USER_STACK;
+
+ memset(&entry->caller, 0, sizeof(entry->caller));
+
+ trace.nr_entries = 0;
+ trace.max_entries = FTRACE_STACK_ENTRIES;
+ trace.skip = 0;
+ trace.entries = entry->caller;
+
+ save_stack_trace_user(&trace);
+ ring_buffer_unlock_commit(tr->buffer, event, irq_flags);
+}
+
+void __trace_userstack(struct trace_array *tr,
+ struct trace_array_cpu *data,
+ unsigned long flags)
+{
+ ftrace_trace_userstack(tr, data, flags, preempt_count());
+}
+
static void
ftrace_trace_special(void *__tr, void *__data,
unsigned long arg1, unsigned long arg2, unsigned long arg3,
@@ -785,6 +824,7 @@ ftrace_trace_special(void *__tr, void *__data,
entry->arg3 = arg3;
ring_buffer_unlock_commit(tr->buffer, event, irq_flags);
ftrace_trace_stack(tr, data, irq_flags, 4, pc);
+ ftrace_trace_userstack(tr, data, irq_flags, pc);

trace_wake_up();
}
@@ -823,6 +863,7 @@ tracing_sched_switch_trace(struct trace_array *tr,
entry->next_cpu = task_cpu(next);
ring_buffer_unlock_commit(tr->buffer, event, irq_flags);
ftrace_trace_stack(tr, data, flags, 5, pc);
+ ftrace_trace_userstack(tr, data, flags, pc);
}

void
@@ -852,6 +893,7 @@ tracing_sched_wakeup_trace(struct trace_array *tr,
entry->next_cpu = task_cpu(wakee);
ring_buffer_unlock_commit(tr->buffer, event, irq_flags);
ftrace_trace_stack(tr, data, flags, 6, pc);
+ ftrace_trace_userstack(tr, data, flags, pc);

trace_wake_up();
}
@@ -1175,6 +1217,31 @@ seq_print_ip_sym(struct trace_seq *s, unsigned long ip, unsigned long sym_flags)
return ret;
}

+static int
+seq_print_userip_objs(const struct userstack_entry *entry, struct trace_seq *s,
+ unsigned long sym_flags)
+{
+ int ret = 1;
+ unsigned i;
+
+ for (i = 0; i < FTRACE_STACK_ENTRIES; i++) {
+ unsigned long ip = entry->caller[i];
+
+ if (ip == ULONG_MAX || !ret)
+ break;
+ if (i)
+ ret = trace_seq_puts(s, " <- ");
+ if (!ip) {
+ ret = trace_seq_puts(s, "??");
+ continue;
+ }
+ if (ret /*&& (sym_flags & TRACE_ITER_SYM_ADDR)*/)
+ ret = trace_seq_printf(s, " <" IP_FMT ">", ip);
+ }
+
+ return ret;
+}
+
static void print_lat_help_header(struct seq_file *m)
{
seq_puts(m, "# _------=> CPU# \n");
@@ -1462,6 +1529,16 @@ print_lat_fmt(struct trace_iterator *iter, unsigned int trace_idx, int cpu)
trace_seq_print_cont(s, iter);
break;
}
+ case TRACE_USER_STACK: {
+ struct userstack_entry *field;
+
+ trace_assign_type(field, entry);
+
+ seq_print_userip_objs(field, s, sym_flags);
+ if (entry->flags & TRACE_FLAG_CONT)
+ trace_seq_print_cont(s, iter);
+ break;
+ }
default:
trace_seq_printf(s, "Unknown type %d\n", entry->type);
}
@@ -1598,6 +1675,19 @@ static enum print_line_t print_trace_fmt(struct trace_iterator *iter)
trace_seq_print_cont(s, iter);
break;
}
+ case TRACE_USER_STACK: {
+ struct userstack_entry *field;
+
+ trace_assign_type(field, entry);
+
+ ret = seq_print_userip_objs(field, s, sym_flags);
+ if (!ret)
+ return TRACE_TYPE_PARTIAL_LINE;
+ ret = trace_seq_putc(s, '\n');
+ if (!ret)
+ return TRACE_TYPE_PARTIAL_LINE;
+ break;
+ }
}
return TRACE_TYPE_HANDLED;
}
@@ -1657,6 +1747,7 @@ static enum print_line_t print_raw_fmt(struct trace_iterator *iter)
break;
}
case TRACE_SPECIAL:
+ case TRACE_USER_STACK:
case TRACE_STACK: {
struct special_entry *field;

@@ -1745,6 +1836,7 @@ static enum print_line_t print_hex_fmt(struct trace_iterator *iter)
break;
}
case TRACE_SPECIAL:
+ case TRACE_USER_STACK:
case TRACE_STACK: {
struct special_entry *field;

@@ -1799,6 +1891,7 @@ static enum print_line_t print_bin_fmt(struct trace_iterator *iter)
break;
}
case TRACE_SPECIAL:
+ case TRACE_USER_STACK:
case TRACE_STACK: {
struct special_entry *field;

diff --git a/kernel/trace/trace.h b/kernel/trace/trace.h
index 8465ad0..4c81642 100644
--- a/kernel/trace/trace.h
+++ b/kernel/trace/trace.h
@@ -22,6 +22,7 @@ enum trace_type {
TRACE_MMIO_RW,
TRACE_MMIO_MAP,
TRACE_BOOT,
+ TRACE_USER_STACK,

__TRACE_LAST_TYPE
};
@@ -38,6 +39,7 @@ struct trace_entry {
unsigned char flags;
unsigned char preempt_count;
int pid;
+ int tgid;
};

/*
@@ -85,6 +87,11 @@ struct stack_entry {
unsigned long caller[FTRACE_STACK_ENTRIES];
};

+struct userstack_entry {
+ struct trace_entry ent;
+ unsigned long caller[FTRACE_STACK_ENTRIES];
+};
+
/*
* ftrace_printk entry:
*/
@@ -212,6 +219,7 @@ extern void __ftrace_bad_type(void);
IF_ASSIGN(var, ent, struct ctx_switch_entry, 0); \
IF_ASSIGN(var, ent, struct trace_field_cont, TRACE_CONT); \
IF_ASSIGN(var, ent, struct stack_entry, TRACE_STACK); \
+ IF_ASSIGN(var, ent, struct userstack_entry, TRACE_USER_STACK);\
IF_ASSIGN(var, ent, struct print_entry, TRACE_PRINT); \
IF_ASSIGN(var, ent, struct special_entry, 0); \
IF_ASSIGN(var, ent, struct trace_mmiotrace_rw, \
@@ -415,6 +423,7 @@ enum trace_iterator_flags {
TRACE_ITER_STACKTRACE = 0x100,
TRACE_ITER_SCHED_TREE = 0x200,
TRACE_ITER_PRINTK = 0x400,
+ TRACE_ITER_USERSTACKTRACE = 0x800
};

extern struct tracer nop_trace;
--
1.5.6.5

2008-11-02 21:25:33

by Al Viro

[permalink] [raw]

Subject: Re: [PATCH] Identify which executable object the userspace address belongs to. Store thread group leader id, and use it to lookup the address in the process's map. We could have looked up the address on thread's map, but the thread might not exist by the time we are called. The process might not exist either, but if you are reading trace_pipe, that is unlikely.

On Sun, Nov 02, 2008 at 11:18:14PM +0200, T??r??k Edwin wrote:
> +static int
> +trace_seq_path(struct trace_seq *s, struct path *path)
> +{
> + int ret;
> + struct seq_file m;
> + m.count = s->len;
> + m.size = PAGE_SIZE;
> + m.buf = s->buffer;
> + ret = seq_path(&m, path, "\n");
> + if (ret)
> + s->len = m.count;
> + return ret;
> +}

NAK. seq_path() is a blatantly wrong thing to use here.

2008-11-02 21:28:51

[permalink] [raw]

Subject: Re: [PATCH] Identify which executable object the userspace address belongs to. Store thread group leader id, and use it to lookup the address in the process's map. We could have looked up the address on thread's map, but the thread might not exist by the time we are called. The process might not exist either, but if you are reading trace_pipe, that is unlikely.

On 2008-11-02 23:25, Al Viro wrote:
> On Sun, Nov 02, 2008 at 11:18:14PM +0200, T??r??k Edwin wrote:
>
>> +static int
>> +trace_seq_path(struct trace_seq *s, struct path *path)
>> +{
>> + int ret;
>> + struct seq_file m;
>> + m.count = s->len;
>> + m.size = PAGE_SIZE;
>> + m.buf = s->buffer;
>> + ret = seq_path(&m, path, "\n");
>> + if (ret)
>> + s->len = m.count;
>> + return ret;
>> +}
>>
>
> NAK. seq_path() is a blatantly wrong thing to use here.
>

Are there any alternatives I could use?

This function is called when I do 'cat /sys/kernel/debug/trace', not
during tracing itself.

Best regards,
--Edwin

2008-11-02 21:41:32

by Al Viro

[permalink] [raw]

Subject: Re: [PATCH] Identify which executable object the userspace address belongs to. Store thread group leader id, and use it to lookup the address in the process's map. We could have looked up the address on thread's map, but the thread might not exist by the time we are called. The process might not exist either, but if you are reading trace_pipe, that is unlikely.

On Sun, Nov 02, 2008 at 11:28:20PM +0200, T?r?k Edwin wrote:
> On 2008-11-02 23:25, Al Viro wrote:
> > On Sun, Nov 02, 2008 at 11:18:14PM +0200, T??r??k Edwin wrote:
> >
> >> +static int
> >> +trace_seq_path(struct trace_seq *s, struct path *path)
> >> +{
> >> + int ret;
> >> + struct seq_file m;
> >> + m.count = s->len;
> >> + m.size = PAGE_SIZE;
> >> + m.buf = s->buffer;
> >> + ret = seq_path(&m, path, "\n");
> >> + if (ret)
> >> + s->len = m.count;
> >> + return ret;
> >> +}
> >>
> >
> > NAK. seq_path() is a blatantly wrong thing to use here.
> >
>
> Are there any alternatives I could use?
>
> This function is called when I do 'cat /sys/kernel/debug/trace', not
> during tracing itself.

_IF_ you want to get the mangled path, you want to take the guts of
seq_path() into a separate helper and use it - without any references
to struct seq_file since it's completely irrelevant there and you are
using it only as a vehicle for passing the stuff you care about through
the seq_path() interface.

2008-11-03 07:33:19

[permalink] [raw]

Subject: Re: [PATCH] Identify which executable object the userspace address belongs to. Store thread group leader id, and use it to lookup the address in the process's map. We could have looked up the address on thread's map, but the thread might not exist by the time we are called. The process might not exist either, but if you are reading trace_pipe, that is unlikely.

* Al Viro <[email protected]> wrote:

> On Sun, Nov 02, 2008 at 11:28:20PM +0200, T?r?k Edwin wrote:
> > On 2008-11-02 23:25, Al Viro wrote:
> > > On Sun, Nov 02, 2008 at 11:18:14PM +0200, T??r??k Edwin wrote:
> > >
> > >> +static int
> > >> +trace_seq_path(struct trace_seq *s, struct path *path)
> > >> +{
> > >> + int ret;
> > >> + struct seq_file m;
> > >> + m.count = s->len;
> > >> + m.size = PAGE_SIZE;
> > >> + m.buf = s->buffer;
> > >> + ret = seq_path(&m, path, "\n");
> > >> + if (ret)
> > >> + s->len = m.count;
> > >> + return ret;
> > >> +}
> > >>
> > >
> > > NAK. seq_path() is a blatantly wrong thing to use here.
> > >
> >
> > Are there any alternatives I could use?
> >
> > This function is called when I do 'cat /sys/kernel/debug/trace', not
> > during tracing itself.
>
> _IF_ you want to get the mangled path, you want to take the guts of
> seq_path() into a separate helper and use it - without any
> references to struct seq_file since it's completely irrelevant there
> and you are using it only as a vehicle for passing the stuff you
> care about through the seq_path() interface.

could you please help out with such a helper? This is really about
visualization, not to rely on it.

Ingo

2008-11-03 07:48:19

[permalink] [raw]

Subject: Re: [PATCH] Identify which executable object the userspace address belongs to. Store thread group leader id, and use it to lookup the address in the process's map. We could have looked up the address on thread's map, but the thread might not exist by the time we are called. The process might not exist either, but if you are reading trace_pipe, that is unlikely.

* T?r?k Edwin <[email protected]> wrote:

> + struct task_struct *task;
> + /* we do the lookup on the thread group leader,
> + * since individual threads might have already quit! */

please use the customary comment style:

/*
* Comment .....
* ...... goes here:
*/

> - if (ret /*&& (sym_flags & TRACE_ITER_SYM_ADDR)*/)
> + if (mm) {
> + const struct vm_area_struct *vma = find_vma(mm, ip);
> + if (vma) {
> + file = vma->vm_file;
> + vmstart = vma->vm_start;
> + }
> + }
> + if (file) {
> + ret = trace_seq_path(s, &file->f_path);
> + if (ret)
> + ret = trace_seq_printf(s, "[+0x%lx]",
> + ip - vmstart);
> + }
> + if (ret && ((sym_flags & TRACE_ITER_SYM_ADDR) || !file))
> ret = trace_seq_printf(s, " <" IP_FMT ">", ip);

the function is getting a bit large - would it make sense to split out
this section into a helper inline function?

another small nit: i cleaned up the subject line and the commit log
message for you - see the two patches below - could you please keep it
for future (v3) submissions of these patches? It's just small
formatting changes.

Your patches are a nice feature we want to have nevertheless - to be
able to see where a user-space app is running has been one of the
historically weak points of kernel instrumentation.

Ingo

-------------->
Subject: tracing: add support for userspace stacktraces in tracing/iter_ctrl
From: T?r?k Edwin <[email protected]>
Date: Sun, 2 Nov 2008 23:18:13 +0200

Impact: add new (default-off) tracing visualization feature

Usage example:

mount -t debugfs nodev /sys/kernel/debug
cd /sys/kernel/debug/tracing
echo userstacktrace >iter_ctrl
echo sched_switch >current_tracer
echo 1 >tracing_enabled
.... run application ...
echo 0 >tracing_enabled

Then read one of 'trace','latency_trace','trace_pipe'.

To get the best output you can compile your userspace programs with
frame pointers (at least glibc + the app you are tracing).

Signed-off-by: T?r?k Edwin <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
---
Documentation/ftrace.txt | 5 +-
arch/x86/kernel/stacktrace.c | 57 ++++++++++++++++++++++++++
include/linux/stacktrace.h | 8 +++
kernel/trace/trace.c | 93 +++++++++++++++++++++++++++++++++++++++++++
kernel/trace/trace.h | 9 ++++
5 files changed, 171 insertions(+), 1 deletion(-)

Index: tip/Documentation/ftrace.txt
===================================================================
--- tip.orig/Documentation/ftrace.txt
+++ tip/Documentation/ftrace.txt
@@ -330,7 +330,7 @@ output. To see what is available, simply

cat /debug/tracing/iter_ctrl
print-parent nosym-offset nosym-addr noverbose noraw nohex nobin \
- noblock nostacktrace nosched-tree
+ noblock nostacktrace nosched-tree nouserstacktrace

To disable one of the options, echo in the option prepended with "no".

@@ -384,6 +384,9 @@ Here are the available options:
When a trace is recorded, so is the stack of functions.
This allows for back traces of trace sites.

+ userstacktrace - This option changes the trace.
+ It records a stacktrace of the current userspace thread.
+
sched-tree - TBD (any users??)

Index: tip/arch/x86/kernel/stacktrace.c
===================================================================
--- tip.orig/arch/x86/kernel/stacktrace.c
+++ tip/arch/x86/kernel/stacktrace.c
@@ -6,6 +6,7 @@
#include <linux/sched.h>
#include <linux/stacktrace.h>
#include <linux/module.h>
+#include <linux/uaccess.h>
#include <asm/stacktrace.h>

static void save_stack_warning(void *data, char *msg)
@@ -90,3 +91,59 @@ void save_stack_trace_tsk(struct task_st
trace->entries[trace->nr_entries++] = ULONG_MAX;
}
EXPORT_SYMBOL_GPL(save_stack_trace_tsk);
+
+/* Userspace stacktrace - based on kernel/trace/trace_sysprof.c */
+
+struct stack_frame {
+ const void __user *next_fp;
+ unsigned long return_address;
+};
+
+static int copy_stack_frame(const void __user *fp, struct stack_frame *frame)
+{
+ int ret;
+
+ if (!access_ok(VERIFY_READ, fp, sizeof(*frame)))
+ return 0;
+
+ ret = 1;
+ pagefault_disable();
+ if (__copy_from_user_inatomic(frame, fp, sizeof(*frame)))
+ ret = 0;
+ pagefault_enable();
+
+ return ret;
+}
+
+void save_stack_trace_user(struct stack_trace *trace)
+{
+ /*
+ * Trace user stack if we are not a kernel thread
+ */
+ if (current->mm) {
+ const struct pt_regs *regs = task_pt_regs(current);
+ const void __user *fp = (const void __user *)regs->bp;
+
+ if (trace->nr_entries < trace->max_entries)
+ trace->entries[trace->nr_entries++] = regs->ip;
+
+ while (trace->nr_entries < trace->max_entries) {
+ struct stack_frame frame;
+ frame.next_fp = NULL;
+ frame.return_address = 0;
+ if (!copy_stack_frame(fp, &frame))
+ break;
+ if ((unsigned long)fp < regs->sp)
+ break;
+ if (frame.return_address)
+ trace->entries[trace->nr_entries++] =
+ frame.return_address;
+ if (fp == frame.next_fp)
+ break;
+ fp = frame.next_fp;
+ }
+ }
+ if (trace->nr_entries < trace->max_entries)
+ trace->entries[trace->nr_entries++] = ULONG_MAX;
+}
+
Index: tip/include/linux/stacktrace.h
===================================================================
--- tip.orig/include/linux/stacktrace.h
+++ tip/include/linux/stacktrace.h
@@ -18,9 +18,17 @@ extern void save_stack_trace_tsk(struct
struct stack_trace *trace);

extern void print_stack_trace(struct stack_trace *trace, int spaces);
+
+#ifdef CONFIG_X86
+extern void save_stack_trace_user(struct stack_trace *trace);
+#else
+# define save_stack_trace_user(trace) do { } while (0)
+#endif
+
#else
# define save_stack_trace(trace) do { } while (0)
# define save_stack_trace_tsk(tsk, trace) do { } while (0)
+# define save_stack_trace_user(trace) do { } while (0)
# define print_stack_trace(trace, spaces) do { } while (0)
#endif

Index: tip/kernel/trace/trace.c
===================================================================
--- tip.orig/kernel/trace/trace.c
+++ tip/kernel/trace/trace.c
@@ -235,6 +235,7 @@ static const char *trace_options[] = {
"stacktrace",
"sched-tree",
"ftrace_printk",
+ "userstacktrace",
NULL
};

@@ -762,6 +763,44 @@ void __trace_stack(struct trace_array *t
ftrace_trace_stack(tr, data, flags, skip, preempt_count());
}

+static void ftrace_trace_userstack(struct trace_array *tr,
+ struct trace_array_cpu *data,
+ unsigned long flags, int pc)
+{
+ struct userstack_entry *entry;
+ struct stack_trace trace;
+ struct ring_buffer_event *event;
+ unsigned long irq_flags;
+
+ if (!(trace_flags & TRACE_ITER_USERSTACKTRACE))
+ return;
+
+ event = ring_buffer_lock_reserve(tr->buffer, sizeof(*entry),
+ &irq_flags);
+ if (!event)
+ return;
+ entry = ring_buffer_event_data(event);
+ tracing_generic_entry_update(&entry->ent, flags, pc);
+ entry->ent.type = TRACE_USER_STACK;
+
+ memset(&entry->caller, 0, sizeof(entry->caller));
+
+ trace.nr_entries = 0;
+ trace.max_entries = FTRACE_STACK_ENTRIES;
+ trace.skip = 0;
+ trace.entries = entry->caller;
+
+ save_stack_trace_user(&trace);
+ ring_buffer_unlock_commit(tr->buffer, event, irq_flags);
+}
+
+void __trace_userstack(struct trace_array *tr,
+ struct trace_array_cpu *data,
+ unsigned long flags)
+{
+ ftrace_trace_userstack(tr, data, flags, preempt_count());
+}
+
static void
ftrace_trace_special(void *__tr, void *__data,
unsigned long arg1, unsigned long arg2, unsigned long arg3,
@@ -785,6 +824,7 @@ ftrace_trace_special(void *__tr, void *_
entry->arg3 = arg3;
ring_buffer_unlock_commit(tr->buffer, event, irq_flags);
ftrace_trace_stack(tr, data, irq_flags, 4, pc);
+ ftrace_trace_userstack(tr, data, irq_flags, pc);

trace_wake_up();
}
@@ -823,6 +863,7 @@ tracing_sched_switch_trace(struct trace_
entry->next_cpu = task_cpu(next);
ring_buffer_unlock_commit(tr->buffer, event, irq_flags);
ftrace_trace_stack(tr, data, flags, 5, pc);
+ ftrace_trace_userstack(tr, data, flags, pc);
}

void
@@ -852,6 +893,7 @@ tracing_sched_wakeup_trace(struct trace_
entry->next_cpu = task_cpu(wakee);
ring_buffer_unlock_commit(tr->buffer, event, irq_flags);
ftrace_trace_stack(tr, data, flags, 6, pc);
+ ftrace_trace_userstack(tr, data, flags, pc);

trace_wake_up();
}
@@ -1175,6 +1217,31 @@ seq_print_ip_sym(struct trace_seq *s, un
return ret;
}

+static int
+seq_print_userip_objs(const struct userstack_entry *entry, struct trace_seq *s,
+ unsigned long sym_flags)
+{
+ int ret = 1;
+ unsigned i;
+
+ for (i = 0; i < FTRACE_STACK_ENTRIES; i++) {
+ unsigned long ip = entry->caller[i];
+
+ if (ip == ULONG_MAX || !ret)
+ break;
+ if (i)
+ ret = trace_seq_puts(s, " <- ");
+ if (!ip) {
+ ret = trace_seq_puts(s, "??");
+ continue;
+ }
+ if (ret /*&& (sym_flags & TRACE_ITER_SYM_ADDR)*/)
+ ret = trace_seq_printf(s, " <" IP_FMT ">", ip);
+ }
+
+ return ret;
+}
+
static void print_lat_help_header(struct seq_file *m)
{
seq_puts(m, "# _------=> CPU# \n");
@@ -1462,6 +1529,16 @@ print_lat_fmt(struct trace_iterator *ite
trace_seq_print_cont(s, iter);
break;
}
+ case TRACE_USER_STACK: {
+ struct userstack_entry *field;
+
+ trace_assign_type(field, entry);
+
+ seq_print_userip_objs(field, s, sym_flags);
+ if (entry->flags & TRACE_FLAG_CONT)
+ trace_seq_print_cont(s, iter);
+ break;
+ }
default:
trace_seq_printf(s, "Unknown type %d\n", entry->type);
}
@@ -1598,6 +1675,19 @@ static enum print_line_t print_trace_fmt
trace_seq_print_cont(s, iter);
break;
}
+ case TRACE_USER_STACK: {
+ struct userstack_entry *field;
+
+ trace_assign_type(field, entry);
+
+ ret = seq_print_userip_objs(field, s, sym_flags);
+ if (!ret)
+ return TRACE_TYPE_PARTIAL_LINE;
+ ret = trace_seq_putc(s, '\n');
+ if (!ret)
+ return TRACE_TYPE_PARTIAL_LINE;
+ break;
+ }
}
return TRACE_TYPE_HANDLED;
}
@@ -1657,6 +1747,7 @@ static enum print_line_t print_raw_fmt(s
break;
}
case TRACE_SPECIAL:
+ case TRACE_USER_STACK:
case TRACE_STACK: {
struct special_entry *field;

@@ -1745,6 +1836,7 @@ static enum print_line_t print_hex_fmt(s
break;
}
case TRACE_SPECIAL:
+ case TRACE_USER_STACK:
case TRACE_STACK: {
struct special_entry *field;

@@ -1799,6 +1891,7 @@ static enum print_line_t print_bin_fmt(s
break;
}
case TRACE_SPECIAL:
+ case TRACE_USER_STACK:
case TRACE_STACK: {
struct special_entry *field;

Index: tip/kernel/trace/trace.h
===================================================================
--- tip.orig/kernel/trace/trace.h
+++ tip/kernel/trace/trace.h
@@ -22,6 +22,7 @@ enum trace_type {
TRACE_MMIO_RW,
TRACE_MMIO_MAP,
TRACE_BOOT,
+ TRACE_USER_STACK,

__TRACE_LAST_TYPE
};
@@ -38,6 +39,7 @@ struct trace_entry {
unsigned char flags;
unsigned char preempt_count;
int pid;
+ int tgid;
};

/*
@@ -85,6 +87,11 @@ struct stack_entry {
unsigned long caller[FTRACE_STACK_ENTRIES];
};

+struct userstack_entry {
+ struct trace_entry ent;
+ unsigned long caller[FTRACE_STACK_ENTRIES];
+};
+
/*
* ftrace_printk entry:
*/
@@ -212,6 +219,7 @@ extern void __ftrace_bad_type(void);
IF_ASSIGN(var, ent, struct ctx_switch_entry, 0); \
IF_ASSIGN(var, ent, struct trace_field_cont, TRACE_CONT); \
IF_ASSIGN(var, ent, struct stack_entry, TRACE_STACK); \
+ IF_ASSIGN(var, ent, struct userstack_entry, TRACE_USER_STACK);\
IF_ASSIGN(var, ent, struct print_entry, TRACE_PRINT); \
IF_ASSIGN(var, ent, struct special_entry, 0); \
IF_ASSIGN(var, ent, struct trace_mmiotrace_rw, \
@@ -415,6 +423,7 @@ enum trace_iterator_flags {
TRACE_ITER_STACKTRACE = 0x100,
TRACE_ITER_SCHED_TREE = 0x200,
TRACE_ITER_PRINTK = 0x400,
+ TRACE_ITER_USERSTACKTRACE = 0x800
};

extern struct tracer nop_trace;

------------------->
Subject: tracing: identify which executable object the userspace address belongs to
From: T?r?k Edwin <[email protected]>
Date: Sun, 2 Nov 2008 23:18:14 +0200

Impact: modify+improve the userstacktrace tracing visualization feature

Store thread group leader id, and use it to lookup the address in the
process's map. We could have looked up the address on thread's map,
but the thread might not exist by the time we are called. The process
might not exist either, but if you are reading trace_pipe, that is
unlikely.

Example usage:

mount -t debugfs nodev /sys/kernel/debug
cd /sys/kernel/debug/tracing
echo userstacktrace >iter_ctrl
echo sym-userobj >iter_ctrl
echo sched_switch >current_tracer
echo 1 >tracing_enabled
cat trace_pipe >/tmp/trace&
.... run application ...
echo 0 >tracing_enabled
cat /tmp/trace

You'll see stack entries like:

/lib/libpthread-2.7.so[+0xd370]

You can convert them to function/line using:

addr2line -fie /lib/libpthread-2.7.so 0xd370

Or:

addr2line -fie /usr/lib/debug/libpthread-2.7.so 0xd370

For non-PIC/PIE executables this won't work:

a.out[+0x73b]

You need to run the following: addr2line -fie a.out 0x40073b
(where 0x400000 is the default load address of a.out)

Signed-off-by: T?r?k Edwin <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
---
Documentation/ftrace.txt | 13 +++++++++++
kernel/trace/trace.c | 51 ++++++++++++++++++++++++++++++++++++++++++++++-
kernel/trace/trace.h | 3 +-
3 files changed, 64 insertions(+), 3 deletions(-)

Index: tip/Documentation/ftrace.txt
===================================================================
--- tip.orig/Documentation/ftrace.txt
+++ tip/Documentation/ftrace.txt
@@ -330,7 +330,7 @@ output. To see what is available, simply

cat /debug/tracing/iter_ctrl
print-parent nosym-offset nosym-addr noverbose noraw nohex nobin \
- noblock nostacktrace nosched-tree nouserstacktrace
+ noblock nostacktrace nosched-tree nouserstacktrace nosym-userobj

To disable one of the options, echo in the option prepended with "no".

@@ -387,6 +387,17 @@ Here are the available options:
userstacktrace - This option changes the trace.
It records a stacktrace of the current userspace thread.

+ sym-userobj - when user stacktrace are enabled, look up which object the
+ address belongs to, and print a relative address
+ This is especially useful when ASLR is on, otherwise you don't
+ get a chance to resolve the address to object/file/line after the app is no
+ longer running
+
+ The lookup is performed when you read trace,trace_pipe,latency_trace. Example:
+
+ a.out-1623 [000] 40874.465068: /root/a.out[+0x480] <-/root/a.out[+0
+x494] <- /root/a.out[+0x4a8] <- /lib/libc-2.7.so[+0x1e1a6]
+
sched-tree - TBD (any users??)

Index: tip/kernel/trace/trace.c
===================================================================
--- tip.orig/kernel/trace/trace.c
+++ tip/kernel/trace/trace.c
@@ -236,6 +236,7 @@ static const char *trace_options[] = {
"sched-tree",
"ftrace_printk",
"userstacktrace",
+ "sym-userobj",
NULL
};

@@ -382,6 +383,20 @@ trace_seq_putmem_hex(struct trace_seq *s
return trace_seq_putmem(s, hex, j);
}

+static int
+trace_seq_path(struct trace_seq *s, struct path *path)
+{
+ int ret;
+ struct seq_file m;
+ m.count = s->len;
+ m.size = PAGE_SIZE;
+ m.buf = s->buffer;
+ ret = seq_path(&m, path, "\n");
+ if (ret)
+ s->len = m.count;
+ return ret;
+}
+
static void
trace_seq_reset(struct trace_seq *s)
{
@@ -678,6 +693,7 @@ tracing_generic_entry_update(struct trac

entry->preempt_count = pc & 0xff;
entry->pid = (tsk) ? tsk->pid : 0;
+ entry->tgid = (tsk) ? tsk->tgid : 0;
entry->flags =
#ifdef CONFIG_TRACE_IRQFLAGS_SUPPORT
(irqs_disabled_flags(flags) ? TRACE_FLAG_IRQS_OFF : 0) |
@@ -1221,11 +1237,26 @@ static int
seq_print_userip_objs(const struct userstack_entry *entry, struct trace_seq *s,
unsigned long sym_flags)
{
+ struct mm_struct *mm = NULL;
int ret = 1;
unsigned i;

+ if (trace_flags & TRACE_ITER_SYM_USEROBJ) {
+ struct task_struct *task;
+ /* we do the lookup on the thread group leader,
+ * since individual threads might have already quit! */
+ rcu_read_lock();
+ task = find_task_by_vpid(entry->ent.tgid);
+ rcu_read_unlock();
+
+ if (task)
+ mm = get_task_mm(task);
+ }
+
for (i = 0; i < FTRACE_STACK_ENTRIES; i++) {
unsigned long ip = entry->caller[i];
+ unsigned long vmstart = 0;
+ struct file *file = NULL;

if (ip == ULONG_MAX || !ret)
break;
@@ -1235,10 +1266,25 @@ seq_print_userip_objs(const struct users
ret = trace_seq_puts(s, "??");
continue;
}
- if (ret /*&& (sym_flags & TRACE_ITER_SYM_ADDR)*/)
+ if (mm) {
+ const struct vm_area_struct *vma = find_vma(mm, ip);
+ if (vma) {
+ file = vma->vm_file;
+ vmstart = vma->vm_start;
+ }
+ }
+ if (file) {
+ ret = trace_seq_path(s, &file->f_path);
+ if (ret)
+ ret = trace_seq_printf(s, "[+0x%lx]",
+ ip - vmstart);
+ }
+ if (ret && ((sym_flags & TRACE_ITER_SYM_ADDR) || !file))
ret = trace_seq_printf(s, " <" IP_FMT ">", ip);
}

+ if (mm)
+ mmput(mm);
return ret;
}

@@ -3239,6 +3285,9 @@ void ftrace_dump(void)
atomic_inc(&global_trace.data[cpu]->disabled);
}

+ /* don't look at user memory in panic mode */
+ trace_flags &= ~TRACE_ITER_SYM_USEROBJ;
+
printk(KERN_TRACE "Dumping ftrace buffer:\n");

iter.tr = &global_trace;
Index: tip/kernel/trace/trace.h
===================================================================
--- tip.orig/kernel/trace/trace.h
+++ tip/kernel/trace/trace.h
@@ -423,7 +423,8 @@ enum trace_iterator_flags {
TRACE_ITER_STACKTRACE = 0x100,
TRACE_ITER_SCHED_TREE = 0x200,
TRACE_ITER_PRINTK = 0x400,
- TRACE_ITER_USERSTACKTRACE = 0x800
+ TRACE_ITER_USERSTACKTRACE = 0x800,
+ TRACE_ITER_SYM_USEROBJ = 0x1000
};

extern struct tracer nop_trace;

2008-11-03 08:16:47

[permalink] [raw]

Subject: Re: [PATCH] Identify which executable object the userspace address belongs to. Store thread group leader id, and use it to lookup the address in the process's map. We could have looked up the address on thread's map, but the thread might not exist by the time we are called. The process might not exist either, but if you are reading trace_pipe, that is unlikely.

On 2008-11-03 09:47, Ingo Molnar wrote:
> * T?r?k Edwin <[email protected]> wrote:
>
>
>> + struct task_struct *task;
>> + /* we do the lookup on the thread group leader,
>> + * since individual threads might have already quit! */
>>
>
> please use the customary comment style:
>
> /*
> * Comment .....
> * ...... goes here:
> */
>
>

. Can checkpatch.pl warn me of situations like this?

>> - if (ret /*&& (sym_flags & TRACE_ITER_SYM_ADDR)*/)
>> + if (mm) {
>> + const struct vm_area_struct *vma = find_vma(mm, ip);
>> + if (vma) {
>> + file = vma->vm_file;
>> + vmstart = vma->vm_start;
>> + }
>> + }
>> + if (file) {
>> + ret = trace_seq_path(s, &file->f_path);
>> + if (ret)
>> + ret = trace_seq_printf(s, "[+0x%lx]",
>> + ip - vmstart);
>> + }
>> + if (ret && ((sym_flags & TRACE_ITER_SYM_ADDR) || !file))
>> ret = trace_seq_printf(s, " <" IP_FMT ">", ip);
>>
>
> the function is getting a bit large - would it make sense to split out
> this section into a helper inline function?
>

Yes, I'll do that.

> another small nit: i cleaned up the subject line and the commit log
> message for you - see the two patches below - could you please keep it
> for future (v3) submissions of these patches? It's just small
> formatting changes.
>

Ok, luckily git-rebase -i allows me to rewrite the log message too ;)

> Your patches are a nice feature we want to have nevertheless - to be
> able to see where a user-space app is running has been one of the
> historically weak points of kernel instrumentation.

Thanks.
It currently works for x86 only, but architecture porters can add
support for theirs quite easily, it just needs to modeled after how
oprofile does it for example.
BTW would it make sense to change oprofile and the sysprof tracer to use
save_stack_trace_user? It would eliminate some code duplication.

Would it make sense to add a script that post-processes the output to
scripts/tracing?
It would parse a trace log (from trace or latency_trace) and use
addr2line to resolve the address to source:line, and if successful
replace the relative address with that; and also
group identical stack traces together.

Best regards,
--Edwin

2008-11-03 08:22:19

[permalink] [raw]

Subject: Re: [PATCH] Identify which executable object the userspace address belongs to. Store thread group leader id, and use it to lookup the address in the process's map. We could have looked up the address on thread's map, but the thread might not exist by the time we are called. The process might not exist either, but if you are reading trace_pipe, that is unlikely.

* T?r?k Edwin <[email protected]> wrote:

> On 2008-11-03 09:47, Ingo Molnar wrote:
> > * T?r?k Edwin <[email protected]> wrote:
> >
> >
> >> + struct task_struct *task;
> >> + /* we do the lookup on the thread group leader,
> >> + * since individual threads might have already quit! */
> >>
> >
> > please use the customary comment style:
> >
> > /*
> > * Comment .....
> > * ...... goes here:
> > */
> >
> >
>
> . Can checkpatch.pl warn me of situations like this?

Cc:-ed Andy - it would be a useful feature indeed. (since there's no
hard CodingStyle rule for it, it could be a default-off helper)

Ingo

2008-11-03 08:30:00

[permalink] [raw]

Subject: Re: [PATCH] Identify which executable object the userspace address belongs to. Store thread group leader id, and use it to lookup the address in the process's map. We could have looked up the address on thread's map, but the thread might not exist by the time we are called. The process might not exist either, but if you are reading trace_pipe, that is unlikely.

* T?r?k Edwin <[email protected]> wrote:

> > Your patches are a nice feature we want to have nevertheless - to
> > be able to see where a user-space app is running has been one of
> > the historically weak points of kernel instrumentation.
>
> Thanks.
> It currently works for x86 only, but architecture porters can add
> support for theirs quite easily, it just needs to modeled after how
> oprofile does it for example.
> BTW would it make sense to change oprofile and the sysprof tracer to use
> save_stack_trace_user? It would eliminate some code duplication.

that definitely sounds like the right direction. I've Cc:-ed Robert
Richter, the Oprofile maintainer - please Cc: him to code that touches
oprofile.

note that NMI interaction of user-space stackframe walkers can be a
bit tricky: the basic problem is that if you fetch a user-space
stackframe that can create a fault, and the IRET at the end of the
fault handler will re-enable NMIs (violating the NMI code's
assumptions).

there are patches on lkml written by Mathieu Desnoyers that solve this
by changing all the fault path to use RET instead of IRET. It might
make sense to dust them off - we carried them for a long time in -tip
and they were robust. (they just never had any really strong
justification and were rather complex - that changes now)

Mathieu, what do you think?

> Would it make sense to add a script that post-processes the output
> to scripts/tracing?
>
> It would parse a trace log (from trace or latency_trace) and use
> addr2line to resolve the address to source:line, and if successful
> replace the relative address with that; and also group identical
> stack traces together.

sure, please add it to scripts/tracing/.

The best approach would be if the kernel could output the best info by
default - but that seems rather hard for addr2line functionality which
involves debuginfo processing, etc.

Ingo

2008-11-03 13:58:04

[permalink] [raw]

Subject: Re: [PATCH] Identify which executable object the userspace address belongs to. Store thread group leader id, and use it to lookup the address in the process's map. We could have looked up the address on thread's map, but the thread might not exist by the time we are called. The process might not exist either, but if you are reading trace_pipe, that is unlikely.

On 2008-11-03 10:29, Ingo Molnar wrote:
> * T?r?k Edwin <[email protected]> wrote:
>
>
>>> Your patches are a nice feature we want to have nevertheless - to
>>> be able to see where a user-space app is running has been one of
>>> the historically weak points of kernel instrumentation.
>>>
>> Thanks.
>> It currently works for x86 only, but architecture porters can add
>> support for theirs quite easily, it just needs to modeled after how
>> oprofile does it for example.
>> BTW would it make sense to change oprofile and the sysprof tracer to use
>> save_stack_trace_user? It would eliminate some code duplication.
>>
>
> that definitely sounds like the right direction. I've Cc:-ed Robert
> Richter, the Oprofile maintainer - please Cc: him to code that touches
> oprofile.
>
> note that NMI interaction of user-space stackframe walkers can be a
> bit tricky: the basic problem is that if you fetch a user-space
> stackframe that can create a fault

The code in trace_sysprof.c (which I used as a base for the
save_stack_trace_user) disables pagefaults
before reading the stackframe from userspace. Does it avoid this problem
then?

Note that due to its use from ftrace, the userstack walker can be called
from the pagefault handler itself, and if it is
allowed to fault it could lead to some form of deadlock. Are the ftrace
functions protected from recursively reentering themselves?

> , and the IRET at the end of the
> fault handler will re-enable NMIs (violating the NMI code's
> assumptions).
>

Is this already a problem with oprofile's user-stack walker?

> there are patches on lkml written by Mathieu Desnoyers that solve this
> by changing all the fault path to use RET instead of IRET. It might
> make sense to dust them off - we carried them for a long time in -tip
> and they were robust. (they just never had any really strong
> justification and were rather complex - that changes now)
>
> Mathieu, what do you think?
>
>
>> Would it make sense to add a script that post-processes the output
>> to scripts/tracing?
>>
>> It would parse a trace log (from trace or latency_trace) and use
>> addr2line to resolve the address to source:line, and if successful
>> replace the relative address with that; and also group identical
>> stack traces together.
>>
>
> sure, please add it to scripts/tracing/.
>

Ok, will do so in v3.

> The best approach would be if the kernel could output the best info by
> default

The kernel could do some grouping and counting (as latencytop does), but
I don't see where it would fit in frace's infrastructure.

I think ftrace's one entry per event is useful in many situations
(debugging, latency measurements), but if the events occur too frequently
it could produce too much data, and it would be more efficient to do
some counting/grouping of similar info in-kernel before outputting to
userspace.
Perhaps as a layer on top of ftrace? What do you think?

> - but that seems rather hard for addr2line functionality which
> involves debuginfo processing, etc.
>

yes it would be an overkill to try to do that from the kernel, when it
is so easy to do from userspace ;)

Best regards,
--Edwin

2008-11-03 14:04:50

[permalink] [raw]

Subject: Re: [PATCH] Identify which executable object the userspace address belongs to. Store thread group leader id, and use it to lookup the address in the process's map. We could have looked up the address on thread's map, but the thread might not exist by the time we are called. The process might not exist either, but if you are reading trace_pipe, that is unlikely.

* T?r?k Edwin <[email protected]> wrote:

> > note that NMI interaction of user-space stackframe walkers can be
> > a bit tricky: the basic problem is that if you fetch a user-space
> > stackframe that can create a fault
>
> The code in trace_sysprof.c (which I used as a base for the
> save_stack_trace_user) disables pagefaults before reading the
> stackframe from userspace. Does it avoid this problem then?

no, it does not solve it - because pagefault_disable is a "soft"
mechanism which does not disable the faults themselves. (it only
disables some of their effects)

> > , and the IRET at the end of the fault handler will re-enable NMIs
> > (violating the NMI code's assumptions).
>
> Is this already a problem with oprofile's user-stack walker?

yes, oprofile's code is buggy here too. And not enabled by default,
and probably not used by many people. The bug would show up as
mistakenly enabled NMIs from within NMIs - the kernel will _probably_
live but could lock up or overflow its stack, etc.

> > The best approach would be if the kernel could output the best
> > info by default
>
> The kernel could do some grouping and counting (as latencytop does),
> but I don't see where it would fit in frace's infrastructure.
>
> I think ftrace's one entry per event is useful in many situations
> (debugging, latency measurements), but if the events occur too
> frequently it could produce too much data, and it would be more
> efficient to do some counting/grouping of similar info in-kernel
> before outputting to userspace. Perhaps as a layer on top of ftrace?
> What do you think?

yes, histogram generation would be a natural 'view' of a tracer: have
a look at Steve's likely()/unlikely() tracer that introduces that
notion into ftrace.

Ingo

2008-11-03 19:26:44

by Mathieu Desnoyers

[permalink] [raw]

Subject: Re: [PATCH] Identify which executable object the userspace address belongs to. Store thread group leader id, and use it to lookup the address in the process's map. We could have looked up the address on thread's map, but the thread might not exist by the time we are called. The process might not exist either, but if you are reading trace_pipe, that is unlikely.

* Ingo Molnar ([email protected]) wrote:
>
> * T?r?k Edwin <[email protected]> wrote:
>
> > > Your patches are a nice feature we want to have nevertheless - to
> > > be able to see where a user-space app is running has been one of
> > > the historically weak points of kernel instrumentation.
> >
> > Thanks.
> > It currently works for x86 only, but architecture porters can add
> > support for theirs quite easily, it just needs to modeled after how
> > oprofile does it for example.
> > BTW would it make sense to change oprofile and the sysprof tracer to use
> > save_stack_trace_user? It would eliminate some code duplication.
>
> that definitely sounds like the right direction. I've Cc:-ed Robert
> Richter, the Oprofile maintainer - please Cc: him to code that touches
> oprofile.
>
> note that NMI interaction of user-space stackframe walkers can be a
> bit tricky: the basic problem is that if you fetch a user-space
> stackframe that can create a fault, and the IRET at the end of the
> fault handler will re-enable NMIs (violating the NMI code's
> assumptions).
>
> there are patches on lkml written by Mathieu Desnoyers that solve this
> by changing all the fault path to use RET instead of IRET. It might
> make sense to dust them off - we carried them for a long time in -tip
> and they were robust. (they just never had any really strong
> justification and were rather complex - that changes now)
>
> Mathieu, what do you think?
>

Yep, using the NMI-safe traps seems like a good idea for this. I look
forward to add those userspace stack dumps in my LTTng traces. I've had
this feature in the past in LTTng and it was _really_ useful, e.g. to
know the whole userspace stack that caused a system call.

The patchset version I have in my -lttng tree is pretty much the same
you currently have in -tip. I have not ported my tree to 2.6.28-rcX yet
though.

For trap instrumentation, I think the sane way to deal with the
recursive trap problem is to keep a nesting count associated with
instrumentation within the trap handler, which would dynamically disable
this specific instrumentation (per-cpu given preemption is disabled
within the instrumentation) once it reaches a given nesting level. That
would permit to overcome the recursive trap problem without losing
nested events happening when, for example, a NMI nests over a standard
interrupt, which is, in this case, nested but not caused by recursion.

Also, we have to think carefully about how we want to access userspace.
a copy_from_user_inatomic(), which may fail if the data we try to access
is not in cache, seems like a sane approach to deal with such
instrumentation called in atomic context. But if we detect that the
instrumentation is called from preemptable context, then a
copy_from_user() could be ok.

Mathieu

> > Would it make sense to add a script that post-processes the output
> > to scripts/tracing?
> >
> > It would parse a trace log (from trace or latency_trace) and use
> > addr2line to resolve the address to source:line, and if successful
> > replace the relative address with that; and also group identical
> > stack traces together.
>
> sure, please add it to scripts/tracing/.
>
> The best approach would be if the kernel could output the best info by
> default - but that seems rather hard for addr2line functionality which
> involves debuginfo processing, etc.
>
> Ingo

--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68

2008-11-03 19:56:18

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [PATCH] Identify which executable object the userspace address belongs to. Store thread group leader id, and use it to lookup the address in the process's map. We could have looked up the address on thread's map, but the thread might not exist by the time we are called. The process might not exist either, but if you are reading trace_pipe, that is unlikely.

fwiw, this is one amazing subject line.. ;-)

2008-11-03 19:59:17

by Arnaldo Carvalho de Melo

[permalink] [raw]

Subject: Re: [PATCH] Identify which executable object the userspace address belongs to. Store thread group leader id, and use it to lookup the address in the process's map. We could have looked up the address on thread's map, but the thread might not exist by the time we are called. The process might not exist either, but if you are reading trace_pipe, that is unlikely.

Em Mon, Nov 03, 2008 at 08:56:29PM +0100, Peter Zijlstra escreveu:
> fwiw, this is one amazing subject line.. ;-)

Too much for "subject says it all." 8)

- Arnaldo

2008-11-03 20:01:57

[permalink] [raw]

Subject: Re: [PATCH] Identify which executable object the userspace address belongs to. Store thread group leader id, and use it to lookup the address in the process's map. We could have looked up the address on thread's map, but the thread might not exist by the time we are called. The process might not exist either, but if you are reading trace_pipe, that is unlikely.

On 2008-11-03 21:58, Arnaldo Carvalho de Melo wrote:
> Em Mon, Nov 03, 2008 at 08:56:29PM +0100, Peter Zijlstra escreveu:
>
>> fwiw, this is one amazing subject line.. ;-)
>>
>
> Too much for "subject says it all." 8)

Yep, I've just learned that the first line in my git commit message will
be the subject line for the emails it sends out.

I'll use the shorter subject line suggested by Ingo next time ;)

Best regards,
--Edwin

2008-11-04 18:26:55

[permalink] [raw]

Subject: Re: [PATCH] Identify which executable object the userspace address belongs to. Store thread group leader id, and use it to lookup the address in the process's map. We could have looked up the address on thread's map, but the thread might not exist by the time we are called. The process might not exist either, but if you are reading trace_pipe, that is unlikely.

On Mon, Nov 03, 2008 at 08:32:57AM +0100, Ingo Molnar wrote:
> could you please help out with such a helper? This is really about
> visualization, not to rely on it.

Is this kindergarten? Take a look at seq_path - what you want
is the combination of d_path + mangle_path and the combination of two
would be rather trivial. In fact I'd say you don't even need the helper
but just make mangle_path global (and document it while you're at it)
and then use it.

2008-11-04 18:37:39

[permalink] [raw]

Subject: Re: [PATCH] Identify which executable object the userspace address belongs to. Store thread group leader id, and use it to lookup the address in the process's map. We could have looked up the address on thread's map, but the thread might not exist by the time we are called. The process might not exist either, but if you are reading trace_pipe, that is unlikely.

On 2008-11-04 20:26, Christoph Hellwig wrote:
> On Mon, Nov 03, 2008 at 08:32:57AM +0100, Ingo Molnar wrote:
>
>> could you please help out with such a helper? This is really about
>> visualization, not to rely on it.
>>
>
> Is this kindergarten? Take a look at seq_path - what you want
> is the combination of d_path + mangle_path and the combination of two
> would be rather trivial. In fact I'd say you don't even need the helper
> but just make mangle_path global (and document it while you're at it)
> and then use it.

Hmm, this is just for visualizing things in /sys/kernel/debug/, can't I
use something simpler than mangle_path, such as a simple while loop like
the one below?

+ static int
+trace_seq_path(struct trace_seq *s, struct path *path)
+{
+ char *p;
+
+ if (s->len >= (PAGE_SIZE - 1))
+ return 0;
+ p = d_path(path, s->buffer + s->len, PAGE_SIZE - s->len);
+ if (!IS_ERR(p)) {
+ size_t i = s->len;
+ char c;
+ while ((c = *p++)) {
+ if (c != '\n')
+ s->buffer[i++] = c;
+ }
+ s->len = i;
+ return 1;
+ } else {
+ s->buffer[s->len++] = '?';
+ return 1;
+ }
+
+ return 0;
+}

2008-11-04 19:09:35

[permalink] [raw]

Subject: Re: [PATCH] Identify which executable object the userspace address belongs to. Store thread group leader id, and use it to lookup the address in the process's map. We could have looked up the address on thread's map, but the thread might not exist by the time we are called. The process might not exist either, but if you are reading trace_pipe, that is unlikely.

On Tue, Nov 04, 2008 at 08:37:11PM +0200, T?r?k Edwin wrote:
> Hmm, this is just for visualizing things in /sys/kernel/debug/, can't I
> use something simpler than mangle_path, such as a simple while loop like
> the one below?

Sounds fine to me, although sharing one mangling-scheme for all pathname
outputs certainly does have benefits of less confusion for the user.

2008-11-04 19:10:57

[permalink] [raw]

Subject: Re: [PATCH] Identify which executable object the userspace address belongs to. Store thread group leader id, and use it to lookup the address in the process's map. We could have looked up the address on thread's map, but the thread might not exist by the time we are called. The process might not exist either, but if you are reading trace_pipe, that is unlikely.

On 2008-11-04 21:09, Christoph Hellwig wrote:
> On Tue, Nov 04, 2008 at 08:37:11PM +0200, T?r?k Edwin wrote:
>
>> Hmm, this is just for visualizing things in /sys/kernel/debug/, can't I
>> use something simpler than mangle_path, such as a simple while loop like
>> the one below?
>>
>
> Sounds fine to me, although sharing one mangling-scheme for all pathname
> outputs certainly does have benefits of less confusion for the user.

Indeed. I'll do as you suggested: make mangle_path global and document
what it does.

Best regards,
--Edwin

2008-11-06 09:56:15

[permalink] [raw]

Subject: Re: [PATCH] Identify which executable object the userspace address belongs to. Store thread group leader id, and use it to lookup the address in the process's map. We could have looked up the address on thread's map, but the thread might not exist by the time we are called. The process might not exist either, but if you are reading trace_pipe, that is unlikely.

* Christoph Hellwig <[email protected]> wrote:

> On Mon, Nov 03, 2008 at 08:32:57AM +0100, Ingo Molnar wrote:
> > could you please help out with such a helper? This is really about
> > visualization, not to rely on it.
>
> Is this kindergarten? [...]

No, this is lkml where, if you decide to stand in the way of a patch,
you are supposed to back up your NAK's with a path out of that NAK.
Instead of forcing people into multiple email ping-pongs trying to
figure out what exactly the objections mean.

Look at the time flow of the mails:

| Nov 2 22:18:21 2008 # submission
| Nov 2 22:25:16 2008 # +7 minutes, NAK with no comment
|
| [ ~5 unnecessary mails ]
|
| Nov 4 20:09:13 20008 # +2 days, ACK

What was done was inefficent and unnecessary, it caused a hickup and
delay of 2 days in the workflow - at the point of the NAK it was
already known what the right solution is, it just was obscured behind
a moody reply. Such obstruction adds up and it simply does not scale
as a form of communication.

Nobody learned anything new compared to the case had the +7 minutes
reply included this information already.

Ingo

2008-11-06 12:27:21

[permalink] [raw]

Subject: Re: [PATCH] Identify which executable object the userspace address belongs to. Store thread group leader id, and use it to lookup the address in the process's map. We could have looked up the address on thread's map, but the thread might not exist by the time we are called. The process might not exist either, but if you are reading trace_pipe, that is unlikely.

On Thu, Nov 06, 2008 at 10:55:47AM +0100, Ingo Molnar wrote:
> > On Mon, Nov 03, 2008 at 08:32:57AM +0100, Ingo Molnar wrote:
> > > could you please help out with such a helper? This is really about
> > > visualization, not to rely on it.
> >
> > Is this kindergarten? [...]
>
> No, this is lkml where, if you decide to stand in the way of a patch,
> you are supposed to back up your NAK's with a path out of that NAK.
> Instead of forcing people into multiple email ping-pongs trying to
> figure out what exactly the objections mean.

Even if that was true, which would be avery sad world I should be
easy enough for you or anyone to look at what seq_path does and how
to have a version that doesn't use the seq_file insterface. Takes
about two minutes with most of the time spent on finding where seq_file
is implemented.

2008-11-06 15:32:57

[permalink] [raw]

Subject: Re: [PATCH] Identify which executable object the userspace address belongs to. Store thread group leader id, and use it to lookup the address in the process's map. We could have looked up the address on thread's map, but the thread might not exist by the time we are called. The process might not exist either, but if you are reading trace_pipe, that is unlikely.

* Christoph Hellwig <[email protected]> wrote:

> On Thu, Nov 06, 2008 at 10:55:47AM +0100, Ingo Molnar wrote:
> > > On Mon, Nov 03, 2008 at 08:32:57AM +0100, Ingo Molnar wrote:
> > > > could you please help out with such a helper? This is really about
> > > > visualization, not to rely on it.
> > >
> > > Is this kindergarten? [...]
> >
> > No, this is lkml where, if you decide to stand in the way of a patch,
> > you are supposed to back up your NAK's with a path out of that NAK.
> > Instead of forcing people into multiple email ping-pongs trying to
> > figure out what exactly the objections mean.
>
> Even if that was true, which would be avery sad world I should be
> easy enough for you or anyone to look at what seq_path does and how
> to have a version that doesn't use the seq_file insterface. Takes
> about two minutes with most of the time spent on finding where
> seq_file is implemented.

the timeline i quoted shows that it was a 2 days process, not 2
minutes.

It's the basic principle of communication: be forgiving in what you
receive and conservative in what you transmit. Plus if one extra
sentence of seemingly redundant information increases the chances to
save an email-roundtrip down the line then please do it to increase
communication efficiency and reduce latency.

Just to show you an example, let me give you an analogous situation in
scheduler talk, as a reply to someone from another field who tries to
use an existing scheduler API the wrong way:

"
Hey, you loser, NAK. Sync wakeups are the wrong thing to do, why
dont you just disable buddy wakeups for a minute in sched.c??

[ Only two minutes to write a patch. The solution is really
obvious to me, and we are not in the kindergarten to explain
everything to you. And besides, the longer i make it for you to
figure it all out, the cooler i can feel about being an
ueber-capable scheduler hacker ;-) ]
"

do you want such a world? i dont ;-)

Ingo

2008-11-06 15:37:05