2008-10-12 13:13:21

by Török Edwin

[permalink] [raw]
Subject: [PATCH 0/4] ftrace: add userspace stacktrace support and semaphore-latency tracer

Hi,

This patch series is a draft implementation of a semaphore-latency tracer.
I would appreciate feedback on these patches so that they can be merged into
mainline.

The patches apply on top of tip/master (commit aaf9631d60b82182c7e9890b5ffec70f38131a50).

The first two patches add userspace stacktrace support.
Unfortunately if the userspace application (actually libc) is compiled without framepointers
on x86-64, the userspace trace will contain only 1 entry: the return address.
You can either request raw IP addresses in the trace, or ask for them to be
resolved to object file + relative address, then a tool such as addr2line can be
used to get file:line information.
This is currently x86-only, but other architectures could implement it.

TODO: It would be worth investigating if we could avoid code duplication vs. oprofile,
and use a common function for userspace traces, and convert oprofile to it.

The last 2 patches add a new tracer (and is independent of the first 2 patches).
This tracer attempts to help in tracking rwsem latency issues.
Each latency instance gets an ID, and a userspace program can be used to group
the stacktraces by latency ID, thus obtaining a list of all tasks involved
in a latency: the task(s) that was (were) holding the semaphore, and the tasks
that had to wait due to this.

See the patches themselves for more detail on how to use them.

Shortlog below:

Török Edwin (4):
Add support for userspace stacktraces in tracing/iter_ctrl
Identify which executable object the userspace address belongs to.
add tracepoints in rwsem
Implement semaphore latency tracer

Documentation/ftrace.txt | 5 +-
arch/x86/kernel/stacktrace.c | 57 +++++++++++++
include/linux/stacktrace.h | 8 ++
include/trace/rwsem.h | 35 ++++++++
kernel/trace/Kconfig | 12 +++
kernel/trace/Makefile | 1 +
kernel/trace/trace.c | 188 ++++++++++++++++++++++++++++++++++++++++++
kernel/trace/trace.h | 35 ++++++++
kernel/trace/trace_semlat.c | 186 +++++++++++++++++++++++++++++++++++++++++
lib/rwsem-spinlock.c | 16 ++--
lib/rwsem.c | 13 +--
11 files changed, 539 insertions(+), 17 deletions(-)
create mode 100644 include/trace/rwsem.h
create mode 100644 kernel/trace/trace_semlat.c


2008-10-12 13:12:24

by Török Edwin

[permalink] [raw]
Subject: [PATCH 2/4] Identify which executable object the userspace address belongs to.

Example usage:
mount -t debugfs nodev /sys/kernel/debug
cd /sys/kernel/debug/tracing
echo userstacktrace >iter_ctrl
echo sym-userobj >iter_ctrl
echo sched_switch >current_tracer
echo 1 >tracing_enabled
cat trace_pipe >/tmp/trace&
.... run application ...
echo 0 >tracing_enabled
cat /tmp/trace

You'll see stack entries like:
/lib/libpthread-2.7.so[+0xd370]
You can convert them to function/line using:
addr2line -fie /lib/libpthread-2.7.so 0xd370
Or
addr2line -fie /usr/lib/debug/libpthread-2.7.so 0xd370

For non-PIC/PIE executables this won't work:
a.out[+0x73b]
You need to run the following: addr2line -fie a.out 0x40073b
(where 0x400000 is the default load address of a.out)

Signed-off-by: Török Edwin <[email protected]>
---
kernel/trace/trace.c | 48 +++++++++++++++++++++++++++++++++++++++++++++++-
kernel/trace/trace.h | 3 ++-
2 files changed, 49 insertions(+), 2 deletions(-)

diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index 4c17453..722ab74 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -213,6 +213,7 @@ static const char *trace_options[] = {
"sched-tree",
"ftrace_printk",
"userstacktrace",
+ "sym-userobj",
NULL
};

@@ -359,6 +360,20 @@ trace_seq_putmem_hex(struct trace_seq *s, void *mem, size_t len)
return trace_seq_putmem(s, hex, j);
}

+static int
+trace_seq_path(struct trace_seq *s, struct path *path)
+{
+ int ret;
+ struct seq_file m;
+ m.count = s->len;
+ m.size = PAGE_SIZE;
+ m.buf = s->buffer;
+ ret = seq_path(&m, path, "\n");
+ if (ret)
+ s->len = m.count;
+ return ret;
+}
+
static void
trace_seq_reset(struct trace_seq *s)
{
@@ -655,6 +670,7 @@ tracing_generic_entry_update(struct trace_entry *entry, unsigned long flags,

entry->preempt_count = pc & 0xff;
entry->pid = (tsk) ? tsk->pid : 0;
+ entry->tgid = (tsk) ? tsk->tgid : 0;
entry->flags =
(irqs_disabled_flags(flags) ? TRACE_FLAG_IRQS_OFF : 0) |
((pc & HARDIRQ_MASK) ? TRACE_FLAG_HARDIRQ : 0) |
@@ -1179,11 +1195,26 @@ static int
seq_print_userip_objs(const struct trace_entry *entry, struct trace_seq *s,
unsigned long sym_flags)
{
+ struct mm_struct *mm = NULL;
int ret = 1;
unsigned i;

+ if (trace_flags & TRACE_ITER_SYM_USEROBJ) {
+ struct task_struct *task;
+ /* we do the lookup on the thread group leader,
+ * since individual threads might have already quit! */
+ rcu_read_lock();
+ task = find_task_by_vpid(entry->field.tgid);
+ rcu_read_unlock();
+
+ if (task)
+ mm = get_task_mm(task);
+ }
+
for (i = 0; i < FTRACE_STACK_ENTRIES; i++) {
unsigned long ip = entry->field.stack.caller[i];
+ unsigned long vmstart = 0;
+ struct file *file = NULL;

if (ip == ULONG_MAX || !ret)
break;
@@ -1193,10 +1224,25 @@ seq_print_userip_objs(const struct trace_entry *entry, struct trace_seq *s,
ret = trace_seq_puts(s, "??");
continue;
}
- if (sym_flags & TRACE_ITER_SYM_ADDR)
+ if (mm) {
+ const struct vm_area_struct *vma = find_vma(mm, ip);
+ if (vma) {
+ file = vma->vm_file;
+ vmstart = vma->vm_start;
+ }
+ }
+ if (file) {
+ ret = trace_seq_path(s, &file->f_path);
+ if (ret)
+ ret = trace_seq_printf(s, "[+0x%lx]",
+ ip - vmstart);
+ }
+ if (ret && ((sym_flags & TRACE_ITER_SYM_ADDR) || !file))
ret = trace_seq_printf(s, " <" IP_FMT ">", ip);
}

+ if (mm)
+ mmput(mm);
return ret;
}

diff --git a/kernel/trace/trace.h b/kernel/trace/trace.h
index 3ad6343..59d3b17 100644
--- a/kernel/trace/trace.h
+++ b/kernel/trace/trace.h
@@ -414,7 +414,8 @@ enum trace_iterator_flags {
TRACE_ITER_STACKTRACE = 0x100,
TRACE_ITER_SCHED_TREE = 0x200,
TRACE_ITER_PRINTK = 0x400,
- TRACE_ITER_USERSTACKTRACE = 0x800
+ TRACE_ITER_USERSTACKTRACE = 0x800,
+ TRACE_ITER_SYM_USEROBJ = 0x1000
};

extern struct tracer nop_trace;
--
1.5.6.5

2008-10-12 13:12:41

by Török Edwin

[permalink] [raw]
Subject: [PATCH 3/4] add tracepoints in rwsem

This allows a tracing framework, such as ftrace to measure the latency
during lock contention, and (unlike latencytop) allows to show whom to blame for holding
the lock for too long.

The tracepoints consist of 3 events
- a down_read, or down_write failed - a tracer can use this as a point of reference
- semaphore is granted after a failed down_read or down_write - usable to measure latency
- one writer, or all readers are woken - a tracer can see who was/is holding the lock,
and blame it for the latencies of all reader/writer tasks on the waiter list

Signed-off-by: Török Edwin <[email protected]>
---
include/trace/rwsem.h | 35 +++++++++++++++++++++++++++++++++++
lib/rwsem-spinlock.c | 16 ++++++++--------
lib/rwsem.c | 13 +++++--------
3 files changed, 48 insertions(+), 16 deletions(-)
create mode 100644 include/trace/rwsem.h

diff --git a/include/trace/rwsem.h b/include/trace/rwsem.h
new file mode 100644
index 0000000..8afb773
--- /dev/null
+++ b/include/trace/rwsem.h
@@ -0,0 +1,35 @@
+#ifndef _TRACE_RWSEM_H
+#define _TRACE_RWSEM_H
+
+#include <linux/tracepoint.h>
+
+struct rwsem_waiter {
+ struct list_head list;
+ struct task_struct *task;
+ unsigned int flags;
+#define RWSEM_WAITING_FOR_READ 0x00000001
+#define RWSEM_WAITING_FOR_WRITE 0x00000002
+};
+
+/* some waiters from the waiter list are woken,
+ * some remain sleeping */
+DEFINE_TRACE(rwsem_wake,
+ TPPROTO(const struct rw_semaphore *sem,
+ const struct rwsem_waiter *waiter),
+ TPARGS(sem, waiter));
+
+/* a down_read() failed because a writer already has the lock,
+ * or a down_write() failed because another reader or writer
+ * already has the lock.
+ * tsk indicates this task, and waiter->list contains
+ * all other tasks that wait for this lock */
+DEFINE_TRACE(rwsem_lock_failed,
+ TPPROTO(const struct rw_semaphore *sem),
+ TPARGS(sem));
+
+/* the tsk was granted the lock, after previously failing to
+ * acquire the lock due to contention */
+DEFINE_TRACE(rwsem_lock_ok,
+ TPPROTO(const struct task_struct *tsk),
+ TPARGS(tsk));
+#endif
diff --git a/lib/rwsem-spinlock.c b/lib/rwsem-spinlock.c
index 9df3ca5..9e32ce4 100644
--- a/lib/rwsem-spinlock.c
+++ b/lib/rwsem-spinlock.c
@@ -8,14 +8,7 @@
#include <linux/rwsem.h>
#include <linux/sched.h>
#include <linux/module.h>
-
-struct rwsem_waiter {
- struct list_head list;
- struct task_struct *task;
- unsigned int flags;
-#define RWSEM_WAITING_FOR_READ 0x00000001
-#define RWSEM_WAITING_FOR_WRITE 0x00000002
-};
+#include <trace/rwsem.h>

/*
* initialise the semaphore
@@ -69,6 +62,7 @@ __rwsem_do_wake(struct rw_semaphore *sem, int wakewrite)
list_del(&waiter->list);
tsk = waiter->task;
/* Don't touch waiter after ->task has been NULLed */
+ trace_rwsem_wake(sem, waiter);
smp_mb();
waiter->task = NULL;
wake_up_process(tsk);
@@ -78,6 +72,7 @@ __rwsem_do_wake(struct rw_semaphore *sem, int wakewrite)

/* grant an infinite number of read locks to the front of the queue */
dont_wake_writers:
+ trace_rwsem_wake(sem, waiter);
woken = 0;
while (waiter->flags & RWSEM_WAITING_FOR_READ) {
struct list_head *next = waiter->list.next;
@@ -115,6 +110,7 @@ __rwsem_wake_one_writer(struct rw_semaphore *sem)
list_del(&waiter->list);

tsk = waiter->task;
+ trace_rwsem_wake(sem, waiter);
smp_mb();
waiter->task = NULL;
wake_up_process(tsk);
@@ -152,6 +148,7 @@ void __sched __down_read(struct rw_semaphore *sem)
/* we don't need to touch the semaphore struct anymore */
spin_unlock_irq(&sem->wait_lock);

+ trace_rwsem_lock_failed(sem);
/* wait to be given the lock */
for (;;) {
if (!waiter.task)
@@ -160,6 +157,7 @@ void __sched __down_read(struct rw_semaphore *sem)
set_task_state(tsk, TASK_UNINTERRUPTIBLE);
}

+ trace_rwsem_lock_ok(tsk);
tsk->state = TASK_RUNNING;
out:
;
@@ -218,6 +216,7 @@ void __sched __down_write_nested(struct rw_semaphore *sem, int subclass)
/* we don't need to touch the semaphore struct anymore */
spin_unlock_irq(&sem->wait_lock);

+ trace_rwsem_lock_failed(sem);
/* wait to be given the lock */
for (;;) {
if (!waiter.task)
@@ -226,6 +225,7 @@ void __sched __down_write_nested(struct rw_semaphore *sem, int subclass)
set_task_state(tsk, TASK_UNINTERRUPTIBLE);
}

+ trace_rwsem_lock_ok(tsk);
tsk->state = TASK_RUNNING;
out:
;
diff --git a/lib/rwsem.c b/lib/rwsem.c
index 3e3365e..a9c5571 100644
--- a/lib/rwsem.c
+++ b/lib/rwsem.c
@@ -7,6 +7,7 @@
#include <linux/sched.h>
#include <linux/init.h>
#include <linux/module.h>
+#include <trace/rwsem.h>

/*
* Initialize an rwsem:
@@ -28,14 +29,6 @@ void __init_rwsem(struct rw_semaphore *sem, const char *name,

EXPORT_SYMBOL(__init_rwsem);

-struct rwsem_waiter {
- struct list_head list;
- struct task_struct *task;
- unsigned int flags;
-#define RWSEM_WAITING_FOR_READ 0x00000001
-#define RWSEM_WAITING_FOR_WRITE 0x00000002
-};
-
/*
* handle the lock release when processes blocked on it that can now run
* - if we come here from up_xxxx(), then:
@@ -81,6 +74,7 @@ __rwsem_do_wake(struct rw_semaphore *sem, int downgrading)
*/
list_del(&waiter->list);
tsk = waiter->task;
+ trace_rwsem_wake(sem, waiter);
smp_mb();
waiter->task = NULL;
wake_up_process(tsk);
@@ -99,6 +93,7 @@ __rwsem_do_wake(struct rw_semaphore *sem, int downgrading)
* readers before waking any processes up
*/
readers_only:
+ trace_rwsem_wake(sem, waiter);
woken = 0;
do {
woken++;
@@ -171,6 +166,7 @@ rwsem_down_failed_common(struct rw_semaphore *sem,

spin_unlock_irq(&sem->wait_lock);

+ trace_rwsem_lock_failed(sem);
/* wait to be given the lock */
for (;;) {
if (!waiter->task)
@@ -179,6 +175,7 @@ rwsem_down_failed_common(struct rw_semaphore *sem,
set_task_state(tsk, TASK_UNINTERRUPTIBLE);
}

+ trace_rwsem_lock_ok(tsk);
tsk->state = TASK_RUNNING;

return sem;
--
1.5.6.5

2008-10-12 13:12:55

by Török Edwin

[permalink] [raw]
Subject: [PATCH 0/4] ftrace: add userspace stacktrace support and semaphore-latency tracer

Hi,

This patch series is a draft implementation of a semaphore-latency tracer.
I would appreciate feedback on these patches so that they can be merged into
mainline.

The patches apply on top of tip/master (commit aaf9631d60b82182c7e9890b5ffec70f38131a50).

The first two patches add userspace stacktrace support.
Unfortunately if the userspace application (actually libc) is compiled without framepointers
on x86-64, the userspace trace will contain only 1 entry: the return address.
You can either request raw IP addresses in the trace, or ask for them to be
resolved to object file + relative address, then a tool such as addr2line can be
used to get file:line information.
This is currently x86-only, but other architectures could implement it.

TODO: It would be worth investigating if we could avoid code duplication vs. oprofile,
and use a common function for userspace traces, and convert oprofile to it.

The last 2 patches add a new tracer (and is independent of the first 2 patches).
This tracer attempts to help in tracking rwsem latency issues.
Each latency instance gets an ID, and a userspace program can be used to group
the stacktraces by latency ID, thus obtaining a list of all tasks involved
in a latency: the task(s) that was (were) holding the semaphore, and the tasks
that had to wait due to this.

See the patches themselves for more detail on how to use them.

Shortlog below:

Török Edwin (4):
Add support for userspace stacktraces in tracing/iter_ctrl
Identify which executable object the userspace address belongs to.
add tracepoints in rwsem
Implement semaphore latency tracer

Documentation/ftrace.txt | 5 +-
arch/x86/kernel/stacktrace.c | 57 +++++++++++++
include/linux/stacktrace.h | 8 ++
include/trace/rwsem.h | 35 ++++++++
kernel/trace/Kconfig | 12 +++
kernel/trace/Makefile | 1 +
kernel/trace/trace.c | 188 ++++++++++++++++++++++++++++++++++++++++++
kernel/trace/trace.h | 35 ++++++++
kernel/trace/trace_semlat.c | 186 +++++++++++++++++++++++++++++++++++++++++
lib/rwsem-spinlock.c | 16 ++--
lib/rwsem.c | 13 +--
11 files changed, 539 insertions(+), 17 deletions(-)
create mode 100644 include/trace/rwsem.h
create mode 100644 kernel/trace/trace_semlat.c

2008-10-12 13:13:40

by Török Edwin

[permalink] [raw]
Subject: [PATCH 1/4] Add support for userspace stacktraces in tracing/iter_ctrl

Usage example:
mount -t debugfs nodev /sys/kernel/debug
cd /sys/kernel/debug/tracing
echo userstacktrace >iter_ctrl
echo sched_switch >current_tracer
echo 1 >tracing_enabled
.... run application ...
echo 0 >tracing_enabled

Then read one of 'trace','latency_trace','trace_pipe'

Signed-off-by: Török Edwin <[email protected]>
---
Documentation/ftrace.txt | 5 ++-
arch/x86/kernel/stacktrace.c | 57 +++++++++++++++++++++++++++++++++++
include/linux/stacktrace.h | 8 +++++
kernel/trace/trace.c | 67 ++++++++++++++++++++++++++++++++++++++++++
kernel/trace/trace.h | 2 +
5 files changed, 138 insertions(+), 1 deletions(-)

diff --git a/Documentation/ftrace.txt b/Documentation/ftrace.txt
index d330fe3..f39cb68 100644
--- a/Documentation/ftrace.txt
+++ b/Documentation/ftrace.txt
@@ -327,7 +327,7 @@ output. To see what is available, simply cat the file:

cat /debug/tracing/iter_ctrl
print-parent nosym-offset nosym-addr noverbose noraw nohex nobin \
- noblock nostacktrace nosched-tree
+ noblock nostacktrace nosched-tree nouserstacktrace

To disable one of the options, echo in the option prepended with "no".

@@ -381,6 +381,9 @@ Here are the available options:
When a trace is recorded, so is the stack of functions.
This allows for back traces of trace sites.

+ userstacktrace - This option changes the trace.
+ It records a stacktrace of the current userspace thread.
+
sched-tree - TBD (any users??)


diff --git a/arch/x86/kernel/stacktrace.c b/arch/x86/kernel/stacktrace.c
index d1d850a..000a965 100644
--- a/arch/x86/kernel/stacktrace.c
+++ b/arch/x86/kernel/stacktrace.c
@@ -6,6 +6,7 @@
#include <linux/sched.h>
#include <linux/stacktrace.h>
#include <linux/module.h>
+#include <linux/uaccess.h>
#include <asm/stacktrace.h>

static void save_stack_warning(void *data, char *msg)
@@ -90,3 +91,59 @@ void save_stack_trace_tsk(struct task_struct *tsk, struct stack_trace *trace)
trace->entries[trace->nr_entries++] = ULONG_MAX;
}
EXPORT_SYMBOL_GPL(save_stack_trace_tsk);
+
+/* Userspace stacktrace - based on kernel/trace/trace_sysprof.c */
+
+struct stack_frame {
+ const void __user *next_fp;
+ unsigned long return_address;
+};
+
+static int copy_stack_frame(const void __user *fp, struct stack_frame *frame)
+{
+ int ret;
+
+ if (!access_ok(VERIFY_READ, fp, sizeof(*frame)))
+ return 0;
+
+ ret = 1;
+ pagefault_disable();
+ if (__copy_from_user_inatomic(frame, fp, sizeof(*frame)))
+ ret = 0;
+ pagefault_enable();
+
+ return ret;
+}
+
+void save_stack_trace_user(struct stack_trace *trace)
+{
+ /*
+ * Trace user stack if we are not a kernel thread
+ */
+ if (current->mm) {
+ const struct pt_regs *regs = task_pt_regs(current);
+ const void __user *fp = (const void __user *)regs->bp;
+
+ if (trace->nr_entries < trace->max_entries)
+ trace->entries[trace->nr_entries++] = regs->ip;
+
+ while (trace->nr_entries < trace->max_entries) {
+ struct stack_frame frame;
+ frame.next_fp = NULL;
+ frame.return_address = 0;
+ if (!copy_stack_frame(fp, &frame))
+ break;
+ if ((unsigned long)fp < regs->sp)
+ break;
+ if (frame.return_address)
+ trace->entries[trace->nr_entries++] =
+ frame.return_address;
+ if (fp == frame.next_fp)
+ break;
+ fp = frame.next_fp;
+ }
+ }
+ if (trace->nr_entries < trace->max_entries)
+ trace->entries[trace->nr_entries++] = ULONG_MAX;
+}
+
diff --git a/include/linux/stacktrace.h b/include/linux/stacktrace.h
index 6b8e54a..46704a5 100644
--- a/include/linux/stacktrace.h
+++ b/include/linux/stacktrace.h
@@ -18,9 +18,17 @@ extern void save_stack_trace_tsk(struct task_struct *tsk,
struct stack_trace *trace);

extern void print_stack_trace(struct stack_trace *trace, int spaces);
+
+#ifdef CONFIG_X86
+extern void save_stack_trace_user(struct stack_trace *trace);
+#else
+# define save_stack_trace_user(trace) do { } while (0)
+#endif
+
#else
# define save_stack_trace(trace) do { } while (0)
# define save_stack_trace_tsk(tsk, trace) do { } while (0)
+# define save_stack_trace_user(trace) do { } while (0)
# define print_stack_trace(trace, spaces) do { } while (0)
#endif

diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index d345d64..4c17453 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -212,6 +212,7 @@ static const char *trace_options[] = {
"stacktrace",
"sched-tree",
"ftrace_printk",
+ "userstacktrace",
NULL
};

@@ -735,6 +736,28 @@ void __trace_stack(struct trace_array *tr,
ftrace_trace_stack(tr, data, flags, skip, preempt_count());
}

+static void __trace_userstack(struct trace_array *tr,
+ struct trace_array_cpu *data,
+ unsigned long flags)
+{
+ struct trace_entry *entry;
+ struct stack_trace trace;
+
+ if (!(trace_flags & TRACE_ITER_USERSTACKTRACE))
+ return;
+
+ entry = tracing_get_trace_entry(tr, data);
+ tracing_generic_entry_update(entry, flags);
+ entry->type = TRACE_USER_STACK;
+
+ memset(&entry->field.stack, 0, sizeof(entry->field.stack));
+ trace.nr_entries = 0;
+ trace.max_entries = FTRACE_STACK_ENTRIES;
+ trace.skip = 0;
+ trace.entries = entry->field.stack.caller;
+ save_stack_trace_user(&trace);
+}
+
static void
ftrace_trace_special(void *__tr, void *__data,
unsigned long arg1, unsigned long arg2, unsigned long arg3,
@@ -758,6 +781,7 @@ ftrace_trace_special(void *__tr, void *__data,
entry->arg3 = arg3;
ring_buffer_unlock_commit(tr->buffer, event, irq_flags);
ftrace_trace_stack(tr, data, irq_flags, 4, pc);
+ __trace_userstack(tr, data, irq_flags);

trace_wake_up();
}
@@ -1151,6 +1175,31 @@ seq_print_ip_sym(struct trace_seq *s, unsigned long ip, unsigned long sym_flags)
return ret;
}

+static int
+seq_print_userip_objs(const struct trace_entry *entry, struct trace_seq *s,
+ unsigned long sym_flags)
+{
+ int ret = 1;
+ unsigned i;
+
+ for (i = 0; i < FTRACE_STACK_ENTRIES; i++) {
+ unsigned long ip = entry->field.stack.caller[i];
+
+ if (ip == ULONG_MAX || !ret)
+ break;
+ if (i)
+ ret = trace_seq_puts(s, " <- ");
+ if (!ip) {
+ ret = trace_seq_puts(s, "??");
+ continue;
+ }
+ if (sym_flags & TRACE_ITER_SYM_ADDR)
+ ret = trace_seq_printf(s, " <" IP_FMT ">", ip);
+ }
+
+ return ret;
+}
+
static void print_lat_help_header(struct seq_file *m)
{
seq_puts(m, "# _------=> CPU# \n");
@@ -1437,6 +1486,10 @@ print_lat_fmt(struct trace_iterator *iter, unsigned int trace_idx, int cpu)
trace_seq_print_cont(s, iter);
break;
}
+ case TRACE_USER_STACK:
+ seq_print_userip_objs(entry, s, sym_flags);
+ trace_seq_putc(s, '\n');
+ break;
default:
trace_seq_printf(s, "Unknown type %d\n", entry->type);
}
@@ -1572,6 +1625,14 @@ static enum print_line_t print_trace_fmt(struct trace_iterator *iter)
if (entry->flags & TRACE_FLAG_CONT)
trace_seq_print_cont(s, iter);
break;
+ case TRACE_USER_STACK:
+ ret = seq_print_userip_objs(entry, s, sym_flags);
+ if (!ret)
+ return 0;
+ ret = trace_seq_putc(s, '\n');
+ if (!ret)
+ return 0;
+ break;
}
}
return TRACE_TYPE_HANDLED;
@@ -1632,6 +1693,7 @@ static enum print_line_t print_raw_fmt(struct trace_iterator *iter)
break;
}
case TRACE_SPECIAL:
+ case TRACE_USER_STACK:
case TRACE_STACK: {
struct special_entry *field;

@@ -1720,6 +1782,7 @@ static enum print_line_t print_hex_fmt(struct trace_iterator *iter)
break;
}
case TRACE_SPECIAL:
+ case TRACE_USER_STACK:
case TRACE_STACK: {
struct special_entry *field;

@@ -1774,6 +1837,7 @@ static enum print_line_t print_bin_fmt(struct trace_iterator *iter)
break;
}
case TRACE_SPECIAL:
+ case TRACE_USER_STACK:
case TRACE_STACK: {
struct special_entry *field;

@@ -3103,6 +3167,9 @@ void ftrace_dump(void)
atomic_inc(&global_trace.data[cpu]->disabled);
}

+ /* don't look at user memory in panic mode */
+ trace_flags &= ~TRACE_ITER_SYM_USEROBJ;
+
printk(KERN_TRACE "Dumping ftrace buffer:\n");

iter.tr = &global_trace;
diff --git a/kernel/trace/trace.h b/kernel/trace/trace.h
index f1f9957..3ad6343 100644
--- a/kernel/trace/trace.h
+++ b/kernel/trace/trace.h
@@ -22,6 +22,7 @@ enum trace_type {
TRACE_MMIO_RW,
TRACE_MMIO_MAP,
TRACE_BOOT,
+ TRACE_USER_STACK,

__TRACE_LAST_TYPE
};
@@ -413,6 +414,7 @@ enum trace_iterator_flags {
TRACE_ITER_STACKTRACE = 0x100,
TRACE_ITER_SCHED_TREE = 0x200,
TRACE_ITER_PRINTK = 0x400,
+ TRACE_ITER_USERSTACKTRACE = 0x800
};

extern struct tracer nop_trace;
--
1.5.6.5

2008-10-12 13:13:56

by Török Edwin

[permalink] [raw]
Subject: [PATCH 4/4] Implement semaphore latency tracer

Each time a down_read or down_write fails, a unique latency id is generated.
Later when someone releases the semaphore, it is blamed for the latency of all
tasks on the wait_list of the semaphore.
If you would group the output from latency_trace by the latency_id you get all those
who were contending on a lock, and the tasks that were holding the lock.
An entry in latency_trace has the format:
(latency_id) [semaphore_id] read|write
stacktrace <= stacktrace

Signed-off-by: Török Edwin <[email protected]>
---
kernel/trace/Kconfig | 12 +++
kernel/trace/Makefile | 1 +
kernel/trace/trace.c | 105 +++++++++++++++++++++----
kernel/trace/trace.h | 32 ++++++++
kernel/trace/trace_semlat.c | 186 +++++++++++++++++++++++++++++++++++++++++++
5 files changed, 321 insertions(+), 15 deletions(-)
create mode 100644 kernel/trace/trace_semlat.c

diff --git a/kernel/trace/Kconfig b/kernel/trace/Kconfig
index 396aea1..8faac14 100644
--- a/kernel/trace/Kconfig
+++ b/kernel/trace/Kconfig
@@ -144,6 +144,18 @@ config STACK_TRACER
This tracer records the max stack of the kernel, and displays
it in debugfs/tracing/stack_trace

+config SEMLAT_TRACER
+ bool "Semaphore latency tracer"
+ depends on HAVE_FTRACE
+ depends on DEBUG_KERNEL
+ select TRACING
+ select TRACEPOINTS
+ select TRACER_MAX_TRACE
+ help
+ This tracer records the reason for semaphore contention, showing
+ who is holding a lock for too long and causing other tasks to contend
+ on the semaphore. Currently it instruments rwsem and rwsem-spinlock with tracepoints.
+
config DYNAMIC_FTRACE
bool "enable/disable ftrace tracepoints dynamically"
depends on FTRACE
diff --git a/kernel/trace/Makefile b/kernel/trace/Makefile
index a85dfba..d037df3 100644
--- a/kernel/trace/Makefile
+++ b/kernel/trace/Makefile
@@ -24,5 +24,6 @@ obj-$(CONFIG_NOP_TRACER) += trace_nop.o
obj-$(CONFIG_STACK_TRACER) += trace_stack.o
obj-$(CONFIG_MMIOTRACE) += trace_mmiotrace.o
obj-$(CONFIG_BOOT_TRACER) += trace_boot.o
+obj-$(CONFIG_SEMLAT_TRACER) += trace_semlat.o

libftrace-y := ftrace.o
diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index 722ab74..bd9145a 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -752,26 +752,42 @@ void __trace_stack(struct trace_array *tr,
ftrace_trace_stack(tr, data, flags, skip, preempt_count());
}

-static void __trace_userstack(struct trace_array *tr,
+static void ftrace_trace_userstack(struct trace_array *tr,
struct trace_array_cpu *data,
- unsigned long flags)
+ unsigned long flags, int pc)
{
- struct trace_entry *entry;
+ struct userstack_entry *entry;
struct stack_trace trace;
+ struct ring_buffer_event *event;
+ unsigned long irq_flags;

if (!(trace_flags & TRACE_ITER_USERSTACKTRACE))
return;

- entry = tracing_get_trace_entry(tr, data);
- tracing_generic_entry_update(entry, flags);
- entry->type = TRACE_USER_STACK;
+ event = ring_buffer_lock_reserve(tr->buffer, sizeof(*entry),
+ &irq_flags);
+ if (!event)
+ return;
+ entry = ring_buffer_event_data(event);
+ tracing_generic_entry_update(&entry->ent, flags, pc);
+ entry->ent.type = TRACE_USER_STACK;
+
+ memset(&entry->caller, 0, sizeof(entry->caller));

- memset(&entry->field.stack, 0, sizeof(entry->field.stack));
trace.nr_entries = 0;
trace.max_entries = FTRACE_STACK_ENTRIES;
trace.skip = 0;
- trace.entries = entry->field.stack.caller;
+ trace.entries = entry->caller;
+
save_stack_trace_user(&trace);
+ ring_buffer_unlock_commit(tr->buffer, event, irq_flags);
+}
+
+void __trace_userstack(struct trace_array *tr,
+ struct trace_array_cpu *data,
+ unsigned long flags)
+{
+ ftrace_trace_userstack(tr, data, flags, preempt_count());
}

static void
@@ -810,6 +826,36 @@ __trace_special(void *__tr, void *__data,
}

void
+trace_semlat(struct trace_array *tr, struct trace_array_cpu *data,
+ const struct semlat_data *sdata, int samestack)
+{
+ struct ring_buffer_event *event;
+ struct semlat_entry *entry;
+ unsigned long irq_flags;
+ int pc = preempt_count();
+
+ event = ring_buffer_lock_reserve(tr->buffer, sizeof(*entry),
+ &irq_flags);
+ if (!event)
+ return;
+ entry = ring_buffer_event_data(event);
+ tracing_generic_entry_update(&entry->ent, 0, pc);
+ entry->ent.type = TRACE_SEM;
+ entry->data = *sdata;
+ entry->data.pid = current->pid;
+ ring_buffer_unlock_commit(tr->buffer, event, irq_flags);
+ if (!samestack) {
+ /* show a stack trace only if different from previous */
+ __trace_stack(tr, data, irq_flags, 3);
+ __trace_userstack(tr, data, irq_flags);
+ }
+ ftrace_trace_stack(tr, data, irq_flags, 4, pc);
+ __trace_userstack(tr, data, irq_flags);
+
+ trace_wake_up();
+}
+
+void
tracing_sched_switch_trace(struct trace_array *tr,
struct trace_array_cpu *data,
struct task_struct *prev,
@@ -1192,7 +1238,7 @@ seq_print_ip_sym(struct trace_seq *s, unsigned long ip, unsigned long sym_flags)
}

static int
-seq_print_userip_objs(const struct trace_entry *entry, struct trace_seq *s,
+seq_print_userip_objs(const struct userstack_entry *entry, struct trace_seq *s,
unsigned long sym_flags)
{
struct mm_struct *mm = NULL;
@@ -1204,7 +1250,7 @@ seq_print_userip_objs(const struct trace_entry *entry, struct trace_seq *s,
/* we do the lookup on the thread group leader,
* since individual threads might have already quit! */
rcu_read_lock();
- task = find_task_by_vpid(entry->field.tgid);
+ task = find_task_by_vpid(entry->ent.tgid);
rcu_read_unlock();

if (task)
@@ -1212,7 +1258,7 @@ seq_print_userip_objs(const struct trace_entry *entry, struct trace_seq *s,
}

for (i = 0; i < FTRACE_STACK_ENTRIES; i++) {
- unsigned long ip = entry->field.stack.caller[i];
+ unsigned long ip = entry->caller[i];
unsigned long vmstart = 0;
struct file *file = NULL;

@@ -1532,10 +1578,34 @@ print_lat_fmt(struct trace_iterator *iter, unsigned int trace_idx, int cpu)
trace_seq_print_cont(s, iter);
break;
}
- case TRACE_USER_STACK:
- seq_print_userip_objs(entry, s, sym_flags);
+ case TRACE_USER_STACK: {
+ struct userstack_entry *field;
+
+ trace_assign_type(field, entry);
+
+ seq_print_userip_objs(field, s, sym_flags);
trace_seq_putc(s, '\n');
break;
+ }
+ case TRACE_SEM: {
+ struct semlat_entry *field;
+
+ trace_assign_type(field, entry);
+
+ if (field->data.pid == field->ent.pid) {
+ /* task that suffered the latency */
+ comm = trace_find_cmdline(field->data.pid);
+ trace_seq_printf(s, " (%u) %s["IP_FMT"] %s %s\n",
+ field->data.latency_id,
+ field->data.name ? field->data.name : "",
+ field->data.semaphore,
+ field->data.kind_read ? "read" : "write",
+ comm);
+ } else
+ trace_seq_printf(s, " (%u)\n",
+ field->data.latency_id);
+ break;
+ }
default:
trace_seq_printf(s, "Unknown type %d\n", entry->type);
}
@@ -1671,8 +1741,13 @@ static enum print_line_t print_trace_fmt(struct trace_iterator *iter)
if (entry->flags & TRACE_FLAG_CONT)
trace_seq_print_cont(s, iter);
break;
- case TRACE_USER_STACK:
- ret = seq_print_userip_objs(entry, s, sym_flags);
+ }
+ case TRACE_USER_STACK: {
+ struct userstack_entry *field;
+
+ trace_assign_type(field, entry);
+
+ ret = seq_print_userip_objs(field, s, sym_flags);
if (!ret)
return 0;
ret = trace_seq_putc(s, '\n');
diff --git a/kernel/trace/trace.h b/kernel/trace/trace.h
index 59d3b17..d734561 100644
--- a/kernel/trace/trace.h
+++ b/kernel/trace/trace.h
@@ -23,6 +23,7 @@ enum trace_type {
TRACE_MMIO_MAP,
TRACE_BOOT,
TRACE_USER_STACK,
+ TRACE_SEM,

__TRACE_LAST_TYPE
};
@@ -39,6 +40,7 @@ struct trace_entry {
unsigned char flags;
unsigned char preempt_count;
int pid;
+ int tgid;
};

/*
@@ -86,6 +88,11 @@ struct stack_entry {
unsigned long caller[FTRACE_STACK_ENTRIES];
};

+struct userstack_entry {
+ struct trace_entry ent;
+ unsigned long caller[FTRACE_STACK_ENTRIES];
+};
+
/*
* ftrace_printk entry:
*/
@@ -119,6 +126,25 @@ struct trace_boot {
};

/*
+ * semaphore-latency entry
+ */
+
+struct semlat_data {
+ const void *semaphore;
+ const char *name;
+ unsigned int latency_id; /* a unique id for this latency instance */
+ unsigned int kind_read;
+ unsigned int pid;/* pid of task that suffered the latency */
+ cycle_t t;/* start of latency - lock contention */
+ int cpu;
+};
+
+struct semlat_entry {
+ struct trace_entry ent;
+ struct semlat_data data;
+};
+
+/*
* trace_flag_type is an enumeration that holds different
* states when a trace occurs. These are:
* IRQS_OFF - interrupts were disabled
@@ -211,6 +237,7 @@ extern void __ftrace_bad_type(void);
IF_ASSIGN(var, ent, struct ctx_switch_entry, 0); \
IF_ASSIGN(var, ent, struct trace_field_cont, TRACE_CONT); \
IF_ASSIGN(var, ent, struct stack_entry, TRACE_STACK); \
+ IF_ASSIGN(var, ent, struct userstack_entry, TRACE_USER_STACK);\
IF_ASSIGN(var, ent, struct print_entry, TRACE_PRINT); \
IF_ASSIGN(var, ent, struct special_entry, 0); \
IF_ASSIGN(var, ent, struct trace_mmiotrace_rw, \
@@ -218,6 +245,7 @@ extern void __ftrace_bad_type(void);
IF_ASSIGN(var, ent, struct trace_mmiotrace_map, \
TRACE_MMIO_MAP); \
IF_ASSIGN(var, ent, struct trace_boot, TRACE_BOOT); \
+ IF_ASSIGN(var, ent, struct semlat_entry, TRACE_SEM); \
__ftrace_bad_type(); \
} while (0)

@@ -314,6 +342,10 @@ void trace_special(struct trace_array *tr,
unsigned long arg1,
unsigned long arg2,
unsigned long arg3, int pc);
+
+void trace_semlat(struct trace_array *tr, struct trace_array_cpu *data,
+ const struct semlat_data *sdata, int samestack);
+
void trace_function(struct trace_array *tr,
struct trace_array_cpu *data,
unsigned long ip,
diff --git a/kernel/trace/trace_semlat.c b/kernel/trace/trace_semlat.c
new file mode 100644
index 0000000..d72e3e3
--- /dev/null
+++ b/kernel/trace/trace_semlat.c
@@ -0,0 +1,186 @@
+/*
+ * Semaphore latency tracer
+ *
+ * Copyright (C) 2008 Török Edwin <[email protected]>
+ *
+ */
+
+#include <linux/module.h>
+#include <linux/fs.h>
+#include <linux/debugfs.h>
+#include <linux/ftrace.h>
+#include <trace/rwsem.h>
+#include "trace.h"
+
+static struct trace_array *ctx_trace;
+static int __read_mostly tracer_enabled;
+
+static struct semlat_data map_pid_to_latency[PID_MAX_DEFAULT+1];
+
+static inline int report_latency(unsigned long delta)
+{
+ if (tracing_thresh) {
+ if (delta < tracing_thresh)
+ return 0;
+ } else {
+ if (delta <= tracing_max_latency)
+ return 0;
+ }
+ return 1;
+}
+
+static inline void probe_rwsem_wake(const struct rw_semaphore *sem,
+ const struct rwsem_waiter *waiter)
+{
+ /* first task in waitlist is oldest, so calculate delta against it */
+ const struct semlat_data *data;
+ struct trace_array_cpu *cpudata;
+ unsigned long T0, T1, delta;
+ int cpu, first;
+
+ if (!tracer_enabled)
+ return;
+ data = &map_pid_to_latency[waiter->task->pid];
+
+ if (data->cpu == -1)
+ return;
+ T0 = data->t;
+ T1 = ftrace_now(data->cpu);
+ delta = T1-T0;
+
+ if (!report_latency(delta))
+ return;
+ cpu = raw_smp_processor_id();
+ cpudata = ctx_trace->data[cpu];
+ /* blame the current task for latency of all tasks on wait list
+ * note that some of those are being woken now, but that doesn't matter
+ * we want to blame this task for their latency too */
+ if (list_empty(&waiter->list)) {
+ /* waiter is not part of wait_list, so process it here */
+ trace_semlat(ctx_trace, cpudata, data, 1);
+ first = 0;
+ } else {
+ first = 1;
+ }
+ list_for_each_entry(waiter, &sem->wait_list, list) {
+ data = &map_pid_to_latency[waiter->task->pid];
+ if (data->cpu != -1)
+ trace_semlat(ctx_trace, cpudata, data, first);
+ first = 0;
+ }
+}
+
+static atomic_t unique_latency_id = ATOMIC_INIT(0);
+
+static void probe_rwsem_lock_failed(const struct rw_semaphore *sem)
+{
+ const struct rwsem_waiter *waiter;
+ struct semlat_data *data;
+ int cpu = raw_smp_processor_id();
+
+ if (!tracer_enabled)
+ return;
+ waiter = list_entry(sem->wait_list.next, struct rwsem_waiter, list);
+ data = &map_pid_to_latency[waiter->task->pid];
+ data->semaphore = sem;
+
+#ifdef CONFIG_DEBUG_LOCK_ALLOC
+ data->name = sem->dep_map.name;
+#else
+ data->name = NULL;
+#endif
+
+ data->latency_id = atomic_inc_return(&unique_latency_id);
+ data->kind_read = waiter->flags == RWSEM_WAITING_FOR_READ;
+ data->t = ftrace_now(cpu);
+ data->cpu = cpu;
+}
+
+static void probe_rwsem_lock_ok(const struct task_struct *tsk)
+{
+ struct semlat_data *data = &map_pid_to_latency[tsk->pid];
+ struct trace_array_cpu *cpudata;
+ int cpu;
+ cpu = raw_smp_processor_id();
+ cpudata = ctx_trace->data[cpu];
+
+ if (!tracer_enabled)
+ return;
+ trace_semlat(ctx_trace, cpudata, data, 1);
+ /* invalidate entry */
+ data->cpu = -1;
+}
+
+static void start_semlat_trace(struct trace_array *tr)
+{
+ int ret;
+ ret = register_trace_rwsem_wake(probe_rwsem_wake);
+ WARN_ON(ret);
+
+ ret = register_trace_rwsem_lock_failed(probe_rwsem_lock_failed);
+ WARN_ON(ret);
+
+ ret = register_trace_rwsem_lock_ok(probe_rwsem_lock_ok);
+ WARN_ON(ret);
+ printk(KERN_INFO "registered semlat\n");
+ tracer_enabled = 1;
+}
+
+static void stop_semlat_trace(struct trace_array *tr)
+{
+ tracer_enabled = 0;
+ unregister_trace_rwsem_wake(probe_rwsem_wake);
+ unregister_trace_rwsem_lock_failed(probe_rwsem_lock_failed);
+ printk(KERN_INFO "unregistered semlat\n");
+}
+
+static void semlat_trace_init(struct trace_array *tr)
+{
+ unsigned int i;
+ int cpu;
+
+ ctx_trace = tr;
+
+ atomic_set(&unique_latency_id, 0);
+ for (i = 0; i <= PID_MAX_DEFAULT; i++)
+ map_pid_to_latency[i].cpu = -1;
+
+ for_each_online_cpu(cpu)
+ tracing_reset(tr, cpu);
+
+ if (tr->ctrl)
+ start_semlat_trace(tr);
+}
+
+static void semlat_trace_reset(struct trace_array *tr)
+{
+
+ if (tr->ctrl)
+ stop_semlat_trace(tr);
+}
+
+static void semlat_trace_ctrl_update(struct trace_array *tr)
+{
+ /* When starting a new trace, reset the buffers */
+ if (tr->ctrl)
+ start_semlat_trace(tr);
+ else
+ stop_semlat_trace(tr);
+}
+
+static struct tracer semlat_trace __read_mostly =
+{
+ .name = "semlat",
+ .init = semlat_trace_init,
+ .reset = semlat_trace_reset,
+ .ctrl_update = semlat_trace_ctrl_update,
+#ifdef CONFIG_FTRACE_SELFTEST
+/* .selftest = trace_selftest_startup_semlat,*/
+#endif
+};
+
+__init static int init_semlat_trace(void)
+{
+ return register_tracer(&semlat_trace);
+}
+device_initcall(init_semlat_trace);
--
1.5.6.5

2008-10-12 13:32:19

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [PATCH 1/4] Add support for userspace stacktraces in tracing/iter_ctrl

2008/10/12 T?r?k Edwin <[email protected]>:
> +static int copy_stack_frame(const void __user *fp, struct stack_frame *frame)
> +{
> + int ret;
> +
> + if (!access_ok(VERIFY_READ, fp, sizeof(*frame)))
> + return 0;
> +
> + ret = 1;
> + pagefault_disable();
> + if (__copy_from_user_inatomic(frame, fp, sizeof(*frame)))
> + ret = 0;
> + pagefault_enable();
> +
> + return ret;
> +}
> +

Hello T?r?k,

I don't understand why you are using this pagefault disabling and
atomic copy here.
Wouldn't a simple copy_from_user actually take of this? You don't seem
to be in atomic
context.

2008-10-12 13:53:28

by Török Edwin

[permalink] [raw]
Subject: Re: [PATCH 1/4] Add support for userspace stacktraces in tracing/iter_ctrl

On 2008-10-12 16:31, Fr?d?ric Weisbecker wrote:
> 2008/10/12 T?r?k Edwin <[email protected]>:
>
>> +static int copy_stack_frame(const void __user *fp, struct stack_frame *frame)
>> +{
>> + int ret;
>> +
>> + if (!access_ok(VERIFY_READ, fp, sizeof(*frame)))
>> + return 0;
>> +
>> + ret = 1;
>> + pagefault_disable();
>> + if (__copy_from_user_inatomic(frame, fp, sizeof(*frame)))
>> + ret = 0;
>> + pagefault_enable();
>> +
>> + return ret;
>> +}
>> +
>>
>
> Hello T?r?k,
>
> I don't understand why you are using this pagefault disabling and
> atomic copy here.
> Wouldn't a simple copy_from_user actually take of this? You don't seem
> to be in atomic
> context.
>

Hello,

This function is called from ftrace, as such I could be holding a
spinlock. It can also be called from
__rwsem_do_wake, which takes a spinlock.
This code is from sysprof's copy_stack_frame. Maybe pagefault_disable()
is not needed.

Best regards,
--Edwin

2008-10-12 18:25:35

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH 0/4] ftrace: add userspace stacktrace support and semaphore-latency tracer

T?r?k Edwin wrote:
> Hi,
>
> This patch series is a draft implementation of a semaphore-latency tracer.
> I would appreciate feedback on these patches so that they can be merged into
> mainline.
>
Hi!

You caught me at a bad time. I'm just about to leave for NYC for the
Linux End Users conference. I'll be back on Wednesday. Please send any
new emails to [email protected]. I will not be checking my Red Hat
account when I'm away.

I'll try to look at the patches this week, but I can't promise anything.
I'll forward your patches to my goodmis account, and if I have time,
I'll look at them.

-- Steve

2008-10-12 19:13:28

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 4/4] Implement semaphore latency tracer

On Sun, 2008-10-12 at 16:12 +0300, Török Edwin wrote:
> Each time a down_read or down_write fails, a unique latency id is generated.
> Later when someone releases the semaphore, it is blamed for the latency of all
> tasks on the wait_list of the semaphore.
> If you would group the output from latency_trace by the latency_id you get all those
> who were contending on a lock, and the tasks that were holding the lock.
> An entry in latency_trace has the format:
> (latency_id) [semaphore_id] read|write
> stacktrace <= stacktrace

What can this tracer do that latencytop cannot already do?

2008-10-12 20:10:52

by Török Edwin

[permalink] [raw]
Subject: Re: [PATCH 4/4] Implement semaphore latency tracer

On 2008-10-12 22:13, Peter Zijlstra wrote:
> On Sun, 2008-10-12 at 16:12 +0300, Török Edwin wrote:
>
>> Each time a down_read or down_write fails, a unique latency id is generated.
>> Later when someone releases the semaphore, it is blamed for the latency of all
>> tasks on the wait_list of the semaphore.
>> If you would group the output from latency_trace by the latency_id you get all those
>> who were contending on a lock, and the tasks that were holding the lock.
>> An entry in latency_trace has the format:
>> (latency_id) [semaphore_id] read|write
>> stacktrace <= stacktrace
>>
>
> What can this tracer do that latencytop cannot already do?

Latencytop can show latencies in down_read or down_write (and is very
useful at doing that), but it cannot show who else was holding the
semaphore,
i.e. the true cause of the latency.

Consider:
process A holds a semaphore for reading, process B tries to acquire it
for writing and fails. Latencytop shows the latency in process B, but
doesn't
show anything about process A.

The semlat tracer is doing something more similar to lockstat, but
doesn't need lockdep, and it adds tracepoints on the slowpath only (lock
failed, wakeup).

Best regards,
--Edwin

2008-10-13 08:02:59

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [PATCH 1/4] Add support for userspace stacktraces in tracing/iter_ctrl

2008/10/12 T?r?k Edwin <[email protected]>:
> Hello,
>
> This function is called from ftrace, as such I could be holding a
> spinlock. It can also be called from
> __rwsem_do_wake, which takes a spinlock.
> This code is from sysprof's copy_stack_frame. Maybe pagefault_disable()
> is not needed.
>
> Best regards,
> --Edwin

Ok, I didn't see the context that way.

2008-10-22 15:29:04

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH 4/4] Implement semaphore latency tracer


* T?r?k Edwin <[email protected]> wrote:

> On 2008-10-12 22:13, Peter Zijlstra wrote:
> > On Sun, 2008-10-12 at 16:12 +0300, T?r?k Edwin wrote:
> >
> >> Each time a down_read or down_write fails, a unique latency id is generated.
> >> Later when someone releases the semaphore, it is blamed for the latency of all
> >> tasks on the wait_list of the semaphore.
> >> If you would group the output from latency_trace by the latency_id you get all those
> >> who were contending on a lock, and the tasks that were holding the lock.
> >> An entry in latency_trace has the format:
> >> (latency_id) [semaphore_id] read|write
> >> stacktrace <= stacktrace
> >>
> >
> > What can this tracer do that latencytop cannot already do?
>
> Latencytop can show latencies in down_read or down_write (and is very
> useful at doing that), but it cannot show who else was holding the
> semaphore,
> i.e. the true cause of the latency.
>
> Consider:
> process A holds a semaphore for reading, process B tries to acquire it
> for writing and fails. Latencytop shows the latency in process B, but
> doesn't
> show anything about process A.
>
> The semlat tracer is doing something more similar to lockstat, but
> doesn't need lockdep, and it adds tracepoints on the slowpath only (lock
> failed, wakeup).

hm, but the most common synchronization primitive are mutexes - and
those are not covered by your patchset.

Ingo

2008-10-22 15:42:12

by Török Edwin

[permalink] [raw]
Subject: Re: [PATCH 4/4] Implement semaphore latency tracer

On 2008-10-22 18:28, Ingo Molnar wrote:
> hm, but the most common synchronization primitive are mutexes - and
> those are not covered by your patchset.
>

Indeed. I've seen a patch from Jason Baron to introduce tracepoints for
mutexes, but the conclusion was that the tracepoints
should be in lockstat instead.

And if lockstat is enabled Peter Zijlstra's 'contend with points' patch
seems to do exactly what I want to.

However I think it would be useful to have (a tracepoints based?)
latency tracker, which can be enabled/disabled at runtime,
and which doesn't add any data to the mutex/semaphore structures.
My patchset was a first attempt towards that, but it seems that such use
of tracepoints is not welcome at this time?

Please tell me if I should continue working on this, or if I my patches
are designed totally on the wrong way.

Best regards,
--Edwin

2008-10-22 15:49:17

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH 4/4] Implement semaphore latency tracer


* T?r?k Edwin <[email protected]> wrote:

> On 2008-10-22 18:28, Ingo Molnar wrote:
> > hm, but the most common synchronization primitive are mutexes - and
> > those are not covered by your patchset.
> >
>
> Indeed. I've seen a patch from Jason Baron to introduce tracepoints
> for mutexes, but the conclusion was that the tracepoints should be in
> lockstat instead.
>
> And if lockstat is enabled Peter Zijlstra's 'contend with points'
> patch seems to do exactly what I want to.
>
> However I think it would be useful to have (a tracepoints based?)
> latency tracker, which can be enabled/disabled at runtime, and which
> doesn't add any data to the mutex/semaphore structures. My patchset
> was a first attempt towards that, but it seems that such use of
> tracepoints is not welcome at this time?
>
> Please tell me if I should continue working on this, or if I my
> patches are designed totally on the wrong way.

i think if you hook into Peter's lockstat APIs that should give us a
pretty good tracer, with no ugliness introduced. That would be rather
interesting. Peter, do you concur?

Ingo

2008-10-22 17:22:38

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 4/4] Implement semaphore latency tracer

On Wed, 2008-10-22 at 17:48 +0200, Ingo Molnar wrote:
> * Török Edwin <[email protected]> wrote:
>
> > On 2008-10-22 18:28, Ingo Molnar wrote:
> > > hm, but the most common synchronization primitive are mutexes - and
> > > those are not covered by your patchset.
> > >
> >
> > Indeed. I've seen a patch from Jason Baron to introduce tracepoints
> > for mutexes, but the conclusion was that the tracepoints should be in
> > lockstat instead.
> >
> > And if lockstat is enabled Peter Zijlstra's 'contend with points'
> > patch seems to do exactly what I want to.
> >
> > However I think it would be useful to have (a tracepoints based?)
> > latency tracker, which can be enabled/disabled at runtime, and which
> > doesn't add any data to the mutex/semaphore structures. My patchset
> > was a first attempt towards that, but it seems that such use of
> > tracepoints is not welcome at this time?
> >
> > Please tell me if I should continue working on this, or if I my
> > patches are designed totally on the wrong way.
>
> i think if you hook into Peter's lockstat APIs that should give us a
> pretty good tracer, with no ugliness introduced. That would be rather
> interesting. Peter, do you concur?

Yes, I've already suggested this. Use the exact same hooks that
lockdep/lockstat use.

2008-10-22 17:25:47

by Török Edwin

[permalink] [raw]
Subject: Re: [PATCH 4/4] Implement semaphore latency tracer

On 2008-10-22 20:22, Peter Zijlstra wrote:
> On Wed, 2008-10-22 at 17:48 +0200, Ingo Molnar wrote:
>
>> * Török Edwin <[email protected]> wrote:
>>
>>
>>> On 2008-10-22 18:28, Ingo Molnar wrote:
>>>
>>>> hm, but the most common synchronization primitive are mutexes - and
>>>> those are not covered by your patchset.
>>>>
>>>>
>>> Indeed. I've seen a patch from Jason Baron to introduce tracepoints
>>> for mutexes, but the conclusion was that the tracepoints should be in
>>> lockstat instead.
>>>
>>> And if lockstat is enabled Peter Zijlstra's 'contend with points'
>>> patch seems to do exactly what I want to.
>>>
>>> However I think it would be useful to have (a tracepoints based?)
>>> latency tracker, which can be enabled/disabled at runtime, and which
>>> doesn't add any data to the mutex/semaphore structures. My patchset
>>> was a first attempt towards that, but it seems that such use of
>>> tracepoints is not welcome at this time?
>>>
>>> Please tell me if I should continue working on this, or if I my
>>> patches are designed totally on the wrong way.
>>>
>> i think if you hook into Peter's lockstat APIs that should give us a
>> pretty good tracer, with no ugliness introduced. That would be rather
>> interesting. Peter, do you concur?
>>
>
> Yes, I've already suggested this. Use the exact same hooks that
> lockdep/lockstat use.

Ok, I'll work on this when I get some time :)
[hopefully this weekend test the 'contend with points patch', next
weekend write the new tracepoints]

Best regards,
--Edwin

2008-10-26 04:05:59

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [PATCH 1/4] Add support for userspace stacktraces in tracing/iter_ctrl

2008/10/12 T?r?k Edwin <[email protected]>:
> Usage example:
> mount -t debugfs nodev /sys/kernel/debug
> cd /sys/kernel/debug/tracing
> echo userstacktrace >iter_ctrl
> echo sched_switch >current_tracer
> echo 1 >tracing_enabled
> .... run application ...
> echo 0 >tracing_enabled
>
> Then read one of 'trace','latency_trace','trace_pipe'
>
> Signed-off-by: T?r?k Edwin <[email protected]>
> ---
> Documentation/ftrace.txt | 5 ++-
> arch/x86/kernel/stacktrace.c | 57 +++++++++++++++++++++++++++++++++++
> include/linux/stacktrace.h | 8 +++++
> kernel/trace/trace.c | 67 ++++++++++++++++++++++++++++++++++++++++++
> kernel/trace/trace.h | 2 +
> 5 files changed, 138 insertions(+), 1 deletions(-)
>
> diff --git a/Documentation/ftrace.txt b/Documentation/ftrace.txt
> index d330fe3..f39cb68 100644
> --- a/Documentation/ftrace.txt
> +++ b/Documentation/ftrace.txt
> @@ -327,7 +327,7 @@ output. To see what is available, simply cat the file:
>
> cat /debug/tracing/iter_ctrl
> print-parent nosym-offset nosym-addr noverbose noraw nohex nobin \
> - noblock nostacktrace nosched-tree
> + noblock nostacktrace nosched-tree nouserstacktrace
>
> To disable one of the options, echo in the option prepended with "no".
>
> @@ -381,6 +381,9 @@ Here are the available options:
> When a trace is recorded, so is the stack of functions.
> This allows for back traces of trace sites.
>
> + userstacktrace - This option changes the trace.
> + It records a stacktrace of the current userspace thread.
> +
> sched-tree - TBD (any users??)
>
>
> diff --git a/arch/x86/kernel/stacktrace.c b/arch/x86/kernel/stacktrace.c
> index d1d850a..000a965 100644
> --- a/arch/x86/kernel/stacktrace.c
> +++ b/arch/x86/kernel/stacktrace.c
> @@ -6,6 +6,7 @@
> #include <linux/sched.h>
> #include <linux/stacktrace.h>
> #include <linux/module.h>
> +#include <linux/uaccess.h>
> #include <asm/stacktrace.h>
>
> static void save_stack_warning(void *data, char *msg)
> @@ -90,3 +91,59 @@ void save_stack_trace_tsk(struct task_struct *tsk, struct stack_trace *trace)
> trace->entries[trace->nr_entries++] = ULONG_MAX;
> }
> EXPORT_SYMBOL_GPL(save_stack_trace_tsk);
> +
> +/* Userspace stacktrace - based on kernel/trace/trace_sysprof.c */
> +
> +struct stack_frame {
> + const void __user *next_fp;
> + unsigned long return_address;
> +};
> +
> +static int copy_stack_frame(const void __user *fp, struct stack_frame *frame)
> +{
> + int ret;
> +
> + if (!access_ok(VERIFY_READ, fp, sizeof(*frame)))
> + return 0;
> +
> + ret = 1;
> + pagefault_disable();
> + if (__copy_from_user_inatomic(frame, fp, sizeof(*frame)))
> + ret = 0;
> + pagefault_enable();
> +
> + return ret;
> +}
> +
> +void save_stack_trace_user(struct stack_trace *trace)
> +{
> + /*
> + * Trace user stack if we are not a kernel thread
> + */
> + if (current->mm) {
> + const struct pt_regs *regs = task_pt_regs(current);
> + const void __user *fp = (const void __user *)regs->bp;
> +
> + if (trace->nr_entries < trace->max_entries)
> + trace->entries[trace->nr_entries++] = regs->ip;
> +
> + while (trace->nr_entries < trace->max_entries) {
> + struct stack_frame frame;
> + frame.next_fp = NULL;
> + frame.return_address = 0;
> + if (!copy_stack_frame(fp, &frame))
> + break;
> + if ((unsigned long)fp < regs->sp)
> + break;
> + if (frame.return_address)
> + trace->entries[trace->nr_entries++] =
> + frame.return_address;
> + if (fp == frame.next_fp)
> + break;
> + fp = frame.next_fp;
> + }
> + }
> + if (trace->nr_entries < trace->max_entries)
> + trace->entries[trace->nr_entries++] = ULONG_MAX;
> +}
> +
> diff --git a/include/linux/stacktrace.h b/include/linux/stacktrace.h
> index 6b8e54a..46704a5 100644
> --- a/include/linux/stacktrace.h
> +++ b/include/linux/stacktrace.h
> @@ -18,9 +18,17 @@ extern void save_stack_trace_tsk(struct task_struct *tsk,
> struct stack_trace *trace);
>
> extern void print_stack_trace(struct stack_trace *trace, int spaces);
> +
> +#ifdef CONFIG_X86
> +extern void save_stack_trace_user(struct stack_trace *trace);
> +#else
> +# define save_stack_trace_user(trace) do { } while (0)
> +#endif
> +
> #else
> # define save_stack_trace(trace) do { } while (0)
> # define save_stack_trace_tsk(tsk, trace) do { } while (0)
> +# define save_stack_trace_user(trace) do { } while (0)
> # define print_stack_trace(trace, spaces) do { } while (0)
> #endif
>
> diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
> index d345d64..4c17453 100644
> --- a/kernel/trace/trace.c
> +++ b/kernel/trace/trace.c
> @@ -212,6 +212,7 @@ static const char *trace_options[] = {
> "stacktrace",
> "sched-tree",
> "ftrace_printk",
> + "userstacktrace",
> NULL
> };
>
> @@ -735,6 +736,28 @@ void __trace_stack(struct trace_array *tr,
> ftrace_trace_stack(tr, data, flags, skip, preempt_count());
> }
>
> +static void __trace_userstack(struct trace_array *tr,
> + struct trace_array_cpu *data,
> + unsigned long flags)
> +{
> + struct trace_entry *entry;
> + struct stack_trace trace;
> +
> + if (!(trace_flags & TRACE_ITER_USERSTACKTRACE))
> + return;
> +
> + entry = tracing_get_trace_entry(tr, data);
> + tracing_generic_entry_update(entry, flags);
> + entry->type = TRACE_USER_STACK;
> +
> + memset(&entry->field.stack, 0, sizeof(entry->field.stack));
> + trace.nr_entries = 0;
> + trace.max_entries = FTRACE_STACK_ENTRIES;
> + trace.skip = 0;
> + trace.entries = entry->field.stack.caller;
> + save_stack_trace_user(&trace);
> +}
> +
> static void
> ftrace_trace_special(void *__tr, void *__data,
> unsigned long arg1, unsigned long arg2, unsigned long arg3,
> @@ -758,6 +781,7 @@ ftrace_trace_special(void *__tr, void *__data,
> entry->arg3 = arg3;
> ring_buffer_unlock_commit(tr->buffer, event, irq_flags);
> ftrace_trace_stack(tr, data, irq_flags, 4, pc);
> + __trace_userstack(tr, data, irq_flags);
>
> trace_wake_up();
> }
> @@ -1151,6 +1175,31 @@ seq_print_ip_sym(struct trace_seq *s, unsigned long ip, unsigned long sym_flags)
> return ret;
> }
>
> +static int
> +seq_print_userip_objs(const struct trace_entry *entry, struct trace_seq *s,
> + unsigned long sym_flags)
> +{
> + int ret = 1;
> + unsigned i;
> +
> + for (i = 0; i < FTRACE_STACK_ENTRIES; i++) {
> + unsigned long ip = entry->field.stack.caller[i];
> +
> + if (ip == ULONG_MAX || !ret)
> + break;
> + if (i)
> + ret = trace_seq_puts(s, " <- ");
> + if (!ip) {
> + ret = trace_seq_puts(s, "??");
> + continue;
> + }
> + if (sym_flags & TRACE_ITER_SYM_ADDR)
> + ret = trace_seq_printf(s, " <" IP_FMT ">", ip);
> + }
> +
> + return ret;
> +}
> +
> static void print_lat_help_header(struct seq_file *m)
> {
> seq_puts(m, "# _------=> CPU# \n");
> @@ -1437,6 +1486,10 @@ print_lat_fmt(struct trace_iterator *iter, unsigned int trace_idx, int cpu)
> trace_seq_print_cont(s, iter);
> break;
> }
> + case TRACE_USER_STACK:
> + seq_print_userip_objs(entry, s, sym_flags);
> + trace_seq_putc(s, '\n');
> + break;
> default:
> trace_seq_printf(s, "Unknown type %d\n", entry->type);
> }
> @@ -1572,6 +1625,14 @@ static enum print_line_t print_trace_fmt(struct trace_iterator *iter)
> if (entry->flags & TRACE_FLAG_CONT)
> trace_seq_print_cont(s, iter);
> break;
> + case TRACE_USER_STACK:
> + ret = seq_print_userip_objs(entry, s, sym_flags);
> + if (!ret)
> + return 0;
> + ret = trace_seq_putc(s, '\n');
> + if (!ret)
> + return 0;
> + break;
> }
> }
> return TRACE_TYPE_HANDLED;
> @@ -1632,6 +1693,7 @@ static enum print_line_t print_raw_fmt(struct trace_iterator *iter)
> break;
> }
> case TRACE_SPECIAL:
> + case TRACE_USER_STACK:
> case TRACE_STACK: {
> struct special_entry *field;
>
> @@ -1720,6 +1782,7 @@ static enum print_line_t print_hex_fmt(struct trace_iterator *iter)
> break;
> }
> case TRACE_SPECIAL:
> + case TRACE_USER_STACK:
> case TRACE_STACK: {
> struct special_entry *field;
>
> @@ -1774,6 +1837,7 @@ static enum print_line_t print_bin_fmt(struct trace_iterator *iter)
> break;
> }
> case TRACE_SPECIAL:
> + case TRACE_USER_STACK:
> case TRACE_STACK: {
> struct special_entry *field;
>
> @@ -3103,6 +3167,9 @@ void ftrace_dump(void)
> atomic_inc(&global_trace.data[cpu]->disabled);
> }
>
> + /* don't look at user memory in panic mode */
> + trace_flags &= ~TRACE_ITER_SYM_USEROBJ;
> +
> printk(KERN_TRACE "Dumping ftrace buffer:\n");
>
> iter.tr = &global_trace;
> diff --git a/kernel/trace/trace.h b/kernel/trace/trace.h
> index f1f9957..3ad6343 100644
> --- a/kernel/trace/trace.h
> +++ b/kernel/trace/trace.h
> @@ -22,6 +22,7 @@ enum trace_type {
> TRACE_MMIO_RW,
> TRACE_MMIO_MAP,
> TRACE_BOOT,
> + TRACE_USER_STACK,
>
> __TRACE_LAST_TYPE
> };
> @@ -413,6 +414,7 @@ enum trace_iterator_flags {
> TRACE_ITER_STACKTRACE = 0x100,
> TRACE_ITER_SCHED_TREE = 0x200,
> TRACE_ITER_PRINTK = 0x400,
> + TRACE_ITER_USERSTACKTRACE = 0x800
> };
>
> extern struct tracer nop_trace;
> --


Hi T?r?k,

I just read a bit more those two patches about user stack tracing.
Note that despite the recent changes on the tracing Api
(ring buffer and other stuff), it still applies well, with some little
hunks. But it needs some updates to support the last changes.

I adapted the patch and it builds well, but I didn't tested yet.
However, some parts are "architecture dependent", I guess these
special pieces would find a better place in the arch directory. So it
would let a proper and seperated implementation.

And most arch implement a stacktrace implementation, even if yours
involve special operations such as copy_from_user, I wonder if the
already established code couldn't be reused for your needs.

2008-10-26 07:04:01

by Török Edwin

[permalink] [raw]
Subject: Re: [PATCH 1/4] Add support for userspace stacktraces in tracing/iter_ctrl

On 2008-10-26 06:05, Fr?d?ric Weisbecker wrote:
> Hi T?r?k,
>
> I just read a bit more those two patches about user stack tracing.
> Note that despite the recent changes on the tracing Api
> (ring buffer and other stuff), it still applies well, with some little
> hunks. But it needs some updates to support the last changes.
>
> I adapted the patch and it builds well, but I didn't tested yet.
> However, some parts are "architecture dependent", I guess these
> special pieces would find a better place in the arch directory. So it
> would let a proper and seperated implementation.
>

Hi,

The stacktrace code itself is in the arch directory, did I miss anything
else that should belong there?

arch/x86/kernel/stacktrace.c | 57 +++++++++++++++++++++++++++++++++++
include/linux/stacktrace.h | 8 +++++


> And most arch implement a stacktrace implementation, even if yours
> involve special operations such as copy_from_user, I wonder if the
> already established code couldn't be reused for your needs.
>

I think it would make sense to use oprofile's code of user stacktracing,
there is a backtrace
operation in 'struct oprofile_operations'. Perhaps that could be
extended so that non-oprofile code can call it?
[currently it uses oprofile_add_trace to store the stacktrace]

Best regards,
--Edwin

2008-10-26 13:16:21

by Frank Ch. Eigler

[permalink] [raw]
Subject: Re: [PATCH 1/4] Add support for userspace stacktraces in tracing/iter_ctrl

"=?ISO-8859-1?Q?Fr=E9d=E9ric_Weisbecker?=" <[email protected]> writes:

> [...]
>> +/* Userspace stacktrace - based on kernel/trace/trace_sysprof.c */
>> +
>> +struct stack_frame {
>> + const void __user *next_fp;
>> + unsigned long return_address;
>> +};
> [...]

To what extent does this actually work, and on which architectures?
While the kernel appears willing to sacrifice some performance for
functionality by building with frame pointers at the drop of a tracing
hat, userspace is not like that.

- FChE

2008-10-26 13:29:42

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 1/4] Add support for userspace stacktraces in tracing/iter_ctrl

On Sun, 2008-10-26 at 09:15 -0400, Frank Ch. Eigler wrote:
> "=?ISO-8859-1?Q?Fr=E9d=E9ric_Weisbecker?=" <[email protected]> writes:
>
> > [...]
> >> +/* Userspace stacktrace - based on kernel/trace/trace_sysprof.c */
> >> +
> >> +struct stack_frame {
> >> + const void __user *next_fp;
> >> + unsigned long return_address;
> >> +};
> > [...]
>
> To what extent does this actually work, and on which architectures?
> While the kernel appears willing to sacrifice some performance for
> functionality by building with frame pointers at the drop of a tracing
> hat, userspace is not like that.

Aww, common, Gentoo is brilliant for that :-)

CFLAGS+=-fframe-pointer

emerge -uDNe world

Then again, you'd better not have all that desktop bloat installed,
otherwise that will take ages.. KDE/OOo compile times anyone?

/me runs

2008-10-26 13:39:00

by Török Edwin

[permalink] [raw]
Subject: Re: [PATCH 1/4] Add support for userspace stacktraces in tracing/iter_ctrl

On 2008-10-26 15:29, Peter Zijlstra wrote:
> On Sun, 2008-10-26 at 09:15 -0400, Frank Ch. Eigler wrote:
>
>> "=?ISO-8859-1?Q?Fr=E9d=E9ric_Weisbecker?=" <[email protected]> writes:
>>
>>
>>> [...]
>>>
>>>> +/* Userspace stacktrace - based on kernel/trace/trace_sysprof.c */
>>>> +
>>>> +struct stack_frame {
>>>> + const void __user *next_fp;
>>>> + unsigned long return_address;
>>>> +};
>>>>
>>> [...]
>>>
>> To what extent does this actually work, and on which architectures?
>> While the kernel appears willing to sacrifice some performance for
>> functionality by building with frame pointers at the drop of a tracing
>> hat, userspace is not like that.
>>
>
> Aww, common, Gentoo is brilliant for that :-)
>
> CFLAGS+=-fframe-pointer
>
> emerge -uDNe world
>
> Then again, you'd better not have all that desktop bloat installed,
> otherwise that will take ages.. KDE/OOo compile times anyone?
>
> /me runs

It should be enough to rebuild the application you are tracing[1] +
libraries, or at least libc.
In userspace I can get a stacktrace using DWARF unwind info, but doing
that in the kernel would be too expensive, right?

[1] assuming you are tracing latencies in a single application

Best regards,
--Edwin

2008-10-26 13:49:35

by Frank Ch. Eigler

[permalink] [raw]
Subject: Re: [PATCH 1/4] Add support for userspace stacktraces in tracing/iter_ctrl

Hi -

On Sun, Oct 26, 2008 at 03:38:45PM +0200, T?r?k Edwin wrote:

> [...] In userspace I can get a stacktrace using DWARF unwind info,
> but doing that in the kernel would be too expensive, right? [...]

We're experimenting with just that in systemtap. The expenses
(memory: megabytes of unwind data, time: running code to process it)
would be paid only for the duration of a systemtap script that needs
that particular sort of information.

- FChE

2008-10-26 15:06:46

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [PATCH 1/4] Add support for userspace stacktraces in tracing/iter_ctrl

2008/10/26 T?r?k Edwin <[email protected]>:
> Hi,
>
> The stacktrace code itself is in the arch directory, did I miss anything
> else that should belong there?

No, that's my bad, I misread the patch. Sorry :)

2008-10-27 16:04:15

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH 1/4] Add support for userspace stacktraces in tracing/iter_ctrl


* T?r?k Edwin <[email protected]> wrote:

> Usage example:
> mount -t debugfs nodev /sys/kernel/debug
> cd /sys/kernel/debug/tracing
> echo userstacktrace >iter_ctrl
> echo sched_switch >current_tracer
> echo 1 >tracing_enabled
> .... run application ...
> echo 0 >tracing_enabled
>
> Then read one of 'trace','latency_trace','trace_pipe'

okay, this makes quite a bit of sense - and sysprof already kind of
walks down into the user-space stack. (and so does oprofile, if asked)

Could you send this independently of the lock contention tracing
patches perhaps?

Ingo

2008-10-27 16:16:52

by Török Edwin

[permalink] [raw]
Subject: Re: [PATCH 1/4] Add support for userspace stacktraces in tracing/iter_ctrl

On 2008-10-27 18:03, Ingo Molnar wrote:
> * T?r?k Edwin <[email protected]> wrote:
>
>
>> Usage example:
>> mount -t debugfs nodev /sys/kernel/debug
>> cd /sys/kernel/debug/tracing
>> echo userstacktrace >iter_ctrl
>> echo sched_switch >current_tracer
>> echo 1 >tracing_enabled
>> .... run application ...
>> echo 0 >tracing_enabled
>>
>> Then read one of 'trace','latency_trace','trace_pipe'
>>
>
> okay, this makes quite a bit of sense - and sysprof already kind of
> walks down into the user-space stack. (and so does oprofile, if asked)
>
> Could you send this independently of the lock contention tracing
> patches perhaps?

Sure, and "[PATCH 2/4] Identify which executable object the userspace
address belongs to" is independent of the lock tracing part too.

Perhaps I should send these 2 patches as 3 separate patches:
- introduce save_stack_trace_user in arch/
- the ftrace parts for user stack tracing (userstacktrace >iter_ctrl)
- the sym-userobj part (which is useful if you got ASLR, otherwise you
don't have a chance to resolve the symbols later after the app is gone)

I promised the lock contention tracepoints for this weekend, I'll send
out these patches at the same time too.

Best regards,
--Edwin