2009-03-21 01:41:21

by Roland McGrath

[permalink] [raw]
Subject: [PATCH 0/3] utrace

utrace is a new kernel-side API for kernel modules, intended to make it
tractable to work on novel ways to trace and debug user-mode tasks.

These patches apply to the current Linus tree (v2.6.29-rc8-241-g65c2449).
The first two should apply fine on the -tip tree as well, and we will be
glad to rebase the set to whichever tree. Frank has another version of the
ftrace patch (3/3) that works for -tip. The utrace patches don't touch
anything unless you set a new kconfig option (still marked EXPERIMENTAL),
and so are quite safe in that regard.

utrace cannot be enabled without CONFIG_HAVE_ARCH_TRACEHOOK and the arch
details it indicates. If your arch does not have it yet, its maintainers
will have to work on that. The details are in the comments in arch/Kconfig.

The first patch makes a small update to one of the tracehook.h interfaces
that we needed for utrace. It moves code a little but does not change any
of the logic in the existing code.

The second patch adds the utrace kernel API (if CONFIG_UTRACE=y is set).
There is no change at all without the config option, and with it there is
no effect on anything at all until a kernel module using the utrace API is
loaded. There is detailed documentation on the API in DocBook form.

The third patch is an ftrace widget based on utrace, by Frank Eigler.
Frank will follow up on any issues about that patch.


Thanks,
Roland


2009-03-21 01:42:24

by Roland McGrath

[permalink] [raw]
Subject: [PATCH 1/3] signals: tracehook_notify_jctl change

This changes tracehook_notify_jctl() so it's called with the siglock held,
and changes its argument and return value definition. These clean-ups make
it a better fit for what new tracing hooks need to check.

Tracing needs the siglock here, held from the time TASK_STOPPED was set,
to avoid potential SIGCONT races if it wants to allow any blocking in its
tracing hooks.

This also folds the finish_stop() function into its caller do_signal_stop().
The function is short, called only once and only unconditionally. It aids
readability to fold it in.

Signed-off-by: Roland McGrath <[email protected]>
---
include/linux/tracehook.h | 25 ++++++++++------
kernel/signal.c | 69 +++++++++++++++++++++++----------------------
2 files changed, 51 insertions(+), 43 deletions(-)

diff --git a/include/linux/tracehook.h b/include/linux/tracehook.h
index 6186a78..b622498 100644
--- a/include/linux/tracehook.h
+++ b/include/linux/tracehook.h
@@ -1,7 +1,7 @@
/*
* Tracing hooks
*
- * Copyright (C) 2008 Red Hat, Inc. All rights reserved.
+ * Copyright (C) 2008-2009 Red Hat, Inc. All rights reserved.
*
* This copyrighted material is made available to anyone wishing to use,
* modify, copy, or redistribute it subject to the terms and conditions
@@ -469,22 +469,29 @@ static inline int tracehook_get_signal(s

/**
* tracehook_notify_jctl - report about job control stop/continue
- * @notify: nonzero if this is the last thread in the group to stop
+ * @notify: zero, %CLD_STOPPED or %CLD_CONTINUED
* @why: %CLD_STOPPED or %CLD_CONTINUED
*
* This is called when we might call do_notify_parent_cldstop().
- * It's called when about to stop for job control; we are already in
- * %TASK_STOPPED state, about to call schedule(). It's also called when
- * a delayed %CLD_STOPPED or %CLD_CONTINUED report is ready to be made.
*
- * Return nonzero to generate a %SIGCHLD with @why, which is
- * normal if @notify is nonzero.
+ * @notify is zero if we would not ordinarily send a %SIGCHLD,
+ * or is the %CLD_STOPPED or %CLD_CONTINUED .si_code for %SIGCHLD.
*
- * Called with no locks held.
+ * @why is %CLD_STOPPED when about to stop for job control;
+ * we are already in %TASK_STOPPED state, about to call schedule().
+ * It might also be that we have just exited (check %PF_EXITING),
+ * but need to report that a group-wide stop is complete.
+ *
+ * @why is %CLD_CONTINUED when waking up after job control stop and
+ * ready to make a delayed @notify report.
+ *
+ * Return the %CLD_* value for %SIGCHLD, or zero to generate no signal.
+ *
+ * Called with the siglock held.
*/
static inline int tracehook_notify_jctl(int notify, int why)
{
- return notify || (current->ptrace & PT_PTRACED);
+ return notify ?: (current->ptrace & PT_PTRACED) ? why : 0;
}

#define DEATH_REAP -1
diff --git a/kernel/signal.c b/kernel/signal.c
index 2a74fe8..9a0d98f 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -691,7 +691,7 @@ static int prepare_signal(int sig, struc

if (why) {
/*
- * The first thread which returns from finish_stop()
+ * The first thread which returns from do_signal_stop()
* will take ->siglock, notice SIGNAL_CLD_MASK, and
* notify its parent. See get_signal_to_deliver().
*/
@@ -1629,29 +1629,6 @@ void ptrace_notify(int exit_code)
spin_unlock_irq(&current->sighand->siglock);
}

-static void
-finish_stop(int stop_count)
-{
- /*
- * If there are no other threads in the group, or if there is
- * a group stop in progress and we are the last to stop,
- * report to the parent. When ptraced, every thread reports itself.
- */
- if (tracehook_notify_jctl(stop_count == 0, CLD_STOPPED)) {
- read_lock(&tasklist_lock);
- do_notify_parent_cldstop(current, CLD_STOPPED);
- read_unlock(&tasklist_lock);
- }
-
- do {
- schedule();
- } while (try_to_freeze());
- /*
- * Now we don't run again until continued.
- */
- current->exit_code = 0;
-}
-
/*
* This performs the stopping for SIGSTOP and other stop signals.
* We have to stop all threads in the thread group.
@@ -1662,6 +1639,7 @@ static int do_signal_stop(int signr)
{
struct signal_struct *sig = current->signal;
int stop_count;
+ int notify;

if (sig->group_stop_count > 0) {
/*
@@ -1701,8 +1679,30 @@ static int do_signal_stop(int signr)
current->exit_code = sig->group_exit_code;
__set_current_state(TASK_STOPPED);

+ /*
+ * If there are no other threads in the group, or if there is
+ * a group stop in progress and we are the last to stop,
+ * report to the parent. When ptraced, every thread reports itself.
+ */
+ notify = tracehook_notify_jctl(stop_count == 0 ? CLD_STOPPED : 0,
+ CLD_STOPPED);
+
spin_unlock_irq(&current->sighand->siglock);
- finish_stop(stop_count);
+
+ if (notify) {
+ read_lock(&tasklist_lock);
+ do_notify_parent_cldstop(current, notify);
+ read_unlock(&tasklist_lock);
+ }
+
+ do {
+ schedule();
+ } while (try_to_freeze());
+ /*
+ * Now we don't run again until continued.
+ */
+ current->exit_code = 0;
+
return 1;
}

@@ -1771,14 +1771,15 @@ relock:
int why = (signal->flags & SIGNAL_STOP_CONTINUED)
? CLD_CONTINUED : CLD_STOPPED;
signal->flags &= ~SIGNAL_CLD_MASK;
- spin_unlock_irq(&sighand->siglock);

- if (unlikely(!tracehook_notify_jctl(1, why)))
- goto relock;
+ why = tracehook_notify_jctl(why, CLD_CONTINUED);
+ spin_unlock_irq(&sighand->siglock);

- read_lock(&tasklist_lock);
- do_notify_parent_cldstop(current->group_leader, why);
- read_unlock(&tasklist_lock);
+ if (why) {
+ read_lock(&tasklist_lock);
+ do_notify_parent_cldstop(current->group_leader, why);
+ read_unlock(&tasklist_lock);
+ }
goto relock;
}

@@ -1936,14 +1937,14 @@ void exit_signals(struct task_struct *ts
if (unlikely(tsk->signal->group_stop_count) &&
!--tsk->signal->group_stop_count) {
tsk->signal->flags = SIGNAL_STOP_STOPPED;
- group_stop = 1;
+ group_stop = tracehook_notify_jctl(CLD_STOPPED, CLD_STOPPED);
}
out:
spin_unlock_irq(&tsk->sighand->siglock);

- if (unlikely(group_stop) && tracehook_notify_jctl(1, CLD_STOPPED)) {
+ if (unlikely(group_stop)) {
read_lock(&tasklist_lock);
- do_notify_parent_cldstop(tsk, CLD_STOPPED);
+ do_notify_parent_cldstop(tsk, group_stop);
read_unlock(&tasklist_lock);
}
}

2009-03-21 01:43:47

by Roland McGrath

[permalink] [raw]
Subject: [PATCH 3/3] utrace-based ftrace "process" engine, v2

From: Frank Ch. Eigler <[email protected]>

This is v2 of the prototype utrace-ftrace interface. This code is
based on Roland McGrath's utrace API, which provides programmatic
hooks to the in-tree tracehook layer. This new patch interfaces many
of those events to ftrace, as configured by a small number of debugfs
controls. Here's the /debugfs/tracing/process_trace_README:

process event tracer mini-HOWTO

1. Select process hierarchy to monitor. Other processes will be
completely unaffected. Leave at 0 for system-wide tracing.
% echo NNN > process_follow_pid

2. Determine which process event traces are potentially desired.
syscall and signal tracing slow down monitored processes.
% echo 0 > process_trace_{syscalls,signals,lifecycle}

3. Add any final uid- or taskcomm-based filtering. Non-matching
processes will skip trace messages, but will still be slowed.
% echo NNN > process_trace_uid_filter # -1: unrestricted
% echo ls > process_trace_taskcomm_filter # empty: unrestricted

4. Start tracing.
% echo process > current_tracer

5. Examine trace.
% cat trace

6. Stop tracing.
% echo nop > current_tracer

Signed-off-by: Frank Ch. Eigler <[email protected]>
---
include/linux/processtrace.h | 41 +++
kernel/trace/Kconfig | 9 +
kernel/trace/Makefile | 1 +
kernel/trace/trace.h | 8 +
kernel/trace/trace_process.c | 601 ++++++++++++++++++++++++++++++++++++++++++
5 files changed, 660 insertions(+), 0 deletions(-)

diff --git a/include/linux/processtrace.h b/include/linux/processtrace.h
new file mode 100644
index ...f2b7d94 100644
--- /dev/null
+++ b/include/linux/processtrace.h
@@ -0,0 +1,41 @@
+#ifndef PROCESSTRACE_H
+#define PROCESSTRACE_H
+
+#include <linux/types.h>
+#include <linux/list.h>
+
+struct process_trace_entry {
+ unsigned char opcode; /* one of _UTRACE_EVENT_* */
+ char comm[TASK_COMM_LEN]; /* XXX: should be in/via trace_entry */
+ union {
+ struct {
+ pid_t child;
+ unsigned long flags;
+ } trace_clone;
+ struct {
+ long code;
+ } trace_exit;
+ struct {
+ } trace_exec;
+ struct {
+ int si_signo;
+ int si_errno;
+ int si_code;
+ } trace_signal;
+ struct {
+ long callno;
+ unsigned long args[6];
+ } trace_syscall_entry;
+ struct {
+ long rc;
+ long error;
+ } trace_syscall_exit;
+ };
+};
+
+/* in kernel/trace/trace_process.c */
+
+extern void enable_process_trace(void);
+extern void disable_process_trace(void);
+
+#endif /* PROCESSTRACE_H */
diff --git a/kernel/trace/Kconfig b/kernel/trace/Kconfig
index 34e707e..8a92d6f 100644
--- a/kernel/trace/Kconfig
+++ b/kernel/trace/Kconfig
@@ -150,6 +150,15 @@ config CONTEXT_SWITCH_TRACER
This tracer gets called from the context switch and records
all switching of tasks.

+config PROCESS_TRACER
+ bool "Trace process events via utrace"
+ depends on DEBUG_KERNEL
+ select TRACING
+ select UTRACE
+ help
+ This tracer provides trace records from process events
+ accessible to utrace: lifecycle, system calls, and signals.
+
config BOOT_TRACER
bool "Trace boot initcalls"
depends on DEBUG_KERNEL
diff --git a/kernel/trace/Makefile b/kernel/trace/Makefile
index 349d5a9..a774db2 100644
--- a/kernel/trace/Makefile
+++ b/kernel/trace/Makefile
@@ -33,5 +33,6 @@ obj-$(CONFIG_FUNCTION_GRAPH_TRACER) += t
obj-$(CONFIG_TRACE_BRANCH_PROFILING) += trace_branch.o
obj-$(CONFIG_HW_BRANCH_TRACER) += trace_hw_branches.o
obj-$(CONFIG_POWER_TRACER) += trace_power.o
+obj-$(CONFIG_PROCESS_TRACER) += trace_process.o

libftrace-y := ftrace.o
diff --git a/kernel/trace/trace.h b/kernel/trace/trace.h
index 4d3d381..c4d2e7f 100644
--- a/kernel/trace/trace.h
+++ b/kernel/trace/trace.h
@@ -7,6 +7,7 @@
#include <linux/clocksource.h>
#include <linux/ring_buffer.h>
#include <linux/mmiotrace.h>
+#include <linux/processtrace.h>
#include <linux/ftrace.h>
#include <trace/boot.h>

@@ -30,6 +31,7 @@ enum trace_type {
TRACE_USER_STACK,
TRACE_HW_BRANCHES,
TRACE_POWER,
+ TRACE_PROCESS,

__TRACE_LAST_TYPE
};
@@ -170,6 +172,11 @@ struct trace_power {
struct power_trace state_data;
};

+struct trace_process {
+ struct trace_entry ent;
+ struct process_trace_entry event;
+};
+
/*
* trace_flag_type is an enumeration that holds different
* states when a trace occurs. These are:
@@ -280,6 +287,7 @@ extern void __ftrace_bad_type(void);
TRACE_GRAPH_RET); \
IF_ASSIGN(var, ent, struct hw_branch_entry, TRACE_HW_BRANCHES);\
IF_ASSIGN(var, ent, struct trace_power, TRACE_POWER); \
+ IF_ASSIGN(var, ent, struct trace_process, TRACE_PROCESS); \
__ftrace_bad_type(); \
} while (0)

diff --git a/kernel/trace/trace_process.c b/kernel/trace/trace_process.c
new file mode 100644
index ...0820e56 100644
--- /dev/null
+++ b/kernel/trace/trace_process.c
@@ -0,0 +1,601 @@
+/*
+ * utrace-based process event tracing
+ * Copyright (C) 2009 Red Hat Inc.
+ * By Frank Ch. Eigler <[email protected]>
+ *
+ * Based on mmio ftrace engine by Pekka Paalanen
+ * and utrace-syscall-tracing prototype by Ananth Mavinakayanahalli
+ */
+
+/* #define DEBUG 1 */
+
+#include <linux/kernel.h>
+#include <linux/utrace.h>
+#include <linux/uaccess.h>
+#include <linux/debugfs.h>
+#include <asm/syscall.h>
+
+#include "trace.h"
+
+/* A process must match these filters in order to be traced. */
+static char trace_taskcomm_filter[TASK_COMM_LEN]; /* \0: unrestricted */
+static u32 trace_taskuid_filter = -1; /* -1: unrestricted */
+static u32 trace_lifecycle_p = 1;
+static u32 trace_syscalls_p = 1;
+static u32 trace_signals_p = 1;
+
+/* A process must be a direct child of given pid in order to be
+ followed. */
+static u32 process_follow_pid; /* 0: unrestricted/systemwide */
+
+/* XXX: lock the above? */
+
+
+/* trace data collection */
+
+static struct trace_array *process_trace_array;
+
+static void process_reset_data(struct trace_array *tr)
+{
+ pr_debug("in %s\n", __func__);
+ tracing_reset_online_cpus(tr);
+}
+
+static int process_trace_init(struct trace_array *tr)
+{
+ pr_debug("in %s\n", __func__);
+ process_trace_array = tr;
+ process_reset_data(tr);
+ enable_process_trace();
+ return 0;
+}
+
+static void process_trace_reset(struct trace_array *tr)
+{
+ pr_debug("in %s\n", __func__);
+ disable_process_trace();
+ process_reset_data(tr);
+ process_trace_array = NULL;
+}
+
+static void process_trace_start(struct trace_array *tr)
+{
+ pr_debug("in %s\n", __func__);
+ process_reset_data(tr);
+}
+
+static void __trace_processtrace(struct trace_array *tr,
+ struct trace_array_cpu *data,
+ struct process_trace_entry *ent)
+{
+ struct ring_buffer_event *event;
+ struct trace_process *entry;
+ unsigned long irq_flags;
+
+ event = ring_buffer_lock_reserve(tr->buffer, sizeof(*entry),
+ &irq_flags);
+ if (!event)
+ return;
+ entry = ring_buffer_event_data(event);
+ tracing_generic_entry_update(&entry->ent, 0, preempt_count());
+ entry->ent.cpu = raw_smp_processor_id();
+ entry->ent.type = TRACE_PROCESS;
+ strlcpy(ent->comm, current->comm, TASK_COMM_LEN);
+ entry->event = *ent;
+ ring_buffer_unlock_commit(tr->buffer, event, irq_flags);
+
+ trace_wake_up();
+}
+
+void process_trace(struct process_trace_entry *ent)
+{
+ struct trace_array *tr = process_trace_array;
+ struct trace_array_cpu *data;
+
+ preempt_disable();
+ data = tr->data[smp_processor_id()];
+ __trace_processtrace(tr, data, ent);
+ preempt_enable();
+}
+
+
+/* trace data rendering */
+
+static void process_pipe_open(struct trace_iterator *iter)
+{
+ struct trace_seq *s = &iter->seq;
+ pr_debug("in %s\n", __func__);
+ trace_seq_printf(s, "VERSION 200901\n");
+}
+
+static void process_close(struct trace_iterator *iter)
+{
+ iter->private = NULL;
+}
+
+static ssize_t process_read(struct trace_iterator *iter, struct file *filp,
+ char __user *ubuf, size_t cnt, loff_t *ppos)
+{
+ ssize_t ret;
+ struct trace_seq *s = &iter->seq;
+ ret = trace_seq_to_user(s, ubuf, cnt);
+ return (ret == -EBUSY) ? 0 : ret;
+}
+
+static enum print_line_t process_print(struct trace_iterator *iter)
+{
+ struct trace_entry *entry = iter->ent;
+ struct trace_process *field;
+ struct trace_seq *s = &iter->seq;
+ unsigned long long t = ns2usecs(iter->ts);
+ unsigned long usec_rem = do_div(t, 1000000ULL);
+ unsigned secs = (unsigned long)t;
+ int ret = 1;
+
+ trace_assign_type(field, entry);
+
+ /* XXX: If print_lat_fmt() were not static, we wouldn't have
+ to duplicate this. */
+ trace_seq_printf(s, "%16s %5d %3d %9lu.%06ld ",
+ field->event.comm,
+ entry->pid, entry->cpu,
+ secs,
+ usec_rem);
+
+ switch (field->event.opcode) {
+ case _UTRACE_EVENT_CLONE:
+ ret = trace_seq_printf(s, "fork %d flags 0x%lx\n",
+ field->event.trace_clone.child,
+ field->event.trace_clone.flags);
+ break;
+ case _UTRACE_EVENT_EXEC:
+ ret = trace_seq_printf(s, "exec\n");
+ break;
+ case _UTRACE_EVENT_EXIT:
+ ret = trace_seq_printf(s, "exit %ld\n",
+ field->event.trace_exit.code);
+ break;
+ case _UTRACE_EVENT_SIGNAL:
+ ret = trace_seq_printf(s, "signal %d errno %d code 0x%x\n",
+ field->event.trace_signal.si_signo,
+ field->event.trace_signal.si_errno,
+ field->event.trace_signal.si_code);
+ break;
+ case _UTRACE_EVENT_SYSCALL_ENTRY:
+ ret = trace_seq_printf(s, "syscall %ld [0x%lx 0x%lx 0x%lx"
+ " 0x%lx 0x%lx 0x%lx]\n",
+ field->event.trace_syscall_entry.callno,
+ field->event.trace_syscall_entry.args[0],
+ field->event.trace_syscall_entry.args[1],
+ field->event.trace_syscall_entry.args[2],
+ field->event.trace_syscall_entry.args[3],
+ field->event.trace_syscall_entry.args[4],
+ field->event.trace_syscall_entry.args[5]);
+ break;
+ case _UTRACE_EVENT_SYSCALL_EXIT:
+ ret = trace_seq_printf(s, "syscall rc %ld error %ld\n",
+ field->event.trace_syscall_exit.rc,
+ field->event.trace_syscall_exit.error);
+ break;
+ default:
+ ret = trace_seq_printf(s, "process code %d?\n",
+ field->event.opcode);
+ break;
+ }
+ if (ret)
+ return TRACE_TYPE_HANDLED;
+ return TRACE_TYPE_HANDLED;
+}
+
+
+static enum print_line_t process_print_line(struct trace_iterator *iter)
+{
+ switch (iter->ent->type) {
+ case TRACE_PROCESS:
+ return process_print(iter);
+ default:
+ return TRACE_TYPE_HANDLED; /* ignore unknown entries */
+ }
+}
+
+static struct tracer process_tracer = {
+ .name = "process",
+ .init = process_trace_init,
+ .reset = process_trace_reset,
+ .start = process_trace_start,
+ .pipe_open = process_pipe_open,
+ .close = process_close,
+ .read = process_read,
+ .print_line = process_print_line,
+};
+
+
+
+/* utrace backend */
+
+/* Should tracing apply to given task? Compare against filter
+ values. */
+static int trace_test(struct task_struct *tsk)
+{
+ if (trace_taskcomm_filter[0]
+ && strncmp(trace_taskcomm_filter, tsk->comm, TASK_COMM_LEN))
+ return 0;
+
+ if (trace_taskuid_filter != (u32)-1
+ && trace_taskuid_filter != task_uid(tsk))
+ return 0;
+
+ return 1;
+}
+
+
+static const struct utrace_engine_ops process_trace_ops;
+
+static void process_trace_tryattach(struct task_struct *tsk)
+{
+ struct utrace_engine *engine;
+
+ pr_debug("in %s\n", __func__);
+ engine = utrace_attach_task(tsk,
+ UTRACE_ATTACH_CREATE |
+ UTRACE_ATTACH_EXCLUSIVE,
+ &process_trace_ops, NULL);
+ if (IS_ERR(engine) || (engine == NULL)) {
+ pr_warning("utrace_attach_task %d (rc %p)\n",
+ tsk->pid, engine);
+ } else {
+ int rc;
+
+ /* We always hook cost-free events. */
+ unsigned long events =
+ UTRACE_EVENT(CLONE) |
+ UTRACE_EVENT(EXEC) |
+ UTRACE_EVENT(EXIT);
+
+ /* Penalizing events are individually controlled, so that
+ utrace doesn't even take the monitored threads off their
+ fast paths, nor bother call our callbacks. */
+ if (trace_syscalls_p)
+ events |= UTRACE_EVENT_SYSCALL;
+ if (trace_signals_p)
+ events |= UTRACE_EVENT_SIGNAL_ALL;
+
+ rc = utrace_set_events(tsk, engine, events);
+ if (rc == -EINPROGRESS)
+ rc = utrace_barrier(tsk, engine);
+ if (rc)
+ pr_warning("utrace_set_events/barrier rc %d\n", rc);
+
+ utrace_engine_put(engine);
+ pr_debug("attached in %s to %s(%d)\n", __func__,
+ tsk->comm, tsk->pid);
+ }
+}
+
+
+u32 process_trace_report_clone(enum utrace_resume_action action,
+ struct utrace_engine *engine,
+ struct task_struct *parent,
+ unsigned long clone_flags,
+ struct task_struct *child)
+{
+ if (trace_lifecycle_p && trace_test(parent)) {
+ struct process_trace_entry ent;
+ ent.opcode = _UTRACE_EVENT_CLONE;
+ ent.trace_clone.child = child->pid;
+ ent.trace_clone.flags = clone_flags;
+ process_trace(&ent);
+ }
+
+ process_trace_tryattach(child);
+
+ return UTRACE_RESUME;
+}
+
+
+u32 process_trace_report_syscall_entry(u32 action,
+ struct utrace_engine *engine,
+ struct task_struct *task,
+ struct pt_regs *regs)
+{
+ if (trace_syscalls_p && trace_test(task)) {
+ struct process_trace_entry ent;
+ ent.opcode = _UTRACE_EVENT_SYSCALL_ENTRY;
+ ent.trace_syscall_entry.callno = syscall_get_nr(task, regs);
+ syscall_get_arguments(task, regs, 0, 6,
+ ent.trace_syscall_entry.args);
+ process_trace(&ent);
+ }
+
+ return UTRACE_RESUME;
+}
+
+
+u32 process_trace_report_syscall_exit(enum utrace_resume_action action,
+ struct utrace_engine *engine,
+ struct task_struct *task,
+ struct pt_regs *regs)
+{
+ if (trace_syscalls_p && trace_test(task)) {
+ struct process_trace_entry ent;
+ ent.opcode = _UTRACE_EVENT_SYSCALL_EXIT;
+ ent.trace_syscall_exit.rc =
+ syscall_get_return_value(task, regs);
+ ent.trace_syscall_exit.error = syscall_get_error(task, regs);
+ process_trace(&ent);
+ }
+
+ return UTRACE_RESUME;
+}
+
+
+u32 process_trace_report_exec(enum utrace_resume_action action,
+ struct utrace_engine *engine,
+ struct task_struct *task,
+ const struct linux_binfmt *fmt,
+ const struct linux_binprm *bprm,
+ struct pt_regs *regs)
+{
+ if (trace_lifecycle_p && trace_test(task)) {
+ struct process_trace_entry ent;
+ ent.opcode = _UTRACE_EVENT_EXEC;
+ process_trace(&ent);
+ }
+
+ /* We're already attached; no need for a new tryattach. */
+
+ return UTRACE_RESUME;
+}
+
+
+u32 process_trace_report_signal(u32 action,
+ struct utrace_engine *engine,
+ struct task_struct *task,
+ struct pt_regs *regs,
+ siginfo_t *info,
+ const struct k_sigaction *orig_ka,
+ struct k_sigaction *return_ka)
+{
+ if (trace_signals_p && trace_test(task)) {
+ struct process_trace_entry ent;
+ ent.opcode = _UTRACE_EVENT_SIGNAL;
+ ent.trace_signal.si_signo = info->si_signo;
+ ent.trace_signal.si_errno = info->si_errno;
+ ent.trace_signal.si_code = info->si_code;
+ process_trace(&ent);
+ }
+
+ /* We're already attached, so no need for a new tryattach. */
+
+ return UTRACE_RESUME | utrace_signal_action(action);
+}
+
+
+u32 process_trace_report_exit(enum utrace_resume_action action,
+ struct utrace_engine *engine,
+ struct task_struct *task,
+ long orig_code, long *code)
+{
+ if (trace_lifecycle_p && trace_test(task)) {
+ struct process_trace_entry ent;
+ ent.opcode = _UTRACE_EVENT_EXIT;
+ ent.trace_exit.code = orig_code;
+ process_trace(&ent);
+ }
+
+ /* There is no need to explicitly attach or detach here. */
+
+ return UTRACE_RESUME;
+}
+
+
+void enable_process_trace()
+{
+ struct task_struct *grp, *tsk;
+
+ pr_debug("in %s\n", __func__);
+ rcu_read_lock();
+ do_each_thread(grp, tsk) {
+ /* Skip over kernel threads. */
+ if (tsk->flags & PF_KTHREAD)
+ continue;
+
+ if (process_follow_pid) {
+ if (tsk->tgid == process_follow_pid ||
+ tsk->parent->tgid == process_follow_pid)
+ process_trace_tryattach(tsk);
+ } else {
+ process_trace_tryattach(tsk);
+ }
+ } while_each_thread(grp, tsk);
+ rcu_read_unlock();
+}
+
+void disable_process_trace()
+{
+ struct utrace_engine *engine;
+ struct task_struct *grp, *tsk;
+ int rc;
+
+ pr_debug("in %s\n", __func__);
+ rcu_read_lock();
+ do_each_thread(grp, tsk) {
+ /* Find matching engine, if any. Returns -ENOENT for
+ unattached threads. */
+ engine = utrace_attach_task(tsk, UTRACE_ATTACH_MATCH_OPS,
+ &process_trace_ops, 0);
+ if (IS_ERR(engine)) {
+ if (PTR_ERR(engine) != -ENOENT)
+ pr_warning("utrace_attach_task %d (rc %ld)\n",
+ tsk->pid, -PTR_ERR(engine));
+ } else if (engine == NULL) {
+ pr_warning("utrace_attach_task %d (null engine)\n",
+ tsk->pid);
+ } else {
+ /* Found one of our own engines. Detach. */
+ rc = utrace_control(tsk, engine, UTRACE_DETACH);
+ switch (rc) {
+ case 0: /* success */
+ break;
+ case -ESRCH: /* REAP callback already begun */
+ case -EALREADY: /* DEATH callback already begun */
+ break;
+ default:
+ rc = -rc;
+ pr_warning("utrace_detach %d (rc %d)\n",
+ tsk->pid, rc);
+ break;
+ }
+ utrace_engine_put(engine);
+ pr_debug("detached in %s from %s(%d)\n", __func__,
+ tsk->comm, tsk->pid);
+ }
+ } while_each_thread(grp, tsk);
+ rcu_read_unlock();
+}
+
+
+static const struct utrace_engine_ops process_trace_ops = {
+ .report_clone = process_trace_report_clone,
+ .report_exec = process_trace_report_exec,
+ .report_exit = process_trace_report_exit,
+ .report_signal = process_trace_report_signal,
+ .report_syscall_entry = process_trace_report_syscall_entry,
+ .report_syscall_exit = process_trace_report_syscall_exit,
+};
+
+
+
+/* control interfaces */
+
+
+static ssize_t
+trace_taskcomm_filter_read(struct file *filp, char __user *ubuf,
+ size_t cnt, loff_t *ppos)
+{
+ return simple_read_from_buffer(ubuf, cnt, ppos,
+ trace_taskcomm_filter, TASK_COMM_LEN);
+}
+
+
+static ssize_t
+trace_taskcomm_filter_write(struct file *filp, const char __user *ubuf,
+ size_t cnt, loff_t *fpos)
+{
+ char *end;
+
+ if (cnt > TASK_COMM_LEN)
+ cnt = TASK_COMM_LEN;
+
+ if (copy_from_user(trace_taskcomm_filter, ubuf, cnt))
+ return -EFAULT;
+
+ /* Cut from the first nil or newline. */
+ trace_taskcomm_filter[cnt] = '\0';
+ end = strchr(trace_taskcomm_filter, '\n');
+ if (end)
+ *end = '\0';
+
+ *fpos += cnt;
+ return cnt;
+}
+
+
+static const struct file_operations trace_taskcomm_filter_fops = {
+ .open = tracing_open_generic,
+ .read = trace_taskcomm_filter_read,
+ .write = trace_taskcomm_filter_write,
+};
+
+
+
+static char README_text[] =
+ "process event tracer mini-HOWTO\n"
+ "\n"
+ "1. Select process hierarchy to monitor. Other processes will be\n"
+ " completely unaffected. Leave at 0 for system-wide tracing.\n"
+ "# echo NNN > process_follow_pid\n"
+ "\n"
+ "2. Determine which process event traces are potentially desired.\n"
+ " syscall and signal tracing slow down monitored processes.\n"
+ "# echo 0 > process_trace_{syscalls,signals,lifecycle}\n"
+ "\n"
+ "3. Add any final uid- or taskcomm-based filtering. Non-matching\n"
+ " processes will skip trace messages, but will still be slowed.\n"
+ "# echo NNN > process_trace_uid_filter # -1: unrestricted \n"
+ "# echo ls > process_trace_taskcomm_filter # empty: unrestricted\n"
+ "\n"
+ "4. Start tracing.\n"
+ "# echo process > current_tracer\n"
+ "\n"
+ "5. Examine trace.\n"
+ "# cat trace\n"
+ "\n"
+ "6. Stop tracing.\n"
+ "# echo nop > current_tracer\n"
+ ;
+
+static struct debugfs_blob_wrapper README_blob = {
+ .data = README_text,
+ .size = sizeof(README_text),
+};
+
+
+static __init int init_process_trace(void)
+{
+ struct dentry *d_tracer;
+ struct dentry *entry;
+
+ d_tracer = tracing_init_dentry();
+
+ entry = debugfs_create_blob("process_trace_README", 0444, d_tracer,
+ &README_blob);
+ if (!entry)
+ pr_warning("Could not create debugfs "
+ "'process_trace_README' entry\n");
+
+ /* Control for scoping process following. */
+ entry = debugfs_create_u32("process_follow_pid", 0644, d_tracer,
+ &process_follow_pid);
+ if (!entry)
+ pr_warning("Could not create debugfs "
+ "'process_follow_pid' entry\n");
+
+ /* Process-level filters */
+ entry = debugfs_create_file("process_trace_taskcomm_filter", 0644,
+ d_tracer, NULL,
+ &trace_taskcomm_filter_fops);
+ /* XXX: it'd be nice to have a read/write debugfs_create_blob. */
+ if (!entry)
+ pr_warning("Could not create debugfs "
+ "'process_trace_taskcomm_filter' entry\n");
+
+ entry = debugfs_create_u32("process_trace_uid_filter", 0644, d_tracer,
+ &trace_taskuid_filter);
+ if (!entry)
+ pr_warning("Could not create debugfs "
+ "'process_trace_uid_filter' entry\n");
+
+ /* Event-level filters. */
+ entry = debugfs_create_u32("process_trace_lifecycle", 0644, d_tracer,
+ &trace_lifecycle_p);
+ if (!entry)
+ pr_warning("Could not create debugfs "
+ "'process_trace_lifecycle' entry\n");
+
+ entry = debugfs_create_u32("process_trace_syscalls", 0644, d_tracer,
+ &trace_syscalls_p);
+ if (!entry)
+ pr_warning("Could not create debugfs "
+ "'process_trace_syscalls' entry\n");
+
+ entry = debugfs_create_u32("process_trace_signals", 0644, d_tracer,
+ &trace_signals_p);
+ if (!entry)
+ pr_warning("Could not create debugfs "
+ "'process_trace_signals' entry\n");
+
+ return register_tracer(&process_tracer);
+}
+
+device_initcall(init_process_trace);

2009-03-21 07:43:40

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH 3/3] utrace-based ftrace "process" engine, v2


* Roland McGrath <[email protected]> wrote:

> From: Frank Ch. Eigler <[email protected]>
>
> This is v2 of the prototype utrace-ftrace interface. This code is
> based on Roland McGrath's utrace API, which provides programmatic
> hooks to the in-tree tracehook layer. This new patch interfaces
> many of those events to ftrace, as configured by a small number of
> debugfs controls. Here's the
> /debugfs/tracing/process_trace_README:

Please submit changes/enhancements to kernel/trace/* to the tracing
tree maintainers (Steve and me) for review, testing and integration.

Please also post patches against the latest tracing tree:

http://people.redhat.com/mingo/tip.git/README

As this patch does not apply:

Applying patch patches/utrace-based-ftrace-process-engine-v2.patch
patching file include/linux/processtrace.h
patching file kernel/trace/Kconfig
Hunk #1 succeeded at 186 with fuzz 2 (offset 36 lines).
patching file kernel/trace/Makefile
Hunk #1 FAILED at 33.
1 out of 1 hunk FAILED -- rejects in file kernel/trace/Makefile
patching file kernel/trace/trace.h
Hunk #1 succeeded at 7 with fuzz 1.
Hunk #2 FAILED at 31.
Hunk #3 succeeded at 215 with fuzz 2 (offset 43 lines).
Hunk #4 FAILED at 330.
2 out of 4 hunks FAILED -- rejects in file kernel/trace/trace.h
patching file kernel/trace/trace_process.c
Patch patches/utrace-based-ftrace-process-engine-v2.patch does not apply (enforce with -f)

Thanks,

Ingo

2009-03-21 08:47:30

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH 3/3] utrace-based ftrace "process" engine, v2

On Sat, 21 Mar 2009 08:43:01 +0100 Ingo Molnar <[email protected]> wrote:

>
> * Roland McGrath <[email protected]> wrote:
>
> > From: Frank Ch. Eigler <[email protected]>
> >
> > This is v2 of the prototype utrace-ftrace interface. This code is
> > based on Roland McGrath's utrace API, which provides programmatic
> > hooks to the in-tree tracehook layer. This new patch interfaces
> > many of those events to ftrace, as configured by a small number of
> > debugfs controls. Here's the
> > /debugfs/tracing/process_trace_README:
>
> Please submit changes/enhancements to kernel/trace/* to the tracing
> tree maintainers (Steve and me) for review, testing and integration.
>
> Please also post patches against the latest tracing tree:
>
> http://people.redhat.com/mingo/tip.git/README

uhm, this patch depends on the (large) utrace patch, which is not kernel/trace
material.

> As this patch does not apply:
>
> Applying patch patches/utrace-based-ftrace-process-engine-v2.patch
> patching file include/linux/processtrace.h
> patching file kernel/trace/Kconfig
> Hunk #1 succeeded at 186 with fuzz 2 (offset 36 lines).
> patching file kernel/trace/Makefile
> Hunk #1 FAILED at 33.
> 1 out of 1 hunk FAILED -- rejects in file kernel/trace/Makefile
> patching file kernel/trace/trace.h
> Hunk #1 succeeded at 7 with fuzz 1.
> Hunk #2 FAILED at 31.
> Hunk #3 succeeded at 215 with fuzz 2 (offset 43 lines).
> Hunk #4 FAILED at 330.
> 2 out of 4 hunks FAILED -- rejects in file kernel/trace/trace.h
> patching file kernel/trace/trace_process.c
> Patch patches/utrace-based-ftrace-process-engine-v2.patch does not apply (enforce with -f)

The rejects are trivial.

2009-03-21 09:13:19

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH 3/3] utrace-based ftrace "process" engine, v2


* Andrew Morton <[email protected]> wrote:

> On Sat, 21 Mar 2009 08:43:01 +0100 Ingo Molnar <[email protected]> wrote:
>
> >
> > * Roland McGrath <[email protected]> wrote:
> >
> > > From: Frank Ch. Eigler <[email protected]>
> > >
> > > This is v2 of the prototype utrace-ftrace interface. This code is
> > > based on Roland McGrath's utrace API, which provides programmatic
> > > hooks to the in-tree tracehook layer. This new patch interfaces
> > > many of those events to ftrace, as configured by a small number of
> > > debugfs controls. Here's the
> > > /debugfs/tracing/process_trace_README:
> >
> > Please submit changes/enhancements to kernel/trace/* to the tracing
> > tree maintainers (Steve and me) for review, testing and integration.
> >
> > Please also post patches against the latest tracing tree:
> >
> > http://people.redhat.com/mingo/tip.git/README
>
> uhm, this patch depends on the (large) utrace patch, which is not
> kernel/trace material.

The thing is, utrace crashes in Fedora have dominated kerneloops.org
for many months, so i'm not sure what to make of the idea of posting
a 4000+ lines of core kernel code patchset on the last day of the
development cycle, a posting that has carefully avoided the Cc:-ing
of affected maintainers ;-)

Utrace is very much tracing material - without the ftrace plugin the
whole utrace machinery is just something that provides a _ton_ of
hooks to something entirely external: SystemTap mainly.

kernel/utrace.c should probably be introduced as
kernel/trace/utrace.c not kernel/utrace.c. It also overlaps pending
work in the tracing tree and cooperation would be nice and desired.

The ftrace/utrace plugin is the only real connection utrace has to
the mainline kernel, so proper review by the tracing folks and
cooperation with the tracing folks is very much needed for the whole
thing.

Ingo

2009-03-21 11:30:58

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH 3/3] utrace-based ftrace "process" engine, v2

On Sat, 21 Mar 2009 10:12:35 +0100 Ingo Molnar <[email protected]> wrote:

>
> * Andrew Morton <[email protected]> wrote:
>
> > On Sat, 21 Mar 2009 08:43:01 +0100 Ingo Molnar <[email protected]> wrote:
> >
> > >
> > > * Roland McGrath <[email protected]> wrote:
> > >
> > > > From: Frank Ch. Eigler <[email protected]>
> > > >
> > > > This is v2 of the prototype utrace-ftrace interface. This code is
> > > > based on Roland McGrath's utrace API, which provides programmatic
> > > > hooks to the in-tree tracehook layer. This new patch interfaces
> > > > many of those events to ftrace, as configured by a small number of
> > > > debugfs controls. Here's the
> > > > /debugfs/tracing/process_trace_README:
> > >
> > > Please submit changes/enhancements to kernel/trace/* to the tracing
> > > tree maintainers (Steve and me) for review, testing and integration.
> > >
> > > Please also post patches against the latest tracing tree:
> > >
> > > http://people.redhat.com/mingo/tip.git/README
> >
> > uhm, this patch depends on the (large) utrace patch, which is not
> > kernel/trace material.
>
> The thing is, utrace crashes in Fedora have dominated kerneloops.org
> for many months, so i'm not sure what to make of the idea of posting
> a 4000+ lines of core kernel code patchset on the last day of the
> development cycle, a posting that has carefully avoided the Cc:-ing
> of affected maintainers ;-)
>
> Utrace is very much tracing material - without the ftrace plugin the
> whole utrace machinery is just something that provides a _ton_ of
> hooks to something entirely external: SystemTap mainly.

Roland's changelogs don't mention systemtap at all afacit.

That was, umm, major information lossage.

> kernel/utrace.c should probably be introduced as
> kernel/trace/utrace.c not kernel/utrace.c. It also overlaps pending
> work in the tracing tree and cooperation would be nice and desired.
>
> The ftrace/utrace plugin is the only real connection utrace has to
> the mainline kernel, so proper review by the tracing folks and
> cooperation with the tracing folks is very much needed for the whole
> thing.

Actually it seems that the whole utrace-ftrace thing is a big distraction and
could/should just be omitted. This is a systemtap feature and should be viewed as
such.

This is all a bit weird.

2009-03-21 11:53:41

by Frank Ch. Eigler

[permalink] [raw]
Subject: Re: [PATCH 3/3] utrace-based ftrace "process" engine, v2

Hi -

On Sat, Mar 21, 2009 at 04:19:54AM -0700, Andrew Morton wrote:
> [...]
> > Utrace is very much tracing material - without the ftrace plugin the
> > whole utrace machinery is just something that provides a _ton_ of
> > hooks to something entirely external: SystemTap mainly.
>
> Roland's changelogs don't mention systemtap at all afacit.
> That was, umm, major information lossage.

There have been many mixed messages from LKML on the topic - sometimes
mentioning systemtap is forbidden, other times necessary. Sorry about
that.

There are several non-systemtap clients in existence or under
development. You've may have heard of the ptrace cleanup, a
multi-client ptrace replacement, an on-the-fly core dumper, the ftrace
widget, user-space probes. All of these should have somewhat
compelling non-systemtap uses, if that's an important criterion.


> Actually it seems that the whole utrace-ftrace thing is a big
> distraction and could/should just be omitted. This is a systemtap
> feature and should be viewed as such. [...]

utrace is a better way to perform user thread management than what is
there now, and the utrace-ftrace widget shows how to *hook* thread
events such as syscalls in a lighter weight / more managed way than
the first one proposed. (That's one reason we've been participating
in the ftrace discussions.) Of course it can be made to use the fine
syscall pretty-printing code recently added.


- FChE

2009-03-21 12:14:00

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH 3/3] utrace-based ftrace "process" engine, v2

On Sat, 21 Mar 2009 07:51:41 -0400 "Frank Ch. Eigler" <[email protected]> wrote:

> Hi -
>
> On Sat, Mar 21, 2009 at 04:19:54AM -0700, Andrew Morton wrote:
> > [...]
> > > Utrace is very much tracing material - without the ftrace plugin the
> > > whole utrace machinery is just something that provides a _ton_ of
> > > hooks to something entirely external: SystemTap mainly.
> >
> > Roland's changelogs don't mention systemtap at all afacit.
> > That was, umm, major information lossage.
>
> There have been many mixed messages from LKML on the topic - sometimes
> mentioning systemtap is forbidden, other times necessary. Sorry about
> that.

heh. We all love systemtap and want it to get better.

> There are several non-systemtap clients in existence or under
> development. You've may have heard of the ptrace cleanup, a
> multi-client ptrace replacement, an on-the-fly core dumper, the ftrace
> widget, user-space probes. All of these should have somewhat
> compelling non-systemtap uses, if that's an important criterion.

Well I dunno. You guys are closer to this than I am, but I'd have thought
that systemtap is the main game here, and most/all of the above is just
fluff.

IOW, "this helps systemtap" is sufficient reason for merging a kernel
change. For sufficiently large values of "help", and sufficiently small
values of "eww", of course.



I have strong memories of being traumatised by reading the uprobes code.
What's the story on all of that nowadays?


>
> > Actually it seems that the whole utrace-ftrace thing is a big
> > distraction and could/should just be omitted. This is a systemtap
> > feature and should be viewed as such. [...]
>
> utrace is a better way to perform user thread management than what is
> there now, and the utrace-ftrace widget shows how to *hook* thread
> events such as syscalls in a lighter weight / more managed way than
> the first one proposed. (That's one reason we've been participating
> in the ftrace discussions.) Of course it can be made to use the fine
> syscall pretty-printing code recently added.

eh. Boring. Let's fix systemtap?

2009-03-21 12:58:51

by Frank Ch. Eigler

[permalink] [raw]
Subject: Re: [PATCH 3/3] utrace-based ftrace "process" engine, v2

Hi -

On Sat, Mar 21, 2009 at 05:04:22AM -0700, Andrew Morton wrote:
> [...]
> > There have been many mixed messages from LKML on the topic - sometimes
> > mentioning systemtap is forbidden, other times necessary. Sorry about
> > that.
>
> heh. We all love systemtap and want it to get better.

Great!


> [...]
> I have strong memories of being traumatised by reading the uprobes
> code. What's the story on all of that nowadays?

uprobes, being a layer upon utrace that provides a kprobes-like
breakpointing API for user threads, is being refactored into several
parts. I don't know about the aesthetics of it all, but I believe the
general future plan is this:

One piece would perform machine code analysis (to classify
instructions for ideal/safe placement of breakpoints or for code
patching), and another thin layer that uses this and utrace to manage
user-space breakpoints. (Systemtap would interface at this point.)
Then a user-space syscallish interface could come along to expose this
to a super-ptrace client (to speed up gdb; perhaps to allow multiple
debuggers). Plus one might as well add an ftrace-engine for it
(directly analogous to the recent kprobe-based one that ftrace people
found "cool".)


> > > Actually it seems that the whole utrace-ftrace thing is a big
> > > distraction and could/should just be omitted. This is a systemtap
> > > feature and should be viewed as such. [...]
> >
> > utrace is a better way to perform user thread management than what is
> > there now, and the utrace-ftrace widget shows how to *hook* thread
> > events such as syscalls in a lighter weight / more managed way than
> > the first one proposed. (That's one reason we've been participating
> > in the ftrace discussions.) Of course it can be made to use the fine
> > syscall pretty-printing code recently added.
>
> eh. Boring. Let's fix systemtap?

There are several constituencies here, some of which find the above
exciting. That's OK and we'd like to help them too.


- FChE

2009-03-21 15:45:46

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH 3/3] utrace-based ftrace "process" engine, v2


* Andrew Morton <[email protected]> wrote:

> [...] Let's fix systemtap?

Yes, it needs to be fixed.

The main issue i see is that no kernel developer i work with on a
daily basis uses SystemTap - and i work with a lot of people. Yes, i
could perhaps name two or three people from lkml using it, but its
average penetration amongst kernel folks is essentially zero.

Was any critical analysis done why that penetration is so absymally
low for a tool with such a promise and with years of availability,
and what are the measures planned to address those problems?

To me personally there are two big direct usability issues with
SystemTap:

1) It relies on DEBUG_INFO for any reasonable level of utility.
Yes, it will limp along otherwise as well, but most of the
actual novel capabilities depend on debuginfo. Which is an
acceptable constraint for enterprise usage where kernels are
switched every few months and having a debuginfo package is not
a big issue. Not acceptable for upstream kernel development. It
also puts way too trust into the compiler generating 1GB+ of
debuginfo correctly. I want to be able to rely on tools all the
time and thus i want tools to have some really simple and
predictable foundations.

2) It's not upstream and folks using it seem to insist on not
having it upstream ;-) This 'distance' to upstream seems to have
grown during the past few years - instead of shrinking. As a
result it simply does not matter and there's no know-how and no
visibility of it upstream.

If these fundamental problems are addressed then i'd even argue for
the totality of SystemTap to be aimed upstreamed (including the
scripting language, etc.), because for something this fundamental
there's just no good reason not to have a turn-key solution there.

Plus then there should be a (steadily growing) library of utility
scripts in the kernel proper as well.

Anything less does not make much sense IMO. Having a separate tool
will reduce efficiency, increases the latency of fixes and
enhancements and creates ABI-like expectations - which are all
counter-productive to good instrumentation.

These are the aspects of SystemTap that i have to say were never
done right, and these are the aspects of SystemTap that need to
change most. Putting utrace upstream now will just make it more
convenient to have SystemTap as a separate entity - without any of
the benefits. Do we want to do that? Maybe, but we could do better i
think.

Ingo

2009-03-21 20:32:57

by Diego Calleja

[permalink] [raw]
Subject: Re: [PATCH 3/3] utrace-based ftrace "process" engine, v2

On S?bado 21 Marzo 2009 16:45:01 Ingo Molnar escribi?:

> The main issue i see is that no kernel developer i work with on a
> daily basis uses SystemTap - and i work with a lot of people. Yes, i
> could perhaps name two or three people from lkml using it, but its
> average penetration amongst kernel folks is essentially zero.

What about userspace developers? People always talks of systemtap as
a kernel thing, but my (humble) impression is that kernel hackers don't
seem to need it that much (maybe for the same reasons they didn't a
kernel debugger ;), but userspace developers do. There're many
userspace projects that offer optional compile options to enable dtrace
probes (some people like apple even ship executables of python, perl
and ruby with probes by default). There're several firefox hackers that
switched to dtrace-capable systems just because the dtrace-javascript
probes enabled them to debug javashit code in ways they weren't able
in linux or windows.

In my humble opinion a better development environment for linux
userspace programmers is way more important than whether kernel
hackers like systemtap or not. So maybe the discussion should be less
about "does it help kernel hackers?" and more about "does it help
userspace hackers?". My 2?...

2009-03-21 21:43:47

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH 3/3] utrace-based ftrace "process" engine, v2

On Sat, 21 Mar 2009 16:45:01 +0100 Ingo Molnar <[email protected]> wrote:

>
> [...]
>

useful, thanks.

> Putting utrace upstream now will just make it more
> convenient to have SystemTap as a separate entity - without any of
> the benefits. Do we want to do that? Maybe, but we could do better i
> think.

It would not be good to merge a large kernel feature which kernel
developers and testers cannot test, and regression test.

If testing utrace against its main application requires installation of a
complete enterprise distro from a distro which the particular developer
might not prefer to use then that's quite a problem.

So it is desirable for this reason (and, I suspect, for other reasons) that
systemtap (or a part thereof) be dragged out in some standalone form which
is usable by random mortals.

IOW: I agree.

2009-03-21 21:51:33

by Frank Ch. Eigler

[permalink] [raw]
Subject: Re: [PATCH 3/3] utrace-based ftrace "process" engine, v2

Hi -

On Sat, Mar 21, 2009 at 04:45:01PM +0100, Ingo Molnar wrote:
> [...]
> To me personally there are two big direct usability issues with
> SystemTap:
>
> 1) It relies on DEBUG_INFO for any reasonable level of utility.
> Yes, it will limp along otherwise as well, but most of the
> actual novel capabilities depend on debuginfo. Which is an
> acceptable constraint for enterprise usage where kernels are
> switched every few months and having a debuginfo package is not
> a big issue. Not acceptable for upstream kernel development.

In my own limited kernel-building experience, I find the debuginfo
data conveniently and instantly available after every "make". Can you
elaborate how is it harder for you to incidentally make it than for
someone to download it?


> It also puts way too trust into the compiler generating 1GB+ of
> debuginfo correctly. I want to be able to rely on tools all the
> time and thus i want tools to have some really simple and
> predictable foundations.

Well, the data has to come from *somewhere*. We know several
shortcomings (and have staff working on gcc debuginfo improvements),
but there is little alternative. If not from the compiler, where are
you going to get detailed type/structure layouts? Stack slot to
variable mappings? Statement-level PC addresses? Unwind data?


> 2) It's not upstream and folks using it seem to insist on not
> having it upstream ;-) This 'distance' to upstream seems to have
> grown during the past few years - instead of shrinking. [...]

Considering our upstream-bound assistance with foundation technologies
like markers, tracepoints, kprobes, utrace, and several other bits,
this does not seem entirely fair.


> If these fundamental problems are addressed then i'd even argue for
> the totality of SystemTap to be aimed upstreamed (including the
> scripting language, etc.), [...]

If consensus on this were plausible, we could seriously discuss it.

But I don't buy the package-deal that utrace must not attempt merging
on its own merits, just because it makes systemtap (as it is today)
useful to more people.


- FChE

2009-03-21 21:54:20

by Frank Ch. Eigler

[permalink] [raw]
Subject: Re: [PATCH 3/3] utrace-based ftrace "process" engine, v2

Hi -


On Sat, Mar 21, 2009 at 02:34:13PM -0700, Andrew Morton wrote:
> [...]
> It would not be good to merge a large kernel feature which kernel
> developers and testers cannot test, and regression test.

It does not. Other kernel self-sufficient utrace clients are on their
way, and of course one was just (re)posted.

> If testing utrace against its main application requires installation
> of a complete enterprise distro from a distro [...]

This has *never* been a requirement.


- FChE

2009-03-21 22:08:27

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH 3/3] utrace-based ftrace "process" engine, v2



On Sat, 21 Mar 2009, Frank Ch. Eigler wrote:
>
> > If testing utrace against its main application requires installation
> > of a complete enterprise distro from a distro [...]
>
> This has *never* been a requirement.

You guys are getting off a tangent.

Let's go back to the post that started this all.

> The thing is, utrace crashes in Fedora have dominated kerneloops.org
> for many months, so i'm not sure what to make of the idea of posting
> a 4000+ lines of core kernel code patchset on the last day of the
> development cycle, a posting that has carefully avoided the Cc:-ing
> of affected maintainers ;-)

.. and dammit, I agree 100%. If utrace really shows up in _any_ way on
kerneloops.org, then I think THE ENTIRE DISCUSSION in this thread is moot.

I'm not going to take known-bad crap. It's that simple. Don't bother
posting it, don't bother discussing it, don't bother making excuses for
it.

Linus

2009-03-21 22:22:53

by Frank Ch. Eigler

[permalink] [raw]
Subject: Re: [PATCH 3/3] utrace-based ftrace "process" engine, v2

Hi -

On Sat, Mar 21, 2009 at 03:02:59PM -0700, Linus Torvalds wrote:
> [...]
> > The thing is, utrace crashes in Fedora have dominated kerneloops.org
> > for many months [...]
>
> .. and dammit, I agree 100%. If utrace really shows up in _any_ way on
> kerneloops.org, then I think THE ENTIRE DISCUSSION in this thread is moot.

There was a short span of time during last fall, when Roland was on
vacation. That bug (in 2.6.26.3) was fixed during the kernel summit.
So this is a six-month obsolete grievance.

- FChE

2009-03-21 22:30:53

by Alexey Dobriyan

[permalink] [raw]
Subject: Re: [PATCH 3/3] utrace-based ftrace "process" engine, v2

On Sat, Mar 21, 2009 at 06:20:30PM -0400, Frank Ch. Eigler wrote:
> On Sat, Mar 21, 2009 at 03:02:59PM -0700, Linus Torvalds wrote:
> > [...]
> > > The thing is, utrace crashes in Fedora have dominated kerneloops.org
> > > for many months [...]
> >
> > .. and dammit, I agree 100%. If utrace really shows up in _any_ way on
> > kerneloops.org, then I think THE ENTIRE DISCUSSION in this thread is moot.
>
> There was a short span of time during last fall, when Roland was on
> vacation. That bug (in 2.6.26.3) was fixed during the kernel summit.
> So this is a six-month obsolete grievance.

struct task_struct::utrace became embedded struct. This is good and
should remove quite a few of utrace bugs. Better late than never.

However, "rewrite-ptrace-via-utrace" patch was omitted, so almost noone
can easily see by how much situation improved.

I see this patch was dropped in Fedora.

Will ptrace(2) will be rewritten through utrace?

2009-03-21 23:40:48

by Frank Ch. Eigler

[permalink] [raw]
Subject: Re: [PATCH 3/3] utrace-based ftrace "process" engine, v2

Hi -

On Sun, Mar 22, 2009 at 01:37:59AM +0300, Alexey Dobriyan wrote:
> [...]
> struct task_struct::utrace became embedded struct. This is good and
> should remove quite a few of utrace bugs. Better late than never.

Yeah.

> However, "rewrite-ptrace-via-utrace" patch was omitted, so almost
> noone can easily see by how much situation improved. [...] Will
> ptrace(2) will be rewritten through utrace?

Yes, I believe that is Roland's intent. I believe it was separated
from the current suite of patches for staging purposes, to merge the
most solid code up first. The code is available from the utrace git
tree in the utrace-ptrace branch.

- FChE

2009-03-22 10:26:30

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH 3/3] utrace-based ftrace "process" engine, v2


* Frank Ch. Eigler <[email protected]> wrote:

> Hi -
>
> On Sun, Mar 22, 2009 at 01:37:59AM +0300, Alexey Dobriyan wrote:
> > [...]
> > struct task_struct::utrace became embedded struct. This is good and
> > should remove quite a few of utrace bugs. Better late than never.
>
> Yeah.
>
> > However, "rewrite-ptrace-via-utrace" patch was omitted, so
> > almost noone can easily see by how much situation improved.
> > [...] Will ptrace(2) will be rewritten through utrace?
>
> Yes, I believe that is Roland's intent. I believe it was
> separated from the current suite of patches for staging purposes,
> to merge the most solid code up first. The code is available from
> the utrace git tree in the utrace-ptrace branch.

i think they should be submitted together.

Here's the histogram of utrace bugs on kerneloops.org:

2.6.27.5 1 x
2.6.27.15 1 x
2.6.27.12 2 x
2.6.27-rc4 2 x
2.6.26.6 1 x
2.6.26.5 43 x
2.6.26.3 1102 x
2.6.26.2 2 x
2.6.26.1 3 x
2.6.26 1 x
2.6.25 3 x

That peak in 2.6.26.3 is what i referred to. The latest F10 kernel
rpm is kernel-2.6.27.12-170.2.5.fc10, and it does include the
utrace-ptrace engine as well:

# grep UTRACE /boot/config-2.6.27.19-170.2.35.fc10.i686
CONFIG_UTRACE=y
CONFIG_UTRACE_PTRACE=y

So the bug i referred to was fixed and the bug count has gone down -
but still we have the utrace core submission here without any
(tested) mainline kernel usage of the core code.

My suggestion would be to:

- submit the ptrace-on-utrace engine as well (with Oleg's signoff?)

- perhaps also submit with a well-tested ftrace plugin that tries
to utilize _all_ aspects of utrace and ftrace (and hence gives
good and continuous burn-in testing via the ftrace bootup
self-tests, etc.)

ideally we want both, because:

- tracing corner-case bugs tend to be found much faster than ptrace
corner case bugs - partly because tracing is much more invasive
when activated system-wide.

- ptrace-over-utrace on the other hand utilizes utrace more deeply
than passive tracing ever can. (for example UML does full,
active virtualization via ptrace - this depth of functional
utrace usage is not possible via a tracing plugin.)

And i think the ptrace-via-utrace engine is actually fully ready,
just perhaps it was not submitted out of caution to keep the
logistics simple.

So i do think we've still got a shot at merging it, in this merge
window.

Ingo

2009-03-22 12:09:13

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH 3/3] utrace-based ftrace "process" engine, v2


* Frank Ch. Eigler <[email protected]> wrote:

> Hi -
>
> On Sat, Mar 21, 2009 at 04:45:01PM +0100, Ingo Molnar wrote:
> > [...]
> > To me personally there are two big direct usability issues with
> > SystemTap:
> >
> > 1) It relies on DEBUG_INFO for any reasonable level of utility.
> > Yes, it will limp along otherwise as well, but most of the
> > actual novel capabilities depend on debuginfo. Which is an
> > acceptable constraint for enterprise usage where kernels are
> > switched every few months and having a debuginfo package is not
> > a big issue. Not acceptable for upstream kernel development.
>
> In my own limited kernel-building experience, I find the debuginfo
> data conveniently and instantly available after every "make". Can
> you elaborate how is it harder for you to incidentally make it
> than for someone to download it?

Four reasons:

1)

I have CONFIG_DEBUG_INFO turned off in 99.9% of my kernel builds,
because it slows down the kernel build times significantly:

without: 4343.31 user 416.39 system 6:09.97 elapsed 1286%CPU
with: 4871.07 user 501.90 system 7:43.22 elapsed 1159 %CPU

( x86 allyesconfig. On an obscenely overpowered Nehalem box
with 12 GB of RAM. )

2)

When the kernel build becomes IO-bound, for example when i build
over a distcc cluster (which is how i generally build my kernels) -
or when others with less RAM build a debuginfo kernel, the ratio
becomes even worse:

without: 870.36 user 292.79 system 3:32.10 elapsed 548% CPU
with: 929.65 user 384.55 system 8:28.70 elapsed 258% CPU

3)

Another metric. Here's an x86 defconfig (i.e. fairly regular config
- not allyesconfig) build's size:

with: 1645 MB
without: 211 MB

Try to build 1.6 GB of dirty data on ext3 and run into an fsync() in
your editor ... you'll sit there twiddling thumbs for a minute or
more.

4)

Or yet another metric - Linux distro package overhead. Try
installing a debuginfo package:

# yum install kernel-debuginfo

==========================================
Package Arch Version
==========================================
Installing:
kernel-debuginfo x86_64 2.6.29-0.258.rc8.git2.fc11
rawhide-debuginfo 294 M
Installing for dependencies:
kernel-debuginfo-common x86_64 2.6.29-0.258.rc8.git2.fc11
rawhide-debuginfo 35 M

Total download size: 329 M

That size of a _compressed_ debuginfo kernel package is obscene. We
can fit 4 years of full Linux kernel Git history into that size -
60,000+ commits, full metadata and full 20 million lines of code
flux included!

Uncompressed it blows up to gigabytes of on-disk data.

And this download has to be repeated for _every_ minor kernel
upgrade.

So when i come into a situation where i could use some debugging
help ... i'd have to rebuild the kernel with DEBUG_INFO=y and i'll
always notice when i have a debuginfo kernel because it's
inconvenient.

The solution?)

Dunno - but i definitely think we should think bigger:

The fundamental disconnect i believe seems to come from the fact
that most user-space projects are relatively small, so debuginfo
bloat is a secondary issue there.

But for a project with the size of the kernel, even for moderate
builds (not allyesconfig), it's a _much_ bigger deal. This has been
known for a long time and the situation has become worse over the
last two years, not better. (last time i checked the debuginfo
package overhead it was below 150 MB)

A few random ideas:

Instead of trying to cache 2+GB of debuginfo for a 50 MB kernel
source repo (+50 MB of genuine .o output) - just to be able to debug
one or two source files [which is the typical scope of a debugging
session], why not build debuginfo on the fly, when a debugging
session requires it? Rarely do we need debuginfo for more than a
fraction of the whole kernel.

( Yes, it needs a few smarts like knowing the SHA1 of the source
code module that a particular kernel portion got built with, to
make sure the debuginfo is fresh and relevant - but nothing major. )

I mean, lets _use_ the fact that we have source code available, more
intelligently. It takes zero time to build detailed debuginfo for a
portion of a tree.

If 'download debuginfo' can be replaced with: 'have a recent Git
repository of the distro kernel source', we'll have a _much_ more
efficient use of resources all around.

Ingo

2009-03-22 12:18:32

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH 3/3] utrace-based ftrace "process" engine, v2


* Diego Calleja <[email protected]> wrote:

> On S?bado 21 Marzo 2009 16:45:01 Ingo Molnar escribi?:
>
> > The main issue i see is that no kernel developer i work with on a
> > daily basis uses SystemTap - and i work with a lot of people. Yes, i
> > could perhaps name two or three people from lkml using it, but its
> > average penetration amongst kernel folks is essentially zero.
>
> What about userspace developers? People always talks of systemtap
> as a kernel thing, but my (humble) impression is that kernel
> hackers don't seem to need it that much (maybe for the same
> reasons they didn't a kernel debugger ;), but userspace developers
> do. There're many userspace projects that offer optional compile
> options to enable dtrace probes (some people like apple even ship
> executables of python, perl and ruby with probes by default).
> There're several firefox hackers that switched to dtrace-capable
> systems just because the dtrace-javascript probes enabled them to
> debug javashit code in ways they weren't able in linux or windows.
>
> In my humble opinion a better development environment for linux
> userspace programmers is way more important than whether kernel
> hackers like systemtap or not. So maybe the discussion should be
> less about "does it help kernel hackers?" and more about "does it
> help userspace hackers?". My 2?...

Well, i consider kernel development to be just another form of
software development, so i dont subscribe to the view that it is
intrinsically different. (Yes, the kernel has many unique aspects -
but most software projects have unique aspects.)

In terms of development methodology and tools, in fact i claim that
the kernel workflow and style of development can be applied to most
user-space software projects with great success.

So ... if a new development tool is apparently not (yet?) suited to
a very large and sanely developed software project like the Linux
kernel, i dont take that as an encouraging sign.

Also, there's practical aspects: the kernel is what we know best so
if we can make it work well for the kernel, hopes are that other
large projects can use it too. If we _only_ make the tool good for
non-kernel purposes, who else will fix it for the kernel? The
icentive to fix it for the kernel will be significantly lower.

Ingo

2009-03-22 12:38:29

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH 3/3] utrace-based ftrace "process" engine, v2


* Linus Torvalds <[email protected]> wrote:

> On Sat, 21 Mar 2009, Frank Ch. Eigler wrote:
> >
> > > If testing utrace against its main application requires installation
> > > of a complete enterprise distro from a distro [...]
> >
> > This has *never* been a requirement.
>
> You guys are getting off a tangent.
>
> Let's go back to the post that started this all.
>
> > The thing is, utrace crashes in Fedora have dominated kerneloops.org
> > for many months, so i'm not sure what to make of the idea of posting
> > a 4000+ lines of core kernel code patchset on the last day of the
> > development cycle, a posting that has carefully avoided the Cc:-ing
> > of affected maintainers ;-)
>
> .. and dammit, I agree 100%. If utrace really shows up in _any_
> way on kerneloops.org, then I think THE ENTIRE DISCUSSION in this
> thread is moot.
>
> I'm not going to take known-bad crap. It's that simple. Don't
> bother posting it, don't bother discussing it, don't bother making
> excuses for it.

The kerneloops stats on utrace crashes are way down currently,
after that peak last fall. So i didnt want to suggest that it's
known-broken now - i only wanted to point out that it's a
known-risky area and that the submission of it should involve
the affected maintainers/developers.

Regarding current stability, Roland, Frank, is the utrace patch in
latest (today's) Fedora rawhide:

-rw-r--r-- 1 root root 176555 2009-01-08 05:42 linux-2.6-utrace.patch

a bug fixed equivalent of the utrace bits that crashed in the
2.6.26.3 kernel? In that case it is certainly known-good.

Or is it a slimmed-down version?

The ptrace bits and signoffs from Oleg and Alexey would certainly
help (me) in trusting it. (I've Cc:-ed Oleg and Alexey)

The ftrace bits could certainly be staged to go in via the tracing
tree (in .31 or so) after the utrace-core+ptrace bits went upstream.

Ingo

2009-03-22 12:54:17

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH 3/3] utrace-based ftrace "process" engine, v2


* Ingo Molnar <[email protected]> wrote:

> Total download size: 329 M
>
> That size of a _compressed_ debuginfo kernel package is obscene.
> We can fit 4 years of full Linux kernel Git history into that size
> - 60,000+ commits, full metadata and full 20 million lines of code
> flux included!
>
> Uncompressed it blows up to gigabytes of on-disk data.
>
> And this download has to be repeated for _every_ minor kernel
> upgrade.

Have to correct my memories about how many commits the kernel repo
has: 132,019 commits. That massive history fits into 298 MB. (!)

Ingo

2009-03-23 05:11:19

by Roland McGrath

[permalink] [raw]
Subject: Re: [PATCH 3/3] utrace-based ftrace "process" engine, v2

> Well I dunno. You guys are closer to this than I am, but I'd have thought
> that systemtap is the main game here, and most/all of the above is just
> fluff.

That is certainly not true for me. It is true that I hear plenty from
systemtap developers, users, and boosters wanting utrace to be merged.
But all that "fluff" you dismiss out of hand is what I would really like
to see become reality. Pretty much the only people who ever tell me
they would hack on those things are the ones who say, "I'm looking
forward to utrace getting merged in so I can try to write something."

> eh. Boring. [...]

Since it's boring to you, it must be so boring to everyone that they
have no interest in a platform they can use to do exciting things with.
Great. Silly me trying to enable collaboration to produce things less
boring than I'm capable of myself. Clearly there is no need for any
such thing. Sorry I'm so out of touch, but I just thought it was cool.


Thanks,
Roland

2009-03-23 05:23:09

by Roland McGrath

[permalink] [raw]
Subject: Re: [PATCH 3/3] utrace-based ftrace "process" engine, v2

> Yes, I believe that is Roland's intent. I believe it was separated
> from the current suite of patches for staging purposes, to merge the
> most solid code up first. The code is available from the utrace git
> tree in the utrace-ptrace branch.

More than just "staging". The utrace-ptrace code there today is really not
very nice to look at, and it's not ready for prime time. As has been
mentioned, it is a "pure clean-up exercise". As such, it's not the top
priority. It also didn't seem to me like much of an argument for merging
utrace: "Look, more code and now it still does the same thing!"

Instead, I figured we should merge utrace in a way that lets everybody beat
on it for new features and hash out details, without immediate risk of
regressions in ptrace (which are severely annoying to everyone when they
happen). The proper clean-ups for ptrace can proceed in parallel with work
using utrace for things that are actually new and interesting, and at least
the first half of that clean-up work is orthogonal to utrace.

This seems like the normal way that new optional CONFIG_FOOBAR features
(marked EXPERIMENTAL, even) are handled. We don't jump over ourselves to
make existing crucial code paths subject to a new subsystem that is getting
its first rounds of shake-out.


Thanks,
Roland

2009-03-23 05:23:40

by Roland McGrath

[permalink] [raw]
Subject: Re: [PATCH 3/3] utrace-based ftrace "process" engine, v2

> kernel/utrace.c should probably be introduced as
> kernel/trace/utrace.c not kernel/utrace.c. It also overlaps pending
> work in the tracing tree and cooperation would be nice and desired.

Of course I would like to cooperate with everyone. And of course it does
not really matter much where a source file lives. But IMHO utrace really
does not fit in with the kernel/trace/ code much at all. Sure, its hooks
can be used by tracer implementations that use CONFIG_TRACING stuff. But
it is a general API about user thread state. It belongs in kernel/trace/
"naturally" far less than, say, kprobes. utrace will in future be used to
implement userland features (ptrace et al) that are just aspects of the
basics of what an operating system does: mediate userland for userland.
Those uses will have nothing to do with "kernel tracing".


Thanks,
Roland

2009-03-23 05:35:47

by Roland McGrath

[permalink] [raw]
Subject: Re: [PATCH 3/3] utrace-based ftrace "process" engine, v2

> And i think the ptrace-via-utrace engine is actually fully ready,
> just perhaps it was not submitted out of caution to keep the
> logistics simple.

That's not so. There is a clumsy prototype version. Much of the work to
do it properly is really just plain ptrace clean-up and not specifically
about using utrace. Oleg and I are ready to work on it as soon as our time
is not monopolized by trying to get the core utrace code to be accepted.

This ptrace work really buys nothing with immediate pay-off at all. It's a
real shame if its lack keeps people from actually looking at utrace itself.
(This has been a long conversation so far with zero discussion of the code.)
A collaboration with focus on what new things can be built, rather than on
reasons not to let the foundations be poured, would be a lovely thing.


Thanks,
Roland

2009-03-23 06:35:38

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH 3/3] utrace-based ftrace "process" engine, v2


* Roland McGrath <[email protected]> wrote:

> > kernel/utrace.c should probably be introduced as
> > kernel/trace/utrace.c not kernel/utrace.c. It also overlaps pending
> > work in the tracing tree and cooperation would be nice and desired.
>
> Of course I would like to cooperate with everyone. And of course
> it does not really matter much where a source file lives. But
> IMHO utrace really does not fit in with the kernel/trace/ code
> much at all. Sure, its hooks can be used by tracer
> implementations that use CONFIG_TRACING stuff. But it is a
> general API about user thread state. It belongs in kernel/trace/
> "naturally" far less than, say, kprobes. utrace will in future be
> used to implement userland features (ptrace et al) that are just
> aspects of the basics of what an operating system does: mediate
> userland for userland. Those uses will have nothing to do with
> "kernel tracing".

But it is fitting if you think of kernel/trace/ as
kernel/instrumentation/.

The virtualization-alike uses for utrace are in essence using system
call instrumentation callbacks to inject extra functionality into
the system. That's possible not because it's primarily geared at
doing that, but because the instrumentation callbacks are generic
and complete enough. It's still correct to think of it as an
instrumentation tool and to maintain it as such. That also makes it
clear that none of these APIs are to be regarded permanent ABIs.

Anyway ... placement is no big deal, and kernel/utrace.c is
certainly a good way of avoiding the tracing tree ;-)

Ingo

2009-03-23 13:41:16

by Alexey Dobriyan

[permalink] [raw]
Subject: Re: [PATCH 3/3] utrace-based ftrace "process" engine, v2

On Sun, Mar 22, 2009 at 01:37:49PM +0100, Ingo Molnar wrote:
> The ptrace bits and signoffs from Oleg and Alexey would certainly
> help (me) in trusting it. (I've Cc:-ed Oleg and Alexey)

The further utrace stays away from mainline, the better.
That's from my experience with this code.

But let's see how ptrace(2) rewrite will look like because this is
the biggest thing that matters. All those cool virtual machines,
fancy tracers and what not aren't even comparable.

Right now with ptrace(2) rewrite the following is easily triggerable:

WARNING: at kernel/ptrace.c:515 ptrace_report_signal+0x2c1/0x2d0()
Pid: 4814, comm: exe Not tainted 2.6.29-rc8-utrace #1
Call Trace:
[<c0126df1>] warn_slowpath+0x81/0xa0
[<c014c359>] ? validate_chain+0xe9/0x1150
[<c014d606>] ? __lock_acquire+0x246/0xa50
[<c0232959>] ? __delay+0x9/0x10
[<c014b8eb>] ? mark_held_locks+0x6b/0x80
[<c03d3dd2>] ? _spin_unlock_irq+0x22/0x50
[<c012fdd1>] ptrace_report_signal+0x2c1/0x2d0
[<c012fb10>] ? ptrace_report_signal+0x0/0x2d0
[<c0154a79>] utrace_get_signal+0x1c9/0x660
[<c0135478>] get_signal_to_deliver+0x288/0x330
[<c01029e9>] do_notify_resume+0xb9/0x890
[<c017edd2>] ? cache_free_debugcheck+0x232/0x2f0
[<c014957b>] ? trace_hardirqs_off+0xb/0x10
[<c03d3d79>] ? _spin_unlock_irqrestore+0x39/0x70
[<c01015a0>] ? sys_execve+0x40/0x60
[<c017f139>] ? kmem_cache_free+0x89/0xc0
[<c014baad>] ? trace_hardirqs_on_caller+0xfd/0x190
[<c014bb4b>] ? trace_hardirqs_on+0xb/0x10
[<c010340a>] work_notifysig+0x13/0x19

It looks like WARN_ON is just bogus, but who knows.

2009-03-23 15:19:16

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [PATCH 3/3] utrace-based ftrace "process" engine, v2

On 03/23, Alexey Dobriyan wrote:
>
> Right now with ptrace(2) rewrite the following is easily triggerable:
>
> WARNING: at kernel/ptrace.c:515 ptrace_report_signal+0x2c1/0x2d0()

Yes, ptrace-over-utrace needs more work. But your message looks as if
utrace core is buggy, imho this is a bit unfair ;)

As Roland said, ptrace-over-utrace is not ready yet. If you mean that
utrace core should not be merged alone - this is another story.

But personally I understand why Roland sends utrace core before changing
ptrace.

Oleg.

2009-03-23 20:28:14

by Frank Ch. Eigler

[permalink] [raw]
Subject: Re: [PATCH 3/3] utrace-based ftrace "process" engine, v2

Hi -

On Sun, Mar 22, 2009 at 01:08:11PM +0100, Ingo Molnar wrote:
> [...]
> > In my own limited kernel-building experience, I find the debuginfo
> > data conveniently and instantly available after every "make". Can
> > you elaborate how is it harder for you to incidentally make it
> > than for someone to download it?
>
> Four reasons:
>
> 1)
>
> I have CONFIG_DEBUG_INFO turned off in 99.9% of my kernel builds,
> because it slows down the kernel build times significantly: [...]

OK, 15% longer time.

> 2)
>
> When the kernel build becomes IO-bound [...]
> without: 870.36 user 292.79 system 3:32.10 elapsed 548% CPU
> with: 929.65 user 384.55 system 8:28.70 elapsed 258% CPU

OK, lots of network traffic.

> 3) [...]
> Try to build 1.6 GB of dirty data on ext3 and run into an fsync() in
> your editor ... you'll sit there twiddling thumbs for a minute or
> more.

Now don't go blaming us for ext3 & fsync & not having a low enough
/proc/sys/vm/dirty_background_ratio.


> 4)
> Or yet another metric - Linux distro package overhead. Try
> installing a debuginfo package: [...]

Same as 3).


> And this download has to be repeated for _every_ minor kernel
> upgrade.

Actually, no. If you just want to run the newly built kernel
somewhere else on your network, you can run a systemtap compile server
on your build machine, and let the systemtap network client do ~RPCs
to get at the data.


> The solution?)
>
> Dunno - but i definitely think we should think bigger:
>
> The fundamental disconnect i believe seems to come from the fact
> that most user-space projects are relatively small, so debuginfo
> bloat is a secondary issue there.

(Well, the fedora debuginfo archive shows a couple of packages of
similar or greater heft than the kernel - openoffice.org, qt3, ...)


> A few random ideas:
>
> [...] why not build debuginfo on the fly, when a debugging
> session requires it? Rarely do we need debuginfo for more than a
> fraction of the whole kernel. [...]
> I mean, lets _use_ the fact that we have source code available, more
> intelligently. It takes zero time to build detailed debuginfo for a
> portion of a tree. [...]

Something like that might be made to work.

Note that this backs away from earlier claims that we can make do
without debuginfo, or that the compiler can't be trusted to build the
stuff. We all agree it'd be nice if made it better and made a little
less.


- FChE

2009-03-23 20:45:50

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH 3/3] utrace-based ftrace "process" engine, v2



On Mon, 23 Mar 2009, Frank Ch. Eigler wrote:
> > I have CONFIG_DEBUG_INFO turned off in 99.9% of my kernel builds,
> > because it slows down the kernel build times significantly: [...]
>
> OK, 15% longer time.

It's way more than that if you don't have tons of memory and excessive
amounts of CPU power.

> > 2)
> >
> > When the kernel build becomes IO-bound [...]
> > without: 870.36 user 292.79 system 3:32.10 elapsed 548% CPU
> > with: 929.65 user 384.55 system 8:28.70 elapsed 258% CPU
>
> OK, lots of network traffic.

This is the _normal_ situation for a debug info build. If it's not
network traffic (distcc), it's just disk IO. Have you tried it on a
laptop? Ingo is not the only one that turns off DEBUG_INFO in disgust.
It's totally useless for any sane kernel developer - the costs are
excessive.

Adn that's totally ignoring the disk usage of multiple debug info kernels.

> Note that this backs away from earlier claims that we can make do
> without debuginfo, or that the compiler can't be trusted to build the
> stuff. We all agree it'd be nice if made it better and made a little
> less.

Gaah. I'd wish you all agreed that DEBUG_INFO is just TOTALLY UNREALISTIC.

Linus

2009-03-23 21:45:21

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH 3/3] utrace-based ftrace "process" engine, v2

On Mon, Mar 23, 2009 at 04:14:00PM +0100, Oleg Nesterov wrote:
>
> Yes, ptrace-over-utrace needs more work. But your message looks as if
> utrace core is buggy, imho this is a bit unfair ;)
>
> As Roland said, ptrace-over-utrace is not ready yet. If you mean that
> utrace core should not be merged alone - this is another story.
>
> But personally I understand why Roland sends utrace core before changing
> ptrace.

Yes, but if it's going to be merged this during 2.6.x cycle, we need
to have a user for the kernel interface along with the new kernel
interface. This is true for any body trying to add some new
infrastructure to the kernel; you have to have an in-tree user of said
interface.

I mean, if some device manufacturer were to go to Red Hat's kernel
team, and say, "we need this interface for our uber expensive RDMA
interface card", and there was no in-kernel user for the interface, we
know what Red Hat would tell that device manufacturer, right? So why
is the SystemTap team trying to get an exception for utrace? It just
seems a little hypocritical.

So what about the ftrace user of utrace? Is that ready to be merged?

- Ted

Subject: Re: [PATCH 3/3] utrace-based ftrace "process" engine, v2

On Sat, Mar 21, 2009 at 05:04:22AM -0700, Andrew Morton wrote:
> On Sat, 21 Mar 2009 07:51:41 -0400 "Frank Ch. Eigler" <[email protected]> wrote:
> > On Sat, Mar 21, 2009 at 04:19:54AM -0700, Andrew Morton wrote:
>
> I have strong memories of being traumatised by reading the uprobes code.

That was a long time ago wasn't it? :-)

That approach was a carry over from an implementation from dprobes that
used readdir hooks. Yes, that was not the most elegant approach, as such
has long been shelved.

> What's the story on all of that nowadays?

Utrace makes implementing uprobes more cleaner. We have a prototype that
implements uprobes over utrace. Its per process, doesn't use any
in-kernel hooks, etc. It currently has a kprobes like interface (needs a
kernel module), but it shouldn't be difficult to adapt it to use
utrace's user interfaces (syscalls?) when those come around. The current
generation of uprobes that has all the bells and whistles can be found at
http://sources.redhat.com/git/gitweb.cgi?p=systemtap.git;a=tree;f=runtime/uprobes2

However, there are aspects of the current uprobes that can be useful to
any other userspace tracer: instruction analysis, breakpoint insertion
and removal, single-stepping support. With these layered on top of
utrace, building userspace debug/trace tools that depend on utrace
should be easier, outside of ptrace.

Work is currently on to factor these layers out. The intention is to
upstream all the bits required for userspace tracing once utrace gets
in, along with an easy to use interface for userspace developers
(a /proc or /debugfs interface?) -- one should be able to use it on
its own or with SystemTap, whatever they prefer. Details are still hazy
at the moment.

But, utrace is the foundation to do all of that.

Ananth

2009-03-24 06:06:57

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH 3/3] utrace-based ftrace "process" engine, v2

On Tue, 24 Mar 2009 10:59:26 +0530 Ananth N Mavinakayanahalli <[email protected]> wrote:

> On Sat, Mar 21, 2009 at 05:04:22AM -0700, Andrew Morton wrote:
> > On Sat, 21 Mar 2009 07:51:41 -0400 "Frank Ch. Eigler" <[email protected]> wrote:
> > > On Sat, Mar 21, 2009 at 04:19:54AM -0700, Andrew Morton wrote:
> >
> > I have strong memories of being traumatised by reading the uprobes code.
>
> That was a long time ago wasn't it? :-)
>
> That approach was a carry over from an implementation from dprobes that
> used readdir hooks. Yes, that was not the most elegant approach, as such
> has long been shelved.
>
> > What's the story on all of that nowadays?
>
> Utrace makes implementing uprobes more cleaner. We have a prototype that
> implements uprobes over utrace. Its per process, doesn't use any
> in-kernel hooks, etc. It currently has a kprobes like interface (needs a
> kernel module), but it shouldn't be difficult to adapt it to use
> utrace's user interfaces (syscalls?) when those come around. The current
> generation of uprobes that has all the bells and whistles can be found at
> http://sources.redhat.com/git/gitweb.cgi?p=systemtap.git;a=tree;f=runtime/uprobes2
>
> However, there are aspects of the current uprobes that can be useful to
> any other userspace tracer: instruction analysis, breakpoint insertion
> and removal, single-stepping support. With these layered on top of
> utrace, building userspace debug/trace tools that depend on utrace
> should be easier, outside of ptrace.
>
> Work is currently on to factor these layers out. The intention is to
> upstream all the bits required for userspace tracing once utrace gets
> in, along with an easy to use interface for userspace developers
> (a /proc or /debugfs interface?) -- one should be able to use it on
> its own or with SystemTap, whatever they prefer. Details are still hazy
> at the moment.
>
> But, utrace is the foundation to do all of that.
>

The sticking point was uprobes's modification of live pagecache. We said
"ick, COW the pages" and you said "too expensive". And there things
remained.

Is that all going to happen again?

Subject: Re: [PATCH 3/3] utrace-based ftrace "process" engine, v2

On Mon, Mar 23, 2009 at 10:54:09PM -0700, Andrew Morton wrote:
> On Tue, 24 Mar 2009 10:59:26 +0530 Ananth N Mavinakayanahalli <[email protected]> wrote:
>
> > On Sat, Mar 21, 2009 at 05:04:22AM -0700, Andrew Morton wrote:
> > > On Sat, 21 Mar 2009 07:51:41 -0400 "Frank Ch. Eigler" <[email protected]> wrote:
> > > > On Sat, Mar 21, 2009 at 04:19:54AM -0700, Andrew Morton wrote:
> > >
> > > I have strong memories of being traumatised by reading the uprobes code.
> >
> > That was a long time ago wasn't it? :-)
> >
> > That approach was a carry over from an implementation from dprobes that
> > used readdir hooks. Yes, that was not the most elegant approach, as such
> > has long been shelved.
> >
> > > What's the story on all of that nowadays?
> >
> > Utrace makes implementing uprobes more cleaner. We have a prototype that
> > implements uprobes over utrace. Its per process, doesn't use any
> > in-kernel hooks, etc. It currently has a kprobes like interface (needs a
> > kernel module), but it shouldn't be difficult to adapt it to use
> > utrace's user interfaces (syscalls?) when those come around. The current
> > generation of uprobes that has all the bells and whistles can be found at
> > http://sources.redhat.com/git/gitweb.cgi?p=systemtap.git;a=tree;f=runtime/uprobes2
> >
> > However, there are aspects of the current uprobes that can be useful to
> > any other userspace tracer: instruction analysis, breakpoint insertion
> > and removal, single-stepping support. With these layered on top of
> > utrace, building userspace debug/trace tools that depend on utrace
> > should be easier, outside of ptrace.
> >
> > Work is currently on to factor these layers out. The intention is to
> > upstream all the bits required for userspace tracing once utrace gets
> > in, along with an easy to use interface for userspace developers
> > (a /proc or /debugfs interface?) -- one should be able to use it on
> > its own or with SystemTap, whatever they prefer. Details are still hazy
> > at the moment.
> >
> > But, utrace is the foundation to do all of that.
> >
>
> The sticking point was uprobes's modification of live pagecache. We said
> "ick, COW the pages" and you said "too expensive". And there things
> remained.
>
> Is that all going to happen again?

No. All modifications are via access_process_vm().

Ananth

2009-03-30 22:27:45

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH 3/3] utrace-based ftrace "process" engine, v2


So we need to work out what to do about utrace and I feel a need to hit
the reset button on all this. Largely because I've forgotten
everything and it was all confusing anyway.

Could those who object to utrace please pipe up and summarise their
reasons?


Just to kick the can down the road a bit I merged the first two
patches. The ftrace patch merged about as (un)successfully as one would
expect.

2009-03-30 22:54:35

by Frank Ch. Eigler

[permalink] [raw]
Subject: Re: [PATCH 3/3] utrace-based ftrace "process" engine, v2

Hi -

On Mon, Mar 30, 2009 at 03:18:44PM -0700, Andrew Morton wrote:
> So we need to work out what to do about utrace and I feel a need to hit
> the reset button on all this. [...]

Thanks.

> [...] The ftrace patch merged about as (un)successfully as one would

A new version against -tip is coming by in a few days.

- FChE

2009-03-31 09:17:36

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 3/3] utrace-based ftrace "process" engine, v2

On Mon, 2009-03-30 at 15:18 -0700, Andrew Morton wrote:
> So we need to work out what to do about utrace and I feel a need to hit
> the reset button on all this. Largely because I've forgotten
> everything and it was all confusing anyway.

Right, from my POV something like utrace is desirable, since its
basically a huge multiplexer for the debugger state, eventually allowing
us to have multiple debuggers attached to the same process.

So in that respect its a very nice feature.

> Could those who object to utrace please pipe up and summarise their
> reasons?

Christoph used to have an opinion on this matter, so I've added him to
the CC.

Last time when I looked at the code, it needed a bit more care and
comments wrt lifetimes and such. I know Roland has done a lot on that
front -- so I'll need to re-inspect.

As to in-kernel users, currently we only have ptrace, and no full
conversion to utrace is in a mergeable shape afaik.

UML (Jeff CC'ed) might want to use this.

I know the Systemtap people need this (fche). But that isn't really
moving towards mainline any time soon afaict.

Then there is this little thing called frysk which uses it, no idea what
kind of kernel space that needs, nor where it lives -- or for that
matter, wth it really does ;-)


Anyway, long story short, once people have had a little time to go over
the code, and a few in-kernel users are lined-up, I think we should
consider merging it.

2009-03-31 11:28:20

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 3/3] utrace-based ftrace "process" engine, v2

On Tue, 2009-03-31 at 11:17 +0200, Peter Zijlstra wrote:
> On Mon, 2009-03-30 at 15:18 -0700, Andrew Morton wrote:
> > So we need to work out what to do about utrace and I feel a need to hit
> > the reset button on all this. Largely because I've forgotten
> > everything and it was all confusing anyway.
>
> Right, from my POV something like utrace is desirable, since its
> basically a huge multiplexer for the debugger state, eventually allowing
> us to have multiple debuggers attached to the same process.
>
> So in that respect its a very nice feature.
>
> > Could those who object to utrace please pipe up and summarise their
> > reasons?
>
> Christoph used to have an opinion on this matter, so I've added him to
> the CC.
>
> Last time when I looked at the code, it needed a bit more care and
> comments wrt lifetimes and such. I know Roland has done a lot on that
> front -- so I'll need to re-inspect.
>
> As to in-kernel users, currently we only have ptrace, and no full
> conversion to utrace is in a mergeable shape afaik.
>
> UML (Jeff CC'ed) might want to use this.
>
> I know the Systemtap people need this (fche). But that isn't really
> moving towards mainline any time soon afaict.
>
> Then there is this little thing called frysk which uses it, no idea what
> kind of kernel space that needs, nor where it lives -- or for that
> matter, wth it really does ;-)

And Frank reminded me we have an ftrace tracer that utilizes utrace.

> Anyway, long story short, once people have had a little time to go over
> the code, and a few in-kernel users are lined-up, I think we should
> consider merging it.

2009-03-31 11:40:55

by Frank Ch. Eigler

[permalink] [raw]
Subject: Re: [PATCH 3/3] utrace-based ftrace "process" engine, v2

Hi -

On Tue, Mar 31, 2009 at 11:17:42AM +0200, Peter Zijlstra wrote:

> [...] Right, from my POV something like utrace is desirable, since
> its basically a huge multiplexer for the debugger state, eventually
> allowing us to have multiple debuggers attached to the same process.
> [...]

Right.

> Then there is this little thing called frysk which uses it, no idea
> what kind of kernel space that needs, nor where it lives -- or for
> that matter, wth it really does ;-)

Frysk was to be a first user of such an improved ptrace(2) API in
order to do the sort of background / multiply-connected debugging, but
that project has been on indefinite hold for about a year. Instead,
there are experiments under way to extend gdb's backend for that
capability.

- FChE

2009-03-31 16:25:38

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH 3/3] utrace-based ftrace "process" engine, v2

On Tue, Mar 31, 2009 at 11:17:42AM +0200, Peter Zijlstra wrote:
> > Could those who object to utrace please pipe up and summarise their
> > reasons?
>
> Christoph used to have an opinion on this matter, so I've added him to
> the CC.

I've never objected utrace per see, quite contrary I think it's a useful
abstraction. I did have objection over various implementation details
which should be sorted out now (have to take a look again to make sure).

I do have a really large objection of merging the current messy double
ptrace implementation. If current utrace based ptrace isn't 100% ready
there's absolutely no point in merging it. Other user would be even
better, e.g. the seccomp rewrite.

2009-03-31 20:57:25

by Roland McGrath

[permalink] [raw]
Subject: Re: [PATCH 3/3] utrace-based ftrace "process" engine, v2

> I do have a really large objection of merging the current messy double
> ptrace implementation. If current utrace based ptrace isn't 100% ready
> there's absolutely no point in merging it.

There is no "current" utrace-ptrace implementation. I haven't proposed
one for merging. When one is ready and working, we can discuss its actual
technical details then.

> Other user would be even better, e.g. the seccomp rewrite.

The seccomp rewrite is a very simple user for which I have a prototype
patch. (It needs testing, but that should be easy enough.) The only
real complexity there is in deciding how to merge those changes.
Its components are:

* clean up Kconfig
* remove old arch/asm hooks
** mips
** powerpc
** sh
** sparc
** x86
* replace kernel/seccomp.c with utrace-based one

Except for the first one, doing it in small incremental changes would
leave some intermediate states with no seccomp feature usable in the
tree. (And, of course, CONFIG_SECCOMP will require CONFIG_UTRACE
thereafter.) Please advise on how many pieces to slice it into and
how to stage the merging.


Thanks,
Roland