2009-03-07 04:53:20

by Frederic Weisbecker

[permalink] [raw]
Subject: [RFC][PATCH 0/2] Syscalls tracing

Here is a first attempt, quick one-shot, to provide a syscall tracing
infrastructure on ftrace.

The RFC prefix is here to reflect its ugliness on various parts.
The compromise between tracing reliabilty and speed is hard to balance.
For example I guess the basic and horrid string mask should be dropped in favour
of something else, which takes care of the volatile strings from the userspace.

But I hope a lot of ideas to make it better will come along this discussion.

Frederic Weisbecker (2):
tracing/ftrace: syscall tracing infrastructure
tracing/x86: basic implementation of syscall tracing for x86-64

arch/x86/Kconfig | 1 +
arch/x86/include/asm/ftrace.h | 7 +
arch/x86/include/asm/thread_info.h | 9 +-
arch/x86/kernel/ftrace.c | 15 ++
arch/x86/kernel/ptrace.c | 7 +
include/linux/ftrace.h | 44 ++++++
kernel/trace/Kconfig | 10 ++
kernel/trace/Makefile | 1 +
kernel/trace/trace.c | 6 +
kernel/trace/trace.h | 32 ++++
kernel/trace/trace_syscalls.c | 284 ++++++++++++++++++++++++++++++++++++
11 files changed, 413 insertions(+), 3 deletions(-)
create mode 100644 kernel/trace/trace_syscalls.c


2009-03-07 04:53:37

by Frederic Weisbecker

[permalink] [raw]
Subject: [RFC][PATCH 1/2] tracing/ftrace: syscall tracing infrastructure

The most important drawback that implies the syscall tracing is the variable
number of parameters that takes a syscall. Assuming there is a large number of
syscall, the goal is to provide as much as possible a generic layer on it.

What we want on tracing time (the hot path):

- Save the parameters and the number of the syscall. That's all. The rest of the
job can be done on the "output path".

On output time:

- Get the raw-binary saved parameters and the syscall number as well as some
usual infos such as the pid, the time...
- Depending on the tracing requirements, the user should have the choice to
dump the raw datas or a formatted version.

The core infrastructure has a static array which contains the necessary
informations for each syscall (ok, I've only implemented 4 for now).

When a syscall triggers, the tracer look at this array (O(1) access) and get the
number of parameters for this syscall. Then these parameters are saved in a
binary form on the ring buffer with the syscall number (a generic version, arch
independent).

On output time the parameters, the name of the syscall and very basic
informations to format the parameters are picket from the ring-buffer and the
generic syscall array.

Depending of the level of implementation we want for the syscall inside this
array, we can provide just a single mask where the n bit matches the n
parameter. If the bit is set, then the parameter will be considered as a string
pointer. Otherwise its raw hexadecimal value will be printed.

If needed, we can otherwise provide a specific callback to print the syscall
event.

The probem with these approaches comes from the fact that the string might have
disappeared from the user memory on output time. Even if we protect against
faults, it can give unreliable traces.
So perhaps we could think about copying the string on tracing time.

Concerning the return value, it doesn't make any problem, a single raw
value is printed for each one.

Note that the current implementation can be easily scaled further to give only
binary informations to the user and the matching metadata to decode it.

Signed-off-by: Frederic Weisbecker <[email protected]>
---
include/linux/ftrace.h | 44 +++++++
kernel/trace/Kconfig | 10 ++
kernel/trace/Makefile | 1 +
kernel/trace/trace.c | 6 +
kernel/trace/trace.h | 32 +++++
kernel/trace/trace_syscalls.c | 284 +++++++++++++++++++++++++++++++++++++++++
6 files changed, 377 insertions(+), 0 deletions(-)
create mode 100644 kernel/trace/trace_syscalls.c

diff --git a/include/linux/ftrace.h b/include/linux/ftrace.h
index e1583f2..1798692 100644
--- a/include/linux/ftrace.h
+++ b/include/linux/ftrace.h
@@ -503,4 +503,48 @@ static inline void trace_hw_branch_oops(void) {}

#endif /* CONFIG_HW_BRANCH_TRACER */

+/* Index for the ftrace syscalls array */
+enum syscall_trace_nr {
+ SYSCALL_TRACE_OPEN = 1,
+ SYSCALL_TRACE_CLOSE,
+ SYSCALL_TRACE_READ,
+ SYSCALL_TRACE_WRITE,
+
+ __SYSCALL_TRACE_END
+};
+
+/*
+ * A syscall entry in the ftrace syscalls array.
+ *
+ * @name: name of the syscall
+ * @nb_args: number of parameters it takes
+ * @simple_format: if true, we use string_mask, otherwise the print callback
+ * @string_mask: rudimentary format mask. If bit nr is set, the nr parameter
+ * will be displayed as a string, otherwise it will be considered
+ * as a raw number (hex displayed).
+ * @print: implement it if you want a custom output for a given syscall
+ */
+struct syscall_trace_entry {
+ const char *name;
+ int nb_args;
+ bool simple_format;
+ union {
+ unsigned long string_mask;
+ int (*print)(unsigned long *);
+ } output;
+};
+
+#ifdef CONFIG_FTRACE_SYSCALLS
+extern short arch_syscall_trace_nr[];
+extern void start_ftrace_syscalls(void);
+extern void stop_ftrace_syscalls(void);
+extern void ftrace_syscall_enter(struct pt_regs *regs);
+extern void ftrace_syscall_exit(struct pt_regs *regs);
+#else
+static inline void start_ftrace_syscalls(void) { }
+static inline void stop_ftrace_syscalls(void) { }
+static inline void ftrace_syscall_enter(struct pt_regs *regs) { }
+static inline void ftrace_syscall_exit(struct pt_regs *regs) { }
+#endif
+
#endif /* _LINUX_FTRACE_H */
diff --git a/kernel/trace/Kconfig b/kernel/trace/Kconfig
index 8e4a2a6..95a0ad1 100644
--- a/kernel/trace/Kconfig
+++ b/kernel/trace/Kconfig
@@ -34,6 +34,9 @@ config HAVE_FTRACE_MCOUNT_RECORD
config HAVE_HW_BRANCH_TRACER
bool

+config HAVE_FTRACE_SYSCALLS
+ bool
+
config TRACER_MAX_TRACE
bool

@@ -175,6 +178,13 @@ config EVENT_TRACER
allowing the user to pick and choose which trace point they
want to trace.

+config FTRACE_SYSCALLS
+ bool "Trace syscalls"
+ depends on HAVE_FTRACE_SYSCALLS
+ select TRACING
+ help
+ Basic tracer to catch the syscall entry and exit events.
+
config BOOT_TRACER
bool "Trace boot initcalls"
select TRACING
diff --git a/kernel/trace/Makefile b/kernel/trace/Makefile
index c7a2943..c3feea0 100644
--- a/kernel/trace/Makefile
+++ b/kernel/trace/Makefile
@@ -43,5 +43,6 @@ obj-$(CONFIG_BLK_DEV_IO_TRACE) += blktrace.o
obj-$(CONFIG_EVENT_TRACER) += trace_events.o
obj-$(CONFIG_EVENT_TRACER) += events.o
obj-$(CONFIG_EVENT_TRACER) += trace_export.o
+obj-$(CONFIG_FTRACE_SYSCALLS) += trace_syscalls.o

libftrace-y := ftrace.o
diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index cc94f86..5152be3 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -242,6 +242,12 @@ static DECLARE_WAIT_QUEUE_HEAD(trace_wait);
unsigned long trace_flags = TRACE_ITER_PRINT_PARENT | TRACE_ITER_PRINTK |
TRACE_ITER_ANNOTATE | TRACE_ITER_CONTEXT_INFO;

+
+struct trace_array *__get_global_trace(void)
+{
+ return &global_trace;
+}
+
/**
* trace_wake_up - wake up tasks waiting for trace input
*
diff --git a/kernel/trace/trace.h b/kernel/trace/trace.h
index 2bfb7d1..9583830 100644
--- a/kernel/trace/trace.h
+++ b/kernel/trace/trace.h
@@ -30,6 +30,8 @@ enum trace_type {
TRACE_GRAPH_ENT,
TRACE_USER_STACK,
TRACE_HW_BRANCHES,
+ TRACE_SYSCALL_ENTER,
+ TRACE_SYSCALL_EXIT,
TRACE_KMEM_ALLOC,
TRACE_KMEM_FREE,
TRACE_POWER,
@@ -192,6 +194,19 @@ struct kmemtrace_free_entry {
const void *ptr;
};

+struct syscall_trace_enter {
+ struct trace_entry ent;
+ enum syscall_trace_nr nr;
+ unsigned long args[];
+};
+
+struct syscall_trace_exit {
+ struct trace_entry ent;
+ enum syscall_trace_nr nr;
+ unsigned long ret;
+};
+
+
/*
* trace_flag_type is an enumeration that holds different
* states when a trace occurs. These are:
@@ -304,6 +319,10 @@ extern void __ftrace_bad_type(void);
TRACE_KMEM_ALLOC); \
IF_ASSIGN(var, ent, struct kmemtrace_free_entry, \
TRACE_KMEM_FREE); \
+ IF_ASSIGN(var, ent, struct syscall_trace_enter, \
+ TRACE_SYSCALL_ENTER); \
+ IF_ASSIGN(var, ent, struct syscall_trace_exit, \
+ TRACE_SYSCALL_EXIT); \
__ftrace_bad_type(); \
} while (0)

@@ -568,6 +587,7 @@ extern int trace_selftest_startup_branch(struct tracer *trace,
#endif /* CONFIG_FTRACE_STARTUP_TEST */

extern void *head_page(struct trace_array_cpu *data);
+extern struct trace_array *__get_global_trace(void);
extern long ns2usecs(cycle_t nsec);
extern int
trace_vprintk(unsigned long ip, int depth, const char *fmt, va_list args);
@@ -627,6 +647,18 @@ static inline int ftrace_trace_task(struct task_struct *task)
return test_tsk_trace_trace(task);
}

+#ifdef CONFIG_FTRACE_SYSCALLS
+extern enum print_line_t
+print_syscall_enter(struct trace_iterator *iter, int flags);
+extern enum print_line_t
+print_syscall_exit(struct trace_iterator *iter, int flags);
+#else
+static inline enum print_line_t
+print_syscall_enter(struct trace_iterator *iter, int flags) { }
+static inline enum print_line_t
+print_syscall_exit(struct trace_iterator *iter, int flags) { }
+#endif
+
/*
* trace_iterator_flags is an enumeration that defines bit
* positions into trace_flags that controls the output.
diff --git a/kernel/trace/trace_syscalls.c b/kernel/trace/trace_syscalls.c
new file mode 100644
index 0000000..470bb9d
--- /dev/null
+++ b/kernel/trace/trace_syscalls.c
@@ -0,0 +1,284 @@
+#include <linux/kernel.h>
+#include <linux/ftrace.h>
+#include <asm/syscall.h>
+
+#include "trace_output.h"
+#include "trace.h"
+
+/* Create basic entry on syscall array (with the rudimentary string mask) */
+#define SYS_TRACE_ENTRY(sname, args, mask) \
+ {.name = #sname, .nb_args = args, .simple_format = true, \
+ .output.string_mask = mask}
+
+/*
+ * Create a custom defined entry on the syscall array, you will have to
+ * implement a callback to output the syscall.
+ */
+#define SYS_TRACE_ENTRY_SPECIAL(sname, args, printer) \
+ {.name = #sname, .nb_args = args, .simple_format = false, \
+ .print = printer}
+
+static const struct syscall_trace_entry syscall_trace_entries[] = {
+ /* For open, the first argument is a string, hence the given mask */
+ [SYSCALL_TRACE_OPEN] = SYS_TRACE_ENTRY(open, 3, 0x1),
+ [SYSCALL_TRACE_CLOSE] = SYS_TRACE_ENTRY(close, 1, 0),
+ [SYSCALL_TRACE_READ] = SYS_TRACE_ENTRY(read, 3, 0),
+ [SYSCALL_TRACE_WRITE] = SYS_TRACE_ENTRY(read, 3, 0),
+};
+
+
+static atomic_t refcount;
+
+enum print_line_t
+print_syscall_enter(struct trace_iterator *iter, int flags)
+{
+ struct trace_seq *s = &iter->seq;
+ struct trace_entry *ent = iter->ent;
+ struct syscall_trace_enter *trace;
+ enum syscall_trace_nr syscall;
+ const struct syscall_trace_entry *entry;
+ char *str = NULL;
+ int i, ret;
+
+ trace_assign_type(trace, ent);
+
+ syscall = trace->nr;
+
+ if (!syscall || syscall >= __SYSCALL_TRACE_END)
+ return TRACE_TYPE_HANDLED;
+
+ entry = &syscall_trace_entries[syscall];
+ if (!entry->simple_format) {
+ ret = entry->output.print(trace->args);
+ if (!ret)
+ return TRACE_TYPE_PARTIAL_LINE;
+ }
+
+ ret = trace_seq_printf(s, "syscall %s :", entry->name);
+ if (!ret)
+ return TRACE_TYPE_PARTIAL_LINE;
+
+ for (i = 0; i < entry->nb_args; i++) {
+ /*
+ * FIXME: the string comes from the user, but on tracing
+ * output time, it may have disappeared from the memory.
+ * May be better to strdup it on tracing time.
+ */
+ if (entry->output.string_mask & (1 << i)) {
+ /* 128 is enough for most path */
+ str = strndup_user((char *)trace->args[i], 128);
+
+ /* If it faulted badly, the error shoud have been
+ * handled while servicing the syscall, just ignore.
+ */
+ if (str == ERR_PTR(-EFAULT) || str == ERR_PTR(-EINVAL)
+ || str == ERR_PTR(-ENOMEM))
+ return TRACE_TYPE_HANDLED;
+
+ ret = trace_seq_printf(s, " %s", str);
+ } else {
+ ret = trace_seq_printf(s, " 0x%lx", trace->args[i]);
+ }
+ if (!ret)
+ goto end_partial;
+ }
+ if (!trace_seq_printf(s, "\n"))
+ goto end_partial;
+
+ kfree(str);
+ return TRACE_TYPE_HANDLED;
+
+end_partial:
+ kfree(str);
+ return TRACE_TYPE_PARTIAL_LINE;
+}
+
+enum print_line_t
+print_syscall_exit(struct trace_iterator *iter, int flags)
+{
+ struct trace_seq *s = &iter->seq;
+ struct trace_entry *ent = iter->ent;
+ struct syscall_trace_exit *trace;
+ enum syscall_trace_nr syscall;
+ const struct syscall_trace_entry *entry;
+ int ret;
+
+ trace_assign_type(trace, ent);
+
+ syscall = trace->nr;
+
+ if (!syscall || syscall >= __SYSCALL_TRACE_END)
+ return TRACE_TYPE_HANDLED;
+
+ entry = &syscall_trace_entries[syscall];
+
+ ret = trace_seq_printf(s, "syscall %s -> 0x%lx\n", entry->name,
+ trace->ret);
+ if (!ret)
+ return TRACE_TYPE_PARTIAL_LINE;
+
+ return TRACE_TYPE_HANDLED;
+}
+
+void start_ftrace_syscalls(void)
+{
+ unsigned long flags;
+ struct task_struct *g, *t;
+
+ if (atomic_inc_return(&refcount) != 1)
+ goto out;
+
+ read_lock_irqsave(&tasklist_lock, flags);
+
+ do_each_thread(g, t) {
+ set_tsk_thread_flag(t, TIF_SYSCALL_FTRACE);
+ } while_each_thread(g, t);
+
+ read_unlock_irqrestore(&tasklist_lock, flags);
+out:
+ atomic_dec(&refcount);
+}
+
+void stop_ftrace_syscalls(void)
+{
+ unsigned long flags;
+ struct task_struct *g, *t;
+
+ if (atomic_dec_return(&refcount))
+ goto out;
+
+ read_lock_irqsave(&tasklist_lock, flags);
+
+ do_each_thread(g, t) {
+ clear_tsk_thread_flag(t, TIF_SYSCALL_FTRACE);
+ } while_each_thread(g, t);
+
+ read_unlock_irqrestore(&tasklist_lock, flags);
+out:
+ atomic_inc(&refcount);
+}
+
+void ftrace_syscall_enter(struct pt_regs *regs)
+{
+ struct trace_array *tr = __get_global_trace();
+ struct syscall_trace_enter *entry;
+ const struct syscall_trace_entry *sys_data;
+ struct ring_buffer_event *event;
+ struct trace_array_cpu *data;
+ int size;
+ int syscall_nr;
+ int nr_offset;
+ int cpu;
+
+ syscall_nr = syscall_get_nr(current, regs);
+ nr_offset = arch_syscall_trace_nr[syscall_nr];
+ if (!nr_offset || nr_offset >= __SYSCALL_TRACE_END)
+ return;
+
+ cpu = raw_smp_processor_id();
+ data = tr->data[cpu];
+
+ if (unlikely(atomic_read(&data->disabled)))
+ return;
+
+ sys_data = &syscall_trace_entries[nr_offset];
+ size = sizeof(*entry) + sizeof(unsigned long) * sys_data->nb_args;
+
+ event = trace_buffer_lock_reserve(tr, TRACE_SYSCALL_ENTER, size, 0, 0);
+ if (!event)
+ return;
+
+ entry = ring_buffer_event_data(event);
+ entry->nr = nr_offset;
+ syscall_get_arguments(current, regs, 0, sys_data->nb_args, entry->args);
+
+ ring_buffer_unlock_commit(tr->buffer, event);
+ trace_wake_up();
+}
+
+void ftrace_syscall_exit(struct pt_regs *regs)
+{
+ struct trace_array *tr = __get_global_trace();
+ struct syscall_trace_exit *entry;
+ const struct syscall_trace_entry *sys_data;
+ struct ring_buffer_event *event;
+ struct trace_array_cpu *data;
+ int syscall_nr;
+ int nr_offset;
+ int cpu;
+
+ syscall_nr = syscall_get_nr(current, regs);
+ nr_offset = arch_syscall_trace_nr[syscall_nr];
+ if (!nr_offset || nr_offset >= __SYSCALL_TRACE_END)
+ return;
+
+ cpu = raw_smp_processor_id();
+ data = tr->data[cpu];
+
+ if (unlikely(atomic_read(&data->disabled)))
+ return;
+
+ sys_data = &syscall_trace_entries[nr_offset];
+
+ event = trace_buffer_lock_reserve(tr, TRACE_SYSCALL_EXIT,
+ sizeof(*entry), 0, 0);
+ if (!event)
+ return;
+
+ entry = ring_buffer_event_data(event);
+ entry->nr = nr_offset;
+ entry->ret = syscall_get_return_value(current, regs);
+
+ ring_buffer_unlock_commit(tr->buffer, event);
+ trace_wake_up();
+}
+
+static int init_syscall_tracer(struct trace_array *tr)
+{
+ start_ftrace_syscalls();
+
+ return 0;
+}
+
+static void reset_syscall_tracer(struct trace_array *tr)
+{
+ stop_ftrace_syscalls();
+}
+
+static struct trace_event syscall_enter_event = {
+ .type = TRACE_SYSCALL_ENTER,
+ .trace = print_syscall_enter,
+};
+
+static struct trace_event syscall_exit_event = {
+ .type = TRACE_SYSCALL_EXIT,
+ .trace = print_syscall_exit,
+};
+
+static struct tracer syscall_tracer __read_mostly = {
+ .name = "syscall",
+ .init = init_syscall_tracer,
+ .reset = reset_syscall_tracer
+};
+
+__init int register_ftrace_syscalls(void)
+{
+ int ret;
+
+ ret = register_ftrace_event(&syscall_enter_event);
+ if (!ret) {
+ printk(KERN_WARNING "event %d failed to register\n",
+ syscall_enter_event.type);
+ WARN_ON_ONCE(1);
+ }
+
+ ret = register_ftrace_event(&syscall_exit_event);
+ if (!ret) {
+ printk(KERN_WARNING "event %d failed to register\n",
+ syscall_exit_event.type);
+ WARN_ON_ONCE(1);
+ }
+
+ return register_tracer(&syscall_tracer);
+}
+device_initcall(register_ftrace_syscalls);
--
1.6.1

2009-03-07 04:53:51

by Frederic Weisbecker

[permalink] [raw]
Subject: [RFC][PATCH 2/2] tracing/x86: basic implementation of syscall tracing for x86-64

Provide the ptrace hooks and arch specific syscall numbers to ftrace arch indepedant
syscall numbers.
For now it only supports 4 syscalls to provide an example.

Signed-off-by: Frederic Weisbecker <[email protected]>
---
arch/x86/Kconfig | 1 +
arch/x86/include/asm/ftrace.h | 7 +++++++
arch/x86/include/asm/thread_info.h | 9 ++++++---
arch/x86/kernel/ftrace.c | 15 +++++++++++++++
arch/x86/kernel/ptrace.c | 7 +++++++
5 files changed, 36 insertions(+), 3 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 2a0ddfe..b5aec1b 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -35,6 +35,7 @@ config X86
select HAVE_FUNCTION_GRAPH_TRACER
select HAVE_FUNCTION_TRACE_MCOUNT_TEST
select HAVE_FTRACE_NMI_ENTER if DYNAMIC_FTRACE
+ select HAVE_FTRACE_SYSCALLS if X86_64
select HAVE_KVM
select HAVE_ARCH_KGDB
select HAVE_ARCH_TRACEHOOK
diff --git a/arch/x86/include/asm/ftrace.h b/arch/x86/include/asm/ftrace.h
index db24c22..b2e6855 100644
--- a/arch/x86/include/asm/ftrace.h
+++ b/arch/x86/include/asm/ftrace.h
@@ -28,6 +28,13 @@

#endif

+/* FIXME: I don't want to stay hardcoded */
+#ifdef CONFIG_X86_64
+#define FTRACE_SYSCALL_MAX 296
+#else
+#define FTRACE_SYSCALL_MAX 333
+#endif
+
#ifdef CONFIG_FUNCTION_TRACER
#define MCOUNT_ADDR ((long)(mcount))
#define MCOUNT_INSN_SIZE 5 /* sizeof mcount call */
diff --git a/arch/x86/include/asm/thread_info.h b/arch/x86/include/asm/thread_info.h
index f6ba6f8..83d2b73 100644
--- a/arch/x86/include/asm/thread_info.h
+++ b/arch/x86/include/asm/thread_info.h
@@ -95,6 +95,7 @@ struct thread_info {
#define TIF_FORCED_TF 24 /* true if TF in eflags artificially */
#define TIF_DEBUGCTLMSR 25 /* uses thread_struct.debugctlmsr */
#define TIF_DS_AREA_MSR 26 /* uses thread_struct.ds_area_msr */
+#define TIF_SYSCALL_FTRACE 27 /* for ftrace syscall instrumentation */

#define _TIF_SYSCALL_TRACE (1 << TIF_SYSCALL_TRACE)
#define _TIF_NOTIFY_RESUME (1 << TIF_NOTIFY_RESUME)
@@ -117,15 +118,17 @@ struct thread_info {
#define _TIF_FORCED_TF (1 << TIF_FORCED_TF)
#define _TIF_DEBUGCTLMSR (1 << TIF_DEBUGCTLMSR)
#define _TIF_DS_AREA_MSR (1 << TIF_DS_AREA_MSR)
+#define _TIF_SYSCALL_FTRACE (1 << TIF_SYSCALL_FTRACE)

/* work to do in syscall_trace_enter() */
#define _TIF_WORK_SYSCALL_ENTRY \
- (_TIF_SYSCALL_TRACE | _TIF_SYSCALL_EMU | \
+ (_TIF_SYSCALL_TRACE | _TIF_SYSCALL_EMU | _TIF_SYSCALL_FTRACE | \
_TIF_SYSCALL_AUDIT | _TIF_SECCOMP | _TIF_SINGLESTEP)

/* work to do in syscall_trace_leave() */
#define _TIF_WORK_SYSCALL_EXIT \
- (_TIF_SYSCALL_TRACE | _TIF_SYSCALL_AUDIT | _TIF_SINGLESTEP)
+ (_TIF_SYSCALL_TRACE | _TIF_SYSCALL_AUDIT | _TIF_SINGLESTEP | \
+ _TIF_SYSCALL_FTRACE)

/* work to do on interrupt/exception return */
#define _TIF_WORK_MASK \
@@ -134,7 +137,7 @@ struct thread_info {
_TIF_SINGLESTEP|_TIF_SECCOMP|_TIF_SYSCALL_EMU))

/* work to do on any return to user space */
-#define _TIF_ALLWORK_MASK (0x0000FFFF & ~_TIF_SECCOMP)
+#define _TIF_ALLWORK_MASK ((0x0000FFFF & ~_TIF_SECCOMP) | _TIF_SYSCALL_FTRACE)

/* Only used for 64 bit */
#define _TIF_DO_NOTIFY_MASK \
diff --git a/arch/x86/kernel/ftrace.c b/arch/x86/kernel/ftrace.c
index a85da17..5bba98b 100644
--- a/arch/x86/kernel/ftrace.c
+++ b/arch/x86/kernel/ftrace.c
@@ -453,3 +453,18 @@ void prepare_ftrace_return(unsigned long *parent, unsigned long self_addr)
}
}
#endif /* CONFIG_FUNCTION_GRAPH_TRACER */
+
+#ifdef CONFIG_FTRACE_SYSCALLS
+#ifdef CONFIG_X86_64
+/*
+ * Array that make the arch specific syscall numbers
+ * matching the ftrace syscall numbers.
+ */
+short arch_syscall_trace_nr[FTRACE_SYSCALL_MAX] = {
+ [__NR_open] = SYSCALL_TRACE_OPEN,
+ [__NR_close] = SYSCALL_TRACE_CLOSE,
+ [__NR_read] = SYSCALL_TRACE_READ,
+ [__NR_write] = SYSCALL_TRACE_WRITE
+};
+#endif
+#endif
diff --git a/arch/x86/kernel/ptrace.c b/arch/x86/kernel/ptrace.c
index 3d9672e..99749d6 100644
--- a/arch/x86/kernel/ptrace.c
+++ b/arch/x86/kernel/ptrace.c
@@ -21,6 +21,7 @@
#include <linux/audit.h>
#include <linux/seccomp.h>
#include <linux/signal.h>
+#include <linux/ftrace.h>

#include <asm/uaccess.h>
#include <asm/pgtable.h>
@@ -1416,6 +1417,9 @@ asmregparm long syscall_trace_enter(struct pt_regs *regs)
tracehook_report_syscall_entry(regs))
ret = -1L;

+ if (unlikely(test_thread_flag(TIF_SYSCALL_FTRACE)))
+ ftrace_syscall_enter(regs);
+
if (unlikely(current->audit_context)) {
if (IS_IA32)
audit_syscall_entry(AUDIT_ARCH_I386,
@@ -1439,6 +1443,9 @@ asmregparm void syscall_trace_leave(struct pt_regs *regs)
if (unlikely(current->audit_context))
audit_syscall_exit(AUDITSC_RESULT(regs->ax), regs->ax);

+ if (unlikely(test_thread_flag(TIF_SYSCALL_FTRACE)))
+ ftrace_syscall_exit(regs);
+
if (test_thread_flag(TIF_SYSCALL_TRACE))
tracehook_report_syscall_exit(regs, 0);

--
1.6.1

2009-03-07 07:49:35

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [RFC][PATCH 1/2] tracing/ftrace: syscall tracing infrastructure

On Sat, Mar 07, 2009 at 05:52:59AM +0100, Frederic Weisbecker wrote:
> The most important drawback that implies the syscall tracing is the variable
> number of parameters that takes a syscall. Assuming there is a large number of
> syscall, the goal is to provide as much as possible a generic layer on it.
>
> What we want on tracing time (the hot path):
>
> - Save the parameters and the number of the syscall. That's all. The rest of the
> job can be done on the "output path".
>
> On output time:
>
> - Get the raw-binary saved parameters and the syscall number as well as some
> usual infos such as the pid, the time...
> - Depending on the tracing requirements, the user should have the choice to
> dump the raw datas or a formatted version.
>
> The core infrastructure has a static array which contains the necessary
> informations for each syscall (ok, I've only implemented 4 for now).
>
> When a syscall triggers, the tracer look at this array (O(1) access) and get the
> number of parameters for this syscall. Then these parameters are saved in a
> binary form on the ring buffer with the syscall number (a generic version, arch
> independent).
>
> On output time the parameters, the name of the syscall and very basic
> informations to format the parameters are picket from the ring-buffer and the
> generic syscall array.
>
> Depending of the level of implementation we want for the syscall inside this
> array, we can provide just a single mask where the n bit matches the n
> parameter. If the bit is set, then the parameter will be considered as a string
> pointer. Otherwise its raw hexadecimal value will be printed.
>
> If needed, we can otherwise provide a specific callback to print the syscall
> event.
>
> The probem with these approaches comes from the fact that the string might have
> disappeared from the user memory on output time. Even if we protect against
> faults, it can give unreliable traces.
> So perhaps we could think about copying the string on tracing time.
>
> Concerning the return value, it doesn't make any problem, a single raw
> value is printed for each one.
>
> Note that the current implementation can be easily scaled further to give only
> binary informations to the user and the matching metadata to decode it.
>
> Signed-off-by: Frederic Weisbecker <[email protected]>
> ---


I made some silly mistakes on this one.
Here is a v2:

---

>From f1fd4e7885f1f96f27fd58b20a1aa5cb054cb15f Mon Sep 17 00:00:00 2001
From: Frederic Weisbecker <[email protected]>
Date: Sat, 7 Mar 2009 08:46:46 +0100
Subject: [PATCH] tracing/ftrace: syscall tracing infrastructure

The most important drawback that implies the syscall tracing is the variable
number of parameters that takes a syscall. Assuming there is a large number of
syscall, the goal is to provide as much as possible a generic layer on it.

What we want on tracing time (the hot path):

- Save the parameters and the number of the syscall. That's all. The rest of the
job can be done on the "output path".

On output time:

- Get the raw-binary saved parameters and the syscall number as well as some
usual infos such as the pid, the time...
- Depending on the tracing requirements, the user should have the choice to
dump the raw datas or a formatted version.

The core infrastructure has a static array which contains the necessary
informations for each syscall (ok, I've only implemented 4 for now).

When a syscall triggers, the tracer look at this array (O(1) access) and get the
number of parameters for this syscall. Then these parameters are saved in a
binary form on the ring buffer with the syscall number (a generic version, arch
independent).

On output time the parameters, the name of the syscall and very basic
informations to format the parameters are picket from the ring-buffer and the
generic syscall array.

Depending of the level of implementation we want for the syscall inside this
array, we can provide just a single mask where the n bit matches the n
parameter. If the bit is set, then the parameter will be considered as a string
pointer. Otherwise its raw hexadecimal value will be printed.

If needed, we can otherwise provide a specific callback to print the syscall
event.

The probem with these approaches comes from the fact that the string might have
disappeared from the user memory on output time. Even if we protect against
faults, it can give unreliable traces.
So perhaps we could think about copying the string on tracing time.

Concerning the return value, it doesn't make any problem, a single raw
value is printed for each one.

Note that the current implementation can be easily scaled further to give only
binary informations to the user and the matching metadata to decode it.

Changes V2:

- Fixes mistakes on refcounting
- Fixes SYS_TRACE_ENTRY_SPECIAL macro which forgot the union
- Always return after the print callback is called

Signed-off-by: Frederic Weisbecker <[email protected]>
---
include/linux/ftrace.h | 44 +++++++
kernel/trace/Kconfig | 10 ++
kernel/trace/Makefile | 1 +
kernel/trace/trace.c | 6 +
kernel/trace/trace.h | 32 +++++
kernel/trace/trace_syscalls.c | 285 +++++++++++++++++++++++++++++++++++++++++
6 files changed, 378 insertions(+), 0 deletions(-)
create mode 100644 kernel/trace/trace_syscalls.c

diff --git a/include/linux/ftrace.h b/include/linux/ftrace.h
index e1583f2..1798692 100644
--- a/include/linux/ftrace.h
+++ b/include/linux/ftrace.h
@@ -503,4 +503,48 @@ static inline void trace_hw_branch_oops(void) {}

#endif /* CONFIG_HW_BRANCH_TRACER */

+/* Index for the ftrace syscalls array */
+enum syscall_trace_nr {
+ SYSCALL_TRACE_OPEN = 1,
+ SYSCALL_TRACE_CLOSE,
+ SYSCALL_TRACE_READ,
+ SYSCALL_TRACE_WRITE,
+
+ __SYSCALL_TRACE_END
+};
+
+/*
+ * A syscall entry in the ftrace syscalls array.
+ *
+ * @name: name of the syscall
+ * @nb_args: number of parameters it takes
+ * @simple_format: if true, we use string_mask, otherwise the print callback
+ * @string_mask: rudimentary format mask. If bit nr is set, the nr parameter
+ * will be displayed as a string, otherwise it will be considered
+ * as a raw number (hex displayed).
+ * @print: implement it if you want a custom output for a given syscall
+ */
+struct syscall_trace_entry {
+ const char *name;
+ int nb_args;
+ bool simple_format;
+ union {
+ unsigned long string_mask;
+ int (*print)(unsigned long *);
+ } output;
+};
+
+#ifdef CONFIG_FTRACE_SYSCALLS
+extern short arch_syscall_trace_nr[];
+extern void start_ftrace_syscalls(void);
+extern void stop_ftrace_syscalls(void);
+extern void ftrace_syscall_enter(struct pt_regs *regs);
+extern void ftrace_syscall_exit(struct pt_regs *regs);
+#else
+static inline void start_ftrace_syscalls(void) { }
+static inline void stop_ftrace_syscalls(void) { }
+static inline void ftrace_syscall_enter(struct pt_regs *regs) { }
+static inline void ftrace_syscall_exit(struct pt_regs *regs) { }
+#endif
+
#endif /* _LINUX_FTRACE_H */
diff --git a/kernel/trace/Kconfig b/kernel/trace/Kconfig
index 8e4a2a6..95a0ad1 100644
--- a/kernel/trace/Kconfig
+++ b/kernel/trace/Kconfig
@@ -34,6 +34,9 @@ config HAVE_FTRACE_MCOUNT_RECORD
config HAVE_HW_BRANCH_TRACER
bool

+config HAVE_FTRACE_SYSCALLS
+ bool
+
config TRACER_MAX_TRACE
bool

@@ -175,6 +178,13 @@ config EVENT_TRACER
allowing the user to pick and choose which trace point they
want to trace.

+config FTRACE_SYSCALLS
+ bool "Trace syscalls"
+ depends on HAVE_FTRACE_SYSCALLS
+ select TRACING
+ help
+ Basic tracer to catch the syscall entry and exit events.
+
config BOOT_TRACER
bool "Trace boot initcalls"
select TRACING
diff --git a/kernel/trace/Makefile b/kernel/trace/Makefile
index c7a2943..c3feea0 100644
--- a/kernel/trace/Makefile
+++ b/kernel/trace/Makefile
@@ -43,5 +43,6 @@ obj-$(CONFIG_BLK_DEV_IO_TRACE) += blktrace.o
obj-$(CONFIG_EVENT_TRACER) += trace_events.o
obj-$(CONFIG_EVENT_TRACER) += events.o
obj-$(CONFIG_EVENT_TRACER) += trace_export.o
+obj-$(CONFIG_FTRACE_SYSCALLS) += trace_syscalls.o

libftrace-y := ftrace.o
diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index cc94f86..5152be3 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -242,6 +242,12 @@ static DECLARE_WAIT_QUEUE_HEAD(trace_wait);
unsigned long trace_flags = TRACE_ITER_PRINT_PARENT | TRACE_ITER_PRINTK |
TRACE_ITER_ANNOTATE | TRACE_ITER_CONTEXT_INFO;

+
+struct trace_array *__get_global_trace(void)
+{
+ return &global_trace;
+}
+
/**
* trace_wake_up - wake up tasks waiting for trace input
*
diff --git a/kernel/trace/trace.h b/kernel/trace/trace.h
index 2bfb7d1..9583830 100644
--- a/kernel/trace/trace.h
+++ b/kernel/trace/trace.h
@@ -30,6 +30,8 @@ enum trace_type {
TRACE_GRAPH_ENT,
TRACE_USER_STACK,
TRACE_HW_BRANCHES,
+ TRACE_SYSCALL_ENTER,
+ TRACE_SYSCALL_EXIT,
TRACE_KMEM_ALLOC,
TRACE_KMEM_FREE,
TRACE_POWER,
@@ -192,6 +194,19 @@ struct kmemtrace_free_entry {
const void *ptr;
};

+struct syscall_trace_enter {
+ struct trace_entry ent;
+ enum syscall_trace_nr nr;
+ unsigned long args[];
+};
+
+struct syscall_trace_exit {
+ struct trace_entry ent;
+ enum syscall_trace_nr nr;
+ unsigned long ret;
+};
+
+
/*
* trace_flag_type is an enumeration that holds different
* states when a trace occurs. These are:
@@ -304,6 +319,10 @@ extern void __ftrace_bad_type(void);
TRACE_KMEM_ALLOC); \
IF_ASSIGN(var, ent, struct kmemtrace_free_entry, \
TRACE_KMEM_FREE); \
+ IF_ASSIGN(var, ent, struct syscall_trace_enter, \
+ TRACE_SYSCALL_ENTER); \
+ IF_ASSIGN(var, ent, struct syscall_trace_exit, \
+ TRACE_SYSCALL_EXIT); \
__ftrace_bad_type(); \
} while (0)

@@ -568,6 +587,7 @@ extern int trace_selftest_startup_branch(struct tracer *trace,
#endif /* CONFIG_FTRACE_STARTUP_TEST */

extern void *head_page(struct trace_array_cpu *data);
+extern struct trace_array *__get_global_trace(void);
extern long ns2usecs(cycle_t nsec);
extern int
trace_vprintk(unsigned long ip, int depth, const char *fmt, va_list args);
@@ -627,6 +647,18 @@ static inline int ftrace_trace_task(struct task_struct *task)
return test_tsk_trace_trace(task);
}

+#ifdef CONFIG_FTRACE_SYSCALLS
+extern enum print_line_t
+print_syscall_enter(struct trace_iterator *iter, int flags);
+extern enum print_line_t
+print_syscall_exit(struct trace_iterator *iter, int flags);
+#else
+static inline enum print_line_t
+print_syscall_enter(struct trace_iterator *iter, int flags) { }
+static inline enum print_line_t
+print_syscall_exit(struct trace_iterator *iter, int flags) { }
+#endif
+
/*
* trace_iterator_flags is an enumeration that defines bit
* positions into trace_flags that controls the output.
diff --git a/kernel/trace/trace_syscalls.c b/kernel/trace/trace_syscalls.c
new file mode 100644
index 0000000..3fc5e54
--- /dev/null
+++ b/kernel/trace/trace_syscalls.c
@@ -0,0 +1,285 @@
+#include <linux/kernel.h>
+#include <linux/ftrace.h>
+#include <asm/syscall.h>
+
+#include "trace_output.h"
+#include "trace.h"
+
+/* Create basic entry on syscall array (with the rudimentary string mask) */
+#define SYS_TRACE_ENTRY(sname, args, mask) \
+ {.name = #sname, .nb_args = args, .simple_format = true, \
+ .output.string_mask = mask}
+
+/*
+ * Create a custom defined entry on the syscall array, you will have to
+ * implement a callback to output the syscall.
+ */
+#define SYS_TRACE_ENTRY_SPECIAL(sname, args, printer) \
+ {.name = #sname, .nb_args = args, .simple_format = false, \
+ .output.print = printer}
+
+static const struct syscall_trace_entry syscall_trace_entries[] = {
+ /* For open, the first argument is a string, hence the given mask */
+ [SYSCALL_TRACE_OPEN] = SYS_TRACE_ENTRY(open, 3, 0x1),
+ [SYSCALL_TRACE_CLOSE] = SYS_TRACE_ENTRY(close, 1, 0),
+ [SYSCALL_TRACE_READ] = SYS_TRACE_ENTRY(read, 3, 0),
+ [SYSCALL_TRACE_WRITE] = SYS_TRACE_ENTRY(read, 3, 0),
+};
+
+
+static atomic_t refcount;
+
+enum print_line_t
+print_syscall_enter(struct trace_iterator *iter, int flags)
+{
+ struct trace_seq *s = &iter->seq;
+ struct trace_entry *ent = iter->ent;
+ struct syscall_trace_enter *trace;
+ enum syscall_trace_nr syscall;
+ const struct syscall_trace_entry *entry;
+ char *str = NULL;
+ int i, ret;
+
+ trace_assign_type(trace, ent);
+
+ syscall = trace->nr;
+
+ if (!syscall || syscall >= __SYSCALL_TRACE_END)
+ return TRACE_TYPE_HANDLED;
+
+ entry = &syscall_trace_entries[syscall];
+ if (!entry->simple_format) {
+ ret = entry->output.print(trace->args);
+ if (!ret)
+ return TRACE_TYPE_PARTIAL_LINE;
+ return TRACE_TYPE_HANDLED;
+ }
+
+ ret = trace_seq_printf(s, "syscall %s :", entry->name);
+ if (!ret)
+ return TRACE_TYPE_PARTIAL_LINE;
+
+ for (i = 0; i < entry->nb_args; i++) {
+ /*
+ * FIXME: the string comes from the user, but on tracing
+ * output time, it may have disappeared from the memory.
+ * May be better to strdup it on tracing time.
+ */
+ if (entry->output.string_mask & (1 << i)) {
+ /* 128 is enough for most path */
+ str = strndup_user((char *)trace->args[i], 128);
+
+ /* If it faulted badly, the error shoud have been
+ * handled while servicing the syscall, just ignore.
+ */
+ if (str == ERR_PTR(-EFAULT) || str == ERR_PTR(-EINVAL)
+ || str == ERR_PTR(-ENOMEM))
+ return TRACE_TYPE_HANDLED;
+
+ ret = trace_seq_printf(s, " %s", str);
+ } else {
+ ret = trace_seq_printf(s, " 0x%lx", trace->args[i]);
+ }
+ if (!ret)
+ goto end_partial;
+ }
+ if (!trace_seq_printf(s, "\n"))
+ goto end_partial;
+
+ kfree(str);
+ return TRACE_TYPE_HANDLED;
+
+end_partial:
+ kfree(str);
+ return TRACE_TYPE_PARTIAL_LINE;
+}
+
+enum print_line_t
+print_syscall_exit(struct trace_iterator *iter, int flags)
+{
+ struct trace_seq *s = &iter->seq;
+ struct trace_entry *ent = iter->ent;
+ struct syscall_trace_exit *trace;
+ enum syscall_trace_nr syscall;
+ const struct syscall_trace_entry *entry;
+ int ret;
+
+ trace_assign_type(trace, ent);
+
+ syscall = trace->nr;
+
+ if (!syscall || syscall >= __SYSCALL_TRACE_END)
+ return TRACE_TYPE_HANDLED;
+
+ entry = &syscall_trace_entries[syscall];
+
+ ret = trace_seq_printf(s, "syscall %s -> 0x%lx\n", entry->name,
+ trace->ret);
+ if (!ret)
+ return TRACE_TYPE_PARTIAL_LINE;
+
+ return TRACE_TYPE_HANDLED;
+}
+
+void start_ftrace_syscalls(void)
+{
+ unsigned long flags;
+ struct task_struct *g, *t;
+
+ if (atomic_inc_return(&refcount) != 1)
+ goto out;
+
+ read_lock_irqsave(&tasklist_lock, flags);
+
+ do_each_thread(g, t) {
+ set_tsk_thread_flag(t, TIF_SYSCALL_FTRACE);
+ } while_each_thread(g, t);
+
+ read_unlock_irqrestore(&tasklist_lock, flags);
+
+ return;
+out:
+ atomic_dec(&refcount);
+}
+
+void stop_ftrace_syscalls(void)
+{
+ unsigned long flags;
+ struct task_struct *g, *t;
+
+ if (atomic_dec_return(&refcount))
+ return;
+
+ read_lock_irqsave(&tasklist_lock, flags);
+
+ do_each_thread(g, t) {
+ clear_tsk_thread_flag(t, TIF_SYSCALL_FTRACE);
+ } while_each_thread(g, t);
+
+ read_unlock_irqrestore(&tasklist_lock, flags);
+}
+
+void ftrace_syscall_enter(struct pt_regs *regs)
+{
+ struct trace_array *tr = __get_global_trace();
+ struct syscall_trace_enter *entry;
+ const struct syscall_trace_entry *sys_data;
+ struct ring_buffer_event *event;
+ struct trace_array_cpu *data;
+ int size;
+ int syscall_nr;
+ int nr_offset;
+ int cpu;
+
+ syscall_nr = syscall_get_nr(current, regs);
+ nr_offset = arch_syscall_trace_nr[syscall_nr];
+ if (!nr_offset || nr_offset >= __SYSCALL_TRACE_END)
+ return;
+
+ cpu = raw_smp_processor_id();
+ data = tr->data[cpu];
+
+ if (unlikely(atomic_read(&data->disabled)))
+ return;
+
+ sys_data = &syscall_trace_entries[nr_offset];
+ size = sizeof(*entry) + sizeof(unsigned long) * sys_data->nb_args;
+
+ event = trace_buffer_lock_reserve(tr, TRACE_SYSCALL_ENTER, size, 0, 0);
+ if (!event)
+ return;
+
+ entry = ring_buffer_event_data(event);
+ entry->nr = nr_offset;
+ syscall_get_arguments(current, regs, 0, sys_data->nb_args, entry->args);
+
+ ring_buffer_unlock_commit(tr->buffer, event);
+ trace_wake_up();
+}
+
+void ftrace_syscall_exit(struct pt_regs *regs)
+{
+ struct trace_array *tr = __get_global_trace();
+ struct syscall_trace_exit *entry;
+ const struct syscall_trace_entry *sys_data;
+ struct ring_buffer_event *event;
+ struct trace_array_cpu *data;
+ int syscall_nr;
+ int nr_offset;
+ int cpu;
+
+ syscall_nr = syscall_get_nr(current, regs);
+ nr_offset = arch_syscall_trace_nr[syscall_nr];
+ if (!nr_offset || nr_offset >= __SYSCALL_TRACE_END)
+ return;
+
+ cpu = raw_smp_processor_id();
+ data = tr->data[cpu];
+
+ if (unlikely(atomic_read(&data->disabled)))
+ return;
+
+ sys_data = &syscall_trace_entries[nr_offset];
+
+ event = trace_buffer_lock_reserve(tr, TRACE_SYSCALL_EXIT,
+ sizeof(*entry), 0, 0);
+ if (!event)
+ return;
+
+ entry = ring_buffer_event_data(event);
+ entry->nr = nr_offset;
+ entry->ret = syscall_get_return_value(current, regs);
+
+ ring_buffer_unlock_commit(tr->buffer, event);
+ trace_wake_up();
+}
+
+static int init_syscall_tracer(struct trace_array *tr)
+{
+ start_ftrace_syscalls();
+
+ return 0;
+}
+
+static void reset_syscall_tracer(struct trace_array *tr)
+{
+ stop_ftrace_syscalls();
+}
+
+static struct trace_event syscall_enter_event = {
+ .type = TRACE_SYSCALL_ENTER,
+ .trace = print_syscall_enter,
+};
+
+static struct trace_event syscall_exit_event = {
+ .type = TRACE_SYSCALL_EXIT,
+ .trace = print_syscall_exit,
+};
+
+static struct tracer syscall_tracer __read_mostly = {
+ .name = "syscall",
+ .init = init_syscall_tracer,
+ .reset = reset_syscall_tracer
+};
+
+__init int register_ftrace_syscalls(void)
+{
+ int ret;
+
+ ret = register_ftrace_event(&syscall_enter_event);
+ if (!ret) {
+ printk(KERN_WARNING "event %d failed to register\n",
+ syscall_enter_event.type);
+ WARN_ON_ONCE(1);
+ }
+
+ ret = register_ftrace_event(&syscall_exit_event);
+ if (!ret) {
+ printk(KERN_WARNING "event %d failed to register\n",
+ syscall_exit_event.type);
+ WARN_ON_ONCE(1);
+ }
+
+ return register_tracer(&syscall_tracer);
+}
+device_initcall(register_ftrace_syscalls);
--
1.6.1

2009-03-07 12:15:44

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC][PATCH 0/2] Syscalls tracing

On Sat, 2009-03-07 at 05:52 +0100, Frederic Weisbecker wrote:
> Here is a first attempt, quick one-shot, to provide a syscall tracing
> infrastructure on ftrace.
>
> The RFC prefix is here to reflect its ugliness on various parts.
> The compromise between tracing reliabilty and speed is hard to balance.
> For example I guess the basic and horrid string mask should be dropped in favour
> of something else, which takes care of the volatile strings from the userspace.
>
> But I hope a lot of ideas to make it better will come along this discussion.

Can't you abuse the SYSCALL_DEFINE macros? This current approach looks
like it will replicate the syscall table.


2009-03-07 16:03:03

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [RFC][PATCH 0/2] Syscalls tracing

On Sat, Mar 07, 2009 at 01:15:18PM +0100, Peter Zijlstra wrote:
> On Sat, 2009-03-07 at 05:52 +0100, Frederic Weisbecker wrote:
> > Here is a first attempt, quick one-shot, to provide a syscall tracing
> > infrastructure on ftrace.
> >
> > The RFC prefix is here to reflect its ugliness on various parts.
> > The compromise between tracing reliabilty and speed is hard to balance.
> > For example I guess the basic and horrid string mask should be dropped in favour
> > of something else, which takes care of the volatile strings from the userspace.
> >
> > But I hope a lot of ideas to make it better will come along this discussion.
>
> Can't you abuse the SYSCALL_DEFINE macros? This current approach looks
> like it will replicate the syscall table.
>


Ah, I did not even think about it.
I will be able to get the number of parameters. Sounds good. But I will
still need a way to store their format somewhere.

2009-03-07 20:27:34

by Frank Ch. Eigler

[permalink] [raw]
Subject: Re: [RFC][PATCH 0/2] Syscalls tracing

Frederic Weisbecker <[email protected]> writes:

> Here is a first attempt, quick one-shot, to provide a syscall tracing
> infrastructure on ftrace.

Please see also the utrace-based thread syscall/signal/lifecycle.
tracer I posted a few times, and is just about to be reposted as a
part of the larger utrace submission.

System call metadata (name, argument count, and getting fancier from
there) would be nice to have for other clients too, such as the audit
subsystem.

The main drawback of this general approach however is the notion that
ftrace is the solitary user of system call tracing, thus dedicating
that new task flag to this purpose. Therefore, your code has nothing
like reference counting or sharing; nothing yet to avoid overhead on
threads that need no tracing, nor to allow more than one tracing
widget to consume events. These are the sorts of services that utrace
provides.

- FChE

2009-03-07 21:53:20

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [RFC][PATCH 0/2] Syscalls tracing

On Sat, Mar 07, 2009 at 03:26:49PM -0500, Frank Ch. Eigler wrote:
> Frederic Weisbecker <[email protected]> writes:
>
> > Here is a first attempt, quick one-shot, to provide a syscall tracing
> > infrastructure on ftrace.
>
> Please see also the utrace-based thread syscall/signal/lifecycle.
> tracer I posted a few times, and is just about to be reposted as a
> part of the larger utrace submission.


Yes I've looked at it and as I will explain below, I don't have
any bad feelings against utrace.


> System call metadata (name, argument count, and getting fancier from
> there) would be nice to have for other clients too, such as the audit
> subsystem.
>
> The main drawback of this general approach however is the notion that
> ftrace is the solitary user of system call tracing, thus dedicating
> that new task flag to this purpose. Therefore, your code has nothing
> like reference counting or sharing; nothing yet to avoid overhead on
> threads that need no tracing, nor to allow more than one tracing
> widget to consume events. These are the sorts of services that utrace
> provides.
>
> - FChE


Yes. But it was a first attempt scribbled in one day. I already knew
that a fair amount of bits coming along these patches started on the wrong
way. But that doesn't matter, I just wanted to test the feelings of anyone
interested about the main axis inside:

- having a O(1) access table which stores the number of parameters for each
syscalls, so that the traced path can be very fast and efficient:
we know what to save quickly and we only save the relevant bits from
the path, that said, 1 syscall number (which behaves here as a unique
identifier key) and its parameters values.

- provide a good way to extract the results. It needs a better and larger work
on the syscalls metadata. Ie: the string mask is a bit ridiculous, as well as
the individual callback to display a syscall (because it implies too much work
and code that can be a lot factorized) but I expect that a discussion about a
good to produce efficient metadata could born after it.

For example we could have an array of char where the types are encoding using
an enum which describes the types of the parameters.
Depending of these parameters, we can have one callback for each type to display
them. We can also export these metadatas to userspace so that the users are able
to only extract the binary traces and format and give some sense by formatting it
later using the metadas.


Anyway, since this is a first scribbling, there is neither reference count nor
per task tracing granularity :-)

But, that's said, I already expressed my feelings about utrace by the past.
It's a good library, seems easy to use and powerful and I would like to see
it merged.

I don't know what other people think about it. But IMHO, I think it would be good
if you could, for now, make an attempt to sort the only relevant bits to provide
syscalls tracing and submit it.

It would exclude a large part that concerns the signals/fork/exec tracing
which will probably be needed later.

2009-03-08 11:24:58

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [RFC][PATCH 0/2] Syscalls tracing

On Sat, Mar 07, 2009 at 05:02:47PM +0100, Frederic Weisbecker wrote:
> On Sat, Mar 07, 2009 at 01:15:18PM +0100, Peter Zijlstra wrote:
> > On Sat, 2009-03-07 at 05:52 +0100, Frederic Weisbecker wrote:
> > > Here is a first attempt, quick one-shot, to provide a syscall tracing
> > > infrastructure on ftrace.
> > >
> > > The RFC prefix is here to reflect its ugliness on various parts.
> > > The compromise between tracing reliabilty and speed is hard to balance.
> > > For example I guess the basic and horrid string mask should be dropped in favour
> > > of something else, which takes care of the volatile strings from the userspace.
> > >
> > > But I hope a lot of ideas to make it better will come along this discussion.
> >
> > Can't you abuse the SYSCALL_DEFINE macros? This current approach looks
> > like it will replicate the syscall table.
> >
>
>
> Ah, I did not even think about it.
> I will be able to get the number of parameters. Sounds good. But I will
> still need a way to store their format somewhere.
>

Ok, we can iterate through sections datas for each one and then generate the format string
depending of the types of the parameters. We can even to it once at boot time.
The last thing is the need to match the exact syscall entry from this section when we enter
a syscall. Don't know yet how I will do that but I will think about it.

2009-03-08 11:28:59

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [RFC][PATCH 0/2] Syscalls tracing

On Sun, Mar 08, 2009 at 12:24:41PM +0100, Frederic Weisbecker wrote:
> On Sat, Mar 07, 2009 at 05:02:47PM +0100, Frederic Weisbecker wrote:
> > On Sat, Mar 07, 2009 at 01:15:18PM +0100, Peter Zijlstra wrote:
> > > On Sat, 2009-03-07 at 05:52 +0100, Frederic Weisbecker wrote:
> > > > Here is a first attempt, quick one-shot, to provide a syscall tracing
> > > > infrastructure on ftrace.
> > > >
> > > > The RFC prefix is here to reflect its ugliness on various parts.
> > > > The compromise between tracing reliabilty and speed is hard to balance.
> > > > For example I guess the basic and horrid string mask should be dropped in favour
> > > > of something else, which takes care of the volatile strings from the userspace.
> > > >
> > > > But I hope a lot of ideas to make it better will come along this discussion.
> > >
> > > Can't you abuse the SYSCALL_DEFINE macros? This current approach looks
> > > like it will replicate the syscall table.
> > >
> >
> >
> > Ah, I did not even think about it.
> > I will be able to get the number of parameters. Sounds good. But I will
> > still need a way to store their format somewhere.
> >
>
> Ok, we can iterate through sections datas for each one and then generate the format string
> depending of the types of the parameters. We can even to it once at boot time.
> The last thing is the need to match the exact syscall entry from this section when we enter
> a syscall. Don't know yet how I will do that but I will think about it.
>

Ah and the string mask is actually useful to determine if we one parameter is a string.
If so we can decide to save it on tracing time or not.

2009-03-08 16:25:21

by Ingo Molnar

[permalink] [raw]
Subject: Re: [RFC][PATCH 1/2] tracing/ftrace: syscall tracing infrastructure


* Frederic Weisbecker <[email protected]> wrote:

> +static const struct syscall_trace_entry syscall_trace_entries[] = {
> + /* For open, the first argument is a string, hence the given mask */
> + [SYSCALL_TRACE_OPEN] = SYS_TRACE_ENTRY(open, 3, 0x1),
> + [SYSCALL_TRACE_CLOSE] = SYS_TRACE_ENTRY(close, 1, 0),
> + [SYSCALL_TRACE_READ] = SYS_TRACE_ENTRY(read, 3, 0),
> + [SYSCALL_TRACE_WRITE] = SYS_TRACE_ENTRY(read, 3, 0),
> +};

s/read/write in the last entry i guess.

But i think the whole concept of duplicating the syscall table
is the wrong way around.

Note that we dont have to build this information at all - in
2.6.29-rc1 all syscalls got wrapper macros:

SYSCALL_DEFINE1(nice, int, increment)
SYSCALL_DEFINE2(sched_setparam, pid_t, pid, struct sched_param __user *, param)
SYSCALL_DEFINE3(sched_setaffinity, pid_t, pid, unsigned int, len,
unsigned long __user *, user_mask_ptr)

We also have the syscall table itself.

So what we can do: by changing the SYSCALL_DEFINEX() macros we
can emit the following information into a table:

(syscall_fn_address, syscall_name_string, #of arguments, array
of argument names and type sizeof()s)

then during bootup we can match up the sys_call_table[] to the
secondary table we built, so that we can order the secondary
table based on syscall NR. 99% of the syscalls will match up
just fine.

Ingo

2009-03-09 13:44:51

by Steven Rostedt

[permalink] [raw]
Subject: Re: [RFC][PATCH 1/2] tracing/ftrace: syscall tracing infrastructure


On Sat, 7 Mar 2009, Frederic Weisbecker wrote:
>
> libftrace-y := ftrace.o
> diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
> index cc94f86..5152be3 100644
> --- a/kernel/trace/trace.c
> +++ b/kernel/trace/trace.c
> @@ -242,6 +242,12 @@ static DECLARE_WAIT_QUEUE_HEAD(trace_wait);
> unsigned long trace_flags = TRACE_ITER_PRINT_PARENT | TRACE_ITER_PRINTK |
> TRACE_ITER_ANNOTATE | TRACE_ITER_CONTEXT_INFO;
>
> +
> +struct trace_array *__get_global_trace(void)
> +{
> + return &global_trace;
> +}
> +


Don't do that. It's opening up a can of worms ;-)

>
> @@ -568,6 +587,7 @@ extern int trace_selftest_startup_branch(struct tracer *trace,
> #endif /* CONFIG_FTRACE_STARTUP_TEST */
>
> extern void *head_page(struct trace_array_cpu *data);
> +extern struct trace_array *__get_global_trace(void);

Remove that too.

> +void ftrace_syscall_enter(struct pt_regs *regs)
> +{
> + struct trace_array *tr = __get_global_trace();

Don't do that.

> + struct syscall_trace_enter *entry;
> + const struct syscall_trace_entry *sys_data;
> + struct ring_buffer_event *event;
> + struct trace_array_cpu *data;
> + int size;
> + int syscall_nr;
> + int nr_offset;
> + int cpu;
> +
> + syscall_nr = syscall_get_nr(current, regs);
> + nr_offset = arch_syscall_trace_nr[syscall_nr];
> + if (!nr_offset || nr_offset >= __SYSCALL_TRACE_END)
> + return;
> +
> + cpu = raw_smp_processor_id();
> + data = tr->data[cpu];
> +
> + if (unlikely(atomic_read(&data->disabled)))
> + return;
> +
> + sys_data = &syscall_trace_entries[nr_offset];
> + size = sizeof(*entry) + sizeof(unsigned long) * sys_data->nb_args;
> +
> + event = trace_buffer_lock_reserve(tr, TRACE_SYSCALL_ENTER, size, 0, 0);

Here you want:

trace_current_buffer_lock_reserve(TRACE_SYSCALL_ENTER, size, 0, 0);


-- Steve


> + if (!event)
> + return;
> +
> + entry = ring_buffer_event_data(event);
> + entry->nr = nr_offset;
> + syscall_get_arguments(current, regs, 0, sys_data->nb_args, entry->args);
> +
> + ring_buffer_unlock_commit(tr->buffer, event);
> + trace_wake_up();
> +}
> +
> +void ftrace_syscall_exit(struct pt_regs *regs)
> +{
> + struct trace_array *tr = __get_global_trace();
> + struct syscall_trace_exit *entry;
> + const struct syscall_trace_entry *sys_data;
> + struct ring_buffer_event *event;
> + struct trace_array_cpu *data;
> + int syscall_nr;
> + int nr_offset;
> + int cpu;
> +
> + syscall_nr = syscall_get_nr(current, regs);
> + nr_offset = arch_syscall_trace_nr[syscall_nr];
> + if (!nr_offset || nr_offset >= __SYSCALL_TRACE_END)
> + return;
> +
> + cpu = raw_smp_processor_id();
> + data = tr->data[cpu];
> +
> + if (unlikely(atomic_read(&data->disabled)))
> + return;
> +
> + sys_data = &syscall_trace_entries[nr_offset];
> +
> + event = trace_buffer_lock_reserve(tr, TRACE_SYSCALL_EXIT,
> + sizeof(*entry), 0, 0);

Same here.

> + if (!event)
> + return;
> +
> + entry = ring_buffer_event_data(event);
> + entry->nr = nr_offset;
> + entry->ret = syscall_get_return_value(current, regs);
> +
> + ring_buffer_unlock_commit(tr->buffer, event);
> + trace_wake_up();
> +}
> +

-- Steve

2009-03-09 13:52:33

by Steven Rostedt

[permalink] [raw]
Subject: Re: [RFC][PATCH 1/2] tracing/ftrace: syscall tracing infrastructure


On Sat, 7 Mar 2009, Frederic Weisbecker wrote:
> +
> +
> +static atomic_t refcount;
> +
> +enum print_line_t
> +print_syscall_enter(struct trace_iterator *iter, int flags)
> +{
> + struct trace_seq *s = &iter->seq;
> + struct trace_entry *ent = iter->ent;
> + struct syscall_trace_enter *trace;
> + enum syscall_trace_nr syscall;
> + const struct syscall_trace_entry *entry;
> + char *str = NULL;
> + int i, ret;
> +
> + trace_assign_type(trace, ent);
> +
> + syscall = trace->nr;
> +
> + if (!syscall || syscall >= __SYSCALL_TRACE_END)

You still need to print a "\n" otherwise the output of the next line
will start on the current line.

> + return TRACE_TYPE_HANDLED;
> +
> + entry = &syscall_trace_entries[syscall];
> + if (!entry->simple_format) {
> + ret = entry->output.print(trace->args);
> + if (!ret)
> + return TRACE_TYPE_PARTIAL_LINE;
> + return TRACE_TYPE_HANDLED;
> + }
> +
> + ret = trace_seq_printf(s, "syscall %s :", entry->name);
> + if (!ret)
> + return TRACE_TYPE_PARTIAL_LINE;
> +
> + for (i = 0; i < entry->nb_args; i++) {
> + /*
> + * FIXME: the string comes from the user, but on tracing
> + * output time, it may have disappeared from the memory.
> + * May be better to strdup it on tracing time.
> + */
> + if (entry->output.string_mask & (1 << i)) {
> + /* 128 is enough for most path */
> + str = strndup_user((char *)trace->args[i], 128);
> +
> + /* If it faulted badly, the error shoud have been
> + * handled while servicing the syscall, just ignore.
> + */
> + if (str == ERR_PTR(-EFAULT) || str == ERR_PTR(-EINVAL)
> + || str == ERR_PTR(-ENOMEM))
> + return TRACE_TYPE_HANDLED;
> +
> + ret = trace_seq_printf(s, " %s", str);
> + } else {
> + ret = trace_seq_printf(s, " 0x%lx", trace->args[i]);
> + }
> + if (!ret)
> + goto end_partial;
> + }
> + if (!trace_seq_printf(s, "\n"))
> + goto end_partial;
> +
> + kfree(str);
> + return TRACE_TYPE_HANDLED;
> +
> +end_partial:
> + kfree(str);
> + return TRACE_TYPE_PARTIAL_LINE;
> +}
> +
> +enum print_line_t
> +print_syscall_exit(struct trace_iterator *iter, int flags)
> +{
> + struct trace_seq *s = &iter->seq;
> + struct trace_entry *ent = iter->ent;
> + struct syscall_trace_exit *trace;
> + enum syscall_trace_nr syscall;
> + const struct syscall_trace_entry *entry;
> + int ret;
> +
> + trace_assign_type(trace, ent);
> +
> + syscall = trace->nr;
> +
> + if (!syscall || syscall >= __SYSCALL_TRACE_END)
> + return TRACE_TYPE_HANDLED;
> +
> + entry = &syscall_trace_entries[syscall];
> +
> + ret = trace_seq_printf(s, "syscall %s -> 0x%lx\n", entry->name,
> + trace->ret);
> + if (!ret)
> + return TRACE_TYPE_PARTIAL_LINE;
> +
> + return TRACE_TYPE_HANDLED;
> +}
> +
> +void start_ftrace_syscalls(void)
> +{
> + unsigned long flags;
> + struct task_struct *g, *t;
> +
> + if (atomic_inc_return(&refcount) != 1)
> + goto out;
> +
> + read_lock_irqsave(&tasklist_lock, flags);
> +
> + do_each_thread(g, t) {
> + set_tsk_thread_flag(t, TIF_SYSCALL_FTRACE);
> + } while_each_thread(g, t);
> +
> + read_unlock_irqrestore(&tasklist_lock, flags);
> +
> + return;
> +out:
> + atomic_dec(&refcount);
> +}
> +
> +void stop_ftrace_syscalls(void)
> +{
> + unsigned long flags;
> + struct task_struct *g, *t;
> +
> + if (atomic_dec_return(&refcount))
> + return;
> +
> + read_lock_irqsave(&tasklist_lock, flags);
> +
> + do_each_thread(g, t) {
> + clear_tsk_thread_flag(t, TIF_SYSCALL_FTRACE);
> + } while_each_thread(g, t);
> +
> + read_unlock_irqrestore(&tasklist_lock, flags);
> +}
> +
> +void ftrace_syscall_enter(struct pt_regs *regs)
> +{
> + struct trace_array *tr = __get_global_trace();
> + struct syscall_trace_enter *entry;
> + const struct syscall_trace_entry *sys_data;
> + struct ring_buffer_event *event;
> + struct trace_array_cpu *data;
> + int size;
> + int syscall_nr;
> + int nr_offset;
> + int cpu;
> +
> + syscall_nr = syscall_get_nr(current, regs);
> + nr_offset = arch_syscall_trace_nr[syscall_nr];
> + if (!nr_offset || nr_offset >= __SYSCALL_TRACE_END)
> + return;
> +
> + cpu = raw_smp_processor_id();
> + data = tr->data[cpu];
> +
> + if (unlikely(atomic_read(&data->disabled)))
> + return;
> +
> + sys_data = &syscall_trace_entries[nr_offset];
> + size = sizeof(*entry) + sizeof(unsigned long) * sys_data->nb_args;
> +
> + event = trace_buffer_lock_reserve(tr, TRACE_SYSCALL_ENTER, size, 0, 0);
> + if (!event)
> + return;
> +
> + entry = ring_buffer_event_data(event);
> + entry->nr = nr_offset;
> + syscall_get_arguments(current, regs, 0, sys_data->nb_args, entry->args);
> +
> + ring_buffer_unlock_commit(tr->buffer, event);

BTW, you should always use the matching lock. Above you used
trace_buffer_lock_reserve, and here you use ring_buffer_unlock_commit.
This may not work in the future. There is a trace_buffer_unlock_commit
too.

But you are going to change these to trace_current_buffer_lock_reserve and
trace_current_buffer_unlock_commit anyway ;-)


> + trace_wake_up();
> +}
> +
> +void ftrace_syscall_exit(struct pt_regs *regs)
> +{
> + struct trace_array *tr = __get_global_trace();
> + struct syscall_trace_exit *entry;
> + const struct syscall_trace_entry *sys_data;
> + struct ring_buffer_event *event;
> + struct trace_array_cpu *data;
> + int syscall_nr;
> + int nr_offset;
> + int cpu;
> +
> + syscall_nr = syscall_get_nr(current, regs);
> + nr_offset = arch_syscall_trace_nr[syscall_nr];
> + if (!nr_offset || nr_offset >= __SYSCALL_TRACE_END)
> + return;
> +
> + cpu = raw_smp_processor_id();
> + data = tr->data[cpu];
> +
> + if (unlikely(atomic_read(&data->disabled)))
> + return;
> +
> + sys_data = &syscall_trace_entries[nr_offset];
> +
> + event = trace_buffer_lock_reserve(tr, TRACE_SYSCALL_EXIT,
> + sizeof(*entry), 0, 0);
> + if (!event)
> + return;
> +
> + entry = ring_buffer_event_data(event);
> + entry->nr = nr_offset;
> + entry->ret = syscall_get_return_value(current, regs);
> +
> + ring_buffer_unlock_commit(tr->buffer, event);

Same here.

-- Steve

> + trace_wake_up();
> +}
> +
> +static int init_syscall_tracer(struct trace_array *tr)
> +{
> + start_ftrace_syscalls();
> +
> + return 0;
> +}
> +
> +static void reset_syscall_tracer(struct trace_array *tr)
> +{
> + stop_ftrace_syscalls();
> +}
> +
> +static struct trace_event syscall_enter_event = {
> + .type = TRACE_SYSCALL_ENTER,
> + .trace = print_syscall_enter,
> +};
> +
> +static struct trace_event syscall_exit_event = {
> + .type = TRACE_SYSCALL_EXIT,
> + .trace = print_syscall_exit,
> +};
> +
> +static struct tracer syscall_tracer __read_mostly = {
> + .name = "syscall",
> + .init = init_syscall_tracer,
> + .reset = reset_syscall_tracer
> +};
> +
> +__init int register_ftrace_syscalls(void)
> +{
> + int ret;
> +
> + ret = register_ftrace_event(&syscall_enter_event);
> + if (!ret) {
> + printk(KERN_WARNING "event %d failed to register\n",
> + syscall_enter_event.type);
> + WARN_ON_ONCE(1);
> + }
> +
> + ret = register_ftrace_event(&syscall_exit_event);
> + if (!ret) {
> + printk(KERN_WARNING "event %d failed to register\n",
> + syscall_exit_event.type);
> + WARN_ON_ONCE(1);
> + }
> +
> + return register_tracer(&syscall_tracer);
> +}
> +device_initcall(register_ftrace_syscalls);
> --
> 1.6.1
>
>
>

2009-03-09 20:40:31

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [RFC][PATCH 1/2] tracing/ftrace: syscall tracing infrastructure

On Sun, Mar 08, 2009 at 05:24:44PM +0100, Ingo Molnar wrote:
>
> * Frederic Weisbecker <[email protected]> wrote:
>
> > +static const struct syscall_trace_entry syscall_trace_entries[] = {
> > + /* For open, the first argument is a string, hence the given mask */
> > + [SYSCALL_TRACE_OPEN] = SYS_TRACE_ENTRY(open, 3, 0x1),
> > + [SYSCALL_TRACE_CLOSE] = SYS_TRACE_ENTRY(close, 1, 0),
> > + [SYSCALL_TRACE_READ] = SYS_TRACE_ENTRY(read, 3, 0),
> > + [SYSCALL_TRACE_WRITE] = SYS_TRACE_ENTRY(read, 3, 0),
> > +};
>
> s/read/write in the last entry i guess.
>
> But i think the whole concept of duplicating the syscall table
> is the wrong way around.
>
> Note that we dont have to build this information at all - in
> 2.6.29-rc1 all syscalls got wrapper macros:
>
> SYSCALL_DEFINE1(nice, int, increment)
> SYSCALL_DEFINE2(sched_setparam, pid_t, pid, struct sched_param __user *, param)
> SYSCALL_DEFINE3(sched_setaffinity, pid_t, pid, unsigned int, len,
> unsigned long __user *, user_mask_ptr)
>
> We also have the syscall table itself.
>
> So what we can do: by changing the SYSCALL_DEFINEX() macros we
> can emit the following information into a table:
>
> (syscall_fn_address, syscall_name_string, #of arguments, array
> of argument names and type sizeof()s)
>
> then during bootup we can match up the sys_call_table[] to the
> secondary table we built, so that we can order the secondary
> table based on syscall NR. 99% of the syscalls will match up
> just fine.
>
> Ingo

Yes, I didn't know the SYSCALL_DEFINEx() macro until Peter told me.
That's indeed a much better way to proceed, I will look at all that.


Thanks!

2009-03-09 20:43:34

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [RFC][PATCH 1/2] tracing/ftrace: syscall tracing infrastructure

On Mon, Mar 09, 2009 at 09:44:35AM -0400, Steven Rostedt wrote:
>
> On Sat, 7 Mar 2009, Frederic Weisbecker wrote:
> >
> > libftrace-y := ftrace.o
> > diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
> > index cc94f86..5152be3 100644
> > --- a/kernel/trace/trace.c
> > +++ b/kernel/trace/trace.c
> > @@ -242,6 +242,12 @@ static DECLARE_WAIT_QUEUE_HEAD(trace_wait);
> > unsigned long trace_flags = TRACE_ITER_PRINT_PARENT | TRACE_ITER_PRINTK |
> > TRACE_ITER_ANNOTATE | TRACE_ITER_CONTEXT_INFO;
> >
> > +
> > +struct trace_array *__get_global_trace(void)
> > +{
> > + return &global_trace;
> > +}
> > +
>
>
> Don't do that. It's opening up a can of worms ;-)



That's right. I just wanted to have the insertion done in the dedicated file.


> >
> > @@ -568,6 +587,7 @@ extern int trace_selftest_startup_branch(struct tracer *trace,
> > #endif /* CONFIG_FTRACE_STARTUP_TEST */
> >
> > extern void *head_page(struct trace_array_cpu *data);
> > +extern struct trace_array *__get_global_trace(void);
>
> Remove that too.
>
> > +void ftrace_syscall_enter(struct pt_regs *regs)
> > +{
> > + struct trace_array *tr = __get_global_trace();
>
> Don't do that.
>
> > + struct syscall_trace_enter *entry;
> > + const struct syscall_trace_entry *sys_data;
> > + struct ring_buffer_event *event;
> > + struct trace_array_cpu *data;
> > + int size;
> > + int syscall_nr;
> > + int nr_offset;
> > + int cpu;
> > +
> > + syscall_nr = syscall_get_nr(current, regs);
> > + nr_offset = arch_syscall_trace_nr[syscall_nr];
> > + if (!nr_offset || nr_offset >= __SYSCALL_TRACE_END)
> > + return;
> > +
> > + cpu = raw_smp_processor_id();
> > + data = tr->data[cpu];
> > +
> > + if (unlikely(atomic_read(&data->disabled)))
> > + return;
> > +
> > + sys_data = &syscall_trace_entries[nr_offset];
> > + size = sizeof(*entry) + sizeof(unsigned long) * sys_data->nb_args;
> > +
> > + event = trace_buffer_lock_reserve(tr, TRACE_SYSCALL_ENTER, size, 0, 0);
>
> Here you want:
>
> trace_current_buffer_lock_reserve(TRACE_SYSCALL_ENTER, size, 0, 0);
>
>
> -- Steve


Heh! I didn't know this one! :-)
Thanks!


>
> > + if (!event)
> > + return;
> > +
> > + entry = ring_buffer_event_data(event);
> > + entry->nr = nr_offset;
> > + syscall_get_arguments(current, regs, 0, sys_data->nb_args, entry->args);
> > +
> > + ring_buffer_unlock_commit(tr->buffer, event);
> > + trace_wake_up();
> > +}
> > +
> > +void ftrace_syscall_exit(struct pt_regs *regs)
> > +{
> > + struct trace_array *tr = __get_global_trace();
> > + struct syscall_trace_exit *entry;
> > + const struct syscall_trace_entry *sys_data;
> > + struct ring_buffer_event *event;
> > + struct trace_array_cpu *data;
> > + int syscall_nr;
> > + int nr_offset;
> > + int cpu;
> > +
> > + syscall_nr = syscall_get_nr(current, regs);
> > + nr_offset = arch_syscall_trace_nr[syscall_nr];
> > + if (!nr_offset || nr_offset >= __SYSCALL_TRACE_END)
> > + return;
> > +
> > + cpu = raw_smp_processor_id();
> > + data = tr->data[cpu];
> > +
> > + if (unlikely(atomic_read(&data->disabled)))
> > + return;
> > +
> > + sys_data = &syscall_trace_entries[nr_offset];
> > +
> > + event = trace_buffer_lock_reserve(tr, TRACE_SYSCALL_EXIT,
> > + sizeof(*entry), 0, 0);
>
> Same here.
>
> > + if (!event)
> > + return;
> > +
> > + entry = ring_buffer_event_data(event);
> > + entry->nr = nr_offset;
> > + entry->ret = syscall_get_return_value(current, regs);
> > +
> > + ring_buffer_unlock_commit(tr->buffer, event);
> > + trace_wake_up();
> > +}
> > +
>
> -- Steve
>

2009-03-09 20:48:26

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [RFC][PATCH 1/2] tracing/ftrace: syscall tracing infrastructure

On Mon, Mar 09, 2009 at 09:52:20AM -0400, Steven Rostedt wrote:
>
> On Sat, 7 Mar 2009, Frederic Weisbecker wrote:
> > +
> > +
> > +static atomic_t refcount;
> > +
> > +enum print_line_t
> > +print_syscall_enter(struct trace_iterator *iter, int flags)
> > +{
> > + struct trace_seq *s = &iter->seq;
> > + struct trace_entry *ent = iter->ent;
> > + struct syscall_trace_enter *trace;
> > + enum syscall_trace_nr syscall;
> > + const struct syscall_trace_entry *entry;
> > + char *str = NULL;
> > + int i, ret;
> > +
> > + trace_assign_type(trace, ent);
> > +
> > + syscall = trace->nr;
> > +
> > + if (!syscall || syscall >= __SYSCALL_TRACE_END)
>
> You still need to print a "\n" otherwise the output of the next line
> will start on the current line.

Are you sure?
This entry will be ignored. And the previous one did print a newline.
So the next one should not have any problem.


> > + return TRACE_TYPE_HANDLED;
> > +
> > + entry = &syscall_trace_entries[syscall];
> > + if (!entry->simple_format) {
> > + ret = entry->output.print(trace->args);
> > + if (!ret)
> > + return TRACE_TYPE_PARTIAL_LINE;
> > + return TRACE_TYPE_HANDLED;
> > + }
> > +
> > + ret = trace_seq_printf(s, "syscall %s :", entry->name);
> > + if (!ret)
> > + return TRACE_TYPE_PARTIAL_LINE;
> > +
> > + for (i = 0; i < entry->nb_args; i++) {
> > + /*
> > + * FIXME: the string comes from the user, but on tracing
> > + * output time, it may have disappeared from the memory.
> > + * May be better to strdup it on tracing time.
> > + */
> > + if (entry->output.string_mask & (1 << i)) {
> > + /* 128 is enough for most path */
> > + str = strndup_user((char *)trace->args[i], 128);
> > +
> > + /* If it faulted badly, the error shoud have been
> > + * handled while servicing the syscall, just ignore.
> > + */
> > + if (str == ERR_PTR(-EFAULT) || str == ERR_PTR(-EINVAL)
> > + || str == ERR_PTR(-ENOMEM))
> > + return TRACE_TYPE_HANDLED;
> > +
> > + ret = trace_seq_printf(s, " %s", str);
> > + } else {
> > + ret = trace_seq_printf(s, " 0x%lx", trace->args[i]);
> > + }
> > + if (!ret)
> > + goto end_partial;
> > + }
> > + if (!trace_seq_printf(s, "\n"))
> > + goto end_partial;
> > +
> > + kfree(str);
> > + return TRACE_TYPE_HANDLED;
> > +
> > +end_partial:
> > + kfree(str);
> > + return TRACE_TYPE_PARTIAL_LINE;
> > +}
> > +
> > +enum print_line_t
> > +print_syscall_exit(struct trace_iterator *iter, int flags)
> > +{
> > + struct trace_seq *s = &iter->seq;
> > + struct trace_entry *ent = iter->ent;
> > + struct syscall_trace_exit *trace;
> > + enum syscall_trace_nr syscall;
> > + const struct syscall_trace_entry *entry;
> > + int ret;
> > +
> > + trace_assign_type(trace, ent);
> > +
> > + syscall = trace->nr;
> > +
> > + if (!syscall || syscall >= __SYSCALL_TRACE_END)
> > + return TRACE_TYPE_HANDLED;
> > +
> > + entry = &syscall_trace_entries[syscall];
> > +
> > + ret = trace_seq_printf(s, "syscall %s -> 0x%lx\n", entry->name,
> > + trace->ret);
> > + if (!ret)
> > + return TRACE_TYPE_PARTIAL_LINE;
> > +
> > + return TRACE_TYPE_HANDLED;
> > +}
> > +
> > +void start_ftrace_syscalls(void)
> > +{
> > + unsigned long flags;
> > + struct task_struct *g, *t;
> > +
> > + if (atomic_inc_return(&refcount) != 1)
> > + goto out;
> > +
> > + read_lock_irqsave(&tasklist_lock, flags);
> > +
> > + do_each_thread(g, t) {
> > + set_tsk_thread_flag(t, TIF_SYSCALL_FTRACE);
> > + } while_each_thread(g, t);
> > +
> > + read_unlock_irqrestore(&tasklist_lock, flags);
> > +
> > + return;
> > +out:
> > + atomic_dec(&refcount);
> > +}
> > +
> > +void stop_ftrace_syscalls(void)
> > +{
> > + unsigned long flags;
> > + struct task_struct *g, *t;
> > +
> > + if (atomic_dec_return(&refcount))
> > + return;
> > +
> > + read_lock_irqsave(&tasklist_lock, flags);
> > +
> > + do_each_thread(g, t) {
> > + clear_tsk_thread_flag(t, TIF_SYSCALL_FTRACE);
> > + } while_each_thread(g, t);
> > +
> > + read_unlock_irqrestore(&tasklist_lock, flags);
> > +}
> > +
> > +void ftrace_syscall_enter(struct pt_regs *regs)
> > +{
> > + struct trace_array *tr = __get_global_trace();
> > + struct syscall_trace_enter *entry;
> > + const struct syscall_trace_entry *sys_data;
> > + struct ring_buffer_event *event;
> > + struct trace_array_cpu *data;
> > + int size;
> > + int syscall_nr;
> > + int nr_offset;
> > + int cpu;
> > +
> > + syscall_nr = syscall_get_nr(current, regs);
> > + nr_offset = arch_syscall_trace_nr[syscall_nr];
> > + if (!nr_offset || nr_offset >= __SYSCALL_TRACE_END)
> > + return;
> > +
> > + cpu = raw_smp_processor_id();
> > + data = tr->data[cpu];
> > +
> > + if (unlikely(atomic_read(&data->disabled)))
> > + return;
> > +
> > + sys_data = &syscall_trace_entries[nr_offset];
> > + size = sizeof(*entry) + sizeof(unsigned long) * sys_data->nb_args;
> > +
> > + event = trace_buffer_lock_reserve(tr, TRACE_SYSCALL_ENTER, size, 0, 0);
> > + if (!event)
> > + return;
> > +
> > + entry = ring_buffer_event_data(event);
> > + entry->nr = nr_offset;
> > + syscall_get_arguments(current, regs, 0, sys_data->nb_args, entry->args);
> > +
> > + ring_buffer_unlock_commit(tr->buffer, event);
>
> BTW, you should always use the matching lock. Above you used
> trace_buffer_lock_reserve, and here you use ring_buffer_unlock_commit.
> This may not work in the future. There is a trace_buffer_unlock_commit
> too.

Ok. I just wondered a bit about the tests against stacktraces flags
which are unecessary here.

>
> But you are going to change these to trace_current_buffer_lock_reserve and
> trace_current_buffer_unlock_commit anyway ;-)
>


Exactly :-)

Thanks!

Frederic.


> > + trace_wake_up();
> > +}
> > +
> > +void ftrace_syscall_exit(struct pt_regs *regs)
> > +{
> > + struct trace_array *tr = __get_global_trace();
> > + struct syscall_trace_exit *entry;
> > + const struct syscall_trace_entry *sys_data;
> > + struct ring_buffer_event *event;
> > + struct trace_array_cpu *data;
> > + int syscall_nr;
> > + int nr_offset;
> > + int cpu;
> > +
> > + syscall_nr = syscall_get_nr(current, regs);
> > + nr_offset = arch_syscall_trace_nr[syscall_nr];
> > + if (!nr_offset || nr_offset >= __SYSCALL_TRACE_END)
> > + return;
> > +
> > + cpu = raw_smp_processor_id();
> > + data = tr->data[cpu];
> > +
> > + if (unlikely(atomic_read(&data->disabled)))
> > + return;
> > +
> > + sys_data = &syscall_trace_entries[nr_offset];
> > +
> > + event = trace_buffer_lock_reserve(tr, TRACE_SYSCALL_EXIT,
> > + sizeof(*entry), 0, 0);
> > + if (!event)
> > + return;
> > +
> > + entry = ring_buffer_event_data(event);
> > + entry->nr = nr_offset;
> > + entry->ret = syscall_get_return_value(current, regs);
> > +
> > + ring_buffer_unlock_commit(tr->buffer, event);
>
> Same here.
>
> -- Steve
>
> > + trace_wake_up();
> > +}
> > +
> > +static int init_syscall_tracer(struct trace_array *tr)
> > +{
> > + start_ftrace_syscalls();
> > +
> > + return 0;
> > +}
> > +
> > +static void reset_syscall_tracer(struct trace_array *tr)
> > +{
> > + stop_ftrace_syscalls();
> > +}
> > +
> > +static struct trace_event syscall_enter_event = {
> > + .type = TRACE_SYSCALL_ENTER,
> > + .trace = print_syscall_enter,
> > +};
> > +
> > +static struct trace_event syscall_exit_event = {
> > + .type = TRACE_SYSCALL_EXIT,
> > + .trace = print_syscall_exit,
> > +};
> > +
> > +static struct tracer syscall_tracer __read_mostly = {
> > + .name = "syscall",
> > + .init = init_syscall_tracer,
> > + .reset = reset_syscall_tracer
> > +};
> > +
> > +__init int register_ftrace_syscalls(void)
> > +{
> > + int ret;
> > +
> > + ret = register_ftrace_event(&syscall_enter_event);
> > + if (!ret) {
> > + printk(KERN_WARNING "event %d failed to register\n",
> > + syscall_enter_event.type);
> > + WARN_ON_ONCE(1);
> > + }
> > +
> > + ret = register_ftrace_event(&syscall_exit_event);
> > + if (!ret) {
> > + printk(KERN_WARNING "event %d failed to register\n",
> > + syscall_exit_event.type);
> > + WARN_ON_ONCE(1);
> > + }
> > +
> > + return register_tracer(&syscall_tracer);
> > +}
> > +device_initcall(register_ftrace_syscalls);
> > --
> > 1.6.1
> >
> >
> >

2009-03-10 00:52:19

by Steven Rostedt

[permalink] [raw]
Subject: Re: [RFC][PATCH 1/2] tracing/ftrace: syscall tracing infrastructure



On Mon, 9 Mar 2009, Frederic Weisbecker wrote:

> On Mon, Mar 09, 2009 at 09:52:20AM -0400, Steven Rostedt wrote:
> >
> > On Sat, 7 Mar 2009, Frederic Weisbecker wrote:
> > > +
> > > +
> > > +static atomic_t refcount;
> > > +
> > > +enum print_line_t
> > > +print_syscall_enter(struct trace_iterator *iter, int flags)
> > > +{
> > > + struct trace_seq *s = &iter->seq;
> > > + struct trace_entry *ent = iter->ent;
> > > + struct syscall_trace_enter *trace;
> > > + enum syscall_trace_nr syscall;
> > > + const struct syscall_trace_entry *entry;
> > > + char *str = NULL;
> > > + int i, ret;
> > > +
> > > + trace_assign_type(trace, ent);
> > > +
> > > + syscall = trace->nr;
> > > +
> > > + if (!syscall || syscall >= __SYSCALL_TRACE_END)
> >
> > You still need to print a "\n" otherwise the output of the next line
> > will start on the current line.
>
> Are you sure?
> This entry will be ignored. And the previous one did print a newline.
> So the next one should not have any problem.

Does this take over the printline? Otherwise we could be printing a
header, (comm, timestamp, cpu, etc).

-- Steve

2009-03-10 10:08:48

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [RFC][PATCH 1/2] tracing/ftrace: syscall tracing infrastructure

On Mon, Mar 09, 2009 at 08:52:00PM -0400, Steven Rostedt wrote:
>
>
> On Mon, 9 Mar 2009, Frederic Weisbecker wrote:
>
> > On Mon, Mar 09, 2009 at 09:52:20AM -0400, Steven Rostedt wrote:
> > >
> > > On Sat, 7 Mar 2009, Frederic Weisbecker wrote:
> > > > +
> > > > +
> > > > +static atomic_t refcount;
> > > > +
> > > > +enum print_line_t
> > > > +print_syscall_enter(struct trace_iterator *iter, int flags)
> > > > +{
> > > > + struct trace_seq *s = &iter->seq;
> > > > + struct trace_entry *ent = iter->ent;
> > > > + struct syscall_trace_enter *trace;
> > > > + enum syscall_trace_nr syscall;
> > > > + const struct syscall_trace_entry *entry;
> > > > + char *str = NULL;
> > > > + int i, ret;
> > > > +
> > > > + trace_assign_type(trace, ent);
> > > > +
> > > > + syscall = trace->nr;
> > > > +
> > > > + if (!syscall || syscall >= __SYSCALL_TRACE_END)
> > >
> > > You still need to print a "\n" otherwise the output of the next line
> > > will start on the current line.
> >
> > Are you sure?
> > This entry will be ignored. And the previous one did print a newline.
> > So the next one should not have any problem.
>
> Does this take over the printline? Otherwise we could be printing a
> header, (comm, timestamp, cpu, etc).
>
> -- Steve

Ah ok, I did not think of the headers.

2009-03-13 04:29:35

by Frederic Weisbecker

[permalink] [raw]
Subject: [tip:tracing/syscalls] tracing/ftrace: syscall tracing infrastructure, basics

Commit-ID: ee08c6eccb7d1295516f7cf420fddf7b14e9146f
Gitweb: http://git.kernel.org/tip/ee08c6eccb7d1295516f7cf420fddf7b14e9146f
Author: Frederic Weisbecker <[email protected]>
AuthorDate: Sat, 7 Mar 2009 05:52:59 +0100
Commit: Ingo Molnar <[email protected]>
CommitDate: Fri, 13 Mar 2009 06:25:43 +0100

tracing/ftrace: syscall tracing infrastructure, basics

Provide basic callbacks to do syscall tracing.

Signed-off-by: Frederic Weisbecker <[email protected]>
Acked-by: Steven Rostedt <[email protected]>
Cc: Lai Jiangshan <[email protected]>
LKML-Reference: <[email protected]>
[ simplified it to a trace_printk() for now. ]
Signed-off-by: Ingo Molnar <[email protected]>


---
include/linux/ftrace.h | 21 ++++++++
kernel/trace/Kconfig | 10 ++++
kernel/trace/Makefile | 1 +
kernel/trace/trace.h | 2 +
kernel/trace/trace_syscalls.c | 113 +++++++++++++++++++++++++++++++++++++++++
5 files changed, 147 insertions(+), 0 deletions(-)

diff --git a/include/linux/ftrace.h b/include/linux/ftrace.h
index e1583f2..c146c10 100644
--- a/include/linux/ftrace.h
+++ b/include/linux/ftrace.h
@@ -503,4 +503,25 @@ static inline void trace_hw_branch_oops(void) {}

#endif /* CONFIG_HW_BRANCH_TRACER */

+/*
+ * A syscall entry in the ftrace syscalls array.
+ *
+ * @syscall_nr: syscall number
+ */
+struct syscall_trace_entry {
+ int syscall_nr;
+};
+
+#ifdef CONFIG_FTRACE_SYSCALLS
+extern void start_ftrace_syscalls(void);
+extern void stop_ftrace_syscalls(void);
+extern void ftrace_syscall_enter(struct pt_regs *regs);
+extern void ftrace_syscall_exit(struct pt_regs *regs);
+#else
+static inline void start_ftrace_syscalls(void) { }
+static inline void stop_ftrace_syscalls(void) { }
+static inline void ftrace_syscall_enter(struct pt_regs *regs) { }
+static inline void ftrace_syscall_exit(struct pt_regs *regs) { }
+#endif
+
#endif /* _LINUX_FTRACE_H */
diff --git a/kernel/trace/Kconfig b/kernel/trace/Kconfig
index 8e4a2a6..95a0ad1 100644
--- a/kernel/trace/Kconfig
+++ b/kernel/trace/Kconfig
@@ -34,6 +34,9 @@ config HAVE_FTRACE_MCOUNT_RECORD
config HAVE_HW_BRANCH_TRACER
bool

+config HAVE_FTRACE_SYSCALLS
+ bool
+
config TRACER_MAX_TRACE
bool

@@ -175,6 +178,13 @@ config EVENT_TRACER
allowing the user to pick and choose which trace point they
want to trace.

+config FTRACE_SYSCALLS
+ bool "Trace syscalls"
+ depends on HAVE_FTRACE_SYSCALLS
+ select TRACING
+ help
+ Basic tracer to catch the syscall entry and exit events.
+
config BOOT_TRACER
bool "Trace boot initcalls"
select TRACING
diff --git a/kernel/trace/Makefile b/kernel/trace/Makefile
index c7a2943..c3feea0 100644
--- a/kernel/trace/Makefile
+++ b/kernel/trace/Makefile
@@ -43,5 +43,6 @@ obj-$(CONFIG_BLK_DEV_IO_TRACE) += blktrace.o
obj-$(CONFIG_EVENT_TRACER) += trace_events.o
obj-$(CONFIG_EVENT_TRACER) += events.o
obj-$(CONFIG_EVENT_TRACER) += trace_export.o
+obj-$(CONFIG_FTRACE_SYSCALLS) += trace_syscalls.o

libftrace-y := ftrace.o
diff --git a/kernel/trace/trace.h b/kernel/trace/trace.h
index c5e1d88..3d49daa 100644
--- a/kernel/trace/trace.h
+++ b/kernel/trace/trace.h
@@ -30,6 +30,8 @@ enum trace_type {
TRACE_GRAPH_ENT,
TRACE_USER_STACK,
TRACE_HW_BRANCHES,
+ TRACE_SYSCALL_ENTER,
+ TRACE_SYSCALL_EXIT,
TRACE_KMEM_ALLOC,
TRACE_KMEM_FREE,
TRACE_POWER,
diff --git a/kernel/trace/trace_syscalls.c b/kernel/trace/trace_syscalls.c
new file mode 100644
index 0000000..66cf974
--- /dev/null
+++ b/kernel/trace/trace_syscalls.c
@@ -0,0 +1,113 @@
+#include <linux/ftrace.h>
+#include <linux/kernel.h>
+
+#include <asm/syscall.h>
+
+#include "trace_output.h"
+#include "trace.h"
+
+static atomic_t refcount;
+
+void start_ftrace_syscalls(void)
+{
+ unsigned long flags;
+ struct task_struct *g, *t;
+
+ if (atomic_inc_return(&refcount) != 1)
+ goto out;
+
+ read_lock_irqsave(&tasklist_lock, flags);
+
+ do_each_thread(g, t) {
+ set_tsk_thread_flag(t, TIF_SYSCALL_FTRACE);
+ } while_each_thread(g, t);
+
+ read_unlock_irqrestore(&tasklist_lock, flags);
+out:
+ atomic_dec(&refcount);
+}
+
+void stop_ftrace_syscalls(void)
+{
+ unsigned long flags;
+ struct task_struct *g, *t;
+
+ if (atomic_dec_return(&refcount))
+ goto out;
+
+ read_lock_irqsave(&tasklist_lock, flags);
+
+ do_each_thread(g, t) {
+ clear_tsk_thread_flag(t, TIF_SYSCALL_FTRACE);
+ } while_each_thread(g, t);
+
+ read_unlock_irqrestore(&tasklist_lock, flags);
+out:
+ atomic_inc(&refcount);
+}
+
+void ftrace_syscall_enter(struct pt_regs *regs)
+{
+ int syscall_nr;
+
+ syscall_nr = syscall_get_nr(current, regs);
+
+ trace_printk("syscall %d enter\n", syscall_nr);
+}
+
+void ftrace_syscall_exit(struct pt_regs *regs)
+{
+ int syscall_nr;
+
+ syscall_nr = syscall_get_nr(current, regs);
+
+ trace_printk("syscall %d exit\n", syscall_nr);
+}
+
+static int init_syscall_tracer(struct trace_array *tr)
+{
+ start_ftrace_syscalls();
+
+ return 0;
+}
+
+static void reset_syscall_tracer(struct trace_array *tr)
+{
+ stop_ftrace_syscalls();
+}
+
+static struct trace_event syscall_enter_event = {
+ .type = TRACE_SYSCALL_ENTER,
+};
+
+static struct trace_event syscall_exit_event = {
+ .type = TRACE_SYSCALL_EXIT,
+};
+
+static struct tracer syscall_tracer __read_mostly = {
+ .name = "syscall",
+ .init = init_syscall_tracer,
+ .reset = reset_syscall_tracer
+};
+
+__init int register_ftrace_syscalls(void)
+{
+ int ret;
+
+ ret = register_ftrace_event(&syscall_enter_event);
+ if (!ret) {
+ printk(KERN_WARNING "event %d failed to register\n",
+ syscall_enter_event.type);
+ WARN_ON_ONCE(1);
+ }
+
+ ret = register_ftrace_event(&syscall_exit_event);
+ if (!ret) {
+ printk(KERN_WARNING "event %d failed to register\n",
+ syscall_exit_event.type);
+ WARN_ON_ONCE(1);
+ }
+
+ return register_tracer(&syscall_tracer);
+}
+device_initcall(register_ftrace_syscalls);

2009-03-13 04:30:20

by Frederic Weisbecker

[permalink] [raw]
Subject: [tip:tracing/syscalls] tracing/x86: basic implementation of syscall tracing for x86

Commit-ID: 1b3fa2ce64363c289b3b14723cca7290bf91cfce
Gitweb: http://git.kernel.org/tip/1b3fa2ce64363c289b3b14723cca7290bf91cfce
Author: Frederic Weisbecker <[email protected]>
AuthorDate: Sat, 7 Mar 2009 05:53:00 +0100
Commit: Ingo Molnar <[email protected]>
CommitDate: Fri, 13 Mar 2009 06:25:44 +0100

tracing/x86: basic implementation of syscall tracing for x86

Provide the x86 trace callbacks to trace syscalls.

Signed-off-by: Frederic Weisbecker <[email protected]>
Acked-by: Steven Rostedt <[email protected]>
Cc: Lai Jiangshan <[email protected]>
LKML-Reference: <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>


---
arch/x86/Kconfig | 1 +
arch/x86/include/asm/thread_info.h | 9 ++++++---
arch/x86/kernel/ptrace.c | 7 +++++++
3 files changed, 14 insertions(+), 3 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index bdcee12..b0a638b 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -35,6 +35,7 @@ config X86
select HAVE_FUNCTION_GRAPH_TRACER
select HAVE_FUNCTION_TRACE_MCOUNT_TEST
select HAVE_FTRACE_NMI_ENTER if DYNAMIC_FTRACE
+ select HAVE_FTRACE_SYSCALLS
select HAVE_KVM
select HAVE_ARCH_KGDB
select HAVE_ARCH_TRACEHOOK
diff --git a/arch/x86/include/asm/thread_info.h b/arch/x86/include/asm/thread_info.h
index df9d5f7..8820a73 100644
--- a/arch/x86/include/asm/thread_info.h
+++ b/arch/x86/include/asm/thread_info.h
@@ -94,6 +94,7 @@ struct thread_info {
#define TIF_FORCED_TF 24 /* true if TF in eflags artificially */
#define TIF_DEBUGCTLMSR 25 /* uses thread_struct.debugctlmsr */
#define TIF_DS_AREA_MSR 26 /* uses thread_struct.ds_area_msr */
+#define TIF_SYSCALL_FTRACE 27 /* for ftrace syscall instrumentation */

#define _TIF_SYSCALL_TRACE (1 << TIF_SYSCALL_TRACE)
#define _TIF_NOTIFY_RESUME (1 << TIF_NOTIFY_RESUME)
@@ -115,15 +116,17 @@ struct thread_info {
#define _TIF_FORCED_TF (1 << TIF_FORCED_TF)
#define _TIF_DEBUGCTLMSR (1 << TIF_DEBUGCTLMSR)
#define _TIF_DS_AREA_MSR (1 << TIF_DS_AREA_MSR)
+#define _TIF_SYSCALL_FTRACE (1 << TIF_SYSCALL_FTRACE)

/* work to do in syscall_trace_enter() */
#define _TIF_WORK_SYSCALL_ENTRY \
- (_TIF_SYSCALL_TRACE | _TIF_SYSCALL_EMU | \
+ (_TIF_SYSCALL_TRACE | _TIF_SYSCALL_EMU | _TIF_SYSCALL_FTRACE | \
_TIF_SYSCALL_AUDIT | _TIF_SECCOMP | _TIF_SINGLESTEP)

/* work to do in syscall_trace_leave() */
#define _TIF_WORK_SYSCALL_EXIT \
- (_TIF_SYSCALL_TRACE | _TIF_SYSCALL_AUDIT | _TIF_SINGLESTEP)
+ (_TIF_SYSCALL_TRACE | _TIF_SYSCALL_AUDIT | _TIF_SINGLESTEP | \
+ _TIF_SYSCALL_FTRACE)

/* work to do on interrupt/exception return */
#define _TIF_WORK_MASK \
@@ -132,7 +135,7 @@ struct thread_info {
_TIF_SINGLESTEP|_TIF_SECCOMP|_TIF_SYSCALL_EMU))

/* work to do on any return to user space */
-#define _TIF_ALLWORK_MASK (0x0000FFFF & ~_TIF_SECCOMP)
+#define _TIF_ALLWORK_MASK ((0x0000FFFF & ~_TIF_SECCOMP) | _TIF_SYSCALL_FTRACE)

/* Only used for 64 bit */
#define _TIF_DO_NOTIFY_MASK \
diff --git a/arch/x86/kernel/ptrace.c b/arch/x86/kernel/ptrace.c
index 3d9672e..99749d6 100644
--- a/arch/x86/kernel/ptrace.c
+++ b/arch/x86/kernel/ptrace.c
@@ -21,6 +21,7 @@
#include <linux/audit.h>
#include <linux/seccomp.h>
#include <linux/signal.h>
+#include <linux/ftrace.h>

#include <asm/uaccess.h>
#include <asm/pgtable.h>
@@ -1416,6 +1417,9 @@ asmregparm long syscall_trace_enter(struct pt_regs *regs)
tracehook_report_syscall_entry(regs))
ret = -1L;

+ if (unlikely(test_thread_flag(TIF_SYSCALL_FTRACE)))
+ ftrace_syscall_enter(regs);
+
if (unlikely(current->audit_context)) {
if (IS_IA32)
audit_syscall_entry(AUDIT_ARCH_I386,
@@ -1439,6 +1443,9 @@ asmregparm void syscall_trace_leave(struct pt_regs *regs)
if (unlikely(current->audit_context))
audit_syscall_exit(AUDITSC_RESULT(regs->ax), regs->ax);

+ if (unlikely(test_thread_flag(TIF_SYSCALL_FTRACE)))
+ ftrace_syscall_exit(regs);
+
if (test_thread_flag(TIF_SYSCALL_TRACE))
tracehook_report_syscall_exit(regs, 0);

2009-03-13 16:11:31

by Ingo Molnar

[permalink] [raw]
Subject: [tip:tracing/syscalls] tracing/syscalls: support for syscalls tracing on x86, fix

Commit-ID: ccd50dfd92ea2c4ba9e39531ac55db53393e783e
Gitweb: http://git.kernel.org/tip/ccd50dfd92ea2c4ba9e39531ac55db53393e783e
Author: Ingo Molnar <[email protected]>
AuthorDate: Fri, 13 Mar 2009 17:02:17 +0100
Commit: Ingo Molnar <[email protected]>
CommitDate: Fri, 13 Mar 2009 17:02:17 +0100

tracing/syscalls: support for syscalls tracing on x86, fix

Impact: build fix

kernel/built-in.o: In function `ftrace_syscall_exit':
(.text+0x76667): undefined reference to `syscall_nr_to_meta'

ftrace.o is built:

obj-$(CONFIG_DYNAMIC_FTRACE) += ftrace.o
obj-$(CONFIG_FUNCTION_GRAPH_TRACER) += ftrace.o

But now a CONFIG_FTRACE_SYSCALLS dependency is needed too.

Cc: Frederic Weisbecker <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Lai Jiangshan <[email protected]>
LKML-Reference: <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>


---
arch/x86/kernel/Makefile | 3 ++-
1 files changed, 2 insertions(+), 1 deletions(-)

diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile
index 339ce35..84000eb 100644
--- a/arch/x86/kernel/Makefile
+++ b/arch/x86/kernel/Makefile
@@ -66,7 +66,8 @@ obj-$(CONFIG_X86_MPPARSE) += mpparse.o
obj-y += apic/
obj-$(CONFIG_X86_REBOOTFIXUPS) += reboot_fixups_32.o
obj-$(CONFIG_DYNAMIC_FTRACE) += ftrace.o
-obj-$(CONFIG_FUNCTION_GRAPH_TRACER) += ftrace.o
+obj-$(CONFIG_FUNCTION_GRAPH_TRACER) += ftrace.o
+obj-$(CONFIG_FTRACE_SYSCALLS) += ftrace.o
obj-$(CONFIG_KEXEC) += machine_kexec_$(BITS).o
obj-$(CONFIG_KEXEC) += relocate_kernel_$(BITS).o crash.o
obj-$(CONFIG_CRASH_DUMP) += crash_dump_$(BITS).o

2009-03-13 16:35:45

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: [RFC][PATCH 2/2] tracing/x86: basic implementation of syscall tracing for x86-64

* Frederic Weisbecker ([email protected]) wrote:
> Provide the ptrace hooks and arch specific syscall numbers to ftrace arch indepedant
> syscall numbers.
> For now it only supports 4 syscalls to provide an example.
>

Hi Frederic,

I already have the equivalent TIF_KERNEL_TRACE flag in my LTTng tree
added to every Linux architecture. You might want to re-use this work
rather than re-doing this. I don't mind changing the flag name.

Please see the LTTng tree and patchset at http://www.lttng.org. I'll
answer more precisely if you need more information.

Mathieu

> Signed-off-by: Frederic Weisbecker <[email protected]>
> ---
> arch/x86/Kconfig | 1 +
> arch/x86/include/asm/ftrace.h | 7 +++++++
> arch/x86/include/asm/thread_info.h | 9 ++++++---
> arch/x86/kernel/ftrace.c | 15 +++++++++++++++
> arch/x86/kernel/ptrace.c | 7 +++++++
> 5 files changed, 36 insertions(+), 3 deletions(-)
>
> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index 2a0ddfe..b5aec1b 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -35,6 +35,7 @@ config X86
> select HAVE_FUNCTION_GRAPH_TRACER
> select HAVE_FUNCTION_TRACE_MCOUNT_TEST
> select HAVE_FTRACE_NMI_ENTER if DYNAMIC_FTRACE
> + select HAVE_FTRACE_SYSCALLS if X86_64
> select HAVE_KVM
> select HAVE_ARCH_KGDB
> select HAVE_ARCH_TRACEHOOK
> diff --git a/arch/x86/include/asm/ftrace.h b/arch/x86/include/asm/ftrace.h
> index db24c22..b2e6855 100644
> --- a/arch/x86/include/asm/ftrace.h
> +++ b/arch/x86/include/asm/ftrace.h
> @@ -28,6 +28,13 @@
>
> #endif
>
> +/* FIXME: I don't want to stay hardcoded */
> +#ifdef CONFIG_X86_64
> +#define FTRACE_SYSCALL_MAX 296
> +#else
> +#define FTRACE_SYSCALL_MAX 333
> +#endif
> +
> #ifdef CONFIG_FUNCTION_TRACER
> #define MCOUNT_ADDR ((long)(mcount))
> #define MCOUNT_INSN_SIZE 5 /* sizeof mcount call */
> diff --git a/arch/x86/include/asm/thread_info.h b/arch/x86/include/asm/thread_info.h
> index f6ba6f8..83d2b73 100644
> --- a/arch/x86/include/asm/thread_info.h
> +++ b/arch/x86/include/asm/thread_info.h
> @@ -95,6 +95,7 @@ struct thread_info {
> #define TIF_FORCED_TF 24 /* true if TF in eflags artificially */
> #define TIF_DEBUGCTLMSR 25 /* uses thread_struct.debugctlmsr */
> #define TIF_DS_AREA_MSR 26 /* uses thread_struct.ds_area_msr */
> +#define TIF_SYSCALL_FTRACE 27 /* for ftrace syscall instrumentation */
>
> #define _TIF_SYSCALL_TRACE (1 << TIF_SYSCALL_TRACE)
> #define _TIF_NOTIFY_RESUME (1 << TIF_NOTIFY_RESUME)
> @@ -117,15 +118,17 @@ struct thread_info {
> #define _TIF_FORCED_TF (1 << TIF_FORCED_TF)
> #define _TIF_DEBUGCTLMSR (1 << TIF_DEBUGCTLMSR)
> #define _TIF_DS_AREA_MSR (1 << TIF_DS_AREA_MSR)
> +#define _TIF_SYSCALL_FTRACE (1 << TIF_SYSCALL_FTRACE)
>
> /* work to do in syscall_trace_enter() */
> #define _TIF_WORK_SYSCALL_ENTRY \
> - (_TIF_SYSCALL_TRACE | _TIF_SYSCALL_EMU | \
> + (_TIF_SYSCALL_TRACE | _TIF_SYSCALL_EMU | _TIF_SYSCALL_FTRACE | \
> _TIF_SYSCALL_AUDIT | _TIF_SECCOMP | _TIF_SINGLESTEP)
>
> /* work to do in syscall_trace_leave() */
> #define _TIF_WORK_SYSCALL_EXIT \
> - (_TIF_SYSCALL_TRACE | _TIF_SYSCALL_AUDIT | _TIF_SINGLESTEP)
> + (_TIF_SYSCALL_TRACE | _TIF_SYSCALL_AUDIT | _TIF_SINGLESTEP | \
> + _TIF_SYSCALL_FTRACE)
>
> /* work to do on interrupt/exception return */
> #define _TIF_WORK_MASK \
> @@ -134,7 +137,7 @@ struct thread_info {
> _TIF_SINGLESTEP|_TIF_SECCOMP|_TIF_SYSCALL_EMU))
>
> /* work to do on any return to user space */
> -#define _TIF_ALLWORK_MASK (0x0000FFFF & ~_TIF_SECCOMP)
> +#define _TIF_ALLWORK_MASK ((0x0000FFFF & ~_TIF_SECCOMP) | _TIF_SYSCALL_FTRACE)
>
> /* Only used for 64 bit */
> #define _TIF_DO_NOTIFY_MASK \
> diff --git a/arch/x86/kernel/ftrace.c b/arch/x86/kernel/ftrace.c
> index a85da17..5bba98b 100644
> --- a/arch/x86/kernel/ftrace.c
> +++ b/arch/x86/kernel/ftrace.c
> @@ -453,3 +453,18 @@ void prepare_ftrace_return(unsigned long *parent, unsigned long self_addr)
> }
> }
> #endif /* CONFIG_FUNCTION_GRAPH_TRACER */
> +
> +#ifdef CONFIG_FTRACE_SYSCALLS
> +#ifdef CONFIG_X86_64
> +/*
> + * Array that make the arch specific syscall numbers
> + * matching the ftrace syscall numbers.
> + */
> +short arch_syscall_trace_nr[FTRACE_SYSCALL_MAX] = {
> + [__NR_open] = SYSCALL_TRACE_OPEN,
> + [__NR_close] = SYSCALL_TRACE_CLOSE,
> + [__NR_read] = SYSCALL_TRACE_READ,
> + [__NR_write] = SYSCALL_TRACE_WRITE
> +};
> +#endif
> +#endif
> diff --git a/arch/x86/kernel/ptrace.c b/arch/x86/kernel/ptrace.c
> index 3d9672e..99749d6 100644
> --- a/arch/x86/kernel/ptrace.c
> +++ b/arch/x86/kernel/ptrace.c
> @@ -21,6 +21,7 @@
> #include <linux/audit.h>
> #include <linux/seccomp.h>
> #include <linux/signal.h>
> +#include <linux/ftrace.h>
>
> #include <asm/uaccess.h>
> #include <asm/pgtable.h>
> @@ -1416,6 +1417,9 @@ asmregparm long syscall_trace_enter(struct pt_regs *regs)
> tracehook_report_syscall_entry(regs))
> ret = -1L;
>
> + if (unlikely(test_thread_flag(TIF_SYSCALL_FTRACE)))
> + ftrace_syscall_enter(regs);
> +
> if (unlikely(current->audit_context)) {
> if (IS_IA32)
> audit_syscall_entry(AUDIT_ARCH_I386,
> @@ -1439,6 +1443,9 @@ asmregparm void syscall_trace_leave(struct pt_regs *regs)
> if (unlikely(current->audit_context))
> audit_syscall_exit(AUDITSC_RESULT(regs->ax), regs->ax);
>
> + if (unlikely(test_thread_flag(TIF_SYSCALL_FTRACE)))
> + ftrace_syscall_exit(regs);
> +
> if (test_thread_flag(TIF_SYSCALL_TRACE))
> tracehook_report_syscall_exit(regs, 0);
>
> --
> 1.6.1
>

--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68

2009-03-13 16:37:36

by Ingo Molnar

[permalink] [raw]
Subject: Re: [RFC][PATCH 2/2] tracing/x86: basic implementation of syscall tracing for x86-64


* Mathieu Desnoyers <[email protected]> wrote:

> * Frederic Weisbecker ([email protected]) wrote:
>
> > Provide the ptrace hooks and arch specific syscall numbers
> > to ftrace arch indepedant syscall numbers. For now it only
> > supports 4 syscalls to provide an example.
> >
>
> Hi Frederic,
>
> I already have the equivalent TIF_KERNEL_TRACE flag in my
> LTTng tree added to every Linux architecture. You might want
> to re-use this work rather than re-doing this. I don't mind
> changing the flag name.

Yeah. Note that the TIF bits are just one part - there are other
bits needed for HAVE_FTRACE_SYSCALLS arch support.

Also, i'd eventually expect the TIF bits to be converted to a
tracehook callback, not spread it to other architectures.

Ingo

2009-03-13 16:50:40

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: [RFC][PATCH 2/2] tracing/x86: basic implementation of syscall tracing for x86-64

* Ingo Molnar ([email protected]) wrote:
>
> * Mathieu Desnoyers <[email protected]> wrote:
>
> > * Frederic Weisbecker ([email protected]) wrote:
> >
> > > Provide the ptrace hooks and arch specific syscall numbers
> > > to ftrace arch indepedant syscall numbers. For now it only
> > > supports 4 syscalls to provide an example.
> > >
> >
> > Hi Frederic,
> >
> > I already have the equivalent TIF_KERNEL_TRACE flag in my
> > LTTng tree added to every Linux architecture. You might want
> > to re-use this work rather than re-doing this. I don't mind
> > changing the flag name.
>
> Yeah. Note that the TIF bits are just one part - there are other
> bits needed for HAVE_FTRACE_SYSCALLS arch support.
>
> Also, i'd eventually expect the TIF bits to be converted to a
> tracehook callback, not spread it to other architectures.
>

The nice part about the TIF bit is that it permits adding this syscall
tracing feature without increasing the size of the thread_info struct
nor adding any extra tests in entry.S. Does the tracehook callback have
these features ?

Mathieu

> Ingo

--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68

2009-03-15 04:44:34

by Ingo Molnar

[permalink] [raw]
Subject: Re: [RFC][PATCH 2/2] tracing/x86: basic implementation of syscall tracing for x86-64


* Mathieu Desnoyers <[email protected]> wrote:

> * Ingo Molnar ([email protected]) wrote:
> >
> > * Mathieu Desnoyers <[email protected]> wrote:
> >
> > > * Frederic Weisbecker ([email protected]) wrote:
> > >
> > > > Provide the ptrace hooks and arch specific syscall numbers
> > > > to ftrace arch indepedant syscall numbers. For now it only
> > > > supports 4 syscalls to provide an example.
> > > >
> > >
> > > Hi Frederic,
> > >
> > > I already have the equivalent TIF_KERNEL_TRACE flag in my
> > > LTTng tree added to every Linux architecture. You might want
> > > to re-use this work rather than re-doing this. I don't mind
> > > changing the flag name.
> >
> > Yeah. Note that the TIF bits are just one part - there are other
> > bits needed for HAVE_FTRACE_SYSCALLS arch support.
> >
> > Also, i'd eventually expect the TIF bits to be converted to a
> > tracehook callback, not spread it to other architectures.
> >
>
> The nice part about the TIF bit is that it permits adding this
> syscall tracing feature without increasing the size of the
> thread_info struct nor adding any extra tests in entry.S. Does
> the tracehook callback have these features ?

yes. Tracehook just factors out common interfacing points - with
one specific implementation for now: ptrace. For syscall tracing
that means it wraps TIF_SYSCALL_TRACE in essence.

Ingo

2009-03-15 15:17:20

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [RFC][PATCH 2/2] tracing/x86: basic implementation of syscall tracing for x86-64

On Sun, Mar 15, 2009 at 05:44:05AM +0100, Ingo Molnar wrote:
>
> * Mathieu Desnoyers <[email protected]> wrote:
>
> > * Ingo Molnar ([email protected]) wrote:
> > >
> > > * Mathieu Desnoyers <[email protected]> wrote:
> > >
> > > > * Frederic Weisbecker ([email protected]) wrote:
> > > >
> > > > > Provide the ptrace hooks and arch specific syscall numbers
> > > > > to ftrace arch indepedant syscall numbers. For now it only
> > > > > supports 4 syscalls to provide an example.
> > > > >
> > > >
> > > > Hi Frederic,
> > > >
> > > > I already have the equivalent TIF_KERNEL_TRACE flag in my
> > > > LTTng tree added to every Linux architecture. You might want
> > > > to re-use this work rather than re-doing this. I don't mind
> > > > changing the flag name.
> > >
> > > Yeah. Note that the TIF bits are just one part - there are other
> > > bits needed for HAVE_FTRACE_SYSCALLS arch support.
> > >
> > > Also, i'd eventually expect the TIF bits to be converted to a
> > > tracehook callback, not spread it to other architectures.
> > >
> >
> > The nice part about the TIF bit is that it permits adding this
> > syscall tracing feature without increasing the size of the
> > thread_info struct nor adding any extra tests in entry.S. Does
> > the tracehook callback have these features ?
>
> yes. Tracehook just factors out common interfacing points - with
> one specific implementation for now: ptrace. For syscall tracing
> that means it wraps TIF_SYSCALL_TRACE in essence.
>
> Ingo


Since I didn't want to experience conflicts against ptrace, I haven't
played with the tracehooks.

But anyway, I wrote this low level part of syscall tracing having in mind
the fact that utrace does this work very much better. That's why I wrote it
to be very basic and simple enough to do the job, but waiting for a better
code that is already written somewhere else.

So now there is some code which does that on both utrace and Lttng.

I would be pleased to see these patches which handle these flags properly
on LKML :-)

Frederic.


2009-03-15 19:22:21

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: [RFC][PATCH 2/2] tracing/x86: basic implementation of syscall tracing for x86-64

* Ingo Molnar ([email protected]) wrote:
>
> * Mathieu Desnoyers <[email protected]> wrote:
>
> > * Ingo Molnar ([email protected]) wrote:
> > >
> > > * Mathieu Desnoyers <[email protected]> wrote:
> > >
> > > > * Frederic Weisbecker ([email protected]) wrote:
> > > >
> > > > > Provide the ptrace hooks and arch specific syscall numbers
> > > > > to ftrace arch indepedant syscall numbers. For now it only
> > > > > supports 4 syscalls to provide an example.
> > > > >
> > > >
> > > > Hi Frederic,
> > > >
> > > > I already have the equivalent TIF_KERNEL_TRACE flag in my
> > > > LTTng tree added to every Linux architecture. You might want
> > > > to re-use this work rather than re-doing this. I don't mind
> > > > changing the flag name.
> > >
> > > Yeah. Note that the TIF bits are just one part - there are other
> > > bits needed for HAVE_FTRACE_SYSCALLS arch support.
> > >
> > > Also, i'd eventually expect the TIF bits to be converted to a
> > > tracehook callback, not spread it to other architectures.
> > >
> >
> > The nice part about the TIF bit is that it permits adding this
> > syscall tracing feature without increasing the size of the
> > thread_info struct nor adding any extra tests in entry.S. Does
> > the tracehook callback have these features ?
>
> yes. Tracehook just factors out common interfacing points - with
> one specific implementation for now: ptrace. For syscall tracing
> that means it wraps TIF_SYSCALL_TRACE in essence.
>
> Ingo

What overhead and latency does tracehooks add when

- they are being enabled/disabled
- once they are enabled ?

Do they require to stop each and every thread upon activation ?

I'll post my lttng trace flags for RFC, as it could save Frederic some
duplicated work. Feel free to comment.

Thanks,

Mathieu

--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68

2009-03-16 19:36:46

by Masami Hiramatsu

[permalink] [raw]
Subject: Re: [RFC][PATCH 1/2] tracing/ftrace: syscall tracing infrastructure

Hi Frederic,

Out of curiously, could you tell me why did you introduce new TIF instead of
using utrace's tracehook API?
For example, systemtap already use current utrace/tracehook api for tracing
all syscalls. If there is any new advantage, it would be good to enhance
existing utrace/tracehook.

Thank you,

Frederic Weisbecker wrote:
> The most important drawback that implies the syscall tracing is the variable
> number of parameters that takes a syscall. Assuming there is a large number of
> syscall, the goal is to provide as much as possible a generic layer on it.
>
> What we want on tracing time (the hot path):
>
> - Save the parameters and the number of the syscall. That's all. The rest of the
> job can be done on the "output path".
>
> On output time:
>
> - Get the raw-binary saved parameters and the syscall number as well as some
> usual infos such as the pid, the time...
> - Depending on the tracing requirements, the user should have the choice to
> dump the raw datas or a formatted version.
>
> The core infrastructure has a static array which contains the necessary
> informations for each syscall (ok, I've only implemented 4 for now).
>
> When a syscall triggers, the tracer look at this array (O(1) access) and get the
> number of parameters for this syscall. Then these parameters are saved in a
> binary form on the ring buffer with the syscall number (a generic version, arch
> independent).
>
> On output time the parameters, the name of the syscall and very basic
> informations to format the parameters are picket from the ring-buffer and the
> generic syscall array.
>
> Depending of the level of implementation we want for the syscall inside this
> array, we can provide just a single mask where the n bit matches the n
> parameter. If the bit is set, then the parameter will be considered as a string
> pointer. Otherwise its raw hexadecimal value will be printed.
>
> If needed, we can otherwise provide a specific callback to print the syscall
> event.
>
> The probem with these approaches comes from the fact that the string might have
> disappeared from the user memory on output time. Even if we protect against
> faults, it can give unreliable traces.
> So perhaps we could think about copying the string on tracing time.
>
> Concerning the return value, it doesn't make any problem, a single raw
> value is printed for each one.
>
> Note that the current implementation can be easily scaled further to give only
> binary informations to the user and the matching metadata to decode it.
>
> Signed-off-by: Frederic Weisbecker <[email protected]>
> ---
> include/linux/ftrace.h | 44 +++++++
> kernel/trace/Kconfig | 10 ++
> kernel/trace/Makefile | 1 +
> kernel/trace/trace.c | 6 +
> kernel/trace/trace.h | 32 +++++
> kernel/trace/trace_syscalls.c | 284 +++++++++++++++++++++++++++++++++++++++++
> 6 files changed, 377 insertions(+), 0 deletions(-)
> create mode 100644 kernel/trace/trace_syscalls.c
>
> diff --git a/include/linux/ftrace.h b/include/linux/ftrace.h
> index e1583f2..1798692 100644
> --- a/include/linux/ftrace.h
> +++ b/include/linux/ftrace.h
> @@ -503,4 +503,48 @@ static inline void trace_hw_branch_oops(void) {}
>
> #endif /* CONFIG_HW_BRANCH_TRACER */
>
> +/* Index for the ftrace syscalls array */
> +enum syscall_trace_nr {
> + SYSCALL_TRACE_OPEN = 1,
> + SYSCALL_TRACE_CLOSE,
> + SYSCALL_TRACE_READ,
> + SYSCALL_TRACE_WRITE,
> +
> + __SYSCALL_TRACE_END
> +};
> +
> +/*
> + * A syscall entry in the ftrace syscalls array.
> + *
> + * @name: name of the syscall
> + * @nb_args: number of parameters it takes
> + * @simple_format: if true, we use string_mask, otherwise the print callback
> + * @string_mask: rudimentary format mask. If bit nr is set, the nr parameter
> + * will be displayed as a string, otherwise it will be considered
> + * as a raw number (hex displayed).
> + * @print: implement it if you want a custom output for a given syscall
> + */
> +struct syscall_trace_entry {
> + const char *name;
> + int nb_args;
> + bool simple_format;
> + union {
> + unsigned long string_mask;
> + int (*print)(unsigned long *);
> + } output;
> +};
> +
> +#ifdef CONFIG_FTRACE_SYSCALLS
> +extern short arch_syscall_trace_nr[];
> +extern void start_ftrace_syscalls(void);
> +extern void stop_ftrace_syscalls(void);
> +extern void ftrace_syscall_enter(struct pt_regs *regs);
> +extern void ftrace_syscall_exit(struct pt_regs *regs);
> +#else
> +static inline void start_ftrace_syscalls(void) { }
> +static inline void stop_ftrace_syscalls(void) { }
> +static inline void ftrace_syscall_enter(struct pt_regs *regs) { }
> +static inline void ftrace_syscall_exit(struct pt_regs *regs) { }
> +#endif
> +
> #endif /* _LINUX_FTRACE_H */
> diff --git a/kernel/trace/Kconfig b/kernel/trace/Kconfig
> index 8e4a2a6..95a0ad1 100644
> --- a/kernel/trace/Kconfig
> +++ b/kernel/trace/Kconfig
> @@ -34,6 +34,9 @@ config HAVE_FTRACE_MCOUNT_RECORD
> config HAVE_HW_BRANCH_TRACER
> bool
>
> +config HAVE_FTRACE_SYSCALLS
> + bool
> +
> config TRACER_MAX_TRACE
> bool
>
> @@ -175,6 +178,13 @@ config EVENT_TRACER
> allowing the user to pick and choose which trace point they
> want to trace.
>
> +config FTRACE_SYSCALLS
> + bool "Trace syscalls"
> + depends on HAVE_FTRACE_SYSCALLS
> + select TRACING
> + help
> + Basic tracer to catch the syscall entry and exit events.
> +
> config BOOT_TRACER
> bool "Trace boot initcalls"
> select TRACING
> diff --git a/kernel/trace/Makefile b/kernel/trace/Makefile
> index c7a2943..c3feea0 100644
> --- a/kernel/trace/Makefile
> +++ b/kernel/trace/Makefile
> @@ -43,5 +43,6 @@ obj-$(CONFIG_BLK_DEV_IO_TRACE) += blktrace.o
> obj-$(CONFIG_EVENT_TRACER) += trace_events.o
> obj-$(CONFIG_EVENT_TRACER) += events.o
> obj-$(CONFIG_EVENT_TRACER) += trace_export.o
> +obj-$(CONFIG_FTRACE_SYSCALLS) += trace_syscalls.o
>
> libftrace-y := ftrace.o
> diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
> index cc94f86..5152be3 100644
> --- a/kernel/trace/trace.c
> +++ b/kernel/trace/trace.c
> @@ -242,6 +242,12 @@ static DECLARE_WAIT_QUEUE_HEAD(trace_wait);
> unsigned long trace_flags = TRACE_ITER_PRINT_PARENT | TRACE_ITER_PRINTK |
> TRACE_ITER_ANNOTATE | TRACE_ITER_CONTEXT_INFO;
>
> +
> +struct trace_array *__get_global_trace(void)
> +{
> + return &global_trace;
> +}
> +
> /**
> * trace_wake_up - wake up tasks waiting for trace input
> *
> diff --git a/kernel/trace/trace.h b/kernel/trace/trace.h
> index 2bfb7d1..9583830 100644
> --- a/kernel/trace/trace.h
> +++ b/kernel/trace/trace.h
> @@ -30,6 +30,8 @@ enum trace_type {
> TRACE_GRAPH_ENT,
> TRACE_USER_STACK,
> TRACE_HW_BRANCHES,
> + TRACE_SYSCALL_ENTER,
> + TRACE_SYSCALL_EXIT,
> TRACE_KMEM_ALLOC,
> TRACE_KMEM_FREE,
> TRACE_POWER,
> @@ -192,6 +194,19 @@ struct kmemtrace_free_entry {
> const void *ptr;
> };
>
> +struct syscall_trace_enter {
> + struct trace_entry ent;
> + enum syscall_trace_nr nr;
> + unsigned long args[];
> +};
> +
> +struct syscall_trace_exit {
> + struct trace_entry ent;
> + enum syscall_trace_nr nr;
> + unsigned long ret;
> +};
> +
> +
> /*
> * trace_flag_type is an enumeration that holds different
> * states when a trace occurs. These are:
> @@ -304,6 +319,10 @@ extern void __ftrace_bad_type(void);
> TRACE_KMEM_ALLOC); \
> IF_ASSIGN(var, ent, struct kmemtrace_free_entry, \
> TRACE_KMEM_FREE); \
> + IF_ASSIGN(var, ent, struct syscall_trace_enter, \
> + TRACE_SYSCALL_ENTER); \
> + IF_ASSIGN(var, ent, struct syscall_trace_exit, \
> + TRACE_SYSCALL_EXIT); \
> __ftrace_bad_type(); \
> } while (0)
>
> @@ -568,6 +587,7 @@ extern int trace_selftest_startup_branch(struct tracer *trace,
> #endif /* CONFIG_FTRACE_STARTUP_TEST */
>
> extern void *head_page(struct trace_array_cpu *data);
> +extern struct trace_array *__get_global_trace(void);
> extern long ns2usecs(cycle_t nsec);
> extern int
> trace_vprintk(unsigned long ip, int depth, const char *fmt, va_list args);
> @@ -627,6 +647,18 @@ static inline int ftrace_trace_task(struct task_struct *task)
> return test_tsk_trace_trace(task);
> }
>
> +#ifdef CONFIG_FTRACE_SYSCALLS
> +extern enum print_line_t
> +print_syscall_enter(struct trace_iterator *iter, int flags);
> +extern enum print_line_t
> +print_syscall_exit(struct trace_iterator *iter, int flags);
> +#else
> +static inline enum print_line_t
> +print_syscall_enter(struct trace_iterator *iter, int flags) { }
> +static inline enum print_line_t
> +print_syscall_exit(struct trace_iterator *iter, int flags) { }
> +#endif
> +
> /*
> * trace_iterator_flags is an enumeration that defines bit
> * positions into trace_flags that controls the output.
> diff --git a/kernel/trace/trace_syscalls.c b/kernel/trace/trace_syscalls.c
> new file mode 100644
> index 0000000..470bb9d
> --- /dev/null
> +++ b/kernel/trace/trace_syscalls.c
> @@ -0,0 +1,284 @@
> +#include <linux/kernel.h>
> +#include <linux/ftrace.h>
> +#include <asm/syscall.h>
> +
> +#include "trace_output.h"
> +#include "trace.h"
> +
> +/* Create basic entry on syscall array (with the rudimentary string mask) */
> +#define SYS_TRACE_ENTRY(sname, args, mask) \
> + {.name = #sname, .nb_args = args, .simple_format = true, \
> + .output.string_mask = mask}
> +
> +/*
> + * Create a custom defined entry on the syscall array, you will have to
> + * implement a callback to output the syscall.
> + */
> +#define SYS_TRACE_ENTRY_SPECIAL(sname, args, printer) \
> + {.name = #sname, .nb_args = args, .simple_format = false, \
> + .print = printer}
> +
> +static const struct syscall_trace_entry syscall_trace_entries[] = {
> + /* For open, the first argument is a string, hence the given mask */
> + [SYSCALL_TRACE_OPEN] = SYS_TRACE_ENTRY(open, 3, 0x1),
> + [SYSCALL_TRACE_CLOSE] = SYS_TRACE_ENTRY(close, 1, 0),
> + [SYSCALL_TRACE_READ] = SYS_TRACE_ENTRY(read, 3, 0),
> + [SYSCALL_TRACE_WRITE] = SYS_TRACE_ENTRY(read, 3, 0),
> +};
> +
> +
> +static atomic_t refcount;
> +
> +enum print_line_t
> +print_syscall_enter(struct trace_iterator *iter, int flags)
> +{
> + struct trace_seq *s = &iter->seq;
> + struct trace_entry *ent = iter->ent;
> + struct syscall_trace_enter *trace;
> + enum syscall_trace_nr syscall;
> + const struct syscall_trace_entry *entry;
> + char *str = NULL;
> + int i, ret;
> +
> + trace_assign_type(trace, ent);
> +
> + syscall = trace->nr;
> +
> + if (!syscall || syscall >= __SYSCALL_TRACE_END)
> + return TRACE_TYPE_HANDLED;
> +
> + entry = &syscall_trace_entries[syscall];
> + if (!entry->simple_format) {
> + ret = entry->output.print(trace->args);
> + if (!ret)
> + return TRACE_TYPE_PARTIAL_LINE;
> + }
> +
> + ret = trace_seq_printf(s, "syscall %s :", entry->name);
> + if (!ret)
> + return TRACE_TYPE_PARTIAL_LINE;
> +
> + for (i = 0; i < entry->nb_args; i++) {
> + /*
> + * FIXME: the string comes from the user, but on tracing
> + * output time, it may have disappeared from the memory.
> + * May be better to strdup it on tracing time.
> + */
> + if (entry->output.string_mask & (1 << i)) {
> + /* 128 is enough for most path */
> + str = strndup_user((char *)trace->args[i], 128);
> +
> + /* If it faulted badly, the error shoud have been
> + * handled while servicing the syscall, just ignore.
> + */
> + if (str == ERR_PTR(-EFAULT) || str == ERR_PTR(-EINVAL)
> + || str == ERR_PTR(-ENOMEM))
> + return TRACE_TYPE_HANDLED;
> +
> + ret = trace_seq_printf(s, " %s", str);
> + } else {
> + ret = trace_seq_printf(s, " 0x%lx", trace->args[i]);
> + }
> + if (!ret)
> + goto end_partial;
> + }
> + if (!trace_seq_printf(s, "\n"))
> + goto end_partial;
> +
> + kfree(str);
> + return TRACE_TYPE_HANDLED;
> +
> +end_partial:
> + kfree(str);
> + return TRACE_TYPE_PARTIAL_LINE;
> +}
> +
> +enum print_line_t
> +print_syscall_exit(struct trace_iterator *iter, int flags)
> +{
> + struct trace_seq *s = &iter->seq;
> + struct trace_entry *ent = iter->ent;
> + struct syscall_trace_exit *trace;
> + enum syscall_trace_nr syscall;
> + const struct syscall_trace_entry *entry;
> + int ret;
> +
> + trace_assign_type(trace, ent);
> +
> + syscall = trace->nr;
> +
> + if (!syscall || syscall >= __SYSCALL_TRACE_END)
> + return TRACE_TYPE_HANDLED;
> +
> + entry = &syscall_trace_entries[syscall];
> +
> + ret = trace_seq_printf(s, "syscall %s -> 0x%lx\n", entry->name,
> + trace->ret);
> + if (!ret)
> + return TRACE_TYPE_PARTIAL_LINE;
> +
> + return TRACE_TYPE_HANDLED;
> +}
> +
> +void start_ftrace_syscalls(void)
> +{
> + unsigned long flags;
> + struct task_struct *g, *t;
> +
> + if (atomic_inc_return(&refcount) != 1)
> + goto out;
> +
> + read_lock_irqsave(&tasklist_lock, flags);
> +
> + do_each_thread(g, t) {
> + set_tsk_thread_flag(t, TIF_SYSCALL_FTRACE);
> + } while_each_thread(g, t);
> +
> + read_unlock_irqrestore(&tasklist_lock, flags);
> +out:
> + atomic_dec(&refcount);
> +}
> +
> +void stop_ftrace_syscalls(void)
> +{
> + unsigned long flags;
> + struct task_struct *g, *t;
> +
> + if (atomic_dec_return(&refcount))
> + goto out;
> +
> + read_lock_irqsave(&tasklist_lock, flags);
> +
> + do_each_thread(g, t) {
> + clear_tsk_thread_flag(t, TIF_SYSCALL_FTRACE);
> + } while_each_thread(g, t);
> +
> + read_unlock_irqrestore(&tasklist_lock, flags);
> +out:
> + atomic_inc(&refcount);
> +}
> +
> +void ftrace_syscall_enter(struct pt_regs *regs)
> +{
> + struct trace_array *tr = __get_global_trace();
> + struct syscall_trace_enter *entry;
> + const struct syscall_trace_entry *sys_data;
> + struct ring_buffer_event *event;
> + struct trace_array_cpu *data;
> + int size;
> + int syscall_nr;
> + int nr_offset;
> + int cpu;
> +
> + syscall_nr = syscall_get_nr(current, regs);
> + nr_offset = arch_syscall_trace_nr[syscall_nr];
> + if (!nr_offset || nr_offset >= __SYSCALL_TRACE_END)
> + return;
> +
> + cpu = raw_smp_processor_id();
> + data = tr->data[cpu];
> +
> + if (unlikely(atomic_read(&data->disabled)))
> + return;
> +
> + sys_data = &syscall_trace_entries[nr_offset];
> + size = sizeof(*entry) + sizeof(unsigned long) * sys_data->nb_args;
> +
> + event = trace_buffer_lock_reserve(tr, TRACE_SYSCALL_ENTER, size, 0, 0);
> + if (!event)
> + return;
> +
> + entry = ring_buffer_event_data(event);
> + entry->nr = nr_offset;
> + syscall_get_arguments(current, regs, 0, sys_data->nb_args, entry->args);
> +
> + ring_buffer_unlock_commit(tr->buffer, event);
> + trace_wake_up();
> +}
> +
> +void ftrace_syscall_exit(struct pt_regs *regs)
> +{
> + struct trace_array *tr = __get_global_trace();
> + struct syscall_trace_exit *entry;
> + const struct syscall_trace_entry *sys_data;
> + struct ring_buffer_event *event;
> + struct trace_array_cpu *data;
> + int syscall_nr;
> + int nr_offset;
> + int cpu;
> +
> + syscall_nr = syscall_get_nr(current, regs);
> + nr_offset = arch_syscall_trace_nr[syscall_nr];
> + if (!nr_offset || nr_offset >= __SYSCALL_TRACE_END)
> + return;
> +
> + cpu = raw_smp_processor_id();
> + data = tr->data[cpu];
> +
> + if (unlikely(atomic_read(&data->disabled)))
> + return;
> +
> + sys_data = &syscall_trace_entries[nr_offset];
> +
> + event = trace_buffer_lock_reserve(tr, TRACE_SYSCALL_EXIT,
> + sizeof(*entry), 0, 0);
> + if (!event)
> + return;
> +
> + entry = ring_buffer_event_data(event);
> + entry->nr = nr_offset;
> + entry->ret = syscall_get_return_value(current, regs);
> +
> + ring_buffer_unlock_commit(tr->buffer, event);
> + trace_wake_up();
> +}
> +
> +static int init_syscall_tracer(struct trace_array *tr)
> +{
> + start_ftrace_syscalls();
> +
> + return 0;
> +}
> +
> +static void reset_syscall_tracer(struct trace_array *tr)
> +{
> + stop_ftrace_syscalls();
> +}
> +
> +static struct trace_event syscall_enter_event = {
> + .type = TRACE_SYSCALL_ENTER,
> + .trace = print_syscall_enter,
> +};
> +
> +static struct trace_event syscall_exit_event = {
> + .type = TRACE_SYSCALL_EXIT,
> + .trace = print_syscall_exit,
> +};
> +
> +static struct tracer syscall_tracer __read_mostly = {
> + .name = "syscall",
> + .init = init_syscall_tracer,
> + .reset = reset_syscall_tracer
> +};
> +
> +__init int register_ftrace_syscalls(void)
> +{
> + int ret;
> +
> + ret = register_ftrace_event(&syscall_enter_event);
> + if (!ret) {
> + printk(KERN_WARNING "event %d failed to register\n",
> + syscall_enter_event.type);
> + WARN_ON_ONCE(1);
> + }
> +
> + ret = register_ftrace_event(&syscall_exit_event);
> + if (!ret) {
> + printk(KERN_WARNING "event %d failed to register\n",
> + syscall_exit_event.type);
> + WARN_ON_ONCE(1);
> + }
> +
> + return register_tracer(&syscall_tracer);
> +}
> +device_initcall(register_ftrace_syscalls);

--
Masami Hiramatsu

Software Engineer
Hitachi Computer Products (America) Inc.
Software Solutions Division

e-mail: [email protected]

2009-03-16 20:15:53

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [RFC][PATCH 1/2] tracing/ftrace: syscall tracing infrastructure

On Mon, Mar 16, 2009 at 03:36:58PM -0400, Masami Hiramatsu wrote:
> Hi Frederic,
>
> Out of curiously, could you tell me why did you introduce new TIF instead of
> using utrace's tracehook API?
> For example, systemtap already use current utrace/tracehook api for tracing
> all syscalls. If there is any new advantage, it would be good to enhance
> existing utrace/tracehook.
>
> Thank you,


Hi,

Because the tracehooks rely on ptrace work and both ftrace and ptrace would overlap
in a same flag, creating racy issues. At least ss far as I understood it.
How does systemtap manage it?

Thanks.

>
> Frederic Weisbecker wrote:
> > The most important drawback that implies the syscall tracing is the variable
> > number of parameters that takes a syscall. Assuming there is a large number of
> > syscall, the goal is to provide as much as possible a generic layer on it.
> >
> > What we want on tracing time (the hot path):
> >
> > - Save the parameters and the number of the syscall. That's all. The rest of the
> > job can be done on the "output path".
> >
> > On output time:
> >
> > - Get the raw-binary saved parameters and the syscall number as well as some
> > usual infos such as the pid, the time...
> > - Depending on the tracing requirements, the user should have the choice to
> > dump the raw datas or a formatted version.
> >
> > The core infrastructure has a static array which contains the necessary
> > informations for each syscall (ok, I've only implemented 4 for now).
> >
> > When a syscall triggers, the tracer look at this array (O(1) access) and get the
> > number of parameters for this syscall. Then these parameters are saved in a
> > binary form on the ring buffer with the syscall number (a generic version, arch
> > independent).
> >
> > On output time the parameters, the name of the syscall and very basic
> > informations to format the parameters are picket from the ring-buffer and the
> > generic syscall array.
> >
> > Depending of the level of implementation we want for the syscall inside this
> > array, we can provide just a single mask where the n bit matches the n
> > parameter. If the bit is set, then the parameter will be considered as a string
> > pointer. Otherwise its raw hexadecimal value will be printed.
> >
> > If needed, we can otherwise provide a specific callback to print the syscall
> > event.
> >
> > The probem with these approaches comes from the fact that the string might have
> > disappeared from the user memory on output time. Even if we protect against
> > faults, it can give unreliable traces.
> > So perhaps we could think about copying the string on tracing time.
> >
> > Concerning the return value, it doesn't make any problem, a single raw
> > value is printed for each one.
> >
> > Note that the current implementation can be easily scaled further to give only
> > binary informations to the user and the matching metadata to decode it.
> >
> > Signed-off-by: Frederic Weisbecker <[email protected]>
> > ---
> > include/linux/ftrace.h | 44 +++++++
> > kernel/trace/Kconfig | 10 ++
> > kernel/trace/Makefile | 1 +
> > kernel/trace/trace.c | 6 +
> > kernel/trace/trace.h | 32 +++++
> > kernel/trace/trace_syscalls.c | 284 +++++++++++++++++++++++++++++++++++++++++
> > 6 files changed, 377 insertions(+), 0 deletions(-)
> > create mode 100644 kernel/trace/trace_syscalls.c
> >
> > diff --git a/include/linux/ftrace.h b/include/linux/ftrace.h
> > index e1583f2..1798692 100644
> > --- a/include/linux/ftrace.h
> > +++ b/include/linux/ftrace.h
> > @@ -503,4 +503,48 @@ static inline void trace_hw_branch_oops(void) {}
> >
> > #endif /* CONFIG_HW_BRANCH_TRACER */
> >
> > +/* Index for the ftrace syscalls array */
> > +enum syscall_trace_nr {
> > + SYSCALL_TRACE_OPEN = 1,
> > + SYSCALL_TRACE_CLOSE,
> > + SYSCALL_TRACE_READ,
> > + SYSCALL_TRACE_WRITE,
> > +
> > + __SYSCALL_TRACE_END
> > +};
> > +
> > +/*
> > + * A syscall entry in the ftrace syscalls array.
> > + *
> > + * @name: name of the syscall
> > + * @nb_args: number of parameters it takes
> > + * @simple_format: if true, we use string_mask, otherwise the print callback
> > + * @string_mask: rudimentary format mask. If bit nr is set, the nr parameter
> > + * will be displayed as a string, otherwise it will be considered
> > + * as a raw number (hex displayed).
> > + * @print: implement it if you want a custom output for a given syscall
> > + */
> > +struct syscall_trace_entry {
> > + const char *name;
> > + int nb_args;
> > + bool simple_format;
> > + union {
> > + unsigned long string_mask;
> > + int (*print)(unsigned long *);
> > + } output;
> > +};
> > +
> > +#ifdef CONFIG_FTRACE_SYSCALLS
> > +extern short arch_syscall_trace_nr[];
> > +extern void start_ftrace_syscalls(void);
> > +extern void stop_ftrace_syscalls(void);
> > +extern void ftrace_syscall_enter(struct pt_regs *regs);
> > +extern void ftrace_syscall_exit(struct pt_regs *regs);
> > +#else
> > +static inline void start_ftrace_syscalls(void) { }
> > +static inline void stop_ftrace_syscalls(void) { }
> > +static inline void ftrace_syscall_enter(struct pt_regs *regs) { }
> > +static inline void ftrace_syscall_exit(struct pt_regs *regs) { }
> > +#endif
> > +
> > #endif /* _LINUX_FTRACE_H */
> > diff --git a/kernel/trace/Kconfig b/kernel/trace/Kconfig
> > index 8e4a2a6..95a0ad1 100644
> > --- a/kernel/trace/Kconfig
> > +++ b/kernel/trace/Kconfig
> > @@ -34,6 +34,9 @@ config HAVE_FTRACE_MCOUNT_RECORD
> > config HAVE_HW_BRANCH_TRACER
> > bool
> >
> > +config HAVE_FTRACE_SYSCALLS
> > + bool
> > +
> > config TRACER_MAX_TRACE
> > bool
> >
> > @@ -175,6 +178,13 @@ config EVENT_TRACER
> > allowing the user to pick and choose which trace point they
> > want to trace.
> >
> > +config FTRACE_SYSCALLS
> > + bool "Trace syscalls"
> > + depends on HAVE_FTRACE_SYSCALLS
> > + select TRACING
> > + help
> > + Basic tracer to catch the syscall entry and exit events.
> > +
> > config BOOT_TRACER
> > bool "Trace boot initcalls"
> > select TRACING
> > diff --git a/kernel/trace/Makefile b/kernel/trace/Makefile
> > index c7a2943..c3feea0 100644
> > --- a/kernel/trace/Makefile
> > +++ b/kernel/trace/Makefile
> > @@ -43,5 +43,6 @@ obj-$(CONFIG_BLK_DEV_IO_TRACE) += blktrace.o
> > obj-$(CONFIG_EVENT_TRACER) += trace_events.o
> > obj-$(CONFIG_EVENT_TRACER) += events.o
> > obj-$(CONFIG_EVENT_TRACER) += trace_export.o
> > +obj-$(CONFIG_FTRACE_SYSCALLS) += trace_syscalls.o
> >
> > libftrace-y := ftrace.o
> > diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
> > index cc94f86..5152be3 100644
> > --- a/kernel/trace/trace.c
> > +++ b/kernel/trace/trace.c
> > @@ -242,6 +242,12 @@ static DECLARE_WAIT_QUEUE_HEAD(trace_wait);
> > unsigned long trace_flags = TRACE_ITER_PRINT_PARENT | TRACE_ITER_PRINTK |
> > TRACE_ITER_ANNOTATE | TRACE_ITER_CONTEXT_INFO;
> >
> > +
> > +struct trace_array *__get_global_trace(void)
> > +{
> > + return &global_trace;
> > +}
> > +
> > /**
> > * trace_wake_up - wake up tasks waiting for trace input
> > *
> > diff --git a/kernel/trace/trace.h b/kernel/trace/trace.h
> > index 2bfb7d1..9583830 100644
> > --- a/kernel/trace/trace.h
> > +++ b/kernel/trace/trace.h
> > @@ -30,6 +30,8 @@ enum trace_type {
> > TRACE_GRAPH_ENT,
> > TRACE_USER_STACK,
> > TRACE_HW_BRANCHES,
> > + TRACE_SYSCALL_ENTER,
> > + TRACE_SYSCALL_EXIT,
> > TRACE_KMEM_ALLOC,
> > TRACE_KMEM_FREE,
> > TRACE_POWER,
> > @@ -192,6 +194,19 @@ struct kmemtrace_free_entry {
> > const void *ptr;
> > };
> >
> > +struct syscall_trace_enter {
> > + struct trace_entry ent;
> > + enum syscall_trace_nr nr;
> > + unsigned long args[];
> > +};
> > +
> > +struct syscall_trace_exit {
> > + struct trace_entry ent;
> > + enum syscall_trace_nr nr;
> > + unsigned long ret;
> > +};
> > +
> > +
> > /*
> > * trace_flag_type is an enumeration that holds different
> > * states when a trace occurs. These are:
> > @@ -304,6 +319,10 @@ extern void __ftrace_bad_type(void);
> > TRACE_KMEM_ALLOC); \
> > IF_ASSIGN(var, ent, struct kmemtrace_free_entry, \
> > TRACE_KMEM_FREE); \
> > + IF_ASSIGN(var, ent, struct syscall_trace_enter, \
> > + TRACE_SYSCALL_ENTER); \
> > + IF_ASSIGN(var, ent, struct syscall_trace_exit, \
> > + TRACE_SYSCALL_EXIT); \
> > __ftrace_bad_type(); \
> > } while (0)
> >
> > @@ -568,6 +587,7 @@ extern int trace_selftest_startup_branch(struct tracer *trace,
> > #endif /* CONFIG_FTRACE_STARTUP_TEST */
> >
> > extern void *head_page(struct trace_array_cpu *data);
> > +extern struct trace_array *__get_global_trace(void);
> > extern long ns2usecs(cycle_t nsec);
> > extern int
> > trace_vprintk(unsigned long ip, int depth, const char *fmt, va_list args);
> > @@ -627,6 +647,18 @@ static inline int ftrace_trace_task(struct task_struct *task)
> > return test_tsk_trace_trace(task);
> > }
> >
> > +#ifdef CONFIG_FTRACE_SYSCALLS
> > +extern enum print_line_t
> > +print_syscall_enter(struct trace_iterator *iter, int flags);
> > +extern enum print_line_t
> > +print_syscall_exit(struct trace_iterator *iter, int flags);
> > +#else
> > +static inline enum print_line_t
> > +print_syscall_enter(struct trace_iterator *iter, int flags) { }
> > +static inline enum print_line_t
> > +print_syscall_exit(struct trace_iterator *iter, int flags) { }
> > +#endif
> > +
> > /*
> > * trace_iterator_flags is an enumeration that defines bit
> > * positions into trace_flags that controls the output.
> > diff --git a/kernel/trace/trace_syscalls.c b/kernel/trace/trace_syscalls.c
> > new file mode 100644
> > index 0000000..470bb9d
> > --- /dev/null
> > +++ b/kernel/trace/trace_syscalls.c
> > @@ -0,0 +1,284 @@
> > +#include <linux/kernel.h>
> > +#include <linux/ftrace.h>
> > +#include <asm/syscall.h>
> > +
> > +#include "trace_output.h"
> > +#include "trace.h"
> > +
> > +/* Create basic entry on syscall array (with the rudimentary string mask) */
> > +#define SYS_TRACE_ENTRY(sname, args, mask) \
> > + {.name = #sname, .nb_args = args, .simple_format = true, \
> > + .output.string_mask = mask}
> > +
> > +/*
> > + * Create a custom defined entry on the syscall array, you will have to
> > + * implement a callback to output the syscall.
> > + */
> > +#define SYS_TRACE_ENTRY_SPECIAL(sname, args, printer) \
> > + {.name = #sname, .nb_args = args, .simple_format = false, \
> > + .print = printer}
> > +
> > +static const struct syscall_trace_entry syscall_trace_entries[] = {
> > + /* For open, the first argument is a string, hence the given mask */
> > + [SYSCALL_TRACE_OPEN] = SYS_TRACE_ENTRY(open, 3, 0x1),
> > + [SYSCALL_TRACE_CLOSE] = SYS_TRACE_ENTRY(close, 1, 0),
> > + [SYSCALL_TRACE_READ] = SYS_TRACE_ENTRY(read, 3, 0),
> > + [SYSCALL_TRACE_WRITE] = SYS_TRACE_ENTRY(read, 3, 0),
> > +};
> > +
> > +
> > +static atomic_t refcount;
> > +
> > +enum print_line_t
> > +print_syscall_enter(struct trace_iterator *iter, int flags)
> > +{
> > + struct trace_seq *s = &iter->seq;
> > + struct trace_entry *ent = iter->ent;
> > + struct syscall_trace_enter *trace;
> > + enum syscall_trace_nr syscall;
> > + const struct syscall_trace_entry *entry;
> > + char *str = NULL;
> > + int i, ret;
> > +
> > + trace_assign_type(trace, ent);
> > +
> > + syscall = trace->nr;
> > +
> > + if (!syscall || syscall >= __SYSCALL_TRACE_END)
> > + return TRACE_TYPE_HANDLED;
> > +
> > + entry = &syscall_trace_entries[syscall];
> > + if (!entry->simple_format) {
> > + ret = entry->output.print(trace->args);
> > + if (!ret)
> > + return TRACE_TYPE_PARTIAL_LINE;
> > + }
> > +
> > + ret = trace_seq_printf(s, "syscall %s :", entry->name);
> > + if (!ret)
> > + return TRACE_TYPE_PARTIAL_LINE;
> > +
> > + for (i = 0; i < entry->nb_args; i++) {
> > + /*
> > + * FIXME: the string comes from the user, but on tracing
> > + * output time, it may have disappeared from the memory.
> > + * May be better to strdup it on tracing time.
> > + */
> > + if (entry->output.string_mask & (1 << i)) {
> > + /* 128 is enough for most path */
> > + str = strndup_user((char *)trace->args[i], 128);
> > +
> > + /* If it faulted badly, the error shoud have been
> > + * handled while servicing the syscall, just ignore.
> > + */
> > + if (str == ERR_PTR(-EFAULT) || str == ERR_PTR(-EINVAL)
> > + || str == ERR_PTR(-ENOMEM))
> > + return TRACE_TYPE_HANDLED;
> > +
> > + ret = trace_seq_printf(s, " %s", str);
> > + } else {
> > + ret = trace_seq_printf(s, " 0x%lx", trace->args[i]);
> > + }
> > + if (!ret)
> > + goto end_partial;
> > + }
> > + if (!trace_seq_printf(s, "\n"))
> > + goto end_partial;
> > +
> > + kfree(str);
> > + return TRACE_TYPE_HANDLED;
> > +
> > +end_partial:
> > + kfree(str);
> > + return TRACE_TYPE_PARTIAL_LINE;
> > +}
> > +
> > +enum print_line_t
> > +print_syscall_exit(struct trace_iterator *iter, int flags)
> > +{
> > + struct trace_seq *s = &iter->seq;
> > + struct trace_entry *ent = iter->ent;
> > + struct syscall_trace_exit *trace;
> > + enum syscall_trace_nr syscall;
> > + const struct syscall_trace_entry *entry;
> > + int ret;
> > +
> > + trace_assign_type(trace, ent);
> > +
> > + syscall = trace->nr;
> > +
> > + if (!syscall || syscall >= __SYSCALL_TRACE_END)
> > + return TRACE_TYPE_HANDLED;
> > +
> > + entry = &syscall_trace_entries[syscall];
> > +
> > + ret = trace_seq_printf(s, "syscall %s -> 0x%lx\n", entry->name,
> > + trace->ret);
> > + if (!ret)
> > + return TRACE_TYPE_PARTIAL_LINE;
> > +
> > + return TRACE_TYPE_HANDLED;
> > +}
> > +
> > +void start_ftrace_syscalls(void)
> > +{
> > + unsigned long flags;
> > + struct task_struct *g, *t;
> > +
> > + if (atomic_inc_return(&refcount) != 1)
> > + goto out;
> > +
> > + read_lock_irqsave(&tasklist_lock, flags);
> > +
> > + do_each_thread(g, t) {
> > + set_tsk_thread_flag(t, TIF_SYSCALL_FTRACE);
> > + } while_each_thread(g, t);
> > +
> > + read_unlock_irqrestore(&tasklist_lock, flags);
> > +out:
> > + atomic_dec(&refcount);
> > +}
> > +
> > +void stop_ftrace_syscalls(void)
> > +{
> > + unsigned long flags;
> > + struct task_struct *g, *t;
> > +
> > + if (atomic_dec_return(&refcount))
> > + goto out;
> > +
> > + read_lock_irqsave(&tasklist_lock, flags);
> > +
> > + do_each_thread(g, t) {
> > + clear_tsk_thread_flag(t, TIF_SYSCALL_FTRACE);
> > + } while_each_thread(g, t);
> > +
> > + read_unlock_irqrestore(&tasklist_lock, flags);
> > +out:
> > + atomic_inc(&refcount);
> > +}
> > +
> > +void ftrace_syscall_enter(struct pt_regs *regs)
> > +{
> > + struct trace_array *tr = __get_global_trace();
> > + struct syscall_trace_enter *entry;
> > + const struct syscall_trace_entry *sys_data;
> > + struct ring_buffer_event *event;
> > + struct trace_array_cpu *data;
> > + int size;
> > + int syscall_nr;
> > + int nr_offset;
> > + int cpu;
> > +
> > + syscall_nr = syscall_get_nr(current, regs);
> > + nr_offset = arch_syscall_trace_nr[syscall_nr];
> > + if (!nr_offset || nr_offset >= __SYSCALL_TRACE_END)
> > + return;
> > +
> > + cpu = raw_smp_processor_id();
> > + data = tr->data[cpu];
> > +
> > + if (unlikely(atomic_read(&data->disabled)))
> > + return;
> > +
> > + sys_data = &syscall_trace_entries[nr_offset];
> > + size = sizeof(*entry) + sizeof(unsigned long) * sys_data->nb_args;
> > +
> > + event = trace_buffer_lock_reserve(tr, TRACE_SYSCALL_ENTER, size, 0, 0);
> > + if (!event)
> > + return;
> > +
> > + entry = ring_buffer_event_data(event);
> > + entry->nr = nr_offset;
> > + syscall_get_arguments(current, regs, 0, sys_data->nb_args, entry->args);
> > +
> > + ring_buffer_unlock_commit(tr->buffer, event);
> > + trace_wake_up();
> > +}
> > +
> > +void ftrace_syscall_exit(struct pt_regs *regs)
> > +{
> > + struct trace_array *tr = __get_global_trace();
> > + struct syscall_trace_exit *entry;
> > + const struct syscall_trace_entry *sys_data;
> > + struct ring_buffer_event *event;
> > + struct trace_array_cpu *data;
> > + int syscall_nr;
> > + int nr_offset;
> > + int cpu;
> > +
> > + syscall_nr = syscall_get_nr(current, regs);
> > + nr_offset = arch_syscall_trace_nr[syscall_nr];
> > + if (!nr_offset || nr_offset >= __SYSCALL_TRACE_END)
> > + return;
> > +
> > + cpu = raw_smp_processor_id();
> > + data = tr->data[cpu];
> > +
> > + if (unlikely(atomic_read(&data->disabled)))
> > + return;
> > +
> > + sys_data = &syscall_trace_entries[nr_offset];
> > +
> > + event = trace_buffer_lock_reserve(tr, TRACE_SYSCALL_EXIT,
> > + sizeof(*entry), 0, 0);
> > + if (!event)
> > + return;
> > +
> > + entry = ring_buffer_event_data(event);
> > + entry->nr = nr_offset;
> > + entry->ret = syscall_get_return_value(current, regs);
> > +
> > + ring_buffer_unlock_commit(tr->buffer, event);
> > + trace_wake_up();
> > +}
> > +
> > +static int init_syscall_tracer(struct trace_array *tr)
> > +{
> > + start_ftrace_syscalls();
> > +
> > + return 0;
> > +}
> > +
> > +static void reset_syscall_tracer(struct trace_array *tr)
> > +{
> > + stop_ftrace_syscalls();
> > +}
> > +
> > +static struct trace_event syscall_enter_event = {
> > + .type = TRACE_SYSCALL_ENTER,
> > + .trace = print_syscall_enter,
> > +};
> > +
> > +static struct trace_event syscall_exit_event = {
> > + .type = TRACE_SYSCALL_EXIT,
> > + .trace = print_syscall_exit,
> > +};
> > +
> > +static struct tracer syscall_tracer __read_mostly = {
> > + .name = "syscall",
> > + .init = init_syscall_tracer,
> > + .reset = reset_syscall_tracer
> > +};
> > +
> > +__init int register_ftrace_syscalls(void)
> > +{
> > + int ret;
> > +
> > + ret = register_ftrace_event(&syscall_enter_event);
> > + if (!ret) {
> > + printk(KERN_WARNING "event %d failed to register\n",
> > + syscall_enter_event.type);
> > + WARN_ON_ONCE(1);
> > + }
> > +
> > + ret = register_ftrace_event(&syscall_exit_event);
> > + if (!ret) {
> > + printk(KERN_WARNING "event %d failed to register\n",
> > + syscall_exit_event.type);
> > + WARN_ON_ONCE(1);
> > + }
> > +
> > + return register_tracer(&syscall_tracer);
> > +}
> > +device_initcall(register_ftrace_syscalls);
>
> --
> Masami Hiramatsu
>
> Software Engineer
> Hitachi Computer Products (America) Inc.
> Software Solutions Division
>
> e-mail: [email protected]
>

2009-03-16 20:38:28

by Masami Hiramatsu

[permalink] [raw]
Subject: Re: [RFC][PATCH 1/2] tracing/ftrace: syscall tracing infrastructure

Frederic Weisbecker wrote:
> On Mon, Mar 16, 2009 at 03:36:58PM -0400, Masami Hiramatsu wrote:
>> Hi Frederic,
>>
>> Out of curiously, could you tell me why did you introduce new TIF instead of
>> using utrace's tracehook API?
>> For example, systemtap already use current utrace/tracehook api for tracing
>> all syscalls. If there is any new advantage, it would be good to enhance
>> existing utrace/tracehook.
>>
>> Thank you,
>
>
> Hi,
>
> Because the tracehooks rely on ptrace work and both ftrace and ptrace would overlap
> in a same flag, creating racy issues. At least ss far as I understood it.
> How does systemtap manage it?

As far as I know, utrace supports multiple trace-engines on a process.
Since ptrace is just an engine of utrace, you can add another engine on utrace.

utrace-+-ptrace_engine---owner_process
|
+-systemtap_module
|
+-ftrace_plugin

Here, Frank had posted an example of utrace->ftrace engine.
http://lkml.org/lkml/2009/1/27/294

And here is the latest his patch(which seems to support syscall tracing...)
http://git.kernel.org/?p=linux/kernel/git/frob/linux-2.6-utrace.git;a=blob;f=kernel/trace/trace_process.c;h=619815f6c2543d0d82824139773deb4ca460a280;hb=ab20efa8d8b5ded96e8f8c3663dda3b4cb532124

Frank, Roland, could you explain that?
Perhaps, we'd better discuss utrace on LKML for avoiding any redundant
code problems.

Thank you,

>
> Thanks.
>
>> Frederic Weisbecker wrote:
>>> The most important drawback that implies the syscall tracing is the variable
>>> number of parameters that takes a syscall. Assuming there is a large number of
>>> syscall, the goal is to provide as much as possible a generic layer on it.
>>>
>>> What we want on tracing time (the hot path):
>>>
>>> - Save the parameters and the number of the syscall. That's all. The rest of the
>>> job can be done on the "output path".
>>>
>>> On output time:
>>>
>>> - Get the raw-binary saved parameters and the syscall number as well as some
>>> usual infos such as the pid, the time...
>>> - Depending on the tracing requirements, the user should have the choice to
>>> dump the raw datas or a formatted version.
>>>
>>> The core infrastructure has a static array which contains the necessary
>>> informations for each syscall (ok, I've only implemented 4 for now).
>>>
>>> When a syscall triggers, the tracer look at this array (O(1) access) and get the
>>> number of parameters for this syscall. Then these parameters are saved in a
>>> binary form on the ring buffer with the syscall number (a generic version, arch
>>> independent).
>>>
>>> On output time the parameters, the name of the syscall and very basic
>>> informations to format the parameters are picket from the ring-buffer and the
>>> generic syscall array.
>>>
>>> Depending of the level of implementation we want for the syscall inside this
>>> array, we can provide just a single mask where the n bit matches the n
>>> parameter. If the bit is set, then the parameter will be considered as a string
>>> pointer. Otherwise its raw hexadecimal value will be printed.
>>>
>>> If needed, we can otherwise provide a specific callback to print the syscall
>>> event.
>>>
>>> The probem with these approaches comes from the fact that the string might have
>>> disappeared from the user memory on output time. Even if we protect against
>>> faults, it can give unreliable traces.
>>> So perhaps we could think about copying the string on tracing time.
>>>
>>> Concerning the return value, it doesn't make any problem, a single raw
>>> value is printed for each one.
>>>
>>> Note that the current implementation can be easily scaled further to give only
>>> binary informations to the user and the matching metadata to decode it.
>>>
>>> Signed-off-by: Frederic Weisbecker <[email protected]>
>>> ---
>>> include/linux/ftrace.h | 44 +++++++
>>> kernel/trace/Kconfig | 10 ++
>>> kernel/trace/Makefile | 1 +
>>> kernel/trace/trace.c | 6 +
>>> kernel/trace/trace.h | 32 +++++
>>> kernel/trace/trace_syscalls.c | 284 +++++++++++++++++++++++++++++++++++++++++
>>> 6 files changed, 377 insertions(+), 0 deletions(-)
>>> create mode 100644 kernel/trace/trace_syscalls.c
>>>
>>> diff --git a/include/linux/ftrace.h b/include/linux/ftrace.h
>>> index e1583f2..1798692 100644
>>> --- a/include/linux/ftrace.h
>>> +++ b/include/linux/ftrace.h
>>> @@ -503,4 +503,48 @@ static inline void trace_hw_branch_oops(void) {}
>>>
>>> #endif /* CONFIG_HW_BRANCH_TRACER */
>>>
>>> +/* Index for the ftrace syscalls array */
>>> +enum syscall_trace_nr {
>>> + SYSCALL_TRACE_OPEN = 1,
>>> + SYSCALL_TRACE_CLOSE,
>>> + SYSCALL_TRACE_READ,
>>> + SYSCALL_TRACE_WRITE,
>>> +
>>> + __SYSCALL_TRACE_END
>>> +};
>>> +
>>> +/*
>>> + * A syscall entry in the ftrace syscalls array.
>>> + *
>>> + * @name: name of the syscall
>>> + * @nb_args: number of parameters it takes
>>> + * @simple_format: if true, we use string_mask, otherwise the print callback
>>> + * @string_mask: rudimentary format mask. If bit nr is set, the nr parameter
>>> + * will be displayed as a string, otherwise it will be considered
>>> + * as a raw number (hex displayed).
>>> + * @print: implement it if you want a custom output for a given syscall
>>> + */
>>> +struct syscall_trace_entry {
>>> + const char *name;
>>> + int nb_args;
>>> + bool simple_format;
>>> + union {
>>> + unsigned long string_mask;
>>> + int (*print)(unsigned long *);
>>> + } output;
>>> +};
>>> +
>>> +#ifdef CONFIG_FTRACE_SYSCALLS
>>> +extern short arch_syscall_trace_nr[];
>>> +extern void start_ftrace_syscalls(void);
>>> +extern void stop_ftrace_syscalls(void);
>>> +extern void ftrace_syscall_enter(struct pt_regs *regs);
>>> +extern void ftrace_syscall_exit(struct pt_regs *regs);
>>> +#else
>>> +static inline void start_ftrace_syscalls(void) { }
>>> +static inline void stop_ftrace_syscalls(void) { }
>>> +static inline void ftrace_syscall_enter(struct pt_regs *regs) { }
>>> +static inline void ftrace_syscall_exit(struct pt_regs *regs) { }
>>> +#endif
>>> +
>>> #endif /* _LINUX_FTRACE_H */
>>> diff --git a/kernel/trace/Kconfig b/kernel/trace/Kconfig
>>> index 8e4a2a6..95a0ad1 100644
>>> --- a/kernel/trace/Kconfig
>>> +++ b/kernel/trace/Kconfig
>>> @@ -34,6 +34,9 @@ config HAVE_FTRACE_MCOUNT_RECORD
>>> config HAVE_HW_BRANCH_TRACER
>>> bool
>>>
>>> +config HAVE_FTRACE_SYSCALLS
>>> + bool
>>> +
>>> config TRACER_MAX_TRACE
>>> bool
>>>
>>> @@ -175,6 +178,13 @@ config EVENT_TRACER
>>> allowing the user to pick and choose which trace point they
>>> want to trace.
>>>
>>> +config FTRACE_SYSCALLS
>>> + bool "Trace syscalls"
>>> + depends on HAVE_FTRACE_SYSCALLS
>>> + select TRACING
>>> + help
>>> + Basic tracer to catch the syscall entry and exit events.
>>> +
>>> config BOOT_TRACER
>>> bool "Trace boot initcalls"
>>> select TRACING
>>> diff --git a/kernel/trace/Makefile b/kernel/trace/Makefile
>>> index c7a2943..c3feea0 100644
>>> --- a/kernel/trace/Makefile
>>> +++ b/kernel/trace/Makefile
>>> @@ -43,5 +43,6 @@ obj-$(CONFIG_BLK_DEV_IO_TRACE) += blktrace.o
>>> obj-$(CONFIG_EVENT_TRACER) += trace_events.o
>>> obj-$(CONFIG_EVENT_TRACER) += events.o
>>> obj-$(CONFIG_EVENT_TRACER) += trace_export.o
>>> +obj-$(CONFIG_FTRACE_SYSCALLS) += trace_syscalls.o
>>>
>>> libftrace-y := ftrace.o
>>> diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
>>> index cc94f86..5152be3 100644
>>> --- a/kernel/trace/trace.c
>>> +++ b/kernel/trace/trace.c
>>> @@ -242,6 +242,12 @@ static DECLARE_WAIT_QUEUE_HEAD(trace_wait);
>>> unsigned long trace_flags = TRACE_ITER_PRINT_PARENT | TRACE_ITER_PRINTK |
>>> TRACE_ITER_ANNOTATE | TRACE_ITER_CONTEXT_INFO;
>>>
>>> +
>>> +struct trace_array *__get_global_trace(void)
>>> +{
>>> + return &global_trace;
>>> +}
>>> +
>>> /**
>>> * trace_wake_up - wake up tasks waiting for trace input
>>> *
>>> diff --git a/kernel/trace/trace.h b/kernel/trace/trace.h
>>> index 2bfb7d1..9583830 100644
>>> --- a/kernel/trace/trace.h
>>> +++ b/kernel/trace/trace.h
>>> @@ -30,6 +30,8 @@ enum trace_type {
>>> TRACE_GRAPH_ENT,
>>> TRACE_USER_STACK,
>>> TRACE_HW_BRANCHES,
>>> + TRACE_SYSCALL_ENTER,
>>> + TRACE_SYSCALL_EXIT,
>>> TRACE_KMEM_ALLOC,
>>> TRACE_KMEM_FREE,
>>> TRACE_POWER,
>>> @@ -192,6 +194,19 @@ struct kmemtrace_free_entry {
>>> const void *ptr;
>>> };
>>>
>>> +struct syscall_trace_enter {
>>> + struct trace_entry ent;
>>> + enum syscall_trace_nr nr;
>>> + unsigned long args[];
>>> +};
>>> +
>>> +struct syscall_trace_exit {
>>> + struct trace_entry ent;
>>> + enum syscall_trace_nr nr;
>>> + unsigned long ret;
>>> +};
>>> +
>>> +
>>> /*
>>> * trace_flag_type is an enumeration that holds different
>>> * states when a trace occurs. These are:
>>> @@ -304,6 +319,10 @@ extern void __ftrace_bad_type(void);
>>> TRACE_KMEM_ALLOC); \
>>> IF_ASSIGN(var, ent, struct kmemtrace_free_entry, \
>>> TRACE_KMEM_FREE); \
>>> + IF_ASSIGN(var, ent, struct syscall_trace_enter, \
>>> + TRACE_SYSCALL_ENTER); \
>>> + IF_ASSIGN(var, ent, struct syscall_trace_exit, \
>>> + TRACE_SYSCALL_EXIT); \
>>> __ftrace_bad_type(); \
>>> } while (0)
>>>
>>> @@ -568,6 +587,7 @@ extern int trace_selftest_startup_branch(struct tracer *trace,
>>> #endif /* CONFIG_FTRACE_STARTUP_TEST */
>>>
>>> extern void *head_page(struct trace_array_cpu *data);
>>> +extern struct trace_array *__get_global_trace(void);
>>> extern long ns2usecs(cycle_t nsec);
>>> extern int
>>> trace_vprintk(unsigned long ip, int depth, const char *fmt, va_list args);
>>> @@ -627,6 +647,18 @@ static inline int ftrace_trace_task(struct task_struct *task)
>>> return test_tsk_trace_trace(task);
>>> }
>>>
>>> +#ifdef CONFIG_FTRACE_SYSCALLS
>>> +extern enum print_line_t
>>> +print_syscall_enter(struct trace_iterator *iter, int flags);
>>> +extern enum print_line_t
>>> +print_syscall_exit(struct trace_iterator *iter, int flags);
>>> +#else
>>> +static inline enum print_line_t
>>> +print_syscall_enter(struct trace_iterator *iter, int flags) { }
>>> +static inline enum print_line_t
>>> +print_syscall_exit(struct trace_iterator *iter, int flags) { }
>>> +#endif
>>> +
>>> /*
>>> * trace_iterator_flags is an enumeration that defines bit
>>> * positions into trace_flags that controls the output.
>>> diff --git a/kernel/trace/trace_syscalls.c b/kernel/trace/trace_syscalls.c
>>> new file mode 100644
>>> index 0000000..470bb9d
>>> --- /dev/null
>>> +++ b/kernel/trace/trace_syscalls.c
>>> @@ -0,0 +1,284 @@
>>> +#include <linux/kernel.h>
>>> +#include <linux/ftrace.h>
>>> +#include <asm/syscall.h>
>>> +
>>> +#include "trace_output.h"
>>> +#include "trace.h"
>>> +
>>> +/* Create basic entry on syscall array (with the rudimentary string mask) */
>>> +#define SYS_TRACE_ENTRY(sname, args, mask) \
>>> + {.name = #sname, .nb_args = args, .simple_format = true, \
>>> + .output.string_mask = mask}
>>> +
>>> +/*
>>> + * Create a custom defined entry on the syscall array, you will have to
>>> + * implement a callback to output the syscall.
>>> + */
>>> +#define SYS_TRACE_ENTRY_SPECIAL(sname, args, printer) \
>>> + {.name = #sname, .nb_args = args, .simple_format = false, \
>>> + .print = printer}
>>> +
>>> +static const struct syscall_trace_entry syscall_trace_entries[] = {
>>> + /* For open, the first argument is a string, hence the given mask */
>>> + [SYSCALL_TRACE_OPEN] = SYS_TRACE_ENTRY(open, 3, 0x1),
>>> + [SYSCALL_TRACE_CLOSE] = SYS_TRACE_ENTRY(close, 1, 0),
>>> + [SYSCALL_TRACE_READ] = SYS_TRACE_ENTRY(read, 3, 0),
>>> + [SYSCALL_TRACE_WRITE] = SYS_TRACE_ENTRY(read, 3, 0),
>>> +};
>>> +
>>> +
>>> +static atomic_t refcount;
>>> +
>>> +enum print_line_t
>>> +print_syscall_enter(struct trace_iterator *iter, int flags)
>>> +{
>>> + struct trace_seq *s = &iter->seq;
>>> + struct trace_entry *ent = iter->ent;
>>> + struct syscall_trace_enter *trace;
>>> + enum syscall_trace_nr syscall;
>>> + const struct syscall_trace_entry *entry;
>>> + char *str = NULL;
>>> + int i, ret;
>>> +
>>> + trace_assign_type(trace, ent);
>>> +
>>> + syscall = trace->nr;
>>> +
>>> + if (!syscall || syscall >= __SYSCALL_TRACE_END)
>>> + return TRACE_TYPE_HANDLED;
>>> +
>>> + entry = &syscall_trace_entries[syscall];
>>> + if (!entry->simple_format) {
>>> + ret = entry->output.print(trace->args);
>>> + if (!ret)
>>> + return TRACE_TYPE_PARTIAL_LINE;
>>> + }
>>> +
>>> + ret = trace_seq_printf(s, "syscall %s :", entry->name);
>>> + if (!ret)
>>> + return TRACE_TYPE_PARTIAL_LINE;
>>> +
>>> + for (i = 0; i < entry->nb_args; i++) {
>>> + /*
>>> + * FIXME: the string comes from the user, but on tracing
>>> + * output time, it may have disappeared from the memory.
>>> + * May be better to strdup it on tracing time.
>>> + */
>>> + if (entry->output.string_mask & (1 << i)) {
>>> + /* 128 is enough for most path */
>>> + str = strndup_user((char *)trace->args[i], 128);
>>> +
>>> + /* If it faulted badly, the error shoud have been
>>> + * handled while servicing the syscall, just ignore.
>>> + */
>>> + if (str == ERR_PTR(-EFAULT) || str == ERR_PTR(-EINVAL)
>>> + || str == ERR_PTR(-ENOMEM))
>>> + return TRACE_TYPE_HANDLED;
>>> +
>>> + ret = trace_seq_printf(s, " %s", str);
>>> + } else {
>>> + ret = trace_seq_printf(s, " 0x%lx", trace->args[i]);
>>> + }
>>> + if (!ret)
>>> + goto end_partial;
>>> + }
>>> + if (!trace_seq_printf(s, "\n"))
>>> + goto end_partial;
>>> +
>>> + kfree(str);
>>> + return TRACE_TYPE_HANDLED;
>>> +
>>> +end_partial:
>>> + kfree(str);
>>> + return TRACE_TYPE_PARTIAL_LINE;
>>> +}
>>> +
>>> +enum print_line_t
>>> +print_syscall_exit(struct trace_iterator *iter, int flags)
>>> +{
>>> + struct trace_seq *s = &iter->seq;
>>> + struct trace_entry *ent = iter->ent;
>>> + struct syscall_trace_exit *trace;
>>> + enum syscall_trace_nr syscall;
>>> + const struct syscall_trace_entry *entry;
>>> + int ret;
>>> +
>>> + trace_assign_type(trace, ent);
>>> +
>>> + syscall = trace->nr;
>>> +
>>> + if (!syscall || syscall >= __SYSCALL_TRACE_END)
>>> + return TRACE_TYPE_HANDLED;
>>> +
>>> + entry = &syscall_trace_entries[syscall];
>>> +
>>> + ret = trace_seq_printf(s, "syscall %s -> 0x%lx\n", entry->name,
>>> + trace->ret);
>>> + if (!ret)
>>> + return TRACE_TYPE_PARTIAL_LINE;
>>> +
>>> + return TRACE_TYPE_HANDLED;
>>> +}
>>> +
>>> +void start_ftrace_syscalls(void)
>>> +{
>>> + unsigned long flags;
>>> + struct task_struct *g, *t;
>>> +
>>> + if (atomic_inc_return(&refcount) != 1)
>>> + goto out;
>>> +
>>> + read_lock_irqsave(&tasklist_lock, flags);
>>> +
>>> + do_each_thread(g, t) {
>>> + set_tsk_thread_flag(t, TIF_SYSCALL_FTRACE);
>>> + } while_each_thread(g, t);
>>> +
>>> + read_unlock_irqrestore(&tasklist_lock, flags);
>>> +out:
>>> + atomic_dec(&refcount);
>>> +}
>>> +
>>> +void stop_ftrace_syscalls(void)
>>> +{
>>> + unsigned long flags;
>>> + struct task_struct *g, *t;
>>> +
>>> + if (atomic_dec_return(&refcount))
>>> + goto out;
>>> +
>>> + read_lock_irqsave(&tasklist_lock, flags);
>>> +
>>> + do_each_thread(g, t) {
>>> + clear_tsk_thread_flag(t, TIF_SYSCALL_FTRACE);
>>> + } while_each_thread(g, t);
>>> +
>>> + read_unlock_irqrestore(&tasklist_lock, flags);
>>> +out:
>>> + atomic_inc(&refcount);
>>> +}
>>> +
>>> +void ftrace_syscall_enter(struct pt_regs *regs)
>>> +{
>>> + struct trace_array *tr = __get_global_trace();
>>> + struct syscall_trace_enter *entry;
>>> + const struct syscall_trace_entry *sys_data;
>>> + struct ring_buffer_event *event;
>>> + struct trace_array_cpu *data;
>>> + int size;
>>> + int syscall_nr;
>>> + int nr_offset;
>>> + int cpu;
>>> +
>>> + syscall_nr = syscall_get_nr(current, regs);
>>> + nr_offset = arch_syscall_trace_nr[syscall_nr];
>>> + if (!nr_offset || nr_offset >= __SYSCALL_TRACE_END)
>>> + return;
>>> +
>>> + cpu = raw_smp_processor_id();
>>> + data = tr->data[cpu];
>>> +
>>> + if (unlikely(atomic_read(&data->disabled)))
>>> + return;
>>> +
>>> + sys_data = &syscall_trace_entries[nr_offset];
>>> + size = sizeof(*entry) + sizeof(unsigned long) * sys_data->nb_args;
>>> +
>>> + event = trace_buffer_lock_reserve(tr, TRACE_SYSCALL_ENTER, size, 0, 0);
>>> + if (!event)
>>> + return;
>>> +
>>> + entry = ring_buffer_event_data(event);
>>> + entry->nr = nr_offset;
>>> + syscall_get_arguments(current, regs, 0, sys_data->nb_args, entry->args);
>>> +
>>> + ring_buffer_unlock_commit(tr->buffer, event);
>>> + trace_wake_up();
>>> +}
>>> +
>>> +void ftrace_syscall_exit(struct pt_regs *regs)
>>> +{
>>> + struct trace_array *tr = __get_global_trace();
>>> + struct syscall_trace_exit *entry;
>>> + const struct syscall_trace_entry *sys_data;
>>> + struct ring_buffer_event *event;
>>> + struct trace_array_cpu *data;
>>> + int syscall_nr;
>>> + int nr_offset;
>>> + int cpu;
>>> +
>>> + syscall_nr = syscall_get_nr(current, regs);
>>> + nr_offset = arch_syscall_trace_nr[syscall_nr];
>>> + if (!nr_offset || nr_offset >= __SYSCALL_TRACE_END)
>>> + return;
>>> +
>>> + cpu = raw_smp_processor_id();
>>> + data = tr->data[cpu];
>>> +
>>> + if (unlikely(atomic_read(&data->disabled)))
>>> + return;
>>> +
>>> + sys_data = &syscall_trace_entries[nr_offset];
>>> +
>>> + event = trace_buffer_lock_reserve(tr, TRACE_SYSCALL_EXIT,
>>> + sizeof(*entry), 0, 0);
>>> + if (!event)
>>> + return;
>>> +
>>> + entry = ring_buffer_event_data(event);
>>> + entry->nr = nr_offset;
>>> + entry->ret = syscall_get_return_value(current, regs);
>>> +
>>> + ring_buffer_unlock_commit(tr->buffer, event);
>>> + trace_wake_up();
>>> +}
>>> +
>>> +static int init_syscall_tracer(struct trace_array *tr)
>>> +{
>>> + start_ftrace_syscalls();
>>> +
>>> + return 0;
>>> +}
>>> +
>>> +static void reset_syscall_tracer(struct trace_array *tr)
>>> +{
>>> + stop_ftrace_syscalls();
>>> +}
>>> +
>>> +static struct trace_event syscall_enter_event = {
>>> + .type = TRACE_SYSCALL_ENTER,
>>> + .trace = print_syscall_enter,
>>> +};
>>> +
>>> +static struct trace_event syscall_exit_event = {
>>> + .type = TRACE_SYSCALL_EXIT,
>>> + .trace = print_syscall_exit,
>>> +};
>>> +
>>> +static struct tracer syscall_tracer __read_mostly = {
>>> + .name = "syscall",
>>> + .init = init_syscall_tracer,
>>> + .reset = reset_syscall_tracer
>>> +};
>>> +
>>> +__init int register_ftrace_syscalls(void)
>>> +{
>>> + int ret;
>>> +
>>> + ret = register_ftrace_event(&syscall_enter_event);
>>> + if (!ret) {
>>> + printk(KERN_WARNING "event %d failed to register\n",
>>> + syscall_enter_event.type);
>>> + WARN_ON_ONCE(1);
>>> + }
>>> +
>>> + ret = register_ftrace_event(&syscall_exit_event);
>>> + if (!ret) {
>>> + printk(KERN_WARNING "event %d failed to register\n",
>>> + syscall_exit_event.type);
>>> + WARN_ON_ONCE(1);
>>> + }
>>> +
>>> + return register_tracer(&syscall_tracer);
>>> +}
>>> +device_initcall(register_ftrace_syscalls);
>> --
>> Masami Hiramatsu
>>
>> Software Engineer
>> Hitachi Computer Products (America) Inc.
>> Software Solutions Division
>>
>> e-mail: [email protected]
>>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

--
Masami Hiramatsu

Software Engineer
Hitachi Computer Products (America) Inc.
Software Solutions Division

e-mail: [email protected]

2009-03-16 21:45:47

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: [RFC][PATCH 1/2] tracing/ftrace: syscall tracing infrastructure

* Masami Hiramatsu ([email protected]) wrote:
> Frederic Weisbecker wrote:
> > On Mon, Mar 16, 2009 at 03:36:58PM -0400, Masami Hiramatsu wrote:
> >> Hi Frederic,
> >>
> >> Out of curiously, could you tell me why did you introduce new TIF instead of
> >> using utrace's tracehook API?
> >> For example, systemtap already use current utrace/tracehook api for tracing
> >> all syscalls. If there is any new advantage, it would be good to enhance
> >> existing utrace/tracehook.
> >>
> >> Thank you,
> >
> >
> > Hi,
> >
> > Because the tracehooks rely on ptrace work and both ftrace and ptrace would overlap
> > in a same flag, creating racy issues. At least ss far as I understood it.
> > How does systemtap manage it?
>
> As far as I know, utrace supports multiple trace-engines on a process.
> Since ptrace is just an engine of utrace, you can add another engine on utrace.
>
> utrace-+-ptrace_engine---owner_process
> |
> +-systemtap_module
> |
> +-ftrace_plugin
>
> Here, Frank had posted an example of utrace->ftrace engine.
> http://lkml.org/lkml/2009/1/27/294
>
> And here is the latest his patch(which seems to support syscall tracing...)
> http://git.kernel.org/?p=linux/kernel/git/frob/linux-2.6-utrace.git;a=blob;f=kernel/trace/trace_process.c;h=619815f6c2543d0d82824139773deb4ca460a280;hb=ab20efa8d8b5ded96e8f8c3663dda3b4cb532124
>

Reminder : we are looking at system-wide tracing here. Here are some
comments about the current utrace implementation.

Looking at include/linux/utrace.h from the tree

17 * A tracing engine starts by calling utrace_attach_task() or
18 * utrace_attach_pid() on the chosen thread, passing in a set of hooks
19 * (&struct utrace_engine_ops), and some associated data. This produces a
20 * &struct utrace_engine, which is the handle used for all other
21 * operations. An attached engine has its ops vector, its data, and an
22 * event mask controlled by utrace_set_events().

So if the system has, say 3000 threads, then we have 3000 struct
utrace_engine created ? I wonder what effet this could have one
cachelines if this is used to trace hot paths like system call
entry/exit. Have you benchmarked this kind of scenario under tbench ?


24 * For each event bit that is set, that engine will get the
25 * appropriate ops->report_*() callback when the event occurs. The
26 * &struct utrace_engine_ops need not provide callbacks for an event
27 * unless the engine sets one of the associated event bits.

Looking at utrace_set_events(), we seem to be limited to 32 events on a
32-bits architectures because it uses a bitmask ? Isn't it a bit small?


682 /**
683 * utrace_set_events_pid - choose which event reports a tracing engine gets
684 * @pid: thread to affect
685 * @engine: attached engine to affect
686 * @eventmask: new event mask
687 *
688 * This is the same as utrace_set_events(), but takes a &struct pid
689 * pointer rather than a &struct task_struct pointer. The caller must
690 * hold a ref on @pid, but does not need to worry about the task
691 * staying valid. If it's been reaped so that @pid points nowhere,
692 * then this call returns -%ESRCH.


Comments like "but does not need to worry about the task staying valid"
does not make me feel safe and comfortable at all, could you explain
how you can assume that derefencing an "invalid" pointer will return
NULL ?

About the utrace_attach_task() :

244 if (unlikely(target->flags & PF_KTHREAD))
245 /*
246 * Silly kernel, utrace is for users!
247 */
248 return ERR_PTR(-EPERM);

So we cannot trace kernel threads ?


118 /*
119 * Called without locks, when we might be the first utrace engine to attach.
120 * If this is a newborn thread and we are not the creator, we have to wait
121 * for it. The creator gets the first chance to attach. The PF_STARTING
122 * flag is cleared after its report_clone hook has had a chance to run.
123 */
124 static inline int utrace_attach_delay(struct task_struct *target)
125 {
126 if ((target->flags & PF_STARTING) && target->real_parent != current)
127 do {
128 schedule_timeout_interruptible(1);
129 if (signal_pending(current))
130 return -ERESTARTNOINTR;
131 } while (target->flags & PF_STARTING);
132
133 return 0;
134 }

Why do we absolutely have to poll until the thread has started ?


utrace_add_engine()
set_notify_resume(target);

ok, so this is where the TIF_NOTIFY_RESUME thread flag is set. I notice
that it is set asynchronously with the execution of the target thread
(as I do with my TIF_KERNEL_TRACE thread flag).

However, on x86_64, _TIF_DO_NOTIFY_MASK is only tested in
entry_64.S

int_signal:
and
retint_signal:

code paths. However, if there is no syscall tracing to do upon syscall
entry, the thread flags are not re-read at syscall exit and you will
miss the syscall exit returning from your target thread if this thread
was blocked while you set its TIF_NOTIFY_RESUME. Or is it handled in
some subtle way I did not figure out ? BTW re-reading the TIF flags from
the thread_info at syscall exit on the fast path is out of question
because it considerably degrades the kernel performances. entry_*.S is
a very, very critical path.

Mathieu


> Frank, Roland, could you explain that?
> Perhaps, we'd better discuss utrace on LKML for avoiding any redundant
> code problems.
>
> Thank you,
>
> >
> > Thanks.
> >
> >> Frederic Weisbecker wrote:
> >>> The most important drawback that implies the syscall tracing is the variable
> >>> number of parameters that takes a syscall. Assuming there is a large number of
> >>> syscall, the goal is to provide as much as possible a generic layer on it.
> >>>
> >>> What we want on tracing time (the hot path):
> >>>
> >>> - Save the parameters and the number of the syscall. That's all. The rest of the
> >>> job can be done on the "output path".
> >>>
> >>> On output time:
> >>>
> >>> - Get the raw-binary saved parameters and the syscall number as well as some
> >>> usual infos such as the pid, the time...
> >>> - Depending on the tracing requirements, the user should have the choice to
> >>> dump the raw datas or a formatted version.
> >>>
> >>> The core infrastructure has a static array which contains the necessary
> >>> informations for each syscall (ok, I've only implemented 4 for now).
> >>>
> >>> When a syscall triggers, the tracer look at this array (O(1) access) and get the
> >>> number of parameters for this syscall. Then these parameters are saved in a
> >>> binary form on the ring buffer with the syscall number (a generic version, arch
> >>> independent).
> >>>
> >>> On output time the parameters, the name of the syscall and very basic
> >>> informations to format the parameters are picket from the ring-buffer and the
> >>> generic syscall array.
> >>>
> >>> Depending of the level of implementation we want for the syscall inside this
> >>> array, we can provide just a single mask where the n bit matches the n
> >>> parameter. If the bit is set, then the parameter will be considered as a string
> >>> pointer. Otherwise its raw hexadecimal value will be printed.
> >>>
> >>> If needed, we can otherwise provide a specific callback to print the syscall
> >>> event.
> >>>
> >>> The probem with these approaches comes from the fact that the string might have
> >>> disappeared from the user memory on output time. Even if we protect against
> >>> faults, it can give unreliable traces.
> >>> So perhaps we could think about copying the string on tracing time.
> >>>
> >>> Concerning the return value, it doesn't make any problem, a single raw
> >>> value is printed for each one.
> >>>
> >>> Note that the current implementation can be easily scaled further to give only
> >>> binary informations to the user and the matching metadata to decode it.
> >>>
> >>> Signed-off-by: Frederic Weisbecker <[email protected]>
> >>> ---
> >>> include/linux/ftrace.h | 44 +++++++
> >>> kernel/trace/Kconfig | 10 ++
> >>> kernel/trace/Makefile | 1 +
> >>> kernel/trace/trace.c | 6 +
> >>> kernel/trace/trace.h | 32 +++++
> >>> kernel/trace/trace_syscalls.c | 284 +++++++++++++++++++++++++++++++++++++++++
> >>> 6 files changed, 377 insertions(+), 0 deletions(-)
> >>> create mode 100644 kernel/trace/trace_syscalls.c
> >>>
> >>> diff --git a/include/linux/ftrace.h b/include/linux/ftrace.h
> >>> index e1583f2..1798692 100644
> >>> --- a/include/linux/ftrace.h
> >>> +++ b/include/linux/ftrace.h
> >>> @@ -503,4 +503,48 @@ static inline void trace_hw_branch_oops(void) {}
> >>>
> >>> #endif /* CONFIG_HW_BRANCH_TRACER */
> >>>
> >>> +/* Index for the ftrace syscalls array */
> >>> +enum syscall_trace_nr {
> >>> + SYSCALL_TRACE_OPEN = 1,
> >>> + SYSCALL_TRACE_CLOSE,
> >>> + SYSCALL_TRACE_READ,
> >>> + SYSCALL_TRACE_WRITE,
> >>> +
> >>> + __SYSCALL_TRACE_END
> >>> +};
> >>> +
> >>> +/*
> >>> + * A syscall entry in the ftrace syscalls array.
> >>> + *
> >>> + * @name: name of the syscall
> >>> + * @nb_args: number of parameters it takes
> >>> + * @simple_format: if true, we use string_mask, otherwise the print callback
> >>> + * @string_mask: rudimentary format mask. If bit nr is set, the nr parameter
> >>> + * will be displayed as a string, otherwise it will be considered
> >>> + * as a raw number (hex displayed).
> >>> + * @print: implement it if you want a custom output for a given syscall
> >>> + */
> >>> +struct syscall_trace_entry {
> >>> + const char *name;
> >>> + int nb_args;
> >>> + bool simple_format;
> >>> + union {
> >>> + unsigned long string_mask;
> >>> + int (*print)(unsigned long *);
> >>> + } output;
> >>> +};
> >>> +
> >>> +#ifdef CONFIG_FTRACE_SYSCALLS
> >>> +extern short arch_syscall_trace_nr[];
> >>> +extern void start_ftrace_syscalls(void);
> >>> +extern void stop_ftrace_syscalls(void);
> >>> +extern void ftrace_syscall_enter(struct pt_regs *regs);
> >>> +extern void ftrace_syscall_exit(struct pt_regs *regs);
> >>> +#else
> >>> +static inline void start_ftrace_syscalls(void) { }
> >>> +static inline void stop_ftrace_syscalls(void) { }
> >>> +static inline void ftrace_syscall_enter(struct pt_regs *regs) { }
> >>> +static inline void ftrace_syscall_exit(struct pt_regs *regs) { }
> >>> +#endif
> >>> +
> >>> #endif /* _LINUX_FTRACE_H */
> >>> diff --git a/kernel/trace/Kconfig b/kernel/trace/Kconfig
> >>> index 8e4a2a6..95a0ad1 100644
> >>> --- a/kernel/trace/Kconfig
> >>> +++ b/kernel/trace/Kconfig
> >>> @@ -34,6 +34,9 @@ config HAVE_FTRACE_MCOUNT_RECORD
> >>> config HAVE_HW_BRANCH_TRACER
> >>> bool
> >>>
> >>> +config HAVE_FTRACE_SYSCALLS
> >>> + bool
> >>> +
> >>> config TRACER_MAX_TRACE
> >>> bool
> >>>
> >>> @@ -175,6 +178,13 @@ config EVENT_TRACER
> >>> allowing the user to pick and choose which trace point they
> >>> want to trace.
> >>>
> >>> +config FTRACE_SYSCALLS
> >>> + bool "Trace syscalls"
> >>> + depends on HAVE_FTRACE_SYSCALLS
> >>> + select TRACING
> >>> + help
> >>> + Basic tracer to catch the syscall entry and exit events.
> >>> +
> >>> config BOOT_TRACER
> >>> bool "Trace boot initcalls"
> >>> select TRACING
> >>> diff --git a/kernel/trace/Makefile b/kernel/trace/Makefile
> >>> index c7a2943..c3feea0 100644
> >>> --- a/kernel/trace/Makefile
> >>> +++ b/kernel/trace/Makefile
> >>> @@ -43,5 +43,6 @@ obj-$(CONFIG_BLK_DEV_IO_TRACE) += blktrace.o
> >>> obj-$(CONFIG_EVENT_TRACER) += trace_events.o
> >>> obj-$(CONFIG_EVENT_TRACER) += events.o
> >>> obj-$(CONFIG_EVENT_TRACER) += trace_export.o
> >>> +obj-$(CONFIG_FTRACE_SYSCALLS) += trace_syscalls.o
> >>>
> >>> libftrace-y := ftrace.o
> >>> diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
> >>> index cc94f86..5152be3 100644
> >>> --- a/kernel/trace/trace.c
> >>> +++ b/kernel/trace/trace.c
> >>> @@ -242,6 +242,12 @@ static DECLARE_WAIT_QUEUE_HEAD(trace_wait);
> >>> unsigned long trace_flags = TRACE_ITER_PRINT_PARENT | TRACE_ITER_PRINTK |
> >>> TRACE_ITER_ANNOTATE | TRACE_ITER_CONTEXT_INFO;
> >>>
> >>> +
> >>> +struct trace_array *__get_global_trace(void)
> >>> +{
> >>> + return &global_trace;
> >>> +}
> >>> +
> >>> /**
> >>> * trace_wake_up - wake up tasks waiting for trace input
> >>> *
> >>> diff --git a/kernel/trace/trace.h b/kernel/trace/trace.h
> >>> index 2bfb7d1..9583830 100644
> >>> --- a/kernel/trace/trace.h
> >>> +++ b/kernel/trace/trace.h
> >>> @@ -30,6 +30,8 @@ enum trace_type {
> >>> TRACE_GRAPH_ENT,
> >>> TRACE_USER_STACK,
> >>> TRACE_HW_BRANCHES,
> >>> + TRACE_SYSCALL_ENTER,
> >>> + TRACE_SYSCALL_EXIT,
> >>> TRACE_KMEM_ALLOC,
> >>> TRACE_KMEM_FREE,
> >>> TRACE_POWER,
> >>> @@ -192,6 +194,19 @@ struct kmemtrace_free_entry {
> >>> const void *ptr;
> >>> };
> >>>
> >>> +struct syscall_trace_enter {
> >>> + struct trace_entry ent;
> >>> + enum syscall_trace_nr nr;
> >>> + unsigned long args[];
> >>> +};
> >>> +
> >>> +struct syscall_trace_exit {
> >>> + struct trace_entry ent;
> >>> + enum syscall_trace_nr nr;
> >>> + unsigned long ret;
> >>> +};
> >>> +
> >>> +
> >>> /*
> >>> * trace_flag_type is an enumeration that holds different
> >>> * states when a trace occurs. These are:
> >>> @@ -304,6 +319,10 @@ extern void __ftrace_bad_type(void);
> >>> TRACE_KMEM_ALLOC); \
> >>> IF_ASSIGN(var, ent, struct kmemtrace_free_entry, \
> >>> TRACE_KMEM_FREE); \
> >>> + IF_ASSIGN(var, ent, struct syscall_trace_enter, \
> >>> + TRACE_SYSCALL_ENTER); \
> >>> + IF_ASSIGN(var, ent, struct syscall_trace_exit, \
> >>> + TRACE_SYSCALL_EXIT); \
> >>> __ftrace_bad_type(); \
> >>> } while (0)
> >>>
> >>> @@ -568,6 +587,7 @@ extern int trace_selftest_startup_branch(struct tracer *trace,
> >>> #endif /* CONFIG_FTRACE_STARTUP_TEST */
> >>>
> >>> extern void *head_page(struct trace_array_cpu *data);
> >>> +extern struct trace_array *__get_global_trace(void);
> >>> extern long ns2usecs(cycle_t nsec);
> >>> extern int
> >>> trace_vprintk(unsigned long ip, int depth, const char *fmt, va_list args);
> >>> @@ -627,6 +647,18 @@ static inline int ftrace_trace_task(struct task_struct *task)
> >>> return test_tsk_trace_trace(task);
> >>> }
> >>>
> >>> +#ifdef CONFIG_FTRACE_SYSCALLS
> >>> +extern enum print_line_t
> >>> +print_syscall_enter(struct trace_iterator *iter, int flags);
> >>> +extern enum print_line_t
> >>> +print_syscall_exit(struct trace_iterator *iter, int flags);
> >>> +#else
> >>> +static inline enum print_line_t
> >>> +print_syscall_enter(struct trace_iterator *iter, int flags) { }
> >>> +static inline enum print_line_t
> >>> +print_syscall_exit(struct trace_iterator *iter, int flags) { }
> >>> +#endif
> >>> +
> >>> /*
> >>> * trace_iterator_flags is an enumeration that defines bit
> >>> * positions into trace_flags that controls the output.
> >>> diff --git a/kernel/trace/trace_syscalls.c b/kernel/trace/trace_syscalls.c
> >>> new file mode 100644
> >>> index 0000000..470bb9d
> >>> --- /dev/null
> >>> +++ b/kernel/trace/trace_syscalls.c
> >>> @@ -0,0 +1,284 @@
> >>> +#include <linux/kernel.h>
> >>> +#include <linux/ftrace.h>
> >>> +#include <asm/syscall.h>
> >>> +
> >>> +#include "trace_output.h"
> >>> +#include "trace.h"
> >>> +
> >>> +/* Create basic entry on syscall array (with the rudimentary string mask) */
> >>> +#define SYS_TRACE_ENTRY(sname, args, mask) \
> >>> + {.name = #sname, .nb_args = args, .simple_format = true, \
> >>> + .output.string_mask = mask}
> >>> +
> >>> +/*
> >>> + * Create a custom defined entry on the syscall array, you will have to
> >>> + * implement a callback to output the syscall.
> >>> + */
> >>> +#define SYS_TRACE_ENTRY_SPECIAL(sname, args, printer) \
> >>> + {.name = #sname, .nb_args = args, .simple_format = false, \
> >>> + .print = printer}
> >>> +
> >>> +static const struct syscall_trace_entry syscall_trace_entries[] = {
> >>> + /* For open, the first argument is a string, hence the given mask */
> >>> + [SYSCALL_TRACE_OPEN] = SYS_TRACE_ENTRY(open, 3, 0x1),
> >>> + [SYSCALL_TRACE_CLOSE] = SYS_TRACE_ENTRY(close, 1, 0),
> >>> + [SYSCALL_TRACE_READ] = SYS_TRACE_ENTRY(read, 3, 0),
> >>> + [SYSCALL_TRACE_WRITE] = SYS_TRACE_ENTRY(read, 3, 0),
> >>> +};
> >>> +
> >>> +
> >>> +static atomic_t refcount;
> >>> +
> >>> +enum print_line_t
> >>> +print_syscall_enter(struct trace_iterator *iter, int flags)
> >>> +{
> >>> + struct trace_seq *s = &iter->seq;
> >>> + struct trace_entry *ent = iter->ent;
> >>> + struct syscall_trace_enter *trace;
> >>> + enum syscall_trace_nr syscall;
> >>> + const struct syscall_trace_entry *entry;
> >>> + char *str = NULL;
> >>> + int i, ret;
> >>> +
> >>> + trace_assign_type(trace, ent);
> >>> +
> >>> + syscall = trace->nr;
> >>> +
> >>> + if (!syscall || syscall >= __SYSCALL_TRACE_END)
> >>> + return TRACE_TYPE_HANDLED;
> >>> +
> >>> + entry = &syscall_trace_entries[syscall];
> >>> + if (!entry->simple_format) {
> >>> + ret = entry->output.print(trace->args);
> >>> + if (!ret)
> >>> + return TRACE_TYPE_PARTIAL_LINE;
> >>> + }
> >>> +
> >>> + ret = trace_seq_printf(s, "syscall %s :", entry->name);
> >>> + if (!ret)
> >>> + return TRACE_TYPE_PARTIAL_LINE;
> >>> +
> >>> + for (i = 0; i < entry->nb_args; i++) {
> >>> + /*
> >>> + * FIXME: the string comes from the user, but on tracing
> >>> + * output time, it may have disappeared from the memory.
> >>> + * May be better to strdup it on tracing time.
> >>> + */
> >>> + if (entry->output.string_mask & (1 << i)) {
> >>> + /* 128 is enough for most path */
> >>> + str = strndup_user((char *)trace->args[i], 128);
> >>> +
> >>> + /* If it faulted badly, the error shoud have been
> >>> + * handled while servicing the syscall, just ignore.
> >>> + */
> >>> + if (str == ERR_PTR(-EFAULT) || str == ERR_PTR(-EINVAL)
> >>> + || str == ERR_PTR(-ENOMEM))
> >>> + return TRACE_TYPE_HANDLED;
> >>> +
> >>> + ret = trace_seq_printf(s, " %s", str);
> >>> + } else {
> >>> + ret = trace_seq_printf(s, " 0x%lx", trace->args[i]);
> >>> + }
> >>> + if (!ret)
> >>> + goto end_partial;
> >>> + }
> >>> + if (!trace_seq_printf(s, "\n"))
> >>> + goto end_partial;
> >>> +
> >>> + kfree(str);
> >>> + return TRACE_TYPE_HANDLED;
> >>> +
> >>> +end_partial:
> >>> + kfree(str);
> >>> + return TRACE_TYPE_PARTIAL_LINE;
> >>> +}
> >>> +
> >>> +enum print_line_t
> >>> +print_syscall_exit(struct trace_iterator *iter, int flags)
> >>> +{
> >>> + struct trace_seq *s = &iter->seq;
> >>> + struct trace_entry *ent = iter->ent;
> >>> + struct syscall_trace_exit *trace;
> >>> + enum syscall_trace_nr syscall;
> >>> + const struct syscall_trace_entry *entry;
> >>> + int ret;
> >>> +
> >>> + trace_assign_type(trace, ent);
> >>> +
> >>> + syscall = trace->nr;
> >>> +
> >>> + if (!syscall || syscall >= __SYSCALL_TRACE_END)
> >>> + return TRACE_TYPE_HANDLED;
> >>> +
> >>> + entry = &syscall_trace_entries[syscall];
> >>> +
> >>> + ret = trace_seq_printf(s, "syscall %s -> 0x%lx\n", entry->name,
> >>> + trace->ret);
> >>> + if (!ret)
> >>> + return TRACE_TYPE_PARTIAL_LINE;
> >>> +
> >>> + return TRACE_TYPE_HANDLED;
> >>> +}
> >>> +
> >>> +void start_ftrace_syscalls(void)
> >>> +{
> >>> + unsigned long flags;
> >>> + struct task_struct *g, *t;
> >>> +
> >>> + if (atomic_inc_return(&refcount) != 1)
> >>> + goto out;
> >>> +
> >>> + read_lock_irqsave(&tasklist_lock, flags);
> >>> +
> >>> + do_each_thread(g, t) {
> >>> + set_tsk_thread_flag(t, TIF_SYSCALL_FTRACE);
> >>> + } while_each_thread(g, t);
> >>> +
> >>> + read_unlock_irqrestore(&tasklist_lock, flags);
> >>> +out:
> >>> + atomic_dec(&refcount);
> >>> +}
> >>> +
> >>> +void stop_ftrace_syscalls(void)
> >>> +{
> >>> + unsigned long flags;
> >>> + struct task_struct *g, *t;
> >>> +
> >>> + if (atomic_dec_return(&refcount))
> >>> + goto out;
> >>> +
> >>> + read_lock_irqsave(&tasklist_lock, flags);
> >>> +
> >>> + do_each_thread(g, t) {
> >>> + clear_tsk_thread_flag(t, TIF_SYSCALL_FTRACE);
> >>> + } while_each_thread(g, t);
> >>> +
> >>> + read_unlock_irqrestore(&tasklist_lock, flags);
> >>> +out:
> >>> + atomic_inc(&refcount);
> >>> +}
> >>> +
> >>> +void ftrace_syscall_enter(struct pt_regs *regs)
> >>> +{
> >>> + struct trace_array *tr = __get_global_trace();
> >>> + struct syscall_trace_enter *entry;
> >>> + const struct syscall_trace_entry *sys_data;
> >>> + struct ring_buffer_event *event;
> >>> + struct trace_array_cpu *data;
> >>> + int size;
> >>> + int syscall_nr;
> >>> + int nr_offset;
> >>> + int cpu;
> >>> +
> >>> + syscall_nr = syscall_get_nr(current, regs);
> >>> + nr_offset = arch_syscall_trace_nr[syscall_nr];
> >>> + if (!nr_offset || nr_offset >= __SYSCALL_TRACE_END)
> >>> + return;
> >>> +
> >>> + cpu = raw_smp_processor_id();
> >>> + data = tr->data[cpu];
> >>> +
> >>> + if (unlikely(atomic_read(&data->disabled)))
> >>> + return;
> >>> +
> >>> + sys_data = &syscall_trace_entries[nr_offset];
> >>> + size = sizeof(*entry) + sizeof(unsigned long) * sys_data->nb_args;
> >>> +
> >>> + event = trace_buffer_lock_reserve(tr, TRACE_SYSCALL_ENTER, size, 0, 0);
> >>> + if (!event)
> >>> + return;
> >>> +
> >>> + entry = ring_buffer_event_data(event);
> >>> + entry->nr = nr_offset;
> >>> + syscall_get_arguments(current, regs, 0, sys_data->nb_args, entry->args);
> >>> +
> >>> + ring_buffer_unlock_commit(tr->buffer, event);
> >>> + trace_wake_up();
> >>> +}
> >>> +
> >>> +void ftrace_syscall_exit(struct pt_regs *regs)
> >>> +{
> >>> + struct trace_array *tr = __get_global_trace();
> >>> + struct syscall_trace_exit *entry;
> >>> + const struct syscall_trace_entry *sys_data;
> >>> + struct ring_buffer_event *event;
> >>> + struct trace_array_cpu *data;
> >>> + int syscall_nr;
> >>> + int nr_offset;
> >>> + int cpu;
> >>> +
> >>> + syscall_nr = syscall_get_nr(current, regs);
> >>> + nr_offset = arch_syscall_trace_nr[syscall_nr];
> >>> + if (!nr_offset || nr_offset >= __SYSCALL_TRACE_END)
> >>> + return;
> >>> +
> >>> + cpu = raw_smp_processor_id();
> >>> + data = tr->data[cpu];
> >>> +
> >>> + if (unlikely(atomic_read(&data->disabled)))
> >>> + return;
> >>> +
> >>> + sys_data = &syscall_trace_entries[nr_offset];
> >>> +
> >>> + event = trace_buffer_lock_reserve(tr, TRACE_SYSCALL_EXIT,
> >>> + sizeof(*entry), 0, 0);
> >>> + if (!event)
> >>> + return;
> >>> +
> >>> + entry = ring_buffer_event_data(event);
> >>> + entry->nr = nr_offset;
> >>> + entry->ret = syscall_get_return_value(current, regs);
> >>> +
> >>> + ring_buffer_unlock_commit(tr->buffer, event);
> >>> + trace_wake_up();
> >>> +}
> >>> +
> >>> +static int init_syscall_tracer(struct trace_array *tr)
> >>> +{
> >>> + start_ftrace_syscalls();
> >>> +
> >>> + return 0;
> >>> +}
> >>> +
> >>> +static void reset_syscall_tracer(struct trace_array *tr)
> >>> +{
> >>> + stop_ftrace_syscalls();
> >>> +}
> >>> +
> >>> +static struct trace_event syscall_enter_event = {
> >>> + .type = TRACE_SYSCALL_ENTER,
> >>> + .trace = print_syscall_enter,
> >>> +};
> >>> +
> >>> +static struct trace_event syscall_exit_event = {
> >>> + .type = TRACE_SYSCALL_EXIT,
> >>> + .trace = print_syscall_exit,
> >>> +};
> >>> +
> >>> +static struct tracer syscall_tracer __read_mostly = {
> >>> + .name = "syscall",
> >>> + .init = init_syscall_tracer,
> >>> + .reset = reset_syscall_tracer
> >>> +};
> >>> +
> >>> +__init int register_ftrace_syscalls(void)
> >>> +{
> >>> + int ret;
> >>> +
> >>> + ret = register_ftrace_event(&syscall_enter_event);
> >>> + if (!ret) {
> >>> + printk(KERN_WARNING "event %d failed to register\n",
> >>> + syscall_enter_event.type);
> >>> + WARN_ON_ONCE(1);
> >>> + }
> >>> +
> >>> + ret = register_ftrace_event(&syscall_exit_event);
> >>> + if (!ret) {
> >>> + printk(KERN_WARNING "event %d failed to register\n",
> >>> + syscall_exit_event.type);
> >>> + WARN_ON_ONCE(1);
> >>> + }
> >>> +
> >>> + return register_tracer(&syscall_tracer);
> >>> +}
> >>> +device_initcall(register_ftrace_syscalls);
> >> --
> >> Masami Hiramatsu
> >>
> >> Software Engineer
> >> Hitachi Computer Products (America) Inc.
> >> Software Solutions Division
> >>
> >> e-mail: [email protected]
> >>
> >
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> > the body of a message to [email protected]
> > More majordomo info at http://vger.kernel.org/majordomo-info.html
> > Please read the FAQ at http://www.tux.org/lkml/
>
> --
> Masami Hiramatsu
>
> Software Engineer
> Hitachi Computer Products (America) Inc.
> Software Solutions Division
>
> e-mail: [email protected]
>

--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68

2009-03-16 22:18:55

by Frank Ch. Eigler

[permalink] [raw]
Subject: Re: [RFC][PATCH 1/2] tracing/ftrace: syscall tracing infrastructure

Hi -


On Mon, Mar 16, 2009 at 05:45:26PM -0400, Mathieu Desnoyers wrote:

> [...]
> > As far as I know, utrace supports multiple trace-engines on a process.
> > Since ptrace is just an engine of utrace, you can add another engine on utrace.
> >
> > utrace-+-ptrace_engine---owner_process
> > |
> > +-systemtap_module
> > |
> > +-ftrace_plugin

Right. In this way, utrace is simply a multiplexing intermediary.


> > Here, Frank had posted an example of utrace->ftrace engine.
> > http://lkml.org/lkml/2009/1/27/294
> >
> > And here is the latest his patch(which seems to support syscall tracing...)
> > http://git.kernel.org/?p=linux/kernel/git/frob/linux-2.6-utrace.git;a=blob;f=kernel/trace/trace_process.c;h=619815f6c2543d0d82824139773deb4ca460a280;hb=ab20efa8d8b5ded96e8f8c3663dda3b4cb532124
> >
>
> Reminder : we are looking at system-wide tracing here. Here are some
> comments about the current utrace implementation.
>
> Looking at include/linux/utrace.h from the tree
>
> 17 * A tracing engine starts by calling utrace_attach_task() or
> 18 * utrace_attach_pid() on the chosen thread, passing in a set of hooks
> 19 * (&struct utrace_engine_ops), and some associated data. This produces a
> 20 * &struct utrace_engine, which is the handle used for all other
> 21 * operations. An attached engine has its ops vector, its data, and an
> 22 * event mask controlled by utrace_set_events().
>
> So if the system has, say 3000 threads, then we have 3000 struct
> utrace_engine created ? I wonder what effet this could have one
> cachelines if this is used to trace hot paths like system call
> entry/exit. Have you benchmarked this kind of scenario under tbench ?

It has not been a problem, since utrace_engines are designed to be
lightweight. Starting or stopping a systemtap script of the form

probe process.syscall {}

appears to have no noticable impact on a tbench suite.


> 24 * For each event bit that is set, that engine will get the
> 25 * appropriate ops->report_*() callback when the event occurs. The
> 26 * &struct utrace_engine_ops need not provide callbacks for an event
> 27 * unless the engine sets one of the associated event bits.
>
> Looking at utrace_set_events(), we seem to be limited to 32 events on a
> 32-bits architectures because it uses a bitmask ? Isn't it a bit small?

There are only a few types of thread events that involve different
classes of treatment, or different degrees of freedom in terms of
interference with the uninstrumented fast path of the threads.

For example, it does not make sense to have different flag bits for
different system calls, since choosing to trace *any* system call
involves taking the thread off of the fast path with the TIF_ flag.
Once it's off the fast path, it doesn't matter whether the utrace core
or some client performs syscall discrimination, so it is left to the
client.


> 682 /**
> 683 * utrace_set_events_pid - choose which event reports a tracing engine gets
> 684 * @pid: thread to affect
> 685 * @engine: attached engine to affect
> 686 * @eventmask: new event mask
> 687 *
> 688 * This is the same as utrace_set_events(), but takes a &struct pid
> 689 * pointer rather than a &struct task_struct pointer. The caller must
> 690 * hold a ref on @pid, but does not need to worry about the task
> 691 * staying valid. If it's been reaped so that @pid points nowhere,
> 692 * then this call returns -%ESRCH.
>
>
> Comments like "but does not need to worry about the task staying valid"
> does not make me feel safe and comfortable at all, could you explain
> how you can assume that derefencing an "invalid" pointer will return
> NULL ?

(We're doing a final round of "internal" (pre-LKML) reviews of the
utrace implementation right now on [email protected], where such
comments get fastest attention from the experts.)

For this particular issue, the utrace documentation file explains the
liveness rules for the various pointers that can be fed to or received
from utrace functions. This is not about "feeling" safe, it's about
what the mechanism is deliberately designed to permit.


> About the utrace_attach_task() :
>
> 244 if (unlikely(target->flags & PF_KTHREAD))
> 245 /*
> 246 * Silly kernel, utrace is for users!
> 247 */
> 248 return ERR_PTR(-EPERM);
>
> So we cannot trace kernel threads ?

I'm not quite sure about all the reasons for this, but I believe that
kernel threads don't tend to engage in job control / signal /
system-call activities the same way as normal user threads do.


> 118 /*
> 119 * Called without locks, when we might be the first utrace engine to attach.
> 120 * If this is a newborn thread and we are not the creator, we have to wait
> 121 * for it. The creator gets the first chance to attach. The PF_STARTING
> 122 * flag is cleared after its report_clone hook has had a chance to run.
> 123 */
> 124 static inline int utrace_attach_delay(struct task_struct *target)
> 125 {
> 126 if ((target->flags & PF_STARTING) && target->real_parent != current)
> 127 do {
> 128 schedule_timeout_interruptible(1);
> 129 if (signal_pending(current))
> 130 return -ERESTARTNOINTR;
> 131 } while (target->flags & PF_STARTING);
> 132
> 133 return 0;
> 134 }
>
> Why do we absolutely have to poll until the thread has started ?

(I don't know off the top of my head - Roland?)


> utrace_add_engine()
> set_notify_resume(target);
>
> ok, so this is where the TIF_NOTIFY_RESUME thread flag is set. I notice
> that it is set asynchronously with the execution of the target thread
> (as I do with my TIF_KERNEL_TRACE thread flag).
>
> However, on x86_64, _TIF_DO_NOTIFY_MASK is only tested in
> entry_64.S
>
> int_signal:
> and
> retint_signal:
>
> code paths. However, if there is no syscall tracing to do upon syscall
> entry, the thread flags are not re-read at syscall exit and you will
> miss the syscall exit returning from your target thread if this thread
> was blocked while you set its TIF_NOTIFY_RESUME. Or is it handled in
> some subtle way I did not figure out ? BTW re-reading the TIF flags from
> the thread_info at syscall exit on the fast path is out of question
> because it considerably degrades the kernel performances. entry_*.S is
> a very, very critical path.

(I don't know off the top of my head - Roland?)


- FChE

2009-03-16 23:47:19

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [RFC][PATCH 1/2] tracing/ftrace: syscall tracing infrastructure

On Mon, Mar 16, 2009 at 06:18:00PM -0400, Frank Ch. Eigler wrote:
> Hi -
>
>
> On Mon, Mar 16, 2009 at 05:45:26PM -0400, Mathieu Desnoyers wrote:
>
> > [...]
> > > As far as I know, utrace supports multiple trace-engines on a process.
> > > Since ptrace is just an engine of utrace, you can add another engine on utrace.
> > >
> > > utrace-+-ptrace_engine---owner_process
> > > |
> > > +-systemtap_module
> > > |
> > > +-ftrace_plugin
>
> Right. In this way, utrace is simply a multiplexing intermediary.
>
>
> > > Here, Frank had posted an example of utrace->ftrace engine.
> > > http://lkml.org/lkml/2009/1/27/294
> > >
> > > And here is the latest his patch(which seems to support syscall tracing...)
> > > http://git.kernel.org/?p=linux/kernel/git/frob/linux-2.6-utrace.git;a=blob;f=kernel/trace/trace_process.c;h=619815f6c2543d0d82824139773deb4ca460a280;hb=ab20efa8d8b5ded96e8f8c3663dda3b4cb532124
> > >
> >
> > Reminder : we are looking at system-wide tracing here. Here are some
> > comments about the current utrace implementation.
> >
> > Looking at include/linux/utrace.h from the tree
> >
> > 17 * A tracing engine starts by calling utrace_attach_task() or
> > 18 * utrace_attach_pid() on the chosen thread, passing in a set of hooks
> > 19 * (&struct utrace_engine_ops), and some associated data. This produces a
> > 20 * &struct utrace_engine, which is the handle used for all other
> > 21 * operations. An attached engine has its ops vector, its data, and an
> > 22 * event mask controlled by utrace_set_events().
> >
> > So if the system has, say 3000 threads, then we have 3000 struct
> > utrace_engine created ? I wonder what effet this could have one
> > cachelines if this is used to trace hot paths like system call
> > entry/exit. Have you benchmarked this kind of scenario under tbench ?
>
> It has not been a problem, since utrace_engines are designed to be
> lightweight. Starting or stopping a systemtap script of the form
>
> probe process.syscall {}
>
> appears to have no noticable impact on a tbench suite.
>
>
> > 24 * For each event bit that is set, that engine will get the
> > 25 * appropriate ops->report_*() callback when the event occurs. The
> > 26 * &struct utrace_engine_ops need not provide callbacks for an event
> > 27 * unless the engine sets one of the associated event bits.
> >
> > Looking at utrace_set_events(), we seem to be limited to 32 events on a
> > 32-bits architectures because it uses a bitmask ? Isn't it a bit small?
>
> There are only a few types of thread events that involve different
> classes of treatment, or different degrees of freedom in terms of
> interference with the uninstrumented fast path of the threads.
>
> For example, it does not make sense to have different flag bits for
> different system calls, since choosing to trace *any* system call
> involves taking the thread off of the fast path with the TIF_ flag.
> Once it's off the fast path, it doesn't matter whether the utrace core
> or some client performs syscall discrimination, so it is left to the
> client.
>
>
> > 682 /**
> > 683 * utrace_set_events_pid - choose which event reports a tracing engine gets
> > 684 * @pid: thread to affect
> > 685 * @engine: attached engine to affect
> > 686 * @eventmask: new event mask
> > 687 *
> > 688 * This is the same as utrace_set_events(), but takes a &struct pid
> > 689 * pointer rather than a &struct task_struct pointer. The caller must
> > 690 * hold a ref on @pid, but does not need to worry about the task
> > 691 * staying valid. If it's been reaped so that @pid points nowhere,
> > 692 * then this call returns -%ESRCH.
> >
> >
> > Comments like "but does not need to worry about the task staying valid"
> > does not make me feel safe and comfortable at all, could you explain
> > how you can assume that derefencing an "invalid" pointer will return
> > NULL ?
>
> (We're doing a final round of "internal" (pre-LKML) reviews of the
> utrace implementation right now on [email protected], where such
> comments get fastest attention from the experts.)
>
> For this particular issue, the utrace documentation file explains the
> liveness rules for the various pointers that can be fed to or received
> from utrace functions. This is not about "feeling" safe, it's about
> what the mechanism is deliberately designed to permit.
>
>
> > About the utrace_attach_task() :
> >
> > 244 if (unlikely(target->flags & PF_KTHREAD))
> > 245 /*
> > 246 * Silly kernel, utrace is for users!
> > 247 */
> > 248 return ERR_PTR(-EPERM);
> >
> > So we cannot trace kernel threads ?
>
> I'm not quite sure about all the reasons for this, but I believe that
> kernel threads don't tend to engage in job control / signal /
> system-call activities the same way as normal user threads do.
>


Some of them use some syscalls, but it doesn't involve a user/kernel switch.
So it's not tracable by hooking syscall_entry/exit or using tracehooks.
It would require specific hooks on sys_* functions for that.

So this check is right (writing on each thread info seems somewhat costly so
it's better if it is avoided like here).

Frederic.


> > 118 /*
> > 119 * Called without locks, when we might be the first utrace engine to attach.
> > 120 * If this is a newborn thread and we are not the creator, we have to wait
> > 121 * for it. The creator gets the first chance to attach. The PF_STARTING
> > 122 * flag is cleared after its report_clone hook has had a chance to run.
> > 123 */
> > 124 static inline int utrace_attach_delay(struct task_struct *target)
> > 125 {
> > 126 if ((target->flags & PF_STARTING) && target->real_parent != current)
> > 127 do {
> > 128 schedule_timeout_interruptible(1);
> > 129 if (signal_pending(current))
> > 130 return -ERESTARTNOINTR;
> > 131 } while (target->flags & PF_STARTING);
> > 132
> > 133 return 0;
> > 134 }
> >
> > Why do we absolutely have to poll until the thread has started ?
>
> (I don't know off the top of my head - Roland?)
>
>
> > utrace_add_engine()
> > set_notify_resume(target);
> >
> > ok, so this is where the TIF_NOTIFY_RESUME thread flag is set. I notice
> > that it is set asynchronously with the execution of the target thread
> > (as I do with my TIF_KERNEL_TRACE thread flag).
> >
> > However, on x86_64, _TIF_DO_NOTIFY_MASK is only tested in
> > entry_64.S
> >
> > int_signal:
> > and
> > retint_signal:
> >
> > code paths. However, if there is no syscall tracing to do upon syscall
> > entry, the thread flags are not re-read at syscall exit and you will
> > miss the syscall exit returning from your target thread if this thread
> > was blocked while you set its TIF_NOTIFY_RESUME. Or is it handled in
> > some subtle way I did not figure out ? BTW re-reading the TIF flags from
> > the thread_info at syscall exit on the fast path is out of question
> > because it considerably degrades the kernel performances. entry_*.S is
> > a very, very critical path.
>
> (I don't know off the top of my head - Roland?)
>
>
> - FChE

2009-03-17 05:28:21

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [RFC][PATCH 1/2] tracing/ftrace: syscall tracing infrastructure

On 03/16, Mathieu Desnoyers wrote:
>
> utrace_add_engine()
> set_notify_resume(target);
>
> ok, so this is where the TIF_NOTIFY_RESUME thread flag is set. I notice
> that it is set asynchronously with the execution of the target thread
> (as I do with my TIF_KERNEL_TRACE thread flag).
>
> However, on x86_64, _TIF_DO_NOTIFY_MASK is only tested in
> entry_64.S
>
> int_signal:
> and
> retint_signal:
>
> code paths. However, if there is no syscall tracing to do upon syscall
> entry, the thread flags are not re-read at syscall exit and you will
> miss the syscall exit returning from your target thread if this thread
> was blocked while you set its TIF_NOTIFY_RESUME.

Afaics, TIF_NOTIFY_RESUME is not needed to trace syscall entry/exit.
If engine wants the syscall tracing, utrace_set_events(UTRACE_SYSCALL_xxx)
sets TIF_SYSCALL_TRACE. And syscall_trace_enter/syscall_trace_leave call
tracehook_report_syscall_xxx().

Oleg.

2009-03-17 16:05:48

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: [RFC][PATCH 1/2] tracing/ftrace: syscall tracing infrastructure

* Oleg Nesterov ([email protected]) wrote:
> On 03/16, Mathieu Desnoyers wrote:
> >
> > utrace_add_engine()
> > set_notify_resume(target);
> >
> > ok, so this is where the TIF_NOTIFY_RESUME thread flag is set. I notice
> > that it is set asynchronously with the execution of the target thread
> > (as I do with my TIF_KERNEL_TRACE thread flag).
> >
> > However, on x86_64, _TIF_DO_NOTIFY_MASK is only tested in
> > entry_64.S
> >
> > int_signal:
> > and
> > retint_signal:
> >
> > code paths. However, if there is no syscall tracing to do upon syscall
> > entry, the thread flags are not re-read at syscall exit and you will
> > miss the syscall exit returning from your target thread if this thread
> > was blocked while you set its TIF_NOTIFY_RESUME.
>
> Afaics, TIF_NOTIFY_RESUME is not needed to trace syscall entry/exit.
> If engine wants the syscall tracing, utrace_set_events(UTRACE_SYSCALL_xxx)
> sets TIF_SYSCALL_TRACE. And syscall_trace_enter/syscall_trace_leave call
> tracehook_report_syscall_xxx().
>
> Oleg.

I recall that TIF_SYSCALL_TRACE also suffers from the same problem as
TIF_NOTIFY_RESUME if set asynchronously with the target thread's
execution at least on x86_64 and arm. Do you take care to stop the
target thread in utrace_set_events ?

Mathieu

>

--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68

2009-03-19 10:37:54

by Roland McGrath

[permalink] [raw]
Subject: Re: [RFC][PATCH 1/2] tracing/ftrace: syscall tracing infrastructure

The utrace API itself is not a good fit for global tracing, since its
purpose is tracing and control of individual user threads. There is
no reason to allocate its per-task data structures when you are going
to treat all tasks the same anyway. The points that I think are being
missed are about the possibilities of overloading TIF_SYSCALL_TRACE.

It's true that ptrace uses TIF_SYSCALL_TRACE as a flag for whether you are
in the middle of a PTRACE_SYSCALL, so it can be confused by setting it for
other purposes on a task that is also ptrace'd (but not with PTRACE_SYSCALL).
Until we are able to do away with these parts of the old ptrace innards,
you can't overload TIF_SYSCALL_TRACE without perturbing ptrace behavior.

The utrace code does not have this problem. It keeps its own state bits,
so for it, TIF_SYSCALL_TRACE means exactly "the task will call
tracehook_report_syscall_*" and no more. To use TIF_SYSCALL_TRACE for
another purpose, just set it on all the tasks you like (and/or set it on
new tasks in fork.c) and add your code (tracepoints, whatever) to
tracehook_report_syscall_* alongside the calls there into utrace. There is
exactly one place in utrace code that clears TIF_SYSCALL_TRACE, and you
just add "&& !global_syscall_tracing_enabled" to the condition there. You
don't need to bother clearing TIF_SYSCALL_TRACE on any task when you're
done. If your "global_syscall_tracing_enabled" (or whatever it is) is
clear, each task will lazily fall into utrace at its next syscall
entry/exit and then utrace will reset TIF_SYSCALL_TRACE when it finds no
reason left to have it on.

I'm not really going to delve into utrace internals in this thread. Please
raise those questions in review of the utrace patches when current code is
actually posted, where they belong. Here I'll just mention the relevant
things that relate to the underlying issue you raised about synchronization.
As thoroughly documented, utrace_set_events() is a quick, asynchronous call
that itself makes no guarantees about how quickly a running task will start
to report the newly-requested events. For purposes relevant here, it just
sets TIF_SYSCALL_TRACE and nothing else. In utrace, if you want synchronous
assurance that a task misses no events you ask for, then you must first use
utrace_control (et al) to stop it and synchronize. That is not something
that makes much sense at all for a "flip on global tracing" operation, which
is not generally especially synchronous with anything else. If you want
best effort that a task will pick up newly-requested events Real Soon Now,
you can use utrace_control with just UTRACE_REPORT. For purposes relevant
here, this just does set_notify_resume(). That will send an IPI if the task
is running, and then it will start noticing before it returns to user mode.
So:
set_tsk_thread_flag(task, TIF_SYSCALL_TRACE);
set_notify_resume(task);
is what I would expect you to do on each task if you want to quickly get it
to start hitting tracehook_report_syscall_*. (I'm a bit dubious that there
is really any need to speed it up with set_notify_resume, but that's just me.)

Finally, some broader points about TIF_SYSCALL_TRACE that I think
have been overlooked. The key special feature of TIF_SYSCALL_TRACE
is that it gets you to a place where full user_regset access is
available. Debuggers need this to read (and write) the full user
register state arbitrarily, which they also need to do user
backtraces and the like. If you do not need user_regset to work,
then you don't need to be on this (slowest) code path.

If you are only interested in reading syscall arguments and results
(or even in changing syscall results in exit tracing) then you do
not need user_regset and you do not need to take the slowest syscall
path. (If you are doing backtraces but already rely on full kernel
stack unwinding to do it, you also do not need user_regset.) From
anywhere inside the kernel, you can use the asm/syscall.h calls to
read syscall args, whichever entry path the task took.

The other mechanism to hook into every syscall entry/exit is
TIF_SYSCALL_AUDIT. On some machines (like x86), this takes a third,
"warm" code path that is faster than the TIF_SYSCALL_TRACE path
(though obviously still off the fastest direct code path). It can
be faster precisely because it doesn't need to allow for user_regset
access, nor for modification of syscall arguments in entry tracing.
For normal read-only tracing of just the actual syscall details,
it has all you need.

Unfortunately the arch code all looks like:

if (unlikely(current->audit_context))
audit_syscall_{entry,exit}(...);

So we need to change that to:

if (unlikely(test_thread_flag(TIF_SYSCALL_AUDIT)))
audit_syscall_{entry,exit}(...);

But that is pretty easy to get right, even doing it blind on arch's
you can't test. Far better than adding new asm hackery for each arch
that's almost identical to TIF_SYSCALL_TRACE or TIF_SYSCALL_AUDIT (and
finding out that some are fresh out of TIF bits in the range that
their asm code can handle).

TIF_SYSCALL_AUDIT is only set when allocating audit_context, and its
paths already have !context tests so overloading is harmless today.
(Whereas with TIF_SYSCALL_TRACE, you have to wait for later ptrace
cleanups or write off using ptrace simultaneously.)

Then you can do the lazy disable in audit_syscall_{entry,exit} with:

if (unlikely(!context)) {
if (unlikely(!global_syscall_tracing_enabled))
clear_thread_flag(TIF_SYSCALL_AUDIT);
return;
}

Plus add there your tracepoint or whatnot.

Unless you really plan to use user_regset in your tracepoints, then
I think this is a better plan for global syscall tracing than either
fiddling with TIF_SYSCALL_TRACE or adding new arch asm requirements.
(IMHO, the latter is the worst idea on the table.)


Thanks,
Roland

2009-03-23 16:42:27

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: [RFC][PATCH 1/2] tracing/ftrace: syscall tracing infrastructure

* Frank Ch. Eigler ([email protected]) wrote:
> Hi -
>
>
> On Mon, Mar 16, 2009 at 05:45:26PM -0400, Mathieu Desnoyers wrote:
>
> > [...]
> > > As far as I know, utrace supports multiple trace-engines on a process.
> > > Since ptrace is just an engine of utrace, you can add another engine on utrace.
> > >
> > > utrace-+-ptrace_engine---owner_process
> > > |
> > > +-systemtap_module
> > > |
> > > +-ftrace_plugin
>
> Right. In this way, utrace is simply a multiplexing intermediary.
>
>
> > > Here, Frank had posted an example of utrace->ftrace engine.
> > > http://lkml.org/lkml/2009/1/27/294
> > >
> > > And here is the latest his patch(which seems to support syscall tracing...)
> > > http://git.kernel.org/?p=linux/kernel/git/frob/linux-2.6-utrace.git;a=blob;f=kernel/trace/trace_process.c;h=619815f6c2543d0d82824139773deb4ca460a280;hb=ab20efa8d8b5ded96e8f8c3663dda3b4cb532124
> > >
> >
> > Reminder : we are looking at system-wide tracing here. Here are some
> > comments about the current utrace implementation.
> >
> > Looking at include/linux/utrace.h from the tree
> >
> > 17 * A tracing engine starts by calling utrace_attach_task() or
> > 18 * utrace_attach_pid() on the chosen thread, passing in a set of hooks
> > 19 * (&struct utrace_engine_ops), and some associated data. This produces a
> > 20 * &struct utrace_engine, which is the handle used for all other
> > 21 * operations. An attached engine has its ops vector, its data, and an
> > 22 * event mask controlled by utrace_set_events().
> >
> > So if the system has, say 3000 threads, then we have 3000 struct
> > utrace_engine created ? I wonder what effet this could have one
> > cachelines if this is used to trace hot paths like system call
> > entry/exit. Have you benchmarked this kind of scenario under tbench ?
>
> It has not been a problem, since utrace_engines are designed to be
> lightweight. Starting or stopping a systemtap script of the form
>
> probe process.syscall {}
>
> appears to have no noticable impact on a tbench suite.
>

Do you mean starting this script for a single process or for _all_ the
processes and threads running on the system ?

>
> > 24 * For each event bit that is set, that engine will get the
> > 25 * appropriate ops->report_*() callback when the event occurs. The
> > 26 * &struct utrace_engine_ops need not provide callbacks for an event
> > 27 * unless the engine sets one of the associated event bits.
> >
> > Looking at utrace_set_events(), we seem to be limited to 32 events on a
> > 32-bits architectures because it uses a bitmask ? Isn't it a bit small?
>
> There are only a few types of thread events that involve different
> classes of treatment, or different degrees of freedom in terms of
> interference with the uninstrumented fast path of the threads.
>
> For example, it does not make sense to have different flag bits for
> different system calls, since choosing to trace *any* system call
> involves taking the thread off of the fast path with the TIF_ flag.
> Once it's off the fast path, it doesn't matter whether the utrace core
> or some client performs syscall discrimination, so it is left to the
> client.
>

If we limit ourself to thread-interaction events, I agree that they are
limited. But in the system-wide tracing scenario, the criterions for
filtering can apply to many more event categories.

Referring to Roland's reply, I think using utrace to enable system-wide
collection of data would just be a waste of resources. Going through a
more lightweight system-wide activation seems more appropriate to me.
Utrace is still a very promising tool to trace process-specific activity
though.

Mathieu

>
> > 682 /**
> > 683 * utrace_set_events_pid - choose which event reports a tracing engine gets
> > 684 * @pid: thread to affect
> > 685 * @engine: attached engine to affect
> > 686 * @eventmask: new event mask
> > 687 *
> > 688 * This is the same as utrace_set_events(), but takes a &struct pid
> > 689 * pointer rather than a &struct task_struct pointer. The caller must
> > 690 * hold a ref on @pid, but does not need to worry about the task
> > 691 * staying valid. If it's been reaped so that @pid points nowhere,
> > 692 * then this call returns -%ESRCH.
> >
> >
> > Comments like "but does not need to worry about the task staying valid"
> > does not make me feel safe and comfortable at all, could you explain
> > how you can assume that derefencing an "invalid" pointer will return
> > NULL ?
>
> (We're doing a final round of "internal" (pre-LKML) reviews of the
> utrace implementation right now on [email protected], where such
> comments get fastest attention from the experts.)
>
> For this particular issue, the utrace documentation file explains the
> liveness rules for the various pointers that can be fed to or received
> from utrace functions. This is not about "feeling" safe, it's about
> what the mechanism is deliberately designed to permit.
>
>
> > About the utrace_attach_task() :
> >
> > 244 if (unlikely(target->flags & PF_KTHREAD))
> > 245 /*
> > 246 * Silly kernel, utrace is for users!
> > 247 */
> > 248 return ERR_PTR(-EPERM);
> >
> > So we cannot trace kernel threads ?
>
> I'm not quite sure about all the reasons for this, but I believe that
> kernel threads don't tend to engage in job control / signal /
> system-call activities the same way as normal user threads do.
>
>
> > 118 /*
> > 119 * Called without locks, when we might be the first utrace engine to attach.
> > 120 * If this is a newborn thread and we are not the creator, we have to wait
> > 121 * for it. The creator gets the first chance to attach. The PF_STARTING
> > 122 * flag is cleared after its report_clone hook has had a chance to run.
> > 123 */
> > 124 static inline int utrace_attach_delay(struct task_struct *target)
> > 125 {
> > 126 if ((target->flags & PF_STARTING) && target->real_parent != current)
> > 127 do {
> > 128 schedule_timeout_interruptible(1);
> > 129 if (signal_pending(current))
> > 130 return -ERESTARTNOINTR;
> > 131 } while (target->flags & PF_STARTING);
> > 132
> > 133 return 0;
> > 134 }
> >
> > Why do we absolutely have to poll until the thread has started ?
>
> (I don't know off the top of my head - Roland?)
>
>
> > utrace_add_engine()
> > set_notify_resume(target);
> >
> > ok, so this is where the TIF_NOTIFY_RESUME thread flag is set. I notice
> > that it is set asynchronously with the execution of the target thread
> > (as I do with my TIF_KERNEL_TRACE thread flag).
> >
> > However, on x86_64, _TIF_DO_NOTIFY_MASK is only tested in
> > entry_64.S
> >
> > int_signal:
> > and
> > retint_signal:
> >
> > code paths. However, if there is no syscall tracing to do upon syscall
> > entry, the thread flags are not re-read at syscall exit and you will
> > miss the syscall exit returning from your target thread if this thread
> > was blocked while you set its TIF_NOTIFY_RESUME. Or is it handled in
> > some subtle way I did not figure out ? BTW re-reading the TIF flags from
> > the thread_info at syscall exit on the fast path is out of question
> > because it considerably degrades the kernel performances. entry_*.S is
> > a very, very critical path.
>
> (I don't know off the top of my head - Roland?)
>
>
> - FChE

--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68

2009-03-23 16:53:43

by Frank Ch. Eigler

[permalink] [raw]
Subject: Re: [RFC][PATCH 1/2] tracing/ftrace: syscall tracing infrastructure

Hi -

On Mon, Mar 23, 2009 at 12:42:08PM -0400, Mathieu Desnoyers wrote:
> [...]

(Please trim emails you're responding to.)

> [...]
> > > So if the system has, say 3000 threads, then we have 3000 struct
> > > utrace_engine created ? I wonder what effet this could have one
> > > cachelines if this is used to trace hot paths like system call
> > > entry/exit. Have you benchmarked this kind of scenario under tbench ?
> >
> > It has not been a problem, since utrace_engines are designed to be
> > lightweight. Starting or stopping a systemtap script of the form
> >
> > probe process.syscall {}
> >
> > appears to have no noticable impact on a tbench suite.
>
> Do you mean starting this script for a single process or for _all_ the
> processes and threads running on the system ?

The script above usually applies to all threads.


> > > Looking at utrace_set_events(), we seem to be limited to 32 events on a
> > > 32-bits architectures because it uses a bitmask ? Isn't it a bit small?
> >
> > There are only a few types of thread events that involve different
> > classes of treatment, or different degrees of freedom in terms of
> > interference with the uninstrumented fast path of the threads. [...]
>
> If we limit ourself to thread-interaction events, I agree that they are
> limited. But in the system-wide tracing scenario, the criterions for
> filtering can apply to many more event categories.

If those different criteria have equivalent impact on running threads,
there is no need to differentiate them at the low (utrace event flag)
level. Could you offer an example to clarify?


> Referring to Roland's reply, I think using utrace to enable
> system-wide collection of data would just be a waste of
> resources. Going through a more lightweight system-wide activation
> seems more appropriate to me. [...]

Perhaps. OTOH it also makes sense to me to use (and improve) one
general facility, if it can do the right thing almost as fast as a
wholly separate facility that's specialized for one small purpose.
The decision would probably rest with a more data-based comparison of
performance & code size.


- FChE

2009-03-23 17:04:18

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: [RFC][PATCH 1/2] tracing/ftrace: syscall tracing infrastructure

* Frank Ch. Eigler ([email protected]) wrote:
> Hi -
>
> On Mon, Mar 23, 2009 at 12:42:08PM -0400, Mathieu Desnoyers wrote:
> > [...]
>
> (Please trim emails you're responding to.)
>
> > [...]
> > > > So if the system has, say 3000 threads, then we have 3000 struct
> > > > utrace_engine created ? I wonder what effet this could have one
> > > > cachelines if this is used to trace hot paths like system call
> > > > entry/exit. Have you benchmarked this kind of scenario under tbench ?
> > >
> > > It has not been a problem, since utrace_engines are designed to be
> > > lightweight. Starting or stopping a systemtap script of the form
> > >
> > > probe process.syscall {}
> > >
> > > appears to have no noticable impact on a tbench suite.
> >
> > Do you mean starting this script for a single process or for _all_ the
> > processes and threads running on the system ?
>
> The script above usually applies to all threads.
>

Hrm, I already spent more time installing and benchmarking systemtap
than I should, so I don't have time currently to run further systemtap
benchmarks, but I seriously doubt about this. Have you run the following
benchmark ?

Baseline :
vanilla kernel, without utrace

Comparison with :
utrace-enabled kernel, with the syscall probe activated

?

If you are comparing a utrace-enabled kernel with and without the
syscall probes activated, then you are probably missing some performance
impact. Also make sure AUDIT SYSCALL, secure computing and
frame pointers are disabled in your baseline kernel too.

If this is what you did, I would really like to see the numbers.

>
> > > > Looking at utrace_set_events(), we seem to be limited to 32 events on a
> > > > 32-bits architectures because it uses a bitmask ? Isn't it a bit small?
> > >
> > > There are only a few types of thread events that involve different
> > > classes of treatment, or different degrees of freedom in terms of
> > > interference with the uninstrumented fast path of the threads. [...]
> >
> > If we limit ourself to thread-interaction events, I agree that they are
> > limited. But in the system-wide tracing scenario, the criterions for
> > filtering can apply to many more event categories.
>
> If those different criteria have equivalent impact on running threads,
> there is no need to differentiate them at the low (utrace event flag)
> level. Could you offer an example to clarify?
>
>
> > Referring to Roland's reply, I think using utrace to enable
> > system-wide collection of data would just be a waste of
> > resources. Going through a more lightweight system-wide activation
> > seems more appropriate to me. [...]
>
> Perhaps. OTOH it also makes sense to me to use (and improve) one
> general facility, if it can do the right thing almost as fast as a
> wholly separate facility that's specialized for one small purpose.
> The decision would probably rest with a more data-based comparison of
> performance & code size.
>

Sure.

Mathieu

>
> - FChE

--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68

2009-03-23 17:33:35

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: [RFC][PATCH 1/2] tracing/ftrace: syscall tracing infrastructure

Hi Roland,

* Roland McGrath ([email protected]) wrote:
> The utrace API itself is not a good fit for global tracing, since its
> purpose is tracing and control of individual user threads. There is
> no reason to allocate its per-task data structures when you are going
> to treat all tasks the same anyway. The points that I think are being
> missed are about the possibilities of overloading TIF_SYSCALL_TRACE.
>
> It's true that ptrace uses TIF_SYSCALL_TRACE as a flag for whether you are
> in the middle of a PTRACE_SYSCALL, so it can be confused by setting it for
> other purposes on a task that is also ptrace'd (but not with PTRACE_SYSCALL).
> Until we are able to do away with these parts of the old ptrace innards,
> you can't overload TIF_SYSCALL_TRACE without perturbing ptrace behavior.
>

Yes, this is why I went with a different thread flag in my
TIF_KERNEL_TRACE implementation.

> The utrace code does not have this problem. It keeps its own state bits,
> so for it, TIF_SYSCALL_TRACE means exactly "the task will call
> tracehook_report_syscall_*" and no more. To use TIF_SYSCALL_TRACE for
> another purpose, just set it on all the tasks you like (and/or set it on
> new tasks in fork.c) and add your code (tracepoints, whatever) to
> tracehook_report_syscall_* alongside the calls there into utrace. There is
> exactly one place in utrace code that clears TIF_SYSCALL_TRACE, and you
> just add "&& !global_syscall_tracing_enabled" to the condition there. You
> don't need to bother clearing TIF_SYSCALL_TRACE on any task when you're
> done. If your "global_syscall_tracing_enabled" (or whatever it is) is
> clear, each task will lazily fall into utrace at its next syscall
> entry/exit and then utrace will reset TIF_SYSCALL_TRACE when it finds no
> reason left to have it on.

I wonder how racy enabling system-wide tracing and disabling utrace
tracing on a specific thread would be ? How do you ensure that the
global tracing flag and per-thread flags are updated consistently ?

I also wonder about added performance impact caused by the
tracehook_report_syscall_* call. Ideally, system-wide syscall tracing
should call directly into a tracing callback, write to the trace
buffers, and return. With utrace, we would have to call an intermediate
callback, which would then call our tracer, then test utrace flags to
check if utrace should be called, and then return. Function calls are
quite costly nowadays :(

>
> I'm not really going to delve into utrace internals in this thread. Please
> raise those questions in review of the utrace patches when current code is
> actually posted, where they belong. Here I'll just mention the relevant
> things that relate to the underlying issue you raised about synchronization.
> As thoroughly documented, utrace_set_events() is a quick, asynchronous call
> that itself makes no guarantees about how quickly a running task will start
> to report the newly-requested events. For purposes relevant here, it just
> sets TIF_SYSCALL_TRACE and nothing else. In utrace, if you want synchronous
> assurance that a task misses no events you ask for, then you must first use
> utrace_control (et al) to stop it and synchronize. That is not something
> that makes much sense at all for a "flip on global tracing" operation, which
> is not generally especially synchronous with anything else. If you want
> best effort that a task will pick up newly-requested events Real Soon Now,
> you can use utrace_control with just UTRACE_REPORT. For purposes relevant
> here, this just does set_notify_resume(). That will send an IPI if the task
> is running, and then it will start noticing before it returns to user mode.
> So:
> set_tsk_thread_flag(task, TIF_SYSCALL_TRACE);
> set_notify_resume(task);
> is what I would expect you to do on each task if you want to quickly get it
> to start hitting tracehook_report_syscall_*. (I'm a bit dubious that there
> is really any need to speed it up with set_notify_resume, but that's just me.)


Ideally, when we start tracing, setting the flag can be asynchronous,
but we need to have a way to figure out when tracing is actually active
(e.g. rcu quiescent state). So this can be seen as synchronous
activation. Stopping all tasks does not really make much sense for
system-wide tracing, especially if there are alternatives.

>
> Finally, some broader points about TIF_SYSCALL_TRACE that I think
> have been overlooked. The key special feature of TIF_SYSCALL_TRACE
> is that it gets you to a place where full user_regset access is
> available. Debuggers need this to read (and write) the full user
> register state arbitrarily, which they also need to do user
> backtraces and the like. If you do not need user_regset to work,
> then you don't need to be on this (slowest) code path.

LTTng had userspace backtraces on syscall entry and irq entry a while
ago, and this way particularly useful. But I agree than if this is not
needed, we should go for the warm path.

>
> If you are only interested in reading syscall arguments and results
> (or even in changing syscall results in exit tracing) then you do
> not need user_regset and you do not need to take the slowest syscall
> path. (If you are doing backtraces but already rely on full kernel
> stack unwinding to do it, you also do not need user_regset.) From
> anywhere inside the kernel, you can use the asm/syscall.h calls to
> read syscall args, whichever entry path the task took.
>
> The other mechanism to hook into every syscall entry/exit is
> TIF_SYSCALL_AUDIT. On some machines (like x86), this takes a third,
> "warm" code path that is faster than the TIF_SYSCALL_TRACE path
> (though obviously still off the fastest direct code path). It can
> be faster precisely because it doesn't need to allow for user_regset
> access, nor for modification of syscall arguments in entry tracing.
> For normal read-only tracing of just the actual syscall details,
> it has all you need.
>
> Unfortunately the arch code all looks like:
>
> if (unlikely(current->audit_context))
> audit_syscall_{entry,exit}(...);
>
> So we need to change that to:
>
> if (unlikely(test_thread_flag(TIF_SYSCALL_AUDIT)))
> audit_syscall_{entry,exit}(...);
>
> But that is pretty easy to get right, even doing it blind on arch's
> you can't test. Far better than adding new asm hackery for each arch
> that's almost identical to TIF_SYSCALL_TRACE or TIF_SYSCALL_AUDIT (and
> finding out that some are fresh out of TIF bits in the range that
> their asm code can handle).
>
> TIF_SYSCALL_AUDIT is only set when allocating audit_context, and its
> paths already have !context tests so overloading is harmless today.
> (Whereas with TIF_SYSCALL_TRACE, you have to wait for later ptrace
> cleanups or write off using ptrace simultaneously.)
>
> Then you can do the lazy disable in audit_syscall_{entry,exit} with:
>
> if (unlikely(!context)) {
> if (unlikely(!global_syscall_tracing_enabled))
> clear_thread_flag(TIF_SYSCALL_AUDIT);
> return;
> }
>
> Plus add there your tracepoint or whatnot.
>
> Unless you really plan to use user_regset in your tracepoints, then
> I think this is a better plan for global syscall tracing than either
> fiddling with TIF_SYSCALL_TRACE or adding new arch asm requirements.
> (IMHO, the latter is the worst idea on the table.)
>

Thanks for this thorough review of TIF flags. Hrm, racing with other
pieces of infrastructure is never fun, and given we might want to save
the userspace stack in some probes, I think it could be a good idea to
go with our own flag.

Mathieu

>
> Thanks,
> Roland

--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68