hi,
Problem:
Currenly, tracepoints are implemented using a conditional. The conditional
check requires checking a global variable for each tracepoint. Although,
the overhead of this check is small, it increases under memory pressure. As we
increase the number of tracepoints in the kernel this may become more of an
issue. In addition, tracepoints are often dormant (disabled), and provide no
direct kernel functionality. Thus, it is highly desirable to reduce their
impact as much as possible. Mathieu Desnoyers has already suggested a number of
requirements for a solution to this issue.
Solution:
In discussing this problem with Roland McGrath and Richard Henderson, we came
up with a new 'asm goto' statement that allows branching to a label. Thus, this
patch set introdues a 'STATIC_JUMP_IF()' macro as follows:
#ifdef HAVE_STATIC_JUMP
#define STATIC_JUMP_IF(tag, label, cond) \
asm goto ("1:" /* 5-byte insn */ \
P6_NOP5 \
".pushsection __jump_table, \"a\" \n\t" \
_ASM_PTR "1b, %l[" #label "], %c0 \n\t" \
".popsection \n\t" \
: : "i" (__sjstrtab_##tag) : : label)
#else
#define STATIC_JUMP_IF(tag, label, cond) \
if (unlikely(cond)) \
goto label;
#endif /* !HAVE_STATIC_JUMP */
which can be used as:
STATIC_JUMP_IF(trace, trace_label, jump_enabled);
printk("not doing tracing\n");
if (0) {
trace_label:
printk("doing tracing: %d\n", file);
}
---------------------------------------
Thus, if 'HAVE_STATIC_JUMP' is defined (which will depend ultimately on the
existence of 'asm goto' in the compiler version), we simply have a no-op
followed by a jump around the dormant (disabled) tracing code. The
'STATIC_JUMP_IF()' macro, produces a 'jump_table' which has the following
format:
[instruction address] [jump target] [tracepoint name]
Thus, to enable a tracepoint, we simply patch the 'instruction address' with
a jump to the 'jump target'. The current implementation is using ftrace
infrastructure to accomplish the patching, which uses 'stop_machine'. In
subsequent versions, we will update the mechanism to use more efficient
code patching techniques.
I've tested the performance of this using 'get_cycles()' calls around the
tracepoint call sites. For an Intel Core 2 Quad cpu (in cycles, averages):
idle after tbench run
---- ----------------
old code 32 88
new code 2 4
The performance improvement can be reproduced very reliably (using patch 4
in this series) on both Intel and AMD hardware.
In terms of code analysis the current code for the disabled case is a 'cmpl'
followed by a 'je' around the tracepoint code. so:
cmpl - 83 3d 0e 77 87 00 00 - 7 bytes
je - 74 3e - 2 bytes
total of 9 instruction bytes.
The new code is a 'nopl' followed by a 'jmp'. Thus:
nopl - 0f 1f 44 00 00 - 5 bytes
jmp - eb 3e - 2 bytes
total of 7 instruction bytes.
So, the new code also accounts for 2 less bytes in the instruction cache per tracepoint.
here's a link to the gcc thread introducing this feature:
http://gcc.gnu.org/ml/gcc-patches/2009-07/msg01556.html
Todo:
-convert the patching to a more optimal implementation (not using stop machine)
-expand infrastructure for modules
-other use cases?
-mark the trace_label section with 'cold' attributes
-add module support
-add support for other arches (besides x86)
thanks,
-Jason
arch/x86/kernel/ftrace.c | 36 +++++
arch/x86/kernel/test_nx.c | 3 +
include/asm-generic/vmlinux.lds.h | 11 ++
include/linux/ftrace.h | 3 +
include/linux/jump_label.h | 45 ++++++
include/linux/tracepoint.h | 50 +++++---
kernel/Makefile | 2 +-
kernel/jump_label.c | 271 +++++++++++++++++++++++++++++++++++++
kernel/trace/trace_workqueue.c | 2 +
kernel/tracepoint.c | 134 ++++++++++++++++++
10 files changed, 540 insertions(+), 17 deletions(-)
create mode 100644 include/linux/jump_label.h
create mode 100644 kernel/jump_label.c
introduce basic infrastructure for jump patching:
-STATIC_JUMP_IF() macro
-jump table infrastructure
-jump/nop patching in the ftrace layer
Signed-off-by: Jason Baron <[email protected]>
---
arch/x86/kernel/ftrace.c | 36 ++++++++++
include/asm-generic/vmlinux.lds.h | 11 +++
include/linux/ftrace.h | 3 +
include/linux/jump_label.h | 45 ++++++++++++
kernel/Makefile | 2 +-
kernel/jump_label.c | 138 +++++++++++++++++++++++++++++++++++++
6 files changed, 234 insertions(+), 1 deletions(-)
create mode 100644 include/linux/jump_label.h
create mode 100644 kernel/jump_label.c
diff --git a/arch/x86/kernel/ftrace.c b/arch/x86/kernel/ftrace.c
index 9dbb527..0907b8c 100644
--- a/arch/x86/kernel/ftrace.c
+++ b/arch/x86/kernel/ftrace.c
@@ -67,6 +67,20 @@ static unsigned char *ftrace_call_replace(unsigned long ip, unsigned long addr)
return calc.code;
}
+static unsigned char *ftrace_jump_replace(unsigned long ip, unsigned long addr)
+{
+ static union ftrace_code_union calc;
+
+ calc.e8 = 0xe9;
+ calc.offset = ftrace_calc_offset(ip + MCOUNT_INSN_SIZE, addr);
+
+ /*
+ * No locking needed, this must be called via kstop_machine
+ * which in essence is like running on a uniprocessor machine.
+ */
+ return calc.code;
+}
+
/*
* Modifying code must take extra care. On an SMP machine, if
* the code being modified is also being executed on another CPU
@@ -278,6 +292,28 @@ int ftrace_make_call(struct dyn_ftrace *rec, unsigned long addr)
return ftrace_modify_code(rec->ip, old, new);
}
+int ftrace_make_jump(struct dyn_ftrace *rec, unsigned long addr)
+{
+ unsigned char *new, *old;
+ unsigned long ip = rec->ip;
+
+ old = ftrace_nop_replace();
+ new = ftrace_jump_replace(ip, addr);
+
+ return ftrace_modify_code(rec->ip, old, new);
+}
+
+int ftrace_make_jump_nop(struct dyn_ftrace *rec, unsigned long addr)
+{
+ unsigned char *new, *old;
+ unsigned long ip = rec->ip;
+
+ old = ftrace_jump_replace(ip, addr);
+ new = ftrace_nop_replace();
+
+ return ftrace_modify_code(rec->ip, old, new);
+}
+
int ftrace_update_ftrace_func(ftrace_func_t func)
{
unsigned long ip = (unsigned long)(&ftrace_call);
diff --git a/include/asm-generic/vmlinux.lds.h b/include/asm-generic/vmlinux.lds.h
index a549465..d789646 100644
--- a/include/asm-generic/vmlinux.lds.h
+++ b/include/asm-generic/vmlinux.lds.h
@@ -212,6 +212,7 @@
*(__vermagic) /* Kernel version magic */ \
*(__markers_strings) /* Markers: strings */ \
*(__tracepoints_strings)/* Tracepoints: strings */ \
+ *(__jump_strings)/* Jump: strings */ \
} \
\
.rodata1 : AT(ADDR(.rodata1) - LOAD_OFFSET) { \
@@ -219,6 +220,8 @@
} \
\
BUG_TABLE \
+ JUMP_TABLE \
+ \
\
/* PCI quirks */ \
.pci_fixup : AT(ADDR(.pci_fixup) - LOAD_OFFSET) { \
@@ -563,6 +566,14 @@
#define BUG_TABLE
#endif
+#define JUMP_TABLE \
+ . = ALIGN(8); \
+ __jump_table : AT(ADDR(__jump_table) - LOAD_OFFSET) { \
+ VMLINUX_SYMBOL(__start___jump_table) = .; \
+ *(__jump_table) \
+ VMLINUX_SYMBOL(__stop___jump_table) = .; \
+ }
+
#ifdef CONFIG_PM_TRACE
#define TRACEDATA \
. = ALIGN(4); \
diff --git a/include/linux/ftrace.h b/include/linux/ftrace.h
index dc3b132..cf3d995 100644
--- a/include/linux/ftrace.h
+++ b/include/linux/ftrace.h
@@ -227,6 +227,9 @@ extern int ftrace_make_nop(struct module *mod,
* Any other value will be considered a failure.
*/
extern int ftrace_make_call(struct dyn_ftrace *rec, unsigned long addr);
+extern int ftrace_make_jump(struct dyn_ftrace *rec, unsigned long addr);
+extern int ftrace_make_jump_nop(struct dyn_ftrace *rec, unsigned long addr);
+
/* May be defined in arch */
extern int ftrace_arch_read_dyn_info(char *buf, int size);
diff --git a/include/linux/jump_label.h b/include/linux/jump_label.h
new file mode 100644
index 0000000..1b80bbe
--- /dev/null
+++ b/include/linux/jump_label.h
@@ -0,0 +1,45 @@
+#ifndef _LINUX_JUMP_LABEL_H
+#define _LINUX_JUMP_LABEL_H
+
+#include <asm/nops.h>
+
+/* this will change to a compiler dependency that supports 'asm goto' */
+#define HAVE_JUMP_LABEL
+
+#ifdef HAVE_JUMP_LABEL
+
+#define JUMP_LABEL_NAME(tag) \
+ const char __sjstrtab_##tag[] \
+ __used __attribute__((section("__jump_strings"))) = #tag;
+
+#define JUMP_LABEL_IF(tag, label, cond) \
+ asm goto ("1:" /* 5-byte insn */ \
+ P6_NOP5 \
+ ".pushsection __jump_table, \"a\" \n\t" \
+ _ASM_PTR "1b, %l[" #label "], %c0 \n\t" \
+ ".popsection \n\t" \
+ : : "i" (__sjstrtab_##tag) : : label)
+
+int run_make_nop(char *name);
+int run_make_jump(char *name);
+
+#else
+
+#define JUMP_LABEL_NAME(tag)
+#define JUMP_LABEL_IF(tag, label, cond) \
+ if (unlikely(cond)) \
+ goto label;
+
+static inline int run_make_nop(char *name)
+{
+ return 0;
+}
+
+static inline int run_make_jump(char *name)
+{
+ return 0;
+}
+
+#endif
+
+#endif
diff --git a/kernel/Makefile b/kernel/Makefile
index ef1011b..d29ae98 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -10,7 +10,7 @@ obj-y = sched.o fork.o exec_domain.o panic.o printk.o \
kthread.o wait.o kfifo.o sys_ni.o posix-cpu-timers.o mutex.o \
hrtimer.o rwsem.o nsproxy.o srcu.o semaphore.o \
notifier.o ksysfs.o pm_qos_params.o sched_clock.o cred.o \
- async.o
+ async.o jump_label.o
obj-y += groups.o
ifdef CONFIG_FUNCTION_TRACER
diff --git a/kernel/jump_label.c b/kernel/jump_label.c
new file mode 100644
index 0000000..f6be1eb
--- /dev/null
+++ b/kernel/jump_label.c
@@ -0,0 +1,138 @@
+#include <linux/init.h>
+#include <linux/debugfs.h>
+#include <linux/jump_label.h>
+#include <linux/stop_machine.h>
+#include <linux/ftrace.h>
+#include <linux/uaccess.h>
+
+#include <asm/cacheflush.h>
+
+JUMP_LABEL_NAME(trace);
+JUMP_LABEL_NAME(trace2);
+
+static int jump_enabled;
+static int jump_enabled2;
+
+#ifdef HAVE_JUMP_LABEL
+
+extern int ftrace_make_call(struct dyn_ftrace *rec, unsigned long addr);
+
+struct jump_entry {
+ unsigned long code;
+ unsigned long target;
+ char *name;
+};
+
+extern struct jump_entry __start___jump_table[];
+extern struct jump_entry __stop___jump_table[];
+
+struct jump_entry *find_jump_entry(char *name)
+{
+
+ struct jump_entry *iter;
+
+ for (iter = __start___jump_table; iter < __stop___jump_table; iter++) {
+ if (!strcmp(name, iter->name)) {
+ printk("find_jump_entry matched: %s\n", iter->name);
+ return iter;
+ }
+ }
+ return NULL;
+}
+
+/* hard coded for testing */
+static int enable_jump(void *ptr)
+{
+ struct dyn_ftrace rec;
+ struct jump_entry *jentry;
+ unsigned long code, target;
+ int ret;
+
+ jentry = ((struct jump_entry *)ptr);
+ code = jentry->code;
+ target = jentry->target;
+ rec.ip = code;
+ ret = ftrace_make_jump(&rec, target);
+
+ return 0;
+}
+
+/* hard coded for testing */
+static int enable_nop(void *ptr)
+{
+ struct dyn_ftrace rec;
+ struct jump_entry *jentry;
+ unsigned long code, target;
+ int ret;
+
+ jentry = ((struct jump_entry *)ptr);
+ code = jentry->code;
+ target = jentry->target;
+ rec.ip = code;
+ ret = ftrace_make_jump_nop(&rec, target);
+
+ return 0;
+}
+
+int run_make_jump(char *name)
+{
+ int ret;
+ struct jump_entry *jentry;
+
+ jentry = find_jump_entry(name);
+ if (!jentry)
+ return -ENOENT;
+
+ ret = ftrace_arch_code_modify_prepare();
+ WARN_ON(ret);
+ if (ret)
+ return -1;
+
+ stop_machine(enable_jump, (void *)jentry, NULL);
+
+ ret = ftrace_arch_code_modify_post_process();
+ WARN_ON(ret);
+
+ return 0;
+}
+
+int run_make_nop(char *name)
+{
+ int ret;
+ struct jump_entry *jentry;
+
+ jentry = find_jump_entry(name);
+ if (!jentry)
+ return -ENOENT;
+
+ ret = ftrace_arch_code_modify_prepare();
+ WARN_ON(ret);
+ if (ret)
+ return -1;
+
+ stop_machine(enable_nop, (void *)jentry, NULL);
+
+ ret = ftrace_arch_code_modify_post_process();
+ WARN_ON(ret);
+
+ return 0;
+}
+
+#endif
+
+static int __jump_label_init(void)
+{
+ struct jump_entry *iter;
+
+
+#ifdef HAVE_STATIC_JUMP
+ printk("__start___jump_table is: %p\n", __start___jump_table);
+ printk("__stop___jump_table is: %p\n", __stop___jump_table);
+ for(iter = __start___jump_table; iter < __stop___jump_table; iter++)
+ printk("jump label: code: %p, target: %p, name: %s\n", iter->code, iter->target, iter->name);
+#endif
+
+ return 0;
+}
+late_initcall(__jump_label_init);
+
--
1.6.2.5
Example cases to see how jump patching can be used:
echo a '1' into <debugfs>/jump/enabled to see a 'doing tracing' printk
echo a '0' into <debugfs>/jump/enabled to see a 'not doing tracing' printk
The codepaths are updated using code patching.
Signed-off-by: Jason Baron <[email protected]>
---
kernel/jump_label.c | 133 +++++++++++++++++++++++++++++++++++++++++++++++++++
1 files changed, 133 insertions(+), 0 deletions(-)
diff --git a/kernel/jump_label.c b/kernel/jump_label.c
index f6be1eb..0bc0b2d 100644
--- a/kernel/jump_label.c
+++ b/kernel/jump_label.c
@@ -120,10 +120,143 @@ int run_make_nop(char *name)
#endif
+static ssize_t read_enabled_file_jump(struct file *file,
+ char __user *user_buf, size_t count, loff_t *ppos)
+{
+ char buf[3];
+ ssize_t val;
+
+ if (jump_enabled)
+ buf[0] = '1';
+ else
+ buf[0] = '0';
+ buf[1] = '\n';
+ buf[2] = 0x00;
+
+ val = simple_read_from_buffer(user_buf, count, ppos, buf, 2);
+ JUMP_LABEL_IF(trace, trace_label, jump_enabled);
+ printk("not doing tracing\n");
+if (0) {
+trace_label:
+ printk("doing tracing: %d\n", file);
+}
+ printk("val is: %d\n", val);
+ return val;
+}
+
+static ssize_t read_enabled_file_jump2(struct file *file,
+ char __user *user_buf, size_t count, loff_t *ppos)
+{
+ char buf[3];
+ ssize_t val;
+
+ if (jump_enabled2)
+ buf[0] = '1';
+ else
+ buf[0] = '0';
+ buf[1] = '\n';
+ buf[2] = 0x00;
+
+ val = simple_read_from_buffer(user_buf, count, ppos, buf, 2);
+ JUMP_LABEL_IF(trace2, trace_label, jump_enabled2);
+ printk("not doing tracing 2\n");
+if (0) {
+trace_label:
+ printk("doing tracing 2: %d\n", file);
+}
+ printk("val is: %d\n", val);
+ return val;
+}
+
+static ssize_t write_enabled_file_jump(struct file *file,
+ const char __user *user_buf, size_t count, loff_t *ppos)
+{
+ char buf[32];
+ int buf_size;
+
+ buf_size = min(count, (sizeof(buf)-1));
+ if (copy_from_user(buf, user_buf, buf_size))
+ return -EFAULT;
+
+ switch (buf[0]) {
+ case 'y':
+ case 'Y':
+ case '1':
+ jump_enabled = 1;
+ run_make_jump("trace");
+ break;
+ case 'n':
+ case 'N':
+ case '0':
+ jump_enabled = 0;
+ run_make_nop("trace");
+ break;
+ }
+
+ return count;
+}
+
+static ssize_t write_enabled_file_jump2(struct file *file,
+ const char __user *user_buf, size_t count, loff_t *ppos)
+{
+ char buf[32];
+ int buf_size;
+
+ buf_size = min(count, (sizeof(buf)-1));
+ if (copy_from_user(buf, user_buf, buf_size))
+ return -EFAULT;
+
+ switch (buf[0]) {
+ case 'y':
+ case 'Y':
+ case '1':
+ jump_enabled2 = 1;
+ run_make_jump("trace2");
+ break;
+ case 'n':
+ case 'N':
+ case '0':
+ jump_enabled2 = 0;
+ run_make_nop("trace2");
+ break;
+ }
+
+ return count;
+}
+
+static struct file_operations fops_jump = {
+ .read = read_enabled_file_jump,
+ .write = write_enabled_file_jump,
+};
+
+static struct file_operations fops_jump2 = {
+ .read = read_enabled_file_jump2,
+ .write = write_enabled_file_jump2,
+};
+
static int __jump_label_init(void)
{
+ struct dentry *dir, *file;
+ unsigned int value = 1;
struct jump_entry *iter;
+ dir = debugfs_create_dir("jump", NULL);
+ if (!dir)
+ return -ENOMEM;
+
+ file = debugfs_create_file("enabled", 0600, dir,
+ &value, &fops_jump);
+ if (!file) {
+ debugfs_remove(dir);
+ return -ENOMEM;
+ }
+
+ file = debugfs_create_file("enabled2", 0600, dir,
+ &value, &fops_jump2);
+ if (!file) {
+ debugfs_remove(dir);
+ return -ENOMEM;
+ }
#ifdef HAVE_STATIC_JUMP
printk("__start___jump_table is: %p\n", __start___jump_table);
--
1.6.2.5
Convert tracepoints to use the new jump patching infrastructure. See
how easy it is!
Also, notice there is a patch to comment out the workqueue tracepoints. Without
this patch I had some panic when doing the code patching. I think this is
probably due to an iteraction b/w stop_machine() using workqueues to do the
actual patching and the workqueue code itself.
Signed-off-by: Jason Baron <[email protected]>
---
arch/x86/kernel/test_nx.c | 3 +++
include/linux/tracepoint.h | 38 ++++++++++++++++++++++----------------
kernel/trace/trace_workqueue.c | 2 ++
kernel/tracepoint.c | 9 +++++++++
4 files changed, 36 insertions(+), 16 deletions(-)
diff --git a/arch/x86/kernel/test_nx.c b/arch/x86/kernel/test_nx.c
index 787a5e4..3dab463 100644
--- a/arch/x86/kernel/test_nx.c
+++ b/arch/x86/kernel/test_nx.c
@@ -15,6 +15,9 @@
#include <asm/uaccess.h>
#include <asm/asm.h>
+#include <linux/jump_label.h>
+
+JUMP_LABEL_NAME(kmalloc);
extern int rodata_test_data;
diff --git a/include/linux/tracepoint.h b/include/linux/tracepoint.h
index 63a3f7a..fdd2d8b 100644
--- a/include/linux/tracepoint.h
+++ b/include/linux/tracepoint.h
@@ -16,6 +16,7 @@
#include <linux/types.h>
#include <linux/rcupdate.h>
+#include <linux/jump_label.h>
struct module;
struct tracepoint;
@@ -63,21 +64,25 @@ struct tracepoint {
* not add unwanted padding between the beginning of the section and the
* structure. Force alignment to the same alignment as the section start.
*/
-#define DECLARE_TRACE(name, proto, args) \
- extern struct tracepoint __tracepoint_##name; \
- static inline void trace_##name(proto) \
- { \
- if (unlikely(__tracepoint_##name.state)) \
- __DO_TRACE(&__tracepoint_##name, \
- TP_PROTO(proto), TP_ARGS(args)); \
- } \
- static inline int register_trace_##name(void (*probe)(proto)) \
- { \
- return tracepoint_probe_register(#name, (void *)probe); \
- } \
- static inline int unregister_trace_##name(void (*probe)(proto)) \
- { \
- return tracepoint_probe_unregister(#name, (void *)probe);\
+#define DECLARE_TRACE(name, proto, args) \
+ extern struct tracepoint __tracepoint_##name; \
+ extern const char __sjstrtab_##name[]; \
+ static inline void trace_##name(proto) \
+ { \
+ JUMP_LABEL_IF(name, trace_label, __tracepoint_##name.state); \
+ if (0) { \
+trace_label: \
+ __DO_TRACE(&__tracepoint_##name, \
+ TP_PROTO(proto), TP_ARGS(args)); \
+ } \
+ } \
+ static inline int register_trace_##name(void (*probe)(proto)) \
+ { \
+ return tracepoint_probe_register(#name, (void *)probe); \
+ } \
+ static inline int unregister_trace_##name(void (*probe)(proto)) \
+ { \
+ return tracepoint_probe_unregister(#name, (void *)probe); \
}
@@ -86,7 +91,8 @@ struct tracepoint {
__attribute__((section("__tracepoints_strings"))) = #name; \
struct tracepoint __tracepoint_##name \
__attribute__((section("__tracepoints"), aligned(32))) = \
- { __tpstrtab_##name, 0, reg, unreg, NULL }
+ { __tpstrtab_##name, 0, reg, unreg, NULL }; \
+ JUMP_LABEL_NAME(name);
#define DEFINE_TRACE(name) \
DEFINE_TRACE_FN(name, NULL, NULL);
diff --git a/kernel/trace/trace_workqueue.c b/kernel/trace/trace_workqueue.c
index 40cafb0..bc38428 100644
--- a/kernel/trace/trace_workqueue.c
+++ b/kernel/trace/trace_workqueue.c
@@ -258,6 +258,7 @@ int __init trace_workqueue_early_init(void)
{
int ret, cpu;
+ /*
ret = register_trace_workqueue_insertion(probe_workqueue_insertion);
if (ret)
goto out;
@@ -273,6 +274,7 @@ int __init trace_workqueue_early_init(void)
ret = register_trace_workqueue_destruction(probe_workqueue_destruction);
if (ret)
goto no_creation;
+ */
for_each_possible_cpu(cpu) {
spin_lock_init(&workqueue_cpu_stat(cpu)->lock);
diff --git a/kernel/tracepoint.c b/kernel/tracepoint.c
index 9489a0a..1b4acc0 100644
--- a/kernel/tracepoint.c
+++ b/kernel/tracepoint.c
@@ -25,6 +25,7 @@
#include <linux/err.h>
#include <linux/slab.h>
#include <linux/sched.h>
+#include <linux/jump_label.h>
extern struct tracepoint __start___tracepoints[];
extern struct tracepoint __stop___tracepoints[];
@@ -256,6 +257,11 @@ static void set_tracepoint(struct tracepoint_entry **entry,
* is used.
*/
rcu_assign_pointer(elem->funcs, (*entry)->funcs);
+ if (!elem->state && active) {
+ printk("set_tracepoint: call run_make_jump: %s\n", elem->name);
+ run_make_jump(elem->name);
+ } else if (elem->state && !active)
+ run_make_nop(elem->name);
elem->state = active;
}
@@ -270,6 +276,9 @@ static void disable_tracepoint(struct tracepoint *elem)
if (elem->unregfunc && elem->state)
elem->unregfunc();
+ if (elem->state)
+ run_make_nop(elem->name);
+
elem->state = 0;
rcu_assign_pointer(elem->funcs, NULL);
}
--
1.6.2.5
Not intended to be merged just a harness for micro-testing tracepoint performance.
Signed-off-by: Jason Baron <[email protected]>
---
include/linux/tracepoint.h | 12 ++++
kernel/tracepoint.c | 125 ++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 137 insertions(+), 0 deletions(-)
diff --git a/include/linux/tracepoint.h b/include/linux/tracepoint.h
index fdd2d8b..da657bf 100644
--- a/include/linux/tracepoint.h
+++ b/include/linux/tracepoint.h
@@ -27,6 +27,8 @@ struct tracepoint {
void (*regfunc)(void);
void (*unregfunc)(void);
void **funcs;
+ unsigned long cycle_total;
+ unsigned long count;
} __attribute__((aligned(32))); /*
* Aligned on 32 bytes because it is
* globally visible and gcc happily
@@ -69,12 +71,22 @@ struct tracepoint {
extern const char __sjstrtab_##name[]; \
static inline void trace_##name(proto) \
{ \
+ unsigned long profile_flags; \
+ u64 t1, t2; \
+ local_irq_save(profile_flags); \
+ preempt_disable(); \
+ t1 = get_cycles(); \
JUMP_LABEL_IF(name, trace_label, __tracepoint_##name.state); \
if (0) { \
trace_label: \
__DO_TRACE(&__tracepoint_##name, \
TP_PROTO(proto), TP_ARGS(args)); \
} \
+ t2 = get_cycles(); \
+ local_irq_restore(profile_flags); \
+ preempt_enable(); \
+ __tracepoint_##name.count += 1; \
+ __tracepoint_##name.cycle_total += (t2 - t1); \
} \
static inline int register_trace_##name(void (*probe)(proto)) \
{ \
diff --git a/kernel/tracepoint.c b/kernel/tracepoint.c
index 1b4acc0..4be3991 100644
--- a/kernel/tracepoint.c
+++ b/kernel/tracepoint.c
@@ -26,6 +26,9 @@
#include <linux/slab.h>
#include <linux/sched.h>
#include <linux/jump_label.h>
+#include <linux/seq_file.h>
+#include <linux/fs.h>
+#include <linux/debugfs.h>
extern struct tracepoint __start___tracepoints[];
extern struct tracepoint __stop___tracepoints[];
@@ -563,6 +566,78 @@ void tracepoint_iter_reset(struct tracepoint_iter *iter)
}
EXPORT_SYMBOL_GPL(tracepoint_iter_reset);
+static void *tracepoint_seq_next(struct seq_file *seqf, void *v, loff_t *pos)
+{
+ struct tracepoint_iter *iter;
+ iter = seqf->private;
+
+ tracepoint_iter_next(iter);
+ if (iter->tracepoint) {
+ (*pos)++;
+ return iter->tracepoint;
+ }
+
+ return NULL;
+}
+
+static void tracepoint_seq_stop(struct seq_file *seqf, void *v)
+{
+ struct tracepoint_iter *iter = seqf->private;
+
+ /* stop is called even after start failed :-( */
+ if (iter)
+ kfree(iter);
+
+}
+
+static void *tracepoint_seq_start(struct seq_file *seqf, loff_t *pos)
+{
+ struct tracepoint_iter *iter;
+ loff_t skip = *pos;
+
+ iter = kmalloc(sizeof(*iter), GFP_KERNEL);
+ if (!iter)
+ return ERR_PTR(-ENOMEM);
+ seqf->private = iter;
+
+ tracepoint_iter_reset(iter);
+ tracepoint_iter_start(iter);
+ do {
+ tracepoint_iter_next(iter);
+ if (!iter->tracepoint)
+ return NULL;
+ } while (skip--);
+
+ return iter->tracepoint;
+}
+
+static int show_tracepoint(struct seq_file *seqf, void *v)
+{
+ struct tracepoint *tp = v;
+
+ seq_printf(seqf, "%s %lu %lu %lu\n", tp->name, tp->cycle_total, tp->count, tp->count ? tp->cycle_total / tp->count : 0 );
+ return 0;
+}
+
+static struct seq_operations tracepoints_seq_ops = {
+ .start = tracepoint_seq_start,
+ .next = tracepoint_seq_next,
+ .stop = tracepoint_seq_stop,
+ .show = show_tracepoint
+};
+
+static int tracepoints_open(struct inode *inode, struct file *filp)
+{
+ return seq_open(filp, &tracepoints_seq_ops);
+}
+
+static struct file_operations debugfs_tracepoint_operations = {
+ .open = tracepoints_open,
+ .read = seq_read,
+ .llseek = seq_lseek,
+ .release = seq_release,
+};
+
#ifdef CONFIG_MODULES
int tracepoint_module_notify(struct notifier_block *self,
@@ -585,8 +660,57 @@ struct notifier_block tracepoint_module_nb = {
.priority = 0,
};
+static ssize_t write_enabled_file_bool(struct file *file,
+ const char __user *user_buf, size_t count, loff_t *ppos)
+{
+ struct tracepoint_iter *iter;
+
+ iter = kmalloc(sizeof(*iter), GFP_KERNEL);
+ if (!iter)
+ return ERR_PTR(-ENOMEM);
+
+ tracepoint_iter_reset(iter);
+ tracepoint_iter_start(iter);
+ do {
+ tracepoint_iter_next(iter);
+ if (!iter->tracepoint)
+ break;
+ iter->tracepoint->count = 0;
+ iter->tracepoint->cycle_total = 0;
+ } while (1);
+
+ kfree(iter);
+ return count;
+}
+
+static struct file_operations fops_kp = {
+ .write = write_enabled_file_bool,
+};
+
+
+
static int init_tracepoints(void)
{
+ struct dentry *dir, *file;
+
+ dir = debugfs_create_dir("tracepoints", NULL);
+ if (!dir)
+ return -ENOMEM;
+
+ file = debugfs_create_file("list", 0444, dir, NULL,
+ &debugfs_tracepoint_operations);
+ if (!file) {
+ debugfs_remove(dir);
+ return -ENOMEM;
+ }
+
+ file = debugfs_create_file("clear", 0600, dir,
+ NULL, &fops_kp);
+ if (!file) {
+ debugfs_remove(dir);
+ return -ENOMEM;
+ }
+
return register_module_notifier(&tracepoint_module_nb);
}
__initcall(init_tracepoints);
@@ -630,3 +754,4 @@ void syscall_unregfunc(void)
}
}
#endif
+
--
1.6.2.5
On Thu, 2009-09-03 at 16:25 -0400, Jason Baron wrote:
> hi,
>
> Problem:
>
> Currenly, tracepoints are implemented using a conditional. The conditional
> check requires checking a global variable for each tracepoint. Although,
> the overhead of this check is small, it increases under memory pressure. As we
> increase the number of tracepoints in the kernel this may become more of an
> issue. In addition, tracepoints are often dormant (disabled), and provide no
> direct kernel functionality. Thus, it is highly desirable to reduce their
> impact as much as possible. Mathieu Desnoyers has already suggested a number of
> requirements for a solution to this issue.
Could you run your patches though scripts/checkpatch.pl and clean up any
errors.. It looks like you have some style issues in a few of the
patches.
Daniel
* Jason Baron <[email protected]> wrote:
> hi,
>
> Problem:
>
> Currenly, tracepoints are implemented using a conditional. The
> conditional check requires checking a global variable for each
> tracepoint. Although, the overhead of this check is small, it
> increases under memory pressure. As we increase the number of
> tracepoints in the kernel this may become more of an issue. In
> addition, tracepoints are often dormant (disabled), and provide no
> direct kernel functionality. Thus, it is highly desirable to
> reduce their impact as much as possible. Mathieu Desnoyers has
> already suggested a number of requirements for a solution to this
> issue.
>
> Solution:
>
> In discussing this problem with Roland McGrath and Richard
> Henderson, we came up with a new 'asm goto' statement that allows
> branching to a label. Thus, this patch set introdues a
> 'STATIC_JUMP_IF()' macro as follows:
>
> #ifdef HAVE_STATIC_JUMP
>
> #define STATIC_JUMP_IF(tag, label, cond) \
> asm goto ("1:" /* 5-byte insn */ \
> P6_NOP5 \
> ".pushsection __jump_table, \"a\" \n\t" \
> _ASM_PTR "1b, %l[" #label "], %c0 \n\t" \
> ".popsection \n\t" \
> : : "i" (__sjstrtab_##tag) : : label)
>
> #else
>
> #define STATIC_JUMP_IF(tag, label, cond) \
> if (unlikely(cond)) \
> goto label;
>
> #endif /* !HAVE_STATIC_JUMP */
>
>
> which can be used as:
>
> STATIC_JUMP_IF(trace, trace_label, jump_enabled);
> printk("not doing tracing\n");
> if (0) {
> trace_label:
> printk("doing tracing: %d\n", file);
> }
>
> ---------------------------------------
>
> Thus, if 'HAVE_STATIC_JUMP' is defined (which will depend
> ultimately on the existence of 'asm goto' in the compiler
> version), we simply have a no-op followed by a jump around the
> dormant (disabled) tracing code. The 'STATIC_JUMP_IF()' macro,
> produces a 'jump_table' which has the following format:
>
> [instruction address] [jump target] [tracepoint name]
>
> Thus, to enable a tracepoint, we simply patch the 'instruction
> address' with a jump to the 'jump target'. The current
> implementation is using ftrace infrastructure to accomplish the
> patching, which uses 'stop_machine'. In subsequent versions, we
> will update the mechanism to use more efficient code patching
> techniques.
>
> I've tested the performance of this using 'get_cycles()' calls
> around the tracepoint call sites. For an Intel Core 2 Quad cpu (in
> cycles, averages):
>
> idle after tbench run
> ---- ----------------
> old code 32 88
> new code 2 4
>
>
> The performance improvement can be reproduced very reliably (using
> patch 4 in this series) on both Intel and AMD hardware.
>
> In terms of code analysis the current code for the disabled case
> is a 'cmpl' followed by a 'je' around the tracepoint code. so:
>
> cmpl - 83 3d 0e 77 87 00 00 - 7 bytes
> je - 74 3e - 2 bytes
>
> total of 9 instruction bytes.
>
> The new code is a 'nopl' followed by a 'jmp'. Thus:
>
> nopl - 0f 1f 44 00 00 - 5 bytes
> jmp - eb 3e - 2 bytes
>
> total of 7 instruction bytes.
>
> So, the new code also accounts for 2 less bytes in the instruction
> cache per tracepoint.
>
> here's a link to the gcc thread introducing this feature:
>
> http://gcc.gnu.org/ml/gcc-patches/2009-07/msg01556.html
This looks really interesting and desired. Once GCC adds this (or an
equivalent) feature, i'd love to have your optimization in the
kernel.
> Todo:
>
> - convert the patching to a more optimal implementation (not using stop machine)
> - expand infrastructure for modules
> - other use cases?
[...]
Other usecases might be kernel features that are turned on/off via
some slowpath. For example SLAB statistics could be patched in/out
using this method. Or scheduler statistics.
Basically everything that is optional and touches some very hot
codepath would be eligible - not just tracepoints.
Ingo
> This looks really interesting and desired. Once GCC adds this (or an
> equivalent) feature, i'd love to have your optimization in the
> kernel.
We expect it will be in gcc 4.5 and backported to Fedora's 4.4 pretty soon.
But it's not finalized yet, so we are only doing proof-of-concept so far.
> Basically everything that is optional and touches some very hot
> codepath would be eligible - not just tracepoints.
Sure. Any place you have a load+conditional-jump where that is too costly,
we can make it a nop that costs a only cycle or so. We probably want to
think a little harder about the nicest macro setup we can do to make
employing this trivial in the source.
Thanks,
Roland
* Jason Baron ([email protected]) wrote:
> hi,
>
> Problem:
>
> Currenly, tracepoints are implemented using a conditional. The conditional
> check requires checking a global variable for each tracepoint. Although,
> the overhead of this check is small, it increases under memory pressure. As we
> increase the number of tracepoints in the kernel this may become more of an
> issue. In addition, tracepoints are often dormant (disabled), and provide no
> direct kernel functionality. Thus, it is highly desirable to reduce their
> impact as much as possible. Mathieu Desnoyers has already suggested a number of
> requirements for a solution to this issue.
>
Hi Jason,
Thanks for working on this. It's a useful idea that's just been sitting
there for too long now. Please see comments below,
> Solution:
>
> In discussing this problem with Roland McGrath and Richard Henderson, we came
> up with a new 'asm goto' statement that allows branching to a label. Thus, this
> patch set introdues a 'STATIC_JUMP_IF()' macro as follows:
>
> #ifdef HAVE_STATIC_JUMP
>
> #define STATIC_JUMP_IF(tag, label, cond) \
> asm goto ("1:" /* 5-byte insn */ \
> P6_NOP5 \
Hrm, be careful there. P6_NOP5 is not always a single instruction. If
you are preempted in the middle of it, bad things could happen, even
with stop_machine, if you iret in the middle the of the new jump
instruction. It could cause an illegal instruction fault. You should use
an atomic nop5. I think the function tracer already does, since I
told Steven about this exact issue.
> ".pushsection __jump_table, \"a\" \n\t" \
> _ASM_PTR "1b, %l[" #label "], %c0 \n\t" \
> ".popsection \n\t" \
> : : "i" (__sjstrtab_##tag) : : label)
>
Supporting multiple labels to create a real jump table would be a
nice-to-have as future enhancement too. This could be called
STATIC_JUMP_TABLE(). Actually, the STATIC_JUMP_IF could probably be
implemented in terms of STATIC_JUMP_TABLE.
> #else
>
> #define STATIC_JUMP_IF(tag, label, cond) \
> if (unlikely(cond)) \
> goto label;
>
Hrm, however, it's not clear to me how the STATIC_JUMP_TABLE() fallback
would look like. In the case of STATIC_JUMP_IF, it's a simple if (),
which makes support of compilers lacking static jump support easy. We
could probably use a switch statement to replace the STATIC_JUMP_TABLE
though.
> #endif /* !HAVE_STATIC_JUMP */
>
>
> which can be used as:
>
> STATIC_JUMP_IF(trace, trace_label, jump_enabled);
> printk("not doing tracing\n");
> if (0) {
> trace_label:
> printk("doing tracing: %d\n", file);
> }
>
Hrm. Is there any way to make this a bit prettier ? Given modifications
are made to gcc anyway...
Maybe:
static_jump_if (trace, jump_enabled) {
...
} else {
...
}
And for the jump table:
static_jump_table (trace, jump_select) {
case 0: ...
break;
case 1: ...
break;
default:
...
}
> ---------------------------------------
>
> Thus, if 'HAVE_STATIC_JUMP' is defined (which will depend ultimately on the
> existence of 'asm goto' in the compiler version), we simply have a no-op
> followed by a jump around the dormant (disabled) tracing code.
Hrm, why don't we collapse that into a single 5-bytes jump instruction
instead ?
> The
> 'STATIC_JUMP_IF()' macro, produces a 'jump_table' which has the following
> format:
>
> [instruction address] [jump target] [tracepoint name]
>
> Thus, to enable a tracepoint, we simply patch the 'instruction address' with
> a jump to the 'jump target'. The current implementation is using ftrace
> infrastructure to accomplish the patching, which uses 'stop_machine'. In
> subsequent versions, we will update the mechanism to use more efficient
> code patching techniques.
>
> I've tested the performance of this using 'get_cycles()' calls around the
> tracepoint call sites. For an Intel Core 2 Quad cpu (in cycles, averages):
>
> idle after tbench run
> ---- ----------------
> old code 32 88
> new code 2 4
>
>
> The performance improvement can be reproduced very reliably (using patch 4
> in this series) on both Intel and AMD hardware.
>
> In terms of code analysis the current code for the disabled case is a 'cmpl'
> followed by a 'je' around the tracepoint code. so:
>
> cmpl - 83 3d 0e 77 87 00 00 - 7 bytes
> je - 74 3e - 2 bytes
>
> total of 9 instruction bytes.
>
> The new code is a 'nopl' followed by a 'jmp'. Thus:
>
> nopl - 0f 1f 44 00 00 - 5 bytes
> jmp - eb 3e - 2 bytes
>
> total of 7 instruction bytes.
>
> So, the new code also accounts for 2 less bytes in the instruction cache per tracepoint.
>
With a single 5-bytes jump, this would account for a 5 bytes total,
which is 4 bytes less.
Thanks,
Mathieu
> here's a link to the gcc thread introducing this feature:
>
> http://gcc.gnu.org/ml/gcc-patches/2009-07/msg01556.html
>
> Todo:
>
> -convert the patching to a more optimal implementation (not using stop machine)
> -expand infrastructure for modules
> -other use cases?
> -mark the trace_label section with 'cold' attributes
> -add module support
> -add support for other arches (besides x86)
>
> thanks,
>
> -Jason
>
> arch/x86/kernel/ftrace.c | 36 +++++
> arch/x86/kernel/test_nx.c | 3 +
> include/asm-generic/vmlinux.lds.h | 11 ++
> include/linux/ftrace.h | 3 +
> include/linux/jump_label.h | 45 ++++++
> include/linux/tracepoint.h | 50 +++++---
> kernel/Makefile | 2 +-
> kernel/jump_label.c | 271 +++++++++++++++++++++++++++++++++++++
> kernel/trace/trace_workqueue.c | 2 +
> kernel/tracepoint.c | 134 ++++++++++++++++++
> 10 files changed, 540 insertions(+), 17 deletions(-)
> create mode 100644 include/linux/jump_label.h
> create mode 100644 kernel/jump_label.c
>
--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
* Mathieu Desnoyers ([email protected]) wrote:
> * Jason Baron ([email protected]) wrote:
[...]
> > Solution:
> >
> > In discussing this problem with Roland McGrath and Richard Henderson, we came
> > up with a new 'asm goto' statement that allows branching to a label. Thus, this
> > patch set introdues a 'STATIC_JUMP_IF()' macro as follows:
> >
> > #ifdef HAVE_STATIC_JUMP
> >
> > #define STATIC_JUMP_IF(tag, label, cond) \
> > asm goto ("1:" /* 5-byte insn */ \
> > P6_NOP5 \
>
> Hrm, be careful there. P6_NOP5 is not always a single instruction. If
> you are preempted in the middle of it, bad things could happen, even
> with stop_machine, if you iret in the middle the of the new jump
> instruction. It could cause an illegal instruction fault. You should use
> an atomic nop5. I think the function tracer already does, since I
> told Steven about this exact issue.
>
Just to clarify this statement:
P6_NOP5 happens to be an atomic nop, but nothing states this requirement
in arch/x86/include/asm/nops.h. Other 5-bytes nops are defined as
multiple instructions (e.g. 2 bytes + 3 bytes nops). So I recommend to
create a family of ATOMIC_P6_NOP5 (and other ATOMIC_*_NOP5 defines) to
document this atomicity requirement.
Ftrace could probably handle this more gracefully than it does at the
moment. It basically assumes that P6_NOP5 is atomic, and falls back on a
5-bytes jmp if it detects that P6_NOP5 faults.
That's coherent with the
"TODO: check the cpuid to determine the best nop."
present in x86 ftrace.c.
So, at the very least, if we rely on nops.h having a single-instruction
P6_NOP5 5 bytes nop, a comment to that effect should be added to nops.h.
Mathieu
--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
On Mon, Sep 07, 2009 at 11:48:24AM -0400, Mathieu Desnoyers wrote:
> > ".pushsection __jump_table, \"a\" \n\t" \
> > _ASM_PTR "1b, %l[" #label "], %c0 \n\t" \
> > ".popsection \n\t" \
> > : : "i" (__sjstrtab_##tag) : : label)
> >
>
> Supporting multiple labels to create a real jump table would be a
> nice-to-have as future enhancement too. This could be called
> STATIC_JUMP_TABLE(). Actually, the STATIC_JUMP_IF could probably be
> implemented in terms of STATIC_JUMP_TABLE.
>
> > #else
> >
> > #define STATIC_JUMP_IF(tag, label, cond) \
> > if (unlikely(cond)) \
> > goto label;
> >
>
> Hrm, however, it's not clear to me how the STATIC_JUMP_TABLE() fallback
> would look like. In the case of STATIC_JUMP_IF, it's a simple if (),
> which makes support of compilers lacking static jump support easy. We
> could probably use a switch statement to replace the STATIC_JUMP_TABLE
> though.
>
right - if we have more labels passed into STATIC_JUMP_TABLE(), we can
probably do a case statement with a 'goto' to the correct label.
> > #endif /* !HAVE_STATIC_JUMP */
> >
> >
> > which can be used as:
> >
> > STATIC_JUMP_IF(trace, trace_label, jump_enabled);
> > printk("not doing tracing\n");
> > if (0) {
> > trace_label:
> > printk("doing tracing: %d\n", file);
> > }
> >
>
> Hrm. Is there any way to make this a bit prettier ? Given modifications
> are made to gcc anyway...
>
> Maybe:
>
> static_jump_if (trace, jump_enabled) {
> ...
>
> } else {
> ...
> }
>
> And for the jump table:
>
> static_jump_table (trace, jump_select) {
> case 0: ...
> break;
> case 1: ...
> break;
> default:
> ...
> }
>
hmmm...yes, I agree it would be nice if the code looked a little prettier.
However, short of additional gcc changes i'm not sure how to do that.
perhaps, somebody has some better ideas? Note also, that the
'STATIC_JUMP_IF()' as defined implements both:
if () { }
and:
if () { } else { }
I'm not sure the code is that hideous as proposed. However, I definitely
would be interested it other opinions? Also, in this case note that the
STATIC_JUMP_IF() is only added to 1 place in the code, and doesn't
affect any of the normal tracepoint API.
>
> > ---------------------------------------
> >
> > Thus, if 'HAVE_STATIC_JUMP' is defined (which will depend ultimately on the
> > existence of 'asm goto' in the compiler version), we simply have a no-op
> > followed by a jump around the dormant (disabled) tracing code.
>
> Hrm, why don't we collapse that into a single 5-bytes jump instruction
> instead ?
that might be nice, but would require more complex compiler support. I'm
not sure if the extra complexity is worth the 2-byte i-cache savings?
That is, I think we're getting the majority of the savings with the
proposed solution.
thanks,
-Jason
On Mon, 2009-09-07 at 13:06 -0400, Mathieu Desnoyers wrote:
> * Mathieu Desnoyers ([email protected]) wrote:
> > * Jason Baron ([email protected]) wrote:
> [...]
> > > Solution:
> > >
> > > In discussing this problem with Roland McGrath and Richard Henderson, we came
> > > up with a new 'asm goto' statement that allows branching to a label. Thus, this
> > > patch set introdues a 'STATIC_JUMP_IF()' macro as follows:
> > >
> > > #ifdef HAVE_STATIC_JUMP
> > >
> > > #define STATIC_JUMP_IF(tag, label, cond) \
> > > asm goto ("1:" /* 5-byte insn */ \
> > > P6_NOP5 \
> >
> > Hrm, be careful there. P6_NOP5 is not always a single instruction. If
> > you are preempted in the middle of it, bad things could happen, even
> > with stop_machine, if you iret in the middle the of the new jump
> > instruction. It could cause an illegal instruction fault. You should use
> > an atomic nop5. I think the function tracer already does, since I
> > told Steven about this exact issue.
> >
>
> Just to clarify this statement:
>
> P6_NOP5 happens to be an atomic nop, but nothing states this requirement
> in arch/x86/include/asm/nops.h. Other 5-bytes nops are defined as
> multiple instructions (e.g. 2 bytes + 3 bytes nops). So I recommend to
> create a family of ATOMIC_P6_NOP5 (and other ATOMIC_*_NOP5 defines) to
> document this atomicity requirement.
Although I agree that we probably should place a comment in that file, I
highly doubt anyone will change that. But who knows?
>
> Ftrace could probably handle this more gracefully than it does at the
> moment. It basically assumes that P6_NOP5 is atomic, and falls back on a
> 5-bytes jmp if it detects that P6_NOP5 faults.
>
> That's coherent with the
> "TODO: check the cpuid to determine the best nop."
>
> present in x86 ftrace.c.
>
> So, at the very least, if we rely on nops.h having a single-instruction
> P6_NOP5 5 bytes nop, a comment to that effect should be added to nops.h.
I might as well go add one.
-- Steve