2011-02-03 15:43:04

by Jiri Olsa

[permalink] [raw]
Subject: [RFC 0/4] tracing,x86_64 - function/graph trace without mcount/-pg/framepointer

hi,

I recently saw the direct jump probing made for kprobes
and tried to use it inside the trace framework.

The global idea is patching the function entry with direct
jump to the trace code, instead of using pregenerated gcc
profile code.

I started this just to see if it would be even possible
to hook with new probing to the current trace code. It
appears it's not that bad. I was able to run function
and function_graph trace on x86_64.

For details on direct jumps probe, please check:
http://www.linuxinsight.com/ols2007-djprobe-kernel-probing-with-the-smallest-overhead.html


I realize using this way to hook the functions has some
drawbacks, from what I can see it's roughly:
- no all functions could be patched
- need to find a way to say which function is safe to patch
- memory consumption for detour buffers and symbol records

but seems there're some advantages as well:
- trace code could be in a module
- no profiling code is needed
- framepointer can be disabled (framepointer is needed for
generating profile code)


As for the attached implementation it's hack mostly (expect bugs),
especially the ftrace/kprobe integration could be probably done better.
It's only for x86_64.

It can be used like this:

- new menu config item is added (function tracer engine),
to choose mcount or ktrace
- new file "ktrace" is added to the tracing dir
- to add symbols to trace run:
echo mutex_unlock > ./ktrace
echo mutex_lock >> ./ktrace
- to display trace symbols:
cat ktrace
- to enable the trace, the usual is needed:
echo function > ./current_tracer
echo function_graph > ./current_tracer
- to remove symbols from trace:
echo nop > ./current_tracer
echo > ./ktrace
- if the function is added while the tracer is running,
the symbol is enabled automatically.
- only all symbols could be removed and only if there's
no tracer running.

I'm not sure how to choose from kallsyms interface what function
is safe to patch, so I omit patching of all symbols so far.


attached patches:
1/4 - kprobe - ktrace instruction slot cache interface
using kprobe detour buffer allocation, adding interface
to use it from trace framework

2/4 - tracing - adding size parameter to do_ftrace_mod_code
adding size parameter to be able to restore the saved
instructions, which could be longer than relative call

3/4 - ktrace - function trace support
adding ktrace support with function tracer

4/4 - ktrace - function trace support
adding function graph support


please let me know what you think, thanks
jirka
---
Makefile | 2 +-
arch/x86/Kconfig | 4 +-
arch/x86/kernel/Makefile | 1 +
arch/x86/kernel/entry_64.S | 50 +++++++
arch/x86/kernel/ftrace.c | 157 +++++++++++----------
arch/x86/kernel/ktrace.c | 256 ++++++++++++++++++++++++++++++++++
include/linux/ftrace.h | 36 +++++-
include/linux/kprobes.h | 8 +
kernel/kprobes.c | 33 +++++
kernel/trace/Kconfig | 28 ++++-
kernel/trace/Makefile | 1 +
kernel/trace/ftrace.c | 21 +++
kernel/trace/ktrace.c | 330 ++++++++++++++++++++++++++++++++++++++++++++
kernel/trace/trace.c | 1 +
14 files changed, 846 insertions(+), 82 deletions(-)


2011-02-03 15:43:11

by Jiri Olsa

[permalink] [raw]
Subject: [PATCH 2/4] tracing - adding size parameter to do_ftrace_mod_code

adding size parameter to be able to restore the saved
instructions, which could be longer than relative call

wbr,
jirka
---
arch/x86/kernel/ftrace.c | 10 ++++++----
1 files changed, 6 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kernel/ftrace.c b/arch/x86/kernel/ftrace.c
index 382eb29..979ec14 100644
--- a/arch/x86/kernel/ftrace.c
+++ b/arch/x86/kernel/ftrace.c
@@ -124,6 +124,7 @@ static atomic_t nmi_running = ATOMIC_INIT(0);
static int mod_code_status; /* holds return value of text write */
static void *mod_code_ip; /* holds the IP to write to */
static void *mod_code_newcode; /* holds the text to write to the IP */
+static int mod_code_size; /* holds the size of the new code */

static unsigned nmi_wait_count;
static atomic_t nmi_update_count = ATOMIC_INIT(0);
@@ -161,7 +162,7 @@ static void ftrace_mod_code(void)
* to succeed, then they all should.
*/
mod_code_status = probe_kernel_write(mod_code_ip, mod_code_newcode,
- MCOUNT_INSN_SIZE);
+ mod_code_size);

/* if we fail, then kill any new writers */
if (mod_code_status)
@@ -225,7 +226,7 @@ within(unsigned long addr, unsigned long start, unsigned long end)
}

static int
-do_ftrace_mod_code(unsigned long ip, void *new_code)
+do_ftrace_mod_code(unsigned long ip, void *new_code, int size)
{
/*
* On x86_64, kernel text mappings are mapped read-only with
@@ -240,6 +241,7 @@ do_ftrace_mod_code(unsigned long ip, void *new_code)

mod_code_ip = (void *)ip;
mod_code_newcode = new_code;
+ mod_code_size = size;

/* The buffers need to be visible before we let NMIs write them */
smp_mb();
@@ -290,7 +292,7 @@ ftrace_modify_code(unsigned long ip, unsigned char *old_code,
return -EINVAL;

/* replace the text with the new text */
- if (do_ftrace_mod_code(ip, new_code))
+ if (do_ftrace_mod_code(ip, new_code, MCOUNT_INSN_SIZE))
return -EPERM;

sync_core();
@@ -361,7 +363,7 @@ static int ftrace_mod_jmp(unsigned long ip,

*(int *)(&code[1]) = new_offset;

- if (do_ftrace_mod_code(ip, &code))
+ if (do_ftrace_mod_code(ip, &code, MCOUNT_INSN_SIZE))
return -EPERM;

return 0;
--
1.7.1

2011-02-03 15:43:16

by Jiri Olsa

[permalink] [raw]
Subject: [PATCH 3/4] ktrace - function trace support

adding ktrace support with function tracer

wbr,
jirka
---
Makefile | 2 +-
arch/x86/Kconfig | 2 +-
arch/x86/kernel/Makefile | 1 +
arch/x86/kernel/entry_64.S | 23 +++
arch/x86/kernel/ftrace.c | 153 +++++++++++----------
arch/x86/kernel/ktrace.c | 256 ++++++++++++++++++++++++++++++++++
include/linux/ftrace.h | 36 +++++-
kernel/trace/Kconfig | 28 ++++-
kernel/trace/Makefile | 1 +
kernel/trace/ftrace.c | 11 ++
kernel/trace/ktrace.c | 330 ++++++++++++++++++++++++++++++++++++++++++++
kernel/trace/trace.c | 1 +
12 files changed, 764 insertions(+), 80 deletions(-)
create mode 100644 arch/x86/kernel/ktrace.c
create mode 100644 kernel/trace/ktrace.c

diff --git a/Makefile b/Makefile
index 66e7e97..26d3d60 100644
--- a/Makefile
+++ b/Makefile
@@ -577,7 +577,7 @@ ifdef CONFIG_DEBUG_INFO_REDUCED
KBUILD_CFLAGS += $(call cc-option, -femit-struct-debug-baseonly)
endif

-ifdef CONFIG_FUNCTION_TRACER
+ifdef CONFIG_FTRACE_MCOUNT_RECORD
KBUILD_CFLAGS += -pg
ifdef CONFIG_DYNAMIC_FTRACE
ifdef CONFIG_HAVE_C_RECORDMCOUNT
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 95c36c4..a02718c 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -38,7 +38,7 @@ config X86
select HAVE_FUNCTION_GRAPH_TRACER
select HAVE_FUNCTION_GRAPH_FP_TEST
select HAVE_FUNCTION_TRACE_MCOUNT_TEST
- select HAVE_FTRACE_NMI_ENTER if DYNAMIC_FTRACE
+ select HAVE_FTRACE_NMI_ENTER if DYNAMIC_FTRACE || KTRACE
select HAVE_SYSCALL_TRACEPOINTS
select HAVE_KVM
select HAVE_ARCH_KGDB
diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile
index 34244b2..b664584 100644
--- a/arch/x86/kernel/Makefile
+++ b/arch/x86/kernel/Makefile
@@ -73,6 +73,7 @@ obj-$(CONFIG_X86_TRAMPOLINE) += trampoline_$(BITS).o
obj-$(CONFIG_X86_MPPARSE) += mpparse.o
obj-y += apic/
obj-$(CONFIG_X86_REBOOTFIXUPS) += reboot_fixups_32.o
+obj-$(CONFIG_KTRACE) += ktrace.o
obj-$(CONFIG_DYNAMIC_FTRACE) += ftrace.o
obj-$(CONFIG_FUNCTION_GRAPH_TRACER) += ftrace.o
obj-$(CONFIG_FTRACE_SYSCALLS) += ftrace.o
diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S
index aed1ffb..4d70019 100644
--- a/arch/x86/kernel/entry_64.S
+++ b/arch/x86/kernel/entry_64.S
@@ -62,6 +62,29 @@

.code64
#ifdef CONFIG_FUNCTION_TRACER
+#ifdef CONFIG_KTRACE
+ENTRY(ktrace_callback)
+ cmpl $0, function_trace_stop
+ jne ftrace_stub
+
+ cmpq $ftrace_stub, ftrace_trace_function
+ jnz ktrace_trace
+ retq
+
+ktrace_trace:
+ MCOUNT_SAVE_FRAME
+
+ movq 0x48(%rsp), %rdi
+ movq 0x50(%rsp), %rsi
+
+ call *ftrace_trace_function
+
+ MCOUNT_RESTORE_FRAME
+
+ retq
+END(ktrace_callback)
+#endif /* CONFIG_KTRACE */
+
#ifdef CONFIG_DYNAMIC_FTRACE
ENTRY(mcount)
retq
diff --git a/arch/x86/kernel/ftrace.c b/arch/x86/kernel/ftrace.c
index 979ec14..ffa87f9 100644
--- a/arch/x86/kernel/ftrace.c
+++ b/arch/x86/kernel/ftrace.c
@@ -29,67 +29,7 @@
#include <asm/nmi.h>


-#ifdef CONFIG_DYNAMIC_FTRACE
-
-/*
- * modifying_code is set to notify NMIs that they need to use
- * memory barriers when entering or exiting. But we don't want
- * to burden NMIs with unnecessary memory barriers when code
- * modification is not being done (which is most of the time).
- *
- * A mutex is already held when ftrace_arch_code_modify_prepare
- * and post_process are called. No locks need to be taken here.
- *
- * Stop machine will make sure currently running NMIs are done
- * and new NMIs will see the updated variable before we need
- * to worry about NMIs doing memory barriers.
- */
-static int modifying_code __read_mostly;
-static DEFINE_PER_CPU(int, save_modifying_code);
-
-int ftrace_arch_code_modify_prepare(void)
-{
- set_kernel_text_rw();
- set_all_modules_text_rw();
- modifying_code = 1;
- return 0;
-}
-
-int ftrace_arch_code_modify_post_process(void)
-{
- modifying_code = 0;
- set_all_modules_text_ro();
- set_kernel_text_ro();
- return 0;
-}
-
-union ftrace_code_union {
- char code[MCOUNT_INSN_SIZE];
- struct {
- char e8;
- int offset;
- } __attribute__((packed));
-};
-
-static int ftrace_calc_offset(long ip, long addr)
-{
- return (int)(addr - ip);
-}
-
-static unsigned char *ftrace_call_replace(unsigned long ip, unsigned long addr)
-{
- static union ftrace_code_union calc;
-
- calc.e8 = 0xe8;
- calc.offset = ftrace_calc_offset(ip + MCOUNT_INSN_SIZE, addr);
-
- /*
- * No locking needed, this must be called via kstop_machine
- * which in essence is like running on a uniprocessor machine.
- */
- return calc.code;
-}
-
+#if defined(CONFIG_DYNAMIC_FTRACE) || defined(CONFIG_KTRACE)
/*
* Modifying code must take extra care. On an SMP machine, if
* the code being modified is also being executed on another CPU
@@ -129,15 +69,21 @@ static int mod_code_size; /* holds the size of the new code */
static unsigned nmi_wait_count;
static atomic_t nmi_update_count = ATOMIC_INIT(0);

-int ftrace_arch_read_dyn_info(char *buf, int size)
-{
- int r;
-
- r = snprintf(buf, size, "%u %u",
- nmi_wait_count,
- atomic_read(&nmi_update_count));
- return r;
-}
+/*
+ * modifying_code is set to notify NMIs that they need to use
+ * memory barriers when entering or exiting. But we don't want
+ * to burden NMIs with unnecessary memory barriers when code
+ * modification is not being done (which is most of the time).
+ *
+ * A mutex is already held when ftrace_arch_code_modify_prepare
+ * and post_process are called. No locks need to be taken here.
+ *
+ * Stop machine will make sure currently running NMIs are done
+ * and new NMIs will see the updated variable before we need
+ * to worry about NMIs doing memory barriers.
+ */
+static int modifying_code __read_mostly;
+static DEFINE_PER_CPU(int, save_modifying_code);

static void clear_mod_flag(void)
{
@@ -226,7 +172,7 @@ within(unsigned long addr, unsigned long start, unsigned long end)
}

static int
-do_ftrace_mod_code(unsigned long ip, void *new_code, int size)
+__do_ftrace_mod_code(unsigned long ip, void *new_code, int size)
{
/*
* On x86_64, kernel text mappings are mapped read-only with
@@ -262,6 +208,67 @@ do_ftrace_mod_code(unsigned long ip, void *new_code, int size)
return mod_code_status;
}

+int do_ftrace_mod_code(unsigned long ip, void *new_code, int size)
+{
+ return __do_ftrace_mod_code(ip, new_code, size);
+}
+
+int ftrace_arch_code_modify_post_process(void)
+{
+ modifying_code = 0;
+ set_all_modules_text_ro();
+ set_kernel_text_ro();
+ return 0;
+}
+
+int ftrace_arch_code_modify_prepare(void)
+{
+ set_kernel_text_rw();
+ set_all_modules_text_rw();
+ modifying_code = 1;
+ return 0;
+}
+
+#endif
+
+#ifdef CONFIG_DYNAMIC_FTRACE
+int ftrace_arch_read_dyn_info(char *buf, int size)
+{
+ int r;
+
+ r = snprintf(buf, size, "%u %u",
+ nmi_wait_count,
+ atomic_read(&nmi_update_count));
+ return r;
+}
+
+union ftrace_code_union {
+ char code[MCOUNT_INSN_SIZE];
+ struct {
+ char e8;
+ int offset;
+ } __attribute__((packed));
+};
+
+static int ftrace_calc_offset(long ip, long addr)
+{
+ return (int)(addr - ip);
+}
+
+static unsigned char *ftrace_call_replace(unsigned long ip, unsigned long addr)
+{
+ static union ftrace_code_union calc;
+
+ calc.e8 = 0xe8;
+ calc.offset = ftrace_calc_offset(ip + MCOUNT_INSN_SIZE, addr);
+
+ /*
+ * No locking needed, this must be called via kstop_machine
+ * which in essence is like running on a uniprocessor machine.
+ */
+ return calc.code;
+}
+
static unsigned char *ftrace_nop_replace(void)
{
return ideal_nop5;
@@ -292,7 +299,7 @@ ftrace_modify_code(unsigned long ip, unsigned char *old_code,
return -EINVAL;

/* replace the text with the new text */
- if (do_ftrace_mod_code(ip, new_code, MCOUNT_INSN_SIZE))
+ if (__do_ftrace_mod_code(ip, new_code, MCOUNT_INSN_SIZE))
return -EPERM;

sync_core();
@@ -363,7 +370,7 @@ static int ftrace_mod_jmp(unsigned long ip,

*(int *)(&code[1]) = new_offset;

- if (do_ftrace_mod_code(ip, &code, MCOUNT_INSN_SIZE))
+ if (__do_ftrace_mod_code(ip, &code, MCOUNT_INSN_SIZE))
return -EPERM;

return 0;
diff --git a/arch/x86/kernel/ktrace.c b/arch/x86/kernel/ktrace.c
new file mode 100644
index 0000000..2bfaa77
--- /dev/null
+++ b/arch/x86/kernel/ktrace.c
@@ -0,0 +1,256 @@
+
+#include <linux/module.h>
+#include <linux/kernel.h>
+#include <linux/ftrace.h>
+#include <asm/insn.h>
+#include <asm/nops.h>
+#include <linux/kprobes.h>
+
+static void __used ktrace_template_holder(void)
+{
+ asm volatile (
+ ".global ktrace_template_entry \n"
+ "ktrace_template_entry: \n"
+ " pushfq \n"
+
+ ".global ktrace_template_call \n"
+ "ktrace_template_call: \n"
+ ASM_NOP5
+
+ " popfq \n"
+ /* eat ret value */
+ " addq $8, %rsp \n"
+ ".global ktrace_template_end \n"
+ "ktrace_template_end: \n"
+ );
+}
+
+extern u8 ktrace_template_entry;
+extern u8 ktrace_template_end;
+extern u8 ktrace_template_call;
+
+extern void ktrace_callback(void);
+
+#define TMPL_CALL_IDX \
+ ((long)&ktrace_template_call - (long)&ktrace_template_entry)
+
+#define TMPL_END_IDX \
+ ((long)&ktrace_template_end - (long)&ktrace_template_entry)
+
+#define RELATIVECALL_SIZE 5
+#define RELATIVE_ADDR_SIZE 4
+#define RELATIVECALL_OPCODE 0xe8
+#define RELATIVEJUMP_OPCODE 0xe9
+#define MAX_OPTIMIZED_LENGTH (MAX_INSN_SIZE + RELATIVE_ADDR_SIZE)
+
+#define MAX_KTRACE_INSN_SIZE \
+ (((unsigned long)&ktrace_template_end - \
+ (unsigned long)&ktrace_template_entry) + \
+ MAX_OPTIMIZED_LENGTH + RELATIVECALL_SIZE)
+
+#define W(row, b0, b1, b2, b3, b4, b5, b6, b7, b8, b9, ba, bb, bc, bd, be, bf)\
+ (((b0##UL << 0x0)|(b1##UL << 0x1)|(b2##UL << 0x2)|(b3##UL << 0x3) | \
+ (b4##UL << 0x4)|(b5##UL << 0x5)|(b6##UL << 0x6)|(b7##UL << 0x7) | \
+ (b8##UL << 0x8)|(b9##UL << 0x9)|(ba##UL << 0xa)|(bb##UL << 0xb) | \
+ (bc##UL << 0xc)|(bd##UL << 0xd)|(be##UL << 0xe)|(bf##UL << 0xf)) \
+ << (row % 32))
+ /*
+ * Undefined/reserved opcodes, conditional jump, Opcode Extension
+ * Groups, and some special opcodes can not boost.
+ */
+static const u32 twobyte_is_boostable[256 / 32] = {
+ /* 0 1 2 3 4 5 6 7 8 9 a b c d e f */
+ /* ---------------------------------------------- */
+ W(0x00, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0) | /* 00 */
+ W(0x10, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0) , /* 10 */
+ W(0x20, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0) | /* 20 */
+ W(0x30, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0) , /* 30 */
+ W(0x40, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1) | /* 40 */
+ W(0x50, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0) , /* 50 */
+ W(0x60, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1) | /* 60 */
+ W(0x70, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1) , /* 70 */
+ W(0x80, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0) | /* 80 */
+ W(0x90, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1) , /* 90 */
+ W(0xa0, 1, 1, 0, 1, 1, 1, 0, 0, 1, 1, 0, 1, 1, 1, 0, 1) | /* a0 */
+ W(0xb0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1) , /* b0 */
+ W(0xc0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1) | /* c0 */
+ W(0xd0, 0, 1, 1, 1, 0, 1, 0, 0, 1, 1, 0, 1, 1, 1, 0, 1) , /* d0 */
+ W(0xe0, 0, 1, 1, 0, 0, 1, 0, 0, 1, 1, 0, 1, 1, 1, 0, 1) | /* e0 */
+ W(0xf0, 0, 1, 1, 1, 0, 1, 0, 0, 1, 1, 1, 0, 1, 1, 1, 0) /* f0 */
+ /* ----------------------------------------------- */
+ /* 0 1 2 3 4 5 6 7 8 9 a b c d e f */
+};
+#undef W
+
+static int __copy_instruction(u8 *dest, u8 *src)
+{
+ struct insn insn;
+
+ kernel_insn_init(&insn, src);
+ insn_get_length(&insn);
+ memcpy(dest, insn.kaddr, insn.length);
+
+#ifdef CONFIG_X86_64
+ if (insn_rip_relative(&insn)) {
+ s64 newdisp;
+ u8 *disp;
+ kernel_insn_init(&insn, dest);
+ insn_get_displacement(&insn);
+ /*
+ * The copied instruction uses the %rip-relative addressing
+ * mode. Adjust the displacement for the difference between
+ * the original location of this instruction and the location
+ * of the copy that will actually be run. The tricky bit here
+ * is making sure that the sign extension happens correctly in
+ * this calculation, since we need a signed 32-bit result to
+ * be sign-extended to 64 bits when it's added to the %rip
+ * value and yield the same 64-bit result that the sign-
+ * extension of the original signed 32-bit displacement would
+ * have given.
+ */
+ newdisp = (u8 *) src + (s64) insn.displacement.value -
+ (u8 *) dest;
+ BUG_ON((s64) (s32) newdisp != newdisp); /* Sanity check. */
+ disp = (u8 *) dest + insn_offset_displacement(&insn);
+ *(s32 *) disp = (s32) newdisp;
+ }
+#endif
+ return insn.length;
+}
+
+static int can_boost(u8 *opcodes)
+{
+ u8 opcode;
+ u8 *orig_opcodes = opcodes;
+
+ if (search_exception_tables((unsigned long)opcodes))
+ return 0; /* Page fault may occur on this address. */
+
+retry:
+ if (opcodes - orig_opcodes > MAX_INSN_SIZE - 1)
+ return 0;
+ opcode = *(opcodes++);
+
+ /* 2nd-byte opcode */
+ if (opcode == 0x0f) {
+ if (opcodes - orig_opcodes > MAX_INSN_SIZE - 1)
+ return 0;
+ return test_bit(*opcodes,
+ (unsigned long *)twobyte_is_boostable);
+ }
+
+ switch (opcode & 0xf0) {
+#ifdef CONFIG_X86_64
+ case 0x40:
+ goto retry; /* REX prefix is boostable */
+#endif
+ case 0x60:
+ if (0x63 < opcode && opcode < 0x67)
+ goto retry; /* prefixes */
+ /* can't boost Address-size override and bound */
+ return (opcode != 0x62 && opcode != 0x67);
+ case 0x70:
+ return 0; /* can't boost conditional jump */
+ case 0xc0:
+ /* can't boost software-interruptions */
+ return (0xc1 < opcode && opcode < 0xcc) || opcode == 0xcf;
+ case 0xd0:
+ /* can boost AA* and XLAT */
+ return (opcode == 0xd4 || opcode == 0xd5 || opcode == 0xd7);
+ case 0xe0:
+ /* can boost in/out and absolute jmps */
+ return ((opcode & 0x04) || opcode == 0xea);
+ case 0xf0:
+ if ((opcode & 0x0c) == 0 && opcode != 0xf1)
+ goto retry; /* lock/rep(ne) prefix */
+ /* clear and set flags are boostable */
+ return (opcode == 0xf5 || (0xf7 < opcode && opcode < 0xfe));
+ default:
+ /* segment override prefixes are boostable */
+ if (opcode == 0x26 || opcode == 0x36 || opcode == 0x3e)
+ goto retry; /* prefixes */
+ /* CS override prefix and call are not boostable */
+ return (opcode != 0x2e && opcode != 0x9a);
+ }
+}
+
+static int copy_instructions(u8 *dest, u8 *src)
+{
+ int len = 0, ret;
+
+ while (len < RELATIVECALL_SIZE) {
+ ret = __copy_instruction(dest + len, src + len);
+ if (!ret || !can_boost(dest + len))
+ return -EINVAL;
+ len += ret;
+ }
+
+ return len;
+}
+
+static void synthesize_relative_insn(u8 *buf, void *from, void *to, u8 op)
+{
+ struct __arch_relative_insn {
+ u8 op;
+ s32 raddr;
+ } __attribute__((packed)) *insn;
+
+ insn = (struct __arch_relative_insn *) buf;
+ insn->raddr = (s32)((long)(to) - ((long)(from) + 5));
+ insn->op = op;
+}
+
+void ktrace_enable_sym(struct ktrace_symbol *ksym)
+{
+ u8 call_buf[RELATIVECALL_SIZE];
+
+ synthesize_relative_insn(call_buf,
+ ksym->addr,
+ ksym->insn_templ,
+ RELATIVECALL_OPCODE);
+
+ do_ftrace_mod_code((unsigned long) ksym->addr,
+ call_buf, RELATIVECALL_SIZE);
+ ksym->enabled = 1;
+}
+
+void ktrace_disable_sym(struct ktrace_symbol *ksym)
+{
+ do_ftrace_mod_code((unsigned long) ksym->addr,
+ ksym->insn_saved,
+ ksym->insn_saved_size);
+ ksym->enabled = 0;
+}
+
+int ktrace_init_template(struct ktrace_symbol *ksym)
+{
+ u8* insn_templ = ksym->insn_templ;
+ u8 *addr = ksym->addr;
+ int size;
+
+ size = copy_instructions(insn_templ + TMPL_END_IDX, addr);
+ if (size < 0)
+ return -EINVAL;
+
+ memcpy(insn_templ, &ktrace_template_entry, TMPL_END_IDX);
+
+ synthesize_relative_insn(insn_templ + TMPL_END_IDX + size,
+ insn_templ + TMPL_END_IDX + size,
+ addr + size,
+ RELATIVEJUMP_OPCODE);
+
+ synthesize_relative_insn(insn_templ + TMPL_CALL_IDX,
+ insn_templ + TMPL_CALL_IDX,
+ ktrace_callback,
+ RELATIVECALL_OPCODE);
+
+ ksym->insn_saved = insn_templ + TMPL_END_IDX;
+ ksym->insn_saved_size = size;
+ return 0;
+}
+
+int __init ktrace_arch_init(void)
+{
+ ktrace_insn_init(MAX_KTRACE_INSN_SIZE);
+ return 0;
+}
diff --git a/include/linux/ftrace.h b/include/linux/ftrace.h
index dcd6a7c..11c3d5b 100644
--- a/include/linux/ftrace.h
+++ b/include/linux/ftrace.h
@@ -116,9 +116,6 @@ struct ftrace_func_command {

#ifdef CONFIG_DYNAMIC_FTRACE

-int ftrace_arch_code_modify_prepare(void);
-int ftrace_arch_code_modify_post_process(void);
-
struct seq_file;

struct ftrace_probe_ops {
@@ -530,4 +527,37 @@ unsigned long arch_syscall_addr(int nr);

#endif /* CONFIG_FTRACE_SYSCALLS */

+#ifdef CONFIG_KTRACE
+enum {
+ KTRACE_ENABLE,
+ KTRACE_DISABLE
+};
+
+struct ktrace_symbol {
+ struct list_head list;
+ int enabled;
+
+ u8 *addr;
+ u8 *insn_templ;
+ u8 *insn_saved;
+ int insn_saved_size;
+};
+
+extern void ktrace_init(void);
+extern int ktrace_init_template(struct ktrace_symbol *ksym);
+extern int ktrace_arch_init(void);
+extern void ktrace_startup(void);
+extern void ktrace_shutdown(void);
+extern void ktrace_enable_sym(struct ktrace_symbol *ksym);
+extern void ktrace_disable_sym(struct ktrace_symbol *ksym);
+#else
+static inline void ktrace_init(void) {}
+#endif /* CONFIG_KTRACE */
+
+#if defined CONFIG_DYNAMIC_FTRACE || defined CONFIG_KTRACE
+extern int do_ftrace_mod_code(unsigned long ip, void *new_code, int size);
+extern int ftrace_arch_code_modify_prepare(void);
+extern int ftrace_arch_code_modify_post_process(void);
+#endif
+
#endif /* _LINUX_FTRACE_H */
diff --git a/kernel/trace/Kconfig b/kernel/trace/Kconfig
index 14674dc..1cf0aba 100644
--- a/kernel/trace/Kconfig
+++ b/kernel/trace/Kconfig
@@ -140,8 +140,6 @@ if FTRACE

config FUNCTION_TRACER
bool "Kernel Function Tracer"
- depends on HAVE_FUNCTION_TRACER
- select FRAME_POINTER if !ARM_UNWIND && !S390
select KALLSYMS
select GENERIC_TRACER
select CONTEXT_SWITCH_TRACER
@@ -168,6 +166,30 @@ config FUNCTION_GRAPH_TRACER
the return value. This is done by setting the current return
address on the current task structure into a stack of calls.

+config KTRACE
+ bool
+ depends on FTRACER_ENG_KTRACE
+
+choice
+ prompt "Function trace engine"
+ default FTRACER_ENG_MCOUNT_RECORD
+ depends on FUNCTION_TRACER
+
+config FTRACER_ENG_MCOUNT_RECORD
+ bool "mcount"
+ depends on HAVE_FUNCTION_TRACER
+ select FRAME_POINTER if !ARM_UNWIND && !S390
+ help
+ standard -pg mcount record generation
+
+config FTRACER_ENG_KTRACE
+ bool "ktrace"
+ select KTRACE
+ help
+ dynamic call probes
+
+endchoice
+

config IRQSOFF_TRACER
bool "Interrupts-off Latency Tracer"
@@ -389,6 +411,7 @@ config DYNAMIC_FTRACE
bool "enable/disable ftrace tracepoints dynamically"
depends on FUNCTION_TRACER
depends on HAVE_DYNAMIC_FTRACE
+ depends on FTRACER_ENG_MCOUNT_RECORD
default y
help
This option will modify all the calls to ftrace dynamically
@@ -422,6 +445,7 @@ config FTRACE_MCOUNT_RECORD
def_bool y
depends on DYNAMIC_FTRACE
depends on HAVE_FTRACE_MCOUNT_RECORD
+ depends on FTRACER_ENG_MCOUNT_RECORD

config FTRACE_SELFTEST
bool
diff --git a/kernel/trace/Makefile b/kernel/trace/Makefile
index 761c510..f557200 100644
--- a/kernel/trace/Makefile
+++ b/kernel/trace/Makefile
@@ -21,6 +21,7 @@ endif
#
obj-y += trace_clock.o

+obj-$(CONFIG_KTRACE) += ktrace.o
obj-$(CONFIG_FUNCTION_TRACER) += libftrace.o
obj-$(CONFIG_RING_BUFFER) += ring_buffer.o
obj-$(CONFIG_RING_BUFFER_BENCHMARK) += ring_buffer_benchmark.o
diff --git a/kernel/trace/ftrace.c b/kernel/trace/ftrace.c
index f3dadae..762e2b3 100644
--- a/kernel/trace/ftrace.c
+++ b/kernel/trace/ftrace.c
@@ -3152,7 +3152,12 @@ int register_ftrace_function(struct ftrace_ops *ops)
mutex_lock(&ftrace_lock);

ret = __register_ftrace_function(ops);
+
+#ifdef CONFIG_KTRACE
+ ktrace_startup();
+#else
ftrace_startup(0);
+#endif

mutex_unlock(&ftrace_lock);
return ret;
@@ -3170,7 +3175,13 @@ int unregister_ftrace_function(struct ftrace_ops *ops)

mutex_lock(&ftrace_lock);
ret = __unregister_ftrace_function(ops);
+
+#ifdef CONFIG_KTRACE
+ ktrace_shutdown();
+#else
ftrace_shutdown(0);
+#endif
+
mutex_unlock(&ftrace_lock);

return ret;
diff --git a/kernel/trace/ktrace.c b/kernel/trace/ktrace.c
new file mode 100644
index 0000000..3e45e2c
--- /dev/null
+++ b/kernel/trace/ktrace.c
@@ -0,0 +1,330 @@
+
+#include <linux/module.h>
+#include <linux/kernel.h>
+#include <linux/kallsyms.h>
+#include <linux/ctype.h>
+#include <linux/slab.h>
+#include <linux/kprobes.h>
+#include <linux/slab.h>
+#include <linux/stop_machine.h>
+
+#include "trace.h"
+
+static DEFINE_MUTEX(symbols_mutex);
+static LIST_HEAD(symbols);
+
+static struct kmem_cache *symbols_cache;
+static int ktrace_disabled;
+static int ktrace_enabled;
+
+static void ktrace_enable_all(void);
+
+static struct ktrace_symbol* ktrace_find_symbol(u8 *addr)
+{
+ struct ktrace_symbol *ksym, *found = NULL;
+
+ mutex_lock(&symbols_mutex);
+
+ list_for_each_entry(ksym, &symbols, list) {
+ if (ksym->addr == addr) {
+ found = ksym;
+ break;
+ }
+ }
+
+ mutex_unlock(&symbols_mutex);
+ return found;
+}
+
+static int ktrace_unregister_symbol(struct ktrace_symbol *ksym)
+{
+ free_ktrace_insn_slot(ksym->insn_templ, 1);
+ kmem_cache_free(symbols_cache, ksym);
+ return 0;
+}
+
+static int ktrace_unregister_all_symbols(void)
+{
+ struct ktrace_symbol *ksym, *n;
+
+ if (ktrace_enabled)
+ return -EINVAL;
+
+ mutex_lock(&symbols_mutex);
+
+ list_for_each_entry_safe(ksym, n, &symbols, list) {
+ list_del(&ksym->list);
+ ktrace_unregister_symbol(ksym);
+ }
+
+ mutex_unlock(&symbols_mutex);
+ return 0;
+}
+
+static int ktrace_register_symbol(char *symbol)
+{
+ struct ktrace_symbol *ksym;
+ u8 *addr, *insn_templ;
+ int ret = -ENOMEM;
+
+ /* Is it really symbol address. */
+ addr = (void*) kallsyms_lookup_name(symbol);
+ if (!addr)
+ return -EINVAL;
+
+ /* Is it already registered. */
+ if (ktrace_find_symbol(addr))
+ return -EINVAL;
+
+ /* Register new symbol. */
+ ksym = kmem_cache_zalloc(symbols_cache, GFP_KERNEL);
+ if (!ksym)
+ return -ENOMEM;
+
+ insn_templ = get_ktrace_insn_slot();
+ if (!insn_templ)
+ goto err_release_ksym;
+
+ ksym->insn_templ = insn_templ;
+ ksym->addr = addr;
+
+ ret = ktrace_init_template(ksym);
+ if (ret)
+ goto err_release_insn;
+
+ mutex_lock(&symbols_mutex);
+ list_add(&ksym->list, &symbols);
+ mutex_unlock(&symbols_mutex);
+
+ return 0;
+
+ err_release_insn:
+ free_ktrace_insn_slot(insn_templ, 1);
+
+ err_release_ksym:
+ kmem_cache_free(symbols_cache, ksym);
+
+ return ret;
+}
+
+static inline int
+within(unsigned long addr, unsigned long start, unsigned long end)
+{
+ return addr >= start && addr < end;
+}
+
+static int ktrace_symbol(void *data, const char *symbol,
+ struct module *mod, unsigned long addr)
+{
+ if (!within(addr, (unsigned long)_text, (unsigned long)_etext))
+ return 0;
+
+ ktrace_register_symbol((char*) symbol);
+ return 0;
+}
+
+static int ktrace_register_all(void)
+{
+ printk("not supported\n");
+ return 0;
+
+ kallsyms_on_each_symbol(ktrace_symbol, NULL);
+ return 0;
+}
+
+static void *ktrace_start(struct seq_file *m, loff_t *pos)
+{
+ mutex_lock(&symbols_mutex);
+
+ if (list_empty(&symbols) && (!*pos))
+ return (void *) 1;
+
+ return seq_list_start(&symbols, *pos);
+}
+
+static void *ktrace_next(struct seq_file *m, void *v, loff_t *pos)
+{
+ if (v == (void *)1)
+ return NULL;
+
+ return seq_list_next(v, &symbols, pos);
+}
+
+static void ktrace_stop(struct seq_file *m, void *p)
+{
+ mutex_unlock(&symbols_mutex);
+}
+
+static int ktrace_show(struct seq_file *m, void *v)
+{
+ const struct ktrace_symbol *ksym = list_entry(v, struct ktrace_symbol, list);
+
+ if (v == (void *)1) {
+ seq_printf(m, "no symbol\n");
+ return 0;
+ }
+
+ seq_printf(m, "%ps\n", ksym->addr);
+ return 0;
+}
+
+static const struct seq_operations ktrace_sops = {
+ .start = ktrace_start,
+ .next = ktrace_next,
+ .stop = ktrace_stop,
+ .show = ktrace_show,
+};
+
+static int
+ktrace_open(struct inode *inode, struct file *file)
+{
+ int ret = 0;
+
+ if ((file->f_mode & FMODE_WRITE) &&
+ (file->f_flags & O_TRUNC))
+ ktrace_unregister_all_symbols();
+
+ if (file->f_mode & FMODE_READ)
+ ret = seq_open(file, &ktrace_sops);
+
+ return ret;
+}
+
+static ssize_t
+ktrace_write(struct file *filp, const char __user *ubuf,
+ size_t cnt, loff_t *ppos)
+{
+#define SYMMAX 50
+ char symbol[SYMMAX];
+ int ret, i;
+
+ if (cnt >= SYMMAX)
+ return -EINVAL;
+
+ if (copy_from_user(&symbol, ubuf, cnt))
+ return -EFAULT;
+
+ symbol[cnt] = 0;
+
+ for (i = cnt - 1;
+ i >= 0 && (isspace(symbol[i]) || (symbol[i] == '\n')); i--)
+ symbol[i] = 0;
+
+ if (!symbol[0])
+ return cnt;
+
+ if (!strcmp(symbol, "all"))
+ ret = ktrace_register_all();
+ else
+ ret = ktrace_register_symbol(symbol);
+
+ if (ret)
+ return ret;
+
+ if (ktrace_enabled)
+ ktrace_startup();
+
+ return ret ? ret : cnt;
+}
+
+static const struct file_operations ktrace_fops = {
+ .open = ktrace_open,
+ .read = seq_read,
+ .llseek = seq_lseek,
+ .write = ktrace_write,
+};
+
+static void ktrace_enable_all(void)
+{
+ struct ktrace_symbol *ksym;
+
+ list_for_each_entry(ksym, &symbols, list) {
+ if (ksym->enabled)
+ continue;
+
+ ktrace_enable_sym(ksym);
+ }
+
+ ktrace_enabled = 1;
+}
+
+static void ktrace_disable_all(void)
+{
+ struct ktrace_symbol *ksym;
+
+ list_for_each_entry(ksym, &symbols, list) {
+ if (ksym->enabled)
+ continue;
+
+ ktrace_disable_sym(ksym);
+ }
+
+ ktrace_enabled = 0;
+}
+
+static int __ktrace_modify_code(void *data)
+{
+ int *command = data;
+
+ if (*command == KTRACE_ENABLE)
+ ktrace_enable_all();
+
+ if (*command == KTRACE_DISABLE)
+ ktrace_disable_all();
+
+ return 0;
+}
+
+#define FTRACE_WARN_ON(cond) \
+do { \
+ if (WARN_ON(cond)) \
+ ftrace_kill(); \
+} while (0)
+
+static void ktrace_run_update_code(int command)
+{
+ int ret;
+
+ if (ktrace_disabled)
+ return;
+
+ ret = ftrace_arch_code_modify_prepare();
+ FTRACE_WARN_ON(ret);
+ if (ret)
+ return;
+
+ stop_machine(__ktrace_modify_code, &command, NULL);
+
+ ret = ftrace_arch_code_modify_post_process();
+ FTRACE_WARN_ON(ret);
+}
+
+void ktrace_startup(void)
+{
+ ktrace_run_update_code(KTRACE_ENABLE);
+}
+
+void ktrace_shutdown(void)
+{
+ ktrace_run_update_code(KTRACE_DISABLE);
+}
+
+void __init ktrace_init(void)
+{
+ struct dentry *d_tracer = tracing_init_dentry();
+
+ trace_create_file("ktrace", 0644, d_tracer,
+ NULL, &ktrace_fops);
+
+ symbols_cache = KMEM_CACHE(ktrace_symbol, 0);
+ if (!symbols_cache) {
+ printk("ktrace disabled - kmem cache allocation failed\n");
+ ktrace_disabled = 1;
+ return;
+ }
+
+ ktrace_arch_init();
+ printk("ktrace initialized\n");
+}
+
+MODULE_LICENSE("GPL");
diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index dc53ecb..b901c94 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -4361,6 +4361,7 @@ static __init int tracer_init_debugfs(void)
for_each_tracing_cpu(cpu)
tracing_init_debugfs_percpu(cpu);

+ ktrace_init();
return 0;
}

--
1.7.1

2011-02-03 15:43:18

by Jiri Olsa

[permalink] [raw]
Subject: [PATCH 4/4] ktrace - function graph trace support

adding function graph support

wbr,
jirka
---
arch/x86/Kconfig | 2 +-
arch/x86/kernel/entry_64.S | 27 +++++++++++++++++++++++++++
kernel/trace/ftrace.c | 10 ++++++++++
3 files changed, 38 insertions(+), 1 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index a02718c..befe1e0 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -36,7 +36,7 @@ config X86
select HAVE_DYNAMIC_FTRACE
select HAVE_FUNCTION_TRACER
select HAVE_FUNCTION_GRAPH_TRACER
- select HAVE_FUNCTION_GRAPH_FP_TEST
+ select HAVE_FUNCTION_GRAPH_FP_TEST if !KTRACE
select HAVE_FUNCTION_TRACE_MCOUNT_TEST
select HAVE_FTRACE_NMI_ENTER if DYNAMIC_FTRACE || KTRACE
select HAVE_SYSCALL_TRACEPOINTS
diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S
index 4d70019..ec9e234 100644
--- a/arch/x86/kernel/entry_64.S
+++ b/arch/x86/kernel/entry_64.S
@@ -69,6 +69,14 @@ ENTRY(ktrace_callback)

cmpq $ftrace_stub, ftrace_trace_function
jnz ktrace_trace
+
+#ifdef CONFIG_FUNCTION_GRAPH_TRACER
+ cmpq $ftrace_stub, ftrace_graph_return
+ jnz ktrace_graph_caller
+
+ cmpq $ftrace_graph_entry_stub, ftrace_graph_entry
+ jnz ktrace_graph_caller
+#endif
retq

ktrace_trace:
@@ -83,6 +91,25 @@ ktrace_trace:

retq
END(ktrace_callback)
+
+#ifdef CONFIG_FUNCTION_GRAPH_TRACER
+ENTRY(ktrace_graph_caller)
+ cmpl $0, function_trace_stop
+ jne ftrace_stub
+
+ MCOUNT_SAVE_FRAME
+
+ leaq 0x50(%rsp), %rdi
+ movq 0x48(%rsp), %rsi
+ movq $0, %rdx
+
+ call prepare_ftrace_return
+
+ MCOUNT_RESTORE_FRAME
+
+ retq
+END(ktrace_graph_caller)
+#endif /* CONFIG_FUNCTION_GRAPH_TRACER */
#endif /* CONFIG_KTRACE */

#ifdef CONFIG_DYNAMIC_FTRACE
diff --git a/kernel/trace/ftrace.c b/kernel/trace/ftrace.c
index 762e2b3..f6e30a8 100644
--- a/kernel/trace/ftrace.c
+++ b/kernel/trace/ftrace.c
@@ -3404,7 +3404,11 @@ int register_ftrace_graph(trace_func_graph_ret_t retfunc,
ftrace_graph_return = retfunc;
ftrace_graph_entry = entryfunc;

+#ifdef CONFIG_KTRACE
+ ktrace_startup();
+#else
ftrace_startup(FTRACE_START_FUNC_RET);
+#endif

out:
mutex_unlock(&ftrace_lock);
@@ -3421,7 +3425,13 @@ void unregister_ftrace_graph(void)
ftrace_graph_active--;
ftrace_graph_return = (trace_func_graph_ret_t)ftrace_stub;
ftrace_graph_entry = ftrace_graph_entry_stub;
+
+#ifdef CONFIG_KTRACE
+ ktrace_shutdown();
+#else
ftrace_shutdown(FTRACE_STOP_FUNC_RET);
+#endif
+
unregister_pm_notifier(&ftrace_suspend_notifier);
unregister_trace_sched_switch(ftrace_graph_probe_sched_switch, NULL);

--
1.7.1

2011-02-03 15:43:50

by Jiri Olsa

[permalink] [raw]
Subject: [PATCH 1/4] kprobe - ktrace instruction slot cache interface

using kprobe detour buffer allocation, adding interface
to use it from trace framework

wbr,
jirka
---
include/linux/kprobes.h | 8 ++++++++
kernel/kprobes.c | 33 +++++++++++++++++++++++++++++++++
2 files changed, 41 insertions(+), 0 deletions(-)

diff --git a/include/linux/kprobes.h b/include/linux/kprobes.h
index dd7c12e..1e984e9 100644
--- a/include/linux/kprobes.h
+++ b/include/linux/kprobes.h
@@ -436,4 +436,12 @@ static inline int enable_jprobe(struct jprobe *jp)
return enable_kprobe(&jp->kp);
}

+#ifdef CONFIG_KTRACE
+
+extern kprobe_opcode_t __kprobes *get_ktrace_insn_slot(void);
+extern void __kprobes free_ktrace_insn_slot(kprobe_opcode_t * slot, int dirty);
+extern void __init ktrace_insn_init(int size);
+
+#endif /* CONFIG_KTRACE */
+
#endif /* _LINUX_KPROBES_H */
diff --git a/kernel/kprobes.c b/kernel/kprobes.c
index 7798181..5bc31d6 100644
--- a/kernel/kprobes.c
+++ b/kernel/kprobes.c
@@ -285,6 +285,39 @@ void __kprobes free_insn_slot(kprobe_opcode_t * slot, int dirty)
__free_insn_slot(&kprobe_insn_slots, slot, dirty);
mutex_unlock(&kprobe_insn_mutex);
}
+
+#ifdef CONFIG_KTRACE
+static DEFINE_MUTEX(ktrace_insn_mutex);
+static struct kprobe_insn_cache ktrace_insn_slots = {
+ .pages = LIST_HEAD_INIT(ktrace_insn_slots.pages),
+ .insn_size = MAX_INSN_SIZE,
+ .nr_garbage = 0,
+};
+
+kprobe_opcode_t __kprobes *get_ktrace_insn_slot(void)
+{
+ kprobe_opcode_t *ret = NULL;
+
+ mutex_lock(&ktrace_insn_mutex);
+ ret = __get_insn_slot(&ktrace_insn_slots);
+ mutex_unlock(&ktrace_insn_mutex);
+
+ return ret;
+}
+
+void __kprobes free_ktrace_insn_slot(kprobe_opcode_t * slot, int dirty)
+{
+ mutex_lock(&ktrace_insn_mutex);
+ __free_insn_slot(&ktrace_insn_slots, slot, dirty);
+ mutex_unlock(&ktrace_insn_mutex);
+}
+
+void __init ktrace_insn_init(int size)
+{
+ ktrace_insn_slots.insn_size = size;
+}
+#endif /* CONFIG_KTRACE */
+
#ifdef CONFIG_OPTPROBES
/* For optimized_kprobe buffer */
static DEFINE_MUTEX(kprobe_optinsn_mutex); /* Protects kprobe_optinsn_slots */
--
1.7.1

2011-02-03 16:33:29

by Steven Rostedt

[permalink] [raw]
Subject: Re: [RFC 0/4] tracing,x86_64 - function/graph trace without mcount/-pg/framepointer

On Thu, 2011-02-03 at 16:42 +0100, Jiri Olsa wrote:
> hi,
>
> I recently saw the direct jump probing made for kprobes
> and tried to use it inside the trace framework.
>
> The global idea is patching the function entry with direct
> jump to the trace code, instead of using pregenerated gcc
> profile code.

Interesting, but ideally, it would be nice if gcc provided a better
"mcount" mechanism. One that calls mcount (or whatever new name it would
have) before it does anything with the stack.

>
> I started this just to see if it would be even possible
> to hook with new probing to the current trace code. It
> appears it's not that bad. I was able to run function
> and function_graph trace on x86_64.
>
> For details on direct jumps probe, please check:
> http://www.linuxinsight.com/ols2007-djprobe-kernel-probing-with-the-smallest-overhead.html
>
>
> I realize using this way to hook the functions has some
> drawbacks, from what I can see it's roughly:
> - no all functions could be patched

What's the reason for not all functions?

> - need to find a way to say which function is safe to patch
> - memory consumption for detour buffers and symbol records
>
> but seems there're some advantages as well:
> - trace code could be in a module

What makes this allow module code?

ftrace could do that now, but it would require a separate handler. I
would need to disable preemption before calling the module code function
handler.

> - no profiling code is needed
> - framepointer can be disabled (framepointer is needed for
> generating profile code)

Again ideally, gcc should fix this.

-- Steve

2011-02-03 17:35:51

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [RFC 0/4] tracing,x86_64 - function/graph trace without mcount/-pg/framepointer

On Thu, Feb 03, 2011 at 11:33:25AM -0500, Steven Rostedt wrote:
> On Thu, 2011-02-03 at 16:42 +0100, Jiri Olsa wrote:
> > hi,
> >
> > I recently saw the direct jump probing made for kprobes
> > and tried to use it inside the trace framework.
> >
> > The global idea is patching the function entry with direct
> > jump to the trace code, instead of using pregenerated gcc
> > profile code.
>
> Interesting, but ideally, it would be nice if gcc provided a better
> "mcount" mechanism. One that calls mcount (or whatever new name it would
> have) before it does anything with the stack.
>
> >
> > I started this just to see if it would be even possible
> > to hook with new probing to the current trace code. It
> > appears it's not that bad. I was able to run function
> > and function_graph trace on x86_64.
> >
> > For details on direct jumps probe, please check:
> > http://www.linuxinsight.com/ols2007-djprobe-kernel-probing-with-the-smallest-overhead.html
> >
> >
> > I realize using this way to hook the functions has some
> > drawbacks, from what I can see it's roughly:
> > - no all functions could be patched
>
> What's the reason for not all functions?

Because of those that kprobes calls, so to avoid recursion.
kprobes has some recursion detection mechanism, IIRC, but
until we reach that checkpoint, I think there are some functions
in the path.

Well, ftrace has the same problem. That's just due to the nature of
function tracing.

There may be some places too fragile to use kprobes there too.

Ah, the whole trap path for example :-(

> > - need to find a way to say which function is safe to patch
> > - memory consumption for detour buffers and symbol records
> >
> > but seems there're some advantages as well:
> > - trace code could be in a module
>
> What makes this allow module code?
>
> ftrace could do that now, but it would require a separate handler. I
> would need to disable preemption before calling the module code function
> handler.

Kprobes takes care of handlers from modules already.
I'm not sure we want that, it makes the tracing code more sensitive.

Look, for example I think kprobes doesn't trace kernel faults path
because module space is allocated through vmalloc (hmm, is it still
the case?).

> > - no profiling code is needed
> > - framepointer can be disabled (framepointer is needed for
> > generating profile code)
>
> Again ideally, gcc should fix this.

As another drawback of using kprobes, there is also the overhead.
I can't imagine a trap triggering for every functions. But then
yeah we have the jmp optimisation. But then it needs that detour
buffer that we can avoid with mcount.

So like Steve I think mcount is still a better backend for function
tracing. More optimized by nature, even though it indeed needs
some fixes.

2011-02-03 19:00:31

by Steven Rostedt

[permalink] [raw]
Subject: Re: [RFC 0/4] tracing,x86_64 - function/graph trace without mcount/-pg/framepointer

On Thu, 2011-02-03 at 18:35 +0100, Frederic Weisbecker wrote:

> > ftrace could do that now, but it would require a separate handler. I
> > would need to disable preemption before calling the module code function
> > handler.
>
> Kprobes takes care of handlers from modules already.
> I'm not sure we want that, it makes the tracing code more sensitive.

Masami,

I'm looking at the optimize code, particularly
kprobes_optinsn_template_holder(), which looks to be the template that
is called on optimized kprobes. I don't see where preemption or
interrupts are disabled when a probe is called.

If modules can register probes, and we can call it in any arbitrary
location of the kernel, then preemption must be disabled prior to
calling the module code. Otherwise you risk crashing the system on
module unload.


module:
-------
register_kprobe(probe);


Core:
-----
hit break point
call probe

module:
-------
in probe function
preempted

module:
-------
unregister_kprobe(probe);
stop_machine();
<module unloaded>

Core:
-----
module <zombie>:
----------------
gets CPU again
executes module code that's been freed
DEATH BY ZOMBIES

Maybe I missed something. But does the optimize kprobes disable
preemption or interrupts before calling the optimized probe?

-- Steve

Subject: Re: [RFC 0/4] tracing,x86_64 - function/graph trace without mcount/-pg/framepointer

Hi,

(2011/02/04 0:42), Jiri Olsa wrote:
> hi,
>
> I recently saw the direct jump probing made for kprobes
> and tried to use it inside the trace framework.
>
> The global idea is patching the function entry with direct
> jump to the trace code, instead of using pregenerated gcc
> profile code.
>
> I started this just to see if it would be even possible
> to hook with new probing to the current trace code. It
> appears it's not that bad. I was able to run function
> and function_graph trace on x86_64.
>
> For details on direct jumps probe, please check:
> http://www.linuxinsight.com/ols2007-djprobe-kernel-probing-with-the-smallest-overhead.html

Thank you for referring it ;-)

> I realize using this way to hook the functions has some
> drawbacks, from what I can see it's roughly:
> - no all functions could be patched

Yeah, that is why the "djprobe" becomes "optprobe". If kprobe
finds there is no space to patch, it just fallback to a
breakpoint. Since this check is done internally, kprobes
user takes this benefit transparently ( don't need to
change user's code).

> - need to find a way to say which function is safe to patch
> - memory consumption for detour buffers and symbol records

And also, you can't patch more than two instructions without
int3 bypass method (or special stack checker), because a processor
can run and may have been interrupted on the 2nd instruction
when stop_machine is issued.
That's the 2nd reason why the djprobe is a part of kprobes.
this "int3 bypass" method disallow you to probe NMI handlers,
since int3 inside NMI will clear additional NMI masking by
issuing IRET.

> but seems there're some advantages as well:
> - trace code could be in a module
> - no profiling code is needed
> - framepointer can be disabled (framepointer is needed for
> generating profile code)

nowadays profiling code with dynamic ftrace will not make
visible overhead, and if you need to do that without
profiling binary, you can already use kprobe-tracer for it.
(Using kprobe-tracer via perf-probe allows you to probe not
only actual function but also inlined function entry ;-))


Thank you,

>
> As for the attached implementation it's hack mostly (expect bugs),
> especially the ftrace/kprobe integration could be probably done better.
> It's only for x86_64.
>
> It can be used like this:
>
> - new menu config item is added (function tracer engine),
> to choose mcount or ktrace
> - new file "ktrace" is added to the tracing dir
> - to add symbols to trace run:
> echo mutex_unlock > ./ktrace
> echo mutex_lock >> ./ktrace
> - to display trace symbols:
> cat ktrace
> - to enable the trace, the usual is needed:
> echo function > ./current_tracer
> echo function_graph > ./current_tracer
> - to remove symbols from trace:
> echo nop > ./current_tracer
> echo > ./ktrace
> - if the function is added while the tracer is running,
> the symbol is enabled automatically.
> - only all symbols could be removed and only if there's
> no tracer running.
>
> I'm not sure how to choose from kallsyms interface what function
> is safe to patch, so I omit patching of all symbols so far.


>
>
> attached patches:
> 1/4 - kprobe - ktrace instruction slot cache interface
> using kprobe detour buffer allocation, adding interface
> to use it from trace framework
>
> 2/4 - tracing - adding size parameter to do_ftrace_mod_code
> adding size parameter to be able to restore the saved
> instructions, which could be longer than relative call
>
> 3/4 - ktrace - function trace support
> adding ktrace support with function tracer
>
> 4/4 - ktrace - function trace support
> adding function graph support
>
>
> please let me know what you think, thanks
> jirka
> ---
> Makefile | 2 +-
> arch/x86/Kconfig | 4 +-
> arch/x86/kernel/Makefile | 1 +
> arch/x86/kernel/entry_64.S | 50 +++++++
> arch/x86/kernel/ftrace.c | 157 +++++++++++----------
> arch/x86/kernel/ktrace.c | 256 ++++++++++++++++++++++++++++++++++
> include/linux/ftrace.h | 36 +++++-
> include/linux/kprobes.h | 8 +
> kernel/kprobes.c | 33 +++++
> kernel/trace/Kconfig | 28 ++++-
> kernel/trace/Makefile | 1 +
> kernel/trace/ftrace.c | 21 +++
> kernel/trace/ktrace.c | 330 ++++++++++++++++++++++++++++++++++++++++++++
> kernel/trace/trace.c | 1 +
> 14 files changed, 846 insertions(+), 82 deletions(-)
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/


--
Masami HIRAMATSU
2nd Dept. Linux Technology Center
Hitachi, Ltd., Systems Development Laboratory
E-mail: [email protected]

2011-02-07 21:22:54

by Josh Triplett

[permalink] [raw]
Subject: Re: [RFC 0/4] tracing,x86_64 - function/graph trace without mcount/-pg/framepointer

On Fri, Jun 23, 2000 at 10:56:55AM +0000, Steven Rostedt wrote:
> On Thu, 2011-02-03 at 16:42 +0100, Jiri Olsa wrote:
> > hi,
> >
> > I recently saw the direct jump probing made for kprobes
> > and tried to use it inside the trace framework.
> >
> > The global idea is patching the function entry with direct
> > jump to the trace code, instead of using pregenerated gcc
> > profile code.
>
> Interesting, but ideally, it would be nice if gcc provided a better
> "mcount" mechanism. One that calls mcount (or whatever new name it would
> have) before it does anything with the stack.

GCC 4.6 may help here. According to
http://gcc.gnu.org/gcc-4.6/changes.html:

"Support for emitting profiler counter calls before function prologues.
This is enabled via a new command-line option -mfentry."

Looks like that option might only support x86 (32-bit and 64-bit) at the
moment, but it still seems like an improvement over the current
mechanism to work around GCC's placement of mcount.

- Josh Triplett

2011-02-07 21:32:51

by Steven Rostedt

[permalink] [raw]
Subject: Re: [RFC 0/4] tracing,x86_64 - function/graph trace without mcount/-pg/framepointer

On Mon, 2011-02-07 at 13:22 -0800, Josh Triplett wrote:

> GCC 4.6 may help here. According to
> http://gcc.gnu.org/gcc-4.6/changes.html:
>
> "Support for emitting profiler counter calls before function prologues.
> This is enabled via a new command-line option -mfentry."
>
> Looks like that option might only support x86 (32-bit and 64-bit) at the
> moment, but it still seems like an improvement over the current
> mechanism to work around GCC's placement of mcount.
>

I may need to download this and try it out.

Thanks for the reference!

-- Steve