[ NOT READY FOR INCLUSION! ]
Note, this is based off of my remove ftrace_start/stop() patch set.
I've been wanting to do this for years, and just never gotten around to it.
But with all this talk of kpatch and kgraft live kernel patching using
the ftrace infrastructure, it seems like a good time to do it.
The way the function callback mechanism works in ftrace is that if there's
only one function callback registered, it will set the mcount/fentry
trampoline to call that function directly. But as soon as you register
another callback, the mcount trampoline calls a loop function that iterates
over all the registered callbacks (ftrace_ops) checking their hash tables
to see if the called function matches the ops before calling its callback.
This happens even if the two registered functions are not even tracing
the same function!
This really sucks if you are tracing all functions, and then add a kprobe
or perf event that traces a single function. That will cause all the
other functions being traced to perform the loop test.
Ideally, if only a single callback (ftrace_ops) is registered to a
function, than that function should call a trampoline that will only
call that one callback without doing any other tests.
This patch set adds this functionality to x86_64. If a callback is
registered to a function and there's no other callback registered to
that function that ftrace_ops will get its own trampoline allocated
for it that will call the function directly.
Note, for dynamically allocated ftrace_ops (kprobes, perf, instance
directory function tracing), the dynamic trampoline will only be created
if CONFIG_PREEMPT is not set. That's because, until Paul finishes his
rcu_call_task() code, there's no safe way to know if a task was preempted
while on the trampoline and is waiting to run on it some more.
I need to write up a bunch of tests for this code, but currently it works
on the few tests I did manually. I didn't even run this code yet under
my full test suite, so it may very well have bugs in it that might be
easily triggered. But I wanted to get the code out for review to see
if anyone has any other idea to help enhance this feature.
If you want a git repo to play with this, you can get it from below.
That repo will rebase often, so do not build against it.
Enjoy,
-- Steve
git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace.git
rfc/trampoline
Head SHA1: 4d781e010842a56f8e7c1bbe309e38075c277c45
Steven Rostedt (Red Hat) (3):
ftrace/x86: Add dynamic allocated trampoline for ftrace_ops
ftrace/x86: Show trampoline call function in enabled_functions
ftrace/x86: Allow !CONFIG_PREEMPT dynamic ops to use allocated trampolines
----
arch/x86/kernel/ftrace.c | 240 ++++++++++++++++++++++++++++++++++++++++++--
arch/x86/kernel/mcount_64.S | 26 ++++-
include/linux/ftrace.h | 8 ++
kernel/trace/ftrace.c | 86 +++++++++++++++-
4 files changed, 344 insertions(+), 16 deletions(-)
On Thu, 03 Jul 2014 16:07:50 -0400
Steven Rostedt <[email protected]> wrote:
> [ NOT READY FOR INCLUSION! ]
One more thing to note. This will not go in before 3.18 (and depending
on issues, maybe later than that).
-- Steve
>
> Note, this is based off of my remove ftrace_start/stop() patch set.
>
> I've been wanting to do this for years, and just never gotten around to it.
> But with all this talk of kpatch and kgraft live kernel patching using
> the ftrace infrastructure, it seems like a good time to do it.
>
> The way the function callback mechanism works in ftrace is that if there's
> only one function callback registered, it will set the mcount/fentry
> trampoline to call that function directly. But as soon as you register
> another callback, the mcount trampoline calls a loop function that iterates
> over all the registered callbacks (ftrace_ops) checking their hash tables
> to see if the called function matches the ops before calling its callback.
> This happens even if the two registered functions are not even tracing
> the same function!
>
> This really sucks if you are tracing all functions, and then add a kprobe
> or perf event that traces a single function. That will cause all the
> other functions being traced to perform the loop test.
>
> Ideally, if only a single callback (ftrace_ops) is registered to a
> function, than that function should call a trampoline that will only
> call that one callback without doing any other tests.
>
> This patch set adds this functionality to x86_64. If a callback is
> registered to a function and there's no other callback registered to
> that function that ftrace_ops will get its own trampoline allocated
> for it that will call the function directly.
>
> Note, for dynamically allocated ftrace_ops (kprobes, perf, instance
> directory function tracing), the dynamic trampoline will only be created
> if CONFIG_PREEMPT is not set. That's because, until Paul finishes his
> rcu_call_task() code, there's no safe way to know if a task was preempted
> while on the trampoline and is waiting to run on it some more.
>
> I need to write up a bunch of tests for this code, but currently it works
> on the few tests I did manually. I didn't even run this code yet under
> my full test suite, so it may very well have bugs in it that might be
> easily triggered. But I wanted to get the code out for review to see
> if anyone has any other idea to help enhance this feature.
>
> If you want a git repo to play with this, you can get it from below.
> That repo will rebase often, so do not build against it.
>
> Enjoy,
>
> -- Steve
>
>
> git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace.git
> rfc/trampoline
>
> Head SHA1: 4d781e010842a56f8e7c1bbe309e38075c277c45
>
>
> Steven Rostedt (Red Hat) (3):
> ftrace/x86: Add dynamic allocated trampoline for ftrace_ops
> ftrace/x86: Show trampoline call function in enabled_functions
> ftrace/x86: Allow !CONFIG_PREEMPT dynamic ops to use allocated trampolines
>
> ----
> arch/x86/kernel/ftrace.c | 240 ++++++++++++++++++++++++++++++++++++++++++--
> arch/x86/kernel/mcount_64.S | 26 ++++-
> include/linux/ftrace.h | 8 ++
> kernel/trace/ftrace.c | 86 +++++++++++++++-
> 4 files changed, 344 insertions(+), 16 deletions(-)
(2014/07/04 5:07), Steven Rostedt wrote:
> [ NOT READY FOR INCLUSION! ]
>
> Note, this is based off of my remove ftrace_start/stop() patch set.
>
> I've been wanting to do this for years, and just never gotten around to it.
> But with all this talk of kpatch and kgraft live kernel patching using
> the ftrace infrastructure, it seems like a good time to do it.
>
> The way the function callback mechanism works in ftrace is that if there's
> only one function callback registered, it will set the mcount/fentry
> trampoline to call that function directly. But as soon as you register
> another callback, the mcount trampoline calls a loop function that iterates
> over all the registered callbacks (ftrace_ops) checking their hash tables
> to see if the called function matches the ops before calling its callback.
> This happens even if the two registered functions are not even tracing
> the same function!
>
> This really sucks if you are tracing all functions, and then add a kprobe
> or perf event that traces a single function. That will cause all the
> other functions being traced to perform the loop test.
Ah, I've thought that ftrace already had different trampoline for loop and
single and replaced each mcount-call instruction to appropriate one. But
this series actually does that, doesn't this? :)
> Ideally, if only a single callback (ftrace_ops) is registered to a
> function, than that function should call a trampoline that will only
> call that one callback without doing any other tests.
>
> This patch set adds this functionality to x86_64. If a callback is
> registered to a function and there's no other callback registered to
> that function that ftrace_ops will get its own trampoline allocated
> for it that will call the function directly.
>
> Note, for dynamically allocated ftrace_ops (kprobes, perf, instance
> directory function tracing), the dynamic trampoline will only be created
> if CONFIG_PREEMPT is not set. That's because, until Paul finishes his
> rcu_call_task() code, there's no safe way to know if a task was preempted
> while on the trampoline and is waiting to run on it some more.
Hmm, if we can declare "this ftrace_ops is permanent"(like finalizing) then
we can allocate trampoline for such dynamic one. Since the kprobes actually
doesn't need to free (or unregister) ftrace_ops, I can use it.
> I need to write up a bunch of tests for this code, but currently it works
> on the few tests I did manually. I didn't even run this code yet under
> my full test suite, so it may very well have bugs in it that might be
> easily triggered. But I wanted to get the code out for review to see
> if anyone has any other idea to help enhance this feature.
Yeah, I'll review it.
Thank you,
>
> If you want a git repo to play with this, you can get it from below.
> That repo will rebase often, so do not build against it.
>
> Enjoy,
>
> -- Steve
>
>
> git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace.git
> rfc/trampoline
>
> Head SHA1: 4d781e010842a56f8e7c1bbe309e38075c277c45
>
>
> Steven Rostedt (Red Hat) (3):
> ftrace/x86: Add dynamic allocated trampoline for ftrace_ops
> ftrace/x86: Show trampoline call function in enabled_functions
> ftrace/x86: Allow !CONFIG_PREEMPT dynamic ops to use allocated trampolines
>
> ----
> arch/x86/kernel/ftrace.c | 240 ++++++++++++++++++++++++++++++++++++++++++--
> arch/x86/kernel/mcount_64.S | 26 ++++-
> include/linux/ftrace.h | 8 ++
> kernel/trace/ftrace.c | 86 +++++++++++++++-
> 4 files changed, 344 insertions(+), 16 deletions(-)
>
>
>
--
Masami HIRAMATSU
Software Platform Research Dept. Linux Technology Research Center
Hitachi, Ltd., Yokohama Research Laboratory
E-mail: [email protected]
On Fri, 04 Jul 2014 22:20:12 +0900
Masami Hiramatsu <[email protected]> wrote:
> (2014/07/04 5:07), Steven Rostedt wrote:
> > [ NOT READY FOR INCLUSION! ]
> >
> > Note, this is based off of my remove ftrace_start/stop() patch set.
> >
> > I've been wanting to do this for years, and just never gotten around to it.
> > But with all this talk of kpatch and kgraft live kernel patching using
> > the ftrace infrastructure, it seems like a good time to do it.
> >
> > The way the function callback mechanism works in ftrace is that if there's
> > only one function callback registered, it will set the mcount/fentry
> > trampoline to call that function directly. But as soon as you register
> > another callback, the mcount trampoline calls a loop function that iterates
> > over all the registered callbacks (ftrace_ops) checking their hash tables
> > to see if the called function matches the ops before calling its callback.
> > This happens even if the two registered functions are not even tracing
> > the same function!
> >
> > This really sucks if you are tracing all functions, and then add a kprobe
> > or perf event that traces a single function. That will cause all the
> > other functions being traced to perform the loop test.
>
> Ah, I've thought that ftrace already had different trampoline for loop and
> single and replaced each mcount-call instruction to appropriate one. But
> this series actually does that, doesn't this? :)
Well, I guess the answer to that is what do you consider the
trampoline? I'm currently considering it to be the assembly code that
the mcount/fentry call jumps to. We only have two trampolines (three if
you count the function graph code that will be called directly come
3.17). Those two are the normal ftrace_caller and the
ftrace_regs_caller. Now what they call can be different. When only a
single ftrace_ops is registered, they call the ftrace_ops->func
directly. If there are more than one ftrace_ops registered, then they
call the loop function directly.
>
> > Ideally, if only a single callback (ftrace_ops) is registered to a
> > function, than that function should call a trampoline that will only
> > call that one callback without doing any other tests.
> >
> > This patch set adds this functionality to x86_64. If a callback is
> > registered to a function and there's no other callback registered to
> > that function that ftrace_ops will get its own trampoline allocated
> > for it that will call the function directly.
> >
> > Note, for dynamically allocated ftrace_ops (kprobes, perf, instance
> > directory function tracing), the dynamic trampoline will only be created
> > if CONFIG_PREEMPT is not set. That's because, until Paul finishes his
> > rcu_call_task() code, there's no safe way to know if a task was preempted
> > while on the trampoline and is waiting to run on it some more.
>
> Hmm, if we can declare "this ftrace_ops is permanent"(like finalizing) then
> we can allocate trampoline for such dynamic one. Since the kprobes actually
> doesn't need to free (or unregister) ftrace_ops, I can use it.
Yeah, if we add a ftrace_ops PERMANENT flag, then we could allow them
too.
>
>
> > I need to write up a bunch of tests for this code, but currently it works
> > on the few tests I did manually. I didn't even run this code yet under
> > my full test suite, so it may very well have bugs in it that might be
> > easily triggered. But I wanted to get the code out for review to see
> > if anyone has any other idea to help enhance this feature.
>
> Yeah, I'll review it.
Thanks!
-- Steve
On Fri, 4 Jul 2014, Steven Rostedt wrote:
> Well, I guess the answer to that is what do you consider the trampoline?
> I'm currently considering it to be the assembly code that the
> mcount/fentry call jumps to. We only have two trampolines (three if you
> count the function graph code that will be called directly come 3.17).
> Those two are the normal ftrace_caller and the ftrace_regs_caller.
BTW, on those archs that support regs saving already, is there really a
reson not to kill ftrace_caller and keep just ftrace_regs_caller?
--
Jiri Kosina
SUSE Labs
On Thu, 3 Jul 2014, Steven Rostedt wrote:
> [ NOT READY FOR INCLUSION! ]
>
> Note, this is based off of my remove ftrace_start/stop() patch set.
>
> I've been wanting to do this for years, and just never gotten around to it.
> But with all this talk of kpatch and kgraft live kernel patching using
> the ftrace infrastructure, it seems like a good time to do it.
>
> The way the function callback mechanism works in ftrace is that if there's
> only one function callback registered, it will set the mcount/fentry
> trampoline to call that function directly. But as soon as you register
> another callback, the mcount trampoline calls a loop function that iterates
> over all the registered callbacks (ftrace_ops) checking their hash tables
> to see if the called function matches the ops before calling its callback.
> This happens even if the two registered functions are not even tracing
> the same function!
>
> This really sucks if you are tracing all functions, and then add a kprobe
> or perf event that traces a single function. That will cause all the
> other functions being traced to perform the loop test.
[ ... snip ... ]
For the record -- I just did a quick test, and kGraft works nicely on top
of this patchset.
Will be looking into the patches more closely later this week.
--
Jiri Kosina
SUSE Labs
On Mon, 7 Jul 2014 15:22:27 +0200 (CEST)
Jiri Kosina <[email protected]> wrote:
> On Fri, 4 Jul 2014, Steven Rostedt wrote:
>
> > Well, I guess the answer to that is what do you consider the trampoline?
> > I'm currently considering it to be the assembly code that the
> > mcount/fentry call jumps to. We only have two trampolines (three if you
> > count the function graph code that will be called directly come 3.17).
> > Those two are the normal ftrace_caller and the ftrace_regs_caller.
>
> BTW, on those archs that support regs saving already, is there really a
> reson not to kill ftrace_caller and keep just ftrace_regs_caller?
>
Consistency. Perhaps the two can be the same trampoline, which would be
trivial to implement, but I wouldn't kill one which would make the
generic code more complex.
-- Steve
On Thu, Jul 03, 2014 at 04:07:50PM -0400, Steven Rostedt wrote:
> [ NOT READY FOR INCLUSION! ]
>
> Note, this is based off of my remove ftrace_start/stop() patch set.
>
> I've been wanting to do this for years, and just never gotten around to it.
> But with all this talk of kpatch and kgraft live kernel patching using
> the ftrace infrastructure, it seems like a good time to do it.
>
> The way the function callback mechanism works in ftrace is that if there's
> only one function callback registered, it will set the mcount/fentry
> trampoline to call that function directly. But as soon as you register
> another callback, the mcount trampoline calls a loop function that iterates
> over all the registered callbacks (ftrace_ops) checking their hash tables
> to see if the called function matches the ops before calling its callback.
> This happens even if the two registered functions are not even tracing
> the same function!
>
> This really sucks if you are tracing all functions, and then add a kprobe
> or perf event that traces a single function. That will cause all the
> other functions being traced to perform the loop test.
>
> Ideally, if only a single callback (ftrace_ops) is registered to a
> function, than that function should call a trampoline that will only
> call that one callback without doing any other tests.
Hi Steven,
I did some testing with kpatch and I found one minor issue. The dynamically
allocated trampoline seems to confuse dump_stack() somewhat.
I added a dump_stack() call in my ftrace_ops callback function
(kpatch_ftrace_handler) which had a filter on meminfo_proc_show().
Without this patch set (3.14):
[ 1533.256720] [<ffffffff816ef90e>] dump_stack+0x45/0x56
[ 1533.256725] [<ffffffff8125cc80>] ? meminfo_proc_open+0x30/0x30
[ 1533.256731] [<ffffffffa07d1494>] kpatch_ftrace_handler+0x14/0xf0 [kpatch]
[ 1533.256734] [<ffffffff8125cc80>] ? meminfo_proc_open+0x30/0x30
[ 1533.256739] [<ffffffff81137a76>] ftrace_ops_list_func+0xf6/0x110
[ 1533.256744] [<ffffffff816ff9c2>] ftrace_regs_call+0x5/0x77
[ 1533.256748] [<ffffffff8120d4ce>] ? seq_read+0x2de/0x3b0
[ 1533.256751] [<ffffffff8120d4ce>] ? seq_read+0x2de/0x3b0
[ 1533.256755] [<ffffffff8125cc85>] ? meminfo_proc_show+0x5/0x5e0
[ 1533.256757] [<ffffffff8125cc85>] ? meminfo_proc_show+0x5/0x5e0
[ 1533.256760] [<ffffffff8120d35a>] ? seq_read+0x16a/0x3b0
[ 1533.256764] [<ffffffff81253f1d>] proc_reg_read+0x3d/0x80
[ 1533.256769] [<ffffffff811e93cb>] vfs_read+0x9b/0x160
[ 1533.256772] [<ffffffff811e9ed5>] SyS_read+0x55/0xd0
[ 1533.256776] [<ffffffff816ffc29>] system_call_fastpath+0x16/0x1b
With this patch set:
[ 934.546013] [<ffffffff81700312>] dump_stack+0x45/0x56
[ 934.546020] [<ffffffff8125f5b0>] ? meminfo_proc_open+0x30/0x30
[ 934.546027] [<ffffffffa080a494>] kpatch_ftrace_handler+0x14/0xf0 [kpatch]
[ 934.546058] [<ffffffff812143ae>] ? seq_read+0x2de/0x3b0
[ 934.546062] [<ffffffff812143ae>] ? seq_read+0x2de/0x3b0
[ 934.546067] [<ffffffff8125f5b5>] ? meminfo_proc_show+0x5/0x5e0
[ 934.546071] [<ffffffff8125f5b5>] ? meminfo_proc_show+0x5/0x5e0
[ 934.546075] [<ffffffff8121423a>] ? seq_read+0x16a/0x3b0
[ 934.546081] [<ffffffff8125768d>] ? proc_reg_read+0x3d/0x80
[ 934.546088] [<ffffffff811f0668>] ? vfs_read+0x98/0x170
[ 934.546093] [<ffffffff811f1345>] ? SyS_read+0x55/0xd0
[ 934.546099] [<ffffffff81707969>] ? system_call_fastpath+0x16/0x1b
It correctly shows the traced function's callers, but they're shown as
unreliable (question marks). I'm not sure why that happens, but if
something goes wrong in a callback, this could make the reported stack
traces confusing. It could also cause trouble for code which relies on
reliable stack traces (e.g. kpatch).
>
> This patch set adds this functionality to x86_64. If a callback is
> registered to a function and there's no other callback registered to
> that function that ftrace_ops will get its own trampoline allocated
> for it that will call the function directly.
>
> Note, for dynamically allocated ftrace_ops (kprobes, perf, instance
> directory function tracing), the dynamic trampoline will only be created
> if CONFIG_PREEMPT is not set. That's because, until Paul finishes his
> rcu_call_task() code, there's no safe way to know if a task was preempted
> while on the trampoline and is waiting to run on it some more.
>
> I need to write up a bunch of tests for this code, but currently it works
> on the few tests I did manually. I didn't even run this code yet under
> my full test suite, so it may very well have bugs in it that might be
> easily triggered. But I wanted to get the code out for review to see
> if anyone has any other idea to help enhance this feature.
>
> If you want a git repo to play with this, you can get it from below.
> That repo will rebase often, so do not build against it.
>
> Enjoy,
>
> -- Steve
>
>
> git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace.git
> rfc/trampoline
>
> Head SHA1: 4d781e010842a56f8e7c1bbe309e38075c277c45
>
>
> Steven Rostedt (Red Hat) (3):
> ftrace/x86: Add dynamic allocated trampoline for ftrace_ops
> ftrace/x86: Show trampoline call function in enabled_functions
> ftrace/x86: Allow !CONFIG_PREEMPT dynamic ops to use allocated trampolines
>
> ----
> arch/x86/kernel/ftrace.c | 240 ++++++++++++++++++++++++++++++++++++++++++--
> arch/x86/kernel/mcount_64.S | 26 ++++-
> include/linux/ftrace.h | 8 ++
> kernel/trace/ftrace.c | 86 +++++++++++++++-
> 4 files changed, 344 insertions(+), 16 deletions(-)
>
--
Josh
On Thu, 10 Jul 2014, Josh Poimboeuf wrote:
> I did some testing with kpatch and I found one minor issue. The dynamically
> allocated trampoline seems to confuse dump_stack() somewhat.
>
> I added a dump_stack() call in my ftrace_ops callback function
> (kpatch_ftrace_handler) which had a filter on meminfo_proc_show().
Interesting. Are you using dwarf2 unwinder for stack dumping by any
chance? It seems to get things right here. Will look into it more
tomorrow.
--
Jiri Kosina
SUSE Labs
On Thu, Jul 10, 2014 at 11:44:43PM +0200, Jiri Kosina wrote:
> On Thu, 10 Jul 2014, Josh Poimboeuf wrote:
>
> > I did some testing with kpatch and I found one minor issue. The dynamically
> > allocated trampoline seems to confuse dump_stack() somewhat.
> >
> > I added a dump_stack() call in my ftrace_ops callback function
> > (kpatch_ftrace_handler) which had a filter on meminfo_proc_show().
>
> Interesting. Are you using dwarf2 unwinder for stack dumping by any
> chance?
I think so (but not sure). How can I check? I do see the following
macros being defined in my kernel build:
-DCONFIG_AS_CFI=1 -DCONFIG_AS_CFI_SIGNAL_FRAME=1 -DCONFIG_AS_CFI_SECTIONS=1
> It seems to get things right here. Will look into it more tomorrow.
Thanks, please let me know what you find...
--
Josh
(2014/07/11 6:44), Jiri Kosina wrote:
> On Thu, 10 Jul 2014, Josh Poimboeuf wrote:
>
>> I did some testing with kpatch and I found one minor issue. The dynamically
>> allocated trampoline seems to confuse dump_stack() somewhat.
>>
>> I added a dump_stack() call in my ftrace_ops callback function
>> (kpatch_ftrace_handler) which had a filter on meminfo_proc_show().
>
> Interesting. Are you using dwarf2 unwinder for stack dumping by any
> chance? It seems to get things right here. Will look into it more
> tomorrow.
Hmm, can dwarf2 unwinder work on the trampoline method?
Since the trampoline just a copy of instructions which
will not have CFI(which is stored in dwarf section),
I guess it may not work...
Frame pointer (push bp and save sp to bp on the entry) can
work anyway.
Thank you,
--
Masami HIRAMATSU
Software Platform Research Dept. Linux Technology Research Center
Hitachi, Ltd., Yokohama Research Laboratory
E-mail: [email protected]
On Fri, 11 Jul 2014, Masami Hiramatsu wrote:
> >> I did some testing with kpatch and I found one minor issue. The dynamically
> >> allocated trampoline seems to confuse dump_stack() somewhat.
> >>
> >> I added a dump_stack() call in my ftrace_ops callback function
> >> (kpatch_ftrace_handler) which had a filter on meminfo_proc_show().
> >
> > Interesting. Are you using dwarf2 unwinder for stack dumping by any
> > chance? It seems to get things right here. Will look into it more
> > tomorrow.
>
> Hmm, can dwarf2 unwinder work on the trampoline method? Since the
> trampoline just a copy of instructions which will not have CFI(which is
> stored in dwarf section), I guess it may not work... Frame pointer (push
> bp and save sp to bp on the entry) can work anyway.
That was exactly my idea and that's why I asked, thanks for confirming.
I am afraid we'll have to declare dynamic trampolines incompatible with
drawf2 stack dumping.
--
Jiri Kosina
SUSE Labs
On Fri, Jul 11, 2014 at 03:24:28PM +0200, Jiri Kosina wrote:
> On Fri, 11 Jul 2014, Masami Hiramatsu wrote:
>
> > >> I did some testing with kpatch and I found one minor issue. The dynamically
> > >> allocated trampoline seems to confuse dump_stack() somewhat.
> > >>
> > >> I added a dump_stack() call in my ftrace_ops callback function
> > >> (kpatch_ftrace_handler) which had a filter on meminfo_proc_show().
> > >
> > > Interesting. Are you using dwarf2 unwinder for stack dumping by any
> > > chance? It seems to get things right here. Will look into it more
> > > tomorrow.
> >
> > Hmm, can dwarf2 unwinder work on the trampoline method? Since the
> > trampoline just a copy of instructions which will not have CFI(which is
> > stored in dwarf section), I guess it may not work... Frame pointer (push
> > bp and save sp to bp on the entry) can work anyway.
>
> That was exactly my idea and that's why I asked, thanks for confirming.
>
> I am afraid we'll have to declare dynamic trampolines incompatible with
> drawf2 stack dumping.
In this case, the problem wasn't related to DWARF, because dump_stack()
uses the frame pointer to unwind the stack. I was able to fix the
problem with the following patch.
---
>From 951d2aec17885a62905df6b910dc705d99c63993 Mon Sep 17 00:00:00 2001
From: Josh Poimboeuf <[email protected]>
Date: Fri, 11 Jul 2014 08:58:33 -0500
Subject: [PATCH] x86/dumpstack: fix stack traces for generated code
If a function in the stack trace is dynamically generated, for example an
ftrace dynamically generated trampoline, print_context_stack() gets confused
and ends up showing all the following addresses as unreliable:
[ 934.546013] [<ffffffff81700312>] dump_stack+0x45/0x56
[ 934.546020] [<ffffffff8125f5b0>] ? meminfo_proc_open+0x30/0x30
[ 934.546027] [<ffffffffa080a494>] kpatch_ftrace_handler+0x14/0xf0 [kpatch]
[ 934.546058] [<ffffffff812143ae>] ? seq_read+0x2de/0x3b0
[ 934.546062] [<ffffffff812143ae>] ? seq_read+0x2de/0x3b0
[ 934.546067] [<ffffffff8125f5b5>] ? meminfo_proc_show+0x5/0x5e0
[ 934.546071] [<ffffffff8125f5b5>] ? meminfo_proc_show+0x5/0x5e0
[ 934.546075] [<ffffffff8121423a>] ? seq_read+0x16a/0x3b0
[ 934.546081] [<ffffffff8125768d>] ? proc_reg_read+0x3d/0x80
[ 934.546088] [<ffffffff811f0668>] ? vfs_read+0x98/0x170
[ 934.546093] [<ffffffff811f1345>] ? SyS_read+0x55/0xd0
[ 934.546099] [<ffffffff81707969>] ? system_call_fastpath+0x16/0x1b
Once it encounters an address which is not in the kernel's text area, it gets
confused and stops updating the frame pointer.
The __kernel_text_address() check isn't needed when determining whether an
address is reliable. It's only needed when deciding whether to print an
unreliable address.
Here's the same stack trace with this patch:
[ 1314.612287] [<ffffffff81700312>] dump_stack+0x45/0x56
[ 1314.612290] [<ffffffff8125f5b0>] ? meminfo_proc_open+0x30/0x30
[ 1314.612293] [<ffffffffa080a494>] kpatch_ftrace_handler+0x14/0xf0 [kpatch]
[ 1314.612306] [<ffffffffa00160c4>] 0xffffffffa00160c3
[ 1314.612309] [<ffffffff812143ae>] ? seq_read+0x2de/0x3b0
[ 1314.612311] [<ffffffff812143ae>] ? seq_read+0x2de/0x3b0
[ 1314.612312] [<ffffffff8125f5b5>] ? meminfo_proc_show+0x5/0x5e0
[ 1314.612314] [<ffffffff8125f5b5>] ? meminfo_proc_show+0x5/0x5e0
[ 1314.612315] [<ffffffff8121423a>] ? seq_read+0x16a/0x3b0
[ 1314.612318] [<ffffffff8125768d>] proc_reg_read+0x3d/0x80
[ 1314.612320] [<ffffffff811f0668>] vfs_read+0x98/0x170
[ 1314.612322] [<ffffffff811f1345>] SyS_read+0x55/0xd0
[ 1314.612324] [<ffffffff81707969>] system_call_fastpath+0x16/0x1b
---
arch/x86/kernel/dumpstack.c | 15 +++++++--------
1 file changed, 7 insertions(+), 8 deletions(-)
diff --git a/arch/x86/kernel/dumpstack.c b/arch/x86/kernel/dumpstack.c
index b74ebc7..db0a33c 100644
--- a/arch/x86/kernel/dumpstack.c
+++ b/arch/x86/kernel/dumpstack.c
@@ -102,14 +102,13 @@ print_context_stack(struct thread_info *tinfo,
unsigned long addr;
addr = *stack;
- if (__kernel_text_address(addr)) {
- if ((unsigned long) stack == bp + sizeof(long)) {
- ops->address(data, addr, 1);
- frame = frame->next_frame;
- bp = (unsigned long) frame;
- } else {
- ops->address(data, addr, 0);
- }
+ if ((unsigned long) stack == bp + sizeof(long)) {
+ ops->address(data, addr, 1);
+ frame = frame->next_frame;
+ bp = (unsigned long) frame;
+ print_ftrace_graph_addr(addr, data, ops, tinfo, graph);
+ } else if (__kernel_text_address(addr)) {
+ ops->address(data, addr, 0);
print_ftrace_graph_addr(addr, data, ops, tinfo, graph);
}
stack++;
--
1.9.3
(2014/07/11 23:29), Josh Poimboeuf wrote:
[...]
>
>>From 951d2aec17885a62905df6b910dc705d99c63993 Mon Sep 17 00:00:00 2001
> From: Josh Poimboeuf <[email protected]>
> Date: Fri, 11 Jul 2014 08:58:33 -0500
> Subject: [PATCH] x86/dumpstack: fix stack traces for generated code
>
> If a function in the stack trace is dynamically generated, for example an
> ftrace dynamically generated trampoline, print_context_stack() gets confused
> and ends up showing all the following addresses as unreliable:
>
> [ 934.546013] [<ffffffff81700312>] dump_stack+0x45/0x56
> [ 934.546020] [<ffffffff8125f5b0>] ? meminfo_proc_open+0x30/0x30
> [ 934.546027] [<ffffffffa080a494>] kpatch_ftrace_handler+0x14/0xf0 [kpatch]
> [ 934.546058] [<ffffffff812143ae>] ? seq_read+0x2de/0x3b0
> [ 934.546062] [<ffffffff812143ae>] ? seq_read+0x2de/0x3b0
> [ 934.546067] [<ffffffff8125f5b5>] ? meminfo_proc_show+0x5/0x5e0
> [ 934.546071] [<ffffffff8125f5b5>] ? meminfo_proc_show+0x5/0x5e0
> [ 934.546075] [<ffffffff8121423a>] ? seq_read+0x16a/0x3b0
> [ 934.546081] [<ffffffff8125768d>] ? proc_reg_read+0x3d/0x80
> [ 934.546088] [<ffffffff811f0668>] ? vfs_read+0x98/0x170
> [ 934.546093] [<ffffffff811f1345>] ? SyS_read+0x55/0xd0
> [ 934.546099] [<ffffffff81707969>] ? system_call_fastpath+0x16/0x1b
>
> Once it encounters an address which is not in the kernel's text area, it gets
> confused and stops updating the frame pointer.
Right, this uses a module_alloc to get a memory for trampline, but
it just allocates a page in executable vmalloc area. We need a hack
in __kernel_text_address if we really want to use that.
> The __kernel_text_address() check isn't needed when determining whether an
> address is reliable. It's only needed when deciding whether to print an
> unreliable address.
Yeah, I guess that is for the case that the frame pointer is broken.
>
> Here's the same stack trace with this patch:
>
> [ 1314.612287] [<ffffffff81700312>] dump_stack+0x45/0x56
> [ 1314.612290] [<ffffffff8125f5b0>] ? meminfo_proc_open+0x30/0x30
> [ 1314.612293] [<ffffffffa080a494>] kpatch_ftrace_handler+0x14/0xf0 [kpatch]
> [ 1314.612306] [<ffffffffa00160c4>] 0xffffffffa00160c3
Here, this still has a wrong entry. Maybe the trampline doesn't setup
frame pointer (bp) correctly.
Thank you,
> [ 1314.612309] [<ffffffff812143ae>] ? seq_read+0x2de/0x3b0
> [ 1314.612311] [<ffffffff812143ae>] ? seq_read+0x2de/0x3b0
> [ 1314.612312] [<ffffffff8125f5b5>] ? meminfo_proc_show+0x5/0x5e0
> [ 1314.612314] [<ffffffff8125f5b5>] ? meminfo_proc_show+0x5/0x5e0
> [ 1314.612315] [<ffffffff8121423a>] ? seq_read+0x16a/0x3b0
> [ 1314.612318] [<ffffffff8125768d>] proc_reg_read+0x3d/0x80
> [ 1314.612320] [<ffffffff811f0668>] vfs_read+0x98/0x170
> [ 1314.612322] [<ffffffff811f1345>] SyS_read+0x55/0xd0
> [ 1314.612324] [<ffffffff81707969>] system_call_fastpath+0x16/0x1b
> ---
> arch/x86/kernel/dumpstack.c | 15 +++++++--------
> 1 file changed, 7 insertions(+), 8 deletions(-)
>
> diff --git a/arch/x86/kernel/dumpstack.c b/arch/x86/kernel/dumpstack.c
> index b74ebc7..db0a33c 100644
> --- a/arch/x86/kernel/dumpstack.c
> +++ b/arch/x86/kernel/dumpstack.c
> @@ -102,14 +102,13 @@ print_context_stack(struct thread_info *tinfo,
> unsigned long addr;
>
> addr = *stack;
> - if (__kernel_text_address(addr)) {
> - if ((unsigned long) stack == bp + sizeof(long)) {
> - ops->address(data, addr, 1);
> - frame = frame->next_frame;
> - bp = (unsigned long) frame;
> - } else {
> - ops->address(data, addr, 0);
> - }
> + if ((unsigned long) stack == bp + sizeof(long)) {
> + ops->address(data, addr, 1);
> + frame = frame->next_frame;
> + bp = (unsigned long) frame;
> + print_ftrace_graph_addr(addr, data, ops, tinfo, graph);
> + } else if (__kernel_text_address(addr)) {
> + ops->address(data, addr, 0);
> print_ftrace_graph_addr(addr, data, ops, tinfo, graph);
> }
> stack++;
>
--
Masami HIRAMATSU
Software Platform Research Dept. Linux Technology Research Center
Hitachi, Ltd., Yokohama Research Laboratory
E-mail: [email protected]
Hi Masami,
On Mon, 14 Jul 2014 10:35:21 +0900, Masami Hiramatsu wrote:
> (2014/07/11 23:29), Josh Poimboeuf wrote:
> [...]
>>
>>>From 951d2aec17885a62905df6b910dc705d99c63993 Mon Sep 17 00:00:00 2001
>> From: Josh Poimboeuf <[email protected]>
>> Date: Fri, 11 Jul 2014 08:58:33 -0500
>> Subject: [PATCH] x86/dumpstack: fix stack traces for generated code
>>
>> If a function in the stack trace is dynamically generated, for example an
>> ftrace dynamically generated trampoline, print_context_stack() gets confused
>> and ends up showing all the following addresses as unreliable:
>>
>> [ 934.546013] [<ffffffff81700312>] dump_stack+0x45/0x56
>> [ 934.546020] [<ffffffff8125f5b0>] ? meminfo_proc_open+0x30/0x30
>> [ 934.546027] [<ffffffffa080a494>] kpatch_ftrace_handler+0x14/0xf0 [kpatch]
>> [ 934.546058] [<ffffffff812143ae>] ? seq_read+0x2de/0x3b0
>> [ 934.546062] [<ffffffff812143ae>] ? seq_read+0x2de/0x3b0
>> [ 934.546067] [<ffffffff8125f5b5>] ? meminfo_proc_show+0x5/0x5e0
>> [ 934.546071] [<ffffffff8125f5b5>] ? meminfo_proc_show+0x5/0x5e0
>> [ 934.546075] [<ffffffff8121423a>] ? seq_read+0x16a/0x3b0
>> [ 934.546081] [<ffffffff8125768d>] ? proc_reg_read+0x3d/0x80
>> [ 934.546088] [<ffffffff811f0668>] ? vfs_read+0x98/0x170
>> [ 934.546093] [<ffffffff811f1345>] ? SyS_read+0x55/0xd0
>> [ 934.546099] [<ffffffff81707969>] ? system_call_fastpath+0x16/0x1b
>>
>> Once it encounters an address which is not in the kernel's text area, it gets
>> confused and stops updating the frame pointer.
>
> Right, this uses a module_alloc to get a memory for trampline, but
> it just allocates a page in executable vmalloc area. We need a hack
> in __kernel_text_address if we really want to use that.
>
>> The __kernel_text_address() check isn't needed when determining whether an
>> address is reliable. It's only needed when deciding whether to print an
>> unreliable address.
>
> Yeah, I guess that is for the case that the frame pointer is broken.
>
>>
>> Here's the same stack trace with this patch:
>>
>> [ 1314.612287] [<ffffffff81700312>] dump_stack+0x45/0x56
>> [ 1314.612290] [<ffffffff8125f5b0>] ? meminfo_proc_open+0x30/0x30
>> [ 1314.612293] [<ffffffffa080a494>] kpatch_ftrace_handler+0x14/0xf0 [kpatch]
>> [ 1314.612306] [<ffffffffa00160c4>] 0xffffffffa00160c3
>
> Here, this still has a wrong entry. Maybe the trampline doesn't setup
> frame pointer (bp) correctly.
Hmm.. are you saying about the hex address above? I guess it's a valid
entry in the (dynamic) trampoline, but has no symbol..
>
>> [ 1314.612309] [<ffffffff812143ae>] ? seq_read+0x2de/0x3b0
>> [ 1314.612311] [<ffffffff812143ae>] ? seq_read+0x2de/0x3b0
>> [ 1314.612312] [<ffffffff8125f5b5>] ? meminfo_proc_show+0x5/0x5e0
>> [ 1314.612314] [<ffffffff8125f5b5>] ? meminfo_proc_show+0x5/0x5e0
>> [ 1314.612315] [<ffffffff8121423a>] ? seq_read+0x16a/0x3b0
But these seem to be wrong - there're duplicate entries and they should
show some of these functions (at least) correctly IMHO. I guess it's
because the trampoline didn't save rbp to the stack right below the
return address as dumpstack requires.
Thanks,
Namhyung
>> [ 1314.612318] [<ffffffff8125768d>] proc_reg_read+0x3d/0x80
>> [ 1314.612320] [<ffffffff811f0668>] vfs_read+0x98/0x170
>> [ 1314.612322] [<ffffffff811f1345>] SyS_read+0x55/0xd0
>> [ 1314.612324] [<ffffffff81707969>] system_call_fastpath+0x16/0x1b
>> ---
>> arch/x86/kernel/dumpstack.c | 15 +++++++--------
>> 1 file changed, 7 insertions(+), 8 deletions(-)
>>
>> diff --git a/arch/x86/kernel/dumpstack.c b/arch/x86/kernel/dumpstack.c
>> index b74ebc7..db0a33c 100644
>> --- a/arch/x86/kernel/dumpstack.c
>> +++ b/arch/x86/kernel/dumpstack.c
>> @@ -102,14 +102,13 @@ print_context_stack(struct thread_info *tinfo,
>> unsigned long addr;
>>
>> addr = *stack;
>> - if (__kernel_text_address(addr)) {
>> - if ((unsigned long) stack == bp + sizeof(long)) {
>> - ops->address(data, addr, 1);
>> - frame = frame->next_frame;
>> - bp = (unsigned long) frame;
>> - } else {
>> - ops->address(data, addr, 0);
>> - }
>> + if ((unsigned long) stack == bp + sizeof(long)) {
>> + ops->address(data, addr, 1);
>> + frame = frame->next_frame;
>> + bp = (unsigned long) frame;
>> + print_ftrace_graph_addr(addr, data, ops, tinfo, graph);
>> + } else if (__kernel_text_address(addr)) {
>> + ops->address(data, addr, 0);
>> print_ftrace_graph_addr(addr, data, ops, tinfo, graph);
>> }
>> stack++;
>>
(2014/07/14 16:16), Namhyung Kim wrote:
> Hi Masami,
>
> On Mon, 14 Jul 2014 10:35:21 +0900, Masami Hiramatsu wrote:
>> (2014/07/11 23:29), Josh Poimboeuf wrote:
>> [...]
>>>
>>> >From 951d2aec17885a62905df6b910dc705d99c63993 Mon Sep 17 00:00:00 2001
>>> From: Josh Poimboeuf <[email protected]>
>>> Date: Fri, 11 Jul 2014 08:58:33 -0500
>>> Subject: [PATCH] x86/dumpstack: fix stack traces for generated code
>>>
>>> If a function in the stack trace is dynamically generated, for example an
>>> ftrace dynamically generated trampoline, print_context_stack() gets confused
>>> and ends up showing all the following addresses as unreliable:
>>>
>>> [ 934.546013] [<ffffffff81700312>] dump_stack+0x45/0x56
>>> [ 934.546020] [<ffffffff8125f5b0>] ? meminfo_proc_open+0x30/0x30
>>> [ 934.546027] [<ffffffffa080a494>] kpatch_ftrace_handler+0x14/0xf0 [kpatch]
>>> [ 934.546058] [<ffffffff812143ae>] ? seq_read+0x2de/0x3b0
>>> [ 934.546062] [<ffffffff812143ae>] ? seq_read+0x2de/0x3b0
>>> [ 934.546067] [<ffffffff8125f5b5>] ? meminfo_proc_show+0x5/0x5e0
>>> [ 934.546071] [<ffffffff8125f5b5>] ? meminfo_proc_show+0x5/0x5e0
>>> [ 934.546075] [<ffffffff8121423a>] ? seq_read+0x16a/0x3b0
>>> [ 934.546081] [<ffffffff8125768d>] ? proc_reg_read+0x3d/0x80
>>> [ 934.546088] [<ffffffff811f0668>] ? vfs_read+0x98/0x170
>>> [ 934.546093] [<ffffffff811f1345>] ? SyS_read+0x55/0xd0
>>> [ 934.546099] [<ffffffff81707969>] ? system_call_fastpath+0x16/0x1b
>>>
>>> Once it encounters an address which is not in the kernel's text area, it gets
>>> confused and stops updating the frame pointer.
>>
>> Right, this uses a module_alloc to get a memory for trampline, but
>> it just allocates a page in executable vmalloc area. We need a hack
>> in __kernel_text_address if we really want to use that.
>>
>>> The __kernel_text_address() check isn't needed when determining whether an
>>> address is reliable. It's only needed when deciding whether to print an
>>> unreliable address.
>>
>> Yeah, I guess that is for the case that the frame pointer is broken.
>>
>>>
>>> Here's the same stack trace with this patch:
>>>
>>> [ 1314.612287] [<ffffffff81700312>] dump_stack+0x45/0x56
>>> [ 1314.612290] [<ffffffff8125f5b0>] ? meminfo_proc_open+0x30/0x30
>>> [ 1314.612293] [<ffffffffa080a494>] kpatch_ftrace_handler+0x14/0xf0 [kpatch]
>>> [ 1314.612306] [<ffffffffa00160c4>] 0xffffffffa00160c3
>>
>> Here, this still has a wrong entry. Maybe the trampline doesn't setup
>> frame pointer (bp) correctly.
>
> Hmm.. are you saying about the hex address above? I guess it's a valid
> entry in the (dynamic) trampoline, but has no symbol..
Ah, indeed. (BTW, why is it one less than the address ...? printk's spec?)
>>> [ 1314.612309] [<ffffffff812143ae>] ? seq_read+0x2de/0x3b0
>>> [ 1314.612311] [<ffffffff812143ae>] ? seq_read+0x2de/0x3b0
>>> [ 1314.612312] [<ffffffff8125f5b5>] ? meminfo_proc_show+0x5/0x5e0
>>> [ 1314.612314] [<ffffffff8125f5b5>] ? meminfo_proc_show+0x5/0x5e0
>>> [ 1314.612315] [<ffffffff8121423a>] ? seq_read+0x16a/0x3b0
>
> But these seem to be wrong - there're duplicate entries and they should
> show some of these functions (at least) correctly IMHO. I guess it's
> because the trampoline didn't save rbp to the stack right below the
> return address as dumpstack requires.
Right, the last seq_read should be reliable. Thank you for pointing out.
Thanks!
--
Masami HIRAMATSU
Software Platform Research Dept. Linux Technology Research Center
Hitachi, Ltd., Yokohama Research Laboratory
E-mail: [email protected]
2014-07-14 (월), 17:18 +0900, Masami Hiramatsu:
> (2014/07/14 16:16), Namhyung Kim wrote:
> > Hi Masami,
> >
> > On Mon, 14 Jul 2014 10:35:21 +0900, Masami Hiramatsu wrote:
> >> (2014/07/11 23:29), Josh Poimboeuf wrote:
> >>> Here's the same stack trace with this patch:
> >>>
> >>> [ 1314.612287] [<ffffffff81700312>] dump_stack+0x45/0x56
> >>> [ 1314.612290] [<ffffffff8125f5b0>] ? meminfo_proc_open+0x30/0x30
> >>> [ 1314.612293] [<ffffffffa080a494>] kpatch_ftrace_handler+0x14/0xf0 [kpatch]
> >>> [ 1314.612306] [<ffffffffa00160c4>] 0xffffffffa00160c3
> >>
> >> Here, this still has a wrong entry. Maybe the trampline doesn't setup
> >> frame pointer (bp) correctly.
> >
> > Hmm.. are you saying about the hex address above? I guess it's a valid
> > entry in the (dynamic) trampoline, but has no symbol..
>
> Ah, indeed. (BTW, why is it one less than the address ...? printk's spec?)
Argh, it seems like a bug in printk's %pB format processing.. :-/
It subtract 1 from the address for stacktrace. Please see the commit
0f77a8d37825 ("vsprintf: Introduce %pB format specifier") and
71f9e59800e5 ("x86, dumpstack: Use %pB format specifier for stack
trace") for details. But it should restore the original address when it
failed to find a symbol for that address.
I'll send a fix soon.
Thanks,
Namhyung
(2014/07/14 23:18), Namhyung Kim wrote:
> 2014-07-14 (월), 17:18 +0900, Masami Hiramatsu:
>> (2014/07/14 16:16), Namhyung Kim wrote:
>>> Hi Masami,
>>>
>>> On Mon, 14 Jul 2014 10:35:21 +0900, Masami Hiramatsu wrote:
>>>> (2014/07/11 23:29), Josh Poimboeuf wrote:
>>>>> Here's the same stack trace with this patch:
>>>>>
>>>>> [ 1314.612287] [<ffffffff81700312>] dump_stack+0x45/0x56
>>>>> [ 1314.612290] [<ffffffff8125f5b0>] ? meminfo_proc_open+0x30/0x30
>>>>> [ 1314.612293] [<ffffffffa080a494>] kpatch_ftrace_handler+0x14/0xf0 [kpatch]
>>>>> [ 1314.612306] [<ffffffffa00160c4>] 0xffffffffa00160c3
>>>>
>>>> Here, this still has a wrong entry. Maybe the trampline doesn't setup
>>>> frame pointer (bp) correctly.
>>>
>>> Hmm.. are you saying about the hex address above? I guess it's a valid
>>> entry in the (dynamic) trampoline, but has no symbol..
>>
>> Ah, indeed. (BTW, why is it one less than the address ...? printk's spec?)
>
> Argh, it seems like a bug in printk's %pB format processing.. :-/
>
> It subtract 1 from the address for stacktrace. Please see the commit
> 0f77a8d37825 ("vsprintf: Introduce %pB format specifier") and
> 71f9e59800e5 ("x86, dumpstack: Use %pB format specifier for stack
> trace") for details. But it should restore the original address when it
> failed to find a symbol for that address.
Ah I see.
>
> I'll send a fix soon.
Thanks! :)
--
Masami HIRAMATSU
Software Platform Research Dept. Linux Technology Research Center
Hitachi, Ltd., Yokohama Research Laboratory
E-mail: [email protected]
On 07/03, Steven Rostedt wrote:
>
> [ NOT READY FOR INCLUSION! ]
>
> Note, this is based off of my remove ftrace_start/stop() patch set.
So I simply pulled your tree. I can't really comment these changes simply
because I do not understand this code. But I am hunting for RHEL bug in
(probably) this area, so I decided to take a look in a hope that may be
this can help me to understand the current code ;)
> The way the function callback mechanism works in ftrace is that if there's
> only one function callback registered, it will set the mcount/fentry
> trampoline to call that function directly. But as soon as you register
> another callback, the mcount trampoline calls a loop function that iterates
> over all the registered callbacks (ftrace_ops) checking their hash tables
> to see if the called function matches the ops before calling its callback.
> This happens even if the two registered functions are not even tracing
> the same function!
>
> This really sucks if you are tracing all functions, and then add a kprobe
> or perf event that traces a single function. That will cause all the
> other functions being traced to perform the loop test.
But this is even worse or I missed something? I mean, currently even
if you trace nothing and then add a KPROBE_FLAG_FTRACE kprobe, then
kprobe_ftrace_handler() is called by ftrace_ops_list_func() ?
After these changes it seems that kprobe will use a trampoline.
And I can't understand even the basic code. Say, __ftrace_hash_rec_update:
if (inc) {
rec->flags++;
if (FTRACE_WARN_ON(ftrace_rec_count(rec) == FTRACE_REF_MAX))
return;
/*
* If there's only a single callback registered to a
* function, and the ops has a trampoline registered
* for it, then we can call it directly.
*/
if (ftrace_rec_count(rec) == 1 && ops->trampoline) {
rec->flags |= FTRACE_FL_TRAMP;
ops->trampolines++;
} else {
/*
* If we are adding another function callback
* to this function, and the previous had a
* trampoline used, then we need to go back to
* the default trampoline.
*/
rec->flags &= ~FTRACE_FL_TRAMP;
/* remove trampolines from any ops for this rec */
ftrace_clear_tramps(rec);
}
It seems that "else if (ftrace_rec_count(rec) == 2)" can avoid the unnecessary
ftrace_clear_tramps() ? And not only unnecessary, ftrace_clear_tramps() decrements
->trampolines, can't this break the accounting?
} else {
if (FTRACE_WARN_ON(ftrace_rec_count(rec) == 0))
return;
rec->flags--;
if (ops->trampoline && !ftrace_rec_count(rec))
ftrace_remove_tramp(ops, rec);
I am wondering what should we do if ftrace_rec_count() becomes 1 again...
ftrace_save_ops_tramp_hash():
do_for_each_ftrace_rec(pg, rec) {
if (ftrace_rec_count(rec) == 1 &&
ftrace_ops_test(ops, rec->ip, rec)) {
/* This record had better have a trampoline */
if (FTRACE_WARN_ON(!(rec->flags & FTRACE_FL_TRAMP_EN)))
return -1;
Yes, but I can't understand how this can work.
This ops can have ->trampolines > 0, but FTRACE_FL_TRAMP_EN can be cleared
by another ftrace_ops?
Suppose there is a single tracer of this function, rec->flags = TRAMP | TRAMP_EN.
Suppose also that it traces more than 1 function, so ->trampolines > 1.
Another tracer comes, __ftrace_hash_rec_update() clears TRAMP. But it should
also do ftrace_check_record() and this should clear TRAMP_EN?
And yes, I can trigger this bug if I simply do "echo function > current_tracer"
and then add/remove a KPROBE_FLAG_FTRACE kprobe.
And you know, when I try to read this code I can't distinguish ->trampoline
from ->trampolines ;)
Oleg.
On Tue, 22 Jul 2014 18:47:07 +0200
Oleg Nesterov <[email protected]> wrote:
> On 07/03, Steven Rostedt wrote:
> >
> > [ NOT READY FOR INCLUSION! ]
> >
> > Note, this is based off of my remove ftrace_start/stop() patch set.
>
> So I simply pulled your tree. I can't really comment these changes simply
> because I do not understand this code. But I am hunting for RHEL bug in
> (probably) this area, so I decided to take a look in a hope that may be
> this can help me to understand the current code ;)
Send me a ping on my RH email and I'll take a look at that BZ.
Note, I've played a little with this recently. I should make sure that
I push the latest. Warning, that trampoline branch will rebase.
>
> > The way the function callback mechanism works in ftrace is that if there's
> > only one function callback registered, it will set the mcount/fentry
> > trampoline to call that function directly. But as soon as you register
> > another callback, the mcount trampoline calls a loop function that iterates
> > over all the registered callbacks (ftrace_ops) checking their hash tables
> > to see if the called function matches the ops before calling its callback.
> > This happens even if the two registered functions are not even tracing
> > the same function!
> >
> > This really sucks if you are tracing all functions, and then add a kprobe
> > or perf event that traces a single function. That will cause all the
> > other functions being traced to perform the loop test.
>
> But this is even worse or I missed something? I mean, currently even
> if you trace nothing and then add a KPROBE_FLAG_FTRACE kprobe, then
> kprobe_ftrace_handler() is called by ftrace_ops_list_func() ?
It shouldn't be. It should get called directly from the trampoline. The
allocated trampoline should never call the list op. Well, it might
during the conversion for safety, but after that, trampolines should
only call the registered ftrace_ops->func directly.
>
> After these changes it seems that kprobe will use a trampoline.
>
> And I can't understand even the basic code. Say, __ftrace_hash_rec_update:
Well, to make you feel better, __ftrace_hash_rec_update() happens to
be one of the most complex functions in ftrace.
>
> if (inc) {
> rec->flags++;
> if (FTRACE_WARN_ON(ftrace_rec_count(rec) == FTRACE_REF_MAX))
> return;
>
> /*
> * If there's only a single callback registered to a
> * function, and the ops has a trampoline registered
> * for it, then we can call it directly.
> */
> if (ftrace_rec_count(rec) == 1 && ops->trampoline) {
> rec->flags |= FTRACE_FL_TRAMP;
> ops->trampolines++;
> } else {
> /*
> * If we are adding another function callback
> * to this function, and the previous had a
> * trampoline used, then we need to go back to
> * the default trampoline.
> */
> rec->flags &= ~FTRACE_FL_TRAMP;
>
> /* remove trampolines from any ops for this rec */
> ftrace_clear_tramps(rec);
> }
>
> It seems that "else if (ftrace_rec_count(rec) == 2)" can avoid the unnecessary
> ftrace_clear_tramps() ? And not only unnecessary, ftrace_clear_tramps() decrements
> ->trampolines, can't this break the accounting?
Ug, you're right.
But I think it should be "else if (rec->flags & FTRACE_FL_TRAMP)"
>
> } else {
> if (FTRACE_WARN_ON(ftrace_rec_count(rec) == 0))
> return;
> rec->flags--;
>
> if (ops->trampoline && !ftrace_rec_count(rec))
> ftrace_remove_tramp(ops, rec);
>
> I am wondering what should we do if ftrace_rec_count() becomes 1 again...
>
> ftrace_save_ops_tramp_hash():
>
> do_for_each_ftrace_rec(pg, rec) {
> if (ftrace_rec_count(rec) == 1 &&
> ftrace_ops_test(ops, rec->ip, rec)) {
>
> /* This record had better have a trampoline */
> if (FTRACE_WARN_ON(!(rec->flags & FTRACE_FL_TRAMP_EN)))
> return -1;
>
> Yes, but I can't understand how this can work.
I wanted the back to 1 case to happen after we get the up to one case
working. That is, I don't want to worry about it now ;-) As you can
see, this code has enough things to try to keep straight without adding
more complexity to the mix.
>
> This ops can have ->trampolines > 0, but FTRACE_FL_TRAMP_EN can be cleared
> by another ftrace_ops?
>
> Suppose there is a single tracer of this function, rec->flags = TRAMP | TRAMP_EN.
> Suppose also that it traces more than 1 function, so ->trampolines > 1.
>
> Another tracer comes, __ftrace_hash_rec_update() clears TRAMP. But it should
> also do ftrace_check_record() and this should clear TRAMP_EN?
>
> And yes, I can trigger this bug if I simply do "echo function > current_tracer"
> and then add/remove a KPROBE_FLAG_FTRACE kprobe.
>
>
> And you know, when I try to read this code I can't distinguish ->trampoline
> from ->trampolines ;)
Good point. I'll rename trampolines to nr_trampolines. At least that
way it will stick out that it's a counter.
BTW, for someone that says they can't understand the code, you are
pretty good feedback.
-- Steve
On 07/22, Steven Rostedt wrote:
>
> On Tue, 22 Jul 2014 18:47:07 +0200
> Oleg Nesterov <[email protected]> wrote:
>
> > On 07/03, Steven Rostedt wrote:
> >
> > > The way the function callback mechanism works in ftrace is that if there's
> > > only one function callback registered, it will set the mcount/fentry
> > > trampoline to call that function directly. But as soon as you register
> > > another callback, the mcount trampoline calls a loop function that iterates
> > > over all the registered callbacks (ftrace_ops) checking their hash tables
> > > to see if the called function matches the ops before calling its callback.
> > > This happens even if the two registered functions are not even tracing
> > > the same function!
> > >
> > > This really sucks if you are tracing all functions, and then add a kprobe
> > > or perf event that traces a single function. That will cause all the
> > > other functions being traced to perform the loop test.
> >
> > But this is even worse or I missed something? I mean, currently even
> > if you trace nothing and then add a KPROBE_FLAG_FTRACE kprobe, then
> > kprobe_ftrace_handler() is called by ftrace_ops_list_func() ?
>
> It shouldn't be. It should get called directly from the trampoline. The
> allocated trampoline should never call the list op. Well, it might
> during the conversion for safety, but after that, trampolines should
> only call the registered ftrace_ops->func directly.
I meant the current code (I am reading 3.16-rc2). Even if we have a single
KPROBE_FLAG_FTRACE kprobe, kprobe_ftrace_handler() won't be called directly.
Or I misunderstood your reply? Just in case, let me check...
With this stupid patch
--- a/kernel/trace/ftrace.c
+++ b/kernel/trace/ftrace.c
@@ -4464,6 +4464,7 @@ __ftrace_ops_list_func(unsigned long ip, unsigned long parent_ip,
printk("op=%p %pS\n", op, op);
goto out;
}
+ pr_crit("LIST_FUNC -> %pf()\n", op->func);
op->func(ip, parent_ip, op, regs);
}
} while_for_each_ftrace_op(op);
I do
# cd /sys/kernel/debug/tracing/
# echo "p:xx SyS_prctl+0x1c" >| kprobe_events
# cat ../kprobes/list
ffffffff81056c4c k SyS_prctl+0x1c [DISABLED][FTRACE]
# echo 1 >| events/kprobes/xx/enable
#
# perl -e 'syscall 157,-1'
# dmesg
LIST_FUNC -> kprobe_ftrace_handler()
so it is really called by the loop test code.
And I guess that after your patches kprobe_ftrace_handler() should be called
from the trampoline in this case.
> > ftrace_save_ops_tramp_hash():
> >
> > do_for_each_ftrace_rec(pg, rec) {
> > if (ftrace_rec_count(rec) == 1 &&
> > ftrace_ops_test(ops, rec->ip, rec)) {
> >
> > /* This record had better have a trampoline */
> > if (FTRACE_WARN_ON(!(rec->flags & FTRACE_FL_TRAMP_EN)))
> > return -1;
> >
> > Yes, but I can't understand how this can work.
>
> I wanted the back to 1 case to happen after we get the up to one case
> working. That is, I don't want to worry about it now ;-) As you can
> see, this code has enough things to try to keep straight without adding
> more complexity to the mix.
Yes, I see... but note that this WARN_ON() looks wrong in any case. At
least currently.
Oleg.
On Wed, 23 Jul 2014 14:08:05 +0200
Oleg Nesterov <[email protected]> wrote:
> On 07/22, Steven Rostedt wrote:
> >
> > On Tue, 22 Jul 2014 18:47:07 +0200
> > Oleg Nesterov <[email protected]> wrote:
> >
> > > On 07/03, Steven Rostedt wrote:
> > >
> > > > The way the function callback mechanism works in ftrace is that if there's
> > > > only one function callback registered, it will set the mcount/fentry
> > > > trampoline to call that function directly. But as soon as you register
> > > > another callback, the mcount trampoline calls a loop function that iterates
> > > > over all the registered callbacks (ftrace_ops) checking their hash tables
> > > > to see if the called function matches the ops before calling its callback.
> > > > This happens even if the two registered functions are not even tracing
> > > > the same function!
> > > >
> > > > This really sucks if you are tracing all functions, and then add a kprobe
> > > > or perf event that traces a single function. That will cause all the
> > > > other functions being traced to perform the loop test.
> > >
> > > But this is even worse or I missed something? I mean, currently even
> > > if you trace nothing and then add a KPROBE_FLAG_FTRACE kprobe, then
> > > kprobe_ftrace_handler() is called by ftrace_ops_list_func() ?
> >
> > It shouldn't be. It should get called directly from the trampoline. The
> > allocated trampoline should never call the list op. Well, it might
> > during the conversion for safety, but after that, trampolines should
> > only call the registered ftrace_ops->func directly.
>
> I meant the current code (I am reading 3.16-rc2). Even if we have a single
> KPROBE_FLAG_FTRACE kprobe, kprobe_ftrace_handler() won't be called directly.
>
Oh, I thought we were still talking about the trampolines, that's not
in 3.16-rc2.
> Or I misunderstood your reply? Just in case, let me check...
>
> With this stupid patch
>
> --- a/kernel/trace/ftrace.c
> +++ b/kernel/trace/ftrace.c
> @@ -4464,6 +4464,7 @@ __ftrace_ops_list_func(unsigned long ip, unsigned long parent_ip,
> printk("op=%p %pS\n", op, op);
> goto out;
> }
> + pr_crit("LIST_FUNC -> %pf()\n", op->func);
> op->func(ip, parent_ip, op, regs);
> }
> } while_for_each_ftrace_op(op);
>
> I do
> # cd /sys/kernel/debug/tracing/
> # echo "p:xx SyS_prctl+0x1c" >| kprobe_events
> # cat ../kprobes/list
> ffffffff81056c4c k SyS_prctl+0x1c [DISABLED][FTRACE]
> # echo 1 >| events/kprobes/xx/enable
> #
> # perl -e 'syscall 157,-1'
> # dmesg
> LIST_FUNC -> kprobe_ftrace_handler()
>
> so it is really called by the loop test code.
>
> And I guess that after your patches kprobe_ftrace_handler() should be called
> from the trampoline in this case.
No it wont be. Not until we have Paul McKenney's task_rcu code that
will return after all tasks have either gone through userspace or a
schedule. Hmm, maybe on a !CONFIG_PREEMPT it will be OK. Oh, I can have
that be OK now on !CONFIG_PREEMPT. Maybe I'll do that too.
kprobe ftrace_ops are allocated which sets the FTRACE_OPS_FL_DYNAMIC
flag. You'll see that flag checked in update_ftrace_function(), and if
it is set, it forces the ftrace_ops_list_func() to be used.
Why?
Because we can not know if a task has just jumped to the function with
the ops that has been freed. Here's what I mean:
foo()
[mcount called --> ftrace_caller trampoline]
ftrace_caller
load ftrace_ops into parameter
<interrupt>
preempt_schedule()
[new task]
kfree(kprobe ftrace_ops);
[schedule back]
call kprobes ftrace ops func
uses ops, but it was freed!
[BOOM!]
A trampoline can not access anything that can be freed. In this case,
the kprobes ftrace_ops can be, and that forces the list function to be
used. Why is the list function ok? Because it does:
preempt_disable();
loop all ops
ops->func()
preempt_enable();
And after we disconnect an ops from the list, we call
schedule_on_each_cpu() before freeing it.
Why not use synchronize_sched()?
Because ftrace can be called outside of areas that rcu is aware of,
including the rcu infrastructure itself. That means there's a extremely
rare case that synchronize_sched() can return before that loop of ops
is finished.
Although it's extremely rare, and even possibly impossible to hit, the
fact that theoretically it can be, we must do the even slower but
guaranteed safe schedule_on_each_cpu() to make sure that nothing was in
that preempt disable location before we free it.
>
> > > ftrace_save_ops_tramp_hash():
> > >
> > > do_for_each_ftrace_rec(pg, rec) {
> > > if (ftrace_rec_count(rec) == 1 &&
> > > ftrace_ops_test(ops, rec->ip, rec)) {
> > >
> > > /* This record had better have a trampoline */
> > > if (FTRACE_WARN_ON(!(rec->flags & FTRACE_FL_TRAMP_EN)))
> > > return -1;
> > >
> > > Yes, but I can't understand how this can work.
> >
> > I wanted the back to 1 case to happen after we get the up to one case
> > working. That is, I don't want to worry about it now ;-) As you can
> > see, this code has enough things to try to keep straight without adding
> > more complexity to the mix.
>
> Yes, I see... but note that this WARN_ON() looks wrong in any case. At
> least currently.
Yeah, in my linux-next repo it quite possibly can be. I'm looking into
fixing that now.
Thanks!
-- Steve
On 07/23, Steven Rostedt wrote:
>
> On Wed, 23 Jul 2014 14:08:05 +0200
> Oleg Nesterov <[email protected]> wrote:
>
> > With this stupid patch
> >
> > --- a/kernel/trace/ftrace.c
> > +++ b/kernel/trace/ftrace.c
> > @@ -4464,6 +4464,7 @@ __ftrace_ops_list_func(unsigned long ip, unsigned long parent_ip,
> > printk("op=%p %pS\n", op, op);
> > goto out;
> > }
> > + pr_crit("LIST_FUNC -> %pf()\n", op->func);
> > op->func(ip, parent_ip, op, regs);
> > }
> > } while_for_each_ftrace_op(op);
> >
> > I do
> > # cd /sys/kernel/debug/tracing/
> > # echo "p:xx SyS_prctl+0x1c" >| kprobe_events
> > # cat ../kprobes/list
> > ffffffff81056c4c k SyS_prctl+0x1c [DISABLED][FTRACE]
> > # echo 1 >| events/kprobes/xx/enable
> > #
> > # perl -e 'syscall 157,-1'
> > # dmesg
> > LIST_FUNC -> kprobe_ftrace_handler()
> >
> > so it is really called by the loop test code.
> >
> > And I guess that after your patches kprobe_ftrace_handler() should be called
> > from the trampoline in this case.
>
> No it wont be. Not until we have Paul McKenney's task_rcu code that
> will return after all tasks have either gone through userspace or a
> schedule. Hmm, maybe on a !CONFIG_PREEMPT it will be OK. Oh, I can have
> that be OK now on !CONFIG_PREEMPT. Maybe I'll do that too.
>
> kprobe ftrace_ops are allocated which sets the FTRACE_OPS_FL_DYNAMIC
> flag. You'll see that flag checked in update_ftrace_function(), and if
> it is set, it forces the ftrace_ops_list_func() to be used.
No? __register_ftrace_function() sets if !core_kernel_data(ops), and
kprobe_ftrace_ops is not dynamic?
> Why?
>
> [...snip..]
Yes, thanks, I understand why, at least to some degree.
> foo()
> [mcount called --> ftrace_caller trampoline]
> ftrace_caller
> load ftrace_ops into parameter
> <interrupt>
> preempt_schedule()
> [new task]
> kfree(kprobe ftrace_ops);
see above.
And to be sure, I compiled your rfc/trampoline kernel which I pulled
yesterday with the same patch and did the same test. __ftrace_ops_list_func()
prints nothing.
So I also added WARN_ON(1) into kprobe_ftrace_handler() to ensure that
it is actually called, and yes, dmesg reports
WARNING: ... kprobe_ftrace_handler+0x38/0x140()
...
Call Trace:
[<ffffffff8136a3eb>] dump_stack+0x5b/0xa8
[<ffffffff810423ec>] warn_slowpath_common+0x8c/0xc0
[<ffffffff8105772c>] ? SyS_prctl+0x1c/0x730
[<ffffffff8104243a>] warn_slowpath_null+0x1a/0x20
[<ffffffff810325c8>] kprobe_ftrace_handler+0x38/0x140
[<ffffffff8137148a>] ? retint_swapgs+0xe/0x13
[<ffffffff81057731>] ? SyS_prctl+0x21/0x730
[<ffffffff81057731>] ? SyS_prctl+0x21/0x730
[<ffffffff8122424e>] ? trace_hardirqs_on_thunk+0x3a/0x3f
[<ffffffff81370912>] ? system_call_fastpath+0x16/0x1b
after "perl -e 'syscall 157,-1'".
and, as expected, if I do "echo SyS_prctl >| set_ftrace_filter" and
"echo function >| current_tracer", then the command above also triggers
2 printk's in __ftrace_ops_list_func() :
LIST_FUNC -> function_trace_call()
LIST_FUNC -> kprobe_ftrace_handler()
so it seems that your patches can potentially buy more than you think ;)
Oleg.
On Wed, 23 Jul 2014 19:05:52 +0200
Oleg Nesterov <[email protected]> wrote:
> > kprobe ftrace_ops are allocated which sets the FTRACE_OPS_FL_DYNAMIC
> > flag. You'll see that flag checked in update_ftrace_function(), and if
> > it is set, it forces the ftrace_ops_list_func() to be used.
>
> No? __register_ftrace_function() sets if !core_kernel_data(ops), and
> kprobe_ftrace_ops is not dynamic?
Oh, you're right. I thought it was allocated.
What it is missing is the FTRACE_OPS_RECURSION_SAFE flag. Although, I'm
working on a patch that makes a non loop func that does the recursion
checks for just a single ftrace_ops->func if only one is registered.
>
> > Why?
> >
> > [...snip..]
>
> Yes, thanks, I understand why, at least to some degree.
>
> > foo()
> > [mcount called --> ftrace_caller trampoline]
> > ftrace_caller
> > load ftrace_ops into parameter
> > <interrupt>
> > preempt_schedule()
> > [new task]
> > kfree(kprobe ftrace_ops);
>
> see above.
>
> And to be sure, I compiled your rfc/trampoline kernel which I pulled
> yesterday with the same patch and did the same test. __ftrace_ops_list_func()
> prints nothing.
Note, I'm still working on fixes to that branch ;-)
>
> So I also added WARN_ON(1) into kprobe_ftrace_handler() to ensure that
> it is actually called, and yes, dmesg reports
>
> WARNING: ... kprobe_ftrace_handler+0x38/0x140()
> ...
> Call Trace:
> [<ffffffff8136a3eb>] dump_stack+0x5b/0xa8
> [<ffffffff810423ec>] warn_slowpath_common+0x8c/0xc0
> [<ffffffff8105772c>] ? SyS_prctl+0x1c/0x730
> [<ffffffff8104243a>] warn_slowpath_null+0x1a/0x20
> [<ffffffff810325c8>] kprobe_ftrace_handler+0x38/0x140
> [<ffffffff8137148a>] ? retint_swapgs+0xe/0x13
> [<ffffffff81057731>] ? SyS_prctl+0x21/0x730
> [<ffffffff81057731>] ? SyS_prctl+0x21/0x730
> [<ffffffff8122424e>] ? trace_hardirqs_on_thunk+0x3a/0x3f
> [<ffffffff81370912>] ? system_call_fastpath+0x16/0x1b
>
BTW, you may want to look at
/sys/kernel/debug/tracing/enabled_functions
as that has a lot of debug info for the trampolines in that branch.
> after "perl -e 'syscall 157,-1'".
>
> and, as expected, if I do "echo SyS_prctl >| set_ftrace_filter" and
> "echo function >| current_tracer", then the command above also triggers
> 2 printk's in __ftrace_ops_list_func() :
>
> LIST_FUNC -> function_trace_call()
> LIST_FUNC -> kprobe_ftrace_handler()
>
> so it seems that your patches can potentially buy more than you think ;)
>
> Oleg.
I'll play with this some more to understand everything you are stating.
Thanks,
-- Steve