From: Steven Rostedt <[email protected]>
The function graph tracer replaces the return address with a hook to
trace the exit of the function call. This hook will finish by returning
to the real location the function should return to.
But the current implementation uses a ret to jump to the real return
location. This causes a imbalance between calls and ret. That is
the original function does a call, the ret goes to the handler
and then the handler does a ret without a matching call.
Although the function graph tracer itself still breaks the branch
predictor by replacing the original ret, by using a second ret and
causing an imbalance, it breaks the predictor even more.
This patch replaces the ret with a jmp to keep the calls and ret
balanced. I tested this on one box and it showed a 1.7% increase in
performance. Another box only showed a small 0.3% increase. But no
box that I tested this on showed a decrease in performance by making this
change.
Cc: Mathieu Desnoyers <[email protected]>
Cc: Frederic Weisbecker <[email protected]>
Signed-off-by: Steven Rostedt <[email protected]>
---
arch/x86/kernel/entry_32.S | 7 ++-----
arch/x86/kernel/entry_64.S | 6 +++---
2 files changed, 5 insertions(+), 8 deletions(-)
diff --git a/arch/x86/kernel/entry_32.S b/arch/x86/kernel/entry_32.S
index c097e7d..7d52e9d 100644
--- a/arch/x86/kernel/entry_32.S
+++ b/arch/x86/kernel/entry_32.S
@@ -1185,17 +1185,14 @@ END(ftrace_graph_caller)
.globl return_to_handler
return_to_handler:
- pushl $0
pushl %eax
- pushl %ecx
pushl %edx
movl %ebp, %eax
call ftrace_return_to_handler
- movl %eax, 0xc(%esp)
+ movl %eax, %ecx
popl %edx
- popl %ecx
popl %eax
- ret
+ jmp *%ecx
#endif
.section .rodata,"a"
diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S
index b5c061f..bd5bbdd 100644
--- a/arch/x86/kernel/entry_64.S
+++ b/arch/x86/kernel/entry_64.S
@@ -155,11 +155,11 @@ GLOBAL(return_to_handler)
call ftrace_return_to_handler
- movq %rax, 16(%rsp)
+ movq %rax, %rdi
movq 8(%rsp), %rdx
movq (%rsp), %rax
- addq $16, %rsp
- retq
+ addq $24, %rsp
+ jmp *%rdi
#endif
--
1.6.3.3
* Steven Rostedt ([email protected]) wrote:
> From: Steven Rostedt <[email protected]>
>
> The function graph tracer replaces the return address with a hook to
> trace the exit of the function call. This hook will finish by returning
> to the real location the function should return to.
>
> But the current implementation uses a ret to jump to the real return
> location. This causes a imbalance between calls and ret. That is
> the original function does a call, the ret goes to the handler
> and then the handler does a ret without a matching call.
>
> Although the function graph tracer itself still breaks the branch
> predictor by replacing the original ret, by using a second ret and
> causing an imbalance, it breaks the predictor even more.
>
> This patch replaces the ret with a jmp to keep the calls and ret
> balanced. I tested this on one box and it showed a 1.7% increase in
> performance. Another box only showed a small 0.3% increase. But no
> box that I tested this on showed a decrease in performance by making this
> change.
This sounds exactly like what I proposed at LPC. I'm glad it shows
actual improvements.
Just to make sure I understand, the old sequence was:
call fct
call ftrace_entry
ret to fct
ret to ftrace_exit
ret to caller
and you now have:
call fct
call ftrace_entry
ret to fct
ret to ftrace_exit
jmp to caller
Am I correct ?
Mathieu
>
> Cc: Mathieu Desnoyers <[email protected]>
> Cc: Frederic Weisbecker <[email protected]>
> Signed-off-by: Steven Rostedt <[email protected]>
> ---
> arch/x86/kernel/entry_32.S | 7 ++-----
> arch/x86/kernel/entry_64.S | 6 +++---
> 2 files changed, 5 insertions(+), 8 deletions(-)
>
> diff --git a/arch/x86/kernel/entry_32.S b/arch/x86/kernel/entry_32.S
> index c097e7d..7d52e9d 100644
> --- a/arch/x86/kernel/entry_32.S
> +++ b/arch/x86/kernel/entry_32.S
> @@ -1185,17 +1185,14 @@ END(ftrace_graph_caller)
>
> .globl return_to_handler
> return_to_handler:
> - pushl $0
> pushl %eax
> - pushl %ecx
> pushl %edx
> movl %ebp, %eax
> call ftrace_return_to_handler
> - movl %eax, 0xc(%esp)
> + movl %eax, %ecx
> popl %edx
> - popl %ecx
> popl %eax
> - ret
> + jmp *%ecx
> #endif
>
> .section .rodata,"a"
> diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S
> index b5c061f..bd5bbdd 100644
> --- a/arch/x86/kernel/entry_64.S
> +++ b/arch/x86/kernel/entry_64.S
> @@ -155,11 +155,11 @@ GLOBAL(return_to_handler)
>
> call ftrace_return_to_handler
>
> - movq %rax, 16(%rsp)
> + movq %rax, %rdi
> movq 8(%rsp), %rdx
> movq (%rsp), %rax
> - addq $16, %rsp
> - retq
> + addq $24, %rsp
> + jmp *%rdi
> #endif
>
>
> --
> 1.6.3.3
>
>
--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
On Tue, Oct 13, 2009 at 04:33:50PM -0400, Steven Rostedt wrote:
> From: Steven Rostedt <[email protected]>
>
> The function graph tracer replaces the return address with a hook to
> trace the exit of the function call. This hook will finish by returning
> to the real location the function should return to.
>
> But the current implementation uses a ret to jump to the real return
> location. This causes a imbalance between calls and ret. That is
> the original function does a call, the ret goes to the handler
> and then the handler does a ret without a matching call.
>
> Although the function graph tracer itself still breaks the branch
> predictor by replacing the original ret, by using a second ret and
> causing an imbalance, it breaks the predictor even more.
I have troubles to understand by it breaks the predictor, especially
since there is not conditional branch in return_to_handler.
But still I don't understand why a ret would break more the branch
prediction than a jmp.
On Tue, 2009-10-13 at 16:47 -0400, Mathieu Desnoyers wrote:
> * Steven Rostedt ([email protected]) wrote:
> > From: Steven Rostedt <[email protected]>
> >
> > The function graph tracer replaces the return address with a hook to
> > trace the exit of the function call. This hook will finish by returning
> > to the real location the function should return to.
> >
> > But the current implementation uses a ret to jump to the real return
> > location. This causes a imbalance between calls and ret. That is
> > the original function does a call, the ret goes to the handler
> > and then the handler does a ret without a matching call.
> >
> > Although the function graph tracer itself still breaks the branch
> > predictor by replacing the original ret, by using a second ret and
> > causing an imbalance, it breaks the predictor even more.
> >
> > This patch replaces the ret with a jmp to keep the calls and ret
> > balanced. I tested this on one box and it showed a 1.7% increase in
> > performance. Another box only showed a small 0.3% increase. But no
> > box that I tested this on showed a decrease in performance by making this
> > change.
>
> This sounds exactly like what I proposed at LPC. I'm glad it shows
> actual improvements.
This is what we discussed at LPC. We both were under the assumption that
a jump would work. The question was how to make that jump without hosing
registers.
We lucked out that this is the back end of the return sequence. Where we
can still clobber callie registers. (just not the ones holding the
return code).
>
> Just to make sure I understand, the old sequence was:
>
> call fct
> call ftrace_entry
> ret to fct
> ret to ftrace_exit
> ret to caller
>
> and you now have:
>
> call fct
> call ftrace_entry
> ret to fct
> ret to ftrace_exit
> jmp to caller
>
> Am I correct ?
Almost.
What it was:
call function
function:
call mcount
mcount:
call ftrace_entry
ftrace_entry:
mess up with return code of caller
ret
ret
[function code]
ret to ftrace_exit
ftrace_exit:
get real return
ret to original
So for the function we have 3 calls and 4 rets
Now we have:
What it was:
call function
function:
call mcount
mcount:
call ftrace_entry
ftrace_entry:
mess up with return code of caller
ret
ret
[function code]
ret to ftrace_exit
ftrace_exit:
get real return
jmp to original
Now we have 3 calls and 3 rets
Note the first call still does not match the ret, but we don't do two
rets anymore.
-- Steve
On Tue, 2009-10-13 at 23:02 +0200, Frederic Weisbecker wrote:
> On Tue, Oct 13, 2009 at 04:33:50PM -0400, Steven Rostedt wrote:
> > From: Steven Rostedt <[email protected]>
> >
> > The function graph tracer replaces the return address with a hook to
> > trace the exit of the function call. This hook will finish by returning
> > to the real location the function should return to.
> >
> > But the current implementation uses a ret to jump to the real return
> > location. This causes a imbalance between calls and ret. That is
> > the original function does a call, the ret goes to the handler
> > and then the handler does a ret without a matching call.
> >
> > Although the function graph tracer itself still breaks the branch
> > predictor by replacing the original ret, by using a second ret and
> > causing an imbalance, it breaks the predictor even more.
>
>
>
> I have troubles to understand by it breaks the predictor, especially
> since there is not conditional branch in return_to_handler.
> But still I don't understand why a ret would break more the branch
> prediction than a jmp.
Calls are branch prediction jumps. Which associates the "ret" with the
call. As it approaches the ret, it starts to receive the code after the
call.
But this is stack order. Every call should hit one ret. But with the
original code, we break this stack. We have one call and two rets. Which
means that the branch prediction will also get messed up with the
previous stored call.
-- Steve
* Steven Rostedt ([email protected]) wrote:
> On Tue, 2009-10-13 at 16:47 -0400, Mathieu Desnoyers wrote:
> > * Steven Rostedt ([email protected]) wrote:
> > > From: Steven Rostedt <[email protected]>
> > >
> > > The function graph tracer replaces the return address with a hook to
> > > trace the exit of the function call. This hook will finish by returning
> > > to the real location the function should return to.
> > >
> > > But the current implementation uses a ret to jump to the real return
> > > location. This causes a imbalance between calls and ret. That is
> > > the original function does a call, the ret goes to the handler
> > > and then the handler does a ret without a matching call.
> > >
> > > Although the function graph tracer itself still breaks the branch
> > > predictor by replacing the original ret, by using a second ret and
> > > causing an imbalance, it breaks the predictor even more.
> > >
> > > This patch replaces the ret with a jmp to keep the calls and ret
> > > balanced. I tested this on one box and it showed a 1.7% increase in
> > > performance. Another box only showed a small 0.3% increase. But no
> > > box that I tested this on showed a decrease in performance by making this
> > > change.
> >
> > This sounds exactly like what I proposed at LPC. I'm glad it shows
> > actual improvements.
>
> This is what we discussed at LPC. We both were under the assumption that
> a jump would work. The question was how to make that jump without hosing
> registers.
>
> We lucked out that this is the back end of the return sequence. Where we
> can still clobber callie registers. (just not the ones holding the
> return code).
>
> >
> > Just to make sure I understand, the old sequence was:
> >
> > call fct
> > call ftrace_entry
> > ret to fct
> > ret to ftrace_exit
> > ret to caller
> >
> > and you now have:
> >
> > call fct
> > call ftrace_entry
> > ret to fct
> > ret to ftrace_exit
> > jmp to caller
> >
> > Am I correct ?
>
> Almost.
>
> What it was:
>
> call function
> function:
> call mcount
> mcount:
> call ftrace_entry
> ftrace_entry:
> mess up with return code of caller
> ret
> ret
>
> [function code]
>
> ret to ftrace_exit
> ftrace_exit:
> get real return
> ret to original
>
> So for the function we have 3 calls and 4 rets
>
> Now we have:
>
> What it was:
>
> call function
> function:
> call mcount
> mcount:
> call ftrace_entry
Can we manage to change this call
> ftrace_entry:
> mess up with return code of caller
> ret
.. and this ret for 2 jmp instructions too ?
Given that we have no choice but to kill call/ret prediction logic, I
think it might be good to try to use this logic as little as possible
(by favoring jmp jmp over call/ret when the return target is invariant).
That's just an idea, benchmarks could prove me right/wrong.
Mathieu
> ret
>
> [function code]
>
> ret to ftrace_exit
> ftrace_exit:
> get real return
> jmp to original
>
> Now we have 3 calls and 3 rets
>
> Note the first call still does not match the ret, but we don't do two
> rets anymore.
>
> -- Steve
>
>
>
--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
On Tue, 2009-10-13 at 17:21 -0400, Mathieu Desnoyers wrote:
> > What it was:
> >
> > call function
> > function:
> > call mcount
> > mcount:
> > call ftrace_entry
>
> Can we manage to change this call
Note, that call jumps to C code.
>
> > ftrace_entry:
> > mess up with return code of caller
> > ret
>
> .. and this ret for 2 jmp instructions too ?
The code is all in C, and it too calls functions. Not sure where this
helps out any. The ret here matches their calls. Thus the prediction
will work.
>
> Given that we have no choice but to kill call/ret prediction logic, I
> think it might be good to try to use this logic as little as possible
> (by favoring jmp jmp over call/ret when the return target is invariant).
>
> That's just an idea, benchmarks could prove me right/wrong.
I don't see how this would help. And I'm not about to waste time
experimenting. What's the rational?
-- Steve
On Tue, Oct 13, 2009 at 05:12:46PM -0400, Steven Rostedt wrote:
> On Tue, 2009-10-13 at 23:02 +0200, Frederic Weisbecker wrote:
> > On Tue, Oct 13, 2009 at 04:33:50PM -0400, Steven Rostedt wrote:
> > > From: Steven Rostedt <[email protected]>
> > >
> > > The function graph tracer replaces the return address with a hook to
> > > trace the exit of the function call. This hook will finish by returning
> > > to the real location the function should return to.
> > >
> > > But the current implementation uses a ret to jump to the real return
> > > location. This causes a imbalance between calls and ret. That is
> > > the original function does a call, the ret goes to the handler
> > > and then the handler does a ret without a matching call.
> > >
> > > Although the function graph tracer itself still breaks the branch
> > > predictor by replacing the original ret, by using a second ret and
> > > causing an imbalance, it breaks the predictor even more.
> >
> >
> >
> > I have troubles to understand by it breaks the predictor, especially
> > since there is not conditional branch in return_to_handler.
> > But still I don't understand why a ret would break more the branch
> > prediction than a jmp.
>
> Calls are branch prediction jumps. Which associates the "ret" with the
> call. As it approaches the ret, it starts to receive the code after the
> call.
>
> But this is stack order. Every call should hit one ret. But with the
> original code, we break this stack. We have one call and two rets. Which
> means that the branch prediction will also get messed up with the
> previous stored call.
Oh, ok I got it.
Thanks.
* Steven Rostedt ([email protected]) wrote:
> On Tue, 2009-10-13 at 17:21 -0400, Mathieu Desnoyers wrote:
>
> > > What it was:
> > >
> > > call function
> > > function:
> > > call mcount
> > > mcount:
> > > call ftrace_entry
> >
> > Can we manage to change this call
>
> Note, that call jumps to C code.
>
> >
> > > ftrace_entry:
> > > mess up with return code of caller
> > > ret
> >
> > .. and this ret for 2 jmp instructions too ?
>
> The code is all in C, and it too calls functions. Not sure where this
> helps out any. The ret here matches their calls. Thus the prediction
> will work.
>
Oh, OK. I thought the callback was in assembly. That's a bit more work
than I thought.
> >
> > Given that we have no choice but to kill call/ret prediction logic, I
> > think it might be good to try to use this logic as little as possible
> > (by favoring jmp jmp over call/ret when the return target is invariant).
> >
> > That's just an idea, benchmarks could prove me right/wrong.
>
> I don't see how this would help. And I'm not about to waste time
> experimenting. What's the rational?
>
The idea is that call/ret are fast when predictions are right. In this
case, I wonder if the fact that we trash the call/ret prediction (even
if this happens after the paired call/ret) would have an impact on
balanced call/ret in the tracing code path. I doubt so, but who knows.
Probably not worth spending much time on this. It would just have been
nice to try if the implementation would have been trivial.
Thanks,
Mathieu
> -- Steve
>
>
--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
Commit-ID: 194ec34184869f0de1cf255c924fc5299e1b3d27
Gitweb: http://git.kernel.org/tip/194ec34184869f0de1cf255c924fc5299e1b3d27
Author: Steven Rostedt <[email protected]>
AuthorDate: Tue, 13 Oct 2009 16:33:50 -0400
Committer: Ingo Molnar <[email protected]>
CommitDate: Wed, 14 Oct 2009 08:13:53 +0200
function-graph/x86: Replace unbalanced ret with jmp
The function graph tracer replaces the return address with a hook
to trace the exit of the function call. This hook will finish by
returning to the real location the function should return to.
But the current implementation uses a ret to jump to the real
return location. This causes a imbalance between calls and ret.
That is the original function does a call, the ret goes to the
handler and then the handler does a ret without a matching call.
Although the function graph tracer itself still breaks the branch
predictor by replacing the original ret, by using a second ret and
causing an imbalance, it breaks the predictor even more.
This patch replaces the ret with a jmp to keep the calls and ret
balanced. I tested this on one box and it showed a 1.7% increase in
performance. Another box only showed a small 0.3% increase. But no
box that I tested this on showed a decrease in performance by
making this change.
Signed-off-by: Steven Rostedt <[email protected]>
Acked-by: Mathieu Desnoyers <[email protected]>
Cc: Frederic Weisbecker <[email protected]>
LKML-Reference: <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
---
arch/x86/kernel/entry_32.S | 7 ++-----
arch/x86/kernel/entry_64.S | 6 +++---
2 files changed, 5 insertions(+), 8 deletions(-)
diff --git a/arch/x86/kernel/entry_32.S b/arch/x86/kernel/entry_32.S
index c097e7d..7d52e9d 100644
--- a/arch/x86/kernel/entry_32.S
+++ b/arch/x86/kernel/entry_32.S
@@ -1185,17 +1185,14 @@ END(ftrace_graph_caller)
.globl return_to_handler
return_to_handler:
- pushl $0
pushl %eax
- pushl %ecx
pushl %edx
movl %ebp, %eax
call ftrace_return_to_handler
- movl %eax, 0xc(%esp)
+ movl %eax, %ecx
popl %edx
- popl %ecx
popl %eax
- ret
+ jmp *%ecx
#endif
.section .rodata,"a"
diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S
index b5c061f..bd5bbdd 100644
--- a/arch/x86/kernel/entry_64.S
+++ b/arch/x86/kernel/entry_64.S
@@ -155,11 +155,11 @@ GLOBAL(return_to_handler)
call ftrace_return_to_handler
- movq %rax, 16(%rsp)
+ movq %rax, %rdi
movq 8(%rsp), %rdx
movq (%rsp), %rax
- addq $16, %rsp
- retq
+ addq $24, %rsp
+ jmp *%rdi
#endif