On Tue, 17 Nov 2009, [email protected] wrote:
Back to LKML with some more cc's.
> From: Feng Tang <[email protected]>
>
> Recent hrtimer code will set the start info to a hrtimer only when that
> flag is set, then the start info of all hrtimers will always be
> uninitialised before a "echo 1 > /proc/timer_stats", thus the
> /proc/timer_lists will have something like:
>
> active timers:
> #0: <c27d46b0>, tick_sched_timer, S:01, <(null)>, /-1
> # expires at 91062000000-91062000000 nsecs [in 156071 to 156071 nsecs]
> #1: <efb81b6c>, hrtimer_wakeup, S:01, <(null)>, /-1
> # expires at 91062300331-91062350331 nsecs [in 456402 to 506402 nsecs]
> #2: <efac9b6c>, hrtimer_wakeup, S:01, <(null)>, /-1
> # expires at 91068699811-91068749811 nsecs [in 6855882 to 6905882 nsecs]
> #3: <efacdb6c>, hrtimer_wakeup, S:01, <(null)>, /-1
> # expires at 91068755511-91068805511 nsecs [in 6911582 to 6961582 nsecs]
> #4: <efa95b6c>, hrtimer_wakeup, S:01, <(null)>, /-1
> # expires at 91068806066-91068856066 nsecs [in 6962137 to 7012137 nsecs]
> .....
>
> This patch fixes it.
This patch does more than that and it does not mention it in the
change log. The change log is also missing the context which
introduced this regression.
Just to get the context straight:
commit 507e1231 (timer stats: Optimize by adding quick check to avoid
function calls) commit 507e1231 added the timer_stats_active check in
timer_stats_hrtimer_set_start_info() to reduce the overhead when timer
stats are inactive.
As an unintentional side effect it introduced the above regression in
/proc/timer_list.
> static inline void timer_stats_hrtimer_set_start_info(struct hrtimer *timer)
> {
> - if (likely(!timer_stats_active))
> - return;
This is fine. It reverts the hrtimer part of commit 507e1231 and fixes
the /proc/timer_list regression for the price of the same overhead
which we had before that commit.
That's the immediate fix which needs to go to Linus and stable.
> __timer_stats_hrtimer_set_start_info(timer, __builtin_return_address(0));
> }
>
> diff -puN kernel/hrtimer.c~hrtimers-remove-the-timer_stats_active-check-when-setting-the-start-info kernel/hrtimer.c
> --- a/kernel/hrtimer.c~hrtimers-remove-the-timer_stats_active-check-when-setting-the-start-info
> +++ a/kernel/hrtimer.c
> @@ -750,7 +750,7 @@ static inline void hrtimer_init_timer_hr
> #ifdef CONFIG_TIMER_STATS
> void __timer_stats_hrtimer_set_start_info(struct hrtimer *timer, void *addr)
> {
> - if (timer->start_site)
> + if (timer->start_site == addr && timer->start_pid == current->pid)
This part is unrelated to the above regression and is wrong for the
following reasons:
1) it runs the memcpy when the start_site changes even when the pid
is the same, which is extremly bad for tick->sched_timer as this
timer is re(armed) frequently from different places when NOHZ=y.
2) it runs the memcpy when the pid changes which is even worse for
tick->sched_timer as this timer is re(armed) frequently from
hrtimer_interrupt which hits random pids and would therefor
report completely wrong pid/comm info.
This needs more thought and is definitely neither -rc7 nor stable
material.
Thanks,
tglx
Commit-ID: 887a29f59b93cf54e21814869a4ab6e80b6fa623
Gitweb: http://git.kernel.org/tip/887a29f59b93cf54e21814869a4ab6e80b6fa623
Author: Feng Tang <[email protected]>
AuthorDate: Thu, 3 Sep 2009 16:32:53 +0800
Committer: Thomas Gleixner <[email protected]>
CommitDate: Wed, 18 Nov 2009 21:20:37 +0100
hrtimer: Fix /proc/timer_list regression
commit 507e1231 (timer stats: Optimize by adding quick check to avoid
function calls) introduced a regression in /proc/timer_list.
/proc/timer_list shows now
#0: <c27d46b0>, tick_sched_timer, S:01, <(null)>, /-1
instead of
#0: <c27d46b0>, tick_sched_timer, S:01, hrtimer_start, swapper/0
Revert the hrtimer quick check for now. The optimization needs more
thought, but this is neither 2.6.32-rc7 nor stable material.
[ tglx: Removed unrelated changes from the original patch and massaged
the changelog ]
Signed-off-by: Feng Tang <[email protected]>
LKML-Reference: <[email protected]>
Cc: Heiko Carstens <[email protected]>
Cc: [email protected]
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
---
include/linux/hrtimer.h | 2 --
1 files changed, 0 insertions(+), 2 deletions(-)
diff --git a/include/linux/hrtimer.h b/include/linux/hrtimer.h
index ff037f0..0a656d6 100644
--- a/include/linux/hrtimer.h
+++ b/include/linux/hrtimer.h
@@ -457,8 +457,6 @@ extern void __timer_stats_hrtimer_set_start_info(struct hrtimer *timer,
static inline void timer_stats_hrtimer_set_start_info(struct hrtimer *timer)
{
- if (likely(!timer_stats_active))
- return;
__timer_stats_hrtimer_set_start_info(timer, __builtin_return_address(0));
}
* tip-bot for Feng Tang <[email protected]> wrote:
> Commit-ID: 887a29f59b93cf54e21814869a4ab6e80b6fa623
> Gitweb: http://git.kernel.org/tip/887a29f59b93cf54e21814869a4ab6e80b6fa623
> Author: Feng Tang <[email protected]>
> AuthorDate: Thu, 3 Sep 2009 16:32:53 +0800
> Committer: Thomas Gleixner <[email protected]>
> CommitDate: Wed, 18 Nov 2009 21:20:37 +0100
>
> hrtimer: Fix /proc/timer_list regression
>
> commit 507e1231 (timer stats: Optimize by adding quick check to avoid
> function calls) introduced a regression in /proc/timer_list.
>
> /proc/timer_list shows now
> #0: <c27d46b0>, tick_sched_timer, S:01, <(null)>, /-1
> instead of
> #0: <c27d46b0>, tick_sched_timer, S:01, hrtimer_start, swapper/0
>
> Revert the hrtimer quick check for now. The optimization needs more
> thought, but this is neither 2.6.32-rc7 nor stable material.
>
> [ tglx: Removed unrelated changes from the original patch and massaged
> the changelog ]
>
> Signed-off-by: Feng Tang <[email protected]>
> LKML-Reference: <[email protected]>
> Cc: Heiko Carstens <[email protected]>
> Cc: [email protected]
> Signed-off-by: Andrew Morton <[email protected]>
> Signed-off-by: Thomas Gleixner <[email protected]>
-tip testing found that this commit is causing boot crashes:
[...]
calling init_mmio_trace+0x0/0xf @ 1
initcall init_mmio_trace+0x0/0xf returned 0 after 699 usecs
calling init_graph_trace+0x0/0x29 @ 1
Testing tracer function_graph:
------------[ cut here ]------------
------------[ cut here ]------------
WARNING: at kernel/trace/trace_functions_graph.c:131 ftrace_return_to_handler+0x119/0x130()
Hardware name: System Product Name
Bad frame pointer: expected be04be80, received be04becc
from func timer_stats_update_stats return to 8106c6fc
Modules linked in:
Pid: 534, comm: kstop/0 Not tainted 2.6.32-rc7-00032-g887a29f-dirty #2762
Call Trace:
[<8104681f>] warn_slowpath_common+0x5f/0x90
[<810c1f79>] ? ftrace_return_to_handler+0x119/0x130
[<81046899>] warn_slowpath_fmt+0x29/0x30
[<810c1f79>] ftrace_return_to_handler+0x119/0x130
[<81077214>] ? timer_stats_update_stats+0x14/0x200
[<8106c6fc>] ? __run_hrtimer+0x14c/0x190
[<8106bdb0>] ? enqueue_hrtimer+0x80/0xd0
[<8106c657>] ? __run_hrtimer+0xa7/0x190
[<81004842>] return_to_handler+0xc/0x3a
[<81098700>] ? stop_cpu+0x0/0x110
[<81004836>] return_to_handler+0x0/0x3a
[<8106c6fc>] __run_hrtimer+0x14c/0x190
[<81004836>] return_to_handler+0x0/0x3a
[<8106c9e0>] hrtimer_interrupt+0x140/0x240
[<81098700>] ? stop_cpu+0x0/0x110
[<810987b1>] ? stop_cpu+0xb1/0x110
[<8106392f>] worker_thread+0x15f/0x2c0
[<810638f3>] ? worker_thread+0x123/0x2c0
[<81068960>] ? autoremove_wake_function+0x0/0x40
[<810637d0>] ? worker_thread+0x0/0x2c0
[<810686ec>] kthread+0x6c/0x70
[<81068680>] ? kthread+0x0/0x70
i've bisected back to it:
| 887a29f59b93cf54e21814869a4ab6e80b6fa623 is the first bad commit
| commit 887a29f59b93cf54e21814869a4ab6e80b6fa623
| Author: Feng Tang <[email protected]>
| Date: Thu Sep 3 16:32:53 2009 +0800
|
| hrtimer: Fix /proc/timer_list regression
Config attached.
I've removed it from timers/urgent for now.
Ingo
On Thu, 19 Nov 2009, Ingo Molnar wrote:
> i've bisected back to it:
>
> | 887a29f59b93cf54e21814869a4ab6e80b6fa623 is the first bad commit
> | commit 887a29f59b93cf54e21814869a4ab6e80b6fa623
> | Author: Feng Tang <[email protected]>
> | Date: Thu Sep 3 16:32:53 2009 +0800
> |
> | hrtimer: Fix /proc/timer_list regression
>
> Config attached.
>
> I've removed it from timers/urgent for now.
Come on, this is a patently wrong conclusion.
We call timer_stats_update_stats() which returns on top of the
function due to:
if (likely(!timer_stats_active))
return;
timer_stats_active is 0 during boot and you can only activate it by
writing to /proc/timer_stats which you certainly did not at this
point.
Can you please explain how a call to a function which returns right
away can cause that problem ?
That patch unearthed some other bug and your revert is just papering
over that fact.
Thanks,
tglx
B1;2005;0cOn Thu, 19 Nov 2009, Thomas Gleixner wrote:
> On Thu, 19 Nov 2009, Ingo Molnar wrote:
> > i've bisected back to it:
> >
> > | 887a29f59b93cf54e21814869a4ab6e80b6fa623 is the first bad commit
> > | commit 887a29f59b93cf54e21814869a4ab6e80b6fa623
> > | Author: Feng Tang <[email protected]>
> > | Date: Thu Sep 3 16:32:53 2009 +0800
> > |
> > | hrtimer: Fix /proc/timer_list regression
> >
> > Config attached.
> >
> > I've removed it from timers/urgent for now.
>
> Come on, this is a patently wrong conclusion.
>
> We call timer_stats_update_stats() which returns on top of the
> function due to:
>
> if (likely(!timer_stats_active))
> return;
>
> timer_stats_active is 0 during boot and you can only activate it by
> writing to /proc/timer_stats which you certainly did not at this
> point.
>
> Can you please explain how a call to a function which returns right
> away can cause that problem ?
>
> That patch unearthed some other bug and your revert is just papering
> over that fact.
Looked deeper into that and found the root cause.
The function graph tracer expects that the first code sequence in a
function is:
function_start:
push %ebp
mov %esp, %ebp
....
call mcount
and on the return side:
pop %ebp
ret
which is the case for most functions except, but not for all
timer_stats_update_stats() and a few others have:
function_start:
push %edi
lea 0x8(%esp),%edi
and $0xfffffff0,%esp
pushl -0x4(%edi)
push %ebp
mov %esp,%ebp
...
call mcount
and the return does:
pop %ebp
lea -0x8(%edi),%esp
pop %edi
ret
mcount calls prepare_ftrace_return() with the following calling sequence:
pushl %eax
pushl %ecx
pushl %edx
movl 0xc(%esp), %edx
lea 0x4(%ebp), %eax
Here is where the shit hits the fan.
For the usual function frames this is correct:
ebp points to the stack location where ebp was pushed and ebp + 4
points to the return address.
For the timer_stats_update_stats case points to the stack location
where ebp was pushed _BUT_ ebp + 4 is pointing to a stack entry
_BELOW_ the return address.
movl (%ebp), %ecx
subl $MCOUNT_INSN_SIZE, %edx
call prepare_ftrace_return
popl %edx
popl %ecx
popl %eax
ret
prepare_ftrace_return does:
void prepare_ftrace_return(unsigned long *parent, unsigned long self_addr,
unsigned long frame_pointer)
{
unsigned long old;
int faulted;
unsigned long return_hooker = (unsigned long)
&return_to_handler;
asm volatile(
"1: " _ASM_MOV " (%[parent]), %[old]\n"
"2: " _ASM_MOV " %[return_hooker], (%[parent])\n"
" movl $0, %[faulted]\n"
"3:\n"
So here we modify the "return" address on the stack where parent is
supposed to point to. That works for the standard function frames, but
not for the ones which look like timer_stats_update_stats().
In the timer_stats_update_stats() case we do not call the
return_to_handler when the function returns, because we did not modify
the return address. So the next return in the calling function will
trip over the sanity checking and panic.
I'm not yet sure whether this is a compiler problem (using gcc 4.4.1)
or just the stupid assumption that function frames always start with
push %ebp
mov %esp, %ebp
Thanks,
tglx
On Thu, 19 Nov 2009, Thomas Gleixner wrote:
Can the GCC folks please shed some light on this:
standard function start:
push %ebp
mov %esp, %ebp
....
call mcount
modified function start on a handful of functions only seen with gcc
4.4.x on x86 32 bit:
push %edi
lea 0x8(%esp),%edi
and $0xfffffff0,%esp
pushl -0x4(%edi)
push %ebp
mov %esp,%ebp
...
call mcount
This modification leads to a hard to solve problem in the kernel
function graph tracer which assumes that the stack looks like:
return address
saved ebp
With the modified function start sequence this is not longer true and
the manipulation of the return address on the stack fails silently.
Neither gcc 4.3 nor gcc 3.4 are generating such function frames, so it
looks like a gcc 4.4.x feature.
There is no real obvious reason why the edi magic needs to be done
_before_
push %ebp
mov %esp,%ebp
Thanks,
tglx
Thomas Gleixner wrote:
> On Thu, 19 Nov 2009, Thomas Gleixner wrote:
>
> Can the GCC folks please shed some light on this:
>
> standard function start:
>
> push %ebp
> mov %esp, %ebp
> ....
> call mcount
>
> modified function start on a handful of functions only seen with gcc
> 4.4.x on x86 32 bit:
>
> push %edi
> lea 0x8(%esp),%edi
> and $0xfffffff0,%esp
> pushl -0x4(%edi)
> push %ebp
> mov %esp,%ebp
> ...
> call mcount
>
> This modification leads to a hard to solve problem in the kernel
> function graph tracer which assumes that the stack looks like:
>
> return address
> saved ebp
>
> With the modified function start sequence this is not longer true and
> the manipulation of the return address on the stack fails silently.
>
> Neither gcc 4.3 nor gcc 3.4 are generating such function frames, so it
> looks like a gcc 4.4.x feature.
>
> There is no real obvious reason why the edi magic needs to be done
> _before_
>
> push %ebp
> mov %esp,%ebp
Sure there is: unless you do the adjustment first %ebp won't be 16-aligned.
We're aligning the stack properly, as per the ABI requirements. Can't
you just fix the tracer?
Andrew.
On 11/19/2009 07:37 AM, Thomas Gleixner wrote:
>
> modified function start on a handful of functions only seen with gcc
> 4.4.x on x86 32 bit:
>
> push %edi
> lea 0x8(%esp),%edi
> and $0xfffffff0,%esp
> pushl -0x4(%edi)
> push %ebp
> mov %esp,%ebp
> ...
> call mcount
>
The real questions is why we're aligning the stack in the kernel. It is
probably not what we want -- we don't use SSE for anything but a handful
of special cases in the kernel, and we don't want the overhead.
-hpa
--
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel. I don't speak on their behalf.
On Thu, Nov 19, 2009 at 4:45 PM, H. Peter Anvin <[email protected]> wrote:
> On 11/19/2009 07:37 AM, Thomas Gleixner wrote:
>>
>> modified function start on a handful of functions only seen with gcc
>> 4.4.x on x86 32 bit:
>>
>> ? ? ? push ? %edi
>> ? ? ? lea ? ?0x8(%esp),%edi
>> ? ? ? and ? ?$0xfffffff0,%esp
>> ? ? ? pushl ?-0x4(%edi)
>> ? ? ? push ? %ebp
>> ? ? ? mov ? ?%esp,%ebp
>> ? ? ? ...
>> ? ? ? call ? mcount
>>
>
> The real questions is why we're aligning the stack in the kernel. ?It is
> probably not what we want -- we don't use SSE for anything but a handful
> of special cases in the kernel, and we don't want the overhead.
It's likely because you have long long vars on the stack which is
faster when they are aligned. -mno-stackrealign may do what you
want (or may not, I have not checked). I assume you already
use -mpreferred-stack-boundary=2.
Richard.
On Thu, Nov 19, 2009 at 4:49 PM, Richard Guenther
<[email protected]> wrote:
> On Thu, Nov 19, 2009 at 4:45 PM, H. Peter Anvin <[email protected]> wrote:
>> On 11/19/2009 07:37 AM, Thomas Gleixner wrote:
>>>
>>> modified function start on a handful of functions only seen with gcc
>>> 4.4.x on x86 32 bit:
>>>
>>> ? ? ? push ? %edi
>>> ? ? ? lea ? ?0x8(%esp),%edi
>>> ? ? ? and ? ?$0xfffffff0,%esp
>>> ? ? ? pushl ?-0x4(%edi)
>>> ? ? ? push ? %ebp
>>> ? ? ? mov ? ?%esp,%ebp
>>> ? ? ? ...
>>> ? ? ? call ? mcount
>>>
>>
>> The real questions is why we're aligning the stack in the kernel. ?It is
>> probably not what we want -- we don't use SSE for anything but a handful
>> of special cases in the kernel, and we don't want the overhead.
>
> It's likely because you have long long vars on the stack which is
> faster when they are aligned. ?-mno-stackrealign may do what you
> want (or may not, I have not checked). ?I assume you already
> use -mpreferred-stack-boundary=2.
Just checking it seems you must be using -mincoming-stack-boundary=2
instead but keep the preferred stack boundary at 4.
Richard.
On 11/19/2009 07:44 AM, Andrew Haley wrote:
>
> We're aligning the stack properly, as per the ABI requirements. Can't
> you just fix the tracer?
>
"Per the ABI requirements?" We're talking 32 bits, here.
-hpa
--
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel. I don't speak on their behalf.
On Thu, Nov 19, 2009 at 4:54 PM, H. Peter Anvin <[email protected]> wrote:
> On 11/19/2009 07:44 AM, Andrew Haley wrote:
>>
>> We're aligning the stack properly, as per the ABI requirements. ?Can't
>> you just fix the tracer?
>>
>
> "Per the ABI requirements?" ?We're talking 32 bits, here.
Hm, even with
void bar (int *);
void foo (void)
{
int x;
bar (&x);
}
gcc -S -O2 -m32 -mincoming-stack-boundary=2 t.c
we re-align the stack. That looks indeed bogus.
HJ, you invented all this code, what's the reason for the above?
Richard.
On Thu, 2009-11-19 at 15:44 +0000, Andrew Haley wrote:
> Thomas Gleixner wrote:
> We're aligning the stack properly, as per the ABI requirements. Can't
> you just fix the tracer?
And how do we do that? The hooks that are in place have no idea of what
happened before they were called?
-- Steve
On Thu, 19 Nov 2009, Andrew Haley wrote:
> Thomas Gleixner wrote:
> > There is no real obvious reason why the edi magic needs to be done
> > _before_
> >
> > push %ebp
> > mov %esp,%ebp
>
> Sure there is: unless you do the adjustment first %ebp won't be 16-aligned.
And why is this not done in 99% of the functions in the kernel, just
in this one and some random others ?
> We're aligning the stack properly, as per the ABI requirements. Can't
> you just fix the tracer?
Where is that ABI requirement that
push %ebp
needs to happen on an aligned stack ?
And why is this something GCC did not care about until GCC4.4 ?
Thanks,
tglx
On 11/19/2009 08:02 AM, Steven Rostedt wrote:
> On Thu, 2009-11-19 at 15:44 +0000, Andrew Haley wrote:
>> Thomas Gleixner wrote:
>
>> We're aligning the stack properly, as per the ABI requirements. Can't
>> you just fix the tracer?
>
> And how do we do that? The hooks that are in place have no idea of what
> happened before they were called?
>
Furthermore, it is nonsense -- ABI stack alignment on *32 bits* is 4
bytes, not 16.
-hpa
--
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel. I don't speak on their behalf.
On Thu, 2009-11-19 at 15:44 +0000, Andrew Haley wrote:
> We're aligning the stack properly, as per the ABI requirements. Can't
> you just fix the tracer?
Unfortunately, this is the only fix we have:
diff --git a/kernel/trace/Kconfig b/kernel/trace/Kconfig
index b416512..cd39064 100644
--- a/kernel/trace/Kconfig
+++ b/kernel/trace/Kconfig
@@ -143,7 +143,6 @@ config FUNCTION_GRAPH_TRACER
bool "Kernel Function Graph Tracer"
depends on HAVE_FUNCTION_GRAPH_TRACER
depends on FUNCTION_TRACER
- depends on !X86_32 || !CC_OPTIMIZE_FOR_SIZE
default y
help
Enable the kernel to trace a function at both its return
diff --git a/kernel/trace/trace_functions_graph.c b/kernel/trace/trace_functions_graph.c
index 45e6c01..50c2251 100644
--- a/kernel/trace/trace_functions_graph.c
+++ b/kernel/trace/trace_functions_graph.c
@@ -1070,6 +1070,11 @@ static __init int init_graph_trace(void)
{
max_bytes_for_cpu = snprintf(NULL, 0, "%d", nr_cpu_ids - 1);
+#if defined(CONFIG_X86_32 && __GNUC__ >= 4 && __GNUC_MINOR__ >= 4)
+ pr_info("WARNING: GCC 4.4.X breaks the function graph tracer on i686.\n"
+ " The function graph tracer will be disabled.\n");
+ return -1;
+#endif
return register_tracer(&graph_trace);
}
-- Steve
Thomas Gleixner wrote:
> On Thu, 19 Nov 2009, Andrew Haley wrote:
>> Thomas Gleixner wrote:
>>> There is no real obvious reason why the edi magic needs to be done
>>> _before_
>>>
>>> push %ebp
>>> mov %esp,%ebp
>> Sure there is: unless you do the adjustment first %ebp won't be 16-aligned.
>
> And why is this not done in 99% of the functions in the kernel, just
> in this one and some random others ?
If I could see the function I might be able to tell you. It's either a
performance enhancement, something to do with SSE, or it's a bug.
Andrew.
On Thu, Nov 19, 2009 at 11:02:32AM -0500, Steven Rostedt wrote:
> On Thu, 2009-11-19 at 15:44 +0000, Andrew Haley wrote:
> > Thomas Gleixner wrote:
>
> > We're aligning the stack properly, as per the ABI requirements. Can't
> > you just fix the tracer?
>
> And how do we do that? The hooks that are in place have no idea of what
> happened before they were called?
>
> -- Steve
Yep, this is really something we can't fix from the tracer....
On Thu, 19 Nov 2009, Andrew Haley wrote:
> Thomas Gleixner wrote:
> > On Thu, 19 Nov 2009, Andrew Haley wrote:
> >> Thomas Gleixner wrote:
> >>> There is no real obvious reason why the edi magic needs to be done
> >>> _before_
> >>>
> >>> push %ebp
> >>> mov %esp,%ebp
> >> Sure there is: unless you do the adjustment first %ebp won't be 16-aligned.
> >
> > And why is this not done in 99% of the functions in the kernel, just
> > in this one and some random others ?
>
> If I could see the function I might be able to tell you. It's either a
> performance enhancement, something to do with SSE, or it's a bug.
kernel/time/timer_stats.c timer_stats_update_stats()
Here is the disassembly:
8107ad50 <timer_stats_update_stats>:
8107ad50: 57 push %edi
8107ad51: 8d 7c 24 08 lea 0x8(%esp),%edi
8107ad55: 83 e4 f0 and $0xfffffff0,%esp
8107ad58: ff 77 fc pushl -0x4(%edi)
8107ad5b: 55 push %ebp
8107ad5c: 89 e5 mov %esp,%ebp
8107ad5e: 57 push %edi
8107ad5f: 56 push %esi
8107ad60: 53 push %ebx
8107ad61: 83 ec 6c sub $0x6c,%esp
8107ad64: e8 47 92 f8 ff call 81003fb0 <mcount>
8107ad69: 8b 77 04 mov 0x4(%edi),%esi
8107ad6c: 89 75 a4 mov %esi,-0x5c(%ebp)
8107ad6f: 65 8b 35 14 00 00 00 mov %gs:0x14,%esi
8107ad76: 89 75 e4 mov %esi,-0x1c(%ebp)
8107ad79: 31 f6 xor %esi,%esi
8107ad7b: 8b 35 60 5a cd 81 mov 0x81cd5a60,%esi
8107ad81: 8b 1f mov (%edi),%ebx
8107ad83: 85 f6 test %esi,%esi
8107ad85: 8b 7f 08 mov 0x8(%edi),%edi
8107ad88: 75 18 jne 8107ada2 <timer_stats_update_stats+0x52>
8107ad8a: 8b 45 e4 mov -0x1c(%ebp),%eax
8107ad8d: 65 33 05 14 00 00 00 xor %gs:0x14,%eax
8107ad94: 75 53 jne 8107ade9 <timer_stats_update_stats+0x99>
8107ad96: 83 c4 6c add $0x6c,%esp
8107ad99: 5b pop %ebx
8107ad9a: 5e pop %esi
8107ad9b: 5f pop %edi
8107ad9c: 5d pop %ebp
8107ad9d: 8d 67 f8 lea -0x8(%edi),%esp
8107ada0: 5f pop %edi
8107ada1: c3 ret
8107ada2: be 00 7a d6 81 mov $0x81d67a00,%esi
8107ada7: 89 45 ac mov %eax,-0x54(%ebp)
8107adaa: 89 75 a0 mov %esi,-0x60(%ebp)
8107adad: 89 5d b4 mov %ebx,-0x4c(%ebp)
8107adb0: 64 8b 35 78 6a d6 81 mov %fs:0x81d66a78,%esi
8107adb7: 8b 34 b5 20 50 cd 81 mov -0x7e32afe0(,%esi,4),%esi
8107adbe: 89 4d b0 mov %ecx,-0x50(%ebp)
8107adc1: 01 75 a0 add %esi,-0x60(%ebp)
8107adc4: 89 55 b8 mov %edx,-0x48(%ebp)
8107adc7: 8b 45 a0 mov -0x60(%ebp),%eax
8107adca: 89 7d c0 mov %edi,-0x40(%ebp)
8107adcd: e8 de f7 76 00 call 817ea5b0 <_spin_lock_irqsave>
8107add2: 83 3d 60 5a cd 81 00 cmpl $0x0,0x81cd5a60
8107add9: 89 c3 mov %eax,%ebx
8107addb: 75 11 jne 8107adee <timer_stats_update_stats+0x9e>
8107addd: 89 da mov %ebx,%edx
8107addf: 8b 45 a0 mov -0x60(%ebp),%eax
8107ade2: e8 79 fc 76 00 call 817eaa60 <_spin_unlock_irqrestore>
8107ade7: eb a1 jmp 8107ad8a <timer_stats_update_stats+0x3a>
8107ade9: e8 52 e4 fc ff call 81049240 <__stack_chk_fail>
8107adee: 8d 45 a8 lea -0x58(%ebp),%eax
8107adf1: 8b 55 a4 mov -0x5c(%ebp),%edx
8107adf4: e8 f7 fd ff ff call 8107abf0 <tstat_lookup>
8107adf9: 85 c0 test %eax,%eax
8107adfb: 74 05 je 8107ae02 <timer_stats_update_stats+0xb2>
8107adfd: ff 40 14 incl 0x14(%eax)
8107ae00: eb db jmp 8107addd <timer_stats_update_stats+0x8d>
8107ae02: f0 ff 05 00 67 fd 81 lock incl 0x81fd6700
8107ae09: eb d2 jmp 8107addd <timer_stats_update_stats+0x8d>
8107ae0b: 90 nop
8107ae0c: 90 nop
8107ae0d: 90 nop
8107ae0e: 90 nop
8107ae0f: 90 nop
There is a dozen more of those.
Thanks,
tglx
Richard Guenther <[email protected]> writes:
>
> It's likely because you have long long vars on the stack which is
> faster when they are aligned.
It's not faster for 32bit.
-Andi
--
[email protected] -- Speaking for myself only.
On Thu, 19 Nov 2009, Thomas Gleixner wrote:
>
> standard function start:
>
> push %ebp
> mov %esp, %ebp
> ....
> call mcount
>
> modified function start on a handful of functions only seen with gcc
> 4.4.x on x86 32 bit:
>
> push %edi
> lea 0x8(%esp),%edi
> and $0xfffffff0,%esp
> pushl -0x4(%edi)
> push %ebp
> mov %esp,%ebp
> ...
> call mcount
That's some crazy sh*t anyway, since we don't _want_ the stack to be
16-byte aligned in the kernel. We do
KBUILD_CFLAGS += $(call cc-option,-mpreferred-stack-boundary=2)
why is that not working?
So this looks like a gcc bug, plain and simple.
> This modification leads to a hard to solve problem in the kernel
> function graph tracer which assumes that the stack looks like:
>
> return address
> saved ebp
Umm. But it still does, doesn't it? That
pushl -0x4(%edi)
push %ebp
should do it - the "-0x4(%edi)" thing seems to be trying to reload the
return address. No?
Maybe I misread the code - but regardless, it does look like a gcc code
generation bug if only because we really don't want a 16-byte aligned
stack anyway, and have asked for it to not be done.
So I agree that gcc shouldn't do that crazy prologue (and certainly _not_
before calling mcount anyway), but I'm not sure I agree with that detail
of your analysis or explanation.
Linus
On Thu, 19 Nov 2009, Linus Torvalds wrote:
> Umm. But it still does, doesn't it? That
>
> pushl -0x4(%edi)
> push %ebp
>
> should do it - the "-0x4(%edi)" thing seems to be trying to reload the
> return address. No?
>
> Maybe I misread the code - but regardless, it does look like a gcc code
> generation bug if only because we really don't want a 16-byte aligned
> stack anyway, and have asked for it to not be done.
>
> So I agree that gcc shouldn't do that crazy prologue (and certainly _not_
> before calling mcount anyway), but I'm not sure I agree with that detail
> of your analysis or explanation.
Yes, it does store the return address before the pushed ebp, but this
is a copy of the real stack entry which is before the pushed edi.
The function graph tracer needs to redirect the return into the tracer
and it therefor saves the real return address and modifies the stack
so the return ends up in the tracer code which then goes back to the
real return address.
But in this prologue/aligment case we modify the copy and not the real
return address on the stack, so we return without calling into the
tracer which is causing the headache because the state of the tracer
becomes confused.
Thanks,
tglx
On Thu, 2009-11-19 at 09:39 -0800, Linus Torvalds wrote:
> > This modification leads to a hard to solve problem in the kernel
> > function graph tracer which assumes that the stack looks like:
> >
> > return address
> > saved ebp
>
> Umm. But it still does, doesn't it? That
>
> pushl -0x4(%edi)
> push %ebp
>
> should do it - the "-0x4(%edi)" thing seems to be trying to reload the
> return address. No?
Yes that is what it is doing. The problem we have is that it is putting
into the frame pointer a "copy" of the return address, and not the
actual pointer. Which is fine for the function tracer, but breaks the
function graph tracer (which is a much more powerful tracer).
Technically, this is all that mcount must have. And yes, we are making
an assumption that the return address in the frame pointer is the one
that will be used to leave the function. But the reason for making this
copy just seems to be all messed up.
I don't know if the ABI says anything about the return address in the
frame pointer must be the actual return address. But it would be nice if
the gcc folks would let us guarantee that it is.
-- Steve
On Thu, Nov 19, 2009 at 6:59 PM, Steven Rostedt <[email protected]> wrote:
> On Thu, 2009-11-19 at 09:39 -0800, Linus Torvalds wrote:
>
>> > This modification leads to a hard to solve problem in the kernel
>> > function graph tracer which assumes that the stack looks like:
>> >
>> > ? ? ? ?return address
>> > ? ? ? ?saved ?ebp
>>
>> Umm. But it still does, doesn't it? That
>>
>> ? ? ? pushl ?-0x4(%edi)
>> ? ? ? push ? %ebp
>>
>> should do it - the "-0x4(%edi)" thing seems to be trying to reload the
>> return address. No?
>
> Yes that is what it is doing. The problem we have is that it is putting
> into the frame pointer a "copy" of the return address, and not the
> actual pointer. Which is fine for the function tracer, but breaks the
> function graph tracer (which is a much more powerful tracer).
>
> Technically, this is all that mcount must have. And yes, we are making
> an assumption that the return address in the frame pointer is the one
> that will be used to leave the function. But the reason for making this
> copy just seems to be all messed up.
>
> I don't know if the ABI says anything about the return address in the
> frame pointer must be the actual return address. But it would be nice if
> the gcc folks would let us guarantee that it is.
Note that I only can reproduce the issue with
-mincoming-stack-boundary=2, not with -mpreferred-stack-boundary=2.
And
you didn't provide us with a testcase either ... so please open
a bugzilla and attach preprocessed source of a file that
shows the problem, note the function it happens in and provide
the command-line options you used for building.
Otherwise it's going to be all speculation on our side.
Thanks,
Richard.
Thomas Gleixner wrote:
> On Thu, 19 Nov 2009, Thomas Gleixner wrote:
>
> Can the GCC folks please shed some light on this:
>
> standard function start:
>
> push %ebp
> mov %esp, %ebp
> ....
> call mcount
>
> modified function start on a handful of functions only seen with gcc
> 4.4.x on x86 32 bit:
>
> push %edi
> lea 0x8(%esp),%edi
> and $0xfffffff0,%esp
> pushl -0x4(%edi)
> push %ebp
> mov %esp,%ebp
> ...
> call mcount
>
> This modification leads to a hard to solve problem in the kernel
> function graph tracer which assumes that the stack looks like:
>
> return address
> saved ebp
>
> With the modified function start sequence this is not longer true and
> the manipulation of the return address on the stack fails silently.
>
> Neither gcc 4.3 nor gcc 3.4 are generating such function frames, so it
> looks like a gcc 4.4.x feature.
>
> There is no real obvious reason why the edi magic needs to be done
> _before_
>
> push %ebp
> mov %esp,%ebp
OK, I found it. There is a struct defined as
struct entry {
...
} __attribute__((__aligned__((1 << (4)))));
and then in timer_stats_update_stats you have a local variable of type
struct entry:
void timer_stats_update_stats()
{
spinlock_t *lock;
struct entry *entry, input;
So, gcc has to 16-align the stack pointer to satisfy the alignment
for struct entry.
Andrew.
Richard Guenther wrote:
> And
> you didn't provide us with a testcase either ... so please open
> a bugzilla and attach preprocessed source of a file that
> shows the problem, note the function it happens in and provide
> the command-line options you used for building.
I've got all that off-list. I found the cause, and replied in another
email. It's not a bug.
Andrew.
On Thu, 19 Nov 2009, Richard Guenther wrote:
> Note that I only can reproduce the issue with
> -mincoming-stack-boundary=2, not with -mpreferred-stack-boundary=2.
> And
> you didn't provide us with a testcase either ... so please open
> a bugzilla and attach preprocessed source of a file that
> shows the problem, note the function it happens in and provide
> the command-line options you used for building.
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=42109
Thanks,
tglx
On Thu, 2009-11-19 at 18:20 +0000, Andrew Haley wrote:
> OK, I found it. There is a struct defined as
>
> struct entry {
> ...
> } __attribute__((__aligned__((1 << (4)))));
>
> and then in timer_stats_update_stats you have a local variable of type
> struct entry:
>
> void timer_stats_update_stats()
> {
> spinlock_t *lock;
> struct entry *entry, input;
>
> So, gcc has to 16-align the stack pointer to satisfy the alignment
> for struct entry.
It has to align the entire stack? Why not just the variable within the
stack?
-- Steve
On Thu, Nov 19, 2009 at 10:33 AM, Steven Rostedt <[email protected]> wrote:
> It has to align the entire stack? Why not just the variable within the
> stack?
I had proposed a patch which just aligns the variable but that patch
was never really commented on and HJL's patches to realign the whole
stack went in afterwards.
Thanks,
Andrew Pinski
Steven Rostedt wrote:
> On Thu, 2009-11-19 at 18:20 +0000, Andrew Haley wrote:
>
>> OK, I found it. There is a struct defined as
>>
>> struct entry {
>> ...
>> } __attribute__((__aligned__((1 << (4)))));
>>
>> and then in timer_stats_update_stats you have a local variable of type
>> struct entry:
>>
>> void timer_stats_update_stats()
>> {
>> spinlock_t *lock;
>> struct entry *entry, input;
>>
>> So, gcc has to 16-align the stack pointer to satisfy the alignment
>> for struct entry.
>
> It has to align the entire stack? Why not just the variable within the
> stack?
How?. gcc has to know, at compile time, the offset from sp of each variable.
So, it of course makes sure that offset is 16-aligned, but it also has to
16-align the stack pointer.
Andrew.
On 11/19/2009 10:33 AM, Steven Rostedt wrote:
>
> It has to align the entire stack? Why not just the variable within the
> stack?
>
Because if the stack pointer isn't aligned, it won't know where it can
stuff the variable. It has to pad *somewhere*, and since you may have
more than one such variable, the most efficient way -- and by far least
complex -- is for the compiler to align the stack when it sets up the
stack frame.
-hpa
--
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel. I don't speak on their behalf.
On Thu, 19 Nov 2009, Richard Guenther wrote:
>
> Note that I only can reproduce the issue with
> -mincoming-stack-boundary=2, not with -mpreferred-stack-boundary=2.
Since you can reproduce it with -mincoming-stack-boundary=2, I woul
suggest just fixing mcount handling that way regardless of anything else.
The current code generated by gcc is just insane - even for the case where
you _want_ 16-byte stack alignment.
Instead crazy code like
> push %edi
> lea 0x8(%esp),%edi
> and $0xfffffff0,%esp
> pushl -0x4(%edi)
> push %ebp
> mov %esp,%ebp
> ...
> call mcount
the sane thing to do would be to just do it as
push %ebp
mov %esp,%ebp
call mcount
and $0xfffffff0,%esp
since
- no sane 'mcount' implementation can ever care about 16-byte stack
alignment anyway, so aliging the stack before mcount is crazy.
- mcount is special anyway, and is the only thing that cares about that
whole ebp/return address thing is mcount, and _all_ your games with
%edi are about that mcount thing.
IOW, once you as a compiler person understand that the 'mcount' call is
special, you should have realized that all the work you did for it was
totally pointless and stupid to begin with.
You must already have that special mcount logic (the whole code to save a
register early and push the fake mcount stack frame), so instead of _that_
special logic, change it to a different mcount special logic that
associates the 'mcount' call with theframe pointer pushing.
That will not only make the Linux kernel tracer happy, it will make all
your _other_ users happier too, since you can generate smaller and more
efficient code.
Admittedly, anybody who compiles with -pg probably doesn't care deeply
about smaller and more efficient code, since the mcount call overhead
tends to make the thing moot anyway, but it really looks like a win-win
situation to just fix the mcount call sequence regardless.
> And you didn't provide us with a testcase either ... so please open a
> bugzilla and attach preprocessed source of a file that shows the
> problem, note the function it happens in and provide the command-line
> options you used for building.
>
> Otherwise it's going to be all speculation on our side.
See above - all you need to do is to just fix mcount calling.
Now, there is a separate bug that shows that you seem to over-align the
stack when not asked for, and yes, since we noticed that I hope that
Thomas and friends will fix that, but I think your mcount logic could (and
should) be fixed as an independent sillyness.
Linus
On Thu, 19 Nov 2009, Andrew Haley wrote:
> OK, I found it. There is a struct defined as
>
> struct entry {
> ...
> } __attribute__((__aligned__((1 << (4)))));
>
> and then in timer_stats_update_stats you have a local variable of type
> struct entry:
>
> void timer_stats_update_stats()
> {
> spinlock_t *lock;
> struct entry *entry, input;
>
> So, gcc has to 16-align the stack pointer to satisfy the alignment
> for struct entry.
This does not explain why GCC < 4.4.x actually puts
push %ebp
mov %esp, %ebp
first and why GCC 4.4.x decides to create an extra copy of the return
address instead of just keeping the mcount stack magic right at the
function entry.
Thanks,
tglx
On Thu, 19 Nov 2009, Andrew Haley wrote:
>
> I've got all that off-list. I found the cause, and replied in another
> email. It's not a bug.
Oh Gods, are we back to gcc people saying "sure, we do stupid things, but
it's allowed, so we don't consider it a bug because it doesn't matter that
real people care about real life, we only care about some paper, and real
life doesn't matter, if it's 'undefined' we can make our idiotic choices
regardless of what people need, and regardless of whether it actually
generates better code or not".
Linus
On Thu, 19 Nov 2009, Linus Torvalds wrote:
>
> Oh Gods, are we back to gcc people saying "sure, we do stupid things, but
> it's allowed, so we don't consider it a bug because it doesn't matter that
> real people care about real life, we only care about some paper, and real
> life doesn't matter, if it's 'undefined' we can make our idiotic choices
> regardless of what people need, and regardless of whether it actually
> generates better code or not".
Put another way: the stack alignment itself may not be a bug, but gcc
generating God-awful code for the mcount handling that results in problems
in real life sure as hell is *stupid* enough to be called a bug.
I bet other people than just the kernel use the mcount hook for subtler
things than just doing profiles. And even if they don't, the quoted code
generation is just crazy _crap_.
Linus
* Linus Torvalds <[email protected]> wrote:
> Admittedly, anybody who compiles with -pg probably doesn't care deeply
> about smaller and more efficient code, since the mcount call overhead
> tends to make the thing moot anyway, but it really looks like a
> win-win situation to just fix the mcount call sequence regardless.
Just a sidenote: due to dyn-ftrace, which patches out all mcounts during
bootup to be NOPs (and opt-in patches them in again if someone runs the
function tracer), the cost is not as large as one would have it with say
-pg based user-space profiling.
It's not completely zero-cost as the pure NOPs balloon the i$ footprint
a bit and GCC generates different code too in some cases. But it's
certainly good enough that it's generally pretty hard to prove overhead
via micro or macro benchmarks that the patched out mcounts call sites
are there.
Ingo
On Thu, 19 Nov 2009, Linus Torvalds wrote:
>
> I bet other people than just the kernel use the mcount hook for subtler
> things than just doing profiles. And even if they don't, the quoted code
> generation is just crazy _crap_.
For the kernel, if the only case is that timer_stat.c thing that Thomas
pointed at, I guess we can at least work around it with something like the
appended. The kernel code is certainly ugly too, no question about that.
It's just that we'd like to be able to depend on mcount code generation
not being insane even in the presense of ugly code..
The alternative would be to have some warning when this happens, so that
we can at least see it. "mcount won't work reliably"
Linus
---
kernel/time/timer_stats.c | 4 ++--
1 files changed, 2 insertions(+), 2 deletions(-)
diff --git a/kernel/time/timer_stats.c b/kernel/time/timer_stats.c
index ee5681f..488c7b8 100644
--- a/kernel/time/timer_stats.c
+++ b/kernel/time/timer_stats.c
@@ -76,7 +76,7 @@ struct entry {
*/
char comm[TASK_COMM_LEN + 1];
-} ____cacheline_aligned_in_smp;
+};
/*
* Spinlock protecting the tables - not taken during lookup:
@@ -114,7 +114,7 @@ static ktime_t time_start, time_stop;
#define MAX_ENTRIES (1UL << MAX_ENTRIES_BITS)
static unsigned long nr_entries;
-static struct entry entries[MAX_ENTRIES];
+static struct entry entries[MAX_ENTRIES] ____cacheline_aligned_in_smp;
static atomic_t overflow_count;
On Thu, 19 Nov 2009, Linus Torvalds wrote:
> > I bet other people than just the kernel use the mcount hook for subtler
> > things than just doing profiles. And even if they don't, the quoted code
> > generation is just crazy _crap_.
>
> For the kernel, if the only case is that timer_stat.c thing that Thomas
> pointed at, I guess we can at least work around it with something like the
> appended. The kernel code is certainly ugly too, no question about that.
>
> It's just that we'd like to be able to depend on mcount code generation
> not being insane even in the presense of ugly code..
>
> The alternative would be to have some warning when this happens, so that
> we can at least see it. "mcount won't work reliably"
There are at least 20 other random functions which have the same
problem. Have not looked at the details yet.
Just compiled with -mincoming-stack-boundary=4 and the problem goes
away as gcc now thinks that the incoming stack is already 16 byte
aligned. But that might break code which actually uses SSE
Thanks,
tglx
On Thu, 2009-11-19 at 19:47 +0100, Ingo Molnar wrote:
> * Linus Torvalds <[email protected]> wrote:
>
> > Admittedly, anybody who compiles with -pg probably doesn't care deeply
> > about smaller and more efficient code, since the mcount call overhead
> > tends to make the thing moot anyway, but it really looks like a
> > win-win situation to just fix the mcount call sequence regardless.
>
> Just a sidenote: due to dyn-ftrace, which patches out all mcounts during
> bootup to be NOPs (and opt-in patches them in again if someone runs the
> function tracer), the cost is not as large as one would have it with say
> -pg based user-space profiling.
>
> It's not completely zero-cost as the pure NOPs balloon the i$ footprint
> a bit and GCC generates different code too in some cases. But it's
> certainly good enough that it's generally pretty hard to prove overhead
> via micro or macro benchmarks that the patched out mcounts call sites
> are there.
And frame pointers do add a little overhead as well. Too bad the mcount
ABI wasn't something like this:
<function>:
call mcount
[...]
This way, the function address for mcount would have been (%esp) and the
parent address would be 4(%esp). Mcount would work without frame
pointers and this whole mess would also become moot.
-- Steve
Linus Torvalds wrote:
>
> On Thu, 19 Nov 2009, Linus Torvalds wrote:
>> I bet other people than just the kernel use the mcount hook for subtler
>> things than just doing profiles. And even if they don't, the quoted code
>> generation is just crazy _crap_.
>
> For the kernel, if the only case is that timer_stat.c thing that Thomas
> pointed at, I guess we can at least work around it with something like the
> appended. The kernel code is certainly ugly too, no question about that.
>
> It's just that we'd like to be able to depend on mcount code generation
> not being insane even in the presense of ugly code..
>
> The alternative would be to have some warning when this happens, so that
> we can at least see it. "mcount won't work reliably"
>
For the MIPS port of GCC and Linux I recently added the
-mmcount-ra-address switch. It causes the location of the return
address (on the stack) to be passed to mcount in a scratch register.
Perhaps something similar could be done for x86. It would make this
patching of the return location more reliable at the expense of more
code at the mcount invocation site.
For the MIPS case the code size doesn't increase, as it is done in the
delay slot of the call instruction, which would otherwise be a nop.
David Daney
On Thu, 2009-11-19 at 11:10 -0800, David Daney wrote:
> Linus Torvalds wrote:
> For the MIPS port of GCC and Linux I recently added the
> -mmcount-ra-address switch. It causes the location of the return
> address (on the stack) to be passed to mcount in a scratch register.
Hehe, scratch register on i686 ;-)
i686 has no extra regs. It just has:
%eax, %ebx, %ecx, %edx - as the general purpose regs
%esp - stack
%ebp - frame pointer
%edi, %esi - counter regs
That's just 8 regs, and half of those are special.
>
> Perhaps something similar could be done for x86. It would make this
> patching of the return location more reliable at the expense of more
> code at the mcount invocation site.
I rather not put any more code in the call site.
>
> For the MIPS case the code size doesn't increase, as it is done in the
> delay slot of the call instruction, which would otherwise be a nop.
I showed in a previous post what the best would be for x86. That is just
calling mcount at the very beginning of the function. The return address
is automatically pushed onto the stack.
Perhaps we could create another profiler? Instead of calling mcount,
call a new function: __fentry__ or something. Have it activated with
another switch. This could make the performance of the function tracer
even better without all these exceptions.
<function>:
call __fentry__
[...]
-- Steve
On Thu, Nov 19, 2009 at 02:28:06PM -0500, Steven Rostedt wrote:
> On Thu, 2009-11-19 at 11:10 -0800, David Daney wrote:
> > Linus Torvalds wrote:
>
> > For the MIPS port of GCC and Linux I recently added the
> > -mmcount-ra-address switch. It causes the location of the return
> > address (on the stack) to be passed to mcount in a scratch register.
>
> Hehe, scratch register on i686 ;-)
>
> i686 has no extra regs. It just has:
>
> %eax, %ebx, %ecx, %edx - as the general purpose regs
> %esp - stack
> %ebp - frame pointer
> %edi, %esi - counter regs
>
> That's just 8 regs, and half of those are special.
>
> >
> > Perhaps something similar could be done for x86. It would make this
> > patching of the return location more reliable at the expense of more
> > code at the mcount invocation site.
>
> I rather not put any more code in the call site.
>
> >
> > For the MIPS case the code size doesn't increase, as it is done in the
> > delay slot of the call instruction, which would otherwise be a nop.
>
> I showed in a previous post what the best would be for x86. That is just
> calling mcount at the very beginning of the function. The return address
> is automatically pushed onto the stack.
> Perhaps we could create another profiler? Instead of calling mcount,
> call a new function: __fentry__ or something. Have it activated with
> another switch. This could make the performance of the function tracer
> even better without all these exceptions.
>
> <function>:
> call __fentry__
> [...]
>
>
> -- Steve
I would really like this. So that we can forget about other possible
further suprises due to sophisticated function prologues beeing before
the mcount call.
And I guess that would fix it in every archs.
That said, Linus had a good point about the fact there might other uses
of mcount even more tricky than what does the function graph tracer,
outside the kernel, and those may depend on the strict ABI assumption
that 4(ebp) is always the _real_ return address, and that through all
the previous stack call. This is even a concern that extrapolates the
single mcount case.
So I wonder that actually the real problem is the lack of something that
could provide this guarantee. We may need a -real-ra-before-fp (yeah
I suck in naming).
On 11/19/2009 11:28 AM, Steven Rostedt wrote:
>
> Hehe, scratch register on i686 ;-)
>
> i686 has no extra regs. It just has:
>
> %eax, %ebx, %ecx, %edx - as the general purpose regs
> %esp - stack
> %ebp - frame pointer
> %edi, %esi - counter regs
>
> That's just 8 regs, and half of those are special.
>
For a modern ABI it is better described as:
%eax, %edx, %ecx - argument/return/scratch registers
%ebx, %esi, %edi - saved registers
%esp - stack pointer
%ebp - frame pointer (saved)
> Perhaps we could create another profiler? Instead of calling mcount,
> call a new function: __fentry__ or something. Have it activated with
> another switch. This could make the performance of the function tracer
> even better without all these exceptions.
>
> <function>:
> call __fentry__
> [...]
>
Calling the profiler immediately at the entry point is clearly the more
sane option. It means the ABI is well-defined, stable, and independent
of what the actual function contents are. It means that ABI isn't the
normal C ABI (the __fentry__ function would have to preserve all
registers), but that's fine...
-hpa
--
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel. I don't speak on their behalf.
* Steven Rostedt <[email protected]> wrote:
> On Thu, 2009-11-19 at 19:47 +0100, Ingo Molnar wrote:
> > * Linus Torvalds <[email protected]> wrote:
> >
> > > Admittedly, anybody who compiles with -pg probably doesn't care deeply
> > > about smaller and more efficient code, since the mcount call overhead
> > > tends to make the thing moot anyway, but it really looks like a
> > > win-win situation to just fix the mcount call sequence regardless.
> >
> > Just a sidenote: due to dyn-ftrace, which patches out all mcounts during
> > bootup to be NOPs (and opt-in patches them in again if someone runs the
> > function tracer), the cost is not as large as one would have it with say
> > -pg based user-space profiling.
> >
> > It's not completely zero-cost as the pure NOPs balloon the i$ footprint
> > a bit and GCC generates different code too in some cases. But it's
> > certainly good enough that it's generally pretty hard to prove overhead
> > via micro or macro benchmarks that the patched out mcounts call sites
> > are there.
>
> And frame pointers do add a little overhead as well. Too bad the mcount
> ABI wasn't something like this:
>
>
> <function>:
> call mcount
> [...]
>
> This way, the function address for mcount would have been (%esp) and
> the parent address would be 4(%esp). Mcount would work without frame
> pointers and this whole mess would also become moot.
In that case we could also fix up static callsites to this address as
well (to jump +5 bytes into the function) and avoid the NOP as well in
most cases. (That would in essence merge any slow-path function epilogue
with the mcount cal instruction in terms of I$ footprint - i.e. it would
be an even lower overhead feature.)
If only the kernel had its own compiler.
Ingo
2009/11/19 Frederic Weisbecker <[email protected]>:
> I would really like this. So that we can forget about other possible
> further suprises due to sophisticated function prologues beeing before
> the mcount call.
>
> And I guess that would fix it in every archs.
My 5 cent for this, too.
> That said, Linus had a good point about the fact there might other uses
> of mcount even more tricky than what does the function graph tracer,
> outside the kernel, and those may depend on the strict ABI assumption
> that 4(ebp) is always the _real_ return address, and that through all
> the previous stack call. This is even a concern that extrapolates the
> single mcount case.
>
> So I wonder that actually the real problem is the lack of something that
> could provide this guarantee. We may need a -real-ra-before-fp (yeah
> I suck in naming).
There are, especially in windows world. We noticed that for example
the Sun's JDK (which is compiled by VC) can be used in gcc compiled
code only by -fno-omit-frame-pointer, as otherwise it fails badly
reasoned by wrong ebp accesses.
Kai
--
| (\_/) This is Bunny. Copy and paste
| (='.'=) Bunny into your signature to help
| (")_(") him gain world domination
On Thu, Nov 19, 2009 at 08:54:56PM +0100, Kai Tietz wrote:
> 2009/11/19 Frederic Weisbecker <[email protected]>:
> > I would really like this. So that we can forget about other possible
> > further suprises due to sophisticated function prologues beeing before
> > the mcount call.
> >
> > And I guess that would fix it in every archs.
>
> My 5 cent for this, too.
>
> > That said, Linus had a good point about the fact there might other uses
> > of mcount even more tricky than what does the function graph tracer,
> > outside the kernel, and those may depend on the strict ABI assumption
> > that 4(ebp) is always the _real_ return address, and that through all
> > the previous stack call. This is even a concern that extrapolates the
> > single mcount case.
> >
> > So I wonder that actually the real problem is the lack of something that
> > could provide this guarantee. We may need a -real-ra-before-fp (yeah
> > I suck in naming).
>
> There are, especially in windows world. We noticed that for example
> the Sun's JDK (which is compiled by VC) can be used in gcc compiled
> code only by -fno-omit-frame-pointer, as otherwise it fails badly
> reasoned by wrong ebp accesses.
Yeah but what we need is not only to ensure ebp is used as the frame
pointer but also that ebp + 4 is really the address that will be used
to return to the caller, and not a copy of the return value.
On Thu, 2009-11-19 at 20:46 +0100, Frederic Weisbecker wrote:
> On Thu, Nov 19, 2009 at 02:28:06PM -0500, Steven Rostedt wrote:
> > <function>:
> > call __fentry__
> > [...]
> >
> >
> > -- Steve
>
>
> I would really like this. So that we can forget about other possible
> further suprises due to sophisticated function prologues beeing before
> the mcount call.
>
> And I guess that would fix it in every archs.
Well, other archs use a register to store the return address. But it
would also be easy to do (pseudo arch assembly):
<function>:
mov lr, (%sp)
add 8, %sp
blr __fentry__
sub 8, %sp
mov (%sp), lr
That way the lr would have the current function, and the parent would
still be at 8(%sp)
>
> That said, Linus had a good point about the fact there might other uses
> of mcount even more tricky than what does the function graph tracer,
> outside the kernel, and those may depend on the strict ABI assumption
> that 4(ebp) is always the _real_ return address, and that through all
> the previous stack call. This is even a concern that extrapolates the
> single mcount case.
As I am proposing a new call. This means that mcount stay as is for
legacy reasons. Yes I know there exists the -finstrument-functions but
that adds way too much bloat to the code. One single call to the
profiler is all I want.
>
> So I wonder that actually the real problem is the lack of something that
> could provide this guarantee. We may need a -real-ra-before-fp (yeah
> I suck in naming).
Don't worry, so do the C compiler folks, I mean, come on "mcount"?
-- Steve
On Thu, 19 Nov 2009, H. Peter Anvin wrote:
>
> Calling the profiler immediately at the entry point is clearly the more
> sane option. It means the ABI is well-defined, stable, and independent
> of what the actual function contents are. It means that ABI isn't the
> normal C ABI (the __fentry__ function would have to preserve all
> registers), but that's fine...
As far as I know, that's true of _mcount already: it's not a normal ABI
and is rather a highly architecture-specific special case to begin with.
At least ARM has some (several?) special mcount calling conventions,
afaik.
(And then ARM people use __attribute__((naked)) and insert the code by
inline asm, or something. That seems to be "standard" in the embedded
world, where they often do even stranger things than we do in the
kernel. At least our low-level system call and interrupt handlers are
written as assembly language - the embedded world seems to commonly
write them as C functions with magic attributes and inline asm).
Linus
On Thu, 2009-11-19 at 11:50 -0800, H. Peter Anvin wrote:
> > Perhaps we could create another profiler? Instead of calling mcount,
> > call a new function: __fentry__ or something. Have it activated with
> > another switch. This could make the performance of the function tracer
> > even better without all these exceptions.
> >
> > <function>:
> > call __fentry__
> > [...]
> >
>
> Calling the profiler immediately at the entry point is clearly the more
> sane option. It means the ABI is well-defined, stable, and independent
> of what the actual function contents are. It means that ABI isn't the
> normal C ABI (the __fentry__ function would have to preserve all
> registers), but that's fine...
mcount already has that requirement (saving all/most regs). Anyway, you
are right, we don't care. The tracer should carry the blunt of the load,
not the individual callers.
-- Steve
On Thu, 2009-11-19 at 15:05 -0500, Steven Rostedt wrote:
> Well, other archs use a register to store the return address. But it
> would also be easy to do (pseudo arch assembly):
>
> <function>:
> mov lr, (%sp)
> add 8, %sp
> blr __fentry__
Should be bl __fentry__ for "branch and link".
> sub 8, %sp
> mov (%sp), lr
>
>
> That way the lr would have the current function, and the parent would
> still be at 8(%sp)
Actually, if we add a new profiler and can make our own specification, I
would say that the add and sub lines be the responsibility of
__fentry__. Then we would have:
<function>:
mov lr, (%sp)
bl __fentry__
mov (%sp), lr
If sp points to the current content, then replace (%sp) above with
-8(%sp). Then the implementation of a nop __fentry__ would simply be:
__fentry__:
blr
For anything more elaborate, __fentry__ would be responsible for all
adjustments.
-- Steve
On Thu, Nov 19, 2009 at 03:05:41PM -0500, Steven Rostedt wrote:
> On Thu, 2009-11-19 at 20:46 +0100, Frederic Weisbecker wrote:
> > On Thu, Nov 19, 2009 at 02:28:06PM -0500, Steven Rostedt wrote:
>
> > > <function>:
> > > call __fentry__
> > > [...]
> > >
> > >
> > > -- Steve
> >
> >
> > I would really like this. So that we can forget about other possible
> > further suprises due to sophisticated function prologues beeing before
> > the mcount call.
> >
> > And I guess that would fix it in every archs.
>
> Well, other archs use a register to store the return address. But it
> would also be easy to do (pseudo arch assembly):
>
> <function>:
> mov lr, (%sp)
> add 8, %sp
> blr __fentry__
> sub 8, %sp
> mov (%sp), lr
>
>
> That way the lr would have the current function, and the parent would
> still be at 8(%sp)
>
Yeah right, we need at least such very tiny prologue for
archs that store return addresses in a reg.
> >
> > That said, Linus had a good point about the fact there might other uses
> > of mcount even more tricky than what does the function graph tracer,
> > outside the kernel, and those may depend on the strict ABI assumption
> > that 4(ebp) is always the _real_ return address, and that through all
> > the previous stack call. This is even a concern that extrapolates the
> > single mcount case.
>
> As I am proposing a new call. This means that mcount stay as is for
> legacy reasons. Yes I know there exists the -finstrument-functions but
> that adds way too much bloat to the code. One single call to the
> profiler is all I want.
Sure, the purpose is not to change the existing -mcount thing.
What I meant is that we could have -mcount and -real-ra-before-fp
at the same time to guarantee fp + 4 is really what we want while
using -mcount.
The __fentry__ idea is more neat, but the guarantee of a real pointer
to the return address is still something that lacks.
> >
> > So I wonder that actually the real problem is the lack of something that
> > could provide this guarantee. We may need a -real-ra-before-fp (yeah
> > I suck in naming).
>
> Don't worry, so do the C compiler folks, I mean, come on "mcount"?
I guess it has been first created for the single purpose of counting
specific functions but then it has been used for wider, unpredicted uses :)
On Thu, Nov 19, 2009 at 03:17:16PM -0500, Steven Rostedt wrote:
> On Thu, 2009-11-19 at 15:05 -0500, Steven Rostedt wrote:
>
> > Well, other archs use a register to store the return address. But it
> > would also be easy to do (pseudo arch assembly):
> >
> > <function>:
> > mov lr, (%sp)
> > add 8, %sp
> > blr __fentry__
>
> Should be bl __fentry__ for "branch and link".
>
> > sub 8, %sp
> > mov (%sp), lr
> >
> >
> > That way the lr would have the current function, and the parent would
> > still be at 8(%sp)
>
> Actually, if we add a new profiler and can make our own specification, I
> would say that the add and sub lines be the responsibility of
> __fentry__. Then we would have:
>
> <function>:
> mov lr, (%sp)
> bl __fentry__
> mov (%sp), lr
>
> If sp points to the current content, then replace (%sp) above with
> -8(%sp). Then the implementation of a nop __fentry__ would simply be:
>
> __fentry__:
> blr
Good point!
> For anything more elaborate, __fentry__ would be responsible for all
> adjustments.
Yep. The more we control it from __fentry__, the less we fall
down into unexpected surprises.
On Thu, 19 Nov 2009, Linus Torvalds wrote:
> On Thu, 19 Nov 2009, Richard Guenther wrote:
> >
> > Note that I only can reproduce the issue with
> > -mincoming-stack-boundary=2, not with -mpreferred-stack-boundary=2.
>
> Since you can reproduce it with -mincoming-stack-boundary=2, I woul
> suggest just fixing mcount handling that way regardless of anything else.
> The current code generated by gcc is just insane - even for the case where
> you _want_ 16-byte stack alignment.
>
> Instead crazy code like
>
> > push %edi
> > lea 0x8(%esp),%edi
> > and $0xfffffff0,%esp
> > pushl -0x4(%edi)
> > push %ebp
> > mov %esp,%ebp
> > ...
> > call mcount
>
> the sane thing to do would be to just do it as
>
> push %ebp
> mov %esp,%ebp
> call mcount
> and $0xfffffff0,%esp
which is what the 64bit compile does except that the mcount call
happens a bit later which is fine.
ffffffff8107cd34 <timer_stats_update_stats>:
ffffffff8107cd34: 55 push %rbp
ffffffff8107cd35: 48 89 e5 mov %rsp,%rbp
ffffffff8107cd38: 48 83 e4 c0 and $0xffffffffffffffc0,%rsp
Thanks,
tglx
On Thu, 19 Nov 2009, Frederic Weisbecker wrote:
>
> > That way the lr would have the current function, and the parent would
> > still be at 8(%sp)
>
> Yeah right, we need at least such very tiny prologue for
> archs that store return addresses in a reg.
Well, it will be architecture-dependent.
For example, alpha can store the return value in _any_ register if I
recall correctly, so you can do the "call to __fentry__" by just picking
another register than the default one as the return address.
And powerpc has two special registers: link and ctr, but iirc you can only
load 'link' with a branch instruction. Which means that you could do
something like
mflr 0
bl __fentry__
in the caller (I forget if R0 is actually a call-trashed register or not),
and then __fentry__ could do something like
mflr 12 # save _new_ link
mtlr 0 # restore original link
mtctr 12 # move __fentry__ link to ctr
.. do whatever ..
bctr # return to __fentry__ caller
to return with 'link' restored (but ctr and r0/r12 trashed - I don't
recall the ppc calling conventions any more, but I think that is ok).
Saving to stack seems unnecessary and pointless.
Linus
On Thu, 2009-11-19 at 12:36 -0800, Linus Torvalds wrote:
>
> On Thu, 19 Nov 2009, Frederic Weisbecker wrote:
> >
> > > That way the lr would have the current function, and the parent would
> > > still be at 8(%sp)
> >
> > Yeah right, we need at least such very tiny prologue for
> > archs that store return addresses in a reg.
>
> Well, it will be architecture-dependent.
Totally agree, as mcount today is architecture dependent.
>
> For example, alpha can store the return value in _any_ register if I
> recall correctly, so you can do the "call to __fentry__" by just picking
> another register than the default one as the return address.
>
> And powerpc has two special registers: link and ctr, but iirc you can only
> load 'link' with a branch instruction. Which means that you could do
> something like
>
> mflr 0
> bl __fentry__
>
> in the caller (I forget if R0 is actually a call-trashed register or not),
> and then __fentry__ could do something like
>
> mflr 12 # save _new_ link
> mtlr 0 # restore original link
> mtctr 12 # move __fentry__ link to ctr
> .. do whatever ..
> bctr # return to __fentry__ caller
>
> to return with 'link' restored (but ctr and r0/r12 trashed - I don't
> recall the ppc calling conventions any more, but I think that is ok).
>
> Saving to stack seems unnecessary and pointless.
I was just using an example. But as you pointed out, each arch can find
its best way to handle it. Having the profiler called at the beginning
of the function is what I feel is the best.
We also get access to the function's parameters :-)
-- Steve
On 11/19/09 12:50, H. Peter Anvin wrote:
>
> Calling the profiler immediately at the entry point is clearly the more
> sane option. It means the ABI is well-defined, stable, and independent
> of what the actual function contents are. It means that ABI isn't the
> normal C ABI (the __fentry__ function would have to preserve all
> registers), but that's fine...
>
Note there are targets (even some old x86 variants) that required the
profiling calls to occur after the prologue. Unfortunately, nobody
documented *why* that was the case. Sigh.
Jeff
On 11/19/09 13:06, Linus Torvalds wrote:
>
> On Thu, 19 Nov 2009, H. Peter Anvin wrote:
>
>> Calling the profiler immediately at the entry point is clearly the more
>> sane option. It means the ABI is well-defined, stable, and independent
>> of what the actual function contents are. It means that ABI isn't the
>> normal C ABI (the __fentry__ function would have to preserve all
>> registers), but that's fine...
>>
> As far as I know, that's true of _mcount already: it's not a normal ABI
> and is rather a highly architecture-specific special case to begin with.
> At least ARM has some (several?) special mcount calling conventions,
> afaik.
>
Correct. _mcount's ABI typically has been defined by the implementation
of the vendor's C library mcount.
GCC has options to emit the profiling code prior to or after the
prologue controllable through the usual variety of target macros &
hooks. I can't imagine anyone would object to a clean, tested patch to
change how x86-linux's profiling code was implemented.
jeff
Ingo,
Not sure if this is too much for this late in the -rc game, but it finds
the gcc bug at build time, and we don't need to disable function graph
tracer for all i386 builds.
This is built on my last urgent repo pull request.
Please pull the latest tip/tracing/urgent-2 tree, which can be found at:
git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-2.6-trace.git
tip/tracing/urgent-2
Steven Rostedt (1):
tracing/x86: Add check to detect GCC messing with mcount prologue
----
kernel/trace/Kconfig | 1 -
scripts/Makefile.build | 25 +++++++++++++++-
scripts/recordmcount.pl | 74 +++++++++++++++++++++++++++++++++++++++++++++--
3 files changed, 95 insertions(+), 5 deletions(-)
---------------------------
commit c7715fb611c69ac4b7f722a891de08b206fb7686
Author: Steven Rostedt <[email protected]>
Date: Thu Nov 19 23:41:02 2009 -0500
tracing/x86: Add check to detect GCC messing with mcount prologue
Latest versions of GCC create a funny prologue for some functions.
Instead of the typical:
push %ebp
mov %esp,%ebp
and $0xffffffe0,%esp
[...]
call mcount
GCC may try to align the stack before setting up the frame pointer
register:
push %edi
lea 0x8(%esp),%edi
and $0xffffffe0,%esp
pushl -0x4(%edi)
push %ebp
mov %esp,%ebp
[...]
call mcount
This crazy code places a copy of the return address into the
frame pointer. The function graph tracer uses this pointer to
save and replace the return address of the calling function to jump
to the function graph tracer's return handler, which will put back
the return address. But instead instead of the typical return:
mov %ebp,%esp
pop %ebp
ret
The return of the function performs:
lea -0x8(%edi),%esp
pop %edi
ret
The function graph tracer return handler will not be called at the exit
of the function, but the parent function will call it. Because we missed
the return of the child function, the handler will replace the parent's
return address with that of the child. Obviously this will cause a crash
(Note, there is code to detect this case and safely panic the kernel).
The kicker is that this happens to just a handful of functions.
And only with certain gcc options.
Compiling with: -march=pentium-mmx
will cause the problem to appear. But if you were to change
pentium-mmx to i686 or add -mtune=generic, then the problem goes away.
I first saw this problem when compiling with optimize for size.
But it seems that various other options may cause this issue to arise.
Instead of completely disabling the function graph tracer for i386 builds
this patch adds a check to recordmcount.pl to make sure that all
functions that contain a call to mcount start with "push %ebp".
If not, it will fail the compile and print out the nasty warning:
CC kernel/time/timer_stats.o
********************************************************
Your version of GCC breaks the function graph tracer
Please disable CONFIG_FUNCTION_GRAPH_TRACER
Failed function was "timer_stats_update_stats"
********************************************************
The script recordmcount.pl is given a new parameter "do_check". If
this is negative, the script will only perform this check without
creating the mcount caller section. This will be executed for x86_32
when CONFIG_FUNCTION_GRAPH_TRACER is enabled and CONFIG_DYNAMIC_FTRACE
is not.
If the arch is x86_32 and $do_check is greater than 1, it will perform
the check while processing the mcount callers. If $do_check is 0, then
no check will be performed. This is for non x86_32 archs and when
compiling without CONFIG_FUNCTION_GRAPH_TRACER enabled, even on x86_32.
Reported-by: Thomas Gleixner <[email protected]>
LKML-Reference: <[email protected]>
Signed-off-by: Steven Rostedt <[email protected]>
diff --git a/kernel/trace/Kconfig b/kernel/trace/Kconfig
index b416512..cd39064 100644
--- a/kernel/trace/Kconfig
+++ b/kernel/trace/Kconfig
@@ -143,7 +143,6 @@ config FUNCTION_GRAPH_TRACER
bool "Kernel Function Graph Tracer"
depends on HAVE_FUNCTION_GRAPH_TRACER
depends on FUNCTION_TRACER
- depends on !X86_32 || !CC_OPTIMIZE_FOR_SIZE
default y
help
Enable the kernel to trace a function at both its return
diff --git a/scripts/Makefile.build b/scripts/Makefile.build
index 341b589..3b897f2 100644
--- a/scripts/Makefile.build
+++ b/scripts/Makefile.build
@@ -206,10 +206,33 @@ cmd_modversions = \
endif
ifdef CONFIG_FTRACE_MCOUNT_RECORD
+
+ ifdef CONFIG_FUNCTION_GRAPH_TRACER
+ ifdef CONFIG_X86_32
+ rm_do_check = 1
+ else
+ rm_do_check = 0
+ endif
+ else
+ rm_do_check = 0
+ endif
+
cmd_record_mcount = set -e ; perl $(srctree)/scripts/recordmcount.pl "$(ARCH)" \
"$(if $(CONFIG_64BIT),64,32)" \
"$(OBJDUMP)" "$(OBJCOPY)" "$(CC)" "$(LD)" "$(NM)" "$(RM)" "$(MV)" \
- "$(if $(part-of-module),1,0)" "$(@)";
+ "$(if $(part-of-module),1,0)" "$(rm_do_check)" "$(@)";
+
+else
+
+ ifdef CONFIG_X86_32
+ ifdef CONFIG_FUNCTION_GRAPH_TRACER
+ cmd_record_mcount = set -e ; perl $(srctree)/scripts/recordmcount.pl "$(ARCH)" \
+ "$(if $(CONFIG_64BIT),64,32)" \
+ "$(OBJDUMP)" "$(OBJCOPY)" "$(CC)" "$(LD)" "$(NM)" "$(RM)" "$(MV)" \
+ "$(if $(part-of-module),1,0)" "-1" "$(@)";
+ endif
+ endif
+
endif
define rule_cc_o_c
diff --git a/scripts/recordmcount.pl b/scripts/recordmcount.pl
index 090d300..164a9d5 100755
--- a/scripts/recordmcount.pl
+++ b/scripts/recordmcount.pl
@@ -99,14 +99,14 @@ $P =~ s@.*/@@g;
my $V = '0.1';
-if ($#ARGV < 7) {
- print "usage: $P arch bits objdump objcopy cc ld nm rm mv is_module inputfile\n";
+if ($#ARGV < 11) {
+ print "usage: $P arch bits objdump objcopy cc ld nm rm mv is_module do_check inputfile\n";
print "version: $V\n";
exit(1);
}
my ($arch, $bits, $objdump, $objcopy, $cc,
- $ld, $nm, $rm, $mv, $is_module, $inputfile) = @ARGV;
+ $ld, $nm, $rm, $mv, $is_module, $do_check, $inputfile) = @ARGV;
# This file refers to mcount and shouldn't be ftraced, so lets' ignore it
if ($inputfile eq "kernel/trace/ftrace.o") {
@@ -129,6 +129,60 @@ $nm = "nm" if ((length $nm) == 0);
$rm = "rm" if ((length $rm) == 0);
$mv = "mv" if ((length $mv) == 0);
+# gcc can do stupid things with the stack pointer on x86_32.
+# It may pass a copy of the return address to mcount, which will
+# break the function graph tracer. If this happens then we need
+# to flag it and break the build.
+#
+# For x86_32, the parameter do_check will be negative if we only
+# want to perform the check, and positive if we should od the check.
+# If function graph tracing is not enabled, do_check will be zero.
+#
+
+my $check_next_line = 0;
+my $line_failed = 0;
+my $last_function;
+
+sub test_x86_32_prologue
+{
+ if ($check_next_line) {
+ if (!/push\s*\%ebp/) {
+ $line_failed = 1;
+ }
+ }
+
+ if ($line_failed && /mcount/) {
+ print STDERR "\n********************************************************\n";
+ print STDERR " Your version of GCC breaks the function graph tracer\n";
+ print STDERR " Please disable CONFIG_FUNCTION_GRAPH_TRACER\n";
+ print STDERR " Failed function was \"$last_function\"\n";
+ print STDERR "********************************************************\n\n";
+ exit -1;
+ }
+ $check_next_line = 0;
+
+ # check the first line after a function starts for
+ # push %ebp
+ if (/^[0-9a-fA-F]+\s+<([a-zA-Z_].*)>:$/) {
+ $last_function = $1;
+ $check_next_line = 1;
+ $line_failed = 0;
+ }
+}
+
+if ($do_check < 0) {
+ # Only perform the check and quit
+ open(IN, "$objdump -dr $inputfile|") || die "error running $objdump";
+
+ while (<IN>) {
+ test_x86_32_prologue;
+ }
+ close (IN);
+ exit 0;
+}
+
+my $check = 0;
+
#print STDERR "running: $P '$arch' '$objdump' '$objcopy' '$cc' '$ld' " .
# "'$nm' '$rm' '$mv' '$inputfile'\n";
@@ -153,6 +207,12 @@ if ($arch eq "x86") {
}
}
+if ($arch eq "i386") {
+ if ($do_check) {
+ $check = 1;
+ }
+}
+
#
# We base the defaults off of i386, the other archs may
# feel free to change them in the below if statements.
@@ -381,6 +441,14 @@ my $text;
my $read_headers = 1;
while (<IN>) {
+
+ # x86_32 may need to test the start of every function to see
+ # if GCC did not mess up the mcount prologue. All functions must
+ # start with push %ebp, otherwise it is broken.
+ if ($check) {
+ test_x86_32_prologue;
+ }
+
# is it a section?
if (/$section_regex/) {
$read_headers = 0;
This touches the Makefile scripts. I forgot to CC kbuild and Sam.
-- Steve
On Fri, 2009-11-20 at 00:23 -0500, Steven Rostedt wrote:
> Ingo,
>
> Not sure if this is too much for this late in the -rc game, but it finds
> the gcc bug at build time, and we don't need to disable function graph
> tracer for all i386 builds.
>
> This is built on my last urgent repo pull request.
>
> Please pull the latest tip/tracing/urgent-2 tree, which can be found at:
>
> git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-2.6-trace.git
> tip/tracing/urgent-2
>
>
> Steven Rostedt (1):
> tracing/x86: Add check to detect GCC messing with mcount prologue
>
> ----
> kernel/trace/Kconfig | 1 -
> scripts/Makefile.build | 25 +++++++++++++++-
> scripts/recordmcount.pl | 74 +++++++++++++++++++++++++++++++++++++++++++++--
> 3 files changed, 95 insertions(+), 5 deletions(-)
> ---------------------------
> commit c7715fb611c69ac4b7f722a891de08b206fb7686
> Author: Steven Rostedt <[email protected]>
> Date: Thu Nov 19 23:41:02 2009 -0500
>
> tracing/x86: Add check to detect GCC messing with mcount prologue
>
> Latest versions of GCC create a funny prologue for some functions.
> Instead of the typical:
>
> push %ebp
> mov %esp,%ebp
> and $0xffffffe0,%esp
> [...]
> call mcount
>
> GCC may try to align the stack before setting up the frame pointer
> register:
>
> push %edi
> lea 0x8(%esp),%edi
> and $0xffffffe0,%esp
> pushl -0x4(%edi)
> push %ebp
> mov %esp,%ebp
> [...]
> call mcount
>
> This crazy code places a copy of the return address into the
> frame pointer. The function graph tracer uses this pointer to
> save and replace the return address of the calling function to jump
> to the function graph tracer's return handler, which will put back
> the return address. But instead instead of the typical return:
>
> mov %ebp,%esp
> pop %ebp
> ret
>
> The return of the function performs:
>
> lea -0x8(%edi),%esp
> pop %edi
> ret
>
> The function graph tracer return handler will not be called at the exit
> of the function, but the parent function will call it. Because we missed
> the return of the child function, the handler will replace the parent's
> return address with that of the child. Obviously this will cause a crash
> (Note, there is code to detect this case and safely panic the kernel).
>
> The kicker is that this happens to just a handful of functions.
> And only with certain gcc options.
>
> Compiling with: -march=pentium-mmx
> will cause the problem to appear. But if you were to change
> pentium-mmx to i686 or add -mtune=generic, then the problem goes away.
>
> I first saw this problem when compiling with optimize for size.
> But it seems that various other options may cause this issue to arise.
>
> Instead of completely disabling the function graph tracer for i386 builds
> this patch adds a check to recordmcount.pl to make sure that all
> functions that contain a call to mcount start with "push %ebp".
> If not, it will fail the compile and print out the nasty warning:
>
> CC kernel/time/timer_stats.o
>
> ********************************************************
> Your version of GCC breaks the function graph tracer
> Please disable CONFIG_FUNCTION_GRAPH_TRACER
> Failed function was "timer_stats_update_stats"
> ********************************************************
>
> The script recordmcount.pl is given a new parameter "do_check". If
> this is negative, the script will only perform this check without
> creating the mcount caller section. This will be executed for x86_32
> when CONFIG_FUNCTION_GRAPH_TRACER is enabled and CONFIG_DYNAMIC_FTRACE
> is not.
>
> If the arch is x86_32 and $do_check is greater than 1, it will perform
> the check while processing the mcount callers. If $do_check is 0, then
> no check will be performed. This is for non x86_32 archs and when
> compiling without CONFIG_FUNCTION_GRAPH_TRACER enabled, even on x86_32.
>
> Reported-by: Thomas Gleixner <[email protected]>
> LKML-Reference: <[email protected]>
> Signed-off-by: Steven Rostedt <[email protected]>
>
> diff --git a/kernel/trace/Kconfig b/kernel/trace/Kconfig
> index b416512..cd39064 100644
> --- a/kernel/trace/Kconfig
> +++ b/kernel/trace/Kconfig
> @@ -143,7 +143,6 @@ config FUNCTION_GRAPH_TRACER
> bool "Kernel Function Graph Tracer"
> depends on HAVE_FUNCTION_GRAPH_TRACER
> depends on FUNCTION_TRACER
> - depends on !X86_32 || !CC_OPTIMIZE_FOR_SIZE
> default y
> help
> Enable the kernel to trace a function at both its return
> diff --git a/scripts/Makefile.build b/scripts/Makefile.build
> index 341b589..3b897f2 100644
> --- a/scripts/Makefile.build
> +++ b/scripts/Makefile.build
> @@ -206,10 +206,33 @@ cmd_modversions = \
> endif
>
> ifdef CONFIG_FTRACE_MCOUNT_RECORD
> +
> + ifdef CONFIG_FUNCTION_GRAPH_TRACER
> + ifdef CONFIG_X86_32
> + rm_do_check = 1
> + else
> + rm_do_check = 0
> + endif
> + else
> + rm_do_check = 0
> + endif
> +
> cmd_record_mcount = set -e ; perl $(srctree)/scripts/recordmcount.pl "$(ARCH)" \
> "$(if $(CONFIG_64BIT),64,32)" \
> "$(OBJDUMP)" "$(OBJCOPY)" "$(CC)" "$(LD)" "$(NM)" "$(RM)" "$(MV)" \
> - "$(if $(part-of-module),1,0)" "$(@)";
> + "$(if $(part-of-module),1,0)" "$(rm_do_check)" "$(@)";
> +
> +else
> +
> + ifdef CONFIG_X86_32
> + ifdef CONFIG_FUNCTION_GRAPH_TRACER
> + cmd_record_mcount = set -e ; perl $(srctree)/scripts/recordmcount.pl "$(ARCH)" \
> + "$(if $(CONFIG_64BIT),64,32)" \
> + "$(OBJDUMP)" "$(OBJCOPY)" "$(CC)" "$(LD)" "$(NM)" "$(RM)" "$(MV)" \
> + "$(if $(part-of-module),1,0)" "-1" "$(@)";
> + endif
> + endif
> +
> endif
>
> define rule_cc_o_c
> diff --git a/scripts/recordmcount.pl b/scripts/recordmcount.pl
> index 090d300..164a9d5 100755
> --- a/scripts/recordmcount.pl
> +++ b/scripts/recordmcount.pl
> @@ -99,14 +99,14 @@ $P =~ s@.*/@@g;
>
> my $V = '0.1';
>
> -if ($#ARGV < 7) {
> - print "usage: $P arch bits objdump objcopy cc ld nm rm mv is_module inputfile\n";
> +if ($#ARGV < 11) {
> + print "usage: $P arch bits objdump objcopy cc ld nm rm mv is_module do_check inputfile\n";
> print "version: $V\n";
> exit(1);
> }
>
> my ($arch, $bits, $objdump, $objcopy, $cc,
> - $ld, $nm, $rm, $mv, $is_module, $inputfile) = @ARGV;
> + $ld, $nm, $rm, $mv, $is_module, $do_check, $inputfile) = @ARGV;
>
> # This file refers to mcount and shouldn't be ftraced, so lets' ignore it
> if ($inputfile eq "kernel/trace/ftrace.o") {
> @@ -129,6 +129,60 @@ $nm = "nm" if ((length $nm) == 0);
> $rm = "rm" if ((length $rm) == 0);
> $mv = "mv" if ((length $mv) == 0);
>
> +# gcc can do stupid things with the stack pointer on x86_32.
> +# It may pass a copy of the return address to mcount, which will
> +# break the function graph tracer. If this happens then we need
> +# to flag it and break the build.
> +#
> +# For x86_32, the parameter do_check will be negative if we only
> +# want to perform the check, and positive if we should od the check.
> +# If function graph tracing is not enabled, do_check will be zero.
> +#
> +
> +my $check_next_line = 0;
> +my $line_failed = 0;
> +my $last_function;
> +
> +sub test_x86_32_prologue
> +{
> + if ($check_next_line) {
> + if (!/push\s*\%ebp/) {
> + $line_failed = 1;
> + }
> + }
> +
> + if ($line_failed && /mcount/) {
> + print STDERR "\n********************************************************\n";
> + print STDERR " Your version of GCC breaks the function graph tracer\n";
> + print STDERR " Please disable CONFIG_FUNCTION_GRAPH_TRACER\n";
> + print STDERR " Failed function was \"$last_function\"\n";
> + print STDERR "********************************************************\n\n";
> + exit -1;
> + }
> + $check_next_line = 0;
> +
> + # check the first line after a function starts for
> + # push %ebp
> + if (/^[0-9a-fA-F]+\s+<([a-zA-Z_].*)>:$/) {
> + $last_function = $1;
> + $check_next_line = 1;
> + $line_failed = 0;
> + }
> +}
> +
> +if ($do_check < 0) {
> + # Only perform the check and quit
> + open(IN, "$objdump -dr $inputfile|") || die "error running $objdump";
> +
> + while (<IN>) {
> + test_x86_32_prologue;
> + }
> + close (IN);
> + exit 0;
> +}
> +
> +my $check = 0;
> +
> #print STDERR "running: $P '$arch' '$objdump' '$objcopy' '$cc' '$ld' " .
> # "'$nm' '$rm' '$mv' '$inputfile'\n";
>
> @@ -153,6 +207,12 @@ if ($arch eq "x86") {
> }
> }
>
> +if ($arch eq "i386") {
> + if ($do_check) {
> + $check = 1;
> + }
> +}
> +
> #
> # We base the defaults off of i386, the other archs may
> # feel free to change them in the below if statements.
> @@ -381,6 +441,14 @@ my $text;
> my $read_headers = 1;
>
> while (<IN>) {
> +
> + # x86_32 may need to test the start of every function to see
> + # if GCC did not mess up the mcount prologue. All functions must
> + # start with push %ebp, otherwise it is broken.
> + if ($check) {
> + test_x86_32_prologue;
> + }
> +
> # is it a section?
> if (/$section_regex/) {
> $read_headers = 0;
>
Steven Rostedt <[email protected]> writes:
>
> And frame pointers do add a little overhead as well. Too bad the mcount
> ABI wasn't something like this:
>
>
> <function>:
> call mcount
> [...]
>
> This way, the function address for mcount would have been (%esp) and the
> parent address would be 4(%esp). Mcount would work without frame
> pointers and this whole mess would also become moot.
I did a patch to do this in x86 gcc some time ago. The motivation
was indeed the frame pointer overhead on Atom with tracing.
Unfortunately it also requires glibc changes (I did those too). For
compatibility and catching mistakes the new function was called
__mcount_nofp.
I haven't tried it with current gcc and last time I missed the
gcc feature merge window with this.
But perhaps you find it useful. Of course it would need more
kernel changes to probe for the new option and handle it.
Here's the old patch. I haven't retested it with a current
gcc version, but I think it still applies at least.
If there's interest I can polish it up and submit formally.
-Andi
Index: gcc/doc/tm.texi
===================================================================
--- gcc/doc/tm.texi (revision 149140)
+++ gcc/doc/tm.texi (working copy)
@@ -1884,6 +1884,12 @@
of words in each data entry.
@end defmac
+@defmac TARGET_FUNCTION_PROFILE
+Define if the target has a custom function_profiler function.
+The target should not set this macro, it is implicitely set from
+the PROFILE_BEFORE_PROLOGUE macro.
+@end defmac
+
@node Registers
@section Register Usage
@cindex register usage
Index: gcc/doc/invoke.texi
===================================================================
--- gcc/doc/invoke.texi (revision 149140)
+++ gcc/doc/invoke.texi (working copy)
@@ -593,7 +593,7 @@
-momit-leaf-frame-pointer -mno-red-zone -mno-tls-direct-seg-refs @gol
-mcmodel=@var{code-model} -mabi=@var{name} @gol
-m32 -m64 -mlarge-data-threshold=@var{num} @gol
--mfused-madd -mno-fused-madd -msse2avx}
+-mfused-madd -mno-fused-madd -msse2avx -mmcount-nofp}
@emph{IA-64 Options}
@gccoptlist{-mbig-endian -mlittle-endian -mgnu-as -mgnu-ld -mno-pic @gol
@@ -11749,6 +11749,11 @@
@opindex msse2avx
Specify that the assembler should encode SSE instructions with VEX
prefix. The option @option{-mavx} turns this on by default.
+
+@item -mmcount-nofp
+Don't force the frame counter with @option{-pg} function profiling.
+Instead call a new @code{__mcount_nofp} function before a stack
+frame is set up.
@end table
These @samp{-m} switches are supported in addition to the above
Index: gcc/target.h
===================================================================
--- gcc/target.h (revision 149140)
+++ gcc/target.h (working copy)
@@ -1132,6 +1132,9 @@
*/
bool arm_eabi_unwinder;
+ /* True when the function profiler code is outputted before the prologue. */
+ bool profile_before_prologue;
+
/* Leave the boolean fields at the end. */
};
Index: gcc/final.c
===================================================================
--- gcc/final.c (revision 149140)
+++ gcc/final.c (working copy)
@@ -1520,10 +1520,8 @@
/* The Sun386i and perhaps other machines don't work right
if the profiling code comes after the prologue. */
-#ifdef PROFILE_BEFORE_PROLOGUE
- if (crtl->profile)
+ if (targetm.profile_before_prologue && crtl->profile)
profile_function (file);
-#endif /* PROFILE_BEFORE_PROLOGUE */
#if defined (DWARF2_UNWIND_INFO) && defined (HAVE_prologue)
if (dwarf2out_do_frame ())
@@ -1565,10 +1563,8 @@
static void
profile_after_prologue (FILE *file ATTRIBUTE_UNUSED)
{
-#ifndef PROFILE_BEFORE_PROLOGUE
- if (crtl->profile)
+ if (!targetm.profile_before_prologue && crtl->profile)
profile_function (file);
-#endif /* not PROFILE_BEFORE_PROLOGUE */
}
static void
Index: gcc/gcc.c
===================================================================
--- gcc/gcc.c (revision 149140)
+++ gcc/gcc.c (working copy)
@@ -797,6 +797,12 @@
# define SYSROOT_HEADERS_SUFFIX_SPEC ""
#endif
+/* Target can override this to allow -pg/-fomit-frame-pointer together */
+#ifndef TARGET_PG_OPTION_SPEC
+#define TARGET_PG_OPTION_SPEC \
+"%{pg:%{fomit-frame-pointer:%e-pg and -fomit-frame-pointer are incompatible}}"
+#endif
+
static const char *asm_debug;
static const char *cpp_spec = CPP_SPEC;
static const char *cc1_spec = CC1_SPEC;
@@ -866,8 +872,8 @@
/* NB: This is shared amongst all front-ends, except for Ada. */
static const char *cc1_options =
-"%{pg:%{fomit-frame-pointer:%e-pg and -fomit-frame-pointer are incompatible}}\
- %1 %{!Q:-quiet} -dumpbase %B %{d*} %{m*} %{a*}\
+ TARGET_PG_OPTION_SPEC
+" %1 %{!Q:-quiet} -dumpbase %B %{d*} %{m*} %{a*}\
%{fcompare-debug-second:%:compare-debug-auxbase-opt(%b)} \
%{!fcompare-debug-second:%{c|S:%{o*:-auxbase-strip %*}%{!o*:-auxbase %b}}}%{!c:%{!S:-auxbase %b}} \
%{g*} %{O*} %{W*&pedantic*} %{w} %{std*&ansi&trigraphs}\
Index: gcc/target-def.h
===================================================================
--- gcc/target-def.h (revision 149140)
+++ gcc/target-def.h (working copy)
@@ -841,6 +841,12 @@
TARGET_OPTION_CAN_INLINE_P, \
}
+#ifdef PROFILE_BEFORE_PROLOGUE
+#define TARGET_FUNCTION_PROFILE true
+#else
+#define TARGET_FUNCTION_PROFILE false
+#endif
+
/* The whole shebang. */
#define TARGET_INITIALIZER \
{ \
@@ -960,7 +966,8 @@
TARGET_HANDLE_PRAGMA_REDEFINE_EXTNAME, \
TARGET_HANDLE_PRAGMA_EXTERN_PREFIX, \
TARGET_RELAXED_ORDERING, \
- TARGET_ARM_EABI_UNWINDER \
+ TARGET_ARM_EABI_UNWINDER, \
+ TARGET_FUNCTION_PROFILE \
}
#define TARGET_HANDLE_C_OPTION default_handle_c_option
Index: gcc/config/i386/i386.h
===================================================================
--- gcc/config/i386/i386.h (revision 149140)
+++ gcc/config/i386/i386.h (working copy)
@@ -2528,6 +2528,9 @@
#undef TARG_COND_NOT_TAKEN_BRANCH_COST
#define TARG_COND_NOT_TAKEN_BRANCH_COST ix86_cost->cond_not_taken_branch_cost
+/* Allow -pg with -fomit-frame-pointer */
+#define TARGET_PG_OPTION_SPEC ""
+
/*
Local variables:
version-control: t
Index: gcc/config/i386/i386.opt
===================================================================
--- gcc/config/i386/i386.opt (revision 149140)
+++ gcc/config/i386/i386.opt (working copy)
@@ -358,3 +358,7 @@
msse2avx
Target Report Var(ix86_sse2avx)
Encode SSE instructions with VEX prefix
+
+mmcount-nofp
+Target Report Var(ix86_mcount_nofp)
+Support function profiling without frame pointer
Index: gcc/config/i386/i386.c
===================================================================
--- gcc/config/i386/i386.c (revision 149140)
+++ gcc/config/i386/i386.c (working copy)
@@ -3413,6 +3413,9 @@
target_flags |= MASK_CLD & ~target_flags_explicit;
#endif
+ if (flag_omit_frame_pointer && profile_flag && ix86_mcount_nofp)
+ targetm.profile_before_prologue = true;
+
/* Save the initial options in case the user does function specific options */
if (main_args_p)
target_option_default_node = target_option_current_node
@@ -7442,7 +7445,7 @@
|| ix86_current_function_calls_tls_descriptor))
return true;
- if (crtl->profile)
+ if (crtl->profile && targetm.profile_before_prologue && !ix86_mcount_nofp)
return true;
return false;
@@ -27364,6 +27367,11 @@
void
x86_function_profiler (FILE *file, int labelno ATTRIBUTE_UNUSED)
{
+ const char *name = MCOUNT_NAME;
+
+ if (targetm.profile_before_prologue && ix86_mcount_nofp)
+ name = "__mcount_nofp";
+
if (TARGET_64BIT)
{
#ifndef NO_PROFILE_COUNTERS
@@ -27371,9 +27379,9 @@
#endif
if (DEFAULT_ABI == SYSV_ABI && flag_pic)
- fprintf (file, "\tcall\t*%s@GOTPCREL(%%rip)\n", MCOUNT_NAME);
+ fprintf (file, "\tcall\t*%s@GOTPCREL(%%rip)\n", name);
else
- fprintf (file, "\tcall\t%s\n", MCOUNT_NAME);
+ fprintf (file, "\tcall\t%s\n", name);
}
else if (flag_pic)
{
@@ -27381,7 +27389,7 @@
fprintf (file, "\tleal\t%sP%d@GOTOFF(%%ebx),%%%s\n",
LPREFIX, labelno, PROFILE_COUNT_REGISTER);
#endif
- fprintf (file, "\tcall\t*%s@GOT(%%ebx)\n", MCOUNT_NAME);
+ fprintf (file, "\tcall\t*%s@GOT(%%ebx)\n", name);
}
else
{
@@ -27389,7 +27397,7 @@
fprintf (file, "\tmovl\t$%sP%d,%%%s\n", LPREFIX, labelno,
PROFILE_COUNT_REGISTER);
#endif
- fprintf (file, "\tcall\t%s\n", MCOUNT_NAME);
+ fprintf (file, "\tcall\t%s\n", name);
}
}
--
[email protected] -- Speaking for myself only.
Commit-ID: 8629ea2eaba8ca0de2e38ce1b4a825e16255976e
Gitweb: http://git.kernel.org/tip/8629ea2eaba8ca0de2e38ce1b4a825e16255976e
Author: Feng Tang <[email protected]>
AuthorDate: Thu, 3 Sep 2009 16:32:53 +0800
Committer: Thomas Gleixner <[email protected]>
CommitDate: Fri, 20 Nov 2009 11:25:48 +0100
hrtimer: Fix /proc/timer_list regression
commit 507e1231 (timer stats: Optimize by adding quick check to avoid
function calls) introduced a regression in /proc/timer_list.
/proc/timer_list shows now
#0: <c27d46b0>, tick_sched_timer, S:01, <(null)>, /-1
instead of
#0: <c27d46b0>, tick_sched_timer, S:01, hrtimer_start, swapper/0
Revert the hrtimer quick check for now. The optimization needs more
thought, but this is neither 2.6.32-rc7 nor stable material.
[ tglx: - Removed unrelated changes from the original patch
- Prevent unneccesary call to timer_stats_update_stats
- massaged the changelog ]
Signed-off-by: Feng Tang <[email protected]>
LKML-Reference: <[email protected]>
Cc: Heiko Carstens <[email protected]>
Cc: [email protected]
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
---
include/linux/hrtimer.h | 4 +---
1 files changed, 1 insertions(+), 3 deletions(-)
diff --git a/include/linux/hrtimer.h b/include/linux/hrtimer.h
index ff037f0..9bace4b 100644
--- a/include/linux/hrtimer.h
+++ b/include/linux/hrtimer.h
@@ -446,7 +446,7 @@ extern void timer_stats_update_stats(void *timer, pid_t pid, void *startf,
static inline void timer_stats_account_hrtimer(struct hrtimer *timer)
{
- if (likely(!timer->start_site))
+ if (likely(!timer_stats_active))
return;
timer_stats_update_stats(timer, timer->start_pid, timer->start_site,
timer->function, timer->start_comm, 0);
@@ -457,8 +457,6 @@ extern void __timer_stats_hrtimer_set_start_info(struct hrtimer *timer,
static inline void timer_stats_hrtimer_set_start_info(struct hrtimer *timer)
{
- if (likely(!timer_stats_active))
- return;
__timer_stats_hrtimer_set_start_info(timer, __builtin_return_address(0));
}
On Fri, 2009-11-20 at 10:57 +0100, Andi Kleen wrote:
> Steven Rostedt <[email protected]> writes:
> >
> > And frame pointers do add a little overhead as well. Too bad the mcount
> > ABI wasn't something like this:
> >
> >
> > <function>:
> > call mcount
> > [...]
> >
> > This way, the function address for mcount would have been (%esp) and the
> > parent address would be 4(%esp). Mcount would work without frame
> > pointers and this whole mess would also become moot.
>
> I did a patch to do this in x86 gcc some time ago. The motivation
> was indeed the frame pointer overhead on Atom with tracing.
>
Yes, I remember you talking about this but I don't remember how far it
went.
> Unfortunately it also requires glibc changes (I did those too). For
> compatibility and catching mistakes the new function was called
> __mcount_nofp.
Actually, could you change the name? I really hate the "mcount" name, it
is legacy and with a new feature, it should be abandoned. Something like
"__fentry__" would be nicer.
>
> I haven't tried it with current gcc and last time I missed the
> gcc feature merge window with this.
>
> But perhaps you find it useful. Of course it would need more
> kernel changes to probe for the new option and handle it.
>
> Here's the old patch. I haven't retested it with a current
> gcc version, but I think it still applies at least.
>
> If there's interest I can polish it up and submit formally.
I would definitely be interested, and I would also be willing to test
it.
Thanks!
-- Steve
On Fri, Nov 20, 2009 at 10:30:39AM +0000, tip-bot for Feng Tang wrote:
> Commit-ID: 8629ea2eaba8ca0de2e38ce1b4a825e16255976e
> Gitweb: http://git.kernel.org/tip/8629ea2eaba8ca0de2e38ce1b4a825e16255976e
> Author: Feng Tang <[email protected]>
> AuthorDate: Thu, 3 Sep 2009 16:32:53 +0800
> Committer: Thomas Gleixner <[email protected]>
> CommitDate: Fri, 20 Nov 2009 11:25:48 +0100
>
> hrtimer: Fix /proc/timer_list regression
>
> commit 507e1231 (timer stats: Optimize by adding quick check to avoid
> function calls) introduced a regression in /proc/timer_list.
>
> /proc/timer_list shows now
> #0: <c27d46b0>, tick_sched_timer, S:01, <(null)>, /-1
> instead of
> #0: <c27d46b0>, tick_sched_timer, S:01, hrtimer_start, swapper/0
For a 'real' fix I see three possible ways if timer_stats are disabled:
Print something like:
#0: <c27d46b0>, tick_sched_timer, S:01, unknown, unknown/-1
That way the format would be still the same and it should be quite obvious
that timer_stats are disabled and should be enabled.
Alternatively leave start_site, taskname and pid away if the correspending
timer has a NULL start_site. That would be the same like for the !TIMER_STATS
case:
#0: <c27d46b0>, tick_sched_timer, S:01
or (worst choice) revert the whole optimization and live with extra cpu cost.
Ingo, Thomas and Linus,
I know Thomas did a patch to force the -mtune=generic, but just in case
gcc decides to do something crazy again, this patch will catch it.
Should we try to get this in now?
-- Steve
On Fri, 2009-11-20 at 00:23 -0500, Steven Rostedt wrote:
> commit c7715fb611c69ac4b7f722a891de08b206fb7686
> Author: Steven Rostedt <[email protected]>
> Date: Thu Nov 19 23:41:02 2009 -0500
>
> tracing/x86: Add check to detect GCC messing with mcount prologue
>
> Latest versions of GCC create a funny prologue for some functions.
> Instead of the typical:
>
> push %ebp
> mov %esp,%ebp
> and $0xffffffe0,%esp
> [...]
> call mcount
>
> GCC may try to align the stack before setting up the frame pointer
> register:
>
> push %edi
> lea 0x8(%esp),%edi
> and $0xffffffe0,%esp
> pushl -0x4(%edi)
> push %ebp
> mov %esp,%ebp
> [...]
> call mcount
>
> This crazy code places a copy of the return address into the
> frame pointer. The function graph tracer uses this pointer to
> save and replace the return address of the calling function to jump
> to the function graph tracer's return handler, which will put back
> the return address. But instead instead of the typical return:
>
> mov %ebp,%esp
> pop %ebp
> ret
>
> The return of the function performs:
>
> lea -0x8(%edi),%esp
> pop %edi
> ret
>
> The function graph tracer return handler will not be called at the exit
> of the function, but the parent function will call it. Because we missed
> the return of the child function, the handler will replace the parent's
> return address with that of the child. Obviously this will cause a crash
> (Note, there is code to detect this case and safely panic the kernel).
>
> The kicker is that this happens to just a handful of functions.
> And only with certain gcc options.
>
> Compiling with: -march=pentium-mmx
> will cause the problem to appear. But if you were to change
> pentium-mmx to i686 or add -mtune=generic, then the problem goes away.
>
> I first saw this problem when compiling with optimize for size.
> But it seems that various other options may cause this issue to arise.
>
> Instead of completely disabling the function graph tracer for i386 builds
> this patch adds a check to recordmcount.pl to make sure that all
> functions that contain a call to mcount start with "push %ebp".
> If not, it will fail the compile and print out the nasty warning:
>
> CC kernel/time/timer_stats.o
>
> ********************************************************
> Your version of GCC breaks the function graph tracer
> Please disable CONFIG_FUNCTION_GRAPH_TRACER
> Failed function was "timer_stats_update_stats"
> ********************************************************
>
> The script recordmcount.pl is given a new parameter "do_check". If
> this is negative, the script will only perform this check without
> creating the mcount caller section. This will be executed for x86_32
> when CONFIG_FUNCTION_GRAPH_TRACER is enabled and CONFIG_DYNAMIC_FTRACE
> is not.
>
> If the arch is x86_32 and $do_check is greater than 1, it will perform
> the check while processing the mcount callers. If $do_check is 0, then
> no check will be performed. This is for non x86_32 archs and when
> compiling without CONFIG_FUNCTION_GRAPH_TRACER enabled, even on x86_32.
>
> Reported-by: Thomas Gleixner <[email protected]>
> LKML-Reference: <[email protected]>
> Signed-off-by: Steven Rostedt <[email protected]>
On 11/20/2009 09:00 AM, Steven Rostedt wrote:
> Ingo, Thomas and Linus,
>
> I know Thomas did a patch to force the -mtune=generic, but just in case
> gcc decides to do something crazy again, this patch will catch it.
>
> Should we try to get this in now?
>
Sounds like a very good idea to me.
-hpa
--
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel. I don't speak on their behalf.
On 11/20/2009 04:34 AM, Steven Rostedt wrote:
>>
>> If there's interest I can polish it up and submit formally.
>
> I would definitely be interested, and I would also be willing to test
> it.
>
I don't think there is any question that interception at the
architectural entry point would be the right thing to do. As such, it
would be great to get this patch productized.
-hpa
Steven Rostedt wrote:
> Ingo, Thomas and Linus,
>
> I know Thomas did a patch to force the -mtune=generic, but just in case
> gcc decides to do something crazy again, this patch will catch it.
>
> Should we try to get this in now?
I'm sure this makes sense, but a gcc test case would be even better.
If this can be detected in the gcc test suite it'll be found and
fixed long before y'all in kernel land get to see it. That's the
only way to guarantee this never bothers you again.
H.J., who wrote the code in question, is hopefully looking at why
this odd code is being generated. Once he's done I can put a
suitable test case in the gcc test suite.
Andrew.
On Fri, 2009-11-20 at 19:35 +0000, Andrew Haley wrote:
> Steven Rostedt wrote:
> > Ingo, Thomas and Linus,
> >
> > I know Thomas did a patch to force the -mtune=generic, but just in case
> > gcc decides to do something crazy again, this patch will catch it.
> >
> > Should we try to get this in now?
>
> I'm sure this makes sense, but a gcc test case would be even better.
> If this can be detected in the gcc test suite it'll be found and
> fixed long before y'all in kernel land get to see it. That's the
> only way to guarantee this never bothers you again.
>
> H.J., who wrote the code in question, is hopefully looking at why
> this odd code is being generated. Once he's done I can put a
> suitable test case in the gcc test suite.
Yes a gcc test suite will help new instances of gcc. But we need to
worry about the instances of gcc that people have on their desktops now.
This test case will catch the discrepancy between gcc and the function
graph tracer. I'm not 100% convince that just adding -mtune=generic will
help in all cases. If we miss another instance, then the function graph
tracer may crash someone's kernel.
-- Steve
On 11/20/2009 11:46 AM, Steven Rostedt wrote:
>
> Yes a gcc test suite will help new instances of gcc. But we need to
> worry about the instances of gcc that people have on their desktops now.
> This test case will catch the discrepancy between gcc and the function
> graph tracer. I'm not 100% convince that just adding -mtune=generic will
> help in all cases. If we miss another instance, then the function graph
> tracer may crash someone's kernel.
>
Furthermore, for future gcc instances what we really want is the early
interception support anyway.
-hpa
* Steven Rostedt <[email protected]> wrote:
> Ingo, Thomas and Linus,
>
> I know Thomas did a patch to force the -mtune=generic, but just in
> case gcc decides to do something crazy again, this patch will catch
> it.
>
> Should we try to get this in now?
Very nice example of defensive coding - i like this. I've queued it up
for .33 (unless anyone objects) as i think it's too late for .32.
Ingo
On Fri, Nov 20, 2009 at 11:35 AM, Andrew Haley <[email protected]> wrote:
> Steven Rostedt wrote:
>> Ingo, Thomas and Linus,
>>
>> I know Thomas did a patch to force the -mtune=generic, but just in case
>> gcc decides to do something crazy again, this patch will catch it.
>>
>> Should we try to get this in now?
>
> I'm sure this makes sense, but a gcc test case would be even better.
> If this can be detected in the gcc test suite it'll be found and
> fixed long before y'all in kernel land get to see it. ?That's the
> only way to guarantee this never bothers you again.
>
> H.J., who wrote the code in question, is hopefully looking at why
> this odd code is being generated. ?Once he's done I can put a
> suitable test case in the gcc test suite.
>
See:
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=42109#c7
--
H.J.
H.J. Lu wrote:
> On Fri, Nov 20, 2009 at 11:35 AM, Andrew Haley <[email protected]> wrote:
>> Steven Rostedt wrote:
>>> Ingo, Thomas and Linus,
>>>
>>> I know Thomas did a patch to force the -mtune=generic, but just in case
>>> gcc decides to do something crazy again, this patch will catch it.
>>>
>>> Should we try to get this in now?
>> I'm sure this makes sense, but a gcc test case would be even better.
>> If this can be detected in the gcc test suite it'll be found and
>> fixed long before y'all in kernel land get to see it. That's the
>> only way to guarantee this never bothers you again.
>>
>> H.J., who wrote the code in question, is hopefully looking at why
>> this odd code is being generated. Once he's done I can put a
>> suitable test case in the gcc test suite.
>>
>
> See:
>
> http://gcc.gnu.org/bugzilla/show_bug.cgi?id=42109#c7
I saw that, but does it mean you're going to investigate? There is
no obvious reason why -mtune=generic should affect code generation
in this way, but it does.
Andrew.
On Sun, Nov 22, 2009 at 9:20 AM, Andrew Haley <[email protected]> wrote:
> H.J. Lu wrote:
>> On Fri, Nov 20, 2009 at 11:35 AM, Andrew Haley <[email protected]> wrote:
>>> Steven Rostedt wrote:
>>>> Ingo, Thomas and Linus,
>>>>
>>>> I know Thomas did a patch to force the -mtune=generic, but just in case
>>>> gcc decides to do something crazy again, this patch will catch it.
>>>>
>>>> Should we try to get this in now?
>>> I'm sure this makes sense, but a gcc test case would be even better.
>>> If this can be detected in the gcc test suite it'll be found and
>>> fixed long before y'all in kernel land get to see it. ?That's the
>>> only way to guarantee this never bothers you again.
>>>
>>> H.J., who wrote the code in question, is hopefully looking at why
>>> this odd code is being generated. ?Once he's done I can put a
>>> suitable test case in the gcc test suite.
>>>
>>
>> See:
>>
>> http://gcc.gnu.org/bugzilla/show_bug.cgi?id=42109#c7
>
> I saw that, but does it mean you're going to investigate? ?There is
> no obvious reason why -mtune=generic should affect code generation
> in this way, but it does.
>
Why not, there is
static const unsigned int x86_accumulate_outgoing_args
= m_AMD_MULTIPLE | m_ATOM | m_PENT4 | m_NOCONA | m_PPRO | m_CORE2
| m_GENERIC;
-mtune=generic turns on -maccumulate-outgoing-args.
--
H.J.
On Thu, Nov 19, 2009 at 08:01:57PM +0100, Thomas Gleixner wrote:
> Just compiled with -mincoming-stack-boundary=4 and the problem goes
> away as gcc now thinks that the incoming stack is already 16 byte
> aligned. But that might break code which actually uses SSE
Please don't do this, lying to the compiler is just going to result in
wrong-code sooner or later, with the above switch gcc will assume the
incoming stack is 16-byte aligned (which is not true in the ix86 kernel)
and could very well e.g. optimize away code that looks at
alignment of stack variables etc.
Jakub
On Mon, 23 Nov 2009, Jakub Jelinek wrote:
> On Thu, Nov 19, 2009 at 08:01:57PM +0100, Thomas Gleixner wrote:
> > Just compiled with -mincoming-stack-boundary=4 and the problem goes
> > away as gcc now thinks that the incoming stack is already 16 byte
> > aligned. But that might break code which actually uses SSE
>
> Please don't do this, lying to the compiler is just going to result in
> wrong-code sooner or later, with the above switch gcc will assume the
> incoming stack is 16-byte aligned (which is not true in the ix86 kernel)
> and could very well e.g. optimize away code that looks at
> alignment of stack variables etc.
Right. I gave up the idea pretty fast. But in the current situation we
are forced to lie to the compiler in some way. Forcing -mtune=generic
when the function graph tracer is enabled seems to be a halfways sane
work around.
Thanks,
tglx
H.J. Lu wrote:
> On Sun, Nov 22, 2009 at 9:20 AM, Andrew Haley <[email protected]> wrote:
>> H.J. Lu wrote:
>>> On Fri, Nov 20, 2009 at 11:35 AM, Andrew Haley <[email protected]> wrote:
>>>> Steven Rostedt wrote:
>>>>> Ingo, Thomas and Linus,
>>>>>
>>>>> I know Thomas did a patch to force the -mtune=generic, but just in case
>>>>> gcc decides to do something crazy again, this patch will catch it.
>>>>>
>>>>> Should we try to get this in now?
>>>> I'm sure this makes sense, but a gcc test case would be even better.
>>>> If this can be detected in the gcc test suite it'll be found and
>>>> fixed long before y'all in kernel land get to see it. That's the
>>>> only way to guarantee this never bothers you again.
>>>>
>>>> H.J., who wrote the code in question, is hopefully looking at why
>>>> this odd code is being generated. Once he's done I can put a
>>>> suitable test case in the gcc test suite.
>>>>
>>> See:
>>>
>>> http://gcc.gnu.org/bugzilla/show_bug.cgi?id=42109#c7
>> I saw that, but does it mean you're going to investigate? There is
>> no obvious reason why -mtune=generic should affect code generation
>> in this way, but it does.
>
> Why not, there is
>
> static const unsigned int x86_accumulate_outgoing_args
> = m_AMD_MULTIPLE | m_ATOM | m_PENT4 | m_NOCONA | m_PPRO | m_CORE2
> | m_GENERIC;
>
> -mtune=generic turns on -maccumulate-outgoing-args.
Alright, so let's at least try to give the kernel people the information
that they need.
What you're saying is, to avoid this:
000005f0 <timer_stats_update_stats>:
5f0: 57 push %edi
5f1: 8d 7c 24 08 lea 0x8(%esp),%edi
5f5: 83 e4 f0 and $0xfffffff0,%esp
5f8: ff 77 fc pushl -0x4(%edi)
5fb: 55 push %ebp
5fc: 89 e5 mov %esp,%ebp
you should compile your code with -maccumulate-outgoing-args, and there's
no need to use -mtune=generic. Is that right?
Andrew.
On Tue, 24 Nov 2009, Andrew Haley wrote:
> H.J. Lu wrote:
> > On Sun, Nov 22, 2009 at 9:20 AM, Andrew Haley <[email protected]> wrote:
> >> H.J. Lu wrote:
> >>> On Fri, Nov 20, 2009 at 11:35 AM, Andrew Haley <[email protected]> wrote:
> >>>> Steven Rostedt wrote:
> >>>>> Ingo, Thomas and Linus,
> >>>>>
> >>>>> I know Thomas did a patch to force the -mtune=generic, but just in case
> >>>>> gcc decides to do something crazy again, this patch will catch it.
> >>>>>
> >>>>> Should we try to get this in now?
> >>>> I'm sure this makes sense, but a gcc test case would be even better.
> >>>> If this can be detected in the gcc test suite it'll be found and
> >>>> fixed long before y'all in kernel land get to see it. That's the
> >>>> only way to guarantee this never bothers you again.
> >>>>
> >>>> H.J., who wrote the code in question, is hopefully looking at why
> >>>> this odd code is being generated. Once he's done I can put a
> >>>> suitable test case in the gcc test suite.
> >>>>
> >>> See:
> >>>
> >>> http://gcc.gnu.org/bugzilla/show_bug.cgi?id=42109#c7
> >> I saw that, but does it mean you're going to investigate? There is
> >> no obvious reason why -mtune=generic should affect code generation
> >> in this way, but it does.
> >
> > Why not, there is
> >
> > static const unsigned int x86_accumulate_outgoing_args
> > = m_AMD_MULTIPLE | m_ATOM | m_PENT4 | m_NOCONA | m_PPRO | m_CORE2
> > | m_GENERIC;
> >
> > -mtune=generic turns on -maccumulate-outgoing-args.
>
> Alright, so let's at least try to give the kernel people the information
> that they need.
>
> What you're saying is, to avoid this:
>
> 000005f0 <timer_stats_update_stats>:
> 5f0: 57 push %edi
> 5f1: 8d 7c 24 08 lea 0x8(%esp),%edi
> 5f5: 83 e4 f0 and $0xfffffff0,%esp
> 5f8: ff 77 fc pushl -0x4(%edi)
> 5fb: 55 push %ebp
> 5fc: 89 e5 mov %esp,%ebp
>
> you should compile your code with -maccumulate-outgoing-args, and there's
> no need to use -mtune=generic. Is that right?
Seems to work. What other side effects has that ?
Thanks,
tglx
On Tue, Nov 24, 2009 at 03:55:49PM +0100, Thomas Gleixner wrote:
> > you should compile your code with -maccumulate-outgoing-args, and there's
> > no need to use -mtune=generic. Is that right?
>
> Seems to work. What other side effects has that ?
Faster code, significant increase in code size though. Note that on many
architectures it is the only supported model.
Jakub
Jakub Jelinek wrote:
> On Tue, Nov 24, 2009 at 03:55:49PM +0100, Thomas Gleixner wrote:
>>> you should compile your code with -maccumulate-outgoing-args, and there's
>>> no need to use -mtune=generic. Is that right?
>> Seems to work. What other side effects has that ?
>
> Faster code, significant increase in code size though.
Does it affect code size when we don't have to realign the stack pointer?
Andrew.
On Tue, Nov 24, 2009 at 03:32:20PM +0000, Andrew Haley wrote:
> Jakub Jelinek wrote:
> > On Tue, Nov 24, 2009 at 03:55:49PM +0100, Thomas Gleixner wrote:
> >>> you should compile your code with -maccumulate-outgoing-args, and there's
> >>> no need to use -mtune=generic. Is that right?
> >> Seems to work. What other side effects has that ?
> >
> > Faster code, significant increase in code size though.
>
> Does it affect code size when we don't have to realign the stack pointer?
Yes, a lot. The difference is that -maccumulate-outgoing-args allocates
space for arguments of the callee with most arguments in the prologue, using
subtraction from sp, then to pass arguments uses movl XXX, 4(%esp) etc.
and the stack pointer doesn't usually change within the function (except for
alloca/VLAs).
With -mno-accumulate-outgoing-args args are pushed using push instructions
and stack pointer is constantly changing.
Jakub
Jakub Jelinek wrote:
> On Tue, Nov 24, 2009 at 03:32:20PM +0000, Andrew Haley wrote:
>> Jakub Jelinek wrote:
>>> On Tue, Nov 24, 2009 at 03:55:49PM +0100, Thomas Gleixner wrote:
>>>>> you should compile your code with -maccumulate-outgoing-args, and there's
>>>>> no need to use -mtune=generic. Is that right?
>>>> Seems to work. What other side effects has that ?
>>> Faster code, significant increase in code size though.
>> Does it affect code size when we don't have to realign the stack pointer?
>
> Yes, a lot. The difference is that -maccumulate-outgoing-args allocates
> space for arguments of the callee with most arguments in the prologue, using
> subtraction from sp, then to pass arguments uses movl XXX, 4(%esp) etc.
> and the stack pointer doesn't usually change within the function (except for
> alloca/VLAs).
> With -mno-accumulate-outgoing-args args are pushed using push instructions
> and stack pointer is constantly changing.
Alright. So, it is possible in theory for gcc to generate code that
only uses -maccumulate-outgoing-args when it needs to realign SP.
And, therefore, we could have a nice option for the kernel: one with
(mostly) good code density and never generates the bizarre code
sequence in the prologue.
Andrew.
On 11/24/2009 07:46 AM, Andrew Haley wrote:
>>
>> Yes, a lot. The difference is that -maccumulate-outgoing-args allocates
>> space for arguments of the callee with most arguments in the prologue, using
>> subtraction from sp, then to pass arguments uses movl XXX, 4(%esp) etc.
>> and the stack pointer doesn't usually change within the function (except for
>> alloca/VLAs).
>> With -mno-accumulate-outgoing-args args are pushed using push instructions
>> and stack pointer is constantly changing.
>
> Alright. So, it is possible in theory for gcc to generate code that
> only uses -maccumulate-outgoing-args when it needs to realign SP.
> And, therefore, we could have a nice option for the kernel: one with
> (mostly) good code density and never generates the bizarre code
> sequence in the prologue.
>
If we're changing gcc anyway, then let's add the option of intercepting
the function at the point where the machine state is well-defined by
ABI, which is before the function stack frame is set up.
-maccumulate-outgoing-args sounds like it would be painful on x86 (not
using its cheap push/pop instructions), but I guess since it's only when
tracing it's less of an issue.
-hpa
--
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel. I don't speak on their behalf.
H. Peter Anvin wrote:
> On 11/24/2009 07:46 AM, Andrew Haley wrote:
>>> Yes, a lot. The difference is that -maccumulate-outgoing-args allocates
>>> space for arguments of the callee with most arguments in the prologue, using
>>> subtraction from sp, then to pass arguments uses movl XXX, 4(%esp) etc.
>>> and the stack pointer doesn't usually change within the function (except for
>>> alloca/VLAs).
>>> With -mno-accumulate-outgoing-args args are pushed using push instructions
>>> and stack pointer is constantly changing.
>> Alright. So, it is possible in theory for gcc to generate code that
>> only uses -maccumulate-outgoing-args when it needs to realign SP.
>> And, therefore, we could have a nice option for the kernel: one with
>> (mostly) good code density and never generates the bizarre code
>> sequence in the prologue.
>
> If we're changing gcc anyway, then let's add the option of intercepting
> the function at the point where the machine state is well-defined by
> ABI, which is before the function stack frame is set up.
Hmm. On the x86 I suppose we could just inject a naked call instruction,
but not all aeches allow us to call anything before we've saved the return
address. Or are you talking x86 only?
Andrew.
On Tue, 2009-11-24 at 17:12 +0000, Andrew Haley wrote:
> H. Peter Anvin wrote:
> > If we're changing gcc anyway, then let's add the option of intercepting
> > the function at the point where the machine state is well-defined by
> > ABI, which is before the function stack frame is set up.
>
> Hmm. On the x86 I suppose we could just inject a naked call instruction,
> but not all aeches allow us to call anything before we've saved the return
> address. Or are you talking x86 only?
Earlier in the GCC BUG thread we talked about this. Adding a __fentry__
call at the beginning of the function. This could be done for other
archs as well, but yes, the return address must be stored. For x86 it is
the easiest because it automatically stores the return address on the
stack (Andi already has a working patch I believe).
For other archs, Linus showed some examples:
http://lkml.org/lkml/2009/11/19/349
-- Steve
On 11/24/2009 09:12 AM, Andrew Haley wrote:
>>
>> If we're changing gcc anyway, then let's add the option of intercepting
>> the function at the point where the machine state is well-defined by
>> ABI, which is before the function stack frame is set up.
>
> Hmm. On the x86 I suppose we could just inject a naked call instruction,
> but not all aeches allow us to call anything before we've saved the return
> address. Or are you talking x86 only?
>
For x86, we should use a naked call.
For architectures where that is not possible, we should use a minimal
sequence such that the ABI state at the invocation point is 100% derivable.
On MIPS, for example, we could use a sequence such as:
mov at, ra
jal __fentry__
It would be up to __fentry__ to save the value in at and to restore it
back into ra before resuming, meaning that __fentry__ has a nonstandard
calling convention.
-hpa
On Tue, 24 Nov 2009, Jakub Jelinek wrote:
> On Tue, Nov 24, 2009 at 03:55:49PM +0100, Thomas Gleixner wrote:
> > > you should compile your code with -maccumulate-outgoing-args, and there's
> > > no need to use -mtune=generic. Is that right?
> >
> > Seems to work. What other side effects has that ?
>
> Faster code, significant increase in code size though. Note that on many
> architectures it is the only supported model.
Just checked on the affected -marchs. The increase in code size is
about 3% which is not that bad and definitely acceptable for the
tracing case. Will zap the -mtune=generic patch and use
-maccumulate-outgoing-args instead.
Thanks,
tglx
* Thomas Gleixner <[email protected]> wrote:
> On Tue, 24 Nov 2009, Jakub Jelinek wrote:
>
> > On Tue, Nov 24, 2009 at 03:55:49PM +0100, Thomas Gleixner wrote:
> > > > you should compile your code with -maccumulate-outgoing-args, and there's
> > > > no need to use -mtune=generic. Is that right?
> > >
> > > Seems to work. What other side effects has that ?
> >
> > Faster code, significant increase in code size though. Note that on many
> > architectures it is the only supported model.
>
> Just checked on the affected -marchs. The increase in code size is
> about 3% which is not that bad and definitely acceptable for the
> tracing case. Will zap the -mtune=generic patch and use
> -maccumulate-outgoing-args instead.
hm, 3% sounds quite large :( dyn-ftrace is enabled in distro configs, so
3% is a big deal IMO.
Ingo
On Wed, 25 Nov 2009, Ingo Molnar wrote:
> * Thomas Gleixner <[email protected]> wrote:
>
> > On Tue, 24 Nov 2009, Jakub Jelinek wrote:
> >
> > > On Tue, Nov 24, 2009 at 03:55:49PM +0100, Thomas Gleixner wrote:
> > > > > you should compile your code with -maccumulate-outgoing-args, and there's
> > > > > no need to use -mtune=generic. Is that right?
> > > >
> > > > Seems to work. What other side effects has that ?
> > >
> > > Faster code, significant increase in code size though. Note that on many
> > > architectures it is the only supported model.
> >
> > Just checked on the affected -marchs. The increase in code size is
> > about 3% which is not that bad and definitely acceptable for the
> > tracing case. Will zap the -mtune=generic patch and use
> > -maccumulate-outgoing-args instead.
>
> hm, 3% sounds quite large :( dyn-ftrace is enabled in distro configs, so
> 3% is a big deal IMO.
Distro-configs have -mtune=generic anyway. So it's not changing
anything for them.
I'm talking about the -march flags which result in that weird code
(pentium-mmx ....).
Thanks,
tglx
* Thomas Gleixner <[email protected]> wrote:
> On Wed, 25 Nov 2009, Ingo Molnar wrote:
> > * Thomas Gleixner <[email protected]> wrote:
> >
> > > On Tue, 24 Nov 2009, Jakub Jelinek wrote:
> > >
> > > > On Tue, Nov 24, 2009 at 03:55:49PM +0100, Thomas Gleixner wrote:
> > > > > > you should compile your code with -maccumulate-outgoing-args, and there's
> > > > > > no need to use -mtune=generic. Is that right?
> > > > >
> > > > > Seems to work. What other side effects has that ?
> > > >
> > > > Faster code, significant increase in code size though. Note that on many
> > > > architectures it is the only supported model.
> > >
> > > Just checked on the affected -marchs. The increase in code size is
> > > about 3% which is not that bad and definitely acceptable for the
> > > tracing case. Will zap the -mtune=generic patch and use
> > > -maccumulate-outgoing-args instead.
> >
> > hm, 3% sounds quite large :( dyn-ftrace is enabled in distro configs, so
> > 3% is a big deal IMO.
>
> Distro-configs have -mtune=generic anyway. So it's not changing
> anything for them.
>
> I'm talking about the -march flags which result in that weird code
> (pentium-mmx ....).
ok!
Ingo
On Wed, Nov 25, 2009 at 04:44:52PM +0100, Ingo Molnar wrote:
>
> * Thomas Gleixner <[email protected]> wrote:
>
> > On Tue, 24 Nov 2009, Jakub Jelinek wrote:
> >
> > > On Tue, Nov 24, 2009 at 03:55:49PM +0100, Thomas Gleixner wrote:
> > > > > you should compile your code with -maccumulate-outgoing-args, and there's
> > > > > no need to use -mtune=generic. Is that right?
> > > >
> > > > Seems to work. What other side effects has that ?
> > >
> > > Faster code, significant increase in code size though. Note that on many
> > > architectures it is the only supported model.
> >
> > Just checked on the affected -marchs. The increase in code size is
> > about 3% which is not that bad and definitely acceptable for the
> > tracing case. Will zap the -mtune=generic patch and use
> > -maccumulate-outgoing-args instead.
>
> hm, 3% sounds quite large :( dyn-ftrace is enabled in distro configs, so
> 3% is a big deal IMO.
If you compile kernels 90%+ people out there run with -p on i?86/x86_64,
then certainly coming up with a new gcc switch and new profiling ABI is
desirable. -p on i?86/x86_64 e.g. forces -fno-omit-frame-pointer, which
makes code on these register starved arches significantly worse.
Making GCC output profiling call before prologue instead of after prologue
is a 4 liner in generic code and a few lines in target specific code.
The important thing is that we shouldn't have 100 different profiling ABIs,
so it is desirable to agree on something that will be generally useful not
just for the kernel, but perhaps for other purposes.
Jakub
On 11/24/2009 09:30 AM, Steven Rostedt wrote:
>
> For other archs, Linus showed some examples:
>
> http://lkml.org/lkml/2009/11/19/349
>
Yes; the key here is that the ABI-defined entry state is readily
mappable onto the state on entry to the __fentry__ function.
-hpa
--
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel. I don't speak on their behalf.
On 11/25/2009 08:44 AM, Jakub Jelinek wrote:
>
> If you compile kernels 90%+ people out there run with -p on i?86/x86_64,
> then certainly coming up with a new gcc switch and new profiling ABI is
> desirable. -p on i?86/x86_64 e.g. forces -fno-omit-frame-pointer, which
> makes code on these register starved arches significantly worse.
> Making GCC output profiling call before prologue instead of after prologue
> is a 4 liner in generic code and a few lines in target specific code.
> The important thing is that we shouldn't have 100 different profiling ABIs,
> so it is desirable to agree on something that will be generally useful not
> just for the kernel, but perhaps for other purposes.
>
There is really just one that makes sense, which is providing the
ABI-defined entry state, which means intercepting at the point of entry.
Anything else is/was a mistake.
-hpa
--
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel. I don't speak on their behalf.
H. Peter Anvin wrote:
> On 11/25/2009 08:44 AM, Jakub Jelinek wrote:
>> If you compile kernels 90%+ people out there run with -p on i?86/x86_64,
>> then certainly coming up with a new gcc switch and new profiling ABI is
>> desirable. -p on i?86/x86_64 e.g. forces -fno-omit-frame-pointer, which
>> makes code on these register starved arches significantly worse.
>> Making GCC output profiling call before prologue instead of after prologue
>> is a 4 liner in generic code and a few lines in target specific code.
>> The important thing is that we shouldn't have 100 different profiling ABIs,
>> so it is desirable to agree on something that will be generally useful not
>> just for the kernel, but perhaps for other purposes.
>
> There is really just one that makes sense, which is providing the
> ABI-defined entry state, which means intercepting at the point of entry.
>
> Anything else is/was a mistake.
Indeed. The problem, though, is that the "naked call" approach, while attractive,
requires the back end to be modified and so requires the help of the gcc maintainers
for every Linux target. Not that this is a terrible idea, but such co-ordination
is going to take time.
Andrew.