2022-07-05 11:20:20

by Sascha Hauer

[permalink] [raw]
Subject: Performance impact of CONFIG_FUNCTION_TRACER

Hi,

I ran some lmbench subtests on a ARMv7 machine (NXP i.MX6q) with and
without CONFIG_FUNCTION_TRACER enabled (with CONFIG_DYNAMIC_FTRACE
enabled and no tracing active), see below. The Kconfig help text of this
option reads as:

> If it's runtime disabled (the bootup default), then the overhead of
> the instructions is very small and not measurable even in
> micro-benchmarks.

In my tests the overhead is small, but it surely exists and is
measurable at least on ARMv7 machines. Is this expected? Should the help
text be rephrased a little less optimistic?

lat_fs
======

CONFIG_FUNCTION_TRACER=y
------------------------

_ file size
/ _ repetitions
| / _ created files per second
| | / _ deleted files per second
| | | /
0k 205 38325 70264
1k 123 21949 41495
4k 119 21632 42365
10k 84 14408 29439

CONFIG_FUNCTION_TRACER=n
------------------------

0k 198 39317 76241
1k 125 22397 42420
4k 125 22038 44144
10k 84 14965 30032

./scripts/do_ctx
================

Measures context switch time (in us?)

CONFIG_FUNCTION_TRACER=y
------------------------
"size=0k ovr=4.60
2 16.57
4 15.75
8 16.99
16 16.93
24 18.79
32 20.11
64 22.20
96 21.59

CONFIG_FUNCTION_TRACER=n
------------------------
"size=0k ovr=4.88
2 15.26
4 17.29
8 18.25
16 18.40
24 22.19
32 20.85
64 23.32
96 23.11

lat_pipe
========

I think it measures the time between sending and receiving data
through a pipe.

CONFIG_FUNCTION_TRACER=y
------------------------
Pipe latency: 44.7039 microseconds

CONFIG_FUNCTION_TRACER=n
------------------------
Pipe latency: 41.9876 microseconds

--
Pengutronix e.K. | |
Steuerwalder Str. 21 | http://www.pengutronix.de/ |
31137 Hildesheim, Germany | Phone: +49-5121-206917-0 |
Amtsgericht Hildesheim, HRA 2686 | Fax: +49-5121-206917-5555 |


2022-07-05 14:46:49

by Steven Rostedt

[permalink] [raw]
Subject: Re: Performance impact of CONFIG_FUNCTION_TRACER

On Tue, 5 Jul 2022 12:54:16 +0200
Sascha Hauer <[email protected]> wrote:

> Hi,
>
> I ran some lmbench subtests on a ARMv7 machine (NXP i.MX6q) with and
> without CONFIG_FUNCTION_TRACER enabled (with CONFIG_DYNAMIC_FTRACE
> enabled and no tracing active), see below. The Kconfig help text of this
> option reads as:
>
> > If it's runtime disabled (the bootup default), then the overhead of
> > the instructions is very small and not measurable even in
> > micro-benchmarks.

Well, this is true for x86 ;-)

>
> In my tests the overhead is small, but it surely exists and is
> measurable at least on ARMv7 machines. Is this expected? Should the help
> text be rephrased a little less optimistic?

You mean "(but may vary by architecture)"

As I believe due to using a link register for function calls, ARM
requires adding two 4 byte nops to every function where as x86 only
adds a single 5 byte nop.

Although nops are very fast (they should not be processed in the CPU's
pipe line, but I don't know if that's true for every arch). It also
affects instruction cache misses, as adding 8 bytes around the code
will cause more cache misses than when they do not exist.

Also, there's some configurations that use the old mcount that does add
some more code to handle the mcount case.

So if this is just to have us change the kconfig, I'm happy to do that.

-- Steve

2022-07-05 22:29:49

by Sascha Hauer

[permalink] [raw]
Subject: Re: Performance impact of CONFIG_FUNCTION_TRACER

On Tue, Jul 05, 2022 at 10:39:01AM -0400, Steven Rostedt wrote:
> On Tue, 5 Jul 2022 12:54:16 +0200
> Sascha Hauer <[email protected]> wrote:
>
> > Hi,
> >
> > I ran some lmbench subtests on a ARMv7 machine (NXP i.MX6q) with and
> > without CONFIG_FUNCTION_TRACER enabled (with CONFIG_DYNAMIC_FTRACE
> > enabled and no tracing active), see below. The Kconfig help text of this
> > option reads as:
> >
> > > If it's runtime disabled (the bootup default), then the overhead of
> > > the instructions is very small and not measurable even in
> > > micro-benchmarks.
>
> Well, this is true for x86 ;-)

That was my assumption ;)

>
> >
> > In my tests the overhead is small, but it surely exists and is
> > measurable at least on ARMv7 machines. Is this expected? Should the help
> > text be rephrased a little less optimistic?
>
> You mean "(but may vary by architecture)"

Something like that, yes.

>
> As I believe due to using a link register for function calls, ARM
> requires adding two 4 byte nops to every function where as x86 only
> adds a single 5 byte nop.
>
> Although nops are very fast (they should not be processed in the CPU's
> pipe line, but I don't know if that's true for every arch). It also
> affects instruction cache misses, as adding 8 bytes around the code
> will cause more cache misses than when they do not exist.

Just digged around a bit and saw that on ARM it's not even a real nop.
The compiler emits:

push {lr}
bl 8010e7c0 <__gnu_mcount_nc>

Which is then turned into a nop by replacing the second instruction with

add sp, sp, #4

to bring the stack pointer back to its original value. This indeed must
be processed by the CPU pipeline. I wonder if that could be optimized by
replacing both instructions with a nop. I have no idea though if that's
feasible at all or if the overhead would even get smaller by that.

>
> Also, there's some configurations that use the old mcount that does add
> some more code to handle the mcount case.
>
> So if this is just to have us change the kconfig, I'm happy to do that.

Yes, would be good to make the kconfig text clear. The overhead itself
is fine when people know that's the price to pay for getting the
function tracer.

Sascha

--
Pengutronix e.K. | |
Steuerwalder Str. 21 | http://www.pengutronix.de/ |
31137 Hildesheim, Germany | Phone: +49-5121-206917-0 |
Amtsgericht Hildesheim, HRA 2686 | Fax: +49-5121-206917-5555 |

2022-07-05 22:49:06

by Steven Rostedt

[permalink] [raw]
Subject: Re: Performance impact of CONFIG_FUNCTION_TRACER

On Tue, 5 Jul 2022 23:59:48 +0200
Sascha Hauer <[email protected]> wrote:

> >
> > As I believe due to using a link register for function calls, ARM
> > requires adding two 4 byte nops to every function where as x86 only
> > adds a single 5 byte nop.
> >
> > Although nops are very fast (they should not be processed in the CPU's
> > pipe line, but I don't know if that's true for every arch). It also
> > affects instruction cache misses, as adding 8 bytes around the code
> > will cause more cache misses than when they do not exist.
>
> Just digged around a bit and saw that on ARM it's not even a real nop.
> The compiler emits:
>
> push {lr}
> bl 8010e7c0 <__gnu_mcount_nc>
>
> Which is then turned into a nop by replacing the second instruction with
>
> add sp, sp, #4
>
> to bring the stack pointer back to its original value. This indeed must
> be processed by the CPU pipeline. I wonder if that could be optimized by
> replacing both instructions with a nop. I have no idea though if that's
> feasible at all or if the overhead would even get smaller by that.

The problem is that there's no easy way to do that, because a task
could have been preempted after doing the 'push {lr}' and before the
'bl'. Thus, you create a race by changing either one to a nop first.

I wonder if it would have been better to change the first one to a jump
passed the second :-/

Actually, if you don't mind setups that take a long time, if you change
the first to a jump passed the second, then do synchronize_rcu_rude()
(which may take a while, possibly several seconds or more) then you know
that all users now only see the jump, and none will see the bl. Then
you could convert the bl to nop, and then even change the jump to nop
after that.

To convert back, you would need to reverse it. Convert the first nop
back to a jmp, run synchronize_rcu_rude(). Then convert the second nop
to the bl, and then convert the first to the push {lr}.

>
> >
> > Also, there's some configurations that use the old mcount that does add
> > some more code to handle the mcount case.
> >
> > So if this is just to have us change the kconfig, I'm happy to do that.
>
> Yes, would be good to make the kconfig text clear. The overhead itself
> is fine when people know that's the price to pay for getting the
> function tracer.

Agreed. I'll write up a patch.

-- Steve

2022-07-14 09:28:52

by Sascha Hauer

[permalink] [raw]
Subject: Re: Performance impact of CONFIG_FUNCTION_TRACER

On Tue, Jul 05, 2022 at 06:27:46PM -0400, Steven Rostedt wrote:
> On Tue, 5 Jul 2022 23:59:48 +0200
> Sascha Hauer <[email protected]> wrote:
>
> > >
> > > As I believe due to using a link register for function calls, ARM
> > > requires adding two 4 byte nops to every function where as x86 only
> > > adds a single 5 byte nop.
> > >
> > > Although nops are very fast (they should not be processed in the CPU's
> > > pipe line, but I don't know if that's true for every arch). It also
> > > affects instruction cache misses, as adding 8 bytes around the code
> > > will cause more cache misses than when they do not exist.
> >
> > Just digged around a bit and saw that on ARM it's not even a real nop.
> > The compiler emits:
> >
> > push {lr}
> > bl 8010e7c0 <__gnu_mcount_nc>
> >
> > Which is then turned into a nop by replacing the second instruction with
> >
> > add sp, sp, #4
> >
> > to bring the stack pointer back to its original value. This indeed must
> > be processed by the CPU pipeline. I wonder if that could be optimized by
> > replacing both instructions with a nop. I have no idea though if that's
> > feasible at all or if the overhead would even get smaller by that.
>
> The problem is that there's no easy way to do that, because a task
> could have been preempted after doing the 'push {lr}' and before the
> 'bl'. Thus, you create a race by changing either one to a nop first.
>
> I wonder if it would have been better to change the first one to a jump
> passed the second :-/

I gave this a try, but the performance was not better compared to the
stack push/pop operations we have now. I also tried to replace both
instructions with nops (mov r0, r0), still no better performance. I
guess we have to live with it then.

Sascha

--
Pengutronix e.K. | |
Steuerwalder Str. 21 | http://www.pengutronix.de/ |
31137 Hildesheim, Germany | Phone: +49-5121-206917-0 |
Amtsgericht Hildesheim, HRA 2686 | Fax: +49-5121-206917-5555 |