One of the things that bothered me about the latest ftrace code was
this annoying daemon that would wake up once a second to see if it
had work to do. If it did not, it would go to sleep, otherwise it would
do its work and then go to sleep.
You see, the reason for this is that for ftrace to maintain performance
when configured in but disabled, it would need to change all the
locations that called "mcount" (enabled with the gcc -pg option) into
nops. The "-pg" option in gcc sets up a function profiler to call this
function called "mcount". If you simply have "mcount" return, it will
still add 15 to 18% overhead in performance. Changing all the calls to
nops moved the overhead into noise.
To get rid of this, I had the mcount code record the location that called
it. Later, the "ftraced" daemon would wake up and look to see if
any new functions were recorded. If so, it would call kstop_machine
and convert the calls to "nops". We needed kstop_machine because bad
things happen on SMP if you modify code that happens to be in the
instruction cache of another CPU.
This "ftraced" kernel thread would be a happy little worker, but it caused
some pains. One, is that it woke up once a second, and Ted Tso got mad
at me because it would show up on PowerTop. I could easily make the
default 5 seconds, and even have it runtime configurable, with a trivial
patch. I have not got around to doing that yet.
The other annoying thing, and this one bothers me the most, is that we
can not have this enabled on a production -rt kernel. The latency caused
by the kstop_machine when doing work is lost in the noise on a non-rt
kernel, but it can be up to 800 microseconds, and that would kill
the -rt kernel. The reason this bothered me the most, is that -rt is
where it came from, and ftraced was not treating its motherland very well.
Along came Gregory Haskins, who was bickering about having ftrace enabled
on a production -rt kernel. I told him the reasons that this would be bad
and then he started thinking out loud, and suggesting wild ideas, like
patching gcc!
Since I have recently seen "The Dark Knight", Gregory's comments put me
into an "evil" mood. I then thought of the idea about using the
relocation entries of the mcount call sites, in a prelinked object file,
and create a separate section with a list of these sites. On boot up,
record them and change them into nops.
That's it! No kstop_machine for turning them into nops. We would only need
stop_machine to enable or disable tracing, but a user not tracing will not have
to deal with this annoying "ftraced" kernel thread waking up every second
or ever running kstop_machine.
What's more, this means we can enable it on a production -rt kernel!
Now, this was no easy task. We needed to add a section to every object
file with a list of pointers to the call sites to mcount. The idea I came
up with was to make a tmp.s file for every object just after it is compiled.
This tmp.s would then be compiled and relinked into the original object.
The tmp.s file would have something like:
.section __mcount_loc,"a",@progbits
.quad location_of_mcount1
.quad location_of_mcount2
(etc)
By running objdump on the object file we can find the offsets into the
sections that the functions are called.
For example, looking at hrtimer.o:
Disassembly of section .text:
0000000000000000 <hrtimer_init_sleeper>:
0: 55 push %rbp
1: 48 89 e5 mov %rsp,%rbp
4: e8 00 00 00 00 callq 9 <hrtimer_init_sleeper+0x9>
5: R_X86_64_PC32 mcount+0xfffffffffffffffc
[...]
the '5' in the '5: R_X86_64_PC32' is the offset that the mcount relocation
is to be done for the call site. This offset is from the .text section,
and not necessarily, from the function. If we look further we see:
000000000000001e <ktime_add_safe>:
1e: 55 push %rbp
1f: 48 89 e5 mov %rsp,%rbp
22: e8 00 00 00 00 callq 27 <ktime_add_safe+0x9>
23: R_X86_64_PC32 mcount+0xfffffffffffffffc
This mcount call site is 0x23 from the .text section, and obviously
not from the ktime_add_safe.
If we make a tmp.s that has the following:
.section __mcount_loc,"a",@progbits
.quad hrtimer_init_sleeper + 0x5
.quad hrtimer_init_sleeper + 0x23
We have a section with the locations of these two call sites. After the final
linking, they will point to the actual address used.
All that would need to be done is:
gcc -c tmp.s -o tmp.o
ld -r tmp.o hrtimer.o -o tmp_hrtime.o
mv tmp_hrtimer.o hrtimer.o
Easy as that! Not quite. What happens if that first function in the
section is a static function? That is, the symbol for the function
is local to the object. If for some reason hrtimer_init_sleeper is static,
the tmp_hrtimer.o would have two symbols for hrtimer_init_sleeper.
One local and one global.
But we can be even more evil with this idea. We can do crazy things
with objcopy to solve it for us.
objcopy --globalize-symbol hrtimer_init_sleeper hrtimer.o tmp_hrtimer.o
Now the hrtimer_init_sleeper would be global for linking.
ld -r tmp_hrtimer.o tmp.o -o tmp2_hrtimer.o
Now the tmp.o could use the same global hrtimer_init_sleeper symbol.
But we have tmp2_hritmer.o that has the tmp.o and tmp_hrtimer.o symbols,
but we cant just blindly convert local symbols to globals.
The solution is simply put it back to local.
objcopy --localize-symbol hrtimer_init_sleeper tmp2_hrtimer.o hrtimer.o
Now our hrtimer.o file has our __mcount_loc section and the
reference to hrtimer_init_sleeper will be resolved.
This is a bit complex to do in shell scripting and Makefiles, so I wrote
a well documented recordmcount.pl perl script, that will do the above
all in one place.
With this new update, we can work to kill that kernel thread "ftraced"!
This patch set ports to x86_64 and i386, the other archs will still use
the daemon until they are converted over.
I tested this on both x86_64 and i386 with and without CONFIG_RELOCATE
set.
* Steven Rostedt ([email protected]) wrote:
>
> One of the things that bothered me about the latest ftrace code was
> this annoying daemon that would wake up once a second to see if it
> had work to do. If it did not, it would go to sleep, otherwise it would
> do its work and then go to sleep.
>
> You see, the reason for this is that for ftrace to maintain performance
> when configured in but disabled, it would need to change all the
> locations that called "mcount" (enabled with the gcc -pg option) into
> nops. The "-pg" option in gcc sets up a function profiler to call this
> function called "mcount". If you simply have "mcount" return, it will
> still add 15 to 18% overhead in performance. Changing all the calls to
> nops moved the overhead into noise.
>
> To get rid of this, I had the mcount code record the location that called
> it. Later, the "ftraced" daemon would wake up and look to see if
> any new functions were recorded. If so, it would call kstop_machine
> and convert the calls to "nops". We needed kstop_machine because bad
> things happen on SMP if you modify code that happens to be in the
> instruction cache of another CPU.
>
> This "ftraced" kernel thread would be a happy little worker, but it caused
> some pains. One, is that it woke up once a second, and Ted Tso got mad
> at me because it would show up on PowerTop. I could easily make the
> default 5 seconds, and even have it runtime configurable, with a trivial
> patch. I have not got around to doing that yet.
>
> The other annoying thing, and this one bothers me the most, is that we
> can not have this enabled on a production -rt kernel. The latency caused
> by the kstop_machine when doing work is lost in the noise on a non-rt
> kernel, but it can be up to 800 microseconds, and that would kill
> the -rt kernel. The reason this bothered me the most, is that -rt is
> where it came from, and ftraced was not treating its motherland very well.
>
Hi Steven,
If you really want to stop using stop_machine, I think you should have a
look at my immediate values infrastructure :
see:
http://git.kernel.org/?p=linux/kernel/git/compudj/linux-2.6-lttng.git;a=blob;f=arch/x86/kernel/immediate.c;h=0958c02b49eed3bbc00bdc5aceeee17a6d0c7ab4;hb=HEAD
particularly replace_instruction_safe(), which uses a temporary
breakpoint to safely replace an instruction on a live SMP system without
using stop_machine. Feel free to try it in ftrace. :)
You may need to tweak the imv_notifier(), which is responsible for
executing the breakpoint. The only special thing this handler has to
know is the non-relocatable instructions you might want to insert (it
only cares about the new instructions, not the old ones being replaced).
The current code deals with 5-bytes jumps. Note that the instruction is
executed in the breakpoint handler with preemption disabled, which might
not be well suited for a call instruction.
I would recommend to patch in a 5-bytes jmp with 0 offset
(e9 00 00 00 00) when disabled (this makes a single 5-bytes instruction
and thus makes sure no instruction pointer can iret in the middle).
When enabling the site, you could patch-in the original call, but you
should tweak the imv_notifier() so that it uses the jmp offset 0 in the
bypass instead of the function call, because preemption would otherwise
be disabled around the call when executed in the breakpoint bypass.
Therefore, simply statically forcing the bypass code to e9 00 00 00 00
in all the cases (a nop) would do the job for your needs.
Mathieu
> Along came Gregory Haskins, who was bickering about having ftrace enabled
> on a production -rt kernel. I told him the reasons that this would be bad
> and then he started thinking out loud, and suggesting wild ideas, like
> patching gcc!
>
> Since I have recently seen "The Dark Knight", Gregory's comments put me
> into an "evil" mood. I then thought of the idea about using the
> relocation entries of the mcount call sites, in a prelinked object file,
> and create a separate section with a list of these sites. On boot up,
> record them and change them into nops.
>
> That's it! No kstop_machine for turning them into nops. We would only need
> stop_machine to enable or disable tracing, but a user not tracing will not have
> to deal with this annoying "ftraced" kernel thread waking up every second
> or ever running kstop_machine.
>
> What's more, this means we can enable it on a production -rt kernel!
>
> Now, this was no easy task. We needed to add a section to every object
> file with a list of pointers to the call sites to mcount. The idea I came
> up with was to make a tmp.s file for every object just after it is compiled.
> This tmp.s would then be compiled and relinked into the original object.
> The tmp.s file would have something like:
>
> .section __mcount_loc,"a",@progbits
> .quad location_of_mcount1
> .quad location_of_mcount2
> (etc)
>
> By running objdump on the object file we can find the offsets into the
> sections that the functions are called.
>
> For example, looking at hrtimer.o:
>
> Disassembly of section .text:
>
> 0000000000000000 <hrtimer_init_sleeper>:
> 0: 55 push %rbp
> 1: 48 89 e5 mov %rsp,%rbp
> 4: e8 00 00 00 00 callq 9 <hrtimer_init_sleeper+0x9>
> 5: R_X86_64_PC32 mcount+0xfffffffffffffffc
> [...]
>
> the '5' in the '5: R_X86_64_PC32' is the offset that the mcount relocation
> is to be done for the call site. This offset is from the .text section,
> and not necessarily, from the function. If we look further we see:
>
> 000000000000001e <ktime_add_safe>:
> 1e: 55 push %rbp
> 1f: 48 89 e5 mov %rsp,%rbp
> 22: e8 00 00 00 00 callq 27 <ktime_add_safe+0x9>
> 23: R_X86_64_PC32 mcount+0xfffffffffffffffc
>
>
> This mcount call site is 0x23 from the .text section, and obviously
> not from the ktime_add_safe.
>
> If we make a tmp.s that has the following:
>
> .section __mcount_loc,"a",@progbits
> .quad hrtimer_init_sleeper + 0x5
> .quad hrtimer_init_sleeper + 0x23
>
> We have a section with the locations of these two call sites. After the final
> linking, they will point to the actual address used.
>
> All that would need to be done is:
>
> gcc -c tmp.s -o tmp.o
> ld -r tmp.o hrtimer.o -o tmp_hrtime.o
> mv tmp_hrtimer.o hrtimer.o
>
> Easy as that! Not quite. What happens if that first function in the
> section is a static function? That is, the symbol for the function
> is local to the object. If for some reason hrtimer_init_sleeper is static,
> the tmp_hrtimer.o would have two symbols for hrtimer_init_sleeper.
> One local and one global.
>
> But we can be even more evil with this idea. We can do crazy things
> with objcopy to solve it for us.
>
> objcopy --globalize-symbol hrtimer_init_sleeper hrtimer.o tmp_hrtimer.o
>
> Now the hrtimer_init_sleeper would be global for linking.
>
> ld -r tmp_hrtimer.o tmp.o -o tmp2_hrtimer.o
>
> Now the tmp.o could use the same global hrtimer_init_sleeper symbol.
> But we have tmp2_hritmer.o that has the tmp.o and tmp_hrtimer.o symbols,
> but we cant just blindly convert local symbols to globals.
>
> The solution is simply put it back to local.
>
> objcopy --localize-symbol hrtimer_init_sleeper tmp2_hrtimer.o hrtimer.o
>
> Now our hrtimer.o file has our __mcount_loc section and the
> reference to hrtimer_init_sleeper will be resolved.
>
> This is a bit complex to do in shell scripting and Makefiles, so I wrote
> a well documented recordmcount.pl perl script, that will do the above
> all in one place.
>
> With this new update, we can work to kill that kernel thread "ftraced"!
>
> This patch set ports to x86_64 and i386, the other archs will still use
> the daemon until they are converted over.
>
> I tested this on both x86_64 and i386 with and without CONFIG_RELOCATE
> set.
>
--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
On Thu, 7 Aug 2008, Mathieu Desnoyers wrote:
>
> Hi Steven,
>
> If you really want to stop using stop_machine, I think you should have a
> look at my immediate values infrastructure :
>
> see:
> http://git.kernel.org/?p=linux/kernel/git/compudj/linux-2.6-lttng.git;a=blob;f=arch/x86/kernel/immediate.c;h=0958c02b49eed3bbc00bdc5aceeee17a6d0c7ab4;hb=HEAD
>
> particularly replace_instruction_safe(), which uses a temporary
> breakpoint to safely replace an instruction on a live SMP system without
> using stop_machine. Feel free to try it in ftrace. :)
>
> You may need to tweak the imv_notifier(), which is responsible for
> executing the breakpoint. The only special thing this handler has to
> know is the non-relocatable instructions you might want to insert (it
> only cares about the new instructions, not the old ones being replaced).
> The current code deals with 5-bytes jumps. Note that the instruction is
> executed in the breakpoint handler with preemption disabled, which might
> not be well suited for a call instruction.
>
> I would recommend to patch in a 5-bytes jmp with 0 offset
> (e9 00 00 00 00) when disabled (this makes a single 5-bytes instruction
> and thus makes sure no instruction pointer can iret in the middle).
>
> When enabling the site, you could patch-in the original call, but you
> should tweak the imv_notifier() so that it uses the jmp offset 0 in the
> bypass instead of the function call, because preemption would otherwise
> be disabled around the call when executed in the breakpoint bypass.
>
> Therefore, simply statically forcing the bypass code to e9 00 00 00 00
> in all the cases (a nop) would do the job for your needs.
I originally used jumps instead of nops, but unfortunately, they actually
hurt performance more than adding nops. Ingo told me it was probably due
to using up the jump predictions of the CPU.
-- Steve
Steven Rostedt wrote:
> Now, this was no easy task. We needed to add a section to every object
> file with a list of pointers to the call sites to mcount. The idea I came
> up with was to make a tmp.s file for every object just after it is compiled.
> This tmp.s would then be compiled and relinked into the original object.
> The tmp.s file would have something like:
>
> .section __mcount_loc,"a",@progbits
> .quad location_of_mcount1
> .quad location_of_mcount2
> (etc)
>
I have a few concerns about this scheme:
One is that you assume that all text is in a .text section, and that the
offsets you compute on a per-section basis are going to be valid in the
final kernel image. At the very least you should make each offset
relative to its own function rather than inter-function.
But it seems pretty fragile overall. Things like -ffunction-sections
and section gc will invalidate the table you build up.
It seems to me that you can acheive the same effect by:
1. link vmlinux with "ld -q", which will leave the relocation
information intact in the final object
2. use "readelf -r" to extract the relocation info, find the
references to mcount and build a table
3. link that table into the kernel image
4. repeat to get a stable result
Note that steps 2-4 are identical to kallsyms, both in intent and
detail. The only difference is the precise table you build and the
command you use to extract the info from the kernel. From a quick peek
at the way Makefile implements kallsyms, it looks to me like it would be
fairly easy to generalize so that you can do the mcount reloc processing
in the same relink passes as kallsyms with minimal overhead on the
kernel build time.
It just seems incredibly fiddly to be doing all this per-object.
J
> By running objdump on the object file we can find the offsets into the
> sections that the functions are called.
>
> For example, looking at hrtimer.o:
>
> Disassembly of section .text:
>
> 0000000000000000 <hrtimer_init_sleeper>:
> 0: 55 push %rbp
> 1: 48 89 e5 mov %rsp,%rbp
> 4: e8 00 00 00 00 callq 9 <hrtimer_init_sleeper+0x9>
> 5: R_X86_64_PC32 mcount+0xfffffffffffffffc
> [...]
>
> the '5' in the '5: R_X86_64_PC32' is the offset that the mcount relocation
> is to be done for the call site. This offset is from the .text section,
> and not necessarily, from the function. If we look further we see:
>
> 000000000000001e <ktime_add_safe>:
> 1e: 55 push %rbp
> 1f: 48 89 e5 mov %rsp,%rbp
> 22: e8 00 00 00 00 callq 27 <ktime_add_safe+0x9>
> 23: R_X86_64_PC32 mcount+0xfffffffffffffffc
>
>
> This mcount call site is 0x23 from the .text section, and obviously
> not from the ktime_add_safe.
>
> If we make a tmp.s that has the following:
>
> .section __mcount_loc,"a",@progbits
> .quad hrtimer_init_sleeper + 0x5
> .quad hrtimer_init_sleeper + 0x23
>
> We have a section with the locations of these two call sites. After the final
> linking, they will point to the actual address used.
>
> All that would need to be done is:
>
> gcc -c tmp.s -o tmp.o
> ld -r tmp.o hrtimer.o -o tmp_hrtime.o
> mv tmp_hrtimer.o hrtimer.o
>
> Easy as that! Not quite. What happens if that first function in the
> section is a static function? That is, the symbol for the function
> is local to the object. If for some reason hrtimer_init_sleeper is static,
> the tmp_hrtimer.o would have two symbols for hrtimer_init_sleeper.
> One local and one global.
>
> But we can be even more evil with this idea. We can do crazy things
> with objcopy to solve it for us.
>
> objcopy --globalize-symbol hrtimer_init_sleeper hrtimer.o tmp_hrtimer.o
>
> Now the hrtimer_init_sleeper would be global for linking.
>
> ld -r tmp_hrtimer.o tmp.o -o tmp2_hrtimer.o
>
> Now the tmp.o could use the same global hrtimer_init_sleeper symbol.
> But we have tmp2_hritmer.o that has the tmp.o and tmp_hrtimer.o symbols,
> but we cant just blindly convert local symbols to globals.
>
> The solution is simply put it back to local.
>
> objcopy --localize-symbol hrtimer_init_sleeper tmp2_hrtimer.o hrtimer.o
>
> Now our hrtimer.o file has our __mcount_loc section and the
> reference to hrtimer_init_sleeper will be resolved.
>
> This is a bit complex to do in shell scripting and Makefiles, so I wrote
> a well documented recordmcount.pl perl script, that will do the above
> all in one place.
>
> With this new update, we can work to kill that kernel thread "ftraced"!
>
> This patch set ports to x86_64 and i386, the other archs will still use
> the daemon until they are converted over.
>
> I tested this on both x86_64 and i386 with and without CONFIG_RELOCATE
> set.
>
>
Steven Rostedt <[email protected]> wrote:
> The idea I came
> up with was to make a tmp.s file for every object just after it is compiled.
make -j would cause these files to overwrite each other, wouldn't it?
Bodo Eggert wrote:
> Steven Rostedt <[email protected]> wrote:
>
>
>> The idea I came
>> up with was to make a tmp.s file for every object just after it is compiled.
>>
>
> make -j would cause these files to overwrite each other, wouldn't it?
>
I'd assume the name is actually derived from the real filename...
J
On Thu, 7 Aug 2008, Jeremy Fitzhardinge wrote:
> Steven Rostedt wrote:
> > Now, this was no easy task. We needed to add a section to every object
> > file with a list of pointers to the call sites to mcount. The idea I came
> > up with was to make a tmp.s file for every object just after it is compiled.
> > This tmp.s would then be compiled and relinked into the original object.
> > The tmp.s file would have something like:
> >
> > .section __mcount_loc,"a",@progbits
> > .quad location_of_mcount1
> > .quad location_of_mcount2
> > (etc)
> >
>
> I have a few concerns about this scheme:
>
> One is that you assume that all text is in a .text section, and that the
> offsets you compute on a per-section basis are going to be valid in the final
> kernel image. At the very least you should make each offset relative to its
> own function rather than inter-function.
I assume neither ;-)
This works with .text.sched .text.init .text.cpu etc. I use the first
function in each section as my reference point. This is the key.
>
> But it seems pretty fragile overall. Things like -ffunction-sections and
> section gc will invalidate the table you build up.
I don't see how. I'm referencing a function as my pointer. If this was
true, then I wouldn't be able to do
static void my_func(void) { [...] }
struct func_struct {
void (*func)(void);
} f = { my_func; };
My references end up exactly the same as the "f->func" reference does.
If my references are broken, so is this f->func. Which would be bad to
say the least.
>
> It seems to me that you can acheive the same effect by:
>
> 1. link vmlinux with "ld -q", which will leave the relocation
> information intact in the final object
> 2. use "readelf -r" to extract the relocation info, find the
> references to mcount and build a table
> 3. link that table into the kernel image
> 4. repeat to get a stable result
This doesn't seem any less complex than what I did. With this, I would
have to come up with another way to handle modules. This will make
things a lot more complex.
>
> Note that steps 2-4 are identical to kallsyms, both in intent and detail. The
> only difference is the precise table you build and the command you use to
> extract the info from the kernel. From a quick peek at the way Makefile
> implements kallsyms, it looks to me like it would be fairly easy to generalize
> so that you can do the mcount reloc processing in the same relink passes as
> kallsyms with minimal overhead on the kernel build time.
I could work on this. But again, this will put more work into module.c
that I would like to avoid.
>
> It just seems incredibly fiddly to be doing all this per-object.
Actually, this isn't that much fiddling. What I did was simply make a list
of pointers to the call sites to mcount. The pointers used the first
function of each section to offset against. I relinked in the list making
it just like a list done within C. All relocations and magic after this
will be handled just fine.
-- Steve
On Thu, 7 Aug 2008, Bodo Eggert wrote:
> Steven Rostedt <[email protected]> wrote:
>
> > The idea I came
> > up with was to make a tmp.s file for every object just after it is compiled.
>
> make -j would cause these files to overwrite each other, wouldn't it?
I simplified my explanation. I really used something like this:
for init/main.o:
init/.tmp_mc_main.s : for the list file
init/.tmp_mc_main.o : for the list compiled object.
init/.tmp_gl_main.o : for the object converted static to global.
init/.tmp_mx_main.o : for the merged list and global object.
The tmp.s was to keep the focus on the concept. But the actual
implementation handles multi-threaded builds.
I use distcc and build with -j40. I had no issues.
Thanks,
-- Steve
> This doesn't seem any less complex than what I did. With this, I would
> have to come up with another way to handle modules. This will make
> things a lot more complex.
The scheme you've implemented can apply fine to a .ko file after it's made.
They are just .o's really. It is presumably faster to do one step per
final .ko rather than many tiny ones (per source file).
That might be a benefit to doing it all at the end for vmlinux too. I
think the best way would be to have a vmlinux.o that we actually use in the
link, rather than just analyzing and discarding. Then you can just do your
existing hack on vmlinux.o before it's linked into vmlinux.
Thanks,
Roland
On Thu, 7 Aug 2008, Roland McGrath wrote:
> > This doesn't seem any less complex than what I did. With this, I would
> > have to come up with another way to handle modules. This will make
> > things a lot more complex.
>
> The scheme you've implemented can apply fine to a .ko file after it's made.
> They are just .o's really. It is presumably faster to do one step per
> final .ko rather than many tiny ones (per source file).
>
> That might be a benefit to doing it all at the end for vmlinux too. I
> think the best way would be to have a vmlinux.o that we actually use in the
> link, rather than just analyzing and discarding. Then you can just do your
> existing hack on vmlinux.o before it's linked into vmlinux.
OK, I am playing with this. And you are right, it does work with vmlinux.o
file. It adds almost another minute to the build anytime something is
changed. Where the per-object way may take longer doing a full build, but
does not add that extra minute ever build.
Which would people prefer? I little longer full build and do the
modification on every object, or have a shorter full build, but every
build will take that extra minute to do?
Note, if you have DYNMAIC_FTRACE disabled, this is a nop, and will _not_
affect you.
-- Steve
On Thu, 7 Aug 2008, Steven Rostedt wrote:
> On Thu, 7 Aug 2008, Roland McGrath wrote:
> >
> > The scheme you've implemented can apply fine to a .ko file after it's made.
> > They are just .o's really. It is presumably faster to do one step per
> > final .ko rather than many tiny ones (per source file).
> >
> > That might be a benefit to doing it all at the end for vmlinux too. I
> > think the best way would be to have a vmlinux.o that we actually use in the
> > link, rather than just analyzing and discarding. Then you can just do your
> > existing hack on vmlinux.o before it's linked into vmlinux.
>
> OK, I am playing with this. And you are right, it does work with vmlinux.o
> file. It adds almost another minute to the build anytime something is
> changed. Where the per-object way may take longer doing a full build, but
> does not add that extra minute ever build.
>
> Which would people prefer? I little longer full build and do the
> modification on every object, or have a shorter full build, but every
> build will take that extra minute to do?
>
> Note, if you have DYNMAIC_FTRACE disabled, this is a nop, and will _not_
> affect you.
Another advantage of doing it at the end on the vmlinux.o, is that you do
not need to recompile the entire kernel when you change the config option.
But you do need to recompile all modules.
-- Steve
On Thu, 7 Aug 2008, Steven Rostedt wrote:
>
> On Thu, 7 Aug 2008, Roland McGrath wrote:
>
> > > This doesn't seem any less complex than what I did. With this, I would
> > > have to come up with another way to handle modules. This will make
> > > things a lot more complex.
> >
> > The scheme you've implemented can apply fine to a .ko file after it's made.
> > They are just .o's really. It is presumably faster to do one step per
> > final .ko rather than many tiny ones (per source file).
> >
> > That might be a benefit to doing it all at the end for vmlinux too. I
> > think the best way would be to have a vmlinux.o that we actually use in the
> > link, rather than just analyzing and discarding. Then you can just do your
> > existing hack on vmlinux.o before it's linked into vmlinux.
>
> OK, I am playing with this. And you are right, it does work with vmlinux.o
> file. It adds almost another minute to the build anytime something is
> changed. Where the per-object way may take longer doing a full build, but
> does not add that extra minute ever build.
I spoke to soon. I was only looking at the result of the vmlinux.o. But it
seems that kallsyms skips using this file and relinks everything itself.
Thus the __mcount_loc I added never made it into the vmlinux. The more I
try to get this method to work, the more complex it becomes. And I still
did not change the recordmcount.pl at all. Which means, by trying to do it
against the vmlinux.o, I'm actually making it more complex than it already
is.
Right now, doing it to every object seems to be the safest approach.
-- Steve
On Thu, Aug 07, 2008 at 03:26:42PM -0700, Roland McGrath wrote:
> > This doesn't seem any less complex than what I did. With this, I would
> > have to come up with another way to handle modules. This will make
> > things a lot more complex.
>
> The scheme you've implemented can apply fine to a .ko file after it's made.
> They are just .o's really. It is presumably faster to do one step per
> final .ko rather than many tiny ones (per source file).
>
> That might be a benefit to doing it all at the end for vmlinux too. I
> think the best way would be to have a vmlinux.o that we actually use in the
> link, rather than just analyzing and discarding. Then you can just do your
> existing hack on vmlinux.o before it's linked into vmlinux.
I have patches to actually use vmlinux.o as part of the link process,
but as I hit some subtle issues with um and sparc they are not
yet ready. I plan to dust them off again and fix up the sparc and
um issues but that is not this month.
Sam
On Thu, 2008-08-07 at 21:21 -0400, Steven Rostedt wrote:
> Which would people prefer? I little longer full build and do the
> modification on every object, or have a shorter full build, but every
> build will take that extra minute to do?
I have a very strong preference to do it per obj rather than at the end,
that way it can be distributed and make these fancy smp boxen useful.
Bonus points if you can make distcc dig it.
The vmlinux link path is the thing that is killing build performance,
anything to shorten that is golden, anything increasing it needs _very_
good justification.
On Fri, 8 Aug 2008, Peter Zijlstra wrote:
> On Thu, 2008-08-07 at 21:21 -0400, Steven Rostedt wrote:
>
> > Which would people prefer? I little longer full build and do the
> > modification on every object, or have a shorter full build, but every
> > build will take that extra minute to do?
>
> I have a very strong preference to do it per obj rather than at the end,
> that way it can be distributed and make these fancy smp boxen useful.
>
> Bonus points if you can make distcc dig it.
Yes, the CC used in the script does indeed use distcc (if you are using
it).
;-)
-- Steve
>
> The vmlinux link path is the thing that is killing build performance,
> anything to shorten that is golden, anything increasing it needs _very_
> good justification.
>
>
* Steven Rostedt ([email protected]) wrote:
>
> On Thu, 7 Aug 2008, Mathieu Desnoyers wrote:
>
> >
> > Hi Steven,
> >
> > If you really want to stop using stop_machine, I think you should have a
> > look at my immediate values infrastructure :
> >
> > see:
> > http://git.kernel.org/?p=linux/kernel/git/compudj/linux-2.6-lttng.git;a=blob;f=arch/x86/kernel/immediate.c;h=0958c02b49eed3bbc00bdc5aceeee17a6d0c7ab4;hb=HEAD
> >
> > particularly replace_instruction_safe(), which uses a temporary
> > breakpoint to safely replace an instruction on a live SMP system without
> > using stop_machine. Feel free to try it in ftrace. :)
> >
> > You may need to tweak the imv_notifier(), which is responsible for
> > executing the breakpoint. The only special thing this handler has to
> > know is the non-relocatable instructions you might want to insert (it
> > only cares about the new instructions, not the old ones being replaced).
> > The current code deals with 5-bytes jumps. Note that the instruction is
> > executed in the breakpoint handler with preemption disabled, which might
> > not be well suited for a call instruction.
> >
> > I would recommend to patch in a 5-bytes jmp with 0 offset
> > (e9 00 00 00 00) when disabled (this makes a single 5-bytes instruction
> > and thus makes sure no instruction pointer can iret in the middle).
> >
> > When enabling the site, you could patch-in the original call, but you
> > should tweak the imv_notifier() so that it uses the jmp offset 0 in the
> > bypass instead of the function call, because preemption would otherwise
> > be disabled around the call when executed in the breakpoint bypass.
> >
> > Therefore, simply statically forcing the bypass code to e9 00 00 00 00
> > in all the cases (a nop) would do the job for your needs.
>
> I originally used jumps instead of nops, but unfortunately, they actually
> hurt performance more than adding nops. Ingo told me it was probably due
> to using up the jump predictions of the CPU.
>
Hrm, are you sure you use a single 5-bytes nop instruction then, or do
you use a mix of various nop sizes (add_nops) on some architectures ?
You can consume the branch prediction buffers for conditional branches,
but I doubt static jumps have this impact ? I don't see what "jump
predictions" you are referring to here exactly.
Mathieu
> -- Steve
>
--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
On Fri, 8 Aug 2008, Mathieu Desnoyers wrote:
> * Steven Rostedt ([email protected]) wrote:
> >
> > I originally used jumps instead of nops, but unfortunately, they actually
> > hurt performance more than adding nops. Ingo told me it was probably due
> > to using up the jump predictions of the CPU.
> >
>
> Hrm, are you sure you use a single 5-bytes nop instruction then, or do
> you use a mix of various nop sizes (add_nops) on some architectures ?
I use (for x86) what is in include/asm-x86/nops.h depending on what the
cpuid gives us.
>
> You can consume the branch prediction buffers for conditional branches,
> but I doubt static jumps have this impact ? I don't see what "jump
> predictions" you are referring to here exactly.
I don't know the details, but we definitely saw a drop in preformance
between using nops and static jumps.
-- Steve
* Steven Rostedt ([email protected]) wrote:
>
> On Fri, 8 Aug 2008, Mathieu Desnoyers wrote:
> > * Steven Rostedt ([email protected]) wrote:
> > >
> > > I originally used jumps instead of nops, but unfortunately, they actually
> > > hurt performance more than adding nops. Ingo told me it was probably due
> > > to using up the jump predictions of the CPU.
> > >
> >
> > Hrm, are you sure you use a single 5-bytes nop instruction then, or do
> > you use a mix of various nop sizes (add_nops) on some architectures ?
>
> I use (for x86) what is in include/asm-x86/nops.h depending on what the
> cpuid gives us.
>
That's bad :
#define GENERIC_NOP5 GENERIC_NOP1 GENERIC_NOP4
#define K8_NOP5 K8_NOP3 K8_NOP2
#define K7_NOP5 K7_NOP4 ASM_NOP1
So, when you try, later, to replace these instructions with a single
5-bytes instruction, a preempted thread could iret in the middle of your
5-bytes insn and cause an illegal instruction ?
> >
> > You can consume the branch prediction buffers for conditional branches,
> > but I doubt static jumps have this impact ? I don't see what "jump
> > predictions" you are referring to here exactly.
>
> I don't know the details, but we definitely saw a drop in preformance
> between using nops and static jumps.
>
Generated by replacing all the call by 5-bytes jumps e9 00 00 00 00
instead of the 5-bytes add_nops ? On which architectures ?
Mathieu
> -- Steve
>
--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
On Fri, 8 Aug 2008, Mathieu Desnoyers wrote:
> * Steven Rostedt ([email protected]) wrote:
> >
> > On Fri, 8 Aug 2008, Mathieu Desnoyers wrote:
> > > * Steven Rostedt ([email protected]) wrote:
> > > >
> > > > I originally used jumps instead of nops, but unfortunately, they actually
> > > > hurt performance more than adding nops. Ingo told me it was probably due
> > > > to using up the jump predictions of the CPU.
> > > >
> > >
> > > Hrm, are you sure you use a single 5-bytes nop instruction then, or do
> > > you use a mix of various nop sizes (add_nops) on some architectures ?
> >
> > I use (for x86) what is in include/asm-x86/nops.h depending on what the
> > cpuid gives us.
> >
>
> That's bad :
>
> #define GENERIC_NOP5 GENERIC_NOP1 GENERIC_NOP4
>
> #define K8_NOP5 K8_NOP3 K8_NOP2
>
> #define K7_NOP5 K7_NOP4 ASM_NOP1
>
> So, when you try, later, to replace these instructions with a single
> 5-bytes instruction, a preempted thread could iret in the middle of your
> 5-bytes insn and cause an illegal instruction ?
That's why I use kstop_machine.
>
>
> > >
> > > You can consume the branch prediction buffers for conditional branches,
> > > but I doubt static jumps have this impact ? I don't see what "jump
> > > predictions" you are referring to here exactly.
> >
> > I don't know the details, but we definitely saw a drop in preformance
> > between using nops and static jumps.
> >
>
> Generated by replacing all the call by 5-bytes jumps e9 00 00 00 00
> instead of the 5-bytes add_nops ? On which architectures ?
>
I ran this on my Dell (intel Xeon), which IIRC did show the performance
degration. I unfortunately don't have the time to redo those tests, but
you are welcome to.
Just look at arch/x86/kernel/ftrace.c and replace the nop with the jump.
In fact, the comments in that file still say it is a jmp. Remember, my
first go was to use the jmp.
-- Steve
On Fri, 2008-08-08 at 14:13 -0400, Steven Rostedt wrote:
> On Fri, 8 Aug 2008, Mathieu Desnoyers wrote:
> > * Steven Rostedt ([email protected]) wrote:
> > >
> > > On Fri, 8 Aug 2008, Mathieu Desnoyers wrote:
> > > > * Steven Rostedt ([email protected]) wrote:
> > > > >
> > > > > I originally used jumps instead of nops, but unfortunately, they actually
> > > > > hurt performance more than adding nops. Ingo told me it was probably due
> > > > > to using up the jump predictions of the CPU.
> > > > >
> > > >
> > > > Hrm, are you sure you use a single 5-bytes nop instruction then, or do
> > > > you use a mix of various nop sizes (add_nops) on some architectures ?
> > >
> > > I use (for x86) what is in include/asm-x86/nops.h depending on what the
> > > cpuid gives us.
> > >
> >
> > That's bad :
> >
> > #define GENERIC_NOP5 GENERIC_NOP1 GENERIC_NOP4
> >
> > #define K8_NOP5 K8_NOP3 K8_NOP2
> >
> > #define K7_NOP5 K7_NOP4 ASM_NOP1
> >
> > So, when you try, later, to replace these instructions with a single
> > 5-bytes instruction, a preempted thread could iret in the middle of your
> > 5-bytes insn and cause an illegal instruction ?
>
> That's why I use kstop_machine.
>
> >
> >
> > > >
> > > > You can consume the branch prediction buffers for conditional branches,
> > > > but I doubt static jumps have this impact ? I don't see what "jump
> > > > predictions" you are referring to here exactly.
> > >
> > > I don't know the details, but we definitely saw a drop in preformance
> > > between using nops and static jumps.
> > >
> >
> > Generated by replacing all the call by 5-bytes jumps e9 00 00 00 00
> > instead of the 5-bytes add_nops ? On which architectures ?
> >
>
> I ran this on my Dell (intel Xeon), which IIRC did show the performance
> degration. I unfortunately don't have the time to redo those tests, but
> you are welcome to.
>
> Just look at arch/x86/kernel/ftrace.c and replace the nop with the jump.
> In fact, the comments in that file still say it is a jmp. Remember, my
> first go was to use the jmp.
5 single byte nops?
* Steven Rostedt ([email protected]) wrote:
>
> On Fri, 8 Aug 2008, Mathieu Desnoyers wrote:
> > * Steven Rostedt ([email protected]) wrote:
> > >
> > > On Fri, 8 Aug 2008, Mathieu Desnoyers wrote:
> > > > * Steven Rostedt ([email protected]) wrote:
> > > > >
> > > > > I originally used jumps instead of nops, but unfortunately, they actually
> > > > > hurt performance more than adding nops. Ingo told me it was probably due
> > > > > to using up the jump predictions of the CPU.
> > > > >
> > > >
> > > > Hrm, are you sure you use a single 5-bytes nop instruction then, or do
> > > > you use a mix of various nop sizes (add_nops) on some architectures ?
> > >
> > > I use (for x86) what is in include/asm-x86/nops.h depending on what the
> > > cpuid gives us.
> > >
> >
> > That's bad :
> >
> > #define GENERIC_NOP5 GENERIC_NOP1 GENERIC_NOP4
> >
> > #define K8_NOP5 K8_NOP3 K8_NOP2
> >
> > #define K7_NOP5 K7_NOP4 ASM_NOP1
> >
> > So, when you try, later, to replace these instructions with a single
> > 5-bytes instruction, a preempted thread could iret in the middle of your
> > 5-bytes insn and cause an illegal instruction ?
>
> That's why I use kstop_machine.
>
kstop_machine does not guarantee that you won't have _any_ thread
preempted with IP pointing exactly in the middle of your instructions
_before_ the modification scheduled back in _after_ the modification and
thus causing an illegal instruction.
Still buggy. :/
> >
> >
> > > >
> > > > You can consume the branch prediction buffers for conditional branches,
> > > > but I doubt static jumps have this impact ? I don't see what "jump
> > > > predictions" you are referring to here exactly.
> > >
> > > I don't know the details, but we definitely saw a drop in preformance
> > > between using nops and static jumps.
> > >
> >
> > Generated by replacing all the call by 5-bytes jumps e9 00 00 00 00
> > instead of the 5-bytes add_nops ? On which architectures ?
> >
>
> I ran this on my Dell (intel Xeon), which IIRC did show the performance
> degration. I unfortunately don't have the time to redo those tests, but
> you are welcome to.
>
> Just look at arch/x86/kernel/ftrace.c and replace the nop with the jump.
> In fact, the comments in that file still say it is a jmp. Remember, my
> first go was to use the jmp.
>
I'll try to find time to compare :
multi-instructions 5-bytes nops (although this approach is just buggy)
5-bytes jump to the next address
2-bytes jump to offset +3.
Mathieu
> -- Steve
>
--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
On Fri, 8 Aug 2008, Mathieu Desnoyers wrote:
> * Steven Rostedt ([email protected]) wrote:
> > > >
> > >
> > > That's bad :
> > >
> > > #define GENERIC_NOP5 GENERIC_NOP1 GENERIC_NOP4
> > >
> > > #define K8_NOP5 K8_NOP3 K8_NOP2
> > >
> > > #define K7_NOP5 K7_NOP4 ASM_NOP1
> > >
> > > So, when you try, later, to replace these instructions with a single
> > > 5-bytes instruction, a preempted thread could iret in the middle of your
> > > 5-bytes insn and cause an illegal instruction ?
> >
> > That's why I use kstop_machine.
> >
>
> kstop_machine does not guarantee that you won't have _any_ thread
> preempted with IP pointing exactly in the middle of your instructions
> _before_ the modification scheduled back in _after_ the modification and
> thus causing an illegal instruction.
>
> Still buggy. :/
Hmm, good point. Unless...
Can a processor be preempted in a middle of nops? What do nops do for a
processor? Can it skip them nicely in one shot?
This means I'll have to do the benchmarks again, and see what the
performance difference of a jmp and a nop is significant. I'm thinking
that if the processor can safely skip nops without any type of processing,
this may be the reason that nops are better than a jmp. A jmp causes the
processor to do a little more work.
I might even run a test to see if I can force a processor that uses the
three-two nops to preempt between them.
I can add a test in x86 ftrace.c to check to see which nop was used, and
use the jmp if the arch does not have a 5 byte nop.
I'm assuming that jmp is more expensive than the nops because otherwise
a jmp 0 would have been used as a 5 byte nop.
-- Steve
* Steven Rostedt ([email protected]) wrote:
>
> On Fri, 8 Aug 2008, Mathieu Desnoyers wrote:
>
> > * Steven Rostedt ([email protected]) wrote:
> > > > >
> > > >
> > > > That's bad :
> > > >
> > > > #define GENERIC_NOP5 GENERIC_NOP1 GENERIC_NOP4
> > > >
> > > > #define K8_NOP5 K8_NOP3 K8_NOP2
> > > >
> > > > #define K7_NOP5 K7_NOP4 ASM_NOP1
> > > >
> > > > So, when you try, later, to replace these instructions with a single
> > > > 5-bytes instruction, a preempted thread could iret in the middle of your
> > > > 5-bytes insn and cause an illegal instruction ?
> > >
> > > That's why I use kstop_machine.
> > >
> >
> > kstop_machine does not guarantee that you won't have _any_ thread
> > preempted with IP pointing exactly in the middle of your instructions
> > _before_ the modification scheduled back in _after_ the modification and
> > thus causing an illegal instruction.
> >
> > Still buggy. :/
>
> Hmm, good point. Unless...
>
> Can a processor be preempted in a middle of nops? What do nops do for a
> processor? Can it skip them nicely in one shot?
>
Given that those are multiple instructions, I think a processor has all
the rights to preempt in the middle of them. And even if some specific
architecture, for any obscure reason, happens to merge them, I don't
think this will be portable across Intel, AMD, ...
> This means I'll have to do the benchmarks again, and see what the
> performance difference of a jmp and a nop is significant. I'm thinking
> that if the processor can safely skip nops without any type of processing,
> this may be the reason that nops are better than a jmp. A jmp causes the
> processor to do a little more work.
>
> I might even run a test to see if I can force a processor that uses the
> three-two nops to preempt between them.
>
Yup, although one architecture not triggering this doesn't say much
about the various x86 flavors out there. In any case
- if you trigger the problem, we have to fix it.
- if you do not succeed to trigger the problem, we will have to test it
on a wider architecture range and maybe end up fixit it anyway to play
safe with the specs.
So, in every case, we end up fixing the issue.
> I can add a test in x86 ftrace.c to check to see which nop was used, and
> use the jmp if the arch does not have a 5 byte nop.
>
I would propose the following alternative :
Create new macros in include/asm-x86/nops.h :
/* short jump, offset 3 bytes : skips total of 5 bytes */
#define GENERIC_ATOMIC_NOP5 ".byte 0xeb,0x03,0x00,0x00,0x00\n"
#if defined(CONFIG_MK7)
#define ATOMIC_NOP5 GENERIC_ATOMIC_NOP5
#elif defined(CONFIG_X86_P6_NOP)
#define ATOMIC_NOP5 P6_NOP5
#elif defined(CONFIG_X86_64)
#define ATOMIC_NOP5 GENERIC_ATOMIC_NOP5
#else
#define ATOMIC_NOP5 GENERIC_ATOMIC_NOP5
#endif
And then optimize if necessary. You will probably find plenty of
knowledgeable people who will know better 5-bytes nop instruction more
efficient than this "generic" short jump offset 0x3.
Then you can use the (buggy) 3nops/2nops as a performance baseline and
see the performance hit on each architecture.
First get it right, then make it fast....
Mathieu
> I'm assuming that jmp is more expensive than the nops because otherwise
> a jmp 0 would have been used as a 5 byte nop.
>
> -- Steve
--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
On Fri, 8 Aug 2008, Steven Rostedt wrote:
>
> Can a processor be preempted in a middle of nops?
Sure. If you have two nops in a row (and the kernel definition of the NOP
array does _not_ guarantee that it's a single-instruction one), you may
get a profile hit (ie any interrupt) on the second one. It's less
_likely_, but it certainly is not architecturally in any way guaranteed
that the kernel "nop[]" tables would be atomic.
> What do nops do for a processor?
Depends on the microarchitecture. But many will squash it in the decode
stage, and generate no uops for them at all, so it's purely a decode
throughput issue and has absolutely _zero_ impact for any later CPU
stages.
> Can it skip them nicely in one shot?
See above. It needs to decode them, and the decoder itself may well have
some limitations - for example, the longer nops may not even decode in all
decoders, which is why some uarchs might prefer two short nops to one long
one, but _generally_ a nop will not make it any further than the decoder.
But separate nops do count as separate instructions, ie they will hit all
the normal decode limits (mostly three or four instructions decoded per
cycle).
> I'm assuming that jmp is more expensive than the nops because otherwise
> a jmp 0 would have been used as a 5 byte nop.
Yes. A CPU core _could_ certainly do special decoding for 'jmp 0' too, but
I don't know any that do. The 'jmp' is much more likely to be special in
the front-end and the decoder, and can easily cause things like the
prefetcher to hickup (ie it tries to start prefetching at the "new" target
address).
So I would absolutely _expect_ a 'jmp' to be noticeably more expensive
than one of the special nop's that can be squashed by the decoder.
A nop that is squashed by the decoder will literally take absolutely no
other resources. It doesn't even need to be tracked from an instruction
completion standpoint (which _may_ end up meaning that a profiler would
actually never see a hit on the second nop, but it's quite likely to
depend on which decoder it hits etc).
Linus
Steven Rostedt wrote:
> Hmm, good point. Unless...
>
> Can a processor be preempted in a middle of nops? What do nops do for a
> processor? Can it skip them nicely in one shot?
>
The CPU can be interrupted between any two instructions, with a couple
of exceptions (it must, otherwise an unbounded run of nops could cause
an unbounded interrupt latency). If you don't want eip to be in the
middle of your instruction site, you need to make sure it's nopped out
with a single instruction.
Unfortunately, aside from P6_NOP5, none of the standard asm/nops.h nops
are suitable. Something like "0x66, 0x66, 0x66, 0x66, 0x90" would work
everywhere, but its possible putting more than 3 prefixes on the
instruction will kick the decoder into a slow path, which may have a
noticable effect.
> This means I'll have to do the benchmarks again, and see what the
> performance difference of a jmp and a nop is significant. I'm thinking
> that if the processor can safely skip nops without any type of processing,
> this may be the reason that nops are better than a jmp. A jmp causes the
> processor to do a little more work.
>
A no-op jmp could be easily converted to a nop early in the pipeline.
But I don't know whether anyone bothers to do that.
J
[ patch included ]
On Fri, 8 Aug 2008, Mathieu Desnoyers wrote:
> * Steven Rostedt ([email protected]) wrote:
> > > > That's why I use kstop_machine.
> > > >
> > >
> > > kstop_machine does not guarantee that you won't have _any_ thread
> > > preempted with IP pointing exactly in the middle of your instructions
> > > _before_ the modification scheduled back in _after_ the modification and
> > > thus causing an illegal instruction.
> > >
> > > Still buggy. :/
> >
> > Hmm, good point. Unless...
> >
> > Can a processor be preempted in a middle of nops? What do nops do for a
> > processor? Can it skip them nicely in one shot?
> >
>
> Given that those are multiple instructions, I think a processor has all
> the rights to preempt in the middle of them. And even if some specific
> architecture, for any obscure reason, happens to merge them, I don't
> think this will be portable across Intel, AMD, ...
>
> > This means I'll have to do the benchmarks again, and see what the
> > performance difference of a jmp and a nop is significant. I'm thinking
> > that if the processor can safely skip nops without any type of processing,
> > this may be the reason that nops are better than a jmp. A jmp causes the
> > processor to do a little more work.
> >
> > I might even run a test to see if I can force a processor that uses the
> > three-two nops to preempt between them.
> >
>
> Yup, although one architecture not triggering this doesn't say much
> about the various x86 flavors out there. In any case
> - if you trigger the problem, we have to fix it.
> - if you do not succeed to trigger the problem, we will have to test it
> on a wider architecture range and maybe end up fixit it anyway to play
> safe with the specs.
>
> So, in every case, we end up fixing the issue.
>
>
> > I can add a test in x86 ftrace.c to check to see which nop was used, and
> > use the jmp if the arch does not have a 5 byte nop.
> >
>
> I would propose the following alternative :
>
> Create new macros in include/asm-x86/nops.h :
>
> /* short jump, offset 3 bytes : skips total of 5 bytes */
> #define GENERIC_ATOMIC_NOP5 ".byte 0xeb,0x03,0x00,0x00,0x00\n"
>
> #if defined(CONFIG_MK7)
> #define ATOMIC_NOP5 GENERIC_ATOMIC_NOP5
> #elif defined(CONFIG_X86_P6_NOP)
> #define ATOMIC_NOP5 P6_NOP5
> #elif defined(CONFIG_X86_64)
> #define ATOMIC_NOP5 GENERIC_ATOMIC_NOP5
> #else
> #define ATOMIC_NOP5 GENERIC_ATOMIC_NOP5
> #endif
>
> And then optimize if necessary. You will probably find plenty of
> knowledgeable people who will know better 5-bytes nop instruction more
> efficient than this "generic" short jump offset 0x3.
>
> Then you can use the (buggy) 3nops/2nops as a performance baseline and
> see the performance hit on each architecture.
>
> First get it right, then make it fast....
>
I'm stubborn, I want to get it right _and_ keep it fast.
I still want the NOPS. Using jmps will hurt performance and that would
keep this turned off on all distros.
But lets think outside the box here (and we will ignore Alan's cat).
Right now the issue is that we might preempt after the first nop, and when
we enable the code, that task will crash when it tries to read the second
nop.
Since we are doing the modifications from kstop_machine, all the tasks are
stopped. We can simply look to see if the tasks have been preempted in
kernel space and if so, is their instruction pointer pointing to the
second nop. If it is, move the ip forward.
Here's a patch that does just that for both i386 and x86_64.
I added a field in the thread_info struct called "ip". This is a pointer
to the location of the task ip in the stack if it was preempted in kernel
space. Null otherwise:
jz restore_all
+ lea PT_EIP(%esp), %eax
+ movl %eax, TI_ip(%ebp)
call preempt_schedule_irq
+ GET_THREAD_INFO(%ebp)
+ movl $0, TI_ip(%ebp)
jmp need_resched
Then, just before we enable tracing (we only need to do this when we
enable tracing, since that is when we have a two instruction nop), we look
at all the tasks. If the task->thread_info->ip is set, this means that it
was preempted just before going back to the kernel.
We look at the **ip and see if it compares with the second nop. If it
does, we increment the ip by the size of that nop:
if (memcmp(*ip, second_nop, x86_nop5_part2) == 0)
/* Match, move the ip forward */
*ip += x86_nop5_part2;
We do this just once before enabling all the locations, and we only do it
if we have a two part nop.
Interesting enough, I wrote a module that did the following:
void (*silly_func)(void);
void do_something_silly(void)
{
}
static int my_thread(void *arg)
{
int i;
while (!kthread_should_stop()) {
for (i=0; i < 100; i++)
silly_func();
}
return 0;
}
static struct task_struct *p;
static int __init mcount_stress_init(void)
{
silly_func = do_something_silly;
p = kthread_run(my_thread, NULL, "sillytask");
return 0;
}
static void mcount_stress_exit(void)
{
kthread_stop(p);
}
The do_something_silly had an mcount pointer to it. I put in printks in
the ftrace enabled code to see where this was preempted. It was preempted
several times before and after the nops, but never at either nop.
Maybe I didn't run it enough (almost 2 hours), but perhaps it is very
unlikely to be preempted at a nop if there's something coming up next.
Yes a string of nops may be preempted, but perhaps only two nops followed
by an actual command might be skipped quickly.
I'll write some hacks to look at where it is preempted in the scheduler
itself, and see if I see it preempting at the second nop ever.
But here's a patch that will work around the problem that we might be
preempted within the two nops.
Note, this is only in the slow path of enabling the function tracer. It is
only done at enabling time inside the kstop_machine, which has a large
overhead anyways.
Signed-off-by: Steven Rostedt <[email protected]>
---
arch/x86/kernel/alternative.c | 29 +++++++++++++++++++-------
arch/x86/kernel/asm-offsets_32.c | 1
arch/x86/kernel/asm-offsets_64.c | 1
arch/x86/kernel/entry_32.S | 4 +++
arch/x86/kernel/entry_64.S | 4 +++
arch/x86/kernel/ftrace.c | 43 +++++++++++++++++++++++++++++++++++++++
include/asm-x86/ftrace.h | 5 ++++
include/asm-x86/thread_info.h | 4 +++
kernel/trace/ftrace.c | 12 ++++++++++
9 files changed, 96 insertions(+), 7 deletions(-)
Index: linux-tip.git/arch/x86/kernel/alternative.c
===================================================================
--- linux-tip.git.orig/arch/x86/kernel/alternative.c 2008-06-05 11:52:24.000000000 -0400
+++ linux-tip.git/arch/x86/kernel/alternative.c 2008-08-08 16:20:23.000000000 -0400
@@ -140,13 +140,26 @@ static const unsigned char *const p6_nop
};
#endif
+/*
+ * Some versions of x86 CPUs have a two part NOP5. This
+ * can break ftrace if a process is preempted between
+ * the two. ftrace needs to know what the second nop
+ * is to handle this case.
+ */
+int x86_nop5_part2;
+
#ifdef CONFIG_X86_64
extern char __vsyscall_0;
const unsigned char *const *find_nop_table(void)
{
- return boot_cpu_data.x86_vendor != X86_VENDOR_INTEL ||
- boot_cpu_data.x86 < 6 ? k8_nops : p6_nops;
+ if (boot_cpu_data.x86_vendor != X86_VENDOR_INTEL ||
+ boot_cpu_data.x86 < 6) {
+ x86_nop5_part2 = 2; /* K8_NOP2 */
+ return k8_nops;
+ } else
+ /* keep k86_nop5_part2 NULL */
+ return p6_nops;
}
#else /* CONFIG_X86_64 */
@@ -154,12 +167,13 @@ const unsigned char *const *find_nop_tab
static const struct nop {
int cpuid;
const unsigned char *const *noptable;
+ int nop5_part2; /* size of part2 nop */
} noptypes[] = {
- { X86_FEATURE_K8, k8_nops },
- { X86_FEATURE_K7, k7_nops },
- { X86_FEATURE_P4, p6_nops },
- { X86_FEATURE_P3, p6_nops },
- { -1, NULL }
+ { X86_FEATURE_K8, k8_nops, 2},
+ { X86_FEATURE_K7, k7_nops, 1 },
+ { X86_FEATURE_P4, p6_nops, 0 },
+ { X86_FEATURE_P3, p6_nops, 0 },
+ { -1, NULL, 0 }
};
const unsigned char *const *find_nop_table(void)
@@ -170,6 +184,7 @@ const unsigned char *const *find_nop_tab
for (i = 0; noptypes[i].cpuid >= 0; i++) {
if (boot_cpu_has(noptypes[i].cpuid)) {
noptable = noptypes[i].noptable;
+ x86_nop5_part2 = noptypes[i].nop5_part2;
break;
}
}
Index: linux-tip.git/arch/x86/kernel/asm-offsets_32.c
===================================================================
--- linux-tip.git.orig/arch/x86/kernel/asm-offsets_32.c 2008-07-27 10:43:26.000000000 -0400
+++ linux-tip.git/arch/x86/kernel/asm-offsets_32.c 2008-08-08 15:46:55.000000000 -0400
@@ -59,6 +59,7 @@ void foo(void)
OFFSET(TI_restart_block, thread_info, restart_block);
OFFSET(TI_sysenter_return, thread_info, sysenter_return);
OFFSET(TI_cpu, thread_info, cpu);
+ OFFSET(TI_ip, thread_info, ip);
BLANK();
OFFSET(GDS_size, desc_ptr, size);
Index: linux-tip.git/arch/x86/kernel/asm-offsets_64.c
===================================================================
--- linux-tip.git.orig/arch/x86/kernel/asm-offsets_64.c 2008-07-27 10:43:26.000000000 -0400
+++ linux-tip.git/arch/x86/kernel/asm-offsets_64.c 2008-08-08 15:52:34.000000000 -0400
@@ -41,6 +41,7 @@ int main(void)
ENTRY(addr_limit);
ENTRY(preempt_count);
ENTRY(status);
+ ENTRY(ip);
#ifdef CONFIG_IA32_EMULATION
ENTRY(sysenter_return);
#endif
Index: linux-tip.git/arch/x86/kernel/entry_32.S
===================================================================
--- linux-tip.git.orig/arch/x86/kernel/entry_32.S 2008-07-27 10:43:26.000000000 -0400
+++ linux-tip.git/arch/x86/kernel/entry_32.S 2008-08-08 17:13:27.000000000 -0400
@@ -304,7 +304,11 @@ need_resched:
jz restore_all
testl $X86_EFLAGS_IF,PT_EFLAGS(%esp) # interrupts off (exception path) ?
jz restore_all
+ lea PT_EIP(%esp), %eax
+ movl %eax, TI_ip(%ebp)
call preempt_schedule_irq
+ GET_THREAD_INFO(%ebp)
+ movl $0, TI_ip(%ebp)
jmp need_resched
END(resume_kernel)
#endif
Index: linux-tip.git/arch/x86/kernel/entry_64.S
===================================================================
--- linux-tip.git.orig/arch/x86/kernel/entry_64.S 2008-07-27 10:43:26.000000000 -0400
+++ linux-tip.git/arch/x86/kernel/entry_64.S 2008-08-08 17:12:47.000000000 -0400
@@ -837,7 +837,11 @@ ENTRY(retint_kernel)
jnc retint_restore_args
bt $9,EFLAGS-ARGOFFSET(%rsp) /* interrupts off? */
jnc retint_restore_args
+ leaq RIP-ARGOFFSET(%rsp), %rax
+ movq %rax, TI_ip(%rcx)
call preempt_schedule_irq
+ GET_THREAD_INFO(%rcx)
+ movq $0, TI_ip(%rcx)
jmp exit_intr
#endif
Index: linux-tip.git/arch/x86/kernel/ftrace.c
===================================================================
--- linux-tip.git.orig/arch/x86/kernel/ftrace.c 2008-06-26 14:58:54.000000000 -0400
+++ linux-tip.git/arch/x86/kernel/ftrace.c 2008-08-08 17:48:04.000000000 -0400
@@ -127,6 +127,46 @@ notrace int ftrace_mcount_set(unsigned l
return 0;
}
+static const unsigned char *second_nop;
+
+void arch_ftrace_pre_enable(void)
+{
+ struct task_struct *g, *p;
+ unsigned long **ip;
+
+ int i;
+
+ if (!second_nop)
+ return;
+
+ /*
+ * x86 has a two part nop to handle 5 byte instructions.
+ * If a task was preempted after the first nop, and has
+ * not ran the second nop, if we modify the code, we can
+ * crash the system. Thus, we will look at all the tasks
+ * and if any of them was preempted and will run the
+ * second nop next, we simply move their ip pointer past
+ * the second nop.
+ */
+
+ /*
+ * Don't need to grab the task list lock, we are running
+ * in kstop_machine
+ */
+ do_each_thread(g, p) {
+ /*
+ * In entry.S we save the ip when a task is preempted
+ * and reset it when it is back running.
+ */
+ ip = task_thread_info(p)->ip;
+ if (!ip)
+ continue;
+ if (memcmp(*ip, second_nop, x86_nop5_part2) == 0)
+ /* Match, move the ip forward */
+ *ip += x86_nop5_part2;
+ } while_each_thread(g, p);
+}
+
int __init ftrace_dyn_arch_init(void *data)
{
const unsigned char *const *noptable = find_nop_table();
@@ -137,5 +177,8 @@ int __init ftrace_dyn_arch_init(void *da
ftrace_nop = (unsigned long *)noptable[MCOUNT_INSN_SIZE];
+ if (x86_nop5_part2)
+ second_nop = noptable[x86_nop5_part2];
+
return 0;
}
Index: linux-tip.git/include/asm-x86/ftrace.h
===================================================================
--- linux-tip.git.orig/include/asm-x86/ftrace.h 2008-08-08 13:00:51.000000000 -0400
+++ linux-tip.git/include/asm-x86/ftrace.h 2008-08-08 16:41:09.000000000 -0400
@@ -17,6 +17,11 @@ static inline unsigned long ftrace_call_
*/
return addr - 1;
}
+
+extern int x86_nop5_part2;
+extern void arch_ftrace_pre_enable(void);
+#define ftrace_pre_enable arch_ftrace_pre_enable
+
#endif
#endif /* CONFIG_FTRACE */
Index: linux-tip.git/include/asm-x86/thread_info.h
===================================================================
--- linux-tip.git.orig/include/asm-x86/thread_info.h 2008-08-07 11:14:43.000000000 -0400
+++ linux-tip.git/include/asm-x86/thread_info.h 2008-08-08 17:06:15.000000000 -0400
@@ -29,6 +29,9 @@ struct thread_info {
__u32 cpu; /* current CPU */
int preempt_count; /* 0 => preemptable,
<0 => BUG */
+ unsigned long **ip; /* pointer to ip on stackwhen
+ preempted
+ */
mm_segment_t addr_limit;
struct restart_block restart_block;
void __user *sysenter_return;
@@ -47,6 +50,7 @@ struct thread_info {
.flags = 0, \
.cpu = 0, \
.preempt_count = 1, \
+ .ip = NULL, \
.addr_limit = KERNEL_DS, \
.restart_block = { \
.fn = do_no_restart_syscall, \
Index: linux-tip.git/kernel/trace/ftrace.c
===================================================================
--- linux-tip.git.orig/kernel/trace/ftrace.c 2008-08-08 13:00:52.000000000 -0400
+++ linux-tip.git/kernel/trace/ftrace.c 2008-08-08 16:18:14.000000000 -0400
@@ -32,6 +32,10 @@
#include "trace.h"
+#ifndef ftrace_pre_enable
+# define ftrace_pre_enable() do { } while (0)
+#endif
+
/* ftrace_enabled is a method to turn ftrace on or off */
int ftrace_enabled __read_mostly;
static int last_ftrace_enabled;
@@ -500,6 +504,14 @@ static void ftrace_replace_code(int enab
else
new = ftrace_nop_replace();
+ /*
+ * Some archs *cough*x86*cough* have more than one nop to cover
+ * the call to mcount. In these cases, special care must be taken
+ * before we start converting nops into calls.
+ */
+ if (enable)
+ ftrace_pre_enable();
+
for (pg = ftrace_pages_start; pg; pg = pg->next) {
for (i = 0; i < pg->index; i++) {
rec = &pg->records[i];
Steven Rostedt <[email protected]> writes:
> I'm stubborn, I want to get it right _and_ keep it fast.
For me it would seem better to just not use two part 5 byte nops
instead of adding such hacks. I doubt there are that many of them
anyways. I bet you won't be able to measure any difference between the
different nop types in any macro benchmark.
-Andi
[ updated patch ]
On Fri, 8 Aug 2008, Steven Rostedt wrote:
>
>
> The do_something_silly had an mcount pointer to it. I put in printks in
> the ftrace enabled code to see where this was preempted. It was preempted
> several times before and after the nops, but never at either nop.
>
> Maybe I didn't run it enough (almost 2 hours), but perhaps it is very
> unlikely to be preempted at a nop if there's something coming up next.
>
> Yes a string of nops may be preempted, but perhaps only two nops followed
> by an actual command might be skipped quickly.
>
> I'll write some hacks to look at where it is preempted in the scheduler
> itself, and see if I see it preempting at the second nop ever.
I put that code in the scheduler and it does indeed preempt in the nops!
It also uncovered a nasty bug I had in my code:
[...]
> Index: linux-tip.git/arch/x86/kernel/ftrace.c
> ===================================================================
> --- linux-tip.git.orig/arch/x86/kernel/ftrace.c 2008-06-26 14:58:54.000000000 -0400
> +++ linux-tip.git/arch/x86/kernel/ftrace.c 2008-08-08 17:48:04.000000000 -0400
> @@ -127,6 +127,46 @@ notrace int ftrace_mcount_set(unsigned l
> return 0;
> }
>
> +static const unsigned char *second_nop;
> +
> +void arch_ftrace_pre_enable(void)
> +{
> + struct task_struct *g, *p;
> + unsigned long **ip;
> +
[...]
> + ip = task_thread_info(p)->ip;
> + if (!ip)
> + continue;
> + if (memcmp(*ip, second_nop, x86_nop5_part2) == 0)
> + /* Match, move the ip forward */
> + *ip += x86_nop5_part2;
Notice the bug? Hint, *ip is of type unsigned long *.
Here's the updated fix:
Signed-off-by: Steven Rostedt <[email protected]>
---
arch/x86/kernel/alternative.c | 29 ++++++++++++++++++++-------
arch/x86/kernel/asm-offsets_32.c | 1
arch/x86/kernel/asm-offsets_64.c | 1
arch/x86/kernel/entry_32.S | 4 +++
arch/x86/kernel/entry_64.S | 4 +++
arch/x86/kernel/ftrace.c | 41 +++++++++++++++++++++++++++++++++++++++
include/asm-x86/ftrace.h | 5 ++++
include/asm-x86/thread_info.h | 4 +++
kernel/trace/ftrace.c | 12 +++++++++++
9 files changed, 94 insertions(+), 7 deletions(-)
Index: linux-tip.git/arch/x86/kernel/alternative.c
===================================================================
--- linux-tip.git.orig/arch/x86/kernel/alternative.c 2008-06-05 11:52:24.000000000 -0400
+++ linux-tip.git/arch/x86/kernel/alternative.c 2008-08-08 16:20:23.000000000 -0400
@@ -140,13 +140,26 @@ static const unsigned char *const p6_nop
};
#endif
+/*
+ * Some versions of x86 CPUs have a two part NOP5. This
+ * can break ftrace if a process is preempted between
+ * the two. ftrace needs to know what the second nop
+ * is to handle this case.
+ */
+int x86_nop5_part2;
+
#ifdef CONFIG_X86_64
extern char __vsyscall_0;
const unsigned char *const *find_nop_table(void)
{
- return boot_cpu_data.x86_vendor != X86_VENDOR_INTEL ||
- boot_cpu_data.x86 < 6 ? k8_nops : p6_nops;
+ if (boot_cpu_data.x86_vendor != X86_VENDOR_INTEL ||
+ boot_cpu_data.x86 < 6) {
+ x86_nop5_part2 = 2; /* K8_NOP2 */
+ return k8_nops;
+ } else
+ /* keep k86_nop5_part2 NULL */
+ return p6_nops;
}
#else /* CONFIG_X86_64 */
@@ -154,12 +167,13 @@ const unsigned char *const *find_nop_tab
static const struct nop {
int cpuid;
const unsigned char *const *noptable;
+ int nop5_part2;
} noptypes[] = {
- { X86_FEATURE_K8, k8_nops },
- { X86_FEATURE_K7, k7_nops },
- { X86_FEATURE_P4, p6_nops },
- { X86_FEATURE_P3, p6_nops },
- { -1, NULL }
+ { X86_FEATURE_K8, k8_nops, 2},
+ { X86_FEATURE_K7, k7_nops, 1 },
+ { X86_FEATURE_P4, p6_nops, 0 },
+ { X86_FEATURE_P3, p6_nops, 0 },
+ { -1, NULL, 0 }
};
const unsigned char *const *find_nop_table(void)
@@ -170,6 +184,7 @@ const unsigned char *const *find_nop_tab
for (i = 0; noptypes[i].cpuid >= 0; i++) {
if (boot_cpu_has(noptypes[i].cpuid)) {
noptable = noptypes[i].noptable;
+ x86_nop5_part2 = noptypes[i].nop5_part2;
break;
}
}
Index: linux-tip.git/arch/x86/kernel/asm-offsets_32.c
===================================================================
--- linux-tip.git.orig/arch/x86/kernel/asm-offsets_32.c 2008-07-27 10:43:26.000000000 -0400
+++ linux-tip.git/arch/x86/kernel/asm-offsets_32.c 2008-08-08 15:46:55.000000000 -0400
@@ -59,6 +59,7 @@ void foo(void)
OFFSET(TI_restart_block, thread_info, restart_block);
OFFSET(TI_sysenter_return, thread_info, sysenter_return);
OFFSET(TI_cpu, thread_info, cpu);
+ OFFSET(TI_ip, thread_info, ip);
BLANK();
OFFSET(GDS_size, desc_ptr, size);
Index: linux-tip.git/arch/x86/kernel/asm-offsets_64.c
===================================================================
--- linux-tip.git.orig/arch/x86/kernel/asm-offsets_64.c 2008-07-27 10:43:26.000000000 -0400
+++ linux-tip.git/arch/x86/kernel/asm-offsets_64.c 2008-08-08 15:52:34.000000000 -0400
@@ -41,6 +41,7 @@ int main(void)
ENTRY(addr_limit);
ENTRY(preempt_count);
ENTRY(status);
+ ENTRY(ip);
#ifdef CONFIG_IA32_EMULATION
ENTRY(sysenter_return);
#endif
Index: linux-tip.git/arch/x86/kernel/entry_32.S
===================================================================
--- linux-tip.git.orig/arch/x86/kernel/entry_32.S 2008-07-27 10:43:26.000000000 -0400
+++ linux-tip.git/arch/x86/kernel/entry_32.S 2008-08-08 17:13:27.000000000 -0400
@@ -304,7 +304,11 @@ need_resched:
jz restore_all
testl $X86_EFLAGS_IF,PT_EFLAGS(%esp) # interrupts off (exception path) ?
jz restore_all
+ lea PT_EIP(%esp), %eax
+ movl %eax, TI_ip(%ebp)
call preempt_schedule_irq
+ GET_THREAD_INFO(%ebp)
+ movl $0, TI_ip(%ebp)
jmp need_resched
END(resume_kernel)
#endif
Index: linux-tip.git/arch/x86/kernel/entry_64.S
===================================================================
--- linux-tip.git.orig/arch/x86/kernel/entry_64.S 2008-07-27 10:43:26.000000000 -0400
+++ linux-tip.git/arch/x86/kernel/entry_64.S 2008-08-08 17:12:47.000000000 -0400
@@ -837,7 +837,11 @@ ENTRY(retint_kernel)
jnc retint_restore_args
bt $9,EFLAGS-ARGOFFSET(%rsp) /* interrupts off? */
jnc retint_restore_args
+ leaq RIP-ARGOFFSET(%rsp), %rax
+ movq %rax, TI_ip(%rcx)
call preempt_schedule_irq
+ GET_THREAD_INFO(%rcx)
+ movq $0, TI_ip(%rcx)
jmp exit_intr
#endif
Index: linux-tip.git/arch/x86/kernel/ftrace.c
===================================================================
--- linux-tip.git.orig/arch/x86/kernel/ftrace.c 2008-06-26 14:58:54.000000000 -0400
+++ linux-tip.git/arch/x86/kernel/ftrace.c 2008-08-08 20:22:19.000000000 -0400
@@ -127,6 +127,44 @@ notrace int ftrace_mcount_set(unsigned l
return 0;
}
+static const unsigned char *second_nop;
+
+void arch_ftrace_pre_enable(void)
+{
+ struct task_struct *g, *p;
+ void **ip;
+
+ if (!second_nop)
+ return;
+
+ /*
+ * x86 has a two part nop to handle 5 byte instructions.
+ * If a task was preempted after the first nop, and has
+ * not ran the second nop, if we modify the code, we can
+ * crash the system. Thus, we will look at all the tasks
+ * and if any of them was preempted and will run the
+ * second nop next, we simply move their ip pointer past
+ * the second nop.
+ */
+
+ /*
+ * Don't need to grab the task list lock, we are running
+ * in kstop_machine
+ */
+ do_each_thread(g, p) {
+ /*
+ * In entry.S we save the ip when a task is preempted
+ * and reset it when it is back running.
+ */
+ ip = (void **)task_thread_info(p)->ip;
+ if (!ip)
+ continue;
+ if (memcmp(*ip, second_nop, x86_nop5_part2) == 0)
+ /* Match, move the ip forward */
+ *ip += x86_nop5_part2;
+ } while_each_thread(g, p);
+}
+
int __init ftrace_dyn_arch_init(void *data)
{
const unsigned char *const *noptable = find_nop_table();
@@ -137,5 +175,8 @@ int __init ftrace_dyn_arch_init(void *da
ftrace_nop = (unsigned long *)noptable[MCOUNT_INSN_SIZE];
+ if (x86_nop5_part2)
+ second_nop = noptable[x86_nop5_part2];
+
return 0;
}
Index: linux-tip.git/include/asm-x86/ftrace.h
===================================================================
--- linux-tip.git.orig/include/asm-x86/ftrace.h 2008-08-08 13:00:51.000000000 -0400
+++ linux-tip.git/include/asm-x86/ftrace.h 2008-08-08 16:41:09.000000000 -0400
@@ -17,6 +17,11 @@ static inline unsigned long ftrace_call_
*/
return addr - 1;
}
+
+extern int x86_nop5_part2;
+extern void arch_ftrace_pre_enable(void);
+#define ftrace_pre_enable arch_ftrace_pre_enable
+
#endif
#endif /* CONFIG_FTRACE */
Index: linux-tip.git/include/asm-x86/thread_info.h
===================================================================
--- linux-tip.git.orig/include/asm-x86/thread_info.h 2008-08-07 11:14:43.000000000 -0400
+++ linux-tip.git/include/asm-x86/thread_info.h 2008-08-08 17:06:15.000000000 -0400
@@ -29,6 +29,9 @@ struct thread_info {
__u32 cpu; /* current CPU */
int preempt_count; /* 0 => preemptable,
<0 => BUG */
+ unsigned long **ip; /* pointer to ip on stackwhen
+ preempted
+ */
mm_segment_t addr_limit;
struct restart_block restart_block;
void __user *sysenter_return;
@@ -47,6 +50,7 @@ struct thread_info {
.flags = 0, \
.cpu = 0, \
.preempt_count = 1, \
+ .ip = NULL, \
.addr_limit = KERNEL_DS, \
.restart_block = { \
.fn = do_no_restart_syscall, \
Index: linux-tip.git/kernel/trace/ftrace.c
===================================================================
--- linux-tip.git.orig/kernel/trace/ftrace.c 2008-08-08 13:00:52.000000000 -0400
+++ linux-tip.git/kernel/trace/ftrace.c 2008-08-08 16:18:14.000000000 -0400
@@ -32,6 +32,10 @@
#include "trace.h"
+#ifndef ftrace_pre_enable
+# define ftrace_pre_enable() do { } while (0)
+#endif
+
/* ftrace_enabled is a method to turn ftrace on or off */
int ftrace_enabled __read_mostly;
static int last_ftrace_enabled;
@@ -500,6 +504,14 @@ static void ftrace_replace_code(int enab
else
new = ftrace_nop_replace();
+ /*
+ * Some archs *cough*x86*cough* have more than one nop to cover
+ * the call to mcount. In these cases, special care must be taken
+ * before we start converting nops into calls.
+ */
+ if (enable)
+ ftrace_pre_enable();
+
for (pg = ftrace_pages_start; pg; pg = pg->next) {
for (i = 0; i < pg->index; i++) {
rec = &pg->records[i];
On Sat, 9 Aug 2008, Andi Kleen wrote:
> Steven Rostedt <[email protected]> writes:
>
> > I'm stubborn, I want to get it right _and_ keep it fast.
>
> For me it would seem better to just not use two part 5 byte nops
> instead of adding such hacks. I doubt there are that many of them
> anyways. I bet you won't be able to measure any difference between the
> different nop types in any macro benchmark.
I wish we had a true 5 byte nop. The alternative is a jmp 0, which is
measurable. This is replacing mcount from a kernel compile with the -pg
option. With a basic build (not counting modules), I have over 15,000
locations that are turned into these 5 byte nops.
# objdump -dr vmlinux.o | grep mcount |wc
15152 45489 764924
If we use the jmp 0, then yes, we will see the overhead. The double nop
that is used for 5 bytes, is significantly better than the jump.
-- Steve
Steven Rostedt wrote:
> I wish we had a true 5 byte nop.
0x66 0x66 0x66 0x66 0x90
J
On Fri, 8 Aug 2008, Jeremy Fitzhardinge wrote:
> Steven Rostedt wrote:
> > I wish we had a true 5 byte nop.
>
> 0x66 0x66 0x66 0x66 0x90
>
But I quote you:
> Unfortunately, aside from P6_NOP5, none of the standard asm/nops.h nops
> are suitable. Something like "0x66, 0x66, 0x66, 0x66, 0x90" would work
> everywhere, but its possible putting more than 3 prefixes on the
> instruction will kick the decoder into a slow path, which may have a
> noticable effect.
I guess the next step is to test if it is slow.
-- Steve
On Fri, 8 Aug 2008, Jeremy Fitzhardinge wrote:
>
> Steven Rostedt wrote:
> > I wish we had a true 5 byte nop.
>
> 0x66 0x66 0x66 0x66 0x90
I don't think so. Multiple redundant prefixes can be really expensive on
some uarchs.
A no-op that isn't cheap isn't a no-op at all, it's a slow-op.
Linus
Are we sure there aren't some good available 5-byte nops?
e.g. for 64-bit maybe 3e 48 8d 76 00
for 32-bit maybe 3e 8d 74 26 00
Have the chip folks confirmed that exactly the set we have in
asm-x86/nops.h are the only optimal ones?
Thanks,
Roland
On Fri, Aug 08, 2008 at 05:53:26PM -0700, Roland McGrath wrote:
> Are we sure there aren't some good available 5-byte nops?
>
> e.g. for 64-bit maybe 3e 48 8d 76 00
> for 32-bit maybe 3e 8d 74 26 00
>
> Have the chip folks confirmed that exactly the set we have in
> asm-x86/nops.h are the only optimal ones?
They were based on the respective optimization manuals, so yes.
-Andi
> I wish we had a true 5 byte nop. The alternative is a jmp 0, which is
> measurable. This is replacing mcount from a kernel compile with the -pg
Scary usage. How much is the difference?
BTW a good way to get most of that back would be to not use frame
pointers if you have them enabled ;-) Unfortunately recently
the "select FRAME_POINTER"s keep creeping all over so far too many kernels
suffer from this slow mode now :-/
-Andi
On Fri, 8 Aug 2008, Linus Torvalds wrote:
>
>
> On Fri, 8 Aug 2008, Jeremy Fitzhardinge wrote:
> >
> > Steven Rostedt wrote:
> > > I wish we had a true 5 byte nop.
> >
> > 0x66 0x66 0x66 0x66 0x90
>
> I don't think so. Multiple redundant prefixes can be really expensive on
> some uarchs.
>
> A no-op that isn't cheap isn't a no-op at all, it's a slow-op.
A quick meaningless benchmark showed a slight perfomance hit.
Here's 10 runs of "hackbench 50" using the two part 5 byte nop:
run 1
Time: 4.501
run 2
Time: 4.855
run 3
Time: 4.198
run 4
Time: 4.587
run 5
Time: 5.016
run 6
Time: 4.757
run 7
Time: 4.477
run 8
Time: 4.693
run 9
Time: 4.710
run 10
Time: 4.715
avg = 4.6509
And 10 runs using the above 5 byte nop:
run 1
Time: 4.832
run 2
Time: 5.319
run 3
Time: 5.213
run 4
Time: 4.830
run 5
Time: 4.363
run 6
Time: 4.391
run 7
Time: 4.772
run 8
Time: 4.992
run 9
Time: 4.727
run 10
Time: 4.825
avg = 4.8264
# cat /proc/cpuinfo
processor : 0
vendor_id : AuthenticAMD
cpu family : 15
model : 65
model name : Dual-Core AMD Opteron(tm) Processor 2220
stepping : 3
cpu MHz : 2799.992
cache size : 1024 KB
physical id : 0
siblings : 2
core id : 0
cpu cores : 2
apicid : 0
initial apicid : 0
fdiv_bug : no
hlt_bug : no
f00f_bug : no
coma_bug : no
fpu : yes
fpu_exception : yes
cpuid level : 1
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca
cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt
rdtscp lm 3dnowext 3dnow pni cx16 lahf_lm cmp_legacy svm extapic
cr8_legacy
bogomips : 5599.98
clflush size : 64
power management: ts fid vid ttp tm stc
There's 4 of these.
Just to make sure, I ran the above nop test again:
[ this is reverse from the above runs ]
run 1
Time: 4.723
run 2
Time: 5.080
run 3
Time: 4.521
run 4
Time: 4.841
run 5
Time: 4.696
run 6
Time: 4.946
run 7
Time: 4.754
run 8
Time: 4.717
run 9
Time: 4.905
run 10
Time: 4.814
avg = 4.7997
And again the two part nop:
run 1
Time: 4.434
run 2
Time: 4.496
run 3
Time: 4.801
run 4
Time: 4.714
run 5
Time: 4.631
run 6
Time: 5.178
run 7
Time: 4.728
run 8
Time: 4.920
run 9
Time: 4.898
run 10
Time: 4.770
avg = 4.757
This time it was close, but still seems to have some difference.
heh, perhaps it's just noise.
-- Steve
On Sat, 9 Aug 2008, Andi Kleen wrote:
> > I wish we had a true 5 byte nop. The alternative is a jmp 0, which is
> > measurable. This is replacing mcount from a kernel compile with the -pg
>
> Scary usage. How much is the difference?
>
> BTW a good way to get most of that back would be to not use frame
> pointers if you have them enabled ;-) Unfortunately recently
> the "select FRAME_POINTER"s keep creeping all over so far too many kernels
> suffer from this slow mode now :-/
Funny, CONFIG_FTRACE happens to select that. Now the question is, would
mcount work without it?
i386:
ENTRY(mcount)
pushl %eax
pushl %ecx
pushl %edx
movl 0xc(%esp), %eax
movl 0x4(%ebp), %edx <-- we record the parent ip here
subl $MCOUNT_INSN_SIZE, %eax
call *ftrace_trace_function
popl %edx
popl %ecx
popl %eax
ret
x86_64:
ENTRY(mcount)
subq $0x38, %rsp
movq %rax, (%rsp)
movq %rcx, 8(%rsp)
movq %rdx, 16(%rsp)
movq %rsi, 24(%rsp)
movq %rdi, 32(%rsp)
movq %r8, 40(%rsp)
movq %r9, 48(%rsp)
movq 0x38(%rsp), %rdi
movq 8(%rbp), %rsi <-- we record the parent pointer here
subq $MCOUNT_INSN_SIZE, %rdi
call *ftrace_trace_function
movq 48(%rsp), %r9
movq 40(%rsp), %r8
movq 32(%rsp), %rdi
movq 24(%rsp), %rsi
movq 16(%rsp), %rdx
movq 8(%rsp), %rcx
movq (%rsp), %rax
addq $0x38, %rsp
retq
-- Steve
> Funny, CONFIG_FTRACE happens to select that. Now the question is, would
> mcount work without it?
Not without fixing gcc first. It would work if gcc always called
mcount the first thing before setting up the stack frame. Not
sure why it doesn't do that.
Still do a benchmark of frame pointer vs no frame pointer kernel
and you'll see, especially on a older CPUs without special hardware
to avoid stack stalls (e.g. not Core2)
BTW always forcing frame pointers also means that ftrace is far
from near zero overhead even when disabled. That is unless you
find a way to nop the frame pointers too, but that would be likely
very difficult because the code will actually use it.
-Andi
On Sat, 9 Aug 2008, Andi Kleen wrote:
> > Funny, CONFIG_FTRACE happens to select that. Now the question is, would
> > mcount work without it?
>
> Not without fixing gcc first. It would work if gcc always called
> mcount the first thing before setting up the stack frame. Not
> sure why it doesn't do that.
I could take out the frame pointer option and pass in zero for the parent
in such that case. But the parent is quite useful. But if it helps
with performance, it may be worth doing this.
-- Steve
>
> Still do a benchmark of frame pointer vs no frame pointer kernel
> and you'll see, especially on a older CPUs without special hardware
> to avoid stack stalls (e.g. not Core2)
I'll see what it does with just a kernel build.
-- Steve
On Fri, Aug 08, 2008 at 10:03:42PM -0400, Steven Rostedt wrote:
>
> On Sat, 9 Aug 2008, Andi Kleen wrote:
>
> > > Funny, CONFIG_FTRACE happens to select that. Now the question is, would
> > > mcount work without it?
> >
> > Not without fixing gcc first. It would work if gcc always called
> > mcount the first thing before setting up the stack frame. Not
> > sure why it doesn't do that.
>
> I could take out the frame pointer option and pass in zero for the parent
> in such that case. But the parent is quite useful. But if it helps
> with performance, it may be worth doing this.
Right now gcc errors out unfortunately. Would need to ask gcc developers
to fix this first. But when they move mcount to be always first
you can still get at the parent, it will always be at 4(%esp) or 8(%rsp).
An alternative would be also run an assembler post processor that
inserts mcount calls without compiler help.
-Andi
On Sat, 9 Aug 2008, Andi Kleen wrote:
> > I wish we had a true 5 byte nop. The alternative is a jmp 0, which is
> > measurable. This is replacing mcount from a kernel compile with the -pg
>
> Scary usage. How much is the difference?
>
> BTW a good way to get most of that back would be to not use frame
> pointers if you have them enabled ;-) Unfortunately recently
> the "select FRAME_POINTER"s keep creeping all over so far too many kernels
> suffer from this slow mode now :-/
I checked out the latest git tree from Linus, and set up a basic config
for it. I built it once. Then I rebooted into various versions of the
kernel and timed the makes. This is what I did for each kernel.
# cd /work/git/linux-compile.git
# make distclean
# cp config-use .config
# make oldconfig
# time make
Here are the results. Just for curiosity sake.
Using the two part nop:
real 3m26.751s
user 1m36.064s
sys 0m39.450s
Using the 5 byte nop with the four 0x66:
real 3m31.543s
user 1m36.197s
sys 0m39.309s
Using jmp 0 as a nop:
real 3m30.025s
user 1m35.948s
sys 0m39.540s
Then I realized that I had the irqsoff, and preemptoff tracers configured.
These do add an overhead when configured in, even when they are disabled.
I disabled them and ran the code with the two part nop again:
Using two part nop without other tracers:
real 3m26.851s
user 1m36.425s
sys 0m35.437s
I then disabled ftrace completely:
no ftrace times:
real 3m25.807s
user 1m36.992s
sys 0m35.715s
Then I disabled frame pointers as well:
real 3m25.104s
user 1m35.726s
sys 0m34.797s
The important numbers here is the sys time. The user and real should not
be that much affected by these changes. Well, the real would, but not the
user.
I have an old toshiba that I want to run this on too. That might give more
interesting results. But it's getting late, and I'll do that another time.
Looking at the results, and yes I should run this more than once, but I
just wanted an idea, this box did not show much difference between the two
part nop, the jmp and the 5 byte nop.
Running with or without ftrace enabled did not make a difference. Heck,
the ftrace disabled was actually slower.
But it does seem that Andi is correct about the frame pointers. It did
slow the compile down by almost a second.
Just some food for thought.
-- Steve
On Thu, Aug 7, 2008 at 11:50 PM, Steven Rostedt <[email protected]> wrote:
> You see, the reason for this is that for ftrace to maintain performance
> when configured in but disabled, it would need to change all the
> locations that called "mcount" (enabled with the gcc -pg option) into
> nops. The "-pg" option in gcc sets up a function profiler to call this
> function called "mcount". If you simply have "mcount" return, it will
> still add 15 to 18% overhead in performance. Changing all the calls to
> nops moved the overhead into noise.
>
> To get rid of this, I had the mcount code record the location that called
> it. Later, the "ftraced" daemon would wake up and look to see if
> any new functions were recorded. If so, it would call kstop_machine
> and convert the calls to "nops". We needed kstop_machine because bad
> things happen on SMP if you modify code that happens to be in the
> instruction cache of another CPU.
Is this new framework needed for x86 specific reasons only? From what
I gathered here, ftraced defers mcount patching simply because there's
no way to update a 5-byte nop atomically. If so, why can't mcount site
patching be left to arch specific ftrace code? For !SMP or archs which
generate word sized mcount branch calls (e.g, ARM) is there really no
way to patch mcount sites synchronously from inside ftrace_record_ip
by disabling interrupts?
Regards,
Abhishek Sagar
Steven Rostedt <[email protected]> wrote:
> On Sat, 9 Aug 2008, Andi Kleen wrote:
>> Steven Rostedt <[email protected]> writes:
>> > I'm stubborn, I want to get it right _and_ keep it fast.
>>
>> For me it would seem better to just not use two part 5 byte nops
>> instead of adding such hacks. I doubt there are that many of them
>> anyways. I bet you won't be able to measure any difference between the
>> different nop types in any macro benchmark.
>
> I wish we had a true 5 byte nop. The alternative is a jmp 0, which is
> measurable.
Did you try short jumps? (0xeb 0x03 0x?? 0x?? 0x??)
On Sat, 9 Aug 2008, Abhishek Sagar wrote:
> On Thu, Aug 7, 2008 at 11:50 PM, Steven Rostedt <[email protected]> wrote:
> > You see, the reason for this is that for ftrace to maintain performance
> > when configured in but disabled, it would need to change all the
> > locations that called "mcount" (enabled with the gcc -pg option) into
> > nops. The "-pg" option in gcc sets up a function profiler to call this
> > function called "mcount". If you simply have "mcount" return, it will
> > still add 15 to 18% overhead in performance. Changing all the calls to
> > nops moved the overhead into noise.
> >
> > To get rid of this, I had the mcount code record the location that called
> > it. Later, the "ftraced" daemon would wake up and look to see if
> > any new functions were recorded. If so, it would call kstop_machine
> > and convert the calls to "nops". We needed kstop_machine because bad
> > things happen on SMP if you modify code that happens to be in the
> > instruction cache of another CPU.
>
> Is this new framework needed for x86 specific reasons only? From what
> I gathered here, ftraced defers mcount patching simply because there's
> no way to update a 5-byte nop atomically. If so, why can't mcount site
> patching be left to arch specific ftrace code? For !SMP or archs which
> generate word sized mcount branch calls (e.g, ARM) is there really no
> way to patch mcount sites synchronously from inside ftrace_record_ip
> by disabling interrupts?
There's two topics in this thread.
1) the x86 issue of the 5 byte instruction. The problem with x86 is that on
some CPUs the nop used consists of two nops to fill the 5 bytes. There is
no way to change that atomically. The workarounds for this is the arch
specific ftrace_pre_enable() that will make sure no process is about to
execute the second part of that nop.
2) Getting rid of the daemon. The daemon is used to patch the code
dynamically later on bootup. Now an arch may or may not be able to modify
code in SMP, but I've been told that this is dangerous to do even on PPC.
Dynamically modifying text that might be in the pipeline on another CPU
may or may not be dangerous on all archs.
The fix here is to convert the mcount calls to nops at boot up. This is
really ideal on all archs. This means we know ever mcount call, and we get
rid of the requirement that we need to run the code once before we can
trace it.
The kstop_machine is now only left at the start and stop of tracing.
-- Steve
On Sat, 9 Aug 2008, Bodo Eggert wrote:
> Steven Rostedt <[email protected]> wrote:
> > On Sat, 9 Aug 2008, Andi Kleen wrote:
> >> Steven Rostedt <[email protected]> writes:
>
> >> > I'm stubborn, I want to get it right _and_ keep it fast.
> >>
> >> For me it would seem better to just not use two part 5 byte nops
> >> instead of adding such hacks. I doubt there are that many of them
> >> anyways. I bet you won't be able to measure any difference between the
> >> different nop types in any macro benchmark.
> >
> > I wish we had a true 5 byte nop. The alternative is a jmp 0, which is
> > measurable.
>
> Did you try short jumps? (0xeb 0x03 0x?? 0x?? 0x??)
What would those last three bytes be?
-- Steve
On Sat, 9 Aug 2008, Steven Rostedt wrote:
> >
> > Did you try short jumps? (0xeb 0x03 0x?? 0x?? 0x??)
>
> What would those last three bytes be?
Ah, I see, you jump 3 bytes forward.
I'm pretty sure that it is still more expensive than a nop, on some archs.
But I'm not 100% sure on that.
-- Steve
On Sat, 9 Aug 2008, Steven Rostedt wrote:
> On Sat, 9 Aug 2008, Bodo Eggert wrote:
> > Steven Rostedt <[email protected]> wrote:
> > > On Sat, 9 Aug 2008, Andi Kleen wrote:
> > >> Steven Rostedt <[email protected]> writes:
> > >> > I'm stubborn, I want to get it right _and_ keep it fast.
> > >>
> > >> For me it would seem better to just not use two part 5 byte nops
> > >> instead of adding such hacks. I doubt there are that many of them
> > >> anyways. I bet you won't be able to measure any difference between the
> > >> different nop types in any macro benchmark.
> > >
> > > I wish we had a true 5 byte nop. The alternative is a jmp 0, which is
> > > measurable.
> >
> > Did you try short jumps? (0xeb 0x03 0x?? 0x?? 0x??)
>
> What would those last three bytes be?
Anything, since the CPU will ignore them.
My hope is that different kinds of jump will behaver differently,
but I fear the side effect of the jmp (reread memory in case of
self-modifying code) will cause the CPU to slow down anyway.
--
Is reading in the bathroom considered Multitasking?
On Sat, Aug 9, 2008 at 6:31 PM, Steven Rostedt <[email protected]> wrote:
> Dynamically modifying text that might be in the pipeline on another CPU
> may or may not be dangerous on all archs.
Kprobes leaves this mess for individual archs to handle (see
arch_arm_kprobe). What would be nice to have is an explanation as to
which archs are safe from this and under what conditions. Also, for
!SMP, this boot-time patching approach and the associated build time
paraphernalia would simply be a bloat. There we can simply have a
daemonless ftrace which patches mcount sites synchronously. Why not
let each arch have this hinge on CONFIG_SMP?
> The fix here is to convert the mcount calls to nops at boot up. This is
> really ideal on all archs. This means we know ever mcount call, and we get
> rid of the requirement that we need to run the code once before we can
> trace it.
This solution indeed would fit all archs well but for some it may be
an overkill (or is it?...I'd like to know that :-\ ).
Oh and as you pointed out, it would mean that we have to no longer run
the code before starting to trace it. But why don't we do that now?
How about calling the tracer from ftrace_record_ip? All we need to do
is pass along the parent_ip from mcount.
> The kstop_machine is now only left at the start and stop of tracing.
This a nice fix. I'm just looking to find out what the self modifying
code problem here translates to on different archs for my own
understanding :-)
Thanks,
Abhishek Sagar
On Sat, 9 Aug 2008, Abhishek Sagar wrote:
> On Sat, Aug 9, 2008 at 6:31 PM, Steven Rostedt <[email protected]> wrote:
> > Dynamically modifying text that might be in the pipeline on another CPU
> > may or may not be dangerous on all archs.
>
> Kprobes leaves this mess for individual archs to handle (see
> arch_arm_kprobe). What would be nice to have is an explanation as to
> which archs are safe from this and under what conditions. Also, for
> !SMP, this boot-time patching approach and the associated build time
> paraphernalia would simply be a bloat. There we can simply have a
> daemonless ftrace which patches mcount sites synchronously. Why not
> let each arch have this hinge on CONFIG_SMP?
One of the things I tried to do was to make ftrace port easily to all
archs. David Miller ported it to Sparc64 in 20 minutes and that was mostly
compiling. Doing a kprobe fix would require much more knowledge to each
arch and more prone to errors.
>
> > The fix here is to convert the mcount calls to nops at boot up. This is
> > really ideal on all archs. This means we know ever mcount call, and we get
> > rid of the requirement that we need to run the code once before we can
> > trace it.
>
> This solution indeed would fit all archs well but for some it may be
> an overkill (or is it?...I'd like to know that :-\ ).
It is not an issue at all. The mcount list is discarded with the init
section, and the change to nops is relatively fast. We do make a list of
all entries anyway, so that we can pick and choose which functions we want
to enable tracing on.
This recording at boot up should be fine on all archs, SMP or UP and will
get rid of that daemon.
>
> Oh and as you pointed out, it would mean that we have to no longer run
> the code before starting to trace it. But why don't we do that now?
> How about calling the tracer from ftrace_record_ip? All we need to do
> is pass along the parent_ip from mcount.
I want ftrace_record_ip to be as fast as possible, and also having it call
the registered function means we need to test if it was set in the filter
too. This is too much for what record_ip does. And as I said, doing it all
on boot up would be best.
>
> > The kstop_machine is now only left at the start and stop of tracing.
>
> This a nice fix. I'm just looking to find out what the self modifying
> code problem here translates to on different archs for my own
> understanding :-)
Now, what I can do, is remove the kstop machine from UP. Hmm, I need to
check if kstop_machine code itself is smart enough to simply do a
local_irq_save; run function; local_irq_restore, if it was on UP and not
SMP.
-- Steve
On Sat, Aug 9, 2008 at 9:07 PM, Steven Rostedt <[email protected]> wrote:
> One of the things I tried to do was to make ftrace port easily to all
> archs. David Miller ported it to Sparc64 in 20 minutes and that was mostly
> compiling. Doing a kprobe fix would require much more knowledge to each
> arch and more prone to errors.
I agree with the point about porting. My kprobe reference was simply
to point out that
an existing module seems to be overwriting code on the fly and doesn't
take care of much
on non-x86 archs. Ftrace has a unique 5-byte call replacement issue
specific to x86 which kprobes doesn't have to deal with.
>> This solution indeed would fit all archs well but for some it may be
>> an overkill (or is it?...I'd like to know that :-\ ).
>
> It is not an issue at all. The mcount list is discarded with the init
> section, and the change to nops is relatively fast. We do make a list of
> all entries anyway, so that we can pick and choose which functions we want
> to enable tracing on.
Ok.
> I want ftrace_record_ip to be as fast as possible, and also having it call
> the registered function means we need to test if it was set in the filter
> too. This is too much for what record_ip does. And as I said, doing it all
> on boot up would be best.
Alrighty.
Thanks,
Abhishek Sagar
On Saturday 09 August 2008 04:41:06 Steven Rostedt wrote:
> On Fri, 8 Aug 2008, Mathieu Desnoyers wrote:
> > kstop_machine does not guarantee that you won't have _any_ thread
> > preempted with IP pointing exactly in the middle of your instructions
> > _before_ the modification scheduled back in _after_ the modification and
> > thus causing an illegal instruction.
> >
> > Still buggy. :/
>
> Hmm, good point. Unless...
You can walk the task list and fix them up, if you have to. Of course, this
could be extremely slow with a million threads. Maybe just walking the
runqueues would be sufficient?
Rusty.
On Mon, 11 Aug 2008, Rusty Russell wrote:
> On Saturday 09 August 2008 04:41:06 Steven Rostedt wrote:
> > On Fri, 8 Aug 2008, Mathieu Desnoyers wrote:
> > > kstop_machine does not guarantee that you won't have _any_ thread
> > > preempted with IP pointing exactly in the middle of your instructions
> > > _before_ the modification scheduled back in _after_ the modification and
> > > thus causing an illegal instruction.
> > >
> > > Still buggy. :/
> >
> > Hmm, good point. Unless...
>
> You can walk the task list and fix them up, if you have to. Of course, this
> could be extremely slow with a million threads. Maybe just walking the
> runqueues would be sufficient?
True, we could do the run queues too. But we are only looping once
through the tasks and checking a single pointer on it, which would only be
set if the task was preempted while in the kernel.
As for being slow, this only happens when we enable the function tracer,
which is slow anyway. It does not need to happen on disabling the tracer.
Note, this is enabling the calls to the tracer function. The tracer itself
can enable and disable quickly by turning on or off a variable. This is
just the "setup" part. Not something that happens often.
-- Steve
* Steven Rostedt ([email protected]) wrote:
>
> [ updated patch ]
>
> On Fri, 8 Aug 2008, Steven Rostedt wrote:
> >
> >
> > The do_something_silly had an mcount pointer to it. I put in printks in
> > the ftrace enabled code to see where this was preempted. It was preempted
> > several times before and after the nops, but never at either nop.
> >
> > Maybe I didn't run it enough (almost 2 hours), but perhaps it is very
> > unlikely to be preempted at a nop if there's something coming up next.
> >
> > Yes a string of nops may be preempted, but perhaps only two nops followed
> > by an actual command might be skipped quickly.
> >
> > I'll write some hacks to look at where it is preempted in the scheduler
> > itself, and see if I see it preempting at the second nop ever.
>
> I put that code in the scheduler and it does indeed preempt in the nops!
> It also uncovered a nasty bug I had in my code:
>
> [...]
>
> > Index: linux-tip.git/arch/x86/kernel/ftrace.c
> > ===================================================================
> > --- linux-tip.git.orig/arch/x86/kernel/ftrace.c 2008-06-26 14:58:54.000000000 -0400
> > +++ linux-tip.git/arch/x86/kernel/ftrace.c 2008-08-08 17:48:04.000000000 -0400
> > @@ -127,6 +127,46 @@ notrace int ftrace_mcount_set(unsigned l
> > return 0;
> > }
> >
> > +static const unsigned char *second_nop;
> > +
> > +void arch_ftrace_pre_enable(void)
> > +{
> > + struct task_struct *g, *p;
> > + unsigned long **ip;
> > +
>
> [...]
>
> > + ip = task_thread_info(p)->ip;
> > + if (!ip)
> > + continue;
> > + if (memcmp(*ip, second_nop, x86_nop5_part2) == 0)
> > + /* Match, move the ip forward */
> > + *ip += x86_nop5_part2;
>
> Notice the bug? Hint, *ip is of type unsigned long *.
>
> Here's the updated fix:
>
Hi Steven,
I'm actually a bit worried about this scheduler-centric approach. The
problem is that any trap that could be generated in the middle of the
3/2 nops (think of performance counters if the APIC is set as a trap
gate) which would stack an interruptible trap handler on top of those
instructions would lead to having a return IP pointing in the wrong
spot, but because the scheduler would interrupt the trap handler and not
the trapped code, it would not detect this.
I am not sure wheter we do actually have the possibility to have these
interruptible trap handlers considering the way the kernel sets up the
gates, but I think the "let's figure out where the IP stopped" approach
is a bit complex and fragile.
Trying to find a fast atomic 5-bytes nop, involving the
microarchitecture guys, seems like a better approach to me.
Mathieu
> Signed-off-by: Steven Rostedt <[email protected]>
> ---
> arch/x86/kernel/alternative.c | 29 ++++++++++++++++++++-------
> arch/x86/kernel/asm-offsets_32.c | 1
> arch/x86/kernel/asm-offsets_64.c | 1
> arch/x86/kernel/entry_32.S | 4 +++
> arch/x86/kernel/entry_64.S | 4 +++
> arch/x86/kernel/ftrace.c | 41 +++++++++++++++++++++++++++++++++++++++
> include/asm-x86/ftrace.h | 5 ++++
> include/asm-x86/thread_info.h | 4 +++
> kernel/trace/ftrace.c | 12 +++++++++++
> 9 files changed, 94 insertions(+), 7 deletions(-)
>
> Index: linux-tip.git/arch/x86/kernel/alternative.c
> ===================================================================
> --- linux-tip.git.orig/arch/x86/kernel/alternative.c 2008-06-05 11:52:24.000000000 -0400
> +++ linux-tip.git/arch/x86/kernel/alternative.c 2008-08-08 16:20:23.000000000 -0400
> @@ -140,13 +140,26 @@ static const unsigned char *const p6_nop
> };
> #endif
>
> +/*
> + * Some versions of x86 CPUs have a two part NOP5. This
> + * can break ftrace if a process is preempted between
> + * the two. ftrace needs to know what the second nop
> + * is to handle this case.
> + */
> +int x86_nop5_part2;
> +
> #ifdef CONFIG_X86_64
>
> extern char __vsyscall_0;
> const unsigned char *const *find_nop_table(void)
> {
> - return boot_cpu_data.x86_vendor != X86_VENDOR_INTEL ||
> - boot_cpu_data.x86 < 6 ? k8_nops : p6_nops;
> + if (boot_cpu_data.x86_vendor != X86_VENDOR_INTEL ||
> + boot_cpu_data.x86 < 6) {
> + x86_nop5_part2 = 2; /* K8_NOP2 */
> + return k8_nops;
> + } else
> + /* keep k86_nop5_part2 NULL */
> + return p6_nops;
> }
>
> #else /* CONFIG_X86_64 */
> @@ -154,12 +167,13 @@ const unsigned char *const *find_nop_tab
> static const struct nop {
> int cpuid;
> const unsigned char *const *noptable;
> + int nop5_part2;
> } noptypes[] = {
> - { X86_FEATURE_K8, k8_nops },
> - { X86_FEATURE_K7, k7_nops },
> - { X86_FEATURE_P4, p6_nops },
> - { X86_FEATURE_P3, p6_nops },
> - { -1, NULL }
> + { X86_FEATURE_K8, k8_nops, 2},
> + { X86_FEATURE_K7, k7_nops, 1 },
> + { X86_FEATURE_P4, p6_nops, 0 },
> + { X86_FEATURE_P3, p6_nops, 0 },
> + { -1, NULL, 0 }
> };
>
> const unsigned char *const *find_nop_table(void)
> @@ -170,6 +184,7 @@ const unsigned char *const *find_nop_tab
> for (i = 0; noptypes[i].cpuid >= 0; i++) {
> if (boot_cpu_has(noptypes[i].cpuid)) {
> noptable = noptypes[i].noptable;
> + x86_nop5_part2 = noptypes[i].nop5_part2;
> break;
> }
> }
> Index: linux-tip.git/arch/x86/kernel/asm-offsets_32.c
> ===================================================================
> --- linux-tip.git.orig/arch/x86/kernel/asm-offsets_32.c 2008-07-27 10:43:26.000000000 -0400
> +++ linux-tip.git/arch/x86/kernel/asm-offsets_32.c 2008-08-08 15:46:55.000000000 -0400
> @@ -59,6 +59,7 @@ void foo(void)
> OFFSET(TI_restart_block, thread_info, restart_block);
> OFFSET(TI_sysenter_return, thread_info, sysenter_return);
> OFFSET(TI_cpu, thread_info, cpu);
> + OFFSET(TI_ip, thread_info, ip);
> BLANK();
>
> OFFSET(GDS_size, desc_ptr, size);
> Index: linux-tip.git/arch/x86/kernel/asm-offsets_64.c
> ===================================================================
> --- linux-tip.git.orig/arch/x86/kernel/asm-offsets_64.c 2008-07-27 10:43:26.000000000 -0400
> +++ linux-tip.git/arch/x86/kernel/asm-offsets_64.c 2008-08-08 15:52:34.000000000 -0400
> @@ -41,6 +41,7 @@ int main(void)
> ENTRY(addr_limit);
> ENTRY(preempt_count);
> ENTRY(status);
> + ENTRY(ip);
> #ifdef CONFIG_IA32_EMULATION
> ENTRY(sysenter_return);
> #endif
> Index: linux-tip.git/arch/x86/kernel/entry_32.S
> ===================================================================
> --- linux-tip.git.orig/arch/x86/kernel/entry_32.S 2008-07-27 10:43:26.000000000 -0400
> +++ linux-tip.git/arch/x86/kernel/entry_32.S 2008-08-08 17:13:27.000000000 -0400
> @@ -304,7 +304,11 @@ need_resched:
> jz restore_all
> testl $X86_EFLAGS_IF,PT_EFLAGS(%esp) # interrupts off (exception path) ?
> jz restore_all
> + lea PT_EIP(%esp), %eax
> + movl %eax, TI_ip(%ebp)
> call preempt_schedule_irq
> + GET_THREAD_INFO(%ebp)
> + movl $0, TI_ip(%ebp)
> jmp need_resched
> END(resume_kernel)
> #endif
> Index: linux-tip.git/arch/x86/kernel/entry_64.S
> ===================================================================
> --- linux-tip.git.orig/arch/x86/kernel/entry_64.S 2008-07-27 10:43:26.000000000 -0400
> +++ linux-tip.git/arch/x86/kernel/entry_64.S 2008-08-08 17:12:47.000000000 -0400
> @@ -837,7 +837,11 @@ ENTRY(retint_kernel)
> jnc retint_restore_args
> bt $9,EFLAGS-ARGOFFSET(%rsp) /* interrupts off? */
> jnc retint_restore_args
> + leaq RIP-ARGOFFSET(%rsp), %rax
> + movq %rax, TI_ip(%rcx)
> call preempt_schedule_irq
> + GET_THREAD_INFO(%rcx)
> + movq $0, TI_ip(%rcx)
> jmp exit_intr
> #endif
>
> Index: linux-tip.git/arch/x86/kernel/ftrace.c
> ===================================================================
> --- linux-tip.git.orig/arch/x86/kernel/ftrace.c 2008-06-26 14:58:54.000000000 -0400
> +++ linux-tip.git/arch/x86/kernel/ftrace.c 2008-08-08 20:22:19.000000000 -0400
> @@ -127,6 +127,44 @@ notrace int ftrace_mcount_set(unsigned l
> return 0;
> }
>
> +static const unsigned char *second_nop;
> +
> +void arch_ftrace_pre_enable(void)
> +{
> + struct task_struct *g, *p;
> + void **ip;
> +
> + if (!second_nop)
> + return;
> +
> + /*
> + * x86 has a two part nop to handle 5 byte instructions.
> + * If a task was preempted after the first nop, and has
> + * not ran the second nop, if we modify the code, we can
> + * crash the system. Thus, we will look at all the tasks
> + * and if any of them was preempted and will run the
> + * second nop next, we simply move their ip pointer past
> + * the second nop.
> + */
> +
> + /*
> + * Don't need to grab the task list lock, we are running
> + * in kstop_machine
> + */
> + do_each_thread(g, p) {
> + /*
> + * In entry.S we save the ip when a task is preempted
> + * and reset it when it is back running.
> + */
> + ip = (void **)task_thread_info(p)->ip;
> + if (!ip)
> + continue;
> + if (memcmp(*ip, second_nop, x86_nop5_part2) == 0)
> + /* Match, move the ip forward */
> + *ip += x86_nop5_part2;
> + } while_each_thread(g, p);
> +}
> +
> int __init ftrace_dyn_arch_init(void *data)
> {
> const unsigned char *const *noptable = find_nop_table();
> @@ -137,5 +175,8 @@ int __init ftrace_dyn_arch_init(void *da
>
> ftrace_nop = (unsigned long *)noptable[MCOUNT_INSN_SIZE];
>
> + if (x86_nop5_part2)
> + second_nop = noptable[x86_nop5_part2];
> +
> return 0;
> }
> Index: linux-tip.git/include/asm-x86/ftrace.h
> ===================================================================
> --- linux-tip.git.orig/include/asm-x86/ftrace.h 2008-08-08 13:00:51.000000000 -0400
> +++ linux-tip.git/include/asm-x86/ftrace.h 2008-08-08 16:41:09.000000000 -0400
> @@ -17,6 +17,11 @@ static inline unsigned long ftrace_call_
> */
> return addr - 1;
> }
> +
> +extern int x86_nop5_part2;
> +extern void arch_ftrace_pre_enable(void);
> +#define ftrace_pre_enable arch_ftrace_pre_enable
> +
> #endif
>
> #endif /* CONFIG_FTRACE */
> Index: linux-tip.git/include/asm-x86/thread_info.h
> ===================================================================
> --- linux-tip.git.orig/include/asm-x86/thread_info.h 2008-08-07 11:14:43.000000000 -0400
> +++ linux-tip.git/include/asm-x86/thread_info.h 2008-08-08 17:06:15.000000000 -0400
> @@ -29,6 +29,9 @@ struct thread_info {
> __u32 cpu; /* current CPU */
> int preempt_count; /* 0 => preemptable,
> <0 => BUG */
> + unsigned long **ip; /* pointer to ip on stackwhen
> + preempted
> + */
> mm_segment_t addr_limit;
> struct restart_block restart_block;
> void __user *sysenter_return;
> @@ -47,6 +50,7 @@ struct thread_info {
> .flags = 0, \
> .cpu = 0, \
> .preempt_count = 1, \
> + .ip = NULL, \
> .addr_limit = KERNEL_DS, \
> .restart_block = { \
> .fn = do_no_restart_syscall, \
> Index: linux-tip.git/kernel/trace/ftrace.c
> ===================================================================
> --- linux-tip.git.orig/kernel/trace/ftrace.c 2008-08-08 13:00:52.000000000 -0400
> +++ linux-tip.git/kernel/trace/ftrace.c 2008-08-08 16:18:14.000000000 -0400
> @@ -32,6 +32,10 @@
>
> #include "trace.h"
>
> +#ifndef ftrace_pre_enable
> +# define ftrace_pre_enable() do { } while (0)
> +#endif
> +
> /* ftrace_enabled is a method to turn ftrace on or off */
> int ftrace_enabled __read_mostly;
> static int last_ftrace_enabled;
> @@ -500,6 +504,14 @@ static void ftrace_replace_code(int enab
> else
> new = ftrace_nop_replace();
>
> + /*
> + * Some archs *cough*x86*cough* have more than one nop to cover
> + * the call to mcount. In these cases, special care must be taken
> + * before we start converting nops into calls.
> + */
> + if (enable)
> + ftrace_pre_enable();
> +
> for (pg = ftrace_pages_start; pg; pg = pg->next) {
> for (i = 0; i < pg->index; i++) {
> rec = &pg->records[i];
--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
On Mon, 11 Aug 2008, Mathieu Desnoyers wrote:
>
> Hi Steven,
>
> I'm actually a bit worried about this scheduler-centric approach. The
> problem is that any trap that could be generated in the middle of the
> 3/2 nops (think of performance counters if the APIC is set as a trap
> gate) which would stack an interruptible trap handler on top of those
> instructions would lead to having a return IP pointing in the wrong
> spot, but because the scheduler would interrupt the trap handler and not
> the trapped code, it would not detect this.
>
> I am not sure wheter we do actually have the possibility to have these
> interruptible trap handlers considering the way the kernel sets up the
> gates, but I think the "let's figure out where the IP stopped" approach
> is a bit complex and fragile.
>
> Trying to find a fast atomic 5-bytes nop, involving the
> microarchitecture guys, seems like a better approach to me.
>
I agree that the fast atomic 5-byte nop is the optimal approach, but until
we have that, we need to do this.
I don't think this approach is too complex, I wrote the code in about an
hour, and the patch isn't that big. albeit my one bug, which was a CS101
type bug (pointer arithmetic), there wasn't anything to me that was
complex.
The thing is, I want this to be enabled in a production kernel. If there
is a 1 in a million chance that this theoretical bug with the trap
happening in a middle of the double nop, it only affects those that enable
the tracer. No one else. The code only happens when tracing is being
enabled.
If we need to put in a non optimal nop in, to replace the call to mcount,
this affects everyone. Anyone that has CONFIG_FTRACE on, and never plans
on running a trace.
If it comes down to slowing everyone down, against hitting a 1 in a
million theoretical bug, I'll take my chances on the bug. Again, that code
is only performed when tracing is being enabled. That is, when it converts
all 20 thousand nops into calls to a tracer function. If the crash
happens, it will only happen after tracing has started. Then we can easily
point the bug at the tracer.
IOW, I do not want to slow down Linus' box.
-- Steve
* Steven Rostedt ([email protected]) wrote:
>
> On Fri, 8 Aug 2008, Linus Torvalds wrote:
> >
> >
> > On Fri, 8 Aug 2008, Jeremy Fitzhardinge wrote:
> > >
> > > Steven Rostedt wrote:
> > > > I wish we had a true 5 byte nop.
> > >
> > > 0x66 0x66 0x66 0x66 0x90
> >
> > I don't think so. Multiple redundant prefixes can be really expensive on
> > some uarchs.
> >
> > A no-op that isn't cheap isn't a no-op at all, it's a slow-op.
>
>
> A quick meaningless benchmark showed a slight perfomance hit.
>
Hi Steven,
I tried to run my own tests to see if I could get to know if these
numbers are actually meaningful at all. My results seems to show that
there is not any significant difference between the various
configurations, and actually that the only one tendency I see is that
the 2-bytes jump offset 0x03 would be slightly faster than the 3/2 nop
on Intel Xeon. But we would have to run these a bit more often to
confirm that I guess.
I am just trying to get a sense of whether we are really trying hard to
optimize something worthless in practice, and to me it looks like it.
But it could be the architecture I am using that brings these results.
Mathieu
Intel Xeon dual quad-core
Intel(R) Xeon(R) CPU E5405 @ 2.00GHz
3/2 nop used :
K8_NOP3 K8_NOP2
#define K8_NOP2 ".byte 0x66,0x90\n"
#define K8_NOP3 ".byte 0x66,0x66,0x90\n"
** Summary **
Test A : make -j20 2.6.27-rc2 kernel (real time)
Avg. std.dev
Case 1 : ftrace not compiled-in. 1m9.76s 0.41s
Case 2 : 3/2 nops 1m9.95s 0.36s
Case 3 : 2-bytes jump, offset 0x03 1m9.10s 0.40s
Case 4 : 5-bytes jump, offset 0x00 1m9.25s 0.34s
Test B : hackbench 15
Case 1 : ftrace not compiled-in. 0.349s 0.007s
Case 2 : 3/2 nops 0.351s 0.014s
Case 3 : 2-bytes jump, offset 0x03 0.350s 0.007s
Case 4 : 5-bytes jump, offset 0x00 0.351s 0.010s
** Detail **
* Test A
benchmark : make -j20 2.6.27-rc2 kernel
make clean; make -j20; make clean done before the tests to prime caches.
Same .config used.
Case 1 : ftrace not compiled-in.
real 1m9.980s
user 7m27.664s
sys 0m48.771s
real 1m9.330s
user 7m27.244s
sys 0m50.567s
real 1m9.393s
user 7m27.408s
sys 0m50.511s
real 1m9.674s
user 7m28.088s
sys 0m50.327s
real 1m10.441s
user 7m27.736s
sys 0m49.687s
real time
average : 1m9.76s
std. dev. : 0.41s
after a reboot with the same kernel :
real 1m8.758s
user 7m26.012s
sys 0m48.835s
real 1m11.035s
user 7m26.432s
sys 0m49.171s
real 1m9.834s
user 7m25.768s
sys 0m49.167s
Case 2 : 3/2 nops
real 1m9.713s
user 7m27.524s
sys 0m48.315s
real 1m9.481s
user 7m27.144s
sys 0m48.587s
real 1m10.565s
user 7m27.048s
sys 0m48.715s
real 1m10.008s
user 7m26.436s
sys 0m49.295s
real 1m9.982s
user 7m27.160s
sys 0m48.667s
real time
avg : 1m9.95s
std. dev. : 0.36s
Case 3 : 2-bytes jump, offset 0x03
real 1m9.158s
user 7m27.108s
sys 0m48.775s
real 1m9.159s
user 7m27.320s
sys 0m48.659s
real 1m8.390s
user 7m27.976s
sys 0m48.359s
real 1m9.143s
user 7m26.624s
sys 0m48.719s
real 1m9.642s
user 7m26.228s
sys 0m49.483s
real time
avg : 1m9.10s
std. dev. : 0.40s
one extra after reboot with same kernel :
real 1m8.855s
user 7m27.372s
sys 0m48.543s
Case 4 : 5-bytes jump, offset 0x00
real 1m9.173s
user 7m27.228s
sys 0m48.151s
real 1m9.735s
user 7m26.852s
sys 0m48.499s
real 1m9.502s
user 7m27.148s
sys 0m48.107s
real 1m8.727s
user 7m27.416s
sys 0m48.071s
real 1m9.115s
user 7m26.932s
sys 0m48.727s
real time
avg : 1m9.25s
std. dev. : 0.34s
* Test B
Hackbench
Case 1 : ftrace not compiled-in.
./hackbench 15
Time: 0.358
./hackbench 15
Time: 0.342
./hackbench 15
Time: 0.354
./hackbench 15
Time: 0.338
./hackbench 15
Time: 0.347
Average : 0.349
std. dev. : 0.007
Case 2 : 3/2 nops
./hackbench 15
Time: 0.328
./hackbench 15
Time: 0.368
./hackbench 15
Time: 0.351
./hackbench 15
Time: 0.343
./hackbench 15
Time: 0.366
Average : 0.351
std. dev. : 0.014
Case 3 : jmp 2 bytes
./hackbench 15
Time: 0.346
./hackbench 15
Time: 0.359
./hackbench 15
Time: 0.356
./hackbench 15
Time: 0.350
./hackbench 15
Time: 0.340
Average : 0.350
std. dev. : 0.007
Case 3 : jmp 5 bytes
./hackbench 15
Time: 0.346
./hackbench 15
Time: 0.346
./hackbench 15
Time: 0.364
./hackbench 15
Time: 0.362
./hackbench 15
Time: 0.338
Average : 0.351
std. dev. : 0.010
Hardware used :
processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 23
model name : Intel(R) Xeon(R) CPU E5405 @ 2.00GHz
stepping : 6
cpu MHz : 2000.114
cache size : 6144 KB
physical id : 0
siblings : 4
core id : 0
cpu cores : 4
apicid : 0
initial apicid : 0
fpu : yes
fpu_exception : yes
cpuid level : 10
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov
pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx lm
constant_tsc arch_perfmon pebs bts rep_good pni monitor ds_cpl vmx tm2 ssse3
cx16 xtpr dca sse4_1 lahf_lm
bogomips : 4000.22
clflush size : 64
cache_alignment : 64
address sizes : 38 bits physical, 48 bits virtual
power management:
(7 other similar cpus)
> Here's 10 runs of "hackbench 50" using the two part 5 byte nop:
>
> run 1
> Time: 4.501
> run 2
> Time: 4.855
> run 3
> Time: 4.198
> run 4
> Time: 4.587
> run 5
> Time: 5.016
> run 6
> Time: 4.757
> run 7
> Time: 4.477
> run 8
> Time: 4.693
> run 9
> Time: 4.710
> run 10
> Time: 4.715
> avg = 4.6509
>
>
> And 10 runs using the above 5 byte nop:
>
> run 1
> Time: 4.832
> run 2
> Time: 5.319
> run 3
> Time: 5.213
> run 4
> Time: 4.830
> run 5
> Time: 4.363
> run 6
> Time: 4.391
> run 7
> Time: 4.772
> run 8
> Time: 4.992
> run 9
> Time: 4.727
> run 10
> Time: 4.825
> avg = 4.8264
>
> # cat /proc/cpuinfo
> processor : 0
> vendor_id : AuthenticAMD
> cpu family : 15
> model : 65
> model name : Dual-Core AMD Opteron(tm) Processor 2220
> stepping : 3
> cpu MHz : 2799.992
> cache size : 1024 KB
> physical id : 0
> siblings : 2
> core id : 0
> cpu cores : 2
> apicid : 0
> initial apicid : 0
> fdiv_bug : no
> hlt_bug : no
> f00f_bug : no
> coma_bug : no
> fpu : yes
> fpu_exception : yes
> cpuid level : 1
> wp : yes
> flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca
> cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt
> rdtscp lm 3dnowext 3dnow pni cx16 lahf_lm cmp_legacy svm extapic
> cr8_legacy
> bogomips : 5599.98
> clflush size : 64
> power management: ts fid vid ttp tm stc
>
> There's 4 of these.
>
> Just to make sure, I ran the above nop test again:
>
> [ this is reverse from the above runs ]
>
> run 1
> Time: 4.723
> run 2
> Time: 5.080
> run 3
> Time: 4.521
> run 4
> Time: 4.841
> run 5
> Time: 4.696
> run 6
> Time: 4.946
> run 7
> Time: 4.754
> run 8
> Time: 4.717
> run 9
> Time: 4.905
> run 10
> Time: 4.814
> avg = 4.7997
>
> And again the two part nop:
>
> run 1
> Time: 4.434
> run 2
> Time: 4.496
> run 3
> Time: 4.801
> run 4
> Time: 4.714
> run 5
> Time: 4.631
> run 6
> Time: 5.178
> run 7
> Time: 4.728
> run 8
> Time: 4.920
> run 9
> Time: 4.898
> run 10
> Time: 4.770
> avg = 4.757
>
>
> This time it was close, but still seems to have some difference.
>
> heh, perhaps it's just noise.
>
> -- Steve
>
--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
* Mathieu Desnoyers ([email protected]) wrote:
> * Steven Rostedt ([email protected]) wrote:
> >
> > On Fri, 8 Aug 2008, Linus Torvalds wrote:
> > >
> > >
> > > On Fri, 8 Aug 2008, Jeremy Fitzhardinge wrote:
> > > >
> > > > Steven Rostedt wrote:
> > > > > I wish we had a true 5 byte nop.
> > > >
> > > > 0x66 0x66 0x66 0x66 0x90
> > >
> > > I don't think so. Multiple redundant prefixes can be really expensive on
> > > some uarchs.
> > >
> > > A no-op that isn't cheap isn't a no-op at all, it's a slow-op.
> >
> >
> > A quick meaningless benchmark showed a slight perfomance hit.
> >
>
> Hi Steven,
>
> I tried to run my own tests to see if I could get to know if these
> numbers are actually meaningful at all. My results seems to show that
> there is not any significant difference between the various
> configurations, and actually that the only one tendency I see is that
> the 2-bytes jump offset 0x03 would be slightly faster than the 3/2 nop
> on Intel Xeon. But we would have to run these a bit more often to
> confirm that I guess.
>
> I am just trying to get a sense of whether we are really trying hard to
> optimize something worthless in practice, and to me it looks like it.
> But it could be the architecture I am using that brings these results.
>
> Mathieu
>
> Intel Xeon dual quad-core
> Intel(R) Xeon(R) CPU E5405 @ 2.00GHz
>
> 3/2 nop used :
> K8_NOP3 K8_NOP2
> #define K8_NOP2 ".byte 0x66,0x90\n"
> #define K8_NOP3 ".byte 0x66,0x66,0x90\n"
>
Small correction : my architecture uses the P6_NOP5, which is an atomic
5-bytes nop (just looked at the runtime output of find_nop_table()).
5: nopl 0x00(%eax,%eax,1)
#define P6_NOP5 ".byte 0x0f,0x1f,0x44,0x00,0\n"
But the results stands. Maybe I should try to force a test with a
K8_NOP3 K8_NOP2 nop.
Mathieu
> ** Summary **
>
> Test A : make -j20 2.6.27-rc2 kernel (real time)
> Avg. std.dev
> Case 1 : ftrace not compiled-in. 1m9.76s 0.41s
> Case 2 : 3/2 nops 1m9.95s 0.36s
> Case 3 : 2-bytes jump, offset 0x03 1m9.10s 0.40s
> Case 4 : 5-bytes jump, offset 0x00 1m9.25s 0.34s
>
> Test B : hackbench 15
>
> Case 1 : ftrace not compiled-in. 0.349s 0.007s
> Case 2 : 3/2 nops 0.351s 0.014s
> Case 3 : 2-bytes jump, offset 0x03 0.350s 0.007s
> Case 4 : 5-bytes jump, offset 0x00 0.351s 0.010s
>
>
>
> ** Detail **
>
> * Test A
>
> benchmark : make -j20 2.6.27-rc2 kernel
> make clean; make -j20; make clean done before the tests to prime caches.
> Same .config used.
>
>
> Case 1 : ftrace not compiled-in.
>
> real 1m9.980s
> user 7m27.664s
> sys 0m48.771s
>
> real 1m9.330s
> user 7m27.244s
> sys 0m50.567s
>
> real 1m9.393s
> user 7m27.408s
> sys 0m50.511s
>
> real 1m9.674s
> user 7m28.088s
> sys 0m50.327s
>
> real 1m10.441s
> user 7m27.736s
> sys 0m49.687s
>
> real time
> average : 1m9.76s
> std. dev. : 0.41s
>
> after a reboot with the same kernel :
>
> real 1m8.758s
> user 7m26.012s
> sys 0m48.835s
>
> real 1m11.035s
> user 7m26.432s
> sys 0m49.171s
>
> real 1m9.834s
> user 7m25.768s
> sys 0m49.167s
>
>
> Case 2 : 3/2 nops
>
> real 1m9.713s
> user 7m27.524s
> sys 0m48.315s
>
> real 1m9.481s
> user 7m27.144s
> sys 0m48.587s
>
> real 1m10.565s
> user 7m27.048s
> sys 0m48.715s
>
> real 1m10.008s
> user 7m26.436s
> sys 0m49.295s
>
> real 1m9.982s
> user 7m27.160s
> sys 0m48.667s
>
> real time
> avg : 1m9.95s
> std. dev. : 0.36s
>
>
> Case 3 : 2-bytes jump, offset 0x03
>
> real 1m9.158s
> user 7m27.108s
> sys 0m48.775s
>
> real 1m9.159s
> user 7m27.320s
> sys 0m48.659s
>
> real 1m8.390s
> user 7m27.976s
> sys 0m48.359s
>
> real 1m9.143s
> user 7m26.624s
> sys 0m48.719s
>
> real 1m9.642s
> user 7m26.228s
> sys 0m49.483s
>
> real time
> avg : 1m9.10s
> std. dev. : 0.40s
>
> one extra after reboot with same kernel :
>
> real 1m8.855s
> user 7m27.372s
> sys 0m48.543s
>
>
> Case 4 : 5-bytes jump, offset 0x00
>
> real 1m9.173s
> user 7m27.228s
> sys 0m48.151s
>
> real 1m9.735s
> user 7m26.852s
> sys 0m48.499s
>
> real 1m9.502s
> user 7m27.148s
> sys 0m48.107s
>
> real 1m8.727s
> user 7m27.416s
> sys 0m48.071s
>
> real 1m9.115s
> user 7m26.932s
> sys 0m48.727s
>
> real time
> avg : 1m9.25s
> std. dev. : 0.34s
>
>
> * Test B
>
> Hackbench
>
> Case 1 : ftrace not compiled-in.
>
> ./hackbench 15
> Time: 0.358
> ./hackbench 15
> Time: 0.342
> ./hackbench 15
> Time: 0.354
> ./hackbench 15
> Time: 0.338
> ./hackbench 15
> Time: 0.347
>
> Average : 0.349
> std. dev. : 0.007
>
> Case 2 : 3/2 nops
>
> ./hackbench 15
> Time: 0.328
> ./hackbench 15
> Time: 0.368
> ./hackbench 15
> Time: 0.351
> ./hackbench 15
> Time: 0.343
> ./hackbench 15
> Time: 0.366
>
> Average : 0.351
> std. dev. : 0.014
>
> Case 3 : jmp 2 bytes
>
> ./hackbench 15
> Time: 0.346
> ./hackbench 15
> Time: 0.359
> ./hackbench 15
> Time: 0.356
> ./hackbench 15
> Time: 0.350
> ./hackbench 15
> Time: 0.340
>
> Average : 0.350
> std. dev. : 0.007
>
> Case 3 : jmp 5 bytes
>
> ./hackbench 15
> Time: 0.346
> ./hackbench 15
> Time: 0.346
> ./hackbench 15
> Time: 0.364
> ./hackbench 15
> Time: 0.362
> ./hackbench 15
> Time: 0.338
>
> Average : 0.351
> std. dev. : 0.010
>
>
> Hardware used :
>
> processor : 0
> vendor_id : GenuineIntel
> cpu family : 6
> model : 23
> model name : Intel(R) Xeon(R) CPU E5405 @ 2.00GHz
> stepping : 6
> cpu MHz : 2000.114
> cache size : 6144 KB
> physical id : 0
> siblings : 4
> core id : 0
> cpu cores : 4
> apicid : 0
> initial apicid : 0
> fpu : yes
> fpu_exception : yes
> cpuid level : 10
> wp : yes
> flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov
> pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx lm
> constant_tsc arch_perfmon pebs bts rep_good pni monitor ds_cpl vmx tm2 ssse3
> cx16 xtpr dca sse4_1 lahf_lm
> bogomips : 4000.22
> clflush size : 64
> cache_alignment : 64
> address sizes : 38 bits physical, 48 bits virtual
> power management:
>
> (7 other similar cpus)
>
>
> > Here's 10 runs of "hackbench 50" using the two part 5 byte nop:
> >
> > run 1
> > Time: 4.501
> > run 2
> > Time: 4.855
> > run 3
> > Time: 4.198
> > run 4
> > Time: 4.587
> > run 5
> > Time: 5.016
> > run 6
> > Time: 4.757
> > run 7
> > Time: 4.477
> > run 8
> > Time: 4.693
> > run 9
> > Time: 4.710
> > run 10
> > Time: 4.715
> > avg = 4.6509
> >
> >
> > And 10 runs using the above 5 byte nop:
> >
> > run 1
> > Time: 4.832
> > run 2
> > Time: 5.319
> > run 3
> > Time: 5.213
> > run 4
> > Time: 4.830
> > run 5
> > Time: 4.363
> > run 6
> > Time: 4.391
> > run 7
> > Time: 4.772
> > run 8
> > Time: 4.992
> > run 9
> > Time: 4.727
> > run 10
> > Time: 4.825
> > avg = 4.8264
> >
> > # cat /proc/cpuinfo
> > processor : 0
> > vendor_id : AuthenticAMD
> > cpu family : 15
> > model : 65
> > model name : Dual-Core AMD Opteron(tm) Processor 2220
> > stepping : 3
> > cpu MHz : 2799.992
> > cache size : 1024 KB
> > physical id : 0
> > siblings : 2
> > core id : 0
> > cpu cores : 2
> > apicid : 0
> > initial apicid : 0
> > fdiv_bug : no
> > hlt_bug : no
> > f00f_bug : no
> > coma_bug : no
> > fpu : yes
> > fpu_exception : yes
> > cpuid level : 1
> > wp : yes
> > flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca
> > cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt
> > rdtscp lm 3dnowext 3dnow pni cx16 lahf_lm cmp_legacy svm extapic
> > cr8_legacy
> > bogomips : 5599.98
> > clflush size : 64
> > power management: ts fid vid ttp tm stc
> >
> > There's 4 of these.
> >
> > Just to make sure, I ran the above nop test again:
> >
> > [ this is reverse from the above runs ]
> >
> > run 1
> > Time: 4.723
> > run 2
> > Time: 5.080
> > run 3
> > Time: 4.521
> > run 4
> > Time: 4.841
> > run 5
> > Time: 4.696
> > run 6
> > Time: 4.946
> > run 7
> > Time: 4.754
> > run 8
> > Time: 4.717
> > run 9
> > Time: 4.905
> > run 10
> > Time: 4.814
> > avg = 4.7997
> >
> > And again the two part nop:
> >
> > run 1
> > Time: 4.434
> > run 2
> > Time: 4.496
> > run 3
> > Time: 4.801
> > run 4
> > Time: 4.714
> > run 5
> > Time: 4.631
> > run 6
> > Time: 5.178
> > run 7
> > Time: 4.728
> > run 8
> > Time: 4.920
> > run 9
> > Time: 4.898
> > run 10
> > Time: 4.770
> > avg = 4.757
> >
> >
> > This time it was close, but still seems to have some difference.
> >
> > heh, perhaps it's just noise.
> >
> > -- Steve
> >
>
> --
> Mathieu Desnoyers
> OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
* Steven Rostedt ([email protected]) wrote:
>
> On Fri, 8 Aug 2008, Linus Torvalds wrote:
> >
> >
> > On Fri, 8 Aug 2008, Jeremy Fitzhardinge wrote:
> > >
> > > Steven Rostedt wrote:
> > > > I wish we had a true 5 byte nop.
> > >
> > > 0x66 0x66 0x66 0x66 0x90
> >
> > I don't think so. Multiple redundant prefixes can be really expensive on
> > some uarchs.
> >
> > A no-op that isn't cheap isn't a no-op at all, it's a slow-op.
>
>
> A quick meaningless benchmark showed a slight perfomance hit.
>
Hi Steven,
I also did some microbenchmarks on my Intel Xeon 64 bits, AMD64 and
Intel Pentium 4 boxes to compare a baseline (function doing a bit of
memory read and arithmetic operations) to cases where nops are used.
Here are the results. The kernel module used for the benchmarks is
below, feel free to run it on your own architectures.
Xeon :
NR_TESTS 10000000
test empty cycles : 165472020
test 2-bytes jump cycles : 166666806
test 5-bytes jump cycles : 166978164
test 3/2 nops cycles : 169259406
test 5-bytes nop with long prefix cycles : 160000140
test 5-bytes P6 nop cycles : 163333458
AMD64 :
NR_TESTS 10000000
test empty cycles : 145142367
test 2-bytes jump cycles : 150000178
test 5-bytes jump cycles : 150000171
test 3/2 nops cycles : 159999994
test 5-bytes nop with long prefix cycles : 150000156
test 5-bytes P6 nop cycles : 150000148
Intel Pentium 4 :
NR_TESTS 10000000
test empty cycles : 290001045
test 2-bytes jump cycles : 310000568
test 5-bytes jump cycles : 310000478
test 3/2 nops cycles : 290000565
test 5-bytes nop with long prefix cycles : 311085510
test 5-bytes P6 nop cycles : 300000517
test Generic 1/4 5-bytes nops cycles : 310000553
test K7 1/4 5-bytes nops cycles : 300000533
These numbers show that both on Xeon and AMD64, the
.byte 0x66,0x66,0x66,0x66,0x90
(osp osp osp osp nop, which is not currently used in nops.h)
is the fastest nop on both architectures.
The currently used 3/2 nops looks like a _very_ bad choice for AMD64
cycle-wise.
The currently used 5-bytes P6 nop used on Xeon seems to be a bit slower
than the 0x66,0x66,0x66,0x66,0x90 nop too.
For the Intel Pentium 4, the best atomic choice seems to be the current
one (5-bytes P6 nop : .byte 0x0f,0x1f,0x44,0x00,0), although we can see
that the 3/2 nop used for K8 would be a bit faster. It is probably due
to the fact that P4 handles long instruction prefixes slowly.
Is there any reason why not to use these atomic nops and kill our
instruction atomicity problems altogether ?
(various cpuinfo can be found below)
Mathieu
/* test-nop-speed.c
*
*/
#include <linux/module.h>
#include <linux/proc_fs.h>
#include <linux/sched.h>
#include <linux/timex.h>
#include <linux/marker.h>
#include <asm/ptrace.h>
#define NR_TESTS 10000000
int var, var2;
struct proc_dir_entry *pentry = NULL;
void empty(void)
{
asm volatile ("");
var += 50;
var /= 10;
var *= var2;
}
void twobytesjump(void)
{
asm volatile ("jmp 1f\n\t"
".byte 0x00, 0x00, 0x00\n\t"
"1:\n\t");
var += 50;
var /= 10;
var *= var2;
}
void fivebytesjump(void)
{
asm volatile (".byte 0xe9, 0x00, 0x00, 0x00, 0x00\n\t");
var += 50;
var /= 10;
var *= var2;
}
void threetwonops(void)
{
asm volatile (".byte 0x66,0x66,0x90,0x66,0x90\n\t");
var += 50;
var /= 10;
var *= var2;
}
void fivebytesnop(void)
{
asm volatile (".byte 0x66,0x66,0x66,0x66,0x90\n\t");
var += 50;
var /= 10;
var *= var2;
}
void fivebytespsixnop(void)
{
asm volatile (".byte 0x0f,0x1f,0x44,0x00,0\n\t");
var += 50;
var /= 10;
var *= var2;
}
/*
* GENERIC_NOP1 GENERIC_NOP4,
* 1: nop
* _not_ nops in 64-bit mode.
* 4: leal 0x00(,%esi,1),%esi
*/
void genericfivebytesonefournops(void)
{
asm volatile (".byte 0x90,0x8d,0x74,0x26,0x00\n\t");
var += 50;
var /= 10;
var *= var2;
}
/*
* K7_NOP4 ASM_NOP1
* 1: nop
* assumed _not_ to be nops in 64-bit mode.
* leal 0x00(,%eax,1),%eax
*/
void k7fivebytesonefournops(void)
{
asm volatile (".byte 0x90,0x8d,0x44,0x20,0x00\n\t");
var += 50;
var /= 10;
var *= var2;
}
void perform_test(const char *name, void (*callback)(void))
{
unsigned int i;
cycles_t cycles1, cycles2;
unsigned long flags;
local_irq_save(flags);
rdtsc_barrier();
cycles1 = get_cycles();
rdtsc_barrier();
for(i=0; i<NR_TESTS; i++) {
callback();
}
rdtsc_barrier();
cycles2 = get_cycles();
rdtsc_barrier();
local_irq_restore(flags);
printk("test %s cycles : %llu\n", name, cycles2-cycles1);
}
static int my_open(struct inode *inode, struct file *file)
{
printk("NR_TESTS %d\n", NR_TESTS);
perform_test("empty", empty);
perform_test("2-bytes jump", twobytesjump);
perform_test("5-bytes jump", fivebytesjump);
perform_test("3/2 nops", threetwonops);
perform_test("5-bytes nop with long prefix", fivebytesnop);
perform_test("5-bytes P6 nop", fivebytespsixnop);
#ifdef CONFIG_X86_32
perform_test("Generic 1/4 5-bytes nops", genericfivebytesonefournops);
perform_test("K7 1/4 5-bytes nops", k7fivebytesonefournops);
#endif
return -EPERM;
}
static struct file_operations my_operations = {
.open = my_open,
};
int init_module(void)
{
pentry = create_proc_entry("testnops", 0444, NULL);
if (pentry)
pentry->proc_fops = &my_operations;
return 0;
}
void cleanup_module(void)
{
remove_proc_entry("testnops", NULL);
}
MODULE_LICENSE("GPL");
MODULE_AUTHOR("Mathieu Desnoyers");
MODULE_DESCRIPTION("NOP Test");
Xeon cpuinfo :
processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 23
model name : Intel(R) Xeon(R) CPU E5405 @ 2.00GHz
stepping : 6
cpu MHz : 2000.126
cache size : 6144 KB
physical id : 0
siblings : 4
core id : 0
cpu cores : 4
apicid : 0
initial apicid : 0
fpu : yes
fpu_exception : yes
cpuid level : 10
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx lm constant_tsc arch_perfmon pebs bts rep_good pni monitor ds_cpl vmx tm2 ssse3 cx16 xtpr dca sse4_1 lahf_lm
bogomips : 4000.25
clflush size : 64
cache_alignment : 64
address sizes : 38 bits physical, 48 bits virtual
power management:
AMD64 cpuinfo :
processor : 0
vendor_id : AuthenticAMD
cpu family : 15
model : 35
model name : AMD Athlon(tm)64 X2 Dual Core Processor 3800+
stepping : 2
cpu MHz : 2009.139
cache size : 512 KB
physical id : 0
siblings : 2
core id : 0
cpu cores : 2
apicid : 0
initial apicid : 0
fpu : yes
fpu_exception : yes
cpuid level : 1
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt lm 3dnowext 3dnow rep_good pni lahf_lm cmp_legacy
bogomips : 4022.42
TLB size : 1024 4K pages
clflush size : 64
cache_alignment : 64
address sizes : 40 bits physical, 48 bits virtual
power management: ts fid vid ttp
Pentium 4 :
processor : 0
vendor_id : GenuineIntel
cpu family : 15
model : 4
model name : Intel(R) Pentium(R) 4 CPU 3.00GHz
stepping : 1
cpu MHz : 3000.138
cache size : 1024 KB
physical id : 0
siblings : 1
core id : 0
cpu cores : 1
apicid : 0
initial apicid : 0
fdiv_bug : no
hlt_bug : no
f00f_bug : no
coma_bug : no
fpu : yes
fpu_exception : yes
cpuid level : 5
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe nx constant_tsc up pebs bts pni monitor ds_cpl cid xtpr
bogomips : 6005.70
clflush size : 64
power management:
> Here's 10 runs of "hackbench 50" using the two part 5 byte nop:
>
> run 1
> Time: 4.501
> run 2
> Time: 4.855
> run 3
> Time: 4.198
> run 4
> Time: 4.587
> run 5
> Time: 5.016
> run 6
> Time: 4.757
> run 7
> Time: 4.477
> run 8
> Time: 4.693
> run 9
> Time: 4.710
> run 10
> Time: 4.715
> avg = 4.6509
>
>
> And 10 runs using the above 5 byte nop:
>
> run 1
> Time: 4.832
> run 2
> Time: 5.319
> run 3
> Time: 5.213
> run 4
> Time: 4.830
> run 5
> Time: 4.363
> run 6
> Time: 4.391
> run 7
> Time: 4.772
> run 8
> Time: 4.992
> run 9
> Time: 4.727
> run 10
> Time: 4.825
> avg = 4.8264
>
> # cat /proc/cpuinfo
> processor : 0
> vendor_id : AuthenticAMD
> cpu family : 15
> model : 65
> model name : Dual-Core AMD Opteron(tm) Processor 2220
> stepping : 3
> cpu MHz : 2799.992
> cache size : 1024 KB
> physical id : 0
> siblings : 2
> core id : 0
> cpu cores : 2
> apicid : 0
> initial apicid : 0
> fdiv_bug : no
> hlt_bug : no
> f00f_bug : no
> coma_bug : no
> fpu : yes
> fpu_exception : yes
> cpuid level : 1
> wp : yes
> flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca
> cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt
> rdtscp lm 3dnowext 3dnow pni cx16 lahf_lm cmp_legacy svm extapic
> cr8_legacy
> bogomips : 5599.98
> clflush size : 64
> power management: ts fid vid ttp tm stc
>
> There's 4 of these.
>
> Just to make sure, I ran the above nop test again:
>
> [ this is reverse from the above runs ]
>
> run 1
> Time: 4.723
> run 2
> Time: 5.080
> run 3
> Time: 4.521
> run 4
> Time: 4.841
> run 5
> Time: 4.696
> run 6
> Time: 4.946
> run 7
> Time: 4.754
> run 8
> Time: 4.717
> run 9
> Time: 4.905
> run 10
> Time: 4.814
> avg = 4.7997
>
> And again the two part nop:
>
> run 1
> Time: 4.434
> run 2
> Time: 4.496
> run 3
> Time: 4.801
> run 4
> Time: 4.714
> run 5
> Time: 4.631
> run 6
> Time: 5.178
> run 7
> Time: 4.728
> run 8
> Time: 4.920
> run 9
> Time: 4.898
> run 10
> Time: 4.770
> avg = 4.757
>
>
> This time it was close, but still seems to have some difference.
>
> heh, perhaps it's just noise.
>
> -- Steve
>
--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
On Wed, 13 Aug 2008, Mathieu Desnoyers wrote:
>
> I also did some microbenchmarks on my Intel Xeon 64 bits, AMD64 and
> Intel Pentium 4 boxes to compare a baseline
Note that the biggest problems of a jump-based nop are likely to happen
when there are I$ misses and/or when there are other jumps involved. Ie a
some microarchitectures tend to have issues with jumps to jumps, or when
there are multiple control changes in the same (possibly partial)
cacheline because the instruction stream prediction may be predecoded in
the L1 I$, and multiple branches in the same cacheline - or in the same
execution cycle - can pollute that kind of thing.
So microbenchmarking this way will probably make some things look
unrealistically good.
On the P4, the trace cache makes things even more interesting, since it's
another level of I$ entirely, with very different behavior for the hit
case vs the miss case.
And I$ misses for the kernel are actually fairly high. Not in
microbenchmarks that tend to have very repetive behavior and a small I$
footprint, but in a lot of real-life loads the *bulk* of all action is in
user space, and then the kernel side is often invoced with few loops (the
kernel has very few loops indeed) and a cold I$.
So your numbers are interesting, but it would be really good to also get
some info from Intel/AMD who may know about microarchitectural issues for
the cases that don't show up in the hot-I$-cache environment.
Linus
> So microbenchmarking this way will probably make some things look
> unrealistically good.
Must be careful to miss the big picture here.
We have two assumptions here in this thread:
- Normal alternative() nops are relatively infrequent, typically
in points with enough pipeline bubbles anyways, and it likely doesn't
matter how they are encode. And also they don't have an issue
with mult part instructions anyways because they're not patched
at runtime, so always the best known can be used.
- The one case where nops are very frequent and matter and multipart
is a problem is with ftrace noping out the call to mcount at runtime
because that happens on every function entry.
Even there the overhead is not that big, but at least measurable
in kernel builds.
Now the numbers have shown that just by not using frame pointer (
-pg right now implies frame pointer) you can get more benefit
than what you lose from using non optimal nops.
So for me the best strategy would be to get rid of the frame pointer
and ignore the nops. This unfortunately would require going away
from -pg and instead post process gcc output to insert "call mcount"
manually. But the nice advantage of that is that you could actually
set up a custom table of callers built in a ELF section and with
that you don't actually need the runtime patching (which is only
done currently because there's no global table of mcount calls),
but could do everything in stop_machine(). Without
runtime patching you also don't need single part nops.
I think that would be the best option. I especially like it because
it would prevent forcing frame pointer which seems to be costlier
than any kinds of nosp.
-Andi
Andi Kleen wrote:
> So for me the best strategy would be to get rid of the frame pointer
> and ignore the nops. This unfortunately would require going away
> from -pg and instead post process gcc output to insert "call mcount"
> manually. But the nice advantage of that is that you could actually
> set up a custom table of callers built in a ELF section and with
> that you don't actually need the runtime patching (which is only
> done currently because there's no global table of mcount calls),
> but could do everything in stop_machine(). Without
> runtime patching you also don't need single part nops.
>
> I think that would be the best option. I especially like it because
> it would prevent forcing frame pointer which seems to be costlier
> than any kinds of nosp.
>
>
How would you deal with inlines? Using debug information?
--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.
> How would you deal with inlines? Using debug information?
-pg already ignores inlines, so they aren't even traced today.
It pretty much has to, assume an inline gets spread out by
the global optimizer over the rest of the function, where would
the mcount calls be inserted?
-Andi
Andi Kleen wrote:
>> How would you deal with inlines? Using debug information?
>>
>
> -pg already ignores inlines, so they aren't even traced today.
>
> It pretty much has to, assume an inline gets spread out by
> the global optimizer over the rest of the function, where would
> the mcount calls be inserted?
>
Good point.
--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.
* Linus Torvalds ([email protected]) wrote:
>
>
> On Wed, 13 Aug 2008, Mathieu Desnoyers wrote:
> >
> > I also did some microbenchmarks on my Intel Xeon 64 bits, AMD64 and
> > Intel Pentium 4 boxes to compare a baseline
>
> Note that the biggest problems of a jump-based nop are likely to happen
> when there are I$ misses and/or when there are other jumps involved. Ie a
> some microarchitectures tend to have issues with jumps to jumps, or when
> there are multiple control changes in the same (possibly partial)
> cacheline because the instruction stream prediction may be predecoded in
> the L1 I$, and multiple branches in the same cacheline - or in the same
> execution cycle - can pollute that kind of thing.
>
Yup, I agree. Actually, the tests I ran shows that using jumps as nops
does not seems to be the best solution, even cycle-wise.
> So microbenchmarking this way will probably make some things look
> unrealistically good.
>
Yes, I am aware of these "high locality" effects. I use these tests as a
starting point to find out which nops are good candidates, and then it
can be later validated with more thorough testing on real workloads,
which will suffer from higher standard deviation.
Interestingly enough, the P6_NOPS seems to be a poor choice both at the
macro and micro levels for the Intel Xeon (referring to
http://lkml.org/lkml/2008/8/13/253 for the macro-benchmarks).
> On the P4, the trace cache makes things even more interesting, since it's
> another level of I$ entirely, with very different behavior for the hit
> case vs the miss case.
As long as the whole kernel agrees on which instructions should be used
for frequently used nops, the instruction trace cache should behave
properly.
>
> And I$ misses for the kernel are actually fairly high. Not in
> microbenchmarks that tend to have very repetive behavior and a small I$
> footprint, but in a lot of real-life loads the *bulk* of all action is in
> user space, and then the kernel side is often invoced with few loops (the
> kernel has very few loops indeed) and a cold I$.
I assume the effect of I$ miss to be the same for all the tested
scenarios (except on P4, and maybe except for the jump cases), given
that in each case we load 5-bytes worth of instructions. Even
considering this, the results I get show that the choices made in the
current kernel does might not be the best ones.
>
> So your numbers are interesting, but it would be really good to also get
> some info from Intel/AMD who may know about microarchitectural issues for
> the cases that don't show up in the hot-I$-cache environment.
>
Yep. I think it may make a difference if we use jumps, but I doubt it
will change anything to the other various nops. Still, having that
information would be good.
Some more numbers follow for older architectures.
Intel Pentium 3, 550MHz
NR_TESTS 10000000
test empty cycles : 510000254
test 2-bytes jump cycles : 510000077
test 5-bytes jump cycles : 510000101
test 3/2 nops cycles : 500000072
test 5-bytes nop with long prefix cycles : 500000107
test 5-bytes P6 nop cycles : 500000069 (current choice ok)
test Generic 1/4 5-bytes nops cycles : 514687590
test K7 1/4 5-bytes nops cycles : 530000012
Intel Pentium 3, 933MHz
NR_TESTS 10000000
test empty cycles : 510000565
test 2-bytes jump cycles : 510000133
test 5-bytes jump cycles : 510000363
test 3/2 nops cycles : 500000358
test 5-bytes nop with long prefix cycles : 500000331
test 5-bytes P6 nop cycles : 500000625 (current choice ok)
test Generic 1/4 5-bytes nops cycles : 514687797
test K7 1/4 5-bytes nops cycles : 530000273
Intel Pentium M, 2GHz
NR_TESTS 10000000
test empty cycles : 180000515
test 2-bytes jump cycles : 180000386 (would be the best)
test 5-bytes jump cycles : 205000435
test 3/2 nops cycles : 193333517
test 5-bytes nop with long prefix cycles : 205000167
test 5-bytes P6 nop cycles : 205937652
test Generic 1/4 5-bytes nops cycles : 187500174
test K7 1/4 5-bytes nops cycles : 193750161
Intel Pentium 3, 550MHz
processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 7
model name : Pentium III (Katmai)
stepping : 3
cpu MHz : 551.295
cache size : 512 KB
fdiv_bug : no
hlt_bug : no
f00f_bug : no
coma_bug : no
fpu : yes
fpu_exception : yes
cpuid level : 2
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 sep mtrr pge mca cmov pat pse36 mmx fxsr sse
bogomips : 1103.44
clflush size : 32
Intel Pentium 3, 933MHz
processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 8
model name : Pentium III (Coppermine)
stepping : 6
cpu MHz : 933.134
cache size : 256 KB
fdiv_bug : no
hlt_bug : no
f00f_bug : no
coma_bug : no
fpu : yes
fpu_exception : yes
cpuid level : 2
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 sep mtrr pge mca cmov pat pse36 mmx fxsr sse
bogomips : 1868.22
clflush size : 32
Intel Pentium M, 2GHz
processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 13
model name : Intel(R) Pentium(R) M processor 2.00GHz
stepping : 8
cpu MHz : 2000.000
cache size : 2048 KB
fdiv_bug : no
hlt_bug : no
f00f_bug : no
coma_bug : no
fpu : yes
fpu_exception : yes
cpuid level : 2
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat clflush dts acpi mmx fxsr sse sse2 ss tm pbe nx bts est tm2
bogomips : 3994.64
clflush size : 64
Mathieu
> Linus
--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
* Andi Kleen ([email protected]) wrote:
> > So microbenchmarking this way will probably make some things look
> > unrealistically good.
>
> Must be careful to miss the big picture here.
>
> We have two assumptions here in this thread:
>
> - Normal alternative() nops are relatively infrequent, typically
> in points with enough pipeline bubbles anyways, and it likely doesn't
> matter how they are encode. And also they don't have an issue
> with mult part instructions anyways because they're not patched
> at runtime, so always the best known can be used.
>
> - The one case where nops are very frequent and matter and multipart
> is a problem is with ftrace noping out the call to mcount at runtime
> because that happens on every function entry.
> Even there the overhead is not that big, but at least measurable
> in kernel builds.
>
> Now the numbers have shown that just by not using frame pointer (
> -pg right now implies frame pointer) you can get more benefit
> than what you lose from using non optimal nops.
>
> So for me the best strategy would be to get rid of the frame pointer
> and ignore the nops. This unfortunately would require going away
> from -pg and instead post process gcc output to insert "call mcount"
> manually. But the nice advantage of that is that you could actually
> set up a custom table of callers built in a ELF section and with
> that you don't actually need the runtime patching (which is only
> done currently because there's no global table of mcount calls),
> but could do everything in stop_machine(). Without
> runtime patching you also don't need single part nops.
>
I agree that if frame pointer brings a too big overhead, it should not
be used.
Sorry to ask, I feel I must be missing something, but I'm trying to
figure out where you propose to add the "call mcount" ? In the caller or
in the callee ?
In the caller, I guess it would replace the normal function call, call a
trampoline which would jump to the normal code.
In the callee, as what is currently done with -pg, the callee would have
a call mcount at the beginning of the function.
Or is it a different scheme I don't see ? I am trying to figure out how
you happen to do all that without dynamic code modification and manage
not to hurt performance.
Mathieu
> I think that would be the best option. I especially like it because
> it would prevent forcing frame pointer which seems to be costlier
> than any kinds of nosp.
>
> -Andi
>
--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
> Sorry to ask, I feel I must be missing something, but I'm trying to
> figure out where you propose to add the "call mcount" ? In the caller or
> in the callee ?
callee like gcc. caller would be likely more bloated because
there are more calls than functions. Also if it was at the
callee more code would be needed because the function currently
executed couldn't be gotten from stack directly.
> Or is it a different scheme I don't see ? I am trying to figure out how
> you happen to do all that without dynamic code modification and manage
> not to hurt performance.
The dynamic code modification is only needed because there is no
global table of the mcount call sites. So instead it discovers
them at runtime, but that requires runtime save patching
With a custom call scheme one could just build up a table of
call sites at link time using an ELF section and then when
tracing is enabled/disabled always patch them all in one go
in a stop_machine(). Then you wouldn't need parallel execution safe
patching anymore and it doesn't matter what the nops look like.
The other advantage is that it would allow getting rid of
the frame pointer.
-Andi
[
Thanks to Mathieu Desnoyers who forward this to me. Currently my ISP
for goodmis.org is having issues:
https://help.domaindirect.com/index.php?_m=news&_a=viewnews&newsid=104
]
> ----- Forwarded message from Andi Kleen <[email protected]> -----
>
>
>> So microbenchmarking this way will probably make some things look
>> unrealistically good.
>>
>
> Must be careful to miss the big picture here.
>
> We have two assumptions here in this thread:
>
> - Normal alternative() nops are relatively infrequent, typically
> in points with enough pipeline bubbles anyways, and it likely doesn't
> matter how they are encode. And also they don't have an issue
> with mult part instructions anyways because they're not patched
> at runtime, so always the best known can be used.
>
> - The one case where nops are very frequent and matter and multipart
> is a problem is with ftrace noping out the call to mcount at runtime
> because that happens on every function entry.
> Even there the overhead is not that big, but at least measurable
> in kernel builds.
>
The problem is not ftrace noping out the call at runtime. The problem is
ftrace changing the nops back to calls to mcount.
The nop part is simple, straight forward and not an issue that we are
talking here. The issue is which kind of nop to use. The bug with the
multi-part nop happens when we _enable_ tracing. That is, when someone
runs the tracer. The issue with the multi-part nop is that a task could
have been preempted after it executed the first nop and before the
second part. Then we enable tracing, and when the task is scheduled back
in, it now will execute half the call to the mcount function.
I want this point very clear. If you never run tracing, this bug will
not happen. And the bug only happens on enabling the tracer, not on the
disabling part. Not to mention that the bug itself will only happen 1 in
a billion.
> Now the numbers have shown that just by not using frame pointer (
> -pg right now implies frame pointer) you can get more benefit
> than what you lose from using non optimal nops.
>
No, I can easily make a patch that does not use frame pointers but still
uses -pg. We just can not print the parent function in the trace. This
can easily be added to a config, as well as easily implemented.
> So for me the best strategy would be to get rid of the frame pointer
> and ignore the nops. This unfortunately would require going away
> from -pg and instead post process gcc output to insert "call mcount"
> manually. But the nice advantage of that is that you could actually
> set up a custom table of callers built in a ELF section and with
> that you don't actually need the runtime patching (which is only
> done currently because there's no global table of mcount calls),
> but could do everything in stop_machine(). Without
> runtime patching you also don't need single part nops.
>
>
I'm totally confused here. How do you enable function tracing? How do
we make a call to the code that will trace a function was hit?
> I think that would be the best option. I especially like it because
> it would prevent forcing frame pointer which seems to be costlier
> than any kinds of nosp.
As I stated, the frame pointer part is only to record the parent
function in tracing. ie:
ls-4866 [00] 177596.041275: _spin_unlock <-journal_stop
Here we see that the function _spin_unlock was called by the function
journal_stop. We can easily turn off parent tracing now, with:
# echo noprint-parent > /debug/tracing/iter_ctrl
which gives us just:
ls-4866 [00] 177596.041275: _spin_unlock
If we disable frame pointers, the noprint-parent option would be forced.
Not that devastating, but it gives the option to still have function
tracing to the user without the requirement of having frame pointers.
I would still require that the irqsoff tracer add frame pointers, just
because knowing that the long latency of interrupts disabled happened at
local_irq_save doesn't cut it ;-)
Anyway, who would want to run with frame pointers disabled? If you ever
get a bug crash, the stack trace is pretty much useless.
-- Steve
* Andi Kleen ([email protected]) wrote:
> > Sorry to ask, I feel I must be missing something, but I'm trying to
> > figure out where you propose to add the "call mcount" ? In the caller or
> > in the callee ?
>
> callee like gcc. caller would be likely more bloated because
> there are more calls than functions. Also if it was at the
> callee more code would be needed because the function currently
> executed couldn't be gotten from stack directly.
>
> > Or is it a different scheme I don't see ? I am trying to figure out how
> > you happen to do all that without dynamic code modification and manage
> > not to hurt performance.
>
> The dynamic code modification is only needed because there is no
> global table of the mcount call sites. So instead it discovers
> them at runtime, but that requires runtime save patching
>
> With a custom call scheme one could just build up a table of
> call sites at link time using an ELF section and then when
> tracing is enabled/disabled always patch them all in one go
> in a stop_machine(). Then you wouldn't need parallel execution safe
> patching anymore and it doesn't matter what the nops look like.
>
I agree that the custom call scheme could let you know the mcount call
site addresses at link time, so you could replace the call instructions
with nops (at link time, so you actually don't know much about the
exact hardware the kernel will be running on, which makes it harder to
choose the best nop). To me, it seems that doing this at link time,
as you propose, is the best approach, as it won't impact the system
bootup time as much as the current ftrace scheme.
However, I disagree with you on one point : if you use nops which are
made of multiple instructions smaller than 5 bytes, enabling the tracer
(patching all the sites in a stop_machine()) still present the risk of
having a preempted thread with a return IP pointing directly in the
middle of what will become a 5-bytes call instruction. When the thread
will be scheduled again after the stop_machine, an illegal instruction
fault (or any random effect) will occur.
Therefore, building a table of mcount call sites in a ELF section,
declaring _single_ 5-bytes nop instruction in the instruction stream
that would fit for all target architectures in lieue of mcount call, so
it can be later patched-in with the 5-bytes call at runtime seems like a
good way to go.
Mathieu
P.S. : It would be good to have a look at the alternative.c lock prefix
vs preemption race I identified a few weeks ago. Actually, this
currently existing cpu hotplug bug is related to the preemption issue I
just explained here. ref. http://lkml.org/lkml/2008/7/30/265,
especially:
"As a general rule, never try to combine smaller instructions into a
bigger one, except in the case of adding a lock-prefix to an instruction :
this case insures that the non-lock prefixed instruction is still
valid after the change has been done. We could however run into a nasty
non-synchronized atomic instruction use in SMP mode if a thread happens
to be scheduled out right after the lock prefix. Hopefully the
alternative code uses the refrigerator... (hrm, it doesn't).
Actually, alternative.c lock-prefix modification is O.K. for spinlocks
because they execute with preemption off, but not for other atomic
operations which may execute with preemption on."
> The other advantage is that it would allow getting rid of
> the frame pointer.
>
> -Andi
>
--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
Steven Rostedt wrote:
> No, I can easily make a patch that does not use frame pointers but
> still uses -pg. We just can not print the parent function in the
> trace. This can easily be added to a config, as well as easily
> implemented.
Why? You can always get the calling function, because its return
address is on the stack (assuming mcount is called before the function
puts its own frame on the stack). But without a frame pointer, you
can't necessarily get the caller's caller.
But I think Andi's point is that gcc forces frame pointers on when you
enable mcount, so there's no choice in the matter.
J
On Wed, Aug 13, 2008 at 04:00:37PM -0400, Steven Rostedt wrote:
> >Now the numbers have shown that just by not using frame pointer (
> >-pg right now implies frame pointer) you can get more benefit
> >than what you lose from using non optimal nops.
> >
>
> No, I can easily make a patch that does not use frame pointers but still
Not without patching gcc. Try it. The patch is not very difficult and i did
it here, but it needs a patch.
> If we disable frame pointers, the noprint-parent option would be forced.
Actually you can get the parent without frame pointer if you just
force gcc to emit mcount before touching the stack frame (and manual
insertion pass would do that). Then parent is at 4(%esp)/8(%rsp)
Again teaching gcc that is not very difficult, but it needs a patch.
> I would still require that the irqsoff tracer add frame pointers, just
> because knowing that the long latency of interrupts disabled happened at
> local_irq_save doesn't cut it ;-)
Nope.
>
> Anyway, who would want to run with frame pointers disabled? If you ever
> get a bug crash, the stack trace is pretty much useless.
First that's not true (remember most production kernels run
without frame pointers, also e.g. crash or systemtap know how to do proper
unwinding without slow frame pointers) and if you want it runtime also you
can always add the dwarf2 unwinder (like the opensuse kernel does) and get
better backtraces than you could ever get with frame pointers (that is
because e.g. most assembler code doesn't even bother to set up frame
pointers, but it is all dwarf2 annotated)
Also I must say the whole ftrace noping exercise is pretty pointless without
avoiding frame pointers because it does save less than what you lose
unconditionally from the "select FRAME_POINTER"
-Andi
On Wed, 13 Aug 2008, Andi Kleen wrote:
>
> Also I must say the whole ftrace noping exercise is pretty pointless without
> avoiding frame pointers because it does save less than what you lose
> unconditionally from the "select FRAME_POINTER"
Andi, you seem to have missed the whole point. This is a _correctness_
issue as long as the nop is not a single instruction. And the workaround
for that is uglier than just making a single-instruction nop.
So the question now is to find a good nop that _is_ a single atomic
instruction. Your blathering about frame pointers is missing the whole
point!
Linus
Andi Kleen wrote:
> On Wed, Aug 13, 2008 at 04:00:37PM -0400, Steven Rostedt wrote:
>
>>> Now the numbers have shown that just by not using frame pointer (
>>> -pg right now implies frame pointer) you can get more benefit
>>> than what you lose from using non optimal nops.
>>>
>>>
>> No, I can easily make a patch that does not use frame pointers but still
>>
>
> Not without patching gcc. Try it. The patch is not very difficult and i did
> it here, but it needs a patch.
>
OK, I admit you are right ;-)
I got the error message:
gcc: -pg and -fomit-frame-pointer are incompatible
-- Steve
Just a curious run of Mathieu's micro benchmark:
NR_TESTS 10000000
test empty cycles : 182500444
test 2-bytes jump cycles : 195969127
test 5-bytes jump cycles : 197000202
test 3/2 nops cycles : 201333408
test 5-bytes nop with long prefix cycles : 205000067
test 5-bytes P6 nop cycles : 205000227
test Generic 1/4 5-bytes nops cycles : 200000077
test K7 1/4 5-bytes nops cycles : 197549045
And this was on a Pentium III 847.461 MHz box (my old toshiba laptop)
The jumps here played the best, but that could just be cache issues. But
interesting to see that of the nops, the K7 1/4 faired the best.
-- Steve
If a kernel thread is preempted in single-cpu mode right after the NOP (nop
about to be turned into a lock prefix), then we CPU hotplug a CPU, and then the
thread is scheduled back again, a SMP-unsafe atomic operation will be used on
shared SMP variables, leading to corruption. No corruption would happen in the
reverse case : going from SMP to UP is ok because we split a bit instruction
into tiny pieces, which does not present this condition.
Changing the 0x90 (single-byte nop) currently used into a 0x3E DS segment
override prefix should fix this issue. Since the default of the atomic
instructions is to use the DS segment anyway, it should not affect the
behavior.
** BIG FAT WARNING (tm) ** : this patch assumes that the 0x3E prefix will
leave atomic operations as-is (thus assuming they normally touch data in the DS
segment). Is there an obnoxious corner-case I would have missed ?
This source :
AMD64 Architecture Programmer's Manual Volume 3: General-Purpose and System
Instructions
States
"Instructions that Reference a Non-Stack Segment—If an instruction encoding
references any base register other than rBP or rSP, or if an instruction
contains an immediate offset, the default segment is the data segment (DS).
These instructions can use the segment-override prefix to select one of the
non-default segments, as shown in Table 1-5."
Therefore, forcing the DS segment on the atomic operations, which already use
the DS segment, should not change.
This source :
http://wiki.osdev.org/X86_Instruction_Encoding
States
"In 64-bit the CS, SS, DS and ES segment overrides are ignored."
(since it's a wiki, it would have to be confirmed, but at least the worse that
happens is that the DS override is ignored)
This patch applies to 2.6.27-rc2, but would also have to be applied to earlier
kernels (2.6.26, 2.6.25, ...).
Performance impact of the fix : tests done on "xaddq" and "xaddl" shows it
actually improves performances on Intel Xeon, AMD64, Pentium M. It does not
change the performance on Pentium 3 and Pentium 4.
Xeon E5405 2.0GHz :
NR_TESTS 10000000
test empty cycles : 162207948
test test 1-byte nop xadd cycles : 170755422
test test DS override prefix xadd cycles : 170000118 *
test test LOCK xadd cycles : 472012134
AMD64 2.0GHz :
NR_TESTS 10000000
test empty cycles : 146674549
test test 1-byte nop xadd cycles : 150273860
test test DS override prefix xadd cycles : 149982382 *
test test LOCK xadd cycles : 270000690
Pentium 4 3.0GHz
NR_TESTS 10000000
test empty cycles : 290001195
test test 1-byte nop xadd cycles : 310000560
test test DS override prefix xadd cycles : 310000575 *
test test LOCK xadd cycles : 1050103740
Pentium M 2.0GHz
NR_TESTS 10000000
test empty cycles : 180000523
test test 1-byte nop xadd cycles : 320000345
test test DS override prefix xadd cycles : 310000374 *
test test LOCK xadd cycles : 480000357
Pentium 3 550MHz
NR_TESTS 10000000
test empty cycles : 510000231
test test 1-byte nop xadd cycles : 620000128
test test DS override prefix xadd cycles : 620000110 *
test test LOCK xadd cycles : 800000088
Speed test modules can be found at
http://ltt.polymtl.ca/svn/trunk/tests/kernel/test-prefix-speed-32.c
http://ltt.polymtl.ca/svn/trunk/tests/kernel/test-prefix-speed.c
Signed-off-by: Mathieu Desnoyers <[email protected]>
CC: Andi Kleen <[email protected]>
CC: Linus Torvalds <[email protected]>
Cc: Steven Rostedt <[email protected]>
CC: Steven Rostedt <[email protected]>
CC: Jeremy Fitzhardinge <[email protected]>
CC: Ingo Molnar <[email protected]>
CC: Thomas Gleixner <[email protected]>
CC: Peter Zijlstra <[email protected]>
CC: Andrew Morton <[email protected]>
CC: David Miller <[email protected]>
CC: Roland McGrath <[email protected]>
CC: Ulrich Drepper <[email protected]>
CC: Rusty Russell <[email protected]>
CC: Gregory Haskins <[email protected]>
CC: Arnaldo Carvalho de Melo <[email protected]>
CC: "Luis Claudio R. Goncalves" <[email protected]>
CC: Clark Williams <[email protected]>
CC: Christoph Lameter <[email protected]>
---
arch/x86/kernel/alternative.c | 8 ++++----
1 file changed, 4 insertions(+), 4 deletions(-)
Index: linux-2.6-lttng/arch/x86/kernel/alternative.c
===================================================================
--- linux-2.6-lttng.orig/arch/x86/kernel/alternative.c 2008-08-13 18:42:47.000000000 -0400
+++ linux-2.6-lttng/arch/x86/kernel/alternative.c 2008-08-13 18:54:36.000000000 -0400
@@ -241,25 +241,25 @@ static void alternatives_smp_lock(u8 **s
continue;
if (*ptr > text_end)
continue;
- text_poke(*ptr, ((unsigned char []){0xf0}), 1); /* add lock prefix */
+ /* turn DS segment override prefix into lock prefix */
+ text_poke(*ptr, ((unsigned char []){0xf0}), 1);
};
}
static void alternatives_smp_unlock(u8 **start, u8 **end, u8 *text, u8 *text_end)
{
u8 **ptr;
- char insn[1];
if (noreplace_smp)
return;
- add_nops(insn, 1);
for (ptr = start; ptr < end; ptr++) {
if (*ptr < text)
continue;
if (*ptr > text_end)
continue;
- text_poke(*ptr, insn, 1);
+ /* turn lock prefix into DS segment override prefix */
+ text_poke(*ptr, ((unsigned char []){0x3E}), 1);
};
}
--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
Mathieu Desnoyers wrote:
> If a kernel thread is preempted in single-cpu mode right after the NOP (nop
> about to be turned into a lock prefix), then we CPU hotplug a CPU, and then the
> thread is scheduled back again, a SMP-unsafe atomic operation will be used on
> shared SMP variables, leading to corruption. No corruption would happen in the
> reverse case : going from SMP to UP is ok because we split a bit instruction
> into tiny pieces, which does not present this condition.
>
> Changing the 0x90 (single-byte nop) currently used into a 0x3E DS segment
> override prefix should fix this issue. Since the default of the atomic
> instructions is to use the DS segment anyway, it should not affect the
> behavior.
I believe this should be okay. In 32-bit mode some of the security and
hypervisor frameworks want to set segment limits, but I don't believe
they ever would set DS and SS inconsistently, or that we'd handle a #GP
versus an #SS differently (segment violations on the stack segment are
#SS, not #GP.) To be 100% sure we'd have to pick apart the modr/m byte
to figure out what the base register is but I think that's total overkill.
I have a vague notion that DS: prefixes came with a penalty on older
CPUs, so we may want to do this only when CPU hotplug is enabled, to
avoid penalizing older embedded systems.
-hpa
* H. Peter Anvin ([email protected]) wrote:
> Mathieu Desnoyers wrote:
>> If a kernel thread is preempted in single-cpu mode right after the NOP
>> (nop
>> about to be turned into a lock prefix), then we CPU hotplug a CPU, and
>> then the
>> thread is scheduled back again, a SMP-unsafe atomic operation will be used
>> on
>> shared SMP variables, leading to corruption. No corruption would happen in
>> the
>> reverse case : going from SMP to UP is ok because we split a bit
>> instruction
>> into tiny pieces, which does not present this condition.
>> Changing the 0x90 (single-byte nop) currently used into a 0x3E DS segment
>> override prefix should fix this issue. Since the default of the atomic
>> instructions is to use the DS segment anyway, it should not affect the
>> behavior.
>
> I believe this should be okay. In 32-bit mode some of the security and
> hypervisor frameworks want to set segment limits, but I don't believe they
> ever would set DS and SS inconsistently, or that we'd handle a #GP versus
> an #SS differently (segment violations on the stack segment are #SS, not
> #GP.) To be 100% sure we'd have to pick apart the modr/m byte to figure
> out what the base register is but I think that's total overkill.
>
I guess some testing of this patch under an virtualized Linux would not
hurt. Anyone have a setup ready ? The test case is simple : Run a kernel
on a multi-CPU virtual guest.
export NR_CPUS=...
for a in `seq 1 $NR_CPUS`; do echo 0 > ./devices/system/cpu/cpu$a/online;done
> I have a vague notion that DS: prefixes came with a penalty on older CPUs,
> so we may want to do this only when CPU hotplug is enabled, to avoid
> penalizing older embedded systems.
>
> -hpa
Reading the "Intel Architecture Optimizations Manual" for older Intels :
http://developer.intel.com/design/pentium/manuals/242816.htm
Chapter 3.7 Prefixed Opcodes
The penality for instructions prefixed with other prefixes than 0x0f,
0x66 or 0x67 seems to be 1 added clock cycle to detect the prefix when
it cannot be paired.
Since we are choosing between the existing 0x90 nop followed by the
atomic instruction and this prefix applied to the atomic instruction,
we have to consider the penality cost of this nop. From the same manual,
the NOP is decoded into 1 micro-op.
Unless these architectures (386SX/DX, 486, Pentium Pro, Pentium MMX,
Pentium II) can execute more than 1 micro-op per cycle, I doubt the DS
prefix would cause any degradation compared to the 0x90 nop. And this
would free the lower stages of the pipeline from executing this NOP
micro-op.
I guess some quick performance tests with the modules I provide on my
website (URL in the patch header) could confirm or infirm this.
Actually, I just removed the dust from an old Pentium II, here are the
results. There is no performance overhead nor degradation.
NR_TESTS 10000000
test empty cycles : 200833494
test test 1-byte nop xadd cycles : 340000130
test test DS override prefix xadd cycles : 340000126 *
test test LOCK xadd cycles : 530000078
processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 5
model name : Pentium II (Deschutes)
stepping : 2
cpu MHz : 350.867
cache size : 512 KB
fdiv_bug : no
hlt_bug : no
f00f_bug : no
coma_bug : no
fpu : yes
fpu_exception : yes
cpuid level : 2
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov
pat pse36 mmx fxsr
bogomips : 690.17
Mathieu
--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
H. Peter Anvin wrote:
> I believe this should be okay. In 32-bit mode some of the security
> and hypervisor frameworks want to set segment limits, but I don't
> believe they ever would set DS and SS inconsistently, or that we'd
> handle a #GP versus an #SS differently (segment violations on the
> stack segment are #SS, not #GP.) To be 100% sure we'd have to pick
> apart the modr/m byte to figure out what the base register is but I
> think that's total overkill.
The kernel sets ds and ss to the same selector, so they're always going
to have the same underlying descriptor.
My only concern is whether there are any locked instructions which are
explicitly using a cs: override for those odd corners of the kernel. I
don't think so.
That said, I wonder how useful it is to do the SMP->UP code transition.
How often does a kernel go from being SMP to UP in a situation where we
really care about the performance? And that won't be shortly be
becoming SMP again anyway?
J
> That said, I wonder how useful it is to do the SMP->UP code transition.
> How often does a kernel go from being SMP to UP in a situation where we
> really care about the performance? And that won't be shortly be
> becoming SMP again anyway?
I agree, it only seems worthwhile to do this substitution when from the
config+hardware it's determined at boot that nr_cpus_possible==1 (i.e. no
CPU hotplug possibility).
Thanks,
Roland
* Jeremy Fitzhardinge ([email protected]) wrote:
> H. Peter Anvin wrote:
>> I believe this should be okay. In 32-bit mode some of the security and
>> hypervisor frameworks want to set segment limits, but I don't believe they
>> ever would set DS and SS inconsistently, or that we'd handle a #GP versus
>> an #SS differently (segment violations on the stack segment are #SS, not
>> #GP.) To be 100% sure we'd have to pick apart the modr/m byte to figure
>> out what the base register is but I think that's total overkill.
>
> The kernel sets ds and ss to the same selector, so they're always going to
> have the same underlying descriptor.
>
> My only concern is whether there are any locked instructions which are
> explicitly using a cs: override for those odd corners of the kernel. I
> don't think so.
>
> That said, I wonder how useful it is to do the SMP->UP code transition.
> How often does a kernel go from being SMP to UP in a situation where we
> really care about the performance? And that won't be shortly be becoming
> SMP again anyway?
>
A virtualized guest kernel could use that to limit its use of the
overall machine CPU resources in different time periods. Policies can
determine how many physical CPU a virtual machine can be tied to, and
that may change depending on e.g. the workload or time of day. Having
the ability to efficiently switch to UP for a long period of time seems
important in this use-case.
Mathieu
> J
--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
Mathieu Desnoyers wrote:
> A virtualized guest kernel could use that to limit its use of the
> overall machine CPU resources in different time periods. Policies can
> determine how many physical CPU a virtual machine can be tied to, and
> that may change depending on e.g. the workload or time of day. Having
> the ability to efficiently switch to UP for a long period of time seems
> important in this use-case.
Not very convinced. Unplugging cpus is a pretty coarse way to control
resource use. There are lots of other mechanisms. And it's not like an
uncontended lock is all that expensive these days...
J
* Jeremy Fitzhardinge ([email protected]) wrote:
> Mathieu Desnoyers wrote:
>> A virtualized guest kernel could use that to limit its use of the
>> overall machine CPU resources in different time periods. Policies can
>> determine how many physical CPU a virtual machine can be tied to, and
>> that may change depending on e.g. the workload or time of day. Having
>> the ability to efficiently switch to UP for a long period of time seems
>> important in this use-case.
>
> Not very convinced. Unplugging cpus is a pretty coarse way to control
> resource use. There are lots of other mechanisms. And it's not like an
> uncontended lock is all that expensive these days...
>
> J
>
I can't argue about the benefit of using VM CPU pinning to manage
resources because I don't use it myself, but I ran some tests out of
curiosity to find if uncontended locks were that cheap, and it turns out
they aren't. Here are the results :
Xeon 2.0GHz
Summary
no lock prefix (s) with lock prefix (s) Speedup
make -j1 kernel/ 33.94 +/- 0.07 34.91 +/- 0.27 2.8 %
hackbench 50 2.99 +/- 0.01 3.74 +/- 0.01 25.1 %
Detail :
1 CPU, replace smp lock prefixes with DS segment selector prefixes
make -j1 kernel/
real 0m34.067s
user 0m30.630s
sys 0m2.980s
real 0m33.867s
user 0m30.582s
sys 0m3.024s
real 0m33.939s
user 0m30.738s
sys 0m2.876s
real 0m33.913s
user 0m30.806s
sys 0m2.808s
avg : 33.94s
std. dev. : 0.07s
hackbench 50
Time: 2.978
Time: 2.982
Time: 3.010
Time: 2.984
Time: 2.982
avg : 2.99
std. dev. : 0.01
1 CPU, noreplace-smp
make -j1 kernel/
real 0m35.326s
user 0m30.630s
sys 0m3.260s
real 0m34.325s
user 0m30.802s
sys 0m3.084s
real 0m35.568s
user 0m30.722s
sys 0m3.168s
real 0m34.435s
user 0m30.886s
sys 0m2.996s
avg.: 34.91s
std. dev. : 0.27s
hackbench 50
Time: 3.733
Time: 3.750
Time: 3.761
Time: 3.737
Time: 3.741
avg : 3.74
std. dev. : 0.01
Mathieu
--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
On Thu, 14 Aug 2008, Mathieu Desnoyers wrote:
>
> I can't argue about the benefit of using VM CPU pinning to manage
> resources because I don't use it myself, but I ran some tests out of
> curiosity to find if uncontended locks were that cheap, and it turns out
> they aren't.
Absolutely.
Locked ops show up not just in microbenchmarks looping over the
instruction, they show up in "real" benchmarks too. We added a single
locked instruction (maybe it was two) to the page fault handling code some
time ago, and the reason I noticed it was that it actually made the page
fault cost visibly more expensive in lmbench. That was a _single_
instruction in the hot path (or maybe two).
And the page fault path is some of the most timing critical in the whole
kernel - if you have everything cached, the cost of doing the page faults
to populate new processes for some fork/exec-heavy workload (and compiling
the kernel is just one of those - any traditional unix behaviour will show
this) is critical.
This is one of the things AMD does a _lot_ better than Intel. Intel tends
to have a 30-50 cycle cost (with later P4s being *much* worse), while AMD
tends to have a cost of around 10-15 cycles.
It's one of the things Intel promises to have improved in the next-gen
uarch (Nehalem), an while I am not supposed to give out any benchmarks, I
can confirm that Intel is getting much better at it. But it's going to be
visible still, and it's really a _big_ issue on P4.
(Of course, on P4, the page fault exception cost itself is so high that
the cost of atomics may be _relatively_ less noticeable in that particular
path)
Linus
Mathieu Desnoyers wrote:
>
> I can't argue about the benefit of using VM CPU pinning to manage
> resources because I don't use it myself, but I ran some tests out of
> curiosity to find if uncontended locks were that cheap, and it turns out
> they aren't. Here are the results :
>
> Xeon 2.0GHz
>
> Summary
>
> make -j1 kernel/ 33.94 +/- 0.07 34.91 +/- 0.27 2.8 %
> hackbench 50 2.99 +/- 0.01 3.74 +/- 0.01 25.1 %
>
> 1 CPU, replace smp lock prefixes with DS segment selector prefixes
> 1 CPU, noreplace-smp
For reference, could you also compare replace smp lock with NOPs?
-hpa
* H. Peter Anvin ([email protected]) wrote:
> Mathieu Desnoyers wrote:
>> I can't argue about the benefit of using VM CPU pinning to manage
>> resources because I don't use it myself, but I ran some tests out of
>> curiosity to find if uncontended locks were that cheap, and it turns out
>> they aren't. Here are the results :
>> Xeon 2.0GHz
>> Summary
>> make -j1 kernel/ 33.94 +/- 0.07 34.91 +/- 0.27 2.8 %
>> hackbench 50 2.99 +/- 0.01 3.74 +/- 0.01 25.1 %
>> 1 CPU, replace smp lock prefixes with DS segment selector prefixes
>> 1 CPU, noreplace-smp
>
> For reference, could you also compare replace smp lock with NOPs?
>
> -hpa
Sure, here are the updated tables. Basically, they show no significant
difference between the NOP and the DS segment selector prefix
approaches.
Xeon 2.0GHz
Summary
* replace smp lock prefixes with DS segment selector prefixes
no lock prefix (s) with lock prefix (s) Speedup
make -j1 kernel/ 33.94 +/- 0.07 34.91 +/- 0.27 2.8 %
hackbench 50 2.99 +/- 0.01 3.74 +/- 0.01 25.1 %
* replace smp lock prefixes with 0x90 nops
no lock prefix (s) with lock prefix (s) Speedup
make -j1 kernel/ 34.16 +/- 0.32 34.91 +/- 0.27 2.2 %
hackbench 50 3.00 +/- 0.01 3.74 +/- 0.01 24.7 %
Detail :
1 CPU, replace smp lock prefixes with DS segment selector prefixes
make -j1 kernel/
real 0m34.067s
user 0m30.630s
sys 0m2.980s
real 0m33.867s
user 0m30.582s
sys 0m3.024s
real 0m33.939s
user 0m30.738s
sys 0m2.876s
real 0m33.913s
user 0m30.806s
sys 0m2.808s
avg : 33.94s
std. dev. : 0.07s
hackbench 50
Time: 2.978
Time: 2.982
Time: 3.010
Time: 2.984
Time: 2.982
avg : 2.99
std. dev. : 0.01
1 CPU, noreplace-smp
make -j1 kernel/
real 0m35.326s
user 0m30.630s
sys 0m3.260s
real 0m34.325s
user 0m30.802s
sys 0m3.084s
real 0m35.568s
user 0m30.722s
sys 0m3.168s
real 0m34.435s
user 0m30.886s
sys 0m2.996s
avg.: 34.91s
std. dev. : 0.27s
hackbench 50
Time: 3.733
Time: 3.750
Time: 3.761
Time: 3.737
Time: 3.741
avg : 3.74
std. dev. : 0.01
1 CPU, replace smp lock prefixes with 0x90 nops
make -j1 kernel/
real 0m34.139s
user 0m30.782s
sys 0m2.820s
real 0m34.010s
user 0m30.630s
sys 0m2.976s
real 0m34.777s
user 0m30.658s
sys 0m2.916s
real 0m33.924s
user 0m30.634s
sys 0m2.924s
real 0m33.962s
user 0m30.774s
sys 0m2.800s
real 0m34.141s
user 0m30.770s
sys 0m2.828s
avg : 34.16
std. dev. : 0.32
hackbench 50
Time: 2.999
Time: 2.994
Time: 3.004
Time: 2.991
Time: 2.988
avg : 3.00
std. dev. : 0.01
--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
Mathieu Desnoyers wrote:
> I can't argue about the benefit of using VM CPU pinning to manage
> resources because I don't use it myself, but I ran some tests out of
> curiosity to find if uncontended locks were that cheap, and it turns out
> they aren't. Here are the results :
>
OK, let me clarify my point a bit. If you've got a kernel which is
switching between UP and SMP on a regular basis, you're going to be
taking the hit for the locked instructions whenever you're in the SMP
state anyway. It's only going to be a workload where you're mostly UP
with occasional excursions into being SMP that patching out the lock
prefixes is actually going to make a difference.
And that just doesn't seem like a very likely use-case to me. Certainly
I don't think it would ever happen on a physical machine. And it
doesn't seem all that likely on a virtual machine either. Certainly
resources are more dynamic in a virtual environment, but I think there's
a fairly good chance that the domain knows from the outset whether it's
going to be UP or SMP, or does the UP->SMP transition once.
(That said, the XenServer product does precisely what I say is unusual:
the dom0 kernel hotplugs all the cpus so it can do ucode updates, etc,
and then unplugs all but one...)
> Xeon 2.0GHz
>
>
> Summary
>
> no lock prefix (s) with lock prefix (s) Speedup
> make -j1 kernel/ 33.94 +/- 0.07 34.91 +/- 0.27 2.8 %
> hackbench 50 2.99 +/- 0.01 3.74 +/- 0.01 25.1 %
>
Yeah, that's more severe than I would have expected. Perhaps I have AMD
numbers in my head.
J
Mathieu Desnoyers wrote:
> * H. Peter Anvin ([email protected]) wrote:
>
>> Mathieu Desnoyers wrote:
>>
>>> I can't argue about the benefit of using VM CPU pinning to manage
>>> resources because I don't use it myself, but I ran some tests out of
>>> curiosity to find if uncontended locks were that cheap, and it turns out
>>> they aren't. Here are the results :
>>> Xeon 2.0GHz
>>> Summary
>>> make -j1 kernel/ 33.94 +/- 0.07 34.91 +/- 0.27 2.8 %
>>> hackbench 50 2.99 +/- 0.01 3.74 +/- 0.01 25.1 %
>>> 1 CPU, replace smp lock prefixes with DS segment selector prefixes
>>> 1 CPU, noreplace-smp
>>>
>> For reference, could you also compare replace smp lock with NOPs?
>>
>> -hpa
>>
>
> Sure, here are the updated tables. Basically, they show no significant
> difference between the NOP and the DS segment selector prefix
> approaches.
>
BTW, are you changing the initial prefix to DS too? Ie, are you doing a
nop->lock->ds transition, or ds->lock->ds?
J
Mathieu Desnoyers wrote:
>
> Sure, here are the updated tables. Basically, they show no significant
> difference between the NOP and the DS segment selector prefix
> approaches.
>
Actually, unless I have blown my T-test completely, they show with a 80%
and 74% confidence (respective for the two benchmarks) that the DS case
is slightly *better* (0.26% and 0.20% better, respective), which makes
it a no-brainer. Doing around 10 runs of each is likely to confirm this
conclusion by pushing it into the 90+% interval.
Note that since the difference is so small, and so can also be due to
some kind of systematic error (lower ambient temperature during the DS
run making the disk drive slightly faster, what have you.)
-hpa
Jeremy Fitzhardinge wrote:
>
>> Xeon 2.0GHz
>>
>>
>> Summary
>>
>> no lock prefix (s) with lock prefix (s) Speedup
>> make -j1 kernel/ 33.94 +/- 0.07 34.91 +/- 0.27 2.8 %
>> hackbench 50 2.99 +/- 0.01 3.74 +/- 0.01 25.1 %
>>
>
> Yeah, that's more severe than I would have expected. Perhaps I have AMD
> numbers in my head.
>
He doesn't specify what kind of Xeon it is (there are three completely
different CPUs under the Xeon brand name: P3 Xeon, P4 Xeon, and Core 2
Xeon), but I think it's a P4 Xeon.
Netburst sucked in so many ways, and this is one of them.
-hpa
H. Peter Anvin wrote:
> Jeremy Fitzhardinge wrote:
>>
>>> Xeon 2.0GHz
>>>
>>>
>>> Summary
>>>
>>> no lock prefix (s) with lock prefix (s) Speedup
>>> make -j1 kernel/ 33.94 +/- 0.07 34.91 +/- 0.27 2.8 %
>>> hackbench 50 2.99 +/- 0.01 3.74 +/- 0.01 25.1 %
>>>
>>
>> Yeah, that's more severe than I would have expected. Perhaps I have
>> AMD numbers in my head.
>>
>
> He doesn't specify what kind of Xeon it is (there are three completely
> different CPUs under the Xeon brand name: P3 Xeon, P4 Xeon, and Core 2
> Xeon), but I think it's a P4 Xeon.
>
> Netburst sucked in so many ways, and this is one of them.
I presume it's this one:
> Xeon cpuinfo :
>
> processor : 0
> vendor_id : GenuineIntel
> cpu family : 6
> model : 23
> model name : Intel(R) Xeon(R) CPU E5405 @ 2.00GHz
> stepping : 6
> cpu MHz : 2000.126
> cache size : 6144 KB
> physical id : 0
> siblings : 4
> core id : 0
> cpu cores : 4
> apicid : 0
> initial apicid : 0
> fpu : yes
> fpu_exception : yes
> cpuid level : 10
> wp : yes
> flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx lm constant_tsc arch_perfmon pebs bts rep_good pni monitor ds_cpl vmx tm2 ssse3 cx16 xtpr dca sse4_1 lahf_lm
> bogomips : 4000.25
> clflush size : 64
> cache_alignment : 64
> address sizes : 38 bits physical, 48 bits virtual
> power management:
I've given up trying to remember what "E5405" might mean, but that
feature list suggests a relatively modern microarchitecture.
J
* Jeremy Fitzhardinge ([email protected]) wrote:
> Mathieu Desnoyers wrote:
>> * H. Peter Anvin ([email protected]) wrote:
>>
>>> Mathieu Desnoyers wrote:
>>>
>>>> I can't argue about the benefit of using VM CPU pinning to manage
>>>> resources because I don't use it myself, but I ran some tests out of
>>>> curiosity to find if uncontended locks were that cheap, and it turns out
>>>> they aren't. Here are the results :
>>>> Xeon 2.0GHz
>>>> Summary
>>>> make -j1 kernel/ 33.94 +/- 0.07 34.91 +/- 0.27 2.8 %
>>>> hackbench 50 2.99 +/- 0.01 3.74 +/- 0.01 25.1 %
>>>> 1 CPU, replace smp lock prefixes with DS segment selector prefixes
>>>> 1 CPU, noreplace-smp
>>>>
>>> For reference, could you also compare replace smp lock with NOPs?
>>>
>>> -hpa
>>>
>>
>> Sure, here are the updated tables. Basically, they show no significant
>> difference between the NOP and the DS segment selector prefix
>> approaches.
>>
>
> BTW, are you changing the initial prefix to DS too? Ie, are you doing a
> nop->lock->ds transition, or ds->lock->ds?
>
> J
Yeah, I thought about this case yesterday, good thing you ask.
include/asm-x86/alternative.h defines LOCK_PREFIX as :
#define LOCK_PREFIX \
".section .smp_locks,\"a\"\n" \
_ASM_ALIGN "\n" \
_ASM_PTR "661f\n" /* address */ \
".previous\n" \
"661:\n\tlock; "
So we have the locked instructions built into the kernel, not the nop'd
one. Therefore, the only transition I am doing for my benchmarks is :
lock->ds
but I tried to switch back to SMP and it worked fine.
Mathieu
--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
Jeremy Fitzhardinge wrote:
>>
>> He doesn't specify what kind of Xeon it is (there are three completely
>> different CPUs under the Xeon brand name: P3 Xeon, P4 Xeon, and Core 2
>> Xeon), but I think it's a P4 Xeon.
>>
>> Netburst sucked in so many ways, and this is one of them.
>
> I presume it's this one:
>> Xeon cpuinfo :
>>
>> processor : 0
>> vendor_id : GenuineIntel
>> cpu family : 6
>> model : 23
>> model name : Intel(R) Xeon(R) CPU E5405 @ 2.00GHz
>
> I've given up trying to remember what "E5405" might mean, but that
> feature list suggests a relatively modern microarchitecture.
>
E5405 is the 45 nm Core 2 Quad Xeon.
-hpa
Mathieu Desnoyers wrote:
> * Jeremy Fitzhardinge ([email protected]) wrote:
>
>> Mathieu Desnoyers wrote:
>>
>>> * H. Peter Anvin ([email protected]) wrote:
>>>
>>>
>>>> Mathieu Desnoyers wrote:
>>>>
>>>>
>>>>> I can't argue about the benefit of using VM CPU pinning to manage
>>>>> resources because I don't use it myself, but I ran some tests out of
>>>>> curiosity to find if uncontended locks were that cheap, and it turns out
>>>>> they aren't. Here are the results :
>>>>> Xeon 2.0GHz
>>>>> Summary
>>>>> make -j1 kernel/ 33.94 +/- 0.07 34.91 +/- 0.27 2.8 %
>>>>> hackbench 50 2.99 +/- 0.01 3.74 +/- 0.01 25.1 %
>>>>> 1 CPU, replace smp lock prefixes with DS segment selector prefixes
>>>>> 1 CPU, noreplace-smp
>>>>>
>>>>>
>>>> For reference, could you also compare replace smp lock with NOPs?
>>>>
>>>> -hpa
>>>>
>>>>
>>> Sure, here are the updated tables. Basically, they show no significant
>>> difference between the NOP and the DS segment selector prefix
>>> approaches.
>>>
>>>
>> BTW, are you changing the initial prefix to DS too? Ie, are you doing a
>> nop->lock->ds transition, or ds->lock->ds?
>>
>> J
>>
>
> Yeah, I thought about this case yesterday, good thing you ask.
>
> include/asm-x86/alternative.h defines LOCK_PREFIX as :
>
> #define LOCK_PREFIX \
> ".section .smp_locks,\"a\"\n" \
> _ASM_ALIGN "\n" \
> _ASM_PTR "661f\n" /* address */ \
> ".previous\n" \
> "661:\n\tlock; "
>
> So we have the locked instructions built into the kernel, not the nop'd
> one. Therefore, the only transition I am doing for my benchmarks is :
>
> lock->ds
>
> but I tried to switch back to SMP and it worked fine.
>
Ah, OK. I'd thought we started unlocked, but given that I've just been
disassembling the kernel and looking at the lock prefixes, that's a bit
of a braino on my part.
BTW, using the ds prefix allows us to undo the hack of dealing with
locked instructions with exception handlers. There was a bug where if
we do lock->nop, then the address of a faulting instruction changes, so
we need exception records for both the locked and unlocked forms. Using
ds means the instruction size doesn't change, so we only need one
exception record. I don't remember off hand where that happens.
J
* H. Peter Anvin ([email protected]) wrote:
> Jeremy Fitzhardinge wrote:
>>> Xeon 2.0GHz
>>>
>>>
>>> Summary
>>>
>>> no lock prefix (s) with lock prefix (s) Speedup
>>> make -j1 kernel/ 33.94 +/- 0.07 34.91 +/- 0.27 2.8 %
>>> hackbench 50 2.99 +/- 0.01 3.74 +/- 0.01 25.1 %
>>>
>> Yeah, that's more severe than I would have expected. Perhaps I have AMD
>> numbers in my head.
>
> He doesn't specify what kind of Xeon it is (there are three completely
> different CPUs under the Xeon brand name: P3 Xeon, P4 Xeon, and Core 2
> Xeon), but I think it's a P4 Xeon.
>
> Netburst sucked in so many ways, and this is one of them.
>
Here are the specs. It's a brand new 64bits dual quad-core (Christmas in
August) :-) None of the specs seems to talk about neither Core2 or
Pentium 4 though. Those appellations seems to hold for desktop
processors, not server processors...
processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 23
model name : Intel(R) Xeon(R) CPU E5405 @ 2.00GHz
stepping : 6
cpu MHz : 2000.109
cache size : 6144 KB
physical id : 0
siblings : 1
core id : 0
cpu cores : 1
apicid : 0
initial apicid : 0
fpu : yes
fpu_exception : yes
cpuid level : 10
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov
pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx lm
constant_tsc up arch_perfmon pebs bts rep_good pni monitor ds_cpl vmx
tm2 ssse3 cx16 xtpr dca sse4_1 lahf_lm
bogomips : 4000.21
clflush size : 64
cache_alignment : 64
address sizes : 38 bits physical, 48 bits virtual
power management:
Booting processor 7/7 ip 6000
Initializing CPU#7
Calibrating delay using timer specific routine.. 4000.26 BogoMIPS (lpj=8000538)
CPU: L1 I cache: 32K, L1 D cache: 32K
CPU: L2 cache: 6144K
CPU: Physical Processor ID: 1
CPU: Processor Core ID: 3
CPU7: Intel(R) Xeon(R) CPU E5405 @ 2.00GHz stepping 06
checking TSC synchronization [CPU#0 -> CPU#7]: passed.
Mathieu
> -hpa
>
>
--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
Mathieu Desnoyers wrote:
> Here are the specs. It's a brand new 64bits dual quad-core (Christmas in
> August) :-) None of the specs seems to talk about neither Core2 or
> Pentium 4 though. Those appellations seems to hold for desktop
> processors, not server processors...
>
Is it a Dell thingy? Turns out my main test box has exactly the same
processor.
J
* Jeremy Fitzhardinge ([email protected]) wrote:
> Mathieu Desnoyers wrote:
>> Here are the specs. It's a brand new 64bits dual quad-core (Christmas in
>> August) :-) None of the specs seems to talk about neither Core2 or
>> Pentium 4 though. Those appellations seems to hold for desktop
>> processors, not server processors...
>>
>
> Is it a Dell thingy? Turns out my main test box has exactly the same
> processor.
>
> J
Nope, it's a custom-made machine from a company in the Montreal area. It
has a Supermicro X7DAL-E motherboard and, well, I got 16GB of ram in it.
Pretty useful to take large traces.
Mathieu
--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
* H. Peter Anvin ([email protected]) wrote:
> Mathieu Desnoyers wrote:
>> Sure, here are the updated tables. Basically, they show no significant
>> difference between the NOP and the DS segment selector prefix
>> approaches.
>
> Actually, unless I have blown my T-test completely, they show with a 80%
> and 74% confidence (respective for the two benchmarks) that the DS case is
> slightly *better* (0.26% and 0.20% better, respective), which makes it a
> no-brainer. Doing around 10 runs of each is likely to confirm this
> conclusion by pushing it into the 90+% interval.
>
Yes, I think you are right.
> Note that since the difference is so small, and so can also be due to some
> kind of systematic error (lower ambient temperature during the DS run
> making the disk drive slightly faster, what have you.)
>
I doubt it, because I made a "cache priming" run before the tests, which
made sure the data was all populated in memory. But yeah, having
different cpu clock calibration values/cpu clock frequency between
reboots could also cause that kind of difference.
Mathieu
> -hpa
--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
Jeremy Fitzhardinge wrote:
>
> Ah, OK. I'd thought we started unlocked, but given that I've just been
> disassembling the kernel and looking at the lock prefixes, that's a bit
> of a braino on my part.
>
> BTW, using the ds prefix allows us to undo the hack of dealing with
> locked instructions with exception handlers. There was a bug where if
> we do lock->nop, then the address of a faulting instruction changes, so
> we need exception records for both the locked and unlocked forms. Using
> ds means the instruction size doesn't change, so we only need one
> exception record. I don't remember off hand where that happens.
>
Using %ds: rather than nop really seems to solve a whole lot of
problems, and might even be faster to boot. It really sounds like a
no-brainer.
-hpa
* H. Peter Anvin ([email protected]) wrote:
> Jeremy Fitzhardinge wrote:
>> Ah, OK. I'd thought we started unlocked, but given that I've just been
>> disassembling the kernel and looking at the lock prefixes, that's a bit of
>> a braino on my part.
>> BTW, using the ds prefix allows us to undo the hack of dealing with locked
>> instructions with exception handlers. There was a bug where if we do
>> lock->nop, then the address of a faulting instruction changes, so we need
>> exception records for both the locked and unlocked forms. Using ds means
>> the instruction size doesn't change, so we only need one exception record.
>> I don't remember off hand where that happens.
>
> Using %ds: rather than nop really seems to solve a whole lot of problems,
> and might even be faster to boot. It really sounds like a no-brainer.
>
> -hpa
So should I wait a bit for more comments or straightforwardly submit
this as a patch rather than RFC ?
Mathieu
--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
Mathieu Desnoyers wrote:
> * Jeremy Fitzhardinge ([email protected]) wrote:
>
>> Mathieu Desnoyers wrote:
>>
>>> Here are the specs. It's a brand new 64bits dual quad-core (Christmas in
>>> August) :-) None of the specs seems to talk about neither Core2 or
>>> Pentium 4 though. Those appellations seems to hold for desktop
>>> processors, not server processors...
>>>
>>>
>> Is it a Dell thingy? Turns out my main test box has exactly the same
>> processor.
>>
>> J
>>
>
> Nope, it's a custom-made machine from a company in the Montreal area. It
> has a Supermicro X7DAL-E motherboard and, well, I got 16GB of ram in it.
> Pretty useful to take large traces.
>
> Mathieu
>
>
>
FWIW, I think its a Harpertown:
http://processorfinder.intel.com/details.aspx?sSpec=SLAP2
-Greg
Mathieu Desnoyers wrote:
> So should I wait a bit for more comments or straightforwardly submit
> this as a patch rather than RFC ?
>
Looks like all the relevant people have reviewed it now, so I don't
think there's much more to say.
J
* H. Peter Anvin ([email protected]) wrote:
> Mathieu Desnoyers wrote:
>> Sure, here are the updated tables. Basically, they show no significant
>> difference between the NOP and the DS segment selector prefix
>> approaches.
>
> Actually, unless I have blown my T-test completely, they show with a 80%
> and 74% confidence (respective for the two benchmarks) that the DS case is
> slightly *better* (0.26% and 0.20% better, respective), which makes it a
> no-brainer. Doing around 10 runs of each is likely to confirm this
> conclusion by pushing it into the 90+% interval.
>
I did more runs (20 runs of each) to compare the nop case to the DS
prefix case. Results in seconds. They actually does not seems to show a
significant difference.
NOP
34.155
33.955
34.012
35.299
35.679
34.141
33.995
35.016
34.254
33.957
33.957
34.008
35.013
34.494
33.893
34.295
34.314
34.854
33.991
34.132
DS
34.080
34.304
34.374
35.095
34.291
34.135
33.940
34.208
35.276
34.288
33.861
33.898
34.610
34.709
33.851
34.256
35.161
34.283
33.865
35.078
Used http://www.graphpad.com/quickcalcs/ttest1.cfm?Format=C to do the
T-test (yeah, I'm lazy) :
Group Group One (DS prefix) Group Two (nops)
Mean 34.37815 34.37070
SD 0.46108 0.51905
SEM 0.10310 0.11606
N 20 20
P value and statistical significance:
The two-tailed P value equals 0.9620
By conventional criteria, this difference is considered to be not
statistically significant.
Confidence interval:
The mean of Group One minus Group Two equals 0.00745
95% confidence interval of this difference: From -0.30682 to 0.32172
Intermediate values used in calculations:
t = 0.0480
df = 38
standard error of difference = 0.155
So, unless these calculus are completely bogus, the difference between
the nop and the DS case seems not to be statistically significant.
Mathieu
> Note that since the difference is so small, and so can also be due to some
> kind of systematic error (lower ambient temperature during the DS run
> making the disk drive slightly faster, what have you.)
>
> -hpa
--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
* Jeremy Fitzhardinge ([email protected]) wrote:
> Mathieu Desnoyers wrote:
>> So should I wait a bit for more comments or straightforwardly submit
>> this as a patch rather than RFC ?
>>
>
> Looks like all the relevant people have reviewed it now, so I don't think
> there's much more to say.
>
> J
I'm just worried about this comment from Harvey Harrison :
arch/x86/mm/fault.c : is_prefetch()
* Values 0x26,0x2E,0x36,0x3E are valid x86 prefixes.
* In X86_64 long mode, the CPU will signal invalid
* opcode if some of these prefixes are present so
* X86_64 will never get here anyway
*/
This comment refers to :
0x26 : ES segment override prefix
0x2E : CS segment override prefix
0x36 : SS segment override prefix
0x3E : DS segment override prefix
AMD documentation seems to indicate that these prefix will be null, not
that the cpu would signal "invalid opcodes" :
"AMD 64-Bit Technology" A.7
http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/x86-64_overview.pdf
"In 64-bit mode, the DS, ES, SS and CS segment-override prefixes have no effect.
These four prefixes are no longer treated as segment-override prefixes in the
context of multipleprefix rules. Instead, they are treated as null prefixes."
Intel does not seem to state anything particular about these prefixes
for the 64-bit mode.
So, is this comment misleading, or is it using the term "invalid opcode"
in a way that does not imply generating a fault ?
Mathieu
--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
Mathieu Desnoyers wrote:
>
> I'm just worried about this comment from Harvey Harrison :
>
> arch/x86/mm/fault.c : is_prefetch()
>
> * Values 0x26,0x2E,0x36,0x3E are valid x86 prefixes.
> * In X86_64 long mode, the CPU will signal invalid
> * opcode if some of these prefixes are present so
> * X86_64 will never get here anyway
> */
>
> This comment refers to :
>
> 0x26 : ES segment override prefix
> 0x2E : CS segment override prefix
> 0x36 : SS segment override prefix
> 0x3E : DS segment override prefix
>
> AMD documentation seems to indicate that these prefix will be null, not
> that the cpu would signal "invalid opcodes" :
>
> "AMD 64-Bit Technology" A.7
> http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/x86-64_overview.pdf
>
> "In 64-bit mode, the DS, ES, SS and CS segment-override prefixes have no effect.
> These four prefixes are no longer treated as segment-override prefixes in the
> context of multipleprefix rules. Instead, they are treated as null prefixes."
>
> Intel does not seem to state anything particular about these prefixes
> for the 64-bit mode.
>
> So, is this comment misleading, or is it using the term "invalid opcode"
> in a way that does not imply generating a fault ?
>
They do not signal faults, there just aren't any base addresses behind
them. Some AMD chips allow limits to be set on these segments --
apparently added on behalf of some hypervisor makers; I suspect that
VMX/SVM is making that quickly obsolete.
So it should be just fine.
Acked-by: H. Peter Anvin <[email protected]>
-hpa
Mathieu Desnoyers wrote:
> I'm just worried about this comment from Harvey Harrison :
>
> arch/x86/mm/fault.c : is_prefetch()
>
> * Values 0x26,0x2E,0x36,0x3E are valid x86 prefixes.
> * In X86_64 long mode, the CPU will signal invalid
> * opcode if some of these prefixes are present so
> * X86_64 will never get here anyway
>
I would say that comment is wrong. But we'd never put a lock prefix on
a prefetch, so you won't be added a ds prefix either.
J
Jeremy Fitzhardinge wrote:
> Mathieu Desnoyers wrote:
>> I'm just worried about this comment from Harvey Harrison :
>>
>> arch/x86/mm/fault.c : is_prefetch()
>>
>> * Values 0x26,0x2E,0x36,0x3E are valid x86
>> prefixes.
>> * In X86_64 long mode, the CPU will signal
>> invalid
>> * opcode if some of these prefixes are
>> present so
>> * X86_64 will never get here anyway
>>
>
> I would say that comment is wrong. But we'd never put a lock prefix on
> a prefetch, so you won't be added a ds prefix either.
>
Locking prefetches is definitely wrong :)
I don't know if the comment is correct; I've certainly never heard of
that myself.
-hpa
[ Finally got my goodmis email back ]
On Wed, 13 Aug 2008, Andi Kleen wrote:
> > Sorry to ask, I feel I must be missing something, but I'm trying to
> > figure out where you propose to add the "call mcount" ? In the caller or
> > in the callee ?
>
> callee like gcc. caller would be likely more bloated because
> there are more calls than functions. Also if it was at the
> callee more code would be needed because the function currently
> executed couldn't be gotten from stack directly.
>
> > Or is it a different scheme I don't see ? I am trying to figure out how
> > you happen to do all that without dynamic code modification and manage
> > not to hurt performance.
>
> The dynamic code modification is only needed because there is no
> global table of the mcount call sites. So instead it discovers
> them at runtime, but that requires runtime save patching
The new code does not discover the places at runtime. The old code did
that. The "to kill a daemon" removed the runtime discovery and replaced it
with discovery at compile time.
>
> With a custom call scheme one could just build up a table of
> call sites at link time using an ELF section and then when
> tracing is enabled/disabled always patch them all in one go
> in a stop_machine(). Then you wouldn't need parallel execution safe
> patching anymore and it doesn't matter what the nops look like.
The current patch set, pretty much does exactly this. Yes, I patch
at boot up all in one go, before the other CPUS are even active.
This takes all of 6 milliseconds to do. Not much extra time for bootup.
>
> The other advantage is that it would allow getting rid of
> the frame pointer.
This is the only advantage that you have.
-- Steve
> > The other advantage is that it would allow getting rid of
> > the frame pointer.
>
> This is the only advantage that you have.
Ok. But it's a serious one. It gives slightly more gain as your whole
complicated patching exercise.
Ok maybe it would be better to just properly fix gcc, but the problem
is it takes forever for the user base to actually start using a new
gcc :/
-Andi