2006-09-30 18:18:40

by Mathieu Desnoyers

[permalink] [raw]
Subject: Performance analysis of Linux Kernel Markers 0.20 for 2.6.17


Hi,

Following the huge discussion thread about tracing/static vs dynamic
instrumentation/markers, a consensus seems to emerge about the need for a
marker system in the Linux kernel. The main issues this mechanism addresses are:

- Identify code important to runtime data collection/analysis tools in tree so
that it follows the code changes naturally.
- Be visually appealing to kernel developers.
- Have a very low impact on the system performance.
- Integrate in the standard kernel infrastructure : use C and loadable modules.

The time has come for some performance measurements of the Linux Kernel Markers,
which follows. I attach a PDF with tables and charts which condense these
results.


* Micro-benchmarks

Use timestamp counter to calculate the time spent, with interrupts disabled.
Machine : Pentium 4 3GHz, 1GB ram
Fully preemptible kernel
Linux Kernel Markers 0.19
Kernel : Linux 2.6.17

marker : MARK(subsys_mark1, "%d %p", 1, NULL);

This marker, with two elements (integer and pointer) have been chosen because it
is representative of high volume events. For instance, a trap entry event logs a
trap_id (long) and an address (pointer). The same applies to system calls, where
a system call entry event logs both the ID of the system call and the address of
the caller.


* Execute an empty loop

- Without marker
NR_LOOPS : 10000000
time delta (cycles): 15026497
cycles per loop : 1.50

- i386 "optimized" : immediate value, test and predicted branch
(non connected marker)
NR_LOOPS : 10000000
time delta (cycles): 40031640
cycles per loop : 4.00
cycles per loop for marker : 4.00-1.50=2.50

- i386 "generic" : load, test and predicted branch
(non connected marker)
NR_LOOPS : 10000000
time delta (cycles): 26697878
cycles per loop : 2.67
cycles per loop for marker : 2.67-1.50=1.17


* Execute a loop of memcpy 4096 bytes

This test has been done to show the impact of markers on a system where the
memory is already used, which is more representative of a running kernel.

- Without marker
NR_LOOPS : 10000
time delta (cycles): 12981555
cycles per loop : 1298.16

- i386 "optimized" : immediate value, test and predicted branch
(non connected marker)
NR_LOOPS : 10000
time delta (cycles): 12982290
cycles per loop : 1298.23
cycles per loop for marker : 1298.23-1298.16=0.074

- i386 "generic" : load, test and predicted branch
(non connected marker)
NR_LOOPS : 10000
time delta (cycles): 13002788
cycles per loop : 1300.28
cycles per loop for marker : 1300.28-1298.16=2.123


The following tests are done with the "optimized" markers only

- Execute a loop with marker enabled, with i386 "fastcall" register argument
setup, probe empty. With preemption disabling.
NR_LOOPS : 100000
time delta (cycles): 4407608
cycles per loop : 44.08
cycles per loop to disable preemption and setup arguments in registers :
44.08-4.00=40.08

- Execute a loop with a marker enabled, with an empty probe. Var args argument
setup, probe empty. With preemption disabling.
NR_LOOPS : 100000
time delta (cycles): 5210587
cycles per loop : 52.11
additional cycles per loop to setup var args : 52.11-44.08=8.03

- Execute a loop with a marker enabled, with an empty probe. Var args argument
setup, probe empty. No preemption disabling.
NR_LOOPS : 100000
time delta (cycles): 3363450
cycles per loop : 33.63
cycles per loop to disable preemption : 44.08-33.63=10.45

- Execute a loop with marker enabled, with i386 "asmlinkage" arguments expected.
Data is copied by the probe. With preemption disabling.
NR_LOOPS : 100000
time delta (cycles): 5299837
cycles per loop : 53.00
additional cycles per loop to get arguments in probe (from stack) on x86 :
53.00-52.11=0.89

- Execute a loop with marker enabled, with var args probe expecting arguments.
Data is copied by the probe. With preemption disabling.
NR_LOOPS : 100000
time delta (cycles): 5574300
cycles per loop : 55.74
additional cycles per loop to get expected variable arguments on x86 :
55.74-53.00=2.74

- Execute a loop with marker enabled, with var args probe, format string
Data is copied by the probe. This is a 6 bytes string to decode.
NR_LOOPS : 100000
time delta (cycles): 9622117
cycles per loop : 96.22
additional cycles per loop to dynamically parse arguments with a 6 bytes format
string : 96.22-55.74=40.48

- Execute a loop with marker enabled, with var args probe expecting arguments.
Data is copied by the probe. With preemption disabling. An empty "kprobe" is
connected to the probe.
NR_LOOPS : 100000
time delta (cycles): 423397455
cycles per loop : 4233.97
additional cycles per loop to execute the kprobe : 4233.97-55.74=4178.23


* Assembly code

The disassembly of the following marked function will be shown :

static int my_open(struct inode *inode, struct file *file)
{
MARK(subsys_mark1, "%d %p", 1, NULL);

return -EPERM;
}


- Optimized

static int my_open(struct inode *inode, struct file *file)
{
0: 55 push %ebp
1: 89 e5 mov %esp,%ebp
3: 83 ec 0c sub $0xc,%esp
MARK(subsys_mark1, "%d %p", 1, NULL);
6: b0 00 mov $0x0,%al <-- immediate load 0 in al
8: 84 c0 test %al,%al
a: 75 07 jne 13 <my_open+0x13>

return -EPERM;
}
c: b8 ff ff ff ff mov $0xffffffff,%eax
11: c9 leave
12: c3 ret
13: b8 01 00 00 00 mov $0x1,%eax
18: e8 fc ff ff ff call 19 <my_open+0x19> <-- preempt_disable
1d: c7 44 24 08 00 00 00 movl $0x0,0x8(%esp)
24: 00
25: c7 44 24 04 01 00 00 movl $0x1,0x4(%esp)
2c: 00
2d: c7 04 24 0d 00 00 00 movl $0xd,(%esp)
34: ff 15 74 10 00 00 call *0x1074 <-- function pointer
3a: b8 01 00 00 00 mov $0x1,%eax
3f: e8 fc ff ff ff call 40 <my_open+0x40> <-- preempt_enable
44: eb c6 jmp c <my_open+0xc>


- Generic

static int my_open(struct inode *inode, struct file *file)
{
0: 55 push %ebp
1: 89 e5 mov %esp,%ebp
3: 83 ec 0c sub $0xc,%esp
MARK(subsys_mark1, "%d %p", 1, NULL);
6: 0f b6 05 20 10 00 00 movzbl 0x1020,%eax <-- memory load byte
d: 84 c0 test %al,%al
f: 75 07 jne 18 <my_open+0x18>

return -EPERM;
}
11: b8 ff ff ff ff mov $0xffffffff,%eax
16: c9 leave
17: c3 ret
18: b8 01 00 00 00 mov $0x1,%eax
1d: e8 fc ff ff ff call 1e <my_open+0x1e> <-- preempt_disable
22: c7 44 24 08 00 00 00 movl $0x0,0x8(%esp)
29: 00
2a: c7 44 24 04 01 00 00 movl $0x1,0x4(%esp)
31: 00
32: c7 04 24 0d 00 00 00 movl $0xd,(%esp)
39: ff 15 74 10 00 00 call *0x1074 <-- function pointer
3f: b8 01 00 00 00 mov $0x1,%eax
44: e8 fc ff ff ff call 45 <my_open+0x45> <-- preempt_enable
49: eb c6 jmp 11 <my_open+0x11>



Here is the typical var arg probe that has been used in those tests. It saves
the values expectes as parameters in global variables. The DO_MARK1_FORMAT
define is used for probe registration to make sure that it will be connected
with a marker that has a matching format string. Note that this checking is
optional : the probe can register with a NULL format and afterward check itself
the format string received in parameter dynamically.

int value;
void *ptr;

#define DO_MARK1_FORMAT "%d %p"
void do_mark1(const char *format, ...)
{
va_list ap;

va_start(ap, format);
value = va_arg(ap, int);
ptr = va_arg(ap, void*);

va_end(ap);
}


Here is the disassembly of the probe :

#define DO_MARK1_FORMAT "%d %p"
void do_mark1(const char *format, ...)
{
0: 55 push %ebp
1: 89 e5 mov %esp,%ebp
3: 83 ec 04 sub $0x4,%esp
va_list ap;

va_start(ap, format);
value = va_arg(ap, int);
6: 8b 45 0c mov 0xc(%ebp),%eax
9: a3 00 00 00 00 mov %eax,0x0
ptr = va_arg(ap, void*);
e: 8b 45 10 mov 0x10(%ebp),%eax
11: a3 00 00 00 00 mov %eax,0x0

va_end(ap);
}
16: c9 leave
17: c3 ret


* Size (x86)

This is the size added by each marker to the memory image :

- Optimized

.text section : instructions
Adds 6 bytes in the "likely" path.
Adds 32 bytes in the "unlikely" path.
.data section : r/w data
0 byte
.rodata.str1 : strings
Length of the marker name
.debug_str : strings (if loaded..)
Length of the marker name + 7 bytes (__mark_)
.markers
8 bytes (2 pointers)
.markers.c
12 bytes (3 pointers)

- Generic

.text section : instructions
Adds 11 bytes in the "likely" path.
Adds 32 bytes in the "unlikely" path.
.data section : r/w data
1 byte (the activation flag)
.rodata.str1 : strings
Length of the marker name
.debug_str : strings (if loaded..)
Length of the marker name + 7 bytes (__mark_)
.markers
8 bytes (2 pointers)
.markers.c
12 bytes (3 pointers)


* Macro-benchmarks

Compiling a 2.6.17 kernel on a Pentium 4 3GHz, 1GB ram, cold cache.
Running a 2.6.17 vanilla kernel :
real 8m2.443s
user 7m35.124s
sys 0m34.950s

Running a 2.6.17 kernel with lttng-0.6.0pre11 markers (no probe connected) :
real 8m1.635s
user 7m34.552s
sys 0m36.298s

--> 0.98 % speedup with markers

Ping flood on loopback interface :
Running a 2.6.17 vanilla kernel :
136596 packets transmitted, 136596 packets received, 0% packet loss
round-trip min/avg/max = 0.0/0.0/0.1 ms

real 0m10.840s
user 0m0.360s
sys 0m10.485s

12601 packets transmitted/s

Running a 2.6.17 kernel with lttng-0.6.0pre11 markers (no probe connected) :
108504 packets transmitted, 108504 packets received, 0% packet loss
round-trip min/avg/max = 0.0/0.0/0.1 ms

real 0m8.614s
user 0m0.264s
sys 0m8.353s

12596 packets transmitted/s

--> 0.03 % slowdown with markers


Conclusion

In an empty loop, the generic marker is faster than the optimized marker. This
may be due to better performances of the movzbl instruction over the movb on the
Pentium 4 architecture. However, when we execute a loop of 4kB copy, the impact
of the movzbl becomes greater because it uses the memory bandwidth.

The preemption disabling and call to a probe itself costs 48.11 cycles, almost
as much as dynamically parsing the format string to get the variable arguments
(40.48 cycles).

There is almost no difference, on x86, between passing the arguments directly on
the stack and using a variable argument list when its layout is known
statically (0.89 cycles vs 2.74 cycles).

The int3 approach for adding instrumentation dynamically saves the 0.074 cycle
(typcal use, high memory usage) used by the optimized marker by adding the
ability to insert a breakpoint at any location without any impact on the code
when inactive. This breakpoint based approach is very useful to instrument core
kernel code that has not been previously marked without need to recompile and
reboot. We can therefore compare the case "without markers" to the null impact
of an inactive int3 breakpoint.

However, the performance impact for using a kprobe is non negligible when
activated. Assuming that kprobes would have a mechanism to get the variables
from the caller's stack, it would perform the same task in at least 4178.23
cycles vs 55.74 for a marker and a probe (ratio : 75). While kprobes are very
useful for the reason explained earlier, the high event rate paths in the kernel
would clearly benefit from a marker mechanism when the are probed.

Code size and memory footprints are smaller with the optimized version : 6
bytes of code in the likely path compared to 11 bytes. The memory footprint of
the optimized approach saves 4 bytes of data memory that would otherwise have to
stay in cache.

On the macro-benchmark side, no significant difference in performance has been
found between the vanilla kernel and a kernel "marked" with the standard LTTng
instrumentation.




OpenPGP public key: http://krystal.dyndns.org:8080/key/compudj.gpg
Key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68


Attachments:
(No filename) (12.16 kB)
markers-tables-charts.pdf (53.75 kB)
Download all attachments

2006-09-30 19:02:20

by Nicholas Miell

[permalink] [raw]
Subject: Re: Performance analysis of Linux Kernel Markers 0.20 for 2.6.17

On Sat, 2006-09-30 at 14:01 -0400, Mathieu Desnoyers wrote:
> Hi,
>
> Following the huge discussion thread about tracing/static vs dynamic
> instrumentation/markers, a consensus seems to emerge about the need for a
> marker system in the Linux kernel. The main issues this mechanism addresses are:
>
> - Identify code important to runtime data collection/analysis tools in tree so
> that it follows the code changes naturally.
> - Be visually appealing to kernel developers.
> - Have a very low impact on the system performance.
> - Integrate in the standard kernel infrastructure : use C and loadable modules.
>
> The time has come for some performance measurements of the Linux Kernel Markers,
> which follows. I attach a PDF with tables and charts which condense these
> results.

Has anyone done any performance measurements with the "regular function
call replaced by a NOP" type of marker?

--
Nicholas Miell <[email protected]>

2006-10-01 03:42:19

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: Performance analysis of Linux Kernel Markers 0.20 for 2.6.17

* Nicholas Miell ([email protected]) wrote:
> On Sat, 2006-09-30 at 14:01 -0400, Mathieu Desnoyers wrote:
> > Hi,
> >
> > Following the huge discussion thread about tracing/static vs dynamic
> > instrumentation/markers, a consensus seems to emerge about the need for a
> > marker system in the Linux kernel. The main issues this mechanism addresses are:
> >
> > - Identify code important to runtime data collection/analysis tools in tree so
> > that it follows the code changes naturally.
> > - Be visually appealing to kernel developers.
> > - Have a very low impact on the system performance.
> > - Integrate in the standard kernel infrastructure : use C and loadable modules.
> >
> > The time has come for some performance measurements of the Linux Kernel Markers,
> > which follows. I attach a PDF with tables and charts which condense these
> > results.
>
> Has anyone done any performance measurements with the "regular function
> call replaced by a NOP" type of marker?
>

Here it is (on the same setup as the other tests : Pentium 4, 3 GHz) :

* Execute an empty loop

- Without marker
NR_LOOPS : 10000000
time delta (cycles): 15026497
cycles per loop : 1.50

- With 5 NOPs
NR_LOOPS : 100000
time delta (cycles): 300157
cycles per loop : 3.00
added cycles per loop for nops : 3.00-1.50 = 1.50


* Execute a loop of memcpy 4096 bytes

- Without marker
NR_LOOPS : 10000
time delta (cycles): 12981555
cycles per loop : 1298.16

- With 5 NOPs
NR_LOOPS : 10000
time delta (cycles): 12983925
cycles per loop : 1298.39
added cycles per loop for nops : 0.23


If we compare this approach to the jump-over-call markers (in cycles per loop) :

NOPs Jump over call generic Jump over call optimized
empty loop 1.50 1.17 2.50
memcpy 0.23 2.12 0.07



Mathieu


OpenPGP public key: http://krystal.dyndns.org:8080/key/compudj.gpg
Key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68

2006-10-01 04:19:48

by Nicholas Miell

[permalink] [raw]
Subject: Re: Performance analysis of Linux Kernel Markers 0.20 for 2.6.17

On Sat, 2006-09-30 at 23:42 -0400, Mathieu Desnoyers wrote:
> * Nicholas Miell ([email protected]) wrote:
> >
> > Has anyone done any performance measurements with the "regular function
> > call replaced by a NOP" type of marker?
> >
>
> Here it is (on the same setup as the other tests : Pentium 4, 3 GHz) :
>
> * Execute an empty loop
>
> - Without marker
> NR_LOOPS : 10000000
> time delta (cycles): 15026497
> cycles per loop : 1.50
>
> - With 5 NOPs
> NR_LOOPS : 100000
> time delta (cycles): 300157
> cycles per loop : 3.00
> added cycles per loop for nops : 3.00-1.50 = 1.50
>
>
> * Execute a loop of memcpy 4096 bytes
>
> - Without marker
> NR_LOOPS : 10000
> time delta (cycles): 12981555
> cycles per loop : 1298.16
>
> - With 5 NOPs
> NR_LOOPS : 10000
> time delta (cycles): 12983925
> cycles per loop : 1298.39
> added cycles per loop for nops : 0.23
>
>
> If we compare this approach to the jump-over-call markers (in cycles per loop) :
>
> NOPs Jump over call generic Jump over call optimized
> empty loop 1.50 1.17 2.50
> memcpy 0.23 2.12 0.07
>
>
>
> Mathieu

What about with two NOPs (".byte 0x66, 0x66, 0x90, 0x66, 0x90" - this
should work with everything) or one (".byte 0x0f, 0x1f, 0x44, 0x00,
0x00" - AFAIK, this should work with P6 or newer).

(Sorry, I should have mentioned this the first time.)

--
Nicholas Miell <[email protected]>

2006-10-01 15:33:29

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: Performance analysis of Linux Kernel Markers 0.20 for 2.6.17

* Nicholas Miell ([email protected]) wrote:
> On Sat, 2006-09-30 at 23:42 -0400, Mathieu Desnoyers wrote:
> > * Nicholas Miell ([email protected]) wrote:
> > >
> > > Has anyone done any performance measurements with the "regular function
> > > call replaced by a NOP" type of marker?
> > >
> >
> > Here it is (on the same setup as the other tests : Pentium 4, 3 GHz) :
> >
> > * Execute an empty loop
> >
> > - Without marker
> > NR_LOOPS : 10000000
> > time delta (cycles): 15026497
> > cycles per loop : 1.50
> >
> > - With 5 NOPs
> > NR_LOOPS : 100000
> > time delta (cycles): 300157
> > cycles per loop : 3.00
> > added cycles per loop for nops : 3.00-1.50 = 1.50
> >
> >
> > * Execute a loop of memcpy 4096 bytes
> >
> > - Without marker
> > NR_LOOPS : 10000
> > time delta (cycles): 12981555
> > cycles per loop : 1298.16
> >
> > - With 5 NOPs
> > NR_LOOPS : 10000
> > time delta (cycles): 12983925
> > cycles per loop : 1298.39
> > added cycles per loop for nops : 0.23
> >
> >
> > If we compare this approach to the jump-over-call markers (in cycles per loop) :
> >
> > NOPs Jump over call generic Jump over call optimized
> > empty loop 1.50 1.17 2.50
> > memcpy 0.23 2.12 0.07
> >
> >
> >
> > Mathieu
>
> What about with two NOPs (".byte 0x66, 0x66, 0x90, 0x66, 0x90" - this
> should work with everything) or one (".byte 0x0f, 0x1f, 0x44, 0x00,
> 0x00" - AFAIK, this should work with P6 or newer).
>
> (Sorry, I should have mentioned this the first time.)
>

Hi,

The tests I made were with :
#define GENERIC_NOP1 ".byte 0x90\n"
#define GENERIC_NOP4 ".byte 0x8d,0x74,0x26,0x00\n"
#define GENERIC_NOP5 GENERIC_NOP1 GENERIC_NOP4

Now with the tests you ask for :

* Execute an empty loop
- 2 NOPs ".byte 0x66, 0x66, 0x90, 0x66, 0x90"
NR_LOOPS : 100000
time delta (cycles): 200190
cycles per loop : 2.00
cycles per loop for nops : 2.00-1.50 = 0.50

- 1 NOP "0x0f, 0x1f, 0x44, 0x00, 0x00"
NR_LOOPS : 100000
time delta (cycles): 300172
cycles per loop : 3.00
cycles per loop for nops : 3.00-1.50 = 2.50


* Execute a loop of memcpy 4096 bytes
- 2 NOPs ".byte 0x66, 0x66, 0x90, 0x66, 0x90"
NR_LOOPS : 10000
time delta (cycles): 12981293
cycles per loop : 1298.13
cycles per loop for nops : 1298.16-1298.13=0.03

- 1 NOP "0x0f, 0x1f, 0x44, 0x00, 0x00"
NR_LOOPS : 10000
time delta (cycles): 12985590
cycles per loop : 1298.56
cycles per loop for nops : 0.43


Mathieu


OpenPGP public key: http://krystal.dyndns.org:8080/key/compudj.gpg
Key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68

2006-10-01 23:57:54

by Nicholas Miell

[permalink] [raw]
Subject: Re: Performance analysis of Linux Kernel Markers 0.20 for 2.6.17

On Sun, 2006-10-01 at 11:33 -0400, Mathieu Desnoyers wrote:
> * Nicholas Miell ([email protected]) wrote:
> > On Sat, 2006-09-30 at 23:42 -0400, Mathieu Desnoyers wrote:
> > > * Nicholas Miell ([email protected]) wrote:
> > > >
> > > > Has anyone done any performance measurements with the "regular function
> > > > call replaced by a NOP" type of marker?
> > > >
> > >
> > > Here it is (on the same setup as the other tests : Pentium 4, 3 GHz) :
> > >
> > > * Execute an empty loop
> > >
> > > - Without marker
> > > NR_LOOPS : 10000000
> > > time delta (cycles): 15026497
> > > cycles per loop : 1.50
> > >
> > > - With 5 NOPs
> > > NR_LOOPS : 100000
> > > time delta (cycles): 300157
> > > cycles per loop : 3.00
> > > added cycles per loop for nops : 3.00-1.50 = 1.50
> > >
> > >
> > > * Execute a loop of memcpy 4096 bytes
> > >
> > > - Without marker
> > > NR_LOOPS : 10000
> > > time delta (cycles): 12981555
> > > cycles per loop : 1298.16
> > >
> > > - With 5 NOPs
> > > NR_LOOPS : 10000
> > > time delta (cycles): 12983925
> > > cycles per loop : 1298.39
> > > added cycles per loop for nops : 0.23
> > >
> > >
> > > If we compare this approach to the jump-over-call markers (in cycles per loop) :
> > >
> > > NOPs Jump over call generic Jump over call optimized
> > > empty loop 1.50 1.17 2.50
> > > memcpy 0.23 2.12 0.07
> > >
> > >
> > >
> > > Mathieu
> >
> > What about with two NOPs (".byte 0x66, 0x66, 0x90, 0x66, 0x90" - this
> > should work with everything) or one (".byte 0x0f, 0x1f, 0x44, 0x00,
> > 0x00" - AFAIK, this should work with P6 or newer).
> >
> > (Sorry, I should have mentioned this the first time.)
> >
>
> Hi,
>
> The tests I made were with :
> #define GENERIC_NOP1 ".byte 0x90\n"
> #define GENERIC_NOP4 ".byte 0x8d,0x74,0x26,0x00\n"
> #define GENERIC_NOP5 GENERIC_NOP1 GENERIC_NOP4
>
> Now with the tests you ask for :
>
> * Execute an empty loop
> - 2 NOPs ".byte 0x66, 0x66, 0x90, 0x66, 0x90"
> NR_LOOPS : 100000
> time delta (cycles): 200190
> cycles per loop : 2.00
> cycles per loop for nops : 2.00-1.50 = 0.50
>
> - 1 NOP "0x0f, 0x1f, 0x44, 0x00, 0x00"
> NR_LOOPS : 100000
> time delta (cycles): 300172
> cycles per loop : 3.00
> cycles per loop for nops : 3.00-1.50 = 2.50
>
>
> * Execute a loop of memcpy 4096 bytes
> - 2 NOPs ".byte 0x66, 0x66, 0x90, 0x66, 0x90"
> NR_LOOPS : 10000
> time delta (cycles): 12981293
> cycles per loop : 1298.13
> cycles per loop for nops : 1298.16-1298.13=0.03
>
> - 1 NOP "0x0f, 0x1f, 0x44, 0x00, 0x00"
> NR_LOOPS : 10000
> time delta (cycles): 12985590
> cycles per loop : 1298.56
> cycles per loop for nops : 0.43
>

To summarize in chart form:

JoC JoCo 2NOP 1NOP
empty loop 1.17 2.50 0.50 2.50
memcpy 2.12 0.07 0.03 0.43

JoC = Jump over call - generic
JoCo = Jump over call - optimized
2NOP = "data16 data16 nop; data16 nop"
1NOP = NOP with ModRM

I left out your "nop; lea 0(%esi), %esi" because it isn't actually a NOP
(the CPU will do actual work even if it has no effect, and on AMD64,
that insn is "nop; lea 0(%rdi), %esi", which will truncate RDI+0 to fit
32-bits.)

The performance of NOP with ModRM doesn't suprise me -- AFAIK, only the
most recent of Intel CPUs actually special case that to be a true
no-work-done NOP.

It'd be nice to see the results of "jump to an out-of-line call with the
jump replaced by a NOP", but even if it performs well (and it should,
the argument passing and stack alignment overhead won't be executed in
the disabled probe case), actually using it in practice would be
difficult without compiler support (call instructions are easy to find
thanks to their relocations, which local jumps don't have).

--
Nicholas Miell <[email protected]>

2006-10-02 00:07:43

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: Performance analysis of Linux Kernel Markers 0.20 for 2.6.17

* Nicholas Miell ([email protected]) wrote:
> On Sun, 2006-10-01 at 11:33 -0400, Mathieu Desnoyers wrote:
> > * Nicholas Miell ([email protected]) wrote:
> > > On Sat, 2006-09-30 at 23:42 -0400, Mathieu Desnoyers wrote:
> > > > * Nicholas Miell ([email protected]) wrote:
> > > > >
> > > > > Has anyone done any performance measurements with the "regular function
> > > > > call replaced by a NOP" type of marker?
> > > > >
> > > >
> > > > Here it is (on the same setup as the other tests : Pentium 4, 3 GHz) :
> > > >
> > > > * Execute an empty loop
> > > >
> > > > - Without marker
> > > > NR_LOOPS : 10000000
> > > > time delta (cycles): 15026497
> > > > cycles per loop : 1.50
> > > >
> > > > - With 5 NOPs
> > > > NR_LOOPS : 100000
> > > > time delta (cycles): 300157
> > > > cycles per loop : 3.00
> > > > added cycles per loop for nops : 3.00-1.50 = 1.50
> > > >
> > > >
> > > > * Execute a loop of memcpy 4096 bytes
> > > >
> > > > - Without marker
> > > > NR_LOOPS : 10000
> > > > time delta (cycles): 12981555
> > > > cycles per loop : 1298.16
> > > >
> > > > - With 5 NOPs
> > > > NR_LOOPS : 10000
> > > > time delta (cycles): 12983925
> > > > cycles per loop : 1298.39
> > > > added cycles per loop for nops : 0.23
> > > >
> > > >
> > > > If we compare this approach to the jump-over-call markers (in cycles per loop) :
> > > >
> > > > NOPs Jump over call generic Jump over call optimized
> > > > empty loop 1.50 1.17 2.50
> > > > memcpy 0.23 2.12 0.07
> > > >
> > > >
> > > >
> > > > Mathieu
> > >
> > > What about with two NOPs (".byte 0x66, 0x66, 0x90, 0x66, 0x90" - this
> > > should work with everything) or one (".byte 0x0f, 0x1f, 0x44, 0x00,
> > > 0x00" - AFAIK, this should work with P6 or newer).
> > >
> > > (Sorry, I should have mentioned this the first time.)
> > >
> >
> > Hi,
> >
> > The tests I made were with :
> > #define GENERIC_NOP1 ".byte 0x90\n"
> > #define GENERIC_NOP4 ".byte 0x8d,0x74,0x26,0x00\n"
> > #define GENERIC_NOP5 GENERIC_NOP1 GENERIC_NOP4
> >
> > Now with the tests you ask for :
> >
> > * Execute an empty loop
> > - 2 NOPs ".byte 0x66, 0x66, 0x90, 0x66, 0x90"
> > NR_LOOPS : 100000
> > time delta (cycles): 200190
> > cycles per loop : 2.00
> > cycles per loop for nops : 2.00-1.50 = 0.50
> >
> > - 1 NOP "0x0f, 0x1f, 0x44, 0x00, 0x00"
> > NR_LOOPS : 100000
> > time delta (cycles): 300172
> > cycles per loop : 3.00
> > cycles per loop for nops : 3.00-1.50 = 2.50
> >
> >
> > * Execute a loop of memcpy 4096 bytes
> > - 2 NOPs ".byte 0x66, 0x66, 0x90, 0x66, 0x90"
> > NR_LOOPS : 10000
> > time delta (cycles): 12981293
> > cycles per loop : 1298.13
> > cycles per loop for nops : 1298.16-1298.13=0.03
> >
> > - 1 NOP "0x0f, 0x1f, 0x44, 0x00, 0x00"
> > NR_LOOPS : 10000
> > time delta (cycles): 12985590
> > cycles per loop : 1298.56
> > cycles per loop for nops : 0.43
> >
>
> To summarize in chart form:
>
> JoC JoCo 2NOP 1NOP
> empty loop 1.17 2.50 0.50 2.50
> memcpy 2.12 0.07 0.03 0.43
>
> JoC = Jump over call - generic
> JoCo = Jump over call - optimized
> 2NOP = "data16 data16 nop; data16 nop"
> 1NOP = NOP with ModRM
>
> I left out your "nop; lea 0(%esi), %esi" because it isn't actually a NOP
> (the CPU will do actual work even if it has no effect, and on AMD64,
> that insn is "nop; lea 0(%rdi), %esi", which will truncate RDI+0 to fit
> 32-bits.)
>
> The performance of NOP with ModRM doesn't suprise me -- AFAIK, only the
> most recent of Intel CPUs actually special case that to be a true
> no-work-done NOP.
>
> It'd be nice to see the results of "jump to an out-of-line call with the
> jump replaced by a NOP", but even if it performs well (and it should,
> the argument passing and stack alignment overhead won't be executed in
> the disabled probe case), actually using it in practice would be
> difficult without compiler support (call instructions are easy to find
> thanks to their relocations, which local jumps don't have).
>

Hi,

Just to make sure we see things the same way : the JoC approach is similar to
the out-of-line call in that the argument passing and stack alignment are not
executed when the probe is disabled.

Mathieu

OpenPGP public key: http://krystal.dyndns.org:8080/key/compudj.gpg
Key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68

2006-10-02 00:53:10

by Nicholas Miell

[permalink] [raw]
Subject: Re: Performance analysis of Linux Kernel Markers 0.20 for 2.6.17

On Sun, 2006-10-01 at 20:07 -0400, Mathieu Desnoyers wrote:
> * Nicholas Miell ([email protected]) wrote:
> > To summarize in chart form:
> >
> > JoC JoCo 2NOP 1NOP
> > empty loop 1.17 2.50 0.50 2.50
> > memcpy 2.12 0.07 0.03 0.43
> >
> > JoC = Jump over call - generic
> > JoCo = Jump over call - optimized
> > 2NOP = "data16 data16 nop; data16 nop"
> > 1NOP = NOP with ModRM
> >
> > I left out your "nop; lea 0(%esi), %esi" because it isn't actually a NOP
> > (the CPU will do actual work even if it has no effect, and on AMD64,
> > that insn is "nop; lea 0(%rdi), %esi", which will truncate RDI+0 to fit
> > 32-bits.)
> >
> > The performance of NOP with ModRM doesn't suprise me -- AFAIK, only the
> > most recent of Intel CPUs actually special case that to be a true
> > no-work-done NOP.
> >
> > It'd be nice to see the results of "jump to an out-of-line call with the
> > jump replaced by a NOP", but even if it performs well (and it should,
> > the argument passing and stack alignment overhead won't be executed in
> > the disabled probe case), actually using it in practice would be
> > difficult without compiler support (call instructions are easy to find
> > thanks to their relocations, which local jumps don't have).
> >
>
> Hi,
>
> Just to make sure we see things the same way : the JoC approach is similar to
> the out-of-line call in that the argument passing and stack alignment are not
> executed when the probe is disabled.
>

Yeah, I assumed that.

For the jump-over-call, you'll always have to do a test and a
conditional jump (even when the probe is disabled), and that test takes
work and that conditional jump will consume "useless" space in the
predictor cache.

For an unconditional-call-replaced-by-NOP, you'll always be doing the
work involved in the setup and cleanup for a function call, but there's
no conditional branching (which is a win, as your test results
demonstrate).

For the ideal case, you'd have a single unconditional jump to an
out-of-line function call, which you'd replace with a single NOP. No
unnecessary work (beyond the NOP instruction itself) gets done in the
disabled probe case, and in the enabled case, you don't have to do any
tests to see if the probe should be run. It should be an improvement all
around, if we could just get gcc to do the hard part of replacing the
unconditional jump with a NOP for us.

--
Nicholas Miell <[email protected]>

2006-10-02 14:31:45

by Mathieu Desnoyers

[permalink] [raw]
Subject: [UPDATE] Performance analysis of Linux Kernel Markers 0.20 for 2.6.17

Here are two minor corrections of my results :

* Mathieu Desnoyers ([email protected]) wrote:
[...]
> * Micro-benchmarks
[...]
> The following tests are done with the "optimized" markers only
>
[...]
> - Execute a loop with a marker enabled, with an empty probe. Var args argument
> setup, probe empty. No preemption disabling.
> NR_LOOPS : 100000
> time delta (cycles): 3363450
> cycles per loop : 33.63
- cycles per loop to disable preemption : 44.08-33.63=10.45
+ cycles per loop to disable preemption : 52.11-33.63=18.48
[...]
> * Size (x86)
>
> This is the size added by each marker to the memory image :
>
> - Optimized
>
> .text section : instructions
> Adds 6 bytes in the "likely" path.
> Adds 32 bytes in the "unlikely" path.
> .data section : r/w data
> 0 byte
+ 4 bytes for the call address
> .rodata.str1 : strings
> Length of the marker name
> .debug_str : strings (if loaded..)
> Length of the marker name + 7 bytes (__mark_)
> .markers
> 8 bytes (2 pointers)
> .markers.c
> 12 bytes (3 pointers)
>
> - Generic
>
> .text section : instructions
> Adds 11 bytes in the "likely" path.
> Adds 32 bytes in the "unlikely" path.
> .data section : r/w data
> 1 byte (the activation flag)
+ 4 bytes for the call address
> .rodata.str1 : strings
> Length of the marker name
> .debug_str : strings (if loaded..)
> Length of the marker name + 7 bytes (__mark_)
> .markers
> 8 bytes (2 pointers)
> .markers.c
> 12 bytes (3 pointers)
>


Mathieu


OpenPGP public key: http://krystal.dyndns.org:8080/key/compudj.gpg
Key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68

2006-10-02 15:26:47

by Jose R. Santos

[permalink] [raw]
Subject: Re: Performance analysis of Linux Kernel Markers 0.20 for 2.6.17

Mathieu Desnoyers wrote:
> However, the performance impact for using a kprobe is non negligible when
> activated. Assuming that kprobes would have a mechanism to get the variables
> from the caller's stack, it would perform the same task in at least 4178.23
> cycles vs 55.74 for a marker and a probe (ratio : 75). While kprobes are very
> useful for the reason explained earlier, the high event rate paths in the kernel
> would clearly benefit from a marker mechanism when the are probed.
>

The problem now is how do we define "high event rate". This is
something that is highly dependent on the workload being run as well as
the system configuration for such workload. There are a lot of places
in the kernel that can be turned into high event rates with with the
right workload even though the may not represent 99% of most user cases.

I would guess that anything above 500 event/s per-CPU on several
realistic workloads is a good place to start.


> On the macro-benchmark side, no significant difference in performance has been
> found between the vanilla kernel and a kernel "marked" with the standard LTTng
> instrumentation.
>

Out of curiosity, how many cycles does it take to process a complete
LTTng event up until the point were it has been completely stored into
the trace buffer. Since this should take a lot more than 55.74 cycles,
it would be interesting to know at what event rate would a static marker
stop showing as big of a performance advantage compared to dynamic probing.

-JRS

2006-10-02 15:49:36

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: Performance analysis of Linux Kernel Markers 0.20 for 2.6.17

Hi Jose,

* Jose R. Santos ([email protected]) wrote:
>
> The problem now is how do we define "high event rate". This is
> something that is highly dependent on the workload being run as well as
> the system configuration for such workload. There are a lot of places
> in the kernel that can be turned into high event rates with with the
> right workload even though the may not represent 99% of most user cases.
>
> I would guess that anything above 500 event/s per-CPU on several
> realistic workloads is a good place to start.
>
Yes, it seems like a good starting point. But besides the event rate, just
having the most widely used events marked in the code should also be the
target. The markup mechanism serves two purposes :
1 - identify important events in a way that follows code change.
2 - speed up instrumentation.

>
> >On the macro-benchmark side, no significant difference in performance has
> >been
> >found between the vanilla kernel and a kernel "marked" with the standard
> >LTTng
> >instrumentation.
> >
>
> Out of curiosity, how many cycles does it take to process a complete
> LTTng event up until the point were it has been completely stored into
> the trace buffer. Since this should take a lot more than 55.74 cycles,
> it would be interesting to know at what event rate would a static marker
> stop showing as big of a performance advantage compared to dynamic probing.
>

In my OLS paper, I pointed out that, in its current state, LTTng took about 278
cycles on the same Pentium 4. I think I could lower that by implementing per-cpu
atomic operations (removing the LOCK prefix, as the data is not shared between
the CPUs; the atomic operations are only useful to protect from higher priority
execution contexts on the same CPU).

Regards,

Mathieu

OpenPGP public key: http://krystal.dyndns.org:8080/key/compudj.gpg
Key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68

2006-10-08 19:31:29

by dean gaudet

[permalink] [raw]
Subject: Re: Performance analysis of Linux Kernel Markers 0.20 for 2.6.17

On Sat, 30 Sep 2006, Mathieu Desnoyers wrote:

> - Optimized
>
> static int my_open(struct inode *inode, struct file *file)
> {
> 0: 55 push %ebp
> 1: 89 e5 mov %esp,%ebp
> 3: 83 ec 0c sub $0xc,%esp
> MARK(subsys_mark1, "%d %p", 1, NULL);
> 6: b0 00 mov $0x0,%al <-- immediate load 0 in al
> 8: 84 c0 test %al,%al
> a: 75 07 jne 13 <my_open+0x13>

why not replace the mov+test with "xor %eax,%eax" and then change the 0x75
to a 0x74 to change from jne to je when you want to enable the marker?

i.e. disabled:

31 c0 xor %eax,%eax
75 07 jne 13

enabled:

31 c0 xor %eax,%eax
74 07 je 13

it would save 2 bytes, 1 instruction and avoid partial register writes...
and still has the nice property that a single byte store into the code is
required for enable/disable (which sounds like a great property -- i
assume you were deliberately going for that).

i assume there's probably no reason to tie the sequence to eax either --
you could let gcc choose it (or maybe you already do).

-dean

2006-10-08 19:40:31

by dean gaudet

[permalink] [raw]
Subject: Re: Performance analysis of Linux Kernel Markers 0.20 for 2.6.17

On Sun, 8 Oct 2006, dean gaudet wrote:

> On Sat, 30 Sep 2006, Mathieu Desnoyers wrote:
>
> > - Optimized
> >
> > static int my_open(struct inode *inode, struct file *file)
> > {
> > 0: 55 push %ebp
> > 1: 89 e5 mov %esp,%ebp
> > 3: 83 ec 0c sub $0xc,%esp
> > MARK(subsys_mark1, "%d %p", 1, NULL);
> > 6: b0 00 mov $0x0,%al <-- immediate load 0 in al
> > 8: 84 c0 test %al,%al
> > a: 75 07 jne 13 <my_open+0x13>
>
> why not replace the mov+test with "xor %eax,%eax" and then change the 0x75
> to a 0x74 to change from jne to je when you want to enable the marker?
>
> i.e. disabled:
>
> 31 c0 xor %eax,%eax
> 75 07 jne 13
>
> enabled:
>
> 31 c0 xor %eax,%eax
> 74 07 je 13

actually... why even destroy the register... i bet this is the best
choice:

39 c0 cmp %eax,%eax
75 07 jne 13

that satisfies the macro-op fusion rules on core2 as well... now you
shouldn't even ask gcc for a register you should just hardwire %eax so it
doesn't get confused (probably need to tell it "cc" is modified).

-dean

2006-10-10 13:17:36

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: Performance analysis of Linux Kernel Markers 0.20 for 2.6.17

* dean gaudet ([email protected]) wrote:
> On Sat, 30 Sep 2006, Mathieu Desnoyers wrote:
>
> > - Optimized
> >
> > static int my_open(struct inode *inode, struct file *file)
> > {
> > 0: 55 push %ebp
> > 1: 89 e5 mov %esp,%ebp
> > 3: 83 ec 0c sub $0xc,%esp
> > MARK(subsys_mark1, "%d %p", 1, NULL);
> > 6: b0 00 mov $0x0,%al <-- immediate load 0 in al
> > 8: 84 c0 test %al,%al
> > a: 75 07 jne 13 <my_open+0x13>
>
> why not replace the mov+test with "xor %eax,%eax" and then change the 0x75
> to a 0x74 to change from jne to je when you want to enable the marker?
>
> i.e. disabled:
>
> 31 c0 xor %eax,%eax
> 75 07 jne 13
>
> enabled:
>
> 31 c0 xor %eax,%eax
> 74 07 je 13
>
> it would save 2 bytes, 1 instruction and avoid partial register writes...
> and still has the nice property that a single byte store into the code is
> required for enable/disable (which sounds like a great property -- i
> assume you were deliberately going for that).
>
> i assume there's probably no reason to tie the sequence to eax either --
> you could let gcc choose it (or maybe you already do).
>
Hi dean,

Yes, good idea. I effectively did the 1 byte write to make sure it could be
coherent for any code alignment on the machine. The only problem with modifying
the jne into a je is this : How can we take a pointer to this assembly
instruction when it is generated by gcc ? We can always search for an
instruction pattern after the inline assembly, but we have no guarantee that
gcc will not put any code between the inline asm and the actual jne.

Mathieu

OpenPGP public key: http://krystal.dyndns.org:8080/key/compudj.gpg
Key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68