2007-05-30 14:05:49

by Mathieu Desnoyers

[permalink] [raw]
Subject: [patch 9/9] Scheduler profiling - Use conditional calls

Use conditional calls with lower d-cache hit in optimized version as a
condition for scheduler profiling call.

Signed-off-by: Mathieu Desnoyers <[email protected]>

---
kernel/profile.c | 4 ++++
kernel/sched.c | 4 +++-
2 files changed, 7 insertions(+), 1 deletion(-)

Index: linux-2.6-lttng/kernel/profile.c
===================================================================
--- linux-2.6-lttng.orig/kernel/profile.c 2007-05-30 08:42:29.000000000 -0400
+++ linux-2.6-lttng/kernel/profile.c 2007-05-30 08:45:22.000000000 -0400
@@ -23,6 +23,7 @@
#include <linux/profile.h>
#include <linux/highmem.h>
#include <linux/mutex.h>
+#include <linux/condcall.h>
#include <asm/sections.h>
#include <asm/semaphore.h>
#include <asm/irq_regs.h>
@@ -92,6 +93,8 @@
printk(KERN_INFO "kernel profiling enabled (shift: %ld)\n",
prof_shift);
}
+ if (prof_on)
+ BUG_ON(cond_call_arm("profile_on"));
return 1;
}
__setup("profile=", profile_setup);
@@ -556,6 +559,7 @@
return 0;
out_cleanup:
prof_on = 0;
+ cond_call_disarm("profile_on");
smp_mb();
on_each_cpu(profile_nop, NULL, 0, 1);
for_each_online_cpu(cpu) {
Index: linux-2.6-lttng/kernel/sched.c
===================================================================
--- linux-2.6-lttng.orig/kernel/sched.c 2007-05-30 08:43:02.000000000 -0400
+++ linux-2.6-lttng/kernel/sched.c 2007-05-30 08:45:22.000000000 -0400
@@ -59,6 +59,7 @@
#include <linux/kprobes.h>
#include <linux/delayacct.h>
#include <linux/reciprocal_div.h>
+#include <linux/condcall.h>

#include <asm/tlb.h>
#include <asm/unistd.h>
@@ -2990,7 +2991,8 @@
print_irqtrace_events(prev);
dump_stack();
}
- profile_hit(SCHED_PROFILING, __builtin_return_address(0));
+ cond_call(profile_on,
+ profile_hit(SCHED_PROFILING, __builtin_return_address(0)));

/*
* The idle thread is not allowed to schedule!

--
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68


2007-05-30 20:35:22

by Andrew Morton

[permalink] [raw]
Subject: Re: [patch 9/9] Scheduler profiling - Use conditional calls

On Wed, 30 May 2007 10:00:34 -0400
Mathieu Desnoyers <[email protected]> wrote:

> @@ -2990,7 +2991,8 @@
> print_irqtrace_events(prev);
> dump_stack();
> }
> - profile_hit(SCHED_PROFILING, __builtin_return_address(0));
> + cond_call(profile_on,
> + profile_hit(SCHED_PROFILING, __builtin_return_address(0)));
>

That's looking pretty neat. Do you have any before-and-after performance
figures for i386 and for a non-optimised architecture?

2007-05-30 20:58:38

by William Lee Irwin III

[permalink] [raw]
Subject: Re: [patch 9/9] Scheduler profiling - Use conditional calls

On Wed, May 30, 2007 at 10:00:34AM -0400, Mathieu Desnoyers wrote:
> Use conditional calls with lower d-cache hit in optimized version as a
> condition for scheduler profiling call.
[...]
> + if (prof_on)
> + BUG_ON(cond_call_arm("profile_on"));

What's the point of this BUG_ON()? The condition is a priori impossible.


On Wed, May 30, 2007 at 10:00:34AM -0400, Mathieu Desnoyers wrote:
> - profile_hit(SCHED_PROFILING, __builtin_return_address(0));
> + cond_call(profile_on,
> + profile_hit(SCHED_PROFILING, __builtin_return_address(0)));

It'd probably be prettier to stuff cond_call() inside a macro that does
something like this named profile_hit() with the old profile_hit()
renamed to __profile_hit() etc. Otherwise fine.


-- wli

2007-05-30 21:45:54

by Matt Mackall

[permalink] [raw]
Subject: Re: [patch 9/9] Scheduler profiling - Use conditional calls

On Wed, May 30, 2007 at 10:00:34AM -0400, Mathieu Desnoyers wrote:
> Use conditional calls with lower d-cache hit in optimized version as a
> condition for scheduler profiling call.
>
> Signed-off-by: Mathieu Desnoyers <[email protected]>
>
> ---
> kernel/profile.c | 4 ++++
> kernel/sched.c | 4 +++-
> 2 files changed, 7 insertions(+), 1 deletion(-)
>
> Index: linux-2.6-lttng/kernel/profile.c
> ===================================================================
> --- linux-2.6-lttng.orig/kernel/profile.c 2007-05-30 08:42:29.000000000 -0400
> +++ linux-2.6-lttng/kernel/profile.c 2007-05-30 08:45:22.000000000 -0400
> @@ -23,6 +23,7 @@
> #include <linux/profile.h>
> #include <linux/highmem.h>
> #include <linux/mutex.h>
> +#include <linux/condcall.h>
> #include <asm/sections.h>
> #include <asm/semaphore.h>
> #include <asm/irq_regs.h>
> @@ -92,6 +93,8 @@
> printk(KERN_INFO "kernel profiling enabled (shift: %ld)\n",
> prof_shift);
> }
> + if (prof_on)
> + BUG_ON(cond_call_arm("profile_on"));
> return 1;
> }
> __setup("profile=", profile_setup);
> @@ -556,6 +559,7 @@
> return 0;
> out_cleanup:
> prof_on = 0;
> + cond_call_disarm("profile_on");
> smp_mb();
> on_each_cpu(profile_nop, NULL, 0, 1);
> for_each_online_cpu(cpu) {
> Index: linux-2.6-lttng/kernel/sched.c
> ===================================================================
> --- linux-2.6-lttng.orig/kernel/sched.c 2007-05-30 08:43:02.000000000 -0400
> +++ linux-2.6-lttng/kernel/sched.c 2007-05-30 08:45:22.000000000 -0400
> @@ -59,6 +59,7 @@
> #include <linux/kprobes.h>
> #include <linux/delayacct.h>
> #include <linux/reciprocal_div.h>
> +#include <linux/condcall.h>
>
> #include <asm/tlb.h>
> #include <asm/unistd.h>
> @@ -2990,7 +2991,8 @@
> print_irqtrace_events(prev);
> dump_stack();
> }
> - profile_hit(SCHED_PROFILING, __builtin_return_address(0));
> + cond_call(profile_on,
> + profile_hit(SCHED_PROFILING, __builtin_return_address(0)));

I think we could do better here pretty trivially. profile hit still
has an if (unlikely(prof_on == TYPE)). Shouldn't we just have a
cond_call type for "sched_profiling" and cond_call profile_hits(type,
ip, 1) directly?

--
Mathematics is the supreme nostalgia of our time.

2007-05-31 12:43:18

by Andi Kleen

[permalink] [raw]
Subject: Re: [patch 9/9] Scheduler profiling - Use conditional calls

Mathieu Desnoyers <[email protected]> writes:
> }
> - profile_hit(SCHED_PROFILING, __builtin_return_address(0));
> + cond_call(profile_on,
> + profile_hit(SCHED_PROFILING, __builtin_return_address(0)));

Would it be possible to use a syntax like

if (unlikely_cond_call(variable)) { (or better name)
...
}

instead? I think that would be much nicer to read than having
code in a macro argument

-Andi

2007-05-31 21:13:15

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: [patch 9/9] Scheduler profiling - Use conditional calls

* William Lee Irwin III ([email protected]) wrote:
> On Wed, May 30, 2007 at 10:00:34AM -0400, Mathieu Desnoyers wrote:
> > Use conditional calls with lower d-cache hit in optimized version as a
> > condition for scheduler profiling call.
> [...]
> > + if (prof_on)
> > + BUG_ON(cond_call_arm("profile_on"));
>
> What's the point of this BUG_ON()? The condition is a priori impossible.
>

Not impossible: hash_add_cond_call() can return -ENOMEM if kmalloc lacks
memory.

Mathieu

--
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68

2007-05-31 21:36:56

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: [patch 9/9] Scheduler profiling - Use conditional calls

* Matt Mackall ([email protected]) wrote:
> > - profile_hit(SCHED_PROFILING, __builtin_return_address(0));
> > + cond_call(profile_on,
> > + profile_hit(SCHED_PROFILING, __builtin_return_address(0)));
>
> I think we could do better here pretty trivially. profile hit still
> has an if (unlikely(prof_on == TYPE)). Shouldn't we just have a
> cond_call type for "sched_profiling" and cond_call profile_hits(type,
> ip, 1) directly?
>

Yes, that's much better. I will leave the profile_hit as a simple call
to profile hits with one parameter for now, although it might be a
little bit useless. Being able to select independently which specific
site must be enabled is interesting too, therefore I will use the more
precise "sched_profiling" to replace "profile_on".

I guess we could also rework profile.c:profile_tick() in the same way,
using "cpu_profiling". It would be interesting to wrap the whole
profile_tick() call in a cond_call, but it would involve dealing with
more than one different things that are part of this function and can
have to run. The same will apply to kvm and sleep profiling.

Mathieu

--
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68

2007-05-31 22:07:51

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: [patch 9/9] Scheduler profiling - Use conditional calls

* Andi Kleen ([email protected]) wrote:
> Mathieu Desnoyers <[email protected]> writes:
> > }
> > - profile_hit(SCHED_PROFILING, __builtin_return_address(0));
> > + cond_call(profile_on,
> > + profile_hit(SCHED_PROFILING, __builtin_return_address(0)));
>
> Would it be possible to use a syntax like
>
> if (unlikely_cond_call(variable)) { (or better name)
> ...
> }
>
> instead? I think that would be much nicer to read than having
> code in a macro argument
>

I see your point, but there is a level of control on the branch I would
lack by doing so: the ability to put the call in either the if or else
branch. It is an optimization on i386.

I could do it by defining my home-made if() :

cond_if (cond_call_name) {
code
}

The macro cond_if could then expand (this is a simplified example) in either in

if (cond)

or

if (cond)
else

Also, I live in the expectation that, someday, the gcc guys will be nice
enough to add some kind of support for a nop-based jump that would
require code patching to put a jump instead. If it ever happens, my
macro could evolve into this for newer compiler versions, which I could
not do with the if() statement you are proposing.


--
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68

2007-05-31 22:33:34

by Andi Kleen

[permalink] [raw]
Subject: Re: [patch 9/9] Scheduler profiling - Use conditional calls

> I see your point, but there is a level of control on the branch I would
> lack by doing so: the ability to put the call in either the if or else
> branch. It is an optimization on i386.

What does it optimize exactly?

> Also, I live in the expectation that, someday, the gcc guys will be nice
> enough to add some kind of support for a nop-based jump that would
> require code patching to put a jump instead. If it ever happens, my
> macro could evolve into this for newer compiler versions, which I could
> not do with the if() statement you are proposing.

If that ever happens we couldn't use it anyways because Linux still
has to support old compilers for a long time. And when those are dropped the
code could be updated.

-Andi

2007-05-31 23:41:36

by William Lee Irwin III

[permalink] [raw]
Subject: Re: [patch 9/9] Scheduler profiling - Use conditional calls

On Wed, May 30, 2007 at 10:00:34AM -0400, Mathieu Desnoyers wrote:
>>> + if (prof_on)
>>> + BUG_ON(cond_call_arm("profile_on"));

* William Lee Irwin III ([email protected]) wrote:
>> What's the point of this BUG_ON()? The condition is a priori impossible.

On Thu, May 31, 2007 at 05:12:58PM -0400, Mathieu Desnoyers wrote:
> Not impossible: hash_add_cond_call() can return -ENOMEM if kmalloc lacks
> memory.

Shouldn't it just propagate the errors like anything else instead of
going BUG(), then? One can easily live without profiling if the profile
buffers should fail to be allocated e.g. due to memory fragmentation.

These things all have to handle errors for hotplugging anyway AIUI.


-- wli

2007-06-01 15:54:25

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: [patch 9/9] Scheduler profiling - Use conditional calls

* Andrew Morton ([email protected]) wrote:
> On Wed, 30 May 2007 10:00:34 -0400
> Mathieu Desnoyers <[email protected]> wrote:
>
> > @@ -2990,7 +2991,8 @@
> > print_irqtrace_events(prev);
> > dump_stack();
> > }
> > - profile_hit(SCHED_PROFILING, __builtin_return_address(0));
> > + cond_call(profile_on,
> > + profile_hit(SCHED_PROFILING, __builtin_return_address(0)));
> >
>
> That's looking pretty neat. Do you have any before-and-after performance
> figures for i386 and for a non-optimised architecture?

Sure, here is the result of a small test comparing:
1 - Branch depending on a cache miss (has to fetch in memory, caused by a 128
bytes stride)). This is the test that is likely to look like what
side-effect the original profile_hit code was causing, under the
assumption that the kernel is already using L1 and L2 caches at
their full capacity and that a supplementary data load would cause
cache trashing.
2 - Branch depending on L1 cache hit. Just for comparison.
3 - Branch depending on a load immediate in the instruction stream.

It has been compiled with gcc -O2. Tests done on a 3GHz P4.

In the first test series, the branch is not taken:

number of tests : 1000
number of branches per test : 81920
memory hit cycles per iteration (mean) : 48.252
L1 cache hit cycles per iteration (mean) : 16.1693
instruction stream based test, cycles per iteration (mean) : 16.0432


In the second test series, the branch is taken and an integer is
incremented within the block:

number of tests : 1000
number of branches per test : 81920
memory hit cycles per iteration (mean) : 48.2691
L1 cache hit cycles per iteration (mean) : 16.396
instruction stream based test, cycles per iteration (mean) : 16.0441

Therefore, the memory fetch based test seems to be 200% slower than the
load immediate based test.

(I am adding these results to the documentation)

Mathieu

--
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68

2007-06-01 16:19:25

by Andrew Morton

[permalink] [raw]
Subject: Re: [patch 9/9] Scheduler profiling - Use conditional calls

On Fri, 1 Jun 2007 11:54:13 -0400 Mathieu Desnoyers <[email protected]> wrote:

> * Andrew Morton ([email protected]) wrote:
> > On Wed, 30 May 2007 10:00:34 -0400
> > Mathieu Desnoyers <[email protected]> wrote:
> >
> > > @@ -2990,7 +2991,8 @@
> > > print_irqtrace_events(prev);
> > > dump_stack();
> > > }
> > > - profile_hit(SCHED_PROFILING, __builtin_return_address(0));
> > > + cond_call(profile_on,
> > > + profile_hit(SCHED_PROFILING, __builtin_return_address(0)));
> > >
> >
> > That's looking pretty neat. Do you have any before-and-after performance
> > figures for i386 and for a non-optimised architecture?
>
> Sure, here is the result of a small test comparing:
> 1 - Branch depending on a cache miss (has to fetch in memory, caused by a 128
> bytes stride)). This is the test that is likely to look like what
> side-effect the original profile_hit code was causing, under the
> assumption that the kernel is already using L1 and L2 caches at
> their full capacity and that a supplementary data load would cause
> cache trashing.
> 2 - Branch depending on L1 cache hit. Just for comparison.
> 3 - Branch depending on a load immediate in the instruction stream.
>
> It has been compiled with gcc -O2. Tests done on a 3GHz P4.
>
> In the first test series, the branch is not taken:
>
> number of tests : 1000
> number of branches per test : 81920
> memory hit cycles per iteration (mean) : 48.252
> L1 cache hit cycles per iteration (mean) : 16.1693
> instruction stream based test, cycles per iteration (mean) : 16.0432
>
>
> In the second test series, the branch is taken and an integer is
> incremented within the block:
>
> number of tests : 1000
> number of branches per test : 81920
> memory hit cycles per iteration (mean) : 48.2691
> L1 cache hit cycles per iteration (mean) : 16.396
> instruction stream based test, cycles per iteration (mean) : 16.0441
>
> Therefore, the memory fetch based test seems to be 200% slower than the
> load immediate based test.

Confused. From what did you calculate that 200%?

> (I am adding these results to the documentation)

Good, thanks.

2007-06-01 16:33:42

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: [patch 9/9] Scheduler profiling - Use conditional calls

* Andrew Morton ([email protected]) wrote:
> On Fri, 1 Jun 2007 11:54:13 -0400 Mathieu Desnoyers <[email protected]> wrote:
>
> > * Andrew Morton ([email protected]) wrote:
> > > On Wed, 30 May 2007 10:00:34 -0400
> > > Mathieu Desnoyers <[email protected]> wrote:
> > >
> > > > @@ -2990,7 +2991,8 @@
> > > > print_irqtrace_events(prev);
> > > > dump_stack();
> > > > }
> > > > - profile_hit(SCHED_PROFILING, __builtin_return_address(0));
> > > > + cond_call(profile_on,
> > > > + profile_hit(SCHED_PROFILING, __builtin_return_address(0)));
> > > >
> > >
> > > That's looking pretty neat. Do you have any before-and-after performance
> > > figures for i386 and for a non-optimised architecture?
> >
> > Sure, here is the result of a small test comparing:
> > 1 - Branch depending on a cache miss (has to fetch in memory, caused by a 128
> > bytes stride)). This is the test that is likely to look like what
> > side-effect the original profile_hit code was causing, under the
> > assumption that the kernel is already using L1 and L2 caches at
> > their full capacity and that a supplementary data load would cause
> > cache trashing.
> > 2 - Branch depending on L1 cache hit. Just for comparison.
> > 3 - Branch depending on a load immediate in the instruction stream.
> >
> > It has been compiled with gcc -O2. Tests done on a 3GHz P4.
> >
> > In the first test series, the branch is not taken:
> >
> > number of tests : 1000
> > number of branches per test : 81920
> > memory hit cycles per iteration (mean) : 48.252
> > L1 cache hit cycles per iteration (mean) : 16.1693
> > instruction stream based test, cycles per iteration (mean) : 16.0432
> >
> >
> > In the second test series, the branch is taken and an integer is
> > incremented within the block:
> >
> > number of tests : 1000
> > number of branches per test : 81920
> > memory hit cycles per iteration (mean) : 48.2691
> > L1 cache hit cycles per iteration (mean) : 16.396
> > instruction stream based test, cycles per iteration (mean) : 16.0441
> >
> > Therefore, the memory fetch based test seems to be 200% slower than the
> > load immediate based test.
>
> Confused. From what did you calculate that 200%?
>
> > (I am adding these results to the documentation)
>
> Good, thanks.


(48.2691-16.0441)/16.0441 = 2.00

Which means that it is 200% slower to run this test while fetching the
branch condition from main memory rather than using the load immediate.

We could also put it like this : the speedup of the load immediate over
the memory fetch is 3.

48.2691/16.0441 = 3.00

Is there a preferred way to present these results in the documentation ?

--
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68

2007-06-04 22:20:26

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: [patch 9/9] Scheduler profiling - Use conditional calls

* Andi Kleen ([email protected]) wrote:
> > I see your point, but there is a level of control on the branch I would
> > lack by doing so: the ability to put the call in either the if or else
> > branch. It is an optimization on i386.
>
> What does it optimize exactly?
>

Nicholas McGuire told me that the non common cases should be put in
else branches of if statements for i386. At the time, I did a quick test
that correlated what he said, but I seem to be unable to reproduce this
behavior now (maybe my code snippet is too simple?): I will then assume
that the likely/unlikely (builtin expects) tells everything that is
needed to gcc until further notice. Therefore, we can use the form :

if (cond_call(var)), as you proposed.

> > Also, I live in the expectation that, someday, the gcc guys will be nice
> > enough to add some kind of support for a nop-based jump that would
> > require code patching to put a jump instead. If it ever happens, my
> > macro could evolve into this for newer compiler versions, which I could
> > not do with the if() statement you are proposing.
>
> If that ever happens we couldn't use it anyways because Linux still
> has to support old compilers for a long time. And when those are dropped the
> code could be updated.
>

Agreed.

--
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68

2007-06-04 22:21:15

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: [patch 9/9] Scheduler profiling - Use conditional calls

* William Lee Irwin III ([email protected]) wrote:
> On Wed, May 30, 2007 at 10:00:34AM -0400, Mathieu Desnoyers wrote:
> >>> + if (prof_on)
> >>> + BUG_ON(cond_call_arm("profile_on"));
>
> * William Lee Irwin III ([email protected]) wrote:
> >> What's the point of this BUG_ON()? The condition is a priori impossible.
>
> On Thu, May 31, 2007 at 05:12:58PM -0400, Mathieu Desnoyers wrote:
> > Not impossible: hash_add_cond_call() can return -ENOMEM if kmalloc lacks
> > memory.
>
> Shouldn't it just propagate the errors like anything else instead of
> going BUG(), then? One can easily live without profiling if the profile
> buffers should fail to be allocated e.g. due to memory fragmentation.
>
> These things all have to handle errors for hotplugging anyway AIUI.
>

Cond call arming will not reserve memory anymore in the next release.
(hash table is gone).

--
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68