2002-09-23 06:50:25

by Con Kolivas

[permalink] [raw]
Subject: [BENCHMARK] Corrected gcc3.2 v gcc2.95.3 contest results


One of those 2.96 compilers snuck in (my poor use of autocomplete). I take back
all the previous data and submit accurate ones, and apologise profusely for
ruining the signal to noise ratio on this list

To rehash what the last results were _supposed_ to be and what these are. I have
identical config 2.5.38 kernels compiled with either gcc2.95.3 or gcc3.2. Then I
run these two kernels through a contest benchmark (http://contest.kolivas.net)
using ONLY gcc2.95.3 to run the benchmark.

Kernel Time CPU
NoLoad:
2.5.38 68.25 99%
2.5.38-gcc32 67.28 99%
Process Load:
2.5.38 71.60 95%
2.5.38-gcc32 70.86 94%
IO Half Load:
2.5.38 81.26 90%
2.5.38-gcc32 88.11 82%
IO Full Load:
2.5.38 170.21 42%
2.5.38-gcc32 230.77 30%
Mem Load:
2.5.38 104.22 70%
2.5.38-gcc32 104.97 70%

This time only the IO loads showed a statistically significant difference.

Terribly sorry about that previous mess


Full logs:

2.5.38 (with gcc2.95.3)
noload Time: 68.25 CPU: 99% Major Faults: 204613 Minor Faults: 255906
process_load Time: 71.60 CPU: 95% Major Faults: 204019 Minor Faults: 255238
io_halfmem Time: 81.26 CPU: 90% Major Faults: 204019 Minor Faults: 255325
Was writing number 4 of a 112Mb sized io_load file after 90 seconds
io_fullmem Time: 170.21 CPU: 42% Major Faults: 204019 Minor Faults: 255272
Was writing number 6 of a 224Mb sized io_load file after 194 seconds
mem_load Time: 104.22 CPU: 70% Major Faults: 204120 Minor Faults: 256271

2.5.38 (with gcc 3.2)
noload Time: 67.28 CPU: 99% Major Faults: 205108 Minor Faults: 256153
process_load Time: 70.86 CPU: 94% Major Faults: 204019 Minor Faults: 254983
io_halfmem Time: 88.11 CPU: 82% Major Faults: 204019 Minor Faults: 255110
Was writing number 5 of a 112Mb sized io_load file after 99 seconds
io_fullmem Time: 230.77 CPU: 30% Major Faults: 204019 Minor Faults: 254998
Was writing number 11 of a 224Mb sized io_load file after 303 seconds
mem_load Time: 104.97 CPU: 70% Major Faults: 204208 Minor Faults: 255956

Con.


2002-09-23 07:37:00

by Ingo Molnar

[permalink] [raw]
Subject: Re: [BENCHMARK] Corrected gcc3.2 v gcc2.95.3 contest results


On Mon, 23 Sep 2002, Con Kolivas wrote:

> IO Full Load:
> 2.5.38 170.21 42%
> 2.5.38-gcc32 230.77 30%

> This time only the IO loads showed a statistically significant
> difference.

how many times are you running each test? You should run them at least
twice (ideally 3 times at least), to establish some sort of statistical
noise measure. Especially IO benchmarks tend to fluctuate very heavily
depending on various things - they are also very dependent on the initial
state - ie. how the pagecache happens to lay out, etc. Ie. a meaningful
measurement result would be something like:

IO Full Load:
2.5.38 170.21 +- 55.21 sec 42%
2.5.38-gcc32 230.77 +- 60.22 sec 30%

where the first column is the average of two measurements, the second
column is the delta of the two measurements divided by 2. This way we can
see the 'spread' of the results.

I simply cannot believe that gcc32 can produce any visible effect in any
of the IO benchmarks, the only explanation would be heavy fluctuation of
IO results.

Ingo

2002-09-23 10:25:13

by Con Kolivas

[permalink] [raw]
Subject: Re: [BENCHMARK] Corrected gcc3.2 v gcc2.95.3 contest results

Quoting Ingo Molnar <[email protected]>:

> On Mon, 23 Sep 2002, Con Kolivas wrote:
>
> > IO Full Load:
> > 2.5.38 170.21 42%
> > 2.5.38-gcc32 230.77 30%
>
> how many times are you running each test? You should run them at least
> twice (ideally 3 times at least), to establish some sort of statistical
> noise measure. Especially IO benchmarks tend to fluctuate very heavily
> depending on various things - they are also very dependent on the initial
> state - ie. how the pagecache happens to lay out, etc. Ie. a meaningful
> measurement result would be something like:

Yes you make a very valid point and something I've been stewing over privately
for some time. contest runs benchmarks in a fixed order with a "priming" compile
to try and get pagecaches etc back to some sort of baseline (I've been trying
hard to make the results accurate and repeatable).

Despite that, you're correct in assuming the IO load will fluctuate widely. My
initial tests show that noload and process_load (not surprisingly) vary very
little. Mem_load varies a little. IO Loads can vary wildly, and the worse the
average performance is, the greater the variation (I mean percentage variation
not just absolute).

> IO Full Load:
> 2.5.38 170.21 +- 55.21 sec 42%
> 2.5.38-gcc32 230.77 +- 60.22 sec 30%
>
> where the first column is the average of two measurements, the second
> column is the delta of the two measurements divided by 2. This way we can
> see the 'spread' of the results.

I'll create some results based on 3 runs soon.

> I simply cannot believe that gcc32 can produce any visible effect in any
> of the IO benchmarks, the only explanation would be heavy fluctuation of
> IO results.

Agreed. There probably is no statistically significant difference in the
different gcc versions.

Contest is very new and I appreciate any feedback I can get to make it as
worthwhile a benchmark as possible to those who know.

Con.

2002-09-23 10:58:51

by jw schultz

[permalink] [raw]
Subject: Re: [BENCHMARK] Corrected gcc3.2 v gcc2.95.3 contest results

On Mon, Sep 23, 2002 at 08:30:21PM +1000, Con Kolivas wrote:
> Quoting Ingo Molnar <[email protected]>:
>
> > On Mon, 23 Sep 2002, Con Kolivas wrote:
> >
> > > IO Full Load:
> > > 2.5.38 170.21 42%
> > > 2.5.38-gcc32 230.77 30%
> >
> > how many times are you running each test? You should run them at least
> > twice (ideally 3 times at least), to establish some sort of statistical
> > noise measure. Especially IO benchmarks tend to fluctuate very heavily
> > depending on various things - they are also very dependent on the initial
> > state - ie. how the pagecache happens to lay out, etc. Ie. a meaningful
> > measurement result would be something like:
>
> Yes you make a very valid point and something I've been stewing over privately
> for some time. contest runs benchmarks in a fixed order with a "priming" compile
> to try and get pagecaches etc back to some sort of baseline (I've been trying
> hard to make the results accurate and repeatable).
>
> Despite that, you're correct in assuming the IO load will fluctuate widely. My
> initial tests show that noload and process_load (not surprisingly) vary very
> little. Mem_load varies a little. IO Loads can vary wildly, and the worse the
> average performance is, the greater the variation (I mean percentage variation
> not just absolute).
>
> > IO Full Load:
> > 2.5.38 170.21 +- 55.21 sec 42%
> > 2.5.38-gcc32 230.77 +- 60.22 sec 30%
> >
> > where the first column is the average of two measurements, the second
> > column is the delta of the two measurements divided by 2. This way we can
> > see the 'spread' of the results.
>
> I'll create some results based on 3 runs soon.
>
> > I simply cannot believe that gcc32 can produce any visible effect in any
> > of the IO benchmarks, the only explanation would be heavy fluctuation of
> > IO results.
>
> Agreed. There probably is no statistically significant difference in the
> different gcc versions.
>
> Contest is very new and I appreciate any feedback I can get to make it as
> worthwhile a benchmark as possible to those who know.

What hapened to the relative improvement (ratio against
baseline)? In this test it didn't matter much because the
baselines were almost identical but others lately especially
between different platforms it would have helped.

Perhaps someone who is a statistician could give Con a hand?
This looks like a good test but Ingo is right. We need
p-values and/or confidence intervals and enough runs to get
them to at least 90% if possible. I only know enough to
look at them when reported and say (non)random or
(in)significant correlation. I couldn't begin to calculate
them. Of course we don't want the measured data smothered in
analytical data, just enough to see if the numbers
are meaningfull.

--
________________________________________________________________
J.W. Schultz Pegasystems Technologies
email address: [email protected]

Remember Cernan and Schmitt

2002-09-23 12:42:22

by Erik Andersen

[permalink] [raw]
Subject: Re: [BENCHMARK] Corrected gcc3.2 v gcc2.95.3 contest results

On Mon Sep 23, 2002 at 08:30:21PM +1000, Con Kolivas wrote:
> Yes you make a very valid point and something I've been stewing over privately
> for some time. contest runs benchmarks in a fixed order with a "priming" compile
> to try and get pagecaches etc back to some sort of baseline (I've been trying
> hard to make the results accurate and repeatable).

It would sure be nice for this sortof test if there were
some sort of a "flush-all-caches" syscall...

-Erik

--
Erik B. Andersen http://codepoet-consulting.com/
--This message was written using 73% post-consumer electrons--

2002-09-23 12:54:56

by Con Kolivas

[permalink] [raw]
Subject: Re: [BENCHMARK] Corrected gcc3.2 v gcc2.95.3 contest results

Quoting Erik Andersen <[email protected]>:

> On Mon Sep 23, 2002 at 08:30:21PM +1000, Con Kolivas wrote:
> > Yes you make a very valid point and something I've been stewing over
> privately
> > for some time. contest runs benchmarks in a fixed order with a "priming"
> compile
> > to try and get pagecaches etc back to some sort of baseline (I've been
> trying
> > hard to make the results accurate and repeatable).
>
> It would sure be nice for this sortof test if there were
> some sort of a "flush-all-caches" syscall...

For the moment I think I'll also add a swapoff/swapon before each compile as
well (thanks Luuk for the suggestion). I'm still looking at the raw data to
figure out what to do.

Con.

2002-09-23 13:08:47

by Richard B. Johnson

[permalink] [raw]
Subject: Re: [BENCHMARK] Corrected gcc3.2 v gcc2.95.3 contest results

On Mon, 23 Sep 2002, Erik Andersen wrote:

> On Mon Sep 23, 2002 at 08:30:21PM +1000, Con Kolivas wrote:
> > Yes you make a very valid point and something I've been stewing over privately
> > for some time. contest runs benchmarks in a fixed order with a "priming" compile
> > to try and get pagecaches etc back to some sort of baseline (I've been trying
> > hard to make the results accurate and repeatable).
>
> It would sure be nice for this sortof test if there were
> some sort of a "flush-all-caches" syscall...
>
> -Erik

I think all you need to do is reload the code-segment register
and you end up flushing caches in ix86.


#
# This forces a cache-line refill by reloading the code-segment
# segment register. This would normally slow things down. However,
# if I put this at the start of a procedure that suffers a cache-line
# refill within the procedure, it is possible to speed things up.
#
.section .text
.global cflush
.type cflush,@function

cflush: pushl %cs # Put code segment on the stack
pushl $goto # Put offset on the stack
lret # Do a 'long' return (reloads cs)
goto: ret
.end




Cheers,
Dick Johnson
Penguin : Linux version 2.4.18 on an i686 machine (797.90 BogoMips).
The US military has given us many words, FUBAR, SNAFU, now ENRON.
Yes, top management were graduates of West Point and Annapolis.

2002-09-23 13:22:58

by Ingo Molnar

[permalink] [raw]
Subject: Re: [BENCHMARK] Corrected gcc3.2 v gcc2.95.3 contest results


On Mon, 23 Sep 2002, Richard B. Johnson wrote:

> > It would sure be nice for this sortof test if there were
> > some sort of a "flush-all-caches" syscall...
>
> I think all you need to do is reload the code-segment register
> and you end up flushing caches in ix86.

i'm pretty sure what was meant was the flushing of the pagecache mainly.
The state of CPU caches does not really play in these several-minutes
benchmarks, they are at most a few millisecs worth of CPU time to build.

Ingo

2002-09-23 14:02:18

by Richard B. Johnson

[permalink] [raw]
Subject: Re: [BENCHMARK] Corrected gcc3.2 v gcc2.95.3 contest results

On Mon, 23 Sep 2002, Ingo Molnar wrote:

>
> On Mon, 23 Sep 2002, Richard B. Johnson wrote:
>
> > > It would sure be nice for this sortof test if there were
> > > some sort of a "flush-all-caches" syscall...
> >
> > I think all you need to do is reload the code-segment register
> > and you end up flushing caches in ix86.
>
> i'm pretty sure what was meant was the flushing of the pagecache mainly.
> The state of CPU caches does not really play in these several-minutes
> benchmarks, they are at most a few millisecs worth of CPU time to build.
>

Okay. Sorry about that.


Cheers,
Dick Johnson
Penguin : Linux version 2.4.18 on an i686 machine (797.90 BogoMips).
The US military has given us many words, FUBAR, SNAFU, now ENRON.
Yes, top management were graduates of West Point and Annapolis.

2002-09-23 13:58:07

by Ryan Anderson

[permalink] [raw]
Subject: Re: [BENCHMARK] Corrected gcc3.2 v gcc2.95.3 contest results

On Mon, Sep 23, 2002 at 08:30:21PM +1000, Con Kolivas wrote:
> Quoting Ingo Molnar <[email protected]>:
> > On Mon, 23 Sep 2002, Con Kolivas wrote:
> >
> > how many times are you running each test? You should run them at least
> > twice (ideally 3 times at least), to establish some sort of statistical
> > noise measure. Especially IO benchmarks tend to fluctuate very heavily
> > depending on various things - they are also very dependent on the initial
> > state - ie. how the pagecache happens to lay out, etc. Ie. a meaningful
> > measurement result would be something like:
>
> Yes you make a very valid point and something I've been stewing over privately
> for some time. contest runs benchmarks in a fixed order with a "priming" compile
> to try and get pagecaches etc back to some sort of baseline (I've been trying
> hard to make the results accurate and repeatable).

Well, run contest once, discard the results. Run it 3 more times, and
you should have started the second, third and fourth runs with similar initial conditions.

Or you could run the contest 3 times, rebooting between each run....
(automating that is a little harder, of course.)

IANAS, however.

--

Ryan Anderson
sometimes Pug Majere

2002-09-23 14:13:17

by Ingo Molnar

[permalink] [raw]
Subject: Re: [BENCHMARK] Corrected gcc3.2 v gcc2.95.3 contest results


On Mon, 23 Sep 2002, Con Kolivas wrote:

> Agreed. There probably is no statistically significant difference in the
> different gcc versions.
>
> Contest is very new and I appreciate any feedback I can get to make it
> as worthwhile a benchmark as possible to those who know.

your measurements are really useful i think, and people like Andrew
started to watch those numbers - this is why at this point a bit more
effort can/should be taken to filter out fluctuations better. Ie. a single
fluctuation could send Andrew out on a wild goose chase while perhaps in
reality his kernel was the fastest. Running every test twice should at
least give a ballpart figure wrt. fluctuations, without increasing the
runtime unrealistically.

i agree that only the IO benchmarks are problematic from this POV - things
like the process load and your other CPU-saturating numbers look perfectly
valid.

obviously another concern to to make testing not take days to accomplish.
This i think is one of the hardest things - making timely measurements
which are still meaningful and provide stable results.

Ingo

2002-09-23 14:08:45

by Richard B. Johnson

[permalink] [raw]
Subject: Re: [BENCHMARK] Corrected gcc3.2 v gcc2.95.3 contest results

On Mon, 23 Sep 2002, Ryan Anderson wrote:

> On Mon, Sep 23, 2002 at 08:30:21PM +1000, Con Kolivas wrote:
> > Quoting Ingo Molnar <[email protected]>:
> > > On Mon, 23 Sep 2002, Con Kolivas wrote:
> > >
> > > how many times are you running each test? You should run them at least
> > > twice (ideally 3 times at least), to establish some sort of statistical
> > > noise measure. Especially IO benchmarks tend to fluctuate very heavily
> > > depending on various things - they are also very dependent on the initial
> > > state - ie. how the pagecache happens to lay out, etc. Ie. a meaningful
> > > measurement result would be something like:
> >
> > Yes you make a very valid point and something I've been stewing over privately
> > for some time. contest runs benchmarks in a fixed order with a "priming" compile
> > to try and get pagecaches etc back to some sort of baseline (I've been trying
> > hard to make the results accurate and repeatable).
>
> Well, run contest once, discard the results. Run it 3 more times, and
> you should have started the second, third and fourth runs with similar initial conditions.
>
> Or you could run the contest 3 times, rebooting between each run....
> (automating that is a little harder, of course.)
>
> IANAS, however.
>

(1) Obtain statistics from a number of runs.
(2) Throw away the smallest and largest.
(3) Average whatever remains.

This works for many "real-world" things because it removes noise-spikes
that could unfairly poison the average.

Cheers,
Dick Johnson
Penguin : Linux version 2.4.18 on an i686 machine (797.90 BogoMips).
The US military has given us many words, FUBAR, SNAFU, now ENRON.
Yes, top management were graduates of West Point and Annapolis.

2002-09-23 14:19:40

by Con Kolivas

[permalink] [raw]
Subject: Re: [BENCHMARK] Corrected gcc3.2 v gcc2.95.3 contest results

Quoting "Richard B. Johnson" <[email protected]>:

> On Mon, 23 Sep 2002, Ryan Anderson wrote:
>
> > On Mon, Sep 23, 2002 at 08:30:21PM +1000, Con Kolivas wrote:
> > > Quoting Ingo Molnar <[email protected]>:
> > > > On Mon, 23 Sep 2002, Con Kolivas wrote:
> > > >
> > > > how many times are you running each test? You should run them at
> least
> > > > twice (ideally 3 times at least), to establish some sort of
> statistical
> > > > noise measure. Especially IO benchmarks tend to fluctuate very
> heavily
> > > > depending on various things - they are also very dependent on the
> initial
> > > > state - ie. how the pagecache happens to lay out, etc. Ie. a
> meaningful
> > > > measurement result would be something like:
> > >
> > > Yes you make a very valid point and something I've been stewing over
> privately
> > > for some time. contest runs benchmarks in a fixed order with a "priming"
> compile
> > > to try and get pagecaches etc back to some sort of baseline (I've been
> trying
> > > hard to make the results accurate and repeatable).
> >
> > Well, run contest once, discard the results. Run it 3 more times, and
> > you should have started the second, third and fourth runs with similar
> initial conditions.
> >
> > Or you could run the contest 3 times, rebooting between each run....
> > (automating that is a little harder, of course.)
> >
> > IANAS, however.
> >
>
> (1) Obtain statistics from a number of runs.
> (2) Throw away the smallest and largest.
> (3) Average whatever remains.
>
> This works for many "real-world" things because it removes noise-spikes
> that could unfairly poison the average.

That is the system I was considering. I just need to run enough benchmarks to
make this worthwhile though. That means about 5 for each it seems - which may
take me a while. A basic mean will suffice for a measure of central tendency. I
also need to quote some measure of variability. Standard deviation?

Con

2002-09-23 14:29:24

by Jakub Jelinek

[permalink] [raw]
Subject: Re: [BENCHMARK] Corrected gcc3.2 v gcc2.95.3 contest results

On Tue, Sep 24, 2002 at 12:24:49AM +1000, Con Kolivas wrote:
> That is the system I was considering. I just need to run enough benchmarks to
> make this worthwhile though. That means about 5 for each it seems - which may
> take me a while. A basic mean will suffice for a measure of central tendency. I
> also need to quote some measure of variability. Standard deviation?

BTW: Have you tried gcc 3.2 with say -finline-limit=2000 too?
By default gcc 3.2 has for usual C code smaller inlining cutoff, so the IO
difference might as well be because some important, but big function was
inlined by 2.95.x and not by 3.2.x. On the other side there is
__attribute__((always_inline)) which you can use to tell gcc you don't
want any cutoff for a particular function.

Jakub

2002-09-23 14:31:13

by Con Kolivas

[permalink] [raw]
Subject: Re: [BENCHMARK] Corrected gcc3.2 v gcc2.95.3 contest results

Quoting Ingo Molnar <[email protected]>:

>
> On Mon, 23 Sep 2002, Con Kolivas wrote:
>
> > Agreed. There probably is no statistically significant difference in the
> > different gcc versions.
> >
> > Contest is very new and I appreciate any feedback I can get to make it
> > as worthwhile a benchmark as possible to those who know.
>
> your measurements are really useful i think, and people like Andrew

Thank you. I was beginning to wonder on this.

> started to watch those numbers - this is why at this point a bit more
> effort can/should be taken to filter out fluctuations better. Ie. a single
> fluctuation could send Andrew out on a wild goose chase while perhaps in
> reality his kernel was the fastest. Running every test twice should at
> least give a ballpart figure wrt. fluctuations, without increasing the
> runtime unrealistically.

Absolutely. In my real profession I deal with statistics all the time so I'm
acutely aware of the problem.

> i agree that only the IO benchmarks are problematic from this POV - things
> like the process load and your other CPU-saturating numbers look perfectly
> valid.

Yes, the IO load is proving to be a pain and I'm afraid it will take numerous
measurements to get some idea of the real average. So far the trends in the
results I've reported I think are still correct. The variability, though, that's
another matter.

> obviously another concern to to make testing not take days to accomplish.
> This i think is one of the hardest things - making timely measurements
> which are still meaningful and provide stable results.

I know. Some of the changes I've made to make results reproducible I have
already had complaints about; the situation will only get worse :(

Con

2002-09-23 14:35:57

by Richard B. Johnson

[permalink] [raw]
Subject: Re: [BENCHMARK] Corrected gcc3.2 v gcc2.95.3 contest results

On Tue, 24 Sep 2002, Con Kolivas wrote:

> Quoting "Richard B. Johnson" <[email protected]>:
>
> > On Mon, 23 Sep 2002, Ryan Anderson wrote:
> >
> > > On Mon, Sep 23, 2002 at 08:30:21PM +1000, Con Kolivas wrote:
> > > > Quoting Ingo Molnar <[email protected]>:
> > > > > On Mon, 23 Sep 2002, Con Kolivas wrote:
> > > > >
> > > > > how many times are you running each test? You should run them at
> > least
> > > > > twice (ideally 3 times at least), to establish some sort of
> > statistical
> > > > > noise measure. Especially IO benchmarks tend to fluctuate very
> > heavily
> > > > > depending on various things - they are also very dependent on the
> > initial
> > > > > state - ie. how the pagecache happens to lay out, etc. Ie. a
> > meaningful
> > > > > measurement result would be something like:
> > > >
> > > > Yes you make a very valid point and something I've been stewing over
> > privately
> > > > for some time. contest runs benchmarks in a fixed order with a "priming"
> > compile
> > > > to try and get pagecaches etc back to some sort of baseline (I've been
> > trying
> > > > hard to make the results accurate and repeatable).
> > >
> > > Well, run contest once, discard the results. Run it 3 more times, and
> > > you should have started the second, third and fourth runs with similar
> > initial conditions.
> > >
> > > Or you could run the contest 3 times, rebooting between each run....
> > > (automating that is a little harder, of course.)
> > >
> > > IANAS, however.
> > >
> >
> > (1) Obtain statistics from a number of runs.
> > (2) Throw away the smallest and largest.
> > (3) Average whatever remains.
> >
> > This works for many "real-world" things because it removes noise-spikes
> > that could unfairly poison the average.
>
> That is the system I was considering. I just need to run enough benchmarks to
> make this worthwhile though. That means about 5 for each it seems - which may
> take me a while. A basic mean will suffice for a measure of central tendency. I
> also need to quote some measure of variability. Standard deviation?
>
> Con
>
> .... Standard deviation?
^^^^^^^^^^^^^^^^^^^

Yes I like that, but does this measure "goodness of the test" or
something else? To make myself clear, let's look at some ridiculous
extreme condition. Your test really takes 1 second, but during your
tests there is a ping-flood that causes your test to take an hour.
Since the ping-flood is continuous, it smoothes out the noise of
your one-second test, making it 1/3600 of its true value. The
standard deviation looks very good but instead of showing that
your measurements were "good", it really shows that they are "bad".

I think a goodness-of-the-test indicator relates to the ratio of
the faster:slower tests. I don't know what you would call this, but
if your average was generated by 3 fast tests plus 1 slow test, it
would indicate a better "goodness" than 1 fast test and 3 slow ones.
It shows that external effects are not influencing the test results
as much with the "more-good" goodness.


Cheers,
Dick Johnson
Penguin : Linux version 2.4.18 on an i686 machine (797.90 BogoMips).
The US military has given us many words, FUBAR, SNAFU, now ENRON.
Yes, top management were graduates of West Point and Annapolis.

2002-09-23 18:31:34

by Andrew Morton

[permalink] [raw]
Subject: Re: [BENCHMARK] Corrected gcc3.2 v gcc2.95.3 contest results

Erik Andersen wrote:
>
> On Mon Sep 23, 2002 at 08:30:21PM +1000, Con Kolivas wrote:
> > Yes you make a very valid point and something I've been stewing over privately
> > for some time. contest runs benchmarks in a fixed order with a "priming" compile
> > to try and get pagecaches etc back to some sort of baseline (I've been trying
> > hard to make the results accurate and repeatable).
>
> It would sure be nice for this sortof test if there were
> some sort of a "flush-all-caches" syscall...
>

Yes, it would be nice.

Unmounting and remounting the test filesystem is usually
sufficient. Or you can run

main()
{
memset(malloc(1024*1024*1024), 0, 1024*1024*1024);
}

a couple of times.

2002-09-23 19:08:18

by MånsRullgård

[permalink] [raw]
Subject: Re: [BENCHMARK] Corrected gcc3.2 v gcc2.95.3 contest results

Jakub Jelinek <[email protected]> writes:

> BTW: Have you tried gcc 3.2 with say -finline-limit=2000 too?
> By default gcc 3.2 has for usual C code smaller inlining cutoff, so the IO
> difference might as well be because some important, but big function was
> inlined by 2.95.x and not by 3.2.x. On the other side there is
> __attribute__((always_inline)) which you can use to tell gcc you don't
> want any cutoff for a particular function.

How about using -Winline?

--
M?ns Rullg?rd
[email protected]

2002-09-23 19:32:08

by Oliver Xymoron

[permalink] [raw]
Subject: Re: [BENCHMARK] Corrected gcc3.2 v gcc2.95.3 contest results

On Tue, Sep 24, 2002 at 12:24:49AM +1000, Con Kolivas wrote:
>
> That is the system I was considering. I just need to run enough
> benchmarks to make this worthwhile though. That means about 5 for
> each it seems - which may take me a while. A basic mean will suffice
> for a measure of central tendency. I also need to quote some measure
> of variability. Standard deviation?

No, standard deviation is inappropriate here. We have no reason to
expect the distribution of problem cases to be normal or even smooth.
What we'd really like is range and mean. Don't throw out the outliers
either, the pathological cases are of critical interest.

--
"Love the dolphins," she advised him. "Write by W.A.S.T.E.."

2002-09-23 21:43:05

by Con Kolivas

[permalink] [raw]
Subject: Re: [BENCHMARK] Corrected gcc3.2 v gcc2.95.3 contest results

Quoting Oliver Xymoron <[email protected]>:

> On Tue, Sep 24, 2002 at 12:24:49AM +1000, Con Kolivas wrote:
> >
> > That is the system I was considering. I just need to run enough
> > benchmarks to make this worthwhile though. That means about 5 for
> > each it seems - which may take me a while. A basic mean will suffice
> > for a measure of central tendency. I also need to quote some measure
> > of variability. Standard deviation?
>
> No, standard deviation is inappropriate here. We have no reason to
> expect the distribution of problem cases to be normal or even smooth.
> What we'd really like is range and mean. Don't throw out the outliers
> either, the pathological cases are of critical interest.

Yes. Definitely the outliers appear to make the difference to the results. The
mean and range appear to be the most important on examining this data. The only
purpose to quoting other figures would be for inferential statistics to
determine if there is a statistically significant difference to the groups. My
overnight benchmarking has generated a few results and I will publish something
soon.

Con.

2002-09-24 01:07:27

by jw schultz

[permalink] [raw]
Subject: Re: [BENCHMARK] Corrected gcc3.2 v gcc2.95.3 contest results

On Tue, Sep 24, 2002 at 07:47:45AM +1000, Con Kolivas wrote:
> Quoting Oliver Xymoron <[email protected]>:
>
> > On Tue, Sep 24, 2002 at 12:24:49AM +1000, Con Kolivas wrote:
> > >
> > > That is the system I was considering. I just need to run enough
> > > benchmarks to make this worthwhile though. That means about 5 for
> > > each it seems - which may take me a while. A basic mean will suffice
> > > for a measure of central tendency. I also need to quote some measure
> > > of variability. Standard deviation?
> >
> > No, standard deviation is inappropriate here. We have no reason to
> > expect the distribution of problem cases to be normal or even smooth.
> > What we'd really like is range and mean. Don't throw out the outliers
> > either, the pathological cases are of critical interest.
>
> Yes. Definitely the outliers appear to make the difference to the results. The
> mean and range appear to be the most important on examining this data. The only
> purpose to quoting other figures would be for inferential statistics to
> determine if there is a statistically significant difference to the groups. My
> overnight benchmarking has generated a few results and I will publish something
> soon.

Happy am i to be wrong in suggesting you would benefit from
the help of a statistician. My apologies.

Sounds like we are getting to relative performance and
confidence interval (much bettern than +/- x) which would be
useful for those doing performance improvements and for us
who must tune or are watching the improvments take place.

--
________________________________________________________________
J.W. Schultz Pegasystems Technologies
email address: [email protected]

Remember Cernan and Schmitt

2002-09-24 02:40:41

by Con Kolivas

[permalink] [raw]
Subject: Re: [BENCHMARK] Corrected gcc3.2 v gcc2.95.3 contest results

Quoting Mark Hahn <[email protected]>:

> > Yes. Definitely the outliers appear to make the difference to the results.
> The
>
> best score is clearly the most important, along with some measure of
> spread.
>
> best-worst is a lousy measure of spread; stdev is not bad for that
> (or closely related measures like absdev, or stdev-from-mean, etc.)
>
> for contests, best is definitely the first score you want.
>

Normally yes. This is quite different. We want to know if there can be periods
where the machine is busy doing file IO to the exclusion of everything else. If
anything, the worst is the measure we want. Even the worst performing kernels
I've tried can have the occasional very good score, but look at these results
how I've presented them in the follow up and you'll see what I mean:

from the new thread I've started entitled
[BENCHMARK] Statistical representation of IO load results with contest

[...SNIP]
n=5 for number of samples

Kernel Mean CI(95%)
2.5.38 411 344-477
2.5.39-gcc32 371 224-519
2.5.38-mm2 95 84-105


The mean is a simple average of the results, and the CI(95%) are the 95%
confidence intervals the mean lies between those numbers. These numbers seem to
be the most useful for comparison.

Comparing 2.5.38(gcc2.95.3) with 2.5.38(gcc3.2) there is NO significant
difference (p 0.56)

Comparing 2.5.38 with 2.5.38-mm2 there is a significant diffence (p<0.001)
[SNIP...]

when I've run dozens of tests previously on the same kernel I've found that even
with a mean of 400 rarely a value of 80 will come up. Clearly this lowest score
does not give us the information we need.

Con.

2002-09-24 02:56:16

by Andrew Morton

[permalink] [raw]
Subject: Re: [BENCHMARK] Corrected gcc3.2 v gcc2.95.3 contest results

Con Kolivas wrote:
>
> ...
> n=5 for number of samples
>
> Kernel Mean CI(95%)
> 2.5.38 411 344-477
> 2.5.39-gcc32 371 224-519
> 2.5.38-mm2 95 84-105
>
> The mean is a simple average of the results, and the CI(95%) are the 95%
> confidence intervals the mean lies between those numbers. These numbers seem to
> be the most useful for comparison.
>
> Comparing 2.5.38(gcc2.95.3) with 2.5.38(gcc3.2) there is NO significant
> difference (p 0.56)
>
> Comparing 2.5.38 with 2.5.38-mm2 there is a significant diffence (p<0.001)
> [SNIP...]
>
> when I've run dozens of tests previously on the same kernel I've found that even
> with a mean of 400 rarely a value of 80 will come up. Clearly this lowest score
> does not give us the information we need.
>

I think this is really going way too far. I mean, the datum which
we take away from the above result is that 2.5.38 sucks. No more
accuracy is required.

Yes, if the differences are small then a few extra runs may be needed
to drill down into the finer margins. The tester should be able to
judge that during the test. You get a feel for these things.

I believe that your time would be better spent developing and incorporating
more tests (wider coverage) than worrying about super-high accuracy.

(And if there's more than a 1% variation between same kernel, compiled
with different compilers then the test is bust. Kernel CPU time is
dominated by cache misses and runtime is dominated by IO wait.
Quality of code generation is of tiny significance)

2002-09-24 08:53:41

by Denis Vlasenko

[permalink] [raw]
Subject: Re: [BENCHMARK] Corrected gcc3.2 v gcc2.95.3 contest results

On 24 September 2002 01:01, Andrew Morton wrote:
> (And if there's more than a 1% variation between same kernel, compiled
> with different compilers then the test is bust. Kernel CPU time is
> dominated by cache misses and runtime is dominated by IO wait.
> Quality of code generation is of tiny significance)

Well, not exactly. If it is true that Intel/MS compilers beat GCC
by 30% on code size, 30% smaller kernel ought to make some difference.

However, that will become a GCC code quality benchmark then.
--
vda

2002-09-24 09:13:39

by Jan Hudec

[permalink] [raw]
Subject: Re: [BENCHMARK] Corrected gcc3.2 v gcc2.95.3 contest results

On Mon, Sep 23, 2002 at 06:12:34PM -0700, jw schultz wrote:
> On Tue, Sep 24, 2002 at 07:47:45AM +1000, Con Kolivas wrote:
> > Quoting Oliver Xymoron <[email protected]>:
> >
> > > On Tue, Sep 24, 2002 at 12:24:49AM +1000, Con Kolivas wrote:
> > > >
> > > > That is the system I was considering. I just need to run enough
> > > > benchmarks to make this worthwhile though. That means about 5 for
> > > > each it seems - which may take me a while. A basic mean will suffice
> > > > for a measure of central tendency. I also need to quote some measure
> > > > of variability. Standard deviation?
> > >
> > > No, standard deviation is inappropriate here. We have no reason to
> > > expect the distribution of problem cases to be normal or even smooth.
> > > What we'd really like is range and mean. Don't throw out the outliers
> > > either, the pathological cases are of critical interest.
> >
> > Yes. Definitely the outliers appear to make the difference to the results. The
> > mean and range appear to be the most important on examining this data. The only
> > purpose to quoting other figures would be for inferential statistics to
> > determine if there is a statistically significant difference to the groups. My
> > overnight benchmarking has generated a few results and I will publish something
> > soon.
>
> Happy am i to be wrong in suggesting you would benefit from
> the help of a statistician. My apologies.
>
> Sounds like we are getting to relative performance and
> confidence interval (much bettern than +/- x) which would be
> useful for those doing performance improvements and for us
> who must tune or are watching the improvments take place.

There is no reason, why separate tests should be distributed normally.
But acoording to central limit theorem, distribution of the mean
converges to normal with increasing number of tests. So standart
deviation will tell to what precision we can trust the mean, that is to
compute the confidence interval.

We should have a bit more than 3 tests (first run can't be considered,
it has different starting conditions). About 5 would do, 10 would be
perfect.

I would like to see the complete set of results anyway. There may be
some more interesting things to compute.

-------------------------------------------------------------------------------
Jan 'Bulb' Hudec <[email protected]>

2002-09-24 09:21:40

by Con Kolivas

[permalink] [raw]
Subject: Re: [BENCHMARK] Corrected gcc3.2 v gcc2.95.3 contest results

Quoting Denis Vlasenko <[email protected]>:

> On 24 September 2002 01:01, Andrew Morton wrote:
> > (And if there's more than a 1% variation between same kernel, compiled
> > with different compilers then the test is bust. Kernel CPU time is
> > dominated by cache misses and runtime is dominated by IO wait.
> > Quality of code generation is of tiny significance)
>
> Well, not exactly. If it is true that Intel/MS compilers beat GCC
> by 30% on code size, 30% smaller kernel ought to make some difference.
>
> However, that will become a GCC code quality benchmark then.

Great well if someone has access to one of these compilers and can successfully
compile me a kernel using my .config I'd love to benchmark it for them.

Con.

2002-09-24 09:26:31

by Denis Vlasenko

[permalink] [raw]
Subject: Re: [BENCHMARK] Corrected gcc3.2 v gcc2.95.3 contest results

On 24 September 2002 07:26, Con Kolivas wrote:
> Quoting Denis Vlasenko <[email protected]>:
> > On 24 September 2002 01:01, Andrew Morton wrote:
> > > (And if there's more than a 1% variation between same kernel, compiled
> > > with different compilers then the test is bust. Kernel CPU time is
> > > dominated by cache misses and runtime is dominated by IO wait.
> > > Quality of code generation is of tiny significance)
> >
> > Well, not exactly. If it is true that Intel/MS compilers beat GCC
> > by 30% on code size, 30% smaller kernel ought to make some difference.
> >
> > However, that will become a GCC code quality benchmark then.
>
> Great well if someone has access to one of these compilers and can
> successfully compile me a kernel using my .config I'd love to benchmark it
> for them.

No, they can't compile kernel, it's too GCC-centric.
Worse, they're not open source.
--
vda

2002-09-24 09:28:57

by Jan Hudec

[permalink] [raw]
Subject: Re: [BENCHMARK] Corrected gcc3.2 v gcc2.95.3 contest results

On Mon, Sep 23, 2002 at 08:01:20PM -0700, Andrew Morton wrote:
> Con Kolivas wrote:
> >
> > ...
> > n=5 for number of samples
> >
> > Kernel Mean CI(95%)
> > 2.5.38 411 344-477
> > 2.5.39-gcc32 371 224-519
> > 2.5.38-mm2 95 84-105
> >
> > The mean is a simple average of the results, and the CI(95%) are the 95%
> > confidence intervals the mean lies between those numbers. These numbers seem to
> > be the most useful for comparison.
> >
> > Comparing 2.5.38(gcc2.95.3) with 2.5.38(gcc3.2) there is NO significant
> > difference (p 0.56)
> >
> > Comparing 2.5.38 with 2.5.38-mm2 there is a significant diffence (p<0.001)
> > [SNIP...]
> >
> > when I've run dozens of tests previously on the same kernel I've found that even
> > with a mean of 400 rarely a value of 80 will come up. Clearly this lowest score
> > does not give us the information we need.
> >
>
> I think this is really going way too far. I mean, the datum which
> we take away from the above result is that 2.5.38 sucks. No more
> accuracy is required.

5 samples is about the minimum where you can compute the confidence
intervals and trust them to reasonable extent. We see, that it's enough
samples to compare to the -mm2. But we can't say anything about gcc32
compiled version.

> Yes, if the differences are small then a few extra runs may be needed
> to drill down into the finer margins. The tester should be able to
> judge that during the test. You get a feel for these things.
>
> I believe that your time would be better spent developing and incorporating
> more tests (wider coverage) than worrying about super-high accuracy.

Accuracy is increased by just geting more samples. More tests are of
course important, but each must be run enough times so the results are
statisticaly significant.

> (And if there's more than a 1% variation between same kernel, compiled
> with different compilers then the test is bust. Kernel CPU time is
> dominated by cache misses and runtime is dominated by IO wait.
> Quality of code generation is of tiny significance)

So we will need a lot runs to see if there is a difference...

-------------------------------------------------------------------------------
Jan 'Bulb' Hudec <[email protected]>

2002-09-24 15:34:30

by Mark Hahn

[permalink] [raw]
Subject: Re: [BENCHMARK] Corrected gcc3.2 v gcc2.95.3 contest results

> > (And if there's more than a 1% variation between same kernel, compiled
> > with different compilers then the test is bust. Kernel CPU time is
> > dominated by cache misses and runtime is dominated by IO wait.
> > Quality of code generation is of tiny significance)
>
> Well, not exactly. If it is true that Intel/MS compilers beat GCC
> by 30% on code size, 30% smaller kernel ought to make some difference.

if you think that's true, then have you tried a modern GCC with -Os?

afaikt, this comparison of gcc's is primarily interesting because it might
show up some either misoptimizations or perhaps semantic problems in the
kernel (ie, perhaps violations of strict aliasing).

2002-09-24 21:33:48

by Bill Davidsen

[permalink] [raw]
Subject: Re: [BENCHMARK] Corrected gcc3.2 v gcc2.95.3 contest results

On Mon, 23 Sep 2002, Richard B. Johnson wrote:

> Yes I like that, but does this measure "goodness of the test" or
> something else? To make myself clear, let's look at some ridiculous
> extreme condition. Your test really takes 1 second, but during your
> tests there is a ping-flood that causes your test to take an hour.

If you run in single user mode as suggested that's pretty unlikely. I
would think having the power go off and your laptop drop into power save
slow mode more likely ;-)

--
bill davidsen <[email protected]>
CTO, TMR Associates, Inc
Doing interesting things with little computers since 1979.

2002-09-24 21:30:23

by Bill Davidsen

[permalink] [raw]
Subject: Re: [BENCHMARK] Corrected gcc3.2 v gcc2.95.3 contest results

On Mon, 23 Sep 2002, Ingo Molnar wrote:

>
> On Mon, 23 Sep 2002, Con Kolivas wrote:
>
> > IO Full Load:
> > 2.5.38 170.21 42%
> > 2.5.38-gcc32 230.77 30%
>
> > This time only the IO loads showed a statistically significant
> > difference.
>
> how many times are you running each test? You should run them at least
> twice (ideally 3 times at least), to establish some sort of statistical
> noise measure. Especially IO benchmarks tend to fluctuate very heavily
> depending on various things - they are also very dependent on the initial
> state - ie. how the pagecache happens to lay out, etc. Ie. a meaningful
> measurement result would be something like:

Do note that the instructions for the benchmark suggest you boot single
user, which cuts down one problem, and since Con adopted my suggestion to
allow the user to set the location of the test file, I put the big file in
a filesystem which is formatted just before the test (I knew I'd find a
use for all that disk ;-) so that stays pretty constant.

The problem of memory size on the halfmem io is more serious, on a large
system the writes are all in memory, on a small system they cause
thrashing. I run in 256m for all tests just for this reason.

Not disagreeing with what you said, but the test is not inherently subject
to much jitter given care in running it.

--
bill davidsen <[email protected]>
CTO, TMR Associates, Inc
Doing interesting things with little computers since 1979.