2009-11-26 08:00:46

by tip-bot for Ma Ling

[permalink] [raw]
Subject: [PATCH RFC] [X86] Compile Option Os versus O2 on latest x86 platform

From: Ma Ling <[email protected]>

Hi All

In current kernel compile original option we prefer Os to O2. Os will reduce
compiled kernel code size obviously, and O2 pay more attention to performance
than code size, so in real environment O2 will bring more i-cache miss than Os,
totally performance should slowdown.

In our system test machine kernel code size from Os is 12M, and that from O2 is 14M.

But we have two questions about it on latest platform:
1. 10% * current kernel code size from Os(CPU execution path)
is far more L1 i-cache size, the difference of i-cache-miss counts from
both options should become little.
2. our latest platform should has excellent prefetch capability by adjusting
predication execution path.

Based on above reasons we re-compiled linux kernel with O2 option on below platform.
CPU type: 2P Quad-core Core i7(2 socket*4 core *2 hyper threads)
CPU frequency: 2670MHz
Memory: 6 x 1GBMb

We mainly tested common and stable benchmarks two times, results show
O2 performance is better than Os (linux kernel version 2.6.32-rc8)

Benchmarks: improvement
volano 8%
netperf 6.7%
tbench 6.45%
Kbuild 5.5% (3 time test, average improvement)
specjbb2000 2%
fio 2%
specjbb2005 No change
cpu2000 No change
aim7 No change
hackbench No Change
oltp No Change

This patch try to enable O2 option and disable Os option.

Appreciate any comments.

Thanks
Ling

---
arch/x86/configs/x86_64_defconfig | 2 +-
1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/arch/x86/configs/x86_64_defconfig b/arch/x86/configs/x86_64_defconfig
index 6c86acd..d564b90 100644
--- a/arch/x86/configs/x86_64_defconfig
+++ b/arch/x86/configs/x86_64_defconfig
@@ -126,7 +126,7 @@ CONFIG_INITRAMFS_SOURCE=""
CONFIG_RD_GZIP=y
CONFIG_RD_BZIP2=y
CONFIG_RD_LZMA=y
-CONFIG_CC_OPTIMIZE_FOR_SIZE=y
+CONFIG_CC_OPTIMIZE_FOR_SIZE=n
CONFIG_SYSCTL=y
CONFIG_ANON_INODES=y
# CONFIG_EMBEDDED is not set
--
1.6.2.5


2009-11-26 09:49:40

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH RFC] [X86] Compile Option Os versus O2 on latest x86 platform


* [email protected] <[email protected]> wrote:

> Benchmarks: improvement
> volano 8%
> netperf 6.7%
> tbench 6.45%
> Kbuild 5.5% (3 time test, average improvement)

that Kbuild result looks suspicious. A kbuild only uses 25% of system
time, so an 5.5% improvement means that system utilization dropped from
25% to 19.5%, a 28% improvement in the kernel! That looks rather
unlikely.

Could you please post before/after 'perf stat --repeat 3' results so
that we can see the noise level?

Thanks,

Ingo

2009-12-01 08:54:49

by tip-bot for Ma Ling

[permalink] [raw]
Subject: RE: [PATCH RFC] [X86] Compile Option Os versus O2 on latest x86 platform

Hi Ingo

Thanks for your correction, so we use perf stat --repeat 3 to test volano, tbench, and kbuild,
Because netperf has multiple items we may send out later.

volano_Os:

Performance counter stats for '/bm/bin/runs -t volano -r /bm/recipes/lkp-ne02.recipe' (3 runs):

6386111.436735 task-clock-msecs # 13.554 CPUs ( +- 0.336% )
914192633 context-switches # 0.143 M/sec ( +- 0.046% )
49186605 CPU-migrations # 0.008 M/sec ( +- 0.962% )
768344 page-faults # 0.000 M/sec ( +- 0.338% )
18680627716893 cycles # 2925.196 M/sec ( +- 0.339% )
7247421283541 instructions # 0.388 IPC ( +- 0.124% )
226838591574 cache-references # 35.521 M/sec ( +- 0.971% )
9420427393 cache-misses # 1.475 M/sec ( +- 0.897% )

471.172398867 seconds time elapsed ( +- 1.292% )

volano_O2:

Performance counter stats for '/bm/bin/runs -t volano -r /bm/recipes/lkp-ne02.recipe' (3 runs):

5873675.998422 task-clock-msecs # 13.447 CPUs ( +- 0.338% )
916070728 context-switches # 0.156 M/sec ( +- 0.050% )
48759104 CPU-migrations # 0.008 M/sec ( +- 0.614% )
738964 page-faults # 0.000 M/sec ( +- 0.082% )
17145170491943 cycles # 2918.985 M/sec ( +- 0.288% )
7324126478801 instructions # 0.427 IPC ( +- 0.090% )
219064318074 cache-references # 37.296 M/sec ( +- 0.792% )
9491237013 cache-misses # 1.616 M/sec ( +- 0.439% )

436.806579899 seconds time elapsed ( +- 0.392% )

O2 is better than Os for volano

tbench_Os:

Performance counter stats for '/bm/bin/runs -t tbench -r /bm/recipes/lkp-ne02.recipe' (3 runs):

11630970.099215 task-clock-msecs # 15.476 CPUs ( +- 1.285% )
1162148139 context-switches # 0.100 M/sec ( +- 0.372% )
39772 CPU-migrations # 0.000 M/sec ( +- 0.502% )
1536289 page-faults # 0.000 M/sec ( +- 0.020% )
33408973681696 cycles # 2872.415 M/sec ( +- 0.028% )
14229765107716 instructions # 0.426 IPC ( +- 0.113% )
290717607018 cache-references # 24.995 M/sec ( +- 10.425% )
2525058529 cache-misses # 0.217 M/sec ( +- 1.798% )

751.537009428 seconds time elapsed ( +- 0.173% )

tbench_O2:

Performance counter stats for '/bm/bin/runs -t tbench -r /bm/recipes/lkp-ne02.recipe' (3 runs):

12093825.537708 task-clock-msecs # 16.084 CPUs ( +- 6.363% )
1235837814 context-switches # 0.102 M/sec ( +- 0.857% )
42363 CPU-migrations # 0.000 M/sec ( +- 3.968% )
1535481 page-faults # 0.000 M/sec ( +- 0.350% )
33028312063911 cycles # 2731.006 M/sec ( +- 0.908% )
15535465986643 instructions # 0.470 IPC ( +- 0.058% )
280118529329 cache-references # 23.162 M/sec ( +- 0.695% )
2866275183 cache-misses # 0.237 M/sec ( +- 0.893% )

751.921568581 seconds time elapsed ( +- 0.182% )

O2 is not different with Os for tbench

kbuild_Os:

Performance counter stats for '/bm/bin/runs -t kbuild -r /bm/recipes/lkp-ne02.recipe' (3 runs):

886426.102100 task-clock-msecs # 1.053 CPUs ( +- 1.712% )
980944 context-switches # 0.001 M/sec ( +- 1.149% )
285613 CPU-migrations # 0.000 M/sec ( +- 1.543% )
81244856 page-faults # 0.092 M/sec ( +- 1.611% )
2610381816839 cycles # 2944.839 M/sec ( +- 1.696% )
2907701964460 instructions # 1.114 IPC ( +- 1.726% )
14758764510 cache-references # 16.650 M/sec ( +- 1.581% )
3212068899 cache-misses # 3.624 M/sec ( +- 1.729% )

841.492770793 seconds time elapsed ( +- 0.209% )

kbuild_O2:

Performance counter stats for '/bm/bin/runs -t kbuild -r /bm/recipes/lkp-ne02.recipe' (3 runs):

897281.428095 task-clock-msecs # 1.062 CPUs ( +- 0.524% )
964812 context-switches # 0.001 M/sec ( +- 1.630% )
287443 CPU-migrations # 0.000 M/sec ( +- 0.532% )
82509345 page-faults # 0.092 M/sec ( +- 0.071% )
2635837258275 cycles # 2937.581 M/sec ( +- 0.150% )
2955626723788 instructions # 1.121 IPC ( +- 0.117% )
14939108242 cache-references # 16.649 M/sec ( +- 0.609% )
3267365744 cache-misses # 3.641 M/sec ( +- 0.066% )

844.891541856 seconds time elapsed ( +- 0.468% )
O2 is not different with Os for kbuild

Thanks
Ling

> -----Original Message-----
> From: Ingo Molnar [mailto:[email protected]]
> Sent: Thursday, November 26, 2009 5:50 PM
> To: Ma, Ling; Arjan van de Ven; Dave Jones
> Cc: [email protected]; [email protected]; [email protected]
> Subject: Re: [PATCH RFC] [X86] Compile Option Os versus O2 on latest x86
> platform
>
>
> * [email protected] <[email protected]> wrote:
>
> > Benchmarks: improvement
> > volano 8%
> > netperf 6.7%
> > tbench 6.45%
> > Kbuild 5.5% (3 time test, average
> improvement)
>
> that Kbuild result looks suspicious. A kbuild only uses 25% of system
> time, so an 5.5% improvement means that system utilization dropped from
> 25% to 19.5%, a 28% improvement in the kernel! That looks rather
> unlikely.
>
> Could you please post before/after 'perf stat --repeat 3' results so
> that we can see the noise level?
>
> Thanks,
>
> Ingo

2009-12-01 10:13:16

by Arjan van de Ven

[permalink] [raw]
Subject: Re: [PATCH RFC] [X86] Compile Option Os versus O2 on latest x86 platform

On Tue, 1 Dec 2009 16:54:04 +0800
"Ma, Ling" <[email protected]> wrote:

> Hi Ingo
>
> Thanks for your correction, so we use perf stat --repeat 3 to test
> volano, tbench, and kbuild, Because netperf has multiple items we may
> send out later.

a key question is.. how much more memory do you have free due to -Os?
(because memory is cache is performance on a system level as well)
and how much less icache pressure is there?


--
Arjan van de Ven Intel Open Source Technology Centre
For development, discussion and tips for power savings,
visit http://www.lesswatts.org

2009-12-01 16:16:17

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH RFC] [X86] Compile Option Os versus O2 on latest x86 platform

On 12/01/2009 02:14 AM, Arjan van de Ven wrote:
> On Tue, 1 Dec 2009 16:54:04 +0800
> "Ma, Ling" <[email protected]> wrote:
>
>> Hi Ingo
>>
>> Thanks for your correction, so we use perf stat --repeat 3 to test
>> volano, tbench, and kbuild, Because netperf has multiple items we may
>> send out later.
>
> a key question is.. how much more memory do you have free due to -Os?
> (because memory is cache is performance on a system level as well)
> and how much less icache pressure is there?
>

>From the re-run, it sounds like the only test that actually shows a
significant difference is volano. From reading the numbers, it looks
like the improvements are almost exclusively in IPC i.e. better
scheduling -- all the other metrics are substantially worse; including a
10% increase in cache misses.

It would be interesting to see what functions are hot in volano. It
might very well be that we could get a boost without significantly bloat
the kernel as a whole by picking out a couple of hot object files and
compiling those with -O2 or -O3.

-hpa

--
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel. I don't speak on their behalf.

2009-12-02 09:47:39

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH RFC] [X86] Compile Option Os versus O2 on latest x86 platform


* Ma, Ling <[email protected]> wrote:

> Hi Ingo
>
> Thanks for your correction, so we use perf stat --repeat 3 to test
> volano, tbench, and kbuild, Because netperf has multiple items we may
> send out later.
>
> volano_Os:

> 18680627716893 cycles # 2925.196 M/sec ( +- 0.339% )
> 7247421283541 instructions # 0.388 IPC ( +- 0.124% )
> 226838591574 cache-references # 35.521 M/sec ( +- 0.971% )
> 9420427393 cache-misses # 1.475 M/sec ( +- 0.897% )

> volano_O2:

> 17145170491943 cycles # 2918.985 M/sec ( +- 0.288% )
> 7324126478801 instructions # 0.427 IPC ( +- 0.090% )
> 219064318074 cache-references # 37.296 M/sec ( +- 0.792% )
> 9491237013 cache-misses # 1.616 M/sec ( +- 0.439% )

> O2 is better than Os for volano
> O2 is not different with Os for tbench
> O2 is not different with Os for kbuild

Ok, this looks pretty credible, thanks for going through it.

For Volano, the difference is 8.9%, well above the 0.3% noise level, so
it's significant.

Would it be possible to do a 'perf record' and 'perf report' comparison
between two volano runs, to see where the nearly 10% overhead comes
from? It might be one or two functions mis-optimized by GCC perhaps. Or
it could be across-the-spectrum slowdown.

Note that the number of instructions increased only by 1%, but the
overhead by 9%. So we might be hitting some nasty corner case - or it
might be some caching effect. (which does not seem to be supported by
the numbers though - the LLC cache-misses does not look significantly
higher in the Os case)

'perf annotate fn_name' will also help you see where the overhead
hot-spots are. If you build the vmlinux via CONFIG_DEBUG_INFO the perf
annotate output will interleave assembly and source code output.
(otherwise it will be assembly output only)

You probably want to use the latest version of 'perf' for all that
analysis, from:

http://people.redhat.com/mingo/tip.git/README

Thanks,

Ingo

2009-12-03 15:03:15

by tip-bot for Ma Ling

[permalink] [raw]
Subject: RE: [PATCH RFC] [X86] Compile Option Os versus O2 on latest x86 platform

> a key question is.. how much more memory do you have free due to -Os?
> (because memory is cache is performance on a system level as well)
The kernel code size from Os is 12M, that from O2 is 14M.
> and how much less icache pressure is there?
>From perf stat report, cache reference(unified cache) from O2 is almost the same with Os.

Thanks
Ling

2009-12-03 15:09:41

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH RFC] [X86] Compile Option Os versus O2 on latest x86 platform

On 12/03/2009 07:03 AM, Ma, Ling wrote:
>> a key question is.. how much more memory do you have free due to -Os?
>> (because memory is cache is performance on a system level as well)
> The kernel code size from Os is 12M, that from O2 is 14M.
>> and how much less icache pressure is there?
> From perf stat report, cache reference(unified cache) from O2 is almost the same with Os.

The icache pressure was substantially higher (by ~10%) in the reports
that I saw.

-hpa

--
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel. I don't speak on their behalf.

2009-12-03 15:31:22

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH RFC] [X86] Compile Option Os versus O2 on latest x86 platform


* H. Peter Anvin <[email protected]> wrote:

> On 12/03/2009 07:03 AM, Ma, Ling wrote:
> >> a key question is.. how much more memory do you have free due to -Os?
> >> (because memory is cache is performance on a system level as well)
> > The kernel code size from Os is 12M, that from O2 is 14M.
> >> and how much less icache pressure is there?
> > From perf stat report, cache reference(unified cache) from O2 is almost the same with Os.
>
> The icache pressure was substantially higher (by ~10%) in the reports
> that I saw.

hm, icache numbers are not included in perf stat runs by default. Are
there some icache numbers i missed perhaps?

Ingo

2009-12-03 15:50:52

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH RFC] [X86] Compile Option Os versus O2 on latest x86 platform

On 12/03/2009 07:31 AM, Ingo Molnar wrote:
>
> * H. Peter Anvin <[email protected]> wrote:
>
>> On 12/03/2009 07:03 AM, Ma, Ling wrote:
>>>> a key question is.. how much more memory do you have free due to -Os?
>>>> (because memory is cache is performance on a system level as well)
>>> The kernel code size from Os is 12M, that from O2 is 14M.
>>>> and how much less icache pressure is there?
>>> From perf stat report, cache reference(unified cache) from O2 is almost the same with Os.
>>
>> The icache pressure was substantially higher (by ~10%) in the reports
>> that I saw.
>
> hm, icache numbers are not included in perf stat runs by default. Are
> there some icache numbers i missed perhaps?
>

Sorry, you're right; cache references and cache misses. Furthermore,
I'm wrong, I was looking at references *per unit time*, which just show
that roughly the same number was squeezed into a shorter time.

Never mind me... :-/

-hpa

--
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel. I don't speak on their behalf.