2003-02-03 23:03:56

by Martin J. Bligh

[permalink] [raw]
Subject: gcc 2.95 vs 3.21 performance

People keep extolling the virtues of gcc 3.2 to me, which I'm
reluctant to switch to, since it compiles so much slower. But
it supposedly generates better code, so I thought I'd compile
the kernel with both and compare the results. This is gcc 2.95
and 3.2.1 from debian unstable on a 16-way NUMA-Q. The kernbench
tests still use 2.95 for the compile-time stuff.

The results below leaves me distinctly unconvinced by the supposed
merits of modern gcc's. Not really better or worse, within experimental
error. But much slower to compile things with.

Kernbench-2: (make -j N vmlinux, where N = 2 x num_cpus)
Elapsed User System CPU
2.5.59 46.08 563.88 118.38 1480.00
2.5.59-gcc3.2 45.86 563.63 119.58 1489.33

Kernbench-16: (make -j N vmlinux, where N = 16 x num_cpus)
Elapsed User System CPU
2.5.59 47.45 568.02 143.17 1498.17
2.5.59-gcc3.2 47.15 567.41 143.72 1507.50

DISCLAIMER: SPEC(tm) and the benchmark name SDET(tm) are registered
trademarks of the Standard Performance Evaluation Corporation. This
benchmarking was performed for research purposes only, and the run results
are non-compliant and not-comparable with any published results.

Results are shown as percentages of the first set displayed

SDET 1 (see disclaimer)
Throughput Std. Dev
2.5.59 100.0% 0.8%
2.5.59-gcc3.2 95.3% 5.2%

SDET 2 (see disclaimer)
Throughput Std. Dev
2.5.59 100.0% 0.6%
2.5.59-gcc3.2 91.9% 7.1%

SDET 4 (see disclaimer)
Throughput Std. Dev
2.5.59 100.0% 5.7%
2.5.59-gcc3.2 98.8% 5.3%

SDET 8 (see disclaimer)
Throughput Std. Dev
2.5.59 100.0% 1.4%
2.5.59-gcc3.2 105.3% 4.7%

SDET 16 (see disclaimer)
Throughput Std. Dev
2.5.59 100.0% 1.7%
2.5.59-gcc3.2 103.1% 1.8%

SDET 32 (see disclaimer)
Throughput Std. Dev
2.5.59 100.0% 1.5%
2.5.59-gcc3.2 101.0% 1.6%

SDET 64 (see disclaimer)
Throughput Std. Dev
2.5.59 100.0% 0.7%
2.5.59-gcc3.2 103.1% 1.1%

SDET 128 (see disclaimer)
Throughput Std. Dev

NUMA schedbench 4:
AvgUser Elapsed TotalUser TotalSys
2.5.59 0.00 38.88 82.78 0.65
2.5.59-gcc3.2 0.00 41.80 107.76 0.73

NUMA schedbench 8:
AvgUser Elapsed TotalUser TotalSys
2.5.59 0.00 49.30 247.80 1.93
2.5.59-gcc3.2 0.00 38.00 229.83 2.11

NUMA schedbench 16:
AvgUser Elapsed TotalUser TotalSys
2.5.59 0.00 57.37 843.12 3.77
2.5.59-gcc3.2 0.00 57.28 839.21 2.85

NUMA schedbench 32:
AvgUser Elapsed TotalUser TotalSys
2.5.59 0.00 116.99 1805.79 6.05
2.5.59-gcc3.2 0.00 118.44 1788.09 6.25

NUMA schedbench 64:
AvgUser Elapsed TotalUser TotalSys
2.5.59 0.00 235.18 3632.73 15.45
2.5.59-gcc3.2 0.00 234.55 3633.76 15.02



------------------------------------------------------------------------------


And with the same kernel, comparing the compile times for gcc 2.95 to 3.2

Kernbench-2: (make -j N vmlinux, where N = 2 x num_cpus)
Elapsed User System CPU
gcc2.95 46.08 563.88 118.38 1480.00
gcc3.21 69.93 923.17 114.36 1483.17

Kernbench-16: (make -j N vmlinux, where N = 16 x num_cpus)
Elapsed User System CPU
gcc2.95 47.45 568.02 143.17 1498.17
gcc3.21 71.44 926.45 134.89 1485.33

pft.


2003-02-03 23:13:30

by Andi Kleen

[permalink] [raw]
Subject: Re: [Lse-tech] gcc 2.95 vs 3.21 performance

On Mon, Feb 03, 2003 at 03:05:06PM -0800, Martin J. Bligh wrote:
> The results below leaves me distinctly unconvinced by the supposed
> merits of modern gcc's. Not really better or worse, within experimental
> error. But much slower to compile things with.

Curious - could you compare it with a gcc 3.3 snapshot too?

It should be even slower at compiling, but generate better code.

-Andi

2003-02-03 23:19:09

by Richard B. Johnson

[permalink] [raw]
Subject: Re: gcc 2.95 vs 3.21 performance

On Mon, 3 Feb 2003, Martin J. Bligh wrote:

> People keep extolling the virtues of gcc 3.2 to me, which I'm
> reluctant to switch to, since it compiles so much slower. But
> it supposedly generates better code, so I thought I'd compile
> the kernel with both and compare the results. This is gcc 2.95
> and 3.2.1 from debian unstable on a 16-way NUMA-Q. The kernbench
> tests still use 2.95 for the compile-time stuff.
>
[SNIPPED tests...]

Don't let this get out, but egcs-2.91.66 compiled FFT code
works about 50 percent of the speed of whatever M$ uses for
Visual C++ Version 6.0 I was awfully disheartened when I
found that identical code executed twice as fast on M$ than
it does on Linux. I tried to isolate what was causing the
difference. So I replaced 'hypot()' with some 'C' code that
does sqrt(x^2 + y^2) just to see if it was the 'C' library.
It didn't help. When I find out what type (section) of code
is running slower, I'll report. In the meantime, it's fast
enough, but I don't like being beat by M$.

Cheers,
Dick Johnson
Penguin : Linux version 2.4.18 on an i686 machine (797.90 BogoMips).
Why is the government concerned about the lunatic fringe? Think about it.


2003-02-04 00:33:56

by J.A. Magallon

[permalink] [raw]
Subject: Re: gcc 2.95 vs 3.21 performance


On 2003.02.04 Richard B. Johnson wrote:
> On Mon, 3 Feb 2003, Martin J. Bligh wrote:
>
> > People keep extolling the virtues of gcc 3.2 to me, which I'm
> > reluctant to switch to, since it compiles so much slower. But
> > it supposedly generates better code, so I thought I'd compile
> > the kernel with both and compare the results. This is gcc 2.95
> > and 3.2.1 from debian unstable on a 16-way NUMA-Q. The kernbench
> > tests still use 2.95 for the compile-time stuff.
> >
> [SNIPPED tests...]
>
> Don't let this get out, but egcs-2.91.66 compiled FFT code
> works about 50 percent of the speed of whatever M$ uses for
> Visual C++ Version 6.0 I was awfully disheartened when I
> found that identical code executed twice as fast on M$ than
> it does on Linux. I tried to isolate what was causing the
> difference. So I replaced 'hypot()' with some 'C' code that
> does sqrt(x^2 + y^2) just to see if it was the 'C' library.
> It didn't help. When I find out what type (section) of code
> is running slower, I'll report. In the meantime, it's fast
> enough, but I don't like being beat by M$.
>

I face a simliar problem. As everybody says that SSE is so marvelous,
we are trying to put some SSE code in our render engine, to speed up this.
But look at the results of the code below (box is a [email protected], Xeon with ht):
annwn:~/sse> ss-g
Proc std:
5020 kticks
Proc std inline:
4320 kticks
Proc sse:
4290 kticks
Proc sse inline:
3890 kticks

So what ? Just around 500 ticks for updating to sse ? As Computer Architecture
people at the school says, it is something called 'spill code' (did I wrote it
ok?). In short, too much sse but too less registers, so Intel ia32 turns into
crap when you need some indexes, out of registers and copy to and from the stack.

#include <stdlib.h>
#include <time.h>
#include <stdio.h>
#if defined(__INTEL_COMPILER)
#include <xmmintrin.h>
#endif

#define LOOPS 1000
#define SZ 100000

#if defined(__GNUC__) && defined(__SSE__)
typedef void __ve_reg __attribute__((__mode__(V4SF)));
#endif

typedef struct point point;
struct point {
float v[4];
};

void mulp_std(const point* a,const point* b,point* r)
{
int i;
for (i=0; i<4; i++)
r->v[i] = a->v[i] * b->v[i];
}

inline void mulpi_std(const point* a,const point* b,point* r)
{
int i;
for (i=0; i<4; i++)
r->v[i] = a->v[i] * b->v[i];
}

void mulp_sse(const point* a,const point* b,point* r)
{
#if defined(__GNUC__) && defined(__SSE__)
__ve_reg xmm0,xmm1,xmm2;
xmm0 = __builtin_ia32_loadups((float*)a->v);
xmm1 = __builtin_ia32_loadups((float*)b->v);
xmm2 = __builtin_ia32_mulps(xmm0,xmm1);
__builtin_ia32_storeups(r->v,xmm2);
#endif
#if defined(__INTEL_COMPILER)
__m128 xmm0,xmm1,xmm2;
xmm0 = _mm_loadu_ps((float*)a->v);
xmm1 = _mm_loadu_ps((float*)b->v);
xmm2 = _mm_mul_ps(xmm0,xmm1);
_mm_storeu_ps(r->v,xmm2);
#endif
}

inline void mulpi_sse(const point* a,const point* b,point* r)
{
#if defined(__GNUC__) && defined(__SSE__)
__ve_reg xmm0,xmm1,xmm2;
xmm0 = __builtin_ia32_loadups((float*)a->v);
xmm1 = __builtin_ia32_loadups((float*)b->v);
xmm2 = __builtin_ia32_mulps(xmm0,xmm1);
__builtin_ia32_storeups(r->v,xmm2);
#endif
#if defined(__INTEL_COMPILER)
#if defined(__INTEL_COMPILER)
__m128 xmm0,xmm1,xmm2;
xmm0 = _mm_loadu_ps((float*)a->v);
xmm1 = _mm_loadu_ps((float*)b->v);
xmm2 = _mm_mul_ps(xmm0,xmm1);
_mm_storeu_ps(r->v,xmm2);
#endif
#endif
}

int main(int argc, char** argv)
{
point *a;
point *b;
point *c;
int i,j;
unsigned long t0,t1;

a = malloc(SZ*sizeof(point));
b = malloc(SZ*sizeof(point));
c = malloc(SZ*sizeof(point));

printf("Proc std:\n");
t0 = clock();
for (i=0; i<LOOPS; i++)
{
for (j=0; j<SZ; j++)
mulp_std(&a[j],&b[j],&c[j]);
for (j=0; j<SZ; j++)
mulp_std(&b[j],&b[j],&a[j]);
}
t1 = clock();
printf("%10d kticks\n",(t1-t0)/1000);

printf("Proc std inline:\n");
t0 = clock();
for (i=0; i<LOOPS; i++)
{
for (j=0; j<SZ; j++)
mulpi_std(&a[j],&b[j],&c[j]);
for (j=0; j<SZ; j++)
mulpi_std(&b[j],&b[j],&a[j]);
}
t1 = clock();
printf("%10d kticks\n",(t1-t0)/1000);

printf("Proc sse:\n");
t0 = clock();
for (i=0; i<LOOPS; i++)
{
for (j=0; j<SZ; j++)
mulp_sse(&a[j],&b[j],&c[j]);
for (j=0; j<SZ; j++)
mulp_sse(&b[j],&b[j],&a[j]);
}
t1 = clock();
printf("%10d kticks\n",(t1-t0)/1000);

printf("Proc sse inline:\n");
t0 = clock();
for (i=0; i<LOOPS; i++)
{
for (j=0; j<SZ; j++)
mulpi_sse(&a[j],&b[j],&c[j]);
for (j=0; j<SZ; j++)
mulpi_sse(&b[j],&b[j],&a[j]);
}
t1 = clock();
printf("%10d kticks\n",(t1-t0)/1000);

free(c);
free(b);
free(a);

return 0;
}


--
J.A. Magallon <[email protected]> \ Software is like sex:
werewolf.able.es \ It's better when it's free
Mandrake Linux release 9.1 (Cooker) for i586
Linux 2.4.21-pre4-jam1 (gcc 3.2.1 (Mandrake Linux 9.1 3.2.1-5mdk))

2003-02-04 06:54:46

by Denis Vlasenko

[permalink] [raw]
Subject: Re: gcc 2.95 vs 3.21 performance

On 4 February 2003 01:31, Richard B. Johnson wrote:
> On Mon, 3 Feb 2003, Martin J. Bligh wrote:
> > People keep extolling the virtues of gcc 3.2 to me, which I'm
> > reluctant to switch to, since it compiles so much slower. But
> > it supposedly generates better code, so I thought I'd compile
> > the kernel with both and compare the results. This is gcc 2.95
> > and 3.2.1 from debian unstable on a 16-way NUMA-Q. The kernbench
> > tests still use 2.95 for the compile-time stuff.
>
> [SNIPPED tests...]

What was the size of uncompressed kernel binaries?
This is a simple (and somewhat inaccurate) measure of compiler
improvement ;)

> Don't let this get out, but egcs-2.91.66 compiled FFT code
> works about 50 percent of the speed of whatever M$ uses for
> Visual C++ Version 6.0 I was awfully disheartened when I

Yes. M$ (and some other compilers) beat GCC badly.

> found that identical code executed twice as fast on M$ than
> it does on Linux. I tried to isolate what was causing the
> difference. So I replaced 'hypot()' with some 'C' code that
> does sqrt(x^2 + y^2) just to see if it was the 'C' library.
> It didn't help. When I find out what type (section) of code
> is running slower, I'll report. In the meantime, it's fast
> enough, but I don't like being beat by M$.

I'm afraid it's code generation engine. It is just worse than
M$ or Intel's one. It is not easily fixable,
GCC folks have tremendous task at hand.

I wonder whether some big companies supposedly supporting
Linux (e.g. Intel) can help GCC team (for example by giving
away some code and/or developer time).
--
vda

2003-02-04 07:05:34

by Martin J. Bligh

[permalink] [raw]
Subject: Re: gcc 2.95 vs 3.21 performance

> I'm afraid it's code generation engine. It is just worse than
> M$ or Intel's one. It is not easily fixable,
> GCC folks have tremendous task at hand.
>
> I wonder whether some big companies supposedly supporting
> Linux (e.g. Intel) can help GCC team (for example by giving
> away some code and/or developer time).

Comparing Intel's compiler vs GCC on Linux would be more interesting.
Anyone got a copy and some time to burn?

M.


2003-02-04 09:45:38

by Bryan Andersen

[permalink] [raw]
Subject: Re: gcc 2.95 vs 3.21 performance


Personal opinion here but I know it is also held by many developers I
know and work with. I'd rather have a compiler that produces correct
and fast code but ran slow than one that produces slow or bad code and
runs fast. Remember compilation is done far less often than run time
execution. Yes I too noticed a difference when I switched over to 3.2
but I also noticed some of my code speed up.

>>>People keep extolling the virtues of gcc 3.2 to me, which I'm
>>>reluctant to switch to, since it compiles so much slower. But
>>>it supposedly generates better code, so I thought I'd compile
>>>the kernel with both and compare the results. This is gcc 2.95
>>>and 3.2.1 from debian unstable on a 16-way NUMA-Q. The kernbench
>>>tests still use 2.95 for the compile-time stuff.
>>
>>[SNIPPED tests...]
>
>
> What was the size of uncompressed kernel binaries?
> This is a simple (and somewhat inaccurate) measure of compiler
> improvement ;)

While I too like smaller tighter output code, I'd trade it for code that
runs faster in real world situations. As an example identifying the
most likely execution path through a routine and keeping it contiguous
in memory will do more for average execution speed than optimizing to
use the smallest number of bytes. If the compiler could tell which
blocks of code are for handling exceptions it then can place them ouside
of the main execution path. This makes the normal code execution path
smaller and more compact. In doing so it also reduces the number of
memory fetch operations and cache space needed to run the code. With
cache misses being 100+ clock cycles and page faults well into the
millions, keeping that normal execution path short means alot.

>>Don't let this get out, but egcs-2.91.66 compiled FFT code
>>works about 50 percent of the speed of whatever M$ uses for
>>Visual C++ Version 6.0 I was awfully disheartened when I
>
> Yes. M$ (and some other compilers) beat GCC badly.

But can M$'s compiler produce code for many radically different CPU
architectures? Most people only work with gcc on one type of CPU so
they never think about just how flexible and good GCC really is. I see
it often compaired against compilers that are dedicated to a single CPU
where the development team only has to worry about one CPU type. GCC's
development team needs to worry about many different arcitectures. Some
are radically different in their fundamental structure. This really
complicates the job of producing a compiler that works correctly.

- Bryan



2003-02-04 10:51:02

by Padraig

[permalink] [raw]
Subject: Re: gcc 2.95 vs 3.21 performance

.file "testfunc.c"
.globl TEST_NUMBER
.data
.align 2
.type TEST_NUMBER,@object
.size TEST_NUMBER,2
TEST_NUMBER:
.value 256
.globl count
.align 4
.type count,@object
.size count,4
count:
.long 0
.globl exit_flag
.align 4
.type exit_flag,@object
.size exit_flag,4
exit_flag:
.long 0
.align 4
.type throttle_print.0,@object
.size throttle_print.0,4
throttle_print.0:
.long 0
.section .rodata.str1.32,"aMS",@progbits,1
.align 32
.LC2:
.string "\nAdding & dropping random array elements,(from a set of 000..%03u)\n"
.section .rodata.str1.1,"aMS",@progbits,1
.LC3:
.string "Ctrl C to exit"
.section .rodata.str1.32
.align 32
.LC0:
.string "\n%lu array elements randomly dropped and added in %lus"
.align 32
.LC1:
.string " (%lu/s)\n \n"
.text
.p2align 2,,3
.globl main
.type main,@function
main:
pushl %ebp
movl %esp, %ebp
pushl %edi
pushl %esi
pushl %ebx
subl $12, %esp
andl $-16, %esp
cmpl $1, 8(%ebp)
movl $1, %edi
jle .L2
pushl $0
pushl $10
pushl $0
movl 12(%ebp), %eax
pushl 4(%eax)
call __strtol_internal
addl $16, %esp
testl %eax, %eax
jle .L2
movw %ax, TEST_NUMBER
.L2:
movzwl TEST_NUMBER, %edx
subl $12, %esp
sall $1, %edx
pushl %edx
call malloc
movl %eax, %esi
movl $0, (%esp)
call time
popl %ebx
movl %eax, start
popl %eax
pushl $exit_info_sig
pushl $2
call signal
xorl %edx, %edx
movw TEST_NUMBER, %cx
addl $16, %esp
cmpw %cx, %dx
jae .L24
.L10:
movzwl %dx, %ebx
movw %dx, (%esi,%ebx,2)
incl %edx
cmpw %cx, %dx
jb .L10
.p2align 2,,3
.L24:
incl count
call rand
movw TEST_NUMBER, %bx
movzwl %bx, %edx
movl %edx, %ecx
cltd
idivl %ecx
cmpw %bx, %dx
movl %edx, %ecx
jae .L27
.p2align 2,,3
.L18:
movzwl %cx, %edx
incl %ecx
movw (%esi,%edx,2), %ax
cmpw %bx, %cx
movw %ax, -2(%esi,%edx,2)
jb .L18
.L27:
leal -1(%ebx), %ecx
subl $8, %esp
movzwl %cx, %edx
pushl %edx
pushl %esi
call GetLowestValueAvailable
movzwl TEST_NUMBER, %edx
movw %ax, -2(%esi,%edx,2)
movl exit_flag, %eax
addl $16, %esp
testl %eax, %eax
jne .L28
testl %edi, %edi
je .L24
subl $8, %esp
leal -1(%edx), %ebx
pushl %ebx
pushl $.LC2
call printf
xorl %edi, %edi
movl $.LC3, (%esp)
call puts
addl $16, %esp
jmp .L24
.L28:
subl $12, %esp
pushl $0
call time
movl %eax, %esi
addl $12, %esp
subl start, %esi
pushl %esi
pushl count
pushl $.LC0
call printf
popl %eax
popl %edx
movl count, %eax
xorl %edx, %edx
divl %esi
pushl %eax
pushl $.LC1
call printf
movl $1, (%esp)
call exit
.Lfe1:
.size main,.Lfe1-main
.p2align 2,,3
.globl RemoveNumber
.type RemoveNumber,@function
RemoveNumber:
pushl %ebp
movl %esp, %ebp
movl 12(%ebp), %ecx
cmpw TEST_NUMBER, %cx
pushl %ebx
movl 8(%ebp), %ebx
jae .L69
.p2align 2,,3
.L67:
movzwl %cx, %edx
movw (%ebx,%edx,2), %ax
movw %ax, -2(%ebx,%edx,2)
incl %ecx
cmpw TEST_NUMBER, %cx
jb .L67
.L69:
popl %ebx
leave
ret
.Lfe2:
.size RemoveNumber,.Lfe2-RemoveNumber
.section .rodata.str1.1
.LC4:
.string "\033[H"
.LC5:
.string "%03d "
.text
.p2align 2,,3
.globl printArray
.type printArray,@function
printArray:
pushl %ebp
movl %esp, %ebp
pushl %esi
pushl %ebx
subl $12, %esp
pushl $.LC4
movl 8(%ebp), %esi
call printf
popl %eax
pushl stdout
xorl %ebx, %ebx
call fflush
addl $16, %esp
cmpw TEST_NUMBER, %bx
jb .L75
.L77:
leal -8(%ebp), %esp
popl %ebx
popl %esi
leave
ret
.p2align 2,,3
.L75:
movzwl %bx, %ecx
subl $8, %esp
movzwl (%esi,%ecx,2), %edx
pushl %edx
pushl $.LC5
incl %ebx
call printf
addl $16, %esp
cmpw TEST_NUMBER, %bx
jb .L75
jmp .L77
.Lfe3:
.size printArray,.Lfe3-printArray
.p2align 2,,3
.globl exit_info
.type exit_info,@function
exit_info:
pushl %ebp
movl %esp, %ebp
pushl %ebx
subl $16, %esp
pushl $0
call time
movl %eax, %ebx
addl $12, %esp
subl start, %ebx
pushl %ebx
pushl count
pushl $.LC0
call printf
popl %eax
popl %edx
movl count, %eax
xorl %edx, %edx
divl %ebx
pushl %eax
pushl $.LC1
call printf
movl $1, (%esp)
call exit
.Lfe4:
.size exit_info,.Lfe4-exit_info
.p2align 2,,3
.globl exit_info_sig
.type exit_info_sig,@function
exit_info_sig:
pushl %ebp
movl %esp, %ebp
movl $1, exit_flag
leave
ret
.Lfe5:
.size exit_info_sig,.Lfe5-exit_info_sig
.comm start,4,4
.ident "GCC: (GNU) 3.2.1 20021207 (Red Hat Linux 8.0 3.2.1-2)"


Attachments:
slow.s (4.36 kB)
fast.s (4.24 kB)
Download all attachments

2003-02-04 12:11:40

by Dave Jones

[permalink] [raw]
Subject: Re: [Lse-tech] gcc 2.95 vs 3.21 performance

On Mon, Feb 03, 2003 at 03:05:06PM -0800, Martin J. Bligh wrote:
> People keep extolling the virtues of gcc 3.2 to me, which I'm
> reluctant to switch to, since it compiles so much slower. But
> it supposedly generates better code, so I thought I'd compile
> the kernel with both and compare the results. This is gcc 2.95
> and 3.2.1 from debian unstable on a 16-way NUMA-Q. The kernbench
> tests still use 2.95 for the compile-time stuff.
>
> The results below leaves me distinctly unconvinced by the supposed
> merits of modern gcc's. Not really better or worse, within experimental
> error. But much slower to compile things with.

What kernel was kernbench compiling ? The reason I'm asking is that
2.5s (and more recent 2.4.21pre's) will use -march flags for more
aggressive optimisation on newer gcc's.
If you want to compare apples to apples, make sure you choose
something like i386 in the processor menu, and then it'll always
use -march=i386 instead of getting fancy with things like -march=pentium4

Dave

--
| Dave Jones. http://www.codemonkey.org.uk
| SuSE Labs

2003-02-04 12:16:08

by Adrian Bunk

[permalink] [raw]
Subject: Re: gcc 2.95 vs 3.21 performance

On Mon, Feb 03, 2003 at 11:13:31PM -0800, Martin J. Bligh wrote:
> > I'm afraid it's code generation engine. It is just worse than
> > M$ or Intel's one. It is not easily fixable,
> > GCC folks have tremendous task at hand.
> >
> > I wonder whether some big companies supposedly supporting
> > Linux (e.g. Intel) can help GCC team (for example by giving
> > away some code and/or developer time).
>
> Comparing Intel's compiler vs GCC on Linux would be more interesting.
> Anyone got a copy and some time to burn?

There are already people who have done this, e.g.

http://www.coyotegulch.com/reviews/intel_comp/intel_gcc_bench2.html

compares g++ and Intel's C++ compiler with C++ code.

> M.

cu
Adrian

--

"Is there not promise of rain?" Ling Tan asked suddenly out
of the darkness. There had been need of rain for many days.
"Only a promise," Lao Er said.
Pearl S. Buck - Dragon Seed

2003-02-04 13:01:40

by Helge Hafting

[permalink] [raw]
Subject: Re: gcc 2.95 vs 3.21 performance

[email protected] wrote:
[...]
> Interesting. I just noticed that I get 50% decrease in
> the speed of my program if I just insert a printf(). I.E.
> my program is like:
>
> printf()
> for(;;) {
> do_sorting_loop_test();
> }
>
> If I remove the initial printf it doubles in speed?
> I assume this is some weird caching thing?

Looks like a cacheline alignment issue to me.
This loop of yours occupy x cachelines on your cpu,
moving it in memory by adding the printf
might cause it to ocupy x+1 cachelines.
That might be noticeable if x is a really small number,
such as 1.

> gcc is 3.2.1 (same happens for 2.95..)
>
> <boggle>
> Note this is with -O3. If I don't specify -O then
> leaving the printf in speeds things up by about 15%
> </boggle>

Sure - going from -O3 to -O changes code generation so
your loop code hits the cachelines differently.
In this case the printf moved the loop into
better alignment.

My advice is to put your test loop in a function of its own,
and do the printing in the function that calls it.
functions are always aligned the same (good) way so
that calling them will be fast.

You can tune the speed of your inner loop by experimenting
with the insertion of one or more NOP asms in front
of the loop. Just be aware that all such tuning is wasted once
you change anything at all in that function - you'll have to
re-do the tuning each time.

The compiler should ideally align the loops for maximum performance.
That can be hard though, considering all the different processors
that might run your program. And aligning everything optimally
could waste a _lot_ of code space - so do this only for
small loops with lots of iterations.

Helge Hafting

2003-02-04 13:31:29

by Richard B. Johnson

[permalink] [raw]
Subject: Re: gcc 2.95 vs 3.21 performance

On Tue, 4 Feb 2003, J.A. Magallon wrote:

>
> On 2003.02.04 Richard B. Johnson wrote:
> > On Mon, 3 Feb 2003, Martin J. Bligh wrote:
> >
> > > People keep extolling the virtues of gcc 3.2 to me, which I'm
> > > reluctant to switch to, since it compiles so much slower. But
> > > it supposedly generates better code, so I thought I'd compile
> > > the kernel with both and compare the results. This is gcc 2.95
> > > and 3.2.1 from debian unstable on a 16-way NUMA-Q. The kernbench
> > > tests still use 2.95 for the compile-time stuff.
> > >
> > [SNIPPED tests...]
> >
> > Don't let this get out, but egcs-2.91.66 compiled FFT code
> > works about 50 percent of the speed of whatever M$ uses for
> > Visual C++ Version 6.0 I was awfully disheartened when I
> > found that identical code executed twice as fast on M$ than
> > it does on Linux. I tried to isolate what was causing the
> > difference. So I replaced 'hypot()' with some 'C' code that
> > does sqrt(x^2 + y^2) just to see if it was the 'C' library.
> > It didn't help. When I find out what type (section) of code
> > is running slower, I'll report. In the meantime, it's fast
> > enough, but I don't like being beat by M$.
> >
>
> I face a simliar problem. As everybody says that SSE is so marvelous,
> we are trying to put some SSE code in our render engine, to speed up this.
> But look at the results of the code below (box is a [email protected], Xeon with ht):

[SNIPPED good demo code]

I'm going to answer all the comments on this topic with just
one observation. Sorry that I don't have the time to answer
all who responded personally, but I have to take a "work break"
today and tommorrow (design review).

gcc is a marvelous compiler because it was designed
to be readily ported to different architectures. However,
is not an optimum compiler for ix86 machines and probably
is not optimum for any one kind of machine.

I often hear complaints about the ix86 processors as being
"register starved", etc. This could not be further from
fact. There are enough registers. However, various registers
were designed to do various things. Once you decide that
you know more than the processor developers, and start
using registers for things they were not designed for,
you start to have excellent test benchmarks, but awful
overall performance.

For example, the ECX register was designed to be used as
a counter. It can be told to decrement and perform a
conditional jump with the 'loop' instruction. The loop
instruction comes in various flavors, also, like loopz,
loopnz. Somebody decided that 'dec ecx; jnz' was faster.
They measured this to "prove" that it's faster. In the
meantime, other code suffers (stumbles) because there
was really no spare time to be grabbed. Data needs to
be fetched to and from memory. The instruction unit
ends up being starved while data are acquired. This
would not normally hurt anything because the RAM bandwidth
ends up being the dominant pole in the transfer function,
but you end up with something I call the "accordion problem".

I will first demonstrate the accordion problem and then
explain where it comes from. Note a smooth slow of traffic
on a highway. All the cars are traveling at the same speed.
Their speed increases until they don't dare go any faster.
They are now "bandwidth limited". Somebody sees a traffic
cop. Somebody slows down, it takes a few hundred milliseconds
for the next car to slow down, this transient moves backwards
though the line of cars until cars several miles back actually
have to perform emergency braking to stay off the bumper
ahead. Then, the cars start accelerating again. This acceleration,
deceleration ripple moves through the line of cars like the
bellows of an accordion. The average speed of the line of
traffic is now reduced even though there are oscillatory
accelerations above the speed-limit.

Now, visualize a CPU and RAM combination running in lock-step.
The speed of the execution unit is matched to the speed of the
processor I/O so the instructions are fetched and executed in
a more-or-less synchronized manner. This is like the high-speed
line of cars before somebody sees the traffic cop. Now, perturb
this execution by throwing in some faster-than-normal program
sequences. You may start the accordion effect. The problem is
that both instructions and data come through the same hole-in-
the wall, regardless of caching. When the prefetch unit needs
more data (instructions) it must contend with the data I/O.
This may cause an oscillatory condition, actually reducing
throughput.

Anybody who uses CPUs in laboratories with sensitive receiving
equipment knows that, regardless of the FCC rules, these
machines generate great gobs of radio frequency interference.
That's why they need to be in shielded boxes. If you want
to "hear" the stumble I'm talking about, just listen to
the AM audio output using a field-intensity meter. When you
have a fast smoothly-running machine, the interference sounds
like noise. When you have the accordion effect, the interference
has a repetitive pattern to it, a tone, usually low-frequency.
If you capture enough data in a logic analyzer, you will see
the pattern and can see actual pauses in bus I/O where the
CPU just isn't doing a damn thing at all!

FYI, there is a difference in power supply current required
to write 0xffffffff to RAM than 0x00000000 (honest!). If you
are doing a memory-test, writing such a pattern that the
load on the power supply changes at a rate that will disturb
the power supply servo-loop, you can make the voltage bounce!
This has nothing to do with slow CPU execution speed, but
just demonstrates that there are a lot of interactions that
should be considered when designing or proving-out a system.
It's not just a local bench-mark that counts.

The Intel Compiler(s) I have used generate code that uses
the registers just like Intel specified. It uses EBX, ESI, EDI
as index registers just like the 16-bit BX, SI, DI. I have
never seen code output from an Intel 'C' compiler that uses
EAX as in index register, even though it's available and
"faster". They seem to stick with the "un-optimized" string
instructions like rep movsb, repnz cmpsb, etc., and they
use 'loop'. Maybe, just maybe, Intel knows something about
their processor that shouldn't be second-guessed by clever
programmers.


Cheers,
Dick Johnson
Penguin : Linux version 2.4.18 on an i686 machine (797.90 BogoMips).
Why is the government concerned about the lunatic fringe? Think about it.


2003-02-04 13:43:39

by Jörn Engel

[permalink] [raw]
Subject: Re: gcc 2.95 vs 3.21 performance

On Tue, 4 February 2003 14:11:56 +0100, Helge Hafting wrote:
>
> Looks like a cacheline alignment issue to me.
> This loop of yours occupy x cachelines on your cpu,
> moving it in memory by adding the printf
> might cause it to ocupy x+1 cachelines.
> That might be noticeable if x is a really small number,
> such as 1.

Makes a lot of sense.

> My advice is to put your test loop in a function of its own,
> and do the printing in the function that calls it.
> functions are always aligned the same (good) way so
> that calling them will be fast.
>
> You can tune the speed of your inner loop by experimenting
> with the insertion of one or more NOP asms in front
> of the loop. Just be aware that all such tuning is wasted once
> you change anything at all in that function - you'll have to
> re-do the tuning each time.
>
> The compiler should ideally align the loops for maximum performance.
> That can be hard though, considering all the different processors
> that might run your program. And aligning everything optimally
> could waste a _lot_ of code space - so do this only for
> small loops with lots of iterations.

The compiler has a hard time to identify those loops that affect
performance as opposed to those that are run 2-3 times.

But the developer can usually profile and figure out, where those
loops are. I wonder if the following would be possible.

printf();
__cacheline_aligned_code;
for(;;)
do_sorting_loop_test();

include/linux/cache.h appears to define such for data structures, but
not for code.

J?rn

--
ticks = jiffies;
while (ticks == jiffies);
ticks = jiffies;
-- /usr/src/linux/init/main.c

2003-02-04 13:58:52

by Pádraig Brady

[permalink] [raw]
Subject: Re: gcc 2.95 vs 3.21 performance

Helge Hafting wrote:
> [email protected] wrote:
> [...]
>
>>Interesting. I just noticed that I get 50% decrease in
>>the speed of my program if I just insert a printf(). I.E.
>>my program is like:
>>
>>printf()
>>for(;;) {
>> do_sorting_loop_test();
>>}
>>
>>If I remove the initial printf it doubles in speed?
>>I assume this is some weird caching thing?
>
>
> Looks like a cacheline alignment issue to me.
> This loop of yours occupy x cachelines on your cpu,
> moving it in memory by adding the printf
> might cause it to ocupy x+1 cachelines.
> That might be noticeable if x is a really small number,
> such as 1.

OK it is (as I suspected and as you explained nicely)
related to the cachelines on my CPU (866 celery).

===============================
GCC options loops/s
===============================
gcc 2283
gcc -O3 -falign-loops=2 3451
gcc -O3 -falign-loops=4 3443
gcc -O3 -falign-loops=8 7045
gcc -march=i686 -O3 9101
===============================

cheers,
P?draig.

2003-02-04 14:11:47

by John Bradford

[permalink] [raw]
Subject: Re: gcc 2.95 vs 3.21 performance

There is some discussion about compiler optimisations in this Linux
Journal article:

http://www.linuxjournal.com/article.php?sid=4885

John.

2003-02-04 15:37:22

by Martin J. Bligh

[permalink] [raw]
Subject: Re: gcc 2.95 vs 3.21 performance

> Personal opinion here but I know it is also held by many developers I
> know and work with. I'd rather have a compiler that produces correct and
> fast code but ran slow than one that produces slow or bad code and runs
> fast. Remember compilation is done far less often than run time
> execution.

Yeah, I'd make that tradeoff too, but gcc 3.2 doesn't give me that.
People keep saying it does, but I see no real evidence of it.
Show me the money.

M.

2003-02-04 15:41:34

by Martin J. Bligh

[permalink] [raw]
Subject: Re: [Lse-tech] gcc 2.95 vs 3.21 performance

> > People keep extolling the virtues of gcc 3.2 to me, which I'm
> > reluctant to switch to, since it compiles so much slower. But
> > it supposedly generates better code, so I thought I'd compile
> > the kernel with both and compare the results. This is gcc 2.95
> > and 3.2.1 from debian unstable on a 16-way NUMA-Q. The kernbench
> > tests still use 2.95 for the compile-time stuff.
> >
> > The results below leaves me distinctly unconvinced by the supposed
> > merits of modern gcc's. Not really better or worse, within experimental
> > error. But much slower to compile things with.
>
> What kernel was kernbench compiling ? The reason I'm asking is that
> 2.5s (and more recent 2.4.21pre's) will use -march flags for more
> aggressive optimisation on newer gcc's.
> If you want to compare apples to apples, make sure you choose
> something like i386 in the processor menu, and then it'll always
> use -march=i386 instead of getting fancy with things like -march=pentium4

Kernbench compiles 2.4.17, because I'm old, slow and lazy, and that
was what was around when I started doing this test ;-)

But the point is still the same ... even if it is doing more agressive
optimisation, it's not actually buying us anything (at least for the kernel)

M.

2003-02-04 15:46:11

by Martin J. Bligh

[permalink] [raw]
Subject: Re: gcc 2.95 vs 3.21 performance

>> Comparing Intel's compiler vs GCC on Linux would be more interesting.
>> Anyone got a copy and some time to burn?
>
> There are already people who have done this, e.g.
>
> http://www.coyotegulch.com/reviews/intel_comp/intel_gcc_bench2.html
>
> compares g++ and Intel's C++ compiler with C++ code.

C would be infinitely more interesting ;-)

M.

2003-02-04 16:21:12

by Martin J. Bligh

[permalink] [raw]
Subject: Re: [Lse-tech] Re: gcc 2.95 vs 3.21 performance

>>> Comparing Intel's compiler vs GCC on Linux would be more interesting.
>>> Anyone got a copy and some time to burn?
>>
>> There are already people who have done this, e.g.
>>
>> http://www.coyotegulch.com/reviews/intel_comp/intel_gcc_bench2.html
>>
>> compares g++ and Intel's C++ compiler with C++ code.
>
> C would be infinitely more interesting ;-)

Speaking of which, has anyone ever compiled the ia32 Linux kernel with the
Intel compiler? I thought I saw some patches floating around to make it
compile the ia64 kernel .... that'd be an interesting test case ... might
give us some ideas about what could be tweaked in GCC (or code rejiggled in
the kernel).

M.

2003-02-04 17:40:45

by Patrick Mansfield

[permalink] [raw]
Subject: Re: [Lse-tech] Re: gcc 2.95 vs 3.21 performance

On Tue, Feb 04, 2003 at 08:27:28AM -0800, Martin J. Bligh wrote:
> >>> Comparing Intel's compiler vs GCC on Linux would be more interesting.
> >>> Anyone got a copy and some time to burn?
> >>
> >> There are already people who have done this, e.g.
> >>
> >> http://www.coyotegulch.com/reviews/intel_comp/intel_gcc_bench2.html
> >>
> >> compares g++ and Intel's C++ compiler with C++ code.
> >
> > C would be infinitely more interesting ;-)
>
> Speaking of which, has anyone ever compiled the ia32 Linux kernel with the
> Intel compiler? I thought I saw some patches floating around to make it
> compile the ia64 kernel .... that'd be an interesting test case ... might
> give us some ideas about what could be tweaked in GCC (or code rejiggled in
> the kernel).
>
> M.

Martin -

Like this?

http://marc.theaimsgroup.com/?l=linux-kernel&m=103559880923586&w=2

-- Patrick Mansfield

2003-02-04 17:49:29

by Martin J. Bligh

[permalink] [raw]
Subject: Re: [Lse-tech] Re: gcc 2.95 vs 3.21 performance

>> Speaking of which, has anyone ever compiled the ia32 Linux kernel with
>> the Intel compiler? I thought I saw some patches floating around to make
>> it compile the ia64 kernel .... that'd be an interesting test case ...
>> might give us some ideas about what could be tweaked in GCC (or code
>> rejiggled in the kernel).
>>
>> M.
>
> Martin -
>
> Like this?
>
> http://marc.theaimsgroup.com/?l=linux-kernel&m=103559880923586&w=2

Yeah, something very like that ;-) Thanks.
Preferably less micro-benchmarky though ....

M.

2003-02-04 19:03:29

by Timothy D. Witham

[permalink] [raw]
Subject: Re: gcc 2.95 vs 3.21 performance

On Mon, 2003-02-03 at 22:54, Denis Vlasenko wrote:
snip

>
> I'm afraid it's code generation engine. It is just worse than
> M$ or Intel's one. It is not easily fixable,
> GCC folks have tremendous task at hand.
>
> I wonder whether some big companies supposedly supporting
> Linux (e.g. Intel) can help GCC team (for example by giving
> away some code and/or developer time).
> --

I'm hesitant to enter into this. But from my own experience
the issue with big companies supporting these sort of changes
in gcc have more to do with the acceptance process of changes
into gcc than a lack of desire on the large companies part.

Tim

> vda
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
--
Timothy D. Witham - Lab Director - [email protected]
Open Source Development Lab Inc - A non-profit corporation
15275 SW Koll Parkway - Suite H - Beaverton OR, 97006
(503)-626-2455 x11 (office) (503)-702-2871 (cell)
(503)-626-2436 (fax)

2003-02-04 19:26:58

by John Bradford

[permalink] [raw]
Subject: Re: gcc 2.95 vs 3.21 performance

> I'm hesitant to enter into this. But from my own experience
> the issue with big companies supporting these sort of changes
> in gcc have more to do with the acceptance process of changes
> into gcc than a lack of desire on the large companies part.

Maybe we should create a KGCC fork, optimise it for kernel
complilations, then try to get our changes merged back in to GCC
mainline at a later date.

John.

2003-02-04 19:41:22

by Dave Jones

[permalink] [raw]
Subject: Re: gcc 2.95 vs 3.21 performance

On Tue, Feb 04, 2003 at 07:35:06PM +0000, John Bradford wrote:

> Maybe we should create a KGCC fork, optimise it for kernel
> complilations, then try to get our changes merged back in to GCC
> mainline at a later date.

What exactly do you mean by "optimise for kernel compilations" ?

Dave

--
| Dave Jones. http://www.codemonkey.org.uk
| SuSE Labs

2003-02-04 20:02:27

by John Bradford

[permalink] [raw]
Subject: Re: gcc 2.95 vs 3.21 performance

> > Maybe we should create a KGCC fork, optimise it for kernel
> > complilations, then try to get our changes merged back in to GCC
> > mainline at a later date.
>
> What exactly do you mean by "optimise for kernel compilations" ?

I don't, that was a bad way of phrasing it - I didn't mean fork GCC
just to create one which compiles the kernel so it runs faster, as the
expense of other code.

What I was thinking was that if we forked GCC, we could try out all of
these ideas that have been floating around in this thread, and if, as
was hinted at earlier in this thread, $bigcompanies[] have not offered
contributions because of reluctance to accept them by the GCC team, we
would be more in a position to try them out, because we only need to
concern ourselves with breaking the compilation of the kernel, not
every single program that currently compiles with GCC.

The way I see it, the development series would be optimised for KGCC,
and when we start to think about stabilising that development series,
we try to get our KGCC changes merged back in to GCC mainline. If
they are not accepted, either KGCC becomes the recommended kernel
compiler, which should cause no great difficulties, (having one
compiler for kernels, and one for userland applications), or we start
making sure that we haven't broken compilation with GCC, (and since a
there would probably always be people compiling with GCC anyway, even
if there was a KGCC, we would effectively always know if we broke
compilation with GCC), and then the recommended compiler is just not
the optimal one, and it would be up to the various distributions to
decide which one they are going to use.

John.

2003-02-04 20:15:36

by John Bradford

[permalink] [raw]
Subject: Re: gcc 2.95 vs 3.21 performance

Sorry, that last post didn't make sense, please apply this diff:

- just to create one which compiles the kernel so it runs faster, as the
+ just to create one which compiles the kernel so it runs faster, at the
expense of other code.

John.

2003-02-04 20:24:01

by Herman Oosthuysen

[permalink] [raw]
Subject: Re: gcc 2.95 vs 3.21 performance

Hi there,

More than anything else, the execution speed on modern processors seem
to be a factor of code and data allignment. Some processors are OK with
16 bit word allignment, other require 32 bit word allignment and the new
crop of processors will probably require 64 bit word allignment.

If the data accesses are not alligned for your type of processor, then
SDRAM accesses go to hell as the bursting gets upset.

Unfortunately, this is a factor of processor architecture and the MS and
Intel compilers support a small number of processors and can therefore
be more easily optimized than GCC, which supports every processor in the
whole world.

If some application of yours is very speed sensitive, then you'll have
to insert specific allignment control switches/pragmas to force GCC to
do things the right way for speed, but that will typically increase the
code and data size a little.

Cheers,
--

------------------------------------------------------------------------
Herman Oosthuysen
B.Eng.(E), Member of IEEE
Wireless Networks Inc.
http://www.WirelessNetworksInc.com
E-mail: [email protected]
Phone: 1.403.569-5687, Fax: 1.403.235-3965
------------------------------------------------------------------------


[email protected] wrote:
> Helge Hafting wrote:
>
>>[email protected] wrote:
>>[...]
>>
>>
>>>Interesting. I just noticed that I get 50% decrease in
>>>the speed of my program if I just insert a printf(). I.E.
>>>my program is like:
>>>
>>>printf()
>>>for(;;) {
>>> do_sorting_loop_test();
>>>}
>>>
>>>If I remove the initial printf it doubles in speed?
>>>I assume this is some weird caching thing?
>>
>>
>>Looks like a cacheline alignment issue to me.
>>This loop of yours occupy x cachelines on your cpu,
>>moving it in memory by adding the printf
>>might cause it to ocupy x+1 cachelines.
>>That might be noticeable if x is a really small number,
>>such as 1.
>
>
> OK it is (as I suspected and as you explained nicely)
> related to the cachelines on my CPU (866 celery).
>
> ===============================
> GCC options loops/s
> ===============================
> gcc 2283
> gcc -O3 -falign-loops=2 3451
> gcc -O3 -falign-loops=4 3443
> gcc -O3 -falign-loops=8 7045
> gcc -march=i686 -O3 9101
> ===============================
>
> cheers,
> P?draig.
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>



2003-02-04 20:32:14

by Herman Oosthuysen

[permalink] [raw]
Subject: Re: gcc 2.95 vs 3.21 performance

Hi there,

From my experience, the speed issue is caused by misalligned memory
accesses, causing inefficient SDRAM to Cache movement of data and
instructions.

I don't think that you necessarily need a modification to the compiler.
What you can do is carefully place the ALLIGN switch in a few critical
places in the kernel code, to ensure that the code and data will be
properly alligned for whatever processor it is compiled for, be that a
Pentium, an ARM, a MIPS or whatever.

It would be nice if GCC can be suitably improved to do this correcly for
all architectures, but a little bit of human help can do wonders,
without having to fork the GCC project.

Cheers,
--

------------------------------------------------------------------------
Herman Oosthuysen
B.Eng.(E), Member of IEEE
Wireless Networks Inc.
http://www.WirelessNetworksInc.com
E-mail: [email protected]
Phone: 1.403.569-5687, Fax: 1.403.235-3965
------------------------------------------------------------------------



John Bradford wrote:
>> > Maybe we should create a KGCC fork, optimise it for kernel
>> > complilations, then try to get our changes merged back in to GCC
>> > mainline at a later date.
>>
>>What exactly do you mean by "optimise for kernel compilations" ?
>
>
> I don't, that was a bad way of phrasing it - I didn't mean fork GCC
> just to create one which compiles the kernel so it runs faster, as the
> expense of other code.
>
> What I was thinking was that if we forked GCC, we could try out all of
> these ideas that have been floating around in this thread, and if, as
> was hinted at earlier in this thread, $bigcompanies[] have not offered
> contributions because of reluctance to accept them by the GCC team, we
> would be more in a position to try them out, because we only need to
> concern ourselves with breaking the compilation of the kernel, not
> every single program that currently compiles with GCC.
>
> The way I see it, the development series would be optimised for KGCC,
> and when we start to think about stabilising that development series,
> we try to get our KGCC changes merged back in to GCC mainline. If
> they are not accepted, either KGCC becomes the recommended kernel
> compiler, which should cause no great difficulties, (having one
> compiler for kernels, and one for userland applications), or we start
> making sure that we haven't broken compilation with GCC, (and since a
> there would probably always be people compiling with GCC anyway, even
> if there was a KGCC, we would effectively always know if we broke
> compilation with GCC), and then the recommended compiler is just not
> the optimal one, and it would be up to the various distributions to
> decide which one they are going to use.
>
> John.
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>


2003-02-04 21:32:10

by Linus Torvalds

[permalink] [raw]
Subject: Re: gcc 2.95 vs 3.21 performance

In article <[email protected]>,
John Bradford <[email protected]> wrote:
>> I'm hesitant to enter into this. But from my own experience
>> the issue with big companies supporting these sort of changes
>> in gcc have more to do with the acceptance process of changes
>> into gcc than a lack of desire on the large companies part.
>
>Maybe we should create a KGCC fork, optimise it for kernel
>complilations, then try to get our changes merged back in to GCC
>mainline at a later date.

That's not really the problem.

I think the problem with gcc is that many of the developers are actually
much more interested in Ada or C++ (or even Fortran!), than in plain
old-fashioned C. So it's not a kernel issue per se, gcc is slow to
compile _any_ C project.

And a lot of the optimizations gcc does aren't even interesting to most
C projects. Most "old-fashioned" C projects tend to be written in ways
that mean that the most important optimizations are the truly trivial
ones, and then doing good register allocation.

I'd love to see a small - and fast - C compiler, and I'd be willing to
make kernel changes to make it work with it.

Let's see. There's been some noises on the gcc lists about splitting up
the languages for easier maintenance, we'll see what happens.

Linus

2003-02-04 21:39:10

by Timothy D. Witham

[permalink] [raw]
Subject: Re: gcc 2.95 vs 3.21 performance


On Tue, 2003-02-04 at 12:45, Herman Oosthuysen wrote:
> Hi there,
>
> From my experience, the speed issue is caused by misalligned memory
> accesses, causing inefficient SDRAM to Cache movement of data and
> instructions.
>
> I don't think that you necessarily need a modification to the compiler.
> What you can do is carefully place the ALLIGN switch in a few critical
> places in the kernel code, to ensure that the code and data will be
> properly alligned for whatever processor it is compiled for, be that a
> Pentium, an ARM, a MIPS or whatever.
>

I guess I would like the compiler to do that without having to go
in and futz the code.

> It would be nice if GCC can be suitably improved to do this correcly for
> all architectures, but a little bit of human help can do wonders,
> without having to fork the GCC project.
>
> Cheers,
--
Timothy D. Witham - Lab Director - [email protected]
Open Source Development Lab Inc - A non-profit corporation
15275 SW Koll Parkway - Suite H - Beaverton OR, 97006
(503)-626-2455 x11 (office) (503)-702-2871 (cell)
(503)-626-2436 (fax)

2003-02-04 21:44:12

by John Bradford

[permalink] [raw]
Subject: Re: gcc 2.95 vs 3.21 performance

> I'd love to see a small - and fast - C compiler, and I'd be willing to
> make kernel changes to make it work with it.

How IA-32 centric would your prefered compiler choice be? In other
words, if a small and fast C compiler turns up, which lacks support
for some currently ported to architectures, are you likely to
encourage kernel changes which will make it difficult for the other
architectures that have to stay with GCC to keep up?

John.

2003-02-04 21:56:08

by Andi Kleen

[permalink] [raw]
Subject: Re: gcc 2.95 vs 3.21 performance

[email protected] (Linus Torvalds) writes:
>
> I'd love to see a small - and fast - C compiler, and I'd be willing to
> make kernel changes to make it work with it.

If you want small and fast use lcc.

Unfortunately it's not completely free (some weird license), doesn't
really support real inline assembly and generates rather bad code compared
to gcc.

I'm still looking forward to Open Watcom (http://www.openwatcom.org) -
they are near self hosting on Linux. The inline assembly is very VC++ style
though; very different from gcc and worse you have to write it in
Intel syntax.

Another alternative would be TenDRA, but it also has no inline assembly
and it's C understanding can be only described as "fascist".

If you don't care about free software you could also use the Intel
compiler, which seems to be often faster in compile time than gcc now
and can already compile kernels.

-Andi

2003-02-04 22:04:41

by Linus Torvalds

[permalink] [raw]
Subject: Re: gcc 2.95 vs 3.21 performance


On Tue, 4 Feb 2003, John Bradford wrote:
> > I'd love to see a small - and fast - C compiler, and I'd be willing to
> > make kernel changes to make it work with it.
>
> How IA-32 centric would your prefered compiler choice be? In other
> words, if a small and fast C compiler turns up, which lacks support
> for some currently ported to architectures, are you likely to
> encourage kernel changes which will make it difficult for the other
> architectures that have to stay with GCC to keep up?

I don't think being architecture-specific is necessarily a bad thing in
compilers, although most compiler writers obviously try to avoid it.

The kernel shouldn't really care: it does want to have a compiler with
support for inline functions, but other than that it's fairly close to
ANSI C.

Yes, I know we use a _lot_ of gcc extensions (inline asms, variadic macros
etc), but that's at least partly because there simply aren't any really
viable alternatives to gcc, so we've had no incentives to abstract any of
that out.

So the gcc'isms aren't really fundamental per se. Although, quite frankly,
even inline asms are pretty much a "standard" thing for any reasonable C
compiler (since C is often used for things that really want it), and the
main issue tends to be the exact syntax rather than anything else. So I
don't think I'd like to use a compiler that is _so_ limited that it
doesn't have some support for something like that. I certainly would
refuse to use a C compiler that didn't support inline functions.

Linus

2003-02-04 22:07:25

by Linus Torvalds

[permalink] [raw]
Subject: Re: gcc 2.95 vs 3.21 performance


On 4 Feb 2003, Andi Kleen wrote:
>
> If you want small and fast use lcc.

lcc isn't really something I want to use, since the license is so strange,
and thus can't be improved upon if there are issues with it.

Some people have used the Intel compiler - which obviously also cannot be
improved upon, but which is likely to start off pretty good. I don't
really want to use it myself - what I'd really like to see is gcc
splitting up just the C compiler as a separate project with more attention
to size and speed.

Linus

2003-02-04 22:49:46

by Jeff Muizelaar

[permalink] [raw]
Subject: Re: gcc 2.95 vs 3.21 performance

Andi Kleen wrote:

>If you want small and fast use lcc.
>
>Unfortunately it's not completely free (some weird license), doesn't
>really support real inline assembly and generates rather bad code compared
>to gcc.
>
>I'm still looking forward to Open Watcom (http://www.openwatcom.org) -
>they are near self hosting on Linux. The inline assembly is very VC++ style
>though; very different from gcc and worse you have to write it in
>Intel syntax.
>
>Another alternative would be TenDRA, but it also has no inline assembly
>and it's C understanding can be only described as "fascist".
>
>If you don't care about free software you could also use the Intel
>compiler, which seems to be often faster in compile time than gcc now
>and can already compile kernels.
>
There is also tcc (http://fabrice.bellard.free.fr/tcc/)
It claims to support gcc-like inline assembler, appears to be much
smaller and faster than gcc. Plus it is GPL so the liscense isn't a
problem either.
Though, I am not really sure of the quality of code generated or of how
mature it is.

-Jeff


2003-02-04 23:02:36

by Balram Adlakha

[permalink] [raw]
Subject: Re: gcc 2.95 vs 3.21 performance

Jeff Muizelaar writes:

> Andi Kleen wrote:
>
>> If you want small and fast use lcc.
>>
>> Unfortunately it's not completely free (some weird license), doesn't
>> really support real inline assembly and generates rather bad code
>> compared to gcc.
>>
>> I'm still looking forward to Open Watcom (http://www.openwatcom.org) -
>> they are near self hosting on Linux. The inline assembly is very VC++
>> style though; very different from gcc and worse you have to write it in
>> Intel syntax.
>>
>> Another alternative would be TenDRA, but it also has no inline assembly
>> and it's C understanding can be only described as "fascist".
>>
>> If you don't care about free software you could also use the Intel
>> compiler, which seems to be often faster in compile time than gcc now
>> and can already compile kernels.
>>
> There is also tcc (http://fabrice.bellard.free.fr/tcc/)
> It claims to support gcc-like inline assembler, appears to be much smaller
> and faster than gcc. Plus it is GPL so the liscense isn't a problem
> either.
> Though, I am not really sure of the quality of code generated or of how
> mature it is.
>
> -Jeff

wow, looks like some teenage kid like me made it...
its a 170 kb gzipped tar!
nice for a C compiler...But i'm not sure if it could compile half of the
linux kernel successfully...

2003-02-04 23:11:35

by Larry McVoy

[permalink] [raw]
Subject: Re: gcc 2.95 vs 3.21 performance

> I'd love to see a small - and fast - C compiler, and I'd be willing to
> make kernel changes to make it work with it.

I can't offer any immediate help with this but I want the same thing. At
some point, we're planning on funding some extensions into GCC or whatever
reasonable C compiler is around:

- associative arrays as a builtin type

{
assoc bar = {}; // anonymous, no file backing

bar{"some key"} = "some value";
if (defined(bar{"some other value"})) ...
}

- regular expressions

{
char *foo = "blech";

if (foo =~ /regex are nice/) {
printf("Well isn't that special?\n");
}
}

- tk bindings built in

and then we'll port BK to that compiler. It's likely to be GCC because we
want to support all the different architectures but if a kernel sponsered
cc shows up we'll happily throw money at that.
--
---
Larry McVoy lm at bitmover.com http://www.bitmover.com/lm

2003-02-04 23:20:27

by Timothy D. Witham

[permalink] [raw]
Subject: Re: gcc 2.95 vs 3.21 performance

If needed we could build this compiler's tree into our testing
process. (PLM/STP) So that patches or changes could be automatically
tested against a matrix of kernels, hardware configurations on
different regression and stress tests.

Tim

On Tue, 2003-02-04 at 14:11, Linus Torvalds wrote:
> On Tue, 4 Feb 2003, John Bradford wrote:
> > > I'd love to see a small - and fast - C compiler, and I'd be willing to
> > > make kernel changes to make it work with it.
> >
> > How IA-32 centric would your prefered compiler choice be? In other
> > words, if a small and fast C compiler turns up, which lacks support
> > for some currently ported to architectures, are you likely to
> > encourage kernel changes which will make it difficult for the other
> > architectures that have to stay with GCC to keep up?
>
> I don't think being architecture-specific is necessarily a bad thing in
> compilers, although most compiler writers obviously try to avoid it.
>
> The kernel shouldn't really care: it does want to have a compiler with
> support for inline functions, but other than that it's fairly close to
> ANSI C.
>
> Yes, I know we use a _lot_ of gcc extensions (inline asms, variadic macros
> etc), but that's at least partly because there simply aren't any really
> viable alternatives to gcc, so we've had no incentives to abstract any of
> that out.
>
> So the gcc'isms aren't really fundamental per se. Although, quite frankly,
> even inline asms are pretty much a "standard" thing for any reasonable C
> compiler (since C is often used for things that really want it), and the
> main issue tends to be the exact syntax rather than anything else. So I
> don't think I'd like to use a compiler that is _so_ limited that it
> doesn't have some support for something like that. I certainly would
> refuse to use a C compiler that didn't support inline functions.
>
> Linus
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
--
Timothy D. Witham - Lab Director - [email protected]
Open Source Development Lab Inc - A non-profit corporation
15275 SW Koll Parkway - Suite H - Beaverton OR, 97006
(503)-626-2455 x11 (office) (503)-702-2871 (cell)
(503)-626-2436 (fax)

2003-02-04 23:32:35

by Balram Adlakha

[permalink] [raw]
Subject: Re: gcc 2.95 vs 3.21 performance

>> I'd love to see a small - and fast - C compiler, and I'd be willing to
>> make kernel changes to make it work with it.

tcc looks like a cool project to me...
Its small enough to be distributed through this mailing list!

and the "C scripts" looks like a cool feature...

2003-02-04 23:42:05

by Eli Carter

[permalink] [raw]
Subject: Re: gcc 2.95 vs 3.21 performance

Larry McVoy wrote:
>>I'd love to see a small - and fast - C compiler, and I'd be willing to
>>make kernel changes to make it work with it.
>
>
> I can't offer any immediate help with this but I want the same thing. At
> some point, we're planning on funding some extensions into GCC or whatever
> reasonable C compiler is around:
>
> - associative arrays as a builtin type
[snip]
> - regular expressions
[snip]
> - tk bindings built in
>
> and then we'll port BK to that compiler. It's likely to be GCC because we
> want to support all the different architectures but if a kernel sponsered
> cc shows up we'll happily throw money at that.

Ok, dumb, (and probably flamebait) question time: I read your list and
thought "In C? Why not Python?" I'm guessing speed issues?

Eli
--------------------. "If it ain't broke now,
Eli Carter \ it will be soon." -- crypto-gram
eli.carter(a)inet.com `-------------------------------------------------

2003-02-04 23:41:40

by Jakob Oestergaard

[permalink] [raw]
Subject: Re: gcc 2.95 vs 3.21 performance

On Tue, Feb 04, 2003 at 03:21:01PM -0800, Larry McVoy wrote:
> > I'd love to see a small - and fast - C compiler, and I'd be willing to
> > make kernel changes to make it work with it.
>
> I can't offer any immediate help with this but I want the same thing. At
> some point, we're planning on funding some extensions into GCC or whatever
> reasonable C compiler is around:

[snipping Linus from To:]

Cool.

>
> - associative arrays as a builtin type
>
> {
> assoc bar = {}; // anonymous, no file backing
>
> bar{"some key"} = "some value";
> if (defined(bar{"some other value"})) ...
> }

Allow me:

{
std::map<std::string,std::string> bar;

bar["some key"] = "some value";
if (bar.find("some other value") != bar.end()) ...
}

Works beautifully, all you need is to pick the existing language which
allows for the existing standard library which already provide that
functionality.

I doubt there's much need for a C+ or C 2+/3 langauage variant ;)

>
> - regular expressions
>
> {
> char *foo = "blech";
>
> if (foo =~ /regex are nice/) {
> printf("Well isn't that special?\n");
> }
> }

Ok, I can't help you with that.

You have probably seen a Perl program before... Now imagine a two
million line Perl program... That is why the above is not a good idea ;)

It's still your right to want it of course...

>
> - tk bindings built in

Built into the language (not a library)?

<sarcasm>
Then I'd want the compiler in a kernel module ;)
</>

> and then we'll port BK to that compiler. It's likely to be GCC because we
> want to support all the different architectures but if a kernel sponsered
> cc shows up we'll happily throw money at that.

If you look at http://www.codesourcery.com, you can see that there
really are some people who do GCC extentions or optimizations for money
- various institutions have funded additions to GCC this way.

It's a cool idea - I have a few things I'd like my company to fund as
well... Some time in the future... Unless someone beats us to it.

--
................................................................
: [email protected] : And I see the elder races, :
:.........................: putrid forms of man :
: Jakob ?stergaard : See him rise and claim the earth, :
: OZ9ABN : his downfall is at hand. :
:.........................:............{Konkhra}...............:

2003-02-05 00:11:01

by Andy Pfiffer

[permalink] [raw]
Subject: Re: gcc 2.95 vs 3.21 performance

On Tue, 2003-02-04 at 15:42, [email protected] wrote:
> >> I'd love to see a small - and fast - C compiler, and I'd be willing to
> >> make kernel changes to make it work with it.
>
> tcc looks like a cool project to me...
> Its small enough to be distributed through this mailing list!

Don't overlook lcc -- last I knew most users were using GNU's cpp, but
other than that, it is available for the curious:

http://www.cs.princeton.edu/software/lcc/




2003-02-05 00:18:17

by Larry McVoy

[permalink] [raw]
Subject: Re: gcc 2.95 vs 3.21 performance

> Ok, dumb, (and probably flamebait) question time: I read your list and
> thought "In C? Why not Python?" I'm guessing speed issues?

Scripting languages are unacceptable for products. Flat out unacceptable.
I spoke to Chip when he was running the perl effort, his answer was "if
you are worried about new releases of perl breaking your scripts, ship
your own version of perl". I spoke with Guido or some other Python
luminary and he said the same thing.

For something which a company has to support, it needs to be a compiled
language with fairly minimal dependencies. Otherwise the customer
upgrades and the tool breaks.

Don't get me wrong, I love perl (well, perl 4, perl 5 got a bit weird
for my tastes but some people seem to like it) and python looks cool as
well. They are great for prototyping but they are just useless as a
application platform. Our support costs would be through the roof.

Before the inevitable flameage, please consider that we have to support
people who insist on using all sorts of weird things. Richard Gooch
maintains his own a.out based linux distribution, for example. Do we
get to tell him to upgrade? Nope. And it just gets worse from there.
--
---
Larry McVoy lm at bitmover.com http://www.bitmover.com/lm

2003-02-05 00:54:15

by Hugo Mills

[permalink] [raw]
Subject: Re: gcc 2.95 vs 3.21 performance

On Wed, Feb 05, 2003 at 12:51:12AM +0100, Jakob Oestergaard wrote:
> On Tue, Feb 04, 2003 at 03:21:01PM -0800, Larry McVoy wrote:
> > I can't offer any immediate help with this but I want the same thing. At
> > some point, we're planning on funding some extensions into GCC or whatever
> > reasonable C compiler is around:
> >
> > - regular expressions
> >
> > {
> > char *foo = "blech";
> >
> > if (foo =~ /regex are nice/) {
> > printf("Well isn't that special?\n");
> > }
> > }
>
> Ok, I can't help you with that.

I wanted something like that a while ago, so I wrote a couple of
classes in C++ to handle regexps. Some of the test code looks like
this:

string str = "fum foo";
rejex exp("f(o*)");
// Search for a regex
if( s/exp )
cout << "Found it!" << endl;
// Count matches
cout << s/exp << " matches" << endl;

replace rep("g$0");

// Search & replace
str/exp/rep;
cout << s << endl;

// All in one
"foo bar"/rejex("ba")/replace();

It's not perfect by any stretch of the imagination, but it works.
I've not released it, because I haven't had a chance to get it into a
releasable form yet. Actually, looking at it, I should probably play a
couple of tricks with overloading operators to give you instead

str =~ search/replace;

or even

"str" =~ "search"/"replace";

> You have probably seen a Perl program before... Now imagine a two
> million line Perl program... That is why the above is not a good idea ;)
>
> It's still your right to want it of course...

That's a good point, but I've always felt that the main problem
with perl isn't the regexes, but the rest of the language(*).

Hugo.

(*) Some may feel that, coming from a C++ programmer, this is a case
of the pot calling the kettle black. :)

--
=== Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk ===
PGP key: 1C335860 from wwwkeys.eu.pgp.net or http://www.carfax.org.uk
--- Our so-called leaders speak/with words they try to jail ya/ ---
They subjugate the meek/but it's the rhetoric of failure.


Attachments:
(No filename) (2.12 kB)
(No filename) (189.00 B)
Download all attachments

2003-02-05 02:35:46

by Peter Chubb

[permalink] [raw]
Subject: Re: gcc 2.95 vs 3.21 performance

>>>>> "Bryan" == Bryan Andersen <[email protected]> writes:

Bryan> Personal opinion here but I know it is also held by many
Bryan> developers I know and work with. I'd rather have a compiler
Bryan> that produces correct and fast code but ran slow than one that
Bryan> produces slow or bad code and runs fast. Remember compilation
Bryan> is done far less often than run time execution. Yes I too
Bryan> noticed a difference when I switched over to 3.2 but I also
Bryan> noticed some of my code speed up.

A different personal opinion: I'd prefer a compiler than can be told
either to run fast and produce correct but suboptimal code or to
produce the fastest correct code it can.

While developing, the compile/test/think/edit cycle is dominated by compile
time for me. So fast compilation is important while developing
algorithms.

--
Dr Peter Chubb [email protected]
You are lost in a maze of BitKeeper repositories, all almost the same.

2003-02-05 02:54:34

by Tomas Szepe

[permalink] [raw]
Subject: Re: gcc 2.95 vs 3.21 performance

> [[email protected]]
>
> I can't offer any immediate help with this but I want the same thing. At
> some point, we're planning on funding some extensions into GCC or whatever
> reasonable C compiler is around:
>
> - associative arrays as a builtin type
> - regular expressions
> - tk bindings built in

Is it April 1st already?

I can't see why this should be a language extension other than you want
to make a real mess out of it.

> and then we'll port BK to that compiler. It's likely to be GCC because we
> want to support all the different architectures but if a kernel sponsered
> cc shows up we'll happily throw money at that.

Ever heard of glib?
#include <glib.h> and be done with it.

--
Tomas Szepe <[email protected]>

2003-02-05 05:45:59

by Mark Mielke

[permalink] [raw]
Subject: Re: gcc 2.95 vs 3.21 performance

On Tue, Feb 04, 2003 at 03:21:01PM -0800, Larry McVoy wrote:
> > I'd love to see a small - and fast - C compiler, and I'd be willing to
> > make kernel changes to make it work with it.
> I can't offer any immediate help with this but I want the same thing. At
> some point, we're planning on funding some extensions into GCC or whatever
> reasonable C compiler is around:
> - associative arrays as a builtin type
> - regular expressions
> - tk bindings built in

What is the problem with C++ or objective C?

I doubt that the GCC people would accept these sort of additions, even
if complete.

mark

--
[email protected]/[email protected]/[email protected] __________________________
. . _ ._ . . .__ . . ._. .__ . . . .__ | Neighbourhood Coder
|\/| |_| |_| |/ |_ |\/| | |_ | |/ |_ |
| | | | | \ | \ |__ . | | .|. |__ |__ | \ |__ | Ottawa, Ontario, Canada

One ring to rule them all, one ring to find them, one ring to bring them all
and in the darkness bind them...

http://mark.mielke.cc/

2003-02-05 07:19:42

by Denis Vlasenko

[permalink] [raw]
Subject: Re: gcc 2.95 vs 3.21 performance

On 4 February 2003 22:45, Herman Oosthuysen wrote:
> Hi there,
>
> From my experience, the speed issue is caused by misalligned memory
> accesses, causing inefficient SDRAM to Cache movement of data and
> instructions.
>
> I don't think that you necessarily need a modification to the
> compiler. What you can do is carefully place the ALLIGN switch in a
> few critical places in the kernel code, to ensure that the code and
> data will be properly alligned for whatever processor it is compiled
> for, be that a Pentium, an ARM, a MIPS or whatever.
>
> It would be nice if GCC can be suitably improved to do this correcly
> for all architectures, but a little bit of human help can do wonders,
> without having to fork the GCC project.

NO.

GCC already went this way, i.e. it aligns functions and loops by
ridiculous (IMHO) amounts like 16 bytes. That's 7,5 bytes per alignment
on average. Now count lk functions and loops and mourn for lost icache.
Or just disassemble any .o module and read the damn code.

This is the primary reason why people report larger kernels for GCC 3.x

I am damn sure that if you compile with less sadistic alignment
you will get smaller *and* faster kernel.
--
vda

2003-02-05 08:32:13

by Horst von Brand

[permalink] [raw]
Subject: Re: gcc 2.95 vs 3.21 performance

[Massive Cc: snippage]

Jeff Muizelaar <[email protected]> said:

[...]

> There is also tcc (http://fabrice.bellard.free.fr/tcc/)
> It claims to support gcc-like inline assembler, appears to be much
> smaller and faster than gcc. Plus it is GPL so the liscense isn't a
> problem either.
> Though, I am not really sure of the quality of code generated

Horrible.

> or of how
> mature it is.

Nice for one-file throwaway C proggies. But then again, Perl is so much
better at what you'd want to do most of the time...

Look, people, the gcc folks have recently redone the guts of the compiler
to make more advanced optimizations possible/easier (look at the news for
2000-2002 at <http://gcc.gnu.org>). It still needs a lot of porting over of
optimizations and developing new ones, plus tuning, AFAIU.

The other open(ish) C compilers I know about are mere toys.
--
Dr. Horst H. von Brand User #22616 counter.li.org
Departamento de Informatica Fono: +56 32 654431
Universidad Tecnica Federico Santa Maria +56 32 654239
Casilla 110-V, Valparaiso, Chile Fax: +56 32 797513

2003-02-05 10:27:02

by Andreas Schwab

[permalink] [raw]
Subject: Re: gcc 2.95 vs 3.21 performance

Denis Vlasenko <[email protected]> writes:

|> I am damn sure that if you compile with less sadistic alignment
|> you will get smaller *and* faster kernel.

So why don't you try it out? GCC offers everything you need for this
experiment.

Andreas.

--
Andreas Schwab, SuSE Labs, [email protected]
SuSE Linux AG, Deutschherrnstr. 15-19, D-90429 N?rnberg
Key fingerprint = 58CA 54C7 6D53 942B 1756 01D3 44D5 214B 8276 4ED5
"And now for something completely different."

2003-02-05 11:41:43

by Denis Vlasenko

[permalink] [raw]
Subject: Re: gcc 2.95 vs 3.21 performance

On 5 February 2003 12:36, Andreas Schwab wrote:
> Denis Vlasenko <[email protected]> writes:
> |> I am damn sure that if you compile with less sadistic alignment
> |> you will get smaller *and* faster kernel.
>
> So why don't you try it out? GCC offers everything you need for this
> experiment.

I did. Others did it too on occasion.

My argument was against overusing optimization techniques.
You cannot speed up kernel by aligning *everything* to 32 bytes,
or by unrolling all loops, or by aggressive inlining.
That's too easy to work. You get kernel which is bigger
*and* slower.
--
vda

2003-02-05 12:15:25

by Dave Jones

[permalink] [raw]
Subject: Re: gcc 2.95 vs 3.21 performance

On Wed, Feb 05, 2003 at 01:41:34PM +0200, Denis Vlasenko wrote:

> > So why don't you try it out? GCC offers everything you need for this
> > experiment.
>
> I did. Others did it too on occasion.

You seem to have forgotten to attach the numbers to your mail.

Dave

--
| Dave Jones. http://www.codemonkey.org.uk
| SuSE Labs

2003-02-05 12:58:01

by Dipankar Sarma

[permalink] [raw]
Subject: Re: [Lse-tech] Re: gcc 2.95 vs 3.21 performance

On Wed, Feb 05, 2003 at 01:41:34PM +0200, Denis Vlasenko wrote:
> My argument was against overusing optimization techniques.
> You cannot speed up kernel by aligning *everything* to 32 bytes,
> or by unrolling all loops, or by aggressive inlining.
> That's too easy to work. You get kernel which is bigger
> *and* slower.

I am not getting into this debate, just wanted to point out that
effect of compiler optimization on UNIX kernels have been studied
before. One paper I recall is -

http://www.usenix.org/publications/library/proceedings/sf94/full_papers/partridge.ps

They used prfile-guided optimization, so that is whole another angle altogether.

Thanks
Dipankar

2003-02-05 15:21:13

by Martin J. Bligh

[permalink] [raw]
Subject: Re: gcc 2.95 vs 3.21 performance

> GCC already went this way, i.e. it aligns functions and loops by
> ridiculous (IMHO) amounts like 16 bytes. That's 7,5 bytes per alignment
> on average. Now count lk functions and loops and mourn for lost icache.
> Or just disassemble any .o module and read the damn code.
>
> This is the primary reason why people report larger kernels for GCC 3.x
>
> I am damn sure that if you compile with less sadistic alignment
> you will get smaller *and* faster kernel.

There's only one real way to know that. Do it, test it.

M.

2003-02-05 19:02:54

by Linus Torvalds

[permalink] [raw]
Subject: Re: gcc 2.95 vs 3.21 performance

In article <[email protected]>,
Jeff Muizelaar <[email protected]> wrote:
>
>There is also tcc (http://fabrice.bellard.free.fr/tcc/)
>It claims to support gcc-like inline assembler, appears to be much
>smaller and faster than gcc. Plus it is GPL so the liscense isn't a
>problem either.
>Though, I am not really sure of the quality of code generated or of how
>mature it is.

tcc is interesting. The code generation is pretty simplistic (read:
trivially horrible for most things), but it sure is fast and small. And
judging by the changelog, Fabrice is trying to compile the kernel with
it.

For a lot of problems, small-and-fast is good. Hell, some of the things
I'd personally find interesting don't have any code generation part at
all (static analysis of annotated source-code - stanford checker on the
cheap). And development doesn't always need good code generation (right
now some people use "gcc -O0" for that, because anything else hurts too
much. Now, the code from tcc will probably look more like "-O-1", but
at least you can test out things _quickly_).

Linus

2003-02-05 19:14:32

by Randy.Dunlap

[permalink] [raw]
Subject: Re: gcc 2.95 vs 3.21 performance

On Wed, 5 Feb 2003, Linus Torvalds wrote:

| In article <[email protected]>,
| Jeff Muizelaar <[email protected]> wrote:
| >
| >There is also tcc (http://fabrice.bellard.free.fr/tcc/)
| >It claims to support gcc-like inline assembler, appears to be much
| >smaller and faster than gcc. Plus it is GPL so the liscense isn't a
| >problem either.
| >Though, I am not really sure of the quality of code generated or of how
| >mature it is.
|
| tcc is interesting. The code generation is pretty simplistic (read:
| trivially horrible for most things), but it sure is fast and small. And
| judging by the changelog, Fabrice is trying to compile the kernel with
| it.
|
| For a lot of problems, small-and-fast is good. Hell, some of the things
| I'd personally find interesting don't have any code generation part at
| all (static analysis of annotated source-code - stanford checker on the
| cheap).
Yep, that's exactly why I'm interested...

| And development doesn't always need good code generation (right
| now some people use "gcc -O0" for that, because anything else hurts too
| much. Now, the code from tcc will probably look more like "-O-1", but
| at least you can test out things _quickly_).

--
~Randy

2003-02-05 19:14:17

by John Bradford

[permalink] [raw]
Subject: Re: gcc 2.95 vs 3.21 performance

> >There is also tcc (http://fabrice.bellard.free.fr/tcc/)
> >It claims to support gcc-like inline assembler, appears to be much
> >smaller and faster than gcc. Plus it is GPL so the liscense isn't a
> >problem either.
> >Though, I am not really sure of the quality of code generated or of how
> >mature it is.
>
> tcc is interesting. The code generation is pretty simplistic (read:
> trivially horrible for most things), but it sure is fast and small. And
> judging by the changelog, Fabrice is trying to compile the kernel with
> it.
>
> For a lot of problems, small-and-fast is good.

Maybe otcc is a better choice, then?

http://fabrice.bellard.free.fr/otcc/

:-)

John.

2003-02-05 19:40:14

by Pavel

[permalink] [raw]
Subject: Re: gcc 2.95 vs 3.21 performance

From: Linus Torvalds <[email protected]>
Date: Tue, 4 Feb 2003 14:14:06 -0800 (PST)

Hi Linus,

> lcc isn't really something I want to use, since the license is so
> strange, and thus can't be improved upon if there are issues with it.

what is the difference between compiler and source management system
regarding licenses and improvements?
--
Pavel Jan?k

I think I started with hitting C-h a lot. Really a LOT.
-- Kai Grossjohann in gnu.emacs.help about Emacs knowledge

2003-02-05 20:03:48

by Linus Torvalds

[permalink] [raw]
Subject: Re: gcc 2.95 vs 3.21 performance


On Wed, 5 Feb 2003, Pavel [iso-8859-2] Jan?k wrote:
>
> Hi Linus,
>
> > lcc isn't really something I want to use, since the license is so
> > strange, and thus can't be improved upon if there are issues with it.
>
> what is the difference between compiler and source management system
> regarding licenses and improvements?

You snipped the part where I said that the intel compiler is likely to be
more interesting to a number of people, since it's at a higher level. So
no, I'm not religious about licenses.

But the real issue is "does it do what we want it to do?" and "do we have
a choice?". There are no open-source SCM's that work for me. But there
_is_ an open-source compiler that does work for me. At which point the
license matters - simply because there is choice in the matter.

Gcc mostly works. But it's slower then I'd like. And it prioritizes things
I don't care about. And competition is always good. So I would definitely
love to see some alternatives.

And if you have issues with BK, maybe you can try to encourage the SCM
people to see why I consider BK to not even have alternatives right now.

Linus

2003-02-05 20:19:11

by Balram Adlakha

[permalink] [raw]
Subject: Re: gcc 2.95 vs 3.21 performance

John Bradford writes:

>> No really, I downloaded tcc yesterday, compiled a few things with it and it
>> is REALLY fast...and as I wrote yesterday, its small enough so people might
>> say:
>>
>> A: "I can't compile linux, what is wrong?"
>> B: "Here, compile it with the compiler attached to this message"
>>
>> Sounds like fun doesn't it? I mean, tcc is a working C compiler (thats
>> supposed to be a great thing), and its only 170 kb gzipped tar!
>
> I haven't actually had chance to test tcc yet, but I'll try to
> tomorrow. How close is it to being able to compile the kernel?
>
> John.
Far away, it doesn't even compile the ncurses based menuconfig...I think we
need to hack (seriously) either tcc or linux... Since tcc is so small it
would be easier to make it run it (bit) more like gcc, than modifying the
whole kernel...

2003-02-06 06:53:25

by Neil Booth

[permalink] [raw]
Subject: Re: gcc 2.95 vs 3.21 performance

Jeff Muizelaar wrote:-

> There is also tcc (http://fabrice.bellard.free.fr/tcc/)
> It claims to support gcc-like inline assembler, appears to be much
> smaller and faster than gcc. Plus it is GPL so the liscense isn't a
> problem either.

It doesn't expand macros correctly, however, and accepts an enormous
range of invalid code without a single diagnostic. I'm pretty sure
it's arithmetic rules are incorrect, too. It's certainly nowhere
near C89 compliance.

Neil.

2003-02-06 14:52:25

by Horst von Brand

[permalink] [raw]
Subject: Re: gcc 2.95 vs 3.21 performance

[email protected] (Pavel =?iso-8859-2?q?Jan=EDk?=) said:
> Linus Torvalds <[email protected]> said:
> > lcc isn't really something I want to use, since the license is so
> > strange, and thus can't be improved upon if there are issues with it.

> what is the difference between compiler and source management system
> regarding licenses and improvements?

That bk was designed around Linus' and other head kernel hackers ideas of
how it should work, and they are still bending over backwards to keep this
biggest _*non*_customer of theirs happy.

OTOH, lcc as a project seems to be dead for all practical purposes (it
looks like 4.2 will be the endo of the line, no patches or updates have
shown up for quite some time). Its licence
<http://www.cs.princeton.edu/software/lcc/pkg/CPYRIGHT> is vaguely BSDish,
but with a "you can't make money off this or any modified versions/software
based on it" clause.

I've been inside lcc 4.1 (current version is 4.2, somewhat different, so
YMMV...) myself a bit, and while it is a marvelous showpiece for classroom
use, it is sorely lacking in what makes a _real_ C compiler (for kernel
use). For one, it only knows about i486-ish ia32 CPUs, to get others
supported in its current incarnation would be a massive excercise in
duplication or macro-massaging the backend source; other than the (very
good) optimal instruction selection there is very little optimization (what
there is is a bit of strength reduction), the organization of the compiler
makes adding aditional higher-level optimization almost impossible, a
separate SSA or such intermediate form would have to retrofitted; the
register selection is very simplistic and doesn't work correctly (some
experimental patches we had for generating PIC code on ia32 kept it
crashing by running out of registers the code for fixing this case up just
doesn't work). No hint at scheduling instructions or such.
--
Dr. Horst H. von Brand User #22616 counter.li.org
Departamento de Informatica Fono: +56 32 654431
Universidad Tecnica Federico Santa Maria +56 32 654239
Casilla 110-V, Valparaiso, Chile Fax: +56 32 797513

2003-02-06 15:33:06

by Martin J. Bligh

[permalink] [raw]
Subject: gcc -O2 vs gcc -Os performance

Compiled the kernel with gcc -O2 (default) vs -Os
(which people sometimes predict will be faster due to better
cache usage). Didn't bother to measure how much time the compile
itself took like that, but the resultant kernels were compared.
Summary ... -Os is a little slower (note system times on kernbench,
SDET and NUMAschedbench I consider within experimental error),
but not drastically. I wouldn't switch to it though ;-)

All done with gcc-2.95.4 (Debian Woody). These machines (16x NUMA-Q) have
700MHz P3 Xeons with 2Mb L2 cache ... -Os might fare better on celeron
with a puny cache if someone wants to try that out.

M.

sizes:

894822 Feb 5 23:50 /boot/vmlinuz-2.5.59-mjb3-Os
906203 Feb 5 22:46 /boot/vmlinuz-2.5.59-mjb3.old

Kernbench-2: (make -j N vmlinux, where N = 2 x num_cpus)
Elapsed User System CPU
2.5.59-mjb3 45.66 565.33 110.18 1479.00
2.5.59-mjb3-Os 45.58 565.38 111.42 1484.33

Kernbench-16: (make -j N vmlinux, where N = 16 x num_cpus)
Elapsed User System CPU
2.5.59-mjb3 46.87 569.77 133.32 1499.67
2.5.59-mjb3-Os 46.86 569.30 134.63 1501.50

DISCLAIMER: SPEC(tm) and the benchmark name SDET(tm) are registered
trademarks of the Standard Performance Evaluation Corporation. This
benchmarking was performed for research purposes only, and the run results
are non-compliant and not-comparable with any published results.

Results are shown as percentages of the first set displayed

SDET 1 (see disclaimer)
Throughput Std. Dev
2.5.59-mjb3 100.0% 4.1%
2.5.59-mjb3-Os 95.1% 6.7%

SDET 2 (see disclaimer)
Throughput Std. Dev
2.5.59-mjb3 100.0% 8.0%
2.5.59-mjb3-Os 101.2% 5.8%

SDET 4 (see disclaimer)
Throughput Std. Dev
2.5.59-mjb3 100.0% 6.2%
2.5.59-mjb3-Os 99.4% 14.1%

SDET 8 (see disclaimer)
Throughput Std. Dev
2.5.59-mjb3 100.0% 3.3%
2.5.59-mjb3-Os 100.5% 2.2%

SDET 16 (see disclaimer)
Throughput Std. Dev
2.5.59-mjb3 100.0% 3.2%
2.5.59-mjb3-Os 98.9% 2.4%

SDET 32 (see disclaimer)
Throughput Std. Dev
2.5.59-mjb3 100.0% 2.2%
2.5.59-mjb3-Os 97.2% 1.6%

SDET 64 (see disclaimer)
Throughput Std. Dev
2.5.59-mjb3 100.0% 0.4%
2.5.59-mjb3-Os 99.9% 0.3%

SDET 128 (see disclaimer)
Throughput Std. Dev

NUMA schedbench 4:
AvgUser Elapsed TotalUser TotalSys
2.5.59-mjb3 0.00 34.62 90.63 0.91
2.5.59-mjb3-Os 0.00 40.35 81.94 0.69

NUMA schedbench 8:
AvgUser Elapsed TotalUser TotalSys
2.5.59-mjb3 0.00 52.16 266.45 1.51
2.5.59-mjb3-Os 0.00 46.61 248.47 1.49

NUMA schedbench 16:
AvgUser Elapsed TotalUser TotalSys
2.5.59-mjb3 0.00 57.38 845.30 3.58
2.5.59-mjb3-Os 0.00 58.34 851.12 2.94

NUMA schedbench 32:
AvgUser Elapsed TotalUser TotalSys
2.5.59-mjb3 0.00 118.05 1806.79 6.24
2.5.59-mjb3-Os 0.00 115.85 1803.72 6.29

NUMA schedbench 64:
AvgUser Elapsed TotalUser TotalSys
2.5.59-mjb3 0.00 236.59 3627.47 15.24
2.5.59-mjb3-Os 0.00 236.90 3631.11 15.35

2003-02-06 15:41:55

by Andi Kleen

[permalink] [raw]
Subject: Re: [Lse-tech] gcc -O2 vs gcc -Os performance

> All done with gcc-2.95.4 (Debian Woody). These machines (16x NUMA-Q) have
> 700MHz P3 Xeons with 2Mb L2 cache ... -Os might fare better on celeron
> with a puny cache if someone wants to try that out.

-Os on 2.95 is not too useful. It only started becomming useful on 3.1+,
even more so on the upcomming 3.3.

e.g. there was one report of ACPI shrinking by >60k by recompiling it
with -Os on 3.1. ACPI is only slow path code so that is completely reasonable.

Best would be of course to use profile feedback to let the compiler
decide where to generate small and where to generate fast&big code.
But that has problems with the maintainability (it will be hard to generate
the same vmlinux as users for debugging/ksymoops reading purposes)

-Andi

2003-02-06 16:42:06

by Alan

[permalink] [raw]
Subject: Re: gcc -O2 vs gcc -Os performance

On Thu, 2003-02-06 at 15:42, Martin J. Bligh wrote:
> All done with gcc-2.95.4 (Debian Woody). These machines (16x NUMA-Q) have
> 700MHz P3 Xeons with 2Mb L2 cache ... -Os might fare better on celeron
> with a puny cache if someone wants to try that out

gcc 3.2 is a lot smarter about -Os and it makes a very big size
difference according to the numbers the from the ACPI guys.

Im not sure testing with a gcc from the last millenium is useful 8)

2003-02-06 16:57:11

by Martin J. Bligh

[permalink] [raw]
Subject: Re: gcc -O2 vs gcc -Os performance

>> All done with gcc-2.95.4 (Debian Woody). These machines (16x NUMA-Q) have
>> 700MHz P3 Xeons with 2Mb L2 cache ... -Os might fare better on celeron
>> with a puny cache if someone wants to try that out
>
> gcc 3.2 is a lot smarter about -Os and it makes a very big size
> difference according to the numbers the from the ACPI guys.
>
> Im not sure testing with a gcc from the last millenium is useful 8)

I'll retest with gcc-3.2 ... maybe it'll finally show a case where it's
better than 2.95 this way?

<ducks> <runs>

M.

2003-02-06 20:28:53

by Martin J. Bligh

[permalink] [raw]
Subject: Re: gcc -O2 vs gcc -Os performance

>> All done with gcc-2.95.4 (Debian Woody). These machines (16x NUMA-Q) have
>> 700MHz P3 Xeons with 2Mb L2 cache ... -Os might fare better on celeron
>> with a puny cache if someone wants to try that out
>
> gcc 3.2 is a lot smarter about -Os and it makes a very big size
> difference according to the numbers the from the ACPI guys.
>
> Im not sure testing with a gcc from the last millenium is useful 8)

Still no use.
/me throws gcc-3.2 in the trash can.

2901299 vmlinux.O2
2667827 vmlinux.Os


Kernbench-2: (make -j N vmlinux, where N = 2 x num_cpus)
Elapsed User System CPU
2.5.59-mjb3-gcc32-O2 45.86 564.75 110.91 1472.67
2.5.59-mjb3-gcc32-Os 45.74 563.96 111.06 1475.17

Kernbench-16: (make -j N vmlinux, where N = 16 x num_cpus)
Elapsed User System CPU
2.5.59-mjb3-gcc32-O2 46.83 569.15 133.88 1500.50
2.5.59-mjb3-gcc32-Os 46.90 568.17 134.58 1497.83

DISCLAIMER: SPEC(tm) and the benchmark name SDET(tm) are registered
trademarks of the Standard Performance Evaluation Corporation. This
benchmarking was performed for research purposes only, and the run results
are non-compliant and not-comparable with any published results.

Results are shown as percentages of the first set displayed

SDET 1 (see disclaimer)
Throughput Std. Dev
2.5.59-mjb3-gcc32-O2 100.0% 3.4%
2.5.59-mjb3-gcc32-Os 99.8% 2.8%

SDET 2 (see disclaimer)
Throughput Std. Dev
2.5.59-mjb3-gcc32-O2 100.0% 6.7%
2.5.59-mjb3-gcc32-Os 101.2% 4.9%

SDET 4 (see disclaimer)
Throughput Std. Dev
2.5.59-mjb3-gcc32-O2 100.0% 3.8%
2.5.59-mjb3-gcc32-Os 95.1% 3.0%

SDET 8 (see disclaimer)
Throughput Std. Dev
2.5.59-mjb3-gcc32-O2 100.0% 1.1%
2.5.59-mjb3-gcc32-Os 98.1% 1.4%

SDET 16 (see disclaimer)
Throughput Std. Dev
2.5.59-mjb3-gcc32-O2 100.0% 1.6%
2.5.59-mjb3-gcc32-Os 97.7% 1.7%

SDET 32 (see disclaimer)
Throughput Std. Dev
2.5.59-mjb3-gcc32-O2 100.0% 1.1%
2.5.59-mjb3-gcc32-Os 103.7% 1.9%

SDET 64 (see disclaimer)
Throughput Std. Dev
2.5.59-mjb3-gcc32-O2 100.0% 1.4%
2.5.59-mjb3-gcc32-Os 96.6% 9.7%

NUMA schedbench 4:
AvgUser Elapsed TotalUser TotalSys
2.5.59-mjb3-gcc32-O2 0.00 36.93 88.84 0.62
2.5.59-mjb3-gcc32-Os 0.00 44.28 96.95 0.67

NUMA schedbench 8:
AvgUser Elapsed TotalUser TotalSys
2.5.59-mjb3-gcc32-O2 0.00 54.16 327.57 1.58
2.5.59-mjb3-gcc32-Os 0.00 50.66 248.42 1.89

NUMA schedbench 16:
AvgUser Elapsed TotalUser TotalSys
2.5.59-mjb3-gcc32-O2 0.00 57.17 851.44 3.09
2.5.59-mjb3-gcc32-Os 0.00 57.25 849.20 3.14

NUMA schedbench 32:
AvgUser Elapsed TotalUser TotalSys
2.5.59-mjb3-gcc32-O2 0.00 117.82 1808.42 6.34
2.5.59-mjb3-gcc32-Os 0.00 130.02 1814.74 6.52

NUMA schedbench 64:
AvgUser Elapsed TotalUser TotalSys
2.5.59-mjb3-gcc32-O2 0.00 236.82 3616.04 15.17
2.5.59-mjb3-gcc32-Os 0.00 241.34 3624.50 16.39

2003-02-06 20:33:32

by Paul Jakma

[permalink] [raw]
Subject: Re: gcc 2.95 vs 3.21 performance

On Tue, 4 Feb 2003, Larry McVoy wrote:

> Scripting languages are unacceptable for products. Flat out unacceptable.
> I spoke to Chip when he was running the perl effort, his answer was "if
> you are worried about new releases of perl breaking your scripts, ship
> your own version of perl".

There is a perl compiler, perlcc, but its not perfect. why not fund it
to have it made perfect. then you get best of all worlds - perl and
interpretation at run time for developers and ability to ship binary
files to customers.

regards,
--
Paul Jakma Sys Admin Alphyra
[email protected]
Warning: /never/ send email to [email protected] or [email protected]

2003-02-06 21:24:42

by John Bradford

[permalink] [raw]
Subject: Re: gcc -O2 vs gcc -Os performance

> >> All done with gcc-2.95.4 (Debian Woody). These machines (16x NUMA-Q) have
> >> 700MHz P3 Xeons with 2Mb L2 cache ... -Os might fare better on celeron
> >> with a puny cache if someone wants to try that out
> >
> > gcc 3.2 is a lot smarter about -Os and it makes a very big size
> > difference according to the numbers the from the ACPI guys.
> >
> > Im not sure testing with a gcc from the last millenium is useful 8)
>
> Still no use.
> /me throws gcc-3.2 in the trash can.

What submodel options are you using? If you're compiling with
-march=i386, I wouldn't expect -Os to have much effect.

Note that, of all architectures, GCC is almost certainly most
efficient on IA-32. Although I haven't done any benchmarks against
other compilers on $arch!=IA32, the ones I've seen claim that the
native compiler generates much better code.

John.

2003-02-06 22:06:02

by Linus Torvalds

[permalink] [raw]
Subject: Re: gcc -O2 vs gcc -Os performance

In article <263740000.1044563891@[10.10.2.4]>,
Martin J. Bligh <[email protected]> wrote:
>>> All done with gcc-2.95.4 (Debian Woody). These machines (16x NUMA-Q) have
>>> 700MHz P3 Xeons with 2Mb L2 cache ... -Os might fare better on celeron
>>> with a puny cache if someone wants to try that out
>>
>> gcc 3.2 is a lot smarter about -Os and it makes a very big size
>> difference according to the numbers the from the ACPI guys.
>>
>> Im not sure testing with a gcc from the last millenium is useful 8)
>
>Still no use.
>/me throws gcc-3.2 in the trash can.
>
>2901299 vmlinux.O2
>2667827 vmlinux.Os

Well, Os is certainly smaller. One thing to look out for is that
microbenchmarks for kernels are usually the _worst_ things to test with
Os.

That's since a large part of the premise of the -Os speed advantage is
that it is better for icache (usually not an issue for microbenchmarks)
and that it is better for load/startup times (generally not a huge issue
for kernels, since the real startup costs of kernels tend to be entirely
elsewhere).

So I suspect -Os tends to be more appropriate for user-mode code, and
especially code with low repeat rates. Possibly the "low repeat rate"
thing ends up being true of certain kernel subsystems too.

Think of it this way: if you win 10% in size, you're likely to map and
load 10% less code pages at run-time. Which is not a big issue for
traditional data-centric loads, but can be a _huge_ deal for things like
GUI programs etc where there is often more code than data.

Linus

2003-02-06 22:48:49

by Martin J. Bligh

[permalink] [raw]
Subject: Re: gcc -O2 vs gcc -Os performance

>> 2901299 vmlinux.O2
>> 2667827 vmlinux.Os
>
> Well, Os is certainly smaller.

Yup. I have lots of RAM though, so unless I can see the perf increase
from cache effects, it's not desperately interesting to me personally.
If someone could do similar measurements with a puny-cache celeron chip,
it would be interesting ...

> So I suspect -Os tends to be more appropriate for user-mode code, and
> especially code with low repeat rates. Possibly the "low repeat rate"
> thing ends up being true of certain kernel subsystems too.

Fair enough. I'm not desperately interested in user-land code at the
moment, personally, but gcc is admittedly more general. Maybe we should
compile gcc itself with -Os ;-) Andi (I think) also made the observation
that the garbage-collect size for gcc3.2 may be rather small.

The observation re low repeat rate is interesting ... might be amusing
to do some really basic profile-guided optimisation on this grounds,
take readprofile / oprofile output, and compile the files that don't
get hammered at all with -Os rather than -O2. Given their low frequency
(by definition), I'm not sure that improving their icache footprint will
have a measureable effect though.

M.

2003-02-06 23:12:15

by Roger Larsson

[permalink] [raw]
Subject: Re: gcc -O2 vs gcc -Os performance

On Thursday 06 February 2003 21:38, Martin J. Bligh wrote:
> gcc-3.2
>
> 2901299 vmlinux.O2
> 2667827 vmlinux.Os
>

In an earlier message, Martin J. Bligh wrote:
>
> 894822 Feb 5 23:50 /boot/vmlinuz-2.5.59-mjb3-Os
> 906203 Feb 5 22:46 /boot/vmlinuz-2.5.59-mjb3.old

And if you compare both with same/no compression?

/RogerL

--
Roger Larsson
Skellefte?
Sweden

2003-02-06 23:10:45

by Linus Torvalds

[permalink] [raw]
Subject: Re: gcc -O2 vs gcc -Os performance


On Thu, 6 Feb 2003, Martin J. Bligh wrote:
>
> The observation re low repeat rate is interesting ... might be amusing
> to do some really basic profile-guided optimisation on this grounds,
> take readprofile / oprofile output, and compile the files that don't
> get hammered at all with -Os rather than -O2. Given their low frequency
> (by definition), I'm not sure that improving their icache footprint will
> have a measureable effect though.

Icache footprint has nothing to do with repeat rates, which is exactly why
repeat rates are interesting for -Os.

Icache footprint is directly proportional to the _static_ size of the code
(ie exactly the thing that -Os is supposed to optimize for), while
instruction-level performance measurement is only valid on the _dynamic_
code.

And with modern CPU's with big caches, a _lot_ of cache misses are the
forced kind - the startup costs, not the actual runtime cost. That's not
always true (if you touch big data sets, you'll have replacement misses
too, of course), but it's not really false either.

So think of the I$ (and TLB, and page load/map - all the same) cost as a
fixed cost that will always be there, but that -Os tries to minimize.
That's _one_ dimension in the total cost.

The "traditional" -O2 kind of "try to make the code run fast"
optimizations tend to try to minimize a totally different dimension,
namely the dynamic code speed.

And the time required for running the program is the sum of the static and
dynamic factors. In other words, a _good_ optimization should try to
minimize not one or the other, but the sum.

And low repeat rates means that the dynamic component is smaller, which
clearly makes the static component more important.

For example, if you are doing mp3 encoding, the repeat rates for the core
loop are huge, and the code is small, so clearly the static component is
largely insignificant. Use -O2.

But if you're running a GUI program then just the loading time is often
quite noticeable, and if you can improve that by, say, 10%, then that can
_more_ than make up for almost any amount of stupidity in your code.
Especially since a lot of the code isn't even all that loopy and tends to
have low repeat rates. You're almost guaranteed to be better off using -Os
than -O2.

If you've got performance counter data, check the I$ and ITLB miss ratios,
and if they are at all noticeable, think about the fact that a I$ miss
tends to cost a lot more than a few more dynamic instructions.

I suspect the kernel I$ behaviour is generally pretty good, and the ITLB
behaviour is improved even further thanks to large pages etc. That said, a
user app that blows the I$ will blow the kernel out of the I$ too, so
small is always beautiful, even in the kernel.

Linus

2003-02-06 23:24:22

by Martin J. Bligh

[permalink] [raw]
Subject: Re: gcc -O2 vs gcc -Os performance

>> gcc-3.2
>>
>> 2901299 vmlinux.O2
>> 2667827 vmlinux.Os
>>
>
> In an earlier message, Martin J. Bligh wrote:
>>
>> 894822 Feb 5 23:50 /boot/vmlinuz-2.5.59-mjb3-Os
>> 906203 Feb 5 22:46 /boot/vmlinuz-2.5.59-mjb3.old
>
> And if you compare both with same/no compression?

980233 Feb 6 11:15 /boot/vmlinuz-2.5.59-mjb3
914965 Feb 6 09:34 /boot/vmlinuz-2.5.59-mjb3.old

Those were probably the right files. (O2 and Os respectively)
I didn't look too closely at the time. Looks like 2.95 produces
smaller files with O2 than 3.2 does with -Os. Bah.

/me cheers for gcc 2.95.4

M.

2003-02-06 23:50:26

by Martin J. Bligh

[permalink] [raw]
Subject: Re: gcc -O2 vs gcc -Os performance

>> The observation re low repeat rate is interesting ... might be amusing
>> to do some really basic profile-guided optimisation on this grounds,
>> take readprofile / oprofile output, and compile the files that don't
>> get hammered at all with -Os rather than -O2. Given their low frequency
>> (by definition), I'm not sure that improving their icache footprint will
>> have a measureable effect though.
>
> Icache footprint has nothing to do with repeat rates, which is exactly why
> repeat rates are interesting for -Os.

Reading the below, I think I just misinterpreted what you meant by
"repeate rate". My point was that if you hardly ever run that section
of code, -Os might be better. If we call how often you call that code
section it's "frequency" (nothing to do with how tightly it loops inside
it), then if the frequency of the code is low, the icache footprint
might be better off smaller, as it'll just blow the icache when we do
run it and those cachelines are fetched. On the other hand, that won't
happen often, so it may well be unobservable for real loads.

M.



2003-02-07 10:22:14

by Balram Adlakha

[permalink] [raw]
Subject: Re: gcc 2.95 vs 3.21 performance

Neil Booth writes:

> [email protected] wrote:-
>
>> Maybe thats why its a 0.9* version, and the auther has stated on his site
>> that not all C98 features are implimented...but then even GCC doesn't
>> impliment them...
>
> No, I said C89. He's got a *long* way to go for that. Forget C99.
>
> However, he does claim C89 compliance, which is quite disingenuous.
>
>> I checked tcc out, and its damn fast, much much much much faster than gcc.
>> gcc is bloated and its slow even on my pentium 4 machine, let alone my 1.2
>> celeron. It takes 20 minutes to compile a new kernel on that, now if you're
>> gonna test kernels/patches, you can wait 20 minutes for every compile!
>
> I agree. I'm trying to fix it.
>
> GCC is larger for a reason: it does things properly. It's easy to be
> fast if you're willing to be wrong, and not emit warnings or errors, and
> not implement half the standard. And not optimize.
>
>> Even icc is much better than gcc, but its very perticular about code (and
>> its not gcc compatible as the intel site says)
>> And its non-free also...
>
> Only better in terms of compile speed.

Cool (you're trying to fix it), maybe you can modify tcc so it is optimized
for compiling linux (optimized for compiling speed and runtime speed for
linux). I think it'll be easier and quicker to just make it compile linux
properly first, then do the testing/fixing for other things, as they are so
many compilers for other things anyway...And maybe it can be called "Linux C
Compiler"? lol.

2003-02-07 18:37:15

by Horst von Brand

[permalink] [raw]
Subject: Re: gcc 2.95 vs 3.21 performance

[email protected] said:
> Neil Booth writes:
> > [email protected] wrote:-
> >> Maybe thats why its a 0.9* version, and the auther has stated on his site
> >> that not all C98 features are implimented...but then even GCC doesn't
> >> impliment them...

> > No, I said C89. He's got a *long* way to go for that. Forget C99.

> > However, he does claim C89 compliance, which is quite disingenuous.

> >> I checked tcc out, and its damn fast, much much much much faster than
> >> gcc. gcc is bloated and its slow even on my pentium 4 machine, let
> >> alone my 1.2 celeron. It takes 20 minutes to compile a new kernel on
> >> that, now if you're gonna test kernels/patches, you can wait 20
> >> minutes for every compile!

Come on, quit whining already. When I started out fooling around with egcs
and the kernel, it took 45 to 60 minutes to build a kernel for me. And the
kernel was a lot smaller, and the compiler much faster.

> > I agree. I'm trying to fix it.
> >
> > GCC is larger for a reason: it does things properly. It's easy to be
> > fast if you're willing to be wrong, and not emit warnings or errors, and
> > not implement half the standard. And not optimize.

> >> Even icc is much better than gcc, but its very perticular about code (and
> >> its not gcc compatible as the intel site says)
> >> And its non-free also...

Pour manpower and people who _know_ that _one_ CPU you are targeting in and
out into the project, it sure will get further along...

> > Only better in terms of compile speed.
>
> Cool (you're trying to fix it), maybe you can modify tcc so it is optimized
> for compiling linux (optimized for compiling speed and runtime speed for
> linux).

Sorry, can pick just one. Either you compile very fast (because you don't
analyze the code you are compiling very much, i.e., generate lousy code) or
generate excelent code (that requires complex analysis, large data
structures to build and use, and takes time).

> I think it'll be easier and quicker to just make it compile linux
> properly first, then do the testing/fixing for other things, as they are so
> many compilers for other things anyway...And maybe it can be called "Linux C
> Compiler"? lol.

"Easier and quicker" as in 5 or 6 years of hard work. Sure enough, come
back when you're done.
--
Dr. Horst H. von Brand User #22616 counter.li.org
Departamento de Informatica Fono: +56 32 654431
Universidad Tecnica Federico Santa Maria +56 32 654239
Casilla 110-V, Valparaiso, Chile Fax: +56 32 797513

2003-02-07 21:39:44

by Neil Booth

[permalink] [raw]
Subject: Re: gcc 2.95 vs 3.21 performance

[email protected] wrote:-

> Cool (you're trying to fix it), maybe you can modify tcc so it is optimized
> for compiling linux (optimized for compiling speed and runtime speed for
> linux). I think it'll be easier and quicker to just make it compile linux
> properly first, then do the testing/fixing for other things, as they are so
> many compilers for other things anyway...And maybe it can be called "Linux
> C Compiler"? lol.

Sorry, I only care about GCC.

Neil.

2003-02-08 16:39:30

by Pavel Machek

[permalink] [raw]
Subject: Re: gcc 2.95 vs 3.21 performance

Hi!

> >> I'm hesitant to enter into this. But from my own experience
> >> the issue with big companies supporting these sort of changes
> >> in gcc have more to do with the acceptance process of changes
> >> into gcc than a lack of desire on the large companies part.
> >
> >Maybe we should create a KGCC fork, optimise it for kernel
> >complilations, then try to get our changes merged back in to GCC
> >mainline at a later date.
>
> That's not really the problem.
>
> I think the problem with gcc is that many of the developers are actually
> much more interested in Ada or C++ (or even Fortran!), than in plain
> old-fashioned C. So it's not a kernel issue per se, gcc is slow to
> compile _any_ C project.
>
> And a lot of the optimizations gcc does aren't even interesting to most
> C projects. Most "old-fashioned" C projects tend to be written in ways
> that mean that the most important optimizations are the truly trivial
> ones, and then doing good register allocation.
>
> I'd love to see a small - and fast - C compiler, and I'd be willing to
> make kernel changes to make it work with it.

What about gcc-1.4 or something like that? If you go back in time,
you'll find gcc is getting smaller and faster ;-). Actually making
kernel compile with gcc-2.7.2 should make it few times faster than
gcc-3.2...
Pavel
--
Worst form of spam? Adding advertisment signatures ala sourceforge.net.
What goes next? Inserting advertisment *into* email?

2003-02-10 02:05:09

by Jeff Garzik

[permalink] [raw]
Subject: Re: gcc 2.95 vs 3.21 performance

Neil Booth wrote:
> Jeff Muizelaar wrote:-
>
>
>>There is also tcc (http://fabrice.bellard.free.fr/tcc/)
>>It claims to support gcc-like inline assembler, appears to be much
>>smaller and faster than gcc. Plus it is GPL so the liscense isn't a
>>problem either.
>
>
> It doesn't expand macros correctly, however, and accepts an enormous
> range of invalid code without a single diagnostic. I'm pretty sure
> it's arithmetic rules are incorrect, too. It's certainly nowhere
> near C89 compliance.


100% agreed.

However, for our purposes, TinyCC is only missing two pieces needed for
successfully building a bootable kernel:

* __builtin_constant_p
* function inlining

Given the existing TinyCC source base, function inlining is a big step
(since tcc doesn't do AST-like things currently), so don't expect that
very soon. TinyCC is a fun little project to watch and play around
with, though, and can compile most major open source projects, as well
as itself.

Jeff



2003-02-10 09:10:05

by Tomas Szepe

[permalink] [raw]
Subject: Re: gcc 2.95 vs 3.21 performance

> [[email protected]]
>
> Given the existing TinyCC source base, function inlining is a big step
> (since tcc doesn't do AST-like things currently), so don't expect that
> very soon. TinyCC is a fun little project to watch and play around
> with, though, and can compile most major open source projects, as well
> as itself.

I wonder how that can be, though, because I've failed getting it to
compile code as trivial as

walk_de = (dirent_t *) debug_malloc(sizeof(dirent_t));

where dirent_t is a simple structure and debug_malloc is prototyped
to void *debug_malloc(size_t size);

--
Tomas Szepe <[email protected]>

2003-02-10 12:37:07

by Momchil Velikov

[permalink] [raw]
Subject: Re: [Lse-tech] gcc 2.95 vs 3.21 performance

>>>>> "Martin" == Martin J Bligh <[email protected]> writes:

Martin> But the point is still the same ... even if it is doing
Martin> more agressive optimisation, it's not actually buying us
Martin> anything (at least for the kernel)

which might be due in part to ``-fno-strict-aliasing'' used to compile
the Linux kernel.

~velco

2003-02-10 22:16:49

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: gcc 2.95 vs 3.21 performance

On Wed, Feb 05, 2003 at 12:51:12AM +0100, Jakob Oestergaard wrote:
> On Tue, Feb 04, 2003 at 03:21:01PM -0800, Larry McVoy wrote:
> > > I'd love to see a small - and fast - C compiler, and I'd be willing to
> > > make kernel changes to make it work with it.
> >
> > I can't offer any immediate help with this but I want the same thing. At
> > some point, we're planning on funding some extensions into GCC or whatever
> > reasonable C compiler is around:
>
> [snipping Linus from To:]
>
> Cool.
>
> >
> > - associative arrays as a builtin type
> >
> > {
> > assoc bar = {}; // anonymous, no file backing
> >
> > bar{"some key"} = "some value";
> > if (defined(bar{"some other value"})) ...
> > }
>
> Allow me:
>
> {
> std::map<std::string,std::string> bar;
>
> bar["some key"] = "some value";
> if (bar.find("some other value") != bar.end()) ...
> }

Indeed. Hardcoding map and multimap templates with string,string
parameter in the language sounds like a very worthless effort. If he
wants an high level syntax on top of the abstractions he should use a
more high level language. C can do everything but it's going to be a
sintax like what we do in the kernel, with lists, rbtrees, structures of
pointer to functions etc..

> Works beautifully, all you need is to pick the existing language which
> allows for the existing standard library which already provide that
> functionality.
>
> I doubt there's much need for a C+ or C 2+/3 langauage variant ;)
>
> >
> > - regular expressions
> >
> > {
> > char *foo = "blech";
> >
> > if (foo =~ /regex are nice/) {
> > printf("Well isn't that special?\n");
> > }
> > }
>
> Ok, I can't help you with that.
>
> You have probably seen a Perl program before... Now imagine a two
> million line Perl program... That is why the above is not a good idea ;)

actually the python syntax for re is quite nice, and would be pretty
compatible with C, no magic perl =~ operator etc.. again a library like
STL in an highlevel language would do the trick just fine.

>
> It's still your right to want it of course...
>
> >
> > - tk bindings built in
>
> Built into the language (not a library)?

Oh my.

>
> <sarcasm>
> Then I'd want the compiler in a kernel module ;)
> </>

then I want insmod kde.o too ;)

Andrea

2003-02-10 23:19:02

by J.A. Magallon

[permalink] [raw]
Subject: Re: gcc 2.95 vs 3.21 performance


On 2003.02.10 Andrea Arcangeli wrote:
> On Wed, Feb 05, 2003 at 12:51:12AM +0100, Jakob Oestergaard wrote:
> > On Tue, Feb 04, 2003 at 03:21:01PM -0800, Larry McVoy wrote:
> > > > I'd love to see a small - and fast - C compiler, and I'd be willing to
> > > > make kernel changes to make it work with it.
> > >
> > > I can't offer any immediate help with this but I want the same thing. At
> > > some point, we're planning on funding some extensions into GCC or whatever
> > > reasonable C compiler is around:
> >
> > [snipping Linus from To:]
> >
> > Cool.
> >
> > >
> > > - associative arrays as a builtin type
> > >
> > > {
> > > assoc bar = {}; // anonymous, no file backing
> > >
> > > bar{"some key"} = "some value";
> > > if (defined(bar{"some other value"})) ...
> > > }
> >
> > Allow me:
> >
> > {
> > std::map<std::string,std::string> bar;
> >
> > bar["some key"] = "some value";
> > if (bar.find("some other value") != bar.end()) ...
> > }
>

And don't forget smart pointers with reference counting so you can get rid of
all those stupind kfree's... ;)

--
J.A. Magallon <[email protected]> \ Software is like sex:
werewolf.able.es \ It's better when it's free
Mandrake Linux release 9.1 (Cooker) for i586
Linux 2.4.21-pre4-jam1 (gcc 3.2.1 (Mandrake Linux 9.1 3.2.1-5mdk))