The conventional wisdom is that compiling x86 without frame pointer
results in smaller code. It turns out to be the opposite, compiling
with frame pointers results in a smaller kernel. gcc version 3.2
20020822 (Red Hat Linux Rawhide 3.2-4).
# size 2.4.20-rc2-*/vmlinux
text data bss dec hex filename
2669584 337972 402697 3410253 34094d 2.4.20-rc2-fp/vmlinux
2676919 337972 402697 3417588 3425f4 2.4.20-rc2-nofp/vmlinux
Without frame pointers, vmlinux is 7K bigger. The difference is that
code with frame pointers can use ebp to directly access the stack,
without frame pointers it has to use esp with an index.
With frame pointers:
00000c10 <inet_dgram_connect>:
c10: 55 push %ebp
c11: 89 e5 mov %esp,%ebp
c13: 83 ec 14 sub $0x14,%esp
c16: 89 75 fc mov %esi,0xfffffffc(%ebp)
c19: 8b 45 08 mov 0x8(%ebp),%eax
c1c: 8b 75 0c mov 0xc(%ebp),%esi
c1f: 89 5d f8 mov %ebx,0xfffffff8(%ebp)
c22: 8b 58 18 mov 0x18(%eax),%ebx
c25: 66 83 3e 00 cmpw $0x0,(%esi)
c29: 74 3d je c68 <inet_dgram_connect+0x58>
Without frame pointers:
00000c10 <inet_dgram_connect>:
c10: 83 ec 14 sub $0x14,%esp
c13: 8b 44 24 18 mov 0x18(%esp,1),%eax
c17: 89 74 24 10 mov %esi,0x10(%esp,1)
c1b: 8b 74 24 1c mov 0x1c(%esp,1),%esi
c1f: 89 5c 24 0c mov %ebx,0xc(%esp,1)
c23: 8b 58 18 mov 0x18(%eax),%ebx
c26: 66 83 3e 00 cmpw $0x0,(%esi)
c2a: 74 44 je c70 <inet_dgram_connect+0x60>
The difference is that stack accesses via ebp are 3 bytes, stack
accesses via esp+index are 4 bytes. On any function with a large
number of stack accesses, this quickly outweighs the extra prologue
code for frame pointers.
The smaller instruction set will improve icache usage. Whether this is
offset by the increased register pressure is something for
benchmarking. Any of the benchmarkers care to test x86 kernels with
and without frame pointers?
A few weeks ago I was surprised to find that code compiled with
-fomit-frame-pointers reliably executed a few percentages slower.
Since the functions I was testing were not anywhere big enough to
fill even the I1 cache, I wrote it off as 'the CPU is obviously
optimized to expect certain instruction sequences after call and
before ret'. Something to think about anyways...
mark
On Thu, Nov 21, 2002 at 03:47:13PM +1100, Keith Owens wrote:
> The conventional wisdom is that compiling x86 without frame pointer
> results in smaller code. It turns out to be the opposite, compiling
> with frame pointers results in a smaller kernel. gcc version 3.2
> 20020822 (Red Hat Linux Rawhide 3.2-4).
>
> # size 2.4.20-rc2-*/vmlinux
> text data bss dec hex filename
> 2669584 337972 402697 3410253 34094d 2.4.20-rc2-fp/vmlinux
> 2676919 337972 402697 3417588 3425f4 2.4.20-rc2-nofp/vmlinux
>
> Without frame pointers, vmlinux is 7K bigger. The difference is that
> code with frame pointers can use ebp to directly access the stack,
> without frame pointers it has to use esp with an index.
>
> With frame pointers:
>
> 00000c10 <inet_dgram_connect>:
> c10: 55 push %ebp
> c11: 89 e5 mov %esp,%ebp
> c13: 83 ec 14 sub $0x14,%esp
> c16: 89 75 fc mov %esi,0xfffffffc(%ebp)
> c19: 8b 45 08 mov 0x8(%ebp),%eax
> c1c: 8b 75 0c mov 0xc(%ebp),%esi
> c1f: 89 5d f8 mov %ebx,0xfffffff8(%ebp)
> c22: 8b 58 18 mov 0x18(%eax),%ebx
> c25: 66 83 3e 00 cmpw $0x0,(%esi)
> c29: 74 3d je c68 <inet_dgram_connect+0x58>
>
> Without frame pointers:
>
> 00000c10 <inet_dgram_connect>:
> c10: 83 ec 14 sub $0x14,%esp
> c13: 8b 44 24 18 mov 0x18(%esp,1),%eax
> c17: 89 74 24 10 mov %esi,0x10(%esp,1)
> c1b: 8b 74 24 1c mov 0x1c(%esp,1),%esi
> c1f: 89 5c 24 0c mov %ebx,0xc(%esp,1)
> c23: 8b 58 18 mov 0x18(%eax),%ebx
> c26: 66 83 3e 00 cmpw $0x0,(%esi)
> c2a: 74 44 je c70 <inet_dgram_connect+0x60>
>
> The difference is that stack accesses via ebp are 3 bytes, stack
> accesses via esp+index are 4 bytes. On any function with a large
> number of stack accesses, this quickly outweighs the extra prologue
> code for frame pointers.
>
> The smaller instruction set will improve icache usage. Whether this is
> offset by the increased register pressure is something for
> benchmarking. Any of the benchmarkers care to test x86 kernels with
> and without frame pointers?
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
--
[email protected]/[email protected]/[email protected] __________________________
. . _ ._ . . .__ . . ._. .__ . . . .__ | Neighbourhood Coder
|\/| |_| |_| |/ |_ |\/| | |_ | |/ |_ |
| | | | | \ | \ |__ . | | .|. |__ |__ | \ |__ | Ottawa, Ontario, Canada
One ring to rule them all, one ring to find them, one ring to bring them all
and in the darkness bind them...
http://mark.mielke.cc/
I use -momit-leaf-frame-pointer for optimization in some own projects,
instead of the "-fomit-frame-pointer". For me, this results in better
codesize/speed compared to both "-fomit-frame-pointer" or no option at
all. Actually gcc-2.95 seems to support this feature as well, but it
never made it into the 2.95 docs... It makes debugging a lot easier too.
So anyone "caring to benchmark", could you please test the
"-momit-leaf-frame-pointer" option for x86 as well...
Mark Mielke wrote:
> A few weeks ago I was surprised to find that code compiled with
> -fomit-frame-pointers reliably executed a few percentages slower.
> Since the functions I was testing were not anywhere big enough to
> fill even the I1 cache, I wrote it off as 'the CPU is obviously
> optimized to expect certain instruction sequences after call and
> before ret'. Something to think about anyways...
On Thu, Nov 21, 2002 at 03:47:13PM +1100, Keith Owens wrote:
> The conventional wisdom is that compiling x86 without frame pointer
> results in smaller code. It turns out to be the opposite, compiling
> with frame pointers results in a smaller kernel. gcc version 3.2
> 20020822 (Red Hat Linux Rawhide 3.2-4).
I've been pushing a forward port of the CONFIG_FRAME_POINTER changes
that went into 2.4 for a while, but Linus hasn't taken them each time.
I'll keep pushing until I get a comment..
Dave
--
| Dave Jones. http://www.codemonkey.org.uk
| SuSE Labs
On Thu, 2002-11-21 at 12:55, Dave Jones wrote:
> On Thu, Nov 21, 2002 at 03:47:13PM +1100, Keith Owens wrote:
> > The conventional wisdom is that compiling x86 without frame pointer
> > results in smaller code. It turns out to be the opposite, compiling
> > with frame pointers results in a smaller kernel. gcc version 3.2
> > 20020822 (Red Hat Linux Rawhide 3.2-4).
>
> I've been pushing a forward port of the CONFIG_FRAME_POINTER changes
> that went into 2.4 for a while, but Linus hasn't taken them each time.
> I'll keep pushing until I get a comment..
Send it this way 8)
> The conventional wisdom is that compiling x86 without frame pointer
> results in smaller code. It turns out to be the opposite, compiling
> with frame pointers results in a smaller kernel. gcc version 3.2
> 20020822 (Red Hat Linux Rawhide 3.2-4).
I looked at 2.5.47 (with a splattering of performance patches) using
gcc 2.95.4 (Debian Woody), on a 16-way NUMA-Q, and did some kernel
compile testing. The times to do the tests were almost identical
(within error noise), but the kernel was indeed smaller
text data bss dec hex filename
1873293 396231 459388 2728912 29a3d0 2.5.47-mjb1/vmlinux
1427355 396875 455356 2279586 22c8a2 2.5.47-mjb1-frameptr/vmlinux
Wow ... that's quite some difference ;-)
> I use -momit-leaf-frame-pointer for optimization in some own
> projects, instead of the "-fomit-frame-pointer". For me, this
> results in better codesize/speed compared to both "-fomit-frame-pointer"
> or no option at all. Actually gcc-2.95 seems to support this feature
> as well, but it never made it into the 2.95 docs...
I tried this, but it seemed to be the same as -fomit-frame-pointer
(on 2.95 at least).
Given that omitting the -fomit-frame-pointer makes a smaller kernel,
that's easier to debug, I'd say this is a good thing to do unless someone
can get *negative* benchmark results.
M.
On Thu, Nov 21, 2002 at 10:30:49AM +0100, David Zaffiro wrote:
> I use -momit-leaf-frame-pointer for optimization in some own projects,
> instead of the "-fomit-frame-pointer". For me, this results in better
> codesize/speed compared to both "-fomit-frame-pointer" or no option at
> all. Actually gcc-2.95 seems to support this feature as well, but it
> never made it into the 2.95 docs... It makes debugging a lot easier too.
>
> So anyone "caring to benchmark", could you please test the
> "-momit-leaf-frame-pointer" option for x86 as well...
Well, I tried on a 2.4.18+patches with gcc 2.95.3. bzImage is :
538481 bytes with -fomit-frame-pointer
538510 bytes with no particular flag
542137 bytes with -momit-leaf-frame-pointer.
So -fomit-frame-pointer shows the same as other's observation, but in this
particular case, -momit-leaf-frame-pointer made a slightly bigger kernel.
Didn't have time to inspect all sections, though.
Cheers,
Willy
On Thu, Nov 21, 2002 at 08:20:45PM +0100, Willy Tarreau wrote:
> On Thu, Nov 21, 2002 at 10:30:49AM +0100, David Zaffiro wrote:
> > I use -momit-leaf-frame-pointer for optimization in some own projects,
> > instead of the "-fomit-frame-pointer". For me, this results in better
> > codesize/speed compared to both "-fomit-frame-pointer" or no option at
> > all. Actually gcc-2.95 seems to support this feature as well, but it
> > never made it into the 2.95 docs... It makes debugging a lot easier too.
> >
> > So anyone "caring to benchmark", could you please test the
> > "-momit-leaf-frame-pointer" option for x86 as well...
>
> Well, I tried on a 2.4.18+patches with gcc 2.95.3. bzImage is :
> 538481 bytes with -fomit-frame-pointer
> 538510 bytes with no particular flag
> 542137 bytes with -momit-leaf-frame-pointer.
These numbers are useless. Since a change in frame pointer setup changes
the code sequences in the text section, it is likely to also change
maximum acheived compression. Therefore, the size of the compressed
images can not be compared and result in any useable data, you need to
compare the size of the uncompressed images.
--
Doug Ledford <[email protected]> 919-754-3700 x44233
Red Hat, Inc.
1801 Varsity Dr.
Raleigh, NC 27606
On Thu, Nov 21, 2002 at 02:32:31PM -0500, Doug Ledford wrote:
> On Thu, Nov 21, 2002 at 08:20:45PM +0100, Willy Tarreau wrote:
> > On Thu, Nov 21, 2002 at 10:30:49AM +0100, David Zaffiro wrote:
> > > I use -momit-leaf-frame-pointer for optimization in some own projects,
> > > instead of the "-fomit-frame-pointer". For me, this results in better
> > > codesize/speed compared to both "-fomit-frame-pointer" or no option at
> > > all. Actually gcc-2.95 seems to support this feature as well, but it
> > > never made it into the 2.95 docs... It makes debugging a lot easier too.
> > >
> > > So anyone "caring to benchmark", could you please test the
> > > "-momit-leaf-frame-pointer" option for x86 as well...
> >
> > Well, I tried on a 2.4.18+patches with gcc 2.95.3. bzImage is :
> > 538481 bytes with -fomit-frame-pointer
> > 538510 bytes with no particular flag
> > 542137 bytes with -momit-leaf-frame-pointer.
>
> These numbers are useless. Since a change in frame pointer setup changes
> the code sequences in the text section, it is likely to also change
> maximum acheived compression. Therefore, the size of the compressed
> images can not be compared and result in any useable data, you need to
> compare the size of the uncompressed images.
Yes, you're quite right about this. I had my mind obsessed all the day reducing
a bzImage to fit it on a diskette, and didn't immediately realise that other
people were speaking pure vmlinux in this discussion :-)
So I retried, and the difference in vmlinux between -fomit-frame-pointer and
-momit-leaf-frame-pointer is nearly 1 kB LESS for the last one (difference
in text only). So David was right here. Please also node that the code is
really less compressible because 1 kB less gives 4 kB more after compression.
Even after upx, the difference is still 3 kB between the two images.
Anyway, the compressed size is sometimes more relevant than the vmlinux one,
when it comes to put it on very limited devices such as diskettes. In my case,
I don't need this extra 1 kB ram, I prefer those 4 kB floppy image for another
NIC driver !
I haven't benchmarked anything with these options. Maybe David's suggestion
is interesting for userland where compression is rarely used.
Cheers,
Willy
On Thu, Nov 21, 2002 at 08:41:27PM +0100, Willy Tarreau wrote:
> Yes, you're quite right about this. I had my mind obsessed all the day reducing
> a bzImage to fit it on a diskette, and didn't immediately realise that other
> people were speaking pure vmlinux in this discussion :-)
I had thought about that as well, but then my answer was that if the
space is that important on the floppy, then we (meaning Red Hat) could
compile out BOOT kernel with whatever option gave the smallest compressed
image and compile installed kernels with whatever gave actual best
performance (with a + given to kernels that have a frame pointer in the
event of a tie or insignificant performance difference).
Of course you may be talking about a system that always boots from floppy
and sits in some closet for years or some embedded system where that 4k in
flash is super important, so situational decision rules apply ;-)
--
Doug Ledford <[email protected]> 919-754-3700 x44233
Red Hat, Inc.
1801 Varsity Dr.
Raleigh, NC 27606
On Thursday 21 November 2002 18:44, Martin J. Bligh wrote:
> > The conventional wisdom is that compiling x86 without frame pointer
> > results in smaller code. It turns out to be the opposite, compiling
> > with frame pointers results in a smaller kernel. gcc version 3.2
> > 20020822 (Red Hat Linux Rawhide 3.2-4).
>
> I looked at 2.5.47 (with a splattering of performance patches) using
> gcc 2.95.4 (Debian Woody), on a 16-way NUMA-Q, and did some kernel
> compile testing. The times to do the tests were almost identical
> (within error noise), but the kernel was indeed smaller
>
> text data bss dec hex filename
> 1873293 396231 459388 2728912 29a3d0 2.5.47-mjb1/vmlinux
> 1427355 396875 455356 2279586 22c8a2 2.5.47-mjb1-frameptr/vmlinux
>
> Wow ... that's quite some difference ;-)
I also tried it, but it is not that big a difference:
text data bss dec hex filename flags
1991125 306324 270484 2567933 272efd vmlinux -fomit-frame-pointer
1981477 306324 270484 2558285 27094d vmlinux
1990965 306324 270484 2567773 272e5d vmlinux -momit-leaf-frame-pointer
this was with gcc 2.95.3 and binutils 2.12 on my lfs system
Rudmer
>
> > I use -momit-leaf-frame-pointer for optimization in some own
> > projects, instead of the "-fomit-frame-pointer". For me, this
> > results in better codesize/speed compared to both "-fomit-frame-pointer"
> > or no option at all. Actually gcc-2.95 seems to support this feature
> > as well, but it never made it into the 2.95 docs...
>
> I tried this, but it seemed to be the same as -fomit-frame-pointer
> (on 2.95 at least).
>
> Given that omitting the -fomit-frame-pointer makes a smaller kernel,
> that's easier to debug, I'd say this is a good thing to do unless someone
> can get *negative* benchmark results.
>
> M.
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>
>
I can understand why not omitting framepointers generates better
compressible code, since every function will start with:
push %ebp
mov %esp,%ebp
and end with:
leave
ret
But it's harder to find a reason why -fomit-frame-pointer is better
compressible that -momit-leaf-frame-pointer (but it's probably related
to a lot of mov's with stackpointer involved), especially since
"-momit-leaf-frame-pointer" makes a trade-off between both other
options: it omits framepointers for leaf functions (callees that aren't
callers as well) and it doesn't for branch-functions.
The mixture of functions with frame-pointers and those without is
probably causing bzip to compress less optimal.
Anyway it makes me wonder, whether kernelcompilation shouldn't be
configurable between a "optimize for (compressed image) size" and a
"optimize for speed" option... I'd go for speed... (and always omitting
frame-pointers doesn't seem to as fast as omitting them only in leaf
functions).
> Well, I tried on a 2.4.18+patches with gcc 2.95.3. bzImage is :
> 538481 bytes with -fomit-frame-pointer
> 538510 bytes with no particular flag
> 542137 bytes with -momit-leaf-frame-pointer.
>
> So -fomit-frame-pointer shows the same as other's observation, but in this
> particular case, -momit-leaf-frame-pointer made a slightly bigger kernel.
> Anyway it makes me wonder, whether kernelcompilation shouldn't be
> configurable between a "optimize for (compressed image) size" and a
> "optimize for speed" option... I'd go for speed... (and always omitting
> frame-pointers doesn't seem to as fast as omitting them only in leaf
> functions).
hehe :-)
I've put this in my kernels for about 2 years now. You can also reduce the
image size with -malign-jumps=0 -mpreferred-stack-boundary=2 and -mcpu=i386.
I also use some other options, but don't have them at hand right now. But it
basically gives me slightly smaller kernels, which is pretty good for install
CD or diskettes.
Cheers,
Willy
> I looked at 2.5.47 (with a splattering of performance patches) using
> gcc 2.95.4 (Debian Woody), on a 16-way NUMA-Q, and did some kernel
> compile testing. The times to do the tests were almost identical
> (within error noise), but the kernel was indeed smaller
>
> text data bss dec hex filename
> 1873293 396231 459388 2728912 29a3d0 2.5.47-mjb1/vmlinux
> 1427355 396875 455356 2279586 22c8a2 2.5.47-mjb1-frameptr/vmlinux
>
I can't think of any reason why the data- and bss-part of the kernel are
influenced by a framepointer option, this seems highly illogical. It
shouldn't make any difference as far as I can tell, maybe you altered
other options as well? (Could be strange compilerbehaviour though)
Keith's results seem more reliable:
# size 2.4.20-rc2-*/vmlinux
text data bss dec hex filename
2669584 337972 402697 3410253 34094d 2.4.20-rc2-fp/vmlinux
2676919 337972 402697 3417588 3425f4 2.4.20-rc2-nofp/vmlinux
On 25 November 2002 06:47, David Zaffiro wrote:
> I can understand why not omitting framepointers generates better
> compressible code, since every function will start with:
> push %ebp
> mov %esp,%ebp
> and end with:
> leave
> ret
>
> But it's harder to find a reason why -fomit-frame-pointer is better
> compressible that -momit-leaf-frame-pointer (but it's probably
> related to a lot of mov's with stackpointer involved), especially
> since "-momit-leaf-frame-pointer" makes a trade-off between both
> other options: it omits framepointers for leaf functions (callees
> that aren't callers as well) and it doesn't for branch-functions.
Which does not sound quite right for me. FP should be omitted
only if function contains less than half dozen stack references,
otherwise not. It does not matter whether it is a leaf function or not.
OTOH, AFAIK frame pointers make debugging easier, development kernels
are better to be compiled with fp in every func.
--
vda
On 25 November 2002 06:52, Willy Tarreau wrote:
> > Anyway it makes me wonder, whether kernelcompilation shouldn't be
> > configurable between a "optimize for (compressed image) size" and a
> > "optimize for speed" option... I'd go for speed... (and always
> > omitting frame-pointers doesn't seem to as fast as omitting them
> > only in leaf functions).
>
> hehe :-)
> I've put this in my kernels for about 2 years now. You can also
> reduce the image size with -malign-jumps=0
> -mpreferred-stack-boundary=2 and -mcpu=i386.
Hehe indeed ;)
--
vda
>>since "-momit-leaf-frame-pointer" makes a trade-off between both
>>other options: it omits framepointers for leaf functions (callees
>>that aren't callers as well) and it doesn't for branch-functions.
>
>
> Which does not sound quite right for me. FP should be omitted
> only if function contains less than half dozen stack references,
> otherwise not. It does not matter whether it is a leaf function or not.
Leaf functions generally do not contain more than half dozen
stackreferences, and are generally called more or equally often as there
callers. The slight overhead of leaf functions that do contain a dozen
stackreferences is much smaller than the overhead of omitting
framepointers in /all/ branch functions including those with dozens of
stackreferences. Maybe gcc's optimizer could be adapted in the (near)
future to compare either speed or sizes of possibly generated code, with
and without framepointer, if the compile is not a debug one.
But in the mean time, in most "userland" projects I've tested with, the
-momit-leaf-frame-pointer resulted in almost te same codesize as
compiles with framepointer, along with more or less the same speed as
"-fomit-frame-pointer". I wouldn't know how to benchmark kernel-configs
though, and I haven't seen anyone doing this with the framepointer
options yet...
> OTOH, AFAIK frame pointers make debugging easier, development kernels
> are better to be compiled with fp in every func.
Honestly, I think that's a shortcoming of the debugger if that's true.
The debugger could store the stackpointer position after a call or
calculate it based on sub/add/push/pop's, instead of borrowing it from
ebp. I'm just concerned about the extra costs (in speed and size) of
always omiting the framepointer.
(It shouldn't be impossible to debug regparm- and stdcall-functions as
well, I wonder why this could be a problem at the moment. But just
"omitting framepointers" at least doesn't mix up the (IMHO: somewhat
thoughtlessly defined) i386 32-bit C-callingconvention.)