2001-07-04 03:25:14

by Rick Hohensee

[permalink] [raw]
Subject: Why Plan 9 C compilers don't have asm("")

Because it's messy and unnecessary. Break this into asmlinkbuild,
asmlink.c, asmlink.h and asmlink.S, chmod +x asmlinkbuild, run it, and
behold a 6.
__________________________________________________________________

#..........................................................
# asmlinkbuild

gcc -c asmlink.S
gcc -o asmlinked asmlink.c asmlink.o
asmlinked

cat asmlinkbuild asmlink.S asmlink.c > asmlink.post


/* ***************************************************
asmlink.S

int bla (int ha, int hahaha, int uh) ;

That does...

push uh
push hahaha
push ha

*/

.globl bla
bla:
add 4(%esp), %eax
add 8(%esp), %eax
add 12(%esp), %eax
ret



/* ******************************************** asmlink.c */
#include "asmlink.h"


int main () {
printf("%d\n", bla(1, 2 , 3 ) ) ;

}

_________________________________________________________________

That's with the GNU tools, without asm(), and without proper declaration
of printf, as is my tendency. I don't actually return an int either, do I?
LAAETTR.

In other words, if you know the push sequence of your C compiler's
function calls, you don't need asm("");. x86 Gcc is "push last declared
first, return in EAX". Plan 9 guys, not surprisingly, seem to prefer to
keep C as C, and asm as asm. I encountered this while trying to build
Linux 1.2.13 with current GNU tools. It breaks on changes in GNU C
asm()'s. Rather a silly thing to break on, eh?

I don't think this is much less clear than the : "=r" $0; stuff, if at
all. This thing didn't take as long to code as it did to construct this
post. Perhaps the C-labels-in-asms optimizes better. I doubt if it's by
much, or if it's worth it.

Oops. I didn't include asmlink.h in the above, except as a comment
in asmlink.S. Here it is by itself...

/* ********************************************asmlink.h*/
int bla (int ha, int hahaha, int uh) ;


Another easy win from Plan 9 that's related to this but that is not in
evidence here is that this thing on Plan 9 could build asmlinkbuild for
itself on the fly based on #pragma's in the headers that simply state what
library they are the header for. This to me is so obviously an improvement
to the usual state of affairs, an ornate system of dead-ends, as to be
depressing. The guys that wrote UNIX don't do such things to themselves
anymore.

Rick Hohensee
:; cLIeNUX /dev/tty11 11:00:14 /
:;d
ABOUT LGPL boot device log subroutine
ABOUT.Linux Linux command floppy mounts suite
GPL README configure guest owner temp
H3nix RIGHTS dev help source
:; cLIeNUX /dev/tty11 22:44:25 /
:;











2001-07-04 03:36:23

by Olivier Galibert

[permalink] [raw]
Subject: Re: Why Plan 9 C compilers don't have asm("")

On Tue, Jul 03, 2001 at 11:37:28PM -0400, Rick Hohensee wrote:
> In other words, if you know the push sequence of your C compiler's
> function calls, you don't need asm("");.

You are very much forgetting _inline_ asm. And if you think that's
unimportant for performance, well, as Al would say, go back playing
with Hurd.

OG.

2001-07-04 06:24:03

by Cort Dougan

[permalink] [raw]
Subject: Re: Why Plan 9 C compilers don't have asm("")

There isn't such a crippling difference between straight-line and code with
unconditional branches in it with modern processors. In fact, there's very
little measurable difference.

If you're looking for something to blame hurd performance on I'd suggest
the entire design of Mach, not inline asm vs procedure calls. Tossing a
few context switches into calls is a lot more expensive.

} > In other words, if you know the push sequence of your C compiler's
} > function calls, you don't need asm("");.
}
} You are very much forgetting _inline_ asm. And if you think that's
} unimportant for performance, well, as Al would say, go back playing
} with Hurd.

2001-07-04 07:14:22

by Andrey Panin

[permalink] [raw]
Subject: Re: Why Plan 9 C compilers don't have asm("")


What are advantages of this approach ????

--
Andrey Panin | Embedded systems software engineer
[email protected] | PGP key: http://www.orbita1.ru/~pazke/AndreyPanin.asc


Attachments:
(No filename) (232.00 B)

2001-07-04 08:03:59

by H. Peter Anvin

[permalink] [raw]
Subject: Re: Why Plan 9 C compilers don't have asm("")

Followup to: <[email protected]>
By author: Cort Dougan <[email protected]>
In newsgroup: linux.dev.kernel
>
> There isn't such a crippling difference between straight-line and code with
> unconditional branches in it with modern processors. In fact, there's very
> little measurable difference.
>
> If you're looking for something to blame hurd performance on I'd suggest
> the entire design of Mach, not inline asm vs procedure calls. Tossing a
> few context switches into calls is a lot more expensive.
>

That's not where the bulk of the penalty of a function call comes in
(and it's a call/return, not an unconditional branch.) The penalty
comes in because of the additional need to obey the calling
convention, and from the icache discontinuity.

Not to mention that certain things simply cannot be done that way.

-hpa

--
<[email protected]> at work, <[email protected]> in private!
"Unix gives you enough rope to shoot yourself in the foot."
http://www.zytor.com/~hpa/puzzle.txt

2001-07-04 09:58:31

by Rick Hohensee

[permalink] [raw]
Subject: Re: Why Plan 9 C compilers don't have asm("")

>>
>Cort Dugan
>> There isn't such a crippling difference between straight-line and code
>>with>
>> unconditional branches in it with modern processors. In fact, there's>
>>very
>> little measurable difference.
>>
>> If you're looking for something to blame hurd performance on I'd
>>suggest
>> the entire design of Mach, not inline asm vs procedure calls. Tossing
>>a
>> few context switches into calls is a lot more expensive.
>
hpa
>That's not where the bulk of the penalty of a function call comes in
>(and it's a call/return, not an unconditional branch.) The penalty
>comes in because of the additional need to obey the calling
>convention, and from the icache discontinuity.
>

call/return is two unconditional branches and a push and a pop (is that
right?), which is I think what CD means, i.e. in terms of branch
prediction. The push/pop is a hit on old CPUs, donno about >386. You're
right though. The big hit is you can't lose the pushes to set up the args
for a separately assembled function, or the frame drop that follows it.

>Not to mention that certain things simply cannot be done that way.
>

Don't tell me that. Then I can't use my subroutine-threaded Forth
variant, in which + is a subroutine call. ;o)

Anyway, yes it's a performance hit to not inline asms. Is it worth the
bletchery? It's worth asking that once in a while. I've looked at set_bit
both ways. Now I'm curious how it does as straight C.

Rick Hohensee

2001-07-04 17:23:38

by Linus Torvalds

[permalink] [raw]
Subject: Re: Why Plan 9 C compilers don't have asm("")

In article <[email protected]>,
Cort Dougan <[email protected]> wrote:
>
>There isn't such a crippling difference between straight-line and code with
>unconditional branches in it with modern processors. In fact, there's very
>little measurable difference.

Oh, the small details get to you eventually.

And it's not just the "call" and "ret" instructions. They _do_ hurt,
even on modern CPU's, btw. They tend to break up the prefetching, and
often mean that you cannot do as good of a instruction mix.

But there's an even more serious problem: a function call in C is a
VERY heavy operation as far as the compiler is concerned. It's a major
sequence point, and the compiler doesn't know what memory locations are
potentially dead etc.

Which means that the compiler has to save everything that might be
relevant to memory, and depending on the calling convention has to
assume that registers are trashed. And when you come back, you have to
re-load everything again, on the assumption that the function might have
changed state.

You also often have issues like reloading the gp pointer on many 64-bit
architectures, where functions can be in different "domains", and
returning from an unknown function means that you have to do other nasty
setup in order to get at your global data.

And trust me, it's noticeable. On alpha, a fast function call _should_
be a a simple two-cycle thing - branch and return. But because of
practical linker issues, what the compiler ends up having to generate
for calls to targets that it doesn't know where they are is

- load a 64-bit address off the GP area that the linker will have fixed
up.
- do an indirect branch to that address
- the callee re-loads the GP with _its_ copy of the GP if it needs any
global data or needs to call anybody else.
- we return to the caller
- the caller reloads its GP.

Your theoretical two cycles that the CPU could follow in the front end
and speculate around turns into multiple loads, a indirect branch and
about 10 instructions. And that's without any of the other effects even
being taken into account. No matter _how_ good the CPU is, that's going
to be slower than not doing it.

[ And yes, I know there are optimizing linkers for the alpha around that
improve this and notice when they don't need to change GP and can do a
straight branch etc. I don't think GNU ld _still_ does that, but who
knows. Even the "good" Digital compilers tended to nop out unnecessary
instructions rather than remove them, causing more icache pressure on
a CPU that was already famous for needing tons of icache ]

Now, you could get around a bit of this by allowing for special calling
conventions. Gcc actually has this for some details - namely the
"register arguments" part, which actually makes for much more readable
code (that's my main personal use for it - never mind the fact that it
is probably faster _too_).

But gcc doesn't have a good "you can re-order this call wrt other stuff"
setup, and gcc lacks the ability to change the calling convention
on-the-fly ("this function will not clobber any registers").

Try it and see. There are good reasons for "inline asm", not the least
of which is that it often makes the produced code much more readable.

And if you never look at the produced assembler code, then you'll never
have a fast system. Really. Compilers can do only so much. People who
understand what the end result is makes for a difference.

Now, you could probably argue that instead of inline asms we should have
more flexibility in doing a per-callee calling convention. That would be
good too, no question about it.

Linus

2001-07-04 17:33:10

by Benjamin LaHaise

[permalink] [raw]
Subject: Don't feed the trooll [offtopic] Re: Why Plan 9 C compilers don't have asm("")

Hey folks,

Just a quick reminder: don't feed the troll. He's very hungry.

-ben

2001-07-05 01:03:31

by Michael Meissner

[permalink] [raw]
Subject: Re: Why Plan 9 C compilers don't have asm("")

On Tue, Jul 03, 2001 at 11:37:28PM -0400, Rick Hohensee wrote:
> That's with the GNU tools, without asm(), and without proper declaration
> of printf, as is my tendency. I don't actually return an int either, do I?
> LAAETTR.

Under ISO C rules, this is illegal, since you must have a proper prototype in
scope when calling variable argument functions. In fact, I have worked on
several GCC ports, where the compiler uses a different calling sequence for
variable argument functions than it does for normal functions. For example, on
the Mips, if the first argument is floating point and the number of arguments
is not variable, it is passed in a FP register, instead of an integer
register. For variable argument functions, everything is passed in the integer
registers.

--
Michael Meissner, Red Hat, Inc. (GCC group)
PMB 198, 174 Littleton Road #3, Westford, Massachusetts 01886, USA
Work: [email protected] phone: +1 978-486-9304
Non-work: [email protected] fax: +1 978-692-4482

2001-07-05 01:41:41

by Rick Hohensee

[permalink] [raw]
Subject: Re: Why Plan 9 C compilers don't have asm("")

>
> On Tue, Jul 03, 2001 at 11:37:28PM -0400, Rick Hohensee wrote:
> > That's with the GNU tools, without asm(), and without proper declaration
> > of printf, as is my tendency. I don't actually return an int either, do I?
> > LAAETTR.
>
> Under ISO C rules, this is illegal, since you must have a proper prototype in
> scope when calling variable argument functions. In fact, I have worked on
> several GCC ports, where the compiler uses a different calling sequence for
> variable argument functions than it does for normal functions. For example, on
> the Mips, if the first argument is floating point and the number of arguments
> is not variable, it is passed in a FP register, instead of an integer
> register. For variable argument functions, everything is passed in the integer
> registers.
>

I didn't know that, but...

You seem to be saying the use of assumptions about args passing is
non-standard. I know. It's more standard than GNU extensions to C though,
C_labels_in_asms in particular, and even in your examples it appears that
the particular function abusing these tenets will know what it can expect
from a particular compiler, since it knows what it's arguments are. It
can't know what it can expect from any compiler. This perhaps is where
#ifdef comes in, or similar. Well, it's not more standard than GNU, but
the differences would be less detailed in the case of just dealing with
various args passing schemes, and there may be some compiler-to-compiler
overlap, where there won't be any with stuff like C_labels_in_asms.

It's illegal to not declare main() as int. I don't know of a unix that
actually passes anything but a byte to the calling process. I got flamed
mightily for this in comp.unix.programmer until people ran some checks on
thier big Real Unix(TM) boxes of various types. Linux won't pass void
either, you have to get a 0 at least. Compliance is subjective. It's
easier when things make sense.

Rick Hohensee
http://www.clienux.com


> --
> Michael Meissner, Red Hat, Inc. (GCC group)
> PMB 198, 174 Littleton Road #3, Westford, Massachusetts 01886, USA
> Work: [email protected] phone: +1 978-486-9304
> Non-work: [email protected] fax: +1 978-692-4482
>

2001-07-05 03:14:12

by Rick Hohensee

[permalink] [raw]
Subject: Re: Why Plan 9 C compilers don't have asm("")

>Now, you could probably argue that instead of inline asms we should have
>more flexibility in doing a per-callee calling convention. That would be
>good too, no question about it.
>
> Linus
>

Today's flamebait has been postponed. Happy July 4th. Peace.

Rick Hohensee
http://www.clienux.com

2001-07-05 16:54:33

by Michael Meissner

[permalink] [raw]
Subject: Re: Why Plan 9 C compilers don't have asm("")

On Wed, Jul 04, 2001 at 09:54:05PM -0400, Rick Hohensee wrote:
> >
> > On Tue, Jul 03, 2001 at 11:37:28PM -0400, Rick Hohensee wrote:
> > > That's with the GNU tools, without asm(), and without proper declaration
> > > of printf, as is my tendency. I don't actually return an int either, do I?
> > > LAAETTR.
> >
> > Under ISO C rules, this is illegal, since you must have a proper prototype in
> > scope when calling variable argument functions. In fact, I have worked on
> > several GCC ports, where the compiler uses a different calling sequence for
> > variable argument functions than it does for normal functions. For example, on
> > the Mips, if the first argument is floating point and the number of arguments
> > is not variable, it is passed in a FP register, instead of an integer
> > register. For variable argument functions, everything is passed in the integer
> > registers.
> >
>
> I didn't know that, but...
>
> You seem to be saying the use of assumptions about args passing is
> non-standard. I know. It's more standard than GNU extensions to C though,
> C_labels_in_asms in particular, and even in your examples it appears that
> the particular function abusing these tenets will know what it can expect
> from a particular compiler, since it knows what it's arguments are. It
> can't know what it can expect from any compiler. This perhaps is where
> #ifdef comes in, or similar. Well, it's not more standard than GNU, but
> the differences would be less detailed in the case of just dealing with
> various args passing schemes, and there may be some compiler-to-compiler
> overlap, where there won't be any with stuff like C_labels_in_asms.

Doing this is a losing game. How many different platforms does Linux currently
run on? Do you know exactly what the ABI is for each of the machines? What
happens when Linux is ported to a new machine? My point is:

extern int printf (const char *, ...);
printf ("%d %d\n", 1, 2);

and

extern int my_printf (const char *, int, int);
my_printf ("%d %d\n", 1, 2);

under some ABIs will pass arguments completely differently and as I said, I
have worked on various GCC ports that did this, so it is not a theoretical
possibility.

> It's illegal to not declare main() as int. I don't know of a unix that
> actually passes anything but a byte to the calling process. I got flamed
> mightily for this in comp.unix.programmer until people ran some checks on
> thier big Real Unix(TM) boxes of various types. Linux won't pass void
> either, you have to get a 0 at least. Compliance is subjective. It's
> easier when things make sense.

Yes, that is an artifact of the original UNIX implementation on the PDP-11 (16
bit ints, signal number is passed back in one byte, and the return value in
another byte).

--
Michael Meissner, Red Hat, Inc. (GCC group)
PMB 198, 174 Littleton Road #3, Westford, Massachusetts 01886, USA
Work: [email protected] phone: +1 978-486-9304
Non-work: [email protected] fax: +1 978-692-4482

2001-07-06 08:38:05

by Cort Dougan

[permalink] [raw]
Subject: Re: Why Plan 9 C compilers don't have asm("")

I'm talking about _modern_ processors, not processors that dominate the
modern age. This isn't x86. I don't believe that even aggressive
re-ordering will cause a serious hit in performance on function calls.
Unconditional branches are definitely predictable so icache pre-fetches are
not more complicated that straight-line code.

Measurement is more important, though. I've rejected a number of
optimizations from people (including many of my own) that were "obvious
enhancements" because of what they showed in real-world measurements. If
it doesn't run faster, despite the theory being "right", it's worthless.

2001-07-06 11:43:30

by David Miller

[permalink] [raw]
Subject: Re: Why Plan 9 C compilers don't have asm("")


Cort Dougan writes:
> I'm talking about _modern_ processors, not processors that dominate the
> modern age. This isn't x86.

Linus mentioned Alpha specifically. I don't see how any of the things
he said were x86-centric in any way shape or form.

All of his examples are entirely accurate on sparc64 for example, and
to even moreso his Alpha commentary can nearly directly be applied to
the MIPS.

Calls suck ass, even on modern cpus. I've seen several hundreds of
cycles go out of the fault path by eliminating them. If you can kill
a leaf level call, you can avoid saving the whole frame, and on Sparc
(for example) this means saving a potential window spill trap which
can be quite costly.

Calls are less simple than branches to do (via prediction etc.) at
"zero cost" because usually there is a write port necessary (to write
the call instruction's address into the "return" register).

Let's not even start talking about calls in PIC code :-)

Later,
David S. Miller
[email protected]

2001-07-06 17:12:05

by Rick Hohensee

[permalink] [raw]
Subject: Re: Why Plan 9 C compilers don't have asm("")

>Cort Dougan writes:
> > I'm talking about _modern_ processors, not processors that dominate
>the
> > modern age. This isn't x86.
>
>Linus mentioned Alpha specifically. I don't see how any of the things
>he said were x86-centric in any way shape or form.
>
>All of his examples are entirely accurate on sparc64 for example, and
>to even moreso his Alpha commentary can nearly directly be applied to
>the MIPS.
>
>Calls suck ass, even on modern cpus. I've seen several hundreds of
>

Modern? How many stacks?
There's a couple of Forth engines out there that pay the usual for a call
and get returns in zero time. Forth code, and Forth engine machine
instructions, have about twice as many calls as Linux code,
proportionately. Therefor, a return on some designs is one bit in every
instruction. Every instruction is "...and maybe do a return in parallel."
Forth engines don't have caches. They have on-chip stacks, or the Novix
has separate busses to the stacks. Both stacks, return and data.

Forth chips aren't modern in the true-multi-user sense, but if an
individual were to design such a beast they could get several of them,
hundreds maybe, on FPGAs available now. Such things are coming, because a
Forth chip IS something an individual can design.

Rick Hohensee
http://www.clienux.com

2001-07-06 18:45:14

by Linus Torvalds

[permalink] [raw]
Subject: Re: Why Plan 9 C compilers don't have asm("")

In article <[email protected]>,
Cort Dougan <[email protected]> wrote:
>I'm talking about _modern_ processors, not processors that dominate the
>modern age. This isn't x86.

NONE of my examples were about the x86.

I gave the alpha as a specific example. The same issues are true on
ia64, sparc64, and mips64. How more "modern" can you get? Name _one_
reasonably important high-end CPU that is more modern than alpha and
ia64..

On ia64, you probably end up with function calls costing even more than
alpha, because not only does the function call end up being a
synchronization point for the compiler, it also means that the compiler
cannot expose any parallelism, so you get an added hit from there. At
least with other CPU's that find the parallelism dynamically they can do
out-of-order stuff across function calls.

>Unconditional branches are definitely predictable so icache pre-fetches are
>not more complicated that straight-line code.

Did you READ my mail at all?

Most of these "unconditional branches" are indirect, because rather few
64-bit architectures have a full 64-bit branch. That means that in
order to predict them, you either have to do data-prediction (pretty
much nobody does this), or you have a branch target prediction cache,
which works very well indeed but has the problem that it only works for
stuff in the cache, and the cache tends to be fairly limited (because
you need to cache the whole address - it's more than a "which direction
do we go in").

There are lots of good arguments for function calls: they improve icache
when done right, but if you have some non-C-semantics assembler sequence
like "cli" or a spinlock that you use a function call for, that would
_decrease_ icache effectiveness simply because the call itself is bigger
than the instruction (and it breaks up the instruction sequence so you
get padding issues).

Linus

2001-07-06 20:01:55

by Cort Dougan

[permalink] [raw]
Subject: Re: Why Plan 9 C compilers don't have asm("")

Yes, that was not easy to miss. I was simply being clear. The plan9
compiler, thus its take on inline asm, doesn't run on ia64 and alpha as far
as I can see from the latest release.

} NONE of my examples were about the x86.
}
} I gave the alpha as a specific example. The same issues are true on
} ia64, sparc64, and mips64. How more "modern" can you get? Name _one_
} reasonably important high-end CPU that is more modern than alpha and
} ia64..
}
} On ia64, you probably end up with function calls costing even more than
} alpha, because not only does the function call end up being a
} synchronization point for the compiler, it also means that the compiler
} cannot expose any parallelism, so you get an added hit from there. At
} least with other CPU's that find the parallelism dynamically they can do
} out-of-order stuff across function calls.

Yes, that's how I saw it didn't relate to the topic at hand. I suggested
measurement rather than theory to determine whether the branch washes out
or not. "Everyone knows" is a much weaker statement than "I can show".

} Did you READ my mail at all?

I definitely agree there. If you need an instruction or two that the
compiler doesn't offer then it's a loss to call a function and inline asm
is worthwhile. If this is a common enough case it's worth the compiler
adding inline asm support. I'm not sure how often that is. My own
subjective experience has been that calls to such code are rare enough that
they fall well into the realm of optimization the uncommon case.

I've used inline asm gratuitously in linux (it's peppered all over the ppc
code) because I had the feature. I don't think that's a strong argument
for adding it to a compiler that doesn't support it, though.

} There are lots of good arguments for function calls: they improve icache
} when done right, but if you have some non-C-semantics assembler sequence
} like "cli" or a spinlock that you use a function call for, that would
} _decrease_ icache effectiveness simply because the call itself is bigger
} than the instruction (and it breaks up the instruction sequence so you
} get padding issues).

2001-07-06 23:54:36

by David Miller

[permalink] [raw]
Subject: Re: Why Plan 9 C compilers don't have asm("")


Rick Hohensee writes:
> Forth chips aren't modern in the true-multi-user sense, but if an
> individual were to design such a beast they could get several of them,
> hundreds maybe, on FPGAs available now. Such things are coming, because a
> Forth chip IS something an individual can design.

And I suppose this zero-cost call is also handling things like keeping
an N stage deep pipeline full during this call right?

Later,
David S. Miller
[email protected]

2001-07-07 00:17:33

by H. Peter Anvin

[permalink] [raw]
Subject: Re: Why Plan 9 C compilers don't have asm("")

Followup to: <[email protected]>
By author: "David S. Miller" <[email protected]>
In newsgroup: linux.dev.kernel
>
> Rick Hohensee writes:
> > Forth chips aren't modern in the true-multi-user sense, but if an
> > individual were to design such a beast they could get several of them,
> > hundreds maybe, on FPGAs available now. Such things are coming, because a
> > Forth chip IS something an individual can design.
>
> And I suppose this zero-cost call is also handling things like keeping
> an N stage deep pipeline full during this call right?
>

Believe it or not, that's actually a fairly simple part of the whole
machinery. All you need for that is to maintain a call/return stack
in the front end of the pipe. That way, a return that is indeed a
return can be speculated properly; obviously, if the speculation
doesn't work out when you get the return address in the execution
stage you suffer a branch mispredict penalty.

-hpa
--
<[email protected]> at work, <[email protected]> in private!
"Unix gives you enough rope to shoot yourself in the foot."
http://www.zytor.com/~hpa/puzzle.txt

2001-07-07 00:37:35

by David Miller

[permalink] [raw]
Subject: Re: Why Plan 9 C compilers don't have asm("")


H. Peter Anvin writes:
> Believe it or not, that's actually a fairly simple part of the whole
> machinery. All you need for that is to maintain a call/return stack
> in the front end of the pipe.

I understand how RAS stacks work, I even mentioned them in another
posting of this thread :-) It's the CALL _itself_ that I'm talking
about.

Later,
David S. Miller
[email protected]

2001-07-07 06:04:05

by Rick Hohensee

[permalink] [raw]
Subject: Re: Why Plan 9 C compilers don't have asm("")

I replied to davem at length but I think I forgot to "reply to all
recipients". The gist of it is Forth code density is so high on Forth
hardware that things like icaches aren't as important, and the factors
involved are entirely different. Like high-performance Forth engines
are tiny and draw negligible current. Two URL's...

http://forth.gsfc.nasa.gov/
http://www.mindspring.com/chipchuck/forth.html

Rick Hohensee
http://www.clienux.com

2001-07-08 21:58:56

by Victor Yodaiken

[permalink] [raw]
Subject: Re: Why Plan 9 C compilers don't have asm("")

On Fri, Jul 06, 2001 at 06:44:31PM +0000, Linus Torvalds wrote:
> On ia64, you probably end up with function calls costing even more than
> alpha, because not only does the function call end up being a
> synchronization point for the compiler, it also means that the compiler
> cannot expose any parallelism, so you get an added hit from there. At

That seems amazingly dumb. You'd think a new processor design would
optimize parallel computation over calls, but what do I know?

> Most of these "unconditional branches" are indirect, because rather few
> 64-bit architectures have a full 64-bit branch. That means that in

This is something I don't get: I never understood why 32bit risc designers
were so damn obstinate about "every instruction fits in 32 bits"
and refused to have "call 32 bit immediate given in next word" not
to mention a "load 32bit immediate given in next word".
Note, the superior x86 instruction set has a 5 byte call immediate.


> There are lots of good arguments for function calls: they improve icache
> when done right, but if you have some non-C-semantics assembler sequence
> like "cli" or a spinlock that you use a function call for, that would
> _decrease_ icache effectiveness simply because the call itself is bigger
> than the instruction (and it breaks up the instruction sequence so you
> get padding issues).

I think anywhere that you have inner loop or often used operations
that are short assembler sequences, inline asm is a win - it's easy to
show for example, that the Linux asm x86 macro semaphore down
is three times as fast as
a called version. I wish, however
that GCC did not use a horrible overly complex lisplike syntax and
that there was a way to inline functions written in .S files.

And the feature is way too easy to abuse - same argument here as in
the threads argument.
It's a far better thing to not need a semaphore at all than to rely
on handcoded semaphore down to make your poorly synchronized design
sort-of perform.


2001-07-08 22:29:49

by David Miller

[permalink] [raw]
Subject: Re: Why Plan 9 C compilers don't have asm("")


Victor Yodaiken writes:
> This is something I don't get: I never understood why 32bit risc designers
> were so damn obstinate about "every instruction fits in 32 bits"
> and refused to have "call 32 bit immediate given in next word" not
> to mention a "load 32bit immediate given in next word".

Sparc has such an instruction, in fact the instruction and the 32-bit
immediate fit in a single 32-bit instruction (since the immediate is
guarenteed to be modulo 4 you have some extra bits for the instruction
opcode itself).

Later,
David S. Miller
[email protected]

2001-07-08 22:29:28

by Alan

[permalink] [raw]
Subject: Re: Why Plan 9 C compilers don't have asm("")

> That seems amazingly dumb. You'd think a new processor design would
> optimize parallel computation over calls, but what do I know?

They try to. Take a look at the trace cache on the Pentium IV. They certainly
seem to have badly screwed the chip design up elsewhere but the trace cache
has some very clever ideas in it

> were so damn obstinate about "every instruction fits in 32 bits"
> and refused to have "call 32 bit immediate given in next word" not
> to mention a "load 32bit immediate given in next word".
> Note, the superior x86 instruction set has a 5 byte call immediate.

Some do. After all there is nothing unrisc about

call [ip]++

Maybe the idea died with the PDP-11 designers 8)

Alan

2001-07-09 00:08:42

by Pete Zaitcev

[permalink] [raw]
Subject: Re: Why Plan 9 C compilers don't have asm("")

In linux-kernel, you wrote:
> On Fri, Jul 06, 2001 at 06:44:31PM +0000, Linus Torvalds wrote:
> > On ia64, you probably end up with function calls costing even more than
> > alpha, because not only does the function call end up being a
> > synchronization point for the compiler, it also means that the compiler
> > cannot expose any parallelism, so you get an added hit from there. At
>
> That seems amazingly dumb. You'd think a new processor design would
> optimize parallel computation over calls, but what do I know?

Register windows do help some, in that sense ia64 is a big
step forward ofver x86. As I read what Linus wrote, he talked
about a different thing: inside a procedure you do not
know whence you are called, therefore you must start scheduling
anew from the first instruction of the procedure; before your
results hit the writeback stage, a lot of bubbles are in the
pipeline meanwhile. Your only hope is that they are used up
by unfinished computations in the caller. In this, rational
argument passing helps to exploit a possible overlap.

> > Most of these "unconditional branches" are indirect, because rather few
> > 64-bit architectures have a full 64-bit branch. That means that in
>
> This is something I don't get: I never understood why 32bit risc designers
> were so damn obstinate about "every instruction fits in 32 bits"
> and refused to have "call 32 bit immediate given in next word" not
> to mention a "load 32bit immediate given in next word".
> Note, the superior x86 instruction set has a 5 byte call immediate.

You must take into account that early riscs had miniscule dies,
for example the first Fujitsu made SPARC had 10,000 gates
all told. An alignment to the next instruction wastes hardware,
and, perhaps, a clock cycle.

-- Pete

2001-07-09 00:32:13

by Victor Yodaiken

[permalink] [raw]
Subject: Re: Why Plan 9 C compilers don't have asm("")

On Sun, Jul 08, 2001 at 08:08:24PM -0400, Pete Zaitcev wrote:
> Register windows do help some, in that sense ia64 is a big
> step forward ofver x86.

It seems to me that x86 instruction set has lived long enough to
become efficient again. Register windows I think are bad. I'd rather
see a couple of hundred K of 1 cycle memory that the compiler/programmer
could use. But then I don't like the property "test for 1 year
and still don't uncover the production case where there is a window
spill that comes at just the wrong time when the write cache is
full, ... - and timing changes by hundreds of microseconds."

>As I read what Linus wrote, he talked
> about a different thing: inside a procedure you do not
> know whence you are called, therefore you must start scheduling
> anew from the first instruction of the procedure; before your

This is a hard part for any vliw type machine - if the compiler
can't figure it out or if the processor requires a sync point, then
performance will be terrible. My understanding is that this is
just a merced problem, not a ia64 fundamental, but it seems hard.
As Alan points out, the PIV tries to do better with a trace cache
so
code;call x; code
is essentially, dynamically inlined by caching
code;code of x; code
if I understand it right and that's pretty cool
- maybe mckinley will use the same technique if
the compiler can't figure it out.


Anyway, any processor that does badly on calls is going to be
a disaster, the real question is when it's good to use assembler
escapes.

> You must take into account that early riscs had miniscule dies,
> for example the first Fujitsu made SPARC had 10,000 gates
> all told. An alignment to the next instruction wastes hardware,
> and, perhaps, a clock cycle.

PowerPC has no excuse.

>
> -- Pete

2001-07-09 01:24:31

by Johan Kullstam

[permalink] [raw]
Subject: Re: Why Plan 9 C compilers don't have asm("")

Victor Yodaiken <[email protected]> writes:

> On Fri, Jul 06, 2001 at 06:44:31PM +0000, Linus Torvalds wrote:
> > On ia64, you probably end up with function calls costing even more than
> > alpha, because not only does the function call end up being a
> > synchronization point for the compiler, it also means that the compiler
> > cannot expose any parallelism, so you get an added hit from there. At
>
> That seems amazingly dumb. You'd think a new processor design would
> optimize parallel computation over calls, but what do I know?
>
> > Most of these "unconditional branches" are indirect, because rather few
> > 64-bit architectures have a full 64-bit branch. That means that in
>
> This is something I don't get: I never understood why 32bit risc designers
> were so damn obstinate about "every instruction fits in 32 bits"
> and refused to have "call 32 bit immediate given in next word" not
> to mention a "load 32bit immediate given in next word".
> Note, the superior x86 instruction set has a 5 byte call immediate.

the 32 bit MIPS (R3K series, at least) has a 32 bit instruction which
loads a 16 bit immediate (which fits within the instruction itself).
thus to load a 32 bit number takes two instructions. since the
instructions are all 32 bits and must live on a multiple of 4 bytes,
this is as compact as you can get given the alignment constraint.

note that x86 is also fussy about alignment in various cases, e.g.,
double-precision floats.

> > There are lots of good arguments for function calls: they improve icache
> > when done right, but if you have some non-C-semantics assembler sequence
> > like "cli" or a spinlock that you use a function call for, that would
> > _decrease_ icache effectiveness simply because the call itself is bigger
> > than the instruction (and it breaks up the instruction sequence so you
> > get padding issues).
>
> I think anywhere that you have inner loop or often used operations
> that are short assembler sequences, inline asm is a win - it's easy to
> show for example, that the Linux asm x86 macro semaphore down
> is three times as fast as
> a called version. I wish, however
> that GCC did not use a horrible overly complex lisplike syntax

lisp syntax is extremely simple. i am not sure what GCC does to make
it complex.

> and
> that there was a way to inline functions written in .S files.
>
> And the feature is way too easy to abuse - same argument here as in
> the threads argument.
> It's a far better thing to not need a semaphore at all than to rely
> on handcoded semaphore down to make your poorly synchronized design
> sort-of perform.
>
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

--
J o h a n K u l l s t a m
[[email protected]]
Don't Fear the Penguin!

2001-07-09 02:49:55

by Rick Hohensee

[permalink] [raw]
Subject: Re: Why Plan 9 C compilers don't have asm("")

>Victor Yodaiken <[email protected]>
>
>I think anywhere that you have inner loop or often used operations
>that are short assembler sequences, inline asm is a win - it's easy to
>show for example, that the Linux asm x86 macro semaphore down
>is three times as fast as
>a called version. I wish, however
>that GCC did not use a horrible overly complex lisplike syntax and
>that there was a way to inline functions written in .S files.

If you can loop faster in asm, and you surely can on x86/Gcc in many
cases, that's a win, and probably quite a worthwhile one, but that's
independant of inline in terms of "not a C call". I think that distinction
may be prone to being overlooked. The longer your average loop, the less
asm("") matters, i.e. the less of a proportional hit a C stack ceremony
is. You can loop in asm and still not need asm(""), if you pay for the
stack frame. Plan 9 has about 4 string functions that are hand-coded, but
they are C-called, from what I can tell, and have been told.

Rick Hohensee
http://www.clienux.com


2001-07-21 22:13:07

by Richard Henderson

[permalink] [raw]
Subject: Re: Why Plan 9 C compilers don't have asm("")

On Wed, Jul 04, 2001 at 05:22:44PM +0000, Linus Torvalds wrote:
> [ And yes, I know there are optimizing linkers for the alpha around that
> improve this and notice when they don't need to change GP and can do a
> straight branch etc. I don't think GNU ld _still_ does that, but who
> knows.

GNU ld does it with the "-relax" flag.

> Even the "good" Digital compilers tended to nop out unnecessary
> instructions rather than remove them, causing more icache pressure on
> a CPU that was already famous for needing tons of icache ]

But you're absolutely right about the nopping -- removing the nops would
require debug info and EH info to be re-coded. The later being a matter
of correctness. This is a bit nastier than I ever cared to deal with.


r~

2001-07-22 03:45:12

by Linus Torvalds

[permalink] [raw]
Subject: Re: Why Plan 9 C compilers don't have asm("")


On Sat, 21 Jul 2001, Richard Henderson wrote:
>
> > Even the "good" Digital compilers tended to nop out unnecessary
> > instructions rather than remove them, causing more icache pressure on
> > a CPU that was already famous for needing tons of icache ]
>
> But you're absolutely right about the nopping -- removing the nops would
> require debug info and EH info to be re-coded. The later being a matter
> of correctness. This is a bit nastier than I ever cared to deal with.

I don't see it as being all that nasty. It's only nasty if you make the
compiler default to the long form - because it's so hard to reduce the
size later (branches around call-sites etc). But if you make the compiler
default to the short form, you're ok - you can trivially expand the short
form later without any of the same problems.

Isn't this what mips32 used to do too - it didn't have the GP reload
issue, but it has 16-bit branch offsets if I remember correctly. So they
ended up adding trampoline branches on demand later. And a 16-bit branch
offset is a lot more constraining than a 20-bit one. 128kB is not a huge
jump, but 2MB is getting to be pretty far in most applications..

So:
- you _always_ generate the fast case. A call is always considered to be
a short one, simple "bsr", no GP change, no nothing.
- you generate a trampoline as well, and teach the linker to go through
the trampoline if it has to do a far call (one trampoline per target,
not per caller). Think of it as a "overflow" case for a .rel20.
- If it's not a weak reference and you can satisfy it at link-time, you
can obviously just get rid of the trampoline then and there. This takes
care of all the normal "intra-GP" things.

Sure, if you want to be fancy, you also drop unused GOT entries for
anything that ends up not having a trampoline.

So the above takes care of correctness. For bonus points, you allow the
user to specify "this will be a far call (or weak)" as an attribute, which
you use on intra-modules code. Which is almost entirely library
interfaces, so you'd have the system header files use this so that shared
library calls don't get the hit of the trampoline.

As far as I can see, this should take care of about 99% of all static
jumps. Most applications have less than 2MB code-space, and the only real
reason for the long form is for intra-module calls which tend to be fairly
well specified (ie they are declared in standard headers etc).

Sure, you could sometimes get the slower case: more than a 2MB offset
within a module, so that you'd have to use the trampoline, or if you're
lazy and don't update the headers for dynamically linked libraries. But
even then there would be the potential for icache win. And you could
always have a "-mlarge-model" compiler option for those cases, so if you
notice that you lose on this optimization, you just disable it.

No?

Linus

2001-07-22 04:00:12

by Mike Castle

[permalink] [raw]
Subject: Re: Why Plan 9 C compilers don't have asm("")


Why do I feel like I just read a description on the various real-mode x86
memory models?

mrc

On Sat, Jul 21, 2001 at 08:43:43PM -0700, Linus Torvalds wrote:
> Sure, you could sometimes get the slower case: more than a 2MB offset
> within a module, so that you'd have to use the trampoline, or if you're
> lazy and don't update the headers for dynamically linked libraries. But
> even then there would be the potential for icache win. And you could
> always have a "-mlarge-model" compiler option for those cases, so if you
> notice that you lose on this optimization, you just disable it.

--
Mike Castle [email protected] http://www.netcom.com/~dalgoda/
We are all of us living in the shadow of Manhattan. -- Watchmen
fatal ("You are in a maze of twisty compiler features, all different"); -- gcc

2001-07-22 06:51:56

by Richard Henderson

[permalink] [raw]
Subject: Re: Why Plan 9 C compilers don't have asm("")

On Sat, Jul 21, 2001 at 08:43:43PM -0700, Linus Torvalds wrote:
> So:
> - you _always_ generate the fast case. A call is always considered to be
> a short one, simple "bsr", no GP change, no nothing.
> - you generate a trampoline as well, and teach the linker to go through
> the trampoline if it has to do a far call (one trampoline per target,
> not per caller). Think of it as a "overflow" case for a .rel20.

This is all well and good if one is designing an ABI from scratch.
You can't retrofit it onto the current ABI, however. Not without
pain anyway.

The call-clobbered GP means that your trampoline has to play games
in order to get the GP restored when coming back from an intra
module call. Which means a new stack frame. Which is a tad more
than you bargined for, really. I can't see that kind of heavyweight
solution being any better than the nops.


r~

2001-07-22 07:46:37

by Linus Torvalds

[permalink] [raw]
Subject: Re: Why Plan 9 C compilers don't have asm("")


On Sat, 21 Jul 2001, Richard Henderson wrote:
>
> The call-clobbered GP means that your trampoline has to play games
> in order to get the GP restored when coming back from an intra
> module call. Which means a new stack frame.

Ahh, only if you do my optimization of sharing trampolines among users.
And you're right, that won't work.

But if you don't do that, you don't need a stack frame. You just reload GP
and jump back to the caller.

And assuming most calls don't need the trampoline (and hey, they really
shouldn't), you're still way ahead. The only thing you lost was the icache
win of re-using the trampoline (and a few cycles for scheduling and the
extra short branch).

Think of it as nothing more than a branch prediction thing - you predict
that you can take a short branch, and emit the long-branch code
out-of-line.

So the code would be roughly (this is not how the compiler would see it,
this is the very last stage of outputting the actual assembly. Nothing
else needs to know):

...
bsr $26,trampoline // linker overflow case
retpoint:
....

trampoline:
ldq $27,fn($gp) // load the full address
jsr $26,($27) // branch to it
ldgp $29,($26) // reload our GP
jsr $31,retpoint // and go back to where we came from.

And the linker can just use the special .rel20 thing to turn the bsr into
a direct call when it can.

Overhead when it cannot: one extra "bsr", one extra "jsr" back, and the
lack of scheduling. You lost two cycles and maybe a pipeline stall or
something (branching around is never nice, even if it's unconditional).

But you only lose this on misprects. And you can have a pretty high
prediction accuracy, even with just static knowledge.

And when you _do_ predict right, you're going to win in icache footprint,
code size (and because you can drop the trampoline for non-weak symbolds
the executable size also goes down) and cycles.

That still doesn't look complicated to me. Of course, it clearly does
depend on whether I'm right that you can fairly easily get 99% prediction
accuracy. And I could just be full of sh*t.

Linus

2001-07-22 15:55:44

by Richard Henderson

[permalink] [raw]
Subject: Re: Why Plan 9 C compilers don't have asm("")

On Sun, Jul 22, 2001 at 12:44:57AM -0700, Linus Torvalds wrote:
> But if you don't do that, you don't need a stack frame. You just reload GP
> and jump back to the caller.

Hmm. Yes, that could work. We'd still be changing the ABI, since
the original source "bsr foo" would really mean "bsr foo+skip ldgp".
But perhaps one that wouldn't matter for all practical purposes.

This would need to be done with a new relocation type so that old
linkers that didn't handle this sort of thing choke, but that's not
a big deal.

If I find the time I may give it a shot and see what sort of
effect it has on typical code.


r~

2001-07-22 19:09:56

by Linus Torvalds

[permalink] [raw]
Subject: Re: Why Plan 9 C compilers don't have asm("")


On Sun, 22 Jul 2001, Richard Henderson wrote:
>
> Hmm. Yes, that could work. We'd still be changing the ABI, since
> the original source "bsr foo" would really mean "bsr foo+skip ldgp".
> But perhaps one that wouldn't matter for all practical purposes.

Ahh.. Well, that would be more of a linker relocation ABI change, not
really a run-time ABI change. And as it needs a new relocation type
anyway, because of the overflow case (I assume the current .rel20
complains loudly when it overflows, right?), this should be ok.

And the code to jump over ldgp obviously exists already if there's a
"-relax" linker option.

Linus

2001-07-23 04:23:32

by Rick Hohensee

[permalink] [raw]
Subject: Re: Why Plan 9 C compilers don't have asm("")

What if Forth only had one stack?

Looking at optimizations and calling conventions, I did some gas/cpp
macros that implement caller-hikes, callee passes. The caller makes the
space for the callee's stack frame, but it's up to the callee to populate
it if necessary. Sometimes it isn't. In assembly the current context can
see the whole stack, and "osimpa" macros not all included here make the
parent frame, the current frame, and the most recently exited child frame
3 sets of named locals. This is in conjunction with x86 RET imm16 , which
does a stack frame drop for free. I got the Ackerman function, a nasty
little recursion excercise, and rather C's home court, about 50% faster
than Gcc 3.0 -O3 -fomit-frame-pointer. The Gcc version does optimize out
the two tail recursions, leaving one non-tail recursion. I beat that with
all 3 tail recursions remaining in my code. i.e. this is the first version
that worked. I stared at this monster for 2 full days looking for where I
had written "increment" instead of "decrement". Now it appears to produce
the correct results.

..........................................................................

#define cell 4
#define cells *4
#define sM 4 (%esp)
#define sN 8 (%esp)
/* some of the parent's locals */
#define pM ((def_hike +2) cells) (%esp)
#define pN ((def_hike +3) cells) (%esp)


#define def(routine,HIKE) \
def_hike = HIKE ; \
.globl routine ; \
routine:

#define fed ret $(def_hike cells)

#define child(callee) child_hike = callee ## _hike

#define hike(by) subl $(by cells) , %esp

#define do(callee) \
hike(def_hike) ;\
call callee
/* Asmacs exerpts as pertains */
#define testsubtract cmpl
#define ifzero jz
#define decrement decl
#define increment incl
#define to ,
#define with ,
#define copy movl
#define A %eax

def(Ack,2)
testsubtract $0 with pM
ifzero alpha
testsubtract $0 with pN
ifzero beta
# return( Ack(M - 1, Ack(M, (N - 1))) );
copy pN to A
decrement A
copy A to sN
copy pM to A
copy A to sM
do(Ack)
copy A to sN
decrement sM
do(Ack)
fed
# return( N + 1 );
alpha: copy pN to A # M=0
increment A
fed
# return( Ack(M - 1, 1) );
beta: copy $1 to sN # N=0
copy pM to A
decrement A
copy A to sM
do(Ack)
fed

#___ ___ ___ ___ ___ ___ ___ ___ ___ ___ ___ ___ ___ ___ ___ ___ ___

def(main,2) # known OK
copy $2 to sM
copy $8 to sN
do(Ack)
fed



/* Rick Hohensee 2001 */


/* The Ackerman function in GNU gas with cpp macros for "asmacs"
verbosifications and "osimpa" caller-hikes, callee-passes subroutine
parameter passing

Parts of asmacs, osimpa.cpp and local renamings included in this file for
clarity.

osimpa stuff, with locals names to reflect the C code example and osimpa
callee-passes, i.e. pM instead of Pa, sM instead of a, etc.

The full asmacs is in Janet_Reno and H3sm. osimpa isn't out yet.
*/

....................................................................

I compared this to the C version on Bagley's language shootout
page by hacking that down to use

<snip>

main () {
return Ack(3,8);
}

so it just returns the low byte of the result, as does my code.

C can pick this up after the expressions are parsed. Whereas this models
"stack-array plus accumulator", that's actually less aggravation to
program directly than Forth stack manipulations (well, maybe), so I'll
probably code this on top of shasm without an expression parser. osimpa
stands for "one stack in memory plus accumulator".

Rick Hohensee
http://www.clienux.com