Has Objective-C ever been considered for kernel development?
regards,
BPC
On Thu, 2007-11-29 at 12:14 +0000, Ben Crowhurst wrote:
> Has Objective-C ever been considered for kernel development?
Why not C# instead ?
> > Has Objective-C ever been considered for kernel development?
>
> Why not C# instead ?
Why not Haskell nor Erlang instead ? :-D
On Fri, 2007-11-30 at 19:09 +0900, KOSAKI Motohiro wrote:
> > > Has Objective-C ever been considered for kernel development?
> >
> > Why not C# instead ?
>
> Why not Haskell nor Erlang instead ? :-D
I heard of a bash compiler. That would enable development time
rationalization and maximize the collaborative convergence of a
community-oriented synergy.
2007/11/29, Ben Crowhurst <[email protected]>:
> Has Objective-C ever been considered for kernel development?
>
> regards,
> BPC
No, it has not. Any language that looks remotely like an OO language
has not ever been considered for (Linux) kernel development and for
most, if not all, other operating systems kernels.
Various problems occur in an object oriented language. One of them
is garbage collection: it provokes asynchronous delays and, during
an interrupt or a system call for a real time task, the kernel cannot
wait. Another is memory overhead: all the magic that OO languages
provide take space in memory and Linux kernel is used in embedded
systems with very tight memory requirements.
Lots of people will think of better reasons why ObjC is not used...
Lo?c Greni?
On Nov 30 2007 11:20, Xavier Bestel wrote:
>On Fri, 2007-11-30 at 19:09 +0900, KOSAKI Motohiro wrote:
>> > > Has Objective-C ever been considered for kernel development?
>> >
>> > Why not C# instead ?
>>
>> Why not Haskell nor Erlang instead ? :-D
>
>I heard of a bash compiler. That would enable development time
>rationalization and maximize the collaborative convergence of a
>community-oriented synergy.
>
Fortran90 it has to be.
Lo?c Greni? wrote:
> 2007/11/29, Ben Crowhurst <[email protected]>:
>
>> Has Objective-C ever been considered for kernel development?
>>
>> regards,
>> BPC
>>
>
> No, it has not. Any language that looks remotely like an OO language
> has not ever been considered for (Linux) kernel development and for
> most, if not all, other operating systems kernels.
>
> Various problems occur in an object oriented language. One of them
> is garbage collection: it provokes asynchronous delays and, during
> an interrupt or a system call for a real time task, the kernel cannot
> wait.
Objective C 1.0 does not force nor have garbage collection.
> Another is memory overhead: all the magic that OO languages
> provide take space in memory and Linux kernel is used in embedded
> systems with very tight memory requirements.
>
But are embedded systems not rapidly moving on. Turning to stare at the
ADSL X6 modem with MB's of ram.
> Lots of people will think of better reasons why ObjC is not used...
>
> Lo?c Greni?
>
>
>
Which I'm looking forward to hear :)
Thank you for your appropriate response.
--
Regards
BPC
On 30/11/2007, Ben Crowhurst <[email protected]> wrote:
> Lo?c Greni? wrote:
> > 2007/11/29, Ben Crowhurst <[email protected]>:
> >
> >> Has Objective-C ever been considered for kernel development?
> >>
<snip>
> > Lots of people will think of better reasons why ObjC is not used...
> >
> > Lo?c Greni?
> >
> Which I'm looking forward to hear :)
>
> Thank you for your appropriate response.
Here are a few reasons off the top of my head:
1. Adding extra unneeded complexity. Debugging would be harder.
2. Not many people can code ObjC when compared to the number of C coders.
3. If it ain't broken... Why fix it. The kernel works, right? Good.
You can find a great explanation somewhere out there, I'm not sure who
wrote it and the thing was explaining why C++ is not a great choice
for the Linux kernel. Some things going against C++ will also go
against ObjC. I cannot find it, but it is out there somewhere.
I'm a newbie and I might be wrong, but the above is what I believe to be true.
Karol Swietlicki
On Thu, Nov 29, 2007 at 12:14:16PM +0000, Ben Crowhurst wrote:
> Has Objective-C ever been considered for kernel development?
>
> regards,
> BPC
To my recall: Never.
Some limited subset of C++ was tried, but was soon abandoned.
Overall the kernel data structures are done in objectish-manner,
although there are no strong type mechanisms being used.
Could the kernel be written in a limited subset[*] of ObjC ? Very likely.
Would it be worth the job ? Radical decrease in number of available
programmers...
*) Subset as enforcing the rule of not even indirectly using dynamic
memory allocation, when operating in interrupt state.
/Matti Aarnio
Jan Engelhardt wrote:
> On Nov 30 2007 11:20, Xavier Bestel wrote:
>
>> On Fri, 2007-11-30 at 19:09 +0900, KOSAKI Motohiro wrote:
>>
>>>>> Has Objective-C ever been considered for kernel development?
>>>>>
>>>> Why not C# instead ?
>>>>
>>> Why not Haskell nor Erlang instead ? :-D
>>>
>> I heard of a bash compiler. That would enable development time
>> rationalization and maximize the collaborative convergence of a
>> community-oriented synergy.
>>
>>
> Fortran90 it has to be.
It used to be written in BCPL; or was that Multics?
On Thu, Nov 29, 2007 at 12:14:16PM +0000, Ben Crowhurst wrote:
> Has Objective-C ever been considered for kernel development?
Doesn't objective C essentially require a runtime to provide a lot of
the features of the language? If it does (as I suspect) then it is
totally unsiatable for kernel development.
That and object oriented languages in general are badly designed and a
bad idea. Having not used objective C I have no idea if it qualifies as
badly designed or not. Certainly C++ and java are both very badly
designed.
Besides the kernel does a wonderful job doing object oriented design
where apropriate using C without any of the stupidities added by the
common OO languages.
--
Len Sorensen
On Fri, Nov 30, 2007 at 11:16:14AM +0000, Ben Crowhurst wrote:
> But are embedded systems not rapidly moving on. Turning to stare at the
> ADSL X6 modem with MB's of ram.
Some embedded systems run on batteries, so the less ram they have to
power the better, and the less cpu cycles that have to spend executing
code the less power they consume. An ADSL modem on your desk doesn't
have any of those worries, it just has to work and if doubling the ram
cuts the development problems by a lot, then that might have been a
worthwhile trade off.
--
Len Sorensen
Ben Crowhurst wrote:
> Has Objective-C ever been considered for kernel development?
No. Kernel programming requires what is essentially assembly language with a
lot of syntactic sugar, which C provides. Higher-level languages abstract away
too much detail to be suitable for the sort of bit-perfect control you need when
you're directly controlling bare metal. You can still use object-oriented
programming techniques in C, and we do this all the time in the kernel, but we
do so with more fine-grained explicit control than a language like Objective-C
would give us. More to the point, if we tried to use Objective-C, we'd find
ourselves needing to fall back to C-style explicitness so often that it wouldn't
be worth the trouble.
In other news, I hear Hurd boots again!
-- Chris
On Nov 30, 2007, at 09:34:45, Lennart Sorensen wrote:
> On Thu, Nov 29, 2007 at 12:14:16PM +0000, Ben Crowhurst wrote:
>> Has Objective-C ever been considered for kernel development?
>
> Doesn't objective C essentially require a runtime to provide a lot
> of the features of the language? If it does (as I suspect) then it
> is totally unsiatable for kernel development.
>
> That and object oriented languages in general are badly designed
> and a bad idea. Having not used objective C I have no idea if it
> qualifies as badly designed or not. Certainly C++ and java are
> both very badly designed.
Objective-C is actually a pretty minimal wrapper around C; it was
originally implemented as a C preprocessor. It generally does not
have any kind of memory management, garbage collection, or anything
else (although typically a "runtime" will provide those features).
There are no first-class exceptions, so there would be nothing to
worry about there (the exceptions used in GUI programs are built
around the setjmp/longjmp primitives). Objective-C is also almost
completely backwards-compatible with C, much more so than C++ ever
was. As far as the runtime goes the kernel would be expected to
write its own, the same way that it implements "kmalloc()" as part of
a "C runtime". Since the runtime itself never does any implicit
memory allocation, I think it would conceivably even be relatively
safe for kernel usage.
With that said, there is a significant performance penalty as all
Objective-C method calls are looked up symbolically at runtime for
every single call. For GUI programs where large chunks of the code
are event-loops and not performance-sensitive that provides a huge
amount of extra flexibility. In the kernel though, there are many
codepaths where *every* *single* instruction counts; that could be a
serious performance hit.
Cheers,
Kyle Moffett
Kyle Moffett wrote:
> With that said, there is a significant performance penalty as all
> Objective-C method calls are looked up symbolically at runtime for every
> single call.
GACK!
At least C++ has vtables.
-hpa
On Nov 30, 2007, at 13:40:07, H. Peter Anvin wrote:
> Kyle Moffett wrote:
>> With that said, there is a significant performance penalty as all
>> Objective-C method calls are looked up symbolically at runtime for
>> every single call.
>
> GACK!
>
> At least C++ has vtables.
In a tight loop there is a way to do a single symbolic lookup and
just call directly through a function pointer, but typically it isn't
necessary for GUI programs and the like. The flexibility of being
able to dynamically add new methods to an existing class (at least
for desktop user interfaces) significantly outweighs the performance
cost. Any performance-sensitive code is typically written in
straight C anyways.
Cheers,
Kyle Moffett
On Fri, 30 Nov 2007 19:09:45 +0900, KOSAKI Motohiro <[email protected]> wrote:
> > > Has Objective-C ever been considered for kernel development?
> >
> > Why not C# instead ?
>
> Why not Haskell nor Erlang instead ? :-D
>
Flash
http://www.lagmonster.info/humor/windowsrg.html
--
J.A. Magallon <jamagallon()ono!com> \ Software is like sex:
\ It's better when it's free
Mandriva Linux release 2008.1 (Cooker) for i586
Linux 2.6.23-jam03 (gcc 4.2.2 (4.2.2-1mdv2008.1)) SMP Sat Nov
09 F9 11 02 9D 74 E3 5B D8 41 56 C5 63 56 88 C0
David Newall wrote:
> Jan Engelhardt wrote:
>> On Nov 30 2007 11:20, Xavier Bestel wrote:
>>
>>> On Fri, 2007-11-30 at 19:09 +0900, KOSAKI Motohiro wrote:
>>>
>>>>>> Has Objective-C ever been considered for kernel development?
>>>>>>
>>>>> Why not C# instead ?
>>>>>
>>>> Why not Haskell nor Erlang instead ? :-D
>>>>
>>> I heard of a bash compiler. That would enable development time
>>> rationalization and maximize the collaborative convergence of a
>>> community-oriented synergy.
>>>
>>>
>> Fortran90 it has to be.
>
> It used to be written in BCPL; or was that Multics?
BCPL was typeless, as was the successor B (between Bell Labs and GE we
write thousands of lines of B, ported to 8080, GE600, etc). C introduced
types, and the rest is history. Multics is written in PL/1, and I wrote
a lot of PL/1 subset G back when as well. You don't know slow compile
until you get a seven pass compiler with each pass on floppy.
--
Bill Davidsen <[email protected]>
"We have more to fear from the bungling of the incompetent than from
the machinations of the wicked." - from Slashdot
On Fri, 30 Nov 2007 11:29:55 +0100, "Loïc Grenié" <[email protected]> wrote:
> 2007/11/29, Ben Crowhurst <[email protected]>:
> > Has Objective-C ever been considered for kernel development?
> >
> > regards,
> > BPC
>
Well, I really would like to learn some things here, could we
keep this off-topic thread alive just a bit, please ?
(I know, I'm going to gain a troll's fame because I can't avoid this
discussions, its one of my secret vices...)
> No, it has not. Any language that looks remotely like an OO language
> has not ever been considered for (Linux) kernel development and for
> most, if not all, other operating systems kernels.
>
I think BeOS was C++ and OSX is C+ObjectiveC (and runs on an iPhone).
Original MacOS (fron 6 to 9) was Pascal (and a mac SE was very near
to embedded hardware :) ).
I do not advocate to rewrite Linux in C++, but don't say a kernel written
in C++ can not be efficient.
> Various problems occur in an object oriented language. One of them
> is garbage collection: it provokes asynchronous delays and, during
> an interrupt or a system call for a real time task, the kernel cannot
> wait.
C++ (and for what I read on other answer, nor ObjectiveC) has no garbage
collection. It does not anything you did not it to do. It just allows
you to change this
struct buffer *x;
x = kmalloc(...)
x->sz = 128
x->buff = kmalloc(...)
...
kfree(x->buff)
kfree(x)
to
struct buffer *x;
x = new buffer(128); (that does itself allocates x->buff,
because _you_ programmed it,
so you poor programmer don't forget)
...
delete x; (that also was programmed to deallocate
x->buff itself, sou you have one less
memory leak to worry about)
> Another is memory overhead: all the magic that OO languages
> provide take space in memory and Linux kernel is used in embedded
> systems with very tight memory requirements.
>
An vtable in C++ takes exactly the same space that the function
table pointer present in every driver nowadays... and probably
the virtual method call that C++ does itself with
thing->do_something(with,this)
like
push thing
push with
push this
call THING_vtable+indexof(do_something) // constants at compile time
is much more efficient that what gcc can mangle to do with
thing->do_something(with,this,thing)
push with
push this
push thing
get thing+offsetof(do_something) // not constant at compile time
dereference it
call it
(that is, get a generic field on a structure and use it as jump address)
In short, the kernel is object oriented, implements OO programming by
hand, but the compiler lacks the knowledge that it is object oriented
programming so it could do some optimizations.
> Lots of people will think of better reasons why ObjC is not used...
People usually complains about RTTI or exceptions, but benefits versus
memory space should be seriously considered (sure there is something
in current drivers to ask 'are you a SATA or an IDE disk?').
--
J.A. Magallon <jamagallon()ono!com> \ Software is like sex:
\ It's better when it's free
Mandriva Linux release 2008.1 (Cooker) for i586
Linux 2.6.23-jam03 (gcc 4.2.2 (4.2.2-1mdv2008.1)) SMP Sat Nov
09 F9 11 02 9D 74 E3 5B D8 41 56 C5 63 56 88 C0
> BCPL was typeless, as was the successor B (between Bell Labs and GE we
B isn't quite typeless. It has minimal inbuilt support for concepts like
strings (although you can of course multiply a string by an array
pointer ;))
It also had some elegances that C lost, notably
case 1..5:
the ability to do no zero biased arrays
x[40];
x-=10;
and the ability to reassign function names.
printk = wombat;
as well as stuff like free(function);
Alan (who learned B before C, and is still waiting for P)
On Sat, 2007-12-01 at 00:19 +0100, J.A. Magallón wrote:
> An vtable in C++ takes exactly the same space that the function
> table pointer present in every driver nowadays... and probably
> the virtual method call that C++ does itself with
>
> thing->do_something(with,this)
>
> like
> push thing
> push with
> push this
> call THING_vtable+indexof(do_something) // constants at compile time
>
> is much more efficient that what gcc can mangle to do with
>
> thing->do_something(with,this,thing)
>
> push with
> push this
> push thing
> get thing+offsetof(do_something) // not constant at compile time
> dereference it
> call it
>
> (that is, get a generic field on a structure and use it as jump address)
>
> In short, the kernel is object oriented, implements OO programming by
> hand, but the compiler lacks the knowledge that it is object oriented
> programming so it could do some optimizations.
struct test;
struct testVtbl
{
int (*fn1)(struct test *t, int x, int y);
int (*fn2)(struct test *t, int x, int y);
};
struct test
{
struct testVtbl *vtbl;
int x, y;
};
void testCall(struct test *t, int x, int y)
{
t->vtbl->fn1(t, x, y);
t->vtbl->fn2(t, x, y);
}
and
struct test
{
virtual int fn1(int x, int y);
virtual int fn2(int x, int y);
int x, y;
};
void testCall(struct test *t, int x, int y)
{
t->fn1(x, y);
t->fn2(x, y);
}
generate instruction-for-instruction identical code.
--
Nicholas Miell <[email protected]>
Em Fri, Nov 30, 2007 at 11:40:13PM +0000, Alan Cox escreveu:
> > BCPL was typeless, as was the successor B (between Bell Labs and GE we
>
> B isn't quite typeless. It has minimal inbuilt support for concepts like
> strings (although you can of course multiply a string by an array
> pointer ;))
>
> It also had some elegances that C lost, notably
>
> case 1..5:
Hey, the language we use, gcC has this too 8-)
[acme@doppio net-2.6.25]$ find . -name "*.c" | xargs grep 'case.\+\.\.' | wc -l
400
[acme@doppio net-2.6.25]$ find . -name "*.c" | xargs grep 'case.\+\.\.' | head
./kernel/signal.c: default: /* this is just in case for now ... */
./kernel/audit.c: case AUDIT_FIRST_USER_MSG ... AUDIT_LAST_USER_MSG:
./kernel/audit.c: case AUDIT_FIRST_USER_MSG2 ... AUDIT_LAST_USER_MSG2:
./kernel/audit.c: case AUDIT_FIRST_USER_MSG ... AUDIT_LAST_USER_MSG:
./kernel/audit.c: case AUDIT_FIRST_USER_MSG2 ... AUDIT_LAST_USER_MSG2:
./kernel/timer.c: * well, in that case 2.2.x was broken anyways...
./arch/frv/kernel/traps.c: case TBR_TT_TRAP2 ... TBR_TT_TRAP126:
./arch/frv/kernel/ptrace.c: case 0 ... PT__END - 1:
./arch/frv/kernel/ptrace.c: case 0 ... PT__END-1:
./arch/frv/kernel/gdb-stub.c: case GDB_REG_GR(1) ... GDB_REG_GR(63):
[acme@doppio net-2.6.25]$
- Arnaldo
On Sat, Dec 01, 2007 at 12:19:50AM +0100, J.A. Magall??n wrote:
> An vtable in C++ takes exactly the same space that the function
> table pointer present in every driver nowadays... and probably
> the virtual method call that C++ does itself with
>
> thing->do_something(with,this)
>
> like
> push thing
> push with
> push this
> call THING_vtable+indexof(do_something) // constants at compile time
This is not what vtables are. Think for a minute - all codepaths arriving
to that point in your code will pick the address to call from the same
location. Either the contents of that location is constant (in which case
you could bloody well call it directly in the first place) *or* it has to
somehow be reassigned back and forth, according to the value of this. The
former is dumb, the latter - outright insane.
The contents of vtables is constant. The whole point of that thing is
to deal with the situations where we _can't_ tell which derived class
this ->do_something() is from; if we could tell which vtable it is at
compile time, we wouldn't need to bother at all.
It's a tradeoff - we pay the extra memory access (fetch vtable pointer, then
fetch method from vtable) for not having to store a slew of method pointers
in each instance of base class. But the extra memory access is very much
there. It can be further optimized away if you have several method calls
for the same object next to each other (then vtable can be picked once),
but it's still done at runtime.
On Sat, Dec 01, 2007 at 12:31:19AM +0000, Al Viro wrote:
> somehow be reassigned back and forth, according to the value of this. The
s/this/thing/, of course
On Sat, 1 Dec 2007 00:31:19 +0000, Al Viro <[email protected]> wrote:
> On Sat, Dec 01, 2007 at 12:19:50AM +0100, J.A. Magall??n wrote:
> > An vtable in C++ takes exactly the same space that the function
> > table pointer present in every driver nowadays... and probably
> > the virtual method call that C++ does itself with
> >
> > thing->do_something(with,this)
> >
> > like
> > push thing
> > push with
> > push this
> > call THING_vtable+indexof(do_something) // constants at compile time
>
> This is not what vtables are. Think for a minute - all codepaths arriving
> to that point in your code will pick the address to call from the same
> location. Either the contents of that location is constant (in which case
> you could bloody well call it directly in the first place) *or* it has to
> somehow be reassigned back and forth, according to the value of this. The
> former is dumb, the latter - outright insane.
>
> The contents of vtables is constant. The whole point of that thing is
> to deal with the situations where we _can't_ tell which derived class
> this ->do_something() is from; if we could tell which vtable it is at
> compile time, we wouldn't need to bother at all.
>
Yup, my mistake (that's why I said i will learn something). I was thinking
on non-virtual methods. For virtual ones you have to fetch the vtable
start address and index from it.
> It's a tradeoff - we pay the extra memory access (fetch vtable pointer, then
> fetch method from vtable) for not having to store a slew of method pointers
> in each instance of base class. But the extra memory access is very much
> there. It can be further optimized away if you have several method calls
> for the same object next to each other (then vtable can be picked once),
> but it's still done at runtime.
--
J.A. Magallon <jamagallon()ono!com> \ Software is like sex:
\ It's better when it's free
Mandriva Linux release 2008.1 (Cooker) for i586
Linux 2.6.23-jam03 (gcc 4.2.2 (4.2.2-1mdv2008.1)) SMP Sat Nov
09 F9 11 02 9D 74 E3 5B D8 41 56 C5 63 56 88 C0
Chris Snook wrote:
> Ben Crowhurst wrote:
>> Has Objective-C ever been considered for kernel development?
>
> No. Kernel programming requires what is essentially assembly language
> with a lot of syntactic sugar, which C provides.
I somewhat disagree. Kernel programming requires and deserves the same
care, rigor and eye to details as all other serious systems. Whilst
performance is always a consideration, high-level languages give a
reward in ease of expression and improved reliability, such that a
notional performance cost is easily justified. Occasionally, precise
bit-diddling or tight timing requirements might necessitate use of
assembly; even so, a lot of bit-diddling can be expressed in high-level
languages.
Kernel programming might require a scintilla of assembly language, but
the very vast majority of it should be written in a high-level language.
There's an old joke that claims, "real programmers can write FORTRAN in
any language." It's true. Object orientation is a style of
programming, not a language, and while certain languages have intrinsic
support for this style, objects, methods, properties and inheritance can
be probably be written in any language. It's an issue of putting in
care and eye to detail.
Linux could be written in Objective-C, it could be written in Pascal,
but it is written in plain C, with a smattering of assembler. Does it
need to be more complicated than that?
Alan Cox wrote:
>> BCPL was typeless, as was the successor B (between Bell Labs and GE we
>
> B isn't quite typeless. It has minimal inbuilt support for concepts like
> strings (although you can of course multiply a string by an array
> pointer ;))
>
> It also had some elegances that C lost, notably
>
> case 1..5:
>
> the ability to do no zero biased arrays
>
> x[40];
> x-=10;
Well, original C allowed you to do what you wanted with pointers (I used
to teach that back when K&R was "the" C manual). Now people which about
having pointers outside the array, which is a crock in practice, as long
as you don't actually /use/ an out of range value.
>
> and the ability to reassign function names.
>
> printk = wombat;
I had forgotten that, the function name was actually a variable with the
entry point, say so in section 3.11. And as I recall the code, arrays
were the same thing, a length ten vector was actually the vector and
variable with the address of the start. I was more familiar with the B
stuff, I wrote both the interpreter and the code generator+library for
the 8080 and GE600 machines. B on MULTICS, those were the days... :-D
>
> as well as stuff like free(function);
>
> Alan (who learned B before C, and is still waiting for P)
I had the BCPL book still on the reference shelf in the office, along
with goodies like the four candidates to be Ada, and a TRAC manual. I
too expected the next language to be "P".
--
Bill Davidsen <[email protected]>
"We have more to fear from the bungling of the incompetent than from
the machinations of the wicked." - from Slashdot
> Well, original C allowed you to do what you wanted with pointers (I used
> to teach that back when K&R was "the" C manual). Now people which about
> having pointers outside the array, which is a crock in practice, as long
> as you don't actually /use/ an out of range value.
Actually the standards had good reasons to bar this use, because many
runtime environments used segmentation and unsigned segment offsets. On a
286 you could get into quite a mess with out of array reference tricks.
> variable with the address of the start. I was more familiar with the B
> stuff, I wrote both the interpreter and the code generator+library for
> the 8080 and GE600 machines. B on MULTICS, those were the days... :-D
B on Honeywell L66, so that may well have been a relative of your code
generator ?
Al Viro wrote:
> On Sat, Dec 01, 2007 at 12:19:50AM +0100, J.A. Magall??n wrote:
>
>> An vtable in C++ takes exactly the same space that the function
>> table pointer present in every driver nowadays... and probably
>> the virtual method call that C++ does itself with
>>
>> thing->do_something(with,this)
>>
>> like
>> push thing
>> push with
>> push this
>> call THING_vtable+indexof(do_something) // constants at compile time
>>
>
> This is not what vtables are. Think for a minute - all codepaths arriving
> to that point in your code will pick the address to call from the same
> location. Either the contents of that location is constant (in which case
> you could bloody well call it directly in the first place) *or* it has to
> somehow be reassigned back and forth, according to the value of this. The
> former is dumb, the latter - outright insane.
>
> The contents of vtables is constant. The whole point of that thing is
> to deal with the situations where we _can't_ tell which derived class
> this ->do_something() is from; if we could tell which vtable it is at
> compile time, we wouldn't need to bother at all.
>
> It's a tradeoff - we pay the extra memory access (fetch vtable pointer, then
> fetch method from vtable) for not having to store a slew of method pointers
> in each instance of base class. But the extra memory access is very much
> there. It can be further optimized away if you have several method calls
> for the same object next to each other (then vtable can be picked once),
> but it's still done at runtime.
>
True. C++ vtables have no performance advantage over C ->ops->function()
calls. But they have no disadvantage either and they do offer many
syntactic advantages (such as automatically casting the object type to
the *correct* derived class.
Lennart Sorensen wrote:
> On Thu, Nov 29, 2007 at 12:14:16PM +0000, Ben Crowhurst wrote:
>
>> Has Objective-C ever been considered for kernel development?
>>
>
> Doesn't objective C essentially require a runtime to provide a lot of
> the features of the language? If it does (as I suspect) then it is
> totally unsiatable for kernel development.
>
>
C also requires a (very minimal) runtime. And I don't see how having a
runtime disqualifies a language from being usable in a kernel; the
runtime is just one more library, either supplied by the compiler or by
the kernel.
>
> Besides the kernel does a wonderful job doing object oriented design
> where apropriate using C without any of the stupidities added by the
> common OO languages
Object orientation in C leaves much to be desired; see the huge number
of void pointers and container_of()s in the kernel.
Kyle Moffett wrote:
> In the kernel though, there are many codepaths where *every* *single*
> instruction counts; that could be a serious performance hit.
Write *those* *codepaths* in *C* or *assembly*. But only after you
manage to measure a difference compared to the object-oriented systems
language.
[I really doubt there are that many of these; syscall
entry/dispatch/exit, interrupt dispatch, context switch, what else?]
Avi Kivity <[email protected]> writes:
>
> [I really doubt there are that many of these; syscall
> entry/dispatch/exit, interrupt dispatch, context switch, what else?]
Networking, block IO, page fault, ... But only the fast paths in these
cases. A lot of the kernel is slow path code and could probably
be written even in an interpreted language without much trouble.
-Andi
On Sat, 1 December 2007 21:59:31 +0200, Avi Kivity wrote:
>
> Object orientation in C leaves much to be desired; see the huge number
> of void pointers and container_of()s in the kernel.
While true, this isn't such a bad problem. A language really sucks when
it tries to disallow something useful. Back in university I was forced
to write system software in pascal. Simple pointer arithmetic became a
5-line piece of code.
Imo the main advantage of C is simply that it doesn't get in the way.
Jörn
--
But this is not to say that the main benefit of Linux and other GPL
software is lower-cost. Control is the main benefit--cost is secondary.
-- Bruce Perens
Alan Cox wrote:
>> Well, original C allowed you to do what you wanted with pointers (I used
>> to teach that back when K&R was "the" C manual). Now people which about
>> having pointers outside the array, which is a crock in practice, as long
>> as you don't actually /use/ an out of range value.
>>
>
> Actually the standards had good reasons to bar this use, because many
> runtime environments used segmentation and unsigned segment offsets. On a
> 286 you could get into quite a mess with out of array reference tricks.
>
>
>> variable with the address of the start. I was more familiar with the B
>> stuff, I wrote both the interpreter and the code generator+library for
>> the 8080 and GE600 machines. B on MULTICS, those were the days... :-D
>>
>
> B on Honeywell L66, so that may well have been a relative of your code
> generator ?
>
>
Probably the Bell Labs one. I did an optimizer on the Pcode which caught
jumps to jumps, then had separate 8080 and L66 code generators into GMAP
on the GE and the CP/M assembler or the Intel (ISIS) assembler for 8080.
There was also an 8085 code generator using the "ten undocumented
instructions" from the Dr Dobbs article. GE actually had a contract with
Intel to provide CPUs with those instructions, and we used them in the
Terminet(r) printers.
Those were the days ;-)
--
Bill Davidsen <[email protected]>
"Woe unto the statesman who makes war without a reason that will still
be valid when the war is over..." Otto von Bismark
Andi Kleen wrote:
> Avi Kivity <[email protected]> writes:
>
>> [I really doubt there are that many of these; syscall
>> entry/dispatch/exit, interrupt dispatch, context switch, what else?]
>>
>
> Networking, block IO, page fault, ... But only the fast paths in these
> cases. A lot of the kernel is slow path code and could probably
> be written even in an interpreted language without much trouble.
>
>
Even these (with the exception of the page fault path) are hardly "we
care about a single instruction" material suggested above. Even with a
million packets per second per core (does such a setup actually exist?)
You have a few thousand cycles per packet. For block you'd need around
5,000 disks per core to reach such rates.
The real benefits aren't in keeping close to the metal, but in high
level optimizations. Ironically, these are easier when the code is a
little more abstracted. You can add quite a lot of instructions if it
allows you not to do some of the I/O at all.
> Even these (with the exception of the page fault path) are hardly "we
> care about a single instruction" material suggested above. Even with a
With 10Gbit/s ethernet working you start to care about every cycle.
Similar with highend routing or in some latency sensitive network
applications (e.g. in HPC). Another simple noticeable case is Unix
sockets and your X server communication.
And there are some special cases where block IO is also pretty critical.
A popular one is TPC-* benchmarking, but there are also others and it
looks likely in the future that this will become more critical
as block devices become faster (e.g. highend SSDs)
> The real benefits aren't in keeping close to the metal, but in high
> level optimizations. Ironically, these are easier when the code is a
> little more abstracted. You can add quite a lot of instructions if it
> allows you not to do some of the I/O at all.
While that's partly true -- cache misses are good for a lot of cycles --
it is not the whole truth and at some point raw code efficiency matters
too.
For example there are some CPUs who are relatively slow at indirect
function calls and there are actually cases where this can be measured.
-Andi
Andi Kleen wrote:
>> Even these (with the exception of the page fault path) are hardly "we
>> care about a single instruction" material suggested above. Even with a
>>
>
> With 10Gbit/s ethernet working you start to care about every cycle.
>
If you have 10M packets/sec no amount of cycle-saving will help you.
You need high level optimizations like TSO. I'm not saying we should
sacrifice cycles like there's no tomorrow, but the big wins are elsewhere.
> Similar with highend routing or in some latency sensitive network
> applications (e.g. in HPC).
True. And here, the hardware can cut hundreds of cycles by avoiding the
kernel completely for the fast path.
> Another simple noticeable case is Unix
> sockets and your X server communication.
Your reflexes are *much* better than mine if you can measure half a
nanosecond on X.
Here, it's scheduling that matters, avoiding large transfers, and
avoiding ping-pongs, not some cycles on the unix domain socket. You
already paid 150 cycles or so by issuing the syscall and thousands for
copying the data, 50 more won't be noticeable except in nanobenchmarks.
>
>
> And there are some special cases where block IO is also pretty critical.
> A popular one is TPC-* benchmarking, but there are also others and it
> looks likely in the future that this will become more critical
> as block devices become faster (e.g. highend SSDs)
>
And again the key is batching, improving cpu affinity, and caching, not
looking for a faster instruction sequence.
>
>> The real benefits aren't in keeping close to the metal, but in high
>> level optimizations. Ironically, these are easier when the code is a
>> little more abstracted. You can add quite a lot of instructions if it
>> allows you not to do some of the I/O at all.
>>
>
> While that's partly true -- cache misses are good for a lot of cycles --
> it is not the whole truth and at some point raw code efficiency matters
> too.
>
> For example there are some CPUs who are relatively slow at indirect
> function calls and there are actually cases where this can be measured.
>
>
That is true. But any self-respecting systems language will let you
choose between direct and indirect calls.
If adding an indirect call allows you to avoid even 1% of I/O, you save
much more than you lose, so again the high level optimizations win.
Nanooptimizations are fun (I do them myself, I admit) but that's not
where performance as measured by the end user lies.
--
error compiling committee.c: too many arguments to function
On Mon, Dec 03, 2007 at 01:46:45PM +0200, Avi Kivity wrote:
> If you have 10M packets/sec no amount of cycle-saving will help you.
> You need high level optimizations like TSO. I'm not saying we should
> sacrifice cycles like there's no tomorrow, but the big wins are elsewhere.
Both high and low level optimizations are needed for good performance.
> >Similar with highend routing or in some latency sensitive network
> >applications (e.g. in HPC).
>
> True. And here, the hardware can cut hundreds of cycles by avoiding the
> kernel completely for the fast path.
A lot of applications don't and the user space networking schemes
tend to have their own drawbacks anyways.
> >Another simple noticeable case is Unix
> >sockets and your X server communication.
>
> Your reflexes are *much* better than mine if you can measure half a
> nanosecond on X.
That's not about mouse/keyboard input, but about all X protocol communication
between X clients and X server. The key is not large copies here
anyways (large data is put into shm) but latency.
> And again the key is batching, improving cpu affinity, and caching, not
> looking for a faster instruction sequence.
That's not the whole story no. Batching etc are needed, but the
faster instruction sequences are needed too.
> Nanooptimizations are fun (I do them myself, I admit) but that's not
> where performance as measured by the end user lies.
It depends. Often high level (and then caching) optimizations are better
bang for the buck, but completely disregarding the fast path work is a bad
thing too. As an example see Christoph's recent work on the slub fastpath
which makes a quite measurable difference on benchmarks.
-Andi
On Mon, 2007-12-03 at 07:12 +0200, Avi Kivity wrote:
> Andi Kleen wrote:
> > Avi Kivity <[email protected]> writes:
> >
> >> [I really doubt there are that many of these; syscall
> >> entry/dispatch/exit, interrupt dispatch, context switch, what else?]
> >>
> >
> > Networking, block IO, page fault, ... But only the fast paths in these
> > cases. A lot of the kernel is slow path code and could probably
> > be written even in an interpreted language without much trouble.
> >
> >
>
> Even these (with the exception of the page fault path) are hardly "we
> care about a single instruction" material suggested above. Even with a
> million packets per second per core (does such a setup actually exist?)
> You have a few thousand cycles per packet. For block you'd need around
> 5,000 disks per core to reach such rate
Intel's newest dual 10GbE NIC can easily (?) throw ~14M packets per
second. (theoretical peak at 1514bytes/frame)
Granted, installing such a device on a single CPU/single core machine is
absurd - but even on an 8 core machine (2 x Xeon 53xx/54xx / AMD
Barcelona) it can still generate ~1M packets/s per core.
Now assuming you're doing low-level (passive) filtering of some sort
(frame/packet routing, traffic interception and/or packet analysis)
using hardware assistance (TSO, complete TCP offloading, etc) is off the
table and each and every cycle within netif_receive_skb (and friends)
-counts-.
I don't suggest that the kernel should be (re)designed for such (niche)
applications but on other hand, if it works...
- Gilboa
On Mon, 2007-12-03 at 14:35 +0200, Gilboa Davara wrote:
> Intel's newest dual 10GbE NIC can easily (?) throw ~14M packets per
> second. (theoretical peak at 1514bytes/frame)
> Granted, installing such a device on a single CPU/single core machine is
> absurd - but even on an 8 core machine (2 x Xeon 53xx/54xx / AMD
> Barcelona) it can still generate ~1M packets/s per core.
Sigh... Sorry. Please ignore the broken math on my part.
Make that 1.8M frames/second per card and ~100K packets/second per core.
- Gilboa
--- Gilboa Davara <[email protected]> wrote:
>
> On Mon, 2007-12-03 at 07:12 +0200, Avi Kivity wrote:
> > Andi Kleen wrote:
> > > Avi Kivity <[email protected]> writes:
> > >
> > >> [I really doubt there are that many of these; syscall
> > >> entry/dispatch/exit, interrupt dispatch, context switch, what else?]
> > >>
> > >
> > > Networking, block IO, page fault, ... But only the fast paths in these
> > > cases. A lot of the kernel is slow path code and could probably
> > > be written even in an interpreted language without much trouble.
> > >
> > >
> >
> > Even these (with the exception of the page fault path) are hardly "we
> > care about a single instruction" material suggested above. Even with a
> > million packets per second per core (does such a setup actually exist?)
> > You have a few thousand cycles per packet. For block you'd need around
> > 5,000 disks per core to reach such rate
>
> Intel's newest dual 10GbE NIC can easily (?) throw ~14M packets per
> second. (theoretical peak at 1514bytes/frame)
> Granted, installing such a device on a single CPU/single core machine is
> absurd - but even on an 8 core machine (2 x Xeon 53xx/54xx / AMD
> Barcelona) it can still generate ~1M packets/s per core.
>
> Now assuming you're doing low-level (passive) filtering of some sort
> (frame/packet routing, traffic interception and/or packet analysis)
> using hardware assistance (TSO, complete TCP offloading, etc) is off the
> table and each and every cycle within netif_receive_skb (and friends)
> -counts-.
>
> I don't suggest that the kernel should be (re)designed for such (niche)
> applications but on other hand, if it works...
I was involved in a 10GBe project like you're describing not too
long ago. Only the driver, and only a tight, lean, special purpose
driver at that, was able to deal with line rate volumes. This was
in a real appliance, where faster CPUs were not an option. In fact,
not hardware changes were possible due to the issues with squeezing
in the 10GBe nics. This project would have been impossible without
the speed and deterministic behavior of th ekernel C environment.
Casey Schaufler
[email protected]
On Sat, Dec 01, 2007 at 09:59:31PM +0200, Avi Kivity wrote:
> C also requires a (very minimal) runtime. And I don't see how having a
> runtime disqualifies a language from being usable in a kernel; the
> runtime is just one more library, either supplied by the compiler or by
> the kernel.
Well the majority of C syntax requires no runtime library. There are
some system call like things that you often want that need a library
(like malloc and such), but those aren't really part of C itself. Of
course without malloc and printf and file i/o calls the program would
probably be a bit boring. I have written some small C programs without
a runtime, where the few things I needed where implemented in assembly
and poked the hardware directly and called from the C program.
> Object orientation in C leaves much to be desired; see the huge number
> of void pointers and container_of()s in the kernel.
As a programming language, C leaves much to be desired.
--
Len Sorensen
On Mon, Dec 03, 2007 at 01:46:45PM +0200, Avi Kivity wrote:
> Andi Kleen wrote:
> >>Even these (with the exception of the page fault path) are hardly "we
> >>care about a single instruction" material suggested above. Even with a
> >>
> >
> >With 10Gbit/s ethernet working you start to care about every cycle.
> >
>
> If you have 10M packets/sec no amount of cycle-saving will help you.
> You need high level optimizations like TSO. I'm not saying we should
> sacrifice cycles like there's no tomorrow, but the big wins are elsewhere.
Huh? At 4 GHz, you have 400 cycles to process each packet. If you need to
route those packets, those cycles may just be what you need to lookup a
forwarding table and perform a few MMIO on an accelerated chip which will
take care of the transfer. But you need those cycles. If you start to waste
them 30 by 30, the performance can drop by a critical factor.
> >Similar with highend routing or in some latency sensitive network
> >applications (e.g. in HPC).
>
> True. And here, the hardware can cut hundreds of cycles by avoiding the
> kernel completely for the fast path.
>
> >Another simple noticeable case is Unix
> >sockets and your X server communication.
>
> Your reflexes are *much* better than mine if you can measure half a
> nanosecond on X.
It just depends how many times a second it happens. For instance, consider
this trivial loop (fct is a two-function array which just return 1 or 2) :
i = 0;
for (j = 0; j < (1 << 28); j++) {
k = (j >> 8) & 1;
i += fct[k]();
}
It takes 1.6 seconds to execute on my athlon-xp 1.5 GHz. If, instead of
changing the function once every 256 calls, you change it to every call :
i = 0;
for (j = 0; j < (1 << 28); j++) {
k = (j >> 0) & 1;
i += fct[k]();
}
Then it only takes 4.3 seconds, which is about 3 times slower. The number
of calls per function remains the same (128M calls each), it's just the
branch prediction which is wrong every time. The very few nanoseconds added
at each call are enough to slow down a program from 1.6 to 4.3 seconds while
it executes the exact same code (it may even save one shift). If you have
such stupid code, say, to compute the color or alpha of each pixel in an
image, you will certainly notice the difference.
And such poorly efficient code may happen very often when you blindly rely
on function pointers instead of explicit calls.
> Here, it's scheduling that matters, avoiding large transfers, and
> avoiding ping-pongs, not some cycles on the unix domain socket. You
> already paid 150 cycles or so by issuing the syscall and thousands for
> copying the data, 50 more won't be noticeable except in nanobenchmarks.
You are forgetting something very important : once you start stacking
functions to perform the dirty work for you, you end up with so much
abstraction that even new stupid code cannot be written at all without
relying on them, and it's where the problem takes its roots, because
when you need to write a fast function and you notice that you cannot
touch a variable without passing through a slow pinhole, your fast
function will remain slow whatever you do, and the worst of all is that
you will think that it is normally fast and that it cannot be written
faster.
> >And there are some special cases where block IO is also pretty critical.
> >A popular one is TPC-* benchmarking, but there are also others and it
> >looks likely in the future that this will become more critical
> >as block devices become faster (e.g. highend SSDs)
> >
>
> And again the key is batching, improving cpu affinity, and caching, not
> looking for a faster instruction sequence.
Every cycle burned is definitely lost. The time cannot go backwards. So
for each cycle that you lose to laziness, you have to become more and more
clever to find out how to write an alternative. Lazy people simply put
caches everywhere and after that they find normal that "hello world" requires
2 Gigs of RAM to be displayed. The only true solution is to create better
algorithms, but you will find even less people capable of creating efficient
algorithms than you will find capable of coding correctly.
> >For example there are some CPUs who are relatively slow at indirect
> >function calls and there are actually cases where this can be measured.
>
> That is true. But any self-respecting systems language will let you
> choose between direct and indirect calls.
>
> If adding an indirect call allows you to avoid even 1% of I/O, you save
> much more than you lose, so again the high level optimizations win.
It depends which type of I/O. If the I/O is non-blocking, you end up doing
something else instead of actively burning cycles.
> Nanooptimizations are fun (I do them myself, I admit) but that's not
> where performance as measured by the end user lies.
I do not agree. It's not uncommon to find 2- or 3-fold performance factors
between equivalent components when one is carefully optimized and the other
one is not. Granted it takes an awful lot of time doing all those nano-opts
at the beginning, but the more you learn about how the hardware reacts to
your code, the more efficiently you write future code, with the fewest bloat.
End users notice bloat a lot (especially when CPU and RAM are excessively
wasted).
Best regards,
Willy
On Mon, 3 Dec 2007 22:13:53 +0100, Willy Tarreau <[email protected]> wrote:
...
>
> It just depends how many times a second it happens. For instance, consider
> this trivial loop (fct is a two-function array which just return 1 or 2) :
>
> i = 0;
> for (j = 0; j < (1 << 28); j++) {
> k = (j >> 8) & 1;
> i += fct[k]();
> }
>
> It takes 1.6 seconds to execute on my athlon-xp 1.5 GHz. If, instead of
> changing the function once every 256 calls, you change it to every call :
>
> i = 0;
> for (j = 0; j < (1 << 28); j++) {
> k = (j >> 0) & 1;
> i += fct[k]();
> }
>
> Then it only takes 4.3 seconds, which is about 3 times slower. The number
> of calls per function remains the same (128M calls each), it's just the
> branch prediction which is wrong every time. The very few nanoseconds added
> at each call are enough to slow down a program from 1.6 to 4.3 seconds while
> it executes the exact same code (it may even save one shift). If you have
> such stupid code, say, to compute the color or alpha of each pixel in an
> image, you will certainly notice the difference.
>
> And such poorly efficient code may happen very often when you blindly rely
> on function pointers instead of explicit calls.
>
...
>
> You are forgetting something very important : once you start stacking
> functions to perform the dirty work for you, you end up with so much
> abstraction that even new stupid code cannot be written at all without
> relying on them, and it's where the problem takes its roots, because
> when you need to write a fast function and you notice that you cannot
> touch a variable without passing through a slow pinhole, your fast
> function will remain slow whatever you do, and the worst of all is that
> you will think that it is normally fast and that it cannot be written
> faster.
>
But don't forget that OOP is just another way to organize your code,
and let the language/compiler do some things you shouldn't de doing,
like fill an vtable pointer, that are error prone.
And of course everything depends on what language you choose and how
you use it.
You could write an equally effcient kernel in languages like C++,
using C++ abstractions as a high level organization, where
the fast paths could be coded the right way; we are not talking about
C# or Java, where even a sum is a call to an overloaded method.
Its the difference between doing school-book push and pops to lists,
and suddenly inventing the splice operator...
--
J.A. Magallon <jamagallon()ono!com> \ Software is like sex:
\ It's better when it's free
Mandriva Linux release 2008.1 (Cooker) for i586
Linux 2.6.23-jam03 (gcc 4.2.2 (4.2.2-1mdv2008.1)) SMP Sat Nov
> You could write an equally effcient kernel in languages like C++,
> using C++ abstractions as a high level organization, where
It's very very hard to generate good C code because of the numerous ways
objects get temporarily created, and the week aliasing rules (as with C).
There are reasons that Fortran lives on (and no I'm not suggesting one
should rewrite the kernel in Fortran ;)) and the fact its not really got
pointer aliasing or "address of" operators and all the resulting
optimsation problems is one of the big ones.
Alan
On Mon, Dec 03, 2007 at 02:35:31PM +0200, Gilboa Davara wrote:
> Intel's newest dual 10GbE NIC can easily (?) throw ~14M packets per
> second. (theoretical peak at 1514bytes/frame)
> Granted, installing such a device on a single CPU/single core machine is
> absurd - but even on an 8 core machine (2 x Xeon 53xx/54xx / AMD
> Barcelona) it can still generate ~1M packets/s per core.
10GbE can't do 14M packets per second if the packets are 1514 bytes. At
10M packets per second you have less than 1000 bits per packet, which is
far from 1514bytes.
10Gbps gives you at most 1.25GBps, which at 1514 bytes per packet works
out to 825627 packets per second. You could reach ~14M packets per
second with only the smallest packet size, which is rather unusual for
high throughput traffic, since you waste almost all the bytes on
overhead in that case. But you do want to be able to handle at least a
million or two packets per second to do 10GbE.
--
Len Sorensen
On Sat, Dec 01, 2007 at 12:19:50AM +0100, J.A. Magall??n wrote:
> I think BeOS was C++ and OSX is C+ObjectiveC (and runs on an iPhone).
> Original MacOS (fron 6 to 9) was Pascal (and a mac SE was very near
> to embedded hardware :) ).
>
> I do not advocate to rewrite Linux in C++, but don't say a kernel written
> in C++ can not be efficient.
Well I am pretty sure the micro kernel of OS X is in C, and certainly
the BSD layer is as well. So the only ObjC part would be the nextstep
framework and other parts of the Mac GUI and other Mac APIs they
provide, which all at some point probably end up calling down into the C
stuff below.
> C++ (and for what I read on other answer, nor ObjectiveC) has no garbage
> collection. It does not anything you did not it to do. It just allows
> you to change this
>
> struct buffer *x;
> x = kmalloc(...)
> x->sz = 128
> x->buff = kmalloc(...)
> ...
> kfree(x->buff)
> kfree(x)
>
> to
> struct buffer *x;
> x = new buffer(128); (that does itself allocates x->buff,
> because _you_ programmed it,
> so you poor programmer don't forget)
> ...
> delete x; (that also was programmed to deallocate
> x->buff itself, sou you have one less
> memory leak to worry about)
But kmalloc is implemented by the kernel. Who implements 'new'?
--
Len Sorensen
Willy Tarreau wrote:
>>>>
>>>>
>>> With 10Gbit/s ethernet working you start to care about every cycle.
>>>
>>>
>> If you have 10M packets/sec no amount of cycle-saving will help you.
>> You need high level optimizations like TSO. I'm not saying we should
>> sacrifice cycles like there's no tomorrow, but the big wins are elsewhere.
>>
>
> Huh? At 4 GHz, you have 400 cycles to process each packet. If you need to
> route those packets, those cycles may just be what you need to lookup a
> forwarding table and perform a few MMIO on an accelerated chip which will
> take care of the transfer. But you need those cycles. If you start to waste
> them 30 by 30, the performance can drop by a critical factor.
>
>
I really doubt Linux spends 400 cycles routing a packet. Look what an
skbuff looks like.
A flood ping to localhost on a 2GHz system takes 8 microseconds, that's
16,000 cycles. Sure it involves userspace, but you're about two orders
of magnitude off. And the localhost interface is nicely cached in L1
without mmio at all, unlike real devices.
>>> Another simple noticeable case is Unix
>>> sockets and your X server communication.
>>>
>> Your reflexes are *much* better than mine if you can measure half a
>> nanosecond on X.
>>
>
> It just depends how many times a second it happens. For instance, consider
> this trivial loop (fct is a two-function array which just return 1 or 2) :
>
> i = 0;
> for (j = 0; j < (1 << 28); j++) {
> k = (j >> 8) & 1;
> i += fct[k]();
> }
>
> It takes 1.6 seconds to execute on my athlon-xp 1.5 GHz. If, instead of
> changing the function once every 256 calls, you change it to every call :
>
> i = 0;
> for (j = 0; j < (1 << 28); j++) {
> k = (j >> 0) & 1;
> i += fct[k]();
> }
>
> Then it only takes 4.3 seconds, which is about 3 times slower. The number
> of calls per function remains the same (128M calls each), it's just the
> branch prediction which is wrong every time. The very few nanoseconds added
> at each call are enough to slow down a program from 1.6 to 4.3 seconds while
> it executes the exact same code (it may even save one shift). If you have
> such stupid code, say, to compute the color or alpha of each pixel in an
> image, you will certainly notice the difference.
>
>
This happens very often in HPC, and when it does, it is often worthwhile
to invest in manual optimizations or even assembly coding.
Unfortunately it is very rare in the kernel (memcmp, raid xor, what
else?). Loops with high iteration counts are very rare, so any
attention you give to the loop body is not amortized over a large number
of executions.
> And such poorly efficient code may happen very often when you blindly rely
> on function pointers instead of explicit calls.
>
Using an indirect call where a direct call is sufficient will also
reduce the compiler's optimization opportunities. However, I don't see
anyone recommending it in the context of systems programming.
It is not true that the number of indirect calls necessarily increases
if you use a language other than C.
(Actually, with templates you can reduce the number of indirect calls)
>> Here, it's scheduling that matters, avoiding large transfers, and
>> avoiding ping-pongs, not some cycles on the unix domain socket. You
>> already paid 150 cycles or so by issuing the syscall and thousands for
>> copying the data, 50 more won't be noticeable except in nanobenchmarks.
>>
>
> You are forgetting something very important : once you start stacking
> functions to perform the dirty work for you, you end up with so much
> abstraction that even new stupid code cannot be written at all without
> relying on them, and it's where the problem takes its roots, because
> when you need to write a fast function and you notice that you cannot
> touch a variable without passing through a slow pinhole, your fast
> function will remain slow whatever you do, and the worst of all is that
> you will think that it is normally fast and that it cannot be written
> faster.
>
>
I don't understand. Can you give an example?
There are two cases where abstraction hurts performance: the first is
where the mechanisms used to achieve the abstraction (functions instead
of direct access to variables, function pointers instead of duplicating
the caller) introduce performance overhead. I don't think C has any
advantage here -- actually a disadvantage as it lacks templates and is
forced to use function pointers for nontrivial cases. Usually the
abstraction penalty is nil with modern compilers.
The second case is where too much abstraction clouds the programmer's
mind. But this is independent of the programming language.
>>> And there are some special cases where block IO is also pretty critical.
>>> A popular one is TPC-* benchmarking, but there are also others and it
>>> looks likely in the future that this will become more critical
>>> as block devices become faster (e.g. highend SSDs)
>>>
>>>
>> And again the key is batching, improving cpu affinity, and caching, not
>> looking for a faster instruction sequence.
>>
>
> Every cycle burned is definitely lost. The time cannot go backwards. So
> for each cycle that you lose to laziness, you have to become more and more
> clever to find out how to write an alternative. Lazy people simply put
> caches everywhere and after that they find normal that "hello world" requires
> 2 Gigs of RAM to be displayed.
A 100 byte program will print "hello world" on a UART and stop. A
modern program will load a vector description of a font, scale it to the
desired size, render it using anti aliasing and sub-pixel positioning,
lay it out according to the language rules of whereever you live, and
place it on a multi-megabyte frame buffer. Yes it needs hundreds of
megabytes and lots of nasty algorithms to do that.
> The only true solution is to create better
> algorithms, but you will find even less people capable of creating efficient
> algorithms than you will find capable of coding correctly.
>
>
That is true, that is why we see a lot more microoptimizations than
algorithmic progress.
But if you want a fast streaming filesystem you choose XFS over ext3,
even though the latter is much smaller and easier to optimize. If you
write a network server you choose epoll() instead of trying to optimize
select() somehow. True algorithmic improvements are rare but they are
the ones that are actually measurable.
>>> For example there are some CPUs who are relatively slow at indirect
>>> function calls and there are actually cases where this can be measured.
>>>
>> That is true. But any self-respecting systems language will let you
>> choose between direct and indirect calls.
>>
>> If adding an indirect call allows you to avoid even 1% of I/O, you save
>> much more than you lose, so again the high level optimizations win.
>>
>
> It depends which type of I/O. If the I/O is non-blocking, you end up doing
> something else instead of actively burning cycles.
>
>
Unless you are I/O bound, which is usually the case when you have 2GHz
cpus driving 200Hz disks.
>> Nanooptimizations are fun (I do them myself, I admit) but that's not
>> where performance as measured by the end user lies.
>>
>
> I do not agree. It's not uncommon to find 2- or 3-fold performance factors
> between equivalent components when one is carefully optimized and the other
> one is not. Granted it takes an awful lot of time doing all those nano-opts
> at the beginning, but the more you learn about how the hardware reacts to
> your code, the more efficiently you write future code, with the fewest bloat.
> End users notice bloat a lot (especially when CPU and RAM are excessively
> wasted).
>
Can you give an example of a 2- or 3- fold factor on an end-user
workload achieved by microopts?
I agree about bloat.
Lennart Sorensen wrote:
> But kmalloc is implemented by the kernel. Who implements 'new'?
>
The kernel.
On Tue, 4 Dec 2007 12:54:13 -0500, [email protected] (Lennart Sorensen) wrote:
> On Sat, Dec 01, 2007 at 12:19:50AM +0100, J.A. Magall??n wrote:
> > I think BeOS was C++ and OSX is C+ObjectiveC (and runs on an iPhone).
> > Original MacOS (fron 6 to 9) was Pascal (and a mac SE was very near
> > to embedded hardware :) ).
> >
> > I do not advocate to rewrite Linux in C++, but don't say a kernel written
> > in C++ can not be efficient.
>
> Well I am pretty sure the micro kernel of OS X is in C, and certainly
> the BSD layer is as well. So the only ObjC part would be the nextstep
> framework and other parts of the Mac GUI and other Mac APIs they
> provide, which all at some point probably end up calling down into the C
> stuff below.
>
Yup, thanks.
> > C++ (and for what I read on other answer, nor ObjectiveC) has no garbage
> > collection. It does not anything you did not it to do. It just allows
> > you to change this
> >
> > struct buffer *x;
> > x = kmalloc(...)
> > x->sz = 128
> > x->buff = kmalloc(...)
> > ...
> > kfree(x->buff)
> > kfree(x)
> >
> > to
> > struct buffer *x;
> > x = new buffer(128); (that does itself allocates x->buff,
> > because _you_ programmed it,
> > so you poor programmer don't forget)
> > ...
> > delete x; (that also was programmed to deallocate
> > x->buff itself, sou you have one less
> > memory leak to worry about)
>
> But kmalloc is implemented by the kernel. Who implements 'new'?
>
Help yourself... as kmalloc() is a replacement for userspace glibc's
malloc, you can write your replacements for functions/operators in
libstdc++ (operators are just cosmetic, as many other features in C++)
In fact, for someone who dared to write a kernel C++ framework, the
very first function he has to write could be something like:
void* operator new(size_t sz)
{
return kmalloc(sz,GPF_KERNEL);
}
And could write alternatives like
operator new(size_t sz,int flags) -> x = new(GPF_ATOMIC) X;
operator new(size_t sz,MemPool& pl) -> x = new(pool) X;
If you are curious, this page http://www.osdev.org/wiki/C_PlusPlus
has some clues about what should you implement to get rid of
libstdc++.
--
J.A. Magallon <jamagallon()ono!com> \ Software is like sex:
\ It's better when it's free
Mandriva Linux release 2008.1 (Cooker) for i586
Linux 2.6.23-jam03 (gcc 4.2.2 (4.2.2-1mdv2008.1)) SMP Sat Nov
On Mon, 3 Dec 2007 21:57:27 +0000, Alan Cox <[email protected]> wrote:
> > You could write an equally effcient kernel in languages like C++,
> > using C++ abstractions as a high level organization, where
>
> It's very very hard to generate good C code because of the numerous ways
> objects get temporarily created, and the week aliasing rules (as with C).
>
That is what I like of C++, with good placement of high level features
like const's and & (references) one can gain fine control over what
gets copied or not.
Try to write a Vector class that does ops with SSE without storing
temporals on the stack. Its a good example of how one can get low
level control, and gcc is pretty good simplifying things like u=v+2*w
and not putting anything on the stack, all in xmm registers.
The advantage is you onle has to be careful one time, when you write
the class.
> There are reasons that Fortran lives on (and no I'm not suggesting one
> should rewrite the kernel in Fortran ;)) and the fact its not really got
> pointer aliasing or "address of" operators and all the resulting
> optimsation problems is one of the big ones.
>
--
J.A. Magallon <jamagallon()ono!com> \ Software is like sex:
\ It's better when it's free
Mandriva Linux release 2008.1 (Cooker) for i586
Linux 2.6.23-jam03 (gcc 4.2.2 (4.2.2-1mdv2008.1)) SMP Sat Nov
El Tue, 4 Dec 2007 22:47:45 +0100, "J.A. Magall?n" <[email protected]> escribi?:
> That is what I like of C++, with good placement of high level features
> like const's and & (references) one can gain fine control over what
> gets copied or not.
But...if there's some way Linux can get "language improvements", is with
new C standards/gccextensions/etc. It'd be nice if people tried to add
(useful) C extensions to gcc, instead of proposing some random language :)
Hi Avi,
On Tue, Dec 04, 2007 at 11:07:05PM +0200, Avi Kivity wrote:
> Willy Tarreau wrote:
> >>>>
> >>>>
> >>>With 10Gbit/s ethernet working you start to care about every cycle.
> >>>
> >>>
> >>If you have 10M packets/sec no amount of cycle-saving will help you.
> >>You need high level optimizations like TSO. I'm not saying we should
> >>sacrifice cycles like there's no tomorrow, but the big wins are elsewhere.
> >>
> >
> >Huh? At 4 GHz, you have 400 cycles to process each packet. If you need to
> >route those packets, those cycles may just be what you need to lookup a
> >forwarding table and perform a few MMIO on an accelerated chip which will
> >take care of the transfer. But you need those cycles. If you start to waste
> >them 30 by 30, the performance can drop by a critical factor.
> >
> >
>
> I really doubt Linux spends 400 cycles routing a packet. Look what an
> skbuff looks like.
That's not what I wrote. I just wrote about doing forwarding table lookup
and MMIO so that dedicated hardware NICs can process the recv/send to the
correct ends. If you just need to scan a list of DMAed packets, look at
their destination IP address, lookup that IP in a table to find the output
NIC and destination MAC address, link them into an output list and waking
the output NIC up, there's nothing which requires more than 400 cycles
here. I never said that it was a requirement to pass through the existing
network stack.
> A flood ping to localhost on a 2GHz system takes 8 microseconds, that's
> 16,000 cycles. Sure it involves userspace, but you're about two orders
> of magnitude off.
I don't see where you see a userspace (or I don't understand your test).
On traffic generation I often do from user space, I can send 630 k raw
ethernet packets per second from userspace on a 1.8 GHz opteron and PCI-e
NICs. That's 2857 cycles per packet, including the (small amount of)
userspace work. That's quite cheap.
> And the localhost interface is nicely cached in L1 without mmio at all,
> unlike real devices.
(...)
> This happens very often in HPC, and when it does, it is often worthwhile
> to invest in manual optimizations or even assembly coding.
> Unfortunately it is very rare in the kernel (memcmp, raid xor, what
> else?). Loops with high iteration counts are very rare, so any
> attention you give to the loop body is not amortized over a large number
> of executions.
Well, in my example above, everythin in the path of the send() syscall down
to the bare metal NIC is under high pressure in a fast loop. 30 cycles
already represent 1% of the performance! In fact, to modulate speed, I
use a busy loop with a volatile int and small values.
> >And such poorly efficient code may happen very often when you blindly rely
> >on function pointers instead of explicit calls.
> >
>
> Using an indirect call where a direct call is sufficient will also
> reduce the compiler's optimization opportunities.
That's true.
> However, I don't see
> anyone recommending it in the context of systems programming.
>
> It is not true that the number of indirect calls necessarily increases
> if you use a language other than C.
>
> (Actually, with templates you can reduce the number of indirect calls)
>
> >>Here, it's scheduling that matters, avoiding large transfers, and
> >>avoiding ping-pongs, not some cycles on the unix domain socket. You
> >>already paid 150 cycles or so by issuing the syscall and thousands for
> >>copying the data, 50 more won't be noticeable except in nanobenchmarks.
> >>
> >
> >You are forgetting something very important : once you start stacking
> >functions to perform the dirty work for you, you end up with so much
> >abstraction that even new stupid code cannot be written at all without
> >relying on them, and it's where the problem takes its roots, because
> >when you need to write a fast function and you notice that you cannot
> >touch a variable without passing through a slow pinhole, your fast
> >function will remain slow whatever you do, and the worst of all is that
> >you will think that it is normally fast and that it cannot be written
> >faster.
> >
> >
>
> I don't understand. Can you give an example?
Yes, the most common examples found today involve applications reading
data from databases. For instance, let's say that one function in your
program must count the number of unique people with the name starting
with an "A". It is very common to see "low-level" primitives to abstract
the database for portability purposes. One of such primitives will
generally be consist in retrieving a list of people with their names,
age and sex in one well-formated 3-column array. Many lazy people will
not see any problem in calling this one from the function described
above. Basically, what they would do is :
count_people_with_name_starting_with_a()
-> array[name,age,sex] = get_list_of_people()
-> while read_one_people_entry() {
alloc(one_line_of_3_columns)
read then parse the 3 fields
format_them_appropriately
}
-> create a new array "name2" by duplicating the "name" column
-> name3 = sort_unique(name2)
-> name4 = name3.grep("^A")
-> return name4.count
Don't laugh, I've recently read such a horrible thing. It was done
that way just because it was easier. Without the abstraction layer,
the coder would have been forced to access the base anyway and would
have seen an added value into just counting from the inner while
loop, saving lots of copies, greps, sort, etc... :
count_people_with_name_starting_with_a() {
count = 0;
while read_one_people_entry() {
read the 3 fields into a statically-allocated buffer
if (name[0] == 'A') count++;
}
return count;
}
I'm not saying that the above was not possible, just that it's
1000% easier to do the former without even having to think that
the final code uses such horrible things. And yes, I can confirm
that when you see this, you want to shoot the author !
> There are two cases where abstraction hurts performance: the first is
> where the mechanisms used to achieve the abstraction (functions instead
> of direct access to variables, function pointers instead of duplicating
> the caller) introduce performance overhead. I don't think C has any
> advantage here -- actually a disadvantage as it lacks templates and is
> forced to use function pointers for nontrivial cases. Usually the
> abstraction penalty is nil with modern compilers.
>
> The second case is where too much abstraction clouds the programmer's
> mind. But this is independent of the programming language.
Agreed. But most often, the abstraction prevents the user from accessing
some information directly and that becomes nasty. I remember when I was
a teen, I wrote a program designed to inventory what you had in your PC,
and run a few performance tests. It ran in those semi-graphical DOS mode
where you use graphics characters to draw boxes. I initially wrote all
the windowing code myself and it ran perfectly. I once decided to rewrite
it using TurboVision, the windowing framework from Borland (it was written
in TurboPascal). I made intensive use of the equivalent of a putchar()
function to write text in a window. You cannot imagine my pain when I
ran it on my old 8088, it wrote at the speed of a 1200 bps terminal. I
then tried to find how to write faster, even by accessing the window
buffer directly. I couldn't. I had to reverse-engineer the internal
structures by debugging memory contents in order to find the pointers
to the window buffer to write to them directly. After this disastrous
experience with abstraction, I thought "never that crap again".
> >Every cycle burned is definitely lost. The time cannot go backwards. So
> >for each cycle that you lose to laziness, you have to become more and more
> >clever to find out how to write an alternative. Lazy people simply put
> >caches everywhere and after that they find normal that "hello world"
> >requires
> >2 Gigs of RAM to be displayed.
>
> A 100 byte program will print "hello world" on a UART and stop. A
> modern program will load a vector description of a font, scale it to the
> desired size, render it using anti aliasing and sub-pixel positioning,
> lay it out according to the language rules of whereever you live, and
> place it on a multi-megabyte frame buffer. Yes it needs hundreds of
> megabytes and lots of nasty algorithms to do that.
What I'm complaining about is that when you don't want those fancy things,
you still have them to justify the hundreds of megs. And even if you manage
to print to stdout, you still have a huge runtime just in case you'd like
to use the fancy features.
> >The only true solution is to create better
> >algorithms, but you will find even less people capable of creating
> >efficient
> >algorithms than you will find capable of coding correctly.
>
> That is true, that is why we see a lot more microoptimizations than
> algorithmic progress.
Also, algorithmic research is very little rewarding. You can work for
months or years thinking you found the nice algo for the job, then
finally discover a limitation you did not expect and throw that amount
of work to the bin in a few minutes.
> But if you want a fast streaming filesystem you choose XFS over ext3,
> even though the latter is much smaller and easier to optimize. If you
> write a network server you choose epoll() instead of trying to optimize
> select() somehow.
That's interesting that you cite epoll() vs select(). I measured the
break-even point around 1000 FDs. Below, select() is faster. Above,
epoll() is faster. On small number of entries (less than 100), a select
based proxy can be 20-30% faster than the same one running on epoll()
because select() while dumber is cheaper to set up.
> True algorithmic improvements are rare but they are the ones that are
> actually measurable.
I generally agree with this.
> >>>For example there are some CPUs who are relatively slow at indirect
> >>>function calls and there are actually cases where this can be measured.
> >>>
> >>That is true. But any self-respecting systems language will let you
> >>choose between direct and indirect calls.
> >>
> >>If adding an indirect call allows you to avoid even 1% of I/O, you save
> >>much more than you lose, so again the high level optimizations win.
> >>
> >
> >It depends which type of I/O. If the I/O is non-blocking, you end up doing
> >something else instead of actively burning cycles.
> >
> >
>
> Unless you are I/O bound, which is usually the case when you have 2GHz
> cpus driving 200Hz disks.
That's true when you seek a lot. When you manage to mostly perform sequential
reads (such as what you do when processing large files such as logs), you can
easily achieve 80 MB/s, which is 20000 pages/s, or 100 times faster.
> >>Nanooptimizations are fun (I do them myself, I admit) but that's not
> >>where performance as measured by the end user lies.
> >>
> >
> >I do not agree. It's not uncommon to find 2- or 3-fold performance factors
> >between equivalent components when one is carefully optimized and the other
> >one is not. Granted it takes an awful lot of time doing all those nano-opts
> >at the beginning, but the more you learn about how the hardware reacts to
> >your code, the more efficiently you write future code, with the fewest
> >bloat.
> >End users notice bloat a lot (especially when CPU and RAM are excessively
> >wasted).
> >
>
> Can you give an example of a 2- or 3- fold factor on an end-user
> workload achieved by microopts?
Oh there are many primitives which are generally optimized in assembly for
this reason. What randomly comes to my mind :
- graphics libraries. Saving 1 cycle per pixel in a rectangle drawing
primitive can have an important impact in animated graphics for
instance.
- video/audio and generally multimedia code. I remember a specially
written version of mpg123 about 10 years ago, which was optimized
for i486 and which was the only one able to run on a 486 without
skipping.
- crypto code. It's common to find CPU-specific DES or AES functions.
Take a look at John The Ripper. I don't know if it still exists,
but there was an Alpha-optimized DES function which was something
like 5 times faster than the generic C one. It changes a lot of
things when you have 1 day to check your users passwords.
I also wrote a netfilter log analyzer which parses 300000 lines per
second on my 1.7 GHz notebook. That's 5600 cycles to read a full
line, lookup the field names, extract the values, parse them (atoi,
aton) save them in a structure, apply a filter, insert the result
in a tree containing up to 12 millions of them, and dump a report
of the counts by any creteria. That saved me a lot of time working
on log analysis. But to achieve such a speed, I had to optimize at
every level, including rewriting a faster atoi() equivalent, a
faster aton() equivalent (with no multiplies), and playing with
likely/unlikely a lot. The code slowly improved from about 75k
lines/s to 300k lines/s with no algorithmic change. Just by the
way of careful code placement and ordering.
In fact, you could say that micro-optimizations are not important
if you are doing them in a crappy environment where the fast path
is already wasted by a big dirty function. But when you have the
ability to master all the environment, every single cycle counts
because there's almost no waste.
I find it essential not to be the first one bringing crap somewhere
and serving as an excuse for others not to care about their code.
If everyone cares, you can still produce very good software, and
that's what I care about.
Cheers,
Willy
On Tue, 2007-12-04 at 12:50 -0500, Lennart Sorensen wrote:
> On Mon, Dec 03, 2007 at 02:35:31PM +0200, Gilboa Davara wrote:
> > Intel's newest dual 10GbE NIC can easily (?) throw ~14M packets per
> > second. (theoretical peak at 1514bytes/frame)
> > Granted, installing such a device on a single CPU/single core machine is
> > absurd - but even on an 8 core machine (2 x Xeon 53xx/54xx / AMD
> > Barcelona) it can still generate ~1M packets/s per core.
>
> 10GbE can't do 14M packets per second if the packets are 1514 bytes. At
> 10M packets per second you have less than 1000 bits per packet, which is
> far from 1514bytes.
>
> 10Gbps gives you at most 1.25GBps, which at 1514 bytes per packet works
> out to 825627 packets per second. You could reach ~14M packets per
> second with only the smallest packet size, which is rather unusual for
> high throughput traffic, since you waste almost all the bytes on
> overhead in that case. But you do want to be able to handle at least a
> million or two packets per second to do 10GbE.
... I corrected my math in the second email. [1]
Never the less, a VOIP network (E.g. G729 and friends) can generate the
maximum number of frames allowed on 10GbE Ethernet which is, AFAIR just
below 15M -per- port. (~29M on a dual port card)
While I doubt that any non-NPU based NIC can handle such a load, on
mixed networks we're already seeing well-above 1M frames per port.
- Gilboa
[1] http://lkml.org/lkml/2007/12/3/69
Diego Calleja wrote:
> El Tue, 4 Dec 2007 22:47:45 +0100, "J.A. Magall?n" <[email protected]> escribi?:
>
>> That is what I like of C++, with good placement of high level features
>> like const's and & (references) one can gain fine control over what
>> gets copied or not.
>
> But...if there's some way Linux can get "language improvements", is with
> new C standards/gccextensions/etc. It'd be nice if people tried to add
> (useful) C extensions to gcc, instead of proposing some random language :)
But nobody know such extensions.
I think that the core kernel will remain in C, because
there are no problems and no improvement possible
(with other language)
But the drivers side has more problems. There is a lot
of copy-paste, quality is often not high, not all developers
know well linux kernel, and not well maintained with new
or better internal API. So if we found a good template
or a good language to help *some* drivers without
causing a lot of problem to the rest of community, it would
be nice.
I don't think that we have written in stone that kernel
drivers should be written only in C, but actually there is
no good alternative.
But I think it is a huge task to find a language, a
prototype of API and convert some testing drivers.
And there is no guarantee of good result.
ciao
cate
Willy Tarreau wrote:
> Hi Avi,
>
> On Tue, Dec 04, 2007 at 11:07:05PM +0200, Avi Kivity wrote:
>
>> Willy Tarreau wrote:
>>
>>>>>>
>>>>>>
>>>>>>
>>>>> With 10Gbit/s ethernet working you start to care about every cycle.
>>>>>
>>>>>
>>>>>
>>>> If you have 10M packets/sec no amount of cycle-saving will help you.
>>>> You need high level optimizations like TSO. I'm not saying we should
>>>> sacrifice cycles like there's no tomorrow, but the big wins are elsewhere.
>>>>
>>>>
>>> Huh? At 4 GHz, you have 400 cycles to process each packet. If you need to
>>> route those packets, those cycles may just be what you need to lookup a
>>> forwarding table and perform a few MMIO on an accelerated chip which will
>>> take care of the transfer. But you need those cycles. If you start to waste
>>> them 30 by 30, the performance can drop by a critical factor.
>>>
>>>
>>>
>> I really doubt Linux spends 400 cycles routing a packet. Look what an
>> skbuff looks like.
>>
>
> That's not what I wrote. I just wrote about doing forwarding table lookup
> and MMIO so that dedicated hardware NICs can process the recv/send to the
> correct ends. If you just need to scan a list of DMAed packets, look at
> their destination IP address, lookup that IP in a table to find the output
> NIC and destination MAC address, link them into an output list and waking
> the output NIC up, there's nothing which requires more than 400 cycles
> here. I never said that it was a requirement to pass through the existing
> network stack.
>
If you're writing a single-purpose program then there is justification
to micro-optimize it to the death. Write it in VHDL, even. But that
description doesn't fit the kernel.
>
>> A flood ping to localhost on a 2GHz system takes 8 microseconds, that's
>> 16,000 cycles. Sure it involves userspace, but you're about two orders
>> of magnitude off.
>>
>
> I don't see where you see a userspace (or I don't understand your test).
>
ping -f -q localhost; the ping client is in userspace.
> On traffic generation I often do from user space, I can send 630 k raw
> ethernet packets per second from userspace on a 1.8 GHz opteron and PCI-e
> NICs. That's 2857 cycles per packet, including the (small amount of)
> userspace work. That's quite cheap.
>
>
Yes, it is.
>> This happens very often in HPC, and when it does, it is often worthwhile
>> to invest in manual optimizations or even assembly coding.
>> Unfortunately it is very rare in the kernel (memcmp, raid xor, what
>> else?). Loops with high iteration counts are very rare, so any
>> attention you give to the loop body is not amortized over a large number
>> of executions.
>>
>
> Well, in my example above, everythin in the path of the send() syscall down
> to the bare metal NIC is under high pressure in a fast loop. 30 cycles
> already represent 1% of the performance! In fact, to modulate speed, I
> use a busy loop with a volatile int and small values.
>
>
Having an interface to send multiple packets in one syscall would cut
way more than 30 cycles.
>>>>
>>>>
>>> You are forgetting something very important : once you start stacking
>>> functions to perform the dirty work for you, you end up with so much
>>> abstraction that even new stupid code cannot be written at all without
>>> relying on them, and it's where the problem takes its roots, because
>>> when you need to write a fast function and you notice that you cannot
>>> touch a variable without passing through a slow pinhole, your fast
>>> function will remain slow whatever you do, and the worst of all is that
>>> you will think that it is normally fast and that it cannot be written
>>> faster.
>>>
>>>
>>>
>> I don't understand. Can you give an example?
>>
>
> Yes, the most common examples found today involve applications reading
> data from databases. For instance, let's say that one function in your
> program must count the number of unique people with the name starting
> with an "A". It is very common to see "low-level" primitives to abstract
> the database for portability purposes. One of such primitives will
> generally be consist in retrieving a list of people with their names,
> age and sex in one well-formated 3-column array. Many lazy people will
> not see any problem in calling this one from the function described
> above. Basically, what they would do is :
>
> count_people_with_name_starting_with_a()
> -> array[name,age,sex] = get_list_of_people()
> -> while read_one_people_entry() {
> alloc(one_line_of_3_columns)
> read then parse the 3 fields
> format_them_appropriately
> }
> -> create a new array "name2" by duplicating the "name" column
> -> name3 = sort_unique(name2)
> -> name4 = name3.grep("^A")
> -> return name4.count
>
> Don't laugh, I've recently read such a horrible thing. It was done
> that way just because it was easier. Without the abstraction layer,
> the coder would have been forced to access the base anyway and would
> have seen an added value into just counting from the inner while
> loop, saving lots of copies, greps, sort, etc... :
>
> count_people_with_name_starting_with_a() {
> count = 0;
> while read_one_people_entry() {
> read the 3 fields into a statically-allocated buffer
> if (name[0] == 'A') count++;
> }
> return count;
> }
>
> I'm not saying that the above was not possible, just that it's
> 1000% easier to do the former without even having to think that
> the final code uses such horrible things.
Your optimized version is wrong. It counts duplicated names, while you
stated you needed unique names. Otherwise the sort_unique step is
completely redundant.
Databases are good examples of where the abstraction helps. If you had
hundreds of millions of records in your example, you'd connect to a
database, present it with an ASCII string describing what you want, upon
which it would parse it, compile it into an internal language against
the schema, optimize that and then execute it. Despite all that
abstraction it would win against your example because it would implement
the inner loop as
open index (by name)
seek to 'A'
while (current starts with 'A')
++count (taking care of the uniqueness requirement if
needed)
close index
Thus it would never see people who's name begins with 'W'. If the
database had a materialized view feature, and this particular query was
deemed important enough, it would optimize it to
open materialized view
read count
close materialized view
The database does all this while allowing concurrent reads and writes
and keeping your data in case someone trips on the power cord. You
can't do that without a zillion layers of abstraction.
>> There are two cases where abstraction hurts performance: the first is
>> where the mechanisms used to achieve the abstraction (functions instead
>> of direct access to variables, function pointers instead of duplicating
>> the caller) introduce performance overhead. I don't think C has any
>> advantage here -- actually a disadvantage as it lacks templates and is
>> forced to use function pointers for nontrivial cases. Usually the
>> abstraction penalty is nil with modern compilers.
>>
>> The second case is where too much abstraction clouds the programmer's
>> mind. But this is independent of the programming language.
>>
>
> Agreed. But most often, the abstraction prevents the user from accessing
> some information directly and that becomes nasty. I remember when I was
> a teen, I wrote a program designed to inventory what you had in your PC,
> and run a few performance tests. It ran in those semi-graphical DOS mode
> where you use graphics characters to draw boxes. I initially wrote all
> the windowing code myself and it ran perfectly. I once decided to rewrite
> it using TurboVision, the windowing framework from Borland (it was written
> in TurboPascal). I made intensive use of the equivalent of a putchar()
> function to write text in a window. You cannot imagine my pain when I
> ran it on my old 8088, it wrote at the speed of a 1200 bps terminal. I
> then tried to find how to write faster, even by accessing the window
> buffer directly. I couldn't. I had to reverse-engineer the internal
> structures by debugging memory contents in order to find the pointers
> to the window buffer to write to them directly. After this disastrous
> experience with abstraction, I thought "never that crap again".
>
>
If the abstraction if badly written, and further you cannot change it,
then of course it hurts. But if the abstraction is well written, or if
it can be fixed, then all is well. The problem here is not that
abstractions exist, but that you persist in using a broken API instead
of fixing it.
>>> Every cycle burned is definitely lost. The time cannot go backwards. So
>>> for each cycle that you lose to laziness, you have to become more and more
>>> clever to find out how to write an alternative. Lazy people simply put
>>> caches everywhere and after that they find normal that "hello world"
>>> requires
>>> 2 Gigs of RAM to be displayed.
>>>
>> A 100 byte program will print "hello world" on a UART and stop. A
>> modern program will load a vector description of a font, scale it to the
>> desired size, render it using anti aliasing and sub-pixel positioning,
>> lay it out according to the language rules of whereever you live, and
>> place it on a multi-megabyte frame buffer. Yes it needs hundreds of
>> megabytes and lots of nasty algorithms to do that.
>>
>
> What I'm complaining about is that when you don't want those fancy things,
> you still have them to justify the hundreds of megs. And even if you manage
> to print to stdout, you still have a huge runtime just in case you'd like
> to use the fancy features.
>
>
That's life. The fact is that users demand features, and programmers
cater to them. If you can find a way to provide all those features
without the bloat, more power to you. The abstractions here are not the
cause of the bloat, they are the tool used to provide the features while
keeping a reasonable level of maintainability and reliability.
>>> The only true solution is to create better
>>> algorithms, but you will find even less people capable of creating
>>> efficient
>>> algorithms than you will find capable of coding correctly.
>>>
>> That is true, that is why we see a lot more microoptimizations than
>> algorithmic progress.
>>
>
> Also, algorithmic research is very little rewarding. You can work for
> months or years thinking you found the nice algo for the job, then
> finally discover a limitation you did not expect and throw that amount
> of work to the bin in a few minutes.
>
You don't need to prove that P == NP to improve things. Most
improvements are in adding new APIs and data structures to keep the
inner loops working on more data. And of course scalability work to
keep data local to a processing core.
It won't win you a Nobel prize, but you'll be able to measure a few
percent improvement on a real-life workload instead of 10 cycles on a
microbenchmark.
>
>> But if you want a fast streaming filesystem you choose XFS over ext3,
>> even though the latter is much smaller and easier to optimize. If you
>> write a network server you choose epoll() instead of trying to optimize
>> select() somehow.
>>
>
> That's interesting that you cite epoll() vs select(). I measured the
> break-even point around 1000 FDs. Below, select() is faster. Above,
> epoll() is faster. On small number of entries (less than 100), a select
> based proxy can be 20-30% faster than the same one running on epoll()
> because select() while dumber is cheaper to set up.
>
[IIRC epoll() setup is done outside the loop, just once]
The small proxy probably doesn't have a performance problem, while 10K
connection servers do.
>> Unless you are I/O bound, which is usually the case when you have 2GHz
>> cpus driving 200Hz disks.
>>
>
> That's true when you seek a lot. When you manage to mostly perform sequential
> reads (such as what you do when processing large files such as logs), you can
> easily achieve 80 MB/s, which is 20000 pages/s, or 100 times faster.
>
>
Right, and this was achieved by having very good batching in the bio layer.
>> Can you give an example of a 2- or 3- fold factor on an end-user
>> workload achieved by microopts?
>>
>
> Oh there are many primitives which are generally optimized in assembly for
> this reason. What randomly comes to my mind :
> - graphics libraries. Saving 1 cycle per pixel in a rectangle drawing
> primitive can have an important impact in animated graphics for
> instance.
>
> - video/audio and generally multimedia code. I remember a specially
> written version of mpg123 about 10 years ago, which was optimized
> for i486 and which was the only one able to run on a 486 without
> skipping.
>
> - crypto code. It's common to find CPU-specific DES or AES functions.
> Take a look at John The Ripper. I don't know if it still exists,
> but there was an Alpha-optimized DES function which was something
> like 5 times faster than the generic C one. It changes a lot of
> things when you have 1 day to check your users passwords.
>
These are indeed cases where the inner loop is executed millions times
per second. Of course it is perfectly reasonable to assembly code these.
I'm talking about regular C code. Most C code is decision taking and
pointer chasing, which is why traditional microopts don't help much.
> I also wrote a netfilter log analyzer which parses 300000 lines per
> second on my 1.7 GHz notebook. That's 5600 cycles to read a full
> line, lookup the field names, extract the values, parse them (atoi,
> aton) save them in a structure, apply a filter, insert the result
> in a tree containing up to 12 millions of them, and dump a report
> of the counts by any creteria. That saved me a lot of time working
> on log analysis. But to achieve such a speed, I had to optimize at
> every level, including rewriting a faster atoi() equivalent, a
> faster aton() equivalent (with no multiplies), and playing with
> likely/unlikely a lot. The code slowly improved from about 75k
> lines/s to 300k lines/s with no algorithmic change. Just by the
> way of careful code placement and ordering.
>
Curious: wasn't the time dominated by the tree code? 12M nodes is 24
levels, and probably unpredictable to the processor unless the data is
very regular.
> In fact, you could say that micro-optimizations are not important
> if you are doing them in a crappy environment where the fast path
> is already wasted by a big dirty function. But when you have the
> ability to master all the environment, every single cycle counts
> because there's almost no waste.
>
>
That only works if the environment is very small. A large scale project
needs abstractions, otherwise you spend all your time re-learning all
the details.
> I find it essential not to be the first one bringing crap somewhere
> and serving as an excuse for others not to care about their code.
> If everyone cares, you can still produce very good software, and
> that's what I care about.
We just disagree about the methods.
--
error compiling committee.c: too many arguments to function
Ben Crowhurst wrote:
> Loïc Grenié wrote:
>> 2007/11/29, Ben Crowhurst <[email protected]>:
>>
>>> Has Objective-C ever been considered for kernel development?
>>>
>>> regards,
>>> BPC
>>>
>>
I have tried it in a toy kernel. Oskit style. The code reuse is very
high specially with string ops and driver interfaces. Its also very easy
to do unit testing with. My main problem was the quality of the compiler
optimization. Its just not good enough. I think if the compiler can do
the right kind of optimizations correctly then a low overhead OO
language like objective-c can be used in a kernel.
On the other hand its the automated testing part that really matters for
me. Imagine adding features to linux week after week without ever
getting a serious panic or two. And then getting a big performance boost
whenever the compiler does more and more optimizations correctly.
>> No, it has not. Any language that looks remotely like an OO language
>> has not ever been considered for (Linux) kernel development and for
>> most, if not all, other operating systems kernels.
>>
>> Various problems occur in an object oriented language. One of them
>> is garbage collection: it provokes asynchronous delays and, during
>> an interrupt or a system call for a real time task, the kernel cannot
>> wait.
> Objective C 1.0 does not force nor have garbage collection.
>
True.
>> Another is memory overhead: all the magic that OO languages
>> provide take space in memory and Linux kernel is used in embedded
>> systems with very tight memory requirements.
>>
> But are embedded systems not rapidly moving on. Turning to stare at
> the ADSL X6 modem with MB's of ram.
Its all about optimizations.
--
Democracy is about two wolves and a sheep deciding what to eat for dinner.