2001-11-06 07:19:07

by H. Peter Anvin

[permalink] [raw]
Subject: Using %cr2 to reference "current"

2.4.13-ac8 uses %cr2 rather than (%esp & 0xfffe0000) to get "current".
I've been trying to figure out the point of this... writing a control
register is microcode on all the x86 implementations I know (and you
have to re-set it after every pagefault), and reading one probably is
one on most (not Transmeta, but...)

On the other hand, %esp is a GPR and available to the core directly,
and so are usually plain immediates.

Is using %cr2 really faster than the old implementation, or is there
another reason? It seems that the alignment constraints on the stack
still remains, since the %esp solution still remains in places...

It might also be worth considering a segment-register based
implementation instead. The reason we're not using %fs and %gs in the
kernel anymore is because of the setup slowness, but perhaps using
them (use %fs since it's much more likely to be NULL and thus faster
to restore) would be faster than using %cr2?

-hpa
--
<[email protected]> at work, <[email protected]> in private!
"Unix gives you enough rope to shoot yourself in the foot."
http://www.zytor.com/~hpa/puzzle.txt <[email protected]>


2001-11-06 08:01:46

by Robert Love

[permalink] [raw]
Subject: Re: Using %cr2 to reference "current"

On Tue, 2001-11-06 at 02:18, H. Peter Anvin wrote:
> 2.4.13-ac8 uses %cr2 rather than (%esp & 0xfffe0000) to get "current".
> I've been trying to figure out the point of this... <snip>

I too am confused. More so, the difference between hard_get_current and
get_current is confusing. I further question things because I suspect
there is a problem: hard_get_current is commented as "for within NMI,
do_page_fault, cpu_init" but all these functions call other functions
that may very well use get_current. How is this going to work?

Further, the preemptible kernel patch oopses with this patch (IOW, don't
use 2.4.13-ac8 + preempt-kernel, unless you remove all these bits like I
did :>). I think it may be because of:

Manfred Spraul wrote:
> error_code:
> [...]
> - GET_CURRENT(%ebx)
> call *%edi
> addl $8,%esp
> + GET_CURRENT(%ebx)
> The pointer to current was loaded into %ebx before the call to the error
> handler, now that only happens after the call. As far as I can see the
> load before the call is not required.

this change but I am unsure. Would Manfred or someone knowledgeable in
this mind letting me pick their brain?

Robert Love

2001-11-06 10:48:49

by Alan

[permalink] [raw]
Subject: Re: Using %cr2 to reference "current"

> I too am confused. More so, the difference between hard_get_current and
> get_current is confusing. I further question things because I suspect

hard_get_current always works
get_current assumes %cr2 is loaded correctly

> do_page_fault, cpu_init" but all these functions call other functions
> that may very well use get_current. How is this going to work?

do_page_fault and cpu_init load %cr2

> Further, the preemptible kernel patch oopses with this patch (IOW, don't
> use 2.4.13-ac8 + preempt-kernel, unless you remove all these bits like I
> did :>). I think it may be because of:

You must ensure that you don't pre-empt until %cr2 is loaded. Obviously this
isnt a problem with the traditional low latency patch but if you pre-empty
very early in page fault handling then I suspect you might get the odd
suprise.

The reasoning behind all this is to fix the cache pessimal nature of the x86
stack layout - we had all task structs on the same cache colour and all
stacks aligned within pages (so every apache thread waiting at the same
point is on the same colour too and each wait queue entry on their stacks
is linked to entries all the same colour)

Alan

2001-11-06 10:51:40

by Alan

[permalink] [raw]
Subject: Re: Using %cr2 to reference "current"

> Is using %cr2 really faster than the old implementation, or is there
> another reason? It seems that the alignment constraints on the stack
> still remains, since the %esp solution still remains in places...

The stack is no longer aligned. We allocate two pages and disturb the stack
by upto 1.5K. We slab the task structs.

> It might also be worth considering a segment-register based
> implementation instead. The reason we're not using %fs and %gs in the
> kernel anymore is because of the setup slowness, but perhaps using
> them (use %fs since it's much more likely to be NULL and thus faster
> to restore) would be faster than using %cr2?

It may be. Likewise its not clear if %cr2 should hold current or a cpu ident
pointer (so you dont reload on switch of task). This needs more
benchmarking. Its in current -ac to verify the theory is correct not the
tuning.

2001-11-06 14:15:04

by Manfred Spraul

[permalink] [raw]
Subject: Re: Using %cr2 to reference "current"

Robert Love wrote:
>
> Further, the preemptible kernel patch oopses with this patch (IOW, don't
> use 2.4.13-ac8 + preempt-kernel, unless you remove all these bits like I
> did :>). I think it may be because of:
>

Could you send me an oops?
I assume that a
set_current(hard_get_current());
is missing somewhere.
The assumption is that get_current() is faster than hard_get_current(),
and that there are so many get_current() calls that the overhead for the
set_current() in __switch_to and do_page_fault is small.

> Manfred Spraul wrote:
> > error_code:
> > [...]
> > - GET_CURRENT(%ebx)
> > call *%edi
> > addl $8,%esp
> > + GET_CURRENT(%ebx)
> > The pointer to current was loaded into %ebx before the call to the error
> > handler, now that only happens after the call. As far as I can see the
> > load before the call is not required.
>
> this change but I am unsure. Would Manfred or someone knowledgeable in
> this mind letting me pick their brain?
>
I would be very surprised if that's a problem: the error handlers are C
functions, and they don't expect parameters in register %ebx.

--
Manfred

2001-11-06 17:06:11

by Linus Torvalds

[permalink] [raw]
Subject: Re: Using %cr2 to reference "current"

In article <[email protected]>,
H. Peter Anvin <[email protected]> wrote:
>
>Is using %cr2 really faster than the old implementation, or is there
>another reason? It seems that the alignment constraints on the stack
>still remains, since the %esp solution still remains in places...

I think the _real_ issue with that patch is that %cr2 is by no means
architecturally even guaranteed to work the way the patches want it to
work.

It's simply not a general-purpose register, and I don't see why it is
assumed to be (a) fast (b) stable and (c) writable.

I could well imagine a x86-compatible chip where %cr2 isn't even
writable. In fact, reading the intel documentation, I see _nowhere_ a
mention of %cr2 being writable at all - it all just says "contains the
fault address".

Similarly, there is _nothing_ that guarantees that the low bits of %cr2
are meaningful, writable, or even implemented.

Which means that the whole approach is just depending on undocumented
implementation behaviour. That's asking for trouble.

Linus

2001-11-06 17:08:11

by Linus Torvalds

[permalink] [raw]
Subject: Re: Using %cr2 to reference "current"

In article <[email protected]>,
Alan Cox <[email protected]> wrote:
>
>It may be. Likewise its not clear if %cr2 should hold current or a cpu ident
>pointer (so you dont reload on switch of task). This needs more
>benchmarking. Its in current -ac to verify the theory is correct not the
>tuning.

We pretty much know the _theory_ is not correct, just by virtue of
depending on non-architected behaviour. The only thing -ac can do is
test whether it works in practice. Which is a totally different thing.

Especially on x86 chips.

Linus

2001-11-06 17:14:21

by Benjamin LaHaise

[permalink] [raw]
Subject: Re: Using %cr2 to reference "current"

On Tue, Nov 06, 2001 at 05:02:32PM +0000, Linus Torvalds wrote:
> Which means that the whole approach is just depending on undocumented
> implementation behaviour. That's asking for trouble.

NetWare uses it and has for a long time.

-ben

2001-11-06 17:39:24

by Alan

[permalink] [raw]
Subject: Re: Using %cr2 to reference "current"

> We pretty much know the _theory_ is not correct, just by virtue of
> depending on non-architected behaviour. The only thing -ac can do is
> test whether it works in practice. Which is a totally different thing.

Yep

> Especially on x86 chips.

Well so far I've found one laptop that eats %cr2 on APM calls, and we have
some mystery cases. Peter's suggestion of using %fs or %gs looks more
promising at the moment

2001-11-06 17:37:34

by Michael Barabanov

[permalink] [raw]
Subject: Re: Using %cr2 to reference "current"

Here's my version of hard cpu id (RTLinux version):

extern inline int rtl_getcpuid(void)
{
unsigned cpu;
__asm__ (
"str %%ax\n\t"
"shr $5, %%eax\n\t"
"sub $3, %%eax\n\t"
: "=a"(cpu));
return cpu;
}

No cr2 involved; extremely fast. This takes advantage of the fact that
TSS-CPU mapping is 1-1 in 2.4.

Michael.

Alan Cox ([email protected]) wrote:
> > I too am confused. More so, the difference between hard_get_current and
> > get_current is confusing. I further question things because I suspect
>
> hard_get_current always works
> get_current assumes %cr2 is loaded correctly
>
> > do_page_fault, cpu_init" but all these functions call other functions
> > that may very well use get_current. How is this going to work?
>
> do_page_fault and cpu_init load %cr2
>
> > Further, the preemptible kernel patch oopses with this patch (IOW, don't
> > use 2.4.13-ac8 + preempt-kernel, unless you remove all these bits like I
> > did :>). I think it may be because of:
>
> You must ensure that you don't pre-empt until %cr2 is loaded. Obviously this
> isnt a problem with the traditional low latency patch but if you pre-empty
> very early in page fault handling then I suspect you might get the odd
> suprise.
>
> The reasoning behind all this is to fix the cache pessimal nature of the x86
> stack layout - we had all task structs on the same cache colour and all
> stacks aligned within pages (so every apache thread waiting at the same
> point is on the same colour too and each wait queue entry on their stacks
> is linked to entries all the same colour)
>
> Alan

2001-11-06 17:53:14

by Linus Torvalds

[permalink] [raw]
Subject: Re: Using %cr2 to reference "current"


On Tue, 6 Nov 2001, Benjamin LaHaise wrote:
>
> On Tue, Nov 06, 2001 at 05:02:32PM +0000, Linus Torvalds wrote:
> > Which means that the whole approach is just depending on undocumented
> > implementation behaviour. That's asking for trouble.
>
> NetWare uses it and has for a long time.

Does anybody know if WNT uses it? Quite frankly, I don't see Intel
worrying over-much about NetWare compatibility. They've broken small OS's
before (ie older versions of SCO Xenix wouldn't boot on a Pentium MMU
because of some changes to error reporting, if I remember correctly).

That said, how expensive is loading %cr2 anyway? We can do all the same
tricks with a 16kB stack and just playing games with using the higher bits
as the "offset", ie things like

/* Return "current" in %eax, trash %edx */
do_get_current:
movl $0x0003c000,%eax // 4 bits at bit 14
movl $-16384,%edx // remove low 14 bits
andl $esp,%eax
andl $esp,%edx
shrl $7,%eax // color it by 128 bytes
addl %edx,%eax
ret

which is going to be ~5 cycles _without_ doing anything that is
undocumented (add a push/pop to not trash a register, that might be
worthwhile - it makes the function marginally slower but might make
callers happier).

Oh, and call using inline assembly, not a C call (so that gcc can take
advantage of better calling convention, and not think memory is trashed
etc). So

static inline struct task_struct *get_current(void)
{
struct task_struct *tsk;
asm("call do_get_current":"=a" (tsk)::"dx");
return tsk;
}

See? You don't have to play games with control registers.

(actually, entry.S seems to want the return value in %ebx, so change to
taste. Or you could have two different versions of the thing, or even
inline it for any place where that makes sense).

The above also allows you to keep fork with just one allocation, and makes
the stack larger (we steal 2kB for the coloring, but we'd use an order-2
allocation that at least SGI wants to do regardless).

The 2kB is, of course, tunable. The above is with a 128-byte cacheline and
16 colors - that may be overkill. 32-byte increents with 32 colors might
be more appropriate (I don't know what the effect of the P4 half-cacheline
thing is, I don't know if the CPU can have just a 64-byte block coherent,
or what.. But a 32-byte color is fine for _most_ CPU's).

The 32-byte by 32-color thing would just change the bitmasks to 0x0007c000
and the shift to 9 (bit 14+ shifted down to bit 5+).

Note that there are lots of advantages to using simple regular
instructions over using "special" instructions like "move from control
register". Historically, the special instructions tend to always become
slower, while the regular instructions become faster.

I would not be surprised if "mov %cr2,%reg" will break a netburst trace
cache entity, or even cause microcode to be executed. While I _guarantee_
that all future Intel CPU's will continue to be fast at mixtures of simple
arithmetic operations like "add" and "and".

(And I bet that the likelyhood of Intel speeding up shifts in the next P4
derivative is a _lot_ higher than Intel speeding up "mov %cr2,xx"..)

Linus

2001-11-06 18:02:46

by Linus Torvalds

[permalink] [raw]
Subject: Re: Using %cr2 to reference "current"


On Tue, 6 Nov 2001, Alan Cox wrote:
>
> > Especially on x86 chips.
>
> Well so far I've found one laptop that eats %cr2 on APM calls, and we have
> some mystery cases.

Well, APM is going away, and it should be easy enough to work around it
(and I don't _think_ you can reasonably do the same in ACPI or SMM: SMM
will save the whole CPU state and has to do that anyway, and ACPI doesn't
actually get to touch things like %cr2).

So I'd be more nervous about future CPU's just not having the register
writable (or having only parts of it, or..)

> Peter's suggestion of using %fs or %gs looks more
> promising at the moment

The problem with using a segment register is that then you have to
save/restore it over system calls - pretty much whether the call needs it
or not. Ie you can pretty much _guarantee_ that any system call will be
slowed down by something on the order of 10-15 cycles (on a good day, some
CPU's are slower at it). Same goes for task switch etc.

Which is why I'd much rather just color using the high bits of %esp, and
spend a few more cycles inside "get_current()". I can guarantee you that
it won't slow down paths that don't even need current at all (unlike the
segment register approach), and even the paths that _do_ need current will
only be ~5 cycles slower (plus possible the cache miss of doing the
function call, but the call-site itself will actually be slightly smaller
than the current in-lined 32-bit immediate and "andl").

Using high bits of %esp has zero impact on task-switch, and makes
"get_current" interrupt safe (ie switching tasks is totally atomic, as
it's the one single "movl ..,%esp" instruction that does the real switch
as far as the kernel is concerned).

It does require using an order-2 allocation, which the current VM will
allow anyway, but which is obviously nastier than an order-1.

Linus

2001-11-06 18:08:06

by Alan

[permalink] [raw]
Subject: Re: Using %cr2 to reference "current"

> "get_current" interrupt safe (ie switching tasks is totally atomic, as
> it's the one single "movl ..,%esp" instruction that does the real switch
> as far as the kernel is concerned).
>
> It does require using an order-2 allocation, which the current VM will
> allow anyway, but which is obviously nastier than an order-1.

I've seen boxes dead in the water from 8K NFS (ie 16K order-2 allocations),
let alone the huge memory hit. Michael's rtlinux approach looks even more
interesting and I may have to play with that (using the TSS to ident the
cpu)

Our memory bloat is already pretty gross in 2.4 without adding 16K task
stacks to the oversided struct page, bootmem and excess double linked lists.

I also need to try sticking a pointer to the task struct at the top of the
stack and loading that - since that should be a cache line that isnt being
shared around or swapped between processors

2001-11-06 18:13:16

by Alan

[permalink] [raw]
Subject: Re: Using %cr2 to reference "current"

> That said, how expensive is loading %cr2 anyway? We can do all the same
> tricks with a 16kB stack and just playing games with using the higher bits
> as the "offset", ie things like

So thats another 600K on my box vanished. I suspect the page faults will
outweigh it

> the stack larger (we steal 2kB for the coloring, but we'd use an order-2
> allocation that at least SGI wants to do regardless).

16K stack is serious "people who cant program" country.

> I would not be surprised if "mov %cr2,%reg" will break a netburst trace
> cache entity, or even cause microcode to be executed. While I _guarantee_
> that all future Intel CPU's will continue to be fast at mixtures of simple
> arithmetic operations like "add" and "and".

True enough, but then we can go to

andl %%esp, %0
movl (%%eax), %%eax

which doesnt really change the cost much, lets us colour the task structs
nicely, and lets us colour the stack somewhat by offseting esp from the base
- and all in standard instructions

Alan




2001-11-06 18:15:36

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: Using %cr2 to reference "current"



On Tue, 6 Nov 2001, Alan Cox wrote:

> > "get_current" interrupt safe (ie switching tasks is totally atomic, as
> > it's the one single "movl ..,%esp" instruction that does the real switch
> > as far as the kernel is concerned).
> >
> > It does require using an order-2 allocation, which the current VM will
> > allow anyway, but which is obviously nastier than an order-1.
>
> I've seen boxes dead in the water from 8K NFS (ie 16K order-2 allocations),
> let alone the huge memory hit. Michael's rtlinux approach looks even more
> interesting and I may have to play with that (using the TSS to ident the
> cpu)

Btw, I also want to see what intense "for-optimization" high-order
allocators are going to do to the current VM.

Think about the possible intensive pressure (and CPU wasted) caused by,
for example, SCSI code which _always_ tries to do 1-order allocations (or
bigger?) to allocate scatter/gather tables. We want those allocations to
fail to 0-order allocations instead looping madly inside the VM freeing
routines.


2001-11-06 18:18:06

by Linus Torvalds

[permalink] [raw]
Subject: Re: Using %cr2 to reference "current"


On Tue, 6 Nov 2001, Alan Cox wrote:
>
> Our memory bloat is already pretty gross in 2.4 without adding 16K task
> stacks to the oversided struct page, bootmem and excess double linked lists.

There are some people who think that the 5kB stack we have now is too
small ;(

> I also need to try sticking a pointer to the task struct at the top of the
> stack and loading that - since that should be a cache line that isnt being
> shared around or swapped between processors

That should work fairly well, and has the advantage that you can hide more
state there if you want (ie it allows us, on demand, to move hot state of
"struct task_struct" up there).

There is a subset of "struct task_struct" that is basically completely
local to the task, and could be advantageous to move around. Things like

- need_resched/sigpending/process attributes
- ptrace
- processor
- addr_limit

are all things that we don't actually _need_ to go all the way to the task
structure to fetch, and that we mostly need to modify anyway on task
switch (ie "need_resched" and "processor" both need to be written on
task-switch anyway, and are not touched by anything other CPU)

So it would basically be a small per-CPU/thread area, not just the "struct
task_struct".

Linus

2001-11-06 18:24:56

by Alan

[permalink] [raw]
Subject: Re: Using %cr2 to reference "current"

> > Our memory bloat is already pretty gross in 2.4 without adding 16K task
> > stacks to the oversided struct page, bootmem and excess double linked lists.
>
> There are some people who think that the 5kB stack we have now is too
> small ;(

Yes but we dont want to let them win or next year 16K will be too small and
then they'll want to 16K C++ stack objects. At the very least we should
make them have to use

really_slow_vmalloc_and_switch_to_big_temporary_stack()
really_slow_vfree_and_return_to_old_stack()

_and_ make them type function names that long.

Granted its less of an issue in 2.5 because we can afford to finally make
DMA off the stack a crime (right now its an offence but one that is violated
in too many places to be sure of killing them all off) - scsi for one does
it.

> That should work fairly well, and has the advantage that you can hide more
> state there if you want (ie it allows us, on demand, to move hot state of
> "struct task_struct" up there).

Sweet. Now that I'd completely missed. Task private state and task
public state splitting

> So it would basically be a small per-CPU/thread area, not just the "struct
> task_struct".

Yep

Alan

2001-11-06 18:43:09

by Benjamin LaHaise

[permalink] [raw]
Subject: Re: Using %cr2 to reference "current"

On Tue, Nov 06, 2001 at 09:49:15AM -0800, Linus Torvalds wrote:
> That said, how expensive is loading %cr2 anyway? We can do all the same
> tricks with a 16kB stack and just playing games with using the higher bits
> as the "offset", ie things like

Here are some numbers:

read cr2 best: 11 av: 11.12
write cr2 cr2 best: 61 av: 64.42
read cr2 best: 11 av: 11.12
write cr2 cr2 best: 61 av: 65.01
read stk best: 10 av: 11.03
write cr2 stk best: 61 av: 64.95
read stk best: 10 av: 11.03
write cr2 stk best: 61 av: 65.23

Which come from insmod of the below two modules. I didn't test writing to
the stack register, but I expect it's similarly expensive as it affects the
call return stack and other behind the scenes dependancies. Suffice it to
say that reading %cr2 is essentially free on my box (athlon mp). Maybe
we should use it as a pointer into a per-cpu area to avoid writing it?

-ben

----teststk_k.c----
#define USE_STK 1
#include "testcr2_k.c"
----testcr2_k.c----
#include <linux/module.h>
#include <linux/kernel.h>
#include <asm/errno.h>
#include <linux/init.h>

static inline long long rdtsc(void)
{
unsigned int low,high;
__asm__ __volatile__("rdtsc" : "=a" (low), "=d" (high));
return low + (((long long)high)<<32);
}

long dummy;

long doit(void)
{
long long start, end;
long val;

start = rdtsc();
#ifdef USE_STK
#define WHICH "stk"
__asm__ __volatile__(
"movl $0x0003c000,%%eax \n" // 4 bits at bit 14
"movl $-16384,%%edx \n" // remove low 14 bits
"andl %%esp,%%eax \n"
"andl %%esp,%%edx \n"
"shrl $7,%%eax \n" // color it by 128 bytes
"addl %%edx,%%eax \n"
: "=a" (val) :: "edx");
#else
#define WHICH "cr2"
__asm__ __volatile__("movl %%cr2,%0" : "=r" (val));
#endif
val += 100;
dummy = val;
end = rdtsc();

return end - start;
}

long doit2(void)
{
long long start, end;
long val;

start = rdtsc();
val = dummy;
__asm__ __volatile__("movl %0,%%cr2" : "=r" (val));
end = rdtsc();

return end - start;
}

int test_init (void)
{
long min = 1000000000, av = 0;
int i;
for (i=0; i<100; i++) {
long dur = doit();
if (dur < min)
min = dur;
av += dur;
}
printk("read " WHICH " best: %ld av: %ld.%02ld\n", min, av / 100, av % 100);

min = 10000000;
av = 0;
for (i=0; i<100; i++) {
long dur = doit2();
if (dur < min)
min = dur;
av += dur;
}
printk("write cr2 " WHICH " best: %ld av: %ld.%02ld\n", min, av / 100, av % 100);
return -ENODEV;
}

void test_exit(void)
{
return;
}

module_init(test_init);
module_exit(test_exit);
MODULE_LICENSE("GPL");
---snip---

2001-11-06 19:10:38

by H. Peter Anvin

[permalink] [raw]
Subject: Re: Using %cr2 to reference "current"

Followup to: <[email protected]>
By author: Benjamin LaHaise <[email protected]>
In newsgroup: linux.dev.kernel
>
> On Tue, Nov 06, 2001 at 09:49:15AM -0800, Linus Torvalds wrote:
> > That said, how expensive is loading %cr2 anyway? We can do all the same
> > tricks with a 16kB stack and just playing games with using the higher bits
> > as the "offset", ie things like
>
> Here are some numbers:
>
> read cr2 best: 11 av: 11.12
> write cr2 cr2 best: 61 av: 64.42
> read cr2 best: 11 av: 11.12
> write cr2 cr2 best: 61 av: 65.01
> read stk best: 10 av: 11.03
> write cr2 stk best: 61 av: 64.95
> read stk best: 10 av: 11.03
> write cr2 stk best: 61 av: 65.23
>
> Which come from insmod of the below two modules. I didn't test writing to
> the stack register, but I expect it's similarly expensive as it affects the
> call return stack and other behind the scenes dependancies. Suffice it to
> say that reading %cr2 is essentially free on my box (athlon mp). Maybe
> we should use it as a pointer into a per-cpu area to avoid writing it?
>

You still have to write it every time you take a page fault. You're
adding 60-odd cycles to the page fault path at least.

Not to mention any system which does microcoded reads of %cr2, which
apparently the Athlon XP doesn't.

-hpa
--
<[email protected]> at work, <[email protected]> in private!
"Unix gives you enough rope to shoot yourself in the foot."
http://www.zytor.com/~hpa/puzzle.txt <[email protected]>

2001-11-06 19:16:48

by Dave Jones

[permalink] [raw]
Subject: Re: Using %cr2 to reference "current"

On Tue, 6 Nov 2001, Benjamin LaHaise wrote:

> Here are some numbers:
> Which come from insmod of the below two modules. I didn't test writing to
> the stack register, but I expect it's similarly expensive as it affects the
> call return stack and other behind the scenes dependancies. Suffice it to
> say that reading %cr2 is essentially free on my box (athlon mp). Maybe
> we should use it as a pointer into a per-cpu area to avoid writing it?

If this is done, it should perhaps be done on only on certain x86s,
as some show the results go the other way. For example, the Cyrix III..

read stk best: 42 av: 42.60
read cr2 best: 61 av: 61.28

regards,

Dave.

--
| Dave Jones. http://www.codemonkey.org.uk
| SuSE Labs

2001-11-06 20:10:43

by Ricky Beam

[permalink] [raw]
Subject: Re: Using %cr2 to reference "current"

On Tue, 6 Nov 2001, Dave Jones wrote:
>If this is done, it should perhaps be done on only on certain x86s,
>as some show the results go the other way. For example, the Cyrix III..

And for some (P150) it makes no difference...

read cr2 best: 25 av: 27.09
write cr2 cr2 best: 32 av: 34.39

read stk best: 26 av: 28.22
write cr2 stk best: 32 av: 33.04

--Ricky


2001-11-06 22:05:38

by Mikael Pettersson

[permalink] [raw]
Subject: Re: Using %cr2 to reference "current"

On Tue, 6 Nov 2001 09:49:15 -0800 (PST), Linus Torvalds wrote:
> /* Return "current" in %eax, trash %edx */
> do_get_current:
> movl $0x0003c000,%eax // 4 bits at bit 14
> movl $-16384,%edx // remove low 14 bits
> andl $esp,%eax
> andl $esp,%edx
> shrl $7,%eax // color it by 128 bytes
> addl %edx,%eax
> ret
>...
>I would not be surprised if "mov %cr2,%reg" will break a netburst trace
>cache entity, or even cause microcode to be executed. While I _guarantee_
>that all future Intel CPU's will continue to be fast at mixtures of simple
>arithmetic operations like "add" and "and".

On my Pentium 4:
- 6.30 cycles to copy %cr2 to %eax
- 1.05 cycles to compute a non-coloured current by masking %esp
- 2.31 cycles to compute a coloured current by your code above

I did some tests on using %cr2 for get_processor_id() a while ago,
but it was clearly slower (58% on P6, 20% on K6-III, 3% on P5MMX)
than *((%esp & mask)+offset), even though the latter also does a load.

/Mikael

2001-11-06 22:42:18

by Linus Torvalds

[permalink] [raw]
Subject: Re: Using %cr2 to reference "current"

In article <[email protected]>,
Alan Cox <[email protected]> wrote:
>
>> That should work fairly well, and has the advantage that you can hide more
>> state there if you want (ie it allows us, on demand, to move hot state of
>> "struct task_struct" up there).
>
>Sweet. Now that I'd completely missed. Task private state and task
>public state splitting

Yes. It would be a waste to have to bring in a cache-line into the L1
cache, and then only use 4 bytes of it. So it should make sense to set
this up somewhat like:

struct local_task_struct {
struct task_struct *tsk;
.. other fields ..
};

and then use the _exact_ existing infrastructure to get
"local_task_struct" instead of "task_struct", and let the compiler do
all the rest at a higher level. So we'd just rename "get_current()" to
"get_local_current()", and then do

#define get_current() (get_local_current()->tsk)

and people who want to know about the local task struct can use that.

Linus

2001-11-06 23:02:28

by Alan

[permalink] [raw]
Subject: Re: Using %cr2 to reference "current"

> If this is done, it should perhaps be done on only on certain x86s,
> as some show the results go the other way. For example, the Cyrix III..
>
> read stk best: 42 av: 42.60
> read cr2 best: 61 av: 61.28

Do we have many SMP Cyrix III's ?

2001-11-06 23:08:18

by Martin Dalecki

[permalink] [raw]
Subject: Re: Using %cr2 to reference "current"

Alan Cox wrote:
>
> > "get_current" interrupt safe (ie switching tasks is totally atomic, as
> > it's the one single "movl ..,%esp" instruction that does the real switch
> > as far as the kernel is concerned).
> >
> > It does require using an order-2 allocation, which the current VM will
> > allow anyway, but which is obviously nastier than an order-1.
>
> I've seen boxes dead in the water from 8K NFS (ie 16K order-2 allocations),
> let alone the huge memory hit. Michael's rtlinux approach looks even more
> interesting and I may have to play with that (using the TSS to ident the
> cpu)
>
> Our memory bloat is already pretty gross in 2.4 without adding 16K task
> stacks to the oversided struct page, bootmem and excess double linked lists.

If we are talking about memmory bload. Let's usk a question. Is somebody
there
working seriously on changing the default function call conventions on
IA32
from stack parameter pushing to register passing throughout the
kernel? The implications on in esp. the I-cache pressure seem to be
quite significant and apparently one of there areas where the GCC got
much better is precisely this. The recent comparisions of gcc against
the intel compiler show as well that this may be really worth it.

2001-11-06 23:13:28

by Alan

[permalink] [raw]
Subject: Re: Using %cr2 to reference "current"

> If we are talking about memmory bload. Let's usk a question. Is somebody
> there
> working seriously on changing the default function call conventions on
> IA32

Thats pure noise

On a 256Mb machine you have 65536 page map entries. Those are 64 bytes but
its not hard to get it down to 56 bytes (.5Mb saved) and probably to 48
bytes. We can probably also shave 8 bytes off each cached inode if not
more (the nfs changes in -ac are a big help there already) - thats typically
another 200K on a reasonable size box - and the new bootmem code can save a
chunk too

Im not sure how much the code change for function call patterns would be
but I doubt its so big for such little effort

Alan

2001-11-06 23:15:59

by Dave Jones

[permalink] [raw]
Subject: Re: Using %cr2 to reference "current"

On Tue, 6 Nov 2001, Alan Cox wrote:

> > If this is done, it should perhaps be done on only on certain x86s,
> > as some show the results go the other way. For example, the Cyrix III..
> Do we have many SMP Cyrix III's ?

I wish :) Today no, tomorrow only VIA knows.
I just used that as an example that it may not be a win everywhere.
A better example perhaps was the P5 case Ricky posted, which as you
know, are seen in the real world in SMP.

regards,

Dave.

--
| Dave Jones. http://www.codemonkey.org.uk
| SuSE Labs

2001-11-07 13:07:45

by Martin Dalecki

[permalink] [raw]
Subject: Re: Using %cr2 to reference "current"

Alan Cox wrote:

> Im not sure how much the code change for function call patterns would be
> but I doubt its so big for such little effort

Let numbers talk to us, or allow me to quote the georieously politically
incorrect Dave: "Numbers talk - billshit walks!":

Without register passing, we have the following size situation:

text data bss dec hex filename
1332132 260804 288080 1881016 1cb3b8 vmlinux

With the following options enabled we get:
-freg-struct-return -mrtd -mregparm=3

text data bss dec hex filename
1302372 260804 288080 1851256 1c3f78 vmlinux

Quite significant difference if you ask me!!!

With the following options enabled we get:
-mrtd -mregparm=3

text data bss dec hex filename
1302404 260804 288080 1851288 1c3f98 vmlinux

Here it's just a few bytes here and there not really
significant, becouse the kernel apparently doesn't
use structs as return values frequently.

With the following options enabled we get:
-mregparm=3

text data bss dec hex filename
1303476 260804 288080 1852360 1c43c8 vmlinux

So apparently the -mrtd options is quite significant as well.

With the following options enabled we get:
-mregparm=2

text data bss dec hex filename
1307876 260804 288080 1856760 1c54f8 vmlinux

As expected the influence here isn't too significant.

So the conclusion is that apparetly the change in calling convention can
result
in a saving of about 2.3% in code size. This may not sound grat in
relative
numbers, but for a compiler designer this would already sound hilarious
and in
absolute numbers it's: 29760 bytes. Not withstanding the speed
improvement...

Oh for compleatness sake, the compiler used was:
gcc version 2.96 20000731 (Red Hat Linux 7.1 2.96-99)

2001-11-07 13:36:51

by Alan

[permalink] [raw]
Subject: Re: Using %cr2 to reference "current"

> With the following options enabled we get:
> -freg-struct-return -mrtd -mregparm=3
>
> text data bss dec hex filename
> 1302372 260804 288080 1851256 1c3f78 vmlinux
>
> Quite significant difference if you ask me!!!

30K is nice have but still a scratch on the surface compared with 500K 8)

> in a saving of about 2.3% in code size. This may not sound grat in
> relative
> numbers, but for a compiler designer this would already sound hilarious
> and in
> absolute numbers it's: 29760 bytes. Not withstanding the speed
> improvement...

The obvious question is - have you tried running the kernel built like that
with any asm fixups needed ?

2001-11-07 14:06:56

by Martin Dalecki

[permalink] [raw]
Subject: Re: Using %cr2 to reference "current"

Alan Cox wrote:
>
> > With the following options enabled we get:
> > -freg-struct-return -mrtd -mregparm=3
> >
> > text data bss dec hex filename
> > 1302372 260804 288080 1851256 1c3f78 vmlinux
> >
> > Quite significant difference if you ask me!!!
>
> 30K is nice have but still a scratch on the surface compared with 500K 8)
>
> > in a saving of about 2.3% in code size. This may not sound grat in
> > relative
> > numbers, but for a compiler designer this would already sound hilarious
> > and in
> > absolute numbers it's: 29760 bytes. Not withstanding the speed
> > improvement...
>
> The obvious question is - have you tried running the kernel built like that
> with any asm fixups needed ?

Once a long time ago I tried already to do the fixups myself, and got
to the stage of init starting... It wasn't THAT difficult. However
somehow encouraged by the compiler comparisions between gcc and intel's
free compiler, which use the register passing for anything local
to the actual code, where the speed gains are up to 20% im currently
quite inclined to do the redo and finish the experiment.
BTW.> It's not just asm fixpus that have to be done for this
to work. For example all the c files with -fno-omit-frame-pointer
as additional compilatoin flag have to be looked seriously at
again. And of course UML makes the debugging of at least this easier.

--
- phone: +49 214 8656 283
- job: eVision-Ventures AG, LEV .de (MY OPINIONS ARE MY OWN!)
- langs: de_DE.ISO8859-1, en_US, pl_PL.ISO8859-2, last ressort:
ru_RU.KOI8-R

2001-11-07 14:10:56

by Alan

[permalink] [raw]
Subject: Re: Using %cr2 to reference "current"

> somehow encouraged by the compiler comparisions between gcc and intel's
> free compiler, which use the register passing for anything local
> to the actual code, where the speed gains are up to 20% im currently

I was under the impression intels compiler was profoundly non-free ?

> quite inclined to do the redo and finish the experiment.
> BTW.> It's not just asm fixpus that have to be done for this
> to work. For example all the c files with -fno-omit-frame-pointer

20% is a nice large number

Alan

2001-11-07 14:35:34

by Dirk Moerenhout

[permalink] [raw]
Subject: Re: Using %cr2 to reference "current"

> > somehow encouraged by the compiler comparisions between gcc and intel's
> > free compiler, which use the register passing for anything local
> > to the actual code, where the speed gains are up to 20% im currently
>
> I was under the impression intels compiler was profoundly non-free ?

Thought that too untill a minute ago. Went to the Intel site and read the
information.

http://developer.intel.com/software/products/eval/

Gives details about _two_ ways to get it free. The known 30 day free trial
with support but also a less known "non commercial unsupported" option. So
for non-commercial use you can use it as much as you want, you just don't
get support.

Downloading it now to play some with it :-)

Dirk Moerenhout ///// System Administrator ///// Planet Internet NV

2001-11-07 14:40:15

by Sebastian Heidl

[permalink] [raw]
Subject: Intel compiler [Re: Using %cr2 to reference "current"]

On Wed, Nov 07, 2001 at 02:17:33PM +0000, Alan Cox wrote:
> > somehow encouraged by the compiler comparisions between gcc and intel's
> > free compiler, which use the register passing for anything local
> > to the actual code, where the speed gains are up to 20% im currently
>
> I was under the impression intels compiler was profoundly non-free ?

have a look:
http://developer.intel.com/software/products/eval/


_sh_

2001-11-07 14:44:15

by Martin Dalecki

[permalink] [raw]
Subject: Re: Using %cr2 to reference "current"

Alan Cox wrote:
>
> > somehow encouraged by the compiler comparisions between gcc and intel's
> > free compiler, which use the register passing for anything local
> > to the actual code, where the speed gains are up to 20% im currently
>
> I was under the impression intels compiler was profoundly non-free ?

Well it's free in terms of money, read: download and "personal usage"
blabla.
This doesn't deterr me from having a look at it ;-).
>
> > quite inclined to do the redo and finish the experiment.
> > BTW.> It's not just asm fixpus that have to be done for this
> > to work. For example all the c files with -fno-omit-frame-pointer
>
> 20% is a nice large number.

Yes I was impressed as well and twiddeling with compiler flags is
actually indicating that the calling convention stuff is one
of the main contributors to this. BTW.> The speed differences
can go up to 40% for floating point, OK this is irrelevant for
the kernel but it is showing very well that there is still
plenty of room for improvement.

2001-11-07 14:47:05

by Alan

[permalink] [raw]
Subject: Re: Using %cr2 to reference "current"

> Thought that too untill a minute ago. Went to the Intel site and read the
> information.
>
> http://developer.intel.com/software/products/eval/

> Gives details about _two_ ways to get it free. The known 30 day free trial

Seems to be non free to me

May well be non-fee non-free but its still most definitely non-free

2001-11-07 15:32:38

by David Howells

[permalink] [raw]
Subject: Re: Using %cr2 to reference "current"


Instead of using %cr2, how about giving each CPU it's own GDT (the GDT doesn't
need to contain many entries). Have one segment number point to a CPU specific
data area that contains things like the current task pointer for that CPU, the
CPU number, etc, etc. This same segment number will be used on all CPU's, but
will be multiplexed via the per-CPU GDTs instead.

Then you can load up a segment register with this segment on entry to the
kernel, and then make CPU data accesses relative to that.

David

2001-11-07 20:10:09

by Andrew Morton

[permalink] [raw]
Subject: Re: Using %cr2 to reference "current"

Alan Cox wrote:
>
> > With the following options enabled we get:
> > -freg-struct-return -mrtd -mregparm=3
> >
> > text data bss dec hex filename
> > 1302372 260804 288080 1851256 1c3f78 vmlinux
> >
> > Quite significant difference if you ask me!!!
>
> 30K is nice have but still a scratch on the surface compared with 500K 8)
>

It's a lot of L1 though.

If this sort of change breaks the ability to build with
conventional argument passing and no-omit-frame-pointer then
the happy kgdb users of this world will be most aggrieved.

-

2001-11-07 22:06:16

by Genes Lists

[permalink] [raw]
Subject: Re: Intel compiler [Re: Using %cr2 to reference "current"]


Just as another data point - a simple test, I ran intel
compiler on flops v2.

Run 3 ways - gcc3, icc (v 5) and the beta 6 icc. All run
on dual p4 with 1 Gb mem on Rh 7.2

At least on this test the differences are quite dramatic.

Regards,

gene/

---------------------------------------------------------------------
Summary
------

gcc -DUNIX -O3 -march=i686 flops2.c
icc -xMKW -o flops2 -DUNIX -O3 flops2.c

FLOPS C Program (Double Precision), V2.0 18 Dec 1992

Module MFLOPS
gcc icc 5 icc 6
-------- --------- ----------
1 444.9410 439.4850 674.3180
2 265.4815 362.3862 362.3862
3 298.1843 604.0250 1270.6569
4 337.7309 1224.8804 1373.8819
5 392.7003 1138.6503 1131.7073
6 391.7678 1334.0521 1422.2222
7 163.5783 193.3900 193.5118
8 395.7743 1317.3242 1372.6542

Iterations = 512000000 512000000 512000000
NullTime (usec) = 0.0029 0.0000 0.0000
MFLOPS(1) = 275.3542 416.9120 472.8952
MFLOPS(2) = 264.7165 413.4297 448.2175
MFLOPS(3) = 339.5966 714.7146 834.5651
MFLOPS(4) = 362.1891 1071.8196 1367.5374

---------------------------------------------------------------------

On Wed, Nov 07, 2001 at 03:39:46PM +0100, Sebastian Heidl wrote:
> On Wed, Nov 07, 2001 at 02:17:33PM +0000, Alan Cox wrote:
> > > somehow encouraged by the compiler comparisions between gcc and intel's
> > > free compiler, which use the register passing for anything local
> > > to the actual code, where the speed gains are up to 20% im currently
> >
> > I was under the impression intels compiler was profoundly non-free ?
>
> have a look:
> http://developer.intel.com/software/products/eval/

2001-11-08 13:15:55

by Martin Dalecki

[permalink] [raw]
Subject: Re: Using %cr2 to reference "current"

Alan Cox wrote:
>
> > somehow encouraged by the compiler comparisions between gcc and intel's
> > free compiler, which use the register passing for anything local
> > to the actual code, where the speed gains are up to 20% im currently
>
> I was under the impression intels compiler was profoundly non-free ?
>
> > quite inclined to do the redo and finish the experiment.
> > BTW.> It's not just asm fixpus that have to be done for this
> > to work. For example all the c files with -fno-omit-frame-pointer
>
> 20% is a nice large number

I just wanted to note that I got already the wohle fixup until
the stage where the first schedule() occures during the kernel
initialization... printk and so on all seem to work nicely ;-).
Well the where some errors which had to be fixed until this.
For example the decompress_kernel function should have the
attribute asmlinkage and boot/compressed/misc.c should not export
enything else.

Further debugging will occur this evening...

2001-11-09 21:53:48

by Jamie Lokier

[permalink] [raw]
Subject: Re: Using %cr2 to reference "current"

Alan Cox wrote:
> True enough, but then we can go to
>
> andl %%esp, %0
> movl (%%eax), %%eax
>
> which doesnt really change the cost much, lets us colour the task structs
> nicely, and lets us colour the stack somewhat by offseting esp from the base
> - and all in standard instructions

A variant lets you put the pointer at the top of the stack, where it can
sometimes share a cache line with the freshly pushed context:

movl $0x1ffc,%0
orl %esp,%0
movl (%0), %0

This works because GCC keeps the stack aligned to 4 bytes at all times,
I believe.

Both this simple sequence, and Alan's code, suffer from the problem that
the pointer itself is not cache-coloured, but it is a lot better than
having the whole context and task state on the same colour.

This perhaps be improved using Linus' idea of shifting upper address
bits to colour the pointer as well.

-- Jamie

2001-11-11 12:24:50

by Martin Dalecki

[permalink] [raw]
Subject: Re: Using %cr2 to reference "current"

Alan Cox wrote:
>
> > With the following options enabled we get:
> > -freg-struct-return -mrtd -mregparm=3
> >
> > text data bss dec hex filename
> > 1302372 260804 288080 1851256 1c3f78 vmlinux
> >
> > Quite significant difference if you ask me!!!
>
> 30K is nice have but still a scratch on the surface compared with 500K 8)
>
> > in a saving of about 2.3% in code size. This may not sound grat in
> > relative
> > numbers, but for a compiler designer this would already sound hilarious
> > and in
> > absolute numbers it's: 29760 bytes. Not withstanding the speed
> > improvement...
>
> The obvious question is - have you tried running the kernel built like that
> with any asm fixups needed ?

I have now a nice kernel at home, compiled with -mredparm=3 up
and going. Full interactive session, full kernel compiles working,
X11 whatsup. Everything seems fine so far.

However I still have to build a RPM-feature grade kernel and test it.
Further the precise benchmarking will take some time as well.
I think that I will in esp. use the byte benchmark, since it is
quite "kernel intensive" at some parts. Patch will follow on
monday (if nothing comes in between...).

2001-11-11 13:07:38

by Keith Owens

[permalink] [raw]
Subject: Re: Using %cr2 to reference "current"

On Sun, 11 Nov 2001 14:16:36 +0100,
Martin Dalecki <[email protected]> wrote:
>I have now a nice kernel at home, compiled with -mredparm=3 up
>... Patch will follow on monday

Compiling the kernel with mregparm is going to play havoc with binary
only modules (BOMs), interface mismatches all over the place. I know
we do not support BOMs but there is a big difference between not
supporting them and having them actively destroy the kernel because of
different calling sequences.

A new feature of kbuild 2.5 is defining which CONFIG options are
critical, any change to any critical config option forces a complete
kernel rebuild. Modutils 2.5 will also refuse to load a module if its
critical config options are different from the kernel. The current
list of critical options is

CONFIG_SMP
UP modules in SMP kernel or vice versa just go splat. This
replaces the modversions '_smp' prefix.

CONFIG_KBUILD_GCC_VERSION
Inserting a module compiled with gcc 3.0.1 into a kernel compiled
with gcc 3.0.2 is a receipe for disaster. Kernel and module must
be built with the same compiler.

Any changes that affect the ABI for modules must be handled via config
options and those options must be on the critical list in 2.5.

Please add CONFIG_MREGPARM with a huge warning that, until kbuild 2.5
and modutils 2.5 are available, inserting a BOM is likely to destroy a
kernel compiled with CONFIG_MREGPARM.

2001-11-12 10:36:32

by Martin Dalecki

[permalink] [raw]
Subject: PATCH 2.4.14 mregparm=3 compilation fixes


BYTE UNIX Benchmarks (Version 3.11)
System -- Linux kozaczek 2.4.14-2 #1 pi? lis 9 22:22:10 CET 2001 i686 unknown
Start Benchmark Run: nie lis 11 16:10:53 CET 2001
1 interactive users.
Dhrystone 2 without register variables 1263134.8 lps (10 secs, 6 samples)
Dhrystone 2 using register variables 1263583.6 lps (10 secs, 6 samples)
Arithmetic Test (type = arithoh) 3177830.7 lps (10 secs, 6 samples)
Arithmetic Test (type = register) 189076.1 lps (10 secs, 6 samples)
Arithmetic Test (type = short) 190665.1 lps (10 secs, 6 samples)
Arithmetic Test (type = int) 188753.5 lps (10 secs, 6 samples)
Arithmetic Test (type = long) 190094.2 lps (10 secs, 6 samples)
Arithmetic Test (type = float) 182872.2 lps (10 secs, 6 samples)
Arithmetic Test (type = double) 183902.9 lps (10 secs, 6 samples)
System Call Overhead Test 360235.7 lps (10 secs, 6 samples)
Pipe Throughput Test 421456.7 lps (10 secs, 6 samples)
Pipe-based Context Switching Test 194915.8 lps (10 secs, 6 samples)
Process Creation Test 3605.4 lps (10 secs, 6 samples)
Execl Throughput Test 608.6 lps (9 secs, 6 samples)
File Read (10 seconds) 1294487.0 KBps (10 secs, 6 samples)
File Write (10 seconds) 138403.0 KBps (10 secs, 6 samples)
File Copy (10 seconds) 19158.0 KBps (10 secs, 6 samples)
File Read (30 seconds) 1278293.0 KBps (30 secs, 6 samples)
File Write (30 seconds) 147556.0 KBps (30 secs, 6 samples)
File Copy (30 seconds) 15129.0 KBps (30 secs, 6 samples)
C Compiler Test 388.8 lpm (60 secs, 3 samples)
Shell scripts (1 concurrent) 1063.2 lpm (60 secs, 3 samples)
Shell scripts (2 concurrent) 563.1 lpm (60 secs, 3 samples)
Shell scripts (4 concurrent) 287.4 lpm (60 secs, 3 samples)
Shell scripts (8 concurrent) 145.7 lpm (60 secs, 3 samples)
Dc: sqrt(2) to 99 decimal places 28576.1 lpm (60 secs, 6 samples)
Recursion Test--Tower of Hanoi 16445.3 lps (10 secs, 6 samples)


INDEX VALUES
TEST BASELINE RESULT INDEX

Arithmetic Test (type = double) 2541.7 183902.9 72.4
Dhrystone 2 without register variables 22366.3 1263134.8 56.5
Execl Throughput Test 16.5 608.6 36.9
File Copy (30 seconds) 179.0 15129.0 84.5
Pipe-based Context Switching Test 1318.5 194915.8 147.8
Shell scripts (8 concurrent) 4.0 145.7 36.4
=========
SUM of 6 items 434.5
AVERAGE 72.4


Attachments:
mregparm.patch (4.04 kB)
regparm3.report (3.01 kB)
report (3.00 kB)
Download all attachments

2001-11-12 16:10:38

by Keith Owens

[permalink] [raw]
Subject: Re: PATCH 2.4.14 mregparm=3 compilation fixes

On Mon, 12 Nov 2001 12:28:33 +0100,
Martin Dalecki <[email protected]> wrote:
>diff -ur linux-2.4.14-2/arch/i386/Makefile linux-mdcki/arch/i386/Makefile
>--- linux-2.4.14-2/arch/i386/Makefile Thu Apr 12 21:20:31 2001
>+++ linux-mdcki/arch/i386/Makefile Sat Nov 10 00:07:17 2001
>@@ -21,7 +21,7 @@
> LDFLAGS=-e stext
> LINKFLAGS =-T $(TOPDIR)/arch/i386/vmlinux.lds $(LDFLAGS)
>
>-CFLAGS += -pipe
>+CFLAGS += -freg-struct-return -mregparm=3
>
> # prevent gcc from keeping the stack 16 byte aligned
> CFLAGS += $(shell if $(CC) -mpreferred-stack-boundary=2 -S -o /dev/null -xc /dev/null >/dev/null 2>&1; then echo "-mpreferred-stack-boundary=2"; fi)

Setting mregparm must be a CONFIG_ option, with a huge warning that

A) Changing CONFIG_MREGPARM requires make mrproper.

B) Loading binary only modules into a kernel compiled with mregparm is
even more likely to destroy your kernel.

2001-11-12 16:27:00

by Christoph Hellwig

[permalink] [raw]
Subject: Re: PATCH 2.4.14 mregparm=3 compilation fixes

In article <[email protected]> you wrote:
> Setting mregparm must be a CONFIG_ option, with a huge warning that
>
> A) Changing CONFIG_MREGPARM requires make mrproper.

The above patch changes the kernel to always use mregparm -
it should be catched by the .flags depencies anyway.

> B) Loading binary only modules into a kernel compiled with mregparm is
> even more likely to destroy your kernel.

Nope - people who uses those are just doomed.

Christoph

--
Of course it doesn't work. We've performed a software upgrade.

2001-11-12 16:47:01

by Linus Torvalds

[permalink] [raw]
Subject: Re: PATCH 2.4.14 mregparm=3 compilation fixes


On Mon, 12 Nov 2001, Martin Dalecki wrote:
>
> The attached patch is fixing compilation and running
> of the kernel with -mregparm=3 on IA32. The fixes excluding
> the change in arch/i386/Makefile of course apply to the stock kernel
> as well, so Linus please include it in 2.4.15 - it just won't hurt...

I certainly won't enable it in the stock kernel, considering the bad track
record gcc has had with regparm under register pressure, but the
"asmlinkage" parts look like real fixes.

However, it's kind of sad to make some of the more timing-critical stuff
(like schedule_tail) be asmlinkage - it might be worth it to do it the
other way around, and make it FASTCALL() and change the assembly code to
pass arguments in registers. That way, the calling convention is still the
same on both regparm=3 and without, but instead of defaulting to the slow
method we'd default to the fast one..

Linus

2001-11-12 17:04:25

by Martin Dalecki

[permalink] [raw]
Subject: Re: PATCH 2.4.14 mregparm=3 compilation fixes

Keith Owens wrote:
>
> On Mon, 12 Nov 2001 12:28:33 +0100,
> Martin Dalecki <[email protected]> wrote:
> >diff -ur linux-2.4.14-2/arch/i386/Makefile linux-mdcki/arch/i386/Makefile
> >--- linux-2.4.14-2/arch/i386/Makefile Thu Apr 12 21:20:31 2001
> >+++ linux-mdcki/arch/i386/Makefile Sat Nov 10 00:07:17 2001
> >@@ -21,7 +21,7 @@
> > LDFLAGS=-e stext
> > LINKFLAGS =-T $(TOPDIR)/arch/i386/vmlinux.lds $(LDFLAGS)
> >
> >-CFLAGS += -pipe
> >+CFLAGS += -freg-struct-return -mregparm=3
> >
> > # prevent gcc from keeping the stack 16 byte aligned
> > CFLAGS += $(shell if $(CC) -mpreferred-stack-boundary=2 -S -o /dev/null -xc /dev/null >/dev/null 2>&1; then echo "-mpreferred-stack-boundary=2"; fi)
>
> Setting mregparm must be a CONFIG_ option, with a huge warning that
>
> A) Changing CONFIG_MREGPARM requires make mrproper.
>
> B) Loading binary only modules into a kernel compiled with mregparm is
> even more likely to destroy your kernel.

Ehmm... In fact my feelings about this are that _this part_ of the
patch _should not_ be included in the mainstream kernel at all. It
should may be made just the default (in 2.5 perhaps)
if it turns out that the performance code size and so on gains are worth
it, since I didn't encounter any problems thus far even with a "distro
RPM grade
kernel" containing USB TCP and what a not. GCC real got better over the
last
years!

So there is no real need for an option at all in my oppinion.
We have already enough of them.

The REST OF THE PATCH is containing only pure true clear cut bugfixes
which should be applied STRAIGHT away. Those fixes do not influence
the current compilation output at all (with the exception of hiding not
externaly used global symbols in misc.c). But they enable somebody
who knows what he is doing to add the above CFLAGS for his system to
gain a significant amount of free speace for example in the PROM or to
gain a bit of performance - supposedly.

I hope this makes my intentions clear. OK?

BTW.> Try it out it doesn't interferre with any module handling.
However your objections about binary only modules I just don't
share - becouse I just don't care about them... In esp. my nonexistant
interrest in computer games doesn't oppress me to
by any nvida graphics cards. Pure nice old Mach64 -
which always was one of the most UNIX friendly VGA designs ever
just makes it fine for me ;-).

2001-11-12 17:58:39

by Martin Dalecki

[permalink] [raw]
Subject: Re: PATCH 2.4.14 mregparm=3 compilation fixes

Linus Torvalds wrote:
>
> On Mon, 12 Nov 2001, Martin Dalecki wrote:
> >
> > The attached patch is fixing compilation and running
> > of the kernel with -mregparm=3 on IA32. The fixes excluding
> > the change in arch/i386/Makefile of course apply to the stock kernel
> > as well, so Linus please include it in 2.4.15 - it just won't hurt...
>
> I certainly won't enable it in the stock kernel, considering the bad track
> record gcc has had with regparm under register pressure, but the
> "asmlinkage" parts look like real fixes.

Yes that was always my intention. The chunk changing the CFLAGS wasn't
deleted from the patch only for the purpose of referrence. I did hope
that I made this clear in my announcement, but i failed apparently ;-).

Despite this I would like to make clear that I have compiled my own
"RedHat 7.2" compatible kernel-RPM set with the patch applied already
and
didn't encounter any problems thus far... Even an ORACLE DB just started
without noticing that anything changed beneath it.
Since this all was done on my notebook, I can say that there where even
no problems with any of the "less mature" kernel parts
like USB handling, CardBus and so on and so on
(Anybody please note: I didn't say "immature" just "less mature",
more like "fresh" no pun intendid.)

Apparently GCC got really much better in regard of this stuff recently.
I'm using RedHat GCC 2.96 brand gcc-2.96-99...
And I reiterate that I'm just happy running a whole
kernel compiled with mregparm=3 without any anomalities thus far.

> However, it's kind of sad to make some of the more timing-critical stuff
> (like schedule_tail) be asmlinkage - it might be worth it to do it the
> other way around, and make it FASTCALL() and change the assembly code to
> pass arguments in registers. That way, the calling convention is still the
> same on both regparm=3 and without, but instead of defaulting to the slow
> method we'd default to the fast one..

Yes that's right. However if you look close than you will notice, that
asmlinkage is quite a bad name. There should be a asmlinkage with
mregparm=3
ideally and a syslinkage macro for system call entry points with
mregparm=0 there.
And then fixes are fixes and with the current semantics my patch is
really just fixing bugs. (Tougth not "tragical" ones). So if I see
this fix applied I will make the above described improvements in 2.5
;-).
They are not difficult anyway, just a bit tedious... and then they would
affect a bit more code around there. In esp. the system call
declarations
and we have a lot of them already ;-).


So long...

2001-11-12 19:13:14

by Martin Dalecki

[permalink] [raw]
Subject: Corsspatch patch-2.4.15-pre2 patch-2.4.15-pre3

Hello out there!

Doing a X-patch between, ehmm, the pre-patches 2 and 3, I noticed
that a call to sa1100_irda_init() will be added in
patch-2.4.15-pre3 TWICE. This *may* work, but I think this isn't
quite in the intention of the inventor :-). So Linus/Alan please
watch out...

It's in the file linux/net/irda/irda_device.c:
The following will be twice there after pre3
#ifdef CONFIG_SA1100_FIR
sa1100_irda_init()
#endif

2001-11-12 19:21:04

by Martin Dalecki

[permalink] [raw]
Subject: BUG BUG hunt the bugs!!! patch-2.4.15-pre5

Hallo out there!

Same symptom from patch-2.4.15-pre4:

diff -u --recursive --new-file v2.4.14/linux/net/irda/irda_device.c
linux/net/irda/irda_device.c
--- v2.4.14/linux/net/irda/irda_device.c Sun Sep 23 11:41:02 2001
+++ linux/net/irda/irda_device.c Sun Nov 11 10:20:21 2001

bla bla bla...

@@ -124,6 +127,12 @@
#ifdef CONFIG_WINBOND_FIR
w83977af_init();
#endif
+#ifdef CONFIG_SA1100_FIR
+ sa1100_irda_init();
+#endif
+#ifdef CONFIG_SA1100_FIR
+ sa1100_irda_init();
+#endif
#ifdef CONFIG_NSC_FIR
nsc_ircc_init();
#endif
@@ -151,6 +160,12 @@
#ifdef CONFIG_OLD_BELKIN
old_belkin_init();
#endif
+#ifdef CONFIG_EP7211_IR
+ ep7211_ir_init();
+#endif
+#ifdef CONFIG_EP7211_IR
+ ep7211_ir_init();
+#endif
return 0;

You see the initialization done twice!

2001-11-13 15:57:42

by Martin Dalecki

[permalink] [raw]
Subject: Merge BUG in 2.4.15-pre4 serial.c

I have found the following code in serial.c aorund line 5565

#ifdef __i386__
if (i == NR_PORTS) {
for (i = 4; i < NR_PORTS; i++)
if ((rs_table[i].type == PORT_UNKNOWN) &&
(rs_table[i].count == 0))
break;
}
#endif
if (i == NR_PORTS) {
for (i = 0; i < NR_PORTS; i++)
if ((rs_table[i].type == PORT_UNKNOWN) &&
(rs_table[i].count == 0))
break;
}

This is supposedly the result of applying some patch twice.
Let me guess the first 8 lines of this can be deleted.

Regards!

2001-11-13 16:22:32

by Russell King

[permalink] [raw]
Subject: Re: Merge BUG in 2.4.15-pre4 serial.c

On Tue, Nov 13, 2001 at 05:49:24PM +0100, Martin Dalecki wrote:
> I have found the following code in serial.c aorund line 5565
>
> #ifdef __i386__
> if (i == NR_PORTS) {
> for (i = 4; i < NR_PORTS; i++)
> if ((rs_table[i].type == PORT_UNKNOWN) &&
> (rs_table[i].count == 0))
> break;
> }
> #endif
> if (i == NR_PORTS) {
> for (i = 0; i < NR_PORTS; i++)
> if ((rs_table[i].type == PORT_UNKNOWN) &&
> (rs_table[i].count == 0))
> break;
> }
>
> This is supposedly the result of applying some patch twice.
> Let me guess the first 8 lines of this can be deleted.

Look at it closer, in particular the for() loops.

It's basically there so that on x86, we don't normally use ttyS0-3
for pcmcia and other similar ports, unless we run out of other ports
to use.

--
Russell King ([email protected]) The developer of ARM Linux
http://www.arm.linux.org.uk/personal/aboutme.html

2001-11-13 16:45:53

by Martin Dalecki

[permalink] [raw]
Subject: Re: Merge BUG in 2.4.15-pre4 serial.c

Russell King wrote:
>
> On Tue, Nov 13, 2001 at 05:49:24PM +0100, Martin Dalecki wrote:
> > I have found the following code in serial.c aorund line 5565
> >
> > #ifdef __i386__
> > if (i == NR_PORTS) {
> > for (i = 4; i < NR_PORTS; i++)
> > if ((rs_table[i].type == PORT_UNKNOWN) &&
> > (rs_table[i].count == 0))
> > break;
> > }
> > #endif
> > if (i == NR_PORTS) {
> > for (i = 0; i < NR_PORTS; i++)
> > if ((rs_table[i].type == PORT_UNKNOWN) &&
> > (rs_table[i].count == 0))
> > break;
> > }
> >
> > This is supposedly the result of applying some patch twice.
> > Let me guess the first 8 lines of this can be deleted.
>
> Look at it closer, in particular the for() loops.
>
> It's basically there so that on x86, we don't normally use ttyS0-3
> for pcmcia and other similar ports, unless we run out of other ports
> to use.

Well I still think that the 8 lines can be deleted. Once again my famous
notbook is perfectly __i386__ and doesn't contain any devices served by
serial.c
unless I configure IrDA. Pushing the port numbers artificially behind
doesn't make sense for me and makes some setserial unknown tricks
neccessary
for irtty setup.

2001-11-13 16:54:23

by Russell King

[permalink] [raw]
Subject: Re: Merge BUG in 2.4.15-pre4 serial.c

On Tue, Nov 13, 2001 at 06:37:54PM +0100, Martin Dalecki wrote:
> Pushing the port numbers artificially behind doesn't make sense for me
> and makes some setserial unknown tricks neccessary for irtty setup.

The key words here are "for me".

What setserial "unknown tricks" are you referring to?

--
Russell King ([email protected]) The developer of ARM Linux
http://www.arm.linux.org.uk/personal/aboutme.html

2001-11-13 17:12:43

by Martin Dalecki

[permalink] [raw]
Subject: Re: Merge BUG in 2.4.15-pre4 serial.c

Russell King wrote:
>
> On Tue, Nov 13, 2001 at 06:37:54PM +0100, Martin Dalecki wrote:
> > Pushing the port numbers artificially behind doesn't make sense for me
> > and makes some setserial unknown tricks neccessary for irtty setup.
>
> The key words here are "for me".
>
> What setserial "unknown tricks" are you referring to?

I referr to the IrDA-HOWTO problems with the serial driver I think
this may be precisely the cause of the culprit.

2001-11-13 17:15:23

by Alan

[permalink] [raw]
Subject: Re: Merge BUG in 2.4.15-pre4 serial.c

> Well I still think that the 8 lines can be deleted. Once again my famous
> notbook is perfectly __i386__ and doesn't contain any devices served by
> serial.c
> unless I configure IrDA. Pushing the port numbers artificially behind
> doesn't make sense for me and makes some setserial unknown tricks
> neccessary

Renumbering everyones serial ports by suprise seems to be a 2.5 thing

2001-11-13 17:31:16

by Martin Dalecki

[permalink] [raw]
Subject: Re: Merge BUG in 2.4.15-pre4 serial.c

Alan Cox wrote:
>
> > Well I still think that the 8 lines can be deleted. Once again my famous
> > notbook is perfectly __i386__ and doesn't contain any devices served by
> > serial.c
> > unless I configure IrDA. Pushing the port numbers artificially behind
> > doesn't make sense for me and makes some setserial unknown tricks
> > neccessary
>
> Renumbering everyones serial ports by suprise seems to be a 2.5 thing

OK that's an argument to which I fully agree.

2001-11-06 23:51:08

by Martin Dalecki

[permalink] [raw]
Subject: Re: Using %cr2 to reference "current"

Alan Cox wrote:
>
> > If we are talking about memmory bload. Let's usk a question. Is somebody
> > there
> > working seriously on changing the default function call conventions on
> > IA32
>
> Thats pure noise
>
> On a 256Mb machine you have 65536 page map entries. Those are 64 bytes but
> its not hard to get it down to 56 bytes (.5Mb saved) and probably to 48
> bytes. We can probably also shave 8 bytes off each cached inode if not
> more (the nfs changes in -ac are a big help there already) - thats typically
> another 200K on a reasonable size box - and the new bootmem code can save a
> chunk too
>
> Im not sure how much the code change for function call patterns would be
> but I doubt its so big for such little effort

Please count the removal of the *very* sparse read_ahead array as
well (patch went to this list a long time ago) in.
It doesn't cost anything and saves some few pages depending on the
number of drivers you have loaded... (Well in comparision to the above
that's nit picking, but...)

And then there is the overloaded struct inde. It would be worth
quite a bit of memmory to not overlay the private,filesystem
specific parts but to attach them by a pointer instead, in esp.
if you make this in a way where the private part would be used
without the public interface in drivers. Currently the most rudiculous
inode layout is deterministic for the overall size in the compiled
kernel.

2001-11-07 00:21:34

by Alan

[permalink] [raw]
Subject: Re: Using %cr2 to reference "current"

> Please count the removal of the *very* sparse read_ahead array as
> well (patch went to this list a long time ago) in.
> It doesn't cost anything and saves some few pages depending on the
> number of drivers you have loaded... (Well in comparision to the above
> that's nit picking, but...)

Sounds quite believable. Several of the hashes are oversize too it seems

> And then there is the overloaded struct inde. It would be worth
> quite a bit of memmory to not overlay the private,filesystem
> specific parts but to attach them by a pointer instead, in esp.

Thats what -ac has started doing. Al Viro has done the worst case ones
so far.

Alan

2001-11-07 00:37:26

by Jeff Garzik

[permalink] [raw]
Subject: Re: Using %cr2 to reference "current"

Martin Dalecki wrote:
> And then there is the overloaded struct inde. It would be worth
> quite a bit of memmory to not overlay the private,filesystem
> specific parts but to attach them by a pointer instead, in esp.
> if you make this in a way where the private part would be used
> without the public interface in drivers.

I think there are plans for several filesystems to use the generic_ip
and generic_sbp members of the unions, instead of further adding to the
unions.

FreeVxFS is an example of one such filesystem which already does this.

--
Jeff Garzik | Only so many songs can be sung
Building 1024 | with two lips, two lungs, and one tongue.
MandrakeSoft | - nomeansno

2002-11-10 21:17:17

by Igor Levicki

[permalink] [raw]
Subject: Re: Using %cr2 to reference "current"


Hi,

>I could well imagine a x86-compatible chip where %cr2 isn't even
>writable. In fact, reading the intel documentation, I see _nowhere_ a
>mention of %cr2 being writable at all - it all just says "contains the
>fault address".

>From Intel System Programmers Guide:

"The control registers can be read and loaded (or modified) using the
move-to-or-from-controlregisters
forms of the MOV instruction. In protected mode, the MOV instructions
allow the
control registers to be read or loaded (at privilege level 0 only).
This restriction means that application
programs or operating-system procedures (running at privilege levels 1,
2, or 3) are
prevented from reading or loading the control registers.
When loading the control register, reserved bits should always be set
to the values previously
read."

>(I don't know what the effect of the P4 half-cacheline
>thing is, I don't know if the CPU can have just a 64-byte block coherent,
>or what..

Cache sector size is 64 bytes on Pentium 4. When CPU reads from memory
it reads 2 sectors x 64 bytes = 128 byte cache line. Hardware
prefetcher fetches 2 x 128 byte cache line = 256 bytes of memory. On
write CPU writes 64 bytes always.
Now if you read 16 bytes from some address and then for example add
something to them and write them back to the same address you will have
a penalty when you read next 16 bytes from that address because you
have just trashed the 64 byte sector and you have to wait for
back-propagation.
Hope this helps.

Regards,
Igor Levicki
[email protected]