LinuxLists.cc - Using %cr2 to reference "current"

2001-11-06 07:19:07

Subject: Using %cr2 to reference "current"

2.4.13-ac8 uses %cr2 rather than (%esp & 0xfffe0000) to get "current".
I've been trying to figure out the point of this... writing a control
register is microcode on all the x86 implementations I know (and you
have to re-set it after every pagefault), and reading one probably is
one on most (not Transmeta, but...)

On the other hand, %esp is a GPR and available to the core directly,
and so are usually plain immediates.

Is using %cr2 really faster than the old implementation, or is there
another reason? It seems that the alignment constraints on the stack
still remains, since the %esp solution still remains in places...

It might also be worth considering a segment-register based
implementation instead. The reason we're not using %fs and %gs in the
kernel anymore is because of the setup slowness, but perhaps using
them (use %fs since it's much more likely to be NULL and thus faster
to restore) would be faster than using %cr2?

-hpa
--
<[email protected]> at work, <[email protected]> in private!
"Unix gives you enough rope to shoot yourself in the foot."
http://www.zytor.com/~hpa/puzzle.txt <[email protected]>

2001-11-06 08:01:46

by Robert Love

[permalink] [raw]

Subject: Re: Using %cr2 to reference "current"

On Tue, 2001-11-06 at 02:18, H. Peter Anvin wrote:
> 2.4.13-ac8 uses %cr2 rather than (%esp & 0xfffe0000) to get "current".
> I've been trying to figure out the point of this... <snip>

I too am confused. More so, the difference between hard_get_current and
get_current is confusing. I further question things because I suspect
there is a problem: hard_get_current is commented as "for within NMI,
do_page_fault, cpu_init" but all these functions call other functions
that may very well use get_current. How is this going to work?

Further, the preemptible kernel patch oopses with this patch (IOW, don't
use 2.4.13-ac8 + preempt-kernel, unless you remove all these bits like I
did :>). I think it may be because of:

Manfred Spraul wrote:
> error_code:
> [...]
> - GET_CURRENT(%ebx)
> call *%edi
> addl $8,%esp
> + GET_CURRENT(%ebx)
> The pointer to current was loaded into %ebx before the call to the error
> handler, now that only happens after the call. As far as I can see the
> load before the call is not required.

this change but I am unsure. Would Manfred or someone knowledgeable in
this mind letting me pick their brain?

Robert Love

2001-11-06 10:48:49

by Alan

[permalink] [raw]

Subject: Re: Using %cr2 to reference "current"

> I too am confused. More so, the difference between hard_get_current and
> get_current is confusing. I further question things because I suspect

hard_get_current always works
get_current assumes %cr2 is loaded correctly

> do_page_fault, cpu_init" but all these functions call other functions
> that may very well use get_current. How is this going to work?

do_page_fault and cpu_init load %cr2

> Further, the preemptible kernel patch oopses with this patch (IOW, don't
> use 2.4.13-ac8 + preempt-kernel, unless you remove all these bits like I
> did :>). I think it may be because of:

You must ensure that you don't pre-empt until %cr2 is loaded. Obviously this
isnt a problem with the traditional low latency patch but if you pre-empty
very early in page fault handling then I suspect you might get the odd
suprise.

The reasoning behind all this is to fix the cache pessimal nature of the x86
stack layout - we had all task structs on the same cache colour and all
stacks aligned within pages (so every apache thread waiting at the same
point is on the same colour too and each wait queue entry on their stacks
is linked to entries all the same colour)

Alan

2001-11-06 10:51:40

by Alan

[permalink] [raw]

Subject: Re: Using %cr2 to reference "current"

> Is using %cr2 really faster than the old implementation, or is there
> another reason? It seems that the alignment constraints on the stack
> still remains, since the %esp solution still remains in places...

The stack is no longer aligned. We allocate two pages and disturb the stack
by upto 1.5K. We slab the task structs.

> It might also be worth considering a segment-register based
> implementation instead. The reason we're not using %fs and %gs in the
> kernel anymore is because of the setup slowness, but perhaps using
> them (use %fs since it's much more likely to be NULL and thus faster
> to restore) would be faster than using %cr2?

It may be. Likewise its not clear if %cr2 should hold current or a cpu ident
pointer (so you dont reload on switch of task). This needs more
benchmarking. Its in current -ac to verify the theory is correct not the
tuning.

2001-11-06 14:15:04

by Manfred Spraul

[permalink] [raw]

Subject: Re: Using %cr2 to reference "current"

Robert Love wrote:
>
> Further, the preemptible kernel patch oopses with this patch (IOW, don't
> use 2.4.13-ac8 + preempt-kernel, unless you remove all these bits like I
> did :>). I think it may be because of:
>

Could you send me an oops?
I assume that a
set_current(hard_get_current());
is missing somewhere.
The assumption is that get_current() is faster than hard_get_current(),
and that there are so many get_current() calls that the overhead for the
set_current() in __switch_to and do_page_fault is small.

> Manfred Spraul wrote:
> > error_code:
> > [...]
> > - GET_CURRENT(%ebx)
> > call *%edi
> > addl $8,%esp
> > + GET_CURRENT(%ebx)
> > The pointer to current was loaded into %ebx before the call to the error
> > handler, now that only happens after the call. As far as I can see the
> > load before the call is not required.
>
> this change but I am unsure. Would Manfred or someone knowledgeable in
> this mind letting me pick their brain?
>
I would be very surprised if that's a problem: the error handlers are C
functions, and they don't expect parameters in register %ebx.

--
Manfred

2001-11-06 17:06:11

by Linus Torvalds

[permalink] [raw]

Subject: Re: Using %cr2 to reference "current"

In article <[email protected]>,
H. Peter Anvin <[email protected]> wrote:
>
>Is using %cr2 really faster than the old implementation, or is there
>another reason? It seems that the alignment constraints on the stack
>still remains, since the %esp solution still remains in places...

I think the _real_ issue with that patch is that %cr2 is by no means
architecturally even guaranteed to work the way the patches want it to
work.

It's simply not a general-purpose register, and I don't see why it is
assumed to be (a) fast (b) stable and (c) writable.

I could well imagine a x86-compatible chip where %cr2 isn't even
writable. In fact, reading the intel documentation, I see _nowhere_ a
mention of %cr2 being writable at all - it all just says "contains the
fault address".

Similarly, there is _nothing_ that guarantees that the low bits of %cr2
are meaningful, writable, or even implemented.

Which means that the whole approach is just depending on undocumented
implementation behaviour. That's asking for trouble.

Linus

2001-11-06 17:08:11

by Linus Torvalds

[permalink] [raw]

Subject: Re: Using %cr2 to reference "current"

In article <[email protected]>,
Alan Cox <[email protected]> wrote:
>
>It may be. Likewise its not clear if %cr2 should hold current or a cpu ident
>pointer (so you dont reload on switch of task). This needs more
>benchmarking. Its in current -ac to verify the theory is correct not the
>tuning.

We pretty much know the _theory_ is not correct, just by virtue of
depending on non-architected behaviour. The only thing -ac can do is
test whether it works in practice. Which is a totally different thing.

Especially on x86 chips.

Linus

2001-11-06 17:14:21

by Benjamin LaHaise

[permalink] [raw]

Subject: Re: Using %cr2 to reference "current"

On Tue, Nov 06, 2001 at 05:02:32PM +0000, Linus Torvalds wrote:
> Which means that the whole approach is just depending on undocumented
> implementation behaviour. That's asking for trouble.

NetWare uses it and has for a long time.

-ben

2001-11-06 17:39:24

by Alan

[permalink] [raw]

Subject: Re: Using %cr2 to reference "current"

> We pretty much know the _theory_ is not correct, just by virtue of
> depending on non-architected behaviour. The only thing -ac can do is
> test whether it works in practice. Which is a totally different thing.

Yep

> Especially on x86 chips.

Well so far I've found one laptop that eats %cr2 on APM calls, and we have
some mystery cases. Peter's suggestion of using %fs or %gs looks more
promising at the moment

2001-11-06 17:37:34

by Michael Barabanov

[permalink] [raw]

Subject: Re: Using %cr2 to reference "current"

Here's my version of hard cpu id (RTLinux version):

extern inline int rtl_getcpuid(void)
{
unsigned cpu;
__asm__ (
"str %%ax\n\t"
"shr $5, %%eax\n\t"
"sub $3, %%eax\n\t"
: "=a"(cpu));
return cpu;
}

No cr2 involved; extremely fast. This takes advantage of the fact that
TSS-CPU mapping is 1-1 in 2.4.

Michael.

Alan Cox ([email protected]) wrote:
> > I too am confused. More so, the difference between hard_get_current and
> > get_current is confusing. I further question things because I suspect
>
> hard_get_current always works
> get_current assumes %cr2 is loaded correctly
>
> > do_page_fault, cpu_init" but all these functions call other functions
> > that may very well use get_current. How is this going to work?
>
> do_page_fault and cpu_init load %cr2
>
> > Further, the preemptible kernel patch oopses with this patch (IOW, don't
> > use 2.4.13-ac8 + preempt-kernel, unless you remove all these bits like I
> > did :>). I think it may be because of:
>
> You must ensure that you don't pre-empt until %cr2 is loaded. Obviously this
> isnt a problem with the traditional low latency patch but if you pre-empty
> very early in page fault handling then I suspect you might get the odd
> suprise.
>
> The reasoning behind all this is to fix the cache pessimal nature of the x86
> stack layout - we had all task structs on the same cache colour and all
> stacks aligned within pages (so every apache thread waiting at the same
> point is on the same colour too and each wait queue entry on their stacks
> is linked to entries all the same colour)
>
> Alan

2001-11-06 17:53:14

by Linus Torvalds

[permalink] [raw]

Subject: Re: Using %cr2 to reference "current"

On Tue, 6 Nov 2001, Benjamin LaHaise wrote:
>
> On Tue, Nov 06, 2001 at 05:02:32PM +0000, Linus Torvalds wrote:
> > Which means that the whole approach is just depending on undocumented
> > implementation behaviour. That's asking for trouble.
>
> NetWare uses it and has for a long time.

Does anybody know if WNT uses it? Quite frankly, I don't see Intel
worrying over-much about NetWare compatibility. They've broken small OS's
before (ie older versions of SCO Xenix wouldn't boot on a Pentium MMU
because of some changes to error reporting, if I remember correctly).

That said, how expensive is loading %cr2 anyway? We can do all the same
tricks with a 16kB stack and just playing games with using the higher bits
as the "offset", ie things like

/* Return "current" in %eax, trash %edx */
do_get_current:
movl $0x0003c000,%eax // 4 bits at bit 14
movl $-16384,%edx // remove low 14 bits
andl $esp,%eax
andl $esp,%edx
shrl $7,%eax // color it by 128 bytes
addl %edx,%eax
ret

which is going to be ~5 cycles _without_ doing anything that is
undocumented (add a push/pop to not trash a register, that might be
worthwhile - it makes the function marginally slower but might make
callers happier).

Oh, and call using inline assembly, not a C call (so that gcc can take
advantage of better calling convention, and not think memory is trashed
etc). So

static inline struct task_struct *get_current(void)
{
struct task_struct *tsk;
asm("call do_get_current":"=a" (tsk)::"dx");
return tsk;
}

See? You don't have to play games with control registers.

(actually, entry.S seems to want the return value in %ebx, so change to
taste. Or you could have two different versions of the thing, or even
inline it for any place where that makes sense).

The above also allows you to keep fork with just one allocation, and makes
the stack larger (we steal 2kB for the coloring, but we'd use an order-2
allocation that at least SGI wants to do regardless).

The 2kB is, of course, tunable. The above is with a 128-byte cacheline and
16 colors - that may be overkill. 32-byte increents with 32 colors might
be more appropriate (I don't know what the effect of the P4 half-cacheline
thing is, I don't know if the CPU can have just a 64-byte block coherent,
or what.. But a 32-byte color is fine for _most_ CPU's).

The 32-byte by 32-color thing would just change the bitmasks to 0x0007c000
and the shift to 9 (bit 14+ shifted down to bit 5+).

Note that there are lots of advantages to using simple regular
instructions over using "special" instructions like "move from control
register". Historically, the special instructions tend to always become
slower, while the regular instructions become faster.

I would not be surprised if "mov %cr2,%reg" will break a netburst trace
cache entity, or even cause microcode to be executed. While I _guarantee_
that all future Intel CPU's will continue to be fast at mixtures of simple
arithmetic operations like "add" and "and".

(And I bet that the likelyhood of Intel speeding up shifts in the next P4
derivative is a _lot_ higher than Intel speeding up "mov %cr2,xx"..)

Linus

2001-11-06 18:02:46

by Linus Torvalds

[permalink] [raw]

Subject: Re: Using %cr2 to reference "current"

On Tue, 6 Nov 2001, Alan Cox wrote:
>
> > Especially on x86 chips.
>
> Well so far I've found one laptop that eats %cr2 on APM calls, and we have
> some mystery cases.

Well, APM is going away, and it should be easy enough to work around it
(and I don't _think_ you can reasonably do the same in ACPI or SMM: SMM
will save the whole CPU state and has to do that anyway, and ACPI doesn't
actually get to touch things like %cr2).

So I'd be more nervous about future CPU's just not having the register
writable (or having only parts of it, or..)

> Peter's suggestion of using %fs or %gs looks more
> promising at the moment

The problem with using a segment register is that then you have to
save/restore it over system calls - pretty much whether the call needs it
or not. Ie you can pretty much _guarantee_ that any system call will be
slowed down by something on the order of 10-15 cycles (on a good day, some
CPU's are slower at it). Same goes for task switch etc.

Which is why I'd much rather just color using the high bits of %esp, and
spend a few more cycles inside "get_current()". I can guarantee you that
it won't slow down paths that don't even need current at all (unlike the
segment register approach), and even the paths that _do_ need current will
only be ~5 cycles slower (plus possible the cache miss of doing the
function call, but the call-site itself will actually be slightly smaller
than the current in-lined 32-bit immediate and "andl").

Using high bits of %esp has zero impact on task-switch, and makes
"get_current" interrupt safe (ie switching tasks is totally atomic, as
it's the one single "movl ..,%esp" instruction that does the real switch
as far as the kernel is concerned).

It does require using an order-2 allocation, which the current VM will
allow anyway, but which is obviously nastier than an order-1.

Linus

2001-11-06 18:08:06

by Alan

[permalink] [raw]

Subject: Re: Using %cr2 to reference "current"

> "get_current" interrupt safe (ie switching tasks is totally atomic, as
> it's the one single "movl ..,%esp" instruction that does the real switch
> as far as the kernel is concerned).
>
> It does require using an order-2 allocation, which the current VM will
> allow anyway, but which is obviously nastier than an order-1.

I've seen boxes dead in the water from 8K NFS (ie 16K order-2 allocations),
let alone the huge memory hit. Michael's rtlinux approach looks even more
interesting and I may have to play with that (using the TSS to ident the
cpu)

Our memory bloat is already pretty gross in 2.4 without adding 16K task
stacks to the oversided struct page, bootmem and excess double linked lists.

I also need to try sticking a pointer to the task struct at the top of the
stack and loading that - since that should be a cache line that isnt being
shared around or swapped between processors

2001-11-06 18:13:16

by Alan

[permalink] [raw]

Subject: Re: Using %cr2 to reference "current"

> That said, how expensive is loading %cr2 anyway? We can do all the same
> tricks with a 16kB stack and just playing games with using the higher bits
> as the "offset", ie things like

So thats another 600K on my box vanished. I suspect the page faults will
outweigh it

> the stack larger (we steal 2kB for the coloring, but we'd use an order-2
> allocation that at least SGI wants to do regardless).

16K stack is serious "people who cant program" country.

> I would not be surprised if "mov %cr2,%reg" will break a netburst trace
> cache entity, or even cause microcode to be executed. While I _guarantee_
> that all future Intel CPU's will continue to be fast at mixtures of simple
> arithmetic operations like "add" and "and".

True enough, but then we can go to

andl %%esp, %0
movl (%%eax), %%eax

which doesnt really change the cost much, lets us colour the task structs
nicely, and lets us colour the stack somewhat by offseting esp from the base
- and all in standard instructions

Alan

2001-11-06 18:15:36

by Marcelo Tosatti

[permalink] [raw]

Subject: Re: Using %cr2 to reference "current"

On Tue, 6 Nov 2001, Alan Cox wrote:

> > "get_current" interrupt safe (ie switching tasks is totally atomic, as
> > it's the one single "movl ..,%esp" instruction that does the real switch
> > as far as the kernel is concerned).
> >
> > It does require using an order-2 allocation, which the current VM will
> > allow anyway, but which is obviously nastier than an order-1.
>
> I've seen boxes dead in the water from 8K NFS (ie 16K order-2 allocations),
> let alone the huge memory hit. Michael's rtlinux approach looks even more
> interesting and I may have to play with that (using the TSS to ident the
> cpu)

Btw, I also want to see what intense "for-optimization" high-order
allocators are going to do to the current VM.

Think about the possible intensive pressure (and CPU wasted) caused by,
for example, SCSI code which _always_ tries to do 1-order allocations (or
bigger?) to allocate scatter/gather tables. We want those allocations to
fail to 0-order allocations instead looping madly inside the VM freeing
routines.

2001-11-06 18:18:06

by Linus Torvalds

[permalink] [raw]

Subject: Re: Using %cr2 to reference "current"

On Tue, 6 Nov 2001, Alan Cox wrote:
>
> Our memory bloat is already pretty gross in 2.4 without adding 16K task
> stacks to the oversided struct page, bootmem and excess double linked lists.

There are some people who think that the 5kB stack we have now is too
small ;(

> I also need to try sticking a pointer to the task struct at the top of the
> stack and loading that - since that should be a cache line that isnt being
> shared around or swapped between processors

That should work fairly well, and has the advantage that you can hide more
state there if you want (ie it allows us, on demand, to move hot state of
"struct task_struct" up there).

There is a subset of "struct task_struct" that is basically completely
local to the task, and could be advantageous to move around. Things like

- need_resched/sigpending/process attributes
- ptrace
- processor
- addr_limit

are all things that we don't actually _need_ to go all the way to the task
structure to fetch, and that we mostly need to modify anyway on task
switch (ie "need_resched" and "processor" both need to be written on
task-switch anyway, and are not touched by anything other CPU)

So it would basically be a small per-CPU/thread area, not just the "struct
task_struct".

Linus

2001-11-06 18:24:56

by Alan

[permalink] [raw]

Subject: Re: Using %cr2 to reference "current"

> > Our memory bloat is already pretty gross in 2.4 without adding 16K task
> > stacks to the oversided struct page, bootmem and excess double linked lists.
>
> There are some people who think that the 5kB stack we have now is too
> small ;(

Yes but we dont want to let them win or next year 16K will be too small and
then they'll want to 16K C++ stack objects. At the very least we should
make them have to use

really_slow_vmalloc_and_switch_to_big_temporary_stack()
really_slow_vfree_and_return_to_old_stack()

_and_ make them type function names that long.

Granted its less of an issue in 2.5 because we can afford to finally make
DMA off the stack a crime (right now its an offence but one that is violated
in too many places to be sure of killing them all off) - scsi for one does
it.

> That should work fairly well, and has the advantage that you can hide more
> state there if you want (ie it allows us, on demand, to move hot state of
> "struct task_struct" up there).

Sweet. Now that I'd completely missed. Task private state and task
public state splitting

> So it would basically be a small per-CPU/thread area, not just the "struct
> task_struct".

Yep

Alan

2001-11-06 18:43:09

by Benjamin LaHaise

[permalink] [raw]

Subject: Re: Using %cr2 to reference "current"

On Tue, Nov 06, 2001 at 09:49:15AM -0800, Linus Torvalds wrote:
> That said, how expensive is loading %cr2 anyway? We can do all the same
> tricks with a 16kB stack and just playing games with using the higher bits
> as the "offset", ie things like

Here are some numbers:

read cr2 best: 11 av: 11.12
write cr2 cr2 best: 61 av: 64.42
read cr2 best: 11 av: 11.12
write cr2 cr2 best: 61 av: 65.01
read stk best: 10 av: 11.03
write cr2 stk best: 61 av: 64.95
read stk best: 10 av: 11.03
write cr2 stk best: 61 av: 65.23

Which come from insmod of the below two modules. I didn't test writing to
the stack register, but I expect it's similarly expensive as it affects the
call return stack and other behind the scenes dependancies. Suffice it to
say that reading %cr2 is essentially free on my box (athlon mp). Maybe
we should use it as a pointer into a per-cpu area to avoid writing it?

-ben

----teststk_k.c----
#define USE_STK 1
#include "testcr2_k.c"
----testcr2_k.c----
#include <linux/module.h>
#include <linux/kernel.h>
#include <asm/errno.h>
#include <linux/init.h>

static inline long long rdtsc(void)
{
unsigned int low,high;
__asm__ __volatile__("rdtsc" : "=a" (low), "=d" (high));
return low + (((long long)high)<<32);
}

long dummy;

long doit(void)
{
long long start, end;
long val;

start = rdtsc();
#ifdef USE_STK
#define WHICH "stk"
__asm__ __volatile__(
"movl $0x0003c000,%%eax \n" // 4 bits at bit 14
"movl $-16384,%%edx \n" // remove low 14 bits
"andl %%esp,%%eax \n"
"andl %%esp,%%edx \n"
"shrl $7,%%eax \n" // color it by 128 bytes
"addl %%edx,%%eax \n"
: "=a" (val) :: "edx");
#else
#define WHICH "cr2"
__asm__ __volatile__("movl %%cr2,%0" : "=r" (val));
#endif
val += 100;
dummy = val;
end = rdtsc();

return end - start;
}

long doit2(void)
{
long long start, end;
long val;

start = rdtsc();
val = dummy;
__asm__ __volatile__("movl %0,%%cr2" : "=r" (val));
end = rdtsc();

return end - start;
}

int test_init (void)
{
long min = 1000000000, av = 0;
int i;
for (i=0; i<100; i++) {
long dur = doit();
if (dur < min)
min = dur;
av += dur;
}
printk("read " WHICH " best: %ld av: %ld.%02ld\n", min, av / 100, av % 100);

min = 10000000;
av = 0;
for (i=0; i<100; i++) {
long dur = doit2();
if (dur < min)
min = dur;
av += dur;
}
printk("write cr2 " WHICH " best: %ld av: %ld.%02ld\n", min, av / 100, av % 100);
return -ENODEV;
}

void test_exit(void)
{
return;
}

module_init(test_init);
module_exit(test_exit);
MODULE_LICENSE("GPL");
---snip---

2001-11-06 19:10:38

by H. Peter Anvin

[permalink] [raw]

Subject: Re: Using %cr2 to reference "current"

Followup to: <[email protected]>
By author: Benjamin LaHaise <[email protected]>
In newsgroup: linux.dev.kernel
>
> On Tue, Nov 06, 2001 at 09:49:15AM -0800, Linus Torvalds wrote:
> > That said, how expensive is loading %cr2 anyway? We can do all the same
> > tricks with a 16kB stack and just playing games with using the higher bits
> > as the "offset", ie things like
>
> Here are some numbers:
>
> read cr2 best: 11 av: 11.12
> write cr2 cr2 best: 61 av: 64.42
> read cr2 best: 11 av: 11.12
> write cr2 cr2 best: 61 av: 65.01
> read stk best: 10 av: 11.03
> write cr2 stk best: 61 av: 64.95
> read stk best: 10 av: 11.03
> write cr2 stk best: 61 av: 65.23
>
> Which come from insmod of the below two modules. I didn't test writing to
> the stack register, but I expect it's similarly expensive as it affects the
> call return stack and other behind the scenes dependancies. Suffice it to
> say that reading %cr2 is essentially free on my box (athlon mp). Maybe
> we should use it as a pointer into a per-cpu area to avoid writing it?
>

You still have to write it every time you take a page fault. You're
adding 60-odd cycles to the page fault path at least.

Not to mention any system which does microcoded reads of %cr2, which
apparently the Athlon XP doesn't.

-hpa
--
<[email protected]> at work, <[email protected]> in private!
"Unix gives you enough rope to shoot yourself in the foot."
http://www.zytor.com/~hpa/puzzle.txt <[email protected]>

2001-11-06 19:16:48

by Dave Jones

[permalink] [raw]

Subject: Re: Using %cr2 to reference "current"

On Tue, 6 Nov 2001, Benjamin LaHaise wrote:

> Here are some numbers:
> Which come from insmod of the below two modules. I didn't test writing to
> the stack register, but I expect it's similarly expensive as it affects the
> call return stack and other behind the scenes dependancies. Suffice it to
> say that reading %cr2 is essentially free on my box (athlon mp). Maybe
> we should use it as a pointer into a per-cpu area to avoid writing it?

If this is done, it should perhaps be done on only on certain x86s,
as some show the results go the other way. For example, the Cyrix III..

read stk best: 42 av: 42.60
read cr2 best: 61 av: 61.28

regards,

Dave.

--
| Dave Jones. http://www.codemonkey.org.uk
| SuSE Labs

2001-11-06 20:10:43

by Ricky Beam

[permalink] [raw]

Subject: Re: Using %cr2 to reference "current"

On Tue, 6 Nov 2001, Dave Jones wrote:
>If this is done, it should perhaps be done on only on certain x86s,
>as some show the results go the other way. For example, the Cyrix III..

And for some (P150) it makes no difference...

read cr2 best: 25 av: 27.09
write cr2 cr2 best: 32 av: 34.39

read stk best: 26 av: 28.22
write cr2 stk best: 32 av: 33.04

--Ricky

2001-11-06 22:05:38

by Mikael Pettersson

[permalink] [raw]

Subject: Re: Using %cr2 to reference "current"

On Tue, 6 Nov 2001 09:49:15 -0800 (PST), Linus Torvalds wrote:
> /* Return "current" in %eax, trash %edx */
> do_get_current:
> movl $0x0003c000,%eax // 4 bits at bit 14
> movl $-16384,%edx // remove low 14 bits
> andl $esp,%eax
> andl $esp,%edx
> shrl $7,%eax // color it by 128 bytes
> addl %edx,%eax
> ret
>...
>I would not be surprised if "mov %cr2,%reg" will break a netburst trace
>cache entity, or even cause microcode to be executed. While I _guarantee_
>that all future Intel CPU's will continue to be fast at mixtures of simple
>arithmetic operations like "add" and "and".

On my Pentium 4:
- 6.30 cycles to copy %cr2 to %eax
- 1.05 cycles to compute a non-coloured current by masking %esp
- 2.31 cycles to compute a coloured current by your code above

I did some tests on using %cr2 for get_processor_id() a while ago,
but it was clearly slower (58% on P6, 20% on K6-III, 3% on P5MMX)
than *((%esp & mask)+offset), even though the latter also does a load.

/Mikael

2001-11-06 22:42:18

by Linus Torvalds

[permalink] [raw]

Subject: Re: Using %cr2 to reference "current"

In article <[email protected]>,
Alan Cox <[email protected]> wrote:
>
>> That should work fairly well, and has the advantage that you can hide more
>> state there if you want (ie it allows us, on demand, to move hot state of
>> "struct task_struct" up there).
>
>Sweet. Now that I'd completely missed. Task private state and task
>public state splitting

Yes. It would be a waste to have to bring in a cache-line into the L1
cache, and then only use 4 bytes of it. So it should make sense to set
this up somewhat like:

struct local_task_struct {
struct task_struct *tsk;
.. other fields ..
};

and then use the _exact_ existing infrastructure to get
"local_task_struct" instead of "task_struct", and let the compiler do
all the rest at a higher level. So we'd just rename "get_current()" to
"get_local_current()", and then do

#define get_current() (get_local_current()->tsk)

and people who want to know about the local task struct can use that.

Linus

2001-11-06 23:02:28

by Alan

[permalink] [raw]

Subject: Re: Using %cr2 to reference "current"

> If this is done, it should perhaps be done on only on certain x86s,
> as some show the results go the other way. For example, the Cyrix III..
>
> read stk best: 42 av: 42.60
> read cr2 best: 61 av: 61.28

Do we have many SMP Cyrix III's ?

2001-11-06 23:08:18

Subject: Using %cr2 to reference "current"

Subject: Re: Using %cr2 to reference "current"

Subject: Re: Using %cr2 to reference "current"

Subject: Re: Using %cr2 to reference "current"

Subject: Re: Using %cr2 to reference "current"

Subject: Re: Using %cr2 to reference "current"

Subject: Re: Using %cr2 to reference "current"

Subject: Re: Using %cr2 to reference "current"

Subject: Re: Using %cr2 to reference "current"

Subject: Re: Using %cr2 to reference "current"

Subject: Re: Using %cr2 to reference "current"

Subject: Re: Using %cr2 to reference "current"

Subject: Re: Using %cr2 to reference "current"

Subject: Re: Using %cr2 to reference "current"

Subject: Re: Using %cr2 to reference "current"

Subject: Re: Using %cr2 to reference "current"

Subject: Re: Using %cr2 to reference "current"

Subject: Re: Using %cr2 to reference "current"

Subject: Re: Using %cr2 to reference "current"

Subject: Re: Using %cr2 to reference "current"

Subject: Re: Using %cr2 to reference "current"

Subject: Re: Using %cr2 to reference "current"

Subject: Re: Using %cr2 to reference "current"

Subject: Re: Using %cr2 to reference "current"

Subject: Re: Using %cr2 to reference "current"

Subject: Re: Using %cr2 to reference "current"

Subject: Re: Using %cr2 to reference "current"

Subject: Re: Using %cr2 to reference "current"

Subject: Re: Using %cr2 to reference "current"

Subject: Re: Using %cr2 to reference "current"

Subject: Re: Using %cr2 to reference "current"

Subject: Re: Using %cr2 to reference "current"

Subject: Intel compiler [Re: Using %cr2 to reference "current"]

Subject: Re: Using %cr2 to reference "current"

Subject: Re: Using %cr2 to reference "current"

Subject: Re: Using %cr2 to reference "current"

Subject: Re: Using %cr2 to reference "current"

Subject: Re: Intel compiler [Re: Using %cr2 to reference "current"]

Subject: Re: Using %cr2 to reference "current"

Subject: Re: Using %cr2 to reference "current"

Subject: Re: Using %cr2 to reference "current"

Subject: Re: Using %cr2 to reference "current"

Subject: PATCH 2.4.14 mregparm=3 compilation fixes

Attachments:

Subject: Re: PATCH 2.4.14 mregparm=3 compilation fixes

Subject: Re: PATCH 2.4.14 mregparm=3 compilation fixes

Subject: Re: PATCH 2.4.14 mregparm=3 compilation fixes

Subject: Re: PATCH 2.4.14 mregparm=3 compilation fixes

Subject: Re: PATCH 2.4.14 mregparm=3 compilation fixes

Subject: Corsspatch patch-2.4.15-pre2 patch-2.4.15-pre3

Subject: BUG BUG hunt the bugs!!! patch-2.4.15-pre5

Subject: Merge BUG in 2.4.15-pre4 serial.c

Subject: Re: Merge BUG in 2.4.15-pre4 serial.c

Subject: Re: Merge BUG in 2.4.15-pre4 serial.c

Subject: Re: Merge BUG in 2.4.15-pre4 serial.c

Subject: Re: Merge BUG in 2.4.15-pre4 serial.c

Subject: Re: Merge BUG in 2.4.15-pre4 serial.c

Subject: Re: Merge BUG in 2.4.15-pre4 serial.c

Subject: Re: Using %cr2 to reference "current"

Subject: Re: Using %cr2 to reference "current"

Subject: Re: Using %cr2 to reference "current"

Subject: Re: Using %cr2 to reference "current"