Hi,
I'm currently wondering how to implement the recent thread_info changes
for m68k, unfortunately I can't find any discussion about it on lkml,
why it was done this way.
1. I more liked the previous byte fields instead of the bitmasks.
bitfield/bitmask instructions are at least 2 bytes longer than a simple
test instruction for m68k. I got this now nicely optimized (see
http://linux-m68k-cvs.apia.dhs.org/c/cvsweb/linux/arch/m68k/kernel/entry.S?rev=1.6&content-type=text/x-cvsweb-markup)
and changing it back to bitmasks would make it worse again.
2. I can understand that we split the task structure from the stack, but
why have thread_info and task_struct to be two different pointers? I'd
prefer two keep one pointer, through everything is accessed, that means
thread_info would be part of task_struct.
bye, Roman
> why have thread_info and task_struct to be two different pointers? I'd
> prefer two keep one pointer, through everything is accessed, that means
> thread_info would be part of task_struct.
Paul and I talked about this and we guessed it is an intel hack so they
can use the "mask the stack pointer to get to struct thread_info" trick.
On archs where we use a register to point to current I cant see why we
need this thread_info junk. I'd be happy if we could put it all in the
task struct for non intel.
Anton
Hi,
I wrote:
> 1. I more liked the previous byte fields instead of the bitmasks.
> bitfield/bitmask instructions are at least 2 bytes longer than a simple
> test instruction for m68k.
Additional note: The currently used bitfield instruction can be quite
expensive, where all we need is an atomic read and write, but not an
atomic test/modify.
bye, Roman
Why are you using bitfield instructions on the m68k arch? why not use just
simple bit instructions (or and/or/xor where masking)? All the flags are
single bit width fields.
David
Hi,
David Howells wrote:
> Why are you using bitfield instructions on the m68k arch? why not use just
> simple bit instructions (or and/or/xor where masking)? All the flags are
> single bit width fields.
These are two bytes longer as well, right now I'm doing this:
tstw %d0
jeq do_signal_return
tstb %d0
jne do_delayed_trace
using btst or andw this would be four bytes longer.
bye, Roman
From: Anton Blanchard <[email protected]>
Date: Tue, 12 Feb 2002 07:50:48 +1100
On archs where we use a register to point to current I cant see why we
need this thread_info junk. I'd be happy if we could put it all in the
task struct for non intel.
I am in fact very happy with the thread_info implementation
on sparc64.
I was able to blow away all of the assembler offset stuff because now
all the stuff assembly wants to get at is in one structure and it is
trivial to compute the offsets by hand.
Hi,
"David S. Miller" wrote:
> I was able to blow away all of the assembler offset stuff because now
> all the stuff assembly wants to get at is in one structure and it is
> trivial to compute the offsets by hand.
Where's the problem to compute them automatically?
bye, Roman
From: Roman Zippel <[email protected]>
Date: Tue, 12 Feb 2002 01:57:03 +0100
"David S. Miller" wrote:
> I was able to blow away all of the assembler offset stuff because now
> all the stuff assembly wants to get at is in one structure and it is
> trivial to compute the offsets by hand.
Where's the problem to compute them automatically?
It requires ugly scripts that parse assembler files if you want it to
work in a cross compilation requirement. Check out
arch/sparc64/kernel/check_asm.sh and the "check_asm" rule in the
Makefile or the same directory in older trees to see what I mean.
> Date: Mon, 11 Feb 2002 16:46:17 -0800 (PST)
> From: "David S. Miller" <[email protected]>
>
> From: Anton Blanchard <[email protected]>
> Date: Tue, 12 Feb 2002 07:50:48 +1100
>
> On archs where we use a register to point to current I cant see why we
> need this thread_info junk. I'd be happy if we could put it all in the
> task struct for non intel.
>
> I am in fact very happy with the thread_info implementation
> on sparc64.
>
> I was able to blow away all of the assembler offset stuff because now
> all the stuff assembly wants to get at is in one structure and it is
> trivial to compute the offsets by hand.
I hope you don't consider this a good argument to force all the other
platforms to throw away their perfectly good low-core code. Like
Anton, I do not expect any benefits from thread_info for ia64. In
fact, if anything it's going to slow things down. And we don't need
it for task coloring either. We can do that perfectly fine with the
old setup.
Thanks,
--david
From: David Mosberger <[email protected]>
Date: Mon, 11 Feb 2002 17:01:24 -0800
I hope you don't consider this a good argument to force all the other
platforms to throw away their perfectly good low-core code.
I didn't have to change any of my locore code, what the heck
are you talking about? :-) All of the changes to _ANY_ assembly
on sparc64 looked like this:
- lduw [%g6 + AOFF_task_thread + AOFF_thread_flags], %l0
+ lduw [%g6 + TI_FLAGS], %l0
It actually cleaned up my locore code :-)
I think, in fact, the everything people have right now in
thread_struct should move to thread_info and we should kill
off thread_struct entirely. It has no reason to exist anymore.
Hi,
"David S. Miller" wrote:
> It requires ugly scripts that parse assembler files if you want it to
> work in a cross compilation requirement. Check out
> arch/sparc64/kernel/check_asm.sh and the "check_asm" rule in the
> Makefile or the same directory in older trees to see what I mean.
Why is that complicated???
I crosscompile m68k/ppc all the time without problems, what am I doing
wrong?
bye, Roman
>>>>> On Mon, 11 Feb 2002 17:07:09 -0800 (PST), "David S. Miller" <[email protected]> said:
DaveM> I didn't have to change any of my locore code, what the heck
DaveM> are you talking about? :-) All of the changes to _ANY_
DaveM> assembly on sparc64 looked like this:
The task pointer lives in the thread pointer register (r13), so it's
trivial to address off of it. The stack pointer is a poor replacement
for that. The bit-masking that the x86 code is doing to get the
thread info pointer atrocious and is part of the reason task coloring
is harder to do on x86.
--david
On Mon, 11 Feb 2002, David Mosberger wrote:
> >>>>> On Mon, 11 Feb 2002 18:22:08 -0800 (PST), "David S. Miller" <[email protected]> said:
>
> DavidM> I implemented the thread_info stuff, and I checked out the
> DavidM> performance, have you?
>
> So why don't you share the results? Perhaps then I can see the light,
> too. With the exception of task coloring, the thread_info is strictly
> more work and it's possible to do task coloring without thread_info.
This one does task colouring and stack pointer jittering for x86 :
http://www.xmailserver.org/linux-patches/misc.html#TskStackCol
Also i think Manfred Spraul has something that does task colouring. It has
been tested by a guy in fujitsu ( Japan ) on 8 way machines by giving pretty
good results. The stack jittering part seemed not to give too much
improvements though ...
- Davide
From: David Mosberger <[email protected]>
Date: Mon, 11 Feb 2002 18:46:00 -0800
OK, so back to square one: why am I supposed to do all this work for
something that will likely slow things slightly down and, at best,
doesn't hurt performance? The old set up works great and as far as
I'm concerned, is not broken.
It keeps your platform the same, and it does help other platforms.
It is the nature of any abstraction change we make in the kernel
that platforms have to deal with.
So at least we're to the point where you could be convinced that
there are no down sides to the change? Let's go over your list:
1) massive locore assembly changes
ummm no, just put current_thread_info into your thread register
2) pointer dereference causes performance problems
ummm no, not really, go test it for yourself if you don't
believe me
This only leaves "I don't want to do the conversion because it has
no benefit to ia64." Well, it doesn't hurt your platform either,
so just cope :-)
From: David Mosberger <[email protected]>
Date: Mon, 11 Feb 2002 17:21:40 -0800
The task pointer lives in the thread pointer register (r13), so it's
trivial to address off of it.
So why don't you put the, oh my gosh, "THREAD INFO POINTER" into the
thread pointer register instead? That is what I did and everything
transforms naturally, you will need to make zero modifications to
assembly code besides the offset macro names and that you can even
script :-)
>>>>> On Mon, 11 Feb 2002 18:51:00 -0800 (PST), "David S. Miller" <[email protected]> said:
DaveM> It keeps your platform the same, and it does help other
DaveM> platforms.
No, it will slow down ia64 and you haven't shown that it helps others.
DaveM> This only leaves "I don't want to do the conversion because
DaveM> it has no benefit to ia64." Well, it doesn't hurt your
DaveM> platform either, so just cope :-)
There are 9 other platforms. Anton doesn't seem too happy about this
change either. I don't know how the maintainers of the others feel.
--david
From: David Mosberger <[email protected]>
Date: Mon, 11 Feb 2002 19:01:27 -0800
No, it will slow down ia64 and you haven't shown that it helps others.
That's crap. You haven't shown this yet, it didn't slow down sparc64
so I doubt you'll be able to.
You don't have any facts, you just "think" it will slow things down
because of the pointer dereference. I challenge you to show it
actually shows up on the performance radar.
The thing is going to be fully hot in the cache all the time, there
is no way you'll take a cache miss for this dereference.
>>>>> On Mon, 11 Feb 2002 19:04:49 -0800 (PST), "David S. Miller" <[email protected]> said:
DaveM> You don't have any facts, you just "think" it will slow
DaveM> things down because of the pointer dereference. I challenge
DaveM> you to show it actually shows up on the performance radar.
DaveM> The thing is going to be fully hot in the cache all the time,
DaveM> there is no way you'll take a cache miss for this
DaveM> dereference.
Let's see: on Itanium, a ld takes up an M slot and has a 2 cycle
access latency (if in the first level cache). This may or may not be
noticable in benchmarks, but it certainly won't go faster. And all
this just for task coloring (which we can do with the old set up just
fine)?
--david
From: David Mosberger <[email protected]>
Date: Mon, 11 Feb 2002 19:18:38 -0800
Let's see: on Itanium, a ld takes up an M slot and has a 2 cycle
access latency (if in the first level cache). This may or may not be
noticable in benchmarks, but it certainly won't go faster. And all
this just for task coloring (which we can do with the old set up just
fine)?
The compiler will schedule the latency out of existence.
>>>>> On Mon, 11 Feb 2002 19:23:34 -0800 (PST), "David S. Miller" <[email protected]> said:
DaveM> The compiler will schedule the latency out of existence.
The kernel has many paths that have sequential dependencies. If there
is no other work to do, the compiler won't help you.
--david
From: David Mosberger <[email protected]>
Date: Mon, 11 Feb 2002 19:32:58 -0800
DaveM> The compiler will schedule the latency out of existence.
The kernel has many paths that have sequential dependencies. If there
is no other work to do, the compiler won't help you.
You mean the company with the most register starved modern processor
can't make a load go fast? :-) I totally beg to differ, and I think
people like Linus will too.
>>>>> On Mon, 11 Feb 2002 19:42:22 -0800 (PST), "David S. Miller" <[email protected]> said:
DaveM> I totally beg to
DaveM> differ, and I think people like Linus will too.
I don't think so. In the not-so-distant past, Linus has rejected a
patch for precisely this reason. And in that particular case, only a
handful of places had a local variable replaced with current->cpu.
--david
From: David Mosberger <[email protected]>
Date: Mon, 11 Feb 2002 20:16:28 -0800
>>>>> On Mon, 11 Feb 2002 19:42:22 -0800 (PST), "David S. Miller" <[email protected]> said:
DaveM> I totally beg to
DaveM> differ, and I think people like Linus will too.
I don't think so.
Linus has actually told me and others his opinion of the patch, and he
has accepted it whole heartedly. If he didn't accept it, it wouldn't
be in the tree right?
In fact, why don't you ask him yourself? The "loads are fast on x86"
comment I made is basically derived from something he told someone
else earlier today wrt. the thread_info stuff.
So let me rephrase what you've quoted "I totally beg to differ, and I
_know_ people like Linus will too." :-)
>>>>> On Mon, 11 Feb 2002 20:21:08 -0800 (PST), "David S. Miller" <[email protected]> said:
DaveM> The "loads are fast
DaveM> on x86" comment I made is basically derived from something he
DaveM> told someone else earlier today wrt. the thread_info stuff.
DaveM> So let me rephrase what you've quoted "I totally beg to
DaveM> differ, and I _know_ people like Linus will too." :-)
And I know that Linus rejected earlier patches (from Nov 2001) on
these grounds.
--david
From: Richard Henderson <[email protected]>
Date: Mon, 11 Feb 2002 21:26:44 -0800
On another topic, I'm considering having $8 continue to be current
and using the two-insn stack mask to get current_thread_info and
measuring the size difference that makes.
I might put 'current' into %g7 on sparc but currently I don't think
it's worth it myself.
BTW, your "4 issue" comments assume the cpu can do 4 non-FPU
instructions per cycle, most I am aware of cannot and I think ia64
even falls into the "cannot" category. Doesn't it?
On Mon, Feb 11, 2002 at 07:32:58PM -0800, David Mosberger wrote:
> The kernel has many paths that have sequential dependencies. If there
> is no other work to do, the compiler won't help you.
Indeed. A 2 cycle latency on a 4-issue processor means you have
to have quite a large block of code in order for the hot load to
be "free".
On another topic, I'm considering having $8 continue to be current
and using the two-insn stack mask to get current_thread_info and
measuring the size difference that makes.
r~
From: David Mosberger <[email protected]>
Date: Mon, 11 Feb 2002 17:36:48 -0800
Umh, perhaps because it adds one level of indirection to every access
of "current"??
Ummm, which is totally cached and therefore costs nothing?
From: David Mosberger <[email protected]>
Date: Mon, 11 Feb 2002 17:53:12 -0800
Loads are certainly not free on many CPU models. This is made worse
by the fact that C alias analysis has to be so pessimistic, especially
given that the kernel is compiled with -fno-strict-aliasing.
I implemented the thread_info stuff, and I checked out the
performance, have you?
From: David Mosberger <[email protected]>
Date: Mon, 11 Feb 2002 18:30:58 -0800
So why don't you share the results? Perhaps then I can see the light,
too. With the exception of task coloring, the thread_info is strictly
more work and it's possible to do task coloring without thread_info.
All performance tests I ran were "about the same" on sparc64, on x86
we really only have one anomaly on one of Linus's SMP x86 machines
(fork+exec from lmbench on dual-Athlon) and I'm going to push to
investigate that further.
>>>>> On Mon, 11 Feb 2002 18:22:08 -0800 (PST), "David S. Miller" <[email protected]> said:
DavidM> I implemented the thread_info stuff, and I checked out the
DavidM> performance, have you?
So why don't you share the results? Perhaps then I can see the light,
too. With the exception of task coloring, the thread_info is strictly
more work and it's possible to do task coloring without thread_info.
--david
>>>>> On Mon, 11 Feb 2002 17:41:02 -0800 (PST), "David S. Miller" <[email protected]> said:
David> From: David Mosberger <[email protected]> Date: Mon, 11
David> Feb 2002 17:36:48 -0800
David> Umh, perhaps because it adds one level of indirection to
David> every access of "current"??
DaveM> Ummm, which is totally cached and therefore costs nothing?
Loads are certainly not free on many CPU models. This is made worse
by the fact that C alias analysis has to be so pessimistic, especially
given that the kernel is compiled with -fno-strict-aliasing.
--david
>>>>> On Mon, 11 Feb 2002 17:32:36 -0800 (PST), "David S. Miller" <[email protected]> said:
David> From: David Mosberger <[email protected]> Date: Mon, 11
David> Feb 2002 17:21:40 -0800
David> The task pointer lives in the thread pointer register
David> (r13), so it's trivial to address off of it.
DaveM> So why don't you put the, oh my gosh, "THREAD INFO POINTER"
DaveM> into the thread pointer register instead? That is what I did
DaveM> and everything transforms naturally, you will need to make
DaveM> zero modifications to assembly code besides the offset macro
DaveM> names and that you can even script :-)
Umh, perhaps because it adds one level of indirection to every access
of "current"??
--david
>>>>> On Mon, 11 Feb 2002 18:36:03 -0800 (PST), "David S. Miller" <[email protected]> said:
DaveM> All performance tests I ran were "about the same" on sparc64,
DaveM> on x86 we really only have one anomaly on one of Linus's SMP
DaveM> x86 machines (fork+exec from lmbench on dual-Athlon) and I'm
DaveM> going to push to investigate that further.
OK, so back to square one: why am I supposed to do all this work for
something that will likely slow things slightly down and, at best,
doesn't hurt performance? The old set up works great and as far as
I'm concerned, is not broken.
Don't get me wrong. I'm willing to invest time to switch to the new
setup, but I'd like to have a good reason before doing so. That's not
asking for too much, is it?
--david
>>>>> On Mon, 11 Feb 2002 21:32:48 -0800 (PST), "David S. Miller" <[email protected]> said:
DaveM> BTW, your "4 issue" comments assume the cpu can do 4 non-FPU
DaveM> instructions per cycle, most I am aware of cannot and I think ia64
DaveM> even falls into the "cannot" category. Doesn't it?
Itanium can certainly issue 5 non-fp instructions per cycle (not very
common, but possible). 4-issue is easy.
--david
On Mon, Feb 11, 2002 at 09:32:48PM -0800, David S. Miller wrote:
> BTW, your "4 issue" comments assume the cpu can do 4 non-FPU
> instructions per cycle, most I am aware of cannot and I think ia64
> even falls into the "cannot" category. Doesn't it?
ia64 and alpha ev6 can do this easily. They both have
four integer pipelines.
r~
Hi,
On Mon, 11 Feb 2002, David S. Miller wrote:
> It keeps your platform the same, and it does help other platforms.
> It is the nature of any abstraction change we make in the kernel
> that platforms have to deal with.
Of what "abstraction change" are you talking about?
Any change should usually help most architectures and so far the
thread_info change has only be done a few.
> 2) pointer dereference causes performance problems
>
> ummm no, not really, go test it for yourself if you don't
> believe me
>
> This only leaves "I don't want to do the conversion because it has
> no benefit to ia64." Well, it doesn't hurt your platform either,
> so just cope :-)
That's simply not true. An extra load might be cheap, maybe on sparc it's
even free, but on most architectures it has a cost. Additionally every
access to current requires an extra load, so every function which uses
current will be larger, all embedded targets will thank you for that.
Where is the problem to allow these two implementations:
1.
#define current_thread_info() asm(...)
#define current current_thread_info()->task
2.
#define current asm(...)
#define current_thread_info() ¤t->thread_info
If you're unable to properly compute your structure offsets, you're free
to use the first version, I prefer the second.
bye, Roman
Roman Zippel wrote:
>
> Hi,
>
> On Mon, 11 Feb 2002, David S. Miller wrote:
>
> > It keeps your platform the same, and it does help other platforms.
> > It is the nature of any abstraction change we make in the kernel
> > that platforms have to deal with.
>
> Of what "abstraction change" are you talking about?
> Any change should usually help most architectures and so far the
> thread_info change has only be done a few.
>
> > 2) pointer dereference causes performance problems
> >
> > ummm no, not really, go test it for yourself if you don't
> > believe me
> >
> > This only leaves "I don't want to do the conversion because it has
> > no benefit to ia64." Well, it doesn't hurt your platform either,
> > so just cope :-)
>
> That's simply not true. An extra load might be cheap, maybe on sparc it's
> even free, but on most architectures it has a cost. Additionally every
> access to current requires an extra load, so every function which uses
> current will be larger, all embedded targets will thank you for that.
> Where is the problem to allow these two implementations:
> 1.
> #define current_thread_info() asm(...)
> #define current current_thread_info()->task
> 2.
> #define current asm(...)
> #define current_thread_info() ¤t->thread_info
...or number 3, do a conversion to 2.5.4 thread_info then embed the task
structure inside struct thread_info, like what was just done with VFS
inodes. (embedding the general struct in the arch-specific struct would
make sense to me, whereas I can definitely see how embedding the
arch-specific struct in the general struct would be annoying)
Jeff
--
Jeff Garzik | "I went through my candy like hot oatmeal
Building 1024 | through an internally-buttered weasel."
MandrakeSoft | - goats.com
Hi,
On Tue, 12 Feb 2002, Jeff Garzik wrote:
> ...or number 3, do a conversion to 2.5.4 thread_info then embed the task
> structure inside struct thread_info, like what was just done with VFS
> inodes.
That's possible.
> (embedding the general struct in the arch-specific struct would
> make sense to me, whereas I can definitely see how embedding the
> arch-specific struct in the general struct would be annoying)
It's not really the same, the private part of the inode is really private
to the specific fs. thread_info is arch specific, but it fields have to be
accessed by generic code. At compile time there is also always only one
thread_info contrary to vfs inodes.
Anyway, it's not really important which structure includes which, it just
has to be decided for all archs uniformly to get include dependencies
right.
bye, Roman
On Mon, 11 Feb 2002, David S. Miller wrote:
> Where's the problem to compute them automatically?
>
> It requires ugly scripts that parse assembler files if you want it to
> work in a cross compilation requirement. Check out
> arch/sparc64/kernel/check_asm.sh and the "check_asm" rule in the
> Makefile or the same directory in older trees to see what I mean.
We cross-compile all the time and don't have to parse assembler-files,
just compile a c-file and include the resulting asm into entry.S:
/* linux/arch/cris/entryoffsets.c
*
* Copyright (C) 2001 Axis Communications AB
*
* Generate structure offsets for use in entry.S. No extra processing
* needed more than compiling this file to assembly code. Horrendous
* assembly code will be generated, so don't look at that.
*
* Authors: Hans-Peter Nilsson ([email protected])
*/
/BW
"David S. Miller" <[email protected]> wrote:
> OK, so back to square one: why am I supposed to do all this work for
> something that will likely slow things slightly down and, at best,
> doesn't hurt performance? The old set up works great and as far as
> I'm concerned, is not broken.
>
> It keeps your platform the same, and it does help other platforms.
> It is the nature of any abstraction change we make in the kernel
> that platforms have to deal with.
It wasn't all that big a change for the i386 arch either. Most of the changes
to assembly actually involved cleaning up and various assembly sources and
sharing constants (something that should probably have been done a lot
earlier).
What might be worth doing is to move the task_struct slab cache and
(de-)allocator out of fork.c and to stick it in the arch somewhere. Then archs
aren't bound to have the two separate. So for a system that can handle lots of
memory, you can allocate the thread_info, task_struct and supervisor stack all
on one very large chunk if you so wish.
David
From: Bjorn Wesen <[email protected]>
Date: Tue, 12 Feb 2002 14:01:13 +0100 (CET)
We cross-compile all the time and don't have to parse assembler-files,
just compile a c-file and include the resulting asm into entry.S:
I didn't say "undoable", we were doing it too. I said ugly,
and what you're showing me isn't pretty :-)
>>>>> On Tue, 12 Feb 2002 13:21:10 +0000, David Howells <[email protected]> said:
David.H> What might be worth doing is to move the task_struct slab
David.H> cache and (de-)allocator out of fork.c and to stick it in
David.H> the arch somewhere. Then archs aren't bound to have the two
David.H> separate. So for a system that can handle lots of memory,
David.H> you can allocate the thread_info, task_struct and
David.H> supervisor stack all on one very large chunk if you so
David.H> wish.
Could you do this? I'd prefer if task_info could be completely hidden
inside the x86/sparc arch-specific code, but if things are set up such
that we at least have the option to keep the stack, task_info, and
task_struct in a single chunk of memory (and without pointers between
them), I'd have much less of an issue with it.
--david
David Mosberger <[email protected]> wrote:
> David.H> What might be worth doing is to move the task_struct slab
> David.H> cache and (de-)allocator out of fork.c and to stick it in
> David.H> the arch somewhere. Then archs aren't bound to have the two
> David.H> separate. So for a system that can handle lots of memory,
> David.H> you can allocate the thread_info, task_struct and
> David.H> supervisor stack all on one very large chunk if you so
> David.H> wish.
>
> Could you do this? I'd prefer if task_info could be completely hidden
> inside the x86/sparc arch-specific code, but if things are set up such that
> we at least have the option to keep the stack, task_info, and task_struct in
> a single chunk of memory (and without pointers between them), I'd have much
> less of an issue with it.
Well, I can do a patch for it tomorrow (it's not particularly difficult to
actually do), but whether Linus'll take it is an entirely different matter.
David
Hi!
> No, it will slow down ia64 and you haven't shown that it helps others.
>
> That's crap. You haven't shown this yet, it didn't slow down sparc64
> so I doubt you'll be able to.
>
> You don't have any facts, you just "think" it will slow things down
> because of the pointer dereference. I challenge you to show it
> actually shows up on the performance radar.
>
> The thing is going to be fully hot in the cache all the time, there
> is no way you'll take a cache miss for this dereference.
So you essentially made your cache one cacheline smaller.
I guess it is easy to add 100 minor modifications, none of them
showing on performance radar, and slowing kernel 2 times in result.
Pavel
--
(about SSSCA) "I don't say this lightly. However, I really think that the U.S.
no longer is classifiable as a democracy, but rather as a plutocracy." --hpa
From: Pavel Machek <[email protected]>
Date: Tue, 12 Feb 2002 18:14:22 +0100
> The thing is going to be fully hot in the cache all the time, there
> is no way you'll take a cache miss for this dereference.
So you essentially made your cache one cacheline smaller.
Not at all, that cacheline has to be in the cache anyways because
it also holds all the other information which needs to be accessed
during trap entry/exit.
Try again.
Hi,
"David S. Miller" wrote:
> So you essentially made your cache one cacheline smaller.
>
> Not at all, that cacheline has to be in the cache anyways because
> it also holds all the other information which needs to be accessed
> during trap entry/exit.
>
> Try again.
Larger code size due to the extra load?
At least two cache lines needed for any access to task_struct?
David, what are you trying to prove? Any architecture which has a thread
register prefers to access data directly through this register and it's
not really difficult to avoid this indirection, that might be needed on
ia32.
bye, Roman
> On another topic, I'm considering having $8 continue to be current
> and using the two-insn stack mask to get current_thread_info and
> measuring the size difference that makes.
This is what paulus has done for ppc32 and what I have shamelessly
copied for ppc64. I was more comfortable doing that then requiring a
load for each use of current.
Anton