On Mon, 2011-02-14 at 11:14 -0500, Mathieu Desnoyers wrote:
>
> I remember that atomic_t is defined in types.h now rather than atomic.h.
> Any reason why you should keep including atomic.h from jump_label.h ?
Ooh, shiny.. we could probably move the few atomic_{read,inc,dec} users
in jump_label.h into out of line functions and have this sorted.
On Mon, Feb 14, 2011 at 05:25:54PM +0100, Peter Zijlstra wrote:
> >
> > I remember that atomic_t is defined in types.h now rather than atomic.h.
> > Any reason why you should keep including atomic.h from jump_label.h ?
>
> Ooh, shiny.. we could probably move the few atomic_{read,inc,dec} users
> in jump_label.h into out of line functions and have this sorted.
>
inc and dec sure, but atomic_read() for the disabled case needs to be
inline....
On Mon, 2011-02-14 at 11:29 -0500, Jason Baron wrote:
> On Mon, Feb 14, 2011 at 05:25:54PM +0100, Peter Zijlstra wrote:
> > >
> > > I remember that atomic_t is defined in types.h now rather than atomic.h.
> > > Any reason why you should keep including atomic.h from jump_label.h ?
> >
> > Ooh, shiny.. we could probably move the few atomic_{read,inc,dec} users
> > in jump_label.h into out of line functions and have this sorted.
> >
>
> inc and dec sure, but atomic_read() for the disabled case needs to be
> inline....
D'0h yes of course, I was thinking about jump_label_enabled(), but
there's still the static_branch() implementation to consider.
We could of course cheat implement our own version of atomic_read() in
order to avoid the whole header mess, but that's not pretty at all
* Peter Zijlstra ([email protected]) wrote:
> On Mon, 2011-02-14 at 11:29 -0500, Jason Baron wrote:
> > On Mon, Feb 14, 2011 at 05:25:54PM +0100, Peter Zijlstra wrote:
> > > >
> > > > I remember that atomic_t is defined in types.h now rather than atomic.h.
> > > > Any reason why you should keep including atomic.h from jump_label.h ?
> > >
> > > Ooh, shiny.. we could probably move the few atomic_{read,inc,dec} users
> > > in jump_label.h into out of line functions and have this sorted.
> > >
> >
> > inc and dec sure, but atomic_read() for the disabled case needs to be
> > inline....
>
> D'0h yes of course, I was thinking about jump_label_enabled(), but
> there's still the static_branch() implementation to consider.
>
> We could of course cheat implement our own version of atomic_read() in
> order to avoid the whole header mess, but that's not pretty at all
>
OK, so the other way around then : why does kernel.h need to include
dynamic_debug.h (which includes jump_label.h) ?
Mathieu
--
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com
On Mon, 2011-02-14 at 17:37 +0100, Peter Zijlstra wrote:
> We could of course cheat implement our own version of atomic_read() in
> order to avoid the whole header mess, but that's not pretty at all
Oh God please no! ;)
atomic_read() is implemented per arch.
-- Steve
On Mon, 2011-02-14 at 11:46 -0500, Steven Rostedt wrote:
> On Mon, 2011-02-14 at 17:37 +0100, Peter Zijlstra wrote:
>
> > We could of course cheat implement our own version of atomic_read() in
> > order to avoid the whole header mess, but that's not pretty at all
>
> Oh God please no! ;)
>
> atomic_read() is implemented per arch.
Ah, but it needn't be:
static inline int atomic_read(atomic_t *a)
{
return ACCESS_ONCE(a->counter);
}
is basically it.
On Mon, 2011-02-14 at 11:46 -0500, Steven Rostedt wrote:
> On Mon, 2011-02-14 at 17:37 +0100, Peter Zijlstra wrote:
>
> > We could of course cheat implement our own version of atomic_read() in
> > order to avoid the whole header mess, but that's not pretty at all
>
> Oh God please no! ;)
>
> atomic_read() is implemented per arch.
Hmm, maybe this isn't so bad:
alpha:
#define atomic_read(v) (*(volatile int *)&(v)->counter)
arm:
#define atomic_read(v) (*(volatile int *)&(v)->counter)
avr32:
#define atomic_read(v) (*(volatile int *)&(v)->counter)
blackfin:
#define atomic_read(v) __raw_uncached_fetch_asm(&(v)->counter)
cris:
#define atomic_read(v) (*(volatile int *)&(v)->counter)
frv:
#define atomic_read(v) (*(volatile int *)&(v)->counter)
h8300:
#define atomic_read(v) (*(volatile int *)&(v)->counter)
ia64:
#define atomic_read(v) (*(volatile int *)&(v)->counter)
m32r:
#define atomic_read(v) (*(volatile int *)&(v)->counter)
m68k:
#define atomic_read(v) (*(volatile int *)&(v)->counter)
microblaze: uses generic which is:
mips:
#define atomic_read(v) (*(volatile int *)&(v)->counter)
mn10300:
#define atomic_read(v) ((v)->counter)
parisc:
static __inline__ int atomic_read(const atomic_t *v)
{
return (*(volatile int *)&(v)->counter);
}
powerpc:
static __inline__ int atomic_read(const atomic_t *v)
{
int t;
__asm__ __volatile__("lwz%U1%X1 %0,%1" : "=r"(t) : "m"(v->counter));
return t;
}
which is still pretty much a volatile read
s390:
static inline int atomic_read(const atomic_t *v)
{
barrier();
return v->counter;
}
score:
uses generic
sh:
#define atomic_read(v) (*(volatile int *)&(v)->counter)
sparc 32:
sparc 64:
#define atomic_read(v) (*(volatile int *)&(v)->counter)
tile:
static inline int atomic_read(const atomic_t *v)
{
return v->counter;
}
Hmm, nothing volatile at all?
x86:
static inline int atomic_read(const atomic_t *v)
{
return (*(volatile int *)&(v)->counter);
}
xtensa:
#define atomic_read(v) (*(volatile int *)&(v)->counter)
So all but a few have basically (as you said on IRC)
#define atomic_read(v) ACCESS_ONCE(v)
Those few are blackfin, s390, powerpc and tile.
s390 probably doesn't need that much of a big hammer with atomic_read()
(unless it uses it in its own arch that expects it to be such).
powerpc could probably be converted to just the volatile code as
everything else. Not sure why it did it that way. To be different?
tile just looks wrong, but wont be hurt with adding volatile to that.
blackfin, seems to be doing quite a lot. Not sure if it is required, but
that may need a bit of investigating to understand why it does the
raw_uncached thing.
Maybe we could move the atomic_read() out of atomic and make it a
standard inline for all (in kernel.h)?
-- Steve
On Mon, Feb 14, 2011 at 12:18, Steven Rostedt wrote:
> blackfin:
> #define atomic_read(v) __raw_uncached_fetch_asm(&(v)->counter)
>
> blackfin, seems to be doing quite a lot. Not sure if it is required, but
> that may need a bit of investigating to understand why it does the
> raw_uncached thing.
this is only for SMP ports, and it's due to our lack of
cache-coherency. for non-SMP, we use asm-generic.
-mike
On Mon, 2011-02-14 at 12:18 -0500, Steven Rostedt wrote:
> mn10300:
> #define atomic_read(v) ((v)->counter)
> tile:
> static inline int atomic_read(const atomic_t *v)
> {
> return v->counter;
> }
Yeah, I already send email to the respective maintainers telling them
they might want to fix this ;-)
> So all but a few have basically (as you said on IRC)
> #define atomic_read(v) ACCESS_ONCE(v)
ACCESS_ONCE(v->counter), but yeah :-)
> Those few are blackfin, s390, powerpc and tile.
>
> s390 probably doesn't need that much of a big hammer with atomic_read()
> (unless it uses it in its own arch that expects it to be such).
Right, it could just do the volatile thing..
> powerpc could probably be converted to just the volatile code as
> everything else. Not sure why it did it that way. To be different?
Maybe that code was written before we all got inventive with the
volatile cast stuff..
> blackfin, seems to be doing quite a lot. Not sure if it is required, but
> that may need a bit of investigating to understand why it does the
> raw_uncached thing.
>From what I can tell its flushing its write cache, invalidating its
d-cache and then issue the read, something which is _way_ overboard.
> Maybe we could move the atomic_read() out of atomic and make it a
> standard inline for all (in kernel.h)?
Certainly looks like that might work..
On Mon, Feb 14, 2011 at 12:27, Peter Zijlstra wrote:
> On Mon, 2011-02-14 at 12:18 -0500, Steven Rostedt wrote:
>> blackfin, seems to be doing quite a lot. Not sure if it is required, but
>> that may need a bit of investigating to understand why it does the
>> raw_uncached thing.
>
> From what I can tell its flushing its write cache, invalidating its
> d-cache and then issue the read, something which is _way_ overboard.
not when the cores in a SMP system lack cache coherency
please to review:
http://docs.blackfin.uclinux.org/doku.php?id=linux-kernel:smp-like
-mike
On Mon, 2011-02-14 at 12:29 -0500, Mike Frysinger wrote:
> On Mon, Feb 14, 2011 at 12:27, Peter Zijlstra wrote:
> > On Mon, 2011-02-14 at 12:18 -0500, Steven Rostedt wrote:
> >> blackfin, seems to be doing quite a lot. Not sure if it is required, but
> >> that may need a bit of investigating to understand why it does the
> >> raw_uncached thing.
> >
> > From what I can tell its flushing its write cache, invalidating its
> > d-cache and then issue the read, something which is _way_ overboard.
>
> not when the cores in a SMP system lack cache coherency
But atomic_read() is completely unordered, so even on a non-coherent
system a regular read should suffice, any old value is correct.
The only problem would be when you could get cache aliasing and read
something totally unrelated.
On Mon, Feb 14, 2011 at 5:27 PM, Peter Zijlstra <[email protected]> wrote:
>> So all but a few have basically (as you said on IRC)
>> #define atomic_read(v) ACCESS_ONCE(v)
>
> ACCESS_ONCE(v->counter), but yeah :-)
I maintain an out-of-tree architecture where that isn't the case
unfortunately [1]. Not expecting any special favours for being
out-of-tree of course, but just thought I would add that data point.
[1] Our atomic operations go around the cache rather than through it,
so the value of an atomic cannot be read with a normal load
instruction.
On Mon, 2011-02-14 at 17:38 +0000, Will Newton wrote:
> On Mon, Feb 14, 2011 at 5:27 PM, Peter Zijlstra <[email protected]> wrote:
>
> >> So all but a few have basically (as you said on IRC)
> >> #define atomic_read(v) ACCESS_ONCE(v)
> >
> > ACCESS_ONCE(v->counter), but yeah :-)
>
> I maintain an out-of-tree architecture where that isn't the case
> unfortunately [1]. Not expecting any special favours for being
> out-of-tree of course, but just thought I would add that data point.
>
> [1] Our atomic operations go around the cache rather than through it,
> so the value of an atomic cannot be read with a normal load
> instruction.
Cannot how? It would observe a stale value? That is acceptable for
atomic_read().
On Mon, Feb 14, 2011 at 12:38, Peter Zijlstra wrote:
> On Mon, 2011-02-14 at 12:29 -0500, Mike Frysinger wrote:
>> On Mon, Feb 14, 2011 at 12:27, Peter Zijlstra wrote:
>> > On Mon, 2011-02-14 at 12:18 -0500, Steven Rostedt wrote:
>> >> blackfin, seems to be doing quite a lot. Not sure if it is required, but
>> >> that may need a bit of investigating to understand why it does the
>> >> raw_uncached thing.
>> >
>> > From what I can tell its flushing its write cache, invalidating its
>> > d-cache and then issue the read, something which is _way_ overboard.
>>
>> not when the cores in a SMP system lack cache coherency
>
> But atomic_read() is completely unordered, so even on a non-coherent
> system a regular read should suffice, any old value is correct.
the words you use seem to form a line of reasoning that makes sense to
me. we'll have to play first though to double check.
> The only problem would be when you could get cache aliasing and read
> something totally unrelated.
being a nommu arch, there shouldnt be any cache aliasing issues.
we're just trying to make sure that what another core has pushed out
isnt stale in another core's cache when the other core does the read.
-mike
On Mon, Feb 14, 2011 at 5:43 PM, Peter Zijlstra <[email protected]> wrote:
> On Mon, 2011-02-14 at 17:38 +0000, Will Newton wrote:
>> On Mon, Feb 14, 2011 at 5:27 PM, Peter Zijlstra <[email protected]> wrote:
>>
>> >> So all but a few have basically (as you said on IRC)
>> >> #define atomic_read(v) ACCESS_ONCE(v)
>> >
>> > ACCESS_ONCE(v->counter), but yeah :-)
>>
>> I maintain an out-of-tree architecture where that isn't the case
>> unfortunately [1]. Not expecting any special favours for being
>> out-of-tree of course, but just thought I would add that data point.
>>
>> [1] Our atomic operations go around the cache rather than through it,
>> so the value of an atomic cannot be read with a normal load
>> instruction.
>
> Cannot how? It would observe a stale value? That is acceptable for
> atomic_read().
It would observe a stale value, but that value would only be updated
when the cache line was reloaded from main memory which would have to
be triggered by either eviction or cache flushing. So it could get
pretty stale. Whilst that's probably within the spec. of atomic_read I
suspect it would lead to problems in practice. I could be wrong
though.
On Mon, 2011-02-14 at 17:50 +0000, Will Newton wrote:
> On Mon, Feb 14, 2011 at 5:43 PM, Peter Zijlstra <[email protected]> wrote:
> > On Mon, 2011-02-14 at 17:38 +0000, Will Newton wrote:
> >> On Mon, Feb 14, 2011 at 5:27 PM, Peter Zijlstra <[email protected]> wrote:
> >>
> >> >> So all but a few have basically (as you said on IRC)
> >> >> #define atomic_read(v) ACCESS_ONCE(v)
> >> >
> >> > ACCESS_ONCE(v->counter), but yeah :-)
> >>
> >> I maintain an out-of-tree architecture where that isn't the case
> >> unfortunately [1]. Not expecting any special favours for being
> >> out-of-tree of course, but just thought I would add that data point.
> >>
> >> [1] Our atomic operations go around the cache rather than through it,
> >> so the value of an atomic cannot be read with a normal load
> >> instruction.
> >
> > Cannot how? It would observe a stale value? That is acceptable for
> > atomic_read().
>
> It would observe a stale value, but that value would only be updated
> when the cache line was reloaded from main memory which would have to
> be triggered by either eviction or cache flushing. So it could get
> pretty stale. Whilst that's probably within the spec. of atomic_read I
> suspect it would lead to problems in practice. I could be wrong
> though.
Arguable, finding such cases would be a Good (TM) thing.. but yeah, I
can imagine you're not too keen on being the one finding them.
Luckily it looks like you're in the same boat as blackfin-smp is.
On Mon, 2011-02-14 at 17:50 +0000, Will Newton wrote:
>
> It would observe a stale value, but that value would only be updated
> when the cache line was reloaded from main memory which would have to
> be triggered by either eviction or cache flushing. So it could get
> pretty stale. Whilst that's probably within the spec. of atomic_read I
> suspect it would lead to problems in practice. I could be wrong
> though.
Right, so the typical scenario that could cause pain is something like:
while (atomic_read(&foo) != n)
cpu_relax();
and the problem is that cpu_relax() doesn't know which particular
cacheline to flush in order to make things go faster, hm?
* Peter Zijlstra ([email protected]) wrote:
> On Mon, 2011-02-14 at 17:50 +0000, Will Newton wrote:
> >
> > It would observe a stale value, but that value would only be updated
> > when the cache line was reloaded from main memory which would have to
> > be triggered by either eviction or cache flushing. So it could get
> > pretty stale. Whilst that's probably within the spec. of atomic_read I
> > suspect it would lead to problems in practice. I could be wrong
> > though.
>
> Right, so the typical scenario that could cause pain is something like:
>
> while (atomic_read(&foo) != n)
> cpu_relax();
>
> and the problem is that cpu_relax() doesn't know which particular
> cacheline to flush in order to make things go faster, hm?
As an information point, this is why I mapped "uatomic_read()" to
"CMM_LOAD_SHARED" in my userspace RCU library rather than just doing a
volatile access. On cache-coherent architectures, the arch-specific code
turns CMM_LOAD_SHARED into a simple volatile access, but for
non-cache-coherent architectures, it can call the required
architecture-level primitives to fetch the stale data.
FWIW, I also have "CMM_STORE_SHARED" which does pretty much the same
thing. I use these for rcu_assign_pointer() and rcu_dereference() (thus
replacing "ACCESS_ONCE()").
The more detailed comment and macros are found at
http://git.lttng.org/?p=userspace-rcu.git;a=blob;f=urcu/system.h
I hope this helps,
Mathieu
--
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com
On Mon, 2011-02-14 at 19:24 +0100, Peter Zijlstra wrote:
> On Mon, 2011-02-14 at 17:50 +0000, Will Newton wrote:
> >
> > It would observe a stale value, but that value would only be updated
> > when the cache line was reloaded from main memory which would have to
> > be triggered by either eviction or cache flushing. So it could get
> > pretty stale. Whilst that's probably within the spec. of atomic_read I
> > suspect it would lead to problems in practice. I could be wrong
> > though.
>
> Right, so the typical scenario that could cause pain is something like:
>
> while (atomic_read(&foo) != n)
> cpu_relax();
>
> and the problem is that cpu_relax() doesn't know which particular
> cacheline to flush in order to make things go faster, hm?
But what about any global variable? Can't we also just have:
while (global != n)
cpu_relax();
?
-- Steve
On Mon, 2011-02-14 at 16:29 -0500, Steven Rostedt wrote:
> > while (atomic_read(&foo) != n)
> > cpu_relax();
> >
> > and the problem is that cpu_relax() doesn't know which particular
> > cacheline to flush in order to make things go faster, hm?
>
> But what about any global variable? Can't we also just have:
>
> while (global != n)
> cpu_relax();
>
> ?
Matt Fleming answered this for me on IRC, and I'll share the answer here
(for those that are dying to know ;)
Seems that the atomic_inc() uses ll/sc operations that do not affect the
cache. Thus the problem is only with atomic_read() as
while(atomic_read(&foo) != n)
cpu_relax();
Will just check the cache version of foo. But because ll/sc skips the
cache, the foo will never update. That is, atomic_inc() and friends do
not touch the cache, and the CPU spinning in this loop will is only
checking the cache, and will spin forever.
Thus it is not about global, as global is updated by normal means and
will update the caches. atomic_t is updated via the ll/sc that ignores
the cache and causes all this to break down. IOW... broken hardware ;)
Matt, feel free to correct this if it is wrong.
-- Steve
From: Steven Rostedt <[email protected]>
Date: Mon, 14 Feb 2011 16:39:36 -0500
> Thus it is not about global, as global is updated by normal means and
> will update the caches. atomic_t is updated via the ll/sc that ignores
> the cache and causes all this to break down. IOW... broken hardware ;)
I don't see how cache coherency can possibly work if the hardware
behaves this way.
In cache aliasing situations, yes I can understand a L1 cache visibility
issue being present, but with kernel only stuff that should never happen
otherwise we have a bug in the arch cache flushing support.
On Mon, 2011-02-14 at 13:46 -0800, David Miller wrote:
> From: Steven Rostedt <[email protected]>
> Date: Mon, 14 Feb 2011 16:39:36 -0500
>
> > Thus it is not about global, as global is updated by normal means and
> > will update the caches. atomic_t is updated via the ll/sc that ignores
> > the cache and causes all this to break down. IOW... broken hardware ;)
>
> I don't see how cache coherency can possibly work if the hardware
> behaves this way.
>
> In cache aliasing situations, yes I can understand a L1 cache visibility
> issue being present, but with kernel only stuff that should never happen
> otherwise we have a bug in the arch cache flushing support.
I guess the issue is, if you use ll/sc on memory, you must always use
ll/sc on that memory, otherwise any normal read won't read the proper
cache.
The atomic_read() in this arch uses ll to read the memory directly and
skip the cache. If we make atomic_read() like the other archs:
#define atomic_read(v) (*(volatile int *)&(v)->counter)
This pulls the counter into cache, and it will not be updated by a
atomic_inc() from another CPU.
Ideally, we would like a single atomic_read() but due to these wacky
archs, it may not be possible.
-- Steve
On Mon, 2011-02-14 at 17:20 -0500, Steven Rostedt wrote:
> I guess the issue is, if you use ll/sc on memory, you must always use
> ll/sc on that memory, otherwise any normal read won't read the proper
> cache.
s/cache/memory/
-- Steve
On 02/14/2011 02:20 PM, Steven Rostedt wrote:
>
> Ideally, we would like a single atomic_read() but due to these wacky
> archs, it may not be possible.
>
#ifdef ARCH_ATOMIC_READ_SUCKS_EGGS?
-hpa
--
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel. I don't speak on their behalf.
On Mon, 14 Feb 2011 16:39:36 -0500
Steven Rostedt <[email protected]> wrote:
> On Mon, 2011-02-14 at 16:29 -0500, Steven Rostedt wrote:
>
> > > while (atomic_read(&foo) != n)
> > > cpu_relax();
> > >
> > > and the problem is that cpu_relax() doesn't know which particular
> > > cacheline to flush in order to make things go faster, hm?
> >
> > But what about any global variable? Can't we also just have:
> >
> > while (global != n)
> > cpu_relax();
> >
> > ?
>
> Matt Fleming answered this for me on IRC, and I'll share the answer
> here (for those that are dying to know ;)
>
> Seems that the atomic_inc() uses ll/sc operations that do not affect
> the cache. Thus the problem is only with atomic_read() as
>
> while(atomic_read(&foo) != n)
> cpu_relax();
>
> Will just check the cache version of foo. But because ll/sc skips the
> cache, the foo will never update. That is, atomic_inc() and friends do
> not touch the cache, and the CPU spinning in this loop will is only
> checking the cache, and will spin forever.
Right. When I wrote the atomic_read() implementation that Will is
talking about I used the ll-equivalent instruction to bypass the cache,
e.g. I wrote it assembly because the compiler didn't emit that
instruction.
And that is what it boils down to really, the ll/sc instructions are
different from any other instructions in the ISA as they bypass the
cache and are not emitted by the compiler. So, in order to maintain
coherence with other cpus doing atomic updates on memory addresses, or
rather to avoid reading stale values, it's necessary to use the ll
instruction - and this isn't possible from C.
> Thus it is not about global, as global is updated by normal means and
> will update the caches. atomic_t is updated via the ll/sc that ignores
> the cache and causes all this to break down. IOW... broken hardware ;)
Well, to be precise it's about read-modify-write operations - the
architecture does maintain cache coherence in that writes from one CPU
are immediately visible to other CPUs.
FYI spinlocks are also implemented with ll/sc instructions.
* H. Peter Anvin ([email protected]) wrote:
> On 02/14/2011 02:20 PM, Steven Rostedt wrote:
> >
> > Ideally, we would like a single atomic_read() but due to these wacky
> > archs, it may not be possible.
> >
>
> #ifdef ARCH_ATOMIC_READ_SUCKS_EGGS?
>
> -hpa
lol :)
Hrm, I wonder if it might cause problems with combinations of "cmpxchg"
and "read" performed on a variable (without using atomic.h).
Mathieu
>
>
> --
> H. Peter Anvin, Intel Open Source Technology Center
> I work for Intel. I don't speak on their behalf.
>
--
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com
From: Steven Rostedt <[email protected]>
Date: Mon, 14 Feb 2011 17:20:30 -0500
> On Mon, 2011-02-14 at 13:46 -0800, David Miller wrote:
>> From: Steven Rostedt <[email protected]>
>> Date: Mon, 14 Feb 2011 16:39:36 -0500
>>
>> > Thus it is not about global, as global is updated by normal means and
>> > will update the caches. atomic_t is updated via the ll/sc that ignores
>> > the cache and causes all this to break down. IOW... broken hardware ;)
>>
>> I don't see how cache coherency can possibly work if the hardware
>> behaves this way.
>>
>> In cache aliasing situations, yes I can understand a L1 cache visibility
>> issue being present, but with kernel only stuff that should never happen
>> otherwise we have a bug in the arch cache flushing support.
>
> I guess the issue is, if you use ll/sc on memory, you must always use
> ll/sc on that memory, otherwise any normal read won't read the proper
> cache.
That also makes no sense at all.
Any update to the L2 cache must be snooped by the L1 cache and cause
an update, otherwise nothing can work correctly.
So every object we use cmpxchg() on in the kernel cannot work on this
architecture? Is that what you're saying?
If so, a lot of things we do will not work.
.
On Mon, 14 Feb 2011 13:46:00 -0800 (PST)
David Miller <[email protected]> wrote:
> From: Steven Rostedt <[email protected]>
> Date: Mon, 14 Feb 2011 16:39:36 -0500
>
> > Thus it is not about global, as global is updated by normal means
> > and will update the caches. atomic_t is updated via the ll/sc that
> > ignores the cache and causes all this to break down. IOW... broken
> > hardware ;)
>
> I don't see how cache coherency can possibly work if the hardware
> behaves this way.
Cache coherency is still maintained provided writes/reads both go
through the cache ;-)
The problem is that for read-modify-write operations the arbitration
logic that decides who "wins" and is allowed to actually perform the
write, assuming two or more CPUs are competing for a single memory
address, is not implemented in the cache controller, I think. I'm not a
hardware engineer and I never understood how the arbitration logic
worked but I'm guessing that's the reason that the ll/sc instructions
bypass the cache.
Which is why the atomic_t functions worked out really well for that
arch, such that any accesses to an atomic_t * had to go through the
wrapper functions.
* Matt Fleming ([email protected]) wrote:
> On Mon, 14 Feb 2011 13:46:00 -0800 (PST)
> David Miller <[email protected]> wrote:
>
> > From: Steven Rostedt <[email protected]>
> > Date: Mon, 14 Feb 2011 16:39:36 -0500
> >
> > > Thus it is not about global, as global is updated by normal means
> > > and will update the caches. atomic_t is updated via the ll/sc that
> > > ignores the cache and causes all this to break down. IOW... broken
> > > hardware ;)
> >
> > I don't see how cache coherency can possibly work if the hardware
> > behaves this way.
>
> Cache coherency is still maintained provided writes/reads both go
> through the cache ;-)
>
> The problem is that for read-modify-write operations the arbitration
> logic that decides who "wins" and is allowed to actually perform the
> write, assuming two or more CPUs are competing for a single memory
> address, is not implemented in the cache controller, I think. I'm not a
> hardware engineer and I never understood how the arbitration logic
> worked but I'm guessing that's the reason that the ll/sc instructions
> bypass the cache.
>
> Which is why the atomic_t functions worked out really well for that
> arch, such that any accesses to an atomic_t * had to go through the
> wrapper functions.
If this is true, then we have bugs in lots of xchg/cmpxchg users (which
do not reside in atomic.h), e.g.:
fs/fs_struct.c:
int current_umask(void)
{
return current->fs->umask;
}
EXPORT_SYMBOL(current_umask);
kernel/sys.c:
SYSCALL_DEFINE1(umask, int, mask)
{
mask = xchg(¤t->fs->umask, mask & S_IRWXUGO);
return mask;
}
The solution to this would be to force all xchg/cmpxchg users to swap to
atomic.h variables, which would force the ll semantic on read. But I'd
really like to see where this is documented first -- or which PowerPC
engineer we should talk to.
Thanks,
Mathieu
--
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com
On 02/14/2011 02:37 PM, Matt Fleming wrote:
>>
>> I don't see how cache coherency can possibly work if the hardware
>> behaves this way.
>
> Cache coherency is still maintained provided writes/reads both go
> through the cache ;-)
>
> The problem is that for read-modify-write operations the arbitration
> logic that decides who "wins" and is allowed to actually perform the
> write, assuming two or more CPUs are competing for a single memory
> address, is not implemented in the cache controller, I think. I'm not a
> hardware engineer and I never understood how the arbitration logic
> worked but I'm guessing that's the reason that the ll/sc instructions
> bypass the cache.
>
> Which is why the atomic_t functions worked out really well for that
> arch, such that any accesses to an atomic_t * had to go through the
> wrapper functions.
I'm sorry... this doesn't compute. Either reads can work normally (go
through the cache) in which case atomic_read() can simply be a read or
they don't, so I don't understand this at all.
-hpa
--
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel. I don't speak on their behalf.
On Mon, Feb 14, 2011 at 11:19 PM, H. Peter Anvin <[email protected]> wrote:
> On 02/14/2011 02:37 PM, Matt Fleming wrote:
>>>
>>> I don't see how cache coherency can possibly work if the hardware
>>> behaves this way.
>>
>> Cache coherency is still maintained provided writes/reads both go
>> through the cache ;-)
>>
>> The problem is that for read-modify-write operations the arbitration
>> logic that decides who "wins" and is allowed to actually perform the
>> write, assuming two or more CPUs are competing for a single memory
>> address, is not implemented in the cache controller, I think. I'm not a
>> hardware engineer and I never understood how the arbitration logic
>> worked but I'm guessing that's the reason that the ll/sc instructions
>> bypass the cache.
>>
>> Which is why the atomic_t functions worked out really well for that
>> arch, such that any accesses to an atomic_t * had to go through the
>> wrapper functions.
>
> I'm sorry... this doesn't compute. ?Either reads can work normally (go
> through the cache) in which case atomic_read() can simply be a read or
> they don't, so I don't understand this at all.
The CPU in question has two sets of instructions:
load/store - these go via the cache (write through)
ll/sc - these operate literally as if there is no cache (they do not
hit on read or write)
This may or may not be a sensible way to architect a CPU, but I think
it is possible to make it work. Making it work efficiently is more of
a challenge.
On 02/15/2011 03:01 AM, Will Newton wrote:
>
> The CPU in question has two sets of instructions:
>
> load/store - these go via the cache (write through)
> ll/sc - these operate literally as if there is no cache (they do not
> hit on read or write)
>
> This may or may not be a sensible way to architect a CPU, but I think
> it is possible to make it work. Making it work efficiently is more of
> a challenge.
>
a) What "CPU in question" is this?
b) Why should we let this particular insane CPU slow ALL OTHER CPUs down?
-hpa
--
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel. I don't speak on their behalf.
On Tue, 2011-02-15 at 05:31 -0800, H. Peter Anvin wrote:
> On 02/15/2011 03:01 AM, Will Newton wrote:
> b) Why should we let this particular insane CPU slow ALL OTHER CPUs down?
Yesterday I got around to reading Linus's interview here:
http://www.itwire.com/opinion-and-analysis/open-sauce/44975-linus-torvalds-looking-back-looking-forward?start=4
This seems appropriate:
"When it comes to "feature I had to include for reasons beyond my
control", it tends to be about crazy hardware doing stupid things that
we just have to work around. Most of the time that's limited to some
specific driver or other, and it's not something that has any relevance
in the "big picture", or that really affects core kernel design very
much. But sometimes it does, and then I really detest it."
-- Steve
On Tue, Feb 15, 2011 at 1:31 PM, H. Peter Anvin <[email protected]> wrote:
> On 02/15/2011 03:01 AM, Will Newton wrote:
>>
>> The CPU in question has two sets of instructions:
>>
>> ? load/store - these go via the cache (write through)
>> ? ll/sc - these operate literally as if there is no cache (they do not
>> hit on read or write)
>>
>> This may or may not be a sensible way to architect a CPU, but I think
>> it is possible to make it work. Making it work efficiently is more of
>> a challenge.
>>
>
> a) What "CPU in question" is this?
http://imgtec.com/meta/meta-technology.asp
> b) Why should we let this particular insane CPU slow ALL OTHER CPUs down?
I didn't propose we do that. I brought it up just to make people aware
that there are these odd architectures out there, and indeed it turns
out Blackfin has some superficially similar issues.
On Mon, Feb 14, 2011 at 06:27:27PM +0100, Peter Zijlstra wrote:
> On Mon, 2011-02-14 at 12:18 -0500, Steven Rostedt wrote:
>
> > mn10300:
> > #define atomic_read(v) ((v)->counter)
>
> > tile:
> > static inline int atomic_read(const atomic_t *v)
> > {
> > return v->counter;
> > }
>
> Yeah, I already send email to the respective maintainers telling them
> they might want to fix this ;-)
>
>
> > So all but a few have basically (as you said on IRC)
> > #define atomic_read(v) ACCESS_ONCE(v)
>
> ACCESS_ONCE(v->counter), but yeah :-)
>
> > Those few are blackfin, s390, powerpc and tile.
> >
> > s390 probably doesn't need that much of a big hammer with atomic_read()
> > (unless it uses it in its own arch that expects it to be such).
>
> Right, it could just do the volatile thing..
The reason that the code on s390 looks like it is was that the volatile
cast was known to generate really bad code.
However I just tried a few variants (inline asm / ACCESS_ONCE / leave as is)
with gcc 4.5.2 and the resulting code was always identical.
So I'm going to change it to the ACCESS_ONCE variant so it's the same like
everywhere else.
On 11:01 Tue 17 Feb , Will Newton wrote:
> On Mon, Feb 14, 2011 at 11:19 PM, H. Peter Anvin <[email protected]> wrote:
> > On 02/14/2011 02:37 PM, Matt Fleming wrote:
> >>>
> >>> I don't see how cache coherency can possibly work if the hardware
> >>> behaves this way.
> >>
> >> Cache coherency is still maintained provided writes/reads both go
> >> through the cache ;-)
> >>
> >> The problem is that for read-modify-write operations the arbitration
> >> logic that decides who "wins" and is allowed to actually perform the
> >> write, assuming two or more CPUs are competing for a single memory
> >> address, is not implemented in the cache controller, I think. I'm not a
> >> hardware engineer and I never understood how the arbitration logic
> >> worked but I'm guessing that's the reason that the ll/sc instructions
> >> bypass the cache.
> >>
> >> Which is why the atomic_t functions worked out really well for that
> >> arch, such that any accesses to an atomic_t * had to go through the
> >> wrapper functions.
> >
> > I'm sorry... this doesn't compute. ?Either reads can work normally (go
> > through the cache) in which case atomic_read() can simply be a read or
> > they don't, so I don't understand this at all.
>
> The CPU in question has two sets of instructions:
>
> load/store - these go via the cache (write through)
> ll/sc - these operate literally as if there is no cache (they do not
> hit on read or write)
>
> This may or may not be a sensible way to architect a CPU, but I think
> it is possible to make it work. Making it work efficiently is more of
> a challenge.
Speaking as a (non-commercial) processor designer here, but feel free to point
out anything I'm wrong on. I have direct experience implementing these
operations in hardware so I'd hope what I say here is right. This information
is definitely relevant to the MIPS R4000 as well as my own hardware. A quick
look at the PPC documentation seems to indicate it's the same there too, and it
should agree with the Wikipedia article on the subject:
http://en.wikipedia.org/wiki/Load-link/store-conditional
The entire point of implementing load-linked (ll) / store-conditional (sc)
instructions is to have lockless atomic primitives *using the cache*. Proper
implementations do not bypass the cache; in fact, the cache coherence protocol
must get involved for them to be correct. If properly implemented, these
operations cause no external bus traffic if the critical section is uncontended
and hits the cache (good for scalability). These are the semantics:
ll: Essentially the same as a normal word load. Implementations will need to do
a little internal book-keeping (i.e. save physical address of last ll
instruction and/or modify coherence state for the cacheline).
sc: Store a word if and only if the address has not been written by any other
processor since the last ll. If the store fails, write 0 to a register,
otherwise write 1.
The address may be tracked on cacheline granularity; this operation may
spuriously fail, depending on implementation details (called "weak" ll/sc).
Arguably the "obvious" way to implement this is to have sc fail if the local
CPU snoops a read-for-ownership for the address in question coming from a
remote CPU. This works because the remote CPU will need to gain the cacheline
for exclusive access before its competing sc can execute. Code is supposed to
put ll/sc in a loop and simply retry the operation until the sc succeeds.
Note how the cache and cache coherence protocol are fundamental parts of this
operation; if these instructions simply bypassed the cache, they *could not*
work correctly - how do you detect when the underlying memory has been
modified? You can't simply detect whether the value has changed - it may have
been changed to another value and then back ("ABA" problem). You have to snoop
bus transactions, and that is what the cache and its coherence algorithm
already do. ll/sc can be implemented entirely using the side-effects of the
cache coherence algorithm; my own working hardware implementation does this.
So, atomically reading the variable can be accomplished with a normal load
instruction. I can't speak for unaligned loads on implementations that do them
in hardware, but at least an aligned load of word size should be atomic on any
sane architecture. Only an atomic read-modify-write of the variable needs to
use ll/sc at all, and only for the reason of preventing another concurrent
modification between the load and store. A plain aligned word store should be
atomic too, but it's not too useful because a another concurrent store would
not be ordered relative to the local store.
From: Will Simoneau <[email protected]>
Date: Tue, 15 Feb 2011 16:11:23 -0500
> Note how the cache and cache coherence protocol are fundamental parts of this
> operation; if these instructions simply bypassed the cache, they *could not*
> work correctly - how do you detect when the underlying memory has been
> modified?
The issue here isn't L2 cache bypassing, it's local L1 cache bypassing.
The chips in question aparently do not consult the local L1 cache on
"ll" instructions.
Therefore you must only ever access such atomic data using "ll"
instructions.
On 13:27 Tue 15 Feb , David Miller wrote:
> From: Will Simoneau <[email protected]>
> Date: Tue, 15 Feb 2011 16:11:23 -0500
>
> > Note how the cache and cache coherence protocol are fundamental parts of this
> > operation; if these instructions simply bypassed the cache, they *could not*
> > work correctly - how do you detect when the underlying memory has been
> > modified?
>
> The issue here isn't L2 cache bypassing, it's local L1 cache bypassing.
>
> The chips in question aparently do not consult the local L1 cache on
> "ll" instructions.
>
> Therefore you must only ever access such atomic data using "ll"
> instructions.
(I should not have said "underlying memory", since it is correct that
only the L1 caches are the problem here)
That's some really crippled hardware... it does seem like *any* loads
from *any* address updated by an sc would have to be done with ll as
well, else they may load stale values. One could work this into
atomic_read(), but surely there are other places that are problems.
It would be OK if the caches on the hardware in question were to
back-invalidate matching cachelines when the sc is snooped from the bus,
but it sounds like this doesn't happen?
On Tue, 2011-02-15 at 13:27 -0800, David Miller wrote:
> From: Will Simoneau <[email protected]>
> Date: Tue, 15 Feb 2011 16:11:23 -0500
>
> > Note how the cache and cache coherence protocol are fundamental parts of this
> > operation; if these instructions simply bypassed the cache, they *could not*
> > work correctly - how do you detect when the underlying memory has been
> > modified?
>
> The issue here isn't L2 cache bypassing, it's local L1 cache bypassing.
>
> The chips in question aparently do not consult the local L1 cache on
> "ll" instructions.
>
> Therefore you must only ever access such atomic data using "ll"
> instructions.
Note that it's actually a reasonable design choice to not consult the L1
in these case .... as long as you invalidate it on the way. That's how
current powerpcs do it afaik, they send a kill to any matching L1 line
along as reading from the L2. (Of course, L1 has to be write-through for
that to work).
Cheers,
Ben.
* Benjamin Herrenschmidt <[email protected]> wrote:
> On Tue, 2011-02-15 at 13:27 -0800, David Miller wrote:
> > From: Will Simoneau <[email protected]>
> > Date: Tue, 15 Feb 2011 16:11:23 -0500
> >
> > > Note how the cache and cache coherence protocol are fundamental parts of this
> > > operation; if these instructions simply bypassed the cache, they *could not*
> > > work correctly - how do you detect when the underlying memory has been
> > > modified?
> >
> > The issue here isn't L2 cache bypassing, it's local L1 cache bypassing.
> >
> > The chips in question aparently do not consult the local L1 cache on
> > "ll" instructions.
> >
> > Therefore you must only ever access such atomic data using "ll"
> > instructions.
>
> Note that it's actually a reasonable design choice to not consult the L1
> in these case .... as long as you invalidate it on the way. That's how
> current powerpcs do it afaik, they send a kill to any matching L1 line
> along as reading from the L2. (Of course, L1 has to be write-through for
> that to work).
Just curious: how does this work if there's an interrupt (or NMI) right after the
invalidate instruction but before the 'll' instruction? The IRQ/NMI may refill the
L1. Or are the two instructions coupled by hw (they form a single instruction in
essence) and irqs/NMIs are inhibited inbetween?
Thanks,
Ingo
On Tue, Feb 15, 2011 at 9:56 PM, Will Simoneau <[email protected]> wrote:
> On 13:27 Tue 15 Feb ? ? , David Miller wrote:
>> From: Will Simoneau <[email protected]>
>> Date: Tue, 15 Feb 2011 16:11:23 -0500
>>
>> > Note how the cache and cache coherence protocol are fundamental parts of this
>> > operation; if these instructions simply bypassed the cache, they *could not*
>> > work correctly - how do you detect when the underlying memory has been
>> > modified?
>>
>> The issue here isn't L2 cache bypassing, it's local L1 cache bypassing.
>>
>> The chips in question aparently do not consult the local L1 cache on
>> "ll" instructions.
>>
>> Therefore you must only ever access such atomic data using "ll"
>> instructions.
>
> (I should not have said "underlying memory", since it is correct that
> only the L1 caches are the problem here)
>
> That's some really crippled hardware... it does seem like *any* loads
> from *any* address updated by an sc would have to be done with ll as
> well, else they may load stale values. One could work this into
> atomic_read(), but surely there are other places that are problems.
I think it's actually ok, atomics have arch implemented accessors, as
do spinlocks and atomic bitops. Those are the only place we do sc so
we can make sure we always ll or invalidate manually.
> It would be OK if the caches on the hardware in question were to
> back-invalidate matching cachelines when the sc is snooped from the bus,
> but it sounds like this doesn't happen?
Yes it's possible to manually invalidate the line but it is not
automatic. Manual invalidation is actually quite reasonable in many
cases because you never see a bad value, just a potentially stale one,
so many of the races are harmless in practice.
I think you're correct in your comments re multi-processor cache
coherence and the performance problems associated with not doing ll/sc
in the cache. I believe some of the reasoning behind the current
implementation is to allow different processors in the same SoC to
participate in the atomic store protocol without having a fully
coherent cache (and implementing a full cache coherence protocol).
It's my understanding that the ll/sc is implemented somewhere beyond
the cache in the bus fabric.
On Wed, 2011-02-16 at 10:15 +0000, Will Newton wrote:
> > That's some really crippled hardware... it does seem like *any* loads
> > from *any* address updated by an sc would have to be done with ll as
> > well, else they may load stale values. One could work this into
> > atomic_read(), but surely there are other places that are problems.
>
> I think it's actually ok, atomics have arch implemented accessors, as
> do spinlocks and atomic bitops. Those are the only place we do sc so
> we can make sure we always ll or invalidate manually.
I'm curious, how is cmpxchg() implemented on this architecture? As there
are several places in the kernel that uses this on regular variables
without any "accessor" functions.
-- Steve
On Wed, Feb 16, 2011 at 12:18 PM, Steven Rostedt <[email protected]> wrote:
> On Wed, 2011-02-16 at 10:15 +0000, Will Newton wrote:
>
>> > That's some really crippled hardware... it does seem like *any* loads
>> > from *any* address updated by an sc would have to be done with ll as
>> > well, else they may load stale values. One could work this into
>> > atomic_read(), but surely there are other places that are problems.
>>
>> I think it's actually ok, atomics have arch implemented accessors, as
>> do spinlocks and atomic bitops. Those are the only place we do sc so
>> we can make sure we always ll or invalidate manually.
>
> I'm curious, how is cmpxchg() implemented on this architecture? As there
> are several places in the kernel that uses this on regular variables
> without any "accessor" functions.
We can invalidate the cache manually. The current cpu will see the new
value (post-cache invalidate) and the other cpus will see either the
old value or the new value depending on whether they read before or
after the invalidate, which is racy but I don't think it is
problematic. Unless I'm missing something...
* Will Newton ([email protected]) wrote:
> On Wed, Feb 16, 2011 at 12:18 PM, Steven Rostedt <[email protected]> wrote:
> > On Wed, 2011-02-16 at 10:15 +0000, Will Newton wrote:
> >
> >> > That's some really crippled hardware... it does seem like *any* loads
> >> > from *any* address updated by an sc would have to be done with ll as
> >> > well, else they may load stale values. One could work this into
> >> > atomic_read(), but surely there are other places that are problems.
> >>
> >> I think it's actually ok, atomics have arch implemented accessors, as
> >> do spinlocks and atomic bitops. Those are the only place we do sc so
> >> we can make sure we always ll or invalidate manually.
> >
> > I'm curious, how is cmpxchg() implemented on this architecture? As there
> > are several places in the kernel that uses this on regular variables
> > without any "accessor" functions.
>
> We can invalidate the cache manually. The current cpu will see the new
> value (post-cache invalidate) and the other cpus will see either the
> old value or the new value depending on whether they read before or
> after the invalidate, which is racy but I don't think it is
> problematic. Unless I'm missing something...
Assuming the invalidate is specific to a cache-line, I'm concerned about
the failure of a scenario like the following:
initially:
foo = 0
bar = 0
CPU A CPU B
xchg(&foo, 1);
ll foo
sc foo
-> interrupt
if (foo == 1)
xchg(&bar, 1);
ll bar
sc bar
invalidate bar
lbar = bar;
smp_mb()
lfoo = foo;
BUG_ON(lbar == 1 && lfoo == 0);
invalidate foo
It should be valid to expect that every time "bar" read by CPU B is 1,
then "foo" is always worth 1. However, in this case, the lack of
invalidate on foo is keeping the cacheline from reaching CPU B. There
seems to be a problem with interrupts/NMIs coming right between sc and
invalidate, as Ingo pointed out.
Thanks,
Mathieu
--
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com
On 12:41 Wed 16 Feb , Will Newton wrote:
> On Wed, Feb 16, 2011 at 12:18 PM, Steven Rostedt <[email protected]> wrote:
> > I'm curious, how is cmpxchg() implemented on this architecture? As there
> > are several places in the kernel that uses this on regular variables
> > without any "accessor" functions.
>
> We can invalidate the cache manually. The current cpu will see the new
> value (post-cache invalidate) and the other cpus will see either the
> old value or the new value depending on whether they read before or
> after the invalidate, which is racy but I don't think it is
> problematic. Unless I'm missing something...
If I understand this correctly, the manual invalidates must propagate to
all CPUs that potentially read the value, even if there is no
contention. Doesn't this involve IPIs? How does it not suck?
Folks, if you want to invent new masochistic programming models
like this please do it on an own thread with a reduced cc list.
Thank you,
-Andi
--
[email protected] -- Speaking for myself only.
From: Andi Kleen <[email protected]>
Date: Thu, 17 Feb 2011 01:53:00 +0100
>
> Folks, if you want to invent new masochistic programming models
> like this please do it on an own thread with a reduced cc list.
Well, Andi, since you removed the subject nobody has any idea
what thread you are even referring to.
This makes you as much as a bozo as the people who you are chiding
right now.
On Wed, Feb 16, 2011 at 18:56, David Miller <[email protected]> wrote:
> Well, Andi, since you removed the subject nobody has any idea
> what thread you are even referring to.
However, all of the necessary information is in the email headers:
In-Reply-To: <[email protected]>
References: <[email protected]>
<[email protected]>
<[email protected]>
<[email protected]>
<[email protected]>
<[email protected]>
<[email protected]>
<[email protected]>
<[email protected]>
It's a damn shame that our email tools ignore such useful information.
On 02/16/2011 12:35 AM, Ingo Molnar wrote:
>
> Just curious: how does this work if there's an interrupt (or NMI) right after the
> invalidate instruction but before the 'll' instruction? The IRQ/NMI may refill the
> L1. Or are the two instructions coupled by hw (they form a single instruction in
> essence) and irqs/NMIs are inhibited inbetween?
>
http://en.wikipedia.org/wiki/Load-link/store-conditional
-hpa
On Wed, Feb 16, 2011 at 10:51 PM, Will Simoneau <[email protected]> wrote:
> On 12:41 Wed 16 Feb ? ? , Will Newton wrote:
>> On Wed, Feb 16, 2011 at 12:18 PM, Steven Rostedt <[email protected]> wrote:
>> > I'm curious, how is cmpxchg() implemented on this architecture? As there
>> > are several places in the kernel that uses this on regular variables
>> > without any "accessor" functions.
>>
>> We can invalidate the cache manually. The current cpu will see the new
>> value (post-cache invalidate) and the other cpus will see either the
>> old value or the new value depending on whether they read before or
>> after the invalidate, which is racy but I don't think it is
>> problematic. Unless I'm missing something...
>
> If I understand this correctly, the manual invalidates must propagate to
> all CPUs that potentially read the value, even if there is no
> contention. Doesn't this involve IPIs? How does it not suck?
The cache is shared between cores (in that regard it's more like a
hyper-threaded core than a true multi-core) so is completely coherent,
so this is the one bit that doesn't really suck! Having spoken to our
hardware guys I'm confident that we'll only ever build a handful of
chip designs with the current way of doing ll/sc and hopefully future
cores will do this the "right" way.
* H. Peter Anvin <[email protected]> wrote:
> On 02/16/2011 12:35 AM, Ingo Molnar wrote:
> >
> > Just curious: how does this work if there's an interrupt (or NMI) right after the
> > invalidate instruction but before the 'll' instruction? The IRQ/NMI may refill the
> > L1. Or are the two instructions coupled by hw (they form a single instruction in
> > essence) and irqs/NMIs are inhibited inbetween?
> >
>
> http://en.wikipedia.org/wiki/Load-link/store-conditional
Oh, ll/sc, that indeed clicks - i even wrote such assembly code many years ago ;-)
Thanks,
Ingo