From: Chris Snook <[email protected]>
Make atomic_read() volatile on frv. This ensures that busy-waiting
for an interrupt handler to change an atomic_t won't get compiled to an
infinite loop, consistent with SMP architectures.
Signed-off-by: Chris Snook <[email protected]>
--- linux-2.6.23-rc2-orig/include/asm-frv/atomic.h 2007-07-08 19:32:17.000000000 -0400
+++ linux-2.6.23-rc2/include/asm-frv/atomic.h 2007-08-09 06:41:48.000000000 -0400
@@ -40,7 +40,12 @@ typedef struct {
} atomic_t;
#define ATOMIC_INIT(i) { (i) }
-#define atomic_read(v) ((v)->counter)
+
+/*
+ * Casting to volatile here minimizes the need for barriers,
+ * without having to declare the type itself as volatile.
+ */
+#define atomic_read(v) (*(volatile int *)&(v)->counter)
#define atomic_set(v, i) (((v)->counter) = (i))
#ifndef CONFIG_FRV_OUTOFLINE_ATOMIC_OPS
Chris Snook wrote:
> From: Chris Snook <[email protected]>
>
> Make atomic_read() volatile on frv. This ensures that busy-waiting
> for an interrupt handler to change an atomic_t won't get compiled to an
> infinite loop, consistent with SMP architectures.
To head off the criticism, I admit this is an oversimplification, and true
busy-waiters should be using cpu_relax(), which contains a barrier. As
discussed in recent threads, there are other cases which can be optimized by
removing the need for a barrier, and having behavior consistent with
architectures where the benefit is more profound is also valuable.
-- Chris
Chris Snook <[email protected]> wrote:
> To head off the criticism, I admit this is an oversimplification, and true
> busy-waiters should be using cpu_relax(), which contains a barrier.
Why would you want to use cpu_relax()? That's there to waste time efficiently,
isn't it? Shouldn't you be using smp_rmb() or something like that?
David
David Howells wrote:
> Chris Snook <[email protected]> wrote:
>
>> To head off the criticism, I admit this is an oversimplification, and true
>> busy-waiters should be using cpu_relax(), which contains a barrier.
>
> Why would you want to use cpu_relax()? That's there to waste time efficiently,
> isn't it? Shouldn't you be using smp_rmb() or something like that?
>
> David
cpu_relax() contains a barrier, so it should do the right thing. For
non-smp architectures, I'm concerned about interacting with interrupt
handlers. Some drivers do use atomic_* operations.
-- Chris
Chris Snook <[email protected]> wrote:
>
> cpu_relax() contains a barrier, so it should do the right thing. For
> non-smp architectures, I'm concerned about interacting with interrupt
> handlers. Some drivers do use atomic_* operations.
What problems with interrupt handlers? Access to int/long must
be atomic or we're in big trouble anyway.
Cheers,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <[email protected]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
On Sat, Aug 11, 2007 at 08:54:46AM +0800, Herbert Xu wrote:
> Chris Snook <[email protected]> wrote:
> >
> > cpu_relax() contains a barrier, so it should do the right thing. For
> > non-smp architectures, I'm concerned about interacting with interrupt
> > handlers. Some drivers do use atomic_* operations.
>
> What problems with interrupt handlers? Access to int/long must
> be atomic or we're in big trouble anyway.
Reordering due to compiler optimizations. CPU reordering does not
affect interactions with interrupt handlers on a given CPU, but
reordering due to compiler code-movement optimization does. Since
volatile can in some cases suppress code-movement optimizations,
it can affect interactions with interrupt handlers.
Thanx, Paul
Chris Snook <[email protected]> wrote:
> cpu_relax() contains a barrier, so it should do the right thing. For non-smp
> architectures, I'm concerned about interacting with interrupt handlers. Some
> drivers do use atomic_* operations.
I'm not sure that actually answers my question. Why not smp_rmb()?
David
Paul E. McKenney <[email protected]> wrote:
> On Sat, Aug 11, 2007 at 08:54:46AM +0800, Herbert Xu wrote:
>> Chris Snook <[email protected]> wrote:
>> >
>> > cpu_relax() contains a barrier, so it should do the right thing. For
>> > non-smp architectures, I'm concerned about interacting with interrupt
>> > handlers. Some drivers do use atomic_* operations.
>>
>> What problems with interrupt handlers? Access to int/long must
>> be atomic or we're in big trouble anyway.
>
> Reordering due to compiler optimizations. CPU reordering does not
> affect interactions with interrupt handlers on a given CPU, but
> reordering due to compiler code-movement optimization does. Since
> volatile can in some cases suppress code-movement optimizations,
> it can affect interactions with interrupt handlers.
If such reordering matters, then you should use one of the
*mb macros or barrier() rather than relying on possibly
hidden volatile cast.
Cheers,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <[email protected]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
On Mon, Aug 13, 2007 at 01:15:52PM +0800, Herbert Xu wrote:
> Paul E. McKenney <[email protected]> wrote:
> > On Sat, Aug 11, 2007 at 08:54:46AM +0800, Herbert Xu wrote:
> >> Chris Snook <[email protected]> wrote:
> >> >
> >> > cpu_relax() contains a barrier, so it should do the right thing. For
> >> > non-smp architectures, I'm concerned about interacting with interrupt
> >> > handlers. Some drivers do use atomic_* operations.
> >>
> >> What problems with interrupt handlers? Access to int/long must
> >> be atomic or we're in big trouble anyway.
> >
> > Reordering due to compiler optimizations. CPU reordering does not
> > affect interactions with interrupt handlers on a given CPU, but
> > reordering due to compiler code-movement optimization does. Since
> > volatile can in some cases suppress code-movement optimizations,
> > it can affect interactions with interrupt handlers.
>
> If such reordering matters, then you should use one of the
> *mb macros or barrier() rather than relying on possibly
> hidden volatile cast.
If communicating among CPUs, sure. However, when communicating between
mainline and interrupt/NMI handlers on the same CPU, the barrier() and
most expecially the *mb() macros are gross overkill. So there really
truly is a place for volatile -- not a large place, to be sure, but a
place nonetheless.
Thanx, Paul
David Howells wrote:
> Chris Snook <[email protected]> wrote:
>
>> cpu_relax() contains a barrier, so it should do the right thing. For non-smp
>> architectures, I'm concerned about interacting with interrupt handlers. Some
>> drivers do use atomic_* operations.
>
> I'm not sure that actually answers my question. Why not smp_rmb()?
>
> David
I would assume because we want to waste time efficiently even on non-smp
architectures, rather than frying the CPU or draining the battery. Certain
looping execution patterns can cause the CPU to operate above thermal design
power. I have fans on my workstation that only ever come on when running
LINPACK, and that's generally memory bandwidth-bound. Just imagine what happens
when you're executing the same few non-serializing instructions in a tight loop
without ever stalling on memory fetches, or being scheduled out.
If there's another reason, I'd like to hear it too, because I'm just guessing here.
-- Chris
Paul E. McKenney wrote:
> On Mon, Aug 13, 2007 at 01:15:52PM +0800, Herbert Xu wrote:
>
>>Paul E. McKenney <[email protected]> wrote:
>>
>>>On Sat, Aug 11, 2007 at 08:54:46AM +0800, Herbert Xu wrote:
>>>
>>>>Chris Snook <[email protected]> wrote:
>>>>
>>>>>cpu_relax() contains a barrier, so it should do the right thing. For
>>>>>non-smp architectures, I'm concerned about interacting with interrupt
>>>>>handlers. Some drivers do use atomic_* operations.
>>>>
>>>>What problems with interrupt handlers? Access to int/long must
>>>>be atomic or we're in big trouble anyway.
>>>
>>>Reordering due to compiler optimizations. CPU reordering does not
>>>affect interactions with interrupt handlers on a given CPU, but
>>>reordering due to compiler code-movement optimization does. Since
>>>volatile can in some cases suppress code-movement optimizations,
>>>it can affect interactions with interrupt handlers.
>>
>>If such reordering matters, then you should use one of the
>>*mb macros or barrier() rather than relying on possibly
>>hidden volatile cast.
>
>
> If communicating among CPUs, sure. However, when communicating between
> mainline and interrupt/NMI handlers on the same CPU, the barrier() and
> most expecially the *mb() macros are gross overkill. So there really
> truly is a place for volatile -- not a large place, to be sure, but a
> place nonetheless.
I really would like all volatile users to go away and be replaced
by explicit barriers. It makes things nicer and more explicit... for
atomic_t type there probably aren't many optimisations that can be
made which volatile would disallow (in actual kernel code), but for
others (eg. bitops, maybe atomic ops in UP kernels), there would be.
Maybe it is the safe way to go, but it does obscure cases where there
is a real need for barriers.
Many atomic operations are allowed to be reordered between CPUs, so
I don't have a good idea for the rationale to order them within the
CPU (also loads and stores to long and ptr types are not ordered like
this, although we do consider those to be atomic operations too).
barrier() in a way is like enforcing sequential memory ordering
between process and interrupt context, wheras volatile is just
enforcing coherency of a single memory location (and as such is
cheaper).
What do you think of this crazy idea?
/* Enforce a compiler barrier for only operations to location X.
* Call multiple times to provide an ordering between multiple
* memory locations. Other memory operations can be assumed by
* the compiler to remain unchanged and may be reordered
*/
#define order(x) asm volatile("" : "+m" (x))
--
SUSE Labs, Novell Inc.
Chris Snook wrote:
> David Howells wrote:
>
>> Chris Snook <[email protected]> wrote:
>>
>>> cpu_relax() contains a barrier, so it should do the right thing. For
>>> non-smp
>>> architectures, I'm concerned about interacting with interrupt
>>> handlers. Some
>>> drivers do use atomic_* operations.
>>
>>
>> I'm not sure that actually answers my question. Why not smp_rmb()?
>>
>> David
>
>
> I would assume because we want to waste time efficiently even on non-smp
> architectures, rather than frying the CPU or draining the battery.
> Certain looping execution patterns can cause the CPU to operate above
> thermal design power. I have fans on my workstation that only ever come
> on when running LINPACK, and that's generally memory bandwidth-bound.
> Just imagine what happens when you're executing the same few
> non-serializing instructions in a tight loop without ever stalling on
> memory fetches, or being scheduled out.
>
> If there's another reason, I'd like to hear it too, because I'm just
> guessing here.
Well if there is only one memory location involved, then smp_rmb() isn't
going to really do anything anyway, so it would be incorrect to use it.
Consider that smp_rmb basically will do anything from flushing the
pipeline to invalidating loads speculatively executed out of order. AFAIK
it will not control the visibility of stores coming from other CPUs (that
is up to the cache coherency).
--
SUSE Labs, Novell Inc.
On Tue, Aug 14, 2007 at 03:34:25PM +1000, Nick Piggin wrote:
>
> What do you think of this crazy idea?
>
> /* Enforce a compiler barrier for only operations to location X.
> * Call multiple times to provide an ordering between multiple
> * memory locations. Other memory operations can be assumed by
> * the compiler to remain unchanged and may be reordered
> */
> #define order(x) asm volatile("" : "+m" (x))
Yes this is a very good idea. This also makes it explicit
that the coder is depending on this rather than the more
vague connotations of atomic_read.
Cheers,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <[email protected]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
On Tue, Aug 14, 2007 at 03:34:25PM +1000, Nick Piggin wrote:
> Paul E. McKenney wrote:
> >On Mon, Aug 13, 2007 at 01:15:52PM +0800, Herbert Xu wrote:
> >
> >>Paul E. McKenney <[email protected]> wrote:
> >>
> >>>On Sat, Aug 11, 2007 at 08:54:46AM +0800, Herbert Xu wrote:
> >>>
> >>>>Chris Snook <[email protected]> wrote:
> >>>>
> >>>>>cpu_relax() contains a barrier, so it should do the right thing. For
> >>>>>non-smp architectures, I'm concerned about interacting with interrupt
> >>>>>handlers. Some drivers do use atomic_* operations.
> >>>>
> >>>>What problems with interrupt handlers? Access to int/long must
> >>>>be atomic or we're in big trouble anyway.
> >>>
> >>>Reordering due to compiler optimizations. CPU reordering does not
> >>>affect interactions with interrupt handlers on a given CPU, but
> >>>reordering due to compiler code-movement optimization does. Since
> >>>volatile can in some cases suppress code-movement optimizations,
> >>>it can affect interactions with interrupt handlers.
> >>
> >>If such reordering matters, then you should use one of the
> >>*mb macros or barrier() rather than relying on possibly
> >>hidden volatile cast.
> >
> >
> >If communicating among CPUs, sure. However, when communicating between
> >mainline and interrupt/NMI handlers on the same CPU, the barrier() and
> >most expecially the *mb() macros are gross overkill. So there really
> >truly is a place for volatile -- not a large place, to be sure, but a
> >place nonetheless.
>
> I really would like all volatile users to go away and be replaced
> by explicit barriers. It makes things nicer and more explicit... for
> atomic_t type there probably aren't many optimisations that can be
> made which volatile would disallow (in actual kernel code), but for
> others (eg. bitops, maybe atomic ops in UP kernels), there would be.
>
> Maybe it is the safe way to go, but it does obscure cases where there
> is a real need for barriers.
I prefer burying barriers into other primitives.
> Many atomic operations are allowed to be reordered between CPUs, so
> I don't have a good idea for the rationale to order them within the
> CPU (also loads and stores to long and ptr types are not ordered like
> this, although we do consider those to be atomic operations too).
>
> barrier() in a way is like enforcing sequential memory ordering
> between process and interrupt context, wheras volatile is just
> enforcing coherency of a single memory location (and as such is
> cheaper).
barrier() is useful, but it has the very painful side-effect of forcing
the compiler to dump temporaries. So we do need something that is
not quite so global in effect.
> What do you think of this crazy idea?
>
> /* Enforce a compiler barrier for only operations to location X.
> * Call multiple times to provide an ordering between multiple
> * memory locations. Other memory operations can be assumed by
> * the compiler to remain unchanged and may be reordered
> */
> #define order(x) asm volatile("" : "+m" (x))
There was something very similar discussed earlier in this thread,
with quite a bit of debate as to exactly what the "m" flag should
look like. I suggested something similar named ACCESS_ONCE in the
context of RCU (http://lkml.org/lkml/2007/7/11/664):
#define ACCESS_ONCE(x) (*(volatile typeof(x) *)&(x))
The nice thing about this is that it works for both loads and stores.
Not clear that order() above does this -- I get compiler errors when
I try something like "b = order(a)" or "order(a) = 1" using gcc 4.1.2.
Thanx, Paul
On Tuesday 14 August 2007, Paul E. McKenney wrote:
> > #define order(x) asm volatile("" : "+m" (x))
>
> There was something very similar discussed earlier in this thread,
> with quite a bit of debate as to exactly what the "m" flag should
> look like. ?I suggested something similar named ACCESS_ONCE in the
> context of RCU (http://lkml.org/lkml/2007/7/11/664):
>
> ????????#define ACCESS_ONCE(x) (*(volatile typeof(x) *)&(x))
>
> The nice thing about this is that it works for both loads and stores.
> Not clear that order() above does this -- I get compiler errors when
> I try something like "b = order(a)" or "order(a) = 1" using gcc 4.1.2.
Well, it serves a different purpose: While your ACCESS_ONCE() macro is
an lvalue, the order() macro is a statement that can be used in place
of the barrier() macro. order() is the most lightweight barrier as it
only enforces ordering on a single variable in the compiler, but does
not have any side-effects visible to other threads, like the cache
line access in ACCESS_ONCE has.
Arnd <><
On Wed, Aug 15, 2007 at 12:01:54AM +0200, Arnd Bergmann wrote:
> On Tuesday 14 August 2007, Paul E. McKenney wrote:
> > > #define order(x) asm volatile("" : "+m" (x))
> >
> > There was something very similar discussed earlier in this thread,
> > with quite a bit of debate as to exactly what the "m" flag should
> > look like. ?I suggested something similar named ACCESS_ONCE in the
> > context of RCU (http://lkml.org/lkml/2007/7/11/664):
> >
> > ????????#define ACCESS_ONCE(x) (*(volatile typeof(x) *)&(x))
> >
> > The nice thing about this is that it works for both loads and stores.
> > Not clear that order() above does this -- I get compiler errors when
> > I try something like "b = order(a)" or "order(a) = 1" using gcc 4.1.2.
>
> Well, it serves a different purpose: While your ACCESS_ONCE() macro is
> an lvalue, the order() macro is a statement that can be used in place
> of the barrier() macro. order() is the most lightweight barrier as it
> only enforces ordering on a single variable in the compiler, but does
> not have any side-effects visible to other threads, like the cache
> line access in ACCESS_ONCE has.
ACCESS_ONCE() is indeed intended to be used when actually loading or
storing the variable. That said, I must admit that it is not clear to me
why you would want to add an extra order() rather than ACCESS_ONCE()ing
one or both of the adjacent accesses to that same variable.
So, what am I missing?
Thanx, Paul
Paul E. McKenney wrote:
> On Tue, Aug 14, 2007 at 03:34:25PM +1000, Nick Piggin wrote:
>>Maybe it is the safe way to go, but it does obscure cases where there
>>is a real need for barriers.
>
>
> I prefer burying barriers into other primitives.
When they should naturally be there, eg. locking or the RCU primitives,
I agree.
I don't like having them scattered in various "just in case" places,
because it makes both the users and the APIs hard to understand and
change.
Especially since several big architectures don't have volatile in their
atomic_get and _set, I think it would be a step backwards to add them in
as a "just in case" thin now (unless there is a better reason).
>>Many atomic operations are allowed to be reordered between CPUs, so
>>I don't have a good idea for the rationale to order them within the
>>CPU (also loads and stores to long and ptr types are not ordered like
>>this, although we do consider those to be atomic operations too).
>>
>>barrier() in a way is like enforcing sequential memory ordering
>>between process and interrupt context, wheras volatile is just
>>enforcing coherency of a single memory location (and as such is
>>cheaper).
>
>
> barrier() is useful, but it has the very painful side-effect of forcing
> the compiler to dump temporaries. So we do need something that is
> not quite so global in effect.
Yep.
>>What do you think of this crazy idea?
>>
>>/* Enforce a compiler barrier for only operations to location X.
>> * Call multiple times to provide an ordering between multiple
>> * memory locations. Other memory operations can be assumed by
>> * the compiler to remain unchanged and may be reordered
>> */
>>#define order(x) asm volatile("" : "+m" (x))
>
>
> There was something very similar discussed earlier in this thread,
> with quite a bit of debate as to exactly what the "m" flag should
> look like. I suggested something similar named ACCESS_ONCE in the
> context of RCU (http://lkml.org/lkml/2007/7/11/664):
Oh, I missed that earlier debate. Will go have a look.
> #define ACCESS_ONCE(x) (*(volatile typeof(x) *)&(x))
>
> The nice thing about this is that it works for both loads and stores.
> Not clear that order() above does this -- I get compiler errors when
> I try something like "b = order(a)" or "order(a) = 1" using gcc 4.1.2.
As Arnd ponted out, order() is not supposed to be an lvalue, but a
statement like the rest of our memory ordering API.
As to your followup question of why to use it over ACCESS_ONCE. I
guess, aside from consistency with the rest of the barrier APIs, you
can use it in other primitives when you don't actually know what the
caller is going to do or if it even will make an access. You could
also use it between calls to _other_ primitives, etc... it just
seems more flexible to me, but I haven't actually used such a thing
in real code...
ACCESS_ONCE doesn't seem as descriptive. What it results in is the
memory location being loaded or stored (presumably once exactly),
but I think the more general underlying idea is a barrier point.
--
SUSE Labs, Novell Inc.
On Wednesday 15 August 2007, Paul E. McKenney wrote:
> ACCESS_ONCE() is indeed intended to be used when actually loading or
> storing the variable. That said, I must admit that it is not clear to me
> why you would want to add an extra order() rather than ACCESS_ONCE()ing
> one or both of the adjacent accesses to that same variable.
>
> So, what am I missing?
You're probably right, the only case I can construct is something like
if (ACCESS_ONCE(x)) {
...
ACCESS_ONCE(x)++;
}
which would be slightly less efficient than
if (x)
x++;
order(x);
because in the first case, you need to do two ordered accesses
but only one in the second case. However, I can't think of a case
where this actually makes a noticable difference in real life.
Arnd <><
On Wednesday 15 August 2007 15:29:43 Arnd Bergmann wrote:
> On Wednesday 15 August 2007, Paul E. McKenney wrote:
>
> > ACCESS_ONCE() is indeed intended to be used when actually loading or
> > storing the variable. That said, I must admit that it is not clear to me
> > why you would want to add an extra order() rather than ACCESS_ONCE()ing
> > one or both of the adjacent accesses to that same variable.
> >
> > So, what am I missing?
>
> You're probably right, the only case I can construct is something like
>
> if (ACCESS_ONCE(x)) {
> ...
> ACCESS_ONCE(x)++;
> }
>
> which would be slightly less efficient than
>
> if (x)
> x++;
> order(x);
>
> because in the first case, you need to do two ordered accesses
> but only one in the second case. However, I can't think of a case
> where this actually makes a noticable difference in real life.
How can this example actually get used in a sane and race-free
way? This would need locking around the whole if
statement. But locking is a barrier.
--
Greetings Michael.
> Well if there is only one memory location involved, then smp_rmb()
> isn't
> going to really do anything anyway, so it would be incorrect to use it.
rmb() orders *any* two reads; that includes two reads from the same
location.
> Consider that smp_rmb basically will do anything from flushing the
> pipeline to invalidating loads speculatively executed out of order.
> AFAIK
> it will not control the visibility of stores coming from other CPUs
> (that
> is up to the cache coherency).
The writer side should typically use wmb() whenever the reader side
uses rmb(), sure.
Segher
On Wed, Aug 15, 2007 at 08:51:58PM +0200, Segher Boessenkool wrote:
> >Well if there is only one memory location involved, then smp_rmb()
> >isn't
> >going to really do anything anyway, so it would be incorrect to use it.
>
> rmb() orders *any* two reads; that includes two reads from the same
> location.
If the two reads are to the same location, all CPUs I am aware of
will maintain the ordering without need for a memory barrier.
Thanx, Paul
> >Consider that smp_rmb basically will do anything from flushing the
> >pipeline to invalidating loads speculatively executed out of order.
> >AFAIK
> >it will not control the visibility of stores coming from other CPUs
> >(that
> >is up to the cache coherency).
>
> The writer side should typically use wmb() whenever the reader side
> uses rmb(), sure.
>
>
> Segher
>
> -
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>> Well if there is only one memory location involved, then smp_rmb()
>>> isn't
>>> going to really do anything anyway, so it would be incorrect to use
>>> it.
>>
>> rmb() orders *any* two reads; that includes two reads from the same
>> location.
>
> If the two reads are to the same location, all CPUs I am aware of
> will maintain the ordering without need for a memory barrier.
That's true of course, although there is no real guarantee for that.
Segher
On Wed, Aug 15, 2007 at 09:46:55PM +0200, Segher Boessenkool wrote:
> >>>Well if there is only one memory location involved, then smp_rmb()
> >>>isn't
> >>>going to really do anything anyway, so it would be incorrect to use
> >>>it.
> >>
> >>rmb() orders *any* two reads; that includes two reads from the same
> >>location.
> >
> >If the two reads are to the same location, all CPUs I am aware of
> >will maintain the ordering without need for a memory barrier.
>
> That's true of course, although there is no real guarantee for that.
A CPU that did not provide this property ("cache coherence") would be
most emphatically reviled. So we are pretty safe assuming that CPUs
will provide it.
Thanx, Paul
On Wed, Aug 15, 2007 at 11:30:05PM +1000, Nick Piggin wrote:
> Paul E. McKenney wrote:
> >On Tue, Aug 14, 2007 at 03:34:25PM +1000, Nick Piggin wrote:
>
> >>Maybe it is the safe way to go, but it does obscure cases where there
> >>is a real need for barriers.
> >
> >
> >I prefer burying barriers into other primitives.
>
> When they should naturally be there, eg. locking or the RCU primitives,
> I agree.
>
> I don't like having them scattered in various "just in case" places,
> because it makes both the users and the APIs hard to understand and
> change.
I certainly agree that we shouldn't do volatile just for the fun of it,
and also that use of volatile should be quite rare.
> Especially since several big architectures don't have volatile in their
> atomic_get and _set, I think it would be a step backwards to add them in
> as a "just in case" thin now (unless there is a better reason).
Good point, except that I would expect gcc's optimization to continue
to improve. I would like the kernel to be able to take advantage of
improved optimization, which means that we are going to have to make
a few changes. Adding volatile to atomic_get() and atomic_set() is
IMHO one of those changes.
> >>Many atomic operations are allowed to be reordered between CPUs, so
> >>I don't have a good idea for the rationale to order them within the
> >>CPU (also loads and stores to long and ptr types are not ordered like
> >>this, although we do consider those to be atomic operations too).
> >>
> >>barrier() in a way is like enforcing sequential memory ordering
> >>between process and interrupt context, wheras volatile is just
> >>enforcing coherency of a single memory location (and as such is
> >>cheaper).
> >
> >
> >barrier() is useful, but it has the very painful side-effect of forcing
> >the compiler to dump temporaries. So we do need something that is
> >not quite so global in effect.
>
> Yep.
>
> >>What do you think of this crazy idea?
> >>
> >>/* Enforce a compiler barrier for only operations to location X.
> >>* Call multiple times to provide an ordering between multiple
> >>* memory locations. Other memory operations can be assumed by
> >>* the compiler to remain unchanged and may be reordered
> >>*/
> >>#define order(x) asm volatile("" : "+m" (x))
> >
> >There was something very similar discussed earlier in this thread,
> >with quite a bit of debate as to exactly what the "m" flag should
> >look like. I suggested something similar named ACCESS_ONCE in the
> >context of RCU (http://lkml.org/lkml/2007/7/11/664):
>
> Oh, I missed that earlier debate. Will go have a look.
>
> > #define ACCESS_ONCE(x) (*(volatile typeof(x) *)&(x))
> >
> >The nice thing about this is that it works for both loads and stores.
> >Not clear that order() above does this -- I get compiler errors when
> >I try something like "b = order(a)" or "order(a) = 1" using gcc 4.1.2.
>
> As Arnd ponted out, order() is not supposed to be an lvalue, but a
> statement like the rest of our memory ordering API.
>
> As to your followup question of why to use it over ACCESS_ONCE. I
> guess, aside from consistency with the rest of the barrier APIs, you
> can use it in other primitives when you don't actually know what the
> caller is going to do or if it even will make an access. You could
> also use it between calls to _other_ primitives, etc... it just
> seems more flexible to me, but I haven't actually used such a thing
> in real code...
>
> ACCESS_ONCE doesn't seem as descriptive. What it results in is the
> memory location being loaded or stored (presumably once exactly),
> but I think the more general underlying idea is a barrier point.
OK, first, I am not arguing that ACCESS_ONCE() can replace all current
uses of barrier(). Similarly, rcu_dereference(), rcu_assign_pointer(),
and the various RCU versions of the list-manipulation API in no way
replaced all uses of explicit memory barriers. However, I do believe that
these API are much easier to use (where they apply, of course) than were
the earlier idioms involving explicit memory barriers.
But we of course need to keep the explicit memory and compiler barriers
for other situations.
Thanx, Paul
>>>>> Well if there is only one memory location involved, then smp_rmb()
>>>>> isn't
>>>>> going to really do anything anyway, so it would be incorrect to use
>>>>> it.
>>>>
>>>> rmb() orders *any* two reads; that includes two reads from the same
>>>> location.
>>>
>>> If the two reads are to the same location, all CPUs I am aware of
>>> will maintain the ordering without need for a memory barrier.
>>
>> That's true of course, although there is no real guarantee for that.
>
> A CPU that did not provide this property ("cache coherence") would be
> most emphatically reviled.
That doesn't have anything to do with coherency as far as I can see.
It's just about the order in which a CPU (speculatively) performs the
loads
(which isn't necessarily the same as the order in which it executes the
corresponding instructions, even).
> So we are pretty safe assuming that CPUs
> will provide it.
Yeah, pretty safe. I just don't like undocumented assumptions :-)
Segher
On Wed, Aug 15, 2007 at 10:13:49PM +0200, Segher Boessenkool wrote:
> >>>>>Well if there is only one memory location involved, then smp_rmb()
> >>>>>isn't
> >>>>>going to really do anything anyway, so it would be incorrect to use
> >>>>>it.
> >>>>
> >>>>rmb() orders *any* two reads; that includes two reads from the same
> >>>>location.
> >>>
> >>>If the two reads are to the same location, all CPUs I am aware of
> >>>will maintain the ordering without need for a memory barrier.
> >>
> >>That's true of course, although there is no real guarantee for that.
> >
> >A CPU that did not provide this property ("cache coherence") would be
> >most emphatically reviled.
>
> That doesn't have anything to do with coherency as far as I can see.
>
> It's just about the order in which a CPU (speculatively) performs the
> loads
> (which isn't necessarily the same as the order in which it executes the
> corresponding instructions, even).
Please check the definition of "cache coherence".
Summary: the CPU is indeed within its rights to execute loads and stores
to a single variable out of order, -but- only if it gets the same result
that it would have obtained by executing them in order. Which means that
any reordering of accesses by a single CPU to a single variable will be
invisible to the software.
> >So we are pretty safe assuming that CPUs
> >will provide it.
>
> Yeah, pretty safe. I just don't like undocumented assumptions :-)
Can't help you there! ;-)
Thanx, Paul
> Please check the definition of "cache coherence".
Which of the twelve thousand such definitions? :-)
> Summary: the CPU is indeed within its rights to execute loads and
> stores
> to a single variable out of order, -but- only if it gets the same
> result
> that it would have obtained by executing them in order. Which means
> that
> any reordering of accesses by a single CPU to a single variable will be
> invisible to the software.
I'm still not sure if that applies to all architectures.
Doesn't matter anyway, let's kill this thread :-)
Segher
Paul E. McKenney wrote:
> On Wed, Aug 15, 2007 at 11:30:05PM +1000, Nick Piggin wrote:
>>Especially since several big architectures don't have volatile in their
>>atomic_get and _set, I think it would be a step backwards to add them in
>>as a "just in case" thin now (unless there is a better reason).
>
>
> Good point, except that I would expect gcc's optimization to continue
> to improve. I would like the kernel to be able to take advantage of
> improved optimization, which means that we are going to have to make
> a few changes. Adding volatile to atomic_get() and atomic_set() is
> IMHO one of those changes.
What optimisations? gcc already does most of the things you need a
barrier/volatile for, like reordering non-dependant loads and stores,
and eliminating mem ops completely by caching in registers.
>>As to your followup question of why to use it over ACCESS_ONCE. I
>>guess, aside from consistency with the rest of the barrier APIs, you
>>can use it in other primitives when you don't actually know what the
>>caller is going to do or if it even will make an access. You could
>>also use it between calls to _other_ primitives, etc... it just
>>seems more flexible to me, but I haven't actually used such a thing
>>in real code...
>>
>>ACCESS_ONCE doesn't seem as descriptive. What it results in is the
>>memory location being loaded or stored (presumably once exactly),
>>but I think the more general underlying idea is a barrier point.
>
>
> OK, first, I am not arguing that ACCESS_ONCE() can replace all current
> uses of barrier().
OK. Well I also wasn't saying that ACCESS_ONCE should not be
implemented. But if we want something like it, then it would make
sense to have an equivalent barrier statement as well (ie. order()).
--
SUSE Labs, Novell Inc.
Segher Boessenkool wrote:
>> Please check the definition of "cache coherence".
>
>
> Which of the twelve thousand such definitions? :-)
Every definition I have seen says that writes to a single memory
location have a serial order as seen by all CPUs, and that a read
will return the most recent write in the sequence (with a bit of
extra mumbo jumbo to cover store queues and store forwarding).
Within such a definition, I don't see how would be allowed for a
single CPU to execute reads out of order and have the second read
return an earlier write than the first read.
--
SUSE Labs, Novell Inc.
On Thu, Aug 16, 2007 at 11:09:25AM +1000, Nick Piggin wrote:
> Paul E. McKenney wrote:
> >On Wed, Aug 15, 2007 at 11:30:05PM +1000, Nick Piggin wrote:
>
> >>Especially since several big architectures don't have volatile in their
> >>atomic_get and _set, I think it would be a step backwards to add them in
> >>as a "just in case" thin now (unless there is a better reason).
> >
> >Good point, except that I would expect gcc's optimization to continue
> >to improve. I would like the kernel to be able to take advantage of
> >improved optimization, which means that we are going to have to make
> >a few changes. Adding volatile to atomic_get() and atomic_set() is
> >IMHO one of those changes.
>
> What optimisations? gcc already does most of the things you need a
> barrier/volatile for, like reordering non-dependant loads and stores,
> and eliminating mem ops completely by caching in registers.
Yep, most of the currently practiced optimizations. Given that CPU clock
frequencies are not rising as quickly as they once did, I would expect
that there will be added effort on implementing the known more-aggressive
optimization techniques, and on coming up with new ones as well.
Some of these new optimizations will likely be inappropriate for kernel
code, but I expect that some will be things that we want.
> >>As to your followup question of why to use it over ACCESS_ONCE. I
> >>guess, aside from consistency with the rest of the barrier APIs, you
> >>can use it in other primitives when you don't actually know what the
> >>caller is going to do or if it even will make an access. You could
> >>also use it between calls to _other_ primitives, etc... it just
> >>seems more flexible to me, but I haven't actually used such a thing
> >>in real code...
> >>
> >>ACCESS_ONCE doesn't seem as descriptive. What it results in is the
> >>memory location being loaded or stored (presumably once exactly),
> >>but I think the more general underlying idea is a barrier point.
> >
> >OK, first, I am not arguing that ACCESS_ONCE() can replace all current
> >uses of barrier().
>
> OK. Well I also wasn't saying that ACCESS_ONCE should not be
> implemented. But if we want something like it, then it would make
> sense to have an equivalent barrier statement as well (ie. order()).
And I am not arguing against use of asms to implement the volatility
in atomic_read() and atomic_set(), and in fact it appears that one
of the architectures will be taking this approach.
Sounds like we might be in violent agreement. ;-)
Thanx, Paul