From: Chris Snook <[email protected]>
Purify volatile use for atomic[64]_t on alpha.
Signed-off-by: Chris Snook <[email protected]>
--- linux-2.6.23-rc2-orig/include/asm-alpha/atomic.h 2007-07-08 19:32:17.000000000 -0400
+++ linux-2.6.23-rc2/include/asm-alpha/atomic.h 2007-08-09 09:19:00.000000000 -0400
@@ -14,18 +14,22 @@
/*
- * Counter is volatile to make sure gcc doesn't try to be clever
- * and move things around on us. We need to use _exactly_ the address
- * the user gave us, not some alias that contains the same information.
+ * Make sure gcc doesn't try to be clever and move things around
+ * on us. We need to use _exactly_ the address the user gave us,
+ * not some alias that contains the same information.
*/
-typedef struct { volatile int counter; } atomic_t;
-typedef struct { volatile long counter; } atomic64_t;
+typedef struct { int counter; } atomic_t;
+typedef struct { long counter; } atomic64_t;
#define ATOMIC_INIT(i) ( (atomic_t) { (i) } )
#define ATOMIC64_INIT(i) ( (atomic64_t) { (i) } )
-#define atomic_read(v) ((v)->counter + 0)
-#define atomic64_read(v) ((v)->counter + 0)
+/*
+ * Casting to volatile here minimizes the need for barriers,
+ * without having to declare the type itself as volatile.
+ */
+#define atomic_read(v) (*(volatile int *)&(v)->counter + 0)
+#define atomic64_read(v) (*(volatile long *)&(v)->counter + 0)
#define atomic_set(v,i) ((v)->counter = (i))
#define atomic64_set(v,i) ((v)->counter = (i))
On Thu, Aug 09, 2007 at 09:24:42AM -0400, Chris Snook wrote:
> From: Chris Snook <[email protected]>
>
> Purify volatile use for atomic[64]_t on alpha.
Why not the same access-once semantics for atomic_set() as
for atomic_read()? As this patch stands, it might introduce
architecture-specific compiler-induced bugs due to the fact that
atomic_set() used to imply volatile behavior but no longer does.
See below for one approach to fixing this.
> Signed-off-by: Chris Snook <[email protected]>
>
> --- linux-2.6.23-rc2-orig/include/asm-alpha/atomic.h 2007-07-08 19:32:17.000000000 -0400
> +++ linux-2.6.23-rc2/include/asm-alpha/atomic.h 2007-08-09 09:19:00.000000000 -0400
> @@ -14,18 +14,22 @@
>
>
> /*
> - * Counter is volatile to make sure gcc doesn't try to be clever
> - * and move things around on us. We need to use _exactly_ the address
> - * the user gave us, not some alias that contains the same information.
> + * Make sure gcc doesn't try to be clever and move things around
> + * on us. We need to use _exactly_ the address the user gave us,
> + * not some alias that contains the same information.
> */
> -typedef struct { volatile int counter; } atomic_t;
> -typedef struct { volatile long counter; } atomic64_t;
> +typedef struct { int counter; } atomic_t;
> +typedef struct { long counter; } atomic64_t;
>
> #define ATOMIC_INIT(i) ( (atomic_t) { (i) } )
> #define ATOMIC64_INIT(i) ( (atomic64_t) { (i) } )
>
> -#define atomic_read(v) ((v)->counter + 0)
> -#define atomic64_read(v) ((v)->counter + 0)
> +/*
> + * Casting to volatile here minimizes the need for barriers,
> + * without having to declare the type itself as volatile.
> + */
> +#define atomic_read(v) (*(volatile int *)&(v)->counter + 0)
> +#define atomic64_read(v) (*(volatile long *)&(v)->counter + 0)
>
> #define atomic_set(v,i) ((v)->counter = (i))
> #define atomic64_set(v,i) ((v)->counter = (i))
How about:
#define atomic_set(v,i) ((*(volatile int *)&(v)->counter) = (i))
#define atomic64_set(v,i) ((*(volatile long *)&(v)->counter) = (i))
Thanx, Paul
> -
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
Paul E. McKenney wrote:
> Why not the same access-once semantics for atomic_set() as
> for atomic_read()? As this patch stands, it might introduce
> architecture-specific compiler-induced bugs due to the fact that
> atomic_set() used to imply volatile behavior but no longer does.
When we make the volatile cast in atomic_read(), we're casting an rvalue to
volatile. This unambiguously tells the compiler that we want to re-load that
register from memory. What's "volatile behavior" for an lvalue? A write to an
lvalue already implies an eventual write to memory, so this would be a no-op.
Maybe you'll write to the register a few times before flushing it to memory, but
it will happen eventually. With an rvalue, there's no guarantee that it will
*ever* load from memory, which is what volatile fixes.
I think what you have in mind is LOCK_PREFIX behavior, which is not the purpose
of atomic_set. We use LOCK_PREFIX in the inline assembly for the atomic_*
operations that read, modify, and write a value, only because it is necessary to
perform that entire transaction atomically.
-- Chris
On Thu, Aug 09, 2007 at 10:53:14AM -0400, Chris Snook wrote:
> Paul E. McKenney wrote:
> >Why not the same access-once semantics for atomic_set() as
> >for atomic_read()? As this patch stands, it might introduce
> >architecture-specific compiler-induced bugs due to the fact that
> >atomic_set() used to imply volatile behavior but no longer does.
>
> When we make the volatile cast in atomic_read(), we're casting an rvalue to
> volatile. This unambiguously tells the compiler that we want to re-load
> that register from memory. What's "volatile behavior" for an lvalue?
I was absolutely -not- suggesting volatile behavior for lvalues.
Instead, I am asking for volatile behavior from an -rvalue-. In the
case of atomic_read(), it is the atomic_t being read from. In the case
of atomic_set(), it is the atomic_t being written to. As suggested in
my previous email:
#define atomic_set(v,i) ((*(volatile int *)&(v)->counter) = (i))
#define atomic64_set(v,i) ((*(volatile long *)&(v)->counter) = (i))
Again, the architectures that used to have their "counter" declared
as volatile will lose volatile semantics on atomic_set() with your
patch, which might result in bugs due to overly imaginative compiler
optimizations. The above would prevent any such bugs from appearing.
> A
> write to an lvalue already implies an eventual write to memory, so this
> would be a no-op. Maybe you'll write to the register a few times before
> flushing it to memory, but it will happen eventually. With an rvalue,
> there's no guarantee that it will *ever* load from memory, which is what
> volatile fixes.
>
> I think what you have in mind is LOCK_PREFIX behavior, which is not the
> purpose of atomic_set. We use LOCK_PREFIX in the inline assembly for the
> atomic_* operations that read, modify, and write a value, only because it
> is necessary to perform that entire transaction atomically.
No LOCK_PREFIX, thank you!!! I just want to make sure that the compiler
doesn't push the store down out of a loop, split the store, allow the
store to happen twice (e.g., to allow different code paths to be merged),
and all the other tricks that the C standard permits compilers to pull.
Thanx, Paul
Paul E. McKenney wrote:
> On Thu, Aug 09, 2007 at 10:53:14AM -0400, Chris Snook wrote:
>> Paul E. McKenney wrote:
>>> Why not the same access-once semantics for atomic_set() as
>>> for atomic_read()? As this patch stands, it might introduce
>>> architecture-specific compiler-induced bugs due to the fact that
>>> atomic_set() used to imply volatile behavior but no longer does.
>> When we make the volatile cast in atomic_read(), we're casting an rvalue to
>> volatile. This unambiguously tells the compiler that we want to re-load
>> that register from memory. What's "volatile behavior" for an lvalue?
>
> I was absolutely -not- suggesting volatile behavior for lvalues.
>
> Instead, I am asking for volatile behavior from an -rvalue-. In the
> case of atomic_read(), it is the atomic_t being read from. In the case
> of atomic_set(), it is the atomic_t being written to. As suggested in
> my previous email:
>
> #define atomic_set(v,i) ((*(volatile int *)&(v)->counter) = (i))
> #define atomic64_set(v,i) ((*(volatile long *)&(v)->counter) = (i))
That looks like a volatile lvalue to me. I confess I didn't exactly ace
compilers. Care to explain this?
> Again, the architectures that used to have their "counter" declared
> as volatile will lose volatile semantics on atomic_set() with your
> patch, which might result in bugs due to overly imaginative compiler
> optimizations. The above would prevent any such bugs from appearing.
>
>> A
>> write to an lvalue already implies an eventual write to memory, so this
>> would be a no-op. Maybe you'll write to the register a few times before
>> flushing it to memory, but it will happen eventually. With an rvalue,
>> there's no guarantee that it will *ever* load from memory, which is what
>> volatile fixes.
>>
>> I think what you have in mind is LOCK_PREFIX behavior, which is not the
>> purpose of atomic_set. We use LOCK_PREFIX in the inline assembly for the
>> atomic_* operations that read, modify, and write a value, only because it
>> is necessary to perform that entire transaction atomically.
>
> No LOCK_PREFIX, thank you!!! I just want to make sure that the compiler
> doesn't push the store down out of a loop, split the store, allow the
> store to happen twice (e.g., to allow different code paths to be merged),
> and all the other tricks that the C standard permits compilers to pull.
We can't have split stores because we don't use atomic64_t on 32-bit
architectures. volatile won't save you from pushing stores out of loops unless
you're also doing reads. This is why we use reads to flush writes to mmio
registers. In this case, an atomic_read() with volatile in it will suffice.
Storing twice is perfectly legal here, though it's unlikely that an optimizing
compiler smart enough to create the problem we're addressing would ever do that.
-- Chris
> We can't have split stores because we don't use atomic64_t on 32-bit
> architectures.
That's not true; the compiler is free to split all stores
(and reads) from memory however it wants. It is debatable
whether "volatile" would prevent this as well, certainly
it is unsafe if you want to be portable. GCC will do its
best to not split volatile memory accesses, but bugs in
this area do happen a lot (because the compiler code for
volatile isn't as well exercised as most other compiler
code, and because it is simply a hard subject; and there
is no real formalised model for what GCC should do).
The only safe way to get atomic accesses is to write
assembler code. Are there any downsides to that? I don't
see any.
Segher
On Thu, Aug 09, 2007 at 11:24:22AM -0400, Chris Snook wrote:
> Paul E. McKenney wrote:
> >On Thu, Aug 09, 2007 at 10:53:14AM -0400, Chris Snook wrote:
> >>Paul E. McKenney wrote:
> >>>Why not the same access-once semantics for atomic_set() as
> >>>for atomic_read()? As this patch stands, it might introduce
> >>>architecture-specific compiler-induced bugs due to the fact that
> >>>atomic_set() used to imply volatile behavior but no longer does.
> >>When we make the volatile cast in atomic_read(), we're casting an rvalue
> >>to volatile. This unambiguously tells the compiler that we want to
> >>re-load that register from memory. What's "volatile behavior" for an
> >>lvalue?
> >
> >I was absolutely -not- suggesting volatile behavior for lvalues.
> >
> >Instead, I am asking for volatile behavior from an -rvalue-. In the
> >case of atomic_read(), it is the atomic_t being read from. In the case
> >of atomic_set(), it is the atomic_t being written to. As suggested in
> >my previous email:
> >
> >#define atomic_set(v,i) ((*(volatile int *)&(v)->counter) =
> >(i))
> >#define atomic64_set(v,i) ((*(volatile long *)&(v)->counter) = (i))
>
> That looks like a volatile lvalue to me. I confess I didn't exactly ace
> compilers. Care to explain this?
OK, so I am dylexic this morning. Never could tell left from right...
This is indeed an lvalue, as is any non-function expression that you
can apply the "&" prefix operator to. Including the expression in
your proposed definitions of atomic_set() and atomic_set64().
An lvalue is any expression that -could- appear on the left-hand side
of an assignment operator, regardless of where it actually appears.
More precisely, an lvalue is an expression that refers to a variable in
such a way that the variable might be both loaded from and stored to,
ignoring things like "const" for the moment.
Because "(v)->counter" could be either loaded from or stored to, it
is an lvalue, which is why the compiler didn't scream bloody murder
at you when you took the address of it.
> >Again, the architectures that used to have their "counter" declared
> >as volatile will lose volatile semantics on atomic_set() with your
> >patch, which might result in bugs due to overly imaginative compiler
> >optimizations. The above would prevent any such bugs from appearing.
> >
> >> A
> >>write to an lvalue already implies an eventual write to memory, so this
> >>would be a no-op. Maybe you'll write to the register a few times before
> >>flushing it to memory, but it will happen eventually. With an rvalue,
> >>there's no guarantee that it will *ever* load from memory, which is what
> >>volatile fixes.
> >>
> >>I think what you have in mind is LOCK_PREFIX behavior, which is not the
> >>purpose of atomic_set. We use LOCK_PREFIX in the inline assembly for the
> >>atomic_* operations that read, modify, and write a value, only because it
> >>is necessary to perform that entire transaction atomically.
> >
> >No LOCK_PREFIX, thank you!!! I just want to make sure that the compiler
> >doesn't push the store down out of a loop, split the store, allow the
> >store to happen twice (e.g., to allow different code paths to be merged),
> >and all the other tricks that the C standard permits compilers to pull.
>
> We can't have split stores because we don't use atomic64_t on 32-bit
> architectures. volatile won't save you from pushing stores out of loops
> unless you're also doing reads. This is why we use reads to flush writes
> to mmio registers. In this case, an atomic_read() with volatile in it will
> suffice. Storing twice is perfectly legal here, though it's unlikely that
> an optimizing compiler smart enough to create the problem we're addressing
> would ever do that.
The compiler is within its rights to read a 32-bit quantity 16 bits at
at time, even on a 32-bit machine. I would be glad to help pummel any
compiler writer that pulls such a dirty trick, but the C standard really
does permit this.
Use of volatile does in fact save you from the compiler pushing stores out
of loops regardless of whether you are also doing reads. The C standard
has the notion of sequence points, which occur at various places including
the ends of statements and the control expressions for "if" and "while"
statements. The compiler is not permitted to move volatile references
across a sequence point. Therefore, the compiler is not allowed to
push a volatile store out of a loop. Now the CPU might well do such a
reordering, but that is a separate issue to be dealt with via memory
barriers. Note that it is the CPU and I/O system, not the compiler,
that is forcing you to use reads to flush writes to MMIO registers.
And you would be amazed at what compiler writers will do in order to
get an additional fraction of a percent out of SpecCPU...
In short, please retain atomic_set()'s volatility, especially on those
architectures that declared the atomic_t's counter to be volatile.
Thanx, Paul
Segher Boessenkool wrote:
>> We can't have split stores because we don't use atomic64_t on 32-bit
>> architectures.
>
> That's not true; the compiler is free to split all stores
> (and reads) from memory however it wants. It is debatable
> whether "volatile" would prevent this as well, certainly
> it is unsafe if you want to be portable. GCC will do its
> best to not split volatile memory accesses, but bugs in
> this area do happen a lot (because the compiler code for
> volatile isn't as well exercised as most other compiler
> code, and because it is simply a hard subject; and there
> is no real formalised model for what GCC should do).
>
> The only safe way to get atomic accesses is to write
> assembler code. Are there any downsides to that? I don't
> see any.
The assumption that aligned word reads and writes are atomic, and that words are
aligned unless explicitly packed otherwise, is endemic in the kernel. No sane
compiler violates this assumption. It's true that we're not portable to insane
compilers after this patch, but we never were in the first place.
-- Chris
Paul E. McKenney wrote:
> The compiler is within its rights to read a 32-bit quantity 16 bits at
> at time, even on a 32-bit machine. I would be glad to help pummel any
> compiler writer that pulls such a dirty trick, but the C standard really
> does permit this.
Yes, but we don't write code for these compilers. There are countless pieces of
kernel code which would break in this condition, and there doesn't seem to be
any interest in fixing this.
> Use of volatile does in fact save you from the compiler pushing stores out
> of loops regardless of whether you are also doing reads. The C standard
> has the notion of sequence points, which occur at various places including
> the ends of statements and the control expressions for "if" and "while"
> statements. The compiler is not permitted to move volatile references
> across a sequence point. Therefore, the compiler is not allowed to
> push a volatile store out of a loop. Now the CPU might well do such a
> reordering, but that is a separate issue to be dealt with via memory
> barriers. Note that it is the CPU and I/O system, not the compiler,
> that is forcing you to use reads to flush writes to MMIO registers.
Sequence points enforce read-after-write ordering, not write-after-write. We
flush writes with reads for MMIO because of this effect as well as the CPU/bus
effects.
> And you would be amazed at what compiler writers will do in order to
> get an additional fraction of a percent out of SpecCPU...
Probably not :)
> In short, please retain atomic_set()'s volatility, especially on those
> architectures that declared the atomic_t's counter to be volatile.
Like i386 and x86_64? These used to have volatile in the atomic_t declaration.
We removed it, and the sky did not fall.
-- Chris
On Thu, Aug 09, 2007 at 12:36:17PM -0400, Chris Snook wrote:
> Paul E. McKenney wrote:
> >The compiler is within its rights to read a 32-bit quantity 16 bits at
> >at time, even on a 32-bit machine. I would be glad to help pummel any
> >compiler writer that pulls such a dirty trick, but the C standard really
> >does permit this.
>
> Yes, but we don't write code for these compilers. There are countless
> pieces of kernel code which would break in this condition, and there
> doesn't seem to be any interest in fixing this.
>
> >Use of volatile does in fact save you from the compiler pushing stores out
> >of loops regardless of whether you are also doing reads. The C standard
> >has the notion of sequence points, which occur at various places including
> >the ends of statements and the control expressions for "if" and "while"
> >statements. The compiler is not permitted to move volatile references
> >across a sequence point. Therefore, the compiler is not allowed to
> >push a volatile store out of a loop. Now the CPU might well do such a
> >reordering, but that is a separate issue to be dealt with via memory
> >barriers. Note that it is the CPU and I/O system, not the compiler,
> >that is forcing you to use reads to flush writes to MMIO registers.
>
> Sequence points enforce read-after-write ordering, not write-after-write.
> We flush writes with reads for MMIO because of this effect as well as the
> CPU/bus effects.
Neither volatile reads nor volatile writes may be moved across sequence
points.
> >And you would be amazed at what compiler writers will do in order to
> >get an additional fraction of a percent out of SpecCPU...
>
> Probably not :)
>
> >In short, please retain atomic_set()'s volatility, especially on those
> >architectures that declared the atomic_t's counter to be volatile.
>
> Like i386 and x86_64? These used to have volatile in the atomic_t
> declaration. We removed it, and the sky did not fall.
Interesting. You tested all possible configs on all possible hardware?
Thanx, Paul
Paul E. McKenney wrote:
> On Thu, Aug 09, 2007 at 12:36:17PM -0400, Chris Snook wrote:
>> Paul E. McKenney wrote:
>>> The compiler is within its rights to read a 32-bit quantity 16 bits at
>>> at time, even on a 32-bit machine. I would be glad to help pummel any
>>> compiler writer that pulls such a dirty trick, but the C standard really
>>> does permit this.
>> Yes, but we don't write code for these compilers. There are countless
>> pieces of kernel code which would break in this condition, and there
>> doesn't seem to be any interest in fixing this.
>>
>>> Use of volatile does in fact save you from the compiler pushing stores out
>>> of loops regardless of whether you are also doing reads. The C standard
>>> has the notion of sequence points, which occur at various places including
>>> the ends of statements and the control expressions for "if" and "while"
>>> statements. The compiler is not permitted to move volatile references
>>> across a sequence point. Therefore, the compiler is not allowed to
>>> push a volatile store out of a loop. Now the CPU might well do such a
>>> reordering, but that is a separate issue to be dealt with via memory
>>> barriers. Note that it is the CPU and I/O system, not the compiler,
>>> that is forcing you to use reads to flush writes to MMIO registers.
>> Sequence points enforce read-after-write ordering, not write-after-write.
>> We flush writes with reads for MMIO because of this effect as well as the
>> CPU/bus effects.
>
> Neither volatile reads nor volatile writes may be moved across sequence
> points.
By the compiler, or by the CPU? If you're depending on volatile writes being
visible to other CPUs, you're screwed either way, because the CPU can hold that
data in cache as long as it wants before it writes it to memory. When this
finally does happen, it will happen atomically, which is all that atomic_set
guarantees. If you need to guarantee that the value is written to memory at a
particular time in your execution sequence, you either have to read it from
memory to force the compiler to store it first (and a volatile cast in
atomic_read will suffice for this) or you have to use LOCK_PREFIX instructions
which will invalidate remote cache lines containing the same variable. This
patch doesn't change either of these cases.
>>> And you would be amazed at what compiler writers will do in order to
>>> get an additional fraction of a percent out of SpecCPU...
>> Probably not :)
>>
>>> In short, please retain atomic_set()'s volatility, especially on those
>>> architectures that declared the atomic_t's counter to be volatile.
>> Like i386 and x86_64? These used to have volatile in the atomic_t
>> declaration. We removed it, and the sky did not fall.
>
> Interesting. You tested all possible configs on all possible hardware?
No, but I can reason about it and be confident that this won't break anything
that isn't already broken. At worst, this patch will make any existing subtly
incorrect uses of atomic_t much more obvious and easier to track down.
-- Chris
On Thu, Aug 09, 2007 at 01:14:35PM -0400, Chris Snook wrote:
> Paul E. McKenney wrote:
> >On Thu, Aug 09, 2007 at 12:36:17PM -0400, Chris Snook wrote:
> >>Paul E. McKenney wrote:
> >>>The compiler is within its rights to read a 32-bit quantity 16 bits at
> >>>at time, even on a 32-bit machine. I would be glad to help pummel any
> >>>compiler writer that pulls such a dirty trick, but the C standard really
> >>>does permit this.
> >>Yes, but we don't write code for these compilers. There are countless
> >>pieces of kernel code which would break in this condition, and there
> >>doesn't seem to be any interest in fixing this.
> >>
> >>>Use of volatile does in fact save you from the compiler pushing stores
> >>>out
> >>>of loops regardless of whether you are also doing reads. The C standard
> >>>has the notion of sequence points, which occur at various places
> >>>including
> >>>the ends of statements and the control expressions for "if" and "while"
> >>>statements. The compiler is not permitted to move volatile references
> >>>across a sequence point. Therefore, the compiler is not allowed to
> >>>push a volatile store out of a loop. Now the CPU might well do such a
> >>>reordering, but that is a separate issue to be dealt with via memory
> >>>barriers. Note that it is the CPU and I/O system, not the compiler,
> >>>that is forcing you to use reads to flush writes to MMIO registers.
> >>Sequence points enforce read-after-write ordering, not write-after-write.
> >>We flush writes with reads for MMIO because of this effect as well as the
> >>CPU/bus effects.
> >
> >Neither volatile reads nor volatile writes may be moved across sequence
> >points.
>
> By the compiler, or by the CPU?
As mentioned in earlier emails, by the compiler. The CPU knows nothing
of C sequence points.
> If you're depending on volatile writes
> being visible to other CPUs, you're screwed either way, because the CPU can
> hold that data in cache as long as it wants before it writes it to memory.
> When this finally does happen, it will happen atomically, which is all that
> atomic_set guarantees. If you need to guarantee that the value is written
> to memory at a particular time in your execution sequence, you either have
> to read it from memory to force the compiler to store it first (and a
> volatile cast in atomic_read will suffice for this) or you have to use
> LOCK_PREFIX instructions which will invalidate remote cache lines
> containing the same variable. This patch doesn't change either of these
> cases.
The case that it -can- change is interactions with interrupt handlers.
And NMI/SMI handlers, for that matter.
> >>>And you would be amazed at what compiler writers will do in order to
> >>>get an additional fraction of a percent out of SpecCPU...
> >>Probably not :)
> >>
> >>>In short, please retain atomic_set()'s volatility, especially on those
> >>>architectures that declared the atomic_t's counter to be volatile.
> >>Like i386 and x86_64? These used to have volatile in the atomic_t
> >>declaration. We removed it, and the sky did not fall.
> >
> >Interesting. You tested all possible configs on all possible hardware?
>
> No, but I can reason about it and be confident that this won't break
> anything that isn't already broken. At worst, this patch will make any
> existing subtly incorrect uses of atomic_t much more obvious and easier to
> track down.
You took interrupt and NMI/SMI handlers into account?
Thanx, Paul
Paul E. McKenney wrote:
> On Thu, Aug 09, 2007 at 01:14:35PM -0400, Chris Snook wrote:
>> If you're depending on volatile writes
>> being visible to other CPUs, you're screwed either way, because the CPU can
>> hold that data in cache as long as it wants before it writes it to memory.
>> When this finally does happen, it will happen atomically, which is all that
>> atomic_set guarantees. If you need to guarantee that the value is written
>> to memory at a particular time in your execution sequence, you either have
>> to read it from memory to force the compiler to store it first (and a
>> volatile cast in atomic_read will suffice for this) or you have to use
>> LOCK_PREFIX instructions which will invalidate remote cache lines
>> containing the same variable. This patch doesn't change either of these
>> cases.
>
> The case that it -can- change is interactions with interrupt handlers.
> And NMI/SMI handlers, for that matter.
You have a point here, but only if you can guarantee that the interrupt handler
is running on a processor sharing the cache that has the not-yet-written
volatile value. That implies a strictly non-SMP architecture. At the moment,
none of those have volatile in their declaration of atomic_t, so this patch
can't break any of them.
-- Chris
>> The only safe way to get atomic accesses is to write
>> assembler code. Are there any downsides to that? I don't
>> see any.
>
> The assumption that aligned word reads and writes are atomic, and that
> words are aligned unless explicitly packed otherwise, is endemic in
> the kernel. No sane compiler violates this assumption. It's true
> that we're not portable to insane compilers after this patch, but we
> never were in the first place.
You didn't answer my question: are there any downsides to using
explicit coded-in-assembler accesses for atomic accesses? You
can handwave all you want that it should "just work" with
volatile accesses, but volatility != atomicity, volatile in C
is really badly defined, GCC never officially gave stronger
guarantees, and we have a bugzilla full of PRs to show what a
minefield it is.
So, why not use the well-defined alternative?
Segher
On Thu, Aug 09, 2007 at 02:13:52PM -0400, Chris Snook wrote:
> Paul E. McKenney wrote:
> >On Thu, Aug 09, 2007 at 01:14:35PM -0400, Chris Snook wrote:
> >> If you're depending on volatile writes
> >>being visible to other CPUs, you're screwed either way, because the CPU
> >>can hold that data in cache as long as it wants before it writes it to
> >>memory. When this finally does happen, it will happen atomically, which
> >>is all that atomic_set guarantees. If you need to guarantee that the
> >>value is written to memory at a particular time in your execution
> >>sequence, you either have to read it from memory to force the compiler to
> >>store it first (and a volatile cast in atomic_read will suffice for this)
> >>or you have to use LOCK_PREFIX instructions which will invalidate remote
> >>cache lines containing the same variable. This patch doesn't change
> >>either of these cases.
> >
> >The case that it -can- change is interactions with interrupt handlers.
> >And NMI/SMI handlers, for that matter.
>
> You have a point here, but only if you can guarantee that the interrupt
> handler is running on a processor sharing the cache that has the
> not-yet-written volatile value. That implies a strictly non-SMP
> architecture. At the moment, none of those have volatile in their
> declaration of atomic_t, so this patch can't break any of them.
This can also happen when using per-CPU variables. And there are a
number of per-CPU variables that are either atomic themselves or are
structures containing atomic fields.
Thanx, Paul
>> The compiler is within its rights to read a 32-bit quantity 16 bits at
>> at time, even on a 32-bit machine. I would be glad to help pummel any
>> compiler writer that pulls such a dirty trick, but the C standard
>> really
>> does permit this.
>
> Yes, but we don't write code for these compilers. There are countless
> pieces of kernel code which would break in this condition, and there
> doesn't seem to be any interest in fixing this.
"Other things are broken too". Great argument :-)
> Sequence points enforce read-after-write ordering, not
> write-after-write.
Sequence points order *all* side effects; sequence points exist in the
domain of the abstract sequential model of the C language only. The
compiler translates that to machine code that is equivalent to that C
code under the "as-if" rule; but this is still in that abstract model,
which doesn't include things such as SMP, visibility by I/O devices,
store queues, etc. etc.
> We flush writes with reads for MMIO because of this effect as well as
> the CPU/bus effects.
You cannot flush all MMIO writes with reads; this is a PCI-specific
thing. And even then, you need more than just the read itself: you
have to make sure the read completed and returned data.
>> In short, please retain atomic_set()'s volatility, especially on those
>> architectures that declared the atomic_t's counter to be volatile.
>
> Like i386 and x86_64? These used to have volatile in the atomic_t
> declaration. We removed it, and the sky did not fall.
And this proves what? Lots of stuff "works" by accident.
Segher
Segher Boessenkool wrote:
>>> The only safe way to get atomic accesses is to write
>>> assembler code. Are there any downsides to that? I don't
>>> see any.
>>
>> The assumption that aligned word reads and writes are atomic, and that
>> words are aligned unless explicitly packed otherwise, is endemic in
>> the kernel. No sane compiler violates this assumption. It's true
>> that we're not portable to insane compilers after this patch, but we
>> never were in the first place.
>
> You didn't answer my question: are there any downsides to using
> explicit coded-in-assembler accesses for atomic accesses? You
> can handwave all you want that it should "just work" with
> volatile accesses, but volatility != atomicity, volatile in C
> is really badly defined, GCC never officially gave stronger
> guarantees, and we have a bugzilla full of PRs to show what a
> minefield it is.
>
> So, why not use the well-defined alternative?
Because we don't need to, and it hurts performance.
>>>> The only safe way to get atomic accesses is to write
>>>> assembler code. Are there any downsides to that? I don't
>>>> see any.
>>>
>>> The assumption that aligned word reads and writes are atomic, and
>>> that words are aligned unless explicitly packed otherwise, is
>>> endemic in the kernel. No sane compiler violates this assumption.
>>> It's true that we're not portable to insane compilers after this
>>> patch, but we never were in the first place.
>> You didn't answer my question: are there any downsides to using
>> explicit coded-in-assembler accesses for atomic accesses? You
>> can handwave all you want that it should "just work" with
>> volatile accesses, but volatility != atomicity, volatile in C
>> is really badly defined, GCC never officially gave stronger
>> guarantees, and we have a bugzilla full of PRs to show what a
>> minefield it is.
>> So, why not use the well-defined alternative?
>
> Because we don't need to,
You don't need to use volatile objects, or accesses through
valatile-cast pointers, either.
> and it hurts performance.
Please show how it does this -- one load is one load either way,
and one store is one store.
Segher
On Thu, 9 Aug 2007, Chris Snook wrote:
> Segher Boessenkool wrote:
> > > > The only safe way to get atomic accesses is to write
> > > > assembler code. Are there any downsides to that? I don't
> > > > see any.
> > >
> > > The assumption that aligned word reads and writes are atomic, and that
> > > words are aligned unless explicitly packed otherwise, is endemic in the
> > > kernel. No sane compiler violates this assumption. It's true that we're
> > > not portable to insane compilers after this patch, but we never were in
> > > the first place.
> >
> > You didn't answer my question: are there any downsides to using
> > explicit coded-in-assembler accesses for atomic accesses? You
> > can handwave all you want that it should "just work" with
> > volatile accesses, but volatility != atomicity, volatile in C
> > is really badly defined, GCC never officially gave stronger
> > guarantees, and we have a bugzilla full of PRs to show what a
> > minefield it is.
> >
> > So, why not use the well-defined alternative?
>
> Because we don't need to, and it hurts performance.
It hurts performance by implementing 32-bit atomic reads in assembler?
Gr{oetje,eeting}s,
Geert
--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- [email protected]
In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
-- Linus Torvalds
> If you need to guarantee that the value is written to memory at a
> particular time in your execution sequence, you either have to read it
> from memory to force the compiler to store it first
That isn't enough. The CPU will happily read the datum back from
its own store queue before it ever hit memory or cache or any
external bus. If you need the value to be globally visible before
another store, you use wmb(); if you need it before some read,
you use mb().
These concepts are completely separate from both atomicity and
volatility (which aren't the same themselves).
> No, but I can reason about it and be confident that this won't break
> anything that isn't already broken. At worst, this patch will make
> any existing subtly incorrect uses of atomic_t much more obvious and
> easier to track down.
PR22278, PR29753 -- both P2 bugs, and both about basic *(volatile *)
usage. Yeah, it's definitely not problematic, esp. not if you want
to support older compilers than 4.2 too.</sarcasm>
Segher
Paul E. McKenney wrote:
> On Thu, Aug 09, 2007 at 02:13:52PM -0400, Chris Snook wrote:
>> Paul E. McKenney wrote:
>>> On Thu, Aug 09, 2007 at 01:14:35PM -0400, Chris Snook wrote:
>>>> If you're depending on volatile writes
>>>> being visible to other CPUs, you're screwed either way, because the CPU
>>>> can hold that data in cache as long as it wants before it writes it to
>>>> memory. When this finally does happen, it will happen atomically, which
>>>> is all that atomic_set guarantees. If you need to guarantee that the
>>>> value is written to memory at a particular time in your execution
>>>> sequence, you either have to read it from memory to force the compiler to
>>>> store it first (and a volatile cast in atomic_read will suffice for this)
>>>> or you have to use LOCK_PREFIX instructions which will invalidate remote
>>>> cache lines containing the same variable. This patch doesn't change
>>>> either of these cases.
>>> The case that it -can- change is interactions with interrupt handlers.
>>> And NMI/SMI handlers, for that matter.
>> You have a point here, but only if you can guarantee that the interrupt
>> handler is running on a processor sharing the cache that has the
>> not-yet-written volatile value. That implies a strictly non-SMP
>> architecture. At the moment, none of those have volatile in their
>> declaration of atomic_t, so this patch can't break any of them.
>
> This can also happen when using per-CPU variables. And there are a
> number of per-CPU variables that are either atomic themselves or are
> structures containing atomic fields.
Accessing per-CPU variables in this fashion reliably already requires a suitable
smp/non-smp read/write memory barrier. I maintain that if we break anything
with this change, it was really already broken, if less obviously. Can you give
a real or synthetic example of legitimate code that could break?
-- Chris
Segher Boessenkool wrote:
>>> The compiler is within its rights to read a 32-bit quantity 16 bits at
>>> at time, even on a 32-bit machine. I would be glad to help pummel any
>>> compiler writer that pulls such a dirty trick, but the C standard really
>>> does permit this.
>>
>> Yes, but we don't write code for these compilers. There are countless
>> pieces of kernel code which would break in this condition, and there
>> doesn't seem to be any interest in fixing this.
>
> "Other things are broken too". Great argument :-)
We make plenty of practical assumptions in the kernel, and declare incorrect
things which violate them, even in cases where there's no commandment from the
heavens forbidding them. Since the whole point of this exercise is to prevent
badness with *optimizing* compilers, it's quite reasonable to declare broken any
so-called optimizer which violates these trivial assumptions.
>>> In short, please retain atomic_set()'s volatility, especially on those
>>> architectures that declared the atomic_t's counter to be volatile.
>>
>> Like i386 and x86_64? These used to have volatile in the atomic_t
>> declaration. We removed it, and the sky did not fall.
>
> And this proves what? Lots of stuff "works" by accident.
If something breaks because of this, it was already broken, but hidden a lot
better. I don't see much of a downside to exposing and fixing those bugs.
-- Chris
Geert Uytterhoeven wrote:
> On Thu, 9 Aug 2007, Chris Snook wrote:
>> Segher Boessenkool wrote:
>>>>> The only safe way to get atomic accesses is to write
>>>>> assembler code. Are there any downsides to that? I don't
>>>>> see any.
>>>> The assumption that aligned word reads and writes are atomic, and that
>>>> words are aligned unless explicitly packed otherwise, is endemic in the
>>>> kernel. No sane compiler violates this assumption. It's true that we're
>>>> not portable to insane compilers after this patch, but we never were in
>>>> the first place.
>>> You didn't answer my question: are there any downsides to using
>>> explicit coded-in-assembler accesses for atomic accesses? You
>>> can handwave all you want that it should "just work" with
>>> volatile accesses, but volatility != atomicity, volatile in C
>>> is really badly defined, GCC never officially gave stronger
>>> guarantees, and we have a bugzilla full of PRs to show what a
>>> minefield it is.
>>>
>>> So, why not use the well-defined alternative?
>> Because we don't need to, and it hurts performance.
>
> It hurts performance by implementing 32-bit atomic reads in assembler?
No, I misunderstood the question. Implementing 32-bit atomic reads in assembler
is redundant, because any sane compiler, *particularly* and optimizing compiler
(and we're only in this mess because of optimizing compilers) will give us that
automatically without the assembler. Yes, it is legal for a compiler to violate
this assumption. It is also legal for us to refuse to maintain compatibility
with compilers that suck this badly. That decision was made a very long time
ago, and I consider it the correct decision.
-- Chris
>>>> So, why not use the well-defined alternative?
>>> Because we don't need to, and it hurts performance.
>> It hurts performance by implementing 32-bit atomic reads in assembler?
>
> No, I misunderstood the question. Implementing 32-bit atomic reads in
> assembler is redundant, because any sane compiler, *particularly* and
> optimizing compiler (and we're only in this mess because of optimizing
> compilers)
Oh please, don't tell me you don't want an optimising compiler.
And if you _do_ want one, well you're in this mess because you
chose C as implementation language and C has some pretty strange
rules. Trying to use not-all-that-well-defined-and-completely-
misunderstood features of the language doesn't make things easier;
trying to use something that isn't even part of the language and
that your particular compiler originally supported by accident,
and that isn't yet an officially supported feature, and that on
top of it all has a track record of problems -- well it makes me
wonder if you're in this game for fun or what.
> will give us that automatically without the assembler.
No, it does *not* give it to you automatically; you have to do
either the asm() thing, or the not-defined-at-all *(volatile *)&
thing.
> Yes, it is legal for a compiler to violate this assumption. It is
> also legal for us to refuse to maintain compatibility with compilers
> that suck this badly.
So that's rm include/linux/compiler-gcc*.h then. Good luck with
the intel compiler, maybe it works more to your liking.
Segher
On Thu, Aug 09, 2007 at 03:24:40PM -0400, Chris Snook wrote:
> Paul E. McKenney wrote:
> >On Thu, Aug 09, 2007 at 02:13:52PM -0400, Chris Snook wrote:
> >>Paul E. McKenney wrote:
> >>>On Thu, Aug 09, 2007 at 01:14:35PM -0400, Chris Snook wrote:
> >>>> If you're depending on volatile writes
> >>>>being visible to other CPUs, you're screwed either way, because the CPU
> >>>>can hold that data in cache as long as it wants before it writes it to
> >>>>memory. When this finally does happen, it will happen atomically,
> >>>>which is all that atomic_set guarantees. If you need to guarantee that
> >>>>the value is written to memory at a particular time in your execution
> >>>>sequence, you either have to read it from memory to force the compiler
> >>>>to store it first (and a volatile cast in atomic_read will suffice for
> >>>>this) or you have to use LOCK_PREFIX instructions which will invalidate
> >>>>remote cache lines containing the same variable. This patch doesn't
> >>>>change either of these cases.
> >>>The case that it -can- change is interactions with interrupt handlers.
> >>>And NMI/SMI handlers, for that matter.
> >>You have a point here, but only if you can guarantee that the interrupt
> >>handler is running on a processor sharing the cache that has the
> >>not-yet-written volatile value. That implies a strictly non-SMP
> >>architecture. At the moment, none of those have volatile in their
> >>declaration of atomic_t, so this patch can't break any of them.
> >
> >This can also happen when using per-CPU variables. And there are a
> >number of per-CPU variables that are either atomic themselves or are
> >structures containing atomic fields.
>
> Accessing per-CPU variables in this fashion reliably already requires a
> suitable smp/non-smp read/write memory barrier. I maintain that if we
> break anything with this change, it was really already broken, if less
> obviously. Can you give a real or synthetic example of legitimate code
> that could break?
My main concern is actually the lack of symmetry -- I would expect
that an atomic_set() would have the same properties as atomic_read().
It is easy and cheap to provide them with similar properties, so why not?
Debugging even a single problem would consume far more time than simply
giving them corresponding semantics.
But you asked for examples. These are synthetic, and of course legitimacy
is in the eye of the beholder.
1. Watchdog variable.
atomic_t watchdog = ATOMIC_INIT(0);
...
int i;
while (!done) {
/* Do so stuff that doesn't take more than a few us. */
/* Could do atomic increment, but throughput penalty. */
i++;
atomic_set(&watchdog, i);
}
do_something_with(&watchdog);
/* Every so often on some other CPU... */
if ((new_watchdog = atomic_read(&watchdog)) == old_watchdog)
die_horribly();
old_watchdog = new_watchdog;
If atomic_set() did not have volatile semantics, the compiler
would be within its rights optimizing it to simply get the
final value of "i" after exit from the loop. This would cause
the watchdog check to fail spuriously. Memory barriers are
not required in this case, because the CPU cannot hang onto
the value for very long -- we don't care about the exact value,
or about exact synchronization, but rather about whether or
not the value is changing.
In this (toy) example, one might replace the atomic_set() with
an atomic increment (though that might be too expensive in some
cases) or with something like:
atomic_set(&watchdog, atomic_read(&watchdog) + 1);
However, other cases might not permit this transformation,
for example, an existing heavily used API might take int rather
than atomic_t.
Some will no doubt argue that this example should use a
macro or an asm similar to the "forget()" asm put forward
elsewhere in this thread.
2. Communicating both with interrupt handler and with other CPUs.
For example, data elements that are built up in a location visible
to interrupts and NMIs, and then added as a unit to a data structure
visible to other CPUs. This more-realistic example is abbreviated
to the point of pointlessness as follows:
struct foo {
atomic_t a;
atomic_t b;
};
DEFINE_PER_CPU(struct foo *, staging) = NULL;
/* Create element in staging area. */
__get_cpu_var(staging) = kzalloc(sizeof(*p), GFP_WHATEVER);
if (__get_cpu_var(staging) == NULL)
die_horribly();
/* allocate an element of some per-CPU array, get the result in "i" */
atomic_set(__get_cpu_var(staging).a, i);
/* allocate another element of a per-CPU array, with result in "i" */
atomic_set(__get_cpu_var(staging).b, i);
rcu_assign_pointer(some_global_place, __get_cpu_var(staging));
If atomic_set() didn't have volatile semantics, then an interrupt
or NMI handler could see the atomic_set() to .a and .b out of
order due to compiler optimizations.
Remember, you -did- ask for these!!! ;-)
Thanx, Paul
Paul E. McKenney <[email protected]> wrote:
>
> The compiler is within its rights to read a 32-bit quantity 16 bits at
> at time, even on a 32-bit machine. I would be glad to help pummel any
> compiler writer that pulls such a dirty trick, but the C standard really
> does permit this.
Code all over the kernel assumes that 32-bit reads/writes
are atomic so while such a compiler might be legal it certainly
can't compile Linux.
Cheers,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <[email protected]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
On Friday 10 August 2007 10:21:46 Herbert Xu wrote:
> Paul E. McKenney <[email protected]> wrote:
> >
> > The compiler is within its rights to read a 32-bit quantity 16 bits at
> > at time, even on a 32-bit machine. I would be glad to help pummel any
> > compiler writer that pulls such a dirty trick, but the C standard really
> > does permit this.
>
> Code all over the kernel assumes that 32-bit reads/writes
> are atomic so while such a compiler might be legal it certainly
> can't compile Linux.
Yes, the kernel requirements are much stricter than ISO-C. And besides
it is a heavy user of C extensions anyways. On the other hand some of the
C99 extensions are not allowed. And then there is sparse, which enforces
a language which sometimes is quite far from standard C. You could say it is
written in Linux-C, not ISO C.
-Andi
On Fri, Aug 10, 2007 at 11:08:20AM +0200, Andi Kleen wrote:
> On Friday 10 August 2007 10:21:46 Herbert Xu wrote:
> > Paul E. McKenney <[email protected]> wrote:
> > >
> > > The compiler is within its rights to read a 32-bit quantity 16 bits at
> > > at time, even on a 32-bit machine. I would be glad to help pummel any
> > > compiler writer that pulls such a dirty trick, but the C standard really
> > > does permit this.
> >
> > Code all over the kernel assumes that 32-bit reads/writes
> > are atomic so while such a compiler might be legal it certainly
> > can't compile Linux.
>
> Yes, the kernel requirements are much stricter than ISO-C. And besides
> it is a heavy user of C extensions anyways. On the other hand some of the
> C99 extensions are not allowed. And then there is sparse, which enforces
> a language which sometimes is quite far from standard C. You could say it is
> written in Linux-C, not ISO C.
Understood. My question is "why do we want the semantics of atomic_read()
and atomic_set() to differ?"
Thanx, Paul
Paul E. McKenney wrote:
> On Thu, Aug 09, 2007 at 03:24:40PM -0400, Chris Snook wrote:
>> Paul E. McKenney wrote:
>>> On Thu, Aug 09, 2007 at 02:13:52PM -0400, Chris Snook wrote:
>>>> Paul E. McKenney wrote:
>>>>> On Thu, Aug 09, 2007 at 01:14:35PM -0400, Chris Snook wrote:
>>>>>> If you're depending on volatile writes
>>>>>> being visible to other CPUs, you're screwed either way, because the CPU
>>>>>> can hold that data in cache as long as it wants before it writes it to
>>>>>> memory. When this finally does happen, it will happen atomically,
>>>>>> which is all that atomic_set guarantees. If you need to guarantee that
>>>>>> the value is written to memory at a particular time in your execution
>>>>>> sequence, you either have to read it from memory to force the compiler
>>>>>> to store it first (and a volatile cast in atomic_read will suffice for
>>>>>> this) or you have to use LOCK_PREFIX instructions which will invalidate
>>>>>> remote cache lines containing the same variable. This patch doesn't
>>>>>> change either of these cases.
>>>>> The case that it -can- change is interactions with interrupt handlers.
>>>>> And NMI/SMI handlers, for that matter.
>>>> You have a point here, but only if you can guarantee that the interrupt
>>>> handler is running on a processor sharing the cache that has the
>>>> not-yet-written volatile value. That implies a strictly non-SMP
>>>> architecture. At the moment, none of those have volatile in their
>>>> declaration of atomic_t, so this patch can't break any of them.
>>> This can also happen when using per-CPU variables. And there are a
>>> number of per-CPU variables that are either atomic themselves or are
>>> structures containing atomic fields.
>> Accessing per-CPU variables in this fashion reliably already requires a
>> suitable smp/non-smp read/write memory barrier. I maintain that if we
>> break anything with this change, it was really already broken, if less
>> obviously. Can you give a real or synthetic example of legitimate code
>> that could break?
>
> My main concern is actually the lack of symmetry -- I would expect
> that an atomic_set() would have the same properties as atomic_read().
> It is easy and cheap to provide them with similar properties, so why not?
> Debugging even a single problem would consume far more time than simply
> giving them corresponding semantics.
>
> But you asked for examples. These are synthetic, and of course legitimacy
> is in the eye of the beholder.
>
> 1. Watchdog variable.
>
> atomic_t watchdog = ATOMIC_INIT(0);
>
> ...
>
> int i;
> while (!done) {
>
> /* Do so stuff that doesn't take more than a few us. */
> /* Could do atomic increment, but throughput penalty. */
>
> i++;
> atomic_set(&watchdog, i);
> }
> do_something_with(&watchdog);
>
>
> /* Every so often on some other CPU... */
>
> if ((new_watchdog = atomic_read(&watchdog)) == old_watchdog)
> die_horribly();
> old_watchdog = new_watchdog;
>
>
> If atomic_set() did not have volatile semantics, the compiler
> would be within its rights optimizing it to simply get the
> final value of "i" after exit from the loop. This would cause
> the watchdog check to fail spuriously. Memory barriers are
> not required in this case, because the CPU cannot hang onto
> the value for very long -- we don't care about the exact value,
> or about exact synchronization, but rather about whether or
> not the value is changing.
>
> In this (toy) example, one might replace the atomic_set() with
> an atomic increment (though that might be too expensive in some
> cases) or with something like:
>
> atomic_set(&watchdog, atomic_read(&watchdog) + 1);
>
> However, other cases might not permit this transformation,
> for example, an existing heavily used API might take int rather
> than atomic_t.
>
> Some will no doubt argue that this example should use a
> macro or an asm similar to the "forget()" asm put forward
> elsewhere in this thread.
>
> 2. Communicating both with interrupt handler and with other CPUs.
> For example, data elements that are built up in a location visible
> to interrupts and NMIs, and then added as a unit to a data structure
> visible to other CPUs. This more-realistic example is abbreviated
> to the point of pointlessness as follows:
>
> struct foo {
> atomic_t a;
> atomic_t b;
> };
>
> DEFINE_PER_CPU(struct foo *, staging) = NULL;
>
> /* Create element in staging area. */
>
> __get_cpu_var(staging) = kzalloc(sizeof(*p), GFP_WHATEVER);
> if (__get_cpu_var(staging) == NULL)
> die_horribly();
> /* allocate an element of some per-CPU array, get the result in "i" */
> atomic_set(__get_cpu_var(staging).a, i);
> /* allocate another element of a per-CPU array, with result in "i" */
> atomic_set(__get_cpu_var(staging).b, i);
> rcu_assign_pointer(some_global_place, __get_cpu_var(staging));
>
> If atomic_set() didn't have volatile semantics, then an interrupt
> or NMI handler could see the atomic_set() to .a and .b out of
> order due to compiler optimizations.
>
> Remember, you -did- ask for these!!! ;-)
Ok, I'm convinced. Part of the motivation here is to avoid heisenbugs,
so if people expect volatile atomic_set behavior, I'm inclined to give
it to them. I don't really feel like indulging the compiler bug
paranoiacs, but developer expectations are a legitimate motivation, and
a major part of why I posted this in the first place. I'll resubmit the
patchset with a volatile cast in atomic_set. Before I do, is there
anything *else* that desperately needs such a cast? As far as I can
tell, all the other functions are implemented with __asm__ __volatile__,
or with spinlocks that use that under the hood.
-- Chris
>> The compiler is within its rights to read a 32-bit quantity 16 bits at
>> at time, even on a 32-bit machine. I would be glad to help pummel any
>> compiler writer that pulls such a dirty trick, but the C standard
>> really
>> does permit this.
>
> Code all over the kernel assumes that 32-bit reads/writes
> are atomic so while such a compiler might be legal it certainly
> can't compile Linux.
That means GCC cannot compile Linux; it already optimises
some accesses to scalars to smaller accesses when it knows
it is allowed to. Not often though, since it hardly ever
helps in the cost model it employs.
Segher
On Fri, Aug 10, 2007 at 03:49:03PM -0400, Chris Snook wrote:
> Paul E. McKenney wrote:
> >On Thu, Aug 09, 2007 at 03:24:40PM -0400, Chris Snook wrote:
> >>Paul E. McKenney wrote:
> >>>On Thu, Aug 09, 2007 at 02:13:52PM -0400, Chris Snook wrote:
> >>>>Paul E. McKenney wrote:
> >>>>>On Thu, Aug 09, 2007 at 01:14:35PM -0400, Chris Snook wrote:
> >>>>>> If you're depending on volatile writes
> >>>>>>being visible to other CPUs, you're screwed either way, because the
> >>>>>>CPU can hold that data in cache as long as it wants before it writes
> >>>>>>it to memory. When this finally does happen, it will happen
> >>>>>>atomically, which is all that atomic_set guarantees. If you need to
> >>>>>>guarantee that the value is written to memory at a particular time in
> >>>>>>your execution sequence, you either have to read it from memory to
> >>>>>>force the compiler to store it first (and a volatile cast in
> >>>>>>atomic_read will suffice for this) or you have to use LOCK_PREFIX
> >>>>>>instructions which will invalidate remote cache lines containing the
> >>>>>>same variable. This patch doesn't change either of these cases.
> >>>>>The case that it -can- change is interactions with interrupt handlers.
> >>>>>And NMI/SMI handlers, for that matter.
> >>>>You have a point here, but only if you can guarantee that the interrupt
> >>>>handler is running on a processor sharing the cache that has the
> >>>>not-yet-written volatile value. That implies a strictly non-SMP
> >>>>architecture. At the moment, none of those have volatile in their
> >>>>declaration of atomic_t, so this patch can't break any of them.
> >>>This can also happen when using per-CPU variables. And there are a
> >>>number of per-CPU variables that are either atomic themselves or are
> >>>structures containing atomic fields.
> >>Accessing per-CPU variables in this fashion reliably already requires a
> >>suitable smp/non-smp read/write memory barrier. I maintain that if we
> >>break anything with this change, it was really already broken, if less
> >>obviously. Can you give a real or synthetic example of legitimate code
> >>that could break?
> >
> >My main concern is actually the lack of symmetry -- I would expect
> >that an atomic_set() would have the same properties as atomic_read().
> >It is easy and cheap to provide them with similar properties, so why not?
> >Debugging even a single problem would consume far more time than simply
> >giving them corresponding semantics.
> >
> >But you asked for examples. These are synthetic, and of course legitimacy
> >is in the eye of the beholder.
> >
> >1. Watchdog variable.
> >
> > atomic_t watchdog = ATOMIC_INIT(0);
> >
> > ...
> >
> > int i;
> > while (!done) {
> >
> > /* Do so stuff that doesn't take more than a few us. */
> > /* Could do atomic increment, but throughput penalty. */
> >
> > i++;
> > atomic_set(&watchdog, i);
> > }
> > do_something_with(&watchdog);
> >
> >
> > /* Every so often on some other CPU... */
> >
> > if ((new_watchdog = atomic_read(&watchdog)) == old_watchdog)
> > die_horribly();
> > old_watchdog = new_watchdog;
> >
> >
> > If atomic_set() did not have volatile semantics, the compiler
> > would be within its rights optimizing it to simply get the
> > final value of "i" after exit from the loop. This would cause
> > the watchdog check to fail spuriously. Memory barriers are
> > not required in this case, because the CPU cannot hang onto
> > the value for very long -- we don't care about the exact value,
> > or about exact synchronization, but rather about whether or
> > not the value is changing.
> >
> > In this (toy) example, one might replace the atomic_set() with
> > an atomic increment (though that might be too expensive in some
> > cases) or with something like:
> >
> > atomic_set(&watchdog, atomic_read(&watchdog) + 1);
> >
> > However, other cases might not permit this transformation,
> > for example, an existing heavily used API might take int rather
> > than atomic_t.
> >
> > Some will no doubt argue that this example should use a
> > macro or an asm similar to the "forget()" asm put forward
> > elsewhere in this thread.
> >
> >2. Communicating both with interrupt handler and with other CPUs.
> > For example, data elements that are built up in a location visible
> > to interrupts and NMIs, and then added as a unit to a data structure
> > visible to other CPUs. This more-realistic example is abbreviated
> > to the point of pointlessness as follows:
> >
> > struct foo {
> > atomic_t a;
> > atomic_t b;
> > };
> >
> > DEFINE_PER_CPU(struct foo *, staging) = NULL;
> >
> > /* Create element in staging area. */
> >
> > __get_cpu_var(staging) = kzalloc(sizeof(*p), GFP_WHATEVER);
> > if (__get_cpu_var(staging) == NULL)
> > die_horribly();
> > /* allocate an element of some per-CPU array, get the result in "i"
> > */
> > atomic_set(__get_cpu_var(staging).a, i);
> > /* allocate another element of a per-CPU array, with result in "i" */
> > atomic_set(__get_cpu_var(staging).b, i);
> > rcu_assign_pointer(some_global_place, __get_cpu_var(staging));
> >
> > If atomic_set() didn't have volatile semantics, then an interrupt
> > or NMI handler could see the atomic_set() to .a and .b out of
> > order due to compiler optimizations.
> >
> >Remember, you -did- ask for these!!! ;-)
>
> Ok, I'm convinced. Part of the motivation here is to avoid heisenbugs,
> so if people expect volatile atomic_set behavior, I'm inclined to give
> it to them. I don't really feel like indulging the compiler bug
> paranoiacs, but developer expectations are a legitimate motivation, and
> a major part of why I posted this in the first place. I'll resubmit the
> patchset with a volatile cast in atomic_set. Before I do, is there
> anything *else* that desperately needs such a cast? As far as I can
> tell, all the other functions are implemented with __asm__ __volatile__,
> or with spinlocks that use that under the hood.
Sounds good!!!
The only other API that I am aware of needing volatile semantics is
rcu_dereference(), but I already sent a patch in for it. So as far
as I know, atomic_read() and atomic_set() should cover it.
Thanx, Paul
On Fri, Aug 10, 2007 at 10:07:27PM +0200, Segher Boessenkool wrote:
>
> That means GCC cannot compile Linux; it already optimises
> some accesses to scalars to smaller accesses when it knows
> it is allowed to. Not often though, since it hardly ever
> helps in the cost model it employs.
Please give an example code snippet + gcc version + arch
to back this up.
Thanks,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <[email protected]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
>> That means GCC cannot compile Linux; it already optimises
>> some accesses to scalars to smaller accesses when it knows
>> it is allowed to. Not often though, since it hardly ever
>> helps in the cost model it employs.
>
> Please give an example code snippet + gcc version + arch
> to back this up.
unsigned char f(unsigned long *p)
{
return *p & 1;
}
with both
powerpc64-linux-gcc (GCC) 4.3.0 20070731 (experimental)
and
powerpc64-linux-gcc-4.2.0 (GCC) 4.2.0
(sorry, I don't have anything newer or older right now; if you
really care, I can test with those too)
generate (in 64-bit mode):
.L.f:
lbz 3,7(3)
rldicl 3,3,0,63
blr
and in 32-bit mode:
f:
stwu 1,-16(1)
nop
nop
lbz 3,3(3)
addi 1,1,16
rlwinm 3,3,0,31,31
blr
(the nops are because I use --with-cpu=970).
But perhaps you do not care for PowerPC, in which case:
i686-linux-gcc (GCC) 4.2.0 20060410 (experimental)
(sorry for the old version, I don't build x86 compilers
all that often; also I don't have a 64-bit version right
now):
f:
pushl %ebp
movl %esp, %ebp
movl 8(%ebp), %eax
popl %ebp
movzbl (%eax), %eax
andl $1, %eax
ret
If you want testing with any other versions, and/or for
any other target architecture, I can do that; it takes a
few minutes to build a compiler.
It is quite hard to build a testcase that reads more than
one part of the "long", since for small testcases the
compiler will almost always be smart enough to do one
bigger read instead; but it certainly isn't inconceivable,
and anyway the compiler would be fully in its right to do
reads non-atomically if not instructed otherwise.
Segher
On Sat, Aug 11, 2007 at 02:38:40AM +0200, Segher Boessenkool wrote:
> >>That means GCC cannot compile Linux; it already optimises
> >>some accesses to scalars to smaller accesses when it knows
> >>it is allowed to. Not often though, since it hardly ever
> >>helps in the cost model it employs.
> >
> >Please give an example code snippet + gcc version + arch
> >to back this up.
>
> unsigned char f(unsigned long *p)
> {
> return *p & 1;
> }
This doesn't really matter since we only care about the LSB.
Do you have an example where gcc reads it non-atmoically and
we care about all parts?
Cheers,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <[email protected]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
>>>> That means GCC cannot compile Linux; it already optimises
>>>> some accesses to scalars to smaller accesses when it knows
>>>> it is allowed to. Not often though, since it hardly ever
>>>> helps in the cost model it employs.
>>>
>>> Please give an example code snippet + gcc version + arch
>>> to back this up.
>>
>> unsigned char f(unsigned long *p)
>> {
>> return *p & 1;
>> }
>
> This doesn't really matter since we only care about the LSB.
It is exactly what I claimed, and what you asked proof of.
> Do you have an example where gcc reads it non-atmoically and
> we care about all parts?
Like I explained in the original mail; no, I suspect such
a testcase will be really hard to construct, esp. as a small
testcase. I have no reason to believe it is impossible to
do so though -- maybe someone else can write trickier code
than I can, in which case, please do so.
Segher
On Sat, 11 Aug 2007 02:38:40 +0200, Segher Boessenkool said:
> >> That means GCC cannot compile Linux; it already optimises
> >> some accesses to scalars to smaller accesses when it knows
> >> it is allowed to. Not often though, since it hardly ever
> >> helps in the cost model it employs.
> >
> > Please give an example code snippet + gcc version + arch
> > to back this up.
>
> unsigned char f(unsigned long *p)
> {
> return *p & 1;
> }
Not really valid, because it's still able to do one atomic access to
compute the result.
Now, if you had found an example where it converts a 32-bit atomic access into
2 separate 16-bit accesses that weren't atomic as a whole....