The attached patch documents the Linux kernel's memory barriers.
Signed-Off-By: David Howells <[email protected]>
---
warthog>diffstat -p1 mb.diff
Documentation/memory-barriers.txt | 359 ++++++++++++++++++++++++++++++++++++++
1 files changed, 359 insertions(+)
diff --git a/Documentation/memory-barriers.txt b/Documentation/memory-barriers.txt
new file mode 100644
index 0000000..c2fc51b
--- /dev/null
+++ b/Documentation/memory-barriers.txt
@@ -0,0 +1,359 @@
+ ============================
+ LINUX KERNEL MEMORY BARRIERS
+ ============================
+
+Contents:
+
+ (*) What are memory barriers?
+
+ (*) Linux kernel memory barrier functions.
+
+ (*) Implied kernel memory barriers.
+
+ (*) i386 and x86_64 arch specific notes.
+
+
+=========================
+WHAT ARE MEMORY BARRIERS?
+=========================
+
+Memory barriers are instructions to both the compiler and the CPU to impose a
+partial ordering between the memory access operations specified either side of
+the barrier.
+
+Older and less complex CPUs will perform memory accesses in exactly the order
+specified, so if one is given the following piece of code:
+
+ a = *A;
+ *B = b;
+ c = *C;
+ d = *D;
+ *E = e;
+
+It can be guaranteed that it will complete the memory access for each
+instruction before moving on to the next line, leading to a definite sequence
+of operations on the bus:
+
+ read *A, write *B, read *C, read *D, write *E.
+
+However, with newer and more complex CPUs, this isn't always true because:
+
+ (*) they can rearrange the order of the memory accesses to promote better use
+ of the CPU buses and caches;
+
+ (*) reads are synchronous and may need to be done immediately to permit
+ progress, whereas writes can often be deferred without a problem;
+
+ (*) and they are able to combine reads and writes to improve performance when
+ talking to the SDRAM (modern SDRAM chips can do batched accesses of
+ adjacent locations, cutting down on transaction setup costs).
+
+So what you might actually get from the above piece of code is:
+
+ read *A, read *C+*D, write *E, write *B
+
+Under normal operation, this is probably not going to be a problem; however,
+there are two circumstances where it definitely _can_ be a problem:
+
+ (1) I/O
+
+ Many I/O devices can be memory mapped, and so appear to the CPU as if
+ they're just memory locations. However, to control the device, the driver
+ has to make the right accesses in exactly the right order.
+
+ Consider, for example, an ethernet chipset such as the AMD PCnet32. It
+ presents to the CPU an "address register" and a bunch of "data registers".
+ The way it's accessed is to write the index of the internal register you
+ want to access to the address register, and then read or write the
+ appropriate data register to access the chip's internal register:
+
+ *ADR = ctl_reg_3;
+ reg = *DATA;
+
+ The problem with a clever CPU or a clever compiler is that the write to
+ the address register isn't guaranteed to happen before the access to the
+ data register, if the CPU or the compiler thinks it is more efficient to
+ defer the address write:
+
+ read *DATA, write *ADR
+
+ then things will break.
+
+ The way to deal with this is to insert an I/O memory barrier between the
+ two accesses:
+
+ *ADR = ctl_reg_3;
+ mb();
+ reg = *DATA;
+
+ In this case, the barrier makes a guarantee that all memory accesses
+ before the barrier will happen before all the memory accesses after the
+ barrier. It does _not_ guarantee that all memory accesses before the
+ barrier will be complete by the time the barrier is complete.
+
+ (2) Multiprocessor interaction
+
+ When there's a system with more than one processor, these may be working
+ on the same set of data, but attempting not to use locks as locks are
+ quite expensive. This means that accesses that affect both CPUs may have
+ to be carefully ordered to prevent error.
+
+ Consider the R/W semaphore slow path. In that, a waiting process is
+ queued on the semaphore, as noted by it having a record on its stack
+ linked to the semaphore's list:
+
+ struct rw_semaphore {
+ ...
+ struct list_head waiters;
+ };
+
+ struct rwsem_waiter {
+ struct list_head list;
+ struct task_struct *task;
+ };
+
+ To wake up the waiter, the up_read() or up_write() functions have to read
+ the pointer from this record to know as to where the next waiter record
+ is, clear the task pointer, call wake_up_process() on the task, and
+ release the task struct reference held:
+
+ READ waiter->list.next;
+ READ waiter->task;
+ WRITE waiter->task;
+ CALL wakeup
+ RELEASE task
+
+ If any of these steps occur out of order, then the whole thing may fail.
+
+ Note that the waiter does not get the semaphore lock again - it just waits
+ for its task pointer to be cleared. Since the record is on its stack, this
+ means that if the task pointer is cleared _before_ the next pointer in the
+ list is read, then another CPU might start processing the waiter and it
+ might clobber its stack before up*() functions have a chance to read the
+ next pointer.
+
+ CPU 0 CPU 1
+ =============================== ===============================
+ down_xxx()
+ Queue waiter
+ Sleep
+ up_yyy()
+ READ waiter->task;
+ WRITE waiter->task;
+ <preempt>
+ Resume processing
+ down_xxx() returns
+ call foo()
+ foo() clobbers *waiter
+ </preempt>
+ READ waiter->list.next;
+ --- OOPS ---
+
+ This could be dealt with using a spinlock, but then the down_xxx()
+ function has to get the spinlock again after it's been woken up, which is
+ a waste of resources.
+
+ The way to deal with this is to insert an SMP memory barrier:
+
+ READ waiter->list.next;
+ READ waiter->task;
+ smp_mb();
+ WRITE waiter->task;
+ CALL wakeup
+ RELEASE task
+
+ In this case, the barrier makes a guarantee that all memory accesses
+ before the barrier will happen before all the memory accesses after the
+ barrier. It does _not_ guarantee that all memory accesses before the
+ barrier will be complete by the time the barrier is complete.
+
+ SMP memory barriers are normally no-ops on a UP system because the CPU
+ orders overlapping accesses with respect to itself.
+
+
+=====================================
+LINUX KERNEL MEMORY BARRIER FUNCTIONS
+=====================================
+
+The Linux kernel has six basic memory barriers:
+
+ MANDATORY (I/O) SMP
+ =============== ================
+ GENERAL mb() smp_mb()
+ READ rmb() smp_rmb()
+ WRITE wmb() smp_wmb()
+
+General memory barriers make a guarantee that all memory accesses specified
+before the barrier will happen before all memory accesses specified after the
+barrier.
+
+Read memory barriers make a guarantee that all memory reads specified before
+the barrier will happen before all memory reads specified after the barrier.
+
+Write memory barriers make a guarantee that all memory writes specified before
+the barrier will happen before all memory writes specified after the barrier.
+
+SMP memory barriers are no-ops on uniprocessor compiled systems because it is
+assumed that a CPU will be self-consistent, and will order overlapping accesses
+with respect to itself.
+
+There is no guarantee that any of the memory accesses specified before a memory
+barrier will be complete by the completion of a memory barrier; the barrier can
+be considered to draw a line in the access queue that accesses of the
+appropriate type may not cross.
+
+There is no guarantee that issuing a memory barrier on one CPU will have any
+direct effect on another CPU or any other hardware in the system. The indirect
+effect will be the order the first CPU commits its accesses to the bus.
+
+Note that these are the _minimum_ guarantees. Different architectures may give
+more substantial guarantees, but they may not be relied upon outside of arch
+specific code.
+
+
+There are some more advanced barriering functions:
+
+ (*) set_mb(var, value)
+ (*) set_wmb(var, value)
+
+ These assign the value to the variable and then insert at least a write
+ barrier after it, depending on the function.
+
+
+==============================
+IMPLIED KERNEL MEMORY BARRIERS
+==============================
+
+Some of the other functions in the linux kernel imply memory barriers. For
+instance all the following (pseudo-)locking functions imply barriers.
+
+ (*) interrupt disablement and/or interrupts
+ (*) spin locks
+ (*) R/W spin locks
+ (*) mutexes
+ (*) semaphores
+ (*) R/W semaphores
+
+In all cases there are variants on a LOCK operation and an UNLOCK operation.
+
+ (*) LOCK operation implication:
+
+ Memory accesses issued after the LOCK will be completed after the LOCK
+ accesses have completed.
+
+ Memory accesses issued before the LOCK may be completed after the LOCK
+ accesses have completed.
+
+ (*) UNLOCK operation implication:
+
+ Memory accesses issued before the UNLOCK will be completed before the
+ UNLOCK accesses have completed.
+
+ Memory accesses issued after the UNLOCK may be completed before the UNLOCK
+ accesses have completed.
+
+ (*) LOCK vs UNLOCK implication:
+
+ The LOCK accesses will be completed before the unlock accesses.
+
+Locks and semaphores may not provide any guarantee of ordering on UP compiled
+systems, and so can't be counted on in such a situation to actually do
+anything at all, especially with respect to I/O memory barriering.
+
+Either interrupt disablement (LOCK) and enablement (UNLOCK) will barrier
+memory and I/O accesses individually, or interrupt handling will barrier
+memory and I/O accesses on entry and on exit. This prevents an interrupt
+routine interfering with accesses made in a disabled-interrupt section of code
+and vice versa.
+
+This specification is a _minimum_ guarantee; any particular architecture may
+provide more substantial guarantees, but these may not be relied upon outside
+of arch specific code.
+
+
+As an example, consider the following:
+
+ *A = a;
+ *B = b;
+ LOCK
+ *C = c;
+ *D = d;
+ UNLOCK
+ *E = e;
+ *F = f;
+
+The following sequence of events on the bus is acceptable:
+
+ LOCK, *F+*A, *E, *C+*D, *B, UNLOCK
+
+But none of the following are:
+
+ *F+*A, *B, LOCK, *C, *D, UNLOCK, *E
+ *A, *B, *C, LOCK, *D, UNLOCK, *E, *F
+ *A, *B, LOCK, *C, UNLOCK, *D, *E, *F
+ *B, LOCK, *C, *D, UNLOCK, *F+*A, *E
+
+
+Consider also the following (going back to the AMD PCnet example):
+
+ DISABLE IRQ
+ *ADR = ctl_reg_3;
+ mb();
+ x = *DATA;
+ *ADR = ctl_reg_4;
+ mb();
+ *DATA = y;
+ *ADR = ctl_reg_5;
+ mb();
+ z = *DATA;
+ ENABLE IRQ
+ <interrupt>
+ *ADR = ctl_reg_7;
+ mb();
+ q = *DATA
+ </interrupt>
+
+What's to stop "z = *DATA" crossing "*ADR = ctl_reg_7" and reading from the
+wrong register? (There's no guarantee that the process of handling an
+interrupt will barrier memory accesses in any way).
+
+
+==============================
+I386 AND X86_64 SPECIFIC NOTES
+==============================
+
+Earlier i386 CPUs (pre-Pentium-III) are fully ordered - the operations on the
+bus appear in program order - and so there's no requirement for any sort of
+explicit memory barriers.
+
+From the Pentium-III onwards were three new memory barrier instructions:
+LFENCE, SFENCE and MFENCE which correspond to the kernel memory barrier
+functions rmb(), wmb() and mb(). However, there are additional implicit memory
+barriers in the CPU implementation:
+
+ (*) Interrupt processing implies mb().
+
+ (*) The LOCK prefix adds implication of mb() on whatever instruction it is
+ attached to.
+
+ (*) Normal writes to memory imply wmb() [and so SFENCE is normally not
+ required].
+
+ (*) Normal writes imply a semi-rmb(): reads before a write may not complete
+ after that write, but reads after a write may complete before the write
+ (ie: reads may go _ahead_ of writes).
+
+ (*) Non-temporal writes imply no memory barrier, and are the intended target
+ of SFENCE.
+
+ (*) Accesses to uncached memory imply mb() [eg: memory mapped I/O].
+
+
+======================
+POWERPC SPECIFIC NOTES
+======================
+
+The powerpc is weakly ordered, and its read and write accesses may be
+completed generally in any order. It's memory barriers are also to some extent
+more substantial than the mimimum requirement, and may directly effect
+hardware outside of the CPU.
This has been needed for quite some time but needs some more
additions:
1) Access to i/o mapped memory does not need memory barriers.
2) Explain difference between mb() and barrier().
3) Explain wmb() versus mmiowb()
Give some more examples of correct usage in drivers.
On Tuesday 07 March 2006 18:40, David Howells wrote:
> +Older and less complex CPUs will perform memory accesses in exactly the order
> +specified, so if one is given the following piece of code:
> +
> + a = *A;
> + *B = b;
> + c = *C;
> + d = *D;
> + *E = e;
> +
> +It can be guaranteed that it will complete the memory access for each
> +instruction before moving on to the next line, leading to a definite sequence
> +of operations on the bus:
Actually gcc is free to reorder it
(often it will not when it cannot prove that they don't alias, but sometimes
it can)
> +
> + Consider, for example, an ethernet chipset such as the AMD PCnet32. It
> + presents to the CPU an "address register" and a bunch of "data registers".
> + The way it's accessed is to write the index of the internal register you
> + want to access to the address register, and then read or write the
> + appropriate data register to access the chip's internal register:
> +
> + *ADR = ctl_reg_3;
> + reg = *DATA;
You're not supposed to do it this way anyways. The official way to access
MMIO space is using read/write[bwlq]
Haven't read all of it sorry, but thanks for the work of documenting
it.
-Andi
Andi Kleen <[email protected]> wrote:
> Actually gcc is free to reorder it
> (often it will not when it cannot prove that they don't alias, but sometimes
> it can)
Yeah... I have mentioned the fact that compilers can reorder too, but
obviously not enough.
> You're not supposed to do it this way anyways. The official way to access
> MMIO space is using read/write[bwlq]
True, I suppose. I should make it clear that these accessor functions imply
memory barriers, if indeed they do, and that you should use them rather than
accessing I/O registers directly (at least, outside the arch you should).
David
On Maw, 2006-03-07 at 17:40 +0000, David Howells wrote:
> +Older and less complex CPUs will perform memory accesses in exactly the order
> +specified, so if one is given the following piece of code:
Not really true. Some of the fairly old dumb processors don't do this to
the bus, and just about anything with a cache wont (as it'll burst cache
lines to main memory)
> + want to access to the address register, and then read or write the
> + appropriate data register to access the chip's internal register:
> +
> + *ADR = ctl_reg_3;
> + reg = *DATA;
Not allowed anyway
> + In this case, the barrier makes a guarantee that all memory accesses
> + before the barrier will happen before all the memory accesses after the
> + barrier. It does _not_ guarantee that all memory accesses before the
> + barrier will be complete by the time the barrier is complete.
Better meaningful example would be barriers versus an IRQ handler. Which
leads nicely onto section 2
> +General memory barriers make a guarantee that all memory accesses specified
> +before the barrier will happen before all memory accesses specified after the
> +barrier.
No. They guarantee that to an observer also running on that set of
processors the accesses to main memory will appear to be ordered in that
manner. They don't guarantee I/O related ordering for non main memory
due to things like PCI posting rules and NUMA goings on.
As an example of the difference here a Geode will reorder stores as it
feels but snoop the bus such that it can ensure an external bus master
cannot observe this by holding it off the bus to fix up ordering
violations first.
> +Read memory barriers make a guarantee that all memory reads specified before
> +the barrier will happen before all memory reads specified after the barrier.
> +
> +Write memory barriers make a guarantee that all memory writes specified before
> +the barrier will happen before all memory writes specified after the barrier.
Both with the caveat above
> +There is no guarantee that any of the memory accesses specified before a memory
> +barrier will be complete by the completion of a memory barrier; the barrier can
> +be considered to draw a line in the access queue that accesses of the
> +appropriate type may not cross.
CPU generated accesses to main memory
> + (*) interrupt disablement and/or interrupts
> + (*) spin locks
> + (*) R/W spin locks
> + (*) mutexes
> + (*) semaphores
> + (*) R/W semaphores
Should probably cover schedule() here.
> +Locks and semaphores may not provide any guarantee of ordering on UP compiled
> +systems, and so can't be counted on in such a situation to actually do
> +anything at all, especially with respect to I/O memory barriering.
_irqsave/_irqrestore ...
> +==============================
> +I386 AND X86_64 SPECIFIC NOTES
> +==============================
> +
> +Earlier i386 CPUs (pre-Pentium-III) are fully ordered - the operations on the
> +bus appear in program order - and so there's no requirement for any sort of
> +explicit memory barriers.
Actually they are not. Processors prior to Pentium Pro ensure that the
perceived ordering between processors of writes to main memory is
preserved. The Pentium Pro is supposed to but does not in SMP cases. Our
spin_unlock code knows about this. It also has some problems with this
situation when handling write combining memory. The IDT Winchip series
processors are run in out of order store mode and our lock functions and
dmamappers should know enough about this.
On x86 memory barriers for read serialize order using lock instructions,
on write the winchip at least generates serializing instructions.
barrier() is pure CPU level of course
> + (*) Normal writes to memory imply wmb() [and so SFENCE is normally not
> + required].
Only at an on processor level and not for all clones, also there are
errata here for PPro.
> + (*) Accesses to uncached memory imply mb() [eg: memory mapped I/O].
Not always. MMIO ordering is outside of the CPU ordering rules and into
PCI and other bus ordering rules. Consider
writel(STOP_DMA, &foodev->ctrl);
free_dma_buffers(foodev);
This leads to horrible disasters.
> +
> +======================
> +POWERPC SPECIFIC NOTES
Can't comment on PPC
On Tuesday 07 March 2006 19:30, David Howells wrote:
> > You're not supposed to do it this way anyways. The official way to access
> > MMIO space is using read/write[bwlq]
>
> True, I suppose. I should make it clear that these accessor functions imply
> memory barriers, if indeed they do,
I don't think they do.
> and that you should use them rather than
> accessing I/O registers directly (at least, outside the arch you should).
Even inside the architecture it's a good idea.
-Andi
On Tuesday, March 7, 2006 10:30 am, David Howells wrote:
> True, I suppose. I should make it clear that these accessor functions
> imply memory barriers, if indeed they do, and that you should use them
> rather than accessing I/O registers directly (at least, outside the
> arch you should).
But they don't, that's why we have mmiowb(). There are lots of cases to
handle:
1) memory vs. memory
2) memory vs. I/O
3) I/O vs. I/O
(reads and writes for every case).
AFAIK, we have (1) fairly well handled with a plethora of barrier ops.
(2) is a bit fuzzy with the current operations I think, and for (3) all
we have is mmiowb() afaik. Maybe one of the ppc64 guys can elaborate on
the barriers their hw needs for the above cases (I think they're the
pathological case, so covering them should be good enough everybody).
Btw, thanks for putting together this documentation, it's desperately
needed.
Jesse
On Tue, 7 Mar 2006, Alan Cox wrote:
[SNIPPED...]
>
> Not always. MMIO ordering is outside of the CPU ordering rules and into
> PCI and other bus ordering rules. Consider
>
> writel(STOP_DMA, &foodev->ctrl);
> free_dma_buffers(foodev);
>
> This leads to horrible disasters.
This might be a good place to document:
dummy = readl(&foodev->ctrl);
Will flush all pending writes to the PCI bus and that:
(void) readl(&foodev->ctrl);
... won't because `gcc` may optimize it away. In fact, variable
"dummy" should be global or `gcc` may make it go away as well.
Cheers,
Dick Johnson
Penguin : Linux version 2.6.15.4 on an i686 machine (5589.50 BogoMips).
Warning : 98.36% of all statistics are fiction, book release in April.
_
****************************************************************
The information transmitted in this message is confidential and may be privileged. Any review, retransmission, dissemination, or other use of this information by persons or entities other than the intended recipient is prohibited. If you are not the intended recipient, please notify Analogic Corporation immediately - by replying to this message or by sending an email to [email protected] - and destroy all copies of this information, including any attachments, without reading or disclosing them.
Thank you.
On Tue, Mar 07, 2006 at 01:54:33PM -0500, linux-os (Dick Johnson) wrote:
> This might be a good place to document:
> dummy = readl(&foodev->ctrl);
>
> Will flush all pending writes to the PCI bus and that:
> (void) readl(&foodev->ctrl);
> ... won't because `gcc` may optimize it away. In fact, variable
> "dummy" should be global or `gcc` may make it go away as well.
static inline unsigned int readl(const volatile void __iomem *addr)
{
return *(volatile unsigned int __force *) addr;
}
The cast is volatile, so gcc knows not to optimise it away.
On Tue, 7 Mar 2006, Matthew Wilcox wrote:
> On Tue, Mar 07, 2006 at 01:54:33PM -0500, linux-os (Dick Johnson) wrote:
>> This might be a good place to document:
>> dummy = readl(&foodev->ctrl);
>>
>> Will flush all pending writes to the PCI bus and that:
>> (void) readl(&foodev->ctrl);
>> ... won't because `gcc` may optimize it away. In fact, variable
>> "dummy" should be global or `gcc` may make it go away as well.
>
> static inline unsigned int readl(const volatile void __iomem *addr)
> {
> return *(volatile unsigned int __force *) addr;
> }
>
> The cast is volatile, so gcc knows not to optimise it away.
>
When the assignment is not made a.k.a., cast to void, or when the
assignment is made to an otherwise unused variable, `gcc` does,
indeed make it go away. These problems caused weeks of chagrin
after it was found that a PCI DMA operation took 20 or more times
than it should. The writel(START_DMA, &control), followed by
a dummy = readl(&control), ended up with the readl() missing.
That meant that the DMA didn't start until some timer code
read a status register, wondering why it hadn't completed yet.
Cheers,
Dick Johnson
Penguin : Linux version 2.6.15.4 on an i686 machine (5589.50 BogoMips).
Warning : 98.36% of all statistics are fiction, book release in April.
_
****************************************************************
The information transmitted in this message is confidential and may be privileged. Any review, retransmission, dissemination, or other use of this information by persons or entities other than the intended recipient is prohibited. If you are not the intended recipient, please notify Analogic Corporation immediately - by replying to this message or by sending an email to [email protected] - and destroy all copies of this information, including any attachments, without reading or disclosing them.
Thank you.
On Tue, 2006-03-07 at 18:30 +0000, David Howells wrote:
> True, I suppose. I should make it clear that these accessor functions imply
> memory barriers, if indeed they do,
They don't, but according to Documentation/DocBook/deviceiobook.tmpl
they are performed by the compiler in the order specified.
They also convert between PCI byte order and CPU byte order. If you
want to avoid that, you need the __raw_* versions, which are not
guaranteed to be provided by all arches.
<b
Andi Kleen <[email protected]> wrote:
> > > You're not supposed to do it this way anyways. The official way to access
> > > MMIO space is using read/write[bwlq]
> >
> > True, I suppose. I should make it clear that these accessor functions imply
> > memory barriers, if indeed they do,
>
> I don't think they do.
Hmmm.. Seems Stephen Hemminger disagrees:
| > > 1) Access to i/o mapped memory does not need memory barriers.
| >
| > There's no guarantee of that. On FRV you have to insert barriers as
| > appropriate when you're accessing I/O mapped memory if ordering is required
| > (accessing an ethernet card vs accessing a frame buffer), but support for
| > inserting the appropriate barriers is built into gcc - which knows the rules
| > for when to insert them.
| >
| > Or are you referring to the fact that this should be implicit in inX(),
| > outX(), readX(), writeX() and similar?
|
| yes
David
On Tuesday 07 March 2006 20:23, Bryan O'Sullivan wrote:
> On Tue, 2006-03-07 at 18:30 +0000, David Howells wrote:
>
> > True, I suppose. I should make it clear that these accessor functions imply
> > memory barriers, if indeed they do,
>
> They don't, but according to Documentation/DocBook/deviceiobook.tmpl
> they are performed by the compiler in the order specified.
I don't think that's correct. Probably the documentation should
be fixed.
-Andi
On Maw, 2006-03-07 at 13:54 -0500, linux-os (Dick Johnson) wrote:
> On Tue, 7 Mar 2006, Alan Cox wrote:
> > writel(STOP_DMA, &foodev->ctrl);
> > free_dma_buffers(foodev);
> >
> > This leads to horrible disasters.
>
> This might be a good place to document:
> dummy = readl(&foodev->ctrl);
Absolutely. And this falls outside of the memory barrier functions.
>
> Will flush all pending writes to the PCI bus and that:
> (void) readl(&foodev->ctrl);
> ... won't because `gcc` may optimize it away. In fact, variable
> "dummy" should be global or `gcc` may make it go away as well.
If they were ordinary functions then maybe, but they are not so a simple
readl(&foodev->ctrl) will be sufficient and isn't optimised away.
Alan
On Tue, 07 Mar 2006 19:24:03 +0000
David Howells <[email protected]> wrote:
> Andi Kleen <[email protected]> wrote:
>
> > > > You're not supposed to do it this way anyways. The official way to access
> > > > MMIO space is using read/write[bwlq]
> > >
> > > True, I suppose. I should make it clear that these accessor functions imply
> > > memory barriers, if indeed they do,
> >
> > I don't think they do.
>
> Hmmm.. Seems Stephen Hemminger disagrees:
>
> | > > 1) Access to i/o mapped memory does not need memory barriers.
> | >
> | > There's no guarantee of that. On FRV you have to insert barriers as
> | > appropriate when you're accessing I/O mapped memory if ordering is required
> | > (accessing an ethernet card vs accessing a frame buffer), but support for
> | > inserting the appropriate barriers is built into gcc - which knows the rules
> | > for when to insert them.
> | >
> | > Or are you referring to the fact that this should be implicit in inX(),
> | > outX(), readX(), writeX() and similar?
> |
The problem with all this is like physics it is all relative to the observer.
I get confused an lost when talking about the general case because there are so many possible
specific examples where a barrier is or is not needed.
On Tuesday, March 7, 2006 3:57 am, Andi Kleen wrote:
> On Tuesday 07 March 2006 20:23, Bryan O'Sullivan wrote:
> > On Tue, 2006-03-07 at 18:30 +0000, David Howells wrote:
> > > True, I suppose. I should make it clear that these accessor
> > > functions imply memory barriers, if indeed they do,
> >
> > They don't, but according to Documentation/DocBook/deviceiobook.tmpl
> > they are performed by the compiler in the order specified.
>
> I don't think that's correct. Probably the documentation should
> be fixed.
On ia64 I'm pretty sure it's true, and it seems like it should be in the
general case too. The compiler shouldn't reorder uncached memory
accesses with volatile semantics...
Jesse
Alan Cox <[email protected]> wrote:
> Better meaningful example would be barriers versus an IRQ handler. Which
> leads nicely onto section 2
Yes, except that I can't think of one that's feasible that doesn't have to do
with I/O - which isn't a problem if you are using the proper accessor
functions.
Such an example has to involve more than one CPU, because you don't tend to
get memory/memory ordering problems on UP.
The obvious one might be circular buffers, except there's no problem there
provided you have a memory barrier between accessing the buffer and updating
your pointer into it.
David
On Tue, 2006-03-07 at 12:57 +0100, Andi Kleen wrote:
> > > True, I suppose. I should make it clear that these accessor functions imply
> > > memory barriers, if indeed they do,
> >
> > They don't, but according to Documentation/DocBook/deviceiobook.tmpl
> > they are performed by the compiler in the order specified.
>
> I don't think that's correct. Probably the documentation should
> be fixed.
That's why I hedged my words with "according to ..." :-)
But on most arches those accesses do indeed seem to happen in-order. On
i386 and x86_64, it's a natural consequence of program store ordering.
On at least some other arches, there are explicit memory barriers in the
implementation of the access macros to force this ordering to occur.
<b
On Tuesday 07 March 2006 22:14, Bryan O'Sullivan wrote:
> On Tue, 2006-03-07 at 12:57 +0100, Andi Kleen wrote:
> > > > True, I suppose. I should make it clear that these accessor functions
> > > > imply memory barriers, if indeed they do,
> > >
> > > They don't, but according to Documentation/DocBook/deviceiobook.tmpl
> > > they are performed by the compiler in the order specified.
> >
> > I don't think that's correct. Probably the documentation should
> > be fixed.
>
> That's why I hedged my words with "according to ..." :-)
>
> But on most arches those accesses do indeed seem to happen in-order. On
> i386 and x86_64, it's a natural consequence of program store ordering.
Not true for reads on x86.
-Andi
In-Reply-To: <[email protected]>
On Tue, 07 Mar 2006 17:40:45 +0000, David Howells wrote:
> The attached patch documents the Linux kernel's memory barriers.
References:
AMD64 Architecture Programmer's Manual Volume 2: System Programming
Chapter 7.1: Memory-Access Ordering
Chapter 7.4: Buffering and Combining Memory Writes
IA-32 Intel Architecture Software Developer?s Manual, Volume 3:
System Programming Guide
Chapter 7.1: Locked Atomic Operations
Chapter 7.2: Memory Ordering
Chapter 7.4: Serializing Instructions
--
Chuck
"Penguins don't come from next door, they come from the Antarctic!"
From: Chuck Ebbert <[email protected]>
Date: Tue, 7 Mar 2006 18:17:19 -0500
> In-Reply-To: <[email protected]>
>
> On Tue, 07 Mar 2006 17:40:45 +0000, David Howells wrote:
>
> > The attached patch documents the Linux kernel's memory barriers.
>
> References:
Here are some good ones for Sparc64:
The SPARC Architecture Manual, Version 9
Chapter 8: Memory Models
Appendix D: Formal Specification of the Memory Models
Appendix J: Programming with the Memory Models
UltraSPARC Programmer Reference Manual
Chapter 5: Memory Accesses and Cacheability
Chapter 15: Sparc-V9 Memory Models
UltraSPARC III Cu User's Manual
Chapter 9: Memory Models
UltraSPARC IIIi Processor User's Manual
Chapter 8: Memory Models
UltraSPARC Architecture 2005
Chapter 9: Memory
Appendix D: Formal Specifications of the Memory Models
UltraSPARC T1 Supplment to the UltraSPARC Architecture 2005
Chapter 8: Memory Models
Appendix F: Caches and Cache Coherency
Jesse Barnes wrote:
> On Tuesday, March 7, 2006 10:30 am, David Howells wrote:
>> True, I suppose. I should make it clear that these accessor functions
>> imply memory barriers, if indeed they do, and that you should use them
>> rather than accessing I/O registers directly (at least, outside the
>> arch you should).
>
> But they don't, that's why we have mmiowb().
I don't think that is why that function exists.. It's a no-op on most
architectures, even where you would need to be able to do write barriers
on IO accesses (i.e. x86_64 using CONFIG_UNORDERED_IO). I believe that
function is intended for a more limited special case.
I think any complete memory barrier description should document that
function as well as EXPLICITLY specifying whether or not the
readX/writeX, etc. functions imply barriers or not.
> Btw, thanks for putting together this documentation, it's desperately
> needed.
Seconded.. The fact that there's debate over what the rules even are
shows why this is needed so badly.
--
Robert Hancock Saskatoon, SK, Canada
To email, remove "nospam" from [email protected]
Home Page: http://www.roberthancock.com/
>>The attached patch documents the Linux kernel's memory barriers.
>
> References:
>
> AMD64 Architecture Programmer's Manual Volume 2: System Programming
> Chapter 7.1: Memory-Access Ordering
> Chapter 7.4: Buffering and Combining Memory Writes
>
> IA-32 Intel Architecture Software Developer?s Manual, Volume 3:
> System Programming Guide
> Chapter 7.1: Locked Atomic Operations
> Chapter 7.2: Memory Ordering
> Chapter 7.4: Serializing Instructions
Do you guys reckon it might be worthwhile adding Sparc's sequential
consistency, TSO, RMO and PSO models, although I think only RMO is used
in the Linux kernel? References can be found for example in:
Solaris Internals, Core Kernel Architecture, p63-68:
Chapter 3.3: Hardware Considerations for Locks and
Synchronization
Unix Systems for Modern Architectures, Symmetric Multiprocessing
and Caching for Kernel Programmers:
Chapter 13 : Other Memory Models
Or is DaveM the only one fiddling with Sparc memory barriers implementation?
Regards,
Roberto Nibali, ratz
--
echo
'[q]sa[ln0=aln256%Pln256/snlbx]sb3135071790101768542287578439snlbxq' | dc
On Maw, 2006-03-07 at 20:09 +0000, David Howells wrote:
> Alan Cox <[email protected]> wrote:
>
> > Better meaningful example would be barriers versus an IRQ handler. Which
> > leads nicely onto section 2
>
> Yes, except that I can't think of one that's feasible that doesn't have to do
> with I/O - which isn't a problem if you are using the proper accessor
> functions.
We get them off bus masters for one and you can construct silly versions
of the other.
There are several kernel instances of
while(*ptr != HAVE_RESPONDED && time_before(jiffies, timeout))
rmb();
where we wait for hardware to bus master respond when it is fast and
doesn't IRQ.
On Maw, 2006-03-07 at 22:24 +0100, Andi Kleen wrote:
> > But on most arches those accesses do indeed seem to happen in-order. On
> > i386 and x86_64, it's a natural consequence of program store ordering.
>
> Not true for reads on x86.
You must have a strange kernel Andi. Mine marks them as volatile
unsigned char * references.
Alan
On Maw, 2006-03-07 at 12:57 +0100, Andi Kleen wrote:
> > They don't, but according to Documentation/DocBook/deviceiobook.tmpl
> > they are performed by the compiler in the order specified.
>
> I don't think that's correct. Probably the documentation should
> be fixed.
It would be wiser to ensure they are performed in the order specified.
As far as I can see this is currently true due to the volatile cast and
most drivers rely on this property so the brown and sticky will impact
the rotating air impeller pretty fast if it isnt.
Alan Cox wrote:
> On Maw, 2006-03-07 at 22:24 +0100, Andi Kleen wrote:
>>> But on most arches those accesses do indeed seem to happen in-order. On
>>> i386 and x86_64, it's a natural consequence of program store ordering.
>> Not true for reads on x86.
>
> You must have a strange kernel Andi. Mine marks them as volatile
> unsigned char * references.
Well, that and the fact that IO memory should be mapped as uncacheable
in the MTRRs should ensure that readl and writel won't be reordered on
i386 and x86_64.. except in the case where CONFIG_UNORDERED_IO is
enabled on x86_64 which can reorder writes since it uses nontemporal
stores..
--
Robert Hancock Saskatoon, SK, Canada
To email, remove "nospam" from [email protected]
Home Page: http://www.roberthancock.com/
David Howells wrote:
>The attached patch documents the Linux kernel's memory barriers.
>
>Signed-Off-By: David Howells <[email protected]>
>---
>
>
Good :)
>+==============================
>+IMPLIED KERNEL MEMORY BARRIERS
>+==============================
>+
>+Some of the other functions in the linux kernel imply memory barriers. For
>+instance all the following (pseudo-)locking functions imply barriers.
>+
>+ (*) interrupt disablement and/or interrupts
>
Is this really the case? I mean interrupt disablement only synchronises with
the local CPU, so it probably should not _have_ to imply barriers (eg. some
architectures are playing around with "virtual" interrupt disablement).
[...]
>+
>+Either interrupt disablement (LOCK) and enablement (UNLOCK) will barrier
>+memory and I/O accesses individually, or interrupt handling will barrier
>+memory and I/O accesses on entry and on exit. This prevents an interrupt
>+routine interfering with accesses made in a disabled-interrupt section of code
>+and vice versa.
>+
>
But CPUs should always be consistent WRT themselves, so I'm not sure that
it is needed?
Thanks,
Nick
--
Send instant messages to your online friends http://au.messenger.yahoo.com
David Howells writes:
> The attached patch documents the Linux kernel's memory barriers.
Thanks for venturing into this particular lion's den. :)
> +Memory barriers are instructions to both the compiler and the CPU to impose a
> +partial ordering between the memory access operations specified either side of
> +the barrier.
... as observed from another agent in the system - another CPU or a
bus-mastering I/O device. A given CPU will always see its own memory
accesses in order.
> + (*) reads are synchronous and may need to be done immediately to permit
Leave out the "are synchronous and". It's not true.
I also think you need to avoid talking about "the bus". Some systems
don't have a bus, but rather have an interconnection fabric between
the CPUs and the memories. Talking about a bus implies that all
memory accesses in fact get serialized (by having to be sent one after
the other over the bus) and that you can therefore talk about the
order in which they get to memory. In some systems, no such order
exists.
It's possible to talk sensibly about the order in which memory
accesses get done without talking about a bus or requiring a total
ordering on the memory access. The PowerPC architecture spec does
this by specifying that in certain circumstances one load or store has
to be "performed with respect to other processors and mechanisms"
before another. A load is said to be performed with respect to
another agent when a store by that agent can no longer change the
value returned by the load. Similarly, a store is performed w.r.t.
an agent when any load done by the agent will return the value stored
(or a later value).
> + The way to deal with this is to insert an I/O memory barrier between the
> + two accesses:
> +
> + *ADR = ctl_reg_3;
> + mb();
> + reg = *DATA;
Ummm, this implies mb() is "an I/O memory barrier". I can see people
getting confused if they read this and then see mb() being used when
no I/O is being done.
> +The Linux kernel has six basic memory barriers:
> +
> + MANDATORY (I/O) SMP
> + =============== ================
> + GENERAL mb() smp_mb()
> + READ rmb() smp_rmb()
> + WRITE wmb() smp_wmb()
> +
> +General memory barriers make a guarantee that all memory accesses specified
> +before the barrier will happen before all memory accesses specified after the
> +barrier.
By "memory accesses" do you mean accesses to system memory, or do you
mean loads and stores - which may be to system memory, memory on an I/O
device (e.g. a framebuffer) or to memory-mapped I/O registers?
Linus explained recently that wmb() on x86 does not order stores to
system memory w.r.t. stores to stores to prefetchable I/O memory (at
least that's what I think he said ;).
> +Some of the other functions in the linux kernel imply memory barriers. For
> +instance all the following (pseudo-)locking functions imply barriers.
> +
> + (*) interrupt disablement and/or interrupts
Enabling/disabling interrupts doesn't imply a barrier on powerpc, and
nor does taking an interrupt or returning from one.
> + (*) spin locks
I think it's still an open question as to whether spin locks do any
ordering between accesses to system memory and accesses to I/O
registers.
> + (*) R/W spin locks
> + (*) mutexes
> + (*) semaphores
> + (*) R/W semaphores
> +
> +In all cases there are variants on a LOCK operation and an UNLOCK operation.
> +
> + (*) LOCK operation implication:
> +
> + Memory accesses issued after the LOCK will be completed after the LOCK
> + accesses have completed.
> +
> + Memory accesses issued before the LOCK may be completed after the LOCK
> + accesses have completed.
> +
> + (*) UNLOCK operation implication:
> +
> + Memory accesses issued before the UNLOCK will be completed before the
> + UNLOCK accesses have completed.
> +
> + Memory accesses issued after the UNLOCK may be completed before the UNLOCK
> + accesses have completed.
And therefore an UNLOCK followed by a LOCK is equivalent to a full
barrier, but a LOCK followed by an UNLOCK isn't.
> +Either interrupt disablement (LOCK) and enablement (UNLOCK) will barrier
> +memory and I/O accesses individually, or interrupt handling will barrier
> +memory and I/O accesses on entry and on exit. This prevents an interrupt
> +routine interfering with accesses made in a disabled-interrupt section of code
> +and vice versa.
I don't think this is right, and I don't think it is necessary to
achieve the end you state, since a CPU will always see its own memory
accesses in program order.
> +The following sequence of events on the bus is acceptable:
> +
> + LOCK, *F+*A, *E, *C+*D, *B, UNLOCK
What does *F+*A mean?
> +Consider also the following (going back to the AMD PCnet example):
> +
> + DISABLE IRQ
> + *ADR = ctl_reg_3;
> + mb();
> + x = *DATA;
> + *ADR = ctl_reg_4;
> + mb();
> + *DATA = y;
> + *ADR = ctl_reg_5;
> + mb();
> + z = *DATA;
> + ENABLE IRQ
> + <interrupt>
> + *ADR = ctl_reg_7;
> + mb();
> + q = *DATA
> + </interrupt>
> +
> +What's to stop "z = *DATA" crossing "*ADR = ctl_reg_7" and reading from the
> +wrong register? (There's no guarantee that the process of handling an
> +interrupt will barrier memory accesses in any way).
Well, the driver should *not* be doing *ADR at all, it should be using
read[bwl]/write[bwl]. The architecture code has to implement
read*/write* in such a way that the accesses generated can't be
reordered. I _think_ it also has to make sure the write accesses
can't be write-combined, but it would be good to have that clarified.
> +======================
> +POWERPC SPECIFIC NOTES
> +======================
> +
> +The powerpc is weakly ordered, and its read and write accesses may be
> +completed generally in any order. It's memory barriers are also to some extent
> +more substantial than the mimimum requirement, and may directly effect
> +hardware outside of the CPU.
Unfortunately mb()/smp_mb() are quite expensive on PowerPC, since the
only instruction we have that implies a strong enough barrier is sync,
which also performs several other kinds of synchronization, such as
waiting until all previous instructions have completed executing to
the point where they can no longer cause an exception.
Paul.
On Wed, 8 Mar 2006, Paul Mackerras wrote:
>
> Linus explained recently that wmb() on x86 does not order stores to
> system memory w.r.t. stores to stores to prefetchable I/O memory (at
> least that's what I think he said ;).
In fact, it won't order stores to normal memory even wrt any
_non-prefetchable_ IO memory.
PCI (and any other sane IO fabric, for that matter) will do IO posting, so
the fact that the CPU _core_ may order them due to a wmb() doesn't
actually mean anything.
The only way to _really_ synchronize with a store to an IO device is
literally to read from that device (*). No amount of memory barriers will
do it.
So you can really only order stores to regular memory wrt each other, and
stores to IO memory wrt each other. For the former, "smp_wmb()" does it.
For IO memory, normal IO memory is _always_ supposed to be in program
order (at least for PCI. It's part of how the bus is supposed to work),
unless the IO range allows prefetching (and you've set some MTRR). And if
you do, that, currently you're kind of screwed. mmiowb() should do it, but
nobody really uses it, and I think it's broken on x86 (it's a no-op, it
really should be an "sfence").
A full "mb()" is probably most likely to work in practice. And yes, we
should clean this up.
Linus
(*) The "read" can of course be any event that tells you that the store
has happened - it doesn't necessarily have to be an actual "read[bwl]()"
operation. Eg the store might start a command, and when you get the
completion interrupt, you obviously know that the store is done, just from
a causal reason.
Paul Mackerras wrote:
> David Howells writes:
>>+ The way to deal with this is to insert an I/O memory barrier between the
>>+ two accesses:
>>+
>>+ *ADR = ctl_reg_3;
>>+ mb();
>>+ reg = *DATA;
>
>
> Ummm, this implies mb() is "an I/O memory barrier". I can see people
> getting confused if they read this and then see mb() being used when
> no I/O is being done.
>
Isn't it? Why wouldn't you just use smp_mb() if no IO is being done?
--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com
On Tuesday 7 March 2006 21:09, David Howells wrote:
> Alan Cox <[email protected]> wrote:
>
> > Better meaningful example would be barriers versus an IRQ handler. Which
> > leads nicely onto section 2
>
> Yes, except that I can't think of one that's feasible that doesn't have to do
> with I/O - which isn't a problem if you are using the proper accessor
> functions.
>
> Such an example has to involve more than one CPU, because you don't tend to
> get memory/memory ordering problems on UP.
On UP you at least need compiler barriers, right? You're in trouble if you think
you are writing in a certain order, and expect to see the same order from an
interrupt handler, but the compiler decided to rearrange the order of the writes...
> The obvious one might be circular buffers, except there's no problem there
> provided you have a memory barrier between accessing the buffer and updating
> your pointer into it.
>
> David
Ciao,
Duncan.
On Maw, 2006-03-07 at 19:10 -0600, Robert Hancock wrote:
> Alan Cox wrote:
> > You must have a strange kernel Andi. Mine marks them as volatile
> > unsigned char * references.
>
> Well, that and the fact that IO memory should be mapped as uncacheable
> in the MTRRs should ensure that readl and writel won't be reordered on
> i386 and x86_64.. except in the case where CONFIG_UNORDERED_IO is
> enabled on x86_64 which can reorder writes since it uses nontemporal
> stores..
You need both
real/writel need the volatile to stop gcc removing/reordering the
accesses at compiler level, and the mtrr/pci bridge stuff then deals
with bus level ordering for that CPU.
Linus Torvalds <[email protected]> wrote:
> > Linus explained recently that wmb() on x86 does not order stores to
> > system memory w.r.t. stores to stores to prefetchable I/O memory (at
> > least that's what I think he said ;).
On i386 and x86_64, do IN and OUT instructions imply MFENCE? It's not obvious
from the x86_64 docs.
David
Paul Mackerras <[email protected]> wrote:
> By "memory accesses" do you mean accesses to system memory, or do you
> mean loads and stores - which may be to system memory, memory on an I/O
> device (e.g. a framebuffer) or to memory-mapped I/O registers?
Well, I meant all loads and stores, irrespective of their destination.
However, on i386, for example, you've actually got at least two different I/O
access domains, and I don't know how they impinge upon each other (IN/OUT vs
MOV).
> Enabling/disabling interrupts doesn't imply a barrier on powerpc, and
> nor does taking an interrupt or returning from one.
Surely it ought to, otherwise what's to stop accesses done with interrupts
disabled crossing with accesses done inside an interrupt handler?
> > +Either interrupt disablement (LOCK) and enablement (UNLOCK) will barrier
> ...
> I don't think this is right, and I don't think it is necessary to
> achieve the end you state, since a CPU will always see its own memory
> accesses in program order.
But what about a driver accessing some memory that its device is going to
observe under irq disablement, and then getting an interrupt immediately after
from that same device, the handler for which communicates with the device,
possibly then being broken because the CPU hasn't completed all the memory
accesses that the driver made while interrupts are disabled?
Alternatively, might it be possible for communications between two CPUs to be
stuffed because one took an interrupt that also modified common data before
the it had committed the memory accesses done under interrupt disablement?
This would suggest using a lock though.
I'm not sure that I can come up with a feasible example for this, but Alan Cox
seems to think that it's a valid problem too.
The only likely way I can see this being a problem is with unordered I/O
writes, which would suggest you have to place an mmiowb() before unlocking the
spinlock in such a case, assuming it is possible to get unordered I/O writes
(which I think it is).
> What does *F+*A mean?
Combined accesses.
> Well, the driver should *not* be doing *ADR at all, it should be using
> read[bwl]/write[bwl]. The architecture code has to implement
> read*/write* in such a way that the accesses generated can't be
> reordered. I _think_ it also has to make sure the write accesses
> can't be write-combined, but it would be good to have that clarified.
Than what use mmiowb()?
Surely write combining and out-of-order reads are reasonable for cacheable
devices like framebuffers.
David
The attached patch documents the Linux kernel's memory barriers.
I've updated it from the comments I've been given.
Note that the per-arch notes sections are gone because it's clear that there
are so many exceptions, that it's not worth having them.
I've added a list of references to other documents.
I've tried to get rid of the concept of memory accesses appearing on the bus;
what matters is apparent behaviour with respect to other observers in the
system.
I'm not sure that any mention interrupts vs interrupt disablement should be
retained... it's unclear that there is actually anything that guarantees that
stuff won't leak out of an interrupt-disabled section and into an interrupt
handler. Paul Mackerras says this isn't valid on powerpc, and looking at the
code seems to confirm that, barring implicit enforcement by the CPU.
Signed-Off-By: David Howells <[email protected]>
---
warthog>diffstat -p1 /tmp/mb.diff
Documentation/memory-barriers.txt | 589 ++++++++++++++++++++++++++++++++++++++
1 files changed, 589 insertions(+)
diff --git a/Documentation/memory-barriers.txt b/Documentation/memory-barriers.txt
new file mode 100644
index 0000000..1340c8d
--- /dev/null
+++ b/Documentation/memory-barriers.txt
@@ -0,0 +1,589 @@
+ ============================
+ LINUX KERNEL MEMORY BARRIERS
+ ============================
+
+Contents:
+
+ (*) What are memory barriers?
+
+ (*) Where are memory barriers needed?
+
+ - Accessing devices.
+ - Multiprocessor interaction.
+ - Interrupts.
+
+ (*) Linux kernel compiler barrier functions.
+
+ (*) Linux kernel memory barrier functions.
+
+ (*) Implicit kernel memory barriers.
+
+ - Locking functions.
+ - Interrupt disablement functions.
+ - Miscellaneous functions.
+
+ (*) Linux kernel I/O barriering.
+
+ (*) References.
+
+
+=========================
+WHAT ARE MEMORY BARRIERS?
+=========================
+
+Memory barriers are instructions to both the compiler and the CPU to impose an
+apparent partial ordering between the memory access operations specified either
+side of the barrier. They request that the sequence of memory events generated
+appears to other components of the system as if the barrier is effective on
+that CPU.
+
+Note that:
+
+ (*) there's no guarantee that the sequence of memory events is _actually_ so
+ ordered. It's possible for the CPU to do out-of-order accesses _as long
+ as no-one is looking_, and then fix up the memory if someone else tries to
+ see what's going on (for instance a bus master device); what matters is
+ the _apparent_ order as far as other processors and devices are concerned;
+ and
+
+ (*) memory barriers are only guaranteed to act within the CPU processing them,
+ and are not, for the most part, guaranteed to percolate down to other CPUs
+ in the system or to any I/O hardware that that CPU may communicate with.
+
+
+For example, a programmer might take it for granted that the CPU will perform
+memory accesses in exactly the order specified, so that if a CPU is, for
+example, given the following piece of code:
+
+ a = *A;
+ *B = b;
+ c = *C;
+ d = *D;
+ *E = e;
+
+They would then expect that the CPU will complete the memory access for each
+instruction before moving on to the next one, leading to a definite sequence of
+operations as seen by external observers in the system:
+
+ read *A, write *B, read *C, read *D, write *E.
+
+
+Reality is, of course, much messier. With many CPUs and compilers, this isn't
+always true because:
+
+ (*) reads are more likely to need to be completed immediately to permit
+ execution progress, whereas writes can often be deferred without a
+ problem;
+
+ (*) reads can be done speculatively, and then the result discarded should it
+ prove not to be required;
+
+ (*) the order of the memory accesses may be rearranged to promote better use
+ of the CPU buses and caches;
+
+ (*) reads and writes may be combined to improve performance when talking to
+ the memory or I/O hardware that can do batched accesses of adjacent
+ locations, thus cutting down on transaction setup costs (memory and PCI
+ devices may be able to do this); and
+
+ (*) the CPU's data cache may affect the ordering, though cache-coherency
+ mechanisms should alleviate this - once the write has actually hit the
+ cache.
+
+So what another CPU, say, might actually observe from the above piece of code
+is:
+
+ read *A, read {*C,*D}, write *E, write *B
+
+ (By "read {*C,*D}" I mean a combined single read).
+
+
+It is also guaranteed that a CPU will be self-consistent: it will see its _own_
+accesses appear to be correctly ordered, without the need for a memory
+barrier. For instance with the following code:
+
+ X = *A;
+ *A = Y;
+ Z = *A;
+
+assuming no intervention by an external influence, it can be taken that:
+
+ (*) X will hold the old value of *A, and will never happen after the write and
+ thus end up being given the value that was assigned to *A from Y instead;
+ and
+
+ (*) Z will always be given the value in *A that was assigned there from Y, and
+ will never happen before the write, and thus end up with the same value
+ that was in *A initially.
+
+(This is ignoring the fact that the value initially in *A may appear to be the
+same as the value assigned to *A from Y).
+
+
+=================================
+WHERE ARE MEMORY BARRIERS NEEDED?
+=================================
+
+Under normal operation, access reordering is probably not going to be a problem
+as a linear program will still appear to operate correctly. There are,
+however, three circumstances where reordering definitely _could_ be a problem:
+
+
+ACCESSING DEVICES
+-----------------
+
+Many devices can be memory mapped, and so appear to the CPU as if they're just
+memory locations. However, to control the device, the driver has to make the
+right accesses in exactly the right order.
+
+Consider, for example, an ethernet chipset such as the AMD PCnet32. It
+presents to the CPU an "address register" and a bunch of "data registers". The
+way it's accessed is to write the index of the internal register to be accessed
+to the address register, and then read or write the appropriate data register
+to access the chip's internal register, which could - theoretically - be done
+by:
+
+ *ADR = ctl_reg_3;
+ reg = *DATA;
+
+The problem with a clever CPU or a clever compiler is that the write to the
+address register isn't guaranteed to happen before the access to the data
+register, if the CPU or the compiler thinks it is more efficient to defer the
+address write:
+
+ read *DATA, write *ADR
+
+then things will break.
+
+
+In the Linux kernel, however, I/O should be done through the appropriate
+accessor routines - such as inb() or writel() - which know how to make such
+accesses appropriately sequential.
+
+On some systems, I/O writes are not strongly ordered across all CPUs, and so
+locking should be used, and mmiowb() should be issued prior to unlocking the
+critical section.
+
+See Documentation/DocBook/deviceiobook.tmpl for more information.
+
+
+MULTIPROCESSOR INTERACTION
+--------------------------
+
+When there's a system with more than one processor, these may be working on the
+same set of data, but attempting not to use locks as locks are quite expensive.
+This means that accesses that affect both CPUs may have to be carefully ordered
+to prevent error.
+
+Consider the R/W semaphore slow path. In that, a waiting process is queued on
+the semaphore, as noted by it having a record on its stack linked to the
+semaphore's list:
+
+ struct rw_semaphore {
+ ...
+ struct list_head waiters;
+ };
+
+ struct rwsem_waiter {
+ struct list_head list;
+ struct task_struct *task;
+ };
+
+To wake up the waiter, the up_read() or up_write() functions have to read the
+pointer from this record to know as to where the next waiter record is, clear
+the task pointer, call wake_up_process() on the task, and release the reference
+held on the waiter's task struct:
+
+ READ waiter->list.next;
+ READ waiter->task;
+ WRITE waiter->task;
+ CALL wakeup
+ RELEASE task
+
+If any of these steps occur out of order, then the whole thing may fail.
+
+Note that the waiter does not get the semaphore lock again - it just waits for
+its task pointer to be cleared. Since the record is on its stack, this means
+that if the task pointer is cleared _before_ the next pointer in the list is
+read, another CPU might start processing the waiter and it might clobber its
+stack before up*() functions have a chance to read the next pointer.
+
+ CPU 0 CPU 1
+ =============================== ===============================
+ down_xxx()
+ Queue waiter
+ Sleep
+ up_yyy()
+ READ waiter->task;
+ WRITE waiter->task;
+ <preempt>
+ Resume processing
+ down_xxx() returns
+ call foo()
+ foo() clobbers *waiter
+ </preempt>
+ READ waiter->list.next;
+ --- OOPS ---
+
+This could be dealt with using a spinlock, but then the down_xxx() function has
+to get the spinlock again after it's been woken up, which is a waste of
+resources.
+
+The way to deal with this is to insert an SMP memory barrier:
+
+ READ waiter->list.next;
+ READ waiter->task;
+ smp_mb();
+ WRITE waiter->task;
+ CALL wakeup
+ RELEASE task
+
+In this case, the barrier makes a guarantee that all memory accesses before the
+barrier will appear to happen before all the memory accesses after the barrier
+with respect to the other CPUs on the system. It does _not_ guarantee that all
+the memory accesses before the barrier will be complete by the time the barrier
+itself is complete.
+
+SMP memory barriers are normally mere compiler barriers on a UP system because
+the CPU orders overlapping accesses with respect to itself.
+
+
+INTERRUPTS
+----------
+
+A driver may be interrupted by its own interrupt service routine, and thus they
+may interfere with each other's attempts to control or access the device.
+
+This may be alleviated - at least in part - by disabling interrupts (a form of
+locking), such that the critical operations are all contained within the
+disabled-interrupt section in the driver. Whilst the driver's interrupt
+routine is executing, the driver's core may not run on the same CPU, and its
+interrupt is not permitted to happen again until the current interrupt has been
+handled, thus the interrupt handler does not need to lock against that.
+
+
+However, consider the following example:
+
+ CPU 1 CPU 2
+ =============================== ===============================
+ [A is 0 and B is 0]
+ DISABLE IRQ
+ *A = 1;
+ smp_wmb();
+ *B = 2;
+ ENABLE IRQ
+ <interrupt>
+ *A = 3
+ a = *A;
+ b = *B;
+ smp_wmb();
+ *B = 4;
+ </interrupt>
+
+CPU 2 might see *A == 3 and *B == 0, when what it probably ought to see is *B
+== 2 and *A == 1 or *A == 3, or *B == 4 and *A == 3.
+
+This might happen because the write "*B = 2" might occur after the write "*A =
+3" - in which case the former write has leaked from the interrupt-disabled
+section into the interrupt handler. In this case it is a lock of some
+description should very probably be used.
+
+
+This sort of problem might also occur with relaxed I/O ordering rules, if it's
+permitted for I/O writes to cross. For instance, if a driver was talking to an
+ethernet card that sports an address register and a data register:
+
+ DISABLE IRQ
+ writew(ADR, ctl_reg_3);
+ writew(DATA, y);
+ ENABLE IRQ
+ <interrupt>
+ writew(ADR, ctl_reg_4);
+ q = readw(DATA);
+ </interrupt>
+
+In such a case, an mmiowb() is needed, firstly to prevent the first write to
+the address register from occurring after the write to the data register, and
+secondly to prevent the write to the data register from happening after the
+second write to the address register.
+
+
+=======================================
+LINUX KERNEL COMPILER BARRIER FUNCTIONS
+=======================================
+
+The Linux kernel has an explicit compiler barrier function that prevents the
+compiler from moving the memory accesses either side of it to the other side:
+
+ barrier();
+
+This has no direct effect on the CPU, which may then reorder things however it
+wishes.
+
+In addition, accesses to "volatile" memory locations and volatile asm
+statements act as implicit compiler barriers.
+
+
+=====================================
+LINUX KERNEL MEMORY BARRIER FUNCTIONS
+=====================================
+
+The Linux kernel has six basic CPU memory barriers:
+
+ MANDATORY SMP CONDITIONAL
+ =============== ===============
+ GENERAL mb() smp_mb()
+ READ rmb() smp_rmb()
+ WRITE wmb() smp_wmb()
+
+General memory barriers give a guarantee that all memory accesses specified
+before the barrier will appear to happen before all memory accesses specified
+after the barrier with respect to the other components of the system.
+
+Read and write memory barriers give similar guarantees, but only for memory
+reads versus memory reads and memory writes versus memory writes respectively.
+
+All memory barriers imply compiler barriers.
+
+SMP memory barriers are only compiler barriers on uniprocessor compiled systems
+because it is assumed that a CPU will be apparently self-consistent, and will
+order overlapping accesses correctly with respect to itself.
+
+There is no guarantee that any of the memory accesses specified before a memory
+barrier will be complete by the completion of a memory barrier; the barrier can
+be considered to draw a line in that CPU's access queue that accesses of the
+appropriate type may not cross.
+
+There is no guarantee that issuing a memory barrier on one CPU will have any
+direct effect on another CPU or any other hardware in the system. The indirect
+effect will be the order in which the second CPU sees the first CPU's accesses
+occur.
+
+There is no guarantee that some intervening piece of off-the-CPU hardware will
+not reorder the memory accesses. CPU cache coherency mechanisms should
+propegate the indirect effects of a memory barrier between CPUs.
+
+Note that these are the _minimum_ guarantees. Different architectures may give
+more substantial guarantees, but they may not be relied upon outside of arch
+specific code.
+
+
+There are some more advanced barriering functions:
+
+ (*) set_mb(var, value)
+ (*) set_wmb(var, value)
+
+ These assign the value to the variable and then insert at least a write
+ barrier after it, depending on the function.
+
+
+===============================
+IMPLICIT KERNEL MEMORY BARRIERS
+===============================
+
+Some of the other functions in the linux kernel imply memory barriers, amongst
+them are locking and scheduling functions and interrupt management functions.
+
+This specification is a _minimum_ guarantee; any particular architecture may
+provide more substantial guarantees, but these may not be relied upon outside
+of arch specific code.
+
+
+LOCKING FUNCTIONS
+-----------------
+
+For instance all the following locking functions imply barriers:
+
+ (*) spin locks
+ (*) R/W spin locks
+ (*) mutexes
+ (*) semaphores
+ (*) R/W semaphores
+
+In all cases there are variants on a LOCK operation and an UNLOCK operation.
+
+ (*) LOCK operation implication:
+
+ Memory accesses issued after the LOCK will be completed after the LOCK
+ accesses have completed.
+
+ Memory accesses issued before the LOCK may be completed after the LOCK
+ accesses have completed.
+
+ (*) UNLOCK operation implication:
+
+ Memory accesses issued before the UNLOCK will be completed before the
+ UNLOCK accesses have completed.
+
+ Memory accesses issued after the UNLOCK may be completed before the UNLOCK
+ accesses have completed.
+
+ (*) LOCK vs UNLOCK implication:
+
+ The LOCK accesses will be completed before the UNLOCK accesses.
+
+And therefore an UNLOCK followed by a LOCK is equivalent to a full barrier, but
+a LOCK followed by an UNLOCK isn't.
+
+Locks and semaphores may not provide any guarantee of ordering on UP compiled
+systems, and so can't be counted on in such a situation to actually do anything
+at all, especially with respect to I/O barriering, unless combined with
+interrupt disablement operations.
+
+
+As an example, consider the following:
+
+ *A = a;
+ *B = b;
+ LOCK
+ *C = c;
+ *D = d;
+ UNLOCK
+ *E = e;
+ *F = f;
+
+The following sequence of events is acceptable:
+
+ LOCK, {*F,*A}, *E, {*C,*D}, *B, UNLOCK
+
+But none of the following are:
+
+ {*F,*A}, *B, LOCK, *C, *D, UNLOCK, *E
+ *A, *B, *C, LOCK, *D, UNLOCK, *E, *F
+ *A, *B, LOCK, *C, UNLOCK, *D, *E, *F
+ *B, LOCK, *C, *D, UNLOCK, {*F,*A}, *E
+
+
+INTERRUPT DISABLEMENT FUNCTIONS
+-------------------------------
+
+Interrupt disablement (LOCK equivalent) and enablement (UNLOCK equivalent) will
+barrier memory and I/O accesses versus memory and I/O accesses done in the
+interrupt handler. This prevents an interrupt routine interfering with
+accesses made in a disabled-interrupt section of code and vice versa.
+
+Note that whilst interrupt disablement barriers all act as compiler barriers,
+they only act as memory barriers with respect to interrupts, not with respect
+to nested sections.
+
+Consider the following:
+
+ <interrupt>
+ *X = x;
+ </interrupt>
+ *A = a;
+ SAVE IRQ AND DISABLE
+ *B = b;
+ SAVE IRQ AND DISABLE
+ *C = c;
+ RESTORE IRQ
+ *D = d;
+ RESTORE IRQ
+ *E = e;
+ <interrupt>
+ *Y = y;
+ </interrupt>
+
+It is acceptable to observe the following sequences of events:
+
+ { INT, *X }, *A, SAVE, *B, SAVE, *C, REST, *D, REST, *E, { INT, *Y }
+ { INT, *X }, *A, SAVE, *B, SAVE, *C, REST, *D, REST, { INT, *Y, *E }
+ { INT, *X }, SAVE, SAVE, *A, *B, *C, *D, *E, REST, REST, { INT, *Y }
+ { INT }, *X, SAVE, SAVE, *A, *B, *C, *D, *E, REST, REST, { INT, *Y }
+ { INT }, *A, *X, SAVE, SAVE, *B, *C, *D, *E, REST, REST, { INT, *Y }
+
+But not the following:
+
+ { INT }, SAVE, *A, *B, *X, SAVE, *C, REST, *D, REST, *E, { INT, *Y }
+ { INT, *X }, *A, SAVE, *B, SAVE, *C, REST, REST, { INT, *Y, *D, *E }
+
+
+MISCELLANEOUS FUNCTIONS
+-----------------------
+
+Other functions that imply barriers:
+
+ (*) schedule() and similar imply full memory barriers.
+
+
+===========================
+LINUX KERNEL I/O BARRIERING
+===========================
+
+When accessing I/O memory, drivers should use the appropriate accessor
+functions:
+
+ (*) inX(), outX():
+
+ These are intended to talk to legacy i386 hardware using an alternate bus
+ addressing mode. They are synchronous as far as the x86 CPUs are
+ concerned, but other CPUs and intermediary bridges may not honour that.
+
+ They are guaranteed to be fully ordered with respect to each other.
+
+ (*) readX(), writeX():
+
+ These are guaranteed to be fully ordered and uncombined with respect to
+ each other on the issuing CPU, provided they're not accessing a
+ prefetchable device. However, intermediary hardware (such as a PCI
+ bridge) may indulge in deferral if it so wishes; to flush a write, a read
+ from the same location must be performed.
+
+ Used with prefetchable I/O memory, an mmiowb() barrier may be required to
+ force writes to be ordered.
+
+ (*) readX_relaxed()
+
+ These are not guaranteed to be ordered in any way. There is no I/O read
+ barrier available.
+
+ (*) ioreadX(), iowriteX()
+
+ These will perform as appropriate for the type of access they're actually
+ doing, be it in/out or read/write.
+
+
+==========
+REFERENCES
+==========
+
+AMD64 Architecture Programmer's Manual Volume 2: System Programming
+ Chapter 7.1: Memory-Access Ordering
+ Chapter 7.4: Buffering and Combining Memory Writes
+
+IA-32 Intel Architecture Software Developer's Manual, Volume 3:
+System Programming Guide
+ Chapter 7.1: Locked Atomic Operations
+ Chapter 7.2: Memory Ordering
+ Chapter 7.4: Serializing Instructions
+
+The SPARC Architecture Manual, Version 9
+ Chapter 8: Memory Models
+ Appendix D: Formal Specification of the Memory Models
+ Appendix J: Programming with the Memory Models
+
+UltraSPARC Programmer Reference Manual
+ Chapter 5: Memory Accesses and Cacheability
+ Chapter 15: Sparc-V9 Memory Models
+
+UltraSPARC III Cu User's Manual
+ Chapter 9: Memory Models
+
+UltraSPARC IIIi Processor User's Manual
+ Chapter 8: Memory Models
+
+UltraSPARC Architecture 2005
+ Chapter 9: Memory
+ Appendix D: Formal Specifications of the Memory Models
+
+UltraSPARC T1 Supplment to the UltraSPARC Architecture 2005
+ Chapter 8: Memory Models
+ Appendix F: Caches and Cache Coherency
+
+Solaris Internals, Core Kernel Architecture, p63-68:
+ Chapter 3.3: Hardware Considerations for Locks and
+ Synchronization
+
+Unix Systems for Modern Architectures, Symmetric Multiprocessing and Caching
+for Kernel Programmers:
+ Chapter 13: Other Memory Models
On Wed, Mar 08, 2006 at 02:37:58PM +0000, David Howells wrote:
> + (*) reads can be done speculatively, and then the result discarded should it
> + prove not to be required;
That might be worth an example with an if() because PPC will do this and if
its a read with a side effect (eg I/O space) you get singed..
> +same set of data, but attempting not to use locks as locks are quite expensive.
s/are quite/is quite
and is quite confusing to read
> +SMP memory barriers are normally mere compiler barriers on a UP system because
s/mere//
Makes it easier to read if you are not 1st language English.
> +In addition, accesses to "volatile" memory locations and volatile asm
> +statements act as implicit compiler barriers.
Add
The use of volatile generates poorer code and hides the serialization in
type declarations that may be far from the code. The Linux coding style therefore
strongly favours the use of explicit barriers except in small and specific cases.
> +SMP memory barriers are only compiler barriers on uniprocessor compiled systems
> +because it is assumed that a CPU will be apparently self-consistent, and will
> +order overlapping accesses correctly with respect to itself.
Is this true of IA-64 ??
> +There is no guarantee that some intervening piece of off-the-CPU hardware will
> +not reorder the memory accesses. CPU cache coherency mechanisms should
> +propegate the indirect effects of a memory barrier between CPUs.
[For information on bus mastering DMA and coherency please read ....]
sincee have a doc on this
> +There are some more advanced barriering functions:
"barriering" ... ick, barrier.
> +LOCKING FUNCTIONS
> +-----------------
> +
> +For instance all the following locking functions imply barriers:
s/For instance//
> + (*) spin locks
> + (*) R/W spin locks
> + (*) mutexes
> + (*) semaphores
> + (*) R/W semaphores
> +
> +In all cases there are variants on a LOCK operation and an UNLOCK operation.
> +
> + (*) LOCK operation implication:
> +
> + Memory accesses issued after the LOCK will be completed after the LOCK
> + accesses have completed.
> +
> + Memory accesses issued before the LOCK may be completed after the LOCK
> + accesses have completed.
> +
> + (*) UNLOCK operation implication:
> +
> + Memory accesses issued before the UNLOCK will be completed before the
> + UNLOCK accesses have completed.
> +
> + Memory accesses issued after the UNLOCK may be completed before the UNLOCK
> + accesses have completed.
> +
> + (*) LOCK vs UNLOCK implication:
> +
> + The LOCK accesses will be completed before the UNLOCK accesses.
> +
> +And therefore an UNLOCK followed by a LOCK is equivalent to a full barrier, but
> +a LOCK followed by an UNLOCK isn't.
> +
> +Locks and semaphores may not provide any guarantee of ordering on UP compiled
> +systems, and so can't be counted on in such a situation to actually do anything
> +at all, especially with respect to I/O barriering, unless combined with
> +interrupt disablement operations.
s/disablement/disabling/
Should clarify local ordering v SMP ordering for locks implied here.
> +INTERRUPT DISABLEMENT FUNCTIONS
> +-------------------------------
s/Disablement/Disabling/
> +Interrupt disablement (LOCK equivalent) and enablement (UNLOCK equivalent) will
disable
> +===========================
> +LINUX KERNEL I/O BARRIERING
/barriering/barriers
> + (*) inX(), outX():
> +
> + These are intended to talk to legacy i386 hardware using an alternate bus
> + addressing mode. They are synchronous as far as the x86 CPUs are
Not really true. Lots of PCI devices use them. Need to talk about "I/O space"
> + concerned, but other CPUs and intermediary bridges may not honour that.
> +
> + They are guaranteed to be fully ordered with respect to each other.
And make clear I/O space is a CPU property and that inX()/outX() may well map
to read/write variant functions on many processors
> + (*) readX(), writeX():
> +
> + These are guaranteed to be fully ordered and uncombined with respect to
> + each other on the issuing CPU, provided they're not accessing a
MTRRs
> + prefetchable device. However, intermediary hardware (such as a PCI
> + bridge) may indulge in deferral if it so wishes; to flush a write, a read
> + from the same location must be performed.
False. Its not so tightly restricted and many devices the location you write
is not safe to read so you must use another. I'd have to dig the PCI spec
out but I believe it says the same devfn. It also says stuff about rules for
visibility of bus mastering relative to these accesses and PCI config space
accesses relative to the lot (the latter serveral chipsets get wrong). We
should probably point people at the PCI 2.2 spec .
Looks much much better than the first version and just goes to prove how complex
this all is
Robert Hancock <[email protected]> writes:
> Alan Cox wrote:
> > On Maw, 2006-03-07 at 22:24 +0100, Andi Kleen wrote:
> >>> But on most arches those accesses do indeed seem to happen in-order. On
> >>> i386 and x86_64, it's a natural consequence of program store ordering.
> >> Not true for reads on x86.
> > You must have a strange kernel Andi. Mine marks them as volatile
> > unsigned char * references.
>
> Well, that and the fact that IO memory should be mapped as uncacheable
> in the MTRRs should ensure that readl and writel won't be reordered on
> i386 and x86_64.. except in the case where CONFIG_UNORDERED_IO is
> enabled on x86_64 which can reorder writes since it uses nontemporal
> stores..
CONFIG_UNORDERED_IO is a failed experiment. I just removed it.
-Andi
On Wed, Mar 08, 2006 at 09:55:06AM -0500, Alan Cox wrote:
> On Wed, Mar 08, 2006 at 02:37:58PM +0000, David Howells wrote:
> > + (*) reads can be done speculatively, and then the result discarded should it
> > + prove not to be required;
>
> That might be worth an example with an if() because PPC will do this and if
> its a read with a side effect (eg I/O space) you get singed..
PPC does speculative memory accesses to IO? Are you *sure*?
> > +same set of data, but attempting not to use locks as locks are quite expensive.
>
> s/are quite/is quite
>
> and is quite confusing to read
His grammar's right ... but I'd just leave out the 'as' part. As
you're right that it's confusing ;-)
> > +SMP memory barriers are normally mere compiler barriers on a UP system because
>
> s/mere//
>
> Makes it easier to read if you are not 1st language English.
Maybe s/mere/only/?
> > +SMP memory barriers are only compiler barriers on uniprocessor compiled systems
> > +because it is assumed that a CPU will be apparently self-consistent, and will
> > +order overlapping accesses correctly with respect to itself.
>
> Is this true of IA-64 ??
Yes:
#else
# define smp_mb() barrier()
# define smp_rmb() barrier()
# define smp_wmb() barrier()
# define smp_read_barrier_depends() do { } while(0)
#endif
> > + (*) inX(), outX():
> > +
> > + These are intended to talk to legacy i386 hardware using an alternate bus
> > + addressing mode. They are synchronous as far as the x86 CPUs are
>
> Not really true. Lots of PCI devices use them. Need to talk about "I/O space"
Port space is deprecated though. PCI 2.3 says:
"Devices are recommended always to map control functions into Memory Space."
> > +
> > + These are guaranteed to be fully ordered and uncombined with respect to
> > + each other on the issuing CPU, provided they're not accessing a
>
> MTRRs
>
> > + prefetchable device. However, intermediary hardware (such as a PCI
> > + bridge) may indulge in deferral if it so wishes; to flush a write, a read
> > + from the same location must be performed.
>
> False. Its not so tightly restricted and many devices the location you write
> is not safe to read so you must use another. I'd have to dig the PCI spec
> out but I believe it says the same devfn. It also says stuff about rules for
> visibility of bus mastering relative to these accesses and PCI config space
> accesses relative to the lot (the latter serveral chipsets get wrong). We
> should probably point people at the PCI 2.2 spec .
3.2.5 of PCI 2.3 seems most relevant:
Since memory write transactions may be posted in bridges anywhere
in the system, and I/O writes may be posted in the host bus bridge,
a master cannot automatically tell when its write transaction completes
at the final destination. For a device driver to guarantee that a write
has completed at the actual target (and not at an intermediate bridge),
it must complete a read to the same device that the write targeted. The
read (memory or I/O) forces all bridges between the originating master
and the actual target to flush all posted data before allowing the
read to complete. For additional details on device drivers, refer to
Section 6.5. Refer to Section 3.10., item 6, for other cases where a
read is necessary.
Appendix E is also of interest:
2. Memory writes can be posted in both directions in a bridge. I/O and
Configuration writes are not posted. (I/O writes can be posted in the
Host Bridge, but some restrictions apply.) Read transactions (Memory,
I/O, or Configuration) are not posted.
5. A read transaction must push ahead of it through the bridge any posted
writes originating on the same side of the bridge and posted before
the read. Before the read transaction can complete on its originating
bus, it must pull out of the bridge any posted writes that originated
on the opposite side and were posted before the read command completes
on the read-destination bus.
I like the way they contradict each other slightly wrt config reads and
whether you have to read from the same device, or merely the same bus.
One thing that is clear is that a read of a status register on the bridge
isn't enough, it needs to be *through* the bridge, not *to* the bridge.
I wonder if a config read of a non-existent device on the other side of
the bridge would force the write to complete ...
You need to explain the difference between the compiler reordering and the
control of the compilers arrangement of loads and stores and the cpu
reordering of stores and loads. Note that IA64 has a much more complete
set of means to reorder stores and loads. i386 and x84_64 processors can
only do limited reordering. So it may make sense to deal with general
reordering and then explain i386 as a specific limited case.
See the "Intel Itanium Architecture Software Developer's Manual"
(available from intels website). Look at Volume 1 section 2.6
"Speculation" and 4.4 "Memory Access"
Also the specific barrier functions of various locking elements varies to
some extend.
On Wed, 2006-03-08 at 12:34 +0000, David Howells wrote:
> On i386 and x86_64, do IN and OUT instructions imply MFENCE?
No.
<b
Alan Cox <[email protected]> wrote:
> [For information on bus mastering DMA and coherency please read ....]
>
> sincee have a doc on this
Documentation/pci.txt?
> The use of volatile generates poorer code and hides the serialization in
> type declarations that may be far from the code.
I'm not sure what you mean by that.
> Is this true of IA-64 ??
Are you referring to non-temporal loads and stores?
> > +There are some more advanced barriering functions:
>
> "barriering" ... ick, barrier.
Picky:-)
> Should clarify local ordering v SMP ordering for locks implied here.
Do you mean explain what each sort of lock does?
> > + (*) inX(), outX():
> > +
> > + These are intended to talk to legacy i386 hardware using an alternate bus
> > + addressing mode. They are synchronous as far as the x86 CPUs are
>
> Not really true. Lots of PCI devices use them. Need to talk about "I/O space"
Which bit is not really true?
David
Matthew Wilcox <[email protected]> wrote:
> > That might be worth an example with an if() because PPC will do this and
> > if its a read with a side effect (eg I/O space) you get singed..
>
> PPC does speculative memory accesses to IO? Are you *sure*?
Can you do speculative reads from frame buffers?
> # define smp_read_barrier_depends() do { } while(0)
What's this one meant to do?
> Port space is deprecated though. PCI 2.3 says:
That's sort of irrelevant for the here. I still need to document the
interaction.
> Since memory write transactions may be posted in bridges anywhere
> in the system, and I/O writes may be posted in the host bus bridge,
I'm not sure whether this is beyond the scope of this document. Maybe the
document's scope needs to be expanded.
David
Christoph Lameter <[email protected]> wrote:
> You need to explain the difference between the compiler reordering and the
> control of the compilers arrangement of loads and stores and the cpu
> reordering of stores and loads.
Hmmm... I would hope people looking at this doc would understand that, but
I'll see what I can come up with.
> Note that IA64 has a much more complete set of means to reorder stores and
> loads. i386 and x84_64 processors can only do limited reordering. So it may
> make sense to deal with general reordering and then explain i386 as a
> specific limited case.
Don't you need to use sacrifice_goat() for controlling the IA64? :-)
Besides, I'm not sure that I need to explain that any CPU is a limited case;
I'm primarily trying to define the basic minimal guarantees you can expect
from using a memory barrier, and what might happen if you don't. It shouldn't
matter which arch you're dealing with, especially if you're writing a driver.
I tried to create arch-specific sections for describing arch-specific implicit
barriers and the extent of the explicit memory barriers on each arch, but the
i386 section was generating lots of exceptions that it looked infeasible to
describe them; besides, you aren't allowed to rely on such features outside of
arch code (I count arch-specific drivers as "arch code" for this).
> See the "Intel Itanium Architecture Software Developer's Manual"
> (available from intels website). Look at Volume 1 section 2.6
> "Speculation" and 4.4 "Memory Access"
I've added that to the refs, thanks.
> Also the specific barrier functions of various locking elements varies to
> some extend.
Please elaborate.
David
On Wed, Mar 08, 2006 at 05:04:51PM +0000, David Howells wrote:
> > [For information on bus mastering DMA and coherency please read ....]
> > sincee have a doc on this
>
> Documentation/pci.txt?
and:
Documentation/DMA-mapping.txt
Documentation/DMA-API.txt
>
> > The use of volatile generates poorer code and hides the serialization in
> > type declarations that may be far from the code.
>
> I'm not sure what you mean by that.
in foo.h
struct blah {
volatile int x; /* need serialization
}
2 million miles away
blah.x = 1;
blah.y = 4;
And you've no idea that its magically serialized due to a type declaration
in a header you've never read. Hence the "dont use volatile" rule
> > Is this true of IA-64 ??
>
> Are you referring to non-temporal loads and stores?
Yep. But Matthew answered that
> > Should clarify local ordering v SMP ordering for locks implied here.
>
> Do you mean explain what each sort of lock does?
spin_unlock ensures that local CPU writes before the lock are visible
to all processors before the lock is dropped but it has no effect on
I/O ordering. Just a need for clarity.
> > > + (*) inX(), outX():
> > > +
> > > + These are intended to talk to legacy i386 hardware using an alternate bus
> > > + addressing mode. They are synchronous as far as the x86 CPUs are
> >
> > Not really true. Lots of PCI devices use them. Need to talk about "I/O space"
>
> Which bit is not really true?
The "legacy i386 hardware" bit. Many processors have an I/O space.
On Wed, 8 Mar 2006, David Howells wrote:
> Hmmm... I would hope people looking at this doc would understand that, but
> I'll see what I can come up with.
>
> > Note that IA64 has a much more complete set of means to reorder stores and
> > loads. i386 and x84_64 processors can only do limited reordering. So it may
> > make sense to deal with general reordering and then explain i386 as a
> > specific limited case.
>
> Don't you need to use sacrifice_goat() for controlling the IA64? :-)
Likely...
> Besides, I'm not sure that I need to explain that any CPU is a limited case;
> I'm primarily trying to define the basic minimal guarantees you can expect
> from using a memory barrier, and what might happen if you don't. It shouldn't
> matter which arch you're dealing with, especially if you're writing a driver.
memory barrier functions have to be targeted to the processor with the
ability to do the widest amount of reordering. This is the Itanium AFAIK.
> I tried to create arch-specific sections for describing arch-specific implicit
> barriers and the extent of the explicit memory barriers on each arch, but the
> i386 section was generating lots of exceptions that it looked infeasible to
> describe them; besides, you aren't allowed to rely on such features outside of
> arch code (I count arch-specific drivers as "arch code" for this).
i386 does not fully implement things like write barriers since they have
an implicit ordering of stores.
> > Also the specific barrier functions of various locking elements varies to
> > some extend.
>
> Please elaborate.
F.e. spin_unlock has "release" semantics on IA64. That means that prior
write accesses are visible before the store, read accesses are also
completed before the store. However, the processor may perform later read
and write accesses before the results of the store become visible.
> i386 does not fully implement things like write barriers since they have
> an implicit ordering of stores.
Except when they don't (PPro errata cases, and the explicit support for
this in the IDT Winchip)
Alan Cox <[email protected]> wrote:
> spin_unlock ensures that local CPU writes before the lock are visible
> to all processors before the lock is dropped but it has no effect on
> I/O ordering. Just a need for clarity.
So I can't use spinlocks in my driver to make sure two different CPUs don't
interfere with each other when trying to communicate with a device because the
spinlocks don't guarantee that I/O operations will stay in effect within the
locking section?
David
On Wed, Mar 08, 2006 at 06:35:07PM +0000, David Howells wrote:
> Alan Cox <[email protected]> wrote:
>
> > spin_unlock ensures that local CPU writes before the lock are visible
> > to all processors before the lock is dropped but it has no effect on
> > I/O ordering. Just a need for clarity.
>
> So I can't use spinlocks in my driver to make sure two different CPUs don't
> interfere with each other when trying to communicate with a device because the
> spinlocks don't guarantee that I/O operations will stay in effect within the
> locking section?
If you have
CPU #0
spin_lock(&foo->lock)
writel(0, &foo->regnum)
writel(1, &foo->data);
spin_unlock(&foo->lock);
CPU #1
spin_lock(&foo->lock);
writel(4, &foo->regnum);
writel(5, &foo->data);
spin_unlock(&foo->lock);
then on some NUMA infrastructures the order may not be as you expect. The
CPU will execute writel 0, writel 1 and the second CPU later will execute
writel 4 writel 5, but the order they hit the PCI bridge may not be the
same order. Usually such things don't matter but in a register windowed
case getting 0/4/1/5 might be rather unfortunate.
See Documentation/DocBook/deviceiobook.tmpl (or its output)
The following case is safe
spin_lock(&foo->lock);
writel(0, &foo->regnum);
reg = readl(&foo->data);
spin_unlock(&foo->lock);
as the real must complete and it forces the write to complete. The pure write
case used above should be implemented as
spin_lock(&foo->lock);
writel(0, &foo->regnum);
writel(1, &foo->data);
mmiowb();
spin_unlock(&foo->lock);
The mmiowb ensures that the writels will occur before the writel from another
CPU then taking the lock and issuing a writel.
Welcome to the wonderful world of NUMA
Alan
Alan Cox <[email protected]> wrote:
> then on some NUMA infrastructures the order may not be as you expect.
Oh, yuck!
Okay... does NUMA guarantee the same for ordinary memory accesses inside the
critical section?
David
On Wednesday 08 March 2006 19:59, David Howells wrote:
> Alan Cox <[email protected]> wrote:
>
> > then on some NUMA infrastructures the order may not be as you expect.
>
> Oh, yuck!
>
> Okay... does NUMA guarantee the same for ordinary memory accesses inside the
> critical section?
If you use barriers the ordering should be the same on cc/NUMA vs SMP.
Otherwise it wouldn't be "cc"
But it might be quite unfair.
-Andi
Alan Cox <[email protected]> wrote:
> spin_lock(&foo->lock);
> writel(0, &foo->regnum);
I presume there only needs to be an mmiowb() here if you've got the
appropriate CPU's I/O memory window set up to be weakly ordered.
> writel(1, &foo->data);
> mmiowb();
> spin_unlock(&foo->lock);
David
On Wed, 8 Mar 2006, David Howells wrote:
> Alan Cox <[email protected]> wrote:
>
> > spin_lock(&foo->lock);
> > writel(0, &foo->regnum);
>
> I presume there only needs to be an mmiowb() here if you've got the
> appropriate CPU's I/O memory window set up to be weakly ordered.
Actually, since the different NUMA things may have different paths to the
PCI thing, I don't think even the mmiowb() will really help. It has
nothing to serialize _with_.
It only orders mmio from within _one_ CPU and "path" to the destination.
The IO might be posted somewhere on a PCI bridge, and and depending on the
posting rules, the mmiowb() just isn't relevant for IO coming through
another path.
Of course, to get into that deep doo-doo, your IO fabric must be separate
from the memory fabric, and the hardware must be pretty special, I think.
So for example, if you are using an Opteron with it's NUMA memory setup
between CPU's over HT links, from an _IO_ standpoint it's not really
anything strange, since it uses the same fabric for memory coherency and
IO coherency, and from an IO ordering standpoint it's just normal SMP.
But if you have a separate IO fabric and basically two different CPU's can
get to one device through two different paths, no amount of write barriers
of any kind will ever help you.
So in the really general case, it's still basically true that the _only_
thing that serializes a MMIO write to a device is a _read_ from that
device, since then the _device_ ends up being the serialization point.
So in the exteme case, you literally have to do a read from the device
before you release the spinlock, if ordering to the device from two
different CPU's matters to you. The IO paths simply may not be
serializable with the normal memory paths, so spinlocks have absolutely
_zero_ ordering capability, and a write barrier on either the normal
memory side or the IO side doesn't affect anything.
Now, I'm by no means claiming that we necessarily get this right in
general, or even very commonly. The undeniable fact is that "big NUMA"
machines need to validate the drivers they use separately. The fact that
it works on a normal PC - and that it's been tested to death there - does
not guarantee much anything.
The good news, of course, is that you don't use that kind of "big NUMA"
system the same way you'd use a regular desktop SMP. You don't plug in
random devices into it and just expect them to work. I'd hope ;)
Linus
Linus Torvalds <[email protected]> wrote:
> Actually, since the different NUMA things may have different paths to the
> PCI thing, I don't think even the mmiowb() will really help. It has
> nothing to serialize _with_.
On NUMA PowerPC, should mmiowb() be a SYNC or an EIEIO instruction then? Those
do inter-component synchronisation.
David
The attached patch documents the Linux kernel's memory barriers.
I've updated it from the comments I've been given.
Note that the per-arch notes sections are gone because it's clear that there
are so many exceptions, that it's not worth having them.
I've added a list of references to other documents.
I've tried to get rid of the concept of memory accesses appearing on the bus;
what matters is apparent behaviour with respect to other observers in the
system.
I'm not sure that any mention interrupts vs interrupt disablement should be
retained... it's unclear that there is actually anything that guarantees that
stuff won't leak out of an interrupt-disabled section and into an interrupt
handler. Paul Mackerras says this isn't valid on powerpc, and looking at the
code seems to confirm that, barring implicit enforcement by the CPU.
There's also some uncertainty with respect to spinlocks vs I/O accesses on
NUMA.
Signed-Off-By: David Howells <[email protected]>
---
warthog>diffstat -p1 /tmp/mb.diff
Documentation/memory-barriers.txt | 781 ++++++++++++++++++++++++++++++++++++++
1 files changed, 781 insertions(+)
diff --git a/Documentation/memory-barriers.txt b/Documentation/memory-barriers.txt
new file mode 100644
index 0000000..6eeb7e4
--- /dev/null
+++ b/Documentation/memory-barriers.txt
@@ -0,0 +1,781 @@
+ ============================
+ LINUX KERNEL MEMORY BARRIERS
+ ============================
+
+Contents:
+
+ (*) What are memory barriers?
+
+ (*) Where are memory barriers needed?
+
+ - Accessing devices.
+ - Multiprocessor interaction.
+ - Interrupts.
+
+ (*) Explicit kernel compiler barriers.
+
+ (*) Explicit kernel memory barriers.
+
+ (*) Implicit kernel memory barriers.
+
+ - Locking functions.
+ - Interrupt disabling functions.
+ - Miscellaneous functions.
+
+ (*) Inter-CPU locking barrier effects.
+
+ - Locks vs memory accesses.
+ - Locks vs I/O accesses.
+
+ (*) Kernel I/O barrier effects.
+
+ (*) References.
+
+
+=========================
+WHAT ARE MEMORY BARRIERS?
+=========================
+
+Memory barriers are instructions to both the compiler and the CPU to impose an
+apparent partial ordering between the memory access operations specified either
+side of the barrier. They request that the sequence of memory events generated
+appears to other components of the system as if the barrier is effective on
+that CPU.
+
+Note that:
+
+ (*) there's no guarantee that the sequence of memory events is _actually_ so
+ ordered. It's possible for the CPU to do out-of-order accesses _as long
+ as no-one is looking_, and then fix up the memory if someone else tries to
+ see what's going on (for instance a bus master device); what matters is
+ the _apparent_ order as far as other processors and devices are concerned;
+ and
+
+ (*) memory barriers are only guaranteed to act within the CPU processing them,
+ and are not, for the most part, guaranteed to percolate down to other CPUs
+ in the system or to any I/O hardware that that CPU may communicate with.
+
+
+For example, a programmer might take it for granted that the CPU will perform
+memory accesses in exactly the order specified, so that if a CPU is, for
+example, given the following piece of code:
+
+ a = *A;
+ *B = b;
+ c = *C;
+ d = *D;
+ *E = e;
+
+They would then expect that the CPU will complete the memory access for each
+instruction before moving on to the next one, leading to a definite sequence of
+operations as seen by external observers in the system:
+
+ read *A, write *B, read *C, read *D, write *E.
+
+
+Reality is, of course, much messier. With many CPUs and compilers, this isn't
+always true because:
+
+ (*) reads are more likely to need to be completed immediately to permit
+ execution progress, whereas writes can often be deferred without a
+ problem;
+
+ (*) reads can be done speculatively, and then the result discarded should it
+ prove not to be required;
+
+ (*) the order of the memory accesses may be rearranged to promote better use
+ of the CPU buses and caches;
+
+ (*) reads and writes may be combined to improve performance when talking to
+ the memory or I/O hardware that can do batched accesses of adjacent
+ locations, thus cutting down on transaction setup costs (memory and PCI
+ devices may be able to do this); and
+
+ (*) the CPU's data cache may affect the ordering, though cache-coherency
+ mechanisms should alleviate this - once the write has actually hit the
+ cache.
+
+So what another CPU, say, might actually observe from the above piece of code
+is:
+
+ read *A, read {*C,*D}, write *E, write *B
+
+ (By "read {*C,*D}" I mean a combined single read).
+
+
+It is also guaranteed that a CPU will be self-consistent: it will see its _own_
+accesses appear to be correctly ordered, without the need for a memory
+barrier. For instance with the following code:
+
+ X = *A;
+ *A = Y;
+ Z = *A;
+
+assuming no intervention by an external influence, it can be taken that:
+
+ (*) X will hold the old value of *A, and will never happen after the write and
+ thus end up being given the value that was assigned to *A from Y instead;
+ and
+
+ (*) Z will always be given the value in *A that was assigned there from Y, and
+ will never happen before the write, and thus end up with the same value
+ that was in *A initially.
+
+(This is ignoring the fact that the value initially in *A may appear to be the
+same as the value assigned to *A from Y).
+
+
+=================================
+WHERE ARE MEMORY BARRIERS NEEDED?
+=================================
+
+Under normal operation, access reordering is probably not going to be a problem
+as a linear program will still appear to operate correctly. There are,
+however, three circumstances where reordering definitely _could_ be a problem:
+
+
+ACCESSING DEVICES
+-----------------
+
+Many devices can be memory mapped, and so appear to the CPU as if they're just
+memory locations. However, to control the device, the driver has to make the
+right accesses in exactly the right order.
+
+Consider, for example, an ethernet chipset such as the AMD PCnet32. It
+presents to the CPU an "address register" and a bunch of "data registers". The
+way it's accessed is to write the index of the internal register to be accessed
+to the address register, and then read or write the appropriate data register
+to access the chip's internal register, which could - theoretically - be done
+by:
+
+ *ADR = ctl_reg_3;
+ reg = *DATA;
+
+The problem with a clever CPU or a clever compiler is that the write to the
+address register isn't guaranteed to happen before the access to the data
+register, if the CPU or the compiler thinks it is more efficient to defer the
+address write:
+
+ read *DATA, write *ADR
+
+then things will break.
+
+
+In the Linux kernel, however, I/O should be done through the appropriate
+accessor routines - such as inb() or writel() - which know how to make such
+accesses appropriately sequential.
+
+On some systems, I/O writes are not strongly ordered across all CPUs, and so
+locking should be used, and mmiowb() should be issued prior to unlocking the
+critical section.
+
+See Documentation/DocBook/deviceiobook.tmpl for more information.
+
+
+MULTIPROCESSOR INTERACTION
+--------------------------
+
+When there's a system with more than one processor, the CPUs in the system may
+be working on the same set of data at the same time. This can cause
+synchronisation problems, and the usual way of dealing with them is to use
+locks - but locks are quite expensive, and so it may be preferable to operate
+without the use of a lock if at all possible. In such a case accesses that
+affect both CPUs may have to be carefully ordered to prevent error.
+
+Consider the R/W semaphore slow path. In that, a waiting process is queued on
+the semaphore, as noted by it having a record on its stack linked to the
+semaphore's list:
+
+ struct rw_semaphore {
+ ...
+ struct list_head waiters;
+ };
+
+ struct rwsem_waiter {
+ struct list_head list;
+ struct task_struct *task;
+ };
+
+To wake up the waiter, the up_read() or up_write() functions have to read the
+pointer from this record to know as to where the next waiter record is, clear
+the task pointer, call wake_up_process() on the task, and release the reference
+held on the waiter's task struct:
+
+ READ waiter->list.next;
+ READ waiter->task;
+ WRITE waiter->task;
+ CALL wakeup
+ RELEASE task
+
+If any of these steps occur out of order, then the whole thing may fail.
+
+Note that the waiter does not get the semaphore lock again - it just waits for
+its task pointer to be cleared. Since the record is on its stack, this means
+that if the task pointer is cleared _before_ the next pointer in the list is
+read, another CPU might start processing the waiter and it might clobber its
+stack before up*() functions have a chance to read the next pointer.
+
+ CPU 0 CPU 1
+ =============================== ===============================
+ down_xxx()
+ Queue waiter
+ Sleep
+ up_yyy()
+ READ waiter->task;
+ WRITE waiter->task;
+ <preempt>
+ Resume processing
+ down_xxx() returns
+ call foo()
+ foo() clobbers *waiter
+ </preempt>
+ READ waiter->list.next;
+ --- OOPS ---
+
+This could be dealt with using a spinlock, but then the down_xxx() function has
+to get the spinlock again after it's been woken up, which is a waste of
+resources.
+
+The way to deal with this is to insert an SMP memory barrier:
+
+ READ waiter->list.next;
+ READ waiter->task;
+ smp_mb();
+ WRITE waiter->task;
+ CALL wakeup
+ RELEASE task
+
+In this case, the barrier makes a guarantee that all memory accesses before the
+barrier will appear to happen before all the memory accesses after the barrier
+with respect to the other CPUs on the system. It does _not_ guarantee that all
+the memory accesses before the barrier will be complete by the time the barrier
+itself is complete.
+
+SMP memory barriers are normally nothing more than compiler barriers on a
+kernel compiled for a UP system because the CPU orders overlapping accesses
+with respect to itself, and so CPU barriers aren't needed.
+
+
+INTERRUPTS
+----------
+
+A driver may be interrupted by its own interrupt service routine, and thus they
+may interfere with each other's attempts to control or access the device.
+
+This may be alleviated - at least in part - by disabling interrupts (a form of
+locking), such that the critical operations are all contained within the
+interrupt-disabled section in the driver. Whilst the driver's interrupt
+routine is executing, the driver's core may not run on the same CPU, and its
+interrupt is not permitted to happen again until the current interrupt has been
+handled, thus the interrupt handler does not need to lock against that.
+
+
+However, consider the following example:
+
+ CPU 1 CPU 2
+ =============================== ===============================
+ [A is 0 and B is 0]
+ DISABLE IRQ
+ *A = 1;
+ smp_wmb();
+ *B = 2;
+ ENABLE IRQ
+ <interrupt>
+ *A = 3
+ a = *A;
+ b = *B;
+ smp_wmb();
+ *B = 4;
+ </interrupt>
+
+CPU 2 might see *A == 3 and *B == 0, when what it probably ought to see is *B
+== 2 and *A == 1 or *A == 3, or *B == 4 and *A == 3.
+
+This might happen because the write "*B = 2" might occur after the write "*A =
+3" - in which case the former write has leaked from the interrupt-disabled
+section into the interrupt handler. In this case it is a lock of some
+description should very probably be used.
+
+
+This sort of problem might also occur with relaxed I/O ordering rules, if it's
+permitted for I/O writes to cross. For instance, if a driver was talking to an
+ethernet card that sports an address register and a data register:
+
+ DISABLE IRQ
+ writew(ADR, ctl_reg_3);
+ writew(DATA, y);
+ ENABLE IRQ
+ <interrupt>
+ writew(ADR, ctl_reg_4);
+ q = readw(DATA);
+ </interrupt>
+
+In such a case, an mmiowb() is needed, firstly to prevent the first write to
+the address register from occurring after the write to the data register, and
+secondly to prevent the write to the data register from happening after the
+second write to the address register.
+
+
+=================================
+EXPLICIT KERNEL COMPILER BARRIERS
+=================================
+
+The Linux kernel has an explicit compiler barrier function that prevents the
+compiler from moving the memory accesses either side of it to the other side:
+
+ barrier();
+
+This has no direct effect on the CPU, which may then reorder things however it
+wishes.
+
+
+In addition, accesses to "volatile" memory locations and volatile asm
+statements act as implicit compiler barriers. Note, however, that the use of
+volatile has two negative consequences:
+
+ (1) it causes the generation of poorer code, and
+
+ (2) it can affect serialisation of events in code distant from the declaration
+ (consider a structure defined in a header file that has a volatile member
+ being accessed by the code in a source file).
+
+The Linux coding style therefore strongly favours the use of explicit barriers
+except in small and specific cases. In general, volatile should be avoided.
+
+
+===============================
+EXPLICIT KERNEL MEMORY BARRIERS
+===============================
+
+The Linux kernel has six basic CPU memory barriers:
+
+ MANDATORY SMP CONDITIONAL
+ =============== ===============
+ GENERAL mb() smp_mb()
+ READ rmb() smp_rmb()
+ WRITE wmb() smp_wmb()
+
+General memory barriers give a guarantee that all memory accesses specified
+before the barrier will appear to happen before all memory accesses specified
+after the barrier with respect to the other components of the system.
+
+Read and write memory barriers give similar guarantees, but only for memory
+reads versus memory reads and memory writes versus memory writes respectively.
+
+All memory barriers imply compiler barriers.
+
+SMP memory barriers are only compiler barriers on uniprocessor compiled systems
+because it is assumed that a CPU will be apparently self-consistent, and will
+order overlapping accesses correctly with respect to itself.
+
+There is no guarantee that any of the memory accesses specified before a memory
+barrier will be complete by the completion of a memory barrier; the barrier can
+be considered to draw a line in that CPU's access queue that accesses of the
+appropriate type may not cross.
+
+There is no guarantee that issuing a memory barrier on one CPU will have any
+direct effect on another CPU or any other hardware in the system. The indirect
+effect will be the order in which the second CPU sees the first CPU's accesses
+occur.
+
+There is no guarantee that some intervening piece of off-the-CPU hardware[*]
+will not reorder the memory accesses. CPU cache coherency mechanisms should
+propegate the indirect effects of a memory barrier between CPUs.
+
+ [*] For information on bus mastering DMA and coherency please read:
+
+ Documentation/pci.txt
+ Documentation/DMA-mapping.txt
+ Documentation/DMA-API.txt
+
+Note that these are the _minimum_ guarantees. Different architectures may give
+more substantial guarantees, but they may not be relied upon outside of arch
+specific code.
+
+
+There are some more advanced barrier functions:
+
+ (*) set_mb(var, value)
+ (*) set_wmb(var, value)
+
+ These assign the value to the variable and then insert at least a write
+ barrier after it, depending on the function. They aren't guaranteed to
+ insert anything more than a compiler barrier in a UP compilation.
+
+
+===============================
+IMPLICIT KERNEL MEMORY BARRIERS
+===============================
+
+Some of the other functions in the linux kernel imply memory barriers, amongst
+them are locking and scheduling functions and interrupt management functions.
+
+This specification is a _minimum_ guarantee; any particular architecture may
+provide more substantial guarantees, but these may not be relied upon outside
+of arch specific code.
+
+
+LOCKING FUNCTIONS
+-----------------
+
+All the following locking functions imply barriers:
+
+ (*) spin locks
+ (*) R/W spin locks
+ (*) mutexes
+ (*) semaphores
+ (*) R/W semaphores
+
+In all cases there are variants on a LOCK operation and an UNLOCK operation.
+
+ (*) LOCK operation implication:
+
+ Memory accesses issued after the LOCK will be completed after the LOCK
+ accesses have completed.
+
+ Memory accesses issued before the LOCK may be completed after the LOCK
+ accesses have completed.
+
+ (*) UNLOCK operation implication:
+
+ Memory accesses issued before the UNLOCK will be completed before the
+ UNLOCK accesses have completed.
+
+ Memory accesses issued after the UNLOCK may be completed before the UNLOCK
+ accesses have completed.
+
+ (*) LOCK vs UNLOCK implication:
+
+ The LOCK accesses will be completed before the UNLOCK accesses.
+
+And therefore an UNLOCK followed by a LOCK is equivalent to a full barrier, but
+a LOCK followed by an UNLOCK isn't.
+
+Locks and semaphores may not provide any guarantee of ordering on UP compiled
+systems, and so can't be counted on in such a situation to actually do anything
+at all, especially with respect to I/O accesses, unless combined with interrupt
+disabling operations.
+
+See also the section on "Inter-CPU locking barrier effects".
+
+
+As an example, consider the following:
+
+ *A = a;
+ *B = b;
+ LOCK
+ *C = c;
+ *D = d;
+ UNLOCK
+ *E = e;
+ *F = f;
+
+The following sequence of events is acceptable:
+
+ LOCK, {*F,*A}, *E, {*C,*D}, *B, UNLOCK
+
+But none of the following are:
+
+ {*F,*A}, *B, LOCK, *C, *D, UNLOCK, *E
+ *A, *B, *C, LOCK, *D, UNLOCK, *E, *F
+ *A, *B, LOCK, *C, UNLOCK, *D, *E, *F
+ *B, LOCK, *C, *D, UNLOCK, {*F,*A}, *E
+
+
+INTERRUPT DISABLING FUNCTIONS
+-----------------------------
+
+Functions that disable interrupts (LOCK equivalent) and enable interrupts
+(UNLOCK equivalent) will barrier memory and I/O accesses versus memory and I/O
+accesses done in the interrupt handler. This prevents an interrupt routine
+interfering with accesses made in a interrupt-disabled section of code and vice
+versa.
+
+Note that whilst disabling or enabling interrupts acts as a compiler barriers
+under all circumstances, they only act as memory barriers with respect to
+interrupts, not with respect to nested sections.
+
+Consider the following:
+
+ <interrupt>
+ *X = x;
+ </interrupt>
+ *A = a;
+ SAVE IRQ AND DISABLE
+ *B = b;
+ SAVE IRQ AND DISABLE
+ *C = c;
+ RESTORE IRQ
+ *D = d;
+ RESTORE IRQ
+ *E = e;
+ <interrupt>
+ *Y = y;
+ </interrupt>
+
+It is acceptable to observe the following sequences of events:
+
+ { INT, *X }, *A, SAVE, *B, SAVE, *C, REST, *D, REST, *E, { INT, *Y }
+ { INT, *X }, *A, SAVE, *B, SAVE, *C, REST, *D, REST, { INT, *Y, *E }
+ { INT, *X }, SAVE, SAVE, *A, *B, *C, *D, *E, REST, REST, { INT, *Y }
+ { INT }, *X, SAVE, SAVE, *A, *B, *C, *D, *E, REST, REST, { INT, *Y }
+ { INT }, *A, *X, SAVE, SAVE, *B, *C, *D, *E, REST, REST, { INT, *Y }
+
+But not the following:
+
+ { INT }, SAVE, *A, *B, *X, SAVE, *C, REST, *D, REST, *E, { INT, *Y }
+ { INT, *X }, *A, SAVE, *B, SAVE, *C, REST, REST, { INT, *Y, *D, *E }
+
+
+MISCELLANEOUS FUNCTIONS
+-----------------------
+
+Other functions that imply barriers:
+
+ (*) schedule() and similar imply full memory barriers.
+
+
+=================================
+INTER-CPU LOCKING BARRIER EFFECTS
+=================================
+
+On SMP systems locking primitives give a more substantial form of barrier: one
+that does affect memory access ordering on other CPUs, within the context of
+conflict on any particular lock.
+
+
+LOCKS VS MEMORY ACCESSES
+------------------------
+
+Consider the following: the system has a pair of spinlocks (N) and (Q), and
+three CPUs; then should the following sequence of events occur:
+
+ CPU 1 CPU 2
+ =============================== ===============================
+ *A = a; *E = e;
+ LOCK M LOCK Q
+ *B = b; *F = f;
+ *C = c; *G = g;
+ UNLOCK M UNLOCK Q
+ *D = d; *H = h;
+
+Then there is no guarantee as to what order CPU #3 will see the accesses to *A
+through *H occur in, other than the constraints imposed by the separate locks
+on the separate CPUs. It might, for example, see:
+
+ *E, LOCK M, LOCK Q, *G, *C, *F, *A, *B, UNLOCK Q, *D, *H, UNLOCK M
+
+But it won't see any of:
+
+ *B, *C or *D preceding LOCK M
+ *A, *B or *C following UNLOCK M
+ *F, *G or *H preceding LOCK Q
+ *E, *F or *G following UNLOCK Q
+
+
+However, if the following occurs:
+
+ CPU 1 CPU 2
+ =============================== ===============================
+ *A = a;
+ LOCK M [1]
+ *B = b;
+ *C = c;
+ UNLOCK M [1]
+ *D = d; *E = e;
+ LOCK M [2]
+ *F = f;
+ *G = g;
+ UNLOCK M [2]
+ *H = h;
+
+CPU #3 might see:
+
+ *E, LOCK M [1], *C, *B, *A, UNLOCK M [1],
+ LOCK M [2], *H, *F, *G, UNLOCK M [2], *D
+
+But assuming CPU #1 gets the lock first, it won't see any of:
+
+ *B, *C, *D, *F, *G or *H preceding LOCK M [1]
+ *A, *B or *C following UNLOCK M [1]
+ *F, *G or *H preceding LOCK M [2]
+ *A, *B, *C, *E, *F or *G following UNLOCK M [2]
+
+
+LOCKS VS I/O ACCESSES
+---------------------
+
+Under certain circumstances (such as NUMA), I/O accesses within two spinlocked
+sections on two different CPUs may be seen as interleaved by the PCI bridge.
+
+For example:
+
+ CPU 1 CPU 2
+ =============================== ===============================
+ spin_lock(Q)
+ writel(0, ADDR)
+ writel(1, DATA);
+ spin_unlock(Q);
+ spin_lock(Q);
+ writel(4, ADDR);
+ writel(5, DATA);
+ spin_unlock(Q);
+
+may be seen by the PCI bridge as follows:
+
+ WRITE *ADDR = 0, WRITE *ADDR = 4, WRITE *DATA = 1, WRITE *DATA = 5
+
+which would probably break.
+
+What is necessary here is to insert an mmiowb() before dropping the spinlock,
+for example:
+
+ CPU 1 CPU 2
+ =============================== ===============================
+ spin_lock(Q)
+ writel(0, ADDR)
+ writel(1, DATA);
+ mmiowb();
+ spin_unlock(Q);
+ spin_lock(Q);
+ writel(4, ADDR);
+ writel(5, DATA);
+ mmiowb();
+ spin_unlock(Q);
+
+this will ensure that the two writes issued on CPU #1 appear at the PCI bridge
+before either of the writes issued on CPU #2.
+
+
+Furthermore, following a write by a read to the same device is okay, because
+the read forces the write to complete before the read is performed:
+
+ CPU 1 CPU 2
+ =============================== ===============================
+ spin_lock(Q)
+ writel(0, ADDR)
+ a = readl(DATA);
+ spin_unlock(Q);
+ spin_lock(Q);
+ writel(4, ADDR);
+ b = readl(DATA);
+ spin_unlock(Q);
+
+
+See Documentation/DocBook/deviceiobook.tmpl for more information.
+
+
+==========================
+KERNEL I/O BARRIER EFFECTS
+==========================
+
+When accessing I/O memory, drivers should use the appropriate accessor
+functions:
+
+ (*) inX(), outX():
+
+ These are intended to talk to I/O space rather than memory space, but
+ that's primarily a CPU-specific concept. The i386 and x86_64 processors do
+ indeed have special I/O space access cycles and instructions, but many
+ CPUs don't have such a concept.
+
+ The PCI bus, amongst others, defines an I/O space concept - which on such
+ CPUs as i386 and x86_64 cpus readily maps to the CPU's concept of I/O
+ space. However, it may also mapped as a virtual I/O space in the CPU's
+ memory map, particularly on those CPUs that don't support alternate
+ I/O spaces.
+
+ Accesses to this space may be fully synchronous (as on i386), but
+ intermediary bridges (such as the PCI host bridge) may not fully honour
+ that.
+
+ They are guaranteed to be fully ordered with respect to each other.
+
+ They are not guaranteed to be fully ordered with respect to other types of
+ memory and I/O operation.
+
+ (*) readX(), writeX():
+
+ Whether these are guaranteed to be fully ordered and uncombined with
+ respect to each other on the issuing CPU depends on the characteristics
+ defined for the memory window through which they're accessing. On later
+ i386 architecture machines, for example, this is controlled by way of the
+ MTRR registers.
+
+ Ordinarily, these will be guaranteed to be fully ordered and uncombined,,
+ provided they're not accessing a prefetchable device.
+
+ However, intermediary hardware (such as a PCI bridge) may indulge in
+ deferral if it so wishes; to flush a write, a read from the same location
+ is preferred[*], but a read from the same device or from configuration
+ space should suffice for PCI.
+
+ [*] NOTE! attempting to read from the same location as was written to may
+ cause a malfunction - consider the 16550 Rx/Tx serial registers for
+ example.
+
+ Used with prefetchable I/O memory, an mmiowb() barrier may be required to
+ force writes to be ordered.
+
+ Please refer to the PCI specification for more information on interactions
+ between PCI transactions.
+
+ (*) readX_relaxed()
+
+ These are similar to readX(), but are not guaranteed to be ordered in any
+ way. Be aware that there is no I/O read barrier available.
+
+ (*) ioreadX(), iowriteX()
+
+ These will perform as appropriate for the type of access they're actually
+ doing, be it inX()/outX() or readX()/writeX().
+
+
+==========
+REFERENCES
+==========
+
+AMD64 Architecture Programmer's Manual Volume 2: System Programming
+ Chapter 7.1: Memory-Access Ordering
+ Chapter 7.4: Buffering and Combining Memory Writes
+
+IA-32 Intel Architecture Software Developer's Manual, Volume 3:
+System Programming Guide
+ Chapter 7.1: Locked Atomic Operations
+ Chapter 7.2: Memory Ordering
+ Chapter 7.4: Serializing Instructions
+
+The SPARC Architecture Manual, Version 9
+ Chapter 8: Memory Models
+ Appendix D: Formal Specification of the Memory Models
+ Appendix J: Programming with the Memory Models
+
+UltraSPARC Programmer Reference Manual
+ Chapter 5: Memory Accesses and Cacheability
+ Chapter 15: Sparc-V9 Memory Models
+
+UltraSPARC III Cu User's Manual
+ Chapter 9: Memory Models
+
+UltraSPARC IIIi Processor User's Manual
+ Chapter 8: Memory Models
+
+UltraSPARC Architecture 2005
+ Chapter 9: Memory
+ Appendix D: Formal Specifications of the Memory Models
+
+UltraSPARC T1 Supplment to the UltraSPARC Architecture 2005
+ Chapter 8: Memory Models
+ Appendix F: Caches and Cache Coherency
+
+Solaris Internals, Core Kernel Architecture, p63-68:
+ Chapter 3.3: Hardware Considerations for Locks and
+ Synchronization
+
+Unix Systems for Modern Architectures, Symmetric Multiprocessing and Caching
+for Kernel Programmers:
+ Chapter 13: Other Memory Models
+
+Intel Itanium Architecture Software Developer's Manual: Volume 1:
+ Section 2.6: Speculation
+ Section 4.4: Memory Access
On Wed, Mar 08, 2006 at 11:26:41AM -0800, Linus Torvalds wrote:
> > I presume there only needs to be an mmiowb() here if you've got the
> > appropriate CPU's I/O memory window set up to be weakly ordered.
>
> Actually, since the different NUMA things may have different paths to the
> PCI thing, I don't think even the mmiowb() will really help. It has
> nothing to serialize _with_.
>
> It only orders mmio from within _one_ CPU and "path" to the destination.
> The IO might be posted somewhere on a PCI bridge, and and depending on the
> posting rules, the mmiowb() just isn't relevant for IO coming through
> another path.
Looking at the SGI implementation, it's smarter than you think. Looks
like there's a register in the local I/O hub that lets you determine
when this write has been queued in the appropriate host->pci bridge.
So by the time __sn_mmiowb() returns, you're guaranteed no other CPU
can bypass the write because the write's got far enough.
Hi!
> +There are some more advanced barriering functions:
> +
> + (*) set_mb(var, value)
> + (*) set_wmb(var, value)
> +
> + These assign the value to the variable and then insert at least a write
> + barrier after it, depending on the function.
> +
I... don't understand what these do. Better explanation would
help.. .what is function?
Does it try to say that set_mb(var, value) is equivalent to var =
value; mb(); but here mb() affects that one variable, only?
> +In all cases there are variants on a LOCK operation and an UNLOCK operation.
> +
> + (*) LOCK operation implication:
> +
> + Memory accesses issued after the LOCK will be completed after the LOCK
> + accesses have completed.
"LOCK access"? Does it try to say that ...will be completed after any
access inside lock region is completed?
("LOCK" looks very much like well-known i386 prefix. Calling it
*_lock() or something would avoid that confusion. Fortunately there's
no UNLOCK instruction :-)
> + (*) UNLOCK operation implication:
> +
> + Memory accesses issued before the UNLOCK will be completed before the
> + UNLOCK accesses have completed.
> +
> + Memory accesses issued after the UNLOCK may be completed before the UNLOCK
> + accesses have completed.
> +
> + (*) LOCK vs UNLOCK implication:
> +
> + The LOCK accesses will be completed before the unlock accesses.
~~~~~~
capital? Or
lower it everywhere?
> +==============================
> +I386 AND X86_64 SPECIFIC NOTES
> +==============================
> +
> +Earlier i386 CPUs (pre-Pentium-III) are fully ordered - the operations on the
> +bus appear in program order - and so there's no requirement for any sort of
> +explicit memory barriers.
> +
> +From the Pentium-III onwards were three new memory barrier instructions:
> +LFENCE, SFENCE and MFENCE which correspond to the kernel memory barrier
> +functions rmb(), wmb() and mb(). However, there are additional implicit memory
> +barriers in the CPU implementation:
> +
> + (*) Normal writes imply a semi-rmb(): reads before a write may not complete
> + after that write, but reads after a write may complete before the write
> + (ie: reads may go _ahead_ of writes).
This makes it sound like pentium-III+ is incompatible with previous
CPUs. Is it really the case?
Pavel
--
Web maintainer for suspend.sf.net (http://www.sf.net/projects/suspend) wanted...
On Wednesday, March 8, 2006 11:26 am, Linus Torvalds wrote:
> But if you have a separate IO fabric and basically two different CPU's
> can get to one device through two different paths, no amount of write
> barriers of any kind will ever help you.
No, that's exactly the case that mmiowb() was designed to protect
against. It ensures that your writes have arrived at the destination
bridge, which means after that point any other CPUs writing to the same
device will have their data actually hit the device afterwards.
Hopefully deviceiobook.tmpl makes that clear...
Jesse
On Wed, Mar 08, 2006 at 11:26:41AM -0800, Linus Torvalds wrote:
> Actually, since the different NUMA things may have different paths to the
> PCI thing, I don't think even the mmiowb() will really help. It has
> nothing to serialize _with_.
It serializes to the bridge. On the Altix for example this is done by reading
a local status register with the pending write count in it and waiting until
the chip reports the write has propogated across the fabric. At that point it
has hit the bridge and the usual PCI posting applies, but the PCI ordering
rule will also apply so the write won't be passed by another write issued
after the spinlock is then dropped.
> The IO might be posted somewhere on a PCI bridge, and and depending on the
> posting rules, the mmiowb() just isn't relevant for IO coming through
> another path.
Yes. mmiowb only serializes to the bridge. Thats how it is defined in the
documentation. Thats enough to sort out things like the example with locks,
but where a read from the device would be overkill.
> general, or even very commonly. The undeniable fact is that "big NUMA"
> machines need to validate the drivers they use separately. The fact that
> it works on a normal PC - and that it's been tested to death there - does
> not guarantee much anything.
mmiowb comes about from the Altix folks strangley enough.
> The good news, of course, is that you don't use that kind of "big NUMA"
> system the same way you'd use a regular desktop SMP. You don't plug in
> random devices into it and just expect them to work. I'd hope ;)
Various core drivers like tg3 use mmiowb()
Alan
Pavel Machek <[email protected]> wrote:
> > + (*) set_mb(var, value)
> > + (*) set_wmb(var, value)
> > +
> > + These assign the value to the variable and then insert at least a write
> > + barrier after it, depending on the function.
> > +
>
> I... don't understand what these do. Better explanation would
> help.. .what is function?
I can only guess, and hope someone corrects me if I'm wrong.
> Does it try to say that set_mb(var, value) is equivalent to var =
> value; mb();
Yes.
> but here mb() affects that one variable, only?
No. set_*mb() is simply a canned sequence of assignment, memory barrier.
The type of barrier inserted depends on which function you choose. set_mb()
inserts an mb() and set_wmb() inserts a wmb().
> "LOCK access"?
The LOCK and UNLOCK functions presumably make at least one memory write apiece
to manipulate the target lock (on SMP at least).
> Does it try to say that ...will be completed after any access inside lock
> region is completed?
No. What you get in effect is something like:
LOCK { *lock = q; }
*A = a;
*B = b;
UNLOCK { *lock = u; }
Except that the accesses to the lock memory are made using special procedures
(LOCK prefixed instructions, XCHG, CAS/CMPXCHG, LL/SC, etc).
> This makes it sound like pentium-III+ is incompatible with previous
> CPUs. Is it really the case?
Yes - hence the alternative instruction stuff.
David
David Howells writes:
> > Enabling/disabling interrupts doesn't imply a barrier on powerpc, and
> > nor does taking an interrupt or returning from one.
>
> Surely it ought to, otherwise what's to stop accesses done with interrupts
> disabled crossing with accesses done inside an interrupt handler?
The rule that the CPU always sees its own loads and stores in program
order.
If a CPU takes an interrupt after doing some stores, and the interrupt
handler does loads from the same location(s), it has to see the new
values, even if they haven't got to memory yet. The interrupt isn't
special in this situation; if the instruction stream has a store to a
location followed by a load from it, the load *has* to see the value
stored by the store (assuming no other store to the same location in
the meantime, of course). That's true whether or not the CPU takes an
exception or interrupt between the store and the load. Anything else
would make programming really ... um ... interesting. :)
> > > +Either interrupt disablement (LOCK) and enablement (UNLOCK) will barrier
> > ...
> > I don't think this is right, and I don't think it is necessary to
> > achieve the end you state, since a CPU will always see its own memory
> > accesses in program order.
>
> But what about a driver accessing some memory that its device is going to
> observe under irq disablement, and then getting an interrupt immediately after
> from that same device, the handler for which communicates with the device,
> possibly then being broken because the CPU hasn't completed all the memory
> accesses that the driver made while interrupts are disabled?
Well, we have to be clear about what causes what here. Is the device
accessing this memory just at a random time, or is the access caused
by (in response to) an MMIO store? And what causes the interrupt?
Does it just happen to come along at this time or is it in response to
one of the stores?
If the device accesses to memory are in response to an MMIO store,
then the code needs an explicit wmb() between the memory stores and
the MMIO store. Disabling interrupts isn't going to help here because
the device doesn't see the CPU interrupt enable state.
In general it is possible for the CPU to see a different state of
memory than the device sees. If the driver needs to be sure that they
both see the same view then it needs to use some sort of
synchronization. A memory barrier followed by a store to the device,
with no further stores to memory until we have an indication from the
device that it has received the MMIO store, would be a suitable way to
synchronize. Enabling or disabling interrupts does nothing useful
here because the device doesn't see that. That applies whether we are
in an interrupt routine or not.
Do you have a specific scenario in mind, with a particular device and
driver?
One thing that driver writers do need to be careful about is that if a
device writes some data to memory and then causes an interrupt, the
fact that the interrupt has reached the CPU and the CPU has invoked
the driver's interrupt routine does *not* mean that the data has got
to memory from the CPU's point of view. The data could still be
queued up in the PCI host bridge or elsewhere. Doing an MMIO read
from the device is sufficient to ensure that the CPU will then see the
correct data in memory.
> Alternatively, might it be possible for communications between two CPUs to be
> stuffed because one took an interrupt that also modified common data before
> the it had committed the memory accesses done under interrupt disablement?
> This would suggest using a lock though.
Disabling interrupts doesn't do *anything* to help with communication
between CPUs. You have to use locks or explicit barriers for that.
It is possible for one CPU to see memory accesses done by another CPU
in a different order from the program order on the CPU that did the
accesses. That applies whether or not some of the accesses were done
inside an interrupt routine.
> > What does *F+*A mean?
>
> Combined accesses.
Still opaque, sorry: you mean they both happen in some unspecified
order?
> > Well, the driver should *not* be doing *ADR at all, it should be using
> > read[bwl]/write[bwl]. The architecture code has to implement
> > read*/write* in such a way that the accesses generated can't be
> > reordered. I _think_ it also has to make sure the write accesses
> > can't be write-combined, but it would be good to have that clarified.
>
> Than what use mmiowb()?
That was introduced to help some platforms that have difficulty
ensuring that MMIO accesses hit the device in the right order, IIRC.
I'm still not entirely clear on exactly where it's needed or what
guarantees you can rely on if you do or don't use it.
> Surely write combining and out-of-order reads are reasonable for cacheable
> devices like framebuffers.
They are. read*/write* to non-cacheable non-prefetchable MMIO
shouldn't be reordered or write-combined, but for prefetchable MMIO
I'm not sure whether read*/write* should allow reordering, or whether
drivers should use __raw_read/write* if they want that. (Of course,
with the __raw_ functions they don't get the endian conversion
either...)
Paul.
On Mer, 2006-03-08 at 20:16 +0000, David Howells wrote:
> The LOCK and UNLOCK functions presumably make at least one memory write apiece
> to manipulate the target lock (on SMP at least).
No they merely perform the bus transactions neccessary to perform an
update atomically. They are however "serializing" instructions which
means they do cause a certain amount of serialization (see the intel
architecture manual on serializing instructions for detail).
Athlon and later know how to turn it from locked memory accesses into
merely an exclusive cache line grab.
> > This makes it sound like pentium-III+ is incompatible with previous
> > CPUs. Is it really the case?
>
> Yes - hence the alternative instruction stuff.
It is the case for certain specialist instructions and the fences are
provided to go with those but can also help in other cases. PIII and
later in particular support explicit non temporal stores.
On Iau, 2006-03-09 at 08:49 +1100, Paul Mackerras wrote:
> If the device accesses to memory are in response to an MMIO store,
> then the code needs an explicit wmb() between the memory stores and
> the MMIO store. Disabling interrupts isn't going to help here because
> the device doesn't see the CPU interrupt enable state.
Interrupts are themselves entirely asynchronous anyway. The following
can occur on SMP Pentium-PIII.
Device
Raise IRQ
CPU
writel(MASK_IRQ, &dev->ctrl);
readl(&dev->ctrl);
IRQ arrives
CPU specific IRQ masking is synchronous, but IRQ delivery is not,
including IPI delivery (which is asynchronous and not guaranteed to
occur only once per IPI but can be replayed in obscure cases on x86).
Alan Cox writes:
> On Wed, Mar 08, 2006 at 02:37:58PM +0000, David Howells wrote:
> > + (*) reads can be done speculatively, and then the result discarded should it
> > + prove not to be required;
>
> That might be worth an example with an if() because PPC will do this and if
> its a read with a side effect (eg I/O space) you get singed..
On PPC machines, the PTE has a bit called G (for Guarded) which
indicates that the memory mapped by it has side effects. It prevents
the CPU from doing speculative accesses (i.e. the CPU can't send out a
load from the page until it knows for sure that the program will get
to that instruction) and from prefetching from the page.
The kernel sets G=1 on MMIO and PIO pages in general, as you would
expect, although you can get G=0 mappings for framebuffers etc. if you
ask specifically for that.
Paul.
David Howells writes:
> > # define smp_read_barrier_depends() do { } while(0)
>
> What's this one meant to do?
On most CPUs, if you load one value and use the value you get to
compute the address for a second load, there is an implicit read
barrier between the two loads because of the dependency. That's not
true on alpha, apparently, because of the way their caches are
structured. The smp_read_barrier_depends is a read barrier that you
use between two loads when there is already a dependency between the
loads, and it is a no-op on everything except alpha (IIRC).
Paul.
Duncan Sands writes:
> On UP you at least need compiler barriers, right? You're in trouble if you think
> you are writing in a certain order, and expect to see the same order from an
> interrupt handler, but the compiler decided to rearrange the order of the writes...
I'd be interested to know what the C standard says about whether the
compiler can reorder writes that may be visible to a signal handler.
An interrupt handler in the kernel is logically equivalent to a signal
handler in normal C code.
Surely there are some C language lawyers on one of the lists that this
thread is going to?
Paul.
From: Paul Mackerras <[email protected]>
Date: Thu, 9 Mar 2006 09:01:57 +1100
> On PPC machines, the PTE has a bit called G (for Guarded) which
> indicates that the memory mapped by it has side effects. It prevents
> the CPU from doing speculative accesses (i.e. the CPU can't send out a
> load from the page until it knows for sure that the program will get
> to that instruction) and from prefetching from the page.
>
> The kernel sets G=1 on MMIO and PIO pages in general, as you would
> expect, although you can get G=0 mappings for framebuffers etc. if you
> ask specifically for that.
Sparc64 has a similar PTE bit called "E" for "side-Effect".
And we also do the same thing as powerpc for framebuffers.
Note that on sparc64 in our asm/io.h PIO/MMIO accessor macros
we use physical addresses, so we don't have to map anything
in ioremap(), and use a special address space identifier on
the loads and stores that indicates "E" behavior is desired.
From: Paul Mackerras <[email protected]>
Date: Thu, 9 Mar 2006 09:06:05 +1100
> I'd be interested to know what the C standard says about whether the
> compiler can reorder writes that may be visible to a signal handler.
> An interrupt handler in the kernel is logically equivalent to a signal
> handler in normal C code.
>
> Surely there are some C language lawyers on one of the lists that this
> thread is going to?
Just like for setjmp() I think you have to mark such things
as volatile.
On Wed, 8 Mar 2006, David S. Miller wrote:
>
> Just like for setjmp() I think you have to mark such things
> as volatile.
.. and sigatomic_t.
Linus
On Iau, 2006-03-09 at 09:06 +1100, Paul Mackerras wrote:
> I'd be interested to know what the C standard says about whether the
> compiler can reorder writes that may be visible to a signal handler.
> An interrupt handler in the kernel is logically equivalent to a signal
> handler in normal C code.
The C standard doesn't have much to say. POSIX has a lot to say and yes
it can do this. You do need volatile or store barriers in signal touched
code quite often, or for that matter locks
POSIX/SuS also has stuff to say about what functions are signal safe and
what is not allowed.
Alan
On Thu, Mar 09, 2006 at 09:10:49AM +1100, Paul Mackerras wrote:
> David Howells writes:
>
> > > # define smp_read_barrier_depends() do { } while(0)
> >
> > What's this one meant to do?
>
> On most CPUs, if you load one value and use the value you get to
> compute the address for a second load, there is an implicit read
> barrier between the two loads because of the dependency. That's not
> true on alpha, apparently, because of the way their caches are
> structured.
Who said?! ;-)
> The smp_read_barrier_depends is a read barrier that you
> use between two loads when there is already a dependency between the
> loads, and it is a no-op on everything except alpha (IIRC).
My "Compiler Writer's Guide for the Alpha 21264" says that if the
result of the first load contributes to the address calculation
of the second load, then the second load cannot issue until the data
from the first load is available.
Obviously, we don't care about earlier alphas as they are executing
strictly in program order.
Ivan.
David Howells writes:
> On NUMA PowerPC, should mmiowb() be a SYNC or an EIEIO instruction then? Those
> do inter-component synchronisation.
We actually have quite heavy synchronization in read*/write* on PPC,
and mmiowb can safely be a no-op. It would be nice to be able to have
lighter-weight synchronization, but I'm sure we would see lots of
subtle driver bugs cropping up if we did. write* do a full memory
barrier (sync) after the store, and read* explicitly wait for the data
to come back before.
If you ask me, the need for mmiowb on some platforms merely shows that
those platforms' implementations of spinlocks and read*/write* are
buggy...
Paul.
On Thu, 9 Mar 2006, Paul Mackerras wrote:
>
> If you ask me, the need for mmiowb on some platforms merely shows that
> those platforms' implementations of spinlocks and read*/write* are
> buggy...
You could also state that same as
"If you ask me, the need for mmiowb on some platforms merely shows
that those platforms perform like a bat out of hell, and I think
they should be slower"
because the fact is, x86 memory barrier rules are just about optimal for
performance.
Linus
Matthew Wilcox writes:
> Looking at the SGI implementation, it's smarter than you think. Looks
> like there's a register in the local I/O hub that lets you determine
> when this write has been queued in the appropriate host->pci bridge.
Given that mmiowb takes no arguments, how does it know which is the
appropriate PCI host bridge?
Paul.
On Wednesday, March 8, 2006 4:35 pm, Paul Mackerras wrote:
> David Howells writes:
> > On NUMA PowerPC, should mmiowb() be a SYNC or an EIEIO instruction
> > then? Those do inter-component synchronisation.
>
> We actually have quite heavy synchronization in read*/write* on PPC,
> and mmiowb can safely be a no-op. It would be nice to be able to have
> lighter-weight synchronization, but I'm sure we would see lots of
> subtle driver bugs cropping up if we did. write* do a full memory
> barrier (sync) after the store, and read* explicitly wait for the data
> to come back before.
>
> If you ask me, the need for mmiowb on some platforms merely shows that
> those platforms' implementations of spinlocks and read*/write* are
> buggy...
Or maybe they just wanted to keep them fast. I don't know why you
compromised so much in your implementation of read/write and
lock/unlock, but given how expensive synchronization is, I'd think it
would be better in the long run to make the barrier types explicit (or
at least a subset of them) to maximize performance. The rules for using
the barriers really aren't that bad... for mmiowb() you basically want
to do it before an unlock in any critical section where you've done PIO
writes.
Of course, that doesn't mean there isn't confusion about existing
barriers. There was a long thread a few years ago (Jes worked it all
out, iirc) regarding some subtle memory ordering bugs in the tty layer
that ended up being due to ia64's very weak spin_unlock ordering
guarantees (one way memory barrier only), but I think that's mainly an
artifact of how ill defined the semantics of the various arch specific
routines are in some cases.
That's why I suggested in an earlier thread that you enumerate all the
memory ordering combinations on ppc and see if we can't define them all.
Then David can roll the implicit ones up into his document, or we can
add the appropriate new operations to the kernel. Really getting
barriers right shouldn't be much harder than getting DMA mapping right,
from a driver writers POV (though people often get that wrong I guess).
Jesse
On Wednesday, March 8, 2006 4:37 pm, Paul Mackerras wrote:
> Matthew Wilcox writes:
> > Looking at the SGI implementation, it's smarter than you think.
> > Looks like there's a register in the local I/O hub that lets you
> > determine when this write has been queued in the appropriate
> > host->pci bridge.
>
> Given that mmiowb takes no arguments, how does it know which is the
> appropriate PCI host bridge?
It uses a per-node address space to reference the local bridge. The
local bridge then waits until the remote bridge has acked the write
before, then sets the outstanding write register to the appropriate
value.
Jesse
Ivan Kokshaysky writes:
> On Thu, Mar 09, 2006 at 09:10:49AM +1100, Paul Mackerras wrote:
> > David Howells writes:
> >
> > > > # define smp_read_barrier_depends() do { } while(0)
> > >
> > > What's this one meant to do?
> >
> > On most CPUs, if you load one value and use the value you get to
> > compute the address for a second load, there is an implicit read
> > barrier between the two loads because of the dependency. That's not
> > true on alpha, apparently, because of the way their caches are
> > structured.
>
> Who said?! ;-)
Paul McKenney, after much discussion with Alpha chip designers IIRC.
> > The smp_read_barrier_depends is a read barrier that you
> > use between two loads when there is already a dependency between the
> > loads, and it is a no-op on everything except alpha (IIRC).
>
> My "Compiler Writer's Guide for the Alpha 21264" says that if the
> result of the first load contributes to the address calculation
> of the second load, then the second load cannot issue until the data
> from the first load is available.
Sure, but because of the partitioned caches on some systems, the
second load can get older data than the first load, even though it
issues later.
If you do:
CPU 0 CPU 1
foo = val;
wmb();
p = &foo;
reg = p;
bar = *reg;
it is apparently possible for CPU 1 to see the new value of p
(i.e. &foo) but an old value of foo (i.e. not val). This can happen
if p and foo are in different halves of the cache on CPU 1, and there
are a lot of updates coming in for the half containing foo but the
half containing p is quiet.
I added Paul McKenney to the cc list so he can correct anything I have
wrong here.
Paul.
Linus Torvalds writes:
> > If you ask me, the need for mmiowb on some platforms merely shows that
> > those platforms' implementations of spinlocks and read*/write* are
> > buggy...
>
> You could also state that same as
>
> "If you ask me, the need for mmiowb on some platforms merely shows
> that those platforms perform like a bat out of hell, and I think
> they should be slower"
>
> because the fact is, x86 memory barrier rules are just about optimal for
> performance.
... and x86 mmiowb is a no-op. It's not x86 that I think is buggy.
Paul.
On Thu, 9 Mar 2006, Paul Mackerras wrote:
>
> ... and x86 mmiowb is a no-op. It's not x86 that I think is buggy.
x86 mmiowb would have to be a real op too if there were any multi-pathed
PCI buses out there for x86, methinks.
Basically, the issue boils down to one thing: no "normal" barrier will
_ever_ show up on the bus on x86 (ie ia64, afaik). That, together with any
situation where there are multiple paths to one physical device means that
mmiowb() _has_ to be a special op, and no spinlocks etc will _ever_ do the
serialization you look for.
Put another way: the only way to avoid mmiowb() being special is either
one of:
(a) have the bus fabric itself be synchronizing
(b) pay a huge expense on the much more critical _regular_ barriers
Now, I claim that (b) is just broken. I'd rather take the hit when I need
to, than every time.
Now, (a) is trivial for small cases, but scales badly unless you do some
fancy footwork. I suspect you could do some scalable multi-pathable
version with using similar approaches to resolving device conflicts as the
cache coherency protocol does (or by having a token-passing thing), but it
seems SGI's solution was fairly well thought out.
That said, when I heard of the NUMA IO issues on the SGI platform, I was
initially pretty horrified. It seems to have worked out ok, and as long as
we're talking about machines where you can concentrate on validating just
a few drivers, it seems to be a good tradeoff.
Would I want the hard-to-think-about IO ordering on a regular desktop
platform? No.
Linus
Jesse Barnes writes:
> On Wednesday, March 8, 2006 4:35 pm, Paul Mackerras wrote:
> > David Howells writes:
> > > On NUMA PowerPC, should mmiowb() be a SYNC or an EIEIO instruction
> > > then? Those do inter-component synchronisation.
> >
> > We actually have quite heavy synchronization in read*/write* on PPC,
> > and mmiowb can safely be a no-op. It would be nice to be able to have
> > lighter-weight synchronization, but I'm sure we would see lots of
> > subtle driver bugs cropping up if we did. write* do a full memory
> > barrier (sync) after the store, and read* explicitly wait for the data
> > to come back before.
> >
> > If you ask me, the need for mmiowb on some platforms merely shows that
> > those platforms' implementations of spinlocks and read*/write* are
> > buggy...
>
> Or maybe they just wanted to keep them fast. I don't know why you
> compromised so much in your implementation of read/write and
> lock/unlock, but given how expensive synchronization is, I'd think it
> would be better in the long run to make the barrier types explicit (or
> at least a subset of them) to maximize performance.
The PPC read*/write* and in*/out* aim to implement x86 semantics, in
order to minimize the number of subtle driver bugs that only show up
under heavy load.
I agree that in the long run making the barriers more explicit is a
good thing.
> The rules for using
> the barriers really aren't that bad... for mmiowb() you basically want
> to do it before an unlock in any critical section where you've done PIO
> writes.
Do you mean just PIO, or do you mean PIO or MMIO writes?
> Of course, that doesn't mean there isn't confusion about existing
> barriers. There was a long thread a few years ago (Jes worked it all
> out, iirc) regarding some subtle memory ordering bugs in the tty layer
> that ended up being due to ia64's very weak spin_unlock ordering
> guarantees (one way memory barrier only), but I think that's mainly an
> artifact of how ill defined the semantics of the various arch specific
> routines are in some cases.
Yes, there is a lot of confusion, unfortunately. There is also some
difficulty in defining things to be any different from what x86 does.
> That's why I suggested in an earlier thread that you enumerate all the
> memory ordering combinations on ppc and see if we can't define them all.
The main difficulty we strike on PPC is that cacheable accesses tend
to get ordered independently of noncacheable accesses. The only
instruction we have that orders cacheable accesses with respect to
noncacheable accesses is the sync instruction, which is a heavyweight
"synchronize everything" operation. It acts as a full memory barrier
for both cacheable and noncacheable loads and stores.
The other barriers we have are the lwsync instruction and the eieio
instruction. The lwsync instruction (light-weight sync) acts as a
memory barrier for cacheable loads and stores except that it allows a
following load to go before a preceding store.
The eieio instruction has two separate and independent effects. It
acts as a full barrier for accesses to noncacheable nonprefetchable
memory (i.e. MMIO or PIO registers), and it acts as a write barrier
for accesses to cacheable memory. It doesn't do any ordering between
cacheable and noncacheable accesses though.
There is also the isync (instruction synchronize) instruction, which
isn't explicitly a memory barrier. It prevents any following
instructions from executing until the outcome of any previous
conditional branches are known, and until it is known that no previous
instruction can generate an exception. Thus it can be used to create
a one-way barrier in spin_lock and read*.
> Then David can roll the implicit ones up into his document, or we can
> add the appropriate new operations to the kernel. Really getting
> barriers right shouldn't be much harder than getting DMA mapping right,
> from a driver writers POV (though people often get that wrong I guess).
Unfortunately, if you get the barriers wrong your driver will still
work most of the time on pretty much any machine, whereas if you get
the DMA mapping wrong your driver won't work at all on some machines.
Nevertheless, we should get these things defined properly and then try
to make sure drivers do the right things.
Paul.
Jesse Barnes writes:
> It uses a per-node address space to reference the local bridge. The
> local bridge then waits until the remote bridge has acked the write
> before, then sets the outstanding write register to the appropriate
> value.
That sounds like mmiowb can only be used when preemption is disabled,
such as inside a spin-locked region - is that right?
Paul.
Linus Torvalds wrote:
>
>On Thu, 9 Mar 2006, Paul Mackerras wrote:
>
>>... and x86 mmiowb is a no-op. It's not x86 that I think is buggy.
>>
>
>x86 mmiowb would have to be a real op too if there were any multi-pathed
>PCI buses out there for x86, methinks.
>
>Basically, the issue boils down to one thing: no "normal" barrier will
>_ever_ show up on the bus on x86 (ie ia64, afaik). That, together with any
>situation where there are multiple paths to one physical device means that
>mmiowb() _has_ to be a special op, and no spinlocks etc will _ever_ do the
>serialization you look for.
>
>Put another way: the only way to avoid mmiowb() being special is either
>one of:
> (a) have the bus fabric itself be synchronizing
> (b) pay a huge expense on the much more critical _regular_ barriers
>
>Now, I claim that (b) is just broken. I'd rather take the hit when I need
>to, than every time.
>
I'm not very driver-minded; would it make sense to have io versions of
locks, which can provide critical sections for IO operations?
The number of (uncommented) memory barriers sprinkled around drivers
looks pretty scary...
--
Send instant messages to your online friends http://au.messenger.yahoo.com
Linus Torvalds writes:
> x86 mmiowb would have to be a real op too if there were any multi-pathed
> PCI buses out there for x86, methinks.
Not if the manufacturers wanted to be able to run existing standard
x86 operating systems on it, surely.
I presume that on x86 the PCI host bridges and caches are all part of
the coherence domain, and that the rule about stores being observed in
order applies to what the PCI host bridge can see as much as it does
to any other agent in the coherence domain. And if I have understood
you correctly, the store ordering rule applies both to stores to
regular cacheable memory and stores to noncacheable nonprefetchable
MMIO registers without distinction.
If that is so, then I don't see how the writel's can get out of order.
Put another way, we expect spinlock regions to order stores to regular
memory, and AFAICS the x86 ordering rules mean that the same guarantee
should apply to stores to MMIO registers. (It's entirely possible
that I don't fully understand the x86 memory ordering rules, of
course. :)
> Basically, the issue boils down to one thing: no "normal" barrier will
> _ever_ show up on the bus on x86 (ie ia64, afaik). That, together with any
A spin_lock does show up on the bus, doesn't it?
> Would I want the hard-to-think-about IO ordering on a regular desktop
> platform? No.
In fact I think that mmiowb can actually be useful on PPC, if we can
be sure that all the drivers we care about will use it correctly.
If we can have the following rules:
* If you have stores to regular memory, followed by an MMIO store, and
you want the device to see the stores to regular memory at the point
where it receives the MMIO store, then you need a wmb() between the
stores to regular memory and the MMIO store.
* If you have PIO or MMIO accesses, and you need to ensure the
PIO/MMIO accesses don't get reordered with respect to PIO/MMIO
accesses on another CPU, put the accesses inside a spin-locked
region, and put a mmiowb() between the last access and the
spin_unlock.
* smp_wmb() doesn't necessarily do any ordering of MMIO accesses
vs. other accesses, and in that sense it is weaker than wmb().
... then I can remove the sync from write*, which would be nice, and
make mmiowb() be a sync. I wonder how long we're going to spend
chasing driver bugs after that, though. :)
Paul.
On Wednesday, March 08, 2006 5:36 pm, Paul Mackerras wrote:
> Jesse Barnes writes:
> > It uses a per-node address space to reference the local bridge.
> > The local bridge then waits until the remote bridge has acked the
> > write before, then sets the outstanding write register to the
> > appropriate value.
>
> That sounds like mmiowb can only be used when preemption is disabled,
> such as inside a spin-locked region - is that right?
There's a scheduler hook to flush things if a process moves. I think
Brent Casavant submitted that patch recently.
Jesse
On Wednesday, March 08, 2006 5:57 pm, Paul Mackerras wrote:
> > The rules for using
> > the barriers really aren't that bad... for mmiowb() you basically
> > want to do it before an unlock in any critical section where you've
> > done PIO writes.
>
> Do you mean just PIO, or do you mean PIO or MMIO writes?
I'd have to check, but iirc it was just MMIO. We assumed PIO (inX/outX)
was defined to be very strongly ordered (and thus slow) in Linux. But
Linus is apparently flexible on that point for the new ioreadX/iowriteX
stuff.
> Yes, there is a lot of confusion, unfortunately. There is also some
> difficulty in defining things to be any different from what x86 does.
Well, Alpha has smp_barrier_depends or whatever, that's *really* funky.
> > That's why I suggested in an earlier thread that you enumerate all
> > the memory ordering combinations on ppc and see if we can't define
> > them all.
>
> The main difficulty we strike on PPC is that cacheable accesses tend
> to get ordered independently of noncacheable accesses. The only
> instruction we have that orders cacheable accesses with respect to
> noncacheable accesses is the sync instruction, which is a heavyweight
> "synchronize everything" operation. It acts as a full memory barrier
> for both cacheable and noncacheable loads and stores.
Ah, ok, sounds like your chip needs an ISA extension or two then. :)
> The other barriers we have are the lwsync instruction and the eieio
> instruction. The lwsync instruction (light-weight sync) acts as a
> memory barrier for cacheable loads and stores except that it allows a
> following load to go before a preceding store.
This sounds like ia64 acquire semantics, a fence, but only in the
downward direction.
> The eieio instruction has two separate and independent effects. It
> acts as a full barrier for accesses to noncacheable nonprefetchable
> memory (i.e. MMIO or PIO registers), and it acts as a write barrier
> for accesses to cacheable memory. It doesn't do any ordering between
> cacheable and noncacheable accesses though.
Weird, ok, so for cacheable stuff it's equivalent to ia64's release
semantics, but has additional effects for noncacheable accesses. Too
bad it doesn't tie the two together somehow.
> There is also the isync (instruction synchronize) instruction, which
> isn't explicitly a memory barrier. It prevents any following
> instructions from executing until the outcome of any previous
> conditional branches are known, and until it is known that no
> previous instruction can generate an exception. Thus it can be used
> to create a one-way barrier in spin_lock and read*.
Hm, interesting.
> Unfortunately, if you get the barriers wrong your driver will still
> work most of the time on pretty much any machine, whereas if you get
> the DMA mapping wrong your driver won't work at all on some machines.
> Nevertheless, we should get these things defined properly and then
> try to make sure drivers do the right things.
Agreed. Having a set of rules that driver writers can use would help
too. Given that PPC doesn't appear to have a lightweight way of
synchronizing between I/O and memory accesses, it sounds like full
syncs will be needed in a lot of cases.
Jesse
On Wednesday, March 08, 2006 5:27 pm, Linus Torvalds wrote:
> That said, when I heard of the NUMA IO issues on the SGI platform, I
> was initially pretty horrified. It seems to have worked out ok, and
> as long as we're talking about machines where you can concentrate on
> validating just a few drivers, it seems to be a good tradeoff.
It's actually not too bad. We tried hard to make the arch code support
the semantics that Linux drivers expect. mmiowb() was an optimization
we added (though it's much less of an optimization than read_relaxed()
was) to make things a little faster. Like you say, the alternative was
to embed the same functionality into spin_unlock or something (IRIX
actually had an io_spin_unlock that did that iirc), but that would mean
an MMIO access on every unlock, which would be bad.
So ultimately mmiowb() is the only thing drivers really have to care
about on Altix (assuming they do DMA mapping correctly), and the rules
for that are fairly simple. Then they can additionally use
read_relaxed() to optimize performance a bit (quite a bit on big
systems).
> Would I want the hard-to-think-about IO ordering on a regular desktop
> platform? No.
I guess you don't want anyone to send you an O2 then? :)
Jesse
On Wednesday, March 08, 2006 7:45 pm, Paul Mackerras wrote:
> If we can have the following rules:
>
> * If you have stores to regular memory, followed by an MMIO store,
> and you want the device to see the stores to regular memory at the
> point where it receives the MMIO store, then you need a wmb() between
> the stores to regular memory and the MMIO store.
>
> * If you have PIO or MMIO accesses, and you need to ensure the
> PIO/MMIO accesses don't get reordered with respect to PIO/MMIO
> accesses on another CPU, put the accesses inside a spin-locked
> region, and put a mmiowb() between the last access and the
> spin_unlock.
>
> * smp_wmb() doesn't necessarily do any ordering of MMIO accesses
> vs. other accesses, and in that sense it is weaker than wmb().
This is a good set of rules. Hopefully David can add something like
this to his doc.
> ... then I can remove the sync from write*, which would be nice, and
> make mmiowb() be a sync. I wonder how long we're going to spend
> chasing driver bugs after that, though. :)
Hm, a static checker should be able to find this stuff, shouldn't it?
Jesse
Jesse Barnes writes:
> So ultimately mmiowb() is the only thing drivers really have to care
> about on Altix (assuming they do DMA mapping correctly), and the rules
> for that are fairly simple. Then they can additionally use
> read_relaxed() to optimize performance a bit (quite a bit on big
> systems).
If I can be sure that all the drivers we care about on PPC use mmiowb
correctly, I can reduce or eliminate the barrier in write*, which
would be nice.
Which drivers have been audited to make sure they use mmiowb
correctly? In particular, has the USB driver been audited?
Paul.
On Thu, 9 Mar 2006, Paul Mackerras wrote:
>
> A spin_lock does show up on the bus, doesn't it?
Nope.
If the lock entity is in a exclusive cache-line, a spinlock does not show
up on the bus at _all_. It's all purely in the core. In fact, I think AMD
does a spinlock in ~15 CPU cycles (that's the serialization overhead in
the core). I think a P-M core is ~25, while the NetBurst (P4) core is much
more because they have horrible serialization issues (I think it's on the
order of 100 cycles there).
Anyway, try doing a spinlock in 15 CPU cycles and going out on the bus for
it..
(Couple that with spin_unlock basically being free).
Now, if the spinlocks end up _bouncing_ between CPU's, they'll obviously
be a lot more expensive.
Linus
Jesse Barnes writes:
> Hm, a static checker should be able to find this stuff, shouldn't it?
Good idea. I wonder if sparse could be extended to do it.
Alternatively, it wouldn't be hard to check dynamically. Just have a
per-cpu count of outstanding MMIO stores. Zero it in spin_lock and
mmiowb, increment it in write*, and grizzle if spin_unlock finds it
non-zero. Should be very little overhead.
Paul.
>>>>> "Paul" == Paul Mackerras <[email protected]> writes:
Paul> Jesse Barnes writes:
>> So ultimately mmiowb() is the only thing drivers really have to
>> care about on Altix (assuming they do DMA mapping correctly), and
>> the rules for that are fairly simple. Then they can additionally
>> use read_relaxed() to optimize performance a bit (quite a bit on
>> big systems).
Paul> If I can be sure that all the drivers we care about on PPC use
Paul> mmiowb correctly, I can reduce or eliminate the barrier in
Paul> write*, which would be nice.
Paul> Which drivers have been audited to make sure they use mmiowb
Paul> correctly? In particular, has the USB driver been audited?
I think the primary drivers we've looked at are drivers/net/tg3.c,
drivers/net/s2io.c, drivers/scsi/qla1280.c, and possibly the
qla[234]xxx series - thats probably it!
While we have USB on the systems, I don't think anyone has spend a lot
of time verifying it in this context. At least the keyboard and mouse
I have on this box seems to behave.
Cheers,
Jes
"linux-os \(Dick Johnson\)" <[email protected]> writes:
> On Tue, 7 Mar 2006, Matthew Wilcox wrote:
>
>> On Tue, Mar 07, 2006 at 01:54:33PM -0500, linux-os (Dick Johnson) wrote:
>>> This might be a good place to document:
>>> dummy = readl(&foodev->ctrl);
>>>
>>> Will flush all pending writes to the PCI bus and that:
>>> (void) readl(&foodev->ctrl);
>>> ... won't because `gcc` may optimize it away. In fact, variable
>>> "dummy" should be global or `gcc` may make it go away as well.
>>
>> static inline unsigned int readl(const volatile void __iomem *addr)
>> {
>> return *(volatile unsigned int __force *) addr;
>> }
>>
>> The cast is volatile, so gcc knows not to optimise it away.
>>
>
> When the assignment is not made a.k.a., cast to void, or when the
> assignment is made to an otherwise unused variable, `gcc` does,
> indeed make it go away.
Wrong. From the GCC texinfo documentation:
" Less obvious expressions are where something which looks like an access
is used in a void context. An example would be,
volatile int *src = SOMEVALUE;
*src;
With C, such expressions are rvalues, and as rvalues cause a read of
the object, GCC interprets this as a read of the volatile being pointed
to. "
So, did you report the bug to the GCC maintainers?
-- Sergei.
Alan Cox <[email protected]> wrote:
> > The LOCK and UNLOCK functions presumably make at least one memory write apiece
> > to manipulate the target lock (on SMP at least).
>
> No they merely perform the bus transactions neccessary to perform an
> update atomically. They are however "serializing" instructions which
> means they do cause a certain amount of serialization (see the intel
> architecture manual on serializing instructions for detail).
>
> Athlon and later know how to turn it from locked memory accesses into
> merely an exclusive cache line grab.
So, you're saying that the LOCK and UNLOCK primitives don't actually modify
memory, but rather simply pin the cacheline into the CPU's cache and refuse to
let anyone else touch it?
No... it can't work like that. It *must* make a memory modification - after
all, the CPU doesn't know that what it's doing is a spin_unlock(), say, rather
than an atomic_set().
David
On Thursday 09 March 2006 04:45, you wrote:
> ... then I can remove the sync from write*, which would be nice, and
> make mmiowb() be a sync. I wonder how long we're going to spend
> chasing driver bugs after that, though. :)
Can you do a patch, which does the change, so people can actually
test their drivers?
--
Greetings Michael.
David Howells <[email protected]> writes:
[...]
> +=======================================
> +LINUX KERNEL COMPILER BARRIER FUNCTIONS
> +=======================================
> +
> +The Linux kernel has an explicit compiler barrier function that prevents the
> +compiler from moving the memory accesses either side of it to the other side:
> +
> + barrier();
> +
> +This has no direct effect on the CPU, which may then reorder things however it
> +wishes.
> +
> +In addition, accesses to "volatile" memory locations and volatile asm
> +statements act as implicit compiler barriers.
This last statement seems to contradict with what GCC manual says about
volatile asm statements:
"You can prevent an `asm' instruction from being deleted by writing the
keyword `volatile' after the `asm'. [...]
The `volatile' keyword indicates that the instruction has important
side-effects. GCC will not delete a volatile `asm' if it is reachable.
(The instruction can still be deleted if GCC can prove that
control-flow will never reach the location of the instruction.) *Note
that even a volatile `asm' instruction can be moved relative to other
code, including across jump instructions.*"
I think that volatile memory locations aren't compiler barriers either,
-- GCC only guarantees that it won't remove the access and that it won't
re-arrange the access w.r.t. other *volatile* accesses. On the other
hand, barrier() indeed prevents *any* memory access from being moved
across the barrier.
-- Sergei.
On Iau, 2006-03-09 at 11:41 +0000, David Howells wrote:
> Alan Cox <[email protected]> wrote:
> So, you're saying that the LOCK and UNLOCK primitives don't actually modify
> memory, but rather simply pin the cacheline into the CPU's cache and refuse to
> let anyone else touch it?
Basically yes
> No... it can't work like that. It *must* make a memory modification
Then you'll have to argue with the chip designers because it doesn't.
Its all built around the cache coherency. To make a write to a cache
line I must be the sole owner of the line. Look up "MESI cache" in a
good book on the subject.
If we own the affected line then we can update just the cache and be
sure that since we own the cache line and we will write it back if
anyone else asks for it (or nowdays on some systems transfer it direct
to the other cpu) that we get locked semantics
Linus Torvalds <[email protected]> wrote:
> > A spin_lock does show up on the bus, doesn't it?
>
> Nope.
Yes, sort of, under some circumstances. If the CPU doing the spin_lock()
doesn't own the cacheline with the lock, it'll have to resort to the bus to
grab the cacheline from the current owner (so another CPU would at least see a
read).
The effect of the spin_lock() might not be seen outside of the CPU before the
spin_unlock() occurs, but it *will* be committed to the CPU's cache, and given
cache coherency mechanisms, that's effectively the same as main memory.
So it's in effect visible on the bus, given that it will be transferred to
another CPU when requested; and as long as the other CPUs expect to see the
effects and the ordering imposed, it's immaterial whether the content of the
spinlock is actually ever committed to SDRAM or whether it remains perpetually
in one or another's CPU cache.
David
Alan Cox <[email protected]> wrote:
> > Alan Cox <[email protected]> wrote:
> > So, you're saying that the LOCK and UNLOCK primitives don't actually modify
> > memory, but rather simply pin the cacheline into the CPU's cache and refuse to
> > let anyone else touch it?
>
> Basically yes
What you said is incomplete: the cacheline is wangled into the Exclusive
state, and there it sits until modified (at which point it shifts to the
Modified state) or stolen (when it shifts to the Shared state). Whilst the x86
CPU might pin it there for the duration of the execution of the locked
instruction, it can't leave it there until it detects a spin_unlock() or
equivalent.
I guess LL/SC and LWARX/STWCX work by the reserved load wangling the cacheline
into the Exclusive state, and then the conditional store only doing the store
if the cacheline is still in that state. I don't know whether the conditional
store may modify a cacheline that's in the Modified state, but I'd guess you'd
need more state than that, because you have to pair it with a load reserved.
With inter-CPU memory barriers I think you have to consider the cache part of
the memory, not part of the CPU. The CPU _does_ make a memory modification;
it's just that it doesn't proceed any further than the cache, until the cache
coherency mechanisms transfer the change to another CPU, or until the cache
becomes full and the lock's line gets ejected.
> > No... it can't work like that. It *must* make a memory modification
>
> Then you'll have to argue with the chip designers because it doesn't.
>
> Its all built around the cache coherency. To make a write to a cache
> line I must be the sole owner of the line. Look up "MESI cache" in a
> good book on the subject.
http://en.wikipedia.org/wiki/MESI_protocol
And a picture of the state machine may be found here:
https://www.cs.tcd.ie/Jeremy.Jones/vivio/caches/MESIHelp.htm
David
I'm thinking of adding the attached to the document. Any comments or
objections?
David
diff --git a/Documentation/memory-barriers.txt b/Documentation/memory-barriers.txt
index 6eeb7e4..f9a9192 100644
--- a/Documentation/memory-barriers.txt
+++ b/Documentation/memory-barriers.txt
@@ -4,6 +4,8 @@
Contents:
+ (*) What do we consider memory?
+
(*) What are memory barriers?
(*) Where are memory barriers needed?
@@ -32,6 +34,82 @@ Contents:
(*) References.
+===========================
+WHAT DO WE CONSIDER MEMORY?
+===========================
+
+For the purpose of this specification, "memory", at least as far as cached CPU
+vs CPU interactions go, has to include the CPU caches in the system. Although
+any particular read or write may not actually appear outside of the CPU that
+issued it because the CPU was able to satisfy it from its own cache, it's still
+as if the memory access had taken place as far as the other CPUs are concerned
+since the cache coherency and ejection mechanisms will propegate the effects
+upon conflict.
+
+Consider the system logically as:
+
+ <--- CPU ---> : <----------- Memory ----------->
+ :
+ +--------+ +--------+ : +--------+ +-----------+
+ | | | | : | | | | +---------+
+ | CPU | | Memory | : | CPU | | | | |
+ | Core |--->| Access |----->| Cache |<-->| | | |
+ | | | Queue | : | | | |--->| Memory |
+ | | | | : | | | | | |
+ +--------+ +--------+ : +--------+ | | | |
+ : | Cache | +---------+
+ : | Coherency |
+ : | Mechanism | +---------+
+ +--------+ +--------+ : +--------+ | | | |
+ | | | | : | | | | | |
+ | CPU | | Memory | : | CPU | | |--->| Device |
+ | Core |--->| Access |----->| Cache |<-->| | | |
+ | | | Queue | : | | | | | |
+ | | | | : | | | | +---------+
+ +--------+ +--------+ : +--------+ +-----------+
+ :
+ :
+
+The CPU core may execute instructions in any order it deems fit, provided the
+expected program causality appears to be maintained. Some of the instructions
+generate load and store operations which then go into the memory access queue
+to be performed. The core may place these in the queue in any order it wishes,
+and continue execution until it is forced to wait for an instruction to
+complete.
+
+What memory barriers are concerned with is controlling the order in which
+accesses cross from the CPU side of things to the memory side of things, and
+the order in which the effects are perceived to happen by the other observers
+in the system.
+
+
+Note that the above model does not show uncached memory or I/O accesses. These
+procede directly from the queue to the memory or the devices, bypassing any
+cache coherency:
+
+ <--- CPU ---> :
+ : +-----+
+ +--------+ +--------+ : | |
+ | | | | : | | +---------+
+ | CPU | | Memory | : | | | |
+ | Core |--->| Access |--------------->| | | |
+ | | | Queue | : | |------------->| Memory |
+ | | | | : | | | |
+ +--------+ +--------+ : | | | |
+ : | | +---------+
+ : | Bus |
+ : | | +---------+
+ +--------+ +--------+ : | | | |
+ | | | | : | | | |
+ | CPU | | Memory | : | |<------------>| Device |
+ | Core |--->| Access |--------------->| | | |
+ | | | Queue | : | | | |
+ | | | | : | | +---------+
+ +--------+ +--------+ : | |
+ : +-----+
+ :
+
+
=========================
WHAT ARE MEMORY BARRIERS?
=========================
@@ -448,8 +526,8 @@ In all cases there are variants on a LOC
The LOCK accesses will be completed before the UNLOCK accesses.
-And therefore an UNLOCK followed by a LOCK is equivalent to a full barrier, but
-a LOCK followed by an UNLOCK isn't.
+ Therefore an UNLOCK followed by a LOCK is equivalent to a full barrier,
+ but a LOCK followed by an UNLOCK is not.
Locks and semaphores may not provide any guarantee of ordering on UP compiled
systems, and so can't be counted on in such a situation to actually do anything
On Thu, Mar 09, 2006 at 12:01:45PM +1100, Paul Mackerras wrote:
> If you do:
>
> CPU 0 CPU 1
>
> foo = val;
> wmb();
> p = &foo;
> reg = p;
> bar = *reg;
>
> it is apparently possible for CPU 1 to see the new value of p
> (i.e. &foo) but an old value of foo (i.e. not val). This can happen
> if p and foo are in different halves of the cache on CPU 1, and there
> are a lot of updates coming in for the half containing foo but the
> half containing p is quiet.
Indeed, this can happen according to architecture reference manual,
so CPU 1 needs mb() as well.
Thanks for clarification.
Ivan.
On Thu, 9 Mar 2006, David Howells wrote:
>
> So, you're saying that the LOCK and UNLOCK primitives don't actually modify
> memory, but rather simply pin the cacheline into the CPU's cache and refuse to
> let anyone else touch it?
>
> No... it can't work like that. It *must* make a memory modification - after
> all, the CPU doesn't know that what it's doing is a spin_unlock(), say, rather
> than an atomic_set().
Basically, as long as nobody else is reading the lock, the lock will stay
in the caches.
Only old and stupid architectures go out to the bus for locking. For
example, I remember the original alpha "load-locked"/"store-conditional",
and it was totally _horrible_ for anything that wanted performance,
because it would do the "pending lock" bit on the bus, so it took hundreds
of cycles even on UP. Gods, how I hated that. It made it almost totally
useless for anything that just wanted to be irq-safe - it was cheaper to
just disable interrupts, iirc. STUPID.
All modern CPU's do atomic operations entirely within the cache coherency
logic. I think x86 still support the notion of a "locked cycle" on the
bus, but I think that's entirely relegated to horrible people doing locked
operations across PCI, and quite frankly, I suspect that it doesn't
actually mean a thing (ie I'd expect no external hardware to actually
react to the lock signal). However, nobody really cares, since nobody
would be crazy enough to do locked cycles over PCI even if they were to
work.
So in practice, as far as I know, the way _all_ modern CPU's do locked
cycles is that they do it by getting exclusive ownership on the cacheline
on the read, and either having logic in place to refuse to do release the
cacheline until the write is complete (ie "locked cycles to the cache"),
or to re-try the instruction if the cacheline has been released by the
time the write is ready (ie "load-locked" + "store-conditional" +
"potentially loop" to the cache).
NOBODY goes out to the bus for locking any more. That would be insane and
stupid.
Yes, many spinlocks see contention, and end up going out to the bus. But
similarly, many spinlocks do _not_ see any contention at all (or other
CPU's even looking at them), and may end up staying exclusive in a CPU
cache for a long time.
The "no contention" case is actually pretty important. Many real loads on
SMP end up being largely single-threaded, and together with some basic CPU
affinity, you really _really_ want to make that single-threaded case go as
fast as possible. And a pretty big part of that is locking: the difference
between a lock that goes to the bus and one that does not is _huge_.
And lots of trivial code is almost dominated by locking costs. In some
system calls on an SMP kernel, the locking cost can be (depending on how
good or bad the CPU is at them) quite noticeable. Just a simple small
read() will take several locks and/or do atomic ops, even if it was cached
and it looks "trivial".
Linus
Linus Torvalds <[email protected]> wrote:
> Basically, as long as nobody else is reading the lock, the lock will stay
> in the caches.
I think for the purposes of talking about memory barriers, we consider the
cache to be part of the memory since the cache coherency mechanisms will give
the same effect.
I suppose the way the cache can be viewed as working is that bits of memory
are shuttled around between the CPUs, RAM and any other devices that partake
of the coherency mechanism.
> All modern CPU's do atomic operations entirely within the cache coherency
> logic.
I know that, and I think it's irrelevant to specifying memory barriers.
> I think x86 still support the notion of a "locked cycle" on the
> bus,
I wonder if that's what XCHG and XADD do... There's no particular reason they
should be that much slower than LOCK INCL/DECL. Of course, I've only measured
this on my Dual-PPro test box, so other i386 arch CPUs may exhibit other
behaviour.
David
On Thu, 9 Mar 2006, David Howells wrote:
>
> I think for the purposes of talking about memory barriers, we consider the
> cache to be part of the memory since the cache coherency mechanisms will give
> the same effect.
Yes and no.
The yes comes from the normal "smp_xxx()" barriers. As far as they are
concerned, the cache coherency means that caches are invisible.
The "no" comes from the IO side. Basically, since IO bypasses caches and
sometimes write buffers, it's simply not ordered wrt normal accesses.
And that's where "bus cycles" actually matter wrt barriers. If you have a
barrier that creates a bus cycle, it suddenly can be ordered wrt IO.
So the fact that x86 SMP ops basically never guarantee any bus cycles
basically means that they are fundamentally no-ops when it comes to IO
serialization. That was really my only point.
> > I think x86 still support the notion of a "locked cycle" on the
> > bus,
>
> I wonder if that's what XCHG and XADD do... There's no particular reason they
> should be that much slower than LOCK INCL/DECL. Of course, I've only measured
> this on my Dual-PPro test box, so other i386 arch CPUs may exhibit other
> behaviour.
I think it's an internal core implementation detail. I don't think they do
anything on the bus, but I suspect that they could easily generate less
optimized uops, simply because they didn't matter as much and didn't fit
the "normal" core uop sequence.
Linus
On Thu, 9 Mar 2006, Linus Torvalds wrote:
>
> So the fact that x86 SMP ops basically never guarantee any bus cycles
> basically means that they are fundamentally no-ops when it comes to IO
> serialization. That was really my only point.
Side note: of course, locked cycles _do_ "serialize" the core. So they'll
stop at least the core write merging, and speculative reads. So they do
have some impact on IO, but they have no way of impacting things like
write posting etc that is outside the CPU.
Linus
Followup to: <[email protected]>
By author: David Howells <[email protected]>
In newsgroup: linux.dev.kernel
>
> However, on i386, for example, you've actually got at least two different I/O
> access domains, and I don't know how they impinge upon each other (IN/OUT vs
> MOV).
>
You do, but those aren't the ones.
What you have is instead MOVNT versus everything else. IN/OUT are
total sledgehammers, as they imply not only nonposted operation, but
the instruction implies wait for completion; this is required since
IN/OUT support emulation via SMI.
-hpa