2006-03-03 16:04:18

by David Howells

[permalink] [raw]
Subject: Memory barriers and spin_unlock safety


Hi,

We've just had an interesting discussion on IRC and this has come up with two
unanswered questions:

(1) Is spin_unlock() is entirely safe on Pentium3+ and x86_64 where ?FENCE
instructions are available?

Consider the following case, where you want to do two reads effectively
atomically, and so wrap them in a spinlock:

spin_lock(&mtx);
a = *A;
b = *B;
spin_unlock(&mtx);

On x86 Pentium3+ and x86_64, what's to stop you from getting the reads
done after the unlock since there's no LFENCE instruction there to stop
you?

What you'd expect is:

LOCK WRITE mtx
--> implies MFENCE
READ *A } which may be reordered
READ *B }
WRITE mtx

But what you might get instead is this:

LOCK WRITE mtx
--> implies MFENCE
WRITE mtx
--> implies SFENCE
READ *A } which may be reordered
READ *B }

There doesn't seem to be anything that says that the reads can't leak
outside of the locked section; at least, there doesn't in the AMD's system
programming manual for Amd64 (book 2, section 7.1).

Writes on the other hand may not happen out of order, so changing things
inside a critical section would seem to be okay.

On PowerPC, on the other hand, the barriers have to be made explicit
because they're not implied by LWARX/STWCX or by ordinary stores:

LWARX mtx
STWCX mtx
ISYNC
READ *A } which may be reordered
READ *B }
LWSYNC
WRITE mtx

So, should the spin_unlock() on i386 and x86_64 be doing an LFENCE
instruction before unlocking?


(2) What is the minimum functionality that can be expected of a memory
barriers? I was of the opinion that all we could expect is for the CPU
executing one them to force the instructions it is executing to be
complete up to a point - depending on the type of barrier - before
continuing past it.

On pentiums, x86_64, and frv this seems to be exactly what you get for a
barrier; there doesn't seem to be any external evidence of it that appears
on the bus, other than the CPU does a load of memory transactions.

However, on ppc/ppc64, it seems to be more thorough than that, and there
seems to be some special interaction between the CPU processing the
instruction and the other CPUs in the system. It's not entirely obvious
from the manual just what this does.

As I understand it, Andrew Morton is of the opinion that issuing a read
barrier on one CPU will cause the other CPUs in the system to sync up, but
that doesn't look likely on all archs.

David


2006-03-03 16:46:09

by David Howells

[permalink] [raw]
Subject: Re: Memory barriers and spin_unlock safety

David Howells <[email protected]> wrote:

> WRITE mtx
> --> implies SFENCE

Actually, I'm not sure this is true. The AMD64 Instruction Manual's writeup of
SFENCE implies that writes can be reordered, which sort of contradicts what
the AMD64 System Programming Manual says.

If this isn't true, then x86_64 at least should do MFENCE before the store in
spin_unlock() or change the store to be LOCK'ed. The same may also apply for
Pentium3+ class CPUs with the i386 arch.

David

2006-03-03 16:55:58

by Linus Torvalds

[permalink] [raw]
Subject: Re: Memory barriers and spin_unlock safety



On Fri, 3 Mar 2006, David Howells wrote:
>
> We've just had an interesting discussion on IRC and this has come up with two
> unanswered questions:
>
> (1) Is spin_unlock() is entirely safe on Pentium3+ and x86_64 where ?FENCE
> instructions are available?
>
> Consider the following case, where you want to do two reads effectively
> atomically, and so wrap them in a spinlock:
>
> spin_lock(&mtx);
> a = *A;
> b = *B;
> spin_unlock(&mtx);
>
> On x86 Pentium3+ and x86_64, what's to stop you from getting the reads
> done after the unlock since there's no LFENCE instruction there to stop
> you?

The rules are, afaik, that reads can pass buffered writes, BUT WRITES
CANNOT PASS READS (aka "writes to memory are always carried out in program
order").

IOW, reads can bubble up, but writes cannot.

So the way I read the Intel rules is that "passing" is always about being
done earlier than otherwise allowed, not about being done later.

(You only "pass" somebody in traffic when you go ahead of them. If you
fall behind them, you don't "pass" them, _they_ pass you).

Now, this is not so much meant to be a semantic argument (the meaning of
the word "pass") as to an explanation of what I believe Intel meant, since
we know from Intel designers that the simple non-atomic write is
supposedly a perfectly fine unlock instruction.

So when Intel says "reads can be carried out speculatively and in any
order", that just says that reads are not ordered wrt other _reads_. They
_are_ ordered wrt other writes, but only one way: they can pass an earlier
write, but they can't fall back behind a later one.

This is consistent with
(a) optimization (you want to do reads _early_, not late)
(b) behaviour (we've been told that a single write is sufficient, with
the exception of an early P6 core revision)
(c) at least one way of reading the documentation.

And I claim that (a) and (b) are the important parts, and that (c) is just
the rationale.

> (2) What is the minimum functionality that can be expected of a memory
> barriers? I was of the opinion that all we could expect is for the CPU
> executing one them to force the instructions it is executing to be
> complete up to a point - depending on the type of barrier - before
> continuing past it.

Well, no. You should expect even _less_.

The core can continue doing things past a barrier. For example, a write
barrier may not actually serialize anything at all: the sane way of doing
write barriers is to just put a note in the write-queue, and that note
just disallows write queue entries from being moved around it. So you
might have a write barrier with two writes on either side, and the writes
might _both_ be outstanding wrt the core despite the barrier.

So there's not necessarily any synchronization at all on a execution core
level, just a partial ordering between the resulting actions of the core.

> However, on ppc/ppc64, it seems to be more thorough than that, and there
> seems to be some special interaction between the CPU processing the
> instruction and the other CPUs in the system. It's not entirely obvious
> from the manual just what this does.

PPC has an absolutely _horrible_ memory ordering implementation, as far as
I can tell. The thing is broken. I think it's just implementation
breakage, not anything really fundamental, but the fact that their write
barriers are expensive is a big sign that they are doing something bad.

For example, their write buffers may not have a way to serialize in the
buffers, and at that point from an _implementation_ standpoint, you just
have to serialize the whole core to make sure that writes don't pass each
other.

> As I understand it, Andrew Morton is of the opinion that issuing a read
> barrier on one CPU will cause the other CPUs in the system to sync up, but
> that doesn't look likely on all archs.

No. Issuing a read barrier on one CPU will do absolutely _nothing_ on the
other CPU. All barriers are purely local to one CPU, and do not generate
any bus traffic what-so-ever. They only potentially affect the order of
bus traffic due to the instructions around them (obviously).

So a read barrier on one CPU _has_ to be paired with a write barrier on
the other side in order to make sense (although the write barrier can
obviously be of the implied kind, ie a lock/unlock event, or just
architecture-specific knowledge of write behaviour, ie for example knowing
that writes are always seen in-order on x86).

Linus

2006-03-03 17:03:21

by Linus Torvalds

[permalink] [raw]
Subject: Re: Memory barriers and spin_unlock safety



On Fri, 3 Mar 2006, David Howells wrote:

> David Howells <[email protected]> wrote:
>
> > WRITE mtx
> > --> implies SFENCE
>
> Actually, I'm not sure this is true. The AMD64 Instruction Manual's writeup of
> SFENCE implies that writes can be reordered, which sort of contradicts what
> the AMD64 System Programming Manual says.

Note that _normal_ writes never need an SFENCE, because they are ordered
by the core.

The reason to use SFENCE is because of _special_ writes.

For example, if you use a non-temporal store, then the write buffer
ordering goes away, because there is no write buffer involved (the store
goes directly to the L2 or outside the bus).

Or when you talk to weakly ordered memory (ie a frame buffer that isn't
cached, and where the MTRR memory ordering bits say that writes be done
speculatively), you may want to say "I'm going to do the store that starts
the graphics pipeline, all my previous stores need to be done now".

THAT is when you need to use SFENCE.

So SFENCE really isn't about the "smp_wmb()" kind of fencing at all. It's
about the much weaker ordering that is allowed by the special IO memory
types and nontemporal instructions.

(Actually, I think one special case of non-temporal instruction is the
"repeat movs/stos" thing: I think you should _not_ use a "repeat stos" to
unlock a spinlock, exactly because those stores are not ordered wrt each
other, and they can bypass the write queue. Of course, doing that would
be insane anyway, so no harm done ;^).

> If this isn't true, then x86_64 at least should do MFENCE before the store in
> spin_unlock() or change the store to be LOCK'ed. The same may also apply for
> Pentium3+ class CPUs with the i386 arch.

No. But if you want to make sure, you can always check with Intel
engineers. I'm pretty sure I have this right, though, because Intel
engineers have certainly looked at Linux sources and locking, and nobody
has ever said that we'd need an SFENCE.

Linus

2006-03-03 20:02:56

by Arjan van de Ven

[permalink] [raw]
Subject: Re: Memory barriers and spin_unlock safety

On Fri, 2006-03-03 at 16:45 +0000, David Howells wrote:
> David Howells <[email protected]> wrote:
>
> > WRITE mtx
> > --> implies SFENCE
>
> Actually, I'm not sure this is true. The AMD64 Instruction Manual's writeup of
> SFENCE implies that writes can be reordered, which sort of contradicts what
> the AMD64 System Programming Manual says.

there are 2 or 3 special instructions which do "non temporal
stores" (movntq and movnit and maybe one more). sfense is designed for
those.


2006-03-03 20:16:09

by David Howells

[permalink] [raw]
Subject: Re: Memory barriers and spin_unlock safety

Linus Torvalds <[email protected]> wrote:

> The rules are, afaik, that reads can pass buffered writes, BUT WRITES
> CANNOT PASS READS (aka "writes to memory are always carried out in program
> order").

So in the example I gave, a read after the spin_unlock() may actually get
executed before the store in the spin_unlock(), but a read before the unlock
will not get executed after.

> No. Issuing a read barrier on one CPU will do absolutely _nothing_ on the
> other CPU.

Well, I think you mean will guarantee absolutely _nothing_ on the other CPU for
the Linux kernel. According to the IBM powerpc book I have, it does actually
do something on the other CPUs, though it doesn't say exactly what.

Anyway, thanks.

I'll write up some documentation on barriers for inclusion in the kernel.

David

2006-03-03 20:17:28

by David Howells

[permalink] [raw]
Subject: Re: Memory barriers and spin_unlock safety

Linus Torvalds <[email protected]> wrote:

> Note that _normal_ writes never need an SFENCE, because they are ordered
> by the core.
>
> The reason to use SFENCE is because of _special_ writes.

I suspect, then, that x86_64 should not have an SFENCE for smp_wmb(), and that
only io_wmb() should have that.

David

2006-03-03 21:06:58

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Re: Memory barriers and spin_unlock safety


> PPC has an absolutely _horrible_ memory ordering implementation, as far as
> I can tell. The thing is broken. I think it's just implementation
> breakage, not anything really fundamental, but the fact that their write
> barriers are expensive is a big sign that they are doing something bad.

Are they ? read barriers and full barriers are, write barriers should be
fairly cheap (but then, I haven't measured).

> For example, their write buffers may not have a way to serialize in the
> buffers, and at that point from an _implementation_ standpoint, you just
> have to serialize the whole core to make sure that writes don't pass each
> other.

The main problem I've had in the past with the ppc barriers is more a
subtle thing in the spec that unfortunately was taken to the word by
implementors, and is that the simple write barrier (eieio) will only
order within the same storage space, that is will not order between
cacheable and non-cacheable storage. That means IOs could leak out of
locks etc... Which is why we use expensive barriers in MMIO wrappers for
now (though we might investigate the use of mmioXb instead in the
future).

> No. Issuing a read barrier on one CPU will do absolutely _nothing_ on the
> other CPU. All barriers are purely local to one CPU, and do not generate
> any bus traffic what-so-ever. They only potentially affect the order of
> bus traffic due to the instructions around them (obviously).

Actually, the ppc's full barrier (sync) will generate bus traffic, and I
think in some case eieio barriers can propagate to the chipset to
enforce ordering there too depending on some voodoo settings and wether
the storage space is cacheable or not.

> So a read barrier on one CPU _has_ to be paired with a write barrier on
> the other side in order to make sense (although the write barrier can
> obviously be of the implied kind, ie a lock/unlock event, or just
> architecture-specific knowledge of write behaviour, ie for example knowing
> that writes are always seen in-order on x86).
>
> Linus
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

2006-03-03 21:18:36

by Hollis Blanchard

[permalink] [raw]
Subject: Re: Memory barriers and spin_unlock safety

On Friday 03 March 2006 15:06, Benjamin Herrenschmidt wrote:
> The main problem I've had in the past with the ppc barriers is more a
> subtle thing in the spec that unfortunately was taken to the word by
> implementors, and is that the simple write barrier (eieio) will only
> order within the same storage space, that is will not order between
> cacheable and non-cacheable storage.

I've heard Sparc has the same issue... in which case it may not be a "chip
designer was too literal" thing, but rather it really simplifies chip
implementation to do it that way.

--
Hollis Blanchard
IBM Linux Technology Center

2006-03-03 21:34:43

by Linus Torvalds

[permalink] [raw]
Subject: Re: Memory barriers and spin_unlock safety



On Fri, 3 Mar 2006, David Howells wrote:
>
> So in the example I gave, a read after the spin_unlock() may actually get
> executed before the store in the spin_unlock(), but a read before the unlock
> will not get executed after.

Yes.

> > No. Issuing a read barrier on one CPU will do absolutely _nothing_ on the
> > other CPU.
>
> Well, I think you mean will guarantee absolutely _nothing_ on the other CPU for
> the Linux kernel. According to the IBM powerpc book I have, it does actually
> do something on the other CPUs, though it doesn't say exactly what.

Yeah, Power really does have some funky stuff in their memory ordering.
I'm not quite sure why, though. And it definitely isn't implied by any of
the Linux kernel barriers.

(They also do TLB coherency in hw etc strange things).

Linus

2006-03-03 21:34:54

by Linus Torvalds

[permalink] [raw]
Subject: Re: Memory barriers and spin_unlock safety



On Fri, 3 Mar 2006, David Howells wrote:
>
> I suspect, then, that x86_64 should not have an SFENCE for smp_wmb(), and that
> only io_wmb() should have that.

Indeed. I think smp_wmb() should be a compiler fence only on x86(-64), ie
just compile to a "barrier()" (and not even that on UP, of course).

Linus

2006-03-03 21:52:34

by David Miller

[permalink] [raw]
Subject: Re: Memory barriers and spin_unlock safety

From: Hollis Blanchard <[email protected]>
Date: Fri, 3 Mar 2006 15:18:13 -0600

> On Friday 03 March 2006 15:06, Benjamin Herrenschmidt wrote:
> > The main problem I've had in the past with the ppc barriers is more a
> > subtle thing in the spec that unfortunately was taken to the word by
> > implementors, and is that the simple write barrier (eieio) will only
> > order within the same storage space, that is will not order between
> > cacheable and non-cacheable storage.
>
> I've heard Sparc has the same issue... in which case it may not be a "chip
> designer was too literal" thing, but rather it really simplifies chip
> implementation to do it that way.

There is a "membar #MemIssue" that is meant to deal with this should
it ever matter, but for most sparc64 chips it doesn't which is why we
don't use that memory barrier type at all in the Linux kernel.

For UltraSPARC-I and II it technically could matter in Relaxed Memory
Ordering (RMO) mode which is what we run the kernel and 64-bit
userspace in, but I've never seen an issue resulting from it.

For UltraSPARC-III and later, the chip only implements the Total Store
Ordering (TSO) memory model and the manual explicitly states that
cacheable and non-cacheable memory operations are ordered, even using
language such as "there is an implicit 'membar #MemIssue' between
them". It further goes on to say:

The UltraSPARCIII Cu processor maintains ordering between
cacheable and non-cacheable accesses. The UltraSPARC III Cu
processor maintains TSO ordering between memory references
regardless of their cacheability.

Niagara behaves almost identically to UltraSPARC-III in this area.

2006-03-03 22:02:13

by Benjamin LaHaise

[permalink] [raw]
Subject: Re: Memory barriers and spin_unlock safety

On Fri, Mar 03, 2006 at 01:34:17PM -0800, Linus Torvalds wrote:
> Indeed. I think smp_wmb() should be a compiler fence only on x86(-64), ie
> just compile to a "barrier()" (and not even that on UP, of course).

Actually, no. At least in testing an implementation of Dekker's and
Peterson's algorithms as a replacement for the locked operation in
our spinlocks, it is absolutely necessary to have an sfence in the lock
to ensure the lock is visible to the other CPU before proceeding. I'd
use smp_wmb() as the fence is completely unnecessary on UP and is even
irq-safe. Here's a copy of the Peterson's implementation to illustrate
(it works, it's just slower than the existing spinlocks).

-ben

diff --git a/include/asm-x86_64/spinlock.h b/include/asm-x86_64/spinlock.h
index fe484a6..45bd386 100644
--- a/include/asm-x86_64/spinlock.h
+++ b/include/asm-x86_64/spinlock.h
@@ -4,6 +4,8 @@
#include <asm/atomic.h>
#include <asm/rwlock.h>
#include <asm/page.h>
+#include <asm/pda.h>
+#include <asm/processor.h>
#include <linux/config.h>

/*
@@ -18,50 +20,53 @@
*/

#define __raw_spin_is_locked(x) \
- (*(volatile signed int *)(&(x)->slock) <= 0)
-
-#define __raw_spin_lock_string \
- "\n1:\t" \
- "lock ; decl %0\n\t" \
- "js 2f\n" \
- LOCK_SECTION_START("") \
- "2:\t" \
- "rep;nop\n\t" \
- "cmpl $0,%0\n\t" \
- "jle 2b\n\t" \
- "jmp 1b\n" \
- LOCK_SECTION_END
-
-#define __raw_spin_unlock_string \
- "movl $1,%0" \
- :"=m" (lock->slock) : : "memory"
+ ((*(volatile signed int *)(x) & ~0xff) != 0)

static inline void __raw_spin_lock(raw_spinlock_t *lock)
{
- __asm__ __volatile__(
- __raw_spin_lock_string
- :"=m" (lock->slock) : : "memory");
+ int cpu = read_pda(cpunumber);
+
+ barrier();
+ lock->flags[cpu] = 1;
+ lock->turn = cpu ^ 1;
+ barrier();
+
+ asm volatile("sfence":::"memory");
+
+ while (lock->flags[cpu ^ 1] && (lock->turn != cpu)) {
+ cpu_relax();
+ barrier();
+ }
}

#define __raw_spin_lock_flags(lock, flags) __raw_spin_lock(lock)

static inline int __raw_spin_trylock(raw_spinlock_t *lock)
{
- int oldval;
-
- __asm__ __volatile__(
- "xchgl %0,%1"
- :"=q" (oldval), "=m" (lock->slock)
- :"0" (0) : "memory");
-
- return oldval > 0;
+ int cpu = read_pda(cpunumber);
+ barrier();
+ if (__raw_spin_is_locked(lock))
+ return 0;
+
+ lock->flags[cpu] = 1;
+ lock->turn = cpu ^ 1;
+ asm volatile("sfence":::"memory");
+
+ if (lock->flags[cpu ^ 1] && (lock->turn != cpu)) {
+ lock->flags[cpu] = 0;
+ barrier();
+ return 0;
+ }
+ return 1;
}

static inline void __raw_spin_unlock(raw_spinlock_t *lock)
{
- __asm__ __volatile__(
- __raw_spin_unlock_string
- );
+ int cpu;
+ //asm volatile("lfence":::"memory");
+ cpu = read_pda(cpunumber);
+ lock->flags[cpu] = 0;
+ barrier();
}

#define __raw_spin_unlock_wait(lock) \
diff --git a/include/asm-x86_64/spinlock_types.h b/include/asm-x86_64/spinlock_types.h
index 59efe84..a409cbf 100644
--- a/include/asm-x86_64/spinlock_types.h
+++ b/include/asm-x86_64/spinlock_types.h
@@ -6,10 +6,11 @@
#endif

typedef struct {
- volatile unsigned int slock;
+ volatile unsigned char turn;
+ volatile unsigned char flags[3];
} raw_spinlock_t;

-#define __RAW_SPIN_LOCK_UNLOCKED { 1 }
+#define __RAW_SPIN_LOCK_UNLOCKED { 0, { 0, } }

typedef struct {
volatile unsigned int lock;

2006-03-03 22:04:41

by Linus Torvalds

[permalink] [raw]
Subject: Re: Memory barriers and spin_unlock safety



On Sat, 4 Mar 2006, Benjamin Herrenschmidt wrote:
>
> The main problem I've had in the past with the ppc barriers is more a
> subtle thing in the spec that unfortunately was taken to the word by
> implementors, and is that the simple write barrier (eieio) will only
> order within the same storage space, that is will not order between
> cacheable and non-cacheable storage.

If so, a simple write barrier should be sufficient. That's exactly what
the x86 write barriers do too, ie stores to magic IO space are _not_
ordered wrt a normal [smp_]wmb() (or, as per how this thread started, a
spin_unlock()) at all.

On x86, we actually have this "CONFIG_X86_OOSTORE" configuration option
that gets enable for you select a WINCHIP device, because that allows a
weaker memory ordering for normal memory too, and that will end up using
an "sfence" instruction for store buffers. But it's not normally enabled.

So the eieio should be sufficient,then.

Of course, the x86 store buffers do tend to flush out stuff after a
certain cycle-delay too, so there may be drivers that technically are
buggy on x86, but where the store buffer in practice is small and flushes
out quickly enough that you'll never _see_ the bug.

> Actually, the ppc's full barrier (sync) will generate bus traffic, and I
> think in some case eieio barriers can propagate to the chipset to
> enforce ordering there too depending on some voodoo settings and wether
> the storage space is cacheable or not.

Well, the regular kernel ops definitely won't depend on that, since that's
not the case anywhere else.

Linus

2006-03-03 22:22:20

by Linus Torvalds

[permalink] [raw]
Subject: Re: Memory barriers and spin_unlock safety



On Fri, 3 Mar 2006, Benjamin LaHaise wrote:
>
> Actually, no. At least in testing an implementation of Dekker's and
> Peterson's algorithms as a replacement for the locked operation in
> our spinlocks, it is absolutely necessary to have an sfence in the lock
> to ensure the lock is visible to the other CPU before proceeding.

I suspect you have some bug in your implementation. I think Dekker's
algorithm depends on the reads and writes being ordered, and you don't
seem to do that.

The thing is, you pretty much _have_ to be wrong, because the x86-64
memory ordering rules are _exactly_ the same as for x86, and we've had
that simple store as an unlock for a long long time.

Linus

2006-03-03 22:37:00

by Linus Torvalds

[permalink] [raw]
Subject: Re: Memory barriers and spin_unlock safety



On Fri, 3 Mar 2006, Linus Torvalds wrote:
>
> I suspect you have some bug in your implementation. I think Dekker's
> algorithm depends on the reads and writes being ordered, and you don't
> seem to do that.

IOW, I think you need a full memory barrier after the
"lock->turn = cpu ^ 1;"
and you should have a "smp_rmb()" in between your reads of
"lock->flags[cpu ^ 1]"
and
"lock->turn"
to give the ordering that Dekker (or Peterson) expects.

IOW, the code should be something like

lock->flags[other] = 1;
smp_wmb();
lock->turn = other
smp_mb();
while (lock->turn == cpu) {
smp_rmb();
if (!lock->flags[other])
break;
}

where the wmb's are no-ops on x86, but the rmb's certainly are not.

I _suspect_ that the fact that it starts working with an 'sfence' in there
somewhere is just because the sfence ends up being "serializing enough"
that it just happens to work, but that it has nothing to do with the
current kernel wmb() being wrong.

Linus

2006-03-04 10:58:16

by Paul Mackerras

[permalink] [raw]
Subject: Re: Memory barriers and spin_unlock safety

Linus Torvalds writes:

> PPC has an absolutely _horrible_ memory ordering implementation, as far as
> I can tell. The thing is broken. I think it's just implementation
> breakage, not anything really fundamental, but the fact that their write
> barriers are expensive is a big sign that they are doing something bad.

An smp_wmb() is just an eieio on PPC, which is pretty cheap. I made
wmb() be a sync though, because it seemed that there were drivers that
expected wmb() to provide an ordering between a write to memory and a
write to an MMIO register. If that is a bogus assumption then we
could make wmb() lighter-weight (after auditing all the drivers we're
interested in, of course, ...).

And in a subsequent message:

> If so, a simple write barrier should be sufficient. That's exactly what
> the x86 write barriers do too, ie stores to magic IO space are _not_
> ordered wrt a normal [smp_]wmb() (or, as per how this thread started, a
> spin_unlock()) at all.

By magic IO space, do you mean just any old memory-mapped device
register in a PCI device, or do you mean something else?

Paul.

2006-03-04 10:58:15

by Paul Mackerras

[permalink] [raw]
Subject: Re: Memory barriers and spin_unlock safety

Benjamin Herrenschmidt writes:

> Actually, the ppc's full barrier (sync) will generate bus traffic, and I
> think in some case eieio barriers can propagate to the chipset to
> enforce ordering there too depending on some voodoo settings and wether
> the storage space is cacheable or not.

Eieio has to go to the PCI host bridge because it is supposed to
prevent write-combining, both in the host bridge and in the CPU.

Paul.

2006-03-04 17:29:22

by Linus Torvalds

[permalink] [raw]
Subject: Re: Memory barriers and spin_unlock safety



On Sat, 4 Mar 2006, Paul Mackerras wrote:
>
> > If so, a simple write barrier should be sufficient. That's exactly what
> > the x86 write barriers do too, ie stores to magic IO space are _not_
> > ordered wrt a normal [smp_]wmb() (or, as per how this thread started, a
> > spin_unlock()) at all.
>
> By magic IO space, do you mean just any old memory-mapped device
> register in a PCI device, or do you mean something else?

Any old memory-mapped device that has been marked as write-combining in
the MTRR's or page tables.

So the rules from the PC side (and like it or not, they end up being
what all the drivers are tested with) are:

- regular stores are ordered by write barriers
- PIO stores are always synchronous
- MMIO stores are ordered by IO semantics
- PCI ordering must be honored:
* write combining is only allowed on PCI memory resources
that are marked prefetchable. If your host bridge does write
combining in general, it's not a "host bridge", it's a "host
disaster".
* for others, writes can always be posted, but they cannot
be re-ordered wrt either reads or writes to that device
(ie a read will always be fully synchronizing)
- io_wmb must be honored

In addition, it will help a hell of a lot if you follow the PC notion of
"per-region extra rules", ie you'd default to the non-prefetchable
behaviour even for areas that are prefetchable from a PCI standpoint, but
allow some way to relax the ordering rules in various ways.

PC's use MTRR's or page table hints for this, but it's actually perfectly
possible to do it by virtual address (ie decide on "ioremap()" time by
looking at some bits that you've saved away to remap it to a certain
virtual address range, and then use the virtual address as a hint for
readl/writel whether you need to serialize or not).

On x86, we already use the "virtual address" trick to distinguish between
PIO and MMIO for the newer ioread/iowrite interface (the older
inb/outb/readb/writeb interfaces obviously don't need that, since the IO
space is statically encoded in the function call itself).

The reason I mention the MTRR emulation is again just purely compatibility
with drivers that get 99.9% of all the testing on a PC platform.

Linus

2006-03-04 22:51:20

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Re: Memory barriers and spin_unlock safety

On Sat, 2006-03-04 at 21:58 +1100, Paul Mackerras wrote:
> Benjamin Herrenschmidt writes:
>
> > Actually, the ppc's full barrier (sync) will generate bus traffic, and I
> > think in some case eieio barriers can propagate to the chipset to
> > enforce ordering there too depending on some voodoo settings and wether
> > the storage space is cacheable or not.
>
> Eieio has to go to the PCI host bridge because it is supposed to
> prevent write-combining, both in the host bridge and in the CPU.

That can be disabled with HID bits tho ;)

Ben.


2006-03-05 02:04:46

by Michael Buesch

[permalink] [raw]
Subject: Re: Memory barriers and spin_unlock safety

On Saturday 04 March 2006 11:58, you wrote:
> Linus Torvalds writes:
>
> > PPC has an absolutely _horrible_ memory ordering implementation, as far as
> > I can tell. The thing is broken. I think it's just implementation
> > breakage, not anything really fundamental, but the fact that their write
> > barriers are expensive is a big sign that they are doing something bad.
>
> An smp_wmb() is just an eieio on PPC, which is pretty cheap. I made
> wmb() be a sync though, because it seemed that there were drivers that
> expected wmb() to provide an ordering between a write to memory and a
> write to an MMIO register. If that is a bogus assumption then we
> could make wmb() lighter-weight (after auditing all the drivers we're
> interested in, of course, ...).

In the bcm43xx driver there is code which looks like the following:

/* Write some coherent DMA memory */
wmb();
/* Write MMIO, which depends on the DMA memory
* write to be finished.
*/

Are the assumptions in this code correct? Is wmb() the correct thing
to do here?
I heavily tested this code on PPC UP and did not see any anormaly, yet.

--
Greetings Michael.


Attachments:
(No filename) (1.11 kB)
(No filename) (189.00 B)
Download all attachments

2006-03-06 09:06:53

by Helge Hafting

[permalink] [raw]
Subject: Re: Memory barriers and spin_unlock safety

Linus Torvalds wrote:

>(Actually, I think one special case of non-temporal instruction is the
>"repeat movs/stos" thing: I think you should _not_ use a "repeat stos" to
>unlock a spinlock, exactly because those stores are not ordered wrt each
>other, and they can bypass the write queue. Of course, doing that would
>be insane anyway, so no harm done ;^).
>
>
oops - there goes the "unlock an array of spinlocks
in a single instruction" idea. :-)

Helge Hafting

2006-03-07 17:37:22

by David Howells

[permalink] [raw]
Subject: Re: Memory barriers and spin_unlock safety

David Howells <[email protected]> wrote:

> I suspect, then, that x86_64 should not have an SFENCE for smp_wmb(), and
> that only io_wmb() should have that.

Hmmm... We don't actually have io_wmb()... Should the following be added to
all archs?

io_mb()
io_rmb()
io_wmb()

David

2006-03-07 17:41:00

by Matthew Wilcox

[permalink] [raw]
Subject: Re: Memory barriers and spin_unlock safety

On Tue, Mar 07, 2006 at 05:36:59PM +0000, David Howells wrote:
> David Howells <[email protected]> wrote:
>
> > I suspect, then, that x86_64 should not have an SFENCE for smp_wmb(), and
> > that only io_wmb() should have that.
>
> Hmmm... We don't actually have io_wmb()... Should the following be added to
> all archs?
>
> io_mb()
> io_rmb()
> io_wmb()

it's spelled mmiowb(), and reads from IO space are synchronous, so don't
need barriers.

2006-03-07 17:56:50

by Jesse Barnes

[permalink] [raw]
Subject: Re: Memory barriers and spin_unlock safety

On Tuesday, March 7, 2006 9:40 am, Matthew Wilcox wrote:
> On Tue, Mar 07, 2006 at 05:36:59PM +0000, David Howells wrote:
> > David Howells <[email protected]> wrote:
> > > I suspect, then, that x86_64 should not have an SFENCE for
> > > smp_wmb(), and that only io_wmb() should have that.
> >
> > Hmmm... We don't actually have io_wmb()... Should the following be
> > added to all archs?
> >
> > io_mb()
> > io_rmb()
> > io_wmb()
>
> it's spelled mmiowb(), and reads from IO space are synchronous, so
> don't need barriers.

To expand on willy's note, the reason it's called mmiowb as opposed to
iowb is because I/O port acccess (inX/outX) are inherently synchronous
and don't need barriers. mmio writes, however (writeX) need barrier
operations to ensure ordering on some platforms.

This raises the question of what semantics the unified I/O mapping
routines have... are ioreadX/iowriteX synchronous or should we define
the barriers you mention above for them? (IIRC ppc64 can use an io read
ordering op).

Jesse

2006-03-07 18:15:11

by Alan

[permalink] [raw]
Subject: Re: Memory barriers and spin_unlock safety

On Maw, 2006-03-07 at 17:36 +0000, David Howells wrote:
> Hmmm... We don't actually have io_wmb()... Should the following be added to
> all archs?
>
> io_mb()
> io_rmb()
> io_wmb()

What kind of mb/rmb/wmb goes with ioread/iowrite ? It seems we actually
need one that can work out what to do for the general io API ?

2006-03-07 18:29:12

by Linus Torvalds

[permalink] [raw]
Subject: Re: Memory barriers and spin_unlock safety



On Tue, 7 Mar 2006, Alan Cox wrote:
>
> What kind of mb/rmb/wmb goes with ioread/iowrite ? It seems we actually
> need one that can work out what to do for the general io API ?

The ioread/iowrite things only guarantee the laxer MMIO rules, since it
_might_ be mmio. So you'd use the mmio barriers.

In fact, I would suggest that architectures that can do PIO in a more
relaxed manner (x86 cannot, since all the serialization is in hardware)
would do even a PIO in the more relaxed ordering (ie writes can at least
be posted, but obviously not merged, since that would be against PCI
specs).

x86 tends to serialize PIO too much (I think at least Intel CPU's will
actually wait for the PIO write to be acknowledged by _something_ on the
bus, although it obviously can't wait for the device to have acted on it).

Linus

2006-03-07 18:50:10

by Alan

[permalink] [raw]
Subject: Re: Memory barriers and spin_unlock safety

On Maw, 2006-03-07 at 10:28 -0800, Linus Torvalds wrote:
> x86 tends to serialize PIO too much (I think at least Intel CPU's will
> actually wait for the PIO write to be acknowledged by _something_ on the
> bus, although it obviously can't wait for the device to have acted on it).

Don't bet on that 8(

In the PCI case the I/O write appears to be acked by the bridges used on
x86 when the write completes on the PCI bus and then back to the CPU.
MMIO is thankfully posted. At least thats how the timings on some
devices look.



2006-03-07 20:22:39

by Linus Torvalds

[permalink] [raw]
Subject: Re: Memory barriers and spin_unlock safety



On Tue, 7 Mar 2006, Alan Cox wrote:
>
> In the PCI case the I/O write appears to be acked by the bridges used on
> x86 when the write completes on the PCI bus and then back to the CPU.
> MMIO is thankfully posted. At least thats how the timings on some
> devices look.

Oh, absolutely. I'm sayign that you shouldn't wait for even that, since
it's totally pointless (it's not synchronized _anyway_) and adds
absolutely zero gain. To really synchronize, you need to read from the
device anyway.

So the "wait for bus activity" is just making PIO slower for no good
reason, and keeps the core waiting when it could do something more useful.

On an x86, there are legacy reasons to do it (people expect certain
timings). But that was what I was saying - on non-x86 architectures,
there's no reason for the ioread/iowrite interfaces to be as serializing
as the old-fashioned PIO ones are. Might as well do the MMIO rules for a
non-cacheable region: no re-ordering, but no waiting for the bus either.

Linus

2006-03-08 03:21:13

by Paul Mackerras

[permalink] [raw]
Subject: Re: Memory barriers and spin_unlock safety

Linus Torvalds writes:

> So the rules from the PC side (and like it or not, they end up being
> what all the drivers are tested with) are:
>
> - regular stores are ordered by write barriers

I thought regular stores were always ordered anyway?

> - PIO stores are always synchronous

By synchronous, do you mean ordered with respect to all other accesses
(regular memory, MMIO, prefetchable MMIO, PIO)?

In other words, if I store a value in regular memory, then do an
outb() to a device, and the device does a DMA read to the location I
just stored to, is the device guaranteed to see the value I just
stored (assuming no other later store to the location)?

> - MMIO stores are ordered by IO semantics
> - PCI ordering must be honored:
> * write combining is only allowed on PCI memory resources
> that are marked prefetchable. If your host bridge does write
> combining in general, it's not a "host bridge", it's a "host
> disaster".

Presumably the host bridge doesn't know what sort of PCI resource is
mapped at a given address, so that information (whether the resource
is prefetchable) must come from the CPU, which would get it from the
TLB entry or an MTRR entry - is that right?

Or is there some gentleman's agreement between the host bridge and the
BIOS that certain address ranges are only used for certain types of
PCI memory resources?

> * for others, writes can always be posted, but they cannot
> be re-ordered wrt either reads or writes to that device
> (ie a read will always be fully synchronizing)
> - io_wmb must be honored

What ordering is there between stores to regular memory and stores to
non-prefetchable MMIO?

If a store to regular memory can be performed before a store to MMIO,
does a wmb() suffice to enforce an ordering, or do you have to use
mmiowb()?

Do PCs ever use write-through caching on prefetchable MMIO resources?

Thanks,
Paul.

2006-03-08 03:55:21

by Linus Torvalds

[permalink] [raw]
Subject: Re: Memory barriers and spin_unlock safety



On Wed, 8 Mar 2006, Paul Mackerras wrote:
>
> Linus Torvalds writes:
>
> > So the rules from the PC side (and like it or not, they end up being
> > what all the drivers are tested with) are:
> >
> > - regular stores are ordered by write barriers
>
> I thought regular stores were always ordered anyway?

For the hw, yes. For the compiler no.

So you actually do end up needing write barriers even on x86. It won't
compile to any actual _instruction_, but it will be a compiler barrier (ie
it just ends up being an empty inline asm that "modifies" memory).

So forgetting the wmb() is a bug even on x86, unless you happen to program
in assembly.

Of course, the x86 hw semantics _do_ mean that forgetting it is less
likely to cause problems, just because the compiler re-ordering is fairly
unlikely most of the time.

> > - PIO stores are always synchronous
>
> By synchronous, do you mean ordered with respect to all other accesses
> (regular memory, MMIO, prefetchable MMIO, PIO)?

Close, yes. HOWEVER, it's only really ordered wrt the "innermost" bus. I
don't think PCI bridges are supposed to post PIO writes, but a x86 CPU
basically won't stall for them forever. I _think_ they'll wait for it to
hit that external bus, though.

So it's totally serializing in the sense that all preceding reads have
completed and all preceding writes have hit the cache-coherency point, but
you don't necessarily know when the write itself will hit the device (the
write will return before that necessarily happens).

> In other words, if I store a value in regular memory, then do an
> outb() to a device, and the device does a DMA read to the location I
> just stored to, is the device guaranteed to see the value I just
> stored (assuming no other later store to the location)?

Yes, assuming that the DMA read is in respose to (ie causally related to)
the write.

> > - MMIO stores are ordered by IO semantics
> > - PCI ordering must be honored:
> > * write combining is only allowed on PCI memory resources
> > that are marked prefetchable. If your host bridge does write
> > combining in general, it's not a "host bridge", it's a "host
> > disaster".
>
> Presumably the host bridge doesn't know what sort of PCI resource is
> mapped at a given address, so that information (whether the resource
> is prefetchable) must come from the CPU, which would get it from the
> TLB entry or an MTRR entry - is that right?

Correct. Although it could of course be a map in the host bridge itself,
not on the CPU.

If the host bridge doesn't know, then the host bridge had better not
combine or the CPU had better tell it not to combine, using something like
a "sync" instruction that causes bus traffic. Either of those approaches
is likely a performance disaster, so you do want to have the CPU and/or
hostbridge do this all automatically for you.

Which is what the PC world does.

> Or is there some gentleman's agreement between the host bridge and the
> BIOS that certain address ranges are only used for certain types of
> PCI memory resources?

Not that I know. I _think_ all of the PC world just depends on the CPU
doing the write combining, and the CPU knows thanks to MTRR's and page
tables. But I could well imagine that there is some situation where the
logic is further out.

> What ordering is there between stores to regular memory and stores to
> non-prefetchable MMIO?

Non-prefetchable MMIO will be in-order on x86 wrt regular memory (unless
you use one of the non-temporal stores).

To get out-of-order stores you have to use a special MTRR setting (mtrr
type "WC" for "write combining").

Or possibly non-temporal writes to an uncached area. I don't think we do.

> If a store to regular memory can be performed before a store to MMIO,
> does a wmb() suffice to enforce an ordering, or do you have to use
> mmiowb()?

On x86, MMIO normally doesn't need memory barriers either for the normal
case (see above). We don't even need the compiler barrier, because we use
a "volatile" pointer for that, telling the compiler to keep its hands off.

> Do PCs ever use write-through caching on prefetchable MMIO resources?

Basically only for frame buffers, with MTRR rules (and while write-through
is an option, normally you'd use "write-combining", which doesn't cache at
all, but write combines in the write buffers and writes the combined
results out to the bus - there's usually something like four or eight
write buffers of up to a cacheline in size for combining).

Yeah, I realize this can be awkward. PC's actually get good performance
(ie they normally can easily fill the bus bandwidth) _and_ the sw doesn't
even need to do anything. That's what you get from several decades of hw
tweaking with a fixed - or almost-fixed - software base.

I _like_ PC's. Almost every other architecture decided to be lazy in hw,
and put the onus on the software to tell it what was right. The PC
platform hardware competition didn't allow for the "let's recompile the
software" approach, so the hardware does it all for you. Very well too.

It does make it somewhat hard for other platforms.

Linus

2006-03-08 13:08:45

by Alan

[permalink] [raw]
Subject: Re: Memory barriers and spin_unlock safety

On Maw, 2006-03-07 at 19:54 -0800, Linus Torvalds wrote:
> Close, yes. HOWEVER, it's only really ordered wrt the "innermost" bus. I
> don't think PCI bridges are supposed to post PIO writes, but a x86 CPU
> basically won't stall for them forever.

The bridges I have will stall forever. You can observe this directly if
an IDE device decides to hang the IORDY line on the IDE cable or you
crash the GPU on an S3 card.

Alan

2006-03-08 15:30:49

by Linus Torvalds

[permalink] [raw]
Subject: Re: Memory barriers and spin_unlock safety



On Wed, 8 Mar 2006, Alan Cox wrote:
>
> On Maw, 2006-03-07 at 19:54 -0800, Linus Torvalds wrote:
> > Close, yes. HOWEVER, it's only really ordered wrt the "innermost" bus. I
> > don't think PCI bridges are supposed to post PIO writes, but a x86 CPU
> > basically won't stall for them forever.
>
> The bridges I have will stall forever. You can observe this directly if
> an IDE device decides to hang the IORDY line on the IDE cable or you
> crash the GPU on an S3 card.

Ok. The only thing I have tested is the timing of "outb()" on its own,
which is definitely long enough that it clearly waits for _some_ bus
activity (ie the CPU doesn't just post the write internally), but I don't
know exactly what the rules are as far as the core itself is concerned: I
suspect the core just waits until it has hit the northbridge or something.

In contrast, a MMIO write to a WC region at least will not necessarily
pause the core at all: it just hits the write queue in the core, and the
core continues on (and may generate other writes that will be combined in
the write buffers before the first one even hits the bus).

Linus

2006-03-11 04:02:49

by Robert Hancock

[permalink] [raw]
Subject: Re: Memory barriers and spin_unlock safety

Linus Torvalds wrote:
> Close, yes. HOWEVER, it's only really ordered wrt the "innermost" bus. I
> don't think PCI bridges are supposed to post PIO writes, but a x86 CPU
> basically won't stall for them forever. I _think_ they'll wait for it to
> hit that external bus, though.

PCI I/O writes are allowed to be posted by the host bus bridge (not
other bridges) according to the PCI 2.2 spec. Maybe no x86 platform
actually does this, but it's allowed, so technically a device would need
to do a read in order to ensure that I/O writes have arrived at the
device as well.

--
Robert Hancock Saskatoon, SK, Canada
To email, remove "nospam" from [email protected]
Home Page: http://www.roberthancock.com/


2006-03-12 18:03:12

by Alan

[permalink] [raw]
Subject: Re: Memory barriers and spin_unlock safety

On Gwe, 2006-03-10 at 19:19 -0600, Robert Hancock wrote:
> PCI I/O writes are allowed to be posted by the host bus bridge (not
> other bridges) according to the PCI 2.2 spec. Maybe no x86 platform
> actually does this, but it's allowed, so technically a device would need
> to do a read in order to ensure that I/O writes have arrived at the
> device as well.

Existing Linux drivers largely believe that PCI I/O cycles as opposed to
MMIO cycles are not posted. At least one MIPS platform that did post
them ended up ensuring PCI I/O cycle posting didn't occur to get a
running Linux system - so its quite a deep assumption.