2014-11-11 18:57:09

by Alexander Duyck

[permalink] [raw]
Subject: [PATCH] arch: Introduce read_acquire()

From: Alexander Duyck <[email protected]>

In the case of device drivers it is common to utilize receive descriptors
in which a single field is used to determine if the descriptor is currently
in the possession of the device or the CPU. In order to prevent any other
fields from being read a rmb() is used resulting in something like code
snippet from ixgbe_main.c:

if (!ixgbe_test_staterr(rx_desc, IXGBE_RXD_STAT_DD))
break;

/*
* This memory barrier is needed to keep us from reading
* any other fields out of the rx_desc until we know the
* RXD_STAT_DD bit is set
*/
rmb();

On reviewing the documentation and code for smp_load_acquire() it occured
to me that implementing something similar for CPU <-> device interraction
would be worth while. This commit provides just the load/read side of this
in the form of read_acquire(). This new primative orders the specified
read against any subsequent reads. As a result we can reduce the above
code snippet down to:

/* This memory barrier is needed to keep us from reading
* any other fields out of the rx_desc until we know the
* RXD_STAT_DD bit is set
*/
if (!(read_acquire(&rx_desc->wb.upper.status_error) &
cpu_to_le32(IXGBE_RXD_STAT_DD)))
break;

With this commit and the above change I have seen a reduction in processing
time of at least 7ns per 64B frame in the ixgbe driver on an Intel
Core i7-4930K.

Cc: Benjamin Herrenschmidt <[email protected]>
Cc: Frederic Weisbecker <[email protected]>
Cc: Mathieu Desnoyers <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Michael Neuling <[email protected]>
Cc: Russell King <[email protected]>
Cc: Geert Uytterhoeven <[email protected]>
Cc: Heiko Carstens <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Martin Schwidefsky <[email protected]>
Cc: Tony Luck <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: "Paul E. McKenney" <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Ingo Molnar <[email protected]>
Signed-off-by: Alexander Duyck <[email protected]>
---
arch/arm/include/asm/barrier.h | 8 ++++++++
arch/arm64/include/asm/barrier.h | 10 ++++++++++
arch/ia64/include/asm/barrier.h | 2 ++
arch/metag/include/asm/barrier.h | 8 ++++++++
arch/mips/include/asm/barrier.h | 8 ++++++++
arch/powerpc/include/asm/barrier.h | 8 ++++++++
arch/s390/include/asm/barrier.h | 2 ++
arch/sparc/include/asm/barrier_64.h | 1 +
arch/x86/include/asm/barrier.h | 10 ++++++++++
include/asm-generic/barrier.h | 8 ++++++++
10 files changed, 65 insertions(+)

diff --git a/arch/arm/include/asm/barrier.h b/arch/arm/include/asm/barrier.h
index c6a3e73..b082578 100644
--- a/arch/arm/include/asm/barrier.h
+++ b/arch/arm/include/asm/barrier.h
@@ -59,6 +59,14 @@
#define smp_wmb() dmb(ishst)
#endif

+#define read_acquire(p) \
+({ \
+ typeof(*p) ___p1 = ACCESS_ONCE(*p); \
+ compiletime_assert_atomic_type(*p); \
+ rmb(); \
+ ___p1; \
+})
+
#define smp_store_release(p, v) \
do { \
compiletime_assert_atomic_type(*p); \
diff --git a/arch/arm64/include/asm/barrier.h b/arch/arm64/include/asm/barrier.h
index 6389d60..5b0bfa7 100644
--- a/arch/arm64/include/asm/barrier.h
+++ b/arch/arm64/include/asm/barrier.h
@@ -52,6 +52,14 @@ do { \
___p1; \
})

+#define read_acquire(p) \
+({ \
+ typeof(*p) ___p1 = ACCESS_ONCE(*p); \
+ compiletime_assert_atomic_type(*p); \
+ rmb(); \
+ ___p1; \
+})
+
#else

#define smp_mb() dmb(ish)
@@ -90,6 +98,8 @@ do { \
___p1; \
})

+#define read_acquire(p) smp_load_acquire(p)
+
#endif

#define read_barrier_depends() do { } while(0)
diff --git a/arch/ia64/include/asm/barrier.h b/arch/ia64/include/asm/barrier.h
index a48957c..2288d09 100644
--- a/arch/ia64/include/asm/barrier.h
+++ b/arch/ia64/include/asm/barrier.h
@@ -78,6 +78,8 @@ do { \
___p1; \
})

+#define read_acquire(p) smp_load_acquire(p)
+
/*
* XXX check on this ---I suspect what Linus really wants here is
* acquire vs release semantics but we can't discuss this stuff with
diff --git a/arch/metag/include/asm/barrier.h b/arch/metag/include/asm/barrier.h
index c7591e8..670b679 100644
--- a/arch/metag/include/asm/barrier.h
+++ b/arch/metag/include/asm/barrier.h
@@ -100,6 +100,14 @@ do { \
___p1; \
})

+#define read_acquire(p) \
+({ \
+ typeof(*p) ___p1 = ACCESS_ONCE(*p); \
+ compiletime_assert_atomic_type(*p); \
+ rmb(); \
+ ___p1; \
+})
+
#define smp_mb__before_atomic() barrier()
#define smp_mb__after_atomic() barrier()

diff --git a/arch/mips/include/asm/barrier.h b/arch/mips/include/asm/barrier.h
index d0101dd..aa5eb06 100644
--- a/arch/mips/include/asm/barrier.h
+++ b/arch/mips/include/asm/barrier.h
@@ -195,6 +195,14 @@ do { \
___p1; \
})

+#define read_acquire(p) \
+({ \
+ typeof(*p) ___p1 = ACCESS_ONCE(*p); \
+ compiletime_assert_atomic_type(*p); \
+ rmb(); \
+ ___p1; \
+})
+
#define smp_mb__before_atomic() smp_mb__before_llsc()
#define smp_mb__after_atomic() smp_llsc_mb()

diff --git a/arch/powerpc/include/asm/barrier.h b/arch/powerpc/include/asm/barrier.h
index bab79a1..3ddc884 100644
--- a/arch/powerpc/include/asm/barrier.h
+++ b/arch/powerpc/include/asm/barrier.h
@@ -84,6 +84,14 @@ do { \
___p1; \
})

+#define read_acquire(p) \
+({ \
+ typeof(*p) ___p1 = ACCESS_ONCE(*p); \
+ compiletime_assert_atomic_type(*p); \
+ rmb(); \
+ ___p1; \
+})
+
#define smp_mb__before_atomic() smp_mb()
#define smp_mb__after_atomic() smp_mb()

diff --git a/arch/s390/include/asm/barrier.h b/arch/s390/include/asm/barrier.h
index b5dce65..516ad04 100644
--- a/arch/s390/include/asm/barrier.h
+++ b/arch/s390/include/asm/barrier.h
@@ -50,4 +50,6 @@ do { \
___p1; \
})

+#define read_acquire(p) smp_load_acquire(p)
+
#endif /* __ASM_BARRIER_H */
diff --git a/arch/sparc/include/asm/barrier_64.h b/arch/sparc/include/asm/barrier_64.h
index 305dcc3..c0ba305 100644
--- a/arch/sparc/include/asm/barrier_64.h
+++ b/arch/sparc/include/asm/barrier_64.h
@@ -68,6 +68,7 @@ do { \
___p1; \
})

+#define read_acquire(p) smp_load_acquire(p)
#define smp_mb__before_atomic() barrier()
#define smp_mb__after_atomic() barrier()

diff --git a/arch/x86/include/asm/barrier.h b/arch/x86/include/asm/barrier.h
index 0f4460b..6aa9641 100644
--- a/arch/x86/include/asm/barrier.h
+++ b/arch/x86/include/asm/barrier.h
@@ -118,6 +118,14 @@ do { \
___p1; \
})

+#define read_acquire(p) \
+({ \
+ typeof(*p) ___p1 = ACCESS_ONCE(*p); \
+ compiletime_assert_atomic_type(*p); \
+ rmb(); \
+ ___p1; \
+})
+
#else /* regular x86 TSO memory ordering */

#define smp_store_release(p, v) \
@@ -135,6 +143,8 @@ do { \
___p1; \
})

+#define read_acquire(p) smp_load_acquire(p)
+
#endif

/* Atomic operations are already serializing on x86 */
diff --git a/include/asm-generic/barrier.h b/include/asm-generic/barrier.h
index 1402fa8..c186bfb 100644
--- a/include/asm-generic/barrier.h
+++ b/include/asm-generic/barrier.h
@@ -70,6 +70,14 @@
#define smp_mb__after_atomic() smp_mb()
#endif

+#define read_acquire(p) \
+({ \
+ typeof(*p) ___p1 = ACCESS_ONCE(*p); \
+ compiletime_assert_atomic_type(*p); \
+ rmb(); \
+ ___p1; \
+})
+
#define smp_store_release(p, v) \
do { \
compiletime_assert_atomic_type(*p); \


2014-11-11 19:40:28

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] arch: Introduce read_acquire()

On Tue, Nov 11, 2014 at 10:57 AM, <[email protected]> wrote:
>
>
> On reviewing the documentation and code for smp_load_acquire() it occured
> to me that implementing something similar for CPU <-> device interraction
> would be worth while. This commit provides just the load/read side of this
> in the form of read_acquire().

So I don't hate the concept, but. there's a couple of reasons to think
this is broken.

One is just the name. Why do we have "smp_load_acquire()", but then
call the non-smp version "read_acquire()"? That makes very little
sense to me. Why did "load" become "read"?

The other is more of a technical issue. Namely the fact that being
*device* ordered is completely and totally different from being *CPU*
ordered.

All sane modern architectures do memory ordering as part of the cache
coherency protocol. But part of that is that it actually requires all
the CPU's to follow said cache coherency protocol.

And bus-master devices don't necessarily follow the same ordering
rules. Yes, any sane DMA will be cache-coherent, although sadly the
insane kind still exists. But even when DMA is cache _coherent_, that
does not necessarily mean that that coherency follows the *ordering*
guarantees.

Now, in practice, I think that DMA tends to be more strictly ordered
than CPU memory ordering is, and the above all "just works". But I'd
really want a lot of ack's from architecture maintainers. Particularly
PowerPC and ARM64. I'm not 100% sure that "smp_load_acquire()" will
necessarily order the read wrt DMA traffic. PPC in particular has some
really odd IO ordering rules, but I *think* all the problems come up
with just MMIO, not with DMA.

But we do have a very real difference between "smp_rmb()" (inter-cpu
cache coherency read barrier) and "rmb()" (full memory barrier that
synchronizes with IO).

And your patch is very confused about this. In *some* places you use
"rmb()", and in other places you just use "smp_load_acquire()". Have
you done extensive verification to check that this is actually ok?
Because the performance difference you quote very much seems to be
about your x86 testing now akipping the IO-synchronizing "rmb()", and
depending on DMA being ordered even without it.

And I'm pretty sure that's actually fine on x86. The real
IO-synchronizing rmb() (which translates into a lfence) is only needed
for when you have uncached accesses (ie mmio) on x86. So I don't think
your code is wrong, I just want to verify that everybody understands
the issues. I'm not even sure DMA can ever really have weaker memory
ordering (I really don't see how you'd be able to do a read barrier
without DMA stores being ordered natively), so maybe I worry too much,
but the ppc people in particular should look at this, because the ppc
memory ordering rules and serialization are some completely odd ad-hoc
black magic....

But anything with non-cache-coherent DMA is obviously very suspect too.

Linus

2014-11-11 19:47:44

by Will Deacon

[permalink] [raw]
Subject: Re: [PATCH] arch: Introduce read_acquire()

Hello,

On Tue, Nov 11, 2014 at 06:57:05PM +0000, [email protected] wrote:
> From: Alexander Duyck <[email protected]>
>
> In the case of device drivers it is common to utilize receive descriptors
> in which a single field is used to determine if the descriptor is currently
> in the possession of the device or the CPU. In order to prevent any other
> fields from being read a rmb() is used resulting in something like code
> snippet from ixgbe_main.c:
>
> if (!ixgbe_test_staterr(rx_desc, IXGBE_RXD_STAT_DD))
> break;
>
> /*
> * This memory barrier is needed to keep us from reading
> * any other fields out of the rx_desc until we know the
> * RXD_STAT_DD bit is set
> */
> rmb();
>
> On reviewing the documentation and code for smp_load_acquire() it occured
> to me that implementing something similar for CPU <-> device interraction
> would be worth while. This commit provides just the load/read side of this
> in the form of read_acquire(). This new primative orders the specified
> read against any subsequent reads. As a result we can reduce the above
> code snippet down to:
>
> /* This memory barrier is needed to keep us from reading
> * any other fields out of the rx_desc until we know the
> * RXD_STAT_DD bit is set
> */
> if (!(read_acquire(&rx_desc->wb.upper.status_error) &

Minor nit on naming, but load_acquire would match what we do with barriers,
where you simply drop the smp_ prefix if you want the thing to work on UP
systems too.

> cpu_to_le32(IXGBE_RXD_STAT_DD)))
> break;

I'm not familiar with the driver in question, but how are the descriptors
mapped? Is the read barrier here purely limiting re-ordering of normal
memory accesses by the CPU? If so, isn't there also scope for store_release
when updating, e.g. next_to_watch in the same driver?

We also need to understand how this plays out with
smp_mb__after_unlock_lock, which is currently *only* implemented by PowerPC.
If we end up having a similar mess to mmiowb, where PowerPC both implements
the barrier *and* plays tricks in its spin_unlock code, then everybody
loses because we'd end up with release doing the right thing anyway.

Peter and I spoke with Paul at LPC about strengthening
smp_load_acquire/smp_store_release so that release->acquire ordering is
maintained, which would allow us to drop smp_mb__after_unlock_lock
altogether. That's stronger than acquire/release in C11, but I think it's
an awful lot easier to use, particularly if device drivers are going to
start using these primitives.

Thoughts?

Will

2014-11-11 20:46:53

by Alexander Duyck

[permalink] [raw]
Subject: Re: [PATCH] arch: Introduce read_acquire()


On 11/11/2014 11:40 AM, Linus Torvalds wrote:
> On Tue, Nov 11, 2014 at 10:57 AM, <[email protected]> wrote:
>>
>>
>> On reviewing the documentation and code for smp_load_acquire() it occurred
>> to me that implementing something similar for CPU <-> device interaction
>> would be worth while. This commit provides just the load/read side of this
>> in the form of read_acquire().
>
> So I don't hate the concept, but. there's a couple of reasons to think
> this is broken.
>
> One is just the name. Why do we have "smp_load_acquire()", but then
> call the non-smp version "read_acquire()"? That makes very little
> sense to me. Why did "load" become "read"?

The idea behind read vs load in the name was because smp_load_acquire is
using a full smp_mb() whereas this just falls back to rmb() for the
cases it is dealing with. My main conern is that a full memory barrier
would be more expensive so I didn't want to give the idea that this is
as completed as smp_load_acquire(). The read_acquire() call is not
strictly enforcing any limitations on writes/stores, although there are
a few cases where the barriers used do leak that functionality over as a
side-effect.

> The other is more of a technical issue. Namely the fact that being
> *device* ordered is completely and totally different from being *CPU*
> ordered.
>
> All sane modern architectures do memory ordering as part of the cache
> coherency protocol. But part of that is that it actually requires all
> the CPU's to follow said cache coherency protocol.
>
> And bus-master devices don't necessarily follow the same ordering
> rules. Yes, any sane DMA will be cache-coherent, although sadly the
> insane kind still exists. But even when DMA is cache _coherent_, that
> does not necessarily mean that that coherency follows the *ordering*
> guarantees.
>
> Now, in practice, I think that DMA tends to be more strictly ordered
> than CPU memory ordering is, and the above all "just works". But I'd
> really want a lot of ack's from architecture maintainers. Particularly
> PowerPC and ARM64. I'm not 100% sure that "smp_load_acquire()" will
> necessarily order the read wrt DMA traffic. PPC in particular has some
> really odd IO ordering rules, but I *think* all the problems come up
> with just MMIO, not with DMA.
>
> But we do have a very real difference between "smp_rmb()" (inter-cpu
> cache coherency read barrier) and "rmb()" (full memory barrier that
> synchronizes with IO).
>
> And your patch is very confused about this. In *some* places you use
> "rmb()", and in other places you just use "smp_load_acquire()". Have
> you done extensive verification to check that this is actually ok?
> Because the performance difference you quote very much seems to be
> about your x86 testing now akipping the IO-synchronizing "rmb()", and
> depending on DMA being ordered even without it.

I am far from an expert on some of these synchronization primitives so I
probably did make some blunders. I meant to send this as an RFC, but as
is I plan to submit a v2 to just fix the spelling errors in the patch
description anyway.

The architectures where I thought I might be able to take advantage of
the smp_load_acquire functionality are arm, x86, ia64, sparc, and s390.
The rest are essentially still just an rmb() call following the
volatile read.

> And I'm pretty sure that's actually fine on x86. The real
> IO-synchronizing rmb() (which translates into a lfence) is only needed
> for when you have uncached accesses (ie mmio) on x86. So I don't think
> your code is wrong, I just want to verify that everybody understands
> the issues. I'm not even sure DMA can ever really have weaker memory
> ordering (I really don't see how you'd be able to do a read barrier
> without DMA stores being ordered natively), so maybe I worry too much,
> but the ppc people in particular should look at this, because the ppc
> memory ordering rules and serialization are some completely odd ad-hoc
> black magic....

Like I said before the PowerPC version is likely the lowest risk. It is
just a standard rmb() call after the read.

I'm pretty sure x86 is safe as well since the issue that triggered the
introduction of the rmb() into the Intel network drivers was on a
PowerPC as I recall and the code had been running on x86 for quite some
time without issue when it was reported.

> But anything with non-cache-coherent DMA is obviously very suspect too.
>
> Linus

Well for not I am only focused on what should be cache-coherent DMA
which usually applies to things like descriptor rings where interaction
is expected to be bi-directional without the need for a map or unmap.

Thanks,

Alex

2014-11-11 21:14:49

by Alexander Duyck

[permalink] [raw]
Subject: Re: [PATCH] arch: Introduce read_acquire()

On 11/11/2014 11:47 AM, Will Deacon wrote:
> Hello,
>
> On Tue, Nov 11, 2014 at 06:57:05PM +0000, [email protected] wrote:
>> From: Alexander Duyck <[email protected]>
>>
>> In the case of device drivers it is common to utilize receive descriptors
>> in which a single field is used to determine if the descriptor is currently
>> in the possession of the device or the CPU. In order to prevent any other
>> fields from being read a rmb() is used resulting in something like code
>> snippet from ixgbe_main.c:
>>
>> if (!ixgbe_test_staterr(rx_desc, IXGBE_RXD_STAT_DD))
>> break;
>>
>> /*
>> * This memory barrier is needed to keep us from reading
>> * any other fields out of the rx_desc until we know the
>> * RXD_STAT_DD bit is set
>> */
>> rmb();
>>
>> On reviewing the documentation and code for smp_load_acquire() it occured
>> to me that implementing something similar for CPU <-> device interraction
>> would be worth while. This commit provides just the load/read side of this
>> in the form of read_acquire(). This new primative orders the specified
>> read against any subsequent reads. As a result we can reduce the above
>> code snippet down to:
>>
>> /* This memory barrier is needed to keep us from reading
>> * any other fields out of the rx_desc until we know the
>> * RXD_STAT_DD bit is set
>> */
>> if (!(read_acquire(&rx_desc->wb.upper.status_error) &
> Minor nit on naming, but load_acquire would match what we do with barriers,
> where you simply drop the smp_ prefix if you want the thing to work on UP
> systems too.

The problem is this is slightly different, load_acquire in my mind would
use a mb() call, I only use a rmb(). That is why I chose read_acquire
as the name.

>> cpu_to_le32(IXGBE_RXD_STAT_DD)))
>> break;
> I'm not familiar with the driver in question, but how are the descriptors
> mapped? Is the read barrier here purely limiting re-ordering of normal
> memory accesses by the CPU? If so, isn't there also scope for store_release
> when updating, e.g. next_to_watch in the same driver?

So the driver in question is using descriptor rings allocated via
dma_alloc_coherent. The device is notified that new descriptors are
present via a memory mapped I/O register, then the device will read the
descriptor via a DMA operation and then write it back with another DMA
operation and the process of doing so it will set the IXGBE_RXD_STAT_DD bit.

The problem with the store_release logic is that it would need to key
off of a write to memory mapped I/O. The idea had crossed my mind, but
I wasn't confident I had a good enough understanding of things to try
and deal with memory ordering for cacheable and uncachable memory in the
same call. I would have to do some more research to see if something
like that is even possible as I suspect some of the architectures may
not support something like that.

> We also need to understand how this plays out with
> smp_mb__after_unlock_lock, which is currently *only* implemented by PowerPC.
> If we end up having a similar mess to mmiowb, where PowerPC both implements
> the barrier *and* plays tricks in its spin_unlock code, then everybody
> loses because we'd end up with release doing the right thing anyway.

PowerPC is not much of a risk in this patch. The implementation I did
just fell back to a rmb().

The architectures I need to sort out are arm, x86, sparc, ia64, and s390
as they are the only ones that tried to make use of the smp_load_acquire
logic.

> Peter and I spoke with Paul at LPC about strengthening
> smp_load_acquire/smp_store_release so that release->acquire ordering is
> maintained, which would allow us to drop smp_mb__after_unlock_lock
> altogether. That's stronger than acquire/release in C11, but I think it's
> an awful lot easier to use, particularly if device drivers are going to
> start using these primitives.
>
> Thoughts?
>
> Will

I generally want just enough of a barrier in place to keep things
working properly without costing much in terms of CPU time. If you can
come up with a generic load_acquire/store_release that could take the
place of this function I am fine with that as long as it would function
at the same level of performance.

Thanks,

Alex

2014-11-12 10:11:15

by Will Deacon

[permalink] [raw]
Subject: Re: [PATCH] arch: Introduce read_acquire()

On Tue, Nov 11, 2014 at 07:40:22PM +0000, Linus Torvalds wrote:
> On Tue, Nov 11, 2014 at 10:57 AM, <[email protected]> wrote:
> > On reviewing the documentation and code for smp_load_acquire() it occured
> > to me that implementing something similar for CPU <-> device interraction
> > would be worth while. This commit provides just the load/read side of this
> > in the form of read_acquire().
>
> So I don't hate the concept, but. there's a couple of reasons to think
> this is broken.
>
> One is just the name. Why do we have "smp_load_acquire()", but then
> call the non-smp version "read_acquire()"? That makes very little
> sense to me. Why did "load" become "read"?

[...]

> But we do have a very real difference between "smp_rmb()" (inter-cpu
> cache coherency read barrier) and "rmb()" (full memory barrier that
> synchronizes with IO).
>
> And your patch is very confused about this. In *some* places you use
> "rmb()", and in other places you just use "smp_load_acquire()". Have
> you done extensive verification to check that this is actually ok?
> Because the performance difference you quote very much seems to be
> about your x86 testing now akipping the IO-synchronizing "rmb()", and
> depending on DMA being ordered even without it.
>
> And I'm pretty sure that's actually fine on x86. The real
> IO-synchronizing rmb() (which translates into a lfence) is only needed
> for when you have uncached accesses (ie mmio) on x86. So I don't think
> your code is wrong, I just want to verify that everybody understands
> the issues. I'm not even sure DMA can ever really have weaker memory
> ordering (I really don't see how you'd be able to do a read barrier
> without DMA stores being ordered natively), so maybe I worry too much,
> but the ppc people in particular should look at this, because the ppc
> memory ordering rules and serialization are some completely odd ad-hoc
> black magic....

Right, so now I see what's going on here. This isn't actually anything
to do with acquire/release (I don't know of any architectures that have
a read-barrier-acquire instruction), it's all about DMA to main memory.

If a device is DMA'ing data *and* control information (e.g. 'descriptor
valid') to memory, then it must be maintaining order between those writes
with respect to memory. In that case, using the usual MMIO barriers can
be overkill because we really just want to enforce read-ordering on the CPU
side. In fact, I think you could even do this with a fake address dependency
on ARM (although I'm not actually suggesting we do that).

In light of that, it actually sounds like we want a new set of barrier
macros that apply only to DMA buffer accesses by the CPU -- they wouldn't
enforce ordering against things like MMIO registers. I wonder whether any
architectures would implement them differently to the smp_* flavours?

> But anything with non-cache-coherent DMA is obviously very suspect too.

I think non-cache-coherent DMA should work too (at least, on ARM), but
only for buffers mapped via dma_alloc_coherent (i.e. a non-cacheable
mapping).

Will

2014-11-12 10:11:37

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH] arch: Introduce read_acquire()

On Tue, Nov 11, 2014 at 12:45:23PM -0800, Alexander Duyck wrote:
>
> On 11/11/2014 11:40 AM, Linus Torvalds wrote:
> >On Tue, Nov 11, 2014 at 10:57 AM, <[email protected]> wrote:
> >>
> >>
> >>On reviewing the documentation and code for smp_load_acquire() it occurred
> >>to me that implementing something similar for CPU <-> device interaction
> >>would be worth while. This commit provides just the load/read side of this
> >>in the form of read_acquire().
> >
> >So I don't hate the concept, but. there's a couple of reasons to think
> >this is broken.
> >
> >One is just the name. Why do we have "smp_load_acquire()", but then
> >call the non-smp version "read_acquire()"? That makes very little
> >sense to me. Why did "load" become "read"?
>
> The idea behind read vs load in the name was because smp_load_acquire is
> using a full smp_mb() whereas this just falls back to rmb() for the cases it
> is dealing with. My main conern is that a full memory barrier would be more
> expensive so I didn't want to give the idea that this is as completed as
> smp_load_acquire(). The read_acquire() call is not strictly enforcing any
> limitations on writes/stores, although there are a few cases where the
> barriers used do leak that functionality over as a side-effect.

Then I object. We should not name it acquire if it does not in fact
provides acquire semantics.

Memory ordering is hard enough, we don't need random weird semantics
mixed in just because.

2014-11-12 10:15:52

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH] arch: Introduce read_acquire()

On Tue, Nov 11, 2014 at 01:12:32PM -0800, Alexander Duyck wrote:
> >Minor nit on naming, but load_acquire would match what we do with barriers,
> >where you simply drop the smp_ prefix if you want the thing to work on UP
> >systems too.
>
> The problem is this is slightly different, load_acquire in my mind would use
> a mb() call, I only use a rmb(). That is why I chose read_acquire as the
> name.

acquire is not about rmb vs mb, do read up on
Documentation/memory-barriers.txt. Its a distinctly different semantic.
Some archs simply lack the means of implementing this semantics and have
to revert to mb (stronger is always allowed).

Using the read vs load to wreck the acquire semantics is just insane.

2014-11-12 15:24:39

by Alexander Duyck

[permalink] [raw]
Subject: Re: [PATCH] arch: Introduce read_acquire()


On 11/12/2014 02:15 AM, Peter Zijlstra wrote:
> On Tue, Nov 11, 2014 at 01:12:32PM -0800, Alexander Duyck wrote:
>>> Minor nit on naming, but load_acquire would match what we do with barriers,
>>> where you simply drop the smp_ prefix if you want the thing to work on UP
>>> systems too.
>> The problem is this is slightly different, load_acquire in my mind would use
>> a mb() call, I only use a rmb(). That is why I chose read_acquire as the
>> name.
> acquire is not about rmb vs mb, do read up on
> Documentation/memory-barriers.txt. Its a distinctly different semantic.
> Some archs simply lack the means of implementing this semantics and have
> to revert to mb (stronger is always allowed).
>
> Using the read vs load to wreck the acquire semantics is just insane.

Actually I have been reading up on it as I wasn't familiar with C11.
Most of what I was doing was actually based on the documentation in
barriers.txt which was referring to memory operations not loads/stores
when referring to the acquire/release so I assumed the full memory
barrier was required. I wasn't aware that smp_load_acquire was only
supposed to be ordering loads, or that smp_ store_release only applied
to stores.I will probably go back and re-implement this patch as
introducing load_acquire and add store_release as well. I figure it is
possible that a device could be doing something like reading a linked
list in memory ("Example combining sync and lwsync"
http://www.ibm.com/developerworks/systems/articles/powerpc.html#N102B1)
and then you would need both the store_release and the wmb() to deal
with system memory vs system memory ordering, and system memory vs MMIO.

Base drivers have been doing this kind of stuff for a while. The
problem is we have been doing it using barriers that are heavier than
they need to be. For example the reason why I am wanting to shift to
using something like an acquire operation is because the current barrier
we have been using, rmb(), is far heavier than we need for system memory
vs system memory ordering on systems such as PowerPC. What I am looking
for with the load_acquire/store_release is to allow for the use of
lighter weight barriers for updates that only affect system memory.

Thanks,

Alex

2014-11-12 15:38:07

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH] arch: Introduce read_acquire()

On Wed, Nov 12, 2014 at 07:23:22AM -0800, Alexander Duyck wrote:
>
> On 11/12/2014 02:15 AM, Peter Zijlstra wrote:
> >On Tue, Nov 11, 2014 at 01:12:32PM -0800, Alexander Duyck wrote:
> >>>Minor nit on naming, but load_acquire would match what we do with barriers,
> >>>where you simply drop the smp_ prefix if you want the thing to work on UP
> >>>systems too.
> >>The problem is this is slightly different, load_acquire in my mind would use
> >>a mb() call, I only use a rmb(). That is why I chose read_acquire as the
> >>name.
> >acquire is not about rmb vs mb, do read up on
> >Documentation/memory-barriers.txt. Its a distinctly different semantic.
> >Some archs simply lack the means of implementing this semantics and have
> >to revert to mb (stronger is always allowed).
> >
> >Using the read vs load to wreck the acquire semantics is just insane.
>
> Actually I have been reading up on it as I wasn't familiar with C11.

C11 is _different_ although somewhat related.

> Most
> of what I was doing was actually based on the documentation in barriers.txt
> which was referring to memory operations not loads/stores when referring to
> the acquire/release so I assumed the full memory barrier was required. I
> wasn't aware that smp_load_acquire was only supposed to be ordering loads,
> or that smp_ store_release only applied to stores.

It does not.. an ACQUIRE is a semi-permeable barrier that doesn't allow
LOADs nor STOREs that are issued _after_ it to appear to happen _before_.
The RELEASE is the opposite number, it ensures LOADs and STOREs that are
issued _before_ cannot happen _after_.

This typically matches locking, where a lock (mutex_lock, spin_lock
etc..) have ACQUIRE semantics and the unlock RELEASE. Such that:

spin_lock();
a = 1;
b = x;
spin_unlock();

guarantees all LOADs (x) and STORESs (a,b) happen _inside_ the lock
region. What they do not guarantee is:


y = 1;
spin_lock()
a = 1;
b = x;
spin_unlock()
z = 4;

An order between y and z, both are allowed _into_ the region and can
cross there like:

spin_lock();
...
z = 4;
y = 1;
...
spin_unlock();


The only 'open' issue at the moment is if RELEASE+ACQUIRE := MB.
Currently we say this is not so, but Will (and me) would very much like
this to be so -- PPC64 being the only arch that actually makes this
distinction.

2014-11-12 15:44:14

by Alexander Duyck

[permalink] [raw]
Subject: Re: [PATCH] arch: Introduce read_acquire()


On 11/12/2014 02:10 AM, Will Deacon wrote:
> On Tue, Nov 11, 2014 at 07:40:22PM +0000, Linus Torvalds wrote:
>> On Tue, Nov 11, 2014 at 10:57 AM, <[email protected]> wrote:
>>> On reviewing the documentation and code for smp_load_acquire() it occured
>>> to me that implementing something similar for CPU <-> device interraction
>>> would be worth while. This commit provides just the load/read side of this
>>> in the form of read_acquire().
>> So I don't hate the concept, but. there's a couple of reasons to think
>> this is broken.
>>
>> One is just the name. Why do we have "smp_load_acquire()", but then
>> call the non-smp version "read_acquire()"? That makes very little
>> sense to me. Why did "load" become "read"?
> [...]
>
>> But we do have a very real difference between "smp_rmb()" (inter-cpu
>> cache coherency read barrier) and "rmb()" (full memory barrier that
>> synchronizes with IO).
>>
>> And your patch is very confused about this. In *some* places you use
>> "rmb()", and in other places you just use "smp_load_acquire()". Have
>> you done extensive verification to check that this is actually ok?
>> Because the performance difference you quote very much seems to be
>> about your x86 testing now akipping the IO-synchronizing "rmb()", and
>> depending on DMA being ordered even without it.
>>
>> And I'm pretty sure that's actually fine on x86. The real
>> IO-synchronizing rmb() (which translates into a lfence) is only needed
>> for when you have uncached accesses (ie mmio) on x86. So I don't think
>> your code is wrong, I just want to verify that everybody understands
>> the issues. I'm not even sure DMA can ever really have weaker memory
>> ordering (I really don't see how you'd be able to do a read barrier
>> without DMA stores being ordered natively), so maybe I worry too much,
>> but the ppc people in particular should look at this, because the ppc
>> memory ordering rules and serialization are some completely odd ad-hoc
>> black magic....
> Right, so now I see what's going on here. This isn't actually anything
> to do with acquire/release (I don't know of any architectures that have
> a read-barrier-acquire instruction), it's all about DMA to main memory.

Actually it is sort of, I just hadn't realized it until I read some of
the explanations of the C11 acquire/release memory order specifics, but
I believe most network drivers are engaged in acquire/release logic
because we are usually using something such as a lockless descriptor
ring to pass data back and forth between the device and the system. The
net win for device drivers is that we can remove some of the
heavy-weight barriers that are having to be used by making use of
lighter barriers or primitives such as lwsync vs sync in PowerPC or ldar
vs dsb(ld) on arm64.

> If a device is DMA'ing data *and* control information (e.g. 'descriptor
> valid') to memory, then it must be maintaining order between those writes
> with respect to memory. In that case, using the usual MMIO barriers can
> be overkill because we really just want to enforce read-ordering on the CPU
> side. In fact, I think you could even do this with a fake address dependency
> on ARM (although I'm not actually suggesting we do that).
>
> In light of that, it actually sounds like we want a new set of barrier
> macros that apply only to DMA buffer accesses by the CPU -- they wouldn't
> enforce ordering against things like MMIO registers. I wonder whether any
> architectures would implement them differently to the smp_* flavours?

My concern would be the cost of the barriers vs the acquire/release
primitives. In the case of arm64 I am assuming there is a reason for
wanting to use ldar vs dsb instructions. I would imagine the devices
drivers would want to get the same kind of advantages.

>> But anything with non-cache-coherent DMA is obviously very suspect too.
> I think non-cache-coherent DMA should work too (at least, on ARM), but
> only for buffers mapped via dma_alloc_coherent (i.e. a non-cacheable
> mapping).
>
> Will

For now my plan is to focus on coherent memory only with this.
Specifically it is only really intended for use with dma_alloc_coherent.

Thanks,

Alex

2014-11-12 19:26:08

by Alexander Duyck

[permalink] [raw]
Subject: Re: [PATCH] arch: Introduce read_acquire()



On 11/12/2014 07:37 AM, Peter Zijlstra wrote:
> On Wed, Nov 12, 2014 at 07:23:22AM -0800, Alexander Duyck wrote:
>>
>> On 11/12/2014 02:15 AM, Peter Zijlstra wrote:
>>> On Tue, Nov 11, 2014 at 01:12:32PM -0800, Alexander Duyck wrote:
>>>>> Minor nit on naming, but load_acquire would match what we do with barriers,
>>>>> where you simply drop the smp_ prefix if you want the thing to work on UP
>>>>> systems too.
>>>> The problem is this is slightly different, load_acquire in my mind would use
>>>> a mb() call, I only use a rmb(). That is why I chose read_acquire as the
>>>> name.
>>> acquire is not about rmb vs mb, do read up on
>>> Documentation/memory-barriers.txt. Its a distinctly different semantic.
>>> Some archs simply lack the means of implementing this semantics and have
>>> to revert to mb (stronger is always allowed).
>>>
>>> Using the read vs load to wreck the acquire semantics is just insane.
>>
>> Actually I have been reading up on it as I wasn't familiar with C11.
>
> C11 is _different_ although somewhat related.

Honestly I find this quite confusing. If you have some sort of other
documentation you can point me at it would be useful in terms of what
you are expecting for behaviour and names.

>> Most
>> of what I was doing was actually based on the documentation in barriers.txt
>> which was referring to memory operations not loads/stores when referring to
>> the acquire/release so I assumed the full memory barrier was required. I
>> wasn't aware that smp_load_acquire was only supposed to be ordering loads,
>> or that smp_ store_release only applied to stores.
>
> It does not.. an ACQUIRE is a semi-permeable barrier that doesn't allow
> LOADs nor STOREs that are issued _after_ it to appear to happen _before_.
> The RELEASE is the opposite number, it ensures LOADs and STOREs that are
> issued _before_ cannot happen _after_.
>
> This typically matches locking, where a lock (mutex_lock, spin_lock
> etc..) have ACQUIRE semantics and the unlock RELEASE. Such that:
>
> spin_lock();
> a = 1;
> b = x;
> spin_unlock();
>
> guarantees all LOADs (x) and STORESs (a,b) happen _inside_ the lock
> region. What they do not guarantee is:
>
>
> y = 1;
> spin_lock()
> a = 1;
> b = x;
> spin_unlock()
> z = 4;
>
> An order between y and z, both are allowed _into_ the region and can
> cross there like:
>
> spin_lock();
> ...
> z = 4;
> y = 1;
> ...
> spin_unlock();
>
>
> The only 'open' issue at the moment is if RELEASE+ACQUIRE := MB.
> Currently we say this is not so, but Will (and me) would very much like
> this to be so -- PPC64 being the only arch that actually makes this
> distinction.

In the grand scheme of things I suppose it doesn't matter too much. I
actually found a documentation that kind of explains subtle nuances of
things a bit. Specifically Acquire represents the first row in the
table below, and Release represents the second column:

Acquire -> LoadLoad LoadStore
StoreLoad StoreStore
^
|
Release

The LoadStore bit was in question in a few different discussions I read,
however as it turns out on x86, sparc, s390, PowerPC, arm64, and ia64
they would give you that as a free benefit anyway. I think that covers
a wide enough gamut that I don't really care if I take a performance hit
on the other architectures for implementing a full mb() versus a rmb().

Thanks,

Alex


2014-11-12 20:43:52

by David Miller

[permalink] [raw]
Subject: Re: [PATCH] arch: Introduce read_acquire()

From: Alexander Duyck <[email protected]>
Date: Tue, 11 Nov 2014 13:12:32 -0800

> The architectures I need to sort out are arm, x86, sparc, ia64, and
> s390 as they are the only ones that tried to make use of the
> smp_load_acquire logic.

On sparc64, you don't have to worry about anything really.

Only mb() does anything. The cpus execute in a reasonably strict
memory ordering mode, and that's why rmb/wmb and friends are simply
nops with a compiler barrier.