2014-11-19 01:24:55

by Alexander Duyck

[permalink] [raw]
Subject: [PATCH v5 0/4] arch: Add lightweight memory barriers for coherent memory access

These patches introduce two new primitives for synchronizing cache coherent
memory writes and reads. These two new primitives are:

dma_rmb()
dma_wmb()

The first patch cleans up some unnecessary overhead related to the
definition of read_barrier_depends, smp_read_barrier_depends, and comments
related to the barrier.

The second patch adds the primitives for the applicable architectures and
asm-generic.

The third patch adds the barriers to r8169 which turns out to be a good
example of where the new barriers might be useful as they have full
rmb()/wmb() barriers ordering accesses to the descriptors and the DescOwn
bit.

The fourth patch adds support for coherent_rmb() to the Intel fm10k, igb,
and ixgbe drivers. Testing with the ixgbe driver has shown a processing
time reduction of at least 7ns per 64B frame on a Core i7-4930K.

This patch series is essentially the v4 for:
v4: Add lightweight memory barriers for coherent memory access
v3: Add lightweight memory barriers fast_rmb() and fast_wmb()
v2: Introduce load_acquire() and store_release()
v1: Introduce read_acquire()

The key changes in this patch series versus the earlier patches are:
v5:
- Renamed barriers dma_rmb and dma_wmb
- Undid smp_wmb changes in x86 and PowerPC
- Defined smp_rmb as __lwsync for SMP case on PowerPC
v4:
- Renamed barriers coherent_rmb and coherent_wmb
- Added smp_lwsync for use in smp_load_acquire/smp_store_release
v3:
- Moved away from acquire()/store() and instead focused on barriers
- Added cleanup of read_barrier_depends
- Added change in r8169 to fix cur_tx/DescOwn ordering
- Simplified changes to just replacing/moving barriers in r8169
- Added update to documentation with code example
v2:
- Renamed read_acquire() to be consistent with smp_load_acquire()
- Changed barrier used to be consistent with smp_load_acquire()
- Updated PowerPC code to use __lwsync based on IBM article
- Added store_release() as this is a viable use case for drivers
- Added r8169 patch which is able to fully use primitives
- Added fm10k/igb/ixgbe patch which is able to test performance

---

Alexander Duyck (4):
arch: Cleanup read_barrier_depends() and comments
arch: Add lightweight memory barriers dma_rmb() and dma_wmb()
r8169: Use dma_rmb() and dma_wmb() for DescOwn checks
fm10k/igb/ixgbe: Use dma_rmb on Rx descriptor reads


Documentation/memory-barriers.txt | 41 +++++++++++++++
arch/alpha/include/asm/barrier.h | 51 ++++++++++++++++++
arch/arm/include/asm/barrier.h | 4 +
arch/arm64/include/asm/barrier.h | 3 +
arch/blackfin/include/asm/barrier.h | 51 ++++++++++++++++++
arch/ia64/include/asm/barrier.h | 25 ++++-----
arch/metag/include/asm/barrier.h | 19 ++++---
arch/mips/include/asm/barrier.h | 61 ++--------------------
arch/powerpc/include/asm/barrier.h | 19 ++++---
arch/s390/include/asm/barrier.h | 7 ++-
arch/sparc/include/asm/barrier_64.h | 7 ++-
arch/x86/include/asm/barrier.h | 70 ++++---------------------
arch/x86/um/asm/barrier.h | 20 ++++---
drivers/net/ethernet/intel/fm10k/fm10k_main.c | 6 +-
drivers/net/ethernet/intel/igb/igb_main.c | 6 +-
drivers/net/ethernet/intel/ixgbe/ixgbe_main.c | 9 +--
drivers/net/ethernet/realtek/r8169.c | 29 ++++++++--
include/asm-generic/barrier.h | 8 +++
18 files changed, 257 insertions(+), 179 deletions(-)

--


2014-11-19 01:26:03

by Alexander Duyck

[permalink] [raw]
Subject: [PATCH v5 1/4] arch: Cleanup read_barrier_depends() and comments

This patch is meant to cleanup the handling of read_barrier_depends and
smp_read_barrier_depends. In multiple spots in the kernel headers
read_barrier_depends is defined as "do {} while (0)", however we then go
into the SMP vs non-SMP sections and have the SMP version reference
read_barrier_depends, and the non-SMP define it as yet another empty
do/while.

With this commit I went through and cleaned out the duplicate definitions
and reduced the number of definitions down to 2 per header. In addition I
moved the 50 line comments for the macro from the x86 and mips headers that
defined it as an empty do/while to those that were actually defining the
macro, alpha and blackfin.

Signed-off-by: Alexander Duyck <[email protected]>
---
arch/alpha/include/asm/barrier.h | 51 ++++++++++++++++++++++++++++++
arch/blackfin/include/asm/barrier.h | 51 ++++++++++++++++++++++++++++++
arch/ia64/include/asm/barrier.h | 22 +++++--------
arch/metag/include/asm/barrier.h | 7 ++--
arch/mips/include/asm/barrier.h | 52 -------------------------------
arch/powerpc/include/asm/barrier.h | 6 ++--
arch/s390/include/asm/barrier.h | 5 ++-
arch/sparc/include/asm/barrier_64.h | 4 +-
arch/x86/include/asm/barrier.h | 59 ++---------------------------------
arch/x86/um/asm/barrier.h | 7 ++--
10 files changed, 129 insertions(+), 135 deletions(-)

diff --git a/arch/alpha/include/asm/barrier.h b/arch/alpha/include/asm/barrier.h
index 3832bdb..77516c8 100644
--- a/arch/alpha/include/asm/barrier.h
+++ b/arch/alpha/include/asm/barrier.h
@@ -7,6 +7,57 @@
#define rmb() __asm__ __volatile__("mb": : :"memory")
#define wmb() __asm__ __volatile__("wmb": : :"memory")

+/**
+ * read_barrier_depends - Flush all pending reads that subsequents reads
+ * depend on.
+ *
+ * No data-dependent reads from memory-like regions are ever reordered
+ * over this barrier. All reads preceding this primitive are guaranteed
+ * to access memory (but not necessarily other CPUs' caches) before any
+ * reads following this primitive that depend on the data return by
+ * any of the preceding reads. This primitive is much lighter weight than
+ * rmb() on most CPUs, and is never heavier weight than is
+ * rmb().
+ *
+ * These ordering constraints are respected by both the local CPU
+ * and the compiler.
+ *
+ * Ordering is not guaranteed by anything other than these primitives,
+ * not even by data dependencies. See the documentation for
+ * memory_barrier() for examples and URLs to more information.
+ *
+ * For example, the following code would force ordering (the initial
+ * value of "a" is zero, "b" is one, and "p" is "&a"):
+ *
+ * <programlisting>
+ * CPU 0 CPU 1
+ *
+ * b = 2;
+ * memory_barrier();
+ * p = &b; q = p;
+ * read_barrier_depends();
+ * d = *q;
+ * </programlisting>
+ *
+ * because the read of "*q" depends on the read of "p" and these
+ * two reads are separated by a read_barrier_depends(). However,
+ * the following code, with the same initial values for "a" and "b":
+ *
+ * <programlisting>
+ * CPU 0 CPU 1
+ *
+ * a = 2;
+ * memory_barrier();
+ * b = 3; y = b;
+ * read_barrier_depends();
+ * x = a;
+ * </programlisting>
+ *
+ * does not enforce ordering, since there is no data dependency between
+ * the read of "a" and the read of "b". Therefore, on some CPUs, such
+ * as Alpha, "y" could be set to 3 and "x" to 0. Use rmb()
+ * in cases like this where there are no data dependencies.
+ */
#define read_barrier_depends() __asm__ __volatile__("mb": : :"memory")

#ifdef CONFIG_SMP
diff --git a/arch/blackfin/include/asm/barrier.h b/arch/blackfin/include/asm/barrier.h
index 4200068..dfb66fe 100644
--- a/arch/blackfin/include/asm/barrier.h
+++ b/arch/blackfin/include/asm/barrier.h
@@ -22,6 +22,57 @@
# define mb() do { barrier(); smp_check_barrier(); smp_mark_barrier(); } while (0)
# define rmb() do { barrier(); smp_check_barrier(); } while (0)
# define wmb() do { barrier(); smp_mark_barrier(); } while (0)
+/*
+ * read_barrier_depends - Flush all pending reads that subsequents reads
+ * depend on.
+ *
+ * No data-dependent reads from memory-like regions are ever reordered
+ * over this barrier. All reads preceding this primitive are guaranteed
+ * to access memory (but not necessarily other CPUs' caches) before any
+ * reads following this primitive that depend on the data return by
+ * any of the preceding reads. This primitive is much lighter weight than
+ * rmb() on most CPUs, and is never heavier weight than is
+ * rmb().
+ *
+ * These ordering constraints are respected by both the local CPU
+ * and the compiler.
+ *
+ * Ordering is not guaranteed by anything other than these primitives,
+ * not even by data dependencies. See the documentation for
+ * memory_barrier() for examples and URLs to more information.
+ *
+ * For example, the following code would force ordering (the initial
+ * value of "a" is zero, "b" is one, and "p" is "&a"):
+ *
+ * <programlisting>
+ * CPU 0 CPU 1
+ *
+ * b = 2;
+ * memory_barrier();
+ * p = &b; q = p;
+ * read_barrier_depends();
+ * d = *q;
+ * </programlisting>
+ *
+ * because the read of "*q" depends on the read of "p" and these
+ * two reads are separated by a read_barrier_depends(). However,
+ * the following code, with the same initial values for "a" and "b":
+ *
+ * <programlisting>
+ * CPU 0 CPU 1
+ *
+ * a = 2;
+ * memory_barrier();
+ * b = 3; y = b;
+ * read_barrier_depends();
+ * x = a;
+ * </programlisting>
+ *
+ * does not enforce ordering, since there is no data dependency between
+ * the read of "a" and the read of "b". Therefore, on some CPUs, such
+ * as Alpha, "y" could be set to 3 and "x" to 0. Use rmb()
+ * in cases like this where there are no data dependencies.
+ */
# define read_barrier_depends() do { barrier(); smp_check_barrier(); } while (0)
#endif

diff --git a/arch/ia64/include/asm/barrier.h b/arch/ia64/include/asm/barrier.h
index a48957c..e8fffb0 100644
--- a/arch/ia64/include/asm/barrier.h
+++ b/arch/ia64/include/asm/barrier.h
@@ -35,26 +35,22 @@
* it's (presumably) much slower than mf and (b) mf.a is supported for
* sequential memory pages only.
*/
-#define mb() ia64_mf()
-#define rmb() mb()
-#define wmb() mb()
-#define read_barrier_depends() do { } while(0)
+#define mb() ia64_mf()
+#define rmb() mb()
+#define wmb() mb()

#ifdef CONFIG_SMP
# define smp_mb() mb()
-# define smp_rmb() rmb()
-# define smp_wmb() wmb()
-# define smp_read_barrier_depends() read_barrier_depends()
-
#else
-
# define smp_mb() barrier()
-# define smp_rmb() barrier()
-# define smp_wmb() barrier()
-# define smp_read_barrier_depends() do { } while(0)
-
#endif

+#define smp_rmb() smp_mb()
+#define smp_wmb() smp_mb()
+
+#define read_barrier_depends() do { } while (0)
+#define smp_read_barrier_depends() do { } while (0)
+
#define smp_mb__before_atomic() barrier()
#define smp_mb__after_atomic() barrier()

diff --git a/arch/metag/include/asm/barrier.h b/arch/metag/include/asm/barrier.h
index c7591e8..6d8b8c9 100644
--- a/arch/metag/include/asm/barrier.h
+++ b/arch/metag/include/asm/barrier.h
@@ -47,8 +47,6 @@ static inline void wmb(void)
wr_fence();
}

-#define read_barrier_depends() do { } while (0)
-
#ifndef CONFIG_SMP
#define fence() do { } while (0)
#define smp_mb() barrier()
@@ -82,7 +80,10 @@ static inline void fence(void)
#define smp_wmb() barrier()
#endif
#endif
-#define smp_read_barrier_depends() do { } while (0)
+
+#define read_barrier_depends() do { } while (0)
+#define smp_read_barrier_depends() do { } while (0)
+
#define set_mb(var, value) do { var = value; smp_mb(); } while (0)

#define smp_store_release(p, v) \
diff --git a/arch/mips/include/asm/barrier.h b/arch/mips/include/asm/barrier.h
index d0101dd..3d69aa8 100644
--- a/arch/mips/include/asm/barrier.h
+++ b/arch/mips/include/asm/barrier.h
@@ -10,58 +10,6 @@

#include <asm/addrspace.h>

-/*
- * read_barrier_depends - Flush all pending reads that subsequents reads
- * depend on.
- *
- * No data-dependent reads from memory-like regions are ever reordered
- * over this barrier. All reads preceding this primitive are guaranteed
- * to access memory (but not necessarily other CPUs' caches) before any
- * reads following this primitive that depend on the data return by
- * any of the preceding reads. This primitive is much lighter weight than
- * rmb() on most CPUs, and is never heavier weight than is
- * rmb().
- *
- * These ordering constraints are respected by both the local CPU
- * and the compiler.
- *
- * Ordering is not guaranteed by anything other than these primitives,
- * not even by data dependencies. See the documentation for
- * memory_barrier() for examples and URLs to more information.
- *
- * For example, the following code would force ordering (the initial
- * value of "a" is zero, "b" is one, and "p" is "&a"):
- *
- * <programlisting>
- * CPU 0 CPU 1
- *
- * b = 2;
- * memory_barrier();
- * p = &b; q = p;
- * read_barrier_depends();
- * d = *q;
- * </programlisting>
- *
- * because the read of "*q" depends on the read of "p" and these
- * two reads are separated by a read_barrier_depends(). However,
- * the following code, with the same initial values for "a" and "b":
- *
- * <programlisting>
- * CPU 0 CPU 1
- *
- * a = 2;
- * memory_barrier();
- * b = 3; y = b;
- * read_barrier_depends();
- * x = a;
- * </programlisting>
- *
- * does not enforce ordering, since there is no data dependency between
- * the read of "a" and the read of "b". Therefore, on some CPUs, such
- * as Alpha, "y" could be set to 3 and "x" to 0. Use rmb()
- * in cases like this where there are no data dependencies.
- */
-
#define read_barrier_depends() do { } while(0)
#define smp_read_barrier_depends() do { } while(0)

diff --git a/arch/powerpc/include/asm/barrier.h b/arch/powerpc/include/asm/barrier.h
index bab79a1..cb6d66c 100644
--- a/arch/powerpc/include/asm/barrier.h
+++ b/arch/powerpc/include/asm/barrier.h
@@ -33,7 +33,6 @@
#define mb() __asm__ __volatile__ ("sync" : : : "memory")
#define rmb() __asm__ __volatile__ ("sync" : : : "memory")
#define wmb() __asm__ __volatile__ ("sync" : : : "memory")
-#define read_barrier_depends() do { } while(0)

#define set_mb(var, value) do { var = value; mb(); } while (0)

@@ -50,16 +49,17 @@
#define smp_mb() mb()
#define smp_rmb() __lwsync()
#define smp_wmb() __asm__ __volatile__ (stringify_in_c(SMPWMB) : : :"memory")
-#define smp_read_barrier_depends() read_barrier_depends()
#else
#define __lwsync() barrier()

#define smp_mb() barrier()
#define smp_rmb() barrier()
#define smp_wmb() barrier()
-#define smp_read_barrier_depends() do { } while(0)
#endif /* CONFIG_SMP */

+#define read_barrier_depends() do { } while (0)
+#define smp_read_barrier_depends() do { } while (0)
+
/*
* This is a barrier which prevents following instructions from being
* started until the value of the argument x is known. For example, if
diff --git a/arch/s390/include/asm/barrier.h b/arch/s390/include/asm/barrier.h
index b5dce65..33d191d 100644
--- a/arch/s390/include/asm/barrier.h
+++ b/arch/s390/include/asm/barrier.h
@@ -24,11 +24,12 @@

#define rmb() mb()
#define wmb() mb()
-#define read_barrier_depends() do { } while(0)
#define smp_mb() mb()
#define smp_rmb() rmb()
#define smp_wmb() wmb()
-#define smp_read_barrier_depends() read_barrier_depends()
+
+#define read_barrier_depends() do { } while (0)
+#define smp_read_barrier_depends() do { } while (0)

#define smp_mb__before_atomic() smp_mb()
#define smp_mb__after_atomic() smp_mb()
diff --git a/arch/sparc/include/asm/barrier_64.h b/arch/sparc/include/asm/barrier_64.h
index 305dcc3..6c974c0 100644
--- a/arch/sparc/include/asm/barrier_64.h
+++ b/arch/sparc/include/asm/barrier_64.h
@@ -37,7 +37,6 @@ do { __asm__ __volatile__("ba,pt %%xcc, 1f\n\t" \
#define rmb() __asm__ __volatile__("":::"memory")
#define wmb() __asm__ __volatile__("":::"memory")

-#define read_barrier_depends() do { } while(0)
#define set_mb(__var, __value) \
do { __var = __value; membar_safe("#StoreLoad"); } while(0)

@@ -51,7 +50,8 @@ do { __asm__ __volatile__("ba,pt %%xcc, 1f\n\t" \
#define smp_wmb() __asm__ __volatile__("":::"memory")
#endif

-#define smp_read_barrier_depends() do { } while(0)
+#define read_barrier_depends() do { } while (0)
+#define smp_read_barrier_depends() do { } while (0)

#define smp_store_release(p, v) \
do { \
diff --git a/arch/x86/include/asm/barrier.h b/arch/x86/include/asm/barrier.h
index 0f4460b..5238000 100644
--- a/arch/x86/include/asm/barrier.h
+++ b/arch/x86/include/asm/barrier.h
@@ -24,60 +24,6 @@
#define wmb() asm volatile("sfence" ::: "memory")
#endif

-/**
- * read_barrier_depends - Flush all pending reads that subsequents reads
- * depend on.
- *
- * No data-dependent reads from memory-like regions are ever reordered
- * over this barrier. All reads preceding this primitive are guaranteed
- * to access memory (but not necessarily other CPUs' caches) before any
- * reads following this primitive that depend on the data return by
- * any of the preceding reads. This primitive is much lighter weight than
- * rmb() on most CPUs, and is never heavier weight than is
- * rmb().
- *
- * These ordering constraints are respected by both the local CPU
- * and the compiler.
- *
- * Ordering is not guaranteed by anything other than these primitives,
- * not even by data dependencies. See the documentation for
- * memory_barrier() for examples and URLs to more information.
- *
- * For example, the following code would force ordering (the initial
- * value of "a" is zero, "b" is one, and "p" is "&a"):
- *
- * <programlisting>
- * CPU 0 CPU 1
- *
- * b = 2;
- * memory_barrier();
- * p = &b; q = p;
- * read_barrier_depends();
- * d = *q;
- * </programlisting>
- *
- * because the read of "*q" depends on the read of "p" and these
- * two reads are separated by a read_barrier_depends(). However,
- * the following code, with the same initial values for "a" and "b":
- *
- * <programlisting>
- * CPU 0 CPU 1
- *
- * a = 2;
- * memory_barrier();
- * b = 3; y = b;
- * read_barrier_depends();
- * x = a;
- * </programlisting>
- *
- * does not enforce ordering, since there is no data dependency between
- * the read of "a" and the read of "b". Therefore, on some CPUs, such
- * as Alpha, "y" could be set to 3 and "x" to 0. Use rmb()
- * in cases like this where there are no data dependencies.
- **/
-
-#define read_barrier_depends() do { } while (0)
-
#ifdef CONFIG_SMP
#define smp_mb() mb()
#ifdef CONFIG_X86_PPRO_FENCE
@@ -86,16 +32,17 @@
# define smp_rmb() barrier()
#endif
#define smp_wmb() barrier()
-#define smp_read_barrier_depends() read_barrier_depends()
#define set_mb(var, value) do { (void)xchg(&var, value); } while (0)
#else /* !SMP */
#define smp_mb() barrier()
#define smp_rmb() barrier()
#define smp_wmb() barrier()
-#define smp_read_barrier_depends() do { } while (0)
#define set_mb(var, value) do { var = value; barrier(); } while (0)
#endif /* SMP */

+#define read_barrier_depends() do { } while (0)
+#define smp_read_barrier_depends() do { } while (0)
+
#if defined(CONFIG_X86_PPRO_FENCE)

/*
diff --git a/arch/x86/um/asm/barrier.h b/arch/x86/um/asm/barrier.h
index cc04e67..d6511d9 100644
--- a/arch/x86/um/asm/barrier.h
+++ b/arch/x86/um/asm/barrier.h
@@ -29,8 +29,6 @@

#endif /* CONFIG_X86_32 */

-#define read_barrier_depends() do { } while (0)
-
#ifdef CONFIG_SMP

#define smp_mb() mb()
@@ -42,7 +40,6 @@

#define smp_wmb() barrier()

-#define smp_read_barrier_depends() read_barrier_depends()
#define set_mb(var, value) do { (void)xchg(&var, value); } while (0)

#else /* CONFIG_SMP */
@@ -50,11 +47,13 @@
#define smp_mb() barrier()
#define smp_rmb() barrier()
#define smp_wmb() barrier()
-#define smp_read_barrier_depends() do { } while (0)
#define set_mb(var, value) do { var = value; barrier(); } while (0)

#endif /* CONFIG_SMP */

+#define read_barrier_depends() do { } while (0)
+#define smp_read_barrier_depends() do { } while (0)
+
/*
* Stop RDTSC speculation. This is needed when you need to use RDTSC
* (or get_cycles or vread that possibly accesses the TSC) in a defined

2014-11-19 01:26:20

by Alexander Duyck

[permalink] [raw]
Subject: [PATCH v5 2/4] arch: Add lightweight memory barriers dma_rmb() and dma_wmb()

There are a number of situations where the mandatory barriers rmb() and
wmb() are used to order memory/memory operations in the device drivers
and those barriers are much heavier than they actually need to be. For
example in the case of PowerPC wmb() calls the heavy-weight sync
instruction when for coherent memory operations all that is really needed
is an lsync or eieio instruction.

This commit adds a coherent only version of the mandatory memory barriers
rmb() and wmb(). In most cases this should result in the barrier being the
same as the SMP barriers for the SMP case, however in some cases we use a
barrier that is somewhere in between rmb() and smp_rmb(). For example on
ARM the rmb barriers break down as follows:

Barrier Call Explanation
--------- -------- ----------------------------------
rmb() dsb() Data synchronization barrier - system
dma_rmb() dmb(osh) data memory barrier - outer sharable
smp_rmb() dmb(ish) data memory barrier - inner sharable

These new barriers are not as safe as the standard rmb() and wmb().
Specifically they do not guarantee ordering between coherent and incoherent
memories. The primary use case for these would be to enforce ordering of
reads and writes when accessing coherent memory that is shared between the
CPU and a device.

It may also be noted that there is no dma_mb(). Most architectures don't
provide a good mechanism for performing a coherent only full barrier without
resorting to the same mechanism used in mb(). As such there isn't much to
be gained in trying to define such a function.

Cc: Benjamin Herrenschmidt <[email protected]>
Cc: Frederic Weisbecker <[email protected]>
Cc: Mathieu Desnoyers <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Michael Neuling <[email protected]>
Cc: Russell King <[email protected]>
Cc: Geert Uytterhoeven <[email protected]>
Cc: Heiko Carstens <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Martin Schwidefsky <[email protected]>
Cc: Tony Luck <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: "Paul E. McKenney" <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: David Miller <[email protected]>
Signed-off-by: Alexander Duyck <[email protected]>
---
Documentation/memory-barriers.txt | 41 +++++++++++++++++++++++++++++++++++
arch/arm/include/asm/barrier.h | 4 +++
arch/arm64/include/asm/barrier.h | 3 +++
arch/ia64/include/asm/barrier.h | 3 +++
arch/metag/include/asm/barrier.h | 14 ++++++------
arch/mips/include/asm/barrier.h | 9 ++++----
arch/powerpc/include/asm/barrier.h | 13 +++++++----
arch/s390/include/asm/barrier.h | 2 ++
arch/sparc/include/asm/barrier_64.h | 3 +++
arch/x86/include/asm/barrier.h | 11 ++++++---
arch/x86/um/asm/barrier.h | 13 ++++++-----
include/asm-generic/barrier.h | 8 +++++++
12 files changed, 98 insertions(+), 26 deletions(-)

diff --git a/Documentation/memory-barriers.txt b/Documentation/memory-barriers.txt
index 22a969c..a1c589b 100644
--- a/Documentation/memory-barriers.txt
+++ b/Documentation/memory-barriers.txt
@@ -1615,6 +1615,47 @@ There are some more advanced barrier functions:
operations" subsection for information on where to use these.


+ (*) dma_wmb();
+ (*) dma_rmb();
+
+ These are for use with memory based device I/O to guarantee the ordering
+ of cache coherent writes or reads with respect to other writes or reads
+ to cache coherent DMA memory.
+
+ For example, consider a device driver that shares memory with a device
+ and uses a descriptor status value to indicate if the descriptor belongs
+ to the device or the CPU, and a doorbell to notify it when new
+ descriptors are available:
+
+ if (desc->status != DEVICE_OWN) {
+ /* do not read data until we own descriptor */
+ dma_rmb();
+
+ /* read/modify data */
+ read_data = desc->data;
+ desc->data = write_data;
+
+ /* flush modifications before status update */
+ dma_wmb();
+
+ /* assign ownership */
+ desc->status = DEVICE_OWN;
+
+ /* force memory to sync before notifying device via MMIO */
+ wmb();
+
+ /* notify device of new descriptors */
+ writel(DESC_NOTIFY, doorbell);
+ }
+
+ The dma_rmb() allows us guarantee the device has released ownership
+ before we read the data from the descriptor, and he dma_wmb() allows
+ us to guarantee the data is written to the descriptor before the device
+ can see it now has ownership. The wmb() is needed to guarantee that the
+ cache coherent memory writes have completed before attempting a write to
+ the cache incoherent MMIO region.
+
+
MMIO WRITE BARRIER
------------------

diff --git a/arch/arm/include/asm/barrier.h b/arch/arm/include/asm/barrier.h
index c6a3e73..d2f81e6 100644
--- a/arch/arm/include/asm/barrier.h
+++ b/arch/arm/include/asm/barrier.h
@@ -43,10 +43,14 @@
#define mb() do { dsb(); outer_sync(); } while (0)
#define rmb() dsb()
#define wmb() do { dsb(st); outer_sync(); } while (0)
+#define dma_rmb() dmb(osh)
+#define dma_wmb() dmb(oshst)
#else
#define mb() barrier()
#define rmb() barrier()
#define wmb() barrier()
+#define dma_rmb() barrier()
+#define dma_wmb() barrier()
#endif

#ifndef CONFIG_SMP
diff --git a/arch/arm64/include/asm/barrier.h b/arch/arm64/include/asm/barrier.h
index 6389d60..a5abb00 100644
--- a/arch/arm64/include/asm/barrier.h
+++ b/arch/arm64/include/asm/barrier.h
@@ -32,6 +32,9 @@
#define rmb() dsb(ld)
#define wmb() dsb(st)

+#define dma_rmb() dmb(oshld)
+#define dma_wmb() dmb(oshst)
+
#ifndef CONFIG_SMP
#define smp_mb() barrier()
#define smp_rmb() barrier()
diff --git a/arch/ia64/include/asm/barrier.h b/arch/ia64/include/asm/barrier.h
index e8fffb0..f6769eb 100644
--- a/arch/ia64/include/asm/barrier.h
+++ b/arch/ia64/include/asm/barrier.h
@@ -39,6 +39,9 @@
#define rmb() mb()
#define wmb() mb()

+#define dma_rmb() mb()
+#define dma_wmb() mb()
+
#ifdef CONFIG_SMP
# define smp_mb() mb()
#else
diff --git a/arch/metag/include/asm/barrier.h b/arch/metag/include/asm/barrier.h
index 6d8b8c9..d703d8e 100644
--- a/arch/metag/include/asm/barrier.h
+++ b/arch/metag/include/asm/barrier.h
@@ -4,8 +4,6 @@
#include <asm/metag_mem.h>

#define nop() asm volatile ("NOP")
-#define mb() wmb()
-#define rmb() barrier()

#ifdef CONFIG_METAG_META21

@@ -41,11 +39,13 @@ static inline void wr_fence(void)

#endif /* !CONFIG_METAG_META21 */

-static inline void wmb(void)
-{
- /* flush writes through the write combiner */
- wr_fence();
-}
+/* flush writes through the write combiner */
+#define mb() wr_fence()
+#define rmb() barrier()
+#define wmb() mb()
+
+#define dma_rmb() rmb()
+#define dma_wmb() wmb()

#ifndef CONFIG_SMP
#define fence() do { } while (0)
diff --git a/arch/mips/include/asm/barrier.h b/arch/mips/include/asm/barrier.h
index 3d69aa8..2b8bbbc 100644
--- a/arch/mips/include/asm/barrier.h
+++ b/arch/mips/include/asm/barrier.h
@@ -75,20 +75,21 @@

#include <asm/wbflush.h>

-#define wmb() fast_wmb()
-#define rmb() fast_rmb()
#define mb() wbflush()
#define iob() wbflush()

#else /* !CONFIG_CPU_HAS_WB */

-#define wmb() fast_wmb()
-#define rmb() fast_rmb()
#define mb() fast_mb()
#define iob() fast_iob()

#endif /* !CONFIG_CPU_HAS_WB */

+#define wmb() fast_wmb()
+#define rmb() fast_rmb()
+#define dma_wmb() fast_wmb()
+#define dma_rmb() fast_rmb()
+
#if defined(CONFIG_WEAK_ORDERING) && defined(CONFIG_SMP)
# ifdef CONFIG_CPU_CAVIUM_OCTEON
# define smp_mb() __sync()
diff --git a/arch/powerpc/include/asm/barrier.h b/arch/powerpc/include/asm/barrier.h
index cb6d66c..a3bf5be 100644
--- a/arch/powerpc/include/asm/barrier.h
+++ b/arch/powerpc/include/asm/barrier.h
@@ -36,8 +36,6 @@

#define set_mb(var, value) do { var = value; mb(); } while (0)

-#ifdef CONFIG_SMP
-
#ifdef __SUBARCH_HAS_LWSYNC
# define SMPWMB LWSYNC
#else
@@ -45,12 +43,17 @@
#endif

#define __lwsync() __asm__ __volatile__ (stringify_in_c(LWSYNC) : : :"memory")
+#define dma_rmb() __lwsync()
+#define dma_wmb() __asm__ __volatile__ (stringify_in_c(SMPWMB) : : :"memory")
+
+#ifdef CONFIG_SMP
+#define smp_lwsync() __lwsync()

#define smp_mb() mb()
#define smp_rmb() __lwsync()
#define smp_wmb() __asm__ __volatile__ (stringify_in_c(SMPWMB) : : :"memory")
#else
-#define __lwsync() barrier()
+#define smp_lwsync() barrier()

#define smp_mb() barrier()
#define smp_rmb() barrier()
@@ -72,7 +75,7 @@
#define smp_store_release(p, v) \
do { \
compiletime_assert_atomic_type(*p); \
- __lwsync(); \
+ smp_lwsync(); \
ACCESS_ONCE(*p) = (v); \
} while (0)

@@ -80,7 +83,7 @@ do { \
({ \
typeof(*p) ___p1 = ACCESS_ONCE(*p); \
compiletime_assert_atomic_type(*p); \
- __lwsync(); \
+ smp_lwsync(); \
___p1; \
})

diff --git a/arch/s390/include/asm/barrier.h b/arch/s390/include/asm/barrier.h
index 33d191d..8d72471 100644
--- a/arch/s390/include/asm/barrier.h
+++ b/arch/s390/include/asm/barrier.h
@@ -24,6 +24,8 @@

#define rmb() mb()
#define wmb() mb()
+#define dma_rmb() rmb()
+#define dma_wmb() wmb()
#define smp_mb() mb()
#define smp_rmb() rmb()
#define smp_wmb() wmb()
diff --git a/arch/sparc/include/asm/barrier_64.h b/arch/sparc/include/asm/barrier_64.h
index 6c974c0..7664894 100644
--- a/arch/sparc/include/asm/barrier_64.h
+++ b/arch/sparc/include/asm/barrier_64.h
@@ -37,6 +37,9 @@ do { __asm__ __volatile__("ba,pt %%xcc, 1f\n\t" \
#define rmb() __asm__ __volatile__("":::"memory")
#define wmb() __asm__ __volatile__("":::"memory")

+#define dma_rmb() rmb()
+#define dma_wmb() wmb()
+
#define set_mb(__var, __value) \
do { __var = __value; membar_safe("#StoreLoad"); } while(0)

diff --git a/arch/x86/include/asm/barrier.h b/arch/x86/include/asm/barrier.h
index 5238000..2ab1eb3 100644
--- a/arch/x86/include/asm/barrier.h
+++ b/arch/x86/include/asm/barrier.h
@@ -24,13 +24,16 @@
#define wmb() asm volatile("sfence" ::: "memory")
#endif

-#ifdef CONFIG_SMP
-#define smp_mb() mb()
#ifdef CONFIG_X86_PPRO_FENCE
-# define smp_rmb() rmb()
+#define dma_rmb() rmb()
#else
-# define smp_rmb() barrier()
+#define dma_rmb() barrier()
#endif
+#define dma_wmb() barrier()
+
+#ifdef CONFIG_SMP
+#define smp_mb() mb()
+#define smp_rmb() dma_rmb()
#define smp_wmb() barrier()
#define set_mb(var, value) do { (void)xchg(&var, value); } while (0)
#else /* !SMP */
diff --git a/arch/x86/um/asm/barrier.h b/arch/x86/um/asm/barrier.h
index d6511d9..2d7d9a1 100644
--- a/arch/x86/um/asm/barrier.h
+++ b/arch/x86/um/asm/barrier.h
@@ -29,17 +29,18 @@

#endif /* CONFIG_X86_32 */

-#ifdef CONFIG_SMP
-
-#define smp_mb() mb()
#ifdef CONFIG_X86_PPRO_FENCE
-#define smp_rmb() rmb()
+#define dma_rmb() rmb()
#else /* CONFIG_X86_PPRO_FENCE */
-#define smp_rmb() barrier()
+#define dma_rmb() barrier()
#endif /* CONFIG_X86_PPRO_FENCE */
+#define dma_wmb() barrier()

-#define smp_wmb() barrier()
+#ifdef CONFIG_SMP

+#define smp_mb() mb()
+#define smp_rmb() dma_rmb()
+#define smp_wmb() barrier()
#define set_mb(var, value) do { (void)xchg(&var, value); } while (0)

#else /* CONFIG_SMP */
diff --git a/include/asm-generic/barrier.h b/include/asm-generic/barrier.h
index 1402fa8..f5c40b0 100644
--- a/include/asm-generic/barrier.h
+++ b/include/asm-generic/barrier.h
@@ -42,6 +42,14 @@
#define wmb() mb()
#endif

+#ifndef dma_rmb
+#define dma_rmb() rmb()
+#endif
+
+#ifndef dma_wmb
+#define dma_wmb() wmb()
+#endif
+
#ifndef read_barrier_depends
#define read_barrier_depends() do { } while (0)
#endif

2014-11-19 01:26:30

by Alexander Duyck

[permalink] [raw]
Subject: [PATCH v5 3/4] r8169: Use dma_rmb() and dma_wmb() for DescOwn checks

The r8169 use a pair of wmb() calls when setting up the descriptor rings.
The first is to synchronize the descriptor data with the descriptor status,
and the second is to synchronize the descriptor status with the use of the
MMIO doorbell to notify the device that descriptors are ready. This can
come at a heavy price on some systems, and is not really necessary on
systems such as x86 as a simple barrier() would suffice to order store/store
accesses. As such we can replace the first memory barrier with
dma_wmb() to reduce the cost for these accesses.

In addition the r8169 uses a rmb() to prevent compiler optimization in the
cleanup paths, however by moving the barrier down a few lines and replacing
it with a dma_rmb() we should be able to use it to guarantee
descriptor accesses do not occur until the device has updated the DescOwn
bit from its end.

One last change I made is to move the update of cur_tx in the xmit path to
after the wmb. This way we can guarantee the device and all CPUs should
see the DescOwn update before they see the cur_tx value update.

Cc: Realtek linux nic maintainers <[email protected]>
Cc: Francois Romieu <[email protected]>
Signed-off-by: Alexander Duyck <[email protected]>
---
drivers/net/ethernet/realtek/r8169.c | 29 +++++++++++++++++++++--------
1 file changed, 21 insertions(+), 8 deletions(-)

diff --git a/drivers/net/ethernet/realtek/r8169.c b/drivers/net/ethernet/realtek/r8169.c
index cf154f7..39e9796 100644
--- a/drivers/net/ethernet/realtek/r8169.c
+++ b/drivers/net/ethernet/realtek/r8169.c
@@ -6601,6 +6601,9 @@ static inline void rtl8169_mark_to_asic(struct RxDesc *desc, u32 rx_buf_sz)
{
u32 eor = le32_to_cpu(desc->opts1) & RingEnd;

+ /* Force memory writes to complete before releasing descriptor */
+ dma_wmb();
+
desc->opts1 = cpu_to_le32(DescOwn | eor | rx_buf_sz);
}

@@ -6608,7 +6611,6 @@ static inline void rtl8169_map_to_asic(struct RxDesc *desc, dma_addr_t mapping,
u32 rx_buf_sz)
{
desc->addr = cpu_to_le64(mapping);
- wmb();
rtl8169_mark_to_asic(desc, rx_buf_sz);
}

@@ -7077,16 +7079,18 @@ static netdev_tx_t rtl8169_start_xmit(struct sk_buff *skb,

skb_tx_timestamp(skb);

- wmb();
+ /* Force memory writes to complete before releasing descriptor */
+ dma_wmb();

/* Anti gcc 2.95.3 bugware (sic) */
status = opts[0] | len | (RingEnd * !((entry + 1) % NUM_TX_DESC));
txd->opts1 = cpu_to_le32(status);

- tp->cur_tx += frags + 1;
-
+ /* Force all memory writes to complete before notifying device */
wmb();

+ tp->cur_tx += frags + 1;
+
RTL_W8(TxPoll, NPQ);

mmiowb();
@@ -7185,11 +7189,16 @@ static void rtl_tx(struct net_device *dev, struct rtl8169_private *tp)
struct ring_info *tx_skb = tp->tx_skb + entry;
u32 status;

- rmb();
status = le32_to_cpu(tp->TxDescArray[entry].opts1);
if (status & DescOwn)
break;

+ /* This barrier is needed to keep us from reading
+ * any other fields out of the Tx descriptor until
+ * we know the status of DescOwn
+ */
+ dma_rmb();
+
rtl8169_unmap_tx_skb(&tp->pci_dev->dev, tx_skb,
tp->TxDescArray + entry);
if (status & LastFrag) {
@@ -7284,11 +7293,16 @@ static int rtl_rx(struct net_device *dev, struct rtl8169_private *tp, u32 budget
struct RxDesc *desc = tp->RxDescArray + entry;
u32 status;

- rmb();
status = le32_to_cpu(desc->opts1) & tp->opts1_mask;
-
if (status & DescOwn)
break;
+
+ /* This barrier is needed to keep us from reading
+ * any other fields out of the Rx descriptor until
+ * we know the status of DescOwn
+ */
+ dma_rmb();
+
if (unlikely(status & RxRES)) {
netif_info(tp, rx_err, dev, "Rx ERROR. status = %08x\n",
status);
@@ -7350,7 +7364,6 @@ process_pkt:
}
release_descriptor:
desc->opts2 = 0;
- wmb();
rtl8169_mark_to_asic(desc, rx_buf_sz);
}

2014-11-19 01:26:53

by Alexander Duyck

[permalink] [raw]
Subject: [PATCH v5 4/4] fm10k/igb/ixgbe: Use dma_rmb on Rx descriptor reads

This change makes it so that dma_rmb is used when reading the Rx
descriptor. The advantage of dma_rmb is that it allows for a much
lower cost barrier on x86, powerpc, arm, and arm64 architectures than a
traditional memory barrier when dealing with reads that only have to
synchronize to coherent memory.

In addition I have updated the code so that it just checks to see if any
bits have been set instead of just the DD bit since the DD bit will always
be set as a part of a descriptor write-back so we just need to check for a
non-zero value being present at that memory location rather than just
checking for any specific bit. This allows the code itself to appear much
cleaner and allows the compiler more room to optimize.

Cc: Jeff Kirsher <[email protected]>
Cc: Matthew Vick <[email protected]>
Cc: Don Skidmore <[email protected]>
Signed-off-by: Alexander Duyck <[email protected]>
---
drivers/net/ethernet/intel/fm10k/fm10k_main.c | 6 +++---
drivers/net/ethernet/intel/igb/igb_main.c | 6 +++---
drivers/net/ethernet/intel/ixgbe/ixgbe_main.c | 9 ++++-----
3 files changed, 10 insertions(+), 11 deletions(-)

diff --git a/drivers/net/ethernet/intel/fm10k/fm10k_main.c b/drivers/net/ethernet/intel/fm10k/fm10k_main.c
index e645af4..06e2d65 100644
--- a/drivers/net/ethernet/intel/fm10k/fm10k_main.c
+++ b/drivers/net/ethernet/intel/fm10k/fm10k_main.c
@@ -620,14 +620,14 @@ static bool fm10k_clean_rx_irq(struct fm10k_q_vector *q_vector,

rx_desc = FM10K_RX_DESC(rx_ring, rx_ring->next_to_clean);

- if (!fm10k_test_staterr(rx_desc, FM10K_RXD_STATUS_DD))
+ if (!rx_desc->d.staterr)
break;

/* This memory barrier is needed to keep us from reading
* any other fields out of the rx_desc until we know the
- * RXD_STATUS_DD bit is set
+ * descriptor has been written back
*/
- rmb();
+ dma_rmb();

/* retrieve a buffer from the ring */
skb = fm10k_fetch_rx_buffer(rx_ring, rx_desc, skb);
diff --git a/drivers/net/ethernet/intel/igb/igb_main.c b/drivers/net/ethernet/intel/igb/igb_main.c
index a2d72a8..2246f45 100644
--- a/drivers/net/ethernet/intel/igb/igb_main.c
+++ b/drivers/net/ethernet/intel/igb/igb_main.c
@@ -6918,14 +6918,14 @@ static bool igb_clean_rx_irq(struct igb_q_vector *q_vector, const int budget)

rx_desc = IGB_RX_DESC(rx_ring, rx_ring->next_to_clean);

- if (!igb_test_staterr(rx_desc, E1000_RXD_STAT_DD))
+ if (!rx_desc->wb.upper.status_error)
break;

/* This memory barrier is needed to keep us from reading
* any other fields out of the rx_desc until we know the
- * RXD_STAT_DD bit is set
+ * descriptor has been written back
*/
- rmb();
+ dma_rmb();

/* retrieve a buffer from the ring */
skb = igb_fetch_rx_buffer(rx_ring, rx_desc, skb);
diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
index d2df4e3..d814b58 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
@@ -2003,15 +2003,14 @@ static int ixgbe_clean_rx_irq(struct ixgbe_q_vector *q_vector,

rx_desc = IXGBE_RX_DESC(rx_ring, rx_ring->next_to_clean);

- if (!ixgbe_test_staterr(rx_desc, IXGBE_RXD_STAT_DD))
+ if (!rx_desc->wb.upper.status_error)
break;

- /*
- * This memory barrier is needed to keep us from reading
+ /* This memory barrier is needed to keep us from reading
* any other fields out of the rx_desc until we know the
- * RXD_STAT_DD bit is set
+ * descriptor has been written back
*/
- rmb();
+ dma_rmb();

/* retrieve a buffer from the ring */
skb = ixgbe_fetch_rx_buffer(rx_ring, rx_desc);

2014-11-19 01:28:54

by Jeff Kirsher

[permalink] [raw]
Subject: Re: [PATCH v5 4/4] fm10k/igb/ixgbe: Use dma_rmb on Rx descriptor reads

On Tue, 2014-11-18 at 17:24 -0800, Alexander Duyck wrote:
> This change makes it so that dma_rmb is used when reading the Rx
> descriptor. The advantage of dma_rmb is that it allows for a much
> lower cost barrier on x86, powerpc, arm, and arm64 architectures than
> a
> traditional memory barrier when dealing with reads that only have to
> synchronize to coherent memory.
>
> In addition I have updated the code so that it just checks to see if
> any
> bits have been set instead of just the DD bit since the DD bit will
> always
> be set as a part of a descriptor write-back so we just need to check
> for a
> non-zero value being present at that memory location rather than just
> checking for any specific bit. This allows the code itself to appear
> much
> cleaner and allows the compiler more room to optimize.
>
> Cc: Jeff Kirsher <[email protected]>
> Cc: Matthew Vick <[email protected]>
> Cc: Don Skidmore <[email protected]>
> Signed-off-by: Alexander Duyck <[email protected]>
> ---
> drivers/net/ethernet/intel/fm10k/fm10k_main.c | 6 +++---
> drivers/net/ethernet/intel/igb/igb_main.c | 6 +++---
> drivers/net/ethernet/intel/ixgbe/ixgbe_main.c | 9 ++++-----
> 3 files changed, 10 insertions(+), 11 deletions(-)

Acked-by: Jeff Kirsher <[email protected]>


Attachments:
signature.asc (819.00 B)
This is a digitally signed message part

2014-11-25 14:01:34

by Will Deacon

[permalink] [raw]
Subject: Re: [PATCH v5 2/4] arch: Add lightweight memory barriers dma_rmb() and dma_wmb()

On Wed, Nov 19, 2014 at 01:24:02AM +0000, Alexander Duyck wrote:
> diff --git a/Documentation/memory-barriers.txt b/Documentation/memory-barriers.txt
> index 22a969c..a1c589b 100644
> --- a/Documentation/memory-barriers.txt
> +++ b/Documentation/memory-barriers.txt
> @@ -1615,6 +1615,47 @@ There are some more advanced barrier functions:
> operations" subsection for information on where to use these.
>
>
> + (*) dma_wmb();
> + (*) dma_rmb();
> +
> + These are for use with memory based device I/O to guarantee the ordering
> + of cache coherent writes or reads with respect to other writes or reads
> + to cache coherent DMA memory.

Can you please make it crystal clear that "memory based device I/O" != MMIO?
If people get these barriers wrong, then debugging will be a nightmare.

> diff --git a/arch/arm/include/asm/barrier.h b/arch/arm/include/asm/barrier.h
> index c6a3e73..d2f81e6 100644
> --- a/arch/arm/include/asm/barrier.h
> +++ b/arch/arm/include/asm/barrier.h
> @@ -43,10 +43,14 @@
> #define mb() do { dsb(); outer_sync(); } while (0)
> #define rmb() dsb()
> #define wmb() do { dsb(st); outer_sync(); } while (0)
> +#define dma_rmb() dmb(osh)
> +#define dma_wmb() dmb(oshst)
> #else
> #define mb() barrier()
> #define rmb() barrier()
> #define wmb() barrier()
> +#define dma_rmb() barrier()
> +#define dma_wmb() barrier()
> #endif
>
> #ifndef CONFIG_SMP
> diff --git a/arch/arm64/include/asm/barrier.h b/arch/arm64/include/asm/barrier.h
> index 6389d60..a5abb00 100644
> --- a/arch/arm64/include/asm/barrier.h
> +++ b/arch/arm64/include/asm/barrier.h
> @@ -32,6 +32,9 @@
> #define rmb() dsb(ld)
> #define wmb() dsb(st)
>
> +#define dma_rmb() dmb(oshld)
> +#define dma_wmb() dmb(oshst)
> +
> #ifndef CONFIG_SMP
> #define smp_mb() barrier()
> #define smp_rmb() barrier()

The arm/arm64 bits look fine to me.

Acked-by: Will Deacon <[email protected]>

If we ever see platforms using Linux/dma_alloc_coherent with devices
mastering from a different outer-shareable domain that the one containing
the CPUs, then we'll need to revisit this.

Will

2014-11-25 16:28:28

by Alexander Duyck

[permalink] [raw]
Subject: Re: [PATCH v5 2/4] arch: Add lightweight memory barriers dma_rmb() and dma_wmb()


On 11/25/2014 06:01 AM, Will Deacon wrote:
> On Wed, Nov 19, 2014 at 01:24:02AM +0000, Alexander Duyck wrote:
>> diff --git a/Documentation/memory-barriers.txt b/Documentation/memory-barriers.txt
>> index 22a969c..a1c589b 100644
>> --- a/Documentation/memory-barriers.txt
>> +++ b/Documentation/memory-barriers.txt
>> @@ -1615,6 +1615,47 @@ There are some more advanced barrier functions:
>> operations" subsection for information on where to use these.
>>
>>
>> + (*) dma_wmb();
>> + (*) dma_rmb();
>> +
>> + These are for use with memory based device I/O to guarantee the ordering
>> + of cache coherent writes or reads with respect to other writes or reads
>> + to cache coherent DMA memory.
> Can you please make it crystal clear that "memory based device I/O" != MMIO?
> If people get these barriers wrong, then debugging will be a nightmare.

I think I'll update that to the following to avoid any confusion:
These are for use with consistent memory to guarantee the ordering
of writes or reads of shared memory accessible to both the CPU and a
DMA capable device.

I will also add a reference to DMA-API.txt at the end for more info on
what consistent memory is.

>> diff --git a/arch/arm/include/asm/barrier.h b/arch/arm/include/asm/barrier.h
>> index c6a3e73..d2f81e6 100644
>> --- a/arch/arm/include/asm/barrier.h
>> +++ b/arch/arm/include/asm/barrier.h
>> @@ -43,10 +43,14 @@
>> #define mb() do { dsb(); outer_sync(); } while (0)
>> #define rmb() dsb()
>> #define wmb() do { dsb(st); outer_sync(); } while (0)
>> +#define dma_rmb() dmb(osh)
>> +#define dma_wmb() dmb(oshst)
>> #else
>> #define mb() barrier()
>> #define rmb() barrier()
>> #define wmb() barrier()
>> +#define dma_rmb() barrier()
>> +#define dma_wmb() barrier()
>> #endif
>>
>> #ifndef CONFIG_SMP
>> diff --git a/arch/arm64/include/asm/barrier.h b/arch/arm64/include/asm/barrier.h
>> index 6389d60..a5abb00 100644
>> --- a/arch/arm64/include/asm/barrier.h
>> +++ b/arch/arm64/include/asm/barrier.h
>> @@ -32,6 +32,9 @@
>> #define rmb() dsb(ld)
>> #define wmb() dsb(st)
>>
>> +#define dma_rmb() dmb(oshld)
>> +#define dma_wmb() dmb(oshst)
>> +
>> #ifndef CONFIG_SMP
>> #define smp_mb() barrier()
>> #define smp_rmb() barrier()
> The arm/arm64 bits look fine to me.
>
> Acked-by: Will Deacon <[email protected]>

Thanks for the review.

> If we ever see platforms using Linux/dma_alloc_coherent with devices
> mastering from a different outer-shareable domain that the one containing
> the CPUs, then we'll need to revisit this.

Would we just need a system wide memory barrier in that case instead of
an outer shareable memory barrier, or would we need to look as something
like a sync barrier?

- Alex

2014-11-25 23:17:33

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Re: [PATCH v5 2/4] arch: Add lightweight memory barriers dma_rmb() and dma_wmb()

On Tue, 2014-11-18 at 17:24 -0800, Alexander Duyck wrote:
> There are a number of situations where the mandatory barriers rmb() and
> wmb() are used to order memory/memory operations in the device drivers
> and those barriers are much heavier than they actually need to be. For
> example in the case of PowerPC wmb() calls the heavy-weight sync
> instruction when for coherent memory operations all that is really needed
> is an lsync or eieio instruction.
> .../...

For powerpc:

Acked-by: Benjamin Herrenschmidt <[email protected]>


2014-11-26 16:05:00

by Will Deacon

[permalink] [raw]
Subject: Re: [PATCH v5 2/4] arch: Add lightweight memory barriers dma_rmb() and dma_wmb()

On Tue, Nov 25, 2014 at 04:26:28PM +0000, Alexander Duyck wrote:
> On 11/25/2014 06:01 AM, Will Deacon wrote:
> > If we ever see platforms using Linux/dma_alloc_coherent with devices
> > mastering from a different outer-shareable domain that the one containing
> > the CPUs, then we'll need to revisit this.
>
> Would we just need a system wide memory barrier in that case instead of
> an outer shareable memory barrier, or would we need to look as something
> like a sync barrier?

I think dmb(sy) would do the trick, but let's cross that bridge if/when we
have to.

Will