These patches introduce two new primitives for synchronizing cache-enabled
memory writes and reads. These two new primitives are:
fast_rmb()
fast_wmb()
The first patch cleans up some unnecessary overhead related to the
definition of read_barrier_depends, smp_read_barrier_depends, and comments
related to read_barrier_depends.
The second patch adds the primitives for the applicable architectures and
asm-generic. The names for the new primitives are based on the names of
similar primitives that already exist in the mips and tile trees.
The third patch adds the barriers to r8169 which turns out to be a good
example of where the new barriers might be useful as they have full
rmb()/wmb() barriers ordering accesses to the descriptors and the DescOwn
bit.
The fourth patch adds support for fast_rmb() to the Intel fm10k, igb, and
ixgbe drivers. Testing with the ixgbe driver has shown a processing
time reduction of at least 7ns per 64B frame on a Core i7-4930K.
This patch series is essentially the v3 for:
arch: Introduce load_acquire() and store_release()
or
arch: Introduce read_acquire()
The key changes in this patch series versus the earlier patches are:
v3:
- Added cleanup of read_barrier_depends
- Focus on rmb()/wmb() instead of acquire()/store()
- Added update to documentation with code example
- Added change in r8169 to fix cur_tx/DescOwn ordering
- Simplified changes to just replacing/moving barriers in r8169
v2:
- Renamed read_acquire() to be consistent with smp_load_acquire()
- Changed barrier used to be consistent with smp_load_acquire()
- Updated PowerPC code to use __lwsync based on IBM article
- Added store_release() as this is a viable use case for drivers
- Added r8169 patch which is able to fully use primitives
- Added fm10k/igb/ixgbe patch which is able to test performance
---
Alexander Duyck (4):
arch: Cleanup read_barrier_depends() and comments
arch: Add lightweight memory barriers fast_rmb() and fast_wmb()
r8169: Use fast_rmb() and fast_wmb() for DescOwn checks
fm10k/igb/ixgbe: Use fast_rmb on Rx descriptor reads
Documentation/memory-barriers.txt | 41 +++++++++++++++
arch/alpha/include/asm/barrier.h | 51 ++++++++++++++++++
arch/arm/include/asm/barrier.h | 4 +
arch/arm64/include/asm/barrier.h | 3 +
arch/blackfin/include/asm/barrier.h | 51 ++++++++++++++++++
arch/ia64/include/asm/barrier.h | 25 ++++-----
arch/metag/include/asm/barrier.h | 19 ++++---
arch/mips/include/asm/barrier.h | 52 -------------------
arch/powerpc/include/asm/barrier.h | 28 ++++++----
arch/s390/include/asm/barrier.h | 7 ++-
arch/sparc/include/asm/barrier_64.h | 7 ++-
arch/x86/include/asm/barrier.h | 70 ++++---------------------
arch/x86/um/asm/barrier.h | 20 ++++---
drivers/net/ethernet/intel/fm10k/fm10k_main.c | 6 +-
drivers/net/ethernet/intel/igb/igb_main.c | 6 +-
drivers/net/ethernet/intel/ixgbe/ixgbe_main.c | 9 +--
drivers/net/ethernet/realtek/r8169.c | 29 ++++++++--
include/asm-generic/barrier.h | 8 +++
18 files changed, 257 insertions(+), 179 deletions(-)
--
This patch is meant to cleanup the handling of read_barrier_depends and
smp_read_barrier_depends. In multiple spots in the kernel headers
read_barrier_depends is defined as "do {} while (0)", however we then go
into the SMP vs non-SMP sections and have the SMP version reference
read_barrier_depends, and the non-SMP define it as yet another empty
do/while.
With this commit I went through and cleaned out the duplicate definitions
and reduced the number of definitions down to 2 per header. In addition I
moved the 50 line comments for the macro from the x86 and mips headers that
defined it as an empty do/while to those that were actually defining the
macro, alpha and blackfin.
Signed-off-by: Alexander Duyck <[email protected]>
---
arch/alpha/include/asm/barrier.h | 51 ++++++++++++++++++++++++++++++
arch/blackfin/include/asm/barrier.h | 51 ++++++++++++++++++++++++++++++
arch/ia64/include/asm/barrier.h | 22 +++++--------
arch/metag/include/asm/barrier.h | 7 ++--
arch/mips/include/asm/barrier.h | 52 -------------------------------
arch/powerpc/include/asm/barrier.h | 6 ++--
arch/s390/include/asm/barrier.h | 5 ++-
arch/sparc/include/asm/barrier_64.h | 4 +-
arch/x86/include/asm/barrier.h | 59 ++---------------------------------
arch/x86/um/asm/barrier.h | 7 ++--
10 files changed, 129 insertions(+), 135 deletions(-)
diff --git a/arch/alpha/include/asm/barrier.h b/arch/alpha/include/asm/barrier.h
index 3832bdb..77516c8 100644
--- a/arch/alpha/include/asm/barrier.h
+++ b/arch/alpha/include/asm/barrier.h
@@ -7,6 +7,57 @@
#define rmb() __asm__ __volatile__("mb": : :"memory")
#define wmb() __asm__ __volatile__("wmb": : :"memory")
+/**
+ * read_barrier_depends - Flush all pending reads that subsequents reads
+ * depend on.
+ *
+ * No data-dependent reads from memory-like regions are ever reordered
+ * over this barrier. All reads preceding this primitive are guaranteed
+ * to access memory (but not necessarily other CPUs' caches) before any
+ * reads following this primitive that depend on the data return by
+ * any of the preceding reads. This primitive is much lighter weight than
+ * rmb() on most CPUs, and is never heavier weight than is
+ * rmb().
+ *
+ * These ordering constraints are respected by both the local CPU
+ * and the compiler.
+ *
+ * Ordering is not guaranteed by anything other than these primitives,
+ * not even by data dependencies. See the documentation for
+ * memory_barrier() for examples and URLs to more information.
+ *
+ * For example, the following code would force ordering (the initial
+ * value of "a" is zero, "b" is one, and "p" is "&a"):
+ *
+ * <programlisting>
+ * CPU 0 CPU 1
+ *
+ * b = 2;
+ * memory_barrier();
+ * p = &b; q = p;
+ * read_barrier_depends();
+ * d = *q;
+ * </programlisting>
+ *
+ * because the read of "*q" depends on the read of "p" and these
+ * two reads are separated by a read_barrier_depends(). However,
+ * the following code, with the same initial values for "a" and "b":
+ *
+ * <programlisting>
+ * CPU 0 CPU 1
+ *
+ * a = 2;
+ * memory_barrier();
+ * b = 3; y = b;
+ * read_barrier_depends();
+ * x = a;
+ * </programlisting>
+ *
+ * does not enforce ordering, since there is no data dependency between
+ * the read of "a" and the read of "b". Therefore, on some CPUs, such
+ * as Alpha, "y" could be set to 3 and "x" to 0. Use rmb()
+ * in cases like this where there are no data dependencies.
+ */
#define read_barrier_depends() __asm__ __volatile__("mb": : :"memory")
#ifdef CONFIG_SMP
diff --git a/arch/blackfin/include/asm/barrier.h b/arch/blackfin/include/asm/barrier.h
index 4200068..dfb66fe 100644
--- a/arch/blackfin/include/asm/barrier.h
+++ b/arch/blackfin/include/asm/barrier.h
@@ -22,6 +22,57 @@
# define mb() do { barrier(); smp_check_barrier(); smp_mark_barrier(); } while (0)
# define rmb() do { barrier(); smp_check_barrier(); } while (0)
# define wmb() do { barrier(); smp_mark_barrier(); } while (0)
+/*
+ * read_barrier_depends - Flush all pending reads that subsequents reads
+ * depend on.
+ *
+ * No data-dependent reads from memory-like regions are ever reordered
+ * over this barrier. All reads preceding this primitive are guaranteed
+ * to access memory (but not necessarily other CPUs' caches) before any
+ * reads following this primitive that depend on the data return by
+ * any of the preceding reads. This primitive is much lighter weight than
+ * rmb() on most CPUs, and is never heavier weight than is
+ * rmb().
+ *
+ * These ordering constraints are respected by both the local CPU
+ * and the compiler.
+ *
+ * Ordering is not guaranteed by anything other than these primitives,
+ * not even by data dependencies. See the documentation for
+ * memory_barrier() for examples and URLs to more information.
+ *
+ * For example, the following code would force ordering (the initial
+ * value of "a" is zero, "b" is one, and "p" is "&a"):
+ *
+ * <programlisting>
+ * CPU 0 CPU 1
+ *
+ * b = 2;
+ * memory_barrier();
+ * p = &b; q = p;
+ * read_barrier_depends();
+ * d = *q;
+ * </programlisting>
+ *
+ * because the read of "*q" depends on the read of "p" and these
+ * two reads are separated by a read_barrier_depends(). However,
+ * the following code, with the same initial values for "a" and "b":
+ *
+ * <programlisting>
+ * CPU 0 CPU 1
+ *
+ * a = 2;
+ * memory_barrier();
+ * b = 3; y = b;
+ * read_barrier_depends();
+ * x = a;
+ * </programlisting>
+ *
+ * does not enforce ordering, since there is no data dependency between
+ * the read of "a" and the read of "b". Therefore, on some CPUs, such
+ * as Alpha, "y" could be set to 3 and "x" to 0. Use rmb()
+ * in cases like this where there are no data dependencies.
+ */
# define read_barrier_depends() do { barrier(); smp_check_barrier(); } while (0)
#endif
diff --git a/arch/ia64/include/asm/barrier.h b/arch/ia64/include/asm/barrier.h
index a48957c..e8fffb0 100644
--- a/arch/ia64/include/asm/barrier.h
+++ b/arch/ia64/include/asm/barrier.h
@@ -35,26 +35,22 @@
* it's (presumably) much slower than mf and (b) mf.a is supported for
* sequential memory pages only.
*/
-#define mb() ia64_mf()
-#define rmb() mb()
-#define wmb() mb()
-#define read_barrier_depends() do { } while(0)
+#define mb() ia64_mf()
+#define rmb() mb()
+#define wmb() mb()
#ifdef CONFIG_SMP
# define smp_mb() mb()
-# define smp_rmb() rmb()
-# define smp_wmb() wmb()
-# define smp_read_barrier_depends() read_barrier_depends()
-
#else
-
# define smp_mb() barrier()
-# define smp_rmb() barrier()
-# define smp_wmb() barrier()
-# define smp_read_barrier_depends() do { } while(0)
-
#endif
+#define smp_rmb() smp_mb()
+#define smp_wmb() smp_mb()
+
+#define read_barrier_depends() do { } while (0)
+#define smp_read_barrier_depends() do { } while (0)
+
#define smp_mb__before_atomic() barrier()
#define smp_mb__after_atomic() barrier()
diff --git a/arch/metag/include/asm/barrier.h b/arch/metag/include/asm/barrier.h
index c7591e8..6d8b8c9 100644
--- a/arch/metag/include/asm/barrier.h
+++ b/arch/metag/include/asm/barrier.h
@@ -47,8 +47,6 @@ static inline void wmb(void)
wr_fence();
}
-#define read_barrier_depends() do { } while (0)
-
#ifndef CONFIG_SMP
#define fence() do { } while (0)
#define smp_mb() barrier()
@@ -82,7 +80,10 @@ static inline void fence(void)
#define smp_wmb() barrier()
#endif
#endif
-#define smp_read_barrier_depends() do { } while (0)
+
+#define read_barrier_depends() do { } while (0)
+#define smp_read_barrier_depends() do { } while (0)
+
#define set_mb(var, value) do { var = value; smp_mb(); } while (0)
#define smp_store_release(p, v) \
diff --git a/arch/mips/include/asm/barrier.h b/arch/mips/include/asm/barrier.h
index d0101dd..3d69aa8 100644
--- a/arch/mips/include/asm/barrier.h
+++ b/arch/mips/include/asm/barrier.h
@@ -10,58 +10,6 @@
#include <asm/addrspace.h>
-/*
- * read_barrier_depends - Flush all pending reads that subsequents reads
- * depend on.
- *
- * No data-dependent reads from memory-like regions are ever reordered
- * over this barrier. All reads preceding this primitive are guaranteed
- * to access memory (but not necessarily other CPUs' caches) before any
- * reads following this primitive that depend on the data return by
- * any of the preceding reads. This primitive is much lighter weight than
- * rmb() on most CPUs, and is never heavier weight than is
- * rmb().
- *
- * These ordering constraints are respected by both the local CPU
- * and the compiler.
- *
- * Ordering is not guaranteed by anything other than these primitives,
- * not even by data dependencies. See the documentation for
- * memory_barrier() for examples and URLs to more information.
- *
- * For example, the following code would force ordering (the initial
- * value of "a" is zero, "b" is one, and "p" is "&a"):
- *
- * <programlisting>
- * CPU 0 CPU 1
- *
- * b = 2;
- * memory_barrier();
- * p = &b; q = p;
- * read_barrier_depends();
- * d = *q;
- * </programlisting>
- *
- * because the read of "*q" depends on the read of "p" and these
- * two reads are separated by a read_barrier_depends(). However,
- * the following code, with the same initial values for "a" and "b":
- *
- * <programlisting>
- * CPU 0 CPU 1
- *
- * a = 2;
- * memory_barrier();
- * b = 3; y = b;
- * read_barrier_depends();
- * x = a;
- * </programlisting>
- *
- * does not enforce ordering, since there is no data dependency between
- * the read of "a" and the read of "b". Therefore, on some CPUs, such
- * as Alpha, "y" could be set to 3 and "x" to 0. Use rmb()
- * in cases like this where there are no data dependencies.
- */
-
#define read_barrier_depends() do { } while(0)
#define smp_read_barrier_depends() do { } while(0)
diff --git a/arch/powerpc/include/asm/barrier.h b/arch/powerpc/include/asm/barrier.h
index bab79a1..cb6d66c 100644
--- a/arch/powerpc/include/asm/barrier.h
+++ b/arch/powerpc/include/asm/barrier.h
@@ -33,7 +33,6 @@
#define mb() __asm__ __volatile__ ("sync" : : : "memory")
#define rmb() __asm__ __volatile__ ("sync" : : : "memory")
#define wmb() __asm__ __volatile__ ("sync" : : : "memory")
-#define read_barrier_depends() do { } while(0)
#define set_mb(var, value) do { var = value; mb(); } while (0)
@@ -50,16 +49,17 @@
#define smp_mb() mb()
#define smp_rmb() __lwsync()
#define smp_wmb() __asm__ __volatile__ (stringify_in_c(SMPWMB) : : :"memory")
-#define smp_read_barrier_depends() read_barrier_depends()
#else
#define __lwsync() barrier()
#define smp_mb() barrier()
#define smp_rmb() barrier()
#define smp_wmb() barrier()
-#define smp_read_barrier_depends() do { } while(0)
#endif /* CONFIG_SMP */
+#define read_barrier_depends() do { } while (0)
+#define smp_read_barrier_depends() do { } while (0)
+
/*
* This is a barrier which prevents following instructions from being
* started until the value of the argument x is known. For example, if
diff --git a/arch/s390/include/asm/barrier.h b/arch/s390/include/asm/barrier.h
index b5dce65..33d191d 100644
--- a/arch/s390/include/asm/barrier.h
+++ b/arch/s390/include/asm/barrier.h
@@ -24,11 +24,12 @@
#define rmb() mb()
#define wmb() mb()
-#define read_barrier_depends() do { } while(0)
#define smp_mb() mb()
#define smp_rmb() rmb()
#define smp_wmb() wmb()
-#define smp_read_barrier_depends() read_barrier_depends()
+
+#define read_barrier_depends() do { } while (0)
+#define smp_read_barrier_depends() do { } while (0)
#define smp_mb__before_atomic() smp_mb()
#define smp_mb__after_atomic() smp_mb()
diff --git a/arch/sparc/include/asm/barrier_64.h b/arch/sparc/include/asm/barrier_64.h
index 305dcc3..6c974c0 100644
--- a/arch/sparc/include/asm/barrier_64.h
+++ b/arch/sparc/include/asm/barrier_64.h
@@ -37,7 +37,6 @@ do { __asm__ __volatile__("ba,pt %%xcc, 1f\n\t" \
#define rmb() __asm__ __volatile__("":::"memory")
#define wmb() __asm__ __volatile__("":::"memory")
-#define read_barrier_depends() do { } while(0)
#define set_mb(__var, __value) \
do { __var = __value; membar_safe("#StoreLoad"); } while(0)
@@ -51,7 +50,8 @@ do { __asm__ __volatile__("ba,pt %%xcc, 1f\n\t" \
#define smp_wmb() __asm__ __volatile__("":::"memory")
#endif
-#define smp_read_barrier_depends() do { } while(0)
+#define read_barrier_depends() do { } while (0)
+#define smp_read_barrier_depends() do { } while (0)
#define smp_store_release(p, v) \
do { \
diff --git a/arch/x86/include/asm/barrier.h b/arch/x86/include/asm/barrier.h
index 0f4460b..5238000 100644
--- a/arch/x86/include/asm/barrier.h
+++ b/arch/x86/include/asm/barrier.h
@@ -24,60 +24,6 @@
#define wmb() asm volatile("sfence" ::: "memory")
#endif
-/**
- * read_barrier_depends - Flush all pending reads that subsequents reads
- * depend on.
- *
- * No data-dependent reads from memory-like regions are ever reordered
- * over this barrier. All reads preceding this primitive are guaranteed
- * to access memory (but not necessarily other CPUs' caches) before any
- * reads following this primitive that depend on the data return by
- * any of the preceding reads. This primitive is much lighter weight than
- * rmb() on most CPUs, and is never heavier weight than is
- * rmb().
- *
- * These ordering constraints are respected by both the local CPU
- * and the compiler.
- *
- * Ordering is not guaranteed by anything other than these primitives,
- * not even by data dependencies. See the documentation for
- * memory_barrier() for examples and URLs to more information.
- *
- * For example, the following code would force ordering (the initial
- * value of "a" is zero, "b" is one, and "p" is "&a"):
- *
- * <programlisting>
- * CPU 0 CPU 1
- *
- * b = 2;
- * memory_barrier();
- * p = &b; q = p;
- * read_barrier_depends();
- * d = *q;
- * </programlisting>
- *
- * because the read of "*q" depends on the read of "p" and these
- * two reads are separated by a read_barrier_depends(). However,
- * the following code, with the same initial values for "a" and "b":
- *
- * <programlisting>
- * CPU 0 CPU 1
- *
- * a = 2;
- * memory_barrier();
- * b = 3; y = b;
- * read_barrier_depends();
- * x = a;
- * </programlisting>
- *
- * does not enforce ordering, since there is no data dependency between
- * the read of "a" and the read of "b". Therefore, on some CPUs, such
- * as Alpha, "y" could be set to 3 and "x" to 0. Use rmb()
- * in cases like this where there are no data dependencies.
- **/
-
-#define read_barrier_depends() do { } while (0)
-
#ifdef CONFIG_SMP
#define smp_mb() mb()
#ifdef CONFIG_X86_PPRO_FENCE
@@ -86,16 +32,17 @@
# define smp_rmb() barrier()
#endif
#define smp_wmb() barrier()
-#define smp_read_barrier_depends() read_barrier_depends()
#define set_mb(var, value) do { (void)xchg(&var, value); } while (0)
#else /* !SMP */
#define smp_mb() barrier()
#define smp_rmb() barrier()
#define smp_wmb() barrier()
-#define smp_read_barrier_depends() do { } while (0)
#define set_mb(var, value) do { var = value; barrier(); } while (0)
#endif /* SMP */
+#define read_barrier_depends() do { } while (0)
+#define smp_read_barrier_depends() do { } while (0)
+
#if defined(CONFIG_X86_PPRO_FENCE)
/*
diff --git a/arch/x86/um/asm/barrier.h b/arch/x86/um/asm/barrier.h
index cc04e67..d6511d9 100644
--- a/arch/x86/um/asm/barrier.h
+++ b/arch/x86/um/asm/barrier.h
@@ -29,8 +29,6 @@
#endif /* CONFIG_X86_32 */
-#define read_barrier_depends() do { } while (0)
-
#ifdef CONFIG_SMP
#define smp_mb() mb()
@@ -42,7 +40,6 @@
#define smp_wmb() barrier()
-#define smp_read_barrier_depends() read_barrier_depends()
#define set_mb(var, value) do { (void)xchg(&var, value); } while (0)
#else /* CONFIG_SMP */
@@ -50,11 +47,13 @@
#define smp_mb() barrier()
#define smp_rmb() barrier()
#define smp_wmb() barrier()
-#define smp_read_barrier_depends() do { } while (0)
#define set_mb(var, value) do { var = value; barrier(); } while (0)
#endif /* CONFIG_SMP */
+#define read_barrier_depends() do { } while (0)
+#define smp_read_barrier_depends() do { } while (0)
+
/*
* Stop RDTSC speculation. This is needed when you need to use RDTSC
* (or get_cycles or vread that possibly accesses the TSC) in a defined
The r8169 use a pair of wmb() calls when setting up the descriptor rings.
The first is to synchronize the descriptor data with the descriptor status,
and the second is to synchronize the descriptor status with the use of the
MMIO doorbell to notify the device that descriptors are ready. This can come
at a heavy price on some systems, and is not really necessary on systems such
as x86 as a simple barrier() would suffice to order store/store accesses.
As such we can replace the first memory barrier with fast_wmb() to reduce
the cost for these accesses.
In addition the r8169 uses a rmb() to prevent compiler optimization in the
cleanup paths, however by moving the barrier down a few lines and
replacing with with a fast_rmb() we should be able to use it to guarantee
descriptor accesses do not occur until the device has updated the DescOwn
bit from its end.
One last change I made is to move the update of cur_tx in the xmit path to
after the wmb. This way we can guarantee the device and all CPUs should
see the DescOwn update before they see the cur_tx value update.
Cc: Realtek linux nic maintainers <[email protected]>
Cc: Francois Romieu <[email protected]>
Signed-off-by: Alexander Duyck <[email protected]>
---
drivers/net/ethernet/realtek/r8169.c | 29 +++++++++++++++++++++--------
1 file changed, 21 insertions(+), 8 deletions(-)
diff --git a/drivers/net/ethernet/realtek/r8169.c b/drivers/net/ethernet/realtek/r8169.c
index cf154f7..be79447 100644
--- a/drivers/net/ethernet/realtek/r8169.c
+++ b/drivers/net/ethernet/realtek/r8169.c
@@ -6601,6 +6601,9 @@ static inline void rtl8169_mark_to_asic(struct RxDesc *desc, u32 rx_buf_sz)
{
u32 eor = le32_to_cpu(desc->opts1) & RingEnd;
+ /* Force memory writes to complete before releasing descriptor */
+ fast_wmb();
+
desc->opts1 = cpu_to_le32(DescOwn | eor | rx_buf_sz);
}
@@ -6608,7 +6611,6 @@ static inline void rtl8169_map_to_asic(struct RxDesc *desc, dma_addr_t mapping,
u32 rx_buf_sz)
{
desc->addr = cpu_to_le64(mapping);
- wmb();
rtl8169_mark_to_asic(desc, rx_buf_sz);
}
@@ -7077,16 +7079,18 @@ static netdev_tx_t rtl8169_start_xmit(struct sk_buff *skb,
skb_tx_timestamp(skb);
- wmb();
+ /* Force memory writes to complete before releasing descriptor */
+ fast_wmb();
/* Anti gcc 2.95.3 bugware (sic) */
status = opts[0] | len | (RingEnd * !((entry + 1) % NUM_TX_DESC));
txd->opts1 = cpu_to_le32(status);
- tp->cur_tx += frags + 1;
-
+ /* Force all memory writes to complete before notifying device */
wmb();
+ tp->cur_tx += frags + 1;
+
RTL_W8(TxPoll, NPQ);
mmiowb();
@@ -7185,11 +7189,16 @@ static void rtl_tx(struct net_device *dev, struct rtl8169_private *tp)
struct ring_info *tx_skb = tp->tx_skb + entry;
u32 status;
- rmb();
status = le32_to_cpu(tp->TxDescArray[entry].opts1);
if (status & DescOwn)
break;
+ /* This barrier is needed to keep us from reading
+ * any other fields out of the Tx descriptor until
+ * we know the status of DescOwn
+ */
+ fast_rmb();
+
rtl8169_unmap_tx_skb(&tp->pci_dev->dev, tx_skb,
tp->TxDescArray + entry);
if (status & LastFrag) {
@@ -7284,11 +7293,16 @@ static int rtl_rx(struct net_device *dev, struct rtl8169_private *tp, u32 budget
struct RxDesc *desc = tp->RxDescArray + entry;
u32 status;
- rmb();
status = le32_to_cpu(desc->opts1) & tp->opts1_mask;
-
if (status & DescOwn)
break;
+
+ /* This barrier is needed to keep us from reading
+ * any other fields out of the Rx descriptor until
+ * we know the status of DescOwn
+ */
+ fast_rmb();
+
if (unlikely(status & RxRES)) {
netif_info(tp, rx_err, dev, "Rx ERROR. status = %08x\n",
status);
@@ -7350,7 +7364,6 @@ process_pkt:
}
release_descriptor:
desc->opts2 = 0;
- wmb();
rtl8169_mark_to_asic(desc, rx_buf_sz);
}
There are a number of situations where the mandatory barriers rmb() and
wmb() are used to order memory/memory operations in the device drivers
and those barriers are much heavier than they actually need to be. For
example in the case of PowerPC wmb() calls the heavy-weight sync
instruction when for memory/memory operations all that is really needed is
an lsync or eieio instruction.
This commit adds a fast (and loose) version of the mandatory memory
barriers rmb() and wmb(). The prefix to the name is actually based on the
version of the functions that already exist in the mips and tile trees.
However I thought it applicable since it gets at what we are trying to
accomplish with these barriers and somethat implies their risky nature.
These new barriers are not as safe as the standard rmb() and wmb().
Specifically they do not guarantee ordering between cache-enabled and
cache-inhibited memories. The primary use case for these would be to
enforce ordering of memory reads/writes when accessing cache-enabled memory
that is shared between the CPU and a device.
It may also be noted that there is no fast_mb(). This is due to the fact
that most architectures didn't seem to have a good way to do a full memory
barrier quickly and so they usually resorted to an mb() for their smp_mb
call. As such there no point in adding a fast_mb() function if it is going
to map to mb() for all architectures anyway.
Cc: Benjamin Herrenschmidt <[email protected]>
Cc: Frederic Weisbecker <[email protected]>
Cc: Mathieu Desnoyers <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Michael Neuling <[email protected]>
Cc: Russell King <[email protected]>
Cc: Geert Uytterhoeven <[email protected]>
Cc: Heiko Carstens <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Martin Schwidefsky <[email protected]>
Cc: Tony Luck <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: "Paul E. McKenney" <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: David Miller <[email protected]>
Signed-off-by: Alexander Duyck <[email protected]>
---
Documentation/memory-barriers.txt | 41 +++++++++++++++++++++++++++++++++++
arch/arm/include/asm/barrier.h | 4 +++
arch/arm64/include/asm/barrier.h | 3 +++
arch/ia64/include/asm/barrier.h | 3 +++
arch/metag/include/asm/barrier.h | 14 ++++++------
arch/powerpc/include/asm/barrier.h | 22 +++++++++++--------
arch/s390/include/asm/barrier.h | 2 ++
arch/sparc/include/asm/barrier_64.h | 3 +++
arch/x86/include/asm/barrier.h | 11 ++++++---
arch/x86/um/asm/barrier.h | 13 ++++++-----
include/asm-generic/barrier.h | 8 +++++++
11 files changed, 98 insertions(+), 26 deletions(-)
diff --git a/Documentation/memory-barriers.txt b/Documentation/memory-barriers.txt
index 22a969c..fe55dea 100644
--- a/Documentation/memory-barriers.txt
+++ b/Documentation/memory-barriers.txt
@@ -1615,6 +1615,47 @@ There are some more advanced barrier functions:
operations" subsection for information on where to use these.
+ (*) fast_wmb();
+ (*) fast_rmb();
+
+ These are for use with memory based device I/O to guarantee the ordering
+ of cache-enabled writes or reads with respect to other writes or reads
+ to cache-enabled memory.
+
+ For example, consider a device driver that shares memory with a device
+ and uses a descriptor status value to indicate if the descriptor belongs
+ to the device or the CPU, and a doorbell to notify it when new
+ descriptors are available:
+
+ if (desc->status != DEVICE_OWN) {
+ /* do not read data until we own descriptor */
+ fast_rmb();
+
+ /* read/modify data */
+ read_data = desc->data;
+ desc->data = write_data;
+
+ /* flush modifications before status update */
+ fast_wmb();
+
+ /* assign ownership */
+ desc->status = DEVICE_OWN;
+
+ /* force memory to sync before notifying device via MMIO */
+ wmb();
+
+ /* notify device of new descriptors */
+ writel(DESC_NOTIFY, doorbell);
+ }
+
+ The fast_rmb() allows us guarantee the device has released ownership
+ before we read the data from the descriptor, and he fast_wmb() allows
+ us to guarantee the data is written to the descriptor before the device
+ can see it now has ownership. The wmb() is needed to guarantee that the
+ cache-enabled memory writes have completed before attempting a write to
+ the cache-inhibited MMIO region.
+
+
MMIO WRITE BARRIER
------------------
diff --git a/arch/arm/include/asm/barrier.h b/arch/arm/include/asm/barrier.h
index c6a3e73..c57903c 100644
--- a/arch/arm/include/asm/barrier.h
+++ b/arch/arm/include/asm/barrier.h
@@ -43,10 +43,14 @@
#define mb() do { dsb(); outer_sync(); } while (0)
#define rmb() dsb()
#define wmb() do { dsb(st); outer_sync(); } while (0)
+#define fast_rmb() dmb()
+#define fast_wmb() dmb(st)
#else
#define mb() barrier()
#define rmb() barrier()
#define wmb() barrier()
+#define fast_rmb() barrier()
+#define fast_wmb() barrier()
#endif
#ifndef CONFIG_SMP
diff --git a/arch/arm64/include/asm/barrier.h b/arch/arm64/include/asm/barrier.h
index 6389d60..3ca1a15 100644
--- a/arch/arm64/include/asm/barrier.h
+++ b/arch/arm64/include/asm/barrier.h
@@ -32,6 +32,9 @@
#define rmb() dsb(ld)
#define wmb() dsb(st)
+#define fast_rmb() dmb(ld)
+#define fast_wmb() dmb(st)
+
#ifndef CONFIG_SMP
#define smp_mb() barrier()
#define smp_rmb() barrier()
diff --git a/arch/ia64/include/asm/barrier.h b/arch/ia64/include/asm/barrier.h
index e8fffb0..997f5b0 100644
--- a/arch/ia64/include/asm/barrier.h
+++ b/arch/ia64/include/asm/barrier.h
@@ -39,6 +39,9 @@
#define rmb() mb()
#define wmb() mb()
+#define fast_rmb() mb()
+#define fast_wmb() mb()
+
#ifdef CONFIG_SMP
# define smp_mb() mb()
#else
diff --git a/arch/metag/include/asm/barrier.h b/arch/metag/include/asm/barrier.h
index 6d8b8c9..edddb6d 100644
--- a/arch/metag/include/asm/barrier.h
+++ b/arch/metag/include/asm/barrier.h
@@ -4,8 +4,6 @@
#include <asm/metag_mem.h>
#define nop() asm volatile ("NOP")
-#define mb() wmb()
-#define rmb() barrier()
#ifdef CONFIG_METAG_META21
@@ -41,11 +39,13 @@ static inline void wr_fence(void)
#endif /* !CONFIG_METAG_META21 */
-static inline void wmb(void)
-{
- /* flush writes through the write combiner */
- wr_fence();
-}
+/* flush writes through the write combiner */
+#define mb() wr_fence()
+#define rmb() barrier()
+#define wmb() mb()
+
+#define fast_rmb() rmb()
+#define fast_wmb() wmb()
#ifndef CONFIG_SMP
#define fence() do { } while (0)
diff --git a/arch/powerpc/include/asm/barrier.h b/arch/powerpc/include/asm/barrier.h
index cb6d66c..f480097 100644
--- a/arch/powerpc/include/asm/barrier.h
+++ b/arch/powerpc/include/asm/barrier.h
@@ -36,22 +36,20 @@
#define set_mb(var, value) do { var = value; mb(); } while (0)
-#ifdef CONFIG_SMP
-
#ifdef __SUBARCH_HAS_LWSYNC
# define SMPWMB LWSYNC
#else
# define SMPWMB eieio
#endif
-#define __lwsync() __asm__ __volatile__ (stringify_in_c(LWSYNC) : : :"memory")
+#define fast_rmb() __asm__ __volatile__ (stringify_in_c(LWSYNC) : : :"memory")
+#define fast_wmb() __asm__ __volatile__ (stringify_in_c(SMPWMB) : : :"memory")
+#ifdef CONFIG_SMP
#define smp_mb() mb()
-#define smp_rmb() __lwsync()
-#define smp_wmb() __asm__ __volatile__ (stringify_in_c(SMPWMB) : : :"memory")
+#define smp_rmb() fast_rmb()
+#define smp_wmb() fast_wmb()
#else
-#define __lwsync() barrier()
-
#define smp_mb() barrier()
#define smp_rmb() barrier()
#define smp_wmb() barrier()
@@ -69,10 +67,16 @@
#define data_barrier(x) \
asm volatile("twi 0,%0,0; isync" : : "r" (x) : "memory");
+/*
+ * The use of smp_rmb() in these functions are actually meant to map from
+ * smp_rmb()->fast_rmb()->LWSYNC. This way if smp is disabled then
+ * smp_rmb()->barrier(), or if the platform doesn't support lwsync it will
+ * map to the more heavy-weight sync.
+ */
#define smp_store_release(p, v) \
do { \
compiletime_assert_atomic_type(*p); \
- __lwsync(); \
+ smp_rmb(); \
ACCESS_ONCE(*p) = (v); \
} while (0)
@@ -80,7 +84,7 @@ do { \
({ \
typeof(*p) ___p1 = ACCESS_ONCE(*p); \
compiletime_assert_atomic_type(*p); \
- __lwsync(); \
+ smp_rmb(); \
___p1; \
})
diff --git a/arch/s390/include/asm/barrier.h b/arch/s390/include/asm/barrier.h
index 33d191d..9a31301 100644
--- a/arch/s390/include/asm/barrier.h
+++ b/arch/s390/include/asm/barrier.h
@@ -24,6 +24,8 @@
#define rmb() mb()
#define wmb() mb()
+#define fast_rmb() rmb()
+#define fast_wmb() wmb()
#define smp_mb() mb()
#define smp_rmb() rmb()
#define smp_wmb() wmb()
diff --git a/arch/sparc/include/asm/barrier_64.h b/arch/sparc/include/asm/barrier_64.h
index 6c974c0..eac2777 100644
--- a/arch/sparc/include/asm/barrier_64.h
+++ b/arch/sparc/include/asm/barrier_64.h
@@ -37,6 +37,9 @@ do { __asm__ __volatile__("ba,pt %%xcc, 1f\n\t" \
#define rmb() __asm__ __volatile__("":::"memory")
#define wmb() __asm__ __volatile__("":::"memory")
+#define fast_rmb() rmb()
+#define fast_wmb() wmb()
+
#define set_mb(__var, __value) \
do { __var = __value; membar_safe("#StoreLoad"); } while(0)
diff --git a/arch/x86/include/asm/barrier.h b/arch/x86/include/asm/barrier.h
index 5238000..cf440e7 100644
--- a/arch/x86/include/asm/barrier.h
+++ b/arch/x86/include/asm/barrier.h
@@ -24,13 +24,16 @@
#define wmb() asm volatile("sfence" ::: "memory")
#endif
-#ifdef CONFIG_SMP
-#define smp_mb() mb()
#ifdef CONFIG_X86_PPRO_FENCE
-# define smp_rmb() rmb()
+#define fast_rmb() rmb()
#else
-# define smp_rmb() barrier()
+#define fast_rmb() barrier()
#endif
+#define fast_wmb() barrier()
+
+#ifdef CONFIG_SMP
+#define smp_mb() mb()
+#define smp_rmb() fast_rmb()
#define smp_wmb() barrier()
#define set_mb(var, value) do { (void)xchg(&var, value); } while (0)
#else /* !SMP */
diff --git a/arch/x86/um/asm/barrier.h b/arch/x86/um/asm/barrier.h
index d6511d9..2c0405d 100644
--- a/arch/x86/um/asm/barrier.h
+++ b/arch/x86/um/asm/barrier.h
@@ -29,17 +29,18 @@
#endif /* CONFIG_X86_32 */
-#ifdef CONFIG_SMP
-
-#define smp_mb() mb()
#ifdef CONFIG_X86_PPRO_FENCE
-#define smp_rmb() rmb()
+#define fast_rmb() rmb()
#else /* CONFIG_X86_PPRO_FENCE */
-#define smp_rmb() barrier()
+#define fast_rmb() barrier()
#endif /* CONFIG_X86_PPRO_FENCE */
+#define fast_wmb() barrier()
-#define smp_wmb() barrier()
+#ifdef CONFIG_SMP
+#define smp_mb() mb()
+#define smp_rmb() fast_rmb()
+#define smp_wmb() fast_wmb()
#define set_mb(var, value) do { (void)xchg(&var, value); } while (0)
#else /* CONFIG_SMP */
diff --git a/include/asm-generic/barrier.h b/include/asm-generic/barrier.h
index 1402fa8..c1d60b9 100644
--- a/include/asm-generic/barrier.h
+++ b/include/asm-generic/barrier.h
@@ -42,6 +42,14 @@
#define wmb() mb()
#endif
+#ifndef fast_rmb
+#define fast_rmb() rmb()
+#endif
+
+#ifndef fast_wmb
+#define fast_wmb() wmb()
+#endif
+
#ifndef read_barrier_depends
#define read_barrier_depends() do { } while (0)
#endif
This change makes it so that fast_rmb is used when reading the Rx
descriptor. The advantage of fast_rmb is that it allows for a much
lower cost barrier on x86, powerpc, arm, and arm64 architectures than a
traditional memory barrier when dealing with reads that only have to
synchronize to system memory.
In addition I have updated the code so that it just checks to see if any
bits have been set instead of just the DD bit since the DD bit will always
be set as a part of a descriptor write-back so we just need to check for a
non-zero value being present at that memory location rather than just
checking for any specific bit. This allows the code itself to appear much
cleaner and allows the compiler more room to optimize.
Cc: Jeff Kirsher <[email protected]>
Cc: Matthew Vick <[email protected]>
Cc: Don Skidmore <[email protected]>
Signed-off-by: Alexander Duyck <[email protected]>
---
drivers/net/ethernet/intel/fm10k/fm10k_main.c | 6 +++---
drivers/net/ethernet/intel/igb/igb_main.c | 6 +++---
drivers/net/ethernet/intel/ixgbe/ixgbe_main.c | 9 ++++-----
3 files changed, 10 insertions(+), 11 deletions(-)
diff --git a/drivers/net/ethernet/intel/fm10k/fm10k_main.c b/drivers/net/ethernet/intel/fm10k/fm10k_main.c
index e645af4..ba959b9 100644
--- a/drivers/net/ethernet/intel/fm10k/fm10k_main.c
+++ b/drivers/net/ethernet/intel/fm10k/fm10k_main.c
@@ -620,14 +620,14 @@ static bool fm10k_clean_rx_irq(struct fm10k_q_vector *q_vector,
rx_desc = FM10K_RX_DESC(rx_ring, rx_ring->next_to_clean);
- if (!fm10k_test_staterr(rx_desc, FM10K_RXD_STATUS_DD))
+ if (!rx_desc->d.staterr)
break;
/* This memory barrier is needed to keep us from reading
* any other fields out of the rx_desc until we know the
- * RXD_STATUS_DD bit is set
+ * descriptor has been written back
*/
- rmb();
+ fast_rmb();
/* retrieve a buffer from the ring */
skb = fm10k_fetch_rx_buffer(rx_ring, rx_desc, skb);
diff --git a/drivers/net/ethernet/intel/igb/igb_main.c b/drivers/net/ethernet/intel/igb/igb_main.c
index a2d72a8..3dbbaa8 100644
--- a/drivers/net/ethernet/intel/igb/igb_main.c
+++ b/drivers/net/ethernet/intel/igb/igb_main.c
@@ -6918,14 +6918,14 @@ static bool igb_clean_rx_irq(struct igb_q_vector *q_vector, const int budget)
rx_desc = IGB_RX_DESC(rx_ring, rx_ring->next_to_clean);
- if (!igb_test_staterr(rx_desc, E1000_RXD_STAT_DD))
+ if (!rx_desc->wb.upper.status_error)
break;
/* This memory barrier is needed to keep us from reading
* any other fields out of the rx_desc until we know the
- * RXD_STAT_DD bit is set
+ * descriptor has been written back
*/
- rmb();
+ fast_rmb();
/* retrieve a buffer from the ring */
skb = igb_fetch_rx_buffer(rx_ring, rx_desc, skb);
diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
index d2df4e3..890d740 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
@@ -2003,15 +2003,14 @@ static int ixgbe_clean_rx_irq(struct ixgbe_q_vector *q_vector,
rx_desc = IXGBE_RX_DESC(rx_ring, rx_ring->next_to_clean);
- if (!ixgbe_test_staterr(rx_desc, IXGBE_RXD_STAT_DD))
+ if (!rx_desc->wb.upper.status_error)
break;
- /*
- * This memory barrier is needed to keep us from reading
+ /* This memory barrier is needed to keep us from reading
* any other fields out of the rx_desc until we know the
- * RXD_STAT_DD bit is set
+ * descriptor has been written back
*/
- rmb();
+ fast_rmb();
/* retrieve a buffer from the ring */
skb = ixgbe_fetch_rx_buffer(rx_ring, rx_desc);
On Mon, 2014-11-17 at 09:18 -0800, Alexander Duyck wrote:
> There are a number of situations where the mandatory barriers rmb() and
> wmb() are used to order memory/memory operations in the device drivers
> and those barriers are much heavier than they actually need to be. For
> example in the case of PowerPC wmb() calls the heavy-weight sync
> instruction when for memory/memory operations all that is really needed is
> an lsync or eieio instruction.
So essentially those are the same as the smp_* variants but not nop'ed
out on !CONFIG_SMP right ?
Ben.
> This commit adds a fast (and loose) version of the mandatory memory
> barriers rmb() and wmb(). The prefix to the name is actually based on the
> version of the functions that already exist in the mips and tile trees.
> However I thought it applicable since it gets at what we are trying to
> accomplish with these barriers and somethat implies their risky nature.
>
> These new barriers are not as safe as the standard rmb() and wmb().
> Specifically they do not guarantee ordering between cache-enabled and
> cache-inhibited memories. The primary use case for these would be to
> enforce ordering of memory reads/writes when accessing cache-enabled memory
> that is shared between the CPU and a device.
>
> It may also be noted that there is no fast_mb(). This is due to the fact
> that most architectures didn't seem to have a good way to do a full memory
> barrier quickly and so they usually resorted to an mb() for their smp_mb
> call. As such there no point in adding a fast_mb() function if it is going
> to map to mb() for all architectures anyway.
>
> Cc: Benjamin Herrenschmidt <[email protected]>
> Cc: Frederic Weisbecker <[email protected]>
> Cc: Mathieu Desnoyers <[email protected]>
> Cc: Michael Ellerman <[email protected]>
> Cc: Michael Neuling <[email protected]>
> Cc: Russell King <[email protected]>
> Cc: Geert Uytterhoeven <[email protected]>
> Cc: Heiko Carstens <[email protected]>
> Cc: Linus Torvalds <[email protected]>
> Cc: Martin Schwidefsky <[email protected]>
> Cc: Tony Luck <[email protected]>
> Cc: Oleg Nesterov <[email protected]>
> Cc: Will Deacon <[email protected]>
> Cc: "Paul E. McKenney" <[email protected]>
> Cc: Peter Zijlstra <[email protected]>
> Cc: Ingo Molnar <[email protected]>
> Cc: David Miller <[email protected]>
> Signed-off-by: Alexander Duyck <[email protected]>
> ---
> Documentation/memory-barriers.txt | 41 +++++++++++++++++++++++++++++++++++
> arch/arm/include/asm/barrier.h | 4 +++
> arch/arm64/include/asm/barrier.h | 3 +++
> arch/ia64/include/asm/barrier.h | 3 +++
> arch/metag/include/asm/barrier.h | 14 ++++++------
> arch/powerpc/include/asm/barrier.h | 22 +++++++++++--------
> arch/s390/include/asm/barrier.h | 2 ++
> arch/sparc/include/asm/barrier_64.h | 3 +++
> arch/x86/include/asm/barrier.h | 11 ++++++---
> arch/x86/um/asm/barrier.h | 13 ++++++-----
> include/asm-generic/barrier.h | 8 +++++++
> 11 files changed, 98 insertions(+), 26 deletions(-)
>
> diff --git a/Documentation/memory-barriers.txt b/Documentation/memory-barriers.txt
> index 22a969c..fe55dea 100644
> --- a/Documentation/memory-barriers.txt
> +++ b/Documentation/memory-barriers.txt
> @@ -1615,6 +1615,47 @@ There are some more advanced barrier functions:
> operations" subsection for information on where to use these.
>
>
> + (*) fast_wmb();
> + (*) fast_rmb();
> +
> + These are for use with memory based device I/O to guarantee the ordering
> + of cache-enabled writes or reads with respect to other writes or reads
> + to cache-enabled memory.
> +
> + For example, consider a device driver that shares memory with a device
> + and uses a descriptor status value to indicate if the descriptor belongs
> + to the device or the CPU, and a doorbell to notify it when new
> + descriptors are available:
> +
> + if (desc->status != DEVICE_OWN) {
> + /* do not read data until we own descriptor */
> + fast_rmb();
> +
> + /* read/modify data */
> + read_data = desc->data;
> + desc->data = write_data;
> +
> + /* flush modifications before status update */
> + fast_wmb();
> +
> + /* assign ownership */
> + desc->status = DEVICE_OWN;
> +
> + /* force memory to sync before notifying device via MMIO */
> + wmb();
> +
> + /* notify device of new descriptors */
> + writel(DESC_NOTIFY, doorbell);
> + }
> +
> + The fast_rmb() allows us guarantee the device has released ownership
> + before we read the data from the descriptor, and he fast_wmb() allows
> + us to guarantee the data is written to the descriptor before the device
> + can see it now has ownership. The wmb() is needed to guarantee that the
> + cache-enabled memory writes have completed before attempting a write to
> + the cache-inhibited MMIO region.
> +
> +
> MMIO WRITE BARRIER
> ------------------
>
> diff --git a/arch/arm/include/asm/barrier.h b/arch/arm/include/asm/barrier.h
> index c6a3e73..c57903c 100644
> --- a/arch/arm/include/asm/barrier.h
> +++ b/arch/arm/include/asm/barrier.h
> @@ -43,10 +43,14 @@
> #define mb() do { dsb(); outer_sync(); } while (0)
> #define rmb() dsb()
> #define wmb() do { dsb(st); outer_sync(); } while (0)
> +#define fast_rmb() dmb()
> +#define fast_wmb() dmb(st)
> #else
> #define mb() barrier()
> #define rmb() barrier()
> #define wmb() barrier()
> +#define fast_rmb() barrier()
> +#define fast_wmb() barrier()
> #endif
>
> #ifndef CONFIG_SMP
> diff --git a/arch/arm64/include/asm/barrier.h b/arch/arm64/include/asm/barrier.h
> index 6389d60..3ca1a15 100644
> --- a/arch/arm64/include/asm/barrier.h
> +++ b/arch/arm64/include/asm/barrier.h
> @@ -32,6 +32,9 @@
> #define rmb() dsb(ld)
> #define wmb() dsb(st)
>
> +#define fast_rmb() dmb(ld)
> +#define fast_wmb() dmb(st)
> +
> #ifndef CONFIG_SMP
> #define smp_mb() barrier()
> #define smp_rmb() barrier()
> diff --git a/arch/ia64/include/asm/barrier.h b/arch/ia64/include/asm/barrier.h
> index e8fffb0..997f5b0 100644
> --- a/arch/ia64/include/asm/barrier.h
> +++ b/arch/ia64/include/asm/barrier.h
> @@ -39,6 +39,9 @@
> #define rmb() mb()
> #define wmb() mb()
>
> +#define fast_rmb() mb()
> +#define fast_wmb() mb()
> +
> #ifdef CONFIG_SMP
> # define smp_mb() mb()
> #else
> diff --git a/arch/metag/include/asm/barrier.h b/arch/metag/include/asm/barrier.h
> index 6d8b8c9..edddb6d 100644
> --- a/arch/metag/include/asm/barrier.h
> +++ b/arch/metag/include/asm/barrier.h
> @@ -4,8 +4,6 @@
> #include <asm/metag_mem.h>
>
> #define nop() asm volatile ("NOP")
> -#define mb() wmb()
> -#define rmb() barrier()
>
> #ifdef CONFIG_METAG_META21
>
> @@ -41,11 +39,13 @@ static inline void wr_fence(void)
>
> #endif /* !CONFIG_METAG_META21 */
>
> -static inline void wmb(void)
> -{
> - /* flush writes through the write combiner */
> - wr_fence();
> -}
> +/* flush writes through the write combiner */
> +#define mb() wr_fence()
> +#define rmb() barrier()
> +#define wmb() mb()
> +
> +#define fast_rmb() rmb()
> +#define fast_wmb() wmb()
>
> #ifndef CONFIG_SMP
> #define fence() do { } while (0)
> diff --git a/arch/powerpc/include/asm/barrier.h b/arch/powerpc/include/asm/barrier.h
> index cb6d66c..f480097 100644
> --- a/arch/powerpc/include/asm/barrier.h
> +++ b/arch/powerpc/include/asm/barrier.h
> @@ -36,22 +36,20 @@
>
> #define set_mb(var, value) do { var = value; mb(); } while (0)
>
> -#ifdef CONFIG_SMP
> -
> #ifdef __SUBARCH_HAS_LWSYNC
> # define SMPWMB LWSYNC
> #else
> # define SMPWMB eieio
> #endif
>
> -#define __lwsync() __asm__ __volatile__ (stringify_in_c(LWSYNC) : : :"memory")
> +#define fast_rmb() __asm__ __volatile__ (stringify_in_c(LWSYNC) : : :"memory")
> +#define fast_wmb() __asm__ __volatile__ (stringify_in_c(SMPWMB) : : :"memory")
>
> +#ifdef CONFIG_SMP
> #define smp_mb() mb()
> -#define smp_rmb() __lwsync()
> -#define smp_wmb() __asm__ __volatile__ (stringify_in_c(SMPWMB) : : :"memory")
> +#define smp_rmb() fast_rmb()
> +#define smp_wmb() fast_wmb()
> #else
> -#define __lwsync() barrier()
> -
> #define smp_mb() barrier()
> #define smp_rmb() barrier()
> #define smp_wmb() barrier()
> @@ -69,10 +67,16 @@
> #define data_barrier(x) \
> asm volatile("twi 0,%0,0; isync" : : "r" (x) : "memory");
>
> +/*
> + * The use of smp_rmb() in these functions are actually meant to map from
> + * smp_rmb()->fast_rmb()->LWSYNC. This way if smp is disabled then
> + * smp_rmb()->barrier(), or if the platform doesn't support lwsync it will
> + * map to the more heavy-weight sync.
> + */
> #define smp_store_release(p, v) \
> do { \
> compiletime_assert_atomic_type(*p); \
> - __lwsync(); \
> + smp_rmb(); \
> ACCESS_ONCE(*p) = (v); \
> } while (0)
>
> @@ -80,7 +84,7 @@ do { \
> ({ \
> typeof(*p) ___p1 = ACCESS_ONCE(*p); \
> compiletime_assert_atomic_type(*p); \
> - __lwsync(); \
> + smp_rmb(); \
> ___p1; \
> })
>
> diff --git a/arch/s390/include/asm/barrier.h b/arch/s390/include/asm/barrier.h
> index 33d191d..9a31301 100644
> --- a/arch/s390/include/asm/barrier.h
> +++ b/arch/s390/include/asm/barrier.h
> @@ -24,6 +24,8 @@
>
> #define rmb() mb()
> #define wmb() mb()
> +#define fast_rmb() rmb()
> +#define fast_wmb() wmb()
> #define smp_mb() mb()
> #define smp_rmb() rmb()
> #define smp_wmb() wmb()
> diff --git a/arch/sparc/include/asm/barrier_64.h b/arch/sparc/include/asm/barrier_64.h
> index 6c974c0..eac2777 100644
> --- a/arch/sparc/include/asm/barrier_64.h
> +++ b/arch/sparc/include/asm/barrier_64.h
> @@ -37,6 +37,9 @@ do { __asm__ __volatile__("ba,pt %%xcc, 1f\n\t" \
> #define rmb() __asm__ __volatile__("":::"memory")
> #define wmb() __asm__ __volatile__("":::"memory")
>
> +#define fast_rmb() rmb()
> +#define fast_wmb() wmb()
> +
> #define set_mb(__var, __value) \
> do { __var = __value; membar_safe("#StoreLoad"); } while(0)
>
> diff --git a/arch/x86/include/asm/barrier.h b/arch/x86/include/asm/barrier.h
> index 5238000..cf440e7 100644
> --- a/arch/x86/include/asm/barrier.h
> +++ b/arch/x86/include/asm/barrier.h
> @@ -24,13 +24,16 @@
> #define wmb() asm volatile("sfence" ::: "memory")
> #endif
>
> -#ifdef CONFIG_SMP
> -#define smp_mb() mb()
> #ifdef CONFIG_X86_PPRO_FENCE
> -# define smp_rmb() rmb()
> +#define fast_rmb() rmb()
> #else
> -# define smp_rmb() barrier()
> +#define fast_rmb() barrier()
> #endif
> +#define fast_wmb() barrier()
> +
> +#ifdef CONFIG_SMP
> +#define smp_mb() mb()
> +#define smp_rmb() fast_rmb()
> #define smp_wmb() barrier()
> #define set_mb(var, value) do { (void)xchg(&var, value); } while (0)
> #else /* !SMP */
> diff --git a/arch/x86/um/asm/barrier.h b/arch/x86/um/asm/barrier.h
> index d6511d9..2c0405d 100644
> --- a/arch/x86/um/asm/barrier.h
> +++ b/arch/x86/um/asm/barrier.h
> @@ -29,17 +29,18 @@
>
> #endif /* CONFIG_X86_32 */
>
> -#ifdef CONFIG_SMP
> -
> -#define smp_mb() mb()
> #ifdef CONFIG_X86_PPRO_FENCE
> -#define smp_rmb() rmb()
> +#define fast_rmb() rmb()
> #else /* CONFIG_X86_PPRO_FENCE */
> -#define smp_rmb() barrier()
> +#define fast_rmb() barrier()
> #endif /* CONFIG_X86_PPRO_FENCE */
> +#define fast_wmb() barrier()
>
> -#define smp_wmb() barrier()
> +#ifdef CONFIG_SMP
>
> +#define smp_mb() mb()
> +#define smp_rmb() fast_rmb()
> +#define smp_wmb() fast_wmb()
> #define set_mb(var, value) do { (void)xchg(&var, value); } while (0)
>
> #else /* CONFIG_SMP */
> diff --git a/include/asm-generic/barrier.h b/include/asm-generic/barrier.h
> index 1402fa8..c1d60b9 100644
> --- a/include/asm-generic/barrier.h
> +++ b/include/asm-generic/barrier.h
> @@ -42,6 +42,14 @@
> #define wmb() mb()
> #endif
>
> +#ifndef fast_rmb
> +#define fast_rmb() rmb()
> +#endif
> +
> +#ifndef fast_wmb
> +#define fast_wmb() wmb()
> +#endif
> +
> #ifndef read_barrier_depends
> #define read_barrier_depends() do { } while (0)
> #endif
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
On Mon, Nov 17, 2014 at 09:18:13AM -0800, Alexander Duyck wrote:
> There are a number of situations where the mandatory barriers rmb() and
> wmb() are used to order memory/memory operations in the device drivers
> and those barriers are much heavier than they actually need to be. For
> example in the case of PowerPC wmb() calls the heavy-weight sync
> instruction when for memory/memory operations all that is really needed is
> an lsync or eieio instruction.
Is this still the case if one of the memory operations is MMIO? Last
I knew, it was not.
> This commit adds a fast (and loose) version of the mandatory memory
> barriers rmb() and wmb(). The prefix to the name is actually based on the
> version of the functions that already exist in the mips and tile trees.
> However I thought it applicable since it gets at what we are trying to
> accomplish with these barriers and somethat implies their risky nature.
>
> These new barriers are not as safe as the standard rmb() and wmb().
> Specifically they do not guarantee ordering between cache-enabled and
> cache-inhibited memories. The primary use case for these would be to
> enforce ordering of memory reads/writes when accessing cache-enabled memory
> that is shared between the CPU and a device.
>
> It may also be noted that there is no fast_mb(). This is due to the fact
> that most architectures didn't seem to have a good way to do a full memory
> barrier quickly and so they usually resorted to an mb() for their smp_mb
> call. As such there no point in adding a fast_mb() function if it is going
> to map to mb() for all architectures anyway.
I must confess that I still don't entirely understand the motivation.
Some problems in PowerPC barrier.h called out below.
Thanx, Paul
> Cc: Benjamin Herrenschmidt <[email protected]>
> Cc: Frederic Weisbecker <[email protected]>
> Cc: Mathieu Desnoyers <[email protected]>
> Cc: Michael Ellerman <[email protected]>
> Cc: Michael Neuling <[email protected]>
> Cc: Russell King <[email protected]>
> Cc: Geert Uytterhoeven <[email protected]>
> Cc: Heiko Carstens <[email protected]>
> Cc: Linus Torvalds <[email protected]>
> Cc: Martin Schwidefsky <[email protected]>
> Cc: Tony Luck <[email protected]>
> Cc: Oleg Nesterov <[email protected]>
> Cc: Will Deacon <[email protected]>
> Cc: "Paul E. McKenney" <[email protected]>
> Cc: Peter Zijlstra <[email protected]>
> Cc: Ingo Molnar <[email protected]>
> Cc: David Miller <[email protected]>
> Signed-off-by: Alexander Duyck <[email protected]>
> ---
> Documentation/memory-barriers.txt | 41 +++++++++++++++++++++++++++++++++++
> arch/arm/include/asm/barrier.h | 4 +++
> arch/arm64/include/asm/barrier.h | 3 +++
> arch/ia64/include/asm/barrier.h | 3 +++
> arch/metag/include/asm/barrier.h | 14 ++++++------
> arch/powerpc/include/asm/barrier.h | 22 +++++++++++--------
> arch/s390/include/asm/barrier.h | 2 ++
> arch/sparc/include/asm/barrier_64.h | 3 +++
> arch/x86/include/asm/barrier.h | 11 ++++++---
> arch/x86/um/asm/barrier.h | 13 ++++++-----
> include/asm-generic/barrier.h | 8 +++++++
> 11 files changed, 98 insertions(+), 26 deletions(-)
>
> diff --git a/Documentation/memory-barriers.txt b/Documentation/memory-barriers.txt
> index 22a969c..fe55dea 100644
> --- a/Documentation/memory-barriers.txt
> +++ b/Documentation/memory-barriers.txt
> @@ -1615,6 +1615,47 @@ There are some more advanced barrier functions:
> operations" subsection for information on where to use these.
>
>
> + (*) fast_wmb();
> + (*) fast_rmb();
> +
> + These are for use with memory based device I/O to guarantee the ordering
> + of cache-enabled writes or reads with respect to other writes or reads
> + to cache-enabled memory.
> +
> + For example, consider a device driver that shares memory with a device
> + and uses a descriptor status value to indicate if the descriptor belongs
> + to the device or the CPU, and a doorbell to notify it when new
> + descriptors are available:
> +
> + if (desc->status != DEVICE_OWN) {
> + /* do not read data until we own descriptor */
> + fast_rmb();
> +
> + /* read/modify data */
> + read_data = desc->data;
> + desc->data = write_data;
> +
> + /* flush modifications before status update */
> + fast_wmb();
> +
> + /* assign ownership */
> + desc->status = DEVICE_OWN;
> +
> + /* force memory to sync before notifying device via MMIO */
> + wmb();
> +
> + /* notify device of new descriptors */
> + writel(DESC_NOTIFY, doorbell);
> + }
> +
> + The fast_rmb() allows us guarantee the device has released ownership
> + before we read the data from the descriptor, and he fast_wmb() allows
> + us to guarantee the data is written to the descriptor before the device
> + can see it now has ownership. The wmb() is needed to guarantee that the
> + cache-enabled memory writes have completed before attempting a write to
> + the cache-inhibited MMIO region.
> +
> +
> MMIO WRITE BARRIER
> ------------------
>
> diff --git a/arch/arm/include/asm/barrier.h b/arch/arm/include/asm/barrier.h
> index c6a3e73..c57903c 100644
> --- a/arch/arm/include/asm/barrier.h
> +++ b/arch/arm/include/asm/barrier.h
> @@ -43,10 +43,14 @@
> #define mb() do { dsb(); outer_sync(); } while (0)
> #define rmb() dsb()
> #define wmb() do { dsb(st); outer_sync(); } while (0)
> +#define fast_rmb() dmb()
> +#define fast_wmb() dmb(st)
> #else
> #define mb() barrier()
> #define rmb() barrier()
> #define wmb() barrier()
> +#define fast_rmb() barrier()
> +#define fast_wmb() barrier()
> #endif
>
> #ifndef CONFIG_SMP
> diff --git a/arch/arm64/include/asm/barrier.h b/arch/arm64/include/asm/barrier.h
> index 6389d60..3ca1a15 100644
> --- a/arch/arm64/include/asm/barrier.h
> +++ b/arch/arm64/include/asm/barrier.h
> @@ -32,6 +32,9 @@
> #define rmb() dsb(ld)
> #define wmb() dsb(st)
>
> +#define fast_rmb() dmb(ld)
> +#define fast_wmb() dmb(st)
> +
> #ifndef CONFIG_SMP
> #define smp_mb() barrier()
> #define smp_rmb() barrier()
> diff --git a/arch/ia64/include/asm/barrier.h b/arch/ia64/include/asm/barrier.h
> index e8fffb0..997f5b0 100644
> --- a/arch/ia64/include/asm/barrier.h
> +++ b/arch/ia64/include/asm/barrier.h
> @@ -39,6 +39,9 @@
> #define rmb() mb()
> #define wmb() mb()
>
> +#define fast_rmb() mb()
> +#define fast_wmb() mb()
> +
> #ifdef CONFIG_SMP
> # define smp_mb() mb()
> #else
> diff --git a/arch/metag/include/asm/barrier.h b/arch/metag/include/asm/barrier.h
> index 6d8b8c9..edddb6d 100644
> --- a/arch/metag/include/asm/barrier.h
> +++ b/arch/metag/include/asm/barrier.h
> @@ -4,8 +4,6 @@
> #include <asm/metag_mem.h>
>
> #define nop() asm volatile ("NOP")
> -#define mb() wmb()
> -#define rmb() barrier()
>
> #ifdef CONFIG_METAG_META21
>
> @@ -41,11 +39,13 @@ static inline void wr_fence(void)
>
> #endif /* !CONFIG_METAG_META21 */
>
> -static inline void wmb(void)
> -{
> - /* flush writes through the write combiner */
> - wr_fence();
> -}
> +/* flush writes through the write combiner */
> +#define mb() wr_fence()
> +#define rmb() barrier()
> +#define wmb() mb()
> +
> +#define fast_rmb() rmb()
> +#define fast_wmb() wmb()
>
> #ifndef CONFIG_SMP
> #define fence() do { } while (0)
> diff --git a/arch/powerpc/include/asm/barrier.h b/arch/powerpc/include/asm/barrier.h
> index cb6d66c..f480097 100644
> --- a/arch/powerpc/include/asm/barrier.h
> +++ b/arch/powerpc/include/asm/barrier.h
> @@ -36,22 +36,20 @@
>
> #define set_mb(var, value) do { var = value; mb(); } while (0)
>
> -#ifdef CONFIG_SMP
> -
> #ifdef __SUBARCH_HAS_LWSYNC
> # define SMPWMB LWSYNC
> #else
> # define SMPWMB eieio
> #endif
>
> -#define __lwsync() __asm__ __volatile__ (stringify_in_c(LWSYNC) : : :"memory")
> +#define fast_rmb() __asm__ __volatile__ (stringify_in_c(LWSYNC) : : :"memory")
> +#define fast_wmb() __asm__ __volatile__ (stringify_in_c(SMPWMB) : : :"memory")
>
> +#ifdef CONFIG_SMP
> #define smp_mb() mb()
> -#define smp_rmb() __lwsync()
> -#define smp_wmb() __asm__ __volatile__ (stringify_in_c(SMPWMB) : : :"memory")
> +#define smp_rmb() fast_rmb()
> +#define smp_wmb() fast_wmb()
> #else
> -#define __lwsync() barrier()
> -
> #define smp_mb() barrier()
> #define smp_rmb() barrier()
> #define smp_wmb() barrier()
> @@ -69,10 +67,16 @@
> #define data_barrier(x) \
> asm volatile("twi 0,%0,0; isync" : : "r" (x) : "memory");
>
> +/*
> + * The use of smp_rmb() in these functions are actually meant to map from
> + * smp_rmb()->fast_rmb()->LWSYNC. This way if smp is disabled then
> + * smp_rmb()->barrier(), or if the platform doesn't support lwsync it will
> + * map to the more heavy-weight sync.
> + */
> #define smp_store_release(p, v) \
> do { \
> compiletime_assert_atomic_type(*p); \
> - __lwsync(); \
> + smp_rmb(); \
This is not good at all. For smp_store_release(), we absolutely
must order prior loads and stores against the assignment on the following
line. This is not something that smp_rmb() does, nor is it something
that smp_wmb() does. Yes, it might happen to now, but this could easily
break in the future -- plus this change is extremely misleading.
The original __lwsync() is much more clear.
> ACCESS_ONCE(*p) = (v); \
> } while (0)
>
> @@ -80,7 +84,7 @@ do { \
> ({ \
> typeof(*p) ___p1 = ACCESS_ONCE(*p); \
> compiletime_assert_atomic_type(*p); \
> - __lwsync(); \
> + smp_rmb(); \
> ___p1; \
> })
>
> diff --git a/arch/s390/include/asm/barrier.h b/arch/s390/include/asm/barrier.h
> index 33d191d..9a31301 100644
> --- a/arch/s390/include/asm/barrier.h
> +++ b/arch/s390/include/asm/barrier.h
> @@ -24,6 +24,8 @@
>
> #define rmb() mb()
> #define wmb() mb()
> +#define fast_rmb() rmb()
> +#define fast_wmb() wmb()
> #define smp_mb() mb()
> #define smp_rmb() rmb()
> #define smp_wmb() wmb()
> diff --git a/arch/sparc/include/asm/barrier_64.h b/arch/sparc/include/asm/barrier_64.h
> index 6c974c0..eac2777 100644
> --- a/arch/sparc/include/asm/barrier_64.h
> +++ b/arch/sparc/include/asm/barrier_64.h
> @@ -37,6 +37,9 @@ do { __asm__ __volatile__("ba,pt %%xcc, 1f\n\t" \
> #define rmb() __asm__ __volatile__("":::"memory")
> #define wmb() __asm__ __volatile__("":::"memory")
>
> +#define fast_rmb() rmb()
> +#define fast_wmb() wmb()
> +
> #define set_mb(__var, __value) \
> do { __var = __value; membar_safe("#StoreLoad"); } while(0)
>
> diff --git a/arch/x86/include/asm/barrier.h b/arch/x86/include/asm/barrier.h
> index 5238000..cf440e7 100644
> --- a/arch/x86/include/asm/barrier.h
> +++ b/arch/x86/include/asm/barrier.h
> @@ -24,13 +24,16 @@
> #define wmb() asm volatile("sfence" ::: "memory")
> #endif
>
> -#ifdef CONFIG_SMP
> -#define smp_mb() mb()
> #ifdef CONFIG_X86_PPRO_FENCE
> -# define smp_rmb() rmb()
> +#define fast_rmb() rmb()
> #else
> -# define smp_rmb() barrier()
> +#define fast_rmb() barrier()
> #endif
> +#define fast_wmb() barrier()
> +
> +#ifdef CONFIG_SMP
> +#define smp_mb() mb()
> +#define smp_rmb() fast_rmb()
> #define smp_wmb() barrier()
> #define set_mb(var, value) do { (void)xchg(&var, value); } while (0)
> #else /* !SMP */
> diff --git a/arch/x86/um/asm/barrier.h b/arch/x86/um/asm/barrier.h
> index d6511d9..2c0405d 100644
> --- a/arch/x86/um/asm/barrier.h
> +++ b/arch/x86/um/asm/barrier.h
> @@ -29,17 +29,18 @@
>
> #endif /* CONFIG_X86_32 */
>
> -#ifdef CONFIG_SMP
> -
> -#define smp_mb() mb()
> #ifdef CONFIG_X86_PPRO_FENCE
> -#define smp_rmb() rmb()
> +#define fast_rmb() rmb()
> #else /* CONFIG_X86_PPRO_FENCE */
> -#define smp_rmb() barrier()
> +#define fast_rmb() barrier()
> #endif /* CONFIG_X86_PPRO_FENCE */
> +#define fast_wmb() barrier()
>
> -#define smp_wmb() barrier()
> +#ifdef CONFIG_SMP
>
> +#define smp_mb() mb()
> +#define smp_rmb() fast_rmb()
> +#define smp_wmb() fast_wmb()
> #define set_mb(var, value) do { (void)xchg(&var, value); } while (0)
>
> #else /* CONFIG_SMP */
> diff --git a/include/asm-generic/barrier.h b/include/asm-generic/barrier.h
> index 1402fa8..c1d60b9 100644
> --- a/include/asm-generic/barrier.h
> +++ b/include/asm-generic/barrier.h
> @@ -42,6 +42,14 @@
> #define wmb() mb()
> #endif
>
> +#ifndef fast_rmb
> +#define fast_rmb() rmb()
> +#endif
> +
> +#ifndef fast_wmb
> +#define fast_wmb() wmb()
> +#endif
> +
> #ifndef read_barrier_depends
> #define read_barrier_depends() do { } while (0)
> #endif
>
On 11/17/2014 12:04 PM, Benjamin Herrenschmidt wrote:
> On Mon, 2014-11-17 at 09:18 -0800, Alexander Duyck wrote:
>> There are a number of situations where the mandatory barriers rmb() and
>> wmb() are used to order memory/memory operations in the device drivers
>> and those barriers are much heavier than they actually need to be. For
>> example in the case of PowerPC wmb() calls the heavy-weight sync
>> instruction when for memory/memory operations all that is really needed is
>> an lsync or eieio instruction.
> So essentially those are the same as the smp_* variants but not nop'ed
> out on !CONFIG_SMP right ?
>
> Ben.
>
Yes and no. So for example on ARM I used the dmb() operation, however I
have to use the barrier at the system level instead of just the inner
shared domain. However on many other architectures they are just the
same as the smp_* variants.
Basically the resultant code is somewhere between the smp and non-smp
barriers in terms of what they cover.
Thanks,
Alex
On Mon, Nov 17, 2014 at 9:18 AM, Alexander Duyck
<[email protected]> wrote:
> There are a number of situations where the mandatory barriers rmb() and
> wmb() are used to order memory/memory operations in the device drivers
> and those barriers are much heavier than they actually need to be.
Ugh. I absolutely despise the name.
It's not "fast". It's just limited. It's the same as "smp_*mb()", in
that it works on cacheable memory, but it actually stays around even
for non-SMP builds.
So I think the name is actively misleading.
Naming should be about what it does, not about some kind of PR thing
that confuses people into thinking it's "better".
Maybe "dma_*mb()" would be acceptable, and ends up having the same
naming convention as "smb_*mb()", and explains what it's about.
And yes, in the same spirit, it would probably be good to try to
eventually get rid of the plain "*mb()" functions, and perhaps call
them "mmio_*mb()" to clarify that they are about ordering memory wrt
mmio.
Hmm?
Linus
On 11/17/2014 12:18 PM, Paul E. McKenney wrote:
> On Mon, Nov 17, 2014 at 09:18:13AM -0800, Alexander Duyck wrote:
>> There are a number of situations where the mandatory barriers rmb() and
>> wmb() are used to order memory/memory operations in the device drivers
>> and those barriers are much heavier than they actually need to be. For
>> example in the case of PowerPC wmb() calls the heavy-weight sync
>> instruction when for memory/memory operations all that is really needed is
>> an lsync or eieio instruction.
>
> Is this still the case if one of the memory operations is MMIO? Last
> I knew, it was not.
This barrier is not meant for use in MMIO operations, for that you still
need a full barrier as I call out in the documentation section. What the
barrier does is allow for a lightweight barrier for accesses to coherent
system memory. So for example many device drivers have to perform a read
of the descriptor to see if the device is done with it. We need an rmb()
following that check to prevent any other accesses.
Right now on x86 that rmb() becomes an lfence instruction and is quite
expensive, and as it turns out we don't need it since the x86 doesn't
reorder reads. The same kind of thing applies to PowerPC, only in that
case we use a sync when what we really need is a lwsync.
>> This commit adds a fast (and loose) version of the mandatory memory
>> barriers rmb() and wmb(). The prefix to the name is actually based on the
>> version of the functions that already exist in the mips and tile trees.
>> However I thought it applicable since it gets at what we are trying to
>> accomplish with these barriers and somethat implies their risky nature.
>>
>> These new barriers are not as safe as the standard rmb() and wmb().
>> Specifically they do not guarantee ordering between cache-enabled and
>> cache-inhibited memories. The primary use case for these would be to
>> enforce ordering of memory reads/writes when accessing cache-enabled memory
>> that is shared between the CPU and a device.
>>
>> It may also be noted that there is no fast_mb(). This is due to the fact
>> that most architectures didn't seem to have a good way to do a full memory
>> barrier quickly and so they usually resorted to an mb() for their smp_mb
>> call. As such there no point in adding a fast_mb() function if it is going
>> to map to mb() for all architectures anyway.
>
> I must confess that I still don't entirely understand the motivation.
The motivation is to provide finer grained barriers. So this provides an
in-between that allows us to "choose the right hammer". In the case of
PowerPC it is the difference between sync/lwsync, on ARM it is
dsb()/dmb(), and on x86 it is lfence/barrier().
> Some problems in PowerPC barrier.h called out below.
>
> Thanx, Paul
>
<snip>
>> diff --git a/Documentation/memory-barriers.txt b/Documentation/memory-barriers.txt
>> index 22a969c..fe55dea 100644
>> --- a/Documentation/memory-barriers.txt
>> +++ b/Documentation/memory-barriers.txt
>> @@ -1615,6 +1615,47 @@ There are some more advanced barrier functions:
>> operations" subsection for information on where to use these.
>>
>>
>> + (*) fast_wmb();
>> + (*) fast_rmb();
>> +
>> + These are for use with memory based device I/O to guarantee the ordering
>> + of cache-enabled writes or reads with respect to other writes or reads
>> + to cache-enabled memory.
>> +
>> + For example, consider a device driver that shares memory with a device
>> + and uses a descriptor status value to indicate if the descriptor belongs
>> + to the device or the CPU, and a doorbell to notify it when new
>> + descriptors are available:
>> +
>> + if (desc->status != DEVICE_OWN) {
>> + /* do not read data until we own descriptor */
>> + fast_rmb();
>> +
>> + /* read/modify data */
>> + read_data = desc->data;
>> + desc->data = write_data;
>> +
>> + /* flush modifications before status update */
>> + fast_wmb();
>> +
>> + /* assign ownership */
>> + desc->status = DEVICE_OWN;
>> +
>> + /* force memory to sync before notifying device via MMIO */
>> + wmb();
>> +
>> + /* notify device of new descriptors */
>> + writel(DESC_NOTIFY, doorbell);
>> + }
>> +
>> + The fast_rmb() allows us guarantee the device has released ownership
>> + before we read the data from the descriptor, and he fast_wmb() allows
>> + us to guarantee the data is written to the descriptor before the device
>> + can see it now has ownership. The wmb() is needed to guarantee that the
>> + cache-enabled memory writes have completed before attempting a write to
>> + the cache-inhibited MMIO region.
>> +
>> +
>> MMIO WRITE BARRIER
>> ------------------
The general idea is that the device/CPU follow acquire/release style
semantics and we need the lightweight barriers to enforce ordering to
prevent us from accessing the descriptors during those periods where we
do not own them. As the example shows we still need a full barrier when
going between MMIO and standard memory. Hopefully that is what is
conveyed in the documentation bits I have above.
<snip>
>> diff --git a/arch/powerpc/include/asm/barrier.h b/arch/powerpc/include/asm/barrier.h
>> index cb6d66c..f480097 100644
>> --- a/arch/powerpc/include/asm/barrier.h
>> +++ b/arch/powerpc/include/asm/barrier.h
>> @@ -36,22 +36,20 @@
>>
>> #define set_mb(var, value) do { var = value; mb(); } while (0)
>>
>> -#ifdef CONFIG_SMP
>> -
>> #ifdef __SUBARCH_HAS_LWSYNC
>> # define SMPWMB LWSYNC
>> #else
>> # define SMPWMB eieio
>> #endif
>>
>> -#define __lwsync() __asm__ __volatile__ (stringify_in_c(LWSYNC) : : :"memory")
>> +#define fast_rmb() __asm__ __volatile__ (stringify_in_c(LWSYNC) : : :"memory")
>> +#define fast_wmb() __asm__ __volatile__ (stringify_in_c(SMPWMB) : : :"memory")
>>
>> +#ifdef CONFIG_SMP
>> #define smp_mb() mb()
>> -#define smp_rmb() __lwsync()
>> -#define smp_wmb() __asm__ __volatile__ (stringify_in_c(SMPWMB) : : :"memory")
>> +#define smp_rmb() fast_rmb()
>> +#define smp_wmb() fast_wmb()
>> #else
>> -#define __lwsync() barrier()
>> -
>> #define smp_mb() barrier()
>> #define smp_rmb() barrier()
>> #define smp_wmb() barrier()
>> @@ -69,10 +67,16 @@
>> #define data_barrier(x) \
>> asm volatile("twi 0,%0,0; isync" : : "r" (x) : "memory");
>>
>> +/*
>> + * The use of smp_rmb() in these functions are actually meant to map from
>> + * smp_rmb()->fast_rmb()->LWSYNC. This way if smp is disabled then
>> + * smp_rmb()->barrier(), or if the platform doesn't support lwsync it will
>> + * map to the more heavy-weight sync.
>> + */
>> #define smp_store_release(p, v) \
>> do { \
>> compiletime_assert_atomic_type(*p); \
>> - __lwsync(); \
>> + smp_rmb(); \
>
> This is not good at all. For smp_store_release(), we absolutely
> must order prior loads and stores against the assignment on the following
> line. This is not something that smp_rmb() does, nor is it something
> that smp_wmb() does. Yes, it might happen to now, but this could easily
> break in the future -- plus this change is extremely misleading.
>
> The original __lwsync() is much more clear.
The problem I had with __lwsync is that it really wasn't all that clear.
It was the lwsync instruction if SMP was enabled, otherwise it was just
a barrier call. What I did is move the definition of __lwsync in the SMP
case into fast_rmb, which in turn is accessed by smp_rmb. I tried to
make this clear in the comment just above the two calls. The resultant
assembly code should be exactly the same.
What I could do is have it added back as a smp_lwsync if that works for
you. That way there is something there to give you a hint that it
becomes a barrier() call as soon as SMP is disabled.
Thanks,
Alex
On Mon, 2014-11-17 at 09:18 -0800, Alexander Duyck wrote:
> This change makes it so that fast_rmb is used when reading the Rx
> descriptor. The advantage of fast_rmb is that it allows for a much
> lower cost barrier on x86, powerpc, arm, and arm64 architectures than
> a
> traditional memory barrier when dealing with reads that only have to
> synchronize to system memory.
>
> In addition I have updated the code so that it just checks to see if
> any
> bits have been set instead of just the DD bit since the DD bit will
> always
> be set as a part of a descriptor write-back so we just need to check
> for a
> non-zero value being present at that memory location rather than just
> checking for any specific bit. This allows the code itself to appear
> much
> cleaner and allows the compiler more room to optimize.
>
> Cc: Jeff Kirsher <[email protected]>
> Cc: Matthew Vick <[email protected]>
> Cc: Don Skidmore <[email protected]>
> Signed-off-by: Alexander Duyck <[email protected]>
> ---
> drivers/net/ethernet/intel/fm10k/fm10k_main.c | 6 +++---
> drivers/net/ethernet/intel/igb/igb_main.c | 6 +++---
> drivers/net/ethernet/intel/ixgbe/ixgbe_main.c | 9 ++++-----
> 3 files changed, 10 insertions(+), 11 deletions(-)
Looks like more changes will be coming, based on the feedback on earlier
patches. So I won't be picking this up for validation purposes.
On 11/17/2014 12:52 PM, Linus Torvalds wrote:
> On Mon, Nov 17, 2014 at 9:18 AM, Alexander Duyck
> <[email protected]> wrote:
>> There are a number of situations where the mandatory barriers rmb() and
>> wmb() are used to order memory/memory operations in the device drivers
>> and those barriers are much heavier than they actually need to be.
> Ugh. I absolutely despise the name.
>
> It's not "fast". It's just limited. It's the same as "smp_*mb()", in
> that it works on cacheable memory, but it actually stays around even
> for non-SMP builds.
>
> So I think the name is actively misleading.
>
> Naming should be about what it does, not about some kind of PR thing
> that confuses people into thinking it's "better".
>
> Maybe "dma_*mb()" would be acceptable, and ends up having the same
> naming convention as "smb_*mb()", and explains what it's about.
What would you think of the name "coherent_*mb()"? I would prefer to
avoid dma in the name since, at least in my mind, that implies MMIO.
It also ties in well with dma_alloc_coherent/dma_free_coherent which is
what would typically be used to allocate the memory we would be using
the barrier to protect anyway.
> And yes, in the same spirit, it would probably be good to try to
> eventually get rid of the plain "*mb()" functions, and perhaps call
> them "mmio_*mb()" to clarify that they are about ordering memory wrt
> mmio.
>
> Hmm?
>
> Linus
I will work on pulling all of the coherent barrier cases out of using
the plain "*mb()" calls first. We need to sort that out before we could
look at renaming the plain barrier functions.
- Alex
On Mon, Nov 17, 2014 at 01:11:57PM -0800, Alexander Duyck wrote:
> On 11/17/2014 12:18 PM, Paul E. McKenney wrote:
> >On Mon, Nov 17, 2014 at 09:18:13AM -0800, Alexander Duyck wrote:
> >>There are a number of situations where the mandatory barriers rmb() and
> >>wmb() are used to order memory/memory operations in the device drivers
> >>and those barriers are much heavier than they actually need to be. For
> >>example in the case of PowerPC wmb() calls the heavy-weight sync
> >>instruction when for memory/memory operations all that is really needed is
> >>an lsync or eieio instruction.
> >
> >Is this still the case if one of the memory operations is MMIO? Last
> >I knew, it was not.
>
> This barrier is not meant for use in MMIO operations, for that you
> still need a full barrier as I call out in the documentation
> section. What the barrier does is allow for a lightweight barrier
> for accesses to coherent system memory. So for example many device
> drivers have to perform a read of the descriptor to see if the
> device is done with it. We need an rmb() following that check to
> prevent any other accesses.
>
> Right now on x86 that rmb() becomes an lfence instruction and is
> quite expensive, and as it turns out we don't need it since the x86
> doesn't reorder reads. The same kind of thing applies to PowerPC,
> only in that case we use a sync when what we really need is a
> lwsync.
Would it make sense to have a memory barrier that enforced the
non-store-buffer orderings, that is prior reads before later
reads and writes and prior writes before later writes? This was
discussed earlier this year ((http://lwn.net/Articles/586838/,
https://lwn.net/Articles/588300/). If I recall correctly, one of
the biggest obstacles was the name. ;-)
> >>This commit adds a fast (and loose) version of the mandatory memory
> >>barriers rmb() and wmb(). The prefix to the name is actually based on the
> >>version of the functions that already exist in the mips and tile trees.
> >>However I thought it applicable since it gets at what we are trying to
> >>accomplish with these barriers and somethat implies their risky nature.
> >>
> >>These new barriers are not as safe as the standard rmb() and wmb().
> >>Specifically they do not guarantee ordering between cache-enabled and
> >>cache-inhibited memories. The primary use case for these would be to
> >>enforce ordering of memory reads/writes when accessing cache-enabled memory
> >>that is shared between the CPU and a device.
> >>
> >>It may also be noted that there is no fast_mb(). This is due to the fact
> >>that most architectures didn't seem to have a good way to do a full memory
> >>barrier quickly and so they usually resorted to an mb() for their smp_mb
> >>call. As such there no point in adding a fast_mb() function if it is going
> >>to map to mb() for all architectures anyway.
> >
> >I must confess that I still don't entirely understand the motivation.
>
> The motivation is to provide finer grained barriers. So this
> provides an in-between that allows us to "choose the right hammer".
> In the case of PowerPC it is the difference between sync/lwsync, on
> ARM it is dsb()/dmb(), and on x86 it is lfence/barrier().
Ah, so ARM will motivate a fast_wmb(), given its instruction set.
> >Some problems in PowerPC barrier.h called out below.
> >
> > Thanx, Paul
> >
>
> <snip>
>
> >>diff --git a/Documentation/memory-barriers.txt b/Documentation/memory-barriers.txt
> >>index 22a969c..fe55dea 100644
> >>--- a/Documentation/memory-barriers.txt
> >>+++ b/Documentation/memory-barriers.txt
> >>@@ -1615,6 +1615,47 @@ There are some more advanced barrier functions:
> >> operations" subsection for information on where to use these.
> >>
> >>
> >>+ (*) fast_wmb();
> >>+ (*) fast_rmb();
> >>+
> >>+ These are for use with memory based device I/O to guarantee the ordering
> >>+ of cache-enabled writes or reads with respect to other writes or reads
> >>+ to cache-enabled memory.
> >>+
> >>+ For example, consider a device driver that shares memory with a device
> >>+ and uses a descriptor status value to indicate if the descriptor belongs
> >>+ to the device or the CPU, and a doorbell to notify it when new
> >>+ descriptors are available:
> >>+
> >>+ if (desc->status != DEVICE_OWN) {
> >>+ /* do not read data until we own descriptor */
> >>+ fast_rmb();
> >>+
> >>+ /* read/modify data */
> >>+ read_data = desc->data;
> >>+ desc->data = write_data;
> >>+
> >>+ /* flush modifications before status update */
> >>+ fast_wmb();
> >>+
> >>+ /* assign ownership */
> >>+ desc->status = DEVICE_OWN;
> >>+
> >>+ /* force memory to sync before notifying device via MMIO */
> >>+ wmb();
> >>+
> >>+ /* notify device of new descriptors */
> >>+ writel(DESC_NOTIFY, doorbell);
> >>+ }
> >>+
> >>+ The fast_rmb() allows us guarantee the device has released ownership
> >>+ before we read the data from the descriptor, and he fast_wmb() allows
> >>+ us to guarantee the data is written to the descriptor before the device
> >>+ can see it now has ownership. The wmb() is needed to guarantee that the
> >>+ cache-enabled memory writes have completed before attempting a write to
> >>+ the cache-inhibited MMIO region.
> >>+
> >>+
> >> MMIO WRITE BARRIER
> >> ------------------
>
> The general idea is that the device/CPU follow acquire/release style
> semantics and we need the lightweight barriers to enforce ordering
> to prevent us from accessing the descriptors during those periods
> where we do not own them. As the example shows we still need a full
> barrier when going between MMIO and standard memory. Hopefully that
> is what is conveyed in the documentation bits I have above.
Thank you for the clarification.
> <snip>
>
> >>diff --git a/arch/powerpc/include/asm/barrier.h b/arch/powerpc/include/asm/barrier.h
> >>index cb6d66c..f480097 100644
> >>--- a/arch/powerpc/include/asm/barrier.h
> >>+++ b/arch/powerpc/include/asm/barrier.h
> >>@@ -36,22 +36,20 @@
> >>
> >> #define set_mb(var, value) do { var = value; mb(); } while (0)
> >>
> >>-#ifdef CONFIG_SMP
> >>-
> >> #ifdef __SUBARCH_HAS_LWSYNC
> >> # define SMPWMB LWSYNC
> >> #else
> >> # define SMPWMB eieio
> >> #endif
> >>
> >>-#define __lwsync() __asm__ __volatile__ (stringify_in_c(LWSYNC) : : :"memory")
> >>+#define fast_rmb() __asm__ __volatile__ (stringify_in_c(LWSYNC) : : :"memory")
> >>+#define fast_wmb() __asm__ __volatile__ (stringify_in_c(SMPWMB) : : :"memory")
> >>
> >>+#ifdef CONFIG_SMP
> >> #define smp_mb() mb()
> >>-#define smp_rmb() __lwsync()
> >>-#define smp_wmb() __asm__ __volatile__ (stringify_in_c(SMPWMB) : : :"memory")
> >>+#define smp_rmb() fast_rmb()
> >>+#define smp_wmb() fast_wmb()
> >> #else
> >>-#define __lwsync() barrier()
> >>-
> >> #define smp_mb() barrier()
> >> #define smp_rmb() barrier()
> >> #define smp_wmb() barrier()
> >>@@ -69,10 +67,16 @@
> >> #define data_barrier(x) \
> >> asm volatile("twi 0,%0,0; isync" : : "r" (x) : "memory");
> >>
> >>+/*
> >>+ * The use of smp_rmb() in these functions are actually meant to map from
> >>+ * smp_rmb()->fast_rmb()->LWSYNC. This way if smp is disabled then
> >>+ * smp_rmb()->barrier(), or if the platform doesn't support lwsync it will
> >>+ * map to the more heavy-weight sync.
> >>+ */
> >> #define smp_store_release(p, v) \
> >> do { \
> >> compiletime_assert_atomic_type(*p); \
> >>- __lwsync(); \
> >>+ smp_rmb(); \
> >
> >This is not good at all. For smp_store_release(), we absolutely
> >must order prior loads and stores against the assignment on the following
> >line. This is not something that smp_rmb() does, nor is it something
> >that smp_wmb() does. Yes, it might happen to now, but this could easily
> >break in the future -- plus this change is extremely misleading.
> >
> >The original __lwsync() is much more clear.
>
> The problem I had with __lwsync is that it really wasn't all that
> clear. It was the lwsync instruction if SMP was enabled, otherwise
> it was just a barrier call. What I did is move the definition of
> __lwsync in the SMP case into fast_rmb, which in turn is accessed by
> smp_rmb. I tried to make this clear in the comment just above the
> two calls. The resultant assembly code should be exactly the same.
>
> What I could do is have it added back as a smp_lwsync if that works
> for you. That way there is something there to give you a hint that
> it becomes a barrier() call as soon as SMP is disabled.
An smp_lwsync() would be a great improvement!
Thanx, Paul
On Mon, 2014-11-17 at 12:18 -0800, Paul E. McKenney wrote:
> On Mon, Nov 17, 2014 at 09:18:13AM -0800, Alexander Duyck wrote:
> > There are a number of situations where the mandatory barriers rmb() and
> > wmb() are used to order memory/memory operations in the device drivers
> > and those barriers are much heavier than they actually need to be. For
> > example in the case of PowerPC wmb() calls the heavy-weight sync
> > instruction when for memory/memory operations all that is really needed is
> > an lsync or eieio instruction.
>
> Is this still the case if one of the memory operations is MMIO? Last
> I knew, it was not.
I *think* (Alexander, correct me if I'm wrong), that what he wants is
the memory<->memory barriers (the smp_* ones) basically for ordering his
loads or stores from/to the DMA area.
The problem is that the smp_* ones aren't compiled for !CONFIG_SMP
IE. Something like:
- Read valid bit from descriptor
- Read rest of descriptor
That needs an rmb of some sort in between, but a full blown "rmb" will
also order vs. MMIOs and end up being a full sync, while an smp_rmb is a
lwsync which is more lightweight.
Similarily:
- Populate descriptor
- Write valid bit
Same deal with wmb ...
Basically, rmb and wmb order both cachable and non-cachable (memory and
MMIO) which makes them needlessly heavy on powerpc and possibly others
when all you need is to order memory accesses to some DMA data
structures. In that case you really want the normal smp_* variants
except they may not be around...
Cheers,
Ben.
On Mon, 2014-11-17 at 12:52 -0800, Linus Torvalds wrote:
> Maybe "dma_*mb()" would be acceptable, and ends up having the same
> naming convention as "smb_*mb()", and explains what it's about.
Yes, that was what I was about to suggest as well.
> And yes, in the same spirit, it would probably be good to try to
> eventually get rid of the plain "*mb()" functions, and perhaps call
> them "mmio_*mb()" to clarify that they are about ordering memory wrt
> mmio.
It will always be somewhat unclear to users who don't read the doc
anyway :)
IE. the dma_* ones do only DMA vs DMA (or vs other processors) but the
mmio_* one do anything vs anything. Not a huge deal tho. I still like
dma_* for Alexander's new stuff but I wouldn't bother with changing the
existing ones.
Ben.
On Mon, 2014-11-17 at 13:54 -0800, Alexander Duyck wrote:
> What would you think of the name "coherent_*mb()"? I would prefer to
> avoid dma in the name since, at least in my mind, that implies MMIO.
I'm lazy, I like typing less, so I like dma_* but I don't object to
coherent_* if at least one more person thinks it makes the semantics
clearer :)
> It also ties in well with dma_alloc_coherent/dma_free_coherent which is
> what would typically be used to allocate the memory we would be using
> the barrier to protect anyway.
Agreed.
> > And yes, in the same spirit, it would probably be good to try to
> > eventually get rid of the plain "*mb()" functions, and perhaps call
> > them "mmio_*mb()" to clarify that they are about ordering memory wrt
> > mmio.
> >
> > Hmm?
> >
> > Linus
>
> I will work on pulling all of the coherent barrier cases out of using
> the plain "*mb()" calls first. We need to sort that out before we could
> look at renaming the plain barrier functions.
Makes sense.
Cheers,
Ben.
> - Alex
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
On Mon, 2014-11-17 at 12:24 -0800, Alexander Duyck wrote:
> Yes and no. So for example on ARM I used the dmb() operation, however
> I
> have to use the barrier at the system level instead of just the inner
> shared domain. However on many other architectures they are just the
> same as the smp_* variants.
>
> Basically the resultant code is somewhere between the smp and non-smp
> barriers in terms of what they cover.
There I don't quite follow you. You need to explain better especially in
the documentation because otherwise people will get it wrong...
If it's ordering in the coherent domain, I fail to see how a DMA agent
is different than another processor when it comes to barriers, so I fail
to see the difference with smp_*
I understand the MMIO vs. memory issue, we do have the same on powerpc,
but that other aspect eludes me.
Ben.
On 11/17/2014 04:39 PM, Benjamin Herrenschmidt wrote:
> On Mon, 2014-11-17 at 12:24 -0800, Alexander Duyck wrote:
>> Yes and no. So for example on ARM I used the dmb() operation, however
>> I
>> have to use the barrier at the system level instead of just the inner
>> shared domain. However on many other architectures they are just the
>> same as the smp_* variants.
>>
>> Basically the resultant code is somewhere between the smp and non-smp
>> barriers in terms of what they cover.
> There I don't quite follow you. You need to explain better especially in
> the documentation because otherwise people will get it wrong...
>
> If it's ordering in the coherent domain, I fail to see how a DMA agent
> is different than another processor when it comes to barriers, so I fail
> to see the difference with smp_*
>
> I understand the MMIO vs. memory issue, we do have the same on powerpc,
> but that other aspect eludes me.
>
> Ben.
ARM adds some funky things. They have two different types of
primitives, a dmb() which is a data memory barrier, and a dsb() which is
a data synchronization barrier. Then with each of those they have the
"domains" the barriers are effective within.
So for example on ARM a rmb() is dsb(sy) which means it is a system wide
synchronization barrier which stops execution on the CPU core until the
read completes. However the smp_rmb() is a dmb(ish) which means it is
only a barrier as far as the inner shareable domain which I believe only
goes as far as the local shared cache hierarchy and only guarantees read
ordering without necessarily halting the CPU or stopping in-order
speculative reads. So what a coherent_rmb() would be in my setup is
dmb(sy) which means the barrier runs all the way out to memory, and it
is allowed to speculative read as long as it does it in order.
If it is still unclear you might check out Will Deacon's talk on the
topic at https://www.youtube.com/watch?v=6ORn6_35kKo, at about 7:00 in
he explains the whole domains thing, and at 13:30 he explains dmb()/dsb().
- Alex
On 11/17/2014 03:17 PM, Paul E. McKenney wrote:
> On Mon, Nov 17, 2014 at 01:11:57PM -0800, Alexander Duyck wrote:
>> On 11/17/2014 12:18 PM, Paul E. McKenney wrote:
>>> On Mon, Nov 17, 2014 at 09:18:13AM -0800, Alexander Duyck wrote:
>>>> There are a number of situations where the mandatory barriers rmb() and
>>>> wmb() are used to order memory/memory operations in the device drivers
>>>> and those barriers are much heavier than they actually need to be. For
>>>> example in the case of PowerPC wmb() calls the heavy-weight sync
>>>> instruction when for memory/memory operations all that is really needed is
>>>> an lsync or eieio instruction.
>>> Is this still the case if one of the memory operations is MMIO? Last
>>> I knew, it was not.
>> This barrier is not meant for use in MMIO operations, for that you
>> still need a full barrier as I call out in the documentation
>> section. What the barrier does is allow for a lightweight barrier
>> for accesses to coherent system memory. So for example many device
>> drivers have to perform a read of the descriptor to see if the
>> device is done with it. We need an rmb() following that check to
>> prevent any other accesses.
>>
>> Right now on x86 that rmb() becomes an lfence instruction and is
>> quite expensive, and as it turns out we don't need it since the x86
>> doesn't reorder reads. The same kind of thing applies to PowerPC,
>> only in that case we use a sync when what we really need is a
>> lwsync.
> Would it make sense to have a memory barrier that enforced the
> non-store-buffer orderings, that is prior reads before later
> reads and writes and prior writes before later writes? This was
> discussed earlier this year ((http://lwn.net/Articles/586838/,
> https://lwn.net/Articles/588300/). If I recall correctly, one of
> the biggest obstacles was the name. ;-)
You''re talking bout acquire and release barriers, or something else?
For most devices the two barriers I have defined should do the job, I
had tried doing load_acquire/store_release type functions in the
previous path set and that was shot down as the preference seemed to be
for barriers instead to remove some of the abstraction as to what was
occurring.
>>>> This commit adds a fast (and loose) version of the mandatory memory
>>>> barriers rmb() and wmb(). The prefix to the name is actually based on the
>>>> version of the functions that already exist in the mips and tile trees.
>>>> However I thought it applicable since it gets at what we are trying to
>>>> accomplish with these barriers and somethat implies their risky nature.
>>>>
>>>> These new barriers are not as safe as the standard rmb() and wmb().
>>>> Specifically they do not guarantee ordering between cache-enabled and
>>>> cache-inhibited memories. The primary use case for these would be to
>>>> enforce ordering of memory reads/writes when accessing cache-enabled memory
>>>> that is shared between the CPU and a device.
>>>>
>>>> It may also be noted that there is no fast_mb(). This is due to the fact
>>>> that most architectures didn't seem to have a good way to do a full memory
>>>> barrier quickly and so they usually resorted to an mb() for their smp_mb
>>>> call. As such there no point in adding a fast_mb() function if it is going
>>>> to map to mb() for all architectures anyway.
>>> I must confess that I still don't entirely understand the motivation.
>> The motivation is to provide finer grained barriers. So this
>> provides an in-between that allows us to "choose the right hammer".
>> In the case of PowerPC it is the difference between sync/lwsync, on
>> ARM it is dsb()/dmb(), and on x86 it is lfence/barrier().
> Ah, so ARM will motivate a fast_wmb(), given its instruction set.
Actually it was x86 that started this, lfence or sfence is much more
expensive then just barrier(). From there I realized we had issues in
PowerPC as well with sync vs lwsync, and ARM with dsb() vs dmb().
>> <snip>
>>
>>>> diff --git a/arch/powerpc/include/asm/barrier.h b/arch/powerpc/include/asm/barrier.h
>>>> index cb6d66c..f480097 100644
>>>> --- a/arch/powerpc/include/asm/barrier.h
>>>> +++ b/arch/powerpc/include/asm/barrier.h
>>>> @@ -36,22 +36,20 @@
>>>>
>>>> #define set_mb(var, value) do { var = value; mb(); } while (0)
>>>>
>>>> -#ifdef CONFIG_SMP
>>>> -
>>>> #ifdef __SUBARCH_HAS_LWSYNC
>>>> # define SMPWMB LWSYNC
>>>> #else
>>>> # define SMPWMB eieio
>>>> #endif
>>>>
>>>> -#define __lwsync() __asm__ __volatile__ (stringify_in_c(LWSYNC) : : :"memory")
>>>> +#define fast_rmb() __asm__ __volatile__ (stringify_in_c(LWSYNC) : : :"memory")
>>>> +#define fast_wmb() __asm__ __volatile__ (stringify_in_c(SMPWMB) : : :"memory")
>>>>
>>>> +#ifdef CONFIG_SMP
>>>> #define smp_mb() mb()
>>>> -#define smp_rmb() __lwsync()
>>>> -#define smp_wmb() __asm__ __volatile__ (stringify_in_c(SMPWMB) : : :"memory")
>>>> +#define smp_rmb() fast_rmb()
>>>> +#define smp_wmb() fast_wmb()
>>>> #else
>>>> -#define __lwsync() barrier()
>>>> -
>>>> #define smp_mb() barrier()
>>>> #define smp_rmb() barrier()
>>>> #define smp_wmb() barrier()
>>>> @@ -69,10 +67,16 @@
>>>> #define data_barrier(x) \
>>>> asm volatile("twi 0,%0,0; isync" : : "r" (x) : "memory");
>>>>
>>>> +/*
>>>> + * The use of smp_rmb() in these functions are actually meant to map from
>>>> + * smp_rmb()->fast_rmb()->LWSYNC. This way if smp is disabled then
>>>> + * smp_rmb()->barrier(), or if the platform doesn't support lwsync it will
>>>> + * map to the more heavy-weight sync.
>>>> + */
>>>> #define smp_store_release(p, v) \
>>>> do { \
>>>> compiletime_assert_atomic_type(*p); \
>>>> - __lwsync(); \
>>>> + smp_rmb(); \
>>> This is not good at all. For smp_store_release(), we absolutely
>>> must order prior loads and stores against the assignment on the following
>>> line. This is not something that smp_rmb() does, nor is it something
>>> that smp_wmb() does. Yes, it might happen to now, but this could easily
>>> break in the future -- plus this change is extremely misleading.
>>>
>>> The original __lwsync() is much more clear.
>> The problem I had with __lwsync is that it really wasn't all that
>> clear. It was the lwsync instruction if SMP was enabled, otherwise
>> it was just a barrier call. What I did is move the definition of
>> __lwsync in the SMP case into fast_rmb, which in turn is accessed by
>> smp_rmb. I tried to make this clear in the comment just above the
>> two calls. The resultant assembly code should be exactly the same.
>>
>> What I could do is have it added back as a smp_lwsync if that works
>> for you. That way there is something there to give you a hint that
>> it becomes a barrier() call as soon as SMP is disabled.
> An smp_lwsync() would be a great improvement!
>
> Thanx, Paul
>
Okay, that will be in the next patch then.
Thanks,
Alex
From: Alexander Duyck
> These patches introduce two new primitives for synchronizing cache-enabled
> memory writes and reads. These two new primitives are:
>
> fast_rmb()
> fast_wmb()
Not sure I like the names.
If the aim is to sync data into the local cache so that hardware
that is doing cache-snooping accesses sees the data then maybe
local_rmb() and local_wmb()
IIRC read_barrier_depends() is a nop on everything except alpha.
Maybe add the default if it isn't defined by the MD file?
David
????{.n?+???????+%?????ݶ??w??{.n?+????{??G?????{ay?ʇڙ?,j??f???h?????????z_??(?階?ݢj"???m??????G????????????&???~???iO???z??v?^?m????????????I?
On Tue, Nov 18, 2014 at 03:13:29AM +0000, Alexander Duyck wrote:
> On 11/17/2014 04:39 PM, Benjamin Herrenschmidt wrote:
> > On Mon, 2014-11-17 at 12:24 -0800, Alexander Duyck wrote:
> >> Yes and no. So for example on ARM I used the dmb() operation, however
> >> I
> >> have to use the barrier at the system level instead of just the inner
> >> shared domain. However on many other architectures they are just the
> >> same as the smp_* variants.
> >>
> >> Basically the resultant code is somewhere between the smp and non-smp
> >> barriers in terms of what they cover.
> > There I don't quite follow you. You need to explain better especially in
> > the documentation because otherwise people will get it wrong...
> >
> > If it's ordering in the coherent domain, I fail to see how a DMA agent
> > is different than another processor when it comes to barriers, so I fail
> > to see the difference with smp_*
> >
> > I understand the MMIO vs. memory issue, we do have the same on powerpc,
> > but that other aspect eludes me.
> >
>
> ARM adds some funky things. They have two different types of
> primitives, a dmb() which is a data memory barrier, and a dsb() which is
> a data synchronization barrier. Then with each of those they have the
> "domains" the barriers are effective within.
>
> So for example on ARM a rmb() is dsb(sy) which means it is a system wide
> synchronization barrier which stops execution on the CPU core until the
> read completes. However the smp_rmb() is a dmb(ish) which means it is
> only a barrier as far as the inner shareable domain which I believe only
> goes as far as the local shared cache hierarchy and only guarantees read
> ordering without necessarily halting the CPU or stopping in-order
> speculative reads. So what a coherent_rmb() would be in my setup is
> dmb(sy) which means the barrier runs all the way out to memory, and it
> is allowed to speculative read as long as it does it in order.
>
> If it is still unclear you might check out Will Deacon's talk on the
> topic at https://www.youtube.com/watch?v=6ORn6_35kKo, at about 7:00 in
> he explains the whole domains thing, and at 13:30 he explains dmb()/dsb().
So actually, this is an interesting case where the barrier would like to
know whether the memory returned by dma_alloc_coherent is h/w coherent
(normal, cacheable) or s/w coherent (normal, non-cacheable). I think Ben
is thinking of the h/w coherent case (i.e. actual snooping into the CPU
caches by the DMA master).
For the former, we could use inner-shareable barriers. For the latter, we'd
need to use outer-shareable barriers.
If we can't tell, then these should be dmb(osh), which will work for both.
Will
On 11/18/2014 01:57 AM, David Laight wrote:
> From: Alexander Duyck
>> These patches introduce two new primitives for synchronizing cache-enabled
>> memory writes and reads. These two new primitives are:
>>
>> fast_rmb()
>> fast_wmb()
> Not sure I like the names.
> If the aim is to sync data into the local cache so that hardware
> that is doing cache-snooping accesses sees the data then maybe
> local_rmb() and local_wmb()
Yeah, that is the general consensus. I am planning to change them to
coherent_rmb() and coherent_wmb().
> IIRC read_barrier_depends() is a nop on everything except alpha.
> Maybe add the default if it isn't defined by the MD file?
>
> David
>
From my patch the only two I saw define it were alpha and blackfin. It
is already defined in asm-generic, the rest is just clean-up since I
suspect some of the arch tree barrier.h calls just borrowed from
asm-generic without sorting out what became redundancies.
Thanks,
Alex
On 11/18/2014 03:58 AM, Will Deacon wrote:
> On Tue, Nov 18, 2014 at 03:13:29AM +0000, Alexander Duyck wrote:
>> On 11/17/2014 04:39 PM, Benjamin Herrenschmidt wrote:
>>> On Mon, 2014-11-17 at 12:24 -0800, Alexander Duyck wrote:
>>>> Yes and no. So for example on ARM I used the dmb() operation, however
>>>> I
>>>> have to use the barrier at the system level instead of just the inner
>>>> shared domain. However on many other architectures they are just the
>>>> same as the smp_* variants.
>>>>
>>>> Basically the resultant code is somewhere between the smp and non-smp
>>>> barriers in terms of what they cover.
>>> There I don't quite follow you. You need to explain better especially in
>>> the documentation because otherwise people will get it wrong...
>>>
>>> If it's ordering in the coherent domain, I fail to see how a DMA agent
>>> is different than another processor when it comes to barriers, so I fail
>>> to see the difference with smp_*
>>>
>>> I understand the MMIO vs. memory issue, we do have the same on powerpc,
>>> but that other aspect eludes me.
>>>
>> ARM adds some funky things. They have two different types of
>> primitives, a dmb() which is a data memory barrier, and a dsb() which is
>> a data synchronization barrier. Then with each of those they have the
>> "domains" the barriers are effective within.
>>
>> So for example on ARM a rmb() is dsb(sy) which means it is a system wide
>> synchronization barrier which stops execution on the CPU core until the
>> read completes. However the smp_rmb() is a dmb(ish) which means it is
>> only a barrier as far as the inner shareable domain which I believe only
>> goes as far as the local shared cache hierarchy and only guarantees read
>> ordering without necessarily halting the CPU or stopping in-order
>> speculative reads. So what a coherent_rmb() would be in my setup is
>> dmb(sy) which means the barrier runs all the way out to memory, and it
>> is allowed to speculative read as long as it does it in order.
>>
>> If it is still unclear you might check out Will Deacon's talk on the
>> topic at https://www.youtube.com/watch?v=6ORn6_35kKo, at about 7:00 in
>> he explains the whole domains thing, and at 13:30 he explains dmb()/dsb().
> So actually, this is an interesting case where the barrier would like to
> know whether the memory returned by dma_alloc_coherent is h/w coherent
> (normal, cacheable) or s/w coherent (normal, non-cacheable). I think Ben
> is thinking of the h/w coherent case (i.e. actual snooping into the CPU
> caches by the DMA master).
>
> For the former, we could use inner-shareable barriers. For the latter, we'd
> need to use outer-shareable barriers.
>
> If we can't tell, then these should be dmb(osh), which will work for both.
>
> Will
Okay, so I will update the ARM portion of my patches to use osh and
oshst then since it sounds like I was using too strong of barriers.
- Alex
On Tue, Nov 18, 2014 at 04:20:46PM +0000, Alexander Duyck wrote:
> On 11/18/2014 03:58 AM, Will Deacon wrote:
> > So actually, this is an interesting case where the barrier would like to
> > know whether the memory returned by dma_alloc_coherent is h/w coherent
> > (normal, cacheable) or s/w coherent (normal, non-cacheable). I think Ben
> > is thinking of the h/w coherent case (i.e. actual snooping into the CPU
> > caches by the DMA master).
> >
> > For the former, we could use inner-shareable barriers. For the latter, we'd
> > need to use outer-shareable barriers.
> >
> > If we can't tell, then these should be dmb(osh), which will work for both.
> >
>
> Okay, so I will update the ARM portion of my patches to use osh and
> oshst then since it sounds like I was using too strong of barriers.
Sounds good. Another reason this is interesting is because the native
acquire/release instructions on ARMv8 actually take into account the
shareability domain of the virtual address, so using them would give you
the shareability domain you want but slightly stronger ordering guarantees
within that domain.
Still, either of them will be a damn sight better than the dsb we currently
have courtesy of the mandatory barriers.
Will
On Mon, 2014-11-17 at 19:13 -0800, Alexander Duyck wrote:
>
> ARM adds some funky things. They have two different types of
> primitives, a dmb() which is a data memory barrier, and a dsb() which is
> a data synchronization barrier. Then with each of those they have the
> "domains" the barriers are effective within.
>
> So for example on ARM a rmb() is dsb(sy) which means it is a system wide
> synchronization barrier which stops execution on the CPU core until the
> read completes.
That's amazingly heavy handed ... I can see that being useful for MMIO,
we do something similar in our MMIO accessors by using a special variant
of trap instruction that never traps to make the core thing the load
value has been consumed. But that's typically only needed to guarantee
MMIO timings.
> However the smp_rmb() is a dmb(ish) which means it is
> only a barrier as far as the inner shareable domain which I believe only
> goes as far as the local shared cache hierarchy and only guarantees read
> ordering without necessarily halting the CPU or stopping in-order
> speculative reads. So what a coherent_rmb() would be in my setup is
> dmb(sy) which means the barrier runs all the way out to memory, and it
> is allowed to speculative read as long as it does it in order.
Correct, which is thus the same as smp_rmb() ... which was my original
point, or am I missing something else ?
> If it is still unclear you might check out Will Deacon's talk on the
> topic at https://www.youtube.com/watch?v=6ORn6_35kKo, at about 7:00 in
> he explains the whole domains thing, and at 13:30 he explains dmb()/dsb().
Ok, I'll try to watch that when I get a chance.
Cheers,
Ben.