2014-11-13 19:28:36

by Alexander Duyck

[permalink] [raw]
Subject: [PATCH 0/3] Introduce load_acquire() and store_release()

These patches introduce uniprocessor or CPU<->device equivalents for
smp_load_acquire() and smp_store_release(). These two new primitives are:

load_acquire()
store_release()

The first patch adds the primitives for the applicable architectures and
asm-generic.

The second patch adds the primitives to r8169 which turns out to be a good
example of where the new primitives might be useful as they have memory
barriers ordering accesses to the descriptors and the DescOwn bit within the
descriptors which follow acquire/release style semantics.

The third patch adds support for load_acquire() to the Intel fm10k, igb,
and ixgbe drivers. Testing with the ixgbe driver has shown a processing
time reduction of at least 7ns per 64B frame on a Core i7-4930K.

This patch series is essentially the v2 for:
arch: Introduce read_acquire()

The key changes in this patch series versus that patch are:
- Renamed read_acquire() to be consistent with smp_load_acquire()
- Changed barrier used to be consistent with smp_load_acquire()
- Updated PowerPC code to use __lwsync based on IBM article
- Added store_release() as this is a viable use case for drivers
- Added r8169 patch which is able to fully use primitives
- Added fm10k/igb/ixgbe patch which is able to test performance

---

Alexander Duyck (3):
arch: Introduce load_acquire() and store_release()
r8169: Use load_acquire() and store_release() to reduce memory barrier overhead
fm10k/igb/ixgbe: Use load_acquire on Rx descriptor


arch/arm/include/asm/barrier.h | 15 ++++++
arch/arm64/include/asm/barrier.h | 59 +++++++++++++------------
arch/ia64/include/asm/barrier.h | 7 ++-
arch/metag/include/asm/barrier.h | 15 ++++++
arch/mips/include/asm/barrier.h | 15 ++++++
arch/powerpc/include/asm/barrier.h | 24 ++++++++--
arch/s390/include/asm/barrier.h | 7 ++-
arch/sparc/include/asm/barrier_64.h | 6 ++-
arch/x86/include/asm/barrier.h | 22 ++++++++-
drivers/net/ethernet/intel/fm10k/fm10k_main.c | 8 +--
drivers/net/ethernet/intel/igb/igb_main.c | 8 +--
drivers/net/ethernet/intel/ixgbe/ixgbe_main.c | 11 ++---
drivers/net/ethernet/realtek/r8169.c | 23 ++++------
include/asm-generic/barrier.h | 15 ++++++
14 files changed, 163 insertions(+), 72 deletions(-)

--


2014-11-13 19:28:48

by Alexander Duyck

[permalink] [raw]
Subject: [PATCH 2/3] r8169: Use load_acquire() and store_release() to reduce memory barrier overhead

The r8169 use a pair of wmb() calls when setting up the descriptor rings.
The first is to synchronize the descriptor data with the descriptor status,
and the second is to synchronize the descriptor status with the use of the
MMIO doorbell to notify the device that descriptors are ready. This can come
at a heavy price on some systems, and is not really necessary on systems such
as x86 as a simple barrier() would suffice to order store/store accesses.

In addition the r8169 uses a rmb() however I believe it is placed incorrectly
as I assume it supposed to be ordering descriptor reads after the check for
ownership. As such I believe this is actually an acquire access pattern so
I have updated the code and removed the barrier using a load_acquire() call.

Cc: Realtek linux nic maintainers <[email protected]>
Signed-off-by: Alexander Duyck <[email protected]>
---
drivers/net/ethernet/realtek/r8169.c | 23 +++++++++--------------
1 file changed, 9 insertions(+), 14 deletions(-)

diff --git a/drivers/net/ethernet/realtek/r8169.c b/drivers/net/ethernet/realtek/r8169.c
index cf154f7..42b4645 100644
--- a/drivers/net/ethernet/realtek/r8169.c
+++ b/drivers/net/ethernet/realtek/r8169.c
@@ -6601,14 +6601,13 @@ static inline void rtl8169_mark_to_asic(struct RxDesc *desc, u32 rx_buf_sz)
{
u32 eor = le32_to_cpu(desc->opts1) & RingEnd;

- desc->opts1 = cpu_to_le32(DescOwn | eor | rx_buf_sz);
+ store_release(&desc->opts1, cpu_to_le32(DescOwn | eor | rx_buf_sz));
}

static inline void rtl8169_map_to_asic(struct RxDesc *desc, dma_addr_t mapping,
u32 rx_buf_sz)
{
desc->addr = cpu_to_le64(mapping);
- wmb();
rtl8169_mark_to_asic(desc, rx_buf_sz);
}

@@ -7077,11 +7076,9 @@ static netdev_tx_t rtl8169_start_xmit(struct sk_buff *skb,

skb_tx_timestamp(skb);

- wmb();
-
/* Anti gcc 2.95.3 bugware (sic) */
status = opts[0] | len | (RingEnd * !((entry + 1) % NUM_TX_DESC));
- txd->opts1 = cpu_to_le32(status);
+ store_release(&txd->opts1, cpu_to_le32(status));

tp->cur_tx += frags + 1;

@@ -7183,16 +7180,15 @@ static void rtl_tx(struct net_device *dev, struct rtl8169_private *tp)
while (tx_left > 0) {
unsigned int entry = dirty_tx % NUM_TX_DESC;
struct ring_info *tx_skb = tp->tx_skb + entry;
- u32 status;
+ __le32 status;

- rmb();
- status = le32_to_cpu(tp->TxDescArray[entry].opts1);
- if (status & DescOwn)
+ status = load_acquire(&tp->TxDescArray[entry].opts1);
+ if (status & cpu_to_le32(DescOwn))
break;

rtl8169_unmap_tx_skb(&tp->pci_dev->dev, tx_skb,
tp->TxDescArray + entry);
- if (status & LastFrag) {
+ if (status & cpu_to_le32(LastFrag)) {
pkts_compl++;
bytes_compl += tx_skb->skb->len;
dev_kfree_skb_any(tx_skb->skb);
@@ -7284,11 +7280,11 @@ static int rtl_rx(struct net_device *dev, struct rtl8169_private *tp, u32 budget
struct RxDesc *desc = tp->RxDescArray + entry;
u32 status;

- rmb();
- status = le32_to_cpu(desc->opts1) & tp->opts1_mask;
-
+ status = cpu_to_le32(load_acquire(&desc->opts1));
if (status & DescOwn)
break;
+
+ status &= tp->opts1_mask;
if (unlikely(status & RxRES)) {
netif_info(tp, rx_err, dev, "Rx ERROR. status = %08x\n",
status);
@@ -7350,7 +7346,6 @@ process_pkt:
}
release_descriptor:
desc->opts2 = 0;
- wmb();
rtl8169_mark_to_asic(desc, rx_buf_sz);
}

2014-11-13 19:28:55

by Alexander Duyck

[permalink] [raw]
Subject: [PATCH 3/3] fm10k/igb/ixgbe: Use load_acquire on Rx descriptor

This change makes it so that load_acquire is used when reading the Rx
descriptor. The advantage of load_acquire is that it allows for a much
lower cost barrier on x86, ia64, powerpc, arm64, and s390 architectures
than a traditional memory barrier when dealing with reads that only have
to synchronize to system memory.

In addition I have updated the code so that it just checks to see if any
bits have been set instead of just the DD bit since the DD bit will always
be set as a part of a descriptor write-back so we just need to check for a
non-zero value being present at that memory location rather than just
checking for any specific bit. This allows the code itself to appear much
cleaner and allows the compiler more room to optimize.

Cc: Jeff Kirsher <[email protected]>
Cc: Matthew Vick <[email protected]>
Cc: Don Skidmore <[email protected]>
Signed-off-by: Alexander Duyck <[email protected]>
---
drivers/net/ethernet/intel/fm10k/fm10k_main.c | 8 +++-----
drivers/net/ethernet/intel/igb/igb_main.c | 8 +++-----
drivers/net/ethernet/intel/ixgbe/ixgbe_main.c | 11 ++++-------
3 files changed, 10 insertions(+), 17 deletions(-)

diff --git a/drivers/net/ethernet/intel/fm10k/fm10k_main.c b/drivers/net/ethernet/intel/fm10k/fm10k_main.c
index e645af4..2d4923c 100644
--- a/drivers/net/ethernet/intel/fm10k/fm10k_main.c
+++ b/drivers/net/ethernet/intel/fm10k/fm10k_main.c
@@ -620,14 +620,12 @@ static bool fm10k_clean_rx_irq(struct fm10k_q_vector *q_vector,

rx_desc = FM10K_RX_DESC(rx_ring, rx_ring->next_to_clean);

- if (!fm10k_test_staterr(rx_desc, FM10K_RXD_STATUS_DD))
- break;
-
/* This memory barrier is needed to keep us from reading
* any other fields out of the rx_desc until we know the
- * RXD_STATUS_DD bit is set
+ * descriptor has been written back
*/
- rmb();
+ if (!load_acquire(&rx_desc->d.staterr))
+ break;

/* retrieve a buffer from the ring */
skb = fm10k_fetch_rx_buffer(rx_ring, rx_desc, skb);
diff --git a/drivers/net/ethernet/intel/igb/igb_main.c b/drivers/net/ethernet/intel/igb/igb_main.c
index a2d72a8..a55cd54 100644
--- a/drivers/net/ethernet/intel/igb/igb_main.c
+++ b/drivers/net/ethernet/intel/igb/igb_main.c
@@ -6918,14 +6918,12 @@ static bool igb_clean_rx_irq(struct igb_q_vector *q_vector, const int budget)

rx_desc = IGB_RX_DESC(rx_ring, rx_ring->next_to_clean);

- if (!igb_test_staterr(rx_desc, E1000_RXD_STAT_DD))
- break;
-
/* This memory barrier is needed to keep us from reading
* any other fields out of the rx_desc until we know the
- * RXD_STAT_DD bit is set
+ * descriptor has been written back
*/
- rmb();
+ if (!load_acquire(&rx_desc->wb.upper.status_error))
+ break;

/* retrieve a buffer from the ring */
skb = igb_fetch_rx_buffer(rx_ring, rx_desc, skb);
diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
index d2df4e3..007f174 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
@@ -2003,15 +2003,12 @@ static int ixgbe_clean_rx_irq(struct ixgbe_q_vector *q_vector,

rx_desc = IXGBE_RX_DESC(rx_ring, rx_ring->next_to_clean);

- if (!ixgbe_test_staterr(rx_desc, IXGBE_RXD_STAT_DD))
- break;
-
- /*
- * This memory barrier is needed to keep us from reading
+ /* This memory barrier is needed to keep us from reading
* any other fields out of the rx_desc until we know the
- * RXD_STAT_DD bit is set
+ * descriptor has been written back
*/
- rmb();
+ if (!load_acquire(&rx_desc->wb.upper.status_error))
+ break;

/* retrieve a buffer from the ring */
skb = ixgbe_fetch_rx_buffer(rx_ring, rx_desc);

2014-11-13 19:29:43

by Alexander Duyck

[permalink] [raw]
Subject: [PATCH 1/3] arch: Introduce load_acquire() and store_release()

It is common for device drivers to make use of acquire/release semantics
when dealing with descriptors stored in device memory. On reviewing the
documentation and code for smp_load_acquire() and smp_store_release() as
well as reviewing an IBM website that goes over the use of PowerPC barriers
at http://www.ibm.com/developerworks/systems/articles/powerpc.html it
occurred to me that the same code could likely be applied to device drivers.

As a result this patch introduces load_acquire() and store_release(). The
load_acquire() function can be used in the place of situations where a test
for ownership must be followed by a memory barrier. The below example is
from ixgbe:

if (!rx_desc->wb.upper.status_error)
break;

/* This memory barrier is needed to keep us from reading
* any other fields out of the rx_desc until we know the
* descriptor has been written back
*/
rmb();

With load_acquire() this can be changed to:

if (!load_acquire(&rx_desc->wb.upper.status_error))
break;

A similar change can be made in the release path of many drivers. For
example in the Realtek r8169 driver there are a number of flows that
consist of something like the following:

wmb();

status = opts[0] | len | (RingEnd * !((entry + 1) % NUM_TX_DESC));
txd->opts1 = cpu_to_le32(status);

tp->cur_tx += frags + 1;

wmb();

With store_release() this can be changed to the following:

status = opts[0] | len | (RingEnd * !((entry + 1) % NUM_TX_DESC));
store_release(&txd->opts1, cpu_to_le32(status));

tp->cur_tx += frags + 1;

wmb();

The resulting assembler code generated as a result can be significantly
less expensive on architectures such as x86 and s390 that support strong
ordering. On architectures that are able to use different primitives than
their rmb/wmb() such as powerpc, ia64, and arm64 we should see gains as we
are able to use less expensive barriers, and for other architectures we end
up using a mb() which may come at the same amount of overhead or more than
a rmb/wmb() as we must ensure Load/Store ordering.

Cc: Benjamin Herrenschmidt <[email protected]>
Cc: Frederic Weisbecker <[email protected]>
Cc: Mathieu Desnoyers <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Michael Neuling <[email protected]>
Cc: Russell King <[email protected]>
Cc: Geert Uytterhoeven <[email protected]>
Cc: Heiko Carstens <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Martin Schwidefsky <[email protected]>
Cc: Tony Luck <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: "Paul E. McKenney" <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: David Miller <[email protected]>
Signed-off-by: Alexander Duyck <[email protected]>
---
arch/arm/include/asm/barrier.h | 15 +++++++++
arch/arm64/include/asm/barrier.h | 59 ++++++++++++++++++-----------------
arch/ia64/include/asm/barrier.h | 7 +++-
arch/metag/include/asm/barrier.h | 15 +++++++++
arch/mips/include/asm/barrier.h | 15 +++++++++
arch/powerpc/include/asm/barrier.h | 24 +++++++++++---
arch/s390/include/asm/barrier.h | 7 +++-
arch/sparc/include/asm/barrier_64.h | 6 ++--
arch/x86/include/asm/barrier.h | 22 ++++++++++++-
include/asm-generic/barrier.h | 15 +++++++++
10 files changed, 144 insertions(+), 41 deletions(-)

diff --git a/arch/arm/include/asm/barrier.h b/arch/arm/include/asm/barrier.h
index c6a3e73..bbdcd34 100644
--- a/arch/arm/include/asm/barrier.h
+++ b/arch/arm/include/asm/barrier.h
@@ -59,6 +59,21 @@
#define smp_wmb() dmb(ishst)
#endif

+#define store_release(p, v) \
+do { \
+ compiletime_assert_atomic_type(*p); \
+ mb(); \
+ ACCESS_ONCE(*p) = (v); \
+} while (0)
+
+#define load_acquire(p) \
+({ \
+ typeof(*p) ___p1 = ACCESS_ONCE(*p); \
+ compiletime_assert_atomic_type(*p); \
+ mb(); \
+ ___p1; \
+})
+
#define smp_store_release(p, v) \
do { \
compiletime_assert_atomic_type(*p); \
diff --git a/arch/arm64/include/asm/barrier.h b/arch/arm64/include/asm/barrier.h
index 6389d60..c91571c 100644
--- a/arch/arm64/include/asm/barrier.h
+++ b/arch/arm64/include/asm/barrier.h
@@ -32,33 +32,7 @@
#define rmb() dsb(ld)
#define wmb() dsb(st)

-#ifndef CONFIG_SMP
-#define smp_mb() barrier()
-#define smp_rmb() barrier()
-#define smp_wmb() barrier()
-
-#define smp_store_release(p, v) \
-do { \
- compiletime_assert_atomic_type(*p); \
- barrier(); \
- ACCESS_ONCE(*p) = (v); \
-} while (0)
-
-#define smp_load_acquire(p) \
-({ \
- typeof(*p) ___p1 = ACCESS_ONCE(*p); \
- compiletime_assert_atomic_type(*p); \
- barrier(); \
- ___p1; \
-})
-
-#else
-
-#define smp_mb() dmb(ish)
-#define smp_rmb() dmb(ishld)
-#define smp_wmb() dmb(ishst)
-
-#define smp_store_release(p, v) \
+#define store_release(p, v) \
do { \
compiletime_assert_atomic_type(*p); \
switch (sizeof(*p)) { \
@@ -73,7 +47,7 @@ do { \
} \
} while (0)

-#define smp_load_acquire(p) \
+#define load_acquire(p) \
({ \
typeof(*p) ___p1; \
compiletime_assert_atomic_type(*p); \
@@ -90,6 +64,35 @@ do { \
___p1; \
})

+#ifndef CONFIG_SMP
+#define smp_mb() barrier()
+#define smp_rmb() barrier()
+#define smp_wmb() barrier()
+
+#define smp_store_release(p, v) \
+do { \
+ compiletime_assert_atomic_type(*p); \
+ barrier(); \
+ ACCESS_ONCE(*p) = (v); \
+} while (0)
+
+#define smp_load_acquire(p) \
+({ \
+ typeof(*p) ___p1 = ACCESS_ONCE(*p); \
+ compiletime_assert_atomic_type(*p); \
+ barrier(); \
+ ___p1; \
+})
+
+#else
+
+#define smp_mb() dmb(ish)
+#define smp_rmb() dmb(ishld)
+#define smp_wmb() dmb(ishst)
+
+#define smp_store_release(p, v) store_release(p, v)
+#define smp_load_acquire(p) load_acquire(p)
+
#endif

#define read_barrier_depends() do { } while(0)
diff --git a/arch/ia64/include/asm/barrier.h b/arch/ia64/include/asm/barrier.h
index a48957c..d7fe208 100644
--- a/arch/ia64/include/asm/barrier.h
+++ b/arch/ia64/include/asm/barrier.h
@@ -63,14 +63,14 @@
* need for asm trickery!
*/

-#define smp_store_release(p, v) \
+#define store_release(p, v) \
do { \
compiletime_assert_atomic_type(*p); \
barrier(); \
ACCESS_ONCE(*p) = (v); \
} while (0)

-#define smp_load_acquire(p) \
+#define load_acquire(p) \
({ \
typeof(*p) ___p1 = ACCESS_ONCE(*p); \
compiletime_assert_atomic_type(*p); \
@@ -78,6 +78,9 @@ do { \
___p1; \
})

+#define smp_store_release(p, v) store_release(p, v)
+#define smp_load_acquire(p) load_acquire(p)
+
/*
* XXX check on this ---I suspect what Linus really wants here is
* acquire vs release semantics but we can't discuss this stuff with
diff --git a/arch/metag/include/asm/barrier.h b/arch/metag/include/asm/barrier.h
index c7591e8..9beb687 100644
--- a/arch/metag/include/asm/barrier.h
+++ b/arch/metag/include/asm/barrier.h
@@ -85,6 +85,21 @@ static inline void fence(void)
#define smp_read_barrier_depends() do { } while (0)
#define set_mb(var, value) do { var = value; smp_mb(); } while (0)

+#define store_release(p, v) \
+do { \
+ compiletime_assert_atomic_type(*p); \
+ mb(); \
+ ACCESS_ONCE(*p) = (v); \
+} while (0)
+
+#define load_acquire(p) \
+({ \
+ typeof(*p) ___p1 = ACCESS_ONCE(*p); \
+ compiletime_assert_atomic_type(*p); \
+ mb(); \
+ ___p1; \
+})
+
#define smp_store_release(p, v) \
do { \
compiletime_assert_atomic_type(*p); \
diff --git a/arch/mips/include/asm/barrier.h b/arch/mips/include/asm/barrier.h
index d0101dd..fc7323c 100644
--- a/arch/mips/include/asm/barrier.h
+++ b/arch/mips/include/asm/barrier.h
@@ -180,6 +180,21 @@
#define nudge_writes() mb()
#endif

+#define store_release(p, v) \
+do { \
+ compiletime_assert_atomic_type(*p); \
+ mb(); \
+ ACCESS_ONCE(*p) = (v); \
+} while (0)
+
+#define load_acquire(p) \
+({ \
+ typeof(*p) ___p1 = ACCESS_ONCE(*p); \
+ compiletime_assert_atomic_type(*p); \
+ mb(); \
+ ___p1; \
+})
+
#define smp_store_release(p, v) \
do { \
compiletime_assert_atomic_type(*p); \
diff --git a/arch/powerpc/include/asm/barrier.h b/arch/powerpc/include/asm/barrier.h
index bab79a1..f2a0d73 100644
--- a/arch/powerpc/include/asm/barrier.h
+++ b/arch/powerpc/include/asm/barrier.h
@@ -37,6 +37,23 @@

#define set_mb(var, value) do { var = value; mb(); } while (0)

+#define __lwsync() __asm__ __volatile__ (stringify_in_c(LWSYNC) : : :"memory")
+
+#define store_release(p, v) \
+do { \
+ compiletime_assert_atomic_type(*p); \
+ __lwsync(); \
+ ACCESS_ONCE(*p) = (v); \
+} while (0)
+
+#define load_acquire(p) \
+({ \
+ typeof(*p) ___p1 = ACCESS_ONCE(*p); \
+ compiletime_assert_atomic_type(*p); \
+ __lwsync(); \
+ ___p1; \
+})
+
#ifdef CONFIG_SMP

#ifdef __SUBARCH_HAS_LWSYNC
@@ -45,15 +62,12 @@
# define SMPWMB eieio
#endif

-#define __lwsync() __asm__ __volatile__ (stringify_in_c(LWSYNC) : : :"memory")

#define smp_mb() mb()
#define smp_rmb() __lwsync()
#define smp_wmb() __asm__ __volatile__ (stringify_in_c(SMPWMB) : : :"memory")
#define smp_read_barrier_depends() read_barrier_depends()
#else
-#define __lwsync() barrier()
-
#define smp_mb() barrier()
#define smp_rmb() barrier()
#define smp_wmb() barrier()
@@ -72,7 +86,7 @@
#define smp_store_release(p, v) \
do { \
compiletime_assert_atomic_type(*p); \
- __lwsync(); \
+ smp_rmb(); \
ACCESS_ONCE(*p) = (v); \
} while (0)

@@ -80,7 +94,7 @@ do { \
({ \
typeof(*p) ___p1 = ACCESS_ONCE(*p); \
compiletime_assert_atomic_type(*p); \
- __lwsync(); \
+ smp_rmb(); \
___p1; \
})

diff --git a/arch/s390/include/asm/barrier.h b/arch/s390/include/asm/barrier.h
index b5dce65..637d7a9 100644
--- a/arch/s390/include/asm/barrier.h
+++ b/arch/s390/include/asm/barrier.h
@@ -35,14 +35,14 @@

#define set_mb(var, value) do { var = value; mb(); } while (0)

-#define smp_store_release(p, v) \
+#define store_release(p, v) \
do { \
compiletime_assert_atomic_type(*p); \
barrier(); \
ACCESS_ONCE(*p) = (v); \
} while (0)

-#define smp_load_acquire(p) \
+#define load_acquire(p) \
({ \
typeof(*p) ___p1 = ACCESS_ONCE(*p); \
compiletime_assert_atomic_type(*p); \
@@ -50,4 +50,7 @@ do { \
___p1; \
})

+#define smp_store_release(p, v) store_release(p, v)
+#define smp_load_acquire(p) load_acquire(p)
+
#endif /* __ASM_BARRIER_H */
diff --git a/arch/sparc/include/asm/barrier_64.h b/arch/sparc/include/asm/barrier_64.h
index 305dcc3..7de3c69 100644
--- a/arch/sparc/include/asm/barrier_64.h
+++ b/arch/sparc/include/asm/barrier_64.h
@@ -53,14 +53,14 @@ do { __asm__ __volatile__("ba,pt %%xcc, 1f\n\t" \

#define smp_read_barrier_depends() do { } while(0)

-#define smp_store_release(p, v) \
+#define store_release(p, v) \
do { \
compiletime_assert_atomic_type(*p); \
barrier(); \
ACCESS_ONCE(*p) = (v); \
} while (0)

-#define smp_load_acquire(p) \
+#define load_acquire(p) \
({ \
typeof(*p) ___p1 = ACCESS_ONCE(*p); \
compiletime_assert_atomic_type(*p); \
@@ -68,6 +68,8 @@ do { \
___p1; \
})

+#define smp_store_release(p, v) store_release(p, v)
+#define smp_load_acquire(p) load_acquire(p)
#define smp_mb__before_atomic() barrier()
#define smp_mb__after_atomic() barrier()

diff --git a/arch/x86/include/asm/barrier.h b/arch/x86/include/asm/barrier.h
index 0f4460b..3d2aa18 100644
--- a/arch/x86/include/asm/barrier.h
+++ b/arch/x86/include/asm/barrier.h
@@ -103,6 +103,21 @@
* model and we should fall back to full barriers.
*/

+#define store_release(p, v) \
+do { \
+ compiletime_assert_atomic_type(*p); \
+ mb(); \
+ ACCESS_ONCE(*p) = (v); \
+} while (0)
+
+#define load_acquire(p) \
+({ \
+ typeof(*p) ___p1 = ACCESS_ONCE(*p); \
+ compiletime_assert_atomic_type(*p); \
+ mb(); \
+ ___p1; \
+})
+
#define smp_store_release(p, v) \
do { \
compiletime_assert_atomic_type(*p); \
@@ -120,14 +135,14 @@ do { \

#else /* regular x86 TSO memory ordering */

-#define smp_store_release(p, v) \
+#define store_release(p, v) \
do { \
compiletime_assert_atomic_type(*p); \
barrier(); \
ACCESS_ONCE(*p) = (v); \
} while (0)

-#define smp_load_acquire(p) \
+#define load_acquire(p) \
({ \
typeof(*p) ___p1 = ACCESS_ONCE(*p); \
compiletime_assert_atomic_type(*p); \
@@ -135,6 +150,9 @@ do { \
___p1; \
})

+#define smp_store_release(p, v) store_release(p, v)
+#define smp_load_acquire(p) load_acquire(p)
+
#endif

/* Atomic operations are already serializing on x86 */
diff --git a/include/asm-generic/barrier.h b/include/asm-generic/barrier.h
index 1402fa8..c6e4b99 100644
--- a/include/asm-generic/barrier.h
+++ b/include/asm-generic/barrier.h
@@ -70,6 +70,21 @@
#define smp_mb__after_atomic() smp_mb()
#endif

+#define store_release(p, v) \
+do { \
+ compiletime_assert_atomic_type(*p); \
+ mb(); \
+ ACCESS_ONCE(*p) = (v); \
+} while (0)
+
+#define load_acquire(p) \
+({ \
+ typeof(*p) ___p1 = ACCESS_ONCE(*p); \
+ compiletime_assert_atomic_type(*p); \
+ mb(); \
+ ___p1; \
+})
+
#define smp_store_release(p, v) \
do { \
compiletime_assert_atomic_type(*p); \

2014-11-13 21:33:31

by Francois Romieu

[permalink] [raw]
Subject: Re: [PATCH 2/3] r8169: Use load_acquire() and store_release() to reduce memory barrier overhead

Alexander Duyck <[email protected]> :
[...]
> In addition the r8169 uses a rmb() however I believe it is placed incorrectly
> as I assume it supposed to be ordering descriptor reads after the check for
> ownership.

Not exactly. It's a barrier against compiler optimization from 2004.
It should not matter.

However I disagree with the change below:

> @@ -7284,11 +7280,11 @@ static int rtl_rx(struct net_device *dev, struct rtl8169_private *tp, u32 budget
> struct RxDesc *desc = tp->RxDescArray + entry;
> u32 status;
>
> - rmb();
> - status = le32_to_cpu(desc->opts1) & tp->opts1_mask;
> -
> + status = cpu_to_le32(load_acquire(&desc->opts1));
> if (status & DescOwn)
> break;
> +
> + status &= tp->opts1_mask;

-> tp->opts1_mask is not __le32 tainted.

Btw, should I consider the sketch above as a skeleton in my r8169 closet ?

NIC CPU0 CPU1
| CPU | NIC | CPU | CPU |

| CPU | NIC | CPU | CPU |
^ tx_dirty

[start_xmit...

| CPU | CPU | CPU | CPU |
(NIC did it's job)
[rtl_tx...
| ... | ... | NIC | NIC |
(ring update)
(tx_dirty increases)

| CPU | CPU | ??? | ??? |
tx_dirty ?
reaping about-to-be-sent
buffers on some platforms ?
...start_xmit]


--
Ueimor

2014-11-13 23:12:07

by Alexander Duyck

[permalink] [raw]
Subject: Re: [PATCH 2/3] r8169: Use load_acquire() and store_release() to reduce memory barrier overhead

On 11/13/2014 01:30 PM, Francois Romieu wrote:
> Alexander Duyck <[email protected]> :
> [...]
>> In addition the r8169 uses a rmb() however I believe it is placed incorrectly
>> as I assume it supposed to be ordering descriptor reads after the check for
>> ownership.
> Not exactly. It's a barrier against compiler optimization from 2004.
> It should not matter.

Okay. Do you recall the kind of problem it was you were seeing?

The origin of the rmb() for the Intel drivers was a PowerPC issue in
which it was fetching the length of a buffer before it checked the DD
bit (equivalent of DescOwn). I'm wondering if the issue you were seeing
was something similar where it had reordered reads in the descriptor to
cause that type of result.

> However I disagree with the change below:
>
>> @@ -7284,11 +7280,11 @@ static int rtl_rx(struct net_device *dev, struct rtl8169_private *tp, u32 budget
>> struct RxDesc *desc = tp->RxDescArray + entry;
>> u32 status;
>>
>> - rmb();
>> - status = le32_to_cpu(desc->opts1) & tp->opts1_mask;
>> -
>> + status = cpu_to_le32(load_acquire(&desc->opts1));
>> if (status & DescOwn)
>> break;
>> +
>> + status &= tp->opts1_mask;
> -> tp->opts1_mask is not __le32 tainted.

Sorry I just noticed I got my byte ordering messed up on that. It
should have been le32_to_cpu. desc->opts is le32, and status should be
CPU ordered. I will have that updated for v2.

> Btw, should I consider the sketch above as a skeleton in my r8169 closet ?
>
> NIC CPU0 CPU1
> | CPU | NIC | CPU | CPU |
>
> | CPU | NIC | CPU | CPU |
> ^ tx_dirty
>
> [start_xmit...
>
> | CPU | CPU | CPU | CPU |
> (NIC did it's job)
> [rtl_tx...
> | ... | ... | NIC | NIC |
> (ring update)
> (tx_dirty increases)
>
> | CPU | CPU | ??? | ??? |
> tx_dirty ?
> reaping about-to-be-sent
> buffers on some platforms ?
> ...start_xmit]

Actually it looks like that could be due to the placement of tp->cur_tx
update and the txd->opts1 being updated in the same spot in start_xmit
with no barrier to separate them. As such the compiler is free to
update tp->cur_tx first, and then update the desc->opts to set the
DescOwn bit.

I will move the update of tp->cur_tx down a few lines past where the
second wmb is/was. That should provide enough buffer to guarantee that
cur_tx update is only visible after the descriptors have been updated so
the reaping should only occur if the CPU has written back.

Thanks,

Alex

2014-11-14 10:19:12

by Will Deacon

[permalink] [raw]
Subject: Re: [PATCH 1/3] arch: Introduce load_acquire() and store_release()

Hi Alex,

On Thu, Nov 13, 2014 at 07:27:23PM +0000, Alexander Duyck wrote:
> It is common for device drivers to make use of acquire/release semantics
> when dealing with descriptors stored in device memory. On reviewing the
> documentation and code for smp_load_acquire() and smp_store_release() as
> well as reviewing an IBM website that goes over the use of PowerPC barriers
> at http://www.ibm.com/developerworks/systems/articles/powerpc.html it
> occurred to me that the same code could likely be applied to device drivers.
>
> As a result this patch introduces load_acquire() and store_release(). The
> load_acquire() function can be used in the place of situations where a test
> for ownership must be followed by a memory barrier. The below example is
> from ixgbe:
>
> if (!rx_desc->wb.upper.status_error)
> break;
>
> /* This memory barrier is needed to keep us from reading
> * any other fields out of the rx_desc until we know the
> * descriptor has been written back
> */
> rmb();
>
> With load_acquire() this can be changed to:
>
> if (!load_acquire(&rx_desc->wb.upper.status_error))
> break;

I still don't think this is a good idea for the specific use-case you're
highlighting.

On ARM, an mb() can be *significantly* more expensive than an rmb() (since
we may have to drain store buffers on an outer L2 cache) and on arm64 it's
not at all clear that an LDAR is more efficient than an LDR; DMB LD
sequence. I can certainly imagine implementations where the latter would
be preferred.

So, whilst I'm perfectly fine to go along with mandatory acquire/release
macros (we should probably add a check to barf on __iomem pointers), I
don't agree with using them in preference to finer-grained read/write
barriers. Doing so will have a real impact on I/O performance.

Finally, do you know of any architectures where load_acquire/store_release
aren't implemented the same way as the smp_* variants on SMP kernels?

Will

2014-11-14 10:46:16

by David Laight

[permalink] [raw]
Subject: RE: [PATCH 1/3] arch: Introduce load_acquire() and store_release()

From: Alexander Duyck
> It is common for device drivers to make use of acquire/release semantics
> when dealing with descriptors stored in device memory. On reviewing the
> documentation and code for smp_load_acquire() and smp_store_release() as
> well as reviewing an IBM website that goes over the use of PowerPC barriers
> at http://www.ibm.com/developerworks/systems/articles/powerpc.html it
> occurred to me that the same code could likely be applied to device drivers.
>
> As a result this patch introduces load_acquire() and store_release(). The
> load_acquire() function can be used in the place of situations where a test
> for ownership must be followed by a memory barrier. The below example is
> from ixgbe:
>
> if (!rx_desc->wb.upper.status_error)
> break;
>
> /* This memory barrier is needed to keep us from reading
> * any other fields out of the rx_desc until we know the
> * descriptor has been written back
> */
> rmb();
>
> With load_acquire() this can be changed to:
>
> if (!load_acquire(&rx_desc->wb.upper.status_error))
> break;

If I'm quickly reading the 'new' code I need to look up yet another
function, with the 'old' code I can easily see the logic.

You've also added a memory barrier to the 'break' path - which isn't needed.

The driver might also have additional code that can be added before the barrier
so reducing the cost of the barrier.

The driver may also be able to perform multiple actions before a barrier is needed.

Hiding barriers isn't necessarily a good idea anyway.
If you are writing a driver you need to understand when and where they are needed.

Maybe you need a new (weaker) barrier to replace rmb() on some architectures.

...


David

????{.n?+???????+%?????ݶ??w??{.n?+????{??G?????{ay?ʇڙ?,j??f???h?????????z_??(?階?ݢj"???m??????G????????????&???~???iO???z??v?^?m???? ????????I?

2014-11-14 16:01:40

by Alexander Duyck

[permalink] [raw]
Subject: Re: [PATCH 1/3] arch: Introduce load_acquire() and store_release()


On 11/14/2014 02:19 AM, Will Deacon wrote:
> Hi Alex,
>
> On Thu, Nov 13, 2014 at 07:27:23PM +0000, Alexander Duyck wrote:
>> It is common for device drivers to make use of acquire/release semantics
>> when dealing with descriptors stored in device memory. On reviewing the
>> documentation and code for smp_load_acquire() and smp_store_release() as
>> well as reviewing an IBM website that goes over the use of PowerPC barriers
>> at http://www.ibm.com/developerworks/systems/articles/powerpc.html it
>> occurred to me that the same code could likely be applied to device drivers.
>>
>> As a result this patch introduces load_acquire() and store_release(). The
>> load_acquire() function can be used in the place of situations where a test
>> for ownership must be followed by a memory barrier. The below example is
>> from ixgbe:
>>
>> if (!rx_desc->wb.upper.status_error)
>> break;
>>
>> /* This memory barrier is needed to keep us from reading
>> * any other fields out of the rx_desc until we know the
>> * descriptor has been written back
>> */
>> rmb();
>>
>> With load_acquire() this can be changed to:
>>
>> if (!load_acquire(&rx_desc->wb.upper.status_error))
>> break;
> I still don't think this is a good idea for the specific use-case you're
> highlighting.
>
> On ARM, an mb() can be *significantly* more expensive than an rmb() (since
> we may have to drain store buffers on an outer L2 cache) and on arm64 it's
> not at all clear that an LDAR is more efficient than an LDR; DMB LD
> sequence. I can certainly imagine implementations where the latter would
> be preferred.

Yeah, I am pretty sure I overdid it in using a mb() for arm. I think
what I should probably be using is something like dmb(ish) which is used
for smp_mb() instead. The general idea is to enforce memory-memory
accesses. The memory-mmio accesses still should be using a full
rmb()/wmb() barrier.

The alternative I am mulling over is creating something like a
lightweight set of memory barriers named lw_mb(), lw_rmb(), lw_wmb(),
that could be used instead. The general idea is that on many
architectures a full mb/rmb/wmb is far too much for just guaranteeing
ordering for system memory only writes or reads. I'm thinking I could
probably use the smp_ varieties as a template for them since I'm
thinking that in most cases this should be correct.

Also, just to be clear I am not advocating replacing the wmb() in most
I/O setups where we have to sync the system memory before doing the MMIO
write. This is for the case where the device descriptor ring has some
bit indicating ownership by either the device or the CPU. So for
example on the r8169 they have to do a wmb() before writing the DescOwn
bit in the first descriptor of a given set of Tx descriptors to
guarantee the rest are written, then they set the DescOwn bit, then they
call wmb() again to flush that last bit before notifying the device it
can start fetching the descriptors. My goal is to deal with that first
wmb() and leave the second as it since it is correct.

> So, whilst I'm perfectly fine to go along with mandatory acquire/release
> macros (we should probably add a check to barf on __iomem pointers), I
> don't agree with using them in preference to finer-grained read/write
> barriers. Doing so will have a real impact on I/O performance.

Couldn't that type of check be added to compiletime_assert_atomic_type?
That seems like that would be the best place for something like that.


> Finally, do you know of any architectures where load_acquire/store_release
> aren't implemented the same way as the smp_* variants on SMP kernels?
>
> Will

I should probably go back through and sort out the cases where mb() and
smp_mb() are not the same thing. I think I probably went with too harsh
of a barrier in probably a couple of other cases.

Thanks,

Alex

2014-11-14 16:59:50

by Alexander Duyck

[permalink] [raw]
Subject: Re: [PATCH 1/3] arch: Introduce load_acquire() and store_release()


On 11/14/2014 02:45 AM, David Laight wrote:
> From: Alexander Duyck
>> It is common for device drivers to make use of acquire/release semantics
>> when dealing with descriptors stored in device memory. On reviewing the
>> documentation and code for smp_load_acquire() and smp_store_release() as
>> well as reviewing an IBM website that goes over the use of PowerPC barriers
>> at http://www.ibm.com/developerworks/systems/articles/powerpc.html it
>> occurred to me that the same code could likely be applied to device drivers.
>>
>> As a result this patch introduces load_acquire() and store_release(). The
>> load_acquire() function can be used in the place of situations where a test
>> for ownership must be followed by a memory barrier. The below example is
>> from ixgbe:
>>
>> if (!rx_desc->wb.upper.status_error)
>> break;
>>
>> /* This memory barrier is needed to keep us from reading
>> * any other fields out of the rx_desc until we know the
>> * descriptor has been written back
>> */
>> rmb();
>>
>> With load_acquire() this can be changed to:
>>
>> if (!load_acquire(&rx_desc->wb.upper.status_error))
>> break;
> If I'm quickly reading the 'new' code I need to look up yet another
> function, with the 'old' code I can easily see the logic.
>
> You've also added a memory barrier to the 'break' path - which isn't needed.
>
> The driver might also have additional code that can be added before the barrier
> so reducing the cost of the barrier.
>
> The driver may also be able to perform multiple actions before a barrier is needed.
>
> Hiding barriers isn't necessarily a good idea anyway.
> If you are writing a driver you need to understand when and where they are needed.
>
> Maybe you need a new (weaker) barrier to replace rmb() on some architectures.
>
> ...
>
>
> David

Yeah, I think I might explore creating some lightweight barriers. The
load/acquire stuff is a bit overkill for what is needed.

Thanks,

Alex

2014-11-14 17:25:50

by Jeff Kirsher

[permalink] [raw]
Subject: Re: [PATCH 3/3] fm10k/igb/ixgbe: Use load_acquire on Rx descriptor

On Thu, 2014-11-13 at 11:27 -0800, Alexander Duyck wrote:
> This change makes it so that load_acquire is used when reading the Rx
> descriptor. The advantage of load_acquire is that it allows for a
> much
> lower cost barrier on x86, ia64, powerpc, arm64, and s390
> architectures
> than a traditional memory barrier when dealing with reads that only
> have
> to synchronize to system memory.
>
> In addition I have updated the code so that it just checks to see if
> any
> bits have been set instead of just the DD bit since the DD bit will
> always
> be set as a part of a descriptor write-back so we just need to check
> for a
> non-zero value being present at that memory location rather than just
> checking for any specific bit. This allows the code itself to appear
> much
> cleaner and allows the compiler more room to optimize.
>
> Cc: Jeff Kirsher <[email protected]>
> Cc: Matthew Vick <[email protected]>
> Cc: Don Skidmore <[email protected]>
> Signed-off-by: Alexander Duyck <[email protected]>
> ---
> drivers/net/ethernet/intel/fm10k/fm10k_main.c | 8 +++-----
> drivers/net/ethernet/intel/igb/igb_main.c | 8 +++-----
> drivers/net/ethernet/intel/ixgbe/ixgbe_main.c | 11 ++++-------
> 3 files changed, 10 insertions(+), 17 deletions(-)

Based on the discussion on patch 01 of the series, it appears changes
are coming to the series, so I won't be picking up this patch. I will
wait for Alex to re-spin the series with the suggested changes.


Attachments:
signature.asc (819.00 B)
This is a digitally signed message part

2014-11-15 21:16:32

by Francois Romieu

[permalink] [raw]
Subject: Re: [PATCH 2/3] r8169: Use load_acquire() and store_release() to reduce memory barrier overhead

Alexander Duyck <[email protected]> :
> On 11/13/2014 01:30 PM, Francois Romieu wrote:
> > Alexander Duyck <[email protected]> :
> > [...]
> >> In addition the r8169 uses a rmb() however I believe it is placed incorrectly
> >> as I assume it supposed to be ordering descriptor reads after the check for
> >> ownership.
> > Not exactly. It's a barrier against compiler optimization from 2004.
> > It should not matter.
>
> Okay. Do you recall the kind of problem it was you were seeing ?

Mildly, I had to grep the local archives.

The relevant code used to be included in the irq handler at that time
(napi support for this driver took place in may 2004). One did not want
a runaway loop in the Tx reaper.

Compiler optimization was suggested by Manfred Spraul in the thread below:
http://marc.info/?l=linux-kernel&m=108096868119004

> The origin of the rmb() for the Intel drivers was a PowerPC issue in
> which it was fetching the length of a buffer before it checked the DD
> bit (equivalent of DescOwn). I'm wondering if the issue you were seeing
> was something similar where it had reordered reads in the descriptor to
> cause that type of result.

The problem was only reported on Intel 32 bit + slackware + gcc 3.2.3.

Adam Nielsen - Cc: added - did not return for this bug.

--
Ueimor