2014-12-10 14:15:54

by Jesper Dangaard Brouer

[permalink] [raw]
Subject: [RFC PATCH 0/3] Faster than SLAB caching of SKBs with qmempool (backed by alf_queue)

The network stack have some use-cases that puts some extreme demands
on the memory allocator. One use-case, 10Gbit/s wirespeed at smallest
packet size[1], requires handling a packet every 67.2 ns (nanosec).

Micro benchmarking[2] the SLUB allocator (with skb size 256bytes
elements), show "fast-path" instant reuse only costs 19 ns, but a
closer to network usage pattern show the cost rise to 45 ns.

This patchset introduce a quick mempool (qmempool), which when used
in-front of the SKB (sk_buff) kmem_cache, saves 12 ns on "fast-path"
drop in iptables "raw" table, but more importantly saves 40 ns with
IP-forwarding, which were hitting the slower SLUB use-case.


One of the building blocks for achieving this speedup is a cmpxchg
based Lock-Free queue that supports bulking, named alf_queue for
Array-based Lock-Free queue. By bulking elements (pointers) from the
queue, the cost of the cmpxchg (approx 8 ns) is amortized over several
elements.

Patch1: alf_queue (Lock-Free queue)

Patch2: qmempool using alf_queue

Patch3: usage of qmempool for SKB caching


Notice, this patchset depend on introduction of napi_alloc_skb(),
which is part of Alexander Duyck's work patchset [3].

Different correctness tests and micro benchmarks are avail via my
github repo "prototype-kernel"[4], where the alf_queue and qmempool is
also kept in sync with this patchset.

Links:
[1]: http://netoptimizer.blogspot.dk/2014/05/the-calculations-10gbits-wirespeed.html
[2]: https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/mm/qmempool_bench.c
[3]: http://thread.gmane.org/gmane.linux.network/342347
[4]: https://github.com/netoptimizer/prototype-kernel

---

Jesper Dangaard Brouer (3):
net: use qmempool in-front of sk_buff kmem_cache
mm: qmempool - quick queue based memory pool
lib: adding an Array-based Lock-Free (ALF) queue


include/linux/alf_queue.h | 303 ++++++++++++++++++++++++++++++++++++++++++
include/linux/qmempool.h | 205 +++++++++++++++++++++++++++++
include/linux/skbuff.h | 4 -
lib/Kconfig | 13 ++
lib/Makefile | 2
lib/alf_queue.c | 47 +++++++
mm/Kconfig | 12 ++
mm/Makefile | 1
mm/qmempool.c | 322 +++++++++++++++++++++++++++++++++++++++++++++
net/core/dev.c | 5 +
net/core/skbuff.c | 43 +++++-
11 files changed, 950 insertions(+), 7 deletions(-)
create mode 100644 include/linux/alf_queue.h
create mode 100644 include/linux/qmempool.h
create mode 100644 lib/alf_queue.c
create mode 100644 mm/qmempool.c

--
Best regards,
Jesper Dangaard Brouer
MSc.CS, Sr. Network Kernel Developer at Red Hat
Author of http://www.iptv-analyzer.org
LinkedIn: http://www.linkedin.com/in/brouer


2014-12-10 14:16:41

by Jesper Dangaard Brouer

[permalink] [raw]
Subject: [RFC PATCH 3/3] net: use qmempool in-front of sk_buff kmem_cache

This patch uses qmempool for faster than SLAB caching of SKBs.

Only use this caching in connection with napi_alloc_skb() which runs
in softirq context. This softirq context provides the needed
protection for qmempool and the underlying alf_queue.

Current caching settings are max 32 elements per CPU, which is 8192
bytes given SKB is SLAB_HWCACHE_ALIGN'ed. The shared queue max limit
is 1024 which corresponds to worst-case 263KB memory usage. Systems
with a NR_CPUS <= 8 will get a smaller max shared queue.

Benchmarked on a E5-2695 12-cores (no-HT) with ixgbe.
Baseline is Alex'es napi_alloc_skb patchset.

Single flow/CPU, early drop in iptables RAW table (fast path compare):
* baseline: 3,159,160 pps
* qmempool: 3,282,508 pps
- Diff to baseline: +123348 pps => -11.89 ns

IP-forward single flow/cpu (slower path compare):
* baseline: 1,137,284 pps
* qmempool: 1,191,856 pps
- Diff to baseline: +54572 pps => -40.26 ns

Some of the improvements also come from qmempool_{alloc,free} have
smaller code size than kmem_cache_{alloc,free}, which helps reduce
instruction-cache misses.

Also did some scaling tests, to stress qmempool sharedq allocs (which
stress the alf_queue's concurrency).

IP-forward MULTI flow/cpu (12 CPUs E5-2695 no-HT, 12 HWQs):
* baseline: 11,946,666 pps
* qmempool: 11,988,042 pps
- Diff to baseline: +41376 pps => -0.29 ns

Signed-off-by: Jesper Dangaard Brouer <[email protected]>
---

include/linux/skbuff.h | 4 +++-
net/core/dev.c | 5 ++++-
net/core/skbuff.c | 43 ++++++++++++++++++++++++++++++++++++++-----
3 files changed, 45 insertions(+), 7 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index af79302..8881215 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -152,6 +152,7 @@ struct scatterlist;
struct pipe_inode_info;
struct iov_iter;
struct napi_struct;
+struct qmempool;

#if defined(CONFIG_NF_CONNTRACK) || defined(CONFIG_NF_CONNTRACK_MODULE)
struct nf_conntrack {
@@ -557,8 +558,8 @@ struct sk_buff {
fclone:2,
peeked:1,
head_frag:1,
+ qmempool:1,
xmit_more:1;
- /* one bit hole */
kmemcheck_bitfield_end(flags1);

/* fields enclosed in headers_start/headers_end are copied
@@ -755,6 +756,7 @@ void skb_tx_error(struct sk_buff *skb);
void consume_skb(struct sk_buff *skb);
void __kfree_skb(struct sk_buff *skb);
extern struct kmem_cache *skbuff_head_cache;
+extern struct qmempool *skbuff_head_pool;

void kfree_skb_partial(struct sk_buff *skb, bool head_stolen);
bool skb_try_coalesce(struct sk_buff *to, struct sk_buff *from,
diff --git a/net/core/dev.c b/net/core/dev.c
index 80f798d..0c95fbd 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -135,6 +135,7 @@
#include <linux/if_macvlan.h>
#include <linux/errqueue.h>
#include <linux/hrtimer.h>
+#include <linux/qmempool.h>

#include "net-sysfs.h"

@@ -4125,7 +4126,9 @@ static gro_result_t napi_skb_finish(gro_result_t ret, struct sk_buff *skb)

case GRO_MERGED_FREE:
if (NAPI_GRO_CB(skb)->free == NAPI_GRO_FREE_STOLEN_HEAD)
- kmem_cache_free(skbuff_head_cache, skb);
+ (skb->qmempool) ?
+ qmempool_free(skbuff_head_pool, skb) :
+ kmem_cache_free(skbuff_head_cache, skb);
else
__kfree_skb(skb);
break;
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index ae13ef6..a96ce75 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -74,10 +74,24 @@
#include <asm/uaccess.h>
#include <trace/events/skb.h>
#include <linux/highmem.h>
+#include <linux/qmempool.h>
+#include <linux/log2.h>

struct kmem_cache *skbuff_head_cache __read_mostly;
static struct kmem_cache *skbuff_fclone_cache __read_mostly;

+/* Keep max 32 skbs per CPU = 8192 bytes per CPU (as skb is
+ * SLAB_HWCACHE_ALIGN). Sharedq cache is limited to max 1024 elems
+ * which is max 262KB skb memory, on small systems it allocs 32*2
+ * elems * NR_CPUS.
+ */
+struct qmempool *skbuff_head_pool __read_mostly;
+#define QMEMPOOL_LOCALQ 32
+#define QMEMPOOL_SCALE (QMEMPOOL_LOCALQ * 2)
+#define QMEMPOOL_SYSTEM_SIZE roundup_pow_of_two(NR_CPUS * QMEMPOOL_SCALE)
+#define QMEMPOOL_SHAREDQ min(1024UL, QMEMPOOL_SYSTEM_SIZE)
+#define QMEMPOOL_PREALLOC 0
+
/**
* skb_panic - private function for out-of-line support
* @skb: buffer
@@ -278,13 +292,14 @@ nodata:
EXPORT_SYMBOL(__alloc_skb);

/**
- * build_skb - build a network buffer
+ * __build_skb - build a network buffer
* @data: data buffer provided by caller
* @frag_size: size of fragment, or 0 if head was kmalloced
*
* Allocate a new &sk_buff. Caller provides space holding head and
* skb_shared_info. @data must have been allocated by kmalloc() only if
* @frag_size is 0, otherwise data should come from the page allocator.
+ * @flags: FIXME-DESCRIBE
* The return is the new skb buffer.
* On a failure the return is %NULL, and @data is not freed.
* Notes :
@@ -295,13 +310,16 @@ EXPORT_SYMBOL(__alloc_skb);
* before giving packet to stack.
* RX rings only contains data buffers, not full skbs.
*/
-struct sk_buff *build_skb(void *data, unsigned int frag_size)
+static struct sk_buff *__build_skb(void *data, unsigned int frag_size,
+ int flags)
{
struct skb_shared_info *shinfo;
struct sk_buff *skb;
unsigned int size = frag_size ? : ksize(data);

- skb = kmem_cache_alloc(skbuff_head_cache, GFP_ATOMIC);
+ skb = (flags & SKB_ALLOC_NAPI) ?
+ qmempool_alloc_softirq(skbuff_head_pool, GFP_ATOMIC) :
+ kmem_cache_alloc(skbuff_head_cache, GFP_ATOMIC);
if (!skb)
return NULL;

@@ -310,6 +328,7 @@ struct sk_buff *build_skb(void *data, unsigned int frag_size)
memset(skb, 0, offsetof(struct sk_buff, tail));
skb->truesize = SKB_TRUESIZE(size);
skb->head_frag = frag_size != 0;
+ skb->qmempool = !!(flags & SKB_ALLOC_NAPI);
atomic_set(&skb->users, 1);
skb->head = data;
skb->data = data;
@@ -326,6 +345,11 @@ struct sk_buff *build_skb(void *data, unsigned int frag_size)

return skb;
}
+
+struct sk_buff *build_skb(void *data, unsigned int frag_size)
+{
+ return __build_skb(data, frag_size, 0);
+}
EXPORT_SYMBOL(build_skb);

struct netdev_alloc_cache {
@@ -477,7 +501,7 @@ static struct sk_buff *__alloc_rx_skb(unsigned int length, gfp_t gfp_mask,
__netdev_alloc_frag(fragsz, gfp_mask);

if (likely(data)) {
- skb = build_skb(data, fragsz);
+ skb = __build_skb(data, fragsz, flags);
if (unlikely(!skb))
put_page(virt_to_head_page(data));
}
@@ -637,7 +661,9 @@ static void kfree_skbmem(struct sk_buff *skb)

switch (skb->fclone) {
case SKB_FCLONE_UNAVAILABLE:
- kmem_cache_free(skbuff_head_cache, skb);
+ (skb->qmempool) ?
+ qmempool_free(skbuff_head_pool, skb) :
+ kmem_cache_free(skbuff_head_cache, skb);
return;

case SKB_FCLONE_ORIG:
@@ -862,6 +888,7 @@ static struct sk_buff *__skb_clone(struct sk_buff *n, struct sk_buff *skb)
C(end);
C(head);
C(head_frag);
+ C(qmempool);
C(data);
C(truesize);
atomic_set(&n->users, 1);
@@ -3370,6 +3397,12 @@ void __init skb_init(void)
0,
SLAB_HWCACHE_ALIGN|SLAB_PANIC,
NULL);
+ /* connect qmempools to slabs */
+ skbuff_head_pool = qmempool_create(QMEMPOOL_LOCALQ,
+ QMEMPOOL_SHAREDQ,
+ QMEMPOOL_PREALLOC,
+ skbuff_head_cache, GFP_ATOMIC);
+ BUG_ON(skbuff_head_pool == NULL);
}

/**

2014-12-10 14:21:39

by Jesper Dangaard Brouer

[permalink] [raw]
Subject: [RFC PATCH 2/3] mm: qmempool - quick queue based memory pool

A quick queue-based memory pool, that functions as a cache in-front
of kmem_cache SLAB/SLUB allocators. Which allows faster than
SLAB/SLUB reuse/caching of fixed size memory elements

The speed gain comes from, the shared storage, using a Lock-Free
queue that supports bulk refilling elements (to a percpu cache)
with a single cmpxchg. Thus, the (lock-prefixed) cmpxchg cost is
amortize over the bulk size.

Qmempool cannot easily replace all kmem_cache usage, because it is
restricted in which contexts is can be used in, as the Lock-Free
queue is not preemption safe. E.g. only supports GFP_ATOMIC allocations
from SLAB.

This version is optimized for usage from softirq context, and cannot
be used from hardirq context. Usage from none-softirq requires usage
of local_bh_{disable,enable}, which have a fairly high cost.

Performance micro benchmarks against SLUB. First test is fast-path
reuse of same element. Second test is allocating 256 element before
freeing elements again, this pattern comes from how NIC ring queue
cleanups often run.

On CPU E5-2695, CONFIG_PREEMPT=y, showing cost of alloc+free:

SLUB - softirq - none-softirq
fastpath-reuse: 19.563 ns - 7.837 ns - 18.536 ns
N(256)-pattern: 45.039 ns - 11.782 ns - 24.186 ns

A significant win for usage from softirq, and a smaller win for
none-softirq which requires taking local_bh_{disable,enable}.

Signed-off-by: Jesper Dangaard Brouer <[email protected]>
---

include/linux/qmempool.h | 205 +++++++++++++++++++++++++++++
mm/Kconfig | 12 ++
mm/Makefile | 1
mm/qmempool.c | 322 ++++++++++++++++++++++++++++++++++++++++++++++
4 files changed, 540 insertions(+), 0 deletions(-)
create mode 100644 include/linux/qmempool.h
create mode 100644 mm/qmempool.c

diff --git a/include/linux/qmempool.h b/include/linux/qmempool.h
new file mode 100644
index 0000000..922ed27
--- /dev/null
+++ b/include/linux/qmempool.h
@@ -0,0 +1,205 @@
+/*
+ * qmempool - a quick queue based mempool
+ *
+ * A quick queue-based memory pool, that functions as a cache in-front
+ * of kmem_cache SLAB/SLUB allocators. Which allows faster than
+ * SLAB/SLUB reuse/caching of fixed size memory elements
+ *
+ * The speed gain comes from, the shared storage, using a Lock-Free
+ * queue that supports bulk refilling elements (to a percpu cache)
+ * with a single cmpxchg. Thus, the lock-prefixed cmpxchg cost is
+ * amortize over the bulk size.
+ *
+ * The Lock-Free queue is based on an array (of pointer to elements).
+ * This make access more cache optimal, as e.g. on 64bit 8 pointers
+ * can be stored per cache-line (which is superior to a linked list
+ * approach). Only storing the pointers to elements, is also
+ * beneficial as we don't touch the elements data.
+ *
+ * Qmempool cannot easily replace all kmem_cache usage, because it is
+ * restricted in which contexts is can be used in, as the Lock-Free
+ * queue is not preemption safe. This version is optimized for usage
+ * from softirq context, and cannot be used from hardirq context.
+ *
+ * Only support GFP_ATOMIC allocations from SLAB.
+ *
+ * Copyright (C) 2014, Red Hat, Inc., Jesper Dangaard Brouer
+ * for licensing details see kernel-base/COPYING
+ */
+
+#ifndef _LINUX_QMEMPOOL_H
+#define _LINUX_QMEMPOOL_H
+
+#include <linux/alf_queue.h>
+#include <linux/prefetch.h>
+#include <linux/hardirq.h>
+
+/* Bulking is an essential part of the performance gains as this
+ * amortize the cost of cmpxchg ops used when accessing sharedq
+ */
+#define QMEMPOOL_BULK 16
+#define QMEMPOOL_REFILL_MULTIPLIER 2
+
+struct qmempool_percpu {
+ struct alf_queue *localq;
+};
+
+struct qmempool {
+ /* The shared queue (sharedq) is a Multi-Producer-Multi-Consumer
+ * queue where access is protected by an atomic cmpxchg operation.
+ * The queue support bulk transfers, which amortize the cost
+ * of the atomic cmpxchg operation.
+ */
+ struct alf_queue *sharedq;
+
+ /* Per CPU local "cache" queues for faster atomic free access.
+ * The local queues (localq) are Single-Producer-Single-Consumer
+ * queues as they are per CPU.
+ */
+ struct qmempool_percpu __percpu *percpu;
+
+ /* Backed by some SLAB kmem_cache */
+ struct kmem_cache *kmem;
+
+ /* Setup */
+ uint32_t prealloc;
+ gfp_t gfp_mask;
+};
+
+extern void qmempool_destroy(struct qmempool *pool);
+extern struct qmempool *qmempool_create(
+ uint32_t localq_sz, uint32_t sharedq_sz, uint32_t prealloc,
+ struct kmem_cache *kmem, gfp_t gfp_mask);
+
+extern void *__qmempool_alloc_from_sharedq(
+ struct qmempool *pool, gfp_t gfp_mask, struct alf_queue *localq);
+extern void __qmempool_free_to_sharedq(void *elem, struct qmempool *pool,
+ struct alf_queue *localq);
+
+/* The percpu variables (SPSC queues) needs preempt protection, and
+ * the shared MPMC queue also needs protection against the same CPU
+ * access the same queue.
+ *
+ * Specialize and optimize the qmempool to run from softirq.
+ * Don't allow qmempool to be used from interrupt context.
+ *
+ * IDEA: When used from softirq, take advantage of the protection
+ * softirq gives. A softirq will never preempt another softirq,
+ * running on the same CPU. The only event that can preempt a softirq
+ * is an interrupt handler (and perhaps we don't need to support
+ * calling qmempool from an interrupt). Another softirq, even the
+ * same one, can run on another CPU however, but these helpers are
+ * only protecting our percpu variables.
+ *
+ * Thus, our percpu variables are safe if current the CPU is the one
+ * serving the softirq (tested via in_serving_softirq()), like:
+ *
+ * if (!in_serving_softirq())
+ * local_bh_disable();
+ *
+ * This makes qmempool very fast, when accesses from softirq, but
+ * slower when accessed outside softirq. The other contexts need to
+ * disable bottom-halves "bh" via local_bh_{disable,enable} (which on
+ * have been measured add cost if 7.5ns on CPU E5-2695).
+ *
+ * MUST not be used from interrupt context, when relying on softirq usage.
+ */
+static inline int __qmempool_preempt_disable(void)
+{
+ int in_serving_softirq = in_serving_softirq();
+
+ if (!in_serving_softirq)
+ local_bh_disable();
+
+ return in_serving_softirq;
+}
+
+static inline void __qmempool_preempt_enable(int in_serving_softirq)
+{
+ if (!in_serving_softirq)
+ local_bh_enable();
+}
+
+/* Elements - alloc and free functions are inlined here for
+ * performance reasons, as the per CPU lockless access should be as
+ * fast as possible.
+ */
+
+/* Main allocation function
+ *
+ * Caller must make sure this is called from a preemptive safe context
+ */
+static inline void * main_qmempool_alloc(struct qmempool *pool, gfp_t gfp_mask)
+{
+ /* NUMA considerations, for now the numa node is not handles,
+ * this could be handled via e.g. numa_mem_id()
+ */
+ void *elem;
+ struct qmempool_percpu *cpu;
+ int num;
+
+ /* 1. attempt get element from local per CPU queue */
+ cpu = this_cpu_ptr(pool->percpu);
+ num = alf_sc_dequeue(cpu->localq, (void **)&elem, 1);
+ if (num == 1) /* Succes: alloc elem by deq from localq cpu cache */
+ return elem;
+
+ /* 2. attempt get element from shared queue. This involves
+ * refilling the localq for next round. Side-effect can be
+ * alloc from SLAB.
+ */
+ elem = __qmempool_alloc_from_sharedq(pool, gfp_mask, cpu->localq);
+ return elem;
+}
+
+static inline void *__qmempool_alloc(struct qmempool *pool, gfp_t gfp_mask)
+{
+ void *elem;
+ int state;
+
+ state = __qmempool_preempt_disable();
+ elem = main_qmempool_alloc(pool, gfp_mask);
+ __qmempool_preempt_enable(state);
+ return elem;
+}
+
+static inline void *__qmempool_alloc_softirq(struct qmempool *pool,
+ gfp_t gfp_mask)
+{
+ return main_qmempool_alloc(pool, gfp_mask);
+}
+
+/* Main free function */
+static inline void __qmempool_free(struct qmempool *pool, void *elem)
+{
+ struct qmempool_percpu *cpu;
+ int num;
+ int state;
+
+ /* NUMA considerations, how do we make sure to avoid caching
+ * elements from a different NUMA node.
+ */
+ state = __qmempool_preempt_disable();
+
+ /* 1. attempt to free/return element to local per CPU queue */
+ cpu = this_cpu_ptr(pool->percpu);
+ num = alf_sp_enqueue(cpu->localq, &elem, 1);
+ if (num == 1) /* success: element free'ed by enqueue to localq */
+ goto done;
+
+ /* 2. localq cannot store more elements, need to return some
+ * from localq to sharedq, to make room. Side-effect can be
+ * free to SLAB.
+ */
+ __qmempool_free_to_sharedq(elem, pool, cpu->localq);
+
+done:
+ __qmempool_preempt_enable(state);
+}
+
+/* API users can choose to use "__" prefixed versions for inlining */
+extern void *qmempool_alloc(struct qmempool *pool, gfp_t gfp_mask);
+extern void *qmempool_alloc_softirq(struct qmempool *pool, gfp_t gfp_mask);
+extern void qmempool_free(struct qmempool *pool, void *elem);
+
+#endif /* _LINUX_QMEMPOOL_H */
diff --git a/mm/Kconfig b/mm/Kconfig
index 1d1ae6b..abaa94c 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -618,3 +618,15 @@ config MAX_STACK_SIZE_MB
changed to a smaller value in which case that is used.

A sane initial value is 80 MB.
+
+config QMEMPOOL
+ bool "Quick queue based mempool (qmempool)"
+ default y
+ select ALF_QUEUE
+ help
+ A mempool designed for faster than SLAB/kmem_cache
+ reuse/caching of fixed size memory elements. Works as a
+ caching layer in-front of existing kmem_cache SLABs. Speed
+ is achieved by _bulk_ refilling percpu local cache, from a
+ Lock-Free queue requiring a single (locked) cmpxchg per bulk
+ transfer, thus amortizing the cost of the cmpxchg.
diff --git a/mm/Makefile b/mm/Makefile
index 8405eb0..49c1e18 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -69,3 +69,4 @@ obj-$(CONFIG_ZSMALLOC) += zsmalloc.o
obj-$(CONFIG_GENERIC_EARLY_IOREMAP) += early_ioremap.o
obj-$(CONFIG_CMA) += cma.o
obj-$(CONFIG_MEMORY_BALLOON) += balloon_compaction.o
+obj-$(CONFIG_QMEMPOOL) += qmempool.o
diff --git a/mm/qmempool.c b/mm/qmempool.c
new file mode 100644
index 0000000..d6debcc
--- /dev/null
+++ b/mm/qmempool.c
@@ -0,0 +1,322 @@
+/*
+ * qmempool - a quick queue based mempool
+ *
+ * Copyright (C) 2014, Red Hat, Inc., Jesper Dangaard Brouer
+ * for licensing details see kernel-base/COPYING
+ */
+#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
+
+#include <linux/module.h>
+#include <linux/mm.h>
+#include <linux/slab.h>
+#include <linux/export.h>
+#include <linux/percpu.h>
+#include <linux/qmempool.h>
+#include <linux/log2.h>
+
+/* Due to hotplug CPU support, we need access to all qmempools
+ * in-order to cleanup elements in localq for the CPU going offline.
+ *
+ * TODO: implement HOTPLUG_CPU
+#ifdef CONFIG_HOTPLUG_CPU
+static LIST_HEAD(qmempool_list);
+static DEFINE_SPINLOCK(qmempool_list_lock);
+#endif
+ */
+
+void qmempool_destroy(struct qmempool *pool)
+{
+ void *elem = NULL;
+ int j;
+
+ if (pool->percpu) {
+ for_each_possible_cpu(j) {
+ struct qmempool_percpu *cpu =
+ per_cpu_ptr(pool->percpu, j);
+
+ while (alf_mc_dequeue(cpu->localq, &elem, 1) == 1)
+ kmem_cache_free(pool->kmem, elem);
+ BUG_ON(!alf_queue_empty(cpu->localq));
+ alf_queue_free(cpu->localq);
+ }
+ free_percpu(pool->percpu);
+ }
+
+ if (pool->sharedq) {
+ while (alf_mc_dequeue(pool->sharedq, &elem, 1) == 1)
+ kmem_cache_free(pool->kmem, elem);
+ BUG_ON(!alf_queue_empty(pool->sharedq));
+ alf_queue_free(pool->sharedq);
+ }
+
+ kfree(pool);
+}
+EXPORT_SYMBOL(qmempool_destroy);
+
+struct qmempool *
+qmempool_create(uint32_t localq_sz, uint32_t sharedq_sz, uint32_t prealloc,
+ struct kmem_cache *kmem, gfp_t gfp_mask)
+{
+ struct qmempool *pool;
+ int i, j, num;
+ void *elem;
+
+ /* Validate constraints, e.g. due to bulking */
+ if (localq_sz < QMEMPOOL_BULK) {
+ pr_err("%s() localq size(%d) too small for bulking\n",
+ __func__, localq_sz);
+ return NULL;
+ }
+ if (sharedq_sz < (QMEMPOOL_BULK * QMEMPOOL_REFILL_MULTIPLIER)) {
+ pr_err("%s() sharedq size(%d) too small for bulk refill\n",
+ __func__, sharedq_sz);
+ return NULL;
+ }
+ if (!is_power_of_2(localq_sz) || !is_power_of_2(sharedq_sz)) {
+ pr_err("%s() queue sizes (%d/%d) must be power-of-2\n",
+ __func__, localq_sz, sharedq_sz);
+ return NULL;
+ }
+ if (prealloc > sharedq_sz) {
+ pr_err("%s() prealloc(%d) req > sharedq size(%d)\n",
+ __func__, prealloc, sharedq_sz);
+ return NULL;
+ }
+ if ((prealloc % QMEMPOOL_BULK) != 0) {
+ pr_warn("%s() prealloc(%d) should be div by BULK size(%d)\n",
+ __func__, prealloc, QMEMPOOL_BULK);
+ }
+ if (!kmem) {
+ pr_err("%s() kmem_cache is a NULL ptr\n", __func__);
+ return NULL;
+ }
+
+ pool = kzalloc(sizeof(*pool), gfp_mask);
+ if (!pool)
+ return NULL;
+ pool->kmem = kmem;
+ pool->gfp_mask = gfp_mask;
+
+ /* MPMC (Multi-Producer-Multi-Consumer) queue */
+ pool->sharedq = alf_queue_alloc(sharedq_sz, gfp_mask);
+ if (IS_ERR_OR_NULL(pool->sharedq)) {
+ pr_err("%s() failed to create shared queue(%d) ERR_PTR:0x%p\n",
+ __func__, sharedq_sz, pool->sharedq);
+ qmempool_destroy(pool);
+ return NULL;
+ }
+
+ pool->prealloc = prealloc;
+ for (i = 0; i < prealloc; i++) {
+ elem = kmem_cache_alloc(pool->kmem, gfp_mask);
+ if (!elem) {
+ pr_err("%s() kmem_cache out of memory?!\n", __func__);
+ qmempool_destroy(pool);
+ return NULL;
+ }
+ /* Could use the SP version given it is not visible yet */
+ num = alf_mp_enqueue(pool->sharedq, &elem, 1);
+ BUG_ON(num <= 0);
+ }
+
+ pool->percpu = alloc_percpu(struct qmempool_percpu);
+ if (pool->percpu == NULL) {
+ pr_err("%s() failed to alloc percpu\n", __func__);
+ qmempool_destroy(pool);
+ return NULL;
+ }
+
+ /* SPSC (Single-Consumer-Single-Producer) queue per CPU */
+ for_each_possible_cpu(j) {
+ struct qmempool_percpu *cpu = per_cpu_ptr(pool->percpu, j);
+
+ cpu->localq = alf_queue_alloc(localq_sz, gfp_mask);
+ if (IS_ERR_OR_NULL(cpu->localq)) {
+ pr_err("%s() failed alloc localq(sz:%d) on cpu:%d\n",
+ __func__, localq_sz, j);
+ qmempool_destroy(pool);
+ return NULL;
+ }
+ }
+
+ return pool;
+}
+EXPORT_SYMBOL(qmempool_create);
+
+/* Element handling
+ */
+
+/* This function is called when sharedq runs-out of elements.
+ * Thus, sharedq needs to be refilled (enq) with elems from slab.
+ *
+ * Caller must assure this is called in an preemptive safe context due
+ * to alf_mp_enqueue() call.
+ */
+void *__qmempool_alloc_from_slab(struct qmempool *pool, gfp_t gfp_mask)
+{
+ void *elems[QMEMPOOL_BULK]; /* on stack variable */
+ void *elem;
+ int num, i, j;
+
+ /* Cannot use SLAB that can sleep if (gfp_mask & __GFP_WAIT),
+ * else preemption disable/enable scheme becomes too complicated
+ */
+ BUG_ON(gfp_mask & __GFP_WAIT);
+
+ elem = kmem_cache_alloc(pool->kmem, gfp_mask);
+ if (elem == NULL) /* slab depleted, no reason to call below allocs */
+ return NULL;
+
+ /* SLAB considerations, we need a kmem_cache interface that
+ * supports allocating a bulk of elements.
+ */
+
+ for (i = 0; i < QMEMPOOL_REFILL_MULTIPLIER; i++) {
+ for (j = 0; j < QMEMPOOL_BULK; j++) {
+ elems[j] = kmem_cache_alloc(pool->kmem, gfp_mask);
+ /* Handle if slab gives us NULL elem */
+ if (elems[j] == NULL) {
+ pr_err("%s() ARGH - slab returned NULL",
+ __func__);
+ num = alf_mp_enqueue(pool->sharedq, elems, j-1);
+ BUG_ON(num == 0); //FIXME handle
+ return elem;
+ }
+ }
+ num = alf_mp_enqueue(pool->sharedq, elems, QMEMPOOL_BULK);
+ /* FIXME: There is a theoretical chance that multiple
+ * CPU enter here, refilling sharedq at the same time,
+ * thus we must handle "full" situation, for now die
+ * hard so someone will need to fix this.
+ */
+ BUG_ON(num == 0); /* sharedq should have room */
+ }
+
+ /* What about refilling localq here? (else it will happen on
+ * next cycle, and will cost an extra cmpxchg).
+ */
+ return elem;
+}
+
+/* This function is called when the localq runs out-of elements.
+ * Thus, localq is refilled (enq) with elements (deq) from sharedq.
+ *
+ * Caller must assure this is called in an preemptive safe context due
+ * to alf_mp_dequeue() call.
+ */
+void *__qmempool_alloc_from_sharedq(struct qmempool *pool, gfp_t gfp_mask,
+ struct alf_queue *localq)
+{
+ void *elems[QMEMPOOL_BULK]; /* on stack variable */
+ void *elem;
+ int num;
+
+ /* Costs atomic "cmpxchg", but amortize cost by bulk dequeue */
+ num = alf_mc_dequeue(pool->sharedq, elems, QMEMPOOL_BULK);
+ if (likely(num > 0)) {
+ /* Consider prefetching data part of elements here, it
+ * should be an optimal place to hide memory prefetching.
+ * Especially given the localq is known to be an empty FIFO
+ * which guarantees the order objs are accessed in.
+ */
+ elem = elems[0]; /* extract one element */
+ if (num > 1) {
+ num = alf_sp_enqueue(localq, &elems[1], num-1);
+ /* Refill localq, should be empty, must succeed */
+ BUG_ON(num == 0);
+ }
+ return elem;
+ }
+ /* Use slab if sharedq runs out of elements */
+ elem = __qmempool_alloc_from_slab(pool, gfp_mask);
+ return elem;
+}
+EXPORT_SYMBOL(__qmempool_alloc_from_sharedq);
+
+/* Called when sharedq is full. Thus also make room in sharedq,
+ * besides also freeing the "elems" given.
+ */
+bool __qmempool_free_to_slab(struct qmempool *pool, void **elems, int n)
+{
+ int num, i, j;
+ /* SLAB considerations, we could use kmem_cache interface that
+ * supports returning a bulk of elements.
+ */
+
+ /* free these elements for real */
+ for (i = 0; i < n; i++)
+ kmem_cache_free(pool->kmem, elems[i]);
+
+ /* Make room in sharedq for next round */
+ for (i = 0; i < QMEMPOOL_REFILL_MULTIPLIER; i++) {
+ num = alf_mc_dequeue(pool->sharedq, elems, QMEMPOOL_BULK);
+ for (j = 0; j < num; j++)
+ kmem_cache_free(pool->kmem, elems[j]);
+ }
+ return true;
+}
+
+/* This function is called when the localq is full. Thus, elements
+ * from localq needs to be (dequeued) and returned (enqueued) to
+ * sharedq (or if shared is full, need to be free'ed to slab)
+ *
+ * MUST be called from a preemptive safe context.
+ */
+void __qmempool_free_to_sharedq(void *elem, struct qmempool *pool,
+ struct alf_queue *localq)
+{
+ void *elems[QMEMPOOL_BULK]; /* on stack variable */
+ int num_enq, num_deq;
+
+ elems[0] = elem;
+ /* Make room in localq */
+ num_deq = alf_sc_dequeue(localq, &elems[1], QMEMPOOL_BULK-1);
+ if (unlikely(num_deq == 0))
+ goto failed;
+ num_deq++; /* count first 'elem' */
+
+ /* Successful dequeued 'num_deq' elements from localq, "free"
+ * these elems by enqueuing to sharedq
+ */
+ num_enq = alf_mp_enqueue(pool->sharedq, elems, num_deq);
+ if (likely(num_enq == num_deq)) /* Success enqueued to sharedq */
+ return;
+
+ /* If sharedq is full (num_enq == 0) dequeue elements will be
+ * returned directly to the SLAB allocator.
+ *
+ * Note: This usage of alf_queue API depend on enqueue is
+ * fixed, by only enqueueing if all elements could fit, this
+ * is an API that might change.
+ */
+
+ __qmempool_free_to_slab(pool, elems, num_deq);
+ return;
+failed:
+ /* dequeing from a full localq should always be possible */
+ BUG();
+}
+EXPORT_SYMBOL(__qmempool_free_to_sharedq);
+
+/* API users can choose to use "__" prefixed versions for inlining */
+void *qmempool_alloc(struct qmempool *pool, gfp_t gfp_mask)
+{
+ return __qmempool_alloc(pool, gfp_mask);
+}
+EXPORT_SYMBOL(qmempool_alloc);
+
+void *qmempool_alloc_softirq(struct qmempool *pool, gfp_t gfp_mask)
+{
+ return __qmempool_alloc_softirq(pool, gfp_mask);
+}
+EXPORT_SYMBOL(qmempool_alloc_softirq);
+
+void qmempool_free(struct qmempool *pool, void *elem)
+{
+ return __qmempool_free(pool, elem);
+}
+EXPORT_SYMBOL(qmempool_free);
+
+MODULE_DESCRIPTION("Quick queue based mempool (qmempool)");
+MODULE_AUTHOR("Jesper Dangaard Brouer <[email protected]>");
+MODULE_LICENSE("GPL");

2014-12-10 14:23:29

by David Laight

[permalink] [raw]
Subject: RE: [RFC PATCH 0/3] Faster than SLAB caching of SKBs with qmempool (backed by alf_queue)

From: Jesper Dangaard Brouer
> The network stack have some use-cases that puts some extreme demands
> on the memory allocator. One use-case, 10Gbit/s wirespeed at smallest
> packet size[1], requires handling a packet every 67.2 ns (nanosec).
>
> Micro benchmarking[2] the SLUB allocator (with skb size 256bytes
> elements), show "fast-path" instant reuse only costs 19 ns, but a
> closer to network usage pattern show the cost rise to 45 ns.
>
> This patchset introduce a quick mempool (qmempool), which when used
> in-front of the SKB (sk_buff) kmem_cache, saves 12 ns on "fast-path"
> drop in iptables "raw" table, but more importantly saves 40 ns with
> IP-forwarding, which were hitting the slower SLUB use-case.
>
>
> One of the building blocks for achieving this speedup is a cmpxchg
> based Lock-Free queue that supports bulking, named alf_queue for
> Array-based Lock-Free queue. By bulking elements (pointers) from the
> queue, the cost of the cmpxchg (approx 8 ns) is amortized over several
> elements.

It seems to me that these improvements could be added to the
underlying allocator itself.
Nesting allocators doesn't really seem right to me.

David

????{.n?+???????+%?????ݶ??w??{.n?+????{??G?????{ay?ʇڙ?,j??f???h?????????z_??(?階?ݢj"???m??????G????????????&???~???iO???z??v?^?m???? ????????I?

2014-12-10 14:40:50

by Jesper Dangaard Brouer

[permalink] [raw]
Subject: Re: [RFC PATCH 0/3] Faster than SLAB caching of SKBs with qmempool (backed by alf_queue)

On Wed, 10 Dec 2014 14:22:22 +0000
David Laight <[email protected]> wrote:

> From: Jesper Dangaard Brouer
> > The network stack have some use-cases that puts some extreme demands
> > on the memory allocator. One use-case, 10Gbit/s wirespeed at smallest
> > packet size[1], requires handling a packet every 67.2 ns (nanosec).
> >
> > Micro benchmarking[2] the SLUB allocator (with skb size 256bytes
> > elements), show "fast-path" instant reuse only costs 19 ns, but a
> > closer to network usage pattern show the cost rise to 45 ns.
> >
> > This patchset introduce a quick mempool (qmempool), which when used
> > in-front of the SKB (sk_buff) kmem_cache, saves 12 ns on "fast-path"
> > drop in iptables "raw" table, but more importantly saves 40 ns with
> > IP-forwarding, which were hitting the slower SLUB use-case.
> >
> >
> > One of the building blocks for achieving this speedup is a cmpxchg
> > based Lock-Free queue that supports bulking, named alf_queue for
> > Array-based Lock-Free queue. By bulking elements (pointers) from the
> > queue, the cost of the cmpxchg (approx 8 ns) is amortized over several
> > elements.
>
> It seems to me that these improvements could be added to the
> underlying allocator itself.
> Nesting allocators doesn't really seem right to me.

Yes, I would very much like to see these ideas integrated into the
underlying allocators (hence addressing the mm-list).

This patchset demonstrates that it is possible to do something faster
than the existing SLUB allocator. Which the network stack have a need
for.

--
Best regards,
Jesper Dangaard Brouer
MSc.CS, Sr. Network Kernel Developer at Red Hat
Author of http://www.iptv-analyzer.org
LinkedIn: http://www.linkedin.com/in/brouer

2014-12-10 15:15:10

by Jesper Dangaard Brouer

[permalink] [raw]
Subject: [RFC PATCH 1/3] lib: adding an Array-based Lock-Free (ALF) queue

This Array-based Lock-Free (ALF) queue, is a very fast bounded
Producer-Consumer queue, supporting bulking. The MPMC
(Multi-Producer/Multi-Consumer) variant uses a locked cmpxchg, but the
cost can be amorized by utilizing bulk enqueue/dequeue.

Results on x86_64 CPU E5-2695, for variants:
MPMC = Multi-Producer-Multi-Consumer
SPSC = Single-Producer-Single-Consumer

(none-bulking): per element cost MPMC and SPSC
MPMC -- SPSC
simple : 9.519 ns -- 1.282 ns
multi(step:128): 12.905 ns -- 2.240 ns

The majority of the cost is associated with the locked cmpxchg in the
MPMC variant. Bulking helps amortize this cost:

(bulking) cost per element comparing MPMC -> SPSC:
MPMC -- SPSC
bulk2 : 5.849 ns -- 1.748 ns
bulk3 : 4.102 ns -- 1.531 ns
bulk4 : 3.281 ns -- 1.383 ns
bulk6 : 2.530 ns -- 1.238 ns
bulk8 : 2.125 ns -- 1.196 ns
bulk16: 1.552 ns -- 1.109 ns

Joint work with Hannes Frederic Sowa.

Signed-off-by: Jesper Dangaard Brouer <[email protected]>
Signed-off-by: Hannes Frederic Sowa <[email protected]>
---

Correctness of memory barries on different arch's need to be evaluated.

The API might need some adjustments/discussions regarding:

1) the semantics of enq/deq when doing bulking, choosing what
to do when e.g. a bulk enqueue cannot fit fully.

2) some better way to detect miss-use of API, e.g. using
single-enqueue variant function call on a multi-enqueue variant data
structure. Now no detection happens.

include/linux/alf_queue.h | 303 +++++++++++++++++++++++++++++++++++++++++++++
lib/Kconfig | 13 ++
lib/Makefile | 2
lib/alf_queue.c | 47 +++++++
4 files changed, 365 insertions(+), 0 deletions(-)
create mode 100644 include/linux/alf_queue.h
create mode 100644 lib/alf_queue.c

diff --git a/include/linux/alf_queue.h b/include/linux/alf_queue.h
new file mode 100644
index 0000000..fb1a774
--- /dev/null
+++ b/include/linux/alf_queue.h
@@ -0,0 +1,303 @@
+#ifndef _LINUX_ALF_QUEUE_H
+#define _LINUX_ALF_QUEUE_H
+/* linux/alf_queue.h
+ *
+ * ALF: Array-based Lock-Free queue
+ *
+ * Queue properties
+ * - Array based for cache-line optimization
+ * - Bounded by the array size
+ * - FIFO Producer/Consumer queue, no queue traversal supported
+ * - Very fast
+ * - Designed as a queue for pointers to objects
+ * - Bulk enqueue and dequeue support
+ * - Supports combinations of Multi and Single Producer/Consumer
+ *
+ * Copyright (C) 2014, Red Hat, Inc.,
+ * by Jesper Dangaard Brouer and Hannes Frederic Sowa
+ * for licensing details see kernel-base/COPYING
+ */
+#include <linux/compiler.h>
+#include <linux/kernel.h>
+
+struct alf_actor {
+ u32 head;
+ u32 tail;
+};
+
+struct alf_queue {
+ u32 size;
+ u32 mask;
+ u32 flags;
+ struct alf_actor producer ____cacheline_aligned_in_smp;
+ struct alf_actor consumer ____cacheline_aligned_in_smp;
+ void *ring[0] ____cacheline_aligned_in_smp;
+};
+
+struct alf_queue *alf_queue_alloc(u32 size, gfp_t gfp);
+void alf_queue_free(struct alf_queue *q);
+
+/* Helpers for LOAD and STORE of elements, have been split-out because:
+ * 1. They can be reused for both "Single" and "Multi" variants
+ * 2. Allow us to experiment with (pipeline) optimizations in this area.
+ */
+static inline void
+__helper_alf_enqueue_store(u32 p_head, struct alf_queue *q,
+ void **ptr, const u32 n)
+{
+ int i, index = p_head;
+
+ for (i = 0; i < n; i++, index++)
+ q->ring[index & q->mask] = ptr[i];
+}
+
+static inline void
+__helper_alf_dequeue_load(u32 c_head, struct alf_queue *q,
+ void **ptr, const u32 elems)
+{
+ int i, index = c_head;
+
+ for (i = 0; i < elems; i++, index++)
+ ptr[i] = q->ring[index & q->mask];
+}
+
+/* Main Multi-Producer ENQUEUE
+ *
+ * Even-though current API have a "fixed" semantics of aborting if it
+ * cannot enqueue the full bulk size. Users of this API should check
+ * on the returned number of enqueue elements match, to verify enqueue
+ * was successful. This allow us to introduce a "variable" enqueue
+ * scheme later.
+ *
+ * Not preemption safe. Multiple CPUs can enqueue elements, but the
+ * same CPU is not allowed to be preempted and access the same
+ * queue. Due to how the tail is updated, this can result in a soft
+ * lock-up. (Same goes for alf_mc_dequeue).
+ */
+static inline int
+alf_mp_enqueue(const u32 n;
+ struct alf_queue *q, void *ptr[n], const u32 n)
+{
+ u32 p_head, p_next, c_tail, space;
+
+ /* Reserve part of the array for enqueue STORE/WRITE */
+ do {
+ p_head = ACCESS_ONCE(q->producer.head);
+ c_tail = ACCESS_ONCE(q->consumer.tail);
+
+ space = q->size + c_tail - p_head;
+ if (n > space)
+ return 0;
+
+ p_next = p_head + n;
+ }
+ while (unlikely(cmpxchg(&q->producer.head, p_head, p_next) != p_head));
+
+ /* STORE the elems into the queue array */
+ __helper_alf_enqueue_store(p_head, q, ptr, n);
+ smp_wmb(); /* Write-Memory-Barrier matching dequeue LOADs */
+
+ /* Wait for other concurrent preceding enqueues not yet done,
+ * this part make us none-wait-free and could be problematic
+ * in case of congestion with many CPUs
+ */
+ while (unlikely(ACCESS_ONCE(q->producer.tail) != p_head))
+ cpu_relax();
+ /* Mark this enq done and avail for consumption */
+ ACCESS_ONCE(q->producer.tail) = p_next;
+
+ return n;
+}
+
+/* Main Multi-Consumer DEQUEUE */
+static inline int
+alf_mc_dequeue(const u32 n;
+ struct alf_queue *q, void *ptr[n], const u32 n)
+{
+ u32 c_head, c_next, p_tail, elems;
+
+ /* Reserve part of the array for dequeue LOAD/READ */
+ do {
+ c_head = ACCESS_ONCE(q->consumer.head);
+ p_tail = ACCESS_ONCE(q->producer.tail);
+
+ elems = p_tail - c_head;
+
+ if (elems == 0)
+ return 0;
+ else
+ elems = min(elems, n);
+
+ c_next = c_head + elems;
+ }
+ while (unlikely(cmpxchg(&q->consumer.head, c_head, c_next) != c_head));
+
+ /* LOAD the elems from the queue array.
+ * We don't need a smb_rmb() Read-Memory-Barrier here because
+ * the above cmpxchg is an implied full Memory-Barrier.
+ */
+ __helper_alf_dequeue_load(c_head, q, ptr, elems);
+
+ /* Archs with weak Memory Ordering need a memory barrier here.
+ * As the STORE to q->consumer.tail, must happen after the
+ * dequeue LOADs. Dequeue LOADs have a dependent STORE into
+ * ptr, thus a smp_wmb() is enough. Paired with enqueue
+ * implicit full-MB in cmpxchg.
+ */
+ smp_wmb();
+
+ /* Wait for other concurrent preceding dequeues not yet done */
+ while (unlikely(ACCESS_ONCE(q->consumer.tail) != c_head))
+ cpu_relax();
+ /* Mark this deq done and avail for producers */
+ ACCESS_ONCE(q->consumer.tail) = c_next;
+
+ return elems;
+}
+
+/* #define ASSERT_DEBUG_SPSC 1 */
+#ifndef ASSERT_DEBUG_SPSC
+#define ASSERT(x) do { } while (0)
+#else
+#define ASSERT(x) \
+ do { \
+ if (unlikely(!(x))) { \
+ pr_crit("Assertion failed %s:%d: \"%s\"\n", \
+ __FILE__, __LINE__, #x); \
+ BUG(); \
+ } \
+ } while (0)
+#endif
+
+/* Main SINGLE Producer ENQUEUE
+ * caller MUST make sure preemption is disabled
+ */
+static inline int
+alf_sp_enqueue(const u32 n;
+ struct alf_queue *q, void *ptr[n], const u32 n)
+{
+ u32 p_head, p_next, c_tail, space;
+
+ /* Reserve part of the array for enqueue STORE/WRITE */
+ p_head = q->producer.head;
+ smp_rmb(); /* for consumer.tail write, making sure deq loads are done */
+ c_tail = ACCESS_ONCE(q->consumer.tail);
+
+ space = q->size + c_tail - p_head;
+ if (n > space)
+ return 0;
+
+ p_next = p_head + n;
+ ASSERT(ACCESS_ONCE(q->producer.head) == p_head);
+ q->producer.head = p_next;
+
+ /* STORE the elems into the queue array */
+ __helper_alf_enqueue_store(p_head, q, ptr, n);
+ smp_wmb(); /* Write-Memory-Barrier matching dequeue LOADs */
+
+ /* Assert no other CPU (or same CPU via preemption) changed queue */
+ ASSERT(ACCESS_ONCE(q->producer.tail) == p_head);
+
+ /* Mark this enq done and avail for consumption */
+ ACCESS_ONCE(q->producer.tail) = p_next;
+
+ return n;
+}
+
+/* Main SINGLE Consumer DEQUEUE
+ * caller MUST make sure preemption is disabled
+ */
+static inline int
+alf_sc_dequeue(const u32 n;
+ struct alf_queue *q, void *ptr[n], const u32 n)
+{
+ u32 c_head, c_next, p_tail, elems;
+
+ /* Reserve part of the array for dequeue LOAD/READ */
+ c_head = q->consumer.head;
+ p_tail = ACCESS_ONCE(q->producer.tail);
+
+ elems = p_tail - c_head;
+
+ if (elems == 0)
+ return 0;
+ else
+ elems = min(elems, n);
+
+ c_next = c_head + elems;
+ ASSERT(ACCESS_ONCE(q->consumer.head) == c_head);
+ q->consumer.head = c_next;
+
+ smp_rmb(); /* Read-Memory-Barrier matching enq STOREs */
+ __helper_alf_dequeue_load(c_head, q, ptr, elems);
+
+ /* Archs with weak Memory Ordering need a memory barrier here.
+ * As the STORE to q->consumer.tail, must happen after the
+ * dequeue LOADs. Dequeue LOADs have a dependent STORE into
+ * ptr, thus a smp_wmb() is enough.
+ */
+ smp_wmb();
+
+ /* Assert no other CPU (or same CPU via preemption) changed queue */
+ ASSERT(ACCESS_ONCE(q->consumer.tail) == c_head);
+
+ /* Mark this deq done and avail for producers */
+ ACCESS_ONCE(q->consumer.tail) = c_next;
+
+ return elems;
+}
+
+static inline bool
+alf_queue_empty(struct alf_queue *q)
+{
+ u32 c_tail = ACCESS_ONCE(q->consumer.tail);
+ u32 p_tail = ACCESS_ONCE(q->producer.tail);
+
+ /* The empty (and initial state) is when consumer have reached
+ * up with producer.
+ *
+ * DOUBLE-CHECK: Should we use producer.head, as this indicate
+ * a producer is in-progress(?)
+ */
+ return c_tail == p_tail;
+}
+
+static inline int
+alf_queue_count(struct alf_queue *q)
+{
+ u32 c_head = ACCESS_ONCE(q->consumer.head);
+ u32 p_tail = ACCESS_ONCE(q->producer.tail);
+ u32 elems;
+
+ /* Due to u32 arithmetic the values are implicitly
+ * masked/modulo 32-bit, thus saving one mask operation
+ */
+ elems = p_tail - c_head;
+ /* Thus, same as:
+ * elems = (p_tail - c_head) & q->mask;
+ */
+ return elems;
+}
+
+static inline int
+alf_queue_avail_space(struct alf_queue *q)
+{
+ u32 p_head = ACCESS_ONCE(q->producer.head);
+ u32 c_tail = ACCESS_ONCE(q->consumer.tail);
+ u32 space;
+
+ /* The max avail space is q->size and
+ * the empty state is when (consumer == producer)
+ */
+
+ /* Due to u32 arithmetic the values are implicitly
+ * masked/modulo 32-bit, thus saving one mask operation
+ */
+ space = q->size + c_tail - p_head;
+ /* Thus, same as:
+ * space = (q->size + c_tail - p_head) & q->mask;
+ */
+ return space;
+}
+
+#endif /* _LINUX_ALF_QUEUE_H */
diff --git a/lib/Kconfig b/lib/Kconfig
index 54cf309..3c0cd58 100644
--- a/lib/Kconfig
+++ b/lib/Kconfig
@@ -439,6 +439,19 @@ config NLATTR
bool

#
+# ALF queue
+#
+config ALF_QUEUE
+ bool "ALF: Array-based Lock-Free (Producer-Consumer) queue"
+ default y
+ help
+ This Array-based Lock-Free (ALF) queue, is a very fast
+ bounded Producer-Consumer queue, supporting bulking. The
+ MPMC (Multi-Producer/Multi-Consumer) variant uses a locked
+ cmpxchg, but the cost can be amorized by utilizing bulk
+ enqueue/dequeue.
+
+#
# Generic 64-bit atomic support is selected if needed
#
config GENERIC_ATOMIC64
diff --git a/lib/Makefile b/lib/Makefile
index 0211d2b..cd3a2d0 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -119,6 +119,8 @@ obj-$(CONFIG_DYNAMIC_DEBUG) += dynamic_debug.o

obj-$(CONFIG_NLATTR) += nlattr.o

+obj-$(CONFIG_ALF_QUEUE) += alf_queue.o
+
obj-$(CONFIG_LRU_CACHE) += lru_cache.o

obj-$(CONFIG_DMA_API_DEBUG) += dma-debug.o
diff --git a/lib/alf_queue.c b/lib/alf_queue.c
new file mode 100644
index 0000000..d6c9b69
--- /dev/null
+++ b/lib/alf_queue.c
@@ -0,0 +1,47 @@
+/*
+ * lib/alf_queue.c
+ *
+ * ALF: Array-based Lock-Free queue
+ * - Main implementation in: include/linux/alf_queue.h
+ *
+ * Copyright (C) 2014, Red Hat, Inc.,
+ * by Jesper Dangaard Brouer and Hannes Frederic Sowa
+ * for licensing details see kernel-base/COPYING
+ */
+#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
+
+#include <linux/module.h>
+#include <linux/slab.h> /* kzalloc */
+#include <linux/alf_queue.h>
+#include <linux/log2.h>
+
+struct alf_queue *alf_queue_alloc(u32 size, gfp_t gfp)
+{
+ struct alf_queue *q;
+ size_t mem_size;
+
+ if (!(is_power_of_2(size)) || size > 65536)
+ return ERR_PTR(-EINVAL);
+
+ /* The ring array is allocated together with the queue struct */
+ mem_size = size * sizeof(void *) + sizeof(struct alf_queue);
+ q = kzalloc(mem_size, gfp);
+ if (!q)
+ return ERR_PTR(-ENOMEM);
+
+ q->size = size;
+ q->mask = size - 1;
+
+ return q;
+}
+EXPORT_SYMBOL_GPL(alf_queue_alloc);
+
+void alf_queue_free(struct alf_queue *q)
+{
+ kfree(q);
+}
+EXPORT_SYMBOL_GPL(alf_queue_free);
+
+MODULE_DESCRIPTION("ALF: Array-based Lock-Free queue");
+MODULE_AUTHOR("Jesper Dangaard Brouer <[email protected]>");
+MODULE_LICENSE("GPL");

Subject: Re: [RFC PATCH 0/3] Faster than SLAB caching of SKBs with qmempool (backed by alf_queue)

On Wed, 10 Dec 2014, Jesper Dangaard Brouer wrote:

> Patch1: alf_queue (Lock-Free queue)

For some reason that key patch is not in my linux-mm archives nor in my
inbox.

2014-12-10 15:33:32

by Jesper Dangaard Brouer

[permalink] [raw]
Subject: Re: [RFC PATCH 0/3] Faster than SLAB caching of SKBs with qmempool (backed by alf_queue)

On Wed, 10 Dec 2014 09:17:32 -0600 (CST)
Christoph Lameter <[email protected]> wrote:

> On Wed, 10 Dec 2014, Jesper Dangaard Brouer wrote:
>
> > Patch1: alf_queue (Lock-Free queue)
>
> For some reason that key patch is not in my linux-mm archives nor in my
> inbox.

That is very strange! I did notice that it was somehow delayed in
showing up on gmane.org (http://thread.gmane.org/gmane.linux.network/342347/focus=126148)
and didn't show up on netdev either...

Inserting it below (hoping my email client does not screw it up):


[PATCH RFC] lib: adding an Array-based Lock-Free (ALF) queue

From: Jesper Dangaard Brouer <[email protected]>

This Array-based Lock-Free (ALF) queue, is a very fast bounded
Producer-Consumer queue, supporting bulking. The MPMC
(Multi-Producer/Multi-Consumer) variant uses a locked cmpxchg, but the
cost can be amorized by utilizing bulk enqueue/dequeue.

Results on x86_64 CPU E5-2695, for variants:
MPMC = Multi-Producer-Multi-Consumer
SPSC = Single-Producer-Single-Consumer

(none-bulking): per element cost MPMC and SPSC
MPMC -- SPSC
simple : 9.519 ns -- 1.282 ns
multi(step:128): 12.905 ns -- 2.240 ns

The majority of the cost is associated with the locked cmpxchg in the
MPMC variant. Bulking helps amortize this cost:

(bulking) cost per element comparing MPMC -> SPSC:
MPMC -- SPSC
bulk2 : 5.849 ns -- 1.748 ns
bulk3 : 4.102 ns -- 1.531 ns
bulk4 : 3.281 ns -- 1.383 ns
bulk6 : 2.530 ns -- 1.238 ns
bulk8 : 2.125 ns -- 1.196 ns
bulk16: 1.552 ns -- 1.109 ns

Joint work with Hannes Frederic Sowa.

Signed-off-by: Jesper Dangaard Brouer <[email protected]>
Signed-off-by: Hannes Frederic Sowa <[email protected]>
----

Correctness of memory barries on different arch's need to be evaluated.

The API might need some adjustments/discussions regarding:

1) the semantics of enq/deq when doing bulking, choosing what
to do when e.g. a bulk enqueue cannot fit fully.

2) some better way to detect miss-use of API, e.g. using
single-enqueue variant function call on a multi-enqueue variant data
structure. Now no detection happens.
---

include/linux/alf_queue.h | 303 +++++++++++++++++++++++++++++++++++++++++++++
lib/Kconfig | 13 ++
lib/Makefile | 2
lib/alf_queue.c | 47 +++++++
4 files changed, 365 insertions(+), 0 deletions(-)
create mode 100644 include/linux/alf_queue.h
create mode 100644 lib/alf_queue.c


diff --git a/include/linux/alf_queue.h b/include/linux/alf_queue.h
new file mode 100644
index 0000000..fb1a774
--- /dev/null
+++ b/include/linux/alf_queue.h
@@ -0,0 +1,303 @@
+#ifndef _LINUX_ALF_QUEUE_H
+#define _LINUX_ALF_QUEUE_H
+/* linux/alf_queue.h
+ *
+ * ALF: Array-based Lock-Free queue
+ *
+ * Queue properties
+ * - Array based for cache-line optimization
+ * - Bounded by the array size
+ * - FIFO Producer/Consumer queue, no queue traversal supported
+ * - Very fast
+ * - Designed as a queue for pointers to objects
+ * - Bulk enqueue and dequeue support
+ * - Supports combinations of Multi and Single Producer/Consumer
+ *
+ * Copyright (C) 2014, Red Hat, Inc.,
+ * by Jesper Dangaard Brouer and Hannes Frederic Sowa
+ * for licensing details see kernel-base/COPYING
+ */
+#include <linux/compiler.h>
+#include <linux/kernel.h>
+
+struct alf_actor {
+ u32 head;
+ u32 tail;
+};
+
+struct alf_queue {
+ u32 size;
+ u32 mask;
+ u32 flags;
+ struct alf_actor producer ____cacheline_aligned_in_smp;
+ struct alf_actor consumer ____cacheline_aligned_in_smp;
+ void *ring[0] ____cacheline_aligned_in_smp;
+};
+
+struct alf_queue *alf_queue_alloc(u32 size, gfp_t gfp);
+void alf_queue_free(struct alf_queue *q);
+
+/* Helpers for LOAD and STORE of elements, have been split-out because:
+ * 1. They can be reused for both "Single" and "Multi" variants
+ * 2. Allow us to experiment with (pipeline) optimizations in this area.
+ */
+static inline void
+__helper_alf_enqueue_store(u32 p_head, struct alf_queue *q,
+ void **ptr, const u32 n)
+{
+ int i, index = p_head;
+
+ for (i = 0; i < n; i++, index++)
+ q->ring[index & q->mask] = ptr[i];
+}
+
+static inline void
+__helper_alf_dequeue_load(u32 c_head, struct alf_queue *q,
+ void **ptr, const u32 elems)
+{
+ int i, index = c_head;
+
+ for (i = 0; i < elems; i++, index++)
+ ptr[i] = q->ring[index & q->mask];
+}
+
+/* Main Multi-Producer ENQUEUE
+ *
+ * Even-though current API have a "fixed" semantics of aborting if it
+ * cannot enqueue the full bulk size. Users of this API should check
+ * on the returned number of enqueue elements match, to verify enqueue
+ * was successful. This allow us to introduce a "variable" enqueue
+ * scheme later.
+ *
+ * Not preemption safe. Multiple CPUs can enqueue elements, but the
+ * same CPU is not allowed to be preempted and access the same
+ * queue. Due to how the tail is updated, this can result in a soft
+ * lock-up. (Same goes for alf_mc_dequeue).
+ */
+static inline int
+alf_mp_enqueue(const u32 n;
+ struct alf_queue *q, void *ptr[n], const u32 n)
+{
+ u32 p_head, p_next, c_tail, space;
+
+ /* Reserve part of the array for enqueue STORE/WRITE */
+ do {
+ p_head = ACCESS_ONCE(q->producer.head);
+ c_tail = ACCESS_ONCE(q->consumer.tail);
+
+ space = q->size + c_tail - p_head;
+ if (n > space)
+ return 0;
+
+ p_next = p_head + n;
+ }
+ while (unlikely(cmpxchg(&q->producer.head, p_head, p_next) != p_head));
+
+ /* STORE the elems into the queue array */
+ __helper_alf_enqueue_store(p_head, q, ptr, n);
+ smp_wmb(); /* Write-Memory-Barrier matching dequeue LOADs */
+
+ /* Wait for other concurrent preceding enqueues not yet done,
+ * this part make us none-wait-free and could be problematic
+ * in case of congestion with many CPUs
+ */
+ while (unlikely(ACCESS_ONCE(q->producer.tail) != p_head))
+ cpu_relax();
+ /* Mark this enq done and avail for consumption */
+ ACCESS_ONCE(q->producer.tail) = p_next;
+
+ return n;
+}
+
+/* Main Multi-Consumer DEQUEUE */
+static inline int
+alf_mc_dequeue(const u32 n;
+ struct alf_queue *q, void *ptr[n], const u32 n)
+{
+ u32 c_head, c_next, p_tail, elems;
+
+ /* Reserve part of the array for dequeue LOAD/READ */
+ do {
+ c_head = ACCESS_ONCE(q->consumer.head);
+ p_tail = ACCESS_ONCE(q->producer.tail);
+
+ elems = p_tail - c_head;
+
+ if (elems == 0)
+ return 0;
+ else
+ elems = min(elems, n);
+
+ c_next = c_head + elems;
+ }
+ while (unlikely(cmpxchg(&q->consumer.head, c_head, c_next) != c_head));
+
+ /* LOAD the elems from the queue array.
+ * We don't need a smb_rmb() Read-Memory-Barrier here because
+ * the above cmpxchg is an implied full Memory-Barrier.
+ */
+ __helper_alf_dequeue_load(c_head, q, ptr, elems);
+
+ /* Archs with weak Memory Ordering need a memory barrier here.
+ * As the STORE to q->consumer.tail, must happen after the
+ * dequeue LOADs. Dequeue LOADs have a dependent STORE into
+ * ptr, thus a smp_wmb() is enough. Paired with enqueue
+ * implicit full-MB in cmpxchg.
+ */
+ smp_wmb();
+
+ /* Wait for other concurrent preceding dequeues not yet done */
+ while (unlikely(ACCESS_ONCE(q->consumer.tail) != c_head))
+ cpu_relax();
+ /* Mark this deq done and avail for producers */
+ ACCESS_ONCE(q->consumer.tail) = c_next;
+
+ return elems;
+}
+
+/* #define ASSERT_DEBUG_SPSC 1 */
+#ifndef ASSERT_DEBUG_SPSC
+#define ASSERT(x) do { } while (0)
+#else
+#define ASSERT(x) \
+ do { \
+ if (unlikely(!(x))) { \
+ pr_crit("Assertion failed %s:%d: \"%s\"\n", \
+ __FILE__, __LINE__, #x); \
+ BUG(); \
+ } \
+ } while (0)
+#endif
+
+/* Main SINGLE Producer ENQUEUE
+ * caller MUST make sure preemption is disabled
+ */
+static inline int
+alf_sp_enqueue(const u32 n;
+ struct alf_queue *q, void *ptr[n], const u32 n)
+{
+ u32 p_head, p_next, c_tail, space;
+
+ /* Reserve part of the array for enqueue STORE/WRITE */
+ p_head = q->producer.head;
+ smp_rmb(); /* for consumer.tail write, making sure deq loads are done */
+ c_tail = ACCESS_ONCE(q->consumer.tail);
+
+ space = q->size + c_tail - p_head;
+ if (n > space)
+ return 0;
+
+ p_next = p_head + n;
+ ASSERT(ACCESS_ONCE(q->producer.head) == p_head);
+ q->producer.head = p_next;
+
+ /* STORE the elems into the queue array */
+ __helper_alf_enqueue_store(p_head, q, ptr, n);
+ smp_wmb(); /* Write-Memory-Barrier matching dequeue LOADs */
+
+ /* Assert no other CPU (or same CPU via preemption) changed queue */
+ ASSERT(ACCESS_ONCE(q->producer.tail) == p_head);
+
+ /* Mark this enq done and avail for consumption */
+ ACCESS_ONCE(q->producer.tail) = p_next;
+
+ return n;
+}
+
+/* Main SINGLE Consumer DEQUEUE
+ * caller MUST make sure preemption is disabled
+ */
+static inline int
+alf_sc_dequeue(const u32 n;
+ struct alf_queue *q, void *ptr[n], const u32 n)
+{
+ u32 c_head, c_next, p_tail, elems;
+
+ /* Reserve part of the array for dequeue LOAD/READ */
+ c_head = q->consumer.head;
+ p_tail = ACCESS_ONCE(q->producer.tail);
+
+ elems = p_tail - c_head;
+
+ if (elems == 0)
+ return 0;
+ else
+ elems = min(elems, n);
+
+ c_next = c_head + elems;
+ ASSERT(ACCESS_ONCE(q->consumer.head) == c_head);
+ q->consumer.head = c_next;
+
+ smp_rmb(); /* Read-Memory-Barrier matching enq STOREs */
+ __helper_alf_dequeue_load(c_head, q, ptr, elems);
+
+ /* Archs with weak Memory Ordering need a memory barrier here.
+ * As the STORE to q->consumer.tail, must happen after the
+ * dequeue LOADs. Dequeue LOADs have a dependent STORE into
+ * ptr, thus a smp_wmb() is enough.
+ */
+ smp_wmb();
+
+ /* Assert no other CPU (or same CPU via preemption) changed queue */
+ ASSERT(ACCESS_ONCE(q->consumer.tail) == c_head);
+
+ /* Mark this deq done and avail for producers */
+ ACCESS_ONCE(q->consumer.tail) = c_next;
+
+ return elems;
+}
+
+static inline bool
+alf_queue_empty(struct alf_queue *q)
+{
+ u32 c_tail = ACCESS_ONCE(q->consumer.tail);
+ u32 p_tail = ACCESS_ONCE(q->producer.tail);
+
+ /* The empty (and initial state) is when consumer have reached
+ * up with producer.
+ *
+ * DOUBLE-CHECK: Should we use producer.head, as this indicate
+ * a producer is in-progress(?)
+ */
+ return c_tail == p_tail;
+}
+
+static inline int
+alf_queue_count(struct alf_queue *q)
+{
+ u32 c_head = ACCESS_ONCE(q->consumer.head);
+ u32 p_tail = ACCESS_ONCE(q->producer.tail);
+ u32 elems;
+
+ /* Due to u32 arithmetic the values are implicitly
+ * masked/modulo 32-bit, thus saving one mask operation
+ */
+ elems = p_tail - c_head;
+ /* Thus, same as:
+ * elems = (p_tail - c_head) & q->mask;
+ */
+ return elems;
+}
+
+static inline int
+alf_queue_avail_space(struct alf_queue *q)
+{
+ u32 p_head = ACCESS_ONCE(q->producer.head);
+ u32 c_tail = ACCESS_ONCE(q->consumer.tail);
+ u32 space;
+
+ /* The max avail space is q->size and
+ * the empty state is when (consumer == producer)
+ */
+
+ /* Due to u32 arithmetic the values are implicitly
+ * masked/modulo 32-bit, thus saving one mask operation
+ */
+ space = q->size + c_tail - p_head;
+ /* Thus, same as:
+ * space = (q->size + c_tail - p_head) & q->mask;
+ */
+ return space;
+}
+
+#endif /* _LINUX_ALF_QUEUE_H */
diff --git a/lib/Kconfig b/lib/Kconfig
index 54cf309..3c0cd58 100644
--- a/lib/Kconfig
+++ b/lib/Kconfig
@@ -439,6 +439,19 @@ config NLATTR
bool

#
+# ALF queue
+#
+config ALF_QUEUE
+ bool "ALF: Array-based Lock-Free (Producer-Consumer) queue"
+ default y
+ help
+ This Array-based Lock-Free (ALF) queue, is a very fast
+ bounded Producer-Consumer queue, supporting bulking. The
+ MPMC (Multi-Producer/Multi-Consumer) variant uses a locked
+ cmpxchg, but the cost can be amorized by utilizing bulk
+ enqueue/dequeue.
+
+#
# Generic 64-bit atomic support is selected if needed
#
config GENERIC_ATOMIC64
diff --git a/lib/Makefile b/lib/Makefile
index 0211d2b..cd3a2d0 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -119,6 +119,8 @@ obj-$(CONFIG_DYNAMIC_DEBUG) += dynamic_debug.o

obj-$(CONFIG_NLATTR) += nlattr.o

+obj-$(CONFIG_ALF_QUEUE) += alf_queue.o
+
obj-$(CONFIG_LRU_CACHE) += lru_cache.o

obj-$(CONFIG_DMA_API_DEBUG) += dma-debug.o
diff --git a/lib/alf_queue.c b/lib/alf_queue.c
new file mode 100644
index 0000000..d6c9b69
--- /dev/null
+++ b/lib/alf_queue.c
@@ -0,0 +1,47 @@
+/*
+ * lib/alf_queue.c
+ *
+ * ALF: Array-based Lock-Free queue
+ * - Main implementation in: include/linux/alf_queue.h
+ *
+ * Copyright (C) 2014, Red Hat, Inc.,
+ * by Jesper Dangaard Brouer and Hannes Frederic Sowa
+ * for licensing details see kernel-base/COPYING
+ */
+#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
+
+#include <linux/module.h>
+#include <linux/slab.h> /* kzalloc */
+#include <linux/alf_queue.h>
+#include <linux/log2.h>
+
+struct alf_queue *alf_queue_alloc(u32 size, gfp_t gfp)
+{
+ struct alf_queue *q;
+ size_t mem_size;
+
+ if (!(is_power_of_2(size)) || size > 65536)
+ return ERR_PTR(-EINVAL);
+
+ /* The ring array is allocated together with the queue struct */
+ mem_size = size * sizeof(void *) + sizeof(struct alf_queue);
+ q = kzalloc(mem_size, gfp);
+ if (!q)
+ return ERR_PTR(-ENOMEM);
+
+ q->size = size;
+ q->mask = size - 1;
+
+ return q;
+}
+EXPORT_SYMBOL_GPL(alf_queue_alloc);
+
+void alf_queue_free(struct alf_queue *q)
+{
+ kfree(q);
+}
+EXPORT_SYMBOL_GPL(alf_queue_free);
+
+MODULE_DESCRIPTION("ALF: Array-based Lock-Free queue");
+MODULE_AUTHOR("Jesper Dangaard Brouer <[email protected]>");
+MODULE_LICENSE("GPL");

Subject: Re: [RFC PATCH 0/3] Faster than SLAB caching of SKBs with qmempool (backed by alf_queue)

On Wed, 10 Dec 2014, Jesper Dangaard Brouer wrote:

> That is very strange! I did notice that it was somehow delayed in
> showing up on gmane.org (http://thread.gmane.org/gmane.linux.network/342347/focus=126148)
> and didn't show up on netdev either...

It finally got through.

Subject: Re: [RFC PATCH 0/3] Faster than SLAB caching of SKBs with qmempool (backed by alf_queue)

On Wed, 10 Dec 2014, Jesper Dangaard Brouer wrote:

> One of the building blocks for achieving this speedup is a cmpxchg
> based Lock-Free queue that supports bulking, named alf_queue for
> Array-based Lock-Free queue. By bulking elements (pointers) from the
> queue, the cost of the cmpxchg (approx 8 ns) is amortized over several
> elements.

This is a bit of an issue since the design of the SLUB allocator is such
that you should pick up an object, apply some processing and then take the
next one. The fetching of an object warms up the first cacheline and this
is tied into the way free objects are linked in SLUB.

So a bulk fetch from SLUB will not that effective and cause the touching
of many cachelines if we are dealing with just a few objects. If we are
looking at whole slab pages with all objects then SLUB can be effective
since we do not have to build up the linked pointer structure in each
page. SLAB has a different architecture there and a bulk fetch there is
possible without touching objects even for small sets since the freelist
management is separate from the objects.

If you do this bulking then you will later access cache cold objects?
Doesnt that negate the benefit that you gain? Or are these objects written
to by hardware and therefore by necessity cache cold?

We could provide a faster bulk alloc/free function.

int kmem_cache_alloc_array(struct kmem_cache *s, gfp_t flags,
size_t objects, void **array)

and this could be optimized by each slab allocator to provide fast
population of objects in that array. We then assume that the number of
objects is in the hundreds or so right?

The corresponding free function

void kmem_cache_free_array(struct kmem_cache *s,
size_t objects, void **array)


I think the queue management of the array can be improved by using a
similar technique as used the SLUB allocator using the cmpxchg_local.
cmpxchg_local is much faster than a full cmpxchg and we are operating on
per cpu structures anyways. So the overhead could still be reduced.

2014-12-11 10:19:58

by Jesper Dangaard Brouer

[permalink] [raw]
Subject: Re: [RFC PATCH 0/3] Faster than SLAB caching of SKBs with qmempool (backed by alf_queue)

On Wed, 10 Dec 2014 13:51:32 -0600 (CST)
Christoph Lameter <[email protected]> wrote:

> On Wed, 10 Dec 2014, Jesper Dangaard Brouer wrote:
>
> > One of the building blocks for achieving this speedup is a cmpxchg
> > based Lock-Free queue that supports bulking, named alf_queue for
> > Array-based Lock-Free queue. By bulking elements (pointers) from the
> > queue, the cost of the cmpxchg (approx 8 ns) is amortized over several
> > elements.
>
> This is a bit of an issue since the design of the SLUB allocator is such
> that you should pick up an object, apply some processing and then take the
> next one. The fetching of an object warms up the first cacheline and this
> is tied into the way free objects are linked in SLUB.
>
> So a bulk fetch from SLUB will not that effective and cause the touching
> of many cachelines if we are dealing with just a few objects. If we are
> looking at whole slab pages with all objects then SLUB can be effective
> since we do not have to build up the linked pointer structure in each
> page. SLAB has a different architecture there and a bulk fetch there is
> possible without touching objects even for small sets since the freelist
> management is separate from the objects.
>
> If you do this bulking then you will later access cache cold objects?
> Doesnt that negate the benefit that you gain? Or are these objects written
> to by hardware and therefore by necessity cache cold?

Cache warmup is a concern, but perhaps it's the callers responsibility
to prefetch for their use-case. For qmempool I do have patches that
prefetch elems when going from the sharedq to the localq (per CPU), but
I didn't see much gain, and I could prove my point (of being faster than
slab) without it. And I would use/need the slab bulk interface to add
elems to sharedq which I consider semi-cache cold.


> We could provide a faster bulk alloc/free function.
>
> int kmem_cache_alloc_array(struct kmem_cache *s, gfp_t flags,
> size_t objects, void **array)

I like it :-)

> and this could be optimized by each slab allocator to provide fast
> population of objects in that array. We then assume that the number of
> objects is in the hundreds or so right?

I'm already seeing a benefit with 16 packets alloc/free "bulking".

On RX we have a "budget" of 64 packets/descriptors (taken from the NIC
RX ring) that need SKBs.

On TX packets are put into the TX ring, and later at TX completion the
TX ring is cleaned up, as many as 256 (as e.g. in the ixgbe driver).

Scientific articles on userspace networking (like netmap) report that
they need at least 8 packet bulking to see wirespeed 10G at 64 bytes.


> The corresponding free function
>
> void kmem_cache_free_array(struct kmem_cache *s,
> size_t objects, void **array)
>
>
> I think the queue management of the array can be improved by using a
> similar technique as used the SLUB allocator using the cmpxchg_local.
> cmpxchg_local is much faster than a full cmpxchg and we are operating on
> per cpu structures anyways. So the overhead could still be reduced.

I think you missed that the per cpu localq is already not using cmpxchg
(it is a SPSC queue). The sharedq (MPMC queue) does need and use the
locked cmpxchg.

--
Best regards,
Jesper Dangaard Brouer
MSc.CS, Sr. Network Kernel Developer at Red Hat
Author of http://www.iptv-analyzer.org
LinkedIn: http://www.linkedin.com/in/brouer

2014-12-11 19:15:48

by David Miller

[permalink] [raw]
Subject: Re: [RFC PATCH 1/3] lib: adding an Array-based Lock-Free (ALF) queue

From: Jesper Dangaard Brouer <[email protected]>
Date: Wed, 10 Dec 2014 15:15:26 +0100

> +static inline int
> +alf_mp_enqueue(const u32 n;
> + struct alf_queue *q, void *ptr[n], const u32 n)
> +{
...
> +/* Main Multi-Consumer DEQUEUE */
> +static inline int
> +alf_mc_dequeue(const u32 n;
> + struct alf_queue *q, void *ptr[n], const u32 n)
> +{

I would seriously consider not inlining these.