LinuxLists.cc - Re: [PATCH v9 1/5] lib: objpool added: ring-array based lockless MPMC

2023-09-25 09:56:27

Subject: Re: [PATCH v9 1/5] lib: objpool added: ring-array based lockless MPMC

Hi Wuqiang,

On Tue, 5 Sep 2023 09:52:51 +0800
"wuqiang.matt" <[email protected]> wrote:

> The object pool is a scalable implementaion of high performance queue
> for object allocation and reclamation, such as kretprobe instances.
>
> With leveraging percpu ring-array to mitigate the hot spot of memory
> contention, it could deliver near-linear scalability for high parallel
> scenarios. The objpool is best suited for following cases:
> 1) Memory allocation or reclamation are prohibited or too expensive
> 2) Consumers are of different priorities, such as irqs and threads
>
> Limitations:
> 1) Maximum objects (capacity) is determined during pool initializing
> and can't be modified (extended) after objpool creation
> 2) The memory of objects won't be freed until objpool is finalized
> 3) Object allocation (pop) may fail after trying all cpu slots

I made a simplifying patch on this by (mainly) removing ages array.
I also rename local variable to use more readable names, like slot,
pool, and obj.

Here the results which I run the test_objpool.ko.

Original:
[ 50.500235] Summary of testcases:
[ 50.503139] duration: 1027135us hits: 30628293 miss: 0 sync: percpu objpool
[ 50.510416] duration: 1047977us hits: 30453023 miss: 0 sync: percpu objpool from vmalloc
[ 50.517421] duration: 1047098us hits: 31145034 miss: 0 sync & hrtimer: percpu objpool
[ 50.524671] duration: 1053233us hits: 30919640 miss: 0 sync & hrtimer: percpu objpool from vmalloc
[ 50.531382] duration: 1055822us hits: 3407221 miss: 830219 sync overrun: percpu objpool
[ 50.538135] duration: 1055998us hits: 3404624 miss: 854160 sync overrun: percpu objpool from vmalloc
[ 50.546686] duration: 1046104us hits: 19464798 miss: 0 async: percpu objpool
[ 50.552989] duration: 1033054us hits: 18957836 miss: 0 async: percpu objpool from vmalloc
[ 50.560289] duration: 1041778us hits: 33806868 miss: 0 async & hrtimer: percpu objpool
[ 50.567425] duration: 1048901us hits: 34211860 miss: 0 async & hrtimer: percpu objpool from vmalloc

Simplified:
[ 48.393236] Summary of testcases:
[ 48.395384] duration: 1013002us hits: 29661448 miss: 0 sync: percpu objpool
[ 48.400351] duration: 1057231us hits: 31035578 miss: 0 sync: percpu objpool from vmalloc
[ 48.405660] duration: 1043196us hits: 30546652 miss: 0 sync & hrtimer: percpu objpool
[ 48.411216] duration: 1047128us hits: 30601306 miss: 0 sync & hrtimer: percpu objpool from vmalloc
[ 48.417231] duration: 1051695us hits: 3468287 miss: 892881 sync overrun: percpu objpool
[ 48.422405] duration: 1054003us hits: 3549491 miss: 898141 sync overrun: percpu objpool from vmalloc
[ 48.428425] duration: 1052946us hits: 19005228 miss: 0 async: percpu objpool
[ 48.433597] duration: 1051757us hits: 19670866 miss: 0 async: percpu objpool from vmalloc
[ 48.439280] duration: 1042951us hits: 37135332 miss: 0 async & hrtimer: percpu objpool
[ 48.445085] duration: 1029803us hits: 37093248 miss: 0 async & hrtimer: percpu objpool from vmalloc

Can you test it too?

Thanks,

From f1f442ff653e329839e5452b8b88463a80a12ff3 Mon Sep 17 00:00:00 2001
From: "Masami Hiramatsu (Google)" <[email protected]>
Date: Mon, 25 Sep 2023 16:07:12 +0900
Subject: [PATCH] objpool: Simplify objpool by removing ages array

Simplify the objpool code by removing ages array from per-cpu slot.
It chooses enough big capacity (which is a rounded up power of 2 value
of nr_objects + 1) for the entries array, the tail never catch up to
the head in per-cpu slot. Thus tail == head means the slot is empty.

This also uses consistent local variable names for pool, slot and obj.

Signed-off-by: Masami Hiramatsu (Google) <[email protected]>
---
include/linux/objpool.h | 61 ++++----
lib/objpool.c | 310 ++++++++++++++++------------------------
2 files changed, 147 insertions(+), 224 deletions(-)

diff --git a/include/linux/objpool.h b/include/linux/objpool.h
index 33c832216b98..ecd5ecaffcd3 100644
--- a/include/linux/objpool.h
+++ b/include/linux/objpool.h
@@ -38,33 +38,23 @@
* struct objpool_slot - percpu ring array of objpool
* @head: head of the local ring array (to retrieve at)
* @tail: tail of the local ring array (to append at)
- * @bits: log2 of capacity (for bitwise operations)
- * @mask: capacity - 1
+ * @mask: capacity of entries - 1
+ * @entries: object entries on this slot.
*
* Represents a cpu-local array-based ring buffer, its size is specialized
* during initialization of object pool. The percpu objpool slot is to be
* allocated from local memory for NUMA system, and to be kept compact in
- * continuous memory: ages[] is stored just after the body of objpool_slot,
- * and then entries[]. ages[] describes revision of each item, solely used
- * to avoid ABA; entries[] contains pointers of the actual objects
- *
- * Layout of objpool_slot in memory:
- *
- * 64bit:
- * 4 8 12 16 32 64
- * | head | tail | bits | mask | ages[4] | ents[4]: (8 * 4) | objects
- *
- * 32bit:
- * 4 8 12 16 32 48 64
- * | head | tail | bits | mask | ages[4] | ents[4] | unused | objects
+ * continuous memory: CPU assigned number of objects are stored just after
+ * the body of objpool_slot.
*
*/
struct objpool_slot {
- uint32_t head;
- uint32_t tail;
- uint32_t bits;
- uint32_t mask;
-} __packed;
+ uint32_t head;
+ uint32_t tail;
+ uint32_t mask;
+ uint32_t dummyr;
+ void * entries[];
+};

struct objpool_head;

@@ -82,12 +72,11 @@ typedef int (*objpool_fini_cb)(struct objpool_head *head, void *context);
* @obj_size: object & element size
* @nr_objs: total objs (to be pre-allocated)
* @nr_cpus: nr_cpu_ids
- * @capacity: max objects per cpuslot
+ * @capacity: max objects on percpu slot
* @gfp: gfp flags for kmalloc & vmalloc
* @ref: refcount for objpool
* @flags: flags for objpool management
* @cpu_slots: array of percpu slots
- * @slot_sizes: size in bytes of slots
* @release: resource cleanup callback
* @context: caller-provided context
*/
@@ -100,7 +89,6 @@ struct objpool_head {
refcount_t ref;
unsigned long flags;
struct objpool_slot **cpu_slots;
- int *slot_sizes;
objpool_fini_cb release;
void *context;
};
@@ -108,9 +96,12 @@ struct objpool_head {
#define OBJPOOL_FROM_VMALLOC (0x800000000) /* objpool allocated from vmalloc area */
#define OBJPOOL_HAVE_OBJECTS (0x400000000) /* objects allocated along with objpool */

+#define OBJPOOL_NR_OBJECT_MAX (1 << 24)
+#define OBJPOOL_OBJECT_SIZE_MAX (1 << 16)
+
/**
* objpool_init() - initialize objpool and pre-allocated objects
- * @head: the object pool to be initialized, declared by caller
+ * @pool: the object pool to be initialized, declared by caller
* @nr_objs: total objects to be pre-allocated by this object pool
* @object_size: size of an object (should be > 0)
* @gfp: flags for memory allocation (via kmalloc or vmalloc)
@@ -128,47 +119,47 @@ struct objpool_head {
* pop (object allocation) or do clearance before each push (object
* reclamation).
*/
-int objpool_init(struct objpool_head *head, int nr_objs, int object_size,
+int objpool_init(struct objpool_head *pool, int nr_objs, int object_size,
gfp_t gfp, void *context, objpool_init_obj_cb objinit,
objpool_fini_cb release);

/**
* objpool_pop() - allocate an object from objpool
- * @head: object pool
+ * @pool: object pool
*
* return value: object ptr or NULL if failed
*/
-void *objpool_pop(struct objpool_head *head);
+void *objpool_pop(struct objpool_head *pool);

/**
* objpool_push() - reclaim the object and return back to objpool
* @obj: object ptr to be pushed to objpool
- * @head: object pool
+ * @pool: object pool
*
* return: 0 or error code (it fails only when user tries to push
* the same object multiple times or wrong "objects" into objpool)
*/
-int objpool_push(void *obj, struct objpool_head *head);
+int objpool_push(void *obj, struct objpool_head *pool);

/**
* objpool_drop() - discard the object and deref objpool
* @obj: object ptr to be discarded
- * @head: object pool
+ * @pool: object pool
*
* return: 0 if objpool was released or error code
*/
-int objpool_drop(void *obj, struct objpool_head *head);
+int objpool_drop(void *obj, struct objpool_head *pool);

/**
* objpool_free() - release objpool forcely (all objects to be freed)
- * @head: object pool to be released
+ * @pool: object pool to be released
*/
-void objpool_free(struct objpool_head *head);
+void objpool_free(struct objpool_head *pool);

/**
* objpool_fini() - deref object pool (also releasing unused objects)
- * @head: object pool to be dereferenced
+ * @pool: object pool to be dereferenced
*/
-void objpool_fini(struct objpool_head *head);
+void objpool_fini(struct objpool_head *pool);

#endif /* _LINUX_OBJPOOL_H */
diff --git a/lib/objpool.c b/lib/objpool.c
index 22e752371820..f8e8f70d7253 100644
--- a/lib/objpool.c
+++ b/lib/objpool.c
@@ -15,104 +15,55 @@
* Copyright: [email protected]
*/

-#define SLOT_AGES(s) ((uint32_t *)((char *)(s) + sizeof(struct objpool_slot)))
-#define SLOT_ENTS(s) ((void **)((char *)(s) + sizeof(struct objpool_slot) + \
- (sizeof(uint32_t) << (s)->bits)))
-#define SLOT_OBJS(s) ((void *)((char *)(s) + sizeof(struct objpool_slot) + \
- ((sizeof(uint32_t) + sizeof(void *)) << (s)->bits)))
-#define SLOT_CORE(n) cpumask_nth((n) % num_possible_cpus(), cpu_possible_mask)
-
-/* compute the suitable num of objects to be managed per slot */
-static int objpool_nobjs(int size)
-{
- return rounddown_pow_of_two((size - sizeof(struct objpool_slot)) /
- (sizeof(uint32_t) + sizeof(void *)));
-}
-
/* allocate and initialize percpu slots */
static int
-objpool_init_percpu_slots(struct objpool_head *head, int nobjs,
- void *context, objpool_init_obj_cb objinit)
+objpool_init_percpu_slots(struct objpool_head *pool, void *context,
+ objpool_init_obj_cb objinit)
{
- int i, j, n, size, objsz, cpu = 0, nents = head->capacity;
-
- /* aligned object size by sizeof(void *) */
- objsz = ALIGN(head->obj_size, sizeof(void *));
- /* shall we allocate objects along with percpu-slot */
- if (objsz)
- head->flags |= OBJPOOL_HAVE_OBJECTS;
-
- /* vmalloc is used in default to allocate percpu-slots */
- if (!(head->gfp & GFP_ATOMIC))
- head->flags |= OBJPOOL_FROM_VMALLOC;
-
- for (i = 0; i < head->nr_cpus; i++) {
- struct objpool_slot *os;
+ int i, j, n, size, slot_size, cpu_count = 0;
+ struct objpool_slot *slot;

+ for (i = 0; i < pool->nr_cpus; i++) {
/* skip the cpus which could never be present */
if (!cpu_possible(i))
continue;

/* compute how many objects to be managed by this slot */
- n = nobjs / num_possible_cpus();
- if (cpu < (nobjs % num_possible_cpus()))
+ n = pool->nr_objs / num_possible_cpus();
+ if (cpu_count < (pool->nr_objs % num_possible_cpus()))
n++;
- size = sizeof(struct objpool_slot) + sizeof(void *) * nents +
- sizeof(uint32_t) * nents + objsz * n;
+ cpu_count++;
+
+ slot_size = struct_size(slot, entries, pool->capacity);
+ size = slot_size + pool->obj_size * n;

/*
* here we allocate percpu-slot & objects together in a single
- * allocation, taking advantage of warm caches and TLB hits as
- * vmalloc always aligns the request size to pages
+ * allocation, taking advantage on NUMA.
*/
- if (head->flags & OBJPOOL_FROM_VMALLOC)
- os = __vmalloc_node(size, sizeof(void *), head->gfp,
+ if (pool->flags & OBJPOOL_FROM_VMALLOC)
+ slot = __vmalloc_node(size, sizeof(void *), pool->gfp,
cpu_to_node(i), __builtin_return_address(0));
else
- os = kmalloc_node(size, head->gfp, cpu_to_node(i));
- if (!os)
+ slot = kmalloc_node(size, pool->gfp, cpu_to_node(i));
+ if (!slot)
return -ENOMEM;

/* initialize percpu slot for the i-th slot */
- memset(os, 0, size);
- os->bits = ilog2(head->capacity);
- os->mask = head->capacity - 1;
- head->cpu_slots[i] = os;
- head->slot_sizes[i] = size;
- cpu = cpu + 1;
-
- /*
- * manually set head & tail to avoid possible conflict:
- * We assume that the head item is ready for retrieval
- * iff head is equal to ages[head & mask]. but ages is
- * initialized as 0, so in view of the caller of pop(),
- * the 1st item (0th) is always ready, but the reality
- * could be: push() is stalled before the final update,
- * thus the item being inserted will be lost forever
- */
- os->head = os->tail = head->capacity;
-
- if (!objsz)
- continue;
+ memset(slot, 0, size);
+ slot->mask = pool->capacity - 1;
+ pool->cpu_slots[i] = slot;

for (j = 0; j < n; j++) {
- uint32_t *ages = SLOT_AGES(os);
- void **ents = SLOT_ENTS(os);
- void *obj = SLOT_OBJS(os) + j * objsz;
- uint32_t ie = os->tail & os->mask;
+ void *obj = (void *)slot + slot_size + pool->obj_size * j;

- /* perform object initialization */
if (objinit) {
int rc = objinit(obj, context);
if (rc)
return rc;
}
-
- /* add obj into the ring array */
- ents[ie] = obj;
- ages[ie] = os->tail;
- os->tail++;
- head->nr_objs++;
+ slot->entries[j] = obj;
+ slot->tail++;
}
}

@@ -120,164 +71,135 @@ objpool_init_percpu_slots(struct objpool_head *head, int nobjs,
}

/* cleanup all percpu slots of the object pool */
-static void objpool_fini_percpu_slots(struct objpool_head *head)
+static void objpool_fini_percpu_slots(struct objpool_head *pool)
{
int i;

- if (!head->cpu_slots)
+ if (!pool->cpu_slots)
return;

- for (i = 0; i < head->nr_cpus; i++) {
- if (!head->cpu_slots[i])
- continue;
- if (head->flags & OBJPOOL_FROM_VMALLOC)
- vfree(head->cpu_slots[i]);
- else
- kfree(head->cpu_slots[i]);
- }
- kfree(head->cpu_slots);
- head->cpu_slots = NULL;
- head->slot_sizes = NULL;
+ for (i = 0; i < pool->nr_cpus; i++)
+ kvfree(pool->cpu_slots[i]);
+ kfree(pool->cpu_slots);
}

/* initialize object pool and pre-allocate objects */
-int objpool_init(struct objpool_head *head, int nr_objs, int object_size,
+int objpool_init(struct objpool_head *pool, int nr_objs, int object_size,
gfp_t gfp, void *context, objpool_init_obj_cb objinit,
objpool_fini_cb release)
{
int nents, rc;

/* check input parameters */
- if (nr_objs <= 0 || object_size <= 0)
+ if (nr_objs <= 0 || object_size <= 0 ||
+ nr_objs > OBJPOOL_NR_OBJECT_MAX || object_size > OBJPOOL_OBJECT_SIZE_MAX)
+ return -EINVAL;
+
+ /* Align up to unsigned long size */
+ object_size = ALIGN(object_size, sizeof(unsigned long));
+
+ /*
+ * To avoid filling up the entries array in the per-cpu slot,
+ * use the power of 2 which is more than N + 1. Thus, the tail never
+ * catch up to the pool, and able to use pool/tail as the sequencial
+ * number.
+ */
+ nents = roundup_pow_of_two(nr_objs + 1);
+ if (!nents)
return -EINVAL;

- /* calculate percpu slot size (rounded to pow of 2) */
- nents = max_t(int, roundup_pow_of_two(nr_objs),
- objpool_nobjs(L1_CACHE_BYTES));
-
- /* initialize objpool head */
- memset(head, 0, sizeof(struct objpool_head));
- head->nr_cpus = nr_cpu_ids;
- head->obj_size = object_size;
- head->capacity = nents;
- head->gfp = gfp & ~__GFP_ZERO;
- head->context = context;
- head->release = release;
-
- /* allocate array for percpu slots */
- head->cpu_slots = kzalloc(head->nr_cpus * sizeof(void *) +
- head->nr_cpus * sizeof(int), head->gfp);
- if (!head->cpu_slots)
+ /* initialize objpool pool */
+ memset(pool, 0, sizeof(struct objpool_head));
+ pool->nr_cpus = nr_cpu_ids;
+ pool->obj_size = object_size;
+ pool->nr_objs = nr_objs;
+ /* just prevent to fullfill the per-cpu ring array */
+ pool->capacity = nents;
+ pool->gfp = gfp & ~__GFP_ZERO;
+ pool->context = context;
+ pool->release = release;
+ /* vmalloc is used in default to allocate percpu-slots */
+ if (!(pool->gfp & GFP_ATOMIC))
+ pool->flags |= OBJPOOL_FROM_VMALLOC;
+
+ pool->cpu_slots = kzalloc(pool->nr_cpus * sizeof(void *), pool->gfp);
+ if (!pool->cpu_slots)
return -ENOMEM;
- head->slot_sizes = (int *)&head->cpu_slots[head->nr_cpus];

/* initialize per-cpu slots */
- rc = objpool_init_percpu_slots(head, nr_objs, context, objinit);
+ rc = objpool_init_percpu_slots(pool, context, objinit);
if (rc)
- objpool_fini_percpu_slots(head);
+ objpool_fini_percpu_slots(pool);
else
- refcount_set(&head->ref, nr_objs + 1);
+ refcount_set(&pool->ref, nr_objs + 1);

return rc;
}
EXPORT_SYMBOL_GPL(objpool_init);

/* adding object to slot, abort if the slot was already full */
-static inline int objpool_try_add_slot(void *obj, struct objpool_slot *os)
+static inline int objpool_try_add_slot(void *obj, struct objpool_head *pool, int cpu)
{
- uint32_t *ages = SLOT_AGES(os);
- void **ents = SLOT_ENTS(os);
- uint32_t head, tail;
+ struct objpool_slot *slot = pool->cpu_slots[cpu];
+ uint32_t tail, next;

do {
- /* perform memory loading for both head and tail */
- head = READ_ONCE(os->head);
- tail = READ_ONCE(os->tail);
- /* just abort if slot is full */
- if (tail - head > os->mask)
- return -ENOENT;
- /* try to extend tail by 1 using CAS to avoid races */
- if (try_cmpxchg_acquire(&os->tail, &tail, tail + 1))
- break;
- } while (1);
+ uint32_t head = READ_ONCE(slot->head);

- /* the tail-th of slot is reserved for the given obj */
- WRITE_ONCE(ents[tail & os->mask], obj);
- /* update epoch id to make this object available for pop() */
- smp_store_release(&ages[tail & os->mask], tail);
+ tail = READ_ONCE(slot->tail);
+ next = tail + 1;
+
+ /* This must never happen because capacity >= N + 1 */
+ if (WARN_ON_ONCE((next > head && next - head > pool->nr_objs) ||
+ (next < head && next > head + pool->nr_objs)))
+ return -EINVAL;
+
+ } while (!try_cmpxchg_acquire(&slot->tail, &tail, next));
+
+ WRITE_ONCE(slot->entries[tail & slot->mask], obj);
return 0;
}

/* reclaim an object to object pool */
-int objpool_push(void *obj, struct objpool_head *oh)
+int objpool_push(void *obj, struct objpool_head *pool)
{
unsigned long flags;
- int cpu, rc;
+ int rc;

/* disable local irq to avoid preemption & interruption */
raw_local_irq_save(flags);
- cpu = raw_smp_processor_id();
- do {
- rc = objpool_try_add_slot(obj, oh->cpu_slots[cpu]);
- if (!rc)
- break;
- cpu = cpumask_next_wrap(cpu, cpu_possible_mask, -1, 1);
- } while (1);
+ rc = objpool_try_add_slot(obj, pool, raw_smp_processor_id());
raw_local_irq_restore(flags);

return rc;
}
EXPORT_SYMBOL_GPL(objpool_push);

-/* drop the allocated object, rather reclaim it to objpool */
-int objpool_drop(void *obj, struct objpool_head *head)
-{
- if (!obj || !head)
- return -EINVAL;
-
- if (refcount_dec_and_test(&head->ref)) {
- objpool_free(head);
- return 0;
- }
-
- return -EAGAIN;
-}
-EXPORT_SYMBOL_GPL(objpool_drop);
-
/* try to retrieve object from slot */
-static inline void *objpool_try_get_slot(struct objpool_slot *os)
+static inline void *objpool_try_get_slot(struct objpool_slot *slot)
{
- uint32_t *ages = SLOT_AGES(os);
- void **ents = SLOT_ENTS(os);
/* do memory load of head to local head */
- uint32_t head = smp_load_acquire(&os->head);
+ uint32_t head = smp_load_acquire(&slot->head);

/* loop if slot isn't empty */
- while (head != READ_ONCE(os->tail)) {
- uint32_t id = head & os->mask, prev = head;
+ while (head != READ_ONCE(slot->tail)) {

/* do prefetching of object ents */
- prefetch(&ents[id]);
-
- /* check whether this item was ready for retrieval */
- if (smp_load_acquire(&ages[id]) == head) {
- /* node must have been udpated by push() */
- void *node = READ_ONCE(ents[id]);
- /* commit and move forward head of the slot */
- if (try_cmpxchg_release(&os->head, &head, head + 1))
- return node;
- /* head was already updated by others */
- }
+ prefetch(&slot->entries[head & slot->mask]);
+
+ /* commit and move forward head of the slot */
+ if (try_cmpxchg_release(&slot->head, &head, head + 1))
+ /*
+ * TBD: check overwrap the tail/head counter and warn
+ * if it is broken. But this happens only if this
+ * process slows down a lot and another CPU updates
+ * the haed/tail just 2^32 + 1 times, and this slot
+ * is empty.
+ */
+ return slot->entries[head & slot->mask];

/* re-load head from memory and continue trying */
- head = READ_ONCE(os->head);
- /*
- * head stays unchanged, so it's very likely there's an
- * ongoing push() on other cpu nodes but yet not update
- * ages[] to mark it's completion
- */
- if (head == prev)
- break;
+ head = READ_ONCE(slot->head);
}

return NULL;
@@ -307,32 +229,42 @@ void *objpool_pop(struct objpool_head *head)
EXPORT_SYMBOL_GPL(objpool_pop);

/* release whole objpool forcely */
-void objpool_free(struct objpool_head *head)
+void objpool_free(struct objpool_head *pool)
{
- if (!head->cpu_slots)
+ if (!pool->cpu_slots)
return;

/* release percpu slots */
- objpool_fini_percpu_slots(head);
+ objpool_fini_percpu_slots(pool);

/* call user's cleanup callback if provided */
- if (head->release)
- head->release(head, head->context);
+ if (pool->release)
+ pool->release(pool, pool->context);
}
EXPORT_SYMBOL_GPL(objpool_free);

-/* drop unused objects and defref objpool for releasing */
-void objpool_fini(struct objpool_head *head)
+/* drop the allocated object, rather reclaim it to objpool */
+int objpool_drop(void *obj, struct objpool_head *pool)
{
- void *nod;
+ if (!obj || !pool)
+ return -EINVAL;

- do {
- /* grab object from objpool and drop it */
- nod = objpool_pop(head);
+ if (refcount_dec_and_test(&pool->ref)) {
+ objpool_free(pool);
+ return 0;
+ }
+
+ return -EAGAIN;
+}
+EXPORT_SYMBOL_GPL(objpool_drop);
+
+/* drop unused objects and defref objpool for releasing */
+void objpool_fini(struct objpool_head *pool)
+{
+ void *obj;

- /* drop the extra ref of objpool */
- if (refcount_dec_and_test(&head->ref))
- objpool_free(head);
- } while (nod);
+ /* grab object from objpool and drop it */
+ while ((obj = objpool_pop(pool)))
+ objpool_drop(obj, pool);
}
EXPORT_SYMBOL_GPL(objpool_fini);
--
2.34.1

--
Masami Hiramatsu (Google) <[email protected]>

2023-10-08 19:05:25

by wuqiang.matt

[permalink] [raw]

Subject: Re: [PATCH v9 1/5] lib: objpool added: ring-array based lockless MPMC

On 2023/9/25 17:42, Masami Hiramatsu (Google) wrote:
> Hi Wuqiang,
>
> On Tue, 5 Sep 2023 09:52:51 +0800
> "wuqiang.matt" <[email protected]> wrote:
>
>> The object pool is a scalable implementaion of high performance queue
>> for object allocation and reclamation, such as kretprobe instances.
>>
>> With leveraging percpu ring-array to mitigate the hot spot of memory
>> contention, it could deliver near-linear scalability for high parallel
>> scenarios. The objpool is best suited for following cases:
>> 1) Memory allocation or reclamation are prohibited or too expensive
>> 2) Consumers are of different priorities, such as irqs and threads
>>
>> Limitations:
>> 1) Maximum objects (capacity) is determined during pool initializing
>> and can't be modified (extended) after objpool creation
>> 2) The memory of objects won't be freed until objpool is finalized
>> 3) Object allocation (pop) may fail after trying all cpu slots
>
> I made a simplifying patch on this by (mainly) removing ages array.
> I also rename local variable to use more readable names, like slot,
> pool, and obj.
>
> Here the results which I run the test_objpool.ko.
>
> Original:
> [ 50.500235] Summary of testcases:
> [ 50.503139] duration: 1027135us hits: 30628293 miss: 0 sync: percpu objpool
> [ 50.510416] duration: 1047977us hits: 30453023 miss: 0 sync: percpu objpool from vmalloc
> [ 50.517421] duration: 1047098us hits: 31145034 miss: 0 sync & hrtimer: percpu objpool
> [ 50.524671] duration: 1053233us hits: 30919640 miss: 0 sync & hrtimer: percpu objpool from vmalloc
> [ 50.531382] duration: 1055822us hits: 3407221 miss: 830219 sync overrun: percpu objpool
> [ 50.538135] duration: 1055998us hits: 3404624 miss: 854160 sync overrun: percpu objpool from vmalloc
> [ 50.546686] duration: 1046104us hits: 19464798 miss: 0 async: percpu objpool
> [ 50.552989] duration: 1033054us hits: 18957836 miss: 0 async: percpu objpool from vmalloc
> [ 50.560289] duration: 1041778us hits: 33806868 miss: 0 async & hrtimer: percpu objpool
> [ 50.567425] duration: 1048901us hits: 34211860 miss: 0 async & hrtimer: percpu objpool from vmalloc
>
> Simplified:
> [ 48.393236] Summary of testcases:
> [ 48.395384] duration: 1013002us hits: 29661448 miss: 0 sync: percpu objpool
> [ 48.400351] duration: 1057231us hits: 31035578 miss: 0 sync: percpu objpool from vmalloc
> [ 48.405660] duration: 1043196us hits: 30546652 miss: 0 sync & hrtimer: percpu objpool
> [ 48.411216] duration: 1047128us hits: 30601306 miss: 0 sync & hrtimer: percpu objpool from vmalloc
> [ 48.417231] duration: 1051695us hits: 3468287 miss: 892881 sync overrun: percpu objpool
> [ 48.422405] duration: 1054003us hits: 3549491 miss: 898141 sync overrun: percpu objpool from vmalloc
> [ 48.428425] duration: 1052946us hits: 19005228 miss: 0 async: percpu objpool
> [ 48.433597] duration: 1051757us hits: 19670866 miss: 0 async: percpu objpool from vmalloc
> [ 48.439280] duration: 1042951us hits: 37135332 miss: 0 async & hrtimer: percpu objpool
> [ 48.445085] duration: 1029803us hits: 37093248 miss: 0 async & hrtimer: percpu objpool from vmalloc
>
> Can you test it too?

Sure, I'll do some testings locally. Also do a performance comparsion with
'entries' managed in objpool_head.

>
> Thanks,

Thank you for your time, I appreciate it.

>
> From f1f442ff653e329839e5452b8b88463a80a12ff3 Mon Sep 17 00:00:00 2001
> From: "Masami Hiramatsu (Google)" <[email protected]>
> Date: Mon, 25 Sep 2023 16:07:12 +0900
> Subject: [PATCH] objpool: Simplify objpool by removing ages array
>
> Simplify the objpool code by removing ages array from per-cpu slot.
> It chooses enough big capacity (which is a rounded up power of 2 value
> of nr_objects + 1) for the entries array, the tail never catch up to
> the head in per-cpu slot. Thus tail == head means the slot is empty.
>
> This also uses consistent local variable names for pool, slot and obj.
>
> Signed-off-by: Masami Hiramatsu (Google) <[email protected]>
> ---
> include/linux/objpool.h | 61 ++++----
> lib/objpool.c | 310 ++++++++++++++++------------------------
> 2 files changed, 147 insertions(+), 224 deletions(-)
>
> diff --git a/include/linux/objpool.h b/include/linux/objpool.h
> index 33c832216b98..ecd5ecaffcd3 100644
> --- a/include/linux/objpool.h
> +++ b/include/linux/objpool.h
> @@ -38,33 +38,23 @@
> * struct objpool_slot - percpu ring array of objpool
> * @head: head of the local ring array (to retrieve at)
> * @tail: tail of the local ring array (to append at)
> - * @bits: log2 of capacity (for bitwise operations)
> - * @mask: capacity - 1
> + * @mask: capacity of entries - 1
> + * @entries: object entries on this slot.
> *
> * Represents a cpu-local array-based ring buffer, its size is specialized
> * during initialization of object pool. The percpu objpool slot is to be
> * allocated from local memory for NUMA system, and to be kept compact in
> - * continuous memory: ages[] is stored just after the body of objpool_slot,
> - * and then entries[]. ages[] describes revision of each item, solely used
> - * to avoid ABA; entries[] contains pointers of the actual objects
> - *
> - * Layout of objpool_slot in memory:
> - *
> - * 64bit:
> - * 4 8 12 16 32 64
> - * | head | tail | bits | mask | ages[4] | ents[4]: (8 * 4) | objects
> - *
> - * 32bit:
> - * 4 8 12 16 32 48 64
> - * | head | tail | bits | mask | ages[4] | ents[4] | unused | objects
> + * continuous memory: CPU assigned number of objects are stored just after
> + * the body of objpool_slot.
> *
> */
> struct objpool_slot {
> - uint32_t head;
> - uint32_t tail;
> - uint32_t bits;
> - uint32_t mask;
> -} __packed;
> + uint32_t head;
> + uint32_t tail;
> + uint32_t mask;
> + uint32_t dummyr;
> + void * entries[];
> +};
>
> struct objpool_head;
>
> @@ -82,12 +72,11 @@ typedef int (*objpool_fini_cb)(struct objpool_head *head, void *context);
> * @obj_size: object & element size
> * @nr_objs: total objs (to be pre-allocated)
> * @nr_cpus: nr_cpu_ids
> - * @capacity: max objects per cpuslot
> + * @capacity: max objects on percpu slot
> * @gfp: gfp flags for kmalloc & vmalloc
> * @ref: refcount for objpool
> * @flags: flags for objpool management
> * @cpu_slots: array of percpu slots
> - * @slot_sizes: size in bytes of slots
> * @release: resource cleanup callback
> * @context: caller-provided context
> */
> @@ -100,7 +89,6 @@ struct objpool_head {
> refcount_t ref;
> unsigned long flags;
> struct objpool_slot **cpu_slots;
> - int *slot_sizes;
> objpool_fini_cb release;
> void *context;
> };
> @@ -108,9 +96,12 @@ struct objpool_head {
> #define OBJPOOL_FROM_VMALLOC (0x800000000) /* objpool allocated from vmalloc area */
> #define OBJPOOL_HAVE_OBJECTS (0x400000000) /* objects allocated along with objpool */
>
> +#define OBJPOOL_NR_OBJECT_MAX (1 << 24)
> +#define OBJPOOL_OBJECT_SIZE_MAX (1 << 16)
> +
> /**
> * objpool_init() - initialize objpool and pre-allocated objects
> - * @head: the object pool to be initialized, declared by caller
> + * @pool: the object pool to be initialized, declared by caller
> * @nr_objs: total objects to be pre-allocated by this object pool
> * @object_size: size of an object (should be > 0)
> * @gfp: flags for memory allocation (via kmalloc or vmalloc)
> @@ -128,47 +119,47 @@ struct objpool_head {
> * pop (object allocation) or do clearance before each push (object
> * reclamation).
> */
> -int objpool_init(struct objpool_head *head, int nr_objs, int object_size,
> +int objpool_init(struct objpool_head *pool, int nr_objs, int object_size,
> gfp_t gfp, void *context, objpool_init_obj_cb objinit,
> objpool_fini_cb release);
>
> /**
> * objpool_pop() - allocate an object from objpool
> - * @head: object pool
> + * @pool: object pool
> *
> * return value: object ptr or NULL if failed
> */
> -void *objpool_pop(struct objpool_head *head);
> +void *objpool_pop(struct objpool_head *pool);
>
> /**
> * objpool_push() - reclaim the object and return back to objpool
> * @obj: object ptr to be pushed to objpool
> - * @head: object pool
> + * @pool: object pool
> *
> * return: 0 or error code (it fails only when user tries to push
> * the same object multiple times or wrong "objects" into objpool)
> */
> -int objpool_push(void *obj, struct objpool_head *head);
> +int objpool_push(void *obj, struct objpool_head *pool);
>
> /**
> * objpool_drop() - discard the object and deref objpool
> * @obj: object ptr to be discarded
> - * @head: object pool
> + * @pool: object pool
> *
> * return: 0 if objpool was released or error code
> */
> -int objpool_drop(void *obj, struct objpool_head *head);
> +int objpool_drop(void *obj, struct objpool_head *pool);
>
> /**
> * objpool_free() - release objpool forcely (all objects to be freed)
> - * @head: object pool to be released
> + * @pool: object pool to be released
> */
> -void objpool_free(struct objpool_head *head);
> +void objpool_free(struct objpool_head *pool);
>
> /**
> * objpool_fini() - deref object pool (also releasing unused objects)
> - * @head: object pool to be dereferenced
> + * @pool: object pool to be dereferenced
> */
> -void objpool_fini(struct objpool_head *head);
> +void objpool_fini(struct objpool_head *pool);
>
> #endif /* _LINUX_OBJPOOL_H */
> diff --git a/lib/objpool.c b/lib/objpool.c
> index 22e752371820..f8e8f70d7253 100644
> --- a/lib/objpool.c
> +++ b/lib/objpool.c
> @@ -15,104 +15,55 @@
> * Copyright: [email protected]
> */
>
> -#define SLOT_AGES(s) ((uint32_t *)((char *)(s) + sizeof(struct objpool_slot)))
> -#define SLOT_ENTS(s) ((void **)((char *)(s) + sizeof(struct objpool_slot) + \
> - (sizeof(uint32_t) << (s)->bits)))
> -#define SLOT_OBJS(s) ((void *)((char *)(s) + sizeof(struct objpool_slot) + \
> - ((sizeof(uint32_t) + sizeof(void *)) << (s)->bits)))
> -#define SLOT_CORE(n) cpumask_nth((n) % num_possible_cpus(), cpu_possible_mask)
> -
> -/* compute the suitable num of objects to be managed per slot */
> -static int objpool_nobjs(int size)
> -{
> - return rounddown_pow_of_two((size - sizeof(struct objpool_slot)) /
> - (sizeof(uint32_t) + sizeof(void *)));
> -}
> -
> /* allocate and initialize percpu slots */
> static int
> -objpool_init_percpu_slots(struct objpool_head *head, int nobjs,
> - void *context, objpool_init_obj_cb objinit)
> +objpool_init_percpu_slots(struct objpool_head *pool, void *context,
> + objpool_init_obj_cb objinit)
> {
> - int i, j, n, size, objsz, cpu = 0, nents = head->capacity;
> -
> - /* aligned object size by sizeof(void *) */
> - objsz = ALIGN(head->obj_size, sizeof(void *));
> - /* shall we allocate objects along with percpu-slot */
> - if (objsz)
> - head->flags |= OBJPOOL_HAVE_OBJECTS;
> -
> - /* vmalloc is used in default to allocate percpu-slots */
> - if (!(head->gfp & GFP_ATOMIC))
> - head->flags |= OBJPOOL_FROM_VMALLOC;
> -
> - for (i = 0; i < head->nr_cpus; i++) {
> - struct objpool_slot *os;
> + int i, j, n, size, slot_size, cpu_count = 0;
> + struct objpool_slot *slot;
>
> + for (i = 0; i < pool->nr_cpus; i++) {
> /* skip the cpus which could never be present */
> if (!cpu_possible(i))
> continue;
>
> /* compute how many objects to be managed by this slot */
> - n = nobjs / num_possible_cpus();
> - if (cpu < (nobjs % num_possible_cpus()))
> + n = pool->nr_objs / num_possible_cpus();
> + if (cpu_count < (pool->nr_objs % num_possible_cpus()))
> n++;
> - size = sizeof(struct objpool_slot) + sizeof(void *) * nents +
> - sizeof(uint32_t) * nents + objsz * n;
> + cpu_count++;
> +
> + slot_size = struct_size(slot, entries, pool->capacity);
> + size = slot_size + pool->obj_size * n;
>
> /*
> * here we allocate percpu-slot & objects together in a single
> - * allocation, taking advantage of warm caches and TLB hits as
> - * vmalloc always aligns the request size to pages
> + * allocation, taking advantage on NUMA.
> */
> - if (head->flags & OBJPOOL_FROM_VMALLOC)
> - os = __vmalloc_node(size, sizeof(void *), head->gfp,
> + if (pool->flags & OBJPOOL_FROM_VMALLOC)
> + slot = __vmalloc_node(size, sizeof(void *), pool->gfp,
> cpu_to_node(i), __builtin_return_address(0));
> else
> - os = kmalloc_node(size, head->gfp, cpu_to_node(i));
> - if (!os)
> + slot = kmalloc_node(size, pool->gfp, cpu_to_node(i));
> + if (!slot)
> return -ENOMEM;
>
> /* initialize percpu slot for the i-th slot */
> - memset(os, 0, size);
> - os->bits = ilog2(head->capacity);
> - os->mask = head->capacity - 1;
> - head->cpu_slots[i] = os;
> - head->slot_sizes[i] = size;
> - cpu = cpu + 1;
> -
> - /*
> - * manually set head & tail to avoid possible conflict:
> - * We assume that the head item is ready for retrieval
> - * iff head is equal to ages[head & mask]. but ages is
> - * initialized as 0, so in view of the caller of pop(),
> - * the 1st item (0th) is always ready, but the reality
> - * could be: push() is stalled before the final update,
> - * thus the item being inserted will be lost forever
> - */
> - os->head = os->tail = head->capacity;
> -
> - if (!objsz)
> - continue;
> + memset(slot, 0, size);
> + slot->mask = pool->capacity - 1;
> + pool->cpu_slots[i] = slot;
>
> for (j = 0; j < n; j++) {
> - uint32_t *ages = SLOT_AGES(os);
> - void **ents = SLOT_ENTS(os);
> - void *obj = SLOT_OBJS(os) + j * objsz;
> - uint32_t ie = os->tail & os->mask;
> + void *obj = (void *)slot + slot_size + pool->obj_size * j;
>
> - /* perform object initialization */
> if (objinit) {
> int rc = objinit(obj, context);
> if (rc)
> return rc;
> }
> -
> - /* add obj into the ring array */
> - ents[ie] = obj;
> - ages[ie] = os->tail;
> - os->tail++;
> - head->nr_objs++;
> + slot->entries[j] = obj;
> + slot->tail++;
> }
> }
>
> @@ -120,164 +71,135 @@ objpool_init_percpu_slots(struct objpool_head *head, int nobjs,
> }
>
> /* cleanup all percpu slots of the object pool */
> -static void objpool_fini_percpu_slots(struct objpool_head *head)
> +static void objpool_fini_percpu_slots(struct objpool_head *pool)
> {
> int i;
>
> - if (!head->cpu_slots)
> + if (!pool->cpu_slots)
> return;
>
> - for (i = 0; i < head->nr_cpus; i++) {
> - if (!head->cpu_slots[i])
> - continue;
> - if (head->flags & OBJPOOL_FROM_VMALLOC)
> - vfree(head->cpu_slots[i]);
> - else
> - kfree(head->cpu_slots[i]);
> - }
> - kfree(head->cpu_slots);
> - head->cpu_slots = NULL;
> - head->slot_sizes = NULL;
> + for (i = 0; i < pool->nr_cpus; i++)
> + kvfree(pool->cpu_slots[i]);
> + kfree(pool->cpu_slots);
> }
>
> /* initialize object pool and pre-allocate objects */
> -int objpool_init(struct objpool_head *head, int nr_objs, int object_size,
> +int objpool_init(struct objpool_head *pool, int nr_objs, int object_size,
> gfp_t gfp, void *context, objpool_init_obj_cb objinit,
> objpool_fini_cb release)
> {
> int nents, rc;
>
> /* check input parameters */
> - if (nr_objs <= 0 || object_size <= 0)
> + if (nr_objs <= 0 || object_size <= 0 ||
> + nr_objs > OBJPOOL_NR_OBJECT_MAX || object_size > OBJPOOL_OBJECT_SIZE_MAX)
> + return -EINVAL;
> +
> + /* Align up to unsigned long size */
> + object_size = ALIGN(object_size, sizeof(unsigned long));
> +
> + /*
> + * To avoid filling up the entries array in the per-cpu slot,
> + * use the power of 2 which is more than N + 1. Thus, the tail never
> + * catch up to the pool, and able to use pool/tail as the sequencial
> + * number.
> + */
> + nents = roundup_pow_of_two(nr_objs + 1);
> + if (!nents)
> return -EINVAL;
>
> - /* calculate percpu slot size (rounded to pow of 2) */
> - nents = max_t(int, roundup_pow_of_two(nr_objs),
> - objpool_nobjs(L1_CACHE_BYTES));
> -
> - /* initialize objpool head */
> - memset(head, 0, sizeof(struct objpool_head));
> - head->nr_cpus = nr_cpu_ids;
> - head->obj_size = object_size;
> - head->capacity = nents;
> - head->gfp = gfp & ~__GFP_ZERO;
> - head->context = context;
> - head->release = release;
> -
> - /* allocate array for percpu slots */
> - head->cpu_slots = kzalloc(head->nr_cpus * sizeof(void *) +
> - head->nr_cpus * sizeof(int), head->gfp);
> - if (!head->cpu_slots)
> + /* initialize objpool pool */
> + memset(pool, 0, sizeof(struct objpool_head));
> + pool->nr_cpus = nr_cpu_ids;
> + pool->obj_size = object_size;
> + pool->nr_objs = nr_objs;
> + /* just prevent to fullfill the per-cpu ring array */
> + pool->capacity = nents;
> + pool->gfp = gfp & ~__GFP_ZERO;
> + pool->context = context;
> + pool->release = release;
> + /* vmalloc is used in default to allocate percpu-slots */
> + if (!(pool->gfp & GFP_ATOMIC))
> + pool->flags |= OBJPOOL_FROM_VMALLOC;
> +
> + pool->cpu_slots = kzalloc(pool->nr_cpus * sizeof(void *), pool->gfp);
> + if (!pool->cpu_slots)
> return -ENOMEM;
> - head->slot_sizes = (int *)&head->cpu_slots[head->nr_cpus];
>
> /* initialize per-cpu slots */
> - rc = objpool_init_percpu_slots(head, nr_objs, context, objinit);
> + rc = objpool_init_percpu_slots(pool, context, objinit);
> if (rc)
> - objpool_fini_percpu_slots(head);
> + objpool_fini_percpu_slots(pool);
> else
> - refcount_set(&head->ref, nr_objs + 1);
> + refcount_set(&pool->ref, nr_objs + 1);
>
> return rc;
> }
> EXPORT_SYMBOL_GPL(objpool_init);
>
> /* adding object to slot, abort if the slot was already full */
> -static inline int objpool_try_add_slot(void *obj, struct objpool_slot *os)
> +static inline int objpool_try_add_slot(void *obj, struct objpool_head *pool, int cpu)
> {
> - uint32_t *ages = SLOT_AGES(os);
> - void **ents = SLOT_ENTS(os);
> - uint32_t head, tail;
> + struct objpool_slot *slot = pool->cpu_slots[cpu];
> + uint32_t tail, next;
>
> do {
> - /* perform memory loading for both head and tail */
> - head = READ_ONCE(os->head);
> - tail = READ_ONCE(os->tail);
> - /* just abort if slot is full */
> - if (tail - head > os->mask)
> - return -ENOENT;
> - /* try to extend tail by 1 using CAS to avoid races */
> - if (try_cmpxchg_acquire(&os->tail, &tail, tail + 1))
> - break;
> - } while (1);
> + uint32_t head = READ_ONCE(slot->head);
>
> - /* the tail-th of slot is reserved for the given obj */
> - WRITE_ONCE(ents[tail & os->mask], obj);
> - /* update epoch id to make this object available for pop() */
> - smp_store_release(&ages[tail & os->mask], tail);
> + tail = READ_ONCE(slot->tail);
> + next = tail + 1;
> +
> + /* This must never happen because capacity >= N + 1 */
> + if (WARN_ON_ONCE((next > head && next - head > pool->nr_objs) ||
> + (next < head && next > head + pool->nr_objs)))
> + return -EINVAL;
> +
> + } while (!try_cmpxchg_acquire(&slot->tail, &tail, next));
> +
> + WRITE_ONCE(slot->entries[tail & slot->mask], obj);
> return 0;
> }
>
> /* reclaim an object to object pool */
> -int objpool_push(void *obj, struct objpool_head *oh)
> +int objpool_push(void *obj, struct objpool_head *pool)
> {
> unsigned long flags;
> - int cpu, rc;
> + int rc;
>
> /* disable local irq to avoid preemption & interruption */
> raw_local_irq_save(flags);
> - cpu = raw_smp_processor_id();
> - do {
> - rc = objpool_try_add_slot(obj, oh->cpu_slots[cpu]);
> - if (!rc)
> - break;
> - cpu = cpumask_next_wrap(cpu, cpu_possible_mask, -1, 1);
> - } while (1);
> + rc = objpool_try_add_slot(obj, pool, raw_smp_processor_id());
> raw_local_irq_restore(flags);
>
> return rc;
> }
> EXPORT_SYMBOL_GPL(objpool_push);
>
> -/* drop the allocated object, rather reclaim it to objpool */
> -int objpool_drop(void *obj, struct objpool_head *head)
> -{
> - if (!obj || !head)
> - return -EINVAL;
> -
> - if (refcount_dec_and_test(&head->ref)) {
> - objpool_free(head);
> - return 0;
> - }
> -
> - return -EAGAIN;
> -}
> -EXPORT_SYMBOL_GPL(objpool_drop);
> -
> /* try to retrieve object from slot */
> -static inline void *objpool_try_get_slot(struct objpool_slot *os)
> +static inline void *objpool_try_get_slot(struct objpool_slot *slot)
> {
> - uint32_t *ages = SLOT_AGES(os);
> - void **ents = SLOT_ENTS(os);
> /* do memory load of head to local head */
> - uint32_t head = smp_load_acquire(&os->head);
> + uint32_t head = smp_load_acquire(&slot->head);
>
> /* loop if slot isn't empty */
> - while (head != READ_ONCE(os->tail)) {
> - uint32_t id = head & os->mask, prev = head;
> + while (head != READ_ONCE(slot->tail)) {
>
> /* do prefetching of object ents */
> - prefetch(&ents[id]);
> -
> - /* check whether this item was ready for retrieval */
> - if (smp_load_acquire(&ages[id]) == head) {
> - /* node must have been udpated by push() */
> - void *node = READ_ONCE(ents[id]);
> - /* commit and move forward head of the slot */
> - if (try_cmpxchg_release(&os->head, &head, head + 1))
> - return node;
> - /* head was already updated by others */
> - }
> + prefetch(&slot->entries[head & slot->mask]);
> +
> + /* commit and move forward head of the slot */
> + if (try_cmpxchg_release(&slot->head, &head, head + 1))
> + /*
> + * TBD: check overwrap the tail/head counter and warn
> + * if it is broken. But this happens only if this
> + * process slows down a lot and another CPU updates
> + * the haed/tail just 2^32 + 1 times, and this slot
> + * is empty.
> + */
> + return slot->entries[head & slot->mask];
>
> /* re-load head from memory and continue trying */
> - head = READ_ONCE(os->head);
> - /*
> - * head stays unchanged, so it's very likely there's an
> - * ongoing push() on other cpu nodes but yet not update
> - * ages[] to mark it's completion
> - */
> - if (head == prev)
> - break;
> + head = READ_ONCE(slot->head);
> }
>
> return NULL;
> @@ -307,32 +229,42 @@ void *objpool_pop(struct objpool_head *head)
> EXPORT_SYMBOL_GPL(objpool_pop);
>
> /* release whole objpool forcely */
> -void objpool_free(struct objpool_head *head)
> +void objpool_free(struct objpool_head *pool)
> {
> - if (!head->cpu_slots)
> + if (!pool->cpu_slots)
> return;
>
> /* release percpu slots */
> - objpool_fini_percpu_slots(head);
> + objpool_fini_percpu_slots(pool);
>
> /* call user's cleanup callback if provided */
> - if (head->release)
> - head->release(head, head->context);
> + if (pool->release)
> + pool->release(pool, pool->context);
> }
> EXPORT_SYMBOL_GPL(objpool_free);
>
> -/* drop unused objects and defref objpool for releasing */
> -void objpool_fini(struct objpool_head *head)
> +/* drop the allocated object, rather reclaim it to objpool */
> +int objpool_drop(void *obj, struct objpool_head *pool)
> {
> - void *nod;
> + if (!obj || !pool)
> + return -EINVAL;
>
> - do {
> - /* grab object from objpool and drop it */
> - nod = objpool_pop(head);
> + if (refcount_dec_and_test(&pool->ref)) {
> + objpool_free(pool);
> + return 0;
> + }
> +
> + return -EAGAIN;
> +}
> +EXPORT_SYMBOL_GPL(objpool_drop);
> +
> +/* drop unused objects and defref objpool for releasing */
> +void objpool_fini(struct objpool_head *pool)
> +{
> + void *obj;
>
> - /* drop the extra ref of objpool */
> - if (refcount_dec_and_test(&head->ref))
> - objpool_free(head);
> - } while (nod);
> + /* grab object from objpool and drop it */
> + while ((obj = objpool_pop(pool)))
> + objpool_drop(obj, pool);
> }
> EXPORT_SYMBOL_GPL(objpool_fini);

2023-10-09 09:27:13

by wuqiang.matt

[permalink] [raw]

Subject: Re: [PATCH v9 1/5] lib: objpool added: ring-array based lockless MPMC

Hello Masami,

Just got time for the new patch and got that ages[] was removed. ages[] is
introduced the way like 2-phase commit to keep consitency and must be kept.

Thinking of the following 2 cases that two cpu nodes are operating the same
objpool_slot simultaneously:

Case 1:

NODE 1: NODE 2:
push to an empty slot pop will get wrong value

try_cmpxchg_acquire(&slot->tail, &tail, next)
try_cmpxchg_release(&slot->head, &head, head + 1)
return slot->entries[head & slot->mask]
WRITE_ONCE(slot->entries[tail & slot->mask], obj)

Case 2:

NODE 1: NODE 2
push to slot w/ 1 obj pop will get wrong value

try_cmpxchg_release(&slot->head, &head, head + 1)
try_cmpxchg_acquire(&slot->tail, &tail, next)
WRITE_ONCE(slot->entries[tail & slot->mask], obj)
return slot->entries[head & slot->mask]

Regards,
wuqiang

On 2023/9/25 17:42, Masami Hiramatsu (Google) wrote:
> Hi Wuqiang,
>
> On Tue, 5 Sep 2023 09:52:51 +0800
> "wuqiang.matt" <[email protected]> wrote:
>
>> The object pool is a scalable implementaion of high performance queue
>> for object allocation and reclamation, such as kretprobe instances.
>>
>> With leveraging percpu ring-array to mitigate the hot spot of memory
>> contention, it could deliver near-linear scalability for high parallel
>> scenarios. The objpool is best suited for following cases:
>> 1) Memory allocation or reclamation are prohibited or too expensive
>> 2) Consumers are of different priorities, such as irqs and threads
>>
>> Limitations:
>> 1) Maximum objects (capacity) is determined during pool initializing
>> and can't be modified (extended) after objpool creation
>> 2) The memory of objects won't be freed until objpool is finalized
>> 3) Object allocation (pop) may fail after trying all cpu slots
>
> I made a simplifying patch on this by (mainly) removing ages array.
> I also rename local variable to use more readable names, like slot,
> pool, and obj.
>
> Here the results which I run the test_objpool.ko.
>
> Original:
> [ 50.500235] Summary of testcases:
> [ 50.503139] duration: 1027135us hits: 30628293 miss: 0 sync: percpu objpool
> [ 50.510416] duration: 1047977us hits: 30453023 miss: 0 sync: percpu objpool from vmalloc
> [ 50.517421] duration: 1047098us hits: 31145034 miss: 0 sync & hrtimer: percpu objpool
> [ 50.524671] duration: 1053233us hits: 30919640 miss: 0 sync & hrtimer: percpu objpool from vmalloc
> [ 50.531382] duration: 1055822us hits: 3407221 miss: 830219 sync overrun: percpu objpool
> [ 50.538135] duration: 1055998us hits: 3404624 miss: 854160 sync overrun: percpu objpool from vmalloc
> [ 50.546686] duration: 1046104us hits: 19464798 miss: 0 async: percpu objpool
> [ 50.552989] duration: 1033054us hits: 18957836 miss: 0 async: percpu objpool from vmalloc
> [ 50.560289] duration: 1041778us hits: 33806868 miss: 0 async & hrtimer: percpu objpool
> [ 50.567425] duration: 1048901us hits: 34211860 miss: 0 async & hrtimer: percpu objpool from vmalloc
>
> Simplified:
> [ 48.393236] Summary of testcases:
> [ 48.395384] duration: 1013002us hits: 29661448 miss: 0 sync: percpu objpool
> [ 48.400351] duration: 1057231us hits: 31035578 miss: 0 sync: percpu objpool from vmalloc
> [ 48.405660] duration: 1043196us hits: 30546652 miss: 0 sync & hrtimer: percpu objpool
> [ 48.411216] duration: 1047128us hits: 30601306 miss: 0 sync & hrtimer: percpu objpool from vmalloc
> [ 48.417231] duration: 1051695us hits: 3468287 miss: 892881 sync overrun: percpu objpool
> [ 48.422405] duration: 1054003us hits: 3549491 miss: 898141 sync overrun: percpu objpool from vmalloc
> [ 48.428425] duration: 1052946us hits: 19005228 miss: 0 async: percpu objpool
> [ 48.433597] duration: 1051757us hits: 19670866 miss: 0 async: percpu objpool from vmalloc
> [ 48.439280] duration: 1042951us hits: 37135332 miss: 0 async & hrtimer: percpu objpool
> [ 48.445085] duration: 1029803us hits: 37093248 miss: 0 async & hrtimer: percpu objpool from vmalloc
>
> Can you test it too?
>
> Thanks,
>
> From f1f442ff653e329839e5452b8b88463a80a12ff3 Mon Sep 17 00:00:00 2001
> From: "Masami Hiramatsu (Google)" <[email protected]>
> Date: Mon, 25 Sep 2023 16:07:12 +0900
> Subject: [PATCH] objpool: Simplify objpool by removing ages array
>
> Simplify the objpool code by removing ages array from per-cpu slot.
> It chooses enough big capacity (which is a rounded up power of 2 value
> of nr_objects + 1) for the entries array, the tail never catch up to
> the head in per-cpu slot. Thus tail == head means the slot is empty.
>
> This also uses consistent local variable names for pool, slot and obj.
>
> Signed-off-by: Masami Hiramatsu (Google) <[email protected]>
> ---
> include/linux/objpool.h | 61 ++++----
> lib/objpool.c | 310 ++++++++++++++++------------------------
> 2 files changed, 147 insertions(+), 224 deletions(-)
>
> diff --git a/include/linux/objpool.h b/include/linux/objpool.h
> index 33c832216b98..ecd5ecaffcd3 100644
> --- a/include/linux/objpool.h
> +++ b/include/linux/objpool.h
> @@ -38,33 +38,23 @@
> * struct objpool_slot - percpu ring array of objpool
> * @head: head of the local ring array (to retrieve at)
> * @tail: tail of the local ring array (to append at)
> - * @bits: log2 of capacity (for bitwise operations)
> - * @mask: capacity - 1
> + * @mask: capacity of entries - 1
> + * @entries: object entries on this slot.
> *
> * Represents a cpu-local array-based ring buffer, its size is specialized
> * during initialization of object pool. The percpu objpool slot is to be
> * allocated from local memory for NUMA system, and to be kept compact in
> - * continuous memory: ages[] is stored just after the body of objpool_slot,
> - * and then entries[]. ages[] describes revision of each item, solely used
> - * to avoid ABA; entries[] contains pointers of the actual objects
> - *
> - * Layout of objpool_slot in memory:
> - *
> - * 64bit:
> - * 4 8 12 16 32 64
> - * | head | tail | bits | mask | ages[4] | ents[4]: (8 * 4) | objects
> - *
> - * 32bit:
> - * 4 8 12 16 32 48 64
> - * | head | tail | bits | mask | ages[4] | ents[4] | unused | objects
> + * continuous memory: CPU assigned number of objects are stored just after
> + * the body of objpool_slot.
> *
> */
> struct objpool_slot {
> - uint32_t head;
> - uint32_t tail;
> - uint32_t bits;
> - uint32_t mask;
> -} __packed;
> + uint32_t head;
> + uint32_t tail;
> + uint32_t mask;
> + uint32_t dummyr;
> + void * entries[];
> +};
>
> struct objpool_head;
>
> @@ -82,12 +72,11 @@ typedef int (*objpool_fini_cb)(struct objpool_head *head, void *context);
> * @obj_size: object & element size
> * @nr_objs: total objs (to be pre-allocated)
> * @nr_cpus: nr_cpu_ids
> - * @capacity: max objects per cpuslot
> + * @capacity: max objects on percpu slot
> * @gfp: gfp flags for kmalloc & vmalloc
> * @ref: refcount for objpool
> * @flags: flags for objpool management
> * @cpu_slots: array of percpu slots
> - * @slot_sizes: size in bytes of slots
> * @release: resource cleanup callback
> * @context: caller-provided context
> */
> @@ -100,7 +89,6 @@ struct objpool_head {
> refcount_t ref;
> unsigned long flags;
> struct objpool_slot **cpu_slots;
> - int *slot_sizes;
> objpool_fini_cb release;
> void *context;
> };
> @@ -108,9 +96,12 @@ struct objpool_head {
> #define OBJPOOL_FROM_VMALLOC (0x800000000) /* objpool allocated from vmalloc area */
> #define OBJPOOL_HAVE_OBJECTS (0x400000000) /* objects allocated along with objpool */
>
> +#define OBJPOOL_NR_OBJECT_MAX (1 << 24)
> +#define OBJPOOL_OBJECT_SIZE_MAX (1 << 16)
> +
> /**
> * objpool_init() - initialize objpool and pre-allocated objects
> - * @head: the object pool to be initialized, declared by caller
> + * @pool: the object pool to be initialized, declared by caller
> * @nr_objs: total objects to be pre-allocated by this object pool
> * @object_size: size of an object (should be > 0)
> * @gfp: flags for memory allocation (via kmalloc or vmalloc)
> @@ -128,47 +119,47 @@ struct objpool_head {
> * pop (object allocation) or do clearance before each push (object
> * reclamation).
> */
> -int objpool_init(struct objpool_head *head, int nr_objs, int object_size,
> +int objpool_init(struct objpool_head *pool, int nr_objs, int object_size,
> gfp_t gfp, void *context, objpool_init_obj_cb objinit,
> objpool_fini_cb release);
>
> /**
> * objpool_pop() - allocate an object from objpool
> - * @head: object pool
> + * @pool: object pool
> *
> * return value: object ptr or NULL if failed
> */
> -void *objpool_pop(struct objpool_head *head);
> +void *objpool_pop(struct objpool_head *pool);
>
> /**
> * objpool_push() - reclaim the object and return back to objpool
> * @obj: object ptr to be pushed to objpool
> - * @head: object pool
> + * @pool: object pool
> *
> * return: 0 or error code (it fails only when user tries to push
> * the same object multiple times or wrong "objects" into objpool)
> */
> -int objpool_push(void *obj, struct objpool_head *head);
> +int objpool_push(void *obj, struct objpool_head *pool);
>
> /**
> * objpool_drop() - discard the object and deref objpool
> * @obj: object ptr to be discarded
> - * @head: object pool
> + * @pool: object pool
> *
> * return: 0 if objpool was released or error code
> */
> -int objpool_drop(void *obj, struct objpool_head *head);
> +int objpool_drop(void *obj, struct objpool_head *pool);
>
> /**
> * objpool_free() - release objpool forcely (all objects to be freed)
> - * @head: object pool to be released
> + * @pool: object pool to be released
> */
> -void objpool_free(struct objpool_head *head);
> +void objpool_free(struct objpool_head *pool);
>
> /**
> * objpool_fini() - deref object pool (also releasing unused objects)
> - * @head: object pool to be dereferenced
> + * @pool: object pool to be dereferenced
> */
> -void objpool_fini(struct objpool_head *head);
> +void objpool_fini(struct objpool_head *pool);
>
> #endif /* _LINUX_OBJPOOL_H */
> diff --git a/lib/objpool.c b/lib/objpool.c
> index 22e752371820..f8e8f70d7253 100644
> --- a/lib/objpool.c
> +++ b/lib/objpool.c
> @@ -15,104 +15,55 @@
> * Copyright: [email protected]
> */
>
> -#define SLOT_AGES(s) ((uint32_t *)((char *)(s) + sizeof(struct objpool_slot)))
> -#define SLOT_ENTS(s) ((void **)((char *)(s) + sizeof(struct objpool_slot) + \
> - (sizeof(uint32_t) << (s)->bits)))
> -#define SLOT_OBJS(s) ((void *)((char *)(s) + sizeof(struct objpool_slot) + \
> - ((sizeof(uint32_t) + sizeof(void *)) << (s)->bits)))
> -#define SLOT_CORE(n) cpumask_nth((n) % num_possible_cpus(), cpu_possible_mask)
> -
> -/* compute the suitable num of objects to be managed per slot */
> -static int objpool_nobjs(int size)
> -{
> - return rounddown_pow_of_two((size - sizeof(struct objpool_slot)) /
> - (sizeof(uint32_t) + sizeof(void *)));
> -}
> -
> /* allocate and initialize percpu slots */
> static int
> -objpool_init_percpu_slots(struct objpool_head *head, int nobjs,
> - void *context, objpool_init_obj_cb objinit)
> +objpool_init_percpu_slots(struct objpool_head *pool, void *context,
> + objpool_init_obj_cb objinit)
> {
> - int i, j, n, size, objsz, cpu = 0, nents = head->capacity;
> -
> - /* aligned object size by sizeof(void *) */
> - objsz = ALIGN(head->obj_size, sizeof(void *));
> - /* shall we allocate objects along with percpu-slot */
> - if (objsz)
> - head->flags |= OBJPOOL_HAVE_OBJECTS;
> -
> - /* vmalloc is used in default to allocate percpu-slots */
> - if (!(head->gfp & GFP_ATOMIC))
> - head->flags |= OBJPOOL_FROM_VMALLOC;
> -
> - for (i = 0; i < head->nr_cpus; i++) {
> - struct objpool_slot *os;
> + int i, j, n, size, slot_size, cpu_count = 0;
> + struct objpool_slot *slot;
>
> + for (i = 0; i < pool->nr_cpus; i++) {
> /* skip the cpus which could never be present */
> if (!cpu_possible(i))
> continue;
>
> /* compute how many objects to be managed by this slot */
> - n = nobjs / num_possible_cpus();
> - if (cpu < (nobjs % num_possible_cpus()))
> + n = pool->nr_objs / num_possible_cpus();
> + if (cpu_count < (pool->nr_objs % num_possible_cpus()))
> n++;
> - size = sizeof(struct objpool_slot) + sizeof(void *) * nents +
> - sizeof(uint32_t) * nents + objsz * n;
> + cpu_count++;
> +
> + slot_size = struct_size(slot, entries, pool->capacity);
> + size = slot_size + pool->obj_size * n;
>
> /*
> * here we allocate percpu-slot & objects together in a single
> - * allocation, taking advantage of warm caches and TLB hits as
> - * vmalloc always aligns the request size to pages
> + * allocation, taking advantage on NUMA.
> */
> - if (head->flags & OBJPOOL_FROM_VMALLOC)
> - os = __vmalloc_node(size, sizeof(void *), head->gfp,
> + if (pool->flags & OBJPOOL_FROM_VMALLOC)
> + slot = __vmalloc_node(size, sizeof(void *), pool->gfp,
> cpu_to_node(i), __builtin_return_address(0));
> else
> - os = kmalloc_node(size, head->gfp, cpu_to_node(i));
> - if (!os)
> + slot = kmalloc_node(size, pool->gfp, cpu_to_node(i));
> + if (!slot)
> return -ENOMEM;
>
> /* initialize percpu slot for the i-th slot */
> - memset(os, 0, size);
> - os->bits = ilog2(head->capacity);
> - os->mask = head->capacity - 1;
> - head->cpu_slots[i] = os;
> - head->slot_sizes[i] = size;
> - cpu = cpu + 1;
> -
> - /*
> - * manually set head & tail to avoid possible conflict:
> - * We assume that the head item is ready for retrieval
> - * iff head is equal to ages[head & mask]. but ages is
> - * initialized as 0, so in view of the caller of pop(),
> - * the 1st item (0th) is always ready, but the reality
> - * could be: push() is stalled before the final update,
> - * thus the item being inserted will be lost forever
> - */
> - os->head = os->tail = head->capacity;
> -
> - if (!objsz)
> - continue;
> + memset(slot, 0, size);
> + slot->mask = pool->capacity - 1;
> + pool->cpu_slots[i] = slot;
>
> for (j = 0; j < n; j++) {
> - uint32_t *ages = SLOT_AGES(os);
> - void **ents = SLOT_ENTS(os);
> - void *obj = SLOT_OBJS(os) + j * objsz;
> - uint32_t ie = os->tail & os->mask;
> + void *obj = (void *)slot + slot_size + pool->obj_size * j;
>
> - /* perform object initialization */
> if (objinit) {
> int rc = objinit(obj, context);
> if (rc)
> return rc;
> }
> -
> - /* add obj into the ring array */
> - ents[ie] = obj;
> - ages[ie] = os->tail;
> - os->tail++;
> - head->nr_objs++;
> + slot->entries[j] = obj;
> + slot->tail++;
> }
> }
>
> @@ -120,164 +71,135 @@ objpool_init_percpu_slots(struct objpool_head *head, int nobjs,
> }
>
> /* cleanup all percpu slots of the object pool */
> -static void objpool_fini_percpu_slots(struct objpool_head *head)
> +static void objpool_fini_percpu_slots(struct objpool_head *pool)
> {
> int i;
>
> - if (!head->cpu_slots)
> + if (!pool->cpu_slots)
> return;
>
> - for (i = 0; i < head->nr_cpus; i++) {
> - if (!head->cpu_slots[i])
> - continue;
> - if (head->flags & OBJPOOL_FROM_VMALLOC)
> - vfree(head->cpu_slots[i]);
> - else
> - kfree(head->cpu_slots[i]);
> - }
> - kfree(head->cpu_slots);
> - head->cpu_slots = NULL;
> - head->slot_sizes = NULL;
> + for (i = 0; i < pool->nr_cpus; i++)
> + kvfree(pool->cpu_slots[i]);
> + kfree(pool->cpu_slots);
> }
>
> /* initialize object pool and pre-allocate objects */
> -int objpool_init(struct objpool_head *head, int nr_objs, int object_size,
> +int objpool_init(struct objpool_head *pool, int nr_objs, int object_size,
> gfp_t gfp, void *context, objpool_init_obj_cb objinit,
> objpool_fini_cb release)
> {
> int nents, rc;
>
> /* check input parameters */
> - if (nr_objs <= 0 || object_size <= 0)
> + if (nr_objs <= 0 || object_size <= 0 ||
> + nr_objs > OBJPOOL_NR_OBJECT_MAX || object_size > OBJPOOL_OBJECT_SIZE_MAX)
> + return -EINVAL;
> +
> + /* Align up to unsigned long size */
> + object_size = ALIGN(object_size, sizeof(unsigned long));
> +
> + /*
> + * To avoid filling up the entries array in the per-cpu slot,
> + * use the power of 2 which is more than N + 1. Thus, the tail never
> + * catch up to the pool, and able to use pool/tail as the sequencial
> + * number.
> + */
> + nents = roundup_pow_of_two(nr_objs + 1);
> + if (!nents)
> return -EINVAL;
>
> - /* calculate percpu slot size (rounded to pow of 2) */
> - nents = max_t(int, roundup_pow_of_two(nr_objs),
> - objpool_nobjs(L1_CACHE_BYTES));
> -
> - /* initialize objpool head */
> - memset(head, 0, sizeof(struct objpool_head));
> - head->nr_cpus = nr_cpu_ids;
> - head->obj_size = object_size;
> - head->capacity = nents;
> - head->gfp = gfp & ~__GFP_ZERO;
> - head->context = context;
> - head->release = release;
> -
> - /* allocate array for percpu slots */
> - head->cpu_slots = kzalloc(head->nr_cpus * sizeof(void *) +
> - head->nr_cpus * sizeof(int), head->gfp);
> - if (!head->cpu_slots)
> + /* initialize objpool pool */
> + memset(pool, 0, sizeof(struct objpool_head));
> + pool->nr_cpus = nr_cpu_ids;
> + pool->obj_size = object_size;
> + pool->nr_objs = nr_objs;
> + /* just prevent to fullfill the per-cpu ring array */
> + pool->capacity = nents;
> + pool->gfp = gfp & ~__GFP_ZERO;
> + pool->context = context;
> + pool->release = release;
> + /* vmalloc is used in default to allocate percpu-slots */
> + if (!(pool->gfp & GFP_ATOMIC))
> + pool->flags |= OBJPOOL_FROM_VMALLOC;
> +
> + pool->cpu_slots = kzalloc(pool->nr_cpus * sizeof(void *), pool->gfp);
> + if (!pool->cpu_slots)
> return -ENOMEM;
> - head->slot_sizes = (int *)&head->cpu_slots[head->nr_cpus];
>
> /* initialize per-cpu slots */
> - rc = objpool_init_percpu_slots(head, nr_objs, context, objinit);
> + rc = objpool_init_percpu_slots(pool, context, objinit);
> if (rc)
> - objpool_fini_percpu_slots(head);
> + objpool_fini_percpu_slots(pool);
> else
> - refcount_set(&head->ref, nr_objs + 1);
> + refcount_set(&pool->ref, nr_objs + 1);
>
> return rc;
> }
> EXPORT_SYMBOL_GPL(objpool_init);
>
> /* adding object to slot, abort if the slot was already full */
> -static inline int objpool_try_add_slot(void *obj, struct objpool_slot *os)
> +static inline int objpool_try_add_slot(void *obj, struct objpool_head *pool, int cpu)
> {
> - uint32_t *ages = SLOT_AGES(os);
> - void **ents = SLOT_ENTS(os);
> - uint32_t head, tail;
> + struct objpool_slot *slot = pool->cpu_slots[cpu];
> + uint32_t tail, next;
>
> do {
> - /* perform memory loading for both head and tail */
> - head = READ_ONCE(os->head);
> - tail = READ_ONCE(os->tail);
> - /* just abort if slot is full */
> - if (tail - head > os->mask)
> - return -ENOENT;
> - /* try to extend tail by 1 using CAS to avoid races */
> - if (try_cmpxchg_acquire(&os->tail, &tail, tail + 1))
> - break;
> - } while (1);
> + uint32_t head = READ_ONCE(slot->head);
>
> - /* the tail-th of slot is reserved for the given obj */
> - WRITE_ONCE(ents[tail & os->mask], obj);
> - /* update epoch id to make this object available for pop() */
> - smp_store_release(&ages[tail & os->mask], tail);
> + tail = READ_ONCE(slot->tail);
> + next = tail + 1;
> +
> + /* This must never happen because capacity >= N + 1 */
> + if (WARN_ON_ONCE((next > head && next - head > pool->nr_objs) ||
> + (next < head && next > head + pool->nr_objs)))
> + return -EINVAL;
> +
> + } while (!try_cmpxchg_acquire(&slot->tail, &tail, next));
> +
> + WRITE_ONCE(slot->entries[tail & slot->mask], obj);
> return 0;
> }
>
> /* reclaim an object to object pool */
> -int objpool_push(void *obj, struct objpool_head *oh)
> +int objpool_push(void *obj, struct objpool_head *pool)
> {
> unsigned long flags;
> - int cpu, rc;
> + int rc;
>
> /* disable local irq to avoid preemption & interruption */
> raw_local_irq_save(flags);
> - cpu = raw_smp_processor_id();
> - do {
> - rc = objpool_try_add_slot(obj, oh->cpu_slots[cpu]);
> - if (!rc)
> - break;
> - cpu = cpumask_next_wrap(cpu, cpu_possible_mask, -1, 1);
> - } while (1);
> + rc = objpool_try_add_slot(obj, pool, raw_smp_processor_id());
> raw_local_irq_restore(flags);
>
> return rc;
> }
> EXPORT_SYMBOL_GPL(objpool_push);
>
> -/* drop the allocated object, rather reclaim it to objpool */
> -int objpool_drop(void *obj, struct objpool_head *head)
> -{
> - if (!obj || !head)
> - return -EINVAL;
> -
> - if (refcount_dec_and_test(&head->ref)) {
> - objpool_free(head);
> - return 0;
> - }
> -
> - return -EAGAIN;
> -}
> -EXPORT_SYMBOL_GPL(objpool_drop);
> -
> /* try to retrieve object from slot */
> -static inline void *objpool_try_get_slot(struct objpool_slot *os)
> +static inline void *objpool_try_get_slot(struct objpool_slot *slot)
> {
> - uint32_t *ages = SLOT_AGES(os);
> - void **ents = SLOT_ENTS(os);
> /* do memory load of head to local head */
> - uint32_t head = smp_load_acquire(&os->head);
> + uint32_t head = smp_load_acquire(&slot->head);
>
> /* loop if slot isn't empty */
> - while (head != READ_ONCE(os->tail)) {
> - uint32_t id = head & os->mask, prev = head;
> + while (head != READ_ONCE(slot->tail)) {
>
> /* do prefetching of object ents */
> - prefetch(&ents[id]);
> -
> - /* check whether this item was ready for retrieval */
> - if (smp_load_acquire(&ages[id]) == head) {
> - /* node must have been udpated by push() */
> - void *node = READ_ONCE(ents[id]);
> - /* commit and move forward head of the slot */
> - if (try_cmpxchg_release(&os->head, &head, head + 1))
> - return node;
> - /* head was already updated by others */
> - }
> + prefetch(&slot->entries[head & slot->mask]);
> +
> + /* commit and move forward head of the slot */
> + if (try_cmpxchg_release(&slot->head, &head, head + 1))
> + /*
> + * TBD: check overwrap the tail/head counter and warn
> + * if it is broken. But this happens only if this
> + * process slows down a lot and another CPU updates
> + * the haed/tail just 2^32 + 1 times, and this slot
> + * is empty.
> + */
> + return slot->entries[head & slot->mask];
>
> /* re-load head from memory and continue trying */
> - head = READ_ONCE(os->head);
> - /*
> - * head stays unchanged, so it's very likely there's an
> - * ongoing push() on other cpu nodes but yet not update
> - * ages[] to mark it's completion
> - */
> - if (head == prev)
> - break;
> + head = READ_ONCE(slot->head);
> }
>
> return NULL;
> @@ -307,32 +229,42 @@ void *objpool_pop(struct objpool_head *head)
> EXPORT_SYMBOL_GPL(objpool_pop);
>
> /* release whole objpool forcely */
> -void objpool_free(struct objpool_head *head)
> +void objpool_free(struct objpool_head *pool)
> {
> - if (!head->cpu_slots)
> + if (!pool->cpu_slots)
> return;
>
> /* release percpu slots */
> - objpool_fini_percpu_slots(head);
> + objpool_fini_percpu_slots(pool);
>
> /* call user's cleanup callback if provided */
> - if (head->release)
> - head->release(head, head->context);
> + if (pool->release)
> + pool->release(pool, pool->context);
> }
> EXPORT_SYMBOL_GPL(objpool_free);
>
> -/* drop unused objects and defref objpool for releasing */
> -void objpool_fini(struct objpool_head *head)
> +/* drop the allocated object, rather reclaim it to objpool */
> +int objpool_drop(void *obj, struct objpool_head *pool)
> {
> - void *nod;
> + if (!obj || !pool)
> + return -EINVAL;
>
> - do {
> - /* grab object from objpool and drop it */
> - nod = objpool_pop(head);
> + if (refcount_dec_and_test(&pool->ref)) {
> + objpool_free(pool);
> + return 0;
> + }
> +
> + return -EAGAIN;
> +}
> +EXPORT_SYMBOL_GPL(objpool_drop);
> +
> +/* drop unused objects and defref objpool for releasing */
> +void objpool_fini(struct objpool_head *pool)
> +{
> + void *obj;
>
> - /* drop the extra ref of objpool */
> - if (refcount_dec_and_test(&head->ref))
> - objpool_free(head);
> - } while (nod);
> + /* grab object from objpool and drop it */
> + while ((obj = objpool_pop(pool)))
> + objpool_drop(obj, pool);
> }
> EXPORT_SYMBOL_GPL(objpool_fini);

2023-10-09 13:51:47

by Masami Hiramatsu

[permalink] [raw]

Subject: Re: [PATCH v9 1/5] lib: objpool added: ring-array based lockless MPMC

On Mon, 9 Oct 2023 17:23:34 +0800
wuqiang <[email protected]> wrote:

> Hello Masami,
>
> Just got time for the new patch and got that ages[] was removed. ages[] is
> introduced the way like 2-phase commit to keep consitency and must be kept.
>
> Thinking of the following 2 cases that two cpu nodes are operating the same
> objpool_slot simultaneously:
>
> Case 1:
>
> NODE 1: NODE 2:
> push to an empty slot pop will get wrong value
>
> try_cmpxchg_acquire(&slot->tail, &tail, next)
> try_cmpxchg_release(&slot->head, &head, head + 1)
> return slot->entries[head & slot->mask]
> WRITE_ONCE(slot->entries[tail & slot->mask], obj)

Oh, good catch! Hmm, indeed. I considered to use another 'commit_tail' but
it may not work (small window to leak the object for nested case).
Thanks for review it!

>
>
> Case 2:
>
> NODE 1: NODE 2
> push to slot w/ 1 obj pop will get wrong value
>
> try_cmpxchg_release(&slot->head, &head, head + 1)
> try_cmpxchg_acquire(&slot->tail, &tail, next)
> WRITE_ONCE(slot->entries[tail & slot->mask], obj)
> return slot->entries[head & slot->mask]
>
>
> Regards,
> wuqiang
>
> On 2023/9/25 17:42, Masami Hiramatsu (Google) wrote:
> > Hi Wuqiang,
> >
> > On Tue, 5 Sep 2023 09:52:51 +0800
> > "wuqiang.matt" <[email protected]> wrote:
> >
> >> The object pool is a scalable implementaion of high performance queue
> >> for object allocation and reclamation, such as kretprobe instances.
> >>
> >> With leveraging percpu ring-array to mitigate the hot spot of memory
> >> contention, it could deliver near-linear scalability for high parallel
> >> scenarios. The objpool is best suited for following cases:
> >> 1) Memory allocation or reclamation are prohibited or too expensive
> >> 2) Consumers are of different priorities, such as irqs and threads
> >>
> >> Limitations:
> >> 1) Maximum objects (capacity) is determined during pool initializing
> >> and can't be modified (extended) after objpool creation
> >> 2) The memory of objects won't be freed until objpool is finalized
> >> 3) Object allocation (pop) may fail after trying all cpu slots
> >
> > I made a simplifying patch on this by (mainly) removing ages array.
> > I also rename local variable to use more readable names, like slot,
> > pool, and obj.
> >
> > Here the results which I run the test_objpool.ko.
> >
> > Original:
> > [ 50.500235] Summary of testcases:
> > [ 50.503139] duration: 1027135us hits: 30628293 miss: 0 sync: percpu objpool
> > [ 50.510416] duration: 1047977us hits: 30453023 miss: 0 sync: percpu objpool from vmalloc
> > [ 50.517421] duration: 1047098us hits: 31145034 miss: 0 sync & hrtimer: percpu objpool
> > [ 50.524671] duration: 1053233us hits: 30919640 miss: 0 sync & hrtimer: percpu objpool from vmalloc
> > [ 50.531382] duration: 1055822us hits: 3407221 miss: 830219 sync overrun: percpu objpool
> > [ 50.538135] duration: 1055998us hits: 3404624 miss: 854160 sync overrun: percpu objpool from vmalloc
> > [ 50.546686] duration: 1046104us hits: 19464798 miss: 0 async: percpu objpool
> > [ 50.552989] duration: 1033054us hits: 18957836 miss: 0 async: percpu objpool from vmalloc
> > [ 50.560289] duration: 1041778us hits: 33806868 miss: 0 async & hrtimer: percpu objpool
> > [ 50.567425] duration: 1048901us hits: 34211860 miss: 0 async & hrtimer: percpu objpool from vmalloc
> >
> > Simplified:
> > [ 48.393236] Summary of testcases:
> > [ 48.395384] duration: 1013002us hits: 29661448 miss: 0 sync: percpu objpool
> > [ 48.400351] duration: 1057231us hits: 31035578 miss: 0 sync: percpu objpool from vmalloc
> > [ 48.405660] duration: 1043196us hits: 30546652 miss: 0 sync & hrtimer: percpu objpool
> > [ 48.411216] duration: 1047128us hits: 30601306 miss: 0 sync & hrtimer: percpu objpool from vmalloc
> > [ 48.417231] duration: 1051695us hits: 3468287 miss: 892881 sync overrun: percpu objpool
> > [ 48.422405] duration: 1054003us hits: 3549491 miss: 898141 sync overrun: percpu objpool from vmalloc
> > [ 48.428425] duration: 1052946us hits: 19005228 miss: 0 async: percpu objpool
> > [ 48.433597] duration: 1051757us hits: 19670866 miss: 0 async: percpu objpool from vmalloc
> > [ 48.439280] duration: 1042951us hits: 37135332 miss: 0 async & hrtimer: percpu objpool
> > [ 48.445085] duration: 1029803us hits: 37093248 miss: 0 async & hrtimer: percpu objpool from vmalloc
> >
> > Can you test it too?
> >
> > Thanks,
> >
> > From f1f442ff653e329839e5452b8b88463a80a12ff3 Mon Sep 17 00:00:00 2001
> > From: "Masami Hiramatsu (Google)" <[email protected]>
> > Date: Mon, 25 Sep 2023 16:07:12 +0900
> > Subject: [PATCH] objpool: Simplify objpool by removing ages array
> >
> > Simplify the objpool code by removing ages array from per-cpu slot.
> > It chooses enough big capacity (which is a rounded up power of 2 value
> > of nr_objects + 1) for the entries array, the tail never catch up to
> > the head in per-cpu slot. Thus tail == head means the slot is empty.
> >
> > This also uses consistent local variable names for pool, slot and obj.
> >
> > Signed-off-by: Masami Hiramatsu (Google) <[email protected]>
> > ---
> > include/linux/objpool.h | 61 ++++----
> > lib/objpool.c | 310 ++++++++++++++++------------------------
> > 2 files changed, 147 insertions(+), 224 deletions(-)
> >
> > diff --git a/include/linux/objpool.h b/include/linux/objpool.h
> > index 33c832216b98..ecd5ecaffcd3 100644
> > --- a/include/linux/objpool.h
> > +++ b/include/linux/objpool.h
> > @@ -38,33 +38,23 @@
> > * struct objpool_slot - percpu ring array of objpool
> > * @head: head of the local ring array (to retrieve at)
> > * @tail: tail of the local ring array (to append at)
> > - * @bits: log2 of capacity (for bitwise operations)
> > - * @mask: capacity - 1
> > + * @mask: capacity of entries - 1
> > + * @entries: object entries on this slot.
> > *
> > * Represents a cpu-local array-based ring buffer, its size is specialized
> > * during initialization of object pool. The percpu objpool slot is to be
> > * allocated from local memory for NUMA system, and to be kept compact in
> > - * continuous memory: ages[] is stored just after the body of objpool_slot,
> > - * and then entries[]. ages[] describes revision of each item, solely used
> > - * to avoid ABA; entries[] contains pointers of the actual objects
> > - *
> > - * Layout of objpool_slot in memory:
> > - *
> > - * 64bit:
> > - * 4 8 12 16 32 64
> > - * | head | tail | bits | mask | ages[4] | ents[4]: (8 * 4) | objects
> > - *
> > - * 32bit:
> > - * 4 8 12 16 32 48 64
> > - * | head | tail | bits | mask | ages[4] | ents[4] | unused | objects
> > + * continuous memory: CPU assigned number of objects are stored just after
> > + * the body of objpool_slot.
> > *
> > */
> > struct objpool_slot {
> > - uint32_t head;
> > - uint32_t tail;
> > - uint32_t bits;
> > - uint32_t mask;
> > -} __packed;
> > + uint32_t head;
> > + uint32_t tail;
> > + uint32_t mask;
> > + uint32_t dummyr;
> > + void * entries[];
> > +};
> >
> > struct objpool_head;
> >
> > @@ -82,12 +72,11 @@ typedef int (*objpool_fini_cb)(struct objpool_head *head, void *context);
> > * @obj_size: object & element size
> > * @nr_objs: total objs (to be pre-allocated)
> > * @nr_cpus: nr_cpu_ids
> > - * @capacity: max objects per cpuslot
> > + * @capacity: max objects on percpu slot
> > * @gfp: gfp flags for kmalloc & vmalloc
> > * @ref: refcount for objpool
> > * @flags: flags for objpool management
> > * @cpu_slots: array of percpu slots
> > - * @slot_sizes: size in bytes of slots
> > * @release: resource cleanup callback
> > * @context: caller-provided context
> > */
> > @@ -100,7 +89,6 @@ struct objpool_head {
> > refcount_t ref;
> > unsigned long flags;
> > struct objpool_slot **cpu_slots;
> > - int *slot_sizes;
> > objpool_fini_cb release;
> > void *context;
> > };
> > @@ -108,9 +96,12 @@ struct objpool_head {
> > #define OBJPOOL_FROM_VMALLOC (0x800000000) /* objpool allocated from vmalloc area */
> > #define OBJPOOL_HAVE_OBJECTS (0x400000000) /* objects allocated along with objpool */
> >
> > +#define OBJPOOL_NR_OBJECT_MAX (1 << 24)
> > +#define OBJPOOL_OBJECT_SIZE_MAX (1 << 16)
> > +
> > /**
> > * objpool_init() - initialize objpool and pre-allocated objects
> > - * @head: the object pool to be initialized, declared by caller
> > + * @pool: the object pool to be initialized, declared by caller
> > * @nr_objs: total objects to be pre-allocated by this object pool
> > * @object_size: size of an object (should be > 0)
> > * @gfp: flags for memory allocation (via kmalloc or vmalloc)
> > @@ -128,47 +119,47 @@ struct objpool_head {
> > * pop (object allocation) or do clearance before each push (object
> > * reclamation).
> > */
> > -int objpool_init(struct objpool_head *head, int nr_objs, int object_size,
> > +int objpool_init(struct objpool_head *pool, int nr_objs, int object_size,
> > gfp_t gfp, void *context, objpool_init_obj_cb objinit,
> > objpool_fini_cb release);
> >
> > /**
> > * objpool_pop() - allocate an object from objpool
> > - * @head: object pool
> > + * @pool: object pool
> > *
> > * return value: object ptr or NULL if failed
> > */
> > -void *objpool_pop(struct objpool_head *head);
> > +void *objpool_pop(struct objpool_head *pool);
> >
> > /**
> > * objpool_push() - reclaim the object and return back to objpool
> > * @obj: object ptr to be pushed to objpool
> > - * @head: object pool
> > + * @pool: object pool
> > *
> > * return: 0 or error code (it fails only when user tries to push
> > * the same object multiple times or wrong "objects" into objpool)
> > */
> > -int objpool_push(void *obj, struct objpool_head *head);
> > +int objpool_push(void *obj, struct objpool_head *pool);
> >
> > /**
> > * objpool_drop() - discard the object and deref objpool
> > * @obj: object ptr to be discarded
> > - * @head: object pool
> > + * @pool: object pool
> > *
> > * return: 0 if objpool was released or error code
> > */
> > -int objpool_drop(void *obj, struct objpool_head *head);
> > +int objpool_drop(void *obj, struct objpool_head *pool);
> >
> > /**
> > * objpool_free() - release objpool forcely (all objects to be freed)
> > - * @head: object pool to be released
> > + * @pool: object pool to be released
> > */
> > -void objpool_free(struct objpool_head *head);
> > +void objpool_free(struct objpool_head *pool);
> >
> > /**
> > * objpool_fini() - deref object pool (also releasing unused objects)
> > - * @head: object pool to be dereferenced
> > + * @pool: object pool to be dereferenced
> > */
> > -void objpool_fini(struct objpool_head *head);
> > +void objpool_fini(struct objpool_head *pool);
> >
> > #endif /* _LINUX_OBJPOOL_H */
> > diff --git a/lib/objpool.c b/lib/objpool.c
> > index 22e752371820..f8e8f70d7253 100644
> > --- a/lib/objpool.c
> > +++ b/lib/objpool.c
> > @@ -15,104 +15,55 @@
> > * Copyright: [email protected]
> > */
> >
> > -#define SLOT_AGES(s) ((uint32_t *)((char *)(s) + sizeof(struct objpool_slot)))
> > -#define SLOT_ENTS(s) ((void **)((char *)(s) + sizeof(struct objpool_slot) + \
> > - (sizeof(uint32_t) << (s)->bits)))
> > -#define SLOT_OBJS(s) ((void *)((char *)(s) + sizeof(struct objpool_slot) + \
> > - ((sizeof(uint32_t) + sizeof(void *)) << (s)->bits)))
> > -#define SLOT_CORE(n) cpumask_nth((n) % num_possible_cpus(), cpu_possible_mask)
> > -
> > -/* compute the suitable num of objects to be managed per slot */
> > -static int objpool_nobjs(int size)
> > -{
> > - return rounddown_pow_of_two((size - sizeof(struct objpool_slot)) /
> > - (sizeof(uint32_t) + sizeof(void *)));
> > -}
> > -
> > /* allocate and initialize percpu slots */
> > static int
> > -objpool_init_percpu_slots(struct objpool_head *head, int nobjs,
> > - void *context, objpool_init_obj_cb objinit)
> > +objpool_init_percpu_slots(struct objpool_head *pool, void *context,
> > + objpool_init_obj_cb objinit)
> > {
> > - int i, j, n, size, objsz, cpu = 0, nents = head->capacity;
> > -
> > - /* aligned object size by sizeof(void *) */
> > - objsz = ALIGN(head->obj_size, sizeof(void *));
> > - /* shall we allocate objects along with percpu-slot */
> > - if (objsz)
> > - head->flags |= OBJPOOL_HAVE_OBJECTS;
> > -
> > - /* vmalloc is used in default to allocate percpu-slots */
> > - if (!(head->gfp & GFP_ATOMIC))
> > - head->flags |= OBJPOOL_FROM_VMALLOC;
> > -
> > - for (i = 0; i < head->nr_cpus; i++) {
> > - struct objpool_slot *os;
> > + int i, j, n, size, slot_size, cpu_count = 0;
> > + struct objpool_slot *slot;
> >
> > + for (i = 0; i < pool->nr_cpus; i++) {
> > /* skip the cpus which could never be present */
> > if (!cpu_possible(i))
> > continue;
> >
> > /* compute how many objects to be managed by this slot */
> > - n = nobjs / num_possible_cpus();
> > - if (cpu < (nobjs % num_possible_cpus()))
> > + n = pool->nr_objs / num_possible_cpus();
> > + if (cpu_count < (pool->nr_objs % num_possible_cpus()))
> > n++;
> > - size = sizeof(struct objpool_slot) + sizeof(void *) * nents +
> > - sizeof(uint32_t) * nents + objsz * n;
> > + cpu_count++;
> > +
> > + slot_size = struct_size(slot, entries, pool->capacity);
> > + size = slot_size + pool->obj_size * n;
> >
> > /*
> > * here we allocate percpu-slot & objects together in a single
> > - * allocation, taking advantage of warm caches and TLB hits as
> > - * vmalloc always aligns the request size to pages
> > + * allocation, taking advantage on NUMA.
> > */
> > - if (head->flags & OBJPOOL_FROM_VMALLOC)
> > - os = __vmalloc_node(size, sizeof(void *), head->gfp,
> > + if (pool->flags & OBJPOOL_FROM_VMALLOC)
> > + slot = __vmalloc_node(size, sizeof(void *), pool->gfp,
> > cpu_to_node(i), __builtin_return_address(0));
> > else
> > - os = kmalloc_node(size, head->gfp, cpu_to_node(i));
> > - if (!os)
> > + slot = kmalloc_node(size, pool->gfp, cpu_to_node(i));
> > + if (!slot)
> > return -ENOMEM;
> >
> > /* initialize percpu slot for the i-th slot */
> > - memset(os, 0, size);
> > - os->bits = ilog2(head->capacity);
> > - os->mask = head->capacity - 1;
> > - head->cpu_slots[i] = os;
> > - head->slot_sizes[i] = size;
> > - cpu = cpu + 1;
> > -
> > - /*
> > - * manually set head & tail to avoid possible conflict:
> > - * We assume that the head item is ready for retrieval
> > - * iff head is equal to ages[head & mask]. but ages is
> > - * initialized as 0, so in view of the caller of pop(),
> > - * the 1st item (0th) is always ready, but the reality
> > - * could be: push() is stalled before the final update,
> > - * thus the item being inserted will be lost forever
> > - */
> > - os->head = os->tail = head->capacity;
> > -
> > - if (!objsz)
> > - continue;
> > + memset(slot, 0, size);
> > + slot->mask = pool->capacity - 1;
> > + pool->cpu_slots[i] = slot;
> >
> > for (j = 0; j < n; j++) {
> > - uint32_t *ages = SLOT_AGES(os);
> > - void **ents = SLOT_ENTS(os);
> > - void *obj = SLOT_OBJS(os) + j * objsz;
> > - uint32_t ie = os->tail & os->mask;
> > + void *obj = (void *)slot + slot_size + pool->obj_size * j;
> >
> > - /* perform object initialization */
> > if (objinit) {
> > int rc = objinit(obj, context);
> > if (rc)
> > return rc;
> > }
> > -
> > - /* add obj into the ring array */
> > - ents[ie] = obj;
> > - ages[ie] = os->tail;
> > - os->tail++;
> > - head->nr_objs++;
> > + slot->entries[j] = obj;
> > + slot->tail++;
> > }
> > }
> >
> > @@ -120,164 +71,135 @@ objpool_init_percpu_slots(struct objpool_head *head, int nobjs,
> > }
> >
> > /* cleanup all percpu slots of the object pool */
> > -static void objpool_fini_percpu_slots(struct objpool_head *head)
> > +static void objpool_fini_percpu_slots(struct objpool_head *pool)
> > {
> > int i;
> >
> > - if (!head->cpu_slots)
> > + if (!pool->cpu_slots)
> > return;
> >
> > - for (i = 0; i < head->nr_cpus; i++) {
> > - if (!head->cpu_slots[i])
> > - continue;
> > - if (head->flags & OBJPOOL_FROM_VMALLOC)
> > - vfree(head->cpu_slots[i]);
> > - else
> > - kfree(head->cpu_slots[i]);
> > - }
> > - kfree(head->cpu_slots);
> > - head->cpu_slots = NULL;
> > - head->slot_sizes = NULL;
> > + for (i = 0; i < pool->nr_cpus; i++)
> > + kvfree(pool->cpu_slots[i]);
> > + kfree(pool->cpu_slots);
> > }
> >
> > /* initialize object pool and pre-allocate objects */
> > -int objpool_init(struct objpool_head *head, int nr_objs, int object_size,
> > +int objpool_init(struct objpool_head *pool, int nr_objs, int object_size,
> > gfp_t gfp, void *context, objpool_init_obj_cb objinit,
> > objpool_fini_cb release)
> > {
> > int nents, rc;
> >
> > /* check input parameters */
> > - if (nr_objs <= 0 || object_size <= 0)
> > + if (nr_objs <= 0 || object_size <= 0 ||
> > + nr_objs > OBJPOOL_NR_OBJECT_MAX || object_size > OBJPOOL_OBJECT_SIZE_MAX)
> > + return -EINVAL;
> > +
> > + /* Align up to unsigned long size */
> > + object_size = ALIGN(object_size, sizeof(unsigned long));
> > +
> > + /*
> > + * To avoid filling up the entries array in the per-cpu slot,
> > + * use the power of 2 which is more than N + 1. Thus, the tail never
> > + * catch up to the pool, and able to use pool/tail as the sequencial
> > + * number.
> > + */
> > + nents = roundup_pow_of_two(nr_objs + 1);
> > + if (!nents)
> > return -EINVAL;
> >
> > - /* calculate percpu slot size (rounded to pow of 2) */
> > - nents = max_t(int, roundup_pow_of_two(nr_objs),
> > - objpool_nobjs(L1_CACHE_BYTES));
> > -
> > - /* initialize objpool head */
> > - memset(head, 0, sizeof(struct objpool_head));
> > - head->nr_cpus = nr_cpu_ids;
> > - head->obj_size = object_size;
> > - head->capacity = nents;
> > - head->gfp = gfp & ~__GFP_ZERO;
> > - head->context = context;
> > - head->release = release;
> > -
> > - /* allocate array for percpu slots */
> > - head->cpu_slots = kzalloc(head->nr_cpus * sizeof(void *) +
> > - head->nr_cpus * sizeof(int), head->gfp);
> > - if (!head->cpu_slots)
> > + /* initialize objpool pool */
> > + memset(pool, 0, sizeof(struct objpool_head));
> > + pool->nr_cpus = nr_cpu_ids;
> > + pool->obj_size = object_size;
> > + pool->nr_objs = nr_objs;
> > + /* just prevent to fullfill the per-cpu ring array */
> > + pool->capacity = nents;
> > + pool->gfp = gfp & ~__GFP_ZERO;
> > + pool->context = context;
> > + pool->release = release;
> > + /* vmalloc is used in default to allocate percpu-slots */
> > + if (!(pool->gfp & GFP_ATOMIC))
> > + pool->flags |= OBJPOOL_FROM_VMALLOC;
> > +
> > + pool->cpu_slots = kzalloc(pool->nr_cpus * sizeof(void *), pool->gfp);
> > + if (!pool->cpu_slots)
> > return -ENOMEM;
> > - head->slot_sizes = (int *)&head->cpu_slots[head->nr_cpus];
> >
> > /* initialize per-cpu slots */
> > - rc = objpool_init_percpu_slots(head, nr_objs, context, objinit);
> > + rc = objpool_init_percpu_slots(pool, context, objinit);
> > if (rc)
> > - objpool_fini_percpu_slots(head);
> > + objpool_fini_percpu_slots(pool);
> > else
> > - refcount_set(&head->ref, nr_objs + 1);
> > + refcount_set(&pool->ref, nr_objs + 1);
> >
> > return rc;
> > }
> > EXPORT_SYMBOL_GPL(objpool_init);
> >
> > /* adding object to slot, abort if the slot was already full */
> > -static inline int objpool_try_add_slot(void *obj, struct objpool_slot *os)
> > +static inline int objpool_try_add_slot(void *obj, struct objpool_head *pool, int cpu)
> > {
> > - uint32_t *ages = SLOT_AGES(os);
> > - void **ents = SLOT_ENTS(os);
> > - uint32_t head, tail;
> > + struct objpool_slot *slot = pool->cpu_slots[cpu];
> > + uint32_t tail, next;
> >
> > do {
> > - /* perform memory loading for both head and tail */
> > - head = READ_ONCE(os->head);
> > - tail = READ_ONCE(os->tail);
> > - /* just abort if slot is full */
> > - if (tail - head > os->mask)
> > - return -ENOENT;
> > - /* try to extend tail by 1 using CAS to avoid races */
> > - if (try_cmpxchg_acquire(&os->tail, &tail, tail + 1))
> > - break;
> > - } while (1);
> > + uint32_t head = READ_ONCE(slot->head);
> >
> > - /* the tail-th of slot is reserved for the given obj */
> > - WRITE_ONCE(ents[tail & os->mask], obj);
> > - /* update epoch id to make this object available for pop() */
> > - smp_store_release(&ages[tail & os->mask], tail);
> > + tail = READ_ONCE(slot->tail);
> > + next = tail + 1;
> > +
> > + /* This must never happen because capacity >= N + 1 */
> > + if (WARN_ON_ONCE((next > head && next - head > pool->nr_objs) ||
> > + (next < head && next > head + pool->nr_objs)))
> > + return -EINVAL;
> > +
> > + } while (!try_cmpxchg_acquire(&slot->tail, &tail, next));
> > +
> > + WRITE_ONCE(slot->entries[tail & slot->mask], obj);
> > return 0;
> > }
> >
> > /* reclaim an object to object pool */
> > -int objpool_push(void *obj, struct objpool_head *oh)
> > +int objpool_push(void *obj, struct objpool_head *pool)
> > {
> > unsigned long flags;
> > - int cpu, rc;
> > + int rc;
> >
> > /* disable local irq to avoid preemption & interruption */
> > raw_local_irq_save(flags);
> > - cpu = raw_smp_processor_id();
> > - do {
> > - rc = objpool_try_add_slot(obj, oh->cpu_slots[cpu]);
> > - if (!rc)
> > - break;
> > - cpu = cpumask_next_wrap(cpu, cpu_possible_mask, -1, 1);
> > - } while (1);
> > + rc = objpool_try_add_slot(obj, pool, raw_smp_processor_id());
> > raw_local_irq_restore(flags);
> >
> > return rc;
> > }
> > EXPORT_SYMBOL_GPL(objpool_push);
> >
> > -/* drop the allocated object, rather reclaim it to objpool */
> > -int objpool_drop(void *obj, struct objpool_head *head)
> > -{
> > - if (!obj || !head)
> > - return -EINVAL;
> > -
> > - if (refcount_dec_and_test(&head->ref)) {
> > - objpool_free(head);
> > - return 0;
> > - }
> > -
> > - return -EAGAIN;
> > -}
> > -EXPORT_SYMBOL_GPL(objpool_drop);
> > -
> > /* try to retrieve object from slot */
> > -static inline void *objpool_try_get_slot(struct objpool_slot *os)
> > +static inline void *objpool_try_get_slot(struct objpool_slot *slot)
> > {
> > - uint32_t *ages = SLOT_AGES(os);
> > - void **ents = SLOT_ENTS(os);
> > /* do memory load of head to local head */
> > - uint32_t head = smp_load_acquire(&os->head);
> > + uint32_t head = smp_load_acquire(&slot->head);
> >
> > /* loop if slot isn't empty */
> > - while (head != READ_ONCE(os->tail)) {
> > - uint32_t id = head & os->mask, prev = head;
> > + while (head != READ_ONCE(slot->tail)) {
> >
> > /* do prefetching of object ents */
> > - prefetch(&ents[id]);
> > -
> > - /* check whether this item was ready for retrieval */
> > - if (smp_load_acquire(&ages[id]) == head) {
> > - /* node must have been udpated by push() */
> > - void *node = READ_ONCE(ents[id]);
> > - /* commit and move forward head of the slot */
> > - if (try_cmpxchg_release(&os->head, &head, head + 1))
> > - return node;
> > - /* head was already updated by others */
> > - }
> > + prefetch(&slot->entries[head & slot->mask]);
> > +
> > + /* commit and move forward head of the slot */
> > + if (try_cmpxchg_release(&slot->head, &head, head + 1))
> > + /*
> > + * TBD: check overwrap the tail/head counter and warn
> > + * if it is broken. But this happens only if this
> > + * process slows down a lot and another CPU updates
> > + * the haed/tail just 2^32 + 1 times, and this slot
> > + * is empty.
> > + */
> > + return slot->entries[head & slot->mask];
> >
> > /* re-load head from memory and continue trying */
> > - head = READ_ONCE(os->head);
> > - /*
> > - * head stays unchanged, so it's very likely there's an
> > - * ongoing push() on other cpu nodes but yet not update
> > - * ages[] to mark it's completion
> > - */
> > - if (head == prev)
> > - break;
> > + head = READ_ONCE(slot->head);
> > }
> >
> > return NULL;
> > @@ -307,32 +229,42 @@ void *objpool_pop(struct objpool_head *head)
> > EXPORT_SYMBOL_GPL(objpool_pop);
> >
> > /* release whole objpool forcely */
> > -void objpool_free(struct objpool_head *head)
> > +void objpool_free(struct objpool_head *pool)
> > {
> > - if (!head->cpu_slots)
> > + if (!pool->cpu_slots)
> > return;
> >
> > /* release percpu slots */
> > - objpool_fini_percpu_slots(head);
> > + objpool_fini_percpu_slots(pool);
> >
> > /* call user's cleanup callback if provided */
> > - if (head->release)
> > - head->release(head, head->context);
> > + if (pool->release)
> > + pool->release(pool, pool->context);
> > }
> > EXPORT_SYMBOL_GPL(objpool_free);
> >
> > -/* drop unused objects and defref objpool for releasing */
> > -void objpool_fini(struct objpool_head *head)
> > +/* drop the allocated object, rather reclaim it to objpool */
> > +int objpool_drop(void *obj, struct objpool_head *pool)
> > {
> > - void *nod;
> > + if (!obj || !pool)
> > + return -EINVAL;
> >
> > - do {
> > - /* grab object from objpool and drop it */
> > - nod = objpool_pop(head);
> > + if (refcount_dec_and_test(&pool->ref)) {
> > + objpool_free(pool);
> > + return 0;
> > + }
> > +
> > + return -EAGAIN;
> > +}
> > +EXPORT_SYMBOL_GPL(objpool_drop);
> > +
> > +/* drop unused objects and defref objpool for releasing */
> > +void objpool_fini(struct objpool_head *pool)
> > +{
> > + void *obj;
> >
> > - /* drop the extra ref of objpool */
> > - if (refcount_dec_and_test(&head->ref))
> > - objpool_free(head);
> > - } while (nod);
> > + /* grab object from objpool and drop it */
> > + while ((obj = objpool_pop(pool)))
> > + objpool_drop(obj, pool);
> > }
> > EXPORT_SYMBOL_GPL(objpool_fini);
>

--
Masami Hiramatsu (Google) <[email protected]>

2023-10-12 14:03:06

by Masami Hiramatsu

[permalink] [raw]

Subject: Re: [PATCH v9 1/5] lib: objpool added: ring-array based lockless MPMC

Hi Wuqiang,

On Mon, 9 Oct 2023 17:23:34 +0800
wuqiang <[email protected]> wrote:

> Hello Masami,
>
> Just got time for the new patch and got that ages[] was removed. ages[] is
> introduced the way like 2-phase commit to keep consitency and must be kept.
>
> Thinking of the following 2 cases that two cpu nodes are operating the same
> objpool_slot simultaneously:
>
> Case 1:
>
> NODE 1: NODE 2:
> push to an empty slot pop will get wrong value
>
> try_cmpxchg_acquire(&slot->tail, &tail, next)
> try_cmpxchg_release(&slot->head, &head, head + 1)
> return slot->entries[head & slot->mask]
> WRITE_ONCE(slot->entries[tail & slot->mask], obj)

Today, I rethink the algorithm.

For this case, we can just add a `commit` to the slot for committing the tail
commit instead of the ages[].

CPU1 CPU2
push to an empty slot pop from the same slot

commit = tail = slot->tail;
next = tail + 1;
try_cmpxchg_acquire(&slot->tail,
&tail, next);
while (head != commit) -> NG1
WRITE_ONCE(slot->entries[tail & slot->mask],
obj)
while (head != commit) -> NG2
WRITE_ONCE(&slot->commit, next);
while (head != commit) -> OK
try_cmpxchg_acquire(&slot->head,
&head, next);
return slot->entries[head & slot->mask]

So the NG1 and NG2 timing, the pop will fail.

This does not expect the nested "push" case because the reserve-commit block
will be interrupt disabled. This doesn't support NMI but that is OK.

If we need to support such NMI or remove irq-disable, we can do following
(please remember the push operation only happens on the slot owned by that
CPU. So this is per-cpu process)

---
do {
commit = tail = slot->tail;
next = tail + 1;
} while (!try_cmpxchg_acquire(&slot->tail, &tail, next));

WRITE_ONCE(slot->entries[tail & slot->mask], obj);

// At this point, "commit == slot->tail - 1" in this nest level.
// Only outmost (non-nexted) case, "commit == slot->commit"
if (commit != slot->commit)
return; // do nothing. it must be updated by outmost one.

// catch up commit to tail.
do {
commit = READ_ONCE(slot->tail);
WRITE_ONCE(slot->commit, commit);
// note that someone can interrupt this loop too.
} while (commit != READ_ONCE(slot->tail));
---

For the rethook this may be too much.

Thank you,

>
>
> Case 2:
>
> NODE 1: NODE 2
> push to slot w/ 1 obj pop will get wrong value
>
> try_cmpxchg_release(&slot->head, &head, head + 1)
> try_cmpxchg_acquire(&slot->tail, &tail, next)
> WRITE_ONCE(slot->entries[tail & slot->mask], obj)
> return slot->entries[head & slot->mask]
>
>
> Regards,
> wuqiang
>
> On 2023/9/25 17:42, Masami Hiramatsu (Google) wrote:
> > Hi Wuqiang,
> >
> > On Tue, 5 Sep 2023 09:52:51 +0800
> > "wuqiang.matt" <[email protected]> wrote:
> >
> >> The object pool is a scalable implementaion of high performance queue
> >> for object allocation and reclamation, such as kretprobe instances.
> >>
> >> With leveraging percpu ring-array to mitigate the hot spot of memory
> >> contention, it could deliver near-linear scalability for high parallel
> >> scenarios. The objpool is best suited for following cases:
> >> 1) Memory allocation or reclamation are prohibited or too expensive
> >> 2) Consumers are of different priorities, such as irqs and threads
> >>
> >> Limitations:
> >> 1) Maximum objects (capacity) is determined during pool initializing
> >> and can't be modified (extended) after objpool creation
> >> 2) The memory of objects won't be freed until objpool is finalized
> >> 3) Object allocation (pop) may fail after trying all cpu slots
> >
> > I made a simplifying patch on this by (mainly) removing ages array.
> > I also rename local variable to use more readable names, like slot,
> > pool, and obj.
> >
> > Here the results which I run the test_objpool.ko.
> >
> > Original:
> > [ 50.500235] Summary of testcases:
> > [ 50.503139] duration: 1027135us hits: 30628293 miss: 0 sync: percpu objpool
> > [ 50.510416] duration: 1047977us hits: 30453023 miss: 0 sync: percpu objpool from vmalloc
> > [ 50.517421] duration: 1047098us hits: 31145034 miss: 0 sync & hrtimer: percpu objpool
> > [ 50.524671] duration: 1053233us hits: 30919640 miss: 0 sync & hrtimer: percpu objpool from vmalloc
> > [ 50.531382] duration: 1055822us hits: 3407221 miss: 830219 sync overrun: percpu objpool
> > [ 50.538135] duration: 1055998us hits: 3404624 miss: 854160 sync overrun: percpu objpool from vmalloc
> > [ 50.546686] duration: 1046104us hits: 19464798 miss: 0 async: percpu objpool
> > [ 50.552989] duration: 1033054us hits: 18957836 miss: 0 async: percpu objpool from vmalloc
> > [ 50.560289] duration: 1041778us hits: 33806868 miss: 0 async & hrtimer: percpu objpool
> > [ 50.567425] duration: 1048901us hits: 34211860 miss: 0 async & hrtimer: percpu objpool from vmalloc
> >
> > Simplified:
> > [ 48.393236] Summary of testcases:
> > [ 48.395384] duration: 1013002us hits: 29661448 miss: 0 sync: percpu objpool
> > [ 48.400351] duration: 1057231us hits: 31035578 miss: 0 sync: percpu objpool from vmalloc
> > [ 48.405660] duration: 1043196us hits: 30546652 miss: 0 sync & hrtimer: percpu objpool
> > [ 48.411216] duration: 1047128us hits: 30601306 miss: 0 sync & hrtimer: percpu objpool from vmalloc
> > [ 48.417231] duration: 1051695us hits: 3468287 miss: 892881 sync overrun: percpu objpool
> > [ 48.422405] duration: 1054003us hits: 3549491 miss: 898141 sync overrun: percpu objpool from vmalloc
> > [ 48.428425] duration: 1052946us hits: 19005228 miss: 0 async: percpu objpool
> > [ 48.433597] duration: 1051757us hits: 19670866 miss: 0 async: percpu objpool from vmalloc
> > [ 48.439280] duration: 1042951us hits: 37135332 miss: 0 async & hrtimer: percpu objpool
> > [ 48.445085] duration: 1029803us hits: 37093248 miss: 0 async & hrtimer: percpu objpool from vmalloc
> >
> > Can you test it too?
> >
> > Thanks,
> >
> > From f1f442ff653e329839e5452b8b88463a80a12ff3 Mon Sep 17 00:00:00 2001
> > From: "Masami Hiramatsu (Google)" <[email protected]>
> > Date: Mon, 25 Sep 2023 16:07:12 +0900
> > Subject: [PATCH] objpool: Simplify objpool by removing ages array
> >
> > Simplify the objpool code by removing ages array from per-cpu slot.
> > It chooses enough big capacity (which is a rounded up power of 2 value
> > of nr_objects + 1) for the entries array, the tail never catch up to
> > the head in per-cpu slot. Thus tail == head means the slot is empty.
> >
> > This also uses consistent local variable names for pool, slot and obj.
> >
> > Signed-off-by: Masami Hiramatsu (Google) <[email protected]>
> > ---
> > include/linux/objpool.h | 61 ++++----
> > lib/objpool.c | 310 ++++++++++++++++------------------------
> > 2 files changed, 147 insertions(+), 224 deletions(-)
> >
> > diff --git a/include/linux/objpool.h b/include/linux/objpool.h
> > index 33c832216b98..ecd5ecaffcd3 100644
> > --- a/include/linux/objpool.h
> > +++ b/include/linux/objpool.h
> > @@ -38,33 +38,23 @@
> > * struct objpool_slot - percpu ring array of objpool
> > * @head: head of the local ring array (to retrieve at)
> > * @tail: tail of the local ring array (to append at)
> > - * @bits: log2 of capacity (for bitwise operations)
> > - * @mask: capacity - 1
> > + * @mask: capacity of entries - 1
> > + * @entries: object entries on this slot.
> > *
> > * Represents a cpu-local array-based ring buffer, its size is specialized
> > * during initialization of object pool. The percpu objpool slot is to be
> > * allocated from local memory for NUMA system, and to be kept compact in
> > - * continuous memory: ages[] is stored just after the body of objpool_slot,
> > - * and then entries[]. ages[] describes revision of each item, solely used
> > - * to avoid ABA; entries[] contains pointers of the actual objects
> > - *
> > - * Layout of objpool_slot in memory:
> > - *
> > - * 64bit:
> > - * 4 8 12 16 32 64
> > - * | head | tail | bits | mask | ages[4] | ents[4]: (8 * 4) | objects
> > - *
> > - * 32bit:
> > - * 4 8 12 16 32 48 64
> > - * | head | tail | bits | mask | ages[4] | ents[4] | unused | objects
> > + * continuous memory: CPU assigned number of objects are stored just after
> > + * the body of objpool_slot.
> > *
> > */
> > struct objpool_slot {
> > - uint32_t head;
> > - uint32_t tail;
> > - uint32_t bits;
> > - uint32_t mask;
> > -} __packed;
> > + uint32_t head;
> > + uint32_t tail;
> > + uint32_t mask;
> > + uint32_t dummyr;
> > + void * entries[];
> > +};
> >
> > struct objpool_head;
> >
> > @@ -82,12 +72,11 @@ typedef int (*objpool_fini_cb)(struct objpool_head *head, void *context);
> > * @obj_size: object & element size
> > * @nr_objs: total objs (to be pre-allocated)
> > * @nr_cpus: nr_cpu_ids
> > - * @capacity: max objects per cpuslot
> > + * @capacity: max objects on percpu slot
> > * @gfp: gfp flags for kmalloc & vmalloc
> > * @ref: refcount for objpool
> > * @flags: flags for objpool management
> > * @cpu_slots: array of percpu slots
> > - * @slot_sizes: size in bytes of slots
> > * @release: resource cleanup callback
> > * @context: caller-provided context
> > */
> > @@ -100,7 +89,6 @@ struct objpool_head {
> > refcount_t ref;
> > unsigned long flags;
> > struct objpool_slot **cpu_slots;
> > - int *slot_sizes;
> > objpool_fini_cb release;
> > void *context;
> > };
> > @@ -108,9 +96,12 @@ struct objpool_head {
> > #define OBJPOOL_FROM_VMALLOC (0x800000000) /* objpool allocated from vmalloc area */
> > #define OBJPOOL_HAVE_OBJECTS (0x400000000) /* objects allocated along with objpool */
> >
> > +#define OBJPOOL_NR_OBJECT_MAX (1 << 24)
> > +#define OBJPOOL_OBJECT_SIZE_MAX (1 << 16)
> > +
> > /**
> > * objpool_init() - initialize objpool and pre-allocated objects
> > - * @head: the object pool to be initialized, declared by caller
> > + * @pool: the object pool to be initialized, declared by caller
> > * @nr_objs: total objects to be pre-allocated by this object pool
> > * @object_size: size of an object (should be > 0)
> > * @gfp: flags for memory allocation (via kmalloc or vmalloc)
> > @@ -128,47 +119,47 @@ struct objpool_head {
> > * pop (object allocation) or do clearance before each push (object
> > * reclamation).
> > */
> > -int objpool_init(struct objpool_head *head, int nr_objs, int object_size,
> > +int objpool_init(struct objpool_head *pool, int nr_objs, int object_size,
> > gfp_t gfp, void *context, objpool_init_obj_cb objinit,
> > objpool_fini_cb release);
> >
> > /**
> > * objpool_pop() - allocate an object from objpool
> > - * @head: object pool
> > + * @pool: object pool
> > *
> > * return value: object ptr or NULL if failed
> > */
> > -void *objpool_pop(struct objpool_head *head);
> > +void *objpool_pop(struct objpool_head *pool);
> >
> > /**
> > * objpool_push() - reclaim the object and return back to objpool
> > * @obj: object ptr to be pushed to objpool
> > - * @head: object pool
> > + * @pool: object pool
> > *
> > * return: 0 or error code (it fails only when user tries to push
> > * the same object multiple times or wrong "objects" into objpool)
> > */
> > -int objpool_push(void *obj, struct objpool_head *head);
> > +int objpool_push(void *obj, struct objpool_head *pool);
> >
> > /**
> > * objpool_drop() - discard the object and deref objpool
> > * @obj: object ptr to be discarded
> > - * @head: object pool
> > + * @pool: object pool
> > *
> > * return: 0 if objpool was released or error code
> > */
> > -int objpool_drop(void *obj, struct objpool_head *head);
> > +int objpool_drop(void *obj, struct objpool_head *pool);
> >
> > /**
> > * objpool_free() - release objpool forcely (all objects to be freed)
> > - * @head: object pool to be released
> > + * @pool: object pool to be released
> > */
> > -void objpool_free(struct objpool_head *head);
> > +void objpool_free(struct objpool_head *pool);
> >
> > /**
> > * objpool_fini() - deref object pool (also releasing unused objects)
> > - * @head: object pool to be dereferenced
> > + * @pool: object pool to be dereferenced
> > */
> > -void objpool_fini(struct objpool_head *head);
> > +void objpool_fini(struct objpool_head *pool);
> >
> > #endif /* _LINUX_OBJPOOL_H */
> > diff --git a/lib/objpool.c b/lib/objpool.c
> > index 22e752371820..f8e8f70d7253 100644
> > --- a/lib/objpool.c
> > +++ b/lib/objpool.c
> > @@ -15,104 +15,55 @@
> > * Copyright: [email protected]
> > */
> >
> > -#define SLOT_AGES(s) ((uint32_t *)((char *)(s) + sizeof(struct objpool_slot)))
> > -#define SLOT_ENTS(s) ((void **)((char *)(s) + sizeof(struct objpool_slot) + \
> > - (sizeof(uint32_t) << (s)->bits)))
> > -#define SLOT_OBJS(s) ((void *)((char *)(s) + sizeof(struct objpool_slot) + \
> > - ((sizeof(uint32_t) + sizeof(void *)) << (s)->bits)))
> > -#define SLOT_CORE(n) cpumask_nth((n) % num_possible_cpus(), cpu_possible_mask)
> > -
> > -/* compute the suitable num of objects to be managed per slot */
> > -static int objpool_nobjs(int size)
> > -{
> > - return rounddown_pow_of_two((size - sizeof(struct objpool_slot)) /
> > - (sizeof(uint32_t) + sizeof(void *)));
> > -}
> > -
> > /* allocate and initialize percpu slots */
> > static int
> > -objpool_init_percpu_slots(struct objpool_head *head, int nobjs,
> > - void *context, objpool_init_obj_cb objinit)
> > +objpool_init_percpu_slots(struct objpool_head *pool, void *context,
> > + objpool_init_obj_cb objinit)
> > {
> > - int i, j, n, size, objsz, cpu = 0, nents = head->capacity;
> > -
> > - /* aligned object size by sizeof(void *) */
> > - objsz = ALIGN(head->obj_size, sizeof(void *));
> > - /* shall we allocate objects along with percpu-slot */
> > - if (objsz)
> > - head->flags |= OBJPOOL_HAVE_OBJECTS;
> > -
> > - /* vmalloc is used in default to allocate percpu-slots */
> > - if (!(head->gfp & GFP_ATOMIC))
> > - head->flags |= OBJPOOL_FROM_VMALLOC;
> > -
> > - for (i = 0; i < head->nr_cpus; i++) {
> > - struct objpool_slot *os;
> > + int i, j, n, size, slot_size, cpu_count = 0;
> > + struct objpool_slot *slot;
> >
> > + for (i = 0; i < pool->nr_cpus; i++) {
> > /* skip the cpus which could never be present */
> > if (!cpu_possible(i))
> > continue;
> >
> > /* compute how many objects to be managed by this slot */
> > - n = nobjs / num_possible_cpus();
> > - if (cpu < (nobjs % num_possible_cpus()))
> > + n = pool->nr_objs / num_possible_cpus();
> > + if (cpu_count < (pool->nr_objs % num_possible_cpus()))
> > n++;
> > - size = sizeof(struct objpool_slot) + sizeof(void *) * nents +
> > - sizeof(uint32_t) * nents + objsz * n;
> > + cpu_count++;
> > +
> > + slot_size = struct_size(slot, entries, pool->capacity);
> > + size = slot_size + pool->obj_size * n;
> >
> > /*
> > * here we allocate percpu-slot & objects together in a single
> > - * allocation, taking advantage of warm caches and TLB hits as
> > - * vmalloc always aligns the request size to pages
> > + * allocation, taking advantage on NUMA.
> > */
> > - if (head->flags & OBJPOOL_FROM_VMALLOC)
> > - os = __vmalloc_node(size, sizeof(void *), head->gfp,
> > + if (pool->flags & OBJPOOL_FROM_VMALLOC)
> > + slot = __vmalloc_node(size, sizeof(void *), pool->gfp,
> > cpu_to_node(i), __builtin_return_address(0));
> > else
> > - os = kmalloc_node(size, head->gfp, cpu_to_node(i));
> > - if (!os)
> > + slot = kmalloc_node(size, pool->gfp, cpu_to_node(i));
> > + if (!slot)
> > return -ENOMEM;
> >
> > /* initialize percpu slot for the i-th slot */
> > - memset(os, 0, size);
> > - os->bits = ilog2(head->capacity);
> > - os->mask = head->capacity - 1;
> > - head->cpu_slots[i] = os;
> > - head->slot_sizes[i] = size;
> > - cpu = cpu + 1;
> > -
> > - /*
> > - * manually set head & tail to avoid possible conflict:
> > - * We assume that the head item is ready for retrieval
> > - * iff head is equal to ages[head & mask]. but ages is
> > - * initialized as 0, so in view of the caller of pop(),
> > - * the 1st item (0th) is always ready, but the reality
> > - * could be: push() is stalled before the final update,
> > - * thus the item being inserted will be lost forever
> > - */
> > - os->head = os->tail = head->capacity;
> > -
> > - if (!objsz)
> > - continue;
> > + memset(slot, 0, size);
> > + slot->mask = pool->capacity - 1;
> > + pool->cpu_slots[i] = slot;
> >
> > for (j = 0; j < n; j++) {
> > - uint32_t *ages = SLOT_AGES(os);
> > - void **ents = SLOT_ENTS(os);
> > - void *obj = SLOT_OBJS(os) + j * objsz;
> > - uint32_t ie = os->tail & os->mask;
> > + void *obj = (void *)slot + slot_size + pool->obj_size * j;
> >
> > - /* perform object initialization */
> > if (objinit) {
> > int rc = objinit(obj, context);
> > if (rc)
> > return rc;
> > }
> > -
> > - /* add obj into the ring array */
> > - ents[ie] = obj;
> > - ages[ie] = os->tail;
> > - os->tail++;
> > - head->nr_objs++;
> > + slot->entries[j] = obj;
> > + slot->tail++;
> > }
> > }
> >
> > @@ -120,164 +71,135 @@ objpool_init_percpu_slots(struct objpool_head *head, int nobjs,
> > }
> >
> > /* cleanup all percpu slots of the object pool */
> > -static void objpool_fini_percpu_slots(struct objpool_head *head)
> > +static void objpool_fini_percpu_slots(struct objpool_head *pool)
> > {
> > int i;
> >
> > - if (!head->cpu_slots)
> > + if (!pool->cpu_slots)
> > return;
> >
> > - for (i = 0; i < head->nr_cpus; i++) {
> > - if (!head->cpu_slots[i])
> > - continue;
> > - if (head->flags & OBJPOOL_FROM_VMALLOC)
> > - vfree(head->cpu_slots[i]);
> > - else
> > - kfree(head->cpu_slots[i]);
> > - }
> > - kfree(head->cpu_slots);
> > - head->cpu_slots = NULL;
> > - head->slot_sizes = NULL;
> > + for (i = 0; i < pool->nr_cpus; i++)
> > + kvfree(pool->cpu_slots[i]);
> > + kfree(pool->cpu_slots);
> > }
> >
> > /* initialize object pool and pre-allocate objects */
> > -int objpool_init(struct objpool_head *head, int nr_objs, int object_size,
> > +int objpool_init(struct objpool_head *pool, int nr_objs, int object_size,
> > gfp_t gfp, void *context, objpool_init_obj_cb objinit,
> > objpool_fini_cb release)
> > {
> > int nents, rc;
> >
> > /* check input parameters */
> > - if (nr_objs <= 0 || object_size <= 0)
> > + if (nr_objs <= 0 || object_size <= 0 ||
> > + nr_objs > OBJPOOL_NR_OBJECT_MAX || object_size > OBJPOOL_OBJECT_SIZE_MAX)
> > + return -EINVAL;
> > +
> > + /* Align up to unsigned long size */
> > + object_size = ALIGN(object_size, sizeof(unsigned long));
> > +
> > + /*
> > + * To avoid filling up the entries array in the per-cpu slot,
> > + * use the power of 2 which is more than N + 1. Thus, the tail never
> > + * catch up to the pool, and able to use pool/tail as the sequencial
> > + * number.
> > + */
> > + nents = roundup_pow_of_two(nr_objs + 1);
> > + if (!nents)
> > return -EINVAL;
> >
> > - /* calculate percpu slot size (rounded to pow of 2) */
> > - nents = max_t(int, roundup_pow_of_two(nr_objs),
> > - objpool_nobjs(L1_CACHE_BYTES));
> > -
> > - /* initialize objpool head */
> > - memset(head, 0, sizeof(struct objpool_head));
> > - head->nr_cpus = nr_cpu_ids;
> > - head->obj_size = object_size;
> > - head->capacity = nents;
> > - head->gfp = gfp & ~__GFP_ZERO;
> > - head->context = context;
> > - head->release = release;
> > -
> > - /* allocate array for percpu slots */
> > - head->cpu_slots = kzalloc(head->nr_cpus * sizeof(void *) +
> > - head->nr_cpus * sizeof(int), head->gfp);
> > - if (!head->cpu_slots)
> > + /* initialize objpool pool */
> > + memset(pool, 0, sizeof(struct objpool_head));
> > + pool->nr_cpus = nr_cpu_ids;
> > + pool->obj_size = object_size;
> > + pool->nr_objs = nr_objs;
> > + /* just prevent to fullfill the per-cpu ring array */
> > + pool->capacity = nents;
> > + pool->gfp = gfp & ~__GFP_ZERO;
> > + pool->context = context;
> > + pool->release = release;
> > + /* vmalloc is used in default to allocate percpu-slots */
> > + if (!(pool->gfp & GFP_ATOMIC))
> > + pool->flags |= OBJPOOL_FROM_VMALLOC;
> > +
> > + pool->cpu_slots = kzalloc(pool->nr_cpus * sizeof(void *), pool->gfp);
> > + if (!pool->cpu_slots)
> > return -ENOMEM;
> > - head->slot_sizes = (int *)&head->cpu_slots[head->nr_cpus];
> >
> > /* initialize per-cpu slots */
> > - rc = objpool_init_percpu_slots(head, nr_objs, context, objinit);
> > + rc = objpool_init_percpu_slots(pool, context, objinit);
> > if (rc)
> > - objpool_fini_percpu_slots(head);
> > + objpool_fini_percpu_slots(pool);
> > else
> > - refcount_set(&head->ref, nr_objs + 1);
> > + refcount_set(&pool->ref, nr_objs + 1);
> >
> > return rc;
> > }
> > EXPORT_SYMBOL_GPL(objpool_init);
> >
> > /* adding object to slot, abort if the slot was already full */
> > -static inline int objpool_try_add_slot(void *obj, struct objpool_slot *os)
> > +static inline int objpool_try_add_slot(void *obj, struct objpool_head *pool, int cpu)
> > {
> > - uint32_t *ages = SLOT_AGES(os);
> > - void **ents = SLOT_ENTS(os);
> > - uint32_t head, tail;
> > + struct objpool_slot *slot = pool->cpu_slots[cpu];
> > + uint32_t tail, next;
> >
> > do {
> > - /* perform memory loading for both head and tail */
> > - head = READ_ONCE(os->head);
> > - tail = READ_ONCE(os->tail);
> > - /* just abort if slot is full */
> > - if (tail - head > os->mask)
> > - return -ENOENT;
> > - /* try to extend tail by 1 using CAS to avoid races */
> > - if (try_cmpxchg_acquire(&os->tail, &tail, tail + 1))
> > - break;
> > - } while (1);
> > + uint32_t head = READ_ONCE(slot->head);
> >
> > - /* the tail-th of slot is reserved for the given obj */
> > - WRITE_ONCE(ents[tail & os->mask], obj);
> > - /* update epoch id to make this object available for pop() */
> > - smp_store_release(&ages[tail & os->mask], tail);
> > + tail = READ_ONCE(slot->tail);
> > + next = tail + 1;
> > +
> > + /* This must never happen because capacity >= N + 1 */
> > + if (WARN_ON_ONCE((next > head && next - head > pool->nr_objs) ||
> > + (next < head && next > head + pool->nr_objs)))
> > + return -EINVAL;
> > +
> > + } while (!try_cmpxchg_acquire(&slot->tail, &tail, next));
> > +
> > + WRITE_ONCE(slot->entries[tail & slot->mask], obj);
> > return 0;
> > }
> >
> > /* reclaim an object to object pool */
> > -int objpool_push(void *obj, struct objpool_head *oh)
> > +int objpool_push(void *obj, struct objpool_head *pool)
> > {
> > unsigned long flags;
> > - int cpu, rc;
> > + int rc;
> >
> > /* disable local irq to avoid preemption & interruption */
> > raw_local_irq_save(flags);
> > - cpu = raw_smp_processor_id();
> > - do {
> > - rc = objpool_try_add_slot(obj, oh->cpu_slots[cpu]);
> > - if (!rc)
> > - break;
> > - cpu = cpumask_next_wrap(cpu, cpu_possible_mask, -1, 1);
> > - } while (1);
> > + rc = objpool_try_add_slot(obj, pool, raw_smp_processor_id());
> > raw_local_irq_restore(flags);
> >
> > return rc;
> > }
> > EXPORT_SYMBOL_GPL(objpool_push);
> >
> > -/* drop the allocated object, rather reclaim it to objpool */
> > -int objpool_drop(void *obj, struct objpool_head *head)
> > -{
> > - if (!obj || !head)
> > - return -EINVAL;
> > -
> > - if (refcount_dec_and_test(&head->ref)) {
> > - objpool_free(head);
> > - return 0;
> > - }
> > -
> > - return -EAGAIN;
> > -}
> > -EXPORT_SYMBOL_GPL(objpool_drop);
> > -
> > /* try to retrieve object from slot */
> > -static inline void *objpool_try_get_slot(struct objpool_slot *os)
> > +static inline void *objpool_try_get_slot(struct objpool_slot *slot)
> > {
> > - uint32_t *ages = SLOT_AGES(os);
> > - void **ents = SLOT_ENTS(os);
> > /* do memory load of head to local head */
> > - uint32_t head = smp_load_acquire(&os->head);
> > + uint32_t head = smp_load_acquire(&slot->head);
> >
> > /* loop if slot isn't empty */
> > - while (head != READ_ONCE(os->tail)) {
> > - uint32_t id = head & os->mask, prev = head;
> > + while (head != READ_ONCE(slot->tail)) {
> >
> > /* do prefetching of object ents */
> > - prefetch(&ents[id]);
> > -
> > - /* check whether this item was ready for retrieval */
> > - if (smp_load_acquire(&ages[id]) == head) {
> > - /* node must have been udpated by push() */
> > - void *node = READ_ONCE(ents[id]);
> > - /* commit and move forward head of the slot */
> > - if (try_cmpxchg_release(&os->head, &head, head + 1))
> > - return node;
> > - /* head was already updated by others */
> > - }
> > + prefetch(&slot->entries[head & slot->mask]);
> > +
> > + /* commit and move forward head of the slot */
> > + if (try_cmpxchg_release(&slot->head, &head, head + 1))
> > + /*
> > + * TBD: check overwrap the tail/head counter and warn
> > + * if it is broken. But this happens only if this
> > + * process slows down a lot and another CPU updates
> > + * the haed/tail just 2^32 + 1 times, and this slot
> > + * is empty.
> > + */
> > + return slot->entries[head & slot->mask];
> >
> > /* re-load head from memory and continue trying */
> > - head = READ_ONCE(os->head);
> > - /*
> > - * head stays unchanged, so it's very likely there's an
> > - * ongoing push() on other cpu nodes but yet not update
> > - * ages[] to mark it's completion
> > - */
> > - if (head == prev)
> > - break;
> > + head = READ_ONCE(slot->head);
> > }
> >
> > return NULL;
> > @@ -307,32 +229,42 @@ void *objpool_pop(struct objpool_head *head)
> > EXPORT_SYMBOL_GPL(objpool_pop);
> >
> > /* release whole objpool forcely */
> > -void objpool_free(struct objpool_head *head)
> > +void objpool_free(struct objpool_head *pool)
> > {
> > - if (!head->cpu_slots)
> > + if (!pool->cpu_slots)
> > return;
> >
> > /* release percpu slots */
> > - objpool_fini_percpu_slots(head);
> > + objpool_fini_percpu_slots(pool);
> >
> > /* call user's cleanup callback if provided */
> > - if (head->release)
> > - head->release(head, head->context);
> > + if (pool->release)
> > + pool->release(pool, pool->context);
> > }
> > EXPORT_SYMBOL_GPL(objpool_free);
> >
> > -/* drop unused objects and defref objpool for releasing */
> > -void objpool_fini(struct objpool_head *head)
> > +/* drop the allocated object, rather reclaim it to objpool */
> > +int objpool_drop(void *obj, struct objpool_head *pool)
> > {
> > - void *nod;
> > + if (!obj || !pool)
> > + return -EINVAL;
> >
> > - do {
> > - /* grab object from objpool and drop it */
> > - nod = objpool_pop(head);
> > + if (refcount_dec_and_test(&pool->ref)) {
> > + objpool_free(pool);
> > + return 0;
> > + }
> > +
> > + return -EAGAIN;
> > +}
> > +EXPORT_SYMBOL_GPL(objpool_drop);
> > +
> > +/* drop unused objects and defref objpool for releasing */
> > +void objpool_fini(struct objpool_head *pool)
> > +{
> > + void *obj;
> >
> > - /* drop the extra ref of objpool */
> > - if (refcount_dec_and_test(&head->ref))
> > - objpool_free(head);
> > - } while (nod);
> > + /* grab object from objpool and drop it */
> > + while ((obj = objpool_pop(pool)))
> > + objpool_drop(obj, pool);
> > }
> > EXPORT_SYMBOL_GPL(objpool_fini);
>

--
Masami Hiramatsu (Google) <[email protected]>

2023-10-12 17:39:07

by wuqiang.matt

[permalink] [raw]

Subject: Re: [PATCH v9 1/5] lib: objpool added: ring-array based lockless MPMC

On 2023/10/12 22:02, Masami Hiramatsu (Google) wrote:
> Hi Wuqiang,
>
> On Mon, 9 Oct 2023 17:23:34 +0800
> wuqiang <[email protected]> wrote:
>
>> Hello Masami,
>>
>> Just got time for the new patch and got that ages[] was removed. ages[] is
>> introduced the way like 2-phase commit to keep consitency and must be kept.
>>
>> Thinking of the following 2 cases that two cpu nodes are operating the same
>> objpool_slot simultaneously:
>>
>> Case 1:
>>
>> NODE 1: NODE 2:
>> push to an empty slot pop will get wrong value
>>
>> try_cmpxchg_acquire(&slot->tail, &tail, next)
>> try_cmpxchg_release(&slot->head, &head, head + 1)
>> return slot->entries[head & slot->mask]
>> WRITE_ONCE(slot->entries[tail & slot->mask], obj)
>
> Today, I rethink the algorithm.
>
> For this case, we can just add a `commit` to the slot for committing the tail
> commit instead of the ages[].
>
> CPU1 CPU2
> push to an empty slot pop from the same slot
>
> commit = tail = slot->tail;
> next = tail + 1;
> try_cmpxchg_acquire(&slot->tail,
> &tail, next);
> while (head != commit) -> NG1
> WRITE_ONCE(slot->entries[tail & slot->mask],
> obj)
> while (head != commit) -> NG2
> WRITE_ONCE(&slot->commit, next);
> while (head != commit) -> OK
> try_cmpxchg_acquire(&slot->head,
> &head, next);
> return slot->entries[head & slot->mask]
>
> So the NG1 and NG2 timing, the pop will fail.
>
> This does not expect the nested "push" case because the reserve-commit block
> will be interrupt disabled. This doesn't support NMI but that is OK.

If 2 push are performed in a row, slot->commit will be 'tail + 2', CPU2 won't
meet the condition "head == commit".

Since slot->commit' is always synced to slot->tail after a successful push,
should pop check "tail != commit" ? In this case when a push is ongoing with
the slot, pop() has to wait for its completion even if there are objects in
the same slot.

> If we need to support such NMI or remove irq-disable, we can do following
> (please remember the push operation only happens on the slot owned by that
> CPU. So this is per-cpu process)
>
> ---
> do {
> commit = tail = slot->tail;
> next = tail + 1;
> } while (!try_cmpxchg_acquire(&slot->tail, &tail, next));
>
> WRITE_ONCE(slot->entries[tail & slot->mask], obj);
>
> // At this point, "commit == slot->tail - 1" in this nest level.
> // Only outmost (non-nexted) case, "commit == slot->commit"
> if (commit != slot->commit)
> return; // do nothing. it must be updated by outmost one.
>
> // catch up commit to tail.
> do {
> commit = READ_ONCE(slot->tail);
> WRITE_ONCE(slot->commit, commit);
> // note that someone can interrupt this loop too.
> } while (commit != READ_ONCE(slot->tail));
> ---

Yes, I got you: push can only happen on local node and the first try has the
right to extend 'commit'. It does resolve nested push attempts, and preemption
must be disabled.

The overflow/ABA issue can still happen if the cpu is being interrupted for
long time.

> For the rethook this may be too much.
>
> Thank you,
>
>>
>>
>> Case 2:
>>
>> NODE 1: NODE 2
>> push to slot w/ 1 obj pop will get wrong value
>>
>> try_cmpxchg_release(&slot->head, &head, head + 1)
>> try_cmpxchg_acquire(&slot->tail, &tail, next)
>> WRITE_ONCE(slot->entries[tail & slot->mask], obj)
>> return slot->entries[head & slot->mask]

The pre-condition should be: CPU 1 tries to push to a full slot, in this case
tail = head + capacity but tail & mask == head & mask

>>
>> Regards,
>> wuqiang
>>
>> On 2023/9/25 17:42, Masami Hiramatsu (Google) wrote:
>>> Hi Wuqiang,
>>>
>>> On Tue, 5 Sep 2023 09:52:51 +0800
>>> "wuqiang.matt" <[email protected]> wrote:
>>>
>>>> The object pool is a scalable implementaion of high performance queue
>>>> for object allocation and reclamation, such as kretprobe instances.
>>>>
>>>> With leveraging percpu ring-array to mitigate the hot spot of memory
>>>> contention, it could deliver near-linear scalability for high parallel
>>>> scenarios. The objpool is best suited for following cases:
>>>> 1) Memory allocation or reclamation are prohibited or too expensive
>>>> 2) Consumers are of different priorities, such as irqs and threads
>>>>
>>>> Limitations:
>>>> 1) Maximum objects (capacity) is determined during pool initializing
>>>> and can't be modified (extended) after objpool creation
>>>> 2) The memory of objects won't be freed until objpool is finalized
>>>> 3) Object allocation (pop) may fail after trying all cpu slots
>>>
>>> I made a simplifying patch on this by (mainly) removing ages array.
>>> I also rename local variable to use more readable names, like slot,
>>> pool, and obj.
>>>
>>> Here the results which I run the test_objpool.ko.
>>>
>>> Original:
>>> [ 50.500235] Summary of testcases:
>>> [ 50.503139] duration: 1027135us hits: 30628293 miss: 0 sync: percpu objpool
>>> [ 50.510416] duration: 1047977us hits: 30453023 miss: 0 sync: percpu objpool from vmalloc
>>> [ 50.517421] duration: 1047098us hits: 31145034 miss: 0 sync & hrtimer: percpu objpool
>>> [ 50.524671] duration: 1053233us hits: 30919640 miss: 0 sync & hrtimer: percpu objpool from vmalloc
>>> [ 50.531382] duration: 1055822us hits: 3407221 miss: 830219 sync overrun: percpu objpool
>>> [ 50.538135] duration: 1055998us hits: 3404624 miss: 854160 sync overrun: percpu objpool from vmalloc
>>> [ 50.546686] duration: 1046104us hits: 19464798 miss: 0 async: percpu objpool
>>> [ 50.552989] duration: 1033054us hits: 18957836 miss: 0 async: percpu objpool from vmalloc
>>> [ 50.560289] duration: 1041778us hits: 33806868 miss: 0 async & hrtimer: percpu objpool
>>> [ 50.567425] duration: 1048901us hits: 34211860 miss: 0 async & hrtimer: percpu objpool from vmalloc
>>>
>>> Simplified:
>>> [ 48.393236] Summary of testcases:
>>> [ 48.395384] duration: 1013002us hits: 29661448 miss: 0 sync: percpu objpool
>>> [ 48.400351] duration: 1057231us hits: 31035578 miss: 0 sync: percpu objpool from vmalloc
>>> [ 48.405660] duration: 1043196us hits: 30546652 miss: 0 sync & hrtimer: percpu objpool
>>> [ 48.411216] duration: 1047128us hits: 30601306 miss: 0 sync & hrtimer: percpu objpool from vmalloc
>>> [ 48.417231] duration: 1051695us hits: 3468287 miss: 892881 sync overrun: percpu objpool
>>> [ 48.422405] duration: 1054003us hits: 3549491 miss: 898141 sync overrun: percpu objpool from vmalloc
>>> [ 48.428425] duration: 1052946us hits: 19005228 miss: 0 async: percpu objpool
>>> [ 48.433597] duration: 1051757us hits: 19670866 miss: 0 async: percpu objpool from vmalloc
>>> [ 48.439280] duration: 1042951us hits: 37135332 miss: 0 async & hrtimer: percpu objpool
>>> [ 48.445085] duration: 1029803us hits: 37093248 miss: 0 async & hrtimer: percpu objpool from vmalloc
>>>
>>> Can you test it too?
>>>
>>> Thanks,
>>>
>>> From f1f442ff653e329839e5452b8b88463a80a12ff3 Mon Sep 17 00:00:00 2001
>>> From: "Masami Hiramatsu (Google)" <[email protected]>
>>> Date: Mon, 25 Sep 2023 16:07:12 +0900
>>> Subject: [PATCH] objpool: Simplify objpool by removing ages array
>>>
>>> Simplify the objpool code by removing ages array from per-cpu slot.
>>> It chooses enough big capacity (which is a rounded up power of 2 value
>>> of nr_objects + 1) for the entries array, the tail never catch up to
>>> the head in per-cpu slot. Thus tail == head means the slot is empty.
>>>
>>> This also uses consistent local variable names for pool, slot and obj.
>>>
>>> Signed-off-by: Masami Hiramatsu (Google) <[email protected]>
>>> ---
>>> include/linux/objpool.h | 61 ++++----
>>> lib/objpool.c | 310 ++++++++++++++++------------------------
>>> 2 files changed, 147 insertions(+), 224 deletions(-)
>>>
>>> diff --git a/include/linux/objpool.h b/include/linux/objpool.h
>>> index 33c832216b98..ecd5ecaffcd3 100644
>>> --- a/include/linux/objpool.h
>>> +++ b/include/linux/objpool.h
>>> @@ -38,33 +38,23 @@
>>> * struct objpool_slot - percpu ring array of objpool
>>> * @head: head of the local ring array (to retrieve at)
>>> * @tail: tail of the local ring array (to append at)
>>> - * @bits: log2 of capacity (for bitwise operations)
>>> - * @mask: capacity - 1
>>> + * @mask: capacity of entries - 1
>>> + * @entries: object entries on this slot.
>>> *
>>> * Represents a cpu-local array-based ring buffer, its size is specialized
>>> * during initialization of object pool. The percpu objpool slot is to be
>>> * allocated from local memory for NUMA system, and to be kept compact in
>>> - * continuous memory: ages[] is stored just after the body of objpool_slot,
>>> - * and then entries[]. ages[] describes revision of each item, solely used
>>> - * to avoid ABA; entries[] contains pointers of the actual objects
>>> - *
>>> - * Layout of objpool_slot in memory:
>>> - *
>>> - * 64bit:
>>> - * 4 8 12 16 32 64
>>> - * | head | tail | bits | mask | ages[4] | ents[4]: (8 * 4) | objects
>>> - *
>>> - * 32bit:
>>> - * 4 8 12 16 32 48 64
>>> - * | head | tail | bits | mask | ages[4] | ents[4] | unused | objects
>>> + * continuous memory: CPU assigned number of objects are stored just after
>>> + * the body of objpool_slot.
>>> *
>>> */
>>> struct objpool_slot {
>>> - uint32_t head;
>>> - uint32_t tail;
>>> - uint32_t bits;
>>> - uint32_t mask;
>>> -} __packed;
>>> + uint32_t head;
>>> + uint32_t tail;
>>> + uint32_t mask;
>>> + uint32_t dummyr;
>>> + void * entries[];
>>> +};
>>>
>>> struct objpool_head;
>>>
>>> @@ -82,12 +72,11 @@ typedef int (*objpool_fini_cb)(struct objpool_head *head, void *context);
>>> * @obj_size: object & element size
>>> * @nr_objs: total objs (to be pre-allocated)
>>> * @nr_cpus: nr_cpu_ids
>>> - * @capacity: max objects per cpuslot
>>> + * @capacity: max objects on percpu slot
>>> * @gfp: gfp flags for kmalloc & vmalloc
>>> * @ref: refcount for objpool
>>> * @flags: flags for objpool management
>>> * @cpu_slots: array of percpu slots
>>> - * @slot_sizes: size in bytes of slots
>>> * @release: resource cleanup callback
>>> * @context: caller-provided context
>>> */
>>> @@ -100,7 +89,6 @@ struct objpool_head {
>>> refcount_t ref;
>>> unsigned long flags;
>>> struct objpool_slot **cpu_slots;
>>> - int *slot_sizes;
>>> objpool_fini_cb release;
>>> void *context;
>>> };
>>> @@ -108,9 +96,12 @@ struct objpool_head {
>>> #define OBJPOOL_FROM_VMALLOC (0x800000000) /* objpool allocated from vmalloc area */
>>> #define OBJPOOL_HAVE_OBJECTS (0x400000000) /* objects allocated along with objpool */
>>>
>>> +#define OBJPOOL_NR_OBJECT_MAX (1 << 24)
>>> +#define OBJPOOL_OBJECT_SIZE_MAX (1 << 16)
>>> +
>>> /**
>>> * objpool_init() - initialize objpool and pre-allocated objects
>>> - * @head: the object pool to be initialized, declared by caller
>>> + * @pool: the object pool to be initialized, declared by caller
>>> * @nr_objs: total objects to be pre-allocated by this object pool
>>> * @object_size: size of an object (should be > 0)
>>> * @gfp: flags for memory allocation (via kmalloc or vmalloc)
>>> @@ -128,47 +119,47 @@ struct objpool_head {
>>> * pop (object allocation) or do clearance before each push (object
>>> * reclamation).
>>> */
>>> -int objpool_init(struct objpool_head *head, int nr_objs, int object_size,
>>> +int objpool_init(struct objpool_head *pool, int nr_objs, int object_size,
>>> gfp_t gfp, void *context, objpool_init_obj_cb objinit,
>>> objpool_fini_cb release);
>>>
>>> /**
>>> * objpool_pop() - allocate an object from objpool
>>> - * @head: object pool
>>> + * @pool: object pool
>>> *
>>> * return value: object ptr or NULL if failed
>>> */
>>> -void *objpool_pop(struct objpool_head *head);
>>> +void *objpool_pop(struct objpool_head *pool);
>>>
>>> /**
>>> * objpool_push() - reclaim the object and return back to objpool
>>> * @obj: object ptr to be pushed to objpool
>>> - * @head: object pool
>>> + * @pool: object pool
>>> *
>>> * return: 0 or error code (it fails only when user tries to push
>>> * the same object multiple times or wrong "objects" into objpool)
>>> */
>>> -int objpool_push(void *obj, struct objpool_head *head);
>>> +int objpool_push(void *obj, struct objpool_head *pool);
>>>
>>> /**
>>> * objpool_drop() - discard the object and deref objpool
>>> * @obj: object ptr to be discarded
>>> - * @head: object pool
>>> + * @pool: object pool
>>> *
>>> * return: 0 if objpool was released or error code
>>> */
>>> -int objpool_drop(void *obj, struct objpool_head *head);
>>> +int objpool_drop(void *obj, struct objpool_head *pool);
>>>
>>> /**
>>> * objpool_free() - release objpool forcely (all objects to be freed)
>>> - * @head: object pool to be released
>>> + * @pool: object pool to be released
>>> */
>>> -void objpool_free(struct objpool_head *head);
>>> +void objpool_free(struct objpool_head *pool);
>>>
>>> /**
>>> * objpool_fini() - deref object pool (also releasing unused objects)
>>> - * @head: object pool to be dereferenced
>>> + * @pool: object pool to be dereferenced
>>> */
>>> -void objpool_fini(struct objpool_head *head);
>>> +void objpool_fini(struct objpool_head *pool);
>>>
>>> #endif /* _LINUX_OBJPOOL_H */
>>> diff --git a/lib/objpool.c b/lib/objpool.c
>>> index 22e752371820..f8e8f70d7253 100644
>>> --- a/lib/objpool.c
>>> +++ b/lib/objpool.c
>>> @@ -15,104 +15,55 @@
>>> * Copyright: [email protected]
>>> */
>>>
>>> -#define SLOT_AGES(s) ((uint32_t *)((char *)(s) + sizeof(struct objpool_slot)))
>>> -#define SLOT_ENTS(s) ((void **)((char *)(s) + sizeof(struct objpool_slot) + \
>>> - (sizeof(uint32_t) << (s)->bits)))
>>> -#define SLOT_OBJS(s) ((void *)((char *)(s) + sizeof(struct objpool_slot) + \
>>> - ((sizeof(uint32_t) + sizeof(void *)) << (s)->bits)))
>>> -#define SLOT_CORE(n) cpumask_nth((n) % num_possible_cpus(), cpu_possible_mask)
>>> -
>>> -/* compute the suitable num of objects to be managed per slot */
>>> -static int objpool_nobjs(int size)
>>> -{
>>> - return rounddown_pow_of_two((size - sizeof(struct objpool_slot)) /
>>> - (sizeof(uint32_t) + sizeof(void *)));
>>> -}
>>> -
>>> /* allocate and initialize percpu slots */
>>> static int
>>> -objpool_init_percpu_slots(struct objpool_head *head, int nobjs,
>>> - void *context, objpool_init_obj_cb objinit)
>>> +objpool_init_percpu_slots(struct objpool_head *pool, void *context,
>>> + objpool_init_obj_cb objinit)
>>> {
>>> - int i, j, n, size, objsz, cpu = 0, nents = head->capacity;
>>> -
>>> - /* aligned object size by sizeof(void *) */
>>> - objsz = ALIGN(head->obj_size, sizeof(void *));
>>> - /* shall we allocate objects along with percpu-slot */
>>> - if (objsz)
>>> - head->flags |= OBJPOOL_HAVE_OBJECTS;
>>> -
>>> - /* vmalloc is used in default to allocate percpu-slots */
>>> - if (!(head->gfp & GFP_ATOMIC))
>>> - head->flags |= OBJPOOL_FROM_VMALLOC;
>>> -
>>> - for (i = 0; i < head->nr_cpus; i++) {
>>> - struct objpool_slot *os;
>>> + int i, j, n, size, slot_size, cpu_count = 0;
>>> + struct objpool_slot *slot;
>>>
>>> + for (i = 0; i < pool->nr_cpus; i++) {
>>> /* skip the cpus which could never be present */
>>> if (!cpu_possible(i))
>>> continue;
>>>
>>> /* compute how many objects to be managed by this slot */
>>> - n = nobjs / num_possible_cpus();
>>> - if (cpu < (nobjs % num_possible_cpus()))
>>> + n = pool->nr_objs / num_possible_cpus();
>>> + if (cpu_count < (pool->nr_objs % num_possible_cpus()))
>>> n++;
>>> - size = sizeof(struct objpool_slot) + sizeof(void *) * nents +
>>> - sizeof(uint32_t) * nents + objsz * n;
>>> + cpu_count++;
>>> +
>>> + slot_size = struct_size(slot, entries, pool->capacity);
>>> + size = slot_size + pool->obj_size * n;
>>>
>>> /*
>>> * here we allocate percpu-slot & objects together in a single
>>> - * allocation, taking advantage of warm caches and TLB hits as
>>> - * vmalloc always aligns the request size to pages
>>> + * allocation, taking advantage on NUMA.
>>> */
>>> - if (head->flags & OBJPOOL_FROM_VMALLOC)
>>> - os = __vmalloc_node(size, sizeof(void *), head->gfp,
>>> + if (pool->flags & OBJPOOL_FROM_VMALLOC)
>>> + slot = __vmalloc_node(size, sizeof(void *), pool->gfp,
>>> cpu_to_node(i), __builtin_return_address(0));
>>> else
>>> - os = kmalloc_node(size, head->gfp, cpu_to_node(i));
>>> - if (!os)
>>> + slot = kmalloc_node(size, pool->gfp, cpu_to_node(i));
>>> + if (!slot)
>>> return -ENOMEM;
>>>
>>> /* initialize percpu slot for the i-th slot */
>>> - memset(os, 0, size);
>>> - os->bits = ilog2(head->capacity);
>>> - os->mask = head->capacity - 1;
>>> - head->cpu_slots[i] = os;
>>> - head->slot_sizes[i] = size;
>>> - cpu = cpu + 1;
>>> -
>>> - /*
>>> - * manually set head & tail to avoid possible conflict:
>>> - * We assume that the head item is ready for retrieval
>>> - * iff head is equal to ages[head & mask]. but ages is
>>> - * initialized as 0, so in view of the caller of pop(),
>>> - * the 1st item (0th) is always ready, but the reality
>>> - * could be: push() is stalled before the final update,
>>> - * thus the item being inserted will be lost forever
>>> - */
>>> - os->head = os->tail = head->capacity;
>>> -
>>> - if (!objsz)
>>> - continue;
>>> + memset(slot, 0, size);
>>> + slot->mask = pool->capacity - 1;
>>> + pool->cpu_slots[i] = slot;
>>>
>>> for (j = 0; j < n; j++) {
>>> - uint32_t *ages = SLOT_AGES(os);
>>> - void **ents = SLOT_ENTS(os);
>>> - void *obj = SLOT_OBJS(os) + j * objsz;
>>> - uint32_t ie = os->tail & os->mask;
>>> + void *obj = (void *)slot + slot_size + pool->obj_size * j;
>>>
>>> - /* perform object initialization */
>>> if (objinit) {
>>> int rc = objinit(obj, context);
>>> if (rc)
>>> return rc;
>>> }
>>> -
>>> - /* add obj into the ring array */
>>> - ents[ie] = obj;
>>> - ages[ie] = os->tail;
>>> - os->tail++;
>>> - head->nr_objs++;
>>> + slot->entries[j] = obj;
>>> + slot->tail++;
>>> }
>>> }
>>>
>>> @@ -120,164 +71,135 @@ objpool_init_percpu_slots(struct objpool_head *head, int nobjs,
>>> }
>>>
>>> /* cleanup all percpu slots of the object pool */
>>> -static void objpool_fini_percpu_slots(struct objpool_head *head)
>>> +static void objpool_fini_percpu_slots(struct objpool_head *pool)
>>> {
>>> int i;
>>>
>>> - if (!head->cpu_slots)
>>> + if (!pool->cpu_slots)
>>> return;
>>>
>>> - for (i = 0; i < head->nr_cpus; i++) {
>>> - if (!head->cpu_slots[i])
>>> - continue;
>>> - if (head->flags & OBJPOOL_FROM_VMALLOC)
>>> - vfree(head->cpu_slots[i]);
>>> - else
>>> - kfree(head->cpu_slots[i]);
>>> - }
>>> - kfree(head->cpu_slots);
>>> - head->cpu_slots = NULL;
>>> - head->slot_sizes = NULL;
>>> + for (i = 0; i < pool->nr_cpus; i++)
>>> + kvfree(pool->cpu_slots[i]);
>>> + kfree(pool->cpu_slots);
>>> }
>>>
>>> /* initialize object pool and pre-allocate objects */
>>> -int objpool_init(struct objpool_head *head, int nr_objs, int object_size,
>>> +int objpool_init(struct objpool_head *pool, int nr_objs, int object_size,
>>> gfp_t gfp, void *context, objpool_init_obj_cb objinit,
>>> objpool_fini_cb release)
>>> {
>>> int nents, rc;
>>>
>>> /* check input parameters */
>>> - if (nr_objs <= 0 || object_size <= 0)
>>> + if (nr_objs <= 0 || object_size <= 0 ||
>>> + nr_objs > OBJPOOL_NR_OBJECT_MAX || object_size > OBJPOOL_OBJECT_SIZE_MAX)
>>> + return -EINVAL;
>>> +
>>> + /* Align up to unsigned long size */
>>> + object_size = ALIGN(object_size, sizeof(unsigned long));
>>> +
>>> + /*
>>> + * To avoid filling up the entries array in the per-cpu slot,
>>> + * use the power of 2 which is more than N + 1. Thus, the tail never
>>> + * catch up to the pool, and able to use pool/tail as the sequencial
>>> + * number.
>>> + */
>>> + nents = roundup_pow_of_two(nr_objs + 1);
>>> + if (!nents)
>>> return -EINVAL;
>>>
>>> - /* calculate percpu slot size (rounded to pow of 2) */
>>> - nents = max_t(int, roundup_pow_of_two(nr_objs),
>>> - objpool_nobjs(L1_CACHE_BYTES));
>>> -
>>> - /* initialize objpool head */
>>> - memset(head, 0, sizeof(struct objpool_head));
>>> - head->nr_cpus = nr_cpu_ids;
>>> - head->obj_size = object_size;
>>> - head->capacity = nents;
>>> - head->gfp = gfp & ~__GFP_ZERO;
>>> - head->context = context;
>>> - head->release = release;
>>> -
>>> - /* allocate array for percpu slots */
>>> - head->cpu_slots = kzalloc(head->nr_cpus * sizeof(void *) +
>>> - head->nr_cpus * sizeof(int), head->gfp);
>>> - if (!head->cpu_slots)
>>> + /* initialize objpool pool */
>>> + memset(pool, 0, sizeof(struct objpool_head));
>>> + pool->nr_cpus = nr_cpu_ids;
>>> + pool->obj_size = object_size;
>>> + pool->nr_objs = nr_objs;
>>> + /* just prevent to fullfill the per-cpu ring array */
>>> + pool->capacity = nents;
>>> + pool->gfp = gfp & ~__GFP_ZERO;
>>> + pool->context = context;
>>> + pool->release = release;
>>> + /* vmalloc is used in default to allocate percpu-slots */
>>> + if (!(pool->gfp & GFP_ATOMIC))
>>> + pool->flags |= OBJPOOL_FROM_VMALLOC;
>>> +
>>> + pool->cpu_slots = kzalloc(pool->nr_cpus * sizeof(void *), pool->gfp);
>>> + if (!pool->cpu_slots)
>>> return -ENOMEM;
>>> - head->slot_sizes = (int *)&head->cpu_slots[head->nr_cpus];
>>>
>>> /* initialize per-cpu slots */
>>> - rc = objpool_init_percpu_slots(head, nr_objs, context, objinit);
>>> + rc = objpool_init_percpu_slots(pool, context, objinit);
>>> if (rc)
>>> - objpool_fini_percpu_slots(head);
>>> + objpool_fini_percpu_slots(pool);
>>> else
>>> - refcount_set(&head->ref, nr_objs + 1);
>>> + refcount_set(&pool->ref, nr_objs + 1);
>>>
>>> return rc;
>>> }
>>> EXPORT_SYMBOL_GPL(objpool_init);
>>>
>>> /* adding object to slot, abort if the slot was already full */
>>> -static inline int objpool_try_add_slot(void *obj, struct objpool_slot *os)
>>> +static inline int objpool_try_add_slot(void *obj, struct objpool_head *pool, int cpu)
>>> {
>>> - uint32_t *ages = SLOT_AGES(os);
>>> - void **ents = SLOT_ENTS(os);
>>> - uint32_t head, tail;
>>> + struct objpool_slot *slot = pool->cpu_slots[cpu];
>>> + uint32_t tail, next;
>>>
>>> do {
>>> - /* perform memory loading for both head and tail */
>>> - head = READ_ONCE(os->head);
>>> - tail = READ_ONCE(os->tail);
>>> - /* just abort if slot is full */
>>> - if (tail - head > os->mask)
>>> - return -ENOENT;
>>> - /* try to extend tail by 1 using CAS to avoid races */
>>> - if (try_cmpxchg_acquire(&os->tail, &tail, tail + 1))
>>> - break;
>>> - } while (1);
>>> + uint32_t head = READ_ONCE(slot->head);
>>>
>>> - /* the tail-th of slot is reserved for the given obj */
>>> - WRITE_ONCE(ents[tail & os->mask], obj);
>>> - /* update epoch id to make this object available for pop() */
>>> - smp_store_release(&ages[tail & os->mask], tail);
>>> + tail = READ_ONCE(slot->tail);
>>> + next = tail + 1;
>>> +
>>> + /* This must never happen because capacity >= N + 1 */
>>> + if (WARN_ON_ONCE((next > head && next - head > pool->nr_objs) ||
>>> + (next < head && next > head + pool->nr_objs)))
>>> + return -EINVAL;
>>> +
>>> + } while (!try_cmpxchg_acquire(&slot->tail, &tail, next));
>>> +
>>> + WRITE_ONCE(slot->entries[tail & slot->mask], obj);
>>> return 0;
>>> }
>>>
>>> /* reclaim an object to object pool */
>>> -int objpool_push(void *obj, struct objpool_head *oh)
>>> +int objpool_push(void *obj, struct objpool_head *pool)
>>> {
>>> unsigned long flags;
>>> - int cpu, rc;
>>> + int rc;
>>>
>>> /* disable local irq to avoid preemption & interruption */
>>> raw_local_irq_save(flags);
>>> - cpu = raw_smp_processor_id();
>>> - do {
>>> - rc = objpool_try_add_slot(obj, oh->cpu_slots[cpu]);
>>> - if (!rc)
>>> - break;
>>> - cpu = cpumask_next_wrap(cpu, cpu_possible_mask, -1, 1);
>>> - } while (1);
>>> + rc = objpool_try_add_slot(obj, pool, raw_smp_processor_id());
>>> raw_local_irq_restore(flags);
>>>
>>> return rc;
>>> }
>>> EXPORT_SYMBOL_GPL(objpool_push);
>>>
>>> -/* drop the allocated object, rather reclaim it to objpool */
>>> -int objpool_drop(void *obj, struct objpool_head *head)
>>> -{
>>> - if (!obj || !head)
>>> - return -EINVAL;
>>> -
>>> - if (refcount_dec_and_test(&head->ref)) {
>>> - objpool_free(head);
>>> - return 0;
>>> - }
>>> -
>>> - return -EAGAIN;
>>> -}
>>> -EXPORT_SYMBOL_GPL(objpool_drop);
>>> -
>>> /* try to retrieve object from slot */
>>> -static inline void *objpool_try_get_slot(struct objpool_slot *os)
>>> +static inline void *objpool_try_get_slot(struct objpool_slot *slot)
>>> {
>>> - uint32_t *ages = SLOT_AGES(os);
>>> - void **ents = SLOT_ENTS(os);
>>> /* do memory load of head to local head */
>>> - uint32_t head = smp_load_acquire(&os->head);
>>> + uint32_t head = smp_load_acquire(&slot->head);
>>>
>>> /* loop if slot isn't empty */
>>> - while (head != READ_ONCE(os->tail)) {
>>> - uint32_t id = head & os->mask, prev = head;
>>> + while (head != READ_ONCE(slot->tail)) {
>>>
>>> /* do prefetching of object ents */
>>> - prefetch(&ents[id]);
>>> -
>>> - /* check whether this item was ready for retrieval */
>>> - if (smp_load_acquire(&ages[id]) == head) {
>>> - /* node must have been udpated by push() */
>>> - void *node = READ_ONCE(ents[id]);
>>> - /* commit and move forward head of the slot */
>>> - if (try_cmpxchg_release(&os->head, &head, head + 1))
>>> - return node;
>>> - /* head was already updated by others */
>>> - }
>>> + prefetch(&slot->entries[head & slot->mask]);
>>> +
>>> + /* commit and move forward head of the slot */
>>> + if (try_cmpxchg_release(&slot->head, &head, head + 1))
>>> + /*
>>> + * TBD: check overwrap the tail/head counter and warn
>>> + * if it is broken. But this happens only if this
>>> + * process slows down a lot and another CPU updates
>>> + * the haed/tail just 2^32 + 1 times, and this slot
>>> + * is empty.
>>> + */
>>> + return slot->entries[head & slot->mask];
>>>
>>> /* re-load head from memory and continue trying */
>>> - head = READ_ONCE(os->head);
>>> - /*
>>> - * head stays unchanged, so it's very likely there's an
>>> - * ongoing push() on other cpu nodes but yet not update
>>> - * ages[] to mark it's completion
>>> - */
>>> - if (head == prev)
>>> - break;
>>> + head = READ_ONCE(slot->head);
>>> }
>>>
>>> return NULL;
>>> @@ -307,32 +229,42 @@ void *objpool_pop(struct objpool_head *head)
>>> EXPORT_SYMBOL_GPL(objpool_pop);
>>>
>>> /* release whole objpool forcely */
>>> -void objpool_free(struct objpool_head *head)
>>> +void objpool_free(struct objpool_head *pool)
>>> {
>>> - if (!head->cpu_slots)
>>> + if (!pool->cpu_slots)
>>> return;
>>>
>>> /* release percpu slots */
>>> - objpool_fini_percpu_slots(head);
>>> + objpool_fini_percpu_slots(pool);
>>>
>>> /* call user's cleanup callback if provided */
>>> - if (head->release)
>>> - head->release(head, head->context);
>>> + if (pool->release)
>>> + pool->release(pool, pool->context);
>>> }
>>> EXPORT_SYMBOL_GPL(objpool_free);
>>>
>>> -/* drop unused objects and defref objpool for releasing */
>>> -void objpool_fini(struct objpool_head *head)
>>> +/* drop the allocated object, rather reclaim it to objpool */
>>> +int objpool_drop(void *obj, struct objpool_head *pool)
>>> {
>>> - void *nod;
>>> + if (!obj || !pool)
>>> + return -EINVAL;
>>>
>>> - do {
>>> - /* grab object from objpool and drop it */
>>> - nod = objpool_pop(head);
>>> + if (refcount_dec_and_test(&pool->ref)) {
>>> + objpool_free(pool);
>>> + return 0;
>>> + }
>>> +
>>> + return -EAGAIN;
>>> +}
>>> +EXPORT_SYMBOL_GPL(objpool_drop);
>>> +
>>> +/* drop unused objects and defref objpool for releasing */
>>> +void objpool_fini(struct objpool_head *pool)
>>> +{
>>> + void *obj;
>>>
>>> - /* drop the extra ref of objpool */
>>> - if (refcount_dec_and_test(&head->ref))
>>> - objpool_free(head);
>>> - } while (nod);
>>> + /* grab object from objpool and drop it */
>>> + while ((obj = objpool_pop(pool)))
>>> + objpool_drop(obj, pool);
>>> }
>>> EXPORT_SYMBOL_GPL(objpool_fini);
>>
>
>

2023-10-13 02:01:41

by Masami Hiramatsu

[permalink] [raw]

Subject: Re: [PATCH v9 1/5] lib: objpool added: ring-array based lockless MPMC

On Fri, 13 Oct 2023 01:36:05 +0800
"wuqiang.matt" <[email protected]> wrote:

> On 2023/10/12 22:02, Masami Hiramatsu (Google) wrote:
> > Hi Wuqiang,
> >
> > On Mon, 9 Oct 2023 17:23:34 +0800
> > wuqiang <[email protected]> wrote:
> >
> >> Hello Masami,
> >>
> >> Just got time for the new patch and got that ages[] was removed. ages[] is
> >> introduced the way like 2-phase commit to keep consitency and must be kept.
> >>
> >> Thinking of the following 2 cases that two cpu nodes are operating the same
> >> objpool_slot simultaneously:
> >>
> >> Case 1:
> >>
> >> NODE 1: NODE 2:
> >> push to an empty slot pop will get wrong value
> >>
> >> try_cmpxchg_acquire(&slot->tail, &tail, next)
> >> try_cmpxchg_release(&slot->head, &head, head + 1)
> >> return slot->entries[head & slot->mask]
> >> WRITE_ONCE(slot->entries[tail & slot->mask], obj)
> >
> > Today, I rethink the algorithm.
> >
> > For this case, we can just add a `commit` to the slot for committing the tail
> > commit instead of the ages[].
> >
> > CPU1 CPU2
> > push to an empty slot pop from the same slot
> >
> > commit = tail = slot->tail;
> > next = tail + 1;
> > try_cmpxchg_acquire(&slot->tail,
> > &tail, next);
> > while (head != commit) -> NG1
> > WRITE_ONCE(slot->entries[tail & slot->mask],
> > obj)
> > while (head != commit) -> NG2
> > WRITE_ONCE(&slot->commit, next);
> > while (head != commit) -> OK
> > try_cmpxchg_acquire(&slot->head,
> > &head, next);
> > return slot->entries[head & slot->mask]
> >
> > So the NG1 and NG2 timing, the pop will fail.
> >
> > This does not expect the nested "push" case because the reserve-commit block
> > will be interrupt disabled. This doesn't support NMI but that is OK.
>
> If 2 push are performed in a row, slot->commit will be 'tail + 2', CPU2 won't
> meet the condition "head == commit".

No, it doesn't happen, because pop() may fetch the object from other's slot, but
push() only happens on its own slot. So push() on the same slot is never
in parallel.

>
> Since slot->commit' is always synced to slot->tail after a successful push,
> should pop check "tail != commit" ?

No, pop() only need to check the "head != commit". If "head == commit", this slot
is empty or the slot owner is trying to push() (not committed yet).

> In this case when a push is ongoing with
> the slot, pop() has to wait for its completion even if there are objects in
> the same slot.

No, pop() will go to other slots to find available object.

>
> > If we need to support such NMI or remove irq-disable, we can do following
> > (please remember the push operation only happens on the slot owned by that
> > CPU. So this is per-cpu process)
> >
> > ---
> > do {
> > commit = tail = slot->tail;
> > next = tail + 1;
> > } while (!try_cmpxchg_acquire(&slot->tail, &tail, next));
> >
> > WRITE_ONCE(slot->entries[tail & slot->mask], obj);
> >
> > // At this point, "commit == slot->tail - 1" in this nest level.
> > // Only outmost (non-nexted) case, "commit == slot->commit"
> > if (commit != slot->commit)
> > return; // do nothing. it must be updated by outmost one.
> >
> > // catch up commit to tail.
> > do {
> > commit = READ_ONCE(slot->tail);
> > WRITE_ONCE(slot->commit, commit);
> > // note that someone can interrupt this loop too.
> > } while (commit != READ_ONCE(slot->tail));
> > ---
>
> Yes, I got you: push can only happen on local node and the first try has the
> right to extend 'commit'. It does resolve nested push attempts, and preemption
> must be disabled.

Yes.

>
> The overflow/ABA issue can still happen if the cpu is being interrupted for
> long time.

Yes, but ages[] also can not avoid that. (Note that head, tail and commit are
32bit counter, and per-cpu slot ring has enough big to store the pointer of
all objects.)

Thank you,

>
> > For the rethook this may be too much.
> >
> > Thank you,
> >
> >>
> >>
> >> Case 2:
> >>
> >> NODE 1: NODE 2
> >> push to slot w/ 1 obj pop will get wrong value
> >>
> >> try_cmpxchg_release(&slot->head, &head, head + 1)
> >> try_cmpxchg_acquire(&slot->tail, &tail, next)
> >> WRITE_ONCE(slot->entries[tail & slot->mask], obj)
> >> return slot->entries[head & slot->mask]
>
> The pre-condition should be: CPU 1 tries to push to a full slot, in this case
> tail = head + capacity but tail & mask == head & mask
>
> >>
> >> Regards,
> >> wuqiang
> >>
> >> On 2023/9/25 17:42, Masami Hiramatsu (Google) wrote:
> >>> Hi Wuqiang,
> >>>
> >>> On Tue, 5 Sep 2023 09:52:51 +0800
> >>> "wuqiang.matt" <[email protected]> wrote:
> >>>
> >>>> The object pool is a scalable implementaion of high performance queue
> >>>> for object allocation and reclamation, such as kretprobe instances.
> >>>>
> >>>> With leveraging percpu ring-array to mitigate the hot spot of memory
> >>>> contention, it could deliver near-linear scalability for high parallel
> >>>> scenarios. The objpool is best suited for following cases:
> >>>> 1) Memory allocation or reclamation are prohibited or too expensive
> >>>> 2) Consumers are of different priorities, such as irqs and threads
> >>>>
> >>>> Limitations:
> >>>> 1) Maximum objects (capacity) is determined during pool initializing
> >>>> and can't be modified (extended) after objpool creation
> >>>> 2) The memory of objects won't be freed until objpool is finalized
> >>>> 3) Object allocation (pop) may fail after trying all cpu slots
> >>>
> >>> I made a simplifying patch on this by (mainly) removing ages array.
> >>> I also rename local variable to use more readable names, like slot,
> >>> pool, and obj.
> >>>
> >>> Here the results which I run the test_objpool.ko.
> >>>
> >>> Original:
> >>> [ 50.500235] Summary of testcases:
> >>> [ 50.503139] duration: 1027135us hits: 30628293 miss: 0 sync: percpu objpool
> >>> [ 50.510416] duration: 1047977us hits: 30453023 miss: 0 sync: percpu objpool from vmalloc
> >>> [ 50.517421] duration: 1047098us hits: 31145034 miss: 0 sync & hrtimer: percpu objpool
> >>> [ 50.524671] duration: 1053233us hits: 30919640 miss: 0 sync & hrtimer: percpu objpool from vmalloc
> >>> [ 50.531382] duration: 1055822us hits: 3407221 miss: 830219 sync overrun: percpu objpool
> >>> [ 50.538135] duration: 1055998us hits: 3404624 miss: 854160 sync overrun: percpu objpool from vmalloc
> >>> [ 50.546686] duration: 1046104us hits: 19464798 miss: 0 async: percpu objpool
> >>> [ 50.552989] duration: 1033054us hits: 18957836 miss: 0 async: percpu objpool from vmalloc
> >>> [ 50.560289] duration: 1041778us hits: 33806868 miss: 0 async & hrtimer: percpu objpool
> >>> [ 50.567425] duration: 1048901us hits: 34211860 miss: 0 async & hrtimer: percpu objpool from vmalloc
> >>>
> >>> Simplified:
> >>> [ 48.393236] Summary of testcases:
> >>> [ 48.395384] duration: 1013002us hits: 29661448 miss: 0 sync: percpu objpool
> >>> [ 48.400351] duration: 1057231us hits: 31035578 miss: 0 sync: percpu objpool from vmalloc
> >>> [ 48.405660] duration: 1043196us hits: 30546652 miss: 0 sync & hrtimer: percpu objpool
> >>> [ 48.411216] duration: 1047128us hits: 30601306 miss: 0 sync & hrtimer: percpu objpool from vmalloc
> >>> [ 48.417231] duration: 1051695us hits: 3468287 miss: 892881 sync overrun: percpu objpool
> >>> [ 48.422405] duration: 1054003us hits: 3549491 miss: 898141 sync overrun: percpu objpool from vmalloc
> >>> [ 48.428425] duration: 1052946us hits: 19005228 miss: 0 async: percpu objpool
> >>> [ 48.433597] duration: 1051757us hits: 19670866 miss: 0 async: percpu objpool from vmalloc
> >>> [ 48.439280] duration: 1042951us hits: 37135332 miss: 0 async & hrtimer: percpu objpool
> >>> [ 48.445085] duration: 1029803us hits: 37093248 miss: 0 async & hrtimer: percpu objpool from vmalloc
> >>>
> >>> Can you test it too?
> >>>
> >>> Thanks,
> >>>
> >>> From f1f442ff653e329839e5452b8b88463a80a12ff3 Mon Sep 17 00:00:00 2001
> >>> From: "Masami Hiramatsu (Google)" <[email protected]>
> >>> Date: Mon, 25 Sep 2023 16:07:12 +0900
> >>> Subject: [PATCH] objpool: Simplify objpool by removing ages array
> >>>
> >>> Simplify the objpool code by removing ages array from per-cpu slot.
> >>> It chooses enough big capacity (which is a rounded up power of 2 value
> >>> of nr_objects + 1) for the entries array, the tail never catch up to
> >>> the head in per-cpu slot. Thus tail == head means the slot is empty.
> >>>
> >>> This also uses consistent local variable names for pool, slot and obj.
> >>>
> >>> Signed-off-by: Masami Hiramatsu (Google) <[email protected]>
> >>> ---
> >>> include/linux/objpool.h | 61 ++++----
> >>> lib/objpool.c | 310 ++++++++++++++++------------------------
> >>> 2 files changed, 147 insertions(+), 224 deletions(-)
> >>>
> >>> diff --git a/include/linux/objpool.h b/include/linux/objpool.h
> >>> index 33c832216b98..ecd5ecaffcd3 100644
> >>> --- a/include/linux/objpool.h
> >>> +++ b/include/linux/objpool.h
> >>> @@ -38,33 +38,23 @@
> >>> * struct objpool_slot - percpu ring array of objpool
> >>> * @head: head of the local ring array (to retrieve at)
> >>> * @tail: tail of the local ring array (to append at)
> >>> - * @bits: log2 of capacity (for bitwise operations)
> >>> - * @mask: capacity - 1
> >>> + * @mask: capacity of entries - 1
> >>> + * @entries: object entries on this slot.
> >>> *
> >>> * Represents a cpu-local array-based ring buffer, its size is specialized
> >>> * during initialization of object pool. The percpu objpool slot is to be
> >>> * allocated from local memory for NUMA system, and to be kept compact in
> >>> - * continuous memory: ages[] is stored just after the body of objpool_slot,
> >>> - * and then entries[]. ages[] describes revision of each item, solely used
> >>> - * to avoid ABA; entries[] contains pointers of the actual objects
> >>> - *
> >>> - * Layout of objpool_slot in memory:
> >>> - *
> >>> - * 64bit:
> >>> - * 4 8 12 16 32 64
> >>> - * | head | tail | bits | mask | ages[4] | ents[4]: (8 * 4) | objects
> >>> - *
> >>> - * 32bit:
> >>> - * 4 8 12 16 32 48 64
> >>> - * | head | tail | bits | mask | ages[4] | ents[4] | unused | objects
> >>> + * continuous memory: CPU assigned number of objects are stored just after
> >>> + * the body of objpool_slot.
> >>> *
> >>> */
> >>> struct objpool_slot {
> >>> - uint32_t head;
> >>> - uint32_t tail;
> >>> - uint32_t bits;
> >>> - uint32_t mask;
> >>> -} __packed;
> >>> + uint32_t head;
> >>> + uint32_t tail;
> >>> + uint32_t mask;
> >>> + uint32_t dummyr;
> >>> + void * entries[];
> >>> +};
> >>>
> >>> struct objpool_head;
> >>>
> >>> @@ -82,12 +72,11 @@ typedef int (*objpool_fini_cb)(struct objpool_head *head, void *context);
> >>> * @obj_size: object & element size
> >>> * @nr_objs: total objs (to be pre-allocated)
> >>> * @nr_cpus: nr_cpu_ids
> >>> - * @capacity: max objects per cpuslot
> >>> + * @capacity: max objects on percpu slot
> >>> * @gfp: gfp flags for kmalloc & vmalloc
> >>> * @ref: refcount for objpool
> >>> * @flags: flags for objpool management
> >>> * @cpu_slots: array of percpu slots
> >>> - * @slot_sizes: size in bytes of slots
> >>> * @release: resource cleanup callback
> >>> * @context: caller-provided context
> >>> */
> >>> @@ -100,7 +89,6 @@ struct objpool_head {
> >>> refcount_t ref;
> >>> unsigned long flags;
> >>> struct objpool_slot **cpu_slots;
> >>> - int *slot_sizes;
> >>> objpool_fini_cb release;
> >>> void *context;
> >>> };
> >>> @@ -108,9 +96,12 @@ struct objpool_head {
> >>> #define OBJPOOL_FROM_VMALLOC (0x800000000) /* objpool allocated from vmalloc area */
> >>> #define OBJPOOL_HAVE_OBJECTS (0x400000000) /* objects allocated along with objpool */
> >>>
> >>> +#define OBJPOOL_NR_OBJECT_MAX (1 << 24)
> >>> +#define OBJPOOL_OBJECT_SIZE_MAX (1 << 16)
> >>> +
> >>> /**
> >>> * objpool_init() - initialize objpool and pre-allocated objects
> >>> - * @head: the object pool to be initialized, declared by caller
> >>> + * @pool: the object pool to be initialized, declared by caller
> >>> * @nr_objs: total objects to be pre-allocated by this object pool
> >>> * @object_size: size of an object (should be > 0)
> >>> * @gfp: flags for memory allocation (via kmalloc or vmalloc)
> >>> @@ -128,47 +119,47 @@ struct objpool_head {
> >>> * pop (object allocation) or do clearance before each push (object
> >>> * reclamation).
> >>> */
> >>> -int objpool_init(struct objpool_head *head, int nr_objs, int object_size,
> >>> +int objpool_init(struct objpool_head *pool, int nr_objs, int object_size,
> >>> gfp_t gfp, void *context, objpool_init_obj_cb objinit,
> >>> objpool_fini_cb release);
> >>>
> >>> /**
> >>> * objpool_pop() - allocate an object from objpool
> >>> - * @head: object pool
> >>> + * @pool: object pool
> >>> *
> >>> * return value: object ptr or NULL if failed
> >>> */
> >>> -void *objpool_pop(struct objpool_head *head);
> >>> +void *objpool_pop(struct objpool_head *pool);
> >>>
> >>> /**
> >>> * objpool_push() - reclaim the object and return back to objpool
> >>> * @obj: object ptr to be pushed to objpool
> >>> - * @head: object pool
> >>> + * @pool: object pool
> >>> *
> >>> * return: 0 or error code (it fails only when user tries to push
> >>> * the same object multiple times or wrong "objects" into objpool)
> >>> */
> >>> -int objpool_push(void *obj, struct objpool_head *head);
> >>> +int objpool_push(void *obj, struct objpool_head *pool);
> >>>
> >>> /**
> >>> * objpool_drop() - discard the object and deref objpool
> >>> * @obj: object ptr to be discarded
> >>> - * @head: object pool
> >>> + * @pool: object pool
> >>> *
> >>> * return: 0 if objpool was released or error code
> >>> */
> >>> -int objpool_drop(void *obj, struct objpool_head *head);
> >>> +int objpool_drop(void *obj, struct objpool_head *pool);
> >>>
> >>> /**
> >>> * objpool_free() - release objpool forcely (all objects to be freed)
> >>> - * @head: object pool to be released
> >>> + * @pool: object pool to be released
> >>> */
> >>> -void objpool_free(struct objpool_head *head);
> >>> +void objpool_free(struct objpool_head *pool);
> >>>
> >>> /**
> >>> * objpool_fini() - deref object pool (also releasing unused objects)
> >>> - * @head: object pool to be dereferenced
> >>> + * @pool: object pool to be dereferenced
> >>> */
> >>> -void objpool_fini(struct objpool_head *head);
> >>> +void objpool_fini(struct objpool_head *pool);
> >>>
> >>> #endif /* _LINUX_OBJPOOL_H */
> >>> diff --git a/lib/objpool.c b/lib/objpool.c
> >>> index 22e752371820..f8e8f70d7253 100644
> >>> --- a/lib/objpool.c
> >>> +++ b/lib/objpool.c
> >>> @@ -15,104 +15,55 @@
> >>> * Copyright: [email protected]
> >>> */
> >>>
> >>> -#define SLOT_AGES(s) ((uint32_t *)((char *)(s) + sizeof(struct objpool_slot)))
> >>> -#define SLOT_ENTS(s) ((void **)((char *)(s) + sizeof(struct objpool_slot) + \
> >>> - (sizeof(uint32_t) << (s)->bits)))
> >>> -#define SLOT_OBJS(s) ((void *)((char *)(s) + sizeof(struct objpool_slot) + \
> >>> - ((sizeof(uint32_t) + sizeof(void *)) << (s)->bits)))
> >>> -#define SLOT_CORE(n) cpumask_nth((n) % num_possible_cpus(), cpu_possible_mask)
> >>> -
> >>> -/* compute the suitable num of objects to be managed per slot */
> >>> -static int objpool_nobjs(int size)
> >>> -{
> >>> - return rounddown_pow_of_two((size - sizeof(struct objpool_slot)) /
> >>> - (sizeof(uint32_t) + sizeof(void *)));
> >>> -}
> >>> -
> >>> /* allocate and initialize percpu slots */
> >>> static int
> >>> -objpool_init_percpu_slots(struct objpool_head *head, int nobjs,
> >>> - void *context, objpool_init_obj_cb objinit)
> >>> +objpool_init_percpu_slots(struct objpool_head *pool, void *context,
> >>> + objpool_init_obj_cb objinit)
> >>> {
> >>> - int i, j, n, size, objsz, cpu = 0, nents = head->capacity;
> >>> -
> >>> - /* aligned object size by sizeof(void *) */
> >>> - objsz = ALIGN(head->obj_size, sizeof(void *));
> >>> - /* shall we allocate objects along with percpu-slot */
> >>> - if (objsz)
> >>> - head->flags |= OBJPOOL_HAVE_OBJECTS;
> >>> -
> >>> - /* vmalloc is used in default to allocate percpu-slots */
> >>> - if (!(head->gfp & GFP_ATOMIC))
> >>> - head->flags |= OBJPOOL_FROM_VMALLOC;
> >>> -
> >>> - for (i = 0; i < head->nr_cpus; i++) {
> >>> - struct objpool_slot *os;
> >>> + int i, j, n, size, slot_size, cpu_count = 0;
> >>> + struct objpool_slot *slot;
> >>>
> >>> + for (i = 0; i < pool->nr_cpus; i++) {
> >>> /* skip the cpus which could never be present */
> >>> if (!cpu_possible(i))
> >>> continue;
> >>>
> >>> /* compute how many objects to be managed by this slot */
> >>> - n = nobjs / num_possible_cpus();
> >>> - if (cpu < (nobjs % num_possible_cpus()))
> >>> + n = pool->nr_objs / num_possible_cpus();
> >>> + if (cpu_count < (pool->nr_objs % num_possible_cpus()))
> >>> n++;
> >>> - size = sizeof(struct objpool_slot) + sizeof(void *) * nents +
> >>> - sizeof(uint32_t) * nents + objsz * n;
> >>> + cpu_count++;
> >>> +
> >>> + slot_size = struct_size(slot, entries, pool->capacity);
> >>> + size = slot_size + pool->obj_size * n;
> >>>
> >>> /*
> >>> * here we allocate percpu-slot & objects together in a single
> >>> - * allocation, taking advantage of warm caches and TLB hits as
> >>> - * vmalloc always aligns the request size to pages
> >>> + * allocation, taking advantage on NUMA.
> >>> */
> >>> - if (head->flags & OBJPOOL_FROM_VMALLOC)
> >>> - os = __vmalloc_node(size, sizeof(void *), head->gfp,
> >>> + if (pool->flags & OBJPOOL_FROM_VMALLOC)
> >>> + slot = __vmalloc_node(size, sizeof(void *), pool->gfp,
> >>> cpu_to_node(i), __builtin_return_address(0));
> >>> else
> >>> - os = kmalloc_node(size, head->gfp, cpu_to_node(i));
> >>> - if (!os)
> >>> + slot = kmalloc_node(size, pool->gfp, cpu_to_node(i));
> >>> + if (!slot)
> >>> return -ENOMEM;
> >>>
> >>> /* initialize percpu slot for the i-th slot */
> >>> - memset(os, 0, size);
> >>> - os->bits = ilog2(head->capacity);
> >>> - os->mask = head->capacity - 1;
> >>> - head->cpu_slots[i] = os;
> >>> - head->slot_sizes[i] = size;
> >>> - cpu = cpu + 1;
> >>> -
> >>> - /*
> >>> - * manually set head & tail to avoid possible conflict:
> >>> - * We assume that the head item is ready for retrieval
> >>> - * iff head is equal to ages[head & mask]. but ages is
> >>> - * initialized as 0, so in view of the caller of pop(),
> >>> - * the 1st item (0th) is always ready, but the reality
> >>> - * could be: push() is stalled before the final update,
> >>> - * thus the item being inserted will be lost forever
> >>> - */
> >>> - os->head = os->tail = head->capacity;
> >>> -
> >>> - if (!objsz)
> >>> - continue;
> >>> + memset(slot, 0, size);
> >>> + slot->mask = pool->capacity - 1;
> >>> + pool->cpu_slots[i] = slot;
> >>>
> >>> for (j = 0; j < n; j++) {
> >>> - uint32_t *ages = SLOT_AGES(os);
> >>> - void **ents = SLOT_ENTS(os);
> >>> - void *obj = SLOT_OBJS(os) + j * objsz;
> >>> - uint32_t ie = os->tail & os->mask;
> >>> + void *obj = (void *)slot + slot_size + pool->obj_size * j;
> >>>
> >>> - /* perform object initialization */
> >>> if (objinit) {
> >>> int rc = objinit(obj, context);
> >>> if (rc)
> >>> return rc;
> >>> }
> >>> -
> >>> - /* add obj into the ring array */
> >>> - ents[ie] = obj;
> >>> - ages[ie] = os->tail;
> >>> - os->tail++;
> >>> - head->nr_objs++;
> >>> + slot->entries[j] = obj;
> >>> + slot->tail++;
> >>> }
> >>> }
> >>>
> >>> @@ -120,164 +71,135 @@ objpool_init_percpu_slots(struct objpool_head *head, int nobjs,
> >>> }
> >>>
> >>> /* cleanup all percpu slots of the object pool */
> >>> -static void objpool_fini_percpu_slots(struct objpool_head *head)
> >>> +static void objpool_fini_percpu_slots(struct objpool_head *pool)
> >>> {
> >>> int i;
> >>>
> >>> - if (!head->cpu_slots)
> >>> + if (!pool->cpu_slots)
> >>> return;
> >>>
> >>> - for (i = 0; i < head->nr_cpus; i++) {
> >>> - if (!head->cpu_slots[i])
> >>> - continue;
> >>> - if (head->flags & OBJPOOL_FROM_VMALLOC)
> >>> - vfree(head->cpu_slots[i]);
> >>> - else
> >>> - kfree(head->cpu_slots[i]);
> >>> - }
> >>> - kfree(head->cpu_slots);
> >>> - head->cpu_slots = NULL;
> >>> - head->slot_sizes = NULL;
> >>> + for (i = 0; i < pool->nr_cpus; i++)
> >>> + kvfree(pool->cpu_slots[i]);
> >>> + kfree(pool->cpu_slots);
> >>> }
> >>>
> >>> /* initialize object pool and pre-allocate objects */
> >>> -int objpool_init(struct objpool_head *head, int nr_objs, int object_size,
> >>> +int objpool_init(struct objpool_head *pool, int nr_objs, int object_size,
> >>> gfp_t gfp, void *context, objpool_init_obj_cb objinit,
> >>> objpool_fini_cb release)
> >>> {
> >>> int nents, rc;
> >>>
> >>> /* check input parameters */
> >>> - if (nr_objs <= 0 || object_size <= 0)
> >>> + if (nr_objs <= 0 || object_size <= 0 ||
> >>> + nr_objs > OBJPOOL_NR_OBJECT_MAX || object_size > OBJPOOL_OBJECT_SIZE_MAX)
> >>> + return -EINVAL;
> >>> +
> >>> + /* Align up to unsigned long size */
> >>> + object_size = ALIGN(object_size, sizeof(unsigned long));
> >>> +
> >>> + /*
> >>> + * To avoid filling up the entries array in the per-cpu slot,
> >>> + * use the power of 2 which is more than N + 1. Thus, the tail never
> >>> + * catch up to the pool, and able to use pool/tail as the sequencial
> >>> + * number.
> >>> + */
> >>> + nents = roundup_pow_of_two(nr_objs + 1);
> >>> + if (!nents)
> >>> return -EINVAL;
> >>>
> >>> - /* calculate percpu slot size (rounded to pow of 2) */
> >>> - nents = max_t(int, roundup_pow_of_two(nr_objs),
> >>> - objpool_nobjs(L1_CACHE_BYTES));
> >>> -
> >>> - /* initialize objpool head */
> >>> - memset(head, 0, sizeof(struct objpool_head));
> >>> - head->nr_cpus = nr_cpu_ids;
> >>> - head->obj_size = object_size;
> >>> - head->capacity = nents;
> >>> - head->gfp = gfp & ~__GFP_ZERO;
> >>> - head->context = context;
> >>> - head->release = release;
> >>> -
> >>> - /* allocate array for percpu slots */
> >>> - head->cpu_slots = kzalloc(head->nr_cpus * sizeof(void *) +
> >>> - head->nr_cpus * sizeof(int), head->gfp);
> >>> - if (!head->cpu_slots)
> >>> + /* initialize objpool pool */
> >>> + memset(pool, 0, sizeof(struct objpool_head));
> >>> + pool->nr_cpus = nr_cpu_ids;
> >>> + pool->obj_size = object_size;
> >>> + pool->nr_objs = nr_objs;
> >>> + /* just prevent to fullfill the per-cpu ring array */
> >>> + pool->capacity = nents;
> >>> + pool->gfp = gfp & ~__GFP_ZERO;
> >>> + pool->context = context;
> >>> + pool->release = release;
> >>> + /* vmalloc is used in default to allocate percpu-slots */
> >>> + if (!(pool->gfp & GFP_ATOMIC))
> >>> + pool->flags |= OBJPOOL_FROM_VMALLOC;
> >>> +
> >>> + pool->cpu_slots = kzalloc(pool->nr_cpus * sizeof(void *), pool->gfp);
> >>> + if (!pool->cpu_slots)
> >>> return -ENOMEM;
> >>> - head->slot_sizes = (int *)&head->cpu_slots[head->nr_cpus];
> >>>
> >>> /* initialize per-cpu slots */
> >>> - rc = objpool_init_percpu_slots(head, nr_objs, context, objinit);
> >>> + rc = objpool_init_percpu_slots(pool, context, objinit);
> >>> if (rc)
> >>> - objpool_fini_percpu_slots(head);
> >>> + objpool_fini_percpu_slots(pool);
> >>> else
> >>> - refcount_set(&head->ref, nr_objs + 1);
> >>> + refcount_set(&pool->ref, nr_objs + 1);
> >>>
> >>> return rc;
> >>> }
> >>> EXPORT_SYMBOL_GPL(objpool_init);
> >>>
> >>> /* adding object to slot, abort if the slot was already full */
> >>> -static inline int objpool_try_add_slot(void *obj, struct objpool_slot *os)
> >>> +static inline int objpool_try_add_slot(void *obj, struct objpool_head *pool, int cpu)
> >>> {
> >>> - uint32_t *ages = SLOT_AGES(os);
> >>> - void **ents = SLOT_ENTS(os);
> >>> - uint32_t head, tail;
> >>> + struct objpool_slot *slot = pool->cpu_slots[cpu];
> >>> + uint32_t tail, next;
> >>>
> >>> do {
> >>> - /* perform memory loading for both head and tail */
> >>> - head = READ_ONCE(os->head);
> >>> - tail = READ_ONCE(os->tail);
> >>> - /* just abort if slot is full */
> >>> - if (tail - head > os->mask)
> >>> - return -ENOENT;
> >>> - /* try to extend tail by 1 using CAS to avoid races */
> >>> - if (try_cmpxchg_acquire(&os->tail, &tail, tail + 1))
> >>> - break;
> >>> - } while (1);
> >>> + uint32_t head = READ_ONCE(slot->head);
> >>>
> >>> - /* the tail-th of slot is reserved for the given obj */
> >>> - WRITE_ONCE(ents[tail & os->mask], obj);
> >>> - /* update epoch id to make this object available for pop() */
> >>> - smp_store_release(&ages[tail & os->mask], tail);
> >>> + tail = READ_ONCE(slot->tail);
> >>> + next = tail + 1;
> >>> +
> >>> + /* This must never happen because capacity >= N + 1 */
> >>> + if (WARN_ON_ONCE((next > head && next - head > pool->nr_objs) ||
> >>> + (next < head && next > head + pool->nr_objs)))
> >>> + return -EINVAL;
> >>> +
> >>> + } while (!try_cmpxchg_acquire(&slot->tail, &tail, next));
> >>> +
> >>> + WRITE_ONCE(slot->entries[tail & slot->mask], obj);
> >>> return 0;
> >>> }
> >>>
> >>> /* reclaim an object to object pool */
> >>> -int objpool_push(void *obj, struct objpool_head *oh)
> >>> +int objpool_push(void *obj, struct objpool_head *pool)
> >>> {
> >>> unsigned long flags;
> >>> - int cpu, rc;
> >>> + int rc;
> >>>
> >>> /* disable local irq to avoid preemption & interruption */
> >>> raw_local_irq_save(flags);
> >>> - cpu = raw_smp_processor_id();
> >>> - do {
> >>> - rc = objpool_try_add_slot(obj, oh->cpu_slots[cpu]);
> >>> - if (!rc)
> >>> - break;
> >>> - cpu = cpumask_next_wrap(cpu, cpu_possible_mask, -1, 1);
> >>> - } while (1);
> >>> + rc = objpool_try_add_slot(obj, pool, raw_smp_processor_id());
> >>> raw_local_irq_restore(flags);
> >>>
> >>> return rc;
> >>> }
> >>> EXPORT_SYMBOL_GPL(objpool_push);
> >>>
> >>> -/* drop the allocated object, rather reclaim it to objpool */
> >>> -int objpool_drop(void *obj, struct objpool_head *head)
> >>> -{
> >>> - if (!obj || !head)
> >>> - return -EINVAL;
> >>> -
> >>> - if (refcount_dec_and_test(&head->ref)) {
> >>> - objpool_free(head);
> >>> - return 0;
> >>> - }
> >>> -
> >>> - return -EAGAIN;
> >>> -}
> >>> -EXPORT_SYMBOL_GPL(objpool_drop);
> >>> -
> >>> /* try to retrieve object from slot */
> >>> -static inline void *objpool_try_get_slot(struct objpool_slot *os)
> >>> +static inline void *objpool_try_get_slot(struct objpool_slot *slot)
> >>> {
> >>> - uint32_t *ages = SLOT_AGES(os);
> >>> - void **ents = SLOT_ENTS(os);
> >>> /* do memory load of head to local head */
> >>> - uint32_t head = smp_load_acquire(&os->head);
> >>> + uint32_t head = smp_load_acquire(&slot->head);
> >>>
> >>> /* loop if slot isn't empty */
> >>> - while (head != READ_ONCE(os->tail)) {
> >>> - uint32_t id = head & os->mask, prev = head;
> >>> + while (head != READ_ONCE(slot->tail)) {
> >>>
> >>> /* do prefetching of object ents */
> >>> - prefetch(&ents[id]);
> >>> -
> >>> - /* check whether this item was ready for retrieval */
> >>> - if (smp_load_acquire(&ages[id]) == head) {
> >>> - /* node must have been udpated by push() */
> >>> - void *node = READ_ONCE(ents[id]);
> >>> - /* commit and move forward head of the slot */
> >>> - if (try_cmpxchg_release(&os->head, &head, head + 1))
> >>> - return node;
> >>> - /* head was already updated by others */
> >>> - }
> >>> + prefetch(&slot->entries[head & slot->mask]);
> >>> +
> >>> + /* commit and move forward head of the slot */
> >>> + if (try_cmpxchg_release(&slot->head, &head, head + 1))
> >>> + /*
> >>> + * TBD: check overwrap the tail/head counter and warn
> >>> + * if it is broken. But this happens only if this
> >>> + * process slows down a lot and another CPU updates
> >>> + * the haed/tail just 2^32 + 1 times, and this slot
> >>> + * is empty.
> >>> + */
> >>> + return slot->entries[head & slot->mask];
> >>>
> >>> /* re-load head from memory and continue trying */
> >>> - head = READ_ONCE(os->head);
> >>> - /*
> >>> - * head stays unchanged, so it's very likely there's an
> >>> - * ongoing push() on other cpu nodes but yet not update
> >>> - * ages[] to mark it's completion
> >>> - */
> >>> - if (head == prev)
> >>> - break;
> >>> + head = READ_ONCE(slot->head);
> >>> }
> >>>
> >>> return NULL;
> >>> @@ -307,32 +229,42 @@ void *objpool_pop(struct objpool_head *head)
> >>> EXPORT_SYMBOL_GPL(objpool_pop);
> >>>
> >>> /* release whole objpool forcely */
> >>> -void objpool_free(struct objpool_head *head)
> >>> +void objpool_free(struct objpool_head *pool)
> >>> {
> >>> - if (!head->cpu_slots)
> >>> + if (!pool->cpu_slots)
> >>> return;
> >>>
> >>> /* release percpu slots */
> >>> - objpool_fini_percpu_slots(head);
> >>> + objpool_fini_percpu_slots(pool);
> >>>
> >>> /* call user's cleanup callback if provided */
> >>> - if (head->release)
> >>> - head->release(head, head->context);
> >>> + if (pool->release)
> >>> + pool->release(pool, pool->context);
> >>> }
> >>> EXPORT_SYMBOL_GPL(objpool_free);
> >>>
> >>> -/* drop unused objects and defref objpool for releasing */
> >>> -void objpool_fini(struct objpool_head *head)
> >>> +/* drop the allocated object, rather reclaim it to objpool */
> >>> +int objpool_drop(void *obj, struct objpool_head *pool)
> >>> {
> >>> - void *nod;
> >>> + if (!obj || !pool)
> >>> + return -EINVAL;
> >>>
> >>> - do {
> >>> - /* grab object from objpool and drop it */
> >>> - nod = objpool_pop(head);
> >>> + if (refcount_dec_and_test(&pool->ref)) {
> >>> + objpool_free(pool);
> >>> + return 0;
> >>> + }
> >>> +
> >>> + return -EAGAIN;
> >>> +}
> >>> +EXPORT_SYMBOL_GPL(objpool_drop);
> >>> +
> >>> +/* drop unused objects and defref objpool for releasing */
> >>> +void objpool_fini(struct objpool_head *pool)
> >>> +{
> >>> + void *obj;
> >>>
> >>> - /* drop the extra ref of objpool */
> >>> - if (refcount_dec_and_test(&head->ref))
> >>> - objpool_free(head);
> >>> - } while (nod);
> >>> + /* grab object from objpool and drop it */
> >>> + while ((obj = objpool_pop(pool)))
> >>> + objpool_drop(obj, pool);
> >>> }
> >>> EXPORT_SYMBOL_GPL(objpool_fini);
> >>
> >
> >
>

--
Masami Hiramatsu (Google) <[email protected]>

2023-10-13 03:03:48

by wuqiang.matt

[permalink] [raw]

Subject: Re: [PATCH v9 1/5] lib: objpool added: ring-array based lockless MPMC

On 2023/10/13 09:59, Masami Hiramatsu (Google) wrote:
> On Fri, 13 Oct 2023 01:36:05 +0800
> "wuqiang.matt" <[email protected]> wrote:
>
>> On 2023/10/12 22:02, Masami Hiramatsu (Google) wrote:
>>> Hi Wuqiang,
>>>
>>> On Mon, 9 Oct 2023 17:23:34 +0800
>>> wuqiang <[email protected]> wrote:
>>>
>>>> Hello Masami,
>>>>
>>>> Just got time for the new patch and got that ages[] was removed. ages[] is
>>>> introduced the way like 2-phase commit to keep consitency and must be kept.
>>>>
>>>> Thinking of the following 2 cases that two cpu nodes are operating the same
>>>> objpool_slot simultaneously:
>>>>
>>>> Case 1:
>>>>
>>>> NODE 1: NODE 2:
>>>> push to an empty slot pop will get wrong value
>>>>
>>>> try_cmpxchg_acquire(&slot->tail, &tail, next)
>>>> try_cmpxchg_release(&slot->head, &head, head + 1)
>>>> return slot->entries[head & slot->mask]
>>>> WRITE_ONCE(slot->entries[tail & slot->mask], obj)
>>>
>>> Today, I rethink the algorithm.
>>>
>>> For this case, we can just add a `commit` to the slot for committing the tail
>>> commit instead of the ages[].
>>>
>>> CPU1 CPU2
>>> push to an empty slot pop from the same slot
>>>
>>> commit = tail = slot->tail;
>>> next = tail + 1;
>>> try_cmpxchg_acquire(&slot->tail,
>>> &tail, next);
>>> while (head != commit) -> NG1
>>> WRITE_ONCE(slot->entries[tail & slot->mask],
>>> obj)
>>> while (head != commit) -> NG2
>>> WRITE_ONCE(&slot->commit, next);
>>> while (head != commit) -> OK
>>> try_cmpxchg_acquire(&slot->head,
>>> &head, next);
>>> return slot->entries[head & slot->mask]
>>>
>>> So the NG1 and NG2 timing, the pop will fail.
>>>
>>> This does not expect the nested "push" case because the reserve-commit block
>>> will be interrupt disabled. This doesn't support NMI but that is OK.
>>
>> If 2 push are performed in a row, slot->commit will be 'tail + 2', CPU2 won't
>> meet the condition "head == commit".
>
> No, it doesn't happen, because pop() may fetch the object from other's slot, but
> push() only happens on its own slot. So push() on the same slot is never
> in parallel.

I got it: 'commit' is now the actual tail. Then "head != commit" means there
are free objects ready for pop.

Great idea indeed. I'll give it a try.

>> Since slot->commit' is always synced to slot->tail after a successful push,
>> should pop check "tail != commit" ?
>
> No, pop() only need to check the "head != commit". If "head == commit", this slot
> is empty or the slot owner is trying to push() (not committed yet).
>
>> In this case when a push is ongoing with
>> the slot, pop() has to wait for its completion even if there are objects in
>> the same slot.
>
> No, pop() will go to other slots to find available object.
>
>>
>>> If we need to support such NMI or remove irq-disable, we can do following
>>> (please remember the push operation only happens on the slot owned by that
>>> CPU. So this is per-cpu process)
>>>
>>> ---
>>> do {
>>> commit = tail = slot->tail;
>>> next = tail + 1;
>>> } while (!try_cmpxchg_acquire(&slot->tail, &tail, next));
>>>
>>> WRITE_ONCE(slot->entries[tail & slot->mask], obj);
>>>
>>> // At this point, "commit == slot->tail - 1" in this nest level.
>>> // Only outmost (non-nexted) case, "commit == slot->commit"
>>> if (commit != slot->commit)
>>> return; // do nothing. it must be updated by outmost one.
>>>
>>> // catch up commit to tail.
>>> do {
>>> commit = READ_ONCE(slot->tail);
>>> WRITE_ONCE(slot->commit, commit);
>>> // note that someone can interrupt this loop too.
>>> } while (commit != READ_ONCE(slot->tail));
>>> ---
>>
>> Yes, I got you: push can only happen on local node and the first try has the
>> right to extend 'commit'. It does resolve nested push attempts, and preemption
>> must be disabled.
>
> Yes.
>
>>
>> The overflow/ABA issue can still happen if the cpu is being interrupted for
>> long time.
>
> Yes, but ages[] also can not avoid that. (Note that head, tail and commit are
> 32bit counter, and per-cpu slot ring has enough big to store the pointer of
> all objects.)
>
> Thank you,
>
>>
>>> For the rethook this may be too much.
>>>
>>> Thank you,
>>>
>>>>
>>>>
>>>> Case 2:
>>>>
>>>> NODE 1: NODE 2
>>>> push to slot w/ 1 obj pop will get wrong value
>>>>
>>>> try_cmpxchg_release(&slot->head, &head, head + 1)
>>>> try_cmpxchg_acquire(&slot->tail, &tail, next)
>>>> WRITE_ONCE(slot->entries[tail & slot->mask], obj)
>>>> return slot->entries[head & slot->mask]
>>
>> The pre-condition should be: CPU 1 tries to push to a full slot, in this case
>> tail = head + capacity but tail & mask == head & mask
>>
>>>>
>>>> Regards,
>>>> wuqiang
>>>>
>>>> On 2023/9/25 17:42, Masami Hiramatsu (Google) wrote:
>>>>> Hi Wuqiang,
>>>>>
>>>>> On Tue, 5 Sep 2023 09:52:51 +0800
>>>>> "wuqiang.matt" <[email protected]> wrote:
>>>>>
>>>>>> The object pool is a scalable implementaion of high performance queue
>>>>>> for object allocation and reclamation, such as kretprobe instances.
>>>>>>
>>>>>> With leveraging percpu ring-array to mitigate the hot spot of memory
>>>>>> contention, it could deliver near-linear scalability for high parallel
>>>>>> scenarios. The objpool is best suited for following cases:
>>>>>> 1) Memory allocation or reclamation are prohibited or too expensive
>>>>>> 2) Consumers are of different priorities, such as irqs and threads
>>>>>>
>>>>>> Limitations:
>>>>>> 1) Maximum objects (capacity) is determined during pool initializing
>>>>>> and can't be modified (extended) after objpool creation
>>>>>> 2) The memory of objects won't be freed until objpool is finalized
>>>>>> 3) Object allocation (pop) may fail after trying all cpu slots
>>>>>
>>>>> I made a simplifying patch on this by (mainly) removing ages array.
>>>>> I also rename local variable to use more readable names, like slot,
>>>>> pool, and obj.
>>>>>
>>>>> Here the results which I run the test_objpool.ko.
>>>>>
>>>>> Original:
>>>>> [ 50.500235] Summary of testcases:
>>>>> [ 50.503139] duration: 1027135us hits: 30628293 miss: 0 sync: percpu objpool
>>>>> [ 50.510416] duration: 1047977us hits: 30453023 miss: 0 sync: percpu objpool from vmalloc
>>>>> [ 50.517421] duration: 1047098us hits: 31145034 miss: 0 sync & hrtimer: percpu objpool
>>>>> [ 50.524671] duration: 1053233us hits: 30919640 miss: 0 sync & hrtimer: percpu objpool from vmalloc
>>>>> [ 50.531382] duration: 1055822us hits: 3407221 miss: 830219 sync overrun: percpu objpool
>>>>> [ 50.538135] duration: 1055998us hits: 3404624 miss: 854160 sync overrun: percpu objpool from vmalloc
>>>>> [ 50.546686] duration: 1046104us hits: 19464798 miss: 0 async: percpu objpool
>>>>> [ 50.552989] duration: 1033054us hits: 18957836 miss: 0 async: percpu objpool from vmalloc
>>>>> [ 50.560289] duration: 1041778us hits: 33806868 miss: 0 async & hrtimer: percpu objpool
>>>>> [ 50.567425] duration: 1048901us hits: 34211860 miss: 0 async & hrtimer: percpu objpool from vmalloc
>>>>>
>>>>> Simplified:
>>>>> [ 48.393236] Summary of testcases:
>>>>> [ 48.395384] duration: 1013002us hits: 29661448 miss: 0 sync: percpu objpool
>>>>> [ 48.400351] duration: 1057231us hits: 31035578 miss: 0 sync: percpu objpool from vmalloc
>>>>> [ 48.405660] duration: 1043196us hits: 30546652 miss: 0 sync & hrtimer: percpu objpool
>>>>> [ 48.411216] duration: 1047128us hits: 30601306 miss: 0 sync & hrtimer: percpu objpool from vmalloc
>>>>> [ 48.417231] duration: 1051695us hits: 3468287 miss: 892881 sync overrun: percpu objpool
>>>>> [ 48.422405] duration: 1054003us hits: 3549491 miss: 898141 sync overrun: percpu objpool from vmalloc
>>>>> [ 48.428425] duration: 1052946us hits: 19005228 miss: 0 async: percpu objpool
>>>>> [ 48.433597] duration: 1051757us hits: 19670866 miss: 0 async: percpu objpool from vmalloc
>>>>> [ 48.439280] duration: 1042951us hits: 37135332 miss: 0 async & hrtimer: percpu objpool
>>>>> [ 48.445085] duration: 1029803us hits: 37093248 miss: 0 async & hrtimer: percpu objpool from vmalloc
>>>>>
>>>>> Can you test it too?
>>>>>
>>>>> Thanks,
>>>>>
>>>>> From f1f442ff653e329839e5452b8b88463a80a12ff3 Mon Sep 17 00:00:00 2001
>>>>> From: "Masami Hiramatsu (Google)" <[email protected]>
>>>>> Date: Mon, 25 Sep 2023 16:07:12 +0900
>>>>> Subject: [PATCH] objpool: Simplify objpool by removing ages array
>>>>>
>>>>> Simplify the objpool code by removing ages array from per-cpu slot.
>>>>> It chooses enough big capacity (which is a rounded up power of 2 value
>>>>> of nr_objects + 1) for the entries array, the tail never catch up to
>>>>> the head in per-cpu slot. Thus tail == head means the slot is empty.
>>>>>
>>>>> This also uses consistent local variable names for pool, slot and obj.
>>>>>
>>>>> Signed-off-by: Masami Hiramatsu (Google) <[email protected]>
>>>>> ---
>>>>> include/linux/objpool.h | 61 ++++----
>>>>> lib/objpool.c | 310 ++++++++++++++++------------------------
>>>>> 2 files changed, 147 insertions(+), 224 deletions(-)
>>>>>
>>>>> diff --git a/include/linux/objpool.h b/include/linux/objpool.h
>>>>> index 33c832216b98..ecd5ecaffcd3 100644
>>>>> --- a/include/linux/objpool.h
>>>>> +++ b/include/linux/objpool.h
>>>>> @@ -38,33 +38,23 @@
>>>>> * struct objpool_slot - percpu ring array of objpool
>>>>> * @head: head of the local ring array (to retrieve at)
>>>>> * @tail: tail of the local ring array (to append at)
>>>>> - * @bits: log2 of capacity (for bitwise operations)
>>>>> - * @mask: capacity - 1
>>>>> + * @mask: capacity of entries - 1
>>>>> + * @entries: object entries on this slot.
>>>>> *
>>>>> * Represents a cpu-local array-based ring buffer, its size is specialized
>>>>> * during initialization of object pool. The percpu objpool slot is to be
>>>>> * allocated from local memory for NUMA system, and to be kept compact in
>>>>> - * continuous memory: ages[] is stored just after the body of objpool_slot,
>>>>> - * and then entries[]. ages[] describes revision of each item, solely used
>>>>> - * to avoid ABA; entries[] contains pointers of the actual objects
>>>>> - *
>>>>> - * Layout of objpool_slot in memory:
>>>>> - *
>>>>> - * 64bit:
>>>>> - * 4 8 12 16 32 64
>>>>> - * | head | tail | bits | mask | ages[4] | ents[4]: (8 * 4) | objects
>>>>> - *
>>>>> - * 32bit:
>>>>> - * 4 8 12 16 32 48 64
>>>>> - * | head | tail | bits | mask | ages[4] | ents[4] | unused | objects
>>>>> + * continuous memory: CPU assigned number of objects are stored just after
>>>>> + * the body of objpool_slot.
>>>>> *
>>>>> */
>>>>> struct objpool_slot {
>>>>> - uint32_t head;
>>>>> - uint32_t tail;
>>>>> - uint32_t bits;
>>>>> - uint32_t mask;
>>>>> -} __packed;
>>>>> + uint32_t head;
>>>>> + uint32_t tail;
>>>>> + uint32_t mask;
>>>>> + uint32_t dummyr;
>>>>> + void * entries[];
>>>>> +};
>>>>>
>>>>> struct objpool_head;
>>>>>
>>>>> @@ -82,12 +72,11 @@ typedef int (*objpool_fini_cb)(struct objpool_head *head, void *context);
>>>>> * @obj_size: object & element size
>>>>> * @nr_objs: total objs (to be pre-allocated)
>>>>> * @nr_cpus: nr_cpu_ids
>>>>> - * @capacity: max objects per cpuslot
>>>>> + * @capacity: max objects on percpu slot
>>>>> * @gfp: gfp flags for kmalloc & vmalloc
>>>>> * @ref: refcount for objpool
>>>>> * @flags: flags for objpool management
>>>>> * @cpu_slots: array of percpu slots
>>>>> - * @slot_sizes: size in bytes of slots
>>>>> * @release: resource cleanup callback
>>>>> * @context: caller-provided context
>>>>> */
>>>>> @@ -100,7 +89,6 @@ struct objpool_head {
>>>>> refcount_t ref;
>>>>> unsigned long flags;
>>>>> struct objpool_slot **cpu_slots;
>>>>> - int *slot_sizes;
>>>>> objpool_fini_cb release;
>>>>> void *context;
>>>>> };
>>>>> @@ -108,9 +96,12 @@ struct objpool_head {
>>>>> #define OBJPOOL_FROM_VMALLOC (0x800000000) /* objpool allocated from vmalloc area */
>>>>> #define OBJPOOL_HAVE_OBJECTS (0x400000000) /* objects allocated along with objpool */
>>>>>
>>>>> +#define OBJPOOL_NR_OBJECT_MAX (1 << 24)
>>>>> +#define OBJPOOL_OBJECT_SIZE_MAX (1 << 16)
>>>>> +
>>>>> /**
>>>>> * objpool_init() - initialize objpool and pre-allocated objects
>>>>> - * @head: the object pool to be initialized, declared by caller
>>>>> + * @pool: the object pool to be initialized, declared by caller
>>>>> * @nr_objs: total objects to be pre-allocated by this object pool
>>>>> * @object_size: size of an object (should be > 0)
>>>>> * @gfp: flags for memory allocation (via kmalloc or vmalloc)
>>>>> @@ -128,47 +119,47 @@ struct objpool_head {
>>>>> * pop (object allocation) or do clearance before each push (object
>>>>> * reclamation).
>>>>> */
>>>>> -int objpool_init(struct objpool_head *head, int nr_objs, int object_size,
>>>>> +int objpool_init(struct objpool_head *pool, int nr_objs, int object_size,
>>>>> gfp_t gfp, void *context, objpool_init_obj_cb objinit,
>>>>> objpool_fini_cb release);
>>>>>
>>>>> /**
>>>>> * objpool_pop() - allocate an object from objpool
>>>>> - * @head: object pool
>>>>> + * @pool: object pool
>>>>> *
>>>>> * return value: object ptr or NULL if failed
>>>>> */
>>>>> -void *objpool_pop(struct objpool_head *head);
>>>>> +void *objpool_pop(struct objpool_head *pool);
>>>>>
>>>>> /**
>>>>> * objpool_push() - reclaim the object and return back to objpool
>>>>> * @obj: object ptr to be pushed to objpool
>>>>> - * @head: object pool
>>>>> + * @pool: object pool
>>>>> *
>>>>> * return: 0 or error code (it fails only when user tries to push
>>>>> * the same object multiple times or wrong "objects" into objpool)
>>>>> */
>>>>> -int objpool_push(void *obj, struct objpool_head *head);
>>>>> +int objpool_push(void *obj, struct objpool_head *pool);
>>>>>
>>>>> /**
>>>>> * objpool_drop() - discard the object and deref objpool
>>>>> * @obj: object ptr to be discarded
>>>>> - * @head: object pool
>>>>> + * @pool: object pool
>>>>> *
>>>>> * return: 0 if objpool was released or error code
>>>>> */
>>>>> -int objpool_drop(void *obj, struct objpool_head *head);
>>>>> +int objpool_drop(void *obj, struct objpool_head *pool);
>>>>>
>>>>> /**
>>>>> * objpool_free() - release objpool forcely (all objects to be freed)
>>>>> - * @head: object pool to be released
>>>>> + * @pool: object pool to be released
>>>>> */
>>>>> -void objpool_free(struct objpool_head *head);
>>>>> +void objpool_free(struct objpool_head *pool);
>>>>>
>>>>> /**
>>>>> * objpool_fini() - deref object pool (also releasing unused objects)
>>>>> - * @head: object pool to be dereferenced
>>>>> + * @pool: object pool to be dereferenced
>>>>> */
>>>>> -void objpool_fini(struct objpool_head *head);
>>>>> +void objpool_fini(struct objpool_head *pool);
>>>>>
>>>>> #endif /* _LINUX_OBJPOOL_H */
>>>>> diff --git a/lib/objpool.c b/lib/objpool.c
>>>>> index 22e752371820..f8e8f70d7253 100644
>>>>> --- a/lib/objpool.c
>>>>> +++ b/lib/objpool.c
>>>>> @@ -15,104 +15,55 @@
>>>>> * Copyright: [email protected]
>>>>> */
>>>>>
>>>>> -#define SLOT_AGES(s) ((uint32_t *)((char *)(s) + sizeof(struct objpool_slot)))
>>>>> -#define SLOT_ENTS(s) ((void **)((char *)(s) + sizeof(struct objpool_slot) + \
>>>>> - (sizeof(uint32_t) << (s)->bits)))
>>>>> -#define SLOT_OBJS(s) ((void *)((char *)(s) + sizeof(struct objpool_slot) + \
>>>>> - ((sizeof(uint32_t) + sizeof(void *)) << (s)->bits)))
>>>>> -#define SLOT_CORE(n) cpumask_nth((n) % num_possible_cpus(), cpu_possible_mask)
>>>>> -
>>>>> -/* compute the suitable num of objects to be managed per slot */
>>>>> -static int objpool_nobjs(int size)
>>>>> -{
>>>>> - return rounddown_pow_of_two((size - sizeof(struct objpool_slot)) /
>>>>> - (sizeof(uint32_t) + sizeof(void *)));
>>>>> -}
>>>>> -
>>>>> /* allocate and initialize percpu slots */
>>>>> static int
>>>>> -objpool_init_percpu_slots(struct objpool_head *head, int nobjs,
>>>>> - void *context, objpool_init_obj_cb objinit)
>>>>> +objpool_init_percpu_slots(struct objpool_head *pool, void *context,
>>>>> + objpool_init_obj_cb objinit)
>>>>> {
>>>>> - int i, j, n, size, objsz, cpu = 0, nents = head->capacity;
>>>>> -
>>>>> - /* aligned object size by sizeof(void *) */
>>>>> - objsz = ALIGN(head->obj_size, sizeof(void *));
>>>>> - /* shall we allocate objects along with percpu-slot */
>>>>> - if (objsz)
>>>>> - head->flags |= OBJPOOL_HAVE_OBJECTS;
>>>>> -
>>>>> - /* vmalloc is used in default to allocate percpu-slots */
>>>>> - if (!(head->gfp & GFP_ATOMIC))
>>>>> - head->flags |= OBJPOOL_FROM_VMALLOC;
>>>>> -
>>>>> - for (i = 0; i < head->nr_cpus; i++) {
>>>>> - struct objpool_slot *os;
>>>>> + int i, j, n, size, slot_size, cpu_count = 0;
>>>>> + struct objpool_slot *slot;
>>>>>
>>>>> + for (i = 0; i < pool->nr_cpus; i++) {
>>>>> /* skip the cpus which could never be present */
>>>>> if (!cpu_possible(i))
>>>>> continue;
>>>>>
>>>>> /* compute how many objects to be managed by this slot */
>>>>> - n = nobjs / num_possible_cpus();
>>>>> - if (cpu < (nobjs % num_possible_cpus()))
>>>>> + n = pool->nr_objs / num_possible_cpus();
>>>>> + if (cpu_count < (pool->nr_objs % num_possible_cpus()))
>>>>> n++;
>>>>> - size = sizeof(struct objpool_slot) + sizeof(void *) * nents +
>>>>> - sizeof(uint32_t) * nents + objsz * n;
>>>>> + cpu_count++;
>>>>> +
>>>>> + slot_size = struct_size(slot, entries, pool->capacity);
>>>>> + size = slot_size + pool->obj_size * n;
>>>>>
>>>>> /*
>>>>> * here we allocate percpu-slot & objects together in a single
>>>>> - * allocation, taking advantage of warm caches and TLB hits as
>>>>> - * vmalloc always aligns the request size to pages
>>>>> + * allocation, taking advantage on NUMA.
>>>>> */
>>>>> - if (head->flags & OBJPOOL_FROM_VMALLOC)
>>>>> - os = __vmalloc_node(size, sizeof(void *), head->gfp,
>>>>> + if (pool->flags & OBJPOOL_FROM_VMALLOC)
>>>>> + slot = __vmalloc_node(size, sizeof(void *), pool->gfp,
>>>>> cpu_to_node(i), __builtin_return_address(0));
>>>>> else
>>>>> - os = kmalloc_node(size, head->gfp, cpu_to_node(i));
>>>>> - if (!os)
>>>>> + slot = kmalloc_node(size, pool->gfp, cpu_to_node(i));
>>>>> + if (!slot)
>>>>> return -ENOMEM;
>>>>>
>>>>> /* initialize percpu slot for the i-th slot */
>>>>> - memset(os, 0, size);
>>>>> - os->bits = ilog2(head->capacity);
>>>>> - os->mask = head->capacity - 1;
>>>>> - head->cpu_slots[i] = os;
>>>>> - head->slot_sizes[i] = size;
>>>>> - cpu = cpu + 1;
>>>>> -
>>>>> - /*
>>>>> - * manually set head & tail to avoid possible conflict:
>>>>> - * We assume that the head item is ready for retrieval
>>>>> - * iff head is equal to ages[head & mask]. but ages is
>>>>> - * initialized as 0, so in view of the caller of pop(),
>>>>> - * the 1st item (0th) is always ready, but the reality
>>>>> - * could be: push() is stalled before the final update,
>>>>> - * thus the item being inserted will be lost forever
>>>>> - */
>>>>> - os->head = os->tail = head->capacity;
>>>>> -
>>>>> - if (!objsz)
>>>>> - continue;
>>>>> + memset(slot, 0, size);
>>>>> + slot->mask = pool->capacity - 1;
>>>>> + pool->cpu_slots[i] = slot;
>>>>>
>>>>> for (j = 0; j < n; j++) {
>>>>> - uint32_t *ages = SLOT_AGES(os);
>>>>> - void **ents = SLOT_ENTS(os);
>>>>> - void *obj = SLOT_OBJS(os) + j * objsz;
>>>>> - uint32_t ie = os->tail & os->mask;
>>>>> + void *obj = (void *)slot + slot_size + pool->obj_size * j;
>>>>>
>>>>> - /* perform object initialization */
>>>>> if (objinit) {
>>>>> int rc = objinit(obj, context);
>>>>> if (rc)
>>>>> return rc;
>>>>> }
>>>>> -
>>>>> - /* add obj into the ring array */
>>>>> - ents[ie] = obj;
>>>>> - ages[ie] = os->tail;
>>>>> - os->tail++;
>>>>> - head->nr_objs++;
>>>>> + slot->entries[j] = obj;
>>>>> + slot->tail++;
>>>>> }
>>>>> }
>>>>>
>>>>> @@ -120,164 +71,135 @@ objpool_init_percpu_slots(struct objpool_head *head, int nobjs,
>>>>> }
>>>>>
>>>>> /* cleanup all percpu slots of the object pool */
>>>>> -static void objpool_fini_percpu_slots(struct objpool_head *head)
>>>>> +static void objpool_fini_percpu_slots(struct objpool_head *pool)
>>>>> {
>>>>> int i;
>>>>>
>>>>> - if (!head->cpu_slots)
>>>>> + if (!pool->cpu_slots)
>>>>> return;
>>>>>
>>>>> - for (i = 0; i < head->nr_cpus; i++) {
>>>>> - if (!head->cpu_slots[i])
>>>>> - continue;
>>>>> - if (head->flags & OBJPOOL_FROM_VMALLOC)
>>>>> - vfree(head->cpu_slots[i]);
>>>>> - else
>>>>> - kfree(head->cpu_slots[i]);
>>>>> - }
>>>>> - kfree(head->cpu_slots);
>>>>> - head->cpu_slots = NULL;
>>>>> - head->slot_sizes = NULL;
>>>>> + for (i = 0; i < pool->nr_cpus; i++)
>>>>> + kvfree(pool->cpu_slots[i]);
>>>>> + kfree(pool->cpu_slots);
>>>>> }
>>>>>
>>>>> /* initialize object pool and pre-allocate objects */
>>>>> -int objpool_init(struct objpool_head *head, int nr_objs, int object_size,
>>>>> +int objpool_init(struct objpool_head *pool, int nr_objs, int object_size,
>>>>> gfp_t gfp, void *context, objpool_init_obj_cb objinit,
>>>>> objpool_fini_cb release)
>>>>> {
>>>>> int nents, rc;
>>>>>
>>>>> /* check input parameters */
>>>>> - if (nr_objs <= 0 || object_size <= 0)
>>>>> + if (nr_objs <= 0 || object_size <= 0 ||
>>>>> + nr_objs > OBJPOOL_NR_OBJECT_MAX || object_size > OBJPOOL_OBJECT_SIZE_MAX)
>>>>> + return -EINVAL;
>>>>> +
>>>>> + /* Align up to unsigned long size */
>>>>> + object_size = ALIGN(object_size, sizeof(unsigned long));
>>>>> +
>>>>> + /*
>>>>> + * To avoid filling up the entries array in the per-cpu slot,
>>>>> + * use the power of 2 which is more than N + 1. Thus, the tail never
>>>>> + * catch up to the pool, and able to use pool/tail as the sequencial
>>>>> + * number.
>>>>> + */
>>>>> + nents = roundup_pow_of_two(nr_objs + 1);
>>>>> + if (!nents)
>>>>> return -EINVAL;
>>>>>
>>>>> - /* calculate percpu slot size (rounded to pow of 2) */
>>>>> - nents = max_t(int, roundup_pow_of_two(nr_objs),
>>>>> - objpool_nobjs(L1_CACHE_BYTES));
>>>>> -
>>>>> - /* initialize objpool head */
>>>>> - memset(head, 0, sizeof(struct objpool_head));
>>>>> - head->nr_cpus = nr_cpu_ids;
>>>>> - head->obj_size = object_size;
>>>>> - head->capacity = nents;
>>>>> - head->gfp = gfp & ~__GFP_ZERO;
>>>>> - head->context = context;
>>>>> - head->release = release;
>>>>> -
>>>>> - /* allocate array for percpu slots */
>>>>> - head->cpu_slots = kzalloc(head->nr_cpus * sizeof(void *) +
>>>>> - head->nr_cpus * sizeof(int), head->gfp);
>>>>> - if (!head->cpu_slots)
>>>>> + /* initialize objpool pool */
>>>>> + memset(pool, 0, sizeof(struct objpool_head));
>>>>> + pool->nr_cpus = nr_cpu_ids;
>>>>> + pool->obj_size = object_size;
>>>>> + pool->nr_objs = nr_objs;
>>>>> + /* just prevent to fullfill the per-cpu ring array */
>>>>> + pool->capacity = nents;
>>>>> + pool->gfp = gfp & ~__GFP_ZERO;
>>>>> + pool->context = context;
>>>>> + pool->release = release;
>>>>> + /* vmalloc is used in default to allocate percpu-slots */
>>>>> + if (!(pool->gfp & GFP_ATOMIC))
>>>>> + pool->flags |= OBJPOOL_FROM_VMALLOC;
>>>>> +
>>>>> + pool->cpu_slots = kzalloc(pool->nr_cpus * sizeof(void *), pool->gfp);
>>>>> + if (!pool->cpu_slots)
>>>>> return -ENOMEM;
>>>>> - head->slot_sizes = (int *)&head->cpu_slots[head->nr_cpus];
>>>>>
>>>>> /* initialize per-cpu slots */
>>>>> - rc = objpool_init_percpu_slots(head, nr_objs, context, objinit);
>>>>> + rc = objpool_init_percpu_slots(pool, context, objinit);
>>>>> if (rc)
>>>>> - objpool_fini_percpu_slots(head);
>>>>> + objpool_fini_percpu_slots(pool);
>>>>> else
>>>>> - refcount_set(&head->ref, nr_objs + 1);
>>>>> + refcount_set(&pool->ref, nr_objs + 1);
>>>>>
>>>>> return rc;
>>>>> }
>>>>> EXPORT_SYMBOL_GPL(objpool_init);
>>>>>
>>>>> /* adding object to slot, abort if the slot was already full */
>>>>> -static inline int objpool_try_add_slot(void *obj, struct objpool_slot *os)
>>>>> +static inline int objpool_try_add_slot(void *obj, struct objpool_head *pool, int cpu)
>>>>> {
>>>>> - uint32_t *ages = SLOT_AGES(os);
>>>>> - void **ents = SLOT_ENTS(os);
>>>>> - uint32_t head, tail;
>>>>> + struct objpool_slot *slot = pool->cpu_slots[cpu];
>>>>> + uint32_t tail, next;
>>>>>
>>>>> do {
>>>>> - /* perform memory loading for both head and tail */
>>>>> - head = READ_ONCE(os->head);
>>>>> - tail = READ_ONCE(os->tail);
>>>>> - /* just abort if slot is full */
>>>>> - if (tail - head > os->mask)
>>>>> - return -ENOENT;
>>>>> - /* try to extend tail by 1 using CAS to avoid races */
>>>>> - if (try_cmpxchg_acquire(&os->tail, &tail, tail + 1))
>>>>> - break;
>>>>> - } while (1);
>>>>> + uint32_t head = READ_ONCE(slot->head);
>>>>>
>>>>> - /* the tail-th of slot is reserved for the given obj */
>>>>> - WRITE_ONCE(ents[tail & os->mask], obj);
>>>>> - /* update epoch id to make this object available for pop() */
>>>>> - smp_store_release(&ages[tail & os->mask], tail);
>>>>> + tail = READ_ONCE(slot->tail);
>>>>> + next = tail + 1;
>>>>> +
>>>>> + /* This must never happen because capacity >= N + 1 */
>>>>> + if (WARN_ON_ONCE((next > head && next - head > pool->nr_objs) ||
>>>>> + (next < head && next > head + pool->nr_objs)))
>>>>> + return -EINVAL;
>>>>> +
>>>>> + } while (!try_cmpxchg_acquire(&slot->tail, &tail, next));
>>>>> +
>>>>> + WRITE_ONCE(slot->entries[tail & slot->mask], obj);
>>>>> return 0;
>>>>> }
>>>>>
>>>>> /* reclaim an object to object pool */
>>>>> -int objpool_push(void *obj, struct objpool_head *oh)
>>>>> +int objpool_push(void *obj, struct objpool_head *pool)
>>>>> {
>>>>> unsigned long flags;
>>>>> - int cpu, rc;
>>>>> + int rc;
>>>>>
>>>>> /* disable local irq to avoid preemption & interruption */
>>>>> raw_local_irq_save(flags);
>>>>> - cpu = raw_smp_processor_id();
>>>>> - do {
>>>>> - rc = objpool_try_add_slot(obj, oh->cpu_slots[cpu]);
>>>>> - if (!rc)
>>>>> - break;
>>>>> - cpu = cpumask_next_wrap(cpu, cpu_possible_mask, -1, 1);
>>>>> - } while (1);
>>>>> + rc = objpool_try_add_slot(obj, pool, raw_smp_processor_id());
>>>>> raw_local_irq_restore(flags);
>>>>>
>>>>> return rc;
>>>>> }
>>>>> EXPORT_SYMBOL_GPL(objpool_push);
>>>>>
>>>>> -/* drop the allocated object, rather reclaim it to objpool */
>>>>> -int objpool_drop(void *obj, struct objpool_head *head)
>>>>> -{
>>>>> - if (!obj || !head)
>>>>> - return -EINVAL;
>>>>> -
>>>>> - if (refcount_dec_and_test(&head->ref)) {
>>>>> - objpool_free(head);
>>>>> - return 0;
>>>>> - }
>>>>> -
>>>>> - return -EAGAIN;
>>>>> -}
>>>>> -EXPORT_SYMBOL_GPL(objpool_drop);
>>>>> -
>>>>> /* try to retrieve object from slot */
>>>>> -static inline void *objpool_try_get_slot(struct objpool_slot *os)
>>>>> +static inline void *objpool_try_get_slot(struct objpool_slot *slot)
>>>>> {
>>>>> - uint32_t *ages = SLOT_AGES(os);
>>>>> - void **ents = SLOT_ENTS(os);
>>>>> /* do memory load of head to local head */
>>>>> - uint32_t head = smp_load_acquire(&os->head);
>>>>> + uint32_t head = smp_load_acquire(&slot->head);
>>>>>
>>>>> /* loop if slot isn't empty */
>>>>> - while (head != READ_ONCE(os->tail)) {
>>>>> - uint32_t id = head & os->mask, prev = head;
>>>>> + while (head != READ_ONCE(slot->tail)) {
>>>>>
>>>>> /* do prefetching of object ents */
>>>>> - prefetch(&ents[id]);
>>>>> -
>>>>> - /* check whether this item was ready for retrieval */
>>>>> - if (smp_load_acquire(&ages[id]) == head) {
>>>>> - /* node must have been udpated by push() */
>>>>> - void *node = READ_ONCE(ents[id]);
>>>>> - /* commit and move forward head of the slot */
>>>>> - if (try_cmpxchg_release(&os->head, &head, head + 1))
>>>>> - return node;
>>>>> - /* head was already updated by others */
>>>>> - }
>>>>> + prefetch(&slot->entries[head & slot->mask]);
>>>>> +
>>>>> + /* commit and move forward head of the slot */
>>>>> + if (try_cmpxchg_release(&slot->head, &head, head + 1))
>>>>> + /*
>>>>> + * TBD: check overwrap the tail/head counter and warn
>>>>> + * if it is broken. But this happens only if this
>>>>> + * process slows down a lot and another CPU updates
>>>>> + * the haed/tail just 2^32 + 1 times, and this slot
>>>>> + * is empty.
>>>>> + */
>>>>> + return slot->entries[head & slot->mask];
>>>>>
>>>>> /* re-load head from memory and continue trying */
>>>>> - head = READ_ONCE(os->head);
>>>>> - /*
>>>>> - * head stays unchanged, so it's very likely there's an
>>>>> - * ongoing push() on other cpu nodes but yet not update
>>>>> - * ages[] to mark it's completion
>>>>> - */
>>>>> - if (head == prev)
>>>>> - break;
>>>>> + head = READ_ONCE(slot->head);
>>>>> }
>>>>>
>>>>> return NULL;
>>>>> @@ -307,32 +229,42 @@ void *objpool_pop(struct objpool_head *head)
>>>>> EXPORT_SYMBOL_GPL(objpool_pop);
>>>>>
>>>>> /* release whole objpool forcely */
>>>>> -void objpool_free(struct objpool_head *head)
>>>>> +void objpool_free(struct objpool_head *pool)
>>>>> {
>>>>> - if (!head->cpu_slots)
>>>>> + if (!pool->cpu_slots)
>>>>> return;
>>>>>
>>>>> /* release percpu slots */
>>>>> - objpool_fini_percpu_slots(head);
>>>>> + objpool_fini_percpu_slots(pool);
>>>>>
>>>>> /* call user's cleanup callback if provided */
>>>>> - if (head->release)
>>>>> - head->release(head, head->context);
>>>>> + if (pool->release)
>>>>> + pool->release(pool, pool->context);
>>>>> }
>>>>> EXPORT_SYMBOL_GPL(objpool_free);
>>>>>
>>>>> -/* drop unused objects and defref objpool for releasing */
>>>>> -void objpool_fini(struct objpool_head *head)
>>>>> +/* drop the allocated object, rather reclaim it to objpool */
>>>>> +int objpool_drop(void *obj, struct objpool_head *pool)
>>>>> {
>>>>> - void *nod;
>>>>> + if (!obj || !pool)
>>>>> + return -EINVAL;
>>>>>
>>>>> - do {
>>>>> - /* grab object from objpool and drop it */
>>>>> - nod = objpool_pop(head);
>>>>> + if (refcount_dec_and_test(&pool->ref)) {
>>>>> + objpool_free(pool);
>>>>> + return 0;
>>>>> + }
>>>>> +
>>>>> + return -EAGAIN;
>>>>> +}
>>>>> +EXPORT_SYMBOL_GPL(objpool_drop);
>>>>> +
>>>>> +/* drop unused objects and defref objpool for releasing */
>>>>> +void objpool_fini(struct objpool_head *pool)
>>>>> +{
>>>>> + void *obj;
>>>>>
>>>>> - /* drop the extra ref of objpool */
>>>>> - if (refcount_dec_and_test(&head->ref))
>>>>> - objpool_free(head);
>>>>> - } while (nod);
>>>>> + /* grab object from objpool and drop it */
>>>>> + while ((obj = objpool_pop(pool)))
>>>>> + objpool_drop(obj, pool);
>>>>> }
>>>>> EXPORT_SYMBOL_GPL(objpool_fini);
>>>>
>>>
>>>
>>
>
>