2022-11-01 01:56:07

by wuqiang.matt

[permalink] [raw]
Subject: [PATCH v3] kprobes,lib: kretprobe scalability improvement

kretprobe is using freelist to manage return-instances, but freelist,
as LIFO queue based on singly linked list, scales badly and reduces
the overall throughput of kretprobed routines, especially for high
contention scenarios.

Here's a typical throughput test of sys_flock (counts in 10 seconds,
measured with perf stat -a -I 10000 -e syscalls:sys_enter_flock):

OS: Debian 10 X86_64, Linux 6.1rc2
HW: XEON 8336C x 2, 64 cores/128 threads, DDR4 3200MT/s

1X 2X 4X 6X 8X 12X 16X
34762430 36546920 17949900 13101899 12569595 12646601 14729195
24X 32X 48X 64X 72X 96X 128X
19263546 10102064 8985418 11936495 11493980 7127789 9330985

This patch implements a scalable, lock-less and numa-aware object pool,
which brings near-linear scalability to kretprobed routines. Tests of
kretprobe throughput show the biggest ratio as 333.9x of the original
freelist. Here's the comparison:

1X 2X 4X 8X 16X
freelist: 34762430 36546920 17949900 12569595 14729195
objpool: 35627544 72182095 144068494 287564688 576903916
32X 48X 64X 96X 128X
freelist: 10102064 8985418 11936495 7127789 9330985
objpool: 1158876372 1737828164 2324371724 2380310472 2463182819

Tests on 96-core ARM64 system output similar results, with biggest
ratio up to 642.2x:

OS: Debian 10 AARCH64, Linux 6.1rc2
HW: Kunpeng-920 96 cores/2 sockets/4 NUMA nodes, DDR4 2933 MT/s

1X 2X 4X 8X 16X
freelist: 17498299 10887037 10224710 8499132 6421751
objpool: 18715726 35549845 71615884 144258971 283707220
24X 32X 48X 64X 96X
freelist: 5339868 4819116 3593919 3121575 2687167
objpool: 419830913 571609748 877456139 1143316315 1725668029

The object pool is a scalable implementation of high performance queue
for objects allocation and reclamation, such as kretprobe instances.

With leveraging percpu ring-array to mitigate the hot spots of memory
contention, it could deliver near-linear scalability for high parallel
scenarios. The ring-array is compactly managed in a single cacheline
for most cases or continuous cachelines if more than 4 instances are
pre-allocated for every core.

Changes since V2:
1) the percpu-extended version of the freelist replaced by new percpu-
ring-array. freelist has data-contention in freelist_node (refs and
next) even after node is removed from freelist and the node could
be polluted easily (with freelist_node defined as union)
2) routines split to objpool.h and objpool.c, the latter moved to lib
3) test module (test_objpool.ko) added to lib for functional testings

Changes since V1:
1) reformat to a single patch as Masami Hiramatsu suggested
2) use __vmalloc_node to replace vmalloc_node for vmalloc
3) a few minor fixes: typo and coding-style issues

Signed-off-by: wuqiang <[email protected]>
---
include/linux/freelist.h | 129 -----
include/linux/kprobes.h | 9 +-
include/linux/objpool.h | 151 ++++++
include/linux/rethook.h | 15 +-
kernel/kprobes.c | 95 ++--
kernel/trace/fprobe.c | 17 +-
kernel/trace/rethook.c | 80 +--
lib/Kconfig.debug | 11 +
lib/Makefile | 4 +-
lib/objpool.c | 480 ++++++++++++++++++
lib/test_objpool.c | 1031 ++++++++++++++++++++++++++++++++++++++
11 files changed, 1772 insertions(+), 250 deletions(-)
delete mode 100644 include/linux/freelist.h
create mode 100644 include/linux/objpool.h
create mode 100644 lib/objpool.c
create mode 100644 lib/test_objpool.c

diff --git a/include/linux/freelist.h b/include/linux/freelist.h
deleted file mode 100644
index fc1842b96469..000000000000
--- a/include/linux/freelist.h
+++ /dev/null
@@ -1,129 +0,0 @@
-/* SPDX-License-Identifier: GPL-2.0-only OR BSD-2-Clause */
-#ifndef FREELIST_H
-#define FREELIST_H
-
-#include <linux/atomic.h>
-
-/*
- * Copyright: [email protected]
- *
- * A simple CAS-based lock-free free list. Not the fastest thing in the world
- * under heavy contention, but simple and correct (assuming nodes are never
- * freed until after the free list is destroyed), and fairly speedy under low
- * contention.
- *
- * Adapted from: https://moodycamel.com/blog/2014/solving-the-aba-problem-for-lock-free-free-lists
- */
-
-struct freelist_node {
- atomic_t refs;
- struct freelist_node *next;
-};
-
-struct freelist_head {
- struct freelist_node *head;
-};
-
-#define REFS_ON_FREELIST 0x80000000
-#define REFS_MASK 0x7FFFFFFF
-
-static inline void __freelist_add(struct freelist_node *node, struct freelist_head *list)
-{
- /*
- * Since the refcount is zero, and nobody can increase it once it's
- * zero (except us, and we run only one copy of this method per node at
- * a time, i.e. the single thread case), then we know we can safely
- * change the next pointer of the node; however, once the refcount is
- * back above zero, then other threads could increase it (happens under
- * heavy contention, when the refcount goes to zero in between a load
- * and a refcount increment of a node in try_get, then back up to
- * something non-zero, then the refcount increment is done by the other
- * thread) -- so if the CAS to add the node to the actual list fails,
- * decrese the refcount and leave the add operation to the next thread
- * who puts the refcount back to zero (which could be us, hence the
- * loop).
- */
- struct freelist_node *head = READ_ONCE(list->head);
-
- for (;;) {
- WRITE_ONCE(node->next, head);
- atomic_set_release(&node->refs, 1);
-
- if (!try_cmpxchg_release(&list->head, &head, node)) {
- /*
- * Hmm, the add failed, but we can only try again when
- * the refcount goes back to zero.
- */
- if (atomic_fetch_add_release(REFS_ON_FREELIST - 1, &node->refs) == 1)
- continue;
- }
- return;
- }
-}
-
-static inline void freelist_add(struct freelist_node *node, struct freelist_head *list)
-{
- /*
- * We know that the should-be-on-freelist bit is 0 at this point, so
- * it's safe to set it using a fetch_add.
- */
- if (!atomic_fetch_add_release(REFS_ON_FREELIST, &node->refs)) {
- /*
- * Oh look! We were the last ones referencing this node, and we
- * know we want to add it to the free list, so let's do it!
- */
- __freelist_add(node, list);
- }
-}
-
-static inline struct freelist_node *freelist_try_get(struct freelist_head *list)
-{
- struct freelist_node *prev, *next, *head = smp_load_acquire(&list->head);
- unsigned int refs;
-
- while (head) {
- prev = head;
- refs = atomic_read(&head->refs);
- if ((refs & REFS_MASK) == 0 ||
- !atomic_try_cmpxchg_acquire(&head->refs, &refs, refs+1)) {
- head = smp_load_acquire(&list->head);
- continue;
- }
-
- /*
- * Good, reference count has been incremented (it wasn't at
- * zero), which means we can read the next and not worry about
- * it changing between now and the time we do the CAS.
- */
- next = READ_ONCE(head->next);
- if (try_cmpxchg_acquire(&list->head, &head, next)) {
- /*
- * Yay, got the node. This means it was on the list,
- * which means should-be-on-freelist must be false no
- * matter the refcount (because nobody else knows it's
- * been taken off yet, it can't have been put back on).
- */
- WARN_ON_ONCE(atomic_read(&head->refs) & REFS_ON_FREELIST);
-
- /*
- * Decrease refcount twice, once for our ref, and once
- * for the list's ref.
- */
- atomic_fetch_add(-2, &head->refs);
-
- return head;
- }
-
- /*
- * OK, the head must have changed on us, but we still need to decrement
- * the refcount we increased.
- */
- refs = atomic_fetch_add(-1, &prev->refs);
- if (refs == REFS_ON_FREELIST + 1)
- __freelist_add(prev, list);
- }
-
- return NULL;
-}
-
-#endif /* FREELIST_H */
diff --git a/include/linux/kprobes.h b/include/linux/kprobes.h
index a0b92be98984..f13f01e600c2 100644
--- a/include/linux/kprobes.h
+++ b/include/linux/kprobes.h
@@ -27,7 +27,7 @@
#include <linux/mutex.h>
#include <linux/ftrace.h>
#include <linux/refcount.h>
-#include <linux/freelist.h>
+#include <linux/objpool.h>
#include <linux/rethook.h>
#include <asm/kprobes.h>

@@ -141,6 +141,7 @@ static inline bool kprobe_ftrace(struct kprobe *p)
*/
struct kretprobe_holder {
struct kretprobe *rp;
+ struct objpool_head oh;
refcount_t ref;
};

@@ -154,7 +155,6 @@ struct kretprobe {
#ifdef CONFIG_KRETPROBE_ON_RETHOOK
struct rethook *rh;
#else
- struct freelist_head freelist;
struct kretprobe_holder *rph;
#endif
};
@@ -165,10 +165,7 @@ struct kretprobe_instance {
#ifdef CONFIG_KRETPROBE_ON_RETHOOK
struct rethook_node node;
#else
- union {
- struct freelist_node freelist;
- struct rcu_head rcu;
- };
+ struct rcu_head rcu;
struct llist_node llist;
struct kretprobe_holder *rph;
kprobe_opcode_t *ret_addr;
diff --git a/include/linux/objpool.h b/include/linux/objpool.h
new file mode 100644
index 000000000000..0b746187482a
--- /dev/null
+++ b/include/linux/objpool.h
@@ -0,0 +1,151 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+
+#ifndef _LINUX_OBJPOOL_H
+#define _LINUX_OBJPOOL_H
+
+#include <linux/slab.h>
+#include <linux/vmalloc.h>
+#include <linux/atomic.h>
+
+/*
+ * objpool: ring-array based lockless MPMC/FIFO queues
+ *
+ * Copyright: [email protected]
+ *
+ * The object pool is a scalable implementaion of high performance queue
+ * for objects allocation and reclamation, such as kretprobe instances.
+ *
+ * With leveraging per-cpu ring-array to mitigate the hot spots of memory
+ * contention, it could deliver near-linear scalability for high parallel
+ * cases. Meanwhile, it also achieves high throughput with benifiting from
+ * warmed cache on each core.
+ *
+ * The object pool are best suited for the following cases:
+ * 1) memory allocation or reclamation is prohibited or too expensive
+ * 2) the objects are allocated/used/reclaimed very frequently
+ *
+ * Before using, you must be aware of it's limitations:
+ * 1) Maximum number of objects is determined during pool initializing
+ * 2) The memory of objects won't be freed until the poll is de-allocated
+ * 3) Both allocation and reclamation could be nested
+ */
+
+/*
+ * objpool_slot: per-cpu ring array
+ *
+ * Represents a cpu-local array-based ring buffer, its size is specialized
+ * during initialization of object pool.
+ *
+ * The objpool_slot is allocated from local memory for NUMA system, and to
+ * be kept compact in a single cacheline. ages[] is stored just after the
+ * body of objpool_slot, and ents[] is after ages[]. ages[] describes the
+ * revision of epoch of the item, solely used to avoid ABA. ents[] contains
+ * the object pointers.
+ *
+ * The default size of objpool_slot is a single cacheline, aka. 64 bytes.
+ *
+ * 64bit:
+ * 4 8 12 16 32 64
+ * | head | tail | size | mask | ages[4] | ents[4]: (8 * 4) |
+ *
+ * 32bit:
+ * 4 8 12 16 32 48 64
+ * | head | tail | size | mask | ages[4] | ents[4] | unused |
+ *
+ */
+
+struct objpool_slot {
+ uint32_t os_head; /* head of ring array */
+ uint32_t os_tail; /* tail of ring array */
+ uint32_t os_size; /* max item slots, pow of 2 */
+ uint32_t os_mask; /* os_size - 1 */
+/*
+ * uint32_t os_ages[]; // ring epoch id
+ * void *os_ents[]; // objects array
+ */
+};
+
+/* caller-specified object initial callback to setup each object, only called once */
+typedef int (*objpool_init_node_cb)(void *context, void *obj);
+
+/* caller-specified cleanup callback for private objects/pool/context */
+typedef int (*objpool_release_cb)(void *context, void *ptr, uint32_t flags);
+
+/* called for object releasing: ptr points to an object */
+#define OBJPOOL_FLAG_NODE (0x00000001)
+/* for user pool and context releasing, ptr could be NULL */
+#define OBJPOOL_FLAG_POOL (0x00001000)
+/* the object or pool to be released is user-managed */
+#define OBJPOOL_FLAG_USER (0x00008000)
+
+/*
+ * objpool_head: object pooling metadata
+ */
+
+struct objpool_head {
+ uint32_t oh_objsz; /* object & element size */
+ uint32_t oh_nobjs; /* total objs (pre-allocated) */
+ uint32_t oh_nents; /* max objects per cpuslot */
+ uint32_t oh_ncpus; /* num of possible cpus */
+ uint32_t oh_in_user:1; /* user-specified buffer */
+ uint32_t oh_in_slot:1; /* objs alloced with slots */
+ uint32_t oh_vmalloc:1; /* alloc from vmalloc zone */
+ gfp_t oh_gfp; /* k/vmalloc gfp flags */
+ uint32_t oh_sz_pool; /* user pool size in byes */
+ void *oh_pool; /* user managed memory pool */
+ struct objpool_slot **oh_slots; /* array of percpu slots */
+ uint32_t *oh_sz_slots; /* size in bytes of slots */
+ objpool_release_cb oh_release; /* resource cleanup callback */
+ void *oh_context; /* caller-provided context */
+};
+
+/* initialize object pool and pre-allocate objects */
+int objpool_init(struct objpool_head *oh,
+ int nobjs, int max, int objsz,
+ gfp_t gfp, void *context,
+ objpool_init_node_cb objinit,
+ objpool_release_cb release);
+
+/* add objects in batch from user provided pool */
+int objpool_populate(struct objpool_head *oh, void *buf,
+ int size, int objsz, void *context,
+ objpool_init_node_cb objinit);
+
+/* add pre-allocated object (managed by user) to objpool */
+int objpool_add(void *obj, struct objpool_head *oh);
+
+/* allocate an object from objects pool */
+void *objpool_pop(struct objpool_head *oh);
+
+/* reclaim an object and return it back to objects pool */
+int objpool_push(void *node, struct objpool_head *oh);
+
+/* cleanup the whole object pool (including all chained objects) */
+void objpool_fini(struct objpool_head *oh);
+
+/* whether the object is pre-allocated with percpu slots */
+static inline int objpool_is_inslot(void *obj, struct objpool_head *oh)
+{
+ void *slot;
+ int i;
+
+ if (!obj)
+ return 0;
+
+ for (i = 0; i < oh->oh_ncpus; i++) {
+ slot = oh->oh_slots[i];
+ if (obj >= slot && obj < slot + oh->oh_sz_slots[i])
+ return 1;
+ }
+
+ return 0;
+}
+
+/* whether the object is from user pool (batched adding) */
+static inline int objpool_is_inpool(void *obj, struct objpool_head *oh)
+{
+ return (obj && oh->oh_pool && obj >= oh->oh_pool &&
+ obj < oh->oh_pool + oh->oh_sz_pool);
+}
+
+#endif /* _LINUX_OBJPOOL_H */
diff --git a/include/linux/rethook.h b/include/linux/rethook.h
index c8ac1e5afcd1..278ec65e71fe 100644
--- a/include/linux/rethook.h
+++ b/include/linux/rethook.h
@@ -6,7 +6,7 @@
#define _LINUX_RETHOOK_H

#include <linux/compiler.h>
-#include <linux/freelist.h>
+#include <linux/objpool.h>
#include <linux/kallsyms.h>
#include <linux/llist.h>
#include <linux/rcupdate.h>
@@ -30,14 +30,14 @@ typedef void (*rethook_handler_t) (struct rethook_node *, void *, struct pt_regs
struct rethook {
void *data;
rethook_handler_t handler;
- struct freelist_head pool;
+ struct objpool_head pool;
refcount_t ref;
struct rcu_head rcu;
};

/**
* struct rethook_node - The rethook shadow-stack entry node.
- * @freelist: The freelist, linked to struct rethook::pool.
+ * @nod: The objpool node, linked to struct rethook::pool.
* @rcu: The rcu_head for deferred freeing.
* @llist: The llist, linked to a struct task_struct::rethooks.
* @rethook: The pointer to the struct rethook.
@@ -48,19 +48,15 @@ struct rethook {
* on each entry of the shadow stack.
*/
struct rethook_node {
- union {
- struct freelist_node freelist;
- struct rcu_head rcu;
- };
+ struct rcu_head rcu;
struct llist_node llist;
struct rethook *rethook;
unsigned long ret_addr;
unsigned long frame;
};

-struct rethook *rethook_alloc(void *data, rethook_handler_t handler);
+struct rethook *rethook_alloc(void *data, rethook_handler_t handler, gfp_t gfp, int size, int max);
void rethook_free(struct rethook *rh);
-void rethook_add_node(struct rethook *rh, struct rethook_node *node);
struct rethook_node *rethook_try_get(struct rethook *rh);
void rethook_recycle(struct rethook_node *node);
void rethook_hook(struct rethook_node *node, struct pt_regs *regs, bool mcount);
@@ -97,4 +93,3 @@ void rethook_flush_task(struct task_struct *tk);
#endif

#endif
-
diff --git a/kernel/kprobes.c b/kernel/kprobes.c
index b781dee3f552..42cb708c3248 100644
--- a/kernel/kprobes.c
+++ b/kernel/kprobes.c
@@ -1865,10 +1865,12 @@ static struct notifier_block kprobe_exceptions_nb = {
static void free_rp_inst_rcu(struct rcu_head *head)
{
struct kretprobe_instance *ri = container_of(head, struct kretprobe_instance, rcu);
+ struct kretprobe_holder *rph = ri->rph;

- if (refcount_dec_and_test(&ri->rph->ref))
- kfree(ri->rph);
- kfree(ri);
+ if (refcount_dec_and_test(&rph->ref)) {
+ objpool_fini(&rph->oh);
+ kfree(rph);
+ }
}
NOKPROBE_SYMBOL(free_rp_inst_rcu);

@@ -1877,7 +1879,7 @@ static void recycle_rp_inst(struct kretprobe_instance *ri)
struct kretprobe *rp = get_kretprobe(ri);

if (likely(rp))
- freelist_add(&ri->freelist, &rp->freelist);
+ objpool_push(ri, &rp->rph->oh);
else
call_rcu(&ri->rcu, free_rp_inst_rcu);
}
@@ -1914,23 +1916,19 @@ NOKPROBE_SYMBOL(kprobe_flush_task);

static inline void free_rp_inst(struct kretprobe *rp)
{
- struct kretprobe_instance *ri;
- struct freelist_node *node;
- int count = 0;
-
- node = rp->freelist.head;
- while (node) {
- ri = container_of(node, struct kretprobe_instance, freelist);
- node = node->next;
-
- kfree(ri);
- count++;
- }
+ struct kretprobe_holder *rph = rp->rph;
+ void *nod;

- if (refcount_sub_and_test(count, &rp->rph->ref)) {
- kfree(rp->rph);
- rp->rph = NULL;
- }
+ rp->rph = NULL;
+ do {
+ nod = objpool_pop(&rph->oh);
+ /* deref anyway since we've one extra ref grabbed */
+ if (refcount_dec_and_test(&rph->ref)) {
+ objpool_fini(&rph->oh);
+ kfree(rph);
+ break;
+ }
+ } while (nod);
}

/* This assumes the 'tsk' is the current task or the is not running. */
@@ -2072,19 +2070,17 @@ NOKPROBE_SYMBOL(__kretprobe_trampoline_handler)
static int pre_handler_kretprobe(struct kprobe *p, struct pt_regs *regs)
{
struct kretprobe *rp = container_of(p, struct kretprobe, kp);
+ struct kretprobe_holder *rph = rp->rph;
struct kretprobe_instance *ri;
- struct freelist_node *fn;

- fn = freelist_try_get(&rp->freelist);
- if (!fn) {
+ ri = objpool_pop(&rph->oh);
+ if (!ri) {
rp->nmissed++;
return 0;
}

- ri = container_of(fn, struct kretprobe_instance, freelist);
-
if (rp->entry_handler && rp->entry_handler(ri, regs)) {
- freelist_add(&ri->freelist, &rp->freelist);
+ objpool_push(ri, &rph->oh);
return 0;
}

@@ -2174,10 +2170,19 @@ int kprobe_on_func_entry(kprobe_opcode_t *addr, const char *sym, unsigned long o
return 0;
}

+#ifndef CONFIG_KRETPROBE_ON_RETHOOK
+static int kretprobe_init_inst(void *context, void *nod)
+{
+ struct kretprobe_instance *ri = nod;
+
+ ri->rph = context;
+ return 0;
+}
+#endif
+
int register_kretprobe(struct kretprobe *rp)
{
int ret;
- struct kretprobe_instance *inst;
int i;
void *addr;

@@ -2215,20 +2220,12 @@ int register_kretprobe(struct kretprobe *rp)
#endif
}
#ifdef CONFIG_KRETPROBE_ON_RETHOOK
- rp->rh = rethook_alloc((void *)rp, kretprobe_rethook_handler);
+ rp->rh = rethook_alloc((void *)rp, kretprobe_rethook_handler, GFP_KERNEL,
+ sizeof(struct kretprobe_instance) + rp->data_size,
+ rp->maxactive);
if (!rp->rh)
return -ENOMEM;

- for (i = 0; i < rp->maxactive; i++) {
- inst = kzalloc(sizeof(struct kretprobe_instance) +
- rp->data_size, GFP_KERNEL);
- if (inst == NULL) {
- rethook_free(rp->rh);
- rp->rh = NULL;
- return -ENOMEM;
- }
- rethook_add_node(rp->rh, &inst->node);
- }
rp->nmissed = 0;
/* Establish function entry probe point */
ret = register_kprobe(&rp->kp);
@@ -2237,25 +2234,19 @@ int register_kretprobe(struct kretprobe *rp)
rp->rh = NULL;
}
#else /* !CONFIG_KRETPROBE_ON_RETHOOK */
- rp->freelist.head = NULL;
rp->rph = kzalloc(sizeof(struct kretprobe_holder), GFP_KERNEL);
if (!rp->rph)
return -ENOMEM;

- rp->rph->rp = rp;
- for (i = 0; i < rp->maxactive; i++) {
- inst = kzalloc(sizeof(struct kretprobe_instance) +
- rp->data_size, GFP_KERNEL);
- if (inst == NULL) {
- refcount_set(&rp->rph->ref, i);
- free_rp_inst(rp);
- return -ENOMEM;
- }
- inst->rph = rp->rph;
- freelist_add(&inst->freelist, &rp->freelist);
+ if (objpool_init(&rp->rph->oh, rp->maxactive, rp->maxactive,
+ rp->data_size + sizeof(struct kretprobe_instance),
+ GFP_KERNEL, rp->rph, kretprobe_init_inst, NULL)) {
+ kfree(rp->rph);
+ rp->rph = NULL;
+ return -ENOMEM;
}
- refcount_set(&rp->rph->ref, i);
-
+ refcount_set(&rp->rph->ref, rp->maxactive + 1);
+ rp->rph->rp = rp;
rp->nmissed = 0;
/* Establish function entry probe point */
ret = register_kprobe(&rp->kp);
diff --git a/kernel/trace/fprobe.c b/kernel/trace/fprobe.c
index aac63ca9c3d1..d2521a0ab2ae 100644
--- a/kernel/trace/fprobe.c
+++ b/kernel/trace/fprobe.c
@@ -140,18 +140,11 @@ static int fprobe_init_rethook(struct fprobe *fp, int num)
if (size < 0)
return -E2BIG;

- fp->rethook = rethook_alloc((void *)fp, fprobe_exit_handler);
- for (i = 0; i < size; i++) {
- struct fprobe_rethook_node *node;
-
- node = kzalloc(sizeof(*node), GFP_KERNEL);
- if (!node) {
- rethook_free(fp->rethook);
- fp->rethook = NULL;
- return -ENOMEM;
- }
- rethook_add_node(fp->rethook, &node->node);
- }
+ fp->rethook = rethook_alloc((void *)fp, fprobe_exit_handler, GFP_KERNEL,
+ sizeof(struct fprobe_rethook_node), size);
+ if (!fp->rethook)
+ return -ENOMEM;
+
return 0;
}

diff --git a/kernel/trace/rethook.c b/kernel/trace/rethook.c
index c69d82273ce7..01df98db2fbe 100644
--- a/kernel/trace/rethook.c
+++ b/kernel/trace/rethook.c
@@ -36,21 +36,17 @@ void rethook_flush_task(struct task_struct *tk)
static void rethook_free_rcu(struct rcu_head *head)
{
struct rethook *rh = container_of(head, struct rethook, rcu);
- struct rethook_node *rhn;
- struct freelist_node *node;
- int count = 1;
+ struct rethook_node *nod;

- node = rh->pool.head;
- while (node) {
- rhn = container_of(node, struct rethook_node, freelist);
- node = node->next;
- kfree(rhn);
- count++;
- }
-
- /* The rh->ref is the number of pooled node + 1 */
- if (refcount_sub_and_test(count, &rh->ref))
- kfree(rh);
+ do {
+ nod = objpool_pop(&rh->pool);
+ /* deref anyway since we've one extra ref grabbed */
+ if (refcount_dec_and_test(&rh->ref)) {
+ objpool_fini(&rh->pool);
+ kfree(rh);
+ break;
+ }
+ } while (nod);
}

/**
@@ -70,16 +66,28 @@ void rethook_free(struct rethook *rh)
call_rcu(&rh->rcu, rethook_free_rcu);
}

+static int rethook_init_node(void *context, void *nod)
+{
+ struct rethook_node *node = nod;
+
+ node->rethook = context;
+ return 0;
+}
+
/**
* rethook_alloc() - Allocate struct rethook.
* @data: a data to pass the @handler when hooking the return.
* @handler: the return hook callback function.
+ * @gfp: default gfp for objpool allocation
+ * @size: rethook node size
+ * @max: number of rethook nodes to be preallocated
*
* Allocate and initialize a new rethook with @data and @handler.
* Return NULL if memory allocation fails or @handler is NULL.
* Note that @handler == NULL means this rethook is going to be freed.
*/
-struct rethook *rethook_alloc(void *data, rethook_handler_t handler)
+struct rethook *rethook_alloc(void *data, rethook_handler_t handler, gfp_t gfp,
+ int size, int max)
{
struct rethook *rh = kzalloc(sizeof(struct rethook), GFP_KERNEL);

@@ -88,34 +96,26 @@ struct rethook *rethook_alloc(void *data, rethook_handler_t handler)

rh->data = data;
rh->handler = handler;
- rh->pool.head = NULL;
- refcount_set(&rh->ref, 1);

+ /* initialize the objpool for rethook nodes */
+ if (objpool_init(&rh->pool, max, max, size, gfp, rh, rethook_init_node,
+ NULL)) {
+ kfree(rh);
+ return NULL;
+ }
+ refcount_set(&rh->ref, max + 1);
return rh;
}

-/**
- * rethook_add_node() - Add a new node to the rethook.
- * @rh: the struct rethook.
- * @node: the struct rethook_node to be added.
- *
- * Add @node to @rh. User must allocate @node (as a part of user's
- * data structure.) The @node fields are initialized in this function.
- */
-void rethook_add_node(struct rethook *rh, struct rethook_node *node)
-{
- node->rethook = rh;
- freelist_add(&node->freelist, &rh->pool);
- refcount_inc(&rh->ref);
-}
-
static void free_rethook_node_rcu(struct rcu_head *head)
{
struct rethook_node *node = container_of(head, struct rethook_node, rcu);
+ struct rethook *rh = node->rethook;

- if (refcount_dec_and_test(&node->rethook->ref))
- kfree(node->rethook);
- kfree(node);
+ if (refcount_dec_and_test(&rh->ref)) {
+ objpool_fini(&rh->pool);
+ kfree(rh);
+ }
}

/**
@@ -130,7 +130,7 @@ void rethook_recycle(struct rethook_node *node)
lockdep_assert_preemption_disabled();

if (likely(READ_ONCE(node->rethook->handler)))
- freelist_add(&node->freelist, &node->rethook->pool);
+ objpool_push(node, &node->rethook->pool);
else
call_rcu(&node->rcu, free_rethook_node_rcu);
}
@@ -146,7 +146,7 @@ NOKPROBE_SYMBOL(rethook_recycle);
struct rethook_node *rethook_try_get(struct rethook *rh)
{
rethook_handler_t handler = READ_ONCE(rh->handler);
- struct freelist_node *fn;
+ struct rethook_node *nod;

lockdep_assert_preemption_disabled();

@@ -163,11 +163,11 @@ struct rethook_node *rethook_try_get(struct rethook *rh)
if (unlikely(!rcu_is_watching()))
return NULL;

- fn = freelist_try_get(&rh->pool);
- if (!fn)
+ nod = (struct rethook_node *)objpool_pop(&rh->pool);
+ if (!nod)
return NULL;

- return container_of(fn, struct rethook_node, freelist);
+ return nod;
}
NOKPROBE_SYMBOL(rethook_try_get);

diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
index 3fc7abffc7aa..b12cc71754cf 100644
--- a/lib/Kconfig.debug
+++ b/lib/Kconfig.debug
@@ -2737,6 +2737,17 @@ config TEST_CLOCKSOURCE_WATCHDOG

If unsure, say N.

+config TEST_OBJPOOL
+ tristate "Test module for correctness and stress of objpool"
+ default n
+ depends on m
+ help
+ This builds the "test_objpool" module that should be used for
+ correctness verification and concurrent testings of objects
+ allocation and reclamation.
+
+ If unsure, say N.
+
endif # RUNTIME_TESTING_MENU

config ARCH_USE_MEMTEST
diff --git a/lib/Makefile b/lib/Makefile
index 161d6a724ff7..4aa282fa0cfc 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -34,7 +34,7 @@ lib-y := ctype.o string.o vsprintf.o cmdline.o \
is_single_threaded.o plist.o decompress.o kobject_uevent.o \
earlycpio.o seq_buf.o siphash.o dec_and_lock.o \
nmi_backtrace.o win_minmax.o memcat_p.o \
- buildid.o
+ buildid.o objpool.o

lib-$(CONFIG_PRINTK) += dump_stack.o
lib-$(CONFIG_SMP) += cpumask.o
@@ -99,6 +99,8 @@ obj-$(CONFIG_KPROBES_SANITY_TEST) += test_kprobes.o
obj-$(CONFIG_TEST_REF_TRACKER) += test_ref_tracker.o
CFLAGS_test_fprobe.o += $(CC_FLAGS_FTRACE)
obj-$(CONFIG_FPROBE_SANITY_TEST) += test_fprobe.o
+obj-$(CONFIG_TEST_OBJPOOL) += test_objpool.o
+
#
# CFLAGS for compiling floating point code inside the kernel. x86/Makefile turns
# off the generation of FPU/SSE* instructions for kernel proper but FPU_FLAGS
diff --git a/lib/objpool.c b/lib/objpool.c
new file mode 100644
index 000000000000..51b3499ff9da
--- /dev/null
+++ b/lib/objpool.c
@@ -0,0 +1,480 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include <linux/objpool.h>
+
+/*
+ * objpool: ring-array based lockless MPMC/FIFO queues
+ *
+ * Copyright: [email protected]
+ */
+
+/* compute the suitable num of objects to be managed by slot */
+static inline uint32_t __objpool_num_of_objs(uint32_t size)
+{
+ return rounddown_pow_of_two((size - sizeof(struct objpool_slot)) /
+ (sizeof(uint32_t) + sizeof(void *)));
+}
+
+#define SLOT_AGES(s) ((uint32_t *)((char *)(s) + sizeof(struct objpool_slot)))
+#define SLOT_ENTS(s) ((void **)((char *)(s) + sizeof(struct objpool_slot) + \
+ sizeof(uint32_t) * (s)->os_size))
+#define SLOT_OBJS(s) ((void *)((char *)(s) + sizeof(struct objpool_slot) + \
+ (sizeof(uint32_t) + sizeof(void *)) * (s)->os_size))
+
+/* allocate and initialize percpu slots */
+static inline int
+__objpool_init_percpu_slots(struct objpool_head *oh, uint32_t nobjs,
+ void *context, objpool_init_node_cb objinit)
+{
+ uint32_t i, j, size, objsz, nents = oh->oh_nents;
+
+ /* aligned object size by sizeof(void *) */
+ objsz = ALIGN(oh->oh_objsz, sizeof(void *));
+ /* shall we allocate objects along with objpool_slot */
+ if (objsz)
+ oh->oh_in_slot = 1;
+
+ for (i = 0; i < oh->oh_ncpus; i++) {
+ struct objpool_slot *os;
+ uint32_t n;
+
+ /* compute how many objects to be managed by this slot */
+ n = nobjs / oh->oh_ncpus;
+ if (i < (nobjs % oh->oh_ncpus))
+ n++;
+ size = sizeof(struct objpool_slot) + sizeof(void *) * nents +
+ sizeof(uint32_t) * nents + objsz * n;
+
+ /* decide which pool shall the slot be allocated from */
+ if (0 == i) {
+ if ((oh->oh_gfp & GFP_ATOMIC) || size < PAGE_SIZE / 2)
+ oh->oh_vmalloc = 0;
+ else
+ oh->oh_vmalloc = 1;
+ }
+
+ /* allocate percpu slot & objects from local memory */
+ if (oh->oh_vmalloc)
+ os = vmalloc_node(size, cpu_to_node(i));
+ else
+ os = kmalloc_node(size, oh->oh_gfp, cpu_to_node(i));
+ if (!os)
+ return -ENOMEM;
+
+ /* initialize percpu slot for the i-th cpu */
+ memset(os, 0, size);
+ os->os_size = oh->oh_nents;
+ os->os_mask = os->os_size - 1;
+ oh->oh_slots[i] = os;
+ oh->oh_sz_slots[i] = size;
+
+ /*
+ * start from 2nd round to avoid conflict of 1st item.
+ * we assume that the head item is ready for retrieval
+ * iff head is equal to ages[head & mask]. but ages is
+ * initialized as 0, so in view of the caller of pop(),
+ * the 1st item (0th) is always ready, but fact could
+ * be: push() is stalled before the final update, thus
+ * the item being inserted will be lost forever.
+ */
+ os->os_head = os->os_tail = oh->oh_nents;
+
+ for (j = 0; oh->oh_in_slot && j < n; j++) {
+ uint32_t *ages = SLOT_AGES(os);
+ void **ents = SLOT_ENTS(os);
+ void *obj = SLOT_OBJS(os) + j * objsz;
+ uint32_t ie = os->os_tail & os->os_mask;
+
+ /* perform object initialization */
+ if (objinit) {
+ int rc = objinit(context, obj);
+ if (rc)
+ return rc;
+ }
+
+ /* add obj into the ring array */
+ ents[ie] = obj;
+ ages[ie] = os->os_tail;
+ os->os_tail++;
+ oh->oh_nobjs++;
+ }
+ }
+
+ return 0;
+}
+
+/* cleanup all percpu slots of the object pool */
+static inline void __objpool_fini_percpu_slots(struct objpool_head *oh)
+{
+ uint32_t i;
+
+ if (!oh->oh_slots)
+ return;
+
+ for (i = 0; i < oh->oh_ncpus; i++) {
+ if (!oh->oh_slots[i])
+ continue;
+ if (oh->oh_vmalloc)
+ vfree(oh->oh_slots[i]);
+ else
+ kfree(oh->oh_slots[i]);
+ }
+ kfree(oh->oh_slots);
+ oh->oh_slots = NULL;
+ oh->oh_sz_slots = NULL;
+}
+
+/**
+ * objpool_init: initialize object pool and pre-allocate objects
+ *
+ * args:
+ * @oh: the object pool to be initialized, declared by the caller
+ * @nojbs: total objects to be allocated by this object pool
+ * @max: max objs this objpool could manage, use nobjs if 0
+ * @ojbsz: size of an object, to be pre-allocated if objsz is not 0
+ * @gfp: gfp flags of caller's context for memory allocation
+ * @context: user context for object initialization callback
+ * @objinit: object initialization callback for extra setting-up
+ * @release: cleanup callback for private objects/pool/context
+ *
+ * return:
+ * 0 for success, otherwise error code
+ *
+ * All pre-allocated objects are to be zeroed. Caller could do extra
+ * initialization in objinit callback. The objinit callback will be
+ * called once and only once after the slot allocation
+ */
+int objpool_init(struct objpool_head *oh,
+ int nobjs, int max, int objsz,
+ gfp_t gfp, void *context,
+ objpool_init_node_cb objinit,
+ objpool_release_cb release)
+{
+ uint32_t nents, cpus = num_possible_cpus();
+ int rc;
+
+ /* calculate percpu slot size (rounded to pow of 2) */
+ if (max < nobjs)
+ max = nobjs;
+ nents = max / cpus;
+ if (nents < __objpool_num_of_objs(L1_CACHE_BYTES))
+ nents = __objpool_num_of_objs(L1_CACHE_BYTES);
+ nents = roundup_pow_of_two(nents);
+ while (nents * cpus < nobjs)
+ nents = nents << 1;
+
+ memset(oh, 0, sizeof(struct objpool_head));
+ oh->oh_ncpus = cpus;
+ oh->oh_objsz = objsz;
+ oh->oh_nents = nents;
+ oh->oh_gfp = gfp & ~__GFP_ZERO;
+ oh->oh_context = context;
+ oh->oh_release = release;
+
+ /* allocate array for percpu slots */
+ oh->oh_slots = kzalloc(oh->oh_ncpus * sizeof(void *) +
+ oh->oh_ncpus * sizeof(uint32_t), oh->oh_gfp);
+ if (!oh->oh_slots)
+ return -ENOMEM;
+ oh->oh_sz_slots = (uint32_t *)&oh->oh_slots[oh->oh_ncpus];
+
+ /* initialize per-cpu slots */
+ rc = __objpool_init_percpu_slots(oh, nobjs, context, objinit);
+ if (rc)
+ __objpool_fini_percpu_slots(oh);
+
+ return rc;
+}
+EXPORT_SYMBOL_GPL(objpool_init);
+
+/* adding object to slot tail, the given slot mustn't be full */
+static inline int __objpool_add_slot(void *obj, struct objpool_slot *os)
+{
+ uint32_t *ages = SLOT_AGES(os);
+ void **ents = SLOT_ENTS(os);
+ uint32_t tail = atomic_inc_return((atomic_t *)&os->os_tail) - 1;
+
+ WRITE_ONCE(ents[tail & os->os_mask], obj);
+
+ /* order matters: obj must be updated before tail updating */
+ smp_store_release(&ages[tail & os->os_mask], tail);
+ return 0;
+}
+
+/* adding object to slot, abort if the slot was already full */
+static inline int __objpool_try_add_slot(void *obj, struct objpool_slot *os)
+{
+ uint32_t *ages = SLOT_AGES(os);
+ void **ents = SLOT_ENTS(os);
+ uint32_t head, tail;
+
+ do {
+ /* perform memory loading for both head and tail */
+ head = READ_ONCE(os->os_head);
+ tail = READ_ONCE(os->os_tail);
+ /* just abort if slot is full */
+ if (tail >= head + os->os_size)
+ return -ENOENT;
+ /* try to extend tail by 1 using CAS to avoid races */
+ if (try_cmpxchg_acquire(&os->os_tail, &tail, tail + 1))
+ break;
+ } while (1);
+
+ /* the tail-th of slot is reserved for the given obj */
+ WRITE_ONCE(ents[tail & os->os_mask], obj);
+ /* update epoch id to make this object available for pop() */
+ smp_store_release(&ages[tail & os->os_mask], tail);
+ return 0;
+}
+
+/**
+ * objpool_populate: add objects from user provided pool in batch
+ *
+ * args:
+ * @oh: object pool
+ * @buf: user buffer for pre-allocated objects
+ * @size: size of user buffer
+ * @objsz: size of object & element
+ * @context: user context for objinit callback
+ * @objinit: object initialization callback
+ *
+ * return: 0 or error code
+ */
+int objpool_populate(struct objpool_head *oh, void *buf, int size, int objsz,
+ void *context, objpool_init_node_cb objinit)
+{
+ int n = oh->oh_nobjs, used = 0, i;
+
+ if (oh->oh_pool || !buf || size < objsz)
+ return -EINVAL;
+ if (oh->oh_objsz && oh->oh_objsz != objsz)
+ return -EINVAL;
+ if (oh->oh_context && context && oh->oh_context != context)
+ return -EINVAL;
+ if (oh->oh_nobjs >= oh->oh_ncpus * oh->oh_nents)
+ return -ENOENT;
+
+ WARN_ON_ONCE(((unsigned long)buf) & (sizeof(void *) - 1));
+ WARN_ON_ONCE(((uint32_t)objsz) & (sizeof(void *) - 1));
+
+ /* align object size by sizeof(void *) */
+ oh->oh_objsz = objsz;
+ objsz = ALIGN(objsz, sizeof(void *));
+ if (objsz <= 0)
+ return -EINVAL;
+
+ while (used + objsz <= size) {
+ void *obj = buf + used;
+
+ /* perform object initialization */
+ if (objinit) {
+ int rc = objinit(context, obj);
+ if (rc)
+ return rc;
+ }
+
+ /* insert obj to its corresponding objpool slot */
+ i = (n + used * oh->oh_ncpus/size) % oh->oh_ncpus;
+ if (!__objpool_try_add_slot(obj, oh->oh_slots[i]))
+ oh->oh_nobjs++;
+
+ used += objsz;
+ }
+
+ if (!used)
+ return -ENOENT;
+
+ oh->oh_context = context;
+ oh->oh_pool = buf;
+ oh->oh_sz_pool = size;
+
+ return 0;
+}
+EXPORT_SYMBOL_GPL(objpool_populate);
+
+/**
+ * objpool_add: add pre-allocated object to objpool during pool
+ * initialization
+ *
+ * args:
+ * @obj: object pointer to be added to objpool
+ * @oh: object pool to be inserted into
+ *
+ * return:
+ * 0 or error code
+ *
+ * objpool_add_node doesn't handle race conditions, can only be
+ * called during objpool initialization
+ */
+int objpool_add(void *obj, struct objpool_head *oh)
+{
+ uint32_t i, cpu;
+
+ if (!obj)
+ return -EINVAL;
+ if (oh->oh_nobjs >= oh->oh_ncpus * oh->oh_nents)
+ return -ENOENT;
+
+ cpu = oh->oh_nobjs % oh->oh_ncpus;
+ for (i = 0; i < oh->oh_ncpus; i++) {
+ if (!__objpool_try_add_slot(obj, oh->oh_slots[cpu])) {
+ oh->oh_nobjs++;
+ return 0;
+ }
+
+ if (++cpu >= oh->oh_ncpus)
+ cpu = 0;
+ }
+
+ return -ENOENT;
+}
+EXPORT_SYMBOL_GPL(objpool_add);
+
+/**
+ * objpool_push: reclaim the object and return back to objects pool
+ *
+ * args:
+ * @obj: object pointer to be pushed to object pool
+ * @oh: object pool
+ *
+ * return:
+ * 0 or error code: it fails only when objects pool are full
+ *
+ * objpool_push is non-blockable, and can be nested
+ */
+int objpool_push(void *obj, struct objpool_head *oh)
+{
+ uint32_t cpu = raw_smp_processor_id();
+
+ do {
+ if (oh->oh_nobjs > oh->oh_nents) {
+ if (!__objpool_try_add_slot(obj, oh->oh_slots[cpu]))
+ return 0;
+ } else {
+ if (!__objpool_add_slot(obj, oh->oh_slots[cpu]))
+ return 0;
+ }
+ if (++cpu >= oh->oh_ncpus)
+ cpu = 0;
+ } while (1);
+
+ return -ENOENT;
+}
+EXPORT_SYMBOL_GPL(objpool_push);
+
+/* try to retrieve object from slot */
+static inline void *__objpool_try_get_slot(struct objpool_slot *os)
+{
+ uint32_t *ages = SLOT_AGES(os);
+ void **ents = SLOT_ENTS(os);
+ /* do memory load of os_head to local head */
+ uint32_t head = smp_load_acquire(&os->os_head);
+
+ /* loop if slot isn't empty */
+ while (head != READ_ONCE(os->os_tail)) {
+ uint32_t id = head & os->os_mask, prev = head;
+
+ /* do prefetching of object ents */
+ prefetch(&ents[id]);
+
+ /*
+ * check whether this item was ready for retrieval ? There's
+ * possibility * in theory * we might retrieve wrong object,
+ * in case ages[id] overflows when current task is sleeping,
+ * but it will take very very long to overflow an uint32_t
+ */
+ if (smp_load_acquire(&ages[id]) == head) {
+ /* node must have been udpated by push() */
+ void *node = READ_ONCE(ents[id]);
+ /* commit and move forward head of the slot */
+ if (try_cmpxchg_release(&os->os_head, &head, head + 1))
+ return node;
+ }
+
+ /* re-load head from memory continue trying */
+ head = READ_ONCE(os->os_head);
+ /*
+ * head stays unchanged, so it's very likely current pop()
+ * just preempted/interrupted an ongoing push() operation
+ */
+ if (head == prev)
+ break;
+ }
+
+ return NULL;
+}
+
+/**
+ * objpool_pop: allocate an object from objects pool
+ *
+ * args:
+ * @oh: object pool
+ *
+ * return:
+ * object: NULL if failed (object pool is empty)
+ *
+ * objpool_pop can be nested, so can be used in any context.
+ */
+void *objpool_pop(struct objpool_head *oh)
+{
+ uint32_t i, cpu = raw_smp_processor_id();
+ void *obj = NULL;
+
+ for (i = 0; i < oh->oh_ncpus; i++) {
+ struct objpool_slot *slot = oh->oh_slots[cpu];
+ obj = __objpool_try_get_slot(slot);
+ if (obj)
+ break;
+ if (++cpu >= oh->oh_ncpus)
+ cpu = 0;
+ }
+
+ return obj;
+}
+EXPORT_SYMBOL_GPL(objpool_pop);
+
+/**
+ * objpool_fini: cleanup the whole object pool (releasing all objects)
+ *
+ * args:
+ * @head: object pool to be released
+ *
+ */
+void objpool_fini(struct objpool_head *oh)
+{
+ uint32_t i, flags;
+
+ if (!oh->oh_slots)
+ return;
+
+ if (!oh->oh_release) {
+ __objpool_fini_percpu_slots(oh);
+ return;
+ }
+
+ /* cleanup all objects remained in objpool */
+ for (i = 0; i < oh->oh_ncpus; i++) {
+ void *obj;
+ do {
+ flags = OBJPOOL_FLAG_NODE;
+ obj = __objpool_try_get_slot(oh->oh_slots[i]);
+ if (!obj)
+ break;
+ if (!objpool_is_inpool(obj, oh) &&
+ !objpool_is_inslot(obj, oh)) {
+ flags |= OBJPOOL_FLAG_USER;
+ }
+ oh->oh_release(oh->oh_context, obj, flags);
+ } while (obj);
+ }
+
+ /* release percpu slots */
+ __objpool_fini_percpu_slots(oh);
+
+ /* cleanup user private pool and related context */
+ flags = OBJPOOL_FLAG_POOL;
+ if (oh->oh_pool)
+ flags |= OBJPOOL_FLAG_USER;
+ oh->oh_release(oh->oh_context, oh->oh_pool, flags);
+}
+EXPORT_SYMBOL_GPL(objpool_fini);
diff --git a/lib/test_objpool.c b/lib/test_objpool.c
new file mode 100644
index 000000000000..c1341ddf77b5
--- /dev/null
+++ b/lib/test_objpool.c
@@ -0,0 +1,1031 @@
+// SPDX-License-Identifier: GPL-2.0
+
+/*
+ * Test module for lockless object pool
+ * (C) 2022 Matt Wu <[email protected]>
+ */
+
+#include <linux/version.h>
+#include <linux/errno.h>
+#include <linux/module.h>
+#include <linux/moduleparam.h>
+#include <linux/sched.h>
+#include <linux/cpumask.h>
+#include <linux/completion.h>
+#include <linux/kthread.h>
+#include <linux/cpu.h>
+#include <linux/cpuset.h>
+#include <linux/slab.h>
+#include <linux/delay.h>
+#include <linux/hrtimer.h>
+#include <linux/interrupt.h>
+#include <linux/objpool.h>
+
+#define OT_NR_MAX_BULK (16)
+
+struct ot_ctrl {
+ unsigned int mode;
+ unsigned int duration; /* ms */
+ unsigned int delay; /* ms */
+ unsigned int bulk_normal;
+ unsigned int bulk_irq;
+ unsigned long hrtimer; /* ms */
+ const char *name;
+};
+
+struct ot_stat {
+ unsigned long nhits;
+ unsigned long nmiss;
+};
+
+struct ot_item {
+ struct objpool_head *pool; /* pool head */
+ struct ot_ctrl *ctrl; /* ctrl parameters */
+
+ void (*worker)(struct ot_item *item, int irq);
+
+ /* hrtimer control */
+ ktime_t hrtcycle;
+ struct hrtimer hrtimer;
+
+ int bulk[2]; /* for thread and irq */
+ int delay;
+ u32 niters;
+
+ /* results summary */
+ struct ot_stat stat[2]; /* thread and irq */
+
+ u64 duration;
+};
+
+struct ot_mem_stat {
+ atomic_long_t alloc;
+ atomic_long_t free;
+};
+
+struct ot_data {
+ struct rw_semaphore start;
+ struct completion wait;
+ struct completion rcu;
+ atomic_t nthreads ____cacheline_aligned_in_smp;
+ atomic_t stop ____cacheline_aligned_in_smp;
+ struct ot_mem_stat kmalloc;
+ struct ot_mem_stat vmalloc;
+} g_ot_data;
+
+/*
+ * memory leakage checking
+ */
+
+void *ot_kzalloc(long size)
+{
+ void *ptr = kzalloc(size, GFP_KERNEL);
+
+ if (ptr)
+ atomic_long_add(size, &g_ot_data.kmalloc.alloc);
+ return ptr;
+}
+
+void ot_kfree(void *ptr, long size)
+{
+ if (!ptr)
+ return;
+ atomic_long_add(size, &g_ot_data.kmalloc.free);
+ kfree(ptr);
+}
+
+void *ot_vmalloc(long size)
+{
+ void *ptr = vmalloc(size);
+
+ if (ptr)
+ atomic_long_add(size, &g_ot_data.vmalloc.alloc);
+ return ptr;
+}
+
+void ot_vfree(void *ptr, long size)
+{
+ if (!ptr)
+ return;
+ atomic_long_add(size, &g_ot_data.vmalloc.free);
+ vfree(ptr);
+}
+
+static void ot_mem_report(struct ot_ctrl *ctrl)
+{
+ long alloc, free;
+
+ pr_info("memory allocation summary for %s\n", ctrl->name);
+
+ alloc = atomic_long_read(&g_ot_data.kmalloc.alloc);
+ free = atomic_long_read(&g_ot_data.kmalloc.free);
+ pr_info(" kmalloc: %lu - %lu = %lu\n", alloc, free, alloc - free);
+
+ alloc = atomic_long_read(&g_ot_data.vmalloc.alloc);
+ free = atomic_long_read(&g_ot_data.vmalloc.free);
+ pr_info(" vmalloc: %lu - %lu = %lu\n", alloc, free, alloc - free);
+}
+
+/*
+ * general structs & routines
+ */
+
+struct ot_node {
+ void *owner;
+ unsigned long data;
+ unsigned long refs;
+};
+
+struct ot_context {
+ struct objpool_head pool;
+ void *ptr;
+ unsigned long size;
+ refcount_t refs;
+ struct rcu_head rcu;
+};
+
+static DEFINE_PER_CPU(struct ot_item, ot_pcup_items);
+
+static int ot_init_data(struct ot_data *data)
+{
+ memset(data, 0, sizeof(*data));
+ init_rwsem(&data->start);
+ init_completion(&data->wait);
+ init_completion(&data->rcu);
+ atomic_set(&data->nthreads, 1);
+
+ return 0;
+}
+
+static void ot_reset_data(struct ot_data *data)
+{
+ reinit_completion(&data->wait);
+ reinit_completion(&data->rcu);
+ atomic_set(&data->nthreads, 1);
+ atomic_set(&data->stop, 0);
+ memset(&data->kmalloc, 0, sizeof(data->kmalloc));
+ memset(&data->vmalloc, 0, sizeof(data->vmalloc));
+}
+
+static int ot_init_node(void *context, void *nod)
+{
+ struct ot_context *sop = context;
+ struct ot_node *on = nod;
+
+ on->owner = &sop->pool;
+ return 0;
+}
+
+static enum hrtimer_restart ot_hrtimer_handler(struct hrtimer *hrt)
+{
+ struct ot_item *item = container_of(hrt, struct ot_item, hrtimer);
+
+ if (atomic_read_acquire(&g_ot_data.stop))
+ return HRTIMER_NORESTART;
+
+ /* do bulk-testings for objects pop/push */
+ item->worker(item, 1);
+
+ hrtimer_forward(hrt, hrt->base->get_time(), item->hrtcycle);
+ return HRTIMER_RESTART;
+}
+
+static void ot_start_hrtimer(struct ot_item *item)
+{
+ if (!item->ctrl->hrtimer)
+ return;
+ hrtimer_start(&item->hrtimer, item->hrtcycle, HRTIMER_MODE_REL);
+}
+
+static void ot_stop_hrtimer(struct ot_item *item)
+{
+ if (!item->ctrl->hrtimer)
+ return;
+ hrtimer_cancel(&item->hrtimer);
+}
+
+static int ot_init_hrtimer(struct ot_item *item, unsigned long hrtimer)
+{
+ struct hrtimer *hrt = &item->hrtimer;
+
+ if (!hrtimer)
+ return -ENOENT;
+
+ item->hrtcycle = ktime_set(0, hrtimer * 1000000UL);
+ hrtimer_init(hrt, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
+ hrt->function = ot_hrtimer_handler;
+ return 0;
+}
+
+static int ot_init_cpu_item(struct ot_item *item,
+ struct ot_ctrl *ctrl,
+ struct objpool_head *pool,
+ void (*worker)(struct ot_item *, int))
+{
+ memset(item, 0, sizeof(*item));
+ item->pool = pool;
+ item->ctrl = ctrl;
+ item->worker = worker;
+
+ item->bulk[0] = ctrl->bulk_normal;
+ item->bulk[1] = ctrl->bulk_irq;
+ item->delay = ctrl->delay;
+
+ /* initialize hrtimer */
+ ot_init_hrtimer(item, item->ctrl->hrtimer);
+ return 0;
+}
+
+static int ot_thread_worker(void *arg)
+{
+ struct ot_item *item = arg;
+ ktime_t start;
+
+ sched_set_normal(current, 50);
+
+ atomic_inc(&g_ot_data.nthreads);
+ down_read(&g_ot_data.start);
+ up_read(&g_ot_data.start);
+ start = ktime_get();
+ ot_start_hrtimer(item);
+ do {
+ if (atomic_read_acquire(&g_ot_data.stop))
+ break;
+ /* do bulk-testings for objects pop/push */
+ item->worker(item, 0);
+ } while (!kthread_should_stop());
+ ot_stop_hrtimer(item);
+ item->duration = (u64) ktime_us_delta(ktime_get(), start);
+ if (atomic_dec_and_test(&g_ot_data.nthreads))
+ complete(&g_ot_data.wait);
+
+ return 0;
+}
+
+static void ot_perf_report(struct ot_ctrl *ctrl, u64 duration)
+{
+ struct ot_stat total, normal = {0}, irq = {0};
+ int cpu, nthreads = 0;
+
+ pr_info("\n");
+ pr_info("Testing summary for %s\n", ctrl->name);
+
+ for_each_possible_cpu(cpu) {
+ struct ot_item *item = per_cpu_ptr(&ot_pcup_items, cpu);
+ if (!item->duration)
+ continue;
+ normal.nhits += item->stat[0].nhits;
+ normal.nmiss += item->stat[0].nmiss;
+ irq.nhits += item->stat[1].nhits;
+ irq.nmiss += item->stat[1].nmiss;
+ pr_info("CPU: %d duration: %lluus\n", cpu, item->duration);
+ pr_info("\tthread:\t%16lu hits \t%16lu miss\n",
+ item->stat[0].nhits, item->stat[0].nmiss);
+ pr_info("\tirq: \t%16lu hits \t%16lu miss\n",
+ item->stat[1].nhits, item->stat[1].nmiss);
+ pr_info("\ttotal: \t%16lu hits \t%16lu miss\n",
+ item->stat[0].nhits + item->stat[1].nhits,
+ item->stat[0].nmiss + item->stat[1].nmiss);
+ nthreads++;
+ }
+
+ total.nhits = normal.nhits + irq.nhits;
+ total.nmiss = normal.nmiss + irq.nmiss;
+
+ pr_info("ALL: \tnthreads: %d duration: %lluus\n", nthreads, duration);
+ pr_info("SUM: \t%16lu hits \t%16lu miss\n",
+ total.nhits, total.nmiss);
+}
+
+/*
+ * synchronous test cases for objpool manipulation
+ */
+
+/* objpool manipulation for synchronous mode 0 (percpu objpool) */
+static struct ot_context *ot_init_sync_m0(void)
+{
+ struct ot_context *sop = NULL;
+ int max = num_possible_cpus() << 3;
+
+ sop = (struct ot_context *)ot_kzalloc(sizeof(*sop));
+ if (!sop)
+ return NULL;
+
+ if (objpool_init(&sop->pool, max, max, sizeof(struct ot_node),
+ GFP_KERNEL, sop, ot_init_node, NULL)) {
+ ot_kfree(sop, sizeof(*sop));
+ return NULL;
+ }
+ WARN_ON(max != sop->pool.oh_nobjs);
+
+ return sop;
+}
+
+static void ot_fini_sync_m0(struct ot_context *sop)
+{
+ objpool_fini(&sop->pool);
+ ot_kfree(sop, sizeof(*sop));
+}
+
+/* objpool manipulation for synchronous mode 1 (private pool) */
+static struct ot_context *ot_init_sync_m1(void)
+{
+ struct ot_context *sop = NULL;
+ unsigned long size;
+ int rc, szobj, max = num_possible_cpus() << 3;
+
+ sop = (struct ot_context *)ot_kzalloc(sizeof(*sop));
+ if (!sop)
+ return NULL;
+
+ size = sizeof(struct ot_node) * max;
+ sop->ptr = ot_vmalloc(size);
+ sop->size = size;
+ if (!sop->ptr) {
+ ot_kfree(sop, sizeof(*sop));
+ return NULL;
+ }
+ memset(sop->ptr, 0, size);
+
+ /* create and initialize objpool as empty (no objects) */
+ rc = objpool_init(&sop->pool, 0, max, 0, GFP_KERNEL, sop, NULL, NULL);
+ if (rc) {
+ ot_kfree(sop, sizeof(*sop));
+ return NULL;
+ }
+
+ /* populate given buffer to objpool */
+ rc = objpool_populate(&sop->pool, sop->ptr, size,
+ sizeof(struct ot_node), sop, ot_init_node);
+ if (rc) {
+ objpool_fini(&sop->pool);
+ ot_vfree(sop->ptr, size);
+ ot_kfree(sop, sizeof(*sop));
+ return NULL;
+ }
+ szobj = ALIGN(sizeof(struct ot_node), sizeof(void *));
+ WARN_ON((size / szobj) != sop->pool.oh_nobjs);
+
+ return sop;
+}
+
+static void ot_fini_sync_m1(struct ot_context *sop)
+{
+ objpool_fini(&sop->pool);
+
+ ot_vfree(sop->ptr, sop->size);
+ ot_kfree(sop, sizeof(*sop));
+}
+
+/* objpool manipulation for synchronous mode 2 (private objects) */
+static int ot_objpool_release(void *context, void *ptr, uint32_t flags)
+{
+ struct ot_context *sop = context;
+
+ /* here we need release all user-allocated objects */
+ if ((flags & OBJPOOL_FLAG_NODE) && (flags & OBJPOOL_FLAG_USER)) {
+ struct ot_node *on = ptr;
+ WARN_ON(on->data != 0xDEADBEEF);
+ ot_kfree(on, sizeof(struct ot_node));
+ } else if (flags & OBJPOOL_FLAG_POOL) {
+ /* release user preallocated pool */
+ if (sop->ptr) {
+ WARN_ON(sop->ptr != ptr);
+ WARN_ON(!(flags & OBJPOOL_FLAG_USER));
+ ot_vfree(sop->ptr, sop->size);
+ }
+ /* do context cleaning if needed */
+ ot_kfree(sop, sizeof(*sop));
+ }
+
+ return 0;
+}
+
+static struct ot_context *ot_init_sync_m2(void)
+{
+ struct ot_context *sop = NULL;
+ struct ot_node *on;
+ int rc, i, max = num_possible_cpus() << 3;
+
+ sop = (struct ot_context *)ot_kzalloc(sizeof(*sop));
+ if (!sop)
+ return NULL;
+
+ /* create and initialize objpool as empty (no objects) */
+ rc = objpool_init(&sop->pool, 0, max, 0, GFP_KERNEL, sop, NULL,
+ ot_objpool_release);
+ if (rc) {
+ ot_kfree(sop, sizeof(*sop));
+ return NULL;
+ }
+
+ /* allocate private objects and insert to objpool */
+ for (i = 0; i < max; i++) {
+ on = ot_kzalloc(sizeof(struct ot_node));
+ if (on) {
+ ot_init_node(sop, on);
+ on->data = 0xDEADBEEF;
+ objpool_add(on, &sop->pool);
+ }
+ }
+ WARN_ON(max != sop->pool.oh_nobjs);
+
+ return sop;
+}
+
+static void ot_fini_sync_m2(struct ot_context *sop)
+{
+ objpool_fini(&sop->pool);
+}
+
+/* objpool manipulation for synchronous mode 3 (mixed mode) */
+static struct ot_context *ot_init_sync_m3(void)
+{
+ struct ot_context *sop = NULL;
+ struct ot_node *on;
+ unsigned long size;
+ int rc, i, szobj, nobjs;
+ int max = num_possible_cpus() << 4;
+
+ sop = (struct ot_context *)ot_kzalloc(sizeof(*sop));
+ if (!sop)
+ return NULL;
+
+ /* create and initialize objpool as empty (no objects) */
+ nobjs = num_possible_cpus() * 2;
+ rc = objpool_init(&sop->pool, nobjs, max, sizeof(struct ot_node),
+ GFP_KERNEL, sop, ot_init_node, ot_objpool_release);
+ if (rc) {
+ ot_kfree(sop, sizeof(*sop));
+ return NULL;
+ }
+
+ size = sizeof(struct ot_node) * num_possible_cpus() * 4;
+ sop->ptr = ot_vmalloc(size);
+ if (!sop->ptr) {
+ objpool_fini(&sop->pool);
+ return NULL;
+ }
+ sop->size = size;
+ memset(sop->ptr, 0, size);
+
+ /* populate given buffer to objpool */
+ rc = objpool_populate(&sop->pool, sop->ptr, size,
+ sizeof(struct ot_node), sop, ot_init_node);
+ if (rc) {
+ objpool_fini(&sop->pool);
+ ot_vfree(sop->ptr, size);
+ return NULL;
+ }
+ szobj = ALIGN(sizeof(struct ot_node), sizeof(void *));
+ nobjs += size / szobj;
+
+ /* allocate private objects and insert to objpool */
+ for (i = 0; i < num_possible_cpus() * 2; i++) {
+ on = ot_kzalloc(sizeof(struct ot_node));
+ if (on) {
+ ot_init_node(sop, on);
+ on->data = 0xDEADBEEF;
+ if (!objpool_add(on, &sop->pool))
+ nobjs++;
+ else
+ ot_kfree(on, sizeof(struct ot_node));
+ }
+ }
+ WARN_ON(nobjs != sop->pool.oh_nobjs);
+
+ return sop;
+}
+
+static void ot_fini_sync_m3(struct ot_context *sop)
+{
+ objpool_fini(&sop->pool);
+}
+
+struct {
+ struct ot_context * (*init)(void);
+ void (*fini)(struct ot_context *sop);
+} g_ot_sync_ops[4] = {
+ {ot_init_sync_m0, ot_fini_sync_m0},
+ {ot_init_sync_m1, ot_fini_sync_m1},
+ {ot_init_sync_m2, ot_fini_sync_m2},
+ {ot_init_sync_m3, ot_fini_sync_m3},
+};
+
+/*
+ * synchronous test cases: performance mode
+ */
+
+static void ot_bulk_sync(struct ot_item *item, int irq)
+{
+ struct ot_node *nods[OT_NR_MAX_BULK];
+ int i;
+
+ for (i = 0; i < item->bulk[irq]; i++)
+ nods[i] = objpool_pop(item->pool);
+
+ if (!irq && (item->delay || !(++(item->niters) & 0x7FFF)))
+ msleep(item->delay);
+
+ while (i-- > 0) {
+ struct ot_node *on = nods[i];
+ if (on) {
+ on->refs++;
+ objpool_push(on, item->pool);
+ item->stat[irq].nhits++;
+ } else {
+ item->stat[irq].nmiss++;
+ }
+ }
+}
+
+static int ot_start_sync(struct ot_ctrl *ctrl)
+{
+ struct ot_context *sop;
+ ktime_t start;
+ u64 duration;
+ unsigned long timeout;
+ int cpu, rc;
+
+ /* initialize objpool for syncrhonous testcase */
+ sop = g_ot_sync_ops[ctrl->mode].init();
+ if (!sop)
+ return -ENOMEM;
+
+ /* grab rwsem to block testing threads */
+ down_write(&g_ot_data.start);
+
+ for_each_possible_cpu(cpu) {
+ struct ot_item *item = per_cpu_ptr(&ot_pcup_items, cpu);
+ struct task_struct *work;
+
+ ot_init_cpu_item(item, ctrl, &sop->pool, ot_bulk_sync);
+
+ /* skip offline cpus */
+ if (!cpu_online(cpu))
+ continue;
+
+ work = kthread_create_on_node(ot_thread_worker, item,
+ cpu_to_node(cpu), "ot_worker_%d", cpu);
+ if (IS_ERR(work)) {
+ pr_err("failed to create thread for cpu %d\n", cpu);
+ } else {
+ kthread_bind(work, cpu);
+ wake_up_process(work);
+ }
+ }
+
+ /* wait a while to make sure all threads waiting at start line */
+ msleep(20);
+
+ /* in case no threads were created: memory insufficient ? */
+ if (atomic_dec_and_test(&g_ot_data.nthreads))
+ complete(&g_ot_data.wait);
+
+ // sched_set_fifo_low(current);
+
+ /* start objpool testing threads */
+ start = ktime_get();
+ up_write(&g_ot_data.start);
+
+ /* yeild cpu to worker threads for duration ms */
+ timeout = msecs_to_jiffies(ctrl->duration);
+ rc = schedule_timeout_interruptible(timeout);
+
+ /* tell workers threads to quit */
+ atomic_set_release(&g_ot_data.stop, 1);
+
+ /* wait all workers threads finish and quit */
+ wait_for_completion(&g_ot_data.wait);
+ duration = (u64) ktime_us_delta(ktime_get(), start);
+
+ /* cleanup objpool */
+ g_ot_sync_ops[ctrl->mode].fini(sop);
+
+ /* report testing summary and performance results */
+ ot_perf_report(ctrl, duration);
+
+ /* report memory allocation summary */
+ ot_mem_report(ctrl);
+
+ return rc;
+}
+
+/*
+ * asynchronous test cases: pool lifecycle controlled by refcount
+ */
+
+static void ot_fini_async_rcu(struct rcu_head *rcu)
+{
+ struct ot_context *sop = container_of(rcu, struct ot_context, rcu);
+ struct ot_node *on;
+
+ /* here all cpus are aware of the stop event: g_ot_data.stop = 1 */
+ WARN_ON(!atomic_read_acquire(&g_ot_data.stop));
+
+ do {
+ /* release all objects remained in objpool */
+ on = objpool_pop(&sop->pool);
+ if (on && !objpool_is_inslot(on, &sop->pool) &&
+ !objpool_is_inpool(on, &sop->pool)) {
+ /* private object managed by user */
+ WARN_ON(on->data != 0xDEADBEEF);
+ ot_kfree(on, sizeof(struct ot_node));
+ }
+
+ /* deref anyway since we've one extra ref grabbed */
+ if (refcount_dec_and_test(&sop->refs)) {
+ objpool_fini(&sop->pool);
+ break;
+ }
+ } while (on);
+
+ complete(&g_ot_data.rcu);
+}
+
+static void ot_fini_async(struct ot_context *sop)
+{
+ /* make sure the stop event is acknowledged by all cores */
+ call_rcu(&sop->rcu, ot_fini_async_rcu);
+}
+
+static struct ot_context *ot_init_async_m0(void)
+{
+ struct ot_context *sop = NULL;
+ int max = num_possible_cpus() << 3;
+
+ sop = (struct ot_context *)ot_kzalloc(sizeof(*sop));
+ if (!sop)
+ return NULL;
+
+ if (objpool_init(&sop->pool, max, max, sizeof(struct ot_node),
+ GFP_KERNEL, sop, ot_init_node, ot_objpool_release)) {
+ ot_kfree(sop, sizeof(*sop));
+ return NULL;
+ }
+ WARN_ON(max != sop->pool.oh_nobjs);
+ refcount_set(&sop->refs, max + 1);
+
+ return sop;
+}
+
+static struct ot_context *ot_init_async_m1(void)
+{
+ struct ot_context *sop = NULL;
+ unsigned long size;
+ int szobj, rc, max = num_possible_cpus() << 3;
+
+ sop = (struct ot_context *)ot_kzalloc(sizeof(*sop));
+ if (!sop)
+ return NULL;
+
+ size = sizeof(struct ot_node) * max;
+ sop->ptr = ot_vmalloc(size);
+ sop->size = size;
+ if (!sop->ptr) {
+ ot_kfree(sop, sizeof(*sop));
+ return NULL;
+ }
+ memset(sop->ptr, 0, size);
+
+ /* create and initialize objpool as empty (no objects) */
+ rc = objpool_init(&sop->pool, 0, max, 0, GFP_KERNEL, sop, NULL,
+ ot_objpool_release);
+ if (rc) {
+ ot_kfree(sop, sizeof(*sop));
+ return NULL;
+ }
+
+ /* populate given buffer to objpool */
+ rc = objpool_populate(&sop->pool, sop->ptr, size,
+ sizeof(struct ot_node), sop, ot_init_node);
+ if (rc) {
+ objpool_fini(&sop->pool);
+ ot_vfree(sop->ptr, size);
+ return NULL;
+ }
+
+ /* calculate total number of objects stored in ptr */
+ szobj = ALIGN(sizeof(struct ot_node), sizeof(void *));
+ WARN_ON(size / szobj != sop->pool.oh_nobjs);
+ refcount_set(&sop->refs, size / szobj + 1);
+
+ return sop;
+}
+
+static struct ot_context *ot_init_async_m2(void)
+{
+ struct ot_context *sop = NULL;
+ struct ot_node *on;
+ int rc, i, nobjs = 0, max = num_possible_cpus() << 3;
+
+ sop = (struct ot_context *)ot_kzalloc(sizeof(*sop));
+ if (!sop)
+ return NULL;
+
+ /* create and initialize objpool as empty (no objects) */
+ rc = objpool_init(&sop->pool, 0, max, 0, GFP_KERNEL, sop, NULL,
+ ot_objpool_release);
+ if (rc) {
+ ot_kfree(sop, sizeof(*sop));
+ return NULL;
+ }
+
+ /* allocate private objects and insert to objpool */
+ for (i = 0; i < max; i++) {
+ on = ot_kzalloc(sizeof(struct ot_node));
+ if (on) {
+ ot_init_node(sop, on);
+ on->data = 0xDEADBEEF;
+ objpool_add(on, &sop->pool);
+ nobjs++;
+ }
+ }
+ WARN_ON(nobjs != sop->pool.oh_nobjs);
+ refcount_set(&sop->refs, nobjs + 1);
+
+ return sop;
+}
+
+/* objpool manipulation for synchronous mode 3 (mixed mode) */
+static struct ot_context *ot_init_async_m3(void)
+{
+ struct ot_context *sop = NULL;
+ struct ot_node *on;
+ unsigned long size;
+ int szobj, nobjs, rc, i, max = num_possible_cpus() << 4;
+
+ sop = (struct ot_context *)ot_kzalloc(sizeof(*sop));
+ if (!sop)
+ return NULL;
+
+ /* create and initialize objpool as empty (no objects) */
+ nobjs = num_possible_cpus() * 2;
+ rc = objpool_init(&sop->pool, nobjs, max, sizeof(struct ot_node),
+ GFP_KERNEL, sop, ot_init_node, ot_objpool_release);
+ if (rc) {
+ ot_kfree(sop, sizeof(*sop));
+ return NULL;
+ }
+
+ size = sizeof(struct ot_node) * num_possible_cpus() * 4;
+ sop->ptr = ot_vmalloc(size);
+ if (!sop->ptr) {
+ ot_kfree(sop, sizeof(*sop));
+ return NULL;
+ }
+ sop->size = size;
+ memset(sop->ptr, 0, size);
+
+ /* populate given buffer to objpool */
+ rc = objpool_populate(&sop->pool, sop->ptr, size,
+ sizeof(struct ot_node), sop, ot_init_node);
+ if (rc) {
+ objpool_fini(&sop->pool);
+ ot_vfree(sop->ptr, size);
+ return NULL;
+ }
+
+ /* calculate total number of objects stored in ptr */
+ szobj = ALIGN(sizeof(struct ot_node), sizeof(void *));
+ nobjs += size / szobj;
+
+ /* allocate private objects and insert to objpool */
+ for (i = 0; i < num_possible_cpus() * 2; i++) {
+ on = ot_kzalloc(sizeof(struct ot_node));
+ if (on) {
+ ot_init_node(sop, on);
+ on->data = 0xDEADBEEF;
+ objpool_add(on, &sop->pool);
+ nobjs++;
+ }
+ }
+ WARN_ON(nobjs != sop->pool.oh_nobjs);
+ refcount_set(&sop->refs, nobjs + 1);
+
+ return sop;
+}
+
+struct {
+ struct ot_context * (*init)(void);
+ void (*fini)(struct ot_context *sop);
+} g_ot_async_ops[4] = {
+ {ot_init_async_m0, ot_fini_async},
+ {ot_init_async_m1, ot_fini_async},
+ {ot_init_async_m2, ot_fini_async},
+ {ot_init_async_m3, ot_fini_async},
+};
+
+static void ot_nod_recycle(struct ot_node *on, struct objpool_head *pool,
+ int release)
+{
+ struct ot_context *sop;
+
+ on->refs++;
+
+ if (!release) {
+ /* push object back to opjpool for reuse */
+ objpool_push(on, pool);
+ return;
+ }
+
+ sop = container_of(pool, struct ot_context, pool);
+ WARN_ON(sop != pool->oh_context);
+
+ if (objpool_is_inslot(on, pool)) {
+ /* object is alloced from percpu slots */
+ } else if (objpool_is_inpool(on, pool)) {
+ /* object is alloced from user-manged pool */
+ } else {
+ /* private object managed by user */
+ WARN_ON(on->data != 0xDEADBEEF);
+ ot_kfree(on, sizeof(struct ot_node));
+ }
+
+ /* unref objpool with nod removed forever */
+ if (refcount_dec_and_test(&sop->refs))
+ objpool_fini(pool);
+}
+
+static void ot_bulk_async(struct ot_item *item, int irq)
+{
+ struct ot_node *nods[OT_NR_MAX_BULK];
+ int i, stop;
+
+ for (i = 0; i < item->bulk[irq]; i++)
+ nods[i] = objpool_pop(item->pool);
+
+ if (!irq) {
+ if (item->delay || !(++(item->niters) & 0x7FFF))
+ msleep(item->delay);
+ get_cpu();
+ }
+
+ stop = atomic_read_acquire(&g_ot_data.stop);
+
+ /* drop all objects and deref objpool */
+ while (i-- > 0) {
+ struct ot_node *on = nods[i];
+
+ if (on) {
+ on->refs++;
+ ot_nod_recycle(on, item->pool, stop);
+ item->stat[irq].nhits++;
+ } else {
+ item->stat[irq].nmiss++;
+ }
+ }
+
+ if (!irq)
+ put_cpu();
+}
+
+static int ot_start_async(struct ot_ctrl *ctrl)
+{
+ struct ot_context *sop;
+ ktime_t start;
+ u64 duration;
+ unsigned long timeout;
+ int cpu, rc;
+
+ /* initialize objpool for syncrhonous testcase */
+ sop = g_ot_async_ops[ctrl->mode].init();
+ if (!sop)
+ return -ENOMEM;
+
+ /* grab rwsem to block testing threads */
+ down_write(&g_ot_data.start);
+
+ for_each_possible_cpu(cpu) {
+ struct ot_item *item = per_cpu_ptr(&ot_pcup_items, cpu);
+ struct task_struct *work;
+
+ ot_init_cpu_item(item, ctrl, &sop->pool, ot_bulk_async);
+
+ /* skip offline cpus */
+ if (!cpu_online(cpu))
+ continue;
+
+ work = kthread_create_on_node(ot_thread_worker, item,
+ cpu_to_node(cpu), "ot_worker_%d", cpu);
+ if (IS_ERR(work)) {
+ pr_err("failed to create thread for cpu %d\n", cpu);
+ } else {
+ kthread_bind(work, cpu);
+ wake_up_process(work);
+ }
+ }
+
+ /* wait a while to make sure all threads waiting at start line */
+ msleep(20);
+
+ /* in case no threads were created: memory insufficient ? */
+ if (atomic_dec_and_test(&g_ot_data.nthreads))
+ complete(&g_ot_data.wait);
+
+ /* start objpool testing threads */
+ start = ktime_get();
+ up_write(&g_ot_data.start);
+
+ /* yeild cpu to worker threads for duration ms */
+ timeout = msecs_to_jiffies(ctrl->duration);
+ rc = schedule_timeout_interruptible(timeout);
+
+ /* tell workers threads to quit */
+ atomic_set_release(&g_ot_data.stop, 1);
+
+ /* do async-finalization */
+ g_ot_async_ops[ctrl->mode].fini(sop);
+
+ /* wait all workers threads finish and quit */
+ wait_for_completion(&g_ot_data.wait);
+ duration = (u64) ktime_us_delta(ktime_get(), start);
+
+ /* assure rcu callback is triggered */
+ wait_for_completion(&g_ot_data.rcu);
+
+ /*
+ * now we are sure that objpool is finalized either
+ * by rcu callback or by worker threads
+ */
+
+ /* report testing summary and performance results */
+ ot_perf_report(ctrl, duration);
+
+ /* report memory allocation summary */
+ ot_mem_report(ctrl);
+
+ return rc;
+}
+
+/*
+ * predefined testing cases:
+ * 4 synchronous cases / 4 overrun cases / 2 async cases
+ *
+ * mode: unsigned int, could be 0/1/2/3, see name
+ * duration: unsigned int, total test time in ms
+ * delay: unsigned int, delay (in ms) between each iteration
+ * bulk_normal: unsigned int, repeat times for thread worker
+ * bulk_irq: unsigned int, repeat times for irq consumer
+ * hrtimer: unsigned long, hrtimer intervnal in ms
+ * name: char *, tag for current test ot_item
+ */
+
+struct ot_ctrl g_ot_sync[] = {
+ {0, 1000, 0, 1, 0, 0, "sync: percpu objpool"},
+ {1, 1000, 0, 1, 0, 0, "sync: user objpool"},
+ {2, 1000, 0, 1, 0, 0, "sync: user objects"},
+ {3, 1000, 0, 1, 0, 0, "sync: mixed pools & objs"},
+};
+
+struct ot_ctrl g_ot_miss[] = {
+ {0, 1000, 0, 16, 0, 0, "sync overrun: percpu objpool"},
+ {1, 1000, 0, 16, 0, 0, "sync overrun: user objpool"},
+ {2, 1000, 0, 16, 0, 0, "sync overrun: user objects"},
+ {3, 1000, 0, 16, 0, 0, "sync overrun: mixed pools & objs"},
+};
+
+struct ot_ctrl g_ot_async[] = {
+ {0, 1000, 4, 8, 8, 6, "async: percpu objpool"},
+ {1, 1000, 4, 8, 8, 6, "async: user objpool"},
+ {2, 1000, 4, 8, 8, 6, "async: user objects"},
+ {3, 1000, 4, 8, 8, 6, "async: mixed pools & objs"},
+};
+
+static int __init ot_mod_init(void)
+{
+ int i;
+
+ ot_init_data(&g_ot_data);
+
+ for (i = 0; i < ARRAY_SIZE(g_ot_sync); i++) {
+ if (ot_start_sync(&g_ot_sync[i]))
+ goto out;
+ ot_reset_data(&g_ot_data);
+ }
+
+ for (i = 0; i < ARRAY_SIZE(g_ot_miss); i++) {
+ if (ot_start_sync(&g_ot_miss[i]))
+ goto out;
+ ot_reset_data(&g_ot_data);
+ }
+
+ for (i = 0; i < ARRAY_SIZE(g_ot_async); i++) {
+ if (ot_start_async(&g_ot_async[i]))
+ goto out;
+ ot_reset_data(&g_ot_data);
+ }
+
+out:
+ return -EAGAIN;
+}
+
+static void __exit ot_mod_exit(void)
+{
+}
+
+module_init(ot_mod_init);
+module_exit(ot_mod_exit);
+
+MODULE_LICENSE("GPL");
+MODULE_AUTHOR("Matt Wu");
--
2.34.1



2022-11-01 12:52:48

by kernel test robot

[permalink] [raw]
Subject: Re: [PATCH v3] kprobes,lib: kretprobe scalability improvement

Hi wuqiang,

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on akpm-mm/mm-everything]
[also build test WARNING on linus/master v6.1-rc3 next-20221101]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url: https://github.com/intel-lab-lkp/linux/commits/wuqiang/kprobes-lib-kretprobe-scalability-improvement/20221101-110242
base: https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything
patch link: https://lore.kernel.org/r/20221101014346.150812-1-wuqiang.matt%40bytedance.com
patch subject: [PATCH v3] kprobes,lib: kretprobe scalability improvement
config: x86_64-rhel-8.3-kselftests
compiler: gcc-11 (Debian 11.3.0-8) 11.3.0
reproduce (this is a W=1 build):
# https://github.com/intel-lab-lkp/linux/commit/a0deeba1c316e59b94856c8eda40f6680fd511f8
git remote add linux-review https://github.com/intel-lab-lkp/linux
git fetch --no-tags linux-review wuqiang/kprobes-lib-kretprobe-scalability-improvement/20221101-110242
git checkout a0deeba1c316e59b94856c8eda40f6680fd511f8
# save the config file
mkdir build_dir && cp config build_dir/.config
make W=1 O=build_dir ARCH=x86_64 SHELL=/bin/bash kernel/trace/

If you fix the issue, kindly add following tag where applicable
| Reported-by: kernel test robot <[email protected]>

All warnings (new ones prefixed by >>):

kernel/trace/fprobe.c: In function 'fprobe_init_rethook':
>> kernel/trace/fprobe.c:128:13: warning: unused variable 'i' [-Wunused-variable]
128 | int i, size;
| ^


vim +/i +128 kernel/trace/fprobe.c

cad9931f64dc7f Masami Hiramatsu 2022-03-15 125
5b0ab78998e325 Masami Hiramatsu 2022-03-15 126 static int fprobe_init_rethook(struct fprobe *fp, int num)
5b0ab78998e325 Masami Hiramatsu 2022-03-15 127 {
5b0ab78998e325 Masami Hiramatsu 2022-03-15 @128 int i, size;
5b0ab78998e325 Masami Hiramatsu 2022-03-15 129
5b0ab78998e325 Masami Hiramatsu 2022-03-15 130 if (num < 0)
5b0ab78998e325 Masami Hiramatsu 2022-03-15 131 return -EINVAL;
5b0ab78998e325 Masami Hiramatsu 2022-03-15 132
5b0ab78998e325 Masami Hiramatsu 2022-03-15 133 if (!fp->exit_handler) {
5b0ab78998e325 Masami Hiramatsu 2022-03-15 134 fp->rethook = NULL;
5b0ab78998e325 Masami Hiramatsu 2022-03-15 135 return 0;
5b0ab78998e325 Masami Hiramatsu 2022-03-15 136 }
5b0ab78998e325 Masami Hiramatsu 2022-03-15 137
5b0ab78998e325 Masami Hiramatsu 2022-03-15 138 /* Initialize rethook if needed */
5b0ab78998e325 Masami Hiramatsu 2022-03-15 139 size = num * num_possible_cpus() * 2;
5b0ab78998e325 Masami Hiramatsu 2022-03-15 140 if (size < 0)
5b0ab78998e325 Masami Hiramatsu 2022-03-15 141 return -E2BIG;
5b0ab78998e325 Masami Hiramatsu 2022-03-15 142
a0deeba1c316e5 wuqiang 2022-11-01 143 fp->rethook = rethook_alloc((void *)fp, fprobe_exit_handler, GFP_KERNEL,
a0deeba1c316e5 wuqiang 2022-11-01 144 sizeof(struct fprobe_rethook_node), size);
a0deeba1c316e5 wuqiang 2022-11-01 145 if (!fp->rethook)
5b0ab78998e325 Masami Hiramatsu 2022-03-15 146 return -ENOMEM;
a0deeba1c316e5 wuqiang 2022-11-01 147
5b0ab78998e325 Masami Hiramatsu 2022-03-15 148 return 0;
5b0ab78998e325 Masami Hiramatsu 2022-03-15 149 }
5b0ab78998e325 Masami Hiramatsu 2022-03-15 150

--
0-DAY CI Kernel Test Service
https://01.org/lkp


Attachments:
(No filename) (3.54 kB)
config (175.14 kB)
Download all attachments

2022-11-02 02:38:59

by wuqiang.matt

[permalink] [raw]
Subject: [PATCH v4] kprobes,lib: kretprobe scalability improvement

kretprobe is using freelist to manage return-instances, but freelist,
as LIFO queue based on singly linked list, scales badly and reduces
the overall throughput of kretprobed routines, especially for high
contention scenarios.

Here's a typical throughput test of sys_flock (counts in 10 seconds,
measured with perf stat -a -I 10000 -e syscalls:sys_enter_flock):

OS: Debian 10 X86_64, Linux 6.1rc2
HW: XEON 8336C x 2, 64 cores/128 threads, DDR4 3200MT/s

1X 2X 4X 6X 8X 12X 16X
34762430 36546920 17949900 13101899 12569595 12646601 14729195
24X 32X 48X 64X 72X 96X 128X
19263546 10102064 8985418 11936495 11493980 7127789 9330985

This patch implements a scalable, lock-less and numa-aware object pool,
which brings near-linear scalability to kretprobed routines. Tests of
kretprobe throughput show the biggest ratio as 333.9x of the original
freelist. Here's the comparison:

1X 2X 4X 8X 16X
freelist: 34762430 36546920 17949900 12569595 14729195
objpool: 35627544 72182095 144068494 287564688 576903916
32X 48X 64X 96X 128X
freelist: 10102064 8985418 11936495 7127789 9330985
objpool: 1158876372 1737828164 2324371724 2380310472 2463182819

Tests on 96-core ARM64 system output similarly, but with the biggest
ratio up to 642.2x:

OS: Debian 10 AARCH64, Linux 6.1rc2
HW: Kunpeng-920 96 cores/2 sockets/4 NUMA nodes, DDR4 2933 MT/s

1X 2X 4X 8X 16X
freelist: 17498299 10887037 10224710 8499132 6421751
objpool: 18715726 35549845 71615884 144258971 283707220
24X 32X 48X 64X 96X
freelist: 5339868 4819116 3593919 3121575 2687167
objpool: 419830913 571609748 877456139 1143316315 1725668029

The object pool, leveraging percpu ring-array to mitigate hot spots
of memory contention, could deliver near-linear scalability for high
parallel scenarios. The ring-array is compactly managed in a single
cacheline (64 bytes) to benefit from warmed L1 cache for most cases
(<= 4 instances per core) and objects are managed in the continuous
cachelines just after ring-array.

Changes since V3:
1) build warning: unused variable in fprobe_init_rethook
Reported-by: kernel test robot <[email protected]>

Changes since V2:
1) the percpu-extended version of the freelist replaced by new percpu-
ring-array. freelist has data-contention in freelist_node (refs and
next) even after node is removed from freelist and the node could
be polluted easily (with freelist_node defined in union)
2) routines split to objpool.h and objpool.c according to cold & hot
pathes, and the latter moved to lib, as suggested by Masami
3) test module (test_objpool.ko) added to lib for functional testings

Changes since V1:
1) reformat to a single patch as Masami Hiramatsu suggested
2) use __vmalloc_node to replace vmalloc_node for vmalloc
3) a few minor fixes: typo and coding-style issues

Signed-off-by: wuqiang <[email protected]>
---
include/linux/freelist.h | 129 -----
include/linux/kprobes.h | 9 +-
include/linux/objpool.h | 151 ++++++
include/linux/rethook.h | 15 +-
kernel/kprobes.c | 95 ++--
kernel/trace/fprobe.c | 17 +-
kernel/trace/rethook.c | 80 +--
lib/Kconfig.debug | 11 +
lib/Makefile | 4 +-
lib/objpool.c | 480 ++++++++++++++++++
lib/test_objpool.c | 1031 ++++++++++++++++++++++++++++++++++++++
11 files changed, 1772 insertions(+), 250 deletions(-)
delete mode 100644 include/linux/freelist.h
create mode 100644 include/linux/objpool.h
create mode 100644 lib/objpool.c
create mode 100644 lib/test_objpool.c

diff --git a/include/linux/freelist.h b/include/linux/freelist.h
deleted file mode 100644
index fc1842b96469..000000000000
--- a/include/linux/freelist.h
+++ /dev/null
@@ -1,129 +0,0 @@
-/* SPDX-License-Identifier: GPL-2.0-only OR BSD-2-Clause */
-#ifndef FREELIST_H
-#define FREELIST_H
-
-#include <linux/atomic.h>
-
-/*
- * Copyright: [email protected]
- *
- * A simple CAS-based lock-free free list. Not the fastest thing in the world
- * under heavy contention, but simple and correct (assuming nodes are never
- * freed until after the free list is destroyed), and fairly speedy under low
- * contention.
- *
- * Adapted from: https://moodycamel.com/blog/2014/solving-the-aba-problem-for-lock-free-free-lists
- */
-
-struct freelist_node {
- atomic_t refs;
- struct freelist_node *next;
-};
-
-struct freelist_head {
- struct freelist_node *head;
-};
-
-#define REFS_ON_FREELIST 0x80000000
-#define REFS_MASK 0x7FFFFFFF
-
-static inline void __freelist_add(struct freelist_node *node, struct freelist_head *list)
-{
- /*
- * Since the refcount is zero, and nobody can increase it once it's
- * zero (except us, and we run only one copy of this method per node at
- * a time, i.e. the single thread case), then we know we can safely
- * change the next pointer of the node; however, once the refcount is
- * back above zero, then other threads could increase it (happens under
- * heavy contention, when the refcount goes to zero in between a load
- * and a refcount increment of a node in try_get, then back up to
- * something non-zero, then the refcount increment is done by the other
- * thread) -- so if the CAS to add the node to the actual list fails,
- * decrese the refcount and leave the add operation to the next thread
- * who puts the refcount back to zero (which could be us, hence the
- * loop).
- */
- struct freelist_node *head = READ_ONCE(list->head);
-
- for (;;) {
- WRITE_ONCE(node->next, head);
- atomic_set_release(&node->refs, 1);
-
- if (!try_cmpxchg_release(&list->head, &head, node)) {
- /*
- * Hmm, the add failed, but we can only try again when
- * the refcount goes back to zero.
- */
- if (atomic_fetch_add_release(REFS_ON_FREELIST - 1, &node->refs) == 1)
- continue;
- }
- return;
- }
-}
-
-static inline void freelist_add(struct freelist_node *node, struct freelist_head *list)
-{
- /*
- * We know that the should-be-on-freelist bit is 0 at this point, so
- * it's safe to set it using a fetch_add.
- */
- if (!atomic_fetch_add_release(REFS_ON_FREELIST, &node->refs)) {
- /*
- * Oh look! We were the last ones referencing this node, and we
- * know we want to add it to the free list, so let's do it!
- */
- __freelist_add(node, list);
- }
-}
-
-static inline struct freelist_node *freelist_try_get(struct freelist_head *list)
-{
- struct freelist_node *prev, *next, *head = smp_load_acquire(&list->head);
- unsigned int refs;
-
- while (head) {
- prev = head;
- refs = atomic_read(&head->refs);
- if ((refs & REFS_MASK) == 0 ||
- !atomic_try_cmpxchg_acquire(&head->refs, &refs, refs+1)) {
- head = smp_load_acquire(&list->head);
- continue;
- }
-
- /*
- * Good, reference count has been incremented (it wasn't at
- * zero), which means we can read the next and not worry about
- * it changing between now and the time we do the CAS.
- */
- next = READ_ONCE(head->next);
- if (try_cmpxchg_acquire(&list->head, &head, next)) {
- /*
- * Yay, got the node. This means it was on the list,
- * which means should-be-on-freelist must be false no
- * matter the refcount (because nobody else knows it's
- * been taken off yet, it can't have been put back on).
- */
- WARN_ON_ONCE(atomic_read(&head->refs) & REFS_ON_FREELIST);
-
- /*
- * Decrease refcount twice, once for our ref, and once
- * for the list's ref.
- */
- atomic_fetch_add(-2, &head->refs);
-
- return head;
- }
-
- /*
- * OK, the head must have changed on us, but we still need to decrement
- * the refcount we increased.
- */
- refs = atomic_fetch_add(-1, &prev->refs);
- if (refs == REFS_ON_FREELIST + 1)
- __freelist_add(prev, list);
- }
-
- return NULL;
-}
-
-#endif /* FREELIST_H */
diff --git a/include/linux/kprobes.h b/include/linux/kprobes.h
index a0b92be98984..f13f01e600c2 100644
--- a/include/linux/kprobes.h
+++ b/include/linux/kprobes.h
@@ -27,7 +27,7 @@
#include <linux/mutex.h>
#include <linux/ftrace.h>
#include <linux/refcount.h>
-#include <linux/freelist.h>
+#include <linux/objpool.h>
#include <linux/rethook.h>
#include <asm/kprobes.h>

@@ -141,6 +141,7 @@ static inline bool kprobe_ftrace(struct kprobe *p)
*/
struct kretprobe_holder {
struct kretprobe *rp;
+ struct objpool_head oh;
refcount_t ref;
};

@@ -154,7 +155,6 @@ struct kretprobe {
#ifdef CONFIG_KRETPROBE_ON_RETHOOK
struct rethook *rh;
#else
- struct freelist_head freelist;
struct kretprobe_holder *rph;
#endif
};
@@ -165,10 +165,7 @@ struct kretprobe_instance {
#ifdef CONFIG_KRETPROBE_ON_RETHOOK
struct rethook_node node;
#else
- union {
- struct freelist_node freelist;
- struct rcu_head rcu;
- };
+ struct rcu_head rcu;
struct llist_node llist;
struct kretprobe_holder *rph;
kprobe_opcode_t *ret_addr;
diff --git a/include/linux/objpool.h b/include/linux/objpool.h
new file mode 100644
index 000000000000..0b746187482a
--- /dev/null
+++ b/include/linux/objpool.h
@@ -0,0 +1,151 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+
+#ifndef _LINUX_OBJPOOL_H
+#define _LINUX_OBJPOOL_H
+
+#include <linux/slab.h>
+#include <linux/vmalloc.h>
+#include <linux/atomic.h>
+
+/*
+ * objpool: ring-array based lockless MPMC/FIFO queues
+ *
+ * Copyright: [email protected]
+ *
+ * The object pool is a scalable implementaion of high performance queue
+ * for objects allocation and reclamation, such as kretprobe instances.
+ *
+ * With leveraging per-cpu ring-array to mitigate the hot spots of memory
+ * contention, it could deliver near-linear scalability for high parallel
+ * cases. Meanwhile, it also achieves high throughput with benifiting from
+ * warmed cache on each core.
+ *
+ * The object pool are best suited for the following cases:
+ * 1) memory allocation or reclamation is prohibited or too expensive
+ * 2) the objects are allocated/used/reclaimed very frequently
+ *
+ * Before using, you must be aware of it's limitations:
+ * 1) Maximum number of objects is determined during pool initializing
+ * 2) The memory of objects won't be freed until the poll is de-allocated
+ * 3) Both allocation and reclamation could be nested
+ */
+
+/*
+ * objpool_slot: per-cpu ring array
+ *
+ * Represents a cpu-local array-based ring buffer, its size is specialized
+ * during initialization of object pool.
+ *
+ * The objpool_slot is allocated from local memory for NUMA system, and to
+ * be kept compact in a single cacheline. ages[] is stored just after the
+ * body of objpool_slot, and ents[] is after ages[]. ages[] describes the
+ * revision of epoch of the item, solely used to avoid ABA. ents[] contains
+ * the object pointers.
+ *
+ * The default size of objpool_slot is a single cacheline, aka. 64 bytes.
+ *
+ * 64bit:
+ * 4 8 12 16 32 64
+ * | head | tail | size | mask | ages[4] | ents[4]: (8 * 4) |
+ *
+ * 32bit:
+ * 4 8 12 16 32 48 64
+ * | head | tail | size | mask | ages[4] | ents[4] | unused |
+ *
+ */
+
+struct objpool_slot {
+ uint32_t os_head; /* head of ring array */
+ uint32_t os_tail; /* tail of ring array */
+ uint32_t os_size; /* max item slots, pow of 2 */
+ uint32_t os_mask; /* os_size - 1 */
+/*
+ * uint32_t os_ages[]; // ring epoch id
+ * void *os_ents[]; // objects array
+ */
+};
+
+/* caller-specified object initial callback to setup each object, only called once */
+typedef int (*objpool_init_node_cb)(void *context, void *obj);
+
+/* caller-specified cleanup callback for private objects/pool/context */
+typedef int (*objpool_release_cb)(void *context, void *ptr, uint32_t flags);
+
+/* called for object releasing: ptr points to an object */
+#define OBJPOOL_FLAG_NODE (0x00000001)
+/* for user pool and context releasing, ptr could be NULL */
+#define OBJPOOL_FLAG_POOL (0x00001000)
+/* the object or pool to be released is user-managed */
+#define OBJPOOL_FLAG_USER (0x00008000)
+
+/*
+ * objpool_head: object pooling metadata
+ */
+
+struct objpool_head {
+ uint32_t oh_objsz; /* object & element size */
+ uint32_t oh_nobjs; /* total objs (pre-allocated) */
+ uint32_t oh_nents; /* max objects per cpuslot */
+ uint32_t oh_ncpus; /* num of possible cpus */
+ uint32_t oh_in_user:1; /* user-specified buffer */
+ uint32_t oh_in_slot:1; /* objs alloced with slots */
+ uint32_t oh_vmalloc:1; /* alloc from vmalloc zone */
+ gfp_t oh_gfp; /* k/vmalloc gfp flags */
+ uint32_t oh_sz_pool; /* user pool size in byes */
+ void *oh_pool; /* user managed memory pool */
+ struct objpool_slot **oh_slots; /* array of percpu slots */
+ uint32_t *oh_sz_slots; /* size in bytes of slots */
+ objpool_release_cb oh_release; /* resource cleanup callback */
+ void *oh_context; /* caller-provided context */
+};
+
+/* initialize object pool and pre-allocate objects */
+int objpool_init(struct objpool_head *oh,
+ int nobjs, int max, int objsz,
+ gfp_t gfp, void *context,
+ objpool_init_node_cb objinit,
+ objpool_release_cb release);
+
+/* add objects in batch from user provided pool */
+int objpool_populate(struct objpool_head *oh, void *buf,
+ int size, int objsz, void *context,
+ objpool_init_node_cb objinit);
+
+/* add pre-allocated object (managed by user) to objpool */
+int objpool_add(void *obj, struct objpool_head *oh);
+
+/* allocate an object from objects pool */
+void *objpool_pop(struct objpool_head *oh);
+
+/* reclaim an object and return it back to objects pool */
+int objpool_push(void *node, struct objpool_head *oh);
+
+/* cleanup the whole object pool (including all chained objects) */
+void objpool_fini(struct objpool_head *oh);
+
+/* whether the object is pre-allocated with percpu slots */
+static inline int objpool_is_inslot(void *obj, struct objpool_head *oh)
+{
+ void *slot;
+ int i;
+
+ if (!obj)
+ return 0;
+
+ for (i = 0; i < oh->oh_ncpus; i++) {
+ slot = oh->oh_slots[i];
+ if (obj >= slot && obj < slot + oh->oh_sz_slots[i])
+ return 1;
+ }
+
+ return 0;
+}
+
+/* whether the object is from user pool (batched adding) */
+static inline int objpool_is_inpool(void *obj, struct objpool_head *oh)
+{
+ return (obj && oh->oh_pool && obj >= oh->oh_pool &&
+ obj < oh->oh_pool + oh->oh_sz_pool);
+}
+
+#endif /* _LINUX_OBJPOOL_H */
diff --git a/include/linux/rethook.h b/include/linux/rethook.h
index c8ac1e5afcd1..278ec65e71fe 100644
--- a/include/linux/rethook.h
+++ b/include/linux/rethook.h
@@ -6,7 +6,7 @@
#define _LINUX_RETHOOK_H

#include <linux/compiler.h>
-#include <linux/freelist.h>
+#include <linux/objpool.h>
#include <linux/kallsyms.h>
#include <linux/llist.h>
#include <linux/rcupdate.h>
@@ -30,14 +30,14 @@ typedef void (*rethook_handler_t) (struct rethook_node *, void *, struct pt_regs
struct rethook {
void *data;
rethook_handler_t handler;
- struct freelist_head pool;
+ struct objpool_head pool;
refcount_t ref;
struct rcu_head rcu;
};

/**
* struct rethook_node - The rethook shadow-stack entry node.
- * @freelist: The freelist, linked to struct rethook::pool.
+ * @nod: The objpool node, linked to struct rethook::pool.
* @rcu: The rcu_head for deferred freeing.
* @llist: The llist, linked to a struct task_struct::rethooks.
* @rethook: The pointer to the struct rethook.
@@ -48,19 +48,15 @@ struct rethook {
* on each entry of the shadow stack.
*/
struct rethook_node {
- union {
- struct freelist_node freelist;
- struct rcu_head rcu;
- };
+ struct rcu_head rcu;
struct llist_node llist;
struct rethook *rethook;
unsigned long ret_addr;
unsigned long frame;
};

-struct rethook *rethook_alloc(void *data, rethook_handler_t handler);
+struct rethook *rethook_alloc(void *data, rethook_handler_t handler, gfp_t gfp, int size, int max);
void rethook_free(struct rethook *rh);
-void rethook_add_node(struct rethook *rh, struct rethook_node *node);
struct rethook_node *rethook_try_get(struct rethook *rh);
void rethook_recycle(struct rethook_node *node);
void rethook_hook(struct rethook_node *node, struct pt_regs *regs, bool mcount);
@@ -97,4 +93,3 @@ void rethook_flush_task(struct task_struct *tk);
#endif

#endif
-
diff --git a/kernel/kprobes.c b/kernel/kprobes.c
index b781dee3f552..42cb708c3248 100644
--- a/kernel/kprobes.c
+++ b/kernel/kprobes.c
@@ -1865,10 +1865,12 @@ static struct notifier_block kprobe_exceptions_nb = {
static void free_rp_inst_rcu(struct rcu_head *head)
{
struct kretprobe_instance *ri = container_of(head, struct kretprobe_instance, rcu);
+ struct kretprobe_holder *rph = ri->rph;

- if (refcount_dec_and_test(&ri->rph->ref))
- kfree(ri->rph);
- kfree(ri);
+ if (refcount_dec_and_test(&rph->ref)) {
+ objpool_fini(&rph->oh);
+ kfree(rph);
+ }
}
NOKPROBE_SYMBOL(free_rp_inst_rcu);

@@ -1877,7 +1879,7 @@ static void recycle_rp_inst(struct kretprobe_instance *ri)
struct kretprobe *rp = get_kretprobe(ri);

if (likely(rp))
- freelist_add(&ri->freelist, &rp->freelist);
+ objpool_push(ri, &rp->rph->oh);
else
call_rcu(&ri->rcu, free_rp_inst_rcu);
}
@@ -1914,23 +1916,19 @@ NOKPROBE_SYMBOL(kprobe_flush_task);

static inline void free_rp_inst(struct kretprobe *rp)
{
- struct kretprobe_instance *ri;
- struct freelist_node *node;
- int count = 0;
-
- node = rp->freelist.head;
- while (node) {
- ri = container_of(node, struct kretprobe_instance, freelist);
- node = node->next;
-
- kfree(ri);
- count++;
- }
+ struct kretprobe_holder *rph = rp->rph;
+ void *nod;

- if (refcount_sub_and_test(count, &rp->rph->ref)) {
- kfree(rp->rph);
- rp->rph = NULL;
- }
+ rp->rph = NULL;
+ do {
+ nod = objpool_pop(&rph->oh);
+ /* deref anyway since we've one extra ref grabbed */
+ if (refcount_dec_and_test(&rph->ref)) {
+ objpool_fini(&rph->oh);
+ kfree(rph);
+ break;
+ }
+ } while (nod);
}

/* This assumes the 'tsk' is the current task or the is not running. */
@@ -2072,19 +2070,17 @@ NOKPROBE_SYMBOL(__kretprobe_trampoline_handler)
static int pre_handler_kretprobe(struct kprobe *p, struct pt_regs *regs)
{
struct kretprobe *rp = container_of(p, struct kretprobe, kp);
+ struct kretprobe_holder *rph = rp->rph;
struct kretprobe_instance *ri;
- struct freelist_node *fn;

- fn = freelist_try_get(&rp->freelist);
- if (!fn) {
+ ri = objpool_pop(&rph->oh);
+ if (!ri) {
rp->nmissed++;
return 0;
}

- ri = container_of(fn, struct kretprobe_instance, freelist);
-
if (rp->entry_handler && rp->entry_handler(ri, regs)) {
- freelist_add(&ri->freelist, &rp->freelist);
+ objpool_push(ri, &rph->oh);
return 0;
}

@@ -2174,10 +2170,19 @@ int kprobe_on_func_entry(kprobe_opcode_t *addr, const char *sym, unsigned long o
return 0;
}

+#ifndef CONFIG_KRETPROBE_ON_RETHOOK
+static int kretprobe_init_inst(void *context, void *nod)
+{
+ struct kretprobe_instance *ri = nod;
+
+ ri->rph = context;
+ return 0;
+}
+#endif
+
int register_kretprobe(struct kretprobe *rp)
{
int ret;
- struct kretprobe_instance *inst;
int i;
void *addr;

@@ -2215,20 +2220,12 @@ int register_kretprobe(struct kretprobe *rp)
#endif
}
#ifdef CONFIG_KRETPROBE_ON_RETHOOK
- rp->rh = rethook_alloc((void *)rp, kretprobe_rethook_handler);
+ rp->rh = rethook_alloc((void *)rp, kretprobe_rethook_handler, GFP_KERNEL,
+ sizeof(struct kretprobe_instance) + rp->data_size,
+ rp->maxactive);
if (!rp->rh)
return -ENOMEM;

- for (i = 0; i < rp->maxactive; i++) {
- inst = kzalloc(sizeof(struct kretprobe_instance) +
- rp->data_size, GFP_KERNEL);
- if (inst == NULL) {
- rethook_free(rp->rh);
- rp->rh = NULL;
- return -ENOMEM;
- }
- rethook_add_node(rp->rh, &inst->node);
- }
rp->nmissed = 0;
/* Establish function entry probe point */
ret = register_kprobe(&rp->kp);
@@ -2237,25 +2234,19 @@ int register_kretprobe(struct kretprobe *rp)
rp->rh = NULL;
}
#else /* !CONFIG_KRETPROBE_ON_RETHOOK */
- rp->freelist.head = NULL;
rp->rph = kzalloc(sizeof(struct kretprobe_holder), GFP_KERNEL);
if (!rp->rph)
return -ENOMEM;

- rp->rph->rp = rp;
- for (i = 0; i < rp->maxactive; i++) {
- inst = kzalloc(sizeof(struct kretprobe_instance) +
- rp->data_size, GFP_KERNEL);
- if (inst == NULL) {
- refcount_set(&rp->rph->ref, i);
- free_rp_inst(rp);
- return -ENOMEM;
- }
- inst->rph = rp->rph;
- freelist_add(&inst->freelist, &rp->freelist);
+ if (objpool_init(&rp->rph->oh, rp->maxactive, rp->maxactive,
+ rp->data_size + sizeof(struct kretprobe_instance),
+ GFP_KERNEL, rp->rph, kretprobe_init_inst, NULL)) {
+ kfree(rp->rph);
+ rp->rph = NULL;
+ return -ENOMEM;
}
- refcount_set(&rp->rph->ref, i);
-
+ refcount_set(&rp->rph->ref, rp->maxactive + 1);
+ rp->rph->rp = rp;
rp->nmissed = 0;
/* Establish function entry probe point */
ret = register_kprobe(&rp->kp);
diff --git a/kernel/trace/fprobe.c b/kernel/trace/fprobe.c
index aac63ca9c3d1..99b4ab0f6468 100644
--- a/kernel/trace/fprobe.c
+++ b/kernel/trace/fprobe.c
@@ -125,7 +125,7 @@ static void fprobe_init(struct fprobe *fp)

static int fprobe_init_rethook(struct fprobe *fp, int num)
{
- int i, size;
+ int size;

if (num < 0)
return -EINVAL;
@@ -140,18 +140,11 @@ static int fprobe_init_rethook(struct fprobe *fp, int num)
if (size < 0)
return -E2BIG;

- fp->rethook = rethook_alloc((void *)fp, fprobe_exit_handler);
- for (i = 0; i < size; i++) {
- struct fprobe_rethook_node *node;
+ fp->rethook = rethook_alloc((void *)fp, fprobe_exit_handler, GFP_KERNEL,
+ sizeof(struct fprobe_rethook_node), size);
+ if (!fp->rethook)
+ return -ENOMEM;

- node = kzalloc(sizeof(*node), GFP_KERNEL);
- if (!node) {
- rethook_free(fp->rethook);
- fp->rethook = NULL;
- return -ENOMEM;
- }
- rethook_add_node(fp->rethook, &node->node);
- }
return 0;
}

diff --git a/kernel/trace/rethook.c b/kernel/trace/rethook.c
index c69d82273ce7..01df98db2fbe 100644
--- a/kernel/trace/rethook.c
+++ b/kernel/trace/rethook.c
@@ -36,21 +36,17 @@ void rethook_flush_task(struct task_struct *tk)
static void rethook_free_rcu(struct rcu_head *head)
{
struct rethook *rh = container_of(head, struct rethook, rcu);
- struct rethook_node *rhn;
- struct freelist_node *node;
- int count = 1;
+ struct rethook_node *nod;

- node = rh->pool.head;
- while (node) {
- rhn = container_of(node, struct rethook_node, freelist);
- node = node->next;
- kfree(rhn);
- count++;
- }
-
- /* The rh->ref is the number of pooled node + 1 */
- if (refcount_sub_and_test(count, &rh->ref))
- kfree(rh);
+ do {
+ nod = objpool_pop(&rh->pool);
+ /* deref anyway since we've one extra ref grabbed */
+ if (refcount_dec_and_test(&rh->ref)) {
+ objpool_fini(&rh->pool);
+ kfree(rh);
+ break;
+ }
+ } while (nod);
}

/**
@@ -70,16 +66,28 @@ void rethook_free(struct rethook *rh)
call_rcu(&rh->rcu, rethook_free_rcu);
}

+static int rethook_init_node(void *context, void *nod)
+{
+ struct rethook_node *node = nod;
+
+ node->rethook = context;
+ return 0;
+}
+
/**
* rethook_alloc() - Allocate struct rethook.
* @data: a data to pass the @handler when hooking the return.
* @handler: the return hook callback function.
+ * @gfp: default gfp for objpool allocation
+ * @size: rethook node size
+ * @max: number of rethook nodes to be preallocated
*
* Allocate and initialize a new rethook with @data and @handler.
* Return NULL if memory allocation fails or @handler is NULL.
* Note that @handler == NULL means this rethook is going to be freed.
*/
-struct rethook *rethook_alloc(void *data, rethook_handler_t handler)
+struct rethook *rethook_alloc(void *data, rethook_handler_t handler, gfp_t gfp,
+ int size, int max)
{
struct rethook *rh = kzalloc(sizeof(struct rethook), GFP_KERNEL);

@@ -88,34 +96,26 @@ struct rethook *rethook_alloc(void *data, rethook_handler_t handler)

rh->data = data;
rh->handler = handler;
- rh->pool.head = NULL;
- refcount_set(&rh->ref, 1);

+ /* initialize the objpool for rethook nodes */
+ if (objpool_init(&rh->pool, max, max, size, gfp, rh, rethook_init_node,
+ NULL)) {
+ kfree(rh);
+ return NULL;
+ }
+ refcount_set(&rh->ref, max + 1);
return rh;
}

-/**
- * rethook_add_node() - Add a new node to the rethook.
- * @rh: the struct rethook.
- * @node: the struct rethook_node to be added.
- *
- * Add @node to @rh. User must allocate @node (as a part of user's
- * data structure.) The @node fields are initialized in this function.
- */
-void rethook_add_node(struct rethook *rh, struct rethook_node *node)
-{
- node->rethook = rh;
- freelist_add(&node->freelist, &rh->pool);
- refcount_inc(&rh->ref);
-}
-
static void free_rethook_node_rcu(struct rcu_head *head)
{
struct rethook_node *node = container_of(head, struct rethook_node, rcu);
+ struct rethook *rh = node->rethook;

- if (refcount_dec_and_test(&node->rethook->ref))
- kfree(node->rethook);
- kfree(node);
+ if (refcount_dec_and_test(&rh->ref)) {
+ objpool_fini(&rh->pool);
+ kfree(rh);
+ }
}

/**
@@ -130,7 +130,7 @@ void rethook_recycle(struct rethook_node *node)
lockdep_assert_preemption_disabled();

if (likely(READ_ONCE(node->rethook->handler)))
- freelist_add(&node->freelist, &node->rethook->pool);
+ objpool_push(node, &node->rethook->pool);
else
call_rcu(&node->rcu, free_rethook_node_rcu);
}
@@ -146,7 +146,7 @@ NOKPROBE_SYMBOL(rethook_recycle);
struct rethook_node *rethook_try_get(struct rethook *rh)
{
rethook_handler_t handler = READ_ONCE(rh->handler);
- struct freelist_node *fn;
+ struct rethook_node *nod;

lockdep_assert_preemption_disabled();

@@ -163,11 +163,11 @@ struct rethook_node *rethook_try_get(struct rethook *rh)
if (unlikely(!rcu_is_watching()))
return NULL;

- fn = freelist_try_get(&rh->pool);
- if (!fn)
+ nod = (struct rethook_node *)objpool_pop(&rh->pool);
+ if (!nod)
return NULL;

- return container_of(fn, struct rethook_node, freelist);
+ return nod;
}
NOKPROBE_SYMBOL(rethook_try_get);

diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
index 3fc7abffc7aa..b12cc71754cf 100644
--- a/lib/Kconfig.debug
+++ b/lib/Kconfig.debug
@@ -2737,6 +2737,17 @@ config TEST_CLOCKSOURCE_WATCHDOG

If unsure, say N.

+config TEST_OBJPOOL
+ tristate "Test module for correctness and stress of objpool"
+ default n
+ depends on m
+ help
+ This builds the "test_objpool" module that should be used for
+ correctness verification and concurrent testings of objects
+ allocation and reclamation.
+
+ If unsure, say N.
+
endif # RUNTIME_TESTING_MENU

config ARCH_USE_MEMTEST
diff --git a/lib/Makefile b/lib/Makefile
index 161d6a724ff7..4aa282fa0cfc 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -34,7 +34,7 @@ lib-y := ctype.o string.o vsprintf.o cmdline.o \
is_single_threaded.o plist.o decompress.o kobject_uevent.o \
earlycpio.o seq_buf.o siphash.o dec_and_lock.o \
nmi_backtrace.o win_minmax.o memcat_p.o \
- buildid.o
+ buildid.o objpool.o

lib-$(CONFIG_PRINTK) += dump_stack.o
lib-$(CONFIG_SMP) += cpumask.o
@@ -99,6 +99,8 @@ obj-$(CONFIG_KPROBES_SANITY_TEST) += test_kprobes.o
obj-$(CONFIG_TEST_REF_TRACKER) += test_ref_tracker.o
CFLAGS_test_fprobe.o += $(CC_FLAGS_FTRACE)
obj-$(CONFIG_FPROBE_SANITY_TEST) += test_fprobe.o
+obj-$(CONFIG_TEST_OBJPOOL) += test_objpool.o
+
#
# CFLAGS for compiling floating point code inside the kernel. x86/Makefile turns
# off the generation of FPU/SSE* instructions for kernel proper but FPU_FLAGS
diff --git a/lib/objpool.c b/lib/objpool.c
new file mode 100644
index 000000000000..51b3499ff9da
--- /dev/null
+++ b/lib/objpool.c
@@ -0,0 +1,480 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include <linux/objpool.h>
+
+/*
+ * objpool: ring-array based lockless MPMC/FIFO queues
+ *
+ * Copyright: [email protected]
+ */
+
+/* compute the suitable num of objects to be managed by slot */
+static inline uint32_t __objpool_num_of_objs(uint32_t size)
+{
+ return rounddown_pow_of_two((size - sizeof(struct objpool_slot)) /
+ (sizeof(uint32_t) + sizeof(void *)));
+}
+
+#define SLOT_AGES(s) ((uint32_t *)((char *)(s) + sizeof(struct objpool_slot)))
+#define SLOT_ENTS(s) ((void **)((char *)(s) + sizeof(struct objpool_slot) + \
+ sizeof(uint32_t) * (s)->os_size))
+#define SLOT_OBJS(s) ((void *)((char *)(s) + sizeof(struct objpool_slot) + \
+ (sizeof(uint32_t) + sizeof(void *)) * (s)->os_size))
+
+/* allocate and initialize percpu slots */
+static inline int
+__objpool_init_percpu_slots(struct objpool_head *oh, uint32_t nobjs,
+ void *context, objpool_init_node_cb objinit)
+{
+ uint32_t i, j, size, objsz, nents = oh->oh_nents;
+
+ /* aligned object size by sizeof(void *) */
+ objsz = ALIGN(oh->oh_objsz, sizeof(void *));
+ /* shall we allocate objects along with objpool_slot */
+ if (objsz)
+ oh->oh_in_slot = 1;
+
+ for (i = 0; i < oh->oh_ncpus; i++) {
+ struct objpool_slot *os;
+ uint32_t n;
+
+ /* compute how many objects to be managed by this slot */
+ n = nobjs / oh->oh_ncpus;
+ if (i < (nobjs % oh->oh_ncpus))
+ n++;
+ size = sizeof(struct objpool_slot) + sizeof(void *) * nents +
+ sizeof(uint32_t) * nents + objsz * n;
+
+ /* decide which pool shall the slot be allocated from */
+ if (0 == i) {
+ if ((oh->oh_gfp & GFP_ATOMIC) || size < PAGE_SIZE / 2)
+ oh->oh_vmalloc = 0;
+ else
+ oh->oh_vmalloc = 1;
+ }
+
+ /* allocate percpu slot & objects from local memory */
+ if (oh->oh_vmalloc)
+ os = vmalloc_node(size, cpu_to_node(i));
+ else
+ os = kmalloc_node(size, oh->oh_gfp, cpu_to_node(i));
+ if (!os)
+ return -ENOMEM;
+
+ /* initialize percpu slot for the i-th cpu */
+ memset(os, 0, size);
+ os->os_size = oh->oh_nents;
+ os->os_mask = os->os_size - 1;
+ oh->oh_slots[i] = os;
+ oh->oh_sz_slots[i] = size;
+
+ /*
+ * start from 2nd round to avoid conflict of 1st item.
+ * we assume that the head item is ready for retrieval
+ * iff head is equal to ages[head & mask]. but ages is
+ * initialized as 0, so in view of the caller of pop(),
+ * the 1st item (0th) is always ready, but fact could
+ * be: push() is stalled before the final update, thus
+ * the item being inserted will be lost forever.
+ */
+ os->os_head = os->os_tail = oh->oh_nents;
+
+ for (j = 0; oh->oh_in_slot && j < n; j++) {
+ uint32_t *ages = SLOT_AGES(os);
+ void **ents = SLOT_ENTS(os);
+ void *obj = SLOT_OBJS(os) + j * objsz;
+ uint32_t ie = os->os_tail & os->os_mask;
+
+ /* perform object initialization */
+ if (objinit) {
+ int rc = objinit(context, obj);
+ if (rc)
+ return rc;
+ }
+
+ /* add obj into the ring array */
+ ents[ie] = obj;
+ ages[ie] = os->os_tail;
+ os->os_tail++;
+ oh->oh_nobjs++;
+ }
+ }
+
+ return 0;
+}
+
+/* cleanup all percpu slots of the object pool */
+static inline void __objpool_fini_percpu_slots(struct objpool_head *oh)
+{
+ uint32_t i;
+
+ if (!oh->oh_slots)
+ return;
+
+ for (i = 0; i < oh->oh_ncpus; i++) {
+ if (!oh->oh_slots[i])
+ continue;
+ if (oh->oh_vmalloc)
+ vfree(oh->oh_slots[i]);
+ else
+ kfree(oh->oh_slots[i]);
+ }
+ kfree(oh->oh_slots);
+ oh->oh_slots = NULL;
+ oh->oh_sz_slots = NULL;
+}
+
+/**
+ * objpool_init: initialize object pool and pre-allocate objects
+ *
+ * args:
+ * @oh: the object pool to be initialized, declared by the caller
+ * @nojbs: total objects to be allocated by this object pool
+ * @max: max objs this objpool could manage, use nobjs if 0
+ * @ojbsz: size of an object, to be pre-allocated if objsz is not 0
+ * @gfp: gfp flags of caller's context for memory allocation
+ * @context: user context for object initialization callback
+ * @objinit: object initialization callback for extra setting-up
+ * @release: cleanup callback for private objects/pool/context
+ *
+ * return:
+ * 0 for success, otherwise error code
+ *
+ * All pre-allocated objects are to be zeroed. Caller could do extra
+ * initialization in objinit callback. The objinit callback will be
+ * called once and only once after the slot allocation
+ */
+int objpool_init(struct objpool_head *oh,
+ int nobjs, int max, int objsz,
+ gfp_t gfp, void *context,
+ objpool_init_node_cb objinit,
+ objpool_release_cb release)
+{
+ uint32_t nents, cpus = num_possible_cpus();
+ int rc;
+
+ /* calculate percpu slot size (rounded to pow of 2) */
+ if (max < nobjs)
+ max = nobjs;
+ nents = max / cpus;
+ if (nents < __objpool_num_of_objs(L1_CACHE_BYTES))
+ nents = __objpool_num_of_objs(L1_CACHE_BYTES);
+ nents = roundup_pow_of_two(nents);
+ while (nents * cpus < nobjs)
+ nents = nents << 1;
+
+ memset(oh, 0, sizeof(struct objpool_head));
+ oh->oh_ncpus = cpus;
+ oh->oh_objsz = objsz;
+ oh->oh_nents = nents;
+ oh->oh_gfp = gfp & ~__GFP_ZERO;
+ oh->oh_context = context;
+ oh->oh_release = release;
+
+ /* allocate array for percpu slots */
+ oh->oh_slots = kzalloc(oh->oh_ncpus * sizeof(void *) +
+ oh->oh_ncpus * sizeof(uint32_t), oh->oh_gfp);
+ if (!oh->oh_slots)
+ return -ENOMEM;
+ oh->oh_sz_slots = (uint32_t *)&oh->oh_slots[oh->oh_ncpus];
+
+ /* initialize per-cpu slots */
+ rc = __objpool_init_percpu_slots(oh, nobjs, context, objinit);
+ if (rc)
+ __objpool_fini_percpu_slots(oh);
+
+ return rc;
+}
+EXPORT_SYMBOL_GPL(objpool_init);
+
+/* adding object to slot tail, the given slot mustn't be full */
+static inline int __objpool_add_slot(void *obj, struct objpool_slot *os)
+{
+ uint32_t *ages = SLOT_AGES(os);
+ void **ents = SLOT_ENTS(os);
+ uint32_t tail = atomic_inc_return((atomic_t *)&os->os_tail) - 1;
+
+ WRITE_ONCE(ents[tail & os->os_mask], obj);
+
+ /* order matters: obj must be updated before tail updating */
+ smp_store_release(&ages[tail & os->os_mask], tail);
+ return 0;
+}
+
+/* adding object to slot, abort if the slot was already full */
+static inline int __objpool_try_add_slot(void *obj, struct objpool_slot *os)
+{
+ uint32_t *ages = SLOT_AGES(os);
+ void **ents = SLOT_ENTS(os);
+ uint32_t head, tail;
+
+ do {
+ /* perform memory loading for both head and tail */
+ head = READ_ONCE(os->os_head);
+ tail = READ_ONCE(os->os_tail);
+ /* just abort if slot is full */
+ if (tail >= head + os->os_size)
+ return -ENOENT;
+ /* try to extend tail by 1 using CAS to avoid races */
+ if (try_cmpxchg_acquire(&os->os_tail, &tail, tail + 1))
+ break;
+ } while (1);
+
+ /* the tail-th of slot is reserved for the given obj */
+ WRITE_ONCE(ents[tail & os->os_mask], obj);
+ /* update epoch id to make this object available for pop() */
+ smp_store_release(&ages[tail & os->os_mask], tail);
+ return 0;
+}
+
+/**
+ * objpool_populate: add objects from user provided pool in batch
+ *
+ * args:
+ * @oh: object pool
+ * @buf: user buffer for pre-allocated objects
+ * @size: size of user buffer
+ * @objsz: size of object & element
+ * @context: user context for objinit callback
+ * @objinit: object initialization callback
+ *
+ * return: 0 or error code
+ */
+int objpool_populate(struct objpool_head *oh, void *buf, int size, int objsz,
+ void *context, objpool_init_node_cb objinit)
+{
+ int n = oh->oh_nobjs, used = 0, i;
+
+ if (oh->oh_pool || !buf || size < objsz)
+ return -EINVAL;
+ if (oh->oh_objsz && oh->oh_objsz != objsz)
+ return -EINVAL;
+ if (oh->oh_context && context && oh->oh_context != context)
+ return -EINVAL;
+ if (oh->oh_nobjs >= oh->oh_ncpus * oh->oh_nents)
+ return -ENOENT;
+
+ WARN_ON_ONCE(((unsigned long)buf) & (sizeof(void *) - 1));
+ WARN_ON_ONCE(((uint32_t)objsz) & (sizeof(void *) - 1));
+
+ /* align object size by sizeof(void *) */
+ oh->oh_objsz = objsz;
+ objsz = ALIGN(objsz, sizeof(void *));
+ if (objsz <= 0)
+ return -EINVAL;
+
+ while (used + objsz <= size) {
+ void *obj = buf + used;
+
+ /* perform object initialization */
+ if (objinit) {
+ int rc = objinit(context, obj);
+ if (rc)
+ return rc;
+ }
+
+ /* insert obj to its corresponding objpool slot */
+ i = (n + used * oh->oh_ncpus/size) % oh->oh_ncpus;
+ if (!__objpool_try_add_slot(obj, oh->oh_slots[i]))
+ oh->oh_nobjs++;
+
+ used += objsz;
+ }
+
+ if (!used)
+ return -ENOENT;
+
+ oh->oh_context = context;
+ oh->oh_pool = buf;
+ oh->oh_sz_pool = size;
+
+ return 0;
+}
+EXPORT_SYMBOL_GPL(objpool_populate);
+
+/**
+ * objpool_add: add pre-allocated object to objpool during pool
+ * initialization
+ *
+ * args:
+ * @obj: object pointer to be added to objpool
+ * @oh: object pool to be inserted into
+ *
+ * return:
+ * 0 or error code
+ *
+ * objpool_add_node doesn't handle race conditions, can only be
+ * called during objpool initialization
+ */
+int objpool_add(void *obj, struct objpool_head *oh)
+{
+ uint32_t i, cpu;
+
+ if (!obj)
+ return -EINVAL;
+ if (oh->oh_nobjs >= oh->oh_ncpus * oh->oh_nents)
+ return -ENOENT;
+
+ cpu = oh->oh_nobjs % oh->oh_ncpus;
+ for (i = 0; i < oh->oh_ncpus; i++) {
+ if (!__objpool_try_add_slot(obj, oh->oh_slots[cpu])) {
+ oh->oh_nobjs++;
+ return 0;
+ }
+
+ if (++cpu >= oh->oh_ncpus)
+ cpu = 0;
+ }
+
+ return -ENOENT;
+}
+EXPORT_SYMBOL_GPL(objpool_add);
+
+/**
+ * objpool_push: reclaim the object and return back to objects pool
+ *
+ * args:
+ * @obj: object pointer to be pushed to object pool
+ * @oh: object pool
+ *
+ * return:
+ * 0 or error code: it fails only when objects pool are full
+ *
+ * objpool_push is non-blockable, and can be nested
+ */
+int objpool_push(void *obj, struct objpool_head *oh)
+{
+ uint32_t cpu = raw_smp_processor_id();
+
+ do {
+ if (oh->oh_nobjs > oh->oh_nents) {
+ if (!__objpool_try_add_slot(obj, oh->oh_slots[cpu]))
+ return 0;
+ } else {
+ if (!__objpool_add_slot(obj, oh->oh_slots[cpu]))
+ return 0;
+ }
+ if (++cpu >= oh->oh_ncpus)
+ cpu = 0;
+ } while (1);
+
+ return -ENOENT;
+}
+EXPORT_SYMBOL_GPL(objpool_push);
+
+/* try to retrieve object from slot */
+static inline void *__objpool_try_get_slot(struct objpool_slot *os)
+{
+ uint32_t *ages = SLOT_AGES(os);
+ void **ents = SLOT_ENTS(os);
+ /* do memory load of os_head to local head */
+ uint32_t head = smp_load_acquire(&os->os_head);
+
+ /* loop if slot isn't empty */
+ while (head != READ_ONCE(os->os_tail)) {
+ uint32_t id = head & os->os_mask, prev = head;
+
+ /* do prefetching of object ents */
+ prefetch(&ents[id]);
+
+ /*
+ * check whether this item was ready for retrieval ? There's
+ * possibility * in theory * we might retrieve wrong object,
+ * in case ages[id] overflows when current task is sleeping,
+ * but it will take very very long to overflow an uint32_t
+ */
+ if (smp_load_acquire(&ages[id]) == head) {
+ /* node must have been udpated by push() */
+ void *node = READ_ONCE(ents[id]);
+ /* commit and move forward head of the slot */
+ if (try_cmpxchg_release(&os->os_head, &head, head + 1))
+ return node;
+ }
+
+ /* re-load head from memory continue trying */
+ head = READ_ONCE(os->os_head);
+ /*
+ * head stays unchanged, so it's very likely current pop()
+ * just preempted/interrupted an ongoing push() operation
+ */
+ if (head == prev)
+ break;
+ }
+
+ return NULL;
+}
+
+/**
+ * objpool_pop: allocate an object from objects pool
+ *
+ * args:
+ * @oh: object pool
+ *
+ * return:
+ * object: NULL if failed (object pool is empty)
+ *
+ * objpool_pop can be nested, so can be used in any context.
+ */
+void *objpool_pop(struct objpool_head *oh)
+{
+ uint32_t i, cpu = raw_smp_processor_id();
+ void *obj = NULL;
+
+ for (i = 0; i < oh->oh_ncpus; i++) {
+ struct objpool_slot *slot = oh->oh_slots[cpu];
+ obj = __objpool_try_get_slot(slot);
+ if (obj)
+ break;
+ if (++cpu >= oh->oh_ncpus)
+ cpu = 0;
+ }
+
+ return obj;
+}
+EXPORT_SYMBOL_GPL(objpool_pop);
+
+/**
+ * objpool_fini: cleanup the whole object pool (releasing all objects)
+ *
+ * args:
+ * @head: object pool to be released
+ *
+ */
+void objpool_fini(struct objpool_head *oh)
+{
+ uint32_t i, flags;
+
+ if (!oh->oh_slots)
+ return;
+
+ if (!oh->oh_release) {
+ __objpool_fini_percpu_slots(oh);
+ return;
+ }
+
+ /* cleanup all objects remained in objpool */
+ for (i = 0; i < oh->oh_ncpus; i++) {
+ void *obj;
+ do {
+ flags = OBJPOOL_FLAG_NODE;
+ obj = __objpool_try_get_slot(oh->oh_slots[i]);
+ if (!obj)
+ break;
+ if (!objpool_is_inpool(obj, oh) &&
+ !objpool_is_inslot(obj, oh)) {
+ flags |= OBJPOOL_FLAG_USER;
+ }
+ oh->oh_release(oh->oh_context, obj, flags);
+ } while (obj);
+ }
+
+ /* release percpu slots */
+ __objpool_fini_percpu_slots(oh);
+
+ /* cleanup user private pool and related context */
+ flags = OBJPOOL_FLAG_POOL;
+ if (oh->oh_pool)
+ flags |= OBJPOOL_FLAG_USER;
+ oh->oh_release(oh->oh_context, oh->oh_pool, flags);
+}
+EXPORT_SYMBOL_GPL(objpool_fini);
diff --git a/lib/test_objpool.c b/lib/test_objpool.c
new file mode 100644
index 000000000000..c1341ddf77b5
--- /dev/null
+++ b/lib/test_objpool.c
@@ -0,0 +1,1031 @@
+// SPDX-License-Identifier: GPL-2.0
+
+/*
+ * Test module for lockless object pool
+ * (C) 2022 Matt Wu <[email protected]>
+ */
+
+#include <linux/version.h>
+#include <linux/errno.h>
+#include <linux/module.h>
+#include <linux/moduleparam.h>
+#include <linux/sched.h>
+#include <linux/cpumask.h>
+#include <linux/completion.h>
+#include <linux/kthread.h>
+#include <linux/cpu.h>
+#include <linux/cpuset.h>
+#include <linux/slab.h>
+#include <linux/delay.h>
+#include <linux/hrtimer.h>
+#include <linux/interrupt.h>
+#include <linux/objpool.h>
+
+#define OT_NR_MAX_BULK (16)
+
+struct ot_ctrl {
+ unsigned int mode;
+ unsigned int duration; /* ms */
+ unsigned int delay; /* ms */
+ unsigned int bulk_normal;
+ unsigned int bulk_irq;
+ unsigned long hrtimer; /* ms */
+ const char *name;
+};
+
+struct ot_stat {
+ unsigned long nhits;
+ unsigned long nmiss;
+};
+
+struct ot_item {
+ struct objpool_head *pool; /* pool head */
+ struct ot_ctrl *ctrl; /* ctrl parameters */
+
+ void (*worker)(struct ot_item *item, int irq);
+
+ /* hrtimer control */
+ ktime_t hrtcycle;
+ struct hrtimer hrtimer;
+
+ int bulk[2]; /* for thread and irq */
+ int delay;
+ u32 niters;
+
+ /* results summary */
+ struct ot_stat stat[2]; /* thread and irq */
+
+ u64 duration;
+};
+
+struct ot_mem_stat {
+ atomic_long_t alloc;
+ atomic_long_t free;
+};
+
+struct ot_data {
+ struct rw_semaphore start;
+ struct completion wait;
+ struct completion rcu;
+ atomic_t nthreads ____cacheline_aligned_in_smp;
+ atomic_t stop ____cacheline_aligned_in_smp;
+ struct ot_mem_stat kmalloc;
+ struct ot_mem_stat vmalloc;
+} g_ot_data;
+
+/*
+ * memory leakage checking
+ */
+
+void *ot_kzalloc(long size)
+{
+ void *ptr = kzalloc(size, GFP_KERNEL);
+
+ if (ptr)
+ atomic_long_add(size, &g_ot_data.kmalloc.alloc);
+ return ptr;
+}
+
+void ot_kfree(void *ptr, long size)
+{
+ if (!ptr)
+ return;
+ atomic_long_add(size, &g_ot_data.kmalloc.free);
+ kfree(ptr);
+}
+
+void *ot_vmalloc(long size)
+{
+ void *ptr = vmalloc(size);
+
+ if (ptr)
+ atomic_long_add(size, &g_ot_data.vmalloc.alloc);
+ return ptr;
+}
+
+void ot_vfree(void *ptr, long size)
+{
+ if (!ptr)
+ return;
+ atomic_long_add(size, &g_ot_data.vmalloc.free);
+ vfree(ptr);
+}
+
+static void ot_mem_report(struct ot_ctrl *ctrl)
+{
+ long alloc, free;
+
+ pr_info("memory allocation summary for %s\n", ctrl->name);
+
+ alloc = atomic_long_read(&g_ot_data.kmalloc.alloc);
+ free = atomic_long_read(&g_ot_data.kmalloc.free);
+ pr_info(" kmalloc: %lu - %lu = %lu\n", alloc, free, alloc - free);
+
+ alloc = atomic_long_read(&g_ot_data.vmalloc.alloc);
+ free = atomic_long_read(&g_ot_data.vmalloc.free);
+ pr_info(" vmalloc: %lu - %lu = %lu\n", alloc, free, alloc - free);
+}
+
+/*
+ * general structs & routines
+ */
+
+struct ot_node {
+ void *owner;
+ unsigned long data;
+ unsigned long refs;
+};
+
+struct ot_context {
+ struct objpool_head pool;
+ void *ptr;
+ unsigned long size;
+ refcount_t refs;
+ struct rcu_head rcu;
+};
+
+static DEFINE_PER_CPU(struct ot_item, ot_pcup_items);
+
+static int ot_init_data(struct ot_data *data)
+{
+ memset(data, 0, sizeof(*data));
+ init_rwsem(&data->start);
+ init_completion(&data->wait);
+ init_completion(&data->rcu);
+ atomic_set(&data->nthreads, 1);
+
+ return 0;
+}
+
+static void ot_reset_data(struct ot_data *data)
+{
+ reinit_completion(&data->wait);
+ reinit_completion(&data->rcu);
+ atomic_set(&data->nthreads, 1);
+ atomic_set(&data->stop, 0);
+ memset(&data->kmalloc, 0, sizeof(data->kmalloc));
+ memset(&data->vmalloc, 0, sizeof(data->vmalloc));
+}
+
+static int ot_init_node(void *context, void *nod)
+{
+ struct ot_context *sop = context;
+ struct ot_node *on = nod;
+
+ on->owner = &sop->pool;
+ return 0;
+}
+
+static enum hrtimer_restart ot_hrtimer_handler(struct hrtimer *hrt)
+{
+ struct ot_item *item = container_of(hrt, struct ot_item, hrtimer);
+
+ if (atomic_read_acquire(&g_ot_data.stop))
+ return HRTIMER_NORESTART;
+
+ /* do bulk-testings for objects pop/push */
+ item->worker(item, 1);
+
+ hrtimer_forward(hrt, hrt->base->get_time(), item->hrtcycle);
+ return HRTIMER_RESTART;
+}
+
+static void ot_start_hrtimer(struct ot_item *item)
+{
+ if (!item->ctrl->hrtimer)
+ return;
+ hrtimer_start(&item->hrtimer, item->hrtcycle, HRTIMER_MODE_REL);
+}
+
+static void ot_stop_hrtimer(struct ot_item *item)
+{
+ if (!item->ctrl->hrtimer)
+ return;
+ hrtimer_cancel(&item->hrtimer);
+}
+
+static int ot_init_hrtimer(struct ot_item *item, unsigned long hrtimer)
+{
+ struct hrtimer *hrt = &item->hrtimer;
+
+ if (!hrtimer)
+ return -ENOENT;
+
+ item->hrtcycle = ktime_set(0, hrtimer * 1000000UL);
+ hrtimer_init(hrt, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
+ hrt->function = ot_hrtimer_handler;
+ return 0;
+}
+
+static int ot_init_cpu_item(struct ot_item *item,
+ struct ot_ctrl *ctrl,
+ struct objpool_head *pool,
+ void (*worker)(struct ot_item *, int))
+{
+ memset(item, 0, sizeof(*item));
+ item->pool = pool;
+ item->ctrl = ctrl;
+ item->worker = worker;
+
+ item->bulk[0] = ctrl->bulk_normal;
+ item->bulk[1] = ctrl->bulk_irq;
+ item->delay = ctrl->delay;
+
+ /* initialize hrtimer */
+ ot_init_hrtimer(item, item->ctrl->hrtimer);
+ return 0;
+}
+
+static int ot_thread_worker(void *arg)
+{
+ struct ot_item *item = arg;
+ ktime_t start;
+
+ sched_set_normal(current, 50);
+
+ atomic_inc(&g_ot_data.nthreads);
+ down_read(&g_ot_data.start);
+ up_read(&g_ot_data.start);
+ start = ktime_get();
+ ot_start_hrtimer(item);
+ do {
+ if (atomic_read_acquire(&g_ot_data.stop))
+ break;
+ /* do bulk-testings for objects pop/push */
+ item->worker(item, 0);
+ } while (!kthread_should_stop());
+ ot_stop_hrtimer(item);
+ item->duration = (u64) ktime_us_delta(ktime_get(), start);
+ if (atomic_dec_and_test(&g_ot_data.nthreads))
+ complete(&g_ot_data.wait);
+
+ return 0;
+}
+
+static void ot_perf_report(struct ot_ctrl *ctrl, u64 duration)
+{
+ struct ot_stat total, normal = {0}, irq = {0};
+ int cpu, nthreads = 0;
+
+ pr_info("\n");
+ pr_info("Testing summary for %s\n", ctrl->name);
+
+ for_each_possible_cpu(cpu) {
+ struct ot_item *item = per_cpu_ptr(&ot_pcup_items, cpu);
+ if (!item->duration)
+ continue;
+ normal.nhits += item->stat[0].nhits;
+ normal.nmiss += item->stat[0].nmiss;
+ irq.nhits += item->stat[1].nhits;
+ irq.nmiss += item->stat[1].nmiss;
+ pr_info("CPU: %d duration: %lluus\n", cpu, item->duration);
+ pr_info("\tthread:\t%16lu hits \t%16lu miss\n",
+ item->stat[0].nhits, item->stat[0].nmiss);
+ pr_info("\tirq: \t%16lu hits \t%16lu miss\n",
+ item->stat[1].nhits, item->stat[1].nmiss);
+ pr_info("\ttotal: \t%16lu hits \t%16lu miss\n",
+ item->stat[0].nhits + item->stat[1].nhits,
+ item->stat[0].nmiss + item->stat[1].nmiss);
+ nthreads++;
+ }
+
+ total.nhits = normal.nhits + irq.nhits;
+ total.nmiss = normal.nmiss + irq.nmiss;
+
+ pr_info("ALL: \tnthreads: %d duration: %lluus\n", nthreads, duration);
+ pr_info("SUM: \t%16lu hits \t%16lu miss\n",
+ total.nhits, total.nmiss);
+}
+
+/*
+ * synchronous test cases for objpool manipulation
+ */
+
+/* objpool manipulation for synchronous mode 0 (percpu objpool) */
+static struct ot_context *ot_init_sync_m0(void)
+{
+ struct ot_context *sop = NULL;
+ int max = num_possible_cpus() << 3;
+
+ sop = (struct ot_context *)ot_kzalloc(sizeof(*sop));
+ if (!sop)
+ return NULL;
+
+ if (objpool_init(&sop->pool, max, max, sizeof(struct ot_node),
+ GFP_KERNEL, sop, ot_init_node, NULL)) {
+ ot_kfree(sop, sizeof(*sop));
+ return NULL;
+ }
+ WARN_ON(max != sop->pool.oh_nobjs);
+
+ return sop;
+}
+
+static void ot_fini_sync_m0(struct ot_context *sop)
+{
+ objpool_fini(&sop->pool);
+ ot_kfree(sop, sizeof(*sop));
+}
+
+/* objpool manipulation for synchronous mode 1 (private pool) */
+static struct ot_context *ot_init_sync_m1(void)
+{
+ struct ot_context *sop = NULL;
+ unsigned long size;
+ int rc, szobj, max = num_possible_cpus() << 3;
+
+ sop = (struct ot_context *)ot_kzalloc(sizeof(*sop));
+ if (!sop)
+ return NULL;
+
+ size = sizeof(struct ot_node) * max;
+ sop->ptr = ot_vmalloc(size);
+ sop->size = size;
+ if (!sop->ptr) {
+ ot_kfree(sop, sizeof(*sop));
+ return NULL;
+ }
+ memset(sop->ptr, 0, size);
+
+ /* create and initialize objpool as empty (no objects) */
+ rc = objpool_init(&sop->pool, 0, max, 0, GFP_KERNEL, sop, NULL, NULL);
+ if (rc) {
+ ot_kfree(sop, sizeof(*sop));
+ return NULL;
+ }
+
+ /* populate given buffer to objpool */
+ rc = objpool_populate(&sop->pool, sop->ptr, size,
+ sizeof(struct ot_node), sop, ot_init_node);
+ if (rc) {
+ objpool_fini(&sop->pool);
+ ot_vfree(sop->ptr, size);
+ ot_kfree(sop, sizeof(*sop));
+ return NULL;
+ }
+ szobj = ALIGN(sizeof(struct ot_node), sizeof(void *));
+ WARN_ON((size / szobj) != sop->pool.oh_nobjs);
+
+ return sop;
+}
+
+static void ot_fini_sync_m1(struct ot_context *sop)
+{
+ objpool_fini(&sop->pool);
+
+ ot_vfree(sop->ptr, sop->size);
+ ot_kfree(sop, sizeof(*sop));
+}
+
+/* objpool manipulation for synchronous mode 2 (private objects) */
+static int ot_objpool_release(void *context, void *ptr, uint32_t flags)
+{
+ struct ot_context *sop = context;
+
+ /* here we need release all user-allocated objects */
+ if ((flags & OBJPOOL_FLAG_NODE) && (flags & OBJPOOL_FLAG_USER)) {
+ struct ot_node *on = ptr;
+ WARN_ON(on->data != 0xDEADBEEF);
+ ot_kfree(on, sizeof(struct ot_node));
+ } else if (flags & OBJPOOL_FLAG_POOL) {
+ /* release user preallocated pool */
+ if (sop->ptr) {
+ WARN_ON(sop->ptr != ptr);
+ WARN_ON(!(flags & OBJPOOL_FLAG_USER));
+ ot_vfree(sop->ptr, sop->size);
+ }
+ /* do context cleaning if needed */
+ ot_kfree(sop, sizeof(*sop));
+ }
+
+ return 0;
+}
+
+static struct ot_context *ot_init_sync_m2(void)
+{
+ struct ot_context *sop = NULL;
+ struct ot_node *on;
+ int rc, i, max = num_possible_cpus() << 3;
+
+ sop = (struct ot_context *)ot_kzalloc(sizeof(*sop));
+ if (!sop)
+ return NULL;
+
+ /* create and initialize objpool as empty (no objects) */
+ rc = objpool_init(&sop->pool, 0, max, 0, GFP_KERNEL, sop, NULL,
+ ot_objpool_release);
+ if (rc) {
+ ot_kfree(sop, sizeof(*sop));
+ return NULL;
+ }
+
+ /* allocate private objects and insert to objpool */
+ for (i = 0; i < max; i++) {
+ on = ot_kzalloc(sizeof(struct ot_node));
+ if (on) {
+ ot_init_node(sop, on);
+ on->data = 0xDEADBEEF;
+ objpool_add(on, &sop->pool);
+ }
+ }
+ WARN_ON(max != sop->pool.oh_nobjs);
+
+ return sop;
+}
+
+static void ot_fini_sync_m2(struct ot_context *sop)
+{
+ objpool_fini(&sop->pool);
+}
+
+/* objpool manipulation for synchronous mode 3 (mixed mode) */
+static struct ot_context *ot_init_sync_m3(void)
+{
+ struct ot_context *sop = NULL;
+ struct ot_node *on;
+ unsigned long size;
+ int rc, i, szobj, nobjs;
+ int max = num_possible_cpus() << 4;
+
+ sop = (struct ot_context *)ot_kzalloc(sizeof(*sop));
+ if (!sop)
+ return NULL;
+
+ /* create and initialize objpool as empty (no objects) */
+ nobjs = num_possible_cpus() * 2;
+ rc = objpool_init(&sop->pool, nobjs, max, sizeof(struct ot_node),
+ GFP_KERNEL, sop, ot_init_node, ot_objpool_release);
+ if (rc) {
+ ot_kfree(sop, sizeof(*sop));
+ return NULL;
+ }
+
+ size = sizeof(struct ot_node) * num_possible_cpus() * 4;
+ sop->ptr = ot_vmalloc(size);
+ if (!sop->ptr) {
+ objpool_fini(&sop->pool);
+ return NULL;
+ }
+ sop->size = size;
+ memset(sop->ptr, 0, size);
+
+ /* populate given buffer to objpool */
+ rc = objpool_populate(&sop->pool, sop->ptr, size,
+ sizeof(struct ot_node), sop, ot_init_node);
+ if (rc) {
+ objpool_fini(&sop->pool);
+ ot_vfree(sop->ptr, size);
+ return NULL;
+ }
+ szobj = ALIGN(sizeof(struct ot_node), sizeof(void *));
+ nobjs += size / szobj;
+
+ /* allocate private objects and insert to objpool */
+ for (i = 0; i < num_possible_cpus() * 2; i++) {
+ on = ot_kzalloc(sizeof(struct ot_node));
+ if (on) {
+ ot_init_node(sop, on);
+ on->data = 0xDEADBEEF;
+ if (!objpool_add(on, &sop->pool))
+ nobjs++;
+ else
+ ot_kfree(on, sizeof(struct ot_node));
+ }
+ }
+ WARN_ON(nobjs != sop->pool.oh_nobjs);
+
+ return sop;
+}
+
+static void ot_fini_sync_m3(struct ot_context *sop)
+{
+ objpool_fini(&sop->pool);
+}
+
+struct {
+ struct ot_context * (*init)(void);
+ void (*fini)(struct ot_context *sop);
+} g_ot_sync_ops[4] = {
+ {ot_init_sync_m0, ot_fini_sync_m0},
+ {ot_init_sync_m1, ot_fini_sync_m1},
+ {ot_init_sync_m2, ot_fini_sync_m2},
+ {ot_init_sync_m3, ot_fini_sync_m3},
+};
+
+/*
+ * synchronous test cases: performance mode
+ */
+
+static void ot_bulk_sync(struct ot_item *item, int irq)
+{
+ struct ot_node *nods[OT_NR_MAX_BULK];
+ int i;
+
+ for (i = 0; i < item->bulk[irq]; i++)
+ nods[i] = objpool_pop(item->pool);
+
+ if (!irq && (item->delay || !(++(item->niters) & 0x7FFF)))
+ msleep(item->delay);
+
+ while (i-- > 0) {
+ struct ot_node *on = nods[i];
+ if (on) {
+ on->refs++;
+ objpool_push(on, item->pool);
+ item->stat[irq].nhits++;
+ } else {
+ item->stat[irq].nmiss++;
+ }
+ }
+}
+
+static int ot_start_sync(struct ot_ctrl *ctrl)
+{
+ struct ot_context *sop;
+ ktime_t start;
+ u64 duration;
+ unsigned long timeout;
+ int cpu, rc;
+
+ /* initialize objpool for syncrhonous testcase */
+ sop = g_ot_sync_ops[ctrl->mode].init();
+ if (!sop)
+ return -ENOMEM;
+
+ /* grab rwsem to block testing threads */
+ down_write(&g_ot_data.start);
+
+ for_each_possible_cpu(cpu) {
+ struct ot_item *item = per_cpu_ptr(&ot_pcup_items, cpu);
+ struct task_struct *work;
+
+ ot_init_cpu_item(item, ctrl, &sop->pool, ot_bulk_sync);
+
+ /* skip offline cpus */
+ if (!cpu_online(cpu))
+ continue;
+
+ work = kthread_create_on_node(ot_thread_worker, item,
+ cpu_to_node(cpu), "ot_worker_%d", cpu);
+ if (IS_ERR(work)) {
+ pr_err("failed to create thread for cpu %d\n", cpu);
+ } else {
+ kthread_bind(work, cpu);
+ wake_up_process(work);
+ }
+ }
+
+ /* wait a while to make sure all threads waiting at start line */
+ msleep(20);
+
+ /* in case no threads were created: memory insufficient ? */
+ if (atomic_dec_and_test(&g_ot_data.nthreads))
+ complete(&g_ot_data.wait);
+
+ // sched_set_fifo_low(current);
+
+ /* start objpool testing threads */
+ start = ktime_get();
+ up_write(&g_ot_data.start);
+
+ /* yeild cpu to worker threads for duration ms */
+ timeout = msecs_to_jiffies(ctrl->duration);
+ rc = schedule_timeout_interruptible(timeout);
+
+ /* tell workers threads to quit */
+ atomic_set_release(&g_ot_data.stop, 1);
+
+ /* wait all workers threads finish and quit */
+ wait_for_completion(&g_ot_data.wait);
+ duration = (u64) ktime_us_delta(ktime_get(), start);
+
+ /* cleanup objpool */
+ g_ot_sync_ops[ctrl->mode].fini(sop);
+
+ /* report testing summary and performance results */
+ ot_perf_report(ctrl, duration);
+
+ /* report memory allocation summary */
+ ot_mem_report(ctrl);
+
+ return rc;
+}
+
+/*
+ * asynchronous test cases: pool lifecycle controlled by refcount
+ */
+
+static void ot_fini_async_rcu(struct rcu_head *rcu)
+{
+ struct ot_context *sop = container_of(rcu, struct ot_context, rcu);
+ struct ot_node *on;
+
+ /* here all cpus are aware of the stop event: g_ot_data.stop = 1 */
+ WARN_ON(!atomic_read_acquire(&g_ot_data.stop));
+
+ do {
+ /* release all objects remained in objpool */
+ on = objpool_pop(&sop->pool);
+ if (on && !objpool_is_inslot(on, &sop->pool) &&
+ !objpool_is_inpool(on, &sop->pool)) {
+ /* private object managed by user */
+ WARN_ON(on->data != 0xDEADBEEF);
+ ot_kfree(on, sizeof(struct ot_node));
+ }
+
+ /* deref anyway since we've one extra ref grabbed */
+ if (refcount_dec_and_test(&sop->refs)) {
+ objpool_fini(&sop->pool);
+ break;
+ }
+ } while (on);
+
+ complete(&g_ot_data.rcu);
+}
+
+static void ot_fini_async(struct ot_context *sop)
+{
+ /* make sure the stop event is acknowledged by all cores */
+ call_rcu(&sop->rcu, ot_fini_async_rcu);
+}
+
+static struct ot_context *ot_init_async_m0(void)
+{
+ struct ot_context *sop = NULL;
+ int max = num_possible_cpus() << 3;
+
+ sop = (struct ot_context *)ot_kzalloc(sizeof(*sop));
+ if (!sop)
+ return NULL;
+
+ if (objpool_init(&sop->pool, max, max, sizeof(struct ot_node),
+ GFP_KERNEL, sop, ot_init_node, ot_objpool_release)) {
+ ot_kfree(sop, sizeof(*sop));
+ return NULL;
+ }
+ WARN_ON(max != sop->pool.oh_nobjs);
+ refcount_set(&sop->refs, max + 1);
+
+ return sop;
+}
+
+static struct ot_context *ot_init_async_m1(void)
+{
+ struct ot_context *sop = NULL;
+ unsigned long size;
+ int szobj, rc, max = num_possible_cpus() << 3;
+
+ sop = (struct ot_context *)ot_kzalloc(sizeof(*sop));
+ if (!sop)
+ return NULL;
+
+ size = sizeof(struct ot_node) * max;
+ sop->ptr = ot_vmalloc(size);
+ sop->size = size;
+ if (!sop->ptr) {
+ ot_kfree(sop, sizeof(*sop));
+ return NULL;
+ }
+ memset(sop->ptr, 0, size);
+
+ /* create and initialize objpool as empty (no objects) */
+ rc = objpool_init(&sop->pool, 0, max, 0, GFP_KERNEL, sop, NULL,
+ ot_objpool_release);
+ if (rc) {
+ ot_kfree(sop, sizeof(*sop));
+ return NULL;
+ }
+
+ /* populate given buffer to objpool */
+ rc = objpool_populate(&sop->pool, sop->ptr, size,
+ sizeof(struct ot_node), sop, ot_init_node);
+ if (rc) {
+ objpool_fini(&sop->pool);
+ ot_vfree(sop->ptr, size);
+ return NULL;
+ }
+
+ /* calculate total number of objects stored in ptr */
+ szobj = ALIGN(sizeof(struct ot_node), sizeof(void *));
+ WARN_ON(size / szobj != sop->pool.oh_nobjs);
+ refcount_set(&sop->refs, size / szobj + 1);
+
+ return sop;
+}
+
+static struct ot_context *ot_init_async_m2(void)
+{
+ struct ot_context *sop = NULL;
+ struct ot_node *on;
+ int rc, i, nobjs = 0, max = num_possible_cpus() << 3;
+
+ sop = (struct ot_context *)ot_kzalloc(sizeof(*sop));
+ if (!sop)
+ return NULL;
+
+ /* create and initialize objpool as empty (no objects) */
+ rc = objpool_init(&sop->pool, 0, max, 0, GFP_KERNEL, sop, NULL,
+ ot_objpool_release);
+ if (rc) {
+ ot_kfree(sop, sizeof(*sop));
+ return NULL;
+ }
+
+ /* allocate private objects and insert to objpool */
+ for (i = 0; i < max; i++) {
+ on = ot_kzalloc(sizeof(struct ot_node));
+ if (on) {
+ ot_init_node(sop, on);
+ on->data = 0xDEADBEEF;
+ objpool_add(on, &sop->pool);
+ nobjs++;
+ }
+ }
+ WARN_ON(nobjs != sop->pool.oh_nobjs);
+ refcount_set(&sop->refs, nobjs + 1);
+
+ return sop;
+}
+
+/* objpool manipulation for synchronous mode 3 (mixed mode) */
+static struct ot_context *ot_init_async_m3(void)
+{
+ struct ot_context *sop = NULL;
+ struct ot_node *on;
+ unsigned long size;
+ int szobj, nobjs, rc, i, max = num_possible_cpus() << 4;
+
+ sop = (struct ot_context *)ot_kzalloc(sizeof(*sop));
+ if (!sop)
+ return NULL;
+
+ /* create and initialize objpool as empty (no objects) */
+ nobjs = num_possible_cpus() * 2;
+ rc = objpool_init(&sop->pool, nobjs, max, sizeof(struct ot_node),
+ GFP_KERNEL, sop, ot_init_node, ot_objpool_release);
+ if (rc) {
+ ot_kfree(sop, sizeof(*sop));
+ return NULL;
+ }
+
+ size = sizeof(struct ot_node) * num_possible_cpus() * 4;
+ sop->ptr = ot_vmalloc(size);
+ if (!sop->ptr) {
+ ot_kfree(sop, sizeof(*sop));
+ return NULL;
+ }
+ sop->size = size;
+ memset(sop->ptr, 0, size);
+
+ /* populate given buffer to objpool */
+ rc = objpool_populate(&sop->pool, sop->ptr, size,
+ sizeof(struct ot_node), sop, ot_init_node);
+ if (rc) {
+ objpool_fini(&sop->pool);
+ ot_vfree(sop->ptr, size);
+ return NULL;
+ }
+
+ /* calculate total number of objects stored in ptr */
+ szobj = ALIGN(sizeof(struct ot_node), sizeof(void *));
+ nobjs += size / szobj;
+
+ /* allocate private objects and insert to objpool */
+ for (i = 0; i < num_possible_cpus() * 2; i++) {
+ on = ot_kzalloc(sizeof(struct ot_node));
+ if (on) {
+ ot_init_node(sop, on);
+ on->data = 0xDEADBEEF;
+ objpool_add(on, &sop->pool);
+ nobjs++;
+ }
+ }
+ WARN_ON(nobjs != sop->pool.oh_nobjs);
+ refcount_set(&sop->refs, nobjs + 1);
+
+ return sop;
+}
+
+struct {
+ struct ot_context * (*init)(void);
+ void (*fini)(struct ot_context *sop);
+} g_ot_async_ops[4] = {
+ {ot_init_async_m0, ot_fini_async},
+ {ot_init_async_m1, ot_fini_async},
+ {ot_init_async_m2, ot_fini_async},
+ {ot_init_async_m3, ot_fini_async},
+};
+
+static void ot_nod_recycle(struct ot_node *on, struct objpool_head *pool,
+ int release)
+{
+ struct ot_context *sop;
+
+ on->refs++;
+
+ if (!release) {
+ /* push object back to opjpool for reuse */
+ objpool_push(on, pool);
+ return;
+ }
+
+ sop = container_of(pool, struct ot_context, pool);
+ WARN_ON(sop != pool->oh_context);
+
+ if (objpool_is_inslot(on, pool)) {
+ /* object is alloced from percpu slots */
+ } else if (objpool_is_inpool(on, pool)) {
+ /* object is alloced from user-manged pool */
+ } else {
+ /* private object managed by user */
+ WARN_ON(on->data != 0xDEADBEEF);
+ ot_kfree(on, sizeof(struct ot_node));
+ }
+
+ /* unref objpool with nod removed forever */
+ if (refcount_dec_and_test(&sop->refs))
+ objpool_fini(pool);
+}
+
+static void ot_bulk_async(struct ot_item *item, int irq)
+{
+ struct ot_node *nods[OT_NR_MAX_BULK];
+ int i, stop;
+
+ for (i = 0; i < item->bulk[irq]; i++)
+ nods[i] = objpool_pop(item->pool);
+
+ if (!irq) {
+ if (item->delay || !(++(item->niters) & 0x7FFF))
+ msleep(item->delay);
+ get_cpu();
+ }
+
+ stop = atomic_read_acquire(&g_ot_data.stop);
+
+ /* drop all objects and deref objpool */
+ while (i-- > 0) {
+ struct ot_node *on = nods[i];
+
+ if (on) {
+ on->refs++;
+ ot_nod_recycle(on, item->pool, stop);
+ item->stat[irq].nhits++;
+ } else {
+ item->stat[irq].nmiss++;
+ }
+ }
+
+ if (!irq)
+ put_cpu();
+}
+
+static int ot_start_async(struct ot_ctrl *ctrl)
+{
+ struct ot_context *sop;
+ ktime_t start;
+ u64 duration;
+ unsigned long timeout;
+ int cpu, rc;
+
+ /* initialize objpool for syncrhonous testcase */
+ sop = g_ot_async_ops[ctrl->mode].init();
+ if (!sop)
+ return -ENOMEM;
+
+ /* grab rwsem to block testing threads */
+ down_write(&g_ot_data.start);
+
+ for_each_possible_cpu(cpu) {
+ struct ot_item *item = per_cpu_ptr(&ot_pcup_items, cpu);
+ struct task_struct *work;
+
+ ot_init_cpu_item(item, ctrl, &sop->pool, ot_bulk_async);
+
+ /* skip offline cpus */
+ if (!cpu_online(cpu))
+ continue;
+
+ work = kthread_create_on_node(ot_thread_worker, item,
+ cpu_to_node(cpu), "ot_worker_%d", cpu);
+ if (IS_ERR(work)) {
+ pr_err("failed to create thread for cpu %d\n", cpu);
+ } else {
+ kthread_bind(work, cpu);
+ wake_up_process(work);
+ }
+ }
+
+ /* wait a while to make sure all threads waiting at start line */
+ msleep(20);
+
+ /* in case no threads were created: memory insufficient ? */
+ if (atomic_dec_and_test(&g_ot_data.nthreads))
+ complete(&g_ot_data.wait);
+
+ /* start objpool testing threads */
+ start = ktime_get();
+ up_write(&g_ot_data.start);
+
+ /* yeild cpu to worker threads for duration ms */
+ timeout = msecs_to_jiffies(ctrl->duration);
+ rc = schedule_timeout_interruptible(timeout);
+
+ /* tell workers threads to quit */
+ atomic_set_release(&g_ot_data.stop, 1);
+
+ /* do async-finalization */
+ g_ot_async_ops[ctrl->mode].fini(sop);
+
+ /* wait all workers threads finish and quit */
+ wait_for_completion(&g_ot_data.wait);
+ duration = (u64) ktime_us_delta(ktime_get(), start);
+
+ /* assure rcu callback is triggered */
+ wait_for_completion(&g_ot_data.rcu);
+
+ /*
+ * now we are sure that objpool is finalized either
+ * by rcu callback or by worker threads
+ */
+
+ /* report testing summary and performance results */
+ ot_perf_report(ctrl, duration);
+
+ /* report memory allocation summary */
+ ot_mem_report(ctrl);
+
+ return rc;
+}
+
+/*
+ * predefined testing cases:
+ * 4 synchronous cases / 4 overrun cases / 2 async cases
+ *
+ * mode: unsigned int, could be 0/1/2/3, see name
+ * duration: unsigned int, total test time in ms
+ * delay: unsigned int, delay (in ms) between each iteration
+ * bulk_normal: unsigned int, repeat times for thread worker
+ * bulk_irq: unsigned int, repeat times for irq consumer
+ * hrtimer: unsigned long, hrtimer intervnal in ms
+ * name: char *, tag for current test ot_item
+ */
+
+struct ot_ctrl g_ot_sync[] = {
+ {0, 1000, 0, 1, 0, 0, "sync: percpu objpool"},
+ {1, 1000, 0, 1, 0, 0, "sync: user objpool"},
+ {2, 1000, 0, 1, 0, 0, "sync: user objects"},
+ {3, 1000, 0, 1, 0, 0, "sync: mixed pools & objs"},
+};
+
+struct ot_ctrl g_ot_miss[] = {
+ {0, 1000, 0, 16, 0, 0, "sync overrun: percpu objpool"},
+ {1, 1000, 0, 16, 0, 0, "sync overrun: user objpool"},
+ {2, 1000, 0, 16, 0, 0, "sync overrun: user objects"},
+ {3, 1000, 0, 16, 0, 0, "sync overrun: mixed pools & objs"},
+};
+
+struct ot_ctrl g_ot_async[] = {
+ {0, 1000, 4, 8, 8, 6, "async: percpu objpool"},
+ {1, 1000, 4, 8, 8, 6, "async: user objpool"},
+ {2, 1000, 4, 8, 8, 6, "async: user objects"},
+ {3, 1000, 4, 8, 8, 6, "async: mixed pools & objs"},
+};
+
+static int __init ot_mod_init(void)
+{
+ int i;
+
+ ot_init_data(&g_ot_data);
+
+ for (i = 0; i < ARRAY_SIZE(g_ot_sync); i++) {
+ if (ot_start_sync(&g_ot_sync[i]))
+ goto out;
+ ot_reset_data(&g_ot_data);
+ }
+
+ for (i = 0; i < ARRAY_SIZE(g_ot_miss); i++) {
+ if (ot_start_sync(&g_ot_miss[i]))
+ goto out;
+ ot_reset_data(&g_ot_data);
+ }
+
+ for (i = 0; i < ARRAY_SIZE(g_ot_async); i++) {
+ if (ot_start_async(&g_ot_async[i]))
+ goto out;
+ ot_reset_data(&g_ot_data);
+ }
+
+out:
+ return -EAGAIN;
+}
+
+static void __exit ot_mod_exit(void)
+{
+}
+
+module_init(ot_mod_init);
+module_exit(ot_mod_exit);
+
+MODULE_LICENSE("GPL");
+MODULE_AUTHOR("Matt Wu");
--
2.34.1


2022-11-02 21:57:52

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH v4] kprobes,lib: kretprobe scalability improvement

On Wed, 2 Nov 2022 10:30:12 +0800 wuqiang <[email protected]> wrote:

> Tests of
> kretprobe throughput show the biggest ratio as 333.9x of the original
> freelist.

Seriously.

I'll add this for some runtime testing.

Are you able to identify other parts of the kernel which could use
(and benefit from) the new objpool?

2022-11-03 03:14:18

by Masami Hiramatsu

[permalink] [raw]
Subject: Re: [PATCH v4] kprobes,lib: kretprobe scalability improvement

Hi wuqiang (or Matt?),

Thanks for updating the patch! I have some comments below,

On Wed, 2 Nov 2022 10:30:12 +0800
wuqiang <[email protected]> wrote:

> kretprobe is using freelist to manage return-instances, but freelist,
> as LIFO queue based on singly linked list, scales badly and reduces
> the overall throughput of kretprobed routines, especially for high
> contention scenarios.
>
> Here's a typical throughput test of sys_flock (counts in 10 seconds,
> measured with perf stat -a -I 10000 -e syscalls:sys_enter_flock):
>
> OS: Debian 10 X86_64, Linux 6.1rc2
> HW: XEON 8336C x 2, 64 cores/128 threads, DDR4 3200MT/s
>
> 1X 2X 4X 6X 8X 12X 16X
> 34762430 36546920 17949900 13101899 12569595 12646601 14729195
> 24X 32X 48X 64X 72X 96X 128X
> 19263546 10102064 8985418 11936495 11493980 7127789 9330985
>
> This patch implements a scalable, lock-less and numa-aware object pool,
> which brings near-linear scalability to kretprobed routines. Tests of
> kretprobe throughput show the biggest ratio as 333.9x of the original
> freelist. Here's the comparison:
>
> 1X 2X 4X 8X 16X
> freelist: 34762430 36546920 17949900 12569595 14729195
> objpool: 35627544 72182095 144068494 287564688 576903916
> 32X 48X 64X 96X 128X
> freelist: 10102064 8985418 11936495 7127789 9330985
> objpool: 1158876372 1737828164 2324371724 2380310472 2463182819
>
> Tests on 96-core ARM64 system output similarly, but with the biggest
> ratio up to 642.2x:
>
> OS: Debian 10 AARCH64, Linux 6.1rc2
> HW: Kunpeng-920 96 cores/2 sockets/4 NUMA nodes, DDR4 2933 MT/s
>
> 1X 2X 4X 8X 16X
> freelist: 17498299 10887037 10224710 8499132 6421751
> objpool: 18715726 35549845 71615884 144258971 283707220
> 24X 32X 48X 64X 96X
> freelist: 5339868 4819116 3593919 3121575 2687167
> objpool: 419830913 571609748 877456139 1143316315 1725668029
>
> The object pool, leveraging percpu ring-array to mitigate hot spots
> of memory contention, could deliver near-linear scalability for high
> parallel scenarios. The ring-array is compactly managed in a single
> cacheline (64 bytes) to benefit from warmed L1 cache for most cases
> (<= 4 instances per core) and objects are managed in the continuous
> cachelines just after ring-array.
>
> Changes since V3:
> 1) build warning: unused variable in fprobe_init_rethook
> Reported-by: kernel test robot <[email protected]>
>
> Changes since V2:
> 1) the percpu-extended version of the freelist replaced by new percpu-
> ring-array. freelist has data-contention in freelist_node (refs and
> next) even after node is removed from freelist and the node could
> be polluted easily (with freelist_node defined in union)
> 2) routines split to objpool.h and objpool.c according to cold & hot
> pathes, and the latter moved to lib, as suggested by Masami
> 3) test module (test_objpool.ko) added to lib for functional testings
>
> Changes since V1:
> 1) reformat to a single patch as Masami Hiramatsu suggested
> 2) use __vmalloc_node to replace vmalloc_node for vmalloc
> 3) a few minor fixes: typo and coding-style issues

Recording change log is very good. But if it becomes too long,
you can put a URL of the previous series on LKML instead of
appending the change logs.
You can get the URL (permlink) by "lkml.kernel.org/r/<your-message-id>"

>
> Signed-off-by: wuqiang <[email protected]>
> ---
> include/linux/freelist.h | 129 -----
> include/linux/kprobes.h | 9 +-
> include/linux/objpool.h | 151 ++++++
> include/linux/rethook.h | 15 +-
> kernel/kprobes.c | 95 ++--
> kernel/trace/fprobe.c | 17 +-
> kernel/trace/rethook.c | 80 +--
> lib/Kconfig.debug | 11 +
> lib/Makefile | 4 +-
> lib/objpool.c | 480 ++++++++++++++++++
> lib/test_objpool.c | 1031 ++++++++++++++++++++++++++++++++++++++
> 11 files changed, 1772 insertions(+), 250 deletions(-)

Hmm, this does too much things in 1 patch.
Can you split this in below 5 patches? This also makes clear that who
needs to review which part.

- Add generic scalable objpool
- Add objpool test
- Use objpool in kretprobe
- Use objpool in fprobe and rethook
- Remove unused freelist

> delete mode 100644 include/linux/freelist.h
> create mode 100644 include/linux/objpool.h
> create mode 100644 lib/objpool.c
> create mode 100644 lib/test_objpool.c
>
[...]
> +
> +struct objpool_slot {
> + uint32_t os_head; /* head of ring array */

If all fields have "os_" prefix, it is meaningless.

> + uint32_t os_tail; /* tail of ring array */
> + uint32_t os_size; /* max item slots, pow of 2 */
> + uint32_t os_mask; /* os_size - 1 */
> +/*
> + * uint32_t os_ages[]; // ring epoch id
> + * void *os_ents[]; // objects array

"entries[]"

> + */
> +};
> +
> +/* caller-specified object initial callback to setup each object, only called once */
> +typedef int (*objpool_init_node_cb)(void *context, void *obj);
> +
> +/* caller-specified cleanup callback for private objects/pool/context */
> +typedef int (*objpool_release_cb)(void *context, void *ptr, uint32_t flags);
> +
> +/* called for object releasing: ptr points to an object */
> +#define OBJPOOL_FLAG_NODE (0x00000001)
> +/* for user pool and context releasing, ptr could be NULL */
> +#define OBJPOOL_FLAG_POOL (0x00001000)
> +/* the object or pool to be released is user-managed */
> +#define OBJPOOL_FLAG_USER (0x00008000)
> +
> +/*
> + * objpool_head: object pooling metadata
> + */
> +
> +struct objpool_head {
> + uint32_t oh_objsz; /* object & element size */
> + uint32_t oh_nobjs; /* total objs (pre-allocated) */
> + uint32_t oh_nents; /* max objects per cpuslot */
> + uint32_t oh_ncpus; /* num of possible cpus */

If all fields have "oh_" prefix, it is meaningless.
Also, if there is no reason to be 32bit (like align the structure size
for cache, or pack the structure for streaming etc.) use appropriate types.

And please do not align the length of field name unnaturally. e.g.

size_t obj_size; /* size_t or unsigned int, I don't care. */
int nr_objs; /* I think just 'int' is enough because the value should be
checked and limited under INT_MAX */
int max_entries;
unsigned int nr_cpus;

(BTW why we need to limit the nr_cpus here? we have num_possible_cpus())

> + uint32_t oh_in_user:1; /* user-specified buffer */
> + uint32_t oh_in_slot:1; /* objs alloced with slots */
> + uint32_t oh_vmalloc:1; /* alloc from vmalloc zone */

Please use "bool" or "unsigned long flags" with flag bits.

> + gfp_t oh_gfp; /* k/vmalloc gfp flags */
> + uint32_t oh_sz_pool; /* user pool size in byes */

size_t pool_size

> + void *oh_pool; /* user managed memory pool */
> + struct objpool_slot **oh_slots; /* array of percpu slots */
> + uint32_t *oh_sz_slots; /* size in bytes of slots */

size_t slot_size

> + objpool_release_cb oh_release; /* resource cleanup callback */
> + void *oh_context; /* caller-provided context */
> +};

Thank you,

--
Masami Hiramatsu (Google) <[email protected]>

2022-11-03 16:56:30

by wuqiang.matt

[permalink] [raw]
Subject: Re: [PATCH v4] kprobes,lib: kretprobe scalability improvement

On 2022/11/3 05:33, Andrew Morton wrote:
> On Wed, 2 Nov 2022 10:30:12 +0800 wuqiang <[email protected]> wrote:
>
>> Tests of
>> kretprobe throughput show the biggest ratio as 333.9x of the original
>> freelist.
>
> Seriously.
>
> I'll add this for some runtime testing.

Thanks.

>
> Are you able to identify other parts of the kernel which could use
> (and benefit from) the new objpool?

The scalability issue is caused by freelist. Currently kretprobe and rethook
are the only use cases.

I'm working on the evaluation of bpf percpu-freelist, which scales well but
uses raw_spinlock and needs local irq disabled.

2022-11-03 17:03:54

by wuqiang.matt

[permalink] [raw]
Subject: Re: [PATCH v4] kprobes,lib: kretprobe scalability improvement

On 2022/11/3 10:51, Masami Hiramatsu (Google) wrote:
> Hi wuqiang (or Matt?),
>
> Thanks for updating the patch! I have some comments below,

Thanks for your time :)

> On Wed, 2 Nov 2022 10:30:12 +0800
> wuqiang <[email protected]> wrote:
>
...
>> Changes since V3:
>> 1) build warning: unused variable in fprobe_init_rethook
>> Reported-by: kernel test robot <[email protected]>
>>
>> Changes since V2:
>> 1) the percpu-extended version of the freelist replaced by new percpu-
>> ring-array. freelist has data-contention in freelist_node (refs and
>> next) even after node is removed from freelist and the node could
>> be polluted easily (with freelist_node defined in union)
>> 2) routines split to objpool.h and objpool.c according to cold & hot
>> pathes, and the latter moved to lib, as suggested by Masami
>> 3) test module (test_objpool.ko) added to lib for functional testings
>>
>> Changes since V1:
>> 1) reformat to a single patch as Masami Hiramatsu suggested
>> 2) use __vmalloc_node to replace vmalloc_node for vmalloc
>> 3) a few minor fixes: typo and coding-style issues
>
> Recording change log is very good. But if it becomes too long,
> you can put a URL of the previous series on LKML instead of
> appending the change logs.
> You can get the URL (permlink) by "lkml.kernel.org/r/<your-message-id>"

Got it.

>>
>> Signed-off-by: wuqiang <[email protected]>
>> ---
>> include/linux/freelist.h | 129 -----
>> include/linux/kprobes.h | 9 +-
>> include/linux/objpool.h | 151 ++++++
>> include/linux/rethook.h | 15 +-
>> kernel/kprobes.c | 95 ++--
>> kernel/trace/fprobe.c | 17 +-
>> kernel/trace/rethook.c | 80 +--
>> lib/Kconfig.debug | 11 +
>> lib/Makefile | 4 +-
>> lib/objpool.c | 480 ++++++++++++++++++
>> lib/test_objpool.c | 1031 ++++++++++++++++++++++++++++++++++++++
>> 11 files changed, 1772 insertions(+), 250 deletions(-)
>
> Hmm, this does too much things in 1 patch.
> Can you split this in below 5 patches? This also makes clear that who
> needs to review which part.

I was ever considering of splitting, but finally decided to mix them in a big
one mostly because it's only for kretprobe improvement.

Next version I'll split to smaller patches. Thank you for the advice.

>
> - Add generic scalable objpool
> - Add objpool test
> - Use objpool in kretprobe
> - Use objpool in fprobe and rethook
> - Remove unused freelist
>
>> delete mode 100644 include/linux/freelist.h
>> create mode 100644 include/linux/objpool.h
>> create mode 100644 lib/objpool.c
>> create mode 100644 lib/test_objpool.c
>>
> [...]
>> +
>> +struct objpool_slot {
>> + uint32_t os_head; /* head of ring array */
>
> If all fields have "os_" prefix, it is meaningless.
>
>> + uint32_t os_tail; /* tail of ring array */
>> + uint32_t os_size; /* max item slots, pow of 2 */
>> + uint32_t os_mask; /* os_size - 1 */
>> +/*
>> + * uint32_t os_ages[]; // ring epoch id
>> + * void *os_ents[]; // objects array
>
> "entries[]"

I'll refine the comments here to better explain the memory layout.

>
>> + */
>> +};
>> +
>> +/* caller-specified object initial callback to setup each object, only called once */
>> +typedef int (*objpool_init_node_cb)(void *context, void *obj);
>> +
>> +/* caller-specified cleanup callback for private objects/pool/context */
>> +typedef int (*objpool_release_cb)(void *context, void *ptr, uint32_t flags);
>> +
>> +/* called for object releasing: ptr points to an object */
>> +#define OBJPOOL_FLAG_NODE (0x00000001)
>> +/* for user pool and context releasing, ptr could be NULL */
>> +#define OBJPOOL_FLAG_POOL (0x00001000)
>> +/* the object or pool to be released is user-managed */
>> +#define OBJPOOL_FLAG_USER (0x00008000)
>> +
>> +/*
>> + * objpool_head: object pooling metadata
>> + */
>> +
>> +struct objpool_head {
>> + uint32_t oh_objsz; /* object & element size */
>> + uint32_t oh_nobjs; /* total objs (pre-allocated) */
>> + uint32_t oh_nents; /* max objects per cpuslot */
>> + uint32_t oh_ncpus; /* num of possible cpus */
>
> If all fields have "oh_" prefix, it is meaningless.
> Also, if there is no reason to be 32bit (like align the structure size
> for cache, or pack the structure for streaming etc.) use appropriate types.
>
> And please do not align the length of field name unnaturally. e.g.

Kind of obsessive-compulsive symptom, haha :) The struct size of objpool_head
doesn't matter. The size of objpool_slot does matter, as managed in a single
cache-line.

>
> size_t obj_size; /* size_t or unsigned int, I don't care. */
> int nr_objs; /* I think just 'int' is enough because the value should be
> checked and limited under INT_MAX */
> int max_entries;
> unsigned int nr_cpus;
>
> (BTW why we need to limit the nr_cpus here? we have num_possible_cpus())

You are right that nr_cpus is unnecessary. num_possible_cpus() just keeps the
same even with new hot-plugged cpus.

>
>> + uint32_t oh_in_user:1; /* user-specified buffer */
>> + uint32_t oh_in_slot:1; /* objs alloced with slots */
>> + uint32_t oh_vmalloc:1; /* alloc from vmalloc zone */
>
> Please use "bool" or "unsigned long flags" with flag bits.
>
>> + gfp_t oh_gfp; /* k/vmalloc gfp flags */
>> + uint32_t oh_sz_pool; /* user pool size in byes */
>
> size_t pool_size
>
>> + void *oh_pool; /* user managed memory pool */
>> + struct objpool_slot **oh_slots; /* array of percpu slots */
>> + uint32_t *oh_sz_slots; /* size in bytes of slots */
>
> size_t slot_size
>

Will apply these changes in next version.

>> + objpool_release_cb oh_release; /* resource cleanup callback */
>> + void *oh_context; /* caller-provided context */
>> +};
>
> Thank you,
>

Best regards,


2022-11-04 01:47:46

by Masami Hiramatsu

[permalink] [raw]
Subject: Re: [PATCH v4] kprobes,lib: kretprobe scalability improvement

On Fri, 4 Nov 2022 00:45:19 +0800
wuqiang <[email protected]> wrote:

> On 2022/11/3 10:51, Masami Hiramatsu (Google) wrote:
> > Hi wuqiang (or Matt?),
> >
> > Thanks for updating the patch! I have some comments below,
>
> Thanks for your time :)
>
> > On Wed, 2 Nov 2022 10:30:12 +0800
> > wuqiang <[email protected]> wrote:
> >
> ...
> >> Changes since V3:
> >> 1) build warning: unused variable in fprobe_init_rethook
> >> Reported-by: kernel test robot <[email protected]>
> >>
> >> Changes since V2:
> >> 1) the percpu-extended version of the freelist replaced by new percpu-
> >> ring-array. freelist has data-contention in freelist_node (refs and
> >> next) even after node is removed from freelist and the node could
> >> be polluted easily (with freelist_node defined in union)
> >> 2) routines split to objpool.h and objpool.c according to cold & hot
> >> pathes, and the latter moved to lib, as suggested by Masami
> >> 3) test module (test_objpool.ko) added to lib for functional testings
> >>
> >> Changes since V1:
> >> 1) reformat to a single patch as Masami Hiramatsu suggested
> >> 2) use __vmalloc_node to replace vmalloc_node for vmalloc
> >> 3) a few minor fixes: typo and coding-style issues
> >
> > Recording change log is very good. But if it becomes too long,
> > you can put a URL of the previous series on LKML instead of
> > appending the change logs.
> > You can get the URL (permlink) by "lkml.kernel.org/r/<your-message-id>"
>
> Got it.
>
> >>
> >> Signed-off-by: wuqiang <[email protected]>
> >> ---
> >> include/linux/freelist.h | 129 -----
> >> include/linux/kprobes.h | 9 +-
> >> include/linux/objpool.h | 151 ++++++
> >> include/linux/rethook.h | 15 +-
> >> kernel/kprobes.c | 95 ++--
> >> kernel/trace/fprobe.c | 17 +-
> >> kernel/trace/rethook.c | 80 +--
> >> lib/Kconfig.debug | 11 +
> >> lib/Makefile | 4 +-
> >> lib/objpool.c | 480 ++++++++++++++++++
> >> lib/test_objpool.c | 1031 ++++++++++++++++++++++++++++++++++++++
> >> 11 files changed, 1772 insertions(+), 250 deletions(-)
> >
> > Hmm, this does too much things in 1 patch.
> > Can you split this in below 5 patches? This also makes clear that who
> > needs to review which part.
>
> I was ever considering of splitting, but finally decided to mix them in a big
> one mostly because it's only for kretprobe improvement.
>
> Next version I'll split to smaller patches. Thank you for the advice.
>
> >
> > - Add generic scalable objpool
> > - Add objpool test
> > - Use objpool in kretprobe
> > - Use objpool in fprobe and rethook
> > - Remove unused freelist
> >
> >> delete mode 100644 include/linux/freelist.h
> >> create mode 100644 include/linux/objpool.h
> >> create mode 100644 lib/objpool.c
> >> create mode 100644 lib/test_objpool.c
> >>
> > [...]
> >> +
> >> +struct objpool_slot {
> >> + uint32_t os_head; /* head of ring array */
> >
> > If all fields have "os_" prefix, it is meaningless.
> >
> >> + uint32_t os_tail; /* tail of ring array */
> >> + uint32_t os_size; /* max item slots, pow of 2 */
> >> + uint32_t os_mask; /* os_size - 1 */
> >> +/*
> >> + * uint32_t os_ages[]; // ring epoch id
> >> + * void *os_ents[]; // objects array
> >
> > "entries[]"
>
> I'll refine the comments here to better explain the memory layout.
>
> >
> >> + */
> >> +};
> >> +
> >> +/* caller-specified object initial callback to setup each object, only called once */
> >> +typedef int (*objpool_init_node_cb)(void *context, void *obj);
> >> +
> >> +/* caller-specified cleanup callback for private objects/pool/context */
> >> +typedef int (*objpool_release_cb)(void *context, void *ptr, uint32_t flags);
> >> +
> >> +/* called for object releasing: ptr points to an object */
> >> +#define OBJPOOL_FLAG_NODE (0x00000001)
> >> +/* for user pool and context releasing, ptr could be NULL */
> >> +#define OBJPOOL_FLAG_POOL (0x00001000)
> >> +/* the object or pool to be released is user-managed */
> >> +#define OBJPOOL_FLAG_USER (0x00008000)
> >> +
> >> +/*
> >> + * objpool_head: object pooling metadata
> >> + */
> >> +
> >> +struct objpool_head {
> >> + uint32_t oh_objsz; /* object & element size */
> >> + uint32_t oh_nobjs; /* total objs (pre-allocated) */
> >> + uint32_t oh_nents; /* max objects per cpuslot */
> >> + uint32_t oh_ncpus; /* num of possible cpus */
> >
> > If all fields have "oh_" prefix, it is meaningless.
> > Also, if there is no reason to be 32bit (like align the structure size
> > for cache, or pack the structure for streaming etc.) use appropriate types.
> >
> > And please do not align the length of field name unnaturally. e.g.
>
> Kind of obsessive-compulsive symptom, haha :) The struct size of objpool_head
> doesn't matter. The size of objpool_slot does matter, as managed in a single
> cache-line.

Yeah, so objpool_slot is good to use uint32_t. You may also need __packed and
__cacheline_aligned for objpool_slot ;)

Thank you!

>
> >
> > size_t obj_size; /* size_t or unsigned int, I don't care. */
> > int nr_objs; /* I think just 'int' is enough because the value should be
> > checked and limited under INT_MAX */
> > int max_entries;
> > unsigned int nr_cpus;
> >
> > (BTW why we need to limit the nr_cpus here? we have num_possible_cpus())
>
> You are right that nr_cpus is unnecessary. num_possible_cpus() just keeps the
> same even with new hot-plugged cpus.
>
> >
> >> + uint32_t oh_in_user:1; /* user-specified buffer */
> >> + uint32_t oh_in_slot:1; /* objs alloced with slots */
> >> + uint32_t oh_vmalloc:1; /* alloc from vmalloc zone */
> >
> > Please use "bool" or "unsigned long flags" with flag bits.
> >
> >> + gfp_t oh_gfp; /* k/vmalloc gfp flags */
> >> + uint32_t oh_sz_pool; /* user pool size in byes */
> >
> > size_t pool_size
> >
> >> + void *oh_pool; /* user managed memory pool */
> >> + struct objpool_slot **oh_slots; /* array of percpu slots */
> >> + uint32_t *oh_sz_slots; /* size in bytes of slots */
> >
> > size_t slot_size
> >
>
> Will apply these changes in next version.
>
> >> + objpool_release_cb oh_release; /* resource cleanup callback */
> >> + void *oh_context; /* caller-provided context */
> >> +};
> >
> > Thank you,
> >
>
> Best regards,
>


--
Masami Hiramatsu (Google) <[email protected]>

2022-11-06 05:46:53

by wuqiang.matt

[permalink] [raw]
Subject: [PATCH v5 0/4] lib,kprobes: kretprobe scalability improvement

This patch series introduces a scalable and lockless ring-array based
object pool and replaces the original freelist (a LIFO queue based on
singly linked list) to improve the scalability of kretprobed routines.

Changes from v4:
1) compiling failure with [-Werror=designated-init]
2) compiling failure for sparc: prefetch() not defined
3) comments & codes of objpool routines refined

v4 and more:
https://lore.kernel.org/all/[email protected]

---
include/linux/freelist.h | 129 ------------
include/linux/kprobes.h | 9 +-
include/linux/objpool.h | 153 ++++++++++++++
include/linux/rethook.h | 15 +-
kernel/kprobes.c | 95 ++++-----
kernel/trace/fprobe.c | 17 +-
kernel/trace/rethook.c | 80 +++----
lib/Kconfig.debug | 11 +
lib/Makefile | 4 +-
lib/objpool.c | 487 +++++++++++++++++++++++++++++++++++++++++++
lib/test_objpool.c | 1052 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
11 files changed, 1802 insertions(+), 250 deletions(-)
create mode 100644 include/linux/objpool.h
create mode 100644 lib/objpool.c
create mode 100644 lib/test_objpool.c
delete mode 100644 include/linux/freelist.h

2022-11-06 05:47:05

by wuqiang.matt

[permalink] [raw]
Subject: [PATCH v5 1/4] lib: objpool added: ring-array based lockless MPMC queue

The object pool is a scalable implementaion of high performance queue
for objects allocation and reclamation, such as kretprobe instances.

With leveraging per-cpu ring-array to mitigate the hot spots of memory
contention, it could deliver near-linear scalability for high parallel
scenarios. The ring-array is compactly managed in a single cache-line
to benefit from warmed L1 cache for most cases (<= 4 objects per-core).
The body of pre-allocated objects is stored in continuous cache-lines
just after the ring-array.

The object pool is interrupt safe. Both allocation and reclamation
(object pop and push operations) can be preemptible or interruptable.

It's best suited for following cases:
1) Memory allocation or reclamation are prohibited or too expensive
2) Consumers are of different priorities, such as irqs and threads

Limitations:
1) Maximum objects (capacity) is determined during pool initializing
2) The memory of objects won't be freed until the poll is finalized
3) Object allocation (pop) may fail after trying all cpu slots
4) Object reclamation (push) won't fail but may take long time to
finish for imbalanced scenarios. You can try larger max_entries
to mitigate, or ( >= CPUS * nr_objs) to avoid

Signed-off-by: wuqiang <[email protected]>
---
include/linux/objpool.h | 153 +++++++++++++
lib/Makefile | 2 +-
lib/objpool.c | 487 ++++++++++++++++++++++++++++++++++++++++
3 files changed, 641 insertions(+), 1 deletion(-)
create mode 100644 include/linux/objpool.h
create mode 100644 lib/objpool.c

diff --git a/include/linux/objpool.h b/include/linux/objpool.h
new file mode 100644
index 000000000000..7899b054b50c
--- /dev/null
+++ b/include/linux/objpool.h
@@ -0,0 +1,153 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+
+#ifndef _LINUX_OBJPOOL_H
+#define _LINUX_OBJPOOL_H
+
+#include <linux/types.h>
+
+/*
+ * objpool: ring-array based lockless MPMC queue
+ *
+ * Copyright: [email protected]
+ *
+ * The object pool is a scalable implementaion of high performance queue
+ * for objects allocation and reclamation, such as kretprobe instances.
+ *
+ * With leveraging per-cpu ring-array to mitigate the hot spots of memory
+ * contention, it could deliver near-linear scalability for high parallel
+ * scenarios. The ring-array is compactly managed in a single cache-line
+ * to benefit from warmed L1 cache for most cases (<= 4 objects per-core).
+ * The body of pre-allocated objects is stored in continuous cache-lines
+ * just after the ring-array.
+ *
+ * The object pool is interrupt safe. Both allocation and reclamation
+ * (object pop and push operations) can be preemptible or interruptable.
+ *
+ * It's best suited for following cases:
+ * 1) Memory allocation or reclamation are prohibited or too expensive
+ * 2) Consumers are of different priorities, such as irqs and threads
+ *
+ * Limitations:
+ * 1) Maximum objects (capacity) is determined during pool initializing
+ * 2) The memory of objects won't be freed until the poll is finalized
+ * 3) Object allocation (pop) may fail after trying all cpu slots
+ * 4) Object reclamation (push) won't fail but may take long time to
+ * finish for imbalanced scenarios. You can try larger max_entries
+ * to mitigate, or ( >= CPUS * nr_objs) to avoid
+ */
+
+/*
+ * objpool_slot: per-cpu ring array
+ *
+ * Represents a cpu-local array-based ring buffer, its size is specialized
+ * during initialization of object pool.
+ *
+ * The objpool_slot is allocated from local memory for NUMA system, and to
+ * be kept compact in a single cacheline. ages[] is stored just after the
+ * body of objpool_slot, and then entries[]. The Array of ages[] describes
+ * revision of each item, solely used to avoid ABA. And array of entries[]
+ * contains the pointers of objects.
+ *
+ * The default size of objpool_slot is a single cache-line, aka. 64 bytes.
+ *
+ * 64bit:
+ * 4 8 12 16 32 64
+ * | head | tail | size | mask | ages[4] | ents[4]: (8 * 4) | objects
+ *
+ * 32bit:
+ * 4 8 12 16 32 48 64
+ * | head | tail | size | mask | ages[4] | ents[4] | unused | objects
+ *
+ */
+
+struct objpool_slot {
+ uint32_t head; /* head of ring array */
+ uint32_t tail; /* tail of ring array */
+ uint32_t size; /* array size, pow of 2 */
+ uint32_t mask; /* size - 1 */
+} __attribute__((packed));
+
+/* caller-specified object initial callback to setup each object, only called once */
+typedef int (*objpool_init_obj_cb)(void *context, void *obj);
+
+/* caller-specified cleanup callback for private objects/pool/context */
+typedef int (*objpool_release_cb)(void *context, void *ptr, uint32_t flags);
+
+/* called for object releasing: ptr points to an object */
+#define OBJPOOL_FLAG_NODE (0x00000001)
+/* for user pool and context releasing, ptr could be NULL */
+#define OBJPOOL_FLAG_POOL (0x00001000)
+/* the object or pool to be released is user-managed */
+#define OBJPOOL_FLAG_USER (0x00008000)
+
+/*
+ * objpool_head: object pooling metadata
+ */
+
+struct objpool_head {
+ unsigned int obj_size; /* object & element size */
+ unsigned int nr_objs; /* total objs (to be pre-allocated) */
+ unsigned int nr_cpus; /* num of possible cpus */
+ unsigned int capacity; /* max objects per cpuslot */
+ unsigned long flags; /* flags for objpool management */
+ gfp_t gfp; /* gfp flags for kmalloc & vmalloc */
+ unsigned int pool_size; /* user pool size in byes */
+ void *pool; /* user managed memory pool */
+ struct objpool_slot **cpu_slots; /* array of percpu slots */
+ unsigned int *slot_sizes; /* size in bytes of slots */
+ objpool_release_cb release; /* resource cleanup callback */
+ void *context; /* caller-provided context */
+};
+
+#define OBJPOOL_FROM_VMALLOC (0x800000000) /* objpool allocated from vmalloc area */
+#define OBJPOOL_HAVE_OBJECTS (0x400000000) /* objects allocated along with objpool */
+
+/* initialize object pool and pre-allocate objects */
+int objpool_init(struct objpool_head *head, unsigned int nr_objs,
+ unsigned int max_objs, unsigned int object_size,
+ gfp_t gfp, void *context, objpool_init_obj_cb objinit,
+ objpool_release_cb release);
+
+/* add objects in batch from user provided pool */
+int objpool_populate(struct objpool_head *head, void *pool,
+ unsigned int size, unsigned int object_size,
+ void *context, objpool_init_obj_cb objinit);
+
+/* add pre-allocated object (managed by user) to objpool */
+int objpool_add(void *obj, struct objpool_head *head);
+
+/* allocate an object from objects pool */
+void *objpool_pop(struct objpool_head *head);
+
+/* reclaim an object to objects pool */
+int objpool_push(void *node, struct objpool_head *head);
+
+/* cleanup the whole object pool (objects including) */
+void objpool_fini(struct objpool_head *head);
+
+/* whether the object is pre-allocated with percpu slots */
+static inline int objpool_is_inslot(void *obj, struct objpool_head *head)
+{
+ void *slot;
+ int i;
+
+ if (!obj || !(head->flags & OBJPOOL_HAVE_OBJECTS))
+ return 0;
+
+ for (i = 0; i < head->nr_cpus; i++) {
+ slot = head->cpu_slots[i];
+ if (obj >= slot && obj < slot + head->slot_sizes[i])
+ return 1;
+ }
+
+ return 0;
+}
+
+/* whether the object is from user pool (batched adding) */
+static inline int objpool_is_inpool(void *obj, struct objpool_head *head)
+{
+ return (obj && head->pool && obj >= head->pool &&
+ obj < head->pool + head->pool_size);
+}
+
+#endif /* _LINUX_OBJPOOL_H */
diff --git a/lib/Makefile b/lib/Makefile
index 161d6a724ff7..e938703a321f 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -34,7 +34,7 @@ lib-y := ctype.o string.o vsprintf.o cmdline.o \
is_single_threaded.o plist.o decompress.o kobject_uevent.o \
earlycpio.o seq_buf.o siphash.o dec_and_lock.o \
nmi_backtrace.o win_minmax.o memcat_p.o \
- buildid.o
+ buildid.o objpool.o

lib-$(CONFIG_PRINTK) += dump_stack.o
lib-$(CONFIG_SMP) += cpumask.o
diff --git a/lib/objpool.c b/lib/objpool.c
new file mode 100644
index 000000000000..ecffa0795f3d
--- /dev/null
+++ b/lib/objpool.c
@@ -0,0 +1,487 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include <linux/objpool.h>
+#include <linux/slab.h>
+#include <linux/vmalloc.h>
+#include <linux/atomic.h>
+#include <linux/prefetch.h>
+
+/*
+ * objpool: ring-array based lockless MPMC/FIFO queues
+ *
+ * Copyright: [email protected]
+ */
+
+/* compute the suitable num of objects to be managed by slot */
+static inline unsigned int __objpool_num_of_objs(unsigned int size)
+{
+ return rounddown_pow_of_two((size - sizeof(struct objpool_slot)) /
+ (sizeof(uint32_t) + sizeof(void *)));
+}
+
+#define SLOT_AGES(s) ((uint32_t *)((char *)(s) + sizeof(struct objpool_slot)))
+#define SLOT_ENTS(s) ((void **)((char *)(s) + sizeof(struct objpool_slot) + \
+ sizeof(uint32_t) * (s)->size))
+#define SLOT_OBJS(s) ((void *)((char *)(s) + sizeof(struct objpool_slot) + \
+ (sizeof(uint32_t) + sizeof(void *)) * (s)->size))
+
+/* allocate and initialize percpu slots */
+static inline int
+__objpool_init_percpu_slots(struct objpool_head *head, unsigned int nobjs,
+ void *context, objpool_init_obj_cb objinit)
+{
+ unsigned int i, j, n, size, objsz, nents = head->capacity;
+
+ /* aligned object size by sizeof(void *) */
+ objsz = ALIGN(head->obj_size, sizeof(void *));
+ /* shall we allocate objects along with objpool_slot */
+ if (objsz)
+ head->flags |= OBJPOOL_HAVE_OBJECTS;
+
+ for (i = 0; i < head->nr_cpus; i++) {
+ struct objpool_slot *os;
+
+ /* compute how many objects to be managed by this slot */
+ n = nobjs / head->nr_cpus;
+ if (i < (nobjs % head->nr_cpus))
+ n++;
+ size = sizeof(struct objpool_slot) + sizeof(void *) * nents +
+ sizeof(uint32_t) * nents + objsz * n;
+
+ /* decide memory area for cpu-slot allocation */
+ if (!i && !(head->gfp & GFP_ATOMIC) && size > PAGE_SIZE / 2)
+ head->flags |= OBJPOOL_FROM_VMALLOC;
+
+ /* allocate percpu slot & objects from local memory */
+ if (head->flags & OBJPOOL_FROM_VMALLOC)
+ os = __vmalloc_node(size, sizeof(void *), head->gfp,
+ cpu_to_node(i), __builtin_return_address(0));
+ else
+ os = kmalloc_node(size, head->gfp, cpu_to_node(i));
+ if (!os)
+ return -ENOMEM;
+
+ /* initialize percpu slot for the i-th cpu */
+ memset(os, 0, size);
+ os->size = head->capacity;
+ os->mask = os->size - 1;
+ head->cpu_slots[i] = os;
+ head->slot_sizes[i] = size;
+
+ /*
+ * start from 2nd round to avoid conflict of 1st item.
+ * we assume that the head item is ready for retrieval
+ * iff head is equal to ages[head & mask]. but ages is
+ * initialized as 0, so in view of the caller of pop(),
+ * the 1st item (0th) is always ready, but fact could
+ * be: push() is stalled before the final update, thus
+ * the item being inserted will be lost forever.
+ */
+ os->head = os->tail = head->capacity;
+
+ if (!objsz)
+ continue;
+
+ for (j = 0; j < n; j++) {
+ uint32_t *ages = SLOT_AGES(os);
+ void **ents = SLOT_ENTS(os);
+ void *obj = SLOT_OBJS(os) + j * objsz;
+ uint32_t ie = os->tail & os->mask;
+
+ /* perform object initialization */
+ if (objinit) {
+ int rc = objinit(context, obj);
+ if (rc)
+ return rc;
+ }
+
+ /* add obj into the ring array */
+ ents[ie] = obj;
+ ages[ie] = os->tail;
+ os->tail++;
+ head->nr_objs++;
+ }
+ }
+
+ return 0;
+}
+
+/* cleanup all percpu slots of the object pool */
+static inline void __objpool_fini_percpu_slots(struct objpool_head *head)
+{
+ unsigned int i;
+
+ if (!head->cpu_slots)
+ return;
+
+ for (i = 0; i < head->nr_cpus; i++) {
+ if (!head->cpu_slots[i])
+ continue;
+ if (head->flags & OBJPOOL_FROM_VMALLOC)
+ vfree(head->cpu_slots[i]);
+ else
+ kfree(head->cpu_slots[i]);
+ }
+ kfree(head->cpu_slots);
+ head->cpu_slots = NULL;
+ head->slot_sizes = NULL;
+}
+
+/**
+ * objpool_init: initialize object pool and pre-allocate objects
+ *
+ * args:
+ * @head: the object pool to be initialized, declared by caller
+ * @nr_objs: total objects to be pre-allocated by this object pool
+ * @max_objs: max entries (object pool capacity), use nr_objs if 0
+ * @object_size: size of an object, no objects pre-allocated if 0
+ * @gfp: flags for memory allocation (via kmalloc or vmalloc)
+ * @context: user context for object initialization callback
+ * @objinit: object initialization callback for extra setting-up
+ * @release: cleanup callback for private objects/pool/context
+ *
+ * return:
+ * 0 for success, otherwise error code
+ *
+ * All pre-allocated objects are to be zeroed. Caller could do extra
+ * initialization in objinit callback. The objinit callback will be
+ * called once and only once after the slot allocation. Then objpool
+ * won't touch any content of the objects since then. It's caller's
+ * duty to perform reinitialization after object allocation (pop) or
+ * clearance before object reclamation (push) if required.
+ */
+int objpool_init(struct objpool_head *head, unsigned int nr_objs,
+ unsigned int max_objs, unsigned int object_size,
+ gfp_t gfp, void *context, objpool_init_obj_cb objinit,
+ objpool_release_cb release)
+{
+ unsigned int nents, ncpus = num_possible_cpus();
+ int rc;
+
+ /* calculate percpu slot size (rounded to pow of 2) */
+ if (max_objs < nr_objs)
+ max_objs = nr_objs;
+ nents = max_objs / ncpus;
+ if (nents < __objpool_num_of_objs(L1_CACHE_BYTES))
+ nents = __objpool_num_of_objs(L1_CACHE_BYTES);
+ nents = roundup_pow_of_two(nents);
+ while (nents * ncpus < nr_objs)
+ nents = nents << 1;
+
+ memset(head, 0, sizeof(struct objpool_head));
+ head->nr_cpus = ncpus;
+ head->obj_size = object_size;
+ head->capacity = nents;
+ head->gfp = gfp & ~__GFP_ZERO;
+ head->context = context;
+ head->release = release;
+
+ /* allocate array for percpu slots */
+ head->cpu_slots = kzalloc(head->nr_cpus * sizeof(void *) +
+ head->nr_cpus * sizeof(uint32_t), head->gfp);
+ if (!head->cpu_slots)
+ return -ENOMEM;
+ head->slot_sizes = (uint32_t *)&head->cpu_slots[head->nr_cpus];
+
+ /* initialize per-cpu slots */
+ rc = __objpool_init_percpu_slots(head, nr_objs, context, objinit);
+ if (rc)
+ __objpool_fini_percpu_slots(head);
+
+ return rc;
+}
+EXPORT_SYMBOL_GPL(objpool_init);
+
+/* adding object to slot tail, the given slot must NOT be full */
+static inline int __objpool_add_slot(void *obj, struct objpool_slot *os)
+{
+ uint32_t *ages = SLOT_AGES(os);
+ void **ents = SLOT_ENTS(os);
+ uint32_t tail = atomic_inc_return((atomic_t *)&os->tail) - 1;
+
+ WRITE_ONCE(ents[tail & os->mask], obj);
+
+ /* order matters: obj must be updated before tail updating */
+ smp_store_release(&ages[tail & os->mask], tail);
+ return 0;
+}
+
+/* adding object to slot, abort if the slot was already full */
+static inline int __objpool_try_add_slot(void *obj, struct objpool_slot *os)
+{
+ uint32_t *ages = SLOT_AGES(os);
+ void **ents = SLOT_ENTS(os);
+ uint32_t head, tail;
+
+ do {
+ /* perform memory loading for both head and tail */
+ head = READ_ONCE(os->head);
+ tail = READ_ONCE(os->tail);
+ /* just abort if slot is full */
+ if (tail >= head + os->size)
+ return -ENOENT;
+ /* try to extend tail by 1 using CAS to avoid races */
+ if (try_cmpxchg_acquire(&os->tail, &tail, tail + 1))
+ break;
+ } while (1);
+
+ /* the tail-th of slot is reserved for the given obj */
+ WRITE_ONCE(ents[tail & os->mask], obj);
+ /* update epoch id to make this object available for pop() */
+ smp_store_release(&ages[tail & os->mask], tail);
+ return 0;
+}
+
+/**
+ * objpool_populate: add objects from user provided pool in batch
+ *
+ * args:
+ * @head: object pool
+ * @pool: user buffer for pre-allocated objects
+ * @size: size of user buffer
+ * @object_size: size of object & element
+ * @context: user context for objinit callback
+ * @objinit: object initialization callback
+ *
+ * return: 0 or error code
+ */
+int objpool_populate(struct objpool_head *head, void *pool,
+ unsigned int size, unsigned int object_size,
+ void *context, objpool_init_obj_cb objinit)
+{
+ unsigned int n = head->nr_objs, used = 0, i;
+
+ if (head->pool || !pool || size < object_size)
+ return -EINVAL;
+ if (head->obj_size && head->obj_size != object_size)
+ return -EINVAL;
+ if (head->context && context && head->context != context)
+ return -EINVAL;
+ if (head->nr_objs >= head->nr_cpus * head->capacity)
+ return -ENOENT;
+
+ WARN_ON_ONCE(((unsigned long)pool) & (sizeof(void *) - 1));
+ WARN_ON_ONCE(((uint32_t)object_size) & (sizeof(void *) - 1));
+
+ /* align object size by sizeof(void *) */
+ head->obj_size = object_size;
+ object_size = ALIGN(object_size, sizeof(void *));
+ if (object_size == 0)
+ return -EINVAL;
+
+ while (used + object_size <= size) {
+ void *obj = pool + used;
+
+ /* perform object initialization */
+ if (objinit) {
+ int rc = objinit(context, obj);
+ if (rc)
+ return rc;
+ }
+
+ /* insert obj to its corresponding objpool slot */
+ i = (n + used * head->nr_cpus/size) % head->nr_cpus;
+ if (!__objpool_try_add_slot(obj, head->cpu_slots[i]))
+ head->nr_objs++;
+
+ used += object_size;
+ }
+
+ if (!used)
+ return -ENOENT;
+
+ head->context = context;
+ head->pool = pool;
+ head->pool_size = size;
+
+ return 0;
+}
+EXPORT_SYMBOL_GPL(objpool_populate);
+
+/**
+ * objpool_add: add pre-allocated object to objpool during pool
+ * initialization
+ *
+ * args:
+ * @obj: object pointer to be added to objpool
+ * @head: object pool to be inserted into
+ *
+ * return:
+ * 0 or error code
+ *
+ * objpool_add_node doesn't handle race conditions, can only be
+ * called during objpool initialization
+ */
+int objpool_add(void *obj, struct objpool_head *head)
+{
+ unsigned int i, cpu;
+
+ if (!obj)
+ return -EINVAL;
+ if (head->nr_objs >= head->nr_cpus * head->capacity)
+ return -ENOENT;
+
+ cpu = head->nr_objs % head->nr_cpus;
+ for (i = 0; i < head->nr_cpus; i++) {
+ if (!__objpool_try_add_slot(obj, head->cpu_slots[cpu])) {
+ head->nr_objs++;
+ return 0;
+ }
+
+ if (++cpu >= head->nr_cpus)
+ cpu = 0;
+ }
+
+ return -ENOENT;
+}
+EXPORT_SYMBOL_GPL(objpool_add);
+
+/**
+ * objpool_push: reclaim the object and return back to objects pool
+ *
+ * args:
+ * @obj: object pointer to be pushed to object pool
+ * @head: object pool
+ *
+ * return:
+ * 0 or error code: it fails only when objects pool are full
+ *
+ * objpool_push is non-blockable, and can be nested
+ */
+int objpool_push(void *obj, struct objpool_head *head)
+{
+ unsigned int cpu = raw_smp_processor_id() % head->nr_cpus;
+
+ do {
+ if (head->nr_objs > head->capacity) {
+ if (!__objpool_try_add_slot(obj, head->cpu_slots[cpu]))
+ return 0;
+ } else {
+ if (!__objpool_add_slot(obj, head->cpu_slots[cpu]))
+ return 0;
+ }
+ if (++cpu >= head->nr_cpus)
+ cpu = 0;
+ } while (1);
+
+ return -ENOENT;
+}
+EXPORT_SYMBOL_GPL(objpool_push);
+
+/* try to retrieve object from slot */
+static inline void *__objpool_try_get_slot(struct objpool_slot *os)
+{
+ uint32_t *ages = SLOT_AGES(os);
+ void **ents = SLOT_ENTS(os);
+ /* do memory load of head to local head */
+ uint32_t head = smp_load_acquire(&os->head);
+
+ /* loop if slot isn't empty */
+ while (head != READ_ONCE(os->tail)) {
+ uint32_t id = head & os->mask, prev = head;
+
+ /* do prefetching of object ents */
+ prefetch(&ents[id]);
+
+ /*
+ * check whether this item was ready for retrieval ? There's
+ * possibility * in theory * we might retrieve wrong object,
+ * in case ages[id] overflows when current task is sleeping,
+ * but it will take very very long to overflow an uint32_t
+ */
+ if (smp_load_acquire(&ages[id]) == head) {
+ /* node must have been udpated by push() */
+ void *node = READ_ONCE(ents[id]);
+ /* commit and move forward head of the slot */
+ if (try_cmpxchg_release(&os->head, &head, head + 1))
+ return node;
+ }
+
+ /* re-load head from memory continue trying */
+ head = READ_ONCE(os->head);
+ /*
+ * head stays unchanged, so it's very likely current pop()
+ * just preempted/interrupted an ongoing push() operation
+ */
+ if (head == prev)
+ break;
+ }
+
+ return NULL;
+}
+
+/**
+ * objpool_pop: allocate an object from objects pool
+ *
+ * args:
+ * @oh: object pool
+ *
+ * return:
+ * object: NULL if failed (object pool is empty)
+ *
+ * objpool_pop can be nested, so can be used in any context.
+ */
+void *objpool_pop(struct objpool_head *head)
+{
+ unsigned int i, cpu;
+ void *obj = NULL;
+
+ cpu = raw_smp_processor_id() % head->nr_cpus;
+ for (i = 0; i < head->nr_cpus; i++) {
+ struct objpool_slot *slot = head->cpu_slots[cpu];
+ obj = __objpool_try_get_slot(slot);
+ if (obj)
+ break;
+ if (++cpu >= head->nr_cpus)
+ cpu = 0;
+ }
+
+ return obj;
+}
+EXPORT_SYMBOL_GPL(objpool_pop);
+
+/**
+ * objpool_fini: cleanup the whole object pool (releasing all objects)
+ *
+ * args:
+ * @head: object pool to be released
+ *
+ */
+void objpool_fini(struct objpool_head *head)
+{
+ uint32_t i, flags;
+
+ if (!head->cpu_slots)
+ return;
+
+ if (!head->release) {
+ __objpool_fini_percpu_slots(head);
+ return;
+ }
+
+ /* cleanup all objects remained in objpool */
+ for (i = 0; i < head->nr_cpus; i++) {
+ void *obj;
+ do {
+ flags = OBJPOOL_FLAG_NODE;
+ obj = __objpool_try_get_slot(head->cpu_slots[i]);
+ if (!obj)
+ break;
+ if (!objpool_is_inpool(obj, head) &&
+ !objpool_is_inslot(obj, head)) {
+ flags |= OBJPOOL_FLAG_USER;
+ }
+ head->release(head->context, obj, flags);
+ } while (obj);
+ }
+
+ /* release percpu slots */
+ __objpool_fini_percpu_slots(head);
+
+ /* cleanup user private pool and related context */
+ flags = OBJPOOL_FLAG_POOL;
+ if (head->pool)
+ flags |= OBJPOOL_FLAG_USER;
+ head->release(head->context, head->pool, flags);
+}
+EXPORT_SYMBOL_GPL(objpool_fini);
--
2.34.1


2022-11-06 06:07:04

by wuqiang.matt

[permalink] [raw]
Subject: [PATCH v5 4/4] kprobes: freelist.h removed

This patch will remove freelist.h from kernel source tree, since the
only use cases (kretprobe and rethook) are converted to objpool.

Signed-off-by: wuqiang <[email protected]>
---
include/linux/freelist.h | 129 ---------------------------------------
1 file changed, 129 deletions(-)
delete mode 100644 include/linux/freelist.h

diff --git a/include/linux/freelist.h b/include/linux/freelist.h
deleted file mode 100644
index fc1842b96469..000000000000
--- a/include/linux/freelist.h
+++ /dev/null
@@ -1,129 +0,0 @@
-/* SPDX-License-Identifier: GPL-2.0-only OR BSD-2-Clause */
-#ifndef FREELIST_H
-#define FREELIST_H
-
-#include <linux/atomic.h>
-
-/*
- * Copyright: [email protected]
- *
- * A simple CAS-based lock-free free list. Not the fastest thing in the world
- * under heavy contention, but simple and correct (assuming nodes are never
- * freed until after the free list is destroyed), and fairly speedy under low
- * contention.
- *
- * Adapted from: https://moodycamel.com/blog/2014/solving-the-aba-problem-for-lock-free-free-lists
- */
-
-struct freelist_node {
- atomic_t refs;
- struct freelist_node *next;
-};
-
-struct freelist_head {
- struct freelist_node *head;
-};
-
-#define REFS_ON_FREELIST 0x80000000
-#define REFS_MASK 0x7FFFFFFF
-
-static inline void __freelist_add(struct freelist_node *node, struct freelist_head *list)
-{
- /*
- * Since the refcount is zero, and nobody can increase it once it's
- * zero (except us, and we run only one copy of this method per node at
- * a time, i.e. the single thread case), then we know we can safely
- * change the next pointer of the node; however, once the refcount is
- * back above zero, then other threads could increase it (happens under
- * heavy contention, when the refcount goes to zero in between a load
- * and a refcount increment of a node in try_get, then back up to
- * something non-zero, then the refcount increment is done by the other
- * thread) -- so if the CAS to add the node to the actual list fails,
- * decrese the refcount and leave the add operation to the next thread
- * who puts the refcount back to zero (which could be us, hence the
- * loop).
- */
- struct freelist_node *head = READ_ONCE(list->head);
-
- for (;;) {
- WRITE_ONCE(node->next, head);
- atomic_set_release(&node->refs, 1);
-
- if (!try_cmpxchg_release(&list->head, &head, node)) {
- /*
- * Hmm, the add failed, but we can only try again when
- * the refcount goes back to zero.
- */
- if (atomic_fetch_add_release(REFS_ON_FREELIST - 1, &node->refs) == 1)
- continue;
- }
- return;
- }
-}
-
-static inline void freelist_add(struct freelist_node *node, struct freelist_head *list)
-{
- /*
- * We know that the should-be-on-freelist bit is 0 at this point, so
- * it's safe to set it using a fetch_add.
- */
- if (!atomic_fetch_add_release(REFS_ON_FREELIST, &node->refs)) {
- /*
- * Oh look! We were the last ones referencing this node, and we
- * know we want to add it to the free list, so let's do it!
- */
- __freelist_add(node, list);
- }
-}
-
-static inline struct freelist_node *freelist_try_get(struct freelist_head *list)
-{
- struct freelist_node *prev, *next, *head = smp_load_acquire(&list->head);
- unsigned int refs;
-
- while (head) {
- prev = head;
- refs = atomic_read(&head->refs);
- if ((refs & REFS_MASK) == 0 ||
- !atomic_try_cmpxchg_acquire(&head->refs, &refs, refs+1)) {
- head = smp_load_acquire(&list->head);
- continue;
- }
-
- /*
- * Good, reference count has been incremented (it wasn't at
- * zero), which means we can read the next and not worry about
- * it changing between now and the time we do the CAS.
- */
- next = READ_ONCE(head->next);
- if (try_cmpxchg_acquire(&list->head, &head, next)) {
- /*
- * Yay, got the node. This means it was on the list,
- * which means should-be-on-freelist must be false no
- * matter the refcount (because nobody else knows it's
- * been taken off yet, it can't have been put back on).
- */
- WARN_ON_ONCE(atomic_read(&head->refs) & REFS_ON_FREELIST);
-
- /*
- * Decrease refcount twice, once for our ref, and once
- * for the list's ref.
- */
- atomic_fetch_add(-2, &head->refs);
-
- return head;
- }
-
- /*
- * OK, the head must have changed on us, but we still need to decrement
- * the refcount we increased.
- */
- refs = atomic_fetch_add(-1, &prev->refs);
- if (refs == REFS_ON_FREELIST + 1)
- __freelist_add(prev, list);
- }
-
- return NULL;
-}
-
-#endif /* FREELIST_H */
--
2.34.1


2022-11-06 06:09:34

by wuqiang.matt

[permalink] [raw]
Subject: [PATCH v5 3/4] kprobes: kretprobe scalability improvement with objpool

kretprobe is using freelist to manage return-instances, but freelist,
as LIFO queue based on singly linked list, scales badly and reduces
the overall throughput of kretprobed routines, especially for high
contention scenarios.

Here's a typical throughput test of sys_flock (counts in 10 seconds,
measured with perf stat -a -I 10000 -e syscalls:sys_enter_flock):

OS: Debian 10 X86_64, Linux 6.1rc2
HW: XEON 8336C x 2, 64 cores/128 threads, DDR4 3200MT/s

1X 2X 4X 6X 8X 12X 16X
34762430 36546920 17949900 13101899 12569595 12646601 14729195
24X 32X 48X 64X 72X 96X 128X
19263546 10102064 8985418 11936495 11493980 7127789 9330985

This patch introduces objpool to kretprobe and rethook, with orginal
freelist replaced and brings near-linear scalability to kretprobed
routines. Tests of kretprobe throughput show the biggest ratio as
333.9x of the original freelist. Here's the comparison:

1X 2X 4X 8X 16X
freelist: 34762430 36546920 17949900 12569595 14729195
objpool: 35627544 72182095 144068494 287564688 576903916
32X 48X 64X 96X 128X
freelist: 10102064 8985418 11936495 7127789 9330985
objpool: 1158876372 1737828164 2324371724 2380310472 2463182819

Tests on 96-core ARM64 system output similarly, but with the biggest
ratio up to 642.2x:

OS: Debian 10 AARCH64, Linux 6.1rc2
HW: Kunpeng-920 96 cores/2 sockets/4 NUMA nodes, DDR4 2933 MT/s

1X 2X 4X 8X 16X
freelist: 17498299 10887037 10224710 8499132 6421751
objpool: 18715726 35549845 71615884 144258971 283707220
24X 32X 48X 64X 96X
freelist: 5339868 4819116 3593919 3121575 2687167
objpool: 419830913 571609748 877456139 1143316315 1725668029

Signed-off-by: wuqiang <[email protected]>
---
include/linux/kprobes.h | 9 ++--
include/linux/rethook.h | 15 +++----
kernel/kprobes.c | 95 +++++++++++++++++++----------------------
kernel/trace/fprobe.c | 17 +++-----
kernel/trace/rethook.c | 80 +++++++++++++++++-----------------
5 files changed, 96 insertions(+), 120 deletions(-)

diff --git a/include/linux/kprobes.h b/include/linux/kprobes.h
index a0b92be98984..f13f01e600c2 100644
--- a/include/linux/kprobes.h
+++ b/include/linux/kprobes.h
@@ -27,7 +27,7 @@
#include <linux/mutex.h>
#include <linux/ftrace.h>
#include <linux/refcount.h>
-#include <linux/freelist.h>
+#include <linux/objpool.h>
#include <linux/rethook.h>
#include <asm/kprobes.h>

@@ -141,6 +141,7 @@ static inline bool kprobe_ftrace(struct kprobe *p)
*/
struct kretprobe_holder {
struct kretprobe *rp;
+ struct objpool_head oh;
refcount_t ref;
};

@@ -154,7 +155,6 @@ struct kretprobe {
#ifdef CONFIG_KRETPROBE_ON_RETHOOK
struct rethook *rh;
#else
- struct freelist_head freelist;
struct kretprobe_holder *rph;
#endif
};
@@ -165,10 +165,7 @@ struct kretprobe_instance {
#ifdef CONFIG_KRETPROBE_ON_RETHOOK
struct rethook_node node;
#else
- union {
- struct freelist_node freelist;
- struct rcu_head rcu;
- };
+ struct rcu_head rcu;
struct llist_node llist;
struct kretprobe_holder *rph;
kprobe_opcode_t *ret_addr;
diff --git a/include/linux/rethook.h b/include/linux/rethook.h
index c8ac1e5afcd1..278ec65e71fe 100644
--- a/include/linux/rethook.h
+++ b/include/linux/rethook.h
@@ -6,7 +6,7 @@
#define _LINUX_RETHOOK_H

#include <linux/compiler.h>
-#include <linux/freelist.h>
+#include <linux/objpool.h>
#include <linux/kallsyms.h>
#include <linux/llist.h>
#include <linux/rcupdate.h>
@@ -30,14 +30,14 @@ typedef void (*rethook_handler_t) (struct rethook_node *, void *, struct pt_regs
struct rethook {
void *data;
rethook_handler_t handler;
- struct freelist_head pool;
+ struct objpool_head pool;
refcount_t ref;
struct rcu_head rcu;
};

/**
* struct rethook_node - The rethook shadow-stack entry node.
- * @freelist: The freelist, linked to struct rethook::pool.
+ * @nod: The objpool node, linked to struct rethook::pool.
* @rcu: The rcu_head for deferred freeing.
* @llist: The llist, linked to a struct task_struct::rethooks.
* @rethook: The pointer to the struct rethook.
@@ -48,19 +48,15 @@ struct rethook {
* on each entry of the shadow stack.
*/
struct rethook_node {
- union {
- struct freelist_node freelist;
- struct rcu_head rcu;
- };
+ struct rcu_head rcu;
struct llist_node llist;
struct rethook *rethook;
unsigned long ret_addr;
unsigned long frame;
};

-struct rethook *rethook_alloc(void *data, rethook_handler_t handler);
+struct rethook *rethook_alloc(void *data, rethook_handler_t handler, gfp_t gfp, int size, int max);
void rethook_free(struct rethook *rh);
-void rethook_add_node(struct rethook *rh, struct rethook_node *node);
struct rethook_node *rethook_try_get(struct rethook *rh);
void rethook_recycle(struct rethook_node *node);
void rethook_hook(struct rethook_node *node, struct pt_regs *regs, bool mcount);
@@ -97,4 +93,3 @@ void rethook_flush_task(struct task_struct *tk);
#endif

#endif
-
diff --git a/kernel/kprobes.c b/kernel/kprobes.c
index 3220b0a2fb4a..e8526a0d29b6 100644
--- a/kernel/kprobes.c
+++ b/kernel/kprobes.c
@@ -1865,10 +1865,12 @@ static struct notifier_block kprobe_exceptions_nb = {
static void free_rp_inst_rcu(struct rcu_head *head)
{
struct kretprobe_instance *ri = container_of(head, struct kretprobe_instance, rcu);
+ struct kretprobe_holder *rph = ri->rph;

- if (refcount_dec_and_test(&ri->rph->ref))
- kfree(ri->rph);
- kfree(ri);
+ if (refcount_dec_and_test(&rph->ref)) {
+ objpool_fini(&rph->oh);
+ kfree(rph);
+ }
}
NOKPROBE_SYMBOL(free_rp_inst_rcu);

@@ -1877,7 +1879,7 @@ static void recycle_rp_inst(struct kretprobe_instance *ri)
struct kretprobe *rp = get_kretprobe(ri);

if (likely(rp))
- freelist_add(&ri->freelist, &rp->freelist);
+ objpool_push(ri, &rp->rph->oh);
else
call_rcu(&ri->rcu, free_rp_inst_rcu);
}
@@ -1914,23 +1916,19 @@ NOKPROBE_SYMBOL(kprobe_flush_task);

static inline void free_rp_inst(struct kretprobe *rp)
{
- struct kretprobe_instance *ri;
- struct freelist_node *node;
- int count = 0;
-
- node = rp->freelist.head;
- while (node) {
- ri = container_of(node, struct kretprobe_instance, freelist);
- node = node->next;
-
- kfree(ri);
- count++;
- }
+ struct kretprobe_holder *rph = rp->rph;
+ void *nod;

- if (refcount_sub_and_test(count, &rp->rph->ref)) {
- kfree(rp->rph);
- rp->rph = NULL;
- }
+ rp->rph = NULL;
+ do {
+ nod = objpool_pop(&rph->oh);
+ /* deref anyway since we've one extra ref grabbed */
+ if (refcount_dec_and_test(&rph->ref)) {
+ objpool_fini(&rph->oh);
+ kfree(rph);
+ break;
+ }
+ } while (nod);
}

/* This assumes the 'tsk' is the current task or the is not running. */
@@ -2072,19 +2070,17 @@ NOKPROBE_SYMBOL(__kretprobe_trampoline_handler)
static int pre_handler_kretprobe(struct kprobe *p, struct pt_regs *regs)
{
struct kretprobe *rp = container_of(p, struct kretprobe, kp);
+ struct kretprobe_holder *rph = rp->rph;
struct kretprobe_instance *ri;
- struct freelist_node *fn;

- fn = freelist_try_get(&rp->freelist);
- if (!fn) {
+ ri = objpool_pop(&rph->oh);
+ if (!ri) {
rp->nmissed++;
return 0;
}

- ri = container_of(fn, struct kretprobe_instance, freelist);
-
if (rp->entry_handler && rp->entry_handler(ri, regs)) {
- freelist_add(&ri->freelist, &rp->freelist);
+ objpool_push(ri, &rph->oh);
return 0;
}

@@ -2174,10 +2170,19 @@ int kprobe_on_func_entry(kprobe_opcode_t *addr, const char *sym, unsigned long o
return 0;
}

+#ifndef CONFIG_KRETPROBE_ON_RETHOOK
+static int kretprobe_init_inst(void *context, void *nod)
+{
+ struct kretprobe_instance *ri = nod;
+
+ ri->rph = context;
+ return 0;
+}
+#endif
+
int register_kretprobe(struct kretprobe *rp)
{
int ret;
- struct kretprobe_instance *inst;
int i;
void *addr;

@@ -2215,20 +2220,12 @@ int register_kretprobe(struct kretprobe *rp)
#endif
}
#ifdef CONFIG_KRETPROBE_ON_RETHOOK
- rp->rh = rethook_alloc((void *)rp, kretprobe_rethook_handler);
+ rp->rh = rethook_alloc((void *)rp, kretprobe_rethook_handler, GFP_KERNEL,
+ sizeof(struct kretprobe_instance) + rp->data_size,
+ rp->maxactive);
if (!rp->rh)
return -ENOMEM;

- for (i = 0; i < rp->maxactive; i++) {
- inst = kzalloc(sizeof(struct kretprobe_instance) +
- rp->data_size, GFP_KERNEL);
- if (inst == NULL) {
- rethook_free(rp->rh);
- rp->rh = NULL;
- return -ENOMEM;
- }
- rethook_add_node(rp->rh, &inst->node);
- }
rp->nmissed = 0;
/* Establish function entry probe point */
ret = register_kprobe(&rp->kp);
@@ -2237,25 +2234,19 @@ int register_kretprobe(struct kretprobe *rp)
rp->rh = NULL;
}
#else /* !CONFIG_KRETPROBE_ON_RETHOOK */
- rp->freelist.head = NULL;
rp->rph = kzalloc(sizeof(struct kretprobe_holder), GFP_KERNEL);
if (!rp->rph)
return -ENOMEM;

- rp->rph->rp = rp;
- for (i = 0; i < rp->maxactive; i++) {
- inst = kzalloc(sizeof(struct kretprobe_instance) +
- rp->data_size, GFP_KERNEL);
- if (inst == NULL) {
- refcount_set(&rp->rph->ref, i);
- free_rp_inst(rp);
- return -ENOMEM;
- }
- inst->rph = rp->rph;
- freelist_add(&inst->freelist, &rp->freelist);
+ if (objpool_init(&rp->rph->oh, rp->maxactive, rp->maxactive,
+ rp->data_size + sizeof(struct kretprobe_instance),
+ GFP_KERNEL, rp->rph, kretprobe_init_inst, NULL)) {
+ kfree(rp->rph);
+ rp->rph = NULL;
+ return -ENOMEM;
}
- refcount_set(&rp->rph->ref, i);
-
+ refcount_set(&rp->rph->ref, rp->maxactive + 1);
+ rp->rph->rp = rp;
rp->nmissed = 0;
/* Establish function entry probe point */
ret = register_kprobe(&rp->kp);
diff --git a/kernel/trace/fprobe.c b/kernel/trace/fprobe.c
index aac63ca9c3d1..99b4ab0f6468 100644
--- a/kernel/trace/fprobe.c
+++ b/kernel/trace/fprobe.c
@@ -125,7 +125,7 @@ static void fprobe_init(struct fprobe *fp)

static int fprobe_init_rethook(struct fprobe *fp, int num)
{
- int i, size;
+ int size;

if (num < 0)
return -EINVAL;
@@ -140,18 +140,11 @@ static int fprobe_init_rethook(struct fprobe *fp, int num)
if (size < 0)
return -E2BIG;

- fp->rethook = rethook_alloc((void *)fp, fprobe_exit_handler);
- for (i = 0; i < size; i++) {
- struct fprobe_rethook_node *node;
+ fp->rethook = rethook_alloc((void *)fp, fprobe_exit_handler, GFP_KERNEL,
+ sizeof(struct fprobe_rethook_node), size);
+ if (!fp->rethook)
+ return -ENOMEM;

- node = kzalloc(sizeof(*node), GFP_KERNEL);
- if (!node) {
- rethook_free(fp->rethook);
- fp->rethook = NULL;
- return -ENOMEM;
- }
- rethook_add_node(fp->rethook, &node->node);
- }
return 0;
}

diff --git a/kernel/trace/rethook.c b/kernel/trace/rethook.c
index c69d82273ce7..01df98db2fbe 100644
--- a/kernel/trace/rethook.c
+++ b/kernel/trace/rethook.c
@@ -36,21 +36,17 @@ void rethook_flush_task(struct task_struct *tk)
static void rethook_free_rcu(struct rcu_head *head)
{
struct rethook *rh = container_of(head, struct rethook, rcu);
- struct rethook_node *rhn;
- struct freelist_node *node;
- int count = 1;
+ struct rethook_node *nod;

- node = rh->pool.head;
- while (node) {
- rhn = container_of(node, struct rethook_node, freelist);
- node = node->next;
- kfree(rhn);
- count++;
- }
-
- /* The rh->ref is the number of pooled node + 1 */
- if (refcount_sub_and_test(count, &rh->ref))
- kfree(rh);
+ do {
+ nod = objpool_pop(&rh->pool);
+ /* deref anyway since we've one extra ref grabbed */
+ if (refcount_dec_and_test(&rh->ref)) {
+ objpool_fini(&rh->pool);
+ kfree(rh);
+ break;
+ }
+ } while (nod);
}

/**
@@ -70,16 +66,28 @@ void rethook_free(struct rethook *rh)
call_rcu(&rh->rcu, rethook_free_rcu);
}

+static int rethook_init_node(void *context, void *nod)
+{
+ struct rethook_node *node = nod;
+
+ node->rethook = context;
+ return 0;
+}
+
/**
* rethook_alloc() - Allocate struct rethook.
* @data: a data to pass the @handler when hooking the return.
* @handler: the return hook callback function.
+ * @gfp: default gfp for objpool allocation
+ * @size: rethook node size
+ * @max: number of rethook nodes to be preallocated
*
* Allocate and initialize a new rethook with @data and @handler.
* Return NULL if memory allocation fails or @handler is NULL.
* Note that @handler == NULL means this rethook is going to be freed.
*/
-struct rethook *rethook_alloc(void *data, rethook_handler_t handler)
+struct rethook *rethook_alloc(void *data, rethook_handler_t handler, gfp_t gfp,
+ int size, int max)
{
struct rethook *rh = kzalloc(sizeof(struct rethook), GFP_KERNEL);

@@ -88,34 +96,26 @@ struct rethook *rethook_alloc(void *data, rethook_handler_t handler)

rh->data = data;
rh->handler = handler;
- rh->pool.head = NULL;
- refcount_set(&rh->ref, 1);

+ /* initialize the objpool for rethook nodes */
+ if (objpool_init(&rh->pool, max, max, size, gfp, rh, rethook_init_node,
+ NULL)) {
+ kfree(rh);
+ return NULL;
+ }
+ refcount_set(&rh->ref, max + 1);
return rh;
}

-/**
- * rethook_add_node() - Add a new node to the rethook.
- * @rh: the struct rethook.
- * @node: the struct rethook_node to be added.
- *
- * Add @node to @rh. User must allocate @node (as a part of user's
- * data structure.) The @node fields are initialized in this function.
- */
-void rethook_add_node(struct rethook *rh, struct rethook_node *node)
-{
- node->rethook = rh;
- freelist_add(&node->freelist, &rh->pool);
- refcount_inc(&rh->ref);
-}
-
static void free_rethook_node_rcu(struct rcu_head *head)
{
struct rethook_node *node = container_of(head, struct rethook_node, rcu);
+ struct rethook *rh = node->rethook;

- if (refcount_dec_and_test(&node->rethook->ref))
- kfree(node->rethook);
- kfree(node);
+ if (refcount_dec_and_test(&rh->ref)) {
+ objpool_fini(&rh->pool);
+ kfree(rh);
+ }
}

/**
@@ -130,7 +130,7 @@ void rethook_recycle(struct rethook_node *node)
lockdep_assert_preemption_disabled();

if (likely(READ_ONCE(node->rethook->handler)))
- freelist_add(&node->freelist, &node->rethook->pool);
+ objpool_push(node, &node->rethook->pool);
else
call_rcu(&node->rcu, free_rethook_node_rcu);
}
@@ -146,7 +146,7 @@ NOKPROBE_SYMBOL(rethook_recycle);
struct rethook_node *rethook_try_get(struct rethook *rh)
{
rethook_handler_t handler = READ_ONCE(rh->handler);
- struct freelist_node *fn;
+ struct rethook_node *nod;

lockdep_assert_preemption_disabled();

@@ -163,11 +163,11 @@ struct rethook_node *rethook_try_get(struct rethook *rh)
if (unlikely(!rcu_is_watching()))
return NULL;

- fn = freelist_try_get(&rh->pool);
- if (!fn)
+ nod = (struct rethook_node *)objpool_pop(&rh->pool);
+ if (!nod)
return NULL;

- return container_of(fn, struct rethook_node, freelist);
+ return nod;
}
NOKPROBE_SYMBOL(rethook_try_get);

--
2.34.1


2022-11-08 07:33:34

by wuqiang.matt

[permalink] [raw]
Subject: [PATCH v6 0/4] lib,kprobes: kretprobe scalability improvement

This patch series introduces a scalable and lockless ring-array based
object pool and replaces the original freelist (a LIFO queue based on
singly linked list) to improve the scalability of kretprobed routines.

Changes from v5 (https://lore.kernel.org/lkml/[email protected]/):
1) PATCH 2/4: test_objpool: build warnings with [-Wmissing-prototypes]
2) PATCH 3/4: fprobe.c: conflicts resolved for linux-6.1-rc4 merging

wuqiang (4):
lib: objpool added: ring-array based lockless MPMC queue
lib: objpool test module added
kprobes: kretprobe scalability improvement with objpool
kprobes: freelist.h removed

include/linux/freelist.h | 129 -----
include/linux/kprobes.h | 9 +-
include/linux/objpool.h | 153 ++++++
include/linux/rethook.h | 15 +-
kernel/kprobes.c | 95 ++--
kernel/trace/fprobe.c | 17 +-
kernel/trace/rethook.c | 80 +--
lib/Kconfig.debug | 11 +
lib/Makefile | 4 +-
lib/objpool.c | 487 ++++++++++++++++++
lib/test_objpool.c | 1052 ++++++++++++++++++++++++++++++++++++++
11 files changed, 1801 insertions(+), 251 deletions(-)
delete mode 100644 include/linux/freelist.h
create mode 100644 include/linux/objpool.h
create mode 100644 lib/objpool.c
create mode 100644 lib/test_objpool.c

--
2.34.1


2022-11-08 07:36:05

by wuqiang.matt

[permalink] [raw]
Subject: [PATCH v6 2/4] lib: objpool test module added

The test_objpool module (test_objpool) will run serveral testcases
for objpool stress and performance evaluation. Each testcase will
have all available cpu cores involved to create a situation of high
parallel and high contention.

As of now there are 3 groups and 3 * 6 testcases in total:

1) group 1: synchronous mode
objpool is managed synchronously, that is, all objects are to be
reclaimed before objpool finalization and the objpool owner makes
sure of it. All threads on different cores run in the same pace.
2) group 2: synchronous + miss mode
This test group is mainly for performance evaluation of missing
cases when pre-allocated objects are less than the requsted.
3) group 3: asynchronous mode
This case is just an emulation of kretprobe. The objpool owner
has no control of the object after it's allocated. hrtimer irq
is introduced to stress objpool with thread preemption.

Signed-off-by: wuqiang <[email protected]>
---
lib/Kconfig.debug | 11 +
lib/Makefile | 2 +
lib/test_objpool.c | 1052 ++++++++++++++++++++++++++++++++++++++++++++
3 files changed, 1065 insertions(+)
create mode 100644 lib/test_objpool.c

diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
index 29280072dc0e..0749335d79db 100644
--- a/lib/Kconfig.debug
+++ b/lib/Kconfig.debug
@@ -2738,6 +2738,17 @@ config TEST_CLOCKSOURCE_WATCHDOG

If unsure, say N.

+config TEST_OBJPOOL
+ tristate "Test module for correctness and stress of objpool"
+ default n
+ depends on m
+ help
+ This builds the "test_objpool" module that should be used for
+ correctness verification and concurrent testings of objects
+ allocation and reclamation.
+
+ If unsure, say N.
+
endif # RUNTIME_TESTING_MENU

config ARCH_USE_MEMTEST
diff --git a/lib/Makefile b/lib/Makefile
index e938703a321f..4aa282fa0cfc 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -99,6 +99,8 @@ obj-$(CONFIG_KPROBES_SANITY_TEST) += test_kprobes.o
obj-$(CONFIG_TEST_REF_TRACKER) += test_ref_tracker.o
CFLAGS_test_fprobe.o += $(CC_FLAGS_FTRACE)
obj-$(CONFIG_FPROBE_SANITY_TEST) += test_fprobe.o
+obj-$(CONFIG_TEST_OBJPOOL) += test_objpool.o
+
#
# CFLAGS for compiling floating point code inside the kernel. x86/Makefile turns
# off the generation of FPU/SSE* instructions for kernel proper but FPU_FLAGS
diff --git a/lib/test_objpool.c b/lib/test_objpool.c
new file mode 100644
index 000000000000..a4c1814ac3b7
--- /dev/null
+++ b/lib/test_objpool.c
@@ -0,0 +1,1052 @@
+// SPDX-License-Identifier: GPL-2.0
+
+/*
+ * Test module for lockless object pool
+ * (C) 2022 Matt Wu <[email protected]>
+ */
+
+#include <linux/version.h>
+#include <linux/errno.h>
+#include <linux/module.h>
+#include <linux/moduleparam.h>
+#include <linux/sched.h>
+#include <linux/cpumask.h>
+#include <linux/completion.h>
+#include <linux/kthread.h>
+#include <linux/cpu.h>
+#include <linux/cpuset.h>
+#include <linux/slab.h>
+#include <linux/vmalloc.h>
+#include <linux/delay.h>
+#include <linux/hrtimer.h>
+#include <linux/interrupt.h>
+#include <linux/objpool.h>
+
+#define OT_NR_MAX_BULK (16)
+
+struct ot_ctrl {
+ unsigned int mode; /* test no */
+ unsigned int objsz; /* object size */
+ unsigned int duration; /* ms */
+ unsigned int delay; /* ms */
+ unsigned int bulk_normal;
+ unsigned int bulk_irq;
+ unsigned long hrtimer; /* ms */
+ const char *name;
+};
+
+struct ot_stat {
+ unsigned long nhits;
+ unsigned long nmiss;
+};
+
+struct ot_item {
+ struct objpool_head *pool; /* pool head */
+ struct ot_ctrl *ctrl; /* ctrl parameters */
+
+ void (*worker)(struct ot_item *item, int irq);
+
+ /* hrtimer control */
+ ktime_t hrtcycle;
+ struct hrtimer hrtimer;
+
+ int bulk[2]; /* for thread and irq */
+ int delay;
+ u32 niters;
+
+ /* results summary */
+ struct ot_stat stat[2]; /* thread and irq */
+
+ u64 duration;
+};
+
+struct ot_mem_stat {
+ atomic_long_t alloc;
+ atomic_long_t free;
+};
+
+struct ot_data {
+ struct rw_semaphore start;
+ struct completion wait;
+ struct completion rcu;
+ atomic_t nthreads ____cacheline_aligned_in_smp;
+ atomic_t stop ____cacheline_aligned_in_smp;
+ struct ot_mem_stat kmalloc;
+ struct ot_mem_stat vmalloc;
+} g_ot_data;
+
+/*
+ * memory leakage checking
+ */
+
+static void *ot_kzalloc(long size)
+{
+ void *ptr = kzalloc(size, GFP_KERNEL);
+
+ if (ptr)
+ atomic_long_add(size, &g_ot_data.kmalloc.alloc);
+ return ptr;
+}
+
+static void ot_kfree(void *ptr, long size)
+{
+ if (!ptr)
+ return;
+ atomic_long_add(size, &g_ot_data.kmalloc.free);
+ kfree(ptr);
+}
+
+static void *ot_vmalloc(long size)
+{
+ void *ptr = vmalloc(size);
+
+ if (ptr)
+ atomic_long_add(size, &g_ot_data.vmalloc.alloc);
+ return ptr;
+}
+
+static void ot_vfree(void *ptr, long size)
+{
+ if (!ptr)
+ return;
+ atomic_long_add(size, &g_ot_data.vmalloc.free);
+ vfree(ptr);
+}
+
+static void ot_mem_report(struct ot_ctrl *ctrl)
+{
+ long alloc, free;
+
+ pr_info("memory allocation summary for %s\n", ctrl->name);
+
+ alloc = atomic_long_read(&g_ot_data.kmalloc.alloc);
+ free = atomic_long_read(&g_ot_data.kmalloc.free);
+ pr_info(" kmalloc: %lu - %lu = %lu\n", alloc, free, alloc - free);
+
+ alloc = atomic_long_read(&g_ot_data.vmalloc.alloc);
+ free = atomic_long_read(&g_ot_data.vmalloc.free);
+ pr_info(" vmalloc: %lu - %lu = %lu\n", alloc, free, alloc - free);
+}
+
+/*
+ * general structs & routines
+ */
+
+struct ot_node {
+ void *owner;
+ unsigned long data;
+ unsigned long refs;
+ unsigned long payload[32];
+};
+
+struct ot_context {
+ struct objpool_head pool; /* objpool head */
+ struct ot_ctrl *ctrl; /* ctrl parameters */
+ void *ptr; /* user pool buffer */
+ unsigned long size; /* buffer size */
+ refcount_t refs;
+ struct rcu_head rcu;
+};
+
+static DEFINE_PER_CPU(struct ot_item, ot_pcup_items);
+
+static int ot_init_data(struct ot_data *data)
+{
+ memset(data, 0, sizeof(*data));
+ init_rwsem(&data->start);
+ init_completion(&data->wait);
+ init_completion(&data->rcu);
+ atomic_set(&data->nthreads, 1);
+
+ return 0;
+}
+
+static void ot_reset_data(struct ot_data *data)
+{
+ reinit_completion(&data->wait);
+ reinit_completion(&data->rcu);
+ atomic_set(&data->nthreads, 1);
+ atomic_set(&data->stop, 0);
+ memset(&data->kmalloc, 0, sizeof(data->kmalloc));
+ memset(&data->vmalloc, 0, sizeof(data->vmalloc));
+}
+
+static int ot_init_node(void *context, void *nod)
+{
+ struct ot_context *sop = context;
+ struct ot_node *on = nod;
+
+ on->owner = &sop->pool;
+ return 0;
+}
+
+static enum hrtimer_restart ot_hrtimer_handler(struct hrtimer *hrt)
+{
+ struct ot_item *item = container_of(hrt, struct ot_item, hrtimer);
+
+ if (atomic_read_acquire(&g_ot_data.stop))
+ return HRTIMER_NORESTART;
+
+ /* do bulk-testings for objects pop/push */
+ item->worker(item, 1);
+
+ hrtimer_forward(hrt, hrt->base->get_time(), item->hrtcycle);
+ return HRTIMER_RESTART;
+}
+
+static void ot_start_hrtimer(struct ot_item *item)
+{
+ if (!item->ctrl->hrtimer)
+ return;
+ hrtimer_start(&item->hrtimer, item->hrtcycle, HRTIMER_MODE_REL);
+}
+
+static void ot_stop_hrtimer(struct ot_item *item)
+{
+ if (!item->ctrl->hrtimer)
+ return;
+ hrtimer_cancel(&item->hrtimer);
+}
+
+static int ot_init_hrtimer(struct ot_item *item, unsigned long hrtimer)
+{
+ struct hrtimer *hrt = &item->hrtimer;
+
+ if (!hrtimer)
+ return -ENOENT;
+
+ item->hrtcycle = ktime_set(0, hrtimer * 1000000UL);
+ hrtimer_init(hrt, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
+ hrt->function = ot_hrtimer_handler;
+ return 0;
+}
+
+static int ot_init_cpu_item(struct ot_item *item,
+ struct ot_ctrl *ctrl,
+ struct objpool_head *pool,
+ void (*worker)(struct ot_item *, int))
+{
+ memset(item, 0, sizeof(*item));
+ item->pool = pool;
+ item->ctrl = ctrl;
+ item->worker = worker;
+
+ item->bulk[0] = ctrl->bulk_normal;
+ item->bulk[1] = ctrl->bulk_irq;
+ item->delay = ctrl->delay;
+
+ /* initialize hrtimer */
+ ot_init_hrtimer(item, item->ctrl->hrtimer);
+ return 0;
+}
+
+static int ot_thread_worker(void *arg)
+{
+ struct ot_item *item = arg;
+ ktime_t start;
+
+ sched_set_normal(current, 50);
+
+ atomic_inc(&g_ot_data.nthreads);
+ down_read(&g_ot_data.start);
+ up_read(&g_ot_data.start);
+ start = ktime_get();
+ ot_start_hrtimer(item);
+ do {
+ if (atomic_read_acquire(&g_ot_data.stop))
+ break;
+ /* do bulk-testings for objects pop/push */
+ item->worker(item, 0);
+ } while (!kthread_should_stop());
+ ot_stop_hrtimer(item);
+ item->duration = (u64) ktime_us_delta(ktime_get(), start);
+ if (atomic_dec_and_test(&g_ot_data.nthreads))
+ complete(&g_ot_data.wait);
+
+ return 0;
+}
+
+static void ot_perf_report(struct ot_ctrl *ctrl, u64 duration)
+{
+ struct ot_stat total, normal = {0}, irq = {0};
+ int cpu, nthreads = 0;
+
+ pr_info("\n");
+ pr_info("Testing summary for %s\n", ctrl->name);
+
+ for_each_possible_cpu(cpu) {
+ struct ot_item *item = per_cpu_ptr(&ot_pcup_items, cpu);
+ if (!item->duration)
+ continue;
+ normal.nhits += item->stat[0].nhits;
+ normal.nmiss += item->stat[0].nmiss;
+ irq.nhits += item->stat[1].nhits;
+ irq.nmiss += item->stat[1].nmiss;
+ pr_info("CPU: %d duration: %lluus\n", cpu, item->duration);
+ pr_info("\tthread:\t%16lu hits \t%16lu miss\n",
+ item->stat[0].nhits, item->stat[0].nmiss);
+ pr_info("\tirq: \t%16lu hits \t%16lu miss\n",
+ item->stat[1].nhits, item->stat[1].nmiss);
+ pr_info("\ttotal: \t%16lu hits \t%16lu miss\n",
+ item->stat[0].nhits + item->stat[1].nhits,
+ item->stat[0].nmiss + item->stat[1].nmiss);
+ nthreads++;
+ }
+
+ total.nhits = normal.nhits + irq.nhits;
+ total.nmiss = normal.nmiss + irq.nmiss;
+
+ pr_info("ALL: \tnthreads: %d duration: %lluus\n", nthreads, duration);
+ pr_info("SUM: \t%16lu hits \t%16lu miss\n",
+ total.nhits, total.nmiss);
+}
+
+/*
+ * synchronous test cases for objpool manipulation
+ */
+
+/* objpool manipulation for synchronous mode 0 (percpu objpool) */
+static struct ot_context *ot_init_sync_m0(struct ot_ctrl *ctrl)
+{
+ struct ot_context *sop = NULL;
+ int max = num_possible_cpus() << 3;
+
+ sop = (struct ot_context *)ot_kzalloc(sizeof(*sop));
+ if (!sop)
+ return NULL;
+ sop->ctrl = ctrl;
+
+ if (objpool_init(&sop->pool, max, max, ctrl->objsz,
+ GFP_KERNEL, sop, ot_init_node, NULL)) {
+ ot_kfree(sop, sizeof(*sop));
+ return NULL;
+ }
+ WARN_ON(max != sop->pool.nr_objs);
+
+ return sop;
+}
+
+static void ot_fini_sync_m0(struct ot_context *sop)
+{
+ objpool_fini(&sop->pool);
+ ot_kfree(sop, sizeof(*sop));
+}
+
+/* objpool manipulation for synchronous mode 1 (private pool) */
+static struct ot_context *ot_init_sync_m1(struct ot_ctrl *ctrl)
+{
+ struct ot_context *sop = NULL;
+ unsigned long size;
+ int rc, szobj, max = num_possible_cpus() << 3;
+
+ sop = (struct ot_context *)ot_kzalloc(sizeof(*sop));
+ if (!sop)
+ return NULL;
+ sop->ctrl = ctrl;
+
+ szobj = ALIGN(ctrl->objsz, sizeof(void *));
+ size = szobj * max;
+ sop->ptr = ot_vmalloc(size);
+ sop->size = size;
+ if (!sop->ptr) {
+ ot_kfree(sop, sizeof(*sop));
+ return NULL;
+ }
+ memset(sop->ptr, 0, size);
+
+ /* create and initialize objpool as empty (no objects) */
+ rc = objpool_init(&sop->pool, 0, max, 0, GFP_KERNEL, sop, NULL, NULL);
+ if (rc) {
+ ot_kfree(sop, sizeof(*sop));
+ return NULL;
+ }
+
+ /* populate given buffer to objpool */
+ rc = objpool_populate(&sop->pool, sop->ptr, size,
+ ctrl->objsz, sop, ot_init_node);
+ if (rc) {
+ objpool_fini(&sop->pool);
+ ot_vfree(sop->ptr, size);
+ ot_kfree(sop, sizeof(*sop));
+ return NULL;
+ }
+ WARN_ON((size / szobj) != sop->pool.nr_objs);
+
+ return sop;
+}
+
+static void ot_fini_sync_m1(struct ot_context *sop)
+{
+ objpool_fini(&sop->pool);
+
+ ot_vfree(sop->ptr, sop->size);
+ ot_kfree(sop, sizeof(*sop));
+}
+
+/* objpool manipulation for synchronous mode 2 (private objects) */
+static int ot_objpool_release(void *context, void *ptr, uint32_t flags)
+{
+ struct ot_context *sop = context;
+
+ /* here we need release all user-allocated objects */
+ if ((flags & OBJPOOL_FLAG_NODE) && (flags & OBJPOOL_FLAG_USER)) {
+ struct ot_node *on = ptr;
+ WARN_ON(on->data != 0xDEADBEEF);
+ ot_kfree(on, sop->ctrl->objsz);
+ } else if (flags & OBJPOOL_FLAG_POOL) {
+ /* release user preallocated pool */
+ if (sop->ptr) {
+ WARN_ON(sop->ptr != ptr);
+ WARN_ON(!(flags & OBJPOOL_FLAG_USER));
+ ot_vfree(sop->ptr, sop->size);
+ }
+ /* do context cleaning if needed */
+ ot_kfree(sop, sizeof(*sop));
+ }
+
+ return 0;
+}
+
+static struct ot_context *ot_init_sync_m2(struct ot_ctrl *ctrl)
+{
+ struct ot_context *sop = NULL;
+ struct ot_node *on;
+ int rc, i, max = num_possible_cpus() << 3;
+
+ sop = (struct ot_context *)ot_kzalloc(sizeof(*sop));
+ if (!sop)
+ return NULL;
+ sop->ctrl = ctrl;
+
+ /* create and initialize objpool as empty (no objects) */
+ rc = objpool_init(&sop->pool, 0, max, 0, GFP_KERNEL, sop, NULL,
+ ot_objpool_release);
+ if (rc) {
+ ot_kfree(sop, sizeof(*sop));
+ return NULL;
+ }
+
+ /* allocate private objects and insert to objpool */
+ for (i = 0; i < max; i++) {
+ on = ot_kzalloc(ctrl->objsz);
+ if (on) {
+ ot_init_node(sop, on);
+ on->data = 0xDEADBEEF;
+ objpool_add(on, &sop->pool);
+ }
+ }
+ WARN_ON(max != sop->pool.nr_objs);
+
+ return sop;
+}
+
+static void ot_fini_sync_m2(struct ot_context *sop)
+{
+ objpool_fini(&sop->pool);
+}
+
+/* objpool manipulation for synchronous mode 3 (mixed mode) */
+static struct ot_context *ot_init_sync_m3(struct ot_ctrl *ctrl)
+{
+ struct ot_context *sop = NULL;
+ struct ot_node *on;
+ unsigned long size;
+ int rc, i, szobj, nobjs;
+ int max = num_possible_cpus() << 4;
+
+ sop = (struct ot_context *)ot_kzalloc(sizeof(*sop));
+ if (!sop)
+ return NULL;
+ sop->ctrl = ctrl;
+
+ /* create and initialize objpool as empty (no objects) */
+ nobjs = num_possible_cpus() * 2;
+ rc = objpool_init(&sop->pool, nobjs, max, ctrl->objsz, GFP_KERNEL,
+ sop, ot_init_node, ot_objpool_release);
+ if (rc) {
+ ot_kfree(sop, sizeof(*sop));
+ return NULL;
+ }
+
+ szobj = ALIGN(ctrl->objsz, sizeof(void *));
+ size = szobj * num_possible_cpus() * 4;
+ sop->ptr = ot_vmalloc(size);
+ if (!sop->ptr) {
+ objpool_fini(&sop->pool);
+ return NULL;
+ }
+ sop->size = size;
+ memset(sop->ptr, 0, size);
+
+ /* populate given buffer to objpool */
+ rc = objpool_populate(&sop->pool, sop->ptr, size,
+ ctrl->objsz, sop, ot_init_node);
+ if (rc) {
+ objpool_fini(&sop->pool);
+ ot_vfree(sop->ptr, size);
+ return NULL;
+ }
+ nobjs += size / szobj;
+
+ /* allocate private objects and insert to objpool */
+ for (i = 0; i < num_possible_cpus() * 2; i++) {
+ on = ot_kzalloc(ctrl->objsz);
+ if (on) {
+ ot_init_node(sop, on);
+ on->data = 0xDEADBEEF;
+ if (!objpool_add(on, &sop->pool))
+ nobjs++;
+ else
+ ot_kfree(on, ctrl->objsz);
+ }
+ }
+ WARN_ON(nobjs != sop->pool.nr_objs);
+
+ return sop;
+}
+
+static void ot_fini_sync_m3(struct ot_context *sop)
+{
+ objpool_fini(&sop->pool);
+}
+
+struct {
+ struct ot_context * (*init)(struct ot_ctrl *);
+ void (*fini)(struct ot_context *sop);
+} g_ot_sync_ops[4] = {
+ {.init = ot_init_sync_m0, .fini = ot_fini_sync_m0},
+ {.init = ot_init_sync_m1, .fini = ot_fini_sync_m1},
+ {.init = ot_init_sync_m2, .fini = ot_fini_sync_m2},
+ {.init = ot_init_sync_m3, .fini = ot_fini_sync_m3},
+};
+
+/*
+ * synchronous test cases: performance mode
+ */
+
+static void ot_bulk_sync(struct ot_item *item, int irq)
+{
+ struct ot_node *nods[OT_NR_MAX_BULK];
+ int i;
+
+ for (i = 0; i < item->bulk[irq]; i++)
+ nods[i] = objpool_pop(item->pool);
+
+ if (!irq && (item->delay || !(++(item->niters) & 0x7FFF)))
+ msleep(item->delay);
+
+ while (i-- > 0) {
+ struct ot_node *on = nods[i];
+ if (on) {
+ on->refs++;
+ objpool_push(on, item->pool);
+ item->stat[irq].nhits++;
+ } else {
+ item->stat[irq].nmiss++;
+ }
+ }
+}
+
+static int ot_start_sync(struct ot_ctrl *ctrl)
+{
+ struct ot_context *sop;
+ ktime_t start;
+ u64 duration;
+ unsigned long timeout;
+ int cpu, rc;
+
+ /* initialize objpool for syncrhonous testcase */
+ sop = g_ot_sync_ops[ctrl->mode].init(ctrl);
+ if (!sop)
+ return -ENOMEM;
+
+ /* grab rwsem to block testing threads */
+ down_write(&g_ot_data.start);
+
+ for_each_possible_cpu(cpu) {
+ struct ot_item *item = per_cpu_ptr(&ot_pcup_items, cpu);
+ struct task_struct *work;
+
+ ot_init_cpu_item(item, ctrl, &sop->pool, ot_bulk_sync);
+
+ /* skip offline cpus */
+ if (!cpu_online(cpu))
+ continue;
+
+ work = kthread_create_on_node(ot_thread_worker, item,
+ cpu_to_node(cpu), "ot_worker_%d", cpu);
+ if (IS_ERR(work)) {
+ pr_err("failed to create thread for cpu %d\n", cpu);
+ } else {
+ kthread_bind(work, cpu);
+ wake_up_process(work);
+ }
+ }
+
+ /* wait a while to make sure all threads waiting at start line */
+ msleep(20);
+
+ /* in case no threads were created: memory insufficient ? */
+ if (atomic_dec_and_test(&g_ot_data.nthreads))
+ complete(&g_ot_data.wait);
+
+ // sched_set_fifo_low(current);
+
+ /* start objpool testing threads */
+ start = ktime_get();
+ up_write(&g_ot_data.start);
+
+ /* yeild cpu to worker threads for duration ms */
+ timeout = msecs_to_jiffies(ctrl->duration);
+ rc = schedule_timeout_interruptible(timeout);
+
+ /* tell workers threads to quit */
+ atomic_set_release(&g_ot_data.stop, 1);
+
+ /* wait all workers threads finish and quit */
+ wait_for_completion(&g_ot_data.wait);
+ duration = (u64) ktime_us_delta(ktime_get(), start);
+
+ /* cleanup objpool */
+ g_ot_sync_ops[ctrl->mode].fini(sop);
+
+ /* report testing summary and performance results */
+ ot_perf_report(ctrl, duration);
+
+ /* report memory allocation summary */
+ ot_mem_report(ctrl);
+
+ return rc;
+}
+
+/*
+ * asynchronous test cases: pool lifecycle controlled by refcount
+ */
+
+static void ot_fini_async_rcu(struct rcu_head *rcu)
+{
+ struct ot_context *sop = container_of(rcu, struct ot_context, rcu);
+ struct ot_node *on;
+
+ /* here all cpus are aware of the stop event: g_ot_data.stop = 1 */
+ WARN_ON(!atomic_read_acquire(&g_ot_data.stop));
+
+ do {
+ /* release all objects remained in objpool */
+ on = objpool_pop(&sop->pool);
+ if (on && !objpool_is_inslot(on, &sop->pool) &&
+ !objpool_is_inpool(on, &sop->pool)) {
+ /* private object managed by user */
+ WARN_ON(on->data != 0xDEADBEEF);
+ ot_kfree(on, sop->ctrl->objsz);
+ }
+
+ /* deref anyway since we've one extra ref grabbed */
+ if (refcount_dec_and_test(&sop->refs)) {
+ objpool_fini(&sop->pool);
+ break;
+ }
+ } while (on);
+
+ complete(&g_ot_data.rcu);
+}
+
+static void ot_fini_async(struct ot_context *sop)
+{
+ /* make sure the stop event is acknowledged by all cores */
+ call_rcu(&sop->rcu, ot_fini_async_rcu);
+}
+
+static struct ot_context *ot_init_async_m0(struct ot_ctrl *ctrl)
+{
+ struct ot_context *sop = NULL;
+ int max = num_possible_cpus() << 3;
+
+ sop = (struct ot_context *)ot_kzalloc(sizeof(*sop));
+ if (!sop)
+ return NULL;
+ sop->ctrl = ctrl;
+
+ if (objpool_init(&sop->pool, max, max, ctrl->objsz, GFP_KERNEL,
+ sop, ot_init_node, ot_objpool_release)) {
+ ot_kfree(sop, sizeof(*sop));
+ return NULL;
+ }
+ WARN_ON(max != sop->pool.nr_objs);
+ refcount_set(&sop->refs, max + 1);
+
+ return sop;
+}
+
+static struct ot_context *ot_init_async_m1(struct ot_ctrl *ctrl)
+{
+ struct ot_context *sop = NULL;
+ unsigned long size;
+ int szobj, rc, max = num_possible_cpus() << 3;
+
+ sop = (struct ot_context *)ot_kzalloc(sizeof(*sop));
+ if (!sop)
+ return NULL;
+ sop->ctrl = ctrl;
+
+ szobj = ALIGN(ctrl->objsz, sizeof(void *));
+ size = szobj * max;
+ sop->ptr = ot_vmalloc(size);
+ sop->size = size;
+ if (!sop->ptr) {
+ ot_kfree(sop, sizeof(*sop));
+ return NULL;
+ }
+ memset(sop->ptr, 0, size);
+
+ /* create and initialize objpool as empty (no objects) */
+ rc = objpool_init(&sop->pool, 0, max, 0, GFP_KERNEL, sop, NULL,
+ ot_objpool_release);
+ if (rc) {
+ ot_kfree(sop, sizeof(*sop));
+ return NULL;
+ }
+
+ /* populate given buffer to objpool */
+ rc = objpool_populate(&sop->pool, sop->ptr, size,
+ ctrl->objsz, sop, ot_init_node);
+ if (rc) {
+ objpool_fini(&sop->pool);
+ ot_vfree(sop->ptr, size);
+ return NULL;
+ }
+
+ /* calculate total number of objects stored in ptr */
+ WARN_ON(size / szobj != sop->pool.nr_objs);
+ refcount_set(&sop->refs, size / szobj + 1);
+
+ return sop;
+}
+
+static struct ot_context *ot_init_async_m2(struct ot_ctrl *ctrl)
+{
+ struct ot_context *sop = NULL;
+ struct ot_node *on;
+ int rc, i, nobjs = 0, max = num_possible_cpus() << 3;
+
+ sop = (struct ot_context *)ot_kzalloc(sizeof(*sop));
+ if (!sop)
+ return NULL;
+ sop->ctrl = ctrl;
+
+ /* create and initialize objpool as empty (no objects) */
+ rc = objpool_init(&sop->pool, 0, max, 0, GFP_KERNEL, sop, NULL,
+ ot_objpool_release);
+ if (rc) {
+ ot_kfree(sop, sizeof(*sop));
+ return NULL;
+ }
+
+ /* allocate private objects and insert to objpool */
+ for (i = 0; i < max; i++) {
+ on = ot_kzalloc(ctrl->objsz);
+ if (on) {
+ ot_init_node(sop, on);
+ on->data = 0xDEADBEEF;
+ objpool_add(on, &sop->pool);
+ nobjs++;
+ }
+ }
+ WARN_ON(nobjs != sop->pool.nr_objs);
+ refcount_set(&sop->refs, nobjs + 1);
+
+ return sop;
+}
+
+/* objpool manipulation for synchronous mode 3 (mixed mode) */
+static struct ot_context *ot_init_async_m3(struct ot_ctrl *ctrl)
+{
+ struct ot_context *sop = NULL;
+ struct ot_node *on;
+ unsigned long size;
+ int szobj, nobjs, rc, i, max = num_possible_cpus() << 4;
+
+ sop = (struct ot_context *)ot_kzalloc(sizeof(*sop));
+ if (!sop)
+ return NULL;
+ sop->ctrl = ctrl;
+
+ /* create and initialize objpool as empty (no objects) */
+ nobjs = num_possible_cpus() * 2;
+ rc = objpool_init(&sop->pool, nobjs, max, ctrl->objsz, GFP_KERNEL,
+ sop, ot_init_node, ot_objpool_release);
+ if (rc) {
+ ot_kfree(sop, sizeof(*sop));
+ return NULL;
+ }
+
+ szobj = ALIGN(ctrl->objsz, sizeof(void *));
+ size = szobj * num_possible_cpus() * 4;
+ sop->ptr = ot_vmalloc(size);
+ if (!sop->ptr) {
+ ot_kfree(sop, sizeof(*sop));
+ return NULL;
+ }
+ sop->size = size;
+ memset(sop->ptr, 0, size);
+
+ /* populate given buffer to objpool */
+ rc = objpool_populate(&sop->pool, sop->ptr, size,
+ ctrl->objsz, sop, ot_init_node);
+ if (rc) {
+ objpool_fini(&sop->pool);
+ ot_vfree(sop->ptr, size);
+ return NULL;
+ }
+
+ /* calculate total number of objects stored in ptr */
+ nobjs += size / szobj;
+
+ /* allocate private objects and insert to objpool */
+ for (i = 0; i < num_possible_cpus() * 2; i++) {
+ on = ot_kzalloc(ctrl->objsz);
+ if (on) {
+ ot_init_node(sop, on);
+ on->data = 0xDEADBEEF;
+ objpool_add(on, &sop->pool);
+ nobjs++;
+ }
+ }
+ WARN_ON(nobjs != sop->pool.nr_objs);
+ refcount_set(&sop->refs, nobjs + 1);
+
+ return sop;
+}
+
+struct {
+ struct ot_context * (*init)(struct ot_ctrl *);
+ void (*fini)(struct ot_context *sop);
+} g_ot_async_ops[4] = {
+ {.init = ot_init_async_m0, .fini = ot_fini_async},
+ {.init = ot_init_async_m1, .fini = ot_fini_async},
+ {.init = ot_init_async_m2, .fini = ot_fini_async},
+ {.init = ot_init_async_m3, .fini = ot_fini_async},
+};
+
+static void ot_nod_recycle(struct ot_node *on, struct objpool_head *pool,
+ int release)
+{
+ struct ot_context *sop;
+
+ on->refs++;
+
+ if (!release) {
+ /* push object back to opjpool for reuse */
+ objpool_push(on, pool);
+ return;
+ }
+
+ sop = container_of(pool, struct ot_context, pool);
+ WARN_ON(sop != pool->context);
+
+ if (objpool_is_inslot(on, pool)) {
+ /* object is alloced from percpu slots */
+ } else if (objpool_is_inpool(on, pool)) {
+ /* object is alloced from user-manged pool */
+ } else {
+ /* private object managed by user */
+ WARN_ON(on->data != 0xDEADBEEF);
+ ot_kfree(on, sop->ctrl->objsz);
+ }
+
+ /* unref objpool with nod removed forever */
+ if (refcount_dec_and_test(&sop->refs))
+ objpool_fini(pool);
+}
+
+static void ot_bulk_async(struct ot_item *item, int irq)
+{
+ struct ot_node *nods[OT_NR_MAX_BULK];
+ int i, stop;
+
+ for (i = 0; i < item->bulk[irq]; i++)
+ nods[i] = objpool_pop(item->pool);
+
+ if (!irq) {
+ if (item->delay || !(++(item->niters) & 0x7FFF))
+ msleep(item->delay);
+ get_cpu();
+ }
+
+ stop = atomic_read_acquire(&g_ot_data.stop);
+
+ /* drop all objects and deref objpool */
+ while (i-- > 0) {
+ struct ot_node *on = nods[i];
+
+ if (on) {
+ on->refs++;
+ ot_nod_recycle(on, item->pool, stop);
+ item->stat[irq].nhits++;
+ } else {
+ item->stat[irq].nmiss++;
+ }
+ }
+
+ if (!irq)
+ put_cpu();
+}
+
+static int ot_start_async(struct ot_ctrl *ctrl)
+{
+ struct ot_context *sop;
+ ktime_t start;
+ u64 duration;
+ unsigned long timeout;
+ int cpu, rc;
+
+ /* initialize objpool for syncrhonous testcase */
+ sop = g_ot_async_ops[ctrl->mode].init(ctrl);
+ if (!sop)
+ return -ENOMEM;
+
+ /* grab rwsem to block testing threads */
+ down_write(&g_ot_data.start);
+
+ for_each_possible_cpu(cpu) {
+ struct ot_item *item = per_cpu_ptr(&ot_pcup_items, cpu);
+ struct task_struct *work;
+
+ ot_init_cpu_item(item, ctrl, &sop->pool, ot_bulk_async);
+
+ /* skip offline cpus */
+ if (!cpu_online(cpu))
+ continue;
+
+ work = kthread_create_on_node(ot_thread_worker, item,
+ cpu_to_node(cpu), "ot_worker_%d", cpu);
+ if (IS_ERR(work)) {
+ pr_err("failed to create thread for cpu %d\n", cpu);
+ } else {
+ kthread_bind(work, cpu);
+ wake_up_process(work);
+ }
+ }
+
+ /* wait a while to make sure all threads waiting at start line */
+ msleep(20);
+
+ /* in case no threads were created: memory insufficient ? */
+ if (atomic_dec_and_test(&g_ot_data.nthreads))
+ complete(&g_ot_data.wait);
+
+ /* start objpool testing threads */
+ start = ktime_get();
+ up_write(&g_ot_data.start);
+
+ /* yeild cpu to worker threads for duration ms */
+ timeout = msecs_to_jiffies(ctrl->duration);
+ rc = schedule_timeout_interruptible(timeout);
+
+ /* tell workers threads to quit */
+ atomic_set_release(&g_ot_data.stop, 1);
+
+ /* do async-finalization */
+ g_ot_async_ops[ctrl->mode].fini(sop);
+
+ /* wait all workers threads finish and quit */
+ wait_for_completion(&g_ot_data.wait);
+ duration = (u64) ktime_us_delta(ktime_get(), start);
+
+ /* assure rcu callback is triggered */
+ wait_for_completion(&g_ot_data.rcu);
+
+ /*
+ * now we are sure that objpool is finalized either
+ * by rcu callback or by worker threads
+ */
+
+ /* report testing summary and performance results */
+ ot_perf_report(ctrl, duration);
+
+ /* report memory allocation summary */
+ ot_mem_report(ctrl);
+
+ return rc;
+}
+
+/*
+ * predefined testing cases:
+ * 4 synchronous cases / 4 overrun cases / 2 async cases
+ *
+ * mode: unsigned int, could be 0/1/2/3, see name
+ * duration: unsigned int, total test time in ms
+ * delay: unsigned int, delay (in ms) between each iteration
+ * bulk_normal: unsigned int, repeat times for thread worker
+ * bulk_irq: unsigned int, repeat times for irq consumer
+ * hrtimer: unsigned long, hrtimer intervnal in ms
+ * name: char *, tag for current test ot_item
+ */
+
+#define NODE_COMPACT sizeof(struct ot_node)
+#define NODE_VMALLOC (512)
+
+struct ot_ctrl g_ot_sync[] = {
+ {0, NODE_COMPACT, 1000, 0, 1, 0, 0, "sync: percpu objpool"},
+ {0, NODE_VMALLOC, 1000, 0, 1, 0, 0, "sync: percpu objpool from vmalloc"},
+ {1, NODE_COMPACT, 1000, 0, 1, 0, 0, "sync: user objpool"},
+ {2, NODE_COMPACT, 1000, 0, 1, 0, 0, "sync: user objects"},
+ {3, NODE_COMPACT, 1000, 0, 1, 0, 0, "sync: mixed pools & objs"},
+ {3, NODE_VMALLOC, 1000, 0, 1, 0, 0, "sync: mixed pools & objs (vmalloc)"},
+};
+
+struct ot_ctrl g_ot_miss[] = {
+ {0, NODE_COMPACT, 1000, 0, 16, 0, 0, "sync overrun: percpu objpool"},
+ {0, NODE_VMALLOC, 1000, 0, 16, 0, 0, "sync overrun: percpu objpool from vmalloc"},
+ {1, NODE_COMPACT, 1000, 0, 16, 0, 0, "sync overrun: user objpool"},
+ {2, NODE_COMPACT, 1000, 0, 16, 0, 0, "sync overrun: user objects"},
+ {3, NODE_COMPACT, 1000, 0, 16, 0, 0, "sync overrun: mixed pools & objs"},
+ {3, NODE_VMALLOC, 1000, 0, 16, 0, 0, "sync overrun: mixed pools & objs (vmalloc)"},
+};
+
+struct ot_ctrl g_ot_async[] = {
+ {0, NODE_COMPACT, 1000, 4, 8, 8, 6, "async: percpu objpool"},
+ {0, NODE_VMALLOC, 1000, 4, 8, 8, 6, "async: percpu objpool from vmalloc"},
+ {1, NODE_COMPACT, 1000, 4, 8, 8, 6, "async: user objpool"},
+ {2, NODE_COMPACT, 1000, 4, 8, 8, 6, "async: user objects"},
+ {3, NODE_COMPACT, 1000, 4, 8, 8, 6, "async: mixed pools & objs"},
+ {3, NODE_VMALLOC, 1000, 4, 8, 8, 6, "async: mixed pools & objs (vmalloc)"},
+};
+
+static int __init ot_mod_init(void)
+{
+ int i;
+
+ ot_init_data(&g_ot_data);
+
+ for (i = 0; i < ARRAY_SIZE(g_ot_sync); i++) {
+ if (ot_start_sync(&g_ot_sync[i]))
+ goto out;
+ ot_reset_data(&g_ot_data);
+ }
+
+ for (i = 0; i < ARRAY_SIZE(g_ot_miss); i++) {
+ if (ot_start_sync(&g_ot_miss[i]))
+ goto out;
+ ot_reset_data(&g_ot_data);
+ }
+
+ for (i = 0; i < ARRAY_SIZE(g_ot_async); i++) {
+ if (ot_start_async(&g_ot_async[i]))
+ goto out;
+ ot_reset_data(&g_ot_data);
+ }
+
+out:
+ return -EAGAIN;
+}
+
+static void __exit ot_mod_exit(void)
+{
+}
+
+module_init(ot_mod_init);
+module_exit(ot_mod_exit);
+
+MODULE_LICENSE("GPL");
+MODULE_AUTHOR("Matt Wu");
--
2.34.1


2022-11-08 07:36:45

by wuqiang.matt

[permalink] [raw]
Subject: [PATCH v6 1/4] lib: objpool added: ring-array based lockless MPMC queue

The object pool is a scalable implementaion of high performance queue
for objects allocation and reclamation, such as kretprobe instances.

With leveraging per-cpu ring-array to mitigate the hot spots of memory
contention, it could deliver near-linear scalability for high parallel
scenarios. The ring-array is compactly managed in a single cache-line
to benefit from warmed L1 cache for most cases (<= 4 objects per-core).
The body of pre-allocated objects is stored in continuous cache-lines
just after the ring-array.

The object pool is interrupt safe. Both allocation and reclamation
(object pop and push operations) can be preemptible or interruptable.

It's best suited for following cases:
1) Memory allocation or reclamation are prohibited or too expensive
2) Consumers are of different priorities, such as irqs and threads

Limitations:
1) Maximum objects (capacity) is determined during pool initializing
2) The memory of objects won't be freed until the poll is finalized
3) Object allocation (pop) may fail after trying all cpu slots
4) Object reclamation (push) won't fail but may take long time to
finish for imbalanced scenarios. You can try larger max_entries
to mitigate, or ( >= CPUS * nr_objs) to avoid

Signed-off-by: wuqiang <[email protected]>
---
include/linux/objpool.h | 153 +++++++++++++
lib/Makefile | 2 +-
lib/objpool.c | 487 ++++++++++++++++++++++++++++++++++++++++
3 files changed, 641 insertions(+), 1 deletion(-)
create mode 100644 include/linux/objpool.h
create mode 100644 lib/objpool.c

diff --git a/include/linux/objpool.h b/include/linux/objpool.h
new file mode 100644
index 000000000000..7899b054b50c
--- /dev/null
+++ b/include/linux/objpool.h
@@ -0,0 +1,153 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+
+#ifndef _LINUX_OBJPOOL_H
+#define _LINUX_OBJPOOL_H
+
+#include <linux/types.h>
+
+/*
+ * objpool: ring-array based lockless MPMC queue
+ *
+ * Copyright: [email protected]
+ *
+ * The object pool is a scalable implementaion of high performance queue
+ * for objects allocation and reclamation, such as kretprobe instances.
+ *
+ * With leveraging per-cpu ring-array to mitigate the hot spots of memory
+ * contention, it could deliver near-linear scalability for high parallel
+ * scenarios. The ring-array is compactly managed in a single cache-line
+ * to benefit from warmed L1 cache for most cases (<= 4 objects per-core).
+ * The body of pre-allocated objects is stored in continuous cache-lines
+ * just after the ring-array.
+ *
+ * The object pool is interrupt safe. Both allocation and reclamation
+ * (object pop and push operations) can be preemptible or interruptable.
+ *
+ * It's best suited for following cases:
+ * 1) Memory allocation or reclamation are prohibited or too expensive
+ * 2) Consumers are of different priorities, such as irqs and threads
+ *
+ * Limitations:
+ * 1) Maximum objects (capacity) is determined during pool initializing
+ * 2) The memory of objects won't be freed until the poll is finalized
+ * 3) Object allocation (pop) may fail after trying all cpu slots
+ * 4) Object reclamation (push) won't fail but may take long time to
+ * finish for imbalanced scenarios. You can try larger max_entries
+ * to mitigate, or ( >= CPUS * nr_objs) to avoid
+ */
+
+/*
+ * objpool_slot: per-cpu ring array
+ *
+ * Represents a cpu-local array-based ring buffer, its size is specialized
+ * during initialization of object pool.
+ *
+ * The objpool_slot is allocated from local memory for NUMA system, and to
+ * be kept compact in a single cacheline. ages[] is stored just after the
+ * body of objpool_slot, and then entries[]. The Array of ages[] describes
+ * revision of each item, solely used to avoid ABA. And array of entries[]
+ * contains the pointers of objects.
+ *
+ * The default size of objpool_slot is a single cache-line, aka. 64 bytes.
+ *
+ * 64bit:
+ * 4 8 12 16 32 64
+ * | head | tail | size | mask | ages[4] | ents[4]: (8 * 4) | objects
+ *
+ * 32bit:
+ * 4 8 12 16 32 48 64
+ * | head | tail | size | mask | ages[4] | ents[4] | unused | objects
+ *
+ */
+
+struct objpool_slot {
+ uint32_t head; /* head of ring array */
+ uint32_t tail; /* tail of ring array */
+ uint32_t size; /* array size, pow of 2 */
+ uint32_t mask; /* size - 1 */
+} __attribute__((packed));
+
+/* caller-specified object initial callback to setup each object, only called once */
+typedef int (*objpool_init_obj_cb)(void *context, void *obj);
+
+/* caller-specified cleanup callback for private objects/pool/context */
+typedef int (*objpool_release_cb)(void *context, void *ptr, uint32_t flags);
+
+/* called for object releasing: ptr points to an object */
+#define OBJPOOL_FLAG_NODE (0x00000001)
+/* for user pool and context releasing, ptr could be NULL */
+#define OBJPOOL_FLAG_POOL (0x00001000)
+/* the object or pool to be released is user-managed */
+#define OBJPOOL_FLAG_USER (0x00008000)
+
+/*
+ * objpool_head: object pooling metadata
+ */
+
+struct objpool_head {
+ unsigned int obj_size; /* object & element size */
+ unsigned int nr_objs; /* total objs (to be pre-allocated) */
+ unsigned int nr_cpus; /* num of possible cpus */
+ unsigned int capacity; /* max objects per cpuslot */
+ unsigned long flags; /* flags for objpool management */
+ gfp_t gfp; /* gfp flags for kmalloc & vmalloc */
+ unsigned int pool_size; /* user pool size in byes */
+ void *pool; /* user managed memory pool */
+ struct objpool_slot **cpu_slots; /* array of percpu slots */
+ unsigned int *slot_sizes; /* size in bytes of slots */
+ objpool_release_cb release; /* resource cleanup callback */
+ void *context; /* caller-provided context */
+};
+
+#define OBJPOOL_FROM_VMALLOC (0x800000000) /* objpool allocated from vmalloc area */
+#define OBJPOOL_HAVE_OBJECTS (0x400000000) /* objects allocated along with objpool */
+
+/* initialize object pool and pre-allocate objects */
+int objpool_init(struct objpool_head *head, unsigned int nr_objs,
+ unsigned int max_objs, unsigned int object_size,
+ gfp_t gfp, void *context, objpool_init_obj_cb objinit,
+ objpool_release_cb release);
+
+/* add objects in batch from user provided pool */
+int objpool_populate(struct objpool_head *head, void *pool,
+ unsigned int size, unsigned int object_size,
+ void *context, objpool_init_obj_cb objinit);
+
+/* add pre-allocated object (managed by user) to objpool */
+int objpool_add(void *obj, struct objpool_head *head);
+
+/* allocate an object from objects pool */
+void *objpool_pop(struct objpool_head *head);
+
+/* reclaim an object to objects pool */
+int objpool_push(void *node, struct objpool_head *head);
+
+/* cleanup the whole object pool (objects including) */
+void objpool_fini(struct objpool_head *head);
+
+/* whether the object is pre-allocated with percpu slots */
+static inline int objpool_is_inslot(void *obj, struct objpool_head *head)
+{
+ void *slot;
+ int i;
+
+ if (!obj || !(head->flags & OBJPOOL_HAVE_OBJECTS))
+ return 0;
+
+ for (i = 0; i < head->nr_cpus; i++) {
+ slot = head->cpu_slots[i];
+ if (obj >= slot && obj < slot + head->slot_sizes[i])
+ return 1;
+ }
+
+ return 0;
+}
+
+/* whether the object is from user pool (batched adding) */
+static inline int objpool_is_inpool(void *obj, struct objpool_head *head)
+{
+ return (obj && head->pool && obj >= head->pool &&
+ obj < head->pool + head->pool_size);
+}
+
+#endif /* _LINUX_OBJPOOL_H */
diff --git a/lib/Makefile b/lib/Makefile
index 161d6a724ff7..e938703a321f 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -34,7 +34,7 @@ lib-y := ctype.o string.o vsprintf.o cmdline.o \
is_single_threaded.o plist.o decompress.o kobject_uevent.o \
earlycpio.o seq_buf.o siphash.o dec_and_lock.o \
nmi_backtrace.o win_minmax.o memcat_p.o \
- buildid.o
+ buildid.o objpool.o

lib-$(CONFIG_PRINTK) += dump_stack.o
lib-$(CONFIG_SMP) += cpumask.o
diff --git a/lib/objpool.c b/lib/objpool.c
new file mode 100644
index 000000000000..ecffa0795f3d
--- /dev/null
+++ b/lib/objpool.c
@@ -0,0 +1,487 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include <linux/objpool.h>
+#include <linux/slab.h>
+#include <linux/vmalloc.h>
+#include <linux/atomic.h>
+#include <linux/prefetch.h>
+
+/*
+ * objpool: ring-array based lockless MPMC/FIFO queues
+ *
+ * Copyright: [email protected]
+ */
+
+/* compute the suitable num of objects to be managed by slot */
+static inline unsigned int __objpool_num_of_objs(unsigned int size)
+{
+ return rounddown_pow_of_two((size - sizeof(struct objpool_slot)) /
+ (sizeof(uint32_t) + sizeof(void *)));
+}
+
+#define SLOT_AGES(s) ((uint32_t *)((char *)(s) + sizeof(struct objpool_slot)))
+#define SLOT_ENTS(s) ((void **)((char *)(s) + sizeof(struct objpool_slot) + \
+ sizeof(uint32_t) * (s)->size))
+#define SLOT_OBJS(s) ((void *)((char *)(s) + sizeof(struct objpool_slot) + \
+ (sizeof(uint32_t) + sizeof(void *)) * (s)->size))
+
+/* allocate and initialize percpu slots */
+static inline int
+__objpool_init_percpu_slots(struct objpool_head *head, unsigned int nobjs,
+ void *context, objpool_init_obj_cb objinit)
+{
+ unsigned int i, j, n, size, objsz, nents = head->capacity;
+
+ /* aligned object size by sizeof(void *) */
+ objsz = ALIGN(head->obj_size, sizeof(void *));
+ /* shall we allocate objects along with objpool_slot */
+ if (objsz)
+ head->flags |= OBJPOOL_HAVE_OBJECTS;
+
+ for (i = 0; i < head->nr_cpus; i++) {
+ struct objpool_slot *os;
+
+ /* compute how many objects to be managed by this slot */
+ n = nobjs / head->nr_cpus;
+ if (i < (nobjs % head->nr_cpus))
+ n++;
+ size = sizeof(struct objpool_slot) + sizeof(void *) * nents +
+ sizeof(uint32_t) * nents + objsz * n;
+
+ /* decide memory area for cpu-slot allocation */
+ if (!i && !(head->gfp & GFP_ATOMIC) && size > PAGE_SIZE / 2)
+ head->flags |= OBJPOOL_FROM_VMALLOC;
+
+ /* allocate percpu slot & objects from local memory */
+ if (head->flags & OBJPOOL_FROM_VMALLOC)
+ os = __vmalloc_node(size, sizeof(void *), head->gfp,
+ cpu_to_node(i), __builtin_return_address(0));
+ else
+ os = kmalloc_node(size, head->gfp, cpu_to_node(i));
+ if (!os)
+ return -ENOMEM;
+
+ /* initialize percpu slot for the i-th cpu */
+ memset(os, 0, size);
+ os->size = head->capacity;
+ os->mask = os->size - 1;
+ head->cpu_slots[i] = os;
+ head->slot_sizes[i] = size;
+
+ /*
+ * start from 2nd round to avoid conflict of 1st item.
+ * we assume that the head item is ready for retrieval
+ * iff head is equal to ages[head & mask]. but ages is
+ * initialized as 0, so in view of the caller of pop(),
+ * the 1st item (0th) is always ready, but fact could
+ * be: push() is stalled before the final update, thus
+ * the item being inserted will be lost forever.
+ */
+ os->head = os->tail = head->capacity;
+
+ if (!objsz)
+ continue;
+
+ for (j = 0; j < n; j++) {
+ uint32_t *ages = SLOT_AGES(os);
+ void **ents = SLOT_ENTS(os);
+ void *obj = SLOT_OBJS(os) + j * objsz;
+ uint32_t ie = os->tail & os->mask;
+
+ /* perform object initialization */
+ if (objinit) {
+ int rc = objinit(context, obj);
+ if (rc)
+ return rc;
+ }
+
+ /* add obj into the ring array */
+ ents[ie] = obj;
+ ages[ie] = os->tail;
+ os->tail++;
+ head->nr_objs++;
+ }
+ }
+
+ return 0;
+}
+
+/* cleanup all percpu slots of the object pool */
+static inline void __objpool_fini_percpu_slots(struct objpool_head *head)
+{
+ unsigned int i;
+
+ if (!head->cpu_slots)
+ return;
+
+ for (i = 0; i < head->nr_cpus; i++) {
+ if (!head->cpu_slots[i])
+ continue;
+ if (head->flags & OBJPOOL_FROM_VMALLOC)
+ vfree(head->cpu_slots[i]);
+ else
+ kfree(head->cpu_slots[i]);
+ }
+ kfree(head->cpu_slots);
+ head->cpu_slots = NULL;
+ head->slot_sizes = NULL;
+}
+
+/**
+ * objpool_init: initialize object pool and pre-allocate objects
+ *
+ * args:
+ * @head: the object pool to be initialized, declared by caller
+ * @nr_objs: total objects to be pre-allocated by this object pool
+ * @max_objs: max entries (object pool capacity), use nr_objs if 0
+ * @object_size: size of an object, no objects pre-allocated if 0
+ * @gfp: flags for memory allocation (via kmalloc or vmalloc)
+ * @context: user context for object initialization callback
+ * @objinit: object initialization callback for extra setting-up
+ * @release: cleanup callback for private objects/pool/context
+ *
+ * return:
+ * 0 for success, otherwise error code
+ *
+ * All pre-allocated objects are to be zeroed. Caller could do extra
+ * initialization in objinit callback. The objinit callback will be
+ * called once and only once after the slot allocation. Then objpool
+ * won't touch any content of the objects since then. It's caller's
+ * duty to perform reinitialization after object allocation (pop) or
+ * clearance before object reclamation (push) if required.
+ */
+int objpool_init(struct objpool_head *head, unsigned int nr_objs,
+ unsigned int max_objs, unsigned int object_size,
+ gfp_t gfp, void *context, objpool_init_obj_cb objinit,
+ objpool_release_cb release)
+{
+ unsigned int nents, ncpus = num_possible_cpus();
+ int rc;
+
+ /* calculate percpu slot size (rounded to pow of 2) */
+ if (max_objs < nr_objs)
+ max_objs = nr_objs;
+ nents = max_objs / ncpus;
+ if (nents < __objpool_num_of_objs(L1_CACHE_BYTES))
+ nents = __objpool_num_of_objs(L1_CACHE_BYTES);
+ nents = roundup_pow_of_two(nents);
+ while (nents * ncpus < nr_objs)
+ nents = nents << 1;
+
+ memset(head, 0, sizeof(struct objpool_head));
+ head->nr_cpus = ncpus;
+ head->obj_size = object_size;
+ head->capacity = nents;
+ head->gfp = gfp & ~__GFP_ZERO;
+ head->context = context;
+ head->release = release;
+
+ /* allocate array for percpu slots */
+ head->cpu_slots = kzalloc(head->nr_cpus * sizeof(void *) +
+ head->nr_cpus * sizeof(uint32_t), head->gfp);
+ if (!head->cpu_slots)
+ return -ENOMEM;
+ head->slot_sizes = (uint32_t *)&head->cpu_slots[head->nr_cpus];
+
+ /* initialize per-cpu slots */
+ rc = __objpool_init_percpu_slots(head, nr_objs, context, objinit);
+ if (rc)
+ __objpool_fini_percpu_slots(head);
+
+ return rc;
+}
+EXPORT_SYMBOL_GPL(objpool_init);
+
+/* adding object to slot tail, the given slot must NOT be full */
+static inline int __objpool_add_slot(void *obj, struct objpool_slot *os)
+{
+ uint32_t *ages = SLOT_AGES(os);
+ void **ents = SLOT_ENTS(os);
+ uint32_t tail = atomic_inc_return((atomic_t *)&os->tail) - 1;
+
+ WRITE_ONCE(ents[tail & os->mask], obj);
+
+ /* order matters: obj must be updated before tail updating */
+ smp_store_release(&ages[tail & os->mask], tail);
+ return 0;
+}
+
+/* adding object to slot, abort if the slot was already full */
+static inline int __objpool_try_add_slot(void *obj, struct objpool_slot *os)
+{
+ uint32_t *ages = SLOT_AGES(os);
+ void **ents = SLOT_ENTS(os);
+ uint32_t head, tail;
+
+ do {
+ /* perform memory loading for both head and tail */
+ head = READ_ONCE(os->head);
+ tail = READ_ONCE(os->tail);
+ /* just abort if slot is full */
+ if (tail >= head + os->size)
+ return -ENOENT;
+ /* try to extend tail by 1 using CAS to avoid races */
+ if (try_cmpxchg_acquire(&os->tail, &tail, tail + 1))
+ break;
+ } while (1);
+
+ /* the tail-th of slot is reserved for the given obj */
+ WRITE_ONCE(ents[tail & os->mask], obj);
+ /* update epoch id to make this object available for pop() */
+ smp_store_release(&ages[tail & os->mask], tail);
+ return 0;
+}
+
+/**
+ * objpool_populate: add objects from user provided pool in batch
+ *
+ * args:
+ * @head: object pool
+ * @pool: user buffer for pre-allocated objects
+ * @size: size of user buffer
+ * @object_size: size of object & element
+ * @context: user context for objinit callback
+ * @objinit: object initialization callback
+ *
+ * return: 0 or error code
+ */
+int objpool_populate(struct objpool_head *head, void *pool,
+ unsigned int size, unsigned int object_size,
+ void *context, objpool_init_obj_cb objinit)
+{
+ unsigned int n = head->nr_objs, used = 0, i;
+
+ if (head->pool || !pool || size < object_size)
+ return -EINVAL;
+ if (head->obj_size && head->obj_size != object_size)
+ return -EINVAL;
+ if (head->context && context && head->context != context)
+ return -EINVAL;
+ if (head->nr_objs >= head->nr_cpus * head->capacity)
+ return -ENOENT;
+
+ WARN_ON_ONCE(((unsigned long)pool) & (sizeof(void *) - 1));
+ WARN_ON_ONCE(((uint32_t)object_size) & (sizeof(void *) - 1));
+
+ /* align object size by sizeof(void *) */
+ head->obj_size = object_size;
+ object_size = ALIGN(object_size, sizeof(void *));
+ if (object_size == 0)
+ return -EINVAL;
+
+ while (used + object_size <= size) {
+ void *obj = pool + used;
+
+ /* perform object initialization */
+ if (objinit) {
+ int rc = objinit(context, obj);
+ if (rc)
+ return rc;
+ }
+
+ /* insert obj to its corresponding objpool slot */
+ i = (n + used * head->nr_cpus/size) % head->nr_cpus;
+ if (!__objpool_try_add_slot(obj, head->cpu_slots[i]))
+ head->nr_objs++;
+
+ used += object_size;
+ }
+
+ if (!used)
+ return -ENOENT;
+
+ head->context = context;
+ head->pool = pool;
+ head->pool_size = size;
+
+ return 0;
+}
+EXPORT_SYMBOL_GPL(objpool_populate);
+
+/**
+ * objpool_add: add pre-allocated object to objpool during pool
+ * initialization
+ *
+ * args:
+ * @obj: object pointer to be added to objpool
+ * @head: object pool to be inserted into
+ *
+ * return:
+ * 0 or error code
+ *
+ * objpool_add_node doesn't handle race conditions, can only be
+ * called during objpool initialization
+ */
+int objpool_add(void *obj, struct objpool_head *head)
+{
+ unsigned int i, cpu;
+
+ if (!obj)
+ return -EINVAL;
+ if (head->nr_objs >= head->nr_cpus * head->capacity)
+ return -ENOENT;
+
+ cpu = head->nr_objs % head->nr_cpus;
+ for (i = 0; i < head->nr_cpus; i++) {
+ if (!__objpool_try_add_slot(obj, head->cpu_slots[cpu])) {
+ head->nr_objs++;
+ return 0;
+ }
+
+ if (++cpu >= head->nr_cpus)
+ cpu = 0;
+ }
+
+ return -ENOENT;
+}
+EXPORT_SYMBOL_GPL(objpool_add);
+
+/**
+ * objpool_push: reclaim the object and return back to objects pool
+ *
+ * args:
+ * @obj: object pointer to be pushed to object pool
+ * @head: object pool
+ *
+ * return:
+ * 0 or error code: it fails only when objects pool are full
+ *
+ * objpool_push is non-blockable, and can be nested
+ */
+int objpool_push(void *obj, struct objpool_head *head)
+{
+ unsigned int cpu = raw_smp_processor_id() % head->nr_cpus;
+
+ do {
+ if (head->nr_objs > head->capacity) {
+ if (!__objpool_try_add_slot(obj, head->cpu_slots[cpu]))
+ return 0;
+ } else {
+ if (!__objpool_add_slot(obj, head->cpu_slots[cpu]))
+ return 0;
+ }
+ if (++cpu >= head->nr_cpus)
+ cpu = 0;
+ } while (1);
+
+ return -ENOENT;
+}
+EXPORT_SYMBOL_GPL(objpool_push);
+
+/* try to retrieve object from slot */
+static inline void *__objpool_try_get_slot(struct objpool_slot *os)
+{
+ uint32_t *ages = SLOT_AGES(os);
+ void **ents = SLOT_ENTS(os);
+ /* do memory load of head to local head */
+ uint32_t head = smp_load_acquire(&os->head);
+
+ /* loop if slot isn't empty */
+ while (head != READ_ONCE(os->tail)) {
+ uint32_t id = head & os->mask, prev = head;
+
+ /* do prefetching of object ents */
+ prefetch(&ents[id]);
+
+ /*
+ * check whether this item was ready for retrieval ? There's
+ * possibility * in theory * we might retrieve wrong object,
+ * in case ages[id] overflows when current task is sleeping,
+ * but it will take very very long to overflow an uint32_t
+ */
+ if (smp_load_acquire(&ages[id]) == head) {
+ /* node must have been udpated by push() */
+ void *node = READ_ONCE(ents[id]);
+ /* commit and move forward head of the slot */
+ if (try_cmpxchg_release(&os->head, &head, head + 1))
+ return node;
+ }
+
+ /* re-load head from memory continue trying */
+ head = READ_ONCE(os->head);
+ /*
+ * head stays unchanged, so it's very likely current pop()
+ * just preempted/interrupted an ongoing push() operation
+ */
+ if (head == prev)
+ break;
+ }
+
+ return NULL;
+}
+
+/**
+ * objpool_pop: allocate an object from objects pool
+ *
+ * args:
+ * @oh: object pool
+ *
+ * return:
+ * object: NULL if failed (object pool is empty)
+ *
+ * objpool_pop can be nested, so can be used in any context.
+ */
+void *objpool_pop(struct objpool_head *head)
+{
+ unsigned int i, cpu;
+ void *obj = NULL;
+
+ cpu = raw_smp_processor_id() % head->nr_cpus;
+ for (i = 0; i < head->nr_cpus; i++) {
+ struct objpool_slot *slot = head->cpu_slots[cpu];
+ obj = __objpool_try_get_slot(slot);
+ if (obj)
+ break;
+ if (++cpu >= head->nr_cpus)
+ cpu = 0;
+ }
+
+ return obj;
+}
+EXPORT_SYMBOL_GPL(objpool_pop);
+
+/**
+ * objpool_fini: cleanup the whole object pool (releasing all objects)
+ *
+ * args:
+ * @head: object pool to be released
+ *
+ */
+void objpool_fini(struct objpool_head *head)
+{
+ uint32_t i, flags;
+
+ if (!head->cpu_slots)
+ return;
+
+ if (!head->release) {
+ __objpool_fini_percpu_slots(head);
+ return;
+ }
+
+ /* cleanup all objects remained in objpool */
+ for (i = 0; i < head->nr_cpus; i++) {
+ void *obj;
+ do {
+ flags = OBJPOOL_FLAG_NODE;
+ obj = __objpool_try_get_slot(head->cpu_slots[i]);
+ if (!obj)
+ break;
+ if (!objpool_is_inpool(obj, head) &&
+ !objpool_is_inslot(obj, head)) {
+ flags |= OBJPOOL_FLAG_USER;
+ }
+ head->release(head->context, obj, flags);
+ } while (obj);
+ }
+
+ /* release percpu slots */
+ __objpool_fini_percpu_slots(head);
+
+ /* cleanup user private pool and related context */
+ flags = OBJPOOL_FLAG_POOL;
+ if (head->pool)
+ flags |= OBJPOOL_FLAG_USER;
+ head->release(head->context, head->pool, flags);
+}
+EXPORT_SYMBOL_GPL(objpool_fini);
--
2.34.1


2022-11-08 08:27:01

by wuqiang.matt

[permalink] [raw]
Subject: [PATCH v6 4/4] kprobes: freelist.h removed

This patch will remove freelist.h from kernel source tree, since the
only use cases (kretprobe and rethook) are converted to objpool.

Signed-off-by: wuqiang <[email protected]>
---
include/linux/freelist.h | 129 ---------------------------------------
1 file changed, 129 deletions(-)
delete mode 100644 include/linux/freelist.h

diff --git a/include/linux/freelist.h b/include/linux/freelist.h
deleted file mode 100644
index fc1842b96469..000000000000
--- a/include/linux/freelist.h
+++ /dev/null
@@ -1,129 +0,0 @@
-/* SPDX-License-Identifier: GPL-2.0-only OR BSD-2-Clause */
-#ifndef FREELIST_H
-#define FREELIST_H
-
-#include <linux/atomic.h>
-
-/*
- * Copyright: [email protected]
- *
- * A simple CAS-based lock-free free list. Not the fastest thing in the world
- * under heavy contention, but simple and correct (assuming nodes are never
- * freed until after the free list is destroyed), and fairly speedy under low
- * contention.
- *
- * Adapted from: https://moodycamel.com/blog/2014/solving-the-aba-problem-for-lock-free-free-lists
- */
-
-struct freelist_node {
- atomic_t refs;
- struct freelist_node *next;
-};
-
-struct freelist_head {
- struct freelist_node *head;
-};
-
-#define REFS_ON_FREELIST 0x80000000
-#define REFS_MASK 0x7FFFFFFF
-
-static inline void __freelist_add(struct freelist_node *node, struct freelist_head *list)
-{
- /*
- * Since the refcount is zero, and nobody can increase it once it's
- * zero (except us, and we run only one copy of this method per node at
- * a time, i.e. the single thread case), then we know we can safely
- * change the next pointer of the node; however, once the refcount is
- * back above zero, then other threads could increase it (happens under
- * heavy contention, when the refcount goes to zero in between a load
- * and a refcount increment of a node in try_get, then back up to
- * something non-zero, then the refcount increment is done by the other
- * thread) -- so if the CAS to add the node to the actual list fails,
- * decrese the refcount and leave the add operation to the next thread
- * who puts the refcount back to zero (which could be us, hence the
- * loop).
- */
- struct freelist_node *head = READ_ONCE(list->head);
-
- for (;;) {
- WRITE_ONCE(node->next, head);
- atomic_set_release(&node->refs, 1);
-
- if (!try_cmpxchg_release(&list->head, &head, node)) {
- /*
- * Hmm, the add failed, but we can only try again when
- * the refcount goes back to zero.
- */
- if (atomic_fetch_add_release(REFS_ON_FREELIST - 1, &node->refs) == 1)
- continue;
- }
- return;
- }
-}
-
-static inline void freelist_add(struct freelist_node *node, struct freelist_head *list)
-{
- /*
- * We know that the should-be-on-freelist bit is 0 at this point, so
- * it's safe to set it using a fetch_add.
- */
- if (!atomic_fetch_add_release(REFS_ON_FREELIST, &node->refs)) {
- /*
- * Oh look! We were the last ones referencing this node, and we
- * know we want to add it to the free list, so let's do it!
- */
- __freelist_add(node, list);
- }
-}
-
-static inline struct freelist_node *freelist_try_get(struct freelist_head *list)
-{
- struct freelist_node *prev, *next, *head = smp_load_acquire(&list->head);
- unsigned int refs;
-
- while (head) {
- prev = head;
- refs = atomic_read(&head->refs);
- if ((refs & REFS_MASK) == 0 ||
- !atomic_try_cmpxchg_acquire(&head->refs, &refs, refs+1)) {
- head = smp_load_acquire(&list->head);
- continue;
- }
-
- /*
- * Good, reference count has been incremented (it wasn't at
- * zero), which means we can read the next and not worry about
- * it changing between now and the time we do the CAS.
- */
- next = READ_ONCE(head->next);
- if (try_cmpxchg_acquire(&list->head, &head, next)) {
- /*
- * Yay, got the node. This means it was on the list,
- * which means should-be-on-freelist must be false no
- * matter the refcount (because nobody else knows it's
- * been taken off yet, it can't have been put back on).
- */
- WARN_ON_ONCE(atomic_read(&head->refs) & REFS_ON_FREELIST);
-
- /*
- * Decrease refcount twice, once for our ref, and once
- * for the list's ref.
- */
- atomic_fetch_add(-2, &head->refs);
-
- return head;
- }
-
- /*
- * OK, the head must have changed on us, but we still need to decrement
- * the refcount we increased.
- */
- refs = atomic_fetch_add(-1, &prev->refs);
- if (refs == REFS_ON_FREELIST + 1)
- __freelist_add(prev, list);
- }
-
- return NULL;
-}
-
-#endif /* FREELIST_H */
--
2.34.1


2022-11-14 16:22:48

by Masami Hiramatsu

[permalink] [raw]
Subject: Re: [PATCH v6 1/4] lib: objpool added: ring-array based lockless MPMC queue

On Tue, 8 Nov 2022 15:14:40 +0800
wuqiang <[email protected]> wrote:

> The object pool is a scalable implementaion of high performance queue
> for objects allocation and reclamation, such as kretprobe instances.
>
> With leveraging per-cpu ring-array to mitigate the hot spots of memory
> contention, it could deliver near-linear scalability for high parallel
> scenarios. The ring-array is compactly managed in a single cache-line
> to benefit from warmed L1 cache for most cases (<= 4 objects per-core).
> The body of pre-allocated objects is stored in continuous cache-lines
> just after the ring-array.
>
> The object pool is interrupt safe. Both allocation and reclamation
> (object pop and push operations) can be preemptible or interruptable.
>
> It's best suited for following cases:
> 1) Memory allocation or reclamation are prohibited or too expensive
> 2) Consumers are of different priorities, such as irqs and threads
>
> Limitations:
> 1) Maximum objects (capacity) is determined during pool initializing
> 2) The memory of objects won't be freed until the poll is finalized
> 3) Object allocation (pop) may fail after trying all cpu slots
> 4) Object reclamation (push) won't fail but may take long time to
> finish for imbalanced scenarios. You can try larger max_entries
> to mitigate, or ( >= CPUS * nr_objs) to avoid
>
> Signed-off-by: wuqiang <[email protected]>
> ---
> include/linux/objpool.h | 153 +++++++++++++
> lib/Makefile | 2 +-
> lib/objpool.c | 487 ++++++++++++++++++++++++++++++++++++++++
> 3 files changed, 641 insertions(+), 1 deletion(-)
> create mode 100644 include/linux/objpool.h
> create mode 100644 lib/objpool.c
>
> diff --git a/include/linux/objpool.h b/include/linux/objpool.h
> new file mode 100644
> index 000000000000..7899b054b50c
> --- /dev/null
> +++ b/include/linux/objpool.h
> @@ -0,0 +1,153 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +
> +#ifndef _LINUX_OBJPOOL_H
> +#define _LINUX_OBJPOOL_H
> +
> +#include <linux/types.h>
> +
> +/*
> + * objpool: ring-array based lockless MPMC queue
> + *
> + * Copyright: [email protected]
> + *
> + * The object pool is a scalable implementaion of high performance queue
> + * for objects allocation and reclamation, such as kretprobe instances.
> + *
> + * With leveraging per-cpu ring-array to mitigate the hot spots of memory
> + * contention, it could deliver near-linear scalability for high parallel
> + * scenarios. The ring-array is compactly managed in a single cache-line
> + * to benefit from warmed L1 cache for most cases (<= 4 objects per-core).
> + * The body of pre-allocated objects is stored in continuous cache-lines
> + * just after the ring-array.
> + *
> + * The object pool is interrupt safe. Both allocation and reclamation
> + * (object pop and push operations) can be preemptible or interruptable.
> + *
> + * It's best suited for following cases:
> + * 1) Memory allocation or reclamation are prohibited or too expensive
> + * 2) Consumers are of different priorities, such as irqs and threads
> + *
> + * Limitations:
> + * 1) Maximum objects (capacity) is determined during pool initializing
> + * 2) The memory of objects won't be freed until the poll is finalized
> + * 3) Object allocation (pop) may fail after trying all cpu slots
> + * 4) Object reclamation (push) won't fail but may take long time to
> + * finish for imbalanced scenarios. You can try larger max_entries
> + * to mitigate, or ( >= CPUS * nr_objs) to avoid
> + */
> +
> +/*
> + * objpool_slot: per-cpu ring array
> + *
> + * Represents a cpu-local array-based ring buffer, its size is specialized
> + * during initialization of object pool.
> + *
> + * The objpool_slot is allocated from local memory for NUMA system, and to
> + * be kept compact in a single cacheline. ages[] is stored just after the
> + * body of objpool_slot, and then entries[]. The Array of ages[] describes
> + * revision of each item, solely used to avoid ABA. And array of entries[]
> + * contains the pointers of objects.
> + *
> + * The default size of objpool_slot is a single cache-line, aka. 64 bytes.
> + *
> + * 64bit:
> + * 4 8 12 16 32 64
> + * | head | tail | size | mask | ages[4] | ents[4]: (8 * 4) | objects
> + *
> + * 32bit:
> + * 4 8 12 16 32 48 64
> + * | head | tail | size | mask | ages[4] | ents[4] | unused | objects
> + *
> + */
> +
> +struct objpool_slot {
> + uint32_t head; /* head of ring array */
> + uint32_t tail; /* tail of ring array */
> + uint32_t size; /* array size, pow of 2 */
> + uint32_t mask; /* size - 1 */
> +} __attribute__((packed));
> +
> +/* caller-specified object initial callback to setup each object, only called once */
> +typedef int (*objpool_init_obj_cb)(void *context, void *obj);

It seems a bit confused that this "initialize object" callback
don't have the @obj as the first argument.

> +
> +/* caller-specified cleanup callback for private objects/pool/context */
> +typedef int (*objpool_release_cb)(void *context, void *ptr, uint32_t flags);

Do you have any use-case for this release callback?
If not, until actual use-case comes up, I recommend you to defer
implementing it.

> +
> +/* called for object releasing: ptr points to an object */
> +#define OBJPOOL_FLAG_NODE (0x00000001)
> +/* for user pool and context releasing, ptr could be NULL */
> +#define OBJPOOL_FLAG_POOL (0x00001000)
> +/* the object or pool to be released is user-managed */
> +#define OBJPOOL_FLAG_USER (0x00008000)

Ditto.

> +
> +/*
> + * objpool_head: object pooling metadata
> + */
> +
> +struct objpool_head {
> + unsigned int obj_size; /* object & element size */
> + unsigned int nr_objs; /* total objs (to be pre-allocated) */
> + unsigned int nr_cpus; /* num of possible cpus */
> + unsigned int capacity; /* max objects per cpuslot */
> + unsigned long flags; /* flags for objpool management */
> + gfp_t gfp; /* gfp flags for kmalloc & vmalloc */
> + unsigned int pool_size; /* user pool size in byes */
> + void *pool; /* user managed memory pool */
> + struct objpool_slot **cpu_slots; /* array of percpu slots */
> + unsigned int *slot_sizes; /* size in bytes of slots */
> + objpool_release_cb release; /* resource cleanup callback */
> + void *context; /* caller-provided context */
> +};
> +
> +#define OBJPOOL_FROM_VMALLOC (0x800000000) /* objpool allocated from vmalloc area */
> +#define OBJPOOL_HAVE_OBJECTS (0x400000000) /* objects allocated along with objpool */

This also doesn't need at this moment. Please start from simple
design for review.

> +
> +/* initialize object pool and pre-allocate objects */
> +int objpool_init(struct objpool_head *head, unsigned int nr_objs,
> + unsigned int max_objs, unsigned int object_size,
> + gfp_t gfp, void *context, objpool_init_obj_cb objinit,
> + objpool_release_cb release);
> +
> +/* add objects in batch from user provided pool */
> +int objpool_populate(struct objpool_head *head, void *pool,
> + unsigned int size, unsigned int object_size,
> + void *context, objpool_init_obj_cb objinit);
> +
> +/* add pre-allocated object (managed by user) to objpool */
> +int objpool_add(void *obj, struct objpool_head *head);
> +
> +/* allocate an object from objects pool */
> +void *objpool_pop(struct objpool_head *head);
> +
> +/* reclaim an object to objects pool */
> +int objpool_push(void *node, struct objpool_head *head);
> +
> +/* cleanup the whole object pool (objects including) */
> +void objpool_fini(struct objpool_head *head);
> +
> +/* whether the object is pre-allocated with percpu slots */
> +static inline int objpool_is_inslot(void *obj, struct objpool_head *head)
> +{
> + void *slot;
> + int i;
> +
> + if (!obj || !(head->flags & OBJPOOL_HAVE_OBJECTS))
> + return 0;
> +
> + for (i = 0; i < head->nr_cpus; i++) {
> + slot = head->cpu_slots[i];
> + if (obj >= slot && obj < slot + head->slot_sizes[i])
> + return 1;
> + }
> +
> + return 0;
> +}

Ditto.

It is too complicated to mix the internal allocated objects
and external ones. This will expose the implementation of the
objpool (users must understand they have to free the object
only outside of slot)

You can add it afterwards if it is really needed :)

> +
> +/* whether the object is from user pool (batched adding) */
> +static inline int objpool_is_inpool(void *obj, struct objpool_head *head)
> +{
> + return (obj && head->pool && obj >= head->pool &&
> + obj < head->pool + head->pool_size);
> +}
> +
> +#endif /* _LINUX_OBJPOOL_H */
> diff --git a/lib/Makefile b/lib/Makefile
> index 161d6a724ff7..e938703a321f 100644
> --- a/lib/Makefile
> +++ b/lib/Makefile
> @@ -34,7 +34,7 @@ lib-y := ctype.o string.o vsprintf.o cmdline.o \
> is_single_threaded.o plist.o decompress.o kobject_uevent.o \
> earlycpio.o seq_buf.o siphash.o dec_and_lock.o \
> nmi_backtrace.o win_minmax.o memcat_p.o \
> - buildid.o
> + buildid.o objpool.o
>
> lib-$(CONFIG_PRINTK) += dump_stack.o
> lib-$(CONFIG_SMP) += cpumask.o
> diff --git a/lib/objpool.c b/lib/objpool.c
> new file mode 100644
> index 000000000000..ecffa0795f3d
> --- /dev/null
> +++ b/lib/objpool.c
> @@ -0,0 +1,487 @@
> +// SPDX-License-Identifier: GPL-2.0
> +
> +#include <linux/objpool.h>
> +#include <linux/slab.h>
> +#include <linux/vmalloc.h>
> +#include <linux/atomic.h>
> +#include <linux/prefetch.h>
> +
> +/*
> + * objpool: ring-array based lockless MPMC/FIFO queues
> + *
> + * Copyright: [email protected]
> + */
> +
> +/* compute the suitable num of objects to be managed by slot */
> +static inline unsigned int __objpool_num_of_objs(unsigned int size)
> +{
> + return rounddown_pow_of_two((size - sizeof(struct objpool_slot)) /
> + (sizeof(uint32_t) + sizeof(void *)));
> +}
> +
> +#define SLOT_AGES(s) ((uint32_t *)((char *)(s) + sizeof(struct objpool_slot)))
> +#define SLOT_ENTS(s) ((void **)((char *)(s) + sizeof(struct objpool_slot) + \
> + sizeof(uint32_t) * (s)->size))
> +#define SLOT_OBJS(s) ((void *)((char *)(s) + sizeof(struct objpool_slot) + \
> + (sizeof(uint32_t) + sizeof(void *)) * (s)->size))
> +
> +/* allocate and initialize percpu slots */
> +static inline int
> +__objpool_init_percpu_slots(struct objpool_head *head, unsigned int nobjs,
> + void *context, objpool_init_obj_cb objinit)
> +{
> + unsigned int i, j, n, size, objsz, nents = head->capacity;
> +
> + /* aligned object size by sizeof(void *) */
> + objsz = ALIGN(head->obj_size, sizeof(void *));
> + /* shall we allocate objects along with objpool_slot */
> + if (objsz)
> + head->flags |= OBJPOOL_HAVE_OBJECTS;
> +
> + for (i = 0; i < head->nr_cpus; i++) {
> + struct objpool_slot *os;
> +
> + /* compute how many objects to be managed by this slot */
> + n = nobjs / head->nr_cpus;
> + if (i < (nobjs % head->nr_cpus))
> + n++;
> + size = sizeof(struct objpool_slot) + sizeof(void *) * nents +
> + sizeof(uint32_t) * nents + objsz * n;
> +
> + /* decide memory area for cpu-slot allocation */
> + if (!i && !(head->gfp & GFP_ATOMIC) && size > PAGE_SIZE / 2)
> + head->flags |= OBJPOOL_FROM_VMALLOC;
> +
> + /* allocate percpu slot & objects from local memory */
> + if (head->flags & OBJPOOL_FROM_VMALLOC)
> + os = __vmalloc_node(size, sizeof(void *), head->gfp,
> + cpu_to_node(i), __builtin_return_address(0));
> + else
> + os = kmalloc_node(size, head->gfp, cpu_to_node(i));
> + if (!os)
> + return -ENOMEM;
> +
> + /* initialize percpu slot for the i-th cpu */
> + memset(os, 0, size);
> + os->size = head->capacity;
> + os->mask = os->size - 1;
> + head->cpu_slots[i] = os;
> + head->slot_sizes[i] = size;
> +
> + /*
> + * start from 2nd round to avoid conflict of 1st item.
> + * we assume that the head item is ready for retrieval
> + * iff head is equal to ages[head & mask]. but ages is
> + * initialized as 0, so in view of the caller of pop(),
> + * the 1st item (0th) is always ready, but fact could
> + * be: push() is stalled before the final update, thus
> + * the item being inserted will be lost forever.
> + */
> + os->head = os->tail = head->capacity;
> +
> + if (!objsz)
> + continue;
> +
> + for (j = 0; j < n; j++) {
> + uint32_t *ages = SLOT_AGES(os);
> + void **ents = SLOT_ENTS(os);
> + void *obj = SLOT_OBJS(os) + j * objsz;
> + uint32_t ie = os->tail & os->mask;
> +
> + /* perform object initialization */
> + if (objinit) {
> + int rc = objinit(context, obj);
> + if (rc)
> + return rc;
> + }
> +
> + /* add obj into the ring array */
> + ents[ie] = obj;
> + ages[ie] = os->tail;
> + os->tail++;
> + head->nr_objs++;
> + }
> + }
> +
> + return 0;
> +}
> +
> +/* cleanup all percpu slots of the object pool */
> +static inline void __objpool_fini_percpu_slots(struct objpool_head *head)
> +{
> + unsigned int i;
> +
> + if (!head->cpu_slots)
> + return;
> +
> + for (i = 0; i < head->nr_cpus; i++) {
> + if (!head->cpu_slots[i])
> + continue;
> + if (head->flags & OBJPOOL_FROM_VMALLOC)
> + vfree(head->cpu_slots[i]);
> + else
> + kfree(head->cpu_slots[i]);
> + }
> + kfree(head->cpu_slots);
> + head->cpu_slots = NULL;
> + head->slot_sizes = NULL;
> +}
> +
> +/**
> + * objpool_init: initialize object pool and pre-allocate objects
> + *
> + * args:
> + * @head: the object pool to be initialized, declared by caller
> + * @nr_objs: total objects to be pre-allocated by this object pool
> + * @max_objs: max entries (object pool capacity), use nr_objs if 0
> + * @object_size: size of an object, no objects pre-allocated if 0
> + * @gfp: flags for memory allocation (via kmalloc or vmalloc)
> + * @context: user context for object initialization callback
> + * @objinit: object initialization callback for extra setting-up
> + * @release: cleanup callback for private objects/pool/context
> + *
> + * return:
> + * 0 for success, otherwise error code
> + *
> + * All pre-allocated objects are to be zeroed. Caller could do extra
> + * initialization in objinit callback. The objinit callback will be
> + * called once and only once after the slot allocation. Then objpool
> + * won't touch any content of the objects since then. It's caller's
> + * duty to perform reinitialization after object allocation (pop) or
> + * clearance before object reclamation (push) if required.
> + */
> +int objpool_init(struct objpool_head *head, unsigned int nr_objs,
> + unsigned int max_objs, unsigned int object_size,
> + gfp_t gfp, void *context, objpool_init_obj_cb objinit,
> + objpool_release_cb release)
> +{
> + unsigned int nents, ncpus = num_possible_cpus();
> + int rc;
> +
> + /* calculate percpu slot size (rounded to pow of 2) */
> + if (max_objs < nr_objs)

This should be an error case.

if (!max_objs)

> + max_objs = nr_objs;

else if (max_objs < nr_objs)
return -EINVAL;

But to simplify that, I think it should use only nr_objs.
I mean, if we can pass the @objinit, there seems no reason to
have both nr_objs and max_objs.

> + nents = max_objs / ncpus;
> + if (nents < __objpool_num_of_objs(L1_CACHE_BYTES))
> + nents = __objpool_num_of_objs(L1_CACHE_BYTES);
> + nents = roundup_pow_of_two(nents);
> + while (nents * ncpus < nr_objs)
> + nents = nents << 1;
> +
> + memset(head, 0, sizeof(struct objpool_head));
> + head->nr_cpus = ncpus;
> + head->obj_size = object_size;
> + head->capacity = nents;
> + head->gfp = gfp & ~__GFP_ZERO;
> + head->context = context;
> + head->release = release;
> +
> + /* allocate array for percpu slots */
> + head->cpu_slots = kzalloc(head->nr_cpus * sizeof(void *) +
> + head->nr_cpus * sizeof(uint32_t), head->gfp);
> + if (!head->cpu_slots)
> + return -ENOMEM;
> + head->slot_sizes = (uint32_t *)&head->cpu_slots[head->nr_cpus];
> +
> + /* initialize per-cpu slots */
> + rc = __objpool_init_percpu_slots(head, nr_objs, context, objinit);
> + if (rc)
> + __objpool_fini_percpu_slots(head);
> +
> + return rc;
> +}
> +EXPORT_SYMBOL_GPL(objpool_init);
> +
> +/* adding object to slot tail, the given slot must NOT be full */
> +static inline int __objpool_add_slot(void *obj, struct objpool_slot *os)
> +{
> + uint32_t *ages = SLOT_AGES(os);
> + void **ents = SLOT_ENTS(os);
> + uint32_t tail = atomic_inc_return((atomic_t *)&os->tail) - 1;
> +
> + WRITE_ONCE(ents[tail & os->mask], obj);
> +
> + /* order matters: obj must be updated before tail updating */
> + smp_store_release(&ages[tail & os->mask], tail);
> + return 0;
> +}
> +
> +/* adding object to slot, abort if the slot was already full */
> +static inline int __objpool_try_add_slot(void *obj, struct objpool_slot *os)
> +{
> + uint32_t *ages = SLOT_AGES(os);
> + void **ents = SLOT_ENTS(os);
> + uint32_t head, tail;
> +
> + do {
> + /* perform memory loading for both head and tail */
> + head = READ_ONCE(os->head);
> + tail = READ_ONCE(os->tail);
> + /* just abort if slot is full */
> + if (tail >= head + os->size)
> + return -ENOENT;
> + /* try to extend tail by 1 using CAS to avoid races */
> + if (try_cmpxchg_acquire(&os->tail, &tail, tail + 1))
> + break;
> + } while (1);
> +
> + /* the tail-th of slot is reserved for the given obj */
> + WRITE_ONCE(ents[tail & os->mask], obj);
> + /* update epoch id to make this object available for pop() */
> + smp_store_release(&ages[tail & os->mask], tail);
> + return 0;
> +}
> +
> +/**
> + * objpool_populate: add objects from user provided pool in batch
> + *
> + * args:
> + * @head: object pool
> + * @pool: user buffer for pre-allocated objects
> + * @size: size of user buffer
> + * @object_size: size of object & element
> + * @context: user context for objinit callback
> + * @objinit: object initialization callback
> + *
> + * return: 0 or error code
> + */
> +int objpool_populate(struct objpool_head *head, void *pool,
> + unsigned int size, unsigned int object_size,
> + void *context, objpool_init_obj_cb objinit)
> +{
> + unsigned int n = head->nr_objs, used = 0, i;
> +
> + if (head->pool || !pool || size < object_size)
> + return -EINVAL;
> + if (head->obj_size && head->obj_size != object_size)
> + return -EINVAL;
> + if (head->context && context && head->context != context)
> + return -EINVAL;
> + if (head->nr_objs >= head->nr_cpus * head->capacity)
> + return -ENOENT;
> +
> + WARN_ON_ONCE(((unsigned long)pool) & (sizeof(void *) - 1));
> + WARN_ON_ONCE(((uint32_t)object_size) & (sizeof(void *) - 1));
> +
> + /* align object size by sizeof(void *) */
> + head->obj_size = object_size;
> + object_size = ALIGN(object_size, sizeof(void *));
> + if (object_size == 0)
> + return -EINVAL;
> +
> + while (used + object_size <= size) {
> + void *obj = pool + used;
> +
> + /* perform object initialization */
> + if (objinit) {
> + int rc = objinit(context, obj);
> + if (rc)
> + return rc;
> + }
> +
> + /* insert obj to its corresponding objpool slot */
> + i = (n + used * head->nr_cpus/size) % head->nr_cpus;
> + if (!__objpool_try_add_slot(obj, head->cpu_slots[i]))
> + head->nr_objs++;
> +
> + used += object_size;
> + }
> +
> + if (!used)
> + return -ENOENT;
> +
> + head->context = context;
> + head->pool = pool;
> + head->pool_size = size;
> +
> + return 0;
> +}
> +EXPORT_SYMBOL_GPL(objpool_populate);
> +
> +/**
> + * objpool_add: add pre-allocated object to objpool during pool
> + * initialization
> + *
> + * args:
> + * @obj: object pointer to be added to objpool
> + * @head: object pool to be inserted into
> + *
> + * return:
> + * 0 or error code
> + *
> + * objpool_add_node doesn't handle race conditions, can only be
> + * called during objpool initialization
> + */
> +int objpool_add(void *obj, struct objpool_head *head)
> +{
> + unsigned int i, cpu;
> +
> + if (!obj)
> + return -EINVAL;
> + if (head->nr_objs >= head->nr_cpus * head->capacity)
> + return -ENOENT;
> +
> + cpu = head->nr_objs % head->nr_cpus;
> + for (i = 0; i < head->nr_cpus; i++) {
> + if (!__objpool_try_add_slot(obj, head->cpu_slots[cpu])) {
> + head->nr_objs++;
> + return 0;
> + }
> +
> + if (++cpu >= head->nr_cpus)
> + cpu = 0;
> + }
> +
> + return -ENOENT;
> +}
> +EXPORT_SYMBOL_GPL(objpool_add);
> +
> +/**
> + * objpool_push: reclaim the object and return back to objects pool
> + *
> + * args:
> + * @obj: object pointer to be pushed to object pool
> + * @head: object pool
> + *
> + * return:
> + * 0 or error code: it fails only when objects pool are full
> + *
> + * objpool_push is non-blockable, and can be nested
> + */
> +int objpool_push(void *obj, struct objpool_head *head)
> +{
> + unsigned int cpu = raw_smp_processor_id() % head->nr_cpus;
> +
> + do {
> + if (head->nr_objs > head->capacity) {
> + if (!__objpool_try_add_slot(obj, head->cpu_slots[cpu]))
> + return 0;
> + } else {
> + if (!__objpool_add_slot(obj, head->cpu_slots[cpu]))
> + return 0;
> + }
> + if (++cpu >= head->nr_cpus)
> + cpu = 0;
> + } while (1);
> +
> + return -ENOENT;
> +}
> +EXPORT_SYMBOL_GPL(objpool_push);
> +
> +/* try to retrieve object from slot */
> +static inline void *__objpool_try_get_slot(struct objpool_slot *os)
> +{
> + uint32_t *ages = SLOT_AGES(os);
> + void **ents = SLOT_ENTS(os);
> + /* do memory load of head to local head */
> + uint32_t head = smp_load_acquire(&os->head);
> +
> + /* loop if slot isn't empty */
> + while (head != READ_ONCE(os->tail)) {
> + uint32_t id = head & os->mask, prev = head;
> +
> + /* do prefetching of object ents */
> + prefetch(&ents[id]);
> +
> + /*
> + * check whether this item was ready for retrieval ? There's
> + * possibility * in theory * we might retrieve wrong object,
> + * in case ages[id] overflows when current task is sleeping,
> + * but it will take very very long to overflow an uint32_t
> + */
> + if (smp_load_acquire(&ages[id]) == head) {
> + /* node must have been udpated by push() */
> + void *node = READ_ONCE(ents[id]);
> + /* commit and move forward head of the slot */
> + if (try_cmpxchg_release(&os->head, &head, head + 1))
> + return node;
> + }
> +
> + /* re-load head from memory continue trying */
> + head = READ_ONCE(os->head);
> + /*
> + * head stays unchanged, so it's very likely current pop()
> + * just preempted/interrupted an ongoing push() operation
> + */
> + if (head == prev)
> + break;
> + }
> +
> + return NULL;
> +}
> +
> +/**
> + * objpool_pop: allocate an object from objects pool
> + *
> + * args:
> + * @oh: object pool
> + *
> + * return:
> + * object: NULL if failed (object pool is empty)
> + *
> + * objpool_pop can be nested, so can be used in any context.
> + */
> +void *objpool_pop(struct objpool_head *head)
> +{
> + unsigned int i, cpu;
> + void *obj = NULL;
> +
> + cpu = raw_smp_processor_id() % head->nr_cpus;

(Not sure, do we really need this?)

Thank you,

> + for (i = 0; i < head->nr_cpus; i++) {
> + struct objpool_slot *slot = head->cpu_slots[cpu];
> + obj = __objpool_try_get_slot(slot);
> + if (obj)
> + break;
> + if (++cpu >= head->nr_cpus)
> + cpu = 0;
> + }
> +
> + return obj;
> +}
> +EXPORT_SYMBOL_GPL(objpool_pop);
> +
> +/**
> + * objpool_fini: cleanup the whole object pool (releasing all objects)
> + *
> + * args:
> + * @head: object pool to be released
> + *
> + */
> +void objpool_fini(struct objpool_head *head)
> +{
> + uint32_t i, flags;
> +
> + if (!head->cpu_slots)
> + return;
> +
> + if (!head->release) {
> + __objpool_fini_percpu_slots(head);
> + return;
> + }
> +
> + /* cleanup all objects remained in objpool */
> + for (i = 0; i < head->nr_cpus; i++) {
> + void *obj;
> + do {
> + flags = OBJPOOL_FLAG_NODE;
> + obj = __objpool_try_get_slot(head->cpu_slots[i]);
> + if (!obj)
> + break;
> + if (!objpool_is_inpool(obj, head) &&
> + !objpool_is_inslot(obj, head)) {
> + flags |= OBJPOOL_FLAG_USER;
> + }
> + head->release(head->context, obj, flags);
> + } while (obj);
> + }
> +
> + /* release percpu slots */
> + __objpool_fini_percpu_slots(head);
> +
> + /* cleanup user private pool and related context */
> + flags = OBJPOOL_FLAG_POOL;
> + if (head->pool)
> + flags |= OBJPOOL_FLAG_USER;
> + head->release(head->context, head->pool, flags);
> +}
> +EXPORT_SYMBOL_GPL(objpool_fini);
> --
> 2.34.1
>


--
Masami Hiramatsu (Google) <[email protected]>

2022-11-16 11:52:04

by wuqiang.matt

[permalink] [raw]
Subject: Re: [PATCH v6 1/4] lib: objpool added: ring-array based lockless MPMC queue

On 2022/11/14 23:54, Masami Hiramatsu (Google) wrote:
> On Tue, 8 Nov 2022 15:14:40 +0800
> wuqiang <[email protected]> wrote:
>
>> The object pool is a scalable implementaion of high performance queue
>> for objects allocation and reclamation, such as kretprobe instances.
>>
>> With leveraging per-cpu ring-array to mitigate the hot spots of memory
>> contention, it could deliver near-linear scalability for high parallel
>> scenarios. The ring-array is compactly managed in a single cache-line
>> to benefit from warmed L1 cache for most cases (<= 4 objects per-core).
>> The body of pre-allocated objects is stored in continuous cache-lines
>> just after the ring-array.
>>
>> The object pool is interrupt safe. Both allocation and reclamation
>> (object pop and push operations) can be preemptible or interruptable.
>>
>> It's best suited for following cases:
>> 1) Memory allocation or reclamation are prohibited or too expensive
>> 2) Consumers are of different priorities, such as irqs and threads
>>
>> Limitations:
>> 1) Maximum objects (capacity) is determined during pool initializing
>> 2) The memory of objects won't be freed until the poll is finalized
>> 3) Object allocation (pop) may fail after trying all cpu slots
>> 4) Object reclamation (push) won't fail but may take long time to
>> finish for imbalanced scenarios. You can try larger max_entries
>> to mitigate, or ( >= CPUS * nr_objs) to avoid
>>
>> Signed-off-by: wuqiang <[email protected]>
>> ---
>> include/linux/objpool.h | 153 +++++++++++++
>> lib/Makefile | 2 +-
>> lib/objpool.c | 487 ++++++++++++++++++++++++++++++++++++++++
>> 3 files changed, 641 insertions(+), 1 deletion(-)
>> create mode 100644 include/linux/objpool.h
>> create mode 100644 lib/objpool.c
>>
>> diff --git a/include/linux/objpool.h b/include/linux/objpool.h
>> new file mode 100644
>> index 000000000000..7899b054b50c
>> --- /dev/null
>> +++ b/include/linux/objpool.h
>> @@ -0,0 +1,153 @@
>> +/* SPDX-License-Identifier: GPL-2.0 */
>> +
>> +#ifndef _LINUX_OBJPOOL_H
>> +#define _LINUX_OBJPOOL_H
>> +
>> +#include <linux/types.h>
>> +
>> +/*
>> + * objpool: ring-array based lockless MPMC queue
>> + *
>> + * Copyright: [email protected]
>> + *
>> + * The object pool is a scalable implementaion of high performance queue
>> + * for objects allocation and reclamation, such as kretprobe instances.
>> + *
>> + * With leveraging per-cpu ring-array to mitigate the hot spots of memory
>> + * contention, it could deliver near-linear scalability for high parallel
>> + * scenarios. The ring-array is compactly managed in a single cache-line
>> + * to benefit from warmed L1 cache for most cases (<= 4 objects per-core).
>> + * The body of pre-allocated objects is stored in continuous cache-lines
>> + * just after the ring-array.
>> + *
>> + * The object pool is interrupt safe. Both allocation and reclamation
>> + * (object pop and push operations) can be preemptible or interruptable.
>> + *
>> + * It's best suited for following cases:
>> + * 1) Memory allocation or reclamation are prohibited or too expensive
>> + * 2) Consumers are of different priorities, such as irqs and threads
>> + *
>> + * Limitations:
>> + * 1) Maximum objects (capacity) is determined during pool initializing
>> + * 2) The memory of objects won't be freed until the poll is finalized
>> + * 3) Object allocation (pop) may fail after trying all cpu slots
>> + * 4) Object reclamation (push) won't fail but may take long time to
>> + * finish for imbalanced scenarios. You can try larger max_entries
>> + * to mitigate, or ( >= CPUS * nr_objs) to avoid
>> + */
>> +
>> +/*
>> + * objpool_slot: per-cpu ring array
>> + *
>> + * Represents a cpu-local array-based ring buffer, its size is specialized
>> + * during initialization of object pool.
>> + *
>> + * The objpool_slot is allocated from local memory for NUMA system, and to
>> + * be kept compact in a single cacheline. ages[] is stored just after the
>> + * body of objpool_slot, and then entries[]. The Array of ages[] describes
>> + * revision of each item, solely used to avoid ABA. And array of entries[]
>> + * contains the pointers of objects.
>> + *
>> + * The default size of objpool_slot is a single cache-line, aka. 64 bytes.
>> + *
>> + * 64bit:
>> + * 4 8 12 16 32 64
>> + * | head | tail | size | mask | ages[4] | ents[4]: (8 * 4) | objects
>> + *
>> + * 32bit:
>> + * 4 8 12 16 32 48 64
>> + * | head | tail | size | mask | ages[4] | ents[4] | unused | objects
>> + *
>> + */
>> +
>> +struct objpool_slot {
>> + uint32_t head; /* head of ring array */
>> + uint32_t tail; /* tail of ring array */
>> + uint32_t size; /* array size, pow of 2 */
>> + uint32_t mask; /* size - 1 */
>> +} __attribute__((packed));
>> +
>> +/* caller-specified object initial callback to setup each object, only called once */
>> +typedef int (*objpool_init_obj_cb)(void *context, void *obj);
>
> It seems a bit confused that this "initialize object" callback
> don't have the @obj as the first argument.

Sure, will update in next version.

>> +
>> +/* caller-specified cleanup callback for private objects/pool/context */
>> +typedef int (*objpool_release_cb)(void *context, void *ptr, uint32_t flags);
>
> Do you have any use-case for this release callback?
> If not, until actual use-case comes up, I recommend you to defer
> implementing it.

No actual use-case for now, since both kretprobe and rethook use internal
objects. It's mainly for user-managed objects and asynchronous finilization.

I'll reconsider your advice. Thanks.

>> +
>> +/* called for object releasing: ptr points to an object */
>> +#define OBJPOOL_FLAG_NODE (0x00000001)
>> +/* for user pool and context releasing, ptr could be NULL */
>> +#define OBJPOOL_FLAG_POOL (0x00001000)
>> +/* the object or pool to be released is user-managed */
>> +#define OBJPOOL_FLAG_USER (0x00008000)
>
> Ditto.
>
>> +
>> +/*
>> + * objpool_head: object pooling metadata
>> + */
>> +
>> +struct objpool_head {
>> + unsigned int obj_size; /* object & element size */
>> + unsigned int nr_objs; /* total objs (to be pre-allocated) */
>> + unsigned int nr_cpus; /* num of possible cpus */
>> + unsigned int capacity; /* max objects per cpuslot */
>> + unsigned long flags; /* flags for objpool management */
>> + gfp_t gfp; /* gfp flags for kmalloc & vmalloc */
>> + unsigned int pool_size; /* user pool size in byes */
>> + void *pool; /* user managed memory pool */
>> + struct objpool_slot **cpu_slots; /* array of percpu slots */
>> + unsigned int *slot_sizes; /* size in bytes of slots */
>> + objpool_release_cb release; /* resource cleanup callback */
>> + void *context; /* caller-provided context */
>> +};
>> +
>> +#define OBJPOOL_FROM_VMALLOC (0x800000000) /* objpool allocated from vmalloc area */
>> +#define OBJPOOL_HAVE_OBJECTS (0x400000000) /* objects allocated along with objpool */
>
> This also doesn't need at this moment. Please start from simple
> design for review.
>
>> +
>> +/* initialize object pool and pre-allocate objects */
>> +int objpool_init(struct objpool_head *head, unsigned int nr_objs,
>> + unsigned int max_objs, unsigned int object_size,
>> + gfp_t gfp, void *context, objpool_init_obj_cb objinit,
>> + objpool_release_cb release);
>> +
>> +/* add objects in batch from user provided pool */
>> +int objpool_populate(struct objpool_head *head, void *pool,
>> + unsigned int size, unsigned int object_size,
>> + void *context, objpool_init_obj_cb objinit);
>> +
>> +/* add pre-allocated object (managed by user) to objpool */
>> +int objpool_add(void *obj, struct objpool_head *head);
>> +
>> +/* allocate an object from objects pool */
>> +void *objpool_pop(struct objpool_head *head);
>> +
>> +/* reclaim an object to objects pool */
>> +int objpool_push(void *node, struct objpool_head *head);
>> +
>> +/* cleanup the whole object pool (objects including) */
>> +void objpool_fini(struct objpool_head *head);
>> +
>> +/* whether the object is pre-allocated with percpu slots */
>> +static inline int objpool_is_inslot(void *obj, struct objpool_head *head)
>> +{
>> + void *slot;
>> + int i;
>> +
>> + if (!obj || !(head->flags & OBJPOOL_HAVE_OBJECTS))
>> + return 0;
>> +
>> + for (i = 0; i < head->nr_cpus; i++) {
>> + slot = head->cpu_slots[i];
>> + if (obj >= slot && obj < slot + head->slot_sizes[i])
>> + return 1;
>> + }
>> +
>> + return 0;
>> +}
>
> Ditto.
>
> It is too complicated to mix the internal allocated objects
> and external ones. This will expose the implementation of the
> objpool (users must understand they have to free the object
> only outside of slot)

Mixing is NOT recommended and normally it should be one of internal / user
oool / external. But as a general lib of objpool, mixing is supported just
because we don't know what use-cases real users would face.

> You can add it afterwards if it is really needed :)

Sure, I'll rethink of it. For kretprobe and rethook, a performance-oriented
version (with internal objects) should be simplely enough and pretty compact.

>> +
>> +/* whether the object is from user pool (batched adding) */
>> +static inline int objpool_is_inpool(void *obj, struct objpool_head *head)
>> +{
>> + return (obj && head->pool && obj >= head->pool &&
>> + obj < head->pool + head->pool_size);
>> +}
>> +
>> +#endif /* _LINUX_OBJPOOL_H */
>> diff --git a/lib/Makefile b/lib/Makefile
>> index 161d6a724ff7..e938703a321f 100644
>> --- a/lib/Makefile
>> +++ b/lib/Makefile
>> @@ -34,7 +34,7 @@ lib-y := ctype.o string.o vsprintf.o cmdline.o \
>> is_single_threaded.o plist.o decompress.o kobject_uevent.o \
>> earlycpio.o seq_buf.o siphash.o dec_and_lock.o \
>> nmi_backtrace.o win_minmax.o memcat_p.o \
>> - buildid.o
>> + buildid.o objpool.o
>>
>> lib-$(CONFIG_PRINTK) += dump_stack.o
>> lib-$(CONFIG_SMP) += cpumask.o
>> diff --git a/lib/objpool.c b/lib/objpool.c
>> new file mode 100644
>> index 000000000000..ecffa0795f3d
>> --- /dev/null
>> +++ b/lib/objpool.c
>> @@ -0,0 +1,487 @@
>> +// SPDX-License-Identifier: GPL-2.0
>> +
>> +#include <linux/objpool.h>
>> +#include <linux/slab.h>
>> +#include <linux/vmalloc.h>
>> +#include <linux/atomic.h>
>> +#include <linux/prefetch.h>
>> +
>> +/*
>> + * objpool: ring-array based lockless MPMC/FIFO queues
>> + *
>> + * Copyright: [email protected]
>> + */
>> +
>> +/* compute the suitable num of objects to be managed by slot */
>> +static inline unsigned int __objpool_num_of_objs(unsigned int size)
>> +{
>> + return rounddown_pow_of_two((size - sizeof(struct objpool_slot)) /
>> + (sizeof(uint32_t) + sizeof(void *)));
>> +}
>> +
>> +#define SLOT_AGES(s) ((uint32_t *)((char *)(s) + sizeof(struct objpool_slot)))
>> +#define SLOT_ENTS(s) ((void **)((char *)(s) + sizeof(struct objpool_slot) + \
>> + sizeof(uint32_t) * (s)->size))
>> +#define SLOT_OBJS(s) ((void *)((char *)(s) + sizeof(struct objpool_slot) + \
>> + (sizeof(uint32_t) + sizeof(void *)) * (s)->size))
>> +
>> +/* allocate and initialize percpu slots */
>> +static inline int
>> +__objpool_init_percpu_slots(struct objpool_head *head, unsigned int nobjs,
>> + void *context, objpool_init_obj_cb objinit)
>> +{
>> + unsigned int i, j, n, size, objsz, nents = head->capacity;
>> +
>> + /* aligned object size by sizeof(void *) */
>> + objsz = ALIGN(head->obj_size, sizeof(void *));
>> + /* shall we allocate objects along with objpool_slot */
>> + if (objsz)
>> + head->flags |= OBJPOOL_HAVE_OBJECTS;
>> +
>> + for (i = 0; i < head->nr_cpus; i++) {
>> + struct objpool_slot *os;
>> +
>> + /* compute how many objects to be managed by this slot */
>> + n = nobjs / head->nr_cpus;
>> + if (i < (nobjs % head->nr_cpus))
>> + n++;
>> + size = sizeof(struct objpool_slot) + sizeof(void *) * nents +
>> + sizeof(uint32_t) * nents + objsz * n;
>> +
>> + /* decide memory area for cpu-slot allocation */
>> + if (!i && !(head->gfp & GFP_ATOMIC) && size > PAGE_SIZE / 2)
>> + head->flags |= OBJPOOL_FROM_VMALLOC;
>> +
>> + /* allocate percpu slot & objects from local memory */
>> + if (head->flags & OBJPOOL_FROM_VMALLOC)
>> + os = __vmalloc_node(size, sizeof(void *), head->gfp,
>> + cpu_to_node(i), __builtin_return_address(0));
>> + else
>> + os = kmalloc_node(size, head->gfp, cpu_to_node(i));
>> + if (!os)
>> + return -ENOMEM;
>> +
>> + /* initialize percpu slot for the i-th cpu */
>> + memset(os, 0, size);
>> + os->size = head->capacity;
>> + os->mask = os->size - 1;
>> + head->cpu_slots[i] = os;
>> + head->slot_sizes[i] = size;
>> +
>> + /*
>> + * start from 2nd round to avoid conflict of 1st item.
>> + * we assume that the head item is ready for retrieval
>> + * iff head is equal to ages[head & mask]. but ages is
>> + * initialized as 0, so in view of the caller of pop(),
>> + * the 1st item (0th) is always ready, but fact could
>> + * be: push() is stalled before the final update, thus
>> + * the item being inserted will be lost forever.
>> + */
>> + os->head = os->tail = head->capacity;
>> +
>> + if (!objsz)
>> + continue;
>> +
>> + for (j = 0; j < n; j++) {
>> + uint32_t *ages = SLOT_AGES(os);
>> + void **ents = SLOT_ENTS(os);
>> + void *obj = SLOT_OBJS(os) + j * objsz;
>> + uint32_t ie = os->tail & os->mask;
>> +
>> + /* perform object initialization */
>> + if (objinit) {
>> + int rc = objinit(context, obj);
>> + if (rc)
>> + return rc;
>> + }
>> +
>> + /* add obj into the ring array */
>> + ents[ie] = obj;
>> + ages[ie] = os->tail;
>> + os->tail++;
>> + head->nr_objs++;
>> + }
>> + }
>> +
>> + return 0;
>> +}
>> +
>> +/* cleanup all percpu slots of the object pool */
>> +static inline void __objpool_fini_percpu_slots(struct objpool_head *head)
>> +{
>> + unsigned int i;
>> +
>> + if (!head->cpu_slots)
>> + return;
>> +
>> + for (i = 0; i < head->nr_cpus; i++) {
>> + if (!head->cpu_slots[i])
>> + continue;
>> + if (head->flags & OBJPOOL_FROM_VMALLOC)
>> + vfree(head->cpu_slots[i]);
>> + else
>> + kfree(head->cpu_slots[i]);
>> + }
>> + kfree(head->cpu_slots);
>> + head->cpu_slots = NULL;
>> + head->slot_sizes = NULL;
>> +}
>> +
>> +/**
>> + * objpool_init: initialize object pool and pre-allocate objects
>> + *
>> + * args:
>> + * @head: the object pool to be initialized, declared by caller
>> + * @nr_objs: total objects to be pre-allocated by this object pool
>> + * @max_objs: max entries (object pool capacity), use nr_objs if 0
>> + * @object_size: size of an object, no objects pre-allocated if 0
>> + * @gfp: flags for memory allocation (via kmalloc or vmalloc)
>> + * @context: user context for object initialization callback
>> + * @objinit: object initialization callback for extra setting-up
>> + * @release: cleanup callback for private objects/pool/context
>> + *
>> + * return:
>> + * 0 for success, otherwise error code
>> + *
>> + * All pre-allocated objects are to be zeroed. Caller could do extra
>> + * initialization in objinit callback. The objinit callback will be
>> + * called once and only once after the slot allocation. Then objpool
>> + * won't touch any content of the objects since then. It's caller's
>> + * duty to perform reinitialization after object allocation (pop) or
>> + * clearance before object reclamation (push) if required.
>> + */
>> +int objpool_init(struct objpool_head *head, unsigned int nr_objs,
>> + unsigned int max_objs, unsigned int object_size,
>> + gfp_t gfp, void *context, objpool_init_obj_cb objinit,
>> + objpool_release_cb release)
>> +{
>> + unsigned int nents, ncpus = num_possible_cpus();
>> + int rc;
>> +
>> + /* calculate percpu slot size (rounded to pow of 2) */
>> + if (max_objs < nr_objs)
>
> This should be an error case.
>
> if (!max_objs)
>
>> + max_objs = nr_objs;
>
> else if (max_objs < nr_objs)
> return -EINVAL;

Got it.

> But to simplify that, I think it should use only nr_objs.
> I mean, if we can pass the @objinit, there seems no reason to
> have both nr_objs and max_objs.

I kept both them just to give user the flexibility to best meet the
requirements. For cases that objects are in imbalanced distribution
among CPUs, max_objs should be bigger enough (nr_cpus * nr_objs) to
deliver good performance, with a cost of a little more memory.

>> + nents = max_objs / ncpus;
>> + if (nents < __objpool_num_of_objs(L1_CACHE_BYTES))
>> + nents = __objpool_num_of_objs(L1_CACHE_BYTES);
>> + nents = roundup_pow_of_two(nents);
>> + while (nents * ncpus < nr_objs)
>> + nents = nents << 1;
>> +
>> + memset(head, 0, sizeof(struct objpool_head));
>> + head->nr_cpus = ncpus;
>> + head->obj_size = object_size;
>> + head->capacity = nents;
>> + head->gfp = gfp & ~__GFP_ZERO;
>> + head->context = context;
>> + head->release = release;
>> +
>> + /* allocate array for percpu slots */
>> + head->cpu_slots = kzalloc(head->nr_cpus * sizeof(void *) +
>> + head->nr_cpus * sizeof(uint32_t), head->gfp);
>> + if (!head->cpu_slots)
>> + return -ENOMEM;
>> + head->slot_sizes = (uint32_t *)&head->cpu_slots[head->nr_cpus];
>> +
>> + /* initialize per-cpu slots */
>> + rc = __objpool_init_percpu_slots(head, nr_objs, context, objinit);
>> + if (rc)
>> + __objpool_fini_percpu_slots(head);
>> +
>> + return rc;
>> +}
>> +EXPORT_SYMBOL_GPL(objpool_init);
>> +
>> +/* adding object to slot tail, the given slot must NOT be full */
>> +static inline int __objpool_add_slot(void *obj, struct objpool_slot *os)
>> +{
>> + uint32_t *ages = SLOT_AGES(os);
>> + void **ents = SLOT_ENTS(os);
>> + uint32_t tail = atomic_inc_return((atomic_t *)&os->tail) - 1;
>> +
>> + WRITE_ONCE(ents[tail & os->mask], obj);
>> +
>> + /* order matters: obj must be updated before tail updating */
>> + smp_store_release(&ages[tail & os->mask], tail);
>> + return 0;
>> +}
>> +
>> +/* adding object to slot, abort if the slot was already full */
>> +static inline int __objpool_try_add_slot(void *obj, struct objpool_slot *os)
>> +{
>> + uint32_t *ages = SLOT_AGES(os);
>> + void **ents = SLOT_ENTS(os);
>> + uint32_t head, tail;
>> +
>> + do {
>> + /* perform memory loading for both head and tail */
>> + head = READ_ONCE(os->head);
>> + tail = READ_ONCE(os->tail);
>> + /* just abort if slot is full */
>> + if (tail >= head + os->size)
>> + return -ENOENT;
>> + /* try to extend tail by 1 using CAS to avoid races */
>> + if (try_cmpxchg_acquire(&os->tail, &tail, tail + 1))
>> + break;
>> + } while (1);
>> +
>> + /* the tail-th of slot is reserved for the given obj */
>> + WRITE_ONCE(ents[tail & os->mask], obj);
>> + /* update epoch id to make this object available for pop() */
>> + smp_store_release(&ages[tail & os->mask], tail);
>> + return 0;
>> +}
>> +
>> +/**
>> + * objpool_populate: add objects from user provided pool in batch
>> + *
>> + * args:
>> + * @head: object pool
>> + * @pool: user buffer for pre-allocated objects
>> + * @size: size of user buffer
>> + * @object_size: size of object & element
>> + * @context: user context for objinit callback
>> + * @objinit: object initialization callback
>> + *
>> + * return: 0 or error code
>> + */
>> +int objpool_populate(struct objpool_head *head, void *pool,
>> + unsigned int size, unsigned int object_size,
>> + void *context, objpool_init_obj_cb objinit)
>> +{
>> + unsigned int n = head->nr_objs, used = 0, i;
>> +
>> + if (head->pool || !pool || size < object_size)
>> + return -EINVAL;
>> + if (head->obj_size && head->obj_size != object_size)
>> + return -EINVAL;
>> + if (head->context && context && head->context != context)
>> + return -EINVAL;
>> + if (head->nr_objs >= head->nr_cpus * head->capacity)
>> + return -ENOENT;
>> +
>> + WARN_ON_ONCE(((unsigned long)pool) & (sizeof(void *) - 1));
>> + WARN_ON_ONCE(((uint32_t)object_size) & (sizeof(void *) - 1));
>> +
>> + /* align object size by sizeof(void *) */
>> + head->obj_size = object_size;
>> + object_size = ALIGN(object_size, sizeof(void *));
>> + if (object_size == 0)
>> + return -EINVAL;
>> +
>> + while (used + object_size <= size) {
>> + void *obj = pool + used;
>> +
>> + /* perform object initialization */
>> + if (objinit) {
>> + int rc = objinit(context, obj);
>> + if (rc)
>> + return rc;
>> + }
>> +
>> + /* insert obj to its corresponding objpool slot */
>> + i = (n + used * head->nr_cpus/size) % head->nr_cpus;
>> + if (!__objpool_try_add_slot(obj, head->cpu_slots[i]))
>> + head->nr_objs++;
>> +
>> + used += object_size;
>> + }
>> +
>> + if (!used)
>> + return -ENOENT;
>> +
>> + head->context = context;
>> + head->pool = pool;
>> + head->pool_size = size;
>> +
>> + return 0;
>> +}
>> +EXPORT_SYMBOL_GPL(objpool_populate);
>> +
>> +/**
>> + * objpool_add: add pre-allocated object to objpool during pool
>> + * initialization
>> + *
>> + * args:
>> + * @obj: object pointer to be added to objpool
>> + * @head: object pool to be inserted into
>> + *
>> + * return:
>> + * 0 or error code
>> + *
>> + * objpool_add_node doesn't handle race conditions, can only be
>> + * called during objpool initialization
>> + */
>> +int objpool_add(void *obj, struct objpool_head *head)
>> +{
>> + unsigned int i, cpu;
>> +
>> + if (!obj)
>> + return -EINVAL;
>> + if (head->nr_objs >= head->nr_cpus * head->capacity)
>> + return -ENOENT;
>> +
>> + cpu = head->nr_objs % head->nr_cpus;
>> + for (i = 0; i < head->nr_cpus; i++) {
>> + if (!__objpool_try_add_slot(obj, head->cpu_slots[cpu])) {
>> + head->nr_objs++;
>> + return 0;
>> + }
>> +
>> + if (++cpu >= head->nr_cpus)
>> + cpu = 0;
>> + }
>> +
>> + return -ENOENT;
>> +}
>> +EXPORT_SYMBOL_GPL(objpool_add);
>> +
>> +/**
>> + * objpool_push: reclaim the object and return back to objects pool
>> + *
>> + * args:
>> + * @obj: object pointer to be pushed to object pool
>> + * @head: object pool
>> + *
>> + * return:
>> + * 0 or error code: it fails only when objects pool are full
>> + *
>> + * objpool_push is non-blockable, and can be nested
>> + */
>> +int objpool_push(void *obj, struct objpool_head *head)
>> +{
>> + unsigned int cpu = raw_smp_processor_id() % head->nr_cpus;
>> +
>> + do {
>> + if (head->nr_objs > head->capacity) {
>> + if (!__objpool_try_add_slot(obj, head->cpu_slots[cpu]))
>> + return 0;
>> + } else {
>> + if (!__objpool_add_slot(obj, head->cpu_slots[cpu]))
>> + return 0;
>> + }
>> + if (++cpu >= head->nr_cpus)
>> + cpu = 0;
>> + } while (1);
>> +
>> + return -ENOENT;
>> +}
>> +EXPORT_SYMBOL_GPL(objpool_push);
>> +
>> +/* try to retrieve object from slot */
>> +static inline void *__objpool_try_get_slot(struct objpool_slot *os)
>> +{
>> + uint32_t *ages = SLOT_AGES(os);
>> + void **ents = SLOT_ENTS(os);
>> + /* do memory load of head to local head */
>> + uint32_t head = smp_load_acquire(&os->head);
>> +
>> + /* loop if slot isn't empty */
>> + while (head != READ_ONCE(os->tail)) {
>> + uint32_t id = head & os->mask, prev = head;
>> +
>> + /* do prefetching of object ents */
>> + prefetch(&ents[id]);
>> +
>> + /*
>> + * check whether this item was ready for retrieval ? There's
>> + * possibility * in theory * we might retrieve wrong object,
>> + * in case ages[id] overflows when current task is sleeping,
>> + * but it will take very very long to overflow an uint32_t
>> + */
>> + if (smp_load_acquire(&ages[id]) == head) {
>> + /* node must have been udpated by push() */
>> + void *node = READ_ONCE(ents[id]);
>> + /* commit and move forward head of the slot */
>> + if (try_cmpxchg_release(&os->head, &head, head + 1))
>> + return node;
>> + }
>> +
>> + /* re-load head from memory continue trying */
>> + head = READ_ONCE(os->head);
>> + /*
>> + * head stays unchanged, so it's very likely current pop()
>> + * just preempted/interrupted an ongoing push() operation
>> + */
>> + if (head == prev)
>> + break;
>> + }
>> +
>> + return NULL;
>> +}
>> +
>> +/**
>> + * objpool_pop: allocate an object from objects pool
>> + *
>> + * args:
>> + * @oh: object pool
>> + *
>> + * return:
>> + * object: NULL if failed (object pool is empty)
>> + *
>> + * objpool_pop can be nested, so can be used in any context.
>> + */
>> +void *objpool_pop(struct objpool_head *head)
>> +{
>> + unsigned int i, cpu;
>> + void *obj = NULL;
>> +
>> + cpu = raw_smp_processor_id() % head->nr_cpus;
>
> (Not sure, do we really need this?)

V6 (this version) needs it, since it's using num_possible_cpus() to manage
the array of cpu_slots.

Last week I finished an improved version, which takes account of holes in
cpu_possible_mask. The testings are going well so far.

> Thank you,

Your advices would be greatly appreciated. Thank you for your time.

>
>> + for (i = 0; i < head->nr_cpus; i++) {
>> + struct objpool_slot *slot = head->cpu_slots[cpu];
>> + obj = __objpool_try_get_slot(slot);
>> + if (obj)
>> + break;
>> + if (++cpu >= head->nr_cpus)
>> + cpu = 0;
>> + }
>> +
>> + return obj;
>> +}
>> +EXPORT_SYMBOL_GPL(objpool_pop);
>> +
>> +/**
>> + * objpool_fini: cleanup the whole object pool (releasing all objects)
>> + *
>> + * args:
>> + * @head: object pool to be released
>> + *
>> + */
>> +void objpool_fini(struct objpool_head *head)
>> +{
>> + uint32_t i, flags;
>> +
>> + if (!head->cpu_slots)
>> + return;
>> +
>> + if (!head->release) {
>> + __objpool_fini_percpu_slots(head);
>> + return;
>> + }
>> +
>> + /* cleanup all objects remained in objpool */
>> + for (i = 0; i < head->nr_cpus; i++) {
>> + void *obj;
>> + do {
>> + flags = OBJPOOL_FLAG_NODE;
>> + obj = __objpool_try_get_slot(head->cpu_slots[i]);
>> + if (!obj)
>> + break;
>> + if (!objpool_is_inpool(obj, head) &&
>> + !objpool_is_inslot(obj, head)) {
>> + flags |= OBJPOOL_FLAG_USER;
>> + }
>> + head->release(head->context, obj, flags);
>> + } while (obj);
>> + }
>> +
>> + /* release percpu slots */
>> + __objpool_fini_percpu_slots(head);
>> +
>> + /* cleanup user private pool and related context */
>> + flags = OBJPOOL_FLAG_POOL;
>> + if (head->pool)
>> + flags |= OBJPOOL_FLAG_USER;
>> + head->release(head->context, head->pool, flags);
>> +}
>> +EXPORT_SYMBOL_GPL(objpool_fini);
>> --
>> 2.34.1
>>