LinuxLists.cc - [PATCH v7 0/5] lib,kprobes: kretprobe scalability improvement

2022-12-12 13:00:33

Subject: [PATCH v7 0/5] lib,kprobes: kretprobe scalability improvement

This patch series introduces a scalable and lockless ring-array based
object pool and replaces the original freelist (a LIFO queue based on
singly linked list) to improve scalability of kretprobed routines.

Changes from v6 (https://lore.kernel.org/lkml/[email protected]/):
1) objpool: implementation simplified as Masami advised
2) rethook_alloc: error codes returning supported (ERR_PTR)
3) MAINTAINERS: support added for objpool files
4) synced to latest 6.1 with x86_64/x86/aarch64 verified

wuqiang (5):
lib: objpool added: ring-array based lockless MPMC queue
lib: objpool test module added
kprobes: kretprobe scalability improvement with objpool
kprobes: freelist.h removed
MAINTAINERS: objpool added

MAINTAINERS | 7 +
include/linux/freelist.h | 129 --------
include/linux/kprobes.h | 9 +-
include/linux/objpool.h | 109 ++++++
include/linux/rethook.h | 14 +-
kernel/kprobes.c | 101 +++---
kernel/trace/fprobe.c | 37 +--
kernel/trace/rethook.c | 99 +++---
lib/Kconfig.debug | 11 +
lib/Makefile | 4 +-
lib/objpool.c | 320 ++++++++++++++++++
lib/test_objpool.c | 696 +++++++++++++++++++++++++++++++++++++++
12 files changed, 1264 insertions(+), 272 deletions(-)
delete mode 100644 include/linux/freelist.h
create mode 100644 include/linux/objpool.h
create mode 100644 lib/objpool.c
create mode 100644 lib/test_objpool.c

--
2.34.1

2022-12-12 13:00:45

by wuqiang.matt

[permalink] [raw]

Subject: [PATCH v7 1/5] lib: objpool added: ring-array based lockless MPMC queue

The object pool is a scalable implementaion of high performance queue
for objects allocation and reclamation, such as kretprobe instances.

With leveraging per-cpu ring-array to mitigate the hot spots of memory
contention, it could deliver near-linear scalability for high parallel
scenarios. The ring-array is compactly managed in a single cache-line
to benefit from warmed L1 cache for most cases (<= 4 objects per-core).
The body of pre-allocated objects is stored in continuous cache-lines
just after the ring-array.

The object pool is interrupt safe. Both allocation and reclamation
(object pop and push operations) can be preemptible or interruptable.

It's best suited for following cases:
1) Memory allocation or reclamation are prohibited or too expensive
2) Consumers are of different priorities, such as irqs and threads

Limitations:
1) Maximum objects (capacity) is determined during pool initializing
2) The memory of objects won't be freed until the poll is finalized
3) Object allocation (pop) may fail after trying all cpu slots
4) Object reclamation (push) won't fail but may take long time to
finish for imbalanced scenarios. You can try larger max_entries
to mitigate, or ( >= CPUS * nr_objs) to avoid

Signed-off-by: wuqiang <[email protected]>
---
include/linux/objpool.h | 109 ++++++++++++++
lib/Makefile | 2 +-
lib/objpool.c | 320 ++++++++++++++++++++++++++++++++++++++++
3 files changed, 430 insertions(+), 1 deletion(-)
create mode 100644 include/linux/objpool.h
create mode 100644 lib/objpool.c

diff --git a/include/linux/objpool.h b/include/linux/objpool.h
new file mode 100644
index 000000000000..922e1bc96f2b
--- /dev/null
+++ b/include/linux/objpool.h
@@ -0,0 +1,109 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+
+#ifndef _LINUX_OBJPOOL_H
+#define _LINUX_OBJPOOL_H
+
+#include <linux/types.h>
+
+/*
+ * objpool: ring-array based lockless MPMC queue
+ *
+ * Copyright: [email protected]
+ *
+ * The object pool is a scalable implementaion of high performance queue
+ * for objects allocation and reclamation, such as kretprobe instances.
+ *
+ * With leveraging per-cpu ring-array to mitigate the hot spots of memory
+ * contention, it could deliver near-linear scalability for high parallel
+ * scenarios. The ring-array is compactly managed in a single cache-line
+ * to benefit from warmed L1 cache for most cases (<= 4 objects per-core).
+ * The body of pre-allocated objects is stored in continuous cache-lines
+ * just after the ring-array.
+ *
+ * The object pool is interrupt safe. Both allocation and reclamation
+ * (object pop and push operations) can be preemptible or interruptable.
+ *
+ * It's best suited for following cases:
+ * 1) Memory allocation or reclamation are prohibited or too expensive
+ * 2) Consumers are of different priorities, such as irqs and threads
+ *
+ * Limitations:
+ * 1) Maximum objects (capacity) is determined during pool initializing
+ * 2) The memory of objects won't be freed until the poll is finalized
+ * 3) Object allocation (pop) may fail after trying all cpu slots
+ */
+
+/*
+ * objpool_slot: per-cpu ring array
+ *
+ * Represents a cpu-local array-based ring buffer, its size is specialized
+ * during initialization of object pool.
+ *
+ * The objpool_slot is allocated from local memory for NUMA system, and to
+ * be kept compact in a single cacheline. ages[] is stored just after the
+ * body of objpool_slot, and then entries[]. The Array of ages[] describes
+ * revision of each item, solely used to avoid ABA. And array of entries[]
+ * contains the pointers of objects.
+ *
+ * The default size of objpool_slot is a single cache-line, aka. 64 bytes.
+ *
+ * 64bit:
+ * 4 8 12 16 32 64
+ * | head | tail | size | mask | ages[4] | ents[4]: (8 * 4) | objects
+ *
+ * 32bit:
+ * 4 8 12 16 32 48 64
+ * | head | tail | size | mask | ages[4] | ents[4] | unused | objects
+ *
+ */
+
+struct objpool_slot {
+ uint32_t head; /* head of ring array */
+ uint32_t tail; /* tail of ring array */
+ uint32_t size; /* array size, pow of 2 */
+ uint32_t mask; /* size - 1 */
+} __packed;
+
+struct objpool_head;
+
+/* caller-specified callback for object initial setup, only called once */
+typedef int (*objpool_init_obj_cb)(void *obj, void *context);
+
+/* caller-specified cleanup callback for objpool destruction */
+typedef int (*objpool_fini_cb)(struct objpool_head *head, void *context);
+
+/*
+ * objpool_head: object pooling metadata
+ */
+
+struct objpool_head {
+ int obj_size; /* object & element size */
+ int nr_objs; /* total objs (to be pre-allocated) */
+ int nr_cpus; /* nr_cpu_ids */
+ int capacity; /* max objects per cpuslot */
+ gfp_t gfp; /* gfp flags for kmalloc & vmalloc */
+ unsigned long flags; /* flags for objpool management */
+ struct objpool_slot **cpu_slots; /* array of percpu slots */
+ int *slot_sizes; /* size in bytes of slots */
+ objpool_fini_cb release; /* resource cleanup callback */
+ void *context; /* caller-provided context */
+};
+
+#define OBJPOOL_FROM_VMALLOC (0x800000000) /* objpool allocated from vmalloc area */
+#define OBJPOOL_HAVE_OBJECTS (0x400000000) /* objects allocated along with objpool */
+
+/* initialize object pool and pre-allocate objects */
+int objpool_init(struct objpool_head *head, int nr_objs, int object_size,
+ gfp_t gfp, void *context, objpool_init_obj_cb objinit,
+ objpool_fini_cb release);
+
+/* allocate an object from objects pool */
+void *objpool_pop(struct objpool_head *head);
+
+/* reclaim an object to objects pool */
+int objpool_push(void *node, struct objpool_head *head);
+
+/* cleanup the whole object pool (objects including) */
+void objpool_fini(struct objpool_head *head);
+
+#endif /* _LINUX_OBJPOOL_H */
diff --git a/lib/Makefile b/lib/Makefile
index 59bd7c2f793a..f23d9c4fe639 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -34,7 +34,7 @@ lib-y := ctype.o string.o vsprintf.o cmdline.o \
is_single_threaded.o plist.o decompress.o kobject_uevent.o \
earlycpio.o seq_buf.o siphash.o dec_and_lock.o \
nmi_backtrace.o win_minmax.o memcat_p.o \
- buildid.o
+ buildid.o objpool.o

lib-$(CONFIG_PRINTK) += dump_stack.o
lib-$(CONFIG_SMP) += cpumask.o
diff --git a/lib/objpool.c b/lib/objpool.c
new file mode 100644
index 000000000000..bab8b27e75d7
--- /dev/null
+++ b/lib/objpool.c
@@ -0,0 +1,320 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include <linux/objpool.h>
+#include <linux/slab.h>
+#include <linux/vmalloc.h>
+#include <linux/atomic.h>
+#include <linux/prefetch.h>
+#include <linux/cpumask.h>
+
+/*
+ * objpool: ring-array based lockless MPMC/FIFO queues
+ *
+ * Copyright: [email protected]
+ */
+
+/* compute the suitable num of objects to be managed by slot */
+static inline int objpool_nobjs(int size)
+{
+ return rounddown_pow_of_two((size - sizeof(struct objpool_slot)) /
+ (sizeof(uint32_t) + sizeof(void *)));
+}
+
+#define SLOT_AGES(s) ((uint32_t *)((char *)(s) + sizeof(struct objpool_slot)))
+#define SLOT_ENTS(s) ((void **)((char *)(s) + sizeof(struct objpool_slot) + \
+ sizeof(uint32_t) * (s)->size))
+#define SLOT_OBJS(s) ((void *)((char *)(s) + sizeof(struct objpool_slot) + \
+ (sizeof(uint32_t) + sizeof(void *)) * (s)->size))
+#define SLOT_CORE(n) cpumask_nth((n) % num_possible_cpus(), cpu_possible_mask)
+
+/* allocate and initialize percpu slots */
+static inline int
+objpool_init_percpu_slots(struct objpool_head *head, int nobjs,
+ void *context, objpool_init_obj_cb objinit)
+{
+ int i, j, n, size, objsz, cpu = 0, nents = head->capacity;
+
+ /* aligned object size by sizeof(void *) */
+ objsz = ALIGN(head->obj_size, sizeof(void *));
+ /* shall we allocate objects along with objpool_slot */
+ if (objsz)
+ head->flags |= OBJPOOL_HAVE_OBJECTS;
+
+ for (i = 0; i < head->nr_cpus; i++) {
+ struct objpool_slot *os;
+
+ /* skip the cpus which could never be present */
+ if (!cpu_possible(i))
+ continue;
+
+ /* compute how many objects to be managed by this slot */
+ n = nobjs / num_possible_cpus();
+ if (cpu < (nobjs % num_possible_cpus()))
+ n++;
+ size = sizeof(struct objpool_slot) + sizeof(void *) * nents +
+ sizeof(uint32_t) * nents + objsz * n;
+
+ /* decide memory area for cpu-slot allocation */
+ if (!cpu && !(head->gfp & GFP_ATOMIC) && size > PAGE_SIZE / 2)
+ head->flags |= OBJPOOL_FROM_VMALLOC;
+
+ /* allocate percpu slot & objects from local memory */
+ if (head->flags & OBJPOOL_FROM_VMALLOC)
+ os = __vmalloc_node(size, sizeof(void *), head->gfp,
+ cpu_to_node(i), __builtin_return_address(0));
+ else
+ os = kmalloc_node(size, head->gfp, cpu_to_node(i));
+ if (!os)
+ return -ENOMEM;
+
+ /* initialize percpu slot for the i-th slot */
+ memset(os, 0, size);
+ os->size = head->capacity;
+ os->mask = os->size - 1;
+ head->cpu_slots[i] = os;
+ head->slot_sizes[i] = size;
+ cpu = cpu + 1;
+
+ /*
+ * start from 2nd round to avoid conflict of 1st item.
+ * we assume that the head item is ready for retrieval
+ * iff head is equal to ages[head & mask]. but ages is
+ * initialized as 0, so in view of the caller of pop(),
+ * the 1st item (0th) is always ready, but fact could
+ * be: push() is stalled before the final update, thus
+ * the item being inserted will be lost forever.
+ */
+ os->head = os->tail = head->capacity;
+
+ if (!objsz)
+ continue;
+
+ for (j = 0; j < n; j++) {
+ uint32_t *ages = SLOT_AGES(os);
+ void **ents = SLOT_ENTS(os);
+ void *obj = SLOT_OBJS(os) + j * objsz;
+ uint32_t ie = os->tail & os->mask;
+
+ /* perform object initialization */
+ if (objinit) {
+ int rc = objinit(obj, context);
+ if (rc)
+ return rc;
+ }
+
+ /* add obj into the ring array */
+ ents[ie] = obj;
+ ages[ie] = os->tail;
+ os->tail++;
+ head->nr_objs++;
+ }
+ }
+
+ return 0;
+}
+
+/* cleanup all percpu slots of the object pool */
+static inline void objpool_fini_percpu_slots(struct objpool_head *head)
+{
+ int i;
+
+ if (!head->cpu_slots)
+ return;
+
+ for (i = 0; i < head->nr_cpus; i++) {
+ if (!head->cpu_slots[i])
+ continue;
+ if (head->flags & OBJPOOL_FROM_VMALLOC)
+ vfree(head->cpu_slots[i]);
+ else
+ kfree(head->cpu_slots[i]);
+ }
+ kfree(head->cpu_slots);
+ head->cpu_slots = NULL;
+ head->slot_sizes = NULL;
+}
+
+/**
+ * objpool_init: initialize object pool and pre-allocate objects
+ *
+ * args:
+ * @head: the object pool to be initialized, declared by caller
+ * @nr_objs: total objects to be pre-allocated by this object pool
+ * @object_size: size of an object, no objects pre-allocated if 0
+ * @gfp: flags for memory allocation (via kmalloc or vmalloc)
+ * @context: user context for object initialization callback
+ * @objinit: object initialization callback for extra setting-up
+ * @release: cleanup callback for private objects/pool/context
+ *
+ * return:
+ * 0 for success, otherwise error code
+ *
+ * All pre-allocated objects are to be zeroed. Caller could do extra
+ * initialization in objinit callback. The objinit callback will be
+ * called once and only once after the slot allocation. Then objpool
+ * won't touch any content of the objects since then. It's caller's
+ * duty to perform reinitialization after object allocation (pop) or
+ * clearance before object reclamation (push) if required.
+ */
+int objpool_init(struct objpool_head *head, int nr_objs, int object_size,
+ gfp_t gfp, void *context, objpool_init_obj_cb objinit,
+ objpool_fini_cb release)
+{
+ int nents, rc;
+
+ /* check input parameters */
+ if (nr_objs <= 0 || object_size < 0)
+ return -EINVAL;
+
+ /* calculate percpu slot size (rounded to pow of 2) */
+ nents = max_t(int, roundup_pow_of_two(nr_objs),
+ objpool_nobjs(L1_CACHE_BYTES));
+
+ /* initialize objpool head */
+ memset(head, 0, sizeof(struct objpool_head));
+ head->nr_cpus = nr_cpu_ids;
+ head->obj_size = object_size;
+ head->capacity = nents;
+ head->gfp = gfp & ~__GFP_ZERO;
+ head->context = context;
+ head->release = release;
+
+ /* allocate array for percpu slots */
+ head->cpu_slots = kzalloc(head->nr_cpus * sizeof(void *) +
+ head->nr_cpus * sizeof(uint32_t), head->gfp);
+ if (!head->cpu_slots)
+ return -ENOMEM;
+ head->slot_sizes = (uint32_t *)&head->cpu_slots[head->nr_cpus];
+
+ /* initialize per-cpu slots */
+ rc = objpool_init_percpu_slots(head, nr_objs, context, objinit);
+ if (rc)
+ objpool_fini_percpu_slots(head);
+
+ return rc;
+}
+EXPORT_SYMBOL_GPL(objpool_init);
+
+/* adding object to slot tail, the given slot must NOT be full */
+static inline int objpool_add_slot(void *obj, struct objpool_slot *os)
+{
+ uint32_t *ages = SLOT_AGES(os);
+ void **ents = SLOT_ENTS(os);
+ uint32_t tail = atomic_inc_return((atomic_t *)&os->tail) - 1;
+
+ WRITE_ONCE(ents[tail & os->mask], obj);
+
+ /* order matters: obj must be updated before tail updating */
+ smp_store_release(&ages[tail & os->mask], tail);
+ return 0;
+}
+
+/**
+ * objpool_push: reclaim the object and return back to objects pool
+ *
+ * args:
+ * @obj: object pointer to be pushed to object pool
+ * @head: object pool
+ *
+ * return:
+ * 0 or error code: it fails only when objects pool are full
+ *
+ * objpool_push is non-blockable, and can be nested
+ */
+int objpool_push(void *obj, struct objpool_head *head)
+{
+ int cpu = raw_smp_processor_id();
+
+ return objpool_add_slot(obj, head->cpu_slots[cpu]);
+}
+EXPORT_SYMBOL_GPL(objpool_push);
+
+/* try to retrieve object from slot */
+static inline void *objpool_try_get_slot(struct objpool_slot *os)
+{
+ uint32_t *ages = SLOT_AGES(os);
+ void **ents = SLOT_ENTS(os);
+ /* do memory load of head to local head */
+ uint32_t head = smp_load_acquire(&os->head);
+
+ /* loop if slot isn't empty */
+ while (head != READ_ONCE(os->tail)) {
+ uint32_t id = head & os->mask, prev = head;
+
+ /* do prefetching of object ents */
+ prefetch(&ents[id]);
+
+ /*
+ * check whether this item was ready for retrieval ? There's
+ * possibility * in theory * we might retrieve wrong object,
+ * in case ages[id] overflows when current task is sleeping,
+ * but it will take very very long to overflow an uint32_t
+ */
+ if (smp_load_acquire(&ages[id]) == head) {
+ /* node must have been udpated by push() */
+ void *node = READ_ONCE(ents[id]);
+ /* commit and move forward head of the slot */
+ if (try_cmpxchg_release(&os->head, &head, head + 1))
+ return node;
+ }
+
+ /* re-load head from memory and continue trying */
+ head = READ_ONCE(os->head);
+ /*
+ * head stays unchanged, so it's very likely current pop()
+ * just preempted/interrupted an ongoing push() operation
+ */
+ if (head == prev)
+ break;
+ }
+
+ return NULL;
+}
+
+/**
+ * objpool_pop: allocate an object from objects pool
+ *
+ * args:
+ * @head: object pool
+ *
+ * return:
+ * object: NULL if failed (object pool is empty)
+ *
+ * objpool_pop can be nested, so can be used in any context.
+ */
+void *objpool_pop(struct objpool_head *head)
+{
+ int i, cpu = raw_smp_processor_id();
+ void *obj = NULL;
+
+ for (i = 0; i < num_possible_cpus(); i++) {
+ obj = objpool_try_get_slot(head->cpu_slots[cpu]);
+ if (obj)
+ break;
+ cpu = cpumask_next_wrap(cpu, cpu_possible_mask, -1, 1);
+ }
+
+ return obj;
+}
+EXPORT_SYMBOL_GPL(objpool_pop);
+
+/**
+ * objpool_fini: cleanup the whole object pool (releasing all objects)
+ *
+ * args:
+ * @head: object pool to be released
+ *
+ */
+void objpool_fini(struct objpool_head *head)
+{
+ if (!head->cpu_slots)
+ return;
+
+ /* release percpu slots */
+ objpool_fini_percpu_slots(head);
+
+ /* call user's cleanup callback if provided */
+ if (head->release)
+ head->release(head, head->context);
+}
+EXPORT_SYMBOL_GPL(objpool_fini);
--
2.34.1

2022-12-12 13:00:47

by wuqiang.matt

[permalink] [raw]

Subject: [PATCH v7 4/5] kprobes: freelist.h removed

This patch will remove freelist.h from kernel source tree, since the
only use cases (kretprobe and rethook) are converted to objpool.

Signed-off-by: wuqiang <[email protected]>
---
include/linux/freelist.h | 129 ---------------------------------------
1 file changed, 129 deletions(-)
delete mode 100644 include/linux/freelist.h

diff --git a/include/linux/freelist.h b/include/linux/freelist.h
deleted file mode 100644
index fc1842b96469..000000000000
--- a/include/linux/freelist.h
+++ /dev/null
@@ -1,129 +0,0 @@
-/* SPDX-License-Identifier: GPL-2.0-only OR BSD-2-Clause */
-#ifndef FREELIST_H
-#define FREELIST_H
-
-#include <linux/atomic.h>
-
-/*
- * Copyright: [email protected]
- *
- * A simple CAS-based lock-free free list. Not the fastest thing in the world
- * under heavy contention, but simple and correct (assuming nodes are never
- * freed until after the free list is destroyed), and fairly speedy under low
- * contention.
- *
- * Adapted from: https://moodycamel.com/blog/2014/solving-the-aba-problem-for-lock-free-free-lists
- */
-
-struct freelist_node {
- atomic_t refs;
- struct freelist_node *next;
-};
-
-struct freelist_head {
- struct freelist_node *head;
-};
-
-#define REFS_ON_FREELIST 0x80000000
-#define REFS_MASK 0x7FFFFFFF
-
-static inline void __freelist_add(struct freelist_node *node, struct freelist_head *list)
-{
- /*
- * Since the refcount is zero, and nobody can increase it once it's
- * zero (except us, and we run only one copy of this method per node at
- * a time, i.e. the single thread case), then we know we can safely
- * change the next pointer of the node; however, once the refcount is
- * back above zero, then other threads could increase it (happens under
- * heavy contention, when the refcount goes to zero in between a load
- * and a refcount increment of a node in try_get, then back up to
- * something non-zero, then the refcount increment is done by the other
- * thread) -- so if the CAS to add the node to the actual list fails,
- * decrese the refcount and leave the add operation to the next thread
- * who puts the refcount back to zero (which could be us, hence the
- * loop).
- */
- struct freelist_node *head = READ_ONCE(list->head);
-
- for (;;) {
- WRITE_ONCE(node->next, head);
- atomic_set_release(&node->refs, 1);
-
- if (!try_cmpxchg_release(&list->head, &head, node)) {
- /*
- * Hmm, the add failed, but we can only try again when
- * the refcount goes back to zero.
- */
- if (atomic_fetch_add_release(REFS_ON_FREELIST - 1, &node->refs) == 1)
- continue;
- }
- return;
- }
-}
-
-static inline void freelist_add(struct freelist_node *node, struct freelist_head *list)
-{
- /*
- * We know that the should-be-on-freelist bit is 0 at this point, so
- * it's safe to set it using a fetch_add.
- */
- if (!atomic_fetch_add_release(REFS_ON_FREELIST, &node->refs)) {
- /*
- * Oh look! We were the last ones referencing this node, and we
- * know we want to add it to the free list, so let's do it!
- */
- __freelist_add(node, list);
- }
-}
-
-static inline struct freelist_node *freelist_try_get(struct freelist_head *list)
-{
- struct freelist_node *prev, *next, *head = smp_load_acquire(&list->head);
- unsigned int refs;
-
- while (head) {
- prev = head;
- refs = atomic_read(&head->refs);
- if ((refs & REFS_MASK) == 0 ||
- !atomic_try_cmpxchg_acquire(&head->refs, &refs, refs+1)) {
- head = smp_load_acquire(&list->head);
- continue;
- }
-
- /*
- * Good, reference count has been incremented (it wasn't at
- * zero), which means we can read the next and not worry about
- * it changing between now and the time we do the CAS.
- */
- next = READ_ONCE(head->next);
- if (try_cmpxchg_acquire(&list->head, &head, next)) {
- /*
- * Yay, got the node. This means it was on the list,
- * which means should-be-on-freelist must be false no
- * matter the refcount (because nobody else knows it's
- * been taken off yet, it can't have been put back on).
- */
- WARN_ON_ONCE(atomic_read(&head->refs) & REFS_ON_FREELIST);
-
- /*
- * Decrease refcount twice, once for our ref, and once
- * for the list's ref.
- */
- atomic_fetch_add(-2, &head->refs);
-
- return head;
- }
-
- /*
- * OK, the head must have changed on us, but we still need to decrement
- * the refcount we increased.
- */
- refs = atomic_fetch_add(-1, &prev->refs);
- if (refs == REFS_ON_FREELIST + 1)
- __freelist_add(prev, list);
- }
-
- return NULL;
-}
-
-#endif /* FREELIST_H */
--
2.34.1

2022-12-12 13:02:55

by wuqiang.matt

[permalink] [raw]

Subject: [PATCH v7 3/5] kprobes: kretprobe scalability improvement with objpool

kretprobe is using freelist to manage return-instances, but freelist,
as LIFO queue based on singly linked list, scales badly and reduces
the overall throughput of kretprobed routines, especially for high
contention scenarios.

Here's a typical throughput test of sys_flock (counts in 10 seconds,
measured with perf stat -a -I 10000 -e syscalls:sys_enter_flock):

OS: Debian 10 X86_64, Linux 6.1rc2
HW: XEON 8336C x 2, 64 cores/128 threads, DDR4 3200MT/s

1X 2X 4X 6X 8X 12X 16X
34762430 36546920 17949900 13101899 12569595 12646601 14729195
24X 32X 48X 64X 72X 96X 128X
19263546 10102064 8985418 11936495 11493980 7127789 9330985

This patch introduces objpool to kretprobe and rethook, with orginal
freelist replaced and brings near-linear scalability to kretprobed
routines. Tests of kretprobe throughput show the biggest ratio as
333.9x of the original freelist. Here's the comparison:

1X 2X 4X 8X 16X
freelist: 34762430 36546920 17949900 12569595 14729195
objpool: 35627544 72182095 144068494 287564688 576903916
32X 48X 64X 96X 128X
freelist: 10102064 8985418 11936495 7127789 9330985
objpool: 1158876372 1737828164 2324371724 2380310472 2463182819

Tests on 96-core ARM64 system output similarly, but with the biggest
ratio up to 642.2x:

OS: Debian 10 AARCH64, Linux 6.1rc2
HW: Kunpeng-920 96 cores/2 sockets/4 NUMA nodes, DDR4 2933 MT/s

1X 2X 4X 8X 16X
freelist: 17498299 10887037 10224710 8499132 6421751
objpool: 18715726 35549845 71615884 144258971 283707220
24X 32X 48X 64X 96X
freelist: 5339868 4819116 3593919 3121575 2687167
objpool: 419830913 571609748 877456139 1143316315 1725668029

Signed-off-by: wuqiang <[email protected]>
---
include/linux/kprobes.h | 9 ++--
include/linux/rethook.h | 14 ++----
kernel/kprobes.c | 101 +++++++++++++++++++---------------------
kernel/trace/fprobe.c | 37 ++++++---------
kernel/trace/rethook.c | 99 ++++++++++++++++++++-------------------
5 files changed, 118 insertions(+), 142 deletions(-)

diff --git a/include/linux/kprobes.h b/include/linux/kprobes.h
index a0b92be98984..122b1f21f3a9 100644
--- a/include/linux/kprobes.h
+++ b/include/linux/kprobes.h
@@ -27,7 +27,7 @@
#include <linux/mutex.h>
#include <linux/ftrace.h>
#include <linux/refcount.h>
-#include <linux/freelist.h>
+#include <linux/objpool.h>
#include <linux/rethook.h>
#include <asm/kprobes.h>

@@ -141,6 +141,7 @@ static inline bool kprobe_ftrace(struct kprobe *p)
*/
struct kretprobe_holder {
struct kretprobe *rp;
+ struct objpool_head pool;
refcount_t ref;
};

@@ -154,7 +155,6 @@ struct kretprobe {
#ifdef CONFIG_KRETPROBE_ON_RETHOOK
struct rethook *rh;
#else
- struct freelist_head freelist;
struct kretprobe_holder *rph;
#endif
};
@@ -165,10 +165,7 @@ struct kretprobe_instance {
#ifdef CONFIG_KRETPROBE_ON_RETHOOK
struct rethook_node node;
#else
- union {
- struct freelist_node freelist;
- struct rcu_head rcu;
- };
+ struct rcu_head rcu;
struct llist_node llist;
struct kretprobe_holder *rph;
kprobe_opcode_t *ret_addr;
diff --git a/include/linux/rethook.h b/include/linux/rethook.h
index c8ac1e5afcd1..f97283c622b7 100644
--- a/include/linux/rethook.h
+++ b/include/linux/rethook.h
@@ -6,7 +6,7 @@
#define _LINUX_RETHOOK_H

#include <linux/compiler.h>
-#include <linux/freelist.h>
+#include <linux/objpool.h>
#include <linux/kallsyms.h>
#include <linux/llist.h>
#include <linux/rcupdate.h>
@@ -30,14 +30,13 @@ typedef void (*rethook_handler_t) (struct rethook_node *, void *, struct pt_regs
struct rethook {
void *data;
rethook_handler_t handler;
- struct freelist_head pool;
+ struct objpool_head pool;
refcount_t ref;
struct rcu_head rcu;
};

/**
* struct rethook_node - The rethook shadow-stack entry node.
- * @freelist: The freelist, linked to struct rethook::pool.
* @rcu: The rcu_head for deferred freeing.
* @llist: The llist, linked to a struct task_struct::rethooks.
* @rethook: The pointer to the struct rethook.
@@ -48,19 +47,15 @@ struct rethook {
* on each entry of the shadow stack.
*/
struct rethook_node {
- union {
- struct freelist_node freelist;
- struct rcu_head rcu;
- };
+ struct rcu_head rcu;
struct llist_node llist;
struct rethook *rethook;
unsigned long ret_addr;
unsigned long frame;
};

-struct rethook *rethook_alloc(void *data, rethook_handler_t handler);
+struct rethook *rethook_alloc(void *data, rethook_handler_t handler, int size, int num);
void rethook_free(struct rethook *rh);
-void rethook_add_node(struct rethook *rh, struct rethook_node *node);
struct rethook_node *rethook_try_get(struct rethook *rh);
void rethook_recycle(struct rethook_node *node);
void rethook_hook(struct rethook_node *node, struct pt_regs *regs, bool mcount);
@@ -97,4 +92,3 @@ void rethook_flush_task(struct task_struct *tk);
#endif

#endif
-
diff --git a/kernel/kprobes.c b/kernel/kprobes.c
index 3050631e528d..5f35997b61f7 100644
--- a/kernel/kprobes.c
+++ b/kernel/kprobes.c
@@ -1868,13 +1868,28 @@ static struct notifier_block kprobe_exceptions_nb = {
#ifdef CONFIG_KRETPROBES

#if !defined(CONFIG_KRETPROBE_ON_RETHOOK)
+
+/* callbacks for objpool of kretprobe instances */
+static int kretprobe_init_inst(void *nod, void *context)
+{
+ struct kretprobe_instance *ri = nod;
+
+ ri->rph = context;
+ return 0;
+}
+static int kretprobe_fini_pool(struct objpool_head *head, void *context)
+{
+ kfree(context);
+ return 0;
+}
+
static void free_rp_inst_rcu(struct rcu_head *head)
{
struct kretprobe_instance *ri = container_of(head, struct kretprobe_instance, rcu);
+ struct kretprobe_holder *rph = ri->rph;

- if (refcount_dec_and_test(&ri->rph->ref))
- kfree(ri->rph);
- kfree(ri);
+ if (refcount_dec_and_test(&rph->ref))
+ objpool_fini(&rph->pool);
}
NOKPROBE_SYMBOL(free_rp_inst_rcu);

@@ -1883,7 +1898,7 @@ static void recycle_rp_inst(struct kretprobe_instance *ri)
struct kretprobe *rp = get_kretprobe(ri);

if (likely(rp))
- freelist_add(&ri->freelist, &rp->freelist);
+ objpool_push(ri, &rp->rph->pool);
else
call_rcu(&ri->rcu, free_rp_inst_rcu);
}
@@ -1920,23 +1935,18 @@ NOKPROBE_SYMBOL(kprobe_flush_task);

static inline void free_rp_inst(struct kretprobe *rp)
{
- struct kretprobe_instance *ri;
- struct freelist_node *node;
- int count = 0;
-
- node = rp->freelist.head;
- while (node) {
- ri = container_of(node, struct kretprobe_instance, freelist);
- node = node->next;
-
- kfree(ri);
- count++;
- }
+ struct kretprobe_holder *rph = rp->rph;
+ void *nod;

- if (refcount_sub_and_test(count, &rp->rph->ref)) {
- kfree(rp->rph);
- rp->rph = NULL;
- }
+ rp->rph = NULL;
+ do {
+ nod = objpool_pop(&rph->pool);
+ /* deref anyway since we've one extra ref grabbed */
+ if (refcount_dec_and_test(&rph->ref)) {
+ objpool_fini(&rph->pool);
+ break;
+ }
+ } while (nod);
}

/* This assumes the 'tsk' is the current task or the is not running. */
@@ -2078,19 +2088,17 @@ NOKPROBE_SYMBOL(__kretprobe_trampoline_handler)
static int pre_handler_kretprobe(struct kprobe *p, struct pt_regs *regs)
{
struct kretprobe *rp = container_of(p, struct kretprobe, kp);
+ struct kretprobe_holder *rph = rp->rph;
struct kretprobe_instance *ri;
- struct freelist_node *fn;

- fn = freelist_try_get(&rp->freelist);
- if (!fn) {
+ ri = objpool_pop(&rph->pool);
+ if (!ri) {
rp->nmissed++;
return 0;
}

- ri = container_of(fn, struct kretprobe_instance, freelist);
-
if (rp->entry_handler && rp->entry_handler(ri, regs)) {
- freelist_add(&ri->freelist, &rp->freelist);
+ objpool_push(ri, &rph->pool);
return 0;
}

@@ -2183,7 +2191,6 @@ int kprobe_on_func_entry(kprobe_opcode_t *addr, const char *sym, unsigned long o
int register_kretprobe(struct kretprobe *rp)
{
int ret;
- struct kretprobe_instance *inst;
int i;
void *addr;

@@ -2221,20 +2228,12 @@ int register_kretprobe(struct kretprobe *rp)
#endif
}
#ifdef CONFIG_KRETPROBE_ON_RETHOOK
- rp->rh = rethook_alloc((void *)rp, kretprobe_rethook_handler);
- if (!rp->rh)
- return -ENOMEM;
+ rp->rh = rethook_alloc((void *)rp, kretprobe_rethook_handler,
+ sizeof(struct kretprobe_instance) +
+ rp->data_size, rp->maxactive);
+ if (IS_ERR(rp->rh))
+ return PTR_ERR(rp->rh);

- for (i = 0; i < rp->maxactive; i++) {
- inst = kzalloc(sizeof(struct kretprobe_instance) +
- rp->data_size, GFP_KERNEL);
- if (inst == NULL) {
- rethook_free(rp->rh);
- rp->rh = NULL;
- return -ENOMEM;
- }
- rethook_add_node(rp->rh, &inst->node);
- }
rp->nmissed = 0;
/* Establish function entry probe point */
ret = register_kprobe(&rp->kp);
@@ -2243,25 +2242,19 @@ int register_kretprobe(struct kretprobe *rp)
rp->rh = NULL;
}
#else /* !CONFIG_KRETPROBE_ON_RETHOOK */
- rp->freelist.head = NULL;
rp->rph = kzalloc(sizeof(struct kretprobe_holder), GFP_KERNEL);
if (!rp->rph)
return -ENOMEM;

- rp->rph->rp = rp;
- for (i = 0; i < rp->maxactive; i++) {
- inst = kzalloc(sizeof(struct kretprobe_instance) +
- rp->data_size, GFP_KERNEL);
- if (inst == NULL) {
- refcount_set(&rp->rph->ref, i);
- free_rp_inst(rp);
- return -ENOMEM;
- }
- inst->rph = rp->rph;
- freelist_add(&inst->freelist, &rp->freelist);
+ if (objpool_init(&rp->rph->pool, rp->maxactive, rp->data_size +
+ sizeof(struct kretprobe_instance), GFP_KERNEL,
+ rp->rph, kretprobe_init_inst, kretprobe_fini_pool)) {
+ kfree(rp->rph);
+ rp->rph = NULL;
+ return -ENOMEM;
}
- refcount_set(&rp->rph->ref, i);
-
+ refcount_set(&rp->rph->ref, rp->maxactive + 1);
+ rp->rph->rp = rp;
rp->nmissed = 0;
/* Establish function entry probe point */
ret = register_kprobe(&rp->kp);
diff --git a/kernel/trace/fprobe.c b/kernel/trace/fprobe.c
index e8143e368074..9b685d6921d1 100644
--- a/kernel/trace/fprobe.c
+++ b/kernel/trace/fprobe.c
@@ -125,41 +125,32 @@ static void fprobe_init(struct fprobe *fp)

static int fprobe_init_rethook(struct fprobe *fp, int num)
{
- int i, size;
-
- if (num < 0)
- return -EINVAL;
+ int max;

if (!fp->exit_handler) {
fp->rethook = NULL;
return 0;
}

- /* Initialize rethook if needed */
- size = num * num_possible_cpus() * 2;
- if (size < 0)
+ if (num <= 0)
+ return -EINVAL;
+ max = num * num_possible_cpus() * 2;
+ /* Fail if max overflows */
+ if (max <= 0)
return -E2BIG;

- fp->rethook = rethook_alloc((void *)fp, fprobe_exit_handler);
- if (!fp->rethook)
- return -ENOMEM;
- for (i = 0; i < size; i++) {
- struct fprobe_rethook_node *node;
-
- node = kzalloc(sizeof(*node), GFP_KERNEL);
- if (!node) {
- rethook_free(fp->rethook);
- fp->rethook = NULL;
- return -ENOMEM;
- }
- rethook_add_node(fp->rethook, &node->node);
- }
+ /* Initialize rethook */
+ fp->rethook = rethook_alloc((void *)fp, fprobe_exit_handler,
+ sizeof(struct fprobe_rethook_node), max);
+ if (IS_ERR(fp->rethook))
+ return PTR_ERR(fp->rethook);
+
return 0;
}

static void fprobe_fail_cleanup(struct fprobe *fp)
{
- if (fp->rethook) {
+ if (!IS_ERR_OR_NULL(fp->rethook)) {
/* Don't need to cleanup rethook->handler because this is not used. */
rethook_free(fp->rethook);
fp->rethook = NULL;
@@ -313,7 +304,7 @@ int unregister_fprobe(struct fprobe *fp)
* current running handlers are finished, call unregister_ftrace_function()
* after this.
*/
- if (fp->rethook)
+ if (!IS_ERR_OR_NULL(fp->rethook))
rethook_free(fp->rethook);

ret = unregister_ftrace_function(&fp->ops);
diff --git a/kernel/trace/rethook.c b/kernel/trace/rethook.c
index 32c3dfdb4d6a..6e1014e4f2f7 100644
--- a/kernel/trace/rethook.c
+++ b/kernel/trace/rethook.c
@@ -36,21 +36,16 @@ void rethook_flush_task(struct task_struct *tk)
static void rethook_free_rcu(struct rcu_head *head)
{
struct rethook *rh = container_of(head, struct rethook, rcu);
- struct rethook_node *rhn;
- struct freelist_node *node;
- int count = 1;
-
- node = rh->pool.head;
- while (node) {
- rhn = container_of(node, struct rethook_node, freelist);
- node = node->next;
- kfree(rhn);
- count++;
- }
+ struct rethook_node *nod;

- /* The rh->ref is the number of pooled node + 1 */
- if (refcount_sub_and_test(count, &rh->ref))
- kfree(rh);
+ do {
+ nod = objpool_pop(&rh->pool);
+ /* deref anyway since we've one extra ref grabbed */
+ if (refcount_dec_and_test(&rh->ref)) {
+ objpool_fini(&rh->pool);
+ break;
+ }
+ } while (nod);
}

/**
@@ -70,54 +65,65 @@ void rethook_free(struct rethook *rh)
call_rcu(&rh->rcu, rethook_free_rcu);
}

+static int rethook_init_node(void *nod, void *context)
+{
+ struct rethook_node *node = nod;
+
+ node->rethook = context;
+ return 0;
+}
+
+static int rethook_fini_pool(struct objpool_head *head, void *context)
+{
+ kfree(context);
+ return 0;
+}
+
/**
* rethook_alloc() - Allocate struct rethook.
* @data: a data to pass the @handler when hooking the return.
- * @handler: the return hook callback function.
+ * @handler: the return hook callback function, must NOT be NULL
+ * @gfp: default gfp for objpool allocation
+ * @size: node size: rethook node and additional data
+ * @num: number of rethook nodes to be preallocated
*
* Allocate and initialize a new rethook with @data and @handler.
- * Return NULL if memory allocation fails or @handler is NULL.
+ * Return pointer of new rethook, or error codes for failures.
+ *
* Note that @handler == NULL means this rethook is going to be freed.
*/
-struct rethook *rethook_alloc(void *data, rethook_handler_t handler)
+struct rethook *rethook_alloc(void *data, rethook_handler_t handler,
+ int size, int num)
{
- struct rethook *rh = kzalloc(sizeof(struct rethook), GFP_KERNEL);
+ struct rethook *rh;

- if (!rh || !handler) {
- kfree(rh);
- return NULL;
- }
+ if (!handler || num <= 0 || size < sizeof(struct rethook_node))
+ return ERR_PTR(-EINVAL);
+
+ rh = kzalloc(sizeof(struct rethook), GFP_KERNEL);
+ if (!rh)
+ return ERR_PTR(-ENOMEM);

rh->data = data;
rh->handler = handler;
- rh->pool.head = NULL;
- refcount_set(&rh->ref, 1);

+ /* initialize the objpool for rethook nodes */
+ if (objpool_init(&rh->pool, num, size, GFP_KERNEL, rh,
+ rethook_init_node, rethook_fini_pool)) {
+ kfree(rh);
+ return ERR_PTR(-ENOMEM);
+ }
+ refcount_set(&rh->ref, num + 1);
return rh;
}

-/**
- * rethook_add_node() - Add a new node to the rethook.
- * @rh: the struct rethook.
- * @node: the struct rethook_node to be added.
- *
- * Add @node to @rh. User must allocate @node (as a part of user's
- * data structure.) The @node fields are initialized in this function.
- */
-void rethook_add_node(struct rethook *rh, struct rethook_node *node)
-{
- node->rethook = rh;
- freelist_add(&node->freelist, &rh->pool);
- refcount_inc(&rh->ref);
-}
-
static void free_rethook_node_rcu(struct rcu_head *head)
{
struct rethook_node *node = container_of(head, struct rethook_node, rcu);
+ struct rethook *rh = node->rethook;

- if (refcount_dec_and_test(&node->rethook->ref))
- kfree(node->rethook);
- kfree(node);
+ if (refcount_dec_and_test(&rh->ref))
+ objpool_fini(&rh->pool);
}

/**
@@ -132,7 +138,7 @@ void rethook_recycle(struct rethook_node *node)
lockdep_assert_preemption_disabled();

if (likely(READ_ONCE(node->rethook->handler)))
- freelist_add(&node->freelist, &node->rethook->pool);
+ objpool_push(node, &node->rethook->pool);
else
call_rcu(&node->rcu, free_rethook_node_rcu);
}
@@ -148,7 +154,6 @@ NOKPROBE_SYMBOL(rethook_recycle);
struct rethook_node *rethook_try_get(struct rethook *rh)
{
rethook_handler_t handler = READ_ONCE(rh->handler);
- struct freelist_node *fn;

lockdep_assert_preemption_disabled();

@@ -165,11 +170,7 @@ struct rethook_node *rethook_try_get(struct rethook *rh)
if (unlikely(!rcu_is_watching()))
return NULL;

- fn = freelist_try_get(&rh->pool);
- if (!fn)
- return NULL;
-
- return container_of(fn, struct rethook_node, freelist);
+ return (struct rethook_node *)objpool_pop(&rh->pool);
}
NOKPROBE_SYMBOL(rethook_try_get);

--
2.34.1

2022-12-22 15:53:35

by Masami Hiramatsu

[permalink] [raw]

Subject: Re: [PATCH v7 1/5] lib: objpool added: ring-array based lockless MPMC queue

Hi Matt,

Thanks for your update! I reviewed it and have some comments and questions.

BTW, from the next version, can you Cc the series to linux-trace-kernel@vger
mailing list too?

http://vger.kernel.org/vger-lists.html#linux-trace-kernel

On Mon, 12 Dec 2022 20:31:49 +0800
wuqiang <[email protected]> wrote:

> The object pool is a scalable implementaion of high performance queue
> for objects allocation and reclamation, such as kretprobe instances.
>
> With leveraging per-cpu ring-array to mitigate the hot spots of memory
> contention, it could deliver near-linear scalability for high parallel
> scenarios. The ring-array is compactly managed in a single cache-line
> to benefit from warmed L1 cache for most cases (<= 4 objects per-core).
> The body of pre-allocated objects is stored in continuous cache-lines
> just after the ring-array.
>
> The object pool is interrupt safe. Both allocation and reclamation
> (object pop and push operations) can be preemptible or interruptable.
>
> It's best suited for following cases:
> 1) Memory allocation or reclamation are prohibited or too expensive
> 2) Consumers are of different priorities, such as irqs and threads
>
> Limitations:
> 1) Maximum objects (capacity) is determined during pool initializing
> 2) The memory of objects won't be freed until the poll is finalized
> 3) Object allocation (pop) may fail after trying all cpu slots
> 4) Object reclamation (push) won't fail but may take long time to
> finish for imbalanced scenarios. You can try larger max_entries
> to mitigate, or ( >= CPUS * nr_objs) to avoid
>
> Signed-off-by: wuqiang <[email protected]>
> ---
> include/linux/objpool.h | 109 ++++++++++++++
> lib/Makefile | 2 +-
> lib/objpool.c | 320 ++++++++++++++++++++++++++++++++++++++++
> 3 files changed, 430 insertions(+), 1 deletion(-)
> create mode 100644 include/linux/objpool.h
> create mode 100644 lib/objpool.c
>
> diff --git a/include/linux/objpool.h b/include/linux/objpool.h
> new file mode 100644
> index 000000000000..922e1bc96f2b
> --- /dev/null
> +++ b/include/linux/objpool.h
> @@ -0,0 +1,109 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +
> +#ifndef _LINUX_OBJPOOL_H
> +#define _LINUX_OBJPOOL_H
> +
> +#include <linux/types.h>
> +
> +/*
> + * objpool: ring-array based lockless MPMC queue
> + *
> + * Copyright: [email protected]
> + *
> + * The object pool is a scalable implementaion of high performance queue
> + * for objects allocation and reclamation, such as kretprobe instances.
> + *
> + * With leveraging per-cpu ring-array to mitigate the hot spots of memory
> + * contention, it could deliver near-linear scalability for high parallel
> + * scenarios. The ring-array is compactly managed in a single cache-line
> + * to benefit from warmed L1 cache for most cases (<= 4 objects per-core).
> + * The body of pre-allocated objects is stored in continuous cache-lines
> + * just after the ring-array.
> + *
> + * The object pool is interrupt safe. Both allocation and reclamation
> + * (object pop and push operations) can be preemptible or interruptable.
> + *
> + * It's best suited for following cases:
> + * 1) Memory allocation or reclamation are prohibited or too expensive
> + * 2) Consumers are of different priorities, such as irqs and threads
> + *
> + * Limitations:
> + * 1) Maximum objects (capacity) is determined during pool initializing
> + * 2) The memory of objects won't be freed until the poll is finalized
> + * 3) Object allocation (pop) may fail after trying all cpu slots
> + */
> +
> +/*
> + * objpool_slot: per-cpu ring array
> + *
> + * Represents a cpu-local array-based ring buffer, its size is specialized
> + * during initialization of object pool.
> + *
> + * The objpool_slot is allocated from local memory for NUMA system, and to
> + * be kept compact in a single cacheline. ages[] is stored just after the
> + * body of objpool_slot, and then entries[]. The Array of ages[] describes
> + * revision of each item, solely used to avoid ABA. And array of entries[]
> + * contains the pointers of objects.
> + *
> + * The default size of objpool_slot is a single cache-line, aka. 64 bytes.
> + *
> + * 64bit:
> + * 4 8 12 16 32 64
> + * | head | tail | size | mask | ages[4] | ents[4]: (8 * 4) | objects
> + *
> + * 32bit:
> + * 4 8 12 16 32 48 64
> + * | head | tail | size | mask | ages[4] | ents[4] | unused | objects
> + *
> + */
> +
> +struct objpool_slot {
> + uint32_t head; /* head of ring array */
> + uint32_t tail; /* tail of ring array */
> + uint32_t size; /* array size, pow of 2 */
> + uint32_t mask; /* size - 1 */
> +} __packed;
> +
> +struct objpool_head;
> +
> +/* caller-specified callback for object initial setup, only called once */

This comment is a bit confusing. it is "called once for each object", right?

> +typedef int (*objpool_init_obj_cb)(void *obj, void *context);
> +
> +/* caller-specified cleanup callback for objpool destruction */
> +typedef int (*objpool_fini_cb)(struct objpool_head *head, void *context);
> +
> +/*
> + * objpool_head: object pooling metadata
> + */
> +
> +struct objpool_head {
> + int obj_size; /* object & element size */
> + int nr_objs; /* total objs (to be pre-allocated) */
> + int nr_cpus; /* nr_cpu_ids */
> + int capacity; /* max objects per cpuslot */
> + gfp_t gfp; /* gfp flags for kmalloc & vmalloc */
> + unsigned long flags; /* flags for objpool management */
> + struct objpool_slot **cpu_slots; /* array of percpu slots */
> + int *slot_sizes; /* size in bytes of slots */
> + objpool_fini_cb release; /* resource cleanup callback */
> + void *context; /* caller-provided context */
> +};
> +
> +#define OBJPOOL_FROM_VMALLOC (0x800000000) /* objpool allocated from vmalloc area */
> +#define OBJPOOL_HAVE_OBJECTS (0x400000000) /* objects allocated along with objpool */
> +
> +/* initialize object pool and pre-allocate objects */
> +int objpool_init(struct objpool_head *head, int nr_objs, int object_size,
> + gfp_t gfp, void *context, objpool_init_obj_cb objinit,
> + objpool_fini_cb release);
> +
> +/* allocate an object from objects pool */
> +void *objpool_pop(struct objpool_head *head);
> +
> +/* reclaim an object to objects pool */
> +int objpool_push(void *node, struct objpool_head *head);
> +
> +/* cleanup the whole object pool (objects including) */
> +void objpool_fini(struct objpool_head *head);
> +
> +#endif /* _LINUX_OBJPOOL_H */
> diff --git a/lib/Makefile b/lib/Makefile
> index 59bd7c2f793a..f23d9c4fe639 100644
> --- a/lib/Makefile
> +++ b/lib/Makefile
> @@ -34,7 +34,7 @@ lib-y := ctype.o string.o vsprintf.o cmdline.o \
> is_single_threaded.o plist.o decompress.o kobject_uevent.o \
> earlycpio.o seq_buf.o siphash.o dec_and_lock.o \
> nmi_backtrace.o win_minmax.o memcat_p.o \
> - buildid.o
> + buildid.o objpool.o
>
> lib-$(CONFIG_PRINTK) += dump_stack.o
> lib-$(CONFIG_SMP) += cpumask.o
> diff --git a/lib/objpool.c b/lib/objpool.c
> new file mode 100644
> index 000000000000..bab8b27e75d7
> --- /dev/null
> +++ b/lib/objpool.c
> @@ -0,0 +1,320 @@
> +// SPDX-License-Identifier: GPL-2.0
> +
> +#include <linux/objpool.h>
> +#include <linux/slab.h>
> +#include <linux/vmalloc.h>
> +#include <linux/atomic.h>
> +#include <linux/prefetch.h>
> +#include <linux/cpumask.h>
> +
> +/*
> + * objpool: ring-array based lockless MPMC/FIFO queues
> + *
> + * Copyright: [email protected]
> + */
> +
> +/* compute the suitable num of objects to be managed by slot */
> +static inline int objpool_nobjs(int size)
> +{
> + return rounddown_pow_of_two((size - sizeof(struct objpool_slot)) /
> + (sizeof(uint32_t) + sizeof(void *)));
> +}
> +
> +#define SLOT_AGES(s) ((uint32_t *)((char *)(s) + sizeof(struct objpool_slot)))
> +#define SLOT_ENTS(s) ((void **)((char *)(s) + sizeof(struct objpool_slot) + \
> + sizeof(uint32_t) * (s)->size))
> +#define SLOT_OBJS(s) ((void *)((char *)(s) + sizeof(struct objpool_slot) + \
> + (sizeof(uint32_t) + sizeof(void *)) * (s)->size))
> +#define SLOT_CORE(n) cpumask_nth((n) % num_possible_cpus(), cpu_possible_mask)
> +
> +/* allocate and initialize percpu slots */
> +static inline int

nit: I don't think this needs to be inlined, since it is only
used once and not on a hot path. I would suggest making it just
a 'static' function and letting the compiler decide whether
to inline it or not.

> +objpool_init_percpu_slots(struct objpool_head *head, int nobjs,
> + void *context, objpool_init_obj_cb objinit)
> +{
> + int i, j, n, size, objsz, cpu = 0, nents = head->capacity;

'nents' and 'n' is a bit confusing. please use 'capacity' or always
use 'head->capacity'.

> +
> + /* aligned object size by sizeof(void *) */
> + objsz = ALIGN(head->obj_size, sizeof(void *));
> + /* shall we allocate objects along with objpool_slot */
> + if (objsz)
> + head->flags |= OBJPOOL_HAVE_OBJECTS;
> +
> + for (i = 0; i < head->nr_cpus; i++) {
> + struct objpool_slot *os;
> +
> + /* skip the cpus which could never be present */
> + if (!cpu_possible(i))
> + continue;

You can use ;

for_each_possible_cpu(i) {

> +
> + /* compute how many objects to be managed by this slot */
> + n = nobjs / num_possible_cpus();
> + if (cpu < (nobjs % num_possible_cpus()))
> + n++;
> + size = sizeof(struct objpool_slot) + sizeof(void *) * nents +
> + sizeof(uint32_t) * nents + objsz * n;
> +
> + /* decide memory area for cpu-slot allocation */
> + if (!cpu && !(head->gfp & GFP_ATOMIC) && size > PAGE_SIZE / 2)
> + head->flags |= OBJPOOL_FROM_VMALLOC;

Why 'PAGE_SIZE / 2' ?

> +
> + /* allocate percpu slot & objects from local memory */
> + if (head->flags & OBJPOOL_FROM_VMALLOC)
> + os = __vmalloc_node(size, sizeof(void *), head->gfp,
> + cpu_to_node(i), __builtin_return_address(0));
> + else
> + os = kmalloc_node(size, head->gfp, cpu_to_node(i));
> + if (!os)
> + return -ENOMEM;
> +
> + /* initialize percpu slot for the i-th slot */
> + memset(os, 0, size);
> + os->size = head->capacity;
> + os->mask = os->size - 1;
> + head->cpu_slots[i] = os;
> + head->slot_sizes[i] = size;
> + cpu = cpu + 1;
> +
> + /*
> + * start from 2nd round to avoid conflict of 1st item.
> + * we assume that the head item is ready for retrieval
> + * iff head is equal to ages[head & mask]. but ages is
> + * initialized as 0, so in view of the caller of pop(),
> + * the 1st item (0th) is always ready, but fact could
> + * be: push() is stalled before the final update, thus
> + * the item being inserted will be lost forever.
> + */
> + os->head = os->tail = head->capacity;
> +
> + if (!objsz)
> + continue;
> +
> + for (j = 0; j < n; j++) {
> + uint32_t *ages = SLOT_AGES(os);
> + void **ents = SLOT_ENTS(os);
> + void *obj = SLOT_OBJS(os) + j * objsz;
> + uint32_t ie = os->tail & os->mask;
> +
> + /* perform object initialization */
> + if (objinit) {
> + int rc = objinit(obj, context);
> + if (rc)
> + return rc;
> + }
> +
> + /* add obj into the ring array */
> + ents[ie] = obj;
> + ages[ie] = os->tail;
> + os->tail++;
> + head->nr_objs++;
> + }
> + }
> +
> + return 0;
> +}
> +
> +/* cleanup all percpu slots of the object pool */
> +static inline void objpool_fini_percpu_slots(struct objpool_head *head)

This is also no need to be inlined.

> +{
> + int i;
> +
> + if (!head->cpu_slots)
> + return;
> +
> + for (i = 0; i < head->nr_cpus; i++) {
> + if (!head->cpu_slots[i])
> + continue;
> + if (head->flags & OBJPOOL_FROM_VMALLOC)
> + vfree(head->cpu_slots[i]);
> + else
> + kfree(head->cpu_slots[i]);
> + }
> + kfree(head->cpu_slots);
> + head->cpu_slots = NULL;
> + head->slot_sizes = NULL;
> +}
> +
> +/**
> + * objpool_init: initialize object pool and pre-allocate objects
> + *
> + * args:

At first, please write this in kernel-doc style.
See Documentation/doc-guide/kernel-doc.rst

> + * @head: the object pool to be initialized, declared by caller
> + * @nr_objs: total objects to be pre-allocated by this object pool
> + * @object_size: size of an object, no objects pre-allocated if 0

This is a bit strange. If no objects pre-allocated, @nr_objs should
be 0 instead of (or both of) @object_size.

And anyway, if there is no actual use-case for non pre-allocated
feature (except for the test case, of course), I would suggest
dropping it from this first version.

> + * @gfp: flags for memory allocation (via kmalloc or vmalloc)
> + * @context: user context for object initialization callback
> + * @objinit: object initialization callback for extra setting-up
> + * @release: cleanup callback for private objects/pool/context
> + *
> + * return:
> + * 0 for success, otherwise error code
> + *
> + * All pre-allocated objects are to be zeroed. Caller could do extra
> + * initialization in objinit callback. The objinit callback will be
> + * called once and only once after the slot allocation. Then objpool
> + * won't touch any content of the objects since then. It's caller's
> + * duty to perform reinitialization after object allocation (pop) or
> + * clearance before object reclamation (push) if required.
> + */
> +int objpool_init(struct objpool_head *head, int nr_objs, int object_size,
> + gfp_t gfp, void *context, objpool_init_obj_cb objinit,
> + objpool_fini_cb release)
> +{
> + int nents, rc;
> +
> + /* check input parameters */
> + if (nr_objs <= 0 || object_size < 0)
> + return -EINVAL;
> +
> + /* calculate percpu slot size (rounded to pow of 2) */
> + nents = max_t(int, roundup_pow_of_two(nr_objs),
> + objpool_nobjs(L1_CACHE_BYTES));
> +
> + /* initialize objpool head */
> + memset(head, 0, sizeof(struct objpool_head));
> + head->nr_cpus = nr_cpu_ids;
> + head->obj_size = object_size;
> + head->capacity = nents;
> + head->gfp = gfp & ~__GFP_ZERO;
> + head->context = context;
> + head->release = release;
> +
> + /* allocate array for percpu slots */
> + head->cpu_slots = kzalloc(head->nr_cpus * sizeof(void *) +
> + head->nr_cpus * sizeof(uint32_t), head->gfp);

This looks wired. Please allocate the array for cpu_slots and slot_sizes
separately.

> + if (!head->cpu_slots)
> + return -ENOMEM;
> + head->slot_sizes = (uint32_t *)&head->cpu_slots[head->nr_cpus];

And do not do this. If it allocates 2 arrays, we can use many debug
features to detect overrun, but this wired allocation will prevent it.

> +
> + /* initialize per-cpu slots */
> + rc = objpool_init_percpu_slots(head, nr_objs, context, objinit);
> + if (rc)
> + objpool_fini_percpu_slots(head);
> +
> + return rc;
> +}
> +EXPORT_SYMBOL_GPL(objpool_init);
> +
> +/* adding object to slot tail, the given slot must NOT be full */
> +static inline int objpool_add_slot(void *obj, struct objpool_slot *os)
> +{
> + uint32_t *ages = SLOT_AGES(os);
> + void **ents = SLOT_ENTS(os);
> + uint32_t tail = atomic_inc_return((atomic_t *)&os->tail) - 1;

No, don't cast u32 to atomic_t.

> +
> + WRITE_ONCE(ents[tail & os->mask], obj);
> +
> + /* order matters: obj must be updated before tail updating */
> + smp_store_release(&ages[tail & os->mask], tail);
> + return 0;
> +}
> +
> +/**
> + * objpool_push: reclaim the object and return back to objects pool

Ditto, please use kernel-doc.

> + *
> + * args:
> + * @obj: object pointer to be pushed to object pool
> + * @head: object pool
> + *
> + * return:
> + * 0 or error code: it fails only when objects pool are full
> + *
> + * objpool_push is non-blockable, and can be nested
> + */
> +int objpool_push(void *obj, struct objpool_head *head)
> +{
> + int cpu = raw_smp_processor_id();
> +
> + return objpool_add_slot(obj, head->cpu_slots[cpu]);
> +}
> +EXPORT_SYMBOL_GPL(objpool_push);
> +
> +/* try to retrieve object from slot */
> +static inline void *objpool_try_get_slot(struct objpool_slot *os)
> +{
> + uint32_t *ages = SLOT_AGES(os);
> + void **ents = SLOT_ENTS(os);
> + /* do memory load of head to local head */
> + uint32_t head = smp_load_acquire(&os->head);
> +
> + /* loop if slot isn't empty */
> + while (head != READ_ONCE(os->tail)) {
> + uint32_t id = head & os->mask, prev = head;
> +
> + /* do prefetching of object ents */
> + prefetch(&ents[id]);
> +
> + /*
> + * check whether this item was ready for retrieval ? There's
> + * possibility * in theory * we might retrieve wrong object,
> + * in case ages[id] overflows when current task is sleeping,
> + * but it will take very very long to overflow an uint32_t
> + */

We should make sure it is safe in theory. The hot path can be loose but
it must be ensured before use. OS can be used very long time in some time
(e.g. 10 years) and uint32 is too short ... (uint64 is OK)

> + if (smp_load_acquire(&ages[id]) == head) {
> + /* node must have been udpated by push() */
> + void *node = READ_ONCE(ents[id]);
> + /* commit and move forward head of the slot */
> + if (try_cmpxchg_release(&os->head, &head, head + 1))
> + return node;
> + }
> +
> + /* re-load head from memory and continue trying */
> + head = READ_ONCE(os->head);
> + /*
> + * head stays unchanged, so it's very likely current pop()
> + * just preempted/interrupted an ongoing push() operation
> + */
> + if (head == prev)
> + break;
> + }
> +
> + return NULL;
> +}
> +
> +/**
> + * objpool_pop: allocate an object from objects pool

Ditto.

Thank you,

> + *
> + * args:
> + * @head: object pool
> + *
> + * return:
> + * object: NULL if failed (object pool is empty)
> + *
> + * objpool_pop can be nested, so can be used in any context.
> + */
> +void *objpool_pop(struct objpool_head *head)
> +{
> + int i, cpu = raw_smp_processor_id();
> + void *obj = NULL;
> +
> + for (i = 0; i < num_possible_cpus(); i++) {
> + obj = objpool_try_get_slot(head->cpu_slots[cpu]);
> + if (obj)
> + break;
> + cpu = cpumask_next_wrap(cpu, cpu_possible_mask, -1, 1);
> + }
> +
> + return obj;
> +}
> +EXPORT_SYMBOL_GPL(objpool_pop);
> +
> +/**
> + * objpool_fini: cleanup the whole object pool (releasing all objects)
> + *
> + * args:
> + * @head: object pool to be released
> + *
> + */
> +void objpool_fini(struct objpool_head *head)
> +{
> + if (!head->cpu_slots)
> + return;
> +
> + /* release percpu slots */
> + objpool_fini_percpu_slots(head);
> +
> + /* call user's cleanup callback if provided */
> + if (head->release)
> + head->release(head, head->context);
> +}
> +EXPORT_SYMBOL_GPL(objpool_fini);
> --
> 2.34.1
>

--
Masami Hiramatsu (Google) <[email protected]>

2022-12-23 03:08:34

by Masami Hiramatsu

[permalink] [raw]

Subject: Re: [PATCH v7 1/5] lib: objpool added: ring-array based lockless MPMC queue

Hi,

On Fri, 23 Dec 2022 00:47:01 +0900
Masami Hiramatsu (Google) <[email protected]> wrote:

> > +/* try to retrieve object from slot */
> > +static inline void *objpool_try_get_slot(struct objpool_slot *os)
> > +{
> > + uint32_t *ages = SLOT_AGES(os);
> > + void **ents = SLOT_ENTS(os);
> > + /* do memory load of head to local head */
> > + uint32_t head = smp_load_acquire(&os->head);
> > +
> > + /* loop if slot isn't empty */
> > + while (head != READ_ONCE(os->tail)) {
> > + uint32_t id = head & os->mask, prev = head;
> > +
> > + /* do prefetching of object ents */
> > + prefetch(&ents[id]);
> > +
> > + /*
> > + * check whether this item was ready for retrieval ? There's
> > + * possibility * in theory * we might retrieve wrong object,
> > + * in case ages[id] overflows when current task is sleeping,
> > + * but it will take very very long to overflow an uint32_t
> > + */
>
> We should make sure it is safe in theory. The hot path can be loose but
> it must be ensured before use. OS can be used very long time in some time
> (e.g. 10 years) and uint32 is too short ... (uint64 is OK)

OK, I understand what you concerned here. This means that the ages[id]
overflows *and* back to the same value during here. But must objpool_pop()
be called under preempt disabled? If not, it should be (and that must not
be a big overhead).

> > + if (smp_load_acquire(&ages[id]) == head) {
> > + /* node must have been udpated by push() */

Please correct me. In my understanding, since the size of ents[] is always
bigger than nr_objs, (tail & mask) == (head & mask) only if the ents[] is
full and no free object (thus no push() happens) or ents[] is empty
(in this case ages[id] != head). Thus the node is not updated if below
cmpxchg is succeeded.

> > + void *node = READ_ONCE(ents[id]);
> > + /* commit and move forward head of the slot */
> > + if (try_cmpxchg_release(&os->head, &head, head + 1))
> > + return node;
> > + }
> > +
> > + /* re-load head from memory and continue trying */
> > + head = READ_ONCE(os->head);
> > + /*
> > + * head stays unchanged, so it's very likely current pop()
> > + * just preempted/interrupted an ongoing push() operation

Since this can touch the other CPUs' slot, there should be another ongoing
push() running on the same slot (so no need to limit the preempt/interrupt
cases.) Also, this happens only when the push() is running on *the empty slot*.
Thus we can consider this is empty, and return NULL.

Can you update the comment above and make it clear that this exits here
because it is empty until ongoing push() is done.

Overall, some comments must be clear, but the code itself seems OK to me.

Thank you,

> > + */
> > + if (head == prev)
> > + break;
> > + }
> > +
> > + return NULL;
> > +}
> > +
> > +/**
> > + * objpool_pop: allocate an object from objects pool
>
> Ditto.
>
> Thank you,
>
> > + *
> > + * args:
> > + * @head: object pool
> > + *
> > + * return:
> > + * object: NULL if failed (object pool is empty)
> > + *
> > + * objpool_pop can be nested, so can be used in any context.
> > + */
> > +void *objpool_pop(struct objpool_head *head)
> > +{
> > + int i, cpu = raw_smp_processor_id();
> > + void *obj = NULL;
> > +
> > + for (i = 0; i < num_possible_cpus(); i++) {
> > + obj = objpool_try_get_slot(head->cpu_slots[cpu]);
> > + if (obj)
> > + break;
> > + cpu = cpumask_next_wrap(cpu, cpu_possible_mask, -1, 1);
> > + }
> > +
> > + return obj;
> > +}
> > +EXPORT_SYMBOL_GPL(objpool_pop);
> > +
> > +/**
> > + * objpool_fini: cleanup the whole object pool (releasing all objects)
> > + *
> > + * args:
> > + * @head: object pool to be released
> > + *
> > + */
> > +void objpool_fini(struct objpool_head *head)
> > +{
> > + if (!head->cpu_slots)
> > + return;
> > +
> > + /* release percpu slots */
> > + objpool_fini_percpu_slots(head);
> > +
> > + /* call user's cleanup callback if provided */
> > + if (head->release)
> > + head->release(head, head->context);
> > +}
> > +EXPORT_SYMBOL_GPL(objpool_fini);
> > --
> > 2.34.1
> >
>
>
> --
> Masami Hiramatsu (Google) <[email protected]>

--
Masami Hiramatsu (Google) <[email protected]>

2022-12-27 16:15:13

by Masami Hiramatsu

[permalink] [raw]

Subject: Re: [PATCH v7 3/5] kprobes: kretprobe scalability improvement with objpool

Hi Matt,

On Mon, 12 Dec 2022 20:31:51 +0800
wuqiang <[email protected]> wrote:

> kretprobe is using freelist to manage return-instances, but freelist,
> as LIFO queue based on singly linked list, scales badly and reduces
> the overall throughput of kretprobed routines, especially for high
> contention scenarios.
>
> Here's a typical throughput test of sys_flock (counts in 10 seconds,
> measured with perf stat -a -I 10000 -e syscalls:sys_enter_flock):
>
> OS: Debian 10 X86_64, Linux 6.1rc2
> HW: XEON 8336C x 2, 64 cores/128 threads, DDR4 3200MT/s
>
> 1X 2X 4X 6X 8X 12X 16X
> 34762430 36546920 17949900 13101899 12569595 12646601 14729195
> 24X 32X 48X 64X 72X 96X 128X
> 19263546 10102064 8985418 11936495 11493980 7127789 9330985
>
> This patch introduces objpool to kretprobe and rethook, with orginal
> freelist replaced and brings near-linear scalability to kretprobed
> routines. Tests of kretprobe throughput show the biggest ratio as
> 333.9x of the original freelist. Here's the comparison:
>
> 1X 2X 4X 8X 16X
> freelist: 34762430 36546920 17949900 12569595 14729195
> objpool: 35627544 72182095 144068494 287564688 576903916
> 32X 48X 64X 96X 128X
> freelist: 10102064 8985418 11936495 7127789 9330985
> objpool: 1158876372 1737828164 2324371724 2380310472 2463182819
>
> Tests on 96-core ARM64 system output similarly, but with the biggest
> ratio up to 642.2x:
>
> OS: Debian 10 AARCH64, Linux 6.1rc2
> HW: Kunpeng-920 96 cores/2 sockets/4 NUMA nodes, DDR4 2933 MT/s
>
> 1X 2X 4X 8X 16X
> freelist: 17498299 10887037 10224710 8499132 6421751
> objpool: 18715726 35549845 71615884 144258971 283707220
> 24X 32X 48X 64X 96X
> freelist: 5339868 4819116 3593919 3121575 2687167
> objpool: 419830913 571609748 877456139 1143316315 1725668029

So for x86, the rethook is applied. And for arm64, kretprobes
is applied. OK, Looks good to me.

I have some comments below (but just nitpicks.)

>
> Signed-off-by: wuqiang <[email protected]>

BTW, can you clarify your name on Signed-off-by? I see your name on
[5/5] that says "Matt Wu <[email protected]>". It should be
the same on Signed-off-by (and MODULE_AUTHOR) so that we can make sure
it is you.

> ---
> include/linux/kprobes.h | 9 ++--
> include/linux/rethook.h | 14 ++----
> kernel/kprobes.c | 101 +++++++++++++++++++---------------------
> kernel/trace/fprobe.c | 37 ++++++---------
> kernel/trace/rethook.c | 99 ++++++++++++++++++++-------------------
> 5 files changed, 118 insertions(+), 142 deletions(-)
>
> diff --git a/include/linux/kprobes.h b/include/linux/kprobes.h
> index a0b92be98984..122b1f21f3a9 100644
> --- a/include/linux/kprobes.h
> +++ b/include/linux/kprobes.h
> @@ -27,7 +27,7 @@
> #include <linux/mutex.h>
> #include <linux/ftrace.h>
> #include <linux/refcount.h>
> -#include <linux/freelist.h>
> +#include <linux/objpool.h>
> #include <linux/rethook.h>
> #include <asm/kprobes.h>
>
> @@ -141,6 +141,7 @@ static inline bool kprobe_ftrace(struct kprobe *p)
> */
> struct kretprobe_holder {
> struct kretprobe *rp;
> + struct objpool_head pool;
> refcount_t ref;
> };
>
> @@ -154,7 +155,6 @@ struct kretprobe {
> #ifdef CONFIG_KRETPROBE_ON_RETHOOK
> struct rethook *rh;
> #else
> - struct freelist_head freelist;
> struct kretprobe_holder *rph;
> #endif
> };
> @@ -165,10 +165,7 @@ struct kretprobe_instance {
> #ifdef CONFIG_KRETPROBE_ON_RETHOOK
> struct rethook_node node;
> #else
> - union {
> - struct freelist_node freelist;
> - struct rcu_head rcu;
> - };
> + struct rcu_head rcu;
> struct llist_node llist;
> struct kretprobe_holder *rph;
> kprobe_opcode_t *ret_addr;
> diff --git a/include/linux/rethook.h b/include/linux/rethook.h
> index c8ac1e5afcd1..f97283c622b7 100644
> --- a/include/linux/rethook.h
> +++ b/include/linux/rethook.h
> @@ -6,7 +6,7 @@
> #define _LINUX_RETHOOK_H
>
> #include <linux/compiler.h>
> -#include <linux/freelist.h>
> +#include <linux/objpool.h>
> #include <linux/kallsyms.h>
> #include <linux/llist.h>
> #include <linux/rcupdate.h>
> @@ -30,14 +30,13 @@ typedef void (*rethook_handler_t) (struct rethook_node *, void *, struct pt_regs
> struct rethook {
> void *data;
> rethook_handler_t handler;
> - struct freelist_head pool;
> + struct objpool_head pool;
> refcount_t ref;
> struct rcu_head rcu;
> };
>
> /**
> * struct rethook_node - The rethook shadow-stack entry node.
> - * @freelist: The freelist, linked to struct rethook::pool.
> * @rcu: The rcu_head for deferred freeing.
> * @llist: The llist, linked to a struct task_struct::rethooks.
> * @rethook: The pointer to the struct rethook.
> @@ -48,19 +47,15 @@ struct rethook {
> * on each entry of the shadow stack.
> */
> struct rethook_node {
> - union {
> - struct freelist_node freelist;
> - struct rcu_head rcu;
> - };
> + struct rcu_head rcu;
> struct llist_node llist;
> struct rethook *rethook;
> unsigned long ret_addr;
> unsigned long frame;
> };
>
> -struct rethook *rethook_alloc(void *data, rethook_handler_t handler);
> +struct rethook *rethook_alloc(void *data, rethook_handler_t handler, int size, int num);
> void rethook_free(struct rethook *rh);
> -void rethook_add_node(struct rethook *rh, struct rethook_node *node);
> struct rethook_node *rethook_try_get(struct rethook *rh);
> void rethook_recycle(struct rethook_node *node);
> void rethook_hook(struct rethook_node *node, struct pt_regs *regs, bool mcount);
> @@ -97,4 +92,3 @@ void rethook_flush_task(struct task_struct *tk);
> #endif
>
> #endif
> -
> diff --git a/kernel/kprobes.c b/kernel/kprobes.c
> index 3050631e528d..5f35997b61f7 100644
> --- a/kernel/kprobes.c
> +++ b/kernel/kprobes.c
> @@ -1868,13 +1868,28 @@ static struct notifier_block kprobe_exceptions_nb = {
> #ifdef CONFIG_KRETPROBES
>
> #if !defined(CONFIG_KRETPROBE_ON_RETHOOK)
> +
> +/* callbacks for objpool of kretprobe instances */
> +static int kretprobe_init_inst(void *nod, void *context)
> +{
> + struct kretprobe_instance *ri = nod;
> +
> + ri->rph = context;
> + return 0;
> +}
> +static int kretprobe_fini_pool(struct objpool_head *head, void *context)
> +{
> + kfree(context);
> + return 0;
> +}
> +
> static void free_rp_inst_rcu(struct rcu_head *head)
> {
> struct kretprobe_instance *ri = container_of(head, struct kretprobe_instance, rcu);
> + struct kretprobe_holder *rph = ri->rph;
>
> - if (refcount_dec_and_test(&ri->rph->ref))
> - kfree(ri->rph);
> - kfree(ri);
> + if (refcount_dec_and_test(&rph->ref))
> + objpool_fini(&rph->pool);
> }
> NOKPROBE_SYMBOL(free_rp_inst_rcu);
>
> @@ -1883,7 +1898,7 @@ static void recycle_rp_inst(struct kretprobe_instance *ri)
> struct kretprobe *rp = get_kretprobe(ri);
>
> if (likely(rp))
> - freelist_add(&ri->freelist, &rp->freelist);
> + objpool_push(ri, &rp->rph->pool);
> else
> call_rcu(&ri->rcu, free_rp_inst_rcu);
> }
> @@ -1920,23 +1935,18 @@ NOKPROBE_SYMBOL(kprobe_flush_task);
>
> static inline void free_rp_inst(struct kretprobe *rp)
> {
> - struct kretprobe_instance *ri;
> - struct freelist_node *node;
> - int count = 0;
> -
> - node = rp->freelist.head;
> - while (node) {
> - ri = container_of(node, struct kretprobe_instance, freelist);
> - node = node->next;
> -
> - kfree(ri);
> - count++;
> - }
> + struct kretprobe_holder *rph = rp->rph;
> + void *nod;
>
> - if (refcount_sub_and_test(count, &rp->rph->ref)) {
> - kfree(rp->rph);
> - rp->rph = NULL;
> - }
> + rp->rph = NULL;
> + do {
> + nod = objpool_pop(&rph->pool);
> + /* deref anyway since we've one extra ref grabbed */
> + if (refcount_dec_and_test(&rph->ref)) {
> + objpool_fini(&rph->pool);
> + break;
> + }
> + } while (nod);
> }
>
> /* This assumes the 'tsk' is the current task or the is not running. */
> @@ -2078,19 +2088,17 @@ NOKPROBE_SYMBOL(__kretprobe_trampoline_handler)
> static int pre_handler_kretprobe(struct kprobe *p, struct pt_regs *regs)
> {
> struct kretprobe *rp = container_of(p, struct kretprobe, kp);
> + struct kretprobe_holder *rph = rp->rph;
> struct kretprobe_instance *ri;
> - struct freelist_node *fn;
>
> - fn = freelist_try_get(&rp->freelist);
> - if (!fn) {
> + ri = objpool_pop(&rph->pool);
> + if (!ri) {
> rp->nmissed++;
> return 0;
> }
>
> - ri = container_of(fn, struct kretprobe_instance, freelist);
> -
> if (rp->entry_handler && rp->entry_handler(ri, regs)) {
> - freelist_add(&ri->freelist, &rp->freelist);
> + objpool_push(ri, &rph->pool);
> return 0;
> }
>
> @@ -2183,7 +2191,6 @@ int kprobe_on_func_entry(kprobe_opcode_t *addr, const char *sym, unsigned long o
> int register_kretprobe(struct kretprobe *rp)
> {
> int ret;
> - struct kretprobe_instance *inst;
> int i;
> void *addr;
>
> @@ -2221,20 +2228,12 @@ int register_kretprobe(struct kretprobe *rp)
> #endif
> }
> #ifdef CONFIG_KRETPROBE_ON_RETHOOK
> - rp->rh = rethook_alloc((void *)rp, kretprobe_rethook_handler);
> - if (!rp->rh)
> - return -ENOMEM;
> + rp->rh = rethook_alloc((void *)rp, kretprobe_rethook_handler,
> + sizeof(struct kretprobe_instance) +
> + rp->data_size, rp->maxactive);
> + if (IS_ERR(rp->rh))
> + return PTR_ERR(rp->rh);
>
> - for (i = 0; i < rp->maxactive; i++) {
> - inst = kzalloc(sizeof(struct kretprobe_instance) +
> - rp->data_size, GFP_KERNEL);
> - if (inst == NULL) {
> - rethook_free(rp->rh);
> - rp->rh = NULL;
> - return -ENOMEM;
> - }
> - rethook_add_node(rp->rh, &inst->node);
> - }
> rp->nmissed = 0;
> /* Establish function entry probe point */
> ret = register_kprobe(&rp->kp);
> @@ -2243,25 +2242,19 @@ int register_kretprobe(struct kretprobe *rp)
> rp->rh = NULL;
> }
> #else /* !CONFIG_KRETPROBE_ON_RETHOOK */
> - rp->freelist.head = NULL;
> rp->rph = kzalloc(sizeof(struct kretprobe_holder), GFP_KERNEL);
> if (!rp->rph)
> return -ENOMEM;
>
> - rp->rph->rp = rp;
> - for (i = 0; i < rp->maxactive; i++) {
> - inst = kzalloc(sizeof(struct kretprobe_instance) +
> - rp->data_size, GFP_KERNEL);
> - if (inst == NULL) {
> - refcount_set(&rp->rph->ref, i);
> - free_rp_inst(rp);
> - return -ENOMEM;
> - }
> - inst->rph = rp->rph;
> - freelist_add(&inst->freelist, &rp->freelist);
> + if (objpool_init(&rp->rph->pool, rp->maxactive, rp->data_size +
> + sizeof(struct kretprobe_instance), GFP_KERNEL,
> + rp->rph, kretprobe_init_inst, kretprobe_fini_pool)) {
> + kfree(rp->rph);
> + rp->rph = NULL;
> + return -ENOMEM;
> }
> - refcount_set(&rp->rph->ref, i);
> -
> + refcount_set(&rp->rph->ref, rp->maxactive + 1);
> + rp->rph->rp = rp;
> rp->nmissed = 0;
> /* Establish function entry probe point */
> ret = register_kprobe(&rp->kp);
> diff --git a/kernel/trace/fprobe.c b/kernel/trace/fprobe.c
> index e8143e368074..9b685d6921d1 100644
> --- a/kernel/trace/fprobe.c
> +++ b/kernel/trace/fprobe.c
> @@ -125,41 +125,32 @@ static void fprobe_init(struct fprobe *fp)
>
> static int fprobe_init_rethook(struct fprobe *fp, int num)
> {
> - int i, size;
> -
> - if (num < 0)
> - return -EINVAL;
> + int max;
>
> if (!fp->exit_handler) {
> fp->rethook = NULL;
> return 0;
> }
>
> - /* Initialize rethook if needed */
> - size = num * num_possible_cpus() * 2;
> - if (size < 0)
> + if (num <= 0)
> + return -EINVAL;
> + max = num * num_possible_cpus() * 2;
> + /* Fail if max overflows */
> + if (max <= 0)
> return -E2BIG;
>
> - fp->rethook = rethook_alloc((void *)fp, fprobe_exit_handler);
> - if (!fp->rethook)
> - return -ENOMEM;
> - for (i = 0; i < size; i++) {
> - struct fprobe_rethook_node *node;
> -
> - node = kzalloc(sizeof(*node), GFP_KERNEL);
> - if (!node) {
> - rethook_free(fp->rethook);
> - fp->rethook = NULL;
> - return -ENOMEM;
> - }
> - rethook_add_node(fp->rethook, &node->node);
> - }
> + /* Initialize rethook */
> + fp->rethook = rethook_alloc((void *)fp, fprobe_exit_handler,
> + sizeof(struct fprobe_rethook_node), max);
> + if (IS_ERR(fp->rethook))
> + return PTR_ERR(fp->rethook);
> +
> return 0;
> }
>
> static void fprobe_fail_cleanup(struct fprobe *fp)
> {
> - if (fp->rethook) {
> + if (!IS_ERR_OR_NULL(fp->rethook)) {
> /* Don't need to cleanup rethook->handler because this is not used. */
> rethook_free(fp->rethook);
> fp->rethook = NULL;
> @@ -313,7 +304,7 @@ int unregister_fprobe(struct fprobe *fp)
> * current running handlers are finished, call unregister_ftrace_function()
> * after this.
> */
> - if (fp->rethook)
> + if (!IS_ERR_OR_NULL(fp->rethook))
> rethook_free(fp->rethook);
>
> ret = unregister_ftrace_function(&fp->ops);
> diff --git a/kernel/trace/rethook.c b/kernel/trace/rethook.c
> index 32c3dfdb4d6a..6e1014e4f2f7 100644
> --- a/kernel/trace/rethook.c
> +++ b/kernel/trace/rethook.c
> @@ -36,21 +36,16 @@ void rethook_flush_task(struct task_struct *tk)
> static void rethook_free_rcu(struct rcu_head *head)
> {
> struct rethook *rh = container_of(head, struct rethook, rcu);
> - struct rethook_node *rhn;
> - struct freelist_node *node;
> - int count = 1;
> -
> - node = rh->pool.head;
> - while (node) {
> - rhn = container_of(node, struct rethook_node, freelist);
> - node = node->next;
> - kfree(rhn);
> - count++;
> - }
> + struct rethook_node *nod;

Why 'nod' but not 'node'?

>
> - /* The rh->ref is the number of pooled node + 1 */
> - if (refcount_sub_and_test(count, &rh->ref))
> - kfree(rh);
> + do {
> + nod = objpool_pop(&rh->pool);
> + /* deref anyway since we've one extra ref grabbed */
> + if (refcount_dec_and_test(&rh->ref)) {
> + objpool_fini(&rh->pool);
> + break;
> + }
> + } while (nod);
> }
>
> /**
> @@ -70,54 +65,65 @@ void rethook_free(struct rethook *rh)
> call_rcu(&rh->rcu, rethook_free_rcu);
> }
>
> +static int rethook_init_node(void *nod, void *context)
> +{
> + struct rethook_node *node = nod;
> +
> + node->rethook = context;
> + return 0;
> +}
> +
> +static int rethook_fini_pool(struct objpool_head *head, void *context)
> +{
> + kfree(context);
> + return 0;
> +}
> +
> /**
> * rethook_alloc() - Allocate struct rethook.
> * @data: a data to pass the @handler when hooking the return.
> - * @handler: the return hook callback function.
> + * @handler: the return hook callback function, must NOT be NULL
> + * @gfp: default gfp for objpool allocation

There is no @gfp in the function arguments.

> + * @size: node size: rethook node and additional data
> + * @num: number of rethook nodes to be preallocated
> *
> * Allocate and initialize a new rethook with @data and @handler.
> - * Return NULL if memory allocation fails or @handler is NULL.
> + * Return pointer of new rethook, or error codes for failures.
> + *
> * Note that @handler == NULL means this rethook is going to be freed.
> */
> -struct rethook *rethook_alloc(void *data, rethook_handler_t handler)
> +struct rethook *rethook_alloc(void *data, rethook_handler_t handler,
> + int size, int num)
> {
> - struct rethook *rh = kzalloc(sizeof(struct rethook), GFP_KERNEL);
> + struct rethook *rh;
>
> - if (!rh || !handler) {
> - kfree(rh);
> - return NULL;
> - }
> + if (!handler || num <= 0 || size < sizeof(struct rethook_node))
> + return ERR_PTR(-EINVAL);
> +
> + rh = kzalloc(sizeof(struct rethook), GFP_KERNEL);
> + if (!rh)
> + return ERR_PTR(-ENOMEM);
>
> rh->data = data;
> rh->handler = handler;
> - rh->pool.head = NULL;
> - refcount_set(&rh->ref, 1);
>
> + /* initialize the objpool for rethook nodes */
> + if (objpool_init(&rh->pool, num, size, GFP_KERNEL, rh,
> + rethook_init_node, rethook_fini_pool)) {
> + kfree(rh);
> + return ERR_PTR(-ENOMEM);
> + }
> + refcount_set(&rh->ref, num + 1);
> return rh;
> }
>
> -/**
> - * rethook_add_node() - Add a new node to the rethook.
> - * @rh: the struct rethook.
> - * @node: the struct rethook_node to be added.
> - *
> - * Add @node to @rh. User must allocate @node (as a part of user's
> - * data structure.) The @node fields are initialized in this function.
> - */
> -void rethook_add_node(struct rethook *rh, struct rethook_node *node)
> -{
> - node->rethook = rh;
> - freelist_add(&node->freelist, &rh->pool);
> - refcount_inc(&rh->ref);
> -}
> -
> static void free_rethook_node_rcu(struct rcu_head *head)
> {
> struct rethook_node *node = container_of(head, struct rethook_node, rcu);
> + struct rethook *rh = node->rethook;
>
> - if (refcount_dec_and_test(&node->rethook->ref))
> - kfree(node->rethook);
> - kfree(node);
> + if (refcount_dec_and_test(&rh->ref))
> + objpool_fini(&rh->pool);
> }
>
> /**
> @@ -132,7 +138,7 @@ void rethook_recycle(struct rethook_node *node)
> lockdep_assert_preemption_disabled();
>
> if (likely(READ_ONCE(node->rethook->handler)))
> - freelist_add(&node->freelist, &node->rethook->pool);
> + objpool_push(node, &node->rethook->pool);
> else
> call_rcu(&node->rcu, free_rethook_node_rcu);
> }
> @@ -148,7 +154,6 @@ NOKPROBE_SYMBOL(rethook_recycle);
> struct rethook_node *rethook_try_get(struct rethook *rh)
> {
> rethook_handler_t handler = READ_ONCE(rh->handler);
> - struct freelist_node *fn;
>
> lockdep_assert_preemption_disabled();
>
> @@ -165,11 +170,7 @@ struct rethook_node *rethook_try_get(struct rethook *rh)
> if (unlikely(!rcu_is_watching()))
> return NULL;
>
> - fn = freelist_try_get(&rh->pool);
> - if (!fn)
> - return NULL;
> -
> - return container_of(fn, struct rethook_node, freelist);
> + return (struct rethook_node *)objpool_pop(&rh->pool);
> }
> NOKPROBE_SYMBOL(rethook_try_get);
>
> --
> 2.34.1
>

--
Masami Hiramatsu (Google) <[email protected]>