Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18;
Date:   Tue, 31 Aug 2021 19:03:55 +0900
From:   Masami Hiramatsu <mhiramat@kernel.org>
To:     wuqiang <wuqiang.matt@bytedance.com>
Cc:     naveen.n.rao@linux.ibm.com, anil.s.keshavamurthy@intel.com,
        davem@davemloft.net, mingo@kernel.org, peterz@infradead.org,
        linux-kernel@vger.kernel.org, mattwu@163.com
Subject: Re: [PATCH v2] kretprobe scalability improvement
Message-Id: <20210831190355.04095270e78368a55d29ce36@kernel.org>
In-Reply-To: <20210830173324.32507-1-wuqiang.matt@bytedance.com>
References: <20210830173324.32507-1-wuqiang.matt@bytedance.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit
Precedence: bulk

On Tue, 31 Aug 2021 01:33:24 +0800
wuqiang <wuqiang.matt@bytedance.com> wrote:

> kretprobe is using freelist to manage return-instances, but freelist,
> as LIFO queue based on singly linked list, scales badly and lowers
> the throughput of kretprobed routines, especially for high contention
> scenarios. Here's a typical case (2 XEON 8260, 48 cores/96 threads):
> 
>       1X       2X       4X       6X       8X      12X     16X
> 10880312 18121228 23214783 13155457 11190217 10991228 9623992
>      24X      32X      48X      64X      96X     128X    192X
>  8484455  8376786  6766684  5698349  4113405  4528009 4081401
> 
> This patch implements a scalable, lock-less and numa-aware object pool,
> which brings near-linear scalability to kretprobed routines. Tests of
> kretprobe throughput show the biggest gain as 181.5x of the original
> freelist. And extreme tests of raw queue throughput can be up to 388.8x
> over freelist. Here are the comparison results:
> 
>                   1X         2X         4X         8X        16X
> freelist:  237911411  163596418   33048459   15506757   10640043
> objpool:   234799081  294086132  585290693 1164205947 2334923746
>                  24X        32X        48X        64X        96X
> freelist:    9025299    7965531    6800225    5507639    4284752
> objpool:  3508905695 1106760339 1101385147 1221763856 1211654038

Good number, just for confirmation, is "objpool" new one right?

> 
> The object pool is a percpu-extended version of the original freelist,
> with compact memory footprints and balanced performance results for 3
> test cases: nonblockable retrieval (most common kertprobe use cases),
> bulk retrieval in a row (multiple-threaded blockable kretprobe), huge
> misses (preallocated objects much less than what's required). More
> information is available at: https://github.com/mattwuq/kretprobe.

The code itself (what you are doing) is good to me. But the name
and the code don't match.

The 'freelist' is a 'list' even if the kretprobe uses it for managing
its instances. But your new 'freelist' is pooling objects. That is not
a 'list' anymore. 

I would like to suggest you to introduce complete new 'objpool.h'
and lib/objpool.c for this scalable object pool function, instead
of overwriting existing freelist. (I think __freelist_init_slots()
and other 'cold-path' functions are no need to be inlined.)
And replace the freelists in the kretprobe. If the current 
'freelist.h' is no more used, remove the freelist.h by another
patch.

Thank you,

> 
> Changes since V1 (Aug 30,2021):
> 1) reformat to a single patch as Masami Hiramatsu suggested
> 2) use __vmalloc_node to replace vmalloc_node for vmalloc
> 3) a few minor fixes: typo and coding-style issues
> 
> Signed-off-by: wuqiang <wuqiang.matt@bytedance.com>
> ---
>  include/linux/freelist.h | 522 ++++++++++++++++++++++++++++++++++++---
>  include/linux/kprobes.h  |   2 +-
>  kernel/kprobes.c         |  83 ++++---
>  3 files changed, 537 insertions(+), 70 deletions(-)
> 
> diff --git a/include/linux/freelist.h b/include/linux/freelist.h
> index fc1842b96469..7306a34e3811 100644
> --- a/include/linux/freelist.h
> +++ b/include/linux/freelist.h
> @@ -2,32 +2,376 @@
>  #ifndef FREELIST_H
>  #define FREELIST_H
>  
> +#include <linux/slab.h>
> +#include <linux/vmalloc.h>
>  #include <linux/atomic.h>
>  
>  /*
> - * Copyright: cameron@moodycamel.com
> + * freelist: a lock-less version of object pool implementation
>   *
> - * A simple CAS-based lock-free free list. Not the fastest thing in the world
> - * under heavy contention, but simple and correct (assuming nodes are never
> - * freed until after the free list is destroyed), and fairly speedy under low
> - * contention.
> + * Copyright: cameron@moodycamel.com, wuqiang.matt@bytedance.com
>   *
> - * Adapted from: https://moodycamel.com/blog/2014/solving-the-aba-problem-for-lock-free-free-lists
> + * The object pool is a scalable implementaion of high performance queue
> + * for objects allocation and reclamation, such as kretprobe instances.
> + *
> + * It's basd on cameron's CAS-based lock-free freelist:
> + * https://moodycamel.com/blog/2014/solving-the-aba-problem-for-lock-free-free-lists
> + *
> + * With leveraging per-cpu lockless queue to mitigate hot spots of memory
> + * contention, it could deliver near-linear scalability for high parallel
> + * loads. The object pool are best suited for the following cases:
> + * 1) memory allocation or reclamation is prohibited or too expensive
> + * 2) the objects are allocated/used/reclaimed very frequently
> + *
> + * Before using, you must be aware of it's limitations:
> + * 1) Memory of all objects won't be freed until pool is de-allocated
> + * 2) Order and fairness are not guaranteed. So some threads might stay
> + *    hungry much longer than other competitors
> + *
> + * Objects could be pre-allocated during initialization or filled later
> + * with user's buffer or private allocations. Mixing different objects
> + * of self-managed/batched/manually-added is NOT recommended, though
> + * it's supported. For mixed case, the caller should take care of the
> + * releasing of objects or user pool.
> + *
> + * Typical use cases:
> + *
> + * 1) self-managed objects
> + *
> + * obj_init():
> + *	static int obj_init(void *context, struct freelist_node *obj)
> + *	{
> + *		struct my_node *node;
> + *		node = container_of(obj, struct my_node, obj);
> + *		do_init_node(context, node);
> + *		return 0;
> + *	}
> + *
> + * main():
> + *	freelist_init(&fh, num_possible_cpus() * 4, 16, GFP_KERNEL, context, obj_init);
> + *	<object pool initialized>
> + *
> + *	obj = freelist_pop(&fh);
> + *	do_something_with(obj);
> + *	freelist_push(obj, &fh);
> + *
> + *	<object pool to be destroyed>
> + *	freelist_fini(&fh, NULL, NULL);
> + *
> + * 2) batced with user's buffer
> + *
> + * obj_init():
> + *	static int obj_init(void *context, struct freelist_node *obj)
> + *	{
> + *		struct my_node *node;
> + *		node = container_of(obj, struct my_node, obj);
> + *		do_init_node(context, node);
> + *		return 0;
> + *	}
> + *
> + * free_buf():
> + *	static int free_buf(void *context, void *obj, int user, int element)
> + *	{
> + *		if (obj && user && !element)
> + *			kfree(obj);
> + *	}
> + *
> + * main():
> + *	freelist_init(&fh, num_possible_cpus() * 4, 0, GFP_KERNEL, 0, 0);
> + *	buffer = kmalloc(size, ...);
> + *	freelist_populate(&fh, buffer, size, 16, context, obj_init);
> + *	<object pool initialized>
> + *
> + *	obj = freelist_pop(&fh);
> + *	do_something_with(obj);
> + *	freelist_push(obj, &fh);
> + *
> + *	<object pool to be destroyed>
> + *	freelist_fini(&fh, context, free_buf);
> + *
> + * 3) manually added with user objects
> + *
> + * free_obj():
> + *	static int free_obj(void *context, void *obj, int user, int element)
> + *	{
> + *		struct my_node *node;
> + *		node = container_of(obj, struct my_node, obj);
> + *		if (obj && user && element)
> + *			kfree(node);
> + *	}
> + *
> + * main():
> + *	freelist_init(&fh, num_possible_cpus() * 4, 0, 0, GFP_KERNEL, 0, 0);
> + *	for () {
> + *		node = kmalloc(objsz, ...);
> + *		do_init_node(node);
> + *		freelist_add_scattered(&node.obj, oh);
> + *	}
> + *	<object pool initialized>
> + *
> + *	obj = freelist_pop(&fh);
> + *	do_something_with(obj);
> + *	freelist_push(obj, &fh);
> + *
> + *	<object pool to be destroyed>
> + *	freelist_fini(&fh, context, free_obj);
>   */
>  
> +/*
> + * common componment of every node
> + */
>  struct freelist_node {
> -	atomic_t		refs;
> -	struct freelist_node	*next;
> +	struct freelist_node   *next;
> +	atomic_t                refs;
> +};
> +
> +#define REFS_ON_FREELIST 0x80000000
> +#define REFS_MASK	 0x7FFFFFFF
> +
> +/*
> + * freelist_slot: per-cpu singly linked list
> + *
> + * All pre-allocated objects are next to freelist_slot. Objects and
> + * freelist_slot are to be allocated from the memory pool local node.
> + */
> +struct freelist_slot {
> +	struct freelist_node   *fs_head;	/* head of percpu list */
>  };
> +#define SLOT_OBJS(s) ((void *)(s) + sizeof(struct freelist_slot))
>  
> +/*
> + * freelist_head: object pooling metadata
> + */
>  struct freelist_head {
> -	struct freelist_node	*head;
> +	uint32_t                fh_objsz;	/* object & element size */
> +	uint32_t                fh_nobjs;	/* total objs in freelist */
> +	uint32_t                fh_ncpus;	/* num of possible cpus */
> +	uint32_t                fh_in_slot:1;	/* objs alloced with slots */
> +	uint32_t                fh_vmalloc:1;	/* alloc from vmalloc zone */
> +	gfp_t                   fh_gfp;		/* k/vmalloc gfp flags */
> +	uint32_t                fh_sz_pool;	/* user pool size in byes */
> +	void                   *fh_pool;	/* user managed memory pool */
> +	struct freelist_slot  **fh_slots;	/* array of percpu slots */
> +	uint32_t               *fh_sz_slots;	/* size in bytes of slots */
>  };
>  
> -#define REFS_ON_FREELIST 0x80000000
> -#define REFS_MASK	 0x7FFFFFFF
> +typedef int (*freelist_init_node_cb)(void *context, struct freelist_node *);
> +
> +/* attach object to percpu slot */
> +static inline void
> +__freelist_insert_node(struct freelist_node *node, struct freelist_slot *slot)
> +{
> +	atomic_set_release(&node->refs, 1);
> +	node->next = slot->fs_head;
> +	slot->fs_head = node;
> +}
> +
> +/* allocate and initialize percpu slots */
> +static inline int
> +__freelist_init_slots(struct freelist_head *head, uint32_t nobjs,
> +			void *context, freelist_init_node_cb objinit)
> +{
> +	uint32_t i, objsz, cpus = head->fh_ncpus;
> +	gfp_t gfp = head->fh_gfp;
> +
> +	/* allocate array for percpu slots */
> +	head->fh_slots = kzalloc(cpus * sizeof(uint32_t) +
> +				cpus * sizeof(void *), gfp);
> +	if (!head->fh_slots)
> +		return -ENOMEM;
> +	head->fh_sz_slots = (uint32_t *)&head->fh_slots[cpus];
> +
> +	/* aligned object size by sizeof(void *) */
> +	objsz = ALIGN(head->fh_objsz, sizeof(void *));
> +
> +	/* shall we allocate objects along with freelist_slot */
> +	if (objsz)
> +		head->fh_in_slot = 1;
> +
> +	/* initialize per-cpu slots */
> +	for (i = 0; i < cpus; i++) {
> +		struct freelist_slot *slot;
> +		uint32_t j, n, s;
> +
> +		/* compute how many objects to be managed by this slot */
> +		n = nobjs / cpus;
> +		if (i < (nobjs % cpus))
> +			n++;
> +		s = sizeof(struct freelist_slot) + objsz * n;
> +
> +		/* decide which zone shall the slot be allocated from */
> +		if (0 == i) {
> +			if ((gfp & GFP_ATOMIC) || s < PAGE_SIZE)
> +				head->fh_vmalloc = 0;
> +			else
> +				head->fh_vmalloc = 1;
> +		}
> +
> +		/* allocate percpu slot & objects from local memory */
> +		if (head->fh_vmalloc)
> +			slot = __vmalloc_node(s, 1, gfp, cpu_to_node(i),
> +					__builtin_return_address(0));
> +		else
> +			slot = kmalloc_node(s, gfp, cpu_to_node(i));
> +		if (!slot)
> +			return -ENOMEM;
> +
> +		head->fh_slots[i] = slot;
> +		head->fh_sz_slots[i] = s;
> +
> +		/* initialize percpu slot for the i-th cpu */
> +		memset(slot, 0, s);
> +		/* initialize pre-allocated record entries */
> +		for (j = 0; head->fh_in_slot && j < n; j++) {
> +			struct freelist_node *node;
> +			node = SLOT_OBJS(slot) + j * objsz;
> +			if (objinit) {
> +				int rc = objinit(context, node);
> +				if (rc)
> +					return rc;
> +			}
> +			__freelist_insert_node(node, slot);
> +			head->fh_nobjs++;
> +		}
> +	}
> +
> +	return 0;
> +}
>  
> -static inline void __freelist_add(struct freelist_node *node, struct freelist_head *list)
> +/* cleanup all percpu slots of the object pool */
> +static inline void __freelist_fini_slots(struct freelist_head *head)
> +{
> +	uint32_t i;
> +
> +	if (!head->fh_slots)
> +		return;
> +
> +	for (i = 0; i < head->fh_ncpus; i++) {
> +		if (!head->fh_slots[i])
> +			continue;
> +		if (head->fh_vmalloc)
> +			vfree(head->fh_slots[i]);
> +		else
> +			kfree(head->fh_slots[i]);
> +	}
> +	kfree(head->fh_slots);
> +	head->fh_slots = NULL;
> +	head->fh_sz_slots = NULL;
> +}
> +
> +/**
> + * freelist_init: initialize object pool and pre-allocate objects
> + *
> + * args:
> + * @fh:    the object pool to be initialized, declared by the caller
> + * @nojbs: total objects to be managed by this object pool
> + * @ojbsz: size of an object, to be pre-allocated if objsz is not 0
> + * @gfp:   gfp flags of caller's context for memory allocation
> + * @context: user context for object initialization callback
> + * @objinit: object initialization callback
> + *
> + * return:
> + *         0 for success, otherwise error code
> + *
> + * All pre-allocated objects are to be zeroed. Caller should do extra
> + * initialization before using.
> + */
> +static inline int
> +freelist_init(struct freelist_head *head, int nobjs, int objsz, gfp_t gfp,
> +		void *context, freelist_init_node_cb objinit)
> +{
> +	memset(head, 0, sizeof(struct freelist_head));
> +	head->fh_ncpus = num_possible_cpus();
> +	head->fh_objsz = objsz;
> +	head->fh_gfp = gfp & ~__GFP_ZERO;
> +
> +	if (__freelist_init_slots(head, nobjs, context, objinit)) {
> +		__freelist_fini_slots(head);
> +		return -ENOMEM;
> +	}
> +
> +	return 0;
> +}
> +
> +/**
> + * freelist_add_scattered: adding pre-allocated objects to objects pool
> + * during initialization. it will try to balance the object numbers of
> + * all slots.
> + *
> + * args:
> + * @node: object pointer to be added to object pool
> + * @head: object pool
> + *
> + * return:
> + *     0 or error code
> + *
> + * freelist_add_scattered doesn't handle race conditions, can only be
> + * called during object pool initialization
> + */
> +static inline int
> +freelist_add_scattered(struct freelist_node *node, struct freelist_head *head)
> +{
> +	uint32_t cpu;
> +
> +	/* try balance object numbers among slots */
> +	cpu = head->fh_nobjs % head->fh_ncpus;
> +	__freelist_insert_node(node, head->fh_slots[cpu]);
> +	head->fh_nobjs++;
> +
> +	return 0;
> +}
> +
> +/**
> + * freelist_populate: add objects from user provided pool in batch
> + *  *
> + * args:
> + * @oh:  object pool
> + * @buf: user buffer for pre-allocated objects
> + * @size: size of user buffer
> + * @objsz: size of object & element
> + * @context: user context for objinit callback
> + * @objinit: object initialization callback
> + *
> + * return:
> + *     0 or error code
> + */
> +static inline int
> +freelist_populate(struct freelist_head *head, void *buf, int size, int objsz,
> +		void *context, freelist_init_node_cb objinit)
> +{
> +	int used = 0;
> +
> +	if (head->fh_pool || !buf || !objsz || size < objsz)
> +		return -EINVAL;
> +	if (head->fh_objsz && head->fh_objsz != objsz)
> +		return -EINVAL;
> +
> +	WARN_ON_ONCE(((unsigned long)buf) & (sizeof(void *) - 1));
> +	WARN_ON_ONCE(((uint32_t)objsz) & (sizeof(void *) - 1));
> +
> +	while (used + objsz <= size) {
> +		struct freelist_node *node = buf + used;
> +		if (objinit) {
> +			int rc = objinit(context, node);
> +			if (rc)
> +				return rc;
> +		}
> +		if (freelist_add_scattered(node, head))
> +			break;
> +		used += objsz;
> +	}
> +
> +	if (!used)
> +		return -ENOENT;
> +
> +	head->fh_pool = buf;
> +	head->fh_sz_pool = size;
> +	head->fh_objsz = objsz;
> +
> +	return 0;
> +}
> +
> +static inline void __freelist_cas_add(struct freelist_node *node, struct freelist_slot *slot)
>  {
>  	/*
>  	 * Since the refcount is zero, and nobody can increase it once it's
> @@ -43,25 +387,26 @@ static inline void __freelist_add(struct freelist_node *node, struct freelist_he
>  	 * who puts the refcount back to zero (which could be us, hence the
>  	 * loop).
>  	 */
> -	struct freelist_node *head = READ_ONCE(list->head);
> +	struct freelist_node *head = READ_ONCE(slot->fs_head);
>  
>  	for (;;) {
>  		WRITE_ONCE(node->next, head);
>  		atomic_set_release(&node->refs, 1);
>  
> -		if (!try_cmpxchg_release(&list->head, &head, node)) {
> -			/*
> -			 * Hmm, the add failed, but we can only try again when
> -			 * the refcount goes back to zero.
> -			 */
> -			if (atomic_fetch_add_release(REFS_ON_FREELIST - 1, &node->refs) == 1)
> -				continue;
> -		}
> -		return;
> +		if (try_cmpxchg_release(&slot->fs_head, &head, node))
> +			break;
> +
> +		/*
> +		 * Hmm, the add failed, but we can only try again when refcount
> +		 * goes back to zero (with REFS_ON_FREELIST set).
> +		 */
> +		if (atomic_fetch_add_release(REFS_ON_FREELIST - 1, &node->refs) != 1)
> +			break;
>  	}
>  }
>  
> -static inline void freelist_add(struct freelist_node *node, struct freelist_head *list)
> +/* adding object to slot */
> +static inline int __freelist_add_slot(struct freelist_node *node, struct freelist_slot *slot)
>  {
>  	/*
>  	 * We know that the should-be-on-freelist bit is 0 at this point, so
> @@ -72,13 +417,34 @@ static inline void freelist_add(struct freelist_node *node, struct freelist_head
>  		 * Oh look! We were the last ones referencing this node, and we
>  		 * know we want to add it to the free list, so let's do it!
>  		 */
> -		__freelist_add(node, list);
> +		__freelist_cas_add(node, slot);
>  	}
> +
> +	return 0;
>  }
>  
> -static inline struct freelist_node *freelist_try_get(struct freelist_head *list)
> +/**
> + * freelist_push: reclaim the object and return back to objects pool
> + *
> + * args:
> + * @node: object pointer to be pushed to object pool
> + * @head: object pool
> + *
> + * return:
> + *     0 (freelist_push never fail)
> + *
> + * freelist_push() can be nested (irp/softirq/preemption)
> + */
> +static inline int freelist_push(struct freelist_node *node, struct freelist_head *head)
>  {
> -	struct freelist_node *prev, *next, *head = smp_load_acquire(&list->head);
> +	int cpu = raw_smp_processor_id();
> +	return __freelist_add_slot(node, head->fh_slots[cpu]);
> +}
> +
> +/* try to retrieve object from slot */
> +static inline struct freelist_node *__freelist_pop_slot(struct freelist_slot *slot)
> +{
> +	struct freelist_node *prev, *next, *head = smp_load_acquire(&slot->fs_head);
>  	unsigned int refs;
>  
>  	while (head) {
> @@ -86,7 +452,7 @@ static inline struct freelist_node *freelist_try_get(struct freelist_head *list)
>  		refs = atomic_read(&head->refs);
>  		if ((refs & REFS_MASK) == 0 ||
>  		    !atomic_try_cmpxchg_acquire(&head->refs, &refs, refs+1)) {
> -			head = smp_load_acquire(&list->head);
> +			head = smp_load_acquire(&slot->fs_head);
>  			continue;
>  		}
>  
> @@ -96,7 +462,7 @@ static inline struct freelist_node *freelist_try_get(struct freelist_head *list)
>  		 * it changing between now and the time we do the CAS.
>  		 */
>  		next = READ_ONCE(head->next);
> -		if (try_cmpxchg_acquire(&list->head, &head, next)) {
> +		if (try_cmpxchg_acquire(&slot->fs_head, &head, next)) {
>  			/*
>  			 * Yay, got the node. This means it was on the list,
>  			 * which means should-be-on-freelist must be false no
> @@ -120,10 +486,108 @@ static inline struct freelist_node *freelist_try_get(struct freelist_head *list)
>  		 */
>  		refs = atomic_fetch_add(-1, &prev->refs);
>  		if (refs == REFS_ON_FREELIST + 1)
> -			__freelist_add(prev, list);
> +			__freelist_cas_add(prev, slot);
> +	}
> +
> +	return NULL;
> +}
> +
> +/**
> + * freelist_pop: allocate an object from objects pool
> + *
> + * args:
> + * @head: object pool
> + *
> + * return:
> + *   node: NULL if failed (object pool is empty)
> + *
> + * freelist_pop can be nesed, and guaranteed to be deadlock-free.
> + * So it can be called in any context, like irq/softirq/nmi.
> + */
> +static inline struct freelist_node *freelist_pop(struct freelist_head *head)
> +{
> +	struct freelist_node *node;
> +	int i, cpu = raw_smp_processor_id();
> +
> +	for (i = 0; i < head->fh_ncpus; i++) {
> +		struct freelist_slot *slot;
> +		slot = head->fh_slots[cpu];
> +		node = __freelist_pop_slot(slot);
> +		if (node)
> +			return node;
> +		if (++cpu >= head->fh_ncpus)
> +			cpu = 0;
>  	}
>  
>  	return NULL;
>  }
>  
> +/* whether this object is from user buffer (batched adding) */
> +static inline int freelist_is_inpool(void *obj, struct freelist_head *fh)
> +{
> +	return (obj && obj >= fh->fh_pool &&
> +		obj < fh->fh_pool + fh->fh_sz_pool);
> +}
> +
> +/* whether this object is pre-allocated with percpu slots */
> +static inline int freelist_is_inslot(void *obj, struct freelist_head *fh)
> +{
> +	uint32_t i;
> +
> +	for (i = 0; i < fh->fh_ncpus; i++) {
> +		void *ptr = fh->fh_slots[i];
> +		if (obj && obj >= ptr && obj < ptr + fh->fh_sz_slots[i])
> +			return 1;
> +	}
> +
> +	return 0;
> +}
> +
> +/**
> + * freelist_fini: cleanup the whole object pool (releasing all objects)
> + *
> + * args:
> + * @head: object pool
> + * @context: user provided value for the callback of release() funciton
> + * @release: user provided callback for resource cleanup or statistics
> + *
> + * prototype of release callback:
> + * static int release(void *context, void *obj, int user, int element);
> + * args:
> + *  context: user provided value
> + *  obj: the object (element or buffer) to be cleaned up
> + *  user: the object is manually provided by user
> + *  element: obj is an object or user-provided buffer
> + */
> +static inline void freelist_fini(struct freelist_head *head, void *context,
> +				int (*release)(void *, void *, int, int))
> +{
> +	uint32_t i;
> +
> +	if (!head->fh_slots)
> +		return;
> +
> +	for (i = 0; release && i < head->fh_ncpus; i++) {
> +		void *obj;
> +		if (!head->fh_slots[i])
> +			continue;
> +		do {
> +			obj = __freelist_pop_slot(head->fh_slots[i]);
> +			if (obj) {
> +				int user = !freelist_is_inpool(obj, head) &&
> +					!freelist_is_inslot(obj, head);
> +				release(context, obj, user, 1);
> +			}
> +		} while (obj);
> +	}
> +
> +	if (head->fh_pool && release) {
> +		release(context, head->fh_pool, 1, 0);
> +		head->fh_pool = NULL;
> +		head->fh_sz_pool = 0;
> +	}
> +
> +	__freelist_fini_slots(head);
> +}
> +
>  #endif /* FREELIST_H */
> diff --git a/include/linux/kprobes.h b/include/linux/kprobes.h
> index e4f3bfe08757..c63b8981d4c5 100644
> --- a/include/linux/kprobes.h
> +++ b/include/linux/kprobes.h
> @@ -140,6 +140,7 @@ static inline int kprobe_ftrace(struct kprobe *p)
>   */
>  struct kretprobe_holder {
>  	struct kretprobe	*rp;
> +	struct freelist_head    fh;
>  	refcount_t		ref;
>  };
>  
> @@ -150,7 +151,6 @@ struct kretprobe {
>  	int maxactive;
>  	int nmissed;
>  	size_t data_size;
> -	struct freelist_head freelist;
>  	struct kretprobe_holder *rph;
>  };
>  
> diff --git a/kernel/kprobes.c b/kernel/kprobes.c
> index 790a573bbe00..12879a3c582f 100644
> --- a/kernel/kprobes.c
> +++ b/kernel/kprobes.c
> @@ -1211,10 +1211,12 @@ NOKPROBE_SYMBOL(kprobes_inc_nmissed_count);
>  static void free_rp_inst_rcu(struct rcu_head *head)
>  {
>  	struct kretprobe_instance *ri = container_of(head, struct kretprobe_instance, rcu);
> +	struct kretprobe_holder *rph = ri->rph;
>  
> -	if (refcount_dec_and_test(&ri->rph->ref))
> -		kfree(ri->rph);
> -	kfree(ri);
> +	if (refcount_dec_and_test(&rph->ref)) {
> +		freelist_fini(&rph->fh, NULL, NULL);
> +		kfree(rph);
> +	}
>  }
>  NOKPROBE_SYMBOL(free_rp_inst_rcu);
>  
> @@ -1223,9 +1225,10 @@ static void recycle_rp_inst(struct kretprobe_instance *ri)
>  	struct kretprobe *rp = get_kretprobe(ri);
>  
>  	if (likely(rp)) {
> -		freelist_add(&ri->freelist, &rp->freelist);
> -	} else
> +		freelist_push(&ri->freelist, &rp->rph->fh);
> +	} else {
>  		call_rcu(&ri->rcu, free_rp_inst_rcu);
> +	}
>  }
>  NOKPROBE_SYMBOL(recycle_rp_inst);
>  
> @@ -1280,23 +1283,19 @@ NOKPROBE_SYMBOL(kprobe_flush_task);
>  
>  static inline void free_rp_inst(struct kretprobe *rp)
>  {
> -	struct kretprobe_instance *ri;
> -	struct freelist_node *node;
> -	int count = 0;
> -
> -	node = rp->freelist.head;
> -	while (node) {
> -		ri = container_of(node, struct kretprobe_instance, freelist);
> -		node = node->next;
> -
> -		kfree(ri);
> -		count++;
> -	}
> +	struct kretprobe_holder *rph = rp->rph;
> +	struct freelist_node *fn;
>  
> -	if (refcount_sub_and_test(count, &rp->rph->ref)) {
> -		kfree(rp->rph);
> -		rp->rph = NULL;
> -	}
> +	rp->rph = NULL;
> +	do {
> +		/* must do pop() first since we have one extra ref grabbed */
> +		fn = freelist_pop(&rph->fh);
> +		if (refcount_dec_and_test(&rph->ref)) {
> +			freelist_fini(&rph->fh, NULL, NULL);
> +			kfree(rph);
> +			break;
> +		}
> +	} while (fn);
>  }
>  
>  /* Add the new probe to ap->list */
> @@ -1922,19 +1921,18 @@ NOKPROBE_SYMBOL(__kretprobe_trampoline_handler)
>  static int pre_handler_kretprobe(struct kprobe *p, struct pt_regs *regs)
>  {
>  	struct kretprobe *rp = container_of(p, struct kretprobe, kp);
> -	struct kretprobe_instance *ri;
>  	struct freelist_node *fn;
> +	struct kretprobe_instance *ri;
>  
> -	fn = freelist_try_get(&rp->freelist);
> +	fn = freelist_pop(&rp->rph->fh);
>  	if (!fn) {
> -		rp->nmissed++;
> +		atomic_inc((atomic_t *)&rp->nmissed);
>  		return 0;
>  	}
> -
>  	ri = container_of(fn, struct kretprobe_instance, freelist);
>  
>  	if (rp->entry_handler && rp->entry_handler(ri, regs)) {
> -		freelist_add(&ri->freelist, &rp->freelist);
> +		freelist_push(fn, &rp->rph->fh);
>  		return 0;
>  	}
>  
> @@ -1980,10 +1978,19 @@ int kprobe_on_func_entry(kprobe_opcode_t *addr, const char *sym, unsigned long o
>  	return 0;
>  }
>  
> +static int kretprobe_init_inst(void *context, struct freelist_node *fn)
> +{
> +	struct kretprobe_instance *ri;
> +
> +	ri = container_of(fn, struct kretprobe_instance, freelist);
> +	ri->rph = context;
> +
> +	return 0;
> +}
> +
>  int register_kretprobe(struct kretprobe *rp)
>  {
>  	int ret;
> -	struct kretprobe_instance *inst;
>  	int i;
>  	void *addr;
>  
> @@ -2017,24 +2024,20 @@ int register_kretprobe(struct kretprobe *rp)
>  		rp->maxactive = num_possible_cpus();
>  #endif
>  	}
> -	rp->freelist.head = NULL;
> +
>  	rp->rph = kzalloc(sizeof(struct kretprobe_holder), GFP_KERNEL);
>  	if (!rp->rph)
>  		return -ENOMEM;
>  
> -	rp->rph->rp = rp;
> -	for (i = 0; i < rp->maxactive; i++) {
> -		inst = kzalloc(sizeof(struct kretprobe_instance) +
> -			       rp->data_size, GFP_KERNEL);
> -		if (inst == NULL) {
> -			refcount_set(&rp->rph->ref, i);
> -			free_rp_inst(rp);
> -			return -ENOMEM;
> -		}
> -		inst->rph = rp->rph;
> -		freelist_add(&inst->freelist, &rp->freelist);
> +	if (freelist_init(&rp->rph->fh, rp->maxactive, rp->data_size +
> +			  sizeof(struct kretprobe_instance), GFP_KERNEL,
> +			  rp->rph, kretprobe_init_inst)) {
> +		kfree(rp->rph);
> +		rp->rph = NULL;
> +		return -ENOMEM;
>  	}
> -	refcount_set(&rp->rph->ref, i);
> +	refcount_set(&rp->rph->ref, rp->maxactive + 1);
> +	rp->rph->rp = rp;
>  
>  	rp->nmissed = 0;
>  	/* Establish function entry probe point */
> -- 
> 2.25.1
> 


-- 
Masami Hiramatsu <mhiramat@kernel.org>