Received: by 2002:a05:7412:d8a:b0:e2:908c:2ebd with SMTP id b10csp680917rdg; Thu, 12 Oct 2023 19:01:41 -0700 (PDT) X-Google-Smtp-Source: AGHT+IFMpduYd6i+96LG5N6j950UYV+vTjsCeNic0b8yxZgh/vsMitlZLw+WtxwMxLsFIwILPuI9 X-Received: by 2002:a17:903:22c3:b0:1c3:cee1:cc8e with SMTP id y3-20020a17090322c300b001c3cee1cc8emr30632099plg.68.1697162501193; Thu, 12 Oct 2023 19:01:41 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1697162501; cv=none; d=google.com; s=arc-20160816; b=c3us/YrCjFS2vvFbXExY9xCoadjxztM2R+VPdfevA1nYyA7SwOkv5/vDMuJbeL5yQ7 oodfS/TMhDx6RGyY6Y3xFfkYghMGP4u9B2C2WClnzFB1snB2Ty1rEKOYiFcWEqq2+BqM KtB2jvL64ZOuZX1XMwPekeIl1kGOPqrKNbc/g6e8HMPhnHoTLyxQaLKuuXy1pc0xXUve uI0iIbV/yLQrK2GxdKdrhjgOx85OdPZj/SvkBIu//n1+PnMT9UQ1skg91YCUG4bpfvom AkUGceXbXT/cmaMlZ1JIgI4ZLd69//eMBrA2oNi7EkkfFsuoOPWlvdfy0WwG4czWUNbH DSog== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :references:in-reply-to:message-id:subject:cc:to:from:date :dkim-signature; bh=FkjOfUvxXBP15iToNL0UjQUS+aqsO7IhflB5uwrq2fY=; fh=Gzkq4XViazw64oKbiDJd7FznSCeNRSCJjuAJV0jclYw=; b=N6w0ut5wF2R03Xw6S1G0a7B86Rc6uhEuPNh1RrE+H4Q+Qgv/zHBZdH94AUoEVORlqP pjB/OE9JyMS9PZVqAVxRCQEA++pVg1HMsFLJqavz1k/WPP387tNbWlJYzB8eEXazmIuR sg6wAMlR8sYbQCDWOZ2e53VUkT5WWBsfBidJUrrkWpOKyPTlKNb4VPbDirZD+MXDvLQq x0U49A/HJnEPio6RICsI4pWi/y1pyatKtnifWHLTOTLDtAQouQShRiOBeyMVWfr91NIN 0eGkjnreQMhHfMLrDzmwoB+YRj7Lqrydpt0jEgLTcJ3ZHUbLCzduo/J+wSzgT4df7YGG jbnQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@kernel.org header.s=k20201202 header.b=kYtRpH+C; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:4 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=kernel.org Return-Path: Received: from howler.vger.email (howler.vger.email. [2620:137:e000::3:4]) by mx.google.com with ESMTPS id i2-20020a17090332c200b001c9bcb6f00asi3767202plr.528.2023.10.12.19.01.40 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 12 Oct 2023 19:01:41 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:4 as permitted sender) client-ip=2620:137:e000::3:4; Authentication-Results: mx.google.com; dkim=pass header.i=@kernel.org header.s=k20201202 header.b=kYtRpH+C; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:4 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=kernel.org Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0]) by howler.vger.email (Postfix) with ESMTP id 9B58A82A39A1; Thu, 12 Oct 2023 19:01:34 -0700 (PDT) X-Virus-Status: Clean X-Virus-Scanned: clamav-milter 0.103.10 at howler.vger.email Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229469AbjJMB71 (ORCPT + 99 others); Thu, 12 Oct 2023 21:59:27 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:33494 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229461AbjJMB70 (ORCPT ); Thu, 12 Oct 2023 21:59:26 -0400 Received: from smtp.kernel.org (relay.kernel.org [52.25.139.140]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id B202EBB for ; Thu, 12 Oct 2023 18:59:22 -0700 (PDT) Received: by smtp.kernel.org (Postfix) with ESMTPSA id 8221FC433C8; Fri, 13 Oct 2023 01:59:18 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1697162362; bh=UzN+epAdsZGLYZ0n5lUbnd3oIRURZvpxzviqAhAaJEc=; h=Date:From:To:Cc:Subject:In-Reply-To:References:From; b=kYtRpH+CY4WwRR8qBwcnoU8BacGgfeVJ8FJRPrQuu+eAE6mMyxhwPCcNnj2nMxqnv fPbaVIBXUcsDjkQ5Jd26G1Tx3XJfl9YXq4oQaJV5xDM33jn5leE8+A/yzkryxYm22K bJatUs7HXnE35d9YbwbpI2zjjXNoTCkXbYIMoHkJ1X6obpZBz+3jFNJpkKx3tGP8xg gloGST33dlV10SMZjYqqieBpwMJghAeL7ZPHK5Kg+mENq3szM3QHPpQI7iZ+ooMUKi izF8r3kZggbIN676qGcc+c3a1NV06NyCWI4xvqd8rmPAXQUxWf/Hue8ftRvOaD7aip m7XeJG8XRGHzg== Date: Fri, 13 Oct 2023 10:59:15 +0900 From: Masami Hiramatsu (Google) To: "wuqiang.matt" Cc: linux-trace-kernel@vger.kernel.org, davem@davemloft.net, anil.s.keshavamurthy@intel.com, naveen.n.rao@linux.ibm.com, rostedt@goodmis.org, peterz@infradead.org, akpm@linux-foundation.org, sander@svanheule.net, ebiggers@google.com, dan.j.williams@intel.com, jpoimboe@kernel.org, linux-kernel@vger.kernel.org, lkp@intel.com, mattwu@163.com Subject: Re: [PATCH v9 1/5] lib: objpool added: ring-array based lockless MPMC Message-Id: <20231013105915.6ea704e738de484558b55ce1@kernel.org> In-Reply-To: <1fb679ad-04b0-c1f6-1675-1921539382bd@bytedance.com> References: <20230905015255.81545-1-wuqiang.matt@bytedance.com> <20230905015255.81545-2-wuqiang.matt@bytedance.com> <20230925184212.5a0b6c627a837b50424c4a75@kernel.org> <5ba828ef-e5f9-45bc-b0c7-b8b71ae8748a@bytedance.com> <20231012230237.45726dfad125cf0e8d00ba53@kernel.org> <1fb679ad-04b0-c1f6-1675-1921539382bd@bytedance.com> X-Mailer: Sylpheed 3.7.0 (GTK+ 2.24.33; x86_64-pc-linux-gnu) Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit X-Spam-Status: No, score=-4.5 required=5.0 tests=DKIMWL_WL_HIGH,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,MAILING_LIST_MULTI,NICE_REPLY_A, SPF_HELO_NONE,SPF_PASS autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on howler.vger.email Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.6.4 (howler.vger.email [0.0.0.0]); Thu, 12 Oct 2023 19:01:34 -0700 (PDT) On Fri, 13 Oct 2023 01:36:05 +0800 "wuqiang.matt" wrote: > On 2023/10/12 22:02, Masami Hiramatsu (Google) wrote: > > Hi Wuqiang, > > > > On Mon, 9 Oct 2023 17:23:34 +0800 > > wuqiang wrote: > > > >> Hello Masami, > >> > >> Just got time for the new patch and got that ages[] was removed. ages[] is > >> introduced the way like 2-phase commit to keep consitency and must be kept. > >> > >> Thinking of the following 2 cases that two cpu nodes are operating the same > >> objpool_slot simultaneously: > >> > >> Case 1: > >> > >> NODE 1: NODE 2: > >> push to an empty slot pop will get wrong value > >> > >> try_cmpxchg_acquire(&slot->tail, &tail, next) > >> try_cmpxchg_release(&slot->head, &head, head + 1) > >> return slot->entries[head & slot->mask] > >> WRITE_ONCE(slot->entries[tail & slot->mask], obj) > > > > Today, I rethink the algorithm. > > > > For this case, we can just add a `commit` to the slot for committing the tail > > commit instead of the ages[]. > > > > CPU1 CPU2 > > push to an empty slot pop from the same slot > > > > commit = tail = slot->tail; > > next = tail + 1; > > try_cmpxchg_acquire(&slot->tail, > > &tail, next); > > while (head != commit) -> NG1 > > WRITE_ONCE(slot->entries[tail & slot->mask], > > obj) > > while (head != commit) -> NG2 > > WRITE_ONCE(&slot->commit, next); > > while (head != commit) -> OK > > try_cmpxchg_acquire(&slot->head, > > &head, next); > > return slot->entries[head & slot->mask] > > > > So the NG1 and NG2 timing, the pop will fail. > > > > This does not expect the nested "push" case because the reserve-commit block > > will be interrupt disabled. This doesn't support NMI but that is OK. > > If 2 push are performed in a row, slot->commit will be 'tail + 2', CPU2 won't > meet the condition "head == commit". No, it doesn't happen, because pop() may fetch the object from other's slot, but push() only happens on its own slot. So push() on the same slot is never in parallel. > > Since slot->commit' is always synced to slot->tail after a successful push, > should pop check "tail != commit" ? No, pop() only need to check the "head != commit". If "head == commit", this slot is empty or the slot owner is trying to push() (not committed yet). > In this case when a push is ongoing with > the slot, pop() has to wait for its completion even if there are objects in > the same slot. No, pop() will go to other slots to find available object. > > > If we need to support such NMI or remove irq-disable, we can do following > > (please remember the push operation only happens on the slot owned by that > > CPU. So this is per-cpu process) > > > > --- > > do { > > commit = tail = slot->tail; > > next = tail + 1; > > } while (!try_cmpxchg_acquire(&slot->tail, &tail, next)); > > > > WRITE_ONCE(slot->entries[tail & slot->mask], obj); > > > > // At this point, "commit == slot->tail - 1" in this nest level. > > // Only outmost (non-nexted) case, "commit == slot->commit" > > if (commit != slot->commit) > > return; // do nothing. it must be updated by outmost one. > > > > // catch up commit to tail. > > do { > > commit = READ_ONCE(slot->tail); > > WRITE_ONCE(slot->commit, commit); > > // note that someone can interrupt this loop too. > > } while (commit != READ_ONCE(slot->tail)); > > --- > > Yes, I got you: push can only happen on local node and the first try has the > right to extend 'commit'. It does resolve nested push attempts, and preemption > must be disabled. Yes. > > The overflow/ABA issue can still happen if the cpu is being interrupted for > long time. Yes, but ages[] also can not avoid that. (Note that head, tail and commit are 32bit counter, and per-cpu slot ring has enough big to store the pointer of all objects.) Thank you, > > > For the rethook this may be too much. > > > > Thank you, > > > >> > >> > >> Case 2: > >> > >> NODE 1: NODE 2 > >> push to slot w/ 1 obj pop will get wrong value > >> > >> try_cmpxchg_release(&slot->head, &head, head + 1) > >> try_cmpxchg_acquire(&slot->tail, &tail, next) > >> WRITE_ONCE(slot->entries[tail & slot->mask], obj) > >> return slot->entries[head & slot->mask] > > The pre-condition should be: CPU 1 tries to push to a full slot, in this case > tail = head + capacity but tail & mask == head & mask > > >> > >> Regards, > >> wuqiang > >> > >> On 2023/9/25 17:42, Masami Hiramatsu (Google) wrote: > >>> Hi Wuqiang, > >>> > >>> On Tue, 5 Sep 2023 09:52:51 +0800 > >>> "wuqiang.matt" wrote: > >>> > >>>> The object pool is a scalable implementaion of high performance queue > >>>> for object allocation and reclamation, such as kretprobe instances. > >>>> > >>>> With leveraging percpu ring-array to mitigate the hot spot of memory > >>>> contention, it could deliver near-linear scalability for high parallel > >>>> scenarios. The objpool is best suited for following cases: > >>>> 1) Memory allocation or reclamation are prohibited or too expensive > >>>> 2) Consumers are of different priorities, such as irqs and threads > >>>> > >>>> Limitations: > >>>> 1) Maximum objects (capacity) is determined during pool initializing > >>>> and can't be modified (extended) after objpool creation > >>>> 2) The memory of objects won't be freed until objpool is finalized > >>>> 3) Object allocation (pop) may fail after trying all cpu slots > >>> > >>> I made a simplifying patch on this by (mainly) removing ages array. > >>> I also rename local variable to use more readable names, like slot, > >>> pool, and obj. > >>> > >>> Here the results which I run the test_objpool.ko. > >>> > >>> Original: > >>> [ 50.500235] Summary of testcases: > >>> [ 50.503139] duration: 1027135us hits: 30628293 miss: 0 sync: percpu objpool > >>> [ 50.510416] duration: 1047977us hits: 30453023 miss: 0 sync: percpu objpool from vmalloc > >>> [ 50.517421] duration: 1047098us hits: 31145034 miss: 0 sync & hrtimer: percpu objpool > >>> [ 50.524671] duration: 1053233us hits: 30919640 miss: 0 sync & hrtimer: percpu objpool from vmalloc > >>> [ 50.531382] duration: 1055822us hits: 3407221 miss: 830219 sync overrun: percpu objpool > >>> [ 50.538135] duration: 1055998us hits: 3404624 miss: 854160 sync overrun: percpu objpool from vmalloc > >>> [ 50.546686] duration: 1046104us hits: 19464798 miss: 0 async: percpu objpool > >>> [ 50.552989] duration: 1033054us hits: 18957836 miss: 0 async: percpu objpool from vmalloc > >>> [ 50.560289] duration: 1041778us hits: 33806868 miss: 0 async & hrtimer: percpu objpool > >>> [ 50.567425] duration: 1048901us hits: 34211860 miss: 0 async & hrtimer: percpu objpool from vmalloc > >>> > >>> Simplified: > >>> [ 48.393236] Summary of testcases: > >>> [ 48.395384] duration: 1013002us hits: 29661448 miss: 0 sync: percpu objpool > >>> [ 48.400351] duration: 1057231us hits: 31035578 miss: 0 sync: percpu objpool from vmalloc > >>> [ 48.405660] duration: 1043196us hits: 30546652 miss: 0 sync & hrtimer: percpu objpool > >>> [ 48.411216] duration: 1047128us hits: 30601306 miss: 0 sync & hrtimer: percpu objpool from vmalloc > >>> [ 48.417231] duration: 1051695us hits: 3468287 miss: 892881 sync overrun: percpu objpool > >>> [ 48.422405] duration: 1054003us hits: 3549491 miss: 898141 sync overrun: percpu objpool from vmalloc > >>> [ 48.428425] duration: 1052946us hits: 19005228 miss: 0 async: percpu objpool > >>> [ 48.433597] duration: 1051757us hits: 19670866 miss: 0 async: percpu objpool from vmalloc > >>> [ 48.439280] duration: 1042951us hits: 37135332 miss: 0 async & hrtimer: percpu objpool > >>> [ 48.445085] duration: 1029803us hits: 37093248 miss: 0 async & hrtimer: percpu objpool from vmalloc > >>> > >>> Can you test it too? > >>> > >>> Thanks, > >>> > >>> From f1f442ff653e329839e5452b8b88463a80a12ff3 Mon Sep 17 00:00:00 2001 > >>> From: "Masami Hiramatsu (Google)" > >>> Date: Mon, 25 Sep 2023 16:07:12 +0900 > >>> Subject: [PATCH] objpool: Simplify objpool by removing ages array > >>> > >>> Simplify the objpool code by removing ages array from per-cpu slot. > >>> It chooses enough big capacity (which is a rounded up power of 2 value > >>> of nr_objects + 1) for the entries array, the tail never catch up to > >>> the head in per-cpu slot. Thus tail == head means the slot is empty. > >>> > >>> This also uses consistent local variable names for pool, slot and obj. > >>> > >>> Signed-off-by: Masami Hiramatsu (Google) > >>> --- > >>> include/linux/objpool.h | 61 ++++---- > >>> lib/objpool.c | 310 ++++++++++++++++------------------------ > >>> 2 files changed, 147 insertions(+), 224 deletions(-) > >>> > >>> diff --git a/include/linux/objpool.h b/include/linux/objpool.h > >>> index 33c832216b98..ecd5ecaffcd3 100644 > >>> --- a/include/linux/objpool.h > >>> +++ b/include/linux/objpool.h > >>> @@ -38,33 +38,23 @@ > >>> * struct objpool_slot - percpu ring array of objpool > >>> * @head: head of the local ring array (to retrieve at) > >>> * @tail: tail of the local ring array (to append at) > >>> - * @bits: log2 of capacity (for bitwise operations) > >>> - * @mask: capacity - 1 > >>> + * @mask: capacity of entries - 1 > >>> + * @entries: object entries on this slot. > >>> * > >>> * Represents a cpu-local array-based ring buffer, its size is specialized > >>> * during initialization of object pool. The percpu objpool slot is to be > >>> * allocated from local memory for NUMA system, and to be kept compact in > >>> - * continuous memory: ages[] is stored just after the body of objpool_slot, > >>> - * and then entries[]. ages[] describes revision of each item, solely used > >>> - * to avoid ABA; entries[] contains pointers of the actual objects > >>> - * > >>> - * Layout of objpool_slot in memory: > >>> - * > >>> - * 64bit: > >>> - * 4 8 12 16 32 64 > >>> - * | head | tail | bits | mask | ages[4] | ents[4]: (8 * 4) | objects > >>> - * > >>> - * 32bit: > >>> - * 4 8 12 16 32 48 64 > >>> - * | head | tail | bits | mask | ages[4] | ents[4] | unused | objects > >>> + * continuous memory: CPU assigned number of objects are stored just after > >>> + * the body of objpool_slot. > >>> * > >>> */ > >>> struct objpool_slot { > >>> - uint32_t head; > >>> - uint32_t tail; > >>> - uint32_t bits; > >>> - uint32_t mask; > >>> -} __packed; > >>> + uint32_t head; > >>> + uint32_t tail; > >>> + uint32_t mask; > >>> + uint32_t dummyr; > >>> + void * entries[]; > >>> +}; > >>> > >>> struct objpool_head; > >>> > >>> @@ -82,12 +72,11 @@ typedef int (*objpool_fini_cb)(struct objpool_head *head, void *context); > >>> * @obj_size: object & element size > >>> * @nr_objs: total objs (to be pre-allocated) > >>> * @nr_cpus: nr_cpu_ids > >>> - * @capacity: max objects per cpuslot > >>> + * @capacity: max objects on percpu slot > >>> * @gfp: gfp flags for kmalloc & vmalloc > >>> * @ref: refcount for objpool > >>> * @flags: flags for objpool management > >>> * @cpu_slots: array of percpu slots > >>> - * @slot_sizes: size in bytes of slots > >>> * @release: resource cleanup callback > >>> * @context: caller-provided context > >>> */ > >>> @@ -100,7 +89,6 @@ struct objpool_head { > >>> refcount_t ref; > >>> unsigned long flags; > >>> struct objpool_slot **cpu_slots; > >>> - int *slot_sizes; > >>> objpool_fini_cb release; > >>> void *context; > >>> }; > >>> @@ -108,9 +96,12 @@ struct objpool_head { > >>> #define OBJPOOL_FROM_VMALLOC (0x800000000) /* objpool allocated from vmalloc area */ > >>> #define OBJPOOL_HAVE_OBJECTS (0x400000000) /* objects allocated along with objpool */ > >>> > >>> +#define OBJPOOL_NR_OBJECT_MAX (1 << 24) > >>> +#define OBJPOOL_OBJECT_SIZE_MAX (1 << 16) > >>> + > >>> /** > >>> * objpool_init() - initialize objpool and pre-allocated objects > >>> - * @head: the object pool to be initialized, declared by caller > >>> + * @pool: the object pool to be initialized, declared by caller > >>> * @nr_objs: total objects to be pre-allocated by this object pool > >>> * @object_size: size of an object (should be > 0) > >>> * @gfp: flags for memory allocation (via kmalloc or vmalloc) > >>> @@ -128,47 +119,47 @@ struct objpool_head { > >>> * pop (object allocation) or do clearance before each push (object > >>> * reclamation). > >>> */ > >>> -int objpool_init(struct objpool_head *head, int nr_objs, int object_size, > >>> +int objpool_init(struct objpool_head *pool, int nr_objs, int object_size, > >>> gfp_t gfp, void *context, objpool_init_obj_cb objinit, > >>> objpool_fini_cb release); > >>> > >>> /** > >>> * objpool_pop() - allocate an object from objpool > >>> - * @head: object pool > >>> + * @pool: object pool > >>> * > >>> * return value: object ptr or NULL if failed > >>> */ > >>> -void *objpool_pop(struct objpool_head *head); > >>> +void *objpool_pop(struct objpool_head *pool); > >>> > >>> /** > >>> * objpool_push() - reclaim the object and return back to objpool > >>> * @obj: object ptr to be pushed to objpool > >>> - * @head: object pool > >>> + * @pool: object pool > >>> * > >>> * return: 0 or error code (it fails only when user tries to push > >>> * the same object multiple times or wrong "objects" into objpool) > >>> */ > >>> -int objpool_push(void *obj, struct objpool_head *head); > >>> +int objpool_push(void *obj, struct objpool_head *pool); > >>> > >>> /** > >>> * objpool_drop() - discard the object and deref objpool > >>> * @obj: object ptr to be discarded > >>> - * @head: object pool > >>> + * @pool: object pool > >>> * > >>> * return: 0 if objpool was released or error code > >>> */ > >>> -int objpool_drop(void *obj, struct objpool_head *head); > >>> +int objpool_drop(void *obj, struct objpool_head *pool); > >>> > >>> /** > >>> * objpool_free() - release objpool forcely (all objects to be freed) > >>> - * @head: object pool to be released > >>> + * @pool: object pool to be released > >>> */ > >>> -void objpool_free(struct objpool_head *head); > >>> +void objpool_free(struct objpool_head *pool); > >>> > >>> /** > >>> * objpool_fini() - deref object pool (also releasing unused objects) > >>> - * @head: object pool to be dereferenced > >>> + * @pool: object pool to be dereferenced > >>> */ > >>> -void objpool_fini(struct objpool_head *head); > >>> +void objpool_fini(struct objpool_head *pool); > >>> > >>> #endif /* _LINUX_OBJPOOL_H */ > >>> diff --git a/lib/objpool.c b/lib/objpool.c > >>> index 22e752371820..f8e8f70d7253 100644 > >>> --- a/lib/objpool.c > >>> +++ b/lib/objpool.c > >>> @@ -15,104 +15,55 @@ > >>> * Copyright: wuqiang.matt@bytedance.com > >>> */ > >>> > >>> -#define SLOT_AGES(s) ((uint32_t *)((char *)(s) + sizeof(struct objpool_slot))) > >>> -#define SLOT_ENTS(s) ((void **)((char *)(s) + sizeof(struct objpool_slot) + \ > >>> - (sizeof(uint32_t) << (s)->bits))) > >>> -#define SLOT_OBJS(s) ((void *)((char *)(s) + sizeof(struct objpool_slot) + \ > >>> - ((sizeof(uint32_t) + sizeof(void *)) << (s)->bits))) > >>> -#define SLOT_CORE(n) cpumask_nth((n) % num_possible_cpus(), cpu_possible_mask) > >>> - > >>> -/* compute the suitable num of objects to be managed per slot */ > >>> -static int objpool_nobjs(int size) > >>> -{ > >>> - return rounddown_pow_of_two((size - sizeof(struct objpool_slot)) / > >>> - (sizeof(uint32_t) + sizeof(void *))); > >>> -} > >>> - > >>> /* allocate and initialize percpu slots */ > >>> static int > >>> -objpool_init_percpu_slots(struct objpool_head *head, int nobjs, > >>> - void *context, objpool_init_obj_cb objinit) > >>> +objpool_init_percpu_slots(struct objpool_head *pool, void *context, > >>> + objpool_init_obj_cb objinit) > >>> { > >>> - int i, j, n, size, objsz, cpu = 0, nents = head->capacity; > >>> - > >>> - /* aligned object size by sizeof(void *) */ > >>> - objsz = ALIGN(head->obj_size, sizeof(void *)); > >>> - /* shall we allocate objects along with percpu-slot */ > >>> - if (objsz) > >>> - head->flags |= OBJPOOL_HAVE_OBJECTS; > >>> - > >>> - /* vmalloc is used in default to allocate percpu-slots */ > >>> - if (!(head->gfp & GFP_ATOMIC)) > >>> - head->flags |= OBJPOOL_FROM_VMALLOC; > >>> - > >>> - for (i = 0; i < head->nr_cpus; i++) { > >>> - struct objpool_slot *os; > >>> + int i, j, n, size, slot_size, cpu_count = 0; > >>> + struct objpool_slot *slot; > >>> > >>> + for (i = 0; i < pool->nr_cpus; i++) { > >>> /* skip the cpus which could never be present */ > >>> if (!cpu_possible(i)) > >>> continue; > >>> > >>> /* compute how many objects to be managed by this slot */ > >>> - n = nobjs / num_possible_cpus(); > >>> - if (cpu < (nobjs % num_possible_cpus())) > >>> + n = pool->nr_objs / num_possible_cpus(); > >>> + if (cpu_count < (pool->nr_objs % num_possible_cpus())) > >>> n++; > >>> - size = sizeof(struct objpool_slot) + sizeof(void *) * nents + > >>> - sizeof(uint32_t) * nents + objsz * n; > >>> + cpu_count++; > >>> + > >>> + slot_size = struct_size(slot, entries, pool->capacity); > >>> + size = slot_size + pool->obj_size * n; > >>> > >>> /* > >>> * here we allocate percpu-slot & objects together in a single > >>> - * allocation, taking advantage of warm caches and TLB hits as > >>> - * vmalloc always aligns the request size to pages > >>> + * allocation, taking advantage on NUMA. > >>> */ > >>> - if (head->flags & OBJPOOL_FROM_VMALLOC) > >>> - os = __vmalloc_node(size, sizeof(void *), head->gfp, > >>> + if (pool->flags & OBJPOOL_FROM_VMALLOC) > >>> + slot = __vmalloc_node(size, sizeof(void *), pool->gfp, > >>> cpu_to_node(i), __builtin_return_address(0)); > >>> else > >>> - os = kmalloc_node(size, head->gfp, cpu_to_node(i)); > >>> - if (!os) > >>> + slot = kmalloc_node(size, pool->gfp, cpu_to_node(i)); > >>> + if (!slot) > >>> return -ENOMEM; > >>> > >>> /* initialize percpu slot for the i-th slot */ > >>> - memset(os, 0, size); > >>> - os->bits = ilog2(head->capacity); > >>> - os->mask = head->capacity - 1; > >>> - head->cpu_slots[i] = os; > >>> - head->slot_sizes[i] = size; > >>> - cpu = cpu + 1; > >>> - > >>> - /* > >>> - * manually set head & tail to avoid possible conflict: > >>> - * We assume that the head item is ready for retrieval > >>> - * iff head is equal to ages[head & mask]. but ages is > >>> - * initialized as 0, so in view of the caller of pop(), > >>> - * the 1st item (0th) is always ready, but the reality > >>> - * could be: push() is stalled before the final update, > >>> - * thus the item being inserted will be lost forever > >>> - */ > >>> - os->head = os->tail = head->capacity; > >>> - > >>> - if (!objsz) > >>> - continue; > >>> + memset(slot, 0, size); > >>> + slot->mask = pool->capacity - 1; > >>> + pool->cpu_slots[i] = slot; > >>> > >>> for (j = 0; j < n; j++) { > >>> - uint32_t *ages = SLOT_AGES(os); > >>> - void **ents = SLOT_ENTS(os); > >>> - void *obj = SLOT_OBJS(os) + j * objsz; > >>> - uint32_t ie = os->tail & os->mask; > >>> + void *obj = (void *)slot + slot_size + pool->obj_size * j; > >>> > >>> - /* perform object initialization */ > >>> if (objinit) { > >>> int rc = objinit(obj, context); > >>> if (rc) > >>> return rc; > >>> } > >>> - > >>> - /* add obj into the ring array */ > >>> - ents[ie] = obj; > >>> - ages[ie] = os->tail; > >>> - os->tail++; > >>> - head->nr_objs++; > >>> + slot->entries[j] = obj; > >>> + slot->tail++; > >>> } > >>> } > >>> > >>> @@ -120,164 +71,135 @@ objpool_init_percpu_slots(struct objpool_head *head, int nobjs, > >>> } > >>> > >>> /* cleanup all percpu slots of the object pool */ > >>> -static void objpool_fini_percpu_slots(struct objpool_head *head) > >>> +static void objpool_fini_percpu_slots(struct objpool_head *pool) > >>> { > >>> int i; > >>> > >>> - if (!head->cpu_slots) > >>> + if (!pool->cpu_slots) > >>> return; > >>> > >>> - for (i = 0; i < head->nr_cpus; i++) { > >>> - if (!head->cpu_slots[i]) > >>> - continue; > >>> - if (head->flags & OBJPOOL_FROM_VMALLOC) > >>> - vfree(head->cpu_slots[i]); > >>> - else > >>> - kfree(head->cpu_slots[i]); > >>> - } > >>> - kfree(head->cpu_slots); > >>> - head->cpu_slots = NULL; > >>> - head->slot_sizes = NULL; > >>> + for (i = 0; i < pool->nr_cpus; i++) > >>> + kvfree(pool->cpu_slots[i]); > >>> + kfree(pool->cpu_slots); > >>> } > >>> > >>> /* initialize object pool and pre-allocate objects */ > >>> -int objpool_init(struct objpool_head *head, int nr_objs, int object_size, > >>> +int objpool_init(struct objpool_head *pool, int nr_objs, int object_size, > >>> gfp_t gfp, void *context, objpool_init_obj_cb objinit, > >>> objpool_fini_cb release) > >>> { > >>> int nents, rc; > >>> > >>> /* check input parameters */ > >>> - if (nr_objs <= 0 || object_size <= 0) > >>> + if (nr_objs <= 0 || object_size <= 0 || > >>> + nr_objs > OBJPOOL_NR_OBJECT_MAX || object_size > OBJPOOL_OBJECT_SIZE_MAX) > >>> + return -EINVAL; > >>> + > >>> + /* Align up to unsigned long size */ > >>> + object_size = ALIGN(object_size, sizeof(unsigned long)); > >>> + > >>> + /* > >>> + * To avoid filling up the entries array in the per-cpu slot, > >>> + * use the power of 2 which is more than N + 1. Thus, the tail never > >>> + * catch up to the pool, and able to use pool/tail as the sequencial > >>> + * number. > >>> + */ > >>> + nents = roundup_pow_of_two(nr_objs + 1); > >>> + if (!nents) > >>> return -EINVAL; > >>> > >>> - /* calculate percpu slot size (rounded to pow of 2) */ > >>> - nents = max_t(int, roundup_pow_of_two(nr_objs), > >>> - objpool_nobjs(L1_CACHE_BYTES)); > >>> - > >>> - /* initialize objpool head */ > >>> - memset(head, 0, sizeof(struct objpool_head)); > >>> - head->nr_cpus = nr_cpu_ids; > >>> - head->obj_size = object_size; > >>> - head->capacity = nents; > >>> - head->gfp = gfp & ~__GFP_ZERO; > >>> - head->context = context; > >>> - head->release = release; > >>> - > >>> - /* allocate array for percpu slots */ > >>> - head->cpu_slots = kzalloc(head->nr_cpus * sizeof(void *) + > >>> - head->nr_cpus * sizeof(int), head->gfp); > >>> - if (!head->cpu_slots) > >>> + /* initialize objpool pool */ > >>> + memset(pool, 0, sizeof(struct objpool_head)); > >>> + pool->nr_cpus = nr_cpu_ids; > >>> + pool->obj_size = object_size; > >>> + pool->nr_objs = nr_objs; > >>> + /* just prevent to fullfill the per-cpu ring array */ > >>> + pool->capacity = nents; > >>> + pool->gfp = gfp & ~__GFP_ZERO; > >>> + pool->context = context; > >>> + pool->release = release; > >>> + /* vmalloc is used in default to allocate percpu-slots */ > >>> + if (!(pool->gfp & GFP_ATOMIC)) > >>> + pool->flags |= OBJPOOL_FROM_VMALLOC; > >>> + > >>> + pool->cpu_slots = kzalloc(pool->nr_cpus * sizeof(void *), pool->gfp); > >>> + if (!pool->cpu_slots) > >>> return -ENOMEM; > >>> - head->slot_sizes = (int *)&head->cpu_slots[head->nr_cpus]; > >>> > >>> /* initialize per-cpu slots */ > >>> - rc = objpool_init_percpu_slots(head, nr_objs, context, objinit); > >>> + rc = objpool_init_percpu_slots(pool, context, objinit); > >>> if (rc) > >>> - objpool_fini_percpu_slots(head); > >>> + objpool_fini_percpu_slots(pool); > >>> else > >>> - refcount_set(&head->ref, nr_objs + 1); > >>> + refcount_set(&pool->ref, nr_objs + 1); > >>> > >>> return rc; > >>> } > >>> EXPORT_SYMBOL_GPL(objpool_init); > >>> > >>> /* adding object to slot, abort if the slot was already full */ > >>> -static inline int objpool_try_add_slot(void *obj, struct objpool_slot *os) > >>> +static inline int objpool_try_add_slot(void *obj, struct objpool_head *pool, int cpu) > >>> { > >>> - uint32_t *ages = SLOT_AGES(os); > >>> - void **ents = SLOT_ENTS(os); > >>> - uint32_t head, tail; > >>> + struct objpool_slot *slot = pool->cpu_slots[cpu]; > >>> + uint32_t tail, next; > >>> > >>> do { > >>> - /* perform memory loading for both head and tail */ > >>> - head = READ_ONCE(os->head); > >>> - tail = READ_ONCE(os->tail); > >>> - /* just abort if slot is full */ > >>> - if (tail - head > os->mask) > >>> - return -ENOENT; > >>> - /* try to extend tail by 1 using CAS to avoid races */ > >>> - if (try_cmpxchg_acquire(&os->tail, &tail, tail + 1)) > >>> - break; > >>> - } while (1); > >>> + uint32_t head = READ_ONCE(slot->head); > >>> > >>> - /* the tail-th of slot is reserved for the given obj */ > >>> - WRITE_ONCE(ents[tail & os->mask], obj); > >>> - /* update epoch id to make this object available for pop() */ > >>> - smp_store_release(&ages[tail & os->mask], tail); > >>> + tail = READ_ONCE(slot->tail); > >>> + next = tail + 1; > >>> + > >>> + /* This must never happen because capacity >= N + 1 */ > >>> + if (WARN_ON_ONCE((next > head && next - head > pool->nr_objs) || > >>> + (next < head && next > head + pool->nr_objs))) > >>> + return -EINVAL; > >>> + > >>> + } while (!try_cmpxchg_acquire(&slot->tail, &tail, next)); > >>> + > >>> + WRITE_ONCE(slot->entries[tail & slot->mask], obj); > >>> return 0; > >>> } > >>> > >>> /* reclaim an object to object pool */ > >>> -int objpool_push(void *obj, struct objpool_head *oh) > >>> +int objpool_push(void *obj, struct objpool_head *pool) > >>> { > >>> unsigned long flags; > >>> - int cpu, rc; > >>> + int rc; > >>> > >>> /* disable local irq to avoid preemption & interruption */ > >>> raw_local_irq_save(flags); > >>> - cpu = raw_smp_processor_id(); > >>> - do { > >>> - rc = objpool_try_add_slot(obj, oh->cpu_slots[cpu]); > >>> - if (!rc) > >>> - break; > >>> - cpu = cpumask_next_wrap(cpu, cpu_possible_mask, -1, 1); > >>> - } while (1); > >>> + rc = objpool_try_add_slot(obj, pool, raw_smp_processor_id()); > >>> raw_local_irq_restore(flags); > >>> > >>> return rc; > >>> } > >>> EXPORT_SYMBOL_GPL(objpool_push); > >>> > >>> -/* drop the allocated object, rather reclaim it to objpool */ > >>> -int objpool_drop(void *obj, struct objpool_head *head) > >>> -{ > >>> - if (!obj || !head) > >>> - return -EINVAL; > >>> - > >>> - if (refcount_dec_and_test(&head->ref)) { > >>> - objpool_free(head); > >>> - return 0; > >>> - } > >>> - > >>> - return -EAGAIN; > >>> -} > >>> -EXPORT_SYMBOL_GPL(objpool_drop); > >>> - > >>> /* try to retrieve object from slot */ > >>> -static inline void *objpool_try_get_slot(struct objpool_slot *os) > >>> +static inline void *objpool_try_get_slot(struct objpool_slot *slot) > >>> { > >>> - uint32_t *ages = SLOT_AGES(os); > >>> - void **ents = SLOT_ENTS(os); > >>> /* do memory load of head to local head */ > >>> - uint32_t head = smp_load_acquire(&os->head); > >>> + uint32_t head = smp_load_acquire(&slot->head); > >>> > >>> /* loop if slot isn't empty */ > >>> - while (head != READ_ONCE(os->tail)) { > >>> - uint32_t id = head & os->mask, prev = head; > >>> + while (head != READ_ONCE(slot->tail)) { > >>> > >>> /* do prefetching of object ents */ > >>> - prefetch(&ents[id]); > >>> - > >>> - /* check whether this item was ready for retrieval */ > >>> - if (smp_load_acquire(&ages[id]) == head) { > >>> - /* node must have been udpated by push() */ > >>> - void *node = READ_ONCE(ents[id]); > >>> - /* commit and move forward head of the slot */ > >>> - if (try_cmpxchg_release(&os->head, &head, head + 1)) > >>> - return node; > >>> - /* head was already updated by others */ > >>> - } > >>> + prefetch(&slot->entries[head & slot->mask]); > >>> + > >>> + /* commit and move forward head of the slot */ > >>> + if (try_cmpxchg_release(&slot->head, &head, head + 1)) > >>> + /* > >>> + * TBD: check overwrap the tail/head counter and warn > >>> + * if it is broken. But this happens only if this > >>> + * process slows down a lot and another CPU updates > >>> + * the haed/tail just 2^32 + 1 times, and this slot > >>> + * is empty. > >>> + */ > >>> + return slot->entries[head & slot->mask]; > >>> > >>> /* re-load head from memory and continue trying */ > >>> - head = READ_ONCE(os->head); > >>> - /* > >>> - * head stays unchanged, so it's very likely there's an > >>> - * ongoing push() on other cpu nodes but yet not update > >>> - * ages[] to mark it's completion > >>> - */ > >>> - if (head == prev) > >>> - break; > >>> + head = READ_ONCE(slot->head); > >>> } > >>> > >>> return NULL; > >>> @@ -307,32 +229,42 @@ void *objpool_pop(struct objpool_head *head) > >>> EXPORT_SYMBOL_GPL(objpool_pop); > >>> > >>> /* release whole objpool forcely */ > >>> -void objpool_free(struct objpool_head *head) > >>> +void objpool_free(struct objpool_head *pool) > >>> { > >>> - if (!head->cpu_slots) > >>> + if (!pool->cpu_slots) > >>> return; > >>> > >>> /* release percpu slots */ > >>> - objpool_fini_percpu_slots(head); > >>> + objpool_fini_percpu_slots(pool); > >>> > >>> /* call user's cleanup callback if provided */ > >>> - if (head->release) > >>> - head->release(head, head->context); > >>> + if (pool->release) > >>> + pool->release(pool, pool->context); > >>> } > >>> EXPORT_SYMBOL_GPL(objpool_free); > >>> > >>> -/* drop unused objects and defref objpool for releasing */ > >>> -void objpool_fini(struct objpool_head *head) > >>> +/* drop the allocated object, rather reclaim it to objpool */ > >>> +int objpool_drop(void *obj, struct objpool_head *pool) > >>> { > >>> - void *nod; > >>> + if (!obj || !pool) > >>> + return -EINVAL; > >>> > >>> - do { > >>> - /* grab object from objpool and drop it */ > >>> - nod = objpool_pop(head); > >>> + if (refcount_dec_and_test(&pool->ref)) { > >>> + objpool_free(pool); > >>> + return 0; > >>> + } > >>> + > >>> + return -EAGAIN; > >>> +} > >>> +EXPORT_SYMBOL_GPL(objpool_drop); > >>> + > >>> +/* drop unused objects and defref objpool for releasing */ > >>> +void objpool_fini(struct objpool_head *pool) > >>> +{ > >>> + void *obj; > >>> > >>> - /* drop the extra ref of objpool */ > >>> - if (refcount_dec_and_test(&head->ref)) > >>> - objpool_free(head); > >>> - } while (nod); > >>> + /* grab object from objpool and drop it */ > >>> + while ((obj = objpool_pop(pool))) > >>> + objpool_drop(obj, pool); > >>> } > >>> EXPORT_SYMBOL_GPL(objpool_fini); > >> > > > > > -- Masami Hiramatsu (Google)