Received: by 2002:a05:7412:d8a:b0:e2:908c:2ebd with SMTP id b10csp699364rdg; Thu, 12 Oct 2023 20:03:48 -0700 (PDT) X-Google-Smtp-Source: AGHT+IElja2+sScbuMu/Jjcmvo5A3g/CsToG0ms/gow+X5DEZRigWe/4bSYdvoxvIeYaiRby04W9 X-Received: by 2002:a05:6808:10d4:b0:3ab:5e9e:51f1 with SMTP id s20-20020a05680810d400b003ab5e9e51f1mr35571009ois.46.1697166228328; Thu, 12 Oct 2023 20:03:48 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1697166228; cv=none; d=google.com; s=arc-20160816; b=F/7SjVb0oMR80SzpXUN9dafAUMb5LdljIgmVYvz++NdB42X21TLuFj95IheH6ggaWf CFmQ637MfMFQRNhCX41mqZr0f3RPgToL6tTuBO4+GBp4BW5K5PSgLcyxnmLS6TZHu+01 pCWv5RFBz12/71AWQWp91bkRFzwLfUeqXLUqHsT+HqxazalN5HILuEhaSduXq4kpRvUL JsFSIMfCCORZveFfFIllu9ioLk+NZhOUReE2mYv/MeFiyIwC/om6cq0UTx3AgqnvOiGF 97HEJDjNodncmgOB4v16G+cf8fpB/OmEUCDLVjCNpDecdSKLfCj6z8gARs485IbHM1iV 77Eg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:in-reply-to:subject :from:references:cc:to:content-language:user-agent:mime-version:date :message-id:dkim-signature; bh=oOTDevTGmi47vnk1NRcfInLXm+2aqkvTTWk8iiUgZ3Y=; fh=CTA6XJ8kKenZqjKecIQDnSA2PiL8fUMLpeKVb1Wox8U=; b=hg/mqoXtw1g0vgfVJS9nCz48ZYbMOaKnbA2/QkfFboUzaUkIJy0f8cqk3yehvk9iZG VAJbPd1MUeBEUFGwH2XlgMDMQ48tEWlAQiFsEgiWW7p5oGHBafkvYjjYrBLhOhGo1c0W wMzHmCnHjH8tpQYxrDK8lFIC4kPH5VZM3oPQxXSpJBvGuXiV10HOQlm2hitpcodqeFyZ ihcAwaan+Ffq6x0OOB0h+GGMXNtkGMe9Tgi3NAIrwj6jMD9BZaUhjL09k0V8mUZ7qZzb CqPq6xNz99nxFzMNcf8w3H56oALRo1oxc8NJZbZeU0QH+OBBx+kI/aDPSQ7ivkd5O94G 34Mw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@bytedance.com header.s=google header.b=l5iijMo3; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:3 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=QUARANTINE sp=QUARANTINE dis=NONE) header.from=bytedance.com Return-Path: Received: from lipwig.vger.email (lipwig.vger.email. [2620:137:e000::3:3]) by mx.google.com with ESMTPS id v24-20020a63f218000000b005aa83686f34si704152pgh.79.2023.10.12.20.03.48 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 12 Oct 2023 20:03:48 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:3 as permitted sender) client-ip=2620:137:e000::3:3; Authentication-Results: mx.google.com; dkim=pass header.i=@bytedance.com header.s=google header.b=l5iijMo3; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:3 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=QUARANTINE sp=QUARANTINE dis=NONE) header.from=bytedance.com Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0]) by lipwig.vger.email (Postfix) with ESMTP id B0D01828CC15; Thu, 12 Oct 2023 20:03:43 -0700 (PDT) X-Virus-Status: Clean X-Virus-Scanned: clamav-milter 0.103.10 at lipwig.vger.email Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229547AbjJMDDX (ORCPT + 99 others); Thu, 12 Oct 2023 23:03:23 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:33898 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229482AbjJMDDU (ORCPT ); Thu, 12 Oct 2023 23:03:20 -0400 Received: from mail-oi1-x234.google.com (mail-oi1-x234.google.com [IPv6:2607:f8b0:4864:20::234]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 89A31D7 for ; Thu, 12 Oct 2023 20:03:16 -0700 (PDT) Received: by mail-oi1-x234.google.com with SMTP id 5614622812f47-3b2b1b25547so34605b6e.3 for ; Thu, 12 Oct 2023 20:03:16 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bytedance.com; s=google; t=1697166196; x=1697770996; darn=vger.kernel.org; h=content-transfer-encoding:in-reply-to:subject:from:references:cc:to :content-language:user-agent:mime-version:date:message-id:from:to:cc :subject:date:message-id:reply-to; bh=oOTDevTGmi47vnk1NRcfInLXm+2aqkvTTWk8iiUgZ3Y=; b=l5iijMo3Rt2S4w2yQeaO7KQNV78y072JzBSGDocys7K4COKfqX4gxFVAqq2rL3sghC FK+kvf+9UoiVFxmUhTZ304AjgB6sc2fAYFJT/7y2KwUZtYxM63LnOrGyhRAEt1EoBPL4 3mJdwHUTTaPAkhXju5M7X7EzldcdSxHXPbwKcN1yL96APNC/VcF8Jtf5b6QHaUq6zOkO SU9M8IJ8ZZANdzaQrqdo4jG26gZ0dhql0Z9uzTueIchGPFCOibV2if+g3cm/gJGbD5kJ YXsbOdlTj2oOYae5ZyCI3xWmzIepe2AdWMyNCqlRKxOqOQpa0+/SK4cZMvkKJHnGClqn Gfvw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1697166196; x=1697770996; h=content-transfer-encoding:in-reply-to:subject:from:references:cc:to :content-language:user-agent:mime-version:date:message-id :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=oOTDevTGmi47vnk1NRcfInLXm+2aqkvTTWk8iiUgZ3Y=; b=oAZwREmh5WyjND3/dnMPbzgzVHdKmFxNSv41niyw/C5yCEuK7W7BXXkxZ8FTlGAj0B qMHqMPkQYXOoSJFAfvMyLuw5sL9Jv6ZQa4W3K1YFrtKUOCWpA4ARsGvSFyNe9I1xW1f6 yAqgYvJbuccIUCVknunr3VNhXhhXI12M2k7MCMNjKHr1A+8vXREuXmIlaR/tY6oXuYHR 4MnYGfyXjrsY6HeX6dr72en6LkfUKWmhKiIdxtbfnsgVlW7XXLyZctn/2Og6wM4nn2nJ Mlcudb0TgvBLM1cnZs4mXwhA4Js6TumIOQOPHmawOuABbo63t/kVCA1oWEwnx66q/H6J 4goA== X-Gm-Message-State: AOJu0Yz+WoGimLh7rCQB5VFHQsdSdSmZ4GI01w06MBcxNJ+dyr0VruwF Devq8fvWzo64rwWEUnZtohzoxQ== X-Received: by 2002:a05:6808:1a02:b0:3af:585:402a with SMTP id bk2-20020a0568081a0200b003af0585402amr33529867oib.39.1697166195722; Thu, 12 Oct 2023 20:03:15 -0700 (PDT) Received: from [192.168.3.101] ([104.28.240.135]) by smtp.gmail.com with ESMTPSA id m7-20020aa78a07000000b0064fd4a6b306sm12421926pfa.76.2023.10.12.20.03.06 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Thu, 12 Oct 2023 20:03:15 -0700 (PDT) Message-ID: Date: Fri, 13 Oct 2023 11:03:02 +0800 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:102.0) Gecko/20100101 Thunderbird/102.14.0 Content-Language: en-US To: "Masami Hiramatsu (Google)" Cc: linux-trace-kernel@vger.kernel.org, davem@davemloft.net, anil.s.keshavamurthy@intel.com, naveen.n.rao@linux.ibm.com, rostedt@goodmis.org, peterz@infradead.org, akpm@linux-foundation.org, sander@svanheule.net, ebiggers@google.com, dan.j.williams@intel.com, jpoimboe@kernel.org, linux-kernel@vger.kernel.org, lkp@intel.com, mattwu@163.com References: <20230905015255.81545-1-wuqiang.matt@bytedance.com> <20230905015255.81545-2-wuqiang.matt@bytedance.com> <20230925184212.5a0b6c627a837b50424c4a75@kernel.org> <5ba828ef-e5f9-45bc-b0c7-b8b71ae8748a@bytedance.com> <20231012230237.45726dfad125cf0e8d00ba53@kernel.org> <1fb679ad-04b0-c1f6-1675-1921539382bd@bytedance.com> <20231013105915.6ea704e738de484558b55ce1@kernel.org> From: "wuqiang.matt" Subject: Re: [PATCH v9 1/5] lib: objpool added: ring-array based lockless MPMC In-Reply-To: <20231013105915.6ea704e738de484558b55ce1@kernel.org> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit X-Spam-Status: No, score=-4.2 required=5.0 tests=DKIM_SIGNED,DKIM_VALID, DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI, NICE_REPLY_A,SPF_HELO_NONE,SPF_PASS autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lipwig.vger.email Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.6.4 (lipwig.vger.email [0.0.0.0]); Thu, 12 Oct 2023 20:03:43 -0700 (PDT) On 2023/10/13 09:59, Masami Hiramatsu (Google) wrote: > On Fri, 13 Oct 2023 01:36:05 +0800 > "wuqiang.matt" wrote: > >> On 2023/10/12 22:02, Masami Hiramatsu (Google) wrote: >>> Hi Wuqiang, >>> >>> On Mon, 9 Oct 2023 17:23:34 +0800 >>> wuqiang wrote: >>> >>>> Hello Masami, >>>> >>>> Just got time for the new patch and got that ages[] was removed. ages[] is >>>> introduced the way like 2-phase commit to keep consitency and must be kept. >>>> >>>> Thinking of the following 2 cases that two cpu nodes are operating the same >>>> objpool_slot simultaneously: >>>> >>>> Case 1: >>>> >>>> NODE 1: NODE 2: >>>> push to an empty slot pop will get wrong value >>>> >>>> try_cmpxchg_acquire(&slot->tail, &tail, next) >>>> try_cmpxchg_release(&slot->head, &head, head + 1) >>>> return slot->entries[head & slot->mask] >>>> WRITE_ONCE(slot->entries[tail & slot->mask], obj) >>> >>> Today, I rethink the algorithm. >>> >>> For this case, we can just add a `commit` to the slot for committing the tail >>> commit instead of the ages[]. >>> >>> CPU1 CPU2 >>> push to an empty slot pop from the same slot >>> >>> commit = tail = slot->tail; >>> next = tail + 1; >>> try_cmpxchg_acquire(&slot->tail, >>> &tail, next); >>> while (head != commit) -> NG1 >>> WRITE_ONCE(slot->entries[tail & slot->mask], >>> obj) >>> while (head != commit) -> NG2 >>> WRITE_ONCE(&slot->commit, next); >>> while (head != commit) -> OK >>> try_cmpxchg_acquire(&slot->head, >>> &head, next); >>> return slot->entries[head & slot->mask] >>> >>> So the NG1 and NG2 timing, the pop will fail. >>> >>> This does not expect the nested "push" case because the reserve-commit block >>> will be interrupt disabled. This doesn't support NMI but that is OK. >> >> If 2 push are performed in a row, slot->commit will be 'tail + 2', CPU2 won't >> meet the condition "head == commit". > > No, it doesn't happen, because pop() may fetch the object from other's slot, but > push() only happens on its own slot. So push() on the same slot is never > in parallel. I got it: 'commit' is now the actual tail. Then "head != commit" means there are free objects ready for pop. Great idea indeed. I'll give it a try. >> Since slot->commit' is always synced to slot->tail after a successful push, >> should pop check "tail != commit" ? > > No, pop() only need to check the "head != commit". If "head == commit", this slot > is empty or the slot owner is trying to push() (not committed yet). > >> In this case when a push is ongoing with >> the slot, pop() has to wait for its completion even if there are objects in >> the same slot. > > No, pop() will go to other slots to find available object. > >> >>> If we need to support such NMI or remove irq-disable, we can do following >>> (please remember the push operation only happens on the slot owned by that >>> CPU. So this is per-cpu process) >>> >>> --- >>> do { >>> commit = tail = slot->tail; >>> next = tail + 1; >>> } while (!try_cmpxchg_acquire(&slot->tail, &tail, next)); >>> >>> WRITE_ONCE(slot->entries[tail & slot->mask], obj); >>> >>> // At this point, "commit == slot->tail - 1" in this nest level. >>> // Only outmost (non-nexted) case, "commit == slot->commit" >>> if (commit != slot->commit) >>> return; // do nothing. it must be updated by outmost one. >>> >>> // catch up commit to tail. >>> do { >>> commit = READ_ONCE(slot->tail); >>> WRITE_ONCE(slot->commit, commit); >>> // note that someone can interrupt this loop too. >>> } while (commit != READ_ONCE(slot->tail)); >>> --- >> >> Yes, I got you: push can only happen on local node and the first try has the >> right to extend 'commit'. It does resolve nested push attempts, and preemption >> must be disabled. > > Yes. > >> >> The overflow/ABA issue can still happen if the cpu is being interrupted for >> long time. > > Yes, but ages[] also can not avoid that. (Note that head, tail and commit are > 32bit counter, and per-cpu slot ring has enough big to store the pointer of > all objects.) > > Thank you, > >> >>> For the rethook this may be too much. >>> >>> Thank you, >>> >>>> >>>> >>>> Case 2: >>>> >>>> NODE 1: NODE 2 >>>> push to slot w/ 1 obj pop will get wrong value >>>> >>>> try_cmpxchg_release(&slot->head, &head, head + 1) >>>> try_cmpxchg_acquire(&slot->tail, &tail, next) >>>> WRITE_ONCE(slot->entries[tail & slot->mask], obj) >>>> return slot->entries[head & slot->mask] >> >> The pre-condition should be: CPU 1 tries to push to a full slot, in this case >> tail = head + capacity but tail & mask == head & mask >> >>>> >>>> Regards, >>>> wuqiang >>>> >>>> On 2023/9/25 17:42, Masami Hiramatsu (Google) wrote: >>>>> Hi Wuqiang, >>>>> >>>>> On Tue, 5 Sep 2023 09:52:51 +0800 >>>>> "wuqiang.matt" wrote: >>>>> >>>>>> The object pool is a scalable implementaion of high performance queue >>>>>> for object allocation and reclamation, such as kretprobe instances. >>>>>> >>>>>> With leveraging percpu ring-array to mitigate the hot spot of memory >>>>>> contention, it could deliver near-linear scalability for high parallel >>>>>> scenarios. The objpool is best suited for following cases: >>>>>> 1) Memory allocation or reclamation are prohibited or too expensive >>>>>> 2) Consumers are of different priorities, such as irqs and threads >>>>>> >>>>>> Limitations: >>>>>> 1) Maximum objects (capacity) is determined during pool initializing >>>>>> and can't be modified (extended) after objpool creation >>>>>> 2) The memory of objects won't be freed until objpool is finalized >>>>>> 3) Object allocation (pop) may fail after trying all cpu slots >>>>> >>>>> I made a simplifying patch on this by (mainly) removing ages array. >>>>> I also rename local variable to use more readable names, like slot, >>>>> pool, and obj. >>>>> >>>>> Here the results which I run the test_objpool.ko. >>>>> >>>>> Original: >>>>> [ 50.500235] Summary of testcases: >>>>> [ 50.503139] duration: 1027135us hits: 30628293 miss: 0 sync: percpu objpool >>>>> [ 50.510416] duration: 1047977us hits: 30453023 miss: 0 sync: percpu objpool from vmalloc >>>>> [ 50.517421] duration: 1047098us hits: 31145034 miss: 0 sync & hrtimer: percpu objpool >>>>> [ 50.524671] duration: 1053233us hits: 30919640 miss: 0 sync & hrtimer: percpu objpool from vmalloc >>>>> [ 50.531382] duration: 1055822us hits: 3407221 miss: 830219 sync overrun: percpu objpool >>>>> [ 50.538135] duration: 1055998us hits: 3404624 miss: 854160 sync overrun: percpu objpool from vmalloc >>>>> [ 50.546686] duration: 1046104us hits: 19464798 miss: 0 async: percpu objpool >>>>> [ 50.552989] duration: 1033054us hits: 18957836 miss: 0 async: percpu objpool from vmalloc >>>>> [ 50.560289] duration: 1041778us hits: 33806868 miss: 0 async & hrtimer: percpu objpool >>>>> [ 50.567425] duration: 1048901us hits: 34211860 miss: 0 async & hrtimer: percpu objpool from vmalloc >>>>> >>>>> Simplified: >>>>> [ 48.393236] Summary of testcases: >>>>> [ 48.395384] duration: 1013002us hits: 29661448 miss: 0 sync: percpu objpool >>>>> [ 48.400351] duration: 1057231us hits: 31035578 miss: 0 sync: percpu objpool from vmalloc >>>>> [ 48.405660] duration: 1043196us hits: 30546652 miss: 0 sync & hrtimer: percpu objpool >>>>> [ 48.411216] duration: 1047128us hits: 30601306 miss: 0 sync & hrtimer: percpu objpool from vmalloc >>>>> [ 48.417231] duration: 1051695us hits: 3468287 miss: 892881 sync overrun: percpu objpool >>>>> [ 48.422405] duration: 1054003us hits: 3549491 miss: 898141 sync overrun: percpu objpool from vmalloc >>>>> [ 48.428425] duration: 1052946us hits: 19005228 miss: 0 async: percpu objpool >>>>> [ 48.433597] duration: 1051757us hits: 19670866 miss: 0 async: percpu objpool from vmalloc >>>>> [ 48.439280] duration: 1042951us hits: 37135332 miss: 0 async & hrtimer: percpu objpool >>>>> [ 48.445085] duration: 1029803us hits: 37093248 miss: 0 async & hrtimer: percpu objpool from vmalloc >>>>> >>>>> Can you test it too? >>>>> >>>>> Thanks, >>>>> >>>>> From f1f442ff653e329839e5452b8b88463a80a12ff3 Mon Sep 17 00:00:00 2001 >>>>> From: "Masami Hiramatsu (Google)" >>>>> Date: Mon, 25 Sep 2023 16:07:12 +0900 >>>>> Subject: [PATCH] objpool: Simplify objpool by removing ages array >>>>> >>>>> Simplify the objpool code by removing ages array from per-cpu slot. >>>>> It chooses enough big capacity (which is a rounded up power of 2 value >>>>> of nr_objects + 1) for the entries array, the tail never catch up to >>>>> the head in per-cpu slot. Thus tail == head means the slot is empty. >>>>> >>>>> This also uses consistent local variable names for pool, slot and obj. >>>>> >>>>> Signed-off-by: Masami Hiramatsu (Google) >>>>> --- >>>>> include/linux/objpool.h | 61 ++++---- >>>>> lib/objpool.c | 310 ++++++++++++++++------------------------ >>>>> 2 files changed, 147 insertions(+), 224 deletions(-) >>>>> >>>>> diff --git a/include/linux/objpool.h b/include/linux/objpool.h >>>>> index 33c832216b98..ecd5ecaffcd3 100644 >>>>> --- a/include/linux/objpool.h >>>>> +++ b/include/linux/objpool.h >>>>> @@ -38,33 +38,23 @@ >>>>> * struct objpool_slot - percpu ring array of objpool >>>>> * @head: head of the local ring array (to retrieve at) >>>>> * @tail: tail of the local ring array (to append at) >>>>> - * @bits: log2 of capacity (for bitwise operations) >>>>> - * @mask: capacity - 1 >>>>> + * @mask: capacity of entries - 1 >>>>> + * @entries: object entries on this slot. >>>>> * >>>>> * Represents a cpu-local array-based ring buffer, its size is specialized >>>>> * during initialization of object pool. The percpu objpool slot is to be >>>>> * allocated from local memory for NUMA system, and to be kept compact in >>>>> - * continuous memory: ages[] is stored just after the body of objpool_slot, >>>>> - * and then entries[]. ages[] describes revision of each item, solely used >>>>> - * to avoid ABA; entries[] contains pointers of the actual objects >>>>> - * >>>>> - * Layout of objpool_slot in memory: >>>>> - * >>>>> - * 64bit: >>>>> - * 4 8 12 16 32 64 >>>>> - * | head | tail | bits | mask | ages[4] | ents[4]: (8 * 4) | objects >>>>> - * >>>>> - * 32bit: >>>>> - * 4 8 12 16 32 48 64 >>>>> - * | head | tail | bits | mask | ages[4] | ents[4] | unused | objects >>>>> + * continuous memory: CPU assigned number of objects are stored just after >>>>> + * the body of objpool_slot. >>>>> * >>>>> */ >>>>> struct objpool_slot { >>>>> - uint32_t head; >>>>> - uint32_t tail; >>>>> - uint32_t bits; >>>>> - uint32_t mask; >>>>> -} __packed; >>>>> + uint32_t head; >>>>> + uint32_t tail; >>>>> + uint32_t mask; >>>>> + uint32_t dummyr; >>>>> + void * entries[]; >>>>> +}; >>>>> >>>>> struct objpool_head; >>>>> >>>>> @@ -82,12 +72,11 @@ typedef int (*objpool_fini_cb)(struct objpool_head *head, void *context); >>>>> * @obj_size: object & element size >>>>> * @nr_objs: total objs (to be pre-allocated) >>>>> * @nr_cpus: nr_cpu_ids >>>>> - * @capacity: max objects per cpuslot >>>>> + * @capacity: max objects on percpu slot >>>>> * @gfp: gfp flags for kmalloc & vmalloc >>>>> * @ref: refcount for objpool >>>>> * @flags: flags for objpool management >>>>> * @cpu_slots: array of percpu slots >>>>> - * @slot_sizes: size in bytes of slots >>>>> * @release: resource cleanup callback >>>>> * @context: caller-provided context >>>>> */ >>>>> @@ -100,7 +89,6 @@ struct objpool_head { >>>>> refcount_t ref; >>>>> unsigned long flags; >>>>> struct objpool_slot **cpu_slots; >>>>> - int *slot_sizes; >>>>> objpool_fini_cb release; >>>>> void *context; >>>>> }; >>>>> @@ -108,9 +96,12 @@ struct objpool_head { >>>>> #define OBJPOOL_FROM_VMALLOC (0x800000000) /* objpool allocated from vmalloc area */ >>>>> #define OBJPOOL_HAVE_OBJECTS (0x400000000) /* objects allocated along with objpool */ >>>>> >>>>> +#define OBJPOOL_NR_OBJECT_MAX (1 << 24) >>>>> +#define OBJPOOL_OBJECT_SIZE_MAX (1 << 16) >>>>> + >>>>> /** >>>>> * objpool_init() - initialize objpool and pre-allocated objects >>>>> - * @head: the object pool to be initialized, declared by caller >>>>> + * @pool: the object pool to be initialized, declared by caller >>>>> * @nr_objs: total objects to be pre-allocated by this object pool >>>>> * @object_size: size of an object (should be > 0) >>>>> * @gfp: flags for memory allocation (via kmalloc or vmalloc) >>>>> @@ -128,47 +119,47 @@ struct objpool_head { >>>>> * pop (object allocation) or do clearance before each push (object >>>>> * reclamation). >>>>> */ >>>>> -int objpool_init(struct objpool_head *head, int nr_objs, int object_size, >>>>> +int objpool_init(struct objpool_head *pool, int nr_objs, int object_size, >>>>> gfp_t gfp, void *context, objpool_init_obj_cb objinit, >>>>> objpool_fini_cb release); >>>>> >>>>> /** >>>>> * objpool_pop() - allocate an object from objpool >>>>> - * @head: object pool >>>>> + * @pool: object pool >>>>> * >>>>> * return value: object ptr or NULL if failed >>>>> */ >>>>> -void *objpool_pop(struct objpool_head *head); >>>>> +void *objpool_pop(struct objpool_head *pool); >>>>> >>>>> /** >>>>> * objpool_push() - reclaim the object and return back to objpool >>>>> * @obj: object ptr to be pushed to objpool >>>>> - * @head: object pool >>>>> + * @pool: object pool >>>>> * >>>>> * return: 0 or error code (it fails only when user tries to push >>>>> * the same object multiple times or wrong "objects" into objpool) >>>>> */ >>>>> -int objpool_push(void *obj, struct objpool_head *head); >>>>> +int objpool_push(void *obj, struct objpool_head *pool); >>>>> >>>>> /** >>>>> * objpool_drop() - discard the object and deref objpool >>>>> * @obj: object ptr to be discarded >>>>> - * @head: object pool >>>>> + * @pool: object pool >>>>> * >>>>> * return: 0 if objpool was released or error code >>>>> */ >>>>> -int objpool_drop(void *obj, struct objpool_head *head); >>>>> +int objpool_drop(void *obj, struct objpool_head *pool); >>>>> >>>>> /** >>>>> * objpool_free() - release objpool forcely (all objects to be freed) >>>>> - * @head: object pool to be released >>>>> + * @pool: object pool to be released >>>>> */ >>>>> -void objpool_free(struct objpool_head *head); >>>>> +void objpool_free(struct objpool_head *pool); >>>>> >>>>> /** >>>>> * objpool_fini() - deref object pool (also releasing unused objects) >>>>> - * @head: object pool to be dereferenced >>>>> + * @pool: object pool to be dereferenced >>>>> */ >>>>> -void objpool_fini(struct objpool_head *head); >>>>> +void objpool_fini(struct objpool_head *pool); >>>>> >>>>> #endif /* _LINUX_OBJPOOL_H */ >>>>> diff --git a/lib/objpool.c b/lib/objpool.c >>>>> index 22e752371820..f8e8f70d7253 100644 >>>>> --- a/lib/objpool.c >>>>> +++ b/lib/objpool.c >>>>> @@ -15,104 +15,55 @@ >>>>> * Copyright: wuqiang.matt@bytedance.com >>>>> */ >>>>> >>>>> -#define SLOT_AGES(s) ((uint32_t *)((char *)(s) + sizeof(struct objpool_slot))) >>>>> -#define SLOT_ENTS(s) ((void **)((char *)(s) + sizeof(struct objpool_slot) + \ >>>>> - (sizeof(uint32_t) << (s)->bits))) >>>>> -#define SLOT_OBJS(s) ((void *)((char *)(s) + sizeof(struct objpool_slot) + \ >>>>> - ((sizeof(uint32_t) + sizeof(void *)) << (s)->bits))) >>>>> -#define SLOT_CORE(n) cpumask_nth((n) % num_possible_cpus(), cpu_possible_mask) >>>>> - >>>>> -/* compute the suitable num of objects to be managed per slot */ >>>>> -static int objpool_nobjs(int size) >>>>> -{ >>>>> - return rounddown_pow_of_two((size - sizeof(struct objpool_slot)) / >>>>> - (sizeof(uint32_t) + sizeof(void *))); >>>>> -} >>>>> - >>>>> /* allocate and initialize percpu slots */ >>>>> static int >>>>> -objpool_init_percpu_slots(struct objpool_head *head, int nobjs, >>>>> - void *context, objpool_init_obj_cb objinit) >>>>> +objpool_init_percpu_slots(struct objpool_head *pool, void *context, >>>>> + objpool_init_obj_cb objinit) >>>>> { >>>>> - int i, j, n, size, objsz, cpu = 0, nents = head->capacity; >>>>> - >>>>> - /* aligned object size by sizeof(void *) */ >>>>> - objsz = ALIGN(head->obj_size, sizeof(void *)); >>>>> - /* shall we allocate objects along with percpu-slot */ >>>>> - if (objsz) >>>>> - head->flags |= OBJPOOL_HAVE_OBJECTS; >>>>> - >>>>> - /* vmalloc is used in default to allocate percpu-slots */ >>>>> - if (!(head->gfp & GFP_ATOMIC)) >>>>> - head->flags |= OBJPOOL_FROM_VMALLOC; >>>>> - >>>>> - for (i = 0; i < head->nr_cpus; i++) { >>>>> - struct objpool_slot *os; >>>>> + int i, j, n, size, slot_size, cpu_count = 0; >>>>> + struct objpool_slot *slot; >>>>> >>>>> + for (i = 0; i < pool->nr_cpus; i++) { >>>>> /* skip the cpus which could never be present */ >>>>> if (!cpu_possible(i)) >>>>> continue; >>>>> >>>>> /* compute how many objects to be managed by this slot */ >>>>> - n = nobjs / num_possible_cpus(); >>>>> - if (cpu < (nobjs % num_possible_cpus())) >>>>> + n = pool->nr_objs / num_possible_cpus(); >>>>> + if (cpu_count < (pool->nr_objs % num_possible_cpus())) >>>>> n++; >>>>> - size = sizeof(struct objpool_slot) + sizeof(void *) * nents + >>>>> - sizeof(uint32_t) * nents + objsz * n; >>>>> + cpu_count++; >>>>> + >>>>> + slot_size = struct_size(slot, entries, pool->capacity); >>>>> + size = slot_size + pool->obj_size * n; >>>>> >>>>> /* >>>>> * here we allocate percpu-slot & objects together in a single >>>>> - * allocation, taking advantage of warm caches and TLB hits as >>>>> - * vmalloc always aligns the request size to pages >>>>> + * allocation, taking advantage on NUMA. >>>>> */ >>>>> - if (head->flags & OBJPOOL_FROM_VMALLOC) >>>>> - os = __vmalloc_node(size, sizeof(void *), head->gfp, >>>>> + if (pool->flags & OBJPOOL_FROM_VMALLOC) >>>>> + slot = __vmalloc_node(size, sizeof(void *), pool->gfp, >>>>> cpu_to_node(i), __builtin_return_address(0)); >>>>> else >>>>> - os = kmalloc_node(size, head->gfp, cpu_to_node(i)); >>>>> - if (!os) >>>>> + slot = kmalloc_node(size, pool->gfp, cpu_to_node(i)); >>>>> + if (!slot) >>>>> return -ENOMEM; >>>>> >>>>> /* initialize percpu slot for the i-th slot */ >>>>> - memset(os, 0, size); >>>>> - os->bits = ilog2(head->capacity); >>>>> - os->mask = head->capacity - 1; >>>>> - head->cpu_slots[i] = os; >>>>> - head->slot_sizes[i] = size; >>>>> - cpu = cpu + 1; >>>>> - >>>>> - /* >>>>> - * manually set head & tail to avoid possible conflict: >>>>> - * We assume that the head item is ready for retrieval >>>>> - * iff head is equal to ages[head & mask]. but ages is >>>>> - * initialized as 0, so in view of the caller of pop(), >>>>> - * the 1st item (0th) is always ready, but the reality >>>>> - * could be: push() is stalled before the final update, >>>>> - * thus the item being inserted will be lost forever >>>>> - */ >>>>> - os->head = os->tail = head->capacity; >>>>> - >>>>> - if (!objsz) >>>>> - continue; >>>>> + memset(slot, 0, size); >>>>> + slot->mask = pool->capacity - 1; >>>>> + pool->cpu_slots[i] = slot; >>>>> >>>>> for (j = 0; j < n; j++) { >>>>> - uint32_t *ages = SLOT_AGES(os); >>>>> - void **ents = SLOT_ENTS(os); >>>>> - void *obj = SLOT_OBJS(os) + j * objsz; >>>>> - uint32_t ie = os->tail & os->mask; >>>>> + void *obj = (void *)slot + slot_size + pool->obj_size * j; >>>>> >>>>> - /* perform object initialization */ >>>>> if (objinit) { >>>>> int rc = objinit(obj, context); >>>>> if (rc) >>>>> return rc; >>>>> } >>>>> - >>>>> - /* add obj into the ring array */ >>>>> - ents[ie] = obj; >>>>> - ages[ie] = os->tail; >>>>> - os->tail++; >>>>> - head->nr_objs++; >>>>> + slot->entries[j] = obj; >>>>> + slot->tail++; >>>>> } >>>>> } >>>>> >>>>> @@ -120,164 +71,135 @@ objpool_init_percpu_slots(struct objpool_head *head, int nobjs, >>>>> } >>>>> >>>>> /* cleanup all percpu slots of the object pool */ >>>>> -static void objpool_fini_percpu_slots(struct objpool_head *head) >>>>> +static void objpool_fini_percpu_slots(struct objpool_head *pool) >>>>> { >>>>> int i; >>>>> >>>>> - if (!head->cpu_slots) >>>>> + if (!pool->cpu_slots) >>>>> return; >>>>> >>>>> - for (i = 0; i < head->nr_cpus; i++) { >>>>> - if (!head->cpu_slots[i]) >>>>> - continue; >>>>> - if (head->flags & OBJPOOL_FROM_VMALLOC) >>>>> - vfree(head->cpu_slots[i]); >>>>> - else >>>>> - kfree(head->cpu_slots[i]); >>>>> - } >>>>> - kfree(head->cpu_slots); >>>>> - head->cpu_slots = NULL; >>>>> - head->slot_sizes = NULL; >>>>> + for (i = 0; i < pool->nr_cpus; i++) >>>>> + kvfree(pool->cpu_slots[i]); >>>>> + kfree(pool->cpu_slots); >>>>> } >>>>> >>>>> /* initialize object pool and pre-allocate objects */ >>>>> -int objpool_init(struct objpool_head *head, int nr_objs, int object_size, >>>>> +int objpool_init(struct objpool_head *pool, int nr_objs, int object_size, >>>>> gfp_t gfp, void *context, objpool_init_obj_cb objinit, >>>>> objpool_fini_cb release) >>>>> { >>>>> int nents, rc; >>>>> >>>>> /* check input parameters */ >>>>> - if (nr_objs <= 0 || object_size <= 0) >>>>> + if (nr_objs <= 0 || object_size <= 0 || >>>>> + nr_objs > OBJPOOL_NR_OBJECT_MAX || object_size > OBJPOOL_OBJECT_SIZE_MAX) >>>>> + return -EINVAL; >>>>> + >>>>> + /* Align up to unsigned long size */ >>>>> + object_size = ALIGN(object_size, sizeof(unsigned long)); >>>>> + >>>>> + /* >>>>> + * To avoid filling up the entries array in the per-cpu slot, >>>>> + * use the power of 2 which is more than N + 1. Thus, the tail never >>>>> + * catch up to the pool, and able to use pool/tail as the sequencial >>>>> + * number. >>>>> + */ >>>>> + nents = roundup_pow_of_two(nr_objs + 1); >>>>> + if (!nents) >>>>> return -EINVAL; >>>>> >>>>> - /* calculate percpu slot size (rounded to pow of 2) */ >>>>> - nents = max_t(int, roundup_pow_of_two(nr_objs), >>>>> - objpool_nobjs(L1_CACHE_BYTES)); >>>>> - >>>>> - /* initialize objpool head */ >>>>> - memset(head, 0, sizeof(struct objpool_head)); >>>>> - head->nr_cpus = nr_cpu_ids; >>>>> - head->obj_size = object_size; >>>>> - head->capacity = nents; >>>>> - head->gfp = gfp & ~__GFP_ZERO; >>>>> - head->context = context; >>>>> - head->release = release; >>>>> - >>>>> - /* allocate array for percpu slots */ >>>>> - head->cpu_slots = kzalloc(head->nr_cpus * sizeof(void *) + >>>>> - head->nr_cpus * sizeof(int), head->gfp); >>>>> - if (!head->cpu_slots) >>>>> + /* initialize objpool pool */ >>>>> + memset(pool, 0, sizeof(struct objpool_head)); >>>>> + pool->nr_cpus = nr_cpu_ids; >>>>> + pool->obj_size = object_size; >>>>> + pool->nr_objs = nr_objs; >>>>> + /* just prevent to fullfill the per-cpu ring array */ >>>>> + pool->capacity = nents; >>>>> + pool->gfp = gfp & ~__GFP_ZERO; >>>>> + pool->context = context; >>>>> + pool->release = release; >>>>> + /* vmalloc is used in default to allocate percpu-slots */ >>>>> + if (!(pool->gfp & GFP_ATOMIC)) >>>>> + pool->flags |= OBJPOOL_FROM_VMALLOC; >>>>> + >>>>> + pool->cpu_slots = kzalloc(pool->nr_cpus * sizeof(void *), pool->gfp); >>>>> + if (!pool->cpu_slots) >>>>> return -ENOMEM; >>>>> - head->slot_sizes = (int *)&head->cpu_slots[head->nr_cpus]; >>>>> >>>>> /* initialize per-cpu slots */ >>>>> - rc = objpool_init_percpu_slots(head, nr_objs, context, objinit); >>>>> + rc = objpool_init_percpu_slots(pool, context, objinit); >>>>> if (rc) >>>>> - objpool_fini_percpu_slots(head); >>>>> + objpool_fini_percpu_slots(pool); >>>>> else >>>>> - refcount_set(&head->ref, nr_objs + 1); >>>>> + refcount_set(&pool->ref, nr_objs + 1); >>>>> >>>>> return rc; >>>>> } >>>>> EXPORT_SYMBOL_GPL(objpool_init); >>>>> >>>>> /* adding object to slot, abort if the slot was already full */ >>>>> -static inline int objpool_try_add_slot(void *obj, struct objpool_slot *os) >>>>> +static inline int objpool_try_add_slot(void *obj, struct objpool_head *pool, int cpu) >>>>> { >>>>> - uint32_t *ages = SLOT_AGES(os); >>>>> - void **ents = SLOT_ENTS(os); >>>>> - uint32_t head, tail; >>>>> + struct objpool_slot *slot = pool->cpu_slots[cpu]; >>>>> + uint32_t tail, next; >>>>> >>>>> do { >>>>> - /* perform memory loading for both head and tail */ >>>>> - head = READ_ONCE(os->head); >>>>> - tail = READ_ONCE(os->tail); >>>>> - /* just abort if slot is full */ >>>>> - if (tail - head > os->mask) >>>>> - return -ENOENT; >>>>> - /* try to extend tail by 1 using CAS to avoid races */ >>>>> - if (try_cmpxchg_acquire(&os->tail, &tail, tail + 1)) >>>>> - break; >>>>> - } while (1); >>>>> + uint32_t head = READ_ONCE(slot->head); >>>>> >>>>> - /* the tail-th of slot is reserved for the given obj */ >>>>> - WRITE_ONCE(ents[tail & os->mask], obj); >>>>> - /* update epoch id to make this object available for pop() */ >>>>> - smp_store_release(&ages[tail & os->mask], tail); >>>>> + tail = READ_ONCE(slot->tail); >>>>> + next = tail + 1; >>>>> + >>>>> + /* This must never happen because capacity >= N + 1 */ >>>>> + if (WARN_ON_ONCE((next > head && next - head > pool->nr_objs) || >>>>> + (next < head && next > head + pool->nr_objs))) >>>>> + return -EINVAL; >>>>> + >>>>> + } while (!try_cmpxchg_acquire(&slot->tail, &tail, next)); >>>>> + >>>>> + WRITE_ONCE(slot->entries[tail & slot->mask], obj); >>>>> return 0; >>>>> } >>>>> >>>>> /* reclaim an object to object pool */ >>>>> -int objpool_push(void *obj, struct objpool_head *oh) >>>>> +int objpool_push(void *obj, struct objpool_head *pool) >>>>> { >>>>> unsigned long flags; >>>>> - int cpu, rc; >>>>> + int rc; >>>>> >>>>> /* disable local irq to avoid preemption & interruption */ >>>>> raw_local_irq_save(flags); >>>>> - cpu = raw_smp_processor_id(); >>>>> - do { >>>>> - rc = objpool_try_add_slot(obj, oh->cpu_slots[cpu]); >>>>> - if (!rc) >>>>> - break; >>>>> - cpu = cpumask_next_wrap(cpu, cpu_possible_mask, -1, 1); >>>>> - } while (1); >>>>> + rc = objpool_try_add_slot(obj, pool, raw_smp_processor_id()); >>>>> raw_local_irq_restore(flags); >>>>> >>>>> return rc; >>>>> } >>>>> EXPORT_SYMBOL_GPL(objpool_push); >>>>> >>>>> -/* drop the allocated object, rather reclaim it to objpool */ >>>>> -int objpool_drop(void *obj, struct objpool_head *head) >>>>> -{ >>>>> - if (!obj || !head) >>>>> - return -EINVAL; >>>>> - >>>>> - if (refcount_dec_and_test(&head->ref)) { >>>>> - objpool_free(head); >>>>> - return 0; >>>>> - } >>>>> - >>>>> - return -EAGAIN; >>>>> -} >>>>> -EXPORT_SYMBOL_GPL(objpool_drop); >>>>> - >>>>> /* try to retrieve object from slot */ >>>>> -static inline void *objpool_try_get_slot(struct objpool_slot *os) >>>>> +static inline void *objpool_try_get_slot(struct objpool_slot *slot) >>>>> { >>>>> - uint32_t *ages = SLOT_AGES(os); >>>>> - void **ents = SLOT_ENTS(os); >>>>> /* do memory load of head to local head */ >>>>> - uint32_t head = smp_load_acquire(&os->head); >>>>> + uint32_t head = smp_load_acquire(&slot->head); >>>>> >>>>> /* loop if slot isn't empty */ >>>>> - while (head != READ_ONCE(os->tail)) { >>>>> - uint32_t id = head & os->mask, prev = head; >>>>> + while (head != READ_ONCE(slot->tail)) { >>>>> >>>>> /* do prefetching of object ents */ >>>>> - prefetch(&ents[id]); >>>>> - >>>>> - /* check whether this item was ready for retrieval */ >>>>> - if (smp_load_acquire(&ages[id]) == head) { >>>>> - /* node must have been udpated by push() */ >>>>> - void *node = READ_ONCE(ents[id]); >>>>> - /* commit and move forward head of the slot */ >>>>> - if (try_cmpxchg_release(&os->head, &head, head + 1)) >>>>> - return node; >>>>> - /* head was already updated by others */ >>>>> - } >>>>> + prefetch(&slot->entries[head & slot->mask]); >>>>> + >>>>> + /* commit and move forward head of the slot */ >>>>> + if (try_cmpxchg_release(&slot->head, &head, head + 1)) >>>>> + /* >>>>> + * TBD: check overwrap the tail/head counter and warn >>>>> + * if it is broken. But this happens only if this >>>>> + * process slows down a lot and another CPU updates >>>>> + * the haed/tail just 2^32 + 1 times, and this slot >>>>> + * is empty. >>>>> + */ >>>>> + return slot->entries[head & slot->mask]; >>>>> >>>>> /* re-load head from memory and continue trying */ >>>>> - head = READ_ONCE(os->head); >>>>> - /* >>>>> - * head stays unchanged, so it's very likely there's an >>>>> - * ongoing push() on other cpu nodes but yet not update >>>>> - * ages[] to mark it's completion >>>>> - */ >>>>> - if (head == prev) >>>>> - break; >>>>> + head = READ_ONCE(slot->head); >>>>> } >>>>> >>>>> return NULL; >>>>> @@ -307,32 +229,42 @@ void *objpool_pop(struct objpool_head *head) >>>>> EXPORT_SYMBOL_GPL(objpool_pop); >>>>> >>>>> /* release whole objpool forcely */ >>>>> -void objpool_free(struct objpool_head *head) >>>>> +void objpool_free(struct objpool_head *pool) >>>>> { >>>>> - if (!head->cpu_slots) >>>>> + if (!pool->cpu_slots) >>>>> return; >>>>> >>>>> /* release percpu slots */ >>>>> - objpool_fini_percpu_slots(head); >>>>> + objpool_fini_percpu_slots(pool); >>>>> >>>>> /* call user's cleanup callback if provided */ >>>>> - if (head->release) >>>>> - head->release(head, head->context); >>>>> + if (pool->release) >>>>> + pool->release(pool, pool->context); >>>>> } >>>>> EXPORT_SYMBOL_GPL(objpool_free); >>>>> >>>>> -/* drop unused objects and defref objpool for releasing */ >>>>> -void objpool_fini(struct objpool_head *head) >>>>> +/* drop the allocated object, rather reclaim it to objpool */ >>>>> +int objpool_drop(void *obj, struct objpool_head *pool) >>>>> { >>>>> - void *nod; >>>>> + if (!obj || !pool) >>>>> + return -EINVAL; >>>>> >>>>> - do { >>>>> - /* grab object from objpool and drop it */ >>>>> - nod = objpool_pop(head); >>>>> + if (refcount_dec_and_test(&pool->ref)) { >>>>> + objpool_free(pool); >>>>> + return 0; >>>>> + } >>>>> + >>>>> + return -EAGAIN; >>>>> +} >>>>> +EXPORT_SYMBOL_GPL(objpool_drop); >>>>> + >>>>> +/* drop unused objects and defref objpool for releasing */ >>>>> +void objpool_fini(struct objpool_head *pool) >>>>> +{ >>>>> + void *obj; >>>>> >>>>> - /* drop the extra ref of objpool */ >>>>> - if (refcount_dec_and_test(&head->ref)) >>>>> - objpool_free(head); >>>>> - } while (nod); >>>>> + /* grab object from objpool and drop it */ >>>>> + while ((obj = objpool_pop(pool))) >>>>> + objpool_drop(obj, pool); >>>>> } >>>>> EXPORT_SYMBOL_GPL(objpool_fini); >>>> >>> >>> >> > >