Received: by 2002:a05:6359:6284:b0:131:369:b2a3 with SMTP id se4csp4729701rwb; Tue, 8 Aug 2023 12:54:39 -0700 (PDT) X-Google-Smtp-Source: AGHT+IGbMFFvdkN32H4XFh3rJqHGHynXkv4V0vRHvmvg1S4YEmoYF6AOSfVN1aR26jHGLAaGyKzh X-Received: by 2002:a05:6a21:998b:b0:140:a0dc:c834 with SMTP id ve11-20020a056a21998b00b00140a0dcc834mr756659pzb.24.1691524479272; Tue, 08 Aug 2023 12:54:39 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1691524479; cv=none; d=google.com; s=arc-20160816; b=T2DceCuKXannHWdFC3CjBkbksRwS+aQN0aqXZSqCY0Y3Vz5NQSn/ODgM19SvRCsCo6 WODqipSkqDBnN3srCWm/T1DiE9Gu7kcdl8W/x6yPNuMfhLXK1Kecwg2eAhzO6ckGApVO impzUZWqZ2UVFZF8bc7VnpaSZPQdNU3EK0rTCygOn3Vg52+Z5DzGp3CyOjCNEg1ZV4Hg j77qaXU/bl4qbWc0IYizwt4rxvSqjSIzvbWJ4wU3NrfAKzLL4+vM/FHMrW/6i5fnbvbo YfU6DTvgPTHJbh8AQs8pVlmGi9KF8lHgspw2ZiBRE020Mj59RNSAD5VuU19grfB6TYrs v7vg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :references:in-reply-to:message-id:date:subject:cc:to:from :dkim-signature:dkim-signature; bh=TBu9LYlHJ8OjXikD6cSMhoFhFKNv3djoLniF9L00kCU=; fh=sDCYtOtQNr4J8m+XjxLSaS1eoEDwapxZAeFBp4YZzdY=; b=qoa17H7aJLRi/F5i8oMrlSCHaAi2oVeQ/JaaCRClogygdRbd9x+KRWJdpSC55MyvPg 1U49dE67rjYw1RX9OuTQQ5x8otG6t24n0vYbrGp8g7wafDYIacoAWHLyMG0l9UCjyVz3 zrTl7yKfMfZ1LxaOkr3q4nP3Y65OkPA4U5lSAHNVSzSQLwzZBVtAelXvEOcEe+DeCNxE DPiSn+MsTvBfd207dmzpTYgenvJeEls/1l+hGi+sBQUxGWEXWiijobtGWULxNUwt8Vo6 unYydQV9uFrHX5DWhAeTGXa/m0GBf4PnzcQ3+i0Ww8B1H6v+vHPBc3TU5s2lEhpRhtcE qU3g== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@suse.cz header.s=susede2_rsa header.b=V2Kbm2lI; dkim=neutral (no key) header.i=@suse.cz header.s=susede2_ed25519; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id e15-20020a17090301cf00b001bb9f190bafsi6032007plh.526.2023.08.08.12.54.27; Tue, 08 Aug 2023 12:54:39 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@suse.cz header.s=susede2_rsa header.b=V2Kbm2lI; dkim=neutral (no key) header.i=@suse.cz header.s=susede2_ed25519; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S233057AbjHHSfj (ORCPT + 99 others); Tue, 8 Aug 2023 14:35:39 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:57716 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S235483AbjHHSfW (ORCPT ); Tue, 8 Aug 2023 14:35:22 -0400 Received: from smtp-out2.suse.de (smtp-out2.suse.de [195.135.220.29]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id B883FD24C7 for ; Tue, 8 Aug 2023 09:29:47 -0700 (PDT) Received: from imap2.suse-dmz.suse.de (imap2.suse-dmz.suse.de [192.168.254.74]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature ECDSA (P-521) server-digest SHA512) (No client certificate requested) by smtp-out2.suse.de (Postfix) with ESMTPS id EB1E620318; Tue, 8 Aug 2023 09:53:52 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.cz; s=susede2_rsa; t=1691488432; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=TBu9LYlHJ8OjXikD6cSMhoFhFKNv3djoLniF9L00kCU=; b=V2Kbm2lIqct91r38K4Ct7TQaHhljTuhoYVMpRcAt3cKRaekudh4nhIcpG85OITNPFcOyyB 9ny32bzS3cJ9kFOEpuYoc/5Az+gqs0imKKtm2YkXP1T1/JudJTJ8/RU4aO2ZNKp10qyzdy 0cT/eGNDsn8TY4oRAkOVz4/R6bWkim8= DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=suse.cz; s=susede2_ed25519; t=1691488432; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=TBu9LYlHJ8OjXikD6cSMhoFhFKNv3djoLniF9L00kCU=; b=lLwi1ysLAUCxBlvGR5/1aEpWdfJUdaAtB0i8tJnQ/pz/b9FbcFpkHWfHfHfiX6579mtNR+ sIcdcA5cPJNiLFAw== Received: from imap2.suse-dmz.suse.de (imap2.suse-dmz.suse.de [192.168.254.74]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature ECDSA (P-521) server-digest SHA512) (No client certificate requested) by imap2.suse-dmz.suse.de (Postfix) with ESMTPS id BC8AA13451; Tue, 8 Aug 2023 09:53:52 +0000 (UTC) Received: from dovecot-director2.suse.de ([192.168.254.65]) by imap2.suse-dmz.suse.de with ESMTPSA id mIhcLbAQ0mSBJQAAMHmgww (envelope-from ); Tue, 08 Aug 2023 09:53:52 +0000 From: Vlastimil Babka To: "Liam R. Howlett" , Matthew Wilcox , Christoph Lameter , David Rientjes , Pekka Enberg , Joonsoo Kim Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com>, Roman Gushchin , linux-mm@kvack.org, linux-kernel@vger.kernel.org, patches@lists.linux.dev, Vlastimil Babka Subject: [RFC v1 2/5] mm, slub: add opt-in slub_percpu_array Date: Tue, 8 Aug 2023 11:53:45 +0200 Message-ID: <20230808095342.12637-9-vbabka@suse.cz> X-Mailer: git-send-email 2.41.0 In-Reply-To: <20230808095342.12637-7-vbabka@suse.cz> References: <20230808095342.12637-7-vbabka@suse.cz> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-2.1 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_BLOCKED, SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org kmem_cache_setup_percpu_array() will allocate a per-cpu array for caching alloc/free objects of given size for the cache. The cache has to be created with SLAB_NO_MERGE flag. The array is filled by freeing. When empty for alloc or full for freeing, it's simply bypassed by the operation, there's currently no batch freeing/allocations. The locking is copied from the page allocator's pcplists, based on embedded spin locks. Interrupts are not disabled, only preemption (cpu migration on RT). Trylock is attempted to avoid deadlock due to an intnerrupt, trylock failure means the array is bypassed. Sysfs stat counters alloc_cpu_cache and free_cpu_cache count operations that used the percpu array. Bulk allocation bypasses the array, bulk freeing does not. kmem_cache_prefill_percpu_array() can be called to ensure the array on the current cpu to at least the given number of objects. However this is only opportunistic as there's no cpu pinning and the trylocks may always fail. Therefore allocations cannot rely on the array for success even after the prefill. But misses should be rare enough that e.g. GFP_ATOMIC allocations should be acceptable after the refill. The operation is currently not optimized. More TODO/FIXMEs: - NUMA awareness - preferred node currently ignored, __GFP_THISNODE not honored - slub_debug - will not work for allocations from the array. Normally in SLUB implementation the slub_debug kills all fast paths, but that could lead to depleting the reserves if we ignore the prefill and use GFP_ATOMIC. Needs more thought. --- include/linux/slab.h | 4 + include/linux/slub_def.h | 10 ++ mm/slub.c | 210 ++++++++++++++++++++++++++++++++++++++- 3 files changed, 223 insertions(+), 1 deletion(-) diff --git a/include/linux/slab.h b/include/linux/slab.h index 848c7c82ad5a..f6c91cbc1544 100644 --- a/include/linux/slab.h +++ b/include/linux/slab.h @@ -196,6 +196,8 @@ struct kmem_cache *kmem_cache_create_usercopy(const char *name, void kmem_cache_destroy(struct kmem_cache *s); int kmem_cache_shrink(struct kmem_cache *s); +int kmem_cache_setup_percpu_array(struct kmem_cache *s, unsigned int count); + /* * Please use this macro to create slab caches. Simply specify the * name of the structure and maybe some flags that are listed above. @@ -494,6 +496,8 @@ void kmem_cache_free(struct kmem_cache *s, void *objp); void kmem_cache_free_bulk(struct kmem_cache *s, size_t size, void **p); int kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags, size_t size, void **p); +int kmem_cache_prefill_percpu_array(struct kmem_cache *s, unsigned int count, gfp_t gfp); + static __always_inline void kfree_bulk(size_t size, void **p) { kmem_cache_free_bulk(NULL, size, p); diff --git a/include/linux/slub_def.h b/include/linux/slub_def.h index deb90cf4bffb..c85434668419 100644 --- a/include/linux/slub_def.h +++ b/include/linux/slub_def.h @@ -13,8 +13,10 @@ #include enum stat_item { + ALLOC_PERCPU_CACHE, /* Allocation from percpu array cache */ ALLOC_FASTPATH, /* Allocation from cpu slab */ ALLOC_SLOWPATH, /* Allocation by getting a new cpu slab */ + FREE_PERCPU_CACHE, /* Free to percpu array cache */ FREE_FASTPATH, /* Free to cpu slab */ FREE_SLOWPATH, /* Freeing not to cpu slab */ FREE_FROZEN, /* Freeing to frozen slab */ @@ -66,6 +68,13 @@ struct kmem_cache_cpu { }; #endif /* CONFIG_SLUB_TINY */ +struct slub_percpu_array { + spinlock_t lock; + unsigned int count; + unsigned int used; + void * objects[]; +}; + #ifdef CONFIG_SLUB_CPU_PARTIAL #define slub_percpu_partial(c) ((c)->partial) @@ -99,6 +108,7 @@ struct kmem_cache { #ifndef CONFIG_SLUB_TINY struct kmem_cache_cpu __percpu *cpu_slab; #endif + struct slub_percpu_array __percpu *cpu_array; /* Used for retrieving partial slabs, etc. */ slab_flags_t flags; unsigned long min_partial; diff --git a/mm/slub.c b/mm/slub.c index a9437d48840c..7fc9f7c124eb 100644 --- a/mm/slub.c +++ b/mm/slub.c @@ -188,6 +188,79 @@ do { \ #define USE_LOCKLESS_FAST_PATH() (false) #endif +/* copy/pasted from mm/page_alloc.c */ + +#if defined(CONFIG_SMP) || defined(CONFIG_PREEMPT_RT) +/* + * On SMP, spin_trylock is sufficient protection. + * On PREEMPT_RT, spin_trylock is equivalent on both SMP and UP. + */ +#define pcp_trylock_prepare(flags) do { } while (0) +#define pcp_trylock_finish(flag) do { } while (0) +#else + +/* UP spin_trylock always succeeds so disable IRQs to prevent re-entrancy. */ +#define pcp_trylock_prepare(flags) local_irq_save(flags) +#define pcp_trylock_finish(flags) local_irq_restore(flags) +#endif + +/* + * Locking a pcp requires a PCP lookup followed by a spinlock. To avoid + * a migration causing the wrong PCP to be locked and remote memory being + * potentially allocated, pin the task to the CPU for the lookup+lock. + * preempt_disable is used on !RT because it is faster than migrate_disable. + * migrate_disable is used on RT because otherwise RT spinlock usage is + * interfered with and a high priority task cannot preempt the allocator. + */ +#ifndef CONFIG_PREEMPT_RT +#define pcpu_task_pin() preempt_disable() +#define pcpu_task_unpin() preempt_enable() +#else +#define pcpu_task_pin() migrate_disable() +#define pcpu_task_unpin() migrate_enable() +#endif + +/* + * Generic helper to lookup and a per-cpu variable with an embedded spinlock. + * Return value should be used with equivalent unlock helper. + */ +#define pcpu_spin_lock(type, member, ptr) \ +({ \ + type *_ret; \ + pcpu_task_pin(); \ + _ret = this_cpu_ptr(ptr); \ + spin_lock(&_ret->member); \ + _ret; \ +}) + +#define pcpu_spin_trylock(type, member, ptr) \ +({ \ + type *_ret; \ + pcpu_task_pin(); \ + _ret = this_cpu_ptr(ptr); \ + if (!spin_trylock(&_ret->member)) { \ + pcpu_task_unpin(); \ + _ret = NULL; \ + } \ + _ret; \ +}) + +#define pcpu_spin_unlock(member, ptr) \ +({ \ + spin_unlock(&ptr->member); \ + pcpu_task_unpin(); \ +}) + +/* struct slub_percpu_array specific helpers. */ +#define pca_spin_lock(ptr) \ + pcpu_spin_lock(struct slub_percpu_array, lock, ptr) + +#define pca_spin_trylock(ptr) \ + pcpu_spin_trylock(struct slub_percpu_array, lock, ptr) + +#define pca_spin_unlock(ptr) \ + pcpu_spin_unlock(lock, ptr) + #ifndef CONFIG_SLUB_TINY #define __fastpath_inline __always_inline #else @@ -3326,6 +3399,32 @@ static void *__slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node, return p; } +static inline void *alloc_from_pca(struct kmem_cache *s) +{ + unsigned long __maybe_unused UP_flags; + struct slub_percpu_array *pca; + void *object = NULL; + + pcp_trylock_prepare(UP_flags); + pca = pca_spin_trylock(s->cpu_array); + + if (unlikely(!pca)) + goto failed; + + if (likely(pca->used > 0)) { + object = pca->objects[--pca->used]; + pca_spin_unlock(pca); + pcp_trylock_finish(UP_flags); + stat(s, ALLOC_PERCPU_CACHE); + return object; + } + pca_spin_unlock(pca); + +failed: + pcp_trylock_finish(UP_flags); + return NULL; +} + static __always_inline void *__slab_alloc_node(struct kmem_cache *s, gfp_t gfpflags, int node, unsigned long addr, size_t orig_size) { @@ -3465,7 +3564,11 @@ static __fastpath_inline void *slab_alloc_node(struct kmem_cache *s, struct list if (unlikely(object)) goto out; - object = __slab_alloc_node(s, gfpflags, node, addr, orig_size); + if (s->cpu_array) + object = alloc_from_pca(s); + + if (!object) + object = __slab_alloc_node(s, gfpflags, node, addr, orig_size); maybe_wipe_obj_freeptr(s, object); init = slab_want_init_on_alloc(gfpflags, s); @@ -3715,6 +3818,34 @@ static void __slab_free(struct kmem_cache *s, struct slab *slab, discard_slab(s, slab); } +static inline bool free_to_pca(struct kmem_cache *s, void *object) +{ + unsigned long __maybe_unused UP_flags; + struct slub_percpu_array *pca; + bool ret = false; + + pcp_trylock_prepare(UP_flags); + pca = pca_spin_trylock(s->cpu_array); + + if (!pca) { + pcp_trylock_finish(UP_flags); + return false; + } + + if (pca->used < pca->count) { + pca->objects[pca->used++] = object; + ret = true; + } + + pca_spin_unlock(pca); + pcp_trylock_finish(UP_flags); + + if (ret) + stat(s, FREE_PERCPU_CACHE); + + return ret; +} + #ifndef CONFIG_SLUB_TINY /* * Fastpath with forced inlining to produce a kfree and kmem_cache_free that @@ -3740,6 +3871,11 @@ static __always_inline void do_slab_free(struct kmem_cache *s, unsigned long tid; void **freelist; + if (s->cpu_array && cnt == 1) { + if (free_to_pca(s, head)) + return; + } + redo: /* * Determine the currently cpus per cpu slab. @@ -3793,6 +3929,11 @@ static void do_slab_free(struct kmem_cache *s, { void *tail_obj = tail ? : head; + if (s->cpu_array && cnt == 1) { + if (free_to_pca(s, head)) + return; + } + __slab_free(s, slab, head, tail_obj, cnt, addr); } #endif /* CONFIG_SLUB_TINY */ @@ -4060,6 +4201,45 @@ int kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags, size_t size, } EXPORT_SYMBOL(kmem_cache_alloc_bulk); +int kmem_cache_prefill_percpu_array(struct kmem_cache *s, unsigned int count, + gfp_t gfp) +{ + struct slub_percpu_array *pca; + void *objects[32]; + unsigned int used; + unsigned int allocated; + + if (!s->cpu_array) + return -EINVAL; + + /* racy but we don't care */ + pca = raw_cpu_ptr(s->cpu_array); + + used = READ_ONCE(pca->used); + + if (used >= count) + return 0; + + if (pca->count < count) + return -EINVAL; + + count -= used; + + /* TODO fix later */ + if (count > 32) + count = 32; + + for (int i = 0; i < count; i++) + objects[i] = NULL; + allocated = kmem_cache_alloc_bulk(s, gfp, count, &objects[0]); + + for (int i = 0; i < count; i++) { + if (objects[i]) { + kmem_cache_free(s, objects[i]); + } + } + return allocated; +} /* * Object placement in a slab is made very easy because we always start at @@ -5131,6 +5311,30 @@ int __kmem_cache_create(struct kmem_cache *s, slab_flags_t flags) return 0; } +int kmem_cache_setup_percpu_array(struct kmem_cache *s, unsigned int count) +{ + int cpu; + + if (WARN_ON_ONCE(!(s->flags & SLAB_NO_MERGE))) + return -EINVAL; + + s->cpu_array = __alloc_percpu(struct_size(s->cpu_array, objects, count), + sizeof(void *)); + + if (!s->cpu_array) + return -ENOMEM; + + for_each_possible_cpu(cpu) { + struct slub_percpu_array *pca = per_cpu_ptr(s->cpu_array, cpu); + + spin_lock_init(&pca->lock); + pca->count = count; + pca->used = 0; + } + + return 0; +} + #ifdef SLAB_SUPPORTS_SYSFS static int count_inuse(struct slab *slab) { @@ -5908,8 +6112,10 @@ static ssize_t text##_store(struct kmem_cache *s, \ } \ SLAB_ATTR(text); \ +STAT_ATTR(ALLOC_PERCPU_CACHE, alloc_cpu_cache); STAT_ATTR(ALLOC_FASTPATH, alloc_fastpath); STAT_ATTR(ALLOC_SLOWPATH, alloc_slowpath); +STAT_ATTR(FREE_PERCPU_CACHE, free_cpu_cache); STAT_ATTR(FREE_FASTPATH, free_fastpath); STAT_ATTR(FREE_SLOWPATH, free_slowpath); STAT_ATTR(FREE_FROZEN, free_frozen); @@ -5995,8 +6201,10 @@ static struct attribute *slab_attrs[] = { &remote_node_defrag_ratio_attr.attr, #endif #ifdef CONFIG_SLUB_STATS + &alloc_cpu_cache_attr.attr, &alloc_fastpath_attr.attr, &alloc_slowpath_attr.attr, + &free_cpu_cache_attr.attr, &free_fastpath_attr.attr, &free_slowpath_attr.attr, &free_frozen_attr.attr, -- 2.41.0