Received: by 2002:a05:6a10:206:0:0:0:0 with SMTP id 6csp5155411pxj; Wed, 9 Jun 2021 10:22:03 -0700 (PDT) X-Google-Smtp-Source: ABdhPJy/VrGYrcaO1/F2VXFUYWaEG6lKQS4qq9AFVpkWzias7LF4zMA36sdiZZORsijSBY9ghc/z X-Received: by 2002:a17:907:1003:: with SMTP id ox3mr859415ejb.469.1623259322774; Wed, 09 Jun 2021 10:22:02 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1623259322; cv=none; d=google.com; s=arc-20160816; b=mMomym3uhmXeEtIBKBJyrM47L8Qe5qFczNWdv7DAmAA/+KHl4erlBtYkre6sVB4En6 b87mmuLsh5guECFaxSfLKOw1lHkStFUs1mkSIgs2Cm/7c7mG4NlcZx/go6evaScJUK7t aGXoX1CvLtesZGs7k7z8ve/rYeg/dwbI5rwQrhRdn+VfSHxj9o2y3FRRlPy3bTBa4vhu NEL6Rr01LXasJYNaHZV8nIOGxCdoKfKqIPGUeH+KcO6cy6rWskc7V5T5M7h60xUnVuAu spF5Xn3I4ezK4p3M5HhwtHxg6JkhsqimBObCM3gQQqhwk5mt2tmXCEFBjomgjuHvLAJ3 rt/A== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :references:in-reply-to:message-id:date:subject:cc:to:from :dkim-signature:dkim-signature:dkim-signature:dkim-signature; bh=biJeHjWSGncOZGlXHpbWSSTom99Ta0fP7Xuy9hrQoQs=; b=heAbEniFlXRb9dNQsA28b3VVzrpdVixc1ba3zHpLINiX/AT4rsyyuFd6ZhpafqkWmY EqSMmt4fY8MUqJ62QRSjMEkMssNvO90DD+Bz/dWadVtRF5k6TU1YkrswTJsUrhNHminK bkCgD5hWpQnPIIT8SXnvSvHENEZ4MVRH9wdTT5FEYB6FGoaXm6gasTms3CDEV2PATVul 4Cju5BF0ctJEVspVDbP0T1Au7c2pkfjD1lQPDhPZgzcKOeKtLqCrmY115nUoKy2f1Bd5 Ags1pX7fK9o0rNuRXlBulOXQJQ3moAQV/5XCCH8nc6LA2wGGb0HpEkx+kbj3lbdBusia XetA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@suse.cz header.s=susede2_rsa header.b=EPKM526O; dkim=neutral (no key) header.i=@suse.cz header.s=susede2_ed25519 header.b=a70dpne6; dkim=pass header.i=@suse.cz header.s=susede2_rsa header.b=EPKM526O; dkim=neutral (no key) header.i=@suse.cz; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id r18si208440edd.158.2021.06.09.10.21.38; Wed, 09 Jun 2021 10:22:02 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; dkim=pass header.i=@suse.cz header.s=susede2_rsa header.b=EPKM526O; dkim=neutral (no key) header.i=@suse.cz header.s=susede2_ed25519 header.b=a70dpne6; dkim=pass header.i=@suse.cz header.s=susede2_rsa header.b=EPKM526O; dkim=neutral (no key) header.i=@suse.cz; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S239159AbhFILnb (ORCPT + 99 others); Wed, 9 Jun 2021 07:43:31 -0400 Received: from smtp-out1.suse.de ([195.135.220.28]:54916 "EHLO smtp-out1.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S239189AbhFILld (ORCPT ); Wed, 9 Jun 2021 07:41:33 -0400 Received: from imap.suse.de (imap-alt.suse-dmz.suse.de [192.168.254.47]) (using TLSv1.2 with cipher ECDHE-ECDSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp-out1.suse.de (Postfix) with ESMTPS id 5FB53219F3; Wed, 9 Jun 2021 11:39:38 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.cz; s=susede2_rsa; t=1623238778; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=biJeHjWSGncOZGlXHpbWSSTom99Ta0fP7Xuy9hrQoQs=; b=EPKM526Ob5SuuPwfwS4F/f7C9Jstx0dXTJ8MK39PR3Mg/n7TcD3q2Ygij/aBrlpHjMWPfR bwSQUT/pyFYe7AesqoDDQgFGJfblBBfAMKXna9soluHJLpWVAfxYM1vRjZIKW/afUyp66M rbt20uGgdGOSUsA7POqe2ml/9eriMbU= DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=suse.cz; s=susede2_ed25519; t=1623238778; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=biJeHjWSGncOZGlXHpbWSSTom99Ta0fP7Xuy9hrQoQs=; b=a70dpne6qRpiupTpqhhti+0l4x+X2Ly2sLyMmy3KYG12xfZyp0tA78wHlj0w5VxSfWWl51 p407KyPv4ZwbRODg== Received: from imap3-int (imap-alt.suse-dmz.suse.de [192.168.254.47]) by imap.suse.de (Postfix) with ESMTP id 2D2E1118DD; Wed, 9 Jun 2021 11:39:38 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.cz; s=susede2_rsa; t=1623238778; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=biJeHjWSGncOZGlXHpbWSSTom99Ta0fP7Xuy9hrQoQs=; b=EPKM526Ob5SuuPwfwS4F/f7C9Jstx0dXTJ8MK39PR3Mg/n7TcD3q2Ygij/aBrlpHjMWPfR bwSQUT/pyFYe7AesqoDDQgFGJfblBBfAMKXna9soluHJLpWVAfxYM1vRjZIKW/afUyp66M rbt20uGgdGOSUsA7POqe2ml/9eriMbU= DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=suse.cz; s=susede2_ed25519; t=1623238778; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=biJeHjWSGncOZGlXHpbWSSTom99Ta0fP7Xuy9hrQoQs=; b=a70dpne6qRpiupTpqhhti+0l4x+X2Ly2sLyMmy3KYG12xfZyp0tA78wHlj0w5VxSfWWl51 p407KyPv4ZwbRODg== Received: from director2.suse.de ([192.168.254.72]) by imap3-int with ESMTPSA id 4L+RCnqowGD6XgAALh3uQQ (envelope-from ); Wed, 09 Jun 2021 11:39:38 +0000 From: Vlastimil Babka To: linux-mm@kvack.org, linux-kernel@vger.kernel.org, Christoph Lameter , David Rientjes , Pekka Enberg , Joonsoo Kim Cc: Sebastian Andrzej Siewior , Thomas Gleixner , Mel Gorman , Jesper Dangaard Brouer , Peter Zijlstra , Jann Horn , Vlastimil Babka Subject: [RFC v2 34/34] mm, slub: convert kmem_cpu_slab protection to local_lock Date: Wed, 9 Jun 2021 13:39:03 +0200 Message-Id: <20210609113903.1421-35-vbabka@suse.cz> X-Mailer: git-send-email 2.31.1 In-Reply-To: <20210609113903.1421-1-vbabka@suse.cz> References: <20210609113903.1421-1-vbabka@suse.cz> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Embed local_lock into struct kmem_cpu_slab and use the irq-safe versions of local_lock instead of plain local_irq_save/restore. On !PREEMPT_RT that's equivalent, with better lockdep visibility. On PREEMPT_RT that means better preemption. However, the cost on PREEMPT_RT is the loss of lockless fast paths which only work with cpu freelist. Those are designed to detect and recover from being preempted by other conflicting operations (both fast or slow path), but the slow path operations assume they cannot be preempted by a fast path operation, which is guaranteed naturally with disabled irqs. With local locks on PREEMPT_RT, the fast paths now also need to take the local lock to avoid races. In the allocation fastpath slab_alloc_node() we can just defer to the slowpath __slab_alloc() which also works with cpu freelist, but under the local lock. In the free fastpath do_slab_free() we have to add a new local lock protected version of freeing to the cpu freelist, as the existing slowpath only works with the page freelist. Also update the comment about locking scheme in SLUB to reflect changes done by this series. Signed-off-by: Vlastimil Babka --- include/linux/slub_def.h | 2 + mm/slub.c | 138 ++++++++++++++++++++++++++++++--------- 2 files changed, 110 insertions(+), 30 deletions(-) diff --git a/include/linux/slub_def.h b/include/linux/slub_def.h index dcde82a4434c..b5bcac29b979 100644 --- a/include/linux/slub_def.h +++ b/include/linux/slub_def.h @@ -10,6 +10,7 @@ #include #include #include +#include enum stat_item { ALLOC_FASTPATH, /* Allocation from cpu slab */ @@ -41,6 +42,7 @@ enum stat_item { NR_SLUB_STAT_ITEMS }; struct kmem_cache_cpu { + local_lock_t lock; /* Protects the fields below except stat */ void **freelist; /* Pointer to next available object */ unsigned long tid; /* Globally unique transaction id */ struct page *page; /* The slab from which we are allocating */ diff --git a/mm/slub.c b/mm/slub.c index caa206213e72..500720ec1e57 100644 --- a/mm/slub.c +++ b/mm/slub.c @@ -43,13 +43,22 @@ /* * Lock order: * 1. slab_mutex (Global Mutex) - * 2. node->list_lock - * 3. slab_lock(page) (Only on some arches and for debugging) + * 2. node->list_lock (Spinlock) + * OR + * kmem_cache->cpu_slab->lock (Local lock) + * 3. slab_lock(page) (Only on some arches or for debugging) + * 4. object_map_lock (Only for debugging) * * slab_mutex * * The role of the slab_mutex is to protect the list of all the slabs * and to synchronize major metadata changes to slab cache structures. + * Also synchronizes memory hotplug callbacks. + * + * slab_lock + * + * The slab_lock is a wrapper around the page lock, thus it is a bit + * spinlock. * * The slab_lock is only used for debugging and on arches that do not * have the ability to do a cmpxchg_double. It only protects: @@ -58,6 +67,8 @@ * C. page->objects -> Number of objects in page * D. page->frozen -> frozen state * + * Frozen slabs + * * If a slab is frozen then it is exempt from list management. It is not * on any list except per cpu partial list. The processor that froze the * slab is the one who can perform list operations on the page. Other @@ -65,6 +76,8 @@ * froze the slab is the only one that can retrieve the objects from the * page's freelist. * + * list_lock + * * The list_lock protects the partial and full list on each node and * the partial slab counter. If taken then no new slabs may be added or * removed from the lists nor make the number of partial slabs be modified. @@ -76,10 +89,36 @@ * slabs, operations can continue without any centralized lock. F.e. * allocating a long series of objects that fill up slabs does not require * the list lock. - * Interrupts are disabled during allocation and deallocation in order to - * make the slab allocator safe to use in the context of an irq. In addition - * interrupts are disabled to ensure that the processor does not change - * while handling per_cpu slabs, due to kernel preemption. + * + * cpu_slab->lock local lock + * + * This locks protect slowpath manipulation of all kmem_cache_cpu fields + * except the stat counters. This is a percpu structure manipulated only by + * the local cpu, so the lock protects against being preempted or interrupted + * by an irq. Fast path operations rely on lockless operations instead. + * On PREEMPT_RT, the local lock does not actually disable irqs (and thus + * prevent the lockless operations), so fastpath operations also need to take + * the lock and are no longer lockless. + * + * lockless fastpaths + * + * The fast path allocation (slab_alloc_node()) and freeing (do_slab_free()) + * are fully lockless when satisfied from the percpu slab (and when + * cmpxchg_double is possible to use, otherwise slab_lock is taken). + * They also don't disable preemption or migration or irqs. They rely on + * the transaction id (tid) field to detect being preempted or moved to + * another cpu. + * + * irq, preemption, migration considerations + * + * Interrupts are disabled as part of list_lock or local_lock operations, or + * around the slab_lock operation, in order to make the slab allocator safe + * to use in the context of an irq. + * + * In addition, preemption (or migration on PREEMPT_RT) is disabled in the + * allocation slowpath, bulk allocation, and put_cpu_partial(), so that the + * local cpu doesn't change in the process and e.g. the kmem_cache_cpu pointer + * doesn't have to be revalidated in each section protected by the local lock. * * SLUB assigns one slab for allocation to each processor. * Allocations only occur from these slabs called cpu slabs. @@ -2179,9 +2218,13 @@ static inline void note_cmpxchg_failure(const char *n, static void init_kmem_cache_cpus(struct kmem_cache *s) { int cpu; + struct kmem_cache_cpu *c; - for_each_possible_cpu(cpu) - per_cpu_ptr(s->cpu_slab, cpu)->tid = init_tid(cpu); + for_each_possible_cpu(cpu) { + c = per_cpu_ptr(s->cpu_slab, cpu); + local_lock_init(&c->lock); + c->tid = init_tid(cpu); + } } /* @@ -2482,7 +2525,7 @@ static inline void flush_slab(struct kmem_cache *s, struct kmem_cache_cpu *c, struct page *page; if (lock) - local_irq_save(flags); + local_lock_irqsave(&s->cpu_slab->lock, flags); freelist = c->freelist; page = c->page; @@ -2492,7 +2535,7 @@ static inline void flush_slab(struct kmem_cache *s, struct kmem_cache_cpu *c, c->tid = next_tid(c->tid); if (lock) - local_irq_restore(flags); + local_unlock_irqrestore(&s->cpu_slab->lock, flags); if (page) deactivate_slab(s, page, freelist); @@ -2780,9 +2823,9 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node, goto deactivate_slab; /* must check again c->page in case we got preempted and it changed */ - local_irq_save(flags); + local_lock_irqsave(&s->cpu_slab->lock, flags); if (unlikely(page != c->page)) { - local_irq_restore(flags); + local_unlock_irqrestore(&s->cpu_slab->lock, flags); goto reread_page; } freelist = c->freelist; @@ -2793,7 +2836,7 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node, if (!freelist) { c->page = NULL; - local_irq_restore(flags); + local_unlock_irqrestore(&s->cpu_slab->lock, flags); stat(s, DEACTIVATE_BYPASS); goto new_slab; } @@ -2802,7 +2845,7 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node, load_freelist: - lockdep_assert_irqs_disabled(); + lockdep_assert_held(this_cpu_ptr(&s->cpu_slab->lock)); /* * freelist is pointing to the list of objects to be used. @@ -2812,39 +2855,39 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node, VM_BUG_ON(!c->page->frozen); c->freelist = get_freepointer(s, freelist); c->tid = next_tid(c->tid); - local_irq_restore(flags); + local_unlock_irqrestore(&s->cpu_slab->lock, flags); return freelist; deactivate_slab: - local_irq_save(flags); + local_lock_irqsave(&s->cpu_slab->lock, flags); if (page != c->page) { - local_irq_restore(flags); + local_unlock_irqrestore(&s->cpu_slab->lock, flags); goto reread_page; } freelist = c->freelist; c->page = NULL; c->freelist = NULL; - local_irq_restore(flags); + local_unlock_irqrestore(&s->cpu_slab->lock, flags); deactivate_slab(s, page, freelist); new_slab: if (slub_percpu_partial(c)) { - local_irq_save(flags); + local_lock_irqsave(&s->cpu_slab->lock, flags); if (unlikely(c->page)) { - local_irq_restore(flags); + local_unlock_irqrestore(&s->cpu_slab->lock, flags); goto reread_page; } if (unlikely(!slub_percpu_partial(c))) { - local_irq_restore(flags); + local_unlock_irqrestore(&s->cpu_slab->lock, flags); /* we were preempted and partial list got empty */ goto new_objects; } page = c->page = slub_percpu_partial(c); slub_set_percpu_partial(c, page); - local_irq_restore(flags); + local_unlock_irqrestore(&s->cpu_slab->lock, flags); stat(s, CPU_PARTIAL_ALLOC); goto redo; } @@ -2897,7 +2940,7 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node, retry_load_page: - local_irq_save(flags); + local_lock_irqsave(&s->cpu_slab->lock, flags); if (unlikely(c->page)) { void *flush_freelist = c->freelist; struct page *flush_page = c->page; @@ -2906,7 +2949,7 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node, c->freelist = NULL; c->tid = next_tid(c->tid); - local_irq_restore(flags); + local_unlock_irqrestore(&s->cpu_slab->lock, flags); deactivate_slab(s, flush_page, flush_freelist); @@ -3025,7 +3068,15 @@ static __always_inline void *slab_alloc_node(struct kmem_cache *s, object = c->freelist; page = c->page; - if (unlikely(!object || !page || !node_match(page, node))) { + /* + * We cannot use the lockless fastpath on PREEMPT_RT because if a + * slowpath has taken the local_lock_irqsave(), it is not protected + * against a fast path operation in an irq handler. So we need to take + * the slow path which uses local_lock. It is still relatively fast if + * there is a suitable cpu freelist. + */ + if (IS_ENABLED(CONFIG_PREEMPT_RT) || + unlikely(!object || !page || !node_match(page, node))) { object = __slab_alloc(s, gfpflags, node, addr, c); } else { void *next_object = get_freepointer_safe(s, object); @@ -3285,6 +3336,7 @@ static __always_inline void do_slab_free(struct kmem_cache *s, barrier(); if (likely(page == c->page)) { +#ifndef CONFIG_PREEMPT_RT void **freelist = READ_ONCE(c->freelist); set_freepointer(s, tail_obj, freelist); @@ -3297,6 +3349,32 @@ static __always_inline void do_slab_free(struct kmem_cache *s, note_cmpxchg_failure("slab_free", s, tid); goto redo; } +#else /* CONFIG_PREEMPT_RT */ + /* + * We cannot use the lockless fastpath on PREEMPT_RT because if + * a slowpath has taken the local_lock_irqsave(), it is not + * protected against a fast path operation in an irq handler. So + * we need to take the local_lock. We shouldn't simply defer to + * __slab_free() as that wouldn't use the cpu freelist at all. + */ + unsigned long flags; + void **freelist; + + local_lock_irqsave(&s->cpu_slab->lock, flags); + c = this_cpu_ptr(s->cpu_slab); + if (unlikely(page != c->page)) { + local_unlock_irqrestore(&s->cpu_slab->lock, flags); + goto redo; + } + tid = c->tid; + freelist = c->freelist; + + set_freepointer(s, tail_obj, freelist); + c->freelist = head; + c->tid = next_tid(tid); + + local_unlock_irqrestore(&s->cpu_slab->lock, flags); +#endif stat(s, FREE_FASTPATH); } else __slab_free(s, page, head, tail_obj, cnt, addr); @@ -3467,7 +3545,7 @@ int kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags, size_t size, * handlers invoking normal fastpath. */ c = slub_get_cpu_ptr(s->cpu_slab); - local_irq_disable(); + local_lock_irq(&s->cpu_slab->lock); for (i = 0; i < size; i++) { void *object = kfence_alloc(s, s->object_size, flags); @@ -3488,7 +3566,7 @@ int kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags, size_t size, */ c->tid = next_tid(c->tid); - local_irq_enable(); + local_unlock_irq(&s->cpu_slab->lock); /* * Invoking slow path likely have side-effect @@ -3502,7 +3580,7 @@ int kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags, size_t size, c = this_cpu_ptr(s->cpu_slab); maybe_wipe_obj_freeptr(s, p[i]); - local_irq_disable(); + local_lock_irq(&s->cpu_slab->lock); continue; /* goto for-loop */ } @@ -3511,7 +3589,7 @@ int kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags, size_t size, maybe_wipe_obj_freeptr(s, p[i]); } c->tid = next_tid(c->tid); - local_irq_enable(); + local_unlock_irq(&s->cpu_slab->lock); slub_put_cpu_ptr(s->cpu_slab); /* @@ -3522,7 +3600,7 @@ int kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags, size_t size, slab_want_init_on_alloc(flags, s)); return i; error: - local_irq_enable(); + local_unlock_irq(&s->cpu_slab->lock); slab_post_alloc_hook(s, objcg, flags, i, p, false); __kmem_cache_free_bulk(s, i, p); return 0; -- 2.31.1