Received: by 2002:a05:6a10:8c0a:0:0:0:0 with SMTP id go10csp390850pxb; Thu, 21 Jan 2021 09:26:43 -0800 (PST) X-Google-Smtp-Source: ABdhPJzpk+2v9hTpNwe7ZNQV031MwrvvZh4YkVohKfivEN5OhanwYdkXpjYPlvO2I4ycP01dKqVP X-Received: by 2002:a17:906:1288:: with SMTP id k8mr369835ejb.206.1611250003634; Thu, 21 Jan 2021 09:26:43 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1611250003; cv=none; d=google.com; s=arc-20160816; b=jT5GNOgQuBMe6xTbTE9S4KLHMEFZaZObczIH8y4Z4pEPqonzVtbQ30QpU96wEZDnmO cP2ri9Unoo9vUc7yf438XXsAUN2aPVS721P15qnIXIxiVgoCJKdleZ99Vwk79PozY1NE IkYKCsay+fyfqPHvYCD0YMxjPD8nvxE4JZ9aoWVdHW55zb/lHu+UomdDIpSKz0tSkKGF TAsSCZZArfllguF3e46rPvaDgjVZ6HK4KTb23Foz9gqnKw02juzw2e9o3y83tomrXvmH CAZ6xQWH3gZfg2RrSXp/g4hEXdgd7j7CY80ecfNS0G6mwUe2D8J2p6wQpcCB+SLVgQva I2OQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :references:in-reply-to:message-id:date:subject:cc:to:from; bh=02IXpWoInPQr8nN2TKKOBIWWB1atHPi9BoDDfPhlths=; b=bSo5KIbksaXXLnGpKs9yTfJAiODic7uxnfZ9dGnpVAEdYCyaZQH4VnThpz3Q4temFm wK49JNICYW6xapeFcKT4bgJwALctBibvlVxsOUOh981DQQQE2mNrGS54FzQJioY3PnwO RzkyEw411bYeYTVHLBfrBwDhemU8uBqOBHn1Jg/8XFMVzvQlwKWffgBZnwvw6/djtlxo 15uXsU3hS38ry5Pwmbe76T+NLqyT0zxIg6HRvFXTjjXrXhrdJ/YjwU2hpULPunNREol4 Yz1xA1uUV0st/eOhjTkB6Jwvmy1dXl0rqlqhgV/ScTpLBiwmO6JqDhoFqU/7h3EQxifi 5gyg== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id w13si1988932ejq.43.2021.01.21.09.26.16; Thu, 21 Jan 2021 09:26:43 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S2388053AbhAURXS (ORCPT + 99 others); Thu, 21 Jan 2021 12:23:18 -0500 Received: from mx2.suse.de ([195.135.220.15]:44280 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1731713AbhAURWr (ORCPT ); Thu, 21 Jan 2021 12:22:47 -0500 X-Virus-Scanned: by amavisd-new at test-mx.suse.de Received: from relay2.suse.de (unknown [195.135.221.27]) by mx2.suse.de (Postfix) with ESMTP id 32597B8F8; Thu, 21 Jan 2021 17:22:05 +0000 (UTC) From: Vlastimil Babka To: vbabka@suse.cz Cc: akpm@linux-foundation.org, bigeasy@linutronix.de, cl@linux.com, guro@fb.com, hannes@cmpxchg.org, iamjoonsoo.kim@lge.com, jannh@google.com, linux-kernel@vger.kernel.org, linux-mm@kvack.org, mhocko@kernel.org, minchan@kernel.org, penberg@kernel.org, rientjes@google.com, shakeelb@google.com, surenb@google.com, tglx@linutronix.de Subject: [RFC 2/2] mm, slub: add shrinker to reclaim cached slabs Date: Thu, 21 Jan 2021 18:21:54 +0100 Message-Id: <20210121172154.27580-2-vbabka@suse.cz> X-Mailer: git-send-email 2.30.0 In-Reply-To: <20210121172154.27580-1-vbabka@suse.cz> References: <20210121172154.27580-1-vbabka@suse.cz> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org For performance reasons, SLUB doesn't keep all slabs on shared lists and doesn't always free slabs immediately after all objects are freed. Namely: - for each cache and cpu, there might be a "CPU slab" page, partially or fully free - with SLUB_CPU_PARTIAL enabled (default y), there might be a number of "percpu partial slabs" for each cache and cpu, also partially or fully free - for each cache and numa node, there are caches on per-node partial list, up to 10 of those may be empty As Jann reports [1], the number of percpu partial slabs should be limited by number of free objects (up to 30), but due to imprecise accounting, this can deterioriate so that there are up to 30 free slabs. He notes: > Even on an old-ish Android phone (Pixel 2), with normal-ish usage, I > see something like 1.5MiB of pages with zero inuse objects stuck in > percpu lists. My observations match Jann's, and we've seen e.g. cases with 10 free slabs per cpu. We can also confirm Jann's theory that on kernels pre-kmemcg rewrite (in v5.9), this issue is amplified as there are separate sets of kmem caches with cpu caches, per-cpu partial and per-node partial lists for each memcg and cache that deals with kmemcg-accounted objects. The cached free slabs can therefore become a memory waste, making memory pressure higher, causing more reclaim of actually used LRU pages, and even cause OOM (global, or memcg on older kernels). SLUB provides __kmem_cache_shrink() that can flush all the abovementioned slabs, but is currently called only in rare situations, or from a sysfs handler. The standard way to cooperate with reclaim is to provide a shrinker, and so this patch adds such shrinker to call __kmem_cache_shrink() systematically. The shrinker design is however atypical. The usual design assumes that a shrinker can easily count how many objects can be reclaimed, and then reclaim given number of objects. For SLUB, determining the number of the various cached slabs would be a lot of work, and controlling how many to shrink precisely would be impractical. Instead, the shrinker is based on reclaim priority, and on lowest priority shrinks a single kmem cache, while on highest it shrinks all of them. To do that effectively, there's a new list caches_to_shrink where caches are taken from its head and then moved to tail. Existing slab_caches list is unaffected so that e.g. /proc/slabinfo order is not disrupted. This approach should not cause excessive shrinking and IPI storms: - If there are multiple reclaimers in parallel, only one can proceed, thanks to mutex_trylock(&slab_mutex). After unlocking, caches that were just shrinked are at the tail of the list. - in flush_all(), we actually check if there's anything to flush by a CPU (has_cpu_slab()) before sending an IPI - CPU slab deactivation became more efficient with "mm, slub: splice cpu and page freelists in deactivate_slab() The result is that SLUB's per-cpu and per-node caches are trimmed of free pages, and partially used pages have higher chance of being either reused of freed. The trimming effort is controlled by reclaim activity and thus memory pressure. Before an OOM, a reclaim attempt at highest priority ensures shrinking all caches. Also being a proper slab shrinker, the shrinking is now also called as part of the drop_caches sysctl operation. [1] https://lore.kernel.org/linux-mm/CAG48ez2Qx5K1Cab-m8BdSibp6wLTip6ro4=-umR7BLsEgjEYzA@mail.gmail.com/ Reported-by: Jann Horn Signed-off-by: Vlastimil Babka --- include/linux/slub_def.h | 1 + mm/slub.c | 76 +++++++++++++++++++++++++++++++++++++++- 2 files changed, 76 insertions(+), 1 deletion(-) diff --git a/include/linux/slub_def.h b/include/linux/slub_def.h index dcde82a4434c..6c4eeb30764d 100644 --- a/include/linux/slub_def.h +++ b/include/linux/slub_def.h @@ -107,6 +107,7 @@ struct kmem_cache { unsigned int red_left_pad; /* Left redzone padding size */ const char *name; /* Name (only for display!) */ struct list_head list; /* List of slab caches */ + struct list_head shrink_list; /* List ordered for shrinking */ #ifdef CONFIG_SYSFS struct kobject kobj; /* For sysfs */ #endif diff --git a/mm/slub.c b/mm/slub.c index c3141aa962be..bba05bd9287a 100644 --- a/mm/slub.c +++ b/mm/slub.c @@ -123,6 +123,8 @@ DEFINE_STATIC_KEY_FALSE(slub_debug_enabled); #endif #endif +static LIST_HEAD(caches_to_shrink); + static inline bool kmem_cache_debug(struct kmem_cache *s) { return kmem_cache_debug_flags(s, SLAB_DEBUG_FLAGS); @@ -3933,6 +3935,8 @@ int __kmem_cache_shutdown(struct kmem_cache *s) int node; struct kmem_cache_node *n; + list_del(&s->shrink_list); + flush_all(s); /* Attempt to free all objects */ for_each_kmem_cache_node(s, node, n) { @@ -3985,6 +3989,69 @@ void kmem_obj_info(struct kmem_obj_info *kpp, void *object, struct page *page) } #endif +static unsigned long count_shrinkable_caches(struct shrinker *shrink, + struct shrink_control *sc) +{ + /* + * Determining how much there is to shrink would be so complex, it's + * better to just pretend there always is and scale the actual effort + * based on sc->priority. + */ + return shrink->batch; +} + +static unsigned long shrink_caches(struct shrinker *shrink, + struct shrink_control *sc) +{ + struct kmem_cache *s; + int nr_to_shrink; + int ret = sc->nr_to_scan / 2; + + nr_to_shrink = DEF_PRIORITY - sc->priority; + if (nr_to_shrink < 0) + nr_to_shrink = 0; + + nr_to_shrink = 1 << nr_to_shrink; + if (sc->priority == 0) { + nr_to_shrink = INT_MAX; + ret = 0; + } + + if (!mutex_trylock(&slab_mutex)) + return SHRINK_STOP; + + list_for_each_entry(s, &caches_to_shrink, shrink_list) { + __kmem_cache_shrink(s); + if (--nr_to_shrink == 0) { + list_bulk_move_tail(&caches_to_shrink, + caches_to_shrink.next, + &s->shrink_list); + break; + } + } + + mutex_unlock(&slab_mutex); + + /* + * As long as we are not at the highest priority, pretend we freed + * something as we might have not have processed all caches. This + * should signal that it's worth retrying. Once we are at the highest + * priority and shrink the whole list, pretend we didn't free anything, + * because there's no point in trying again. + * + * Note the value is currently ultimately ignored in "normal" reclaim, + * but drop_slab_node() which handles drop_caches sysctl works like this. + */ + return ret; +} + +static struct shrinker slub_cache_shrinker = { + .count_objects = count_shrinkable_caches, + .scan_objects = shrink_caches, + .batch = 128, + .seeks = 0, +}; + /******************************************************************** * Kmalloc subsystem *******************************************************************/ @@ -4424,6 +4491,8 @@ static struct kmem_cache * __init bootstrap(struct kmem_cache *static_cache) #endif } list_add(&s->list, &slab_caches); + list_del(&static_cache->shrink_list); + list_add(&s->shrink_list, &caches_to_shrink); return s; } @@ -4480,6 +4549,8 @@ void __init kmem_cache_init(void) void __init kmem_cache_init_late(void) { + if (!register_shrinker(&slub_cache_shrinker)) + pr_err("SLUB: failed to register shrinker\n"); } struct kmem_cache * @@ -4518,11 +4589,14 @@ int __kmem_cache_create(struct kmem_cache *s, slab_flags_t flags) /* Mutex is not taken during early boot */ if (slab_state <= UP) - return 0; + goto out; err = sysfs_slab_add(s); if (err) __kmem_cache_release(s); +out: + if (!err) + list_add(&s->shrink_list, &caches_to_shrink); return err; } -- 2.30.0