Received: by 2002:a05:6a10:8c0a:0:0:0:0 with SMTP id go10csp969712pxb; Tue, 9 Feb 2021 18:22:13 -0800 (PST) X-Google-Smtp-Source: ABdhPJxuGjTyqcKouILrbxE9BV3Ec25DTHer+hl9jHSgEGUi577oSZSHDcFv7dYL51K1HAlG4tEh X-Received: by 2002:a17:906:d0cd:: with SMTP id bq13mr658188ejb.75.1612923733381; Tue, 09 Feb 2021 18:22:13 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1612923733; cv=none; d=google.com; s=arc-20160816; b=UHrleJzC04GBzbXx/XmynTWHRt6497Yir019cVX0sSm8PFmENoL+DARnl4rnDuZaan 0SzqzIiEX/1u6CPO6njKH7PArQlI+8asr7egkabSspU01I5aXMMa+o0N1JRnTLQ7EOXR nTj1m3Q55dtaoBcuMHQP06F7vYNz+mN3sO2jN0l0NbSHHRNiu/rjILrJk/+CyM81JTNP Ta7Mt6icwd3BqItGqT0JsmW6k6PiMhWiohoefqHFM+PE4TgMOxllrjLvVnRJ3c0h9nYk v5+dZuOwAHHBuCkq4F7Mt5TZPr5bIUmLlvDAda04vQ7glzBCtdAwlwcB5IPa2eNPbZY3 XTMA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:cc:to:subject:message-id:date:from:in-reply-to :references:mime-version:dkim-signature; bh=13JJ3/w3Pcz8NUosfTxcJLj7zXUfsyz8pYv+/KLk+u8=; b=jILiNZJmXx3FZNRIgIlU2Cu/9Qep0f6WAlaMN8qpCPqwxH2eGZrGllW8KtPBy7Ha8Q Zg/TvY82sFGO05INx+1hJd0/jDySS/LFo4YzQgkGa8TLTDWuah0jMR9xgv4XEGJjTmEI XaFZtwLS6EdOYqXKAALLUY4Cowu0KBAhTysGkDzkdG2ONIpNPSgr7iv2GWooNAoZD+ee VxFiBeT8uN4N+R147CeUi9bg9m8tNd2p7O1HwCcWX3GMir95eLfq6wLoPbpB0Ytsxm4e LxonaWMqC5IuiEEj08YvpTIxRbBiUo3x4eFVUkuKf/ZsES9zm/SVbYdZ83ZvG7Wxkc/C 06lw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@gmail.com header.s=20161025 header.b=vWVh5nSL; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=gmail.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id cc7si425069edb.462.2021.02.09.18.21.49; Tue, 09 Feb 2021 18:22:13 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; dkim=pass header.i=@gmail.com header.s=20161025 header.b=vWVh5nSL; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=gmail.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S235178AbhBJCTG (ORCPT + 99 others); Tue, 9 Feb 2021 21:19:06 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:42930 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S234908AbhBJB0J (ORCPT ); Tue, 9 Feb 2021 20:26:09 -0500 Received: from mail-ej1-x62c.google.com (mail-ej1-x62c.google.com [IPv6:2a00:1450:4864:20::62c]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 77466C06174A; Tue, 9 Feb 2021 17:25:29 -0800 (PST) Received: by mail-ej1-x62c.google.com with SMTP id y9so987952ejp.10; Tue, 09 Feb 2021 17:25:29 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=13JJ3/w3Pcz8NUosfTxcJLj7zXUfsyz8pYv+/KLk+u8=; b=vWVh5nSLLm6TW61S/karaXSSW/aU0HaoE6KSZsDWiDphgMPZv55VHI2tGs2Q7Nyg96 7h6hVBfF1OP6HS5HilWcef8m7CxHSp1U69kug7ok1AZhde901YR+vl50AXZe/PwOaOCu 3Mq7xqz+q22r2ipSAvGVAxQY5bNDtToS1u4XfyB2rs08/bD22Zhs7bRW17ohg1SYyhYB JtiWZi6rV0QUiODq9ZT7CPwoBlYilOG3FcOyNf0aiO4X7aTua3Rvf21S0QjHBowF5dGv Z+uv7R29Axc4hkHNi/16V791snzNEIbPH4gavME9zdxgFJUeMMOWzX0xH+YpjDFJZ2LM Q3GQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=13JJ3/w3Pcz8NUosfTxcJLj7zXUfsyz8pYv+/KLk+u8=; b=Hy5SqVxWEQzijrakPifpfPyLn32nNm55IiPC7Rx6z1IcZNfpjB1x+4sclmXDRUARtD wJMtiok043VEqzGQVV7cloS4jvF/AvXmbGKnlO7XJ1p5UTGjUo3n9ZR0rNJf6kmZlKLK JS99o6kK+5S1bJlW1LOYST0wMqpsOigZ7UtgH2fmX67aD3MuVdBdRnth9xKKzHxqe5sZ tSKr+TEVCrhnbwCNNFFZFRvEpMGSL22e3xgqGNcanuuC7ZYAplFz8vG/OrS+8wIFdcVT jmdoQ2N2LQTKFo7U9fTxgU+qbmLJiCWCuZZvSefilw3ahvUc8c5VEdbYebFoS0KBwk0Z PKiA== X-Gm-Message-State: AOAM530cW6ccHC7MPcgSKu/00EgBDv+T4Ffj58jPCeUM8vzy1rWIq1fW d6Pxs+TfaM7sMrOobi5+rIiktG63NYqR3B1vq1Y= X-Received: by 2002:a17:906:755:: with SMTP id z21mr509472ejb.514.1612920328165; Tue, 09 Feb 2021 17:25:28 -0800 (PST) MIME-Version: 1.0 References: <20210209174646.1310591-1-shy828301@gmail.com> <20210209174646.1310591-9-shy828301@gmail.com> <20210210011020.GL524633@carbon.DHCP.thefacebook.com> In-Reply-To: <20210210011020.GL524633@carbon.DHCP.thefacebook.com> From: Yang Shi Date: Tue, 9 Feb 2021 17:25:16 -0800 Message-ID: Subject: Re: [v7 PATCH 08/12] mm: vmscan: add per memcg shrinker nr_deferred To: Roman Gushchin Cc: Kirill Tkhai , Vlastimil Babka , Shakeel Butt , Dave Chinner , Johannes Weiner , Michal Hocko , Andrew Morton , Linux MM , Linux FS-devel Mailing List , Linux Kernel Mailing List Content-Type: text/plain; charset="UTF-8" Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, Feb 9, 2021 at 5:10 PM Roman Gushchin wrote: > > On Tue, Feb 09, 2021 at 09:46:42AM -0800, Yang Shi wrote: > > Currently the number of deferred objects are per shrinker, but some slabs, for example, > > vfs inode/dentry cache are per memcg, this would result in poor isolation among memcgs. > > > > The deferred objects typically are generated by __GFP_NOFS allocations, one memcg with > > excessive __GFP_NOFS allocations may blow up deferred objects, then other innocent memcgs > > may suffer from over shrink, excessive reclaim latency, etc. > > > > For example, two workloads run in memcgA and memcgB respectively, workload in B is vfs > > heavy workload. Workload in A generates excessive deferred objects, then B's vfs cache > > might be hit heavily (drop half of caches) by B's limit reclaim or global reclaim. > > > > We observed this hit in our production environment which was running vfs heavy workload > > shown as the below tracing log: > > > > <...>-409454 [016] .... 28286961.747146: mm_shrink_slab_start: super_cache_scan+0x0/0x1a0 ffff9a83046f3458: > > nid: 1 objects to shrink 3641681686040 gfp_flags GFP_HIGHUSER_MOVABLE|__GFP_ZERO pgs_scanned 1 lru_pgs 15721 > > cache items 246404277 delta 31345 total_scan 123202138 > > <...>-409454 [022] .... 28287105.928018: mm_shrink_slab_end: super_cache_scan+0x0/0x1a0 ffff9a83046f3458: > > nid: 1 unused scan count 3641681686040 new scan count 3641798379189 total_scan 602 > > last shrinker return val 123186855 > > > > The vfs cache and page cache ratio was 10:1 on this machine, and half of caches were dropped. > > This also resulted in significant amount of page caches were dropped due to inodes eviction. > > > > Make nr_deferred per memcg for memcg aware shrinkers would solve the unfairness and bring > > better isolation. > > > > When memcg is not enabled (!CONFIG_MEMCG or memcg disabled), the shrinker's nr_deferred > > would be used. And non memcg aware shrinkers use shrinker's nr_deferred all the time. > > > > Signed-off-by: Yang Shi > > --- > > include/linux/memcontrol.h | 7 +++--- > > mm/vmscan.c | 49 +++++++++++++++++++++++++------------- > > 2 files changed, 37 insertions(+), 19 deletions(-) > > > > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h > > index 4c9253896e25..c457fc7bc631 100644 > > --- a/include/linux/memcontrol.h > > +++ b/include/linux/memcontrol.h > > @@ -93,12 +93,13 @@ struct lruvec_stat { > > }; > > > > /* > > - * Bitmap of shrinker::id corresponding to memcg-aware shrinkers, > > - * which have elements charged to this memcg. > > + * Bitmap and deferred work of shrinker::id corresponding to memcg-aware > > + * shrinkers, which have elements charged to this memcg. > > */ > > struct shrinker_info { > > struct rcu_head rcu; > > - unsigned long map[]; > > + atomic_long_t *nr_deferred; > > + unsigned long *map; > > }; > > > > /* > > diff --git a/mm/vmscan.c b/mm/vmscan.c > > index a047980536cf..d4b030a0b2a9 100644 > > --- a/mm/vmscan.c > > +++ b/mm/vmscan.c > > @@ -187,9 +187,13 @@ static DECLARE_RWSEM(shrinker_rwsem); > > #ifdef CONFIG_MEMCG > > static int shrinker_nr_max; > > > > +/* The shrinker_info is expanded in a batch of BITS_PER_LONG */ > > #define NR_MAX_TO_SHR_MAP_SIZE(nr_max) \ > > (DIV_ROUND_UP(nr_max, BITS_PER_LONG) * sizeof(unsigned long)) > > > > +#define NR_MAX_TO_SHR_DEF_SIZE(nr_max) \ > > + (round_up(nr_max, BITS_PER_LONG) * sizeof(atomic_long_t)) > > + > > static struct shrinker_info *shrinker_info_protected(struct mem_cgroup *memcg, > > int nid) > > { > > @@ -203,10 +207,12 @@ static void free_shrinker_info_rcu(struct rcu_head *head) > > } > > > > static int expand_one_shrinker_info(struct mem_cgroup *memcg, > > - int size, int old_size) > > + int m_size, int d_size, > > + int old_m_size, int old_d_size) > > { > > struct shrinker_info *new, *old; > > int nid; > > + int size = m_size + d_size; > > > > for_each_node(nid) { > > old = shrinker_info_protected(memcg, nid); > > @@ -218,9 +224,15 @@ static int expand_one_shrinker_info(struct mem_cgroup *memcg, > > if (!new) > > return -ENOMEM; > > > > - /* Set all old bits, clear all new bits */ > > - memset(new->map, (int)0xff, old_size); > > - memset((void *)new->map + old_size, 0, size - old_size); > > + new->nr_deferred = (atomic_long_t *)(new + 1); > > + new->map = (void *)new->nr_deferred + d_size; > > + > > + /* map: set all old bits, clear all new bits */ > > + memset(new->map, (int)0xff, old_m_size); > > + memset((void *)new->map + old_m_size, 0, m_size - old_m_size); > > + /* nr_deferred: copy old values, clear all new values */ > > + memcpy(new->nr_deferred, old->nr_deferred, old_d_size); > > + memset((void *)new->nr_deferred + old_d_size, 0, d_size - old_d_size); > > > > rcu_assign_pointer(memcg->nodeinfo[nid]->shrinker_info, new); > > call_rcu(&old->rcu, free_shrinker_info_rcu); > > @@ -235,9 +247,6 @@ void free_shrinker_info(struct mem_cgroup *memcg) > > struct shrinker_info *info; > > int nid; > > > > - if (mem_cgroup_is_root(memcg)) > > - return; > > - > > for_each_node(nid) { > > pn = mem_cgroup_nodeinfo(memcg, nid); > > info = shrinker_info_protected(memcg, nid); > > @@ -250,12 +259,13 @@ int alloc_shrinker_info(struct mem_cgroup *memcg) > > { > > struct shrinker_info *info; > > int nid, size, ret = 0; > > - > > - if (mem_cgroup_is_root(memcg)) > > - return 0; > > + int m_size, d_size = 0; > > > > down_write(&shrinker_rwsem); > > - size = NR_MAX_TO_SHR_MAP_SIZE(shrinker_nr_max); > > + m_size = NR_MAX_TO_SHR_MAP_SIZE(shrinker_nr_max); > > + d_size = NR_MAX_TO_SHR_DEF_SIZE(shrinker_nr_max); > > + size = m_size + d_size; > > + > > for_each_node(nid) { > > info = kvzalloc_node(sizeof(*info) + size, GFP_KERNEL, nid); > > if (!info) { > > @@ -263,6 +273,8 @@ int alloc_shrinker_info(struct mem_cgroup *memcg) > > ret = -ENOMEM; > > break; > > } > > + info->nr_deferred = (atomic_long_t *)(info + 1); > > + info->map = (void *)info->nr_deferred + d_size; > > rcu_assign_pointer(memcg->nodeinfo[nid]->shrinker_info, info); > > } > > up_write(&shrinker_rwsem); > > @@ -274,10 +286,16 @@ static int expand_shrinker_info(int new_id) > > { > > int size, old_size, ret = 0; > > int new_nr_max = new_id + 1; > > + int m_size, d_size = 0; > > + int old_m_size, old_d_size = 0; > > struct mem_cgroup *memcg; > > > > - size = NR_MAX_TO_SHR_MAP_SIZE(new_nr_max); > > - old_size = NR_MAX_TO_SHR_MAP_SIZE(shrinker_nr_max); > > + m_size = NR_MAX_TO_SHR_MAP_SIZE(new_nr_max); > > + d_size = NR_MAX_TO_SHR_DEF_SIZE(new_nr_max); > > + size = m_size + d_size; > > + old_m_size = NR_MAX_TO_SHR_MAP_SIZE(shrinker_nr_max); > > + old_d_size = NR_MAX_TO_SHR_DEF_SIZE(shrinker_nr_max); > > + old_size = old_m_size + old_d_size; > > if (size <= old_size) > > goto out; > > It looks correct, but a bit bulky. Can we check that the new maximum > number of elements is larger than then the old one here? Seems not to me. For example, we have shrinker_nr_max as 1, then a new shrinker is registered and the new_nr_max is 2, but actually the new size is equal to the old size. We should be able to do: if (round_up(new_nr_max, BITS_PER_LONG) <= round_up(shrinker_nr_mx, BITS_PER_LONG)) Does it seem better? > > > > > @@ -286,9 +304,8 @@ static int expand_shrinker_info(int new_id) > > > > memcg = mem_cgroup_iter(NULL, NULL, NULL); > > do { > > - if (mem_cgroup_is_root(memcg)) > > - continue; > > - ret = expand_one_shrinker_info(memcg, size, old_size); > > + ret = expand_one_shrinker_info(memcg, m_size, d_size, > > + old_m_size, old_d_size); > > Pass the old and the new numbers to expand_one_shrinker_info() and > have all size manipulation there? With the above proposal we could move the size manipulation right before the memcg iter, we could save some cycles if we don't have to expand it. > > > if (ret) { > > mem_cgroup_iter_break(NULL, memcg); > > goto out; > > -- > > 2.26.2 > >