Received: by 2002:a25:d783:0:0:0:0:0 with SMTP id o125csp299544ybg; Wed, 18 Mar 2020 23:22:55 -0700 (PDT) X-Google-Smtp-Source: ADFU+vsT63Tjl+n9/6uxAMtUlYhmyc3ZgIjD6c1I6+A6+6KDjkbOq9krh/SJosENuqfckQ/Gl/JG X-Received: by 2002:a05:6830:22c3:: with SMTP id q3mr1035446otc.212.1584598975274; Wed, 18 Mar 2020 23:22:55 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1584598975; cv=none; d=google.com; s=arc-20160816; b=mtH8gRXiQvMV42vtgUQXvlk4hAJrr2aI+SWD1MRRMjFqDS34uF6aAiEIk4c6QUn24L ImLlvvyoBhARdTeasM939YYa1dn6EiJW9/5eCRSJSHlGpbz6ZTm0aLbQ+okn3uYDnOhC nD2j24CtOjf3HzYSd9zYhmEjA1qF8t9wlufDj+q3KfUzGI0Gn1+OjTh2vDW0yB81CbpR XxYFAMRutlyrYk3fslGUnMZ34bWz4O7i3+V7SHDTqDPFNt22ePY2BAlHu0Fp/HArKIyh s0rlMc5GfTbmmimGWh/qwHJ9oPbonIO300Vun8bAFRoDFRvtNptOorYH5jjO1NS45YM0 7NWw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:content-transfer-encoding:cc:to:subject :message-id:date:from:in-reply-to:references:mime-version :dkim-signature; bh=6N67oDZqhu8rSLAYYLp2ntwe2aF0H/t/Nu/zVU+OFmM=; b=fiALjS6pFHSsgmzm+Gwuroi5JOF//mFKueNAkfAhon4egj34ses4B3A3+7AuU74ILm IX0G2ItNFCvg4WscEnBic+2c5Y1wF1p2/+SjENO2JFsylk6Zp8CV7NCE4gDDshD00+ZK KELTP+PgmV0+8at+rTZa/Szbk65DpcOdVRyPR/3B+0EaB60VKZr+6hxIAB6WBEU1Eu4t cEwDC+NgP6VzS5SexFJNSZ1Pyx+9E/1FERpgCqQhWYr6PO7XizV/NMn0F+GFol8DeVip QpjWe3P7TR1VDwp3dfUGGJRWLUCMTN4G6jfu2NZuoTeZ8YKdStjoBgofiXIWkm/Pitef Dmig== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@gmail.com header.s=20161025 header.b=DWPvFVq5; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=gmail.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id u136si538181oif.197.2020.03.18.23.22.40; Wed, 18 Mar 2020 23:22:55 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@gmail.com header.s=20161025 header.b=DWPvFVq5; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=gmail.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726063AbgCSGVI (ORCPT + 99 others); Thu, 19 Mar 2020 02:21:08 -0400 Received: from mail-qt1-f193.google.com ([209.85.160.193]:38429 "EHLO mail-qt1-f193.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1725601AbgCSGVI (ORCPT ); Thu, 19 Mar 2020 02:21:08 -0400 Received: by mail-qt1-f193.google.com with SMTP id z12so883060qtq.5 for ; Wed, 18 Mar 2020 23:21:05 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc:content-transfer-encoding; bh=6N67oDZqhu8rSLAYYLp2ntwe2aF0H/t/Nu/zVU+OFmM=; b=DWPvFVq5zlOEI0Tl77YPjaTO3BIHu2WPCpD1l9C9T7qy7olwa3XYkI0ojrhLIjuUog uP/NR3WA3BOA0+VxFFyXK5n2hRu9o4LuxjI+ho8aSmxa2JEIg6v3qajPC3opeSmeIFsH gEj9XH0jMQ0b/WMJM+N+9aBYVUyelpfEXLGhSgK15utmqx2CoYt/QpwmcWd0Q70QfAXk rNfkurYDfIKN+qkLlNvTqbjk/+OCT1NcmCjAED1rjfyzjI5XWxQ0rUZ2bk57RDIhKNVe bc3V2Lg1wOMDRmlUFVZg0Fk6uYqcEOMO3TZM9WmkYlFt41ngzq1VM4Zplr1mePkX9lfp 3dXg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc:content-transfer-encoding; bh=6N67oDZqhu8rSLAYYLp2ntwe2aF0H/t/Nu/zVU+OFmM=; b=klOKXUbDTUo/KVNIZaIdxb8VYimrI9SzmN6S+q5VGrajKGAHyNAlcKOAvV6IA2fTtI mJOGSL55M50WiLBhwZwXziO1QbnvmE0TNcvD7MNliYx210yii52oF3BUPopj8MXZJwPl iwAmodXxbUphymOvjNYCV1pHShGbx7mhZD7fyIGhxhzZWS0KTfA8iGt2X/+h8+3Iyysz JoALKOCc75g/41hHrkoMurYJsVtOTtfMfVY+HsSzQufnqE5/My0ymnOw8k67N2UVZzXS G8O793AntORPGDgsKxRcnsSWocqbO91fQfXgkVYVhkrp44epTKrA1CwrINjy07EmCC1R ZIag== X-Gm-Message-State: ANhLgQ34EeNjIJTLTY8u3rxAbqLHDm3XdtTqtvWVHQMO1zuGuzmtFohc SPfgELJr9JFzzJTcQTcy+qzPO7BP2m18/5R9ig8= X-Received: by 2002:ac8:708f:: with SMTP id y15mr1384354qto.35.1584598865112; Wed, 18 Mar 2020 23:21:05 -0700 (PDT) MIME-Version: 1.0 References: <1584423717-3440-1-git-send-email-iamjoonsoo.kim@lge.com> <1584423717-3440-6-git-send-email-iamjoonsoo.kim@lge.com> <20200318191802.GE154135@cmpxchg.org> In-Reply-To: <20200318191802.GE154135@cmpxchg.org> From: Joonsoo Kim Date: Thu, 19 Mar 2020 15:20:54 +0900 Message-ID: Subject: Re: [PATCH v3 5/9] mm/workingset: use the node counter if memcg is the root memcg To: Johannes Weiner Cc: Andrew Morton , Linux Memory Management List , LKML , Michal Hocko , Hugh Dickins , Minchan Kim , Vlastimil Babka , Mel Gorman , kernel-team@lge.com, Joonsoo Kim Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org 2020=EB=85=84 3=EC=9B=94 19=EC=9D=BC (=EB=AA=A9) =EC=98=A4=EC=A0=84 4:18, J= ohannes Weiner =EB=8B=98=EC=9D=B4 =EC=9E=91=EC=84=B1: > > On Tue, Mar 17, 2020 at 02:41:53PM +0900, js1304@gmail.com wrote: > > From: Joonsoo Kim > > > > In the following patch, workingset detection is implemented for the > > swap cache. Swap cache's node is usually allocated by kswapd and it > > isn't charged by kmemcg since it is from the kernel thread. So the swap > > cache's shadow node is managed by the node list of the list_lru rather > > than the memcg specific one. > > > > If counting the shadow node on the root memcg happens to reclaim the sl= ab > > object, the shadow node count returns the number of the shadow node on > > the node list of the list_lru since root memcg has the kmem_cache_id, -= 1. > > > > However, the size of pages on the LRU is calculated by using the specif= ic > > memcg, so mismatch happens. This causes the number of shadow node not t= o > > be increased to the enough size and, therefore, workingset detection > > cannot work correctly. This patch fixes this bug by checking if the mem= cg > > is the root memcg or not. If it is the root memcg, instead of using > > the memcg-specific LRU, the system-wide LRU is used to calculate proper > > size of the shadow node so that the number of the shadow node can grow > > as expected. > > > > Signed-off-by: Joonsoo Kim > > --- > > mm/workingset.c | 8 +++++++- > > 1 file changed, 7 insertions(+), 1 deletion(-) > > > > diff --git a/mm/workingset.c b/mm/workingset.c > > index 5fb8f85..a9f474a 100644 > > --- a/mm/workingset.c > > +++ b/mm/workingset.c > > @@ -468,7 +468,13 @@ static unsigned long count_shadow_nodes(struct shr= inker *shrinker, > > * PAGE_SIZE / xa_nodes / node_entries * 8 / PAGE_SIZE > > */ > > #ifdef CONFIG_MEMCG > > - if (sc->memcg) { > > + /* > > + * Kernel allocation on root memcg isn't regarded as allocation o= f > > + * specific memcg. So, if sc->memcg is the root memcg, we need to > > + * use the count for the node rather than one for the specific > > + * memcg. > > + */ > > + if (sc->memcg && !mem_cgroup_is_root(sc->memcg)) { > > This is no good, unfortunately. > > It allows the root cgroup's shadows to grow way too large. Consider a > large memory system where several workloads run in containers and only > some host software runs in the root, yet that tiny root group will > grow shadow entries in proportion to the entire RAM. Okay. > IMO, we have some choices here: > > 1. We say the swapcache is a shared system facility and its memory is > not accounted to anyone. In that case, we should either > 1a. Reclaim them to a fixed threshold, regardless of cgroup, or > 1b. Not reclaim them at all. Or > 2. We account all nodes to the groups for which they are allocated. > Essentially like this: > > diff --git a/mm/swap_state.c b/mm/swap_state.c > index 8e7ce9a9bc5e..d0d0dcc357fb 100644 > --- a/mm/swap_state.c > +++ b/mm/swap_state.c > @@ -125,6 +125,7 @@ int add_to_swap_cache(struct page *page, swp_entry_t = entry, gfp_t gfp) > page_ref_add(page, nr); > SetPageSwapCache(page); > > + memalloc_use_memcg(page_memcg(page)); > do { > xas_lock_irq(&xas); > xas_create_range(&xas); > @@ -142,6 +143,7 @@ int add_to_swap_cache(struct page *page, swp_entry_t = entry, gfp_t gfp) > unlock: > xas_unlock_irq(&xas); > } while (xas_nomem(&xas, gfp)); > + memalloc_unuse_memcg(); > > if (!xas_error(&xas)) > return 0; > @@ -605,7 +607,8 @@ int init_swap_address_space(unsigned int type, unsign= ed long nr_pages) > return -ENOMEM; > for (i =3D 0; i < nr; i++) { > space =3D spaces + i; > - xa_init_flags(&space->i_pages, XA_FLAGS_LOCK_IRQ); > + xa_init_flags(&space->i_pages, > + XA_FLAGS_LOCK_IRQ | XA_FLAGS_ACCOUNT); > atomic_set(&space->i_mmap_writable, 0); > space->a_ops =3D &swap_aops; > /* swap cache doesn't use writeback related tags */ > > (A reclaimer has PF_MEMALLOC set, so we'll bypass the limit when > recursing into charging the node.) > > I'm leaning more toward 1b, actually. The shadow shrinker was written > because the combined address space of files on the filesystem can > easily be in the terabytes, and practically unbounded with sparse > files. The shadow shrinker is there to keep users from DoSing the > system with shadow entries for files. > > However, the swap address space is bounded by a privileged user. And > the size is usually in the GB range. On my system, radix_tree_node is > ~583 bytes, so a for a 16G swapfile, the swapcache xarray should max > out below 40M (36M worth of leaf nodes, plus some intermediate nodes). > > It doesn't seem worth messing with the shrinker at all for these. 40M / 16G, 0.25% of the amount of the used swap looks okay to me. I will rework the patch on that way. Thanks.