Received: by 2002:a05:6a10:a841:0:0:0:0 with SMTP id d1csp1980099pxy; Thu, 29 Apr 2021 20:28:58 -0700 (PDT) X-Google-Smtp-Source: ABdhPJyOzyyYMqxEaffRYW8ZCcPOg/2Xw4W0PAtfZfg5DlYXAHSQ+WBFxlXE+sec8JR/u98OSExt X-Received: by 2002:a17:90a:6582:: with SMTP id k2mr12833274pjj.11.1619753338644; Thu, 29 Apr 2021 20:28:58 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1619753338; cv=none; d=google.com; s=arc-20160816; b=QLYU3mqFC/DLlepaWde22fsQ9hbM5ErFxAerhpNkIU1YFzvdmqVq1YFWvb53w35k61 2oLYwwn4zsFqLS6pvtMLxF5O0CEqxN9z3kE/wdB4sIiA+wEpapYuS6PbSi7jsEQtbYnN 2itZvYgTi7W9xkzYpMITOfxf2v34NUQPempvjeg5ytUseuE7TqrNAoEHm2PDIJCqHs/8 npMfjvgduJZL+sOTgMz39EzGdgWE9J3eNZCgP/RBnE1N3vjwf4e8qPOibi8GetLEoBgr 1QvcbnNUOWccs5eYBd6J6hPE9s0eEkiUbL5HMeOnol/MMaD1ZcLF9h7RjiSkhX/Jxmce 63+Q== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:in-reply-to:content-disposition:mime-version :references:message-id:subject:cc:to:from:date; bh=nquzxmL9lo6/7cWNlXVD866W47R5HSpeSuZDt+pdfUQ=; b=VYdE15Tn9SLyyaZFRX5jl58vbtRFGN/kHhVVks7JqRUbuIyXUx9IftZEInl8NGQLoH TUzAdIH2hjMmuFDhRiJSdj6ZQElj3q5pmTcol6W6AH2W/Ago10aYR76PKxMsWnG8Mv0c t0jDM23R5YU6jvZf+LbYWmvUucJfSwg0yURyX/g2n9ArCo2D0GFR5TIap25AtDoc0p1x SnTLE1vvj1vQRe+bdZlfOoxwaUtn9QdtTEqth8dT75FxI1+dc7s9Uah68JaVQYLFjAxF lpDor5r9bt6Bk7ZRc9L4ToOpGJ5v6rP8SOXifqAiu4lXQ0cMiMyFYHFNXnCi9NGGAsfc MDUw== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id b22si1950420plz.247.2021.04.29.20.28.40; Thu, 29 Apr 2021 20:28:58 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229609AbhD3D2c (ORCPT + 99 others); Thu, 29 Apr 2021 23:28:32 -0400 Received: from mail109.syd.optusnet.com.au ([211.29.132.80]:53559 "EHLO mail109.syd.optusnet.com.au" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229577AbhD3D2b (ORCPT ); Thu, 29 Apr 2021 23:28:31 -0400 Received: from dread.disaster.area (pa49-179-143-157.pa.nsw.optusnet.com.au [49.179.143.157]) by mail109.syd.optusnet.com.au (Postfix) with ESMTPS id 0B4A367642; Fri, 30 Apr 2021 13:27:40 +1000 (AEST) Received: from dave by dread.disaster.area with local (Exim 4.92.3) (envelope-from ) id 1lcJoZ-00EAl0-Vo; Fri, 30 Apr 2021 13:27:39 +1000 Date: Fri, 30 Apr 2021 13:27:39 +1000 From: Dave Chinner To: Roman Gushchin Cc: Muchun Song , willy@infradead.org, akpm@linux-foundation.org, hannes@cmpxchg.org, mhocko@kernel.org, vdavydov.dev@gmail.com, shakeelb@google.com, shy828301@gmail.com, alexs@kernel.org, alexander.h.duyck@linux.intel.com, richard.weiyang@gmail.com, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org Subject: Re: [PATCH 0/9] Shrink the list lru size on memory cgroup removal Message-ID: <20210430032739.GG1872259@dread.disaster.area> References: <20210428094949.43579-1-songmuchun@bytedance.com> <20210430004903.GF1872259@dread.disaster.area> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: X-Optus-CM-Score: 0 X-Optus-CM-Analysis: v=2.3 cv=Tu+Yewfh c=1 sm=1 tr=0 a=I9rzhn+0hBG9LkCzAun3+g==:117 a=I9rzhn+0hBG9LkCzAun3+g==:17 a=kj9zAlcOel0A:10 a=3YhXtTcJ-WEA:10 a=7-415B0cAAAA:8 a=XWMAvtMH4rRcoei1LcIA:9 a=CjuIK1q_8ugA:10 a=biEYGPWJfzWAr4FL6Ov7:22 Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, Apr 29, 2021 at 06:39:40PM -0700, Roman Gushchin wrote: > On Fri, Apr 30, 2021 at 10:49:03AM +1000, Dave Chinner wrote: > > On Wed, Apr 28, 2021 at 05:49:40PM +0800, Muchun Song wrote: > > > In our server, we found a suspected memory leak problem. The kmalloc-32 > > > consumes more than 6GB of memory. Other kmem_caches consume less than 2GB > > > memory. > > > > > > After our in-depth analysis, the memory consumption of kmalloc-32 slab > > > cache is the cause of list_lru_one allocation. > > > > > > crash> p memcg_nr_cache_ids > > > memcg_nr_cache_ids = $2 = 24574 > > > > > > memcg_nr_cache_ids is very large and memory consumption of each list_lru > > > can be calculated with the following formula. > > > > > > num_numa_node * memcg_nr_cache_ids * 32 (kmalloc-32) > > > > > > There are 4 numa nodes in our system, so each list_lru consumes ~3MB. > > > > > > crash> list super_blocks | wc -l > > > 952 > > > > The more I see people trying to work around this, the more I think > > that the way memcgs have been grafted into the list_lru is back to > > front. > > > > We currently allocate scope for every memcg to be able to tracked on > > every not on every superblock instantiated in the system, regardless > > of whether that superblock is even accessible to that memcg. > > > > These huge memcg counts come from container hosts where memcgs are > > confined to just a small subset of the total number of superblocks > > that instantiated at any given point in time. > > > > IOWs, for these systems with huge container counts, list_lru does > > not need the capability of tracking every memcg on every superblock. > > > > What it comes down to is that the list_lru is only needed for a > > given memcg if that memcg is instatiating and freeing objects on a > > given list_lru. > > > > Which makes me think we should be moving more towards "add the memcg > > to the list_lru at the first insert" model rather than "instantiate > > all at memcg init time just in case". The model we originally came > > up with for supprting memcgs is really starting to show it's limits, > > and we should address those limitations rahter than hack more > > complexity into the system that does nothing to remove the > > limitations that are causing the problems in the first place. > > I totally agree. > > It looks like the initial implementation of the whole kernel memory accounting > and memcg-aware shrinkers was based on the idea that the number of memory > cgroups is relatively small and stable. Yes, that was one of the original assumptions - tens to maybe low hundreds of memcgs at most. The other was that memcgs weren't NUMA aware, and so would only need a single LRU list per memcg. Hence the total overhead even with "lots" of memcgsi and superblocks the overhead wasn't that great. Then came "memcgs need to be NUMA aware" because of the size of the machines they were being use for resrouce management in, and that greatly increased the per-memcg, per LRU overhead. Now we're talking about needing to support a couple of orders of magnitude more memcgs and superblocks than were originally designed for. So, really, we're way beyond the original design scope of this subsystem now. > With systemd creating a separate cgroup > for everything including short-living processes it simple not true anymore. Yeah, that too. Everything is much more dynamic these days... Cheers, Dave. -- Dave Chinner david@fromorbit.com