Received: by 2002:a05:6a11:4021:0:0:0:0 with SMTP id ky33csp429562pxb; Sat, 18 Sep 2021 06:58:21 -0700 (PDT) X-Google-Smtp-Source: ABdhPJyKkqzt7y3dTNgm6xMBr3Ve3/estOvB+bWqePNr3NJ2xtmNPKKjRu6vOlHIzLQyNkcEiKnx X-Received: by 2002:a92:cb43:: with SMTP id f3mr12156995ilq.261.1631973501013; Sat, 18 Sep 2021 06:58:21 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1631973501; cv=none; d=google.com; s=arc-20160816; b=lF9+HxgQcxhvmmETULl8nqkiy2L2koxeJThnxIh7c9wX4gQ7NrzaowS4RlpoCCqSxj WxNaX+wGz9aQ18zDRBGyBaTr2nhhuuWYw34y7El5cTqqaELiI2SEbfuFOxq3J/jvzSAf YE4fS4FLZ6wNWf0544EtH8gcH16nlkZkUGWYccjGLpKs9srte/Itn5VsJzJimN7s6IiY sYEjrQkg4GV8McRe4YqgbY1IsQ7an0Ti5IBA67+qq2z6knutSVyOvosv8HhvBT78DWV+ KfJl8qGGuTuwr5vbpMIMAYnLPcfO2qkiwjieHUU+Ee8o+oP7OslpsCc467Cb9OAYZ1dd +RNA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:cc:to:subject:message-id:date:from:in-reply-to :references:mime-version:dkim-signature; bh=LpJAvb2r/ypZU5FQrVW/iFzPACcNLQ13GoNWicSevGc=; b=YSCxIl25jPf6XpqDuSpc/4RffkGPs+wFgeAhau27zMVHVy9+3SCqcQRDLL1suxbHo1 ghewECIU8rLIFowa2lcB2zSwjftl66QI8PnO9gM38FaG4MgXf1qtfhcGuTxKvF19YJ9b 48VJge2RSn44OU4i0lRARciXU+HHmk4wKj8ynhgcWB4af+azIX8iRnxXKAQYzPvCxb7q o1TZoheKwS2zUo9LpKimx3drq5QdrOMeRFmO9HNd3t3Y/oLY0agApV1gAWr04R+YTcZ0 1Xtlq8iX0uXAy+aS2OEgrnvk6lS7ufCatufQX7EZfjtrTVgx+IBuY3rxok271ujYkedg 2gzw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@bytedance-com.20210112.gappssmtp.com header.s=20210112 header.b="XeZo/7cH"; spf=pass (google.com: domain of linux-nfs-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-nfs-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=bytedance.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id g13si655791ila.118.2021.09.18.06.57.56; Sat, 18 Sep 2021 06:58:20 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-nfs-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; dkim=pass header.i=@bytedance-com.20210112.gappssmtp.com header.s=20210112 header.b="XeZo/7cH"; spf=pass (google.com: domain of linux-nfs-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-nfs-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=bytedance.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S243555AbhIRIBZ (ORCPT + 99 others); Sat, 18 Sep 2021 04:01:25 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:44304 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S233741AbhIRIBZ (ORCPT ); Sat, 18 Sep 2021 04:01:25 -0400 Received: from mail-pg1-x534.google.com (mail-pg1-x534.google.com [IPv6:2607:f8b0:4864:20::534]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 11EF2C061574 for ; Sat, 18 Sep 2021 01:00:02 -0700 (PDT) Received: by mail-pg1-x534.google.com with SMTP id k24so11959038pgh.8 for ; Sat, 18 Sep 2021 01:00:02 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bytedance-com.20210112.gappssmtp.com; s=20210112; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=LpJAvb2r/ypZU5FQrVW/iFzPACcNLQ13GoNWicSevGc=; b=XeZo/7cHcvVIjZlH/JJ+iSEorvYofxtg1I9uKERgtWRETsP5Q+FBsqKs6WJbYo93zn bucVdfSWE/C0O5QrU6kV1y5muj+qxF4W181zsCod+FAKOfMf71QyYj2Hng6gXNUWP/Vu hnuRZjOpmr9UL5/InNNY+c8Hrmj0ADj75KVxJcXlwrCx+O3O51WDiXqhlt2IdPXOlSAu RZ0b0mBTCufq4ij8if5CvRSuxENrWZi7RiIWLOVGEt7yHryp3PUoH3u1hXH1sjI/uR8o 9wX4WGtrvU7o6C2VeogjO3CWRcHYLiLBbnCMtzQorDbhmdzbzl2KS0PitSZJaYoUJcU6 P50w== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=LpJAvb2r/ypZU5FQrVW/iFzPACcNLQ13GoNWicSevGc=; b=Z1jlmGB08KC3/rvyo/KprqiRslDXNdIduVG/7fylIxpCqvHQOW0EgqIeHFN0aDgpAI eWRj1OGQv+lH7Vk2TZLJa3ahKh18yMixcrxHN8z9+FPeeMmrrcRFYIcreH9apLr2nvtV IJiAOAZqs/e6QU5lbIQqTE2ROvgBdVPblIBZPuPDDiKkZH4TaUHwtChGpWfz6nvMV1Ua 1RPKePVSaiUa6QZPjDPEZhaKMe2Wr86uspJeYlkDoHzO/ric3b5N7WmTtCqtALV05AEK C0iH8WjXdiMmYW7b8Cd1Mj2w5SsG8gVhTid1rLo7m2YwvVMHQFGEEVtehOJy0+JcpuVH c8xA== X-Gm-Message-State: AOAM5313Q5HETpbDWqV5VjrbzKp+64XnNLEQNsgV1VcRGFvV4GaWmzM2 7u4vfWP+mmC9SCAo6w2k2CmJ6gZYvtJl7ptGNNRCMQ== X-Received: by 2002:a62:1717:0:b0:440:527f:6664 with SMTP id 23-20020a621717000000b00440527f6664mr13602699pfx.73.1631952001524; Sat, 18 Sep 2021 01:00:01 -0700 (PDT) MIME-Version: 1.0 References: <20210914072938.6440-1-songmuchun@bytedance.com> <20210918065624.dbaar4lss5olrfhu@kari-VirtualBox> In-Reply-To: <20210918065624.dbaar4lss5olrfhu@kari-VirtualBox> From: Muchun Song Date: Sat, 18 Sep 2021 15:59:23 +0800 Message-ID: Subject: Re: [PATCH v3 00/76] Optimize list lru memory consumption To: Kari Argillander Cc: Matthew Wilcox , Andrew Morton , Johannes Weiner , Michal Hocko , Vladimir Davydov , Shakeel Butt , Roman Gushchin , Yang Shi , Alex Shi , Wei Yang , Dave Chinner , trond.myklebust@hammerspace.com, anna.schumaker@netapp.com, linux-fsdevel , LKML , Linux Memory Management List , linux-nfs@vger.kernel.org, Qi Zheng , Xiongchun duan , fam.zheng@bytedance.com, Muchun Song Content-Type: text/plain; charset="UTF-8" Precedence: bulk List-ID: X-Mailing-List: linux-nfs@vger.kernel.org On Sat, Sep 18, 2021 at 2:56 PM Kari Argillander wrote: > > On Tue, Sep 14, 2021 at 03:28:22PM +0800, Muchun Song wrote: > > We introduced alloc_inode_sb() in previous version 2, which sets up the > > inode reclaim context properly, to allocate filesystems specific inode. > > So we have to convert to new API for all filesystems, which is done in > > one patch. Some filesystems are easy to convert (just replace > > kmem_cache_alloc() to alloc_inode_sb()), while other filesystems need to > > do more work. In order to make it easy for maintainers of different > > filesystems to review their own maintained part, I split the patch into > > patches which are per-filesystem in this version. I am not sure if this > > is a good idea, because there is going to be more commits. > > > > In our server, we found a suspected memory leak problem. The kmalloc-32 > > consumes more than 6GB of memory. Other kmem_caches consume less than 2GB > > memory. > > > > After our in-depth analysis, the memory consumption of kmalloc-32 slab > > cache is the cause of list_lru_one allocation. > > > > crash> p memcg_nr_cache_ids > > memcg_nr_cache_ids = $2 = 24574 > > > > memcg_nr_cache_ids is very large and memory consumption of each list_lru > > can be calculated with the following formula. > > > > num_numa_node * memcg_nr_cache_ids * 32 (kmalloc-32) > > > > There are 4 numa nodes in our system, so each list_lru consumes ~3MB. > > > > crash> list super_blocks | wc -l > > 952 > > > > Every mount will register 2 list lrus, one is for inode, another is for > > dentry. There are 952 super_blocks. So the total memory is 952 * 2 * 3 > > MB (~5.6GB). But now the number of memory cgroups is less than 500. So I > > guess more than 12286 memory cgroups have been created on this machine (I > > do not know why there are so many cgroups, it may be a user's bug or > > the user really want to do that). Because memcg_nr_cache_ids has not been > > reduced to a suitable value. It leads to waste a lot of memory. If we want > > to reduce memcg_nr_cache_ids, we have to *reboot* the server. This is not > > what we want. > > > > In order to reduce memcg_nr_cache_ids, I had posted a patchset [1] to do > > this. But this did not fundamentally solve the problem. > > > > We currently allocate scope for every memcg to be able to tracked on every > > superblock instantiated in the system, regardless of whether that superblock > > is even accessible to that memcg. > > > > These huge memcg counts come from container hosts where memcgs are confined > > to just a small subset of the total number of superblocks that instantiated > > at any given point in time. > > > > For these systems with huge container counts, list_lru does not need the > > capability of tracking every memcg on every superblock. > > > > What it comes down to is that the list_lru is only needed for a given memcg > > if that memcg is instatiating and freeing objects on a given list_lru. > > > > As Dave said, "Which makes me think we should be moving more towards 'add the > > memcg to the list_lru at the first insert' model rather than 'instantiate > > all at memcg init time just in case'." > > > > This patchset aims to optimize the list lru memory consumption from different > > aspects. > > > > Patch 1-6 are code simplification. > > Patch 7 converts the array from per-memcg per-node to per-memcg > > Patch 8 introduces kmem_cache_alloc_lru() > > Patch 9 introduces alloc_inode_sb() > > Patch 10-66 convert all filesystems to alloc_inode_sb() respectively. > > There is now days also ntfs3. If you do not plan to convert this please > CC me atleast so that I can do it when these lands. > > Argillander > Wow, a new filesystem. I didn't notice it before. I'll cover it in the next version and Cc you if you can do a review. Thanks for your reminder.