Received: by 2002:a05:6a10:9afc:0:0:0:0 with SMTP id t28csp2739413pxm; Mon, 28 Feb 2022 05:19:17 -0800 (PST) X-Google-Smtp-Source: ABdhPJxzyFxKiFN2qJ2o8NULf9cOHuWjVTLTtJUdKDYeoZ2BHRuhNnFeCiQsSBrgPmvG+NcHYuaL X-Received: by 2002:a17:902:ed93:b0:14f:c84d:2448 with SMTP id e19-20020a170902ed9300b0014fc84d2448mr20850044plj.64.1646054357247; Mon, 28 Feb 2022 05:19:17 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1646054357; cv=none; d=google.com; s=arc-20160816; b=GVHDrSkwirIg3tGyRVR7wyumXLRpaNF/3kUxSj7w3Z6x3bYTMxwP+L+64ZYf/KFby+ oamOTzZabBwptOmOOR8jLJjrdYneocL1xuatH2/Kt24P6+SZxZVjdQBvc09W3KRcpgNG /m30iwvcOX+3rjY5kPlLlDxX88eZQn5WS2jhkNuJZYukQfqfzMRpuwKHg/QqgdA9EpET DYlNkdZoSsTPjF5ZaIvVF3S24K0ALGlaE9UkeBkYOYp+VnKc43F6f0GOC/l2ufwe6Tbs b0+OY2TIEKPnFjGkdnJxVS9bwkYq2otJgDtUmdg+EhGFZScNTZ6QDkrcVt4X8vhQv0Uw HiTg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :message-id:date:subject:cc:to:from:dkim-signature; bh=iCFerXiCPxcpKLyNdx6afs6aHROa19DCxH4ZSTwVEss=; b=eyHtBvlGD+fPp05O2jIr0HBAwO36otrThzM+I5TlknjmqHx9BL16ZcgO0fsCd3J2Fb P2ipOwQ+wf+rSJlk+zXRPn5PxLagJ/P/UfNq9C8MVA/yNc8UZ62DrRLu7IWP/qnj2GTZ Ir7rhMxVs8bJxbU16DQiLHXA8BrMLSYBuTkvvy5p5VvZyJfl/keCS5GXYfCcrU9k+o3i QoaoFk+J+91D3cRt/nWVWHwvQXXhycPZrO2UsEPLL783Dap3Kvv/jM0UHxYk417OodNK 59MGHHQ7cUADW/FodGeThwj+Y0pF2UH7lmaLrr3D0pFIG/XNJtY1b3mQ/L1b/HUBikNK yXPQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@bytedance-com.20210112.gappssmtp.com header.s=20210112 header.b=lAVHuKfp; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=bytedance.com Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id n14-20020a170902d2ce00b001501544bccbsi10763705plc.455.2022.02.28.05.19.00; Mon, 28 Feb 2022 05:19:17 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@bytedance-com.20210112.gappssmtp.com header.s=20210112 header.b=lAVHuKfp; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=bytedance.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S236237AbiB1MWz (ORCPT + 99 others); Mon, 28 Feb 2022 07:22:55 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:43300 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S233969AbiB1MWx (ORCPT ); Mon, 28 Feb 2022 07:22:53 -0500 Received: from mail-pj1-x102f.google.com (mail-pj1-x102f.google.com [IPv6:2607:f8b0:4864:20::102f]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 24C7B70877 for ; Mon, 28 Feb 2022 04:22:14 -0800 (PST) Received: by mail-pj1-x102f.google.com with SMTP id j10-20020a17090a94ca00b001bc2a9596f6so11171801pjw.5 for ; Mon, 28 Feb 2022 04:22:14 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bytedance-com.20210112.gappssmtp.com; s=20210112; h=from:to:cc:subject:date:message-id:mime-version :content-transfer-encoding; bh=iCFerXiCPxcpKLyNdx6afs6aHROa19DCxH4ZSTwVEss=; b=lAVHuKfps4dVgtNu+f/E9fvH0TxJ+nuX5cPLAM6rQJRwDD5sqOn8q6qdp/uwxBiBQN zJ3GqG+p6+iTXMPjRhL1uYpSA8Vb+t+gQG8BJLHH736MxOCU/mEZgh1KB3spCvPIJ6kV DtCAxFqQhw/ngttbKDuITF9zu0rgskOQr0h0aX7Qa3WjQ5XRckmcAGuaN1nJIceONSyc 2mKL8bRK/yyLOmyBks+8BGNNQUSTo7+ledWxwSiZ/EFwv7VF9iUXFx9Jr1jL4sRzNNT0 oLj0YfsGmAo6M31zQx3iodummLbuQhtliTq+YUzm2cQgk5rLGcxthP5fOLrEUwb0Ad2a LM6g== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:from:to:cc:subject:date:message-id:mime-version :content-transfer-encoding; bh=iCFerXiCPxcpKLyNdx6afs6aHROa19DCxH4ZSTwVEss=; b=DT060qdn6zrmNsUk82qVNHVwYwgq2xeD5XFLRhEuH8x+09MWdFXiLb/xdkDT4wkrzK 8rtlR/c64eelcu3rGqdvQ7v4bo+EGfZ0Cy85ARU2yQVyrg8N6wvdA2rx2Pw8DvrJ7vaR QDssxmb6brjIQCa2gyCJII6KBjzFTwg1Sky13zcZ6nZNIR8Hqt82O59ZB89YQBp/BYdB kBfl2lWsLTXMhJlDGRQ8TxZe4GhAiHQZjOyy2t1uYnPOAy02GhhW7tR1kzCcU6tZhIia JMUqu+TlCuInSto8PYU/R8CyDj/mv1u28ys1G0OCMKnMNrrbcVVaOBX4KPlE/oZ7SMy1 cAqw== X-Gm-Message-State: AOAM532T9zQmfbq7RXe15ugLUVRWvMb1xSm6bG7anHfne1yJob/aIj7+ PFtO5ps0ozWRFUcasTfBHrgv1g== X-Received: by 2002:a17:903:1246:b0:14f:e51e:baa7 with SMTP id u6-20020a170903124600b0014fe51ebaa7mr20311477plh.159.1646050933556; Mon, 28 Feb 2022 04:22:13 -0800 (PST) Received: from FVFYT0MHHV2J.tiktokcdn.com ([139.177.225.227]) by smtp.gmail.com with ESMTPSA id ep22-20020a17090ae65600b001b92477db10sm10466753pjb.29.2022.02.28.04.22.04 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 28 Feb 2022 04:22:13 -0800 (PST) From: Muchun Song To: willy@infradead.org, akpm@linux-foundation.org, hannes@cmpxchg.org, mhocko@kernel.org, vdavydov.dev@gmail.com, shakeelb@google.com, roman.gushchin@linux.dev, shy828301@gmail.com, alexs@kernel.org, richard.weiyang@gmail.com, david@fromorbit.com, trond.myklebust@hammerspace.com, anna.schumaker@netapp.com, jaegeuk@kernel.org, chao@kernel.org, kari.argillander@gmail.com, vbabka@suse.cz Cc: linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, linux-nfs@vger.kernel.org, zhengqi.arch@bytedance.com, duanxiongchun@bytedance.com, fam.zheng@bytedance.com, smuchun@gmail.com, Muchun Song Subject: [PATCH v6 00/16] Optimize list lru memory consumption Date: Mon, 28 Feb 2022 20:21:10 +0800 Message-Id: <20220228122126.37293-1-songmuchun@bytedance.com> X-Mailer: git-send-email 2.32.0 (Apple Git-132) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,RCVD_IN_DNSWL_NONE,SPF_HELO_NONE,SPF_NONE, T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org This series is based on Linux v5.17-rc5. And I have replaced Roman's email in Acked-by and Reviewed-by tags to roman.gushchin@linux.dev. In our server, we found a suspected memory leak problem. The kmalloc-32 consumes more than 6GB of memory. Other kmem_caches consume less than 2GB memory. After our in-depth analysis, the memory consumption of kmalloc-32 slab cache is the cause of list_lru_one allocation. crash> p memcg_nr_cache_ids memcg_nr_cache_ids = $2 = 24574 memcg_nr_cache_ids is very large and memory consumption of each list_lru can be calculated with the following formula. num_numa_node * memcg_nr_cache_ids * 32 (kmalloc-32) There are 4 numa nodes in our system, so each list_lru consumes ~3MB. crash> list super_blocks | wc -l 952 Every mount will register 2 list lrus, one is for inode, another is for dentry. There are 952 super_blocks. So the total memory is 952 * 2 * 3 MB (~5.6GB). But now the number of memory cgroups is less than 500. So I guess more than 12286 memory cgroups have been created on this machine (I do not know why there are so many cgroups, it may be a user's bug or the user really want to do that). Because memcg_nr_cache_ids has not been reduced to a suitable value. It leads to waste a lot of memory. If we want to reduce memcg_nr_cache_ids, we have to *reboot* the server. This is not what we want. In order to reduce memcg_nr_cache_ids, I had posted a patchset [1] to do this. But this did not fundamentally solve the problem. We currently allocate scope for every memcg to be able to tracked on every superblock instantiated in the system, regardless of whether that superblock is even accessible to that memcg. These huge memcg counts come from container hosts where memcgs are confined to just a small subset of the total number of superblocks that instantiated at any given point in time. For these systems with huge container counts, list_lru does not need the capability of tracking every memcg on every superblock. What it comes down to is that the list_lru is only needed for a given memcg if that memcg is instatiating and freeing objects on a given list_lru. As Dave said, "Which makes me think we should be moving more towards 'add the memcg to the list_lru at the first insert' model rather than 'instantiate all at memcg init time just in case'." This patchset aims to optimize the list lru memory consumption from different aspects. I had done a easy test to show the optimization. I create 10k memory cgroups and mount 10k filesystems in the systems. We use free command to show how many memory does the systems comsumes after this operation (There are 2 numa nodes in the system). +-----------------------+------------------------+ | condition | memory consumption | +-----------------------+------------------------+ | without this patchset | 24464 MB | +-----------------------+------------------------+ | after patch 1 | 21957 MB | <--------+ +-----------------------+------------------------+ | | after patch 10 | 6895 MB | | +-----------------------+------------------------+ | | after patch 12 | 4367 MB | | +-----------------------+------------------------+ | | The more the number of nodes, the more obvious the effect---+ BTW, there was a recent discussion [2] on the same issue. [1] https://lore.kernel.org/all/20210428094949.43579-1-songmuchun@bytedance.com/ [2] https://lore.kernel.org/all/20210405054848.GA1077931@in.ibm.com/ This series not only optimizes the memory usage of list_lru but also simplifies the code. v5: https://lore.kernel.org/all/20211220085649.8196-1-songmuchun@bytedance.com/ v4: https://lore.kernel.org/all/20211213165342.74704-1-songmuchun@bytedance.com/ v3: https://lore.kernel.org/all/20210914072938.6440-1-songmuchun@bytedance.com/ v2: https://lore.kernel.org/all/20210527062148.9361-1-songmuchun@bytedance.com/ v1: https://lore.kernel.org/all/20210511104647.604-1-songmuchun@bytedance.com/ v6: - Collect Acked-by from Roman and replace his old email with roman.gushchin@linux.dev. - Rework patch 1's commit log suggested by Roman. - Reuse memory cgroup ID for kmem ID directly suggested by Mika Penttilä. - Add a couple of words to Documentation/filesystems/porting.rst suggested by Roman. Thanks for your review. v5: - Fix sleeping from atomic context reported by kernel test robot. - Add a figure to patch 1 suggested by Johannes. - Squash patch 9 into patch 8 suggested by Johannes. - Remove LRUS_CLEAR_MASK and use GFP_RECLAIM_MASK directly suggested by Johannes. - Collect Acked-by from Johannes. v4: - Remove some code cleanup patches since they are already merged. - Collect Acked-by from Theodore. v3: - Fix mixing advanced and normal XArray concepts (Thanks to Matthew). - Split one patch into per-filesystem patches. v2: - Update Documentation/filesystems/porting.rst suggested by Dave. - Add a comment above alloc_inode_sb() suggested by Dave. - Rework some patch's commit log. - Add patch 18-21. Muchun Song (16): mm: list_lru: transpose the array of per-node per-memcg lru lists mm: introduce kmem_cache_alloc_lru fs: introduce alloc_inode_sb() to allocate filesystems specific inode fs: allocate inode by using alloc_inode_sb() f2fs: allocate inode by using alloc_inode_sb() nfs42: use a specific kmem_cache to allocate nfs4_xattr_entry mm: dcache: use kmem_cache_alloc_lru() to allocate dentry xarray: use kmem_cache_alloc_lru to allocate xa_node mm: memcontrol: move memcg_online_kmem() to mem_cgroup_css_online() mm: list_lru: allocate list_lru_one only when needed mm: list_lru: rename memcg_drain_all_list_lrus to memcg_reparent_list_lrus mm: list_lru: replace linear array with xarray mm: memcontrol: reuse memory cgroup ID for kmem ID mm: memcontrol: fix cannot alloc the maximum memcg ID mm: list_lru: rename list_lru_per_memcg to list_lru_memcg mm: memcontrol: rename memcg_cache_id to memcg_kmem_id Documentation/filesystems/porting.rst | 6 + block/bdev.c | 2 +- drivers/dax/super.c | 2 +- fs/9p/vfs_inode.c | 2 +- fs/adfs/super.c | 2 +- fs/affs/super.c | 2 +- fs/afs/super.c | 2 +- fs/befs/linuxvfs.c | 2 +- fs/bfs/inode.c | 2 +- fs/btrfs/inode.c | 2 +- fs/ceph/inode.c | 2 +- fs/cifs/cifsfs.c | 2 +- fs/coda/inode.c | 2 +- fs/dcache.c | 3 +- fs/ecryptfs/super.c | 2 +- fs/efs/super.c | 2 +- fs/erofs/super.c | 2 +- fs/exfat/super.c | 2 +- fs/ext2/super.c | 2 +- fs/ext4/super.c | 2 +- fs/f2fs/super.c | 8 +- fs/fat/inode.c | 2 +- fs/freevxfs/vxfs_super.c | 2 +- fs/fuse/inode.c | 2 +- fs/gfs2/super.c | 2 +- fs/hfs/super.c | 2 +- fs/hfsplus/super.c | 2 +- fs/hostfs/hostfs_kern.c | 2 +- fs/hpfs/super.c | 2 +- fs/hugetlbfs/inode.c | 2 +- fs/inode.c | 2 +- fs/isofs/inode.c | 2 +- fs/jffs2/super.c | 2 +- fs/jfs/super.c | 2 +- fs/minix/inode.c | 2 +- fs/nfs/inode.c | 2 +- fs/nfs/nfs42xattr.c | 95 ++++---- fs/nilfs2/super.c | 2 +- fs/ntfs/inode.c | 2 +- fs/ntfs3/super.c | 2 +- fs/ocfs2/dlmfs/dlmfs.c | 2 +- fs/ocfs2/super.c | 2 +- fs/openpromfs/inode.c | 2 +- fs/orangefs/super.c | 2 +- fs/overlayfs/super.c | 2 +- fs/proc/inode.c | 2 +- fs/qnx4/inode.c | 2 +- fs/qnx6/inode.c | 2 +- fs/reiserfs/super.c | 2 +- fs/romfs/super.c | 2 +- fs/squashfs/super.c | 2 +- fs/sysv/inode.c | 2 +- fs/ubifs/super.c | 2 +- fs/udf/super.c | 2 +- fs/ufs/super.c | 2 +- fs/vboxsf/super.c | 2 +- fs/xfs/xfs_icache.c | 2 +- fs/zonefs/super.c | 2 +- include/linux/fs.h | 11 + include/linux/list_lru.h | 17 +- include/linux/memcontrol.h | 41 ++-- include/linux/slab.h | 3 + include/linux/swap.h | 5 +- include/linux/xarray.h | 9 +- ipc/mqueue.c | 2 +- lib/xarray.c | 10 +- mm/list_lru.c | 417 ++++++++++++++++------------------ mm/memcontrol.c | 160 ++----------- mm/shmem.c | 2 +- mm/slab.c | 39 +++- mm/slab.h | 25 +- mm/slob.c | 6 + mm/slub.c | 42 ++-- mm/workingset.c | 2 +- net/socket.c | 2 +- net/sunrpc/rpc_pipe.c | 2 +- 76 files changed, 476 insertions(+), 539 deletions(-) -- 2.11.0