Received: by 2002:a6b:fb09:0:0:0:0:0 with SMTP id h9csp2838604iog; Mon, 27 Jun 2022 04:06:08 -0700 (PDT) X-Google-Smtp-Source: AGRyM1upwTgSk4HtTnnYavr8dxt+lf8xV4WF9n2soAmM6RVs7tJctsHuT9w7j0DHDJLh0UY0KdF+ X-Received: by 2002:a63:2a4e:0:b0:40c:6ff9:9978 with SMTP id q75-20020a632a4e000000b0040c6ff99978mr12215452pgq.447.1656327968649; Mon, 27 Jun 2022 04:06:08 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1656327968; cv=none; d=google.com; s=arc-20160816; b=uxT7g4sHbCRHOo+jmyy+8KiKOizxxLUu7fOV2P/2vHkXNZ9dlb4R7WZlA/w/U5vkXT yPfzfdwcrb5+c8eS2yPN0n5/p4FpOVoTRqHc+wyQwQsbyYB9VsFZFpxkxrE/Rs9qlbFz jM7MQEQCs0fA3EnLcecz6iUsl1azib/bpRkEOan0/8HvqX2AETz69knzMCeeFVNAY52P QKApjACLl4rU2b0BMMWVUcg6ICg9hQtQ7gufk1BrwgbYGtEA3OMxtyHIE6U0VeOFTtau iKAJOP10PO7lFKOppYj3dmERdZdRDYWu0y7sl+G40ioPFxFOPT31DunEDAkknrD/BVnI 4vRQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:in-reply-to:from :references:cc:to:content-language:subject:user-agent:mime-version :date:message-id:dkim-signature; bh=kr/grU3kkgeweF+fN7RAjBDwzZqei/Z54c4DEpy1KMs=; b=QcNZytdZCwRarxCgAGzSTXtiUvtJj2KtVqZxmDtr7JzZ6hdEMHfm0uYBEg6C2T0Q9W gytgSUz15LQgUs5uiJjVOU42jJA5N3EGxo+SFitaLxnibg8CVpRggp7PMNycICrsU+3G SYxIUb7ujEM3Vqip8+Mmf5E3LWIL2VzIhXIBpz/VOcaBjCMcJaU0Ci9QIXY6+eLz1KbQ moEUQRTRBHqs/xte0+fT80Y3oVz8Jhb70Nc14eR9RnkHxIRYFWvWiCORe2DVJmmhIXfi Xu6RHoZrcpiKJueM645+jM+qJkbxP3D60YYAjgDdDl/UgpbQ10Nz476BBWUAQ88AAeoB tfcw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b="cpR/mdhv"; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id p11-20020a17090a868b00b001df4ea7a026si17881471pjn.34.2022.06.27.04.05.56; Mon, 27 Jun 2022 04:06:08 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b="cpR/mdhv"; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S234015AbiF0Kns (ORCPT + 99 others); Mon, 27 Jun 2022 06:43:48 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:44236 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S233879AbiF0Knr (ORCPT ); Mon, 27 Jun 2022 06:43:47 -0400 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) by lindbergh.monkeyblade.net (Postfix) with ESMTP id 4B4D963B8 for ; Mon, 27 Jun 2022 03:43:46 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1656326625; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=kr/grU3kkgeweF+fN7RAjBDwzZqei/Z54c4DEpy1KMs=; b=cpR/mdhvmKtoV4nOgDnYYNfJjR/S+MqVogf5uxZuQ3+HY3BkRU/DwpvfqegvWsFB2Wi8e2 I7ZQrxU5NeKwvI7LZWdihP2prBPmmHnIeZH14AEW3hy1NyUM67X5/JOWlxrQiLqTh6/39x VfR6jwu2CxhdNrpfg9YsDBcGW/Jv2vE= Received: from mail-lf1-f70.google.com (mail-lf1-f70.google.com [209.85.167.70]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id us-mta-656-PXnuZ99VO6CZnhqnFVXmSw-1; Mon, 27 Jun 2022 06:43:44 -0400 X-MC-Unique: PXnuZ99VO6CZnhqnFVXmSw-1 Received: by mail-lf1-f70.google.com with SMTP id be18-20020a056512251200b0048120ff434dso713012lfb.19 for ; Mon, 27 Jun 2022 03:43:44 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:message-id:date:mime-version:user-agent:subject :content-language:to:cc:references:from:in-reply-to :content-transfer-encoding; bh=kr/grU3kkgeweF+fN7RAjBDwzZqei/Z54c4DEpy1KMs=; b=JKDyCF9/kO0mdO3zMv6RdP24xJnOih+059DnwePdVexSuFvzNwRS1LbIyzlF8dIZ/g 77bV+lK+JWNFfSR8cZvP0e61OEhf0qcIF7lt7uPtrhF2r1H6UzRobhfbZMNGjLcTIgLt Z2SHiKp6v3Xot/r21wTxXCWXLUJeYrZYwHYg0YzxN9DewFgHmn0bjnVl+EKNFTI9bi9A WMb/asJGdtvDvhwYkadf3LsTNoVDP2S0c0JW17dfZ9egqfkgsJSH09r3fWue0+0XLmPT WpDnNfUY185W98te7cI8bCK1Jc3CuWmq3TVxf710VvA1aoU3NFHzySf8/oIQFJuOWOaW VA7w== X-Gm-Message-State: AJIora9IOKY05yWo+ofI8O1yn+LSGGiA4ZEUu0DaGmJmaZtRT0ZbcYk/ wrdB45ubCptMwmkM0FgorGmI5HKz95xJxdaKBcF1WPkIss28Q9EcX5uoR0JvDdPcr+Kvl36RCEg dRshH+vawimIqKGDQue6jcAI= X-Received: by 2002:a2e:8696:0:b0:25a:7673:d22a with SMTP id l22-20020a2e8696000000b0025a7673d22amr6354317lji.494.1656326621637; Mon, 27 Jun 2022 03:43:41 -0700 (PDT) X-Received: by 2002:a2e:8696:0:b0:25a:7673:d22a with SMTP id l22-20020a2e8696000000b0025a7673d22amr6354296lji.494.1656326621311; Mon, 27 Jun 2022 03:43:41 -0700 (PDT) Received: from [192.168.1.121] (91-145-109-188.bb.dnainternet.fi. [91.145.109.188]) by smtp.gmail.com with ESMTPSA id q10-20020a056512210a00b00477a287438csm1764197lfr.2.2022.06.27.03.43.39 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Mon, 27 Jun 2022 03:43:39 -0700 (PDT) Message-ID: Date: Mon, 27 Jun 2022 13:43:38 +0300 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101 Thunderbird/91.9.1 Subject: Re: [PATCH v6 00/11] Use obj_cgroup APIs to charge the LRU pages Content-Language: en-US To: Yosry Ahmed , Muchun Song Cc: Andrew Morton , Johannes Weiner , longman@redhat.com, Michal Hocko , Roman Gushchin , Shakeel Butt , Cgroups , duanxiongchun@bytedance.com, Linux Kernel Mailing List , Linux-MM References: <20220621125658.64935-1-songmuchun@bytedance.com> From: =?UTF-8?Q?Mika_Penttil=c3=a4?= In-Reply-To: Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-2.5 required=5.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,NICE_REPLY_A, RCVD_IN_DNSWL_NONE,SPF_HELO_NONE,SPF_NONE,T_SCC_BODY_TEXT_LINE autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 27.6.2022 11.05, Yosry Ahmed wrote: > On Mon, Jun 27, 2022 at 12:11 AM Muchun Song wrote: >> >> On Sun, Jun 26, 2022 at 03:32:02AM -0700, Yosry Ahmed wrote: >>> On Tue, Jun 21, 2022 at 5:57 AM Muchun Song wrote: >>>> >>>> This version is rebased on mm-unstable. Hopefully, Andrew can get this series >>>> into mm-unstable which will help to determine whether there is a problem or >>>> degradation. I am also doing some benchmark tests in parallel. >>>> >>>> Since the following patchsets applied. All the kernel memory are charged >>>> with the new APIs of obj_cgroup. >>>> >>>> commit f2fe7b09a52b ("mm: memcg/slab: charge individual slab objects instead of pages") >>>> commit b4e0b68fbd9d ("mm: memcontrol: use obj_cgroup APIs to charge kmem pages") >>>> >>>> But user memory allocations (LRU pages) pinning memcgs for a long time - >>>> it exists at a larger scale and is causing recurring problems in the real >>>> world: page cache doesn't get reclaimed for a long time, or is used by the >>>> second, third, fourth, ... instance of the same job that was restarted into >>>> a new cgroup every time. Unreclaimable dying cgroups pile up, waste memory, >>>> and make page reclaim very inefficient. >>>> >>>> We can convert LRU pages and most other raw memcg pins to the objcg direction >>>> to fix this problem, and then the LRU pages will not pin the memcgs. >>>> >>>> This patchset aims to make the LRU pages to drop the reference to memory >>>> cgroup by using the APIs of obj_cgroup. Finally, we can see that the number >>>> of the dying cgroups will not increase if we run the following test script. >>> >>> This is amazing work! >>> >>> Sorry if I came late, I didn't follow the threads of previous versions >>> so this might be redundant, I just have a couple of questions. >>> >>> a) If LRU pages keep getting parented until they reach root_mem_cgroup >>> (assuming they can), aren't these pages effectively unaccounted at >>> this point or leaked? Is there protection against this? >>> >> >> In this case, those pages are accounted in root memcg level. Unfortunately, >> there is no mechanism now to transfer a page's memcg from one to another. >> >>> b) Since moving charged pages between memcgs is now becoming easier by >>> using the APIs of obj_cgroup, I wonder if this opens the door for >>> future work to transfer charges to memcgs that are actually using >>> reparented resources. For example, let's say cgroup A reads a few >>> pages into page cache, and then they are no longer used by cgroup A. >>> cgroup B, however, is using the same pages that are currently charged >>> to cgroup A, so it keeps taxing cgroup A for its use. When cgroup A >>> dies, and these pages are reparented to A's parent, can we possibly >>> mark these reparented pages (maybe in the page tables somewhere) so >>> that next time they get accessed we recharge them to B instead >>> (possibly asynchronously)? >>> I don't have much experience about page tables but I am pretty sure >>> they are loaded so maybe there is no room in PTEs for something like >>> this, but I have always wondered about what we can do for this case >>> where a cgroup is consistently using memory charged to another cgroup. >>> Maybe when this memory is reparented is a good point in time to decide >>> to recharge appropriately. It would also fix the reparenty leak to >>> root problem (if it even exists). >>> >> >> From my point of view, this is going to be an improvement to the memcg >> subsystem in the future. IIUC, most reparented pages are page cache >> pages without be mapped to users. So page tables are not a suitable >> place to record this information. However, we already have this information >> in struct obj_cgroup and struct mem_cgroup. If a page's obj_cgroup is not >> equal to the page's obj_cgroup->memcg->objcg, it means this page have >> been reparented. I am thinking if a place where a page is mapped (probably >> page fault patch) or page (cache) is written (usually vfs write path) >> is suitable to transfer page's memcg from one to another. But need more > > Very good point about unmapped pages, I missed this. Page tables will > do us no good here. Such a change would indeed require careful thought > because (like you mentioned) there are multiple points in time where > it might be suitable to consider recharging the page (e.g. when the > page is mapped). This could be an incremental change though. Right now > we have no recharging at all, so maybe we can gradually add recharging > to suitable paths. > >> thinking, e.g. How to decide if a reparented page needs to be transferred? > > Maybe if (page's obj_cgroup->memcg == root_mem_cgroup) OR (memcg of > current is not a descendant of page's obj_cgroup->memcg) is a good > place to start? > > My rationale is that if the page is charged to root_mem_cgroup through > reparenting and a process in a memcg is using it then this is probably > an accounting leak. If a page is charged to a memcg A through > reparenting and is used by a memcg B in a different subtree, then > probably memcg B is getting away with using the page for free while A > is being taxed. If B is a descendant of A, it is still getting away > with using the page unaccounted, but at least it makes no difference > for A. > > One could argue that we might as well recharge a reparented page > anyway if the process is cheap (or done asynchronously), and the paths > where we do recharging are not very common. > > All of this might be moot, I am just thinking out loud. In any way > this would be future work and not part of this work. > I think you have to uncharge at the reparented parent to keep balances right (because parent is hierarchically charged thru page_counter). And maybe recharge after that if appropriate. > >> If we need more information to make this decision, where to store those >> information? This is my primary thoughts on this question. > >> >> Thanks. >> >>> Thanks again for this work and please excuse my ignorance if any part >>> of what I said doesn't make sense :) >>> >>>> >>>> ```bash >>>> #!/bin/bash >>>> >>>> dd if=/dev/zero of=temp bs=4096 count=1 >>>> cat /proc/cgroups | grep memory >>>> >>>> for i in {0..2000} >>>> do >>>> mkdir /sys/fs/cgroup/memory/test$i >>>> echo $$ > /sys/fs/cgroup/memory/test$i/cgroup.procs >>>> cat temp >> log >>>> echo $$ > /sys/fs/cgroup/memory/cgroup.procs >>>> rmdir /sys/fs/cgroup/memory/test$i >>>> done >>>> >>>> cat /proc/cgroups | grep memory >>>> >>>> rm -f temp log >>>> ``` >>>> >>>> v5: https://lore.kernel.org/all/20220530074919.46352-1-songmuchun@bytedance.com/ >>>> v4: https://lore.kernel.org/all/20220524060551.80037-1-songmuchun@bytedance.com/ >>>> v3: https://lore.kernel.org/all/20220216115132.52602-1-songmuchun@bytedance.com/ >>>> v2: https://lore.kernel.org/all/20210916134748.67712-1-songmuchun@bytedance.com/ >>>> v1: https://lore.kernel.org/all/20210814052519.86679-1-songmuchun@bytedance.com/ >>>> RFC v4: https://lore.kernel.org/all/20210527093336.14895-1-songmuchun@bytedance.com/ >>>> RFC v3: https://lore.kernel.org/all/20210421070059.69361-1-songmuchun@bytedance.com/ >>>> RFC v2: https://lore.kernel.org/all/20210409122959.82264-1-songmuchun@bytedance.com/ >>>> RFC v1: https://lore.kernel.org/all/20210330101531.82752-1-songmuchun@bytedance.com/ >>>> >>>> v6: >>>> - Collect Acked-by and Reviewed-by from Roman and Michal Koutný. Thanks. >>>> - Rebase to mm-unstable. >>>> >>>> v5: >>>> - Lots of improvements from Johannes, Roman and Waiman. >>>> - Fix lockdep warning reported by kernel test robot. >>>> - Add two new patches to do code cleanup. >>>> - Collect Acked-by and Reviewed-by from Johannes and Roman. >>>> - I didn't replace local_irq_disable/enable() to local_lock/unlock_irq() since >>>> local_lock/unlock_irq() takes an parameter, it needs more thinking to transform >>>> it to local_lock. It could be an improvement in the future. >>>> >>>> v4: >>>> - Resend and rebased on v5.18. >>>> >>>> v3: >>>> - Removed the Acked-by tags from Roman since this version is based on >>>> the folio relevant. >>>> >>>> v2: >>>> - Rename obj_cgroup_release_kmem() to obj_cgroup_release_bytes() and the >>>> dependencies of CONFIG_MEMCG_KMEM (suggested by Roman, Thanks). >>>> - Rebase to linux 5.15-rc1. >>>> - Add a new pacth to cleanup mem_cgroup_kmem_disabled(). >>>> >>>> v1: >>>> - Drop RFC tag. >>>> - Rebase to linux next-20210811. >>>> >>>> RFC v4: >>>> - Collect Acked-by from Roman. >>>> - Rebase to linux next-20210525. >>>> - Rename obj_cgroup_release_uncharge() to obj_cgroup_release_kmem(). >>>> - Change the patch 1 title to "prepare objcg API for non-kmem usage". >>>> - Convert reparent_ops_head to an array in patch 8. >>>> >>>> Thanks for Roman's review and suggestions. >>>> >>>> RFC v3: >>>> - Drop the code cleanup and simplification patches. Gather those patches >>>> into a separate series[1]. >>>> - Rework patch #1 suggested by Johannes. >>>> >>>> RFC v2: >>>> - Collect Acked-by tags by Johannes. Thanks. >>>> - Rework lruvec_holds_page_lru_lock() suggested by Johannes. Thanks. >>>> - Fix move_pages_to_lru(). >>>> >>>> Muchun Song (11): >>>> mm: memcontrol: remove dead code and comments >>>> mm: rename unlock_page_lruvec{_irq, _irqrestore} to >>>> lruvec_unlock{_irq, _irqrestore} >>>> mm: memcontrol: prepare objcg API for non-kmem usage >>>> mm: memcontrol: make lruvec lock safe when LRU pages are reparented >>>> mm: vmscan: rework move_pages_to_lru() >>>> mm: thp: make split queue lock safe when LRU pages are reparented >>>> mm: memcontrol: make all the callers of {folio,page}_memcg() safe >>>> mm: memcontrol: introduce memcg_reparent_ops >>>> mm: memcontrol: use obj_cgroup APIs to charge the LRU pages >>>> mm: lru: add VM_WARN_ON_ONCE_FOLIO to lru maintenance function >>>> mm: lru: use lruvec lock to serialize memcg changes >>>> >>>> fs/buffer.c | 4 +- >>>> fs/fs-writeback.c | 23 +- >>>> include/linux/memcontrol.h | 218 +++++++++------ >>>> include/linux/mm_inline.h | 6 + >>>> include/trace/events/writeback.h | 5 + >>>> mm/compaction.c | 39 ++- >>>> mm/huge_memory.c | 153 ++++++++-- >>>> mm/memcontrol.c | 584 +++++++++++++++++++++++++++------------ >>>> mm/migrate.c | 4 + >>>> mm/mlock.c | 2 +- >>>> mm/page_io.c | 5 +- >>>> mm/swap.c | 49 ++-- >>>> mm/vmscan.c | 66 ++--- >>>> 13 files changed, 776 insertions(+), 382 deletions(-) >>>> >>>> >>>> base-commit: 882be1ed6b1b5073fc88552181b99bd2b9c0031f >>>> -- >>>> 2.11.0 >>>> >>>> >>> >