Received: by 2002:a05:7412:d8a:b0:e2:908c:2ebd with SMTP id b10csp751074rdg; Thu, 12 Oct 2023 22:52:40 -0700 (PDT) X-Google-Smtp-Source: AGHT+IFYW/Qj1JWDfmz0vIhyq6zgH/vz5VcV2j501wCO0Cltn16lTV9kRo23daD/JkZH9cgDvzkX X-Received: by 2002:a05:6a00:1a92:b0:690:bdda:7c35 with SMTP id e18-20020a056a001a9200b00690bdda7c35mr28005878pfv.1.1697176360270; Thu, 12 Oct 2023 22:52:40 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1697176360; cv=none; d=google.com; s=arc-20160816; b=U6n+eFgjj3OoGFqu43BYwCSfpwHDC6ezNhUEQ22XxGqQZHescKRVgVD2wgoTTvmAC6 Giz5Iau1UDpp1bY8IEopxKXsZHBVOVFIwd/+8JmATdxEr1JYmXno+MUPvs++i+ZuqPQz S5uSWID2a6C9iZweLP1RKdTeooGmVgNzv6xV2WdVs2aNwqWptZaA1xBlrowddimOvC/S TcxqrYlIqQeXqs0tiMOGypQ1x70snbHatEb4yuaqstDl0liZtXEG4vgtDgStlxLoQiCp 67Ayv/puNU4sGKm2R/wLivdSVcmDmjWuya27g2lTJV8RX4F8A3iyjZ2lLXWCYF9Panxb AOcA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :message-id:date:subject:cc:to:from; bh=XyoU9+bHIs9C4n3oSuPWp2ouztsGp17Q3wfRubUiMho=; fh=tQFLCiAPtBUQ+uASqNHnXL0aRgmOqhnAuHviRiKUMDg=; b=lBWQ3eZAZQpDmTKbG43u0nGjTVnNKZFMxJjEHVuntSlYRnFVNlatxgSyPhCcw8qK98 MQH0Eg9lfJFm4zVov3DVZP+Jdt2w5y+rJS8WPQw+3aOzjmGJj663JoAgcKcaAbOGKdkj W0ajRNLUuLXiKr+8TFV5C/K3q077wy7SMRzce3OZpJFhLVwIMNpWQ3hpBNWEq8gy/jUs RWookoT4Y1Kw5sf3KFB0cJ3ao8kw7IBhBC/dwmO6mmFEzrN+tljNvHAAsN2z0d1UOghE CszcYMAgVq3IEOkF5OfmC53DdiInuKaa68tiWvNK4HkVbrNkmbYc3wDLAsKAlwKOjgRO g1Kg== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:7 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=alibaba.com Return-Path: Received: from snail.vger.email (snail.vger.email. [2620:137:e000::3:7]) by mx.google.com with ESMTPS id 9-20020a630f49000000b0057ab7f46381si3685521pgp.76.2023.10.12.22.52.39 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 12 Oct 2023 22:52:40 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:7 as permitted sender) client-ip=2620:137:e000::3:7; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:7 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=alibaba.com Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0]) by snail.vger.email (Postfix) with ESMTP id 90F92820DA30; Thu, 12 Oct 2023 22:52:23 -0700 (PDT) X-Virus-Status: Clean X-Virus-Scanned: clamav-milter 0.103.10 at snail.vger.email Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229718AbjJMFwR (ORCPT + 99 others); Fri, 13 Oct 2023 01:52:17 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:49382 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229671AbjJMFwQ (ORCPT ); Fri, 13 Oct 2023 01:52:16 -0400 Received: from out30-111.freemail.mail.aliyun.com (out30-111.freemail.mail.aliyun.com [115.124.30.111]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 50500B8; Thu, 12 Oct 2023 22:52:13 -0700 (PDT) X-Alimail-AntiSpam: AC=PASS;BC=-1|-1;BR=01201311R131e4;CH=green;DM=||false|;DS=||;FP=0|-1|-1|-1|0|-1|-1|-1;HT=ay29a033018045170;MF=jefflexu@linux.alibaba.com;NM=1;PH=DS;RN=12;SR=0;TI=SMTPD_---0Vu15FF8_1697176328; Received: from localhost(mailfrom:jefflexu@linux.alibaba.com fp:SMTPD_---0Vu15FF8_1697176328) by smtp.aliyun-inc.com; Fri, 13 Oct 2023 13:52:09 +0800 From: Jingbo Xu To: tj@kernel.org, guro@fb.com Cc: lizefan.x@bytedance.com, hannes@cmpxchg.org, cgroups@vger.kernel.org, jack@suse.cz, linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, viro@zeniv.linux.org.uk, brauner@kernel.org, willy@infradead.org, joseph.qi@linux.alibaba.com Subject: [PATCH v2] writeback, cgroup: switch inodes with dirty timestamps to release dying cgwbs Date: Fri, 13 Oct 2023 13:52:08 +0800 Message-Id: <20231013055208.15457-1-jefflexu@linux.alibaba.com> X-Mailer: git-send-email 2.19.1.6.gb485710b MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-9.9 required=5.0 tests=BAYES_00, ENV_AND_HDR_SPF_MATCH,RCVD_IN_DNSWL_NONE,SPF_HELO_NONE,SPF_PASS, UNPARSEABLE_RELAY,USER_IN_DEF_SPF_WL autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.6.4 (snail.vger.email [0.0.0.0]); Thu, 12 Oct 2023 22:52:23 -0700 (PDT) The cgwb cleanup routine will try to release the dying cgwb by switching the attached inodes. It fetches the attached inodes from wb->b_attached list, omitting the fact that inodes only with dirty timestamps reside in wb->b_dirty_time list, which is the case when lazytime is enabled. This causes enormous zombie memory cgroup when lazytime is enabled, as inodes with dirty timestamps can not be switched to a live cgwb for a long time. It is reasonable not to switch cgwb for inodes with dirty data, as otherwise it may break the bandwidth restrictions. However since the writeback of inode metadata is not accounted for, let's also switch inodes with dirty timestamps to avoid zombie memory and block cgroups when laztytime is enabled. Fixs: c22d70a162d3 ("writeback, cgroup: release dying cgwbs by switching attached inodes") Signed-off-by: Jingbo Xu --- v2: add comment explaining why switching for inodes with dirty timestamps is needed v1: https://lore.kernel.org/all/20231011084228.77615-1-jefflexu@linux.alibaba.com/ --- fs/fs-writeback.c | 41 +++++++++++++++++++++++++++++------------ 1 file changed, 29 insertions(+), 12 deletions(-) diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c index c1af01b2c42d..1767493dffda 100644 --- a/fs/fs-writeback.c +++ b/fs/fs-writeback.c @@ -613,6 +613,24 @@ static void inode_switch_wbs(struct inode *inode, int new_wb_id) kfree(isw); } +static bool isw_prepare_wbs_switch(struct inode_switch_wbs_context *isw, + struct list_head *list, int *nr) +{ + struct inode *inode; + + list_for_each_entry(inode, list, i_io_list) { + if (!inode_prepare_wbs_switch(inode, isw->new_wb)) + continue; + + isw->inodes[*nr] = inode; + (*nr)++; + + if (*nr >= WB_MAX_INODES_PER_ISW - 1) + return true; + } + return false; +} + /** * cleanup_offline_cgwb - detach associated inodes * @wb: target wb @@ -625,7 +643,6 @@ bool cleanup_offline_cgwb(struct bdi_writeback *wb) { struct cgroup_subsys_state *memcg_css; struct inode_switch_wbs_context *isw; - struct inode *inode; int nr; bool restart = false; @@ -647,17 +664,17 @@ bool cleanup_offline_cgwb(struct bdi_writeback *wb) nr = 0; spin_lock(&wb->list_lock); - list_for_each_entry(inode, &wb->b_attached, i_io_list) { - if (!inode_prepare_wbs_switch(inode, isw->new_wb)) - continue; - - isw->inodes[nr++] = inode; - - if (nr >= WB_MAX_INODES_PER_ISW - 1) { - restart = true; - break; - } - } + /* + * In addition to the inodes that have completed writeback, also switch + * cgwbs for those inodes only with dirty timestamps. Otherwise, those + * inodes won't be written back for a long time when lazytime is + * enabled, and thus pinning the dying cgwbs. It won't break the + * bandwidth restrictions, as writeback of inode metadata is not + * accounted for. + */ + restart = isw_prepare_wbs_switch(isw, &wb->b_attached, &nr); + if (!restart) + restart = isw_prepare_wbs_switch(isw, &wb->b_dirty_time, &nr); spin_unlock(&wb->list_lock); /* no attached inodes? bail out */ -- 2.19.1.6.gb485710b