Received: by 2002:a05:7412:d8a:b0:e2:908c:2ebd with SMTP id b10csp662329rdg; Wed, 11 Oct 2023 01:43:04 -0700 (PDT) X-Google-Smtp-Source: AGHT+IH89lPpQyvouc9mD1IApbZzcYfFXF1hZdRVefM1NNX3/C1OMDt9eFu4WSrMfd7Evc/jxDhB X-Received: by 2002:a17:902:d509:b0:1bd:d911:2a85 with SMTP id b9-20020a170902d50900b001bdd9112a85mr25686568plg.12.1697013783626; Wed, 11 Oct 2023 01:43:03 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1697013783; cv=none; d=google.com; s=arc-20160816; b=APw+67GD0bfen7rzKeYzxeRiJ3IPbXNyyParS4ChrxZ2KCWpkJ8pu8Ly8KsH7PmiM+ P4VKha10rQlg4ALoLHnuFa35wlMH9xvqHto6PGDrPbVScpSDQu08rpwN+g5+JB2a9b5U oJUYevoXfrlT+tqShUazI35ILKoUGbhMJQuT3Fw0tDO8ClH2xOvWAYej1vVrMjyVtRMz nuKJbDWYz7fQ0Ccmy6HFvjL9vFBljRNAyRmPcK3nfBrYkmKsGr3RXtwJWmjhqEgd5rk+ Cac6dyzCF6yK9KGb6ZzBPyxUiu3Il8A/lCdcH3UfG5xaAE6ekMtB7wM4alcbXKf7R8O8 OEbw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :message-id:date:subject:cc:to:from; bh=pHu9JFSO3azIcCMLKVBIeE0Gnw/O9WvAV1yDuqcNwto=; fh=tQFLCiAPtBUQ+uASqNHnXL0aRgmOqhnAuHviRiKUMDg=; b=nlJ6EN6AOgAed3FG7z8KI/kl+MkNv8A0iZKCpT766ncOZgskbE3o+FMn/81rAhQAmk k2sHFZZpPfc5XD2xAjVEz/ZVaGvIYoeRj/122gCkbMNlZFFXA55WbDUwfAUDwSMDT0Ae K5sM1McDDT1cXgLUHqISHADH32Q1qo2ZIeWxROu5H0npkTkaa2pq36PHTstk7HEMUqFx we6k/1pRQg/LL9I8riBfQVM3krTej51/2ZKNGYPHywGRLYprkm1Xzh1gi6BmkZjzMLZT Jcr1eb1cXF4A/TUKehm9bsDCZZ8KN5SN/d2GZgFZj2gi5jWT+jnXIjgDqQtCQouNWV2j ghlA== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.35 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=alibaba.com Return-Path: Received: from groat.vger.email (groat.vger.email. [23.128.96.35]) by mx.google.com with ESMTPS id u8-20020a1709026e0800b001c60de17b5esi13363274plk.118.2023.10.11.01.43.03 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 11 Oct 2023 01:43:03 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.35 as permitted sender) client-ip=23.128.96.35; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.35 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=alibaba.com Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0]) by groat.vger.email (Postfix) with ESMTP id 98CB980E0A12; Wed, 11 Oct 2023 01:42:58 -0700 (PDT) X-Virus-Status: Clean X-Virus-Scanned: clamav-milter 0.103.10 at groat.vger.email Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1345391AbjJKImi (ORCPT + 99 others); Wed, 11 Oct 2023 04:42:38 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:42044 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1344899AbjJKIme (ORCPT ); Wed, 11 Oct 2023 04:42:34 -0400 Received: from out30-113.freemail.mail.aliyun.com (out30-113.freemail.mail.aliyun.com [115.124.30.113]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 32EB994; Wed, 11 Oct 2023 01:42:32 -0700 (PDT) X-Alimail-AntiSpam: AC=PASS;BC=-1|-1;BR=01201311R221e4;CH=green;DM=||false|;DS=||;FP=0|-1|-1|-1|0|-1|-1|-1;HT=ay29a033018045170;MF=jefflexu@linux.alibaba.com;NM=1;PH=DS;RN=12;SR=0;TI=SMTPD_---0VtwDfcD_1697013748; Received: from localhost(mailfrom:jefflexu@linux.alibaba.com fp:SMTPD_---0VtwDfcD_1697013748) by smtp.aliyun-inc.com; Wed, 11 Oct 2023 16:42:29 +0800 From: Jingbo Xu To: tj@kernel.org, guro@fb.com Cc: lizefan.x@bytedance.com, hannes@cmpxchg.org, cgroups@vger.kernel.org, jack@suse.cz, linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, viro@zeniv.linux.org.uk, brauner@kernel.org, willy@infradead.org, joseph.qi@linux.alibaba.com Subject: [PATCH] writeback, cgroup: switch inodes with dirty timestamps to release dying cgwbs Date: Wed, 11 Oct 2023 16:42:28 +0800 Message-Id: <20231011084228.77615-1-jefflexu@linux.alibaba.com> X-Mailer: git-send-email 2.19.1.6.gb485710b MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=2.8 required=5.0 tests=HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,RCVD_IN_SBL_CSS,SPF_HELO_NONE,SPF_PASS, UNPARSEABLE_RELAY autolearn=no autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on groat.vger.email Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.6.4 (groat.vger.email [0.0.0.0]); Wed, 11 Oct 2023 01:42:58 -0700 (PDT) X-Spam-Level: ** The cgwb cleanup routine will try to release the dying cgwb by switching the attached inodes. It fetches the attached inodes from wb->b_attached list, omitting the fact that inodes only with dirty timestamps reside in wb->b_dirty_time list, which is the case when lazytime is enabled. This causes enormous zombie memory cgroup when lazytime is enabled, as inodes with dirty timestamps can not be switched to a live cgwb for a long time. It is reasonable not to switch cgwb for inodes with dirty data, as otherwise it may break the bandwidth restrictions. However since the writeback of inode metadata is not accounted, let's also switch inodes with dirty timestamps to avoid zombie memory and block cgroups when laztytime is enabled. Fixs: c22d70a162d3 ("writeback, cgroup: release dying cgwbs by switching attached inodes") Signed-off-by: Jingbo Xu --- This issue is reported on our production environment (5.10 kernel with the dying cgwbs optimization[1] backported, and ext4 mounted with "-o relatime,lazytime"), while I can also reproduce it on the latest mainline kernel. [1] c22d70a162d3 ("writeback, cgroup: release dying cgwbs by switching attached inodes") --- fs/fs-writeback.c | 33 +++++++++++++++++++++------------ 1 file changed, 21 insertions(+), 12 deletions(-) diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c index c1af01b2c42d..89125760e4ad 100644 --- a/fs/fs-writeback.c +++ b/fs/fs-writeback.c @@ -613,6 +613,24 @@ static void inode_switch_wbs(struct inode *inode, int new_wb_id) kfree(isw); } +static bool isw_prepare_wbs_switch(struct inode_switch_wbs_context *isw, + struct list_head *list, int *nr) +{ + struct inode *inode; + + list_for_each_entry(inode, list, i_io_list) { + if (!inode_prepare_wbs_switch(inode, isw->new_wb)) + continue; + + isw->inodes[*nr] = inode; + (*nr)++; + + if (*nr >= WB_MAX_INODES_PER_ISW - 1) + return true; + } + return false; +} + /** * cleanup_offline_cgwb - detach associated inodes * @wb: target wb @@ -625,7 +643,6 @@ bool cleanup_offline_cgwb(struct bdi_writeback *wb) { struct cgroup_subsys_state *memcg_css; struct inode_switch_wbs_context *isw; - struct inode *inode; int nr; bool restart = false; @@ -647,17 +664,9 @@ bool cleanup_offline_cgwb(struct bdi_writeback *wb) nr = 0; spin_lock(&wb->list_lock); - list_for_each_entry(inode, &wb->b_attached, i_io_list) { - if (!inode_prepare_wbs_switch(inode, isw->new_wb)) - continue; - - isw->inodes[nr++] = inode; - - if (nr >= WB_MAX_INODES_PER_ISW - 1) { - restart = true; - break; - } - } + restart = isw_prepare_wbs_switch(isw, &wb->b_attached, &nr); + if (!restart) + restart = isw_prepare_wbs_switch(isw, &wb->b_dirty_time, &nr); spin_unlock(&wb->list_lock); /* no attached inodes? bail out */ -- 2.19.1.6.gb485710b