Received: by 2002:a05:6a10:206:0:0:0:0 with SMTP id 6csp179166pxj; Wed, 26 May 2021 19:48:34 -0700 (PDT) X-Google-Smtp-Source: ABdhPJyWm9jjPQjngGMUnObaTPaEWMeWrgz5XKb/2zoWMwhv6LsCtoQ0zRfO5X6axryfBawXtAT6 X-Received: by 2002:a05:6e02:54d:: with SMTP id i13mr1181299ils.26.1622083714762; Wed, 26 May 2021 19:48:34 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1622083714; cv=none; d=google.com; s=arc-20160816; b=kQj50woThzb3k4DuXxsCp0q0ToB5GT1pC2L6QLD04t8g7+ZHED+CnayhD+2UtU5NDF +PHWREppjQfBzF7ylsRd3+Vm38nNV2HkkJX1Fl0CxKhkOBn+/+1+1pulP/x9r5DFTg/F tUddP4AlbAjbXa5fIDlyNuYbyJyZDZUf1OapQq6HKAxE2R1sQBkqRnMGUxrypqcx+1FV mBeqH4E5b/dDYiXP+DJ7LcSw+h9S/NK9KBYaw8SyjMfr03xszL40sHBDd+MHoMp/to4j 4FWrLrA/469CcmxdLS3c8IocZwye10aEZzMvJ4e39sP7VQlZO9qwpZCHuSGko7v/3bha K7Gw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :references:in-reply-to:message-id:date:subject:cc:to:from :dkim-signature; bh=upg3t5shzXRGgTa68wVqIPfFpLcD5pG4JjjszPkvr3s=; b=PcyH4PtQmLAHf14Z1RIPXfOtDgxLNr70M3s074MQdvBxcHuiFsm2sHXi0nORRvi4mh gsHzphmLa38yH/a0Jbe1ZHEcgQYAV49Nyown0TqTWnLjic+f2OwQUZyQ2YgBAoKrAHth BAU+E5cwjNkLlV+WsLNP5UjuQrIO/i0mBldYkDv1us9SGYK5IBECa2bE+PtCIzZOQnKI wbAVO+z2MBP3LNN50N4VVpkhU26yGRjxZAw1w5KxD2eMgEQMeO3G5JT8w6AfncYKKU/d Q+2tlPcq8qSIxpvkX2/6MhMzWUI9dedGQiR9jcjUM/+nyvRazGta4Sozf7pkYXP5K55U NhPg== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@fb.com header.s=facebook header.b=pXmMxe1E; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=fb.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id c7si1006533ioi.54.2021.05.26.19.48.21; Wed, 26 May 2021 19:48:34 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; dkim=pass header.i=@fb.com header.s=facebook header.b=pXmMxe1E; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=fb.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S234331AbhEZW1i (ORCPT + 99 others); Wed, 26 May 2021 18:27:38 -0400 Received: from mx0a-00082601.pphosted.com ([67.231.145.42]:2752 "EHLO mx0a-00082601.pphosted.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S234211AbhEZW1g (ORCPT ); Wed, 26 May 2021 18:27:36 -0400 Received: from pps.filterd (m0148461.ppops.net [127.0.0.1]) by mx0a-00082601.pphosted.com (8.16.0.43/8.16.0.43) with SMTP id 14QME74J016862 for ; Wed, 26 May 2021 15:26:04 -0700 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=fb.com; h=from : to : cc : subject : date : message-id : in-reply-to : references : mime-version : content-transfer-encoding : content-type; s=facebook; bh=upg3t5shzXRGgTa68wVqIPfFpLcD5pG4JjjszPkvr3s=; b=pXmMxe1EcrBE6Auo/Et4S0yZjyZTd8RcDj6U9hkGzzsAn71VE/KiEsvXtVmVZLAzQNi+ WCldWrLZAgkoHU/D+Jwf/tuXmbD1fbxqXGqCMrxjh7bhunQ3XE04SrRgW8eenkIdzB91 awrqNZTU8gOmelDl+h3a1vdu9ejJIyDZ0NU= Received: from maileast.thefacebook.com ([163.114.130.16]) by mx0a-00082601.pphosted.com with ESMTP id 38ss3qjq0u-3 (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128 verify=NOT) for ; Wed, 26 May 2021 15:26:04 -0700 Received: from intmgw001.27.prn2.facebook.com (2620:10d:c0a8:1b::d) by mail.thefacebook.com (2620:10d:c0a8:83::6) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.2176.2; Wed, 26 May 2021 15:26:02 -0700 Received: by devvm3388.prn0.facebook.com (Postfix, from userid 111017) id ABC017B6ABB7; Wed, 26 May 2021 15:25:58 -0700 (PDT) From: Roman Gushchin To: Jan Kara , Tejun Heo CC: , , , Alexander Viro , Dennis Zhou , Dave Chinner , , Roman Gushchin Subject: [PATCH v5 2/2] writeback, cgroup: release dying cgwbs by switching attached inodes Date: Wed, 26 May 2021 15:25:57 -0700 Message-ID: <20210526222557.3118114-3-guro@fb.com> X-Mailer: git-send-email 2.30.2 In-Reply-To: <20210526222557.3118114-1-guro@fb.com> References: <20210526222557.3118114-1-guro@fb.com> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-FB-Internal: Safe Content-Type: text/plain X-Proofpoint-GUID: Zl5X7IJLI2UrF-zLN57A0ZQLjkM6qjQO X-Proofpoint-ORIG-GUID: Zl5X7IJLI2UrF-zLN57A0ZQLjkM6qjQO X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10434:6.0.391,18.0.761 definitions=2021-05-26_12:2021-05-26,2021-05-26 signatures=0 X-Proofpoint-Spam-Details: rule=fb_default_notspam policy=fb_default score=0 spamscore=0 mlxscore=0 lowpriorityscore=0 phishscore=0 adultscore=0 suspectscore=0 bulkscore=0 mlxlogscore=703 impostorscore=0 clxscore=1015 priorityscore=1501 malwarescore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2104190000 definitions=main-2105260151 X-FB-Internal: deliver Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Asynchronously try to release dying cgwbs by switching clean attached inodes to the bdi's wb. It helps to get rid of per-cgroup writeback structures themselves and of pinned memory and block cgroups, which are way larger structures (mostly due to large per-cpu statistics data). It helps to prevent memory waste and different scalability problems caused by large piles of dying cgroups. A cgwb cleanup operation can fail due to different reasons (e.g. the cgwb has in-glight/pending io, an attached inode is locked or isn't clean, etc). In this case the next scheduled cleanup will make a new attempt. An attempt is made each time a new cgwb is offlined (in other words a memcg and/or a blkcg is deleted by a user). In the future an additional attempt scheduled by a timer can be implemented. Signed-off-by: Roman Gushchin --- fs/fs-writeback.c | 35 ++++++++++++++++++ include/linux/backing-dev-defs.h | 1 + include/linux/writeback.h | 1 + mm/backing-dev.c | 61 ++++++++++++++++++++++++++++++-- 4 files changed, 96 insertions(+), 2 deletions(-) diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c index 631ef6366293..8fbcd50844f0 100644 --- a/fs/fs-writeback.c +++ b/fs/fs-writeback.c @@ -577,6 +577,41 @@ static void inode_switch_wbs(struct inode *inode, in= t new_wb_id) kfree(isw); } =20 +/** + * cleanup_offline_wb - detach associated clean inodes + * @wb: target wb + * + * Switch the inode->i_wb pointer of the attached inodes to the bdi's wb= and + * drop the corresponding per-cgroup wb's reference. Skip inodes which a= re + * dirty, freeing, in the active writeback process or are in any way bus= y. + */ +void cleanup_offline_wb(struct bdi_writeback *wb) +{ + struct inode *inode, *tmp; + + spin_lock(&wb->list_lock); +restart: + list_for_each_entry_safe(inode, tmp, &wb->b_attached, i_io_list) { + if (!spin_trylock(&inode->i_lock)) + continue; + xa_lock_irq(&inode->i_mapping->i_pages); + if ((inode->i_state & I_REFERENCED) !=3D I_REFERENCED) { + struct bdi_writeback *bdi_wb =3D &inode_to_bdi(inode)->wb; + + WARN_ON_ONCE(inode->i_wb !=3D wb); + + inode->i_wb =3D bdi_wb; + list_del_init(&inode->i_io_list); + wb_put(wb); + } + xa_unlock_irq(&inode->i_mapping->i_pages); + spin_unlock(&inode->i_lock); + if (cond_resched_lock(&wb->list_lock)) + goto restart; + } + spin_unlock(&wb->list_lock); +} + /** * wbc_attach_and_unlock_inode - associate wbc with target inode and unl= ock it * @wbc: writeback_control of interest diff --git a/include/linux/backing-dev-defs.h b/include/linux/backing-dev= -defs.h index e5dc238ebe4f..07d6b6d6dbdf 100644 --- a/include/linux/backing-dev-defs.h +++ b/include/linux/backing-dev-defs.h @@ -155,6 +155,7 @@ struct bdi_writeback { struct list_head memcg_node; /* anchored at memcg->cgwb_list */ struct list_head blkcg_node; /* anchored at blkcg->cgwb_list */ struct list_head b_attached; /* attached inodes, protected by list_lock= */ + struct list_head offline_node; /* anchored at offline_cgwbs */ =20 union { struct work_struct release_work; diff --git a/include/linux/writeback.h b/include/linux/writeback.h index 572a13c40c90..922f15fe6ad4 100644 --- a/include/linux/writeback.h +++ b/include/linux/writeback.h @@ -222,6 +222,7 @@ void wbc_account_cgroup_owner(struct writeback_contro= l *wbc, struct page *page, int cgroup_writeback_by_id(u64 bdi_id, int memcg_id, unsigned long nr_pa= ges, enum wb_reason reason, struct wb_completion *done); void cgroup_writeback_umount(void); +void cleanup_offline_wb(struct bdi_writeback *wb); =20 /** * inode_attach_wb - associate an inode with its wb diff --git a/mm/backing-dev.c b/mm/backing-dev.c index 54c5dc4b8c24..92a00bcaa504 100644 --- a/mm/backing-dev.c +++ b/mm/backing-dev.c @@ -371,12 +371,16 @@ static void wb_exit(struct bdi_writeback *wb) #include =20 /* - * cgwb_lock protects bdi->cgwb_tree, blkcg->cgwb_list, and memcg->cgwb_= list. - * bdi->cgwb_tree is also RCU protected. + * cgwb_lock protects bdi->cgwb_tree, blkcg->cgwb_list, offline_cgwbs an= d + * memcg->cgwb_list. bdi->cgwb_tree is also RCU protected. */ static DEFINE_SPINLOCK(cgwb_lock); static struct workqueue_struct *cgwb_release_wq; =20 +static LIST_HEAD(offline_cgwbs); +static void cleanup_offline_cgwbs_workfn(struct work_struct *work); +static DECLARE_WORK(cleanup_offline_cgwbs_work, cleanup_offline_cgwbs_wo= rkfn); + static void cgwb_release_workfn(struct work_struct *work) { struct bdi_writeback *wb =3D container_of(work, struct bdi_writeback, @@ -395,6 +399,7 @@ static void cgwb_release_workfn(struct work_struct *w= ork) =20 fprop_local_destroy_percpu(&wb->memcg_completions); percpu_ref_exit(&wb->refcnt); + WARN_ON(!list_empty(&wb->offline_node)); wb_exit(wb); WARN_ON_ONCE(!list_empty(&wb->b_attached)); kfree_rcu(wb, rcu); @@ -414,6 +419,10 @@ static void cgwb_kill(struct bdi_writeback *wb) WARN_ON(!radix_tree_delete(&wb->bdi->cgwb_tree, wb->memcg_css->id)); list_del(&wb->memcg_node); list_del(&wb->blkcg_node); + if (!list_empty(&wb->b_attached)) + list_add(&wb->offline_node, &offline_cgwbs); + else + INIT_LIST_HEAD(&wb->offline_node); percpu_ref_kill(&wb->refcnt); } =20 @@ -635,6 +644,50 @@ static void cgwb_bdi_unregister(struct backing_dev_i= nfo *bdi) mutex_unlock(&bdi->cgwb_release_mutex); } =20 +/** + * cleanup_offline_cgwbs - try to release dying cgwbs + * + * Try to release dying cgwbs by switching attached inodes to the wb + * belonging to the root memory cgroup. Processed wbs are placed at the + * end of the list to guarantee the forward progress. + * + * Should be called with the acquired cgwb_lock lock, which might + * be released and re-acquired in the process. + */ +static void cleanup_offline_cgwbs_workfn(struct work_struct *work) +{ + struct bdi_writeback *wb; + LIST_HEAD(processed); + + spin_lock_irq(&cgwb_lock); + + while (!list_empty(&offline_cgwbs)) { + wb =3D list_first_entry(&offline_cgwbs, struct bdi_writeback, + offline_node); + list_move(&wb->offline_node, &processed); + + if (wb_has_dirty_io(wb)) + continue; + + if (!percpu_ref_tryget(&wb->refcnt)) + continue; + + spin_unlock_irq(&cgwb_lock); + cleanup_offline_wb(wb); + spin_lock_irq(&cgwb_lock); + + if (list_empty(&wb->b_attached)) + list_del_init(&wb->offline_node); + + wb_put(wb); + } + + if (!list_empty(&processed)) + list_splice_tail(&processed, &offline_cgwbs); + + spin_unlock_irq(&cgwb_lock); +} + /** * wb_memcg_offline - kill all wb's associated with a memcg being offlin= ed * @memcg: memcg being offlined @@ -650,6 +703,10 @@ void wb_memcg_offline(struct mem_cgroup *memcg) list_for_each_entry_safe(wb, next, memcg_cgwb_list, memcg_node) cgwb_kill(wb); memcg_cgwb_list->next =3D NULL; /* prevent new wb's */ + + if (!list_empty(&offline_cgwbs)) + schedule_work(&cleanup_offline_cgwbs_work); + spin_unlock_irq(&cgwb_lock); } =20 --=20 2.31.1