Received: by 2002:a05:6a10:206:0:0:0:0 with SMTP id 6csp320701pxj; Fri, 14 May 2021 04:25:11 -0700 (PDT) X-Google-Smtp-Source: ABdhPJyI9w2tXiFDBQN+qxoyM0PpMVBrFgjfUunezWZfFqq5OKIIf5+yZkD6dqP1gXcAKl1lErfR X-Received: by 2002:aa7:dd57:: with SMTP id o23mr17596756edw.98.1620991511269; Fri, 14 May 2021 04:25:11 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1620991511; cv=none; d=google.com; s=arc-20160816; b=Tc8IVFuDDwQif5DO1TVAyhiBkjU2IPvE8dfZQpnrY+5JUVUYIsFss3LdTvKNG6rzim 01MoeNdB4TuyeCzlNx4F4Yx6my2mjqsM2cAl9Fls3GW2mU77cJCQfOoCL0qVXyxv7nbu iOfPQBpKU+fx/Xg0tVTztO5AjwrqBTQHcZK7zMsqQBKz+XqgYju60351ip6crAuGEdm+ nK5PD97ZuxGXzCYT/zL3Mq6PM9F8ZJB/qUyrtCH3vYe4oSuaOdbQWQBc+oLSi54Vo/s5 E18m/1jilF/ZyDlV4Bd/iF1gfHU+BL0TOfLMKCTWW1sepbhb8nAvjCizIOKbPPqWx8kg 7IJQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:user-agent:in-reply-to:content-disposition :mime-version:references:message-id:subject:cc:to:from:date; bh=7QqxawZ7d2J3zUir+CvUZKp8VozyGC+SVD/0pj9UePg=; b=nA4oiixNM4BAj/4CCQ1cK3D+QdAJNI9vOeKFVOP6ZAONdgzWEd6GZobmjL4MgGVx3i FN64Kv7aAzRftPVIxvVlh8KMzs6brwHn0151XvLUTFDXczncYF/1vn4+6y3Sm/oiATzY oYZOYekiqovpQVNp/YW0lBIVWST4qG30GdRj2rT63k9WtVfxeJfFSzER0G0V8IF+p1Qh UWfdparP1I0+opoYUBvlNn+eSPFpl6uavOsES3Qu4GgOkbqFZH2SCf76Sw6WMdQMNkcY F32n7kkQjq1g/X3KjCTjM977rpP+Z7moz2lj8lZ/jjkiCVZ0DVhH+8BUT7zVmkD1EarS kHPg== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id h6si3556445ejq.486.2021.05.14.04.24.45; Fri, 14 May 2021 04:25:11 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231750AbhENLYe (ORCPT + 99 others); Fri, 14 May 2021 07:24:34 -0400 Received: from mx2.suse.de ([195.135.220.15]:51298 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229516AbhENLYd (ORCPT ); Fri, 14 May 2021 07:24:33 -0400 X-Virus-Scanned: by amavisd-new at test-mx.suse.de Received: from relay2.suse.de (unknown [195.135.221.27]) by mx2.suse.de (Postfix) with ESMTP id 99D82AFF5; Fri, 14 May 2021 11:23:20 +0000 (UTC) Received: by quack2.suse.cz (Postfix, from userid 1000) id 689631F2B4A; Fri, 14 May 2021 13:23:20 +0200 (CEST) Date: Fri, 14 May 2021 13:23:20 +0200 From: Jan Kara To: Roman Gushchin Cc: Jan Kara , Tejun Heo , linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, Alexander Viro , Dennis Zhou , Dave Chinner , cgroups@vger.kernel.org Subject: Re: [PATCH v4] cgroup, blkcg: prevent dirty inodes to pin dying memory cgroups Message-ID: <20210514112320.GB27655@quack2.suse.cz> References: <20210513004258.1610273-1-guro@fb.com> <20210513101239.GE2734@quack2.suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.10.1 (2018-07-13) Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu 13-05-21 11:12:16, Roman Gushchin wrote: > On Thu, May 13, 2021 at 12:12:39PM +0200, Jan Kara wrote: > > On Wed 12-05-21 17:42:58, Roman Gushchin wrote: > > > + WARN_ON_ONCE(inode->i_wb != wb); > > > + inode->i_wb = NULL; > > > + wb_put(wb); > > > + list_del_init(&inode->i_io_list); > > > > So I was thinking about this and I'm still a bit nervous that setting i_wb > > to NULL is going to cause subtle crashes somewhere. Granted you are very > > careful when not to touch the inode but still, even stuff like > > inode_to_bdi() is not safe to call with inode->i_wb is NULL. So I'm afraid > > that some place in the writeback code will be looking at i_wb without > > having any of those bits set and boom. E.g. inode_to_wb() call in > > test_clear_page_writeback() - what protects that one? > > I assume that if the page is dirty/under the writeback, the inode must be > dirty too, so we can't zero inode->i_wb. If page is under writeback, the inode can be already clean. You could check !mapping_tagged(mapping, PAGECACHE_TAG_WRITEBACK) but see how fragile it is... For every place using inode_to_wb() you have to come up with a test excluding it... > > I forgot what possibilities did we already discuss in the past but cannot > > we just switch inode->i_wb to inode_to_bdi(inode)->wb (i.e., root cgroup > > writeback structure)? That would be certainly safer... > > I am/was nervous too. I had several BUG_ON()'s in all such places and ran > the test kernel for about a day on my dev desktop (even updated to Fedora > 34 lol), and haven't seen any panics. And certainly I can give it a > comprehensive test in a more production environment. I appreciate the testing but it is also about how likely this is to break sometime in future because someone unaware of this corner-case will add new inode_to_wb() call not excluded by one of your conditions. > Re switching to the root wb: it's certainly a possibility too, however > zeroing has an advantage: the next potential writeback will be accounted > to the right cgroup without a need in an additional switch. > I'd try to go with zeroing if it's possible, and keep the switching to the > root wb as plab B. Yes, there will be the cost of an additional switch. But inodes attached to dying cgroups shouldn't be the fast path anyway so does it matter? > > > @@ -633,6 +647,48 @@ static void cgwb_bdi_unregister(struct backing_dev_info *bdi) > > > mutex_unlock(&bdi->cgwb_release_mutex); > > > } > > > > > > +/** > > > + * cleanup_offline_cgwbs - try to release dying cgwbs > > > + * > > > + * Try to release dying cgwbs by switching attached inodes to the wb > > > + * belonging to the root memory cgroup. Processed wbs are placed at the > > > + * end of the list to guarantee the forward progress. > > > + * > > > + * Should be called with the acquired cgwb_lock lock, which might > > > + * be released and re-acquired in the process. > > > + */ > > > +static void cleanup_offline_cgwbs_workfn(struct work_struct *work) > > > +{ > > > + struct bdi_writeback *wb; > > > + LIST_HEAD(processed); > > > + > > > + spin_lock_irq(&cgwb_lock); > > > + > > > + while (!list_empty(&offline_cgwbs)) { > > > + wb = list_first_entry(&offline_cgwbs, struct bdi_writeback, > > > + offline_node); > > > + > > > + list_move(&wb->offline_node, &processed); > > > + > > > + if (wb_has_dirty_io(wb)) > > > + continue; > > > + > > > + if (!percpu_ref_tryget(&wb->refcnt)) > > > + continue; > > > + > > > + spin_unlock_irq(&cgwb_lock); > > > + cleanup_offline_wb(wb); > > > + spin_lock_irq(&cgwb_lock); > > > + > > > + wb_put(wb); > > > + } > > > + > > > + if (!list_empty(&processed)) > > > + list_splice_tail(&processed, &offline_cgwbs); > > > + > > > + spin_unlock_irq(&cgwb_lock); > > > > Shouldn't we reschedule this work with some delay if offline_cgwbs is > > non-empty? Otherwise we can end up with non-empty &offline_cgwbs and no > > cleaning scheduled... > > I'm not sure. In general it's not a big problem to have few outstanding > wb structures, we just need to make sure we don't pile them. > I'm scheduling the cleanup from the memcg offlining path, so if new cgroups > are regularly created and destroyed, some pressure will be applied. > > To reschedule based on time we need to come up with some heuristic how to > calculate the required delay and I don't have any specific ideas. If you do, > I'm totally fine to incorporate them. Well, I'd pick e.g. dirty_expire_interval (30s by default) as a time after which we try again. Because after that time writeback has likely already happened. But I don't insist here. If you think leaving inodes attached to dead cgroups for potentially long time in some cases isn't a problem, then we can leave this as is. If we find it's a problem in the future, we can always add the time-based retry. Honza -- Jan Kara SUSE Labs, CR