Received: by 2002:a05:6a10:a841:0:0:0:0 with SMTP id d1csp289386pxy; Thu, 22 Apr 2021 02:07:48 -0700 (PDT) X-Google-Smtp-Source: ABdhPJyM+MJHNJNlRhX6I6stTSzvR5CtXcsDubLiqvTSNxtPE7Zwv1kHeOm6IcJ9SG3M2Jnpoqtc X-Received: by 2002:aa7:c957:: with SMTP id h23mr2437924edt.301.1619082468322; Thu, 22 Apr 2021 02:07:48 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1619082468; cv=none; d=google.com; s=arc-20160816; b=h82ROaJlm9VJH/QZsfHZ2pHOh09L5RLdBeSy26HJMBQw5PRvgFFfvoRVue1PFIZhhD jRB4LU0vZc7hqbqVaNs1LVNAKuSnpEqYg7Cw7wg2qs7E7XNSKQX30DD6oXc6hxkEo7v+ Vu+wsw5aGEKUZEXgk+FXmUWmlZMCXF5iLCOylBTKbmEw+1604hLr2BouR5nDS35aUq4W HE1jq1+Gf4tYFAMOt5bO7nXllvOHRrKjWuxY8oj584WfKhkfZurszYlLS/wqQtkm26yM +sL6p1WHVHhsBRO26JAcp9VYDME+akzZlcm4TGhZ749GaD1JdjkQT03+wj1aUJCxHY9D 6C+Q== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:user-agent:in-reply-to:content-disposition :mime-version:references:message-id:subject:cc:to:from:date; bh=4kdYhrp8Vn0XDwvVAx2ga5YbahUwNmkZBhinq9LqIxc=; b=eb7l+1ux7mJ/bIRAP/I7rUwcFzBkMP++hdwyy+6QP6lLtHlkOYzH2m3MWBBomzXgAD Usndqw8EZHuxR/w/xNZPN9qFFqZhCQx4JKmbq1eSZUdNEyK6mpTaY7vdlfuF//sm6aMb 9TwcNyN6nHdxOTJx5/nRx1oXcDsqVtYvaKs07PiSBrfNdVaxvEiVq0D7r1VQCQ9jm5Lj 47wiiEOOd7flmuQPyNtW2o4ITqWd/nEpKRFDvsxdJJb8PIN2JiRouHzk5dNhWL2hvgZ1 t0ALkSZqaRPDYqsSy1WUkCZVP7XE5GsQI2CwIe7RaqjiktIuad9ANzF2ix3jbpmayBHa cqCw== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-ext4-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-ext4-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id bf20si1709190edb.271.2021.04.22.02.07.20; Thu, 22 Apr 2021 02:07:48 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-ext4-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-ext4-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-ext4-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230365AbhDVJEr (ORCPT + 99 others); Thu, 22 Apr 2021 05:04:47 -0400 Received: from mx2.suse.de ([195.135.220.15]:57250 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230316AbhDVJEr (ORCPT ); Thu, 22 Apr 2021 05:04:47 -0400 X-Virus-Scanned: by amavisd-new at test-mx.suse.de Received: from relay2.suse.de (unknown [195.135.221.27]) by mx2.suse.de (Postfix) with ESMTP id 995F4B053; Thu, 22 Apr 2021 09:04:11 +0000 (UTC) Received: by quack2.suse.cz (Postfix, from userid 1000) id 07F9F1E37A2; Thu, 22 Apr 2021 11:04:11 +0200 (CEST) Date: Thu, 22 Apr 2021 11:04:11 +0200 From: Jan Kara To: Theodore Ts'o Cc: Jan Kara , Christoph Hellwig , Zhang Yi , linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org, adilger.kernel@dilger.ca, yukuai3@huawei.com Subject: Re: [RFC PATCH v2 7/7] ext4: fix race between blkdev_releasepage() and ext4_put_super() Message-ID: <20210422090410.GA26221@quack2.suse.cz> References: <20210414134737.2366971-1-yi.zhang@huawei.com> <20210414134737.2366971-8-yi.zhang@huawei.com> <20210415145235.GD2069063@infradead.org> <20210420130841.GA3618564@infradead.org> <20210421134634.GT8706@quack2.suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.10.1 (2018-07-13) Precedence: bulk List-ID: X-Mailing-List: linux-ext4@vger.kernel.org On Wed 21-04-21 12:57:39, Theodore Ts'o wrote: > On Wed, Apr 21, 2021 at 03:46:34PM +0200, Jan Kara wrote: > > > > Indeed, after 12 years in kernel .bdev_try_to_free_page is implemented only > > by ext4. So maybe it is not that important? I agree with Zhang and > > Christoph that getting the lifetime rules sorted out will be hairy and it > > is questionable, whether it is worth the additional pages we can reclaim. > > Ted, do you remember what was the original motivation for this? > > The comment in fs/ext4/super.c is I thought a pretty good explanation: > > /* > * Try to release metadata pages (indirect blocks, directories) which are > * mapped via the block device. Since these pages could have journal heads > * which would prevent try_to_free_buffers() from freeing them, we must use > * jbd2 layer's try_to_free_buffers() function to release them. > */ > > When we modify a metadata block, we attach a journal_head (jh) > structure to the buffer_head, and bump the ref count to prevent the > buffer from being freed. Before the transaction is committed, the > buffer is marked jbddirty, but the dirty bit is not set until the > transaction commit. > > At that back, writeback happens entirely at the discretion of the > buffer cache. The jbd layer doesn't get notification when the I/O is > completed, nor when there is an I/O error. (There was an attempt to > add a callback but that was NACK'ed because of a complaint that it was > jbd specific.) > > So we don't actually know when it's safe to detach the jh from the > buffer_head and can drop the refcount so that the buffer_head can be > freed. When the space in the journal starts getting low, we'll look > at at the jh's attached to completed transactions, and see how many of > them have clean bh's, and at that point, we can release the buffer > heads. > > The other time when we'll attempt to detach jh's from clean buffers is > via bdev_try_to_free_buffers(). So if we drop the > bdev_try_to_free_page hook, then when we are under memory pressure, > there could be potentially a large percentage of the buffer cache > which can't be freed, and so the OOM-killer might trigger more often. Yes, I understand that. What I was more asking about is: Does it really matter we leave those buffer heads and journal heads unreclaimed. I understand it could be triggering premature OOM in theory but is it a problem in practice? Was there some observed practical case for which this was added or was it just added due to the theoretical concern? > Now, if we could get a callback on I/O completion on a per-bh basis, > then we could detach the jh when the buffer is clean --- and as a > bonus, we'd get a notification when there was an I/O error writing > back a metadata block, which would be even better. > > So how about an even swap? If we can get a buffer I/O completion > callback, we can drop bdev_to_free_swap hook..... I'm OK with that because mainly for IO error reporting it makes sense to me. For this memory reclaim problem I think we have also other reasonably sensible options. E.g. we could have a shrinker that would just walk the checkpoint list and reclaim journal heads for whatever is already written out... Or we could just release journal heads already after commit and during checkpoint we'd fetch the list of blocks that may need to be written out e.g. from journal descriptor blocks. This would be a larger overhaul but as a bonus we'd get rid of probably the last place in the kernel which can write out page contents through buffer heads without updating page state properly (and thus get rid of some workarounds in mm code as well). Honza -- Jan Kara SUSE Labs, CR