Received: by 2002:a05:6a10:af89:0:0:0:0 with SMTP id iu9csp805838pxb; Fri, 21 Jan 2022 03:52:29 -0800 (PST) X-Google-Smtp-Source: ABdhPJytfhqTfzK1grCdYSnSGuBTyFUGGQCjgltMInEFQniki/O8SUTwWaM63Mkku/5TMuxFMyRU X-Received: by 2002:a17:902:7fc8:b0:14a:e403:2f18 with SMTP id t8-20020a1709027fc800b0014ae4032f18mr3598162plb.45.1642765949509; Fri, 21 Jan 2022 03:52:29 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1642765949; cv=none; d=google.com; s=arc-20160816; b=BcJ5z71AeH50RXuvNrInAyus5SYGCy/W0I4DsYB9/xnKg7rqpnaAT8oq4UbYYAT7RC MzNQ+zsUS3wVn/w0/RI3SlwdRflTeJQIQPdg3qgW26AQb1FZS1aDkj3XHplXD/A7d4xL mRKkXXRJv9TmXTEJel/maad0NTtw42lEi9ShttqOUZMJldLGUlto4Rg/+gBersUxdP2y T9qRd2MYSYQd70/Qi1wUyHtuEwckpUyjUqNFI9scWmCwxI0gqOKUSZ/nORo66RWMfPN4 BnM7Dhbs8gR2ybUEexiv+c7YfswfvrbkfuzKKw5g43bQOFKzex/uG4F+h3f04q18KOiS MlHQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:in-reply-to:content-disposition:mime-version :references:message-id:subject:cc:to:from:date; bh=TznfCLx9L53E7l/FWk2GB5d4gD0tRKHqd3oVpNUE5C0=; b=sqmz9Wf0EgaUNCfKOo5zE0DL5Bf7jG1VOFuoZRU2hVMRHaBUCwCKsm7lNDdH6fUKPH N2DM4fR9VBRfHZUh/D9fxYiV++0arHGPTndP2ZEy3sueoNEa3mLcebHJhChiNSpdumBu NCCdVPKbMrONKERPljfRHdgcHxKBkrSmetctUOYti8ihYtUpAye50wEDtQv1o9yJQVLp Tl9T4e8AQRKCuZLYsShQettKUB/DcC4wBAHbxC4yJoJQO9dT4U7KmcX04gIz1VRvGhAC rxYyvYQjnade6AqWha/2oiBTHox9tWrofXMWQ/Dg9YQ/FFqQ1meZIs7Fa1qV43UrQ/zW +6Fw== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id h184si3853602pge.125.2022.01.21.03.52.17; Fri, 21 Jan 2022 03:52:29 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1350280AbiARXZy (ORCPT + 99 others); Tue, 18 Jan 2022 18:25:54 -0500 Received: from mail104.syd.optusnet.com.au ([211.29.132.246]:51961 "EHLO mail104.syd.optusnet.com.au" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S234491AbiARXZx (ORCPT ); Tue, 18 Jan 2022 18:25:53 -0500 Received: from dread.disaster.area (pa49-179-45-11.pa.nsw.optusnet.com.au [49.179.45.11]) by mail104.syd.optusnet.com.au (Postfix) with ESMTPS id 21DA262C2BA; Wed, 19 Jan 2022 10:25:48 +1100 (AEDT) Received: from dave by dread.disaster.area with local (Exim 4.92.3) (envelope-from ) id 1n9xrH-001StL-Oc; Wed, 19 Jan 2022 10:25:47 +1100 Date: Wed, 19 Jan 2022 10:25:47 +1100 From: Dave Chinner To: Al Viro Cc: Brian Foster , Ian Kent , "Darrick J. Wong" , Christoph Hellwig , Miklos Szeredi , David Howells , Kernel Mailing List , linux-fsdevel , xfs Subject: Re: [PATCH] vfs: check dentry is still valid in get_link() Message-ID: <20220118232547.GD59729@dread.disaster.area> References: <164180589176.86426.501271559065590169.stgit@mickey.themaw.net> <275358741c4ee64b5e4e008d514876ed4ec1071c.camel@themaw.net> <20220118030041.GB59729@dread.disaster.area> <20220118041253.GC59729@dread.disaster.area> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: X-Optus-CM-Score: 0 X-Optus-CM-Analysis: v=2.4 cv=e9dl9Yl/ c=1 sm=1 tr=0 ts=61e74c7f a=Eslsx4mF8WGvnV49LKizaA==:117 a=Eslsx4mF8WGvnV49LKizaA==:17 a=kj9zAlcOel0A:10 a=DghFqjY3_ZEA:10 a=7-415B0cAAAA:8 a=sC8CWpTUz8-MwOmwIq8A:9 a=CjuIK1q_8ugA:10 a=biEYGPWJfzWAr4FL6Ov7:22 Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, Jan 18, 2022 at 05:58:14AM +0000, Al Viro wrote: > On Tue, Jan 18, 2022 at 03:12:53PM +1100, Dave Chinner wrote: > > > No, that just creates a black hole where the VFS inode has been > > destroyed but the XFS inode cache doesn't know it's been trashed. > > Hence setting XFS_IRECLAIMABLE needs to remain in the during > > ->destroy_inode, otherwise the ->lookup side of the cache will think > > that are currently still in use by the VFS and hand them straight > > back out without going through the inode recycling code. > > > > i.e. XFS_IRECLAIMABLE is the flag that tells xfs_iget() that the VFS > > part of the inode has been torn down, and that it must go back > > through VFS re-initialisation before it can be re-instantiated as a > > VFS inode. > > OK... > > > It would also mean that the inode will need to go through two RCU > > grace periods before it gets reclaimed, because XFS uses RCU > > protected inode cache lookups internally (e.g. for clustering dirty > > inode writeback) and so freeing the inode from the internal > > XFS inode cache requires RCU freeing... > > Wait a minute. Where is that RCU delay of yours, relative to > xfs_vn_unlink() and xfs_vn_rename() (for target)? Both of those drop the inode on an on-disk unlinked list. When the last reference goes away, ->destroy_inode then runs inactivation. Inactivation then runs transactions to free all the space attached to the inode and then removes the inode from the unlinked list and frees it. It then goes into the XFS_IRECLAIMABLE state and is dirty in memory. It can't be reclaimed until the inode is written to disk or the whole inode cluster is freed and the inode marked XFS_ISTALE (so won't get written back). At that point, a background inode reclaim thread (runs every 5s) does a RCU protected lockless radix tree walk to find XFS_IRECLAIMABLE inodes (via radix tree tags). If they are clean, it moves them to XFS_IRECLAIM state, deletes them from the radix tree and frees them via a call_rcu() callback. If memory reclaim comes along sooner than this, the ->free_cached_objects() superblock shrinker callback runs that RCU protected lockless radix tree walk to find XFS_IRECLAIMABLE inodes. > And where does > it happen in case of e.g. open() + unlink() + close()? Same thing - close() drops the last reference, the unlinked inode goes through inactivation, then moves into the XFS_IRECLAIMABLE state. The problem is not -quite- open-unlink-close. The problem case is the reallocation of an on-disk inode in the case of unlink-close-open(O_CREATE) operations because of the on-disk inode allocator policy of aggressive reuse of recently freed inodes. In that case the xfs_iget() lookup will reinstantiate the inode via xfs_iget_recycle() and the inode will change identity between VFS instantiations. This is where a RCU grace period is absolutely required, and we don't currently have one. The bug was introduced with RCU freeing of inodes (what, 15 years ago now?) and it's only recently that we've realised this bug exists via code inspection. We really have no evidence that it's actually been tripped over in the wild.... Unfortunately, the simple fix of adding syncronize_rcu() to xfs_iget_recycle() causes significant performance regressions because we hit this path quite frequently when workloads use lots of temporary files - the on-disk inode allocator policy tends towards aggressive re-use of inodes for small sets of temporary files. The problem XFS is trying to address is that the VFS inode lifecycle does not cater for filesystems that need to both dirty and then clean unlinked inodes between iput_final() and ->destroy_inode. It's too late to be able to put the inode back on the LRU once we've decided to drop the inode if we need to dirty it again. ANd because evict() is part of the non-blocking memory reclaim, we aren't supposed to block for arbitrarily long periods of time or create unbound memory demand processing inode eviction (both of which XFS can do in inactivation). IOWs, XFS can't free the inode until it's journal releases the internal reference on the dirty inode. ext4 doesn't track inodes in it's journal - it only tracks inode buffers that contain the changes made to the inode, so once the transaction is committed in ext4_evict_inode() the inode can be immediately freed via either ->destroy_inode or ->free_inode. That option does not exist for XFS because we have to wait for the journal to finish with the inode before it can be freed. Hence all the background reclaim stuff. We've recently solved several of the problems we need to solve to reduce the mismatch; avoiding blocking on inode writeback in reclaim and background inactivation are two of the major pieces of work we needed done before we could even consider more closely aligning XFS to the VFS inode cache life cycle model. The next step is to move the background inode inactivation triggers up into ->drop_inode so we can catch inodes that need to be dirtied by the filesysetm before they have been marked for eviction by the VFS. This will allow us to keep the inode on the VFS LRU (probably marked with I_WILL_FREE so everyone else keeps away from it) whilst we are waiting for the background inactivation work to be done, the journal flushed and the metadata written back. Once clean, we can directly evict the inode from the VFS ourselves. This would mean we only get clean, reclaimable inodes hitting the evict() path, and so at that point we can just remove the inode directly from the XFS inode cache from either ->destroy_inode or ->free_inode and RCU free it. The recycling of in-memory inodes in xfs_iget_cache_hit can go away entirely because no inodes will linger in the XFS inode cache without being visible at the VFS layer as they do now... That's going to take a fair bit of work to realise, and I'm not sure yet exactly what mods are going to be needed to either the VFS inode infrastructure or the XFS inode cache. Cheers, Dave. -- Dave Chinner david@fromorbit.com