Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S932762Ab2FGTgP (ORCPT ); Thu, 7 Jun 2012 15:36:15 -0400 Received: from zeniv.linux.org.uk ([195.92.253.2]:46897 "EHLO ZenIV.linux.org.uk" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751707Ab2FGTgN (ORCPT ); Thu, 7 Jun 2012 15:36:13 -0400 Date: Thu, 7 Jun 2012 20:36:07 +0100 From: Al Viro To: Dave Jones , Linus Torvalds , Linux Kernel , Miklos Szeredi , Jan Kara , Peter Zijlstra Cc: linux-fsdevel@vger.kernel.org, "J. Bruce Fields" , "Eric W. Biederman" , Sage Weil Subject: Re: processes hung after sys_renameat, and 'missing' processes Message-ID: <20120607193607.GI30000@ZenIV.linux.org.uk> References: <20120603232820.GQ30000@ZenIV.linux.org.uk> <20120606194233.GA1537@redhat.com> <20120606230040.GA18089@redhat.com> <20120606235403.GC30000@ZenIV.linux.org.uk> <20120607002914.GB22223@redhat.com> <20120607011915.GA17566@redhat.com> <20120607012900.GE30000@ZenIV.linux.org.uk> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20120607012900.GE30000@ZenIV.linux.org.uk> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 7689 Lines: 161 [maintainers of assorted code involved Cc'd] On Thu, Jun 07, 2012 at 02:29:00AM +0100, Al Viro wrote: > On Wed, Jun 06, 2012 at 09:19:15PM -0400, Dave Jones wrote: > > On Wed, Jun 06, 2012 at 05:42:35PM -0700, Linus Torvalds wrote: > > > > > So Al meant you to test mutex_is_locked(dentry->d_inode->i_mutex) of > > > the parents. > > > > ok, I ended up with.. > > > > WARN_ON_ONCE(!mutex_is_locked(&target->d_parent->d_inode->i_mutex)); > > > > if (dentry->d_parent != NULL) > != dentry) > > ->d_parent *never* should be NULL; when dentry is disconnected from the > tree (or hadn't been connected to it yet), it points to that dentry itself. BTW, I've started to put together some documentation on ->d_parent use. And already have found idiocies. 0) *all* instances of struct dentry should come from __d_alloc(); never embed it into another object or declare variables of type struct dentry. AFAICS, that one is satisfied now; any out-of-tree code violating that deserves to be broken and *will* be broken. 1) For all code outside of fs/dcache.c it's read-only. Never change it directly, period. Again, satisfied in mainline, out-of-tree code can go play in the traffic, for all I care. 2) Places that set ->d_parent: __d_alloc(), d_alloc(), __d_move(), __d_materialize_dentry(). The first two are doing that to new instance of struct dentry with no references to it anywhere in shared data structures. The other two have had dentry passed to dentry_lock_for_move(). 3) d_alloc(NULL, name) should never happen (will instantly oops in that case). 4) ->d_parent is *never* NULL; the only way to stumble across a dentry with that not being true is to crawl through the slab page directly and to run into an object that has been allocated by not initialized yet. Don't do that (nothing in mainline does, thanks $DEITY). "No parent" == "dentry has ->d_parent pointing to dentry itself". There's a helper checking that, albeit with slightly misleading name - IS_ROOT(dentry). Doesn't have to be fs root - anything still not attached to the tree or already detached from it will be that way. Any code checking that ->d_parent is NULL is almost certainly a bug. And we have a bunch of such places in fs/ceph: ceph_init_dentry() ceph_d_prune() ceph_mdsc_build_path() a few in fs/cifs: build_path_from_dentry() complete idiocy in fs/debugfs (not only check for NULL ->d_parent, but one for ->d_parent being a negative dentry; also can't happen). debugfs_remove() debugfs_remove_recursive() one in fs/ocfs2: ocfs2_match_dentry() and one in security/inode.c: securityfs_remove() (nicked from debugfs, at a guess). Some of those guys can be simply dropped, but I really wonder what ceph_init_dentry/ceph_d_prune intended there. Incidentall, build_path_from_dentry() ought to be lifted into fs/dcache.c; we have at least three copies and one of them is not an instant roothole only because hppfs doesn't follow underlying procfs changes. And I'm going to export is_subdir() - at least gfs2 has grown itself a b0rken copy... 5) all assignments to dentry->d_parent after initialization are under dentry->d_lock; dentry_lock_for_move() takes care of that. 6) all reassignments to dentry->d_parent have ->d_lock held on old and new parent. Again, dentry_lock_for_move() takes care of that and so does d_alloc() (that one - for new parent only; old parent is dentry itself here and nothing could have seen it yet). 7) attached dentry moving away from old parent must have ->i_mutex on that old parent held. 8) dentry getting attached to a new parent must have ->i_mutex held on that new parent. 9) attached dentry getting moved to a new parent must have ->s_vfs_rename_mutex held. Now, (7--9) are interesting; the call graph looks so: __d_move() <- d_move() <- __d_unalias() <- d_materialise_unique() __d_materialise_dentry() <- d_materialise_unique() rename-related callers of d_move() (vfs_rename_dir/vfs_rename_other/ nfs_rename/ocfs2_dentry_move from ocfs2_rename/ceph_rename/v9fs_vfs_rename) share the same locking environment; we have both parents + ->s_vfs_rename_mutex in case of cross-directory move held. Pretty much by an accident we have one difference between vfs_rename_dir and vfs_rename_other - holding ->i_mutex on *victim* of overwriting rename() is not something d_move() cares about. We can make that more uniform, but that's really unrelated to that. Note that in all these cases both dentries are guaranteed to be attached all along. Other callers of d_move(): vfat_lookup() does d_move() of an alias; the validity of that strongly relies on the lack of anything resembling hardlinks on VFAT; the parents will be the same (and parent of target is locked). sysfs_lookup() plays insane games with d_move(); I _really_ don't like the look of that code. It tries to play catch-up after sysfs_rename() done its thing. AFAICS, the parent *can* be changed by sysfs_rename(). No lock is held on said parent here; no ->s_vfs_rename_mutex either. Moreover, none of the sysfs_rename() callers (kobject_move(), etc.) appear to give a damn about checking if it would create a loop. * debugfs_rename() - imitates what vfs_rename() is doing. Same locking environment. BTW, trap = lock_rename(new_dir, old_dir); /* Source or destination directories don't exist? */ if (!old_dir->d_inode || !new_dir->d_inode) goto exit; is bogus - lock_rename() is taking ->i_mutex on these inodes, for fsck sake! If this can be called with old_dir or new_dir negative, it's buggered. * ceph_fill_trace() in CEPH_MDS_OP_RENAME case. Seems to be done while ceph_rename() is being executed; i.e. the usual locking environment holds. No idea why it feels the need to grab references to old_dentry and new_dentry in ceph_rename(), though - if that thing can outlive ceph_rename(), we have a bad trouble. And if it can't, there's no reasons to pin them down that way. * d_splice_alias() - broken. Called without any locking on the old directory, nevermind ->s_vfs_rename_mutex. I really believe that this pair of commits needs to be reverted. The earlier code used to guarantee that alias would be detached. d_materialise_unique(dentry, inode) must be called with dentry->d_parent having its ->d_inode->i_mutex held. AFAICS, that's satisfied for all callers - they come either from ->lookup() or from ->readdir(), both having the parent inode locked. ceph is a possible exceptions; it's damn hard to trace back to the places triggering ceph_fill_trace() in general ;-/ __d_unalias(), in case we are asked for cross-directory move, grabs ->s_vfs_rename_mutex first (with trylock). That makes ->d_parent of alias stable; then it grabs ->i_mutex on that parent (again with trylock) and we are all set. __d_materialise_dentry(dentry, alias) is called only when alias is unattached. That looks like potential race, BTW - what's to prevent it from becoming attached right under us? We do have __d_materialise_dentry() and __d_ualias serialized by alias->d_inode->i_lock, but... there's another way for that unattached alias to get attached - d_splice_alias(). And it's done without ->i_lock there. OTOH, it's not a problem to extend the protection of ->i_lock to that spot as well. I'll continue this series in a couple of hours - there's more. In the meanwhile, I'd like to hear from ceph, sysfs and nfsd folks - see above for the reasons. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/