Date: Thu, 7 Jun 2012 20:36:07 +0100
From: Al Viro <viro@ZenIV.linux.org.uk>
To: Dave Jones <davej@redhat.com>,
        Linus Torvalds <torvalds@linux-foundation.org>,
        Linux Kernel <linux-kernel@vger.kernel.org>,
        Miklos Szeredi <mszeredi@suse.cz>, Jan Kara <jack@suse.cz>,
        Peter Zijlstra <peterz@infradead.org>
Cc: linux-fsdevel@vger.kernel.org, "J. Bruce Fields" <bfields@redhat.com>,
        "Eric W. Biederman" <ebiederm@xmission.com>,
        Sage Weil <sage@newdream.net>
Subject: Re: processes hung after sys_renameat, and 'missing' processes
Message-ID: <20120607193607.GI30000@ZenIV.linux.org.uk>
References: <20120603232820.GQ30000@ZenIV.linux.org.uk>
 <20120606194233.GA1537@redhat.com>
 <CA+55aFwxFVLPk4-dp73uccKXepKwjpGFgh0NW1u8d0zYzr=hBA@mail.gmail.com>
 <20120606230040.GA18089@redhat.com>
 <CA+55aFx_oREEd2XS_aG7PJ7dwzWjgx689V_uRPGWpVS01_++6A@mail.gmail.com>
 <20120606235403.GC30000@ZenIV.linux.org.uk>
 <20120607002914.GB22223@redhat.com>
 <CA+55aFyp2GUbc218eb5BDyAqz+cOPTJMVcb4ptk+f1_RBB+-mg@mail.gmail.com>
 <20120607011915.GA17566@redhat.com>
 <20120607012900.GE30000@ZenIV.linux.org.uk>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <20120607012900.GE30000@ZenIV.linux.org.uk>
User-Agent: Mutt/1.5.21 (2010-09-15)
Sender: linux-kernel-owner@vger.kernel.org
Content-Length: 7689
Lines: 161

[maintainers of assorted code involved Cc'd]

On Thu, Jun 07, 2012 at 02:29:00AM +0100, Al Viro wrote:
> On Wed, Jun 06, 2012 at 09:19:15PM -0400, Dave Jones wrote:
> > On Wed, Jun 06, 2012 at 05:42:35PM -0700, Linus Torvalds wrote:
> > 
> >  > So Al meant you to test mutex_is_locked(dentry->d_inode->i_mutex) of
> >  > the parents.
> > 
> > ok, I ended up with..
> > 
> >         WARN_ON_ONCE(!mutex_is_locked(&target->d_parent->d_inode->i_mutex));
> > 
> >         if (dentry->d_parent != NULL)
> 			       != dentry)
> 
> ->d_parent *never* should be NULL; when dentry is disconnected from the
> tree (or hadn't been connected to it yet), it points to that dentry itself.

BTW, I've started to put together some documentation on ->d_parent use.
And already have found idiocies.

0) *all* instances of struct dentry should come from __d_alloc(); never
embed it into another object or declare variables of type struct dentry.
AFAICS, that one is satisfied now; any out-of-tree code violating that
deserves to be broken and *will* be broken.

1) For all code outside of fs/dcache.c it's read-only.  Never change it
directly, period.  Again, satisfied in mainline, out-of-tree code can
go play in the traffic, for all I care.

2) Places that set ->d_parent: __d_alloc(), d_alloc(), __d_move(),
__d_materialize_dentry().  The first two are doing that to new instance
of struct dentry with no references to it anywhere in shared data
structures.  The other two have had dentry passed to dentry_lock_for_move().

3) d_alloc(NULL, name) should never happen (will instantly oops in that case).

4) ->d_parent is *never* NULL; the only way to stumble across a dentry with
that not being true is to crawl through the slab page directly and to run
into an object that has been allocated by not initialized yet.  Don't do
that (nothing in mainline does, thanks $DEITY).  "No parent" == "dentry
has ->d_parent pointing to dentry itself".  There's a helper checking that,
albeit with slightly misleading name - IS_ROOT(dentry).  Doesn't have to
be fs root - anything still not attached to the tree or already detached
from it will be that way.  Any code checking that ->d_parent is NULL is
almost certainly a bug.  And we have a bunch of such places in fs/ceph:
	ceph_init_dentry()
	ceph_d_prune()
	ceph_mdsc_build_path()
a few in fs/cifs:
	build_path_from_dentry()
complete idiocy in fs/debugfs (not only check for NULL ->d_parent, but
one for ->d_parent being a negative dentry; also can't happen).
	debugfs_remove()
	debugfs_remove_recursive()
one in fs/ocfs2:
	ocfs2_match_dentry()
and one in security/inode.c:
	securityfs_remove() (nicked from debugfs, at a guess).
Some of those guys can be simply dropped, but I really wonder what
ceph_init_dentry/ceph_d_prune intended there.

Incidentall, build_path_from_dentry() ought to be lifted into fs/dcache.c;
we have at least three copies and one of them is not an instant roothole
only because hppfs doesn't follow underlying procfs changes.  And I'm going
to export is_subdir() - at least gfs2 has grown itself a b0rken copy...

5) all assignments to dentry->d_parent after initialization are under
dentry->d_lock; dentry_lock_for_move() takes care of that.

6) all reassignments to dentry->d_parent have ->d_lock held on old and new
parent.  Again, dentry_lock_for_move() takes care of that and so does
d_alloc() (that one - for new parent only; old parent is dentry itself
here and nothing could have seen it yet).

7) attached dentry moving away from old parent must have ->i_mutex on
that old parent held.

8) dentry getting attached to a new parent must have ->i_mutex held on
that new parent.

9) attached dentry getting moved to a new parent must have ->s_vfs_rename_mutex
held.

Now, (7--9) are interesting; the call graph looks so:
__d_move()
	<- d_move()
	<- __d_unalias()
		<- d_materialise_unique()
__d_materialise_dentry()
	<- d_materialise_unique()

rename-related callers of d_move() (vfs_rename_dir/vfs_rename_other/
nfs_rename/ocfs2_dentry_move from ocfs2_rename/ceph_rename/v9fs_vfs_rename)
share the same locking environment; we have both parents + ->s_vfs_rename_mutex
in case of cross-directory move held.  Pretty much by an accident we have
one difference between vfs_rename_dir and vfs_rename_other - holding ->i_mutex
on *victim* of overwriting rename() is not something d_move() cares about.
We can make that more uniform, but that's really unrelated to that.
Note that in all these cases both dentries are guaranteed to be attached
all along.

Other callers of d_move():
	vfat_lookup() does d_move() of an alias; the validity of that
strongly relies on the lack of anything resembling hardlinks on VFAT;
the parents will be the same (and parent of target is locked).
	sysfs_lookup() plays insane games with d_move(); I _really_
don't like the look of that code.  It tries to play catch-up after
sysfs_rename() done its thing.  AFAICS, the parent *can* be changed
by sysfs_rename().  No lock is held on said parent here; no
->s_vfs_rename_mutex either.  Moreover, none of the sysfs_rename() callers
(kobject_move(), etc.) appear to give a damn about checking if it would
create a loop.
	* debugfs_rename() - imitates what vfs_rename() is doing.  Same
locking environment.  BTW,
        trap = lock_rename(new_dir, old_dir);
        /* Source or destination directories don't exist? */
        if (!old_dir->d_inode || !new_dir->d_inode)
                goto exit;
is bogus - lock_rename() is taking ->i_mutex on these inodes, for fsck sake!
If this can be called with old_dir or new_dir negative, it's buggered.
	* ceph_fill_trace() in CEPH_MDS_OP_RENAME case.  Seems to be
done while ceph_rename() is being executed; i.e. the usual locking
environment holds.  No idea why it feels the need to grab references
to old_dentry and new_dentry in ceph_rename(), though - if that thing
can outlive ceph_rename(), we have a bad trouble.  And if it can't, there's
no reasons to pin them down that way.
	* d_splice_alias() - broken.  Called without any locking on the
old directory, nevermind ->s_vfs_rename_mutex.  I really believe that
this pair of commits needs to be reverted.  The earlier code used to
guarantee that alias would be detached.

d_materialise_unique(dentry, inode) must be called with dentry->d_parent
having its ->d_inode->i_mutex held.  AFAICS, that's satisfied for all
callers - they come either from ->lookup() or from ->readdir(), both
having the parent inode locked.  ceph is a possible exceptions; it's
damn hard to trace back to the places triggering ceph_fill_trace() in
general ;-/

__d_unalias(), in case we are asked for cross-directory move, grabs
->s_vfs_rename_mutex first (with trylock).  That makes ->d_parent of
alias stable; then it grabs ->i_mutex on that parent (again with trylock)
and we are all set.

__d_materialise_dentry(dentry, alias) is called only when alias is
unattached.  That looks like potential race, BTW - what's to prevent it
from becoming attached right under us?  We do have __d_materialise_dentry()
and __d_ualias serialized by alias->d_inode->i_lock, but... there's another
way for that unattached alias to get attached - d_splice_alias().  And
it's done without ->i_lock there.  OTOH, it's not a problem to extend
the protection of ->i_lock to that spot as well.

I'll continue this series in a couple of hours - there's more.  In the
meanwhile, I'd like to hear from ceph, sysfs and nfsd folks - see above
for the reasons.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/