Date: Fri, 24 Apr 2015 14:42:03 +0100
From: Al Viro <viro@ZenIV.linux.org.uk>
To: NeilBrown <neilb@suse.de>
Cc: Christoph Hellwig <hch@infradead.org>,
        Linus Torvalds <torvalds@linux-foundation.org>,
        linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org
Subject: Re: [RFC][PATCHSET] non-recursive link_path_walk() and reducing
 stack footprint
Message-ID: <20150424134203.GC889@ZenIV.linux.org.uk>
References: <20150421150408.GA29838@infradead.org>
 <553668C1.8030707@nod.at>
 <20150421154504.GT889@ZenIV.linux.org.uk>
 <20150421212007.GU889@ZenIV.linux.org.uk>
 <20150422180702.GA15209@ZenIV.linux.org.uk>
 <20150422201238.GW889@ZenIV.linux.org.uk>
 <20150422210553.GX889@ZenIV.linux.org.uk>
 <20150423174544.48ae3122@notabene.brown>
 <20150423180754.GA889@ZenIV.linux.org.uk>
 <20150424163534.6eb109eb@notabene.brown>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <20150424163534.6eb109eb@notabene.brown>
User-Agent: Mutt/1.5.21 (2010-09-15)
Sender: linux-kernel-owner@vger.kernel.org
Content-Length: 4846
Lines: 111

On Fri, Apr 24, 2015 at 04:35:34PM +1000, NeilBrown wrote:
> > * path_init() arranges for that to be true in the beginning of the walk.
> > 
> > * symlinks aside, walk_component() preserves that.
> > 	+ normal name components go through lookup_fast(), where we have
> > 	  __d_lookup_rcu() find a child of current nd->path with matching
> > 	  name and (atomically) picks ->d_seq of that child, which had the
> > 	  name matching our component.  Atomicity is provided by ->d_lock
> > 	  on child.  Then we proceed to pick ->d_inode (and verify that
> 
> I don't see d_lock being taken  in __d_lookup_rcu.

Do'h...  No, it isn't - sorry about the braino.

> I think this atomicity is provided by ->d_seq on the child.

Yes.  It's fetch ->d_seq, compare names, then in caller fetch ->d_inode and
compare ->d_seq.

> So.....
> Where I have:
> 
> +                       if (nd->flags & LOOKUP_RCU) {
> +                               if (!nd->root.mnt)
> +                                       set_root_rcu(nd);
> +                               nd->path = nd->root;
> 
> 
> in the case where the symlink starts '/', I need to set nd->seq
> 
> 		nd->seq = read_seqcount_begin(&nd->path.dentry->d_seq);
> 
> 
> but in the case where symlink doesn't start '/', I don't change nd->path,
> so nd->seq should still be correct?

It would be correct, if walk_component() hadn't _already_ flipped it to
match the symlink.  By that point it's too late - we'd already lost the
old value.  This is the area where it hits the fan:
                if (__read_seqcount_retry(&parent->d_seq, nd->seq))
                        return -ECHILD;
OK, parent is still valid
                nd->seq = seq;
... and we lose its seq number, replacing it with that of a child.

                if (unlikely(dentry->d_flags & DCACHE_OP_REVALIDATE)) {
                        status = d_revalidate(dentry, nd->flags);
                        if (unlikely(status <= 0)) {
                                if (status != -ECHILD)
                                        need_reval = 0;
                                goto unlazy;
                        }
                }
                path->mnt = mnt;
                path->dentry = dentry;
set path to (nd->path.mnt, dentry of child)
                if (likely(__follow_mount_rcu(nd, path, inode)))
see if we need to (and can) cross the mountpoint here.  That can change
(vfsmount,dentry) pair *and* nd->seq - making it match whatever path->dentry
ends us being.
                        return 0;
if everything's fine, we return to caller, still in RCU mode.
unlazy:
                if (unlazy_walk(nd, dentry))
... and here we attempt to fall back to non-lazy mode, expecting nd->seq to
match dentry.  Win or lose, that's the last point where nd->seq will be looked
at.
                        return -ECHILD;

See the problem?  We have discarded the nd->seq value for parent and we can't
re-pick it without opening a race.

AFAICS, the simplest solution is to have nd->next_seq and set _that_ in
lookup_fast() and __follow_mount_rcu(), using it in unlazy_walk().  And
in callers have it copied into nd->seq after we'd established that it's not
a symlink to be followed.

Handling of absolute symlinks also needs some care - we have no idea whether
nd->root still makes any sense.  It might have become negative by that point.
unlazy_walk() done first would've checked it's still our process' root, but
if we stay in lazy mode, we obviously can't rely on that check.

One possible solution is to do the same kind of check we do in unlazy_walk() - 
        if (!(nd->flags & LOOKUP_ROOT)) {
                spin_lock(&fs->lock);
                if (nd->root.mnt != fs->root.mnt || nd->root.dentry != fs->root.dentry)
			bugger off
		fetch ->d_inode and ->d_seq
                spin_unlock(&fs->lock);
	} else {
		fetch ->d_inode and ->d_seq
	}
Another - to store the ->d_seq of root in nd->seq_root, have set_root_rcu()
set that instead of (or, better, in addition to) returning it to caller,
have path_init() initialize it explicitly in LOOKUP_ROOT case and have this
"reset to root" in following an absolute symlink just do
	set_root_rcu(nd);
	nd->path = nd->root;
	nd->seq = nd->seq_root;
	nd->inode = nd->path.dentry->d_inode;
	check if nd->path.dentry is stale, bugger off if it is

That avoids this spin_lock() on each absolute symlink at the price of extra
32 bits in struct nameidata.  It looks like doing on-demand reallocation
of nd->stack is the right way to go anyway, so the pressure on nameidata size
is going to be weaker and that might be the right way to go...
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/