Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S932910AbbDWHp4 (ORCPT ); Thu, 23 Apr 2015 03:45:56 -0400 Received: from cantor2.suse.de ([195.135.220.15]:37028 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S932189AbbDWHpy (ORCPT ); Thu, 23 Apr 2015 03:45:54 -0400 Date: Thu, 23 Apr 2015 17:45:44 +1000 From: NeilBrown To: Al Viro Cc: Christoph Hellwig , Linus Torvalds , linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org Subject: Re: [RFC][PATCHSET] non-recursive link_path_walk() and reducing stack footprint Message-ID: <20150423174544.48ae3122@notabene.brown> In-Reply-To: <20150422210553.GX889@ZenIV.linux.org.uk> References: <20150420181222.GK889@ZenIV.linux.org.uk> <20150421144959.GR889@ZenIV.linux.org.uk> <20150421150408.GA29838@infradead.org> <553668C1.8030707@nod.at> <20150421154504.GT889@ZenIV.linux.org.uk> <20150421212007.GU889@ZenIV.linux.org.uk> <20150422180702.GA15209@ZenIV.linux.org.uk> <20150422201238.GW889@ZenIV.linux.org.uk> <20150422210553.GX889@ZenIV.linux.org.uk> X-Mailer: Claws Mail 3.10.1-162-g4d0ed6 (GTK+ 2.24.25; x86_64-suse-linux-gnu) MIME-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha1; boundary="Sig_/Tz/GW/gD1ge__05+GC3RFf2"; protocol="application/pgp-signature" Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 6392 Lines: 154 --Sig_/Tz/GW/gD1ge__05+GC3RFf2 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: quoted-printable On Wed, 22 Apr 2015 22:05:53 +0100 Al Viro wrote: > On Wed, Apr 22, 2015 at 09:12:38PM +0100, Al Viro wrote: > > On Wed, Apr 22, 2015 at 07:07:59PM +0100, Al Viro wrote: > > > And one more: may_follow_link() is now potentially oopsable. Look: s= uppose > > > we've reached the link in RCU mode, just as it got unlinked. link->d= entry > > > has become negative and may_follow_link() steps into > > > /* Allowed if owner and follower match. */ > > > inode =3D link->dentry->d_inode; > > > if (uid_eq(current_cred()->fsuid, inode->i_uid)) > > > return 0; > > > Oops... Incidentally, I suspect that your __read_seqcount_retry() in > > > follow_link() might be lacking a barrier; why isn't full read_seqcoun= t_retry() > > > needed? Blind copy-paste error I suspect. > > >=20 > > > FWIW, I would rather fetch ->d_inode *and* checked ->seq proir to cal= ling > > > get_link(), and passed inode to it as an explicit argument. And pass= ed it > > > to may_follow_link() as well... > >=20 > > Hrm... You know, something really weird is going on here. Where are > > you setting nd->seq? I don't see anything in follow_link() doing that. follow_link calls link_path_walk -> walk_component -> lookup_fast which se= ts nd->seq. Is that not enough? I guess not when nd_jump_link is called. Is that what I missed? > > And nd->seq _definitely_ needs setting if you want to stay in RCU mode - > > at that point it matches the dentry of symlink, not that of nd->path > > (=3D=3D parent directory). Neil, could you tell me which kernel you'd = been > > testing (ideally - commit ID is a public git tree), what config and what > > tests had those been? The 'devel' branch of git://neil.brown.name/md, which is based on 4.0-rc7. Tests have been fairly causal so far, making sure I can follow symlinks on a small variety of filesystems, and running a particular test which repeatedly stats a large number of non-existent files with paths that go through a symlink. >=20 > FWIW, there's a wart that had been annoying me for quite a while, and it > might be related to dealing with that properly. Namely, walk_component() > calling conventions. We have walk_component(nd, &path, follow), which can > * return -E..., and leave us with pathwalk terminated; path contains > junk, and so does nd->path. > * return 0, update nd->path, nd->inode and nd->seq. The contents > of path is in undefined state - it might be unchanged, it might be equal = to > nd->path (and not pinned down, RCU mode or not). In any case, callers do > not touch it afterwards. That's the normal case. > * return 1, update nd->seq, leave nd->path and nd->inode unchanged and > set path pointing to our symlink. nd->seq matches path, not nd->path. >=20 > In all cases the original contents of path is ignored - it's purely 'out' > parameter, but compiler can't establish that on its own; it _might_ be > left untouched. In all cases when its contents survives we don't look at > it afterwards, but proving that requires a non-trivial analysis. >=20 > And in case when we return 1 (=3D=3D symlink to be followed), we bugger n= d->seq. > It's left as we need it for unlazy_walk() (and after unlazy_walk() we don= 't > care about it at all), so currently everything works, but if we want to > stay in RCU mode for symlink traversal, we _will_ need ->d_seq of parent > directory. >=20 > I wonder if the right way to solve that would be to drop the path argument > entirely and store the bugger in nameidata. As in > union { > struct qstr last; > struct path link; > }; > ... > union { > int last_type; > unsigned link_seq; > }; > in struct nameidata. We never need both at the same time; after > walk_component() (or its analogue in do_last()) we don't need the compone= nt > name anymore. That way walk_component() would not trash nd->seq when > finding a symlink... >=20 > It would also shrink the stack footprint a bit - local struct path next > in link_path_walk() would be gone, along with the same thing in path_look= upat() > and friends. Not a lot of win (4 pointers total), but it might be enough > to excuse putting ->d_seq of root in there, along with ->link.dentry->d_i= node, > to avoid rechecking its ->d_seq. As the matter of fact, we do have an > odd number of 32bit fields in there, so ->d_seq of root would fit nicely.= .. >=20 > Comments? One thing that is clear to me is that I don't really understand all the handling of 'seq' numbers, making me unable to comment usefully. I'll try to go through the current code and you current patch with that iss= ue in mind and make sure I do understand it. Then I'll try to comment. Meanwhile, I like your suggestion in separate email to legitimize the whole symlink stack when unlazy_walk is needed. One reason that I didn't even try that originally is that it was not practical to walk the stack to find everything that needed legitimizing. Now that you have that in an explicit stack it is at least work looking into. Thanks, NeilBrown --Sig_/Tz/GW/gD1ge__05+GC3RFf2 Content-Type: application/pgp-signature Content-Description: OpenPGP digital signature -----BEGIN PGP SIGNATURE----- Version: GnuPG v2 iQIVAwUBVTijKTnsnt1WYoG5AQLDfhAAr2E53Xawd2d535v72TU8Iqu231lTjgVr O18likmrhO2nlcbh8Nq41e0GnBxTnWfupJnqPxnKE+shKoE7yWx19jvXP8z5y8My 5g0YTK+dCLJD+MbAn8QIo4Rga6yPTcvwCCZve+aQdbAs+AayB5gSfOSSQbS71LAj L1kLsdn4T6OHSmSCzc70OCa1Jb3aayL1ZNnN1VKDUiREKuKaOAj/i4IFLK7FWnTi RiEz5G5LOX3QRz1ho1beCWZ8Eizl/Llmwb3pcal/Gwj2NlG8/CarnGCRVcIgAykN PZbo32rkLYu9HpTteGk3V8CeIzOQ5DpbS7mLv4Y/E6O7Dnsx5f6X5Xpi29cDShHd 9fVKwWD0x0oz1GsqnhZ5QdRbrVobVInUwkIsax5mZXlcvQ5A71x/sBgcGaO5Kb4G 1y3SvydCjs8oD1FiQ3AB3ru6+L5kB2t45tbQxd5RcDZmO485vGafxWcK7awkEk97 k2fXYjuY/eSt8XkuxCRR7jckHfTojE0Zn6MMoVA2pF0iDHli0NysEESmu269K7xN ZBipLiUshd87l8NEsX760IvRlqszi94eXiD4ddP2I3VL7qWKr5nEN9BLnprRHxjW lHhSXVvNWt4r8W4Km+vDXa0aY0laivYAhFTQWZ0tLCDSj1rEO/vlPcFYSmdAtnpO wNrJvzrXYLY= =uWCw -----END PGP SIGNATURE----- --Sig_/Tz/GW/gD1ge__05+GC3RFf2-- -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/