Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S932287AbbDXGfr (ORCPT ); Fri, 24 Apr 2015 02:35:47 -0400 Received: from cantor2.suse.de ([195.135.220.15]:34818 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S932076AbbDXGfq (ORCPT ); Fri, 24 Apr 2015 02:35:46 -0400 Date: Fri, 24 Apr 2015 16:35:34 +1000 From: NeilBrown To: Al Viro Cc: Christoph Hellwig , Linus Torvalds , linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org Subject: Re: [RFC][PATCHSET] non-recursive link_path_walk() and reducing stack footprint Message-ID: <20150424163534.6eb109eb@notabene.brown> In-Reply-To: <20150423180754.GA889@ZenIV.linux.org.uk> References: <20150420181222.GK889@ZenIV.linux.org.uk> <20150421144959.GR889@ZenIV.linux.org.uk> <20150421150408.GA29838@infradead.org> <553668C1.8030707@nod.at> <20150421154504.GT889@ZenIV.linux.org.uk> <20150421212007.GU889@ZenIV.linux.org.uk> <20150422180702.GA15209@ZenIV.linux.org.uk> <20150422201238.GW889@ZenIV.linux.org.uk> <20150422210553.GX889@ZenIV.linux.org.uk> <20150423174544.48ae3122@notabene.brown> <20150423180754.GA889@ZenIV.linux.org.uk> X-Mailer: Claws Mail 3.10.1-162-g4d0ed6 (GTK+ 2.24.25; x86_64-suse-linux-gnu) MIME-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha1; boundary="Sig_/e_LDiU38DlgFNAZsf6dvkgZ"; protocol="application/pgp-signature" Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 11183 Lines: 261 --Sig_/e_LDiU38DlgFNAZsf6dvkgZ Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: quoted-printable On Thu, 23 Apr 2015 19:07:55 +0100 Al Viro wrote: > On Thu, Apr 23, 2015 at 05:45:44PM +1000, NeilBrown wrote: >=20 > > follow_link calls link_path_walk -> walk_component -> lookup_fast whic= h sets > > nd->seq. Is that not enough? I guess not when nd_jump_link is called.= Is > > that what I missed? >=20 > No. Potential nd_jump_link() callers are just refusing to go there in la= zy > mode, end of story. That's not the problem; see below. >=20 > > One thing that is clear to me is that I don't really understand all the > > handling of 'seq' numbers, making me unable to comment usefully. > > I'll try to go through the current code and you current patch with that= issue > > in mind and make sure I do understand it. Then I'll try to comment. >=20 > OK, here's the basic theory behind the lazy pathwalk: >=20 > * during the entire exercise we never drop rcu_read_lock, therefore any > objects that have all references to them removed before RCU-delayed > freeing (inodes, dentries, superblocks and vfsmounts are among such) > that we might find in process won't be freed until after we are done. >=20 > * invariant we are keeping: > at some point between the beginning of walk and now the pathname > traversed so far would lead to nd->path, with nd->seq and nd->inode equal= to > ->d_seq and ->d_inode of the dentry in question. Thanks for all of this! I think one part that was confusing me is that there is a place where nd->s= eq does related to nd->path.dentry. Usually the two are updated at the same time. However updated nd->seq, but *doesn't* update nd->path.dentry. Instead, the dentry is stored in a separate 'path' variable. This discrepancy is eventually resolved in a call to path_to_nameidata(). Which is a bit confusing until you know what is happening :-) >=20 > * path_init() arranges for that to be true in the beginning of the walk. >=20 > * symlinks aside, walk_component() preserves that. > + normal name components go through lookup_fast(), where we have > __d_lookup_rcu() find a child of current nd->path with matching > name and (atomically) picks ->d_seq of that child, which had the > name matching our component. Atomicity is provided by ->d_lock > on child. Then we proceed to pick ->d_inode (and verify that I don't see d_lock being taken in __d_lookup_rcu. I think this atomicity is provided by ->d_seq on the child. > ->d_seq has not changed, thus making sure that ->d_inode value > at the moment when the name matched had been the same and child > is still hashed. Then we verify that parent's ->d_seq has not > changed, guaranteeing that parent hadn't been moved or unhashed > from the moment we'd found it until after we'd found its child. > Assuming nothing's mounted on top of that thing, and there's no > problem with ->d_revalidate()), that's it - we have new nd->seq, > nd->path and nd->inode satisfying our invariant for longer > piece of pathname. > + "." needs nothing - we just stay where we are > + ".." is handled by follow_dotdot_rcu(), which in the normal case > (no mountpoint crossing) picks ->d_parent of where we are, > fetches its ->d_inode and ->d_seq and verifies that our directory > still hadn't changed _its_ ->d_seq. The last part guarantees that > it hadn't been moved since the time we'd found it, and thus its > ->d_parent had remained unchanged at least until that verification. > Therefore, it remained pinned all along, and it ->d_inode had > remained stable, including the moment when we fetched ->d_seq. > Which means that the value we had picked *and* its ->d_inode and > ->d_seq would satisfy the invariant for the longer piece of > pathname. > + mountpoint crossing towards leaves is handled in __follow_mount_rcu(); > it is simple (->mnt_root never changes and is always pinned, > stabilizing its ->d_inode), but we might need to worry about > automount points *and* need to make sure that we stop right > there if mount_lock has been bumped. See commit b37199e6 for > the details on the last part - basically, making sure that false > negatives from __lookup_mnt() won't end up with hard error when > we walk into whatever had been under the mountpoint we'd missed. > + mountpoint crossing towards root (in follow_dotdot_rcu()) is > similar, but there we obviously don't care about automounts. > Looking at it now, it might make sense to recheck mount_lock there > as well, though - potential danger is to hit the moment when > mnt_parent and mnt_mountpoint are out of sync, leaving us with > mismatched vfsmount,dentry pair. Generally, that will be caught > when we try to leave lazy mode (legitimize_mnt() will fail) or > to cross towards leaves, but the next crossing towards root > might be bogus as well, and we could end up with unwarranted hard > error. Should be very hard to hit, but it's easy enough to check > *and* backport, so it looks like a good idea for -stable. Linus, > do you have any objections against adding > if (read_seqretry(&mount_lock, nd->m_seq)) > goto failed; > right after > if (!follow_up_rcu(&nd->path)) > break; > in follow_dotdot_rcu()? It's a very narrow race with mount --move > and most of the time it ends up being completely harmless, but > it's possible to construct a case when we'll get a bogus hard error > instead of falling back to non-lazy walk... OTOH, anyone doing > that will get something inherently timing-dependent as the result, > lazy mode or not. I'm in favour of adding that check, but IMO it's > not a critical problem. >=20 > * if we find that we can't continue in lazy mode because some verification > fails, we just fail with -ECHILD. However, in cases when our current pos= ition > might be fine, but the next step can't be done in lazy mode, we attempt to > fall back to non-lazy without full restart that would be caused by -ECHIL= D. > That's what unlazy_walk() is. Of course, if we reach the end of walk we > need to leave the lazy mode as well (complete_walk()). Either can fail, > and such a failure means restart from scratch in non-lazy mode. We need > to grab references on vfsmount and dentry (or two dentries, when we have > parent and child to deal with). The interesting part is vfsmount - we ne= ed > to make sure we won't interfere with umount(2), having walked in that suc= ker > *after* umount(2) has checked that it's not busy. See legitimize_mnt() f= or > details; basically, we have umount(2) mark the "I've verified it's not bu= sy > and it's going to be killed no matter what" with MNT_SYNC_UMOUNT and rely > on RCU delays on that path - if we find one of those, we undo the > increment of its refcount we'd just done, without having dropped > rcu_read_lock(). Full-blown mntput() is done only on mismatches that > had _not_ been marked that way. >=20 > BTW, something similar to the above probably needs to be turned into cohe= rent > text, either in parallel to Documentation/filesystems/path-lookup.txt, or > as update to it. That might be fun.... if I find time. >=20 > The reason why walk_component() in your call chain won't suffice is simpl= e - > it will fail this > /* > * This sequence count validates that the parent had no > * changes while we did the lookup of the dentry above. > * > * The memory barrier in read_seqcount_begin of child is > * enough, we can use __read_seqcount_retry here. > */ > if (__read_seqcount_retry(&parent->d_seq, nd->seq)) > return -ECHILD; > in lookup_fast(), just before updating nd->seq to new value. Basically, > it has no way to tell if parent has been buggered to hell and back. >=20 > It's not _that_ hard to prevent - we just need to stop discarding the par= ent's > seq number in "need to follow a symlink" case of walk_component(). Will = take > some massage, but not too much... So..... Where I have: + if (nd->flags & LOOKUP_RCU) { + if (!nd->root.mnt) + set_root_rcu(nd); + nd->path =3D nd->root; in the case where the symlink starts '/', I need to set nd->seq nd->seq =3D read_seqcount_begin(&nd->path.dentry->d_seq); but in the case where symlink doesn't start '/', I don't change nd->path, so nd->seq should still be correct? Also, just after that (still in follow_link()), we have nd->inode =3D nd->path.dentry->d_inode; *after* the test for (*s =3D=3D '/'). Wouldn't is be OK to have that *inside* the if? In the relative-symlink ca= se it should be unchanged, and in the jump-symlink case nd_jump_link() has alr= eady done that (and probably NULL was returns for 's' so this code doesn't execu= te). So following on top of my previous patches? Thanks heaps, NeilBrown diff --git a/fs/namei.c b/fs/namei.c index d13b4315447f..ce6387d5317c 100644 --- a/fs/namei.c +++ b/fs/namei.c @@ -947,6 +947,8 @@ follow_link(struct path *link, struct nameidata *nd, vo= id **p) if (!nd->root.mnt) set_root_rcu(nd); nd->path =3D nd->root; + nd->seq =3D read_seqcount_begin( + &nd->path->dentry->d_seq); } else { if (!nd->root.mnt) set_root(nd); @@ -954,9 +956,9 @@ follow_link(struct path *link, struct nameidata *nd, vo= id **p) nd->path =3D nd->root; path_get(&nd->root); } + nd->inode =3D nd->path.dentry->d_inode; nd->flags |=3D LOOKUP_JUMPED; } - nd->inode =3D nd->path.dentry->d_inode; error =3D link_path_walk(s, nd); if (unlikely(error)) put_link(nd, link, inode, *p); --Sig_/e_LDiU38DlgFNAZsf6dvkgZ Content-Type: application/pgp-signature Content-Description: OpenPGP digital signature -----BEGIN PGP SIGNATURE----- Version: GnuPG v2 iQIVAwUBVTnkNjnsnt1WYoG5AQLDshAAotMwCa1TBd0aEngsFTWRw5fdFU/B+Gc4 N5c3joKhkEl9jjPJVQd7Uq7wbr03nZ80RDbrhFMFw5rN9e0RTqtY8e2OQT/0uiTJ Llsf/97anPWR0EpRCLWvGHWJrpiwptkgAo3jrWk9zAg44tSB89xNnRrwOV4XhqEp aqZsUve0WTs1vq7mOyVLnBAkd1tUv0UTRR4P7lMgkhkS2FUfL85t5jB9BgPCBf3U GAYidYzM6qYaFsi3oTI3q0CI+1dsSKDTuH6e2q6J3quWT+YQsIkPYKn0EW8B8g3/ VyyX9ybJ3w/doIl/p6Z/OpDXskKhlkKbYXYNKtJzKj/L7JmuK69Y7YNupcs76sJD enevBW1Yc////dgJLj6yukBRP8W+hY1NrYSODrjXWdUb9/Amj8RU4bdX0JqU3BjV RlNwcW7SSiIEa+tUDzp1sttxdAY59BAb8nIT4I1HFwthbzmf7bYZIgNXScpm3HF0 7gV/6rThrLNPD18d4Xck65Zl1ISgsLEERbcINhL+wo/wNs++kemj0Pzqui6kFt0R gHIZt6mPgrhE2nfN6brH7GOJRV9jKkf4rsrqohJYIF+EPl6vSprtUiJRylQn6KEM k70fln6Mw9TSHYzEQz7wY571fMo8wutpLFLreim6XMwUiy9z1yX33wBIdTPMy0aT tHIcdzhCcOE= =w94J -----END PGP SIGNATURE----- --Sig_/e_LDiU38DlgFNAZsf6dvkgZ-- -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/