Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751117AbbEQXQn (ORCPT ); Sun, 17 May 2015 19:16:43 -0400 Received: from cantor2.suse.de ([195.135.220.15]:58563 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750830AbbEQXQi (ORCPT ); Sun, 17 May 2015 19:16:38 -0400 Date: Mon, 18 May 2015 09:16:01 +1000 From: NeilBrown To: Linus Torvalds Cc: Al Viro , Andreas Dilger , Dave Chinner , Linux Kernel Mailing List , linux-fsdevel , Christoph Hellwig Subject: Re: [RFC][PATCHSET v3] non-recursive pathname resolution & RCU symlinks Message-ID: <20150518091601.5c95322c@notabene.brown> In-Reply-To: References: <20150516093022.51e1464e@notabene.brown> <20150516112503.2f970573@notabene.brown> <20150516014718.GO7232@ZenIV.linux.org.uk> <20150516144527.20b89194@notabene.brown> <20150516054626.GS7232@ZenIV.linux.org.uk> <20150516141811.GT7232@ZenIV.linux.org.uk> <20150517131203.7342afc8@notabene.brown> <20150517105535.GU7232@ZenIV.linux.org.uk> X-Mailer: Claws Mail 3.10.1-162-g4d0ed6 (GTK+ 2.24.25; x86_64-suse-linux-gnu) MIME-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha1; boundary="Sig_/=BNdFg=cnReOJocq.h_adUk"; protocol="application/pgp-signature" Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 6100 Lines: 143 --Sig_/=BNdFg=cnReOJocq.h_adUk Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: quoted-printable On Sun, 17 May 2015 09:43:34 -0700 Linus Torvalds wrote: > On Sun, May 17, 2015 at 3:55 AM, Al Viro wrote: > > > > And that is complete crap. Multi-component lookups do make sense; once > > we are at the edge of the area present in dcache, we _know_ there won't > > be any existing mountpoints involved; parsing the components and feeding > > them to fs at once, along with an array of dentries to fill makes perfe= ct > > sense. Why bother with a bunch of roundtrips when we can have one? >=20 > Yes, the edges are easier. And yes, it's fine to do components one by one. >=20 > Maybe I misunderstood, but I thought that was exactly what Neil > *didn't* want to do, though. It sounded like he wanted to do > path-based lookup, not component-based one. Just to be crystal clear about what I want: I want the filesystem to be in control Any examples, whether about multi-component lookup or path-based lookup or O_EXCL opens are just throw-away examples. I have no desire to implement or re-implement anything like that. I just want the filesystem to have contro= l. The reason I want, is that it will (ultimately) make the code easier to understand and so easier to verify. And it will make implementing unusual filesystems easier. The dcache is just a cache. It is a great cache, but it isn't the filesyst= em. So filesystems should be able to put things in the cache. And the VFS shou= ld be able to look up things in the cache. And if the VFS finds everything it needs to follow a full path all the way to the inode at the end, that is great. But as soon as it hits something that the cache doesn't have an answer for, it asks the filesystem. As a useful simple case it can ask via d_revalidate in RCU mode, in which case the filesystem either says (based on its own caching rules) "Yeah, this one's OK really" and the VFS just keeps going, or the filesystem says "Nope, I need more time with this one" and we drop out or RCU and to the more general case. In that general case it just hands everything to the filesystem. The filesystem then uses generic helpers (or not) to find the answers and a= dds more current information to the cache. It could potentially just return and let the VFS continue down the cache (n= ow with current data), but it probably makes more sense for the filesystem to explicitly return what it has. So for Al's example of revalidating multiple components at once, once the V= FS gets to a point in the path where d_revalidate says "I need more time", the VFS just passes the rest of the path to the filesystem. The filesystem can then see what is in the cache and revalidate multiple dentries in parallel. Or it could just send the rest of the path to the server requesting attributes for each directory in the path, and then can p= op all of that into the dcache/icache and let the lookup complete. Or it can just do one component at a time. >=20 > But yes, if it's purely about preloading the cache, then *that* should > be reasonably easy. In fact, it should work as-is today, if we just > added a "const char *hint" to the lookup callback which told the > filesystem what will come after this lookup. But it would be a hint > for pre-loading the dcache, nothing more. "hint" being a synonym for "layering violation" ?? NeilBrown >=20 > So if we have a pathname like "a/b/c" that we don't have in the > dcache, and we're doing to look up component "a", we could give "b/c" > as the hint, and a filesystem that currently populates the dcache with > "a" by doing >=20 > d_instantiate(dentry, inode); >=20 > could decide that *before* it does that "d_instantiate()", it could > pre-populate the child list of 'dentry' with the lookup information > for 'b' (and possibly recursively for 'c' too under 'd'). >=20 > But you'd still have to do the components one by one, you couldn't > just do the "final" tip. >=20 > And no, I absolutely refuse to even entertain the thought of the > filesystem actually doing any of the do_last crap. It would bt purely > about pre-populating the dcache deeper than the one single component, > and then the VFS layer would just find the pre-populated dentries and > do the normal thing. >=20 > Doing things that way means that not only does do_last() at the vfs > level already do the right thing, but we get all the per-component > semantics (with security checks etc) right, because we'd still be > traversing the pathname one component at a time. It's just the > filesystem that could prime the cache. >=20 > If *that* was what Neil wanted to do (rather than do "a/b/c" as one > single lookup to the server), then I withdraw all my complaints and am > sorry for having misunderstood. >=20 > Linus --Sig_/=BNdFg=cnReOJocq.h_adUk Content-Type: application/pgp-signature Content-Description: OpenPGP digital signature -----BEGIN PGP SIGNATURE----- Version: GnuPG v2 iQIVAwUBVVkhMTnsnt1WYoG5AQJuxg/5AQ8uQ8+Q0pPZmCUhSXl/5PMhTrolxxSa DpbT65vDyfAr7VmXG9xLmypa04rcZdAR7y9AxgWlnxk2O5/bJYXfHtF2bpBxYy6D VLJXT4Fqxgvf42qVB45rAepX3CUMwRJrSExk/kwg2gHefVRGaCGm1y8O2BmDUiRE DagIPrgNRLR9THu4cSu8O8ab/pYQlgA/zZ/JCopKCDa32e/uDpOgronf67cS7ogF 3+IMDumqITB+S9oeSeKoERaJMtAiyegtTSlkxg8fwtt5Qs9rqhSnbFAWY9gkr/Jd l0VJYPlCxoUSW0+tO49O+B3xCiPTjvixkrHo+HQuWDYhWeRcy/qwC3vZkVzHZYo3 6jipJ4wW8r5CL28jeiJe6Jj0NE4W85u7S7j5+/4Cm16MMhRMVvF2DFXJzGfXUfG1 yg6ns4MRl9hesy/c6ldS4gB/e8hMWtPJWlLzZwfUdc4TQ7LGWJfrZwROBhpP8a64 5S93dyqrIuxNBbI5kC7bzgjtHs5LbMlyC9TEwyFdeYUt9smiE7lxevA2GkuwuBnN /cyL6qFXXV7KSPoE0CwgNZGhlqLSmYLPYbODPtxoWpC3HledEYD1YYFq+yGC+0Hg 9KTksnIqQ+//YIfwIoAlR6TMp60sO9J8q3GPzVlWtnOZCYB0DN7jhOj7hpxzjiMW 4YuygLaoxNo= =OCAf -----END PGP SIGNATURE----- --Sig_/=BNdFg=cnReOJocq.h_adUk-- -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/