Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754762AbbESIdn (ORCPT ); Tue, 19 May 2015 04:33:43 -0400 Received: from cantor2.suse.de ([195.135.220.15]:46548 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751621AbbESIdk (ORCPT ); Tue, 19 May 2015 04:33:40 -0400 Date: Tue, 19 May 2015 18:33:28 +1000 From: NeilBrown To: Linus Torvalds Cc: Al Viro , Andreas Dilger , Dave Chinner , Linux Kernel Mailing List , linux-fsdevel , Christoph Hellwig Subject: Re: [RFC][PATCHSET v3] non-recursive pathname resolution & RCU symlinks Message-ID: <20150519183328.137c1b73@notabene.brown> In-Reply-To: References: <20150516093022.51e1464e@notabene.brown> <20150516112503.2f970573@notabene.brown> <20150516014718.GO7232@ZenIV.linux.org.uk> <20150516144527.20b89194@notabene.brown> <20150516054626.GS7232@ZenIV.linux.org.uk> <20150516141811.GT7232@ZenIV.linux.org.uk> <20150517131203.7342afc8@notabene.brown> <20150517105535.GU7232@ZenIV.linux.org.uk> <20150518091601.5c95322c@notabene.brown> X-Mailer: Claws Mail 3.10.1-162-g4d0ed6 (GTK+ 2.24.25; x86_64-suse-linux-gnu) MIME-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha1; boundary="Sig_/b65hcBIYtiXX6LV5UO8TDJV"; protocol="application/pgp-signature" Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 7195 Lines: 178 --Sig_/b65hcBIYtiXX6LV5UO8TDJV Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: quoted-printable On Sun, 17 May 2015 19:56:26 -0700 Linus Torvalds wrote: > On Sun, May 17, 2015 at 4:16 PM, NeilBrown wrote: > > > > Just to be crystal clear about what I want: > > I want the filesystem to be in control >=20 > Yeah, no. Not going to happen. >=20 > You seem to think that the dcache is "just" a cache. It's not. It's a > cache, but that is absolutely not all that it is. It's very much a > cache with strong semantics. >=20 > And no, we're not handing over those semantics over to the filesystem. > The dcache is not just a cache, it's the *primary* data structure that > we use for pathname validation, local security checking, and for doing > things like "getcwd()" and handling ".." etc. A fact that makes it relatively easy to create a situation where 'getcwd()' returns a string for which 'stat' says ENOENT or where "cd .." puts you somewhere that "getcwd" gets quite upset: $ cd .. cd: error retrieving current directory: getcwd: cannot access parent direct= ories: No such file or directory $ ls -l /proc/self/cwd lrwxrwxrwx 1 neilb users 0 May 19 17:28 /proc/self/cwd -> /mnt/tmp/cdir (de= leted) (and no: I hadn't deleted the cwd, just renamed some things on the server) >=20 > So there's no way the filesystem is "in control". You as a filesystem > are not really even doing the actual pathname lookup. The *only* thing > you're doing is filling in the dcache. The actual real pathname lookup > is done by the VFS layer using the dcache data. >=20 > That's how it very fundamentally works. It's *so* much more than a > cache - it really *is* the primary path lookup. The filesystem is the > slave in this relationship. This requires the VFS to have knowledge, sometimes intimate knowledge, of h= ow each filesystem works. DCACHE_NFSFS_RENAMED ? Oh wait, afs and btrfs know about that too, so it can't be too intimate. >=20 > > The filesystem then uses generic helpers (or not) to find the answers a= nd adds > > more current information to the cache. >=20 > You can do that already. There *are* those generic helpers to add data > to the cache. That's what "d_instantiate()" and friends _are_ for. >=20 > But no, you do *not* control name lookup. You get notified when > there's not enough data in the cache, and then you can fill it up any > which way you want. >=20 > You can populate the dcache with other entries than the one we asked > for, and you can ask the dcache to revalidate and throw dentries out. >=20 > But no, you do *not* get access to things like do_last() or to the > decision to follow symlinks or namespace rules, or mountpoints or > things like that. Obviously the important rules that you mention would be handled by library code. But do_last() could be a lot simpler if the filesystem could manage = the 'stale dentry' handling and call one version or the other of do_last depending on whether it had an 'atomic_open' callback or not. >=20 > > So for Al's example of revalidating multiple components at once, once t= he VFS > > gets to a point in the path where d_revalidate says "I need more time", > > the VFS just passes the rest of the path to the filesystem. >=20 > That's bullshit,. for a very simple and basic reason: "the rest of the > path" is not necessarily at all for your filesystem! For revalidate: probably not, though the filesystem can ask questions of the dcache just as easily as the VFS. For lookup, the rest of the path up to a ".." or symlink (which the filesystem can easily recognise) does belong to the filesystem. On this topic, Al suggested: > With bulk revalidate covering > all the chain when we stumble across .., mountpoint or something we belie= ve > to be a symlink, or when the chain reaches fs-specified limit.=20 That "fs-specific limit" is what really bothers me. This is feeding more information about the fs into the VFS, and it assumes that a "limit" is the thing that is meaningful for the VFS to know. Just let the FS take over and use the approved interfaces to collect the dentries that it thinks might be useful to revalidate, and then revalidate them. >=20 > Really. There might be mount-points, there might be symlinks, there > might be tons of stuff like that. >=20 > You're not getting control, for the very simple reason that IT IS NOT > YOUR DATA. And it really never ever will be. >=20 > Now, this is why I said we can do a "hint" style thing. Part of that > "hint" issue is very very much that it has no semantic meaning. You > can't screw it up, because if it turns out that the path component > we're looking up is a symlink and we actually end up in some other > filesystem, if you end up looking up the hint part, it just would > never actually get used. >=20 > So it's kind of like a prefetch for names. It's semantically much > weaker than saying "look up this name". The hint would be "this is > likely the next part of the name that the VFS layer will look up". >=20 > And the key part of that statement is > (a) "likely" (it might not happen, and even if it does happen, it > migth not be for your filesystem) > and > (b) "the VFS layer will look up" because it won't be the low-level > filesystem doing it. >=20 > So it would be the low-level filesystem pre-populating the dcache - if > the low-level filesystem decides the hint is worth using for that - > and the VFS layer then uses the data in the dcache without further > bothering the filesystem. >=20 > Exactly because the dcache is *so* much more than "just a cache". >=20 > Linus Well, let's just start with that "in-lookup" or "unknown" dentry that has been mentioned, so that the VFS doesn't have to hold i_mutex across lookup and create, and so that the filesystem can at least control it's own lockin= g. That would be a big step forward in my mind. Thanks, NeilBrown --Sig_/b65hcBIYtiXX6LV5UO8TDJV Content-Type: application/pgp-signature Content-Description: OpenPGP digital signature -----BEGIN PGP SIGNATURE----- Version: GnuPG v2 iQIVAwUBVVr1WDnsnt1WYoG5AQIATBAAnmM9Is2s8HpTNdL2ZrEUuywW9dreq+wU 5SQ3Pjy/sxo/CpFQ8yinKAvMR/3ETV8CvTXdPwPEPTFf63IA4IReOzsdfuBZgefQ nnmGCks7flPICQBBtjdkYG2jXNezHriY6A1y2kmDr5/1ZVr2s9qbjoENViTrUf5I vFBGfJc2OHfCvDVb2LZz8PiP9FYlDRs+AsCgj+mwzyntAlibKPUIdzVDfdZyccGg Xezsluef5vs7fAydnVuZ49G8hiGIxxb/WRXdx/DYlB2v+16Q0x8Op1UetIXMqN6U X73quty6bK56RUsqkdMMPZzLlxOrSCTDjFVAd7LIEZZOAmScag7W0HB8JeLPtuaa Afr2WfeUArb1X9LkSZlpWrzfGEIgNFZq6IkWkvXpd2W2WeO1kYsO0v2uaQzHgN+H q68DTQv/B3IGQYoxeSkSQMvQrKVTVkdybJEFGFpG6pmwzqQZbKN9c+QQiXSkVncF ZEc1QGOGFcg0zX0euTEPSyRmQka4S8/LhFB/oM1fPGWZYqA7iiKiVszZ+Ic9OUaK nPpQ9+6qgG9/WIDuWf0oPJ3PbitHrF6eDouGlRUxI/Tq8ABPt9f8Hqne00s7eYfG ToVCL1F6W2U9fGPTT2La2Vl84TndOIWd4RjCvAUsCTvWJel+eYXCa2At4rBp7Gl9 SSK5XZwdP4E= =KHWZ -----END PGP SIGNATURE----- --Sig_/b65hcBIYtiXX6LV5UO8TDJV-- -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/