Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S2992697AbbEPBZS (ORCPT ); Fri, 15 May 2015 21:25:18 -0400 Received: from cantor2.suse.de ([195.135.220.15]:58022 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S2992646AbbEPBZO (ORCPT ); Fri, 15 May 2015 21:25:14 -0400 Date: Sat, 16 May 2015 11:25:03 +1000 From: NeilBrown To: Linus Torvalds Cc: Andreas Dilger , Dave Chinner , Al Viro , Linux Kernel Mailing List , linux-fsdevel , Christoph Hellwig Subject: Re: [RFC][PATCHSET v3] non-recursive pathname resolution & RCU symlinks Message-ID: <20150516112503.2f970573@notabene.brown> In-Reply-To: References: <20150505052205.GS889@ZenIV.linux.org.uk> <20150511180650.GA4147@ZenIV.linux.org.uk> <20150513222533.GA24192@ZenIV.linux.org.uk> <20150514033040.GF7232@ZenIV.linux.org.uk> <20150514112304.GT15721@dastard> <20150516093022.51e1464e@notabene.brown> X-Mailer: Claws Mail 3.10.1-162-g4d0ed6 (GTK+ 2.24.25; x86_64-suse-linux-gnu) MIME-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha1; boundary="Sig_/YlIXe/fAanyP+fS0nbwLZdH"; protocol="application/pgp-signature" Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 7114 Lines: 157 --Sig_/YlIXe/fAanyP+fS0nbwLZdH Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: quoted-printable On Fri, 15 May 2015 17:45:56 -0700 Linus Torvalds wrote: > On Fri, May 15, 2015 at 4:30 PM, NeilBrown wrote: > > > > .. and I've been wondering what to do about i_mutex and NFS. I've had > > customer reports of slowness in creating files that seems to be due to > > i_mutex on the directory being held over the whole 'create' RPC, so onl= y one > > of those can be in flight at the one time. > > "make -j" on a large source directory can easily want to create lots of > > "*.o" files at "the same time". > > > > And NFS doesn't need i_mutex at all because the server will provide the > > needed guarantees. >=20 > So i_mutex on a directory is probably the nastiest lock we have in the fs= layer. >=20 > It's used for several different half-related things: >=20 > - serialize filename creation/deletion >=20 > This is partly for the benefit of the filesystem itself (and not > helpful for NFS, as you note), but it's also very much about making > sure we have uniqueness guarantees at the VFS layer too. >=20 > So even with NFS, it's not just "the server provides the needed > guarantees", because some of the guarantees are really client-local. >=20 > For example, simply that we only ever have one single dentry for a > particular name, and that we only ever have one active lookup per > dentry. Those things happen independently of - and before - the server > even sees the operation. But surely those things can be managed with a spinlock. I think a big part of the problem is that the VFS tries to control filesystems rather than provide services to them. If NFS was in control it might: - ask the dcache to look up a name and get back a dentry: positive, negati= ve or not-validated. - if positive, NFS returns an error, or uses the inode - depending on operation. - otherwise send a 'create' request to the server. At this point it holds references to dentries for directory and target, but no locks. - If an error is returned, just drop references and return. - On successful create, turn the filehandle into an inode and then ask the dcache to link the inode with the target dentry. If the dentry is still negative or not-validated, this is trivial. If it is positive and already has the right inode, again trivial. If it has the wrong inode, you certainly have an interesting problem, but one that is specific to NFS (or similar filesystems) and one that is up to NFS to solve, not up to the VFS to avoid. If the syscall doesn't need to return an 'fd', then just drop the references and report success. If an 'fd' is required, then create a 'deleted' dentry attached to the original parent. As 'mkdir' doesn't return an fd, that should be safe (not that all the fuss about directory dentries having exactly o= ne parent is particularly relevant to NFS) I'm not convinced that serialising 'lookup' calls is vital. If two threads find a 'not-validated' dentry, and both try to look up the inode, they will both ultimately get the same struct_inode from the icache, and will bo= th succeed in connecting it to the dentry. Obviously it would be better to avoid two concurrent NFS "LOOKUP" requests, but that is a problem for NFS to solve. I suspect that using d_fsdata to point to a pending LOOKUP request would allow the "second" thread to wait for that request to finish. Other filesystems would take a completely different approach. But with the VFS trying to be in control and trying to accommodate the needs of wildly different filesystems, I imagine it might not be so easy. >=20 > So the whole local directory tree consistency ends up depending on thi= s. >=20 > - readdir(). This is mostly to make it hard for filesystems to do the > wrong thing when there is concurrent file creation. and makes it impossible for them to do the right thing too? >=20 > I suspect readdir could fairly easily push the i_mutex down from the > caller and into the filesystem, and then filesystems might narrow down > the use (or even get rid of it). The initial patch might even be > automated with coccinelle. However, rather few loads actually have a > lot of readdir() activity, and samba is probably the only major one. > I've seen benchmarks where it matters, but they are rare (and I > haven't seen one in literally years). >=20 > So the readdir case could probably be at least relaxed fairly easily. > But the thing that tends to hurt on more loads is, as you note, the > filename lookup/creation/movement case. And that's much harder to fix. >=20 > Al, do you have any ideas? Personally, I've wanted to make I_mutex a > rwsem for a long time, but right now pretty much everything uses it > for exclusion. For example, filename lookup is clearly just reading > the directory, so it should take a rwsem for reading, right? No. Not > the way it is done now. Filename lookup wants the directory inode > exclusively because that guarantees that we create just one dentry and > call the filesystem ->lookup only once on that dentry. >=20 > Again, there tend to be no simple benchmarks or loads that people care > about that show this. Most of the time it's fairly hard to see. >=20 Which "this"? Readdir? I haven't come across one (NFS READDIR issues are all about when to use READDIR_PLUS and when not to). Or create-contention-on-i_mutex? The load my customer had was 'make -j60' = on a many-cpu machine. I'm not sure if all the .o files were in the one directory or if some logging was being recorded to multiple files in that o= ne directory, but the serialization of file-creation over NFS was measurable. It wasn't horribly bad, but it was noticeably suboptimal. I think I recommended creating whatever-it-was in multiple directories. NeilBrown --Sig_/YlIXe/fAanyP+fS0nbwLZdH Content-Type: application/pgp-signature Content-Description: OpenPGP digital signature -----BEGIN PGP SIGNATURE----- Version: GnuPG v2 iQIVAwUBVVacbznsnt1WYoG5AQK2Ag//ZudI5kmnAmnhvIWV3C2qF3SAyiuPsPgF 9BU/fc+fkNvOm2I+1RluMqOlCLz0DujlPiM58K9v6O6UE477yAlX59H+gNCkVbEV 0YewDmmXVn5BLIBjubRaAu30sZLAijBav9gNd1tGlDE6TY86LuIFzE4N2uZDYdLr tfHnzNiaTDO6Swg7jQyq+wTA/MsUpgXrUkF2XHq+SlAyjlkxrBvIP6/14vghESlf tUURoQThhyL+r1/yyMsrLRdnliS1Bag1v22R0/XDYRBros3/KT9c2+xRK044ZOc9 UA4lrgIRXlC4UZgYHkTYO2xdkkWvSSE0sAelqDLejWf2ORduFaidtIy9KZzCgqEY 8I46NLO0AfJsftdClUpM6dX2O10qzNzh2YEOCsC5rLhVidH5fmJoRilxp0vLBXTw tD+s7RTvu2nCAbNG/LX7LYLmHHZsbIFa8JsH26h7PPG6ZuwkRTdYoL4egb3EQjYx 6mHbj04bR0hjQ/xTRRRXBX8OZ9+bfhMSg6X3ngMUnE1ANOo4jc62Ctut1cPMC0BT CrnwZzZq0Za+aNfweCvWlGbxp3JYbqWCbPikbxcxogEqbnuPOn4LUf94P3dvPlzI L1rvTeXTWIrK8TaOV6QJmZrm3jcylKExByDZHsdkrzSf4ZveP2brwVZcr82lLL35 nudprGA4BFE= =wjhL -----END PGP SIGNATURE----- --Sig_/YlIXe/fAanyP+fS0nbwLZdH-- -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/