Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751263AbcDPDCK (ORCPT ); Fri, 15 Apr 2016 23:02:10 -0400 Received: from mail-ig0-f172.google.com ([209.85.213.172]:38284 "EHLO mail-ig0-f172.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751069AbcDPDCI (ORCPT ); Fri, 15 Apr 2016 23:02:08 -0400 Subject: Re: [PATCHSET][RFC][CFT] parallel lookups Mime-Version: 1.0 (Mac OS X Mail 9.3 \(3124\)) Content-Type: multipart/signed; boundary="Apple-Mail=_6FED75FE-E144-4549-8E0E-274922210B20"; protocol="application/pgp-signature"; micalg=pgp-sha256 X-Pgp-Agent: GPGMail 2.6b2 From: Andreas Dilger In-Reply-To: <20160416005232.GV25498@ZenIV.linux.org.uk> Date: Fri, 15 Apr 2016 21:02:02 -0600 Cc: Linus Torvalds , linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org Message-Id: <958A657E-1983-4695-8869-4E03FEBA2F90@dilger.ca> References: <20160416005232.GV25498@ZenIV.linux.org.uk> To: Al Viro X-Mailer: Apple Mail (2.3124) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 7927 Lines: 215 --Apple-Mail=_6FED75FE-E144-4549-8E0E-274922210B20 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset=us-ascii On Apr 15, 2016, at 6:52 PM, Al Viro wrote: >=20 > The thing appears to be working. It's in vfs.git#work.lookups; = the > last 5 commits are the infrastructure (fs/namei.c and fs/dcache.c; no = changes > in fs/*/*) + actual switch to rwsem. >=20 > The missing bits: down_write_killable() (there had been a series > posted introducing just that; for now I've replaced = mutex_lock_killable() > calls with plain inode_lock() - they are not critical for any testing = and > as soon as down_write_killable() gets there I'll replace those), = lockdep > bits might need corrections and right now it's only for lookups. >=20 > I'm going to add readdir to the mix; the primitive added in this > series (d_alloc_parallel()) will need to be used in dcache pre-seeding > paths, ncpfs use of dentry_update_name_case() will need to be changed = to > something less hacky and syscalls calling iterate_dir() will need to > switch to fdget_pos() (with FMODE_ATOMIC_POS set for directories as = well > as regulars). The last bit is needed for exclusion on struct file > level - there's a bunch of cases where we maintain data structures > hanging off file->private and those really need to be serialized. = Besides, > serializing ->f_pos updates is needed for sane semantics; right now we > tend to use ->i_mutex for that, but it would be easier to go for the = same > mechanism as for regular files. With any luck we'll have working = parallel > readdir in addition to parallel lookups in this cycle as well. >=20 > The patchset is on top of switching getxattr to passing dentry = and > inode separately; that part will get changes (in particular, the stuff > agruen has posted lately), but the lookups queue proper cares only = about > being able to move security_d_instantiate() to the point before dentry > is attached to inode. >=20 > 1/15: security_d_instantiate(): move to the point prior to attaching = dentry > to inode. Depends on getxattr changes, allows to do the "attach to = inode" > and "add to dentry hash" parts without dropping ->d_lock in between. >=20 > 2/15 -- 8/15: preparations - stuff similar to what went in during the = last > cycle; several places switched to lookup_one_len_unlocked(), a bunch = of > direct manipulations of ->i_mutex replaced with inode_lock, etc. = helpers. >=20 > kernfs: use lookup_one_len_unlocked(). > configfs_detach_prep(): make sure that wait_mutex won't go away > ocfs2: don't open-code inode_lock/inode_unlock > orangefs: don't open-code inode_lock/inode_unlock > reiserfs: open-code reiserfs_mutex_lock_safe() in reiserfs_unpack() > reconnect_one(): use lookup_one_len_unlocked() > ovl_lookup_real(): use lookup_one_len_unlocked() >=20 > 9/15: lookup_slow(): bugger off on IS_DEADDIR() from the very = beginning > open-code real_lookup() call in lookup_slow(), move IS_DEADDIR check = upwards. >=20 > 10/15: __d_add(): don't drop/regain ->d_lock > that's what 1/15 had been for; might make sense to reorder closer to = it. >=20 > 11/15 -- 14/15: actual machinery for parallel lookups. This stuff = could've > been a single commit, along with the actual switch to rwsem and shared = lock > in lookup_slow(), but it's easier to review if carved up like that. = =46rom the > testing POV it's one chunk - it is bisect-safe, but the added code = really > comes into play only after we go for shared lock, which happens in = 15/15. > That's the core of the series. >=20 > beginning of transition to parallel lookups - marking in-lookup = dentries > parallel lookups machinery, part 2 > parallel lookups machinery, part 3 > parallel lookups machinery, part 4 (and last) >=20 > 15/15: parallel lookups: actual switch to rwsem >=20 > Note that filesystems would be free to switch some of their own uses = of > inode_lock() to grabbing it shared - it's really up to them. This = series > works only with directories locking, but this field has become an = rwsem > for all inodes. XFS folks in particular might be interested in using = it... Looks very interesting, and long awaited. How do you see the parallel operations moving forward? Staying as lookup only, or moving on to = parallel modifications as well? We've been carrying an out-of-tree patch for ext4 for several years to = allow parallel create/unlink for directory entries*, as I discussed a few = times with you in the past. It is still a bit heavyweight for doing read-only = lookups, but after this patch series it might finally be interesting to merge = into ext4, with a hope that the VFS might allow parallel directory changes in the = future? We can already do this on a Lustre server, and it would be nice to be = able to so on the client, since the files may even be on different servers = (hashed by name at the client to decide which server to contact) and network = latency during parallel file creates (one thread per CPU core, which is getting into = the low hundreds these days) is a much bigger deal than for local = filesystems. The actual inode_*lock() handling would need to be delegated to the = filesystems, with the VFS just using i_rwsem if the filesystem has no ->inode_lock() = method, but calling the inode_lock() method of the filesystem if available, so = that they can do parallel create/unlink for files and directories (though not = rename since that way lies insanity). We can discuss more at LSF, but now that you posted your patch series, = I'm curious. :-) Cheers, Andreas [*] Patch not usable as-is since it has no VFS interface and is only = uptodate for 3.12, just linked here in case of interest. Essentially, it creates = a lock tree for the ext4 htree directory, and opportunistically only locks the = leaf blocks of the directory on the expectation that there will be ~50-60 = entries added to each leaf before it is split, before it needs to back out to = write lock the parent htree block. The larger the directory, the more leaf = and parent blocks that can be locked concurrently, and the better scaling = gets. = http://git.hpdd.intel.com/fs/lustre-release.git/blob/HEAD:/ldiskfs/kernel_= patches/patches/sles12/ext4-pdirop.patch > I'll post the individual patches in followups. Again, this is also = available > in vfs.git #work.lookups (head at e2d622a right now). The thing = survives > LTP and xfstests without regressions, but more testing would certainly = be > appreciated. So would review, of course. > -- > To unsubscribe from this list: send the line "unsubscribe = linux-fsdevel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html Cheers, Andreas --Apple-Mail=_6FED75FE-E144-4549-8E0E-274922210B20 Content-Transfer-Encoding: 7bit Content-Disposition: attachment; filename=signature.asc Content-Type: application/pgp-signature; name=signature.asc Content-Description: Message signed with OpenPGP using GPGMail -----BEGIN PGP SIGNATURE----- Comment: GPGTools - http://gpgtools.org iQIVAwUBVxGrK3Kl2rkXzB/gAQj47hAApu+B3S2dx6LFwLsLjOwfHpivgEdRQsVb A/v5USdktw2Is910sYkbEfSMmyK//i80IVlCUI0wByVMVC0Y0MXHZhQKB/4G/21R 7lhq6LCaclsAA/j0UEqutOlc9Tp261h2+a5v+mrfd39LYvG7Yhs4uv/qz82nBmPH sh3UuQO10NH8eVt65OvF3QAVLncf1LAYsu55WIDoPZinYOKhtcb4RvSddrU/iLbc YvhlKs8wk7UhKP0NVgKXmtheeewetCYAwIeMrCBSkjNQVVqdw05miOwA5zcxdnW7 c1/bFLfDNYBuxXbZxrTEKvsY3lhyViy2ljlhzVZ35t0xSrdkpj0Qhczub6CAHV5i f/mgn9QKhBEyi11IWogWAKJKDbDVcCrbSrwDD8zCXLDSU/BAb49OIuwrflBhKV3w 5J1L36mYL0NuDD1xB7jYzDL26fU5pfe5w8VHqC+pb4fuOA0tdzyKPIcPrcs4s1/G AqGVdPuqoz3CnwEqels2h5VwW699CU+YveYvnwEauzIB8AwEf7gEmXNBoyIxx25k q/wXZuNBxodcp2qkl56QCPAmZHt261FxEqWUpLgdV6HFE8noTUlANOZ2DFuF5jmy e+x7YKgu51xIDkz3voQL5KuE8ufsThCJVkKPGj8YC7HRg2y58mvNPmN3g4Y6v6Zg nIdZkANUyf0= =H4rh -----END PGP SIGNATURE----- --Apple-Mail=_6FED75FE-E144-4549-8E0E-274922210B20--