Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1758200Ab3FADKe (ORCPT ); Fri, 31 May 2013 23:10:34 -0400 Received: from mx1.redhat.com ([209.132.183.28]:39124 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1757104Ab3FADIL (ORCPT ); Fri, 31 May 2013 23:08:11 -0400 From: Jeff Layton To: viro@zeniv.linux.org.uk, matthew@wil.cx, bfields@fieldses.org Cc: dhowells@redhat.com, sage@inktank.com, smfrench@gmail.com, swhiteho@redhat.com, Trond.Myklebust@netapp.com, akpm@linux-foundation.org, linux-kernel@vger.kernel.org, linux-afs@lists.infradead.org, ceph-devel@vger.kernel.org, linux-cifs@vger.kernel.org, samba-technical@lists.samba.org, cluster-devel@redhat.com, linux-nfs@vger.kernel.org, linux-fsdevel@vger.kernel.org, piastryyy@gmail.com Subject: [PATCH v1 00/11] locks: scalability improvements for file locking Date: Fri, 31 May 2013 23:07:23 -0400 Message-Id: <1370056054-25449-1-git-send-email-jlayton@redhat.com> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 5220 Lines: 121 Executive summary (tl;dr version): This patchset represents an overhaul of the file locking code with an aim toward improving its scalability and making the code a bit easier to understand. Longer version: When the BKL was finally ripped out of the kernel in 2010, the strategy taken for the file locking code was to simply turn it into a new file_lock_locks spinlock. It was an expedient way to deal with the file locking code at the time, but having a giant spinlock around all of this code is clearly not great for scalability. Red Hat has bug reports that go back into the 2.6.18 era that point to BKL scalability problems in the file locking code and the file_lock_lock suffers from the same issues. This patchset is my first attempt to make this code less dependent on global locking. The main change is to switch most of the file locking code to be protected by the inode->i_lock instead of the file_lock_lock. While that works for most things, there are a couple of global data structures (lists in the current code) that need a global lock to protect them. So we still need a global lock in order to deal with those. The remaining patches are intended to make that global locking less painful. The big gain is made by turning the blocked_list into a hashtable, which greatly speeds up the deadlock detection code. I rolled a couple of small programs in order to test this code. The first one just forks off 128 children and has them lock and unlock the same file 10k times. Running this under "time" against a file on tmpfs gives typical values like this: Unpatched (3.10-rc3-ish): real 0m5.283s user 0m0.380s sys 0m20.469s Patched (same base kernel): real 0m5.099s user 0m0.478s sys 0m19.662s ...so there seems to be some modest performance gain in this test. I think that's almost entirely due to the change to a hashtable and to optimize removing and readding blocked locks to the global lists. Note that with this code we have to take two spinlocks instead of just one, and that has some performance impact too. So the real peformance gain from that hashtable conversion is eaten up to some degree by this. The next test just forks off a bunch of children that each create their own file and then lock and unlock it 20k times. Obviously, the locks in this case are uncontended. Running that under "time" typically gives these rough numbers. Unpatched (3.10-rc3-ish): real 0m8.836s user 0m1.018s sys 0m34.094s Patched (same base kernel): real 0m4.965s user 0m1.043s sys 0m18.651s In this test, we see the real benefit of moving to the i_lock for most of this code. The run time is almost cut in half in this test. With these changes locking different inodes needs very little serialization. If people know of other file locking performance tests, then I'd be happy to try them out too. It's possible that this might make some workloads slower, and it would be helpful to know what they are (and address them) if so. This is not the first attempt at doing this. The conversion to the i_lock was originally attempted by Bruce Fields a few years ago. His approach was NAK'ed since it involved ripping out the deadlock detection. People also really seem to like /proc/locks for debugging, so keeping that in is probably worthwhile. There's more work to be done in this area and this patchset is just a start. There's a horrible thundering herd problem when a blocking lock is released, for instance. There was also interest in solving the goofy "unlock on any close" POSIX lock semantics at this year's LSF. I think this patchset will help lay the groundwork for those changes as well. Comments and suggestions welcome. Jeff Layton (11): cifs: use posix_unblock_lock instead of locks_delete_block locks: make generic_add_lease and generic_delete_lease static locks: comment cleanups and clarifications locks: make "added" in __posix_lock_file a bool locks: encapsulate the fl_link list handling locks: convert to i_lock to protect i_flock list locks: only pull entries off of blocked_list when they are really unblocked locks: convert fl_link to a hlist_node locks: turn the blocked_list into a hashtable locks: add a new "lm_owner_key" lock operation locks: give the blocked_hash its own spinlock Documentation/filesystems/Locking | 27 +++- fs/afs/flock.c | 5 +- fs/ceph/locks.c | 2 +- fs/ceph/mds_client.c | 8 +- fs/cifs/cifsfs.c | 2 +- fs/cifs/file.c | 15 +- fs/gfs2/file.c | 2 +- fs/lockd/svclock.c | 12 ++ fs/lockd/svcsubs.c | 12 +- fs/locks.c | 254 +++++++++++++++++++++++++------------ fs/nfs/delegation.c | 11 +- fs/nfs/nfs4state.c | 8 +- fs/nfsd/nfs4state.c | 8 +- include/linux/fs.h | 25 +--- 14 files changed, 249 insertions(+), 142 deletions(-) -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/