From: Jeff Layton <jlayton@redhat.com>
To: viro@zeniv.linux.org.uk, matthew@wil.cx, bfields@fieldses.org
Cc: dhowells@redhat.com, sage@inktank.com, smfrench@gmail.com,
        swhiteho@redhat.com, Trond.Myklebust@netapp.com,
        akpm@linux-foundation.org, linux-kernel@vger.kernel.org,
        linux-afs@lists.infradead.org, ceph-devel@vger.kernel.org,
        linux-cifs@vger.kernel.org, samba-technical@lists.samba.org,
        cluster-devel@redhat.com, linux-nfs@vger.kernel.org,
        linux-fsdevel@vger.kernel.org, piastryyy@gmail.com
Subject: [PATCH v1 00/11] locks: scalability improvements for file locking
Date: Fri, 31 May 2013 23:07:23 -0400
Message-Id: <1370056054-25449-1-git-send-email-jlayton@redhat.com>
Sender: linux-kernel-owner@vger.kernel.org
Content-Length: 5220
Lines: 121

Executive summary (tl;dr version): This patchset represents an overhaul
of the file locking code with an aim toward improving its scalability
and making the code a bit easier to understand.

Longer version:

When the BKL was finally ripped out of the kernel in 2010, the strategy
taken for the file locking code was to simply turn it into a new
file_lock_locks spinlock. It was an expedient way to deal with the file
locking code at the time, but having a giant spinlock around all of this
code is clearly not great for scalability. Red Hat has bug reports that
go back into the 2.6.18 era that point to BKL scalability problems in
the file locking code and the file_lock_lock suffers from the same
issues.

This patchset is my first attempt to make this code less dependent on
global locking. The main change is to switch most of the file locking
code to be protected by the inode->i_lock instead of the file_lock_lock.

While that works for most things, there are a couple of global data
structures (lists in the current code) that need a global lock to
protect them. So we still need a global lock in order to deal with
those. The remaining patches are intended to make that global locking
less painful. The big gain is made by turning the blocked_list into a
hashtable, which greatly speeds up the deadlock detection code.

I rolled a couple of small programs in order to test this code. The
first one just forks off 128 children and has them lock and unlock the
same file 10k times. Running this under "time" against a file on tmpfs
gives typical values like this:

Unpatched (3.10-rc3-ish):
real	0m5.283s
user	0m0.380s
sys	0m20.469s

Patched (same base kernel):
real	0m5.099s
user	0m0.478s
sys	0m19.662s

...so there seems to be some modest performance gain in this test. I
think that's almost entirely due to the change to a hashtable and to
optimize removing and readding blocked locks to the global lists. Note
that with this code we have to take two spinlocks instead of just one,
and that has some performance impact too. So the real peformance gain
from that hashtable conversion is eaten up to some degree by this.

The next test just forks off a bunch of children that each create their
own file and then lock and unlock it 20k times. Obviously, the locks in
this case are uncontended. Running that under "time" typically gives
these rough numbers.

Unpatched (3.10-rc3-ish):
real	0m8.836s
user	0m1.018s
sys	0m34.094s

Patched (same base kernel):
real	0m4.965s
user	0m1.043s
sys	0m18.651s

In this test, we see the real benefit of moving to the i_lock for most
of this code. The run time is almost cut in half in this test. With
these changes locking different inodes needs very little serialization.

If people know of other file locking performance tests, then I'd be
happy to try them out too. It's possible that this might make some
workloads slower, and it would be helpful to know what they are (and
address them) if so.

This is not the first attempt at doing this. The conversion to the
i_lock was originally attempted by Bruce Fields a few years ago. His
approach was NAK'ed since it involved ripping out the deadlock
detection. People also really seem to like /proc/locks for debugging, so
keeping that in is probably worthwhile.

There's more work to be done in this area and this patchset is just a
start. There's a horrible thundering herd problem when a blocking lock
is released, for instance. There was also interest in solving the goofy
"unlock on any close" POSIX lock semantics at this year's LSF. I think
this patchset will help lay the groundwork for those changes as well.

Comments and suggestions welcome.

Jeff Layton (11):
  cifs: use posix_unblock_lock instead of locks_delete_block
  locks: make generic_add_lease and generic_delete_lease static
  locks: comment cleanups and clarifications
  locks: make "added" in __posix_lock_file a bool
  locks: encapsulate the fl_link list handling
  locks: convert to i_lock to protect i_flock list
  locks: only pull entries off of blocked_list when they are really
    unblocked
  locks: convert fl_link to a hlist_node
  locks: turn the blocked_list into a hashtable
  locks: add a new "lm_owner_key" lock operation
  locks: give the blocked_hash its own spinlock

 Documentation/filesystems/Locking |   27 +++-
 fs/afs/flock.c                    |    5 +-
 fs/ceph/locks.c                   |    2 +-
 fs/ceph/mds_client.c              |    8 +-
 fs/cifs/cifsfs.c                  |    2 +-
 fs/cifs/file.c                    |   15 +-
 fs/gfs2/file.c                    |    2 +-
 fs/lockd/svclock.c                |   12 ++
 fs/lockd/svcsubs.c                |   12 +-
 fs/locks.c                        |  254 +++++++++++++++++++++++++------------
 fs/nfs/delegation.c               |   11 +-
 fs/nfs/nfs4state.c                |    8 +-
 fs/nfsd/nfs4state.c               |    8 +-
 include/linux/fs.h                |   25 +---
 14 files changed, 249 insertions(+), 142 deletions(-)

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/