2013-06-01 03:10:34

by Jeff Layton

[permalink] [raw]
Subject: [PATCH v1 00/11] locks: scalability improvements for file locking

Executive summary (tl;dr version): This patchset represents an overhaul
of the file locking code with an aim toward improving its scalability
and making the code a bit easier to understand.

Longer version:

When the BKL was finally ripped out of the kernel in 2010, the strategy
taken for the file locking code was to simply turn it into a new
file_lock_locks spinlock. It was an expedient way to deal with the file
locking code at the time, but having a giant spinlock around all of this
code is clearly not great for scalability. Red Hat has bug reports that
go back into the 2.6.18 era that point to BKL scalability problems in
the file locking code and the file_lock_lock suffers from the same
issues.

This patchset is my first attempt to make this code less dependent on
global locking. The main change is to switch most of the file locking
code to be protected by the inode->i_lock instead of the file_lock_lock.

While that works for most things, there are a couple of global data
structures (lists in the current code) that need a global lock to
protect them. So we still need a global lock in order to deal with
those. The remaining patches are intended to make that global locking
less painful. The big gain is made by turning the blocked_list into a
hashtable, which greatly speeds up the deadlock detection code.

I rolled a couple of small programs in order to test this code. The
first one just forks off 128 children and has them lock and unlock the
same file 10k times. Running this under "time" against a file on tmpfs
gives typical values like this:

Unpatched (3.10-rc3-ish):
real 0m5.283s
user 0m0.380s
sys 0m20.469s

Patched (same base kernel):
real 0m5.099s
user 0m0.478s
sys 0m19.662s

...so there seems to be some modest performance gain in this test. I
think that's almost entirely due to the change to a hashtable and to
optimize removing and readding blocked locks to the global lists. Note
that with this code we have to take two spinlocks instead of just one,
and that has some performance impact too. So the real peformance gain
from that hashtable conversion is eaten up to some degree by this.

The next test just forks off a bunch of children that each create their
own file and then lock and unlock it 20k times. Obviously, the locks in
this case are uncontended. Running that under "time" typically gives
these rough numbers.

Unpatched (3.10-rc3-ish):
real 0m8.836s
user 0m1.018s
sys 0m34.094s

Patched (same base kernel):
real 0m4.965s
user 0m1.043s
sys 0m18.651s

In this test, we see the real benefit of moving to the i_lock for most
of this code. The run time is almost cut in half in this test. With
these changes locking different inodes needs very little serialization.

If people know of other file locking performance tests, then I'd be
happy to try them out too. It's possible that this might make some
workloads slower, and it would be helpful to know what they are (and
address them) if so.

This is not the first attempt at doing this. The conversion to the
i_lock was originally attempted by Bruce Fields a few years ago. His
approach was NAK'ed since it involved ripping out the deadlock
detection. People also really seem to like /proc/locks for debugging, so
keeping that in is probably worthwhile.

There's more work to be done in this area and this patchset is just a
start. There's a horrible thundering herd problem when a blocking lock
is released, for instance. There was also interest in solving the goofy
"unlock on any close" POSIX lock semantics at this year's LSF. I think
this patchset will help lay the groundwork for those changes as well.

Comments and suggestions welcome.

Jeff Layton (11):
cifs: use posix_unblock_lock instead of locks_delete_block
locks: make generic_add_lease and generic_delete_lease static
locks: comment cleanups and clarifications
locks: make "added" in __posix_lock_file a bool
locks: encapsulate the fl_link list handling
locks: convert to i_lock to protect i_flock list
locks: only pull entries off of blocked_list when they are really
unblocked
locks: convert fl_link to a hlist_node
locks: turn the blocked_list into a hashtable
locks: add a new "lm_owner_key" lock operation
locks: give the blocked_hash its own spinlock

Documentation/filesystems/Locking | 27 +++-
fs/afs/flock.c | 5 +-
fs/ceph/locks.c | 2 +-
fs/ceph/mds_client.c | 8 +-
fs/cifs/cifsfs.c | 2 +-
fs/cifs/file.c | 15 +-
fs/gfs2/file.c | 2 +-
fs/lockd/svclock.c | 12 ++
fs/lockd/svcsubs.c | 12 +-
fs/locks.c | 254 +++++++++++++++++++++++++------------
fs/nfs/delegation.c | 11 +-
fs/nfs/nfs4state.c | 8 +-
fs/nfsd/nfs4state.c | 8 +-
include/linux/fs.h | 25 +---
14 files changed, 249 insertions(+), 142 deletions(-)


2013-06-01 03:08:21

by Jeff Layton

[permalink] [raw]
Subject: [PATCH v1 03/11] locks: comment cleanups and clarifications

Signed-off-by: Jeff Layton <[email protected]>
---
fs/locks.c | 24 +++++++++++++++++++-----
include/linux/fs.h | 6 ++++++
2 files changed, 25 insertions(+), 5 deletions(-)

diff --git a/fs/locks.c b/fs/locks.c
index e3140b8..a7d2253 100644
--- a/fs/locks.c
+++ b/fs/locks.c
@@ -150,6 +150,16 @@ static int target_leasetype(struct file_lock *fl)
int leases_enable = 1;
int lease_break_time = 45;

+/*
+ * The i_flock list is ordered by:
+ *
+ * 1) lock type -- FL_LEASEs first, then FL_FLOCK, and finally FL_POSIX
+ * 2) lock owner
+ * 3) lock range start
+ * 4) lock range end
+ *
+ * Obviously, the last two criteria only matter for POSIX locks.
+ */
#define for_each_lock(inode, lockp) \
for (lockp = &inode->i_flock; *lockp != NULL; lockp = &(*lockp)->fl_next)

@@ -806,6 +816,11 @@ static int __posix_lock_file(struct inode *inode, struct file_lock *request, str
}

lock_flocks();
+ /*
+ * New lock request. Walk all POSIX locks and look for conflicts. If
+ * there are any, either return -EAGAIN or put the request on the
+ * blocker's list of waiters.
+ */
if (request->fl_type != F_UNLCK) {
for_each_lock(inode, before) {
fl = *before;
@@ -844,7 +859,7 @@ static int __posix_lock_file(struct inode *inode, struct file_lock *request, str
before = &fl->fl_next;
}

- /* Process locks with this owner. */
+ /* Process locks with this owner. */
while ((fl = *before) && posix_same_owner(request, fl)) {
/* Detect adjacent or overlapping regions (if same lock type)
*/
@@ -930,10 +945,9 @@ static int __posix_lock_file(struct inode *inode, struct file_lock *request, str
}

/*
- * The above code only modifies existing locks in case of
- * merging or replacing. If new lock(s) need to be inserted
- * all modifications are done bellow this, so it's safe yet to
- * bail out.
+ * The above code only modifies existing locks in case of merging or
+ * replacing. If new lock(s) need to be inserted all modifications are
+ * done below this, so it's safe yet to bail out.
*/
error = -ENOLCK; /* "no luck" */
if (right && left == right && !new_fl2)
diff --git a/include/linux/fs.h b/include/linux/fs.h
index b9d7816..ae377e9 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -926,6 +926,12 @@ int locks_in_grace(struct net *);
/* that will die - we need it for nfs_lock_info */
#include <linux/nfs_fs_i.h>

+/*
+ * struct file_lock represents a generic "file lock". It's used to represent
+ * POSIX byte range locks, BSD (flock) locks, and leases. It's important to
+ * note that the same struct is used to represent both a request for a lock and
+ * the lock itself, but the same object is never used for both.
+ */
struct file_lock {
struct file_lock *fl_next; /* singly linked list for this inode */
struct list_head fl_link; /* doubly linked list of all locks */
--
1.7.1

2013-06-01 03:09:03

by Jeff Layton

[permalink] [raw]
Subject: [PATCH v1 01/11] cifs: use posix_unblock_lock instead of locks_delete_block

commit 66189be74 (CIFS: Fix VFS lock usage for oplocked files) exported
the locks_delete_block symbol. There's already an exported helper
function that provides this capability however, so make cifs use that
instead and turn locks_delete_block back into a static function.

Note that if fl->fl_next == NULL then this lock has already been through
locks_delete_block(), so we should be OK to ignore an ENOENT error here
and simply not retry the lock.

Cc: Pavel Shilovsky <[email protected]>
Signed-off-by: Jeff Layton <[email protected]>
---
fs/cifs/file.c | 2 +-
fs/locks.c | 3 +--
include/linux/fs.h | 5 -----
3 files changed, 2 insertions(+), 8 deletions(-)

diff --git a/fs/cifs/file.c b/fs/cifs/file.c
index 48b29d2..44a4f18 100644
--- a/fs/cifs/file.c
+++ b/fs/cifs/file.c
@@ -999,7 +999,7 @@ try_again:
rc = wait_event_interruptible(flock->fl_wait, !flock->fl_next);
if (!rc)
goto try_again;
- locks_delete_block(flock);
+ posix_unblock_lock(file, flock);
}
return rc;
}
diff --git a/fs/locks.c b/fs/locks.c
index cb424a4..7a02064 100644
--- a/fs/locks.c
+++ b/fs/locks.c
@@ -496,13 +496,12 @@ static void __locks_delete_block(struct file_lock *waiter)

/*
*/
-void locks_delete_block(struct file_lock *waiter)
+static void locks_delete_block(struct file_lock *waiter)
{
lock_flocks();
__locks_delete_block(waiter);
unlock_flocks();
}
-EXPORT_SYMBOL(locks_delete_block);

/* Insert waiter into blocker's block list.
* We use a circular list so that processes can be easily woken up in
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 43db02e..b9d7816 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1006,7 +1006,6 @@ extern int vfs_setlease(struct file *, long, struct file_lock **);
extern int lease_modify(struct file_lock **, int);
extern int lock_may_read(struct inode *, loff_t start, unsigned long count);
extern int lock_may_write(struct inode *, loff_t start, unsigned long count);
-extern void locks_delete_block(struct file_lock *waiter);
extern void lock_flocks(void);
extern void unlock_flocks(void);
#else /* !CONFIG_FILE_LOCKING */
@@ -1151,10 +1150,6 @@ static inline int lock_may_write(struct inode *inode, loff_t start,
return 1;
}

-static inline void locks_delete_block(struct file_lock *waiter)
-{
-}
-
static inline void lock_flocks(void)
{
}
--
1.7.1

2013-06-01 03:09:33

by Jeff Layton

[permalink] [raw]
Subject: [PATCH v1 04/11] locks: make "added" in __posix_lock_file a bool

...save 3 bytes of stack space.

Signed-off-by: Jeff Layton <[email protected]>
---
fs/locks.c | 9 +++++----
1 files changed, 5 insertions(+), 4 deletions(-)

diff --git a/fs/locks.c b/fs/locks.c
index a7d2253..cef0e04 100644
--- a/fs/locks.c
+++ b/fs/locks.c
@@ -800,7 +800,8 @@ static int __posix_lock_file(struct inode *inode, struct file_lock *request, str
struct file_lock *left = NULL;
struct file_lock *right = NULL;
struct file_lock **before;
- int error, added = 0;
+ int error;
+ bool added = false;

/*
* We may need two file_lock structures for this operation,
@@ -894,7 +895,7 @@ static int __posix_lock_file(struct inode *inode, struct file_lock *request, str
continue;
}
request = fl;
- added = 1;
+ added = true;
}
else {
/* Processing for different lock types is a bit
@@ -905,7 +906,7 @@ static int __posix_lock_file(struct inode *inode, struct file_lock *request, str
if (fl->fl_start > request->fl_end)
break;
if (request->fl_type == F_UNLCK)
- added = 1;
+ added = true;
if (fl->fl_start < request->fl_start)
left = fl;
/* If the next lock in the list has a higher end
@@ -935,7 +936,7 @@ static int __posix_lock_file(struct inode *inode, struct file_lock *request, str
locks_release_private(fl);
locks_copy_private(fl, request);
request = fl;
- added = 1;
+ added = true;
}
}
/* Go on to next lock.
--
1.7.1

2013-06-01 03:09:21

by Jeff Layton

[permalink] [raw]
Subject: [PATCH v1 05/11] locks: encapsulate the fl_link list handling

Move the fl_link list handling routines into a separate set of helpers.
Also move the global list handling out of locks_insert_block, and into
the caller that ends up triggering it as that allows us to eliminate the
IS_POSIX check there.

Signed-off-by: Jeff Layton <[email protected]>
---
fs/locks.c | 34 +++++++++++++++++++++++++++++-----
1 files changed, 29 insertions(+), 5 deletions(-)

diff --git a/fs/locks.c b/fs/locks.c
index cef0e04..caca466 100644
--- a/fs/locks.c
+++ b/fs/locks.c
@@ -494,13 +494,38 @@ static int posix_same_owner(struct file_lock *fl1, struct file_lock *fl2)
return fl1->fl_owner == fl2->fl_owner;
}

+/* Remove a blocker or lock from one of the global lists */
+static inline void
+locks_insert_global_blocked(struct file_lock *waiter)
+{
+ list_add(&waiter->fl_link, &blocked_list);
+}
+
+static inline void
+locks_delete_global_blocked(struct file_lock *waiter)
+{
+ list_del_init(&waiter->fl_link);
+}
+
+static inline void
+locks_insert_global_locks(struct file_lock *waiter)
+{
+ list_add_tail(&waiter->fl_link, &file_lock_list);
+}
+
+static inline void
+locks_delete_global_locks(struct file_lock *waiter)
+{
+ list_del_init(&waiter->fl_link);
+}
+
/* Remove waiter from blocker's block list.
* When blocker ends up pointing to itself then the list is empty.
*/
static void __locks_delete_block(struct file_lock *waiter)
{
list_del_init(&waiter->fl_block);
- list_del_init(&waiter->fl_link);
+ locks_delete_global_blocked(waiter);
waiter->fl_next = NULL;
}

@@ -524,8 +549,6 @@ static void locks_insert_block(struct file_lock *blocker,
BUG_ON(!list_empty(&waiter->fl_block));
list_add_tail(&waiter->fl_block, &blocker->fl_block);
waiter->fl_next = blocker;
- if (IS_POSIX(blocker))
- list_add(&waiter->fl_link, &blocked_list);
}

/* Wake up processes blocked waiting for blocker.
@@ -552,7 +575,7 @@ static void locks_wake_up_blocks(struct file_lock *blocker)
*/
static void locks_insert_lock(struct file_lock **pos, struct file_lock *fl)
{
- list_add(&fl->fl_link, &file_lock_list);
+ locks_insert_global_locks(fl);

fl->fl_nspid = get_pid(task_tgid(current));

@@ -573,7 +596,7 @@ static void locks_delete_lock(struct file_lock **thisfl_p)

*thisfl_p = fl->fl_next;
fl->fl_next = NULL;
- list_del_init(&fl->fl_link);
+ locks_delete_global_locks(fl);

if (fl->fl_nspid) {
put_pid(fl->fl_nspid);
@@ -839,6 +862,7 @@ static int __posix_lock_file(struct inode *inode, struct file_lock *request, str
goto out;
error = FILE_LOCK_DEFERRED;
locks_insert_block(fl, request);
+ locks_insert_global_blocked(request);
goto out;
}
}
--
1.7.1

2013-06-01 03:09:45

by Jeff Layton

[permalink] [raw]
Subject: [PATCH v1 11/11] locks: give the blocked_hash its own spinlock

There's no reason we have to protect the blocked_hash and file_lock_list
with the same spinlock. With the tests I have, breaking it in two gives
a barely measurable performance benefit, but it seems reasonable to make
this locking as granular as possible.

Signed-off-by: Jeff Layton <[email protected]>
---
Documentation/filesystems/Locking | 16 ++++++++--------
fs/locks.c | 17 ++++++++++-------
2 files changed, 18 insertions(+), 15 deletions(-)

diff --git a/Documentation/filesystems/Locking b/Documentation/filesystems/Locking
index ee351ac..8d8d040 100644
--- a/Documentation/filesystems/Locking
+++ b/Documentation/filesystems/Locking
@@ -359,20 +359,20 @@ prototypes:

locking rules:

- inode->i_lock file_lock_lock may block
-lm_compare_owner: yes maybe no
-lm_owner_key yes yes no
-lm_notify: yes no no
-lm_grant: no no no
-lm_break: yes no no
-lm_change yes no no
+ inode->i_lock blocked_hash_lock may block
+lm_compare_owner: yes maybe no
+lm_owner_key yes yes no
+lm_notify: yes no no
+lm_grant: no no no
+lm_break: yes no no
+lm_change yes no no

->lm_compare_owner and ->lm_owner_key are generally called with
*an* inode->i_lock held. It may not be the i_lock of the inode
associated with either file_lock argument! This is the case with deadlock
detection, since the code has to chase down the owners of locks that may
be entirely unrelated to the one on which the lock is being acquired.
-For deadlock detection however, the file_lock_lock is also held. The
+For deadlock detection however, the blocked_hash_lock is also held. The
fact that these locks are held ensures that the file_locks do not
disappear out from under you while doing the comparison or generating an
owner key.
diff --git a/fs/locks.c b/fs/locks.c
index 8219187..520f32b 100644
--- a/fs/locks.c
+++ b/fs/locks.c
@@ -172,12 +172,13 @@ int lease_break_time = 45;
*/
#define BLOCKED_HASH_BITS 7

+static DEFINE_SPINLOCK(blocked_hash_lock);
static DEFINE_HASHTABLE(blocked_hash, BLOCKED_HASH_BITS);

+static DEFINE_SPINLOCK(file_lock_lock);
static HLIST_HEAD(file_lock_list);

/* Protects the file_lock_list and the blocked_hash */
-static DEFINE_SPINLOCK(file_lock_lock);

static struct kmem_cache *filelock_cache __read_mostly;

@@ -503,17 +504,17 @@ posix_owner_key(struct file_lock *fl)
static inline void
locks_insert_global_blocked(struct file_lock *waiter)
{
- spin_lock(&file_lock_lock);
+ spin_lock(&blocked_hash_lock);
hash_add(blocked_hash, &waiter->fl_link, posix_owner_key(waiter));
- spin_unlock(&file_lock_lock);
+ spin_unlock(&blocked_hash_lock);
}

static inline void
locks_delete_global_blocked(struct file_lock *waiter)
{
- spin_lock(&file_lock_lock);
+ spin_lock(&blocked_hash_lock);
hash_del(&waiter->fl_link);
- spin_unlock(&file_lock_lock);
+ spin_unlock(&blocked_hash_lock);
}

static inline void
@@ -739,7 +740,7 @@ static int posix_locks_deadlock(struct file_lock *caller_fl,
int i = 0;
int ret = 0;

- spin_lock(&file_lock_lock);
+ spin_lock(&blocked_hash_lock);
while ((block_fl = what_owner_is_waiting_for(block_fl))) {
if (i++ > MAX_DEADLK_ITERATIONS)
break;
@@ -748,7 +749,7 @@ static int posix_locks_deadlock(struct file_lock *caller_fl,
break;
}
}
- spin_unlock(&file_lock_lock);
+ spin_unlock(&blocked_hash_lock);
return ret;
}

@@ -2300,10 +2301,12 @@ static int locks_show(struct seq_file *f, void *v)

lock_get_status(f, fl, *((loff_t *)f->private), "");

+ spin_lock(&blocked_hash_lock);
hash_for_each(blocked_hash, bkt, bfl, fl_link) {
if (bfl->fl_next == fl)
lock_get_status(f, bfl, *((loff_t *)f->private), " ->");
}
+ spin_unlock(&blocked_hash_lock);

return 0;
}
--
1.7.1

2013-06-01 03:09:56

by Jeff Layton

[permalink] [raw]
Subject: [PATCH v1 06/11] locks: convert to i_lock to protect i_flock list

Having a global lock that protects all of this code is a clear
scalability problem. Instead of doing that, move most of the code to be
protected by the i_lock instead.

The exceptions are the global lists that file_lock->fl_link sits on.
Those still need a global lock of some sort, so wrap just those accesses
in the file_lock_lock().

Note that this requires a small change to the /proc/locks code. Instead
of walking the fl_block list to look for locks blocked on the current
lock, we must instead walk the global blocker list and skip any that
aren't blocked on the current lock. Otherwise, we'd need to take the
i_lock on each inode as we go and that would create a lock inversion
problem.

Signed-off-by: Jeff Layton <[email protected]>
Signed-off-by: J. Bruce Fields <[email protected]>
---
Documentation/filesystems/Locking | 23 +++++--
fs/afs/flock.c | 5 +-
fs/ceph/locks.c | 2 +-
fs/ceph/mds_client.c | 8 +-
fs/cifs/cifsfs.c | 2 +-
fs/cifs/file.c | 13 ++--
fs/gfs2/file.c | 2 +-
fs/lockd/svcsubs.c | 12 ++--
fs/locks.c | 121 ++++++++++++++++++++-----------------
fs/nfs/delegation.c | 11 ++--
fs/nfs/nfs4state.c | 8 +-
fs/nfsd/nfs4state.c | 8 +-
include/linux/fs.h | 11 ----
13 files changed, 119 insertions(+), 107 deletions(-)

diff --git a/Documentation/filesystems/Locking b/Documentation/filesystems/Locking
index 0706d32..13f91ab 100644
--- a/Documentation/filesystems/Locking
+++ b/Documentation/filesystems/Locking
@@ -344,7 +344,7 @@ prototypes:


locking rules:
- file_lock_lock may block
+ inode->i_lock may block
fl_copy_lock: yes no
fl_release_private: maybe no

@@ -357,12 +357,21 @@ prototypes:
int (*lm_change)(struct file_lock **, int);

locking rules:
- file_lock_lock may block
-lm_compare_owner: yes no
-lm_notify: yes no
-lm_grant: no no
-lm_break: yes no
-lm_change yes no
+
+ inode->i_lock file_lock_lock may block
+lm_compare_owner: yes maybe no
+lm_notify: yes no no
+lm_grant: no no no
+lm_break: yes no no
+lm_change yes no no
+
+ ->lm_compare_owner is generally called with *an* inode->i_lock
+held. It may not be the i_lock of the inode for either file_lock being
+compared! This is the case with deadlock detection, since the code has
+to chase down the owners of locks that may be entirely unrelated to the
+one on which the lock is being acquired. For deadlock detection however,
+the file_lock_lock is also held. The locks primarily ensure that neither
+file_lock disappear out from under you while doing the comparison.

--------------------------- buffer_head -----------------------------------
prototypes:
diff --git a/fs/afs/flock.c b/fs/afs/flock.c
index 2497bf3..03fc0d1 100644
--- a/fs/afs/flock.c
+++ b/fs/afs/flock.c
@@ -252,6 +252,7 @@ static void afs_defer_unlock(struct afs_vnode *vnode, struct key *key)
*/
static int afs_do_setlk(struct file *file, struct file_lock *fl)
{
+ struct inode = file_inode(file);
struct afs_vnode *vnode = AFS_FS_I(file->f_mapping->host);
afs_lock_type_t type;
struct key *key = file->private_data;
@@ -273,7 +274,7 @@ static int afs_do_setlk(struct file *file, struct file_lock *fl)

type = (fl->fl_type == F_RDLCK) ? AFS_LOCK_READ : AFS_LOCK_WRITE;

- lock_flocks();
+ spin_lock(&inode->i_lock);

/* make sure we've got a callback on this file and that our view of the
* data version is up to date */
@@ -420,7 +421,7 @@ given_lock:
afs_vnode_fetch_status(vnode, NULL, key);

error:
- unlock_flocks();
+ spin_unlock(&inode->i_lock);
_leave(" = %d", ret);
return ret;

diff --git a/fs/ceph/locks.c b/fs/ceph/locks.c
index 202dd3d..cd0a664 100644
--- a/fs/ceph/locks.c
+++ b/fs/ceph/locks.c
@@ -194,7 +194,7 @@ void ceph_count_locks(struct inode *inode, int *fcntl_count, int *flock_count)
* Encode the flock and fcntl locks for the given inode into the pagelist.
* Format is: #fcntl locks, sequential fcntl locks, #flock locks,
* sequential flock locks.
- * Must be called with lock_flocks() already held.
+ * Must be called with inode->i_lock already held.
* If we encounter more of a specific lock type than expected,
* we return the value 1.
*/
diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c
index 4f22671..ae621b5 100644
--- a/fs/ceph/mds_client.c
+++ b/fs/ceph/mds_client.c
@@ -2482,13 +2482,13 @@ static int encode_caps_cb(struct inode *inode, struct ceph_cap *cap,

ceph_pagelist_set_cursor(pagelist, &trunc_point);
do {
- lock_flocks();
+ spin_lock(&inode->i_lock);
ceph_count_locks(inode, &num_fcntl_locks,
&num_flock_locks);
rec.v2.flock_len = (2*sizeof(u32) +
(num_fcntl_locks+num_flock_locks) *
sizeof(struct ceph_filelock));
- unlock_flocks();
+ spin_unlock(&inode->i_lock);

/* pre-alloc pagelist */
ceph_pagelist_truncate(pagelist, &trunc_point);
@@ -2499,12 +2499,12 @@ static int encode_caps_cb(struct inode *inode, struct ceph_cap *cap,

/* encode locks */
if (!err) {
- lock_flocks();
+ spin_lock(&inode->i_lock);
err = ceph_encode_locks(inode,
pagelist,
num_fcntl_locks,
num_flock_locks);
- unlock_flocks();
+ spin_unlock(&inode->i_lock);
}
} while (err == -ENOSPC);
} else {
diff --git a/fs/cifs/cifsfs.c b/fs/cifs/cifsfs.c
index 72e4efe..29952d5 100644
--- a/fs/cifs/cifsfs.c
+++ b/fs/cifs/cifsfs.c
@@ -768,7 +768,7 @@ static loff_t cifs_llseek(struct file *file, loff_t offset, int whence)

static int cifs_setlease(struct file *file, long arg, struct file_lock **lease)
{
- /* note that this is called by vfs setlease with lock_flocks held
+ /* note that this is called by vfs setlease with i_lock held
to protect *lease from going away */
struct inode *inode = file_inode(file);
struct cifsFileInfo *cfile = file->private_data;
diff --git a/fs/cifs/file.c b/fs/cifs/file.c
index 44a4f18..0dd10cd 100644
--- a/fs/cifs/file.c
+++ b/fs/cifs/file.c
@@ -1092,6 +1092,7 @@ struct lock_to_push {
static int
cifs_push_posix_locks(struct cifsFileInfo *cfile)
{
+ struct inode *inode = cfile->dentry->d_inode;
struct cifs_tcon *tcon = tlink_tcon(cfile->tlink);
struct file_lock *flock, **before;
unsigned int count = 0, i = 0;
@@ -1102,12 +1103,12 @@ cifs_push_posix_locks(struct cifsFileInfo *cfile)

xid = get_xid();

- lock_flocks();
- cifs_for_each_lock(cfile->dentry->d_inode, before) {
+ spin_lock(&inode->i_lock);
+ cifs_for_each_lock(inode, before) {
if ((*before)->fl_flags & FL_POSIX)
count++;
}
- unlock_flocks();
+ spin_unlock(&inode->i_lock);

INIT_LIST_HEAD(&locks_to_send);

@@ -1126,8 +1127,8 @@ cifs_push_posix_locks(struct cifsFileInfo *cfile)
}

el = locks_to_send.next;
- lock_flocks();
- cifs_for_each_lock(cfile->dentry->d_inode, before) {
+ spin_lock(&inode->i_lock);
+ cifs_for_each_lock(inode, before) {
flock = *before;
if ((flock->fl_flags & FL_POSIX) == 0)
continue;
@@ -1152,7 +1153,7 @@ cifs_push_posix_locks(struct cifsFileInfo *cfile)
lck->offset = flock->fl_start;
el = el->next;
}
- unlock_flocks();
+ spin_unlock(&inode->i_lock);

list_for_each_entry_safe(lck, tmp, &locks_to_send, llist) {
int stored_rc;
diff --git a/fs/gfs2/file.c b/fs/gfs2/file.c
index acd1676..9e634e0 100644
--- a/fs/gfs2/file.c
+++ b/fs/gfs2/file.c
@@ -889,7 +889,7 @@ out_uninit:
* cluster; until we do, disable leases (by just returning -EINVAL),
* unless the administrator has requested purely local locking.
*
- * Locking: called under lock_flocks
+ * Locking: called under i_lock
*
* Returns: errno
*/
diff --git a/fs/lockd/svcsubs.c b/fs/lockd/svcsubs.c
index 97e8741..dc5c759 100644
--- a/fs/lockd/svcsubs.c
+++ b/fs/lockd/svcsubs.c
@@ -169,7 +169,7 @@ nlm_traverse_locks(struct nlm_host *host, struct nlm_file *file,

again:
file->f_locks = 0;
- lock_flocks(); /* protects i_flock list */
+ spin_lock(&inode->i_lock);
for (fl = inode->i_flock; fl; fl = fl->fl_next) {
if (fl->fl_lmops != &nlmsvc_lock_operations)
continue;
@@ -181,7 +181,7 @@ again:
if (match(lockhost, host)) {
struct file_lock lock = *fl;

- unlock_flocks();
+ spin_unlock(&inode->i_lock);
lock.fl_type = F_UNLCK;
lock.fl_start = 0;
lock.fl_end = OFFSET_MAX;
@@ -193,7 +193,7 @@ again:
goto again;
}
}
- unlock_flocks();
+ spin_unlock(&inode->i_lock);

return 0;
}
@@ -228,14 +228,14 @@ nlm_file_inuse(struct nlm_file *file)
if (file->f_count || !list_empty(&file->f_blocks) || file->f_shares)
return 1;

- lock_flocks();
+ spin_lock(&inode->i_lock);
for (fl = inode->i_flock; fl; fl = fl->fl_next) {
if (fl->fl_lmops == &nlmsvc_lock_operations) {
- unlock_flocks();
+ spin_unlock(&inode->i_lock);
return 1;
}
}
- unlock_flocks();
+ spin_unlock(&inode->i_lock);
file->f_locks = 0;
return 0;
}
diff --git a/fs/locks.c b/fs/locks.c
index caca466..055c06c 100644
--- a/fs/locks.c
+++ b/fs/locks.c
@@ -165,22 +165,9 @@ int lease_break_time = 45;

static LIST_HEAD(file_lock_list);
static LIST_HEAD(blocked_list);
-static DEFINE_SPINLOCK(file_lock_lock);
-
-/*
- * Protects the two list heads above, plus the inode->i_flock list
- */
-void lock_flocks(void)
-{
- spin_lock(&file_lock_lock);
-}
-EXPORT_SYMBOL_GPL(lock_flocks);

-void unlock_flocks(void)
-{
- spin_unlock(&file_lock_lock);
-}
-EXPORT_SYMBOL_GPL(unlock_flocks);
+/* Protects the two list heads above */
+static DEFINE_SPINLOCK(file_lock_lock);

static struct kmem_cache *filelock_cache __read_mostly;

@@ -498,25 +485,33 @@ static int posix_same_owner(struct file_lock *fl1, struct file_lock *fl2)
static inline void
locks_insert_global_blocked(struct file_lock *waiter)
{
+ spin_lock(&file_lock_lock);
list_add(&waiter->fl_link, &blocked_list);
+ spin_unlock(&file_lock_lock);
}

static inline void
locks_delete_global_blocked(struct file_lock *waiter)
{
+ spin_lock(&file_lock_lock);
list_del_init(&waiter->fl_link);
+ spin_unlock(&file_lock_lock);
}

static inline void
locks_insert_global_locks(struct file_lock *waiter)
{
+ spin_lock(&file_lock_lock);
list_add_tail(&waiter->fl_link, &file_lock_list);
+ spin_unlock(&file_lock_lock);
}

static inline void
locks_delete_global_locks(struct file_lock *waiter)
{
+ spin_lock(&file_lock_lock);
list_del_init(&waiter->fl_link);
+ spin_unlock(&file_lock_lock);
}

/* Remove waiter from blocker's block list.
@@ -533,9 +528,11 @@ static void __locks_delete_block(struct file_lock *waiter)
*/
static void locks_delete_block(struct file_lock *waiter)
{
- lock_flocks();
+ struct inode *inode = file_inode(waiter->fl_file);
+
+ spin_lock(&inode->i_lock);
__locks_delete_block(waiter);
- unlock_flocks();
+ spin_unlock(&inode->i_lock);
}

/* Insert waiter into blocker's block list.
@@ -657,8 +654,9 @@ void
posix_test_lock(struct file *filp, struct file_lock *fl)
{
struct file_lock *cfl;
+ struct inode *inode = file_inode(filp);

- lock_flocks();
+ spin_lock(&inode->i_lock);
for (cfl = file_inode(filp)->i_flock; cfl; cfl = cfl->fl_next) {
if (!IS_POSIX(cfl))
continue;
@@ -671,7 +669,7 @@ posix_test_lock(struct file *filp, struct file_lock *fl)
fl->fl_pid = pid_vnr(cfl->fl_nspid);
} else
fl->fl_type = F_UNLCK;
- unlock_flocks();
+ spin_unlock(&inode->i_lock);
return;
}
EXPORT_SYMBOL(posix_test_lock);
@@ -719,14 +717,19 @@ static int posix_locks_deadlock(struct file_lock *caller_fl,
struct file_lock *block_fl)
{
int i = 0;
+ int ret = 0;

+ spin_lock(&file_lock_lock);
while ((block_fl = what_owner_is_waiting_for(block_fl))) {
if (i++ > MAX_DEADLK_ITERATIONS)
- return 0;
- if (posix_same_owner(caller_fl, block_fl))
- return 1;
+ break;
+ if (posix_same_owner(caller_fl, block_fl)) {
+ ++ret;
+ break;
+ }
}
- return 0;
+ spin_unlock(&file_lock_lock);
+ return ret;
}

/* Try to create a FLOCK lock on filp. We always insert new FLOCK locks
@@ -750,7 +753,7 @@ static int flock_lock_file(struct file *filp, struct file_lock *request)
return -ENOMEM;
}

- lock_flocks();
+ spin_lock(&inode->i_lock);
if (request->fl_flags & FL_ACCESS)
goto find_conflict;

@@ -780,9 +783,9 @@ static int flock_lock_file(struct file *filp, struct file_lock *request)
* give it the opportunity to lock the file.
*/
if (found) {
- unlock_flocks();
+ spin_unlock(&inode->i_lock);
cond_resched();
- lock_flocks();
+ spin_lock(&inode->i_lock);
}

find_conflict:
@@ -809,7 +812,7 @@ find_conflict:
error = 0;

out:
- unlock_flocks();
+ spin_unlock(&inode->i_lock);
if (new_fl)
locks_free_lock(new_fl);
return error;
@@ -839,7 +842,7 @@ static int __posix_lock_file(struct inode *inode, struct file_lock *request, str
new_fl2 = locks_alloc_lock();
}

- lock_flocks();
+ spin_lock(&inode->i_lock);
/*
* New lock request. Walk all POSIX locks and look for conflicts. If
* there are any, either return -EAGAIN or put the request on the
@@ -1012,7 +1015,7 @@ static int __posix_lock_file(struct inode *inode, struct file_lock *request, str
locks_wake_up_blocks(left);
}
out:
- unlock_flocks();
+ spin_unlock(&inode->i_lock);
/*
* Free any unused locks.
*/
@@ -1087,14 +1090,14 @@ int locks_mandatory_locked(struct inode *inode)
/*
* Search the lock list for this inode for any POSIX locks.
*/
- lock_flocks();
+ spin_lock(&inode->i_lock);
for (fl = inode->i_flock; fl != NULL; fl = fl->fl_next) {
if (!IS_POSIX(fl))
continue;
if (fl->fl_owner != owner)
break;
}
- unlock_flocks();
+ spin_unlock(&inode->i_lock);
return fl ? -EAGAIN : 0;
}

@@ -1237,7 +1240,7 @@ int __break_lease(struct inode *inode, unsigned int mode)
if (IS_ERR(new_fl))
return PTR_ERR(new_fl);

- lock_flocks();
+ spin_lock(&inode->i_lock);

time_out_leases(inode);

@@ -1287,10 +1290,10 @@ restart:
break_time++;
}
locks_insert_block(flock, new_fl);
- unlock_flocks();
+ spin_unlock(&inode->i_lock);
error = wait_event_interruptible_timeout(new_fl->fl_wait,
!new_fl->fl_next, break_time);
- lock_flocks();
+ spin_lock(&inode->i_lock);
__locks_delete_block(new_fl);
if (error >= 0) {
if (error == 0)
@@ -1308,7 +1311,7 @@ restart:
}

out:
- unlock_flocks();
+ spin_unlock(&inode->i_lock);
locks_free_lock(new_fl);
return error;
}
@@ -1361,9 +1364,10 @@ EXPORT_SYMBOL(lease_get_mtime);
int fcntl_getlease(struct file *filp)
{
struct file_lock *fl;
+ struct inode *inode = file_inode(filp);
int type = F_UNLCK;

- lock_flocks();
+ spin_lock(&inode->i_lock);
time_out_leases(file_inode(filp));
for (fl = file_inode(filp)->i_flock; fl && IS_LEASE(fl);
fl = fl->fl_next) {
@@ -1372,7 +1376,7 @@ int fcntl_getlease(struct file *filp)
break;
}
}
- unlock_flocks();
+ spin_unlock(&inode->i_lock);
return type;
}

@@ -1466,7 +1470,7 @@ static int generic_delete_lease(struct file *filp, struct file_lock **flp)
* The (input) flp->fl_lmops->lm_break function is required
* by break_lease().
*
- * Called with file_lock_lock held.
+ * Called with inode->i_lock held.
*/
int generic_setlease(struct file *filp, long arg, struct file_lock **flp)
{
@@ -1535,11 +1539,12 @@ static int __vfs_setlease(struct file *filp, long arg, struct file_lock **lease)

int vfs_setlease(struct file *filp, long arg, struct file_lock **lease)
{
+ struct inode *inode = file_inode(filp);
int error;

- lock_flocks();
+ spin_lock(&inode->i_lock);
error = __vfs_setlease(filp, arg, lease);
- unlock_flocks();
+ spin_unlock(&inode->i_lock);

return error;
}
@@ -1557,6 +1562,7 @@ static int do_fcntl_delete_lease(struct file *filp)
static int do_fcntl_add_lease(unsigned int fd, struct file *filp, long arg)
{
struct file_lock *fl, *ret;
+ struct inode *inode = file_inode(filp);
struct fasync_struct *new;
int error;

@@ -1570,10 +1576,10 @@ static int do_fcntl_add_lease(unsigned int fd, struct file *filp, long arg)
return -ENOMEM;
}
ret = fl;
- lock_flocks();
+ spin_lock(&inode->i_lock);
error = __vfs_setlease(filp, arg, &ret);
if (error) {
- unlock_flocks();
+ spin_unlock(&inode->i_lock);
locks_free_lock(fl);
goto out_free_fasync;
}
@@ -1590,7 +1596,7 @@ static int do_fcntl_add_lease(unsigned int fd, struct file *filp, long arg)
new = NULL;

error = __f_setown(filp, task_pid(current), PIDTYPE_PID, 0);
- unlock_flocks();
+ spin_unlock(&inode->i_lock);

out_free_fasync:
if (new)
@@ -2114,7 +2120,7 @@ void locks_remove_flock(struct file *filp)
fl.fl_ops->fl_release_private(&fl);
}

- lock_flocks();
+ spin_lock(&inode->i_lock);
before = &inode->i_flock;

while ((fl = *before) != NULL) {
@@ -2132,7 +2138,7 @@ void locks_remove_flock(struct file *filp)
}
before = &fl->fl_next;
}
- unlock_flocks();
+ spin_unlock(&inode->i_lock);
}

/**
@@ -2145,14 +2151,15 @@ void locks_remove_flock(struct file *filp)
int
posix_unblock_lock(struct file *filp, struct file_lock *waiter)
{
+ struct inode *inode = file_inode(filp);
int status = 0;

- lock_flocks();
+ spin_lock(&inode->i_lock);
if (waiter->fl_next)
__locks_delete_block(waiter);
else
status = -ENOENT;
- unlock_flocks();
+ spin_unlock(&inode->i_lock);
return status;
}

@@ -2257,8 +2264,10 @@ static int locks_show(struct seq_file *f, void *v)

lock_get_status(f, fl, *((loff_t *)f->private), "");

- list_for_each_entry(bfl, &fl->fl_block, fl_block)
- lock_get_status(f, bfl, *((loff_t *)f->private), " ->");
+ list_for_each_entry(bfl, &blocked_list, fl_link) {
+ if (bfl->fl_next == fl)
+ lock_get_status(f, bfl, *((loff_t *)f->private), " ->");
+ }

return 0;
}
@@ -2267,7 +2276,7 @@ static void *locks_start(struct seq_file *f, loff_t *pos)
{
loff_t *p = f->private;

- lock_flocks();
+ spin_lock(&file_lock_lock);
*p = (*pos + 1);
return seq_list_start(&file_lock_list, *pos);
}
@@ -2281,7 +2290,7 @@ static void *locks_next(struct seq_file *f, void *v, loff_t *pos)

static void locks_stop(struct seq_file *f, void *v)
{
- unlock_flocks();
+ spin_unlock(&file_lock_lock);
}

static const struct seq_operations locks_seq_operations = {
@@ -2328,7 +2337,8 @@ int lock_may_read(struct inode *inode, loff_t start, unsigned long len)
{
struct file_lock *fl;
int result = 1;
- lock_flocks();
+
+ spin_lock(&inode->i_lock);
for (fl = inode->i_flock; fl != NULL; fl = fl->fl_next) {
if (IS_POSIX(fl)) {
if (fl->fl_type == F_RDLCK)
@@ -2345,7 +2355,7 @@ int lock_may_read(struct inode *inode, loff_t start, unsigned long len)
result = 0;
break;
}
- unlock_flocks();
+ spin_unlock(&inode->i_lock);
return result;
}

@@ -2368,7 +2378,8 @@ int lock_may_write(struct inode *inode, loff_t start, unsigned long len)
{
struct file_lock *fl;
int result = 1;
- lock_flocks();
+
+ spin_lock(&inode->i_lock);
for (fl = inode->i_flock; fl != NULL; fl = fl->fl_next) {
if (IS_POSIX(fl)) {
if ((fl->fl_end < start) || (fl->fl_start > (start + len)))
@@ -2383,7 +2394,7 @@ int lock_may_write(struct inode *inode, loff_t start, unsigned long len)
result = 0;
break;
}
- unlock_flocks();
+ spin_unlock(&inode->i_lock);
return result;
}

diff --git a/fs/nfs/delegation.c b/fs/nfs/delegation.c
index 57db324..43ee7f9 100644
--- a/fs/nfs/delegation.c
+++ b/fs/nfs/delegation.c
@@ -73,20 +73,21 @@ static int nfs_delegation_claim_locks(struct nfs_open_context *ctx, struct nfs4_
if (inode->i_flock == NULL)
goto out;

- /* Protect inode->i_flock using the file locks lock */
- lock_flocks();
+ /* Protect inode->i_flock using the i_lock */
+ spin_lock(&inode->i_lock);
for (fl = inode->i_flock; fl != NULL; fl = fl->fl_next) {
if (!(fl->fl_flags & (FL_POSIX|FL_FLOCK)))
continue;
if (nfs_file_open_context(fl->fl_file) != ctx)
continue;
- unlock_flocks();
+ /* FIXME: safe to drop lock here while walking list? */
+ spin_unlock(&inode->i_lock);
status = nfs4_lock_delegation_recall(fl, state, stateid);
if (status < 0)
goto out;
- lock_flocks();
+ spin_lock(&inode->i_lock);
}
- unlock_flocks();
+ spin_unlock(&inode->i_lock);
out:
return status;
}
diff --git a/fs/nfs/nfs4state.c b/fs/nfs/nfs4state.c
index 1fab140..ff10b4a 100644
--- a/fs/nfs/nfs4state.c
+++ b/fs/nfs/nfs4state.c
@@ -1373,13 +1373,13 @@ static int nfs4_reclaim_locks(struct nfs4_state *state, const struct nfs4_state_
/* Guard against delegation returns and new lock/unlock calls */
down_write(&nfsi->rwsem);
/* Protect inode->i_flock using the BKL */
- lock_flocks();
+ spin_lock(&inode->i_lock);
for (fl = inode->i_flock; fl != NULL; fl = fl->fl_next) {
if (!(fl->fl_flags & (FL_POSIX|FL_FLOCK)))
continue;
if (nfs_file_open_context(fl->fl_file)->state != state)
continue;
- unlock_flocks();
+ spin_unlock(&inode->i_lock);
status = ops->recover_lock(state, fl);
switch (status) {
case 0:
@@ -1406,9 +1406,9 @@ static int nfs4_reclaim_locks(struct nfs4_state *state, const struct nfs4_state_
/* kill_proc(fl->fl_pid, SIGLOST, 1); */
status = 0;
}
- lock_flocks();
+ spin_lock(&inode->i_lock);
}
- unlock_flocks();
+ spin_unlock(&inode->i_lock);
out:
up_write(&nfsi->rwsem);
return status;
diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c
index 316ec84..f170518 100644
--- a/fs/nfsd/nfs4state.c
+++ b/fs/nfsd/nfs4state.c
@@ -2645,13 +2645,13 @@ static void nfsd_break_one_deleg(struct nfs4_delegation *dp)

list_add_tail(&dp->dl_recall_lru, &nn->del_recall_lru);

- /* only place dl_time is set. protected by lock_flocks*/
+ /* Only place dl_time is set; protected by i_lock: */
dp->dl_time = get_seconds();

nfsd4_cb_recall(dp);
}

-/* Called from break_lease() with lock_flocks() held. */
+/* Called from break_lease() with i_lock held. */
static void nfsd_break_deleg_cb(struct file_lock *fl)
{
struct nfs4_file *fp = (struct nfs4_file *)fl->fl_owner;
@@ -4520,7 +4520,7 @@ check_for_locks(struct nfs4_file *filp, struct nfs4_lockowner *lowner)
struct inode *inode = filp->fi_inode;
int status = 0;

- lock_flocks();
+ spin_lock(&inode->i_lock);
for (flpp = &inode->i_flock; *flpp != NULL; flpp = &(*flpp)->fl_next) {
if ((*flpp)->fl_owner == (fl_owner_t)lowner) {
status = 1;
@@ -4528,7 +4528,7 @@ check_for_locks(struct nfs4_file *filp, struct nfs4_lockowner *lowner)
}
}
out:
- unlock_flocks();
+ spin_unlock(&inode->i_lock);
return status;
}

diff --git a/include/linux/fs.h b/include/linux/fs.h
index ae377e9..ccb44ea 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1012,8 +1012,6 @@ extern int vfs_setlease(struct file *, long, struct file_lock **);
extern int lease_modify(struct file_lock **, int);
extern int lock_may_read(struct inode *, loff_t start, unsigned long count);
extern int lock_may_write(struct inode *, loff_t start, unsigned long count);
-extern void lock_flocks(void);
-extern void unlock_flocks(void);
#else /* !CONFIG_FILE_LOCKING */
static inline int fcntl_getlk(struct file *file, struct flock __user *user)
{
@@ -1155,15 +1153,6 @@ static inline int lock_may_write(struct inode *inode, loff_t start,
{
return 1;
}
-
-static inline void lock_flocks(void)
-{
-}
-
-static inline void unlock_flocks(void)
-{
-}
-
#endif /* !CONFIG_FILE_LOCKING */


--
1.7.1

2013-06-01 03:10:25

by Jeff Layton

[permalink] [raw]
Subject: [PATCH v1 09/11] locks: turn the blocked_list into a hashtable

Break up the blocked_list into a hashtable, using the fl_owner as a key.
This speeds up searching the hash chains, which is especially significant
for deadlock detection.

Note that the initial implementation assumes that hashing on fl_owner is
sufficient. In most cases it should be, with the notable exception being
server-side lockd, which compares ownership using a tuple of the
nlm_host and the pid sent in the lock request. So, this may degrade to a
single hash bucket when you only have a single NFS client. That will be
addressed in a later patch.

The careful observer may note that this patch leaves the file_lock_list
alone. There's much less of a case for turning the file_lock_list into a
hashtable. The only user of that list is the code that generates
/proc/locks, and it always walks the entire list.

Signed-off-by: Jeff Layton <[email protected]>
---
fs/locks.c | 24 ++++++++++++++++++------
1 files changed, 18 insertions(+), 6 deletions(-)

diff --git a/fs/locks.c b/fs/locks.c
index 5ed056b..0d030ce 100644
--- a/fs/locks.c
+++ b/fs/locks.c
@@ -126,6 +126,7 @@
#include <linux/time.h>
#include <linux/rcupdate.h>
#include <linux/pid_namespace.h>
+#include <linux/hashtable.h>

#include <asm/uaccess.h>

@@ -163,10 +164,19 @@ int lease_break_time = 45;
#define for_each_lock(inode, lockp) \
for (lockp = &inode->i_flock; *lockp != NULL; lockp = &(*lockp)->fl_next)

+/*
+ * By breaking up the blocked locks list into a hashtable, we speed up the
+ * deadlock detection.
+ *
+ * FIXME: make this value scale via some heuristic?
+ */
+#define BLOCKED_HASH_BITS 7
+
+static DEFINE_HASHTABLE(blocked_hash, BLOCKED_HASH_BITS);
+
static HLIST_HEAD(file_lock_list);
-static HLIST_HEAD(blocked_list);

-/* Protects the two list heads above */
+/* Protects the file_lock_list and the blocked_hash */
static DEFINE_SPINLOCK(file_lock_lock);

static struct kmem_cache *filelock_cache __read_mostly;
@@ -486,7 +496,8 @@ static inline void
locks_insert_global_blocked(struct file_lock *waiter)
{
spin_lock(&file_lock_lock);
- hlist_add_head(&waiter->fl_link, &blocked_list);
+ hash_add(blocked_hash, &waiter->fl_link,
+ (unsigned long)waiter->fl_owner);
spin_unlock(&file_lock_lock);
}

@@ -494,7 +505,7 @@ static inline void
locks_delete_global_blocked(struct file_lock *waiter)
{
spin_lock(&file_lock_lock);
- hlist_del_init(&waiter->fl_link);
+ hash_del(&waiter->fl_link);
spin_unlock(&file_lock_lock);
}

@@ -705,7 +716,7 @@ static struct file_lock *what_owner_is_waiting_for(struct file_lock *block_fl)
{
struct file_lock *fl, *ret = NULL;

- hlist_for_each_entry(fl, &blocked_list, fl_link) {
+ hash_for_each_possible(blocked_hash, fl, fl_link, (unsigned long)block_fl->fl_owner) {
if (posix_same_owner(fl, block_fl)) {
ret = fl->fl_next;
if (likely(ret))
@@ -2275,13 +2286,14 @@ static void lock_get_status(struct seq_file *f, struct file_lock *fl,

static int locks_show(struct seq_file *f, void *v)
{
+ int bkt;
struct file_lock *fl, *bfl;

fl = hlist_entry(v, struct file_lock, fl_link);

lock_get_status(f, fl, *((loff_t *)f->private), "");

- hlist_for_each_entry(bfl, &blocked_list, fl_link) {
+ hash_for_each(blocked_hash, bkt, bfl, fl_link) {
if (bfl->fl_next == fl)
lock_get_status(f, bfl, *((loff_t *)f->private), " ->");
}
--
1.7.1

2013-06-01 03:10:13

by Jeff Layton

[permalink] [raw]
Subject: [PATCH v1 07/11] locks: only pull entries off of blocked_list when they are really unblocked

Currently, when there is a lot of lock contention the kernel spends an
inordinate amount of time taking blocked locks off of the global
blocked_list and then putting them right back on again. When all of this
code was protected by a single lock, then it didn't matter much, but now
it means a lot of file_lock_lock thrashing.

Optimize this a bit by deferring the removal from the blocked_list until
we're either applying or cancelling the lock. By doing this, and using a
lockless list_empty check, we can avoid taking the file_lock_lock in
many cases.

Because the fl_link check is lockless, we must ensure that only the task
that "owns" the request manipulates the fl_link. Also, with this change,
it's possible that we'll see an entry on the blocked_list that has a
NULL fl_next pointer. In that event, just ignore it and continue walking
the list.

Signed-off-by: Jeff Layton <[email protected]>
---
fs/locks.c | 29 +++++++++++++++++++++++------
1 files changed, 23 insertions(+), 6 deletions(-)

diff --git a/fs/locks.c b/fs/locks.c
index 055c06c..fc35b9e 100644
--- a/fs/locks.c
+++ b/fs/locks.c
@@ -520,7 +520,6 @@ locks_delete_global_locks(struct file_lock *waiter)
static void __locks_delete_block(struct file_lock *waiter)
{
list_del_init(&waiter->fl_block);
- locks_delete_global_blocked(waiter);
waiter->fl_next = NULL;
}

@@ -704,13 +703,16 @@ EXPORT_SYMBOL(posix_test_lock);
/* Find a lock that the owner of the given block_fl is blocking on. */
static struct file_lock *what_owner_is_waiting_for(struct file_lock *block_fl)
{
- struct file_lock *fl;
+ struct file_lock *fl, *ret = NULL;

list_for_each_entry(fl, &blocked_list, fl_link) {
- if (posix_same_owner(fl, block_fl))
- return fl->fl_next;
+ if (posix_same_owner(fl, block_fl)) {
+ ret = fl->fl_next;
+ if (likely(ret))
+ break;
+ }
}
- return NULL;
+ return ret;
}

static int posix_locks_deadlock(struct file_lock *caller_fl,
@@ -865,7 +867,8 @@ static int __posix_lock_file(struct inode *inode, struct file_lock *request, str
goto out;
error = FILE_LOCK_DEFERRED;
locks_insert_block(fl, request);
- locks_insert_global_blocked(request);
+ if (list_empty(&request->fl_link))
+ locks_insert_global_blocked(request);
goto out;
}
}
@@ -876,6 +879,16 @@ static int __posix_lock_file(struct inode *inode, struct file_lock *request, str
goto out;

/*
+ * Now that we know the request is no longer blocked, we can take it
+ * off the global list. Some callers send down partially initialized
+ * requests, so we only do this if FL_SLEEP is set. Also, avoid taking
+ * the lock if the list is empty, as that indicates a request that
+ * never blocked.
+ */
+ if ((request->fl_flags & FL_SLEEP) && !list_empty(&request->fl_link))
+ locks_delete_global_blocked(request);
+
+ /*
* Find the first old lock with the same owner as the new lock.
*/

@@ -1069,6 +1082,7 @@ int posix_lock_file_wait(struct file *filp, struct file_lock *fl)
continue;

locks_delete_block(fl);
+ locks_delete_global_blocked(fl);
break;
}
return error;
@@ -1147,6 +1161,7 @@ int locks_mandatory_area(int read_write, struct inode *inode,
}

locks_delete_block(&fl);
+ locks_delete_global_blocked(&fl);
break;
}

@@ -1859,6 +1874,7 @@ static int do_lock_file_wait(struct file *filp, unsigned int cmd,
continue;

locks_delete_block(fl);
+ locks_delete_global_blocked(fl);
break;
}

@@ -2160,6 +2176,7 @@ posix_unblock_lock(struct file *filp, struct file_lock *waiter)
else
status = -ENOENT;
spin_unlock(&inode->i_lock);
+ locks_delete_global_blocked(waiter);
return status;
}

--
1.7.1

2013-06-01 03:10:19

by Jeff Layton

[permalink] [raw]
Subject: [PATCH v1 08/11] locks: convert fl_link to a hlist_node

Testing has shown that iterating over the blocked_list for deadlock
detection turns out to be a bottleneck. In order to alleviate that,
begin the process of turning it into a hashtable. We start by turning
the fl_link into a hlist_node and the global lists into hlists. A later
patch will do the conversion of the blocked_list to a hashtable.

Signed-off-by: Jeff Layton <[email protected]>
---
fs/locks.c | 32 ++++++++++++++++----------------
include/linux/fs.h | 2 +-
2 files changed, 17 insertions(+), 17 deletions(-)

diff --git a/fs/locks.c b/fs/locks.c
index fc35b9e..5ed056b 100644
--- a/fs/locks.c
+++ b/fs/locks.c
@@ -163,8 +163,8 @@ int lease_break_time = 45;
#define for_each_lock(inode, lockp) \
for (lockp = &inode->i_flock; *lockp != NULL; lockp = &(*lockp)->fl_next)

-static LIST_HEAD(file_lock_list);
-static LIST_HEAD(blocked_list);
+static HLIST_HEAD(file_lock_list);
+static HLIST_HEAD(blocked_list);

/* Protects the two list heads above */
static DEFINE_SPINLOCK(file_lock_lock);
@@ -173,7 +173,7 @@ static struct kmem_cache *filelock_cache __read_mostly;

static void locks_init_lock_heads(struct file_lock *fl)
{
- INIT_LIST_HEAD(&fl->fl_link);
+ INIT_HLIST_NODE(&fl->fl_link);
INIT_LIST_HEAD(&fl->fl_block);
init_waitqueue_head(&fl->fl_wait);
}
@@ -207,7 +207,7 @@ void locks_free_lock(struct file_lock *fl)
{
BUG_ON(waitqueue_active(&fl->fl_wait));
BUG_ON(!list_empty(&fl->fl_block));
- BUG_ON(!list_empty(&fl->fl_link));
+ BUG_ON(!hlist_unhashed(&fl->fl_link));

locks_release_private(fl);
kmem_cache_free(filelock_cache, fl);
@@ -486,7 +486,7 @@ static inline void
locks_insert_global_blocked(struct file_lock *waiter)
{
spin_lock(&file_lock_lock);
- list_add(&waiter->fl_link, &blocked_list);
+ hlist_add_head(&waiter->fl_link, &blocked_list);
spin_unlock(&file_lock_lock);
}

@@ -494,7 +494,7 @@ static inline void
locks_delete_global_blocked(struct file_lock *waiter)
{
spin_lock(&file_lock_lock);
- list_del_init(&waiter->fl_link);
+ hlist_del_init(&waiter->fl_link);
spin_unlock(&file_lock_lock);
}

@@ -502,7 +502,7 @@ static inline void
locks_insert_global_locks(struct file_lock *waiter)
{
spin_lock(&file_lock_lock);
- list_add_tail(&waiter->fl_link, &file_lock_list);
+ hlist_add_head(&waiter->fl_link, &file_lock_list);
spin_unlock(&file_lock_lock);
}

@@ -510,7 +510,7 @@ static inline void
locks_delete_global_locks(struct file_lock *waiter)
{
spin_lock(&file_lock_lock);
- list_del_init(&waiter->fl_link);
+ hlist_del_init(&waiter->fl_link);
spin_unlock(&file_lock_lock);
}

@@ -705,7 +705,7 @@ static struct file_lock *what_owner_is_waiting_for(struct file_lock *block_fl)
{
struct file_lock *fl, *ret = NULL;

- list_for_each_entry(fl, &blocked_list, fl_link) {
+ hlist_for_each_entry(fl, &blocked_list, fl_link) {
if (posix_same_owner(fl, block_fl)) {
ret = fl->fl_next;
if (likely(ret))
@@ -867,7 +867,7 @@ static int __posix_lock_file(struct inode *inode, struct file_lock *request, str
goto out;
error = FILE_LOCK_DEFERRED;
locks_insert_block(fl, request);
- if (list_empty(&request->fl_link))
+ if (hlist_unhashed(&request->fl_link))
locks_insert_global_blocked(request);
goto out;
}
@@ -882,10 +882,10 @@ static int __posix_lock_file(struct inode *inode, struct file_lock *request, str
* Now that we know the request is no longer blocked, we can take it
* off the global list. Some callers send down partially initialized
* requests, so we only do this if FL_SLEEP is set. Also, avoid taking
- * the lock if the list is empty, as that indicates a request that
+ * the lock if the hlist is unhashed, as that indicates a request that
* never blocked.
*/
- if ((request->fl_flags & FL_SLEEP) && !list_empty(&request->fl_link))
+ if ((request->fl_flags & FL_SLEEP) && !hlist_unhashed(&request->fl_link))
locks_delete_global_blocked(request);

/*
@@ -2277,11 +2277,11 @@ static int locks_show(struct seq_file *f, void *v)
{
struct file_lock *fl, *bfl;

- fl = list_entry(v, struct file_lock, fl_link);
+ fl = hlist_entry(v, struct file_lock, fl_link);

lock_get_status(f, fl, *((loff_t *)f->private), "");

- list_for_each_entry(bfl, &blocked_list, fl_link) {
+ hlist_for_each_entry(bfl, &blocked_list, fl_link) {
if (bfl->fl_next == fl)
lock_get_status(f, bfl, *((loff_t *)f->private), " ->");
}
@@ -2295,14 +2295,14 @@ static void *locks_start(struct seq_file *f, loff_t *pos)

spin_lock(&file_lock_lock);
*p = (*pos + 1);
- return seq_list_start(&file_lock_list, *pos);
+ return seq_hlist_start(&file_lock_list, *pos);
}

static void *locks_next(struct seq_file *f, void *v, loff_t *pos)
{
loff_t *p = f->private;
++*p;
- return seq_list_next(v, &file_lock_list, pos);
+ return seq_hlist_next(v, &file_lock_list, pos);
}

static void locks_stop(struct seq_file *f, void *v)
diff --git a/include/linux/fs.h b/include/linux/fs.h
index ccb44ea..07a009e 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -934,7 +934,7 @@ int locks_in_grace(struct net *);
*/
struct file_lock {
struct file_lock *fl_next; /* singly linked list for this inode */
- struct list_head fl_link; /* doubly linked list of all locks */
+ struct hlist_node fl_link; /* node in global lists */
struct list_head fl_block; /* circular list of blocked processes */
fl_owner_t fl_owner;
unsigned int fl_flags;
--
1.7.1

2013-06-01 03:10:07

by Jeff Layton

[permalink] [raw]
Subject: [PATCH v1 10/11] locks: add a new "lm_owner_key" lock operation

Currently, the hashing that the locking code uses to add these values
to the blocked_hash is simply calculated using fl_owner field. That's
valid in most cases except for server-side lockd, which validates the
owner of a lock based on fl_owner and fl_pid.

In the case where you have a small number of NFS clients doing a lot
of locking between different processes, you could end up with all
the blocked requests sitting in a very small number of hash buckets.

Add a new lm_owner_key operation to the lock_manager_operations that
will generate an unsigned long to use as the key in the hashtable.
That function is only implemented for server-side lockd, and simply
XORs the fl_owner and fl_pid.

Signed-off-by: Jeff Layton <[email protected]>
---
Documentation/filesystems/Locking | 18 +++++++++++-------
fs/lockd/svclock.c | 12 ++++++++++++
fs/locks.c | 13 ++++++++++---
include/linux/fs.h | 1 +
4 files changed, 34 insertions(+), 10 deletions(-)

diff --git a/Documentation/filesystems/Locking b/Documentation/filesystems/Locking
index 13f91ab..ee351ac 100644
--- a/Documentation/filesystems/Locking
+++ b/Documentation/filesystems/Locking
@@ -351,6 +351,7 @@ fl_release_private: maybe no
----------------------- lock_manager_operations ---------------------------
prototypes:
int (*lm_compare_owner)(struct file_lock *, struct file_lock *);
+ unsigned long (*lm_owner_key)(struct file_lock *);
void (*lm_notify)(struct file_lock *); /* unblock callback */
int (*lm_grant)(struct file_lock *, struct file_lock *, int);
void (*lm_break)(struct file_lock *); /* break_lease callback */
@@ -360,18 +361,21 @@ locking rules:

inode->i_lock file_lock_lock may block
lm_compare_owner: yes maybe no
+lm_owner_key yes yes no
lm_notify: yes no no
lm_grant: no no no
lm_break: yes no no
lm_change yes no no

- ->lm_compare_owner is generally called with *an* inode->i_lock
-held. It may not be the i_lock of the inode for either file_lock being
-compared! This is the case with deadlock detection, since the code has
-to chase down the owners of locks that may be entirely unrelated to the
-one on which the lock is being acquired. For deadlock detection however,
-the file_lock_lock is also held. The locks primarily ensure that neither
-file_lock disappear out from under you while doing the comparison.
+ ->lm_compare_owner and ->lm_owner_key are generally called with
+*an* inode->i_lock held. It may not be the i_lock of the inode
+associated with either file_lock argument! This is the case with deadlock
+detection, since the code has to chase down the owners of locks that may
+be entirely unrelated to the one on which the lock is being acquired.
+For deadlock detection however, the file_lock_lock is also held. The
+fact that these locks are held ensures that the file_locks do not
+disappear out from under you while doing the comparison or generating an
+owner key.

--------------------------- buffer_head -----------------------------------
prototypes:
diff --git a/fs/lockd/svclock.c b/fs/lockd/svclock.c
index e703318..ce2cdab 100644
--- a/fs/lockd/svclock.c
+++ b/fs/lockd/svclock.c
@@ -744,8 +744,20 @@ static int nlmsvc_same_owner(struct file_lock *fl1, struct file_lock *fl2)
return fl1->fl_owner == fl2->fl_owner && fl1->fl_pid == fl2->fl_pid;
}

+/*
+ * Since NLM uses two "keys" for tracking locks, we need to hash them down
+ * to one for the blocked_hash. Here, we're just xor'ing the host address
+ * with the pid in order to create a key value for picking a hash bucket.
+ */
+static unsigned long
+nlmsvc_owner_key(struct file_lock *fl)
+{
+ return (unsigned long)fl->fl_owner ^ (unsigned long)fl->fl_pid;
+}
+
const struct lock_manager_operations nlmsvc_lock_operations = {
.lm_compare_owner = nlmsvc_same_owner,
+ .lm_owner_key = nlmsvc_owner_key,
.lm_notify = nlmsvc_notify_blocked,
.lm_grant = nlmsvc_grant_deferred,
};
diff --git a/fs/locks.c b/fs/locks.c
index 0d030ce..8219187 100644
--- a/fs/locks.c
+++ b/fs/locks.c
@@ -491,13 +491,20 @@ static int posix_same_owner(struct file_lock *fl1, struct file_lock *fl2)
return fl1->fl_owner == fl2->fl_owner;
}

+static unsigned long
+posix_owner_key(struct file_lock *fl)
+{
+ if (fl->fl_lmops && fl->fl_lmops->lm_owner_key)
+ return fl->fl_lmops->lm_owner_key(fl);
+ return (unsigned long)fl->fl_owner;
+}
+
/* Remove a blocker or lock from one of the global lists */
static inline void
locks_insert_global_blocked(struct file_lock *waiter)
{
spin_lock(&file_lock_lock);
- hash_add(blocked_hash, &waiter->fl_link,
- (unsigned long)waiter->fl_owner);
+ hash_add(blocked_hash, &waiter->fl_link, posix_owner_key(waiter));
spin_unlock(&file_lock_lock);
}

@@ -716,7 +723,7 @@ static struct file_lock *what_owner_is_waiting_for(struct file_lock *block_fl)
{
struct file_lock *fl, *ret = NULL;

- hash_for_each_possible(blocked_hash, fl, fl_link, (unsigned long)block_fl->fl_owner) {
+ hash_for_each_possible(blocked_hash, fl, fl_link, posix_owner_key(block_fl)) {
if (posix_same_owner(fl, block_fl)) {
ret = fl->fl_next;
if (likely(ret))
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 07a009e..4906cf5 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -908,6 +908,7 @@ struct file_lock_operations {

struct lock_manager_operations {
int (*lm_compare_owner)(struct file_lock *, struct file_lock *);
+ unsigned long (*lm_owner_key)(struct file_lock *);
void (*lm_notify)(struct file_lock *); /* unblock callback */
int (*lm_grant)(struct file_lock *, struct file_lock *, int);
void (*lm_break)(struct file_lock *);
--
1.7.1

2013-06-01 03:08:51

by Jeff Layton

[permalink] [raw]
Subject: [PATCH v1 02/11] locks: make generic_add_lease and generic_delete_lease static

Signed-off-by: Jeff Layton <[email protected]>
---
fs/locks.c | 4 ++--
1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/fs/locks.c b/fs/locks.c
index 7a02064..e3140b8 100644
--- a/fs/locks.c
+++ b/fs/locks.c
@@ -1337,7 +1337,7 @@ int fcntl_getlease(struct file *filp)
return type;
}

-int generic_add_lease(struct file *filp, long arg, struct file_lock **flp)
+static int generic_add_lease(struct file *filp, long arg, struct file_lock **flp)
{
struct file_lock *fl, **before, **my_before = NULL, *lease;
struct dentry *dentry = filp->f_path.dentry;
@@ -1402,7 +1402,7 @@ out:
return error;
}

-int generic_delete_lease(struct file *filp, struct file_lock **flp)
+static int generic_delete_lease(struct file *filp, struct file_lock **flp)
{
struct file_lock *fl, **before;
struct dentry *dentry = filp->f_path.dentry;
--
1.7.1

2013-06-03 19:05:00

by Davidlohr Bueso

[permalink] [raw]
Subject: Re: [PATCH v1 00/11] locks: scalability improvements for file locking

On Fri, 2013-05-31 at 23:07 -0400, Jeff Layton wrote:
> This is not the first attempt at doing this. The conversion to the
> i_lock was originally attempted by Bruce Fields a few years ago. His
> approach was NAK'ed since it involved ripping out the deadlock
> detection. People also really seem to like /proc/locks for debugging, so
> keeping that in is probably worthwhile.

Yep, we need to keep this. FWIW, lslocks(8) relies on /proc/locks.

Thanks,
Davidlohr

2013-06-03 21:31:59

by J. Bruce Fields

[permalink] [raw]
Subject: Re: [PATCH v1 00/11] locks: scalability improvements for file locking

On Fri, May 31, 2013 at 11:07:23PM -0400, Jeff Layton wrote:
> Executive summary (tl;dr version): This patchset represents an overhaul
> of the file locking code with an aim toward improving its scalability
> and making the code a bit easier to understand.

Thanks for working on this, that code could use some love!

> Longer version:
>
> When the BKL was finally ripped out of the kernel in 2010, the strategy
> taken for the file locking code was to simply turn it into a new
> file_lock_locks spinlock. It was an expedient way to deal with the file
> locking code at the time, but having a giant spinlock around all of this
> code is clearly not great for scalability. Red Hat has bug reports that
> go back into the 2.6.18 era that point to BKL scalability problems in
> the file locking code and the file_lock_lock suffers from the same
> issues.
>
> This patchset is my first attempt to make this code less dependent on
> global locking. The main change is to switch most of the file locking
> code to be protected by the inode->i_lock instead of the file_lock_lock.
>
> While that works for most things, there are a couple of global data
> structures (lists in the current code) that need a global lock to
> protect them. So we still need a global lock in order to deal with
> those. The remaining patches are intended to make that global locking
> less painful. The big gain is made by turning the blocked_list into a
> hashtable, which greatly speeds up the deadlock detection code.
>
> I rolled a couple of small programs in order to test this code. The
> first one just forks off 128 children and has them lock and unlock the
> same file 10k times. Running this under "time" against a file on tmpfs
> gives typical values like this:

What kind of hardware was this?

>
> Unpatched (3.10-rc3-ish):
> real 0m5.283s
> user 0m0.380s
> sys 0m20.469s
>
> Patched (same base kernel):
> real 0m5.099s
> user 0m0.478s
> sys 0m19.662s
>
> ...so there seems to be some modest performance gain in this test. I
> think that's almost entirely due to the change to a hashtable and to
> optimize removing and readding blocked locks to the global lists. Note
> that with this code we have to take two spinlocks instead of just one,
> and that has some performance impact too. So the real peformance gain
> from that hashtable conversion is eaten up to some degree by this.

Might be nice to look at some profiles to confirm all of that. I'd also
be curious how much variation there was in the results above, as they're
pretty close.

> The next test just forks off a bunch of children that each create their
> own file and then lock and unlock it 20k times. Obviously, the locks in
> this case are uncontended. Running that under "time" typically gives
> these rough numbers.
>
> Unpatched (3.10-rc3-ish):
> real 0m8.836s
> user 0m1.018s
> sys 0m34.094s
>
> Patched (same base kernel):
> real 0m4.965s
> user 0m1.043s
> sys 0m18.651s
>
> In this test, we see the real benefit of moving to the i_lock for most
> of this code. The run time is almost cut in half in this test. With
> these changes locking different inodes needs very little serialization.
>
> If people know of other file locking performance tests, then I'd be
> happy to try them out too. It's possible that this might make some
> workloads slower, and it would be helpful to know what they are (and
> address them) if so.
>
> This is not the first attempt at doing this. The conversion to the
> i_lock was originally attempted by Bruce Fields a few years ago. His
> approach was NAK'ed since it involved ripping out the deadlock
> detection. People also really seem to like /proc/locks for debugging, so
> keeping that in is probably worthwhile.

Yes, there's already code that depends on it.

The deadlock detection, though--I still wonder if we could get away with
ripping it out. Might be worth at least giving an option to configure
it out as a first step.

--b.

> There's more work to be done in this area and this patchset is just a
> start. There's a horrible thundering herd problem when a blocking lock
> is released, for instance. There was also interest in solving the goofy
> "unlock on any close" POSIX lock semantics at this year's LSF. I think
> this patchset will help lay the groundwork for those changes as well.
>
> Comments and suggestions welcome.
>
> Jeff Layton (11):
> cifs: use posix_unblock_lock instead of locks_delete_block
> locks: make generic_add_lease and generic_delete_lease static
> locks: comment cleanups and clarifications
> locks: make "added" in __posix_lock_file a bool
> locks: encapsulate the fl_link list handling
> locks: convert to i_lock to protect i_flock list
> locks: only pull entries off of blocked_list when they are really
> unblocked
> locks: convert fl_link to a hlist_node
> locks: turn the blocked_list into a hashtable
> locks: add a new "lm_owner_key" lock operation
> locks: give the blocked_hash its own spinlock
>
> Documentation/filesystems/Locking | 27 +++-
> fs/afs/flock.c | 5 +-
> fs/ceph/locks.c | 2 +-
> fs/ceph/mds_client.c | 8 +-
> fs/cifs/cifsfs.c | 2 +-
> fs/cifs/file.c | 15 +-
> fs/gfs2/file.c | 2 +-
> fs/lockd/svclock.c | 12 ++
> fs/lockd/svcsubs.c | 12 +-
> fs/locks.c | 254 +++++++++++++++++++++++++------------
> fs/nfs/delegation.c | 11 +-
> fs/nfs/nfs4state.c | 8 +-
> fs/nfsd/nfs4state.c | 8 +-
> include/linux/fs.h | 25 +---
> 14 files changed, 249 insertions(+), 142 deletions(-)
>

2013-06-03 21:54:23

by J. Bruce Fields

[permalink] [raw]
Subject: Re: [PATCH v1 01/11] cifs: use posix_unblock_lock instead of locks_delete_block

On Fri, May 31, 2013 at 11:07:24PM -0400, Jeff Layton wrote:
> commit 66189be74 (CIFS: Fix VFS lock usage for oplocked files) exported
> the locks_delete_block symbol. There's already an exported helper
> function that provides this capability however, so make cifs use that
> instead and turn locks_delete_block back into a static function.
>
> Note that if fl->fl_next == NULL then this lock has already been through
> locks_delete_block(), so we should be OK to ignore an ENOENT error here
> and simply not retry the lock.

ACK.--b.

>
> Cc: Pavel Shilovsky <[email protected]>
> Signed-off-by: Jeff Layton <[email protected]>
> ---
> fs/cifs/file.c | 2 +-
> fs/locks.c | 3 +--
> include/linux/fs.h | 5 -----
> 3 files changed, 2 insertions(+), 8 deletions(-)
>
> diff --git a/fs/cifs/file.c b/fs/cifs/file.c
> index 48b29d2..44a4f18 100644
> --- a/fs/cifs/file.c
> +++ b/fs/cifs/file.c
> @@ -999,7 +999,7 @@ try_again:
> rc = wait_event_interruptible(flock->fl_wait, !flock->fl_next);
> if (!rc)
> goto try_again;
> - locks_delete_block(flock);
> + posix_unblock_lock(file, flock);
> }
> return rc;
> }
> diff --git a/fs/locks.c b/fs/locks.c
> index cb424a4..7a02064 100644
> --- a/fs/locks.c
> +++ b/fs/locks.c
> @@ -496,13 +496,12 @@ static void __locks_delete_block(struct file_lock *waiter)
>
> /*
> */
> -void locks_delete_block(struct file_lock *waiter)
> +static void locks_delete_block(struct file_lock *waiter)
> {
> lock_flocks();
> __locks_delete_block(waiter);
> unlock_flocks();
> }
> -EXPORT_SYMBOL(locks_delete_block);
>
> /* Insert waiter into blocker's block list.
> * We use a circular list so that processes can be easily woken up in
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 43db02e..b9d7816 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -1006,7 +1006,6 @@ extern int vfs_setlease(struct file *, long, struct file_lock **);
> extern int lease_modify(struct file_lock **, int);
> extern int lock_may_read(struct inode *, loff_t start, unsigned long count);
> extern int lock_may_write(struct inode *, loff_t start, unsigned long count);
> -extern void locks_delete_block(struct file_lock *waiter);
> extern void lock_flocks(void);
> extern void unlock_flocks(void);
> #else /* !CONFIG_FILE_LOCKING */
> @@ -1151,10 +1150,6 @@ static inline int lock_may_write(struct inode *inode, loff_t start,
> return 1;
> }
>
> -static inline void locks_delete_block(struct file_lock *waiter)
> -{
> -}
> -
> static inline void lock_flocks(void)
> {
> }
> --
> 1.7.1
>

2013-06-03 21:54:38

by J. Bruce Fields

[permalink] [raw]
Subject: Re: [PATCH v1 02/11] locks: make generic_add_lease and generic_delete_lease static

On Fri, May 31, 2013 at 11:07:25PM -0400, Jeff Layton wrote:
> Signed-off-by: Jeff Layton <[email protected]>

ACK.--b.

> ---
> fs/locks.c | 4 ++--
> 1 files changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/fs/locks.c b/fs/locks.c
> index 7a02064..e3140b8 100644
> --- a/fs/locks.c
> +++ b/fs/locks.c
> @@ -1337,7 +1337,7 @@ int fcntl_getlease(struct file *filp)
> return type;
> }
>
> -int generic_add_lease(struct file *filp, long arg, struct file_lock **flp)
> +static int generic_add_lease(struct file *filp, long arg, struct file_lock **flp)
> {
> struct file_lock *fl, **before, **my_before = NULL, *lease;
> struct dentry *dentry = filp->f_path.dentry;
> @@ -1402,7 +1402,7 @@ out:
> return error;
> }
>
> -int generic_delete_lease(struct file *filp, struct file_lock **flp)
> +static int generic_delete_lease(struct file *filp, struct file_lock **flp)
> {
> struct file_lock *fl, **before;
> struct dentry *dentry = filp->f_path.dentry;
> --
> 1.7.1
>

2013-06-03 22:00:58

by J. Bruce Fields

[permalink] [raw]
Subject: Re: [PATCH v1 03/11] locks: comment cleanups and clarifications

On Fri, May 31, 2013 at 11:07:26PM -0400, Jeff Layton wrote:
> Signed-off-by: Jeff Layton <[email protected]>
> ---
> fs/locks.c | 24 +++++++++++++++++++-----
> include/linux/fs.h | 6 ++++++
> 2 files changed, 25 insertions(+), 5 deletions(-)
>
> diff --git a/fs/locks.c b/fs/locks.c
> index e3140b8..a7d2253 100644
> --- a/fs/locks.c
> +++ b/fs/locks.c
> @@ -150,6 +150,16 @@ static int target_leasetype(struct file_lock *fl)
> int leases_enable = 1;
> int lease_break_time = 45;
>
> +/*
> + * The i_flock list is ordered by:
> + *
> + * 1) lock type -- FL_LEASEs first, then FL_FLOCK, and finally FL_POSIX
> + * 2) lock owner
> + * 3) lock range start
> + * 4) lock range end
> + *
> + * Obviously, the last two criteria only matter for POSIX locks.
> + */

Thanks, yes, that needs documenting! Though I wonder if this is the
place people will look for it.

> #define for_each_lock(inode, lockp) \
> for (lockp = &inode->i_flock; *lockp != NULL; lockp = &(*lockp)->fl_next)
>
> @@ -806,6 +816,11 @@ static int __posix_lock_file(struct inode *inode, struct file_lock *request, str
> }
>
> lock_flocks();
> + /*
> + * New lock request. Walk all POSIX locks and look for conflicts. If
> + * there are any, either return -EAGAIN or put the request on the
> + * blocker's list of waiters.
> + */

This though, seems a) not 100% accurate (it could also return EDEADLCK,
for example), b) mostly redundant with respect to the following code.

> if (request->fl_type != F_UNLCK) {
> for_each_lock(inode, before) {
> fl = *before;
> @@ -844,7 +859,7 @@ static int __posix_lock_file(struct inode *inode, struct file_lock *request, str
> before = &fl->fl_next;
> }
>
> - /* Process locks with this owner. */
> + /* Process locks with this owner. */
> while ((fl = *before) && posix_same_owner(request, fl)) {
> /* Detect adjacent or overlapping regions (if same lock type)
> */
> @@ -930,10 +945,9 @@ static int __posix_lock_file(struct inode *inode, struct file_lock *request, str
> }
>
> /*
> - * The above code only modifies existing locks in case of
> - * merging or replacing. If new lock(s) need to be inserted
> - * all modifications are done bellow this, so it's safe yet to
> - * bail out.
> + * The above code only modifies existing locks in case of merging or
> + * replacing. If new lock(s) need to be inserted all modifications are
> + * done below this, so it's safe yet to bail out.
> */
> error = -ENOLCK; /* "no luck" */
> if (right && left == right && !new_fl2)
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index b9d7816..ae377e9 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -926,6 +926,12 @@ int locks_in_grace(struct net *);
> /* that will die - we need it for nfs_lock_info */
> #include <linux/nfs_fs_i.h>
>
> +/*
> + * struct file_lock represents a generic "file lock". It's used to represent
> + * POSIX byte range locks, BSD (flock) locks, and leases. It's important to
> + * note that the same struct is used to represent both a request for a lock and
> + * the lock itself, but the same object is never used for both.

Yes, and I do find that confusing. I wonder if there's a sensible way
to use separate structs for the different uses.

--b.

> + */
> struct file_lock {
> struct file_lock *fl_next; /* singly linked list for this inode */
> struct list_head fl_link; /* doubly linked list of all locks */
> --
> 1.7.1
>

2013-06-04 10:54:59

by Jeff Layton

[permalink] [raw]
Subject: Re: [PATCH v1 00/11] locks: scalability improvements for file locking

On Mon, 3 Jun 2013 17:31:01 -0400
"J. Bruce Fields" <[email protected]> wrote:

> On Fri, May 31, 2013 at 11:07:23PM -0400, Jeff Layton wrote:
> > Executive summary (tl;dr version): This patchset represents an overhaul
> > of the file locking code with an aim toward improving its scalability
> > and making the code a bit easier to understand.
>
> Thanks for working on this, that code could use some love!
>
> > Longer version:
> >
> > When the BKL was finally ripped out of the kernel in 2010, the strategy
> > taken for the file locking code was to simply turn it into a new
> > file_lock_locks spinlock. It was an expedient way to deal with the file
> > locking code at the time, but having a giant spinlock around all of this
> > code is clearly not great for scalability. Red Hat has bug reports that
> > go back into the 2.6.18 era that point to BKL scalability problems in
> > the file locking code and the file_lock_lock suffers from the same
> > issues.
> >
> > This patchset is my first attempt to make this code less dependent on
> > global locking. The main change is to switch most of the file locking
> > code to be protected by the inode->i_lock instead of the file_lock_lock.
> >
> > While that works for most things, there are a couple of global data
> > structures (lists in the current code) that need a global lock to
> > protect them. So we still need a global lock in order to deal with
> > those. The remaining patches are intended to make that global locking
> > less painful. The big gain is made by turning the blocked_list into a
> > hashtable, which greatly speeds up the deadlock detection code.
> >
> > I rolled a couple of small programs in order to test this code. The
> > first one just forks off 128 children and has them lock and unlock the
> > same file 10k times. Running this under "time" against a file on tmpfs
> > gives typical values like this:
>
> What kind of hardware was this?
>

Mostly a KVM guest on Intel hardware. I've done some testing on bare
metal too, and the results are pretty similar.

> >
> > Unpatched (3.10-rc3-ish):
> > real 0m5.283s
> > user 0m0.380s
> > sys 0m20.469s
> >
> > Patched (same base kernel):
> > real 0m5.099s
> > user 0m0.478s
> > sys 0m19.662s
> >
> > ...so there seems to be some modest performance gain in this test. I
> > think that's almost entirely due to the change to a hashtable and to
> > optimize removing and readding blocked locks to the global lists. Note
> > that with this code we have to take two spinlocks instead of just one,
> > and that has some performance impact too. So the real peformance gain
> > from that hashtable conversion is eaten up to some degree by this.
>
> Might be nice to look at some profiles to confirm all of that. I'd also
> be curious how much variation there was in the results above, as they're
> pretty close.
>

The above is just a random representative sample. The results are
pretty close when running this test, but I can average up several runs
and present the numbers. I plan to get a bare-metal test box on which
to run some more detailed testing and maybe some profiling this week.

> > The next test just forks off a bunch of children that each create their
> > own file and then lock and unlock it 20k times. Obviously, the locks in
> > this case are uncontended. Running that under "time" typically gives
> > these rough numbers.
> >
> > Unpatched (3.10-rc3-ish):
> > real 0m8.836s
> > user 0m1.018s
> > sys 0m34.094s
> >
> > Patched (same base kernel):
> > real 0m4.965s
> > user 0m1.043s
> > sys 0m18.651s
> >
> > In this test, we see the real benefit of moving to the i_lock for most
> > of this code. The run time is almost cut in half in this test. With
> > these changes locking different inodes needs very little serialization.
> >
> > If people know of other file locking performance tests, then I'd be
> > happy to try them out too. It's possible that this might make some
> > workloads slower, and it would be helpful to know what they are (and
> > address them) if so.
> >
> > This is not the first attempt at doing this. The conversion to the
> > i_lock was originally attempted by Bruce Fields a few years ago. His
> > approach was NAK'ed since it involved ripping out the deadlock
> > detection. People also really seem to like /proc/locks for debugging, so
> > keeping that in is probably worthwhile.
>
> Yes, there's already code that depends on it.
>
> The deadlock detection, though--I still wonder if we could get away with
> ripping it out. Might be worth at least giving an option to configure
> it out as a first step.
>
> --b.
>


I considered that, and have patches that add such a config option. Some
of the later patches in this set optimize the code that is necessary to
support deadlock detection. In particular, turning the blocked_list
into a hashtable really speeds up the list walking. So much so that I
think the case for compiling it out is less obvious.

Should we add an option for people who really want their locks to
scream? Maybe, but I think it would be easy to add that later,
especially now that the code to handle the blocked_hash is fairly well
encapsulated with this patch.

Thanks for taking a look!

-- Jeff

> > There's more work to be done in this area and this patchset is just a
> > start. There's a horrible thundering herd problem when a blocking lock
> > is released, for instance. There was also interest in solving the goofy
> > "unlock on any close" POSIX lock semantics at this year's LSF. I think
> > this patchset will help lay the groundwork for those changes as well.
> >
> > Comments and suggestions welcome.
> >
> > Jeff Layton (11):
> > cifs: use posix_unblock_lock instead of locks_delete_block
> > locks: make generic_add_lease and generic_delete_lease static
> > locks: comment cleanups and clarifications
> > locks: make "added" in __posix_lock_file a bool
> > locks: encapsulate the fl_link list handling
> > locks: convert to i_lock to protect i_flock list
> > locks: only pull entries off of blocked_list when they are really
> > unblocked
> > locks: convert fl_link to a hlist_node
> > locks: turn the blocked_list into a hashtable
> > locks: add a new "lm_owner_key" lock operation
> > locks: give the blocked_hash its own spinlock
> >
> > Documentation/filesystems/Locking | 27 +++-
> > fs/afs/flock.c | 5 +-
> > fs/ceph/locks.c | 2 +-
> > fs/ceph/mds_client.c | 8 +-
> > fs/cifs/cifsfs.c | 2 +-
> > fs/cifs/file.c | 15 +-
> > fs/gfs2/file.c | 2 +-
> > fs/lockd/svclock.c | 12 ++
> > fs/lockd/svcsubs.c | 12 +-
> > fs/locks.c | 254 +++++++++++++++++++++++++------------
> > fs/nfs/delegation.c | 11 +-
> > fs/nfs/nfs4state.c | 8 +-
> > fs/nfsd/nfs4state.c | 8 +-
> > include/linux/fs.h | 25 +---
> > 14 files changed, 249 insertions(+), 142 deletions(-)
> >

2013-06-04 11:09:46

by Jeff Layton

[permalink] [raw]
Subject: Re: [PATCH v1 03/11] locks: comment cleanups and clarifications

On Mon, 3 Jun 2013 18:00:24 -0400
"J. Bruce Fields" <[email protected]> wrote:

> On Fri, May 31, 2013 at 11:07:26PM -0400, Jeff Layton wrote:
> > Signed-off-by: Jeff Layton <[email protected]>
> > ---
> > fs/locks.c | 24 +++++++++++++++++++-----
> > include/linux/fs.h | 6 ++++++
> > 2 files changed, 25 insertions(+), 5 deletions(-)
> >
> > diff --git a/fs/locks.c b/fs/locks.c
> > index e3140b8..a7d2253 100644
> > --- a/fs/locks.c
> > +++ b/fs/locks.c
> > @@ -150,6 +150,16 @@ static int target_leasetype(struct file_lock *fl)
> > int leases_enable = 1;
> > int lease_break_time = 45;
> >
> > +/*
> > + * The i_flock list is ordered by:
> > + *
> > + * 1) lock type -- FL_LEASEs first, then FL_FLOCK, and finally FL_POSIX
> > + * 2) lock owner
> > + * 3) lock range start
> > + * 4) lock range end
> > + *
> > + * Obviously, the last two criteria only matter for POSIX locks.
> > + */
>
> Thanks, yes, that needs documenting! Though I wonder if this is the
> place people will look for it.
>

Agreed. If you can think of a better spot to put this, I'd be happy to
move it in the next iteration of this series.

> > #define for_each_lock(inode, lockp) \
> > for (lockp = &inode->i_flock; *lockp != NULL; lockp = &(*lockp)->fl_next)
> >
> > @@ -806,6 +816,11 @@ static int __posix_lock_file(struct inode *inode, struct file_lock *request, str
> > }
> >
> > lock_flocks();
> > + /*
> > + * New lock request. Walk all POSIX locks and look for conflicts. If
> > + * there are any, either return -EAGAIN or put the request on the
> > + * blocker's list of waiters.
> > + */
>
> This though, seems a) not 100% accurate (it could also return EDEADLCK,
> for example), b) mostly redundant with respect to the following code.
>

Good point -- I'll fix that up. That one was mostly for my own benefit
while trying to figure out how this code works. It's still probably worth
some more verbose comments here since this certainly wasn't 100%
obvious to me when I first dove into this code.

> > if (request->fl_type != F_UNLCK) {
> > for_each_lock(inode, before) {
> > fl = *before;
> > @@ -844,7 +859,7 @@ static int __posix_lock_file(struct inode *inode, struct file_lock *request, str
> > before = &fl->fl_next;
> > }
> >
> > - /* Process locks with this owner. */
> > + /* Process locks with this owner. */
> > while ((fl = *before) && posix_same_owner(request, fl)) {
> > /* Detect adjacent or overlapping regions (if same lock type)
> > */
> > @@ -930,10 +945,9 @@ static int __posix_lock_file(struct inode *inode, struct file_lock *request, str
> > }
> >
> > /*
> > - * The above code only modifies existing locks in case of
> > - * merging or replacing. If new lock(s) need to be inserted
> > - * all modifications are done bellow this, so it's safe yet to
> > - * bail out.
> > + * The above code only modifies existing locks in case of merging or
> > + * replacing. If new lock(s) need to be inserted all modifications are
> > + * done below this, so it's safe yet to bail out.
> > */
> > error = -ENOLCK; /* "no luck" */
> > if (right && left == right && !new_fl2)
> > diff --git a/include/linux/fs.h b/include/linux/fs.h
> > index b9d7816..ae377e9 100644
> > --- a/include/linux/fs.h
> > +++ b/include/linux/fs.h
> > @@ -926,6 +926,12 @@ int locks_in_grace(struct net *);
> > /* that will die - we need it for nfs_lock_info */
> > #include <linux/nfs_fs_i.h>
> >
> > +/*
> > + * struct file_lock represents a generic "file lock". It's used to represent
> > + * POSIX byte range locks, BSD (flock) locks, and leases. It's important to
> > + * note that the same struct is used to represent both a request for a lock and
> > + * the lock itself, but the same object is never used for both.
>
> Yes, and I do find that confusing. I wonder if there's a sensible way
> to use separate structs for the different uses.
>
> --b.
>

Yeah. I think that latter point accounts for a lot of the difficulty for
the uninitiated to understand this code.

I considered creating a "struct lock_request" too. If I were designing
this code from scratch I certainly would do so, but it's harder to
justify such a change to the existing code.

It would mean a lot of churn in fs-specific lock routines, and we'd
have to completely revamp things like locks_copy_lock. I just don't see
that giving us a lot of benefit in the near term and I really want this
set to focus on concrete benefits.

We might consider it in the future though. I'll add a FIXME comment
here so we can record the idea for posterity.

Thanks for the comments!

> > + */
> > struct file_lock {
> > struct file_lock *fl_next; /* singly linked list for this inode */
> > struct list_head fl_link; /* doubly linked list of all locks */
> > --
> > 1.7.1
> >


--
Jeff Layton <[email protected]>

2013-06-04 11:57:35

by Jim Rees

[permalink] [raw]
Subject: Re: [PATCH v1 00/11] locks: scalability improvements for file locking

Jeff Layton wrote:

> Might be nice to look at some profiles to confirm all of that. I'd also
> be curious how much variation there was in the results above, as they're
> pretty close.
>

The above is just a random representative sample. The results are
pretty close when running this test, but I can average up several runs
and present the numbers. I plan to get a bare-metal test box on which
to run some more detailed testing and maybe some profiling this week.

Just contributing more runs into the mean doesn't tell us anything about the
variance. With numbers that close you need the variance to tell whether it's
a significant change.

2013-06-04 12:15:58

by Jeff Layton

[permalink] [raw]
Subject: Re: [PATCH v1 00/11] locks: scalability improvements for file locking

On Tue, 4 Jun 2013 07:56:44 -0400
Jim Rees <[email protected]> wrote:

> Jeff Layton wrote:
>
> > Might be nice to look at some profiles to confirm all of that. I'd also
> > be curious how much variation there was in the results above, as they're
> > pretty close.
> >
>
> The above is just a random representative sample. The results are
> pretty close when running this test, but I can average up several runs
> and present the numbers. I plan to get a bare-metal test box on which
> to run some more detailed testing and maybe some profiling this week.
>
> Just contributing more runs into the mean doesn't tell us anything about the
> variance. With numbers that close you need the variance to tell whether it's
> a significant change.

Thanks. I'll see if I can get some standard deviation numbers here, and
I'll do it on some bare metal to ensure that virtualization doesn't
skew any results.

FWIW, they were all consistently very close to one another when I ran
these tests, and the times were all consistently shorter than the
unpatched kernel.

That said, this test is pretty rough. Doing this with "time" measures
other things that aren't related to locking. So I'll also see if I
can come up with a way to measure the actual locking performance more
accurately too.

--
Jeff Layton <[email protected]>

2013-06-04 14:21:27

by Stefan Metzmacher

[permalink] [raw]
Subject: Re: [PATCH v1 11/11] locks: give the blocked_hash its own spinlock

Hi Jeff,

> There's no reason we have to protect the blocked_hash and file_lock_list
> with the same spinlock. With the tests I have, breaking it in two gives
> a barely measurable performance benefit, but it seems reasonable to make
> this locking as granular as possible.

as file_lock_{list,lock} is only used for debugging (/proc/locks) after this
change, I guess it would be possible to use RCU instead of a spinlock.

@others: this was the related discussion on IRC
(http://irclog.samba.org/) about this:

16:02 < metze> jlayton: do you have time to discuss your file_lock_lock
changes?
16:02 < jlayton> metze: sure, what's up?
16:03 < jlayton> metze: note that it won't help vl's thundering herd
problems...
16:03 < metze> is it correct that after your last patch file_lock_lock
is only used for /proc/locks?
16:03 < jlayton> well, it's only used to protect the list that is used
for /proc/locks
16:04 < jlayton> it still gets taken whenever a lock is acquired or
released in order to manipulate that list
16:04 < metze> would it be a good idea to use rcu instead of a spin lock?
16:04 < jlayton> I tried using RCU, but it turned out to slow everything
down
16:04 < jlayton> this is not a read-mostly workload unfortunately
16:04 < jlayton> so doing it with mutual exclusion turns out to be faster
16:04 < metze> ok
16:05 < jlayton> I might play around with it again sometime, but I don't
think it really helps. What we need to ensure is
that we optimize the code that manipulates that list,
and RCU list manipulations have larger overhead
16:06 < jlayton> metze: that's a good question though so if you want to
ask it on the list, please do
16:06 < jlayton> others will probably be wondering the same thing
16:08 < metze> maybe it's worth a comment in commit message and the code
16:08 < metze> btw, why don't you remove the ' /* Protects the
file_lock_list and the blocked_hash */' comment?

metze


Attachments:
signature.asc (261.00 B)
OpenPGP digital signature

2013-06-04 14:40:19

by Jeff Layton

[permalink] [raw]
Subject: Re: [PATCH v1 11/11] locks: give the blocked_hash its own spinlock

On Tue, 04 Jun 2013 16:19:53 +0200
"Stefan (metze) Metzmacher" <[email protected]> wrote:

> Hi Jeff,
>
> > There's no reason we have to protect the blocked_hash and file_lock_list
> > with the same spinlock. With the tests I have, breaking it in two gives
> > a barely measurable performance benefit, but it seems reasonable to make
> > this locking as granular as possible.
>
> as file_lock_{list,lock} is only used for debugging (/proc/locks) after this
> change, I guess it would be possible to use RCU instead of a spinlock.
>
> @others: this was the related discussion on IRC
> (http://irclog.samba.org/) about this:
>
> 16:02 < metze> jlayton: do you have time to discuss your file_lock_lock
> changes?
> 16:02 < jlayton> metze: sure, what's up?
> 16:03 < jlayton> metze: note that it won't help vl's thundering herd
> problems...
> 16:03 < metze> is it correct that after your last patch file_lock_lock
> is only used for /proc/locks?
> 16:03 < jlayton> well, it's only used to protect the list that is used
> for /proc/locks
> 16:04 < jlayton> it still gets taken whenever a lock is acquired or
> released in order to manipulate that list
> 16:04 < metze> would it be a good idea to use rcu instead of a spin lock?
> 16:04 < jlayton> I tried using RCU, but it turned out to slow everything
> down
> 16:04 < jlayton> this is not a read-mostly workload unfortunately
> 16:04 < jlayton> so doing it with mutual exclusion turns out to be faster
> 16:04 < metze> ok
> 16:05 < jlayton> I might play around with it again sometime, but I don't
> think it really helps. What we need to ensure is
> that we optimize the code that manipulates that list,
> and RCU list manipulations have larger overhead
> 16:06 < jlayton> metze: that's a good question though so if you want to
> ask it on the list, please do
> 16:06 < jlayton> others will probably be wondering the same thing
> 16:08 < metze> maybe it's worth a comment in commit message and the code


I'm not sure it's worth commenting about RCU in the code. In the future
it may be possible to do this -- who knows. It does seem to have been
consistently slower in my testing here though.

> 16:08 < metze> btw, why don't you remove the ' /* Protects the
> file_lock_list and the blocked_hash */' comment?
>

Removed in my git tree -- thanks for pointing that out.

--
Jeff Layton <[email protected]>


Attachments:
signature.asc (836.00 B)

2013-06-04 14:46:59

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH v1 11/11] locks: give the blocked_hash its own spinlock

Having RCU for modification mostly workloads never is a good idea, so
I don't think it makes sense to mention it here.

If you care about the overhead it's worth trying to use per-cpu lists,
though.

2013-06-04 14:54:17

by J. Bruce Fields

[permalink] [raw]
Subject: Re: [PATCH v1 11/11] locks: give the blocked_hash its own spinlock

On Tue, Jun 04, 2013 at 07:46:40AM -0700, Christoph Hellwig wrote:
> Having RCU for modification mostly workloads never is a good idea, so
> I don't think it makes sense to mention it here.
>
> If you care about the overhead it's worth trying to use per-cpu lists,
> though.

Yes. The lock and unlock could happen on different CPU's--but I think
you can make the rule that the lock stays associated with the list it
was first put on, and then it's correct in general and hopefully quick
in the common case.

--b.

2013-06-04 14:58:01

by Jeff Layton

[permalink] [raw]
Subject: Re: [PATCH v1 11/11] locks: give the blocked_hash its own spinlock

On Tue, 4 Jun 2013 07:46:40 -0700
Christoph Hellwig <[email protected]> wrote:

> Having RCU for modification mostly workloads never is a good idea, so
> I don't think it makes sense to mention it here.
>
> If you care about the overhead it's worth trying to use per-cpu lists,
> though.
>

Yeah, I looked at those too in an earlier set and it did help some.
Moving to a hashtable for the blocked_list really seemed to help the
most there, but percpu lists with lglocks or something might help a lot
on the file_lock_list.

--
Jeff Layton <[email protected]>

2013-06-04 15:16:04

by Jeff Layton

[permalink] [raw]
Subject: Re: [PATCH v1 11/11] locks: give the blocked_hash its own spinlock

On Tue, 4 Jun 2013 10:53:22 -0400
"J. Bruce Fields" <[email protected]> wrote:

> On Tue, Jun 04, 2013 at 07:46:40AM -0700, Christoph Hellwig wrote:
> > Having RCU for modification mostly workloads never is a good idea, so
> > I don't think it makes sense to mention it here.
> >
> > If you care about the overhead it's worth trying to use per-cpu lists,
> > though.
>
> Yes. The lock and unlock could happen on different CPU's--but I think
> you can make the rule that the lock stays associated with the list it
> was first put on, and then it's correct in general and hopefully quick
> in the common case.
>

It's important to distinguish between the blocked_list/hash and the
file_lock_list. Yes, they use the same embedded list_head or hlist_node
in the file_lock, but they have very different characteristics and use
cases.

In the testing I did, having a hashtable for the blocked locks helped a
lot more than a percpu list. The trick with deadlock detection is to
ensure that you don't spend a lot of time walking the lists. Since we
do deadlock detection frequently, we need to optimize for that case.

For the file_lock_list, it might make sense to have percpu hlists with
an lglock however. The thing to note here is that the file_lock_list is
almost never read. Only /proc/locks uses it, so anything we can do to
optimize the list manipulation is probably worth it.

All that said, I'm leery about changing too much of this code too fast.
It's pretty old and poorly understood, so I think we need to be
cautious and take an incremental approach to changing it.

--
Jeff Layton <[email protected]>

2013-06-04 20:18:22

by J. Bruce Fields

[permalink] [raw]
Subject: Re: [PATCH v1 04/11] locks: make "added" in __posix_lock_file a bool

On Fri, May 31, 2013 at 11:07:27PM -0400, Jeff Layton wrote:
> ...save 3 bytes of stack space.
>
> Signed-off-by: Jeff Layton <[email protected]>

ACK.--b.

> ---
> fs/locks.c | 9 +++++----
> 1 files changed, 5 insertions(+), 4 deletions(-)
>
> diff --git a/fs/locks.c b/fs/locks.c
> index a7d2253..cef0e04 100644
> --- a/fs/locks.c
> +++ b/fs/locks.c
> @@ -800,7 +800,8 @@ static int __posix_lock_file(struct inode *inode, struct file_lock *request, str
> struct file_lock *left = NULL;
> struct file_lock *right = NULL;
> struct file_lock **before;
> - int error, added = 0;
> + int error;
> + bool added = false;
>
> /*
> * We may need two file_lock structures for this operation,
> @@ -894,7 +895,7 @@ static int __posix_lock_file(struct inode *inode, struct file_lock *request, str
> continue;
> }
> request = fl;
> - added = 1;
> + added = true;
> }
> else {
> /* Processing for different lock types is a bit
> @@ -905,7 +906,7 @@ static int __posix_lock_file(struct inode *inode, struct file_lock *request, str
> if (fl->fl_start > request->fl_end)
> break;
> if (request->fl_type == F_UNLCK)
> - added = 1;
> + added = true;
> if (fl->fl_start < request->fl_start)
> left = fl;
> /* If the next lock in the list has a higher end
> @@ -935,7 +936,7 @@ static int __posix_lock_file(struct inode *inode, struct file_lock *request, str
> locks_release_private(fl);
> locks_copy_private(fl, request);
> request = fl;
> - added = 1;
> + added = true;
> }
> }
> /* Go on to next lock.
> --
> 1.7.1
>

2013-06-04 20:18:32

by J. Bruce Fields

[permalink] [raw]
Subject: Re: [PATCH v1 05/11] locks: encapsulate the fl_link list handling

On Fri, May 31, 2013 at 11:07:28PM -0400, Jeff Layton wrote:
> Move the fl_link list handling routines into a separate set of helpers.
> Also move the global list handling out of locks_insert_block, and into
> the caller that ends up triggering it as that allows us to eliminate the
> IS_POSIX check there.

ACK.--b.

>
> Signed-off-by: Jeff Layton <[email protected]>
> ---
> fs/locks.c | 34 +++++++++++++++++++++++++++++-----
> 1 files changed, 29 insertions(+), 5 deletions(-)
>
> diff --git a/fs/locks.c b/fs/locks.c
> index cef0e04..caca466 100644
> --- a/fs/locks.c
> +++ b/fs/locks.c
> @@ -494,13 +494,38 @@ static int posix_same_owner(struct file_lock *fl1, struct file_lock *fl2)
> return fl1->fl_owner == fl2->fl_owner;
> }
>
> +/* Remove a blocker or lock from one of the global lists */
> +static inline void
> +locks_insert_global_blocked(struct file_lock *waiter)
> +{
> + list_add(&waiter->fl_link, &blocked_list);
> +}
> +
> +static inline void
> +locks_delete_global_blocked(struct file_lock *waiter)
> +{
> + list_del_init(&waiter->fl_link);
> +}
> +
> +static inline void
> +locks_insert_global_locks(struct file_lock *waiter)
> +{
> + list_add_tail(&waiter->fl_link, &file_lock_list);
> +}
> +
> +static inline void
> +locks_delete_global_locks(struct file_lock *waiter)
> +{
> + list_del_init(&waiter->fl_link);
> +}
> +
> /* Remove waiter from blocker's block list.
> * When blocker ends up pointing to itself then the list is empty.
> */
> static void __locks_delete_block(struct file_lock *waiter)
> {
> list_del_init(&waiter->fl_block);
> - list_del_init(&waiter->fl_link);
> + locks_delete_global_blocked(waiter);
> waiter->fl_next = NULL;
> }
>
> @@ -524,8 +549,6 @@ static void locks_insert_block(struct file_lock *blocker,
> BUG_ON(!list_empty(&waiter->fl_block));
> list_add_tail(&waiter->fl_block, &blocker->fl_block);
> waiter->fl_next = blocker;
> - if (IS_POSIX(blocker))
> - list_add(&waiter->fl_link, &blocked_list);
> }
>
> /* Wake up processes blocked waiting for blocker.
> @@ -552,7 +575,7 @@ static void locks_wake_up_blocks(struct file_lock *blocker)
> */
> static void locks_insert_lock(struct file_lock **pos, struct file_lock *fl)
> {
> - list_add(&fl->fl_link, &file_lock_list);
> + locks_insert_global_locks(fl);
>
> fl->fl_nspid = get_pid(task_tgid(current));
>
> @@ -573,7 +596,7 @@ static void locks_delete_lock(struct file_lock **thisfl_p)
>
> *thisfl_p = fl->fl_next;
> fl->fl_next = NULL;
> - list_del_init(&fl->fl_link);
> + locks_delete_global_locks(fl);
>
> if (fl->fl_nspid) {
> put_pid(fl->fl_nspid);
> @@ -839,6 +862,7 @@ static int __posix_lock_file(struct inode *inode, struct file_lock *request, str
> goto out;
> error = FILE_LOCK_DEFERRED;
> locks_insert_block(fl, request);
> + locks_insert_global_blocked(request);
> goto out;
> }
> }
> --
> 1.7.1
>

2013-06-04 21:22:51

by J. Bruce Fields

[permalink] [raw]
Subject: Re: [PATCH v1 06/11] locks: convert to i_lock to protect i_flock list

On Fri, May 31, 2013 at 11:07:29PM -0400, Jeff Layton wrote:
> Having a global lock that protects all of this code is a clear
> scalability problem. Instead of doing that, move most of the code to be
> protected by the i_lock instead.
>
> The exceptions are the global lists that file_lock->fl_link sits on.
> Those still need a global lock of some sort, so wrap just those accesses
> in the file_lock_lock().
>
> Note that this requires a small change to the /proc/locks code. Instead
> of walking the fl_block list to look for locks blocked on the current
> lock, we must instead walk the global blocker list and skip any that
> aren't blocked on the current lock. Otherwise, we'd need to take the
> i_lock on each inode as we go and that would create a lock inversion
> problem.

locks_insert lock sets fl_nspid after adding to the global list, so in
theory I think that could leave /proc/locks seeing a garbage fl_nspid
or fl_pid field.

I'm similarly worried about the deadlock detection code using the
fl_next value before it's set but I'm not sure what the consequences
are.

But I think both of those are fixable, and this looks OK otherwise?

Patch-sequencing nit: it'd be nice to have the big-but-boring
lock_flocks->spin_lock(&i_lock) search-and-replace in a separate patch
from the (briefer, but more subtle) deadlock and /proc/locks changes.
Off the top of my head all I can think of is bulk-replace of
lock_flocks() by lock_flocks(inode) (which ignores inode argument), then
change definition of the lock_flocks(inode) in the patch which does the
other stuff, then bulk replace of lock_flocks(inode) by
spin_lock(&inode->i_lock), but maybe there's something simpler.

--b.


>
> Signed-off-by: Jeff Layton <[email protected]>
> Signed-off-by: J. Bruce Fields <[email protected]>
> ---
> Documentation/filesystems/Locking | 23 +++++--
> fs/afs/flock.c | 5 +-
> fs/ceph/locks.c | 2 +-
> fs/ceph/mds_client.c | 8 +-
> fs/cifs/cifsfs.c | 2 +-
> fs/cifs/file.c | 13 ++--
> fs/gfs2/file.c | 2 +-
> fs/lockd/svcsubs.c | 12 ++--
> fs/locks.c | 121 ++++++++++++++++++++-----------------
> fs/nfs/delegation.c | 11 ++--
> fs/nfs/nfs4state.c | 8 +-
> fs/nfsd/nfs4state.c | 8 +-
> include/linux/fs.h | 11 ----
> 13 files changed, 119 insertions(+), 107 deletions(-)
>
> diff --git a/Documentation/filesystems/Locking b/Documentation/filesystems/Locking
> index 0706d32..13f91ab 100644
> --- a/Documentation/filesystems/Locking
> +++ b/Documentation/filesystems/Locking
> @@ -344,7 +344,7 @@ prototypes:
>
>
> locking rules:
> - file_lock_lock may block
> + inode->i_lock may block
> fl_copy_lock: yes no
> fl_release_private: maybe no
>
> @@ -357,12 +357,21 @@ prototypes:
> int (*lm_change)(struct file_lock **, int);
>
> locking rules:
> - file_lock_lock may block
> -lm_compare_owner: yes no
> -lm_notify: yes no
> -lm_grant: no no
> -lm_break: yes no
> -lm_change yes no
> +
> + inode->i_lock file_lock_lock may block
> +lm_compare_owner: yes maybe no
> +lm_notify: yes no no
> +lm_grant: no no no
> +lm_break: yes no no
> +lm_change yes no no
> +
> + ->lm_compare_owner is generally called with *an* inode->i_lock
> +held. It may not be the i_lock of the inode for either file_lock being
> +compared! This is the case with deadlock detection, since the code has
> +to chase down the owners of locks that may be entirely unrelated to the
> +one on which the lock is being acquired. For deadlock detection however,
> +the file_lock_lock is also held. The locks primarily ensure that neither
> +file_lock disappear out from under you while doing the comparison.
>
> --------------------------- buffer_head -----------------------------------
> prototypes:
> diff --git a/fs/afs/flock.c b/fs/afs/flock.c
> index 2497bf3..03fc0d1 100644
> --- a/fs/afs/flock.c
> +++ b/fs/afs/flock.c
> @@ -252,6 +252,7 @@ static void afs_defer_unlock(struct afs_vnode *vnode, struct key *key)
> */
> static int afs_do_setlk(struct file *file, struct file_lock *fl)
> {
> + struct inode = file_inode(file);
> struct afs_vnode *vnode = AFS_FS_I(file->f_mapping->host);
> afs_lock_type_t type;
> struct key *key = file->private_data;
> @@ -273,7 +274,7 @@ static int afs_do_setlk(struct file *file, struct file_lock *fl)
>
> type = (fl->fl_type == F_RDLCK) ? AFS_LOCK_READ : AFS_LOCK_WRITE;
>
> - lock_flocks();
> + spin_lock(&inode->i_lock);
>
> /* make sure we've got a callback on this file and that our view of the
> * data version is up to date */
> @@ -420,7 +421,7 @@ given_lock:
> afs_vnode_fetch_status(vnode, NULL, key);
>
> error:
> - unlock_flocks();
> + spin_unlock(&inode->i_lock);
> _leave(" = %d", ret);
> return ret;
>
> diff --git a/fs/ceph/locks.c b/fs/ceph/locks.c
> index 202dd3d..cd0a664 100644
> --- a/fs/ceph/locks.c
> +++ b/fs/ceph/locks.c
> @@ -194,7 +194,7 @@ void ceph_count_locks(struct inode *inode, int *fcntl_count, int *flock_count)
> * Encode the flock and fcntl locks for the given inode into the pagelist.
> * Format is: #fcntl locks, sequential fcntl locks, #flock locks,
> * sequential flock locks.
> - * Must be called with lock_flocks() already held.
> + * Must be called with inode->i_lock already held.
> * If we encounter more of a specific lock type than expected,
> * we return the value 1.
> */
> diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c
> index 4f22671..ae621b5 100644
> --- a/fs/ceph/mds_client.c
> +++ b/fs/ceph/mds_client.c
> @@ -2482,13 +2482,13 @@ static int encode_caps_cb(struct inode *inode, struct ceph_cap *cap,
>
> ceph_pagelist_set_cursor(pagelist, &trunc_point);
> do {
> - lock_flocks();
> + spin_lock(&inode->i_lock);
> ceph_count_locks(inode, &num_fcntl_locks,
> &num_flock_locks);
> rec.v2.flock_len = (2*sizeof(u32) +
> (num_fcntl_locks+num_flock_locks) *
> sizeof(struct ceph_filelock));
> - unlock_flocks();
> + spin_unlock(&inode->i_lock);
>
> /* pre-alloc pagelist */
> ceph_pagelist_truncate(pagelist, &trunc_point);
> @@ -2499,12 +2499,12 @@ static int encode_caps_cb(struct inode *inode, struct ceph_cap *cap,
>
> /* encode locks */
> if (!err) {
> - lock_flocks();
> + spin_lock(&inode->i_lock);
> err = ceph_encode_locks(inode,
> pagelist,
> num_fcntl_locks,
> num_flock_locks);
> - unlock_flocks();
> + spin_unlock(&inode->i_lock);
> }
> } while (err == -ENOSPC);
> } else {
> diff --git a/fs/cifs/cifsfs.c b/fs/cifs/cifsfs.c
> index 72e4efe..29952d5 100644
> --- a/fs/cifs/cifsfs.c
> +++ b/fs/cifs/cifsfs.c
> @@ -768,7 +768,7 @@ static loff_t cifs_llseek(struct file *file, loff_t offset, int whence)
>
> static int cifs_setlease(struct file *file, long arg, struct file_lock **lease)
> {
> - /* note that this is called by vfs setlease with lock_flocks held
> + /* note that this is called by vfs setlease with i_lock held
> to protect *lease from going away */
> struct inode *inode = file_inode(file);
> struct cifsFileInfo *cfile = file->private_data;
> diff --git a/fs/cifs/file.c b/fs/cifs/file.c
> index 44a4f18..0dd10cd 100644
> --- a/fs/cifs/file.c
> +++ b/fs/cifs/file.c
> @@ -1092,6 +1092,7 @@ struct lock_to_push {
> static int
> cifs_push_posix_locks(struct cifsFileInfo *cfile)
> {
> + struct inode *inode = cfile->dentry->d_inode;
> struct cifs_tcon *tcon = tlink_tcon(cfile->tlink);
> struct file_lock *flock, **before;
> unsigned int count = 0, i = 0;
> @@ -1102,12 +1103,12 @@ cifs_push_posix_locks(struct cifsFileInfo *cfile)
>
> xid = get_xid();
>
> - lock_flocks();
> - cifs_for_each_lock(cfile->dentry->d_inode, before) {
> + spin_lock(&inode->i_lock);
> + cifs_for_each_lock(inode, before) {
> if ((*before)->fl_flags & FL_POSIX)
> count++;
> }
> - unlock_flocks();
> + spin_unlock(&inode->i_lock);
>
> INIT_LIST_HEAD(&locks_to_send);
>
> @@ -1126,8 +1127,8 @@ cifs_push_posix_locks(struct cifsFileInfo *cfile)
> }
>
> el = locks_to_send.next;
> - lock_flocks();
> - cifs_for_each_lock(cfile->dentry->d_inode, before) {
> + spin_lock(&inode->i_lock);
> + cifs_for_each_lock(inode, before) {
> flock = *before;
> if ((flock->fl_flags & FL_POSIX) == 0)
> continue;
> @@ -1152,7 +1153,7 @@ cifs_push_posix_locks(struct cifsFileInfo *cfile)
> lck->offset = flock->fl_start;
> el = el->next;
> }
> - unlock_flocks();
> + spin_unlock(&inode->i_lock);
>
> list_for_each_entry_safe(lck, tmp, &locks_to_send, llist) {
> int stored_rc;
> diff --git a/fs/gfs2/file.c b/fs/gfs2/file.c
> index acd1676..9e634e0 100644
> --- a/fs/gfs2/file.c
> +++ b/fs/gfs2/file.c
> @@ -889,7 +889,7 @@ out_uninit:
> * cluster; until we do, disable leases (by just returning -EINVAL),
> * unless the administrator has requested purely local locking.
> *
> - * Locking: called under lock_flocks
> + * Locking: called under i_lock
> *
> * Returns: errno
> */
> diff --git a/fs/lockd/svcsubs.c b/fs/lockd/svcsubs.c
> index 97e8741..dc5c759 100644
> --- a/fs/lockd/svcsubs.c
> +++ b/fs/lockd/svcsubs.c
> @@ -169,7 +169,7 @@ nlm_traverse_locks(struct nlm_host *host, struct nlm_file *file,
>
> again:
> file->f_locks = 0;
> - lock_flocks(); /* protects i_flock list */
> + spin_lock(&inode->i_lock);
> for (fl = inode->i_flock; fl; fl = fl->fl_next) {
> if (fl->fl_lmops != &nlmsvc_lock_operations)
> continue;
> @@ -181,7 +181,7 @@ again:
> if (match(lockhost, host)) {
> struct file_lock lock = *fl;
>
> - unlock_flocks();
> + spin_unlock(&inode->i_lock);
> lock.fl_type = F_UNLCK;
> lock.fl_start = 0;
> lock.fl_end = OFFSET_MAX;
> @@ -193,7 +193,7 @@ again:
> goto again;
> }
> }
> - unlock_flocks();
> + spin_unlock(&inode->i_lock);
>
> return 0;
> }
> @@ -228,14 +228,14 @@ nlm_file_inuse(struct nlm_file *file)
> if (file->f_count || !list_empty(&file->f_blocks) || file->f_shares)
> return 1;
>
> - lock_flocks();
> + spin_lock(&inode->i_lock);
> for (fl = inode->i_flock; fl; fl = fl->fl_next) {
> if (fl->fl_lmops == &nlmsvc_lock_operations) {
> - unlock_flocks();
> + spin_unlock(&inode->i_lock);
> return 1;
> }
> }
> - unlock_flocks();
> + spin_unlock(&inode->i_lock);
> file->f_locks = 0;
> return 0;
> }
> diff --git a/fs/locks.c b/fs/locks.c
> index caca466..055c06c 100644
> --- a/fs/locks.c
> +++ b/fs/locks.c
> @@ -165,22 +165,9 @@ int lease_break_time = 45;
>
> static LIST_HEAD(file_lock_list);
> static LIST_HEAD(blocked_list);
> -static DEFINE_SPINLOCK(file_lock_lock);
> -
> -/*
> - * Protects the two list heads above, plus the inode->i_flock list
> - */
> -void lock_flocks(void)
> -{
> - spin_lock(&file_lock_lock);
> -}
> -EXPORT_SYMBOL_GPL(lock_flocks);
>
> -void unlock_flocks(void)
> -{
> - spin_unlock(&file_lock_lock);
> -}
> -EXPORT_SYMBOL_GPL(unlock_flocks);
> +/* Protects the two list heads above */
> +static DEFINE_SPINLOCK(file_lock_lock);
>
> static struct kmem_cache *filelock_cache __read_mostly;
>
> @@ -498,25 +485,33 @@ static int posix_same_owner(struct file_lock *fl1, struct file_lock *fl2)
> static inline void
> locks_insert_global_blocked(struct file_lock *waiter)
> {
> + spin_lock(&file_lock_lock);
> list_add(&waiter->fl_link, &blocked_list);
> + spin_unlock(&file_lock_lock);
> }
>
> static inline void
> locks_delete_global_blocked(struct file_lock *waiter)
> {
> + spin_lock(&file_lock_lock);
> list_del_init(&waiter->fl_link);
> + spin_unlock(&file_lock_lock);
> }
>
> static inline void
> locks_insert_global_locks(struct file_lock *waiter)
> {
> + spin_lock(&file_lock_lock);
> list_add_tail(&waiter->fl_link, &file_lock_list);
> + spin_unlock(&file_lock_lock);
> }
>
> static inline void
> locks_delete_global_locks(struct file_lock *waiter)
> {
> + spin_lock(&file_lock_lock);
> list_del_init(&waiter->fl_link);
> + spin_unlock(&file_lock_lock);
> }
>
> /* Remove waiter from blocker's block list.
> @@ -533,9 +528,11 @@ static void __locks_delete_block(struct file_lock *waiter)
> */
> static void locks_delete_block(struct file_lock *waiter)
> {
> - lock_flocks();
> + struct inode *inode = file_inode(waiter->fl_file);
> +
> + spin_lock(&inode->i_lock);
> __locks_delete_block(waiter);
> - unlock_flocks();
> + spin_unlock(&inode->i_lock);
> }
>
> /* Insert waiter into blocker's block list.
> @@ -657,8 +654,9 @@ void
> posix_test_lock(struct file *filp, struct file_lock *fl)
> {
> struct file_lock *cfl;
> + struct inode *inode = file_inode(filp);
>
> - lock_flocks();
> + spin_lock(&inode->i_lock);
> for (cfl = file_inode(filp)->i_flock; cfl; cfl = cfl->fl_next) {
> if (!IS_POSIX(cfl))
> continue;
> @@ -671,7 +669,7 @@ posix_test_lock(struct file *filp, struct file_lock *fl)
> fl->fl_pid = pid_vnr(cfl->fl_nspid);
> } else
> fl->fl_type = F_UNLCK;
> - unlock_flocks();
> + spin_unlock(&inode->i_lock);
> return;
> }
> EXPORT_SYMBOL(posix_test_lock);
> @@ -719,14 +717,19 @@ static int posix_locks_deadlock(struct file_lock *caller_fl,
> struct file_lock *block_fl)
> {
> int i = 0;
> + int ret = 0;
>
> + spin_lock(&file_lock_lock);
> while ((block_fl = what_owner_is_waiting_for(block_fl))) {
> if (i++ > MAX_DEADLK_ITERATIONS)
> - return 0;
> - if (posix_same_owner(caller_fl, block_fl))
> - return 1;
> + break;
> + if (posix_same_owner(caller_fl, block_fl)) {
> + ++ret;
> + break;
> + }
> }
> - return 0;
> + spin_unlock(&file_lock_lock);
> + return ret;
> }
>
> /* Try to create a FLOCK lock on filp. We always insert new FLOCK locks
> @@ -750,7 +753,7 @@ static int flock_lock_file(struct file *filp, struct file_lock *request)
> return -ENOMEM;
> }
>
> - lock_flocks();
> + spin_lock(&inode->i_lock);
> if (request->fl_flags & FL_ACCESS)
> goto find_conflict;
>
> @@ -780,9 +783,9 @@ static int flock_lock_file(struct file *filp, struct file_lock *request)
> * give it the opportunity to lock the file.
> */
> if (found) {
> - unlock_flocks();
> + spin_unlock(&inode->i_lock);
> cond_resched();
> - lock_flocks();
> + spin_lock(&inode->i_lock);
> }
>
> find_conflict:
> @@ -809,7 +812,7 @@ find_conflict:
> error = 0;
>
> out:
> - unlock_flocks();
> + spin_unlock(&inode->i_lock);
> if (new_fl)
> locks_free_lock(new_fl);
> return error;
> @@ -839,7 +842,7 @@ static int __posix_lock_file(struct inode *inode, struct file_lock *request, str
> new_fl2 = locks_alloc_lock();
> }
>
> - lock_flocks();
> + spin_lock(&inode->i_lock);
> /*
> * New lock request. Walk all POSIX locks and look for conflicts. If
> * there are any, either return -EAGAIN or put the request on the
> @@ -1012,7 +1015,7 @@ static int __posix_lock_file(struct inode *inode, struct file_lock *request, str
> locks_wake_up_blocks(left);
> }
> out:
> - unlock_flocks();
> + spin_unlock(&inode->i_lock);
> /*
> * Free any unused locks.
> */
> @@ -1087,14 +1090,14 @@ int locks_mandatory_locked(struct inode *inode)
> /*
> * Search the lock list for this inode for any POSIX locks.
> */
> - lock_flocks();
> + spin_lock(&inode->i_lock);
> for (fl = inode->i_flock; fl != NULL; fl = fl->fl_next) {
> if (!IS_POSIX(fl))
> continue;
> if (fl->fl_owner != owner)
> break;
> }
> - unlock_flocks();
> + spin_unlock(&inode->i_lock);
> return fl ? -EAGAIN : 0;
> }
>
> @@ -1237,7 +1240,7 @@ int __break_lease(struct inode *inode, unsigned int mode)
> if (IS_ERR(new_fl))
> return PTR_ERR(new_fl);
>
> - lock_flocks();
> + spin_lock(&inode->i_lock);
>
> time_out_leases(inode);
>
> @@ -1287,10 +1290,10 @@ restart:
> break_time++;
> }
> locks_insert_block(flock, new_fl);
> - unlock_flocks();
> + spin_unlock(&inode->i_lock);
> error = wait_event_interruptible_timeout(new_fl->fl_wait,
> !new_fl->fl_next, break_time);
> - lock_flocks();
> + spin_lock(&inode->i_lock);
> __locks_delete_block(new_fl);
> if (error >= 0) {
> if (error == 0)
> @@ -1308,7 +1311,7 @@ restart:
> }
>
> out:
> - unlock_flocks();
> + spin_unlock(&inode->i_lock);
> locks_free_lock(new_fl);
> return error;
> }
> @@ -1361,9 +1364,10 @@ EXPORT_SYMBOL(lease_get_mtime);
> int fcntl_getlease(struct file *filp)
> {
> struct file_lock *fl;
> + struct inode *inode = file_inode(filp);
> int type = F_UNLCK;
>
> - lock_flocks();
> + spin_lock(&inode->i_lock);
> time_out_leases(file_inode(filp));
> for (fl = file_inode(filp)->i_flock; fl && IS_LEASE(fl);
> fl = fl->fl_next) {
> @@ -1372,7 +1376,7 @@ int fcntl_getlease(struct file *filp)
> break;
> }
> }
> - unlock_flocks();
> + spin_unlock(&inode->i_lock);
> return type;
> }
>
> @@ -1466,7 +1470,7 @@ static int generic_delete_lease(struct file *filp, struct file_lock **flp)
> * The (input) flp->fl_lmops->lm_break function is required
> * by break_lease().
> *
> - * Called with file_lock_lock held.
> + * Called with inode->i_lock held.
> */
> int generic_setlease(struct file *filp, long arg, struct file_lock **flp)
> {
> @@ -1535,11 +1539,12 @@ static int __vfs_setlease(struct file *filp, long arg, struct file_lock **lease)
>
> int vfs_setlease(struct file *filp, long arg, struct file_lock **lease)
> {
> + struct inode *inode = file_inode(filp);
> int error;
>
> - lock_flocks();
> + spin_lock(&inode->i_lock);
> error = __vfs_setlease(filp, arg, lease);
> - unlock_flocks();
> + spin_unlock(&inode->i_lock);
>
> return error;
> }
> @@ -1557,6 +1562,7 @@ static int do_fcntl_delete_lease(struct file *filp)
> static int do_fcntl_add_lease(unsigned int fd, struct file *filp, long arg)
> {
> struct file_lock *fl, *ret;
> + struct inode *inode = file_inode(filp);
> struct fasync_struct *new;
> int error;
>
> @@ -1570,10 +1576,10 @@ static int do_fcntl_add_lease(unsigned int fd, struct file *filp, long arg)
> return -ENOMEM;
> }
> ret = fl;
> - lock_flocks();
> + spin_lock(&inode->i_lock);
> error = __vfs_setlease(filp, arg, &ret);
> if (error) {
> - unlock_flocks();
> + spin_unlock(&inode->i_lock);
> locks_free_lock(fl);
> goto out_free_fasync;
> }
> @@ -1590,7 +1596,7 @@ static int do_fcntl_add_lease(unsigned int fd, struct file *filp, long arg)
> new = NULL;
>
> error = __f_setown(filp, task_pid(current), PIDTYPE_PID, 0);
> - unlock_flocks();
> + spin_unlock(&inode->i_lock);
>
> out_free_fasync:
> if (new)
> @@ -2114,7 +2120,7 @@ void locks_remove_flock(struct file *filp)
> fl.fl_ops->fl_release_private(&fl);
> }
>
> - lock_flocks();
> + spin_lock(&inode->i_lock);
> before = &inode->i_flock;
>
> while ((fl = *before) != NULL) {
> @@ -2132,7 +2138,7 @@ void locks_remove_flock(struct file *filp)
> }
> before = &fl->fl_next;
> }
> - unlock_flocks();
> + spin_unlock(&inode->i_lock);
> }
>
> /**
> @@ -2145,14 +2151,15 @@ void locks_remove_flock(struct file *filp)
> int
> posix_unblock_lock(struct file *filp, struct file_lock *waiter)
> {
> + struct inode *inode = file_inode(filp);
> int status = 0;
>
> - lock_flocks();
> + spin_lock(&inode->i_lock);
> if (waiter->fl_next)
> __locks_delete_block(waiter);
> else
> status = -ENOENT;
> - unlock_flocks();
> + spin_unlock(&inode->i_lock);
> return status;
> }
>
> @@ -2257,8 +2264,10 @@ static int locks_show(struct seq_file *f, void *v)
>
> lock_get_status(f, fl, *((loff_t *)f->private), "");
>
> - list_for_each_entry(bfl, &fl->fl_block, fl_block)
> - lock_get_status(f, bfl, *((loff_t *)f->private), " ->");
> + list_for_each_entry(bfl, &blocked_list, fl_link) {
> + if (bfl->fl_next == fl)
> + lock_get_status(f, bfl, *((loff_t *)f->private), " ->");
> + }
>
> return 0;
> }
> @@ -2267,7 +2276,7 @@ static void *locks_start(struct seq_file *f, loff_t *pos)
> {
> loff_t *p = f->private;
>
> - lock_flocks();
> + spin_lock(&file_lock_lock);
> *p = (*pos + 1);
> return seq_list_start(&file_lock_list, *pos);
> }
> @@ -2281,7 +2290,7 @@ static void *locks_next(struct seq_file *f, void *v, loff_t *pos)
>
> static void locks_stop(struct seq_file *f, void *v)
> {
> - unlock_flocks();
> + spin_unlock(&file_lock_lock);
> }
>
> static const struct seq_operations locks_seq_operations = {
> @@ -2328,7 +2337,8 @@ int lock_may_read(struct inode *inode, loff_t start, unsigned long len)
> {
> struct file_lock *fl;
> int result = 1;
> - lock_flocks();
> +
> + spin_lock(&inode->i_lock);
> for (fl = inode->i_flock; fl != NULL; fl = fl->fl_next) {
> if (IS_POSIX(fl)) {
> if (fl->fl_type == F_RDLCK)
> @@ -2345,7 +2355,7 @@ int lock_may_read(struct inode *inode, loff_t start, unsigned long len)
> result = 0;
> break;
> }
> - unlock_flocks();
> + spin_unlock(&inode->i_lock);
> return result;
> }
>
> @@ -2368,7 +2378,8 @@ int lock_may_write(struct inode *inode, loff_t start, unsigned long len)
> {
> struct file_lock *fl;
> int result = 1;
> - lock_flocks();
> +
> + spin_lock(&inode->i_lock);
> for (fl = inode->i_flock; fl != NULL; fl = fl->fl_next) {
> if (IS_POSIX(fl)) {
> if ((fl->fl_end < start) || (fl->fl_start > (start + len)))
> @@ -2383,7 +2394,7 @@ int lock_may_write(struct inode *inode, loff_t start, unsigned long len)
> result = 0;
> break;
> }
> - unlock_flocks();
> + spin_unlock(&inode->i_lock);
> return result;
> }
>
> diff --git a/fs/nfs/delegation.c b/fs/nfs/delegation.c
> index 57db324..43ee7f9 100644
> --- a/fs/nfs/delegation.c
> +++ b/fs/nfs/delegation.c
> @@ -73,20 +73,21 @@ static int nfs_delegation_claim_locks(struct nfs_open_context *ctx, struct nfs4_
> if (inode->i_flock == NULL)
> goto out;
>
> - /* Protect inode->i_flock using the file locks lock */
> - lock_flocks();
> + /* Protect inode->i_flock using the i_lock */
> + spin_lock(&inode->i_lock);
> for (fl = inode->i_flock; fl != NULL; fl = fl->fl_next) {
> if (!(fl->fl_flags & (FL_POSIX|FL_FLOCK)))
> continue;
> if (nfs_file_open_context(fl->fl_file) != ctx)
> continue;
> - unlock_flocks();
> + /* FIXME: safe to drop lock here while walking list? */
> + spin_unlock(&inode->i_lock);
> status = nfs4_lock_delegation_recall(fl, state, stateid);
> if (status < 0)
> goto out;
> - lock_flocks();
> + spin_lock(&inode->i_lock);
> }
> - unlock_flocks();
> + spin_unlock(&inode->i_lock);
> out:
> return status;
> }
> diff --git a/fs/nfs/nfs4state.c b/fs/nfs/nfs4state.c
> index 1fab140..ff10b4a 100644
> --- a/fs/nfs/nfs4state.c
> +++ b/fs/nfs/nfs4state.c
> @@ -1373,13 +1373,13 @@ static int nfs4_reclaim_locks(struct nfs4_state *state, const struct nfs4_state_
> /* Guard against delegation returns and new lock/unlock calls */
> down_write(&nfsi->rwsem);
> /* Protect inode->i_flock using the BKL */
> - lock_flocks();
> + spin_lock(&inode->i_lock);
> for (fl = inode->i_flock; fl != NULL; fl = fl->fl_next) {
> if (!(fl->fl_flags & (FL_POSIX|FL_FLOCK)))
> continue;
> if (nfs_file_open_context(fl->fl_file)->state != state)
> continue;
> - unlock_flocks();
> + spin_unlock(&inode->i_lock);
> status = ops->recover_lock(state, fl);
> switch (status) {
> case 0:
> @@ -1406,9 +1406,9 @@ static int nfs4_reclaim_locks(struct nfs4_state *state, const struct nfs4_state_
> /* kill_proc(fl->fl_pid, SIGLOST, 1); */
> status = 0;
> }
> - lock_flocks();
> + spin_lock(&inode->i_lock);
> }
> - unlock_flocks();
> + spin_unlock(&inode->i_lock);
> out:
> up_write(&nfsi->rwsem);
> return status;
> diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c
> index 316ec84..f170518 100644
> --- a/fs/nfsd/nfs4state.c
> +++ b/fs/nfsd/nfs4state.c
> @@ -2645,13 +2645,13 @@ static void nfsd_break_one_deleg(struct nfs4_delegation *dp)
>
> list_add_tail(&dp->dl_recall_lru, &nn->del_recall_lru);
>
> - /* only place dl_time is set. protected by lock_flocks*/
> + /* Only place dl_time is set; protected by i_lock: */
> dp->dl_time = get_seconds();
>
> nfsd4_cb_recall(dp);
> }
>
> -/* Called from break_lease() with lock_flocks() held. */
> +/* Called from break_lease() with i_lock held. */
> static void nfsd_break_deleg_cb(struct file_lock *fl)
> {
> struct nfs4_file *fp = (struct nfs4_file *)fl->fl_owner;
> @@ -4520,7 +4520,7 @@ check_for_locks(struct nfs4_file *filp, struct nfs4_lockowner *lowner)
> struct inode *inode = filp->fi_inode;
> int status = 0;
>
> - lock_flocks();
> + spin_lock(&inode->i_lock);
> for (flpp = &inode->i_flock; *flpp != NULL; flpp = &(*flpp)->fl_next) {
> if ((*flpp)->fl_owner == (fl_owner_t)lowner) {
> status = 1;
> @@ -4528,7 +4528,7 @@ check_for_locks(struct nfs4_file *filp, struct nfs4_lockowner *lowner)
> }
> }
> out:
> - unlock_flocks();
> + spin_unlock(&inode->i_lock);
> return status;
> }
>
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index ae377e9..ccb44ea 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -1012,8 +1012,6 @@ extern int vfs_setlease(struct file *, long, struct file_lock **);
> extern int lease_modify(struct file_lock **, int);
> extern int lock_may_read(struct inode *, loff_t start, unsigned long count);
> extern int lock_may_write(struct inode *, loff_t start, unsigned long count);
> -extern void lock_flocks(void);
> -extern void unlock_flocks(void);
> #else /* !CONFIG_FILE_LOCKING */
> static inline int fcntl_getlk(struct file *file, struct flock __user *user)
> {
> @@ -1155,15 +1153,6 @@ static inline int lock_may_write(struct inode *inode, loff_t start,
> {
> return 1;
> }
> -
> -static inline void lock_flocks(void)
> -{
> -}
> -
> -static inline void unlock_flocks(void)
> -{
> -}
> -
> #endif /* !CONFIG_FILE_LOCKING */
>
>
> --
> 1.7.1
>

2013-06-04 21:59:18

by J. Bruce Fields

[permalink] [raw]
Subject: Re: [PATCH v1 07/11] locks: only pull entries off of blocked_list when they are really unblocked

On Fri, May 31, 2013 at 11:07:30PM -0400, Jeff Layton wrote:
> Currently, when there is a lot of lock contention the kernel spends an
> inordinate amount of time taking blocked locks off of the global
> blocked_list and then putting them right back on again. When all of this
> code was protected by a single lock, then it didn't matter much, but now
> it means a lot of file_lock_lock thrashing.
>
> Optimize this a bit by deferring the removal from the blocked_list until
> we're either applying or cancelling the lock. By doing this, and using a
> lockless list_empty check, we can avoid taking the file_lock_lock in
> many cases.
>
> Because the fl_link check is lockless, we must ensure that only the task
> that "owns" the request manipulates the fl_link. Also, with this change,
> it's possible that we'll see an entry on the blocked_list that has a
> NULL fl_next pointer. In that event, just ignore it and continue walking
> the list.

OK, that sounds safe as in it shouldn't crash, but does the deadlock
detection still work, or can it miss loops?

Those locks that are temporarily NULL would previously not have been on
the list at all, OK, but... I'm having trouble reasoning about how this
works now.

Previously a single lock was held interrupted across
posix_locks_deadlock and locks_insert_block() which guaranteed we
shouldn't be adding a loop, is that still true?

--b.

>
> Signed-off-by: Jeff Layton <[email protected]>
> ---
> fs/locks.c | 29 +++++++++++++++++++++++------
> 1 files changed, 23 insertions(+), 6 deletions(-)
>
> diff --git a/fs/locks.c b/fs/locks.c
> index 055c06c..fc35b9e 100644
> --- a/fs/locks.c
> +++ b/fs/locks.c
> @@ -520,7 +520,6 @@ locks_delete_global_locks(struct file_lock *waiter)
> static void __locks_delete_block(struct file_lock *waiter)
> {
> list_del_init(&waiter->fl_block);
> - locks_delete_global_blocked(waiter);
> waiter->fl_next = NULL;
> }
>
> @@ -704,13 +703,16 @@ EXPORT_SYMBOL(posix_test_lock);
> /* Find a lock that the owner of the given block_fl is blocking on. */
> static struct file_lock *what_owner_is_waiting_for(struct file_lock *block_fl)
> {
> - struct file_lock *fl;
> + struct file_lock *fl, *ret = NULL;
>
> list_for_each_entry(fl, &blocked_list, fl_link) {
> - if (posix_same_owner(fl, block_fl))
> - return fl->fl_next;
> + if (posix_same_owner(fl, block_fl)) {
> + ret = fl->fl_next;
> + if (likely(ret))
> + break;
> + }
> }
> - return NULL;
> + return ret;
> }
>
> static int posix_locks_deadlock(struct file_lock *caller_fl,
> @@ -865,7 +867,8 @@ static int __posix_lock_file(struct inode *inode, struct file_lock *request, str
> goto out;
> error = FILE_LOCK_DEFERRED;
> locks_insert_block(fl, request);
> - locks_insert_global_blocked(request);
> + if (list_empty(&request->fl_link))
> + locks_insert_global_blocked(request);
> goto out;
> }
> }
> @@ -876,6 +879,16 @@ static int __posix_lock_file(struct inode *inode, struct file_lock *request, str
> goto out;
>
> /*
> + * Now that we know the request is no longer blocked, we can take it
> + * off the global list. Some callers send down partially initialized
> + * requests, so we only do this if FL_SLEEP is set. Also, avoid taking
> + * the lock if the list is empty, as that indicates a request that
> + * never blocked.
> + */
> + if ((request->fl_flags & FL_SLEEP) && !list_empty(&request->fl_link))
> + locks_delete_global_blocked(request);
> +
> + /*
> * Find the first old lock with the same owner as the new lock.
> */
>
> @@ -1069,6 +1082,7 @@ int posix_lock_file_wait(struct file *filp, struct file_lock *fl)
> continue;
>
> locks_delete_block(fl);
> + locks_delete_global_blocked(fl);
> break;
> }
> return error;
> @@ -1147,6 +1161,7 @@ int locks_mandatory_area(int read_write, struct inode *inode,
> }
>
> locks_delete_block(&fl);
> + locks_delete_global_blocked(&fl);
> break;
> }
>
> @@ -1859,6 +1874,7 @@ static int do_lock_file_wait(struct file *filp, unsigned int cmd,
> continue;
>
> locks_delete_block(fl);
> + locks_delete_global_blocked(fl);
> break;
> }
>
> @@ -2160,6 +2176,7 @@ posix_unblock_lock(struct file *filp, struct file_lock *waiter)
> else
> status = -ENOENT;
> spin_unlock(&inode->i_lock);
> + locks_delete_global_blocked(waiter);
> return status;
> }
>
> --
> 1.7.1
>

2013-06-04 22:00:21

by J. Bruce Fields

[permalink] [raw]
Subject: Re: [PATCH v1 08/11] locks: convert fl_link to a hlist_node

On Fri, May 31, 2013 at 11:07:31PM -0400, Jeff Layton wrote:
> Testing has shown that iterating over the blocked_list for deadlock
> detection turns out to be a bottleneck. In order to alleviate that,
> begin the process of turning it into a hashtable. We start by turning
> the fl_link into a hlist_node and the global lists into hlists. A later
> patch will do the conversion of the blocked_list to a hashtable.

Even simpler would be if we could add a pointer to the (well, a) lock
that a lockowner is blocking on, and then we'd just have to follow a
pointer. I haven't thought that through, though, perhaps that's hard ot
make work....

--b.

>
> Signed-off-by: Jeff Layton <[email protected]>
> ---
> fs/locks.c | 32 ++++++++++++++++----------------
> include/linux/fs.h | 2 +-
> 2 files changed, 17 insertions(+), 17 deletions(-)
>
> diff --git a/fs/locks.c b/fs/locks.c
> index fc35b9e..5ed056b 100644
> --- a/fs/locks.c
> +++ b/fs/locks.c
> @@ -163,8 +163,8 @@ int lease_break_time = 45;
> #define for_each_lock(inode, lockp) \
> for (lockp = &inode->i_flock; *lockp != NULL; lockp = &(*lockp)->fl_next)
>
> -static LIST_HEAD(file_lock_list);
> -static LIST_HEAD(blocked_list);
> +static HLIST_HEAD(file_lock_list);
> +static HLIST_HEAD(blocked_list);
>
> /* Protects the two list heads above */
> static DEFINE_SPINLOCK(file_lock_lock);
> @@ -173,7 +173,7 @@ static struct kmem_cache *filelock_cache __read_mostly;
>
> static void locks_init_lock_heads(struct file_lock *fl)
> {
> - INIT_LIST_HEAD(&fl->fl_link);
> + INIT_HLIST_NODE(&fl->fl_link);
> INIT_LIST_HEAD(&fl->fl_block);
> init_waitqueue_head(&fl->fl_wait);
> }
> @@ -207,7 +207,7 @@ void locks_free_lock(struct file_lock *fl)
> {
> BUG_ON(waitqueue_active(&fl->fl_wait));
> BUG_ON(!list_empty(&fl->fl_block));
> - BUG_ON(!list_empty(&fl->fl_link));
> + BUG_ON(!hlist_unhashed(&fl->fl_link));
>
> locks_release_private(fl);
> kmem_cache_free(filelock_cache, fl);
> @@ -486,7 +486,7 @@ static inline void
> locks_insert_global_blocked(struct file_lock *waiter)
> {
> spin_lock(&file_lock_lock);
> - list_add(&waiter->fl_link, &blocked_list);
> + hlist_add_head(&waiter->fl_link, &blocked_list);
> spin_unlock(&file_lock_lock);
> }
>
> @@ -494,7 +494,7 @@ static inline void
> locks_delete_global_blocked(struct file_lock *waiter)
> {
> spin_lock(&file_lock_lock);
> - list_del_init(&waiter->fl_link);
> + hlist_del_init(&waiter->fl_link);
> spin_unlock(&file_lock_lock);
> }
>
> @@ -502,7 +502,7 @@ static inline void
> locks_insert_global_locks(struct file_lock *waiter)
> {
> spin_lock(&file_lock_lock);
> - list_add_tail(&waiter->fl_link, &file_lock_list);
> + hlist_add_head(&waiter->fl_link, &file_lock_list);
> spin_unlock(&file_lock_lock);
> }
>
> @@ -510,7 +510,7 @@ static inline void
> locks_delete_global_locks(struct file_lock *waiter)
> {
> spin_lock(&file_lock_lock);
> - list_del_init(&waiter->fl_link);
> + hlist_del_init(&waiter->fl_link);
> spin_unlock(&file_lock_lock);
> }
>
> @@ -705,7 +705,7 @@ static struct file_lock *what_owner_is_waiting_for(struct file_lock *block_fl)
> {
> struct file_lock *fl, *ret = NULL;
>
> - list_for_each_entry(fl, &blocked_list, fl_link) {
> + hlist_for_each_entry(fl, &blocked_list, fl_link) {
> if (posix_same_owner(fl, block_fl)) {
> ret = fl->fl_next;
> if (likely(ret))
> @@ -867,7 +867,7 @@ static int __posix_lock_file(struct inode *inode, struct file_lock *request, str
> goto out;
> error = FILE_LOCK_DEFERRED;
> locks_insert_block(fl, request);
> - if (list_empty(&request->fl_link))
> + if (hlist_unhashed(&request->fl_link))
> locks_insert_global_blocked(request);
> goto out;
> }
> @@ -882,10 +882,10 @@ static int __posix_lock_file(struct inode *inode, struct file_lock *request, str
> * Now that we know the request is no longer blocked, we can take it
> * off the global list. Some callers send down partially initialized
> * requests, so we only do this if FL_SLEEP is set. Also, avoid taking
> - * the lock if the list is empty, as that indicates a request that
> + * the lock if the hlist is unhashed, as that indicates a request that
> * never blocked.
> */
> - if ((request->fl_flags & FL_SLEEP) && !list_empty(&request->fl_link))
> + if ((request->fl_flags & FL_SLEEP) && !hlist_unhashed(&request->fl_link))
> locks_delete_global_blocked(request);
>
> /*
> @@ -2277,11 +2277,11 @@ static int locks_show(struct seq_file *f, void *v)
> {
> struct file_lock *fl, *bfl;
>
> - fl = list_entry(v, struct file_lock, fl_link);
> + fl = hlist_entry(v, struct file_lock, fl_link);
>
> lock_get_status(f, fl, *((loff_t *)f->private), "");
>
> - list_for_each_entry(bfl, &blocked_list, fl_link) {
> + hlist_for_each_entry(bfl, &blocked_list, fl_link) {
> if (bfl->fl_next == fl)
> lock_get_status(f, bfl, *((loff_t *)f->private), " ->");
> }
> @@ -2295,14 +2295,14 @@ static void *locks_start(struct seq_file *f, loff_t *pos)
>
> spin_lock(&file_lock_lock);
> *p = (*pos + 1);
> - return seq_list_start(&file_lock_list, *pos);
> + return seq_hlist_start(&file_lock_list, *pos);
> }
>
> static void *locks_next(struct seq_file *f, void *v, loff_t *pos)
> {
> loff_t *p = f->private;
> ++*p;
> - return seq_list_next(v, &file_lock_list, pos);
> + return seq_hlist_next(v, &file_lock_list, pos);
> }
>
> static void locks_stop(struct seq_file *f, void *v)
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index ccb44ea..07a009e 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -934,7 +934,7 @@ int locks_in_grace(struct net *);
> */
> struct file_lock {
> struct file_lock *fl_next; /* singly linked list for this inode */
> - struct list_head fl_link; /* doubly linked list of all locks */
> + struct hlist_node fl_link; /* node in global lists */
> struct list_head fl_block; /* circular list of blocked processes */
> fl_owner_t fl_owner;
> unsigned int fl_flags;
> --
> 1.7.1
>

2013-06-05 00:46:50

by Jeff Layton

[permalink] [raw]
Subject: Re: [PATCH v1 06/11] locks: convert to i_lock to protect i_flock list

On Tue, 4 Jun 2013 17:22:08 -0400
"J. Bruce Fields" <[email protected]> wrote:

> On Fri, May 31, 2013 at 11:07:29PM -0400, Jeff Layton wrote:
> > Having a global lock that protects all of this code is a clear
> > scalability problem. Instead of doing that, move most of the code to be
> > protected by the i_lock instead.
> >
> > The exceptions are the global lists that file_lock->fl_link sits on.
> > Those still need a global lock of some sort, so wrap just those accesses
> > in the file_lock_lock().
> >
> > Note that this requires a small change to the /proc/locks code. Instead
> > of walking the fl_block list to look for locks blocked on the current
> > lock, we must instead walk the global blocker list and skip any that
> > aren't blocked on the current lock. Otherwise, we'd need to take the
> > i_lock on each inode as we go and that would create a lock inversion
> > problem.
>
> locks_insert lock sets fl_nspid after adding to the global list, so in
> theory I think that could leave /proc/locks seeing a garbage fl_nspid
> or fl_pid field.
>

Well spotted. I think we can simply fix that by ensuring that we do the
insert onto the global list as the last thing in locks_insert_lock and
remove it from the global list first thing in locks_delete_lock. I'll
do that and test it out before I send the next iteration.

> I'm similarly worried about the deadlock detection code using the
> fl_next value before it's set but I'm not sure what the consequences
> are.
>

This I'm less clear on. We already do the insert onto the global
blocked list after locks_insert_block. Am I missing something here?

> But I think both of those are fixable, and this looks OK otherwise?
>
> Patch-sequencing nit: it'd be nice to have the big-but-boring
> lock_flocks->spin_lock(&i_lock) search-and-replace in a separate patch
> from the (briefer, but more subtle) deadlock and /proc/locks changes.
> Off the top of my head all I can think of is bulk-replace of
> lock_flocks() by lock_flocks(inode) (which ignores inode argument), then
> change definition of the lock_flocks(inode) in the patch which does the
> other stuff, then bulk replace of lock_flocks(inode) by
> spin_lock(&inode->i_lock), but maybe there's something simpler.
>

Ok, maybe I can do some of these other changes before we switch to the
i_lock, and then do the big boring change after that. I'll see if I can
cook up a reasonable patch for that for the next iteration.

> --b.
>
>
> >
> > Signed-off-by: Jeff Layton <[email protected]>
> > Signed-off-by: J. Bruce Fields <[email protected]>
> > ---
> > Documentation/filesystems/Locking | 23 +++++--
> > fs/afs/flock.c | 5 +-
> > fs/ceph/locks.c | 2 +-
> > fs/ceph/mds_client.c | 8 +-
> > fs/cifs/cifsfs.c | 2 +-
> > fs/cifs/file.c | 13 ++--
> > fs/gfs2/file.c | 2 +-
> > fs/lockd/svcsubs.c | 12 ++--
> > fs/locks.c | 121 ++++++++++++++++++++-----------------
> > fs/nfs/delegation.c | 11 ++--
> > fs/nfs/nfs4state.c | 8 +-
> > fs/nfsd/nfs4state.c | 8 +-
> > include/linux/fs.h | 11 ----
> > 13 files changed, 119 insertions(+), 107 deletions(-)
> >
> > diff --git a/Documentation/filesystems/Locking b/Documentation/filesystems/Locking
> > index 0706d32..13f91ab 100644
> > --- a/Documentation/filesystems/Locking
> > +++ b/Documentation/filesystems/Locking
> > @@ -344,7 +344,7 @@ prototypes:
> >
> >
> > locking rules:
> > - file_lock_lock may block
> > + inode->i_lock may block
> > fl_copy_lock: yes no
> > fl_release_private: maybe no
> >
> > @@ -357,12 +357,21 @@ prototypes:
> > int (*lm_change)(struct file_lock **, int);
> >
> > locking rules:
> > - file_lock_lock may block
> > -lm_compare_owner: yes no
> > -lm_notify: yes no
> > -lm_grant: no no
> > -lm_break: yes no
> > -lm_change yes no
> > +
> > + inode->i_lock file_lock_lock may block
> > +lm_compare_owner: yes maybe no
> > +lm_notify: yes no no
> > +lm_grant: no no no
> > +lm_break: yes no no
> > +lm_change yes no no
> > +
> > + ->lm_compare_owner is generally called with *an* inode->i_lock
> > +held. It may not be the i_lock of the inode for either file_lock being
> > +compared! This is the case with deadlock detection, since the code has
> > +to chase down the owners of locks that may be entirely unrelated to the
> > +one on which the lock is being acquired. For deadlock detection however,
> > +the file_lock_lock is also held. The locks primarily ensure that neither
> > +file_lock disappear out from under you while doing the comparison.
> >
> > --------------------------- buffer_head -----------------------------------
> > prototypes:
> > diff --git a/fs/afs/flock.c b/fs/afs/flock.c
> > index 2497bf3..03fc0d1 100644
> > --- a/fs/afs/flock.c
> > +++ b/fs/afs/flock.c
> > @@ -252,6 +252,7 @@ static void afs_defer_unlock(struct afs_vnode *vnode, struct key *key)
> > */
> > static int afs_do_setlk(struct file *file, struct file_lock *fl)
> > {
> > + struct inode = file_inode(file);
> > struct afs_vnode *vnode = AFS_FS_I(file->f_mapping->host);
> > afs_lock_type_t type;
> > struct key *key = file->private_data;
> > @@ -273,7 +274,7 @@ static int afs_do_setlk(struct file *file, struct file_lock *fl)
> >
> > type = (fl->fl_type == F_RDLCK) ? AFS_LOCK_READ : AFS_LOCK_WRITE;
> >
> > - lock_flocks();
> > + spin_lock(&inode->i_lock);
> >
> > /* make sure we've got a callback on this file and that our view of the
> > * data version is up to date */
> > @@ -420,7 +421,7 @@ given_lock:
> > afs_vnode_fetch_status(vnode, NULL, key);
> >
> > error:
> > - unlock_flocks();
> > + spin_unlock(&inode->i_lock);
> > _leave(" = %d", ret);
> > return ret;
> >
> > diff --git a/fs/ceph/locks.c b/fs/ceph/locks.c
> > index 202dd3d..cd0a664 100644
> > --- a/fs/ceph/locks.c
> > +++ b/fs/ceph/locks.c
> > @@ -194,7 +194,7 @@ void ceph_count_locks(struct inode *inode, int *fcntl_count, int *flock_count)
> > * Encode the flock and fcntl locks for the given inode into the pagelist.
> > * Format is: #fcntl locks, sequential fcntl locks, #flock locks,
> > * sequential flock locks.
> > - * Must be called with lock_flocks() already held.
> > + * Must be called with inode->i_lock already held.
> > * If we encounter more of a specific lock type than expected,
> > * we return the value 1.
> > */
> > diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c
> > index 4f22671..ae621b5 100644
> > --- a/fs/ceph/mds_client.c
> > +++ b/fs/ceph/mds_client.c
> > @@ -2482,13 +2482,13 @@ static int encode_caps_cb(struct inode *inode, struct ceph_cap *cap,
> >
> > ceph_pagelist_set_cursor(pagelist, &trunc_point);
> > do {
> > - lock_flocks();
> > + spin_lock(&inode->i_lock);
> > ceph_count_locks(inode, &num_fcntl_locks,
> > &num_flock_locks);
> > rec.v2.flock_len = (2*sizeof(u32) +
> > (num_fcntl_locks+num_flock_locks) *
> > sizeof(struct ceph_filelock));
> > - unlock_flocks();
> > + spin_unlock(&inode->i_lock);
> >
> > /* pre-alloc pagelist */
> > ceph_pagelist_truncate(pagelist, &trunc_point);
> > @@ -2499,12 +2499,12 @@ static int encode_caps_cb(struct inode *inode, struct ceph_cap *cap,
> >
> > /* encode locks */
> > if (!err) {
> > - lock_flocks();
> > + spin_lock(&inode->i_lock);
> > err = ceph_encode_locks(inode,
> > pagelist,
> > num_fcntl_locks,
> > num_flock_locks);
> > - unlock_flocks();
> > + spin_unlock(&inode->i_lock);
> > }
> > } while (err == -ENOSPC);
> > } else {
> > diff --git a/fs/cifs/cifsfs.c b/fs/cifs/cifsfs.c
> > index 72e4efe..29952d5 100644
> > --- a/fs/cifs/cifsfs.c
> > +++ b/fs/cifs/cifsfs.c
> > @@ -768,7 +768,7 @@ static loff_t cifs_llseek(struct file *file, loff_t offset, int whence)
> >
> > static int cifs_setlease(struct file *file, long arg, struct file_lock **lease)
> > {
> > - /* note that this is called by vfs setlease with lock_flocks held
> > + /* note that this is called by vfs setlease with i_lock held
> > to protect *lease from going away */
> > struct inode *inode = file_inode(file);
> > struct cifsFileInfo *cfile = file->private_data;
> > diff --git a/fs/cifs/file.c b/fs/cifs/file.c
> > index 44a4f18..0dd10cd 100644
> > --- a/fs/cifs/file.c
> > +++ b/fs/cifs/file.c
> > @@ -1092,6 +1092,7 @@ struct lock_to_push {
> > static int
> > cifs_push_posix_locks(struct cifsFileInfo *cfile)
> > {
> > + struct inode *inode = cfile->dentry->d_inode;
> > struct cifs_tcon *tcon = tlink_tcon(cfile->tlink);
> > struct file_lock *flock, **before;
> > unsigned int count = 0, i = 0;
> > @@ -1102,12 +1103,12 @@ cifs_push_posix_locks(struct cifsFileInfo *cfile)
> >
> > xid = get_xid();
> >
> > - lock_flocks();
> > - cifs_for_each_lock(cfile->dentry->d_inode, before) {
> > + spin_lock(&inode->i_lock);
> > + cifs_for_each_lock(inode, before) {
> > if ((*before)->fl_flags & FL_POSIX)
> > count++;
> > }
> > - unlock_flocks();
> > + spin_unlock(&inode->i_lock);
> >
> > INIT_LIST_HEAD(&locks_to_send);
> >
> > @@ -1126,8 +1127,8 @@ cifs_push_posix_locks(struct cifsFileInfo *cfile)
> > }
> >
> > el = locks_to_send.next;
> > - lock_flocks();
> > - cifs_for_each_lock(cfile->dentry->d_inode, before) {
> > + spin_lock(&inode->i_lock);
> > + cifs_for_each_lock(inode, before) {
> > flock = *before;
> > if ((flock->fl_flags & FL_POSIX) == 0)
> > continue;
> > @@ -1152,7 +1153,7 @@ cifs_push_posix_locks(struct cifsFileInfo *cfile)
> > lck->offset = flock->fl_start;
> > el = el->next;
> > }
> > - unlock_flocks();
> > + spin_unlock(&inode->i_lock);
> >
> > list_for_each_entry_safe(lck, tmp, &locks_to_send, llist) {
> > int stored_rc;
> > diff --git a/fs/gfs2/file.c b/fs/gfs2/file.c
> > index acd1676..9e634e0 100644
> > --- a/fs/gfs2/file.c
> > +++ b/fs/gfs2/file.c
> > @@ -889,7 +889,7 @@ out_uninit:
> > * cluster; until we do, disable leases (by just returning -EINVAL),
> > * unless the administrator has requested purely local locking.
> > *
> > - * Locking: called under lock_flocks
> > + * Locking: called under i_lock
> > *
> > * Returns: errno
> > */
> > diff --git a/fs/lockd/svcsubs.c b/fs/lockd/svcsubs.c
> > index 97e8741..dc5c759 100644
> > --- a/fs/lockd/svcsubs.c
> > +++ b/fs/lockd/svcsubs.c
> > @@ -169,7 +169,7 @@ nlm_traverse_locks(struct nlm_host *host, struct nlm_file *file,
> >
> > again:
> > file->f_locks = 0;
> > - lock_flocks(); /* protects i_flock list */
> > + spin_lock(&inode->i_lock);
> > for (fl = inode->i_flock; fl; fl = fl->fl_next) {
> > if (fl->fl_lmops != &nlmsvc_lock_operations)
> > continue;
> > @@ -181,7 +181,7 @@ again:
> > if (match(lockhost, host)) {
> > struct file_lock lock = *fl;
> >
> > - unlock_flocks();
> > + spin_unlock(&inode->i_lock);
> > lock.fl_type = F_UNLCK;
> > lock.fl_start = 0;
> > lock.fl_end = OFFSET_MAX;
> > @@ -193,7 +193,7 @@ again:
> > goto again;
> > }
> > }
> > - unlock_flocks();
> > + spin_unlock(&inode->i_lock);
> >
> > return 0;
> > }
> > @@ -228,14 +228,14 @@ nlm_file_inuse(struct nlm_file *file)
> > if (file->f_count || !list_empty(&file->f_blocks) || file->f_shares)
> > return 1;
> >
> > - lock_flocks();
> > + spin_lock(&inode->i_lock);
> > for (fl = inode->i_flock; fl; fl = fl->fl_next) {
> > if (fl->fl_lmops == &nlmsvc_lock_operations) {
> > - unlock_flocks();
> > + spin_unlock(&inode->i_lock);
> > return 1;
> > }
> > }
> > - unlock_flocks();
> > + spin_unlock(&inode->i_lock);
> > file->f_locks = 0;
> > return 0;
> > }
> > diff --git a/fs/locks.c b/fs/locks.c
> > index caca466..055c06c 100644
> > --- a/fs/locks.c
> > +++ b/fs/locks.c
> > @@ -165,22 +165,9 @@ int lease_break_time = 45;
> >
> > static LIST_HEAD(file_lock_list);
> > static LIST_HEAD(blocked_list);
> > -static DEFINE_SPINLOCK(file_lock_lock);
> > -
> > -/*
> > - * Protects the two list heads above, plus the inode->i_flock list
> > - */
> > -void lock_flocks(void)
> > -{
> > - spin_lock(&file_lock_lock);
> > -}
> > -EXPORT_SYMBOL_GPL(lock_flocks);
> >
> > -void unlock_flocks(void)
> > -{
> > - spin_unlock(&file_lock_lock);
> > -}
> > -EXPORT_SYMBOL_GPL(unlock_flocks);
> > +/* Protects the two list heads above */
> > +static DEFINE_SPINLOCK(file_lock_lock);
> >
> > static struct kmem_cache *filelock_cache __read_mostly;
> >
> > @@ -498,25 +485,33 @@ static int posix_same_owner(struct file_lock *fl1, struct file_lock *fl2)
> > static inline void
> > locks_insert_global_blocked(struct file_lock *waiter)
> > {
> > + spin_lock(&file_lock_lock);
> > list_add(&waiter->fl_link, &blocked_list);
> > + spin_unlock(&file_lock_lock);
> > }
> >
> > static inline void
> > locks_delete_global_blocked(struct file_lock *waiter)
> > {
> > + spin_lock(&file_lock_lock);
> > list_del_init(&waiter->fl_link);
> > + spin_unlock(&file_lock_lock);
> > }
> >
> > static inline void
> > locks_insert_global_locks(struct file_lock *waiter)
> > {
> > + spin_lock(&file_lock_lock);
> > list_add_tail(&waiter->fl_link, &file_lock_list);
> > + spin_unlock(&file_lock_lock);
> > }
> >
> > static inline void
> > locks_delete_global_locks(struct file_lock *waiter)
> > {
> > + spin_lock(&file_lock_lock);
> > list_del_init(&waiter->fl_link);
> > + spin_unlock(&file_lock_lock);
> > }
> >
> > /* Remove waiter from blocker's block list.
> > @@ -533,9 +528,11 @@ static void __locks_delete_block(struct file_lock *waiter)
> > */
> > static void locks_delete_block(struct file_lock *waiter)
> > {
> > - lock_flocks();
> > + struct inode *inode = file_inode(waiter->fl_file);
> > +
> > + spin_lock(&inode->i_lock);
> > __locks_delete_block(waiter);
> > - unlock_flocks();
> > + spin_unlock(&inode->i_lock);
> > }
> >
> > /* Insert waiter into blocker's block list.
> > @@ -657,8 +654,9 @@ void
> > posix_test_lock(struct file *filp, struct file_lock *fl)
> > {
> > struct file_lock *cfl;
> > + struct inode *inode = file_inode(filp);
> >
> > - lock_flocks();
> > + spin_lock(&inode->i_lock);
> > for (cfl = file_inode(filp)->i_flock; cfl; cfl = cfl->fl_next) {
> > if (!IS_POSIX(cfl))
> > continue;
> > @@ -671,7 +669,7 @@ posix_test_lock(struct file *filp, struct file_lock *fl)
> > fl->fl_pid = pid_vnr(cfl->fl_nspid);
> > } else
> > fl->fl_type = F_UNLCK;
> > - unlock_flocks();
> > + spin_unlock(&inode->i_lock);
> > return;
> > }
> > EXPORT_SYMBOL(posix_test_lock);
> > @@ -719,14 +717,19 @@ static int posix_locks_deadlock(struct file_lock *caller_fl,
> > struct file_lock *block_fl)
> > {
> > int i = 0;
> > + int ret = 0;
> >
> > + spin_lock(&file_lock_lock);
> > while ((block_fl = what_owner_is_waiting_for(block_fl))) {
> > if (i++ > MAX_DEADLK_ITERATIONS)
> > - return 0;
> > - if (posix_same_owner(caller_fl, block_fl))
> > - return 1;
> > + break;
> > + if (posix_same_owner(caller_fl, block_fl)) {
> > + ++ret;
> > + break;
> > + }
> > }
> > - return 0;
> > + spin_unlock(&file_lock_lock);
> > + return ret;
> > }
> >
> > /* Try to create a FLOCK lock on filp. We always insert new FLOCK locks
> > @@ -750,7 +753,7 @@ static int flock_lock_file(struct file *filp, struct file_lock *request)
> > return -ENOMEM;
> > }
> >
> > - lock_flocks();
> > + spin_lock(&inode->i_lock);
> > if (request->fl_flags & FL_ACCESS)
> > goto find_conflict;
> >
> > @@ -780,9 +783,9 @@ static int flock_lock_file(struct file *filp, struct file_lock *request)
> > * give it the opportunity to lock the file.
> > */
> > if (found) {
> > - unlock_flocks();
> > + spin_unlock(&inode->i_lock);
> > cond_resched();
> > - lock_flocks();
> > + spin_lock(&inode->i_lock);
> > }
> >
> > find_conflict:
> > @@ -809,7 +812,7 @@ find_conflict:
> > error = 0;
> >
> > out:
> > - unlock_flocks();
> > + spin_unlock(&inode->i_lock);
> > if (new_fl)
> > locks_free_lock(new_fl);
> > return error;
> > @@ -839,7 +842,7 @@ static int __posix_lock_file(struct inode *inode, struct file_lock *request, str
> > new_fl2 = locks_alloc_lock();
> > }
> >
> > - lock_flocks();
> > + spin_lock(&inode->i_lock);
> > /*
> > * New lock request. Walk all POSIX locks and look for conflicts. If
> > * there are any, either return -EAGAIN or put the request on the
> > @@ -1012,7 +1015,7 @@ static int __posix_lock_file(struct inode *inode, struct file_lock *request, str
> > locks_wake_up_blocks(left);
> > }
> > out:
> > - unlock_flocks();
> > + spin_unlock(&inode->i_lock);
> > /*
> > * Free any unused locks.
> > */
> > @@ -1087,14 +1090,14 @@ int locks_mandatory_locked(struct inode *inode)
> > /*
> > * Search the lock list for this inode for any POSIX locks.
> > */
> > - lock_flocks();
> > + spin_lock(&inode->i_lock);
> > for (fl = inode->i_flock; fl != NULL; fl = fl->fl_next) {
> > if (!IS_POSIX(fl))
> > continue;
> > if (fl->fl_owner != owner)
> > break;
> > }
> > - unlock_flocks();
> > + spin_unlock(&inode->i_lock);
> > return fl ? -EAGAIN : 0;
> > }
> >
> > @@ -1237,7 +1240,7 @@ int __break_lease(struct inode *inode, unsigned int mode)
> > if (IS_ERR(new_fl))
> > return PTR_ERR(new_fl);
> >
> > - lock_flocks();
> > + spin_lock(&inode->i_lock);
> >
> > time_out_leases(inode);
> >
> > @@ -1287,10 +1290,10 @@ restart:
> > break_time++;
> > }
> > locks_insert_block(flock, new_fl);
> > - unlock_flocks();
> > + spin_unlock(&inode->i_lock);
> > error = wait_event_interruptible_timeout(new_fl->fl_wait,
> > !new_fl->fl_next, break_time);
> > - lock_flocks();
> > + spin_lock(&inode->i_lock);
> > __locks_delete_block(new_fl);
> > if (error >= 0) {
> > if (error == 0)
> > @@ -1308,7 +1311,7 @@ restart:
> > }
> >
> > out:
> > - unlock_flocks();
> > + spin_unlock(&inode->i_lock);
> > locks_free_lock(new_fl);
> > return error;
> > }
> > @@ -1361,9 +1364,10 @@ EXPORT_SYMBOL(lease_get_mtime);
> > int fcntl_getlease(struct file *filp)
> > {
> > struct file_lock *fl;
> > + struct inode *inode = file_inode(filp);
> > int type = F_UNLCK;
> >
> > - lock_flocks();
> > + spin_lock(&inode->i_lock);
> > time_out_leases(file_inode(filp));
> > for (fl = file_inode(filp)->i_flock; fl && IS_LEASE(fl);
> > fl = fl->fl_next) {
> > @@ -1372,7 +1376,7 @@ int fcntl_getlease(struct file *filp)
> > break;
> > }
> > }
> > - unlock_flocks();
> > + spin_unlock(&inode->i_lock);
> > return type;
> > }
> >
> > @@ -1466,7 +1470,7 @@ static int generic_delete_lease(struct file *filp, struct file_lock **flp)
> > * The (input) flp->fl_lmops->lm_break function is required
> > * by break_lease().
> > *
> > - * Called with file_lock_lock held.
> > + * Called with inode->i_lock held.
> > */
> > int generic_setlease(struct file *filp, long arg, struct file_lock **flp)
> > {
> > @@ -1535,11 +1539,12 @@ static int __vfs_setlease(struct file *filp, long arg, struct file_lock **lease)
> >
> > int vfs_setlease(struct file *filp, long arg, struct file_lock **lease)
> > {
> > + struct inode *inode = file_inode(filp);
> > int error;
> >
> > - lock_flocks();
> > + spin_lock(&inode->i_lock);
> > error = __vfs_setlease(filp, arg, lease);
> > - unlock_flocks();
> > + spin_unlock(&inode->i_lock);
> >
> > return error;
> > }
> > @@ -1557,6 +1562,7 @@ static int do_fcntl_delete_lease(struct file *filp)
> > static int do_fcntl_add_lease(unsigned int fd, struct file *filp, long arg)
> > {
> > struct file_lock *fl, *ret;
> > + struct inode *inode = file_inode(filp);
> > struct fasync_struct *new;
> > int error;
> >
> > @@ -1570,10 +1576,10 @@ static int do_fcntl_add_lease(unsigned int fd, struct file *filp, long arg)
> > return -ENOMEM;
> > }
> > ret = fl;
> > - lock_flocks();
> > + spin_lock(&inode->i_lock);
> > error = __vfs_setlease(filp, arg, &ret);
> > if (error) {
> > - unlock_flocks();
> > + spin_unlock(&inode->i_lock);
> > locks_free_lock(fl);
> > goto out_free_fasync;
> > }
> > @@ -1590,7 +1596,7 @@ static int do_fcntl_add_lease(unsigned int fd, struct file *filp, long arg)
> > new = NULL;
> >
> > error = __f_setown(filp, task_pid(current), PIDTYPE_PID, 0);
> > - unlock_flocks();
> > + spin_unlock(&inode->i_lock);
> >
> > out_free_fasync:
> > if (new)
> > @@ -2114,7 +2120,7 @@ void locks_remove_flock(struct file *filp)
> > fl.fl_ops->fl_release_private(&fl);
> > }
> >
> > - lock_flocks();
> > + spin_lock(&inode->i_lock);
> > before = &inode->i_flock;
> >
> > while ((fl = *before) != NULL) {
> > @@ -2132,7 +2138,7 @@ void locks_remove_flock(struct file *filp)
> > }
> > before = &fl->fl_next;
> > }
> > - unlock_flocks();
> > + spin_unlock(&inode->i_lock);
> > }
> >
> > /**
> > @@ -2145,14 +2151,15 @@ void locks_remove_flock(struct file *filp)
> > int
> > posix_unblock_lock(struct file *filp, struct file_lock *waiter)
> > {
> > + struct inode *inode = file_inode(filp);
> > int status = 0;
> >
> > - lock_flocks();
> > + spin_lock(&inode->i_lock);
> > if (waiter->fl_next)
> > __locks_delete_block(waiter);
> > else
> > status = -ENOENT;
> > - unlock_flocks();
> > + spin_unlock(&inode->i_lock);
> > return status;
> > }
> >
> > @@ -2257,8 +2264,10 @@ static int locks_show(struct seq_file *f, void *v)
> >
> > lock_get_status(f, fl, *((loff_t *)f->private), "");
> >
> > - list_for_each_entry(bfl, &fl->fl_block, fl_block)
> > - lock_get_status(f, bfl, *((loff_t *)f->private), " ->");
> > + list_for_each_entry(bfl, &blocked_list, fl_link) {
> > + if (bfl->fl_next == fl)
> > + lock_get_status(f, bfl, *((loff_t *)f->private), " ->");
> > + }
> >
> > return 0;
> > }
> > @@ -2267,7 +2276,7 @@ static void *locks_start(struct seq_file *f, loff_t *pos)
> > {
> > loff_t *p = f->private;
> >
> > - lock_flocks();
> > + spin_lock(&file_lock_lock);
> > *p = (*pos + 1);
> > return seq_list_start(&file_lock_list, *pos);
> > }
> > @@ -2281,7 +2290,7 @@ static void *locks_next(struct seq_file *f, void *v, loff_t *pos)
> >
> > static void locks_stop(struct seq_file *f, void *v)
> > {
> > - unlock_flocks();
> > + spin_unlock(&file_lock_lock);
> > }
> >
> > static const struct seq_operations locks_seq_operations = {
> > @@ -2328,7 +2337,8 @@ int lock_may_read(struct inode *inode, loff_t start, unsigned long len)
> > {
> > struct file_lock *fl;
> > int result = 1;
> > - lock_flocks();
> > +
> > + spin_lock(&inode->i_lock);
> > for (fl = inode->i_flock; fl != NULL; fl = fl->fl_next) {
> > if (IS_POSIX(fl)) {
> > if (fl->fl_type == F_RDLCK)
> > @@ -2345,7 +2355,7 @@ int lock_may_read(struct inode *inode, loff_t start, unsigned long len)
> > result = 0;
> > break;
> > }
> > - unlock_flocks();
> > + spin_unlock(&inode->i_lock);
> > return result;
> > }
> >
> > @@ -2368,7 +2378,8 @@ int lock_may_write(struct inode *inode, loff_t start, unsigned long len)
> > {
> > struct file_lock *fl;
> > int result = 1;
> > - lock_flocks();
> > +
> > + spin_lock(&inode->i_lock);
> > for (fl = inode->i_flock; fl != NULL; fl = fl->fl_next) {
> > if (IS_POSIX(fl)) {
> > if ((fl->fl_end < start) || (fl->fl_start > (start + len)))
> > @@ -2383,7 +2394,7 @@ int lock_may_write(struct inode *inode, loff_t start, unsigned long len)
> > result = 0;
> > break;
> > }
> > - unlock_flocks();
> > + spin_unlock(&inode->i_lock);
> > return result;
> > }
> >
> > diff --git a/fs/nfs/delegation.c b/fs/nfs/delegation.c
> > index 57db324..43ee7f9 100644
> > --- a/fs/nfs/delegation.c
> > +++ b/fs/nfs/delegation.c
> > @@ -73,20 +73,21 @@ static int nfs_delegation_claim_locks(struct nfs_open_context *ctx, struct nfs4_
> > if (inode->i_flock == NULL)
> > goto out;
> >
> > - /* Protect inode->i_flock using the file locks lock */
> > - lock_flocks();
> > + /* Protect inode->i_flock using the i_lock */
> > + spin_lock(&inode->i_lock);
> > for (fl = inode->i_flock; fl != NULL; fl = fl->fl_next) {
> > if (!(fl->fl_flags & (FL_POSIX|FL_FLOCK)))
> > continue;
> > if (nfs_file_open_context(fl->fl_file) != ctx)
> > continue;
> > - unlock_flocks();
> > + /* FIXME: safe to drop lock here while walking list? */
> > + spin_unlock(&inode->i_lock);
> > status = nfs4_lock_delegation_recall(fl, state, stateid);
> > if (status < 0)
> > goto out;
> > - lock_flocks();
> > + spin_lock(&inode->i_lock);
> > }
> > - unlock_flocks();
> > + spin_unlock(&inode->i_lock);
> > out:
> > return status;
> > }
> > diff --git a/fs/nfs/nfs4state.c b/fs/nfs/nfs4state.c
> > index 1fab140..ff10b4a 100644
> > --- a/fs/nfs/nfs4state.c
> > +++ b/fs/nfs/nfs4state.c
> > @@ -1373,13 +1373,13 @@ static int nfs4_reclaim_locks(struct nfs4_state *state, const struct nfs4_state_
> > /* Guard against delegation returns and new lock/unlock calls */
> > down_write(&nfsi->rwsem);
> > /* Protect inode->i_flock using the BKL */
> > - lock_flocks();
> > + spin_lock(&inode->i_lock);
> > for (fl = inode->i_flock; fl != NULL; fl = fl->fl_next) {
> > if (!(fl->fl_flags & (FL_POSIX|FL_FLOCK)))
> > continue;
> > if (nfs_file_open_context(fl->fl_file)->state != state)
> > continue;
> > - unlock_flocks();
> > + spin_unlock(&inode->i_lock);
> > status = ops->recover_lock(state, fl);
> > switch (status) {
> > case 0:
> > @@ -1406,9 +1406,9 @@ static int nfs4_reclaim_locks(struct nfs4_state *state, const struct nfs4_state_
> > /* kill_proc(fl->fl_pid, SIGLOST, 1); */
> > status = 0;
> > }
> > - lock_flocks();
> > + spin_lock(&inode->i_lock);
> > }
> > - unlock_flocks();
> > + spin_unlock(&inode->i_lock);
> > out:
> > up_write(&nfsi->rwsem);
> > return status;
> > diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c
> > index 316ec84..f170518 100644
> > --- a/fs/nfsd/nfs4state.c
> > +++ b/fs/nfsd/nfs4state.c
> > @@ -2645,13 +2645,13 @@ static void nfsd_break_one_deleg(struct nfs4_delegation *dp)
> >
> > list_add_tail(&dp->dl_recall_lru, &nn->del_recall_lru);
> >
> > - /* only place dl_time is set. protected by lock_flocks*/
> > + /* Only place dl_time is set; protected by i_lock: */
> > dp->dl_time = get_seconds();
> >
> > nfsd4_cb_recall(dp);
> > }
> >
> > -/* Called from break_lease() with lock_flocks() held. */
> > +/* Called from break_lease() with i_lock held. */
> > static void nfsd_break_deleg_cb(struct file_lock *fl)
> > {
> > struct nfs4_file *fp = (struct nfs4_file *)fl->fl_owner;
> > @@ -4520,7 +4520,7 @@ check_for_locks(struct nfs4_file *filp, struct nfs4_lockowner *lowner)
> > struct inode *inode = filp->fi_inode;
> > int status = 0;
> >
> > - lock_flocks();
> > + spin_lock(&inode->i_lock);
> > for (flpp = &inode->i_flock; *flpp != NULL; flpp = &(*flpp)->fl_next) {
> > if ((*flpp)->fl_owner == (fl_owner_t)lowner) {
> > status = 1;
> > @@ -4528,7 +4528,7 @@ check_for_locks(struct nfs4_file *filp, struct nfs4_lockowner *lowner)
> > }
> > }
> > out:
> > - unlock_flocks();
> > + spin_unlock(&inode->i_lock);
> > return status;
> > }
> >
> > diff --git a/include/linux/fs.h b/include/linux/fs.h
> > index ae377e9..ccb44ea 100644
> > --- a/include/linux/fs.h
> > +++ b/include/linux/fs.h
> > @@ -1012,8 +1012,6 @@ extern int vfs_setlease(struct file *, long, struct file_lock **);
> > extern int lease_modify(struct file_lock **, int);
> > extern int lock_may_read(struct inode *, loff_t start, unsigned long count);
> > extern int lock_may_write(struct inode *, loff_t start, unsigned long count);
> > -extern void lock_flocks(void);
> > -extern void unlock_flocks(void);
> > #else /* !CONFIG_FILE_LOCKING */
> > static inline int fcntl_getlk(struct file *file, struct flock __user *user)
> > {
> > @@ -1155,15 +1153,6 @@ static inline int lock_may_write(struct inode *inode, loff_t start,
> > {
> > return 1;
> > }
> > -
> > -static inline void lock_flocks(void)
> > -{
> > -}
> > -
> > -static inline void unlock_flocks(void)
> > -{
> > -}
> > -
> > #endif /* !CONFIG_FILE_LOCKING */
> >
> >
> > --
> > 1.7.1
> >


--
Jeff Layton <[email protected]>

2013-06-05 11:39:01

by Jeff Layton

[permalink] [raw]
Subject: Re: [PATCH v1 07/11] locks: only pull entries off of blocked_list when they are really unblocked

On Tue, 4 Jun 2013 17:58:39 -0400
"J. Bruce Fields" <[email protected]> wrote:

> On Fri, May 31, 2013 at 11:07:30PM -0400, Jeff Layton wrote:
> > Currently, when there is a lot of lock contention the kernel spends an
> > inordinate amount of time taking blocked locks off of the global
> > blocked_list and then putting them right back on again. When all of this
> > code was protected by a single lock, then it didn't matter much, but now
> > it means a lot of file_lock_lock thrashing.
> >
> > Optimize this a bit by deferring the removal from the blocked_list until
> > we're either applying or cancelling the lock. By doing this, and using a
> > lockless list_empty check, we can avoid taking the file_lock_lock in
> > many cases.
> >
> > Because the fl_link check is lockless, we must ensure that only the task
> > that "owns" the request manipulates the fl_link. Also, with this change,
> > it's possible that we'll see an entry on the blocked_list that has a
> > NULL fl_next pointer. In that event, just ignore it and continue walking
> > the list.
>
> OK, that sounds safe as in it shouldn't crash, but does the deadlock
> detection still work, or can it miss loops?
>
> Those locks that are temporarily NULL would previously not have been on
> the list at all, OK, but... I'm having trouble reasoning about how this
> works now.
>
> Previously a single lock was held interrupted across
> posix_locks_deadlock and locks_insert_block() which guaranteed we
> shouldn't be adding a loop, is that still true?
>
> --b.
>

I had thought it was when I originally looked at this, but now that I
consider it again I think you may be correct and that there are possible
races here. Since we might end up reblocking behind a different lock
without taking the global spinlock we could flip to blocking behind a
different lock such that a loop is created if you had a complex (>2)
chain of locks.

I think I'm going to have to drop this approach and instead make it so
that the deadlock detection and insertion into the global blocker
list/hash are atomic. Ditto for locks_wake_up_blocks on posix locks and
taking the entries off the list/hash.

Now that I look, I think that approach may actually improve performance
too since we'll be taking the global spinlock less than we would have,
but I'll need to test it out to know for sure.

> >
> > Signed-off-by: Jeff Layton <[email protected]>
> > ---
> > fs/locks.c | 29 +++++++++++++++++++++++------
> > 1 files changed, 23 insertions(+), 6 deletions(-)
> >
> > diff --git a/fs/locks.c b/fs/locks.c
> > index 055c06c..fc35b9e 100644
> > --- a/fs/locks.c
> > +++ b/fs/locks.c
> > @@ -520,7 +520,6 @@ locks_delete_global_locks(struct file_lock *waiter)
> > static void __locks_delete_block(struct file_lock *waiter)
> > {
> > list_del_init(&waiter->fl_block);
> > - locks_delete_global_blocked(waiter);
> > waiter->fl_next = NULL;
> > }
> >
> > @@ -704,13 +703,16 @@ EXPORT_SYMBOL(posix_test_lock);
> > /* Find a lock that the owner of the given block_fl is blocking on. */
> > static struct file_lock *what_owner_is_waiting_for(struct file_lock *block_fl)
> > {
> > - struct file_lock *fl;
> > + struct file_lock *fl, *ret = NULL;
> >
> > list_for_each_entry(fl, &blocked_list, fl_link) {
> > - if (posix_same_owner(fl, block_fl))
> > - return fl->fl_next;
> > + if (posix_same_owner(fl, block_fl)) {
> > + ret = fl->fl_next;
> > + if (likely(ret))
> > + break;
> > + }
> > }
> > - return NULL;
> > + return ret;
> > }
> >
> > static int posix_locks_deadlock(struct file_lock *caller_fl,
> > @@ -865,7 +867,8 @@ static int __posix_lock_file(struct inode *inode, struct file_lock *request, str
> > goto out;
> > error = FILE_LOCK_DEFERRED;
> > locks_insert_block(fl, request);
> > - locks_insert_global_blocked(request);
> > + if (list_empty(&request->fl_link))
> > + locks_insert_global_blocked(request);
> > goto out;
> > }
> > }
> > @@ -876,6 +879,16 @@ static int __posix_lock_file(struct inode *inode, struct file_lock *request, str
> > goto out;
> >
> > /*
> > + * Now that we know the request is no longer blocked, we can take it
> > + * off the global list. Some callers send down partially initialized
> > + * requests, so we only do this if FL_SLEEP is set. Also, avoid taking
> > + * the lock if the list is empty, as that indicates a request that
> > + * never blocked.
> > + */
> > + if ((request->fl_flags & FL_SLEEP) && !list_empty(&request->fl_link))
> > + locks_delete_global_blocked(request);
> > +
> > + /*
> > * Find the first old lock with the same owner as the new lock.
> > */
> >
> > @@ -1069,6 +1082,7 @@ int posix_lock_file_wait(struct file *filp, struct file_lock *fl)
> > continue;
> >
> > locks_delete_block(fl);
> > + locks_delete_global_blocked(fl);
> > break;
> > }
> > return error;
> > @@ -1147,6 +1161,7 @@ int locks_mandatory_area(int read_write, struct inode *inode,
> > }
> >
> > locks_delete_block(&fl);
> > + locks_delete_global_blocked(&fl);
> > break;
> > }
> >
> > @@ -1859,6 +1874,7 @@ static int do_lock_file_wait(struct file *filp, unsigned int cmd,
> > continue;
> >
> > locks_delete_block(fl);
> > + locks_delete_global_blocked(fl);
> > break;
> > }
> >
> > @@ -2160,6 +2176,7 @@ posix_unblock_lock(struct file *filp, struct file_lock *waiter)
> > else
> > status = -ENOENT;
> > spin_unlock(&inode->i_lock);
> > + locks_delete_global_blocked(waiter);
> > return status;
> > }
> >
> > --
> > 1.7.1
> >


--
Jeff Layton <[email protected]>

2013-06-05 11:43:55

by Jeff Layton

[permalink] [raw]
Subject: Re: [PATCH v1 08/11] locks: convert fl_link to a hlist_node

On Tue, 4 Jun 2013 17:59:50 -0400
"J. Bruce Fields" <[email protected]> wrote:

> On Fri, May 31, 2013 at 11:07:31PM -0400, Jeff Layton wrote:
> > Testing has shown that iterating over the blocked_list for deadlock
> > detection turns out to be a bottleneck. In order to alleviate that,
> > begin the process of turning it into a hashtable. We start by turning
> > the fl_link into a hlist_node and the global lists into hlists. A later
> > patch will do the conversion of the blocked_list to a hashtable.
>
> Even simpler would be if we could add a pointer to the (well, a) lock
> that a lockowner is blocking on, and then we'd just have to follow a
> pointer. I haven't thought that through, though, perhaps that's hard ot
> make work....
>
> --b.
>

I considered that as well and it makes sense for the simple local
filesystem case where you just track ownership based on fl_owner_t.

But...what about lockd? It considers ownership to be a tuple of the
nlm_host and the pid sent in a lock request. I can't seem to wrap my
brain around how to make such an approach work there. I'll confess
though that I haven't tried *too* hard yet though since I had bigger
problems to sort through at the time. Maybe we can consider that for a
later set?

> >
> > Signed-off-by: Jeff Layton <[email protected]>
> > ---
> > fs/locks.c | 32 ++++++++++++++++----------------
> > include/linux/fs.h | 2 +-
> > 2 files changed, 17 insertions(+), 17 deletions(-)
> >
> > diff --git a/fs/locks.c b/fs/locks.c
> > index fc35b9e..5ed056b 100644
> > --- a/fs/locks.c
> > +++ b/fs/locks.c
> > @@ -163,8 +163,8 @@ int lease_break_time = 45;
> > #define for_each_lock(inode, lockp) \
> > for (lockp = &inode->i_flock; *lockp != NULL; lockp = &(*lockp)->fl_next)
> >
> > -static LIST_HEAD(file_lock_list);
> > -static LIST_HEAD(blocked_list);
> > +static HLIST_HEAD(file_lock_list);
> > +static HLIST_HEAD(blocked_list);
> >
> > /* Protects the two list heads above */
> > static DEFINE_SPINLOCK(file_lock_lock);
> > @@ -173,7 +173,7 @@ static struct kmem_cache *filelock_cache __read_mostly;
> >
> > static void locks_init_lock_heads(struct file_lock *fl)
> > {
> > - INIT_LIST_HEAD(&fl->fl_link);
> > + INIT_HLIST_NODE(&fl->fl_link);
> > INIT_LIST_HEAD(&fl->fl_block);
> > init_waitqueue_head(&fl->fl_wait);
> > }
> > @@ -207,7 +207,7 @@ void locks_free_lock(struct file_lock *fl)
> > {
> > BUG_ON(waitqueue_active(&fl->fl_wait));
> > BUG_ON(!list_empty(&fl->fl_block));
> > - BUG_ON(!list_empty(&fl->fl_link));
> > + BUG_ON(!hlist_unhashed(&fl->fl_link));
> >
> > locks_release_private(fl);
> > kmem_cache_free(filelock_cache, fl);
> > @@ -486,7 +486,7 @@ static inline void
> > locks_insert_global_blocked(struct file_lock *waiter)
> > {
> > spin_lock(&file_lock_lock);
> > - list_add(&waiter->fl_link, &blocked_list);
> > + hlist_add_head(&waiter->fl_link, &blocked_list);
> > spin_unlock(&file_lock_lock);
> > }
> >
> > @@ -494,7 +494,7 @@ static inline void
> > locks_delete_global_blocked(struct file_lock *waiter)
> > {
> > spin_lock(&file_lock_lock);
> > - list_del_init(&waiter->fl_link);
> > + hlist_del_init(&waiter->fl_link);
> > spin_unlock(&file_lock_lock);
> > }
> >
> > @@ -502,7 +502,7 @@ static inline void
> > locks_insert_global_locks(struct file_lock *waiter)
> > {
> > spin_lock(&file_lock_lock);
> > - list_add_tail(&waiter->fl_link, &file_lock_list);
> > + hlist_add_head(&waiter->fl_link, &file_lock_list);
> > spin_unlock(&file_lock_lock);
> > }
> >
> > @@ -510,7 +510,7 @@ static inline void
> > locks_delete_global_locks(struct file_lock *waiter)
> > {
> > spin_lock(&file_lock_lock);
> > - list_del_init(&waiter->fl_link);
> > + hlist_del_init(&waiter->fl_link);
> > spin_unlock(&file_lock_lock);
> > }
> >
> > @@ -705,7 +705,7 @@ static struct file_lock *what_owner_is_waiting_for(struct file_lock *block_fl)
> > {
> > struct file_lock *fl, *ret = NULL;
> >
> > - list_for_each_entry(fl, &blocked_list, fl_link) {
> > + hlist_for_each_entry(fl, &blocked_list, fl_link) {
> > if (posix_same_owner(fl, block_fl)) {
> > ret = fl->fl_next;
> > if (likely(ret))
> > @@ -867,7 +867,7 @@ static int __posix_lock_file(struct inode *inode, struct file_lock *request, str
> > goto out;
> > error = FILE_LOCK_DEFERRED;
> > locks_insert_block(fl, request);
> > - if (list_empty(&request->fl_link))
> > + if (hlist_unhashed(&request->fl_link))
> > locks_insert_global_blocked(request);
> > goto out;
> > }
> > @@ -882,10 +882,10 @@ static int __posix_lock_file(struct inode *inode, struct file_lock *request, str
> > * Now that we know the request is no longer blocked, we can take it
> > * off the global list. Some callers send down partially initialized
> > * requests, so we only do this if FL_SLEEP is set. Also, avoid taking
> > - * the lock if the list is empty, as that indicates a request that
> > + * the lock if the hlist is unhashed, as that indicates a request that
> > * never blocked.
> > */
> > - if ((request->fl_flags & FL_SLEEP) && !list_empty(&request->fl_link))
> > + if ((request->fl_flags & FL_SLEEP) && !hlist_unhashed(&request->fl_link))
> > locks_delete_global_blocked(request);
> >
> > /*
> > @@ -2277,11 +2277,11 @@ static int locks_show(struct seq_file *f, void *v)
> > {
> > struct file_lock *fl, *bfl;
> >
> > - fl = list_entry(v, struct file_lock, fl_link);
> > + fl = hlist_entry(v, struct file_lock, fl_link);
> >
> > lock_get_status(f, fl, *((loff_t *)f->private), "");
> >
> > - list_for_each_entry(bfl, &blocked_list, fl_link) {
> > + hlist_for_each_entry(bfl, &blocked_list, fl_link) {
> > if (bfl->fl_next == fl)
> > lock_get_status(f, bfl, *((loff_t *)f->private), " ->");
> > }
> > @@ -2295,14 +2295,14 @@ static void *locks_start(struct seq_file *f, loff_t *pos)
> >
> > spin_lock(&file_lock_lock);
> > *p = (*pos + 1);
> > - return seq_list_start(&file_lock_list, *pos);
> > + return seq_hlist_start(&file_lock_list, *pos);
> > }
> >
> > static void *locks_next(struct seq_file *f, void *v, loff_t *pos)
> > {
> > loff_t *p = f->private;
> > ++*p;
> > - return seq_list_next(v, &file_lock_list, pos);
> > + return seq_hlist_next(v, &file_lock_list, pos);
> > }
> >
> > static void locks_stop(struct seq_file *f, void *v)
> > diff --git a/include/linux/fs.h b/include/linux/fs.h
> > index ccb44ea..07a009e 100644
> > --- a/include/linux/fs.h
> > +++ b/include/linux/fs.h
> > @@ -934,7 +934,7 @@ int locks_in_grace(struct net *);
> > */
> > struct file_lock {
> > struct file_lock *fl_next; /* singly linked list for this inode */
> > - struct list_head fl_link; /* doubly linked list of all locks */
> > + struct hlist_node fl_link; /* node in global lists */
> > struct list_head fl_block; /* circular list of blocked processes */
> > fl_owner_t fl_owner;
> > unsigned int fl_flags;
> > --
> > 1.7.1
> >


--
Jeff Layton <[email protected]>

2013-06-05 12:25:19

by J. Bruce Fields

[permalink] [raw]
Subject: Re: [PATCH v1 07/11] locks: only pull entries off of blocked_list when they are really unblocked

On Wed, Jun 05, 2013 at 07:38:22AM -0400, Jeff Layton wrote:
> On Tue, 4 Jun 2013 17:58:39 -0400
> "J. Bruce Fields" <[email protected]> wrote:
>
> > On Fri, May 31, 2013 at 11:07:30PM -0400, Jeff Layton wrote:
> > > Currently, when there is a lot of lock contention the kernel spends an
> > > inordinate amount of time taking blocked locks off of the global
> > > blocked_list and then putting them right back on again. When all of this
> > > code was protected by a single lock, then it didn't matter much, but now
> > > it means a lot of file_lock_lock thrashing.
> > >
> > > Optimize this a bit by deferring the removal from the blocked_list until
> > > we're either applying or cancelling the lock. By doing this, and using a
> > > lockless list_empty check, we can avoid taking the file_lock_lock in
> > > many cases.
> > >
> > > Because the fl_link check is lockless, we must ensure that only the task
> > > that "owns" the request manipulates the fl_link. Also, with this change,
> > > it's possible that we'll see an entry on the blocked_list that has a
> > > NULL fl_next pointer. In that event, just ignore it and continue walking
> > > the list.
> >
> > OK, that sounds safe as in it shouldn't crash, but does the deadlock
> > detection still work, or can it miss loops?
> >
> > Those locks that are temporarily NULL would previously not have been on
> > the list at all, OK, but... I'm having trouble reasoning about how this
> > works now.
> >
> > Previously a single lock was held interrupted across
> > posix_locks_deadlock and locks_insert_block() which guaranteed we
> > shouldn't be adding a loop, is that still true?
> >
> > --b.
> >
>
> I had thought it was when I originally looked at this, but now that I
> consider it again I think you may be correct and that there are possible
> races here. Since we might end up reblocking behind a different lock
> without taking the global spinlock we could flip to blocking behind a
> different lock such that a loop is created if you had a complex (>2)
> chain of locks.
>
> I think I'm going to have to drop this approach and instead make it so
> that the deadlock detection and insertion into the global blocker
> list/hash are atomic.

Right. Once you drop the lock you can no longer be sure that what you
learned about the file-lock graph stays true.

> Ditto for locks_wake_up_blocks on posix locks and
> taking the entries off the list/hash.

Here I'm not sure what you mean.

--b.

2013-06-05 12:39:38

by Jeff Layton

[permalink] [raw]
Subject: Re: [PATCH v1 07/11] locks: only pull entries off of blocked_list when they are really unblocked

On Wed, 5 Jun 2013 08:24:32 -0400
"J. Bruce Fields" <[email protected]> wrote:

> On Wed, Jun 05, 2013 at 07:38:22AM -0400, Jeff Layton wrote:
> > On Tue, 4 Jun 2013 17:58:39 -0400
> > "J. Bruce Fields" <[email protected]> wrote:
> >
> > > On Fri, May 31, 2013 at 11:07:30PM -0400, Jeff Layton wrote:
> > > > Currently, when there is a lot of lock contention the kernel spends an
> > > > inordinate amount of time taking blocked locks off of the global
> > > > blocked_list and then putting them right back on again. When all of this
> > > > code was protected by a single lock, then it didn't matter much, but now
> > > > it means a lot of file_lock_lock thrashing.
> > > >
> > > > Optimize this a bit by deferring the removal from the blocked_list until
> > > > we're either applying or cancelling the lock. By doing this, and using a
> > > > lockless list_empty check, we can avoid taking the file_lock_lock in
> > > > many cases.
> > > >
> > > > Because the fl_link check is lockless, we must ensure that only the task
> > > > that "owns" the request manipulates the fl_link. Also, with this change,
> > > > it's possible that we'll see an entry on the blocked_list that has a
> > > > NULL fl_next pointer. In that event, just ignore it and continue walking
> > > > the list.
> > >
> > > OK, that sounds safe as in it shouldn't crash, but does the deadlock
> > > detection still work, or can it miss loops?
> > >
> > > Those locks that are temporarily NULL would previously not have been on
> > > the list at all, OK, but... I'm having trouble reasoning about how this
> > > works now.
> > >
> > > Previously a single lock was held interrupted across
> > > posix_locks_deadlock and locks_insert_block() which guaranteed we
> > > shouldn't be adding a loop, is that still true?
> > >
> > > --b.
> > >
> >
> > I had thought it was when I originally looked at this, but now that I
> > consider it again I think you may be correct and that there are possible
> > races here. Since we might end up reblocking behind a different lock
> > without taking the global spinlock we could flip to blocking behind a
> > different lock such that a loop is created if you had a complex (>2)
> > chain of locks.
> >
> > I think I'm going to have to drop this approach and instead make it so
> > that the deadlock detection and insertion into the global blocker
> > list/hash are atomic.
>
> Right. Once you drop the lock you can no longer be sure that what you
> learned about the file-lock graph stays true.
>
> > Ditto for locks_wake_up_blocks on posix locks and
> > taking the entries off the list/hash.
>
> Here I'm not sure what you mean.
>

Basically, I mean that rather than setting the fl_next pointer to NULL
while holding only the inode lock and then ignoring those locks in the
deadlock detection code, we should additionally take the global lock in
locks_wake_up_blocks too and take the blocked locks off the global list
and the i_flock list at the same time.

That actually might not be completely necessary, but it'll make the
logic clearer and easier to understand and probably won't hurt
performance too much. Again, I'll need to do some perf testing to be
sure.

--
Jeff Layton <[email protected]>

2013-06-05 12:46:51

by J. Bruce Fields

[permalink] [raw]
Subject: Re: [PATCH v1 08/11] locks: convert fl_link to a hlist_node

On Wed, Jun 05, 2013 at 07:43:09AM -0400, Jeff Layton wrote:
> On Tue, 4 Jun 2013 17:59:50 -0400
> "J. Bruce Fields" <[email protected]> wrote:
>
> > On Fri, May 31, 2013 at 11:07:31PM -0400, Jeff Layton wrote:
> > > Testing has shown that iterating over the blocked_list for deadlock
> > > detection turns out to be a bottleneck. In order to alleviate that,
> > > begin the process of turning it into a hashtable. We start by turning
> > > the fl_link into a hlist_node and the global lists into hlists. A later
> > > patch will do the conversion of the blocked_list to a hashtable.
> >
> > Even simpler would be if we could add a pointer to the (well, a) lock
> > that a lockowner is blocking on, and then we'd just have to follow a
> > pointer. I haven't thought that through, though, perhaps that's hard ot
> > make work....
> >
> > --b.
> >
>
> I considered that as well and it makes sense for the simple local
> filesystem case where you just track ownership based on fl_owner_t.
>
> But...what about lockd? It considers ownership to be a tuple of the
> nlm_host and the pid sent in a lock request. I can't seem to wrap my
> brain around how to make such an approach work there.

I wonder if we could do something vaguely like

struct lock_owner_common {
struct file_lock *blocker;
};

struct nlmsvc_lock_owner {
struct lock_owner_common owner;
unsigned int client_pid;
};

and make fl_owner a (struct lock_owner_common *) and have lockd create
nlmsvc_lock_owners as necessary on the fly. The lm_compare_owner
callback could then be replaced by a pointer comparison. I'm not sure
what kind of locking or refcounting might be needed. But...

> I'll confess though that I haven't tried *too* hard yet

... me neither, so...

> though since I had bigger problems to sort through at the time. Maybe
> we can consider that for a later set?

sounds fine.

--b.

2013-06-05 12:59:49

by J. Bruce Fields

[permalink] [raw]
Subject: Re: [PATCH v1 07/11] locks: only pull entries off of blocked_list when they are really unblocked

On Wed, Jun 05, 2013 at 08:38:59AM -0400, Jeff Layton wrote:
> On Wed, 5 Jun 2013 08:24:32 -0400
> "J. Bruce Fields" <[email protected]> wrote:
>
> > On Wed, Jun 05, 2013 at 07:38:22AM -0400, Jeff Layton wrote:
> > > On Tue, 4 Jun 2013 17:58:39 -0400
> > > "J. Bruce Fields" <[email protected]> wrote:
> > >
> > > > On Fri, May 31, 2013 at 11:07:30PM -0400, Jeff Layton wrote:
> > > > > Currently, when there is a lot of lock contention the kernel spends an
> > > > > inordinate amount of time taking blocked locks off of the global
> > > > > blocked_list and then putting them right back on again. When all of this
> > > > > code was protected by a single lock, then it didn't matter much, but now
> > > > > it means a lot of file_lock_lock thrashing.
> > > > >
> > > > > Optimize this a bit by deferring the removal from the blocked_list until
> > > > > we're either applying or cancelling the lock. By doing this, and using a
> > > > > lockless list_empty check, we can avoid taking the file_lock_lock in
> > > > > many cases.
> > > > >
> > > > > Because the fl_link check is lockless, we must ensure that only the task
> > > > > that "owns" the request manipulates the fl_link. Also, with this change,
> > > > > it's possible that we'll see an entry on the blocked_list that has a
> > > > > NULL fl_next pointer. In that event, just ignore it and continue walking
> > > > > the list.
> > > >
> > > > OK, that sounds safe as in it shouldn't crash, but does the deadlock
> > > > detection still work, or can it miss loops?
> > > >
> > > > Those locks that are temporarily NULL would previously not have been on
> > > > the list at all, OK, but... I'm having trouble reasoning about how this
> > > > works now.
> > > >
> > > > Previously a single lock was held interrupted across
> > > > posix_locks_deadlock and locks_insert_block() which guaranteed we
> > > > shouldn't be adding a loop, is that still true?
> > > >
> > > > --b.
> > > >
> > >
> > > I had thought it was when I originally looked at this, but now that I
> > > consider it again I think you may be correct and that there are possible
> > > races here. Since we might end up reblocking behind a different lock
> > > without taking the global spinlock we could flip to blocking behind a
> > > different lock such that a loop is created if you had a complex (>2)
> > > chain of locks.
> > >
> > > I think I'm going to have to drop this approach and instead make it so
> > > that the deadlock detection and insertion into the global blocker
> > > list/hash are atomic.
> >
> > Right. Once you drop the lock you can no longer be sure that what you
> > learned about the file-lock graph stays true.
> >
> > > Ditto for locks_wake_up_blocks on posix locks and
> > > taking the entries off the list/hash.
> >
> > Here I'm not sure what you mean.
> >
>
> Basically, I mean that rather than setting the fl_next pointer to NULL
> while holding only the inode lock and then ignoring those locks in the
> deadlock detection code, we should additionally take the global lock in
> locks_wake_up_blocks too and take the blocked locks off the global list
> and the i_flock list at the same time.

OK, thanks, got it. I have a hard time thinking about that.... But yes
it bothers me that the deadlock detection code could see an out-of-date
value of fl_next, and I can't convince myself that this wouldn't result
in false positives or false negatives.

> That actually might not be completely necessary, but it'll make the
> logic clearer and easier to understand and probably won't hurt
> performance too much. Again, I'll need to do some perf testing to be
> sure.

OK!

--b.