2013-06-11 11:13:33

by Jeff Layton

[permalink] [raw]
Subject: [PATCH v2 00/14] locks: scalability improvements for file locking

Summary of Significant Changes:
-------------------------------
v2:
- Fix potential races in deadlock detection. Manipulation of global
blocked_hash and deadlock detection are now atomic. This is a
little slower than the earlier set, but is provably correct. Also,
the patch that converts to using the i_lock has been split out from
most of the other changes. That should make it easier to review, but
it does leave a potential race in the deadlock detection that is fixed
up by the following patch. It may make sense to fold patches 7 and 8
together before merging.

- Add percpu hlists and lglocks for global file_lock_list. This gives
us some speedup since this list is seldom read.

Abstract (tl;dr version):
-------------------------
This patchset represents an overhaul of the file locking code with an
aim toward improving its scalability and making the code a bit easier to
understand.

Longer version:
---------------
When the BKL was finally ripped out of the kernel in 2010, the strategy
taken for the file locking code was to simply turn it into a new
file_lock_locks spinlock. It was an expedient way to deal with the file
locking code at the time, but having a giant spinlock around all of this
code is clearly not great for scalability. Red Hat has bug reports that
go back into the 2.6.18 era that point to BKL scalability problems in
the file locking code and the file_lock_lock suffers from the same
issues.

This patchset is my first attempt to make this code less dependent on
global locking. The main change is to switch most of the file locking
code to be protected by the inode->i_lock instead of the file_lock_lock.

While that works for most things, there are a couple of global data
structures (lists in the current code) that need a global lock to
protect them. So we still need a global lock in order to deal with
those. The remaining patches are intended to make that global locking
less painful. The big gainis are made by turning the blocked_list into a
hashtable, which greatly speeds up the deadlock detection code.

This is not the first attempt at doing this. The conversion to the
i_lock was originally attempted by Bruce Fields a few years ago. His
approach was NAK'ed since it involved ripping out the deadlock
detection. People also really seem to like /proc/locks for debugging, so
keeping that in is probably worthwhile.

There's more work to be done in this area and this patchset is just a
start. There's a horrible thundering herd problem when a blocking lock
is released, for instance. There was also interest in solving the goofy
"unlock on any close" POSIX lock semantics at this year's LSF. I think
this patchset will help lay the groundwork for those changes as well.

While file locking is not usually considered to be a high-performance
codepath, it *is* an IPC mechanism and I think it behooves us to try to
make it as fast as possible.

I'd like to see this considered for 3.11. Comments and suggestions
welcome.

Performance testing and results:
--------------------------------
In order to measure performance here, I've written some locking
performance tests that I've made available here:

git://git.samba.org/jlayton/lockperf.git

...there are only 4 tests so far that measure performance of BSD and
POSIX locking. I ran each test 100 times. The first number in each set
of results is the unpatched (baseline) and the second is the patched
results. The arch in both cases was x86_64. I've also done some testing
on i686 and the results are more or less inline with the ones below.
The files were all on tmpfs, to eliminate any storage-related latencies
that might creep in.

Here's a description of each test with their results. I've also included
the standard deviation with each:

posix01: fork off 128 children, and have them all open the same file and
lock and unlock it 10k times (whole file, FL_WRLCK). Measure the time
that it takes to perform all of the locks and unlocks in each process,
and the total them all up. Note that the "mean time" here is not the
time to run the test, but the total time that all the tasks spent
locking and unlocking:

2-way Intel machine, 1G RAM, UMA:
kernel mean time std. deviation
-------------------------------------------------------------------------
3.10.0-rc4-00034-g042dd60 780.50511606027 20.0773085693901
3.10.0-rc4-00049-ge0b71e1 420.178537863755 24.4692621642102

32-way AMD Opteron machine, 32G RAM in 4 NUMA nodes:
kernel mean time std. deviation
-------------------------------------------------------------------------
3.10.0-rc4-00034-g042dd60 30889.1603525361 267.767126085587
3.10.0-rc4-00049-ge0b71e1 25504.7748006928 336.232893246214

The first thing to note here is how painful this test is as the number
of CPUs scales out, and when NUMA is thrown into the mix. It takes a
staggeringly long time to run on such a machine. The patchset helps this
use-case a lot, but it's still painful. I think these numbers help
illustrate why we need to worry about this as the machines get more
parallel.

posix02: fork off 128 children, and have them each open their own file
and lock and unlock it 20k times. Measure the time that it takes to
complete the locks and unlocks in each process and then total them up:

2-way Intel machine, 1G RAM, UMA:
kernel mean time std. deviation
-------------------------------------------------------------------------
3.10.0-rc4-00034-g042dd60 319.86610077757 11.0827165395713
3.10.0-rc4-00049-ge0b71e1 257.24231139548 8.0772950753018

32-way AMD Opteron machine, 32G RAM in 4 NUMA nodes:
kernel mean time std. deviation
-------------------------------------------------------------------------
3.10.0-rc4-00034-g042dd60 1354.73702552958 40.5202787251742
3.10.0-rc4-00049-ge0b71e1 541.35526474265 12.0492755786782

...so again, multiple CPUs and NUMA take their toll on this workload,
but the results are still universally positive from the patchset.

flock01: fork off 128 children, and have them all open the same file and
lock (LOCK_EX) and unlock it 10k times. Measure the time that it takes
to perform all of the locks and unlocks in each process, and the total
them all up. Note that the "mean time" here is not the time to run the
test, but the total time that all the tasks spent locking and unlocking:

2-way Intel machine, 1G RAM, UMA:
kernel mean time std. deviation
-------------------------------------------------------------------------
3.10.0-rc4-00034-g042dd60 704.60957498018 10.6182455520549
3.10.0-rc4-00049-ge0b71e1 729.30782576048 4.62185698266119

32-way AMD Opteron machine, 32G RAM in 4 NUMA nodes:
kernel mean time std. deviation
-------------------------------------------------------------------------
3.10.0-rc4-00034-g042dd60 23617.876013741 221.445828515533
3.10.0-rc4-00049-ge0b71e1 24743.3849476248 340.664907591012

...so here we see some slight slowdown (3-5%). In the case of the UMA
box, it's on the order of the standard deviation, and in a similar test
on a different box, the patched kernel turned out to be faster. Go
figure...

In any case, I'm suspect that this is due to the fact that with this
set, we need to take two spinlocks in order to do the work (the i_lock
and the file_lock_lglock). Since the flock codepath doesn't use the
global blocked_list, it sees no benefit to the conversion of it to a
hash table.

flock02: fork off 128 children, and have them each open their own file
and lock (LOCK_EX) and unlock it 20k times. Measure the time that it
takes to complete the locks and unlocks in each process and then total
them up:

2-way Intel machine, 1G RAM, UMA:
kernel mean time std. deviation
-------------------------------------------------------------------------
3.10.0-rc4-00034-g042dd60 241.26319896997 7.69747220750147
3.10.0-rc4-00049-ge0b71e1 167.23062027739 5.1643997156006

32-way AMD Opteron machine, 32G RAM in 4 NUMA nodes:
kernel mean time std. deviation
--------------------------------------------------------------------------
3.10.0-rc4-00034-g042dd60 1348.64594042422 39.5186585988528
3.10.0-rc4-00049-ge0b71e1 8.4762987879 0.347121187462215

...the speedup here is really dramatic, especially in the NUMA case.
This was so much faster that I had to double check that flock locking
still worked properly afterward.

Jeff Layton (14):
cifs: use posix_unblock_lock instead of locks_delete_block
locks: make generic_add_lease and generic_delete_lease static
locks: comment cleanups and clarifications
locks: make "added" in __posix_lock_file a bool
locks: encapsulate the fl_link list handling
locks: don't walk inode->i_flock list in locks_show
locks: convert to i_lock to protect i_flock list
locks: ensure that deadlock detection is atomic with respect to
blocked_list modification
locks: convert fl_link to a hlist_node
locks: turn the blocked_list into a hashtable
locks: add a new "lm_owner_key" lock operation
locks: give the blocked_hash its own spinlock
seq_file: add seq_list_*_percpu helpers
locks: move file_lock_list to a set of percpu hlist_heads and convert
file_lock_lock to an lglock

Documentation/filesystems/Locking | 27 +++-
fs/afs/flock.c | 5 +-
fs/ceph/locks.c | 2 +-
fs/ceph/mds_client.c | 8 +-
fs/cifs/cifsfs.c | 2 +-
fs/cifs/file.c | 15 +-
fs/gfs2/file.c | 2 +-
fs/lockd/svclock.c | 12 ++
fs/lockd/svcsubs.c | 12 +-
fs/locks.c | 306 +++++++++++++++++++++++++------------
fs/nfs/delegation.c | 11 +-
fs/nfs/nfs4state.c | 8 +-
fs/nfsd/nfs4state.c | 8 +-
fs/seq_file.c | 54 +++++++
include/linux/fs.h | 38 +++--
include/linux/seq_file.h | 6 +
16 files changed, 362 insertions(+), 154 deletions(-)


2013-06-11 11:09:55

by Jeff Layton

[permalink] [raw]
Subject: [PATCH v2 13/14] seq_file: add seq_list_*_percpu helpers

When we convert the file_lock_list to a set of percpu lists, we'll need
a way to iterate over them in order to output /proc/locks info. Add
some seq_list_*_percpu helpers to handle that.

Signed-off-by: Jeff Layton <[email protected]>
---
fs/seq_file.c | 54 ++++++++++++++++++++++++++++++++++++++++++++++
include/linux/seq_file.h | 6 +++++
2 files changed, 60 insertions(+), 0 deletions(-)

diff --git a/fs/seq_file.c b/fs/seq_file.c
index 774c1eb..ce9f97a 100644
--- a/fs/seq_file.c
+++ b/fs/seq_file.c
@@ -921,3 +921,57 @@ struct hlist_node *seq_hlist_next_rcu(void *v,
return rcu_dereference(node->next);
}
EXPORT_SYMBOL(seq_hlist_next_rcu);
+
+/**
+ * seq_hlist_start_precpu - start an iteration of a hlist
+ * @head: pointer to percpu array of struct hlist_heads
+ * @cpu: pointer to cpu "cursor"
+ * @pos: start position of sequence
+ *
+ * Called at seq_file->op->start().
+ */
+struct hlist_node *
+seq_hlist_start_percpu(struct hlist_head __percpu *head, int *cpu, loff_t pos)
+{
+ struct hlist_node *node;
+
+ for_each_possible_cpu(*cpu) {
+ hlist_for_each(node, per_cpu_ptr(head, *cpu)) {
+ if (pos-- == 0)
+ return node;
+ }
+ }
+ return NULL;
+}
+EXPORT_SYMBOL(seq_hlist_start_percpu);
+
+/**
+ * seq_hlist_next_percpu - move to the next position of the hlist
+ * @v: pointer to current hlist_node
+ * @head: pointer to percpu array of struct hlist_heads
+ * @cpu: pointer to cpu "cursor"
+ * @pos: start position of sequence
+ *
+ * Called at seq_file->op->next().
+ */
+struct hlist_node *
+seq_hlist_next_percpu(void *v, struct hlist_head __percpu *head,
+ int *cpu, loff_t *pos)
+{
+ struct hlist_node *node = v;
+
+ ++*pos;
+
+ if (node->next)
+ return node->next;
+
+ for (*cpu = cpumask_next(*cpu, cpu_possible_mask); *cpu < nr_cpu_ids;
+ *cpu = cpumask_next(*cpu, cpu_possible_mask)) {
+ struct hlist_head *bucket = per_cpu_ptr(head, *cpu);
+
+ if (!hlist_empty(bucket))
+ return bucket->first;
+ }
+ return NULL;
+}
+EXPORT_SYMBOL(seq_hlist_next_percpu);
diff --git a/include/linux/seq_file.h b/include/linux/seq_file.h
index 2da29ac..4e32edc 100644
--- a/include/linux/seq_file.h
+++ b/include/linux/seq_file.h
@@ -173,4 +173,10 @@ extern struct hlist_node *seq_hlist_start_head_rcu(struct hlist_head *head,
extern struct hlist_node *seq_hlist_next_rcu(void *v,
struct hlist_head *head,
loff_t *ppos);
+
+/* Helpers for iterating over per-cpu hlist_head-s in seq_files */
+extern struct hlist_node *seq_hlist_start_percpu(struct hlist_head __percpu *head, int *cpu, loff_t pos);
+
+extern struct hlist_node *seq_hlist_next_percpu(void *v, struct hlist_head __percpu *head, int *cpu, loff_t *pos);
+
#endif
--
1.7.1

2013-06-11 11:10:00

by Jeff Layton

[permalink] [raw]
Subject: [PATCH v2 12/14] locks: give the blocked_hash its own spinlock

There's no reason we have to protect the blocked_hash and file_lock_list
with the same spinlock. With the tests I have, breaking it in two gives
a barely measurable performance benefit, but it seems reasonable to make
this locking as granular as possible.

Signed-off-by: Jeff Layton <[email protected]>
---
Documentation/filesystems/Locking | 16 ++++++++--------
fs/locks.c | 25 +++++++++++++------------
2 files changed, 21 insertions(+), 20 deletions(-)

diff --git a/Documentation/filesystems/Locking b/Documentation/filesystems/Locking
index ee351ac..8d8d040 100644
--- a/Documentation/filesystems/Locking
+++ b/Documentation/filesystems/Locking
@@ -359,20 +359,20 @@ prototypes:

locking rules:

- inode->i_lock file_lock_lock may block
-lm_compare_owner: yes maybe no
-lm_owner_key yes yes no
-lm_notify: yes no no
-lm_grant: no no no
-lm_break: yes no no
-lm_change yes no no
+ inode->i_lock blocked_hash_lock may block
+lm_compare_owner: yes maybe no
+lm_owner_key yes yes no
+lm_notify: yes no no
+lm_grant: no no no
+lm_break: yes no no
+lm_change yes no no

->lm_compare_owner and ->lm_owner_key are generally called with
*an* inode->i_lock held. It may not be the i_lock of the inode
associated with either file_lock argument! This is the case with deadlock
detection, since the code has to chase down the owners of locks that may
be entirely unrelated to the one on which the lock is being acquired.
-For deadlock detection however, the file_lock_lock is also held. The
+For deadlock detection however, the blocked_hash_lock is also held. The
fact that these locks are held ensures that the file_locks do not
disappear out from under you while doing the comparison or generating an
owner key.
diff --git a/fs/locks.c b/fs/locks.c
index 11e7784..8124fc1 100644
--- a/fs/locks.c
+++ b/fs/locks.c
@@ -162,12 +162,11 @@ int lease_break_time = 45;
*/
#define BLOCKED_HASH_BITS 7

+static DEFINE_SPINLOCK(blocked_hash_lock);
static DEFINE_HASHTABLE(blocked_hash, BLOCKED_HASH_BITS);

-static HLIST_HEAD(file_lock_list);
-
-/* Protects the file_lock_list and the blocked_hash */
static DEFINE_SPINLOCK(file_lock_lock);
+static HLIST_HEAD(file_lock_list);

static struct kmem_cache *filelock_cache __read_mostly;

@@ -505,9 +504,9 @@ __locks_delete_global_blocked(struct file_lock *waiter)
static inline void
locks_delete_global_blocked(struct file_lock *waiter)
{
- spin_lock(&file_lock_lock);
+ spin_lock(&blocked_hash_lock);
__locks_delete_global_blocked(waiter);
- spin_unlock(&file_lock_lock);
+ spin_unlock(&blocked_hash_lock);
}

static inline void
@@ -581,14 +580,14 @@ static void locks_wake_up_blocks(struct file_lock *blocker)

/*
* Wake up processes blocked waiting for blocker. In the FL_POSIX case, we must
- * also take the global file_lock_lock and dequeue it from the global blocked
- * list as we wake the processes.
+ * also take the global blocked_hash_lock and dequeue it from the global
+ * blocked list as we wake the processes.
*
* Must be called with the inode->i_lock of the blocker held!
*/
static void locks_wake_up_posix_blocks(struct file_lock *blocker)
{
- spin_lock(&file_lock_lock);
+ spin_lock(&blocked_hash_lock);
while (!list_empty(&blocker->fl_block)) {
struct file_lock *waiter;

@@ -601,7 +600,7 @@ static void locks_wake_up_posix_blocks(struct file_lock *blocker)
else
wake_up(&waiter->fl_wait);
}
- spin_unlock(&file_lock_lock);
+ spin_unlock(&blocked_hash_lock);
}
/* Insert file lock fl into an inode's lock list at the position indicated
* by pos. At the same time add the lock to the global file lock list.
@@ -754,7 +753,7 @@ static struct file_lock *what_owner_is_waiting_for(struct file_lock *block_fl)
return NULL;
}

-/* Must be called with the file_lock_lock held! */
+/* Must be called with the blocked_hash_lock held! */
static int posix_locks_deadlock(struct file_lock *caller_fl,
struct file_lock *block_fl)
{
@@ -898,13 +897,13 @@ static int __posix_lock_file(struct inode *inode, struct file_lock *request, str
if (!(request->fl_flags & FL_SLEEP))
goto out;
error = -EDEADLK;
- spin_lock(&file_lock_lock);
+ spin_lock(&blocked_hash_lock);
if (likely(!posix_locks_deadlock(request, fl))) {
error = FILE_LOCK_DEFERRED;
locks_insert_block(fl, request);
locks_insert_global_blocked(request);
}
- spin_unlock(&file_lock_lock);
+ spin_unlock(&blocked_hash_lock);
goto out;
}
}
@@ -2309,10 +2308,12 @@ static int locks_show(struct seq_file *f, void *v)

lock_get_status(f, fl, *((loff_t *)f->private), "");

+ spin_lock(&blocked_hash_lock);
hash_for_each(blocked_hash, bkt, bfl, fl_link) {
if (bfl->fl_next == fl)
lock_get_status(f, bfl, *((loff_t *)f->private), " ->");
}
+ spin_unlock(&blocked_hash_lock);

return 0;
}
--
1.7.1

2013-06-11 11:10:59

by Jeff Layton

[permalink] [raw]
Subject: [PATCH v2 08/14] locks: ensure that deadlock detection is atomic with respect to blocked_list modification

Sound deadlock detection requires that we hold the file-lock state
steady while checking for them, and also ensure that updates to that
state are atomic with respect to those checks.

For the checking and insertion side, push the acquisition of the
global lock into __posix_lock_file and ensure that checking and update
of the global lists are done without dropping the lock in between.

On the removal side, when waking up blocked POSIX lock waiters, take
the global lock before walking the blocked list and dequeue the waiters
from the global list prior to removal from the i_flock list.

With this, deadlock detection should be race free while we minimize
excessive file_lock_lock thrashing.

Signed-off-by: Jeff Layton <[email protected]>
---
fs/locks.c | 71 +++++++++++++++++++++++++++++++++++++++++++----------------
1 files changed, 52 insertions(+), 19 deletions(-)

diff --git a/fs/locks.c b/fs/locks.c
index d7342a3..b8cd1b1 100644
--- a/fs/locks.c
+++ b/fs/locks.c
@@ -475,16 +475,20 @@ static int posix_same_owner(struct file_lock *fl1, struct file_lock *fl2)
static inline void
locks_insert_global_blocked(struct file_lock *waiter)
{
- spin_lock(&file_lock_lock);
list_add(&waiter->fl_link, &blocked_list);
- spin_unlock(&file_lock_lock);
+}
+
+static inline void
+__locks_delete_global_blocked(struct file_lock *waiter)
+{
+ list_del_init(&waiter->fl_link);
}

static inline void
locks_delete_global_blocked(struct file_lock *waiter)
{
spin_lock(&file_lock_lock);
- list_del_init(&waiter->fl_link);
+ __locks_delete_global_blocked(waiter);
spin_unlock(&file_lock_lock);
}

@@ -509,7 +513,6 @@ locks_delete_global_locks(struct file_lock *waiter)
*/
static void __locks_delete_block(struct file_lock *waiter)
{
- locks_delete_global_blocked(waiter);
list_del_init(&waiter->fl_block);
waiter->fl_next = NULL;
}
@@ -558,6 +561,30 @@ static void locks_wake_up_blocks(struct file_lock *blocker)
}
}

+/*
+ * Wake up processes blocked waiting for blocker. In the FL_POSIX case, we must
+ * also take the global file_lock_lock and dequeue it from the global blocked
+ * list as we wake the processes.
+ *
+ * Must be called with the inode->i_lock of the blocker held!
+ */
+static void locks_wake_up_posix_blocks(struct file_lock *blocker)
+{
+ spin_lock(&file_lock_lock);
+ while (!list_empty(&blocker->fl_block)) {
+ struct file_lock *waiter;
+
+ waiter = list_first_entry(&blocker->fl_block,
+ struct file_lock, fl_block);
+ __locks_delete_global_blocked(waiter);
+ __locks_delete_block(waiter);
+ if (waiter->fl_lmops && waiter->fl_lmops->lm_notify)
+ waiter->fl_lmops->lm_notify(waiter);
+ else
+ wake_up(&waiter->fl_wait);
+ }
+ spin_unlock(&file_lock_lock);
+}
/* Insert file lock fl into an inode's lock list at the position indicated
* by pos. At the same time add the lock to the global file lock list.
*/
@@ -592,7 +619,11 @@ static void locks_delete_lock(struct file_lock **thisfl_p)
fl->fl_nspid = NULL;
}

- locks_wake_up_blocks(fl);
+ if (IS_POSIX(fl))
+ locks_wake_up_posix_blocks(fl);
+ else
+ locks_wake_up_blocks(fl);
+
locks_free_lock(fl);
}

@@ -705,6 +736,7 @@ static struct file_lock *what_owner_is_waiting_for(struct file_lock *block_fl)
return NULL;
}

+/* Must be called with the file_lock_lock held! */
static int posix_locks_deadlock(struct file_lock *caller_fl,
struct file_lock *block_fl)
{
@@ -848,17 +880,13 @@ static int __posix_lock_file(struct inode *inode, struct file_lock *request, str
if (!(request->fl_flags & FL_SLEEP))
goto out;
error = -EDEADLK;
- /*
- * XXX: potential race here. We should be adding the
- * file_lock to the global list before releasing lock.
- */
spin_lock(&file_lock_lock);
- if (posix_locks_deadlock(request, fl))
- goto out;
+ if (likely(!posix_locks_deadlock(request, fl))) {
+ error = FILE_LOCK_DEFERRED;
+ locks_insert_block(fl, request);
+ locks_insert_global_blocked(request);
+ }
spin_unlock(&file_lock_lock);
- error = FILE_LOCK_DEFERRED;
- locks_insert_block(fl, request);
- locks_insert_global_blocked(request);
goto out;
}
}
@@ -949,7 +977,7 @@ static int __posix_lock_file(struct inode *inode, struct file_lock *request, str
* as the change in lock type might satisfy
* their needs.
*/
- locks_wake_up_blocks(fl);
+ locks_wake_up_posix_blocks(fl);
fl->fl_start = request->fl_start;
fl->fl_end = request->fl_end;
fl->fl_type = request->fl_type;
@@ -1001,11 +1029,11 @@ static int __posix_lock_file(struct inode *inode, struct file_lock *request, str
locks_insert_lock(before, left);
}
right->fl_start = request->fl_end + 1;
- locks_wake_up_blocks(right);
+ locks_wake_up_posix_blocks(right);
}
if (left) {
left->fl_end = request->fl_start - 1;
- locks_wake_up_blocks(left);
+ locks_wake_up_posix_blocks(left);
}
out:
spin_unlock(&inode->i_lock);
@@ -1061,6 +1089,7 @@ int posix_lock_file_wait(struct file *filp, struct file_lock *fl)
if (!error)
continue;

+ locks_delete_global_blocked(fl);
locks_delete_block(fl);
break;
}
@@ -1139,6 +1168,7 @@ int locks_mandatory_area(int read_write, struct inode *inode,
continue;
}

+ locks_delete_global_blocked(&fl);
locks_delete_block(&fl);
break;
}
@@ -1851,6 +1881,7 @@ static int do_lock_file_wait(struct file *filp, unsigned int cmd,
if (!error)
continue;

+ locks_delete_global_blocked(fl);
locks_delete_block(fl);
break;
}
@@ -2148,10 +2179,12 @@ posix_unblock_lock(struct file *filp, struct file_lock *waiter)
int status = 0;

spin_lock(&inode->i_lock);
- if (waiter->fl_next)
+ if (waiter->fl_next) {
+ locks_delete_global_blocked(waiter);
__locks_delete_block(waiter);
- else
+ } else {
status = -ENOENT;
+ }
spin_unlock(&inode->i_lock);
return status;
}
--
1.7.1

2013-06-11 11:11:50

by Jeff Layton

[permalink] [raw]
Subject: [PATCH v2 07/14] locks: convert to i_lock to protect i_flock list

Having a global lock that protects all of this code is a clear
scalability problem. Instead of doing that, move most of the code to be
protected by the i_lock instead.

The exceptions are the global lists that file_lock->fl_link sits on.
Those still need a global lock of some sort, so wrap just those accesses
in the file_lock_lock().

Note that this patch introduces a potential race in the deadlock
detection code which could result in either false positives or missed
file_lock loops. The blocked_list management needs to be done atomically
with the deadlock detection. This will be fixed in a later patch.

Signed-off-by: Jeff Layton <[email protected]>
Signed-off-by: J. Bruce Fields <[email protected]>
---
Documentation/filesystems/Locking | 23 +++++--
fs/afs/flock.c | 5 +-
fs/ceph/locks.c | 2 +-
fs/ceph/mds_client.c | 8 +-
fs/cifs/cifsfs.c | 2 +-
fs/cifs/file.c | 13 ++--
fs/gfs2/file.c | 2 +-
fs/lockd/svcsubs.c | 12 ++--
fs/locks.c | 110 ++++++++++++++++++++-----------------
fs/nfs/delegation.c | 11 ++--
fs/nfs/nfs4state.c | 8 +-
fs/nfsd/nfs4state.c | 8 +-
include/linux/fs.h | 11 ----
13 files changed, 113 insertions(+), 102 deletions(-)

diff --git a/Documentation/filesystems/Locking b/Documentation/filesystems/Locking
index 0706d32..13f91ab 100644
--- a/Documentation/filesystems/Locking
+++ b/Documentation/filesystems/Locking
@@ -344,7 +344,7 @@ prototypes:


locking rules:
- file_lock_lock may block
+ inode->i_lock may block
fl_copy_lock: yes no
fl_release_private: maybe no

@@ -357,12 +357,21 @@ prototypes:
int (*lm_change)(struct file_lock **, int);

locking rules:
- file_lock_lock may block
-lm_compare_owner: yes no
-lm_notify: yes no
-lm_grant: no no
-lm_break: yes no
-lm_change yes no
+
+ inode->i_lock file_lock_lock may block
+lm_compare_owner: yes maybe no
+lm_notify: yes no no
+lm_grant: no no no
+lm_break: yes no no
+lm_change yes no no
+
+ ->lm_compare_owner is generally called with *an* inode->i_lock
+held. It may not be the i_lock of the inode for either file_lock being
+compared! This is the case with deadlock detection, since the code has
+to chase down the owners of locks that may be entirely unrelated to the
+one on which the lock is being acquired. For deadlock detection however,
+the file_lock_lock is also held. The locks primarily ensure that neither
+file_lock disappear out from under you while doing the comparison.

--------------------------- buffer_head -----------------------------------
prototypes:
diff --git a/fs/afs/flock.c b/fs/afs/flock.c
index 2497bf3..03fc0d1 100644
--- a/fs/afs/flock.c
+++ b/fs/afs/flock.c
@@ -252,6 +252,7 @@ static void afs_defer_unlock(struct afs_vnode *vnode, struct key *key)
*/
static int afs_do_setlk(struct file *file, struct file_lock *fl)
{
+ struct inode = file_inode(file);
struct afs_vnode *vnode = AFS_FS_I(file->f_mapping->host);
afs_lock_type_t type;
struct key *key = file->private_data;
@@ -273,7 +274,7 @@ static int afs_do_setlk(struct file *file, struct file_lock *fl)

type = (fl->fl_type == F_RDLCK) ? AFS_LOCK_READ : AFS_LOCK_WRITE;

- lock_flocks();
+ spin_lock(&inode->i_lock);

/* make sure we've got a callback on this file and that our view of the
* data version is up to date */
@@ -420,7 +421,7 @@ given_lock:
afs_vnode_fetch_status(vnode, NULL, key);

error:
- unlock_flocks();
+ spin_unlock(&inode->i_lock);
_leave(" = %d", ret);
return ret;

diff --git a/fs/ceph/locks.c b/fs/ceph/locks.c
index 202dd3d..cd0a664 100644
--- a/fs/ceph/locks.c
+++ b/fs/ceph/locks.c
@@ -194,7 +194,7 @@ void ceph_count_locks(struct inode *inode, int *fcntl_count, int *flock_count)
* Encode the flock and fcntl locks for the given inode into the pagelist.
* Format is: #fcntl locks, sequential fcntl locks, #flock locks,
* sequential flock locks.
- * Must be called with lock_flocks() already held.
+ * Must be called with inode->i_lock already held.
* If we encounter more of a specific lock type than expected,
* we return the value 1.
*/
diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c
index 4f22671..ae621b5 100644
--- a/fs/ceph/mds_client.c
+++ b/fs/ceph/mds_client.c
@@ -2482,13 +2482,13 @@ static int encode_caps_cb(struct inode *inode, struct ceph_cap *cap,

ceph_pagelist_set_cursor(pagelist, &trunc_point);
do {
- lock_flocks();
+ spin_lock(&inode->i_lock);
ceph_count_locks(inode, &num_fcntl_locks,
&num_flock_locks);
rec.v2.flock_len = (2*sizeof(u32) +
(num_fcntl_locks+num_flock_locks) *
sizeof(struct ceph_filelock));
- unlock_flocks();
+ spin_unlock(&inode->i_lock);

/* pre-alloc pagelist */
ceph_pagelist_truncate(pagelist, &trunc_point);
@@ -2499,12 +2499,12 @@ static int encode_caps_cb(struct inode *inode, struct ceph_cap *cap,

/* encode locks */
if (!err) {
- lock_flocks();
+ spin_lock(&inode->i_lock);
err = ceph_encode_locks(inode,
pagelist,
num_fcntl_locks,
num_flock_locks);
- unlock_flocks();
+ spin_unlock(&inode->i_lock);
}
} while (err == -ENOSPC);
} else {
diff --git a/fs/cifs/cifsfs.c b/fs/cifs/cifsfs.c
index 3752b9f..e81655c 100644
--- a/fs/cifs/cifsfs.c
+++ b/fs/cifs/cifsfs.c
@@ -765,7 +765,7 @@ static loff_t cifs_llseek(struct file *file, loff_t offset, int whence)

static int cifs_setlease(struct file *file, long arg, struct file_lock **lease)
{
- /* note that this is called by vfs setlease with lock_flocks held
+ /* note that this is called by vfs setlease with i_lock held
to protect *lease from going away */
struct inode *inode = file_inode(file);
struct cifsFileInfo *cfile = file->private_data;
diff --git a/fs/cifs/file.c b/fs/cifs/file.c
index 44a4f18..0dd10cd 100644
--- a/fs/cifs/file.c
+++ b/fs/cifs/file.c
@@ -1092,6 +1092,7 @@ struct lock_to_push {
static int
cifs_push_posix_locks(struct cifsFileInfo *cfile)
{
+ struct inode *inode = cfile->dentry->d_inode;
struct cifs_tcon *tcon = tlink_tcon(cfile->tlink);
struct file_lock *flock, **before;
unsigned int count = 0, i = 0;
@@ -1102,12 +1103,12 @@ cifs_push_posix_locks(struct cifsFileInfo *cfile)

xid = get_xid();

- lock_flocks();
- cifs_for_each_lock(cfile->dentry->d_inode, before) {
+ spin_lock(&inode->i_lock);
+ cifs_for_each_lock(inode, before) {
if ((*before)->fl_flags & FL_POSIX)
count++;
}
- unlock_flocks();
+ spin_unlock(&inode->i_lock);

INIT_LIST_HEAD(&locks_to_send);

@@ -1126,8 +1127,8 @@ cifs_push_posix_locks(struct cifsFileInfo *cfile)
}

el = locks_to_send.next;
- lock_flocks();
- cifs_for_each_lock(cfile->dentry->d_inode, before) {
+ spin_lock(&inode->i_lock);
+ cifs_for_each_lock(inode, before) {
flock = *before;
if ((flock->fl_flags & FL_POSIX) == 0)
continue;
@@ -1152,7 +1153,7 @@ cifs_push_posix_locks(struct cifsFileInfo *cfile)
lck->offset = flock->fl_start;
el = el->next;
}
- unlock_flocks();
+ spin_unlock(&inode->i_lock);

list_for_each_entry_safe(lck, tmp, &locks_to_send, llist) {
int stored_rc;
diff --git a/fs/gfs2/file.c b/fs/gfs2/file.c
index acd1676..9e634e0 100644
--- a/fs/gfs2/file.c
+++ b/fs/gfs2/file.c
@@ -889,7 +889,7 @@ out_uninit:
* cluster; until we do, disable leases (by just returning -EINVAL),
* unless the administrator has requested purely local locking.
*
- * Locking: called under lock_flocks
+ * Locking: called under i_lock
*
* Returns: errno
*/
diff --git a/fs/lockd/svcsubs.c b/fs/lockd/svcsubs.c
index 97e8741..dc5c759 100644
--- a/fs/lockd/svcsubs.c
+++ b/fs/lockd/svcsubs.c
@@ -169,7 +169,7 @@ nlm_traverse_locks(struct nlm_host *host, struct nlm_file *file,

again:
file->f_locks = 0;
- lock_flocks(); /* protects i_flock list */
+ spin_lock(&inode->i_lock);
for (fl = inode->i_flock; fl; fl = fl->fl_next) {
if (fl->fl_lmops != &nlmsvc_lock_operations)
continue;
@@ -181,7 +181,7 @@ again:
if (match(lockhost, host)) {
struct file_lock lock = *fl;

- unlock_flocks();
+ spin_unlock(&inode->i_lock);
lock.fl_type = F_UNLCK;
lock.fl_start = 0;
lock.fl_end = OFFSET_MAX;
@@ -193,7 +193,7 @@ again:
goto again;
}
}
- unlock_flocks();
+ spin_unlock(&inode->i_lock);

return 0;
}
@@ -228,14 +228,14 @@ nlm_file_inuse(struct nlm_file *file)
if (file->f_count || !list_empty(&file->f_blocks) || file->f_shares)
return 1;

- lock_flocks();
+ spin_lock(&inode->i_lock);
for (fl = inode->i_flock; fl; fl = fl->fl_next) {
if (fl->fl_lmops == &nlmsvc_lock_operations) {
- unlock_flocks();
+ spin_unlock(&inode->i_lock);
return 1;
}
}
- unlock_flocks();
+ spin_unlock(&inode->i_lock);
file->f_locks = 0;
return 0;
}
diff --git a/fs/locks.c b/fs/locks.c
index 3fd27f0..d7342a3 100644
--- a/fs/locks.c
+++ b/fs/locks.c
@@ -155,22 +155,9 @@ int lease_break_time = 45;

static LIST_HEAD(file_lock_list);
static LIST_HEAD(blocked_list);
-static DEFINE_SPINLOCK(file_lock_lock);
-
-/*
- * Protects the two list heads above, plus the inode->i_flock list
- */
-void lock_flocks(void)
-{
- spin_lock(&file_lock_lock);
-}
-EXPORT_SYMBOL_GPL(lock_flocks);

-void unlock_flocks(void)
-{
- spin_unlock(&file_lock_lock);
-}
-EXPORT_SYMBOL_GPL(unlock_flocks);
+/* Protects the two list heads above */
+static DEFINE_SPINLOCK(file_lock_lock);

static struct kmem_cache *filelock_cache __read_mostly;

@@ -488,25 +475,33 @@ static int posix_same_owner(struct file_lock *fl1, struct file_lock *fl2)
static inline void
locks_insert_global_blocked(struct file_lock *waiter)
{
+ spin_lock(&file_lock_lock);
list_add(&waiter->fl_link, &blocked_list);
+ spin_unlock(&file_lock_lock);
}

static inline void
locks_delete_global_blocked(struct file_lock *waiter)
{
+ spin_lock(&file_lock_lock);
list_del_init(&waiter->fl_link);
+ spin_unlock(&file_lock_lock);
}

static inline void
locks_insert_global_locks(struct file_lock *waiter)
{
+ spin_lock(&file_lock_lock);
list_add_tail(&waiter->fl_link, &file_lock_list);
+ spin_unlock(&file_lock_lock);
}

static inline void
locks_delete_global_locks(struct file_lock *waiter)
{
+ spin_lock(&file_lock_lock);
list_del_init(&waiter->fl_link);
+ spin_unlock(&file_lock_lock);
}

/* Remove waiter from blocker's block list.
@@ -523,9 +518,11 @@ static void __locks_delete_block(struct file_lock *waiter)
*/
static void locks_delete_block(struct file_lock *waiter)
{
- lock_flocks();
+ struct inode *inode = file_inode(waiter->fl_file);
+
+ spin_lock(&inode->i_lock);
__locks_delete_block(waiter);
- unlock_flocks();
+ spin_unlock(&inode->i_lock);
}

/* Insert waiter into blocker's block list.
@@ -544,7 +541,7 @@ static void locks_insert_block(struct file_lock *blocker,
/*
* Wake up processes blocked waiting for blocker.
*
- * Must be called with the file_lock_lock held!
+ * Must be called with the inode->i_lock of the blocker held!
*/
static void locks_wake_up_blocks(struct file_lock *blocker)
{
@@ -649,8 +646,9 @@ void
posix_test_lock(struct file *filp, struct file_lock *fl)
{
struct file_lock *cfl;
+ struct inode *inode = file_inode(filp);

- lock_flocks();
+ spin_lock(&inode->i_lock);
for (cfl = file_inode(filp)->i_flock; cfl; cfl = cfl->fl_next) {
if (!IS_POSIX(cfl))
continue;
@@ -663,7 +661,7 @@ posix_test_lock(struct file *filp, struct file_lock *fl)
fl->fl_pid = pid_vnr(cfl->fl_nspid);
} else
fl->fl_type = F_UNLCK;
- unlock_flocks();
+ spin_unlock(&inode->i_lock);
return;
}
EXPORT_SYMBOL(posix_test_lock);
@@ -742,7 +740,7 @@ static int flock_lock_file(struct file *filp, struct file_lock *request)
return -ENOMEM;
}

- lock_flocks();
+ spin_lock(&inode->i_lock);
if (request->fl_flags & FL_ACCESS)
goto find_conflict;

@@ -772,9 +770,9 @@ static int flock_lock_file(struct file *filp, struct file_lock *request)
* give it the opportunity to lock the file.
*/
if (found) {
- unlock_flocks();
+ spin_unlock(&inode->i_lock);
cond_resched();
- lock_flocks();
+ spin_lock(&inode->i_lock);
}

find_conflict:
@@ -801,7 +799,7 @@ find_conflict:
error = 0;

out:
- unlock_flocks();
+ spin_unlock(&inode->i_lock);
if (new_fl)
locks_free_lock(new_fl);
return error;
@@ -831,7 +829,7 @@ static int __posix_lock_file(struct inode *inode, struct file_lock *request, str
new_fl2 = locks_alloc_lock();
}

- lock_flocks();
+ spin_lock(&inode->i_lock);
/*
* New lock request. Walk all POSIX locks and look for conflicts. If
* there are any, either return error or put the request on the
@@ -850,8 +848,14 @@ static int __posix_lock_file(struct inode *inode, struct file_lock *request, str
if (!(request->fl_flags & FL_SLEEP))
goto out;
error = -EDEADLK;
+ /*
+ * XXX: potential race here. We should be adding the
+ * file_lock to the global list before releasing lock.
+ */
+ spin_lock(&file_lock_lock);
if (posix_locks_deadlock(request, fl))
goto out;
+ spin_unlock(&file_lock_lock);
error = FILE_LOCK_DEFERRED;
locks_insert_block(fl, request);
locks_insert_global_blocked(request);
@@ -1004,7 +1008,7 @@ static int __posix_lock_file(struct inode *inode, struct file_lock *request, str
locks_wake_up_blocks(left);
}
out:
- unlock_flocks();
+ spin_unlock(&inode->i_lock);
/*
* Free any unused locks.
*/
@@ -1079,14 +1083,14 @@ int locks_mandatory_locked(struct inode *inode)
/*
* Search the lock list for this inode for any POSIX locks.
*/
- lock_flocks();
+ spin_lock(&inode->i_lock);
for (fl = inode->i_flock; fl != NULL; fl = fl->fl_next) {
if (!IS_POSIX(fl))
continue;
if (fl->fl_owner != owner)
break;
}
- unlock_flocks();
+ spin_unlock(&inode->i_lock);
return fl ? -EAGAIN : 0;
}

@@ -1229,7 +1233,7 @@ int __break_lease(struct inode *inode, unsigned int mode)
if (IS_ERR(new_fl))
return PTR_ERR(new_fl);

- lock_flocks();
+ spin_lock(&inode->i_lock);

time_out_leases(inode);

@@ -1279,10 +1283,10 @@ restart:
break_time++;
}
locks_insert_block(flock, new_fl);
- unlock_flocks();
+ spin_unlock(&inode->i_lock);
error = wait_event_interruptible_timeout(new_fl->fl_wait,
!new_fl->fl_next, break_time);
- lock_flocks();
+ spin_lock(&inode->i_lock);
__locks_delete_block(new_fl);
if (error >= 0) {
if (error == 0)
@@ -1300,7 +1304,7 @@ restart:
}

out:
- unlock_flocks();
+ spin_unlock(&inode->i_lock);
locks_free_lock(new_fl);
return error;
}
@@ -1353,9 +1357,10 @@ EXPORT_SYMBOL(lease_get_mtime);
int fcntl_getlease(struct file *filp)
{
struct file_lock *fl;
+ struct inode *inode = file_inode(filp);
int type = F_UNLCK;

- lock_flocks();
+ spin_lock(&inode->i_lock);
time_out_leases(file_inode(filp));
for (fl = file_inode(filp)->i_flock; fl && IS_LEASE(fl);
fl = fl->fl_next) {
@@ -1364,7 +1369,7 @@ int fcntl_getlease(struct file *filp)
break;
}
}
- unlock_flocks();
+ spin_unlock(&inode->i_lock);
return type;
}

@@ -1458,7 +1463,7 @@ static int generic_delete_lease(struct file *filp, struct file_lock **flp)
* The (input) flp->fl_lmops->lm_break function is required
* by break_lease().
*
- * Called with file_lock_lock held.
+ * Called with inode->i_lock held.
*/
int generic_setlease(struct file *filp, long arg, struct file_lock **flp)
{
@@ -1527,11 +1532,12 @@ static int __vfs_setlease(struct file *filp, long arg, struct file_lock **lease)

int vfs_setlease(struct file *filp, long arg, struct file_lock **lease)
{
+ struct inode *inode = file_inode(filp);
int error;

- lock_flocks();
+ spin_lock(&inode->i_lock);
error = __vfs_setlease(filp, arg, lease);
- unlock_flocks();
+ spin_unlock(&inode->i_lock);

return error;
}
@@ -1549,6 +1555,7 @@ static int do_fcntl_delete_lease(struct file *filp)
static int do_fcntl_add_lease(unsigned int fd, struct file *filp, long arg)
{
struct file_lock *fl, *ret;
+ struct inode *inode = file_inode(filp);
struct fasync_struct *new;
int error;

@@ -1562,10 +1569,10 @@ static int do_fcntl_add_lease(unsigned int fd, struct file *filp, long arg)
return -ENOMEM;
}
ret = fl;
- lock_flocks();
+ spin_lock(&inode->i_lock);
error = __vfs_setlease(filp, arg, &ret);
if (error) {
- unlock_flocks();
+ spin_unlock(&inode->i_lock);
locks_free_lock(fl);
goto out_free_fasync;
}
@@ -1582,7 +1589,7 @@ static int do_fcntl_add_lease(unsigned int fd, struct file *filp, long arg)
new = NULL;

error = __f_setown(filp, task_pid(current), PIDTYPE_PID, 0);
- unlock_flocks();
+ spin_unlock(&inode->i_lock);

out_free_fasync:
if (new)
@@ -2106,7 +2113,7 @@ void locks_remove_flock(struct file *filp)
fl.fl_ops->fl_release_private(&fl);
}

- lock_flocks();
+ spin_lock(&inode->i_lock);
before = &inode->i_flock;

while ((fl = *before) != NULL) {
@@ -2124,7 +2131,7 @@ void locks_remove_flock(struct file *filp)
}
before = &fl->fl_next;
}
- unlock_flocks();
+ spin_unlock(&inode->i_lock);
}

/**
@@ -2137,14 +2144,15 @@ void locks_remove_flock(struct file *filp)
int
posix_unblock_lock(struct file *filp, struct file_lock *waiter)
{
+ struct inode *inode = file_inode(filp);
int status = 0;

- lock_flocks();
+ spin_lock(&inode->i_lock);
if (waiter->fl_next)
__locks_delete_block(waiter);
else
status = -ENOENT;
- unlock_flocks();
+ spin_unlock(&inode->i_lock);
return status;
}

@@ -2261,7 +2269,7 @@ static void *locks_start(struct seq_file *f, loff_t *pos)
{
loff_t *p = f->private;

- lock_flocks();
+ spin_lock(&file_lock_lock);
*p = (*pos + 1);
return seq_list_start(&file_lock_list, *pos);
}
@@ -2275,7 +2283,7 @@ static void *locks_next(struct seq_file *f, void *v, loff_t *pos)

static void locks_stop(struct seq_file *f, void *v)
{
- unlock_flocks();
+ spin_unlock(&file_lock_lock);
}

static const struct seq_operations locks_seq_operations = {
@@ -2322,7 +2330,8 @@ int lock_may_read(struct inode *inode, loff_t start, unsigned long len)
{
struct file_lock *fl;
int result = 1;
- lock_flocks();
+
+ spin_lock(&inode->i_lock);
for (fl = inode->i_flock; fl != NULL; fl = fl->fl_next) {
if (IS_POSIX(fl)) {
if (fl->fl_type == F_RDLCK)
@@ -2339,7 +2348,7 @@ int lock_may_read(struct inode *inode, loff_t start, unsigned long len)
result = 0;
break;
}
- unlock_flocks();
+ spin_unlock(&inode->i_lock);
return result;
}

@@ -2362,7 +2371,8 @@ int lock_may_write(struct inode *inode, loff_t start, unsigned long len)
{
struct file_lock *fl;
int result = 1;
- lock_flocks();
+
+ spin_lock(&inode->i_lock);
for (fl = inode->i_flock; fl != NULL; fl = fl->fl_next) {
if (IS_POSIX(fl)) {
if ((fl->fl_end < start) || (fl->fl_start > (start + len)))
@@ -2377,7 +2387,7 @@ int lock_may_write(struct inode *inode, loff_t start, unsigned long len)
result = 0;
break;
}
- unlock_flocks();
+ spin_unlock(&inode->i_lock);
return result;
}

diff --git a/fs/nfs/delegation.c b/fs/nfs/delegation.c
index 57db324..43ee7f9 100644
--- a/fs/nfs/delegation.c
+++ b/fs/nfs/delegation.c
@@ -73,20 +73,21 @@ static int nfs_delegation_claim_locks(struct nfs_open_context *ctx, struct nfs4_
if (inode->i_flock == NULL)
goto out;

- /* Protect inode->i_flock using the file locks lock */
- lock_flocks();
+ /* Protect inode->i_flock using the i_lock */
+ spin_lock(&inode->i_lock);
for (fl = inode->i_flock; fl != NULL; fl = fl->fl_next) {
if (!(fl->fl_flags & (FL_POSIX|FL_FLOCK)))
continue;
if (nfs_file_open_context(fl->fl_file) != ctx)
continue;
- unlock_flocks();
+ /* FIXME: safe to drop lock here while walking list? */
+ spin_unlock(&inode->i_lock);
status = nfs4_lock_delegation_recall(fl, state, stateid);
if (status < 0)
goto out;
- lock_flocks();
+ spin_lock(&inode->i_lock);
}
- unlock_flocks();
+ spin_unlock(&inode->i_lock);
out:
return status;
}
diff --git a/fs/nfs/nfs4state.c b/fs/nfs/nfs4state.c
index 1fab140..ff10b4a 100644
--- a/fs/nfs/nfs4state.c
+++ b/fs/nfs/nfs4state.c
@@ -1373,13 +1373,13 @@ static int nfs4_reclaim_locks(struct nfs4_state *state, const struct nfs4_state_
/* Guard against delegation returns and new lock/unlock calls */
down_write(&nfsi->rwsem);
/* Protect inode->i_flock using the BKL */
- lock_flocks();
+ spin_lock(&inode->i_lock);
for (fl = inode->i_flock; fl != NULL; fl = fl->fl_next) {
if (!(fl->fl_flags & (FL_POSIX|FL_FLOCK)))
continue;
if (nfs_file_open_context(fl->fl_file)->state != state)
continue;
- unlock_flocks();
+ spin_unlock(&inode->i_lock);
status = ops->recover_lock(state, fl);
switch (status) {
case 0:
@@ -1406,9 +1406,9 @@ static int nfs4_reclaim_locks(struct nfs4_state *state, const struct nfs4_state_
/* kill_proc(fl->fl_pid, SIGLOST, 1); */
status = 0;
}
- lock_flocks();
+ spin_lock(&inode->i_lock);
}
- unlock_flocks();
+ spin_unlock(&inode->i_lock);
out:
up_write(&nfsi->rwsem);
return status;
diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c
index 316ec84..f170518 100644
--- a/fs/nfsd/nfs4state.c
+++ b/fs/nfsd/nfs4state.c
@@ -2645,13 +2645,13 @@ static void nfsd_break_one_deleg(struct nfs4_delegation *dp)

list_add_tail(&dp->dl_recall_lru, &nn->del_recall_lru);

- /* only place dl_time is set. protected by lock_flocks*/
+ /* Only place dl_time is set; protected by i_lock: */
dp->dl_time = get_seconds();

nfsd4_cb_recall(dp);
}

-/* Called from break_lease() with lock_flocks() held. */
+/* Called from break_lease() with i_lock held. */
static void nfsd_break_deleg_cb(struct file_lock *fl)
{
struct nfs4_file *fp = (struct nfs4_file *)fl->fl_owner;
@@ -4520,7 +4520,7 @@ check_for_locks(struct nfs4_file *filp, struct nfs4_lockowner *lowner)
struct inode *inode = filp->fi_inode;
int status = 0;

- lock_flocks();
+ spin_lock(&inode->i_lock);
for (flpp = &inode->i_flock; *flpp != NULL; flpp = &(*flpp)->fl_next) {
if ((*flpp)->fl_owner == (fl_owner_t)lowner) {
status = 1;
@@ -4528,7 +4528,7 @@ check_for_locks(struct nfs4_file *filp, struct nfs4_lockowner *lowner)
}
}
out:
- unlock_flocks();
+ spin_unlock(&inode->i_lock);
return status;
}

diff --git a/include/linux/fs.h b/include/linux/fs.h
index 94105d2..e2f896d 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1024,8 +1024,6 @@ extern int vfs_setlease(struct file *, long, struct file_lock **);
extern int lease_modify(struct file_lock **, int);
extern int lock_may_read(struct inode *, loff_t start, unsigned long count);
extern int lock_may_write(struct inode *, loff_t start, unsigned long count);
-extern void lock_flocks(void);
-extern void unlock_flocks(void);
#else /* !CONFIG_FILE_LOCKING */
static inline int fcntl_getlk(struct file *file, struct flock __user *user)
{
@@ -1167,15 +1165,6 @@ static inline int lock_may_write(struct inode *inode, loff_t start,
{
return 1;
}
-
-static inline void lock_flocks(void)
-{
-}
-
-static inline void unlock_flocks(void)
-{
-}
-
#endif /* !CONFIG_FILE_LOCKING */


--
1.7.1

2013-06-11 11:11:49

by Jeff Layton

[permalink] [raw]
Subject: [PATCH v2 14/14] locks: move file_lock_list to a set of percpu hlist_heads and convert file_lock_lock to an lglock

The file_lock_list is only used for /proc/locks. The vastly common case
is for locks to be put onto the list and come off again, without ever
being traversed.

Help optimize for this use-case by moving to percpu hlist_head-s. At the
same time, we can make the locking less contentious by moving to an
lglock. When iterating over the lists for /proc/locks, we must take the
global lock and then iterate over each CPU's list in turn.

This change necessitates a new fl_link_cpu field to keep track of which
CPU the entry is on. On x86_64 at least, this field is placed within an
existing hole in the struct to avoid growing its real size.

Signed-off-by: Jeff Layton <[email protected]>
---
fs/locks.c | 57 +++++++++++++++++++++++++++++++++++----------------
include/linux/fs.h | 1 +
2 files changed, 40 insertions(+), 18 deletions(-)

diff --git a/fs/locks.c b/fs/locks.c
index 8124fc1..094eb4d 100644
--- a/fs/locks.c
+++ b/fs/locks.c
@@ -127,6 +127,8 @@
#include <linux/rcupdate.h>
#include <linux/pid_namespace.h>
#include <linux/hashtable.h>
+#include <linux/percpu.h>
+#include <linux/lglock.h>

#include <asm/uaccess.h>

@@ -165,8 +167,8 @@ int lease_break_time = 45;
static DEFINE_SPINLOCK(blocked_hash_lock);
static DEFINE_HASHTABLE(blocked_hash, BLOCKED_HASH_BITS);

-static DEFINE_SPINLOCK(file_lock_lock);
-static HLIST_HEAD(file_lock_list);
+DEFINE_STATIC_LGLOCK(file_lock_lglock);
+static DEFINE_PER_CPU(struct hlist_head, file_lock_list);

static struct kmem_cache *filelock_cache __read_mostly;

@@ -512,17 +514,21 @@ locks_delete_global_blocked(struct file_lock *waiter)
static inline void
locks_insert_global_locks(struct file_lock *waiter)
{
- spin_lock(&file_lock_lock);
- hlist_add_head(&waiter->fl_link, &file_lock_list);
- spin_unlock(&file_lock_lock);
+ lg_local_lock(&file_lock_lglock);
+ waiter->fl_link_cpu = smp_processor_id();
+ hlist_add_head(&waiter->fl_link, this_cpu_ptr(&file_lock_list));
+ lg_local_unlock(&file_lock_lglock);
}

static inline void
locks_delete_global_locks(struct file_lock *waiter)
{
- spin_lock(&file_lock_lock);
+ /* avoid taking lock if already unhashed */
+ if (hlist_unhashed(&waiter->fl_link))
+ return;
+ lg_local_lock_cpu(&file_lock_lglock, waiter->fl_link_cpu);
hlist_del_init(&waiter->fl_link);
- spin_unlock(&file_lock_lock);
+ lg_local_unlock_cpu(&file_lock_lglock, waiter->fl_link_cpu);
}

/* Remove waiter from blocker's block list.
@@ -2228,6 +2234,11 @@ EXPORT_SYMBOL_GPL(vfs_cancel_lock);
#include <linux/proc_fs.h>
#include <linux/seq_file.h>

+struct locks_iterator {
+ int li_cpu;
+ loff_t li_pos;
+};
+
static void lock_get_status(struct seq_file *f, struct file_lock *fl,
loff_t id, char *pfx)
{
@@ -2302,16 +2313,17 @@ static void lock_get_status(struct seq_file *f, struct file_lock *fl,
static int locks_show(struct seq_file *f, void *v)
{
int bkt;
+ struct locks_iterator *iter = f->private;
struct file_lock *fl, *bfl;

fl = hlist_entry(v, struct file_lock, fl_link);

- lock_get_status(f, fl, *((loff_t *)f->private), "");
+ lock_get_status(f, fl, iter->li_pos, "");

spin_lock(&blocked_hash_lock);
hash_for_each(blocked_hash, bkt, bfl, fl_link) {
if (bfl->fl_next == fl)
- lock_get_status(f, bfl, *((loff_t *)f->private), " ->");
+ lock_get_status(f, bfl, iter->li_pos, " ->");
}
spin_unlock(&blocked_hash_lock);

@@ -2320,23 +2332,24 @@ static int locks_show(struct seq_file *f, void *v)

static void *locks_start(struct seq_file *f, loff_t *pos)
{
- loff_t *p = f->private;
+ struct locks_iterator *iter = f->private;

- spin_lock(&file_lock_lock);
- *p = (*pos + 1);
- return seq_hlist_start(&file_lock_list, *pos);
+ iter->li_pos = *pos + 1;
+ lg_global_lock(&file_lock_lglock);
+ return seq_hlist_start_percpu(&file_lock_list, &iter->li_cpu, *pos);
}

static void *locks_next(struct seq_file *f, void *v, loff_t *pos)
{
- loff_t *p = f->private;
- ++*p;
- return seq_hlist_next(v, &file_lock_list, pos);
+ struct locks_iterator *iter = f->private;
+
+ ++iter->li_pos;
+ return seq_hlist_next_percpu(v, &file_lock_list, &iter->li_cpu, pos);
}

static void locks_stop(struct seq_file *f, void *v)
{
- spin_unlock(&file_lock_lock);
+ lg_global_unlock(&file_lock_lglock);
}

static const struct seq_operations locks_seq_operations = {
@@ -2348,7 +2361,8 @@ static const struct seq_operations locks_seq_operations = {

static int locks_open(struct inode *inode, struct file *filp)
{
- return seq_open_private(filp, &locks_seq_operations, sizeof(loff_t));
+ return seq_open_private(filp, &locks_seq_operations,
+ sizeof(struct locks_iterator));
}

static const struct file_operations proc_locks_operations = {
@@ -2448,9 +2462,16 @@ EXPORT_SYMBOL(lock_may_write);

static int __init filelock_init(void)
{
+ int i;
+
filelock_cache = kmem_cache_create("file_lock_cache",
sizeof(struct file_lock), 0, SLAB_PANIC, NULL);

+ lg_lock_init(&file_lock_lglock, "file_lock_lglock");
+
+ for_each_possible_cpu(i)
+ INIT_HLIST_HEAD(per_cpu_ptr(&file_lock_list, i));
+
return 0;
}

diff --git a/include/linux/fs.h b/include/linux/fs.h
index 232a345..18e59b8 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -953,6 +953,7 @@ struct file_lock {
unsigned int fl_flags;
unsigned char fl_type;
unsigned int fl_pid;
+ int fl_link_cpu; /* what cpu's list is this on? */
struct pid *fl_nspid;
wait_queue_head_t fl_wait;
struct file *fl_file;
--
1.7.1

2013-06-11 11:09:51

by Jeff Layton

[permalink] [raw]
Subject: [PATCH v2 05/14] locks: encapsulate the fl_link list handling

Move the fl_link list handling routines into a separate set of helpers.
Also move the global list handling out of locks_insert_block, and into
the caller that ends up triggering it as that allows us to eliminate the
IS_POSIX check there.

Signed-off-by: Jeff Layton <[email protected]>
Acked-by: J. Bruce Fields <[email protected]>
---
fs/locks.c | 37 +++++++++++++++++++++++++++++++------
1 files changed, 31 insertions(+), 6 deletions(-)

diff --git a/fs/locks.c b/fs/locks.c
index c186649..e451d18 100644
--- a/fs/locks.c
+++ b/fs/locks.c
@@ -484,13 +484,38 @@ static int posix_same_owner(struct file_lock *fl1, struct file_lock *fl2)
return fl1->fl_owner == fl2->fl_owner;
}

+/* Remove a blocker or lock from one of the global lists */
+static inline void
+locks_insert_global_blocked(struct file_lock *waiter)
+{
+ list_add(&waiter->fl_link, &blocked_list);
+}
+
+static inline void
+locks_delete_global_blocked(struct file_lock *waiter)
+{
+ list_del_init(&waiter->fl_link);
+}
+
+static inline void
+locks_insert_global_locks(struct file_lock *waiter)
+{
+ list_add_tail(&waiter->fl_link, &file_lock_list);
+}
+
+static inline void
+locks_delete_global_locks(struct file_lock *waiter)
+{
+ list_del_init(&waiter->fl_link);
+}
+
/* Remove waiter from blocker's block list.
* When blocker ends up pointing to itself then the list is empty.
*/
static void __locks_delete_block(struct file_lock *waiter)
{
+ locks_delete_global_blocked(waiter);
list_del_init(&waiter->fl_block);
- list_del_init(&waiter->fl_link);
waiter->fl_next = NULL;
}

@@ -514,8 +539,6 @@ static void locks_insert_block(struct file_lock *blocker,
BUG_ON(!list_empty(&waiter->fl_block));
list_add_tail(&waiter->fl_block, &blocker->fl_block);
waiter->fl_next = blocker;
- if (IS_POSIX(blocker))
- list_add(&waiter->fl_link, &blocked_list);
}

/*
@@ -543,13 +566,13 @@ static void locks_wake_up_blocks(struct file_lock *blocker)
*/
static void locks_insert_lock(struct file_lock **pos, struct file_lock *fl)
{
- list_add(&fl->fl_link, &file_lock_list);
-
fl->fl_nspid = get_pid(task_tgid(current));

/* insert into file's list */
fl->fl_next = *pos;
*pos = fl;
+
+ locks_insert_global_locks(fl);
}

/*
@@ -562,9 +585,10 @@ static void locks_delete_lock(struct file_lock **thisfl_p)
{
struct file_lock *fl = *thisfl_p;

+ locks_delete_global_locks(fl);
+
*thisfl_p = fl->fl_next;
fl->fl_next = NULL;
- list_del_init(&fl->fl_link);

if (fl->fl_nspid) {
put_pid(fl->fl_nspid);
@@ -830,6 +854,7 @@ static int __posix_lock_file(struct inode *inode, struct file_lock *request, str
goto out;
error = FILE_LOCK_DEFERRED;
locks_insert_block(fl, request);
+ locks_insert_global_blocked(request);
goto out;
}
}
--
1.7.1

2013-06-11 11:12:45

by Jeff Layton

[permalink] [raw]
Subject: [PATCH v2 11/14] locks: add a new "lm_owner_key" lock operation

Currently, the hashing that the locking code uses to add these values
to the blocked_hash is simply calculated using fl_owner field. That's
valid in most cases except for server-side lockd, which validates the
owner of a lock based on fl_owner and fl_pid.

In the case where you have a small number of NFS clients doing a lot
of locking between different processes, you could end up with all
the blocked requests sitting in a very small number of hash buckets.

Add a new lm_owner_key operation to the lock_manager_operations that
will generate an unsigned long to use as the key in the hashtable.
That function is only implemented for server-side lockd, and simply
XORs the fl_owner and fl_pid.

Signed-off-by: Jeff Layton <[email protected]>
---
Documentation/filesystems/Locking | 18 +++++++++++-------
fs/lockd/svclock.c | 12 ++++++++++++
fs/locks.c | 12 ++++++++++--
include/linux/fs.h | 1 +
4 files changed, 34 insertions(+), 9 deletions(-)

diff --git a/Documentation/filesystems/Locking b/Documentation/filesystems/Locking
index 13f91ab..ee351ac 100644
--- a/Documentation/filesystems/Locking
+++ b/Documentation/filesystems/Locking
@@ -351,6 +351,7 @@ fl_release_private: maybe no
----------------------- lock_manager_operations ---------------------------
prototypes:
int (*lm_compare_owner)(struct file_lock *, struct file_lock *);
+ unsigned long (*lm_owner_key)(struct file_lock *);
void (*lm_notify)(struct file_lock *); /* unblock callback */
int (*lm_grant)(struct file_lock *, struct file_lock *, int);
void (*lm_break)(struct file_lock *); /* break_lease callback */
@@ -360,18 +361,21 @@ locking rules:

inode->i_lock file_lock_lock may block
lm_compare_owner: yes maybe no
+lm_owner_key yes yes no
lm_notify: yes no no
lm_grant: no no no
lm_break: yes no no
lm_change yes no no

- ->lm_compare_owner is generally called with *an* inode->i_lock
-held. It may not be the i_lock of the inode for either file_lock being
-compared! This is the case with deadlock detection, since the code has
-to chase down the owners of locks that may be entirely unrelated to the
-one on which the lock is being acquired. For deadlock detection however,
-the file_lock_lock is also held. The locks primarily ensure that neither
-file_lock disappear out from under you while doing the comparison.
+ ->lm_compare_owner and ->lm_owner_key are generally called with
+*an* inode->i_lock held. It may not be the i_lock of the inode
+associated with either file_lock argument! This is the case with deadlock
+detection, since the code has to chase down the owners of locks that may
+be entirely unrelated to the one on which the lock is being acquired.
+For deadlock detection however, the file_lock_lock is also held. The
+fact that these locks are held ensures that the file_locks do not
+disappear out from under you while doing the comparison or generating an
+owner key.

--------------------------- buffer_head -----------------------------------
prototypes:
diff --git a/fs/lockd/svclock.c b/fs/lockd/svclock.c
index e703318..ce2cdab 100644
--- a/fs/lockd/svclock.c
+++ b/fs/lockd/svclock.c
@@ -744,8 +744,20 @@ static int nlmsvc_same_owner(struct file_lock *fl1, struct file_lock *fl2)
return fl1->fl_owner == fl2->fl_owner && fl1->fl_pid == fl2->fl_pid;
}

+/*
+ * Since NLM uses two "keys" for tracking locks, we need to hash them down
+ * to one for the blocked_hash. Here, we're just xor'ing the host address
+ * with the pid in order to create a key value for picking a hash bucket.
+ */
+static unsigned long
+nlmsvc_owner_key(struct file_lock *fl)
+{
+ return (unsigned long)fl->fl_owner ^ (unsigned long)fl->fl_pid;
+}
+
const struct lock_manager_operations nlmsvc_lock_operations = {
.lm_compare_owner = nlmsvc_same_owner,
+ .lm_owner_key = nlmsvc_owner_key,
.lm_notify = nlmsvc_notify_blocked,
.lm_grant = nlmsvc_grant_deferred,
};
diff --git a/fs/locks.c b/fs/locks.c
index 76fb7af..11e7784 100644
--- a/fs/locks.c
+++ b/fs/locks.c
@@ -481,11 +481,19 @@ static int posix_same_owner(struct file_lock *fl1, struct file_lock *fl2)
return fl1->fl_owner == fl2->fl_owner;
}

+static unsigned long
+posix_owner_key(struct file_lock *fl)
+{
+ if (fl->fl_lmops && fl->fl_lmops->lm_owner_key)
+ return fl->fl_lmops->lm_owner_key(fl);
+ return (unsigned long)fl->fl_owner;
+}
+
/* Remove a blocker or lock from one of the global lists */
static inline void
locks_insert_global_blocked(struct file_lock *waiter)
{
- hash_add(blocked_hash, &waiter->fl_link, (unsigned long)waiter->fl_owner);
+ hash_add(blocked_hash, &waiter->fl_link, posix_owner_key(waiter));
}

static inline void
@@ -739,7 +747,7 @@ static struct file_lock *what_owner_is_waiting_for(struct file_lock *block_fl)
{
struct file_lock *fl;

- hash_for_each_possible(blocked_hash, fl, fl_link, (unsigned long)block_fl->fl_owner) {
+ hash_for_each_possible(blocked_hash, fl, fl_link, posix_owner_key(block_fl)) {
if (posix_same_owner(fl, block_fl))
return fl->fl_next;
}
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 3b340f7..232a345 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -908,6 +908,7 @@ struct file_lock_operations {

struct lock_manager_operations {
int (*lm_compare_owner)(struct file_lock *, struct file_lock *);
+ unsigned long (*lm_owner_key)(struct file_lock *);
void (*lm_notify)(struct file_lock *); /* unblock callback */
int (*lm_grant)(struct file_lock *, struct file_lock *, int);
void (*lm_break)(struct file_lock *);
--
1.7.1

2013-06-11 11:09:50

by Jeff Layton

[permalink] [raw]
Subject: [PATCH v2 01/14] cifs: use posix_unblock_lock instead of locks_delete_block

commit 66189be74 (CIFS: Fix VFS lock usage for oplocked files) exported
the locks_delete_block symbol. There's already an exported helper
function that provides this capability however, so make cifs use that
instead and turn locks_delete_block back into a static function.

Note that if fl->fl_next == NULL then this lock has already been through
locks_delete_block(), so we should be OK to ignore an ENOENT error here
and simply not retry the lock.

Cc: Pavel Shilovsky <[email protected]>
Signed-off-by: Jeff Layton <[email protected]>
Acked-by: J. Bruce Fields <[email protected]>
---
fs/cifs/file.c | 2 +-
fs/locks.c | 3 +--
include/linux/fs.h | 5 -----
3 files changed, 2 insertions(+), 8 deletions(-)

diff --git a/fs/cifs/file.c b/fs/cifs/file.c
index 48b29d2..44a4f18 100644
--- a/fs/cifs/file.c
+++ b/fs/cifs/file.c
@@ -999,7 +999,7 @@ try_again:
rc = wait_event_interruptible(flock->fl_wait, !flock->fl_next);
if (!rc)
goto try_again;
- locks_delete_block(flock);
+ posix_unblock_lock(file, flock);
}
return rc;
}
diff --git a/fs/locks.c b/fs/locks.c
index cb424a4..7a02064 100644
--- a/fs/locks.c
+++ b/fs/locks.c
@@ -496,13 +496,12 @@ static void __locks_delete_block(struct file_lock *waiter)

/*
*/
-void locks_delete_block(struct file_lock *waiter)
+static void locks_delete_block(struct file_lock *waiter)
{
lock_flocks();
__locks_delete_block(waiter);
unlock_flocks();
}
-EXPORT_SYMBOL(locks_delete_block);

/* Insert waiter into blocker's block list.
* We use a circular list so that processes can be easily woken up in
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 43db02e..b9d7816 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1006,7 +1006,6 @@ extern int vfs_setlease(struct file *, long, struct file_lock **);
extern int lease_modify(struct file_lock **, int);
extern int lock_may_read(struct inode *, loff_t start, unsigned long count);
extern int lock_may_write(struct inode *, loff_t start, unsigned long count);
-extern void locks_delete_block(struct file_lock *waiter);
extern void lock_flocks(void);
extern void unlock_flocks(void);
#else /* !CONFIG_FILE_LOCKING */
@@ -1151,10 +1150,6 @@ static inline int lock_may_write(struct inode *inode, loff_t start,
return 1;
}

-static inline void locks_delete_block(struct file_lock *waiter)
-{
-}
-
static inline void lock_flocks(void)
{
}
--
1.7.1

2013-06-11 11:13:35

by Jeff Layton

[permalink] [raw]
Subject: [PATCH v2 09/14] locks: convert fl_link to a hlist_node

Testing has shown that iterating over the blocked_list for deadlock
detection turns out to be a bottleneck. In order to alleviate that,
begin the process of turning it into a hashtable. We start by turning
the fl_link into a hlist_node and the global lists into hlists. A later
patch will do the conversion of the blocked_list to a hashtable.

Signed-off-by: Jeff Layton <[email protected]>
---
fs/locks.c | 26 +++++++++++++-------------
include/linux/fs.h | 2 +-
2 files changed, 14 insertions(+), 14 deletions(-)

diff --git a/fs/locks.c b/fs/locks.c
index b8cd1b1..28959bc 100644
--- a/fs/locks.c
+++ b/fs/locks.c
@@ -153,8 +153,8 @@ int lease_break_time = 45;
#define for_each_lock(inode, lockp) \
for (lockp = &inode->i_flock; *lockp != NULL; lockp = &(*lockp)->fl_next)

-static LIST_HEAD(file_lock_list);
-static LIST_HEAD(blocked_list);
+static HLIST_HEAD(file_lock_list);
+static HLIST_HEAD(blocked_list);

/* Protects the two list heads above */
static DEFINE_SPINLOCK(file_lock_lock);
@@ -163,7 +163,7 @@ static struct kmem_cache *filelock_cache __read_mostly;

static void locks_init_lock_heads(struct file_lock *fl)
{
- INIT_LIST_HEAD(&fl->fl_link);
+ INIT_HLIST_NODE(&fl->fl_link);
INIT_LIST_HEAD(&fl->fl_block);
init_waitqueue_head(&fl->fl_wait);
}
@@ -197,7 +197,7 @@ void locks_free_lock(struct file_lock *fl)
{
BUG_ON(waitqueue_active(&fl->fl_wait));
BUG_ON(!list_empty(&fl->fl_block));
- BUG_ON(!list_empty(&fl->fl_link));
+ BUG_ON(!hlist_unhashed(&fl->fl_link));

locks_release_private(fl);
kmem_cache_free(filelock_cache, fl);
@@ -475,13 +475,13 @@ static int posix_same_owner(struct file_lock *fl1, struct file_lock *fl2)
static inline void
locks_insert_global_blocked(struct file_lock *waiter)
{
- list_add(&waiter->fl_link, &blocked_list);
+ hlist_add_head(&waiter->fl_link, &blocked_list);
}

static inline void
__locks_delete_global_blocked(struct file_lock *waiter)
{
- list_del_init(&waiter->fl_link);
+ hlist_del_init(&waiter->fl_link);
}

static inline void
@@ -496,7 +496,7 @@ static inline void
locks_insert_global_locks(struct file_lock *waiter)
{
spin_lock(&file_lock_lock);
- list_add_tail(&waiter->fl_link, &file_lock_list);
+ hlist_add_head(&waiter->fl_link, &file_lock_list);
spin_unlock(&file_lock_lock);
}

@@ -504,7 +504,7 @@ static inline void
locks_delete_global_locks(struct file_lock *waiter)
{
spin_lock(&file_lock_lock);
- list_del_init(&waiter->fl_link);
+ hlist_del_init(&waiter->fl_link);
spin_unlock(&file_lock_lock);
}

@@ -729,7 +729,7 @@ static struct file_lock *what_owner_is_waiting_for(struct file_lock *block_fl)
{
struct file_lock *fl;

- list_for_each_entry(fl, &blocked_list, fl_link) {
+ hlist_for_each_entry(fl, &blocked_list, fl_link) {
if (posix_same_owner(fl, block_fl))
return fl->fl_next;
}
@@ -2286,11 +2286,11 @@ static int locks_show(struct seq_file *f, void *v)
{
struct file_lock *fl, *bfl;

- fl = list_entry(v, struct file_lock, fl_link);
+ fl = hlist_entry(v, struct file_lock, fl_link);

lock_get_status(f, fl, *((loff_t *)f->private), "");

- list_for_each_entry(bfl, &blocked_list, fl_link) {
+ hlist_for_each_entry(bfl, &blocked_list, fl_link) {
if (bfl->fl_next == fl)
lock_get_status(f, bfl, *((loff_t *)f->private), " ->");
}
@@ -2304,14 +2304,14 @@ static void *locks_start(struct seq_file *f, loff_t *pos)

spin_lock(&file_lock_lock);
*p = (*pos + 1);
- return seq_list_start(&file_lock_list, *pos);
+ return seq_hlist_start(&file_lock_list, *pos);
}

static void *locks_next(struct seq_file *f, void *v, loff_t *pos)
{
loff_t *p = f->private;
++*p;
- return seq_list_next(v, &file_lock_list, pos);
+ return seq_hlist_next(v, &file_lock_list, pos);
}

static void locks_stop(struct seq_file *f, void *v)
diff --git a/include/linux/fs.h b/include/linux/fs.h
index e2f896d..3b340f7 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -946,7 +946,7 @@ int locks_in_grace(struct net *);
*/
struct file_lock {
struct file_lock *fl_next; /* singly linked list for this inode */
- struct list_head fl_link; /* doubly linked list of all locks */
+ struct hlist_node fl_link; /* node in global lists */
struct list_head fl_block; /* circular list of blocked processes */
fl_owner_t fl_owner;
unsigned int fl_flags;
--
1.7.1

2013-06-11 11:14:11

by Jeff Layton

[permalink] [raw]
Subject: [PATCH v2 10/14] locks: turn the blocked_list into a hashtable

Break up the blocked_list into a hashtable, using the fl_owner as a key.
This speeds up searching the hash chains, which is especially significant
for deadlock detection.

Note that the initial implementation assumes that hashing on fl_owner is
sufficient. In most cases it should be, with the notable exception being
server-side lockd, which compares ownership using a tuple of the
nlm_host and the pid sent in the lock request. So, this may degrade to a
single hash bucket when you only have a single NFS client. That will be
addressed in a later patch.

The careful observer may note that this patch leaves the file_lock_list
alone. There's much less of a case for turning the file_lock_list into a
hashtable. The only user of that list is the code that generates
/proc/locks, and it always walks the entire list.

Signed-off-by: Jeff Layton <[email protected]>
---
fs/locks.c | 25 ++++++++++++++++++-------
1 files changed, 18 insertions(+), 7 deletions(-)

diff --git a/fs/locks.c b/fs/locks.c
index 28959bc..76fb7af 100644
--- a/fs/locks.c
+++ b/fs/locks.c
@@ -126,6 +126,7 @@
#include <linux/time.h>
#include <linux/rcupdate.h>
#include <linux/pid_namespace.h>
+#include <linux/hashtable.h>

#include <asm/uaccess.h>

@@ -153,10 +154,19 @@ int lease_break_time = 45;
#define for_each_lock(inode, lockp) \
for (lockp = &inode->i_flock; *lockp != NULL; lockp = &(*lockp)->fl_next)

+/*
+ * By breaking up the blocked locks list into a hashtable, we speed up the
+ * deadlock detection.
+ *
+ * FIXME: make this value scale via some heuristic?
+ */
+#define BLOCKED_HASH_BITS 7
+
+static DEFINE_HASHTABLE(blocked_hash, BLOCKED_HASH_BITS);
+
static HLIST_HEAD(file_lock_list);
-static HLIST_HEAD(blocked_list);

-/* Protects the two list heads above */
+/* Protects the file_lock_list and the blocked_hash */
static DEFINE_SPINLOCK(file_lock_lock);

static struct kmem_cache *filelock_cache __read_mostly;
@@ -475,13 +485,13 @@ static int posix_same_owner(struct file_lock *fl1, struct file_lock *fl2)
static inline void
locks_insert_global_blocked(struct file_lock *waiter)
{
- hlist_add_head(&waiter->fl_link, &blocked_list);
+ hash_add(blocked_hash, &waiter->fl_link, (unsigned long)waiter->fl_owner);
}

static inline void
__locks_delete_global_blocked(struct file_lock *waiter)
{
- hlist_del_init(&waiter->fl_link);
+ hash_del(&waiter->fl_link);
}

static inline void
@@ -729,7 +739,7 @@ static struct file_lock *what_owner_is_waiting_for(struct file_lock *block_fl)
{
struct file_lock *fl;

- hlist_for_each_entry(fl, &blocked_list, fl_link) {
+ hash_for_each_possible(blocked_hash, fl, fl_link, (unsigned long)block_fl->fl_owner) {
if (posix_same_owner(fl, block_fl))
return fl->fl_next;
}
@@ -865,7 +875,7 @@ static int __posix_lock_file(struct inode *inode, struct file_lock *request, str
/*
* New lock request. Walk all POSIX locks and look for conflicts. If
* there are any, either return error or put the request on the
- * blocker's list of waiters and the global blocked_list.
+ * blocker's list of waiters and the global blocked_hash.
*/
if (request->fl_type != F_UNLCK) {
for_each_lock(inode, before) {
@@ -2284,13 +2294,14 @@ static void lock_get_status(struct seq_file *f, struct file_lock *fl,

static int locks_show(struct seq_file *f, void *v)
{
+ int bkt;
struct file_lock *fl, *bfl;

fl = hlist_entry(v, struct file_lock, fl_link);

lock_get_status(f, fl, *((loff_t *)f->private), "");

- hlist_for_each_entry(bfl, &blocked_list, fl_link) {
+ hash_for_each(blocked_hash, bkt, bfl, fl_link) {
if (bfl->fl_next == fl)
lock_get_status(f, bfl, *((loff_t *)f->private), " ->");
}
--
1.7.1

2013-06-11 11:09:47

by Jeff Layton

[permalink] [raw]
Subject: [PATCH v2 06/14] locks: don't walk inode->i_flock list in locks_show

When we convert over to using the i_lock to protect the i_flock list,
that will introduce a potential lock inversion problem in locks_show.
When we want to walk the i_flock list, we'll need to take the i_lock.

Rather than do that, just walk the global blocked_locks list and print
out any that are blocked on the given lock.

Signed-off-by: Jeff Layton <[email protected]>
---
fs/locks.c | 6 ++++--
1 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/fs/locks.c b/fs/locks.c
index e451d18..3fd27f0 100644
--- a/fs/locks.c
+++ b/fs/locks.c
@@ -2249,8 +2249,10 @@ static int locks_show(struct seq_file *f, void *v)

lock_get_status(f, fl, *((loff_t *)f->private), "");

- list_for_each_entry(bfl, &fl->fl_block, fl_block)
- lock_get_status(f, bfl, *((loff_t *)f->private), " ->");
+ list_for_each_entry(bfl, &blocked_list, fl_link) {
+ if (bfl->fl_next == fl)
+ lock_get_status(f, bfl, *((loff_t *)f->private), " ->");
+ }

return 0;
}
--
1.7.1

2013-06-11 11:14:48

by Jeff Layton

[permalink] [raw]
Subject: [PATCH v2 02/14] locks: make generic_add_lease and generic_delete_lease static

Signed-off-by: Jeff Layton <[email protected]>
Acked-by: J. Bruce Fields <[email protected]>
---
fs/locks.c | 4 ++--
1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/fs/locks.c b/fs/locks.c
index 7a02064..e3140b8 100644
--- a/fs/locks.c
+++ b/fs/locks.c
@@ -1337,7 +1337,7 @@ int fcntl_getlease(struct file *filp)
return type;
}

-int generic_add_lease(struct file *filp, long arg, struct file_lock **flp)
+static int generic_add_lease(struct file *filp, long arg, struct file_lock **flp)
{
struct file_lock *fl, **before, **my_before = NULL, *lease;
struct dentry *dentry = filp->f_path.dentry;
@@ -1402,7 +1402,7 @@ out:
return error;
}

-int generic_delete_lease(struct file *filp, struct file_lock **flp)
+static int generic_delete_lease(struct file *filp, struct file_lock **flp)
{
struct file_lock *fl, **before;
struct dentry *dentry = filp->f_path.dentry;
--
1.7.1

2013-06-11 11:15:20

by Jeff Layton

[permalink] [raw]
Subject: [PATCH v2 04/14] locks: make "added" in __posix_lock_file a bool

...save 3 bytes of stack space.

Signed-off-by: Jeff Layton <[email protected]>
Acked-by: J. Bruce Fields <[email protected]>
---
fs/locks.c | 9 +++++----
1 files changed, 5 insertions(+), 4 deletions(-)

diff --git a/fs/locks.c b/fs/locks.c
index 1e6301b..c186649 100644
--- a/fs/locks.c
+++ b/fs/locks.c
@@ -791,7 +791,8 @@ static int __posix_lock_file(struct inode *inode, struct file_lock *request, str
struct file_lock *left = NULL;
struct file_lock *right = NULL;
struct file_lock **before;
- int error, added = 0;
+ int error;
+ bool added = false;

/*
* We may need two file_lock structures for this operation,
@@ -885,7 +886,7 @@ static int __posix_lock_file(struct inode *inode, struct file_lock *request, str
continue;
}
request = fl;
- added = 1;
+ added = true;
}
else {
/* Processing for different lock types is a bit
@@ -896,7 +897,7 @@ static int __posix_lock_file(struct inode *inode, struct file_lock *request, str
if (fl->fl_start > request->fl_end)
break;
if (request->fl_type == F_UNLCK)
- added = 1;
+ added = true;
if (fl->fl_start < request->fl_start)
left = fl;
/* If the next lock in the list has a higher end
@@ -926,7 +927,7 @@ static int __posix_lock_file(struct inode *inode, struct file_lock *request, str
locks_release_private(fl);
locks_copy_private(fl, request);
request = fl;
- added = 1;
+ added = true;
}
}
/* Go on to next lock.
--
1.7.1

2013-06-11 11:15:18

by Jeff Layton

[permalink] [raw]
Subject: [PATCH v2 03/14] locks: comment cleanups and clarifications

Signed-off-by: Jeff Layton <[email protected]>
---
fs/locks.c | 21 +++++++++++++--------
include/linux/fs.h | 18 ++++++++++++++++++
2 files changed, 31 insertions(+), 8 deletions(-)

diff --git a/fs/locks.c b/fs/locks.c
index e3140b8..1e6301b 100644
--- a/fs/locks.c
+++ b/fs/locks.c
@@ -518,9 +518,10 @@ static void locks_insert_block(struct file_lock *blocker,
list_add(&waiter->fl_link, &blocked_list);
}

-/* Wake up processes blocked waiting for blocker.
- * If told to wait then schedule the processes until the block list
- * is empty, otherwise empty the block list ourselves.
+/*
+ * Wake up processes blocked waiting for blocker.
+ *
+ * Must be called with the file_lock_lock held!
*/
static void locks_wake_up_blocks(struct file_lock *blocker)
{
@@ -806,6 +807,11 @@ static int __posix_lock_file(struct inode *inode, struct file_lock *request, str
}

lock_flocks();
+ /*
+ * New lock request. Walk all POSIX locks and look for conflicts. If
+ * there are any, either return error or put the request on the
+ * blocker's list of waiters and the global blocked_list.
+ */
if (request->fl_type != F_UNLCK) {
for_each_lock(inode, before) {
fl = *before;
@@ -844,7 +850,7 @@ static int __posix_lock_file(struct inode *inode, struct file_lock *request, str
before = &fl->fl_next;
}

- /* Process locks with this owner. */
+ /* Process locks with this owner. */
while ((fl = *before) && posix_same_owner(request, fl)) {
/* Detect adjacent or overlapping regions (if same lock type)
*/
@@ -930,10 +936,9 @@ static int __posix_lock_file(struct inode *inode, struct file_lock *request, str
}

/*
- * The above code only modifies existing locks in case of
- * merging or replacing. If new lock(s) need to be inserted
- * all modifications are done bellow this, so it's safe yet to
- * bail out.
+ * The above code only modifies existing locks in case of merging or
+ * replacing. If new lock(s) need to be inserted all modifications are
+ * done below this, so it's safe yet to bail out.
*/
error = -ENOLCK; /* "no luck" */
if (right && left == right && !new_fl2)
diff --git a/include/linux/fs.h b/include/linux/fs.h
index b9d7816..94105d2 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -926,6 +926,24 @@ int locks_in_grace(struct net *);
/* that will die - we need it for nfs_lock_info */
#include <linux/nfs_fs_i.h>

+/*
+ * struct file_lock represents a generic "file lock". It's used to represent
+ * POSIX byte range locks, BSD (flock) locks, and leases. It's important to
+ * note that the same struct is used to represent both a request for a lock and
+ * the lock itself, but the same object is never used for both.
+ *
+ * FIXME: should we create a separate "struct lock_request" to help distinguish
+ * these two uses?
+ *
+ * The i_flock list is ordered by:
+ *
+ * 1) lock type -- FL_LEASEs first, then FL_FLOCK, and finally FL_POSIX
+ * 2) lock owner
+ * 3) lock range start
+ * 4) lock range end
+ *
+ * Obviously, the last two criteria only matter for POSIX locks.
+ */
struct file_lock {
struct file_lock *fl_next; /* singly linked list for this inode */
struct list_head fl_link; /* doubly linked list of all locks */
--
1.7.1

2013-06-11 16:04:53

by J. Bruce Fields

[permalink] [raw]
Subject: Re: [PATCH v2 00/14] locks: scalability improvements for file locking

On Tue, Jun 11, 2013 at 07:08:54AM -0400, Jeff Layton wrote:
> Summary of Significant Changes:
> -------------------------------
> v2:
> - Fix potential races in deadlock detection. Manipulation of global
> blocked_hash and deadlock detection are now atomic. This is a
> little slower than the earlier set, but is provably correct. Also,
> the patch that converts to using the i_lock has been split out from
> most of the other changes. That should make it easier to review, but
> it does leave a potential race in the deadlock detection that is fixed
> up by the following patch. It may make sense to fold patches 7 and 8
> together before merging.
>
> - Add percpu hlists and lglocks for global file_lock_list. This gives
> us some speedup since this list is seldom read.
>
> Abstract (tl;dr version):
> -------------------------
> This patchset represents an overhaul of the file locking code with an
> aim toward improving its scalability and making the code a bit easier to
> understand.
>
> Longer version:
> ---------------
> When the BKL was finally ripped out of the kernel in 2010, the strategy
> taken for the file locking code was to simply turn it into a new
> file_lock_locks spinlock. It was an expedient way to deal with the file
> locking code at the time, but having a giant spinlock around all of this
> code is clearly not great for scalability. Red Hat has bug reports that
> go back into the 2.6.18 era that point to BKL scalability problems in
> the file locking code and the file_lock_lock suffers from the same
> issues.
>
> This patchset is my first attempt to make this code less dependent on
> global locking. The main change is to switch most of the file locking
> code to be protected by the inode->i_lock instead of the file_lock_lock.
>
> While that works for most things, there are a couple of global data
> structures (lists in the current code) that need a global lock to
> protect them. So we still need a global lock in order to deal with
> those. The remaining patches are intended to make that global locking
> less painful. The big gainis are made by turning the blocked_list into a
> hashtable, which greatly speeds up the deadlock detection code.
>
> This is not the first attempt at doing this. The conversion to the
> i_lock was originally attempted by Bruce Fields a few years ago. His
> approach was NAK'ed since it involved ripping out the deadlock
> detection. People also really seem to like /proc/locks for debugging, so
> keeping that in is probably worthwhile.
>
> There's more work to be done in this area and this patchset is just a
> start. There's a horrible thundering herd problem when a blocking lock
> is released, for instance. There was also interest in solving the goofy
> "unlock on any close" POSIX lock semantics at this year's LSF. I think
> this patchset will help lay the groundwork for those changes as well.
>
> While file locking is not usually considered to be a high-performance
> codepath, it *is* an IPC mechanism and I think it behooves us to try to
> make it as fast as possible.
>
> I'd like to see this considered for 3.11. Comments and suggestions
> welcome.
>
> Performance testing and results:
> --------------------------------
> In order to measure performance here, I've written some locking
> performance tests that I've made available here:
>
> git://git.samba.org/jlayton/lockperf.git
>
> ...there are only 4 tests so far that measure performance of BSD and
> POSIX locking. I ran each test 100 times. The first number in each set
> of results is the unpatched (baseline) and the second is the patched
> results. The arch in both cases was x86_64. I've also done some testing
> on i686 and the results are more or less inline with the ones below.
> The files were all on tmpfs, to eliminate any storage-related latencies
> that might creep in.
>
> Here's a description of each test with their results. I've also included
> the standard deviation with each:
>
> posix01: fork off 128 children, and have them all open the same file and
> lock and unlock it 10k times (whole file, FL_WRLCK). Measure the time
> that it takes to perform all of the locks and unlocks in each process,
> and the total them all up. Note that the "mean time" here is not the
> time to run the test, but the total time that all the tasks spent
> locking and unlocking:

Why that and not the time to run the test?

Maybe what you're doing is right, but measuring total time seems more
realistic and simpler, and less likely to accumulate errors.

If we're measuring individual calls I think I'd rather see mean and
variance of the time of individual calls instead of of the sum.

It might also be interested to see how this changes as the number of
cores varies. (Could you do that by pinning the test threads to some
subset of the 32 cores?)

Admittedly the difference seems clear enough already.... But it'd be
interesting to see how we scale with the new code in addition to just
seeing that's in an improvement.

(And, nit: round these numbers a little?:)

> 2-way Intel machine, 1G RAM, UMA:
> kernel mean time std. deviation
> -------------------------------------------------------------------------
> 3.10.0-rc4-00034-g042dd60 780.50511606027 20.0773085693901
> 3.10.0-rc4-00049-ge0b71e1 420.178537863755 24.4692621642102
>
> 32-way AMD Opteron machine, 32G RAM in 4 NUMA nodes:
> kernel mean time std. deviation
> -------------------------------------------------------------------------
> 3.10.0-rc4-00034-g042dd60 30889.1603525361 267.767126085587
> 3.10.0-rc4-00049-ge0b71e1 25504.7748006928 336.232893246214
>
> The first thing to note here is how painful this test is as the number
> of CPUs scales out, and when NUMA is thrown into the mix. It takes a
> staggeringly long time to run on such a machine. The patchset helps this
> use-case a lot, but it's still painful. I think these numbers help
> illustrate why we need to worry about this as the machines get more
> parallel.
>
> posix02: fork off 128 children, and have them each open their own file
> and lock and unlock it 20k times. Measure the time that it takes to
> complete the locks and unlocks in each process and then total them up:
>
> 2-way Intel machine, 1G RAM, UMA:
> kernel mean time std. deviation
> -------------------------------------------------------------------------
> 3.10.0-rc4-00034-g042dd60 319.86610077757 11.0827165395713
> 3.10.0-rc4-00049-ge0b71e1 257.24231139548 8.0772950753018
>
> 32-way AMD Opteron machine, 32G RAM in 4 NUMA nodes:
> kernel mean time std. deviation
> -------------------------------------------------------------------------
> 3.10.0-rc4-00034-g042dd60 1354.73702552958 40.5202787251742
> 3.10.0-rc4-00049-ge0b71e1 541.35526474265 12.0492755786782
>
> ...so again, multiple CPUs and NUMA take their toll on this workload,
> but the results are still universally positive from the patchset.
>
> flock01: fork off 128 children, and have them all open the same file and
> lock (LOCK_EX) and unlock it 10k times. Measure the time that it takes
> to perform all of the locks and unlocks in each process, and the total
> them all up. Note that the "mean time" here is not the time to run the
> test, but the total time that all the tasks spent locking and unlocking:
>
> 2-way Intel machine, 1G RAM, UMA:
> kernel mean time std. deviation
> -------------------------------------------------------------------------
> 3.10.0-rc4-00034-g042dd60 704.60957498018 10.6182455520549
> 3.10.0-rc4-00049-ge0b71e1 729.30782576048 4.62185698266119
>
> 32-way AMD Opteron machine, 32G RAM in 4 NUMA nodes:
> kernel mean time std. deviation
> -------------------------------------------------------------------------
> 3.10.0-rc4-00034-g042dd60 23617.876013741 221.445828515533
> 3.10.0-rc4-00049-ge0b71e1 24743.3849476248 340.664907591012
>
> ...so here we see some slight slowdown (3-5%). In the case of the UMA
> box, it's on the order of the standard deviation, and in a similar test
> on a different box, the patched kernel turned out to be faster. Go
> figure...
>
> In any case, I'm suspect that this is due to the fact that with this
> set, we need to take two spinlocks in order to do the work (the i_lock
> and the file_lock_lglock). Since the flock codepath doesn't use the
> global blocked_list, it sees no benefit to the conversion of it to a
> hash table.
>
> flock02: fork off 128 children, and have them each open their own file
> and lock (LOCK_EX) and unlock it 20k times. Measure the time that it
> takes to complete the locks and unlocks in each process and then total
> them up:
>
> 2-way Intel machine, 1G RAM, UMA:
> kernel mean time std. deviation
> -------------------------------------------------------------------------
> 3.10.0-rc4-00034-g042dd60 241.26319896997 7.69747220750147
> 3.10.0-rc4-00049-ge0b71e1 167.23062027739 5.1643997156006
>
> 32-way AMD Opteron machine, 32G RAM in 4 NUMA nodes:
> kernel mean time std. deviation
> --------------------------------------------------------------------------
> 3.10.0-rc4-00034-g042dd60 1348.64594042422 39.5186585988528
> 3.10.0-rc4-00049-ge0b71e1 8.4762987879 0.347121187462215
>
> ...the speedup here is really dramatic, especially in the NUMA case.
> This was so much faster that I had to double check that flock locking
> still worked properly afterward.

I guess deadlock detection is the most likely explanation of the
difference between the posix and flock cases?

If so that'd make a pretty strong case for eliminating (or somehow
greatly reducing) the deadlock detection.

With /proc/locks using a percpu lock and everything else a per-file
lock, in theory the deadlock detectionis the only thing that should
cause contention between cores, so the test should scale perfectly in
the flock case. (Might be interesting to confirm that by looking at how
it changes over # of cores.)

--b.

>
> Jeff Layton (14):
> cifs: use posix_unblock_lock instead of locks_delete_block
> locks: make generic_add_lease and generic_delete_lease static
> locks: comment cleanups and clarifications
> locks: make "added" in __posix_lock_file a bool
> locks: encapsulate the fl_link list handling
> locks: don't walk inode->i_flock list in locks_show
> locks: convert to i_lock to protect i_flock list
> locks: ensure that deadlock detection is atomic with respect to
> blocked_list modification
> locks: convert fl_link to a hlist_node
> locks: turn the blocked_list into a hashtable
> locks: add a new "lm_owner_key" lock operation
> locks: give the blocked_hash its own spinlock
> seq_file: add seq_list_*_percpu helpers
> locks: move file_lock_list to a set of percpu hlist_heads and convert
> file_lock_lock to an lglock
>
> Documentation/filesystems/Locking | 27 +++-
> fs/afs/flock.c | 5 +-
> fs/ceph/locks.c | 2 +-
> fs/ceph/mds_client.c | 8 +-
> fs/cifs/cifsfs.c | 2 +-
> fs/cifs/file.c | 15 +-
> fs/gfs2/file.c | 2 +-
> fs/lockd/svclock.c | 12 ++
> fs/lockd/svcsubs.c | 12 +-
> fs/locks.c | 306 +++++++++++++++++++++++++------------
> fs/nfs/delegation.c | 11 +-
> fs/nfs/nfs4state.c | 8 +-
> fs/nfsd/nfs4state.c | 8 +-
> fs/seq_file.c | 54 +++++++
> include/linux/fs.h | 38 +++--
> include/linux/seq_file.h | 6 +
> 16 files changed, 362 insertions(+), 154 deletions(-)
>

2013-06-11 16:46:53

by Jeff Layton

[permalink] [raw]
Subject: Re: [PATCH v2 00/14] locks: scalability improvements for file locking

On Tue, 11 Jun 2013 12:04:05 -0400
"J. Bruce Fields" <[email protected]> wrote:

> On Tue, Jun 11, 2013 at 07:08:54AM -0400, Jeff Layton wrote:
> > Summary of Significant Changes:
> > -------------------------------
> > v2:
> > - Fix potential races in deadlock detection. Manipulation of global
> > blocked_hash and deadlock detection are now atomic. This is a
> > little slower than the earlier set, but is provably correct. Also,
> > the patch that converts to using the i_lock has been split out from
> > most of the other changes. That should make it easier to review, but
> > it does leave a potential race in the deadlock detection that is fixed
> > up by the following patch. It may make sense to fold patches 7 and 8
> > together before merging.
> >
> > - Add percpu hlists and lglocks for global file_lock_list. This gives
> > us some speedup since this list is seldom read.
> >
> > Abstract (tl;dr version):
> > -------------------------
> > This patchset represents an overhaul of the file locking code with an
> > aim toward improving its scalability and making the code a bit easier to
> > understand.
> >
> > Longer version:
> > ---------------
> > When the BKL was finally ripped out of the kernel in 2010, the strategy
> > taken for the file locking code was to simply turn it into a new
> > file_lock_locks spinlock. It was an expedient way to deal with the file
> > locking code at the time, but having a giant spinlock around all of this
> > code is clearly not great for scalability. Red Hat has bug reports that
> > go back into the 2.6.18 era that point to BKL scalability problems in
> > the file locking code and the file_lock_lock suffers from the same
> > issues.
> >
> > This patchset is my first attempt to make this code less dependent on
> > global locking. The main change is to switch most of the file locking
> > code to be protected by the inode->i_lock instead of the file_lock_lock.
> >
> > While that works for most things, there are a couple of global data
> > structures (lists in the current code) that need a global lock to
> > protect them. So we still need a global lock in order to deal with
> > those. The remaining patches are intended to make that global locking
> > less painful. The big gainis are made by turning the blocked_list into a
> > hashtable, which greatly speeds up the deadlock detection code.
> >
> > This is not the first attempt at doing this. The conversion to the
> > i_lock was originally attempted by Bruce Fields a few years ago. His
> > approach was NAK'ed since it involved ripping out the deadlock
> > detection. People also really seem to like /proc/locks for debugging, so
> > keeping that in is probably worthwhile.
> >
> > There's more work to be done in this area and this patchset is just a
> > start. There's a horrible thundering herd problem when a blocking lock
> > is released, for instance. There was also interest in solving the goofy
> > "unlock on any close" POSIX lock semantics at this year's LSF. I think
> > this patchset will help lay the groundwork for those changes as well.
> >
> > While file locking is not usually considered to be a high-performance
> > codepath, it *is* an IPC mechanism and I think it behooves us to try to
> > make it as fast as possible.
> >
> > I'd like to see this considered for 3.11. Comments and suggestions
> > welcome.
> >
> > Performance testing and results:
> > --------------------------------
> > In order to measure performance here, I've written some locking
> > performance tests that I've made available here:
> >
> > git://git.samba.org/jlayton/lockperf.git
> >
> > ...there are only 4 tests so far that measure performance of BSD and
> > POSIX locking. I ran each test 100 times. The first number in each set
> > of results is the unpatched (baseline) and the second is the patched
> > results. The arch in both cases was x86_64. I've also done some testing
> > on i686 and the results are more or less inline with the ones below.
> > The files were all on tmpfs, to eliminate any storage-related latencies
> > that might creep in.
> >
> > Here's a description of each test with their results. I've also included
> > the standard deviation with each:
> >
> > posix01: fork off 128 children, and have them all open the same file and
> > lock and unlock it 10k times (whole file, FL_WRLCK). Measure the time
> > that it takes to perform all of the locks and unlocks in each process,
> > and the total them all up. Note that the "mean time" here is not the
> > time to run the test, but the total time that all the tasks spent
> > locking and unlocking:
>
> Why that and not the time to run the test?
>

Yeah, the fact that there is time spent both sleeping to wait for the
lock as well as time spent processing when you're awake makes this
difficult to quantify.

I was really trying to eliminate measuring anything *but* the locking
performance. In most cases, the open/close/fork overhead is pretty
insignificant, but by measuring it this way, I know they don't matter.

I did look at the times to run the test and in general, the time to run
the test is smaller when the numbers above are smaller.

> Maybe what you're doing is right, but measuring total time seems more
> realistic and simpler, and less likely to accumulate errors.
>
> If we're measuring individual calls I think I'd rather see mean and
> variance of the time of individual calls instead of of the sum.
>
> It might also be interested to see how this changes as the number of
> cores varies. (Could you do that by pinning the test threads to some
> subset of the 32 cores?)
>
> Admittedly the difference seems clear enough already.... But it'd be
> interesting to see how we scale with the new code in addition to just
> seeing that's in an improvement.
>
> (And, nit: round these numbers a little?:)
>

I'll see...getting these numbers took long enough. I'm not sure I
really want to spend that much of my life running these tests and
collecting results. ;)

If I get the time, I'll do so but I'd also like to start work on some
patches to help the thundering herd problem when a lock is released.
I'm fairly convinced at this point that these patches do help
performance overall.

And yeah, I just cut and pasted the numbers from my simple perl script
to gather these stats, they obviously could use some rounding...

> > 2-way Intel machine, 1G RAM, UMA:
> > kernel mean time std. deviation
> > -------------------------------------------------------------------------
> > 3.10.0-rc4-00034-g042dd60 780.50511606027 20.0773085693901
> > 3.10.0-rc4-00049-ge0b71e1 420.178537863755 24.4692621642102
> >
> > 32-way AMD Opteron machine, 32G RAM in 4 NUMA nodes:
> > kernel mean time std. deviation
> > -------------------------------------------------------------------------
> > 3.10.0-rc4-00034-g042dd60 30889.1603525361 267.767126085587
> > 3.10.0-rc4-00049-ge0b71e1 25504.7748006928 336.232893246214
> >
> > The first thing to note here is how painful this test is as the number
> > of CPUs scales out, and when NUMA is thrown into the mix. It takes a
> > staggeringly long time to run on such a machine. The patchset helps this
> > use-case a lot, but it's still painful. I think these numbers help
> > illustrate why we need to worry about this as the machines get more
> > parallel.
> >
> > posix02: fork off 128 children, and have them each open their own file
> > and lock and unlock it 20k times. Measure the time that it takes to
> > complete the locks and unlocks in each process and then total them up:
> >
> > 2-way Intel machine, 1G RAM, UMA:
> > kernel mean time std. deviation
> > -------------------------------------------------------------------------
> > 3.10.0-rc4-00034-g042dd60 319.86610077757 11.0827165395713
> > 3.10.0-rc4-00049-ge0b71e1 257.24231139548 8.0772950753018
> >
> > 32-way AMD Opteron machine, 32G RAM in 4 NUMA nodes:
> > kernel mean time std. deviation
> > -------------------------------------------------------------------------
> > 3.10.0-rc4-00034-g042dd60 1354.73702552958 40.5202787251742
> > 3.10.0-rc4-00049-ge0b71e1 541.35526474265 12.0492755786782
> >
> > ...so again, multiple CPUs and NUMA take their toll on this workload,
> > but the results are still universally positive from the patchset.
> >
> > flock01: fork off 128 children, and have them all open the same file and
> > lock (LOCK_EX) and unlock it 10k times. Measure the time that it takes
> > to perform all of the locks and unlocks in each process, and the total
> > them all up. Note that the "mean time" here is not the time to run the
> > test, but the total time that all the tasks spent locking and unlocking:
> >
> > 2-way Intel machine, 1G RAM, UMA:
> > kernel mean time std. deviation
> > -------------------------------------------------------------------------
> > 3.10.0-rc4-00034-g042dd60 704.60957498018 10.6182455520549
> > 3.10.0-rc4-00049-ge0b71e1 729.30782576048 4.62185698266119
> >
> > 32-way AMD Opteron machine, 32G RAM in 4 NUMA nodes:
> > kernel mean time std. deviation
> > -------------------------------------------------------------------------
> > 3.10.0-rc4-00034-g042dd60 23617.876013741 221.445828515533
> > 3.10.0-rc4-00049-ge0b71e1 24743.3849476248 340.664907591012
> >
> > ...so here we see some slight slowdown (3-5%). In the case of the UMA
> > box, it's on the order of the standard deviation, and in a similar test
> > on a different box, the patched kernel turned out to be faster. Go
> > figure...
> >
> > In any case, I'm suspect that this is due to the fact that with this
> > set, we need to take two spinlocks in order to do the work (the i_lock
> > and the file_lock_lglock). Since the flock codepath doesn't use the
> > global blocked_list, it sees no benefit to the conversion of it to a
> > hash table.
> >
> > flock02: fork off 128 children, and have them each open their own file
> > and lock (LOCK_EX) and unlock it 20k times. Measure the time that it
> > takes to complete the locks and unlocks in each process and then total
> > them up:
> >
> > 2-way Intel machine, 1G RAM, UMA:
> > kernel mean time std. deviation
> > -------------------------------------------------------------------------
> > 3.10.0-rc4-00034-g042dd60 241.26319896997 7.69747220750147
> > 3.10.0-rc4-00049-ge0b71e1 167.23062027739 5.1643997156006
> >
> > 32-way AMD Opteron machine, 32G RAM in 4 NUMA nodes:
> > kernel mean time std. deviation
> > --------------------------------------------------------------------------
> > 3.10.0-rc4-00034-g042dd60 1348.64594042422 39.5186585988528
> > 3.10.0-rc4-00049-ge0b71e1 8.4762987879 0.347121187462215
> >
> > ...the speedup here is really dramatic, especially in the NUMA case.
> > This was so much faster that I had to double check that flock locking
> > still worked properly afterward.
>
> I guess deadlock detection is the most likely explanation of the
> difference between the posix and flock cases?
>
> If so that'd make a pretty strong case for eliminating (or somehow
> greatly reducing) the deadlock detection.
>

Well yes, the deadlock detection itself and the addition/removal from
the blocked_hash/blocked_list. The conversion to a hashtable help the
deadlock detection itself, but we can't really do much about the list
manipulation.

I'm not opposed to some sort of CONFIG_* option to allow compiling out
the deadlock detection, but I'd probably prefer to leave that for
another set. It might also make sense to have some sort of boot-time
switch to enable it?

For a distro like Fedora or RHEL, I could envision turning off deadlock
detection in "production" kernels and leaving it enabled in -debug
kernels. I have my doubts as to how useful deadlock detection really is
though...

> With /proc/locks using a percpu lock and everything else a per-file
> lock, in theory the deadlock detectionis the only thing that should
> cause contention between cores, so the test should scale perfectly in
> the flock case. (Might be interesting to confirm that by looking at how
> it changes over # of cores.)

Sure, if you assume that the locks are uncontended. You can still have
contention for the i_lock of course if tasks running on different CPUs
are trying to lock the same file.

Also the file_lock_lock is a percpu lglock, but occasionally tasks
acquire a lock on one CPU and then get migrated to another. At that
point you have to "reach across" to another CPU to take the entry off
the percpu hlists.

But yeah...for the most part these changes make the code have very
little contention, and the flock02 test results seem to really bear
that out.

Thanks for taking a look!
--
Jeff Layton <[email protected]>

2013-06-13 14:42:30

by J. Bruce Fields

[permalink] [raw]
Subject: Re: [PATCH v2 07/14] locks: convert to i_lock to protect i_flock list

On Tue, Jun 11, 2013 at 07:09:01AM -0400, Jeff Layton wrote:
> Having a global lock that protects all of this code is a clear
> scalability problem. Instead of doing that, move most of the code to be
> protected by the i_lock instead.
>
> The exceptions are the global lists that file_lock->fl_link sits on.
> Those still need a global lock of some sort, so wrap just those accesses
> in the file_lock_lock().
>
> Note that this patch introduces a potential race in the deadlock
> detection code which could result in either false positives or missed
> file_lock loops. The blocked_list management needs to be done atomically
> with the deadlock detection. This will be fixed in a later patch.

Actually, the way I think I'd try this:

1. Introduce a new spinlock global_lock_lists_lock and take it
in all the places we need a global lock.
2. Do the big flock_lock-to-i_lock conversion, deleting flock_lock
in the process.

Does that work?

Yeah, more patch-ordering bike-shedding, but... I think that would be
simple and bisectable and for a confusing long-untouched bit of code
it's worth getting these little steps right if we can.

--b.

>
> Signed-off-by: Jeff Layton <[email protected]>
> Signed-off-by: J. Bruce Fields <[email protected]>
> ---
> Documentation/filesystems/Locking | 23 +++++--
> fs/afs/flock.c | 5 +-
> fs/ceph/locks.c | 2 +-
> fs/ceph/mds_client.c | 8 +-
> fs/cifs/cifsfs.c | 2 +-
> fs/cifs/file.c | 13 ++--
> fs/gfs2/file.c | 2 +-
> fs/lockd/svcsubs.c | 12 ++--
> fs/locks.c | 110 ++++++++++++++++++++-----------------
> fs/nfs/delegation.c | 11 ++--
> fs/nfs/nfs4state.c | 8 +-
> fs/nfsd/nfs4state.c | 8 +-
> include/linux/fs.h | 11 ----
> 13 files changed, 113 insertions(+), 102 deletions(-)
>
> diff --git a/Documentation/filesystems/Locking b/Documentation/filesystems/Locking
> index 0706d32..13f91ab 100644
> --- a/Documentation/filesystems/Locking
> +++ b/Documentation/filesystems/Locking
> @@ -344,7 +344,7 @@ prototypes:
>
>
> locking rules:
> - file_lock_lock may block
> + inode->i_lock may block
> fl_copy_lock: yes no
> fl_release_private: maybe no
>
> @@ -357,12 +357,21 @@ prototypes:
> int (*lm_change)(struct file_lock **, int);
>
> locking rules:
> - file_lock_lock may block
> -lm_compare_owner: yes no
> -lm_notify: yes no
> -lm_grant: no no
> -lm_break: yes no
> -lm_change yes no
> +
> + inode->i_lock file_lock_lock may block
> +lm_compare_owner: yes maybe no
> +lm_notify: yes no no
> +lm_grant: no no no
> +lm_break: yes no no
> +lm_change yes no no
> +
> + ->lm_compare_owner is generally called with *an* inode->i_lock
> +held. It may not be the i_lock of the inode for either file_lock being
> +compared! This is the case with deadlock detection, since the code has
> +to chase down the owners of locks that may be entirely unrelated to the
> +one on which the lock is being acquired. For deadlock detection however,
> +the file_lock_lock is also held. The locks primarily ensure that neither
> +file_lock disappear out from under you while doing the comparison.
>
> --------------------------- buffer_head -----------------------------------
> prototypes:
> diff --git a/fs/afs/flock.c b/fs/afs/flock.c
> index 2497bf3..03fc0d1 100644
> --- a/fs/afs/flock.c
> +++ b/fs/afs/flock.c
> @@ -252,6 +252,7 @@ static void afs_defer_unlock(struct afs_vnode *vnode, struct key *key)
> */
> static int afs_do_setlk(struct file *file, struct file_lock *fl)
> {
> + struct inode = file_inode(file);
> struct afs_vnode *vnode = AFS_FS_I(file->f_mapping->host);
> afs_lock_type_t type;
> struct key *key = file->private_data;
> @@ -273,7 +274,7 @@ static int afs_do_setlk(struct file *file, struct file_lock *fl)
>
> type = (fl->fl_type == F_RDLCK) ? AFS_LOCK_READ : AFS_LOCK_WRITE;
>
> - lock_flocks();
> + spin_lock(&inode->i_lock);
>
> /* make sure we've got a callback on this file and that our view of the
> * data version is up to date */
> @@ -420,7 +421,7 @@ given_lock:
> afs_vnode_fetch_status(vnode, NULL, key);
>
> error:
> - unlock_flocks();
> + spin_unlock(&inode->i_lock);
> _leave(" = %d", ret);
> return ret;
>
> diff --git a/fs/ceph/locks.c b/fs/ceph/locks.c
> index 202dd3d..cd0a664 100644
> --- a/fs/ceph/locks.c
> +++ b/fs/ceph/locks.c
> @@ -194,7 +194,7 @@ void ceph_count_locks(struct inode *inode, int *fcntl_count, int *flock_count)
> * Encode the flock and fcntl locks for the given inode into the pagelist.
> * Format is: #fcntl locks, sequential fcntl locks, #flock locks,
> * sequential flock locks.
> - * Must be called with lock_flocks() already held.
> + * Must be called with inode->i_lock already held.
> * If we encounter more of a specific lock type than expected,
> * we return the value 1.
> */
> diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c
> index 4f22671..ae621b5 100644
> --- a/fs/ceph/mds_client.c
> +++ b/fs/ceph/mds_client.c
> @@ -2482,13 +2482,13 @@ static int encode_caps_cb(struct inode *inode, struct ceph_cap *cap,
>
> ceph_pagelist_set_cursor(pagelist, &trunc_point);
> do {
> - lock_flocks();
> + spin_lock(&inode->i_lock);
> ceph_count_locks(inode, &num_fcntl_locks,
> &num_flock_locks);
> rec.v2.flock_len = (2*sizeof(u32) +
> (num_fcntl_locks+num_flock_locks) *
> sizeof(struct ceph_filelock));
> - unlock_flocks();
> + spin_unlock(&inode->i_lock);
>
> /* pre-alloc pagelist */
> ceph_pagelist_truncate(pagelist, &trunc_point);
> @@ -2499,12 +2499,12 @@ static int encode_caps_cb(struct inode *inode, struct ceph_cap *cap,
>
> /* encode locks */
> if (!err) {
> - lock_flocks();
> + spin_lock(&inode->i_lock);
> err = ceph_encode_locks(inode,
> pagelist,
> num_fcntl_locks,
> num_flock_locks);
> - unlock_flocks();
> + spin_unlock(&inode->i_lock);
> }
> } while (err == -ENOSPC);
> } else {
> diff --git a/fs/cifs/cifsfs.c b/fs/cifs/cifsfs.c
> index 3752b9f..e81655c 100644
> --- a/fs/cifs/cifsfs.c
> +++ b/fs/cifs/cifsfs.c
> @@ -765,7 +765,7 @@ static loff_t cifs_llseek(struct file *file, loff_t offset, int whence)
>
> static int cifs_setlease(struct file *file, long arg, struct file_lock **lease)
> {
> - /* note that this is called by vfs setlease with lock_flocks held
> + /* note that this is called by vfs setlease with i_lock held
> to protect *lease from going away */
> struct inode *inode = file_inode(file);
> struct cifsFileInfo *cfile = file->private_data;
> diff --git a/fs/cifs/file.c b/fs/cifs/file.c
> index 44a4f18..0dd10cd 100644
> --- a/fs/cifs/file.c
> +++ b/fs/cifs/file.c
> @@ -1092,6 +1092,7 @@ struct lock_to_push {
> static int
> cifs_push_posix_locks(struct cifsFileInfo *cfile)
> {
> + struct inode *inode = cfile->dentry->d_inode;
> struct cifs_tcon *tcon = tlink_tcon(cfile->tlink);
> struct file_lock *flock, **before;
> unsigned int count = 0, i = 0;
> @@ -1102,12 +1103,12 @@ cifs_push_posix_locks(struct cifsFileInfo *cfile)
>
> xid = get_xid();
>
> - lock_flocks();
> - cifs_for_each_lock(cfile->dentry->d_inode, before) {
> + spin_lock(&inode->i_lock);
> + cifs_for_each_lock(inode, before) {
> if ((*before)->fl_flags & FL_POSIX)
> count++;
> }
> - unlock_flocks();
> + spin_unlock(&inode->i_lock);
>
> INIT_LIST_HEAD(&locks_to_send);
>
> @@ -1126,8 +1127,8 @@ cifs_push_posix_locks(struct cifsFileInfo *cfile)
> }
>
> el = locks_to_send.next;
> - lock_flocks();
> - cifs_for_each_lock(cfile->dentry->d_inode, before) {
> + spin_lock(&inode->i_lock);
> + cifs_for_each_lock(inode, before) {
> flock = *before;
> if ((flock->fl_flags & FL_POSIX) == 0)
> continue;
> @@ -1152,7 +1153,7 @@ cifs_push_posix_locks(struct cifsFileInfo *cfile)
> lck->offset = flock->fl_start;
> el = el->next;
> }
> - unlock_flocks();
> + spin_unlock(&inode->i_lock);
>
> list_for_each_entry_safe(lck, tmp, &locks_to_send, llist) {
> int stored_rc;
> diff --git a/fs/gfs2/file.c b/fs/gfs2/file.c
> index acd1676..9e634e0 100644
> --- a/fs/gfs2/file.c
> +++ b/fs/gfs2/file.c
> @@ -889,7 +889,7 @@ out_uninit:
> * cluster; until we do, disable leases (by just returning -EINVAL),
> * unless the administrator has requested purely local locking.
> *
> - * Locking: called under lock_flocks
> + * Locking: called under i_lock
> *
> * Returns: errno
> */
> diff --git a/fs/lockd/svcsubs.c b/fs/lockd/svcsubs.c
> index 97e8741..dc5c759 100644
> --- a/fs/lockd/svcsubs.c
> +++ b/fs/lockd/svcsubs.c
> @@ -169,7 +169,7 @@ nlm_traverse_locks(struct nlm_host *host, struct nlm_file *file,
>
> again:
> file->f_locks = 0;
> - lock_flocks(); /* protects i_flock list */
> + spin_lock(&inode->i_lock);
> for (fl = inode->i_flock; fl; fl = fl->fl_next) {
> if (fl->fl_lmops != &nlmsvc_lock_operations)
> continue;
> @@ -181,7 +181,7 @@ again:
> if (match(lockhost, host)) {
> struct file_lock lock = *fl;
>
> - unlock_flocks();
> + spin_unlock(&inode->i_lock);
> lock.fl_type = F_UNLCK;
> lock.fl_start = 0;
> lock.fl_end = OFFSET_MAX;
> @@ -193,7 +193,7 @@ again:
> goto again;
> }
> }
> - unlock_flocks();
> + spin_unlock(&inode->i_lock);
>
> return 0;
> }
> @@ -228,14 +228,14 @@ nlm_file_inuse(struct nlm_file *file)
> if (file->f_count || !list_empty(&file->f_blocks) || file->f_shares)
> return 1;
>
> - lock_flocks();
> + spin_lock(&inode->i_lock);
> for (fl = inode->i_flock; fl; fl = fl->fl_next) {
> if (fl->fl_lmops == &nlmsvc_lock_operations) {
> - unlock_flocks();
> + spin_unlock(&inode->i_lock);
> return 1;
> }
> }
> - unlock_flocks();
> + spin_unlock(&inode->i_lock);
> file->f_locks = 0;
> return 0;
> }
> diff --git a/fs/locks.c b/fs/locks.c
> index 3fd27f0..d7342a3 100644
> --- a/fs/locks.c
> +++ b/fs/locks.c
> @@ -155,22 +155,9 @@ int lease_break_time = 45;
>
> static LIST_HEAD(file_lock_list);
> static LIST_HEAD(blocked_list);
> -static DEFINE_SPINLOCK(file_lock_lock);
> -
> -/*
> - * Protects the two list heads above, plus the inode->i_flock list
> - */
> -void lock_flocks(void)
> -{
> - spin_lock(&file_lock_lock);
> -}
> -EXPORT_SYMBOL_GPL(lock_flocks);
>
> -void unlock_flocks(void)
> -{
> - spin_unlock(&file_lock_lock);
> -}
> -EXPORT_SYMBOL_GPL(unlock_flocks);
> +/* Protects the two list heads above */
> +static DEFINE_SPINLOCK(file_lock_lock);
>
> static struct kmem_cache *filelock_cache __read_mostly;
>
> @@ -488,25 +475,33 @@ static int posix_same_owner(struct file_lock *fl1, struct file_lock *fl2)
> static inline void
> locks_insert_global_blocked(struct file_lock *waiter)
> {
> + spin_lock(&file_lock_lock);
> list_add(&waiter->fl_link, &blocked_list);
> + spin_unlock(&file_lock_lock);
> }
>
> static inline void
> locks_delete_global_blocked(struct file_lock *waiter)
> {
> + spin_lock(&file_lock_lock);
> list_del_init(&waiter->fl_link);
> + spin_unlock(&file_lock_lock);
> }
>
> static inline void
> locks_insert_global_locks(struct file_lock *waiter)
> {
> + spin_lock(&file_lock_lock);
> list_add_tail(&waiter->fl_link, &file_lock_list);
> + spin_unlock(&file_lock_lock);
> }
>
> static inline void
> locks_delete_global_locks(struct file_lock *waiter)
> {
> + spin_lock(&file_lock_lock);
> list_del_init(&waiter->fl_link);
> + spin_unlock(&file_lock_lock);
> }
>
> /* Remove waiter from blocker's block list.
> @@ -523,9 +518,11 @@ static void __locks_delete_block(struct file_lock *waiter)
> */
> static void locks_delete_block(struct file_lock *waiter)
> {
> - lock_flocks();
> + struct inode *inode = file_inode(waiter->fl_file);
> +
> + spin_lock(&inode->i_lock);
> __locks_delete_block(waiter);
> - unlock_flocks();
> + spin_unlock(&inode->i_lock);
> }
>
> /* Insert waiter into blocker's block list.
> @@ -544,7 +541,7 @@ static void locks_insert_block(struct file_lock *blocker,
> /*
> * Wake up processes blocked waiting for blocker.
> *
> - * Must be called with the file_lock_lock held!
> + * Must be called with the inode->i_lock of the blocker held!
> */
> static void locks_wake_up_blocks(struct file_lock *blocker)
> {
> @@ -649,8 +646,9 @@ void
> posix_test_lock(struct file *filp, struct file_lock *fl)
> {
> struct file_lock *cfl;
> + struct inode *inode = file_inode(filp);
>
> - lock_flocks();
> + spin_lock(&inode->i_lock);
> for (cfl = file_inode(filp)->i_flock; cfl; cfl = cfl->fl_next) {
> if (!IS_POSIX(cfl))
> continue;
> @@ -663,7 +661,7 @@ posix_test_lock(struct file *filp, struct file_lock *fl)
> fl->fl_pid = pid_vnr(cfl->fl_nspid);
> } else
> fl->fl_type = F_UNLCK;
> - unlock_flocks();
> + spin_unlock(&inode->i_lock);
> return;
> }
> EXPORT_SYMBOL(posix_test_lock);
> @@ -742,7 +740,7 @@ static int flock_lock_file(struct file *filp, struct file_lock *request)
> return -ENOMEM;
> }
>
> - lock_flocks();
> + spin_lock(&inode->i_lock);
> if (request->fl_flags & FL_ACCESS)
> goto find_conflict;
>
> @@ -772,9 +770,9 @@ static int flock_lock_file(struct file *filp, struct file_lock *request)
> * give it the opportunity to lock the file.
> */
> if (found) {
> - unlock_flocks();
> + spin_unlock(&inode->i_lock);
> cond_resched();
> - lock_flocks();
> + spin_lock(&inode->i_lock);
> }
>
> find_conflict:
> @@ -801,7 +799,7 @@ find_conflict:
> error = 0;
>
> out:
> - unlock_flocks();
> + spin_unlock(&inode->i_lock);
> if (new_fl)
> locks_free_lock(new_fl);
> return error;
> @@ -831,7 +829,7 @@ static int __posix_lock_file(struct inode *inode, struct file_lock *request, str
> new_fl2 = locks_alloc_lock();
> }
>
> - lock_flocks();
> + spin_lock(&inode->i_lock);
> /*
> * New lock request. Walk all POSIX locks and look for conflicts. If
> * there are any, either return error or put the request on the
> @@ -850,8 +848,14 @@ static int __posix_lock_file(struct inode *inode, struct file_lock *request, str
> if (!(request->fl_flags & FL_SLEEP))
> goto out;
> error = -EDEADLK;
> + /*
> + * XXX: potential race here. We should be adding the
> + * file_lock to the global list before releasing lock.
> + */
> + spin_lock(&file_lock_lock);
> if (posix_locks_deadlock(request, fl))
> goto out;
> + spin_unlock(&file_lock_lock);
> error = FILE_LOCK_DEFERRED;
> locks_insert_block(fl, request);
> locks_insert_global_blocked(request);
> @@ -1004,7 +1008,7 @@ static int __posix_lock_file(struct inode *inode, struct file_lock *request, str
> locks_wake_up_blocks(left);
> }
> out:
> - unlock_flocks();
> + spin_unlock(&inode->i_lock);
> /*
> * Free any unused locks.
> */
> @@ -1079,14 +1083,14 @@ int locks_mandatory_locked(struct inode *inode)
> /*
> * Search the lock list for this inode for any POSIX locks.
> */
> - lock_flocks();
> + spin_lock(&inode->i_lock);
> for (fl = inode->i_flock; fl != NULL; fl = fl->fl_next) {
> if (!IS_POSIX(fl))
> continue;
> if (fl->fl_owner != owner)
> break;
> }
> - unlock_flocks();
> + spin_unlock(&inode->i_lock);
> return fl ? -EAGAIN : 0;
> }
>
> @@ -1229,7 +1233,7 @@ int __break_lease(struct inode *inode, unsigned int mode)
> if (IS_ERR(new_fl))
> return PTR_ERR(new_fl);
>
> - lock_flocks();
> + spin_lock(&inode->i_lock);
>
> time_out_leases(inode);
>
> @@ -1279,10 +1283,10 @@ restart:
> break_time++;
> }
> locks_insert_block(flock, new_fl);
> - unlock_flocks();
> + spin_unlock(&inode->i_lock);
> error = wait_event_interruptible_timeout(new_fl->fl_wait,
> !new_fl->fl_next, break_time);
> - lock_flocks();
> + spin_lock(&inode->i_lock);
> __locks_delete_block(new_fl);
> if (error >= 0) {
> if (error == 0)
> @@ -1300,7 +1304,7 @@ restart:
> }
>
> out:
> - unlock_flocks();
> + spin_unlock(&inode->i_lock);
> locks_free_lock(new_fl);
> return error;
> }
> @@ -1353,9 +1357,10 @@ EXPORT_SYMBOL(lease_get_mtime);
> int fcntl_getlease(struct file *filp)
> {
> struct file_lock *fl;
> + struct inode *inode = file_inode(filp);
> int type = F_UNLCK;
>
> - lock_flocks();
> + spin_lock(&inode->i_lock);
> time_out_leases(file_inode(filp));
> for (fl = file_inode(filp)->i_flock; fl && IS_LEASE(fl);
> fl = fl->fl_next) {
> @@ -1364,7 +1369,7 @@ int fcntl_getlease(struct file *filp)
> break;
> }
> }
> - unlock_flocks();
> + spin_unlock(&inode->i_lock);
> return type;
> }
>
> @@ -1458,7 +1463,7 @@ static int generic_delete_lease(struct file *filp, struct file_lock **flp)
> * The (input) flp->fl_lmops->lm_break function is required
> * by break_lease().
> *
> - * Called with file_lock_lock held.
> + * Called with inode->i_lock held.
> */
> int generic_setlease(struct file *filp, long arg, struct file_lock **flp)
> {
> @@ -1527,11 +1532,12 @@ static int __vfs_setlease(struct file *filp, long arg, struct file_lock **lease)
>
> int vfs_setlease(struct file *filp, long arg, struct file_lock **lease)
> {
> + struct inode *inode = file_inode(filp);
> int error;
>
> - lock_flocks();
> + spin_lock(&inode->i_lock);
> error = __vfs_setlease(filp, arg, lease);
> - unlock_flocks();
> + spin_unlock(&inode->i_lock);
>
> return error;
> }
> @@ -1549,6 +1555,7 @@ static int do_fcntl_delete_lease(struct file *filp)
> static int do_fcntl_add_lease(unsigned int fd, struct file *filp, long arg)
> {
> struct file_lock *fl, *ret;
> + struct inode *inode = file_inode(filp);
> struct fasync_struct *new;
> int error;
>
> @@ -1562,10 +1569,10 @@ static int do_fcntl_add_lease(unsigned int fd, struct file *filp, long arg)
> return -ENOMEM;
> }
> ret = fl;
> - lock_flocks();
> + spin_lock(&inode->i_lock);
> error = __vfs_setlease(filp, arg, &ret);
> if (error) {
> - unlock_flocks();
> + spin_unlock(&inode->i_lock);
> locks_free_lock(fl);
> goto out_free_fasync;
> }
> @@ -1582,7 +1589,7 @@ static int do_fcntl_add_lease(unsigned int fd, struct file *filp, long arg)
> new = NULL;
>
> error = __f_setown(filp, task_pid(current), PIDTYPE_PID, 0);
> - unlock_flocks();
> + spin_unlock(&inode->i_lock);
>
> out_free_fasync:
> if (new)
> @@ -2106,7 +2113,7 @@ void locks_remove_flock(struct file *filp)
> fl.fl_ops->fl_release_private(&fl);
> }
>
> - lock_flocks();
> + spin_lock(&inode->i_lock);
> before = &inode->i_flock;
>
> while ((fl = *before) != NULL) {
> @@ -2124,7 +2131,7 @@ void locks_remove_flock(struct file *filp)
> }
> before = &fl->fl_next;
> }
> - unlock_flocks();
> + spin_unlock(&inode->i_lock);
> }
>
> /**
> @@ -2137,14 +2144,15 @@ void locks_remove_flock(struct file *filp)
> int
> posix_unblock_lock(struct file *filp, struct file_lock *waiter)
> {
> + struct inode *inode = file_inode(filp);
> int status = 0;
>
> - lock_flocks();
> + spin_lock(&inode->i_lock);
> if (waiter->fl_next)
> __locks_delete_block(waiter);
> else
> status = -ENOENT;
> - unlock_flocks();
> + spin_unlock(&inode->i_lock);
> return status;
> }
>
> @@ -2261,7 +2269,7 @@ static void *locks_start(struct seq_file *f, loff_t *pos)
> {
> loff_t *p = f->private;
>
> - lock_flocks();
> + spin_lock(&file_lock_lock);
> *p = (*pos + 1);
> return seq_list_start(&file_lock_list, *pos);
> }
> @@ -2275,7 +2283,7 @@ static void *locks_next(struct seq_file *f, void *v, loff_t *pos)
>
> static void locks_stop(struct seq_file *f, void *v)
> {
> - unlock_flocks();
> + spin_unlock(&file_lock_lock);
> }
>
> static const struct seq_operations locks_seq_operations = {
> @@ -2322,7 +2330,8 @@ int lock_may_read(struct inode *inode, loff_t start, unsigned long len)
> {
> struct file_lock *fl;
> int result = 1;
> - lock_flocks();
> +
> + spin_lock(&inode->i_lock);
> for (fl = inode->i_flock; fl != NULL; fl = fl->fl_next) {
> if (IS_POSIX(fl)) {
> if (fl->fl_type == F_RDLCK)
> @@ -2339,7 +2348,7 @@ int lock_may_read(struct inode *inode, loff_t start, unsigned long len)
> result = 0;
> break;
> }
> - unlock_flocks();
> + spin_unlock(&inode->i_lock);
> return result;
> }
>
> @@ -2362,7 +2371,8 @@ int lock_may_write(struct inode *inode, loff_t start, unsigned long len)
> {
> struct file_lock *fl;
> int result = 1;
> - lock_flocks();
> +
> + spin_lock(&inode->i_lock);
> for (fl = inode->i_flock; fl != NULL; fl = fl->fl_next) {
> if (IS_POSIX(fl)) {
> if ((fl->fl_end < start) || (fl->fl_start > (start + len)))
> @@ -2377,7 +2387,7 @@ int lock_may_write(struct inode *inode, loff_t start, unsigned long len)
> result = 0;
> break;
> }
> - unlock_flocks();
> + spin_unlock(&inode->i_lock);
> return result;
> }
>
> diff --git a/fs/nfs/delegation.c b/fs/nfs/delegation.c
> index 57db324..43ee7f9 100644
> --- a/fs/nfs/delegation.c
> +++ b/fs/nfs/delegation.c
> @@ -73,20 +73,21 @@ static int nfs_delegation_claim_locks(struct nfs_open_context *ctx, struct nfs4_
> if (inode->i_flock == NULL)
> goto out;
>
> - /* Protect inode->i_flock using the file locks lock */
> - lock_flocks();
> + /* Protect inode->i_flock using the i_lock */
> + spin_lock(&inode->i_lock);
> for (fl = inode->i_flock; fl != NULL; fl = fl->fl_next) {
> if (!(fl->fl_flags & (FL_POSIX|FL_FLOCK)))
> continue;
> if (nfs_file_open_context(fl->fl_file) != ctx)
> continue;
> - unlock_flocks();
> + /* FIXME: safe to drop lock here while walking list? */
> + spin_unlock(&inode->i_lock);
> status = nfs4_lock_delegation_recall(fl, state, stateid);
> if (status < 0)
> goto out;
> - lock_flocks();
> + spin_lock(&inode->i_lock);
> }
> - unlock_flocks();
> + spin_unlock(&inode->i_lock);
> out:
> return status;
> }
> diff --git a/fs/nfs/nfs4state.c b/fs/nfs/nfs4state.c
> index 1fab140..ff10b4a 100644
> --- a/fs/nfs/nfs4state.c
> +++ b/fs/nfs/nfs4state.c
> @@ -1373,13 +1373,13 @@ static int nfs4_reclaim_locks(struct nfs4_state *state, const struct nfs4_state_
> /* Guard against delegation returns and new lock/unlock calls */
> down_write(&nfsi->rwsem);
> /* Protect inode->i_flock using the BKL */
> - lock_flocks();
> + spin_lock(&inode->i_lock);
> for (fl = inode->i_flock; fl != NULL; fl = fl->fl_next) {
> if (!(fl->fl_flags & (FL_POSIX|FL_FLOCK)))
> continue;
> if (nfs_file_open_context(fl->fl_file)->state != state)
> continue;
> - unlock_flocks();
> + spin_unlock(&inode->i_lock);
> status = ops->recover_lock(state, fl);
> switch (status) {
> case 0:
> @@ -1406,9 +1406,9 @@ static int nfs4_reclaim_locks(struct nfs4_state *state, const struct nfs4_state_
> /* kill_proc(fl->fl_pid, SIGLOST, 1); */
> status = 0;
> }
> - lock_flocks();
> + spin_lock(&inode->i_lock);
> }
> - unlock_flocks();
> + spin_unlock(&inode->i_lock);
> out:
> up_write(&nfsi->rwsem);
> return status;
> diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c
> index 316ec84..f170518 100644
> --- a/fs/nfsd/nfs4state.c
> +++ b/fs/nfsd/nfs4state.c
> @@ -2645,13 +2645,13 @@ static void nfsd_break_one_deleg(struct nfs4_delegation *dp)
>
> list_add_tail(&dp->dl_recall_lru, &nn->del_recall_lru);
>
> - /* only place dl_time is set. protected by lock_flocks*/
> + /* Only place dl_time is set; protected by i_lock: */
> dp->dl_time = get_seconds();
>
> nfsd4_cb_recall(dp);
> }
>
> -/* Called from break_lease() with lock_flocks() held. */
> +/* Called from break_lease() with i_lock held. */
> static void nfsd_break_deleg_cb(struct file_lock *fl)
> {
> struct nfs4_file *fp = (struct nfs4_file *)fl->fl_owner;
> @@ -4520,7 +4520,7 @@ check_for_locks(struct nfs4_file *filp, struct nfs4_lockowner *lowner)
> struct inode *inode = filp->fi_inode;
> int status = 0;
>
> - lock_flocks();
> + spin_lock(&inode->i_lock);
> for (flpp = &inode->i_flock; *flpp != NULL; flpp = &(*flpp)->fl_next) {
> if ((*flpp)->fl_owner == (fl_owner_t)lowner) {
> status = 1;
> @@ -4528,7 +4528,7 @@ check_for_locks(struct nfs4_file *filp, struct nfs4_lockowner *lowner)
> }
> }
> out:
> - unlock_flocks();
> + spin_unlock(&inode->i_lock);
> return status;
> }
>
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 94105d2..e2f896d 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -1024,8 +1024,6 @@ extern int vfs_setlease(struct file *, long, struct file_lock **);
> extern int lease_modify(struct file_lock **, int);
> extern int lock_may_read(struct inode *, loff_t start, unsigned long count);
> extern int lock_may_write(struct inode *, loff_t start, unsigned long count);
> -extern void lock_flocks(void);
> -extern void unlock_flocks(void);
> #else /* !CONFIG_FILE_LOCKING */
> static inline int fcntl_getlk(struct file *file, struct flock __user *user)
> {
> @@ -1167,15 +1165,6 @@ static inline int lock_may_write(struct inode *inode, loff_t start,
> {
> return 1;
> }
> -
> -static inline void lock_flocks(void)
> -{
> -}
> -
> -static inline void unlock_flocks(void)
> -{
> -}
> -
> #endif /* !CONFIG_FILE_LOCKING */
>
>
> --
> 1.7.1
>

2013-06-13 14:50:40

by J. Bruce Fields

[permalink] [raw]
Subject: Re: [PATCH v2 10/14] locks: turn the blocked_list into a hashtable

On Tue, Jun 11, 2013 at 07:09:04AM -0400, Jeff Layton wrote:
> Break up the blocked_list into a hashtable, using the fl_owner as a key.
> This speeds up searching the hash chains, which is especially significant
> for deadlock detection.
>
> Note that the initial implementation assumes that hashing on fl_owner is
> sufficient. In most cases it should be, with the notable exception being
> server-side lockd, which compares ownership using a tuple of the
> nlm_host and the pid sent in the lock request. So, this may degrade to a
> single hash bucket when you only have a single NFS client. That will be
> addressed in a later patch.
>
> The careful observer may note that this patch leaves the file_lock_list
> alone. There's much less of a case for turning the file_lock_list into a
> hashtable. The only user of that list is the code that generates
> /proc/locks, and it always walks the entire list.

Makes sense to me, ACK to this and the previous patch.--b.

>
> Signed-off-by: Jeff Layton <[email protected]>
> ---
> fs/locks.c | 25 ++++++++++++++++++-------
> 1 files changed, 18 insertions(+), 7 deletions(-)
>
> diff --git a/fs/locks.c b/fs/locks.c
> index 28959bc..76fb7af 100644
> --- a/fs/locks.c
> +++ b/fs/locks.c
> @@ -126,6 +126,7 @@
> #include <linux/time.h>
> #include <linux/rcupdate.h>
> #include <linux/pid_namespace.h>
> +#include <linux/hashtable.h>
>
> #include <asm/uaccess.h>
>
> @@ -153,10 +154,19 @@ int lease_break_time = 45;
> #define for_each_lock(inode, lockp) \
> for (lockp = &inode->i_flock; *lockp != NULL; lockp = &(*lockp)->fl_next)
>
> +/*
> + * By breaking up the blocked locks list into a hashtable, we speed up the
> + * deadlock detection.
> + *
> + * FIXME: make this value scale via some heuristic?
> + */
> +#define BLOCKED_HASH_BITS 7
> +
> +static DEFINE_HASHTABLE(blocked_hash, BLOCKED_HASH_BITS);
> +
> static HLIST_HEAD(file_lock_list);
> -static HLIST_HEAD(blocked_list);
>
> -/* Protects the two list heads above */
> +/* Protects the file_lock_list and the blocked_hash */
> static DEFINE_SPINLOCK(file_lock_lock);
>
> static struct kmem_cache *filelock_cache __read_mostly;
> @@ -475,13 +485,13 @@ static int posix_same_owner(struct file_lock *fl1, struct file_lock *fl2)
> static inline void
> locks_insert_global_blocked(struct file_lock *waiter)
> {
> - hlist_add_head(&waiter->fl_link, &blocked_list);
> + hash_add(blocked_hash, &waiter->fl_link, (unsigned long)waiter->fl_owner);
> }
>
> static inline void
> __locks_delete_global_blocked(struct file_lock *waiter)
> {
> - hlist_del_init(&waiter->fl_link);
> + hash_del(&waiter->fl_link);
> }
>
> static inline void
> @@ -729,7 +739,7 @@ static struct file_lock *what_owner_is_waiting_for(struct file_lock *block_fl)
> {
> struct file_lock *fl;
>
> - hlist_for_each_entry(fl, &blocked_list, fl_link) {
> + hash_for_each_possible(blocked_hash, fl, fl_link, (unsigned long)block_fl->fl_owner) {
> if (posix_same_owner(fl, block_fl))
> return fl->fl_next;
> }
> @@ -865,7 +875,7 @@ static int __posix_lock_file(struct inode *inode, struct file_lock *request, str
> /*
> * New lock request. Walk all POSIX locks and look for conflicts. If
> * there are any, either return error or put the request on the
> - * blocker's list of waiters and the global blocked_list.
> + * blocker's list of waiters and the global blocked_hash.
> */
> if (request->fl_type != F_UNLCK) {
> for_each_lock(inode, before) {
> @@ -2284,13 +2294,14 @@ static void lock_get_status(struct seq_file *f, struct file_lock *fl,
>
> static int locks_show(struct seq_file *f, void *v)
> {
> + int bkt;
> struct file_lock *fl, *bfl;
>
> fl = hlist_entry(v, struct file_lock, fl_link);
>
> lock_get_status(f, fl, *((loff_t *)f->private), "");
>
> - hlist_for_each_entry(bfl, &blocked_list, fl_link) {
> + hash_for_each(blocked_hash, bkt, bfl, fl_link) {
> if (bfl->fl_next == fl)
> lock_get_status(f, bfl, *((loff_t *)f->private), " ->");
> }
> --
> 1.7.1
>

2013-06-13 15:01:06

by J. Bruce Fields

[permalink] [raw]
Subject: Re: [PATCH v2 11/14] locks: add a new "lm_owner_key" lock operation

On Tue, Jun 11, 2013 at 07:09:05AM -0400, Jeff Layton wrote:
> Currently, the hashing that the locking code uses to add these values
> to the blocked_hash is simply calculated using fl_owner field. That's
> valid in most cases except for server-side lockd, which validates the
> owner of a lock based on fl_owner and fl_pid.
>
> In the case where you have a small number of NFS clients doing a lot
> of locking between different processes, you could end up with all
> the blocked requests sitting in a very small number of hash buckets.
>
> Add a new lm_owner_key operation to the lock_manager_operations that
> will generate an unsigned long to use as the key in the hashtable.
> That function is only implemented for server-side lockd, and simply
> XORs the fl_owner and fl_pid.

Like I've said I think we should look into defining a lock_owner struct
that lockd can allocate as necessary so that the lock code can just do a
pointer comparison on struct lock_owner *'s. But maybe that doesn't
work out and in any case it can be future work, so looks fine, ACK.

--b.

>
> Signed-off-by: Jeff Layton <[email protected]>
> ---
> Documentation/filesystems/Locking | 18 +++++++++++-------
> fs/lockd/svclock.c | 12 ++++++++++++
> fs/locks.c | 12 ++++++++++--
> include/linux/fs.h | 1 +
> 4 files changed, 34 insertions(+), 9 deletions(-)
>
> diff --git a/Documentation/filesystems/Locking b/Documentation/filesystems/Locking
> index 13f91ab..ee351ac 100644
> --- a/Documentation/filesystems/Locking
> +++ b/Documentation/filesystems/Locking
> @@ -351,6 +351,7 @@ fl_release_private: maybe no
> ----------------------- lock_manager_operations ---------------------------
> prototypes:
> int (*lm_compare_owner)(struct file_lock *, struct file_lock *);
> + unsigned long (*lm_owner_key)(struct file_lock *);
> void (*lm_notify)(struct file_lock *); /* unblock callback */
> int (*lm_grant)(struct file_lock *, struct file_lock *, int);
> void (*lm_break)(struct file_lock *); /* break_lease callback */
> @@ -360,18 +361,21 @@ locking rules:
>
> inode->i_lock file_lock_lock may block
> lm_compare_owner: yes maybe no
> +lm_owner_key yes yes no
> lm_notify: yes no no
> lm_grant: no no no
> lm_break: yes no no
> lm_change yes no no
>
> - ->lm_compare_owner is generally called with *an* inode->i_lock
> -held. It may not be the i_lock of the inode for either file_lock being
> -compared! This is the case with deadlock detection, since the code has
> -to chase down the owners of locks that may be entirely unrelated to the
> -one on which the lock is being acquired. For deadlock detection however,
> -the file_lock_lock is also held. The locks primarily ensure that neither
> -file_lock disappear out from under you while doing the comparison.
> + ->lm_compare_owner and ->lm_owner_key are generally called with
> +*an* inode->i_lock held. It may not be the i_lock of the inode
> +associated with either file_lock argument! This is the case with deadlock
> +detection, since the code has to chase down the owners of locks that may
> +be entirely unrelated to the one on which the lock is being acquired.
> +For deadlock detection however, the file_lock_lock is also held. The
> +fact that these locks are held ensures that the file_locks do not
> +disappear out from under you while doing the comparison or generating an
> +owner key.
>
> --------------------------- buffer_head -----------------------------------
> prototypes:
> diff --git a/fs/lockd/svclock.c b/fs/lockd/svclock.c
> index e703318..ce2cdab 100644
> --- a/fs/lockd/svclock.c
> +++ b/fs/lockd/svclock.c
> @@ -744,8 +744,20 @@ static int nlmsvc_same_owner(struct file_lock *fl1, struct file_lock *fl2)
> return fl1->fl_owner == fl2->fl_owner && fl1->fl_pid == fl2->fl_pid;
> }
>
> +/*
> + * Since NLM uses two "keys" for tracking locks, we need to hash them down
> + * to one for the blocked_hash. Here, we're just xor'ing the host address
> + * with the pid in order to create a key value for picking a hash bucket.
> + */
> +static unsigned long
> +nlmsvc_owner_key(struct file_lock *fl)
> +{
> + return (unsigned long)fl->fl_owner ^ (unsigned long)fl->fl_pid;
> +}
> +
> const struct lock_manager_operations nlmsvc_lock_operations = {
> .lm_compare_owner = nlmsvc_same_owner,
> + .lm_owner_key = nlmsvc_owner_key,
> .lm_notify = nlmsvc_notify_blocked,
> .lm_grant = nlmsvc_grant_deferred,
> };
> diff --git a/fs/locks.c b/fs/locks.c
> index 76fb7af..11e7784 100644
> --- a/fs/locks.c
> +++ b/fs/locks.c
> @@ -481,11 +481,19 @@ static int posix_same_owner(struct file_lock *fl1, struct file_lock *fl2)
> return fl1->fl_owner == fl2->fl_owner;
> }
>
> +static unsigned long
> +posix_owner_key(struct file_lock *fl)
> +{
> + if (fl->fl_lmops && fl->fl_lmops->lm_owner_key)
> + return fl->fl_lmops->lm_owner_key(fl);
> + return (unsigned long)fl->fl_owner;
> +}
> +
> /* Remove a blocker or lock from one of the global lists */
> static inline void
> locks_insert_global_blocked(struct file_lock *waiter)
> {
> - hash_add(blocked_hash, &waiter->fl_link, (unsigned long)waiter->fl_owner);
> + hash_add(blocked_hash, &waiter->fl_link, posix_owner_key(waiter));
> }
>
> static inline void
> @@ -739,7 +747,7 @@ static struct file_lock *what_owner_is_waiting_for(struct file_lock *block_fl)
> {
> struct file_lock *fl;
>
> - hash_for_each_possible(blocked_hash, fl, fl_link, (unsigned long)block_fl->fl_owner) {
> + hash_for_each_possible(blocked_hash, fl, fl_link, posix_owner_key(block_fl)) {
> if (posix_same_owner(fl, block_fl))
> return fl->fl_next;
> }
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 3b340f7..232a345 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -908,6 +908,7 @@ struct file_lock_operations {
>
> struct lock_manager_operations {
> int (*lm_compare_owner)(struct file_lock *, struct file_lock *);
> + unsigned long (*lm_owner_key)(struct file_lock *);
> void (*lm_notify)(struct file_lock *); /* unblock callback */
> int (*lm_grant)(struct file_lock *, struct file_lock *, int);
> void (*lm_break)(struct file_lock *);
> --
> 1.7.1
>

2013-06-13 15:03:19

by J. Bruce Fields

[permalink] [raw]
Subject: Re: [PATCH v2 12/14] locks: give the blocked_hash its own spinlock

On Tue, Jun 11, 2013 at 07:09:06AM -0400, Jeff Layton wrote:
> There's no reason we have to protect the blocked_hash and file_lock_list
> with the same spinlock. With the tests I have, breaking it in two gives
> a barely measurable performance benefit, but it seems reasonable to make
> this locking as granular as possible.

Out of curiosity... In the typical case when adding/removing a lock,
aren't both lists being modified in rapid succession?

I wonder if it would be better to instead stick with one lock and take
care to acquire it only once to cover both manipulations.

--b.

>
> Signed-off-by: Jeff Layton <[email protected]>
> ---
> Documentation/filesystems/Locking | 16 ++++++++--------
> fs/locks.c | 25 +++++++++++++------------
> 2 files changed, 21 insertions(+), 20 deletions(-)
>
> diff --git a/Documentation/filesystems/Locking b/Documentation/filesystems/Locking
> index ee351ac..8d8d040 100644
> --- a/Documentation/filesystems/Locking
> +++ b/Documentation/filesystems/Locking
> @@ -359,20 +359,20 @@ prototypes:
>
> locking rules:
>
> - inode->i_lock file_lock_lock may block
> -lm_compare_owner: yes maybe no
> -lm_owner_key yes yes no
> -lm_notify: yes no no
> -lm_grant: no no no
> -lm_break: yes no no
> -lm_change yes no no
> + inode->i_lock blocked_hash_lock may block
> +lm_compare_owner: yes maybe no
> +lm_owner_key yes yes no
> +lm_notify: yes no no
> +lm_grant: no no no
> +lm_break: yes no no
> +lm_change yes no no
>
> ->lm_compare_owner and ->lm_owner_key are generally called with
> *an* inode->i_lock held. It may not be the i_lock of the inode
> associated with either file_lock argument! This is the case with deadlock
> detection, since the code has to chase down the owners of locks that may
> be entirely unrelated to the one on which the lock is being acquired.
> -For deadlock detection however, the file_lock_lock is also held. The
> +For deadlock detection however, the blocked_hash_lock is also held. The
> fact that these locks are held ensures that the file_locks do not
> disappear out from under you while doing the comparison or generating an
> owner key.
> diff --git a/fs/locks.c b/fs/locks.c
> index 11e7784..8124fc1 100644
> --- a/fs/locks.c
> +++ b/fs/locks.c
> @@ -162,12 +162,11 @@ int lease_break_time = 45;
> */
> #define BLOCKED_HASH_BITS 7
>
> +static DEFINE_SPINLOCK(blocked_hash_lock);
> static DEFINE_HASHTABLE(blocked_hash, BLOCKED_HASH_BITS);
>
> -static HLIST_HEAD(file_lock_list);
> -
> -/* Protects the file_lock_list and the blocked_hash */
> static DEFINE_SPINLOCK(file_lock_lock);
> +static HLIST_HEAD(file_lock_list);
>
> static struct kmem_cache *filelock_cache __read_mostly;
>
> @@ -505,9 +504,9 @@ __locks_delete_global_blocked(struct file_lock *waiter)
> static inline void
> locks_delete_global_blocked(struct file_lock *waiter)
> {
> - spin_lock(&file_lock_lock);
> + spin_lock(&blocked_hash_lock);
> __locks_delete_global_blocked(waiter);
> - spin_unlock(&file_lock_lock);
> + spin_unlock(&blocked_hash_lock);
> }
>
> static inline void
> @@ -581,14 +580,14 @@ static void locks_wake_up_blocks(struct file_lock *blocker)
>
> /*
> * Wake up processes blocked waiting for blocker. In the FL_POSIX case, we must
> - * also take the global file_lock_lock and dequeue it from the global blocked
> - * list as we wake the processes.
> + * also take the global blocked_hash_lock and dequeue it from the global
> + * blocked list as we wake the processes.
> *
> * Must be called with the inode->i_lock of the blocker held!
> */
> static void locks_wake_up_posix_blocks(struct file_lock *blocker)
> {
> - spin_lock(&file_lock_lock);
> + spin_lock(&blocked_hash_lock);
> while (!list_empty(&blocker->fl_block)) {
> struct file_lock *waiter;
>
> @@ -601,7 +600,7 @@ static void locks_wake_up_posix_blocks(struct file_lock *blocker)
> else
> wake_up(&waiter->fl_wait);
> }
> - spin_unlock(&file_lock_lock);
> + spin_unlock(&blocked_hash_lock);
> }
> /* Insert file lock fl into an inode's lock list at the position indicated
> * by pos. At the same time add the lock to the global file lock list.
> @@ -754,7 +753,7 @@ static struct file_lock *what_owner_is_waiting_for(struct file_lock *block_fl)
> return NULL;
> }
>
> -/* Must be called with the file_lock_lock held! */
> +/* Must be called with the blocked_hash_lock held! */
> static int posix_locks_deadlock(struct file_lock *caller_fl,
> struct file_lock *block_fl)
> {
> @@ -898,13 +897,13 @@ static int __posix_lock_file(struct inode *inode, struct file_lock *request, str
> if (!(request->fl_flags & FL_SLEEP))
> goto out;
> error = -EDEADLK;
> - spin_lock(&file_lock_lock);
> + spin_lock(&blocked_hash_lock);
> if (likely(!posix_locks_deadlock(request, fl))) {
> error = FILE_LOCK_DEFERRED;
> locks_insert_block(fl, request);
> locks_insert_global_blocked(request);
> }
> - spin_unlock(&file_lock_lock);
> + spin_unlock(&blocked_hash_lock);
> goto out;
> }
> }
> @@ -2309,10 +2308,12 @@ static int locks_show(struct seq_file *f, void *v)
>
> lock_get_status(f, fl, *((loff_t *)f->private), "");
>
> + spin_lock(&blocked_hash_lock);
> hash_for_each(blocked_hash, bkt, bfl, fl_link) {
> if (bfl->fl_next == fl)
> lock_get_status(f, bfl, *((loff_t *)f->private), " ->");
> }
> + spin_unlock(&blocked_hash_lock);
>
> return 0;
> }
> --
> 1.7.1
>

2013-06-13 15:10:20

by Jeff Layton

[permalink] [raw]
Subject: Re: [PATCH v2 07/14] locks: convert to i_lock to protect i_flock list

On Thu, 13 Jun 2013 10:41:40 -0400
"J. Bruce Fields" <[email protected]> wrote:

> On Tue, Jun 11, 2013 at 07:09:01AM -0400, Jeff Layton wrote:
> > Having a global lock that protects all of this code is a clear
> > scalability problem. Instead of doing that, move most of the code to be
> > protected by the i_lock instead.
> >
> > The exceptions are the global lists that file_lock->fl_link sits on.
> > Those still need a global lock of some sort, so wrap just those accesses
> > in the file_lock_lock().
> >
> > Note that this patch introduces a potential race in the deadlock
> > detection code which could result in either false positives or missed
> > file_lock loops. The blocked_list management needs to be done atomically
> > with the deadlock detection. This will be fixed in a later patch.
>
> Actually, the way I think I'd try this:
>
> 1. Introduce a new spinlock global_lock_lists_lock and take it
> in all the places we need a global lock.
> 2. Do the big flock_lock-to-i_lock conversion, deleting flock_lock
> in the process.
>
> Does that work?
>
> Yeah, more patch-ordering bike-shedding, but... I think that would be
> simple and bisectable and for a confusing long-untouched bit of code
> it's worth getting these little steps right if we can.
>
>

I don't know...that really sounds more like it'll just add to the
churn without making anything simpler. If we're concerned about
bisectability, I think it'd just be best to merge patches 7 and 8 and
call it a day...

> >
> > Signed-off-by: Jeff Layton <[email protected]>
> > Signed-off-by: J. Bruce Fields <[email protected]>
> > ---
> > Documentation/filesystems/Locking | 23 +++++--
> > fs/afs/flock.c | 5 +-
> > fs/ceph/locks.c | 2 +-
> > fs/ceph/mds_client.c | 8 +-
> > fs/cifs/cifsfs.c | 2 +-
> > fs/cifs/file.c | 13 ++--
> > fs/gfs2/file.c | 2 +-
> > fs/lockd/svcsubs.c | 12 ++--
> > fs/locks.c | 110 ++++++++++++++++++++-----------------
> > fs/nfs/delegation.c | 11 ++--
> > fs/nfs/nfs4state.c | 8 +-
> > fs/nfsd/nfs4state.c | 8 +-
> > include/linux/fs.h | 11 ----
> > 13 files changed, 113 insertions(+), 102 deletions(-)
> >
> > diff --git a/Documentation/filesystems/Locking b/Documentation/filesystems/Locking
> > index 0706d32..13f91ab 100644
> > --- a/Documentation/filesystems/Locking
> > +++ b/Documentation/filesystems/Locking
> > @@ -344,7 +344,7 @@ prototypes:
> >
> >
> > locking rules:
> > - file_lock_lock may block
> > + inode->i_lock may block
> > fl_copy_lock: yes no
> > fl_release_private: maybe no
> >
> > @@ -357,12 +357,21 @@ prototypes:
> > int (*lm_change)(struct file_lock **, int);
> >
> > locking rules:
> > - file_lock_lock may block
> > -lm_compare_owner: yes no
> > -lm_notify: yes no
> > -lm_grant: no no
> > -lm_break: yes no
> > -lm_change yes no
> > +
> > + inode->i_lock file_lock_lock may block
> > +lm_compare_owner: yes maybe no
> > +lm_notify: yes no no
> > +lm_grant: no no no
> > +lm_break: yes no no
> > +lm_change yes no no
> > +
> > + ->lm_compare_owner is generally called with *an* inode->i_lock
> > +held. It may not be the i_lock of the inode for either file_lock being
> > +compared! This is the case with deadlock detection, since the code has
> > +to chase down the owners of locks that may be entirely unrelated to the
> > +one on which the lock is being acquired. For deadlock detection however,
> > +the file_lock_lock is also held. The locks primarily ensure that neither
> > +file_lock disappear out from under you while doing the comparison.
> >
> > --------------------------- buffer_head -----------------------------------
> > prototypes:
> > diff --git a/fs/afs/flock.c b/fs/afs/flock.c
> > index 2497bf3..03fc0d1 100644
> > --- a/fs/afs/flock.c
> > +++ b/fs/afs/flock.c
> > @@ -252,6 +252,7 @@ static void afs_defer_unlock(struct afs_vnode *vnode, struct key *key)
> > */
> > static int afs_do_setlk(struct file *file, struct file_lock *fl)
> > {
> > + struct inode = file_inode(file);
> > struct afs_vnode *vnode = AFS_FS_I(file->f_mapping->host);
> > afs_lock_type_t type;
> > struct key *key = file->private_data;
> > @@ -273,7 +274,7 @@ static int afs_do_setlk(struct file *file, struct file_lock *fl)
> >
> > type = (fl->fl_type == F_RDLCK) ? AFS_LOCK_READ : AFS_LOCK_WRITE;
> >
> > - lock_flocks();
> > + spin_lock(&inode->i_lock);
> >
> > /* make sure we've got a callback on this file and that our view of the
> > * data version is up to date */
> > @@ -420,7 +421,7 @@ given_lock:
> > afs_vnode_fetch_status(vnode, NULL, key);
> >
> > error:
> > - unlock_flocks();
> > + spin_unlock(&inode->i_lock);
> > _leave(" = %d", ret);
> > return ret;
> >
> > diff --git a/fs/ceph/locks.c b/fs/ceph/locks.c
> > index 202dd3d..cd0a664 100644
> > --- a/fs/ceph/locks.c
> > +++ b/fs/ceph/locks.c
> > @@ -194,7 +194,7 @@ void ceph_count_locks(struct inode *inode, int *fcntl_count, int *flock_count)
> > * Encode the flock and fcntl locks for the given inode into the pagelist.
> > * Format is: #fcntl locks, sequential fcntl locks, #flock locks,
> > * sequential flock locks.
> > - * Must be called with lock_flocks() already held.
> > + * Must be called with inode->i_lock already held.
> > * If we encounter more of a specific lock type than expected,
> > * we return the value 1.
> > */
> > diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c
> > index 4f22671..ae621b5 100644
> > --- a/fs/ceph/mds_client.c
> > +++ b/fs/ceph/mds_client.c
> > @@ -2482,13 +2482,13 @@ static int encode_caps_cb(struct inode *inode, struct ceph_cap *cap,
> >
> > ceph_pagelist_set_cursor(pagelist, &trunc_point);
> > do {
> > - lock_flocks();
> > + spin_lock(&inode->i_lock);
> > ceph_count_locks(inode, &num_fcntl_locks,
> > &num_flock_locks);
> > rec.v2.flock_len = (2*sizeof(u32) +
> > (num_fcntl_locks+num_flock_locks) *
> > sizeof(struct ceph_filelock));
> > - unlock_flocks();
> > + spin_unlock(&inode->i_lock);
> >
> > /* pre-alloc pagelist */
> > ceph_pagelist_truncate(pagelist, &trunc_point);
> > @@ -2499,12 +2499,12 @@ static int encode_caps_cb(struct inode *inode, struct ceph_cap *cap,
> >
> > /* encode locks */
> > if (!err) {
> > - lock_flocks();
> > + spin_lock(&inode->i_lock);
> > err = ceph_encode_locks(inode,
> > pagelist,
> > num_fcntl_locks,
> > num_flock_locks);
> > - unlock_flocks();
> > + spin_unlock(&inode->i_lock);
> > }
> > } while (err == -ENOSPC);
> > } else {
> > diff --git a/fs/cifs/cifsfs.c b/fs/cifs/cifsfs.c
> > index 3752b9f..e81655c 100644
> > --- a/fs/cifs/cifsfs.c
> > +++ b/fs/cifs/cifsfs.c
> > @@ -765,7 +765,7 @@ static loff_t cifs_llseek(struct file *file, loff_t offset, int whence)
> >
> > static int cifs_setlease(struct file *file, long arg, struct file_lock **lease)
> > {
> > - /* note that this is called by vfs setlease with lock_flocks held
> > + /* note that this is called by vfs setlease with i_lock held
> > to protect *lease from going away */
> > struct inode *inode = file_inode(file);
> > struct cifsFileInfo *cfile = file->private_data;
> > diff --git a/fs/cifs/file.c b/fs/cifs/file.c
> > index 44a4f18..0dd10cd 100644
> > --- a/fs/cifs/file.c
> > +++ b/fs/cifs/file.c
> > @@ -1092,6 +1092,7 @@ struct lock_to_push {
> > static int
> > cifs_push_posix_locks(struct cifsFileInfo *cfile)
> > {
> > + struct inode *inode = cfile->dentry->d_inode;
> > struct cifs_tcon *tcon = tlink_tcon(cfile->tlink);
> > struct file_lock *flock, **before;
> > unsigned int count = 0, i = 0;
> > @@ -1102,12 +1103,12 @@ cifs_push_posix_locks(struct cifsFileInfo *cfile)
> >
> > xid = get_xid();
> >
> > - lock_flocks();
> > - cifs_for_each_lock(cfile->dentry->d_inode, before) {
> > + spin_lock(&inode->i_lock);
> > + cifs_for_each_lock(inode, before) {
> > if ((*before)->fl_flags & FL_POSIX)
> > count++;
> > }
> > - unlock_flocks();
> > + spin_unlock(&inode->i_lock);
> >
> > INIT_LIST_HEAD(&locks_to_send);
> >
> > @@ -1126,8 +1127,8 @@ cifs_push_posix_locks(struct cifsFileInfo *cfile)
> > }
> >
> > el = locks_to_send.next;
> > - lock_flocks();
> > - cifs_for_each_lock(cfile->dentry->d_inode, before) {
> > + spin_lock(&inode->i_lock);
> > + cifs_for_each_lock(inode, before) {
> > flock = *before;
> > if ((flock->fl_flags & FL_POSIX) == 0)
> > continue;
> > @@ -1152,7 +1153,7 @@ cifs_push_posix_locks(struct cifsFileInfo *cfile)
> > lck->offset = flock->fl_start;
> > el = el->next;
> > }
> > - unlock_flocks();
> > + spin_unlock(&inode->i_lock);
> >
> > list_for_each_entry_safe(lck, tmp, &locks_to_send, llist) {
> > int stored_rc;
> > diff --git a/fs/gfs2/file.c b/fs/gfs2/file.c
> > index acd1676..9e634e0 100644
> > --- a/fs/gfs2/file.c
> > +++ b/fs/gfs2/file.c
> > @@ -889,7 +889,7 @@ out_uninit:
> > * cluster; until we do, disable leases (by just returning -EINVAL),
> > * unless the administrator has requested purely local locking.
> > *
> > - * Locking: called under lock_flocks
> > + * Locking: called under i_lock
> > *
> > * Returns: errno
> > */
> > diff --git a/fs/lockd/svcsubs.c b/fs/lockd/svcsubs.c
> > index 97e8741..dc5c759 100644
> > --- a/fs/lockd/svcsubs.c
> > +++ b/fs/lockd/svcsubs.c
> > @@ -169,7 +169,7 @@ nlm_traverse_locks(struct nlm_host *host, struct nlm_file *file,
> >
> > again:
> > file->f_locks = 0;
> > - lock_flocks(); /* protects i_flock list */
> > + spin_lock(&inode->i_lock);
> > for (fl = inode->i_flock; fl; fl = fl->fl_next) {
> > if (fl->fl_lmops != &nlmsvc_lock_operations)
> > continue;
> > @@ -181,7 +181,7 @@ again:
> > if (match(lockhost, host)) {
> > struct file_lock lock = *fl;
> >
> > - unlock_flocks();
> > + spin_unlock(&inode->i_lock);
> > lock.fl_type = F_UNLCK;
> > lock.fl_start = 0;
> > lock.fl_end = OFFSET_MAX;
> > @@ -193,7 +193,7 @@ again:
> > goto again;
> > }
> > }
> > - unlock_flocks();
> > + spin_unlock(&inode->i_lock);
> >
> > return 0;
> > }
> > @@ -228,14 +228,14 @@ nlm_file_inuse(struct nlm_file *file)
> > if (file->f_count || !list_empty(&file->f_blocks) || file->f_shares)
> > return 1;
> >
> > - lock_flocks();
> > + spin_lock(&inode->i_lock);
> > for (fl = inode->i_flock; fl; fl = fl->fl_next) {
> > if (fl->fl_lmops == &nlmsvc_lock_operations) {
> > - unlock_flocks();
> > + spin_unlock(&inode->i_lock);
> > return 1;
> > }
> > }
> > - unlock_flocks();
> > + spin_unlock(&inode->i_lock);
> > file->f_locks = 0;
> > return 0;
> > }
> > diff --git a/fs/locks.c b/fs/locks.c
> > index 3fd27f0..d7342a3 100644
> > --- a/fs/locks.c
> > +++ b/fs/locks.c
> > @@ -155,22 +155,9 @@ int lease_break_time = 45;
> >
> > static LIST_HEAD(file_lock_list);
> > static LIST_HEAD(blocked_list);
> > -static DEFINE_SPINLOCK(file_lock_lock);
> > -
> > -/*
> > - * Protects the two list heads above, plus the inode->i_flock list
> > - */
> > -void lock_flocks(void)
> > -{
> > - spin_lock(&file_lock_lock);
> > -}
> > -EXPORT_SYMBOL_GPL(lock_flocks);
> >
> > -void unlock_flocks(void)
> > -{
> > - spin_unlock(&file_lock_lock);
> > -}
> > -EXPORT_SYMBOL_GPL(unlock_flocks);
> > +/* Protects the two list heads above */
> > +static DEFINE_SPINLOCK(file_lock_lock);
> >
> > static struct kmem_cache *filelock_cache __read_mostly;
> >
> > @@ -488,25 +475,33 @@ static int posix_same_owner(struct file_lock *fl1, struct file_lock *fl2)
> > static inline void
> > locks_insert_global_blocked(struct file_lock *waiter)
> > {
> > + spin_lock(&file_lock_lock);
> > list_add(&waiter->fl_link, &blocked_list);
> > + spin_unlock(&file_lock_lock);
> > }
> >
> > static inline void
> > locks_delete_global_blocked(struct file_lock *waiter)
> > {
> > + spin_lock(&file_lock_lock);
> > list_del_init(&waiter->fl_link);
> > + spin_unlock(&file_lock_lock);
> > }
> >
> > static inline void
> > locks_insert_global_locks(struct file_lock *waiter)
> > {
> > + spin_lock(&file_lock_lock);
> > list_add_tail(&waiter->fl_link, &file_lock_list);
> > + spin_unlock(&file_lock_lock);
> > }
> >
> > static inline void
> > locks_delete_global_locks(struct file_lock *waiter)
> > {
> > + spin_lock(&file_lock_lock);
> > list_del_init(&waiter->fl_link);
> > + spin_unlock(&file_lock_lock);
> > }
> >
> > /* Remove waiter from blocker's block list.
> > @@ -523,9 +518,11 @@ static void __locks_delete_block(struct file_lock *waiter)
> > */
> > static void locks_delete_block(struct file_lock *waiter)
> > {
> > - lock_flocks();
> > + struct inode *inode = file_inode(waiter->fl_file);
> > +
> > + spin_lock(&inode->i_lock);
> > __locks_delete_block(waiter);
> > - unlock_flocks();
> > + spin_unlock(&inode->i_lock);
> > }
> >
> > /* Insert waiter into blocker's block list.
> > @@ -544,7 +541,7 @@ static void locks_insert_block(struct file_lock *blocker,
> > /*
> > * Wake up processes blocked waiting for blocker.
> > *
> > - * Must be called with the file_lock_lock held!
> > + * Must be called with the inode->i_lock of the blocker held!
> > */
> > static void locks_wake_up_blocks(struct file_lock *blocker)
> > {
> > @@ -649,8 +646,9 @@ void
> > posix_test_lock(struct file *filp, struct file_lock *fl)
> > {
> > struct file_lock *cfl;
> > + struct inode *inode = file_inode(filp);
> >
> > - lock_flocks();
> > + spin_lock(&inode->i_lock);
> > for (cfl = file_inode(filp)->i_flock; cfl; cfl = cfl->fl_next) {
> > if (!IS_POSIX(cfl))
> > continue;
> > @@ -663,7 +661,7 @@ posix_test_lock(struct file *filp, struct file_lock *fl)
> > fl->fl_pid = pid_vnr(cfl->fl_nspid);
> > } else
> > fl->fl_type = F_UNLCK;
> > - unlock_flocks();
> > + spin_unlock(&inode->i_lock);
> > return;
> > }
> > EXPORT_SYMBOL(posix_test_lock);
> > @@ -742,7 +740,7 @@ static int flock_lock_file(struct file *filp, struct file_lock *request)
> > return -ENOMEM;
> > }
> >
> > - lock_flocks();
> > + spin_lock(&inode->i_lock);
> > if (request->fl_flags & FL_ACCESS)
> > goto find_conflict;
> >
> > @@ -772,9 +770,9 @@ static int flock_lock_file(struct file *filp, struct file_lock *request)
> > * give it the opportunity to lock the file.
> > */
> > if (found) {
> > - unlock_flocks();
> > + spin_unlock(&inode->i_lock);
> > cond_resched();
> > - lock_flocks();
> > + spin_lock(&inode->i_lock);
> > }
> >
> > find_conflict:
> > @@ -801,7 +799,7 @@ find_conflict:
> > error = 0;
> >
> > out:
> > - unlock_flocks();
> > + spin_unlock(&inode->i_lock);
> > if (new_fl)
> > locks_free_lock(new_fl);
> > return error;
> > @@ -831,7 +829,7 @@ static int __posix_lock_file(struct inode *inode, struct file_lock *request, str
> > new_fl2 = locks_alloc_lock();
> > }
> >
> > - lock_flocks();
> > + spin_lock(&inode->i_lock);
> > /*
> > * New lock request. Walk all POSIX locks and look for conflicts. If
> > * there are any, either return error or put the request on the
> > @@ -850,8 +848,14 @@ static int __posix_lock_file(struct inode *inode, struct file_lock *request, str
> > if (!(request->fl_flags & FL_SLEEP))
> > goto out;
> > error = -EDEADLK;
> > + /*
> > + * XXX: potential race here. We should be adding the
> > + * file_lock to the global list before releasing lock.
> > + */
> > + spin_lock(&file_lock_lock);
> > if (posix_locks_deadlock(request, fl))
> > goto out;
> > + spin_unlock(&file_lock_lock);
> > error = FILE_LOCK_DEFERRED;
> > locks_insert_block(fl, request);
> > locks_insert_global_blocked(request);
> > @@ -1004,7 +1008,7 @@ static int __posix_lock_file(struct inode *inode, struct file_lock *request, str
> > locks_wake_up_blocks(left);
> > }
> > out:
> > - unlock_flocks();
> > + spin_unlock(&inode->i_lock);
> > /*
> > * Free any unused locks.
> > */
> > @@ -1079,14 +1083,14 @@ int locks_mandatory_locked(struct inode *inode)
> > /*
> > * Search the lock list for this inode for any POSIX locks.
> > */
> > - lock_flocks();
> > + spin_lock(&inode->i_lock);
> > for (fl = inode->i_flock; fl != NULL; fl = fl->fl_next) {
> > if (!IS_POSIX(fl))
> > continue;
> > if (fl->fl_owner != owner)
> > break;
> > }
> > - unlock_flocks();
> > + spin_unlock(&inode->i_lock);
> > return fl ? -EAGAIN : 0;
> > }
> >
> > @@ -1229,7 +1233,7 @@ int __break_lease(struct inode *inode, unsigned int mode)
> > if (IS_ERR(new_fl))
> > return PTR_ERR(new_fl);
> >
> > - lock_flocks();
> > + spin_lock(&inode->i_lock);
> >
> > time_out_leases(inode);
> >
> > @@ -1279,10 +1283,10 @@ restart:
> > break_time++;
> > }
> > locks_insert_block(flock, new_fl);
> > - unlock_flocks();
> > + spin_unlock(&inode->i_lock);
> > error = wait_event_interruptible_timeout(new_fl->fl_wait,
> > !new_fl->fl_next, break_time);
> > - lock_flocks();
> > + spin_lock(&inode->i_lock);
> > __locks_delete_block(new_fl);
> > if (error >= 0) {
> > if (error == 0)
> > @@ -1300,7 +1304,7 @@ restart:
> > }
> >
> > out:
> > - unlock_flocks();
> > + spin_unlock(&inode->i_lock);
> > locks_free_lock(new_fl);
> > return error;
> > }
> > @@ -1353,9 +1357,10 @@ EXPORT_SYMBOL(lease_get_mtime);
> > int fcntl_getlease(struct file *filp)
> > {
> > struct file_lock *fl;
> > + struct inode *inode = file_inode(filp);
> > int type = F_UNLCK;
> >
> > - lock_flocks();
> > + spin_lock(&inode->i_lock);
> > time_out_leases(file_inode(filp));
> > for (fl = file_inode(filp)->i_flock; fl && IS_LEASE(fl);
> > fl = fl->fl_next) {
> > @@ -1364,7 +1369,7 @@ int fcntl_getlease(struct file *filp)
> > break;
> > }
> > }
> > - unlock_flocks();
> > + spin_unlock(&inode->i_lock);
> > return type;
> > }
> >
> > @@ -1458,7 +1463,7 @@ static int generic_delete_lease(struct file *filp, struct file_lock **flp)
> > * The (input) flp->fl_lmops->lm_break function is required
> > * by break_lease().
> > *
> > - * Called with file_lock_lock held.
> > + * Called with inode->i_lock held.
> > */
> > int generic_setlease(struct file *filp, long arg, struct file_lock **flp)
> > {
> > @@ -1527,11 +1532,12 @@ static int __vfs_setlease(struct file *filp, long arg, struct file_lock **lease)
> >
> > int vfs_setlease(struct file *filp, long arg, struct file_lock **lease)
> > {
> > + struct inode *inode = file_inode(filp);
> > int error;
> >
> > - lock_flocks();
> > + spin_lock(&inode->i_lock);
> > error = __vfs_setlease(filp, arg, lease);
> > - unlock_flocks();
> > + spin_unlock(&inode->i_lock);
> >
> > return error;
> > }
> > @@ -1549,6 +1555,7 @@ static int do_fcntl_delete_lease(struct file *filp)
> > static int do_fcntl_add_lease(unsigned int fd, struct file *filp, long arg)
> > {
> > struct file_lock *fl, *ret;
> > + struct inode *inode = file_inode(filp);
> > struct fasync_struct *new;
> > int error;
> >
> > @@ -1562,10 +1569,10 @@ static int do_fcntl_add_lease(unsigned int fd, struct file *filp, long arg)
> > return -ENOMEM;
> > }
> > ret = fl;
> > - lock_flocks();
> > + spin_lock(&inode->i_lock);
> > error = __vfs_setlease(filp, arg, &ret);
> > if (error) {
> > - unlock_flocks();
> > + spin_unlock(&inode->i_lock);
> > locks_free_lock(fl);
> > goto out_free_fasync;
> > }
> > @@ -1582,7 +1589,7 @@ static int do_fcntl_add_lease(unsigned int fd, struct file *filp, long arg)
> > new = NULL;
> >
> > error = __f_setown(filp, task_pid(current), PIDTYPE_PID, 0);
> > - unlock_flocks();
> > + spin_unlock(&inode->i_lock);
> >
> > out_free_fasync:
> > if (new)
> > @@ -2106,7 +2113,7 @@ void locks_remove_flock(struct file *filp)
> > fl.fl_ops->fl_release_private(&fl);
> > }
> >
> > - lock_flocks();
> > + spin_lock(&inode->i_lock);
> > before = &inode->i_flock;
> >
> > while ((fl = *before) != NULL) {
> > @@ -2124,7 +2131,7 @@ void locks_remove_flock(struct file *filp)
> > }
> > before = &fl->fl_next;
> > }
> > - unlock_flocks();
> > + spin_unlock(&inode->i_lock);
> > }
> >
> > /**
> > @@ -2137,14 +2144,15 @@ void locks_remove_flock(struct file *filp)
> > int
> > posix_unblock_lock(struct file *filp, struct file_lock *waiter)
> > {
> > + struct inode *inode = file_inode(filp);
> > int status = 0;
> >
> > - lock_flocks();
> > + spin_lock(&inode->i_lock);
> > if (waiter->fl_next)
> > __locks_delete_block(waiter);
> > else
> > status = -ENOENT;
> > - unlock_flocks();
> > + spin_unlock(&inode->i_lock);
> > return status;
> > }
> >
> > @@ -2261,7 +2269,7 @@ static void *locks_start(struct seq_file *f, loff_t *pos)
> > {
> > loff_t *p = f->private;
> >
> > - lock_flocks();
> > + spin_lock(&file_lock_lock);
> > *p = (*pos + 1);
> > return seq_list_start(&file_lock_list, *pos);
> > }
> > @@ -2275,7 +2283,7 @@ static void *locks_next(struct seq_file *f, void *v, loff_t *pos)
> >
> > static void locks_stop(struct seq_file *f, void *v)
> > {
> > - unlock_flocks();
> > + spin_unlock(&file_lock_lock);
> > }
> >
> > static const struct seq_operations locks_seq_operations = {
> > @@ -2322,7 +2330,8 @@ int lock_may_read(struct inode *inode, loff_t start, unsigned long len)
> > {
> > struct file_lock *fl;
> > int result = 1;
> > - lock_flocks();
> > +
> > + spin_lock(&inode->i_lock);
> > for (fl = inode->i_flock; fl != NULL; fl = fl->fl_next) {
> > if (IS_POSIX(fl)) {
> > if (fl->fl_type == F_RDLCK)
> > @@ -2339,7 +2348,7 @@ int lock_may_read(struct inode *inode, loff_t start, unsigned long len)
> > result = 0;
> > break;
> > }
> > - unlock_flocks();
> > + spin_unlock(&inode->i_lock);
> > return result;
> > }
> >
> > @@ -2362,7 +2371,8 @@ int lock_may_write(struct inode *inode, loff_t start, unsigned long len)
> > {
> > struct file_lock *fl;
> > int result = 1;
> > - lock_flocks();
> > +
> > + spin_lock(&inode->i_lock);
> > for (fl = inode->i_flock; fl != NULL; fl = fl->fl_next) {
> > if (IS_POSIX(fl)) {
> > if ((fl->fl_end < start) || (fl->fl_start > (start + len)))
> > @@ -2377,7 +2387,7 @@ int lock_may_write(struct inode *inode, loff_t start, unsigned long len)
> > result = 0;
> > break;
> > }
> > - unlock_flocks();
> > + spin_unlock(&inode->i_lock);
> > return result;
> > }
> >
> > diff --git a/fs/nfs/delegation.c b/fs/nfs/delegation.c
> > index 57db324..43ee7f9 100644
> > --- a/fs/nfs/delegation.c
> > +++ b/fs/nfs/delegation.c
> > @@ -73,20 +73,21 @@ static int nfs_delegation_claim_locks(struct nfs_open_context *ctx, struct nfs4_
> > if (inode->i_flock == NULL)
> > goto out;
> >
> > - /* Protect inode->i_flock using the file locks lock */
> > - lock_flocks();
> > + /* Protect inode->i_flock using the i_lock */
> > + spin_lock(&inode->i_lock);
> > for (fl = inode->i_flock; fl != NULL; fl = fl->fl_next) {
> > if (!(fl->fl_flags & (FL_POSIX|FL_FLOCK)))
> > continue;
> > if (nfs_file_open_context(fl->fl_file) != ctx)
> > continue;
> > - unlock_flocks();
> > + /* FIXME: safe to drop lock here while walking list? */
> > + spin_unlock(&inode->i_lock);
> > status = nfs4_lock_delegation_recall(fl, state, stateid);
> > if (status < 0)
> > goto out;
> > - lock_flocks();
> > + spin_lock(&inode->i_lock);
> > }
> > - unlock_flocks();
> > + spin_unlock(&inode->i_lock);
> > out:
> > return status;
> > }
> > diff --git a/fs/nfs/nfs4state.c b/fs/nfs/nfs4state.c
> > index 1fab140..ff10b4a 100644
> > --- a/fs/nfs/nfs4state.c
> > +++ b/fs/nfs/nfs4state.c
> > @@ -1373,13 +1373,13 @@ static int nfs4_reclaim_locks(struct nfs4_state *state, const struct nfs4_state_
> > /* Guard against delegation returns and new lock/unlock calls */
> > down_write(&nfsi->rwsem);
> > /* Protect inode->i_flock using the BKL */
> > - lock_flocks();
> > + spin_lock(&inode->i_lock);
> > for (fl = inode->i_flock; fl != NULL; fl = fl->fl_next) {
> > if (!(fl->fl_flags & (FL_POSIX|FL_FLOCK)))
> > continue;
> > if (nfs_file_open_context(fl->fl_file)->state != state)
> > continue;
> > - unlock_flocks();
> > + spin_unlock(&inode->i_lock);
> > status = ops->recover_lock(state, fl);
> > switch (status) {
> > case 0:
> > @@ -1406,9 +1406,9 @@ static int nfs4_reclaim_locks(struct nfs4_state *state, const struct nfs4_state_
> > /* kill_proc(fl->fl_pid, SIGLOST, 1); */
> > status = 0;
> > }
> > - lock_flocks();
> > + spin_lock(&inode->i_lock);
> > }
> > - unlock_flocks();
> > + spin_unlock(&inode->i_lock);
> > out:
> > up_write(&nfsi->rwsem);
> > return status;
> > diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c
> > index 316ec84..f170518 100644
> > --- a/fs/nfsd/nfs4state.c
> > +++ b/fs/nfsd/nfs4state.c
> > @@ -2645,13 +2645,13 @@ static void nfsd_break_one_deleg(struct nfs4_delegation *dp)
> >
> > list_add_tail(&dp->dl_recall_lru, &nn->del_recall_lru);
> >
> > - /* only place dl_time is set. protected by lock_flocks*/
> > + /* Only place dl_time is set; protected by i_lock: */
> > dp->dl_time = get_seconds();
> >
> > nfsd4_cb_recall(dp);
> > }
> >
> > -/* Called from break_lease() with lock_flocks() held. */
> > +/* Called from break_lease() with i_lock held. */
> > static void nfsd_break_deleg_cb(struct file_lock *fl)
> > {
> > struct nfs4_file *fp = (struct nfs4_file *)fl->fl_owner;
> > @@ -4520,7 +4520,7 @@ check_for_locks(struct nfs4_file *filp, struct nfs4_lockowner *lowner)
> > struct inode *inode = filp->fi_inode;
> > int status = 0;
> >
> > - lock_flocks();
> > + spin_lock(&inode->i_lock);
> > for (flpp = &inode->i_flock; *flpp != NULL; flpp = &(*flpp)->fl_next) {
> > if ((*flpp)->fl_owner == (fl_owner_t)lowner) {
> > status = 1;
> > @@ -4528,7 +4528,7 @@ check_for_locks(struct nfs4_file *filp, struct nfs4_lockowner *lowner)
> > }
> > }
> > out:
> > - unlock_flocks();
> > + spin_unlock(&inode->i_lock);
> > return status;
> > }
> >
> > diff --git a/include/linux/fs.h b/include/linux/fs.h
> > index 94105d2..e2f896d 100644
> > --- a/include/linux/fs.h
> > +++ b/include/linux/fs.h
> > @@ -1024,8 +1024,6 @@ extern int vfs_setlease(struct file *, long, struct file_lock **);
> > extern int lease_modify(struct file_lock **, int);
> > extern int lock_may_read(struct inode *, loff_t start, unsigned long count);
> > extern int lock_may_write(struct inode *, loff_t start, unsigned long count);
> > -extern void lock_flocks(void);
> > -extern void unlock_flocks(void);
> > #else /* !CONFIG_FILE_LOCKING */
> > static inline int fcntl_getlk(struct file *file, struct flock __user *user)
> > {
> > @@ -1167,15 +1165,6 @@ static inline int lock_may_write(struct inode *inode, loff_t start,
> > {
> > return 1;
> > }
> > -
> > -static inline void lock_flocks(void)
> > -{
> > -}
> > -
> > -static inline void unlock_flocks(void)
> > -{
> > -}
> > -
> > #endif /* !CONFIG_FILE_LOCKING */
> >
> >
> > --
> > 1.7.1
> >


--
Jeff Layton <[email protected]>

2013-06-13 15:19:22

by Jeff Layton

[permalink] [raw]
Subject: Re: [PATCH v2 12/14] locks: give the blocked_hash its own spinlock

On Thu, 13 Jun 2013 11:02:47 -0400
"J. Bruce Fields" <[email protected]> wrote:

> On Tue, Jun 11, 2013 at 07:09:06AM -0400, Jeff Layton wrote:
> > There's no reason we have to protect the blocked_hash and file_lock_list
> > with the same spinlock. With the tests I have, breaking it in two gives
> > a barely measurable performance benefit, but it seems reasonable to make
> > this locking as granular as possible.
>
> Out of curiosity... In the typical case when adding/removing a lock,
> aren't both lists being modified in rapid succession?
>
> I wonder if it would be better to instead stick with one lock and take
> care to acquire it only once to cover both manipulations.
>
> --b.
>

That's not really the case...

Typically, when doing a call into __posix_lock_file with FL_SLEEP set,
we either end up blocking on the lock or acquiring it. In either case,
we'll only end up taking one of the global spinlocks. The reason for
this is that blocker is what dequeues a waiter from the blocked_hash
before waking it up (in locks_wake_up_posix_blocks).

Also, while this patch description doesn't spell it out, we require a
truly global lock for deadlock detection. In a later patch though, I
convert the file_lock_lock to a per-cpu spinlock. So we really do need
to separate the locks here in order to make the per-cpu file_lock_list
worthwhile.

> >
> > Signed-off-by: Jeff Layton <[email protected]>
> > ---
> > Documentation/filesystems/Locking | 16 ++++++++--------
> > fs/locks.c | 25 +++++++++++++------------
> > 2 files changed, 21 insertions(+), 20 deletions(-)
> >
> > diff --git a/Documentation/filesystems/Locking b/Documentation/filesystems/Locking
> > index ee351ac..8d8d040 100644
> > --- a/Documentation/filesystems/Locking
> > +++ b/Documentation/filesystems/Locking
> > @@ -359,20 +359,20 @@ prototypes:
> >
> > locking rules:
> >
> > - inode->i_lock file_lock_lock may block
> > -lm_compare_owner: yes maybe no
> > -lm_owner_key yes yes no
> > -lm_notify: yes no no
> > -lm_grant: no no no
> > -lm_break: yes no no
> > -lm_change yes no no
> > + inode->i_lock blocked_hash_lock may block
> > +lm_compare_owner: yes maybe no
> > +lm_owner_key yes yes no
> > +lm_notify: yes no no
> > +lm_grant: no no no
> > +lm_break: yes no no
> > +lm_change yes no no
> >
> > ->lm_compare_owner and ->lm_owner_key are generally called with
> > *an* inode->i_lock held. It may not be the i_lock of the inode
> > associated with either file_lock argument! This is the case with deadlock
> > detection, since the code has to chase down the owners of locks that may
> > be entirely unrelated to the one on which the lock is being acquired.
> > -For deadlock detection however, the file_lock_lock is also held. The
> > +For deadlock detection however, the blocked_hash_lock is also held. The
> > fact that these locks are held ensures that the file_locks do not
> > disappear out from under you while doing the comparison or generating an
> > owner key.
> > diff --git a/fs/locks.c b/fs/locks.c
> > index 11e7784..8124fc1 100644
> > --- a/fs/locks.c
> > +++ b/fs/locks.c
> > @@ -162,12 +162,11 @@ int lease_break_time = 45;
> > */
> > #define BLOCKED_HASH_BITS 7
> >
> > +static DEFINE_SPINLOCK(blocked_hash_lock);
> > static DEFINE_HASHTABLE(blocked_hash, BLOCKED_HASH_BITS);
> >
> > -static HLIST_HEAD(file_lock_list);
> > -
> > -/* Protects the file_lock_list and the blocked_hash */
> > static DEFINE_SPINLOCK(file_lock_lock);
> > +static HLIST_HEAD(file_lock_list);
> >
> > static struct kmem_cache *filelock_cache __read_mostly;
> >
> > @@ -505,9 +504,9 @@ __locks_delete_global_blocked(struct file_lock *waiter)
> > static inline void
> > locks_delete_global_blocked(struct file_lock *waiter)
> > {
> > - spin_lock(&file_lock_lock);
> > + spin_lock(&blocked_hash_lock);
> > __locks_delete_global_blocked(waiter);
> > - spin_unlock(&file_lock_lock);
> > + spin_unlock(&blocked_hash_lock);
> > }
> >
> > static inline void
> > @@ -581,14 +580,14 @@ static void locks_wake_up_blocks(struct file_lock *blocker)
> >
> > /*
> > * Wake up processes blocked waiting for blocker. In the FL_POSIX case, we must
> > - * also take the global file_lock_lock and dequeue it from the global blocked
> > - * list as we wake the processes.
> > + * also take the global blocked_hash_lock and dequeue it from the global
> > + * blocked list as we wake the processes.
> > *
> > * Must be called with the inode->i_lock of the blocker held!
> > */
> > static void locks_wake_up_posix_blocks(struct file_lock *blocker)
> > {
> > - spin_lock(&file_lock_lock);
> > + spin_lock(&blocked_hash_lock);
> > while (!list_empty(&blocker->fl_block)) {
> > struct file_lock *waiter;
> >
> > @@ -601,7 +600,7 @@ static void locks_wake_up_posix_blocks(struct file_lock *blocker)
> > else
> > wake_up(&waiter->fl_wait);
> > }
> > - spin_unlock(&file_lock_lock);
> > + spin_unlock(&blocked_hash_lock);
> > }
> > /* Insert file lock fl into an inode's lock list at the position indicated
> > * by pos. At the same time add the lock to the global file lock list.
> > @@ -754,7 +753,7 @@ static struct file_lock *what_owner_is_waiting_for(struct file_lock *block_fl)
> > return NULL;
> > }
> >
> > -/* Must be called with the file_lock_lock held! */
> > +/* Must be called with the blocked_hash_lock held! */
> > static int posix_locks_deadlock(struct file_lock *caller_fl,
> > struct file_lock *block_fl)
> > {
> > @@ -898,13 +897,13 @@ static int __posix_lock_file(struct inode *inode, struct file_lock *request, str
> > if (!(request->fl_flags & FL_SLEEP))
> > goto out;
> > error = -EDEADLK;
> > - spin_lock(&file_lock_lock);
> > + spin_lock(&blocked_hash_lock);
> > if (likely(!posix_locks_deadlock(request, fl))) {
> > error = FILE_LOCK_DEFERRED;
> > locks_insert_block(fl, request);
> > locks_insert_global_blocked(request);
> > }
> > - spin_unlock(&file_lock_lock);
> > + spin_unlock(&blocked_hash_lock);
> > goto out;
> > }
> > }
> > @@ -2309,10 +2308,12 @@ static int locks_show(struct seq_file *f, void *v)
> >
> > lock_get_status(f, fl, *((loff_t *)f->private), "");
> >
> > + spin_lock(&blocked_hash_lock);
> > hash_for_each(blocked_hash, bkt, bfl, fl_link) {
> > if (bfl->fl_next == fl)
> > lock_get_status(f, bfl, *((loff_t *)f->private), " ->");
> > }
> > + spin_unlock(&blocked_hash_lock);
> >
> > return 0;
> > }
> > --
> > 1.7.1
> >


--
Jeff Layton <[email protected]>

2013-06-13 15:21:14

by J. Bruce Fields

[permalink] [raw]
Subject: Re: [PATCH v2 12/14] locks: give the blocked_hash its own spinlock

On Thu, Jun 13, 2013 at 11:18:44AM -0400, Jeff Layton wrote:
> On Thu, 13 Jun 2013 11:02:47 -0400
> "J. Bruce Fields" <[email protected]> wrote:
>
> > On Tue, Jun 11, 2013 at 07:09:06AM -0400, Jeff Layton wrote:
> > > There's no reason we have to protect the blocked_hash and file_lock_list
> > > with the same spinlock. With the tests I have, breaking it in two gives
> > > a barely measurable performance benefit, but it seems reasonable to make
> > > this locking as granular as possible.
> >
> > Out of curiosity... In the typical case when adding/removing a lock,
> > aren't both lists being modified in rapid succession?
> >
> > I wonder if it would be better to instead stick with one lock and take
> > care to acquire it only once to cover both manipulations.
> >
> > --b.
> >
>
> That's not really the case...
>
> Typically, when doing a call into __posix_lock_file with FL_SLEEP set,
> we either end up blocking on the lock or acquiring it. In either case,
> we'll only end up taking one of the global spinlocks. The reason for
> this is that blocker is what dequeues a waiter from the blocked_hash
> before waking it up (in locks_wake_up_posix_blocks).
>
> Also, while this patch description doesn't spell it out, we require a
> truly global lock for deadlock detection. In a later patch though, I
> convert the file_lock_lock to a per-cpu spinlock. So we really do need
> to separate the locks here in order to make the per-cpu file_lock_list
> worthwhile.

Oh, right, got it!

--b.

2013-06-13 15:28:09

by J. Bruce Fields

[permalink] [raw]
Subject: Re: [PATCH v2 13/14] seq_file: add seq_list_*_percpu helpers

On Tue, Jun 11, 2013 at 07:09:07AM -0400, Jeff Layton wrote:
> When we convert the file_lock_list to a set of percpu lists, we'll need
> a way to iterate over them in order to output /proc/locks info. Add
> some seq_list_*_percpu helpers to handle that.
>
> Signed-off-by: Jeff Layton <[email protected]>
> ---
> fs/seq_file.c | 54 ++++++++++++++++++++++++++++++++++++++++++++++
> include/linux/seq_file.h | 6 +++++
> 2 files changed, 60 insertions(+), 0 deletions(-)
>
> diff --git a/fs/seq_file.c b/fs/seq_file.c
> index 774c1eb..ce9f97a 100644
> --- a/fs/seq_file.c
> +++ b/fs/seq_file.c
> @@ -921,3 +921,57 @@ struct hlist_node *seq_hlist_next_rcu(void *v,
> return rcu_dereference(node->next);
> }
> EXPORT_SYMBOL(seq_hlist_next_rcu);
> +
> +/**
> + * seq_hlist_start_precpu - start an iteration of a hlist

Should be "of a percpu array of hlists"?

> + * @head: pointer to percpu array of struct hlist_heads
> + * @cpu: pointer to cpu "cursor"
> + * @pos: start position of sequence
> + *
> + * Called at seq_file->op->start().
> + */
> +struct hlist_node *
> +seq_hlist_start_percpu(struct hlist_head __percpu *head, int *cpu, loff_t pos)
> +{
> + struct hlist_node *node;
> +
> + for_each_possible_cpu(*cpu) {
> + hlist_for_each(node, per_cpu_ptr(head, *cpu)) {
> + if (pos-- == 0)
> + return node;
> + }
> + }
> + return NULL;
> +}
> +EXPORT_SYMBOL(seq_hlist_start_percpu);
> +
> +/**
> + * seq_hlist_next_percpu - move to the next position of the hlist

Ditto?

Just thinking if the doc scripts collected up a bunch of these summaries
into an index somewhere it'd be confusing for these to have some
description as seq_hlist_start/next.

Otherwise, ACK.

--b.

> + * @v: pointer to current hlist_node
> + * @head: pointer to percpu array of struct hlist_heads
> + * @cpu: pointer to cpu "cursor"
> + * @pos: start position of sequence
> + *
> + * Called at seq_file->op->next().
> + */
> +struct hlist_node *
> +seq_hlist_next_percpu(void *v, struct hlist_head __percpu *head,
> + int *cpu, loff_t *pos)
> +{
> + struct hlist_node *node = v;
> +
> + ++*pos;
> +
> + if (node->next)
> + return node->next;
> +
> + for (*cpu = cpumask_next(*cpu, cpu_possible_mask); *cpu < nr_cpu_ids;
> + *cpu = cpumask_next(*cpu, cpu_possible_mask)) {
> + struct hlist_head *bucket = per_cpu_ptr(head, *cpu);
> +
> + if (!hlist_empty(bucket))
> + return bucket->first;
> + }
> + return NULL;
> +}
> +EXPORT_SYMBOL(seq_hlist_next_percpu);
> diff --git a/include/linux/seq_file.h b/include/linux/seq_file.h
> index 2da29ac..4e32edc 100644
> --- a/include/linux/seq_file.h
> +++ b/include/linux/seq_file.h
> @@ -173,4 +173,10 @@ extern struct hlist_node *seq_hlist_start_head_rcu(struct hlist_head *head,
> extern struct hlist_node *seq_hlist_next_rcu(void *v,
> struct hlist_head *head,
> loff_t *ppos);
> +
> +/* Helpers for iterating over per-cpu hlist_head-s in seq_files */
> +extern struct hlist_node *seq_hlist_start_percpu(struct hlist_head __percpu *head, int *cpu, loff_t pos);
> +
> +extern struct hlist_node *seq_hlist_next_percpu(void *v, struct hlist_head __percpu *head, int *cpu, loff_t *pos);
> +
> #endif
> --
> 1.7.1
>

2013-06-13 15:38:12

by J. Bruce Fields

[permalink] [raw]
Subject: Re: [PATCH v2 14/14] locks: move file_lock_list to a set of percpu hlist_heads and convert file_lock_lock to an lglock

On Tue, Jun 11, 2013 at 07:09:08AM -0400, Jeff Layton wrote:
> The file_lock_list is only used for /proc/locks. The vastly common case
> is for locks to be put onto the list and come off again, without ever
> being traversed.
>
> Help optimize for this use-case by moving to percpu hlist_head-s. At the
> same time, we can make the locking less contentious by moving to an
> lglock. When iterating over the lists for /proc/locks, we must take the
> global lock and then iterate over each CPU's list in turn.
>
> This change necessitates a new fl_link_cpu field to keep track of which
> CPU the entry is on. On x86_64 at least, this field is placed within an
> existing hole in the struct to avoid growing its real size.

ACK.--b.

>
> Signed-off-by: Jeff Layton <[email protected]>
> ---
> fs/locks.c | 57 +++++++++++++++++++++++++++++++++++----------------
> include/linux/fs.h | 1 +
> 2 files changed, 40 insertions(+), 18 deletions(-)
>
> diff --git a/fs/locks.c b/fs/locks.c
> index 8124fc1..094eb4d 100644
> --- a/fs/locks.c
> +++ b/fs/locks.c
> @@ -127,6 +127,8 @@
> #include <linux/rcupdate.h>
> #include <linux/pid_namespace.h>
> #include <linux/hashtable.h>
> +#include <linux/percpu.h>
> +#include <linux/lglock.h>
>
> #include <asm/uaccess.h>
>
> @@ -165,8 +167,8 @@ int lease_break_time = 45;
> static DEFINE_SPINLOCK(blocked_hash_lock);
> static DEFINE_HASHTABLE(blocked_hash, BLOCKED_HASH_BITS);
>
> -static DEFINE_SPINLOCK(file_lock_lock);
> -static HLIST_HEAD(file_lock_list);
> +DEFINE_STATIC_LGLOCK(file_lock_lglock);
> +static DEFINE_PER_CPU(struct hlist_head, file_lock_list);
>
> static struct kmem_cache *filelock_cache __read_mostly;
>
> @@ -512,17 +514,21 @@ locks_delete_global_blocked(struct file_lock *waiter)
> static inline void
> locks_insert_global_locks(struct file_lock *waiter)
> {
> - spin_lock(&file_lock_lock);
> - hlist_add_head(&waiter->fl_link, &file_lock_list);
> - spin_unlock(&file_lock_lock);
> + lg_local_lock(&file_lock_lglock);
> + waiter->fl_link_cpu = smp_processor_id();
> + hlist_add_head(&waiter->fl_link, this_cpu_ptr(&file_lock_list));
> + lg_local_unlock(&file_lock_lglock);
> }
>
> static inline void
> locks_delete_global_locks(struct file_lock *waiter)
> {
> - spin_lock(&file_lock_lock);
> + /* avoid taking lock if already unhashed */
> + if (hlist_unhashed(&waiter->fl_link))
> + return;
> + lg_local_lock_cpu(&file_lock_lglock, waiter->fl_link_cpu);
> hlist_del_init(&waiter->fl_link);
> - spin_unlock(&file_lock_lock);
> + lg_local_unlock_cpu(&file_lock_lglock, waiter->fl_link_cpu);
> }
>
> /* Remove waiter from blocker's block list.
> @@ -2228,6 +2234,11 @@ EXPORT_SYMBOL_GPL(vfs_cancel_lock);
> #include <linux/proc_fs.h>
> #include <linux/seq_file.h>
>
> +struct locks_iterator {
> + int li_cpu;
> + loff_t li_pos;
> +};
> +
> static void lock_get_status(struct seq_file *f, struct file_lock *fl,
> loff_t id, char *pfx)
> {
> @@ -2302,16 +2313,17 @@ static void lock_get_status(struct seq_file *f, struct file_lock *fl,
> static int locks_show(struct seq_file *f, void *v)
> {
> int bkt;
> + struct locks_iterator *iter = f->private;
> struct file_lock *fl, *bfl;
>
> fl = hlist_entry(v, struct file_lock, fl_link);
>
> - lock_get_status(f, fl, *((loff_t *)f->private), "");
> + lock_get_status(f, fl, iter->li_pos, "");
>
> spin_lock(&blocked_hash_lock);
> hash_for_each(blocked_hash, bkt, bfl, fl_link) {
> if (bfl->fl_next == fl)
> - lock_get_status(f, bfl, *((loff_t *)f->private), " ->");
> + lock_get_status(f, bfl, iter->li_pos, " ->");
> }
> spin_unlock(&blocked_hash_lock);
>
> @@ -2320,23 +2332,24 @@ static int locks_show(struct seq_file *f, void *v)
>
> static void *locks_start(struct seq_file *f, loff_t *pos)
> {
> - loff_t *p = f->private;
> + struct locks_iterator *iter = f->private;
>
> - spin_lock(&file_lock_lock);
> - *p = (*pos + 1);
> - return seq_hlist_start(&file_lock_list, *pos);
> + iter->li_pos = *pos + 1;
> + lg_global_lock(&file_lock_lglock);
> + return seq_hlist_start_percpu(&file_lock_list, &iter->li_cpu, *pos);
> }
>
> static void *locks_next(struct seq_file *f, void *v, loff_t *pos)
> {
> - loff_t *p = f->private;
> - ++*p;
> - return seq_hlist_next(v, &file_lock_list, pos);
> + struct locks_iterator *iter = f->private;
> +
> + ++iter->li_pos;
> + return seq_hlist_next_percpu(v, &file_lock_list, &iter->li_cpu, pos);
> }
>
> static void locks_stop(struct seq_file *f, void *v)
> {
> - spin_unlock(&file_lock_lock);
> + lg_global_unlock(&file_lock_lglock);
> }
>
> static const struct seq_operations locks_seq_operations = {
> @@ -2348,7 +2361,8 @@ static const struct seq_operations locks_seq_operations = {
>
> static int locks_open(struct inode *inode, struct file *filp)
> {
> - return seq_open_private(filp, &locks_seq_operations, sizeof(loff_t));
> + return seq_open_private(filp, &locks_seq_operations,
> + sizeof(struct locks_iterator));
> }
>
> static const struct file_operations proc_locks_operations = {
> @@ -2448,9 +2462,16 @@ EXPORT_SYMBOL(lock_may_write);
>
> static int __init filelock_init(void)
> {
> + int i;
> +
> filelock_cache = kmem_cache_create("file_lock_cache",
> sizeof(struct file_lock), 0, SLAB_PANIC, NULL);
>
> + lg_lock_init(&file_lock_lglock, "file_lock_lglock");
> +
> + for_each_possible_cpu(i)
> + INIT_HLIST_HEAD(per_cpu_ptr(&file_lock_list, i));
> +
> return 0;
> }
>
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 232a345..18e59b8 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -953,6 +953,7 @@ struct file_lock {
> unsigned int fl_flags;
> unsigned char fl_type;
> unsigned int fl_pid;
> + int fl_link_cpu; /* what cpu's list is this on? */
> struct pid *fl_nspid;
> wait_queue_head_t fl_wait;
> struct file *fl_file;
> --
> 1.7.1
>

2013-06-13 19:46:24

by J. Bruce Fields

[permalink] [raw]
Subject: Re: [PATCH v2 06/14] locks: don't walk inode->i_flock list in locks_show

On Tue, Jun 11, 2013 at 07:09:00AM -0400, Jeff Layton wrote:
> When we convert over to using the i_lock to protect the i_flock list,
> that will introduce a potential lock inversion problem in locks_show.
> When we want to walk the i_flock list, we'll need to take the i_lock.
>
> Rather than do that, just walk the global blocked_locks list and print
> out any that are blocked on the given lock.

I'm OK with this as obviously /proc/locks shouldn't be the common case,
but it still bugs me a bit that we're suddenly making it something like

O(number of held locks * number of waiters)

where it used to be

O(number of held lock + number of waiters)

I wonder if there's any solution that's just as easy and avoids scanning
the blocked list each time.

--b.

>
> Signed-off-by: Jeff Layton <[email protected]>
> ---
> fs/locks.c | 6 ++++--
> 1 files changed, 4 insertions(+), 2 deletions(-)
>
> diff --git a/fs/locks.c b/fs/locks.c
> index e451d18..3fd27f0 100644
> --- a/fs/locks.c
> +++ b/fs/locks.c
> @@ -2249,8 +2249,10 @@ static int locks_show(struct seq_file *f, void *v)
>
> lock_get_status(f, fl, *((loff_t *)f->private), "");
>
> - list_for_each_entry(bfl, &fl->fl_block, fl_block)
> - lock_get_status(f, bfl, *((loff_t *)f->private), " ->");
> + list_for_each_entry(bfl, &blocked_list, fl_link) {
> + if (bfl->fl_next == fl)
> + lock_get_status(f, bfl, *((loff_t *)f->private), " ->");
> + }
>
> return 0;
> }
> --
> 1.7.1
>

2013-06-13 20:27:25

by Jeff Layton

[permalink] [raw]
Subject: Re: [PATCH v2 06/14] locks: don't walk inode->i_flock list in locks_show

On Thu, 13 Jun 2013 15:45:46 -0400
"J. Bruce Fields" <[email protected]> wrote:

> On Tue, Jun 11, 2013 at 07:09:00AM -0400, Jeff Layton wrote:
> > When we convert over to using the i_lock to protect the i_flock list,
> > that will introduce a potential lock inversion problem in locks_show.
> > When we want to walk the i_flock list, we'll need to take the i_lock.
> >
> > Rather than do that, just walk the global blocked_locks list and print
> > out any that are blocked on the given lock.
>
> I'm OK with this as obviously /proc/locks shouldn't be the common case,
> but it still bugs me a bit that we're suddenly making it something like
>
> O(number of held locks * number of waiters)
>
> where it used to be
>
> O(number of held lock + number of waiters)
>
> I wonder if there's any solution that's just as easy and avoids scanning
> the blocked list each time.
>
> --b.
>
> >
> > Signed-off-by: Jeff Layton <[email protected]>
> > ---
> > fs/locks.c | 6 ++++--
> > 1 files changed, 4 insertions(+), 2 deletions(-)
> >
> > diff --git a/fs/locks.c b/fs/locks.c
> > index e451d18..3fd27f0 100644
> > --- a/fs/locks.c
> > +++ b/fs/locks.c
> > @@ -2249,8 +2249,10 @@ static int locks_show(struct seq_file *f, void *v)
> >
> > lock_get_status(f, fl, *((loff_t *)f->private), "");
> >
> > - list_for_each_entry(bfl, &fl->fl_block, fl_block)
> > - lock_get_status(f, bfl, *((loff_t *)f->private), " ->");
> > + list_for_each_entry(bfl, &blocked_list, fl_link) {
> > + if (bfl->fl_next == fl)
> > + lock_get_status(f, bfl, *((loff_t *)f->private), " ->");
> > + }
> >
> > return 0;
> > }
> > --
> > 1.7.1
> >

Yeah, it's ugly, but I don't see a real alternative. We could try to
use RCU for this, but that slows down the list manipulation at the cost
of optimizing a rarely read procfile.

Now that I look though...it occurs to me that we have a problem here
anyway. Only blocked POSIX requests go onto that list currently, so
this misses seeing any blocked flock requests.

The only real solution I can think of is to put flock locks into the
blocked_list/blocked_hash too, or maybe giving them a simple hlist to
sit on.

I'll fix that up in the next iteration. It'll probably make flock()
tests run slower, but such is the cost of preserving this procfile...

--
Jeff Layton <[email protected]>

2013-06-15 11:06:19

by Jeff Layton

[permalink] [raw]
Subject: Re: [PATCH v2 06/14] locks: don't walk inode->i_flock list in locks_show

On Fri, 14 Jun 2013 07:52:44 -0400
Simo <[email protected]> wrote:

> On 06/13/2013 04:26 PM, Jeff Layton wrote:
> > The only real solution I can think of is to put flock locks into the
> > blocked_list/blocked_hash too, or maybe giving them a simple hlist to
> > sit on.
> >
> > I'll fix that up in the next iteration. It'll probably make flock()
> > tests run slower, but such is the cost of preserving this procfile...
>
> How hard would it be to make the procfile stuff optional ?
> So that those that need performance can decide to not use it ?
> Maybe even something that can be disabled at run time ? Not just compile
> time.
>

(re-adding back the cc lists...)

It'd be tricky, especially if you want to do it at runtime. The
procfile itself is not a problem per-se. The real problem is the
tracking you have to do in order to eventually present the procfile. So
a boot-time or compile-time switch might be reasonable, but a runtime
switch will probably never really be.

I have a new patchset that I'm testing now though that should address
Bruce's concerns about iterating over that global list. So far, it
seems to be at least as fast as the latest patchset I posted.

It makes the (spin)locking a bit more complex, but hopefully I can
document this well enough that it's not a great concern.

Stay tuned...

--
Jeff Layton <[email protected]>

2013-06-15 15:12:59

by simo

[permalink] [raw]
Subject: Re: [PATCH v2 06/14] locks: don't walk inode->i_flock list in locks_show

On 06/15/2013 07:05 AM, Jeff Layton wrote:
> On Fri, 14 Jun 2013 07:52:44 -0400
> Simo <[email protected]> wrote:
>
>> On 06/13/2013 04:26 PM, Jeff Layton wrote:
>>> The only real solution I can think of is to put flock locks into the
>>> blocked_list/blocked_hash too, or maybe giving them a simple hlist to
>>> sit on.
>>>
>>> I'll fix that up in the next iteration. It'll probably make flock()
>>> tests run slower, but such is the cost of preserving this procfile...
>> How hard would it be to make the procfile stuff optional ?
>> So that those that need performance can decide to not use it ?
>> Maybe even something that can be disabled at run time ? Not just compile
>> time.
>>
> (re-adding back the cc lists...)
>
> It'd be tricky, especially if you want to do it at runtime. The
> procfile itself is not a problem per-se. The real problem is the
> tracking you have to do in order to eventually present the procfile. So
> a boot-time or compile-time switch might be reasonable, but a runtime
> switch will probably never really be.

Just to be clear, I meant for a switch to turn it off at runtime, I
understand very well that it would be way too hard to turn on at
runtime. But killing the perf problem might be desirable on a system you
cannot just reboot.

> I have a new patchset that I'm testing now though that should address
> Bruce's concerns about iterating over that global list. So far, it
> seems to be at least as fast as the latest patchset I posted.
>
> It makes the (spin)locking a bit more complex, but hopefully I can
> document this well enough that it's not a great concern.
>
> Stay tuned...

Thanks Jeff,
this is very valuable work.

Simo.