Change log
----------
v1->v2:
- Rename the new seqlock primitives to read_seqexcl_lock* and
read_seqexcl_unlock*.
- Clarify in the commit log and comments about the exclusive nature
of the read lock.
Waiman Long (2):
seqlock: Add a new locking reader type
dcache: get/release read lock in read_seqbegin_or_lock() & friend
fs/dcache.c | 31 +++++++++++----------
include/linux/seqlock.h | 68 +++++++++++++++++++++++++++++++++++++++++++---
2 files changed, 79 insertions(+), 20 deletions(-)
The sequence lock (seqlock) was originally designed for the cases
where the readers do not need to block the writers by making the
readers retry the read operation when the data change.
Since then, the use cases have been expanded to include situations
where a thread does not need to change the data (effectively a reader)
at all but have to take the writer lock because it can't tolerate
changes to the protected structure. Some examples are the d_path()
function and the getcwd() syscall in fs/dcache.c where the functions
take the writer lock on rename_lock even though they don't need
to change anything in the protected data structure at all. This is
inefficient as a reader is now blocking other sequence number reading
readers from moving forward by pretending to be a writer.
This patch tries to eliminate this inefficiency by introducing a new
type of locking reader to the seqlock locking mechanism. This new
locking reader will try to take an exclusive lock preventing other
writers and locking readers from going forward. However, it won't
affect the progress of the other sequence number reading readers as
the sequence number won't be changed.
Signed-off-by: Waiman Long <[email protected]>
---
include/linux/seqlock.h | 68 +++++++++++++++++++++++++++++++++++++++++++---
1 files changed, 63 insertions(+), 5 deletions(-)
diff --git a/include/linux/seqlock.h b/include/linux/seqlock.h
index 1829905..9bd84b5 100644
--- a/include/linux/seqlock.h
+++ b/include/linux/seqlock.h
@@ -3,15 +3,21 @@
/*
* Reader/writer consistent mechanism without starving writers. This type of
* lock for data where the reader wants a consistent set of information
- * and is willing to retry if the information changes. Readers never
- * block but they may have to retry if a writer is in
- * progress. Writers do not wait for readers.
+ * and is willing to retry if the information changes. There are two types
+ * of readers:
+ * 1. Sequence readers which never block a writer but they may have to retry
+ * if a writer is in progress by detecting change in sequence number.
+ * Writers do not wait for a sequence reader.
+ * 2. Locking readers which will wait if a writer or another locking reader
+ * is in progress. A locking reader in progress will also block a writer
+ * from going forward. Unlike the regular rwlock, the read lock here is
+ * exclusive so that only one locking reader can get it.
*
- * This is not as cache friendly as brlock. Also, this will not work
+ * This is not as cache friendly as brlock. Also, this may not work well
* for data that contains pointers, because any writer could
* invalidate a pointer that a reader was following.
*
- * Expected reader usage:
+ * Expected non-blocking reader usage:
* do {
* seq = read_seqbegin(&foo);
* ...
@@ -268,4 +274,56 @@ write_sequnlock_irqrestore(seqlock_t *sl, unsigned long flags)
spin_unlock_irqrestore(&sl->lock, flags);
}
+/*
+ * A locking reader exclusively locks out other writers and locking readers,
+ * but doesn't update the sequence number. Acts like a normal spin_lock/unlock.
+ * Don't need preempt_disable() because that is in the spin_lock already.
+ */
+static inline void read_seqexcl_lock(seqlock_t *sl)
+{
+ spin_lock(&sl->lock);
+}
+
+static inline void read_seqexcl_unlock(seqlock_t *sl)
+{
+ spin_unlock(&sl->lock);
+}
+
+static inline void read_seqexcl_lock_bh(seqlock_t *sl)
+{
+ spin_lock_bh(&sl->lock);
+}
+
+static inline void read_seqexcl_unlock_bh(seqlock_t *sl)
+{
+ spin_unlock_bh(&sl->lock);
+}
+
+static inline void read_seqexcl_lock_irq(seqlock_t *sl)
+{
+ spin_lock_irq(&sl->lock);
+}
+
+static inline void read_seqexcl_unlock_irq(seqlock_t *sl)
+{
+ spin_unlock_irq(&sl->lock);
+}
+
+static inline unsigned long __read_seqexcl_lock_irqsave(seqlock_t *sl)
+{
+ unsigned long flags;
+
+ spin_lock_irqsave(&sl->lock, flags);
+ return flags;
+}
+
+#define read_seqexcl_lock_irqsave(lock, flags) \
+ do { flags = __read_seqexcl_lock_irqsave(lock); } while (0)
+
+static inline void
+read_seqexcl_unlock_irqrestore(seqlock_t *sl, unsigned long flags)
+{
+ spin_unlock_irqrestore(&sl->lock, flags);
+}
+
#endif /* __LINUX_SEQLOCK_H */
--
1.7.1
This patch modifies read_seqbegin_or_lock() and need_seqretry() to
use newly introduced read_seqexcl_lock() and read_seqexcl_unlock()
primitives so that they won't change the sequence number even if
they fall back to take the lock. This is OK as no change to the
protected data structure is being made. It will prevent one fallback
to lock taking from cascading into a series of lock taking reducing
performance because of the sequence number change. It will also allow
other sequence readers to go forward while an exclusive reader lock
is taken.
This patch also updates some of the inaccurate comments in the code.
Signed-off-by: Waiman Long <[email protected]>
---
fs/dcache.c | 31 ++++++++++++++++---------------
1 files changed, 16 insertions(+), 15 deletions(-)
diff --git a/fs/dcache.c b/fs/dcache.c
index 4d9df3c..9e88367 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -90,8 +90,8 @@ static struct kmem_cache *dentry_cache __read_mostly;
/**
* read_seqbegin_or_lock - begin a sequence number check or locking block
- * lock: sequence lock
- * seq : sequence number to be checked
+ * @lock: sequence lock
+ * @seq : sequence number to be checked
*
* First try it once optimistically without taking the lock. If that fails,
* take the lock. The sequence number is also used as a marker for deciding
@@ -103,7 +103,7 @@ static inline void read_seqbegin_or_lock(seqlock_t *lock, int *seq)
if (!(*seq & 1)) /* Even */
*seq = read_seqbegin(lock);
else /* Odd */
- write_seqlock(lock);
+ read_seqexcl_lock(lock);
}
static inline int need_seqretry(seqlock_t *lock, int seq)
@@ -114,7 +114,7 @@ static inline int need_seqretry(seqlock_t *lock, int seq)
static inline void done_seqretry(seqlock_t *lock, int seq)
{
if (seq & 1)
- write_sequnlock(lock);
+ read_seqexcl_unlock(lock);
}
/*
@@ -2673,9 +2673,9 @@ static int prepend(char **buffer, int *buflen, const char *str, int namelen)
/**
* prepend_name - prepend a pathname in front of current buffer pointer
- * buffer: buffer pointer
- * buflen: allocated length of the buffer
- * name: name string and length qstr structure
+ * @buffer: buffer pointer
+ * @buflen: allocated length of the buffer
+ * @name: name string and length qstr structure
*
* With RCU path tracing, it may race with d_move(). Use ACCESS_ONCE() to
* make sure that either the old or the new name pointer and length are
@@ -2713,14 +2713,15 @@ static int prepend_name(char **buffer, int *buflen, struct qstr *name)
* @buffer: pointer to the end of the buffer
* @buflen: pointer to buffer length
*
- * The function tries to write out the pathname without taking any lock other
- * than the RCU read lock to make sure that dentries won't go away. It only
- * checks the sequence number of the global rename_lock as any change in the
- * dentry's d_seq will be preceded by changes in the rename_lock sequence
- * number. If the sequence number had been change, it will restart the whole
- * pathname back-tracing sequence again. It performs a total of 3 trials of
- * lockless back-tracing sequences before falling back to take the
- * rename_lock.
+ * The function will first try to write out the pathname without taking any
+ * lock other than the RCU read lock to make sure that dentries won't go away.
+ * It only checks the sequence number of the global rename_lock as any change
+ * in the dentry's d_seq will be preceded by changes in the rename_lock
+ * sequence number. If the sequence number had been changed, it will restart
+ * the whole pathname back-tracing sequence again by taking the rename_lock.
+ * In this case, there is no need to take the RCU read lock as the recursive
+ * parent pointer references will keep the dentry chain alive as long as no
+ * rename operation is performed.
*/
static int prepend_path(const struct path *path,
const struct path *root,
--
1.7.1
On Thu, Sep 12, 2013 at 7:55 AM, Waiman Long <[email protected]> wrote:
> Change log
> ----------
> v1->v2:
> - Rename the new seqlock primitives to read_seqexcl_lock* and
> read_seqexcl_unlock*.
Applied. Except I peed in the snow and renamed the functions
again.That whole "seqexcl" looked too odd to me. It not only looks a
bit too much like random noise, but it makes it seem a whole different
lock from the "seqlock" thing.
I wanted to pattern the name after "write_seq[un]lock()", since it
most resembles that (not just in implementation, but in usage: the
traditional read-sequence isn't a lock, it's a begin/retry sequence,
so the usage pattern is totally different too, and the naming is
different).
I ended up picking "read_seq[un]lock_excl()". I could have gone with
"excl_" as a prefix too, I guess. Whatever. Now the "_excl" thing
looks a bit like the "_bh"/"_irqX" context modifier, and I think it
matches our normal lock naming pattern better.
Linus
On Thu, Sep 12, 2013 at 9:38 AM, Linus Torvalds
<[email protected]> wrote:
> On Thu, Sep 12, 2013 at 7:55 AM, Waiman Long <[email protected]> wrote:
>> Change log
>> ----------
>> v1->v2:
>> - Rename the new seqlock primitives to read_seqexcl_lock* and
>> read_seqexcl_unlock*.
>
> Applied.
Btw, when I tried to benchmark this, I failed miserably.
Why?
If you do a threaded benchmark of "getcwd()", you end up spending all
your time in a spinlock anyway: get_fs_root_and_pwd() takes the
fs->lock to get the root/pwd.
Now, AIM7 probably uses processes, not threads, so you don't see this,
and maybe I shouldn't care. But looking at it, it annoys me
enormously, because the whole get_fs_root_and_pwd() is just stupid.
Putting it all under the RCU lock and then changing it to use
get_fs_root_and_pwd_rcu() that just uses the fs->seq sequence
read-lock looks absolutely trivial.
Linus
On 09/12/2013 12:38 PM, Linus Torvalds wrote:
> On Thu, Sep 12, 2013 at 7:55 AM, Waiman Long<[email protected]> wrote:
>> Change log
>> ----------
>> v1->v2:
>> - Rename the new seqlock primitives to read_seqexcl_lock* and
>> read_seqexcl_unlock*.
> Applied. Except I peed in the snow and renamed the functions
> again.That whole "seqexcl" looked too odd to me. It not only looks a
> bit too much like random noise, but it makes it seem a whole different
> lock from the "seqlock" thing.
>
> I wanted to pattern the name after "write_seq[un]lock()", since it
> most resembles that (not just in implementation, but in usage: the
> traditional read-sequence isn't a lock, it's a begin/retry sequence,
> so the usage pattern is totally different too, and the naming is
> different).
>
> I ended up picking "read_seq[un]lock_excl()". I could have gone with
> "excl_" as a prefix too, I guess. Whatever. Now the "_excl" thing
> looks a bit like the "_bh"/"_irqX" context modifier, and I think it
> matches our normal lock naming pattern better.
>
> Linus
I think your new names are better than mine. I am not good at naming
stuff. Thank for the merge and the rename.
-Longman
On 09/12/2013 01:30 PM, Linus Torvalds wrote:
> On Thu, Sep 12, 2013 at 9:38 AM, Linus Torvalds
> <[email protected]> wrote:
>> On Thu, Sep 12, 2013 at 7:55 AM, Waiman Long<[email protected]> wrote:
>>> Change log
>>> ----------
>>> v1->v2:
>>> - Rename the new seqlock primitives to read_seqexcl_lock* and
>>> read_seqexcl_unlock*.
>> Applied.
> Btw, when I tried to benchmark this, I failed miserably.
>
> Why?
This patch is just a safety guard to prevent occasional bad performance
because of some bad timing. It will not improve performance for many
cases because the seqbegin/seqretry sequence succeeds without actual retry.
> If you do a threaded benchmark of "getcwd()", you end up spending all
> your time in a spinlock anyway: get_fs_root_and_pwd() takes the
> fs->lock to get the root/pwd.
I am aware that there is another spinlock bottleneck in the fs struct
for getcwd().
> Now, AIM7 probably uses processes, not threads, so you don't see this,
> and maybe I shouldn't care. But looking at it, it annoys me
> enormously, because the whole get_fs_root_and_pwd() is just stupid.
AIM7 don't do much getcwd() calls, so it is not a real bottleneck for
the benchmark. The lockref patch boosts the short workload performance.
The prepend_path patch was to fix the incorrect perf record data as
perf makes heavy use of d_path(). The change made to getcwd() was just a
side benefit. But then it still have other spinlock bottleneck.
> Putting it all under the RCU lock and then changing it to use
> get_fs_root_and_pwd_rcu() that just uses the fs->seq sequence
> read-lock looks absolutely trivial.
Yes, I think we can do something similar for this. I will take a look to
see how it can be fixed.
-Longman
On Thu, Sep 12, 2013 at 12:01 PM, Waiman Long <[email protected]> wrote:
>
> Yes, I think we can do something similar for this. I will take a look to see
> how it can be fixed.
Actually, it was so trivial that I already did it, and only had one
stupid brown-bag moment while doing so.
Now my threaded test-case spends most of its time allocating the
temporary page for the result.
It's pushed out.
Btw, Al, I really want a pull request from you to fix the "returning
to user space with semaphore held" thing. Or I'm going to apply the
patch myself. I'm on the road starting very early tomorrow morning,
and I don't want to have that hanging over me..
Linus
On Thu, Sep 12, 2013 at 12:04:56PM -0700, Linus Torvalds wrote:
> On Thu, Sep 12, 2013 at 12:01 PM, Waiman Long <[email protected]> wrote:
> >
> > Yes, I think we can do something similar for this. I will take a look to see
> > how it can be fixed.
>
> Actually, it was so trivial that I already did it, and only had one
> stupid brown-bag moment while doing so.
>
> Now my threaded test-case spends most of its time allocating the
> temporary page for the result.
>
> It's pushed out.
>
> Btw, Al, I really want a pull request from you to fix the "returning
> to user space with semaphore held" thing. Or I'm going to apply the
> patch myself. I'm on the road starting very early tomorrow morning,
> and I don't want to have that hanging over me..
Well... The pull request contains quite a bit more than that - list_lru
pile, mostly. Please, pull from the usual place -
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs.git for-linus
Shortlog:
Al Viro (1):
... and fold the renamed __vfs_follow_link() into its only caller
Andrew Morton (2):
xfs-convert-buftarg-lru-to-generic-code-fix
xfs-convert-dquot-cache-lru-to-list_lru-fix
Christoph Hellwig (1):
fs: remove vfs_follow_link
Dave Chinner (20):
dcache: convert dentry_stat.nr_unused to per-cpu counters
dentry: move to per-sb LRU locks
dcache: remove dentries from LRU before putting on dispose list
mm: new shrinker API
shrinker: convert superblock shrinkers to new API
list: add a new LRU list type
inode: convert inode lru list to generic lru list code.
dcache: convert to use new lru list infrastructure
list_lru: per-node list infrastructure
list_lru: fix broken LRU_RETRY behaviour
shrinker: add node awareness
fs: convert inode and dentry shrinking to be node aware
xfs: convert buftarg LRU to generic code
xfs: rework buffer dispose list tracking
xfs: convert dquot cache lru to list_lru
xfs: fix dquot isolation hang
fs: convert fs shrinkers to new scan/count API
drivers: convert shrinkers to new count/scan API
shrinker: convert remaining shrinkers to count/scan API
shrinker: Kill old ->shrink API.
Dave Jones (1):
Add missing unlocks to error paths of mountpoint_last.
Glauber Costa (10):
fs: bump inode and dentry counters to long
super: fix calculation of shrinkable objects for small numbers
inode: move inode to a different list inside lock
list_lru: per-node API
list_lru: remove special case function list_lru_dispose_all.
vmscan: per-node deferred work
i915: bail out earlier when shrinker cannot acquire mutex
hugepage: convert huge zero page shrinker to new shrinker API
list_lru: dynamically adjust node arrays
super: fix for destroy lrus
Peng Tao (4):
staging/lustre/ldlm: convert to shrinkers to count/scan API
staging/lustre/obdclass: convert lu_object shrinker to count/scan API
staging/lustre/ptlrpc: convert to new shrinker API
staging/lustre/libcfs: cleanup linux-mem.h
Diffstat:
Documentation/filesystems/porting | 4 +
arch/x86/kvm/mmu.c | 25 ++-
drivers/gpu/drm/i915/i915_dma.c | 4 +-
drivers/gpu/drm/i915/i915_gem.c | 82 ++++--
drivers/gpu/drm/ttm/ttm_page_alloc.c | 44 ++-
drivers/gpu/drm/ttm/ttm_page_alloc_dma.c | 51 +++--
drivers/md/bcache/btree.c | 43 ++--
drivers/md/bcache/sysfs.c | 2 +-
drivers/md/dm-bufio.c | 64 +++--
drivers/staging/android/ashmem.c | 44 ++-
drivers/staging/android/lowmemorykiller.c | 43 ++--
.../lustre/include/linux/libcfs/linux/linux-mem.h | 38 ---
drivers/staging/lustre/lustre/ldlm/ldlm_pool.c | 148 ++++++-----
drivers/staging/lustre/lustre/obdclass/lu_object.c | 98 ++++---
drivers/staging/lustre/lustre/ptlrpc/sec_bulk.c | 76 +++---
fs/dcache.c | 278 ++++++++++++-------
fs/drop_caches.c | 1 +
fs/ext4/extents_status.c | 33 ++-
fs/gfs2/glock.c | 30 ++-
fs/gfs2/main.c | 3 +-
fs/gfs2/quota.c | 18 +-
fs/gfs2/quota.h | 6 +-
fs/inode.c | 193 ++++++--------
fs/internal.h | 6 +-
fs/mbcache.c | 49 ++--
fs/namei.c | 49 ++--
fs/nfs/dir.c | 16 +-
fs/nfs/internal.h | 6 +-
fs/nfs/super.c | 3 +-
fs/nfsd/nfscache.c | 32 ++-
fs/quota/dquot.c | 34 +--
fs/super.c | 111 +++++---
fs/ubifs/shrinker.c | 29 ++-
fs/ubifs/super.c | 3 +-
fs/ubifs/ubifs.h | 5 +-
fs/xfs/xfs_buf.c | 253 +++++++++--------
fs/xfs/xfs_buf.h | 17 +-
fs/xfs/xfs_dquot.c | 7 +-
fs/xfs/xfs_icache.c | 4 +-
fs/xfs/xfs_icache.h | 2 +-
fs/xfs/xfs_qm.c | 287 ++++++++++---------
fs/xfs/xfs_qm.h | 4 +-
fs/xfs/xfs_super.c | 12 +-
include/linux/dcache.h | 14 +-
include/linux/fs.h | 26 +-
include/linux/list_lru.h | 131 +++++++++
include/linux/shrinker.h | 54 +++-
include/trace/events/vmscan.h | 4 +-
include/uapi/linux/fs.h | 6 +-
kernel/sysctl.c | 6 +-
mm/Makefile | 2 +-
mm/huge_memory.c | 17 +-
mm/list_lru.c | 139 ++++++++++
mm/memory-failure.c | 2 +
mm/vmscan.c | 241 ++++++++++-------
net/sunrpc/auth.c | 41 ++-
56 files changed, 1778 insertions(+), 1162 deletions(-)
create mode 100644 include/linux/list_lru.h
create mode 100644 mm/list_lru.c
On Thu, Sep 12, 2013 at 2:56 PM, Al Viro <[email protected]> wrote:
> On Thu, Sep 12, 2013 at 12:04:56PM -0700, Linus Torvalds wrote:
>>
>> Btw, Al, I really want a pull request from you to fix the "returning
>> to user space with semaphore held" thing. Or I'm going to apply the
>> patch myself. I'm on the road starting very early tomorrow morning,
>> and I don't want to have that hanging over me..
>
> Well... The pull request contains quite a bit more than that - list_lru
> pile, mostly.
Ok, I'm going to really hope this is it for 3.12, because while I will
have my laptop, I will spend Friday traveling, the weekend diving,
before the week of LinuxCon.
So if you were planning a "pile 4", I hope we can leave it for 3.13.
Linus
On Thu, Sep 12, 2013 at 3:00 PM, Linus Torvalds
<[email protected]> wrote:
>
> So if you were planning a "pile 4", I hope we can leave it for 3.13.
.. actually, this would be "pile 4", so it's "pile 5" that I hope not to get..
Linus