2015-04-30 21:13:08

by Waiman Long

[permalink] [raw]
Subject: [PATCH v4 0/2] locking/rwsem: optimize rwsem_wakeup()

v3->v4:
- Break out the active writer check into a separate patch and move
it from __rwsem_do_wake() to rwsem_wake().
- Use smp_rmb() instead of the incorrect smp_mb__after_atomic() as
suggested by PeterZ.

v2->v3:
- Fix errors in commit log.

v1->v2:
- Add a memory barrier before calling spin_trylock for proper memory
ordering.

This patch set aims to reduce spinlock contention in the wait_lock
due to excessive activity in the rwsem_wake code path. This, in turn,
reduces up_write/up_read latency and improve performance when the
rwsem is heavily contended.

On an 8-socket Westmere-EX server (80 cores, HT off), running AIM7's
high_systime workload (1000 users) on a vanilla 4.0 kernel produced
the following perf profile for spinlock contention:

9.23% reaim [kernel.kallsyms] [k] _raw_spin_lock_irqsave
|--97.39%-- rwsem_wake
|--0.69%-- try_to_wake_up
|--0.52%-- release_pages
--1.40%-- [...]

1.70% reaim [kernel.kallsyms] [k] _raw_spin_lock_irq
|--96.61%-- rwsem_down_write_failed
|--2.03%-- __schedule
|--0.50%-- run_timer_softirq
--0.86%-- [...]

Here the contended rwsems are the mmap_sem (mm_struct) and the
i_mmap_rwsem (address_space) with mostly write locking. With a
patched 4.0 kernel, the perf profile became:

1.87% reaim [kernel.kallsyms] [k] _raw_spin_lock_irqsave
|--87.64%-- rwsem_wake
|--2.80%-- release_pages
|--2.56%-- try_to_wake_up
|--1.10%-- __wake_up
|--1.06%-- pagevec_lru_move_fn
|--0.93%-- prepare_to_wait_exclusive
|--0.71%-- free_pid
|--0.58%-- get_page_from_freelist
|--0.57%-- add_device_randomness
--2.04%-- [...]

0.80% reaim [kernel.kallsyms] [k] _raw_spin_lock_irq
|--92.49%-- rwsem_down_write_failed
|--4.24%-- __schedule
|--1.37%-- run_timer_softirq
--1.91%-- [...]

The table below shows the % improvement in throughput (1100-2000 users)
in the various AIM7's workloads:

Workload % increase in throughput
-------- ------------------------
custom 3.8%
five-sec 3.5%
fserver 4.1%
high_systime 22.2%
shared 2.1%
short 10.1%

Waiman Long (2):
locking/rwsem: reduce spinlock contention in wakeup after
up_read/up_write
locking/rwsem: check for active writer before wakeup

include/linux/osq_lock.h | 5 +++
kernel/locking/rwsem-xadd.c | 65 +++++++++++++++++++++++++++++++++++++++++-
2 files changed, 68 insertions(+), 2 deletions(-)


2015-04-30 21:13:10

by Waiman Long

[permalink] [raw]
Subject: [PATCH v4 1/2] locking/rwsem: reduce spinlock contention in wakeup after up_read/up_write

In up_write()/up_read(), rwsem_wake() will be called whenever it
detects that some writers/readers are waiting. The rwsem_wake()
function will take the wait_lock and call __rwsem_do_wake() to do the
real wakeup. For a heavily contended rwsem, doing a spin_lock() on
wait_lock will cause further contention on the heavily contended rwsem
cacheline resulting in delay in the completion of the up_read/up_write
operations.

This patch makes the wait_lock taking and the call to __rwsem_do_wake()
optional if at least one spinning writer is present. The spinning
writer will be able to take the rwsem and call rwsem_wake() later
when it calls up_write(). With the presence of a spinning writer,
rwsem_wake() will now try to acquire the lock using trylock. If that
fails, it will just quit.

Signed-off-by: Waiman Long <[email protected]>
Suggested-by: Peter Zijlstra (Intel) <[email protected]>
---
include/linux/osq_lock.h | 5 ++++
kernel/locking/rwsem-xadd.c | 44 +++++++++++++++++++++++++++++++++++++++++++
2 files changed, 49 insertions(+), 0 deletions(-)

diff --git a/include/linux/osq_lock.h b/include/linux/osq_lock.h
index 3a6490e..703ea5c 100644
--- a/include/linux/osq_lock.h
+++ b/include/linux/osq_lock.h
@@ -32,4 +32,9 @@ static inline void osq_lock_init(struct optimistic_spin_queue *lock)
extern bool osq_lock(struct optimistic_spin_queue *lock);
extern void osq_unlock(struct optimistic_spin_queue *lock);

+static inline bool osq_is_locked(struct optimistic_spin_queue *lock)
+{
+ return atomic_read(&lock->tail) != OSQ_UNLOCKED_VAL;
+}
+
#endif
diff --git a/kernel/locking/rwsem-xadd.c b/kernel/locking/rwsem-xadd.c
index 2f7cc40..2bb25e2 100644
--- a/kernel/locking/rwsem-xadd.c
+++ b/kernel/locking/rwsem-xadd.c
@@ -391,11 +391,24 @@ done:
return taken;
}

+/*
+ * Return true if the rwsem has active spinner
+ */
+static inline bool rwsem_has_spinner(struct rw_semaphore *sem)
+{
+ return osq_is_locked(&sem->osq);
+}
+
#else
static bool rwsem_optimistic_spin(struct rw_semaphore *sem)
{
return false;
}
+
+static inline bool rwsem_has_spinner(struct rw_semaphore *sem)
+{
+ return false;
+}
#endif

/*
@@ -478,7 +491,38 @@ struct rw_semaphore *rwsem_wake(struct rw_semaphore *sem)
{
unsigned long flags;

+ /*
+ * If a spinner is present, it is not necessary to do the wakeup.
+ * Try to do wakeup only if the trylock succeeds to minimize
+ * spinlock contention which may introduce too much delay in the
+ * unlock operation.
+ *
+ * spinning writer up_write/up_read caller
+ * --------------- -----------------------
+ * [S] osq_unlock() [L] osq
+ * MB RMB
+ * [RmW] rwsem_try_write_lock() [RmW] spin_trylock(wait_lock)
+ *
+ * Here, it is important to make sure that there won't be a missed
+ * wakeup while the rwsem is free and the only spinning writer goes
+ * to sleep without taking the rwsem. Even when the spinning writer
+ * is just going to break out of the waiting loop, it will still do
+ * a trylock in rwsem_down_write_failed() before sleeping. IOW, if
+ * rwsem_has_spinner() is true, it will guarantee at least one
+ * trylock attempt on the rwsem later on.
+ */
+ if (rwsem_has_spinner(sem)) {
+ /*
+ * The smp_rmb() here is to make sure that the spinner
+ * state is consulted before reading the wait_lock.
+ */
+ smp_rmb();
+ if (!raw_spin_trylock_irqsave(&sem->wait_lock, flags))
+ return sem;
+ goto locked;
+ }
raw_spin_lock_irqsave(&sem->wait_lock, flags);
+locked:

/* do nothing if list empty */
if (!list_empty(&sem->wait_list))
--
1.7.1

2015-04-30 21:13:14

by Waiman Long

[permalink] [raw]
Subject: [PATCH v4 2/2] locking/rwsem: check for active writer before wakeup

On a highly contended rwsem, spinlock contention due to the slow
rwsem_wake() call can be a significant portion of the total CPU cycles
used. With writer lock stealing and writer optimistic spinning, there
is also a chance that the lock may have been stolen by the time that
the wait_lock is acquired.

This patch adds a low cost checking code after acquiring the wait_lock
to look for active writer. The presence of an active writer will
abort the wakeup operation.

Signed-off-by: Waiman Long <[email protected]>
---
kernel/locking/rwsem-xadd.c | 21 +++++++++++++++++++--
1 files changed, 19 insertions(+), 2 deletions(-)

diff --git a/kernel/locking/rwsem-xadd.c b/kernel/locking/rwsem-xadd.c
index 2bb25e2..815f0cc 100644
--- a/kernel/locking/rwsem-xadd.c
+++ b/kernel/locking/rwsem-xadd.c
@@ -399,6 +399,15 @@ static inline bool rwsem_has_spinner(struct rw_semaphore *sem)
return osq_is_locked(&sem->osq);
}

+/*
+ * Return true if there is an active writer by checking the owner field which
+ * should be set if there is one.
+ */
+static inline bool rwsem_has_active_writer(struct rw_semaphore *sem)
+{
+ return READ_ONCE(sem->owner) != NULL;
+}
+
#else
static bool rwsem_optimistic_spin(struct rw_semaphore *sem)
{
@@ -409,6 +418,11 @@ static inline bool rwsem_has_spinner(struct rw_semaphore *sem)
{
return false;
}
+
+static inline bool rwsem_has_active_writer(struct rw_semaphore *sem)
+{
+ return false; /* Assume it has no active writer */
+}
#endif

/*
@@ -524,8 +538,11 @@ struct rw_semaphore *rwsem_wake(struct rw_semaphore *sem)
raw_spin_lock_irqsave(&sem->wait_lock, flags);
locked:

- /* do nothing if list empty */
- if (!list_empty(&sem->wait_list))
+ /*
+ * Do nothing if list empty or the lock has just been stolen by a
+ * writer after a possibly long wait in getting the wait_lock.
+ */
+ if (!list_empty(&sem->wait_list) && !rwsem_has_active_writer(sem))
sem = __rwsem_do_wake(sem, RWSEM_WAKE_ANY);

raw_spin_unlock_irqrestore(&sem->wait_lock, flags);
--
1.7.1

2015-04-30 21:21:16

by Jason Low

[permalink] [raw]
Subject: Re: [PATCH v4 1/2] locking/rwsem: reduce spinlock contention in wakeup after up_read/up_write

On Thu, 2015-04-30 at 17:12 -0400, Waiman Long wrote:
> In up_write()/up_read(), rwsem_wake() will be called whenever it
> detects that some writers/readers are waiting. The rwsem_wake()
> function will take the wait_lock and call __rwsem_do_wake() to do the
> real wakeup. For a heavily contended rwsem, doing a spin_lock() on
> wait_lock will cause further contention on the heavily contended rwsem
> cacheline resulting in delay in the completion of the up_read/up_write
> operations.
>
> This patch makes the wait_lock taking and the call to __rwsem_do_wake()
> optional if at least one spinning writer is present. The spinning
> writer will be able to take the rwsem and call rwsem_wake() later
> when it calls up_write(). With the presence of a spinning writer,
> rwsem_wake() will now try to acquire the lock using trylock. If that
> fails, it will just quit.
>
> Signed-off-by: Waiman Long <[email protected]>
> Suggested-by: Peter Zijlstra (Intel) <[email protected]>

Acked-by: Jason Low <[email protected]>

2015-05-01 10:15:09

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v4 1/2] locking/rwsem: reduce spinlock contention in wakeup after up_read/up_write

On Thu, Apr 30, 2015 at 05:12:16PM -0400, Waiman Long wrote:
> In up_write()/up_read(), rwsem_wake() will be called whenever it
> detects that some writers/readers are waiting. The rwsem_wake()
> function will take the wait_lock and call __rwsem_do_wake() to do the
> real wakeup. For a heavily contended rwsem, doing a spin_lock() on
> wait_lock will cause further contention on the heavily contended rwsem
> cacheline resulting in delay in the completion of the up_read/up_write
> operations.
>
> This patch makes the wait_lock taking and the call to __rwsem_do_wake()
> optional if at least one spinning writer is present. The spinning
> writer will be able to take the rwsem and call rwsem_wake() later
> when it calls up_write(). With the presence of a spinning writer,
> rwsem_wake() will now try to acquire the lock using trylock. If that
> fails, it will just quit.
>
> Signed-off-by: Waiman Long <[email protected]>
> Suggested-by: Peter Zijlstra (Intel) <[email protected]>
> ---

Thanks!

2015-05-06 11:18:37

by Davidlohr Bueso

[permalink] [raw]
Subject: Re: [PATCH v4 1/2] locking/rwsem: reduce spinlock contention in wakeup after up_read/up_write

On Thu, 2015-04-30 at 17:12 -0400, Waiman Long wrote:
> In up_write()/up_read(), rwsem_wake() will be called whenever it
> detects that some writers/readers are waiting. The rwsem_wake()
> function will take the wait_lock and call __rwsem_do_wake() to do the
> real wakeup. For a heavily contended rwsem, doing a spin_lock() on
> wait_lock will cause further contention on the heavily contended rwsem
> cacheline resulting in delay in the completion of the up_read/up_write
> operations.
>
> This patch makes the wait_lock taking and the call to __rwsem_do_wake()
> optional if at least one spinning writer is present. The spinning
> writer will be able to take the rwsem and call rwsem_wake() later
> when it calls up_write(). With the presence of a spinning writer,
> rwsem_wake() will now try to acquire the lock using trylock. If that
> fails, it will just quit.
>
> Signed-off-by: Waiman Long <[email protected]>
> Suggested-by: Peter Zijlstra (Intel) <[email protected]>

Reviewed-by: Davidlohr Bueso <[email protected]>

2015-05-06 11:20:30

by Davidlohr Bueso

[permalink] [raw]
Subject: Re: [PATCH v4 1/2] locking/rwsem: reduce spinlock contention in wakeup after up_read/up_write

On Wed, 2015-05-06 at 04:18 -0700, Davidlohr Bueso wrote:
> Reviewed-by: Davidlohr Bueso <[email protected]>

A nit, but it would be useful if the benchmark/perf numbers were also in
this changelog, for future references.

Thanks,
Davidlohr

Subject: [tip:locking/core] locking/rwsem: Reduce spinlock contention in wakeup after up_read()/up_write()

Commit-ID: 59aabfc7e959f5f213e4e5cc7567ab4934da2adf
Gitweb: http://git.kernel.org/tip/59aabfc7e959f5f213e4e5cc7567ab4934da2adf
Author: Waiman Long <[email protected]>
AuthorDate: Thu, 30 Apr 2015 17:12:16 -0400
Committer: Ingo Molnar <[email protected]>
CommitDate: Fri, 8 May 2015 12:27:59 +0200

locking/rwsem: Reduce spinlock contention in wakeup after up_read()/up_write()

In up_write()/up_read(), rwsem_wake() will be called whenever it
detects that some writers/readers are waiting. The rwsem_wake()
function will take the wait_lock and call __rwsem_do_wake() to do the
real wakeup. For a heavily contended rwsem, doing a spin_lock() on
wait_lock will cause further contention on the heavily contended rwsem
cacheline resulting in delay in the completion of the up_read/up_write
operations.

This patch makes the wait_lock taking and the call to __rwsem_do_wake()
optional if at least one spinning writer is present. The spinning
writer will be able to take the rwsem and call rwsem_wake() later
when it calls up_write(). With the presence of a spinning writer,
rwsem_wake() will now try to acquire the lock using trylock. If that
fails, it will just quit.

Suggested-by: Peter Zijlstra (Intel) <[email protected]>
Signed-off-by: Waiman Long <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Reviewed-by: Davidlohr Bueso <[email protected]>
Acked-by: Jason Low <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Douglas Hatch <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Scott J Norton <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
---
include/linux/osq_lock.h | 5 +++++
kernel/locking/rwsem-xadd.c | 44 ++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 49 insertions(+)

diff --git a/include/linux/osq_lock.h b/include/linux/osq_lock.h
index 3a6490e..703ea5c 100644
--- a/include/linux/osq_lock.h
+++ b/include/linux/osq_lock.h
@@ -32,4 +32,9 @@ static inline void osq_lock_init(struct optimistic_spin_queue *lock)
extern bool osq_lock(struct optimistic_spin_queue *lock);
extern void osq_unlock(struct optimistic_spin_queue *lock);

+static inline bool osq_is_locked(struct optimistic_spin_queue *lock)
+{
+ return atomic_read(&lock->tail) != OSQ_UNLOCKED_VAL;
+}
+
#endif
diff --git a/kernel/locking/rwsem-xadd.c b/kernel/locking/rwsem-xadd.c
index 3417d01..0f18971 100644
--- a/kernel/locking/rwsem-xadd.c
+++ b/kernel/locking/rwsem-xadd.c
@@ -409,11 +409,24 @@ done:
return taken;
}

+/*
+ * Return true if the rwsem has active spinner
+ */
+static inline bool rwsem_has_spinner(struct rw_semaphore *sem)
+{
+ return osq_is_locked(&sem->osq);
+}
+
#else
static bool rwsem_optimistic_spin(struct rw_semaphore *sem)
{
return false;
}
+
+static inline bool rwsem_has_spinner(struct rw_semaphore *sem)
+{
+ return false;
+}
#endif

/*
@@ -496,7 +509,38 @@ struct rw_semaphore *rwsem_wake(struct rw_semaphore *sem)
{
unsigned long flags;

+ /*
+ * If a spinner is present, it is not necessary to do the wakeup.
+ * Try to do wakeup only if the trylock succeeds to minimize
+ * spinlock contention which may introduce too much delay in the
+ * unlock operation.
+ *
+ * spinning writer up_write/up_read caller
+ * --------------- -----------------------
+ * [S] osq_unlock() [L] osq
+ * MB RMB
+ * [RmW] rwsem_try_write_lock() [RmW] spin_trylock(wait_lock)
+ *
+ * Here, it is important to make sure that there won't be a missed
+ * wakeup while the rwsem is free and the only spinning writer goes
+ * to sleep without taking the rwsem. Even when the spinning writer
+ * is just going to break out of the waiting loop, it will still do
+ * a trylock in rwsem_down_write_failed() before sleeping. IOW, if
+ * rwsem_has_spinner() is true, it will guarantee at least one
+ * trylock attempt on the rwsem later on.
+ */
+ if (rwsem_has_spinner(sem)) {
+ /*
+ * The smp_rmb() here is to make sure that the spinner
+ * state is consulted before reading the wait_lock.
+ */
+ smp_rmb();
+ if (!raw_spin_trylock_irqsave(&sem->wait_lock, flags))
+ return sem;
+ goto locked;
+ }
raw_spin_lock_irqsave(&sem->wait_lock, flags);
+locked:

/* do nothing if list empty */
if (!list_empty(&sem->wait_list))