2015-04-18 02:04:15

by Waiman Long

[permalink] [raw]
Subject: [PATCH] locking/rwsem: reduce spinlock contention in wakeup after up_read/up_write

In up_read()/up_write(), rwsem_wake() will be called whenever it
detects that some writers/readers are waiting. The rwsem_wake()
function will take the wait_lock and call __rwsem_do_wake() to do
the real wakeup. This can be a problem especially for up_read()
where many readers can potentially call rwsem_wake() at more or less
the same time even though a single call should be enough. This will
cause contention in the wait_lock cacheline resulting in delay of
the up_read/up_write operations.

This patch makes the wait_lock taking and the call to __rwsem_do_wake()
optional if at least one spinning writer is present. The spinning
writer will be able to take the rwsem and call rwsem_wake() later
when it calls up_write(). With the presence of a spinning writer,
rwsem_wake() will now try to acquire the lock using trylock. If that
fails, it will just quit.

This patch also checks one more time to see if the rwsem was stolen
just before doing the expensive wakeup operation which will be
unnecessary if the lock was stolen.

On an 8-socket Westmere-EX server (80 cores, HT off), running AIM7's
high_systime workload (1000 users) on a vanilla 4.0 kernel produced
the following perf profile for spinlock contention:

9.23% reaim [kernel.kallsyms] [k] _raw_spin_lock_irqsave
|--97.39%-- rwsem_wake
|--0.69%-- try_to_wake_up
|--0.52%-- release_pages
--1.40%-- [...]

1.70% reaim [kernel.kallsyms] [k] _raw_spin_lock_irq
|--96.61%-- rwsem_down_write_failed
|--2.03%-- __schedule
|--0.50%-- run_timer_softirq
--0.86%-- [...]

With a patched 4.0 kernel, the perf profile became:

1.87% reaim [kernel.kallsyms] [k] _raw_spin_lock_irqsave
|--87.64%-- rwsem_wake
|--2.80%-- release_pages
|--2.56%-- try_to_wake_up
|--1.10%-- __wake_up
|--1.06%-- pagevec_lru_move_fn
|--0.93%-- prepare_to_wait_exclusive
|--0.71%-- free_pid
|--0.58%-- get_page_from_freelist
|--0.57%-- add_device_randomness
--2.04%-- [...]

0.80% reaim [kernel.kallsyms] [k] _raw_spin_lock_irq
|--92.49%-- rwsem_down_write_failed
|--4.24%-- __schedule
|--1.37%-- run_timer_softirq
--1.91%-- [...]

The table below shows the % improvement in throughput (1100-2000 users)
in the various AIM7's workloads:

Workload % increase in throughput
-------- ------------------------
custom 3.8%
five-sec 3.5%
fserver 4.1%
high_systime 22.2%
shared 2.1%
short 10.1%

Signed-off-by: Waiman Long <[email protected]>
---
include/linux/osq_lock.h | 5 +++
kernel/locking/rwsem-xadd.c | 60 ++++++++++++++++++++++++++++++++++++++++++-
2 files changed, 64 insertions(+), 1 deletions(-)

diff --git a/include/linux/osq_lock.h b/include/linux/osq_lock.h
index 3a6490e..703ea5c 100644
--- a/include/linux/osq_lock.h
+++ b/include/linux/osq_lock.h
@@ -32,4 +32,9 @@ static inline void osq_lock_init(struct optimistic_spin_queue *lock)
extern bool osq_lock(struct optimistic_spin_queue *lock);
extern void osq_unlock(struct optimistic_spin_queue *lock);

+static inline bool osq_is_locked(struct optimistic_spin_queue *lock)
+{
+ return atomic_read(&lock->tail) != OSQ_UNLOCKED_VAL;
+}
+
#endif
diff --git a/kernel/locking/rwsem-xadd.c b/kernel/locking/rwsem-xadd.c
index 2f7cc40..d088045 100644
--- a/kernel/locking/rwsem-xadd.c
+++ b/kernel/locking/rwsem-xadd.c
@@ -107,6 +107,35 @@ enum rwsem_wake_type {
RWSEM_WAKE_READ_OWNED /* Waker thread holds the read lock */
};

+#ifdef CONFIG_RWSEM_SPIN_ON_OWNER
+/*
+ * return true if there is an active writer by checking the owner field which
+ * should be set if there is one.
+ */
+static inline bool rwsem_has_active_writer(struct rw_semaphore *sem)
+{
+ return READ_ONCE(sem->owner) != NULL;
+}
+
+/*
+ * Return true if the rwsem has active spinner
+ */
+static inline bool rwsem_has_spinner(struct rw_semaphore *sem)
+{
+ return osq_is_locked(&sem->osq);
+}
+#else /* CONFIG_RWSEM_SPIN_ON_OWNER */
+static inline bool rwsem_has_active_writer(struct rw_semaphore *sem)
+{
+ return false; /* Assume it has no active writer */
+}
+
+static inline bool rwsem_has_spinner(struct rw_semaphore *sem)
+{
+ return false;
+}
+#endif /* CONFIG_RWSEM_SPIN_ON_OWNER */
+
/*
* handle the lock release when processes blocked on it that can now run
* - if we come here from up_xxxx(), then:
@@ -125,6 +154,14 @@ __rwsem_do_wake(struct rw_semaphore *sem, enum rwsem_wake_type wake_type)
struct list_head *next;
long oldcount, woken, loop, adjustment;

+ /*
+ * up_write() cleared the owner field before calling this function.
+ * If that field is now set, a writer must have stolen the lock and
+ * the wakeup operation should be aborted.
+ */
+ if (rwsem_has_active_writer(sem))
+ goto out;
+
waiter = list_entry(sem->wait_list.next, struct rwsem_waiter, list);
if (waiter->type == RWSEM_WAITING_FOR_WRITE) {
if (wake_type == RWSEM_WAKE_ANY)
@@ -478,7 +515,28 @@ struct rw_semaphore *rwsem_wake(struct rw_semaphore *sem)
{
unsigned long flags;

- raw_spin_lock_irqsave(&sem->wait_lock, flags);
+ /*
+ * If a spinner is present, it is not necessary to do the wakeup.
+ * Try to do wakeup only if the trylock succeeds to minimize
+ * spinlock contention which may introduce too much delay in the
+ * unlock operation.
+ *
+ * In case the spinning writer is just going to break out of the
+ * waiting loop, it will still do a trylock in
+ * rwsem_down_write_failed() before sleeping.
+ * IOW, if rwsem_has_spinner() is true, it will guarantee at least
+ * one trylock attempt on the rwsem.
+ *
+ * spinning writer
+ * ---------------
+ * [S] osq_unlock()
+ * MB
+ * [RmW] rwsem_try_write_lock()
+ */
+ if (!rwsem_has_spinner(sem))
+ raw_spin_lock_irqsave(&sem->wait_lock, flags);
+ else if (!raw_spin_trylock_irqsave(&sem->wait_lock, flags))
+ return sem;

/* do nothing if list empty */
if (!list_empty(&sem->wait_list))
--
1.7.1


2015-04-18 19:35:08

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH] locking/rwsem: reduce spinlock contention in wakeup after up_read/up_write

On Fri, Apr 17, 2015 at 10:03:18PM -0400, Waiman Long wrote:
> @@ -478,7 +515,28 @@ struct rw_semaphore *rwsem_wake(struct rw_semaphore *sem)
> {
> unsigned long flags;
>
> - raw_spin_lock_irqsave(&sem->wait_lock, flags);
> + /*
> + * If a spinner is present, it is not necessary to do the wakeup.
> + * Try to do wakeup only if the trylock succeeds to minimize
> + * spinlock contention which may introduce too much delay in the
> + * unlock operation.
> + *
> + * In case the spinning writer is just going to break out of the
> + * waiting loop, it will still do a trylock in
> + * rwsem_down_write_failed() before sleeping.
> + * IOW, if rwsem_has_spinner() is true, it will guarantee at least
> + * one trylock attempt on the rwsem.

successful trylock? I think we're having 'issues' on if failed trylocks
(and cmpxchg) imply full barriers.

> + *
> + * spinning writer
> + * ---------------
> + * [S] osq_unlock()
> + * MB
> + * [RmW] rwsem_try_write_lock()
> + */

Ordering comes in pairs, this is incomplete.

> + if (!rwsem_has_spinner(sem))
> + raw_spin_lock_irqsave(&sem->wait_lock, flags);
> + else if (!raw_spin_trylock_irqsave(&sem->wait_lock, flags))
> + return sem;
>
> /* do nothing if list empty */
> if (!list_empty(&sem->wait_list))
> --
> 1.7.1
>

2015-04-20 20:23:55

by Jason Low

[permalink] [raw]
Subject: Re: [PATCH] locking/rwsem: reduce spinlock contention in wakeup after up_read/up_write

On Fri, 2015-04-17 at 22:03 -0400, Waiman Long wrote:

> diff --git a/include/linux/osq_lock.h b/include/linux/osq_lock.h
> index 3a6490e..703ea5c 100644
> --- a/include/linux/osq_lock.h
> +++ b/include/linux/osq_lock.h
> @@ -32,4 +32,9 @@ static inline void osq_lock_init(struct optimistic_spin_queue *lock)
> extern bool osq_lock(struct optimistic_spin_queue *lock);
> extern void osq_unlock(struct optimistic_spin_queue *lock);
>
> +static inline bool osq_is_locked(struct optimistic_spin_queue *lock)
> +{
> + return atomic_read(&lock->tail) != OSQ_UNLOCKED_VAL;
> +}

Would it be better to separate the addition of osq_is_locked() into its
own patch, since this can be useful for other situations and isn't just
specific to the rwsem optimization.

2015-04-22 20:19:30

by Waiman Long

[permalink] [raw]
Subject: Re: [PATCH] locking/rwsem: reduce spinlock contention in wakeup after up_read/up_write

On 04/20/2015 04:23 PM, Jason Low wrote:
> On Fri, 2015-04-17 at 22:03 -0400, Waiman Long wrote:
>
>> diff --git a/include/linux/osq_lock.h b/include/linux/osq_lock.h
>> index 3a6490e..703ea5c 100644
>> --- a/include/linux/osq_lock.h
>> +++ b/include/linux/osq_lock.h
>> @@ -32,4 +32,9 @@ static inline void osq_lock_init(struct optimistic_spin_queue *lock)
>> extern bool osq_lock(struct optimistic_spin_queue *lock);
>> extern void osq_unlock(struct optimistic_spin_queue *lock);
>>
>> +static inline bool osq_is_locked(struct optimistic_spin_queue *lock)
>> +{
>> + return atomic_read(&lock->tail) != OSQ_UNLOCKED_VAL;
>> +}
> Would it be better to separate the addition of osq_is_locked() into its
> own patch, since this can be useful for other situations and isn't just
> specific to the rwsem optimization.
>

I think the osq_lock.h change is too simple and straight forward to
warrant a separate patch.

Cheers,
Longman

2015-04-23 18:42:14

by Waiman Long

[permalink] [raw]
Subject: Re: [PATCH] locking/rwsem: reduce spinlock contention in wakeup after up_read/up_write

On 04/18/2015 11:40 AM, Peter Zijlstra wrote:
> On Fri, Apr 17, 2015 at 10:03:18PM -0400, Waiman Long wrote:
>> @@ -478,7 +515,28 @@ struct rw_semaphore *rwsem_wake(struct rw_semaphore *sem)
>> {
>> unsigned long flags;
>>
>> - raw_spin_lock_irqsave(&sem->wait_lock, flags);
>> + /*
>> + * If a spinner is present, it is not necessary to do the wakeup.
>> + * Try to do wakeup only if the trylock succeeds to minimize
>> + * spinlock contention which may introduce too much delay in the
>> + * unlock operation.
>> + *
>> + * In case the spinning writer is just going to break out of the
>> + * waiting loop, it will still do a trylock in
>> + * rwsem_down_write_failed() before sleeping.
>> + * IOW, if rwsem_has_spinner() is true, it will guarantee at least
>> + * one trylock attempt on the rwsem.
> successful trylock? I think we're having 'issues' on if failed trylocks
> (and cmpxchg) imply full barriers.
>
>> + *
>> + * spinning writer
>> + * ---------------
>> + * [S] osq_unlock()
>> + * MB
>> + * [RmW] rwsem_try_write_lock()
>> + */
> Ordering comes in pairs, this is incomplete.

I am sorry that I am a bit sloppy here. I have just sent out an updated
patch to remedy this. I have added a smp_mb__after_atomic() to ensure
proper memory ordering. However, I am not so sure if this primitive or
just a simple smp_rmb() will be more expensive in other non-x86
architectures.

Cheers,
Longman