2015-04-24 17:55:19

by Waiman Long

[permalink] [raw]
Subject: [PATCH v3] locking/rwsem: reduce spinlock contention in wakeup after up_read/up_write

In up_write()/up_read(), rwsem_wake() will be called whenever it
detects that some writers/readers are waiting. The rwsem_wake()
function will take the wait_lock and call __rwsem_do_wake() to do the
real wakeup. For a heavily contended rwsem, doing a spin_lock() on
wait_lock will cause further contention on the heavily contended rwsem
cacheline resulting in delay in the completion of the up_read/up_write
operations.

This patch makes the wait_lock taking and the call to __rwsem_do_wake()
optional if at least one spinning writer is present. The spinning
writer will be able to take the rwsem and call rwsem_wake() later
when it calls up_write(). With the presence of a spinning writer,
rwsem_wake() will now try to acquire the lock using trylock. If that
fails, it will just quit.

This patch also checks one more time in __rwsem_do_wake() to see if
the rwsem was stolen just before doing the expensive wakeup operation
which will be unnecessary if the lock was stolen.

On an 8-socket Westmere-EX server (80 cores, HT off), running AIM7's
high_systime workload (1000 users) on a vanilla 4.0 kernel produced
the following perf profile for spinlock contention:

9.23% reaim [kernel.kallsyms] [k] _raw_spin_lock_irqsave
|--97.39%-- rwsem_wake
|--0.69%-- try_to_wake_up
|--0.52%-- release_pages
--1.40%-- [...]

1.70% reaim [kernel.kallsyms] [k] _raw_spin_lock_irq
|--96.61%-- rwsem_down_write_failed
|--2.03%-- __schedule
|--0.50%-- run_timer_softirq
--0.86%-- [...]

Here the contended rwsems are the mmap_sem (mm_struct) and the
i_mmap_rwsem (address_space) with mostly write locking. With a
patched 4.0 kernel, the perf profile became:

1.87% reaim [kernel.kallsyms] [k] _raw_spin_lock_irqsave
|--87.64%-- rwsem_wake
|--2.80%-- release_pages
|--2.56%-- try_to_wake_up
|--1.10%-- __wake_up
|--1.06%-- pagevec_lru_move_fn
|--0.93%-- prepare_to_wait_exclusive
|--0.71%-- free_pid
|--0.58%-- get_page_from_freelist
|--0.57%-- add_device_randomness
--2.04%-- [...]

0.80% reaim [kernel.kallsyms] [k] _raw_spin_lock_irq
|--92.49%-- rwsem_down_write_failed
|--4.24%-- __schedule
|--1.37%-- run_timer_softirq
--1.91%-- [...]

The table below shows the % improvement in throughput (1100-2000 users)
in the various AIM7's workloads:

Workload % increase in throughput
-------- ------------------------
custom 3.8%
five-sec 3.5%
fserver 4.1%
high_systime 22.2%
shared 2.1%
short 10.1%

Signed-off-by: Waiman Long <[email protected]>
---
include/linux/osq_lock.h | 5 +++
kernel/locking/rwsem-xadd.c | 72 ++++++++++++++++++++++++++++++++++++++++++-
2 files changed, 76 insertions(+), 1 deletions(-)

v2->v3:
- Fix errors in commit log.

v1->v2:
- Add a memory barrier before calling spin_trylock for proper memory
ordering.

diff --git a/include/linux/osq_lock.h b/include/linux/osq_lock.h
index 3a6490e..703ea5c 100644
--- a/include/linux/osq_lock.h
+++ b/include/linux/osq_lock.h
@@ -32,4 +32,9 @@ static inline void osq_lock_init(struct optimistic_spin_queue *lock)
extern bool osq_lock(struct optimistic_spin_queue *lock);
extern void osq_unlock(struct optimistic_spin_queue *lock);

+static inline bool osq_is_locked(struct optimistic_spin_queue *lock)
+{
+ return atomic_read(&lock->tail) != OSQ_UNLOCKED_VAL;
+}
+
#endif
diff --git a/kernel/locking/rwsem-xadd.c b/kernel/locking/rwsem-xadd.c
index 2f7cc40..f52c69a 100644
--- a/kernel/locking/rwsem-xadd.c
+++ b/kernel/locking/rwsem-xadd.c
@@ -107,6 +107,35 @@ enum rwsem_wake_type {
RWSEM_WAKE_READ_OWNED /* Waker thread holds the read lock */
};

+#ifdef CONFIG_RWSEM_SPIN_ON_OWNER
+/*
+ * return true if there is an active writer by checking the owner field which
+ * should be set if there is one.
+ */
+static inline bool rwsem_has_active_writer(struct rw_semaphore *sem)
+{
+ return READ_ONCE(sem->owner) != NULL;
+}
+
+/*
+ * Return true if the rwsem has active spinner
+ */
+static inline bool rwsem_has_spinner(struct rw_semaphore *sem)
+{
+ return osq_is_locked(&sem->osq);
+}
+#else /* CONFIG_RWSEM_SPIN_ON_OWNER */
+static inline bool rwsem_has_active_writer(struct rw_semaphore *sem)
+{
+ return false; /* Assume it has no active writer */
+}
+
+static inline bool rwsem_has_spinner(struct rw_semaphore *sem)
+{
+ return false;
+}
+#endif /* CONFIG_RWSEM_SPIN_ON_OWNER */
+
/*
* handle the lock release when processes blocked on it that can now run
* - if we come here from up_xxxx(), then:
@@ -125,6 +154,14 @@ __rwsem_do_wake(struct rw_semaphore *sem, enum rwsem_wake_type wake_type)
struct list_head *next;
long oldcount, woken, loop, adjustment;

+ /*
+ * up_write() cleared the owner field before calling this function.
+ * If that field is now set, a writer must have stolen the lock and
+ * the wakeup operation should be aborted.
+ */
+ if (rwsem_has_active_writer(sem))
+ goto out;
+
waiter = list_entry(sem->wait_list.next, struct rwsem_waiter, list);
if (waiter->type == RWSEM_WAITING_FOR_WRITE) {
if (wake_type == RWSEM_WAKE_ANY)
@@ -478,7 +515,40 @@ struct rw_semaphore *rwsem_wake(struct rw_semaphore *sem)
{
unsigned long flags;

- raw_spin_lock_irqsave(&sem->wait_lock, flags);
+ /*
+ * If a spinner is present, it is not necessary to do the wakeup.
+ * Try to do wakeup only if the trylock succeeds to minimize
+ * spinlock contention which may introduce too much delay in the
+ * unlock operation.
+ *
+ * spinning writer up_write/up_read caller
+ * --------------- -----------------------
+ * [S] osq_unlock() [L] osq
+ * MB MB
+ * [RmW] rwsem_try_write_lock() [RmW] spin_trylock(wait_lock)
+ *
+ * Here, it is important to make sure that there won't be a missed
+ * wakeup while the rwsem is free and the only spinning writer goes
+ * to sleep without taking the rwsem. In case the spinning writer is
+ * just going to break out of the waiting loop, it will still do a
+ * trylock in rwsem_down_write_failed() before sleeping. IOW, if
+ * rwsem_has_spinner() is true, it will guarantee at least one
+ * trylock attempt on the rwsem.
+ */
+ if (!rwsem_has_spinner(sem)) {
+ raw_spin_lock_irqsave(&sem->wait_lock, flags);
+ } else {
+ /*
+ * rwsem_has_spinner() is an atomic read while spin_trylock
+ * does not guarantee a full memory barrier. Insert a memory
+ * barrier here to make sure that wait_lock isn't read until
+ * after osq.
+ * Note: smp_rmb__after_atomic() should be used if available.
+ */
+ smp_mb__after_atomic();
+ if (!raw_spin_trylock_irqsave(&sem->wait_lock, flags))
+ return sem;
+ }

/* do nothing if list empty */
if (!list_empty(&sem->wait_list))
--
1.7.1


2015-04-24 20:39:45

by Davidlohr Bueso

[permalink] [raw]
Subject: Re: [PATCH v3] locking/rwsem: reduce spinlock contention in wakeup after up_read/up_write

On Fri, 2015-04-24 at 13:54 -0400, Waiman Long wrote:
> This patch also checks one more time in __rwsem_do_wake() to see if
> the rwsem was stolen just before doing the expensive wakeup operation
> which will be unnecessary if the lock was stolen.

It strikes me that this should be another patch, as the optimization is
independent of the wake_lock (comments below).

[...]

> +#ifdef CONFIG_RWSEM_SPIN_ON_OWNER

Could you please reuse the CONFIG_RWSEM_SPIN_ON_OWNER ifdefiry we
already have? Just add these where we define rwsem_spin_on_owner().

[...]

> /*
> * handle the lock release when processes blocked on it that can now run
> * - if we come here from up_xxxx(), then:
> @@ -125,6 +154,14 @@ __rwsem_do_wake(struct rw_semaphore *sem, enum rwsem_wake_type wake_type)
> struct list_head *next;
> long oldcount, woken, loop, adjustment;
>
> + /*
> + * up_write() cleared the owner field before calling this function.
> + * If that field is now set, a writer must have stolen the lock and
> + * the wakeup operation should be aborted.
> + */
> + if (rwsem_has_active_writer(sem))
> + goto out;

We currently allow small races between rwsem owner and counter checks.
And __rwsem_do_wake() can be called by checking the former -- and lock
stealing is done with the counter as well. Please see below how we back
out of such cases, as it is very much considered when granting the next
reader. So nack to this as is, sorry.

Thanks,
Davidlohr

2015-04-27 20:25:17

by Waiman Long

[permalink] [raw]
Subject: Re: [PATCH v3] locking/rwsem: reduce spinlock contention in wakeup after up_read/up_write

On 04/24/2015 04:39 PM, Davidlohr Bueso wrote:
> On Fri, 2015-04-24 at 13:54 -0400, Waiman Long wrote:
>> This patch also checks one more time in __rwsem_do_wake() to see if
>> the rwsem was stolen just before doing the expensive wakeup operation
>> which will be unnecessary if the lock was stolen.
> It strikes me that this should be another patch, as the optimization is
> independent of the wake_lock (comments below).
>
> [...]
>

I could separate that into a separate patch.

>> +#ifdef CONFIG_RWSEM_SPIN_ON_OWNER
> Could you please reuse the CONFIG_RWSEM_SPIN_ON_OWNER ifdefiry we
> already have? Just add these where we define rwsem_spin_on_owner().

I can move the rwsem_has_spinner() down there, but
rwsem_has_active_writer() has to be on top or I will need to add a
forward reference to it.

> [...]
>
>> /*
>> * handle the lock release when processes blocked on it that can now run
>> * - if we come here from up_xxxx(), then:
>> @@ -125,6 +154,14 @@ __rwsem_do_wake(struct rw_semaphore *sem, enum rwsem_wake_type wake_type)
>> struct list_head *next;
>> long oldcount, woken, loop, adjustment;
>>
>> + /*
>> + * up_write() cleared the owner field before calling this function.
>> + * If that field is now set, a writer must have stolen the lock and
>> + * the wakeup operation should be aborted.
>> + */
>> + if (rwsem_has_active_writer(sem))
>> + goto out;
> We currently allow small races between rwsem owner and counter checks.
> And __rwsem_do_wake() can be called by checking the former -- and lock
> stealing is done with the counter as well. Please see below how we back
> out of such cases, as it is very much considered when granting the next
> reader. So nack to this as is, sorry.

If the first one in the queue is a writer, wake_up_process() may be
called directly which can be quite expensive if the lock has already
been stolen as the task will have to sleep again. Even for reader, the
counter check is by mean of an atomic instruction which can cost
hundreds of CPU cycles, the rwsem_has_active_writer() check, however, is
just a few additional cycles on top of the rwsem cacheline read which is
needed by the next list_entry() function call anyway. I consider this a
low-cost way to save hundreds of wasted CPU cycles in case the lock is
stolen and the owner set.

Cheers,
Longman

2015-04-28 16:52:40

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v3] locking/rwsem: reduce spinlock contention in wakeup after up_read/up_write

On Fri, Apr 24, 2015 at 01:54:29PM -0400, Waiman Long wrote:
> +static inline bool rwsem_has_active_writer(struct rw_semaphore *sem)
> +{
> + return READ_ONCE(sem->owner) != NULL;
> +}

> +static inline bool rwsem_has_spinner(struct rw_semaphore *sem)
> +{
> + return osq_is_locked(&sem->osq);
> +}

> + /*
> + * If a spinner is present, it is not necessary to do the wakeup.
> + * Try to do wakeup only if the trylock succeeds to minimize
> + * spinlock contention which may introduce too much delay in the
> + * unlock operation.
> + *
> + * spinning writer up_write/up_read caller
> + * --------------- -----------------------
> + * [S] osq_unlock() [L] osq
> + * MB MB
> + * [RmW] rwsem_try_write_lock() [RmW] spin_trylock(wait_lock)
> + *
> + * Here, it is important to make sure that there won't be a missed
> + * wakeup while the rwsem is free and the only spinning writer goes
> + * to sleep without taking the rwsem. In case the spinning writer is
> + * just going to break out of the waiting loop, it will still do a
> + * trylock in rwsem_down_write_failed() before sleeping. IOW, if
> + * rwsem_has_spinner() is true, it will guarantee at least one
> + * trylock attempt on the rwsem.
> + */
> + if (!rwsem_has_spinner(sem)) {
> + raw_spin_lock_irqsave(&sem->wait_lock, flags);
> + } else {
> + /*
> + * rwsem_has_spinner() is an atomic read while spin_trylock
> + * does not guarantee a full memory barrier. Insert a memory
> + * barrier here to make sure that wait_lock isn't read until
> + * after osq.
> + * Note: smp_rmb__after_atomic() should be used if available.
> + */
> + smp_mb__after_atomic();

Sorry, that's wrong. the smp_mb__{before,after}_atomic() are for all
atomic (RmW) ops that do not return a value.

They end up as whatever barrier is required to make real atomic (RmW)
ops (LOCK on x86, LL/SC on risc etc) ordered. And all atomic (RmW) ops
that return a value are already guaranteed to imply full ordering
semantics.

Note, the (RmW) part is important here, atomic_{set,read}() are _NOT_
read-modify-write ops.

> + if (!raw_spin_trylock_irqsave(&sem->wait_lock, flags))
> + return sem;
> + }
>

2015-04-28 17:17:52

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v3] locking/rwsem: reduce spinlock contention in wakeup after up_read/up_write

On Fri, Apr 24, 2015 at 01:54:29PM -0400, Waiman Long wrote:
> @@ -478,7 +515,40 @@ struct rw_semaphore *rwsem_wake(struct rw_semaphore *sem)
> {
> unsigned long flags;
>
> - raw_spin_lock_irqsave(&sem->wait_lock, flags);
> + /*
> + * If a spinner is present, it is not necessary to do the wakeup.
> + * Try to do wakeup only if the trylock succeeds to minimize
> + * spinlock contention which may introduce too much delay in the
> + * unlock operation.
> + *
> + * spinning writer up_write/up_read caller
> + * --------------- -----------------------
> + * [S] osq_unlock() [L] osq
> + * MB MB
> + * [RmW] rwsem_try_write_lock() [RmW] spin_trylock(wait_lock)
> + *
> + * Here, it is important to make sure that there won't be a missed
> + * wakeup while the rwsem is free and the only spinning writer goes
> + * to sleep without taking the rwsem. In case the spinning writer is
> + * just going to break out of the waiting loop, it will still do a
> + * trylock in rwsem_down_write_failed() before sleeping. IOW, if
> + * rwsem_has_spinner() is true, it will guarantee at least one
> + * trylock attempt on the rwsem.
> + */
> + if (!rwsem_has_spinner(sem)) {
> + raw_spin_lock_irqsave(&sem->wait_lock, flags);
> + } else {
> + /*
> + * rwsem_has_spinner() is an atomic read while spin_trylock
> + * does not guarantee a full memory barrier. Insert a memory
> + * barrier here to make sure that wait_lock isn't read until
> + * after osq.
> + * Note: smp_rmb__after_atomic() should be used if available.
> + */
> + smp_mb__after_atomic();
> + if (!raw_spin_trylock_irqsave(&sem->wait_lock, flags))
> + return sem;
> + }
>
> /* do nothing if list empty */
> if (!list_empty(&sem->wait_list))

To me it makes more sense to reverse these two branches (identical code
wise of course) and put the special case first.

Alternatively we could also do something like the below, which to my
eyes looks a little better still, but I don't care too much.

if (rwsem_has_spinner(sem)) {
/*
* comment ...
*/
smp_rmb();
if (!raw_spin_trylock_irqsave(&sem->wait_lock, flags))
return sem;
goto locked;
}

raw_spin_lock_irqsave(&sem->wait_lock, flags);
locked:

2015-04-28 17:50:07

by Jason Low

[permalink] [raw]
Subject: Re: [PATCH v3] locking/rwsem: reduce spinlock contention in wakeup after up_read/up_write

On Tue, 2015-04-28 at 19:17 +0200, Peter Zijlstra wrote:

> To me it makes more sense to reverse these two branches (identical code
> wise of course) and put the special case first.
>
> Alternatively we could also do something like the below, which to my
> eyes looks a little better still, but I don't care too much.
>
> if (rwsem_has_spinner(sem)) {
> /*
> * comment ...
> */
> smp_rmb();
> if (!raw_spin_trylock_irqsave(&sem->wait_lock, flags))
> return sem;
> goto locked;
> }
>
> raw_spin_lock_irqsave(&sem->wait_lock, flags);
> locked:

How about putting this into its own function:

static inline bool __rwsem_wake_acquire_wait_lock(sem)
{
/*
*
* Comments
*
*/
if (unlikely(rwsem_has_spinner(sem))) {
/*
* Comments
*/
smp_rmb();
if (!raw_spin_trylock_irqsave(&sem->wait_lock, flags))
return false;
}

return true;
}

2015-04-28 17:59:25

by Jason Low

[permalink] [raw]
Subject: Re: [PATCH v3] locking/rwsem: reduce spinlock contention in wakeup after up_read/up_write

On Tue, 2015-04-28 at 10:50 -0700, Jason Low wrote:
> On Tue, 2015-04-28 at 19:17 +0200, Peter Zijlstra wrote:
>
> > To me it makes more sense to reverse these two branches (identical code
> > wise of course) and put the special case first.
> >
> > Alternatively we could also do something like the below, which to my
> > eyes looks a little better still, but I don't care too much.
> >
> > if (rwsem_has_spinner(sem)) {
> > /*
> > * comment ...
> > */
> > smp_rmb();
> > if (!raw_spin_trylock_irqsave(&sem->wait_lock, flags))
> > return sem;
> > goto locked;
> > }
> >
> > raw_spin_lock_irqsave(&sem->wait_lock, flags);
> > locked:
>
> How about putting this into its own function:
>
> static inline bool __rwsem_wake_acquire_wait_lock(sem)
> {
> /*
> *
> * Comments
> *
> */
> if (unlikely(rwsem_has_spinner(sem))) {
> /*
> * Comments
> */
> smp_rmb();
> if (!raw_spin_trylock_irqsave(&sem->wait_lock, flags))
> return false;
> }
>
> return true;
> }

That is, with the raw_spin_lock_irqsave() too :)

static inline bool __rwsem_wake_acquire_wait_lock(sem)
{
/*
*
* Comments
*
*/
if (unlikely(rwsem_has_spinner(sem))) {
/*
* Comments
*/
smp_rmb();
if (!raw_spin_trylock_irqsave(&sem->wait_lock, flags))
return false;

/* trylock successful */
return true;
}

raw_spin_lock_irqsave(&sem->wait_lock, flags);
return true;
}

2015-04-28 18:06:11

by Davidlohr Bueso

[permalink] [raw]
Subject: Re: [PATCH v3] locking/rwsem: reduce spinlock contention in wakeup after up_read/up_write

On Tue, 2015-04-28 at 10:50 -0700, Jason Low wrote:
> On Tue, 2015-04-28 at 19:17 +0200, Peter Zijlstra wrote:
>
> > To me it makes more sense to reverse these two branches (identical code
> > wise of course) and put the special case first.
> >
> > Alternatively we could also do something like the below, which to my
> > eyes looks a little better still, but I don't care too much.
> >
> > if (rwsem_has_spinner(sem)) {
> > /*
> > * comment ...
> > */
> > smp_rmb();
> > if (!raw_spin_trylock_irqsave(&sem->wait_lock, flags))
> > return sem;
> > goto locked;
> > }
> >
> > raw_spin_lock_irqsave(&sem->wait_lock, flags);
> > locked:
>
> How about putting this into its own function:

I'd actually prefer reusing the current function. We have so many
utility functions that we've over-encapsulated the code a bit, it makes
reading _harder_, imho.

Thanks,
Davidlohr

2015-04-28 18:17:31

by Davidlohr Bueso

[permalink] [raw]
Subject: Re: [PATCH v3] locking/rwsem: reduce spinlock contention in wakeup after up_read/up_write

On Mon, 2015-04-27 at 16:25 -0400, Waiman Long wrote:
> >> + /*
> >> + * up_write() cleared the owner field before calling this function.
> >> + * If that field is now set, a writer must have stolen the lock and
> >> + * the wakeup operation should be aborted.
> >> + */
> >> + if (rwsem_has_active_writer(sem))
> >> + goto out;
> > We currently allow small races between rwsem owner and counter checks.
> > And __rwsem_do_wake() can be called by checking the former -- and lock
> > stealing is done with the counter as well. Please see below how we back
> > out of such cases, as it is very much considered when granting the next
> > reader. So nack to this as is, sorry.
>
> If the first one in the queue is a writer, wake_up_process() may be
> called directly which can be quite expensive if the lock has already
> been stolen as the task will have to sleep again.

But how can this occur? Lock stealing takes form in two places:

1) fastpath: only if the counter is 0 -- which, since we are discussing
waking up waiter(s) code, obviously cannot occur.

2) With the cmpxchg() in rwsem_try_write_lock(), which is serialized
with the wait_lock, so again this cannot occur.

Which is why this is not considered in __rwsem_do_wake() when waking the
writer fist in the queue.

Thanks,
Davidlohr

2015-04-29 19:58:51

by Waiman Long

[permalink] [raw]
Subject: Re: [PATCH v3] locking/rwsem: reduce spinlock contention in wakeup after up_read/up_write

On 04/28/2015 02:17 PM, Davidlohr Bueso wrote:
> On Mon, 2015-04-27 at 16:25 -0400, Waiman Long wrote:
>>>> + /*
>>>> + * up_write() cleared the owner field before calling this function.
>>>> + * If that field is now set, a writer must have stolen the lock and
>>>> + * the wakeup operation should be aborted.
>>>> + */
>>>> + if (rwsem_has_active_writer(sem))
>>>> + goto out;
>>> We currently allow small races between rwsem owner and counter checks.
>>> And __rwsem_do_wake() can be called by checking the former -- and lock
>>> stealing is done with the counter as well. Please see below how we back
>>> out of such cases, as it is very much considered when granting the next
>>> reader. So nack to this as is, sorry.
>> If the first one in the queue is a writer, wake_up_process() may be
>> called directly which can be quite expensive if the lock has already
>> been stolen as the task will have to sleep again.
> But how can this occur? Lock stealing takes form in two places:
>
> 1) fastpath: only if the counter is 0 -- which, since we are discussing
> waking up waiter(s) code, obviously cannot occur.
>
> 2) With the cmpxchg() in rwsem_try_write_lock(), which is serialized
> with the wait_lock, so again this cannot occur.
>
> Which is why this is not considered in __rwsem_do_wake() when waking the
> writer fist in the queue.
>
> Thanks,
> Davidlohr
>

A write lock can also be acquired by a spinning writer in
rwsem_try_write_lock_unqueued() where wait_lock isn't used. With
multiple down_read's, it is possible that the first exiting reader wakes
up a writer who acquires the write lock while the other readers are
waiting for acquiring the wait_lock.

Cheers,
Longman

2015-04-30 14:12:19

by Davidlohr Bueso

[permalink] [raw]
Subject: Re: [PATCH v3] locking/rwsem: reduce spinlock contention in wakeup after up_read/up_write

On Wed, 2015-04-29 at 15:58 -0400, Waiman Long wrote:
> A write lock can also be acquired by a spinning writer in
> rwsem_try_write_lock_unqueued() where wait_lock isn't used. With
> multiple down_read's, it is possible that the first exiting reader wakes
> up a writer who acquires the write lock while the other readers are
> waiting for acquiring the wait_lock.

Except that readers that do the wakeup do not call __rwsem_do_wake() if
there is an active writer:

/* If there are no active locks, wake the front queued process(es).
*
* If there are no writers and we are first in the queue,
* wake our own waiter to join the existing active readers !
*/
if (count == RWSEM_WAITING_BIAS ||
(count > RWSEM_WAITING_BIAS &&
adjustment != -RWSEM_ACTIVE_READ_BIAS))
sem = __rwsem_do_wake(sem, RWSEM_WAKE_ANY);

And for the reader part, your rwsem_has_active_writer() check, while
avoiding the counter atomic update, would break current semantics in
that we still do the next reader grant -- also note the unlikely()
predictor ;) And this is done with the counter, so by using the owner,
you would have a race between the cmpxchg and rwsem_set_owner(). Yes,
its a small window (specially after commit 7a215f89), but there
nonetheless and could cause even more bogus wakeups. Again, I simply do
not like mixing these two -- we get away with it with the optimistic
spinning because we just iterate again, but this is not the case.

Ultimately, is this really an issue? Do you have numbers that could
justify such a change? I suspect all the benchmark results you posted in
the patch are from reducing the spinlock contention, not from this.

Thanks,
Davidlohr

2015-04-30 20:25:40

by Waiman Long

[permalink] [raw]
Subject: Re: [PATCH v3] locking/rwsem: reduce spinlock contention in wakeup after up_read/up_write

On 04/30/2015 10:12 AM, Davidlohr Bueso wrote:
> On Wed, 2015-04-29 at 15:58 -0400, Waiman Long wrote:
>> A write lock can also be acquired by a spinning writer in
>> rwsem_try_write_lock_unqueued() where wait_lock isn't used. With
>> multiple down_read's, it is possible that the first exiting reader wakes
>> up a writer who acquires the write lock while the other readers are
>> waiting for acquiring the wait_lock.
> Except that readers that do the wakeup do not call __rwsem_do_wake() if
> there is an active writer:
>
> /* If there are no active locks, wake the front queued process(es).
> *
> * If there are no writers and we are first in the queue,
> * wake our own waiter to join the existing active readers !
> */
> if (count == RWSEM_WAITING_BIAS ||
> (count> RWSEM_WAITING_BIAS&&
> adjustment != -RWSEM_ACTIVE_READ_BIAS))
> sem = __rwsem_do_wake(sem, RWSEM_WAKE_ANY);
>
> And for the reader part, your rwsem_has_active_writer() check, while
> avoiding the counter atomic update, would break current semantics in
> that we still do the next reader grant -- also note the unlikely()
> predictor ;) And this is done with the counter, so by using the owner,
> you would have a race between the cmpxchg and rwsem_set_owner(). Yes,
> its a small window (specially after commit 7a215f89), but there
> nonetheless and could cause even more bogus wakeups. Again, I simply do
> not like mixing these two -- we get away with it with the optimistic
> spinning because we just iterate again, but this is not the case.

I think I know what you are talking about. However, I don't think it is
really a race condition. In up_write():

rwsem_clear_owner(sem);
__up_write(sem);

The owner field is cleared before the lock holding writer decrement the
count. So there is simply no way a reader will infer from the count that
there is no writer in the rwsem while the owner field is still set
unless it has just been stolen by another writer. In this case,
__rwsem_do_wake() will have to back out from the reader grant anyway. It
is possible that the second writer release the lock just before
__rwsem_do_wake() is called, perhaps an interrupt happen. In this case,
2nd writer will be waiting for the wait_lock in rwsem_wake() to do the
wakeup duty. So there should be no missed wakeup.

If you are still worrying about this, I can put the owner check in
rwsem_wake() function after taking the wait_lock instead. That should
remove any concern that you have.

> Ultimately, is this really an issue? Do you have numbers that could
> justify such a change? I suspect all the benchmark results you posted in
> the patch are from reducing the spinlock contention, not from this.

This is true. This change has no statistically significant contribution
(just a tiny bit, perhaps) for improving benchmark results.

Cheers,
Longman

2015-04-30 21:34:43

by Waiman Long

[permalink] [raw]
Subject: Re: [PATCH v3] locking/rwsem: reduce spinlock contention in wakeup after up_read/up_write

On 04/28/2015 01:17 PM, Peter Zijlstra wrote:
> On Fri, Apr 24, 2015 at 01:54:29PM -0400, Waiman Long wrote:
>> @@ -478,7 +515,40 @@ struct rw_semaphore *rwsem_wake(struct rw_semaphore *sem)
>> {
>> unsigned long flags;
>>
>> - raw_spin_lock_irqsave(&sem->wait_lock, flags);
>> + /*
>> + * If a spinner is present, it is not necessary to do the wakeup.
>> + * Try to do wakeup only if the trylock succeeds to minimize
>> + * spinlock contention which may introduce too much delay in the
>> + * unlock operation.
>> + *
>> + * spinning writer up_write/up_read caller
>> + * --------------- -----------------------
>> + * [S] osq_unlock() [L] osq
>> + * MB MB
>> + * [RmW] rwsem_try_write_lock() [RmW] spin_trylock(wait_lock)
>> + *
>> + * Here, it is important to make sure that there won't be a missed
>> + * wakeup while the rwsem is free and the only spinning writer goes
>> + * to sleep without taking the rwsem. In case the spinning writer is
>> + * just going to break out of the waiting loop, it will still do a
>> + * trylock in rwsem_down_write_failed() before sleeping. IOW, if
>> + * rwsem_has_spinner() is true, it will guarantee at least one
>> + * trylock attempt on the rwsem.
>> + */
>> + if (!rwsem_has_spinner(sem)) {
>> + raw_spin_lock_irqsave(&sem->wait_lock, flags);
>> + } else {
>> + /*
>> + * rwsem_has_spinner() is an atomic read while spin_trylock
>> + * does not guarantee a full memory barrier. Insert a memory
>> + * barrier here to make sure that wait_lock isn't read until
>> + * after osq.
>> + * Note: smp_rmb__after_atomic() should be used if available.
>> + */
>> + smp_mb__after_atomic();
>> + if (!raw_spin_trylock_irqsave(&sem->wait_lock, flags))
>> + return sem;
>> + }
>>
>> /* do nothing if list empty */
>> if (!list_empty(&sem->wait_list))
> To me it makes more sense to reverse these two branches (identical code
> wise of course) and put the special case first.
>
> Alternatively we could also do something like the below, which to my
> eyes looks a little better still, but I don't care too much.
>
> if (rwsem_has_spinner(sem)) {
> /*
> * comment ...
> */
> smp_rmb();
> if (!raw_spin_trylock_irqsave(&sem->wait_lock, flags))
> return sem;
> goto locked;
> }
>
> raw_spin_lock_irqsave(&sem->wait_lock, flags);
> locked:

Thanks for the suggested. I have implemented that in the v4 patch. Also
thanks for correcting my misconception on how to use the
smp_mb__after_atomic() macro.

Cheers,
Longman