by Waiman Long

[permalink] [raw]

Subject: Re: [PATCH v4 6/8] locking/qrwlock: make use of acquire/release/relaxed atomics

On 08/03/2015 01:02 PM, Will Deacon wrote:
> The qrwlock implementation is slightly heavy in its use of memory
> barriers, mainly through the use of cmpxchg and _return atomics, which
> imply full barrier semantics.
>
> This patch modifies the qrwlock code to use the more relaxed atomic
> routines so that we can reduce the unnecessary barrier overhead on
> weakly-ordered architectures.
>
> Signed-off-by: Will Deacon<[email protected]>
> ---
> include/asm-generic/qrwlock.h | 13 ++++++-------
> kernel/locking/qrwlock.c | 23 +++++++++++++++--------
> 2 files changed, 21 insertions(+), 15 deletions(-)
>
> diff --git a/include/asm-generic/qrwlock.h b/include/asm-generic/qrwlock.h
> index eb673dde8879..54a8e65e18b6 100644
> --- a/include/asm-generic/qrwlock.h
> +++ b/include/asm-generic/qrwlock.h
> @@ -68,7 +68,7 @@ static inline int queued_read_trylock(struct qrwlock *lock)
>
> cnts = atomic_read(&lock->cnts);
> if (likely(!(cnts& _QW_WMASK))) {
> - cnts = (u32)atomic_add_return(_QR_BIAS,&lock->cnts);
> + cnts = (u32)atomic_add_return_acquire(_QR_BIAS,&lock->cnts);
> if (likely(!(cnts& _QW_WMASK)))
> return 1;
> atomic_sub(_QR_BIAS,&lock->cnts);
> @@ -89,8 +89,8 @@ static inline int queued_write_trylock(struct qrwlock *lock)
> if (unlikely(cnts))
> return 0;
>
> - return likely(atomic_cmpxchg(&lock->cnts,
> - cnts, cnts | _QW_LOCKED) == cnts);
> + return likely(atomic_cmpxchg_acquire(&lock->cnts,
> + cnts, cnts | _QW_LOCKED) == cnts);
> }
> /**
> * queued_read_lock - acquire read lock of a queue rwlock
> @@ -100,7 +100,7 @@ static inline void queued_read_lock(struct qrwlock *lock)
> {
> u32 cnts;
>
> - cnts = atomic_add_return(_QR_BIAS,&lock->cnts);
> + cnts = atomic_add_return_acquire(_QR_BIAS,&lock->cnts);
> if (likely(!(cnts& _QW_WMASK)))
> return;
>
> @@ -115,7 +115,7 @@ static inline void queued_read_lock(struct qrwlock *lock)
> static inline void queued_write_lock(struct qrwlock *lock)
> {
> /* Optimize for the unfair lock case where the fair flag is 0. */
> - if (atomic_cmpxchg(&lock->cnts, 0, _QW_LOCKED) == 0)
> + if (atomic_cmpxchg_acquire(&lock->cnts, 0, _QW_LOCKED) == 0)
> return;
>
> queued_write_lock_slowpath(lock);
> @@ -130,8 +130,7 @@ static inline void queued_read_unlock(struct qrwlock *lock)
> /*
> * Atomically decrement the reader count
> */
> - smp_mb__before_atomic();
> - atomic_sub(_QR_BIAS,&lock->cnts);
> + (void)atomic_sub_return_release(_QR_BIAS,&lock->cnts);
> }
>
> /**
> diff --git a/kernel/locking/qrwlock.c b/kernel/locking/qrwlock.c
> index d9c36c5f5711..fb4ef2d636f2 100644
> --- a/kernel/locking/qrwlock.c
> +++ b/kernel/locking/qrwlock.c
> @@ -55,7 +55,7 @@ rspin_until_writer_unlock(struct qrwlock *lock, u32 cnts)
> {
> while ((cnts& _QW_WMASK) == _QW_LOCKED) {
> cpu_relax_lowlatency();
> - cnts = smp_load_acquire((u32 *)&lock->cnts);
> + cnts = atomic_read_acquire(&lock->cnts);
> }
> }
>
> @@ -74,8 +74,9 @@ void queued_read_lock_slowpath(struct qrwlock *lock, u32 cnts)
> * Readers in interrupt context will get the lock immediately
> * if the writer is just waiting (not holding the lock yet).
> * The rspin_until_writer_unlock() function returns immediately
> - * in this case. Otherwise, they will spin until the lock
> - * is available without waiting in the queue.
> + * in this case. Otherwise, they will spin (with ACQUIRE
> + * semantics) until the lock is available without waiting in
> + * the queue.
> */
> rspin_until_writer_unlock(lock, cnts);
> return;
> @@ -97,7 +98,13 @@ void queued_read_lock_slowpath(struct qrwlock *lock, u32 cnts)
> while (atomic_read(&lock->cnts)& _QW_WMASK)
> cpu_relax_lowlatency();
>
> - cnts = atomic_add_return(_QR_BIAS,&lock->cnts) - _QR_BIAS;
> + cnts = atomic_add_return_relaxed(_QR_BIAS,&lock->cnts) - _QR_BIAS;
> +
> + /*
> + * The ACQUIRE semantics of the spinning code ensure that
> + * accesses can't leak upwards out of our subsequent critical
> + * section.
> + */

Maybe you should be more specific to mention the arch_spin_lock() call
above. Other than that,

Reviewed-by: Waiman Long <[email protected]>

2015-08-04 11:20:32

by Will Deacon

[permalink] [raw]

Subject: Re: [PATCH v4 6/8] locking/qrwlock: make use of acquire/release/relaxed atomics

Hi Waiman,

Thanks for having a look.

On Mon, Aug 03, 2015 at 09:49:26PM +0100, Waiman Long wrote:
> On 08/03/2015 01:02 PM, Will Deacon wrote:
> > The qrwlock implementation is slightly heavy in its use of memory
> > barriers, mainly through the use of cmpxchg and _return atomics, which
> > imply full barrier semantics.
> >
> > This patch modifies the qrwlock code to use the more relaxed atomic
> > routines so that we can reduce the unnecessary barrier overhead on
> > weakly-ordered architectures.
> >
> > Signed-off-by: Will Deacon<[email protected]>

[...]

> > @@ -74,8 +74,9 @@ void queued_read_lock_slowpath(struct qrwlock *lock, u32 cnts)
> > * Readers in interrupt context will get the lock immediately
> > * if the writer is just waiting (not holding the lock yet).
> > * The rspin_until_writer_unlock() function returns immediately
> > - * in this case. Otherwise, they will spin until the lock
> > - * is available without waiting in the queue.
> > + * in this case. Otherwise, they will spin (with ACQUIRE
> > + * semantics) until the lock is available without waiting in
> > + * the queue.
> > */
> > rspin_until_writer_unlock(lock, cnts);
> > return;
> > @@ -97,7 +98,13 @@ void queued_read_lock_slowpath(struct qrwlock *lock, u32 cnts)
> > while (atomic_read(&lock->cnts)& _QW_WMASK)
> > cpu_relax_lowlatency();
> >
> > - cnts = atomic_add_return(_QR_BIAS,&lock->cnts) - _QR_BIAS;
> > + cnts = atomic_add_return_relaxed(_QR_BIAS,&lock->cnts) - _QR_BIAS;
> > +
> > + /*
> > + * The ACQUIRE semantics of the spinning code ensure that
> > + * accesses can't leak upwards out of our subsequent critical
> > + * section.
> > + */
>
> Maybe you should be more specific to mention the arch_spin_lock() call
> above. Other than that,

Actually, I think you've uncovered a bug! Initially, I based this on top
of my qrwlock series that made the acquire unconditional in
rspin_until_writer_unlock, but you (reasonably) objected to the extra
overhead on the interrupt path, so now we only get an acquire if the
initial test of (cnts & _QW_WMASK) == _QW_LOCKED) succeeds.

So actually, the atomic_add_return needs to be made an
atomic_add_return_acquire. I'll make that change and adjust the comment
accordingly.

Fixup below.

Cheers,

Will

--->8

diff --git a/kernel/locking/qrwlock.c b/kernel/locking/qrwlock.c
index fb4ef2d636f2..1724eac4c84b 100644
--- a/kernel/locking/qrwlock.c
+++ b/kernel/locking/qrwlock.c
@@ -98,13 +98,12 @@ void queued_read_lock_slowpath(struct qrwlock *lock, u32 cnts)
while (atomic_read(&lock->cnts) & _QW_WMASK)
cpu_relax_lowlatency();

- cnts = atomic_add_return_relaxed(_QR_BIAS, &lock->cnts) - _QR_BIAS;
-
/*
- * The ACQUIRE semantics of the spinning code ensure that
- * accesses can't leak upwards out of our subsequent critical
- * section.
+ * The ACQUIRE semantics of the following spinning code ensure
+ * that accesses can't leak upwards out of our subsequent critical
+ * section in the case that the lock is currently held for write.
*/
+ cnts = atomic_add_return_acquire(_QR_BIAS, &lock->cnts) - _QR_BIAS;
rspin_until_writer_unlock(lock, cnts);

/*