2014-06-15 13:11:52

by Peter Zijlstra

[permalink] [raw]
Subject: [PATCH 04/11] qspinlock: Extract out the exchange of tail code word

From: Waiman Long <[email protected]>

This patch extracts the logic for the exchange of new and previous tail
code words into a new xchg_tail() function which can be optimized in a
later patch.

Signed-off-by: Waiman Long <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
---
include/asm-generic/qspinlock_types.h | 2 +
kernel/locking/qspinlock.c | 58 +++++++++++++++++++++-------------
2 files changed, 38 insertions(+), 22 deletions(-)

--- a/include/asm-generic/qspinlock_types.h
+++ b/include/asm-generic/qspinlock_types.h
@@ -61,6 +61,8 @@ typedef struct qspinlock {
#define _Q_TAIL_CPU_BITS (32 - _Q_TAIL_CPU_OFFSET)
#define _Q_TAIL_CPU_MASK _Q_SET_MASK(TAIL_CPU)

+#define _Q_TAIL_MASK (_Q_TAIL_IDX_MASK | _Q_TAIL_CPU_MASK)
+
#define _Q_LOCKED_VAL (1U << _Q_LOCKED_OFFSET)
#define _Q_PENDING_VAL (1U << _Q_PENDING_OFFSET)

--- a/kernel/locking/qspinlock.c
+++ b/kernel/locking/qspinlock.c
@@ -86,6 +86,31 @@ static inline struct mcs_spinlock *decod
#define _Q_LOCKED_PENDING_MASK (_Q_LOCKED_MASK | _Q_PENDING_MASK)

/**
+ * xchg_tail - Put in the new queue tail code word & retrieve previous one
+ * @lock : Pointer to queue spinlock structure
+ * @tail : The new queue tail code word
+ * Return: The previous queue tail code word
+ *
+ * xchg(lock, tail)
+ *
+ * p,*,* -> n,*,* ; prev = xchg(lock, node)
+ */
+static __always_inline u32 xchg_tail(struct qspinlock *lock, u32 tail)
+{
+ u32 old, new, val = atomic_read(&lock->val);
+
+ for (;;) {
+ new = (val & _Q_LOCKED_PENDING_MASK) | tail;
+ old = atomic_cmpxchg(&lock->val, val, new);
+ if (old == val)
+ break;
+
+ val = old;
+ }
+ return old;
+}
+
+/**
* queue_spin_lock_slowpath - acquire the queue spinlock
* @lock: Pointer to queue spinlock structure
* @val: Current value of the queue spinlock 32-bit word
@@ -182,36 +207,25 @@ void queue_spin_lock_slowpath(struct qsp
node->next = NULL;

/*
- * we already touched the queueing cacheline; don't bother with pending
- * stuff.
- *
- * trylock || xchg(lock, node)
- *
- * 0,0,0 -> 0,0,1 ; trylock
- * p,y,x -> n,y,x ; prev = xchg(lock, node)
+ * We touched a (possibly) cold cacheline in the per-cpu queue node;
+ * attempt the trylock once more in the hope someone let go while we
+ * weren't watching.
*/
- for (;;) {
- new = _Q_LOCKED_VAL;
- if (val)
- new = tail | (val & _Q_LOCKED_PENDING_MASK);
-
- old = atomic_cmpxchg(&lock->val, val, new);
- if (old == val)
- break;
-
- val = old;
- }
+ if (queue_spin_trylock(lock))
+ goto release;

/*
- * we won the trylock; forget about queueing.
+ * we already touched the queueing cacheline; don't bother with pending
+ * stuff.
+ *
+ * p,*,* -> n,*,*
*/
- if (new == _Q_LOCKED_VAL)
- goto release;
+ old = xchg_tail(lock, tail);

/*
* if there was a previous node; link it and wait.
*/
- if (old & ~_Q_LOCKED_PENDING_MASK) {
+ if (old & _Q_TAIL_MASK) {
prev = decode_tail(old);
ACCESS_ONCE(prev->next) = node;



2014-06-17 20:56:32

by Konrad Rzeszutek Wilk

[permalink] [raw]
Subject: Re: [PATCH 04/11] qspinlock: Extract out the exchange of tail code word

On Sun, Jun 15, 2014 at 02:47:01PM +0200, Peter Zijlstra wrote:
> From: Waiman Long <[email protected]>
>
> This patch extracts the logic for the exchange of new and previous tail
> code words into a new xchg_tail() function which can be optimized in a
> later patch.

And also adds a third try on acquiring the lock. That I think should
be a seperate patch.

And instead of saying 'later patch' you should spell out the name
of the patch. Especially as this might not be obvious from somebody
doing git bisection.

>
> Signed-off-by: Waiman Long <[email protected]>
> Signed-off-by: Peter Zijlstra <[email protected]>
> ---
> include/asm-generic/qspinlock_types.h | 2 +
> kernel/locking/qspinlock.c | 58 +++++++++++++++++++++-------------
> 2 files changed, 38 insertions(+), 22 deletions(-)
>
> --- a/include/asm-generic/qspinlock_types.h
> +++ b/include/asm-generic/qspinlock_types.h
> @@ -61,6 +61,8 @@ typedef struct qspinlock {
> #define _Q_TAIL_CPU_BITS (32 - _Q_TAIL_CPU_OFFSET)
> #define _Q_TAIL_CPU_MASK _Q_SET_MASK(TAIL_CPU)
>
> +#define _Q_TAIL_MASK (_Q_TAIL_IDX_MASK | _Q_TAIL_CPU_MASK)
> +
> #define _Q_LOCKED_VAL (1U << _Q_LOCKED_OFFSET)
> #define _Q_PENDING_VAL (1U << _Q_PENDING_OFFSET)
>
> --- a/kernel/locking/qspinlock.c
> +++ b/kernel/locking/qspinlock.c
> @@ -86,6 +86,31 @@ static inline struct mcs_spinlock *decod
> #define _Q_LOCKED_PENDING_MASK (_Q_LOCKED_MASK | _Q_PENDING_MASK)
>
> /**
> + * xchg_tail - Put in the new queue tail code word & retrieve previous one
> + * @lock : Pointer to queue spinlock structure
> + * @tail : The new queue tail code word
> + * Return: The previous queue tail code word
> + *
> + * xchg(lock, tail)
> + *
> + * p,*,* -> n,*,* ; prev = xchg(lock, node)
> + */
> +static __always_inline u32 xchg_tail(struct qspinlock *lock, u32 tail)
> +{
> + u32 old, new, val = atomic_read(&lock->val);
> +
> + for (;;) {
> + new = (val & _Q_LOCKED_PENDING_MASK) | tail;
> + old = atomic_cmpxchg(&lock->val, val, new);
> + if (old == val)
> + break;
> +
> + val = old;
> + }
> + return old;
> +}
> +
> +/**
> * queue_spin_lock_slowpath - acquire the queue spinlock
> * @lock: Pointer to queue spinlock structure
> * @val: Current value of the queue spinlock 32-bit word
> @@ -182,36 +207,25 @@ void queue_spin_lock_slowpath(struct qsp
> node->next = NULL;
>
> /*
> - * we already touched the queueing cacheline; don't bother with pending
> - * stuff.
> - *
> - * trylock || xchg(lock, node)
> - *
> - * 0,0,0 -> 0,0,1 ; trylock
> - * p,y,x -> n,y,x ; prev = xchg(lock, node)
> + * We touched a (possibly) cold cacheline in the per-cpu queue node;
> + * attempt the trylock once more in the hope someone let go while we
> + * weren't watching.
> */
> - for (;;) {
> - new = _Q_LOCKED_VAL;
> - if (val)
> - new = tail | (val & _Q_LOCKED_PENDING_MASK);
> -
> - old = atomic_cmpxchg(&lock->val, val, new);
> - if (old == val)
> - break;
> -
> - val = old;
> - }
> + if (queue_spin_trylock(lock))
> + goto release;

So now are three of them? One in queue_spin_lock, then at the start
of this function when checking for the pending bit, and the once more
here. And that is because the local cache line might be cold for the
'mcs_index' struct?

That all seems to be a bit of experimental. But then we are already
in the slowpath so we could as well do:

for (i = 0; i < 10; i++)
if (queue_spin_trylock(lock))
goto release;

And would have the same effect.


>
> /*
> - * we won the trylock; forget about queueing.
> + * we already touched the queueing cacheline; don't bother with pending
> + * stuff.

I guess we could also just erase the pending bit if we wanted too. The
optimistic spinning will still hit go to the queue label as lock->val will
have the tail value.

> + *
> + * p,*,* -> n,*,*
> */
> - if (new == _Q_LOCKED_VAL)
> - goto release;
> + old = xchg_tail(lock, tail);
>
> /*
> * if there was a previous node; link it and wait.
> */
> - if (old & ~_Q_LOCKED_PENDING_MASK) {
> + if (old & _Q_TAIL_MASK) {
> prev = decode_tail(old);
> ACCESS_ONCE(prev->next) = node;
>
>
>

2014-06-18 11:38:44

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [PATCH 04/11] qspinlock: Extract out the exchange of tail code word

Il 17/06/2014 22:55, Konrad Rzeszutek Wilk ha scritto:
> On Sun, Jun 15, 2014 at 02:47:01PM +0200, Peter Zijlstra wrote:
>> From: Waiman Long <[email protected]>
>>
>> This patch extracts the logic for the exchange of new and previous tail
>> code words into a new xchg_tail() function which can be optimized in a
>> later patch.
>
> And also adds a third try on acquiring the lock. That I think should
> be a seperate patch.

It doesn't really add a new try, the old code is:


- for (;;) {
- new = _Q_LOCKED_VAL;
- if (val)
- new = tail | (val & _Q_LOCKED_PENDING_MASK);
-
- old = atomic_cmpxchg(&lock->val, val, new);
- if (old == val)
- break;
-
- val = old;
- }

/*
- * we won the trylock; forget about queueing.
*/
- if (new == _Q_LOCKED_VAL)
- goto release;

The trylock happens if the "if (val)" hits the else branch.

What the patch does is change it from attempting two transition with a
single cmpxchg:

- * 0,0,0 -> 0,0,1 ; trylock
- * p,y,x -> n,y,x ; prev = xchg(lock, node)

to first doing the trylock, then the xchg. If the trylock passes and
the xchg returns prev=0,0,0, the next step of the algorithm goes to the
locked/uncontended state

+ /*
+ * claim the lock:
+ *
+ * n,0 -> 0,1 : lock, uncontended

Similar to your suggestion of patch 3, it's expected that the xchg will
*not* return prev=0,0,0 after a failed trylock.

However, I *do* agree with you that it's simpler to just squash this
patch into 01/11.

Paolo

> And instead of saying 'later patch' you should spell out the name
> of the patch. Especially as this might not be obvious from somebody
> doing git bisection.
>
>>
>> Signed-off-by: Waiman Long <[email protected]>
>> Signed-off-by: Peter Zijlstra <[email protected]>
>> ---
>> include/asm-generic/qspinlock_types.h | 2 +
>> kernel/locking/qspinlock.c | 58 +++++++++++++++++++++-------------
>> 2 files changed, 38 insertions(+), 22 deletions(-)
>>
>> --- a/include/asm-generic/qspinlock_types.h
>> +++ b/include/asm-generic/qspinlock_types.h
>> @@ -61,6 +61,8 @@ typedef struct qspinlock {
>> #define _Q_TAIL_CPU_BITS (32 - _Q_TAIL_CPU_OFFSET)
>> #define _Q_TAIL_CPU_MASK _Q_SET_MASK(TAIL_CPU)
>>
>> +#define _Q_TAIL_MASK (_Q_TAIL_IDX_MASK | _Q_TAIL_CPU_MASK)
>> +
>> #define _Q_LOCKED_VAL (1U << _Q_LOCKED_OFFSET)
>> #define _Q_PENDING_VAL (1U << _Q_PENDING_OFFSET)
>>
>> --- a/kernel/locking/qspinlock.c
>> +++ b/kernel/locking/qspinlock.c
>> @@ -86,6 +86,31 @@ static inline struct mcs_spinlock *decod
>> #define _Q_LOCKED_PENDING_MASK (_Q_LOCKED_MASK | _Q_PENDING_MASK)
>>
>> /**
>> + * xchg_tail - Put in the new queue tail code word & retrieve previous one
>> + * @lock : Pointer to queue spinlock structure
>> + * @tail : The new queue tail code word
>> + * Return: The previous queue tail code word
>> + *
>> + * xchg(lock, tail)
>> + *
>> + * p,*,* -> n,*,* ; prev = xchg(lock, node)
>> + */
>> +static __always_inline u32 xchg_tail(struct qspinlock *lock, u32 tail)
>> +{
>> + u32 old, new, val = atomic_read(&lock->val);
>> +
>> + for (;;) {
>> + new = (val & _Q_LOCKED_PENDING_MASK) | tail;
>> + old = atomic_cmpxchg(&lock->val, val, new);
>> + if (old == val)
>> + break;
>> +
>> + val = old;
>> + }
>> + return old;
>> +}
>> +
>> +/**
>> * queue_spin_lock_slowpath - acquire the queue spinlock
>> * @lock: Pointer to queue spinlock structure
>> * @val: Current value of the queue spinlock 32-bit word
>> @@ -182,36 +207,25 @@ void queue_spin_lock_slowpath(struct qsp
>> node->next = NULL;
>>
>> /*
>> - * we already touched the queueing cacheline; don't bother with pending
>> - * stuff.
>> - *
>> - * trylock || xchg(lock, node)
>> - *
>> - * 0,0,0 -> 0,0,1 ; trylock
>> - * p,y,x -> n,y,x ; prev = xchg(lock, node)
>> + * We touched a (possibly) cold cacheline in the per-cpu queue node;
>> + * attempt the trylock once more in the hope someone let go while we
>> + * weren't watching.
>> */
>> - for (;;) {
>> - new = _Q_LOCKED_VAL;
>> - if (val)
>> - new = tail | (val & _Q_LOCKED_PENDING_MASK);
>> -
>> - old = atomic_cmpxchg(&lock->val, val, new);
>> - if (old == val)
>> - break;
>> -
>> - val = old;
>> - }
>> + if (queue_spin_trylock(lock))
>> + goto release;
>
> So now are three of them? One in queue_spin_lock, then at the start
> of this function when checking for the pending bit, and the once more
> here. And that is because the local cache line might be cold for the
> 'mcs_index' struct?
>
> That all seems to be a bit of experimental. But then we are already
> in the slowpath so we could as well do:
>
> for (i = 0; i < 10; i++)
> if (queue_spin_trylock(lock))
> goto release;
>
> And would have the same effect.
>
>
>>
>> /*
>> - * we won the trylock; forget about queueing.
>> + * we already touched the queueing cacheline; don't bother with pending
>> + * stuff.
>
> I guess we could also just erase the pending bit if we wanted too. The
> optimistic spinning will still hit go to the queue label as lock->val will
> have the tail value.
>
>> + *
>> + * p,*,* -> n,*,*
>> */
>> - if (new == _Q_LOCKED_VAL)
>> - goto release;
>> + old = xchg_tail(lock, tail);
>>
>> /*
>> * if there was a previous node; link it and wait.
>> */
>> - if (old & ~_Q_LOCKED_PENDING_MASK) {
>> + if (old & _Q_TAIL_MASK) {
>> prev = decode_tail(old);
>> ACCESS_ONCE(prev->next) = node;
>>
>>
>>

2014-06-18 13:52:10

by Konrad Rzeszutek Wilk

[permalink] [raw]
Subject: Re: [PATCH 04/11] qspinlock: Extract out the exchange of tail code word

On Wed, Jun 18, 2014 at 01:37:45PM +0200, Paolo Bonzini wrote:
> Il 17/06/2014 22:55, Konrad Rzeszutek Wilk ha scritto:
> >On Sun, Jun 15, 2014 at 02:47:01PM +0200, Peter Zijlstra wrote:
> >>From: Waiman Long <[email protected]>
> >>
> >>This patch extracts the logic for the exchange of new and previous tail
> >>code words into a new xchg_tail() function which can be optimized in a
> >>later patch.
> >
> >And also adds a third try on acquiring the lock. That I think should
> >be a seperate patch.
>
> It doesn't really add a new try, the old code is:
>
>
> - for (;;) {
> - new = _Q_LOCKED_VAL;
> - if (val)
> - new = tail | (val & _Q_LOCKED_PENDING_MASK);
> -
> - old = atomic_cmpxchg(&lock->val, val, new);
> - if (old == val)
> - break;
> -
> - val = old;
> - }
>
> /*
> - * we won the trylock; forget about queueing.
> */
> - if (new == _Q_LOCKED_VAL)
> - goto release;
>
> The trylock happens if the "if (val)" hits the else branch.
>
> What the patch does is change it from attempting two transition with a
> single cmpxchg:
>
> - * 0,0,0 -> 0,0,1 ; trylock
> - * p,y,x -> n,y,x ; prev = xchg(lock, node)
>
> to first doing the trylock, then the xchg. If the trylock passes and the
> xchg returns prev=0,0,0, the next step of the algorithm goes to the
> locked/uncontended state
>
> + /*
> + * claim the lock:
> + *
> + * n,0 -> 0,1 : lock, uncontended
>
> Similar to your suggestion of patch 3, it's expected that the xchg will
> *not* return prev=0,0,0 after a failed trylock.

I do like your explanation. I hope that Peter will put it in the
description as it explains the change quite well.

>
> However, I *do* agree with you that it's simpler to just squash this patch
> into 01/11.

Uh, did I say that? Oh I said why don't make it right the first time!

I meant in terms of seperating the slowpath (aka the bytelock on the pending
bit) from the queue (MCS code). Or renaming the function to be called
'complex' instead of 'slowpath' as it is getting quite hairy.

The #1 patch is nice by itself - as it lays out the foundation of the
MCS-similar code - and if Ingo decides he does not want this pending
byte-lock bit business - it can be easily reverted or dropped.

In terms of squashing this in #1 - I would advocate against that.

Thanks!

2014-06-18 15:46:13

by Waiman Long

[permalink] [raw]
Subject: Re: [PATCH 04/11] qspinlock: Extract out the exchange of tail code word

On 06/18/2014 09:50 AM, Konrad Rzeszutek Wilk wrote:
> On Wed, Jun 18, 2014 at 01:37:45PM +0200, Paolo Bonzini wrote:
>> Il 17/06/2014 22:55, Konrad Rzeszutek Wilk ha scritto:
>>> On Sun, Jun 15, 2014 at 02:47:01PM +0200, Peter Zijlstra wrote:
>>>> From: Waiman Long<[email protected]>
>>>>
>>>> This patch extracts the logic for the exchange of new and previous tail
>>>> code words into a new xchg_tail() function which can be optimized in a
>>>> later patch.
>>> And also adds a third try on acquiring the lock. That I think should
>>> be a seperate patch.
>> It doesn't really add a new try, the old code is:
>>
>>
>> - for (;;) {
>> - new = _Q_LOCKED_VAL;
>> - if (val)
>> - new = tail | (val& _Q_LOCKED_PENDING_MASK);
>> -
>> - old = atomic_cmpxchg(&lock->val, val, new);
>> - if (old == val)
>> - break;
>> -
>> - val = old;
>> - }
>>
>> /*
>> - * we won the trylock; forget about queueing.
>> */
>> - if (new == _Q_LOCKED_VAL)
>> - goto release;
>>
>> The trylock happens if the "if (val)" hits the else branch.
>>
>> What the patch does is change it from attempting two transition with a
>> single cmpxchg:
>>
>> - * 0,0,0 -> 0,0,1 ; trylock
>> - * p,y,x -> n,y,x ; prev = xchg(lock, node)
>>
>> to first doing the trylock, then the xchg. If the trylock passes and the
>> xchg returns prev=0,0,0, the next step of the algorithm goes to the
>> locked/uncontended state
>>
>> + /*
>> + * claim the lock:
>> + *
>> + * n,0 -> 0,1 : lock, uncontended
>>
>> Similar to your suggestion of patch 3, it's expected that the xchg will
>> *not* return prev=0,0,0 after a failed trylock.
> I do like your explanation. I hope that Peter will put it in the
> description as it explains the change quite well.
>
>> However, I *do* agree with you that it's simpler to just squash this patch
>> into 01/11.
> Uh, did I say that? Oh I said why don't make it right the first time!
>
> I meant in terms of seperating the slowpath (aka the bytelock on the pending
> bit) from the queue (MCS code). Or renaming the function to be called
> 'complex' instead of 'slowpath' as it is getting quite hairy.
>
> The #1 patch is nice by itself - as it lays out the foundation of the
> MCS-similar code - and if Ingo decides he does not want this pending
> byte-lock bit business - it can be easily reverted or dropped.

The pending bit code is needed for performance parity with ticket
spinlock for light load. My own measurement indicates that the queuing
overhead will cause the queue spinlock to be slower than ticket spinlock
with 2-4 contending tasks. The pending bit solves the performance
problem with 2 contending tasks, leave only the 3-4 tasks cases being a
bit slower than the ticket spinlock which should be more than
compensated by its superior performance with heavy contention and
slightly better performance with no contention.

-Longman

2014-06-18 15:50:57

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [PATCH 04/11] qspinlock: Extract out the exchange of tail code word

Il 18/06/2014 17:46, Waiman Long ha scritto:
>>
>>
>> The #1 patch is nice by itself - as it lays out the foundation of the
>> MCS-similar code - and if Ingo decides he does not want this pending
>> byte-lock bit business - it can be easily reverted or dropped.
>
> The pending bit code is needed for performance parity with ticket
> spinlock for light load. My own measurement indicates that the queuing
> overhead will cause the queue spinlock to be slower than ticket spinlock
> with 2-4 contending tasks. The pending bit solves the performance
> problem with 2 contending tasks, leave only the 3-4 tasks cases being a
> bit slower than the ticket spinlock which should be more than
> compensated by its superior performance with heavy contention and
> slightly better performance with no contention.

Note that this patch is not related to the pending bit, only to the
trylock bit which is already in patch 1. It serializes two
previously-parallel checks for transitions. This is why I thought it
could already belong in patch 1.

Paolo

2014-06-18 18:29:06

by Konrad Rzeszutek Wilk

[permalink] [raw]
Subject: Re: [PATCH 04/11] qspinlock: Extract out the exchange of tail code word

> >>However, I *do* agree with you that it's simpler to just squash this patch
> >>into 01/11.
> >Uh, did I say that? Oh I said why don't make it right the first time!
> >
> >I meant in terms of seperating the slowpath (aka the bytelock on the pending
> >bit) from the queue (MCS code). Or renaming the function to be called
> >'complex' instead of 'slowpath' as it is getting quite hairy.
> >
> >The #1 patch is nice by itself - as it lays out the foundation of the
> >MCS-similar code - and if Ingo decides he does not want this pending
> >byte-lock bit business - it can be easily reverted or dropped.
>
> The pending bit code is needed for performance parity with ticket spinlock
> for light load. My own measurement indicates that the queuing overhead will
> cause the queue spinlock to be slower than ticket spinlock with 2-4
> contending tasks. The pending bit solves the performance problem with 2

Aha!

> contending tasks, leave only the 3-4 tasks cases being a bit slower than the
> ticket spinlock which should be more than compensated by its superior
> performance with heavy contention and slightly better performance with no
> contention.

That should be mentioned in the commit description as the rationale for
the patch "qspinlock: Add pending bit" and also in the code.

Thank you!

2014-06-24 10:47:17

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 04/11] qspinlock: Extract out the exchange of tail code word

On Wed, Jun 18, 2014 at 01:37:45PM +0200, Paolo Bonzini wrote:
> However, I *do* agree with you that it's simpler to just squash this patch
> into 01/11.

So I explicitly broke out these optimizations into separate patches so
that we can see them independently and agree they're idempotent wrt the
state machine.

The initial patches by Waiman were totally unreadable, partly because
the optimizations made the code terribly complex.

Luckily waiman then dropped the most horrible optimizations
(optimization for the very large nr_cpus case, were we cannot have a
pending byte), so the end result isn't quite as complex as it used to
be.