2015-12-03 12:46:46

by Peter Zijlstra

[permalink] [raw]
Subject: [PATCH 3/4] locking: Introduce smp_cond_acquire()

Introduce smp_cond_acquire() which combines a control dependency and a
read barrier to form acquire semantics.

This primitive has two benefits:
- it documents control dependencies,
- its typically cheaper than using smp_load_acquire() in a loop.

Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
---
include/linux/compiler.h | 17 +++++++++++++++++
kernel/locking/qspinlock.c | 3 +--
kernel/sched/core.c | 8 +-------
kernel/sched/sched.h | 2 +-
4 files changed, 20 insertions(+), 10 deletions(-)

--- a/include/linux/compiler.h
+++ b/include/linux/compiler.h
@@ -299,6 +299,23 @@ static __always_inline void __write_once
__u.__val; \
})

+/**
+ * smp_cond_acquire() - Spin wait for cond with ACQUIRE ordering
+ * @cond: boolean expression to wait for
+ *
+ * Equivalent to using smp_load_acquire() on the condition variable but employs
+ * the control dependency of the wait to reduce the barrier on many platforms.
+ *
+ * The control dependency provides a LOAD->STORE order, the additional RMB
+ * provides LOAD->LOAD order, together they provide LOAD->{LOAD,STORE} order,
+ * aka. ACQUIRE.
+ */
+#define smp_cond_acquire(cond) do { \
+ while (!(cond)) \
+ cpu_relax(); \
+ smp_rmb(); /* ctrl + rmb := acquire */ \
+} while (0)
+
#endif /* __KERNEL__ */

#endif /* __ASSEMBLY__ */
--- a/kernel/locking/qspinlock.c
+++ b/kernel/locking/qspinlock.c
@@ -446,8 +446,7 @@ void queued_spin_lock_slowpath(struct qs
if ((val = pv_wait_head_or_lock(lock, node)))
goto locked;

- while ((val = smp_load_acquire(&lock->val.counter)) & _Q_LOCKED_PENDING_MASK)
- cpu_relax();
+ smp_cond_acquire(!((val = atomic_read(&lock->val)) & _Q_LOCKED_PENDING_MASK));

locked:
/*
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1979,19 +1979,13 @@ try_to_wake_up(struct task_struct *p, un
/*
* If the owning (remote) cpu is still in the middle of schedule() with
* this task as prev, wait until its done referencing the task.
- */
- while (p->on_cpu)
- cpu_relax();
- /*
- * Combined with the control dependency above, we have an effective
- * smp_load_acquire() without the need for full barriers.
*
* Pairs with the smp_store_release() in finish_lock_switch().
*
* This ensures that tasks getting woken will be fully ordered against
* their previous state and preserve Program Order.
*/
- smp_rmb();
+ smp_cond_acquire(!p->on_cpu);

p->sched_contributes_to_load = !!task_contributes_to_load(p);
p->state = TASK_WAKING;
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1079,7 +1079,7 @@ static inline void finish_lock_switch(st
* In particular, the load of prev->state in finish_task_switch() must
* happen before this.
*
- * Pairs with the control dependency and rmb in try_to_wake_up().
+ * Pairs with the smp_cond_acquire() in try_to_wake_up().
*/
smp_store_release(&prev->on_cpu, 0);
#endif


2015-12-03 16:37:28

by Will Deacon

[permalink] [raw]
Subject: Re: [PATCH 3/4] locking: Introduce smp_cond_acquire()

Hi Peter,

On Thu, Dec 03, 2015 at 01:40:13PM +0100, Peter Zijlstra wrote:
> Introduce smp_cond_acquire() which combines a control dependency and a
> read barrier to form acquire semantics.
>
> This primitive has two benefits:
> - it documents control dependencies,
> - its typically cheaper than using smp_load_acquire() in a loop.
>
> Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
> ---
> include/linux/compiler.h | 17 +++++++++++++++++
> kernel/locking/qspinlock.c | 3 +--
> kernel/sched/core.c | 8 +-------
> kernel/sched/sched.h | 2 +-
> 4 files changed, 20 insertions(+), 10 deletions(-)
>
> --- a/include/linux/compiler.h
> +++ b/include/linux/compiler.h
> @@ -299,6 +299,23 @@ static __always_inline void __write_once
> __u.__val; \
> })
>
> +/**
> + * smp_cond_acquire() - Spin wait for cond with ACQUIRE ordering
> + * @cond: boolean expression to wait for
> + *
> + * Equivalent to using smp_load_acquire() on the condition variable but employs
> + * the control dependency of the wait to reduce the barrier on many platforms.
> + *
> + * The control dependency provides a LOAD->STORE order, the additional RMB
> + * provides LOAD->LOAD order, together they provide LOAD->{LOAD,STORE} order,
> + * aka. ACQUIRE.
> + */
> +#define smp_cond_acquire(cond) do { \
> + while (!(cond)) \
> + cpu_relax(); \
> + smp_rmb(); /* ctrl + rmb := acquire */ \
> +} while (0)
> +
> #endif /* __KERNEL__ */
>
> #endif /* __ASSEMBLY__ */
> --- a/kernel/locking/qspinlock.c
> +++ b/kernel/locking/qspinlock.c
> @@ -446,8 +446,7 @@ void queued_spin_lock_slowpath(struct qs
> if ((val = pv_wait_head_or_lock(lock, node)))
> goto locked;
>
> - while ((val = smp_load_acquire(&lock->val.counter)) & _Q_LOCKED_PENDING_MASK)
> - cpu_relax();
> + smp_cond_acquire(!((val = atomic_read(&lock->val)) & _Q_LOCKED_PENDING_MASK));

I think we spoke about this before, but what would work really well for
arm64 here is if we could override smp_cond_acquire in such a way that
the atomic_read could be performed explicitly in the macro. That would
allow us to use an LDXR to set the exclusive monitor, which in turn
means we can issue a WFE and get a cheap wakeup when lock->val is
actually modified.

With the current scheme, there's not enough information expressed in the
"cond" parameter to perform this optimisation.

Cheers,

Will

2015-12-03 19:41:51

by Davidlohr Bueso

[permalink] [raw]
Subject: Re: [PATCH 3/4] locking: Introduce smp_cond_acquire()

On Thu, 03 Dec 2015, Peter Zijlstra wrote:

>+/**
>+ * smp_cond_acquire() - Spin wait for cond with ACQUIRE ordering
>+ * @cond: boolean expression to wait for
>+ *
>+ * Equivalent to using smp_load_acquire() on the condition variable but employs
>+ * the control dependency of the wait to reduce the barrier on many platforms.
>+ *
>+ * The control dependency provides a LOAD->STORE order, the additional RMB
>+ * provides LOAD->LOAD order, together they provide LOAD->{LOAD,STORE} order,
>+ * aka. ACQUIRE.
>+ */
>+#define smp_cond_acquire(cond) do { \
>+ while (!(cond)) \
>+ cpu_relax(); \
>+ smp_rmb(); /* ctrl + rmb := acquire */ \
>+} while (0)

So this hides the fact that we actually are waiting on the cond, as opposed
to conditional acquiring. Could it be renamed to something like smp_waitcond_acquire()?

Thanks,
Davidlohr

2015-12-03 20:26:39

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 3/4] locking: Introduce smp_cond_acquire()

On Thu, Dec 03, 2015 at 04:37:26PM +0000, Will Deacon wrote:
> > +#define smp_cond_acquire(cond) do { \
> > + while (!(cond)) \
> > + cpu_relax(); \
> > + smp_rmb(); /* ctrl + rmb := acquire */ \
> > +} while (0)

> > + smp_cond_acquire(!((val = atomic_read(&lock->val)) & _Q_LOCKED_PENDING_MASK));
>
> I think we spoke about this before, but what would work really well for
> arm64 here is if we could override smp_cond_acquire in such a way that
> the atomic_read could be performed explicitly in the macro. That would
> allow us to use an LDXR to set the exclusive monitor, which in turn
> means we can issue a WFE and get a cheap wakeup when lock->val is
> actually modified.
>
> With the current scheme, there's not enough information expressed in the
> "cond" parameter to perform this optimisation.

Right, but I'm having a hard time constructing something pretty that can
do that. Lambda functions would be lovely, but we don't have those :/

While we can easily pass a pointer to an arbitrary type, we need
an expression to evaluate the result of the pointer load to act as our
condition.

smp_cond_acquire(&lock->val.counter,
[](int val){ return !(val & _Q_LOCKED_PENDING_MASK); });

Would be nice, but alas.

The best we can do is hardcode a variable name; maybe something like:

#define smp_cond_acquire(ptr, expr) do { \
typeof(*ptr) val; \
while ((val = READ_ONCE(*ptr)), expr) \
cpu_relax(); \
smp_rmb(); /* ctrl + rmb := acquire */ \
} while (0)

Which would let us write:

smp_cond_acquire(&lock->val.counter, !(val & _Q_LOCKED_PENDING_MASK));


Thoughts?

2015-12-03 20:32:10

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 3/4] locking: Introduce smp_cond_acquire()

On Thu, Dec 03, 2015 at 11:41:39AM -0800, Davidlohr Bueso wrote:

> >+#define smp_cond_acquire(cond) do { \
> >+ while (!(cond)) \
> >+ cpu_relax(); \
> >+ smp_rmb(); /* ctrl + rmb := acquire */ \
> >+} while (0)
>
> So this hides the fact that we actually are waiting on the cond, as opposed
> to conditional acquiring. Could it be renamed to something like smp_waitcond_acquire()?

Right, I'm conflicted about that. On the one hand you're right, on the
other hand we spin-wait so the next person will want it called
smp_spin_wait_cond_acquire(), also it gets terribly long either way :/

bike-shed away I imagine.

2015-12-03 21:16:58

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 3/4] locking: Introduce smp_cond_acquire()

On Thu, Dec 03, 2015 at 09:26:27PM +0100, Peter Zijlstra wrote:
> The best we can do is hardcode a variable name; maybe something like:
>
> #define smp_cond_acquire(ptr, expr) do { \
> typeof(*ptr) val; \
> while ((val = READ_ONCE(*ptr)), expr) \
> cpu_relax(); \
> smp_rmb(); /* ctrl + rmb := acquire */ \
> } while (0)
>
> Which would let us write:
>
> smp_cond_acquire(&lock->val.counter, !(val & _Q_LOCKED_PENDING_MASK));

So I'm thinking that might still be hard to use for you if the LDXR+WFE
has the regular LL/SC constraint of not allowing other loads in.

2015-12-04 14:57:56

by Will Deacon

[permalink] [raw]
Subject: Re: [PATCH 3/4] locking: Introduce smp_cond_acquire()

On Thu, Dec 03, 2015 at 09:26:27PM +0100, Peter Zijlstra wrote:
> On Thu, Dec 03, 2015 at 04:37:26PM +0000, Will Deacon wrote:
> > > +#define smp_cond_acquire(cond) do { \
> > > + while (!(cond)) \
> > > + cpu_relax(); \
> > > + smp_rmb(); /* ctrl + rmb := acquire */ \
> > > +} while (0)
>
> > > + smp_cond_acquire(!((val = atomic_read(&lock->val)) & _Q_LOCKED_PENDING_MASK));
> >
> > I think we spoke about this before, but what would work really well for
> > arm64 here is if we could override smp_cond_acquire in such a way that
> > the atomic_read could be performed explicitly in the macro. That would
> > allow us to use an LDXR to set the exclusive monitor, which in turn
> > means we can issue a WFE and get a cheap wakeup when lock->val is
> > actually modified.
> >
> > With the current scheme, there's not enough information expressed in the
> > "cond" parameter to perform this optimisation.
>
> Right, but I'm having a hard time constructing something pretty that can
> do that. Lambda functions would be lovely, but we don't have those :/
>
> While we can easily pass a pointer to an arbitrary type, we need
> an expression to evaluate the result of the pointer load to act as our
> condition.
>
> smp_cond_acquire(&lock->val.counter,
> [](int val){ return !(val & _Q_LOCKED_PENDING_MASK); });
>
> Would be nice, but alas.
>
> The best we can do is hardcode a variable name; maybe something like:
>
> #define smp_cond_acquire(ptr, expr) do { \
> typeof(*ptr) val; \
> while ((val = READ_ONCE(*ptr)), expr) \
> cpu_relax(); \
> smp_rmb(); /* ctrl + rmb := acquire */ \
> } while (0)
>
> Which would let us write:
>
> smp_cond_acquire(&lock->val.counter, !(val & _Q_LOCKED_PENDING_MASK));
>
>
> Thoughts?

That would certainly work for me, but I appreciate it's not pretty. We
could have an extra macro parameter for the name of the temporary
variable, if you wanted to make it explicit.

Will

2015-12-04 20:51:33

by Waiman Long

[permalink] [raw]
Subject: Re: [PATCH 3/4] locking: Introduce smp_cond_acquire()

On 12/03/2015 03:26 PM, Peter Zijlstra wrote:
> On Thu, Dec 03, 2015 at 04:37:26PM +0000, Will Deacon wrote:
>>> +#define smp_cond_acquire(cond) do { \
>>> + while (!(cond)) \
>>> + cpu_relax(); \
>>> + smp_rmb(); /* ctrl + rmb := acquire */ \
>>> +} while (0)
>>> + smp_cond_acquire(!((val = atomic_read(&lock->val))& _Q_LOCKED_PENDING_MASK));
>> I think we spoke about this before, but what would work really well for
>> arm64 here is if we could override smp_cond_acquire in such a way that
>> the atomic_read could be performed explicitly in the macro. That would
>> allow us to use an LDXR to set the exclusive monitor, which in turn
>> means we can issue a WFE and get a cheap wakeup when lock->val is
>> actually modified.
>>
>> With the current scheme, there's not enough information expressed in the
>> "cond" parameter to perform this optimisation.
> Right, but I'm having a hard time constructing something pretty that can
> do that. Lambda functions would be lovely, but we don't have those :/
>
> While we can easily pass a pointer to an arbitrary type, we need
> an expression to evaluate the result of the pointer load to act as our
> condition.
>
> smp_cond_acquire(&lock->val.counter,
> [](int val){ return !(val& _Q_LOCKED_PENDING_MASK); });
>
> Would be nice, but alas.
>
> The best we can do is hardcode a variable name; maybe something like:
>
> #define smp_cond_acquire(ptr, expr) do { \
> typeof(*ptr) val; \
> while ((val = READ_ONCE(*ptr)), expr) \
> cpu_relax(); \
> smp_rmb(); /* ctrl + rmb := acquire */ \
> } while (0)
>
> Which would let us write:
>
> smp_cond_acquire(&lock->val.counter, !(val& _Q_LOCKED_PENDING_MASK));
>
>
> Thoughts?
Will the following work?

#define smp_cond_load_acquire(ptr, cond_expr, neg) ({ \
typeof(*ptr) _val; \
while (neg ((_val = READ_ONCE(*ptr)) cond_expr))\
cpu_relax(); \
smp_rmb(); \
_val; \
})

val = smp_cond_load_acquire(&lock->val.counter, &
_Q_LOCKED_PENDING_MASK, !);

Cheers,
Longman

2015-12-04 22:05:55

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH 3/4] locking: Introduce smp_cond_acquire()

On Fri, Dec 4, 2015 at 12:51 PM, Waiman Long <[email protected]> wrote:
>
> Will the following work?

Are we trying to win some obfuscated C contest here?

Just make it do something like (skipping backslashes to make it easier
to type and read)

#define smp_cond_load_acquire(ptr, cond_expr) ({
typeof(*ptr) VAL;
for (;;) {
VAL = READ_ONCE(*ptr);
if (cond_expr) break;
cpu_relax();
}
smp_rmb();
VAL;
})

and then you'd have it be

val = smp_cond_load_acquire(&lock->val.counter,
!(VAL & _Q_LOCKED_PENDING_MASK));

which is at least halfway legible. Not some odd "fragments of
expressions" interfaces unless absolutely required, please.

Of course, I suspect we should not use READ_ONCE(), but some
architecture-overridable version that just defaults to READ_ONCE().
Same goes for that "smp_rmb()". Because maybe some architectures will
just prefer an explicit acquire, and I suspect we do *not* want
architectures having to recreate and override that crazy loop.

How much does this all actually end up mattering, btw?

Linus

2015-12-04 22:48:19

by Waiman Long

[permalink] [raw]
Subject: Re: [PATCH 3/4] locking: Introduce smp_cond_acquire()

On 12/04/2015 05:05 PM, Linus Torvalds wrote:
> On Fri, Dec 4, 2015 at 12:51 PM, Waiman Long<[email protected]> wrote:
>> Will the following work?
> Are we trying to win some obfuscated C contest here?
>
> Just make it do something like (skipping backslashes to make it easier
> to type and read)
>
> #define smp_cond_load_acquire(ptr, cond_expr) ({
> typeof(*ptr) VAL;
> for (;;) {
> VAL = READ_ONCE(*ptr);
> if (cond_expr) break;
> cpu_relax();
> }
> smp_rmb();
> VAL;
> })
>
> and then you'd have it be
>
> val = smp_cond_load_acquire(&lock->val.counter,
> !(VAL& _Q_LOCKED_PENDING_MASK));
>
> which is at least halfway legible. Not some odd "fragments of
> expressions" interfaces unless absolutely required, please.

It is just some random thought that I have. I am not saying that it is
the right way to go.

> Of course, I suspect we should not use READ_ONCE(), but some
> architecture-overridable version that just defaults to READ_ONCE().
> Same goes for that "smp_rmb()". Because maybe some architectures will
> just prefer an explicit acquire, and I suspect we do *not* want
> architectures having to recreate and override that crazy loop.
>
> How much does this all actually end up mattering, btw?
>
> Linus

I think what Will want to do is to provide an architecture specific
replacement for the whole macro, not just part of it. So using READ_ONCE
should be fine.

Cheers,
Longman

2015-12-04 23:43:44

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 3/4] locking: Introduce smp_cond_acquire()

On Fri, Dec 04, 2015 at 02:05:49PM -0800, Linus Torvalds wrote:
> Of course, I suspect we should not use READ_ONCE(), but some
> architecture-overridable version that just defaults to READ_ONCE().
> Same goes for that "smp_rmb()". Because maybe some architectures will
> just prefer an explicit acquire, and I suspect we do *not* want
> architectures having to recreate and override that crazy loop.
>
> How much does this all actually end up mattering, btw?

Not sure, I'll have to let Will quantify that. But the whole reason
we're having this discussion is that ARM64 has a MONITOR+MWAIT like
construct that they'd like to use to avoid the spinning.

Of course, in order to use that, they _have_ to override the crazy loop.

Now, Will and I spoke earlier today, and the version proposed by me (and
you, since that is roughly similar) will indeed work for them in that it
would allow them to rewrite the thing something like:


typeof(*ptr) VAL;
for (;;) {
VAL = READ_ONCE(*ptr);
if (expr)
break;
cmp_and_wait(ptr, VAL);
}


Where their cmd_and_wait(ptr, val) looks a little like:

asm volatile(
" ldxr %w0, %1 \n"
" sub %w0, %w0, %2 \n"
" cbnz 1f \n"
" wfe \n"
"1:"

: "=&r" (tmp)
: "Q" (*ptr), "r" (val)
);

(excuse my poor ARM asm foo)

Which sets up a load-exclusive monitor, compares if the value loaded
matches what we previously saw, and if so, wait-for-event.

WFE will wake on any event that would've also invalidated a subsequent
stxr or store-exclusive.

ARM64 also of course can choose to use load-acquire instead of the
READ_ONCE(), or still issue the smp_rmb(), dunno what is best for them.
The load-acquire would (potentially) be issued multiple times, vs the
rmb only once. I'll let Will sort that.


In any case, WFE is both better for power consumption and lowers the
cacheline pressure, ie. nobody keeps trying to pull the line into shared
state all the time while you're trying to get a store done.

2015-12-07 15:18:41

by Will Deacon

[permalink] [raw]
Subject: Re: [PATCH 3/4] locking: Introduce smp_cond_acquire()

On Sat, Dec 05, 2015 at 12:43:37AM +0100, Peter Zijlstra wrote:
> On Fri, Dec 04, 2015 at 02:05:49PM -0800, Linus Torvalds wrote:
> > Of course, I suspect we should not use READ_ONCE(), but some
> > architecture-overridable version that just defaults to READ_ONCE().
> > Same goes for that "smp_rmb()". Because maybe some architectures will
> > just prefer an explicit acquire, and I suspect we do *not* want
> > architectures having to recreate and override that crazy loop.
> >
> > How much does this all actually end up mattering, btw?
>
> Not sure, I'll have to let Will quantify that. But the whole reason
> we're having this discussion is that ARM64 has a MONITOR+MWAIT like
> construct that they'd like to use to avoid the spinning.
>
> Of course, in order to use that, they _have_ to override the crazy loop.

Right. This also removes one of the few hurdles standing between us (arm64)
and generic locking routines such as qspinlock, where we really don't
want busy-wait loops (since cpu_relax doesn't give us the opportuinity
to use wfe safely).

> Now, Will and I spoke earlier today, and the version proposed by me (and
> you, since that is roughly similar) will indeed work for them in that it
> would allow them to rewrite the thing something like:
>
>
> typeof(*ptr) VAL;
> for (;;) {
> VAL = READ_ONCE(*ptr);
> if (expr)
> break;
> cmp_and_wait(ptr, VAL);
> }
>
>
> Where their cmd_and_wait(ptr, val) looks a little like:
>
> asm volatile(
> " ldxr %w0, %1 \n"
> " sub %w0, %w0, %2 \n"
> " cbnz 1f \n"
> " wfe \n"
> "1:"
>
> : "=&r" (tmp)
> : "Q" (*ptr), "r" (val)
> );
>
> (excuse my poor ARM asm foo)
>
> Which sets up a load-exclusive monitor, compares if the value loaded
> matches what we previously saw, and if so, wait-for-event.
>
> WFE will wake on any event that would've also invalidated a subsequent
> stxr or store-exclusive.
>
> ARM64 also of course can choose to use load-acquire instead of the
> READ_ONCE(), or still issue the smp_rmb(), dunno what is best for them.
> The load-acquire would (potentially) be issued multiple times, vs the
> rmb only once. I'll let Will sort that.

Yup. I'll just override the whole thing using something much like what
you're suggesting.

Cheers,

Will