2019-10-10 10:42:33

by Manfred Spraul

[permalink] [raw]
Subject: wake_q memory ordering

Hi,

Waiman Long noticed that the memory barriers in sem_lock() are not
really documented, and while adding documentation, I ended up with one
case where I'm not certain about the wake_q code:

Questions:
- Does smp_mb__before_atomic() + a (failed) cmpxchg_relaxed provide an
  ordering guarantee?
- Is it ok that wake_up_q just writes wake_q->next, shouldn't
  smp_store_acquire() be used? I.e.: guarantee that wake_up_process()
  happens after cmpxchg_relaxed(), assuming that a failed cmpxchg_relaxed
  provides any ordering.

Example:
- CPU2 never touches lock a. It is just an unrelated wake_q user that also
  wants to wake up task 1234.
- I've noticed already that smp_store_acquire() doesn't exist.
  So smp_store_mb() is required. But from semantical point of view, we
would
  need an ACQUIRE: the wake_up_process() must happen after cmpxchg().
- May wake_up_q() rely on the spinlocks/memory barriers in try_to_wake_up,
  or should the function be safe by itself?

CPU1: /current=1234, inside do_semtimedop()/
        g_wakee = current;
        current->state = TASK_INTERRUPTIBLE;
        spin_unlock(a);

CPU2: / arbitrary kernel thread that uses wake_q /
                wake_q_add(&unrelated_q, 1234);
                wake_up_q(&unrelated_q);
                <...ongoing>

CPU3: / do_semtimedop() + wake_up_sem_queue_prepare() /
                        spin_lock(a);
                        wake_q_add(,g_wakee);
                        < within wake_q_add() >:
                          smp_mb__before_atomic();
                          if (unlikely(cmpxchg_relaxed(&node->next,
NULL, WAKE_Q_TAIL)))
                              return false; /* -> this happens */

CPU2:
                <within wake_up_q>
                1234->wake_q.next = NULL; <<<<<<<<< Ok? Is
store_acquire() missing? >>>>>>>>>>>>
                wake_up_process(1234);
                < within wake_up_process/try_to_wake_up():
                    raw_spin_lock_irqsave()
                    smp_mb__after_spinlock()
                    if(1234->state = TASK_RUNNING) return;
                 >


rewritten:

start condition: A = 1; B = 0;

CPU1:
    B = 1;
    RELEASE, unlock LockX;

CPU2:
    lock LockX, ACQUIRE
    if (LOAD A == 1) return; /* using cmp_xchg_relaxed */

CPU2:
    A = 0;
    ACQUIRE, lock LockY
    smp_mb__after_spinlock();
    READ B

Question: is A = 1, B = 0 possible?

--

    Manfred


2019-10-10 11:44:23

by Peter Zijlstra

[permalink] [raw]
Subject: Re: wake_q memory ordering

On Thu, Oct 10, 2019 at 12:41:11PM +0200, Manfred Spraul wrote:
> Hi,
>
> Waiman Long noticed that the memory barriers in sem_lock() are not really
> documented, and while adding documentation, I ended up with one case where
> I'm not certain about the wake_q code:
>
> Questions:
> - Does smp_mb__before_atomic() + a (failed) cmpxchg_relaxed provide an
> ? ordering guarantee?

Yep. Either the atomic instruction implies ordering (eg. x86 LOCK
prefix) or it doesn't (most RISC LL/SC), if it does,
smp_mb__{before,after}_atomic() are a NO-OP and the ordering is
unconditinoal, if it does not, then smp_mb__{before,after}_atomic() are
unconditional barriers.

IOW, the only way to get a cmpxchg without barriers on failure, is with
LL/SC, and in that case smp_mb__{before,after}_atomic() are
unconditional.

For instance, the way ARM64 does cmpxchg() is:

cmpxchg(p, o, n)
do {
v = LL(p);
if (v != o)
return v;
} while (!SC_RELEASE(p, n))
smp_mb();
return v;

And you'll note how on success the store-release constraints all prior
memory operations, and the smp_mb() constraints all later memops. But on
failure there's not a barrier to be found.

> - Is it ok that wake_up_q just writes wake_q->next, shouldn't
> ? smp_store_acquire() be used? I.e.: guarantee that wake_up_process()
> ? happens after cmpxchg_relaxed(), assuming that a failed cmpxchg_relaxed
> ? provides any ordering.

There is no such thing as store_acquire, it is either load_acquire or
store_release. But just like how we can write load-aquire like
load+smp_mb(), so too I suppose we could write store-acquire like
store+smp_mb(), and that is exactly what is there (through the implied
barrier of wake_up_process()).

(arguably it should've been WRITE_ONCE() I suppose)

>
> Example:
> - CPU2 never touches lock a. It is just an unrelated wake_q user that also
> ? wants to wake up task 1234.
> - I've noticed already that smp_store_acquire() doesn't exist.
> ? So smp_store_mb() is required. But from semantical point of view, we would
> ? need an ACQUIRE: the wake_up_process() must happen after cmpxchg().
> - May wake_up_q() rely on the spinlocks/memory barriers in try_to_wake_up,
> ? or should the function be safe by itself?
>
> CPU1: /current=1234, inside do_semtimedop()/
> ??????? g_wakee = current;
> ??????? current->state = TASK_INTERRUPTIBLE;
> ??????? spin_unlock(a);
>
> CPU2: / arbitrary kernel thread that uses wake_q /
> ??????????????? wake_q_add(&unrelated_q, 1234);
> ??????????????? wake_up_q(&unrelated_q);
> ??????????????? <...ongoing>
>
> CPU3: / do_semtimedop() + wake_up_sem_queue_prepare() /
> ??????????????????????? spin_lock(a);
> ??????????????????????? wake_q_add(,g_wakee);
> ??????????????????????? < within wake_q_add() >:
> ????????????????????????? smp_mb__before_atomic();
> ????????????????????????? if (unlikely(cmpxchg_relaxed(&node->next, NULL,
> WAKE_Q_TAIL)))
> ????????????????????????????? return false; /* -> this happens */
>
> CPU2:
> ??????????????? <within wake_up_q>
> ??????????????? 1234->wake_q.next = NULL; <<<<<<<<< Ok? Is store_acquire()
> missing? >>>>>>>>>>>>

/* smp_mb(); implied by the following wake_up_process() */

> ??????????????? wake_up_process(1234);
> ??????????????? < within wake_up_process/try_to_wake_up():
> ??????????????????? raw_spin_lock_irqsave()
> ??????????????????? smp_mb__after_spinlock()
> ??????????????????? if(1234->state = TASK_RUNNING) return;
> ???????????????? >
>
>
> rewritten:
>
> start condition: A = 1; B = 0;
>
> CPU1:
> ??? B = 1;
> ??? RELEASE, unlock LockX;
>
> CPU2:
> ??? lock LockX, ACQUIRE
> ??? if (LOAD A == 1) return; /* using cmp_xchg_relaxed */
>
> CPU2:
> ??? A = 0;
> ??? ACQUIRE, lock LockY
> ??? smp_mb__after_spinlock();
> ??? READ B
>
> Question: is A = 1, B = 0 possible?

Your example is incomplete (there is no A=1 assignment for example), but
I'm thinking I can guess where that should go given the earlier text.

I don't think this is broken.

2019-10-10 12:14:30

by Manfred Spraul

[permalink] [raw]
Subject: Re: wake_q memory ordering

Hi Peter,

On 10/10/19 1:42 PM, Peter Zijlstra wrote:
> On Thu, Oct 10, 2019 at 12:41:11PM +0200, Manfred Spraul wrote:
>> Hi,
>>
>> Waiman Long noticed that the memory barriers in sem_lock() are not really
>> documented, and while adding documentation, I ended up with one case where
>> I'm not certain about the wake_q code:
>>
>> Questions:
>> - Does smp_mb__before_atomic() + a (failed) cmpxchg_relaxed provide an
>>   ordering guarantee?
> Yep. Either the atomic instruction implies ordering (eg. x86 LOCK
> prefix) or it doesn't (most RISC LL/SC), if it does,
> smp_mb__{before,after}_atomic() are a NO-OP and the ordering is
> unconditinoal, if it does not, then smp_mb__{before,after}_atomic() are
> unconditional barriers.

And _relaxed() differs from "normal" cmpxchg only for LL/SC
architectures, correct?

Therefore smp_mb__{before,after}_atomic() may be combined with
cmpxchg_relaxed, to form a full memory barrier, on all archs.

[...]


>> - Is it ok that wake_up_q just writes wake_q->next, shouldn't
>>   smp_store_acquire() be used? I.e.: guarantee that wake_up_process()
>>   happens after cmpxchg_relaxed(), assuming that a failed cmpxchg_relaxed
>>   provides any ordering.
> There is no such thing as store_acquire, it is either load_acquire or
> store_release. But just like how we can write load-aquire like
> load+smp_mb(), so too I suppose we could write store-acquire like
> store+smp_mb(), and that is exactly what is there (through the implied
> barrier of wake_up_process()).

Thanks for confirming my assumption:
The code is correct, due to the implied barrier inside wake_up_process().

[...]
>> rewritten:
>>
>> start condition: A = 1; B = 0;
>>
>> CPU1:
>>     B = 1;
>>     RELEASE, unlock LockX;
>>
>> CPU2:
>>     lock LockX, ACQUIRE
>>     if (LOAD A == 1) return; /* using cmp_xchg_relaxed */
>>
>> CPU2:
>>     A = 0;
>>     ACQUIRE, lock LockY
>>     smp_mb__after_spinlock();
>>     READ B
>>
>> Question: is A = 1, B = 0 possible?
> Your example is incomplete (there is no A=1 assignment for example), but
> I'm thinking I can guess where that should go given the earlier text.

A=1 is listed as start condition. Way before, someone did wake_q_add().


> I don't think this is broken.

Thanks.

--

    Manfred

2019-10-10 12:33:30

by Peter Zijlstra

[permalink] [raw]
Subject: Re: wake_q memory ordering

On Thu, Oct 10, 2019 at 02:13:47PM +0200, Manfred Spraul wrote:
> Hi Peter,
>
> On 10/10/19 1:42 PM, Peter Zijlstra wrote:
> > On Thu, Oct 10, 2019 at 12:41:11PM +0200, Manfred Spraul wrote:
> > > Hi,
> > >
> > > Waiman Long noticed that the memory barriers in sem_lock() are not really
> > > documented, and while adding documentation, I ended up with one case where
> > > I'm not certain about the wake_q code:
> > >
> > > Questions:
> > > - Does smp_mb__before_atomic() + a (failed) cmpxchg_relaxed provide an
> > > ? ordering guarantee?
> > Yep. Either the atomic instruction implies ordering (eg. x86 LOCK
> > prefix) or it doesn't (most RISC LL/SC), if it does,
> > smp_mb__{before,after}_atomic() are a NO-OP and the ordering is
> > unconditinoal, if it does not, then smp_mb__{before,after}_atomic() are
> > unconditional barriers.
>
> And _relaxed() differs from "normal" cmpxchg only for LL/SC architectures,
> correct?

Indeed.

> Therefore smp_mb__{before,after}_atomic() may be combined with
> cmpxchg_relaxed, to form a full memory barrier, on all archs.

Just so.

> > > - Is it ok that wake_up_q just writes wake_q->next, shouldn't
> > > ? smp_store_acquire() be used? I.e.: guarantee that wake_up_process()
> > > ? happens after cmpxchg_relaxed(), assuming that a failed cmpxchg_relaxed
> > > ? provides any ordering.
> > There is no such thing as store_acquire, it is either load_acquire or
> > store_release. But just like how we can write load-aquire like
> > load+smp_mb(), so too I suppose we could write store-acquire like
> > store+smp_mb(), and that is exactly what is there (through the implied
> > barrier of wake_up_process()).
>
> Thanks for confirming my assumption:
> The code is correct, due to the implied barrier inside wake_up_process().

It has a comment there, trying to state this.

task->wake_q.next = NULL;

/*
* wake_up_process() executes a full barrier, which pairs with
* the queueing in wake_q_add() so as not to miss wakeups.
*/
wake_up_process(task);

2019-10-10 19:27:20

by Davidlohr Bueso

[permalink] [raw]
Subject: Re: wake_q memory ordering

On Thu, 10 Oct 2019, Peter Zijlstra wrote:

>On Thu, Oct 10, 2019 at 02:13:47PM +0200, Manfred Spraul wrote:
>> Hi Peter,
>>
>> On 10/10/19 1:42 PM, Peter Zijlstra wrote:
>> > On Thu, Oct 10, 2019 at 12:41:11PM +0200, Manfred Spraul wrote:
>> > > Hi,
>> > >
>> > > Waiman Long noticed that the memory barriers in sem_lock() are not really
>> > > documented, and while adding documentation, I ended up with one case where
>> > > I'm not certain about the wake_q code:
>> > >
>> > > Questions:
>> > > - Does smp_mb__before_atomic() + a (failed) cmpxchg_relaxed provide an
>> > > ? ordering guarantee?
>> > Yep. Either the atomic instruction implies ordering (eg. x86 LOCK
>> > prefix) or it doesn't (most RISC LL/SC), if it does,
>> > smp_mb__{before,after}_atomic() are a NO-OP and the ordering is
>> > unconditinoal, if it does not, then smp_mb__{before,after}_atomic() are
>> > unconditional barriers.
>>
>> And _relaxed() differs from "normal" cmpxchg only for LL/SC architectures,
>> correct?
>
>Indeed.
>
>> Therefore smp_mb__{before,after}_atomic() may be combined with
>> cmpxchg_relaxed, to form a full memory barrier, on all archs.
>
>Just so.

We might want something like this?

----8<---------------------------------------------------------

From: Davidlohr Bueso <[email protected]>
Subject: [PATCH] Documentation/memory-barriers.txt: Mention smp_mb__{before,after}_atomic() and CAS

Explicitly mention possible usages to guarantee serialization even upon
failed cmpxchg (or similar) calls along with smp_mb__{before,after}_atomic().

Signed-off-by: Davidlohr Bueso <[email protected]>
---
Documentation/memory-barriers.txt | 12 ++++++++++++
1 file changed, 12 insertions(+)

diff --git a/Documentation/memory-barriers.txt b/Documentation/memory-barriers.txt
index 1adbb8a371c7..5d2873d4b442 100644
--- a/Documentation/memory-barriers.txt
+++ b/Documentation/memory-barriers.txt
@@ -1890,6 +1890,18 @@ There are some more advanced barrier functions:
This makes sure that the death mark on the object is perceived to be set
*before* the reference counter is decremented.

+ Similarly, these barriers can be used to guarantee serialization for atomic
+ RMW calls on architectures which may not imply memory barriers upon failure.
+
+ obj->next = NULL;
+ smp_mb__before_atomic()
+ if (cmpxchg(&obj->ptr, NULL, val))
+ return;
+
+ This makes sure that the store to the next pointer always has smp_store_mb()
+ semantics. As such, smp_mb__{before,after}_atomic() calls allow optimizing
+ the barrier usage by finer grained serialization.
+
See Documentation/atomic_{t,bitops}.txt for more information.


--
2.16.4

2019-10-11 08:58:19

by Manfred Spraul

[permalink] [raw]
Subject: Re: wake_q memory ordering

Hi Davidlohr,

On 10/10/19 9:25 PM, Davidlohr Bueso wrote:
> On Thu, 10 Oct 2019, Peter Zijlstra wrote:
>
>> On Thu, Oct 10, 2019 at 02:13:47PM +0200, Manfred Spraul wrote:
>>
>>> Therefore smp_mb__{before,after}_atomic() may be combined with
>>> cmpxchg_relaxed, to form a full memory barrier, on all archs.
>>
>> Just so.
>
> We might want something like this?
>
> ----8<---------------------------------------------------------
>
> From: Davidlohr Bueso <[email protected]>
> Subject: [PATCH] Documentation/memory-barriers.txt: Mention
> smp_mb__{before,after}_atomic() and CAS
>
> Explicitly mention possible usages to guarantee serialization even upon
> failed cmpxchg (or similar) calls along with
> smp_mb__{before,after}_atomic().
>
> Signed-off-by: Davidlohr Bueso <[email protected]>
> ---
> Documentation/memory-barriers.txt | 12 ++++++++++++
> 1 file changed, 12 insertions(+)
>
> diff --git a/Documentation/memory-barriers.txt
> b/Documentation/memory-barriers.txt
> index 1adbb8a371c7..5d2873d4b442 100644
> --- a/Documentation/memory-barriers.txt
> +++ b/Documentation/memory-barriers.txt
> @@ -1890,6 +1890,18 @@ There are some more advanced barrier functions:
>      This makes sure that the death mark on the object is perceived to
> be set
>      *before* the reference counter is decremented.
>
> +     Similarly, these barriers can be used to guarantee serialization
> for atomic
> +     RMW calls on architectures which may not imply memory barriers
> upon failure.
> +
> +    obj->next = NULL;
> +    smp_mb__before_atomic()
> +    if (cmpxchg(&obj->ptr, NULL, val))
> +        return;
> +
> +     This makes sure that the store to the next pointer always has
> smp_store_mb()
> +     semantics. As such, smp_mb__{before,after}_atomic() calls allow
> optimizing
> +     the barrier usage by finer grained serialization.
> +
>      See Documentation/atomic_{t,bitops}.txt for more information.
>
>
I don't know. The new documentation would not have answered my question
(is it ok to combine smp_mb__before_atomic() with atomic_relaxed()?).
And it copies content already present in atomic_t.txt.

Thus: I would prefer if the first sentence of the paragraph is replaced:
The list of operations should end with "...", and it should match what
is in atomic_t.txt

Ok?

--

    Manfred



Attachments:
0004-Documentation-memory-barriers.txt-Clarify-cmpxchg.patch (2.50 kB)

2019-10-11 15:48:21

by Davidlohr Bueso

[permalink] [raw]
Subject: Re: wake_q memory ordering

On Fri, 11 Oct 2019, Manfred Spraul wrote:
>I don't know. The new documentation would not have answered my
>question (is it ok to combine smp_mb__before_atomic() with
>atomic_relaxed()?). And it copies content already present in
>atomic_t.txt.

Well, the _relaxed (and release/acquire) extentions refer to a
_successful_ operation (LL/SC), and whether it has barriers on
each of the sides before and after. I thought you were mainly
worried about the failed CAS scenario, not the relaxed itself.

I don't know how this copies content from atomic_t.txt, at no
point does it talk about failed CAS.

>
>Thus: I would prefer if the first sentence of the paragraph is
>replaced: The list of operations should end with "...", and it should
>match what is in atomic_t.txt

I'll see about combining some of your changes in patch 5/5 of
your new series, but have to say I still prefer my documentation
change.

Thanks,
Davidlohr