2015-02-17 10:47:26

by Kirill Tkhai

[permalink] [raw]
Subject: [PATCH 2/2] [PATCH] sched: Add smp_rmb() in task rq locking cycles


We migrate a task using TASK_ON_RQ_MIGRATING state of on_rq:

raw_spin_lock(&old_rq->lock);
deactivate_task(old_rq, p, 0);
p->on_rq = TASK_ON_RQ_MIGRATING;
set_task_cpu(p, new_cpu);
raw_spin_unlock(&rq->lock);

I.e.:

write TASK_ON_RQ_MIGRATING
smp_wmb() (in __set_task_cpu)
write new_cpu

But {,__}task_rq_lock() don't use smp_rmb(), and they may see
the cpu and TASK_ON_RQ_MIGRATING in opposite order. In this case
{,__}task_rq_lock() lock new_rq before the task is actually queued
on it.

Fix that using ordered reading.

Fixes: cca26e8009d1 "sched: Teach scheduler to understand TASK_ON_RQ_MIGRATING state"
Signed-off-by: Kirill Tkhai <[email protected]>
---
kernel/sched/core.c | 8 ++++++--
kernel/sched/sched.h | 8 ++++++--
2 files changed, 12 insertions(+), 4 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index fc12a1d..a42fb88 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -319,8 +319,12 @@ static struct rq *task_rq_lock(struct task_struct *p, unsigned long *flags)
raw_spin_lock_irqsave(&p->pi_lock, *flags);
rq = task_rq(p);
raw_spin_lock(&rq->lock);
- if (likely(rq == task_rq(p) && !task_on_rq_migrating(p)))
- return rq;
+ if (likely(rq == task_rq(p))) {
+ /* Pairs with smp_wmb() in __set_task_cpu() */
+ smp_rmb();
+ if (likely(!task_on_rq_migrating(p)))
+ return rq;
+ }
raw_spin_unlock(&rq->lock);
raw_spin_unlock_irqrestore(&p->pi_lock, *flags);

diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index f65f57c..4d7b03c 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1031,8 +1031,12 @@ static inline struct rq *__task_rq_lock(struct task_struct *p)
for (;;) {
rq = task_rq(p);
raw_spin_lock(&rq->lock);
- if (likely(rq == task_rq(p) && !task_on_rq_migrating(p)))
- return rq;
+ if (likely(rq == task_rq(p))) {
+ /* Pairs with smp_wmb() in __set_task_cpu() */
+ smp_rmb();
+ if (likely(!task_on_rq_migrating(p)))
+ return rq;
+ }
raw_spin_unlock(&rq->lock);

while (unlikely(task_on_rq_migrating(p)))



2015-02-17 12:13:08

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 2/2] [PATCH] sched: Add smp_rmb() in task rq locking cycles

On Tue, Feb 17, 2015 at 01:47:01PM +0300, Kirill Tkhai wrote:
>
> We migrate a task using TASK_ON_RQ_MIGRATING state of on_rq:
>
> raw_spin_lock(&old_rq->lock);
> deactivate_task(old_rq, p, 0);
> p->on_rq = TASK_ON_RQ_MIGRATING;
> set_task_cpu(p, new_cpu);
> raw_spin_unlock(&rq->lock);
>
> I.e.:
>
> write TASK_ON_RQ_MIGRATING
> smp_wmb() (in __set_task_cpu)
> write new_cpu
>
> But {,__}task_rq_lock() don't use smp_rmb(), and they may see
> the cpu and TASK_ON_RQ_MIGRATING in opposite order. In this case
> {,__}task_rq_lock() lock new_rq before the task is actually queued
> on it.

> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index fc12a1d..a42fb88 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -319,8 +319,12 @@ static struct rq *task_rq_lock(struct task_struct *p, unsigned long *flags)
> raw_spin_lock_irqsave(&p->pi_lock, *flags);
> rq = task_rq(p);
> raw_spin_lock(&rq->lock);
> - if (likely(rq == task_rq(p) && !task_on_rq_migrating(p)))
> - return rq;
> + if (likely(rq == task_rq(p))) {
> + /* Pairs with smp_wmb() in __set_task_cpu() */

That comment really is insufficient; but aside from that:

If we observe the old cpu value we've just acquired the old rq->lock and
therefore we must observe the new cpu value and retry -- we don't care
about the migrate value in this case.

If we observe the new cpu value, we've acquired the new rq->lock and its
ACQUIRE will pair with the WMB to ensure we see the migrate value.

So I think the current code is correct; albeit it could use a comment.

> + smp_rmb();
> + if (likely(!task_on_rq_migrating(p)))
> + return rq;
> + }


---
Subject: sched: Clarify ordering between task_rq_lock() and move_queued_task()
From: Peter Zijlstra <[email protected]>
Date: Tue Feb 17 13:07:38 CET 2015

There was a wee bit of confusion around the exact ordering here;
clarify things.

Cc: Oleg Nesterov <[email protected]>
Cc: "Paul E. McKenney" <[email protected]>
Reported-by: Kirill Tkhai <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
---
kernel/sched/core.c | 16 ++++++++++++++++
1 file changed, 16 insertions(+)

--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -341,6 +341,22 @@ static struct rq *task_rq_lock(struct ta
raw_spin_lock_irqsave(&p->pi_lock, *flags);
rq = task_rq(p);
raw_spin_lock(&rq->lock);
+ /*
+ * move_queued_task() task_rq_lock()
+ *
+ * ACQUIRE (rq->lock)
+ * [S] ->on_rq = MIGRATING [L] rq = task_rq()
+ * WMB (__set_task_cpu()) ACQUIRE (rq->lock);
+ * [S] ->cpu = new_cpu [L] task_rq()
+ * [L] ->on_rq
+ * RELEASE (rq->lock)
+ *
+ * If we observe the old cpu in task_rq_lock, the acquire of
+ * the old rq->lock will fully serialize against the stores.
+ *
+ * If we observe the new cpu in task_rq_lock, the acquire will
+ * pair with the WMB to ensure we must then also see migrating.
+ */
if (likely(rq == task_rq(p) && !task_on_rq_migrating(p)))
return rq;
raw_spin_unlock(&rq->lock);

2015-02-17 12:37:04

by Kirill Tkhai

[permalink] [raw]
Subject: Re: [PATCH 2/2] [PATCH] sched: Add smp_rmb() in task rq locking cycles

В Вт, 17/02/2015 в 13:12 +0100, Peter Zijlstra пишет:
> On Tue, Feb 17, 2015 at 01:47:01PM +0300, Kirill Tkhai wrote:
> >
> > We migrate a task using TASK_ON_RQ_MIGRATING state of on_rq:
> >
> > raw_spin_lock(&old_rq->lock);
> > deactivate_task(old_rq, p, 0);
> > p->on_rq = TASK_ON_RQ_MIGRATING;
> > set_task_cpu(p, new_cpu);
> > raw_spin_unlock(&rq->lock);
> >
> > I.e.:
> >
> > write TASK_ON_RQ_MIGRATING
> > smp_wmb() (in __set_task_cpu)
> > write new_cpu
> >
> > But {,__}task_rq_lock() don't use smp_rmb(), and they may see
> > the cpu and TASK_ON_RQ_MIGRATING in opposite order. In this case
> > {,__}task_rq_lock() lock new_rq before the task is actually queued
> > on it.
>
> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > index fc12a1d..a42fb88 100644
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -319,8 +319,12 @@ static struct rq *task_rq_lock(struct task_struct *p, unsigned long *flags)
> > raw_spin_lock_irqsave(&p->pi_lock, *flags);
> > rq = task_rq(p);
> > raw_spin_lock(&rq->lock);
> > - if (likely(rq == task_rq(p) && !task_on_rq_migrating(p)))
> > - return rq;
> > + if (likely(rq == task_rq(p))) {
> > + /* Pairs with smp_wmb() in __set_task_cpu() */
>
> That comment really is insufficient; but aside from that:
>
> If we observe the old cpu value we've just acquired the old rq->lock and
> therefore we must observe the new cpu value and retry -- we don't care
> about the migrate value in this case.
>
> If we observe the new cpu value, we've acquired the new rq->lock and its
> ACQUIRE will pair with the WMB to ensure we see the migrate value.

Yes, I warried about new_cpu case.

So, spin_lock() implies smp_rmb(). I used to think it does not do
(I was confused by smp_mb__before_spin_lock(), but it's for STORE).

Thanks for the explanation :)


> So I think the current code is correct; albeit it could use a comment.
>
> > + smp_rmb();
> > + if (likely(!task_on_rq_migrating(p)))
> > + return rq;
> > + }
>
>
> ---
> Subject: sched: Clarify ordering between task_rq_lock() and move_queued_task()
> From: Peter Zijlstra <[email protected]>
> Date: Tue Feb 17 13:07:38 CET 2015
>
> There was a wee bit of confusion around the exact ordering here;
> clarify things.
>
> Cc: Oleg Nesterov <[email protected]>
> Cc: "Paul E. McKenney" <[email protected]>
> Reported-by: Kirill Tkhai <[email protected]>
> Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
> ---
> kernel/sched/core.c | 16 ++++++++++++++++
> 1 file changed, 16 insertions(+)
>
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -341,6 +341,22 @@ static struct rq *task_rq_lock(struct ta
> raw_spin_lock_irqsave(&p->pi_lock, *flags);
> rq = task_rq(p);
> raw_spin_lock(&rq->lock);
> + /*
> + * move_queued_task() task_rq_lock()
> + *
> + * ACQUIRE (rq->lock)
> + * [S] ->on_rq = MIGRATING [L] rq = task_rq()
> + * WMB (__set_task_cpu()) ACQUIRE (rq->lock);
> + * [S] ->cpu = new_cpu [L] task_rq()
> + * [L] ->on_rq
> + * RELEASE (rq->lock)
> + *
> + * If we observe the old cpu in task_rq_lock, the acquire of
> + * the old rq->lock will fully serialize against the stores.
> + *
> + * If we observe the new cpu in task_rq_lock, the acquire will
> + * pair with the WMB to ensure we must then also see migrating.
> + */
> if (likely(rq == task_rq(p) && !task_on_rq_migrating(p)))
> return rq;
> raw_spin_unlock(&rq->lock);

2015-02-17 12:45:30

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 2/2] [PATCH] sched: Add smp_rmb() in task rq locking cycles

On Tue, Feb 17, 2015 at 03:36:50PM +0300, Kirill Tkhai wrote:

> > If we observe the new cpu value, we've acquired the new rq->lock and its
> > ACQUIRE will pair with the WMB to ensure we see the migrate value.
>
> Yes, I warried about new_cpu case.
>
> So, spin_lock() implies smp_rmb(). I used to think it does not do
> (I was confused by smp_mb__before_spin_lock(), but it's for STORE).
>
> Thanks for the explanation :)

No, it does not imply RMB, its an ACQUIRE barrier.

>From Documentation/memory-barriers.txt:

(5) ACQUIRE operations.

This acts as a one-way permeable barrier. It guarantees that all memory
operations after the ACQUIRE operation will appear to happen after the
ACQUIRE operation with respect to the other components of the system.
ACQUIRE operations include LOCK operations and smp_load_acquire()
operations.

Memory operations that occur before an ACQUIRE operation may appear to
happen after it completes.

An ACQUIRE operation should almost always be paired with a RELEASE
operation.

And note that the [L] rq = task_rq() cannot happen _after_ the acquire
because the acquire has an address dependency on it, we must complete
the load in order to actually take the lock.

(of course, taking the lock implies further stores and atomic ops, but
those need not actually imply more than the ACQUIRE barrier -- even
though they will on x86).

2015-02-17 13:05:37

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 2/2] [PATCH] sched: Add smp_rmb() in task rq locking cycles

On Tue, Feb 17, 2015 at 01:12:58PM +0100, Peter Zijlstra wrote:
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -341,6 +341,22 @@ static struct rq *task_rq_lock(struct ta
> raw_spin_lock_irqsave(&p->pi_lock, *flags);
> rq = task_rq(p);
> raw_spin_lock(&rq->lock);
> + /*
> + * move_queued_task() task_rq_lock()
> + *
> + * ACQUIRE (rq->lock)
> + * [S] ->on_rq = MIGRATING [L] rq = task_rq()
> + * WMB (__set_task_cpu()) ACQUIRE (rq->lock);
> + * [S] ->cpu = new_cpu [L] task_rq()
> + * [L] ->on_rq
> + * RELEASE (rq->lock)
> + *
> + * If we observe the old cpu in task_rq_lock, the acquire of
> + * the old rq->lock will fully serialize against the stores.
> + *
> + * If we observe the new cpu in task_rq_lock, the acquire will
> + * pair with the WMB to ensure we must then also see migrating.
> + */
> if (likely(rq == task_rq(p) && !task_on_rq_migrating(p)))
> return rq;
> raw_spin_unlock(&rq->lock);


Hey Paul, remember this: https://lkml.org/lkml/2014/7/16/310

I just used a creative one :-)

BTW, should we attempt to include that table in memory-barriers.txt like
Mathieu said? As a cheat sheet with references to longer explanations
for the 'interesting' ones?

FWIW, we should probably update that table to include control
dependencies too; we didn't (formally) have those back then I think.

The blob under SMP BARRIER PAIRING does not mention pairing with control
dependencies; and I'm rather sure I've done so.

2015-02-17 16:05:40

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH 2/2] [PATCH] sched: Add smp_rmb() in task rq locking cycles

On Tue, Feb 17, 2015 at 02:05:23PM +0100, Peter Zijlstra wrote:
> On Tue, Feb 17, 2015 at 01:12:58PM +0100, Peter Zijlstra wrote:
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -341,6 +341,22 @@ static struct rq *task_rq_lock(struct ta
> > raw_spin_lock_irqsave(&p->pi_lock, *flags);
> > rq = task_rq(p);
> > raw_spin_lock(&rq->lock);
> > + /*
> > + * move_queued_task() task_rq_lock()
> > + *
> > + * ACQUIRE (rq->lock)
> > + * [S] ->on_rq = MIGRATING [L] rq = task_rq()
> > + * WMB (__set_task_cpu()) ACQUIRE (rq->lock);
> > + * [S] ->cpu = new_cpu [L] task_rq()
> > + * [L] ->on_rq
> > + * RELEASE (rq->lock)
> > + *
> > + * If we observe the old cpu in task_rq_lock, the acquire of
> > + * the old rq->lock will fully serialize against the stores.
> > + *
> > + * If we observe the new cpu in task_rq_lock, the acquire will
> > + * pair with the WMB to ensure we must then also see migrating.
> > + */
> > if (likely(rq == task_rq(p) && !task_on_rq_migrating(p)))
> > return rq;
> > raw_spin_unlock(&rq->lock);
>
> Hey Paul, remember this: https://lkml.org/lkml/2014/7/16/310

I do now. ;-)

> I just used a creative one :-)

The scenario above?

> BTW, should we attempt to include that table in memory-barriers.txt like
> Mathieu said? As a cheat sheet with references to longer explanations
> for the 'interesting' ones?
>
> FWIW, we should probably update that table to include control
> dependencies too; we didn't (formally) have those back then I think.
>
> The blob under SMP BARRIER PAIRING does not mention pairing with control
> dependencies; and I'm rather sure I've done so.

Yep, they should pair as well, though the pairing is limited.
No transitivity, of course.

So the straightforward approach requires eighteen bits per cell, though
some of them are a bit, ummm, "unusual". Sixteen of these are given by
Scenarios 0-15 in http://lwn.net/Articles/573436/, with the barrier on
the side corresponding to the first column and the barrier on the top
corresponding to the second column. The seventeenth bit says whether
you get transitivity chaining after the top access, assuming that it
happens later. The eighteenth bit says whether you get transitivity
chaining before the side access, assuming that it happens earlier.

The following is a rough first guess, filling in only the diagonal.
Some of the entries are no doubt wrong, and getting them right requires
something like 7*7*18 test cases, which will take some time. So, is
something like this really helpful?


| mb | wmb | rmb | rbd | acq | rel | ctl |
-----+-------+-------+-------+-------+-------+-------+-------+
mb | 3ffff | X | X | X | X | X | X +
-----+-------+-------+-------+-------+-------+-------+-------+
wmb | X | 01000 | X | X | X | X | X +
-----+-------+-------+-------+-------+-------+-------+-------+
rmb | X | X | 00000 | X | X | X | X +
-----+-------+-------+-------+-------+-------+-------+-------+
rbd | X | X | X | 00000 | X | X | X +
-----+-------+-------+-------+-------+-------+-------+-------+
acq | X | X | X | X | 00020 | X | X +
-----+-------+-------+-------+-------+-------+-------+-------+
rel | X | X | X | X | X | 0cc00 | X +
-----+-------+-------+-------+-------+-------+-------+-------+
ctl | X | X | X | X | X | X | 00020 +
-----+-------+-------+-------+-------+-------+-------+-------+

Thanx, Paul

2015-02-17 18:02:05

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH 2/2] [PATCH] sched: Add smp_rmb() in task rq locking cycles

On Tue, Feb 17, 2015 at 08:05:32AM -0800, Paul E. McKenney wrote:
> On Tue, Feb 17, 2015 at 02:05:23PM +0100, Peter Zijlstra wrote:
> > On Tue, Feb 17, 2015 at 01:12:58PM +0100, Peter Zijlstra wrote:
> > > --- a/kernel/sched/core.c
> > > +++ b/kernel/sched/core.c
> > > @@ -341,6 +341,22 @@ static struct rq *task_rq_lock(struct ta
> > > raw_spin_lock_irqsave(&p->pi_lock, *flags);
> > > rq = task_rq(p);
> > > raw_spin_lock(&rq->lock);
> > > + /*
> > > + * move_queued_task() task_rq_lock()
> > > + *
> > > + * ACQUIRE (rq->lock)
> > > + * [S] ->on_rq = MIGRATING [L] rq = task_rq()
> > > + * WMB (__set_task_cpu()) ACQUIRE (rq->lock);
> > > + * [S] ->cpu = new_cpu [L] task_rq()
> > > + * [L] ->on_rq
> > > + * RELEASE (rq->lock)
> > > + *
> > > + * If we observe the old cpu in task_rq_lock, the acquire of
> > > + * the old rq->lock will fully serialize against the stores.
> > > + *
> > > + * If we observe the new cpu in task_rq_lock, the acquire will
> > > + * pair with the WMB to ensure we must then also see migrating.
> > > + */
> > > if (likely(rq == task_rq(p) && !task_on_rq_migrating(p)))
> > > return rq;
> > > raw_spin_unlock(&rq->lock);
> >
> > Hey Paul, remember this: https://lkml.org/lkml/2014/7/16/310
>
> I do now. ;-)
>
> > I just used a creative one :-)
>
> The scenario above?
>
> > BTW, should we attempt to include that table in memory-barriers.txt like
> > Mathieu said? As a cheat sheet with references to longer explanations
> > for the 'interesting' ones?
> >
> > FWIW, we should probably update that table to include control
> > dependencies too; we didn't (formally) have those back then I think.
> >
> > The blob under SMP BARRIER PAIRING does not mention pairing with control
> > dependencies; and I'm rather sure I've done so.

And here is a patch for the control-dependency pairing. Thoughts?

Thanx, Paul

------------------------------------------------------------------------

documentation: Clarify control-dependency pairing

This commit explicitly states that control dependencies pair normally
with other barriers, and gives an example of such pairing.

Reported-by: Peter Zijlstra <[email protected]>
Signed-off-by: Paul E. McKenney <[email protected]>

diff --git a/Documentation/memory-barriers.txt b/Documentation/memory-barriers.txt
index ca2387ef27ab..b09880086d96 100644
--- a/Documentation/memory-barriers.txt
+++ b/Documentation/memory-barriers.txt
@@ -592,9 +592,9 @@ See also the subsection on "Cache Coherency" for a more thorough example.
CONTROL DEPENDENCIES
--------------------

-A control dependency requires a full read memory barrier, not simply a data
-dependency barrier to make it work correctly. Consider the following bit of
-code:
+A load-load control dependency requires a full read memory barrier, not
+simply a data dependency barrier to make it work correctly. Consider the
+following bit of code:

q = ACCESS_ONCE(a);
if (q) {
@@ -615,14 +615,15 @@ case what's actually required is:
}

However, stores are not speculated. This means that ordering -is- provided
-in the following example:
+for load-store control dependencies, as in the following example:

q = ACCESS_ONCE(a);
if (q) {
ACCESS_ONCE(b) = p;
}

-Please note that ACCESS_ONCE() is not optional! Without the
+Control dependencies pair normally with other types of barriers.
+That said, please note that ACCESS_ONCE() is not optional! Without the
ACCESS_ONCE(), might combine the load from 'a' with other loads from
'a', and the store to 'b' with other stores to 'b', with possible highly
counterintuitive effects on ordering.
@@ -813,6 +814,8 @@ In summary:
barrier() can help to preserve your control dependency. Please
see the Compiler Barrier section for more information.

+ (*) Control dependencies pair normally with other types of barriers.
+
(*) Control dependencies do -not- provide transitivity. If you
need transitivity, use smp_mb().

@@ -850,6 +853,19 @@ Or:
<data dependency barrier>
y = *x;

+Or even:
+
+ CPU 1 CPU 2
+ =============== ===============================
+ r1 = ACCESS_ONCE(y);
+ <write barrier>
+ ACCESS_ONCE(y) = 1; if (r2 = ACCESS_ONCE(x)) {
+ <implicit control dependency>
+ ACCESS_ONCE(y) = 1;
+ }
+
+ assert(r1 == 0 || r2 == 0);
+
Basically, the read barrier always has to be there, even though it can be of
the "weaker" type.

2015-02-17 18:23:42

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 2/2] [PATCH] sched: Add smp_rmb() in task rq locking cycles

On Tue, Feb 17, 2015 at 10:01:54AM -0800, Paul E. McKenney wrote:
> > > The blob under SMP BARRIER PAIRING does not mention pairing with control
> > > dependencies; and I'm rather sure I've done so.
>
> And here is a patch for the control-dependency pairing. Thoughts?

The proposed patch does not change the blub under SMP BARRIER PAIRING,
which would be the first place I would look for this.

> @@ -850,6 +853,19 @@ Or:
> <data dependency barrier>
> y = *x;
>
> +Or even:
> +
> + CPU 1 CPU 2
> + =============== ===============================
> + r1 = ACCESS_ONCE(y);
> + <write barrier>
> + ACCESS_ONCE(y) = 1; if (r2 = ACCESS_ONCE(x)) {
> + <implicit control dependency>
> + ACCESS_ONCE(y) = 1;
> + }
> +
> + assert(r1 == 0 || r2 == 0);
> +
> Basically, the read barrier always has to be there, even though it can be of
> the "weaker" type.

Does that want to be a <general barrier>; CPU1 looks to do a LOAD
followed by a STORE, separated by a WMB, which is of course odd.

To me the pairing with a general barrier is also the most intuitive,
since the control dependency is a LOAD -> STORE order.

Then again, I'm rather tired and maybe I'm missing the obvious :-)

2015-02-17 18:36:42

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 2/2] [PATCH] sched: Add smp_rmb() in task rq locking cycles

On Tue, Feb 17, 2015 at 08:05:32AM -0800, Paul E. McKenney wrote:
> > FWIW, we should probably update that table to include control
> > dependencies too; we didn't (formally) have those back then I think.
> >
> > The blob under SMP BARRIER PAIRING does not mention pairing with control
> > dependencies; and I'm rather sure I've done so.
>
> Yep, they should pair as well, though the pairing is limited.
> No transitivity, of course.
>
> So the straightforward approach requires eighteen bits per cell, though
> some of them are a bit, ummm, "unusual".

Right, I think the idea was to not mark with 'X' when very unusual,
otherwise you do indeed obtain the below 'trivial' matrix.

> Sixteen of these are given by
> Scenarios 0-15 in http://lwn.net/Articles/573436/, with the barrier on
> the side corresponding to the first column and the barrier on the top
> corresponding to the second column. The seventeenth bit says whether
> you get transitivity chaining after the top access, assuming that it
> happens later. The eighteenth bit says whether you get transitivity
> chaining before the side access, assuming that it happens earlier.
>
> The following is a rough first guess, filling in only the diagonal.
> Some of the entries are no doubt wrong, and getting them right requires
> something like 7*7*18 test cases, which will take some time. So, is
> something like this really helpful?


> | mb | wmb | rmb | rbd | acq | rel | ctl |
> -----+-------+-------+-------+-------+-------+-------+-------+
> mb | 3ffff | X | X | X | X | X | X +
> -----+-------+-------+-------+-------+-------+-------+-------+
> wmb | X | 01000 | X | X | X | X | X +
> -----+-------+-------+-------+-------+-------+-------+-------+
> rmb | X | X | 00000 | X | X | X | X +
> -----+-------+-------+-------+-------+-------+-------+-------+
> rbd | X | X | X | 00000 | X | X | X +
> -----+-------+-------+-------+-------+-------+-------+-------+
> acq | X | X | X | X | 00020 | X | X +
> -----+-------+-------+-------+-------+-------+-------+-------+
> rel | X | X | X | X | X | 0cc00 | X +
> -----+-------+-------+-------+-------+-------+-------+-------+
> ctl | X | X | X | X | X | X | 00020 +
> -----+-------+-------+-------+-------+-------+-------+-------+

So maybe make two tables; one with 'obvious' pairings, which would
include things like mb - {mb,rmb,wmb}; rmb-wmb; acq-rel; ctl-mb; etc.

That table is for people to quickly check 'easy'; like yes wmb-rbd makes
sense and rmb-rbd doesn't appear to make sense, I need more reading up.

After that do the 'funny' table, which will explain further possible
pairings in more detail, like the rmb-rbd pairing.

I'm not entirely sure we want to do the 7*7*18 state table, that's a lot
of work to exhaustively generate. We should be lazy and demand fill when
people come to us.

2015-02-17 21:45:30

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH 2/2] [PATCH] sched: Add smp_rmb() in task rq locking cycles

On Tue, Feb 17, 2015 at 07:23:35PM +0100, Peter Zijlstra wrote:
> On Tue, Feb 17, 2015 at 10:01:54AM -0800, Paul E. McKenney wrote:
> > > > The blob under SMP BARRIER PAIRING does not mention pairing with control
> > > > dependencies; and I'm rather sure I've done so.
> >
> > And here is a patch for the control-dependency pairing. Thoughts?
>
> The proposed patch does not change the blub under SMP BARRIER PAIRING,
> which would be the first place I would look for this.

OK, updated below.

> > @@ -850,6 +853,19 @@ Or:
> > <data dependency barrier>
> > y = *x;
> >
> > +Or even:
> > +
> > + CPU 1 CPU 2
> > + =============== ===============================
> > + r1 = ACCESS_ONCE(y);
> > + <write barrier>
> > + ACCESS_ONCE(y) = 1; if (r2 = ACCESS_ONCE(x)) {
> > + <implicit control dependency>
> > + ACCESS_ONCE(y) = 1;
> > + }
> > +
> > + assert(r1 == 0 || r2 == 0);
> > +
> > Basically, the read barrier always has to be there, even though it can be of
> > the "weaker" type.
>
> Does that want to be a <general barrier>; CPU1 looks to do a LOAD
> followed by a STORE, separated by a WMB, which is of course odd.
>
> To me the pairing with a general barrier is also the most intuitive,
> since the control dependency is a LOAD -> STORE order.
>
> Then again, I'm rather tired and maybe I'm missing the obvious :-)

Nope, you are correct. There either must be a general barrier or
CPU 1's write must be an smp_store_release(). I went with the
general barrier. Please see below for an update.

Thanx, Paul

------------------------------------------------------------------------

documentation: Clarify control-dependency pairing

This commit explicitly states that control dependencies pair normally
with other barriers, and gives an example of such pairing.

Reported-by: Peter Zijlstra <[email protected]>
Signed-off-by: Paul E. McKenney <[email protected]>

diff --git a/Documentation/memory-barriers.txt b/Documentation/memory-barriers.txt
index ca2387ef27ab..6974f1c2b4e1 100644
--- a/Documentation/memory-barriers.txt
+++ b/Documentation/memory-barriers.txt
@@ -592,9 +592,9 @@ See also the subsection on "Cache Coherency" for a more thorough example.
CONTROL DEPENDENCIES
--------------------

-A control dependency requires a full read memory barrier, not simply a data
-dependency barrier to make it work correctly. Consider the following bit of
-code:
+A load-load control dependency requires a full read memory barrier, not
+simply a data dependency barrier to make it work correctly. Consider the
+following bit of code:

q = ACCESS_ONCE(a);
if (q) {
@@ -615,14 +615,15 @@ case what's actually required is:
}

However, stores are not speculated. This means that ordering -is- provided
-in the following example:
+for load-store control dependencies, as in the following example:

q = ACCESS_ONCE(a);
if (q) {
ACCESS_ONCE(b) = p;
}

-Please note that ACCESS_ONCE() is not optional! Without the
+Control dependencies pair normally with other types of barriers.
+That said, please note that ACCESS_ONCE() is not optional! Without the
ACCESS_ONCE(), might combine the load from 'a' with other loads from
'a', and the store to 'b' with other stores to 'b', with possible highly
counterintuitive effects on ordering.
@@ -813,6 +814,8 @@ In summary:
barrier() can help to preserve your control dependency. Please
see the Compiler Barrier section for more information.

+ (*) Control dependencies pair normally with other types of barriers.
+
(*) Control dependencies do -not- provide transitivity. If you
need transitivity, use smp_mb().

@@ -823,14 +826,14 @@ SMP BARRIER PAIRING
When dealing with CPU-CPU interactions, certain types of memory barrier should
always be paired. A lack of appropriate pairing is almost certainly an error.

-General barriers pair with each other, though they also pair with
-most other types of barriers, albeit without transitivity. An acquire
-barrier pairs with a release barrier, but both may also pair with other
-barriers, including of course general barriers. A write barrier pairs
-with a data dependency barrier, an acquire barrier, a release barrier,
-a read barrier, or a general barrier. Similarly a read barrier or a
-data dependency barrier pairs with a write barrier, an acquire barrier,
-a release barrier, or a general barrier:
+General barriers pair with each other, though they also pair with most
+other types of barriers, albeit without transitivity. An acquire barrier
+pairs with a release barrier, but both may also pair with other barriers,
+including of course general barriers. A write barrier pairs with a data
+dependency barrier, a control dependency, an acquire barrier, a release
+barrier, a read barrier, or a general barrier. Similarly a read barrier,
+control dependency, or a data dependency barrier pairs with a write
+barrier, an acquire barrier, a release barrier, or a general barrier:

CPU 1 CPU 2
=============== ===============
@@ -850,6 +853,19 @@ Or:
<data dependency barrier>
y = *x;

+Or even:
+
+ CPU 1 CPU 2
+ =============== ===============================
+ r1 = ACCESS_ONCE(y);
+ <general barrier>
+ ACCESS_ONCE(y) = 1; if (r2 = ACCESS_ONCE(x)) {
+ <implicit control dependency>
+ ACCESS_ONCE(y) = 1;
+ }
+
+ assert(r1 == 0 || r2 == 0);
+
Basically, the read barrier always has to be there, even though it can be of
the "weaker" type.

2015-02-17 21:52:39

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH 2/2] [PATCH] sched: Add smp_rmb() in task rq locking cycles

On Tue, Feb 17, 2015 at 07:36:36PM +0100, Peter Zijlstra wrote:
> On Tue, Feb 17, 2015 at 08:05:32AM -0800, Paul E. McKenney wrote:
> > > FWIW, we should probably update that table to include control
> > > dependencies too; we didn't (formally) have those back then I think.
> > >
> > > The blob under SMP BARRIER PAIRING does not mention pairing with control
> > > dependencies; and I'm rather sure I've done so.
> >
> > Yep, they should pair as well, though the pairing is limited.
> > No transitivity, of course.
> >
> > So the straightforward approach requires eighteen bits per cell, though
> > some of them are a bit, ummm, "unusual".
>
> Right, I think the idea was to not mark with 'X' when very unusual,
> otherwise you do indeed obtain the below 'trivial' matrix.
>
> > Sixteen of these are given by
> > Scenarios 0-15 in http://lwn.net/Articles/573436/, with the barrier on
> > the side corresponding to the first column and the barrier on the top
> > corresponding to the second column. The seventeenth bit says whether
> > you get transitivity chaining after the top access, assuming that it
> > happens later. The eighteenth bit says whether you get transitivity
> > chaining before the side access, assuming that it happens earlier.
> >
> > The following is a rough first guess, filling in only the diagonal.
> > Some of the entries are no doubt wrong, and getting them right requires
> > something like 7*7*18 test cases, which will take some time. So, is
> > something like this really helpful?
>
>
> > | mb | wmb | rmb | rbd | acq | rel | ctl |
> > -----+-------+-------+-------+-------+-------+-------+-------+
> > mb | 3ffff | X | X | X | X | X | X +
> > -----+-------+-------+-------+-------+-------+-------+-------+
> > wmb | X | 01000 | X | X | X | X | X +
> > -----+-------+-------+-------+-------+-------+-------+-------+
> > rmb | X | X | 00000 | X | X | X | X +
> > -----+-------+-------+-------+-------+-------+-------+-------+
> > rbd | X | X | X | 00000 | X | X | X +
> > -----+-------+-------+-------+-------+-------+-------+-------+
> > acq | X | X | X | X | 00020 | X | X +
> > -----+-------+-------+-------+-------+-------+-------+-------+
> > rel | X | X | X | X | X | 0cc00 | X +
> > -----+-------+-------+-------+-------+-------+-------+-------+
> > ctl | X | X | X | X | X | X | 00020 +
> > -----+-------+-------+-------+-------+-------+-------+-------+
>
> So maybe make two tables; one with 'obvious' pairings, which would
> include things like mb - {mb,rmb,wmb}; rmb-wmb; acq-rel; ctl-mb; etc.
>
> That table is for people to quickly check 'easy'; like yes wmb-rbd makes
> sense and rmb-rbd doesn't appear to make sense, I need more reading up.
>
> After that do the 'funny' table, which will explain further possible
> pairings in more detail, like the rmb-rbd pairing.
>
> I'm not entirely sure we want to do the 7*7*18 state table, that's a lot
> of work to exhaustively generate. We should be lazy and demand fill when
> people come to us.

I could do a table per communication style. For example, message
passing looks like this (give or take likely errors in the table):

Side CPU Top CPU
-------- -------
X = 1; r1 = Y;
<some barrier> <some barrier>
Y = 1; r2 = X;

assert(r1 == 0 || r2 == 1);


| mb | wmb | rmb | rbd | acq | rel | ctl |
-----+-------+-------+-------+-------+-------+-------+-------+
mb | Y | | Y | y | Y | | Y +
-----+-------+-------+-------+-------+-------+-------+-------+
wmb | Y | | Y | y | Y | | Y +
-----+-------+-------+-------+-------+-------+-------+-------+
rmb | | | | | | | +
-----+-------+-------+-------+-------+-------+-------+-------+
rbd | | | | | | | +
-----+-------+-------+-------+-------+-------+-------+-------+
acq | | | | | | | +
-----+-------+-------+-------+-------+-------+-------+-------+
rel | Y | | Y | y | Y | | Y +
-----+-------+-------+-------+-------+-------+-------+-------+
ctl | | | | | | | +
-----+-------+-------+-------+-------+-------+-------+-------+

Here "Y" says that the barrier pair works, "y" says that it can
work but requires an artificial dependency, and " " says that
it does not work.

Thanx, Paul

2015-02-18 13:41:43

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 2/2] [PATCH] sched: Add smp_rmb() in task rq locking cycles

On Tue, Feb 17, 2015 at 01:45:22PM -0800, Paul E. McKenney wrote:
>
> documentation: Clarify control-dependency pairing
>
> This commit explicitly states that control dependencies pair normally
> with other barriers, and gives an example of such pairing.
>
> Reported-by: Peter Zijlstra <[email protected]>
> Signed-off-by: Paul E. McKenney <[email protected]>

Acked-by: Peter Zijlstra (Intel) <[email protected]>

Thanks!

2015-02-18 13:47:15

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 2/2] [PATCH] sched: Add smp_rmb() in task rq locking cycles

On Tue, Feb 17, 2015 at 01:52:31PM -0800, Paul E. McKenney wrote:
> I could do a table per communication style. For example, message
> passing looks like this (give or take likely errors in the table):
>
> Side CPU Top CPU
> -------- -------
> X = 1; r1 = Y;
> <some barrier> <some barrier>
> Y = 1; r2 = X;
>
> assert(r1 == 0 || r2 == 1);
>
>
> | mb | wmb | rmb | rbd | acq | rel | ctl |
> -----+-------+-------+-------+-------+-------+-------+-------+
> mb | Y | | Y | y | Y | | Y +
> -----+-------+-------+-------+-------+-------+-------+-------+
> wmb | Y | | Y | y | Y | | Y +
> -----+-------+-------+-------+-------+-------+-------+-------+
> rmb | | | | | | | +
> -----+-------+-------+-------+-------+-------+-------+-------+
> rbd | | | | | | | +
> -----+-------+-------+-------+-------+-------+-------+-------+
> acq | | | | | | | +
> -----+-------+-------+-------+-------+-------+-------+-------+
> rel | Y | | Y | y | Y | | Y +
> -----+-------+-------+-------+-------+-------+-------+-------+
> ctl | | | | | | | +
> -----+-------+-------+-------+-------+-------+-------+-------+
>
> Here "Y" says that the barrier pair works, "y" says that it can
> work but requires an artificial dependency, and " " says that
> it does not work.

I would maybe do s/artificial/additional/, the pointer deref in RCU is
not really artificial, is it?

Also, how many communication styles do you envision to enumerate?

2015-02-18 15:55:30

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [PATCH 2/2] [PATCH] sched: Add smp_rmb() in task rq locking cycles

Thanks Paul and Peter, this was the interesting reading ;)

This is almost off-topic (but see below), but perhaps memory-barriers.txt
could also mention spin_unlock_wait() to explain that _most probably_ it is
pointless without the memory barrier(s), and the barrer before-or-after
unlock_wait() pairs with release-or-acquire.

At the same time, the code like

spin_unlock_wait();
STORE;

_can_ be correct because this implies the load-store control dependency.

On 02/17, Paul E. McKenney wrote:
>
> | mb | wmb | rmb | rbd | acq | rel | ctl |
> -----+-------+-------+-------+-------+-------+-------+-------+
> mb | Y | | Y | y | Y | | Y +
> -----+-------+-------+-------+-------+-------+-------+-------+
> wmb | Y | | Y | y | Y | | Y +
> -----+-------+-------+-------+-------+-------+-------+-------+
> rmb | | | | | | | +
> -----+-------+-------+-------+-------+-------+-------+-------+
> rbd | | | | | | | +
> -----+-------+-------+-------+-------+-------+-------+-------+
> acq | | | | | | | +
> -----+-------+-------+-------+-------+-------+-------+-------+
> rel | Y | | Y | y | Y | | Y +
> -----+-------+-------+-------+-------+-------+-------+-------+
> ctl | | | | | | | +
> -----+-------+-------+-------+-------+-------+-------+-------+

OK, so "acq" can't pair with "acq", and I am not sure I understand.

First of all, it is not clear to me how you can even try to pair them
unless you do something like spin_unlock_wait(). I would like to see
an example which is not "obviously wrong".

At the same time, if you play with spin_unlock_wait() or spin_is_locked()
then acq can pair with acq?

Let's look at sem_lock(). I never looked at this code before, I can be
easily wrong. Manfred will correct me. But at first glance we can write
the oversimplified pseudo-code:

spinlock_t local, global;

bool my_lock(bool try_local)
{
if (try_local) {
spin_lock(&local);
if (!spin_is_locked(&global))
return true;
spin_unlock(&local);
}

spin_lock(&global);
spin_unlock_wait(&local);
return false;
}

void my_unlock(bool drop_local)
{
if (drop_local)
spin_unlock(&local);
else
spin_unlock(&global);
}

it assumes that the "local" lock is cheaper than "global", the usage is

bool xxx = my_lock(condition);
/* CRITICAL SECTION */
my_unlock(xxx);

Now. Unless I missed something, my_lock() does NOT need a barrier BEFORE
spin_unlock_wait() (or spin_is_locked()). Either my_lock(true) should see
spin_is_locked(global) == T, or my_lock(false)->spin_unlock_wait() should
see that "local" is locked and wait.

Doesn't this mean that acq can pair with acq or I am totally confused?



Another question is do we need a barrier AFTER spin_unlock_wait(). I do not
know what ipc/sem.c actually needs, but in general (I think) this does need
mb(). Otherwise my_lock / my_unlock itself does not have the proper acq/rel
semantics. For example, my_lock(false) can miss the changes which were done
under my_lock(true).

So I think that (in theory) sem_wait_array() need smp_mb() at the end. But,
given that we have the control dependency, perhaps smp_rmb() is enough?

Oleg.

2015-02-18 16:01:06

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [PATCH 2/2] [PATCH] sched: Add smp_rmb() in task rq locking cycles

(Forgot to add Manfred, resending)

Thanks Paul and Peter, this was the interesting reading ;)

This is almost off-topic (but see below), but perhaps memory-barriers.txt
could also mention spin_unlock_wait() to explain that _most probably_ it is
pointless without the memory barrier(s), and the barrer before-or-after
unlock_wait() pairs with release-or-acquire.

At the same time, the code like

spin_unlock_wait();
STORE;

_can_ be correct because this implies the load-store control dependency.

On 02/17, Paul E. McKenney wrote:
>
> | mb | wmb | rmb | rbd | acq | rel | ctl |
> -----+-------+-------+-------+-------+-------+-------+-------+
> mb | Y | | Y | y | Y | | Y +
> -----+-------+-------+-------+-------+-------+-------+-------+
> wmb | Y | | Y | y | Y | | Y +
> -----+-------+-------+-------+-------+-------+-------+-------+
> rmb | | | | | | | +
> -----+-------+-------+-------+-------+-------+-------+-------+
> rbd | | | | | | | +
> -----+-------+-------+-------+-------+-------+-------+-------+
> acq | | | | | | | +
> -----+-------+-------+-------+-------+-------+-------+-------+
> rel | Y | | Y | y | Y | | Y +
> -----+-------+-------+-------+-------+-------+-------+-------+
> ctl | | | | | | | +
> -----+-------+-------+-------+-------+-------+-------+-------+

OK, so "acq" can't pair with "acq", and I am not sure I understand.

First of all, it is not clear to me how you can even try to pair them
unless you do something like spin_unlock_wait(). I would like to see
an example which is not "obviously wrong".

At the same time, if you play with spin_unlock_wait() or spin_is_locked()
then acq can pair with acq?

Let's look at sem_lock(). I never looked at this code before, I can be
easily wrong. Manfred will correct me. But at first glance we can write
the oversimplified pseudo-code:

spinlock_t local, global;

bool my_lock(bool try_local)
{
if (try_local) {
spin_lock(&local);
if (!spin_is_locked(&global))
return true;
spin_unlock(&local);
}

spin_lock(&global);
spin_unlock_wait(&local);
return false;
}

void my_unlock(bool drop_local)
{
if (drop_local)
spin_unlock(&local);
else
spin_unlock(&global);
}

it assumes that the "local" lock is cheaper than "global", the usage is

bool xxx = my_lock(condition);
/* CRITICAL SECTION */
my_unlock(xxx);

Now. Unless I missed something, my_lock() does NOT need a barrier BEFORE
spin_unlock_wait() (or spin_is_locked()). Either my_lock(true) should see
spin_is_locked(global) == T, or my_lock(false)->spin_unlock_wait() should
see that "local" is locked and wait.

Doesn't this mean that acq can pair with acq or I am totally confused?



Another question is do we need a barrier AFTER spin_unlock_wait(). I do not
know what ipc/sem.c actually needs, but in general (I think) this does need
mb(). Otherwise my_lock / my_unlock itself does not have the proper acq/rel
semantics. For example, my_lock(false) can miss the changes which were done
under my_lock(true).

So I think that (in theory) sem_wait_array() need smp_mb() at the end. But,
given that we have the control dependency, perhaps smp_rmb() is enough?

Oleg.

2015-02-18 16:11:23

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 2/2] [PATCH] sched: Add smp_rmb() in task rq locking cycles

On Wed, Feb 18, 2015 at 04:53:34PM +0100, Oleg Nesterov wrote:
> On 02/17, Paul E. McKenney wrote:
> >
> > | mb | wmb | rmb | rbd | acq | rel | ctl |
> > -----+-------+-------+-------+-------+-------+-------+-------+
> > mb | Y | | Y | y | Y | | Y +
> > -----+-------+-------+-------+-------+-------+-------+-------+
> > wmb | Y | | Y | y | Y | | Y +
> > -----+-------+-------+-------+-------+-------+-------+-------+
> > rmb | | | | | | | +
> > -----+-------+-------+-------+-------+-------+-------+-------+
> > rbd | | | | | | | +
> > -----+-------+-------+-------+-------+-------+-------+-------+
> > acq | | | | | | | +
> > -----+-------+-------+-------+-------+-------+-------+-------+
> > rel | Y | | Y | y | Y | | Y +
> > -----+-------+-------+-------+-------+-------+-------+-------+
> > ctl | | | | | | | +
> > -----+-------+-------+-------+-------+-------+-------+-------+
>
> OK, so "acq" can't pair with "acq", and I am not sure I understand.

Please consider the table in the context of message passing; that is
what Paul proposed. Your example from sysvsems, while interesting, would
not fit the general scenario of message passing.

This too illustrates a problem with that approach, people can't read, so
they'll pick the wrong table to look at.

I really think having just the _one_ table with obvious pairings marked,
and everything not marked in the table needs careful reading.

And acq-acq pairing would be a careful one in my book.

2015-02-18 16:36:09

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [PATCH 2/2] [PATCH] sched: Add smp_rmb() in task rq locking cycles

On 02/18, Peter Zijlstra wrote:
>
> On Wed, Feb 18, 2015 at 04:53:34PM +0100, Oleg Nesterov wrote:
> > On 02/17, Paul E. McKenney wrote:
> > >
> > > | mb | wmb | rmb | rbd | acq | rel | ctl |
> > > -----+-------+-------+-------+-------+-------+-------+-------+
> > > mb | Y | | Y | y | Y | | Y +
> > > -----+-------+-------+-------+-------+-------+-------+-------+
> > > wmb | Y | | Y | y | Y | | Y +
> > > -----+-------+-------+-------+-------+-------+-------+-------+
> > > rmb | | | | | | | +
> > > -----+-------+-------+-------+-------+-------+-------+-------+
> > > rbd | | | | | | | +
> > > -----+-------+-------+-------+-------+-------+-------+-------+
> > > acq | | | | | | | +
> > > -----+-------+-------+-------+-------+-------+-------+-------+
> > > rel | Y | | Y | y | Y | | Y +
> > > -----+-------+-------+-------+-------+-------+-------+-------+
> > > ctl | | | | | | | +
> > > -----+-------+-------+-------+-------+-------+-------+-------+
> >
> > OK, so "acq" can't pair with "acq", and I am not sure I understand.
>
> Please consider the table in the context of message passing; that is
> what Paul proposed. Your example from sysvsems, while interesting, would
> not fit the general scenario of message passing.

Ah, yeeees, sorry. Somehow I completely missed that part of Paul's email.

> This too illustrates a problem with that approach, people can't read, so
> they'll pick the wrong table to look at.

At least we know that I certainly can't ;)

Thanks,

Oleg.

Subject: [tip:sched/core] sched: Clarify ordering between task_rq_lock() and move_queued_task()

Commit-ID: 74b8a4cb6ce3685049ee124243a52238c5cabe55
Gitweb: http://git.kernel.org/tip/74b8a4cb6ce3685049ee124243a52238c5cabe55
Author: Peter Zijlstra <[email protected]>
AuthorDate: Tue, 17 Feb 2015 13:07:38 +0100
Committer: Ingo Molnar <[email protected]>
CommitDate: Wed, 18 Feb 2015 14:27:28 +0100

sched: Clarify ordering between task_rq_lock() and move_queued_task()

There was a wee bit of confusion around the exact ordering here;
clarify things.

Reported-by: Kirill Tkhai <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Cc: Paul E. McKenney <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
---
kernel/sched/core.c | 16 ++++++++++++++++
1 file changed, 16 insertions(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 1f37fe7..fd0a6ed 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -341,6 +341,22 @@ static struct rq *task_rq_lock(struct task_struct *p, unsigned long *flags)
raw_spin_lock_irqsave(&p->pi_lock, *flags);
rq = task_rq(p);
raw_spin_lock(&rq->lock);
+ /*
+ * move_queued_task() task_rq_lock()
+ *
+ * ACQUIRE (rq->lock)
+ * [S] ->on_rq = MIGRATING [L] rq = task_rq()
+ * WMB (__set_task_cpu()) ACQUIRE (rq->lock);
+ * [S] ->cpu = new_cpu [L] task_rq()
+ * [L] ->on_rq
+ * RELEASE (rq->lock)
+ *
+ * If we observe the old cpu in task_rq_lock, the acquire of
+ * the old rq->lock will fully serialize against the stores.
+ *
+ * If we observe the new cpu in task_rq_lock, the acquire will
+ * pair with the WMB to ensure we must then also see migrating.
+ */
if (likely(rq == task_rq(p) && !task_on_rq_migrating(p)))
return rq;
raw_spin_unlock(&rq->lock);

2015-02-18 18:44:00

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH 2/2] [PATCH] sched: Add smp_rmb() in task rq locking cycles

On Wed, Feb 18, 2015 at 02:47:06PM +0100, Peter Zijlstra wrote:
> On Tue, Feb 17, 2015 at 01:52:31PM -0800, Paul E. McKenney wrote:
> > I could do a table per communication style. For example, message
> > passing looks like this (give or take likely errors in the table):
> >
> > Side CPU Top CPU
> > -------- -------
> > X = 1; r1 = Y;
> > <some barrier> <some barrier>
> > Y = 1; r2 = X;
> >
> > assert(r1 == 0 || r2 == 1);
> >
> >
> > | mb | wmb | rmb | rbd | acq | rel | ctl |
> > -----+-------+-------+-------+-------+-------+-------+-------+
> > mb | Y | | Y | y | Y | | Y +
> > -----+-------+-------+-------+-------+-------+-------+-------+
> > wmb | Y | | Y | y | Y | | Y +
> > -----+-------+-------+-------+-------+-------+-------+-------+
> > rmb | | | | | | | +
> > -----+-------+-------+-------+-------+-------+-------+-------+
> > rbd | | | | | | | +
> > -----+-------+-------+-------+-------+-------+-------+-------+
> > acq | | | | | | | +
> > -----+-------+-------+-------+-------+-------+-------+-------+
> > rel | Y | | Y | y | Y | | Y +
> > -----+-------+-------+-------+-------+-------+-------+-------+
> > ctl | | | | | | | +
> > -----+-------+-------+-------+-------+-------+-------+-------+
> >
> > Here "Y" says that the barrier pair works, "y" says that it can
> > work but requires an artificial dependency, and " " says that
> > it does not work.
>
> I would maybe do s/artificial/additional/, the pointer deref in RCU is
> not really artificial, is it?

Ah, but RCU is a slightly different pattern because you do have to have
a dependency, so you need a different litmus test:

Initial state: X = 0, Y = &Z, Z = 2

Side CPU Top CPU
-------- -------
X = 1; r1 = READ_ONCE(Y);
<some barrier> <some_barrier>
WRITE_ONCE(Y, &X); r2 = *r1;

assert(r1 == &Z || r2 == 1);

| mb | wmb | rmb | rbd | acq | rel | ctl |
-----+-------+-------+-------+-------+-------+-------+-------+
mb | Y | | Y | Y | Y | | Y +
-----+-------+-------+-------+-------+-------+-------+-------+
wmb | Y | | Y | Y | Y | | Y +
-----+-------+-------+-------+-------+-------+-------+-------+
rmb | | | | | | | +
-----+-------+-------+-------+-------+-------+-------+-------+
rbd | | | | | | | +
-----+-------+-------+-------+-------+-------+-------+-------+
acq | | | | | | | +
-----+-------+-------+-------+-------+-------+-------+-------+
rel | Y | | Y | Y | Y | | Y +
-----+-------+-------+-------+-------+-------+-------+-------+
ctl | | | | | | | +
-----+-------+-------+-------+-------+-------+-------+-------+

But of course you should instead use rcu_assign_pointer() and
rcu_dereference(). And I should also have used READ_ONCE() and
WRITE_ONCE() in my earlier example.

> Also, how many communication styles do you envision to enumerate?

As few as possible, but yes, that is likely to be a bit of a problem.
To get an idea of how big it could get if not contained somehow, take
a look at http://www.cl.cam.ac.uk/~pes20/ppc-supplemental/test6.pdf.
20 of them on the first page alone! ;-)

We most definitely need to stick to the lowest common denominator
if we are taking this approach.

Thanx, Paul

2015-02-18 19:14:08

by Manfred Spraul

[permalink] [raw]
Subject: Re: [PATCH 2/2] [PATCH] sched: Add smp_rmb() in task rq locking cycles

Hi Oleg,

On 02/18/2015 04:59 PM, Oleg Nesterov wrote:
> Let's look at sem_lock(). I never looked at this code before, I can be
> easily wrong. Manfred will correct me. But at first glance we can write
> the oversimplified pseudo-code:
>
> spinlock_t local, global;
>
> bool my_lock(bool try_local)
> {
> if (try_local) {
> spin_lock(&local);
> if (!spin_is_locked(&global))
> return true;
> spin_unlock(&local);
> }
>
> spin_lock(&global);
> spin_unlock_wait(&local);
> return false;
> }
>
> void my_unlock(bool drop_local)
> {
> if (drop_local)
> spin_unlock(&local);
> else
> spin_unlock(&global);
> }
>
> it assumes that the "local" lock is cheaper than "global", the usage is
>
> bool xxx = my_lock(condition);
> /* CRITICAL SECTION */
> my_unlock(xxx);
>
> Now. Unless I missed something, my_lock() does NOT need a barrier BEFORE
> spin_unlock_wait() (or spin_is_locked()). Either my_lock(true) should see
> spin_is_locked(global) == T, or my_lock(false)->spin_unlock_wait() should
> see that "local" is locked and wait.
I would agree:
There is no need for a barrier. spin_unlock_read() is just a read, the
barriers are from spin_lock() and spin_unlock().

The barrier exist to protect something like a "force_global" flag
(complex_count)

> spinlock_t local, global;
> bool force_global;
> bool my_lock(bool try_local)
> {
> if (try_local) {
> spin_lock(&local);
> if (!spin_is_locked(&global)) {
> if (!force_global) {
> return true;
> }
> }
> spin_unlock(&local);
>
>
> spin_lock(&global);
> spin_unlock_wait(&local);
> return false;
> }
>
> void my_unlock(bool drop_local)
> {
> if (drop_local)
> spin_unlock(&local);
> else
> spin_unlock(&global);
> }
> }

force_global can only be set by the owner of &global.

> Another question is do we need a barrier AFTER spin_unlock_wait(). I do not
> know what ipc/sem.c actually needs, but in general (I think) this does need
> mb(). Otherwise my_lock / my_unlock itself does not have the proper acq/rel
> semantics. For example, my_lock(false) can miss the changes which were done
> under my_lock(true).
How could that happen?
I thought that
thread A:
protected_var = 1234;
spin_unlock(&lock_a)

thread B:
spin_lock(&lock_b)
if (protected_var)

is safe. i.e, there is no need that acquire and releases is done on the same pointer.

--
Manfred

2015-02-18 19:23:52

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH 2/2] [PATCH] sched: Add smp_rmb() in task rq locking cycles

On Wed, Feb 18, 2015 at 05:32:56PM +0100, Oleg Nesterov wrote:
> On 02/18, Peter Zijlstra wrote:
> >
> > On Wed, Feb 18, 2015 at 04:53:34PM +0100, Oleg Nesterov wrote:
> > > On 02/17, Paul E. McKenney wrote:
> > > >
> > > > | mb | wmb | rmb | rbd | acq | rel | ctl |
> > > > -----+-------+-------+-------+-------+-------+-------+-------+
> > > > mb | Y | | Y | y | Y | | Y +
> > > > -----+-------+-------+-------+-------+-------+-------+-------+
> > > > wmb | Y | | Y | y | Y | | Y +
> > > > -----+-------+-------+-------+-------+-------+-------+-------+
> > > > rmb | | | | | | | +
> > > > -----+-------+-------+-------+-------+-------+-------+-------+
> > > > rbd | | | | | | | +
> > > > -----+-------+-------+-------+-------+-------+-------+-------+
> > > > acq | | | | | | | +
> > > > -----+-------+-------+-------+-------+-------+-------+-------+
> > > > rel | Y | | Y | y | Y | | Y +
> > > > -----+-------+-------+-------+-------+-------+-------+-------+
> > > > ctl | | | | | | | +
> > > > -----+-------+-------+-------+-------+-------+-------+-------+
> > >
> > > OK, so "acq" can't pair with "acq", and I am not sure I understand.
> >
> > Please consider the table in the context of message passing; that is
> > what Paul proposed. Your example from sysvsems, while interesting, would
> > not fit the general scenario of message passing.
>
> Ah, yeeees, sorry. Somehow I completely missed that part of Paul's email.
>
> > This too illustrates a problem with that approach, people can't read, so
> > they'll pick the wrong table to look at.
>
> At least we know that I certainly can't ;)

Nobody can! ;-)

Thanx, Paul

2015-02-18 22:43:27

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 2/2] [PATCH] sched: Add smp_rmb() in task rq locking cycles

On Wed, Feb 18, 2015 at 08:14:01PM +0100, Manfred Spraul wrote:

> >spinlock_t local, global;
> >bool force_global;
> >bool my_lock(bool try_local)
> >{
> > if (try_local) {
> > spin_lock(&local);
> > if (!spin_is_locked(&global)) {
> > if (!force_global) {
> > return true;
> > }
> > }
> > spin_unlock(&local);
> >
> >
> > spin_lock(&global);
> > spin_unlock_wait(&local);
> > return false;
> > }
> >
> > void my_unlock(bool drop_local)
> > {
> > if (drop_local)
> > spin_unlock(&local);
> > else
> > spin_unlock(&global);
> > }
> >}

> >Another question is do we need a barrier AFTER spin_unlock_wait(). I do not
> >know what ipc/sem.c actually needs, but in general (I think) this does need
> >mb(). Otherwise my_lock / my_unlock itself does not have the proper acq/rel
> >semantics. For example, my_lock(false) can miss the changes which were done
> >under my_lock(true).

> How could that happen?
> I thought that
> thread A:
> protected_var = 1234;
> spin_unlock(&lock_a)
>
> thread B:
> spin_lock(&lock_b)
> if (protected_var)

> is safe. i.e, there is no need that acquire and releases is done on the same pointer.

Well, just those four statements can of course be executed like:

CPU0 CPU1

spin_lock(&b)
if (prot_var)

prot_var = 1;
spin_unlock(&a);

And you would see the old var. Lock a and b are completely independent
here.

Now of course the local/global thing in sysvsem is more complex.

As to what Oleg meant:

X := 0

CPU0 CPU1

spin_lock(&global);
spin_lock(&local);
X = 1;
spin_unlock(&local);
spin_unlock_wait(&local);

assert(X == 1); /* BOOM */

that assert can trigger, because spin_unlock_wait() are reads, the read
of X can be lifted over and above, before the assignment of X on CPU1.

Again, the sysvsem code is slightly more complex, but I think Oleg is
right, there is no guarantee you'll observe the full critical section of
sem->lock if sem_lock() takes the slow path and does sem_wait_array(),
because of the above.

2015-02-19 14:21:06

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [PATCH 2/2] [PATCH] sched: Add smp_rmb() in task rq locking cycles

On 02/18, Peter Zijlstra wrote:
>
> On Wed, Feb 18, 2015 at 08:14:01PM +0100, Manfred Spraul wrote:
>
> > >spinlock_t local, global;
> > >bool force_global;

Yes, force_global (sma->complex_count) adds more complications, but
I think we can ignoire it in this discussion.

> > >bool my_lock(bool try_local)
> > >{
> > > if (try_local) {
> > > spin_lock(&local);
> > > if (!spin_is_locked(&global)) {
> > > if (!force_global) {
> > > return true;
> > > }
> > > }
> > > spin_unlock(&local);
> > >
> > >
> > > spin_lock(&global);
> > > spin_unlock_wait(&local);
> > > return false;
> > > }
> > >
> > > void my_unlock(bool drop_local)
> > > {
> > > if (drop_local)
> > > spin_unlock(&local);
> > > else
> > > spin_unlock(&global);
> > > }
> > >}
>
> > >Another question is do we need a barrier AFTER spin_unlock_wait(). I do not
> > >know what ipc/sem.c actually needs, but in general (I think) this does need
> > >mb(). Otherwise my_lock / my_unlock itself does not have the proper acq/rel
> > >semantics. For example, my_lock(false) can miss the changes which were done
> > >under my_lock(true).
>
> > How could that happen?
> > I thought that
> > thread A:
> > protected_var = 1234;
> > spin_unlock(&lock_a)
> >
> > thread B:
> > spin_lock(&lock_b)
> > if (protected_var)
>
> > is safe. i.e, there is no need that acquire and releases is done on the same pointer.
>
> Well, just those four statements can of course be executed like:
>
> CPU0 CPU1
>
> spin_lock(&b)
> if (prot_var)
>
> prot_var = 1;
> spin_unlock(&a);
>
> And you would see the old var. Lock a and b are completely independent
> here.
>
> Now of course the local/global thing in sysvsem is more complex.
>
> As to what Oleg meant:
>
> X := 0
>
> CPU0 CPU1
>
> spin_lock(&global);
> spin_lock(&local);
> X = 1;
> spin_unlock(&local);
> spin_unlock_wait(&local);
>
> assert(X == 1); /* BOOM */
>
> that assert can trigger, because spin_unlock_wait() are reads, the read
> of X can be lifted over and above, before the assignment of X on CPU1.
>
> Again, the sysvsem code is slightly more complex, but I think Oleg is
> right, there is no guarantee you'll observe the full critical section of
> sem->lock if sem_lock() takes the slow path and does sem_wait_array(),
> because of the above.

Yes, thanks Peter.

Or another artificial example,

int X = 0, Y = 0;

void func(void)
{
bool xxx = my_lock(rand());

BUG_ON(X != Y);

++X; ++Y;

my_unlock(xxx);
}

If func() can race with itself it can hit BUG_ON() above unless my_lock()
has the barriers after spin_unlock_wait() and spin_is_locked().

We need the full barrier to serialize STORE's as well, but probably we can
rely on control dependancy and thus we only need rmb().

And note that sem_lock() already has rmb() after spin_is_locked() to ensure
that we can't miss ->complex_count != 0 which can be changed under sem_perm.lock
("global" lock in the pseudo code above). This is correct, but this equally
applies to any other change under "global" lock, we can't miss it only because
we have rmb().

Amd the same is true for spin_unlock_wait() in the "slow" path.

Again, again, I do not know, perhaps sem.c is fine. For example, perhaps
sem_perm.lock doesn't need to fully serialize with sem_base[sem_num].lock.
But this is not obvious, and spin_unlock_wait() without a barrier looks
suspicious, at least this needs a comment imo.

Especially because it looks as if sem_base[sem_num].lock can't even fully
serialize with itself, sem_lock(nsops => -1) on the 3rd CPU can force one
of the lockers to switch to the "global" lock.

Oleg.

2015-02-20 18:28:22

by Manfred Spraul

[permalink] [raw]
Subject: Re: [PATCH 2/2] [PATCH] sched: Add smp_rmb() in task rq locking cycles

Hi Oleg,

my example was bad, let's continue with your example.

And: If sem_lock() needs another smp_xmb(), then we must add it:
Some apps do not have a user space hot path, i.e. it seems that on some
setups, we have millions of calls per second.
If there is a race, then it will happen.

I've tried to merge your example:
>
> int X = 0, Y = 0;
>
> void func(void)
> {
> bool ll = rand();
>
> if (ll) {
> spin_lock(&local);
> if (!spin_is_locked(&global))
> goto done;
> spin_unlock(&local);
> }
> ll = false;
> spin_lock(&global);
> spin_unlock_wait(&local);
> done:
> smp_rmb(); <<<<<<<<<<<<<<<
> BUG_ON(X != Y);
>
> ++X; ++Y;
>
> if (ll)
> spin_unlock(&local);
> else
> spin_unlock(&global);
> }
I agree, we need the smp_rmb().
I'll write a patch.

> We need the full barrier to serialize STORE's as well, but probably we can
> rely on control dependancy and thus we only need rmb().
Do we need a full barrier or not?

I don't manage to create a proper line of reasoning.
--
Manfred

2015-02-20 18:46:05

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 2/2] [PATCH] sched: Add smp_rmb() in task rq locking cycles

On Fri, Feb 20, 2015 at 07:28:16PM +0100, Manfred Spraul wrote:

> >We need the full barrier to serialize STORE's as well, but probably we can
> >rely on control dependancy and thus we only need rmb().
> Do we need a full barrier or not?
>
> I don't manage to create a proper line of reasoning.

I think I agree with Oleg in that we only need the smp_rmb(); of course
that wants a somewhat elaborate comment to go along with it. How about
something like so:

spin_unlock_wait(&local);
/*
* The above spin_unlock_wait() forms a control dependency with
* any following stores; because we must first observe the lock
* unlocked and we cannot speculate stores.
*
* Subsequent loads however can easily pass through the loads
* represented by spin_unlock_wait() and therefore we need the
* read barrier.
*
* This together is stronger than ACQUIRE for @local and
* therefore we will observe the complete prior critical section
* of @local.
*/
smp_rmb();

The obvious alternative is using spin_unlock_wait() with an
smp_load_acquire(), but that might be more expensive on some archs due
to repeated issuing of memory barriers.

2015-02-20 20:26:09

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [PATCH 2/2] [PATCH] sched: Add smp_rmb() in task rq locking cycles

On 02/20, Peter Zijlstra wrote:
>
> I think I agree with Oleg in that we only need the smp_rmb(); of course
> that wants a somewhat elaborate comment to go along with it. How about
> something like so:
>
> spin_unlock_wait(&local);
> /*
> * The above spin_unlock_wait() forms a control dependency with
> * any following stores; because we must first observe the lock
> * unlocked and we cannot speculate stores.
> *
> * Subsequent loads however can easily pass through the loads
> * represented by spin_unlock_wait() and therefore we need the
> * read barrier.
> *
> * This together is stronger than ACQUIRE for @local and
> * therefore we will observe the complete prior critical section
> * of @local.
> */
> smp_rmb();
>
> The obvious alternative is using spin_unlock_wait() with an
> smp_load_acquire(), but that might be more expensive on some archs due
> to repeated issuing of memory barriers.

Yes, yes, thanks!

But note that we need the same comment after sem_lock()->spin_is_locked().

So perhaps we can add this comment into include/linux/spinlock.h ? In this
case perhaps it makes sense to add, say,

#define smp_mb__after_unlock_wait() smp_rmb()

with this comment above? Another potential user task_work_run(). It could
use rmb() too, but this again needs the same fat comment.

Ehat do you think?

Oleg.

2015-02-21 03:26:28

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH 2/2] [PATCH] sched: Add smp_rmb() in task rq locking cycles

On Fri, Feb 20, 2015 at 07:28:16PM +0100, Manfred Spraul wrote:
> Hi Oleg,
>
> my example was bad, let's continue with your example.
>
> And: If sem_lock() needs another smp_xmb(), then we must add it:
> Some apps do not have a user space hot path, i.e. it seems that on
> some setups, we have millions of calls per second.
> If there is a race, then it will happen.
>
> I've tried to merge your example:
> >
> > int X = 0, Y = 0;
> >
> > void func(void)
> > {
> > bool ll = rand();
> >
> > if (ll) {
> > spin_lock(&local);
> > if (!spin_is_locked(&global))
> > goto done;
> > spin_unlock(&local);
> > }
> > ll = false;
> > spin_lock(&global);
> > spin_unlock_wait(&local);
> > done:
> > smp_rmb(); <<<<<<<<<<<<<<<
> > BUG_ON(X != Y);
> >
> > ++X; ++Y;
> >
> > if (ll)
> > spin_unlock(&local);
> > else
> > spin_unlock(&global);
> > }
> I agree, we need the smp_rmb().
> I'll write a patch.
>
> >We need the full barrier to serialize STORE's as well, but probably we can
> >rely on control dependancy and thus we only need rmb().
> Do we need a full barrier or not?
>
> I don't manage to create a proper line of reasoning.

This has to be one of the more bizarre forms of Dekker's algorithm
that I have seen. ;-)

I am going to have to put this through one of the tools...

Thanx, Paul

2015-02-21 12:54:41

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 2/2] [PATCH] sched: Add smp_rmb() in task rq locking cycles

On Fri, Feb 20, 2015 at 09:23:19PM +0100, Oleg Nesterov wrote:
> On 02/20, Peter Zijlstra wrote:
> >
> > I think I agree with Oleg in that we only need the smp_rmb(); of course
> > that wants a somewhat elaborate comment to go along with it. How about
> > something like so:
> >
> > spin_unlock_wait(&local);
> > /*
> > * The above spin_unlock_wait() forms a control dependency with
> > * any following stores; because we must first observe the lock
> > * unlocked and we cannot speculate stores.
> > *
> > * Subsequent loads however can easily pass through the loads
> > * represented by spin_unlock_wait() and therefore we need the
> > * read barrier.
> > *
> > * This together is stronger than ACQUIRE for @local and
> > * therefore we will observe the complete prior critical section
> > * of @local.
> > */
> > smp_rmb();
> >
> > The obvious alternative is using spin_unlock_wait() with an
> > smp_load_acquire(), but that might be more expensive on some archs due
> > to repeated issuing of memory barriers.
>
> Yes, yes, thanks!
>
> But note that we need the same comment after sem_lock()->spin_is_locked().
>
> So perhaps we can add this comment into include/linux/spinlock.h ? In this
> case perhaps it makes sense to add, say,
>
> #define smp_mb__after_unlock_wait() smp_rmb()
>
> with this comment above? Another potential user task_work_run(). It could
> use rmb() too, but this again needs the same fat comment.
>
> Ehat do you think?

Sure, that works.

2015-02-23 18:29:28

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH 2/2] [PATCH] sched: Add smp_rmb() in task rq locking cycles

On Fri, Feb 20, 2015 at 07:26:21PM -0800, Paul E. McKenney wrote:
> On Fri, Feb 20, 2015 at 07:28:16PM +0100, Manfred Spraul wrote:
> > Hi Oleg,
> >
> > my example was bad, let's continue with your example.
> >
> > And: If sem_lock() needs another smp_xmb(), then we must add it:
> > Some apps do not have a user space hot path, i.e. it seems that on
> > some setups, we have millions of calls per second.
> > If there is a race, then it will happen.
> >
> > I've tried to merge your example:
> > >
> > > int X = 0, Y = 0;
> > >
> > > void func(void)
> > > {
> > > bool ll = rand();
> > >
> > > if (ll) {
> > > spin_lock(&local);
> > > if (!spin_is_locked(&global))
> > > goto done;
> > > spin_unlock(&local);
> > > }
> > > ll = false;
> > > spin_lock(&global);
> > > spin_unlock_wait(&local);
> > > done:
> > > smp_rmb(); <<<<<<<<<<<<<<<
> > > BUG_ON(X != Y);
> > >
> > > ++X; ++Y;
> > >
> > > if (ll)
> > > spin_unlock(&local);
> > > else
> > > spin_unlock(&global);
> > > }
> > I agree, we need the smp_rmb().
> > I'll write a patch.
> >
> > >We need the full barrier to serialize STORE's as well, but probably we can
> > >rely on control dependancy and thus we only need rmb().
> > Do we need a full barrier or not?
> >
> > I don't manage to create a proper line of reasoning.
>
> This has to be one of the more bizarre forms of Dekker's algorithm
> that I have seen. ;-)
>
> I am going to have to put this through one of the tools...

And this was just me getting confused by memories of old versions of
the code. This will work given current mainline code. Please accept
my apologies for the noise.

And yes, you do need the smp_rmb() to ensure that the BUG_ON() happens
after the other guy releases his spinlock.

Thanx, Paul

2015-04-26 11:08:16

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH 2/2] [PATCH] sched: Add smp_rmb() in task rq locking cycles

On Fri, Feb 20, 2015 at 07:45:51PM +0100, Peter Zijlstra wrote:
> On Fri, Feb 20, 2015 at 07:28:16PM +0100, Manfred Spraul wrote:
>
> > >We need the full barrier to serialize STORE's as well, but probably we can
> > >rely on control dependancy and thus we only need rmb().
> > Do we need a full barrier or not?
> >
> > I don't manage to create a proper line of reasoning.
>
> I think I agree with Oleg in that we only need the smp_rmb(); of course
> that wants a somewhat elaborate comment to go along with it. How about
> something like so:
>
> spin_unlock_wait(&local);
> /*
> * The above spin_unlock_wait() forms a control dependency with
> * any following stores; because we must first observe the lock
> * unlocked and we cannot speculate stores.
> *
> * Subsequent loads however can easily pass through the loads
> * represented by spin_unlock_wait() and therefore we need the
> * read barrier.
> *
> * This together is stronger than ACQUIRE for @local and
> * therefore we will observe the complete prior critical section
> * of @local.
> */
> smp_rmb();
>
> The obvious alternative is using spin_unlock_wait() with an
> smp_load_acquire(), but that might be more expensive on some archs due
> to repeated issuing of memory barriers.

Please accept my apologies for the late reply. And thank you to Peter
for prodding me (again!) to think more carefully about this. Because we
have a bit of a comedy of errors at the moment. I will start with
mine, namely that we need to support DEC Alpha, which means we cannot
use ACCESS_ONCE() or READ_ONCE() to head a control-dependency chain.
Alpha needs an mb instruction for control dependencies, just as it does
for data dependencies.

I have the following (untested, probably does not even build) patch
queued to introduce READ_ONCE_CTRL() to head control depedencies, making
them safe for DEC Alpha. And, just as important, also making them more
self-documenting.

But the fun doesn't end there! Some architectures have "interesting"
implementations of spin_unlock_wait() and arch_spin_is_locked(). I will
deal with these in a follow-up email.

In the meantime, thoughts on the patch below?

Thanx, Paul

------------------------------------------------------------------------

smp: Make control dependencies work on Alpha, improve documentation

The current formulation of control dependencies fails on DEC Alpha, which
does not respect dependencies of any kind unless an explicit memory is
provided. This means that the current fomulation of control dependencies
fails on Alpha. This commit therefore creates a READ_ONCE_CTRL() that
has the same overhead on non-Alpha systems, but causes Alpha to produce
the needed ordering. This commit also applies READ_ONCE_CTRL() to the
one known use of control dependencies.

Use of READ_ONCE_CTRL() also has the beneficial effect of adding a bit
of self-documentation to control dependencies.

Signed-off-by: Paul E. McKenney <[email protected]>

diff --git a/Documentation/memory-barriers.txt b/Documentation/memory-barriers.txt
index 52c320e3f107..16897552dc1c 100644
--- a/Documentation/memory-barriers.txt
+++ b/Documentation/memory-barriers.txt
@@ -617,16 +617,16 @@ case what's actually required is:
However, stores are not speculated. This means that ordering -is- provided
for load-store control dependencies, as in the following example:

- q = ACCESS_ONCE(a);
+ q = READ_ONCE_CTRL(a);
if (q) {
ACCESS_ONCE(b) = p;
}

-Control dependencies pair normally with other types of barriers.
-That said, please note that ACCESS_ONCE() is not optional! Without the
-ACCESS_ONCE(), might combine the load from 'a' with other loads from
-'a', and the store to 'b' with other stores to 'b', with possible highly
-counterintuitive effects on ordering.
+Control dependencies pair normally with other types of barriers. That
+said, please note that READ_ONCE_CTRL() is not optional! Without the
+READ_ONCE_CTRL(), the compiler might combine the load from 'a' with
+other loads from 'a', and the store to 'b' with other stores to 'b',
+with possible highly counterintuitive effects on ordering.

Worse yet, if the compiler is able to prove (say) that the value of
variable 'a' is always non-zero, it would be well within its rights
@@ -636,12 +636,15 @@ as follows:
q = a;
b = p; /* BUG: Compiler and CPU can both reorder!!! */

-So don't leave out the ACCESS_ONCE().
+Finally, the READ_ONCE_CTRL() includes an smp_read_barrier_depends()
+that DEC Alpha needs in order to respect control depedencies.
+
+So don't leave out the READ_ONCE_CTRL().

It is tempting to try to enforce ordering on identical stores on both
branches of the "if" statement as follows:

- q = ACCESS_ONCE(a);
+ q = READ_ONCE_CTRL(a);
if (q) {
barrier();
ACCESS_ONCE(b) = p;
@@ -655,7 +658,7 @@ branches of the "if" statement as follows:
Unfortunately, current compilers will transform this as follows at high
optimization levels:

- q = ACCESS_ONCE(a);
+ q = READ_ONCE_CTRL(a);
barrier();
ACCESS_ONCE(b) = p; /* BUG: No ordering vs. load from a!!! */
if (q) {
@@ -685,7 +688,7 @@ memory barriers, for example, smp_store_release():
In contrast, without explicit memory barriers, two-legged-if control
ordering is guaranteed only when the stores differ, for example:

- q = ACCESS_ONCE(a);
+ q = READ_ONCE_CTRL(a);
if (q) {
ACCESS_ONCE(b) = p;
do_something();
@@ -694,14 +697,14 @@ ordering is guaranteed only when the stores differ, for example:
do_something_else();
}

-The initial ACCESS_ONCE() is still required to prevent the compiler from
-proving the value of 'a'.
+The initial READ_ONCE_CTRL() is still required to prevent the compiler
+from proving the value of 'a'.

In addition, you need to be careful what you do with the local variable 'q',
otherwise the compiler might be able to guess the value and again remove
the needed conditional. For example:

- q = ACCESS_ONCE(a);
+ q = READ_ONCE_CTRL(a);
if (q % MAX) {
ACCESS_ONCE(b) = p;
do_something();
@@ -714,7 +717,7 @@ If MAX is defined to be 1, then the compiler knows that (q % MAX) is
equal to zero, in which case the compiler is within its rights to
transform the above code into the following:

- q = ACCESS_ONCE(a);
+ q = READ_ONCE_CTRL(a);
ACCESS_ONCE(b) = p;
do_something_else();

@@ -725,7 +728,7 @@ is gone, and the barrier won't bring it back. Therefore, if you are
relying on this ordering, you should make sure that MAX is greater than
one, perhaps as follows:

- q = ACCESS_ONCE(a);
+ q = READ_ONCE_CTRL(a);
BUILD_BUG_ON(MAX <= 1); /* Order load from a with store to b. */
if (q % MAX) {
ACCESS_ONCE(b) = p;
@@ -742,14 +745,15 @@ of the 'if' statement.
You must also be careful not to rely too much on boolean short-circuit
evaluation. Consider this example:

- q = ACCESS_ONCE(a);
+ q = READ_ONCE_CTRL(a);
if (a || 1 > 0)
ACCESS_ONCE(b) = 1;

-Because the second condition is always true, the compiler can transform
-this example as following, defeating control dependency:
+Because the first condition cannot fault and the second condition is
+always true, the compiler can transform this example as following,
+defeating control dependency:

- q = ACCESS_ONCE(a);
+ q = READ_ONCE_CTRL(a);
ACCESS_ONCE(b) = 1;

This example underscores the need to ensure that the compiler cannot
@@ -762,8 +766,8 @@ demonstrated by two related examples, with the initial values of
x and y both being zero:

CPU 0 CPU 1
- ===================== =====================
- r1 = ACCESS_ONCE(x); r2 = ACCESS_ONCE(y);
+ ======================= =======================
+ r1 = READ_ONCE_CTRL(x); r2 = READ_ONCE_CTRL(y);
if (r1 > 0) if (r2 > 0)
ACCESS_ONCE(y) = 1; ACCESS_ONCE(x) = 1;

@@ -783,7 +787,8 @@ But because control dependencies do -not- provide transitivity, the above
assertion can fail after the combined three-CPU example completes. If you
need the three-CPU example to provide ordering, you will need smp_mb()
between the loads and stores in the CPU 0 and CPU 1 code fragments,
-that is, just before or just after the "if" statements.
+that is, just before or just after the "if" statements. Furthermore,
+the original two-CPU example is very fragile and should be avoided.

These two examples are the LB and WWC litmus tests from this paper:
http://www.cl.cam.ac.uk/users/pes20/ppc-supplemental/test6.pdf and this
@@ -791,6 +796,12 @@ site: https://www.cl.cam.ac.uk/~pes20/ppcmem/index.html.

In summary:

+ (*) Control dependencies must be headed by READ_ONCE_CTRL().
+ Or, as a much less preferable alternative, interpose
+ be headed by READ_ONCE() or an ACCESS_ONCE() read and must
+ have smp_read_barrier_depends() between this read and the
+ control-dependent write.
+
(*) Control dependencies can order prior loads against later stores.
However, they do -not- guarantee any other sort of ordering:
Not prior loads against later loads, nor prior stores against
diff --git a/include/linux/compiler.h b/include/linux/compiler.h
index 1b45e4a0519b..a57eacde2b84 100644
--- a/include/linux/compiler.h
+++ b/include/linux/compiler.h
@@ -264,6 +264,22 @@ static __always_inline void __write_once_size(volatile void *p, void *res, int s
#define WRITE_ONCE(x, val) \
({ typeof(x) __val = (val); __write_once_size(&(x), &__val, sizeof(__val)); __val; })

+/**
+ * READ_ONCE_CTRL - Read a value heading a control dependency
+ * @x: The value to be read, heading the control dependency
+ *
+ * Control dependencies are tricky. See Documentation/memory-barriers.txt
+ * for important information on how to use them. Note that in many cases,
+ * use of smp_load_acquire() will be much simpler. Control dependencies
+ * should be avoided except on the hottest of hotpaths.
+ */
+#define READ_ONCE_CTRL(x) \
+({ \
+ typeof(x) __val = READ_ONCE(x); \
+ smp_read_barrier_depends(); /* Enforce control dependency. */ \
+ __val; \
+})
+
#endif /* __KERNEL__ */

#endif /* __ASSEMBLY__ */
diff --git a/kernel/events/ring_buffer.c b/kernel/events/ring_buffer.c
index eadb95ce7aac..67548d5de4cb 100644
--- a/kernel/events/ring_buffer.c
+++ b/kernel/events/ring_buffer.c
@@ -141,7 +141,7 @@ int perf_output_begin(struct perf_output_handle *handle,
perf_output_get_handle(handle);

do {
- tail = ACCESS_ONCE(rb->user_page->data_tail);
+ tail = READ_ONCE_CTRL(rb->user_page->data_tail);
offset = head = local_read(&rb->head);
if (!rb->overwrite &&
unlikely(CIRC_SPACE(head, tail, perf_data_size(rb)) < size))

2015-04-26 11:03:00

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH 2/2] [PATCH] sched: Add smp_rmb() in task rq locking cycles

On Sat, Apr 25, 2015 at 12:56:02PM -0700, Paul E. McKenney wrote:
> On Fri, Feb 20, 2015 at 07:45:51PM +0100, Peter Zijlstra wrote:
> > On Fri, Feb 20, 2015 at 07:28:16PM +0100, Manfred Spraul wrote:
> >
> > > >We need the full barrier to serialize STORE's as well, but probably we can
> > > >rely on control dependancy and thus we only need rmb().
> > > Do we need a full barrier or not?
> > >
> > > I don't manage to create a proper line of reasoning.
> >
> > I think I agree with Oleg in that we only need the smp_rmb(); of course
> > that wants a somewhat elaborate comment to go along with it. How about
> > something like so:
> >
> > spin_unlock_wait(&local);

And then an smp_read_barrier_depends() would be needed either here
or embedded in apin_unlock_wait(). But we also need to check the
spin_unlock_wait() implementations to see if any are potentially
vulnerable to compiler misbehavior due to lack of ACCESS_ONCE(),
READ_ONCE(), or other sources of the required volatility:

o alpha lacks ACCESS_ONCE() or READ_ONCE(), though admittedly the
potential for damage is at least somewhat limited by the fact
that smp_read_barrier_depends() is a full memory barrier on Alpha.

o arc: lacks ACCESS_ONCE() or READ_ONCE(), but field is volatile,
so OK..

o arm: looks good, the READ_ONCE() in arch_spin_is_locked()
should do it.

o arm64: ditto.

o blackfin: Drops into assembly, which should keep the compiler
suitably constrained.

o cris: lacks ACCESS_ONCE() or READ_ONCE(). Not sure where SMP
cris is getting its arch_spinlock_t from, though. Except for
UP, which gets volatile fields.

o hexagon: lacks ACCESS_ONCE() or READ_ONCE(), but field is volatile,
so OK.

o ia64: Uses ACCESS_ONCE(), so OK.

o m32r: Volatile cast, so OK. READ_ONCE() might be nicer.

o metag: Drops into assembly, so OK.

o metag #2: volatile declaration on field, so OK.

o mips: ACCESS_ONCE(), so OK.

o mn10300: Volatile cast, so OK. READ_ONCE() might be nicer.

o parisc: Obscure but very real volatile cast, so OK.
Or, in another branch of #ifdef, volatile field, so also OK.

o powerpc: Volatile field, so OK.

o s390: ACCESS_ONCE(), so OK.

o sh: Volatile field, so OK.

o sparc: Volatile cast, so OK. READ_ONCE() might be nicer.
Or, in another path, volatile field, so also OK.

o tile: For 32-bit, looks like a bug. Compares ->current_ticket and
->next_ticket with no obvious protection. The compiler is free to
load them in either order, so it is possible that the two fields
could compare equal despite never having actually been equal at
any given time. Needs something like arm, arm64, mips, or x86
to do single fetch, then compare fields in quantity fetched.

Except that this appears to be using int on a 32-bit system,
thus might not have a 64-bit load. If that is the case, the
trick would be to load them in order. Except that this can be
defeated by overflow. Are there really 32-bit tile systems with
enough CPUs to overflow an unsigned short?

For 64-bit, a READ_ONCE() appears to be in order -- no obvious
volatility present.

o x86: Uses READ_ONCE(), so OK.

o xtensa: Uses volatile field, so OK.

o UP: Uses volatile fields, so OK.

So tile needs help in addition to the need for smp_read_barrier_depeneds().

One could argue that only Alpha really needs smp_read_barrier_depends(),
so the others can avoid it. No opinion myself, at least not at the moment.

Thanx, Paul

> > /*
> > * The above spin_unlock_wait() forms a control dependency with
> > * any following stores; because we must first observe the lock
> > * unlocked and we cannot speculate stores.
> > *
> > * Subsequent loads however can easily pass through the loads
> > * represented by spin_unlock_wait() and therefore we need the
> > * read barrier.
> > *
> > * This together is stronger than ACQUIRE for @local and
> > * therefore we will observe the complete prior critical section
> > * of @local.
> > */
> > smp_rmb();
> >
> > The obvious alternative is using spin_unlock_wait() with an
> > smp_load_acquire(), but that might be more expensive on some archs due
> > to repeated issuing of memory barriers.
>
> Please accept my apologies for the late reply. And thank you to Peter
> for prodding me (again!) to think more carefully about this. Because we
> have a bit of a comedy of errors at the moment. I will start with
> mine, namely that we need to support DEC Alpha, which means we cannot
> use ACCESS_ONCE() or READ_ONCE() to head a control-dependency chain.
> Alpha needs an mb instruction for control dependencies, just as it does
> for data dependencies.
>
> I have the following (untested, probably does not even build) patch
> queued to introduce READ_ONCE_CTRL() to head control depedencies, making
> them safe for DEC Alpha. And, just as important, also making them more
> self-documenting.
>
> But the fun doesn't end there! Some architectures have "interesting"
> implementations of spin_unlock_wait() and arch_spin_is_locked(). I will
> deal with these in a follow-up email.
>
> In the meantime, thoughts on the patch below?
>
> Thanx, Paul
>
> ------------------------------------------------------------------------
>
> smp: Make control dependencies work on Alpha, improve documentation
>
> The current formulation of control dependencies fails on DEC Alpha, which
> does not respect dependencies of any kind unless an explicit memory is
> provided. This means that the current fomulation of control dependencies
> fails on Alpha. This commit therefore creates a READ_ONCE_CTRL() that
> has the same overhead on non-Alpha systems, but causes Alpha to produce
> the needed ordering. This commit also applies READ_ONCE_CTRL() to the
> one known use of control dependencies.
>
> Use of READ_ONCE_CTRL() also has the beneficial effect of adding a bit
> of self-documentation to control dependencies.
>
> Signed-off-by: Paul E. McKenney <[email protected]>
>
> diff --git a/Documentation/memory-barriers.txt b/Documentation/memory-barriers.txt
> index 52c320e3f107..16897552dc1c 100644
> --- a/Documentation/memory-barriers.txt
> +++ b/Documentation/memory-barriers.txt
> @@ -617,16 +617,16 @@ case what's actually required is:
> However, stores are not speculated. This means that ordering -is- provided
> for load-store control dependencies, as in the following example:
>
> - q = ACCESS_ONCE(a);
> + q = READ_ONCE_CTRL(a);
> if (q) {
> ACCESS_ONCE(b) = p;
> }
>
> -Control dependencies pair normally with other types of barriers.
> -That said, please note that ACCESS_ONCE() is not optional! Without the
> -ACCESS_ONCE(), might combine the load from 'a' with other loads from
> -'a', and the store to 'b' with other stores to 'b', with possible highly
> -counterintuitive effects on ordering.
> +Control dependencies pair normally with other types of barriers. That
> +said, please note that READ_ONCE_CTRL() is not optional! Without the
> +READ_ONCE_CTRL(), the compiler might combine the load from 'a' with
> +other loads from 'a', and the store to 'b' with other stores to 'b',
> +with possible highly counterintuitive effects on ordering.
>
> Worse yet, if the compiler is able to prove (say) that the value of
> variable 'a' is always non-zero, it would be well within its rights
> @@ -636,12 +636,15 @@ as follows:
> q = a;
> b = p; /* BUG: Compiler and CPU can both reorder!!! */
>
> -So don't leave out the ACCESS_ONCE().
> +Finally, the READ_ONCE_CTRL() includes an smp_read_barrier_depends()
> +that DEC Alpha needs in order to respect control depedencies.
> +
> +So don't leave out the READ_ONCE_CTRL().
>
> It is tempting to try to enforce ordering on identical stores on both
> branches of the "if" statement as follows:
>
> - q = ACCESS_ONCE(a);
> + q = READ_ONCE_CTRL(a);
> if (q) {
> barrier();
> ACCESS_ONCE(b) = p;
> @@ -655,7 +658,7 @@ branches of the "if" statement as follows:
> Unfortunately, current compilers will transform this as follows at high
> optimization levels:
>
> - q = ACCESS_ONCE(a);
> + q = READ_ONCE_CTRL(a);
> barrier();
> ACCESS_ONCE(b) = p; /* BUG: No ordering vs. load from a!!! */
> if (q) {
> @@ -685,7 +688,7 @@ memory barriers, for example, smp_store_release():
> In contrast, without explicit memory barriers, two-legged-if control
> ordering is guaranteed only when the stores differ, for example:
>
> - q = ACCESS_ONCE(a);
> + q = READ_ONCE_CTRL(a);
> if (q) {
> ACCESS_ONCE(b) = p;
> do_something();
> @@ -694,14 +697,14 @@ ordering is guaranteed only when the stores differ, for example:
> do_something_else();
> }
>
> -The initial ACCESS_ONCE() is still required to prevent the compiler from
> -proving the value of 'a'.
> +The initial READ_ONCE_CTRL() is still required to prevent the compiler
> +from proving the value of 'a'.
>
> In addition, you need to be careful what you do with the local variable 'q',
> otherwise the compiler might be able to guess the value and again remove
> the needed conditional. For example:
>
> - q = ACCESS_ONCE(a);
> + q = READ_ONCE_CTRL(a);
> if (q % MAX) {
> ACCESS_ONCE(b) = p;
> do_something();
> @@ -714,7 +717,7 @@ If MAX is defined to be 1, then the compiler knows that (q % MAX) is
> equal to zero, in which case the compiler is within its rights to
> transform the above code into the following:
>
> - q = ACCESS_ONCE(a);
> + q = READ_ONCE_CTRL(a);
> ACCESS_ONCE(b) = p;
> do_something_else();
>
> @@ -725,7 +728,7 @@ is gone, and the barrier won't bring it back. Therefore, if you are
> relying on this ordering, you should make sure that MAX is greater than
> one, perhaps as follows:
>
> - q = ACCESS_ONCE(a);
> + q = READ_ONCE_CTRL(a);
> BUILD_BUG_ON(MAX <= 1); /* Order load from a with store to b. */
> if (q % MAX) {
> ACCESS_ONCE(b) = p;
> @@ -742,14 +745,15 @@ of the 'if' statement.
> You must also be careful not to rely too much on boolean short-circuit
> evaluation. Consider this example:
>
> - q = ACCESS_ONCE(a);
> + q = READ_ONCE_CTRL(a);
> if (a || 1 > 0)
> ACCESS_ONCE(b) = 1;
>
> -Because the second condition is always true, the compiler can transform
> -this example as following, defeating control dependency:
> +Because the first condition cannot fault and the second condition is
> +always true, the compiler can transform this example as following,
> +defeating control dependency:
>
> - q = ACCESS_ONCE(a);
> + q = READ_ONCE_CTRL(a);
> ACCESS_ONCE(b) = 1;
>
> This example underscores the need to ensure that the compiler cannot
> @@ -762,8 +766,8 @@ demonstrated by two related examples, with the initial values of
> x and y both being zero:
>
> CPU 0 CPU 1
> - ===================== =====================
> - r1 = ACCESS_ONCE(x); r2 = ACCESS_ONCE(y);
> + ======================= =======================
> + r1 = READ_ONCE_CTRL(x); r2 = READ_ONCE_CTRL(y);
> if (r1 > 0) if (r2 > 0)
> ACCESS_ONCE(y) = 1; ACCESS_ONCE(x) = 1;
>
> @@ -783,7 +787,8 @@ But because control dependencies do -not- provide transitivity, the above
> assertion can fail after the combined three-CPU example completes. If you
> need the three-CPU example to provide ordering, you will need smp_mb()
> between the loads and stores in the CPU 0 and CPU 1 code fragments,
> -that is, just before or just after the "if" statements.
> +that is, just before or just after the "if" statements. Furthermore,
> +the original two-CPU example is very fragile and should be avoided.
>
> These two examples are the LB and WWC litmus tests from this paper:
> http://www.cl.cam.ac.uk/users/pes20/ppc-supplemental/test6.pdf and this
> @@ -791,6 +796,12 @@ site: https://www.cl.cam.ac.uk/~pes20/ppcmem/index.html.
>
> In summary:
>
> + (*) Control dependencies must be headed by READ_ONCE_CTRL().
> + Or, as a much less preferable alternative, interpose
> + be headed by READ_ONCE() or an ACCESS_ONCE() read and must
> + have smp_read_barrier_depends() between this read and the
> + control-dependent write.
> +
> (*) Control dependencies can order prior loads against later stores.
> However, they do -not- guarantee any other sort of ordering:
> Not prior loads against later loads, nor prior stores against
> diff --git a/include/linux/compiler.h b/include/linux/compiler.h
> index 1b45e4a0519b..a57eacde2b84 100644
> --- a/include/linux/compiler.h
> +++ b/include/linux/compiler.h
> @@ -264,6 +264,22 @@ static __always_inline void __write_once_size(volatile void *p, void *res, int s
> #define WRITE_ONCE(x, val) \
> ({ typeof(x) __val = (val); __write_once_size(&(x), &__val, sizeof(__val)); __val; })
>
> +/**
> + * READ_ONCE_CTRL - Read a value heading a control dependency
> + * @x: The value to be read, heading the control dependency
> + *
> + * Control dependencies are tricky. See Documentation/memory-barriers.txt
> + * for important information on how to use them. Note that in many cases,
> + * use of smp_load_acquire() will be much simpler. Control dependencies
> + * should be avoided except on the hottest of hotpaths.
> + */
> +#define READ_ONCE_CTRL(x) \
> +({ \
> + typeof(x) __val = READ_ONCE(x); \
> + smp_read_barrier_depends(); /* Enforce control dependency. */ \
> + __val; \
> +})
> +
> #endif /* __KERNEL__ */
>
> #endif /* __ASSEMBLY__ */
> diff --git a/kernel/events/ring_buffer.c b/kernel/events/ring_buffer.c
> index eadb95ce7aac..67548d5de4cb 100644
> --- a/kernel/events/ring_buffer.c
> +++ b/kernel/events/ring_buffer.c
> @@ -141,7 +141,7 @@ int perf_output_begin(struct perf_output_handle *handle,
> perf_output_get_handle(handle);
>
> do {
> - tail = ACCESS_ONCE(rb->user_page->data_tail);
> + tail = READ_ONCE_CTRL(rb->user_page->data_tail);
> offset = head = local_read(&rb->head);
> if (!rb->overwrite &&
> unlikely(CIRC_SPACE(head, tail, perf_data_size(rb)) < size))

2015-04-28 14:32:51

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 2/2] [PATCH] sched: Add smp_rmb() in task rq locking cycles

On Sat, Apr 25, 2015 at 12:56:03PM -0700, Paul E. McKenney wrote:
> smp: Make control dependencies work on Alpha, improve documentation
>
> The current formulation of control dependencies fails on DEC Alpha, which
> does not respect dependencies of any kind unless an explicit memory is

+ barrier ?

> provided. This means that the current fomulation of control dependencies
> fails on Alpha. This commit therefore creates a READ_ONCE_CTRL() that
> has the same overhead on non-Alpha systems, but causes Alpha to produce
> the needed ordering. This commit also applies READ_ONCE_CTRL() to the
> one known use of control dependencies.
>
> Use of READ_ONCE_CTRL() also has the beneficial effect of adding a bit
> of self-documentation to control dependencies.
>
> Signed-off-by: Paul E. McKenney <[email protected]>

Acked-by: Peter Zijlstra (Intel) <[email protected]>

> diff --git a/include/linux/compiler.h b/include/linux/compiler.h
> index 1b45e4a0519b..a57eacde2b84 100644
> --- a/include/linux/compiler.h
> +++ b/include/linux/compiler.h
> @@ -264,6 +264,22 @@ static __always_inline void __write_once_size(volatile void *p, void *res, int s
> #define WRITE_ONCE(x, val) \
> ({ typeof(x) __val = (val); __write_once_size(&(x), &__val, sizeof(__val)); __val; })
>
> +/**
> + * READ_ONCE_CTRL - Read a value heading a control dependency
> + * @x: The value to be read, heading the control dependency
> + *
> + * Control dependencies are tricky. See Documentation/memory-barriers.txt
> + * for important information on how to use them. Note that in many cases,
> + * use of smp_load_acquire() will be much simpler. Control dependencies
> + * should be avoided except on the hottest of hotpaths.
> + */
> +#define READ_ONCE_CTRL(x) \
> +({ \
> + typeof(x) __val = READ_ONCE(x); \
> + smp_read_barrier_depends(); /* Enforce control dependency. */ \
> + __val; \
> +})
> +
> #endif /* __KERNEL__ */
>
> #endif /* __ASSEMBLY__ */

We mostly try and align the \ for multi-line macros.

> diff --git a/kernel/events/ring_buffer.c b/kernel/events/ring_buffer.c
> index eadb95ce7aac..67548d5de4cb 100644
> --- a/kernel/events/ring_buffer.c
> +++ b/kernel/events/ring_buffer.c
> @@ -141,7 +141,7 @@ int perf_output_begin(struct perf_output_handle *handle,
> perf_output_get_handle(handle);
>
> do {
> - tail = ACCESS_ONCE(rb->user_page->data_tail);
> + tail = READ_ONCE_CTRL(rb->user_page->data_tail);
> offset = head = local_read(&rb->head);
> if (!rb->overwrite &&
> unlikely(CIRC_SPACE(head, tail, perf_data_size(rb)) < size))

Right. I could not remember any other current usage.

2015-04-28 14:34:16

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 2/2] [PATCH] sched: Add smp_rmb() in task rq locking cycles

On Sun, Apr 26, 2015 at 03:52:13AM -0700, Paul E. McKenney wrote:

> And then an smp_read_barrier_depends() would be needed either here
> or embedded in apin_unlock_wait(). But we also need to check the
> spin_unlock_wait() implementations to see if any are potentially
> vulnerable to compiler misbehavior due to lack of ACCESS_ONCE(),
> READ_ONCE(), or other sources of the required volatility:
>

> o tile: For 32-bit, looks like a bug. Compares ->current_ticket and
> ->next_ticket with no obvious protection. The compiler is free to
> load them in either order, so it is possible that the two fields
> could compare equal despite never having actually been equal at
> any given time. Needs something like arm, arm64, mips, or x86
> to do single fetch, then compare fields in quantity fetched.
>
> Except that this appears to be using int on a 32-bit system,
> thus might not have a 64-bit load. If that is the case, the
> trick would be to load them in order. Except that this can be
> defeated by overflow. Are there really 32-bit tile systems with
> enough CPUs to overflow an unsigned short?
>
> For 64-bit, a READ_ONCE() appears to be in order -- no obvious
> volatility present.
>

Chris?

2015-04-28 15:53:38

by Chris Metcalf

[permalink] [raw]
Subject: Re: [PATCH 2/2] [PATCH] sched: Add smp_rmb() in task rq locking cycles

On 04/28/2015 10:33 AM, Peter Zijlstra wrote:
> On Sun, Apr 26, 2015 at 03:52:13AM -0700, Paul E. McKenney wrote:
>
>> And then an smp_read_barrier_depends() would be needed either here
>> or embedded in apin_unlock_wait(). But we also need to check the
>> spin_unlock_wait() implementations to see if any are potentially
>> vulnerable to compiler misbehavior due to lack of ACCESS_ONCE(),
>> READ_ONCE(), or other sources of the required volatility:
>>
>> o tile: For 32-bit, looks like a bug. Compares ->current_ticket and
>> ->next_ticket with no obvious protection. The compiler is free to
>> load them in either order, so it is possible that the two fields
>> could compare equal despite never having actually been equal at
>> any given time. Needs something like arm, arm64, mips, or x86
>> to do single fetch, then compare fields in quantity fetched.
>>
>> Except that this appears to be using int on a 32-bit system,
>> thus might not have a 64-bit load. If that is the case, the
>> trick would be to load them in order. Except that this can be
>> defeated by overflow. Are there really 32-bit tile systems with
>> enough CPUs to overflow an unsigned short?

As you surmise, tilepro doesn't have 64-bit loads. So we are stuck with
32-bit loads on these two fields. It's true that spin_unlock_wait() can
therefore falsely claim that the lock is unlocked, but it should be only a
hint anyway, since by the time the caller tries to act on that information
the lock may have been retaken anyway, right? If spin_unlock_wait() is
really trying to guarantee that the lock was available at some point in
the interval between when it was called and when it returned, we could use
READ_ONCE() to read the current ticket value first; is that a necessary
part of the semantics?

(Even with READ_ONCE we'd still be exposed to a technical risk that
others cores had taken and released the lock 2 billion times in between
the two loads of the core running spin_unlock_wait, without ever having
the lock actually be free, so technically the only solution is for that core
to actually acquire and release the lock, but that seems a bit extreme
in practice.)

The reason we use two 32-bit fields on tilepro is that the only available
atomic instruction is tns (test and set), which sets a 32-bit "1" value
into the target memory and returns the old 32-bit value. So we need to
be able to safely "corrupt" the next_ticket value with a "1", load the
current_ticket value, and if they don't match, rewrite next_ticket with
its old value. We can't safely do this if next_ticket and current_ticket
are 16-bit fields in one 32-bit word, since the "tns" operation would
corrupt the current_ticket value in that case, making someone
waiting on ticket 0 think they owned the lock.

On tilegx we made the atomic instruction situation much, much better :-)

>> For 64-bit, a READ_ONCE() appears to be in order -- no obvious
>> volatility present.
>>

It depends, I guess. If you are spinning on arch_spin_is_locked(), yes,
you need to make sure to do something to ensure the value is re-read.
But arch_spin_unlock_wait() already calls delay_backoff(), which calls
relax(), which includes a barrier(), so we're OK there. But if
stylistically
the consensus calls for a READ_ONCE() in arch_spin_is_locked(), I
can certainly add that. What do folks think?

Assuming the answers to both questions is "change the code",
how does this look?

diff --git a/arch/tile/include/asm/spinlock_32.h b/arch/tile/include/asm/spinlock_32.h
index c0a77b38d39a..7c7b80bd83db 100644
--- a/arch/tile/include/asm/spinlock_32.h
+++ b/arch/tile/include/asm/spinlock_32.h
@@ -41,8 +41,23 @@ static inline int arch_spin_is_locked(arch_spinlock_t *lock)
* to claim the lock is held, since it will be momentarily
* if not already. There's no need to wait for a "valid"
* lock->next_ticket to become available.
+ *
+ * We order the reads here so that if we claim the lock is
+ * unlocked, we know it actually was for at least a moment.
+ * Since current_ticket is never incremented above
+ * next_ticket, by reading current first, then next, and
+ * finding them equal, we know that during that window between
+ * the reads the lock was unlocked.
+ *
+ * There is a technical risk in this that between reading
+ * current and reading next, other cores locked and unlocked
+ * two billion times without the lock ever being unlocked, and
+ * therefore it looks like the lock was at some point unlocked
+ * but it never was. But this seems highly improbable.
*/
- return lock->next_ticket != lock->current_ticket;
+ int current = READ_ONCE(lock->current_ticket);
+ int next = READ_ONCE(lock->next_ticket);
+ return next != current;
}

void arch_spin_lock(arch_spinlock_t *lock);
diff --git a/arch/tile/include/asm/spinlock_64.h b/arch/tile/include/asm/spinlock_64.h
index 9a12b9c7e5d3..b9718fb4e74a 100644
--- a/arch/tile/include/asm/spinlock_64.h
+++ b/arch/tile/include/asm/spinlock_64.h
@@ -18,6 +18,8 @@
#ifndef _ASM_TILE_SPINLOCK_64_H
#define _ASM_TILE_SPINLOCK_64_H

+#include <linux/compiler.h>
+
/* Shifts and masks for the various fields in "lock". */
#define __ARCH_SPIN_CURRENT_SHIFT 17
#define __ARCH_SPIN_NEXT_MASK 0x7fff
@@ -44,7 +46,8 @@ static inline u32 arch_spin_next(u32 val)
/* The lock is locked if a task would have to wait to get it. */
static inline int arch_spin_is_locked(arch_spinlock_t *lock)
{
- u32 val = lock->lock;
+ /* Use READ_ONCE() to ensure that calling this in a loop is OK. */
+ u32 val = READ_ONCE(lock->lock);
return arch_spin_current(val) != arch_spin_next(val);
}

--
Chris Metcalf, EZChip Semiconductor
http://www.ezchip.com

2015-04-28 16:25:19

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 2/2] [PATCH] sched: Add smp_rmb() in task rq locking cycles

On Tue, Apr 28, 2015 at 11:53:21AM -0400, Chris Metcalf wrote:
> As you surmise, tilepro doesn't have 64-bit loads. So we are stuck with
> 32-bit loads on these two fields. It's true that spin_unlock_wait() can
> therefore falsely claim that the lock is unlocked, but it should be only a
> hint anyway, since by the time the caller tries to act on that information
> the lock may have been retaken anyway, right? If spin_unlock_wait() is
> really trying to guarantee that the lock was available at some point in
> the interval between when it was called and when it returned, we could use
> READ_ONCE() to read the current ticket value first; is that a necessary
> part of the semantics?

I think it must not return before the lock holder that is current at the
time of calling releases. Anything thereafter is indeed fair game as per
your logic above.


2015-04-28 16:40:50

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 2/2] [PATCH] sched: Add smp_rmb() in task rq locking cycles

On Tue, Apr 28, 2015 at 11:53:21AM -0400, Chris Metcalf wrote:

> The reason we use two 32-bit fields on tilepro is that the only available
> atomic instruction is tns (test and set), which sets a 32-bit "1" value
> into the target memory and returns the old 32-bit value.

And you want a ticket lock as opposed to the test-and-set lock because
with 64 tiles starvation under contention is a real worry?

Where sparc32 (which only has the load-store-unsigned-byte instruction,
which is similar to your tns except it writes 0xff) has the benefit of
not actually having many CPUs you get to do 64 cpus!

2015-04-28 16:45:06

by Chris Metcalf

[permalink] [raw]
Subject: [PATCH] spinlock: clarify doc for raw_spin_unlock_wait()

The intent of the function is to wait for the current locker.
Thus, we don't have to worry about potential later lockers,
but we also have to ensure that the arch implementation
doesn't return a false positive for the current locker.

Signed-off-by: Chris Metcalf <[email protected]>
---
On 04/28/2015 12:24 PM, Peter Zijlstra wrote:
> I think it must not return before the lock holder that is current at the
> time of calling releases. Anything thereafter is indeed fair game as per
> your logic above.

Great, that seems like a workable definition. How does this modified
language seem? With this definition I can actually modify the
implementation of tile's arch_raw_spin_unlock_wait() to just read
current_ticket once and just wait until it changes (assuming the
lock is, in fact, locked).

Not sure whose tree this should go through; any takers?

include/linux/spinlock.h | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/include/linux/spinlock.h b/include/linux/spinlock.h
index 3e18379dfa6f..36de5fc86647 100644
--- a/include/linux/spinlock.h
+++ b/include/linux/spinlock.h
@@ -141,7 +141,8 @@ do { \
#endif

/**
- * raw_spin_unlock_wait - wait until the spinlock gets unlocked
+ * raw_spin_unlock_wait - wait until the lock holder that is current at the
+ * time of calling releases the lock (or return immediately if unlocked).
* @lock: the spinlock in question.
*/
#define raw_spin_unlock_wait(lock) arch_spin_unlock_wait(&(lock)->raw_lock)
--
2.1.2

2015-04-28 16:59:13

by Chris Metcalf

[permalink] [raw]
Subject: Re: [PATCH 2/2] [PATCH] sched: Add smp_rmb() in task rq locking cycles

On 04/28/2015 12:40 PM, Peter Zijlstra wrote:
> On Tue, Apr 28, 2015 at 11:53:21AM -0400, Chris Metcalf wrote:
>
>> The reason we use two 32-bit fields on tilepro is that the only available
>> atomic instruction is tns (test and set), which sets a 32-bit "1" value
>> into the target memory and returns the old 32-bit value.
> And you want a ticket lock as opposed to the test-and-set lock because
> with 64 tiles starvation under contention is a real worry?

We see substantial unfairness under load with a plain spinlock,
basically because nearer cores on the mesh network can exponentially
crowd out further cores. The ticket lock avoids that, though we
have to be careful to do backoff when checking the lock to avoid
DDoS in the mesh network.

> Where sparc32 (which only has the load-store-unsigned-byte instruction,
> which is similar to your tns except it writes 0xff) has the benefit of
> not actually having many CPUs you get to do 64 cpus!

Yes, the 64-core thing is awesome, but the atomic instructions
we built into the first generation chip weren't really adequate
to the task.

--
Chris Metcalf, EZChip Semiconductor
http://www.ezchip.com

2015-04-28 17:34:15

by Chris Metcalf

[permalink] [raw]
Subject: [PATCH 1/2] tile: modify arch_spin_unlock_wait() semantics

Rather than trying to wait until all possible lockers have
unlocked the lock, we now only wait until the current locker
(if any) has released the lock.

The old code was correct, but the new code works more like the x86
code and thus hopefully is more appropriate under contention.
See commit 78bff1c8684f ("x86/ticketlock: Fix spin_unlock_wait()
livelock") for x86.

Signed-off-by: Chris Metcalf <[email protected]>
---
arch/tile/lib/spinlock_32.c | 11 ++++++++++-
arch/tile/lib/spinlock_64.c | 11 ++++++++++-
2 files changed, 20 insertions(+), 2 deletions(-)

diff --git a/arch/tile/lib/spinlock_32.c b/arch/tile/lib/spinlock_32.c
index b34f79aada48..c3df3c237496 100644
--- a/arch/tile/lib/spinlock_32.c
+++ b/arch/tile/lib/spinlock_32.c
@@ -65,8 +65,17 @@ EXPORT_SYMBOL(arch_spin_trylock);
void arch_spin_unlock_wait(arch_spinlock_t *lock)
{
u32 iterations = 0;
- while (arch_spin_is_locked(lock))
+ int current = READ_ONCE(lock->current_ticket);
+ int next = READ_ONCE(lock->next_ticket);
+
+ /* Return immediately if unlocked. */
+ if (next == current)
+ return;
+
+ /* Wait until the current locker has released the lock. */
+ do {
delay_backoff(iterations++);
+ } while (READ_ONCE(lock->current_ticket) == current)
}
EXPORT_SYMBOL(arch_spin_unlock_wait);

diff --git a/arch/tile/lib/spinlock_64.c b/arch/tile/lib/spinlock_64.c
index d6fb9581e980..aaf12813d4db 100644
--- a/arch/tile/lib/spinlock_64.c
+++ b/arch/tile/lib/spinlock_64.c
@@ -65,8 +65,17 @@ EXPORT_SYMBOL(arch_spin_trylock);
void arch_spin_unlock_wait(arch_spinlock_t *lock)
{
u32 iterations = 0;
- while (arch_spin_is_locked(lock))
+ u32 val = READ_ONCE(lock->lock);
+ u32 current = arch_spin_current(val);
+
+ /* Return immediately if unlocked. */
+ if (arch_spin_next(val) == current)
+ return;
+
+ /* Wait until the current locker has released the lock. */
+ do {
delay_backoff(iterations++);
+ } while (arch_spin_current(READ_ONCE(lock->lock)) == current);
}
EXPORT_SYMBOL(arch_spin_unlock_wait);

--
2.1.2

2015-04-28 17:34:29

by Chris Metcalf

[permalink] [raw]
Subject: [PATCH 2/2] tile: use READ_ONCE() in arch_spin_is_locked()

This avoid potential issues if callers were to loop on these
routines without some kind of memory barrier. Currently there
are no such users in-tree, but it seems better safe than sorry.

Also, in the tilepro case we read "current" before "next",
which gives us a slightly better guarantee that the lock was
actually unlocked at least momentarily if we return claiming
that it is not locked. None of the callers actually rely on
this behavior, as far as I know, however.

Signed-off-by: Chris Metcalf <[email protected]>
---
arch/tile/include/asm/spinlock_32.h | 5 ++++-
arch/tile/include/asm/spinlock_64.h | 5 ++++-
2 files changed, 8 insertions(+), 2 deletions(-)

diff --git a/arch/tile/include/asm/spinlock_32.h b/arch/tile/include/asm/spinlock_32.h
index c0a77b38d39a..f7c1c0ebcf1d 100644
--- a/arch/tile/include/asm/spinlock_32.h
+++ b/arch/tile/include/asm/spinlock_32.h
@@ -41,8 +41,12 @@ static inline int arch_spin_is_locked(arch_spinlock_t *lock)
* to claim the lock is held, since it will be momentarily
* if not already. There's no need to wait for a "valid"
* lock->next_ticket to become available.
+ * Use READ_ONCE() to ensure that calling this in a loop is OK.
*/
- return lock->next_ticket != lock->current_ticket;
+ int current = READ_ONCE(lock->current_ticket);
+ int next = READ_ONCE(lock->next_ticket);
+
+ return next != current;
}

void arch_spin_lock(arch_spinlock_t *lock);
diff --git a/arch/tile/include/asm/spinlock_64.h b/arch/tile/include/asm/spinlock_64.h
index 9a12b9c7e5d3..b9718fb4e74a 100644
--- a/arch/tile/include/asm/spinlock_64.h
+++ b/arch/tile/include/asm/spinlock_64.h
@@ -18,6 +18,8 @@
#ifndef _ASM_TILE_SPINLOCK_64_H
#define _ASM_TILE_SPINLOCK_64_H

+#include <linux/compiler.h>
+
/* Shifts and masks for the various fields in "lock". */
#define __ARCH_SPIN_CURRENT_SHIFT 17
#define __ARCH_SPIN_NEXT_MASK 0x7fff
@@ -44,7 +46,8 @@ static inline u32 arch_spin_next(u32 val)
/* The lock is locked if a task would have to wait to get it. */
static inline int arch_spin_is_locked(arch_spinlock_t *lock)
{
- u32 val = lock->lock;
+ /* Use READ_ONCE() to ensure that calling this in a loop is OK. */
+ u32 val = READ_ONCE(lock->lock);
return arch_spin_current(val) != arch_spin_next(val);
}

--
2.1.2

2015-04-28 17:43:38

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 2/2] [PATCH] sched: Add smp_rmb() in task rq locking cycles

On Tue, Apr 28, 2015 at 12:58:55PM -0400, Chris Metcalf wrote:
> On 04/28/2015 12:40 PM, Peter Zijlstra wrote:
> >On Tue, Apr 28, 2015 at 11:53:21AM -0400, Chris Metcalf wrote:
> >
> >>The reason we use two 32-bit fields on tilepro is that the only available
> >>atomic instruction is tns (test and set), which sets a 32-bit "1" value
> >>into the target memory and returns the old 32-bit value.
> >And you want a ticket lock as opposed to the test-and-set lock because
> >with 64 tiles starvation under contention is a real worry?
>
> We see substantial unfairness under load with a plain spinlock,
> basically because nearer cores on the mesh network can exponentially
> crowd out further cores. The ticket lock avoids that, though we
> have to be careful to do backoff when checking the lock to avoid
> DDoS in the mesh network.

Does your arch have 16bit atomic load/stores ? If so, would something
like the below not make sense?

typedef struct {
union {
struct {
unsigned short head;
unsigned short tail;
};
unsigned int tickets;
};
unsigned int lock;
} arch_spinlock_t;

static inline void ___tns_lock(unsigned int *lock)
{
while (tns(lock))
cpu_relax();
}

static inline void ___tns_unlock(unsigned int *lock)
{
WRITE_ONCE(*lock, 0);
}

static inline void arch_spin_lock(arch_spinlock_t *lock)
{
unsigned short head, tail;

___tns_lock(&lock->lock); /* XXX does the TNS imply a ___sync? */
head = lock->head;
lock->head++;
___tns_unlock(&lock->lock);

while (READ_ONCE(lock->tail) != head)
cpu_relax();
}

static inline void arch_spin_unlock(arch_spinlock_t *lock)
{
/*
* can do with regular load/store because the lock owner
* is the only one going to do stores to the tail
*/
unsigned short tail = READ_ONCE(lock->tail);
smp_mb(); /* MB is stronger than RELEASE */
WRITE_ONCE(lock->tail, tail + 1);
}

static inline void arch_spin_unlock_wait(arch_spinlock_t *lock)
{
union {
struct {
unsigned short head;
unsigned short tail;
};
unsigned int tickets;
} x;

for (;;) {
x.tickets = READ_ONCE(lock->tickets);
if (x.head == x.tail)
break;
cpu_relax();
}
}

2015-04-28 18:01:05

by Chris Metcalf

[permalink] [raw]
Subject: Re: [PATCH 2/2] [PATCH] sched: Add smp_rmb() in task rq locking cycles

On 04/28/2015 01:43 PM, Peter Zijlstra wrote:
> On Tue, Apr 28, 2015 at 12:58:55PM -0400, Chris Metcalf wrote:
>> On 04/28/2015 12:40 PM, Peter Zijlstra wrote:
>>> On Tue, Apr 28, 2015 at 11:53:21AM -0400, Chris Metcalf wrote:
>>>
>>>> The reason we use two 32-bit fields on tilepro is that the only available
>>>> atomic instruction is tns (test and set), which sets a 32-bit "1" value
>>>> into the target memory and returns the old 32-bit value.
>>> And you want a ticket lock as opposed to the test-and-set lock because
>>> with 64 tiles starvation under contention is a real worry?
>> We see substantial unfairness under load with a plain spinlock,
>> basically because nearer cores on the mesh network can exponentially
>> crowd out further cores. The ticket lock avoids that, though we
>> have to be careful to do backoff when checking the lock to avoid
>> DDoS in the mesh network.
> Does your arch have 16bit atomic load/stores ? If so, would something
> like the below not make sense?

Yes, tilepro can do 16-bit atomic load/stores. The reason we didn't use
your approach (basically having tns provide locking for the head/tail)
is just a perceived efficiency gain from rolling the tns lock into the head.

The current tilepro arch_spin_lock() is just three mesh network transactions
(tns, store, load). Your proposed spin lock is five (tns, load, store,
store, load).
Or, looking it from a core-centric perspective, the current arch_spin_lock()
only has to wait on requests from the mesh network twice (tns, load),
basically
once for each member of the lock structure; your proposed version is three
(tns, load, load).

I don't honestly know how critical this difference is, but that's why I
designed it the way I did.

I think your goal with your proposed redesign is being able to atomically
read head and tail together for arch_spin_unlock_wait(), but I don't see
why that's better than just reading head, checking it's not equal to tail
with a separate read, then spinning waiting for head to change.

>
> typedef struct {
> union {
> struct {
> unsigned short head;
> unsigned short tail;
> };
> unsigned int tickets;
> };
> unsigned int lock;
> } arch_spinlock_t;
>
> static inline void ___tns_lock(unsigned int *lock)
> {
> while (tns(lock))
> cpu_relax();
> }
>
> static inline void ___tns_unlock(unsigned int *lock)
> {
> WRITE_ONCE(*lock, 0);
> }
>
> static inline void arch_spin_lock(arch_spinlock_t *lock)
> {
> unsigned short head, tail;
>
> ___tns_lock(&lock->lock); /* XXX does the TNS imply a ___sync? */
> head = lock->head;
> lock->head++;
> ___tns_unlock(&lock->lock);
>
> while (READ_ONCE(lock->tail) != head)
> cpu_relax();
> }
>
> static inline void arch_spin_unlock(arch_spinlock_t *lock)
> {
> /*
> * can do with regular load/store because the lock owner
> * is the only one going to do stores to the tail
> */
> unsigned short tail = READ_ONCE(lock->tail);
> smp_mb(); /* MB is stronger than RELEASE */
> WRITE_ONCE(lock->tail, tail + 1);
> }
>
> static inline void arch_spin_unlock_wait(arch_spinlock_t *lock)
> {
> union {
> struct {
> unsigned short head;
> unsigned short tail;
> };
> unsigned int tickets;
> } x;
>
> for (;;) {
> x.tickets = READ_ONCE(lock->tickets);
> if (x.head == x.tail)
> break;
> cpu_relax();
> }
> }

--
Chris Metcalf, EZChip Semiconductor
http://www.ezchip.com

2015-04-28 18:24:26

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 2/2] [PATCH] sched: Add smp_rmb() in task rq locking cycles

On Tue, Apr 28, 2015 at 02:00:48PM -0400, Chris Metcalf wrote:
> Yes, tilepro can do 16-bit atomic load/stores. The reason we didn't use
> your approach (basically having tns provide locking for the head/tail)
> is just a perceived efficiency gain from rolling the tns lock into the head.
>
> The current tilepro arch_spin_lock() is just three mesh network transactions
> (tns, store, load). Your proposed spin lock is five (tns, load, store,
> store, load).
> Or, looking it from a core-centric perspective, the current arch_spin_lock()
> only has to wait on requests from the mesh network twice (tns, load),
> basically
> once for each member of the lock structure; your proposed version is three
> (tns, load, load).
>
> I don't honestly know how critical this difference is, but that's why I
> designed it the way I did.

Makes sense. Good reason ;-)

> I think your goal with your proposed redesign is being able to atomically
> read head and tail together for arch_spin_unlock_wait(), but I don't see
> why that's better than just reading head, checking it's not equal to tail
> with a separate read, then spinning waiting for head to change.

Right, that should be perfectly fine indeed.

A few questions:

> >static inline void arch_spin_lock(arch_spinlock_t *lock)
> >{
> > unsigned short head, tail;
> >
> > ___tns_lock(&lock->lock); /* XXX does the TNS imply a ___sync? */

Does it? Something needs to provide the ACQUIRE semantics.

> > head = lock->head;
> > lock->head++;
> > ___tns_unlock(&lock->lock);
> >
> > while (READ_ONCE(lock->tail) != head)
> > cpu_relax();
> >}
> >
> >static inline void arch_spin_unlock(arch_spinlock_t *lock)
> >{
> > /*
> > * can do with regular load/store because the lock owner
> > * is the only one going to do stores to the tail
> > */
> > unsigned short tail = READ_ONCE(lock->tail);
> > smp_mb(); /* MB is stronger than RELEASE */

Note that your code uses wmb(), wmb is strictly speaking not correct,
as its weaker than RELEASE.

_However_ it doesn't make any practical difference since all three
barriers end up emitting __sync() so its not a bug per se.

> > WRITE_ONCE(lock->tail, tail + 1);
> >}

2015-04-28 18:39:11

by Chris Metcalf

[permalink] [raw]
Subject: Re: [PATCH 2/2] [PATCH] sched: Add smp_rmb() in task rq locking cycles

On 04/28/2015 02:24 PM, Peter Zijlstra wrote:
> A few questions:
> On Tue, Apr 28, 2015 at 02:00:48PM -0400, Chris Metcalf wrote:
>
>>> static inline void arch_spin_lock(arch_spinlock_t *lock)
>>> {
>>> unsigned short head, tail;
>>>
>>> ___tns_lock(&lock->lock); /* XXX does the TNS imply a ___sync? */
> Does it? Something needs to provide the ACQUIRE semantics.

Yes, __insn_xxx() is a compiler barrier on tile.

Tile architectures do not need any hardware-level "acquire" semantics
since normally control dependency is sufficient for lock acquisition.
Loads and stores are issued in-order into the mesh network, but issued
loads don't block further instruction issue until a register read dependency
requires it. There is no speculative execution.


>>> head = lock->head;
>>> lock->head++;
>>> ___tns_unlock(&lock->lock);
>>>
>>> while (READ_ONCE(lock->tail) != head)
>>> cpu_relax();
>>> }
>>>
>>> static inline void arch_spin_unlock(arch_spinlock_t *lock)
>>> {
>>> /*
>>> * can do with regular load/store because the lock owner
>>> * is the only one going to do stores to the tail
>>> */
>>> unsigned short tail = READ_ONCE(lock->tail);
>>> smp_mb(); /* MB is stronger than RELEASE */
> Note that your code uses wmb(), wmb is strictly speaking not correct,
> as its weaker than RELEASE.
>
> _However_ it doesn't make any practical difference since all three
> barriers end up emitting __sync() so its not a bug per se.

Yes, this code dates back to 2.6.18 or so; today I would use
smp_store_release(). I like the trend in the kernel to move more
towards the C11 memory order model; I think it will help a lot.

--
Chris Metcalf, EZChip Semiconductor
http://www.ezchip.com

2015-04-30 07:43:42

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH 2/2] [PATCH] sched: Add smp_rmb() in task rq locking cycles

On Tue, Apr 28, 2015 at 04:32:35PM +0200, Peter Zijlstra wrote:
> On Sat, Apr 25, 2015 at 12:56:03PM -0700, Paul E. McKenney wrote:
> > smp: Make control dependencies work on Alpha, improve documentation
> >
> > The current formulation of control dependencies fails on DEC Alpha, which
> > does not respect dependencies of any kind unless an explicit memory is
>
> + barrier ?

Good catch, fixed!

> > provided. This means that the current fomulation of control dependencies
> > fails on Alpha. This commit therefore creates a READ_ONCE_CTRL() that
> > has the same overhead on non-Alpha systems, but causes Alpha to produce
> > the needed ordering. This commit also applies READ_ONCE_CTRL() to the
> > one known use of control dependencies.
> >
> > Use of READ_ONCE_CTRL() also has the beneficial effect of adding a bit
> > of self-documentation to control dependencies.
> >
> > Signed-off-by: Paul E. McKenney <[email protected]>
>
> Acked-by: Peter Zijlstra (Intel) <[email protected]>

Applied, thank you!

Thanx, Paul

> > diff --git a/include/linux/compiler.h b/include/linux/compiler.h
> > index 1b45e4a0519b..a57eacde2b84 100644
> > --- a/include/linux/compiler.h
> > +++ b/include/linux/compiler.h
> > @@ -264,6 +264,22 @@ static __always_inline void __write_once_size(volatile void *p, void *res, int s
> > #define WRITE_ONCE(x, val) \
> > ({ typeof(x) __val = (val); __write_once_size(&(x), &__val, sizeof(__val)); __val; })
> >
> > +/**
> > + * READ_ONCE_CTRL - Read a value heading a control dependency
> > + * @x: The value to be read, heading the control dependency
> > + *
> > + * Control dependencies are tricky. See Documentation/memory-barriers.txt
> > + * for important information on how to use them. Note that in many cases,
> > + * use of smp_load_acquire() will be much simpler. Control dependencies
> > + * should be avoided except on the hottest of hotpaths.
> > + */
> > +#define READ_ONCE_CTRL(x) \
> > +({ \
> > + typeof(x) __val = READ_ONCE(x); \
> > + smp_read_barrier_depends(); /* Enforce control dependency. */ \
> > + __val; \
> > +})
> > +
> > #endif /* __KERNEL__ */
> >
> > #endif /* __ASSEMBLY__ */
>
> We mostly try and align the \ for multi-line macros.
>
> > diff --git a/kernel/events/ring_buffer.c b/kernel/events/ring_buffer.c
> > index eadb95ce7aac..67548d5de4cb 100644
> > --- a/kernel/events/ring_buffer.c
> > +++ b/kernel/events/ring_buffer.c
> > @@ -141,7 +141,7 @@ int perf_output_begin(struct perf_output_handle *handle,
> > perf_output_get_handle(handle);
> >
> > do {
> > - tail = ACCESS_ONCE(rb->user_page->data_tail);
> > + tail = READ_ONCE_CTRL(rb->user_page->data_tail);
> > offset = head = local_read(&rb->head);
> > if (!rb->overwrite &&
> > unlikely(CIRC_SPACE(head, tail, perf_data_size(rb)) < size))
>
> Right. I could not remember any other current usage.
>

2015-04-29 17:34:08

by Manfred Spraul

[permalink] [raw]
Subject: Re: [PATCH] spinlock: clarify doc for raw_spin_unlock_wait()

On 04/28/2015 06:44 PM, Chris Metcalf wrote:
> The intent of the function is to wait for the current locker.
> Thus, we don't have to worry about potential later lockers,
> but we also have to ensure that the arch implementation
> doesn't return a false positive for the current locker.
>
> Signed-off-by: Chris Metcalf <[email protected]>
sysvsem depends on this definition, i.e. a false early return can cause
a corrupted semaphore state.

Acked-by: Manfred Spraul <[email protected]>
> ---
> On 04/28/2015 12:24 PM, Peter Zijlstra wrote:
>> I think it must not return before the lock holder that is current at the
>> time of calling releases. Anything thereafter is indeed fair game as per
>> your logic above.
> Great, that seems like a workable definition. How does this modified
> language seem? With this definition I can actually modify the
> implementation of tile's arch_raw_spin_unlock_wait() to just read
> current_ticket once and just wait until it changes (assuming the
> lock is, in fact, locked).
>
> Not sure whose tree this should go through; any takers?
>
> include/linux/spinlock.h | 3 ++-
> 1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/include/linux/spinlock.h b/include/linux/spinlock.h
> index 3e18379dfa6f..36de5fc86647 100644
> --- a/include/linux/spinlock.h
> +++ b/include/linux/spinlock.h
> @@ -141,7 +141,8 @@ do { \
> #endif
>
> /**
> - * raw_spin_unlock_wait - wait until the spinlock gets unlocked
> + * raw_spin_unlock_wait - wait until the lock holder that is current at the
> + * time of calling releases the lock (or return immediately if unlocked).
> * @lock: the spinlock in question.
> */
> #define raw_spin_unlock_wait(lock) arch_spin_unlock_wait(&(lock)->raw_lock)