2007-01-28 12:12:05

by Peter Zijlstra

[permalink] [raw]
Subject: [PATCH 3/7] barrier: a scalable synchonisation barrier

This barrier thing is constructed so that it will not write in the sync()
condition (the hot path) when there are no active lock sections; thus avoiding
cacheline bouncing. -- I'm just not sure how this will work out in relation to
PI. We might track those in the barrier scope and boost those by the max prio
of the blockers.

Signed-off-by: Peter Zijlstra <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
---
include/linux/barrier.h | 65 ++++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 65 insertions(+)

Index: linux/include/linux/barrier.h
===================================================================
--- /dev/null
+++ linux/include/linux/barrier.h
@@ -0,0 +1,65 @@
+/*
+ * Copyright (C) 2006, Red Hat, Inc., Peter Zijlstra <[email protected]>
+ * Licenced under the GPLv2.
+ *
+ * simple synchonisation barrier
+ *
+ * The sync() operation will wait for completion of all lock sections if any.
+ *
+ * The lock sections are intended to be rare and the sync operation frequent.
+ * This construct is created to be scalable and does only 1 read in the fast
+ * path (sync), hence avoiding cacheline bounces.
+ *
+ * NOTE: it _synchronisation_ only, so if there are serialisation requirements
+ * those must be met by something external to this construct.
+ */
+#ifndef _LINUX_BARRIER_H
+#define _LINUX_BARRIER_H
+
+#ifdef __KERNEL__
+
+#include <linux/wait.h>
+#include <linux/sched.h>
+#include <asm/atomic.h>
+
+struct barrier {
+ atomic_t count;
+ wait_queue_head_t wait;
+};
+
+static inline void init_barrier(struct barrier *b)
+{
+ atomic_set(&b->count, 0);
+ init_waitqueue_head(&b->wait);
+ __acquire(b);
+}
+
+static inline void barrier_lock(struct barrier *b)
+{
+ __release(b);
+ atomic_inc(&b->count);
+ smp_wmb();
+}
+
+static inline void barrier_unlock(struct barrier *b)
+{
+ smp_wmb();
+ if (atomic_dec_and_test(&b->count))
+ __wake_up(&b->wait, TASK_INTERRUPTIBLE|TASK_UNINTERRUPTIBLE, 0, b);
+}
+
+static inline void barrier_sync(struct barrier *b)
+{
+ might_sleep();
+
+ if (unlikely(atomic_read(&b->count))) {
+ DEFINE_WAIT(wait);
+ prepare_to_wait(&b->wait, &wait, TASK_UNINTERRUPTIBLE);
+ while (atomic_read(&b->count))
+ schedule();
+ finish_wait(&b->wait, &wait);
+ }
+}
+
+#endif /* __KERNEL__ */
+#endif /* _LINUX_BARRIER_H */

--


2007-01-28 14:39:50

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH 3/7] barrier: a scalable synchonisation barrier

On Sun, Jan 28, 2007 at 12:51:21PM +0100, Peter Zijlstra wrote:
> This barrier thing is constructed so that it will not write in the sync()
> condition (the hot path) when there are no active lock sections; thus avoiding
> cacheline bouncing. -- I'm just not sure how this will work out in relation to
> PI. We might track those in the barrier scope and boost those by the max prio
> of the blockers.

Is this really needed? We seem to grow new funky locking algorithms
exponentially, while people already have a hard time understanding
the existing ones.

2007-01-28 15:26:19

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH 3/7] barrier: a scalable synchonisation barrier


* Christoph Hellwig <[email protected]> wrote:

> On Sun, Jan 28, 2007 at 12:51:21PM +0100, Peter Zijlstra wrote:
> > This barrier thing is constructed so that it will not write in the
> > sync() condition (the hot path) when there are no active lock
> > sections; thus avoiding cacheline bouncing. -- I'm just not sure how
> > this will work out in relation to PI. We might track those in the
> > barrier scope and boost those by the max prio of the blockers.
>
> Is this really needed? We seem to grow new funky locking algorithms
> exponentially, while people already have a hard time understanding the
> existing ones.

yes, it's needed.

Ingo

2007-01-28 15:34:20

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH 3/7] barrier: a scalable synchonisation barrier

On Sun, Jan 28, 2007 at 04:24:35PM +0100, Ingo Molnar wrote:
> > > This barrier thing is constructed so that it will not write in the
> > > sync() condition (the hot path) when there are no active lock
> > > sections; thus avoiding cacheline bouncing. -- I'm just not sure how
> > > this will work out in relation to PI. We might track those in the
> > > barrier scope and boost those by the max prio of the blockers.
> >
> > Is this really needed? We seem to grow new funky locking algorithms
> > exponentially, while people already have a hard time understanding the
> > existing ones.
>
> yes, it's needed.

Thanks for the wonderful and indepth explanation </sarcasm>

2007-01-31 19:12:30

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH 3/7] barrier: a scalable synchonisation barrier

On Sun, Jan 28, 2007 at 04:24:35PM +0100, Ingo Molnar wrote:
>
> * Christoph Hellwig <[email protected]> wrote:
>
> > On Sun, Jan 28, 2007 at 12:51:21PM +0100, Peter Zijlstra wrote:
> > > This barrier thing is constructed so that it will not write in the
> > > sync() condition (the hot path) when there are no active lock
> > > sections; thus avoiding cacheline bouncing. -- I'm just not sure how
> > > this will work out in relation to PI. We might track those in the
> > > barrier scope and boost those by the max prio of the blockers.
> >
> > Is this really needed? We seem to grow new funky locking algorithms
> > exponentially, while people already have a hard time understanding the
> > existing ones.
>
> yes, it's needed.

Would it be possible to come up with something common between this primitive
and the one that Oleg Nesterov put together for Jens Axboe?

http://lkml.org/lkml/2006/11/29/330

Oleg's approach acquires a lock on the update side, which Peter would
not want in the uncontended case -- but perhaps there is some way to
make Oleg's approach be able to safely test both counters so as to
avoid acquiring the lock if there are no readers.

Oleg, any chance of this working? I believe it does, but have not
thought it through fully.

If it does turn out that we cannot converge these, I believe that
Peter's implementation needs an smp_mb() at both the beginning
and the end of barrier_sync(). Without the first smp_mb(), the
test in barrier_sync() might precede the prior change, and without
the second smp_mb() the barrier_sync() might slide after the following
cleanup code. (But I could easily be misunderstanding the code
using barrier_sync().)

Thanx, Paul

2007-01-31 21:14:44

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [PATCH 3/7] barrier: a scalable synchonisation barrier

On 01/31, Paul E. McKenney wrote:
>
> On Sun, Jan 28, 2007 at 04:24:35PM +0100, Ingo Molnar wrote:
> >
> > * Christoph Hellwig <[email protected]> wrote:
> >
> > > On Sun, Jan 28, 2007 at 12:51:21PM +0100, Peter Zijlstra wrote:
> > > > This barrier thing is constructed so that it will not write in the
> > > > sync() condition (the hot path) when there are no active lock
> > > > sections; thus avoiding cacheline bouncing. -- I'm just not sure how
> > > > this will work out in relation to PI. We might track those in the
> > > > barrier scope and boost those by the max prio of the blockers.
> > >
> > > Is this really needed? We seem to grow new funky locking algorithms
> > > exponentially, while people already have a hard time understanding the
> > > existing ones.
> >
> > yes, it's needed.
>
> Would it be possible to come up with something common between this primitive
> and the one that Oleg Nesterov put together for Jens Axboe?
>
> http://lkml.org/lkml/2006/11/29/330
>
> Oleg's approach acquires a lock on the update side, which Peter would
> not want in the uncontended case -- but perhaps there is some way to
> make Oleg's approach be able to safely test both counters so as to
> avoid acquiring the lock if there are no readers.
>
> Oleg, any chance of this working? I believe it does, but have not
> thought it through fully.

I think no. From the quick reading, barrier_sync() and qrcu/srcu are
quite different. Consider:

barrier_lock()

barrier_sync();

barrier_unlock();
... wake up ...
barrier_lock();

schedule again

The last "schedule again" would be a BUG for qrcu/srcu, but probably
it is ok for barrier_sync(). It looks like barrier_sync() is more a
rw semaphore biased to readers.

A couple of minor off-topic notes,

+static inline void barrier_unlock(struct barrier *b)
+{
+ smp_wmb();
+ if (atomic_dec_and_test(&b->count))
+ __wake_up(&b->wait, TASK_INTERRUPTIBLE|TASK_UNINTERRUPTIBLE, 0, b);

This is wake_up_all(&b->wait), yes? I don't undestans why key == b, it could be NULL.

+static inline void barrier_sync(struct barrier *b)
+{
+ might_sleep();
+
+ if (unlikely(atomic_read(&b->count))) {
+ DEFINE_WAIT(wait);
+ prepare_to_wait(&b->wait, &wait, TASK_UNINTERRUPTIBLE);
+ while (atomic_read(&b->count))
+ schedule();
+ finish_wait(&b->wait, &wait);
+ }
+}

This should be open-coded wait_event(), but wrong! With the scenario above this
can hang forever! because the first wake_up removes the task from the &b->wait.

Oleg.

2007-01-31 21:31:11

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [PATCH 3/7] barrier: a scalable synchonisation barrier

On 02/01, Oleg Nesterov wrote:
>
> +static inline void barrier_sync(struct barrier *b)
> +{
> + might_sleep();
> +
> + if (unlikely(atomic_read(&b->count))) {
> + DEFINE_WAIT(wait);
> + prepare_to_wait(&b->wait, &wait, TASK_UNINTERRUPTIBLE);
> + while (atomic_read(&b->count))
> + schedule();
> + finish_wait(&b->wait, &wait);
> + }
> +}
>
> This should be open-coded wait_event(), but wrong! With the scenario above this
> can hang forever! because the first wake_up removes the task from the &b->wait.

I am afraid I was not clear (as usual :). prepare_to_wait means autoremove_wake_function.
So, if barrier_unlock() wakes up the caller of barrier_sync(), it will be removed
from b->wait. If it goes to schedule() because another barrier_lock() incremented
b->count, we can't wake it via __wake_up(&b->wait, ...).

Oleg.

2007-01-31 21:48:33

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 3/7] barrier: a scalable synchonisation barrier

On Thu, 2007-02-01 at 00:13 +0300, Oleg Nesterov wrote:
> On 01/31, Paul E. McKenney wrote:
> >
> > On Sun, Jan 28, 2007 at 04:24:35PM +0100, Ingo Molnar wrote:
> > >
> > > * Christoph Hellwig <[email protected]> wrote:
> > >
> > > > On Sun, Jan 28, 2007 at 12:51:21PM +0100, Peter Zijlstra wrote:
> > > > > This barrier thing is constructed so that it will not write in the
> > > > > sync() condition (the hot path) when there are no active lock
> > > > > sections; thus avoiding cacheline bouncing. -- I'm just not sure how
> > > > > this will work out in relation to PI. We might track those in the
> > > > > barrier scope and boost those by the max prio of the blockers.
> > > >
> > > > Is this really needed? We seem to grow new funky locking algorithms
> > > > exponentially, while people already have a hard time understanding the
> > > > existing ones.
> > >
> > > yes, it's needed.
> >
> > Would it be possible to come up with something common between this primitive
> > and the one that Oleg Nesterov put together for Jens Axboe?
> >
> > http://lkml.org/lkml/2006/11/29/330
> >
> > Oleg's approach acquires a lock on the update side, which Peter would
> > not want in the uncontended case -- but perhaps there is some way to
> > make Oleg's approach be able to safely test both counters so as to
> > avoid acquiring the lock if there are no readers.
> >
> > Oleg, any chance of this working? I believe it does, but have not
> > thought it through fully.
>
> I think no. From the quick reading, barrier_sync() and qrcu/srcu are
> quite different. Consider:
>
> barrier_lock()
>
> barrier_sync();
>
> barrier_unlock();
> ... wake up ...
> barrier_lock();
>
> schedule again
>
> The last "schedule again" would be a BUG for qrcu/srcu, but probably
> it is ok for barrier_sync().

Yes, that would be ok.

> It looks like barrier_sync() is more a
> rw semaphore biased to readers.

Indeed, the locked sections are designed to be the rare case.

> A couple of minor off-topic notes,
>
> +static inline void barrier_unlock(struct barrier *b)
> +{
> + smp_wmb();
> + if (atomic_dec_and_test(&b->count))
> + __wake_up(&b->wait, TASK_INTERRUPTIBLE|TASK_UNINTERRUPTIBLE, 0, b);
>
> This is wake_up_all(&b->wait), yes? I don't undestans why key == b, it could be NULL.
>
> +static inline void barrier_sync(struct barrier *b)
> +{
> + might_sleep();
> +
> + if (unlikely(atomic_read(&b->count))) {
> + DEFINE_WAIT(wait);
> + prepare_to_wait(&b->wait, &wait, TASK_UNINTERRUPTIBLE);
> + while (atomic_read(&b->count))
> + schedule();
> + finish_wait(&b->wait, &wait);
> + }
> +}
>
> This should be open-coded wait_event(), but wrong! With the scenario above this
> can hang forever! because the first wake_up removes the task from the &b->wait.

This would be me struggling with the waitqueue API, its all a tad
confusing at first look.

2007-01-31 23:32:42

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH 3/7] barrier: a scalable synchonisation barrier

On Wed, Jan 31, 2007 at 10:48:21PM +0100, Peter Zijlstra wrote:
> On Thu, 2007-02-01 at 00:13 +0300, Oleg Nesterov wrote:
> > On 01/31, Paul E. McKenney wrote:
> > > On Sun, Jan 28, 2007 at 04:24:35PM +0100, Ingo Molnar wrote:
> > > > * Christoph Hellwig <[email protected]> wrote:
> > > > > On Sun, Jan 28, 2007 at 12:51:21PM +0100, Peter Zijlstra wrote:
> > > > > > This barrier thing is constructed so that it will not write in the
> > > > > > sync() condition (the hot path) when there are no active lock
> > > > > > sections; thus avoiding cacheline bouncing. -- I'm just not sure how
> > > > > > this will work out in relation to PI. We might track those in the
> > > > > > barrier scope and boost those by the max prio of the blockers.
> > > > >
> > > > > Is this really needed? We seem to grow new funky locking algorithms
> > > > > exponentially, while people already have a hard time understanding the
> > > > > existing ones.
> > > >
> > > > yes, it's needed.
> > >
> > > Would it be possible to come up with something common between this primitive
> > > and the one that Oleg Nesterov put together for Jens Axboe?
> > >
> > > http://lkml.org/lkml/2006/11/29/330
> > >
> > > Oleg's approach acquires a lock on the update side, which Peter would
> > > not want in the uncontended case -- but perhaps there is some way to
> > > make Oleg's approach be able to safely test both counters so as to
> > > avoid acquiring the lock if there are no readers.
> > >
> > > Oleg, any chance of this working? I believe it does, but have not
> > > thought it through fully.
> >
> > I think no. From the quick reading, barrier_sync() and qrcu/srcu are
> > quite different. Consider:
> >
> > barrier_lock()
> >
> > barrier_sync();
> >
> > barrier_unlock();
> > ... wake up ...
> > barrier_lock();
> >
> > schedule again
> >
> > The last "schedule again" would be a BUG for qrcu/srcu, but probably
> > it is ok for barrier_sync().
>
> Yes, that would be ok.

The wakeup in barrier_sync() would mean that the counter was zero
at some point in the past. The counter would then be rechecked, and
if it were still zero, barrier_sync() would invoke finish_wait() and
then return -- but the counter might well become non-zero in the
meantime, right?

So given that barrier_sync() is permitted to return after the counter
becomes non-zero, why can't it just rely on the fact that barrier_unlock()
saw it as zero not long in the past?

> > It looks like barrier_sync() is more a
> > rw semaphore biased to readers.
>
> Indeed, the locked sections are designed to be the rare case.

OK -- but barrier_sync() just waits for readers, it doesn't exclude them.

If all barrier_sync() needs to do is to wait until all pre-existing
barrier_lock()/barrier_unlock() pairs to complete, it seems to me to
be compatible with qrcu's semantics.

So what am I missing here?

Thanx, Paul

> > A couple of minor off-topic notes,
> >
> > +static inline void barrier_unlock(struct barrier *b)
> > +{
> > + smp_wmb();
> > + if (atomic_dec_and_test(&b->count))
> > + __wake_up(&b->wait, TASK_INTERRUPTIBLE|TASK_UNINTERRUPTIBLE, 0, b);
> >
> > This is wake_up_all(&b->wait), yes? I don't undestans why key == b, it could be NULL.
> >
> > +static inline void barrier_sync(struct barrier *b)
> > +{
> > + might_sleep();
> > +
> > + if (unlikely(atomic_read(&b->count))) {
> > + DEFINE_WAIT(wait);
> > + prepare_to_wait(&b->wait, &wait, TASK_UNINTERRUPTIBLE);
> > + while (atomic_read(&b->count))
> > + schedule();
> > + finish_wait(&b->wait, &wait);
> > + }
> > +}
> >
> > This should be open-coded wait_event(), but wrong! With the scenario above this
> > can hang forever! because the first wake_up removes the task from the &b->wait.
>
> This would be me struggling with the waitqueue API, its all a tad
> confusing at first look.
>

2007-02-01 00:03:20

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 3/7] barrier: a scalable synchonisation barrier

On Wed, 2007-01-31 at 15:32 -0800, Paul E. McKenney wrote:

> The wakeup in barrier_sync() would mean that the counter was zero
> at some point in the past. The counter would then be rechecked, and
> if it were still zero, barrier_sync() would invoke finish_wait() and
> then return -- but the counter might well become non-zero in the
> meantime, right?
>
> So given that barrier_sync() is permitted to return after the counter
> becomes non-zero, why can't it just rely on the fact that barrier_unlock()
> saw it as zero not long in the past?
>
> > > It looks like barrier_sync() is more a
> > > rw semaphore biased to readers.
> >
> > Indeed, the locked sections are designed to be the rare case.
>
> OK -- but barrier_sync() just waits for readers, it doesn't exclude them.
>
> If all barrier_sync() needs to do is to wait until all pre-existing
> barrier_lock()/barrier_unlock() pairs to complete, it seems to me to
> be compatible with qrcu's semantics.
>
> So what am I missing here?

I might be the one missing stuff, I'll have a hard look at qrcu.

The intent was that barrier_sync() would not write to memory when there
are no active locked sections, so that the cacheline can stay shared,
thus keeping is fast.

If qrcu does exactly this, then yes we have a match.

2007-02-01 00:49:00

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH 3/7] barrier: a scalable synchonisation barrier

On Thu, Feb 01, 2007 at 01:03:09AM +0100, Peter Zijlstra wrote:
> On Wed, 2007-01-31 at 15:32 -0800, Paul E. McKenney wrote:
>
> > The wakeup in barrier_sync() would mean that the counter was zero
> > at some point in the past. The counter would then be rechecked, and
> > if it were still zero, barrier_sync() would invoke finish_wait() and
> > then return -- but the counter might well become non-zero in the
> > meantime, right?
> >
> > So given that barrier_sync() is permitted to return after the counter
> > becomes non-zero, why can't it just rely on the fact that barrier_unlock()
> > saw it as zero not long in the past?
> >
> > > > It looks like barrier_sync() is more a
> > > > rw semaphore biased to readers.
> > >
> > > Indeed, the locked sections are designed to be the rare case.
> >
> > OK -- but barrier_sync() just waits for readers, it doesn't exclude them.
> >
> > If all barrier_sync() needs to do is to wait until all pre-existing
> > barrier_lock()/barrier_unlock() pairs to complete, it seems to me to
> > be compatible with qrcu's semantics.
> >
> > So what am I missing here?
>
> I might be the one missing stuff, I'll have a hard look at qrcu.
>
> The intent was that barrier_sync() would not write to memory when there
> are no active locked sections, so that the cacheline can stay shared,
> thus keeping is fast.
>
> If qrcu does exactly this, then yes we have a match.

QRCU as currently written (http://lkml.org/lkml/2006/11/29/330) doesn't
do what you want, as it acquires the lock unconditionally. I am proposing
that synchronize_qrcu() change to something like the following:

void synchronize_qrcu(struct qrcu_struct *qp)
{
int idx;

smp_mb();

if (atomic_read(qp->ctr[0]) + atomic_read(qp->ctr[1]) <= 1) {
smp_rmb();
if (atomic_read(qp->ctr[0]) +
atomic_read(qp->ctr[1]) <= 1)
goto out;
}

mutex_lock(&qp->mutex);
idx = qp->completed & 0x1;
atomic_inc(qp->ctr + (idx ^ 0x1));
/* Reduce the likelihood that qrcu_read_lock() will loop */
smp_mb__after_atomic_inc();
qp->completed++;

atomic_dec(qp->ctr + idx);
__wait_event(qp->wq, !atomic_read(qp->ctr + idx));
mutex_unlock(&qp->mutex);
out:
smp_mb();
}

For the first "if" to give a false positive, a concurrent switch had
to have happened. For example, qp->ctr[0] was zero and qp->ctr[1]
was two at the time of the first atomic_read(), but then qp->completed
switched so that both qp->ctr[0] and qp->ctr[1] were one at the time
of the second atomic_read. The only way the second "if" can give us a
false positive is if there was another change to qp->completed in the
meantime -- but that means that all of the pre-existing qrcu_read_lock()
holders must have gotten done, otherwise the second switch could not
have happened. Yes, you do incur three memory barriers on the fast
path, but the best you could hope for with your approach was two of them
(unless I am confused about how you were using barrier_sync()).

Oleg, does this look safe?

Ugly at best, I know, but I do very much sympathize with Christoph's
desire to keep the number of synchronization primitives down to a
dull roar. ;-)

Thanx, Paul

2007-02-01 16:00:48

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [PATCH 3/7] barrier: a scalable synchonisation barrier

On 01/31, Paul E. McKenney wrote:
>
> QRCU as currently written (http://lkml.org/lkml/2006/11/29/330) doesn't
> do what you want, as it acquires the lock unconditionally. I am proposing
> that synchronize_qrcu() change to something like the following:
>
> void synchronize_qrcu(struct qrcu_struct *qp)
> {
> int idx;
>
> smp_mb();
>
> if (atomic_read(qp->ctr[0]) + atomic_read(qp->ctr[1]) <= 1) {
> smp_rmb();
> if (atomic_read(qp->ctr[0]) +
> atomic_read(qp->ctr[1]) <= 1)
> goto out;
> }
>
> mutex_lock(&qp->mutex);
> idx = qp->completed & 0x1;
> atomic_inc(qp->ctr + (idx ^ 0x1));
> /* Reduce the likelihood that qrcu_read_lock() will loop */
> smp_mb__after_atomic_inc();
> qp->completed++;
>
> atomic_dec(qp->ctr + idx);
> __wait_event(qp->wq, !atomic_read(qp->ctr + idx));
> mutex_unlock(&qp->mutex);
> out:
> smp_mb();
> }
>
> For the first "if" to give a false positive, a concurrent switch had
> to have happened. For example, qp->ctr[0] was zero and qp->ctr[1]
> was two at the time of the first atomic_read(), but then qp->completed
> switched so that both qp->ctr[0] and qp->ctr[1] were one at the time
> of the second atomic_read. The only way the second "if" can give us a
> false positive is if there was another change to qp->completed in the
> meantime -- but that means that all of the pre-existing qrcu_read_lock()
> holders must have gotten done, otherwise the second switch could not
> have happened. Yes, you do incur three memory barriers on the fast
> path, but the best you could hope for with your approach was two of them
> (unless I am confused about how you were using barrier_sync()).

While doing qrcu, somehow I convinced myself we can't optimize out taking
qp->mutex. Now I think I was wrong. Good!

Q: you deleted "if (atomic_read(qp->ctr + idx) == 1)" fastpath under ->mutex,
was this needed for this optimization to work? I am asking because I can't
understand how it can make any difference.

> Oleg, does this look safe?

Yes. But let me think more about this later, I've got a fever, and can't
think properly today :)

Oleg.

2007-02-01 21:38:40

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH 3/7] barrier: a scalable synchonisation barrier

On Thu, Feb 01, 2007 at 07:00:10PM +0300, Oleg Nesterov wrote:
> On 01/31, Paul E. McKenney wrote:
> >
> > QRCU as currently written (http://lkml.org/lkml/2006/11/29/330) doesn't
> > do what you want, as it acquires the lock unconditionally. I am proposing
> > that synchronize_qrcu() change to something like the following:
> >
> > void synchronize_qrcu(struct qrcu_struct *qp)
> > {
> > int idx;
> >
> > smp_mb();
> >
> > if (atomic_read(qp->ctr[0]) + atomic_read(qp->ctr[1]) <= 1) {
> > smp_rmb();
> > if (atomic_read(qp->ctr[0]) +
> > atomic_read(qp->ctr[1]) <= 1)
> > goto out;
> > }
> >
> > mutex_lock(&qp->mutex);
> > idx = qp->completed & 0x1;
> > atomic_inc(qp->ctr + (idx ^ 0x1));
> > /* Reduce the likelihood that qrcu_read_lock() will loop */
> > smp_mb__after_atomic_inc();
> > qp->completed++;
> >
> > atomic_dec(qp->ctr + idx);
> > __wait_event(qp->wq, !atomic_read(qp->ctr + idx));
> > mutex_unlock(&qp->mutex);
> > out:
> > smp_mb();
> > }
> >
> > For the first "if" to give a false positive, a concurrent switch had
> > to have happened. For example, qp->ctr[0] was zero and qp->ctr[1]
> > was two at the time of the first atomic_read(), but then qp->completed
> > switched so that both qp->ctr[0] and qp->ctr[1] were one at the time
> > of the second atomic_read. The only way the second "if" can give us a
> > false positive is if there was another change to qp->completed in the
> > meantime -- but that means that all of the pre-existing qrcu_read_lock()
> > holders must have gotten done, otherwise the second switch could not
> > have happened. Yes, you do incur three memory barriers on the fast
> > path, but the best you could hope for with your approach was two of them
> > (unless I am confused about how you were using barrier_sync()).
>
> While doing qrcu, somehow I convinced myself we can't optimize out taking
> qp->mutex. Now I think I was wrong. Good!

Me, I didn't want to worry about it unless someone needed it. Which
it now appears they do. ;-)

> Q: you deleted "if (atomic_read(qp->ctr + idx) == 1)" fastpath under ->mutex,
> was this needed for this optimization to work? I am asking because I can't
> understand how it can make any difference.

Before, we held the lock, so we could just check the single current
element. Now we don't hold the lock, so we need to check both elements.
So I replaced the "if (atomic_read(qp->ctr + idx) == 1)" with the
nested "if" statements that test both elements.

> > Oleg, does this look safe?
>
> Yes. But let me think more about this later, I've got a fever, and can't
> think properly today :)

Get well!!! ;-)

And yes, the concurrency on the fastpath is nontrivial...

Thanx, Paul

2007-02-02 11:57:12

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [PATCH 3/7] barrier: a scalable synchonisation barrier

On 02/01, Paul E. McKenney wrote:
> > >
> > > void synchronize_qrcu(struct qrcu_struct *qp)
> > > {
> > > int idx;
> > >
> > > smp_mb();
> > >
> > > if (atomic_read(qp->ctr[0]) + atomic_read(qp->ctr[1]) <= 1) {
> > > smp_rmb();
> > > if (atomic_read(qp->ctr[0]) +
> > > atomic_read(qp->ctr[1]) <= 1)
> > > goto out;
> > > }
> > >
> > > mutex_lock(&qp->mutex);
> > > idx = qp->completed & 0x1;
> > > atomic_inc(qp->ctr + (idx ^ 0x1));
> > > /* Reduce the likelihood that qrcu_read_lock() will loop */
> > > smp_mb__after_atomic_inc();
> > > qp->completed++;
> > >
> > > atomic_dec(qp->ctr + idx);
> > > __wait_event(qp->wq, !atomic_read(qp->ctr + idx));
> > > mutex_unlock(&qp->mutex);
> > > out:
> > > smp_mb();
> > > }
> > >
> > > For the first "if" to give a false positive, a concurrent switch had
> > > to have happened. For example, qp->ctr[0] was zero and qp->ctr[1]
> > > was two at the time of the first atomic_read(), but then qp->completed
> > > switched so that both qp->ctr[0] and qp->ctr[1] were one at the time
> > > of the second atomic_read. The only way the second "if" can give us a
> > > false positive is if there was another change to qp->completed in the
> > > meantime -- but that means that all of the pre-existing qrcu_read_lock()
> > > holders must have gotten done, otherwise the second switch could not
> > > have happened. Yes, you do incur three memory barriers on the fast
> > > path, but the best you could hope for with your approach was two of them
> > > (unless I am confused about how you were using barrier_sync()).

Yes. Without synchronize_qrcu() in between, one of the counters should be == 0,
another >= 1. == 1 means we have no active readers. So the false positive really
means a concurrent switch. And we can check twice - excellent idea!

> > While doing qrcu, somehow I convinced myself we can't optimize out taking
> > qp->mutex. Now I think I was wrong. Good!
>
> Me, I didn't want to worry about it unless someone needed it. Which
> it now appears they do. ;-)

No. I do remember I tried hard to optimize out taking qp->mutex, but failed.
So I decided it is not possible. And now you show that I just don't have enough
brains! (of course, I hate you :)

> > Q: you deleted "if (atomic_read(qp->ctr + idx) == 1)" fastpath under ->mutex,
> > was this needed for this optimization to work? I am asking because I can't
> > understand how it can make any difference.
>
> Before, we held the lock, so we could just check the single current
> element. Now we don't hold the lock, so we need to check both elements.
> So I replaced the "if (atomic_read(qp->ctr + idx) == 1)" with the
> nested "if" statements that test both elements.

Ah, my question was different. The current version of qrcu does

mutex_lock(&qp->mutex);

idx = qp->completed & 0x1;
if (atomic_read(qp->ctr + idx) == 1) // fast path
return;

...

and it seems to me that we can retain this fastpath even with your optimization,
no? Surely, it is not so important, but it is nearly free.

Paul, could you make a patch? (I'll do rcutorture test tomorrow, but I only have
P-4 ht).

Peter, do you think you can use qrcu?

Oleg.

2007-02-02 12:02:11

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 3/7] barrier: a scalable synchonisation barrier

On Fri, 2007-02-02 at 14:56 +0300, Oleg Nesterov wrote:
> On 02/01, Paul E. McKenney wrote:
> > > >
> > > > void synchronize_qrcu(struct qrcu_struct *qp)
> > > > {
> > > > int idx;
> > > >
> > > > smp_mb();
> > > >
> > > > if (atomic_read(qp->ctr[0]) + atomic_read(qp->ctr[1]) <= 1) {
> > > > smp_rmb();
> > > > if (atomic_read(qp->ctr[0]) +
> > > > atomic_read(qp->ctr[1]) <= 1)
> > > > goto out;
> > > > }
> > > >
> > > > mutex_lock(&qp->mutex);
> > > > idx = qp->completed & 0x1;
> > > > atomic_inc(qp->ctr + (idx ^ 0x1));
> > > > /* Reduce the likelihood that qrcu_read_lock() will loop */
> > > > smp_mb__after_atomic_inc();
> > > > qp->completed++;
> > > >
> > > > atomic_dec(qp->ctr + idx);
> > > > __wait_event(qp->wq, !atomic_read(qp->ctr + idx));
> > > > mutex_unlock(&qp->mutex);
> > > > out:
> > > > smp_mb();
> > > > }
> > > >
> > > > For the first "if" to give a false positive, a concurrent switch had
> > > > to have happened. For example, qp->ctr[0] was zero and qp->ctr[1]
> > > > was two at the time of the first atomic_read(), but then qp->completed
> > > > switched so that both qp->ctr[0] and qp->ctr[1] were one at the time
> > > > of the second atomic_read. The only way the second "if" can give us a
> > > > false positive is if there was another change to qp->completed in the
> > > > meantime -- but that means that all of the pre-existing qrcu_read_lock()
> > > > holders must have gotten done, otherwise the second switch could not
> > > > have happened. Yes, you do incur three memory barriers on the fast
> > > > path, but the best you could hope for with your approach was two of them
> > > > (unless I am confused about how you were using barrier_sync()).
>
> Yes. Without synchronize_qrcu() in between, one of the counters should be == 0,
> another >= 1. == 1 means we have no active readers. So the false positive really
> means a concurrent switch. And we can check twice - excellent idea!
>
> > > While doing qrcu, somehow I convinced myself we can't optimize out taking
> > > qp->mutex. Now I think I was wrong. Good!
> >
> > Me, I didn't want to worry about it unless someone needed it. Which
> > it now appears they do. ;-)
>
> No. I do remember I tried hard to optimize out taking qp->mutex, but failed.
> So I decided it is not possible. And now you show that I just don't have enough
> brains! (of course, I hate you :)
>
> > > Q: you deleted "if (atomic_read(qp->ctr + idx) == 1)" fastpath under ->mutex,
> > > was this needed for this optimization to work? I am asking because I can't
> > > understand how it can make any difference.
> >
> > Before, we held the lock, so we could just check the single current
> > element. Now we don't hold the lock, so we need to check both elements.
> > So I replaced the "if (atomic_read(qp->ctr + idx) == 1)" with the
> > nested "if" statements that test both elements.
>
> Ah, my question was different. The current version of qrcu does
>
> mutex_lock(&qp->mutex);
>
> idx = qp->completed & 0x1;
> if (atomic_read(qp->ctr + idx) == 1) // fast path
> return;
>
> ...
>
> and it seems to me that we can retain this fastpath even with your optimization,
> no? Surely, it is not so important, but it is nearly free.
>
> Paul, could you make a patch? (I'll do rcutorture test tomorrow, but I only have
> P-4 ht).
>
> Peter, do you think you can use qrcu?

Yes, this looks very much like what I need. Awesome work!

2007-02-02 17:13:12

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH 3/7] barrier: a scalable synchonisation barrier

On Fri, Feb 02, 2007 at 02:56:12PM +0300, Oleg Nesterov wrote:
> On 02/01, Paul E. McKenney wrote:
> > > >
> > > > void synchronize_qrcu(struct qrcu_struct *qp)
> > > > {
> > > > int idx;
> > > >
> > > > smp_mb();
> > > >
> > > > if (atomic_read(qp->ctr[0]) + atomic_read(qp->ctr[1]) <= 1) {
> > > > smp_rmb();
> > > > if (atomic_read(qp->ctr[0]) +
> > > > atomic_read(qp->ctr[1]) <= 1)
> > > > goto out;
> > > > }
> > > >
> > > > mutex_lock(&qp->mutex);
> > > > idx = qp->completed & 0x1;
> > > > atomic_inc(qp->ctr + (idx ^ 0x1));
> > > > /* Reduce the likelihood that qrcu_read_lock() will loop */
> > > > smp_mb__after_atomic_inc();
> > > > qp->completed++;
> > > >
> > > > atomic_dec(qp->ctr + idx);
> > > > __wait_event(qp->wq, !atomic_read(qp->ctr + idx));
> > > > mutex_unlock(&qp->mutex);
> > > > out:
> > > > smp_mb();
> > > > }
> > > >
> > > > For the first "if" to give a false positive, a concurrent switch had
> > > > to have happened. For example, qp->ctr[0] was zero and qp->ctr[1]
> > > > was two at the time of the first atomic_read(), but then qp->completed
> > > > switched so that both qp->ctr[0] and qp->ctr[1] were one at the time
> > > > of the second atomic_read. The only way the second "if" can give us a
> > > > false positive is if there was another change to qp->completed in the
> > > > meantime -- but that means that all of the pre-existing qrcu_read_lock()
> > > > holders must have gotten done, otherwise the second switch could not
> > > > have happened. Yes, you do incur three memory barriers on the fast
> > > > path, but the best you could hope for with your approach was two of them
> > > > (unless I am confused about how you were using barrier_sync()).
>
> Yes. Without synchronize_qrcu() in between, one of the counters should be == 0,
> another >= 1. == 1 means we have no active readers. So the false positive really
> means a concurrent switch. And we can check twice - excellent idea!

Well, if it ends up really working. ;-)

This one needs more than just testing -- I will put together a Promela
model that does a full state-space search for races.

I do have one fall-back, namely putting both counters into a single
aligned long. But I would like to avoid this, since there might be
architectures out there that cannot cleanly store into half-longs.
Such architectures would have to use atomic ops. :-/

> > > While doing qrcu, somehow I convinced myself we can't optimize out taking
> > > qp->mutex. Now I think I was wrong. Good!
> >
> > Me, I didn't want to worry about it unless someone needed it. Which
> > it now appears they do. ;-)
>
> No. I do remember I tried hard to optimize out taking qp->mutex, but failed.
> So I decided it is not possible. And now you show that I just don't have enough
> brains! (of course, I hate you :)

Coming from you, that is high praise indeed!!! ;-)

Now if it really does work...

> > > Q: you deleted "if (atomic_read(qp->ctr + idx) == 1)" fastpath under ->mutex,
> > > was this needed for this optimization to work? I am asking because I can't
> > > understand how it can make any difference.
> >
> > Before, we held the lock, so we could just check the single current
> > element. Now we don't hold the lock, so we need to check both elements.
> > So I replaced the "if (atomic_read(qp->ctr + idx) == 1)" with the
> > nested "if" statements that test both elements.
>
> Ah, my question was different. The current version of qrcu does
>
> mutex_lock(&qp->mutex);
>
> idx = qp->completed & 0x1;
> if (atomic_read(qp->ctr + idx) == 1) // fast path
> return;
>
> ...
>
> and it seems to me that we can retain this fastpath even with your optimization,
> no? Surely, it is not so important, but it is nearly free.

Ah! This does make sense, excellent point!!!

> Paul, could you make a patch? (I'll do rcutorture test tomorrow, but I only have
> P-4 ht).

I will do the Promela model, and if that works, I will submit a patch.

Thanx, Paul

> Peter, do you think you can use qrcu?
>
> Oleg.
>

2007-02-03 16:39:25

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [PATCH 3/7] barrier: a scalable synchonisation barrier

On 01/31, Paul E. McKenney wrote:
>
> QRCU as currently written (http://lkml.org/lkml/2006/11/29/330) doesn't
> do what you want, as it acquires the lock unconditionally. I am proposing
> that synchronize_qrcu() change to something like the following:
>
> void synchronize_qrcu(struct qrcu_struct *qp)
> {
> int idx;
>
> smp_mb();
>
> if (atomic_read(qp->ctr[0]) + atomic_read(qp->ctr[1]) <= 1) {
> smp_rmb();
> if (atomic_read(qp->ctr[0]) +
> atomic_read(qp->ctr[1]) <= 1)
> goto out;
> }
>
> mutex_lock(&qp->mutex);
> idx = qp->completed & 0x1;
> atomic_inc(qp->ctr + (idx ^ 0x1));
> /* Reduce the likelihood that qrcu_read_lock() will loop */
> smp_mb__after_atomic_inc();

I almost forgot. Currently this smp_mb__after_atomic_inc() is not strictly
needed, and the comment is correct. However, it becomes mandatory with your
optimization. Without this barrier, it is possible that both checks above
mutex_lock() will see the result of atomic_dec(), but not the atomic_inc().

So, may I ask you to also update this comment?

/*
* Reduce the likelihood that qrcu_read_lock() will loop
* AND
* make sure the second re-check above will see the result
* of atomic_inc() if it sees the result of atomic_dec()
*/

Something like this, I hope you will make it better.

And another note: this all assumes that STORE-MB-LOAD works "correctly", yes?
We have other code which relies on that, should not be a problem.

(Alan Stern cc'ed).

Oleg.

2007-02-04 00:24:10

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH 3/7] barrier: a scalable synchonisation barrier

On Sat, Feb 03, 2007 at 07:38:50PM +0300, Oleg Nesterov wrote:
> On 01/31, Paul E. McKenney wrote:
> >
> > QRCU as currently written (http://lkml.org/lkml/2006/11/29/330) doesn't
> > do what you want, as it acquires the lock unconditionally. I am proposing
> > that synchronize_qrcu() change to something like the following:
> >
> > void synchronize_qrcu(struct qrcu_struct *qp)
> > {
> > int idx;
> >
> > smp_mb();
> >
> > if (atomic_read(qp->ctr[0]) + atomic_read(qp->ctr[1]) <= 1) {
> > smp_rmb();
> > if (atomic_read(qp->ctr[0]) +
> > atomic_read(qp->ctr[1]) <= 1)
> > goto out;
> > }
> >
> > mutex_lock(&qp->mutex);
> > idx = qp->completed & 0x1;
> > atomic_inc(qp->ctr + (idx ^ 0x1));
> > /* Reduce the likelihood that qrcu_read_lock() will loop */
> > smp_mb__after_atomic_inc();
>
> I almost forgot. Currently this smp_mb__after_atomic_inc() is not strictly
> needed, and the comment is correct. However, it becomes mandatory with your
> optimization. Without this barrier, it is possible that both checks above
> mutex_lock() will see the result of atomic_dec(), but not the atomic_inc().
>
> So, may I ask you to also update this comment?
>
> /*
> * Reduce the likelihood that qrcu_read_lock() will loop
> * AND
> * make sure the second re-check above will see the result
> * of atomic_inc() if it sees the result of atomic_dec()
> */
>
> Something like this, I hope you will make it better.

Good catch!!! I will make sure that this is reflected.

Also, validation is proceeding nicely, if extremely hoggishly.
The validation program and output thus far attached in case someone
is interested. The three-readers/three-updaters case has worked its
way up to 16% of the memory on a 48GB machine. ;-)

If it overflows, I will see if I can get someone to convert it to
VHDL and run hardware validation tools on it. This turned out to
be necessary for the -rt implementation of RCU...

> And another note: this all assumes that STORE-MB-LOAD works "correctly", yes?
> We have other code which relies on that, should not be a problem.

We have been working with Doug Lea of SUNY Oswego, Sebatian Burckhardt of
University of Pennsylvania, and Vijay Saraswat of IBM Research towards
a "universal memory model" that accommodates all machines. Currently,
it does in fact handle store-mb-load the way we all want, thankfully!
Actually, the other guys are doing most of the formalism, my main role
has been programming a very stupid checker based on their formalisms
and yelling at them when something bad happens.

See the cccc directory in the x10 project on SourceForge if you want more
info, but be warned that the checker's UI really sucks. It's input and
output formats are abominations that could only have been dreamed up by
someone who started out on punchcards and early-70s BASIC. Not pretty,
but it does a good job of letting people know what I think the formalism
is saying!

There are some people working on a Prolog program called jmmsolve that
as a much nicer input format, but I need to prototype memory barriers
before they will incorporate them (currently, each statement acts as
if it had smp_mb() before and after it). Also, their output is as yet
incomprehensible to me.

Thanx, Paul

> (Alan Stern cc'ed).
>
> Oleg.
>


Attachments:
(No filename) (3.24 kB)
qrcuval.2007.02.03a.tgz (7.08 kB)
qrcuval.2007.02.03a.tgz
Download all attachments

2007-02-04 03:24:08

by Alan Stern

[permalink] [raw]
Subject: Re: [PATCH 3/7] barrier: a scalable synchonisation barrier

On Sat, 3 Feb 2007, Paul E. McKenney wrote:

> > And another note: this all assumes that STORE-MB-LOAD works "correctly", yes?
> > We have other code which relies on that, should not be a problem.
>
> We have been working with Doug Lea of SUNY Oswego, Sebatian Burckhardt of
> University of Pennsylvania, and Vijay Saraswat of IBM Research towards
> a "universal memory model" that accommodates all machines. Currently,
> it does in fact handle store-mb-load the way we all want, thankfully!

We should add that many places in the kernel do depend on proper behavior
for this data access pattern. So whatever "universal memory model" we end
up with, it had better handle the pattern correctly if Linux is to support
it.

It's interesting to note, however, that this does exclude simple MESI.

Alan Stern

2007-02-04 05:46:29

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH 3/7] barrier: a scalable synchonisation barrier

On Sat, Feb 03, 2007 at 10:24:04PM -0500, Alan Stern wrote:
> On Sat, 3 Feb 2007, Paul E. McKenney wrote:
>
> > > And another note: this all assumes that STORE-MB-LOAD works "correctly", yes?
> > > We have other code which relies on that, should not be a problem.
> >
> > We have been working with Doug Lea of SUNY Oswego, Sebatian Burckhardt of
> > University of Pennsylvania, and Vijay Saraswat of IBM Research towards
> > a "universal memory model" that accommodates all machines. Currently,
> > it does in fact handle store-mb-load the way we all want, thankfully!
>
> We should add that many places in the kernel do depend on proper behavior
> for this data access pattern. So whatever "universal memory model" we end
> up with, it had better handle the pattern correctly if Linux is to support
> it.

Agreed!

> It's interesting to note, however, that this does exclude simple MESI.

Yep! And also a number of compiler optimizations, as it turns out. ;-)

There is a tension between nice-to-software memory-barrier properties
on the one hand and easily understood code on the other. But I guess
that this is true of pretty much any software tool.

Thanx, Paul