2023-01-26 13:42:33

by Jonas Oberhauser

[permalink] [raw]
Subject: [PATCH v2 0/2] Streamlining treatment of smp_mb__after_unlock_lock

Hi all,

This patchset first makes smp_mb__after_unlock_lock use po-unlock-lock-po
to express its ordering property on UNLOCK+LOCK pairs in a more uniform
way with the rest of the model. This does not affect the guarantees
provided by it, but if the solution proposed here is unsatisfactory, there
are two alternatives that spring to mind:

1) one could think about rephrasing the guarantee of this fence to
more obviously match the code

2) one could redefine po-unlock-lock-po to more obviously match the
definition of UNLOCK+LOCK pair in rcupdate.h (note: I haven't
yet checked every use of po-unlock-lock-po, so that would need to
be checked)

It then makes ppo a subset of po by moving the extended fence behaviors of
this fence (which covers events of two threads) out of ppo.


I'm also not sure how to correctly credit the helpful comments of the
reviewers on the first iteration of the patch.

Changes since the last patch proposal:

1) As suggested by Andrea Parri, the patches no longer introduce a new
relation and instead rely purely on strong-fence. Unlike his
suggestion these patches do not redefine strong-fence but instead
use mb | gp directly in the case where the extended fence is to be
excluded.

2) Following another suggestion by Andrea Parri, the patches no longer
update any documentation since there are no new relations.

3) We split the patch logically into two steps as mentioned above.

4) As suggested by Alan Stern, we explain in the model why the
mismatch between the UNLOCK+LOCK definition in rcupdate.h and the
definition of the semantics of the fence in the model is not
causing any difference in behavior.


Jonas Oberhauser (2):
tools/memory-model: Unify UNLOCK+LOCK pairings to po-unlock-lock-po
tools/memory-model: Make ppo a subrelation of po

tools/memory-model/linux-kernel.cat | 22 +++++++++++++++++-----
1 file changed, 17 insertions(+), 5 deletions(-)

--
2.17.1



2023-01-26 13:42:43

by Jonas Oberhauser

[permalink] [raw]
Subject: [PATCH v2 1/2] tools/memory-model: Unify UNLOCK+LOCK pairings to po-unlock-lock-po

LKMM uses two relations for talking about UNLOCK+LOCK pairings:

1) po-unlock-lock-po, which handles UNLOCK+LOCK pairings
on the same CPU or immediate lock handovers on the same
lock variable

2) po;[UL];(co|po);[LKW];po, which handles UNLOCK+LOCK pairs
literally as described in rcupdate.h#L1002, i.e., even
after a sequence of handovers on the same lock variable.

The latter relation is used only once, to provide the guarantee
defined in rcupdate.h#L1002 by smp_mb__after_unlock_lock(), which
makes any UNLOCK+LOCK pair followed by the fence behave like a full
barrier.

This patch drops this use in favor of using po-unlock-lock-po
everywhere, which unifies the way the model talks about UNLOCK+LOCK
pairings. At first glance this seems to weaken the guarantee given
by LKMM: When considering a long sequence of lock handovers
such as below, where P0 hands the lock to P1, which hands it to P2,
which finally executes such an after_unlock_lock fence, the mb
relation currently links any stores in the critical section of P0
to instructions P2 executes after its fence, but not so after the
patch.

P0(int *x, int *y, spinlock_t *mylock)
{
spin_lock(mylock);
WRITE_ONCE(*x, 2);
spin_unlock(mylock);
WRITE_ONCE(*y, 1);
}

P1(int *y, int *z, spinlock_t *mylock)
{
int r0 = READ_ONCE(*y); // reads 1
spin_lock(mylock);
spin_unlock(mylock);
WRITE_ONCE(*z,1);
}

P2(int *z, int *d, spinlock_t *mylock)
{
int r1 = READ_ONCE(*z); // reads 1
spin_lock(mylock);
spin_unlock(mylock);
smp_mb__after_unlock_lock();
WRITE_ONCE(*d,1);
}

P3(int *x, int *d)
{
WRITE_ONCE(*d,2);
smp_mb();
WRITE_ONCE(*x,1);
}

exists (1:r0=1 /\ 2:r1=1 /\ x=2 /\ d=2)

Nevertheless, the ordering guarantee given in rcupdate.h is actually
not weakened. This is because the unlock operations along the
sequence of handovers are A-cumulative fences. They ensure that any
stores that propagate to the CPU performing the first unlock
operation in the sequence must also propagate to every CPU that
performs a subsequent lock operation in the sequence. Therefore any
such stores will also be ordered correctly by the fence even if only
the final handover is considered a full barrier.

Indeed this patch does not affect the behaviors allowed by LKMM at
all. The mb relation is used to define ordering through:
1) mb/.../ppo/hb, where the ordering is subsumed by hb+ where the
lock-release, rfe, and unlock-acquire orderings each provide hb
2) mb/strong-fence/cumul-fence/prop, where the rfe and A-cumulative
lock-release orderings simply add more fine-grained cumul-fence
edges to substitute a single strong-fence edge provided by a long
lock handover sequence
3) mb/strong-fence/pb and various similar uses in the definition of
data races, where as discussed above any long handover sequence
can be turned into a sequence of cumul-fence edges that provide
the same ordering.

Signed-off-by: Jonas Oberhauser <[email protected]>
---
tools/memory-model/linux-kernel.cat | 15 +++++++++++++--
1 file changed, 13 insertions(+), 2 deletions(-)

diff --git a/tools/memory-model/linux-kernel.cat b/tools/memory-model/linux-kernel.cat
index 07f884f9b2bf..6e531457bb73 100644
--- a/tools/memory-model/linux-kernel.cat
+++ b/tools/memory-model/linux-kernel.cat
@@ -37,8 +37,19 @@ let mb = ([M] ; fencerel(Mb) ; [M]) |
([M] ; fencerel(Before-atomic) ; [RMW] ; po? ; [M]) |
([M] ; po? ; [RMW] ; fencerel(After-atomic) ; [M]) |
([M] ; po? ; [LKW] ; fencerel(After-spinlock) ; [M]) |
- ([M] ; po ; [UL] ; (co | po) ; [LKW] ;
- fencerel(After-unlock-lock) ; [M])
+(*
+ * Note: The po-unlock-lock-po relation only passes the lock to the direct
+ * successor, perhaps giving the impression that the ordering of the
+ * smp_mb__after_unlock_lock() fence only affects a single lock handover.
+ * However, in a longer sequence of lock handovers, the implicit
+ * A-cumulative release fences of lock-release ensure that any stores that
+ * propagate to one of the involved CPUs before it hands over the lock to
+ * the next CPU will also propagate to the final CPU handing over the lock
+ * to the CPU that executes the fence. Therefore, all those stores are
+ * also affected by the fence.
+ *)
+ ([M] ; po-unlock-lock-po ;
+ [After-unlock-lock] ; po ; [M])
let gp = po ; [Sync-rcu | Sync-srcu] ; po?
let strong-fence = mb | gp

--
2.17.1


2023-01-26 13:42:52

by Jonas Oberhauser

[permalink] [raw]
Subject: [PATCH v2 2/2] tools/memory-model: Make ppo a subrelation of po

As stated in the documentation and implied by its name, the ppo
(preserved program order) relation is intended to link po-earlier
to po-later instructions under certain conditions. However, a
corner case currently allows instructions to be linked by ppo that
are not executed by the same thread, i.e., instructions are being
linked that have no po relation.

This happens due to the mb/strong-fence relations, which (as one
case) provide order when locks are passed between threads followed
by an smp_mb__after_unlock_lock() fence. This is illustrated in
the following litmus test (as can be seen when using herd7 with
`doshow ppo`):

P0(int *x, int *y)
{
spin_lock(x);
spin_unlock(x);
}

P1(int *x, int *y)
{
spin_lock(x);
smp_mb__after_unlock_lock();
*y = 1;
}

The ppo relation will link P0's spin_lock(x) and P1's *y=1, because
P0 passes a lock to P1 which then uses this fence.

The patch makes ppo a subrelation of po by eliminating this possibility
from mb (but not strong-fence) and relying explicitly on mb|gp instead
of strong-fence when defining ppo.

Signed-off-by: Jonas Oberhauser <[email protected]>
---
tools/memory-model/linux-kernel.cat | 9 +++++----
1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/tools/memory-model/linux-kernel.cat b/tools/memory-model/linux-kernel.cat
index 6e531457bb73..815fdafacaef 100644
--- a/tools/memory-model/linux-kernel.cat
+++ b/tools/memory-model/linux-kernel.cat
@@ -36,7 +36,9 @@ let wmb = [W] ; fencerel(Wmb) ; [W]
let mb = ([M] ; fencerel(Mb) ; [M]) |
([M] ; fencerel(Before-atomic) ; [RMW] ; po? ; [M]) |
([M] ; po? ; [RMW] ; fencerel(After-atomic) ; [M]) |
- ([M] ; po? ; [LKW] ; fencerel(After-spinlock) ; [M]) |
+ ([M] ; po? ; [LKW] ; fencerel(After-spinlock) ; [M])
+let gp = po ; [Sync-rcu | Sync-srcu] ; po?
+let strong-fence = mb | gp |
(*
* Note: The po-unlock-lock-po relation only passes the lock to the direct
* successor, perhaps giving the impression that the ordering of the
@@ -50,10 +52,9 @@ let mb = ([M] ; fencerel(Mb) ; [M]) |
*)
([M] ; po-unlock-lock-po ;
[After-unlock-lock] ; po ; [M])
-let gp = po ; [Sync-rcu | Sync-srcu] ; po?
-let strong-fence = mb | gp

-let nonrw-fence = strong-fence | po-rel | acq-po
+
+let nonrw-fence = mb | gp | po-rel | acq-po
let fence = nonrw-fence | wmb | rmb
let barrier = fencerel(Barrier | Rmb | Wmb | Mb | Sync-rcu | Sync-srcu |
Before-atomic | After-atomic | Acquire | Release |
--
2.17.1


2023-01-26 16:36:29

by Alan Stern

[permalink] [raw]
Subject: Re: [PATCH v2 2/2] tools/memory-model: Make ppo a subrelation of po

On Thu, Jan 26, 2023 at 02:46:04PM +0100, Jonas Oberhauser wrote:
> As stated in the documentation and implied by its name, the ppo
> (preserved program order) relation is intended to link po-earlier
> to po-later instructions under certain conditions. However, a
> corner case currently allows instructions to be linked by ppo that
> are not executed by the same thread, i.e., instructions are being
> linked that have no po relation.
>
> This happens due to the mb/strong-fence relations, which (as one
> case) provide order when locks are passed between threads followed
> by an smp_mb__after_unlock_lock() fence. This is illustrated in
> the following litmus test (as can be seen when using herd7 with
> `doshow ppo`):
>
> P0(int *x, int *y)
> {
> spin_lock(x);
> spin_unlock(x);
> }
>
> P1(int *x, int *y)
> {
> spin_lock(x);
> smp_mb__after_unlock_lock();
> *y = 1;
> }
>
> The ppo relation will link P0's spin_lock(x) and P1's *y=1, because
> P0 passes a lock to P1 which then uses this fence.
>
> The patch makes ppo a subrelation of po by eliminating this possibility
> from mb (but not strong-fence) and relying explicitly on mb|gp instead
> of strong-fence when defining ppo.
>
> Signed-off-by: Jonas Oberhauser <[email protected]>
> ---

This changes the meaning of the fence relation, which is used in
w-pre-bounded, w-post-bounded, ww-vis, wr-vis, and rw-xbstar. Have you
checked that they won't be affected by the change?

> tools/memory-model/linux-kernel.cat | 9 +++++----
> 1 file changed, 5 insertions(+), 4 deletions(-)
>
> diff --git a/tools/memory-model/linux-kernel.cat b/tools/memory-model/linux-kernel.cat
> index 6e531457bb73..815fdafacaef 100644
> --- a/tools/memory-model/linux-kernel.cat
> +++ b/tools/memory-model/linux-kernel.cat
> @@ -36,7 +36,9 @@ let wmb = [W] ; fencerel(Wmb) ; [W]
> let mb = ([M] ; fencerel(Mb) ; [M]) |
> ([M] ; fencerel(Before-atomic) ; [RMW] ; po? ; [M]) |
> ([M] ; po? ; [RMW] ; fencerel(After-atomic) ; [M]) |
> - ([M] ; po? ; [LKW] ; fencerel(After-spinlock) ; [M]) |
> + ([M] ; po? ; [LKW] ; fencerel(After-spinlock) ; [M])
> +let gp = po ; [Sync-rcu | Sync-srcu] ; po?
> +let strong-fence = mb | gp |
> (*
> * Note: The po-unlock-lock-po relation only passes the lock to the direct
> * successor, perhaps giving the impression that the ordering of the
> @@ -50,10 +52,9 @@ let mb = ([M] ; fencerel(Mb) ; [M]) |
> *)
> ([M] ; po-unlock-lock-po ;
> [After-unlock-lock] ; po ; [M])
> -let gp = po ; [Sync-rcu | Sync-srcu] ; po?
> -let strong-fence = mb | gp
>
> -let nonrw-fence = strong-fence | po-rel | acq-po
> +

Extra blank line.

> +let nonrw-fence = mb | gp | po-rel | acq-po
> let fence = nonrw-fence | wmb | rmb
> let barrier = fencerel(Barrier | Rmb | Wmb | Mb | Sync-rcu | Sync-srcu |
> Before-atomic | After-atomic | Acquire | Release |
> --
> 2.17.1

Alan

2023-01-26 16:37:07

by Alan Stern

[permalink] [raw]
Subject: Re: [PATCH v2 1/2] tools/memory-model: Unify UNLOCK+LOCK pairings to po-unlock-lock-po

On Thu, Jan 26, 2023 at 02:46:03PM +0100, Jonas Oberhauser wrote:
> LKMM uses two relations for talking about UNLOCK+LOCK pairings:
>
> 1) po-unlock-lock-po, which handles UNLOCK+LOCK pairings
> on the same CPU or immediate lock handovers on the same
> lock variable
>
> 2) po;[UL];(co|po);[LKW];po, which handles UNLOCK+LOCK pairs
> literally as described in rcupdate.h#L1002, i.e., even
> after a sequence of handovers on the same lock variable.
>
> The latter relation is used only once, to provide the guarantee
> defined in rcupdate.h#L1002 by smp_mb__after_unlock_lock(), which
> makes any UNLOCK+LOCK pair followed by the fence behave like a full
> barrier.
>
> This patch drops this use in favor of using po-unlock-lock-po
> everywhere, which unifies the way the model talks about UNLOCK+LOCK
> pairings. At first glance this seems to weaken the guarantee given
> by LKMM: When considering a long sequence of lock handovers
> such as below, where P0 hands the lock to P1, which hands it to P2,
> which finally executes such an after_unlock_lock fence, the mb
> relation currently links any stores in the critical section of P0
> to instructions P2 executes after its fence, but not so after the
> patch.
>
> P0(int *x, int *y, spinlock_t *mylock)
> {
> spin_lock(mylock);
> WRITE_ONCE(*x, 2);
> spin_unlock(mylock);
> WRITE_ONCE(*y, 1);
> }
>
> P1(int *y, int *z, spinlock_t *mylock)
> {
> int r0 = READ_ONCE(*y); // reads 1
> spin_lock(mylock);
> spin_unlock(mylock);
> WRITE_ONCE(*z,1);
> }
>
> P2(int *z, int *d, spinlock_t *mylock)
> {
> int r1 = READ_ONCE(*z); // reads 1
> spin_lock(mylock);
> spin_unlock(mylock);
> smp_mb__after_unlock_lock();
> WRITE_ONCE(*d,1);
> }
>
> P3(int *x, int *d)
> {
> WRITE_ONCE(*d,2);
> smp_mb();
> WRITE_ONCE(*x,1);
> }
>
> exists (1:r0=1 /\ 2:r1=1 /\ x=2 /\ d=2)
>
> Nevertheless, the ordering guarantee given in rcupdate.h is actually
> not weakened. This is because the unlock operations along the
> sequence of handovers are A-cumulative fences. They ensure that any
> stores that propagate to the CPU performing the first unlock
> operation in the sequence must also propagate to every CPU that
> performs a subsequent lock operation in the sequence. Therefore any
> such stores will also be ordered correctly by the fence even if only
> the final handover is considered a full barrier.
>
> Indeed this patch does not affect the behaviors allowed by LKMM at
> all. The mb relation is used to define ordering through:
> 1) mb/.../ppo/hb, where the ordering is subsumed by hb+ where the
> lock-release, rfe, and unlock-acquire orderings each provide hb
> 2) mb/strong-fence/cumul-fence/prop, where the rfe and A-cumulative
> lock-release orderings simply add more fine-grained cumul-fence
> edges to substitute a single strong-fence edge provided by a long
> lock handover sequence
> 3) mb/strong-fence/pb and various similar uses in the definition of
> data races, where as discussed above any long handover sequence
> can be turned into a sequence of cumul-fence edges that provide
> the same ordering.
>
> Signed-off-by: Jonas Oberhauser <[email protected]>
> ---

Reviewed-by: Alan Stern <[email protected]>

> tools/memory-model/linux-kernel.cat | 15 +++++++++++++--
> 1 file changed, 13 insertions(+), 2 deletions(-)
>
> diff --git a/tools/memory-model/linux-kernel.cat b/tools/memory-model/linux-kernel.cat
> index 07f884f9b2bf..6e531457bb73 100644
> --- a/tools/memory-model/linux-kernel.cat
> +++ b/tools/memory-model/linux-kernel.cat
> @@ -37,8 +37,19 @@ let mb = ([M] ; fencerel(Mb) ; [M]) |
> ([M] ; fencerel(Before-atomic) ; [RMW] ; po? ; [M]) |
> ([M] ; po? ; [RMW] ; fencerel(After-atomic) ; [M]) |
> ([M] ; po? ; [LKW] ; fencerel(After-spinlock) ; [M]) |
> - ([M] ; po ; [UL] ; (co | po) ; [LKW] ;
> - fencerel(After-unlock-lock) ; [M])
> +(*
> + * Note: The po-unlock-lock-po relation only passes the lock to the direct
> + * successor, perhaps giving the impression that the ordering of the
> + * smp_mb__after_unlock_lock() fence only affects a single lock handover.
> + * However, in a longer sequence of lock handovers, the implicit
> + * A-cumulative release fences of lock-release ensure that any stores that
> + * propagate to one of the involved CPUs before it hands over the lock to
> + * the next CPU will also propagate to the final CPU handing over the lock
> + * to the CPU that executes the fence. Therefore, all those stores are
> + * also affected by the fence.
> + *)
> + ([M] ; po-unlock-lock-po ;
> + [After-unlock-lock] ; po ; [M])
> let gp = po ; [Sync-rcu | Sync-srcu] ; po?
> let strong-fence = mb | gp
>
> --
> 2.17.1
>

2023-01-26 20:08:38

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH v2 1/2] tools/memory-model: Unify UNLOCK+LOCK pairings to po-unlock-lock-po

On Thu, Jan 26, 2023 at 11:36:51AM -0500, Alan Stern wrote:
> On Thu, Jan 26, 2023 at 02:46:03PM +0100, Jonas Oberhauser wrote:
> > LKMM uses two relations for talking about UNLOCK+LOCK pairings:
> >
> > 1) po-unlock-lock-po, which handles UNLOCK+LOCK pairings
> > on the same CPU or immediate lock handovers on the same
> > lock variable
> >
> > 2) po;[UL];(co|po);[LKW];po, which handles UNLOCK+LOCK pairs
> > literally as described in rcupdate.h#L1002, i.e., even
> > after a sequence of handovers on the same lock variable.
> >
> > The latter relation is used only once, to provide the guarantee
> > defined in rcupdate.h#L1002 by smp_mb__after_unlock_lock(), which
> > makes any UNLOCK+LOCK pair followed by the fence behave like a full
> > barrier.
> >
> > This patch drops this use in favor of using po-unlock-lock-po
> > everywhere, which unifies the way the model talks about UNLOCK+LOCK
> > pairings. At first glance this seems to weaken the guarantee given
> > by LKMM: When considering a long sequence of lock handovers
> > such as below, where P0 hands the lock to P1, which hands it to P2,
> > which finally executes such an after_unlock_lock fence, the mb
> > relation currently links any stores in the critical section of P0
> > to instructions P2 executes after its fence, but not so after the
> > patch.
> >
> > P0(int *x, int *y, spinlock_t *mylock)
> > {
> > spin_lock(mylock);
> > WRITE_ONCE(*x, 2);
> > spin_unlock(mylock);
> > WRITE_ONCE(*y, 1);
> > }
> >
> > P1(int *y, int *z, spinlock_t *mylock)
> > {
> > int r0 = READ_ONCE(*y); // reads 1
> > spin_lock(mylock);
> > spin_unlock(mylock);
> > WRITE_ONCE(*z,1);
> > }
> >
> > P2(int *z, int *d, spinlock_t *mylock)
> > {
> > int r1 = READ_ONCE(*z); // reads 1
> > spin_lock(mylock);
> > spin_unlock(mylock);
> > smp_mb__after_unlock_lock();
> > WRITE_ONCE(*d,1);
> > }
> >
> > P3(int *x, int *d)
> > {
> > WRITE_ONCE(*d,2);
> > smp_mb();
> > WRITE_ONCE(*x,1);
> > }
> >
> > exists (1:r0=1 /\ 2:r1=1 /\ x=2 /\ d=2)
> >
> > Nevertheless, the ordering guarantee given in rcupdate.h is actually
> > not weakened. This is because the unlock operations along the
> > sequence of handovers are A-cumulative fences. They ensure that any
> > stores that propagate to the CPU performing the first unlock
> > operation in the sequence must also propagate to every CPU that
> > performs a subsequent lock operation in the sequence. Therefore any
> > such stores will also be ordered correctly by the fence even if only
> > the final handover is considered a full barrier.
> >
> > Indeed this patch does not affect the behaviors allowed by LKMM at
> > all. The mb relation is used to define ordering through:
> > 1) mb/.../ppo/hb, where the ordering is subsumed by hb+ where the
> > lock-release, rfe, and unlock-acquire orderings each provide hb
> > 2) mb/strong-fence/cumul-fence/prop, where the rfe and A-cumulative
> > lock-release orderings simply add more fine-grained cumul-fence
> > edges to substitute a single strong-fence edge provided by a long
> > lock handover sequence
> > 3) mb/strong-fence/pb and various similar uses in the definition of
> > data races, where as discussed above any long handover sequence
> > can be turned into a sequence of cumul-fence edges that provide
> > the same ordering.
> >
> > Signed-off-by: Jonas Oberhauser <[email protected]>
> > ---
>
> Reviewed-by: Alan Stern <[email protected]>

A quick spot check showed no change in performance, so thank you both!

Queued for review and further testing.

Thanx, Paul

> > tools/memory-model/linux-kernel.cat | 15 +++++++++++++--
> > 1 file changed, 13 insertions(+), 2 deletions(-)
> >
> > diff --git a/tools/memory-model/linux-kernel.cat b/tools/memory-model/linux-kernel.cat
> > index 07f884f9b2bf..6e531457bb73 100644
> > --- a/tools/memory-model/linux-kernel.cat
> > +++ b/tools/memory-model/linux-kernel.cat
> > @@ -37,8 +37,19 @@ let mb = ([M] ; fencerel(Mb) ; [M]) |
> > ([M] ; fencerel(Before-atomic) ; [RMW] ; po? ; [M]) |
> > ([M] ; po? ; [RMW] ; fencerel(After-atomic) ; [M]) |
> > ([M] ; po? ; [LKW] ; fencerel(After-spinlock) ; [M]) |
> > - ([M] ; po ; [UL] ; (co | po) ; [LKW] ;
> > - fencerel(After-unlock-lock) ; [M])
> > +(*
> > + * Note: The po-unlock-lock-po relation only passes the lock to the direct
> > + * successor, perhaps giving the impression that the ordering of the
> > + * smp_mb__after_unlock_lock() fence only affects a single lock handover.
> > + * However, in a longer sequence of lock handovers, the implicit
> > + * A-cumulative release fences of lock-release ensure that any stores that
> > + * propagate to one of the involved CPUs before it hands over the lock to
> > + * the next CPU will also propagate to the final CPU handing over the lock
> > + * to the CPU that executes the fence. Therefore, all those stores are
> > + * also affected by the fence.
> > + *)
> > + ([M] ; po-unlock-lock-po ;
> > + [After-unlock-lock] ; po ; [M])
> > let gp = po ; [Sync-rcu | Sync-srcu] ; po?
> > let strong-fence = mb | gp
> >
> > --
> > 2.17.1
> >

2023-01-26 23:21:56

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH v2 1/2] tools/memory-model: Unify UNLOCK+LOCK pairings to po-unlock-lock-po

On Thu, Jan 26, 2023 at 12:08:28PM -0800, Paul E. McKenney wrote:
> On Thu, Jan 26, 2023 at 11:36:51AM -0500, Alan Stern wrote:
> > On Thu, Jan 26, 2023 at 02:46:03PM +0100, Jonas Oberhauser wrote:
> > > LKMM uses two relations for talking about UNLOCK+LOCK pairings:
> > >
> > > 1) po-unlock-lock-po, which handles UNLOCK+LOCK pairings
> > > on the same CPU or immediate lock handovers on the same
> > > lock variable
> > >
> > > 2) po;[UL];(co|po);[LKW];po, which handles UNLOCK+LOCK pairs
> > > literally as described in rcupdate.h#L1002, i.e., even
> > > after a sequence of handovers on the same lock variable.
> > >
> > > The latter relation is used only once, to provide the guarantee
> > > defined in rcupdate.h#L1002 by smp_mb__after_unlock_lock(), which
> > > makes any UNLOCK+LOCK pair followed by the fence behave like a full
> > > barrier.
> > >
> > > This patch drops this use in favor of using po-unlock-lock-po
> > > everywhere, which unifies the way the model talks about UNLOCK+LOCK
> > > pairings. At first glance this seems to weaken the guarantee given
> > > by LKMM: When considering a long sequence of lock handovers
> > > such as below, where P0 hands the lock to P1, which hands it to P2,
> > > which finally executes such an after_unlock_lock fence, the mb
> > > relation currently links any stores in the critical section of P0
> > > to instructions P2 executes after its fence, but not so after the
> > > patch.
> > >
> > > P0(int *x, int *y, spinlock_t *mylock)
> > > {
> > > spin_lock(mylock);
> > > WRITE_ONCE(*x, 2);
> > > spin_unlock(mylock);
> > > WRITE_ONCE(*y, 1);
> > > }
> > >
> > > P1(int *y, int *z, spinlock_t *mylock)
> > > {
> > > int r0 = READ_ONCE(*y); // reads 1
> > > spin_lock(mylock);
> > > spin_unlock(mylock);
> > > WRITE_ONCE(*z,1);
> > > }
> > >
> > > P2(int *z, int *d, spinlock_t *mylock)
> > > {
> > > int r1 = READ_ONCE(*z); // reads 1
> > > spin_lock(mylock);
> > > spin_unlock(mylock);
> > > smp_mb__after_unlock_lock();
> > > WRITE_ONCE(*d,1);
> > > }
> > >
> > > P3(int *x, int *d)
> > > {
> > > WRITE_ONCE(*d,2);
> > > smp_mb();
> > > WRITE_ONCE(*x,1);
> > > }
> > >
> > > exists (1:r0=1 /\ 2:r1=1 /\ x=2 /\ d=2)
> > >
> > > Nevertheless, the ordering guarantee given in rcupdate.h is actually
> > > not weakened. This is because the unlock operations along the
> > > sequence of handovers are A-cumulative fences. They ensure that any
> > > stores that propagate to the CPU performing the first unlock
> > > operation in the sequence must also propagate to every CPU that
> > > performs a subsequent lock operation in the sequence. Therefore any
> > > such stores will also be ordered correctly by the fence even if only
> > > the final handover is considered a full barrier.
> > >
> > > Indeed this patch does not affect the behaviors allowed by LKMM at
> > > all. The mb relation is used to define ordering through:
> > > 1) mb/.../ppo/hb, where the ordering is subsumed by hb+ where the
> > > lock-release, rfe, and unlock-acquire orderings each provide hb
> > > 2) mb/strong-fence/cumul-fence/prop, where the rfe and A-cumulative
> > > lock-release orderings simply add more fine-grained cumul-fence
> > > edges to substitute a single strong-fence edge provided by a long
> > > lock handover sequence
> > > 3) mb/strong-fence/pb and various similar uses in the definition of
> > > data races, where as discussed above any long handover sequence
> > > can be turned into a sequence of cumul-fence edges that provide
> > > the same ordering.
> > >
> > > Signed-off-by: Jonas Oberhauser <[email protected]>
> > > ---
> >
> > Reviewed-by: Alan Stern <[email protected]>
>
> A quick spot check showed no change in performance, so thank you both!
>
> Queued for review and further testing.

And testing on https://github.com/paulmckrcu/litmus for litmus tests up
to ten processes and allowing 10 minutes per litmus test got this:

Exact output matches: 5208
!!! Timed out: 38
!!! Unknown primitive: 7

This test compared output with and without your patch.

For the tests with a Results clause, these failed:

manual/kernel/C-srcu-nest-7.litmus
manual/kernel/C-srcu-nest-5.litmus
manual/kernel/C-srcu-nest-6.litmus
manual/kernel/C-srcu-nest-8.litmus

But all of these will continue to fail until we get Alan's new-age SRCU
patch applied.

Therefore, this constitutes success, so good show thus far on testing! ;-)

Also, I am going to be pushing the scripts I use to mainline. They might
not be perfect, but they will be quite useful for this sort of change
to the memory model.

Thanx, Paul

2023-01-27 13:20:33

by Jonas Oberhauser

[permalink] [raw]
Subject: Re: [PATCH v2 1/2] tools/memory-model: Unify UNLOCK+LOCK pairings to po-unlock-lock-po



On 1/27/2023 12:21 AM, Paul E. McKenney wrote:
> On Thu, Jan 26, 2023 at 12:08:28PM -0800, Paul E. McKenney wrote:
>> On Thu, Jan 26, 2023 at 11:36:51AM -0500, Alan Stern wrote:
>>> On Thu, Jan 26, 2023 at 02:46:03PM +0100, Jonas Oberhauser wrote:
>>>> LKMM uses two relations for talking about UNLOCK+LOCK pairings:
>>>>
>>>> 1) po-unlock-lock-po, which handles UNLOCK+LOCK pairings
>>>> on the same CPU or immediate lock handovers on the same
>>>> lock variable
>>>>
>>>> 2) po;[UL];(co|po);[LKW];po, which handles UNLOCK+LOCK pairs
>>>> literally as described in rcupdate.h#L1002, i.e., even
>>>> after a sequence of handovers on the same lock variable.
>>>>
>>>> The latter relation is used only once, to provide the guarantee
>>>> defined in rcupdate.h#L1002 by smp_mb__after_unlock_lock(), which
>>>> makes any UNLOCK+LOCK pair followed by the fence behave like a full
>>>> barrier.
>>>>
>>>> This patch drops this use in favor of using po-unlock-lock-po
>>>> everywhere, which unifies the way the model talks about UNLOCK+LOCK
>>>> pairings. At first glance this seems to weaken the guarantee given
>>>> by LKMM: When considering a long sequence of lock handovers
>>>> such as below, where P0 hands the lock to P1, which hands it to P2,
>>>> which finally executes such an after_unlock_lock fence, the mb
>>>> relation currently links any stores in the critical section of P0
>>>> to instructions P2 executes after its fence, but not so after the
>>>> patch.
>>>>
>>>> P0(int *x, int *y, spinlock_t *mylock)
>>>> {
>>>> spin_lock(mylock);
>>>> WRITE_ONCE(*x, 2);
>>>> spin_unlock(mylock);
>>>> WRITE_ONCE(*y, 1);
>>>> }
>>>>
>>>> P1(int *y, int *z, spinlock_t *mylock)
>>>> {
>>>> int r0 = READ_ONCE(*y); // reads 1
>>>> spin_lock(mylock);
>>>> spin_unlock(mylock);
>>>> WRITE_ONCE(*z,1);
>>>> }
>>>>
>>>> P2(int *z, int *d, spinlock_t *mylock)
>>>> {
>>>> int r1 = READ_ONCE(*z); // reads 1
>>>> spin_lock(mylock);
>>>> spin_unlock(mylock);
>>>> smp_mb__after_unlock_lock();
>>>> WRITE_ONCE(*d,1);
>>>> }
>>>>
>>>> P3(int *x, int *d)
>>>> {
>>>> WRITE_ONCE(*d,2);
>>>> smp_mb();
>>>> WRITE_ONCE(*x,1);
>>>> }
>>>>
>>>> exists (1:r0=1 /\ 2:r1=1 /\ x=2 /\ d=2)
>>>>
>>>> Nevertheless, the ordering guarantee given in rcupdate.h is actually
>>>> not weakened. This is because the unlock operations along the
>>>> sequence of handovers are A-cumulative fences. They ensure that any
>>>> stores that propagate to the CPU performing the first unlock
>>>> operation in the sequence must also propagate to every CPU that
>>>> performs a subsequent lock operation in the sequence. Therefore any
>>>> such stores will also be ordered correctly by the fence even if only
>>>> the final handover is considered a full barrier.
>>>>
>>>> Indeed this patch does not affect the behaviors allowed by LKMM at
>>>> all. The mb relation is used to define ordering through:
>>>> 1) mb/.../ppo/hb, where the ordering is subsumed by hb+ where the
>>>> lock-release, rfe, and unlock-acquire orderings each provide hb
>>>> 2) mb/strong-fence/cumul-fence/prop, where the rfe and A-cumulative
>>>> lock-release orderings simply add more fine-grained cumul-fence
>>>> edges to substitute a single strong-fence edge provided by a long
>>>> lock handover sequence
>>>> 3) mb/strong-fence/pb and various similar uses in the definition of
>>>> data races, where as discussed above any long handover sequence
>>>> can be turned into a sequence of cumul-fence edges that provide
>>>> the same ordering.
>>>>
>>>> Signed-off-by: Jonas Oberhauser <[email protected]>
>>>> ---
>>> Reviewed-by: Alan Stern <[email protected]>
>> A quick spot check showed no change in performance, so thank you both!
>>
>> Queued for review and further testing.
> And testing on https://github.com/paulmckrcu/litmus for litmus tests up
> to ten processes and allowing 10 minutes per litmus test got this:
>
> Exact output matches: 5208
> !!! Timed out: 38
> !!! Unknown primitive: 7
>
> This test compared output with and without your patch.
>
> For the tests with a Results clause, these failed:

Gave me a heart attack there for a second!

> Also, I am going to be pushing the scripts I use to mainline. They might
> not be perfect, but they will be quite useful for this sort of change
> to the memory model.

I could also provide Coq proofs, although those are ignoring the
srcu/data race parts at the moment.

Have fun, jonas


2023-01-27 14:32:21

by Jonas Oberhauser

[permalink] [raw]
Subject: Re: [PATCH v2 2/2] tools/memory-model: Make ppo a subrelation of po



On 1/26/2023 5:36 PM, Alan Stern wrote:
> On Thu, Jan 26, 2023 at 02:46:04PM +0100, Jonas Oberhauser wrote:
>> As stated in the documentation and implied by its name, the ppo
>> (preserved program order) relation is intended to link po-earlier
>> to po-later instructions under certain conditions. However, a
>> corner case currently allows instructions to be linked by ppo that
>> are not executed by the same thread, i.e., instructions are being
>> linked that have no po relation.
>>
>> This happens due to the mb/strong-fence relations, which (as one
>> case) provide order when locks are passed between threads followed
>> by an smp_mb__after_unlock_lock() fence. This is illustrated in
>> the following litmus test (as can be seen when using herd7 with
>> `doshow ppo`):
>>
>> P0(int *x, int *y)
>> {
>> spin_lock(x);
>> spin_unlock(x);
>> }
>>
>> P1(int *x, int *y)
>> {
>> spin_lock(x);
>> smp_mb__after_unlock_lock();
>> *y = 1;
>> }
>>
>> The ppo relation will link P0's spin_lock(x) and P1's *y=1, because
>> P0 passes a lock to P1 which then uses this fence.
>>
>> The patch makes ppo a subrelation of po by eliminating this possibility
>> from mb (but not strong-fence) and relying explicitly on mb|gp instead
>> of strong-fence when defining ppo.
>>
>> Signed-off-by: Jonas Oberhauser <[email protected]>
>> ---
> This changes the meaning of the fence relation, which is used in
> w-pre-bounded, w-post-bounded, ww-vis, wr-vis, and rw-xbstar. Have you
> checked that they won't be affected by the change?

Good point, in fact it changes the nonrw-fence as well which is used in
the r-* relations.
I had missed this completely. That's what I get for not looking at the
data race relations!


Let's go through the individual cases.
If we had e.g. *-post-bounded because of the
po-unlock-lock-po;[After-unlock-lock];po edge, this may be either due to
the fence rule or due to (...fence ; vis ; *-pre-bounded).
In the first case we have po-rel ; rfe ; acq-po and get
fence;rfe;(xbstar&int);fence which gives us *-post-bounded.
In the second case we now have strong-fence, with vis <= xbstar we see
that **-vis is preserved here by switching from the fence rule to the
strong-fence;xbstar;... case.

For *-pre-bounded, the situtation is tricky because of the xbstar&int
that can be at the end of vis, when *-pre-bounded is used to define
w*-vis through (w-post-bounded;vis;*-pre-bounded). In this case, the
xbstar&int can point in the opposite direction of po, which means that
the unlock that creates the strong fence might not be po-after the
instruction that starts the link.

Here's a litmus test illustrating the difference, where P1 has a
backwards-pointing xbstar&int. Currently there's no data race, but with
the proposed patch there is.

P0(int *x, int *y)
{
    *x = 1;
    smp_store_release(y, 1);
}

P1(int *x, int *y, int *dx, int *dy, spinlock_t *l)
{
    spin_lock(l);
    int r1 = READ_ONCE(*dy);
    if (r1==1)
        spin_unlock(l);

    int r0 = smp_load_acquire(y);
    if (r0 == 1) {
        WRITE_ONCE(*dx,1);
    }
}

P2(int *dx, int *dy)
{
    WRITE_ONCE(*dy,READ_ONCE(*dx));
}


P3(int *x, spinlock_t *l)
{
    spin_lock(l);
    smp_mb__after_unlock_lock();
    *x = 2;
}


This actually makes me wonder. I thought the reason for the xbstar & int
is that it ensures that the overall relation, after shuffling around a
little bit, becomes prop&int ; hb*.

Like in case the *x=2 is co-before the *x=1, we get
  Wx2 ->overwrite Wx1 ->cumul-fence*;rfe  (some event on the same CPU
as Wx2)  ->fence Wx2
which is
  Wx2 -> prop&int some other event ->hb Wx2
which must be irreflexive.

However, that's not the case at all, because the fence relation
currently doesn't actually have to relate events of the same CPU.
So we get
  Wx2 ->overwrite Wx1 ->cumul-fence*;rfe  (some event on some other CPU
than Wx2's) ->(hb*&int);fence Wx2
i.e.,
  Wx2 ->prop&ext;hb*;strong-fence Wx2

which shouldn't provide any ordering in general.

In fact, replacing the *x=1 and *x=2 with WRITE_ONCEs, (pilot errors
notwithstanding) both Wx1 ->co Wx2 and Wx2 ->co Wx1 become allowed in
the current LKMM (in graphs where all other edges are equal).

Shouldn't this actually *be* a data race? And potentially the same with
rcu-fence?

The other cases of *-pre-bounded seem to work out: they all link stuff
via xbstar to the instruction that is linked via po-unlock-lock-po ;
[After-unlock-lock] ; po to the potentially racy access, and
po-unlock-lock-po;po   is xbstar ; acq-po, which allows closing the gap.

Best wishes, jonas


2023-01-27 15:13:49

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH v2 1/2] tools/memory-model: Unify UNLOCK+LOCK pairings to po-unlock-lock-po

On Fri, Jan 27, 2023 at 02:18:41PM +0100, Jonas Oberhauser wrote:
> On 1/27/2023 12:21 AM, Paul E. McKenney wrote:
> > On Thu, Jan 26, 2023 at 12:08:28PM -0800, Paul E. McKenney wrote:
> > > On Thu, Jan 26, 2023 at 11:36:51AM -0500, Alan Stern wrote:
> > > > On Thu, Jan 26, 2023 at 02:46:03PM +0100, Jonas Oberhauser wrote:
> > > > > LKMM uses two relations for talking about UNLOCK+LOCK pairings:
> > > > >
> > > > > 1) po-unlock-lock-po, which handles UNLOCK+LOCK pairings
> > > > > on the same CPU or immediate lock handovers on the same
> > > > > lock variable
> > > > >
> > > > > 2) po;[UL];(co|po);[LKW];po, which handles UNLOCK+LOCK pairs
> > > > > literally as described in rcupdate.h#L1002, i.e., even
> > > > > after a sequence of handovers on the same lock variable.
> > > > >
> > > > > The latter relation is used only once, to provide the guarantee
> > > > > defined in rcupdate.h#L1002 by smp_mb__after_unlock_lock(), which
> > > > > makes any UNLOCK+LOCK pair followed by the fence behave like a full
> > > > > barrier.
> > > > >
> > > > > This patch drops this use in favor of using po-unlock-lock-po
> > > > > everywhere, which unifies the way the model talks about UNLOCK+LOCK
> > > > > pairings. At first glance this seems to weaken the guarantee given
> > > > > by LKMM: When considering a long sequence of lock handovers
> > > > > such as below, where P0 hands the lock to P1, which hands it to P2,
> > > > > which finally executes such an after_unlock_lock fence, the mb
> > > > > relation currently links any stores in the critical section of P0
> > > > > to instructions P2 executes after its fence, but not so after the
> > > > > patch.
> > > > >
> > > > > P0(int *x, int *y, spinlock_t *mylock)
> > > > > {
> > > > > spin_lock(mylock);
> > > > > WRITE_ONCE(*x, 2);
> > > > > spin_unlock(mylock);
> > > > > WRITE_ONCE(*y, 1);
> > > > > }
> > > > >
> > > > > P1(int *y, int *z, spinlock_t *mylock)
> > > > > {
> > > > > int r0 = READ_ONCE(*y); // reads 1
> > > > > spin_lock(mylock);
> > > > > spin_unlock(mylock);
> > > > > WRITE_ONCE(*z,1);
> > > > > }
> > > > >
> > > > > P2(int *z, int *d, spinlock_t *mylock)
> > > > > {
> > > > > int r1 = READ_ONCE(*z); // reads 1
> > > > > spin_lock(mylock);
> > > > > spin_unlock(mylock);
> > > > > smp_mb__after_unlock_lock();
> > > > > WRITE_ONCE(*d,1);
> > > > > }
> > > > >
> > > > > P3(int *x, int *d)
> > > > > {
> > > > > WRITE_ONCE(*d,2);
> > > > > smp_mb();
> > > > > WRITE_ONCE(*x,1);
> > > > > }
> > > > >
> > > > > exists (1:r0=1 /\ 2:r1=1 /\ x=2 /\ d=2)
> > > > >
> > > > > Nevertheless, the ordering guarantee given in rcupdate.h is actually
> > > > > not weakened. This is because the unlock operations along the
> > > > > sequence of handovers are A-cumulative fences. They ensure that any
> > > > > stores that propagate to the CPU performing the first unlock
> > > > > operation in the sequence must also propagate to every CPU that
> > > > > performs a subsequent lock operation in the sequence. Therefore any
> > > > > such stores will also be ordered correctly by the fence even if only
> > > > > the final handover is considered a full barrier.
> > > > >
> > > > > Indeed this patch does not affect the behaviors allowed by LKMM at
> > > > > all. The mb relation is used to define ordering through:
> > > > > 1) mb/.../ppo/hb, where the ordering is subsumed by hb+ where the
> > > > > lock-release, rfe, and unlock-acquire orderings each provide hb
> > > > > 2) mb/strong-fence/cumul-fence/prop, where the rfe and A-cumulative
> > > > > lock-release orderings simply add more fine-grained cumul-fence
> > > > > edges to substitute a single strong-fence edge provided by a long
> > > > > lock handover sequence
> > > > > 3) mb/strong-fence/pb and various similar uses in the definition of
> > > > > data races, where as discussed above any long handover sequence
> > > > > can be turned into a sequence of cumul-fence edges that provide
> > > > > the same ordering.
> > > > >
> > > > > Signed-off-by: Jonas Oberhauser <[email protected]>
> > > > > ---
> > > > Reviewed-by: Alan Stern <[email protected]>
> > > A quick spot check showed no change in performance, so thank you both!
> > >
> > > Queued for review and further testing.
> > And testing on https://github.com/paulmckrcu/litmus for litmus tests up
> > to ten processes and allowing 10 minutes per litmus test got this:
> >
> > Exact output matches: 5208
> > !!! Timed out: 38
> > !!! Unknown primitive: 7
> >
> > This test compared output with and without your patch.
> >
> > For the tests with a Results clause, these failed:
>
> Gave me a heart attack there for a second!

Sorry for the scare!!!

> > Also, I am going to be pushing the scripts I use to mainline. They might
> > not be perfect, but they will be quite useful for this sort of change
> > to the memory model.
>
> I could also provide Coq proofs, although those are ignoring the srcu/data
> race parts at the moment.

Can such proofs serve as regression tests for future changes?

Thanx, Paul

2023-01-27 15:58:29

by Jonas Oberhauser

[permalink] [raw]
Subject: Re: [PATCH v2 1/2] tools/memory-model: Unify UNLOCK+LOCK pairings to po-unlock-lock-po



On 1/27/2023 4:13 PM, Paul E. McKenney wrote:
> On Fri, Jan 27, 2023 at 02:18:41PM +0100, Jonas Oberhauser wrote:
>> On 1/27/2023 12:21 AM, Paul E. McKenney wrote:
>>> On Thu, Jan 26, 2023 at 12:08:28PM -0800, Paul E. McKenney wrote:
>>>> On Thu, Jan 26, 2023 at 11:36:51AM -0500, Alan Stern wrote:
>>>>> On Thu, Jan 26, 2023 at 02:46:03PM +0100, Jonas Oberhauser wrote:
>>>>>> LKMM uses two relations for talking about UNLOCK+LOCK pairings:
>>>>>>
>>>>>> 1) po-unlock-lock-po, which handles UNLOCK+LOCK pairings
>>>>>> on the same CPU or immediate lock handovers on the same
>>>>>> lock variable
>>>>>>
>>>>>> 2) po;[UL];(co|po);[LKW];po, which handles UNLOCK+LOCK pairs
>>>>>> literally as described in rcupdate.h#L1002, i.e., even
>>>>>> after a sequence of handovers on the same lock variable.
>>>>>>
>>>>>> The latter relation is used only once, to provide the guarantee
>>>>>> defined in rcupdate.h#L1002 by smp_mb__after_unlock_lock(), which
>>>>>> makes any UNLOCK+LOCK pair followed by the fence behave like a full
>>>>>> barrier.
>>>>>>
>>>>>> This patch drops this use in favor of using po-unlock-lock-po
>>>>>> everywhere, which unifies the way the model talks about UNLOCK+LOCK
>>>>>> pairings. At first glance this seems to weaken the guarantee given
>>>>>> by LKMM: When considering a long sequence of lock handovers
>>>>>> such as below, where P0 hands the lock to P1, which hands it to P2,
>>>>>> which finally executes such an after_unlock_lock fence, the mb
>>>>>> relation currently links any stores in the critical section of P0
>>>>>> to instructions P2 executes after its fence, but not so after the
>>>>>> patch.
>>>>>>
>>>>>> P0(int *x, int *y, spinlock_t *mylock)
>>>>>> {
>>>>>> spin_lock(mylock);
>>>>>> WRITE_ONCE(*x, 2);
>>>>>> spin_unlock(mylock);
>>>>>> WRITE_ONCE(*y, 1);
>>>>>> }
>>>>>>
>>>>>> P1(int *y, int *z, spinlock_t *mylock)
>>>>>> {
>>>>>> int r0 = READ_ONCE(*y); // reads 1
>>>>>> spin_lock(mylock);
>>>>>> spin_unlock(mylock);
>>>>>> WRITE_ONCE(*z,1);
>>>>>> }
>>>>>>
>>>>>> P2(int *z, int *d, spinlock_t *mylock)
>>>>>> {
>>>>>> int r1 = READ_ONCE(*z); // reads 1
>>>>>> spin_lock(mylock);
>>>>>> spin_unlock(mylock);
>>>>>> smp_mb__after_unlock_lock();
>>>>>> WRITE_ONCE(*d,1);
>>>>>> }
>>>>>>
>>>>>> P3(int *x, int *d)
>>>>>> {
>>>>>> WRITE_ONCE(*d,2);
>>>>>> smp_mb();
>>>>>> WRITE_ONCE(*x,1);
>>>>>> }
>>>>>>
>>>>>> exists (1:r0=1 /\ 2:r1=1 /\ x=2 /\ d=2)
>>>>>>
>>>>>> Nevertheless, the ordering guarantee given in rcupdate.h is actually
>>>>>> not weakened. This is because the unlock operations along the
>>>>>> sequence of handovers are A-cumulative fences. They ensure that any
>>>>>> stores that propagate to the CPU performing the first unlock
>>>>>> operation in the sequence must also propagate to every CPU that
>>>>>> performs a subsequent lock operation in the sequence. Therefore any
>>>>>> such stores will also be ordered correctly by the fence even if only
>>>>>> the final handover is considered a full barrier.
>>>>>>
>>>>>> Indeed this patch does not affect the behaviors allowed by LKMM at
>>>>>> all. The mb relation is used to define ordering through:
>>>>>> 1) mb/.../ppo/hb, where the ordering is subsumed by hb+ where the
>>>>>> lock-release, rfe, and unlock-acquire orderings each provide hb
>>>>>> 2) mb/strong-fence/cumul-fence/prop, where the rfe and A-cumulative
>>>>>> lock-release orderings simply add more fine-grained cumul-fence
>>>>>> edges to substitute a single strong-fence edge provided by a long
>>>>>> lock handover sequence
>>>>>> 3) mb/strong-fence/pb and various similar uses in the definition of
>>>>>> data races, where as discussed above any long handover sequence
>>>>>> can be turned into a sequence of cumul-fence edges that provide
>>>>>> the same ordering.
>>>>>>
>>>>>> Signed-off-by: Jonas Oberhauser <[email protected]>
>>>>>> ---
>>>>> Reviewed-by: Alan Stern <[email protected]>
>>>> A quick spot check showed no change in performance, so thank you both!
>>>>
>>>> Queued for review and further testing.
>>> And testing on https://github.com/paulmckrcu/litmus for litmus tests up
>>> to ten processes and allowing 10 minutes per litmus test got this:
>>>
>>> Exact output matches: 5208
>>> !!! Timed out: 38
>>> !!! Unknown primitive: 7
>>>
>>> This test compared output with and without your patch.
>>>
>>> For the tests with a Results clause, these failed:
>> Gave me a heart attack there for a second!
> Sorry for the scare!!!
>
>>> Also, I am going to be pushing the scripts I use to mainline. They might
>>> not be perfect, but they will be quite useful for this sort of change
>>> to the memory model.
>> I could also provide Coq proofs, although those are ignoring the srcu/data
>> race parts at the moment.
> Can such proofs serve as regression tests for future changes?
>
> Thanx, Paul

So-so. On the upside, it would be easy to make them raise an alarm if
the future change breaks stuff.
On the downside, they will often need maintenance together with any
change. Sometimes a lot, sometimes very little.
I think for the proofs that show the equivalence between two models, the
maintenance is quite a bit higher because every change needs to be
reflected in both versions. So if you do 10 equivalent transformations
and want to show that they remain equivalent with any future changes you
do, you need to keep at least 10 additional models around ("current LKMM
where ppo isn't in po, current LKMM where the unlock fence still relies
on co, ...").

Right now, each equivalence proof I did (e.g., for using
po-unlock-lock-po here, or the ppo<=po patch I originally proposed) is
at average in the ballpark of 500 lines of proof script. And as
evidenced by my discussion with Alan, these proofs only cover the "core
model".

So for this kind of thing, I think it's better to look at them to have
more confidence in the patch, and do the patch more based on which model
is more reasonable (as Alan enforces). Then consider the simplified
version as the more natural one, and not worry about future changes that
break the equivalence (that would usually indicate a problem with the
old model, rather than a problem with the patch).

For regressions, I would rather consider some desirable properties of
LKMM, like "DRF-SC" or "monotonicity of barriers" or "ppo <= po" and try
to prove those. This has the upside of not requiring to carry additional
models around, so much less than half the maintenance effort, and if the
property should be broken this usually would indicate a problem with the
patch. So I think the bang for the buck is much higher there.

Those are my current thoughts anyways : )

have fun, jonas


2023-01-27 16:49:02

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH v2 1/2] tools/memory-model: Unify UNLOCK+LOCK pairings to po-unlock-lock-po

On Fri, Jan 27, 2023 at 04:57:43PM +0100, Jonas Oberhauser wrote:
>
>
> On 1/27/2023 4:13 PM, Paul E. McKenney wrote:
> > On Fri, Jan 27, 2023 at 02:18:41PM +0100, Jonas Oberhauser wrote:
> > > On 1/27/2023 12:21 AM, Paul E. McKenney wrote:
> > > > On Thu, Jan 26, 2023 at 12:08:28PM -0800, Paul E. McKenney wrote:
> > > > > On Thu, Jan 26, 2023 at 11:36:51AM -0500, Alan Stern wrote:
> > > > > > On Thu, Jan 26, 2023 at 02:46:03PM +0100, Jonas Oberhauser wrote:
> > > > > > > LKMM uses two relations for talking about UNLOCK+LOCK pairings:
> > > > > > >
> > > > > > > 1) po-unlock-lock-po, which handles UNLOCK+LOCK pairings
> > > > > > > on the same CPU or immediate lock handovers on the same
> > > > > > > lock variable
> > > > > > >
> > > > > > > 2) po;[UL];(co|po);[LKW];po, which handles UNLOCK+LOCK pairs
> > > > > > > literally as described in rcupdate.h#L1002, i.e., even
> > > > > > > after a sequence of handovers on the same lock variable.
> > > > > > >
> > > > > > > The latter relation is used only once, to provide the guarantee
> > > > > > > defined in rcupdate.h#L1002 by smp_mb__after_unlock_lock(), which
> > > > > > > makes any UNLOCK+LOCK pair followed by the fence behave like a full
> > > > > > > barrier.
> > > > > > >
> > > > > > > This patch drops this use in favor of using po-unlock-lock-po
> > > > > > > everywhere, which unifies the way the model talks about UNLOCK+LOCK
> > > > > > > pairings. At first glance this seems to weaken the guarantee given
> > > > > > > by LKMM: When considering a long sequence of lock handovers
> > > > > > > such as below, where P0 hands the lock to P1, which hands it to P2,
> > > > > > > which finally executes such an after_unlock_lock fence, the mb
> > > > > > > relation currently links any stores in the critical section of P0
> > > > > > > to instructions P2 executes after its fence, but not so after the
> > > > > > > patch.
> > > > > > >
> > > > > > > P0(int *x, int *y, spinlock_t *mylock)
> > > > > > > {
> > > > > > > spin_lock(mylock);
> > > > > > > WRITE_ONCE(*x, 2);
> > > > > > > spin_unlock(mylock);
> > > > > > > WRITE_ONCE(*y, 1);
> > > > > > > }
> > > > > > >
> > > > > > > P1(int *y, int *z, spinlock_t *mylock)
> > > > > > > {
> > > > > > > int r0 = READ_ONCE(*y); // reads 1
> > > > > > > spin_lock(mylock);
> > > > > > > spin_unlock(mylock);
> > > > > > > WRITE_ONCE(*z,1);
> > > > > > > }
> > > > > > >
> > > > > > > P2(int *z, int *d, spinlock_t *mylock)
> > > > > > > {
> > > > > > > int r1 = READ_ONCE(*z); // reads 1
> > > > > > > spin_lock(mylock);
> > > > > > > spin_unlock(mylock);
> > > > > > > smp_mb__after_unlock_lock();
> > > > > > > WRITE_ONCE(*d,1);
> > > > > > > }
> > > > > > >
> > > > > > > P3(int *x, int *d)
> > > > > > > {
> > > > > > > WRITE_ONCE(*d,2);
> > > > > > > smp_mb();
> > > > > > > WRITE_ONCE(*x,1);
> > > > > > > }
> > > > > > >
> > > > > > > exists (1:r0=1 /\ 2:r1=1 /\ x=2 /\ d=2)
> > > > > > >
> > > > > > > Nevertheless, the ordering guarantee given in rcupdate.h is actually
> > > > > > > not weakened. This is because the unlock operations along the
> > > > > > > sequence of handovers are A-cumulative fences. They ensure that any
> > > > > > > stores that propagate to the CPU performing the first unlock
> > > > > > > operation in the sequence must also propagate to every CPU that
> > > > > > > performs a subsequent lock operation in the sequence. Therefore any
> > > > > > > such stores will also be ordered correctly by the fence even if only
> > > > > > > the final handover is considered a full barrier.
> > > > > > >
> > > > > > > Indeed this patch does not affect the behaviors allowed by LKMM at
> > > > > > > all. The mb relation is used to define ordering through:
> > > > > > > 1) mb/.../ppo/hb, where the ordering is subsumed by hb+ where the
> > > > > > > lock-release, rfe, and unlock-acquire orderings each provide hb
> > > > > > > 2) mb/strong-fence/cumul-fence/prop, where the rfe and A-cumulative
> > > > > > > lock-release orderings simply add more fine-grained cumul-fence
> > > > > > > edges to substitute a single strong-fence edge provided by a long
> > > > > > > lock handover sequence
> > > > > > > 3) mb/strong-fence/pb and various similar uses in the definition of
> > > > > > > data races, where as discussed above any long handover sequence
> > > > > > > can be turned into a sequence of cumul-fence edges that provide
> > > > > > > the same ordering.
> > > > > > >
> > > > > > > Signed-off-by: Jonas Oberhauser <[email protected]>
> > > > > > > ---
> > > > > > Reviewed-by: Alan Stern <[email protected]>
> > > > > A quick spot check showed no change in performance, so thank you both!
> > > > >
> > > > > Queued for review and further testing.
> > > > And testing on https://github.com/paulmckrcu/litmus for litmus tests up
> > > > to ten processes and allowing 10 minutes per litmus test got this:
> > > >
> > > > Exact output matches: 5208
> > > > !!! Timed out: 38
> > > > !!! Unknown primitive: 7
> > > >
> > > > This test compared output with and without your patch.
> > > >
> > > > For the tests with a Results clause, these failed:
> > > Gave me a heart attack there for a second!
> > Sorry for the scare!!!
> >
> > > > Also, I am going to be pushing the scripts I use to mainline. They might
> > > > not be perfect, but they will be quite useful for this sort of change
> > > > to the memory model.
> > > I could also provide Coq proofs, although those are ignoring the srcu/data
> > > race parts at the moment.
> > Can such proofs serve as regression tests for future changes?
> >
> > Thanx, Paul
>
> So-so. On the upside, it would be easy to make them raise an alarm if the
> future change breaks stuff.
> On the downside, they will often need maintenance together with any change.
> Sometimes a lot, sometimes very little.
> I think for the proofs that show the equivalence between two models, the
> maintenance is quite a bit higher because every change needs to be reflected
> in both versions. So if you do 10 equivalent transformations and want to
> show that they remain equivalent with any future changes you do, you need to
> keep at least 10 additional models around ("current LKMM where ppo isn't in
> po, current LKMM where the unlock fence still relies on co, ...").
>
> Right now, each equivalence proof I did (e.g., for using po-unlock-lock-po
> here, or the ppo<=po patch I originally proposed) is at average in the
> ballpark of 500 lines of proof script. And as evidenced by my discussion
> with Alan, these proofs only cover the "core model".
>
> So for this kind of thing, I think it's better to look at them to have more
> confidence in the patch, and do the patch more based on which model is more
> reasonable (as Alan enforces). Then consider the simplified version as the
> more natural one, and not worry about future changes that break the
> equivalence (that would usually indicate a problem with the old model,
> rather than a problem with the patch).
>
> For regressions, I would rather consider some desirable properties of LKMM,
> like "DRF-SC" or "monotonicity of barriers" or "ppo <= po" and try to prove
> those. This has the upside of not requiring to carry additional models
> around, so much less than half the maintenance effort, and if the property
> should be broken this usually would indicate a problem with the patch. So I
> think the bang for the buck is much higher there.
>
> Those are my current thoughts anyways : )

That matches my experience, for whatever that is worth. (I first
used Promela/spin in the early 1990s, which proved to be an excellent
cautionary tale.)

Thanx, Paul

2023-01-28 19:57:07

by Alan Stern

[permalink] [raw]
Subject: Re: [PATCH v2 2/2] tools/memory-model: Make ppo a subrelation of po

On Fri, Jan 27, 2023 at 03:31:25PM +0100, Jonas Oberhauser wrote:
> Here's a litmus test illustrating the difference, where P1 has a
> backwards-pointing xbstar&int. Currently there's no data race, but with the
> proposed patch there is.
>
> P0(int *x, int *y)
> {
> ?? ?*x = 1;
> ?? ?smp_store_release(y, 1);
> }
>
> P1(int *x, int *y, int *dx, int *dy, spinlock_t *l)
> {
> ??? spin_lock(l);
> ??? int r1 = READ_ONCE(*dy);
> ??? if (r1==1)
> ??????? spin_unlock(l);
>
> ??? int r0 = smp_load_acquire(y);
> ??? if (r0 == 1) {
> ??????? WRITE_ONCE(*dx,1);
> ??? }
> }
>
> P2(int *dx, int *dy)
> {
> ?? ?WRITE_ONCE(*dy,READ_ONCE(*dx));
> }
>
>
> P3(int *x, spinlock_t *l)
> {
> ??? spin_lock(l);
> ??? smp_mb__after_unlock_lock();
> ??? *x = 2;
> }

I don't understand why the current LKMM doesn't say there is a data
race. In fact, I don't understand what herd7 is doing with this litmus
test at all. Evidently the plain-coherence check rules out x=1 at the
end, because when I relax that check, x=1 becomes a possible result.
Furthermore, the graphical output confirms that this execution has a
ww-incoh edge from Wx=2 to Wx=1. But there is no ww-vis edge from Wx=1
to Wx=2! How can this be possible? It seems like a bug in herd7.

Furthermore, the execution with x=2 at the end doesn't have either a
ww-vis or a ww-nonrace edge betwen Wx=1 and Wx=2. So why isn't there a
ww-race edge?

> This actually makes me wonder. I thought the reason for the xbstar & int is
> that it ensures that the overall relation, after shuffling around a little
> bit, becomes prop&int ; hb*.

No, that is not the reason for it. See below.

> Like in case the *x=2 is co-before the *x=1, we get
> ? Wx2 ->overwrite Wx1 ->cumul-fence*;rfe? (some event on the same CPU as
> Wx2)? ->fence Wx2
> which is
> ? Wx2 -> prop&int some other event ->hb Wx2
> which must be irreflexive.
>
> However, that's not the case at all, because the fence relation currently
> doesn't actually have to relate events of the same CPU.
> So we get
> ? Wx2 ->overwrite Wx1 ->cumul-fence*;rfe? (some event on some other CPU than
> Wx2's) ->(hb*&int);fence Wx2
> i.e.,
> ? Wx2 ->prop&ext;hb*;strong-fence Wx2
>
> which shouldn't provide any ordering in general.
>
> In fact, replacing the *x=1 and *x=2 with WRITE_ONCEs, (pilot errors
> notwithstanding) both Wx1 ->co Wx2 and Wx2 ->co Wx1 become allowed in the
> current LKMM (in graphs where all other edges are equal).
>
> Shouldn't this actually *be* a data race? And potentially the same with
> rcu-fence?

I think that herd7 _should_ say there is a data race.

On the other hand, I also think that the operational model says there
isn't. This is a case where the formal model fails to match the
operational model.

The operational model says that if W is a release-store on CPU C and W'
is another store which propagates to C before W executes, then W'
propagates to every CPU before W does. (In other words, releases are
A-cumulative). But the formal model enforces this rule only in cases
the event reading W' on C is po-before W.

In your litmus test, the event reading y=1 on P1 is po-after the
spin_unlock (which is a release). Nevertheless, any feasible execution
requires that P1 must execute Ry=1 before the unlock. So the
operational model says that y=1 must propagate to P3 before the unlock
does, i.e., before P3 executes the spin_lock(). But the formal model
doesn't require this.

Ideally we would fix this by changing the definition of po-rel to:

[M] ; (xbstar & int) ; [Release]

(This is closely related to the use of (xbstar & int) in the definition
of vis that you asked about.) Unfortunately we can't do this, because
po-rel has to be defined long before xbstar.

> The other cases of *-pre-bounded seem to work out: they all link stuff via
> xbstar to the instruction that is linked via po-unlock-lock-po ;
> [After-unlock-lock] ; po to the potentially racy access, and
> po-unlock-lock-po;po?? is xbstar ; acq-po, which allows closing the gap.

I could not follow your arguments at all; the writing was too confusing.

Alan

2023-01-28 22:14:31

by Andrea Parri

[permalink] [raw]
Subject: Re: [PATCH v2 2/2] tools/memory-model: Make ppo a subrelation of po

> Evidently the plain-coherence check rules out x=1 at the
> end, because when I relax that check, x=1 becomes a possible result.
> Furthermore, the graphical output confirms that this execution has a
> ww-incoh edge from Wx=2 to Wx=1. But there is no ww-vis edge from Wx=1
> to Wx=2! How can this be possible? It seems like a bug in herd7.

By default, herd7 performs some edges removal when generating the
graphical outputs. The option -showraw can be useful to increase
the "verbosity", for example,

[with "exists (x=2)", output in /tmp/T.dot]
$ herd7 -conf linux-kernel.cfg T.litmus -show prop -o /tmp -skipchecks plain-coherence -doshow ww-vis -showraw ww-vis


> Furthermore, the execution with x=2 at the end doesn't have either a
> ww-vis or a ww-nonrace edge betwen Wx=1 and Wx=2. So why isn't there a
> ww-race edge?

And similarly

[with "exists (x=2)"]
$ herd7 -conf linux-kernel.cfg T.litmus -show prop -o /tmp -doshow ww-vis,ww-nonrace -showraw ww-vis,ww-nonrace

Andrea

2023-01-28 22:22:03

by Andrea Parri

[permalink] [raw]
Subject: Re: [PATCH v2 2/2] tools/memory-model: Make ppo a subrelation of po

On Sat, Jan 28, 2023 at 11:14:17PM +0100, Andrea Parri wrote:
> > Evidently the plain-coherence check rules out x=1 at the
> > end, because when I relax that check, x=1 becomes a possible result.
> > Furthermore, the graphical output confirms that this execution has a
> > ww-incoh edge from Wx=2 to Wx=1. But there is no ww-vis edge from Wx=1
> > to Wx=2! How can this be possible? It seems like a bug in herd7.
>
> By default, herd7 performs some edges removal when generating the
> graphical outputs. The option -showraw can be useful to increase
> the "verbosity", for example,
>
> [with "exists (x=2)", output in /tmp/T.dot]

This was meant to be "exists (x=1)".

Andrea


> $ herd7 -conf linux-kernel.cfg T.litmus -show prop -o /tmp -skipchecks plain-coherence -doshow ww-vis -showraw ww-vis
>
>
> > Furthermore, the execution with x=2 at the end doesn't have either a
> > ww-vis or a ww-nonrace edge betwen Wx=1 and Wx=2. So why isn't there a
> > ww-race edge?
>
> And similarly
>
> [with "exists (x=2)"]
> $ herd7 -conf linux-kernel.cfg T.litmus -show prop -o /tmp -doshow ww-vis,ww-nonrace -showraw ww-vis,ww-nonrace
>
> Andrea

2023-01-28 22:59:58

by Alan Stern

[permalink] [raw]
Subject: Re: [PATCH v2 2/2] tools/memory-model: Make ppo a subrelation of po

On Sat, Jan 28, 2023 at 11:14:17PM +0100, Andrea Parri wrote:
> > Evidently the plain-coherence check rules out x=1 at the
> > end, because when I relax that check, x=1 becomes a possible result.
> > Furthermore, the graphical output confirms that this execution has a
> > ww-incoh edge from Wx=2 to Wx=1. But there is no ww-vis edge from Wx=1
> > to Wx=2! How can this be possible? It seems like a bug in herd7.
>
> By default, herd7 performs some edges removal when generating the
> graphical outputs. The option -showraw can be useful to increase
> the "verbosity", for example,
>
> [with "exists (x=2)", output in /tmp/T.dot]
> $ herd7 -conf linux-kernel.cfg T.litmus -show prop -o /tmp -skipchecks plain-coherence -doshow ww-vis -showraw ww-vis

Okay, thanks, that helps a lot.

So here's what we've got. The litmus test:


C hb-and-int
{}

P0(int *x, int *y)
{
*x = 1;
smp_store_release(y, 1);
}

P1(int *x, int *y, int *dx, int *dy, spinlock_t *l)
{
spin_lock(l);
int r1 = READ_ONCE(*dy);
if (r1==1)
spin_unlock(l);

int r0 = smp_load_acquire(y);
if (r0 == 1) {
WRITE_ONCE(*dx,1);
}
}

P2(int *dx, int *dy)
{
WRITE_ONCE(*dy,READ_ONCE(*dx));
}


P3(int *x, spinlock_t *l)
{
spin_lock(l);
smp_mb__after_unlock_lock();
*x = 2;
}

exists (x=2)


The reason why Wx=1 ->ww-vis Wx=2:

0:Wx=1 ->po-rel 0:Wy=1 and po-rel < fence < ww-post-bounded.

0:Wy=1 ->rfe 1:Ry=1 ->(hb* & int) 1:Rdy=1 and
(rfe ; hb* & int) <= (rfe ; xbstar & int) <= vis.

1:Rdy=1 ->po 1:unlock ->rfe 3:lock ->po 3:Wx=2
so 1:Rdy=1 ->po-unlock-lock-po 3:Wx=2
and po-unlock-lock-po <= mb <= fence <= w-pre-bounded.

Finally, w-post-bounded ; vis ; w-pre-bounded <= ww-vis.

This explains why the memory model says there isn't a data race. This
doesn't use the smp_mb__after_unlock_lock at all.

Alan

2023-01-29 05:20:25

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH v2 2/2] tools/memory-model: Make ppo a subrelation of po

On Sat, Jan 28, 2023 at 05:59:52PM -0500, Alan Stern wrote:
> On Sat, Jan 28, 2023 at 11:14:17PM +0100, Andrea Parri wrote:
> > > Evidently the plain-coherence check rules out x=1 at the
> > > end, because when I relax that check, x=1 becomes a possible result.
> > > Furthermore, the graphical output confirms that this execution has a
> > > ww-incoh edge from Wx=2 to Wx=1. But there is no ww-vis edge from Wx=1
> > > to Wx=2! How can this be possible? It seems like a bug in herd7.
> >
> > By default, herd7 performs some edges removal when generating the
> > graphical outputs. The option -showraw can be useful to increase
> > the "verbosity", for example,
> >
> > [with "exists (x=2)", output in /tmp/T.dot]
> > $ herd7 -conf linux-kernel.cfg T.litmus -show prop -o /tmp -skipchecks plain-coherence -doshow ww-vis -showraw ww-vis
>
> Okay, thanks, that helps a lot.
>
> So here's what we've got. The litmus test:
>
>
> C hb-and-int
> {}
>
> P0(int *x, int *y)
> {
> *x = 1;
> smp_store_release(y, 1);
> }
>
> P1(int *x, int *y, int *dx, int *dy, spinlock_t *l)
> {
> spin_lock(l);
> int r1 = READ_ONCE(*dy);
> if (r1==1)
> spin_unlock(l);
>
> int r0 = smp_load_acquire(y);
> if (r0 == 1) {
> WRITE_ONCE(*dx,1);
> }

The lack of a spin_unlock() when r1!=1 is intentional?

It is admittedly a cute way to prevent P3 from doing anything
when r1!=1. And P1 won't do anything if P3 runs first.

> }
>
> P2(int *dx, int *dy)
> {
> WRITE_ONCE(*dy,READ_ONCE(*dx));
> }
>
>
> P3(int *x, spinlock_t *l)
> {
> spin_lock(l);
> smp_mb__after_unlock_lock();
> *x = 2;
> }
>
> exists (x=2)
>
>
> The reason why Wx=1 ->ww-vis Wx=2:
>
> 0:Wx=1 ->po-rel 0:Wy=1 and po-rel < fence < ww-post-bounded.
>
> 0:Wy=1 ->rfe 1:Ry=1 ->(hb* & int) 1:Rdy=1 and
> (rfe ; hb* & int) <= (rfe ; xbstar & int) <= vis.
>
> 1:Rdy=1 ->po 1:unlock ->rfe 3:lock ->po 3:Wx=2
> so 1:Rdy=1 ->po-unlock-lock-po 3:Wx=2
> and po-unlock-lock-po <= mb <= fence <= w-pre-bounded.
>
> Finally, w-post-bounded ; vis ; w-pre-bounded <= ww-vis.
>
> This explains why the memory model says there isn't a data race. This
> doesn't use the smp_mb__after_unlock_lock at all.

You lost me on this one.

Suppose that P3 starts first, then P0. P1 is then stuck at the
spin_lock() because P3 does not release that lock. P2 goes out for a
pizza.

Why can't the two stores to x by P0 and P3 conflict, resulting in a
data race?

Thanx, Paul

2023-01-29 16:03:51

by Alan Stern

[permalink] [raw]
Subject: Re: [PATCH v2 2/2] tools/memory-model: Make ppo a subrelation of po

On Sat, Jan 28, 2023 at 09:17:34PM -0800, Paul E. McKenney wrote:
> On Sat, Jan 28, 2023 at 05:59:52PM -0500, Alan Stern wrote:
> > On Sat, Jan 28, 2023 at 11:14:17PM +0100, Andrea Parri wrote:
> > > > Evidently the plain-coherence check rules out x=1 at the
> > > > end, because when I relax that check, x=1 becomes a possible result.
> > > > Furthermore, the graphical output confirms that this execution has a
> > > > ww-incoh edge from Wx=2 to Wx=1. But there is no ww-vis edge from Wx=1
> > > > to Wx=2! How can this be possible? It seems like a bug in herd7.
> > >
> > > By default, herd7 performs some edges removal when generating the
> > > graphical outputs. The option -showraw can be useful to increase
> > > the "verbosity", for example,
> > >
> > > [with "exists (x=2)", output in /tmp/T.dot]
> > > $ herd7 -conf linux-kernel.cfg T.litmus -show prop -o /tmp -skipchecks plain-coherence -doshow ww-vis -showraw ww-vis
> >
> > Okay, thanks, that helps a lot.
> >
> > So here's what we've got. The litmus test:
> >
> >
> > C hb-and-int
> > {}
> >
> > P0(int *x, int *y)
> > {
> > *x = 1;
> > smp_store_release(y, 1);
> > }
> >
> > P1(int *x, int *y, int *dx, int *dy, spinlock_t *l)
> > {
> > spin_lock(l);
> > int r1 = READ_ONCE(*dy);
> > if (r1==1)
> > spin_unlock(l);
> >
> > int r0 = smp_load_acquire(y);
> > if (r0 == 1) {
> > WRITE_ONCE(*dx,1);
> > }
>
> The lack of a spin_unlock() when r1!=1 is intentional?

I assume so.

> It is admittedly a cute way to prevent P3 from doing anything
> when r1!=1. And P1 won't do anything if P3 runs first.

Right.

> > }
> >
> > P2(int *dx, int *dy)
> > {
> > WRITE_ONCE(*dy,READ_ONCE(*dx));
> > }
> >
> >
> > P3(int *x, spinlock_t *l)
> > {
> > spin_lock(l);
> > smp_mb__after_unlock_lock();
> > *x = 2;
> > }
> >
> > exists (x=2)
> >
> >
> > The reason why Wx=1 ->ww-vis Wx=2:
> >
> > 0:Wx=1 ->po-rel 0:Wy=1 and po-rel < fence < ww-post-bounded.
> >
> > 0:Wy=1 ->rfe 1:Ry=1 ->(hb* & int) 1:Rdy=1 and
> > (rfe ; hb* & int) <= (rfe ; xbstar & int) <= vis.
> >
> > 1:Rdy=1 ->po 1:unlock ->rfe 3:lock ->po 3:Wx=2
> > so 1:Rdy=1 ->po-unlock-lock-po 3:Wx=2
> > and po-unlock-lock-po <= mb <= fence <= w-pre-bounded.
> >
> > Finally, w-post-bounded ; vis ; w-pre-bounded <= ww-vis.
> >
> > This explains why the memory model says there isn't a data race. This
> > doesn't use the smp_mb__after_unlock_lock at all.
>
> You lost me on this one.
>
> Suppose that P3 starts first, then P0. P1 is then stuck at the
> spin_lock() because P3 does not release that lock. P2 goes out for a
> pizza.

That wouldn't be a valid execution. One of the rules in lock.cat says
that a spin_lock() call must read from a spin_unlock() or from an
initial write, which rules out executions in which P3 acquires the lock
first.

> Why can't the two stores to x by P0 and P3 conflict, resulting in a
> data race?

That can't happen in executions where P1 acquires the lock first for the
reason outlined above (P0's store to x propagates to P3 before P3 writes
to x). And there are no other executions -- basically, herd7 ignores
deadlock scenarios.

Alan

2023-01-29 16:22:04

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH v2 2/2] tools/memory-model: Make ppo a subrelation of po

On Sun, Jan 29, 2023 at 11:03:46AM -0500, Alan Stern wrote:
> On Sat, Jan 28, 2023 at 09:17:34PM -0800, Paul E. McKenney wrote:
> > On Sat, Jan 28, 2023 at 05:59:52PM -0500, Alan Stern wrote:
> > > On Sat, Jan 28, 2023 at 11:14:17PM +0100, Andrea Parri wrote:
> > > > > Evidently the plain-coherence check rules out x=1 at the
> > > > > end, because when I relax that check, x=1 becomes a possible result.
> > > > > Furthermore, the graphical output confirms that this execution has a
> > > > > ww-incoh edge from Wx=2 to Wx=1. But there is no ww-vis edge from Wx=1
> > > > > to Wx=2! How can this be possible? It seems like a bug in herd7.
> > > >
> > > > By default, herd7 performs some edges removal when generating the
> > > > graphical outputs. The option -showraw can be useful to increase
> > > > the "verbosity", for example,
> > > >
> > > > [with "exists (x=2)", output in /tmp/T.dot]
> > > > $ herd7 -conf linux-kernel.cfg T.litmus -show prop -o /tmp -skipchecks plain-coherence -doshow ww-vis -showraw ww-vis
> > >
> > > Okay, thanks, that helps a lot.
> > >
> > > So here's what we've got. The litmus test:
> > >
> > >
> > > C hb-and-int
> > > {}
> > >
> > > P0(int *x, int *y)
> > > {
> > > *x = 1;
> > > smp_store_release(y, 1);
> > > }
> > >
> > > P1(int *x, int *y, int *dx, int *dy, spinlock_t *l)
> > > {
> > > spin_lock(l);
> > > int r1 = READ_ONCE(*dy);
> > > if (r1==1)
> > > spin_unlock(l);
> > >
> > > int r0 = smp_load_acquire(y);
> > > if (r0 == 1) {
> > > WRITE_ONCE(*dx,1);
> > > }
> >
> > The lack of a spin_unlock() when r1!=1 is intentional?
>
> I assume so.
>
> > It is admittedly a cute way to prevent P3 from doing anything
> > when r1!=1. And P1 won't do anything if P3 runs first.
>
> Right.
>
> > > }
> > >
> > > P2(int *dx, int *dy)
> > > {
> > > WRITE_ONCE(*dy,READ_ONCE(*dx));
> > > }
> > >
> > >
> > > P3(int *x, spinlock_t *l)
> > > {
> > > spin_lock(l);
> > > smp_mb__after_unlock_lock();
> > > *x = 2;
> > > }
> > >
> > > exists (x=2)
> > >
> > >
> > > The reason why Wx=1 ->ww-vis Wx=2:
> > >
> > > 0:Wx=1 ->po-rel 0:Wy=1 and po-rel < fence < ww-post-bounded.
> > >
> > > 0:Wy=1 ->rfe 1:Ry=1 ->(hb* & int) 1:Rdy=1 and
> > > (rfe ; hb* & int) <= (rfe ; xbstar & int) <= vis.
> > >
> > > 1:Rdy=1 ->po 1:unlock ->rfe 3:lock ->po 3:Wx=2
> > > so 1:Rdy=1 ->po-unlock-lock-po 3:Wx=2
> > > and po-unlock-lock-po <= mb <= fence <= w-pre-bounded.
> > >
> > > Finally, w-post-bounded ; vis ; w-pre-bounded <= ww-vis.
> > >
> > > This explains why the memory model says there isn't a data race. This
> > > doesn't use the smp_mb__after_unlock_lock at all.
> >
> > You lost me on this one.
> >
> > Suppose that P3 starts first, then P0. P1 is then stuck at the
> > spin_lock() because P3 does not release that lock. P2 goes out for a
> > pizza.
>
> That wouldn't be a valid execution. One of the rules in lock.cat says
> that a spin_lock() call must read from a spin_unlock() or from an
> initial write, which rules out executions in which P3 acquires the lock
> first.

OK, I will bite...

Why can't P3's spin_lock() read from that initial write?

> > Why can't the two stores to x by P0 and P3 conflict, resulting in a
> > data race?
>
> That can't happen in executions where P1 acquires the lock first for the
> reason outlined above (P0's store to x propagates to P3 before P3 writes
> to x). And there are no other executions -- basically, herd7 ignores
> deadlock scenarios.

True enough, if P1 gets there first, then P3 never stores to x.

What I don't understand is why P1 must always get there first.

Thanx, Paul

2023-01-29 17:11:24

by Andrea Parri

[permalink] [raw]
Subject: Re: [PATCH v2 2/2] tools/memory-model: Make ppo a subrelation of po

> The reason why Wx=1 ->ww-vis Wx=2:
>
> 0:Wx=1 ->po-rel 0:Wy=1 and po-rel < fence < ww-post-bounded.
>
> 0:Wy=1 ->rfe 1:Ry=1 ->(hb* & int) 1:Rdy=1 and
> (rfe ; hb* & int) <= (rfe ; xbstar & int) <= vis.
>
> 1:Rdy=1 ->po 1:unlock ->rfe 3:lock ->po 3:Wx=2
> so 1:Rdy=1 ->po-unlock-lock-po 3:Wx=2
> and po-unlock-lock-po <= mb <= fence <= w-pre-bounded.
>
> Finally, w-post-bounded ; vis ; w-pre-bounded <= ww-vis.

To clarify, po-unlock-lock-po is not a subrelation of mb; see what
happens without the smp_mb__after_unlock_lock().

Andrea

2023-01-29 17:28:36

by Andrea Parri

[permalink] [raw]
Subject: Re: [PATCH v2 2/2] tools/memory-model: Make ppo a subrelation of po

> Why can't P3's spin_lock() read from that initial write?

Mmh, sounds like you want to play with something like below?

Andrea

diff --git a/tools/memory-model/lock.cat b/tools/memory-model/lock.cat
index 6b52f365d73ac..20c3af4511255 100644
--- a/tools/memory-model/lock.cat
+++ b/tools/memory-model/lock.cat
@@ -74,7 +74,6 @@ flag ~empty UL \ range(critical) as unmatched-unlock

(* Allow up to one unmatched LKW per location; more must deadlock *)
let UNMATCHED-LKW = LKW \ domain(critical)
-empty ([UNMATCHED-LKW] ; loc ; [UNMATCHED-LKW]) \ id as unmatched-locks

(* rfi for LF events: link each LKW to the LF events in its critical section *)
let rfi-lf = ([LKW] ; po-loc ; [LF]) \ ([LKW] ; po-loc ; [UL] ; po-loc)
@@ -120,8 +119,7 @@ let rf-ru = rfe-ru | rfi-ru
let rf = rf | rf-lf | rf-ru

(* Generate all co relations, including LKW events but not UL *)
-let co0 = co0 | ([IW] ; loc ; [LKW]) |
- (([LKW] ; loc ; [UNMATCHED-LKW]) \ [UNMATCHED-LKW])
+let co0 = co0 | ([IW] ; loc ; [LKW])
include "cos-opt.cat"
let W = W | UL
let M = R | W


2023-01-29 18:44:08

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH v2 2/2] tools/memory-model: Make ppo a subrelation of po

On Sun, Jan 29, 2023 at 06:28:27PM +0100, Andrea Parri wrote:
> > Why can't P3's spin_lock() read from that initial write?
>
> Mmh, sounds like you want to play with something like below?
>
> Andrea
>
> diff --git a/tools/memory-model/lock.cat b/tools/memory-model/lock.cat
> index 6b52f365d73ac..20c3af4511255 100644
> --- a/tools/memory-model/lock.cat
> +++ b/tools/memory-model/lock.cat
> @@ -74,7 +74,6 @@ flag ~empty UL \ range(critical) as unmatched-unlock
>
> (* Allow up to one unmatched LKW per location; more must deadlock *)
> let UNMATCHED-LKW = LKW \ domain(critical)
> -empty ([UNMATCHED-LKW] ; loc ; [UNMATCHED-LKW]) \ id as unmatched-locks
>
> (* rfi for LF events: link each LKW to the LF events in its critical section *)
> let rfi-lf = ([LKW] ; po-loc ; [LF]) \ ([LKW] ; po-loc ; [UL] ; po-loc)
> @@ -120,8 +119,7 @@ let rf-ru = rfe-ru | rfi-ru
> let rf = rf | rf-lf | rf-ru
>
> (* Generate all co relations, including LKW events but not UL *)
> -let co0 = co0 | ([IW] ; loc ; [LKW]) |
> - (([LKW] ; loc ; [UNMATCHED-LKW]) \ [UNMATCHED-LKW])
> +let co0 = co0 | ([IW] ; loc ; [LKW])
> include "cos-opt.cat"
> let W = W | UL
> let M = R | W

No idea. But the following litmus test gets no executions whatsoever,
so point taken about my missing at least one corner case. ;-)

Adding a spin_unlock() to the end of either process allows both to
run.

One could argue that this is a bug, but one could equally well argue
that if you have a deadlock, you have a deadlock.

Thoughts?

Thanx, Paul

------------------------------------------------------------------------

C lock

{
}


P0(int *a, int *b, spinlock_t *l)
{
spin_lock(l);
WRITE_ONCE(*a, 1);
}

P1(int *a, int *b, spinlock_t *l)
{
spin_lock(l);
WRITE_ONCE(*b, 1);
}

exists (a=1 /\ b=1)

2023-01-29 19:18:06

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH v2 2/2] tools/memory-model: Make ppo a subrelation of po

On Sun, Jan 29, 2023 at 08:21:56AM -0800, Paul E. McKenney wrote:
> On Sun, Jan 29, 2023 at 11:03:46AM -0500, Alan Stern wrote:
> > On Sat, Jan 28, 2023 at 09:17:34PM -0800, Paul E. McKenney wrote:
> > > On Sat, Jan 28, 2023 at 05:59:52PM -0500, Alan Stern wrote:
> > > > On Sat, Jan 28, 2023 at 11:14:17PM +0100, Andrea Parri wrote:
> > > > > > Evidently the plain-coherence check rules out x=1 at the
> > > > > > end, because when I relax that check, x=1 becomes a possible result.
> > > > > > Furthermore, the graphical output confirms that this execution has a
> > > > > > ww-incoh edge from Wx=2 to Wx=1. But there is no ww-vis edge from Wx=1
> > > > > > to Wx=2! How can this be possible? It seems like a bug in herd7.
> > > > >
> > > > > By default, herd7 performs some edges removal when generating the
> > > > > graphical outputs. The option -showraw can be useful to increase
> > > > > the "verbosity", for example,
> > > > >
> > > > > [with "exists (x=2)", output in /tmp/T.dot]
> > > > > $ herd7 -conf linux-kernel.cfg T.litmus -show prop -o /tmp -skipchecks plain-coherence -doshow ww-vis -showraw ww-vis
> > > >
> > > > Okay, thanks, that helps a lot.
> > > >
> > > > So here's what we've got. The litmus test:
> > > >
> > > >
> > > > C hb-and-int
> > > > {}
> > > >
> > > > P0(int *x, int *y)
> > > > {
> > > > *x = 1;
> > > > smp_store_release(y, 1);
> > > > }
> > > >
> > > > P1(int *x, int *y, int *dx, int *dy, spinlock_t *l)
> > > > {
> > > > spin_lock(l);
> > > > int r1 = READ_ONCE(*dy);
> > > > if (r1==1)
> > > > spin_unlock(l);
> > > >
> > > > int r0 = smp_load_acquire(y);
> > > > if (r0 == 1) {
> > > > WRITE_ONCE(*dx,1);
> > > > }
> > >
> > > The lack of a spin_unlock() when r1!=1 is intentional?
> >
> > I assume so.
> >
> > > It is admittedly a cute way to prevent P3 from doing anything
> > > when r1!=1. And P1 won't do anything if P3 runs first.
> >
> > Right.
> >
> > > > }
> > > >
> > > > P2(int *dx, int *dy)
> > > > {
> > > > WRITE_ONCE(*dy,READ_ONCE(*dx));
> > > > }
> > > >
> > > >
> > > > P3(int *x, spinlock_t *l)
> > > > {
> > > > spin_lock(l);
> > > > smp_mb__after_unlock_lock();
> > > > *x = 2;
> > > > }
> > > >
> > > > exists (x=2)
> > > >
> > > >
> > > > The reason why Wx=1 ->ww-vis Wx=2:
> > > >
> > > > 0:Wx=1 ->po-rel 0:Wy=1 and po-rel < fence < ww-post-bounded.
> > > >
> > > > 0:Wy=1 ->rfe 1:Ry=1 ->(hb* & int) 1:Rdy=1 and
> > > > (rfe ; hb* & int) <= (rfe ; xbstar & int) <= vis.
> > > >
> > > > 1:Rdy=1 ->po 1:unlock ->rfe 3:lock ->po 3:Wx=2
> > > > so 1:Rdy=1 ->po-unlock-lock-po 3:Wx=2
> > > > and po-unlock-lock-po <= mb <= fence <= w-pre-bounded.
> > > >
> > > > Finally, w-post-bounded ; vis ; w-pre-bounded <= ww-vis.
> > > >
> > > > This explains why the memory model says there isn't a data race. This
> > > > doesn't use the smp_mb__after_unlock_lock at all.
> > >
> > > You lost me on this one.
> > >
> > > Suppose that P3 starts first, then P0. P1 is then stuck at the
> > > spin_lock() because P3 does not release that lock. P2 goes out for a
> > > pizza.
> >
> > That wouldn't be a valid execution. One of the rules in lock.cat says
> > that a spin_lock() call must read from a spin_unlock() or from an
> > initial write, which rules out executions in which P3 acquires the lock
> > first.
>
> OK, I will bite...
>
> Why can't P3's spin_lock() read from that initial write?
>
> > > Why can't the two stores to x by P0 and P3 conflict, resulting in a
> > > data race?
> >
> > That can't happen in executions where P1 acquires the lock first for the
> > reason outlined above (P0's store to x propagates to P3 before P3 writes
> > to x). And there are no other executions -- basically, herd7 ignores
> > deadlock scenarios.
>
> True enough, if P1 gets there first, then P3 never stores to x.
>
> What I don't understand is why P1 must always get there first.

Ah, is the rule that all processes must complete?

If so, then the only way all processes can complete is if P1 loads 1
from dy, thus releasing the lock.

But that dy=1 load can only happen after P2 has copied dx to dy, and has
stored a 1. Which in turn only happens if P1's write to dx is ordered
before the lock release. Which only executes if P1's load from y returned
1, which only happens after P0 stored 1 to y.

Which means that P3 gets the lock only after P0 completes its plain
write to x, so there is no data race.

The reason that P3 cannot go first is that this will prevent P1 from
completing, in turn preventing herd7 from counting that execution.

Or am I still missing a turn in there somewhere?

Thanx, Paul

2023-01-29 21:44:56

by Boqun Feng

[permalink] [raw]
Subject: Re: [PATCH v2 2/2] tools/memory-model: Make ppo a subrelation of po

On Sun, Jan 29, 2023 at 10:44:03AM -0800, Paul E. McKenney wrote:
> On Sun, Jan 29, 2023 at 06:28:27PM +0100, Andrea Parri wrote:
> > > Why can't P3's spin_lock() read from that initial write?
> >
> > Mmh, sounds like you want to play with something like below?
> >
> > Andrea
> >
> > diff --git a/tools/memory-model/lock.cat b/tools/memory-model/lock.cat
> > index 6b52f365d73ac..20c3af4511255 100644
> > --- a/tools/memory-model/lock.cat
> > +++ b/tools/memory-model/lock.cat
> > @@ -74,7 +74,6 @@ flag ~empty UL \ range(critical) as unmatched-unlock
> >
> > (* Allow up to one unmatched LKW per location; more must deadlock *)
> > let UNMATCHED-LKW = LKW \ domain(critical)
> > -empty ([UNMATCHED-LKW] ; loc ; [UNMATCHED-LKW]) \ id as unmatched-locks
> >
> > (* rfi for LF events: link each LKW to the LF events in its critical section *)
> > let rfi-lf = ([LKW] ; po-loc ; [LF]) \ ([LKW] ; po-loc ; [UL] ; po-loc)
> > @@ -120,8 +119,7 @@ let rf-ru = rfe-ru | rfi-ru
> > let rf = rf | rf-lf | rf-ru
> >
> > (* Generate all co relations, including LKW events but not UL *)
> > -let co0 = co0 | ([IW] ; loc ; [LKW]) |
> > - (([LKW] ; loc ; [UNMATCHED-LKW]) \ [UNMATCHED-LKW])
> > +let co0 = co0 | ([IW] ; loc ; [LKW])
> > include "cos-opt.cat"
> > let W = W | UL
> > let M = R | W
>
> No idea. But the following litmus test gets no executions whatsoever,
> so point taken about my missing at least one corner case. ;-)
>
> Adding a spin_unlock() to the end of either process allows both to
> run.
>
> One could argue that this is a bug, but one could equally well argue
> that if you have a deadlock, you have a deadlock.
>

in lock.cat:

(* Allow up to one unmatched LKW per location; more must deadlock *)
let UNMATCHED-LKW = LKW \ domain(critical)
empty ([UNMATCHED-LKW] ; loc ; [UNMATCHED-LKW]) \ id as unmatched-locks

we rule out deadlocks from the execution candidates we care about.

Regards,
Boqun

> Thoughts?
>
> Thanx, Paul
>
> ------------------------------------------------------------------------
>
> C lock
>
> {
> }
>
>
> P0(int *a, int *b, spinlock_t *l)
> {
> spin_lock(l);
> WRITE_ONCE(*a, 1);
> }
>
> P1(int *a, int *b, spinlock_t *l)
> {
> spin_lock(l);
> WRITE_ONCE(*b, 1);
> }
>
> exists (a=1 /\ b=1)

2023-01-29 22:10:55

by Alan Stern

[permalink] [raw]
Subject: Re: [PATCH v2 2/2] tools/memory-model: Make ppo a subrelation of po

On Sun, Jan 29, 2023 at 06:11:12PM +0100, Andrea Parri wrote:
> > The reason why Wx=1 ->ww-vis Wx=2:
> >
> > 0:Wx=1 ->po-rel 0:Wy=1 and po-rel < fence < ww-post-bounded.
> >
> > 0:Wy=1 ->rfe 1:Ry=1 ->(hb* & int) 1:Rdy=1 and
> > (rfe ; hb* & int) <= (rfe ; xbstar & int) <= vis.
> >
> > 1:Rdy=1 ->po 1:unlock ->rfe 3:lock ->po 3:Wx=2
> > so 1:Rdy=1 ->po-unlock-lock-po 3:Wx=2
> > and po-unlock-lock-po <= mb <= fence <= w-pre-bounded.
> >
> > Finally, w-post-bounded ; vis ; w-pre-bounded <= ww-vis.
>
> To clarify, po-unlock-lock-po is not a subrelation of mb; see what
> happens without the smp_mb__after_unlock_lock().

Ah, thank you again. That was what I got wrong, and it explains why the
data race appears with Jonas's patch.

This also points out an important unstated fact: All of the inter-CPU
extended fences in the memory model are A-cumulative. In this case the
fence links Rdy=1 on P1 to Wx=3 on P3. We know that 0:Wx=1 is visible
to P1 before the Rdy executes, but what we need is that it is visible to
P3 before Wx=3 executes. The fact that the extended fence is strong
(and therefore A-cumulative) provides this extra guarantee.

Alan

2023-01-29 22:20:18

by Jonas Oberhauser

[permalink] [raw]
Subject: Re: [PATCH v2 2/2] tools/memory-model: Make ppo a subrelation of po


Hi all, apologies on the confusion about the litmus test.
I should have explained it better but it seems you mostly figured it out.
As Alan said I'm tricking a little bit by not unlocking in certain
places to filter out all executions that aren't what I'm looking for.
I didn't have much time when I sent it (hence also the lack of
explanation and why I haven't responded earlier), so I didn't have time
to play around with the filter feature to do this the "proper"/non-cute way.
As such it really isn't about deadlocks.

I think one question is whether the distinction between the models could
be reproduced without using any kind of filtering at all.
I have a feeling it should be possible but I haven't had time to think
up a litmus test that does that.


On 1/28/2023 11:59 PM, Alan Stern wrote:
> On Sat, Jan 28, 2023 at 11:14:17PM +0100, Andrea Parri wrote:
>>> Evidently the plain-coherence check rules out x=1 at the
>>> end, because when I relax that check, x=1 becomes a possible result.
>>> Furthermore, the graphical output confirms that this execution has a
>>> ww-incoh edge from Wx=2 to Wx=1. But there is no ww-vis edge from Wx=1
>>> to Wx=2! How can this be possible? It seems like a bug in herd7.
>> By default, herd7 performs some edges removal when generating the
>> graphical outputs. The option -showraw can be useful to increase
>> the "verbosity", for example,
>>
>> [with "exists (x=2)", output in /tmp/T.dot]
>> $ herd7 -conf linux-kernel.cfg T.litmus -show prop -o /tmp -skipchecks plain-coherence -doshow ww-vis -showraw ww-vis
> Okay, thanks, that helps a lot.
>
> So here's what we've got. The litmus test:
>
>
> C hb-and-int
> {}
>
> P0(int *x, int *y)
> {
> *x = 1;
> smp_store_release(y, 1);
> }
>
> P1(int *x, int *y, int *dx, int *dy, spinlock_t *l)
> {
> spin_lock(l);
> int r1 = READ_ONCE(*dy);
> if (r1==1)
> spin_unlock(l);
>
> int r0 = smp_load_acquire(y);
> if (r0 == 1) {
> WRITE_ONCE(*dx,1);
> }
> }
>
> P2(int *dx, int *dy)
> {
> WRITE_ONCE(*dy,READ_ONCE(*dx));
> }
>
>
> P3(int *x, spinlock_t *l)
> {
> spin_lock(l);
> smp_mb__after_unlock_lock();
> *x = 2;
> }
>
> exists (x=2)
>
>
> The reason why Wx=1 ->ww-vis Wx=2:
>
> 0:Wx=1 ->po-rel 0:Wy=1 and po-rel < fence < ww-post-bounded.
>
> 0:Wy=1 ->rfe 1:Ry=1 ->(hb* & int) 1:Rdy=1 and
> (rfe ; hb* & int) <= (rfe ; xbstar & int) <= vis.
>
> 1:Rdy=1 ->po 1:unlock ->rfe 3:lock ->po 3:Wx=2
> so 1:Rdy=1 ->po-unlock-lock-po 3:Wx=2
> and po-unlock-lock-po <= mb <= fence <= w-pre-bounded.
>
> [...]
>
> This explains why the memory model says there isn't a data race. This
> doesn't use the smp_mb__after_unlock_lock at all.

Note as Andrea said that po-unlock-lock-po is generally not in mb (and
also not otherwise in fence).
Only po-unlock-lock-po;[After-unlock-lock];po is in mb (resp. ...&int in
fence).
While the example uses smp_mb__after_unlock_lock, the anomaly should
generally cover any extended fences.


[Snipping in a part of an earlier e-mail you sent]



> I think that herd7_should_ say there is a data race.
>
> On the other hand, I also think that the operational model says there
> isn't. This is a case where the formal model fails to match the
> operational model.
>
> The operational model says that if W is a release-store on CPU C and W'
> is another store which propagates to C before W executes, then W'
> propagates to every CPU before W does. (In other words, releases are
> A-cumulative). But the formal model enforces this rule only in cases
> the event reading W' on C is po-before W.
>
> In your litmus test, the event reading y=1 on P1 is po-after the
> spin_unlock (which is a release). Nevertheless, any feasible execution
> requires that P1 must execute Ry=1 before the unlock. So the
> operational model says that y=1 must propagate to P3 before the unlock
> does, i.e., before P3 executes the spin_lock(). But the formal model
> doesn't require this.


I see now. Somehow I thought stores must execute in program order, but I
guess it doesn't make sense.
In that sense, W ->xbstar&int X always means W propagates to X's CPU
before X executes.


> Ideally we would fix this by changing the definition of po-rel to:
>
> [M] ; (xbstar & int) ; [Release]
>
> (This is closely related to the use of (xbstar & int) in the definition
> of vis that you asked about.)

This misses the property of release stores that any po-earlier store
must also execute before the release store.
Perhaps it could be changed to the old  po-rel | [M] ; (xbstar & int) ;
[Release] but then one could instead move this into the definition of
cumul-fence.
In fact you'd probably want this for all the propagation fences, so
cumul-fence and pb should be the right place.

> Unfortunately we can't do this, because
> po-rel has to be defined long before xbstar.

You could do it, by turning the relation into one massive recursive
definition.

Thinking about what the options are:
1) accept the difference and run with it by making it consistent inside
the axiomatic model
2) fix it through the recursive definition, which seems to be quite ugly
but also consistent with the power operational model as far as I can tell
3) weaken the operational model... somehow
4) just ignore the anomaly
5) ???

Currently my least favorite option is 4) since it seems a bit off that
the reasoning applies in one specific case of LKMM, more specifically
the data race definition which should be equivalent to "the order of the
two races isn't fixed", but here the order isn't fixed but it's a data race.
I think the patch happens to almost do 1) because the xbstar&int at the
end should already imply ordering through the prop&int <= hb rule.
What would remain is to also exclude rcu-fence somehow.

2) seems the most correct choice but also to complicate LKMM a lot.

Seems like being between a rock and hard place.
jonas

PS:
>> The other cases of *-pre-bounded seem to work out: they all link stuff via
>> xbstar to the instruction that is linked via po-unlock-lock-po ;
>> [After-unlock-lock] ; po to the potentially racy access, and
>> po-unlock-lock-po;po   is xbstar ; acq-po, which allows closing the gap.
> I could not follow your arguments at all; the writing was too confusing.

Sorry, I collapsed some cases. I'll show an example. I think all the
other cases are the same.
Let's pick an edge that links two events with ww-vis through
  w-post-bounded ; vis ; w-pre-bounded
where the vis comes from
  cumul-fence* ; rfe? ; [Marked] ;
    (strong-fence ; [Marked] ; xbstar)  <= vis
and the w-pre-bounded came from po-unlock-lock-po;[After-unlock-lock];po
but not po-unlock-lock-po & int.

Note that such po-unlock-lock-po;[After-unlock-lock];po must come from
  po-rel ; rfe ; acq-po

So we have
  w-post-bounded ;  cumul-fence* ; rfe? ; [Marked] ;
     strong-fence ; [Marked] ; xbstar ; po-rel ; rfe ; acq-po
<=
  w-post-bounded ;  cumul-fence* ; rfe? ; [Marked] ;
     strong-fence ; [Marked] ; xbstar ; hb ; hb ;  acq-po
<=
  w-post-bounded ;  cumul-fence* ; rfe? ; [Marked] ;
     strong-fence ; [Marked] ; xbstar ;                fence
<= ww-vis

All the other cases where w-pre-bounded are used have the shape
    ... ; xbstar ; w-pre-bounded
which can all be handled in the same manner.











2023-01-29 23:09:08

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH v2 2/2] tools/memory-model: Make ppo a subrelation of po

On Sun, Jan 29, 2023 at 01:43:53PM -0800, Boqun Feng wrote:
> On Sun, Jan 29, 2023 at 10:44:03AM -0800, Paul E. McKenney wrote:
> > On Sun, Jan 29, 2023 at 06:28:27PM +0100, Andrea Parri wrote:
> > > > Why can't P3's spin_lock() read from that initial write?
> > >
> > > Mmh, sounds like you want to play with something like below?
> > >
> > > Andrea
> > >
> > > diff --git a/tools/memory-model/lock.cat b/tools/memory-model/lock.cat
> > > index 6b52f365d73ac..20c3af4511255 100644
> > > --- a/tools/memory-model/lock.cat
> > > +++ b/tools/memory-model/lock.cat
> > > @@ -74,7 +74,6 @@ flag ~empty UL \ range(critical) as unmatched-unlock
> > >
> > > (* Allow up to one unmatched LKW per location; more must deadlock *)
> > > let UNMATCHED-LKW = LKW \ domain(critical)
> > > -empty ([UNMATCHED-LKW] ; loc ; [UNMATCHED-LKW]) \ id as unmatched-locks
> > >
> > > (* rfi for LF events: link each LKW to the LF events in its critical section *)
> > > let rfi-lf = ([LKW] ; po-loc ; [LF]) \ ([LKW] ; po-loc ; [UL] ; po-loc)
> > > @@ -120,8 +119,7 @@ let rf-ru = rfe-ru | rfi-ru
> > > let rf = rf | rf-lf | rf-ru
> > >
> > > (* Generate all co relations, including LKW events but not UL *)
> > > -let co0 = co0 | ([IW] ; loc ; [LKW]) |
> > > - (([LKW] ; loc ; [UNMATCHED-LKW]) \ [UNMATCHED-LKW])
> > > +let co0 = co0 | ([IW] ; loc ; [LKW])
> > > include "cos-opt.cat"
> > > let W = W | UL
> > > let M = R | W
> >
> > No idea. But the following litmus test gets no executions whatsoever,
> > so point taken about my missing at least one corner case. ;-)
> >
> > Adding a spin_unlock() to the end of either process allows both to
> > run.
> >
> > One could argue that this is a bug, but one could equally well argue
> > that if you have a deadlock, you have a deadlock.
> >
>
> in lock.cat:
>
> (* Allow up to one unmatched LKW per location; more must deadlock *)
> let UNMATCHED-LKW = LKW \ domain(critical)
> empty ([UNMATCHED-LKW] ; loc ; [UNMATCHED-LKW]) \ id as unmatched-locks
>
> we rule out deadlocks from the execution candidates we care about.

Thank you, Boqun!

Thanx, Paul

> Regards,
> Boqun
>
> > Thoughts?
> >
> > Thanx, Paul
> >
> > ------------------------------------------------------------------------
> >
> > C lock
> >
> > {
> > }
> >
> >
> > P0(int *a, int *b, spinlock_t *l)
> > {
> > spin_lock(l);
> > WRITE_ONCE(*a, 1);
> > }
> >
> > P1(int *a, int *b, spinlock_t *l)
> > {
> > spin_lock(l);
> > WRITE_ONCE(*b, 1);
> > }
> >
> > exists (a=1 /\ b=1)

2023-01-30 02:18:29

by Alan Stern

[permalink] [raw]
Subject: Re: [PATCH v2 2/2] tools/memory-model: Make ppo a subrelation of po

On Sun, Jan 29, 2023 at 03:09:00PM -0800, Paul E. McKenney wrote:
> On Sun, Jan 29, 2023 at 01:43:53PM -0800, Boqun Feng wrote:
> > in lock.cat:
> >
> > (* Allow up to one unmatched LKW per location; more must deadlock *)
> > let UNMATCHED-LKW = LKW \ domain(critical)
> > empty ([UNMATCHED-LKW] ; loc ; [UNMATCHED-LKW]) \ id as unmatched-locks
> >
> > we rule out deadlocks from the execution candidates we care about.
>
> Thank you, Boqun!

Actually that's only part of it. The other part is rather obscure:

(* Generate all co relations, including LKW events but not UL *)
let co0 = co0 | ([IW] ; loc ; [LKW]) |
(([LKW] ; loc ; [UNMATCHED-LKW]) \ [UNMATCHED-LKW])

Implicitly this says that any lock with no corresponding unlock must
come last in the coherence order, which implies the unmatched-locks rule
(since only one lock event can be last). By itself, the unmatched-locks
rule would not prevent P3 from executing before P1, provided P1 executes
both its lock and unlock.

Alan


2023-01-30 02:39:23

by Alan Stern

[permalink] [raw]
Subject: Re: [PATCH v2 2/2] tools/memory-model: Make ppo a subrelation of po

On Sun, Jan 29, 2023 at 11:19:32PM +0100, Jonas Oberhauser wrote:
> I see now. Somehow I thought stores must execute in program order, but I
> guess it doesn't make sense.
> In that sense, W ->xbstar&int X always means W propagates to X's CPU before
> X executes.

It also means any write that propagates to W's CPU before W executes
also propagates to X's CPU before X executes (because it's the same CPU
and W executes before X).

> > Ideally we would fix this by changing the definition of po-rel to:
> >
> > [M] ; (xbstar & int) ; [Release]
> >
> > (This is closely related to the use of (xbstar & int) in the definition
> > of vis that you asked about.)
>
> This misses the property of release stores that any po-earlier store must
> also execute before the release store.

I should have written:

[M] ; (po | (xbstar & int)) ; [Release]

> Perhaps it could be changed to the old? po-rel | [M] ; (xbstar & int) ;
> [Release] but then one could instead move this into the definition of
> cumul-fence.
> In fact you'd probably want this for all the propagation fences, so
> cumul-fence and pb should be the right place.
>
> > Unfortunately we can't do this, because
> > po-rel has to be defined long before xbstar.
>
> You could do it, by turning the relation into one massive recursive
> definition.

Which would make pretty much the entire memory model one big recursion.
I do not want to do that.

> Thinking about what the options are:
> 1) accept the difference and run with it by making it consistent inside the
> axiomatic model
> 2) fix it through the recursive definition, which seems to be quite ugly but
> also consistent with the power operational model as far as I can tell
> 3) weaken the operational model... somehow
> 4) just ignore the anomaly
> 5) ???
>
> Currently my least favorite option is 4) since it seems a bit off that the
> reasoning applies in one specific case of LKMM, more specifically the data
> race definition which should be equivalent to "the order of the two races
> isn't fixed", but here the order isn't fixed but it's a data race.
> I think the patch happens to almost do 1) because the xbstar&int at the end
> should already imply ordering through the prop&int <= hb rule.
> What would remain is to also exclude rcu-fence somehow.

IMO 1) is the best choice.

Alan

PS: For the record, here's a simpler litmus test to illustrates the
failing. The idea is that Wz=1 is reordered before the store-release,
so it ought to propagate before Wy=1. The LKMM does not require this.


C before-release

{}

P0(int *x, int *y, int *z)
{
int r1;

r1 = READ_ONCE(*x);
smp_store_release(y, 1);
WRITE_ONCE(*z, 1);
}

P1(int *x, int *y, int *z)
{
int r2;

r2 = READ_ONCE(*z);
WRITE_ONCE(*x, r2);
}

P2(int *x, int *y, int *z)
{
int r3;
int r4;

r3 = READ_ONCE(*y);
smp_rmb();
r4 = READ_ONCE(*z);
}

exists (0:r1=1 /\ 2:r3=1 /\ 2:r4=0)

2023-01-30 04:36:55

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH v2 2/2] tools/memory-model: Make ppo a subrelation of po

On Sun, Jan 29, 2023 at 09:39:17PM -0500, Alan Stern wrote:
> On Sun, Jan 29, 2023 at 11:19:32PM +0100, Jonas Oberhauser wrote:
> > I see now. Somehow I thought stores must execute in program order, but I
> > guess it doesn't make sense.
> > In that sense, W ->xbstar&int X always means W propagates to X's CPU before
> > X executes.
>
> It also means any write that propagates to W's CPU before W executes
> also propagates to X's CPU before X executes (because it's the same CPU
> and W executes before X).
>
> > > Ideally we would fix this by changing the definition of po-rel to:
> > >
> > > [M] ; (xbstar & int) ; [Release]
> > >
> > > (This is closely related to the use of (xbstar & int) in the definition
> > > of vis that you asked about.)
> >
> > This misses the property of release stores that any po-earlier store must
> > also execute before the release store.
>
> I should have written:
>
> [M] ; (po | (xbstar & int)) ; [Release]
>
> > Perhaps it could be changed to the old? po-rel | [M] ; (xbstar & int) ;
> > [Release] but then one could instead move this into the definition of
> > cumul-fence.
> > In fact you'd probably want this for all the propagation fences, so
> > cumul-fence and pb should be the right place.
> >
> > > Unfortunately we can't do this, because
> > > po-rel has to be defined long before xbstar.
> >
> > You could do it, by turning the relation into one massive recursive
> > definition.
>
> Which would make pretty much the entire memory model one big recursion.
> I do not want to do that.
>
> > Thinking about what the options are:
> > 1) accept the difference and run with it by making it consistent inside the
> > axiomatic model
> > 2) fix it through the recursive definition, which seems to be quite ugly but
> > also consistent with the power operational model as far as I can tell
> > 3) weaken the operational model... somehow
> > 4) just ignore the anomaly
> > 5) ???
> >
> > Currently my least favorite option is 4) since it seems a bit off that the
> > reasoning applies in one specific case of LKMM, more specifically the data
> > race definition which should be equivalent to "the order of the two races
> > isn't fixed", but here the order isn't fixed but it's a data race.
> > I think the patch happens to almost do 1) because the xbstar&int at the end
> > should already imply ordering through the prop&int <= hb rule.
> > What would remain is to also exclude rcu-fence somehow.
>
> IMO 1) is the best choice.
>
> Alan
>
> PS: For the record, here's a simpler litmus test to illustrates the
> failing. The idea is that Wz=1 is reordered before the store-release,
> so it ought to propagate before Wy=1. The LKMM does not require this.

In PowerPC terms, would this be like having the Wz=1 being reorders
before the Wy=1, but not before the lwsync instruction preceding the
Wy=1 that made it be a release store?

If so, we might have to keep this quirk.

Thanx, Paul

> C before-release
>
> {}
>
> P0(int *x, int *y, int *z)
> {
> int r1;
>
> r1 = READ_ONCE(*x);
> smp_store_release(y, 1);
> WRITE_ONCE(*z, 1);
> }
>
> P1(int *x, int *y, int *z)
> {
> int r2;
>
> r2 = READ_ONCE(*z);
> WRITE_ONCE(*x, r2);
> }
>
> P2(int *x, int *y, int *z)
> {
> int r3;
> int r4;
>
> r3 = READ_ONCE(*y);
> smp_rmb();
> r4 = READ_ONCE(*z);
> }
>
> exists (0:r1=1 /\ 2:r3=1 /\ 2:r4=0)

2023-01-30 04:43:18

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH v2 2/2] tools/memory-model: Make ppo a subrelation of po

On Sun, Jan 29, 2023 at 09:18:24PM -0500, Alan Stern wrote:
> On Sun, Jan 29, 2023 at 03:09:00PM -0800, Paul E. McKenney wrote:
> > On Sun, Jan 29, 2023 at 01:43:53PM -0800, Boqun Feng wrote:
> > > in lock.cat:
> > >
> > > (* Allow up to one unmatched LKW per location; more must deadlock *)
> > > let UNMATCHED-LKW = LKW \ domain(critical)
> > > empty ([UNMATCHED-LKW] ; loc ; [UNMATCHED-LKW]) \ id as unmatched-locks
> > >
> > > we rule out deadlocks from the execution candidates we care about.
> >
> > Thank you, Boqun!
>
> Actually that's only part of it. The other part is rather obscure:
>
> (* Generate all co relations, including LKW events but not UL *)
> let co0 = co0 | ([IW] ; loc ; [LKW]) |
> (([LKW] ; loc ; [UNMATCHED-LKW]) \ [UNMATCHED-LKW])
>
> Implicitly this says that any lock with no corresponding unlock must
> come last in the coherence order, which implies the unmatched-locks rule
> (since only one lock event can be last). By itself, the unmatched-locks
> rule would not prevent P3 from executing before P1, provided P1 executes
> both its lock and unlock.

And thank you, Alan, as well!

And RCU looks to operate in a similar manner:

------------------------------------------------------------------------

C rcudeadlock

{
}


P0(int *a, int *b)
{
rcu_read_lock();
WRITE_ONCE(*a, 1);
synchronize_rcu();
WRITE_ONCE(*b, 1);
rcu_read_unlock();
}

P1(int *a, int *b)
{
int r1;
int r2;

r1 = READ_ONCE(*b);
smp_mb();
r2 = READ_ONCE(*a);
}

exists (1:r1=1 /\ 1:r2=0)

------------------------------------------------------------------------

$ herd7 -conf linux-kernel.cfg rcudeadlock.litmus
Test rcudeadlock Allowed
States 0
No
Witnesses
Positive: 0 Negative: 0
Condition exists (1:r1=1 /\ 1:r2=0)
Observation rcudeadlock Never 0 0
Time rcudeadlock 0.00
Hash=4f7f336ad39d724d93b089133b00d1e2

------------------------------------------------------------------------

So good enough! ;-)

Thanx, Paul

2023-01-30 04:46:16

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH v2 2/2] tools/memory-model: Make ppo a subrelation of po

On Sun, Jan 29, 2023 at 11:19:32PM +0100, Jonas Oberhauser wrote:
>
> Hi all, apologies on the confusion about the litmus test.
> I should have explained it better but it seems you mostly figured it out.
> As Alan said I'm tricking a little bit by not unlocking in certain places to
> filter out all executions that aren't what I'm looking for.
> I didn't have much time when I sent it (hence also the lack of explanation
> and why I haven't responded earlier), so I didn't have time to play around
> with the filter feature to do this the "proper"/non-cute way.
> As such it really isn't about deadlocks.

Not a problem!

> I think one question is whether the distinction between the models could be
> reproduced without using any kind of filtering at all.
> I have a feeling it should be possible but I haven't had time to think up a
> litmus test that does that.

Here is an example litmus test using filter, if this helps.

You put it right before the "exists" clause, and the condition is the
same as in the "exists" clause. If an execution does not satisfy the
condition in the filter clause, it is tossed.

Thanx, Paul

------------------------------------------------------------------------

C C-srcu-nest-6

(*
* Result: Never
*
* This would be valid for srcu_down_read() and srcu_up_read().
*)

{}

P0(int *x, int *y, struct srcu_struct *s1, int *idx, int *f)
{
int r2;
int r3;

r3 = srcu_down_read(s1);
WRITE_ONCE(*idx, r3);
r2 = READ_ONCE(*y);
smp_store_release(f, 1);
}

P1(int *x, int *y, struct srcu_struct *s1, int *idx, int *f)
{
int r1;
int r3;
int r4;

r4 = smp_load_acquire(f);
r1 = READ_ONCE(*x);
r3 = READ_ONCE(*idx);
srcu_up_read(s1, r3);
}

P2(int *x, int *y, struct srcu_struct *s1)
{
WRITE_ONCE(*y, 1);
synchronize_srcu(s1);
WRITE_ONCE(*x, 1);
}

locations [0:r1]
filter (1:r4=1)
exists (1:r1=1 /\ 0:r2=0)

2023-01-30 16:47:57

by Alan Stern

[permalink] [raw]
Subject: Re: [PATCH v2 2/2] tools/memory-model: Make ppo a subrelation of po

On Sun, Jan 29, 2023 at 08:36:45PM -0800, Paul E. McKenney wrote:
> On Sun, Jan 29, 2023 at 09:39:17PM -0500, Alan Stern wrote:
> > On Sun, Jan 29, 2023 at 11:19:32PM +0100, Jonas Oberhauser wrote:
> > > I see now. Somehow I thought stores must execute in program order, but I
> > > guess it doesn't make sense.
> > > In that sense, W ->xbstar&int X always means W propagates to X's CPU before
> > > X executes.
> >
> > It also means any write that propagates to W's CPU before W executes
> > also propagates to X's CPU before X executes (because it's the same CPU
> > and W executes before X).
> >
> > > > Ideally we would fix this by changing the definition of po-rel to:
> > > >
> > > > [M] ; (xbstar & int) ; [Release]
> > > >
> > > > (This is closely related to the use of (xbstar & int) in the definition
> > > > of vis that you asked about.)
> > >
> > > This misses the property of release stores that any po-earlier store must
> > > also execute before the release store.
> >
> > I should have written:
> >
> > [M] ; (po | (xbstar & int)) ; [Release]
> >
> > > Perhaps it could be changed to the old? po-rel | [M] ; (xbstar & int) ;
> > > [Release] but then one could instead move this into the definition of
> > > cumul-fence.
> > > In fact you'd probably want this for all the propagation fences, so
> > > cumul-fence and pb should be the right place.
> > >
> > > > Unfortunately we can't do this, because
> > > > po-rel has to be defined long before xbstar.
> > >
> > > You could do it, by turning the relation into one massive recursive
> > > definition.
> >
> > Which would make pretty much the entire memory model one big recursion.
> > I do not want to do that.
> >
> > > Thinking about what the options are:
> > > 1) accept the difference and run with it by making it consistent inside the
> > > axiomatic model
> > > 2) fix it through the recursive definition, which seems to be quite ugly but
> > > also consistent with the power operational model as far as I can tell
> > > 3) weaken the operational model... somehow
> > > 4) just ignore the anomaly
> > > 5) ???
> > >
> > > Currently my least favorite option is 4) since it seems a bit off that the
> > > reasoning applies in one specific case of LKMM, more specifically the data
> > > race definition which should be equivalent to "the order of the two races
> > > isn't fixed", but here the order isn't fixed but it's a data race.
> > > I think the patch happens to almost do 1) because the xbstar&int at the end
> > > should already imply ordering through the prop&int <= hb rule.
> > > What would remain is to also exclude rcu-fence somehow.
> >
> > IMO 1) is the best choice.
> >
> > Alan
> >
> > PS: For the record, here's a simpler litmus test to illustrates the
> > failing. The idea is that Wz=1 is reordered before the store-release,
> > so it ought to propagate before Wy=1. The LKMM does not require this.
>
> In PowerPC terms, would this be like having the Wz=1 being reorders
> before the Wy=1, but not before the lwsync instruction preceding the
> Wy=1 that made it be a release store?

No, it would be like having the Wz=1 reordered before the Rx=1,
therefore before the lwsync. Obviously this can't ever happen on
PowerPC.

Alan

> If so, we might have to keep this quirk.
>
> Thanx, Paul
>
> > C before-release
> >
> > {}
> >
> > P0(int *x, int *y, int *z)
> > {
> > int r1;
> >
> > r1 = READ_ONCE(*x);
> > smp_store_release(y, 1);
> > WRITE_ONCE(*z, 1);
> > }
> >
> > P1(int *x, int *y, int *z)
> > {
> > int r2;
> >
> > r2 = READ_ONCE(*z);
> > WRITE_ONCE(*x, r2);
> > }
> >
> > P2(int *x, int *y, int *z)
> > {
> > int r3;
> > int r4;
> >
> > r3 = READ_ONCE(*y);
> > smp_rmb();
> > r4 = READ_ONCE(*z);
> > }
> >
> > exists (0:r1=1 /\ 2:r3=1 /\ 2:r4=0)

2023-01-30 16:50:46

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH v2 2/2] tools/memory-model: Make ppo a subrelation of po

On Mon, Jan 30, 2023 at 11:47:50AM -0500, Alan Stern wrote:
> On Sun, Jan 29, 2023 at 08:36:45PM -0800, Paul E. McKenney wrote:
> > On Sun, Jan 29, 2023 at 09:39:17PM -0500, Alan Stern wrote:
> > > On Sun, Jan 29, 2023 at 11:19:32PM +0100, Jonas Oberhauser wrote:
> > > > I see now. Somehow I thought stores must execute in program order, but I
> > > > guess it doesn't make sense.
> > > > In that sense, W ->xbstar&int X always means W propagates to X's CPU before
> > > > X executes.
> > >
> > > It also means any write that propagates to W's CPU before W executes
> > > also propagates to X's CPU before X executes (because it's the same CPU
> > > and W executes before X).
> > >
> > > > > Ideally we would fix this by changing the definition of po-rel to:
> > > > >
> > > > > [M] ; (xbstar & int) ; [Release]
> > > > >
> > > > > (This is closely related to the use of (xbstar & int) in the definition
> > > > > of vis that you asked about.)
> > > >
> > > > This misses the property of release stores that any po-earlier store must
> > > > also execute before the release store.
> > >
> > > I should have written:
> > >
> > > [M] ; (po | (xbstar & int)) ; [Release]
> > >
> > > > Perhaps it could be changed to the old? po-rel | [M] ; (xbstar & int) ;
> > > > [Release] but then one could instead move this into the definition of
> > > > cumul-fence.
> > > > In fact you'd probably want this for all the propagation fences, so
> > > > cumul-fence and pb should be the right place.
> > > >
> > > > > Unfortunately we can't do this, because
> > > > > po-rel has to be defined long before xbstar.
> > > >
> > > > You could do it, by turning the relation into one massive recursive
> > > > definition.
> > >
> > > Which would make pretty much the entire memory model one big recursion.
> > > I do not want to do that.
> > >
> > > > Thinking about what the options are:
> > > > 1) accept the difference and run with it by making it consistent inside the
> > > > axiomatic model
> > > > 2) fix it through the recursive definition, which seems to be quite ugly but
> > > > also consistent with the power operational model as far as I can tell
> > > > 3) weaken the operational model... somehow
> > > > 4) just ignore the anomaly
> > > > 5) ???
> > > >
> > > > Currently my least favorite option is 4) since it seems a bit off that the
> > > > reasoning applies in one specific case of LKMM, more specifically the data
> > > > race definition which should be equivalent to "the order of the two races
> > > > isn't fixed", but here the order isn't fixed but it's a data race.
> > > > I think the patch happens to almost do 1) because the xbstar&int at the end
> > > > should already imply ordering through the prop&int <= hb rule.
> > > > What would remain is to also exclude rcu-fence somehow.
> > >
> > > IMO 1) is the best choice.
> > >
> > > Alan
> > >
> > > PS: For the record, here's a simpler litmus test to illustrates the
> > > failing. The idea is that Wz=1 is reordered before the store-release,
> > > so it ought to propagate before Wy=1. The LKMM does not require this.
> >
> > In PowerPC terms, would this be like having the Wz=1 being reorders
> > before the Wy=1, but not before the lwsync instruction preceding the
> > Wy=1 that made it be a release store?
>
> No, it would be like having the Wz=1 reordered before the Rx=1,
> therefore before the lwsync. Obviously this can't ever happen on
> PowerPC.

Whew!!! ;-)

Thanx, Paul

> Alan
>
> > If so, we might have to keep this quirk.
> >
> > Thanx, Paul
> >
> > > C before-release
> > >
> > > {}
> > >
> > > P0(int *x, int *y, int *z)
> > > {
> > > int r1;
> > >
> > > r1 = READ_ONCE(*x);
> > > smp_store_release(y, 1);
> > > WRITE_ONCE(*z, 1);
> > > }
> > >
> > > P1(int *x, int *y, int *z)
> > > {
> > > int r2;
> > >
> > > r2 = READ_ONCE(*z);
> > > WRITE_ONCE(*x, r2);
> > > }
> > >
> > > P2(int *x, int *y, int *z)
> > > {
> > > int r3;
> > > int r4;
> > >
> > > r3 = READ_ONCE(*y);
> > > smp_rmb();
> > > r4 = READ_ONCE(*z);
> > > }
> > >
> > > exists (0:r1=1 /\ 2:r3=1 /\ 2:r4=0)

2023-01-31 13:56:43

by Jonas Oberhauser

[permalink] [raw]
Subject: Re: [PATCH v2 2/2] tools/memory-model: Make ppo a subrelation of po



On 1/30/2023 3:39 AM, Alan Stern wrote:
> On Sun, Jan 29, 2023 at 11:19:32PM +0100, Jonas Oberhauser wrote:
>> You could do it, by turning the relation into one massive recursive
>> definition.
> Which would make pretty much the entire memory model one big recursion.
> I do not want to do that.

Neither do I :D

>
>> Thinking about what the options are:
>> 1) accept the difference and run with it by making it consistent inside the
>> axiomatic model
>> 2) fix it through the recursive definition, which seems to be quite ugly but
>> also consistent with the power operational model as far as I can tell
>> 3) weaken the operational model... somehow
>> 4) just ignore the anomaly
>> 5) ???
>>
>> Currently my least favorite option is 4) since it seems a bit off that the
>> reasoning applies in one specific case of LKMM, more specifically the data
>> race definition which should be equivalent to "the order of the two races
>> isn't fixed", but here the order isn't fixed but it's a data race.
>> I think the patch happens to almost do 1) because the xbstar&int at the end
>> should already imply ordering through the prop&int <= hb rule.
>> What would remain is to also exclude rcu-fence somehow.
> IMO 1) is the best choice.

I have some additional thoughts now. It seems that you could weaken the
operational model by stating that an A-cumulative fence orders
propagation of all *external* stores (in addition to all po-earlier
stores) that propagated to you before the fence is executed.

It seems that on power, from an operational model perspective, there's
currently no difference between propagation fences ordering all stores
vs only external stores that propagated to the CPU before the fence is
executed, because they only have bidirectional (*->W) fences (sync,
lwsync) and not uni-directional (acquire, release), and so it is not
possible for a store that is po-later than the barrier to be executed
before the barrier; i.e., on power, every internal store that propagates
to a CPU before the fence executes is also po-earler than the fence.

If power did introduce release stores, I think you could potentially
create implementations that allow the behavior in the example you have
given, but I don't think they are the most natural ones:

> {}
>
> P0(int *x, int *y, int *z)
> {
> int r1;
>
> r1 = READ_ONCE(*x);
> smp_store_release(y, 1);
> WRITE_ONCE(*z, 1);
> }
>
> P1(int *x, int *y, int *z)
> {
> int r2;
>
> r2 = READ_ONCE(*z);
> WRITE_ONCE(*x, r2);
> }
>
> P2(int *x, int *y, int *z)
> {
> int r3;
> int r4;
>
> r3 = READ_ONCE(*y);
> smp_rmb();
> r4 = READ_ONCE(*z);
> }
>
> exists (0:r1=1 /\ 2:r3=1 /\ 2:r4=0)

I could imagine that P0 posts both of its stores in a shared store
buffer before reading *x, but marks the release store as "not ready".
Then P1 forwards *z=1 from the store buffer and posts *x=1, which P0
reads, and subsequently marks its release store as "ready".
Then the release store is sent to the cache, where P2 reads *y=1 and
then *z=0.
Finally P0 sends its *z=1 store to the cache.

However, a perhaps more natural implementation would not post the
release store to the store buffer until it is "ready", in which case the
order in the store buffer would be *z=1 before *y=1, and in this case
the release ordering would presumably work like your current operational
model.

Nevertheless, perhaps this slightly weaker operational model isn't as
absurd as it sounds. And I think many people wouldn't be shocked if the
release store didn't provide ordering with *z=1.

Best wishes, jonas


2023-01-31 15:15:58

by Alan Stern

[permalink] [raw]
Subject: Re: [PATCH v2 2/2] tools/memory-model: Make ppo a subrelation of po

On Tue, Jan 31, 2023 at 02:56:00PM +0100, Jonas Oberhauser wrote:
> I have some additional thoughts now. It seems that you could weaken the
> operational model by stating that an A-cumulative fence orders propagation
> of all *external* stores (in addition to all po-earlier stores) that
> propagated to you before the fence is executed.

How is that a weakening of the operational model? It's what the
operational model says right now. From explanation.txt:

Release fences, such as smp_store_release(), force the CPU to
execute all po-earlier instructions before the store
associated with the fence (e.g., the store part of an
smp_store_release()).

... When a fence instruction is executed on CPU C:

...

For each other CPU C', any store which propagates to C before
a release fence is executed (including all po-earlier
stores executed on C) is forced to propagate to C' before the
store associated with the release fence does.

In theory, we could weaken the operational model by saying that pfences
order propagation of stores from other CPUs only when those stores are
read-from by instructions po-before the fence. But I suspect that's not
such a good idea.

> It seems that on power, from an operational model perspective, there's
> currently no difference between propagation fences ordering all stores vs
> only external stores that propagated to the CPU before the fence is
> executed, because they only have bidirectional (*->W) fences (sync, lwsync)
> and not uni-directional (acquire, release), and so it is not possible for a
> store that is po-later than the barrier to be executed before the barrier;
> i.e., on power, every internal store that propagates to a CPU before the
> fence executes is also po-earler than the fence.
>
> If power did introduce release stores, I think you could potentially create
> implementations that allow the behavior in the example you have given, but I
> don't think they are the most natural ones:

Maybe so. In any case, it's a moot point. In fact, I don't know if any
architecture supporting Linux allows a write that is po-after a release
store to be reordered before the release store.

> > P0(int *x, int *y, int *z)
> > {
> > int r1;
> >
> > r1 = READ_ONCE(*x);
> > smp_store_release(y, 1);
> > WRITE_ONCE(*z, 1);
> > }
> >
> > P1(int *x, int *y, int *z)
> > {
> > int r2;
> >
> > r2 = READ_ONCE(*z);
> > WRITE_ONCE(*x, r2);
> > }
> >
> > P2(int *x, int *y, int *z)
> > {
> > int r3;
> > int r4;
> >
> > r3 = READ_ONCE(*y);
> > smp_rmb();
> > r4 = READ_ONCE(*z);
> > }
> >
> > exists (0:r1=1 /\ 2:r3=1 /\ 2:r4=0)
>
> I could imagine that P0 posts both of its stores in a shared store buffer
> before reading *x, but marks the release store as "not ready".
> Then P1 forwards *z=1 from the store buffer and posts *x=1, which P0 reads,
> and subsequently marks its release store as "ready".

That isn't how release stores are meant to work. The read of x is
supposed to be complete before the release store becomes visible to any
other CPU. This is true even in C11.

> Then the release store is sent to the cache, where P2 reads *y=1 and then
> *z=0.
> Finally P0 sends its *z=1 store to the cache.
>
> However, a perhaps more natural implementation would not post the release
> store to the store buffer until it is "ready", in which case the order in
> the store buffer would be *z=1 before *y=1, and in this case the release
> ordering would presumably work like your current operational model.
>
> Nevertheless, perhaps this slightly weaker operational model isn't as absurd
> as it sounds. And I think many people wouldn't be shocked if the release
> store didn't provide ordering with *z=1.

This issue is one we should discuss with all the other people involved
in maintaining the LKMM.

Alan

2023-01-31 15:34:07

by Jonas Oberhauser

[permalink] [raw]
Subject: Re: [PATCH v2 2/2] tools/memory-model: Make ppo a subrelation of po



On 1/31/2023 4:06 PM, Alan Stern wrote:
> On Tue, Jan 31, 2023 at 02:56:00PM +0100, Jonas Oberhauser wrote:
>> I have some additional thoughts now. It seems that you could weaken the
>> operational model by stating that an A-cumulative fence orders propagation
>> of all *external* stores (in addition to all po-earlier stores) that
>> propagated to you before the fence is executed.
> How is that a weakening of the operational model? It's what the
> operational model says right now.

No, as in the part that you have quoted, it is stated that an
A-cumulative fence orderes propagation of *all* stores that propagated
to you before the fence is executed.
I'm saying you could weaken this to only cover all *external* stores.

More precisely, I would change

> For each other CPU C', any store which propagates to C before
> a release fence is executed (including all po-earlier
> stores executed on C) is forced to propagate to C' before the
> store associated with the release fence does.

Into something like


     For each other CPU C', any *external* store which propagates to C
before
     a release fence is executed as well as any po-earlier
     store executed on C is forced to propagate to C' before the
     store associated with the release fence does.

The difference is that po-later stores that happen to propagate to C
before the release fence is executed would no longer be ordered.
That should be consistent with the axiomatic model.


>
> In theory, we could weaken the operational model by saying that pfences
> order propagation of stores from other CPUs only when those stores are
> read-from by instructions po-before the fence. But I suspect that's not
> such a good idea.

That indeed looks too confusing.


>> It seems that on power, from an operational model perspective, there's
>> currently no difference between propagation fences ordering all stores vs
>> only external stores that propagated to the CPU before the fence is
>> executed, because they only have bidirectional (*->W) fences (sync, lwsync)
>> and not uni-directional (acquire, release), and so it is not possible for a
>> store that is po-later than the barrier to be executed before the barrier;
>> i.e., on power, every internal store that propagates to a CPU before the
>> fence executes is also po-earler than the fence.
>>
>> If power did introduce release stores, I think you could potentially create
>> implementations that allow the behavior in the example you have given, but I
>> don't think they are the most natural ones:
> Maybe so. In any case, it's a moot point. In fact, I don't know if any
> architecture supporting Linux allows a write that is po-after a release
> store to be reordered before the release store.

Arm and Risc5 do, but they are multi-copy-atomic anyways.

>
>>> P0(int *x, int *y, int *z)
>>> {
>>> int r1;
>>>
>>> r1 = READ_ONCE(*x);
>>> smp_store_release(y, 1);
>>> WRITE_ONCE(*z, 1);
>>> }
>>>
>>> P1(int *x, int *y, int *z)
>>> {
>>> int r2;
>>>
>>> r2 = READ_ONCE(*z);
>>> WRITE_ONCE(*x, r2);
>>> }
>>>
>>> P2(int *x, int *y, int *z)
>>> {
>>> int r3;
>>> int r4;
>>>
>>> r3 = READ_ONCE(*y);
>>> smp_rmb();
>>> r4 = READ_ONCE(*z);
>>> }
>>>
>>> exists (0:r1=1 /\ 2:r3=1 /\ 2:r4=0)
>> I could imagine that P0 posts both of its stores in a shared store buffer
>> before reading *x, but marks the release store as "not ready".
>> Then P1 forwards *z=1 from the store buffer and posts *x=1, which P0 reads,
>> and subsequently marks its release store as "ready".
> That isn't how release stores are meant to work. The read of x is
> supposed to be complete before the release store becomes visible to any
> other CPU.

Note that the release store isn't observed until it becomes "ready", so
it is really indistinguishable of whether it had become visible to any
other CPU.
Indeed stores that aren't marked "ready" would be ignored during
forwarding, and not allowed to be pushed to the cache.

The reason this kind of implementation seems less natural to me is that
such an "not ready" store would need to be pushed back in the buffer (if
it is the head of the buffer and the cache is ready to take a store),
stall the later stores, or be aborted until it becomes ready.
That just seems to create a lot of hassle for no discernible benefit.
A "not ready" store probably shouldn't be put into a store queue, even
if the only reason it is not ready is that there are some otherwise
unrelated reads that haven't completed yet.



> This is true even in C11.

Arguable... The following pseudo-code litmus test should demonstrate this:

P0 {
   int r = read_relaxed(&x);
   store_release(&y,1);
}


P1 {
   int s = read_relaxed(&y);
   store_release(&x,1);
}

In C11, it should be possible to read r==s==1.


>> Then the release store is sent to the cache, where P2 reads *y=1 and then
>> *z=0.
>> Finally P0 sends its *z=1 store to the cache.
>>
>> However, a perhaps more natural implementation would not post the release
>> store to the store buffer until it is "ready", in which case the order in
>> the store buffer would be *z=1 before *y=1, and in this case the release
>> ordering would presumably work like your current operational model.
>>
>> Nevertheless, perhaps this slightly weaker operational model isn't as absurd
>> as it sounds. And I think many people wouldn't be shocked if the release
>> store didn't provide ordering with *z=1.
> This issue is one we should discuss with all the other people involved
> in maintaining the LKMM.
>
> Alan

Sure.

Btw, how to proceed for your SRCU patch and this one?
Are you planning to make any changes? I think the version you have is ok
if you don't think the patch is improved by anything I brought up.

Any additional concerns/changes for this patch?

Best wishes, jonas


2023-01-31 16:55:24

by Alan Stern

[permalink] [raw]
Subject: Re: [PATCH v2 2/2] tools/memory-model: Make ppo a subrelation of po

On Tue, Jan 31, 2023 at 04:33:25PM +0100, Jonas Oberhauser wrote:
>
>
> On 1/31/2023 4:06 PM, Alan Stern wrote:
> > On Tue, Jan 31, 2023 at 02:56:00PM +0100, Jonas Oberhauser wrote:
> > > I have some additional thoughts now. It seems that you could weaken the
> > > operational model by stating that an A-cumulative fence orders propagation
> > > of all *external* stores (in addition to all po-earlier stores) that
> > > propagated to you before the fence is executed.
> > How is that a weakening of the operational model? It's what the
> > operational model says right now.
>
> No, as in the part that you have quoted, it is stated that an A-cumulative
> fence orderes propagation of *all* stores that propagated to you before the
> fence is executed.
> I'm saying you could weaken this to only cover all *external* stores.

Okay, now I understand.

> More precisely, I would change
>
> > For each other CPU C', any store which propagates to C before
> > a release fence is executed (including all po-earlier
> > stores executed on C) is forced to propagate to C' before the
> > store associated with the release fence does.
>
> Into something like
>
>
> ???? For each other CPU C', any *external* store which propagates to C
> before
> ???? a release fence is executed as well as any po-earlier
> ???? store executed on C is forced to propagate to C' before the
> ???? store associated with the release fence does.
>
> The difference is that po-later stores that happen to propagate to C before
> the release fence is executed would no longer be ordered.
> That should be consistent with the axiomatic model.

I had to check that it wouldn't affect the (xbstar & int) part of vis,
but it looks all right. This seems like a reasonable change.

However, it only fixes part of the problem. Suppose an external write
is read by an instruction po-after the release-store, but the read
executes before the release-store. The operational model would still
say the external write has to obey the propagation ordering, whereas the
formal model doesn't require it.

> > > > P0(int *x, int *y, int *z)
> > > > {
> > > > int r1;
> > > >
> > > > r1 = READ_ONCE(*x);
> > > > smp_store_release(y, 1);
> > > > WRITE_ONCE(*z, 1);
> > > > }
> > > >
> > > > P1(int *x, int *y, int *z)
> > > > {
> > > > int r2;
> > > >
> > > > r2 = READ_ONCE(*z);
> > > > WRITE_ONCE(*x, r2);
> > > > }
> > > >
> > > > P2(int *x, int *y, int *z)
> > > > {
> > > > int r3;
> > > > int r4;
> > > >
> > > > r3 = READ_ONCE(*y);
> > > > smp_rmb();
> > > > r4 = READ_ONCE(*z);
> > > > }
> > > >
> > > > exists (0:r1=1 /\ 2:r3=1 /\ 2:r4=0)
> > > I could imagine that P0 posts both of its stores in a shared store buffer
> > > before reading *x, but marks the release store as "not ready".
> > > Then P1 forwards *z=1 from the store buffer and posts *x=1, which P0 reads,
> > > and subsequently marks its release store as "ready".
> > That isn't how release stores are meant to work. The read of x is
> > supposed to be complete before the release store becomes visible to any
> > other CPU.
>
> Note that the release store isn't observed until it becomes "ready", so it
> is really indistinguishable of whether it had become visible to any other
> CPU.
> Indeed stores that aren't marked "ready" would be ignored during forwarding,
> and not allowed to be pushed to the cache.

Oops, I mixed up a couple of the accesses. Okay, yes, this mechanism
will allow writes that are po-after a release store but execute before
it to evade the propagation restriction.

> The reason this kind of implementation seems less natural to me is that such
> an "not ready" store would need to be pushed back in the buffer (if it is
> the head of the buffer and the cache is ready to take a store), stall the
> later stores, or be aborted until it becomes ready.
> That just seems to create a lot of hassle for no discernible benefit.
> A "not ready" store probably shouldn't be put into a store queue, even if
> the only reason it is not ready is that there are some otherwise unrelated
> reads that haven't completed yet.
>
>
>
> > This is true even in C11.
>
> Arguable... The following pseudo-code litmus test should demonstrate this:
>
> P0 {
> ?? int r = read_relaxed(&x);
> ?? store_release(&y,1);
> }
>
>
> P1 {
> ?? int s = read_relaxed(&y);
> ?? store_release(&x,1);
> }
>
> In C11, it should be possible to read r==s==1.

True, in C11 releases don't mean anything unless they're paired with
acquires. But if your P1 had been

int s = read_acquire(&y);
write_relaxed(&x, 1);

then r = s = 1 would not be allowed. And presumably the same object
code would be generated for P0 either way, particularly if P1 was in a
separate compilation unit (link-time optimization notwithstanding).

> Btw, how to proceed for your SRCU patch and this one?
> Are you planning to make any changes? I think the version you have is ok if
> you don't think the patch is improved by anything I brought up.

I don't see any need to change the SRCU patch at this point, other than
to improve the attributions.

> Any additional concerns/changes for this patch?

It should give the same data-race diagnostics as the current LKMM. This
probably means the patch will need to punch up the definitions of
*-pre-bounded and *-post-bounded, unless you can think of a better
approach.

Alan

2023-02-01 10:39:17

by Jonas Oberhauser

[permalink] [raw]
Subject: Re: [PATCH v2 2/2] tools/memory-model: Make ppo a subrelation of po



On 1/31/2023 5:55 PM, Alan Stern wrote:
> On Tue, Jan 31, 2023 at 04:33:25PM +0100, Jonas Oberhauser wrote:
>>
>> More precisely, I would change
>>
>>> For each other CPU C', any store which propagates to C before
>>> a release fence is executed (including all po-earlier
>>> stores executed on C) is forced to propagate to C' before the
>>> store associated with the release fence does.
>> Into something like
>>
>>
>>      For each other CPU C', any *external* store which propagates to C
>> before
>>      a release fence is executed as well as any po-earlier
>>      store executed on C is forced to propagate to C' before the
>>      store associated with the release fence does.
>>
>> The difference is that po-later stores that happen to propagate to C before
>> the release fence is executed would no longer be ordered.
>> That should be consistent with the axiomatic model.
> I had to check that it wouldn't affect the (xbstar & int) part of vis,
> but it looks all right. This seems like a reasonable change.
>
> However, it only fixes part of the problem. Suppose an external write
> is read by an instruction po-after the release-store, but the read
> executes before the release-store. The operational model would still
> say the external write has to obey the propagation ordering, whereas the
> formal model doesn't require it.


Ouch. I had only thought about the [W];(xbstar & int);[Release] case,
but there's also the rfe;(xbstar & int);[Release] case....


>> Any additional concerns/changes for this patch?
> It should give the same data-race diagnostics as the current LKMM. This
> probably means the patch will need to punch up the definitions of
> *-pre-bounded and *-post-bounded, unless you can think of a better
> approach.
>
> Alan

Seems the 1 thing per patch mentally hasn't yet become ingrained in me.
Thanks!

jonas