2023-02-24 13:49:18

by Jonas Oberhauser

[permalink] [raw]
Subject: [PATCH v3] tools/memory-model: Make ppo a subrelation of po

As stated in the documentation and implied by its name, the ppo
(preserved program order) relation is intended to link po-earlier
to po-later instructions under certain conditions. However, a
corner case currently allows instructions to be linked by ppo that
are not executed by the same thread, i.e., instructions are being
linked that have no po relation.

This happens due to the mb/strong-fence/fence relations, which (as
one case) provide order when locks are passed between threads
followed by an smp_mb__after_unlock_lock() fence. This is
illustrated in the following litmus test (as can be seen when using
herd7 with `doshow ppo`):

P0(int *x, int *y)
{
spin_lock(x);
spin_unlock(x);
}

P1(int *x, int *y)
{
spin_lock(x);
smp_mb__after_unlock_lock();
*y = 1;
}

The ppo relation will link P0's spin_lock(x) and P1's *y=1, because
P0 passes a lock to P1 which then uses this fence.

The patch makes ppo a subrelation of po by letting fence contribute
to ppo only in case the fence links events of the same thread.

Signed-off-by: Jonas Oberhauser <[email protected]>
---
tools/memory-model/linux-kernel.cat | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/tools/memory-model/linux-kernel.cat b/tools/memory-model/linux-kernel.cat
index cfc1b8fd46da..adf3c4f41229 100644
--- a/tools/memory-model/linux-kernel.cat
+++ b/tools/memory-model/linux-kernel.cat
@@ -82,7 +82,7 @@ let rwdep = (dep | ctrl) ; [W]
let overwrite = co | fr
let to-w = rwdep | (overwrite & int) | (addr ; [Plain] ; wmb)
let to-r = (addr ; [R]) | (dep ; [Marked] ; rfi)
-let ppo = to-r | to-w | fence | (po-unlock-lock-po & int)
+let ppo = to-r | to-w | (fence & int) | (po-unlock-lock-po & int)

(* Propagation: Ordering from release operations and strong fences. *)
let A-cumul(r) = (rfe ; [Marked])? ; r
--
2.17.1



2023-02-24 15:32:49

by Alan Stern

[permalink] [raw]
Subject: Re: [PATCH v3] tools/memory-model: Make ppo a subrelation of po

On Fri, Feb 24, 2023 at 02:52:51PM +0100, Jonas Oberhauser wrote:
> As stated in the documentation and implied by its name, the ppo
> (preserved program order) relation is intended to link po-earlier
> to po-later instructions under certain conditions. However, a
> corner case currently allows instructions to be linked by ppo that
> are not executed by the same thread, i.e., instructions are being
> linked that have no po relation.
>
> This happens due to the mb/strong-fence/fence relations, which (as
> one case) provide order when locks are passed between threads
> followed by an smp_mb__after_unlock_lock() fence. This is
> illustrated in the following litmus test (as can be seen when using
> herd7 with `doshow ppo`):
>
> P0(int *x, int *y)
> {
> spin_lock(x);
> spin_unlock(x);
> }
>
> P1(int *x, int *y)
> {
> spin_lock(x);
> smp_mb__after_unlock_lock();
> *y = 1;
> }
>
> The ppo relation will link P0's spin_lock(x) and P1's *y=1, because
> P0 passes a lock to P1 which then uses this fence.
>
> The patch makes ppo a subrelation of po by letting fence contribute
> to ppo only in case the fence links events of the same thread.
>
> Signed-off-by: Jonas Oberhauser <[email protected]>
> ---
> tools/memory-model/linux-kernel.cat | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/tools/memory-model/linux-kernel.cat b/tools/memory-model/linux-kernel.cat
> index cfc1b8fd46da..adf3c4f41229 100644
> --- a/tools/memory-model/linux-kernel.cat
> +++ b/tools/memory-model/linux-kernel.cat
> @@ -82,7 +82,7 @@ let rwdep = (dep | ctrl) ; [W]
> let overwrite = co | fr
> let to-w = rwdep | (overwrite & int) | (addr ; [Plain] ; wmb)
> let to-r = (addr ; [R]) | (dep ; [Marked] ; rfi)
> -let ppo = to-r | to-w | fence | (po-unlock-lock-po & int)
> +let ppo = to-r | to-w | (fence & int) | (po-unlock-lock-po & int)
>
> (* Propagation: Ordering from release operations and strong fences. *)
> let A-cumul(r) = (rfe ; [Marked])? ; r

Acked-by: Alan Stern <[email protected]>

2023-02-24 18:38:03

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH v3] tools/memory-model: Make ppo a subrelation of po

On Fri, Feb 24, 2023 at 10:32:43AM -0500, Alan Stern wrote:
> On Fri, Feb 24, 2023 at 02:52:51PM +0100, Jonas Oberhauser wrote:
> > As stated in the documentation and implied by its name, the ppo
> > (preserved program order) relation is intended to link po-earlier
> > to po-later instructions under certain conditions. However, a
> > corner case currently allows instructions to be linked by ppo that
> > are not executed by the same thread, i.e., instructions are being
> > linked that have no po relation.
> >
> > This happens due to the mb/strong-fence/fence relations, which (as
> > one case) provide order when locks are passed between threads
> > followed by an smp_mb__after_unlock_lock() fence. This is
> > illustrated in the following litmus test (as can be seen when using
> > herd7 with `doshow ppo`):
> >
> > P0(int *x, int *y)
> > {
> > spin_lock(x);
> > spin_unlock(x);
> > }
> >
> > P1(int *x, int *y)
> > {
> > spin_lock(x);
> > smp_mb__after_unlock_lock();
> > *y = 1;
> > }
> >
> > The ppo relation will link P0's spin_lock(x) and P1's *y=1, because
> > P0 passes a lock to P1 which then uses this fence.
> >
> > The patch makes ppo a subrelation of po by letting fence contribute
> > to ppo only in case the fence links events of the same thread.
> >
> > Signed-off-by: Jonas Oberhauser <[email protected]>
> > ---
> > tools/memory-model/linux-kernel.cat | 2 +-
> > 1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/tools/memory-model/linux-kernel.cat b/tools/memory-model/linux-kernel.cat
> > index cfc1b8fd46da..adf3c4f41229 100644
> > --- a/tools/memory-model/linux-kernel.cat
> > +++ b/tools/memory-model/linux-kernel.cat
> > @@ -82,7 +82,7 @@ let rwdep = (dep | ctrl) ; [W]
> > let overwrite = co | fr
> > let to-w = rwdep | (overwrite & int) | (addr ; [Plain] ; wmb)
> > let to-r = (addr ; [R]) | (dep ; [Marked] ; rfi)
> > -let ppo = to-r | to-w | fence | (po-unlock-lock-po & int)
> > +let ppo = to-r | to-w | (fence & int) | (po-unlock-lock-po & int)
> >
> > (* Propagation: Ordering from release operations and strong fences. *)
> > let A-cumul(r) = (rfe ; [Marked])? ; r
>
> Acked-by: Alan Stern <[email protected]>

Queued for the v6.4 merge window (not the current one), thank you both!

Thanx, Paul

2023-02-26 01:02:58

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH v3] tools/memory-model: Make ppo a subrelation of po

On Fri, Feb 24, 2023 at 10:37:58AM -0800, Paul E. McKenney wrote:
> On Fri, Feb 24, 2023 at 10:32:43AM -0500, Alan Stern wrote:
> > On Fri, Feb 24, 2023 at 02:52:51PM +0100, Jonas Oberhauser wrote:
> > > As stated in the documentation and implied by its name, the ppo
> > > (preserved program order) relation is intended to link po-earlier
> > > to po-later instructions under certain conditions. However, a
> > > corner case currently allows instructions to be linked by ppo that
> > > are not executed by the same thread, i.e., instructions are being
> > > linked that have no po relation.
> > >
> > > This happens due to the mb/strong-fence/fence relations, which (as
> > > one case) provide order when locks are passed between threads
> > > followed by an smp_mb__after_unlock_lock() fence. This is
> > > illustrated in the following litmus test (as can be seen when using
> > > herd7 with `doshow ppo`):
> > >
> > > P0(int *x, int *y)
> > > {
> > > spin_lock(x);
> > > spin_unlock(x);
> > > }
> > >
> > > P1(int *x, int *y)
> > > {
> > > spin_lock(x);
> > > smp_mb__after_unlock_lock();
> > > *y = 1;
> > > }
> > >
> > > The ppo relation will link P0's spin_lock(x) and P1's *y=1, because
> > > P0 passes a lock to P1 which then uses this fence.
> > >
> > > The patch makes ppo a subrelation of po by letting fence contribute
> > > to ppo only in case the fence links events of the same thread.
> > >
> > > Signed-off-by: Jonas Oberhauser <[email protected]>
> > > ---
> > > tools/memory-model/linux-kernel.cat | 2 +-
> > > 1 file changed, 1 insertion(+), 1 deletion(-)
> > >
> > > diff --git a/tools/memory-model/linux-kernel.cat b/tools/memory-model/linux-kernel.cat
> > > index cfc1b8fd46da..adf3c4f41229 100644
> > > --- a/tools/memory-model/linux-kernel.cat
> > > +++ b/tools/memory-model/linux-kernel.cat
> > > @@ -82,7 +82,7 @@ let rwdep = (dep | ctrl) ; [W]
> > > let overwrite = co | fr
> > > let to-w = rwdep | (overwrite & int) | (addr ; [Plain] ; wmb)
> > > let to-r = (addr ; [R]) | (dep ; [Marked] ; rfi)
> > > -let ppo = to-r | to-w | fence | (po-unlock-lock-po & int)
> > > +let ppo = to-r | to-w | (fence & int) | (po-unlock-lock-po & int)
> > >
> > > (* Propagation: Ordering from release operations and strong fences. *)
> > > let A-cumul(r) = (rfe ; [Marked])? ; r
> >
> > Acked-by: Alan Stern <[email protected]>
>
> Queued for the v6.4 merge window (not the current one), thank you both!

I tested both Alan's and Jonas's commit. These do not see to produce
any significant differences in behavior, which is of course a good thing.

Here are the differences and a few oddities:

auto/C-RR-G+RR-R+RR-G+RR-G+RR-R+RR-R+RR-R+RR-R.litmus

Timed out with changes, completed without them. But it completed
in 558.29 seconds against a limit of 600 seconds, so never mind.

auto/C-RR-G+RR-R+RR-R+RR-G+RR-R+RR-R+RR-G+RR-R.litmus

Timed out with changes, completed without them. But it completed
in 580.01 seconds against a limit of 600 seconds, so never mind. *

auto/C-RR-G+RR-R+RR-R+RR-R+RR-R+RR-G+RR-R+RR-R.litmus

Timed out with changes, completed without them. But it completed
in 522.29 seconds against a limit of 600 seconds, so never mind.

auto/C-RR-G+RR-R+RR-R+RR-R+RR-R+RR-G+RR-G+RR-R.litmus

Timed out with changes, completed without them. But it completed
in 588.70 seconds against a limit of 600 seconds, so never mind.

All tests that didn't time out matched Results comments.

The reason I am so cavalier about the times is that I was foolishly
running rcutorture concurrently with the new-version testing. I re-ran
and of them, only auto/C-RR-G+RR-R+RR-R+RR-G+RR-R+RR-R+RR-G+RR-R.litmus
timed out the second time. I re-ran it again, but without a time limit,
and it completed properly in 364.8 seconds compared to 580. A rerun
took 360.1 seconds. So things have slowed down a bit.

A few other oddities:

litmus/auto/C-LB-Lww+R-OC.litmus

Both versions flag a data race, which I am not seeing. It appears
to me that P1's store to u0 cannot happen unless P0's store
has completed. So what am I missing here?

litmus/auto/C-LB-Lrw+R-OC.litmus
litmus/auto/C-LB-Lww+R-Oc.litmus
litmus/auto/C-LB-Lrw+R-Oc.litmus
litmus/auto/C-LB-Lrw+R-A+R-Oc.litmus
litmus/auto/C-LB-Lww+R-A+R-OC.litmus

Ditto. (There are likely more.)

Thoughts?

Thanx, Paul

2023-02-26 02:29:57

by Alan Stern

[permalink] [raw]
Subject: Re: [PATCH v3] tools/memory-model: Make ppo a subrelation of po

On Sat, Feb 25, 2023 at 05:01:10PM -0800, Paul E. McKenney wrote:
> A few other oddities:
>
> litmus/auto/C-LB-Lww+R-OC.litmus
>
> Both versions flag a data race, which I am not seeing. It appears
> to me that P1's store to u0 cannot happen unless P0's store
> has completed. So what am I missing here?

The LKMM doesn't believe that a control or data dependency orders a
plain write after a marked read. Hence in this test it thinks that P1's
store to u0 can happen before the load of x1. I don't remember why we
did it this way -- probably we just wanted to minimize the restrictions
on when plain accesses can execute. (I do remember the reason for
making address dependencies induce order; it was so RCU would work.)

The patch below will change what the LKMM believes. It eliminates the
positive outcome of the litmus test and the data race. Should it be
adopted into the memory model?

> litmus/auto/C-LB-Lrw+R-OC.litmus
> litmus/auto/C-LB-Lww+R-Oc.litmus
> litmus/auto/C-LB-Lrw+R-Oc.litmus
> litmus/auto/C-LB-Lrw+R-A+R-Oc.litmus
> litmus/auto/C-LB-Lww+R-A+R-OC.litmus
>
> Ditto. (There are likely more.)

I haven't looked at these but they're probably similar.

Alan



--- usb-devel.orig/tools/memory-model/linux-kernel.cat
+++ usb-devel/tools/memory-model/linux-kernel.cat
@@ -172,7 +172,7 @@ let vis = cumul-fence* ; rfe? ; [Marked]
((strong-fence ; [Marked] ; xbstar) | (xbstar & int))

(* Boundaries for lifetimes of plain accesses *)
-let w-pre-bounded = [Marked] ; (addr | fence)?
+let w-pre-bounded = [Marked] ; (rwdep | fence)?
let r-pre-bounded = [Marked] ; (addr | nonrw-fence |
([R4rmb] ; fencerel(Rmb) ; [~Noreturn]))?
let w-post-bounded = fence? ; [Marked] ; rmw-sequence

2023-02-26 02:58:56

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH v3] tools/memory-model: Make ppo a subrelation of po

On Sat, Feb 25, 2023 at 05:01:10PM -0800, Paul E. McKenney wrote:
> On Fri, Feb 24, 2023 at 10:37:58AM -0800, Paul E. McKenney wrote:
> > On Fri, Feb 24, 2023 at 10:32:43AM -0500, Alan Stern wrote:
> > > On Fri, Feb 24, 2023 at 02:52:51PM +0100, Jonas Oberhauser wrote:
> > > > As stated in the documentation and implied by its name, the ppo
> > > > (preserved program order) relation is intended to link po-earlier
> > > > to po-later instructions under certain conditions. However, a
> > > > corner case currently allows instructions to be linked by ppo that
> > > > are not executed by the same thread, i.e., instructions are being
> > > > linked that have no po relation.
> > > >
> > > > This happens due to the mb/strong-fence/fence relations, which (as
> > > > one case) provide order when locks are passed between threads
> > > > followed by an smp_mb__after_unlock_lock() fence. This is
> > > > illustrated in the following litmus test (as can be seen when using
> > > > herd7 with `doshow ppo`):
> > > >
> > > > P0(int *x, int *y)
> > > > {
> > > > spin_lock(x);
> > > > spin_unlock(x);
> > > > }
> > > >
> > > > P1(int *x, int *y)
> > > > {
> > > > spin_lock(x);
> > > > smp_mb__after_unlock_lock();
> > > > *y = 1;
> > > > }
> > > >
> > > > The ppo relation will link P0's spin_lock(x) and P1's *y=1, because
> > > > P0 passes a lock to P1 which then uses this fence.
> > > >
> > > > The patch makes ppo a subrelation of po by letting fence contribute
> > > > to ppo only in case the fence links events of the same thread.
> > > >
> > > > Signed-off-by: Jonas Oberhauser <[email protected]>
> > > > ---
> > > > tools/memory-model/linux-kernel.cat | 2 +-
> > > > 1 file changed, 1 insertion(+), 1 deletion(-)
> > > >
> > > > diff --git a/tools/memory-model/linux-kernel.cat b/tools/memory-model/linux-kernel.cat
> > > > index cfc1b8fd46da..adf3c4f41229 100644
> > > > --- a/tools/memory-model/linux-kernel.cat
> > > > +++ b/tools/memory-model/linux-kernel.cat
> > > > @@ -82,7 +82,7 @@ let rwdep = (dep | ctrl) ; [W]
> > > > let overwrite = co | fr
> > > > let to-w = rwdep | (overwrite & int) | (addr ; [Plain] ; wmb)
> > > > let to-r = (addr ; [R]) | (dep ; [Marked] ; rfi)
> > > > -let ppo = to-r | to-w | fence | (po-unlock-lock-po & int)
> > > > +let ppo = to-r | to-w | (fence & int) | (po-unlock-lock-po & int)
> > > >
> > > > (* Propagation: Ordering from release operations and strong fences. *)
> > > > let A-cumul(r) = (rfe ; [Marked])? ; r
> > >
> > > Acked-by: Alan Stern <[email protected]>
> >
> > Queued for the v6.4 merge window (not the current one), thank you both!
>
> I tested both Alan's and Jonas's commit. These do not see to produce
> any significant differences in behavior, which is of course a good thing.
>
> Here are the differences and a few oddities:
>
> auto/C-RR-G+RR-R+RR-G+RR-G+RR-R+RR-R+RR-R+RR-R.litmus
>
> Timed out with changes, completed without them. But it completed
> in 558.29 seconds against a limit of 600 seconds, so never mind.
>
> auto/C-RR-G+RR-R+RR-R+RR-G+RR-R+RR-R+RR-G+RR-R.litmus
>
> Timed out with changes, completed without them. But it completed
> in 580.01 seconds against a limit of 600 seconds, so never mind. *
>
> auto/C-RR-G+RR-R+RR-R+RR-R+RR-R+RR-G+RR-R+RR-R.litmus
>
> Timed out with changes, completed without them. But it completed
> in 522.29 seconds against a limit of 600 seconds, so never mind.
>
> auto/C-RR-G+RR-R+RR-R+RR-R+RR-R+RR-G+RR-G+RR-R.litmus
>
> Timed out with changes, completed without them. But it completed
> in 588.70 seconds against a limit of 600 seconds, so never mind.
>
> All tests that didn't time out matched Results comments.
>
> The reason I am so cavalier about the times is that I was foolishly
> running rcutorture concurrently with the new-version testing. I re-ran
> and of them, only auto/C-RR-G+RR-R+RR-R+RR-G+RR-R+RR-R+RR-G+RR-R.litmus
> timed out the second time. I re-ran it again, but without a time limit,
> and it completed properly in 364.8 seconds compared to 580. A rerun
> took 360.1 seconds. So things have slowed down a bit.
>
> A few other oddities:
>
> litmus/auto/C-LB-Lww+R-OC.litmus
>
> Both versions flag a data race, which I am not seeing. It appears
> to me that P1's store to u0 cannot happen unless P0's store
> has completed. So what am I missing here?
>
> litmus/auto/C-LB-Lrw+R-OC.litmus
> litmus/auto/C-LB-Lww+R-Oc.litmus
> litmus/auto/C-LB-Lrw+R-Oc.litmus
> litmus/auto/C-LB-Lrw+R-A+R-Oc.litmus
> litmus/auto/C-LB-Lww+R-A+R-OC.litmus
>
> Ditto. (There are likely more.)
>
> Thoughts?

And what happened here was that I conflated LKMM with the C++ memory
model, producing something stronger than either.

Never mind!!!

Thanx, Paul

2023-02-26 03:03:19

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH v3] tools/memory-model: Make ppo a subrelation of po

On Sat, Feb 25, 2023 at 09:29:51PM -0500, Alan Stern wrote:
> On Sat, Feb 25, 2023 at 05:01:10PM -0800, Paul E. McKenney wrote:
> > A few other oddities:
> >
> > litmus/auto/C-LB-Lww+R-OC.litmus
> >
> > Both versions flag a data race, which I am not seeing. It appears
> > to me that P1's store to u0 cannot happen unless P0's store
> > has completed. So what am I missing here?
>
> The LKMM doesn't believe that a control or data dependency orders a
> plain write after a marked read. Hence in this test it thinks that P1's
> store to u0 can happen before the load of x1. I don't remember why we
> did it this way -- probably we just wanted to minimize the restrictions
> on when plain accesses can execute. (I do remember the reason for
> making address dependencies induce order; it was so RCU would work.)
>
> The patch below will change what the LKMM believes. It eliminates the
> positive outcome of the litmus test and the data race. Should it be
> adopted into the memory model?

Excellent question!

As noted separately, I was conflating the C++ memory model and LKMM.

> > litmus/auto/C-LB-Lrw+R-OC.litmus
> > litmus/auto/C-LB-Lww+R-Oc.litmus
> > litmus/auto/C-LB-Lrw+R-Oc.litmus
> > litmus/auto/C-LB-Lrw+R-A+R-Oc.litmus
> > litmus/auto/C-LB-Lww+R-A+R-OC.litmus
> >
> > Ditto. (There are likely more.)
>
> I haven't looked at these but they're probably similar.

Let me give this patch a go and see what it does.

Thanx, Paul

> Alan
>
>
>
> --- usb-devel.orig/tools/memory-model/linux-kernel.cat
> +++ usb-devel/tools/memory-model/linux-kernel.cat
> @@ -172,7 +172,7 @@ let vis = cumul-fence* ; rfe? ; [Marked]
> ((strong-fence ; [Marked] ; xbstar) | (xbstar & int))
>
> (* Boundaries for lifetimes of plain accesses *)
> -let w-pre-bounded = [Marked] ; (addr | fence)?
> +let w-pre-bounded = [Marked] ; (rwdep | fence)?
> let r-pre-bounded = [Marked] ; (addr | nonrw-fence |
> ([R4rmb] ; fencerel(Rmb) ; [~Noreturn]))?
> let w-post-bounded = fence? ; [Marked] ; rmw-sequence

2023-02-26 03:09:44

by Boqun Feng

[permalink] [raw]
Subject: Re: [PATCH v3] tools/memory-model: Make ppo a subrelation of po

On Sat, Feb 25, 2023 at 09:29:51PM -0500, Alan Stern wrote:
> On Sat, Feb 25, 2023 at 05:01:10PM -0800, Paul E. McKenney wrote:
> > A few other oddities:
> >
> > litmus/auto/C-LB-Lww+R-OC.litmus
> >
> > Both versions flag a data race, which I am not seeing. It appears
> > to me that P1's store to u0 cannot happen unless P0's store
> > has completed. So what am I missing here?
>
> The LKMM doesn't believe that a control or data dependency orders a
> plain write after a marked read. Hence in this test it thinks that P1's
> store to u0 can happen before the load of x1. I don't remember why we
> did it this way -- probably we just wanted to minimize the restrictions
> on when plain accesses can execute. (I do remember the reason for
> making address dependencies induce order; it was so RCU would work.)
>

Because plain store can be optimzed as an "store only if not equal"?
As the following sentenses in the explanations.txt:

The need to distinguish between r- and w-bounding raises yet another
issue. When the source code contains a plain store, the compiler is
allowed to put plain loads of the same location into the object code.
For example, given the source code:

x = 1;

the compiler is theoretically allowed to generate object code that
looks like:

if (x != 1)
x = 1;

thereby adding a load (and possibly replacing the store entirely).
For this reason, whenever the LKMM requires a plain store to be
w-pre-bounded or w-post-bounded by a marked access, it also requires
the store to be r-pre-bounded or r-post-bounded, so as to handle cases
where the compiler adds a load.

Regards,
Boqun

> The patch below will change what the LKMM believes. It eliminates the
> positive outcome of the litmus test and the data race. Should it be
> adopted into the memory model?
>
> > litmus/auto/C-LB-Lrw+R-OC.litmus
> > litmus/auto/C-LB-Lww+R-Oc.litmus
> > litmus/auto/C-LB-Lrw+R-Oc.litmus
> > litmus/auto/C-LB-Lrw+R-A+R-Oc.litmus
> > litmus/auto/C-LB-Lww+R-A+R-OC.litmus
> >
> > Ditto. (There are likely more.)
>
> I haven't looked at these but they're probably similar.
>
> Alan
>
>
>
> --- usb-devel.orig/tools/memory-model/linux-kernel.cat
> +++ usb-devel/tools/memory-model/linux-kernel.cat
> @@ -172,7 +172,7 @@ let vis = cumul-fence* ; rfe? ; [Marked]
> ((strong-fence ; [Marked] ; xbstar) | (xbstar & int))
>
> (* Boundaries for lifetimes of plain accesses *)
> -let w-pre-bounded = [Marked] ; (addr | fence)?
> +let w-pre-bounded = [Marked] ; (rwdep | fence)?
> let r-pre-bounded = [Marked] ; (addr | nonrw-fence |
> ([R4rmb] ; fencerel(Rmb) ; [~Noreturn]))?
> let w-post-bounded = fence? ; [Marked] ; rmw-sequence

2023-02-26 03:30:18

by Alan Stern

[permalink] [raw]
Subject: Re: [PATCH v3] tools/memory-model: Make ppo a subrelation of po

On Sat, Feb 25, 2023 at 07:09:05PM -0800, Boqun Feng wrote:
> On Sat, Feb 25, 2023 at 09:29:51PM -0500, Alan Stern wrote:
> > On Sat, Feb 25, 2023 at 05:01:10PM -0800, Paul E. McKenney wrote:
> > > A few other oddities:
> > >
> > > litmus/auto/C-LB-Lww+R-OC.litmus
> > >
> > > Both versions flag a data race, which I am not seeing. It appears
> > > to me that P1's store to u0 cannot happen unless P0's store
> > > has completed. So what am I missing here?
> >
> > The LKMM doesn't believe that a control or data dependency orders a
> > plain write after a marked read. Hence in this test it thinks that P1's
> > store to u0 can happen before the load of x1. I don't remember why we
> > did it this way -- probably we just wanted to minimize the restrictions
> > on when plain accesses can execute. (I do remember the reason for
> > making address dependencies induce order; it was so RCU would work.)
> >
>
> Because plain store can be optimzed as an "store only if not equal"?
> As the following sentenses in the explanations.txt:
>
> The need to distinguish between r- and w-bounding raises yet another
> issue. When the source code contains a plain store, the compiler is
> allowed to put plain loads of the same location into the object code.
> For example, given the source code:
>
> x = 1;
>
> the compiler is theoretically allowed to generate object code that
> looks like:
>
> if (x != 1)
> x = 1;
>
> thereby adding a load (and possibly replacing the store entirely).
> For this reason, whenever the LKMM requires a plain store to be
> w-pre-bounded or w-post-bounded by a marked access, it also requires
> the store to be r-pre-bounded or r-post-bounded, so as to handle cases
> where the compiler adds a load.

Good guess; maybe that was the reason. Ironically, in this case the
store _is_ r-pre-bounded, because there's an smp_rmb() in the litmus
test just before it. So perhaps the original reason is not valid now
that the memory model explicitly includes tests for stores being
r-pre/post-bounded.

Alan

2023-02-26 11:18:28

by Jonas Oberhauser

[permalink] [raw]
Subject: Re: [PATCH v3] tools/memory-model: Make ppo a subrelation of po



On 2/26/2023 4:30 AM, Alan Stern wrote:
> On Sat, Feb 25, 2023 at 07:09:05PM -0800, Boqun Feng wrote:
>> On Sat, Feb 25, 2023 at 09:29:51PM -0500, Alan Stern wrote:
>>> On Sat, Feb 25, 2023 at 05:01:10PM -0800, Paul E. McKenney wrote:
>>>> A few other oddities:
>>>>
>>>> litmus/auto/C-LB-Lww+R-OC.litmus
>>>>
>>>> Both versions flag a data race, which I am not seeing. It appears
>>>> to me that P1's store to u0 cannot happen unless P0's store
>>>> has completed. So what am I missing here?
>>> The LKMM doesn't believe that a control or data dependency orders a
>>> plain write after a marked read. Hence in this test it thinks that P1's
>>> store to u0 can happen before the load of x1. I don't remember why we
>>> did it this way -- probably we just wanted to minimize the restrictions
>>> on when plain accesses can execute. (I do remember the reason for
>>> making address dependencies induce order; it was so RCU would work.)
>>>
>> Because plain store can be optimzed as an "store only if not equal"?
>> As the following sentenses in the explanations.txt:
>>
>> The need to distinguish between r- and w-bounding raises yet another
>> issue. When the source code contains a plain store, the compiler is
>> allowed to put plain loads of the same location into the object code.
>> For example, given the source code:
>>
>> x = 1;
>>
>> the compiler is theoretically allowed to generate object code that
>> looks like:
>>
>> if (x != 1)
>> x = 1;
>>
>> thereby adding a load (and possibly replacing the store entirely).
>> For this reason, whenever the LKMM requires a plain store to be
>> w-pre-bounded or w-post-bounded by a marked access, it also requires
>> the store to be r-pre-bounded or r-post-bounded, so as to handle cases
>> where the compiler adds a load.
> Good guess; maybe that was the reason. [...]
> So perhaps the original reason is not valid now
> that the memory model explicitly includes tests for stores being
> r-pre/post-bounded.
>
> Alan

I agree, I think you could relax that condition.

Note there's also rw-xbstar (used with fr) which doesn't check for
r-pre-bounded, but it should be ok. That's because only reads would be
unordered, as a result the read (in the if (x != ..) x=..) should provide
the correct value. The store would be issued as necessary, and the issued
store would still be ordered correctly w.r.t the read.

Best wishes,
jonas


2023-02-26 16:23:49

by Joel Fernandes

[permalink] [raw]
Subject: Re: [PATCH v3] tools/memory-model: Make ppo a subrelation of po

On Fri, Feb 24, 2023 at 02:52:51PM +0100, Jonas Oberhauser wrote:
> As stated in the documentation and implied by its name, the ppo
> (preserved program order) relation is intended to link po-earlier
> to po-later instructions under certain conditions. However, a
> corner case currently allows instructions to be linked by ppo that
> are not executed by the same thread, i.e., instructions are being
> linked that have no po relation.
>
> This happens due to the mb/strong-fence/fence relations, which (as
> one case) provide order when locks are passed between threads
> followed by an smp_mb__after_unlock_lock() fence. This is
> illustrated in the following litmus test (as can be seen when using
> herd7 with `doshow ppo`):
>
> P0(int *x, int *y)
> {
> spin_lock(x);
> spin_unlock(x);
> }
>
> P1(int *x, int *y)
> {
> spin_lock(x);
> smp_mb__after_unlock_lock();
> *y = 1;
> }
>
> The ppo relation will link P0's spin_lock(x) and P1's *y=1, because
> P0 passes a lock to P1 which then uses this fence.
>
> The patch makes ppo a subrelation of po by letting fence contribute
> to ppo only in case the fence links events of the same thread.
>
> Signed-off-by: Jonas Oberhauser <[email protected]>
> ---
> tools/memory-model/linux-kernel.cat | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/tools/memory-model/linux-kernel.cat b/tools/memory-model/linux-kernel.cat
> index cfc1b8fd46da..adf3c4f41229 100644
> --- a/tools/memory-model/linux-kernel.cat
> +++ b/tools/memory-model/linux-kernel.cat
> @@ -82,7 +82,7 @@ let rwdep = (dep | ctrl) ; [W]
> let overwrite = co | fr
> let to-w = rwdep | (overwrite & int) | (addr ; [Plain] ; wmb)
> let to-r = (addr ; [R]) | (dep ; [Marked] ; rfi)
> -let ppo = to-r | to-w | fence | (po-unlock-lock-po & int)
> +let ppo = to-r | to-w | (fence & int) | (po-unlock-lock-po & int)

Alternatively can be the following appended diff? Requires only single 'int'
in ->ppo then and prevents future similar issues caused by sub relations.
Also makes clear that ->ppo can only be CPU-internal.

Or would that not work for some reason? For the test you shared at least, the
graphs are the same.

Either way:

Tested-by: Joel Fernandes (Google) <[email protected]>
Reviewed-by: Joel Fernandes (Google) <[email protected]>

---8<-----------------------

diff --git a/tools/memory-model/linux-kernel.cat b/tools/memory-model/linux-kernel.cat
index 07f884f9b2bf..63052d1628e9 100644
--- a/tools/memory-model/linux-kernel.cat
+++ b/tools/memory-model/linux-kernel.cat
@@ -70,7 +70,7 @@ let rwdep = (dep | ctrl) ; [W]
let overwrite = co | fr
let to-w = rwdep | (overwrite & int) | (addr ; [Plain] ; wmb)
let to-r = addr | (dep ; [Marked] ; rfi)
-let ppo = to-r | to-w | fence | (po-unlock-lock-po & int)
+let ppo = (to-r | to-w | fence | po-unlock-lock-po) & int

(* Propagation: Ordering from release operations and strong fences. *)
let A-cumul(r) = (rfe ; [Marked])? ; r

2023-02-26 16:51:21

by Alan Stern

[permalink] [raw]
Subject: Re: [PATCH v3] tools/memory-model: Make ppo a subrelation of po

On Sun, Feb 26, 2023 at 12:17:31PM +0100, Jonas Oberhauser wrote:
>
>
> On 2/26/2023 4:30 AM, Alan Stern wrote:
> > On Sat, Feb 25, 2023 at 07:09:05PM -0800, Boqun Feng wrote:
> > > On Sat, Feb 25, 2023 at 09:29:51PM -0500, Alan Stern wrote:
> > > > On Sat, Feb 25, 2023 at 05:01:10PM -0800, Paul E. McKenney wrote:
> > > > > A few other oddities:
> > > > >
> > > > > litmus/auto/C-LB-Lww+R-OC.litmus
> > > > >
> > > > > Both versions flag a data race, which I am not seeing. It appears
> > > > > to me that P1's store to u0 cannot happen unless P0's store
> > > > > has completed. So what am I missing here?
> > > > The LKMM doesn't believe that a control or data dependency orders a
> > > > plain write after a marked read. Hence in this test it thinks that P1's
> > > > store to u0 can happen before the load of x1. I don't remember why we
> > > > did it this way -- probably we just wanted to minimize the restrictions
> > > > on when plain accesses can execute. (I do remember the reason for
> > > > making address dependencies induce order; it was so RCU would work.)
> > > >
> > > Because plain store can be optimzed as an "store only if not equal"?
> > > As the following sentenses in the explanations.txt:
> > >
> > > The need to distinguish between r- and w-bounding raises yet another
> > > issue. When the source code contains a plain store, the compiler is
> > > allowed to put plain loads of the same location into the object code.
> > > For example, given the source code:
> > >
> > > x = 1;
> > >
> > > the compiler is theoretically allowed to generate object code that
> > > looks like:
> > >
> > > if (x != 1)
> > > x = 1;
> > >
> > > thereby adding a load (and possibly replacing the store entirely).
> > > For this reason, whenever the LKMM requires a plain store to be
> > > w-pre-bounded or w-post-bounded by a marked access, it also requires
> > > the store to be r-pre-bounded or r-post-bounded, so as to handle cases
> > > where the compiler adds a load.
> > Good guess; maybe that was the reason. [...]
> > So perhaps the original reason is not valid now
> > that the memory model explicitly includes tests for stores being
> > r-pre/post-bounded.
> >
> > Alan
>
> I agree, I think you could relax that condition.

Here's a related question to think about. Suppose a compiler does make
this change, adding a load-and-test in front of a store. Can that load
cause a data race?

Normally I'd say no, because compilers aren't allowed to create data
races where one didn't already exist. But that restriction is part of
the C/C++ standard, and what we consider to be a data race differs from
what the standard considers.

So what's the answer? Is the compiler allowed to translate:

r1 = READ_ONCE(*x);
if (r1)
*y = 1;

into something resembling:

r1 = READ_ONCE(*x);
rtemp = *y;
if (r1) {
if (rtemp != 1)
*y = 1;
}

(Note that whether the load to rtemp occurs inside the "if (r1)"
conditional or not makes no difference; either way the CPU can execute
it before testing the condition. Even before reading the value of *x.)

_If_ we assume that these manufactured loads can never cause a data race
then it should be safe to remove the r-pre/post-bounded tests for plain
writes.

But what if rtemp reads from a plain write that was torn, and the
intermediate value it observes happens to be 1, even though neither the
initial nor the final value of *y was 1?

> Note there's also rw-xbstar (used with fr) which doesn't check for
> r-pre-bounded, but it should be ok. That's because only reads would be
> unordered, as a result the read (in the if (x != ..) x=..) should provide
> the correct value. The store would be issued as necessary, and the issued
> store would still be ordered correctly w.r.t the read.

That isn't the reason I left r-pre-bounded out from rw-xbstar. If the
write gets changed to a read there's no need for rw-xbstar to check
r-pre-bounded, because then rw-race would be comparing a read with
another read (instead of with a write) and so there would be no
possibility of a race in any case.

Alan

2023-02-26 18:45:37

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH v3] tools/memory-model: Make ppo a subrelation of po

On Sun, Feb 26, 2023 at 11:51:15AM -0500, Alan Stern wrote:
> On Sun, Feb 26, 2023 at 12:17:31PM +0100, Jonas Oberhauser wrote:
> > On 2/26/2023 4:30 AM, Alan Stern wrote:
> > > On Sat, Feb 25, 2023 at 07:09:05PM -0800, Boqun Feng wrote:
> > > > On Sat, Feb 25, 2023 at 09:29:51PM -0500, Alan Stern wrote:
> > > > > On Sat, Feb 25, 2023 at 05:01:10PM -0800, Paul E. McKenney wrote:
> > > > > > A few other oddities:
> > > > > >
> > > > > > litmus/auto/C-LB-Lww+R-OC.litmus
> > > > > >
> > > > > > Both versions flag a data race, which I am not seeing. It appears
> > > > > > to me that P1's store to u0 cannot happen unless P0's store
> > > > > > has completed. So what am I missing here?
> > > > > The LKMM doesn't believe that a control or data dependency orders a
> > > > > plain write after a marked read. Hence in this test it thinks that P1's
> > > > > store to u0 can happen before the load of x1. I don't remember why we
> > > > > did it this way -- probably we just wanted to minimize the restrictions
> > > > > on when plain accesses can execute. (I do remember the reason for
> > > > > making address dependencies induce order; it was so RCU would work.)
> > > > >
> > > > Because plain store can be optimzed as an "store only if not equal"?
> > > > As the following sentenses in the explanations.txt:
> > > >
> > > > The need to distinguish between r- and w-bounding raises yet another
> > > > issue. When the source code contains a plain store, the compiler is
> > > > allowed to put plain loads of the same location into the object code.
> > > > For example, given the source code:
> > > >
> > > > x = 1;
> > > >
> > > > the compiler is theoretically allowed to generate object code that
> > > > looks like:
> > > >
> > > > if (x != 1)
> > > > x = 1;
> > > >
> > > > thereby adding a load (and possibly replacing the store entirely).
> > > > For this reason, whenever the LKMM requires a plain store to be
> > > > w-pre-bounded or w-post-bounded by a marked access, it also requires
> > > > the store to be r-pre-bounded or r-post-bounded, so as to handle cases
> > > > where the compiler adds a load.
> > > Good guess; maybe that was the reason. [...]
> > > So perhaps the original reason is not valid now
> > > that the memory model explicitly includes tests for stores being
> > > r-pre/post-bounded.
> > >
> > > Alan
> >
> > I agree, I think you could relax that condition.
>
> Here's a related question to think about. Suppose a compiler does make
> this change, adding a load-and-test in front of a store. Can that load
> cause a data race?
>
> Normally I'd say no, because compilers aren't allowed to create data
> races where one didn't already exist. But that restriction is part of
> the C/C++ standard, and what we consider to be a data race differs from
> what the standard considers.
>
> So what's the answer? Is the compiler allowed to translate:
>
> r1 = READ_ONCE(*x);
> if (r1)
> *y = 1;
>
> into something resembling:
>
> r1 = READ_ONCE(*x);
> rtemp = *y;
> if (r1) {
> if (rtemp != 1)
> *y = 1;
> }
>
> (Note that whether the load to rtemp occurs inside the "if (r1)"
> conditional or not makes no difference; either way the CPU can execute
> it before testing the condition. Even before reading the value of *x.)
>
> _If_ we assume that these manufactured loads can never cause a data race
> then it should be safe to remove the r-pre/post-bounded tests for plain
> writes.
>
> But what if rtemp reads from a plain write that was torn, and the
> intermediate value it observes happens to be 1, even though neither the
> initial nor the final value of *y was 1?

I am not worried about compilers creating data races, so that assignment
to rtemp would be within the "if (r1)" statement. Not that this matters,
as you say, from a hardware ordering viewpoint.

However, tearing is a concern. Just to make sure I undersand, one
scenario might be where the initial value of *y was zero and the final
value was 0x10001, correct? In that case, we have seen compilers that
would write that constant 16 bits at a time, resulting in an transitory
value of 0x1.

But in this case, we would need the value to -not- be 1 for bad things
to happen, correct?

And in that case, we would need the value to initially be 1 and the
desired value to be 1 and some other store to redundantly set it to
1, but tear is such a way that the transitory value is not 1, correct?
Plus we should detect the data race in that case, not?

Or am I missing yet another opportunity for a mischievous compiler?

> > Note there's also rw-xbstar (used with fr) which doesn't check for
> > r-pre-bounded, but it should be ok. That's because only reads would be
> > unordered, as a result the read (in the if (x != ..) x=..) should provide
> > the correct value. The store would be issued as necessary, and the issued
> > store would still be ordered correctly w.r.t the read.
>
> That isn't the reason I left r-pre-bounded out from rw-xbstar. If the
> write gets changed to a read there's no need for rw-xbstar to check
> r-pre-bounded, because then rw-race would be comparing a read with
> another read (instead of with a write) and so there would be no
> possibility of a race in any case.

True, and if there was a racing write, it would be a data race in
any case.

Thanx, Paul

2023-02-26 18:49:31

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH v3] tools/memory-model: Make ppo a subrelation of po

On Sat, Feb 25, 2023 at 07:03:11PM -0800, Paul E. McKenney wrote:
> On Sat, Feb 25, 2023 at 09:29:51PM -0500, Alan Stern wrote:
> > On Sat, Feb 25, 2023 at 05:01:10PM -0800, Paul E. McKenney wrote:
> > > A few other oddities:
> > >
> > > litmus/auto/C-LB-Lww+R-OC.litmus
> > >
> > > Both versions flag a data race, which I am not seeing. It appears
> > > to me that P1's store to u0 cannot happen unless P0's store
> > > has completed. So what am I missing here?
> >
> > The LKMM doesn't believe that a control or data dependency orders a
> > plain write after a marked read. Hence in this test it thinks that P1's
> > store to u0 can happen before the load of x1. I don't remember why we
> > did it this way -- probably we just wanted to minimize the restrictions
> > on when plain accesses can execute. (I do remember the reason for
> > making address dependencies induce order; it was so RCU would work.)
> >
> > The patch below will change what the LKMM believes. It eliminates the
> > positive outcome of the litmus test and the data race. Should it be
> > adopted into the memory model?
>
> Excellent question!
>
> As noted separately, I was conflating the C++ memory model and LKMM.
>
> > > litmus/auto/C-LB-Lrw+R-OC.litmus
> > > litmus/auto/C-LB-Lww+R-Oc.litmus
> > > litmus/auto/C-LB-Lrw+R-Oc.litmus
> > > litmus/auto/C-LB-Lrw+R-A+R-Oc.litmus
> > > litmus/auto/C-LB-Lww+R-A+R-OC.litmus
> > >
> > > Ditto. (There are likely more.)
> >
> > I haven't looked at these but they're probably similar.
>
> Let me give this patch a go and see what it does.

And it operates as expected, converting Sometimes/data-race results
into Never.

Leaving the question of whether that is what we need. ;-)

Thanx, Paul

2023-02-26 19:32:41

by Alan Stern

[permalink] [raw]
Subject: Re: [PATCH v3] tools/memory-model: Make ppo a subrelation of po

On Sun, Feb 26, 2023 at 10:45:28AM -0800, Paul E. McKenney wrote:
> On Sun, Feb 26, 2023 at 11:51:15AM -0500, Alan Stern wrote:
> > Here's a related question to think about. Suppose a compiler does make
> > this change, adding a load-and-test in front of a store. Can that load
> > cause a data race?
> >
> > Normally I'd say no, because compilers aren't allowed to create data
> > races where one didn't already exist. But that restriction is part of
> > the C/C++ standard, and what we consider to be a data race differs from
> > what the standard considers.
> >
> > So what's the answer? Is the compiler allowed to translate:
> >
> > r1 = READ_ONCE(*x);
> > if (r1)
> > *y = 1;
> >
> > into something resembling:
> >
> > r1 = READ_ONCE(*x);
> > rtemp = *y;
> > if (r1) {
> > if (rtemp != 1)
> > *y = 1;
> > }
> >
> > (Note that whether the load to rtemp occurs inside the "if (r1)"
> > conditional or not makes no difference; either way the CPU can execute
> > it before testing the condition. Even before reading the value of *x.)
> >
> > _If_ we assume that these manufactured loads can never cause a data race
> > then it should be safe to remove the r-pre/post-bounded tests for plain
> > writes.
> >
> > But what if rtemp reads from a plain write that was torn, and the
> > intermediate value it observes happens to be 1, even though neither the
> > initial nor the final value of *y was 1?
>
> I am not worried about compilers creating data races, so that assignment
> to rtemp would be within the "if (r1)" statement. Not that this matters,
> as you say, from a hardware ordering viewpoint.
>
> However, tearing is a concern. Just to make sure I undersand, one
> scenario might be where the initial value of *y was zero and the final
> value was 0x10001, correct? In that case, we have seen compilers that
> would write that constant 16 bits at a time, resulting in an transitory
> value of 0x1.
>
> But in this case, we would need the value to -not- be 1 for bad things
> to happen, correct?
>
> And in that case, we would need the value to initially be 1 and the
> desired value to be 1 and some other store to redundantly set it to
> 1, but tear is such a way that the transitory value is not 1, correct?
> Plus we should detect the data race in that case, not?
>
> Or am I missing yet another opportunity for a mischievous compiler?

Let's try a better example:

P0(int *x, int *y)
{
*y = 0x10001;
smp_store_release(x, 1);
}

P1(int *x, int *y)
{
int r1;

r1 = READ_ONCE(*x);
if (r1)
*y = 1;
}

exists (1:r1=1 /\ y=0x10001)

Assume the compiler translates "*y = 1;" to:

{
rtemp = *y;
if (rtemp != 1)
*y = 1;
}

Then there is nothing preventing P1's CPU from loading *y into rtemp
before it reads *x. If the plain write in P0 is torn then rtemp could
end up getting set to 1, so P1 wouldn't write anything to *y and the
litmus test would succeed.

If the LKMM should think this litmus test has no data race then it
should also think the test will never succeed. But I've just shown how
it might succeed. Ergo the LKMM should say this test has a data race,
and thus we shouldn't remove the r-pre-bounded tests for plain writes.

Of course, this is a separate question from whether w-pre-bounded should
be changed to use rwdep instead of addr.

Alan

2023-02-27 14:04:08

by Jonas Oberhauser

[permalink] [raw]
Subject: Re: [PATCH v3] tools/memory-model: Make ppo a subrelation of po



On 2/26/2023 5:51 PM, Alan Stern wrote:
> On Sun, Feb 26, 2023 at 12:17:31PM +0100, Jonas Oberhauser wrote:
>>
>> On 2/26/2023 4:30 AM, Alan Stern wrote:
>>> On Sat, Feb 25, 2023 at 07:09:05PM -0800, Boqun Feng wrote:
>>>> On Sat, Feb 25, 2023 at 09:29:51PM -0500, Alan Stern wrote:
>>>>> On Sat, Feb 25, 2023 at 05:01:10PM -0800, Paul E. McKenney wrote:
>>>>>> A few other oddities:
>>>>>>
>>>>>> litmus/auto/C-LB-Lww+R-OC.litmus
>>>>>>
>>>>>> Both versions flag a data race, which I am not seeing. It appears
>>>>>> to me that P1's store to u0 cannot happen unless P0's store
>>>>>> has completed. So what am I missing here?
>>>>> The LKMM doesn't believe that a control or data dependency orders a
>>>>> plain write after a marked read. Hence in this test it thinks that P1's
>>>>> store to u0 can happen before the load of x1. I don't remember why we
>>>>> did it this way -- probably we just wanted to minimize the restrictions
>>>>> on when plain accesses can execute. (I do remember the reason for
>>>>> making address dependencies induce order; it was so RCU would work.)
>>>>>
>>>> Because plain store can be optimzed as an "store only if not equal"?
>>>> As the following sentenses in the explanations.txt:
>>>>
>>>> The need to distinguish between r- and w-bounding raises yet another
>>>> issue. When the source code contains a plain store, the compiler is
>>>> allowed to put plain loads of the same location into the object code.
>>>> For example, given the source code:
>>>>
>>>> x = 1;
>>>>
>>>> the compiler is theoretically allowed to generate object code that
>>>> looks like:
>>>>
>>>> if (x != 1)
>>>> x = 1;
>>>>
>>>> thereby adding a load (and possibly replacing the store entirely).
>>>> For this reason, whenever the LKMM requires a plain store to be
>>>> w-pre-bounded or w-post-bounded by a marked access, it also requires
>>>> the store to be r-pre-bounded or r-post-bounded, so as to handle cases
>>>> where the compiler adds a load.
>>> Good guess; maybe that was the reason. [...]
>>> So perhaps the original reason is not valid now
>>> that the memory model explicitly includes tests for stores being
>>> r-pre/post-bounded.
>>>
>>> Alan
>> I agree, I think you could relax that condition.
> Here's a related question to think about. Suppose a compiler does make
> this change, adding a load-and-test in front of a store. Can that load
> cause a data race?
>
> Normally I'd say no, because compilers aren't allowed to create data
> races where one didn't already exist.

I don't think that's completely true. At least N1570 says

Transformations that introduce a speculative read of a potentially shared memory location may
not preserve the semantics of the program as defined in this standard, since they potentially introduce a data
race. However, they are typically valid in the context of an optimizing compiler that targets a specific
machine with well-defined semantics for data races.

In fact I vaguely recall certain optimizations that do introduce data races, like load hoisting.
The issue is that some optimizations are only valid in the absence of data races, and therefore these optimizations and the optimizations that introduce data races can't be used on the same portion of code at the same time.

https://cs.nju.edu.cn/hongjin/teaching/concurrency/03_C11MM_ViktorVafeiadis.pdf

I think in generall it's not about whether the compiler introduces races or not, but whether those races create new behaviors that you don't see in LKMM.


> But that restriction is part of
> the C/C++ standard, and what we consider to be a data race differs from
> what the standard considers.
>
> So what's the answer? Is the compiler allowed to translate:
>
> r1 = READ_ONCE(*x);
> if (r1)
> *y = 1;
>
> into something resembling:
>
> r1 = READ_ONCE(*x);
> rtemp = *y;
> if (r1) {
> if (rtemp != 1)
> *y = 1;
> }
>
> (Note that whether the load to rtemp occurs inside the "if (r1)"
> conditional or not makes no difference; either way the CPU can execute
> it before testing the condition. Even before reading the value of *x.)
>
> _If_ we assume that these manufactured loads can never cause a data race
> then it should be safe to remove the r-pre/post-bounded tests for plain
> writes.

Note that I don't want to remove the r-pre/post-bounded tests.
What I agreed to is that the restriction to only addr for plain writes
is an overly conservative "r-pre/post-bounded-style" test which is made
redundant by the existence of the actual r-pre/post-bounded test.

>
> But what if rtemp reads from a plain write that was torn, and the
> intermediate value it observes happens to be 1, even though neither the
> initial nor the final value of *y was 1?

Of course, and you don't even need a torn write to cause problems
without requiring r-bounding.
For example, if the other side does *y=1; *y=2; mb(); *x=&y;, then
r1==&x seems to imply that y always end up as ==1 from a
compiler-oblivious perspective, but because of the data race it could be
==2 instead.


>> Note there's also rw-xbstar (used with fr) which doesn't check for
>> r-pre-bounded, but it should be ok. That's because only reads would be
>> unordered, as a result the read (in the if (x != ..) x=..) should provide
>> the correct value. The store would be issued as necessary, and the issued
>> store would still be ordered correctly w.r.t the read.
> That isn't the reason I left r-pre-bounded out from rw-xbstar. If the
> write gets changed to a read there's no need for rw-xbstar to check
> r-pre-bounded, because then rw-race would be comparing a read with
> another read (instead of with a write) and so there would be no
> possibility of a race in any case.

That is the first part of my explanation (only reads would be unordered)
but I don't think it's sufficient in general.
Imagine a hypothetical memory model with a release fence that upgrades
the next memory operation only (and only stores) to release (e.g., to
save some opcode design space you do weird_release_fence;str x1 x2
instead of stlr x1 x2).
Then in the message passing pattern

T1 {
   u = a;
   release(&x, 1);
}

T2 {
  t = READ_ONCE(&x);
  weird_release_fence;
  a = 1;
}

where T2 is changed by the compiler to

T2 {
  t = READ_ONCE(&x);
  weird_release_fence();
  if (a!=1) a = 1;
}

In the specific executions where t==1, there wouldn't be a data race
when just considering the equivalent of rw-xbstar, but u==1 and t==1
would be possible (even though it might not seem so from the first example).

Of course in LKMM there's no such fence, but I think to make the
argument complete you still need to go through every case that provides
the w-pre-bounding and make sure it still provides the necessary order
in the compiler-generated version. (or you can try a more complicated
argument of the form "there would be another execution of the same
program that would have a data race", which works at least for this
example, not sure in general)

Have fun,
jonas


2023-02-27 14:39:53

by Jonas Oberhauser

[permalink] [raw]
Subject: Re: [PATCH v3] tools/memory-model: Make ppo a subrelation of po



On 2/26/2023 5:23 PM, Joel Fernandes wrote:
> On Fri, Feb 24, 2023 at 02:52:51PM +0100, Jonas Oberhauser wrote:
>> As stated in the documentation and implied by its name, the ppo
>> (preserved program order) relation is intended to link po-earlier
>> to po-later instructions under certain conditions. However, a
>> corner case currently allows instructions to be linked by ppo that
>> are not executed by the same thread, i.e., instructions are being
>> linked that have no po relation.
>>
>> This happens due to the mb/strong-fence/fence relations, which (as
>> one case) provide order when locks are passed between threads
>> followed by an smp_mb__after_unlock_lock() fence. This is
>> illustrated in the following litmus test (as can be seen when using
>> herd7 with `doshow ppo`):
>>
>> P0(int *x, int *y)
>> {
>> spin_lock(x);
>> spin_unlock(x);
>> }
>>
>> P1(int *x, int *y)
>> {
>> spin_lock(x);
>> smp_mb__after_unlock_lock();
>> *y = 1;
>> }
>>
>> The ppo relation will link P0's spin_lock(x) and P1's *y=1, because
>> P0 passes a lock to P1 which then uses this fence.
>>
>> The patch makes ppo a subrelation of po by letting fence contribute
>> to ppo only in case the fence links events of the same thread.
>>
>> Signed-off-by: Jonas Oberhauser <[email protected]>
>> ---
>> tools/memory-model/linux-kernel.cat | 2 +-
>> 1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/tools/memory-model/linux-kernel.cat b/tools/memory-model/linux-kernel.cat
>> index cfc1b8fd46da..adf3c4f41229 100644
>> --- a/tools/memory-model/linux-kernel.cat
>> +++ b/tools/memory-model/linux-kernel.cat
>> @@ -82,7 +82,7 @@ let rwdep = (dep | ctrl) ; [W]
>> let overwrite = co | fr
>> let to-w = rwdep | (overwrite & int) | (addr ; [Plain] ; wmb)
>> let to-r = (addr ; [R]) | (dep ; [Marked] ; rfi)
>> -let ppo = to-r | to-w | fence | (po-unlock-lock-po & int)
>> +let ppo = to-r | to-w | (fence & int) | (po-unlock-lock-po & int)
> Alternatively can be the following appended diff? Requires only single 'int'
> in ->ppo then and prevents future similar issues caused by sub relations.
> Also makes clear that ->ppo can only be CPU-internal.

I had thought about going in that direction, but it doesn't prevent
future similar issues, at best makes them less likely.
For example, you could have an xfence that somehow goes back to the
original thread, but to a po-earlier event (e.g., like prop).

Given that to-r and to-w are unlikely to ever become become inconsistent
with po, I am not sure it even really helps much.

Personally I'm not too happy with the ad-hoc '& int' because it's like
adding some unused stuff (via ... | unused-stuff) and then cutting it
back out with &int, unlike prop & int which has a real semantic meaning
(propagate back to the thread). The fastest move is the move we avoid
doing, so I rather wouldn't add those parts in the first place.

However fixing the fence relation turned out to be a lot trickier, both
because of the missed data race and also rmw-sequences, essentially I
would have had to disambiguate between xfences and fences already in
this patch. So I did this minimal local fix for now and we can discuss
whether it makes sense to get rid of the '& int' once/if we have xfence etc.

Best wishes,
jonas

PS:
> ---8<-----------------------

haha that's so clever :D

>
> diff --git a/tools/memory-model/linux-kernel.cat b/tools/memory-model/linux-kernel.cat
> index 07f884f9b2bf..63052d1628e9 100644
> --- a/tools/memory-model/linux-kernel.cat
> +++ b/tools/memory-model/linux-kernel.cat
> @@ -70,7 +70,7 @@ let rwdep = (dep | ctrl) ; [W]
> let overwrite = co | fr
> let to-w = rwdep | (overwrite & int) | (addr ; [Plain] ; wmb)
> let to-r = addr | (dep ; [Marked] ; rfi)
> -let ppo = to-r | to-w | fence | (po-unlock-lock-po & int)
> +let ppo = (to-r | to-w | fence | po-unlock-lock-po) & int
>
> (* Propagation: Ordering from release operations and strong fences. *)
> let A-cumul(r) = (rfe ; [Marked])? ; r


2023-02-27 16:16:42

by Alan Stern

[permalink] [raw]
Subject: Re: [PATCH v3] tools/memory-model: Make ppo a subrelation of po

On Mon, Feb 27, 2023 at 03:03:16PM +0100, Jonas Oberhauser wrote:
> Note that I don't want to remove the r-pre/post-bounded tests.
> What I agreed to is that the restriction to only addr for plain writes is an
> overly conservative "r-pre/post-bounded-style" test which is made redundant
> by the existence of the actual r-pre/post-bounded test.

Good, that agrees with what I've been thinking.

> > > Note there's also rw-xbstar (used with fr) which doesn't check for
> > > r-pre-bounded, but it should be ok. That's because only reads would be
> > > unordered, as a result the read (in the if (x != ..) x=..) should provide
> > > the correct value. The store would be issued as necessary, and the issued
> > > store would still be ordered correctly w.r.t the read.
> > That isn't the reason I left r-pre-bounded out from rw-xbstar. If the
> > write gets changed to a read there's no need for rw-xbstar to check
> > r-pre-bounded, because then rw-race would be comparing a read with
> > another read (instead of with a write) and so there would be no
> > possibility of a race in any case.
>
> That is the first part of my explanation (only reads would be unordered) but

It is? I couldn't tell from what you wrote that this was supposed to
imply we didn't have to worry about a data race.

> I don't think it's sufficient in general.
> Imagine a hypothetical memory model with a release fence that upgrades the
> next memory operation only (and only stores) to release (e.g., to save some
> opcode design space you do weird_release_fence;str x1 x2 instead of stlr x1
> x2).
> Then in the message passing pattern
>
> T1 {
> ?? u = a;
> ?? release(&x, 1);
> }
>
> T2 {
> ? t = READ_ONCE(&x);
> ? weird_release_fence;
> ? a = 1;
> }

[That isn't the Message Passing pattern. In the MP pattern, one thread
does two writes and the other thread does two reads. This is the Read
Buffering (RB) pattern: Each thread does a read followed by a write.]

>
> where T2 is changed by the compiler to
>
> T2 {
> ? t = READ_ONCE(&x);
> ? weird_release_fence();
> ? if (a!=1) a = 1;
> }
>
> In the specific executions where t==1, there wouldn't be a data race when
> just considering the equivalent of rw-xbstar, but u==1 and t==1 would be
> possible (even though it might not seem so from the first example).

If such a fence existed in the LKMM, we would handle this case by saying
that weird_release_fence() does not give release semantics to an
immediately following plain store; only to an immediately following
marked store. The reason being that the compiler is allowed to muck
around with the code generated for plain accesses, so there's no
guarantee that the first machine instruction generated for "a = 1;" will
be a store.

As a result, there would not be an rw-xbstar link from T1 to T2.

> Of course in LKMM there's no such fence, but I think to make the argument
> complete you still need to go through every case that provides the
> w-pre-bounding and make sure it still provides the necessary order in the
> compiler-generated version. (or you can try a more complicated argument of
> the form "there would be another execution of the same program that would
> have a data race", which works at least for this example, not sure in
> general)

So I don't see this as a valid argument for not using rw-xbstar in
rw-race. Even theoretically.

Alan

2023-02-27 16:51:03

by Jonas Oberhauser

[permalink] [raw]
Subject: Re: [PATCH v3] tools/memory-model: Make ppo a subrelation of po



On 2/27/2023 5:16 PM, Alan Stern wrote:
> On Mon, Feb 27, 2023 at 03:03:16PM +0100, Jonas Oberhauser wrote:
>
>>>> Note there's also rw-xbstar (used with fr) which doesn't check for
>>>> r-pre-bounded, but it should be ok. That's because only reads would be
>>>> unordered, as a result the read (in the if (x != ..) x=..) should provide
>>>> the correct value. The store would be issued as necessary, and the issued
>>>> store would still be ordered correctly w.r.t the read.
>>> That isn't the reason I left r-pre-bounded out from rw-xbstar. If the
>>> write gets changed to a read there's no need for rw-xbstar to check
>>> r-pre-bounded, because then rw-race would be comparing a read with
>>> another read (instead of with a write) and so there would be no
>>> possibility of a race in any case.
>> That is the first part of my explanation (only reads would be unordered) but
> It is? I couldn't tell from what you wrote that this was supposed to
> imply we didn't have to worry about a data race.

Because it wasn't supposed to imply that, in the sense that by itself,
the fact that the things that are left unordered are reads does not
immediately imply that we don't have to worry about a data race. It's
just a step in the sequence of reasoning. But of course I won't claim
that I laid out that sequence in full enough detail to make sense.

>
>> I don't think it's sufficient in general.
>> Imagine a hypothetical memory model with a release fence that upgrades the
>> next memory operation only (and only stores) to release (e.g., to save some
>> opcode design space you do weird_release_fence;str x1 x2 instead of stlr x1
>> x2).
>> Then in the message passing pattern
>>
>> T1 {
>>    u = a;
>>    release(&x, 1);
>> }
>>
>> T2 {
>>   t = READ_ONCE(&x);
>>   weird_release_fence;
>>   a = 1;
>> }
> [That isn't the Message Passing pattern. In the MP pattern, one thread
> does two writes and the other thread does two reads. This is the Read
> Buffering (RB) pattern: Each thread does a read followed by a write.]

Ah, thanks. Somehow I misremembered MP to be any case of [do stuff on a]
; x = ...  || read that update from x; [do stuff on a].

>
>> where T2 is changed by the compiler to
>>
>> T2 {
>>   t = READ_ONCE(&x);
>>   weird_release_fence();
>>   if (a!=1) a = 1;
>> }
>>
>> In the specific executions where t==1, there wouldn't be a data race when
>> just considering the equivalent of rw-xbstar, but u==1 and t==1 would be
>> possible (even though it might not seem so from the first example).
> If such a fence existed in the LKMM, we would handle this case by saying
> that weird_release_fence() does not give release semantics to an
> immediately following plain store; only to an immediately following
> marked store. The reason being that the compiler is allowed to muck
> around with the code generated for plain accesses, so there's no
> guarantee that the first machine instruction generated for "a = 1;" will
> be a store.
>
> As a result, there would not be an rw-xbstar link from T1 to T2.

Sure, but IMHO a well-designed high-level model like LKMM shouldn't have
such a fence at all :D This is something you might use in the
implementation of a release store in inline-assembly, it doesn't belong
in the source code. But it's just for the sake of argument anyways.


>
>> Of course in LKMM there's no such fence, but I think to make the argument
>> complete you still need to go through every case that provides the
>> w-pre-bounding and make sure it still provides the necessary order in the
>> compiler-generated version. (or you can try a more complicated argument of
>> the form "there would be another execution of the same program that would
>> have a data race", which works at least for this example, not sure in
>> general)
> So I don't see this as a valid argument for not using rw-xbstar in
> rw-race. Even theoretically.

There's nothing wrong with using rw-xbstar in rw-race, especially in
current LKMM, and I'm not arguing against that.

I'm saying that the argument
"if rw-xbstar links a read R to a plain write W, and that plain write is
replaced by a read R', and in case R' reads a value different from W,
followed by a store W' (with some dependency from R' to W')  by the
compiler, then the fact that R and R' can't have a data race means that
it's safe to use rw-xbstar in rw-race"
is incomplete. (Of course that doesn't mean the claim is wrong.)
To make the argument complete, you also need that W' is generated if
necessary, and more crucially that W' is still ordered behind R!
Otherwise you would now have a data race between R and W', like in the
hypothetical example I mentioned, even though R and R' don't race.

And if you do that second step in LKMM (even with the change of
w-pre-bounded we are discussing) you quickly find that W' is indeed
still ordered, so rw-xbstar is perfectly fine.

Perhaps that step is so trivial to you that you don't feel it needs
mentioning : ) But speaking about LKMM-like models in general, some
might have some subtle case where rw-xbstar links R and W but would not
R and W'.

jonas


2023-02-27 17:57:29

by Joel Fernandes

[permalink] [raw]
Subject: Re: [PATCH v3] tools/memory-model: Make ppo a subrelation of po

On Mon, Feb 27, 2023 at 9:39 AM Jonas Oberhauser
<[email protected]> wrote:
>
>
>
> On 2/26/2023 5:23 PM, Joel Fernandes wrote:
> > On Fri, Feb 24, 2023 at 02:52:51PM +0100, Jonas Oberhauser wrote:
> >> As stated in the documentation and implied by its name, the ppo
> >> (preserved program order) relation is intended to link po-earlier
> >> to po-later instructions under certain conditions. However, a
> >> corner case currently allows instructions to be linked by ppo that
> >> are not executed by the same thread, i.e., instructions are being
> >> linked that have no po relation.
> >>
> >> This happens due to the mb/strong-fence/fence relations, which (as
> >> one case) provide order when locks are passed between threads
> >> followed by an smp_mb__after_unlock_lock() fence. This is
> >> illustrated in the following litmus test (as can be seen when using
> >> herd7 with `doshow ppo`):
> >>
> >> P0(int *x, int *y)
> >> {
> >> spin_lock(x);
> >> spin_unlock(x);
> >> }
> >>
> >> P1(int *x, int *y)
> >> {
> >> spin_lock(x);
> >> smp_mb__after_unlock_lock();
> >> *y = 1;
> >> }
> >>
> >> The ppo relation will link P0's spin_lock(x) and P1's *y=1, because
> >> P0 passes a lock to P1 which then uses this fence.
> >>
> >> The patch makes ppo a subrelation of po by letting fence contribute
> >> to ppo only in case the fence links events of the same thread.
> >>
> >> Signed-off-by: Jonas Oberhauser <[email protected]>
> >> ---
> >> tools/memory-model/linux-kernel.cat | 2 +-
> >> 1 file changed, 1 insertion(+), 1 deletion(-)
> >>
> >> diff --git a/tools/memory-model/linux-kernel.cat b/tools/memory-model/linux-kernel.cat
> >> index cfc1b8fd46da..adf3c4f41229 100644
> >> --- a/tools/memory-model/linux-kernel.cat
> >> +++ b/tools/memory-model/linux-kernel.cat
> >> @@ -82,7 +82,7 @@ let rwdep = (dep | ctrl) ; [W]
> >> let overwrite = co | fr
> >> let to-w = rwdep | (overwrite & int) | (addr ; [Plain] ; wmb)
> >> let to-r = (addr ; [R]) | (dep ; [Marked] ; rfi)
> >> -let ppo = to-r | to-w | fence | (po-unlock-lock-po & int)
> >> +let ppo = to-r | to-w | (fence & int) | (po-unlock-lock-po & int)
> > Alternatively can be the following appended diff? Requires only single 'int'
> > in ->ppo then and prevents future similar issues caused by sub relations.
> > Also makes clear that ->ppo can only be CPU-internal.
>
> I had thought about going in that direction, but it doesn't prevent
> future similar issues, at best makes them less likely.

Making less likely still sounds like a win to me.

> For example, you could have an xfence that somehow goes back to the
> original thread, but to a po-earlier event (e.g., like prop).
>
> Given that to-r and to-w are unlikely to ever become become inconsistent
> with po, I am not sure it even really helps much.

I am not sure I understand, what is the problem with enforcing that
ppo is only supposed to ever be -int ? Sounds like it makes it super
clear.

> Personally I'm not too happy with the ad-hoc '& int' because it's like

So, with the idea I suggest, you will have fewer ints so you should be happy ;-)

> adding some unused stuff (via ... | unused-stuff) and then cutting it
> back out with &int, unlike prop & int which has a real semantic meaning
> (propagate back to the thread). The fastest move is the move we avoid
> doing, so I rather wouldn't add those parts in the first place.
>
> However fixing the fence relation turned out to be a lot trickier, both
> because of the missed data race and also rmw-sequences, essentially I
> would have had to disambiguate between xfences and fences already in
> this patch. So I did this minimal local fix for now and we can discuss
> whether it makes sense to get rid of the '& int' once/if we have xfence etc.

I see. Ok, I'll defer to your expertise on this since you know more
than I. I am relatively only recent with even opening up the CAT code.

Cheers,

- Joel


>
> Best wishes,
> jonas
>
> PS:
> > ---8<-----------------------
>
> haha that's so clever :D
>
> >
> > diff --git a/tools/memory-model/linux-kernel.cat b/tools/memory-model/linux-kernel.cat
> > index 07f884f9b2bf..63052d1628e9 100644
> > --- a/tools/memory-model/linux-kernel.cat
> > +++ b/tools/memory-model/linux-kernel.cat
> > @@ -70,7 +70,7 @@ let rwdep = (dep | ctrl) ; [W]
> > let overwrite = co | fr
> > let to-w = rwdep | (overwrite & int) | (addr ; [Plain] ; wmb)
> > let to-r = addr | (dep ; [Marked] ; rfi)
> > -let ppo = to-r | to-w | fence | (po-unlock-lock-po & int)
> > +let ppo = (to-r | to-w | fence | po-unlock-lock-po) & int
> >
> > (* Propagation: Ordering from release operations and strong fences. *)
> > let A-cumul(r) = (rfe ; [Marked])? ; r
>

2023-02-27 18:41:31

by Alan Stern

[permalink] [raw]
Subject: Re: [PATCH v3] tools/memory-model: Make ppo a subrelation of po

On Mon, Feb 27, 2023 at 05:50:15PM +0100, Jonas Oberhauser wrote:
> > So I don't see this as a valid argument for not using rw-xbstar in
> > rw-race. Even theoretically.
>
> There's nothing wrong with using rw-xbstar in rw-race, especially in current
> LKMM, and I'm not arguing against that.
>
> I'm saying that the argument
> "if rw-xbstar links a read R to a plain write W, and that plain write is
> replaced by a read R', and in case R' reads a value different from W,
> followed by a store W' (with some dependency from R' to W')? by the
> compiler, then the fact that R and R' can't have a data race means that it's
> safe to use rw-xbstar in rw-race"
> is incomplete. (Of course that doesn't mean the claim is wrong.)
> To make the argument complete, you also need that W' is generated if
> necessary, and more crucially that W' is still ordered behind R!
> Otherwise you would now have a data race between R and W', like in the
> hypothetical example I mentioned, even though R and R' don't race.
>
> And if you do that second step in LKMM (even with the change of
> w-pre-bounded we are discussing) you quickly find that W' is indeed still
> ordered, so rw-xbstar is perfectly fine.
>
> Perhaps that step is so trivial to you that you don't feel it needs
> mentioning : ) But speaking about LKMM-like models in general, some might
> have some subtle case where rw-xbstar links R and W but would not R and W'.

Ah, okay. Yes, it is a subtle point. And by the reasoning I just used,
if such a case did exist then one could conclude it would be an example
demonstrating that rw-xbstar should not have linked R and W in the first
place.

And it looks like I should write up and submit a patch allowing more
dependencies in the definition of w-pre-bounded.

Alan

2023-02-27 19:35:22

by Andrea Parri

[permalink] [raw]
Subject: Re: [PATCH v3] tools/memory-model: Make ppo a subrelation of po

On Fri, Feb 24, 2023 at 02:52:51PM +0100, Jonas Oberhauser wrote:
> As stated in the documentation and implied by its name, the ppo
> (preserved program order) relation is intended to link po-earlier
> to po-later instructions under certain conditions. However, a
> corner case currently allows instructions to be linked by ppo that
> are not executed by the same thread, i.e., instructions are being
> linked that have no po relation.
>
> This happens due to the mb/strong-fence/fence relations, which (as
> one case) provide order when locks are passed between threads
> followed by an smp_mb__after_unlock_lock() fence. This is
> illustrated in the following litmus test (as can be seen when using
> herd7 with `doshow ppo`):
>
> P0(int *x, int *y)
> {
> spin_lock(x);
> spin_unlock(x);
> }
>
> P1(int *x, int *y)
> {
> spin_lock(x);
> smp_mb__after_unlock_lock();
> *y = 1;
> }

nit: for readability if not for other, s/int *x/spinlock_t *x/.


> The ppo relation will link P0's spin_lock(x) and P1's *y=1, because
> P0 passes a lock to P1 which then uses this fence.
>
> The patch makes ppo a subrelation of po by letting fence contribute
> to ppo only in case the fence links events of the same thread.
>
> Signed-off-by: Jonas Oberhauser <[email protected]>

Acked-by: Andrea Parri <[email protected]>

Andrea


> ---
> tools/memory-model/linux-kernel.cat | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/tools/memory-model/linux-kernel.cat b/tools/memory-model/linux-kernel.cat
> index cfc1b8fd46da..adf3c4f41229 100644
> --- a/tools/memory-model/linux-kernel.cat
> +++ b/tools/memory-model/linux-kernel.cat
> @@ -82,7 +82,7 @@ let rwdep = (dep | ctrl) ; [W]
> let overwrite = co | fr
> let to-w = rwdep | (overwrite & int) | (addr ; [Plain] ; wmb)
> let to-r = (addr ; [R]) | (dep ; [Marked] ; rfi)
> -let ppo = to-r | to-w | fence | (po-unlock-lock-po & int)
> +let ppo = to-r | to-w | (fence & int) | (po-unlock-lock-po & int)
>
> (* Propagation: Ordering from release operations and strong fences. *)
> let A-cumul(r) = (rfe ; [Marked])? ; r
> --
> 2.17.1
>

2023-02-27 19:40:30

by Andrea Parri

[permalink] [raw]
Subject: Re: [PATCH v3] tools/memory-model: Make ppo a subrelation of po

> The LKMM doesn't believe that a control or data dependency orders a
> plain write after a marked read. Hence in this test it thinks that P1's
> store to u0 can happen before the load of x1. I don't remember why we
> did it this way -- probably we just wanted to minimize the restrictions
> on when plain accesses can execute. (I do remember the reason for
> making address dependencies induce order; it was so RCU would work.)
>
> The patch below will change what the LKMM believes. It eliminates the
> positive outcome of the litmus test and the data race. Should it be
> adopted into the memory model?

(Unpopular opinion I know,) it should drop dependencies ordering, not
add/promote it.

Andrea

2023-02-27 20:13:48

by Jonas Oberhauser

[permalink] [raw]
Subject: Re: [PATCH v3] tools/memory-model: Make ppo a subrelation of po



On 2/27/2023 8:40 PM, Andrea Parri wrote:
>> The LKMM doesn't believe that a control or data dependency orders a
>> plain write after a marked read. Hence in this test it thinks that P1's
>> store to u0 can happen before the load of x1. I don't remember why we
>> did it this way -- probably we just wanted to minimize the restrictions
>> on when plain accesses can execute. (I do remember the reason for
>> making address dependencies induce order; it was so RCU would work.)
>>
>> The patch below will change what the LKMM believes. It eliminates the
>> positive outcome of the litmus test and the data race. Should it be
>> adopted into the memory model?
> (Unpopular opinion I know,) it should drop dependencies ordering, not
> add/promote it.
>
> Andrea

Maybe not as unpopular as you think... :)
But either way IMHO it should be consistent; either take all the
dependencies that are true and add them, or drop them all.
In the latter case, RCU should change to an acquire barrier. (also, one
would have to deal with OOTA in some yet different way).

Generally my position is that unless there's a real-world benchmark with
proven performance benefits of relying on dependency ordering, one
should use an acquire barrier. I haven't yet met such a case, but maybe
one of you has...

Best wishes,
jonas



2023-02-27 20:25:25

by Jonas Oberhauser

[permalink] [raw]
Subject: Re: [PATCH v3] tools/memory-model: Make ppo a subrelation of po



On 2/27/2023 6:57 PM, Joel Fernandes wrote:
> On Mon, Feb 27, 2023 at 9:39 AM Jonas Oberhauser
> <[email protected]> wrote:
>>
>>
>> On 2/26/2023 5:23 PM, Joel Fernandes wrote:
>>> On Fri, Feb 24, 2023 at 02:52:51PM +0100, Jonas Oberhauser wrote:
>>>> [...]
>>> Alternatively can be the following appended diff? Requires only single 'int'
>>> in ->ppo then and prevents future similar issues caused by sub relations.
>>> Also makes clear that ->ppo can only be CPU-internal.
>> I had thought about going in that direction, but it doesn't prevent
>> future similar issues, at best makes them less likely.
> Making less likely still sounds like a win to me.
>
>> For example, you could have an xfence that somehow goes back to the
>> original thread, but to a po-earlier event (e.g., like prop).
>>
>> Given that to-r and to-w are unlikely to ever become become inconsistent
>> with po, I am not sure it even really helps much.
> I am not sure I understand, what is the problem with enforcing that
> ppo is only supposed to ever be -int ? Sounds like it makes it super
> clear.

You could go further and do ... & po.

But it would still make me feel that it's a plaster used to hold
together a steampipe.
It reminds me a bit of college when some of my class mates passed the
nightly tests in the programming lab by doing
"if input == (the input of the specific test case they were having
problem with) return (value expected by testcase)".
Or making everything atomic so that tsan does not complain about data
races anymore.

If there's something in one of these relations tying together events of
different threads, is it intentional or a bug?
I prefer to be very conscious about what is being tied together by the
relations.

I'd rather take Boqun's suggestion to add some "debug/testing" flags to
see if a litmus test violates a property assumed by LKMM.

Yet I think the ideal way is to have a mechanized proof that LKMM
satisfies these properties and use that to avoid regressions.


>
>> Personally I'm not too happy with the ad-hoc '& int' because it's like
> So, with the idea I suggest, you will have fewer ints so you should be happy ;-)

haha : ) An alternative perspective is that the &int now covers more
cases of the relation ; )


>
>> adding some unused stuff (via ... | unused-stuff) and then cutting it
>> back out with &int, unlike prop & int which has a real semantic meaning
>> (propagate back to the thread). The fastest move is the move we avoid
>> doing, so I rather wouldn't add those parts in the first place.
>>
>> However fixing the fence relation turned out to be a lot trickier, both
>> because of the missed data race and also rmw-sequences, essentially I
>> would have had to disambiguate between xfences and fences already in
>> this patch. So I did this minimal local fix for now and we can discuss
>> whether it makes sense to get rid of the '& int' once/if we have xfence etc.
> I see. Ok, I'll defer to your expertise on this since you know more
> than I. I am relatively only recent with even opening up the CAT code.

Enjoy : )

jonas


2023-02-27 22:21:28

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH v3] tools/memory-model: Make ppo a subrelation of po

On Mon, Feb 27, 2023 at 09:13:01PM +0100, Jonas Oberhauser wrote:
>
>
> On 2/27/2023 8:40 PM, Andrea Parri wrote:
> > > The LKMM doesn't believe that a control or data dependency orders a
> > > plain write after a marked read. Hence in this test it thinks that P1's
> > > store to u0 can happen before the load of x1. I don't remember why we
> > > did it this way -- probably we just wanted to minimize the restrictions
> > > on when plain accesses can execute. (I do remember the reason for
> > > making address dependencies induce order; it was so RCU would work.)
> > >
> > > The patch below will change what the LKMM believes. It eliminates the
> > > positive outcome of the litmus test and the data race. Should it be
> > > adopted into the memory model?
> > (Unpopular opinion I know,) it should drop dependencies ordering, not
> > add/promote it.
> >
> > Andrea
>
> Maybe not as unpopular as you think... :)
> But either way IMHO it should be consistent; either take all the
> dependencies that are true and add them, or drop them all.
> In the latter case, RCU should change to an acquire barrier. (also, one
> would have to deal with OOTA in some yet different way).
>
> Generally my position is that unless there's a real-world benchmark with
> proven performance benefits of relying on dependency ordering, one should
> use an acquire barrier. I haven't yet met such a case, but maybe one of you
> has...

https://www.msully.net/thesis/thesis.pdf page 128 (PDF page 141).

Though this is admittedly for ARMv7 and PowerPC.

Thanx, Paul

2023-02-28 08:50:06

by Jonas Oberhauser

[permalink] [raw]
Subject: Re: [PATCH v3] tools/memory-model: Make ppo a subrelation of po



On 2/27/2023 11:21 PM, Paul E. McKenney wrote:
> On Mon, Feb 27, 2023 at 09:13:01PM +0100, Jonas Oberhauser wrote:
>>
>> On 2/27/2023 8:40 PM, Andrea Parri wrote:
>>>> The LKMM doesn't believe that a control or data dependency orders a
>>>> plain write after a marked read. Hence in this test it thinks that P1's
>>>> store to u0 can happen before the load of x1. I don't remember why we
>>>> did it this way -- probably we just wanted to minimize the restrictions
>>>> on when plain accesses can execute. (I do remember the reason for
>>>> making address dependencies induce order; it was so RCU would work.)
>>>>
>>>> The patch below will change what the LKMM believes. It eliminates the
>>>> positive outcome of the litmus test and the data race. Should it be
>>>> adopted into the memory model?
>>> (Unpopular opinion I know,) it should drop dependencies ordering, not
>>> add/promote it.
>>>
>>> Andrea
>> Maybe not as unpopular as you think... :)
>> But either way IMHO it should be consistent; either take all the
>> dependencies that are true and add them, or drop them all.
>> In the latter case, RCU should change to an acquire barrier. (also, one
>> would have to deal with OOTA in some yet different way).
>>
>> Generally my position is that unless there's a real-world benchmark with
>> proven performance benefits of relying on dependency ordering, one should
>> use an acquire barrier. I haven't yet met such a case, but maybe one of you
>> has...
> https://www.msully.net/thesis/thesis.pdf page 128 (PDF page 141).
>
> Though this is admittedly for ARMv7 and PowerPC.
>

Thanks for the link.

It's true that on architectures that don't have an acquire load (and
have to use a fence), the penalty will be bigger.

But the more obvious discussion would be what constitutes a real-world
benchmark : )
In my experience you can get a lot of performance benefits out of
optimizing barriers in code if all you execute is that code.
But once you embed that into a real-world application, often 90%-99% of
time spent will be in the business logic, not in the data structure.

And then the benefits suddenly disappear.
Note that a lot of barriers are a lot cheaper as well when there's no
contention.

Because of that, making optimization decisions based on microbenchmarks
can sometimes lead to a very poor "time invested" vs "total product
improvement" ratio.

Best wishes,
jonas


2023-02-28 15:40:43

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH v3] tools/memory-model: Make ppo a subrelation of po

On Tue, Feb 28, 2023 at 09:49:07AM +0100, Jonas Oberhauser wrote:
>
>
> On 2/27/2023 11:21 PM, Paul E. McKenney wrote:
> > On Mon, Feb 27, 2023 at 09:13:01PM +0100, Jonas Oberhauser wrote:
> > >
> > > On 2/27/2023 8:40 PM, Andrea Parri wrote:
> > > > > The LKMM doesn't believe that a control or data dependency orders a
> > > > > plain write after a marked read. Hence in this test it thinks that P1's
> > > > > store to u0 can happen before the load of x1. I don't remember why we
> > > > > did it this way -- probably we just wanted to minimize the restrictions
> > > > > on when plain accesses can execute. (I do remember the reason for
> > > > > making address dependencies induce order; it was so RCU would work.)
> > > > >
> > > > > The patch below will change what the LKMM believes. It eliminates the
> > > > > positive outcome of the litmus test and the data race. Should it be
> > > > > adopted into the memory model?
> > > > (Unpopular opinion I know,) it should drop dependencies ordering, not
> > > > add/promote it.
> > > >
> > > > Andrea
> > > Maybe not as unpopular as you think... :)
> > > But either way IMHO it should be consistent; either take all the
> > > dependencies that are true and add them, or drop them all.
> > > In the latter case, RCU should change to an acquire barrier. (also, one
> > > would have to deal with OOTA in some yet different way).
> > >
> > > Generally my position is that unless there's a real-world benchmark with
> > > proven performance benefits of relying on dependency ordering, one should
> > > use an acquire barrier. I haven't yet met such a case, but maybe one of you
> > > has...
> > https://www.msully.net/thesis/thesis.pdf page 128 (PDF page 141).
> >
> > Though this is admittedly for ARMv7 and PowerPC.
> >
>
> Thanks for the link.
>
> It's true that on architectures that don't have an acquire load (and have to
> use a fence), the penalty will be bigger.
>
> But the more obvious discussion would be what constitutes a real-world
> benchmark : )
> In my experience you can get a lot of performance benefits out of optimizing
> barriers in code if all you execute is that code.
> But once you embed that into a real-world application, often 90%-99% of time
> spent will be in the business logic, not in the data structure.
>
> And then the benefits suddenly disappear.
> Note that a lot of barriers are a lot cheaper as well when there's no
> contention.
>
> Because of that, making optimization decisions based on microbenchmarks can
> sometimes lead to a very poor "time invested" vs "total product improvement"
> ratio.

All true, though that 2x and 4x should be worth something.

The real-world examples I know of involved garbage collectors, and the
improvement was said to be a few percent system-wide. But that was a
verbal exchange, so I don't have a citation for you.

Thanx, Paul

2023-02-28 22:01:57

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH v3] tools/memory-model: Make ppo a subrelation of po

On Mon, Feb 27, 2023 at 08:35:11PM +0100, Andrea Parri wrote:
> On Fri, Feb 24, 2023 at 02:52:51PM +0100, Jonas Oberhauser wrote:
> > As stated in the documentation and implied by its name, the ppo
> > (preserved program order) relation is intended to link po-earlier
> > to po-later instructions under certain conditions. However, a
> > corner case currently allows instructions to be linked by ppo that
> > are not executed by the same thread, i.e., instructions are being
> > linked that have no po relation.
> >
> > This happens due to the mb/strong-fence/fence relations, which (as
> > one case) provide order when locks are passed between threads
> > followed by an smp_mb__after_unlock_lock() fence. This is
> > illustrated in the following litmus test (as can be seen when using
> > herd7 with `doshow ppo`):
> >
> > P0(int *x, int *y)
> > {
> > spin_lock(x);
> > spin_unlock(x);
> > }
> >
> > P1(int *x, int *y)
> > {
> > spin_lock(x);
> > smp_mb__after_unlock_lock();
> > *y = 1;
> > }
>
> nit: for readability if not for other, s/int *x/spinlock_t *x/.

Unless there are objections, I will make this change on my next
rebase.

> > The ppo relation will link P0's spin_lock(x) and P1's *y=1, because
> > P0 passes a lock to P1 which then uses this fence.
> >
> > The patch makes ppo a subrelation of po by letting fence contribute
> > to ppo only in case the fence links events of the same thread.
> >
> > Signed-off-by: Jonas Oberhauser <[email protected]>
>
> Acked-by: Andrea Parri <[email protected]>

And add this as well, thank you!

Thanx, Paul

> Andrea
>
>
> > ---
> > tools/memory-model/linux-kernel.cat | 2 +-
> > 1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/tools/memory-model/linux-kernel.cat b/tools/memory-model/linux-kernel.cat
> > index cfc1b8fd46da..adf3c4f41229 100644
> > --- a/tools/memory-model/linux-kernel.cat
> > +++ b/tools/memory-model/linux-kernel.cat
> > @@ -82,7 +82,7 @@ let rwdep = (dep | ctrl) ; [W]
> > let overwrite = co | fr
> > let to-w = rwdep | (overwrite & int) | (addr ; [Plain] ; wmb)
> > let to-r = (addr ; [R]) | (dep ; [Marked] ; rfi)
> > -let ppo = to-r | to-w | fence | (po-unlock-lock-po & int)
> > +let ppo = to-r | to-w | (fence & int) | (po-unlock-lock-po & int)
> >
> > (* Propagation: Ordering from release operations and strong fences. *)
> > let A-cumul(r) = (rfe ; [Marked])? ; r
> > --
> > 2.17.1
> >

2023-03-01 10:53:00

by Jonas Oberhauser

[permalink] [raw]
Subject: Re: [PATCH v3] tools/memory-model: Make ppo a subrelation of po



On 2/28/2023 4:40 PM, Paul E. McKenney wrote:
> On Tue, Feb 28, 2023 at 09:49:07AM +0100, Jonas Oberhauser wrote:
>>
>> On 2/27/2023 11:21 PM, Paul E. McKenney wrote:
>>> On Mon, Feb 27, 2023 at 09:13:01PM +0100, Jonas Oberhauser wrote:
>>>> On 2/27/2023 8:40 PM, Andrea Parri wrote:
>>>>>> The LKMM doesn't believe that a control or data dependency orders a
>>>>>> plain write after a marked read. Hence in this test it thinks that P1's
>>>>>> store to u0 can happen before the load of x1. I don't remember why we
>>>>>> did it this way -- probably we just wanted to minimize the restrictions
>>>>>> on when plain accesses can execute. (I do remember the reason for
>>>>>> making address dependencies induce order; it was so RCU would work.)
>>>>>>
>>>>>> The patch below will change what the LKMM believes. It eliminates the
>>>>>> positive outcome of the litmus test and the data race. Should it be
>>>>>> adopted into the memory model?
>>>>> (Unpopular opinion I know,) it should drop dependencies ordering, not
>>>>> add/promote it.
>>>>>
>>>>> Andrea
>>>> Maybe not as unpopular as you think... :)
>>>> But either way IMHO it should be consistent; either take all the
>>>> dependencies that are true and add them, or drop them all.
>>>> In the latter case, RCU should change to an acquire barrier. (also, one
>>>> would have to deal with OOTA in some yet different way).
>>>>
>>>> Generally my position is that unless there's a real-world benchmark with
>>>> proven performance benefits of relying on dependency ordering, one should
>>>> use an acquire barrier. I haven't yet met such a case, but maybe one of you
>>>> has...
>>> https://www.msully.net/thesis/thesis.pdf page 128 (PDF page 141).
>>>
>>> Though this is admittedly for ARMv7 and PowerPC.
>>>
>> Thanks for the link.
>>
>> It's true that on architectures that don't have an acquire load (and have to
>> use a fence), the penalty will be bigger.
>>
>> But the more obvious discussion would be what constitutes a real-world
>> benchmark : )
>> In my experience you can get a lot of performance benefits out of optimizing
>> barriers in code if all you execute is that code.
>> But once you embed that into a real-world application, often 90%-99% of time
>> spent will be in the business logic, not in the data structure.
>>
>> And then the benefits suddenly disappear.
>> Note that a lot of barriers are a lot cheaper as well when there's no
>> contention.
>>
>> Because of that, making optimization decisions based on microbenchmarks can
>> sometimes lead to a very poor "time invested" vs "total product improvement"
>> ratio.
> All true, though that 2x and 4x should be worth something.

I think the most egregious case we had (not barrier related, but cache
related) was something like ~100x in specific benchmarks and then I
think somewhere around 1% system-wide. I think the issue was that in the
larger system, we couldn't keep the cache hot, so our greatly improved
data locality was being diluted.
But of course it always depends on how much that component actually
contributes to the overall system performance.
Which IMHO should be one of the measurements taken before starting to
invest heavily into optimizations.

Surprisingly, many people don't want to do that. I think it's something
about the misleading calculus of "2 months implementing the optimization
+ 2 weeks building robust benchmarks & profiling infrastructure > 2
months implementing the optimization" instead of "2 weeks building
robust benchmarks & profiling infrastructure + 0 months implementing a
useless optimization < 2 months implementing the optimization", which
seems to be the more common case.


> The real-world examples I know of involved garbage collectors, and the
> improvement was said to be a few percent system-wide. But that was a
> verbal exchange, so I don't have a citation for you.

Thanks for the example, it sounds very reasonable (at least for
platforms like PowerPC).
GC has all the hallmarks of a good optimization target: measurable
impact on system wide throughput and latency, and pointer chasing
(=dependency ordering) being a hot part inside the algorithm.

Best wishes,
jonas


2023-03-02 01:43:17

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH v3] tools/memory-model: Make ppo a subrelation of po

On Wed, Mar 01, 2023 at 11:52:09AM +0100, Jonas Oberhauser wrote:
>
>
> On 2/28/2023 4:40 PM, Paul E. McKenney wrote:
> > On Tue, Feb 28, 2023 at 09:49:07AM +0100, Jonas Oberhauser wrote:
> > >
> > > On 2/27/2023 11:21 PM, Paul E. McKenney wrote:
> > > > On Mon, Feb 27, 2023 at 09:13:01PM +0100, Jonas Oberhauser wrote:
> > > > > On 2/27/2023 8:40 PM, Andrea Parri wrote:
> > > > > > > The LKMM doesn't believe that a control or data dependency orders a
> > > > > > > plain write after a marked read. Hence in this test it thinks that P1's
> > > > > > > store to u0 can happen before the load of x1. I don't remember why we
> > > > > > > did it this way -- probably we just wanted to minimize the restrictions
> > > > > > > on when plain accesses can execute. (I do remember the reason for
> > > > > > > making address dependencies induce order; it was so RCU would work.)
> > > > > > >
> > > > > > > The patch below will change what the LKMM believes. It eliminates the
> > > > > > > positive outcome of the litmus test and the data race. Should it be
> > > > > > > adopted into the memory model?
> > > > > > (Unpopular opinion I know,) it should drop dependencies ordering, not
> > > > > > add/promote it.
> > > > > >
> > > > > > Andrea
> > > > > Maybe not as unpopular as you think... :)
> > > > > But either way IMHO it should be consistent; either take all the
> > > > > dependencies that are true and add them, or drop them all.
> > > > > In the latter case, RCU should change to an acquire barrier. (also, one
> > > > > would have to deal with OOTA in some yet different way).
> > > > >
> > > > > Generally my position is that unless there's a real-world benchmark with
> > > > > proven performance benefits of relying on dependency ordering, one should
> > > > > use an acquire barrier. I haven't yet met such a case, but maybe one of you
> > > > > has...
> > > > https://www.msully.net/thesis/thesis.pdf page 128 (PDF page 141).
> > > >
> > > > Though this is admittedly for ARMv7 and PowerPC.
> > > >
> > > Thanks for the link.
> > >
> > > It's true that on architectures that don't have an acquire load (and have to
> > > use a fence), the penalty will be bigger.
> > >
> > > But the more obvious discussion would be what constitutes a real-world
> > > benchmark : )
> > > In my experience you can get a lot of performance benefits out of optimizing
> > > barriers in code if all you execute is that code.
> > > But once you embed that into a real-world application, often 90%-99% of time
> > > spent will be in the business logic, not in the data structure.
> > >
> > > And then the benefits suddenly disappear.
> > > Note that a lot of barriers are a lot cheaper as well when there's no
> > > contention.
> > >
> > > Because of that, making optimization decisions based on microbenchmarks can
> > > sometimes lead to a very poor "time invested" vs "total product improvement"
> > > ratio.
> > All true, though that 2x and 4x should be worth something.
>
> I think the most egregious case we had (not barrier related, but cache
> related) was something like ~100x in specific benchmarks and then I think
> somewhere around 1% system-wide. I think the issue was that in the larger
> system, we couldn't keep the cache hot, so our greatly improved data
> locality was being diluted.
> But of course it always depends on how much that component actually
> contributes to the overall system performance.
> Which IMHO should be one of the measurements taken before starting to invest
> heavily into optimizations.

Agreed, it is all too easy for profound local optimizations to have
near-zero global effect.

> Surprisingly, many people don't want to do that. I think it's something
> about the misleading calculus of "2 months implementing the optimization + 2
> weeks building robust benchmarks & profiling infrastructure > 2 months
> implementing the optimization" instead of "2 weeks building robust
> benchmarks & profiling infrastructure + 0 months implementing a useless
> optimization < 2 months implementing the optimization", which seems to be
> the more common case.

Or, for that matter, having the investigation locate a five-minute change
that produces large benefits.

> > The real-world examples I know of involved garbage collectors, and the
> > improvement was said to be a few percent system-wide. But that was a
> > verbal exchange, so I don't have a citation for you.
>
> Thanks for the example, it sounds very reasonable (at least for platforms
> like PowerPC).
> GC has all the hallmarks of a good optimization target: measurable impact on
> system wide throughput and latency, and pointer chasing (=dependency
> ordering) being a hot part inside the algorithm.

I believe that the target was ARM. It was definitely not PowerPC.

But yes, GCs are indeed intense exercise for pointer chasing.

Thanx, Paul