2022-08-26 13:04:59

by Paul E. McKenney

[permalink] [raw]
Subject: "Verifying and Optimizing Compact NUMA-Aware Locks on Weak Memory Models"

Hello!

I have not yet done more than glance at this one, but figured I should
send it along sooner rather than later.

"Verifying and Optimizing Compact NUMA-Aware Locks on Weak
Memory Models", Antonio Paolillo, Hern?n Ponce-de-Le?n, Thomas
Haas, Diogo Behrens, Rafael Chehab, Ming Fu, and Roland Meyer.
https://arxiv.org/abs/2111.15240

The claim is that the queued spinlocks implementation with CNA violates
LKMM but actually works on all architectures having a formal hardware
memory model.

Thoughts?

Thanx, Paul


2022-08-26 16:38:20

by Boqun Feng

[permalink] [raw]
Subject: Re: "Verifying and Optimizing Compact NUMA-Aware Locks on Weak Memory Models"

On Fri, Aug 26, 2022 at 05:48:12AM -0700, Paul E. McKenney wrote:
> Hello!
>
> I have not yet done more than glance at this one, but figured I should
> send it along sooner rather than later.
>
> "Verifying and Optimizing Compact NUMA-Aware Locks on Weak
> Memory Models", Antonio Paolillo, Hern?n Ponce-de-Le?n, Thomas
> Haas, Diogo Behrens, Rafael Chehab, Ming Fu, and Roland Meyer.
> https://arxiv.org/abs/2111.15240
>
> The claim is that the queued spinlocks implementation with CNA violates
> LKMM but actually works on all architectures having a formal hardware
> memory model.
>

Translate one of their litmus test into a runnable one (there is a typo
in it):

C queued-spin-lock

(*
* P0 is lock-release
* P1 is xchg_tail()
* P2 is lock-acquire
*)

{}

P0(int *x, atomic_t *l)
{
WRITE_ONCE(*x, 1);
atomic_set_release(l, 1);
}

P1(int *x, atomic_t *l)
{
int val;
int new;
int old;

val = atomic_read(l);
new = val + 2;
old = atomic_cmpxchg_relaxed(l, val, new);
}

P2(int *x, atomic_t *l)
{
int r0 = atomic_read_acquire(l);
int r1 = READ_ONCE(*x);
}

exists (2:r0=3 /\ 2:r1=0)

According to LKMM, the exist-clause could be triggered because:

po-rel; coe: rfe; acq-po

is not happen-before in LKMM. Alan actually explain why at a response to
a GitHub issue:

https://github.com/paulmckrcu/litmus/issues/11#issuecomment-1115235860

(Paste Alan's reply)

"""
As for why the LKMM doesn't require ordering for this test... I do not
believe this is related to 2+2W. Think instead in terms of the LKMM's
operational model:

The store-release in P0 means that the x=1 write will propagate
to each CPU before the y=1 write does.

Since y=3 at the end, we know that y=1 (and hence x=1 too)
propagates to P1 before the addition occurs. And we know that
y=3 propagates to P2 before the load-acquire executes.

But we _don't_ know that either y=1 or x=1 propagates to P2
before y=3 does! If the store in P1 were a store-release then
this would have to be true (as you saw in your tests), but it
isn't.

In other words, the litmus test could execute with the following
temporal ordering:

P0 P1 P2
---------- --------- ----------
Write x=1
Write y=1

[x=1 and y=1 propagate to P1]

Read y=1
Write y=3

[y=3 propagates to P2]

Read y=3
Read x=0

[x=1 and y=1 propagate to P2]

(Note that when y=1 propagates to P2, it doesn't do anything because it
won't overwrite the coherence-later store y=3.)

It may be true that none of the architectures supported by Linux will
allow this outcome for the test (although I suspect that the PPC-weird
version _would_ be allowed if you fixed it). At any rate, disallowing
this result in the memory model would probably require more effort than
would be worthwhile.

Alan
"""

The question is that whether we "fix" LKMM because of this, or we
mention explicitly this is something "unsupported" by LKMM yet?

Regards,
Boqun

> Thoughts?
>
> Thanx, Paul

2022-08-26 17:02:50

by Peter Zijlstra

[permalink] [raw]
Subject: Re: "Verifying and Optimizing Compact NUMA-Aware Locks on Weak Memory Models"

On Fri, Aug 26, 2022 at 05:48:12AM -0700, Paul E. McKenney wrote:
> Hello!
>
> I have not yet done more than glance at this one, but figured I should
> send it along sooner rather than later.
>
> "Verifying and Optimizing Compact NUMA-Aware Locks on Weak
> Memory Models", Antonio Paolillo, Hern?n Ponce-de-Le?n, Thomas
> Haas, Diogo Behrens, Rafael Chehab, Ming Fu, and Roland Meyer.
> https://arxiv.org/abs/2111.15240
>
> The claim is that the queued spinlocks implementation with CNA violates
> LKMM but actually works on all architectures having a formal hardware
> memory model.
>
> Thoughts?

So the paper mentions the following defects:

- LKMM doesn't carry a release-acquire chain across a relaxed op

- some babbling about a missing propagation -- ISTR Linux if stuffed
full of them, specifically we require stores to auto propagate
without help from barriers

- some handoff that is CNA specific and I've not looked too hard at
presently.


I think we should address that first one in LKMM, it seems very weird to
me a RmW would break the chain like that. Is there actual hardware that
doesn't behave?


2022-08-26 17:22:44

by Alan Stern

[permalink] [raw]
Subject: Re: "Verifying and Optimizing Compact NUMA-Aware Locks on Weak Memory Models"

On Fri, Aug 26, 2022 at 06:23:24PM +0200, Peter Zijlstra wrote:
> On Fri, Aug 26, 2022 at 05:48:12AM -0700, Paul E. McKenney wrote:
> > Hello!
> >
> > I have not yet done more than glance at this one, but figured I should
> > send it along sooner rather than later.
> >
> > "Verifying and Optimizing Compact NUMA-Aware Locks on Weak
> > Memory Models", Antonio Paolillo, Hern?n Ponce-de-Le?n, Thomas
> > Haas, Diogo Behrens, Rafael Chehab, Ming Fu, and Roland Meyer.
> > https://arxiv.org/abs/2111.15240
> >
> > The claim is that the queued spinlocks implementation with CNA violates
> > LKMM but actually works on all architectures having a formal hardware
> > memory model.
> >
> > Thoughts?
>
> So the paper mentions the following defects:
>
> - LKMM doesn't carry a release-acquire chain across a relaxed op

That's right, although I'm not so sure this should be considered a
defect...

> - some babbling about a missing propagation -- ISTR Linux if stuffed
> full of them, specifically we require stores to auto propagate
> without help from barriers

Not a missing propagation; a late one.

Don't understand what you mean by "auto propagate without help from
barriers".

> - some handoff that is CNA specific and I've not looked too hard at
> presently.
>
>
> I think we should address that first one in LKMM, it seems very weird to
> me a RmW would break the chain like that.

An explicitly relaxed RMW (atomic_cmpxchg_relaxed(), to be precise).

If the authors wanted to keep the release-acquire chain intact, why not
use a cmpxchg version that has release semantics instead of going out of
their way to use a relaxed version?

To put it another way, RMW accesses and release-acquire accesses are
unrelated concepts. You can have one without the other (in principle,
anyway). So a relaxed RMW is just as capable of breaking a
release-acquire chain as any other relaxed operation is.

> Is there actual hardware that
> doesn't behave?

Not as far as I know, although that isn't very far. Certainly an
other-multicopy-atomic architecture would make the litmus test succeed.
But the LKMM does not require other-multicopy-atomicity.

Alan

2022-08-26 21:29:36

by Paul E. McKenney

[permalink] [raw]
Subject: Re: "Verifying and Optimizing Compact NUMA-Aware Locks on Weak Memory Models"

On Fri, Aug 26, 2022 at 01:10:39PM -0400, Alan Stern wrote:
> On Fri, Aug 26, 2022 at 06:23:24PM +0200, Peter Zijlstra wrote:
> > On Fri, Aug 26, 2022 at 05:48:12AM -0700, Paul E. McKenney wrote:
> > > Hello!
> > >
> > > I have not yet done more than glance at this one, but figured I should
> > > send it along sooner rather than later.
> > >
> > > "Verifying and Optimizing Compact NUMA-Aware Locks on Weak
> > > Memory Models", Antonio Paolillo, Hern?n Ponce-de-Le?n, Thomas
> > > Haas, Diogo Behrens, Rafael Chehab, Ming Fu, and Roland Meyer.
> > > https://arxiv.org/abs/2111.15240
> > >
> > > The claim is that the queued spinlocks implementation with CNA violates
> > > LKMM but actually works on all architectures having a formal hardware
> > > memory model.
> > >
> > > Thoughts?
> >
> > So the paper mentions the following defects:
> >
> > - LKMM doesn't carry a release-acquire chain across a relaxed op
>
> That's right, although I'm not so sure this should be considered a
> defect...
>
> > - some babbling about a missing propagation -- ISTR Linux if stuffed
> > full of them, specifically we require stores to auto propagate
> > without help from barriers
>
> Not a missing propagation; a late one.
>
> Don't understand what you mean by "auto propagate without help from
> barriers".
>
> > - some handoff that is CNA specific and I've not looked too hard at
> > presently.
> >
> >
> > I think we should address that first one in LKMM, it seems very weird to
> > me a RmW would break the chain like that.
>
> An explicitly relaxed RMW (atomic_cmpxchg_relaxed(), to be precise).
>
> If the authors wanted to keep the release-acquire chain intact, why not
> use a cmpxchg version that has release semantics instead of going out of
> their way to use a relaxed version?
>
> To put it another way, RMW accesses and release-acquire accesses are
> unrelated concepts. You can have one without the other (in principle,
> anyway). So a relaxed RMW is just as capable of breaking a
> release-acquire chain as any other relaxed operation is.
>
> > Is there actual hardware that
> > doesn't behave?
>
> Not as far as I know, although that isn't very far. Certainly an
> other-multicopy-atomic architecture would make the litmus test succeed.
> But the LKMM does not require other-multicopy-atomicity.

My first attempt with ppcmem suggests that powerpc does -not- behave
this way. But that surprises me, just on general principles. Most likely
I blew the litmus test shown below.

Thoughts?

Thanx, Paul

------------------------------------------------------------------------

PPC MP+lwsyncs+atomic
"LwSyncdWW Rfe LwSyncdRR Fre"
Cycle=Rfe LwSyncdRR Fre LwSyncdWW
{
0:r2=x; 0:r4=y;
1:r2=y; 1:r5=2;
2:r2=y; 2:r4=x;
}
P0 | P1 | P2 ;
li r1,1 | lwarx r1,r0,r2 | lwz r1,0(r2) ;
stw r1,0(r2) | stwcx. r5,r0,r2 | lwsync ;
lwsync | | lwz r3,0(r4) ;
li r3,1 | | ;
stw r3,0(r4) | | ;
exists (1:r1=1 /\ 2:r1=2 /\ 2:r3=0)

------------------------------------------------------------------------

$ ./ppcmem -model lwsync_read_block -model coherence_points MP+lwsyncs+atomic.litmus
...
Test MP+lwsyncs+atomic Allowed
States 9
1:r1=0; 2:r1=0; 2:r3=0;
1:r1=0; 2:r1=0; 2:r3=1;
1:r1=0; 2:r1=1; 2:r3=1;
1:r1=0; 2:r1=2; 2:r3=0;
1:r1=0; 2:r1=2; 2:r3=1;
1:r1=1; 2:r1=0; 2:r3=0;
1:r1=1; 2:r1=0; 2:r3=1;
1:r1=1; 2:r1=1; 2:r3=1;
1:r1=1; 2:r1=2; 2:r3=1;
No (allowed not found)
Condition exists (1:r1=1 /\ 2:r1=2 /\ 2:r3=0)
Hash=b7cec0e2ecbd1cb68fe500d6fe362f9c
Observation MP+lwsyncs+atomic Never 0 9

2022-08-27 00:31:05

by Peter Zijlstra

[permalink] [raw]
Subject: Re: "Verifying and Optimizing Compact NUMA-Aware Locks on Weak Memory Models"

On Fri, Aug 26, 2022 at 01:10:39PM -0400, Alan Stern wrote:

> > - some babbling about a missing propagation -- ISTR Linux if stuffed
> > full of them, specifically we require stores to auto propagate
> > without help from barriers
>
> Not a missing propagation; a late one.
>
> Don't understand what you mean by "auto propagate without help from
> barriers".

Linux hard relies on:

CPU0 CPU1

WRITE_ONCE(foo, 1); while (!READ_ONCE(foo));

making forward progress.

There were a few 'funny' uarchs that were broken, see for example commit
a30718868915f.

2022-08-27 16:06:28

by Alan Stern

[permalink] [raw]
Subject: Re: "Verifying and Optimizing Compact NUMA-Aware Locks on Weak Memory Models"

On Fri, Aug 26, 2022 at 01:42:19PM -0700, Paul E. McKenney wrote:
> On Fri, Aug 26, 2022 at 01:10:39PM -0400, Alan Stern wrote:
> > On Fri, Aug 26, 2022 at 06:23:24PM +0200, Peter Zijlstra wrote:
> > > I think we should address that first one in LKMM, it seems very weird to
> > > me a RmW would break the chain like that.
> >
> > An explicitly relaxed RMW (atomic_cmpxchg_relaxed(), to be precise).
> >
> > If the authors wanted to keep the release-acquire chain intact, why not
> > use a cmpxchg version that has release semantics instead of going out of
> > their way to use a relaxed version?
> >
> > To put it another way, RMW accesses and release-acquire accesses are
> > unrelated concepts. You can have one without the other (in principle,
> > anyway). So a relaxed RMW is just as capable of breaking a
> > release-acquire chain as any other relaxed operation is.
> >
> > > Is there actual hardware that
> > > doesn't behave?
> >
> > Not as far as I know, although that isn't very far. Certainly an
> > other-multicopy-atomic architecture would make the litmus test succeed.
> > But the LKMM does not require other-multicopy-atomicity.
>
> My first attempt with ppcmem suggests that powerpc does -not- behave
> this way. But that surprises me, just on general principles. Most likely
> I blew the litmus test shown below.
>
> Thoughts?

The litmus test looks okay.

As for your surprise, remember that PPC is B-cumulative, another
property which the LKMM does not require. B-cumulativity will also
force the original litmus test to succeed. (The situation is like ISA2
in the infamous test6.pdf, except that y and z are separate variables in
ISA2 but are the same here. The RMW nature of lwarx/stwcx provides
the necessary R-W ordering in P1.)

Alan

> Thanx, Paul
>
> ------------------------------------------------------------------------
>
> PPC MP+lwsyncs+atomic
> "LwSyncdWW Rfe LwSyncdRR Fre"
> Cycle=Rfe LwSyncdRR Fre LwSyncdWW
> {
> 0:r2=x; 0:r4=y;
> 1:r2=y; 1:r5=2;
> 2:r2=y; 2:r4=x;
> }
> P0 | P1 | P2 ;
> li r1,1 | lwarx r1,r0,r2 | lwz r1,0(r2) ;
> stw r1,0(r2) | stwcx. r5,r0,r2 | lwsync ;
> lwsync | | lwz r3,0(r4) ;
> li r3,1 | | ;
> stw r3,0(r4) | | ;
> exists (1:r1=1 /\ 2:r1=2 /\ 2:r3=0)
>
> ------------------------------------------------------------------------
>
> $ ./ppcmem -model lwsync_read_block -model coherence_points MP+lwsyncs+atomic.litmus
> ...
> Test MP+lwsyncs+atomic Allowed
> States 9
> 1:r1=0; 2:r1=0; 2:r3=0;
> 1:r1=0; 2:r1=0; 2:r3=1;
> 1:r1=0; 2:r1=1; 2:r3=1;
> 1:r1=0; 2:r1=2; 2:r3=0;
> 1:r1=0; 2:r1=2; 2:r3=1;
> 1:r1=1; 2:r1=0; 2:r3=0;
> 1:r1=1; 2:r1=0; 2:r3=1;
> 1:r1=1; 2:r1=1; 2:r3=1;
> 1:r1=1; 2:r1=2; 2:r3=1;
> No (allowed not found)
> Condition exists (1:r1=1 /\ 2:r1=2 /\ 2:r3=0)
> Hash=b7cec0e2ecbd1cb68fe500d6fe362f9c
> Observation MP+lwsyncs+atomic Never 0 9

2022-08-27 16:42:06

by Alan Stern

[permalink] [raw]
Subject: Re: "Verifying and Optimizing Compact NUMA-Aware Locks on Weak Memory Models"

On Sat, Aug 27, 2022 at 01:47:48AM +0200, Peter Zijlstra wrote:
> On Fri, Aug 26, 2022 at 01:10:39PM -0400, Alan Stern wrote:
>
> > > - some babbling about a missing propagation -- ISTR Linux if stuffed
> > > full of them, specifically we require stores to auto propagate
> > > without help from barriers
> >
> > Not a missing propagation; a late one.
> >
> > Don't understand what you mean by "auto propagate without help from
> > barriers".
>
> Linux hard relies on:
>
> CPU0 CPU1
>
> WRITE_ONCE(foo, 1); while (!READ_ONCE(foo));
>
> making forward progress.

Indeed yes. As far as I can tell, this requirement is not explicitly
mentioned in the LKMM, although it certainly is implicit. I can't even
think of a way to express it in a form Herd could verify.

> There were a few 'funny' uarchs that were broken, see for example commit
> a30718868915f.

Ha! That commit should be a lesson in something, although I'm not sure
what. :-)

Alan

2022-08-27 16:58:59

by Boqun Feng

[permalink] [raw]
Subject: Re: "Verifying and Optimizing Compact NUMA-Aware Locks on Weak Memory Models"

On Sat, Aug 27, 2022 at 12:05:33PM -0400, Alan Stern wrote:
> On Sat, Aug 27, 2022 at 01:47:48AM +0200, Peter Zijlstra wrote:
> > On Fri, Aug 26, 2022 at 01:10:39PM -0400, Alan Stern wrote:
> >
> > > > - some babbling about a missing propagation -- ISTR Linux if stuffed
> > > > full of them, specifically we require stores to auto propagate
> > > > without help from barriers
> > >
> > > Not a missing propagation; a late one.
> > >
> > > Don't understand what you mean by "auto propagate without help from
> > > barriers".
> >
> > Linux hard relies on:
> >
> > CPU0 CPU1
> >
> > WRITE_ONCE(foo, 1); while (!READ_ONCE(foo));
> >
> > making forward progress.
>
> Indeed yes. As far as I can tell, this requirement is not explicitly
> mentioned in the LKMM, although it certainly is implicit. I can't even
> think of a way to express it in a form Herd could verify.
>

FWIW, C++ defines this as (in https://eel.is/c++draft/atomics#order-11):

Implementations should make atomic stores visible to atomic
loads within a reasonable amount of time.

in other words:

if one thread does an atomic store, then all other threads must see that
store eventually.

(from: https://rust-lang.zulipchat.com/#narrow/stream/136281-t-lang.2Fwg-unsafe-code-guidelines/topic/Rust.20forward.20progress.20guarantees/near/294702950)

Should we add something somewhere in our model, maybe in the
explanation.txt?

Plus, I think we cannot express this in Herd because Herd uses
graph-based model (axiomatic model) instead of an operational model to
describe the model: axiomatic model cannot describe "something will
eventually happen". There was also some discussion in the zulip steam
of Rust unsafe-code-guidelines.

Regards,
Boqun

> > There were a few 'funny' uarchs that were broken, see for example commit
> > a30718868915f.
>
> Ha! That commit should be a lesson in something, although I'm not sure
> what. :-)
>
> Alan

2022-08-27 17:08:17

by Paul E. McKenney

[permalink] [raw]
Subject: Re: "Verifying and Optimizing Compact NUMA-Aware Locks on Weak Memory Models"

On Sat, Aug 27, 2022 at 12:00:15PM -0400, Alan Stern wrote:
> On Fri, Aug 26, 2022 at 01:42:19PM -0700, Paul E. McKenney wrote:
> > On Fri, Aug 26, 2022 at 01:10:39PM -0400, Alan Stern wrote:
> > > On Fri, Aug 26, 2022 at 06:23:24PM +0200, Peter Zijlstra wrote:
> > > > I think we should address that first one in LKMM, it seems very weird to
> > > > me a RmW would break the chain like that.
> > >
> > > An explicitly relaxed RMW (atomic_cmpxchg_relaxed(), to be precise).
> > >
> > > If the authors wanted to keep the release-acquire chain intact, why not
> > > use a cmpxchg version that has release semantics instead of going out of
> > > their way to use a relaxed version?
> > >
> > > To put it another way, RMW accesses and release-acquire accesses are
> > > unrelated concepts. You can have one without the other (in principle,
> > > anyway). So a relaxed RMW is just as capable of breaking a
> > > release-acquire chain as any other relaxed operation is.
> > >
> > > > Is there actual hardware that
> > > > doesn't behave?
> > >
> > > Not as far as I know, although that isn't very far. Certainly an
> > > other-multicopy-atomic architecture would make the litmus test succeed.
> > > But the LKMM does not require other-multicopy-atomicity.
> >
> > My first attempt with ppcmem suggests that powerpc does -not- behave
> > this way. But that surprises me, just on general principles. Most likely
> > I blew the litmus test shown below.
> >
> > Thoughts?
>
> The litmus test looks okay.
>
> As for your surprise, remember that PPC is B-cumulative, another
> property which the LKMM does not require. B-cumulativity will also
> force the original litmus test to succeed. (The situation is like ISA2
> in the infamous test6.pdf, except that y and z are separate variables in
> ISA2 but are the same here. The RMW nature of lwarx/stwcx provides
> the necessary R-W ordering in P1.)

Got it, thank you!

Thanx, Paul

> > ------------------------------------------------------------------------
> >
> > PPC MP+lwsyncs+atomic
> > "LwSyncdWW Rfe LwSyncdRR Fre"
> > Cycle=Rfe LwSyncdRR Fre LwSyncdWW
> > {
> > 0:r2=x; 0:r4=y;
> > 1:r2=y; 1:r5=2;
> > 2:r2=y; 2:r4=x;
> > }
> > P0 | P1 | P2 ;
> > li r1,1 | lwarx r1,r0,r2 | lwz r1,0(r2) ;
> > stw r1,0(r2) | stwcx. r5,r0,r2 | lwsync ;
> > lwsync | | lwz r3,0(r4) ;
> > li r3,1 | | ;
> > stw r3,0(r4) | | ;
> > exists (1:r1=1 /\ 2:r1=2 /\ 2:r3=0)
> >
> > ------------------------------------------------------------------------
> >
> > $ ./ppcmem -model lwsync_read_block -model coherence_points MP+lwsyncs+atomic.litmus
> > ...
> > Test MP+lwsyncs+atomic Allowed
> > States 9
> > 1:r1=0; 2:r1=0; 2:r3=0;
> > 1:r1=0; 2:r1=0; 2:r3=1;
> > 1:r1=0; 2:r1=1; 2:r3=1;
> > 1:r1=0; 2:r1=2; 2:r3=0;
> > 1:r1=0; 2:r1=2; 2:r3=1;
> > 1:r1=1; 2:r1=0; 2:r3=0;
> > 1:r1=1; 2:r1=0; 2:r3=1;
> > 1:r1=1; 2:r1=1; 2:r3=1;
> > 1:r1=1; 2:r1=2; 2:r3=1;
> > No (allowed not found)
> > Condition exists (1:r1=1 /\ 2:r1=2 /\ 2:r3=0)
> > Hash=b7cec0e2ecbd1cb68fe500d6fe362f9c
> > Observation MP+lwsyncs+atomic Never 0 9

2022-08-29 02:30:33

by Andrea Parri

[permalink] [raw]
Subject: Re: "Verifying and Optimizing Compact NUMA-Aware Locks on Weak Memory Models"

> FWIW, C++ defines this as (in https://eel.is/c++draft/atomics#order-11):
>
> Implementations should make atomic stores visible to atomic
> loads within a reasonable amount of time.
>
> in other words:
>
> if one thread does an atomic store, then all other threads must see that
> store eventually.
>
> (from: https://rust-lang.zulipchat.com/#narrow/stream/136281-t-lang.2Fwg-unsafe-code-guidelines/topic/Rust.20forward.20progress.20guarantees/near/294702950)
>
> Should we add something somewhere in our model, maybe in the
> explanation.txt?

FYI, that's briefly mentioned in Section 11, "CACHE COHERENCE AND THE
COHERENCE ORDER RELATION: co, coi, and coe":

"sequential consistency per variable and cache coherence mean the
same thing except that cache coherence includes an extra requirement
that every store eventually becomes visible to every CPU"

Andrea


> Plus, I think we cannot express this in Herd because Herd uses
> graph-based model (axiomatic model) instead of an operational model to
> describe the model: axiomatic model cannot describe "something will
> eventually happen". There was also some discussion in the zulip steam
> of Rust unsafe-code-guidelines.

2022-08-29 02:45:27

by Andrea Parri

[permalink] [raw]
Subject: Re: "Verifying and Optimizing Compact NUMA-Aware Locks on Weak Memory Models"

On Fri, Aug 26, 2022 at 05:48:12AM -0700, Paul E. McKenney wrote:
> Hello!
>
> I have not yet done more than glance at this one, but figured I should
> send it along sooner rather than later.
>
> "Verifying and Optimizing Compact NUMA-Aware Locks on Weak
> Memory Models", Antonio Paolillo, Hern?n Ponce-de-Le?n, Thomas
> Haas, Diogo Behrens, Rafael Chehab, Ming Fu, and Roland Meyer.
> https://arxiv.org/abs/2111.15240
>
> The claim is that the queued spinlocks implementation with CNA violates
> LKMM but actually works on all architectures having a formal hardware
> memory model.
>
> Thoughts?

Section 4 ends with a discussion about certain "spurious" data races.
Do we have litmus tests with them? (I could repro with Dartagnan...)

Thanks,
Andrea

2022-08-29 12:58:32

by Paul E. McKenney

[permalink] [raw]
Subject: Re: "Verifying and Optimizing Compact NUMA-Aware Locks on Weak Memory Models"

On Mon, Aug 29, 2022 at 04:33:23AM +0200, Andrea Parri wrote:
> On Fri, Aug 26, 2022 at 05:48:12AM -0700, Paul E. McKenney wrote:
> > Hello!
> >
> > I have not yet done more than glance at this one, but figured I should
> > send it along sooner rather than later.
> >
> > "Verifying and Optimizing Compact NUMA-Aware Locks on Weak
> > Memory Models", Antonio Paolillo, Hern?n Ponce-de-Le?n, Thomas
> > Haas, Diogo Behrens, Rafael Chehab, Ming Fu, and Roland Meyer.
> > https://arxiv.org/abs/2111.15240
> >
> > The claim is that the queued spinlocks implementation with CNA violates
> > LKMM but actually works on all architectures having a formal hardware
> > memory model.
> >
> > Thoughts?
>
> Section 4 ends with a discussion about certain "spurious" data races.
> Do we have litmus tests with them? (I could repro with Dartagnan...)

Their Figure 5 clearly shows a data race, but agreed, their claim is
that this race is prevented by other code not in this litmus test.
Me, I currently suspect that the spurious data race might be due to the
failure to guarantee mutual exclusion, though I have not yet read that
paper carefully. That is scheduled for tomorrow morning.

Thanx, Paul

2022-09-10 15:11:14

by Alan Stern

[permalink] [raw]
Subject: Re: "Verifying and Optimizing Compact NUMA-Aware Locks on Weak Memory Models"

On Sat, Sep 10, 2022 at 12:11:36PM +0000, Hernan Luis Ponce de Leon wrote:
>
> What they mean seems to be that a prop relation followed only by wmb
> (not mb) doesn't enforce the order of some writes to the same
> location, leading to the claimed hang in qspinlock (at least as far as
> LKMM is concerned).

You were quoting Jonas here, right? The email doesn't make this obvious
because it doesn't have two levels of "> > " markings.

> What we mean is that wmb does not give the same propagation properties as mb.

In general, _no_ two distinct relations in the LKMM have the same
propagation properties. If wmb always behaved the same way as mb, we
wouldn't use two separate words for them.

> The claim is based on these relations from the memory model
>
> let strong-fence = mb | gp
> ...
> let cumul-fence = [Marked] ; (A-cumul(strong-fence | po-rel) | wmb |
> po-unlock-lock-po) ; [Marked]
> let prop = [Marked] ; (overwrite & ext)? ; cumul-fence* ;
> [Marked] ; rfe? ; [Marked]

Please be more specific. What difference between mb and wmb are you
concerned about? Can you give a small litmus test that illustrates this
difference? Can you explain in more detail how this difference affects
the qspinlock implementation?

> From an engineering perspective, I think the only issue is that cat
> *currently* does not have any syntax for this,

Syntax for what? The difference between wmb and mb?

> nor does herd currently
> implement the await model checking techniques proposed in those works
> (c.f. Theorem 5.3. in the "making weak memory models fair" paper,
> which says that for this kind of loop, iff the mo-maximal reads in
> some graph are read in a loop iteration that does not exit the loop,
> the loop can run forever). However GenMC and I believe also Dat3M and
> recently also Nidhugg support such techniques. It may not even be too
> much effort to implement something like this in herd if desired.

I believe that herd has no way to express the idea of a program running
forever. On the other hand, it's certainly true (in all of these
models) than for any finite number N, there is a feasible execution in
which a loop runs for more than N iterations before the termination
condition eventually becomes true.

Alan

> The Dartagnan model checker uses the Theorem 5.3 from above to detect
> liveness violations.
>
> We did not try to come up with a litmus test about the behavior
> because herd7 cannot reason about liveness.
> However, if anybody is interested, the violating execution is shown here
> https://github.com/huawei-drc/cna-verification/blob/master/verification-output/BUG1.png
>
> Hernan

2022-09-11 10:23:07

by Joel Fernandes

[permalink] [raw]
Subject: Re: "Verifying and Optimizing Compact NUMA-Aware Locks on Weak Memory Models"

On Sat, Sep 10, 2022 at 4:41 PM Hernan Luis Ponce de Leon
<[email protected]> wrote:
>
> > You were quoting Jonas here, right? The email doesn't make this obvious
> > because it doesn't have two levels of "> > " markings.
>
> Yes, I was quoting Jonas.
> It seems my mail client did not format the email correctly and I did not notice.
> Sorry for that.
>
> > In general, _no_ two distinct relations in the LKMM have the same propagation
> > properties. If wmb always behaved the same way as mb, we wouldn't use two
> > separate words for them.
>
> I understand that relations with different names are intended to be different.
> What I meant was
> "wmb gives weaker propagation guarantees than mb and because of this, liveness of qspinlock is not guaranteed in LKMM"
>

I wonder if this sort of liveness guarantee (or lack thereof) is
really a problem in practice, where writes will eventually propagate
even though they may not for a bit. Is it possible to write a liveness
test case on any hardware, or is this more in the realms of theory?
Either way, quite intriguing!

Thanks,

- Joel

2022-09-11 15:36:53

by Andrea Parri

[permalink] [raw]
Subject: Re: "Verifying and Optimizing Compact NUMA-Aware Locks on Weak Memory Models"

> Here is a litmus test showing the problem (I hope the comment are enough to relate it back to qspinlock)
>
> C Liveness
> {
> atomic_t x = ATOMIC_INIT(0);
> atomic_t y = ATOMIC_INIT(0);
> }
>
> P0(atomic_t *x) {
> // clear_pending_set_locked
> int r0 = atomic_fetch_add(2,x) ;
> }
>
> P1(atomic_t *x, int *z) {
> // queued_spin_trylock
> int r0 = atomic_read(x);
> // barrier after the initialization of nodes
> smp_wmb();
> // xchg_tail
> int r1 = atomic_cmpxchg_relaxed(x,r0,42);
> // link node into the waitqueue
> WRITE_ONCE(*z, 1);
> }
>
> P2(atomic_t *x,atomic_t *z) {
> // node initialization
> WRITE_ONCE(*z, 2);
> // queued_spin_trylock
> int r0 = atomic_read(x);
> // barrier after the initialization of nodes
> smp_wmb();
> // xchg_tail
> int r1 = atomic_cmpxchg_relaxed(x,r0,24);
> }
>
> exists (0:r0 = 24 /\ 1:r0 = 26 /\ z=2)
>
> herd7 says that the behavior is observable.
> However if you change wmb by mb, it is not observable anymore.

Indeed. For more context, this is a 3-threads extension of the 2+2W
test/behavior, cf.

https://github.com/paulmckrcu/litmus/blob/master/manual/lwn573436/C-2+2w+o-wb-o+o-wb-o.litmus

which is also allowed by the LKMM. The basic rationale for allowing
such behaviors was that we "don't think we need to care" (cf. the
comment in the link above), except that it seems the developers of the
code at stake disagreed. ;-) So this does look like a good time to
review that design choice/code.

Andrea

2022-09-12 12:29:30

by Alan Stern

[permalink] [raw]
Subject: Re: "Verifying and Optimizing Compact NUMA-Aware Locks on Weak Memory Models"

On Mon, Sep 12, 2022 at 10:46:38AM +0000, Jonas Oberhauser wrote:
> Hi Alan,
>
> Sorry for the confusion.
>
> > [...] it's certainly true (in all of these
> models) than for any finite number N, there is a feasible execution in which a loop runs for more than N iterations before the termination condition eventually becomes true.
>
> Indeed; but more interestingly, as the Theorem 5.3 in "making weak memory models fair" states, under certain conditions it suffices to look at graphs where N=1 to decide whether a loop can run forever (despite all writes propagating to all cores eventually) or will always terminate.

Cool!

> And since herd can generate all such graphs, herd could be extended to make that decision and output it, just like many other tools already do.
>
> To illuminate this on an example, consider the program sent by Peter earlier:
> WRITE_ONCE(foo, 1); while (!READ_ONCE(foo));
>
> Without the assumption that fr is prefix finite, the graph with infinitely many reads of thread 2 all reading the initial value (and hence being fr-before the store foo=1) would be allowed. However, the tools used to analyze the qspinlock all assume that fr is prefix finite, and hence that such a graph with infinitely many fr-before events does not exist. Because of that, all of the tools will say that this loop always terminates.
>
> However, if you change the code into the following:
>
> WRITE_ONCE(foo, 1); WRITE_ONCE(foo, 0); while (!READ_ONCE(foo));
>
> then even under the assumption of fr-prefix-finiteness, the coherence order in which WRITE_ONCE(foo, 1); is overwritten by WRITE_ONCE(foo, 0); of thread 2 would lead to non-terminating behaviors, and these are detected by those tools. Furthermore, if we synchronize the two stores as follows:
>
> while (! READ_ONCE(bar)) {}; WRITE_ONCE(foo, 1); WRITE_ONCE(foo, 0); smp_store_release(&bar,1); while (!READ_ONCE(foo));
>
> then the graphs with that co become prohibited as they all have hb cycles, and the tools again will not detect any liveness violations. But if we go further and relax the release barrier as below
>
>
> while (! READ_ONCE(bar)) {}; WRITE_ONCE(foo, 1); WRITE_ONCE(foo, 0); WRITE_ONCE(bar,1); while (!READ_ONCE(foo));
>
> then the hb cycle disappears, and the coherence order in which foo=0 overwrites foo=1 becomes allowed. Again, the tools will detect this and state that thread 2 could be stuck in its while loop forever.

Neat stuff; I'll have to think about this.

> In each of these cases, the decision can be made by looking for a graph in which the loop is executed for one iteration which reads from foo=0, and checking whether that store is co-maximal. So in some sense this is all that is needed to express the idea that a program can run forever, even though only in some very limited (but common) circumstances, namely that the loop iteration that repeats forever does not create modifications to state that are observed outside the loop. Luckily this is a very common case, so these checks have proven quite useful in practice.
>
> The same kind of check could be implemented in herd together with either the implicit assumption that fr is prefix finite (like in other tools) or with some special syntax like
>
> prefix-finite fr | co as fairness
>
> which hopefully clears up the question below:
>
> > > From an engineering perspective, I think the only issue is that cat
> > > *currently* does not have any syntax for this,
>
> > Syntax for what? The difference between wmb and mb?
> > [...]
>
>
> Thanks for your patience and hoping I explained it more clearly,

Yes indeed, thank you very much.

Alan

2022-09-12 12:31:14

by Alan Stern

[permalink] [raw]
Subject: Re: "Verifying and Optimizing Compact NUMA-Aware Locks on Weak Memory Models"

On Mon, Sep 12, 2022 at 10:13:33AM +0000, Jonas Oberhauser wrote:
> As I tried to explain before, this problem has nothing to do with
> stores propagating within a given time to another core. Rather it is
> due to two stores to the same location happening in a surprising
> order. I.e., both stores propagate quickly to other cores, but in a
> surprising coherence order.And if a wmb in the code is replaced by an
> mb, then this co will create a pb cycle and become forbidden.
>
> Therefore this hang should be observable on a hypothetical LKMM
> processor which makes use of all the relaxed liberty the LKMM allows.
> However according to the authors of that paper (who are my colleagues
> but I haven't been involved deeply in that work), not even Power+gcc
> allow this reordering to happen, and if that's true it is probably
> because the wmb is mapped to lwsync which is fully cumulative in Power
> but not in LKMM.

Yes, that's right. On ARM64 architectures the reordering is forbidden
by other multi-copy atomicity, and on PPC is it forbidden by
B-cumulativity (neither of which is part of the LKMM).

If I'm not mistaken, another way to forbid it is to replace one of the
relaxed atomic accesses with an atomic access having release semantics.
Perhaps this change will find its way into the kernel source, since it
has less overhead than replacing wmb with bm.

Alan

2022-09-13 12:15:08

by Will Deacon

[permalink] [raw]
Subject: Re: "Verifying and Optimizing Compact NUMA-Aware Locks on Weak Memory Models"

On Fri, Aug 26, 2022 at 01:42:19PM -0700, Paul E. McKenney wrote:
> On Fri, Aug 26, 2022 at 01:10:39PM -0400, Alan Stern wrote:
> > On Fri, Aug 26, 2022 at 06:23:24PM +0200, Peter Zijlstra wrote:
> > > On Fri, Aug 26, 2022 at 05:48:12AM -0700, Paul E. McKenney wrote:
> > > > Hello!
> > > >
> > > > I have not yet done more than glance at this one, but figured I should
> > > > send it along sooner rather than later.
> > > >
> > > > "Verifying and Optimizing Compact NUMA-Aware Locks on Weak
> > > > Memory Models", Antonio Paolillo, Hern?n Ponce-de-Le?n, Thomas
> > > > Haas, Diogo Behrens, Rafael Chehab, Ming Fu, and Roland Meyer.
> > > > https://arxiv.org/abs/2111.15240
> > > >
> > > > The claim is that the queued spinlocks implementation with CNA violates
> > > > LKMM but actually works on all architectures having a formal hardware
> > > > memory model.
> > > >
> > > > Thoughts?
> > >
> > > So the paper mentions the following defects:
> > >
> > > - LKMM doesn't carry a release-acquire chain across a relaxed op
> >
> > That's right, although I'm not so sure this should be considered a
> > defect...
> >
> > > - some babbling about a missing propagation -- ISTR Linux if stuffed
> > > full of them, specifically we require stores to auto propagate
> > > without help from barriers
> >
> > Not a missing propagation; a late one.
> >
> > Don't understand what you mean by "auto propagate without help from
> > barriers".
> >
> > > - some handoff that is CNA specific and I've not looked too hard at
> > > presently.
> > >
> > >
> > > I think we should address that first one in LKMM, it seems very weird to
> > > me a RmW would break the chain like that.
> >
> > An explicitly relaxed RMW (atomic_cmpxchg_relaxed(), to be precise).
> >
> > If the authors wanted to keep the release-acquire chain intact, why not
> > use a cmpxchg version that has release semantics instead of going out of
> > their way to use a relaxed version?
> >
> > To put it another way, RMW accesses and release-acquire accesses are
> > unrelated concepts. You can have one without the other (in principle,
> > anyway). So a relaxed RMW is just as capable of breaking a
> > release-acquire chain as any other relaxed operation is.
> >
> > > Is there actual hardware that
> > > doesn't behave?
> >
> > Not as far as I know, although that isn't very far. Certainly an
> > other-multicopy-atomic architecture would make the litmus test succeed.
> > But the LKMM does not require other-multicopy-atomicity.
>
> My first attempt with ppcmem suggests that powerpc does -not- behave
> this way. But that surprises me, just on general principles. Most likely
> I blew the litmus test shown below.
>
> Thoughts?
>
> Thanx, Paul
>
> ------------------------------------------------------------------------
>
> PPC MP+lwsyncs+atomic
> "LwSyncdWW Rfe LwSyncdRR Fre"
> Cycle=Rfe LwSyncdRR Fre LwSyncdWW
> {
> 0:r2=x; 0:r4=y;
> 1:r2=y; 1:r5=2;
> 2:r2=y; 2:r4=x;
> }
> P0 | P1 | P2 ;
> li r1,1 | lwarx r1,r0,r2 | lwz r1,0(r2) ;
> stw r1,0(r2) | stwcx. r5,r0,r2 | lwsync ;
> lwsync | | lwz r3,0(r4) ;
> li r3,1 | | ;
> stw r3,0(r4) | | ;
> exists (1:r1=1 /\ 2:r1=2 /\ 2:r3=0)

Just catching up on this, but one possible gotcha here is if you have an
architecture with native load-acquire on P2 and then you move P2 to the end
of P1. e.g. at a high-level:


P0 P1
Wx = 1 RmW(y) // xchg() 1 => 2
WyRel = 1 RyAcq = 2
Rx = 0

arm64 forbids this, but it's not "natural" to the hardware and I don't
know what e.g. risc-v would say about it.

Will

2022-09-13 12:24:10

by Daniel Lustig

[permalink] [raw]
Subject: Re: "Verifying and Optimizing Compact NUMA-Aware Locks on Weak Memory Models"

On 9/13/2022 7:24 AM, Will Deacon wrote:
> On Fri, Aug 26, 2022 at 01:42:19PM -0700, Paul E. McKenney wrote:
>> On Fri, Aug 26, 2022 at 01:10:39PM -0400, Alan Stern wrote:
>>> On Fri, Aug 26, 2022 at 06:23:24PM +0200, Peter Zijlstra wrote:
>>>> On Fri, Aug 26, 2022 at 05:48:12AM -0700, Paul E. McKenney wrote:
>>>>> Hello!
>>>>>
>>>>> I have not yet done more than glance at this one, but figured I should
>>>>> send it along sooner rather than later.
>>>>>
>>>>> "Verifying and Optimizing Compact NUMA-Aware Locks on Weak
>>>>> Memory Models", Antonio Paolillo, Hernán Ponce-de-León, Thomas
>>>>> Haas, Diogo Behrens, Rafael Chehab, Ming Fu, and Roland Meyer.
>>>>> https://arxiv.org/abs/2111.15240
>>>>>
>>>>> The claim is that the queued spinlocks implementation with CNA violates
>>>>> LKMM but actually works on all architectures having a formal hardware
>>>>> memory model.
>>>>>
>>>>> Thoughts?
>>>>
>>>> So the paper mentions the following defects:
>>>>
>>>> - LKMM doesn't carry a release-acquire chain across a relaxed op
>>>
>>> That's right, although I'm not so sure this should be considered a
>>> defect...
>>>
>>>> - some babbling about a missing propagation -- ISTR Linux if stuffed
>>>> full of them, specifically we require stores to auto propagate
>>>> without help from barriers
>>>
>>> Not a missing propagation; a late one.
>>>
>>> Don't understand what you mean by "auto propagate without help from
>>> barriers".
>>>
>>>> - some handoff that is CNA specific and I've not looked too hard at
>>>> presently.
>>>>
>>>>
>>>> I think we should address that first one in LKMM, it seems very weird to
>>>> me a RmW would break the chain like that.
>>>
>>> An explicitly relaxed RMW (atomic_cmpxchg_relaxed(), to be precise).
>>>
>>> If the authors wanted to keep the release-acquire chain intact, why not
>>> use a cmpxchg version that has release semantics instead of going out of
>>> their way to use a relaxed version?
>>>
>>> To put it another way, RMW accesses and release-acquire accesses are
>>> unrelated concepts. You can have one without the other (in principle,
>>> anyway). So a relaxed RMW is just as capable of breaking a
>>> release-acquire chain as any other relaxed operation is.
>>>
>>>> Is there actual hardware that
>>>> doesn't behave?
>>>
>>> Not as far as I know, although that isn't very far. Certainly an
>>> other-multicopy-atomic architecture would make the litmus test succeed.
>>> But the LKMM does not require other-multicopy-atomicity.
>>
>> My first attempt with ppcmem suggests that powerpc does -not- behave
>> this way. But that surprises me, just on general principles. Most likely
>> I blew the litmus test shown below.
>>
>> Thoughts?
>>
>> Thanx, Paul
>>
>> ------------------------------------------------------------------------
>>
>> PPC MP+lwsyncs+atomic
>> "LwSyncdWW Rfe LwSyncdRR Fre"
>> Cycle=Rfe LwSyncdRR Fre LwSyncdWW
>> {
>> 0:r2=x; 0:r4=y;
>> 1:r2=y; 1:r5=2;
>> 2:r2=y; 2:r4=x;
>> }
>> P0 | P1 | P2 ;
>> li r1,1 | lwarx r1,r0,r2 | lwz r1,0(r2) ;
>> stw r1,0(r2) | stwcx. r5,r0,r2 | lwsync ;
>> lwsync | | lwz r3,0(r4) ;
>> li r3,1 | | ;
>> stw r3,0(r4) | | ;
>> exists (1:r1=1 /\ 2:r1=2 /\ 2:r3=0)
>
> Just catching up on this, but one possible gotcha here is if you have an
> architecture with native load-acquire on P2 and then you move P2 to the end
> of P1. e.g. at a high-level:
>
>
> P0 P1
> Wx = 1 RmW(y) // xchg() 1 => 2
> WyRel = 1 RyAcq = 2
> Rx = 0
>
> arm64 forbids this, but it's not "natural" to the hardware and I don't
> know what e.g. risc-v would say about it.
>
> Will

RISC-V doesn't currently have native load-acquire instructions other than
load-reserve-acquire, but if it did, it would forbid this outcome as well.

To the broader question, RISC-V is other-multi-copy-atomic, so questions
about propagation order and B-cumulativity and so on aren't generally
problematic, just like they generally aren't an issue for ARMv8.

Dan

2022-09-14 15:08:01

by Alan Stern

[permalink] [raw]
Subject: Re: "Verifying and Optimizing Compact NUMA-Aware Locks on Weak Memory Models"

On Mon, Sep 12, 2022 at 11:10:17AM +0000, Hernan Luis Ponce de Leon wrote:
> I think it is somehow possible to show the liveness violation using herd7 and the following variant of the code
>
> ------------------------------------------------------------------------
> C Liveness
> {
> atomic_t x = ATOMIC_INIT(0);
> atomic_t y = ATOMIC_INIT(0);
> }
>
>
> P0(atomic_t *x) {
> // clear_pending_set_locked
> int r0 = atomic_fetch_add(2,x) ;
> }
>
> P1(atomic_t *x, int *z, int *y) {
> // this store breaks liveness
> WRITE_ONCE(*y, 1);
> // queued_spin_trylock
> int r0 = atomic_read(x);
> // barrier after the initialisation of nodes
> smp_wmb();
> // xchg_tail
> int r1 = atomic_cmpxchg_relaxed(x,r0,42);
> // link node into the waitqueue
> WRITE_ONCE(*z, 1);
> }
>
> P2(atomic_t *x,int *z, int *y) {
> // node initialisation
> WRITE_ONCE(*z, 2);
> // queued_spin_trylock
> int r0 = atomic_read(x);
> // barrier after the initialisation of nodes
> smp_wmb();
> // if we read z==2 we expect to read this store
> WRITE_ONCE(*y, 0);
> // xchg_tail
> int r1 = atomic_cmpxchg_relaxed(x,r0,24);
> // spinloop
> int r2 = READ_ONCE(*y);
> int r3 = READ_ONCE(*z);
> }
>
> exists (z=2 /\ y=1 /\ 2:r2=1 /\ 2:r3=2)
> ------------------------------------------------------------------------
>
> Condition "2:r3=2" forces the spinloop to read from the first write in P2 and "z=2" forces this write
> to be last in the coherence order. Conditions "2:r2=1" and "y=1" force the same for the other read.
> herd7 says this behavior is allowed by LKMM, showing that liveness can be violated.
>
> In all the examples above, if we use mb() instead of wmb(), LKMM does not accept
> the behavior and thus liveness is guaranteed.

That's right.

The issue may be somewhat confused by the fact that there have been
_two_ separate problems covered in this discussion. One has to do with
the use of relaxed vs. non-relaxed atomic accesses, and the other --
this one -- has to do with liveness (eventual termination of a spin
loop).

You can see the distinction quite clearly by noticing in the litmus test
above, the variable x plays absolutely no role. There are no
dependencies from it, it isn't accessed by any instructions that include
an acquire or release memory barrier, and it isn't used in the "exists"
clause. If we remove x from the test (and remove P0, which is now
vacuous), and we also remove the unneeded reads at the end of P2
(unneeded because they observe the co-final values stored in y and z),
what remains is:

P1(int *z, int *y) {
WRITE_ONCE(*y, 1);
smp_wmb();
WRITE_ONCE(*z, 1);
}

P2(int *z, int *y) {
WRITE_ONCE(*z, 2);
smp_wmb();
WRITE_ONCE(*y, 0);
}

exists (z=2 /\ y=1)

which is exactly the 2+2W pattern. As others have pointed out, this
pattern is permitted by LKML because we never found a good reason to
forbid it, even though it cannot occur on any real hardware that
supports Linux.

On the other hand, the simplicity of this "liveness" test leads me to
wonder if it isn't missing some crucial feature of the actual qspinlock
implementation.

Alan

2022-09-16 09:19:23

by Will Deacon

[permalink] [raw]
Subject: Re: "Verifying and Optimizing Compact NUMA-Aware Locks on Weak Memory Models"

On Tue, Sep 13, 2022 at 08:21:02AM -0400, Dan Lustig wrote:
> On 9/13/2022 7:24 AM, Will Deacon wrote:
> > On Fri, Aug 26, 2022 at 01:42:19PM -0700, Paul E. McKenney wrote:
> >> PPC MP+lwsyncs+atomic
> >> "LwSyncdWW Rfe LwSyncdRR Fre"
> >> Cycle=Rfe LwSyncdRR Fre LwSyncdWW
> >> {
> >> 0:r2=x; 0:r4=y;
> >> 1:r2=y; 1:r5=2;
> >> 2:r2=y; 2:r4=x;
> >> }
> >> P0 | P1 | P2 ;
> >> li r1,1 | lwarx r1,r0,r2 | lwz r1,0(r2) ;
> >> stw r1,0(r2) | stwcx. r5,r0,r2 | lwsync ;
> >> lwsync | | lwz r3,0(r4) ;
> >> li r3,1 | | ;
> >> stw r3,0(r4) | | ;
> >> exists (1:r1=1 /\ 2:r1=2 /\ 2:r3=0)
> >
> > Just catching up on this, but one possible gotcha here is if you have an
> > architecture with native load-acquire on P2 and then you move P2 to the end
> > of P1. e.g. at a high-level:
> >
> >
> > P0 P1
> > Wx = 1 RmW(y) // xchg() 1 => 2
> > WyRel = 1 RyAcq = 2
> > Rx = 0
> >
> > arm64 forbids this, but it's not "natural" to the hardware and I don't
> > know what e.g. risc-v would say about it.
> >
>
> RISC-V doesn't currently have native load-acquire instructions other than
> load-reserve-acquire, but if it did, it would forbid this outcome as well.

Thanks for chipping in, Dan. Somehow, I hadn't realised that you didn't
have native load-acquire instructions.

Will