2017-12-01 15:32:18

by Alan Stern

[permalink] [raw]
Subject: Re: Unlock-lock questions and the Linux Kernel Memory Model

On Fri, 1 Dec 2017, Boqun Feng wrote:

> > > But in case of AMOs, which directly send the addition request to memory
> > > controller, so there wouldn't be any read part or even write part of the
> > > atomic_inc() executed by CPU. Would this be allowed then?
> >
> > Firstly, sending the addition request to the memory controller _is_ a
> > write operation.
> >
> > Secondly, even though the CPU hardware might not execute a read
> > operation during an AMO, the LKMM and herd nevertheless represent the
> > atomic update as a specially-annotated read event followed by a write
> > event.
> >
>
> Ah, right! From the point of view of the model, there are read events
> and write events for the atomics.
>
> > In an other-multicopy-atomic system, P0's write to y must become
> > visible to P1 before P1 executes the smp_load_acquire, because the
> > write was visible to the memory controller when the controller carried
> > out the AMO, and the write becomes visible to the memory controller and
> > to P1 at the same time (by other-multicopy-atomicity). That's why I
> > said the test would be forbidden on ARM.
> >
>
> Agreed.
>
> > But even on a non-other-multicopy-atomic system, there has to be some
> > synchronization between the memory controller and P1's CPU. Otherwise,
> > how could the system guarantee that P1's smp_load_acquire would see the
> > post-increment value of y? It seems reasonable to assume that this
> > synchronization would also cause P1 to see x=1.
> >
>
> I agree with you the "reasonable" part ;-) So basically, memory
> controller could only do the write of AMO until P0's second write
> propagated to the memory controller(and because of the wmb(), P0's first
> write must be already propagated to the memory controller, too), so it
> makes sense when the write of AMO propagated from memory controller to
> P1, P0's first write is also propagted to P1. IOW, the write of AMO on
> memory controller acts at least like a release.
>
> However, some part of myself is still a little paranoid, because to my
> understanding, the point of AMO is to get atomic operations executing
> as fast as possible, so maybe, AMO has some fast path for the memory
> controller to forward a write to the CPU that issues the AMO, in that
> way, it will become unreasonable ;-)

It's true that a hardware design in the future might behave differently
from current hardware. If that ever happens, we will need to rethink
the situation. Maybe the designers will change their hardware to make
it match the memory model. Or maybe the memory model will change.

And it's certainly possible to write a litmus test which emulates this
situation:

C MP+wmb+emulated-amo-acq

{}

P0(int *x, int *y)
{
WRITE_ONCE(*x, 1);
smp_wmb();
WRITE_ONCE(*y, 1);
}

P1(int *x, int *y, int *u, int *v)
{
WRITE_ONCE(*u, 1);
r1 = READ_ONCE(*v);
smp_rmb();
r2 = smp_load_acquire(y);
r3 = READ_ONCE(*x);
}

P2(int *y, int *u, int *v)
{
r4 = READ_ONCE(*u);
if (r4 != 0) {
atomic_inc(y);
smp_wmb();
WRITE_ONCE(*v, 1);
}
}

exists (1:r1=1 /\ 1:r2=2 /\ 1:r3=0 /\ 2:r4=1)

Here P1 tells P2 to perform the atomic increment by setting u to 1, and
P2 tells P1 that the increment is finished by setting v to 1. This
test is allowed by the LKMM, because the wmb in P2 is not A-cumulative.
On the other hand, store-release is A-cumulative -- the test would be
forbidden if P2 did "smp_store_release(v, 1)" rather than "smp_wmb() ;
WRITE_ONCE(*v, 1)".

> With that in mind, I think it's better if herd could provide the type
> annotations of atomics for the read and write parts, and we handle it
> inside the LKMM's cats and bells, rather than letting herd provide the
> internal dependency by default.

herd already does provide this information via the rmw relation.

Alan


2017-12-01 16:17:07

by Daniel Lustig

[permalink] [raw]
Subject: Re: Unlock-lock questions and the Linux Kernel Memory Model

On 12/1/2017 7:32 AM, Alan Stern wrote:
> On Fri, 1 Dec 2017, Boqun Feng wrote:
>>> But even on a non-other-multicopy-atomic system, there has to be some
>>> synchronization between the memory controller and P1's CPU. Otherwise,
>>> how could the system guarantee that P1's smp_load_acquire would see the
>>> post-increment value of y? It seems reasonable to assume that this
>>> synchronization would also cause P1 to see x=1.
>>>
>>
>> I agree with you the "reasonable" part ;-) So basically, memory
>> controller could only do the write of AMO until P0's second write
>> propagated to the memory controller(and because of the wmb(), P0's first
>> write must be already propagated to the memory controller, too), so it
>> makes sense when the write of AMO propagated from memory controller to
>> P1, P0's first write is also propagted to P1. IOW, the write of AMO on
>> memory controller acts at least like a release.
>>
>> However, some part of myself is still a little paranoid, because to my
>> understanding, the point of AMO is to get atomic operations executing
>> as fast as possible, so maybe, AMO has some fast path for the memory
>> controller to forward a write to the CPU that issues the AMO, in that
>> way, it will become unreasonable ;-)
>
> It's true that a hardware design in the future might behave differently
> from current hardware. If that ever happens, we will need to rethink
> the situation. Maybe the designers will change their hardware to make
> it match the memory model. Or maybe the memory model will change.

Do you mean all of the above in the context of increment etc, as opposed
to swap? ARM hardware in the wild is already documented as forwarding
SWP values to subsequent loads early, even past control dependencies.
Paul sent this link earlier in the thread.

http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2017/p0735r0.html

The reason swap is special is because its store value is available to be
forwarded even before the AMO goes out to the memory controller or
wherever else it gets its load value from.

Also, the case I described is an acquire rather than a control
dependency, but it's similar enough that it doesn't seem completely
unrealistic to think hardware might try to do this.

Dan

2017-12-01 16:24:00

by Will Deacon

[permalink] [raw]
Subject: Re: Unlock-lock questions and the Linux Kernel Memory Model

On Fri, Dec 01, 2017 at 08:17:04AM -0800, Daniel Lustig wrote:
> On 12/1/2017 7:32 AM, Alan Stern wrote:
> > On Fri, 1 Dec 2017, Boqun Feng wrote:
> >>> But even on a non-other-multicopy-atomic system, there has to be some
> >>> synchronization between the memory controller and P1's CPU. Otherwise,
> >>> how could the system guarantee that P1's smp_load_acquire would see the
> >>> post-increment value of y? It seems reasonable to assume that this
> >>> synchronization would also cause P1 to see x=1.
> >>>
> >>
> >> I agree with you the "reasonable" part ;-) So basically, memory
> >> controller could only do the write of AMO until P0's second write
> >> propagated to the memory controller(and because of the wmb(), P0's first
> >> write must be already propagated to the memory controller, too), so it
> >> makes sense when the write of AMO propagated from memory controller to
> >> P1, P0's first write is also propagted to P1. IOW, the write of AMO on
> >> memory controller acts at least like a release.
> >>
> >> However, some part of myself is still a little paranoid, because to my
> >> understanding, the point of AMO is to get atomic operations executing
> >> as fast as possible, so maybe, AMO has some fast path for the memory
> >> controller to forward a write to the CPU that issues the AMO, in that
> >> way, it will become unreasonable ;-)
> >
> > It's true that a hardware design in the future might behave differently
> > from current hardware. If that ever happens, we will need to rethink
> > the situation. Maybe the designers will change their hardware to make
> > it match the memory model. Or maybe the memory model will change.
>
> Do you mean all of the above in the context of increment etc, as opposed
> to swap? ARM hardware in the wild is already documented as forwarding
> SWP values to subsequent loads early, even past control dependencies.
> Paul sent this link earlier in the thread.
>
> http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2017/p0735r0.html
>
> The reason swap is special is because its store value is available to be
> forwarded even before the AMO goes out to the memory controller or
> wherever else it gets its load value from.
>
> Also, the case I described is an acquire rather than a control
> dependency, but it's similar enough that it doesn't seem completely
> unrealistic to think hardware might try to do this.

To be clear: we don't forward from a SWP to a load with Acquire semantics,
so the distinction is an important one.

Will

2017-12-01 17:18:39

by Alan Stern

[permalink] [raw]
Subject: Re: Unlock-lock questions and the Linux Kernel Memory Model

On Fri, 1 Dec 2017, Daniel Lustig wrote:

> On 12/1/2017 7:32 AM, Alan Stern wrote:
> > On Fri, 1 Dec 2017, Boqun Feng wrote:
> >>> But even on a non-other-multicopy-atomic system, there has to be some
> >>> synchronization between the memory controller and P1's CPU. Otherwise,
> >>> how could the system guarantee that P1's smp_load_acquire would see the
> >>> post-increment value of y? It seems reasonable to assume that this
> >>> synchronization would also cause P1 to see x=1.
> >>>
> >>
> >> I agree with you the "reasonable" part ;-) So basically, memory
> >> controller could only do the write of AMO until P0's second write
> >> propagated to the memory controller(and because of the wmb(), P0's first
> >> write must be already propagated to the memory controller, too), so it
> >> makes sense when the write of AMO propagated from memory controller to
> >> P1, P0's first write is also propagted to P1. IOW, the write of AMO on
> >> memory controller acts at least like a release.
> >>
> >> However, some part of myself is still a little paranoid, because to my
> >> understanding, the point of AMO is to get atomic operations executing
> >> as fast as possible, so maybe, AMO has some fast path for the memory
> >> controller to forward a write to the CPU that issues the AMO, in that
> >> way, it will become unreasonable ;-)
> >
> > It's true that a hardware design in the future might behave differently
> > from current hardware. If that ever happens, we will need to rethink
> > the situation. Maybe the designers will change their hardware to make
> > it match the memory model. Or maybe the memory model will change.
>
> Do you mean all of the above in the context of increment etc, as opposed
> to swap? ARM hardware in the wild is already documented as forwarding
> SWP values to subsequent loads early, even past control dependencies.
> Paul sent this link earlier in the thread.
>
> http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2017/p0735r0.html
>
> The reason swap is special is because its store value is available to be
> forwarded even before the AMO goes out to the memory controller or
> wherever else it gets its load value from.

I believe the current intention for herd is as follows:

xchg() and similar RMW operations do not generate an internal
dependency;

cmpxchg() and similar RMW operations generate an internal
control dependency;

atomic_add() and similar RMW operations generate an internal
data dependency.

If herd adds support for saturating operations, they will generate at
least a data dependency and maybe also a control dependency.

Alan

> Also, the case I described is an acquire rather than a control
> dependency, but it's similar enough that it doesn't seem completely
> unrealistic to think hardware might try to do this.
>
> Dan