2023-07-03 20:16:13

by Olivier Dion

[permalink] [raw]
Subject: [RFC] Bridging the gap between the Linux Kernel Memory Consistency Model (LKMM) and C11/C++11 atomics

Hi all,

This is a request for comments on extending the atomic builtins API to
help avoiding redundant memory barriers. Indeed, there are
discrepancies between the Linux kernel consistency memory model (LKMM)
and the C11/C++11 memory consistency model [0]. For example,
fully-ordered atomic operations like xchg and cmpxchg success in LKMM
have implicit memory barriers before/after the operations [1-2], while
atomic operations using the __ATOMIC_SEQ_CST memory order in C11/C++11
do not have any ordering guarantees of an atomic thread fence
__ATOMIC_SEQ_CST with respect to other non-SEQ_CST operations [3].

For a little bit of context here, we are porting liburcu [4] to atomic
builtins. Before that, liburcu was using its own implementation for
atomic operations and its CMM memory consistency model was mimicking the
LKMM. liburcu is now extending its CMM memory consistency model to
become close to the C11/C++11 memory consistency model, with the
exception of the extra SEQ_CST_FENCE memory order that is similar to
SEQ_CST, but ensure that a thread fence is emitted. This is necessary
for backward compatibility of the liburcu uatomic API, but also for
closing the gap between the LKMM and the C11/C+11 memory consistency
model. For example, to make Read-Modify-Write (RMW) operations match
the Linux kernel "full barrier before/after" semantics, the liburcu's
uatomic API has to emit both a SEQ_CST RMW operation and a subsequent
thread fence SEQ_CST, which leads to duplicated barriers in some cases.

Consider for example the following Dekker and the resulting assemblers
generated:

int x = 0;
int y = 0;
int r0, r1;

int dummy;

void t0(void)
{
__atomic_store_n(&x, 1, __ATOMIC_RELAXED);

__atomic_exchange_n(&dummy, 1, __ATOMIC_SEQ_CST);
__atomic_thread_fence(__ATOMIC_SEQ_CST);

r0 = __atomic_load_n(&y, __ATOMIC_RELAXED);
}

void t1(void)
{
__atomic_store_n(&y, 1, __ATOMIC_RELAXED);
__atomic_thread_fence(__ATOMIC_SEQ_CST);
r1 = __atomic_load_n(&x, __ATOMIC_RELAXED);
}

// BUG_ON(r0 == 0 && r1 == 0)

On x86-64 (gcc 13.1 -O2) we get:

t0():
movl $1, x(%rip)
movl $1, %eax
xchgl dummy(%rip), %eax
lock orq $0, (%rsp) ;; Redundant with previous exchange.
movl y(%rip), %eax
movl %eax, r0(%rip)
ret
t1():
movl $1, y(%rip)
lock orq $0, (%rsp)
movl x(%rip), %eax
movl %eax, r1(%rip)
ret

On x86-64 (clang 16 -O2) we get:

t0():
movl $1, x(%rip)
movl $1, %eax
xchgl %eax, dummy(%rip)
mfence ;; Redundant with previous exchange.
movl y(%rip), %eax
movl %eax, r0(%rip)
retq
t1():
movl $1, y(%rip)
mfence
movl x(%rip), %eax
movl %eax, r1(%rip)
retq

On armv8-a (gcc 13.1 -O2) we get:

t0():
adrp x0, .LANCHOR0
mov w1, 1
add x0, x0, :lo12:.LANCHOR0
str w1, [x0]
add x1, x0, 4
mov w2, 1
.L3:
ldaxr w3, [x1]
stlxr w4, w2, [x1]
cbnz w4, .L3
dmb ish ;; Okay!
add x1, x0, 8
ldr w1, [x1]
str w1, [x0, 12]
ret
t1():
adrp x0, .LANCHOR0
add x0, x0, :lo12:.LANCHOR0
add x1, x0, 8
mov w2, 1
str w2, [x1]
dmb ish
ldr w1, [x0]
str w1, [x0, 16]
ret

On armv8.1-a (gcc 13.1 -O2) we get:

t0():
adrp x0, .LANCHOR0
mov w1, 1
add x0, x0, :lo12:.LANCHOR0
str w1, [x0]
add x2, x0, 4
swpal w1, w1, [x2]
dmb ish ;; Okay!
add x1, x0, 8
ldr w1, [x1]
str w1, [x0, 12]
ret
t1():
adrp x0, .LANCHOR0
add x0, x0, :lo12:.LANCHOR0
add x1, x0, 8
mov w2,p 1
str w2, [x1]
dmb ish
ldr w1, [x0]
str w1, [x0, 16]
ret

For the initial transition to the atomic builtins in liburcu, we plan on
emitting memory barriers to ensure correctness at the expense of
performance. However, new primitives in the atomic builtins API would
help avoiding the redundant thread fences.

Indeed, eliminating redundant memory fences is often done in the Linux
kernel. For example in kernel/sched/core.c:try_to_wake_up():

/*
* smp_mb__after_spinlock() provides the equivalent of a full memory
* barrier between program-order earlier lock acquisitions and
* program-order later memory accesses.
* ...
* Since most load-store architectures implement ACQUIRE with an
* smp_mb() after the LL/SC loop, they need no further barriers.
* Similarly all our TSO architectures imply an smp_mb() for each
* atomic instruction and equally don't need more.
*/
raw_spin_lock_irqsave(&p->pi_lock, flags);
smp_mb__after_spinlock();

Therefore, we propose to add the following primitives to the atomic
builtins API:

__atomic_thread_fence_before_load(int load_memorder, int fence_memorder)
__atomic_thread_fence_after_load(int load_memorder, int fence_memorder)

__atomic_thread_fence_before_store(int store_memorder, int fence_memorder)
__atomic_thread_fence_after_store(int store_memorder, int fence_memorder)

__atomic_thread_fence_before_clear(int clear_memorder, int fence_memorder)
__atomic_thread_fence_after_clear(int clear_memorder, int fence_memorder)

__atomic_thread_fence_before_rmw(int rmw_memorder, int fence_memorder)
__atomic_thread_fence_after_rmw(int rmw_memorder, int fence_memorder)

All primitives follow the template:

__atomic_thread_fence_{before|after}_<operation>(int OM, int FM)

Depending on the implementation of <operation> with memory order OM, a
thread fence may be emitted {before|after} with memory order FM. The
`rmw' operation is an umbrella term for all operations which does a RMW
sequence for keeping the proposal short. More primitives could be
added, allowing greater flexibility for the toolchains in their
implementation.

In the previous Dekker, t0() can now be re-written as:

void t0(void)
{
__atomic_store_n(&x, 1, __ATOMIC_RELAXED);

__atomic_exchange_n(&dummy, 1, __ATOMIC_SEQ_CST);
__atomic_thread_fence_after_rmw(__ATOMIC_SEQ_CST,
__ATOMIC_SEQ_CST);

r0 = __atomic_load_n(&y, __ATOMIC_RELAXED);
}

because the exchange has a memory order of SEQ_CST and we want a thread
fence with memory order SEQ_CST after the operation.

Finally, here is what we could expect for some architectures:

On x86-64 we would have (based on gcc 13.1 [5]):

// Always emit thread fence.
__atomic_thread_fence_{before,after}_load(int load_memorder,
int fence_memorder)

// NOP for store_memorder == SEQ_CST.
// Otherwise, emit thread fence.
__atomic_thread_fence_{before,after}_store(int store_memorder,
int fence_memorder)

// NOP for clear_memorder == SEQ_CST.
// Otherwise, emit thread fence.
__atomic_thread_fence_{before,after}_clear(int clear_memorder,
int fence_memorder)

// Always NOP.
__atomic_thread_fence_{before,after}_rmw(int rmw_memorder,
int fence_memorder)

On aarch64 we would have (based on gcc 13.1 [6]):

// Always emit thread fence.
__atomic_thread_fence_{before,after}_load(int load_memorder,
int fence_memorder)

// Always emit thread fence.
__atomic_thread_fence_{before,after}_store(int store_memorder,
int fence_memorder)

// Always emit thread fence.
__atomic_thread_fence_{before,after}_clear(int clear_memorder,
int fence_memorder)

// Always emit thread fence.
__atomic_thread_fence_{before,after}_rmw(int rmw_memorder,
int fence_memorder)

NOTE: On x86-64, we found at least one corner case [7] with Clang where
a RELEASE exchange is optimized to a RELEASE store, when the returned
value of the exchange is unused, breaking the above expectations.
Although this type of optimization respect the standard "as-if"
statement, we question its pertinence since a user should simply do a
RELEASE store instead of an exchange in that case. With the
introduction of these new primitives, these type of optimizations should
be revisited.

NOTE 2: You can test the Dekker -- without the memory barrier in t0() --
in CppMem [8] with this code [9] to see that r0 == 0 && r1 == 0 is
possible. The exchange is replaced by a store in this example because
the tool does not support exchange.

[0] https://www.open-std.org/JTC1/SC22/WG21/docs/papers/2020/p0124r7.html#So%20You%20Want%20Your%20Arch%20To%20Use%20C11%20Atomics...

[1] Linux kernel Documentation/atomic_t.txt

[2] Linux kernel Documentation/memory-barriers.txt

[3] https://en.cppreference.com/w/c/atomic/memory_order#Sequentially-consistent_ordering

[4] https://liburcu.org

[5] https://godbolt.org/z/ch1j3bW93

Reproduce with x86-64 gcc 13.1 -O2:
--8<---------------cut here---------------start------------->8---
int load_relaxed(int *x)
{
return __atomic_load_n(x, __ATOMIC_RELAXED);
}

int load_consume(int *x)
{
return __atomic_load_n(x, __ATOMIC_CONSUME);
}

int load_acquire(int *x)
{
return __atomic_load_n(x, __ATOMIC_ACQUIRE);
}

int load_seq_cst(int *x)
{
return __atomic_load_n(x, __ATOMIC_SEQ_CST);
}

void store_relaxed(int *x, int y)
{
__atomic_store_n(x, y, __ATOMIC_RELAXED);
}

void store_release(int *x, int y)
{
__atomic_store_n(x, y, __ATOMIC_RELEASE);
}

void store_seq_cst(int *x, int y)
{
__atomic_store_n(x, y, __ATOMIC_SEQ_CST);
}

void clear_relaxed(char *x)
{
__atomic_clear(x, __ATOMIC_RELAXED);
}

void clear_release(char *x)
{
__atomic_clear(x, __ATOMIC_RELEASE);
}

void clear_seq_cst(char *x)
{
__atomic_clear(x, __ATOMIC_SEQ_CST);
}

int exchange_relaxed(int *x, int y)
{
return __atomic_exchange_n(x, y, __ATOMIC_RELAXED);
}

int exchange_consume(int *x, int y)
{
return __atomic_exchange_n(x, y, __ATOMIC_CONSUME);
}

int exchange_acquire(int *x, int y)
{
return __atomic_exchange_n(x, y, __ATOMIC_ACQUIRE);
}

int exchange_release(int *x, int y)
{
return __atomic_exchange_n(x, y, __ATOMIC_RELEASE);
}

int exchange_acq_rel(int *x, int y)
{
return __atomic_exchange_n(x, y, __ATOMIC_ACQ_REL);
}

int exchange_seq_cst(int *x, int y)
{
return __atomic_exchange_n(x, y, __ATOMIC_SEQ_CST);
}
--8<---------------cut here---------------end--------------->8---

[6] https://godbolt.org/z/WT4qj636b

Reproduce with ARM64 gcc 13.1.0 -O2 -march=armv8.1-a:
--8<---------------cut here---------------start------------->8---
int load_relaxed(int *x)
{
return __atomic_load_n(x, __ATOMIC_RELAXED);
}

int load_consume(int *x)
{
return __atomic_load_n(x, __ATOMIC_CONSUME);
}

int load_acquire(int *x)
{
return __atomic_load_n(x, __ATOMIC_ACQUIRE);
}

int load_seq_cst(int *x)
{
return __atomic_load_n(x, __ATOMIC_SEQ_CST);
}

void store_relaxed(int *x, int y)
{
__atomic_store_n(x, y, __ATOMIC_RELAXED);
}

void store_release(int *x, int y)
{
__atomic_store_n(x, y, __ATOMIC_RELEASE);
}

void store_seq_cst(int *x, int y)
{
__atomic_store_n(x, y, __ATOMIC_SEQ_CST);
}

void clear_relaxed(char *x)
{
__atomic_clear(x, __ATOMIC_RELAXED);
}

void clear_release(char *x)
{
__atomic_clear(x, __ATOMIC_RELEASE);
}

void clear_seq_cst(char *x)
{
__atomic_clear(x, __ATOMIC_SEQ_CST);
}

int exchange_relaxed(int *x, int y)
{
return __atomic_exchange_n(x, y, __ATOMIC_RELAXED);
}

int exchange_consume(int *x, int y)
{
return __atomic_exchange_n(x, y, __ATOMIC_CONSUME);
}

int exchange_acquire(int *x, int y)
{
return __atomic_exchange_n(x, y, __ATOMIC_ACQUIRE);
}

int exchange_release(int *x, int y)
{
return __atomic_exchange_n(x, y, __ATOMIC_RELEASE);
}

int exchange_acq_rel(int *x, int y)
{
return __atomic_exchange_n(x, y, __ATOMIC_ACQ_REL);
}

int exchange_seq_cst(int *x, int y)
{
return __atomic_exchange_n(x, y, __ATOMIC_SEQ_CST);
}
--8<---------------cut here---------------end--------------->8---

[7] https://godbolt.org/z/ssvbr9qvE

Reproduce with x86-64 clang 16.0.0 -O2:
--8<---------------cut here---------------start------------->8---
void exchange_release_no_return(int *x, int y)
{
__atomic_exchange_n(x, y, __ATOMIC_RELEASE);
}

int exchange_release(int *x, int y)
{
return __atomic_exchange_n(x, y, __ATOMIC_RELEASE);
}
--8<---------------cut here---------------end--------------->8---

[8] http://svr-pes20-cppmem.cl.cam.ac.uk/cppmem/

[9]:
--8<---------------cut here---------------start------------->8---
int main()
{
atomic_int x=0;
atomic_int y=0;
atomic_int dummy=0;

{{{
{
x.store(1, memory_order_relaxed);
// CPPMem does not support exchange.
dummy.store(1, memory_order_seq_cst);
r0 = y.load(memory_order_relaxed);
}

|||

{
y.store(1, memory_order_relaxed);
atomic_thread_fence(memory_order_seq_cst);
r1 = x.load(memory_order_relaxed);
}
}}}

return 0;
}
--8<---------------cut here---------------end--------------->8---
--
Olivier Dion
EfficiOS Inc.
https://www.efficios.com



2023-07-03 20:36:32

by Alan Stern

[permalink] [raw]
Subject: Re: [RFC] Bridging the gap between the Linux Kernel Memory Consistency Model (LKMM) and C11/C++11 atomics

On Mon, Jul 03, 2023 at 03:20:31PM -0400, Olivier Dion wrote:
> Hi all,
>
> This is a request for comments on extending the atomic builtins API to
> help avoiding redundant memory barriers. Indeed, there are

What atomic builtins API are you talking about? The kernel's? That's
what it sounded like when I first read this sentence -- why else post
your message on a kernel mailing list?

> discrepancies between the Linux kernel consistency memory model (LKMM)
> and the C11/C++11 memory consistency model [0]. For example,

Indeed. The kernel's usage of C differs from the standard in several
respects, and there's no particular reason for its memory model to match
the standard's.

> fully-ordered atomic operations like xchg and cmpxchg success in LKMM
> have implicit memory barriers before/after the operations [1-2], while
> atomic operations using the __ATOMIC_SEQ_CST memory order in C11/C++11
> do not have any ordering guarantees of an atomic thread fence
> __ATOMIC_SEQ_CST with respect to other non-SEQ_CST operations [3].

After reading what you wrote below, I realized that the API you're
thinking of modifying is the one used by liburcu for user programs.
It's a shame you didn't mention this in either the subject line or the
first few paragraphs of the email; that would have made understanding
the message a little easier.

In any case, your proposal seems reasonable to me at first glance, with
two possible exceptions:

1. I can see why you have special fences for before/after load,
store, and rmw operations. But why clear? In what way is
clearing an atomic variable different from storing a 0 in it?

2. You don't have a special fence for use after initializing an
atomic. This operation can be treated specially, because at the
point where an atomic is initialized, it generally has not yet
been made visible to any other threads. Therefore the fence
which would normally appear after a store (or clear) generally
need not appear after an initialization, and you might want to
add a special API to force the generation of such a fence.

Alan Stern

2023-07-04 10:01:32

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC] Bridging the gap between the Linux Kernel Memory Consistency Model (LKMM) and C11/C++11 atomics

On Mon, Jul 03, 2023 at 03:20:31PM -0400, Olivier Dion wrote:

> int x = 0;
> int y = 0;
> int r0, r1;
>
> int dummy;
>
> void t0(void)
> {
> __atomic_store_n(&x, 1, __ATOMIC_RELAXED);
>
> __atomic_exchange_n(&dummy, 1, __ATOMIC_SEQ_CST);
> __atomic_thread_fence(__ATOMIC_SEQ_CST);
>
> r0 = __atomic_load_n(&y, __ATOMIC_RELAXED);
> }
>
> void t1(void)
> {
> __atomic_store_n(&y, 1, __ATOMIC_RELAXED);
> __atomic_thread_fence(__ATOMIC_SEQ_CST);
> r1 = __atomic_load_n(&x, __ATOMIC_RELAXED);
> }
>
> // BUG_ON(r0 == 0 && r1 == 0)
>
> On x86-64 (gcc 13.1 -O2) we get:
>
> t0():
> movl $1, x(%rip)
> movl $1, %eax
> xchgl dummy(%rip), %eax
> lock orq $0, (%rsp) ;; Redundant with previous exchange.
> movl y(%rip), %eax
> movl %eax, r0(%rip)
> ret
> t1():
> movl $1, y(%rip)
> lock orq $0, (%rsp)
> movl x(%rip), %eax
> movl %eax, r1(%rip)
> ret

So I would expect the compilers to do better here. It should know those
__atomic_thread_fence() thingies are superfluous and simply not emit
them. This could even be done as a peephole pass later, where it sees
consecutive atomic ops and the second being a no-op.

> On x86-64 (clang 16 -O2) we get:
>
> t0():
> movl $1, x(%rip)
> movl $1, %eax
> xchgl %eax, dummy(%rip)
> mfence ;; Redundant with previous exchange.

And that's just terrible :/ Nobody should be using MFENCE for this. And
using MFENCE after a LOCK prefixes instruction (implicit in this case)
is just fail, because I don't think C++ atomics cover MMIO and other
such 'lovely' things.

> movl y(%rip), %eax
> movl %eax, r0(%rip)
> retq
> t1():
> movl $1, y(%rip)
> mfence
> movl x(%rip), %eax
> movl %eax, r1(%rip)
> retq


2023-07-04 10:41:44

by Jonathan Wakely

[permalink] [raw]
Subject: Re: [RFC] Bridging the gap between the Linux Kernel Memory Consistency Model (LKMM) and C11/C++11 atomics

On Tue, 4 Jul 2023 at 10:47, Peter Zijlstra wrote:
>
> On Mon, Jul 03, 2023 at 03:20:31PM -0400, Olivier Dion wrote:
>
> > int x = 0;
> > int y = 0;
> > int r0, r1;
> >
> > int dummy;
> >
> > void t0(void)
> > {
> > __atomic_store_n(&x, 1, __ATOMIC_RELAXED);
> >
> > __atomic_exchange_n(&dummy, 1, __ATOMIC_SEQ_CST);
> > __atomic_thread_fence(__ATOMIC_SEQ_CST);
> >
> > r0 = __atomic_load_n(&y, __ATOMIC_RELAXED);
> > }
> >
> > void t1(void)
> > {
> > __atomic_store_n(&y, 1, __ATOMIC_RELAXED);
> > __atomic_thread_fence(__ATOMIC_SEQ_CST);
> > r1 = __atomic_load_n(&x, __ATOMIC_RELAXED);
> > }
> >
> > // BUG_ON(r0 == 0 && r1 == 0)
> >
> > On x86-64 (gcc 13.1 -O2) we get:
> >
> > t0():
> > movl $1, x(%rip)
> > movl $1, %eax
> > xchgl dummy(%rip), %eax
> > lock orq $0, (%rsp) ;; Redundant with previous exchange.
> > movl y(%rip), %eax
> > movl %eax, r0(%rip)
> > ret
> > t1():
> > movl $1, y(%rip)
> > lock orq $0, (%rsp)
> > movl x(%rip), %eax
> > movl %eax, r1(%rip)
> > ret
>
> So I would expect the compilers to do better here. It should know those
> __atomic_thread_fence() thingies are superfluous and simply not emit
> them. This could even be done as a peephole pass later, where it sees
> consecutive atomic ops and the second being a no-op.

Right, I don't see why we need a whole set of new built-ins that say
"this fence isn't needed if the adjacent atomic op already implies a
fence". If the adjacent atomic op already implies a fence for a given
ISA, then the compiler should already be able to elide the explicit
fence.

So just write your code with the explicit fence, and rely on the
compiler to optimize it properly. Admittedly, today's compilers don't
do that optimization well, but they also don't support your proposed
built-ins, so you're going to have to wait for compilers to make
improvements either way.

https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2015/n4455.html
discusses that compilers could (and should) optimize around atomics
better.

>
> > On x86-64 (clang 16 -O2) we get:
> >
> > t0():
> > movl $1, x(%rip)
> > movl $1, %eax
> > xchgl %eax, dummy(%rip)
> > mfence ;; Redundant with previous exchange.
>
> And that's just terrible :/ Nobody should be using MFENCE for this. And
> using MFENCE after a LOCK prefixes instruction (implicit in this case)
> is just fail, because I don't think C++ atomics cover MMIO and other
> such 'lovely' things.
>
> > movl y(%rip), %eax
> > movl %eax, r0(%rip)
> > retq
> > t1():
> > movl $1, y(%rip)
> > mfence
> > movl x(%rip), %eax
> > movl %eax, r1(%rip)
> > retq
>

2023-07-04 17:36:23

by Olivier Dion

[permalink] [raw]
Subject: Re: [RFC] Bridging the gap between the Linux Kernel Memory Consistency Model (LKMM) and C11/C++11 atomics

On Mon, 03 Jul 2023, Alan Stern <[email protected]> wrote:
> On Mon, Jul 03, 2023 at 03:20:31PM -0400, Olivier Dion wrote:
>> This is a request for comments on extending the atomic builtins API to
>> help avoiding redundant memory barriers. Indeed, there are
>
> What atomic builtins API are you talking about? The kernel's? That's
> what it sounded like when I first read this sentence -- why else post
> your message on a kernel mailing list?

Good point, we meant the `__atomic' builtins from GCC and Clang. Sorry
for the confusion.

[...]

>> fully-ordered atomic operations like xchg and cmpxchg success in LKMM
>> have implicit memory barriers before/after the operations [1-2], while
>> atomic operations using the __ATOMIC_SEQ_CST memory order in C11/C++11
>> do not have any ordering guarantees of an atomic thread fence
>> __ATOMIC_SEQ_CST with respect to other non-SEQ_CST operations [3].
>
> After reading what you wrote below, I realized that the API you're
> thinking of modifying is the one used by liburcu for user programs.
> It's a shame you didn't mention this in either the subject line or the
> first few paragraphs of the email; that would have made understanding
> the message a little easier.

Indeed, our intent is to discuss the Userspace RCU uatomic API by extending
the toolchain's atomic builtins and not the LKMM itself. The reason why
we've reached out to the Linux kernel developers is because the
original Userspace RCU uatomic API is based on the LKMM.

> In any case, your proposal seems reasonable to me at first glance, with
> two possible exceptions:
>
> 1. I can see why you have special fences for before/after load,
> store, and rmw operations. But why clear? In what way is
> clearing an atomic variable different from storing a 0 in it?

We could indeed group the clear with the store.

We had two approaches in mind:

a) A before/after pair by category of operation:

- load
- store
- RMW

b) A before/after pair for every operation:

- load
- store
- exchange
- compare_exchange
- {add,sub,and,xor,or,nand}_fetch
- fetch_{add,sub,and,xor,or,nand}
- test_and_set
- clear

If we go for the grouping in a), we have to take into account that the
barriers emitted need to cover the worse case scenario. As an example,
Clang can emit a store for a exchange with SEQ_CST on x86-64, if the
returned value is not used.

Therefore, for the grouping in a), all RMW would need to emit a memory
barrier (with Clang on x86-64). But with the scheme in b), we can emit
the barrier explicitly for the exchange operation. We however question
the usefulness of this kind of optimization made by the compiler, since
a user should use a store operation instead.

> 2. You don't have a special fence for use after initializing an
> atomic. This operation can be treated specially, because at the
> point where an atomic is initialized, it generally has not yet
> been made visible to any other threads.

I assume that you're referring to something like std::atomic_init from
C++11 and deprecated in C++20? I do not see any scenario on any
architecture where a compiler would emit an atomic operation for the
initialization of an atomic variable. If a memory barrier is required
in this situation, then an explicit one can be emitted using the
existing API.

In our case -- with the compiler's atomic builtins -- the initialization
of a variable can be done without any atomic operations and does not
require any memory barrier. This is a consequence of being capable of
working with integral-scalar/pointer type without an atomic qualifier.

> Therefore the fence which would normally appear after a store (or
> clear) generally need not appear after an initialization, and you
> might want to add a special API to force the generation of such a
> fence.

I am puzzled by this. Initialization of a shared variable does not need
to be atomic until its publication. Could you expand on this?

Thanks for the feedback,
Olivier

--
Olivier Dion
EfficiOS Inc.
https://www.efficios.com

2023-07-04 20:45:32

by Alan Stern

[permalink] [raw]
Subject: Re: [RFC] Bridging the gap between the Linux Kernel Memory Consistency Model (LKMM) and C11/C++11 atomics

On Tue, Jul 04, 2023 at 01:19:23PM -0400, Olivier Dion wrote:
> On Mon, 03 Jul 2023, Alan Stern <[email protected]> wrote:
> > On Mon, Jul 03, 2023 at 03:20:31PM -0400, Olivier Dion wrote:
> >> This is a request for comments on extending the atomic builtins API to
> >> help avoiding redundant memory barriers. Indeed, there are
> >
> > What atomic builtins API are you talking about? The kernel's? That's
> > what it sounded like when I first read this sentence -- why else post
> > your message on a kernel mailing list?
>
> Good point, we meant the `__atomic' builtins from GCC and Clang. Sorry
> for the confusion.

Oh, is that it? Then I misunderstood entirely; I thought you were
talking about augmenting the set of functions or macros made available
in liburcu. I did not realize you intended to change the compilers.

> Indeed, our intent is to discuss the Userspace RCU uatomic API by extending
> the toolchain's atomic builtins and not the LKMM itself. The reason why
> we've reached out to the Linux kernel developers is because the
> original Userspace RCU uatomic API is based on the LKMM.

But why do you want to change the compilers to better support urcu?
That seems like going about things backward; wouldn't it make more sense
to change urcu to better match the facilities offered by the current
compilers?

What if everybody started to do this: modifying the compilers to better
support their pet projects? The end result would be chaos!

> > 1. I can see why you have special fences for before/after load,
> > store, and rmw operations. But why clear? In what way is
> > clearing an atomic variable different from storing a 0 in it?
>
> We could indeed group the clear with the store.
>
> We had two approaches in mind:
>
> a) A before/after pair by category of operation:
>
> - load
> - store
> - RMW
>
> b) A before/after pair for every operation:
>
> - load
> - store
> - exchange
> - compare_exchange
> - {add,sub,and,xor,or,nand}_fetch
> - fetch_{add,sub,and,xor,or,nand}
> - test_and_set
> - clear
>
> If we go for the grouping in a), we have to take into account that the
> barriers emitted need to cover the worse case scenario. As an example,
> Clang can emit a store for a exchange with SEQ_CST on x86-64, if the
> returned value is not used.
>
> Therefore, for the grouping in a), all RMW would need to emit a memory
> barrier (with Clang on x86-64). But with the scheme in b), we can emit
> the barrier explicitly for the exchange operation. We however question
> the usefulness of this kind of optimization made by the compiler, since
> a user should use a store operation instead.

So in the end you settled on a compromise?

> > 2. You don't have a special fence for use after initializing an
> > atomic. This operation can be treated specially, because at the
> > point where an atomic is initialized, it generally has not yet
> > been made visible to any other threads.
>
> I assume that you're referring to something like std::atomic_init from
> C++11 and deprecated in C++20? I do not see any scenario on any
> architecture where a compiler would emit an atomic operation for the
> initialization of an atomic variable. If a memory barrier is required
> in this situation, then an explicit one can be emitted using the
> existing API.
>
> In our case -- with the compiler's atomic builtins -- the initialization
> of a variable can be done without any atomic operations and does not
> require any memory barrier. This is a consequence of being capable of
> working with integral-scalar/pointer type without an atomic qualifier.
>
> > Therefore the fence which would normally appear after a store (or
> > clear) generally need not appear after an initialization, and you
> > might want to add a special API to force the generation of such a
> > fence.
>
> I am puzzled by this. Initialization of a shared variable does not need
> to be atomic until its publication. Could you expand on this?

In the kernel, I believe it sometimes happens that an atomic variable
may be published before it is initialized. (If that's wrong, Paul or
Peter can correct me.) But since this doesn't apply to the situations
you're concerned with, you can forget I mentioned it.

Alan

2023-07-04 21:38:08

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [RFC] Bridging the gap between the Linux Kernel Memory Consistency Model (LKMM) and C11/C++11 atomics

On Tue, Jul 04, 2023 at 04:25:45PM -0400, Alan Stern wrote:
> On Tue, Jul 04, 2023 at 01:19:23PM -0400, Olivier Dion wrote:

[ . . . ]

> > I am puzzled by this. Initialization of a shared variable does not need
> > to be atomic until its publication. Could you expand on this?
>
> In the kernel, I believe it sometimes happens that an atomic variable
> may be published before it is initialized. (If that's wrong, Paul or
> Peter can correct me.) But since this doesn't apply to the situations
> you're concerned with, you can forget I mentioned it.

Both use cases exist.

A global atomic is implicitly published at compile time. If the desired
initial value is not known until multiple threads are running, then it
is necessary to be careful. Hence double-check locking and its various
replacements. (Clearly, if you can determine the initial value before
going multithreaded, life is simpler.)

And dynamically allocated or on-stack storage is the other case, where
there is a point in time when the storage is private even after multiple
threads are running.

Or am I missing the point?

Thanx, Paul

2023-07-05 07:52:07

by Boqun Feng

[permalink] [raw]
Subject: Re: [RFC] Bridging the gap between the Linux Kernel Memory Consistency Model (LKMM) and C11/C++11 atomics

On Mon, Jul 03, 2023 at 03:20:31PM -0400, Olivier Dion wrote:
[...]
> NOTE: On x86-64, we found at least one corner case [7] with Clang where
> a RELEASE exchange is optimized to a RELEASE store, when the returned
> value of the exchange is unused, breaking the above expectations.
> Although this type of optimization respect the standard "as-if"
> statement, we question its pertinence since a user should simply do a
> RELEASE store instead of an exchange in that case. With the
> introduction of these new primitives, these type of optimizations should
> be revisited.
>

FWIW, this is actually a LLVM bug:

https://github.com/llvm/llvm-project/issues/60418

Regards,
Boqun

2023-07-05 13:33:03

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: [RFC] Bridging the gap between the Linux Kernel Memory Consistency Model (LKMM) and C11/C++11 atomics

On 7/5/23 03:05, Boqun Feng wrote:
> On Mon, Jul 03, 2023 at 03:20:31PM -0400, Olivier Dion wrote:
> [...]
>> NOTE: On x86-64, we found at least one corner case [7] with Clang where
>> a RELEASE exchange is optimized to a RELEASE store, when the returned
>> value of the exchange is unused, breaking the above expectations.
>> Although this type of optimization respect the standard "as-if"
>> statement, we question its pertinence since a user should simply do a
>> RELEASE store instead of an exchange in that case. With the
>> introduction of these new primitives, these type of optimizations should
>> be revisited.
>>
>
> FWIW, this is actually a LLVM bug:
>
> https://github.com/llvm/llvm-project/issues/60418

So it was more than a dubious optimization, it's actually broken as well.

I am worried about adding to the compiler's ability to optimize those
atomics because of the subtle corner-cases/bugs that can creep up.

Thanks,

Mathieu


--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com


2023-07-06 16:51:08

by Olivier Dion

[permalink] [raw]
Subject: Re: [RFC] Bridging the gap between the Linux Kernel Memory Consistency Model (LKMM) and C11/C++11 atomics

On Tue, 04 Jul 2023, Alan Stern <[email protected]> wrote:
> On Tue, Jul 04, 2023 at 01:19:23PM -0400, Olivier Dion wrote:
>> On Mon, 03 Jul 2023, Alan Stern <[email protected]> wrote:
>> > On Mon, Jul 03, 2023 at 03:20:31PM -0400, Olivier Dion wrote:
[...]
> Oh, is that it? Then I misunderstood entirely; I thought you were
> talking about augmenting the set of functions or macros made available
> in liburcu. I did not realize you intended to change the compilers.

Yes. We want to extend the atomic builtins API of the toolchains.

>> Indeed, our intent is to discuss the Userspace RCU uatomic API by extending
>> the toolchain's atomic builtins and not the LKMM itself. The reason why
>> we've reached out to the Linux kernel developers is because the
>> original Userspace RCU uatomic API is based on the LKMM.
>
> But why do you want to change the compilers to better support urcu?
> That seems like going about things backward; wouldn't it make more sense
> to change urcu to better match the facilities offered by the current
> compilers?

The initial motivation for the migration of the Userspace RCU atomics
API from custom inline assembler (mimicking the LKMM) to the C11/C++11
memory model was for supporting userspace tools such as TSAN.

We did that by porting everything to the compiler's atomic builtins API.
However, because of the "fully-ordered" atomic semantic of the LKMM, we
had no other choices than to add memory fences which are redundant on
some strongly ordered architectures.

> What if everybody started to do this: modifying the compilers to better
> support their pet projects? The end result would be chaos!

This is why we are starting this discussion which involves members of
the Kernel and toolchains communities. We have prior experience, e.g. with
asm gotos which were implemented in GCC, and Clang afterward, in
response to Linux Kernel tracepoint's requirements.

Note that the motivation for supporting TSAN in Userspace RCU is coming
from the requirements of the ISC for the BIND 9 project.

[...]
>> If we go for the grouping in a), we have to take into account that the
>> barriers emitted need to cover the worse case scenario. As an example,
>> Clang can emit a store for a exchange with SEQ_CST on x86-64, if the
>> returned value is not used.
>>
>> Therefore, for the grouping in a), all RMW would need to emit a memory
>> barrier (with Clang on x86-64). But with the scheme in b), we can emit
>> the barrier explicitly for the exchange operation. We however question
>> the usefulness of this kind of optimization made by the compiler, since
>> a user should use a store operation instead.
>
> So in the end you settled on a compromise?

We have not settled on anything yet. Choosing between options a) and b)
is open to discussion.

[...]


Thanks,
Olivier
--
Olivier Dion
EfficiOS Inc.
https://www.efficios.com

2023-07-07 10:52:18

by Jonas Oberhauser

[permalink] [raw]
Subject: Re: [RFC] Bridging the gap between the Linux Kernel Memory Consistency Model (LKMM) and C11/C++11 atomics

Hi all,


Am 7/3/2023 um 9:20 PM schrieb Olivier Dion:
> Hi all,
>
> This is a request for comments on extending the atomic builtins API to
> help avoiding redundant memory barriers. Indeed, there are
> discrepancies between the Linux kernel consistency memory model (LKMM)
> and the C11/C++11 memory consistency model [0]. For example,
> fully-ordered atomic operations like xchg and cmpxchg success in LKMM
> have implicit memory barriers before/after the operations [1-2], while
> atomic operations using the __ATOMIC_SEQ_CST memory order in C11/C++11
> do not have any ordering guarantees of an atomic thread fence
> __ATOMIC_SEQ_CST with respect to other non-SEQ_CST operations [3].


The issues run quite a bit deeper than this. The two models have two
completely different perspectives that are quite much incompatible.
I think all you can really do is bridge the gap at the level of the
generated assembly.
I.e., don't bridge the gap between LKMM and the C11 MCM. Bridge the gap
between the assembly code generated by C11 atomics and the one generated
by LKMM. But I'm not sure that's really the task here.


>
> [...] For example, to make Read-Modify-Write (RMW) operations match
> the Linux kernel "full barrier before/after" semantics, the liburcu's
> uatomic API has to emit both a SEQ_CST RMW operation and a subsequent
> thread fence SEQ_CST, which leads to duplicated barriers in some cases.


Does it have to though? Can't you just do e.g. an release RMW operation
followed by an after_atomic  fence?
And for loads, a SEQ_CST fence followed by an acquire load? Analogously
(but: mirrored) for stores.



> // Always emit thread fence.
> __atomic_thread_fence_{before,after}_load(int load_memorder,
> int fence_memorder)
>
> // NOP for store_memorder == SEQ_CST.
> // Otherwise, emit thread fence.
> __atomic_thread_fence_{before,after}_store(int store_memorder,
> int fence_memorder)
>
> // NOP for clear_memorder == SEQ_CST.
> // Otherwise, emit thread fence.
> __atomic_thread_fence_{before,after}_clear(int clear_memorder,
> int fence_memorder)
>
> // Always NOP.
> __atomic_thread_fence_{before,after}_rmw(int rmw_memorder,
> int fence_memorder)


I currently don't feel comfortable adding such extensions to LKMM (or a
compiler API for that matter).


You mentioned that the goal is to check some code written using LKMM
primitives with TSAN due to some formal requirements. What exactly do
these requirements entail? Do you need to check the code exactly as it
will be executed (modulo the TSAN instrumentation)? Is it an option to
map to normal builtins with suboptimal performance just for the
verification purpose, but then run the slightly more optimized original
code later?

Specifically for TSAN's ordering requirements, you may need to make
LKMM's RMWs into acq+rel with an extra mb, even if all that extra
ordering isn't necessary at the assembler level.


Also note that no matter what you do, due to the two different
perspectives, TSAN's hb relation may introduce false positive data races
w.r.t. LKMM.  For example, if the happens-before ordering is guaranteed
through pb starting with coe/fre.

Without thinking too hard, it seems to me no matter what fences and
barriers you introduce, TSAN will not see this kind of ordering and
consider the situation a data race.

(And at least in our own verification methodology for rcu/smr, ordering
through fre appears when the rscs reads something that is later
overwritten by the writer. Not sure off the top of my head if this fre
ordering is what prevents the data race though, or there's some
additional ordering that TSAN may also detect.

As a side note, according to LKMM we would also have data races in our
critical sections, but I believe use of rcu_dereference would fix that,
so you may not experience such data races in your code).


best wishes,
jonas


2023-07-07 14:51:15

by Olivier Dion

[permalink] [raw]
Subject: Re: [RFC] Bridging the gap between the Linux Kernel Memory Consistency Model (LKMM) and C11/C++11 atomics

On Tue, 04 Jul 2023, Peter Zijlstra <[email protected]> wrote:
> On Mon, Jul 03, 2023 at 03:20:31PM -0400, Olivier Dion wrote:
[...]
>> On x86-64 (gcc 13.1 -O2) we get:
>>
>> t0():
>> movl $1, x(%rip)
>> movl $1, %eax
>> xchgl dummy(%rip), %eax
>> lock orq $0, (%rsp) ;; Redundant with previous exchange.
>> movl y(%rip), %eax
>> movl %eax, r0(%rip)
>> ret
>> t1():
>> movl $1, y(%rip)
>> lock orq $0, (%rsp)
>> movl x(%rip), %eax
>> movl %eax, r1(%rip)
>> ret
>
> So I would expect the compilers to do better here. It should know those
> __atomic_thread_fence() thingies are superfluous and simply not emit
> them. This could even be done as a peephole pass later, where it sees
> consecutive atomic ops and the second being a no-op.

Indeed, a peephole optimization could work for this Dekker, if the
compiler adds the pattern for it. However, AFAIK, a peephole can not be
applied when the two fences are in different basic blocks. For example,
only emitting a fence on a compare_exchange success. This limitation
implies that the optimization can not be done across functions/modules
(shared libraries). For example, it would be interesting to be able to
promote an acquire fence of a pthread_mutex_lock() to a full fence on
weakly ordered architectures while preventing a redundant fence on
strongly ordered architectures.

We know that at least Clang has such peephole optimizations for some
architecture backends. It seems however that they do not recognize
lock-prefixed instructions as fence. AFAIK, GCC does not have that kind
of optimization.

We are also aware that some research has been done on this topic [0].
The idea is to use PRE for elimiation of redundant fences. This would
work across multiple basic blocks, although the paper focus on
intra-procedural eliminations. However, it seems that the latest work
on that [1] has never been completed [2].

Our proposed approach provides a mean for the user to express -- and
document -- the wanted semantic in the source code. This allows the
compiler to only emit wanted fences, therefore not relying on
architecture specific backend optimizations. In other words, this
applies even on unoptimized binaries.

[...]

Thanks,
Olivier

[0] https://dl.acm.org/doi/10.1145/3033019.3033021

[1] https://discourse.llvm.org/t/fence-elimination-pass-proposal/33679

[2] https://reviews.llvm.org/D5758
--
Olivier Dion
EfficiOS Inc.
https://www.efficios.com

2023-07-07 15:47:20

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: [RFC] Bridging the gap between the Linux Kernel Memory Consistency Model (LKMM) and C11/C++11 atomics

On 7/4/23 06:23, Jonathan Wakely wrote:
> On Tue, 4 Jul 2023 at 10:47, Peter Zijlstra wrote:
>>
>> On Mon, Jul 03, 2023 at 03:20:31PM -0400, Olivier Dion wrote:
>>
>>> int x = 0;
>>> int y = 0;
>>> int r0, r1;
>>>
>>> int dummy;
>>>
>>> void t0(void)
>>> {
>>> __atomic_store_n(&x, 1, __ATOMIC_RELAXED);
>>>
>>> __atomic_exchange_n(&dummy, 1, __ATOMIC_SEQ_CST);
>>> __atomic_thread_fence(__ATOMIC_SEQ_CST);
>>>
>>> r0 = __atomic_load_n(&y, __ATOMIC_RELAXED);
>>> }
>>>
>>> void t1(void)
>>> {
>>> __atomic_store_n(&y, 1, __ATOMIC_RELAXED);
>>> __atomic_thread_fence(__ATOMIC_SEQ_CST);
>>> r1 = __atomic_load_n(&x, __ATOMIC_RELAXED);
>>> }
>>>
>>> // BUG_ON(r0 == 0 && r1 == 0)
>>>
>>> On x86-64 (gcc 13.1 -O2) we get:
>>>
>>> t0():
>>> movl $1, x(%rip)
>>> movl $1, %eax
>>> xchgl dummy(%rip), %eax
>>> lock orq $0, (%rsp) ;; Redundant with previous exchange.
>>> movl y(%rip), %eax
>>> movl %eax, r0(%rip)
>>> ret
>>> t1():
>>> movl $1, y(%rip)
>>> lock orq $0, (%rsp)
>>> movl x(%rip), %eax
>>> movl %eax, r1(%rip)
>>> ret
>>
>> So I would expect the compilers to do better here. It should know those
>> __atomic_thread_fence() thingies are superfluous and simply not emit
>> them. This could even be done as a peephole pass later, where it sees
>> consecutive atomic ops and the second being a no-op.
>
> Right, I don't see why we need a whole set of new built-ins that say
> "this fence isn't needed if the adjacent atomic op already implies a
> fence". If the adjacent atomic op already implies a fence for a given
> ISA, then the compiler should already be able to elide the explicit
> fence.
>
> So just write your code with the explicit fence, and rely on the
> compiler to optimize it properly. Admittedly, today's compilers don't
> do that optimization well, but they also don't support your proposed
> built-ins, so you're going to have to wait for compilers to make
> improvements either way.

Emitting the redundant fences is the plan we have for liburcu. The
current situation unfortunately requires users to choose between
generation of inefficient code with C11 or implement their own inline
assembler until the compilers catch up.

>
> https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2015/n4455.html
> discusses that compilers could (and should) optimize around atomics
> better.

Our understanding of the C11/C++11 memory model is that it aims at
defining the weakest possible guarantees for each ordering to be as
efficient as possible on weakly ordered architectures. However, when
writing portable code in practice, the C11/C++11 memory model force the
programmer to insert memory fences which are redundant on strongly
ordered architectures.

We want something that can apply across procedures from different
modules: e.g. a mutex lock operation (glibc) has an acquire semantic
using a RMW operation that the caller could promote to a full fence.
The peephole optimizations cannot do this because they focus on a single
basic block. PRE can apply across procedures, but would rely on LTO and
possibly function annotation across modules. I am not aware of any
progress in that research field in the past 6 years. [1-2]

The new atomic builtins we propose allow the user to better express its
intent to the compiler, allowing for better code generation. Therefore,
reducing the number of emitted redundant fences, without having to rely on
optimizations.

It should be noted that the builtins extensions we propose are not
entirely free. Here are our perceived downsides of introducing those
APIs:

- They add complexity to the atomic builtins API.

- They add constraints which need to be taken into account for future
architecture-specific backend optimizations, as an example the (broken)
xchg RELEASE | RELAXED -> store on x86 (Clang) [3].

If an atomic op class (e.g. rmw) can be optimized to a weaker
instruction by the architecture backend, then the emission of a
before/after-fence associated with this class of atomic op, must be
pessimistic and assume the weakest instruction pattern which can
be generated.

There are optimizations of atomics and redundant fences in Clang. The
redundant fences optimizations appear to be limited to a peephole, which
does not appear to leverage the fact that lock-prefixed atomic
operations act as implicit fences on x86. Perhaps this could be a
low-hanging fruit for optimization.

We have not observed any similar optimizations in gcc as of today, which
appears to be a concern for many users. [4-7]

Thanks,

Mathieu

[1] https://dl.acm.org/doi/10.1145/3033019.3033021
[2] https://reviews.llvm.org/D5758
[3] https://github.com/llvm/llvm-project/issues/60418
[4] https://gcc.gnu.org/bugzilla/show_bug.cgi?id=86056
[5] https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68622
[6] https://gcc.gnu.org/bugzilla/show_bug.cgi?id=86072
[7] https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63273

--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com


2023-07-07 15:55:41

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC] Bridging the gap between the Linux Kernel Memory Consistency Model (LKMM) and C11/C++11 atomics

On Fri, Jul 07, 2023 at 10:04:06AM -0400, Olivier Dion wrote:
> On Tue, 04 Jul 2023, Peter Zijlstra <[email protected]> wrote:
> > On Mon, Jul 03, 2023 at 03:20:31PM -0400, Olivier Dion wrote:
> [...]
> >> On x86-64 (gcc 13.1 -O2) we get:
> >>
> >> t0():
> >> movl $1, x(%rip)
> >> movl $1, %eax
> >> xchgl dummy(%rip), %eax
> >> lock orq $0, (%rsp) ;; Redundant with previous exchange.
> >> movl y(%rip), %eax
> >> movl %eax, r0(%rip)
> >> ret
> >> t1():
> >> movl $1, y(%rip)
> >> lock orq $0, (%rsp)
> >> movl x(%rip), %eax
> >> movl %eax, r1(%rip)
> >> ret
> >
> > So I would expect the compilers to do better here. It should know those
> > __atomic_thread_fence() thingies are superfluous and simply not emit
> > them. This could even be done as a peephole pass later, where it sees
> > consecutive atomic ops and the second being a no-op.
>
> Indeed, a peephole optimization could work for this Dekker, if the
> compiler adds the pattern for it. However, AFAIK, a peephole can not be
> applied when the two fences are in different basic blocks. For example,
> only emitting a fence on a compare_exchange success. This limitation
> implies that the optimization can not be done across functions/modules
> (shared libraries).

LTO FTW :-)

> For example, it would be interesting to be able to
> promote an acquire fence of a pthread_mutex_lock() to a full fence on
> weakly ordered architectures while preventing a redundant fence on
> strongly ordered architectures.

That's a very non-trivial thing to do. I know Linux has
smp_mb__after_spinlock() and that x86 has it a no-op, but even on x86
adding a full fence after a lock has observable differences IIRC.

Specifically, the actual store that acquires the lock is not well
ordered vs the critical section itself for non-trivial spinlock
implementations (notably qspinlock).

For RCU you mostly care about RCsc locks (IIRC), and upgrading unlock is
a 'simpler' (IMO) approach to achieve that (which is what RCU does with
smp_mb_after_unlock_lock()).

> We know that at least Clang has such peephole optimizations for some
> architecture backends. It seems however that they do not recognize
> lock-prefixed instructions as fence.

They seem confused in general for emitting MFENCE.

> AFAIK, GCC does not have that kind
> of optimization.

> We are also aware that some research has been done on this topic [0].
> The idea is to use PRE for elimiation of redundant fences. This would
> work across multiple basic blocks, although the paper focus on
> intra-procedural eliminations. However, it seems that the latest work
> on that [1] has never been completed [2].
>
> Our proposed approach provides a mean for the user to express -- and
> document -- the wanted semantic in the source code. This allows the
> compiler to only emit wanted fences, therefore not relying on
> architecture specific backend optimizations. In other words, this
> applies even on unoptimized binaries.

I'm not a tool person, but if I were, I'd be very hesitant to add
__builtin functions that 'conflict'/'overlap' with what an optimizer
should be able to do.

Either way around you need work done on the compilers, and I'm thinking
'fixing' the optimizer will benefit far more people than adding
__builtin's.

Then again, I'm not a tools person, so you don't need to convince me.
But one of the selling points of the whole Atomics as a language feature
was that whole optimizer angle. Otherwise you might as well do as we do,
inline asm the world.

I'll shut up now, thanks for that PRE reference [0], that seems a fun
read for when I'm bored.

2023-07-07 17:59:44

by Olivier Dion

[permalink] [raw]
Subject: Re: [RFC] Bridging the gap between the Linux Kernel Memory Consistency Model (LKMM) and C11/C++11 atomics

On Fri, 07 Jul 2023, Jonas Oberhauser <[email protected]> wrote:
[...]
>> This is a request for comments on extending the atomic builtins API to
>> help avoiding redundant memory barriers. Indeed, there are
>> discrepancies between the Linux kernel consistency memory model (LKMM)
>> and the C11/C++11 memory consistency model [0]. For example,
>> fully-ordered atomic operations like xchg and cmpxchg success in LKMM
>> have implicit memory barriers before/after the operations [1-2], while
>> atomic operations using the __ATOMIC_SEQ_CST memory order in C11/C++11
>> do not have any ordering guarantees of an atomic thread fence
>> __ATOMIC_SEQ_CST with respect to other non-SEQ_CST operations [3].
>
>
> The issues run quite a bit deeper than this. The two models have two
> completely different perspectives that are quite much incompatible.

Agreed. Our intent is not to close the gap completely, but to reduce
the gap between the two models, by supporting the "full barrier
before/after" semantic of LKMM in the C11/C++11 memory model.

> I think all you can really do is bridge the gap at the level of the
> generated assembly. I.e., don't bridge the gap between LKMM and the
> C11 MCM. Bridge the gap between the assembly code generated by C11
> atomics and the one generated by LKMM. But I'm not sure that's really
> the task here.

We have considered analyzing the assembler output of different
toolchain's versions to generate manually our own before/after fences.
However, nothing prevents a toolchain from changing the emitted
assembler in the future, which would make things fragile. The only
thing that is guaranteed to not change is the definitions in the
standard (C11/C++11). Anything else is fair game for optimizations.

>> [...] For example, to make Read-Modify-Write (RMW) operations match
>> the Linux kernel "full barrier before/after" semantics, the liburcu's
>> uatomic API has to emit both a SEQ_CST RMW operation and a subsequent
>> thread fence SEQ_CST, which leads to duplicated barriers in some cases.
>
> Does it have to though? Can't you just do e.g. an release RMW
> operation followed by an after_atomic  fence? And for loads, a
> SEQ_CST fence followed by an acquire load? Analogously (but: mirrored)
> for stores.

That would not improve anything for RMW. Consider the following example
and its resulting assembler on x86-64 gcc 13.1 -O2:

int exchange(int *x, int y)
{
int r = __atomic_exchange_n(x, y, __ATOMIC_RELEASE);
__atomic_thread_fence(__ATOMIC_SEQ_CST);

return r;
}

exchange:
movl %esi, %eax
xchgl (%rdi), %eax
lock orq $0, (%rsp) ;; Redundant with previous exchange
ret

also that would make the exchange weaker, in the sense of the C11/C++11
memory model, by losing its acquire and its sequential consistency
semantics.

[...]
>> // Always NOP.
>> __atomic_thread_fence_{before,after}_rmw(int rmw_memorder,
>> int fence_memorder)
>
>
> I currently don't feel comfortable adding such extensions to LKMM (or a
> compiler API for that matter).

There's no plan to add such extensions to LKMM but only to extend the
current atomic builtins API of toolchains.

> You mentioned that the goal is to check some code written using LKMM
> primitives with TSAN due to some formal requirements. What exactly do
> these requirements entail? Do you need to check the code exactly as it
> will be executed (modulo the TSAN instrumentation)? Is it an option to
> map to normal builtins with suboptimal performance just for the
> verification purpose, but then run the slightly more optimized
> original code later?

We aim to validate with TSAN the code that will run during production,
minus TSAN itself.

> Specifically for TSAN's ordering requirements, you may need to make
> LKMM's RMWs into acq+rel with an extra mb, even if all that extra
> ordering isn't necessary at the assembler level.
>
>
> Also note that no matter what you do, due to the two different
> perspectives, TSAN's hb relation may introduce false positive data
> races w.r.t. LKMM.  For example, if the happens-before ordering is
> guaranteed through pb starting with coe/fre.

This is why we have implemented our primitives and changed our
algorithms so that they use the acquire/release semantics of the
C11/C++11 memory model.

> Without thinking too hard, it seems to me no matter what fences and
> barriers you introduce, TSAN will not see this kind of ordering and
> consider the situation a data race.

We have come to the same conclusion, mainly because TSAN does not
support thread fence in its verifications. This is why we have
implemented an annotation layer that group relaxed memory accesses into
a single acquire/release event. This layer, makes TSAN aware of the
happen-before relations of the RCU implementations -- and lock-free data
structures -- in Userspace RCU.

This came with a downside of introducing redundant fences on strongly
ordered architectures because of the usage of atomic builtins of the
C11/C++11 memory model which does not provide any means for expressing
fully-ordered atomic operations without relying on explicit thread fences.

[...]

Thanks,
Olivier
--
Olivier Dion
EfficiOS Inc.
https://www.efficios.com

2023-07-10 15:50:01

by Jonas Oberhauser

[permalink] [raw]
Subject: Re: [RFC] Bridging the gap between the Linux Kernel Memory Consistency Model (LKMM) and C11/C++11 atomics


Am 7/7/2023 um 7:25 PM schrieb Olivier Dion:
> On Fri, 07 Jul 2023, Jonas Oberhauser <[email protected]> wrote:
> [...]
>>> This is a request for comments on extending the atomic builtins API to
>>> help avoiding redundant memory barriers. Indeed, there are
>>> discrepancies between the Linux kernel consistency memory model (LKMM)
>>> and the C11/C++11 memory consistency model [0]. For example,
>>> fully-ordered atomic operations like xchg and cmpxchg success in LKMM
>>> have implicit memory barriers before/after the operations [1-2], while
>>> atomic operations using the __ATOMIC_SEQ_CST memory order in C11/C++11
>>> do not have any ordering guarantees of an atomic thread fence
>>> __ATOMIC_SEQ_CST with respect to other non-SEQ_CST operations [3].
>>
>> The issues run quite a bit deeper than this. The two models have two
>> completely different perspectives that are quite much incompatible.
> Agreed. Our intent is not to close the gap completely, but to reduce
> the gap between the two models, by supporting the "full barrier
> before/after" semantic of LKMM in the C11/C++11 memory model.


I think what you're trying to achieve has nothing to do with the gap at
all. (But do check out the IMM paper https://plv.mpi-sws.org/imm/ for
what is involved in bridging the gap between LKMM-like and C11-like models).

What you're trying to achieve is to certify some urcu algorithms,
without making the code unnecessarily slower.
Your problem is that the algorithm is implemented using the LKMM API,
and you want to check it with a tool (TSAN) meant to (dynamically)
analyze C11 code that relies on a subset of C11's memory model.

What I still don't understand is whether using TSAN as-is is a formal
requirement from the certification you are trying to achieve, or whether
you could either slightly modify the TSAN toolchain to give answers
consistent with the behavior on LKMM, or use a completely different tool.

For example, you could eliminate the worry about the unnecessary
barriers by including the extra barriers only in the TSAN' (modified
TSAN) analysis.
In that case TSAN' adds additional, redundant barriers in some cases
during the analysis process, but those barriers would be gone the moment
you stop using TSAN'.

You would need to argue that this additional instrumentation doesn't
hide any data races, but I suspect that wouldn't be too hard.

Another possibility is to use a tool like Dat3M that supports LKMM to
try and verify your code, but I'm not sure if these tools are
feature-complete enough to verify the specific algorithms you have in
mind (e.g., mixed-size accesses are an issue, and Paul told me there's a
few of those in (U)RCU. But maybe the cost of changing the code to
full-sized accesses might be cheaper than relying on extra barriers.)


FWIW your current solution of adding a whole class of fences to
essentially C11 and the toolchain, and modifying the code to use these
fences, isn't a way I would want to take.


>> I think all you can really do is bridge the gap at the level of the
>> generated assembly. I.e., don't bridge the gap between LKMM and the
>> C11 MCM. Bridge the gap between the assembly code generated by C11
>> atomics and the one generated by LKMM. But I'm not sure that's really
>> the task here.
> [...]
> However, nothing prevents a toolchain from changing the emitted
> assembler in the future, which would make things fragile. The only
> thing that is guaranteed to not change is the definitions in the
> standard (C11/C++11). Anything else is fair game for optimizations.

Not quite. If you rely on the LKMM API to generate the final code, you
can definitely prove that the LKMM API implementation has a C11-like
memory model abstraction.
For example, you might be able to prove that the LKMM implementation of
a strong xchg guarantees at least the same ordering as a seq_cst fence ;
seq_cst xchg ; seq_cst fence sequence in C11.
I don't think it's that fragile since 1) it's a manually written
volatile assembler mapping, so there's not really a lot the toolchains
can do about it and 2) the LKMM implementation of atomics rarely
changes, and should still have similar guarantees after the change.

The main issue will be as we discussed before and below that TSAN will
still trigger false positives.


>>> [...] For example, to make Read-Modify-Write (RMW) operations match
>>> the Linux kernel "full barrier before/after" semantics, the liburcu's
>>> uatomic API has to emit both a SEQ_CST RMW operation and a subsequent
>>> thread fence SEQ_CST, which leads to duplicated barriers in some cases.
>> Does it have to though? Can't you just do e.g. an release RMW
>> operation followed by an after_atomic  fence? And for loads, a
>> SEQ_CST fence followed by an acquire load? Analogously (but: mirrored)
>> for stores.
> That would not improve anything for RMW. Consider the following example
> and its resulting assembler on x86-64 gcc 13.1 -O2:
>
> int exchange(int *x, int y)
> {
> int r = __atomic_exchange_n(x, y, __ATOMIC_RELEASE);
> __atomic_thread_fence(__ATOMIC_SEQ_CST);
>
> return r;
> }
>
> exchange:
> movl %esi, %eax
> xchgl (%rdi), %eax
> lock orq $0, (%rsp) ;; Redundant with previous exchange
> ret


I specifically meant the after_atomic fence from LKMM, which will
compile to nothing on x86 (not even a compiler barrier).
However, at that point I also wasn't clear on what you're trying to
achieve. I see now that using LKMM barriers doesn't help you here.



>> You mentioned that the goal is to check some code written using LKMM
>> primitives with TSAN due to some formal requirements. What exactly do
>> these requirements entail? Do you need to check the code exactly as it
>> will be executed (modulo the TSAN instrumentation)? Is it an option to
>> map to normal builtins with suboptimal performance just for the
>> verification purpose, but then run the slightly more optimized
>> original code later?
> We aim to validate with TSAN the code that will run during production,
> minus TSAN itself.
>
>> Specifically for TSAN's ordering requirements, you may need to make
>> LKMM's RMWs into acq+rel with an extra mb, even if all that extra
>> ordering isn't necessary at the assembler level.
>>
>>
>> Also note that no matter what you do, due to the two different
>> perspectives, TSAN's hb relation may introduce false positive data
>> races w.r.t. LKMM.  For example, if the happens-before ordering is
>> guaranteed through pb starting with coe/fre.
> This is why we have implemented our primitives and changed our
> algorithms so that they use the acquire/release semantics of the
> C11/C++11 memory model.
>
>> Without thinking too hard, it seems to me no matter what fences and
>> barriers you introduce, TSAN will not see this kind of ordering and
>> consider the situation a data race.
> We have come to the same conclusion, mainly because TSAN does not
> support thread fence in its verifications.

That's also a concern (although I thought they fixed that a year or two
ago, but I must be mistaken).

What I mean is that even if TSAN appropriately used all fences for
hb-analysis, and even if you added strong fences all over your code,
there are (as far as I can see) still cases where TSAN will tell you
there's a data race (on C11) but there isn't one on LKMM.


good luck

jonas