2021-12-13 18:48:49

by Florian Weimer

[permalink] [raw]
Subject: rseq + membarrier programming model

I've been studying Jann Horn's biased locking example:

Re: [PATCH 0/4 POC] Allow executing code and syscalls in another address space
<https://lore.kernel.org/linux-api/CAG48ez02UDn_yeLuLF4c=kX0=h2Qq8Fdb0cer1yN8atbXSNjkQ@mail.gmail.com/>

It uses MEMBARRIER_CMD_PRIVATE_EXPEDITED_RSEQ as part of the biased lock
revocation.

How does the this code know that the process has called
MEMBARRIER_CMD_REGISTER_PRIVATE_EXPEDITED_RSEQ? Could it fall back to
MEMBARRIER_CMD_GLOBAL instead? Why is it that MEMBARRIER_CMD_GLOBAL
does not require registration (the broader/more expensive barrier), but
the more restricted versions do?

Or put differently, why wouldn't we request
MEMBARRIER_CMD_REGISTER_PRIVATE_EXPEDITED_RSEQ unconditionally at
process start in glibc, once we start biased locking in a few places?

Thanks,
Florian



2021-12-13 19:19:44

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: rseq + membarrier programming model

----- On Dec 13, 2021, at 1:47 PM, Florian Weimer [email protected] wrote:

> I've been studying Jann Horn's biased locking example:
>
> Re: [PATCH 0/4 POC] Allow executing code and syscalls in another address space
> <https://lore.kernel.org/linux-api/CAG48ez02UDn_yeLuLF4c=kX0=h2Qq8Fdb0cer1yN8atbXSNjkQ@mail.gmail.com/>
>
> It uses MEMBARRIER_CMD_PRIVATE_EXPEDITED_RSEQ as part of the biased lock
> revocation.
>
> How does the this code know that the process has called
> MEMBARRIER_CMD_REGISTER_PRIVATE_EXPEDITED_RSEQ?

I won't speak for this code snippet in particular, but in general
issuing MEMBARRIER_CMD_PRIVATE_EXPEDITED_RSEQ from a thread which
belongs to a process which has not performed
MEMBARRIER_CMD_REGISTER_PRIVATE_EXPEDITED_RSEQ will result in
membarrier returning -EPERM. If the kernel is built without CONFIG_RSEQ
support, it will return -EINVAL:

membarrier_private_expedited():

} else if (flags == MEMBARRIER_FLAG_RSEQ) {
if (!IS_ENABLED(CONFIG_RSEQ))
return -EINVAL;
if (!(atomic_read(&mm->membarrier_state) &
MEMBARRIER_STATE_PRIVATE_EXPEDITED_RSEQ_READY))
return -EPERM;

If you want to create code which optionally depends on availability
of EXPEDITED_RSEQ membarrier, I suspect you will want to perform
registration from a library constructor, and keep track of registration
success/failure in a static variable within the library.

> Could it fall back to
> MEMBARRIER_CMD_GLOBAL instead?

No. CMD_GLOBAL does not issue the required rseq fence used by the
algorithm discussed. Also, CMD_GLOBAL has quite a few other shortcomings:
it takes a while to execute, and is incompatible with nohz_full kernels.

> Why is it that MEMBARRIER_CMD_GLOBAL
> does not require registration (the broader/more expensive barrier), but
> the more restricted versions do?

The more restricted versions (which require explicit registration) have a
close integration with the Linux scheduler, and in some cases require
additional code to be executed when scheduling between threads which
belong to different processes, for instance the for "SYNC_CORE" membarrier,
which is useful for JITs:

static inline void membarrier_mm_sync_core_before_usermode(struct mm_struct *mm)
{
if (current->mm != mm)
return;
if (likely(!(atomic_read(&mm->membarrier_state) &
MEMBARRIER_STATE_PRIVATE_EXPEDITED_SYNC_CORE)))
return;
sync_core_before_usermode();
}

Also, for the "global-expedited" commands, these can generate IPIs which will
interrupt the flow of threads running on behalf of a registered process.
Therefore, in order to make sure we do not add delays to real-time sensitive
applications, we made this registration "opt-in".

In order to make sure the programming model is the same for expedited
private/global plain/sync-core/rseq membarrier commands, we require that
each process perform a registration beforehand.

>
> Or put differently, why wouldn't we request
> MEMBARRIER_CMD_REGISTER_PRIVATE_EXPEDITED_RSEQ unconditionally at
> process start in glibc, once we start biased locking in a few places?

The registration of membarrier expedited can be either performed immediately
when the process starts, or later, possibly when there are other threads
running concurrently. Note however that the registration scheme has been
optimized for the scenario where it is called when a single thread is
running in the process (see sync_runqueues_membarrier_state()). Otherwise
we need to use the more heavyweight synchronize_rcu(). So my advice would
be to perform the membarrier expedited registration while the process
is still single-threaded if possible, rather than postpone this and
do it entirely lazily on first use, which may happen while other
threads are already running.

Thanks,

Mathieu

>
> Thanks,
> Florian

--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

2021-12-13 19:28:15

by Jann Horn

[permalink] [raw]
Subject: Re: rseq + membarrier programming model

On Mon, Dec 13, 2021 at 7:48 PM Florian Weimer <[email protected]> wrote:
> I've been studying Jann Horn's biased locking example:
>
> Re: [PATCH 0/4 POC] Allow executing code and syscalls in another address space
> <https://lore.kernel.org/linux-api/CAG48ez02UDn_yeLuLF4c=kX0=h2Qq8Fdb0cer1yN8atbXSNjkQ@mail.gmail.com/>
>
> It uses MEMBARRIER_CMD_PRIVATE_EXPEDITED_RSEQ as part of the biased lock
> revocation.
>
> How does the this code know that the process has called
> MEMBARRIER_CMD_REGISTER_PRIVATE_EXPEDITED_RSEQ? Could it fall back to
> MEMBARRIER_CMD_GLOBAL instead?

AFAIK no - MEMBARRIER_CMD_REGISTER_PRIVATE_EXPEDITED_RSEQ specifically
forces targeted processes to go through an RSEQ preemption. That only
happens when this special membarrier command is used and when an
actual task switch happens; other membarrier flavors don't guarantee
that.


Also, MEMBARRIER_CMD_GLOBAL can take really long in terms of wall
clock time - it's basically just synchronize_rcu(), and as the
documentation at
https://www.kernel.org/doc/html/latest/RCU/Design/Requirements/Requirements.html
says:

"The synchronize_rcu() grace-period-wait primitive is optimized for
throughput. It may therefore incur several milliseconds of latency in
addition to the duration of the longest RCU read-side critical
section."


You can see that synchronize_rcu() indeed takes quite long in terms of
wall clock time (but not in terms of CPU time - as the documentation
says, it's optimized for throughput in a parallel context) with a
simple test program:

jannh@laptop:~/test/rcu$ cat rcu_membarrier.c
#define _GNU_SOURCE
#include <stdio.h>
#include <linux/membarrier.h>
#include <sys/syscall.h>
#include <unistd.h>
#include <time.h>
#include <err.h>

int main(void) {
for (int i=0; i<20; i++) {
struct timespec ts1;
if (clock_gettime(CLOCK_MONOTONIC, &ts1))
err(1, "time");

if (syscall(__NR_membarrier, MEMBARRIER_CMD_GLOBAL, 0, 0))
err(1, "membarrier");

struct timespec ts2;
if (clock_gettime(CLOCK_MONOTONIC, &ts2))
err(1, "time");

unsigned long delta_ns = (ts2.tv_nsec - ts1.tv_nsec) +
(1000UL*1000*1000) * (ts2.tv_sec - ts1.tv_sec);
printf("MEMBARRIER_CMD_GLOBAL took %lu nanoseconds\n", delta_ns);
}
}
jannh@laptop:~/test/rcu$ gcc -o rcu_membarrier rcu_membarrier.c -Wall
jannh@laptop:~/test/rcu$ time ./rcu_membarrier
MEMBARRIER_CMD_GLOBAL took 17155142 nanoseconds
MEMBARRIER_CMD_GLOBAL took 19207001 nanoseconds
MEMBARRIER_CMD_GLOBAL took 16087350 nanoseconds
MEMBARRIER_CMD_GLOBAL took 15963711 nanoseconds
MEMBARRIER_CMD_GLOBAL took 16336149 nanoseconds
MEMBARRIER_CMD_GLOBAL took 15931331 nanoseconds
MEMBARRIER_CMD_GLOBAL took 16020315 nanoseconds
MEMBARRIER_CMD_GLOBAL took 15873814 nanoseconds
MEMBARRIER_CMD_GLOBAL took 15945667 nanoseconds
MEMBARRIER_CMD_GLOBAL took 23815452 nanoseconds
MEMBARRIER_CMD_GLOBAL took 23626444 nanoseconds
MEMBARRIER_CMD_GLOBAL took 19911435 nanoseconds
MEMBARRIER_CMD_GLOBAL took 23967343 nanoseconds
MEMBARRIER_CMD_GLOBAL took 15943147 nanoseconds
MEMBARRIER_CMD_GLOBAL took 23914809 nanoseconds
MEMBARRIER_CMD_GLOBAL took 32498986 nanoseconds
MEMBARRIER_CMD_GLOBAL took 19450932 nanoseconds
MEMBARRIER_CMD_GLOBAL took 16281308 nanoseconds
MEMBARRIER_CMD_GLOBAL took 24045168 nanoseconds
MEMBARRIER_CMD_GLOBAL took 15406698 nanoseconds

real 0m0.458s
user 0m0.058s
sys 0m0.031s
jannh@laptop:~/test/rcu$

Every invocation of MEMBARRIER_CMD_GLOBAL on my laptop took >10 ms.

2021-12-13 19:30:06

by Florian Weimer

[permalink] [raw]
Subject: Re: rseq + membarrier programming model

* Mathieu Desnoyers:

>> Could it fall back to
>> MEMBARRIER_CMD_GLOBAL instead?
>
> No. CMD_GLOBAL does not issue the required rseq fence used by the
> algorithm discussed. Also, CMD_GLOBAL has quite a few other shortcomings:
> it takes a while to execute, and is incompatible with nohz_full kernels.

What about using sched_setcpu to move the current thread to the same CPU
(and move it back afterwards)? Surely that implies the required sort of
rseq barrier that MEMBARRIER_CMD_PRIVATE_EXPEDITED_RSEQ with
MEMBARRIER_CMD_FLAG_CPU performs?

That is possible even without membarrier, so I wonder why registration
of intent is needed for MEMBARRIER_CMD_PRIVATE_EXPEDITED_RSEQ.

> In order to make sure the programming model is the same for expedited
> private/global plain/sync-core/rseq membarrier commands, we require that
> each process perform a registration beforehand.

Hmm. At least it's not possible to unregister again.

But I think it would be really useful to have some of these barriers
available without registration, possibly in a more expensive form.

Thanks,
Florian


2021-12-13 19:31:48

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: rseq + membarrier programming model

----- On Dec 13, 2021, at 1:47 PM, Florian Weimer [email protected] wrote:

> I've been studying Jann Horn's biased locking example:
>
> Re: [PATCH 0/4 POC] Allow executing code and syscalls in another address space
> <https://lore.kernel.org/linux-api/CAG48ez02UDn_yeLuLF4c=kX0=h2Qq8Fdb0cer1yN8atbXSNjkQ@mail.gmail.com/>
>
> It uses MEMBARRIER_CMD_PRIVATE_EXPEDITED_RSEQ as part of the biased lock
> revocation.

By the way, there might be something good to salvage from this prototype I did a
while back:

https://github.com/compudj/rseq-test/blob/adapt-lock/test-rseq-adaptative-lock.c

The neat trick there is to use a combination of the Zero Flag and rbx==0/1 to detect
whether the rseq critical section was aborted before or after successful execution
of the CAS. This allows the rseq c.s. to cover an entire loop, which contains a CAS
instruction, without requiring that the critical section ends with a "commit"
instruction.

Some characteristics of this prototype:

- Don't busy-wait in user-space if the lock owner belongs to the same CPU where the
waiter executes. Immediately use futex.
- Adaptative busy-wait delay (per-lock).
- If busy-spinning is preempted, it jumps to abort immediately when resumed. Therefore,
the loop count for adaptative busy-spinning is very precise.

Of course, much more work would be needed, but I suspect a few ideas there can be
useful.

Thanks,

Mathieu

--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

2021-12-13 19:57:07

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: rseq + membarrier programming model

----- On Dec 13, 2021, at 2:29 PM, Florian Weimer [email protected] wrote:

> * Mathieu Desnoyers:
>
>>> Could it fall back to
>>> MEMBARRIER_CMD_GLOBAL instead?
>>
>> No. CMD_GLOBAL does not issue the required rseq fence used by the
>> algorithm discussed. Also, CMD_GLOBAL has quite a few other shortcomings:
>> it takes a while to execute, and is incompatible with nohz_full kernels.
>
> What about using sched_setcpu to move the current thread to the same CPU
> (and move it back afterwards)? Surely that implies the required sort of
> rseq barrier that MEMBARRIER_CMD_PRIVATE_EXPEDITED_RSEQ with
> MEMBARRIER_CMD_FLAG_CPU performs?

I guess you refer to using sched_setaffinity(2) there ? There are various
reasons why this may fail. For one, the affinity mask is a shared global
resource which can be changed by external applications. Also, setting
the affinity is really just a hint. In the presence of cpu hotplug and
or cgroup cpuset, it is known to lead to situations where the kernel
just gives up and provides an affinity mask including all CPUs.
Therefore, using sched_setaffinity() and expecting to be pinned to
a specific CPU for correctness purposes seems brittle.

But _if_ we'd have something like a sched_setaffinity which we can
trust, yes, temporarily migrating to the target CPU, and observing that
we indeed run there, would AFAIU provide the same guarantee as the rseq
fence provided by membarrier. It would have a higher overhead than
membarrier as well.

>
> That is possible even without membarrier, so I wonder why registration
> of intent is needed for MEMBARRIER_CMD_PRIVATE_EXPEDITED_RSEQ.

I would answer that it is not possible to do this _reliably_ today
without membarrier (see above discussion of cpu hotplug, cgroups, and
modification of cpu affinity by external processes).

AFAIR, registration of intent for MEMBARRIER_CMD_PRIVATE_EXPEDITED_RSEQ
is mainly there to provide a programming model similar to private expedited
plain and core-sync cmds.

The registration of intent allows the kernel to further tweak what is
done internally and make tradeoffs which only impact applications
performing the registration.

>
>> In order to make sure the programming model is the same for expedited
>> private/global plain/sync-core/rseq membarrier commands, we require that
>> each process perform a registration beforehand.
>
> Hmm. At least it's not possible to unregister again.
>
> But I think it would be really useful to have some of these barriers
> available without registration, possibly in a more expensive form.

What would be wrong with doing a membarrier private-expedited-rseq registration
on libc startup, and exposing a glibc tunable to allow disabling this ?

Thanks,

Mathieu


>
> Thanks,
> Florian

--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

2021-12-13 20:12:25

by Florian Weimer

[permalink] [raw]
Subject: Re: rseq + membarrier programming model

* Mathieu Desnoyers:

> ----- On Dec 13, 2021, at 2:29 PM, Florian Weimer [email protected] wrote:
>
>> * Mathieu Desnoyers:
>>
>>>> Could it fall back to
>>>> MEMBARRIER_CMD_GLOBAL instead?
>>>
>>> No. CMD_GLOBAL does not issue the required rseq fence used by the
>>> algorithm discussed. Also, CMD_GLOBAL has quite a few other shortcomings:
>>> it takes a while to execute, and is incompatible with nohz_full kernels.
>>
>> What about using sched_setcpu to move the current thread to the same CPU
>> (and move it back afterwards)? Surely that implies the required sort of
>> rseq barrier that MEMBARRIER_CMD_PRIVATE_EXPEDITED_RSEQ with
>> MEMBARRIER_CMD_FLAG_CPU performs?
>
> I guess you refer to using sched_setaffinity(2) there ? There are various
> reasons why this may fail. For one, the affinity mask is a shared global
> resource which can be changed by external applications.

So is process memory …

> Also, setting the affinity is really just a hint. In the presence of
> cpu hotplug and or cgroup cpuset, it is known to lead to situations
> where the kernel just gives up and provides an affinity mask including
> all CPUs.

How does CPU hotplug impact this negatively?

The cgroup cpuset issue clearly is a bug.

> Therefore, using sched_setaffinity() and expecting to be pinned to
> a specific CPU for correctness purposes seems brittle.

I'm pretty sure it used to work reliably for some forms of concurrency
control.

> But _if_ we'd have something like a sched_setaffinity which we can
> trust, yes, temporarily migrating to the target CPU, and observing that
> we indeed run there, would AFAIU provide the same guarantee as the rseq
> fence provided by membarrier. It would have a higher overhead than
> membarrier as well.

Presumably a signal could do it as well.

>> That is possible even without membarrier, so I wonder why registration
>> of intent is needed for MEMBARRIER_CMD_PRIVATE_EXPEDITED_RSEQ.
>
> I would answer that it is not possible to do this _reliably_ today
> without membarrier (see above discussion of cpu hotplug, cgroups, and
> modification of cpu affinity by external processes).
>
> AFAIR, registration of intent for MEMBARRIER_CMD_PRIVATE_EXPEDITED_RSEQ
> is mainly there to provide a programming model similar to private expedited
> plain and core-sync cmds.
>
> The registration of intent allows the kernel to further tweak what is
> done internally and make tradeoffs which only impact applications
> performing the registration.

But if there is no strong performance argument to do so, this introduces
additional complexity into userspace. Surely we could say we just do
MEMBARRIER_CMD_PRIVATE_EXPEDITED_RSEQ at process start and document
failure (in case of seccomp etc.), but then why do this at all?

>>> In order to make sure the programming model is the same for expedited
>>> private/global plain/sync-core/rseq membarrier commands, we require that
>>> each process perform a registration beforehand.
>>
>> Hmm. At least it's not possible to unregister again.
>>
>> But I think it would be really useful to have some of these barriers
>> available without registration, possibly in a more expensive form.
>
> What would be wrong with doing a membarrier private-expedited-rseq
> registration on libc startup, and exposing a glibc tunable to allow
> disabling this ?

The configurations that need to be supported go from “no rseq“/“rseq”
to “no rseq“/“rseq”/“rseq with membarrier”. Everyone now needs to
think about implementing support for all three instead just the obvious
two.

Thanks,
Florian


2021-12-14 20:25:24

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: rseq + membarrier programming model



----- On Dec 13, 2021, at 3:12 PM, Florian Weimer [email protected] wrote:

> * Mathieu Desnoyers:
>
>> ----- On Dec 13, 2021, at 2:29 PM, Florian Weimer [email protected] wrote:
>>
>>> * Mathieu Desnoyers:
>>>
>>>>> Could it fall back to
>>>>> MEMBARRIER_CMD_GLOBAL instead?
>>>>
>>>> No. CMD_GLOBAL does not issue the required rseq fence used by the
>>>> algorithm discussed. Also, CMD_GLOBAL has quite a few other shortcomings:
>>>> it takes a while to execute, and is incompatible with nohz_full kernels.
>>>
>>> What about using sched_setcpu to move the current thread to the same CPU
>>> (and move it back afterwards)? Surely that implies the required sort of
>>> rseq barrier that MEMBARRIER_CMD_PRIVATE_EXPEDITED_RSEQ with
>>> MEMBARRIER_CMD_FLAG_CPU performs?
>>
>> I guess you refer to using sched_setaffinity(2) there ? There are various
>> reasons why this may fail. For one, the affinity mask is a shared global
>> resource which can be changed by external applications.
>
> So is process memory …

Fair point.

>
>> Also, setting the affinity is really just a hint. In the presence of
>> cpu hotplug and or cgroup cpuset, it is known to lead to situations
>> where the kernel just gives up and provides an affinity mask including
>> all CPUs.
>
> How does CPU hotplug impact this negatively?

It may be OK for the rseq fence use-case specifically, but in general
relying on cpu affinity to "pin" to a specific CPU is problematic with
a hotplug scenario like this:

- Userspace thread sets affinity to CPU 3 (only)
- echo 0 > /sys/devices/system/cpu/cpu3/online (as root)

-> scheduler will hit:

select_fallback_rq():
if (cpuset_cpus_allowed_fallback(p)) { -> false
do_set_cpus_allowed(p, task_cpu_possible_mask(p));

thus setting the cpus allowed mask to "any of the possible cpus".

This can be confirmed by doing "cat /proc/${pid}/status | grep Cpus_allowed_list:"
before/after unplugging cpu 3. (side-note: in my test, the target application was
"sleep 5000", which never gets picked by the scheduler unless we force some activity
on it by delivering a signal. I used a SIGSTOP/SIGCONT.):

before:
Cpus_allowed_list: 3

after:
Cpus_allowed_list: 0-3

>
> The cgroup cpuset issue clearly is a bug.

For cgroupv2, there are cpuset.cpus (invariant wrt hotplug),
cpuset.cpus.effective (affected by hotplug) and cpuset.cpus.partition
(takes away from parent effective cpuset, invariant after creation).
cgroup controllers can be either threaded controllers or domain
controllers. Unfortunately cpuset is a threaded controller, which
means each thread can have its own cgroup cpuset.

I do not have a full understanding of the interaction between
sched_setaffinity and concurrent changes to the cgroup cpuset,
but I am concerned that scenarios where affinity is first "pinned"
to a specific cpu, and then an external process manager changes the
cpuset.cpus mask to exclude that cpu may cause issues.

I am also concerned for the rseq fence use-case (done with explicit
sched_setaffinity) about what would happen if threads belong to
different cgroup cpusets with threaded controllers. There we may
have situations where a thread fails to run on a specific CPU just
because it is not part of its cpuset, but another thread within the
same process successfully runs there while executing an rseq critical
section.

>
>> Therefore, using sched_setaffinity() and expecting to be pinned to
>> a specific CPU for correctness purposes seems brittle.
>
> I'm pretty sure it used to work reliably for some forms of concurrency
> control.

That being said, it may be OK for the specific case of an rseq-fence, considering
that if we affine to CPU A, and later discover that we run anywhere except on
CPU A while we explicitly requested to be pinned to that CPU, this means the
kernel had to take action and move us away from CPU A's runqueue because we
are not allowed to run there. So we could consider that this CPU is "quiescent"
in terms of rseq because no other thread belonging to our process runs there.
This appears to work only for cpusets applying to the entire process though,
not for threaded cpusets.

>
>> But _if_ we'd have something like a sched_setaffinity which we can
>> trust, yes, temporarily migrating to the target CPU, and observing that
>> we indeed run there, would AFAIU provide the same guarantee as the rseq
>> fence provided by membarrier. It would have a higher overhead than
>> membarrier as well.
>
> Presumably a signal could do it as well.

Fair point, but then you would have to send a signal to every thread, and
wait for each signal handler to have executed. membarrier improves on this
kind of scheme by integrating with the scheduler, and leverage its knowledge
of which thread is actively running or not. Therefore, if a thread is not
running, there is no need to awaken it. This makes a huge performance
difference for heavily multi-threaded applications.

>
>>> That is possible even without membarrier, so I wonder why registration
>>> of intent is needed for MEMBARRIER_CMD_PRIVATE_EXPEDITED_RSEQ.
>>
>> I would answer that it is not possible to do this _reliably_ today
>> without membarrier (see above discussion of cpu hotplug, cgroups, and
>> modification of cpu affinity by external processes).
>>
>> AFAIR, registration of intent for MEMBARRIER_CMD_PRIVATE_EXPEDITED_RSEQ
>> is mainly there to provide a programming model similar to private expedited
>> plain and core-sync cmds.
>>
>> The registration of intent allows the kernel to further tweak what is
>> done internally and make tradeoffs which only impact applications
>> performing the registration.
>
> But if there is no strong performance argument to do so, this introduces
> additional complexity into userspace. Surely we could say we just do
> MEMBARRIER_CMD_PRIVATE_EXPEDITED_RSEQ at process start and document
> failure (in case of seccomp etc.), but then why do this at all?

There are many performance gains we can get by having membarrier-expedited-rseq
registered. Some of those use-cases may be doable either by sending signals
to all threads, or by doing cpu affinity tricks, but using membarrier
is much more lightweight thanks to its integration with the Linux
scheduler. When a thread is not running, there is really no need to awaken
it.

In terms of use-cases, the rseq-fence is a compelling use-case enabling
various algorithms with rseq.

Other use-cases involve the "plain" memory barrier capabilities of membarrier.
This generally allow turning algorithms that require pairing memory
barrier instructions on fast and slow paths into even faster fast-path,
by pairing compiler barriers (asm memory clobber) on the fast paths
with membarrier system calls on the slow paths.

Finally, other use-cases involves the SYNC_CORE membarrier. This is mainly
for JITs, allowing them to issue a process-wide "fence" allowing them to
re-use memory after reclaim of JITted code.

In terms of overhead added into the process when membarrier-expedited
is registered, only specific cases are affected:

- SYNC_CORE: processes which have registered membarrier expedited sync-core
will issue sync_core_before_usermode() after each scheduling between
threads belonging to different processes (see membarrier_mm_sync_core_before_usermode).
It is a no-op for all architectures except x86, which implements its
return to user-space with sysexit, sysrel and sysretq, which are not core
serializing.

Because of the runtime overhead of the sync-core registration on x86,
I would recommend that only JITs register with
MEMBARRIER_CMD_REGISTER_PRIVATE_EXPEDITED_SYNC_CORE.

- Plain memory barrier and RSEQ: Registering those adds no overhead except
on powerpc (see membarrier_arch_switch_mm). There, when context switching
between two user-space processes, an additional memory barrier is needed
because it is not implicitly issued by the architecture switch_mm.

I expect that the impact of this runtime overhead will be much more
limited than for the SYNC_CORE. Therefore having glibc auto-register
MEMBARRIER_CMD_REGISTER_PRIVATE_EXPEDITED_RSEQ would make sense
considering the fast-path improvements this enables.

All of the expedited membarrier commands issue inter-processor interrupts
(IPIs) to CPUs running threads belonging to the same process. This may be
unexpected for hard-real-time applications, so this may be something they
will want to opt-out from with a tunable.

There are also the "global-expedited" membarrier commands, which are
done to deal with shared memory across processes. There, the processes
wishing to receive the IPIs need to be registered explicitly. This
ensures we don't disrupt other hard-real-time processes with unexpected
IPIs. The processes registered for global-expedited membarrier also have
the same overhead discussed above for plain/rseq membarrier registration
on powerpc. I do not expect the global-expedited registration to be done
automatically, it should really be opt-in by the applications/libraries
requiring membarrier to interact with other processes across shared memory.

>
>>>> In order to make sure the programming model is the same for expedited
>>>> private/global plain/sync-core/rseq membarrier commands, we require that
>>>> each process perform a registration beforehand.
>>>
>>> Hmm. At least it's not possible to unregister again.
>>>
>>> But I think it would be really useful to have some of these barriers
>>> available without registration, possibly in a more expensive form.
>>
>> What would be wrong with doing a membarrier private-expedited-rseq
>> registration on libc startup, and exposing a glibc tunable to allow
>> disabling this ?
>
> The configurations that need to be supported go from “no rseq“/“rseq”
> to “no rseq“/“rseq”/“rseq with membarrier”. Everyone now needs to
> think about implementing support for all three instead just the obvious
> two.

One thing to keep in mind is that within the Linux kernel, CONFIG_RSEQ
always selects CONFIG_MEMBARRIER. I've done this on purpose to simplify
the user-space programming model. Therefore, if the rseq system call is
implemented, membarrier is available, unless it's forbidden by seccomp.

However, MEMBARRIER_CMD_PRIVATE_EXPEDITED_RSEQ only appears in kernel v5.10.

This means that starting from kernel v5.10, glibc can rely on having
both rseq and membarrier MEMBARRIER_CMD_PRIVATE_EXPEDITED_RSEQ available,
or don't bother to do any of the registration.

This would simplify the programming model from a user perspective. If
glibc registers rseq, this guarantees that
MEMBARRIER_CMD_PRIVATE_EXPEDITED_RSEQ is available.

You can check for rseq availability with e.g.:

int rseq_available(void)
{
int rc;

rc = sys_rseq(NULL, 0, 0, 0);
if (rc != -1)
abort();
switch (errno) {
case ENOSYS:
return 0;
case EINVAL:
return 1;
default:
abort();
}
}

and check for membarrier MEMBARRIER_CMD_PRIVATE_EXPEDITED_RSEQ availability
by inspecting the mask returned by MEMBARRIER_CMD_QUERY, e.g.:

int status;

status = sys_membarrier(MEMBARRIER_CMD_QUERY, 0);
if (status < 0 || !(status & MEMBARRIER_CMD_PRIVATE_EXPEDITED_RSEQ))
return false;
else
return true;

I guess it all depends on how much you care about registering rseq on
kernels between 4.18 and 5.9 inclusively.

Thanks,

Mathieu


>
> Thanks,
> Florian

--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com