LinuxLists.cc - Restartable Sequences system call merged into Linux

2018-06-11 20:06:01

Subject: Restartable Sequences system call merged into Linux

Hi!

Good news! The restartable sequences (rseq) system call is now merged into the master
branch of the Linux kernel within the 4.18 merge window:

https://github.com/torvalds/linux/commit/d82991a8688ad128b46db1b42d5d84396487a508

It would be important to discuss how we should proceed to integrate the library part
of rseq (see tools/testing/selftests/rseq/rseq*.{ch}) into glibc, or if it should
live in a standalone project.

It should be noted that there can be only one rseq TLS area registered per thread,
which can then be used by many libraries and by the executable, so this is a
process-wide (per-thread) resource that we need to manage carefully.

Thoughts ?

Thanks!

Mathieu

--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

2018-06-11 20:12:55

by Mathieu Desnoyers

[permalink] [raw]

Subject: Re: Restartable Sequences system call merged into Linux

----- On Jun 11, 2018, at 3:55 PM, Florian Weimer [email protected] wrote:

> On 06/11/2018 09:49 PM, Mathieu Desnoyers wrote:
>> It should be noted that there can be only one rseq TLS area registered per
>> thread,
>> which can then be used by many libraries and by the executable, so this is a
>> process-wide (per-thread) resource that we need to manage carefully.
>
> Is it possible to resize the area after thread creation, perhaps even
> from other threads?

I'm not sure why we would want to resize it. The per-thread area is fixed-size.
Its layout is here: include/uapi/linux/rseq.h: struct rseq

The ABI is designed so that all users (program and libraries) can interact
through this per-thread TLS area.

>
> If there is only one contiguous area, this generally means there needs
> to be linker support, similar to what we have for initial-exec TLS today.

Not entirely sure what you imply by "one contiguous area". All we need is
a single fixed-size TLS area for each thread.

Thanks,

Mathieu

>
> Thanks,
> Florian

--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

2018-06-11 20:49:18

by Florian Weimer

[permalink] [raw]

Subject: Re: Restartable Sequences system call merged into Linux

On 06/11/2018 09:49 PM, Mathieu Desnoyers wrote:
> It should be noted that there can be only one rseq TLS area registered per thread,
> which can then be used by many libraries and by the executable, so this is a
> process-wide (per-thread) resource that we need to manage carefully.

Is it possible to resize the area after thread creation, perhaps even
from other threads?

If there is only one contiguous area, this generally means there needs
to be linker support, similar to what we have for initial-exec TLS today.

Thanks,
Florian

2018-06-12 13:12:25

by Florian Weimer

[permalink] [raw]

Subject: Re: Restartable Sequences system call merged into Linux

On 06/11/2018 10:04 PM, Mathieu Desnoyers wrote:
> ----- On Jun 11, 2018, at 3:55 PM, Florian Weimer [email protected] wrote:
>
>> On 06/11/2018 09:49 PM, Mathieu Desnoyers wrote:
>>> It should be noted that there can be only one rseq TLS area registered per
>>> thread,
>>> which can then be used by many libraries and by the executable, so this is a
>>> process-wide (per-thread) resource that we need to manage carefully.
>>
>> Is it possible to resize the area after thread creation, perhaps even
>> from other threads?
>
> I'm not sure why we would want to resize it. The per-thread area is fixed-size.
> Its layout is here: include/uapi/linux/rseq.h: struct rseq

Looks I was mistaken and this is very similar to the robust mutex list.

Should we treat it the same way? Always allocate it for each new thread
and register it with the kernel?

> The ABI is designed so that all users (program and libraries) can interact
> through this per-thread TLS area.

Then the user code needs just the address of the structure.

How much coordination is needed between different users of this
interface? Looking at the the section hacks, I don't think we want to
put this into glibc at this stage. It looks more like something for
which we traditionally require compiler support.

Thanks,
Florian

2018-06-12 16:32:24

by Mathieu Desnoyers

[permalink] [raw]

Subject: Re: Restartable Sequences system call merged into Linux

----- On Jun 12, 2018, at 9:11 AM, Florian Weimer [email protected] wrote:

> On 06/11/2018 10:04 PM, Mathieu Desnoyers wrote:
>> ----- On Jun 11, 2018, at 3:55 PM, Florian Weimer [email protected] wrote:
>>
>>> On 06/11/2018 09:49 PM, Mathieu Desnoyers wrote:
>>>> It should be noted that there can be only one rseq TLS area registered per
>>>> thread,
>>>> which can then be used by many libraries and by the executable, so this is a
>>>> process-wide (per-thread) resource that we need to manage carefully.
>>>
>>> Is it possible to resize the area after thread creation, perhaps even
>>> from other threads?
>>
>> I'm not sure why we would want to resize it. The per-thread area is fixed-size.
>> Its layout is here: include/uapi/linux/rseq.h: struct rseq
>
> Looks I was mistaken and this is very similar to the robust mutex list.
>
> Should we treat it the same way? Always allocate it for each new thread
> and register it with the kernel?

That would be an efficient way to do it, indeed. There is very little
performance overhead to have rseq registered for all threads, whether or
not they intend to run rseq critical sections.

>
>> The ABI is designed so that all users (program and libraries) can interact
>> through this per-thread TLS area.
>
> Then the user code needs just the address of the structure.

Yes.

>
> How much coordination is needed between different users of this
> interface? Looking at the the section hacks, I don't think we want to
> put this into glibc at this stage. It looks more like something for
> which we traditionally require compiler support.

I really don't mind maintaining a separate project containing librseq
along with the headers needed to facilitate declaration of rseq critical
sections. This specifically does not need much coordination between users of
the interface.

The part which really requires coordination between users is registration
to the kernel (and ownership) of the rseq TLS area.

I have a few possible approaches in mind (feel free to suggest other
options):

A) glibc exposes a strong __rseq_abi TLS symbol:

- should ideally *not* be global-dynamic for performance reasons,
- registration to kernel can either be handled explicitly by requiring
application or libraries to call an API, or implicitly at thread
creation,
- requires all rseq users to upgrade to newer glibc. Early rseq users
(libs and applications) registering their own rseq TLS will conflict
with newer glibc.

B) librseq.so exposes a strong __rseq_abi symbol:

- should ideally *not* be global-dynamic for performance reasons, but
testing shows that using initial-exec causes issues in situations where
librseq.so ends up being dlopen'd (e.g. java virtual machine dlopening
the lttng-ust tracer linked against librseq.so),
- registration/unregistration of area to kernel can either be performed
lazily on first use, destruction done using pthread_key, or require an
explicit API call from application,
- A per-thread refcount in a TLS could allow many users to call the
registration/unregistration API, and lazy registration,
- an early-user application which also exposes a __rseq_abi strong symbol
would conflict with librseq.so.

C) __rseq_abi symbol declared weak within each user (application, librseq,
other libraries, glibc):

- should ideally *not* be global-dynamic for performance reasons,
- however, initial-exec causes issues when librseq or early user libraries
are dlopen'd (e.g. java runtime dlopening lttng-ust),
- a weak symbol allow combining early user libs/apps with glibc/librseq
exposing the same symbol,
- considering that glibc is AFAIK never dlopen'd, does not cause exhaustion
of initial-exec TLS entries in cases where librseq.so or early adopter
libs are dlopen'd,
- if glibc implicitly registers the rseq area, *and* librseq.so also wants
to register it, *and* early adopters also want to register it, we should
come up with a refcount scheme in the TLS ensuring that registration and
unregistration is only done with the first/last user comes/goes away.

Thoughts ?

Thanks!

Mathieu

>
> Thanks,
> Florian

--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

2018-06-13 08:22:30

by Florian Weimer

[permalink] [raw]

Subject: Re: Restartable Sequences system call merged into Linux

On 06/12/2018 06:31 PM, Mathieu Desnoyers wrote:
> ----- On Jun 12, 2018, at 9:11 AM, Florian Weimer [email protected] wrote:
>
>> On 06/11/2018 10:04 PM, Mathieu Desnoyers wrote:
>>> ----- On Jun 11, 2018, at 3:55 PM, Florian Weimer [email protected] wrote:
>>>
>>>> On 06/11/2018 09:49 PM, Mathieu Desnoyers wrote:
>>>>> It should be noted that there can be only one rseq TLS area registered per
>>>>> thread,
>>>>> which can then be used by many libraries and by the executable, so this is a
>>>>> process-wide (per-thread) resource that we need to manage carefully.
>>>>
>>>> Is it possible to resize the area after thread creation, perhaps even
>>>> from other threads?
>>>
>>> I'm not sure why we would want to resize it. The per-thread area is fixed-size.
>>> Its layout is here: include/uapi/linux/rseq.h: struct rseq
>>
>> Looks I was mistaken and this is very similar to the robust mutex list.
>>
>> Should we treat it the same way? Always allocate it for each new thread
>> and register it with the kernel?
>
> That would be an efficient way to do it, indeed. There is very little
> performance overhead to have rseq registered for all threads, whether or
> not they intend to run rseq critical sections.
>
>>
>>> The ABI is designed so that all users (program and libraries) can interact
>>> through this per-thread TLS area.
>>
>> Then the user code needs just the address of the structure.
>
> Yes.

So we'd add

struct rseq *rseq_location (void);

and be done with it? It would return the address of the thread-local
variable, similar to __errno_location.

Or we could add something like this:

extern __thread struct rseq pthread_rseq_area_np
__attribute__ ((__tls_model__ ("initial-exec")));

But of course only for recent-enough GNU compilers (and Clang, which
identifies itself as GNU).

The advantage of the function call is that it often results in more
compact code. Making the initial-exec nature part of the ABI has the
advantage that the applications could use the fact of the constant
offset to the thread pointer if they desire to do so.

Would we need to document which glibc functions use
pthread_rseq_area_np, so that applications do not call them when they
itself use the area?

Do we actually need to use RSEQ_FLAG_UNREGISTER prior to thread exit?
Why can't the kernel do it for us?

> - requires all rseq users to upgrade to newer glibc. Early rseq users
> (libs and applications) registering their own rseq TLS will conflict
> with newer glibc.

We will need to do something about stack unwinding and longjmp anyway (I
assume the kernel already handles signals for us), so it may not be
possible to use restartable sequences in any substantial way with a
system upgrade anyway.

> B) librseq.so exposes a strong __rseq_abi symbol:
>
> - should ideally *not* be global-dynamic for performance reasons, but
> testing shows that using initial-exec causes issues in situations where
> librseq.so ends up being dlopen'd (e.g. java virtual machine dlopening
> the lttng-ust tracer linked against librseq.so),

Just an aside:

You can work around that using preloading. On the glibc side, we could
also make the initial reserve configurable. On 64-bit, there really is
no reason not to use a different TCB allocation scheme which would allow
you to create a few threads before the initial-exec TLS area cannot be
extended.

The existing approach dates back to LinuxThreads and its TCB collocated
with the the stack. But changes in the next few months are not very likely.

> C) __rseq_abi symbol declared weak within each user (application, librseq,
> other libraries, glibc):

We can multiple two non-weak definitions for the symbol. It should work
as long as only the definition in glibc has a symbol version.

__rseq_abi as a name is problematic because it's in the internal namespace.

Thanks,
Florian

2018-06-13 11:49:08

by Heiko Carstens

[permalink] [raw]

Subject: Re: Restartable Sequences system call merged into Linux

On Mon, Jun 11, 2018 at 03:49:18PM -0400, Mathieu Desnoyers wrote:
> Hi!
>
> Good news! The restartable sequences (rseq) system call is now merged into the master
> branch of the Linux kernel within the 4.18 merge window:
>
> https://github.com/torvalds/linux/commit/d82991a8688ad128b46db1b42d5d84396487a508
>
> It would be important to discuss how we should proceed to integrate the library part
> of rseq (see tools/testing/selftests/rseq/rseq*.{ch}) into glibc, or if it should
> live in a standalone project.

Is there any documentation available of what is the exact semantics of the
functions that have to be implemented for additional architectures?

I could look at rseq-skip.h and e.g. rseq-x86.h and try to figure out what
would be the correct implementation for s390. But having that somewhere
written down, e.g. as comments in one of the implementations, would be very
helpful.

2018-06-13 16:15:26

by Mathieu Desnoyers

[permalink] [raw]

Subject: Re: Restartable Sequences system call merged into Linux

----- On Jun 13, 2018, at 7:48 AM, heiko carstens [email protected] wrote:

> On Mon, Jun 11, 2018 at 03:49:18PM -0400, Mathieu Desnoyers wrote:
>> Hi!
>>
>> Good news! The restartable sequences (rseq) system call is now merged into the
>> master
>> branch of the Linux kernel within the 4.18 merge window:
>>
>> https://github.com/torvalds/linux/commit/d82991a8688ad128b46db1b42d5d84396487a508
>>
>> It would be important to discuss how we should proceed to integrate the library
>> part
>> of rseq (see tools/testing/selftests/rseq/rseq*.{ch}) into glibc, or if it
>> should
>> live in a standalone project.
>
> Is there any documentation available of what is the exact semantics of the
> functions that have to be implemented for additional architectures?

It's documented on top of kernel/rseq.c:

/*
*
* Restartable sequences are a lightweight interface that allows
* user-level code to be executed atomically relative to scheduler
* preemption and signal delivery. Typically used for implementing
* per-cpu operations.
*
* It allows user-space to perform update operations on per-cpu data
* without requiring heavy-weight atomic operations.
*
* Detailed algorithm of rseq user-space assembly sequences:
*
* init(rseq_cs)
* cpu = TLS->rseq::cpu_id_start
* [1] TLS->rseq::rseq_cs = rseq_cs
* [start_ip] ----------------------------
* [2] if (cpu != TLS->rseq::cpu_id)
* goto abort_ip;
* [3] <last_instruction_in_cs>
* [post_commit_ip] ----------------------------
*
* The address of jump target abort_ip must be outside the critical
* region, i.e.:
*
* [abort_ip] < [start_ip] || [abort_ip] >= [post_commit_ip]
*
* Steps [2]-[3] (inclusive) need to be a sequence of instructions in
* userspace that can handle being interrupted between any of those
* instructions, and then resumed to the abort_ip.
*
* 1. Userspace stores the address of the struct rseq_cs assembly
* block descriptor into the rseq_cs field of the registered
* struct rseq TLS area. This update is performed through a single
* store within the inline assembly instruction sequence.
* [start_ip]
*
* 2. Userspace tests to check whether the current cpu_id field match
* the cpu number loaded before start_ip, branching to abort_ip
* in case of a mismatch.
*
* If the sequence is preempted or interrupted by a signal
* at or after start_ip and before post_commit_ip, then the kernel
* clears TLS->__rseq_abi::rseq_cs, and sets the user-space return
* ip to abort_ip before returning to user-space, so the preempted
* execution resumes at abort_ip.
*
* 3. Userspace critical section final instruction before
* post_commit_ip is the commit. The critical section is
* self-terminating.
* [post_commit_ip]
*
* 4. <success>
*
* On failure at [2], or if interrupted by preempt or signal delivery
* between [1] and [3]:
*
* [abort_ip]
* F1. <failure>
*/

> I could look at rseq-skip.h and e.g. rseq-x86.h and try to figure out what
> would be the correct implementation for s390. But having that somewhere
> written down, e.g. as comments in one of the implementations, would be very
> helpful.

The first architecture implemented was rseq-x86.h. Boqun derived rseq-ppc.h
from it, and I derived rseq-arm.h from it. Feel free to start from whichever
architecture has the instruction set which is most similar to yours.

Thanks!

Mathieu

--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

2018-06-13 19:54:23

by Mathieu Desnoyers

[permalink] [raw]

Subject: Re: Restartable Sequences system call merged into Linux

----- On Jun 13, 2018, at 12:14 PM, Mathieu Desnoyers [email protected] wrote:

> ----- On Jun 13, 2018, at 7:48 AM, heiko carstens [email protected]
> wrote:
[...]
>>
>> Is there any documentation available of what is the exact semantics of the
>> functions that have to be implemented for additional architectures?
>
> It's documented on top of kernel/rseq.c:
>

[...]

>
> The first architecture implemented was rseq-x86.h. Boqun derived rseq-ppc.h
> from it, and I derived rseq-arm.h from it. Feel free to start from whichever
> architecture has the instruction set which is most similar to yours.

One more thing: adding full support for your architecture to the rseq selftests
also requires to extend tools/testing/selftests/rseq/param_test.c to implement
the RSEQ_INJECT_INPUT, INJECT_ASM_REG, RSEQ_INJECT_CLOBBER and RSEQ_INJECT_ASM
macros for your architecture. Those are simply defining the inline asm operands
and assembly code needed to inject delay loops within the rseq critical sections,
which greatly facilitates testing.

Thanks,

Mathieu

--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

2018-06-14 12:30:35

by Pavel Machek

[permalink] [raw]

Subject: Re: Restartable Sequences system call merged into Linux

On Tue 2018-06-12 12:31:24, Mathieu Desnoyers wrote:
> ----- On Jun 12, 2018, at 9:11 AM, Florian Weimer [email protected] wrote:
>
> > On 06/11/2018 10:04 PM, Mathieu Desnoyers wrote:
> >> ----- On Jun 11, 2018, at 3:55 PM, Florian Weimer [email protected] wrote:
> >>
> >>> On 06/11/2018 09:49 PM, Mathieu Desnoyers wrote:
> >>>> It should be noted that there can be only one rseq TLS area registered per
> >>>> thread,
> >>>> which can then be used by many libraries and by the executable, so this is a
> >>>> process-wide (per-thread) resource that we need to manage carefully.
> >>>
> >>> Is it possible to resize the area after thread creation, perhaps even
> >>> from other threads?
> >>
> >> I'm not sure why we would want to resize it. The per-thread area is fixed-size.
> >> Its layout is here: include/uapi/linux/rseq.h: struct rseq
> >
> > Looks I was mistaken and this is very similar to the robust mutex list.
> >
> > Should we treat it the same way? Always allocate it for each new thread
> > and register it with the kernel?
>
> That would be an efficient way to do it, indeed. There is very little
> performance overhead to have rseq registered for all threads, whether or
> not they intend to run rseq critical sections.

People with slow / low memory machines would prefer not to see
overhead they don't need...

> I have a few possible approaches in mind (feel free to suggest other
> options):
>
> A) glibc exposes a strong __rseq_abi TLS symbol:
>
> - should ideally *not* be global-dynamic for performance reasons,
> - registration to kernel can either be handled explicitly by requiring
> application or libraries to call an API, or implicitly at thread
> creation,

...so I'd prefer explicit API call.

> B) librseq.so exposes a strong __rseq_abi symbol:

Works for me.
Pavel

--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

Attachments:

(No filename) (1.98 kB)
signature.asc (188.00 B)
Digital signature Download all attachments

2018-06-14 13:02:43

by Mathieu Desnoyers

[permalink] [raw]

Subject: Re: Restartable Sequences system call merged into Linux

----- On Jun 14, 2018, at 8:27 AM, Pavel Machek [email protected] wrote:

> On Tue 2018-06-12 12:31:24, Mathieu Desnoyers wrote:
>> ----- On Jun 12, 2018, at 9:11 AM, Florian Weimer [email protected] wrote:
>>
>> > On 06/11/2018 10:04 PM, Mathieu Desnoyers wrote:
>> >> ----- On Jun 11, 2018, at 3:55 PM, Florian Weimer [email protected] wrote:
>> >>
>> >>> On 06/11/2018 09:49 PM, Mathieu Desnoyers wrote:
>> >>>> It should be noted that there can be only one rseq TLS area registered per
>> >>>> thread,
>> >>>> which can then be used by many libraries and by the executable, so this is a
>> >>>> process-wide (per-thread) resource that we need to manage carefully.
>> >>>
>> >>> Is it possible to resize the area after thread creation, perhaps even
>> >>> from other threads?
>> >>
>> >> I'm not sure why we would want to resize it. The per-thread area is fixed-size.
>> >> Its layout is here: include/uapi/linux/rseq.h: struct rseq
>> >
>> > Looks I was mistaken and this is very similar to the robust mutex list.
>> >
>> > Should we treat it the same way? Always allocate it for each new thread
>> > and register it with the kernel?
>>
>> That would be an efficient way to do it, indeed. There is very little
>> performance overhead to have rseq registered for all threads, whether or
>> not they intend to run rseq critical sections.
>
> People with slow / low memory machines would prefer not to see
> overhead they don't need...

In terms of memory usage, if people don't want the extra few bytes of memory
used by rseq in the kernel, they should use CONFIG_RSEQ=n.

In terms of overhead, let's have a closer look at what it means: when a thread
is registered to rseq, but does not enter rseq critical sections, only this
extra work is done by the kernel:

- rseq_preempt(): on preemption, the scheduler sets the TIF_NOTIFY_RESUME thread
flag, so rseq_handle_notify_resume() can check whether it's in a rseq critical
section when returning to user-space,
- rseq_signal_deliver(): on signal delivery, rseq_handle_notify_resume() checks
whether it's in a rseq critical section,
- rseq_migrate: on migration, the scheduler sets TIF_NOTIFY_RESUME as well,

>
>> I have a few possible approaches in mind (feel free to suggest other
>> options):
>>
>> A) glibc exposes a strong __rseq_abi TLS symbol:
>>
>> - should ideally *not* be global-dynamic for performance reasons,
>> - registration to kernel can either be handled explicitly by requiring
>> application or libraries to call an API, or implicitly at thread
>> creation,
>
> ...so I'd prefer explicit API call.

I have use-cases where a library wants to link against librseq and have rseq
critical sections, without requiring the application to explicitly add rseq
registration calls on thread creation/destruction. Is there a way to register
callbacks to glibc which could be invoked on thread creation/destruction ?

Then if we include dynamic loading of libraries (dlopen/dlclose) in the
picture, this gets even worse, as we'd need to be able to iterate on all
existing threads to invoke registration/unregistration callbacks.

One alternative approach would be to let the user library lazily register rseq
when needed, and use a pthread_key for unregistration. However, this does not
allow dlclose of the user library without figuring a way to iterate on all
threads.

Another alternative would be to somehow let glibc handle the registration,
perhaps only doing it for applications expressing their interest for rseq.

Thoughts ?

Thanks,

Mathieu

>
>> B) librseq.so exposes a strong __rseq_abi symbol:
>
> Works for me.
> Pavel
>
> --
> (english) http://www.livejournal.com/~pavelmachek
> (cesky, pictures)
> http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

2018-06-14 13:26:42

by Pavel Machek

[permalink] [raw]

Subject: Re: Restartable Sequences system call merged into Linux

Hi!

> >> >>>> It should be noted that there can be only one rseq TLS area registered per
> >> >>>> thread,
> >> >>>> which can then be used by many libraries and by the executable, so this is a
> >> >>>> process-wide (per-thread) resource that we need to manage carefully.
> >> >>>
> >> >>> Is it possible to resize the area after thread creation, perhaps even
> >> >>> from other threads?
> >> >>
> >> >> I'm not sure why we would want to resize it. The per-thread area is fixed-size.
> >> >> Its layout is here: include/uapi/linux/rseq.h: struct rseq
> >> >
> >> > Looks I was mistaken and this is very similar to the robust mutex list.
> >> >
> >> > Should we treat it the same way? Always allocate it for each new thread
> >> > and register it with the kernel?
> >>
> >> That would be an efficient way to do it, indeed. There is very little
> >> performance overhead to have rseq registered for all threads, whether or
> >> not they intend to run rseq critical sections.
> >
> > People with slow / low memory machines would prefer not to see
> > overhead they don't need...
>
> In terms of memory usage, if people don't want the extra few bytes of memory
> used by rseq in the kernel, they should use CONFIG_RSEQ=n.
>
> In terms of overhead, let's have a closer look at what it means: when a thread
> is registered to rseq, but does not enter rseq critical sections, only this
> extra work is done by the kernel:
>
> - rseq_preempt(): on preemption, the scheduler sets the TIF_NOTIFY_RESUME thread
> flag, so rseq_handle_notify_resume() can check whether it's in a rseq critical
> section when returning to user-space,
> - rseq_signal_deliver(): on signal delivery, rseq_handle_notify_resume() checks
> whether it's in a rseq critical section,
> - rseq_migrate: on migration, the scheduler sets TIF_NOTIFY_RESUME as well,

Yes, this is not likely to be noticeable.

But the proposal wanted to add a syscall to thread creation, right?
And I believe that may be noticeable.
Pavel

--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

Attachments:

(No filename) (2.14 kB)
signature.asc (188.00 B)
Digital signature Download all attachments

2018-06-14 13:34:25

by Florian Weimer

[permalink] [raw]

Subject: Re: Restartable Sequences system call merged into Linux

On 06/14/2018 03:25 PM, Pavel Machek wrote:

> But the proposal wanted to add a syscall to thread creation, right?
> And I believe that may be noticeable.

We already call set_robust_list, so we could just pass a larger area to
that and the kernel could use it. Then no additional system call would
be needed in the common case (new kernel which recognizes the new area
size).

But then we cannot use an initial-exec thread local variable for it
(although the offset from the thread pointer will still be constant, of
course).

Thanks,
Florian

2018-06-14 13:40:40

by Mathieu Desnoyers

[permalink] [raw]

Subject: Re: Restartable Sequences system call merged into Linux

----- On Jun 14, 2018, at 9:25 AM, Pavel Machek [email protected] wrote:

> Hi!
>
>> >> >>>> It should be noted that there can be only one rseq TLS area registered per
>> >> >>>> thread,
>> >> >>>> which can then be used by many libraries and by the executable, so this is a
>> >> >>>> process-wide (per-thread) resource that we need to manage carefully.
>> >> >>>
>> >> >>> Is it possible to resize the area after thread creation, perhaps even
>> >> >>> from other threads?
>> >> >>
>> >> >> I'm not sure why we would want to resize it. The per-thread area is fixed-size.
>> >> >> Its layout is here: include/uapi/linux/rseq.h: struct rseq
>> >> >
>> >> > Looks I was mistaken and this is very similar to the robust mutex list.
>> >> >
>> >> > Should we treat it the same way? Always allocate it for each new thread
>> >> > and register it with the kernel?
>> >>
>> >> That would be an efficient way to do it, indeed. There is very little
>> >> performance overhead to have rseq registered for all threads, whether or
>> >> not they intend to run rseq critical sections.
>> >
>> > People with slow / low memory machines would prefer not to see
>> > overhead they don't need...
>>
>> In terms of memory usage, if people don't want the extra few bytes of memory
>> used by rseq in the kernel, they should use CONFIG_RSEQ=n.
>>
>> In terms of overhead, let's have a closer look at what it means: when a thread
>> is registered to rseq, but does not enter rseq critical sections, only this
>> extra work is done by the kernel:
>>
>> - rseq_preempt(): on preemption, the scheduler sets the TIF_NOTIFY_RESUME thread
>> flag, so rseq_handle_notify_resume() can check whether it's in a rseq critical
>> section when returning to user-space,
>> - rseq_signal_deliver(): on signal delivery, rseq_handle_notify_resume() checks
>> whether it's in a rseq critical section,
>> - rseq_migrate: on migration, the scheduler sets TIF_NOTIFY_RESUME as well,
>
> Yes, this is not likely to be noticeable.
>
> But the proposal wanted to add a syscall to thread creation, right?
> And I believe that may be noticeable.

Fair point! Do we have a standard benchmark that would stress this ?

If it ends up being noticeable overhead, I wonder whether we could extend clone() with a
new CLONE_RSEQ flag so glibc could pass a pointer to the rseq TLS area through an extra
argument to the clone system call rather than do an extra syscall on thread creation ?

Thanks,

Mathieu

--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

2018-06-14 13:47:36

by Mathieu Desnoyers

[permalink] [raw]

Subject: Re: Restartable Sequences system call merged into Linux

----- On Jun 14, 2018, at 9:32 AM, Florian Weimer [email protected] wrote:

> On 06/14/2018 03:25 PM, Pavel Machek wrote:
>
>> But the proposal wanted to add a syscall to thread creation, right?
>> And I believe that may be noticeable.
>
> We already call set_robust_list, so we could just pass a larger area to
> that and the kernel could use it. Then no additional system call would
> be needed in the common case (new kernel which recognizes the new area
> size).
>
> But then we cannot use an initial-exec thread local variable for it
> (although the offset from the thread pointer will still be constant, of
> course).

I'm wondering whether we could turn the problem around: expose a new
system call allowing to register an array of pointers to per-thread data,
which would be used rather than set_robust_list when available. This way,
we could register both the robust list and rseq with a single system call,
e.g.:

enum linux_tls_area_type {
LINUX_TLS_ROBUST_LIST,
LINUX_TLS_RSEQ,
};

struct linux_tls_area_item {
enum linux_tls_area_type type;
void *p;
};

long sys_register_tls_areas(struct linux_tls_area_item *array, size_t nb)

This would allow registering various TLS data structures with a single
system call without hindering flexibility on the user-space side. For
instance, we could still use initial-exec and the __rseq_abi symbol for
rseq with this approach.

Thoughts ?

Thanks,

Mathieu

--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

2018-06-14 13:50:40

by Pavel Machek

[permalink] [raw]

Subject: Re: Restartable Sequences system call merged into Linux

Hi!

> >> - rseq_preempt(): on preemption, the scheduler sets the TIF_NOTIFY_RESUME thread
> >> flag, so rseq_handle_notify_resume() can check whether it's in a rseq critical
> >> section when returning to user-space,
> >> - rseq_signal_deliver(): on signal delivery, rseq_handle_notify_resume() checks
> >> whether it's in a rseq critical section,
> >> - rseq_migrate: on migration, the scheduler sets TIF_NOTIFY_RESUME as well,
> >
> > Yes, this is not likely to be noticeable.
> >
> > But the proposal wanted to add a syscall to thread creation, right?
> > And I believe that may be noticeable.
>
> Fair point! Do we have a standard benchmark that would stress this ?

Web server performance benchmarks basically test clone() performance
in many cases.
Pavel

--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

Attachments:

(No filename) (940.00 B)
signature.asc (188.00 B)
Digital signature Download all attachments

2018-06-14 14:01:53

by Florian Weimer

[permalink] [raw]

Subject: Re: Restartable Sequences system call merged into Linux

On 06/14/2018 03:49 PM, Pavel Machek wrote:
> Hi!
>
>>>> - rseq_preempt(): on preemption, the scheduler sets the TIF_NOTIFY_RESUME thread
>>>> flag, so rseq_handle_notify_resume() can check whether it's in a rseq critical
>>>> section when returning to user-space,
>>>> - rseq_signal_deliver(): on signal delivery, rseq_handle_notify_resume() checks
>>>> whether it's in a rseq critical section,
>>>> - rseq_migrate: on migration, the scheduler sets TIF_NOTIFY_RESUME as well,
>>>
>>> Yes, this is not likely to be noticeable.
>>>
>>> But the proposal wanted to add a syscall to thread creation, right?
>>> And I believe that may be noticeable.
>>
>> Fair point! Do we have a standard benchmark that would stress this ?
>
> Web server performance benchmarks basically test clone() performance
> in many cases.

Isn't that fork? I expect that the rseq arena is inherited on fork and
fork-type clone, otherwise it's going to be painful.

Thanks,
Florian

2018-06-14 14:38:32

by Mathieu Desnoyers

[permalink] [raw]

Subject: Re: Restartable Sequences system call merged into Linux

----- On Jun 14, 2018, at 10:00 AM, Florian Weimer [email protected] wrote:

> On 06/14/2018 03:49 PM, Pavel Machek wrote:
>> Hi!
>>
>>>>> - rseq_preempt(): on preemption, the scheduler sets the TIF_NOTIFY_RESUME thread
>>>>> flag, so rseq_handle_notify_resume() can check whether it's in a rseq critical
>>>>> section when returning to user-space,
>>>>> - rseq_signal_deliver(): on signal delivery, rseq_handle_notify_resume() checks
>>>>> whether it's in a rseq critical section,
>>>>> - rseq_migrate: on migration, the scheduler sets TIF_NOTIFY_RESUME as well,
>>>>
>>>> Yes, this is not likely to be noticeable.
>>>>
>>>> But the proposal wanted to add a syscall to thread creation, right?
>>>> And I believe that may be noticeable.
>>>
>>> Fair point! Do we have a standard benchmark that would stress this ?
>>
>> Web server performance benchmarks basically test clone() performance
>> in many cases.
>
> Isn't that fork? I expect that the rseq arena is inherited on fork and
> fork-type clone, otherwise it's going to be painful.

On fork or clone creating a new process, the rseq tls area is inherited
from the thread that does the fork syscall.

On creation of a new thread with clone, there is no such inheritance.

Thanks,

Mathieu

--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

2018-06-14 14:42:46

by Florian Weimer

[permalink] [raw]

Subject: Re: Restartable Sequences system call merged into Linux

On 06/14/2018 04:36 PM, Mathieu Desnoyers wrote:
> ----- On Jun 14, 2018, at 10:00 AM, Florian Weimer [email protected] wrote:
>
>> On 06/14/2018 03:49 PM, Pavel Machek wrote:
>>> Hi!
>>>
>>>>>> - rseq_preempt(): on preemption, the scheduler sets the TIF_NOTIFY_RESUME thread
>>>>>> flag, so rseq_handle_notify_resume() can check whether it's in a rseq critical
>>>>>> section when returning to user-space,
>>>>>> - rseq_signal_deliver(): on signal delivery, rseq_handle_notify_resume() checks
>>>>>> whether it's in a rseq critical section,
>>>>>> - rseq_migrate: on migration, the scheduler sets TIF_NOTIFY_RESUME as well,
>>>>>
>>>>> Yes, this is not likely to be noticeable.
>>>>>
>>>>> But the proposal wanted to add a syscall to thread creation, right?
>>>>> And I believe that may be noticeable.
>>>>
>>>> Fair point! Do we have a standard benchmark that would stress this ?
>>>
>>> Web server performance benchmarks basically test clone() performance
>>> in many cases.
>>
>> Isn't that fork? I expect that the rseq arena is inherited on fork and
>> fork-type clone, otherwise it's going to be painful.
>
> On fork or clone creating a new process, the rseq tls area is inherited
> from the thread that does the fork syscall.
>
> On creation of a new thread with clone, there is no such inheritance.

Makes sense. So fork-based (web) servers will not be impacted by the
additional system call, and thread-based servers likely use a thread
pool anyway. I'm not really concerned about the additional system call
here.

Thanks,
Florian

2018-06-14 15:10:26

by Mathieu Desnoyers

[permalink] [raw]

Subject: Re: Restartable Sequences system call merged into Linux

----- On Jun 14, 2018, at 10:41 AM, Florian Weimer [email protected] wrote:

> On 06/14/2018 04:36 PM, Mathieu Desnoyers wrote:
>> ----- On Jun 14, 2018, at 10:00 AM, Florian Weimer [email protected] wrote:
>>
>>> On 06/14/2018 03:49 PM, Pavel Machek wrote:
>>>> Hi!
>>>>
>>>>>>> - rseq_preempt(): on preemption, the scheduler sets the TIF_NOTIFY_RESUME thread
>>>>>>> flag, so rseq_handle_notify_resume() can check whether it's in a rseq critical
>>>>>>> section when returning to user-space,
>>>>>>> - rseq_signal_deliver(): on signal delivery, rseq_handle_notify_resume() checks
>>>>>>> whether it's in a rseq critical section,
>>>>>>> - rseq_migrate: on migration, the scheduler sets TIF_NOTIFY_RESUME as well,
>>>>>>
>>>>>> Yes, this is not likely to be noticeable.
>>>>>>
>>>>>> But the proposal wanted to add a syscall to thread creation, right?
>>>>>> And I believe that may be noticeable.
>>>>>
>>>>> Fair point! Do we have a standard benchmark that would stress this ?
>>>>
>>>> Web server performance benchmarks basically test clone() performance
>>>> in many cases.
>>>
>>> Isn't that fork? I expect that the rseq arena is inherited on fork and
>>> fork-type clone, otherwise it's going to be painful.
>>
>> On fork or clone creating a new process, the rseq tls area is inherited
>> from the thread that does the fork syscall.
>>
>> On creation of a new thread with clone, there is no such inheritance.
>
> Makes sense. So fork-based (web) servers will not be impacted by the
> additional system call, and thread-based servers likely use a thread
> pool anyway. I'm not really concerned about the additional system call
> here.

Just for the sake of completeness, there is (of course) no inheritance
on exec(). So glibc would also have to register the rseq TLS in its
constructors.

Thanks,

Mathieu

--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

2018-06-15 05:10:43

by Florian Weimer

[permalink] [raw]

Subject: Re: Restartable Sequences system call merged into Linux

On 06/14/2018 02:27 PM, Pavel Machek wrote:

>>> Should we treat it the same way? Always allocate it for each new thread
>>> and register it with the kernel?
>>
>> That would be an efficient way to do it, indeed. There is very little
>> performance overhead to have rseq registered for all threads, whether or
>> not they intend to run rseq critical sections.
>
> People with slow / low memory machines would prefer not to see
> overhead they don't need...

I can try to get rid of the >500 byte per-thread area for the stub
resolver. That should compensate for the overhead introduced.

Thanks,
Florian

2018-06-15 05:12:48

by Florian Weimer

[permalink] [raw]

Subject: Re: Restartable Sequences system call merged into Linux

On 06/14/2018 03:01 PM, Mathieu Desnoyers wrote:
> Another alternative would be to somehow let glibc handle the registration,
> perhaps only doing it for applications expressing their interest for rseq.

That's not really possible. We can't rely on the visibility of symbol
bindings due to lazy binding and hidden visibility. Registration of
intent by other means will not work because if it is done from user
code, some other library may have already launched a thread at this point.

(It's also a moot point if we want to use restartable sequences in glibc
itself.)

Thanks,
Florian

2018-06-15 05:14:37

by Florian Weimer

[permalink] [raw]

Subject: Re: Restartable Sequences system call merged into Linux

On 06/14/2018 03:46 PM, Mathieu Desnoyers wrote:
> This would allow registering various TLS data structures with a single
> system call without hindering flexibility on the user-space side. For
> instance, we could still use initial-exec and the __rseq_abi symbol for
> rseq with this approach.
>
> Thoughts ?

Isn't this just a very narrow case of the usual batched syscalls
proposal? 8-)

Florian

2018-06-15 17:45:21

by Mathieu Desnoyers

[permalink] [raw]

Subject: Re: Restartable Sequences system call merged into Linux

----- On Jun 15, 2018, at 1:10 AM, Florian Weimer [email protected] wrote:

> On 06/14/2018 03:46 PM, Mathieu Desnoyers wrote:
>> This would allow registering various TLS data structures with a single
>> system call without hindering flexibility on the user-space side. For
>> instance, we could still use initial-exec and the __rseq_abi symbol for
>> rseq with this approach.
>>
>> Thoughts ?
>
> Isn't this just a very narrow case of the usual batched syscalls
> proposal? 8-)

Pretty much. But let's not go there unless this is really needed.
It looks like the added syscall on thread creation is not an issue
so far.

Thanks,

Mathieu

--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

2018-06-15 17:51:09

by Mathieu Desnoyers

[permalink] [raw]

Subject: Re: Restartable Sequences system call merged into Linux

----- On Jun 15, 2018, at 1:09 AM, Florian Weimer [email protected] wrote:

> On 06/14/2018 03:01 PM, Mathieu Desnoyers wrote:
>> Another alternative would be to somehow let glibc handle the registration,
>> perhaps only doing it for applications expressing their interest for rseq.
>
> That's not really possible. We can't rely on the visibility of symbol
> bindings due to lazy binding and hidden visibility. Registration of
> intent by other means will not work because if it is done from user
> code, some other library may have already launched a thread at this point.
>
> (It's also a moot point if we want to use restartable sequences in glibc
> itself.)

Considering that we can expect the glibc memory allocator to benefit from
rseq to speed up its memory allocator, this means pretty much any application
linked against glibc *will* end up using rseq indirectly.

Thanks,

Mathieu

--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com