2020-02-20 21:34:01

by Mathieu Desnoyers

[permalink] [raw]
Subject: Rseq registration: Google tcmalloc vs glibc

Hi Chris,

As one of the maintainers of the Rseq system call in the Linux kernel, I would
like to thank the Google team for open sourcing a tcmalloc implementation based
on Rseq!

I've looked into some critical integration aspects of the tcmalloc implementation,
and would like to bring up a topic which involves both tcmalloc developers and the
glibc community.

I have been discussing aspects of co-existence between early Rseq adopter libraries
and glibc for the past year with the glibc community, and tcmalloc happens to be the
first project to publicly use Rseq outside of prototype branches or selftests code.
Considering that there can only be one Rseq registration per thread (as imposed by
the rseq ABI), there needs to be some kind of protocol between libraries to ensure we
don't introduce regressions when we eventually combine a newer glibc which takes care
of registration of the __rseq_abi TLS along with tcmalloc which also try to perform
that registration within the same thread.

Throughout the various rounds of review of the glibc Rseq integration patch set,
there were a few solutions envisioned. Here is a brief history:

1) Introduce a __rseq_refcount TLS variable.

- Currently used by Linux tools/testing/selftests/rseq/rseq.c,
- Currently used by Google tcmalloc,
- Emitted by glibc as well my the original patchset (but was later removed),

A user incrementing the refcount from 0 -> 1 performs rseq registration.
The last user decrementing from 1 -> 0 performs rseq unregistration.

Works for co-existence of dlopen'd/dlclose'd libraries, for dynamically
linked libraries, and for the main executable.

The refcounting was deemed too complex for glibc's needs (it always
exists for the entire executable's lifetime), so we moved to
__rseq_handled instead.


2) Introduce a __rseq_handled global variable.

- Currently used by Linux tools/testing/selftests/rseq/rseq.c,
- At some point emitted by glibc as well in my patch set (but was later
removed),

A library may take rseq ownership if it is still 0 when executing the
library constructor. Set to 1 by library constructor when handling rseq.
Set to 0 in destructor if handling rseq.

Not meant to be set by dlopen'd/dlclose'd libraries, only by libraries
existing for the whole lifetime of the executable and/or the main executable.

This __rseq_handled symbol has been identified as being somewhat redundant
with the information provided in the __rseq_abi.cpu_id field (uninitialized
state), which motivated removing this symbol from the glibc integration
entirely. The only reason for having __rseq_handled separate from
__rseq_abi.cpu_id was because it was then impossible to touch TLS data
early in the glibc initialization. This issue was later resolved within
glibc.


3) Use the __rseq_abi TLS cpu_id field to know whether Rseq has been
registered.

- Current protocol in the most recent glibc integration patch set.
- Not supported yet by Linux kernel rseq selftests,
- Not supported yet by tcmalloc,

Use the per-thread state to figure out whether each thread need to register
Rseq individually.

Works for integration between a library which exists for the entire lifetime
of the executable (e.g. glibc) and other libraries. However, it does not
allow a set of libraries which are dlopen'd/dlclose'd to co-exist without
having a library like glibc handling the registration present.

So overall, I suspect the protocol we want for early adopters is that they
only register Rseq if __rseq_abi.cpu_id == RSEQ_CPU_ID_UNINITIALIZED, which
ensure they do not get -1, errno = EBUSY when linked against a newer glibc
which handles Rseq registration. In order to handle multiple early adopters
dlopen'd/dlclose'd in the same executable, those should synchronize with a
__rseq_refcount TLS reference count, but it does not have to be taken into
account by the main executable or libraries present for the entire executable
lifetime (like glibc).

Based on this, what I think would be missing from the current Google tcmalloc
implementation is a check for __rseq_abi.cpu_id == RSEQ_CPU_ID_UNINITIALIZED
in InitThreadPerCpu().

Is tcmalloc ever meant to be dlopen'd/dlclose'd (either directly or indirectly),
or is it required to exist for the entire executable lifetime ? The check and
increment of __rseq_refcount is only useful to co-exist with dlopen'd/dlclose'd
libraries, but it would not allow discovering the presence of a glibc which
takes care of the rseq registration with the planned protocol. A dlopen'd
library should then only perform rseq unregistration if if brings the
__rseq_refcount back to 0 (e.g. in a pthread_key destructor).

Adding this check for __rseq_abi.cpu_id == RSEQ_CPU_ID_UNINITIALIZED is something
I need to do in the Linux rseq selftests, but I refrained from submitting any
further change to those tests until the glibc rseq integration gets finally
merged.

Is it something that could be easily changed at this stage in Google tcmalloc,
or should we reconsider adding back __rseq_refcount within the glibc integration
patch set, even though it is not strictly useful to glibc ?

Thanks,

Mathieu

--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com


2020-02-21 13:09:04

by Florian Weimer

[permalink] [raw]
Subject: Re: Rseq registration: Google tcmalloc vs glibc

Quoting in full to get this message to libc-alpha, past the text/html
filter. I wish we had a different list configuration …

Please Note that we have integration patches for glibc which need
review. A fair number of them have been written by me, so I can't help
with that.

* Chris Kennelly:

> On Thu, Feb 20, 2020 at 4:33 PM Mathieu Desnoyers <[email protected]>
> wrote:
>
> Hi Chris,
>
> As one of the maintainers of the Rseq system call in the Linux kernel, I would
> like to thank the Google team for open sourcing a tcmalloc implementation based
> on Rseq!
>
> I've looked into some critical integration aspects of the tcmalloc implementation,
> and would like to bring up a topic which involves both tcmalloc developers and the
> glibc community.
>
> Thanks. To answer the later questions first:
> * I implemented TCMalloc's upstream rseq based on the conventions I could find from
> (mostly) the kernel self tests. This is probably why it looks like #1 :)
>
> This is less of an intentional preference and more of "it's important that early adopters follow
> a convention" for future glibc upgrades. Initializing __rseq_abi is the most important, but
> there are some other ones, mostly for debugging/tracing
> (https://github.com/compudj/librseq/issues/1), that I'd like to get right too.
>
> * The TCMalloc project does not provide ABI stability, so TCMalloc can change the convention
> it follows.
>
> I have been discussing aspects of co-existence between early Rseq adopter libraries
> and glibc for the past year with the glibc community, and tcmalloc happens to be the
> first project to publicly use Rseq outside of prototype branches or selftests code.
> Considering that there can only be one Rseq registration per thread (as imposed by
> the rseq ABI), there needs to be some kind of protocol between libraries to ensure we
> don't introduce regressions when we eventually combine a newer glibc which takes care
> of registration of the __rseq_abi TLS along with tcmalloc which also try to perform
> that registration within the same thread.
>
> Throughout the various rounds of review of the glibc Rseq integration patch set,
> there were a few solutions envisioned. Here is a brief history:
>
> 1) Introduce a __rseq_refcount TLS variable.
>
> - Currently used by Linux tools/testing/selftests/rseq/rseq.c,
> - Currently used by Google tcmalloc,
> - Emitted by glibc as well my the original patchset (but was later removed),
>
> A user incrementing the refcount from 0 -> 1 performs rseq registration.
> The last user decrementing from 1 -> 0 performs rseq unregistration.
>
> Works for co-existence of dlopen'd/dlclose'd libraries, for dynamically
> linked libraries, and for the main executable.
>
> The refcounting was deemed too complex for glibc's needs (it always
> exists for the entire executable's lifetime), so we moved to
> __rseq_handled instead.
>
> 2) Introduce a __rseq_handled global variable.
>
> - Currently used by Linux tools/testing/selftests/rseq/rseq.c,
> - At some point emitted by glibc as well in my patch set (but was later
> removed),
>
> A library may take rseq ownership if it is still 0 when executing the
> library constructor. Set to 1 by library constructor when handling rseq.
> Set to 0 in destructor if handling rseq.
>
> Not meant to be set by dlopen'd/dlclose'd libraries, only by libraries
> existing for the whole lifetime of the executable and/or the main executable.
>
> This __rseq_handled symbol has been identified as being somewhat redundant
> with the information provided in the __rseq_abi.cpu_id field (uninitialized
> state), which motivated removing this symbol from the glibc integration
> entirely. The only reason for having __rseq_handled separate from
> __rseq_abi.cpu_id was because it was then impossible to touch TLS data
> early in the glibc initialization. This issue was later resolved within
> glibc.
>
> 3) Use the __rseq_abi TLS cpu_id field to know whether Rseq has been
> registered.
>
> - Current protocol in the most recent glibc integration patch set.
> - Not supported yet by Linux kernel rseq selftests,
> - Not supported yet by tcmalloc,
>
> Use the per-thread state to figure out whether each thread need to register
> Rseq individually.
>
> Works for integration between a library which exists for the entire lifetime
> of the executable (e.g. glibc) and other libraries. However, it does not
> allow a set of libraries which are dlopen'd/dlclose'd to co-exist without
> having a library like glibc handling the registration present.
>
> Overall, I like #3 the most due to its simplicity, but I also do not need to support the
> dlopen/dlclose use case (below).
>
> So overall, I suspect the protocol we want for early adopters is that they
> only register Rseq if __rseq_abi.cpu_id == RSEQ_CPU_ID_UNINITIALIZED, which
> ensure they do not get -1, errno = EBUSY when linked against a newer glibc
> which handles Rseq registration. In order to handle multiple early adopters
> dlopen'd/dlclose'd in the same executable, those should synchronize with a
> __rseq_refcount TLS reference count, but it does not have to be taken into
> account by the main executable or libraries present for the entire executable
> lifetime (like glibc).
>
> Based on this, what I think would be missing from the current Google tcmalloc
> implementation is a check for __rseq_abi.cpu_id == RSEQ_CPU_ID_UNINITIALIZED
> in InitThreadPerCpu().
>
> TCMalloc does not get to InitThreadPerCpu without that check.
>
> Before initialization happens, we
> * end up on a slow path
> https://github.com/google/tcmalloc/blob/master/tcmalloc/tcmalloc.cc#L1486
> * which checks UsePerCpuCache
> https://github.com/google/tcmalloc/blob/master/tcmalloc/cpu_cache.h#L222
> * and inspects the TLS variable in IsFast
> https://github.com/google/tcmalloc/blob/master/tcmalloc/internal/percpu.h#L171
> ...which triggers per-thread rseq registration if __rseq_abi.cpu_id is uninitialized.
>
> Otherwise, the IsOnFastPath() call checks also inspects __rseq_abi.cpu_id via IsFastNoInit
> (same thing, but no registration triggered).
>
> Is tcmalloc ever meant to be dlopen'd/dlclose'd (either directly or indirectly),
> or is it required to exist for the entire executable lifetime ? The check and
> increment of __rseq_refcount is only useful to co-exist with dlopen'd/dlclose'd
> libraries, but it would not allow discovering the presence of a glibc which
> takes care of the rseq registration with the planned protocol. A dlopen'd
> library should then only perform rseq unregistration if if brings the
> __rseq_refcount back to 0 (e.g. in a pthread_key destructor).
>
> TCMalloc cannot practically be dlopen'd or dlclose'd.
> * Once memory is allocated with one instance of malloc (or operator new), it needs to be
> free'd to the same heap.
> * dlclose is explicitly not supported by our dependencies ("do not rely on dynamic
> unloading" https://abseil.io/about/compatibility)
>
> Thanks,
> Chris
>
> Adding this check for __rseq_abi.cpu_id == RSEQ_CPU_ID_UNINITIALIZED is something
> I need to do in the Linux rseq selftests, but I refrained from submitting any
> further change to those tests until the glibc rseq integration gets finally
> merged.
>
> Is it something that could be easily changed at this stage in Google tcmalloc,
> or should we reconsider adding back __rseq_refcount within the glibc integration
> patch set, even though it is not strictly useful to glibc ?
>
> Thanks,
>
> Mathieu
>
> --
> Mathieu Desnoyers
> EfficiOS Inc.
> http://www.efficios.com

2020-02-21 15:50:17

by Joel Fernandes

[permalink] [raw]
Subject: Re: Rseq registration: Google tcmalloc vs glibc

On Thu, Feb 20, 2020 at 04:33:30PM -0500, Mathieu Desnoyers wrote:
> Hi Chris,
>
> As one of the maintainers of the Rseq system call in the Linux kernel, I would
> like to thank the Google team for open sourcing a tcmalloc implementation based
> on Rseq!
>
> I've looked into some critical integration aspects of the tcmalloc implementation,
> and would like to bring up a topic which involves both tcmalloc developers and the
> glibc community.
>
> I have been discussing aspects of co-existence between early Rseq adopter libraries
> and glibc for the past year with the glibc community, and tcmalloc happens to be the
> first project to publicly use Rseq outside of prototype branches or selftests code.
> Considering that there can only be one Rseq registration per thread (as imposed by
> the rseq ABI), there needs to be some kind of protocol between libraries to ensure we
> don't introduce regressions when we eventually combine a newer glibc which takes care
> of registration of the __rseq_abi TLS along with tcmalloc which also try to perform
> that registration within the same thread.
>
> Throughout the various rounds of review of the glibc Rseq integration patch set,
> there were a few solutions envisioned. Here is a brief history:
>
> 1) Introduce a __rseq_refcount TLS variable.
>
> - Currently used by Linux tools/testing/selftests/rseq/rseq.c,
> - Currently used by Google tcmalloc,
> - Emitted by glibc as well my the original patchset (but was later removed),
>
> A user incrementing the refcount from 0 -> 1 performs rseq registration.
> The last user decrementing from 1 -> 0 performs rseq unregistration.
>
> Works for co-existence of dlopen'd/dlclose'd libraries, for dynamically
> linked libraries, and for the main executable.
>
> The refcounting was deemed too complex for glibc's needs (it always
> exists for the entire executable's lifetime), so we moved to
> __rseq_handled instead.
>
>
> 2) Introduce a __rseq_handled global variable.
>
> - Currently used by Linux tools/testing/selftests/rseq/rseq.c,
> - At some point emitted by glibc as well in my patch set (but was later
> removed),
>
> A library may take rseq ownership if it is still 0 when executing the
> library constructor. Set to 1 by library constructor when handling rseq.
> Set to 0 in destructor if handling rseq.
>
> Not meant to be set by dlopen'd/dlclose'd libraries, only by libraries
> existing for the whole lifetime of the executable and/or the main executable.
>
> This __rseq_handled symbol has been identified as being somewhat redundant
> with the information provided in the __rseq_abi.cpu_id field (uninitialized
> state), which motivated removing this symbol from the glibc integration
> entirely. The only reason for having __rseq_handled separate from
> __rseq_abi.cpu_id was because it was then impossible to touch TLS data
> early in the glibc initialization. This issue was later resolved within
> glibc.
>
>
> 3) Use the __rseq_abi TLS cpu_id field to know whether Rseq has been
> registered.
>
> - Current protocol in the most recent glibc integration patch set.
> - Not supported yet by Linux kernel rseq selftests,
> - Not supported yet by tcmalloc,
>
> Use the per-thread state to figure out whether each thread need to register
> Rseq individually.
>
> Works for integration between a library which exists for the entire lifetime
> of the executable (e.g. glibc) and other libraries. However, it does not
> allow a set of libraries which are dlopen'd/dlclose'd to co-exist without
> having a library like glibc handling the registration present.

Mathieu, could you share more details about why during dlopen/close
libraries we cannot use the same __rseq_abi TLS to detect that rseq was
registered?

Thanks!

- Joel


>
> So overall, I suspect the protocol we want for early adopters is that they
> only register Rseq if __rseq_abi.cpu_id == RSEQ_CPU_ID_UNINITIALIZED, which
> ensure they do not get -1, errno = EBUSY when linked against a newer glibc
> which handles Rseq registration. In order to handle multiple early adopters
> dlopen'd/dlclose'd in the same executable, those should synchronize with a
> __rseq_refcount TLS reference count, but it does not have to be taken into
> account by the main executable or libraries present for the entire executable
> lifetime (like glibc).
>
> Based on this, what I think would be missing from the current Google tcmalloc
> implementation is a check for __rseq_abi.cpu_id == RSEQ_CPU_ID_UNINITIALIZED
> in InitThreadPerCpu().
>
> Is tcmalloc ever meant to be dlopen'd/dlclose'd (either directly or indirectly),
> or is it required to exist for the entire executable lifetime ? The check and
> increment of __rseq_refcount is only useful to co-exist with dlopen'd/dlclose'd
> libraries, but it would not allow discovering the presence of a glibc which
> takes care of the rseq registration with the planned protocol. A dlopen'd
> library should then only perform rseq unregistration if if brings the
> __rseq_refcount back to 0 (e.g. in a pthread_key destructor).
>
> Adding this check for __rseq_abi.cpu_id == RSEQ_CPU_ID_UNINITIALIZED is something
> I need to do in the Linux rseq selftests, but I refrained from submitting any
> further change to those tests until the glibc rseq integration gets finally
> merged.
>
> Is it something that could be easily changed at this stage in Google tcmalloc,
> or should we reconsider adding back __rseq_refcount within the glibc integration
> patch set, even though it is not strictly useful to glibc ?
>
> Thanks,
>
> Mathieu
>
> --
> Mathieu Desnoyers
> EfficiOS Inc.
> http://www.efficios.com

2020-02-21 16:14:33

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: Rseq registration: Google tcmalloc vs glibc

----- On Feb 21, 2020, at 10:49 AM, Joel Fernandes, Google [email protected] wrote:

[...]
>>
>> 3) Use the __rseq_abi TLS cpu_id field to know whether Rseq has been
>> registered.
>>
>> - Current protocol in the most recent glibc integration patch set.
>> - Not supported yet by Linux kernel rseq selftests,
>> - Not supported yet by tcmalloc,
>>
>> Use the per-thread state to figure out whether each thread need to register
>> Rseq individually.
>>
>> Works for integration between a library which exists for the entire lifetime
>> of the executable (e.g. glibc) and other libraries. However, it does not
>> allow a set of libraries which are dlopen'd/dlclose'd to co-exist without
>> having a library like glibc handling the registration present.
>
> Mathieu, could you share more details about why during dlopen/close
> libraries we cannot use the same __rseq_abi TLS to detect that rseq was
> registered?

Sure,

A library which is only loaded and never closed during the execution of the
program can let the kernel implicitly unregister rseq at thread exit. For
the dlopen/dlclose use-case, we need to be able to explicitly unregister
each thread's __rseq_abi which sit in a library which is going to be
dlclose'd.

The issue is that __rseq_abi.cpu_id does not track any reference counting
of rseq user libraries, which becomes an issue if we have many of those
libraries around with different life-time.

As an example scenario, let's suppose we have a single-threaded application
which does the following:

main()
dlopen(liba)
-> liba's constructor observes uninitialized __rseq_abi.cpu_id, thus
performs rseq registration
dlopen(libb)
-> libb's constructor observes that rseq is already registered.

dlclose(libb)
-> libb's destructor unregisters rseq.

-> at this point, liba is still loaded, and would still expect rseq to
be registered. But unfortunately rseq has been unregistered by libb.

dlclose(liba)
-> rseq is already unregistered, which is unexpected.

The TLS __rseq_refcount solves this by tracking the number of users of
rseq for the thread, so rseq is only unregistered when the very last user
decrements the reference count.

As soon as there is at least one library taking care of registering rseq
for the entire thread's duration (e.g. glibc), and that this library
guarantees to never be dlclose'd, the __rseq_refcount becomes unneeded.

Thanks,

Mathieu

--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

2020-02-26 03:26:55

by Joel Fernandes

[permalink] [raw]
Subject: Re: Rseq registration: Google tcmalloc vs glibc

On Fri, Feb 21, 2020 at 11:13 AM Mathieu Desnoyers
<[email protected]> wrote:
>
> ----- On Feb 21, 2020, at 10:49 AM, Joel Fernandes, Google [email protected] wrote:
>
> [...]
> >>
> >> 3) Use the __rseq_abi TLS cpu_id field to know whether Rseq has been
> >> registered.
> >>
> >> - Current protocol in the most recent glibc integration patch set.
> >> - Not supported yet by Linux kernel rseq selftests,
> >> - Not supported yet by tcmalloc,
> >>
> >> Use the per-thread state to figure out whether each thread need to register
> >> Rseq individually.
> >>
> >> Works for integration between a library which exists for the entire lifetime
> >> of the executable (e.g. glibc) and other libraries. However, it does not
> >> allow a set of libraries which are dlopen'd/dlclose'd to co-exist without
> >> having a library like glibc handling the registration present.
> >
> > Mathieu, could you share more details about why during dlopen/close
> > libraries we cannot use the same __rseq_abi TLS to detect that rseq was
> > registered?
>
> Sure,
>
> A library which is only loaded and never closed during the execution of the
> program can let the kernel implicitly unregister rseq at thread exit. For
> the dlopen/dlclose use-case, we need to be able to explicitly unregister
> each thread's __rseq_abi which sit in a library which is going to be
> dlclose'd.

Mathieu, Thanks a lot for the explanation, it makes complete sense. It
sounds from Chris's reply that tcmalloc already checks
__rseq_abi.cpu_id and is not dlopened/closed. Considering these, it
seems to already handle things properly - CMIIW.

Chris, Brian, is there any other concern to upgrading of tcmalloc
version in ChromeOS? I believe there was some concern about detection
of rseq kernel support. A quick look at tcmalloc shows it does not do
such detection, but I can stand corrected. One more thing, currently
tcmalloc does not use rseq on ARM. If I recall, ARM does have rseq
support as well. So we ought to enable it for that arch as well if
possible. Why not enable it on all arches and then dynamically detect
at runtime if needed support is available?

Thanks,
Joel

2020-02-26 03:39:22

by Chris Kennelly

[permalink] [raw]
Subject: Re: Rseq registration: Google tcmalloc vs glibc

On Tue, Feb 25, 2020 at 10:25 PM Joel Fernandes <[email protected]> wrote:
>
> On Fri, Feb 21, 2020 at 11:13 AM Mathieu Desnoyers
> <[email protected]> wrote:
> >
> > ----- On Feb 21, 2020, at 10:49 AM, Joel Fernandes, Google [email protected] wrote:
> >
> > [...]
> > >>
> > >> 3) Use the __rseq_abi TLS cpu_id field to know whether Rseq has been
> > >> registered.
> > >>
> > >> - Current protocol in the most recent glibc integration patch set.
> > >> - Not supported yet by Linux kernel rseq selftests,
> > >> - Not supported yet by tcmalloc,
> > >>
> > >> Use the per-thread state to figure out whether each thread need to register
> > >> Rseq individually.
> > >>
> > >> Works for integration between a library which exists for the entire lifetime
> > >> of the executable (e.g. glibc) and other libraries. However, it does not
> > >> allow a set of libraries which are dlopen'd/dlclose'd to co-exist without
> > >> having a library like glibc handling the registration present.
> > >
> > > Mathieu, could you share more details about why during dlopen/close
> > > libraries we cannot use the same __rseq_abi TLS to detect that rseq was
> > > registered?
> >
> > Sure,
> >
> > A library which is only loaded and never closed during the execution of the
> > program can let the kernel implicitly unregister rseq at thread exit. For
> > the dlopen/dlclose use-case, we need to be able to explicitly unregister
> > each thread's __rseq_abi which sit in a library which is going to be
> > dlclose'd.
>
> Mathieu, Thanks a lot for the explanation, it makes complete sense. It
> sounds from Chris's reply that tcmalloc already checks
> __rseq_abi.cpu_id and is not dlopened/closed. Considering these, it
> seems to already handle things properly - CMIIW.

I'll make a note about this, since we can probably benefit from some
more comments about the assumptions/invariants the fastpath uses.

> Chris, Brian, is there any other concern to upgrading of tcmalloc
> version in ChromeOS? I believe there was some concern about detection
> of rseq kernel support. A quick look at tcmalloc shows it does not do
> such detection, but I can stand corrected.

The build time configuration is because we need to have an assembly
implementation of the restartable sequence (example: x86_64
[https://github.com/google/tcmalloc/blob/master/tcmalloc/internal/percpu_rseq_x86_64.S]),
although I've been moving these to inline assembly recently.

If we have build time support for the code path, we'll try to invoke
the rseq() call and see if registration succeeds:
https://github.com/google/tcmalloc/blob/master/tcmalloc/internal/percpu.cc#L85-L107

> One more thing, currently
> tcmalloc does not use rseq on ARM. If I recall, ARM does have rseq
> support as well. So we ought to enable it for that arch as well if
> possible. Why not enable it on all arches and then dynamically detect
> at runtime if needed support is available?

As far as ARM support goes for TCMalloc's per-CPU/rseq path, I haven't
had the bandwidth to write an assembly implementation (although I'd
welcome one).

Thanks,
Chris Kennelly

2020-02-26 17:02:57

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: Rseq registration: Google tcmalloc vs glibc

----- On Feb 25, 2020, at 10:38 PM, Chris Kennelly [email protected] wrote:

> On Tue, Feb 25, 2020 at 10:25 PM Joel Fernandes <[email protected]> wrote:
>>
>> On Fri, Feb 21, 2020 at 11:13 AM Mathieu Desnoyers
>> <[email protected]> wrote:
>> >
>> > ----- On Feb 21, 2020, at 10:49 AM, Joel Fernandes, Google
>> > [email protected] wrote:
>> >
>> > [...]
>> > >>
>> > >> 3) Use the __rseq_abi TLS cpu_id field to know whether Rseq has been
>> > >> registered.
>> > >>
>> > >> - Current protocol in the most recent glibc integration patch set.
>> > >> - Not supported yet by Linux kernel rseq selftests,
>> > >> - Not supported yet by tcmalloc,
>> > >>
>> > >> Use the per-thread state to figure out whether each thread need to register
>> > >> Rseq individually.
>> > >>
>> > >> Works for integration between a library which exists for the entire lifetime
>> > >> of the executable (e.g. glibc) and other libraries. However, it does not
>> > >> allow a set of libraries which are dlopen'd/dlclose'd to co-exist without
>> > >> having a library like glibc handling the registration present.
>> > >
>> > > Mathieu, could you share more details about why during dlopen/close
>> > > libraries we cannot use the same __rseq_abi TLS to detect that rseq was
>> > > registered?
>> >
>> > Sure,
>> >
>> > A library which is only loaded and never closed during the execution of the
>> > program can let the kernel implicitly unregister rseq at thread exit. For
>> > the dlopen/dlclose use-case, we need to be able to explicitly unregister
>> > each thread's __rseq_abi which sit in a library which is going to be
>> > dlclose'd.
>>
>> Mathieu, Thanks a lot for the explanation, it makes complete sense. It
>> sounds from Chris's reply that tcmalloc already checks
>> __rseq_abi.cpu_id and is not dlopened/closed. Considering these, it
>> seems to already handle things properly - CMIIW.
>
> I'll make a note about this, since we can probably benefit from some
> more comments about the assumptions/invariants the fastpath uses.

I suspect the integration with glibc and with dlopen'd/dlclose'd libraries will not
behave correctly with the current tcmalloc implementation.

Based on the tcmalloc code-base, InitFastPerCpu is only called from IsFast. As long
as this is the only expected caller, having IsFast comparing the RseqCpuId detects
whether glibc (or some other library) has already registered rseq for the current
thread.

However, if the application chooses to invoke InitFastPerCpu() directly, things become
expected, because it invokes:

absl::base_internal::LowLevelCallOnce(&init_per_cpu_once, InitPerCpu);

which AFAIU invokes InitPerCpu once after execution of the current program. Which
does:

static bool InitThreadPerCpu() {
if (__rseq_refcount++ > 0) {
return true;
}

auto ret = syscall(__NR_rseq, &__rseq_abi, sizeof(__rseq_abi), 0,
PERCPU_RSEQ_SIGNATURE);
if (ret == 0) {
return true;
} else {
__rseq_refcount--;
}

return false;
}

static void InitPerCpu() {
// Based on the results of successfully initializing the first thread, mark
// init_status to initialize all subsequent threads.
if (InitThreadPerCpu()) {
init_status = kFastMode;
}
}

In a scenario where glibc has already registered Rseq, the __rseq_refcount will
be incremented, the __NR_rseq syscall will fail with -1, errno=EBUSY, so the refcount
will be immediately decremented and it will return false. Therefore, "init_status" will
never be set fo kFastMode, leaving it in kSlowMode for the entire lifetime of this
program. That being said, even though this state can come as a surprise, it seems to
be entirely bypassed by the fast-paths IsFast() and IsFastNoInit(), so maybe it won't
have any observable side-effects other than leaving init_status in a state that does not
match reality.

In the other use-case where tcmalloc co-exist with a dlopened/dlclosed library, but glibc
does not provide Rseq registration, we run into issues as well if the dlopened library
registers rseq first for a given thread. The IsFastNoInit() expects that if Rseq has been
observed as registered in the past for a thread, it stays registered. However, if a
dlclosed library unregisters Rseq, we need to be prepared to re-register it. So either
tcmalloc needs to express its use of Rseq by incrementing __rseq_refcount even when Rseq
is registered (this would hurt the fast-path however, and I would hate to have to do this),
or tcmalloc needs to be able to handle the fact that Rseq may be unregistered by a dlclosed
library which was the actual owner of the Rseq registration.

Thanks,

Mathieu


--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

2020-02-26 17:28:23

by Chris Kennelly

[permalink] [raw]
Subject: Re: Rseq registration: Google tcmalloc vs glibc

On Wed, Feb 26, 2020 at 12:01 PM Mathieu Desnoyers
<[email protected]> wrote:
>
> ----- On Feb 25, 2020, at 10:38 PM, Chris Kennelly [email protected] wrote:
>
> > On Tue, Feb 25, 2020 at 10:25 PM Joel Fernandes <[email protected]> wrote:
> >>
> >> On Fri, Feb 21, 2020 at 11:13 AM Mathieu Desnoyers
> >> <[email protected]> wrote:
> >> >
> >> > ----- On Feb 21, 2020, at 10:49 AM, Joel Fernandes, Google
> >> > [email protected] wrote:
> >> >
> >> > [...]
> >> > >>
> >> > >> 3) Use the __rseq_abi TLS cpu_id field to know whether Rseq has been
> >> > >> registered.
> >> > >>
> >> > >> - Current protocol in the most recent glibc integration patch set.
> >> > >> - Not supported yet by Linux kernel rseq selftests,
> >> > >> - Not supported yet by tcmalloc,
> >> > >>
> >> > >> Use the per-thread state to figure out whether each thread need to register
> >> > >> Rseq individually.
> >> > >>
> >> > >> Works for integration between a library which exists for the entire lifetime
> >> > >> of the executable (e.g. glibc) and other libraries. However, it does not
> >> > >> allow a set of libraries which are dlopen'd/dlclose'd to co-exist without
> >> > >> having a library like glibc handling the registration present.
> >> > >
> >> > > Mathieu, could you share more details about why during dlopen/close
> >> > > libraries we cannot use the same __rseq_abi TLS to detect that rseq was
> >> > > registered?
> >> >
> >> > Sure,
> >> >
> >> > A library which is only loaded and never closed during the execution of the
> >> > program can let the kernel implicitly unregister rseq at thread exit. For
> >> > the dlopen/dlclose use-case, we need to be able to explicitly unregister
> >> > each thread's __rseq_abi which sit in a library which is going to be
> >> > dlclose'd.
> >>
> >> Mathieu, Thanks a lot for the explanation, it makes complete sense. It
> >> sounds from Chris's reply that tcmalloc already checks
> >> __rseq_abi.cpu_id and is not dlopened/closed. Considering these, it
> >> seems to already handle things properly - CMIIW.
> >
> > I'll make a note about this, since we can probably benefit from some
> > more comments about the assumptions/invariants the fastpath uses.
>
> I suspect the integration with glibc and with dlopen'd/dlclose'd libraries will not
> behave correctly with the current tcmalloc implementation.
>
> Based on the tcmalloc code-base, InitFastPerCpu is only called from IsFast. As long
> as this is the only expected caller, having IsFast comparing the RseqCpuId detects
> whether glibc (or some other library) has already registered rseq for the current
> thread.
>
> However, if the application chooses to invoke InitFastPerCpu() directly, things become
> expected, because it invokes:
>
> absl::base_internal::LowLevelCallOnce(&init_per_cpu_once, InitPerCpu);
>
> which AFAIU invokes InitPerCpu once after execution of the current program. Which
> does:
>
> static bool InitThreadPerCpu() {
> if (__rseq_refcount++ > 0) {
> return true;
> }
>
> auto ret = syscall(__NR_rseq, &__rseq_abi, sizeof(__rseq_abi), 0,
> PERCPU_RSEQ_SIGNATURE);
> if (ret == 0) {
> return true;
> } else {
> __rseq_refcount--;
> }
>
> return false;
> }
>
> static void InitPerCpu() {
> // Based on the results of successfully initializing the first thread, mark
> // init_status to initialize all subsequent threads.
> if (InitThreadPerCpu()) {
> init_status = kFastMode;
> }
> }
>
> In a scenario where glibc has already registered Rseq, the __rseq_refcount will
> be incremented, the __NR_rseq syscall will fail with -1, errno=EBUSY, so the refcount
> will be immediately decremented and it will return false. Therefore, "init_status" will
> never be set fo kFastMode, leaving it in kSlowMode for the entire lifetime of this
> program. That being said, even though this state can come as a surprise, it seems to
> be entirely bypassed by the fast-paths IsFast() and IsFastNoInit(), so maybe it won't
> have any observable side-effects other than leaving init_status in a state that does not
> match reality.

I agree that this could potentially violate inviarants, but
InitFastPerCpu is not intended to be called by the application.

> In the other use-case where tcmalloc co-exist with a dlopened/dlclosed library, but glibc
> does not provide Rseq registration, we run into issues as well if the dlopened library
> registers rseq first for a given thread. The IsFastNoInit() expects that if Rseq has been
> observed as registered in the past for a thread, it stays registered. However, if a
> dlclosed library unregisters Rseq, we need to be prepared to re-register it. So either
> tcmalloc needs to express its use of Rseq by incrementing __rseq_refcount even when Rseq
> is registered (this would hurt the fast-path however, and I would hate to have to do this),
> or tcmalloc needs to be able to handle the fact that Rseq may be unregistered by a dlclosed
> library which was the actual owner of the Rseq registration.

We have a bit of an opportunity to figure out whether this is the
first time--from TCMalloc's perspective--a thread is doing per-CPU and
bump the __rseq_count accordingly. I think this could be done off of
the fast path.

Chris

2020-02-26 18:57:27

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: Rseq registration: Google tcmalloc vs glibc

----- On Feb 26, 2020, at 12:27 PM, Chris Kennelly [email protected] wrote:

> On Wed, Feb 26, 2020 at 12:01 PM Mathieu Desnoyers
> <[email protected]> wrote:
>>
>> ----- On Feb 25, 2020, at 10:38 PM, Chris Kennelly [email protected] wrote:
>>
>> > On Tue, Feb 25, 2020 at 10:25 PM Joel Fernandes <[email protected]> wrote:
>> >>
>> >> On Fri, Feb 21, 2020 at 11:13 AM Mathieu Desnoyers
>> >> <[email protected]> wrote:
>> >> >
>> >> > ----- On Feb 21, 2020, at 10:49 AM, Joel Fernandes, Google
>> >> > [email protected] wrote:
>> >> >
>> >> > [...]
>> >> > >>
>> >> > >> 3) Use the __rseq_abi TLS cpu_id field to know whether Rseq has been
>> >> > >> registered.
>> >> > >>
>> >> > >> - Current protocol in the most recent glibc integration patch set.
>> >> > >> - Not supported yet by Linux kernel rseq selftests,
>> >> > >> - Not supported yet by tcmalloc,
>> >> > >>
>> >> > >> Use the per-thread state to figure out whether each thread need to register
>> >> > >> Rseq individually.
>> >> > >>
>> >> > >> Works for integration between a library which exists for the entire lifetime
>> >> > >> of the executable (e.g. glibc) and other libraries. However, it does not
>> >> > >> allow a set of libraries which are dlopen'd/dlclose'd to co-exist without
>> >> > >> having a library like glibc handling the registration present.
>> >> > >
>> >> > > Mathieu, could you share more details about why during dlopen/close
>> >> > > libraries we cannot use the same __rseq_abi TLS to detect that rseq was
>> >> > > registered?
>> >> >
>> >> > Sure,
>> >> >
>> >> > A library which is only loaded and never closed during the execution of the
>> >> > program can let the kernel implicitly unregister rseq at thread exit. For
>> >> > the dlopen/dlclose use-case, we need to be able to explicitly unregister
>> >> > each thread's __rseq_abi which sit in a library which is going to be
>> >> > dlclose'd.
>> >>
>> >> Mathieu, Thanks a lot for the explanation, it makes complete sense. It
>> >> sounds from Chris's reply that tcmalloc already checks
>> >> __rseq_abi.cpu_id and is not dlopened/closed. Considering these, it
>> >> seems to already handle things properly - CMIIW.
>> >
>> > I'll make a note about this, since we can probably benefit from some
>> > more comments about the assumptions/invariants the fastpath uses.
>>
>> I suspect the integration with glibc and with dlopen'd/dlclose'd libraries will
>> not
>> behave correctly with the current tcmalloc implementation.
>>
>> Based on the tcmalloc code-base, InitFastPerCpu is only called from IsFast. As
>> long
>> as this is the only expected caller, having IsFast comparing the RseqCpuId
>> detects
>> whether glibc (or some other library) has already registered rseq for the
>> current
>> thread.
>>
>> However, if the application chooses to invoke InitFastPerCpu() directly, things
>> become
>> expected, because it invokes:
>>
>> absl::base_internal::LowLevelCallOnce(&init_per_cpu_once, InitPerCpu);
>>
>> which AFAIU invokes InitPerCpu once after execution of the current program.
>> Which
>> does:
>>
>> static bool InitThreadPerCpu() {
>> if (__rseq_refcount++ > 0) {
>> return true;
>> }
>>
>> auto ret = syscall(__NR_rseq, &__rseq_abi, sizeof(__rseq_abi), 0,
>> PERCPU_RSEQ_SIGNATURE);
>> if (ret == 0) {
>> return true;
>> } else {
>> __rseq_refcount--;
>> }
>>
>> return false;
>> }
>>
>> static void InitPerCpu() {
>> // Based on the results of successfully initializing the first thread, mark
>> // init_status to initialize all subsequent threads.
>> if (InitThreadPerCpu()) {
>> init_status = kFastMode;
>> }
>> }
>>
>> In a scenario where glibc has already registered Rseq, the __rseq_refcount will
>> be incremented, the __NR_rseq syscall will fail with -1, errno=EBUSY, so the
>> refcount
>> will be immediately decremented and it will return false. Therefore,
>> "init_status" will
>> never be set fo kFastMode, leaving it in kSlowMode for the entire lifetime of
>> this
>> program. That being said, even though this state can come as a surprise, it
>> seems to
>> be entirely bypassed by the fast-paths IsFast() and IsFastNoInit(), so maybe it
>> won't
>> have any observable side-effects other than leaving init_status in a state that
>> does not
>> match reality.
>
> I agree that this could potentially violate inviarants, but
> InitFastPerCpu is not intended to be called by the application.

OK, explicitly documenting this would be a good thing. In my own projects,
I prefix those symbols with double-underscores (__) to indicate that those
are not meant to be called by other means than the static inlines in the API.

There may be use-cases which justify exposing InitFastPerCpu as a public API for
applications though, especially for those which require some level of
real-time guarantees from the malloc/free APIs. I've run into this situation
with liburcu which I maintain.

>
>> In the other use-case where tcmalloc co-exist with a dlopened/dlclosed library,
>> but glibc
>> does not provide Rseq registration, we run into issues as well if the dlopened
>> library
>> registers rseq first for a given thread. The IsFastNoInit() expects that if Rseq
>> has been
>> observed as registered in the past for a thread, it stays registered. However,
>> if a
>> dlclosed library unregisters Rseq, we need to be prepared to re-register it. So
>> either
>> tcmalloc needs to express its use of Rseq by incrementing __rseq_refcount even
>> when Rseq
>> is registered (this would hurt the fast-path however, and I would hate to have
>> to do this),
>> or tcmalloc needs to be able to handle the fact that Rseq may be unregistered by
>> a dlclosed
>> library which was the actual owner of the Rseq registration.
>
> We have a bit of an opportunity to figure out whether this is the
> first time--from TCMalloc's perspective--a thread is doing per-CPU and
> bump the __rseq_count accordingly. I think this could be done off of
> the fast path.

Is there an explicit tcmalloc API call that each thread need to do before starting
to use tcmalloc to allocate and free memory ? If not, you'll probably need to add
at least a load of __rseq_refcount (or some other TLS variable), test and conditional
branch on the fast-path, which is an additional cost I would ideally prefer to avoid.
Or do you have something else in mind ?

Thanks,

Mathieu

--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

2020-02-26 19:13:35

by Chris Kennelly

[permalink] [raw]
Subject: Re: Rseq registration: Google tcmalloc vs glibc

On Wed, Feb 26, 2020 at 1:56 PM Mathieu Desnoyers
<[email protected]> wrote:
>
> ----- On Feb 26, 2020, at 12:27 PM, Chris Kennelly [email protected] wrote:
>
> > On Wed, Feb 26, 2020 at 12:01 PM Mathieu Desnoyers
> > <[email protected]> wrote:
> >>
> >> ----- On Feb 25, 2020, at 10:38 PM, Chris Kennelly [email protected] wrote:
> >>
> >> > On Tue, Feb 25, 2020 at 10:25 PM Joel Fernandes <[email protected]> wrote:
> >> >>
> >> >> On Fri, Feb 21, 2020 at 11:13 AM Mathieu Desnoyers
> >> >> <[email protected]> wrote:
> >> >> >
> >> >> > ----- On Feb 21, 2020, at 10:49 AM, Joel Fernandes, Google
> >> >> > [email protected] wrote:
> >> >> >
> >> >> > [...]
> >> >> > >>
> >> >> > >> 3) Use the __rseq_abi TLS cpu_id field to know whether Rseq has been
> >> >> > >> registered.
> >> >> > >>
> >> >> > >> - Current protocol in the most recent glibc integration patch set.
> >> >> > >> - Not supported yet by Linux kernel rseq selftests,
> >> >> > >> - Not supported yet by tcmalloc,
> >> >> > >>
> >> >> > >> Use the per-thread state to figure out whether each thread need to register
> >> >> > >> Rseq individually.
> >> >> > >>
> >> >> > >> Works for integration between a library which exists for the entire lifetime
> >> >> > >> of the executable (e.g. glibc) and other libraries. However, it does not
> >> >> > >> allow a set of libraries which are dlopen'd/dlclose'd to co-exist without
> >> >> > >> having a library like glibc handling the registration present.
> >> >> > >
> >> >> > > Mathieu, could you share more details about why during dlopen/close
> >> >> > > libraries we cannot use the same __rseq_abi TLS to detect that rseq was
> >> >> > > registered?
> >> >> >
> >> >> > Sure,
> >> >> >
> >> >> > A library which is only loaded and never closed during the execution of the
> >> >> > program can let the kernel implicitly unregister rseq at thread exit. For
> >> >> > the dlopen/dlclose use-case, we need to be able to explicitly unregister
> >> >> > each thread's __rseq_abi which sit in a library which is going to be
> >> >> > dlclose'd.
> >> >>
> >> >> Mathieu, Thanks a lot for the explanation, it makes complete sense. It
> >> >> sounds from Chris's reply that tcmalloc already checks
> >> >> __rseq_abi.cpu_id and is not dlopened/closed. Considering these, it
> >> >> seems to already handle things properly - CMIIW.
> >> >
> >> > I'll make a note about this, since we can probably benefit from some
> >> > more comments about the assumptions/invariants the fastpath uses.
> >>
> >> I suspect the integration with glibc and with dlopen'd/dlclose'd libraries will
> >> not
> >> behave correctly with the current tcmalloc implementation.
> >>
> >> Based on the tcmalloc code-base, InitFastPerCpu is only called from IsFast. As
> >> long
> >> as this is the only expected caller, having IsFast comparing the RseqCpuId
> >> detects
> >> whether glibc (or some other library) has already registered rseq for the
> >> current
> >> thread.
> >>
> >> However, if the application chooses to invoke InitFastPerCpu() directly, things
> >> become
> >> expected, because it invokes:
> >>
> >> absl::base_internal::LowLevelCallOnce(&init_per_cpu_once, InitPerCpu);
> >>
> >> which AFAIU invokes InitPerCpu once after execution of the current program.
> >> Which
> >> does:
> >>
> >> static bool InitThreadPerCpu() {
> >> if (__rseq_refcount++ > 0) {
> >> return true;
> >> }
> >>
> >> auto ret = syscall(__NR_rseq, &__rseq_abi, sizeof(__rseq_abi), 0,
> >> PERCPU_RSEQ_SIGNATURE);
> >> if (ret == 0) {
> >> return true;
> >> } else {
> >> __rseq_refcount--;
> >> }
> >>
> >> return false;
> >> }
> >>
> >> static void InitPerCpu() {
> >> // Based on the results of successfully initializing the first thread, mark
> >> // init_status to initialize all subsequent threads.
> >> if (InitThreadPerCpu()) {
> >> init_status = kFastMode;
> >> }
> >> }
> >>
> >> In a scenario where glibc has already registered Rseq, the __rseq_refcount will
> >> be incremented, the __NR_rseq syscall will fail with -1, errno=EBUSY, so the
> >> refcount
> >> will be immediately decremented and it will return false. Therefore,
> >> "init_status" will
> >> never be set fo kFastMode, leaving it in kSlowMode for the entire lifetime of
> >> this
> >> program. That being said, even though this state can come as a surprise, it
> >> seems to
> >> be entirely bypassed by the fast-paths IsFast() and IsFastNoInit(), so maybe it
> >> won't
> >> have any observable side-effects other than leaving init_status in a state that
> >> does not
> >> match reality.
> >
> > I agree that this could potentially violate inviarants, but
> > InitFastPerCpu is not intended to be called by the application.
>
> OK, explicitly documenting this would be a good thing. In my own projects,
> I prefix those symbols with double-underscores (__) to indicate that those
> are not meant to be called by other means than the static inlines in the API.
>
> There may be use-cases which justify exposing InitFastPerCpu as a public API for
> applications though, especially for those which require some level of
> real-time guarantees from the malloc/free APIs. I've run into this situation
> with liburcu which I maintain.
>
> >
> >> In the other use-case where tcmalloc co-exist with a dlopened/dlclosed library,
> >> but glibc
> >> does not provide Rseq registration, we run into issues as well if the dlopened
> >> library
> >> registers rseq first for a given thread. The IsFastNoInit() expects that if Rseq
> >> has been
> >> observed as registered in the past for a thread, it stays registered. However,
> >> if a
> >> dlclosed library unregisters Rseq, we need to be prepared to re-register it. So
> >> either
> >> tcmalloc needs to express its use of Rseq by incrementing __rseq_refcount even
> >> when Rseq
> >> is registered (this would hurt the fast-path however, and I would hate to have
> >> to do this),
> >> or tcmalloc needs to be able to handle the fact that Rseq may be unregistered by
> >> a dlclosed
> >> library which was the actual owner of the Rseq registration.
> >
> > We have a bit of an opportunity to figure out whether this is the
> > first time--from TCMalloc's perspective--a thread is doing per-CPU and
> > bump the __rseq_count accordingly. I think this could be done off of
> > the fast path.
>
> Is there an explicit tcmalloc API call that each thread need to do before starting
> to use tcmalloc to allocate and free memory ? If not, you'll probably need to add
> at least a load of __rseq_refcount (or some other TLS variable), test and conditional
> branch on the fast-path, which is an additional cost I would ideally prefer to avoid.
> Or do you have something else in mind ?

No explicit call is necessary. This is something that can be done in
the slow path, since we can recognize the transition from slow -> fast
path for that thread

2020-02-26 21:22:54

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: Rseq registration: Google tcmalloc vs glibc

----- On Feb 26, 2020, at 2:12 PM, Chris Kennelly [email protected] wrote:

> On Wed, Feb 26, 2020 at 1:56 PM Mathieu Desnoyers
> <[email protected]> wrote:
>>
>> ----- On Feb 26, 2020, at 12:27 PM, Chris Kennelly [email protected] wrote:
>>
>> > On Wed, Feb 26, 2020 at 12:01 PM Mathieu Desnoyers
>> > <[email protected]> wrote:
>> >>
>> >> ----- On Feb 25, 2020, at 10:38 PM, Chris Kennelly [email protected] wrote:
>> >>
>> >> > On Tue, Feb 25, 2020 at 10:25 PM Joel Fernandes <[email protected]> wrote:
>> >> >>
>> >> >> On Fri, Feb 21, 2020 at 11:13 AM Mathieu Desnoyers
>> >> >> <[email protected]> wrote:
>> >> >> >
>> >> >> > ----- On Feb 21, 2020, at 10:49 AM, Joel Fernandes, Google
>> >> >> > [email protected] wrote:
>> >> >> >
>> >> >> > [...]
>> >> >> > >>
>> >> >> > >> 3) Use the __rseq_abi TLS cpu_id field to know whether Rseq has been
>> >> >> > >> registered.
>> >> >> > >>
>> >> >> > >> - Current protocol in the most recent glibc integration patch set.
>> >> >> > >> - Not supported yet by Linux kernel rseq selftests,
>> >> >> > >> - Not supported yet by tcmalloc,
>> >> >> > >>
>> >> >> > >> Use the per-thread state to figure out whether each thread need to register
>> >> >> > >> Rseq individually.
>> >> >> > >>
>> >> >> > >> Works for integration between a library which exists for the entire lifetime
>> >> >> > >> of the executable (e.g. glibc) and other libraries. However, it does not
>> >> >> > >> allow a set of libraries which are dlopen'd/dlclose'd to co-exist without
>> >> >> > >> having a library like glibc handling the registration present.
>> >> >> > >
>> >> >> > > Mathieu, could you share more details about why during dlopen/close
>> >> >> > > libraries we cannot use the same __rseq_abi TLS to detect that rseq was
>> >> >> > > registered?
>> >> >> >
>> >> >> > Sure,
>> >> >> >
>> >> >> > A library which is only loaded and never closed during the execution of the
>> >> >> > program can let the kernel implicitly unregister rseq at thread exit. For
>> >> >> > the dlopen/dlclose use-case, we need to be able to explicitly unregister
>> >> >> > each thread's __rseq_abi which sit in a library which is going to be
>> >> >> > dlclose'd.
>> >> >>
>> >> >> Mathieu, Thanks a lot for the explanation, it makes complete sense. It
>> >> >> sounds from Chris's reply that tcmalloc already checks
>> >> >> __rseq_abi.cpu_id and is not dlopened/closed. Considering these, it
>> >> >> seems to already handle things properly - CMIIW.
>> >> >
>> >> > I'll make a note about this, since we can probably benefit from some
>> >> > more comments about the assumptions/invariants the fastpath uses.
>> >>
>> >> I suspect the integration with glibc and with dlopen'd/dlclose'd libraries will
>> >> not
>> >> behave correctly with the current tcmalloc implementation.
>> >>
>> >> Based on the tcmalloc code-base, InitFastPerCpu is only called from IsFast. As
>> >> long
>> >> as this is the only expected caller, having IsFast comparing the RseqCpuId
>> >> detects
>> >> whether glibc (or some other library) has already registered rseq for the
>> >> current
>> >> thread.
>> >>
>> >> However, if the application chooses to invoke InitFastPerCpu() directly, things
>> >> become
>> >> expected, because it invokes:
>> >>
>> >> absl::base_internal::LowLevelCallOnce(&init_per_cpu_once, InitPerCpu);
>> >>
>> >> which AFAIU invokes InitPerCpu once after execution of the current program.
>> >> Which
>> >> does:
>> >>
>> >> static bool InitThreadPerCpu() {
>> >> if (__rseq_refcount++ > 0) {
>> >> return true;
>> >> }
>> >>
>> >> auto ret = syscall(__NR_rseq, &__rseq_abi, sizeof(__rseq_abi), 0,
>> >> PERCPU_RSEQ_SIGNATURE);
>> >> if (ret == 0) {
>> >> return true;
>> >> } else {
>> >> __rseq_refcount--;
>> >> }
>> >>
>> >> return false;
>> >> }
>> >>
>> >> static void InitPerCpu() {
>> >> // Based on the results of successfully initializing the first thread, mark
>> >> // init_status to initialize all subsequent threads.
>> >> if (InitThreadPerCpu()) {
>> >> init_status = kFastMode;
>> >> }
>> >> }
>> >>
>> >> In a scenario where glibc has already registered Rseq, the __rseq_refcount will
>> >> be incremented, the __NR_rseq syscall will fail with -1, errno=EBUSY, so the
>> >> refcount
>> >> will be immediately decremented and it will return false. Therefore,
>> >> "init_status" will
>> >> never be set fo kFastMode, leaving it in kSlowMode for the entire lifetime of
>> >> this
>> >> program. That being said, even though this state can come as a surprise, it
>> >> seems to
>> >> be entirely bypassed by the fast-paths IsFast() and IsFastNoInit(), so maybe it
>> >> won't
>> >> have any observable side-effects other than leaving init_status in a state that
>> >> does not
>> >> match reality.
>> >
>> > I agree that this could potentially violate inviarants, but
>> > InitFastPerCpu is not intended to be called by the application.
>>
>> OK, explicitly documenting this would be a good thing. In my own projects,
>> I prefix those symbols with double-underscores (__) to indicate that those
>> are not meant to be called by other means than the static inlines in the API.
>>
>> There may be use-cases which justify exposing InitFastPerCpu as a public API for
>> applications though, especially for those which require some level of
>> real-time guarantees from the malloc/free APIs. I've run into this situation
>> with liburcu which I maintain.
>>
>> >
>> >> In the other use-case where tcmalloc co-exist with a dlopened/dlclosed library,
>> >> but glibc
>> >> does not provide Rseq registration, we run into issues as well if the dlopened
>> >> library
>> >> registers rseq first for a given thread. The IsFastNoInit() expects that if Rseq
>> >> has been
>> >> observed as registered in the past for a thread, it stays registered. However,
>> >> if a
>> >> dlclosed library unregisters Rseq, we need to be prepared to re-register it. So
>> >> either
>> >> tcmalloc needs to express its use of Rseq by incrementing __rseq_refcount even
>> >> when Rseq
>> >> is registered (this would hurt the fast-path however, and I would hate to have
>> >> to do this),
>> >> or tcmalloc needs to be able to handle the fact that Rseq may be unregistered by
>> >> a dlclosed
>> >> library which was the actual owner of the Rseq registration.
>> >
>> > We have a bit of an opportunity to figure out whether this is the
>> > first time--from TCMalloc's perspective--a thread is doing per-CPU and
>> > bump the __rseq_count accordingly. I think this could be done off of
>> > the fast path.
>>
>> Is there an explicit tcmalloc API call that each thread need to do before
>> starting
>> to use tcmalloc to allocate and free memory ? If not, you'll probably need to
>> add
>> at least a load of __rseq_refcount (or some other TLS variable), test and
>> conditional
>> branch on the fast-path, which is an additional cost I would ideally prefer to
>> avoid.
>> Or do you have something else in mind ?
>
> No explicit call is necessary. This is something that can be done in
> the slow path, since we can recognize the transition from slow -> fast
> path for that thread

Got it, it should work. Thanks!

Mathieu


--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

2020-02-26 21:51:40

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: Rseq registration: Google tcmalloc vs glibc

----- On Feb 25, 2020, at 10:24 PM, Joel Fernandes, Google [email protected] wrote:
[..]
>
> Chris, Brian, is there any other concern to upgrading of tcmalloc
> version in ChromeOS? I believe there was some concern about detection
> of rseq kernel support. A quick look at tcmalloc shows it does not do
> such detection, but I can stand corrected. One more thing, currently
> tcmalloc does not use rseq on ARM. If I recall, ARM does have rseq
> support as well. So we ought to enable it for that arch as well if
> possible. Why not enable it on all arches and then dynamically detect
> at runtime if needed support is available?

Please allow me to raise a concern with respect to the implementation
of the SlowFence() function in tcmalloc/internal/percpu.cc. It uses
sched_setaffinity to move the thread around to each CPU part of the
cpu mask.

There are a couple of corner-cases where I think it can malfunction:

- Interaction with concurrent sched_setaffinity invoked by an external
manager process: If an external manager process attempts to limit this
thread's ability to run onto specific CPUs, either before the thread
starts or concurrently while the thread executes, I suspect the
SlowFence() algorithm will simply handle errors while trying to set
affinity by skipping CPUs, which results in a skipped rseq fence,
which in turn can cause corruption.

The comments in this function state:

// If we can't pin ourselves there, then no one else can run there, so
// that's fine.

But AFAIU the thread's cpu affinity is a per-thread attribute, so saying
that no other thread from the same process can run there seems wrong. What
am I missing ? Maybe it is a difference between cpusets and sched_setaffinity ?

The code below opens /proc/self/cpuset to deal with concurrent affinity
updates by cpuset seems to rely on CONFIG_CPUSETS=y, and does not seem to
take into account CPU affinity changes through sched_setaffinity.

Moreover, reading through the comments there, depending on internal kernel
synchronization implementation details for dealing with concurrent cpuset
updates seems very fragile. Those details about internal locking of
cpuset.cpus within the kernel should not be expected to be ABI.

- Interaction with CPU hotplug. If a target CPU is unplugged and plugged
again (offline, then online) concurrently, this algorithm may skip that
CPU and thus skip a rseq fence, which can also cause corruption.

Those limitations of sched_setaffinity() are the reasons why I have
proposed a new "pin_on_cpu()" system call [1]. Feedback in that area
is very welcome.

Thanks,

Mathieu

[1] https://lore.kernel.org/r/[email protected]

--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

2020-02-27 10:19:44

by Szabolcs Nagy

[permalink] [raw]
Subject: Re: Rseq registration: Google tcmalloc vs glibc

On 26/02/2020 18:56, Mathieu Desnoyers wrote:
> ----- On Feb 26, 2020, at 12:27 PM, Chris Kennelly [email protected] wrote:
>> I agree that this could potentially violate inviarants, but
>> InitFastPerCpu is not intended to be called by the application.
>
> OK, explicitly documenting this would be a good thing. In my own projects,
> I prefix those symbols with double-underscores (__) to indicate that those
> are not meant to be called by other means than the static inlines in the API.

use a different convention for that, __ prefix is always
reserved for the implementation for arbitrary use.

ideally internals would not be exposed in the user api
and then there is no such issue.

2020-02-27 10:33:45

by Florian Weimer

[permalink] [raw]
Subject: Re: Rseq registration: Google tcmalloc vs glibc

* Szabolcs Nagy:

> On 26/02/2020 18:56, Mathieu Desnoyers wrote:
>> ----- On Feb 26, 2020, at 12:27 PM, Chris Kennelly [email protected] wrote:
>>> I agree that this could potentially violate inviarants, but
>>> InitFastPerCpu is not intended to be called by the application.
>>
>> OK, explicitly documenting this would be a good thing. In my own projects,
>> I prefix those symbols with double-underscores (__) to indicate that those
>> are not meant to be called by other means than the static inlines in the API.
>
> use a different convention for that, __ prefix is always
> reserved for the implementation for arbitrary use.

tcmalloc is *not* the implementation in that sense. It must not use the
__ prefix for its identifiers.

Thanks,
Florian

2020-02-27 21:12:16

by Paul E. McKenney

[permalink] [raw]
Subject: Re: Rseq registration: Google tcmalloc vs glibc

Apologies for top-posting, and adding David Goldblatt of the jemalloc()
group on CC. I have bounced the rest of the email chain to him.

Thanx, Paul

On Thu, Feb 20, 2020 at 04:33:30PM -0500, Mathieu Desnoyers wrote:
> Hi Chris,
>
> As one of the maintainers of the Rseq system call in the Linux kernel, I would
> like to thank the Google team for open sourcing a tcmalloc implementation based
> on Rseq!
>
> I've looked into some critical integration aspects of the tcmalloc implementation,
> and would like to bring up a topic which involves both tcmalloc developers and the
> glibc community.
>
> I have been discussing aspects of co-existence between early Rseq adopter libraries
> and glibc for the past year with the glibc community, and tcmalloc happens to be the
> first project to publicly use Rseq outside of prototype branches or selftests code.
> Considering that there can only be one Rseq registration per thread (as imposed by
> the rseq ABI), there needs to be some kind of protocol between libraries to ensure we
> don't introduce regressions when we eventually combine a newer glibc which takes care
> of registration of the __rseq_abi TLS along with tcmalloc which also try to perform
> that registration within the same thread.
>
> Throughout the various rounds of review of the glibc Rseq integration patch set,
> there were a few solutions envisioned. Here is a brief history:
>
> 1) Introduce a __rseq_refcount TLS variable.
>
> - Currently used by Linux tools/testing/selftests/rseq/rseq.c,
> - Currently used by Google tcmalloc,
> - Emitted by glibc as well my the original patchset (but was later removed),
>
> A user incrementing the refcount from 0 -> 1 performs rseq registration.
> The last user decrementing from 1 -> 0 performs rseq unregistration.
>
> Works for co-existence of dlopen'd/dlclose'd libraries, for dynamically
> linked libraries, and for the main executable.
>
> The refcounting was deemed too complex for glibc's needs (it always
> exists for the entire executable's lifetime), so we moved to
> __rseq_handled instead.
>
>
> 2) Introduce a __rseq_handled global variable.
>
> - Currently used by Linux tools/testing/selftests/rseq/rseq.c,
> - At some point emitted by glibc as well in my patch set (but was later
> removed),
>
> A library may take rseq ownership if it is still 0 when executing the
> library constructor. Set to 1 by library constructor when handling rseq.
> Set to 0 in destructor if handling rseq.
>
> Not meant to be set by dlopen'd/dlclose'd libraries, only by libraries
> existing for the whole lifetime of the executable and/or the main executable.
>
> This __rseq_handled symbol has been identified as being somewhat redundant
> with the information provided in the __rseq_abi.cpu_id field (uninitialized
> state), which motivated removing this symbol from the glibc integration
> entirely. The only reason for having __rseq_handled separate from
> __rseq_abi.cpu_id was because it was then impossible to touch TLS data
> early in the glibc initialization. This issue was later resolved within
> glibc.
>
>
> 3) Use the __rseq_abi TLS cpu_id field to know whether Rseq has been
> registered.
>
> - Current protocol in the most recent glibc integration patch set.
> - Not supported yet by Linux kernel rseq selftests,
> - Not supported yet by tcmalloc,
>
> Use the per-thread state to figure out whether each thread need to register
> Rseq individually.
>
> Works for integration between a library which exists for the entire lifetime
> of the executable (e.g. glibc) and other libraries. However, it does not
> allow a set of libraries which are dlopen'd/dlclose'd to co-exist without
> having a library like glibc handling the registration present.
>
> So overall, I suspect the protocol we want for early adopters is that they
> only register Rseq if __rseq_abi.cpu_id == RSEQ_CPU_ID_UNINITIALIZED, which
> ensure they do not get -1, errno = EBUSY when linked against a newer glibc
> which handles Rseq registration. In order to handle multiple early adopters
> dlopen'd/dlclose'd in the same executable, those should synchronize with a
> __rseq_refcount TLS reference count, but it does not have to be taken into
> account by the main executable or libraries present for the entire executable
> lifetime (like glibc).
>
> Based on this, what I think would be missing from the current Google tcmalloc
> implementation is a check for __rseq_abi.cpu_id == RSEQ_CPU_ID_UNINITIALIZED
> in InitThreadPerCpu().
>
> Is tcmalloc ever meant to be dlopen'd/dlclose'd (either directly or indirectly),
> or is it required to exist for the entire executable lifetime ? The check and
> increment of __rseq_refcount is only useful to co-exist with dlopen'd/dlclose'd
> libraries, but it would not allow discovering the presence of a glibc which
> takes care of the rseq registration with the planned protocol. A dlopen'd
> library should then only perform rseq unregistration if if brings the
> __rseq_refcount back to 0 (e.g. in a pthread_key destructor).
>
> Adding this check for __rseq_abi.cpu_id == RSEQ_CPU_ID_UNINITIALIZED is something
> I need to do in the Linux rseq selftests, but I refrained from submitting any
> further change to those tests until the glibc rseq integration gets finally
> merged.
>
> Is it something that could be easily changed at this stage in Google tcmalloc,
> or should we reconsider adding back __rseq_refcount within the glibc integration
> patch set, even though it is not strictly useful to glibc ?
>
> Thanks,
>
> Mathieu
>
> --
> Mathieu Desnoyers
> EfficiOS Inc.
> http://www.efficios.com