2020-07-06 20:50:20

by Mathieu Desnoyers

[permalink] [raw]
Subject: [RFC PATCH for 5.8 3/4] rseq: Introduce RSEQ_FLAG_RELIABLE_CPU_ID

commit 93b585c08d16 ("Fix: sched: unreliable rseq cpu_id for new tasks")
addresses an issue with cpu_id field of newly created processes. Expose
a flag which can be used by user-space to query whether the kernel
implements this fix.

Considering that this issue can cause corruption of user-space per-cpu
data updated with rseq, it is recommended that user-space detects
availability of this fix by using the RSEQ_FLAG_RELIABLE_CPU_ID flag
either combined with registration or on its own before using rseq.

Signed-off-by: Mathieu Desnoyers <[email protected]>
Cc: Peter Zijlstra (Intel) <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Florian Weimer <[email protected]>
Cc: "Paul E. McKenney" <[email protected]>
Cc: Boqun Feng <[email protected]>
Cc: "H . Peter Anvin" <[email protected]>
Cc: Paul Turner <[email protected]>
Cc: Dmitry Vyukov <[email protected]>
Cc: Neel Natu <[email protected]>
Cc: [email protected]
---
include/uapi/linux/rseq.h | 5 +++++
kernel/rseq.c | 4 ++++
2 files changed, 9 insertions(+)

diff --git a/include/uapi/linux/rseq.h b/include/uapi/linux/rseq.h
index 3b5fba25461a..a548b77c9520 100644
--- a/include/uapi/linux/rseq.h
+++ b/include/uapi/linux/rseq.h
@@ -21,13 +21,18 @@ enum rseq_cpu_id_state {
/*
* RSEQ_FLAG_UNREGISTER: Unregister rseq ABI for caller thread.
* RSEQ_FLAG_REGISTER: Register rseq ABI for caller thread.
+ * RSEQ_FLAG_RELIABLE_CPU_ID: rseq provides a reliable cpu_id field.
*
* Flag value 0 has the same behavior as RSEQ_FLAG_REGISTER, but cannot be
* combined with other flags. This behavior is kept for backward compatibility.
+ *
+ * The flag RSEQ_FLAG_REGISTER can be combined with the RSEQ_FLAG_RELIABLE_CPU_ID
+ * flag.
*/
enum rseq_flags {
RSEQ_FLAG_UNREGISTER = (1 << 0),
RSEQ_FLAG_REGISTER = (1 << 1),
+ RSEQ_FLAG_RELIABLE_CPU_ID = (1 << 2),
};

enum rseq_cs_flags_bit {
diff --git a/kernel/rseq.c b/kernel/rseq.c
index 47ce221cd6f9..32cc2e0d961f 100644
--- a/kernel/rseq.c
+++ b/kernel/rseq.c
@@ -333,6 +333,8 @@ SYSCALL_DEFINE4(rseq, struct rseq __user *, rseq, u32, rseq_len,
current->rseq_sig = 0;
break;
case RSEQ_FLAG_REGISTER:
+ fallthrough;
+ case RSEQ_FLAG_REGISTER | RSEQ_FLAG_RELIABLE_CPU_ID:
if (current->rseq) {
/*
* If rseq is already registered, check whether
@@ -365,6 +367,8 @@ SYSCALL_DEFINE4(rseq, struct rseq __user *, rseq, u32, rseq_len,
*/
rseq_set_notify_resume(current);
break;
+ case RSEQ_FLAG_RELIABLE_CPU_ID:
+ break;
default:
return -EINVAL;
}
--
2.17.1


2020-07-07 07:32:23

by Florian Weimer

[permalink] [raw]
Subject: Re: [RFC PATCH for 5.8 3/4] rseq: Introduce RSEQ_FLAG_RELIABLE_CPU_ID

* Mathieu Desnoyers:

> commit 93b585c08d16 ("Fix: sched: unreliable rseq cpu_id for new tasks")
> addresses an issue with cpu_id field of newly created processes. Expose
> a flag which can be used by user-space to query whether the kernel
> implements this fix.
>
> Considering that this issue can cause corruption of user-space per-cpu
> data updated with rseq, it is recommended that user-space detects
> availability of this fix by using the RSEQ_FLAG_RELIABLE_CPU_ID flag
> either combined with registration or on its own before using rseq.

Presumably, the intent is that glibc uses RSEQ_FLAG_RELIABLE_CPU_ID to
register the rseq area. That will surely prevent glibc itself from
activating rseq on broken kernels. But if another rseq library
performs registration and has not been updated to use
RSEQ_FLAG_RELIABLE_CPU_ID, we still end up with an active rseq area
(and incorrect CPU IDs from sched_getcpu in glibc). So further glibc
changes will be needed. I suppose we could block third-party rseq
registration with a registration of a hidden rseq area (not
__rseq_abi). But then the question is if any of the third-party rseq
users are expecting the EINVAL error code from their failed
registration.

The rseq registration state machine is quite tricky already, and the
need to use RSEQ_FLAG_RELIABLE_CPU_ID would make it even more
complicated. Even if we implemented all the changes, it's all going
to be essentially dead, untestable code in a few months, when the
broken kernels are out of circulation. It does not appear to be good
investment to me.

2020-07-07 10:51:09

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: [RFC PATCH for 5.8 3/4] rseq: Introduce RSEQ_FLAG_RELIABLE_CPU_ID

----- On Jul 7, 2020, at 3:29 AM, Florian Weimer [email protected] wrote:

> * Mathieu Desnoyers:
>
>> commit 93b585c08d16 ("Fix: sched: unreliable rseq cpu_id for new tasks")
>> addresses an issue with cpu_id field of newly created processes. Expose
>> a flag which can be used by user-space to query whether the kernel
>> implements this fix.
>>
>> Considering that this issue can cause corruption of user-space per-cpu
>> data updated with rseq, it is recommended that user-space detects
>> availability of this fix by using the RSEQ_FLAG_RELIABLE_CPU_ID flag
>> either combined with registration or on its own before using rseq.
>
> Presumably, the intent is that glibc uses RSEQ_FLAG_RELIABLE_CPU_ID to
> register the rseq area. That will surely prevent glibc itself from
> activating rseq on broken kernels. But if another rseq library
> performs registration and has not been updated to use
> RSEQ_FLAG_RELIABLE_CPU_ID, we still end up with an active rseq area
> (and incorrect CPU IDs from sched_getcpu in glibc). So further glibc
> changes will be needed. I suppose we could block third-party rseq
> registration with a registration of a hidden rseq area (not
> __rseq_abi). But then the question is if any of the third-party rseq
> users are expecting the EINVAL error code from their failed
> registration.
>
> The rseq registration state machine is quite tricky already, and the
> need to use RSEQ_FLAG_RELIABLE_CPU_ID would make it even more
> complicated. Even if we implemented all the changes, it's all going
> to be essentially dead, untestable code in a few months, when the
> broken kernels are out of circulation. It does not appear to be good
> investment to me.

Those are very good points. One possibility we have would be to let
glibc do the rseq registration without the RSEQ_FLAG_RELIABLE_CPU_ID
flag. On kernels with the bug present, the cpu_id field is still good
enough for typical uses of sched_getcpu() which does not appear to
have a very strict correctness requirement on returning the right
cpu number.

Then libraries and applications which require a reliable cpu_id field
could check this on their own by calling rseq with the
RSEQ_FLAG_RELIABLE_CPU_ID flag. This would not make the state more
complex in __rseq_abi, and let each rseq user decide about its own fate:
whether it uses rseq or keeps using an rseq-free fallback.

I am still tempted to allow combining RSEQ_FLAG_REGISTER | RSEQ_FLAG_RELIABLE_CPU_ID
for applications which would not be using glibc, and want to check this flag on
thread registration.

Thoughts ?

Thanks,

Mathieu

--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

2020-07-07 11:33:38

by Florian Weimer

[permalink] [raw]
Subject: Re: [RFC PATCH for 5.8 3/4] rseq: Introduce RSEQ_FLAG_RELIABLE_CPU_ID

* Mathieu Desnoyers:

> Those are very good points. One possibility we have would be to let
> glibc do the rseq registration without the RSEQ_FLAG_RELIABLE_CPU_ID
> flag. On kernels with the bug present, the cpu_id field is still good
> enough for typical uses of sched_getcpu() which does not appear to
> have a very strict correctness requirement on returning the right
> cpu number.
>
> Then libraries and applications which require a reliable cpu_id
> field could check this on their own by calling rseq with the
> RSEQ_FLAG_RELIABLE_CPU_ID flag. This would not make the state more
> complex in __rseq_abi, and let each rseq user decide about its own
> fate: whether it uses rseq or keeps using an rseq-free fallback.
>
> I am still tempted to allow combining RSEQ_FLAG_REGISTER |
> RSEQ_FLAG_RELIABLE_CPU_ID for applications which would not be using
> glibc, and want to check this flag on thread registration.

Well, you could add a bug fix level field to the __rseq_abi variable.
Then applications could check if the kernel has the appropriate level
of non-buggyness. But the same thing could be useful for many other
kernel interfaces, and I haven't seen such a fix level value for them.
What makes rseq so special?

It won't help with the present bug, but maybe we should add an rseq
API sub-call that blocks future rseq registration for the thread.
Then we can add a glibc tunable that flips off rseq reliably if people
do not want to use it for some reason (and switch to the non-rseq
fallback code instead). But that's going to help with future bugs
only.

2020-07-07 12:06:55

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: [RFC PATCH for 5.8 3/4] rseq: Introduce RSEQ_FLAG_RELIABLE_CPU_ID

----- On Jul 7, 2020, at 7:32 AM, Florian Weimer [email protected] wrote:

> * Mathieu Desnoyers:
>
>> Those are very good points. One possibility we have would be to let
>> glibc do the rseq registration without the RSEQ_FLAG_RELIABLE_CPU_ID
>> flag. On kernels with the bug present, the cpu_id field is still good
>> enough for typical uses of sched_getcpu() which does not appear to
>> have a very strict correctness requirement on returning the right
>> cpu number.
>>
>> Then libraries and applications which require a reliable cpu_id
>> field could check this on their own by calling rseq with the
>> RSEQ_FLAG_RELIABLE_CPU_ID flag. This would not make the state more
>> complex in __rseq_abi, and let each rseq user decide about its own
>> fate: whether it uses rseq or keeps using an rseq-free fallback.
>>
>> I am still tempted to allow combining RSEQ_FLAG_REGISTER |
>> RSEQ_FLAG_RELIABLE_CPU_ID for applications which would not be using
>> glibc, and want to check this flag on thread registration.
>
> Well, you could add a bug fix level field to the __rseq_abi variable.

Even though I initially planned to make the struct rseq_abi extensible,
the __rseq_abi variable ends up being fix-sized, so we need to be very
careful in choosing what we place in the remaining 12 bytes of padding.
I suspect we'd want to keep 8 bytes to express a pointer to an
"extended" structure.

I wonder if a bug fix level "version" is the right approach. We could
instead have a bitmask of fixes, which the application could independently
check. For instance, some applications may care about cpu_id field
reliability, and others not.

> Then applications could check if the kernel has the appropriate level
> of non-buggyness. But the same thing could be useful for many other
> kernel interfaces, and I haven't seen such a fix level value for them.
> What makes rseq so special?

I guess my only answer is because I care as a user of the system call, and
what is a system call without users ? I have real applications which work
today with end users deploying them on various kernels, old and new, and I
want to take advantage of this system call to speed them up. However, if I
have to choose between speed and correctness (in other words, not crashing
a critical application), I will choose correctness. So if I cannot detect
that I can safely use the system call, it becomes pretty much useless even
for my own use-cases.

> It won't help with the present bug, but maybe we should add an rseq
> API sub-call that blocks future rseq registration for the thread.
> Then we can add a glibc tunable that flips off rseq reliably if people
> do not want to use it for some reason (and switch to the non-rseq
> fallback code instead). But that's going to help with future bugs
> only.

I don't think it's needed. All I really need is to have _some_ way to
let lttng-ust or liburcu query whether the cpu_id field is reliable. This
state does not have to be made quickly accessible to other libraries,
nor does it have to be shared between libraries. It would allow each
user library or application to make its own mind on whether it can use
rseq or not.

Thanks,

Mathieu

--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

2020-07-07 18:54:59

by Carlos O'Donell

[permalink] [raw]
Subject: Re: [RFC PATCH for 5.8 3/4] rseq: Introduce RSEQ_FLAG_RELIABLE_CPU_ID

On 7/7/20 8:06 AM, Mathieu Desnoyers wrote:
> ----- On Jul 7, 2020, at 7:32 AM, Florian Weimer [email protected] wrote:
>
>> * Mathieu Desnoyers:
>>
>>> Those are very good points. One possibility we have would be to let
>>> glibc do the rseq registration without the RSEQ_FLAG_RELIABLE_CPU_ID
>>> flag. On kernels with the bug present, the cpu_id field is still good
>>> enough for typical uses of sched_getcpu() which does not appear to
>>> have a very strict correctness requirement on returning the right
>>> cpu number.
>>>
>>> Then libraries and applications which require a reliable cpu_id
>>> field could check this on their own by calling rseq with the
>>> RSEQ_FLAG_RELIABLE_CPU_ID flag. This would not make the state more
>>> complex in __rseq_abi, and let each rseq user decide about its own
>>> fate: whether it uses rseq or keeps using an rseq-free fallback.
>>>
>>> I am still tempted to allow combining RSEQ_FLAG_REGISTER |
>>> RSEQ_FLAG_RELIABLE_CPU_ID for applications which would not be using
>>> glibc, and want to check this flag on thread registration.
>>
>> Well, you could add a bug fix level field to the __rseq_abi variable.
>
> Even though I initially planned to make the struct rseq_abi extensible,
> the __rseq_abi variable ends up being fix-sized, so we need to be very
> careful in choosing what we place in the remaining 12 bytes of padding.
> I suspect we'd want to keep 8 bytes to express a pointer to an
> "extended" structure.
>
> I wonder if a bug fix level "version" is the right approach. We could
> instead have a bitmask of fixes, which the application could independently
> check. For instance, some applications may care about cpu_id field
> reliability, and others not.

I agree with Florian.

Developers are not interested in a bitmask of fixes.

They want a version they can check and that at a given version everything
works as expected.

In reality today this is the kernel version.

So rseq is broken from a developer perspective until they can get a new
kernel or an agreement from their downstream vendor that revision Z of
the kernel they are using has the fix you've just committed.

Essentially this problem solves itself at level higher than the interfaces
we're talking about today.

Encoding this as a bitmask of fixes is an overengineered solution for a
problem that the downstream communities already know how to solve.

I would strongly suggest a "version" or nothing.

>> Then applications could check if the kernel has the appropriate level
>> of non-buggyness. But the same thing could be useful for many other
>> kernel interfaces, and I haven't seen such a fix level value for them.
>> What makes rseq so special?
>
> I guess my only answer is because I care as a user of the system call, and
> what is a system call without users ? I have real applications which work
> today with end users deploying them on various kernels, old and new, and I
> want to take advantage of this system call to speed them up. However, if I
> have to choose between speed and correctness (in other words, not crashing
> a critical application), I will choose correctness. So if I cannot detect
> that I can safely use the system call, it becomes pretty much useless even
> for my own use-cases.

Yes.

In the case of RHEL we have *tons* of users in the same predicament.

They just wait until the RHEL kernel team releases a fixed kernel version
and check for that version and deploy with that version.

Likewise every other user of a kernel. They solve it by asking their
kernel provider, internal or external, to verify the fix is applied to
the deployment kernel.

If they are an ISV they have to test and deploy similar strategies for
multiple kernel version.

By trying to automate this you are encoding into the API some level of
package management concepts e.g. patch level, revision level, etc.

This is difficult to do without a more generalized API. Why do it just
for rseq? Why do it with the few bits you have?

It's not a great fit IMO. Just let the kernel version be the arbiter of
correctness.

>> It won't help with the present bug, but maybe we should add an rseq
>> API sub-call that blocks future rseq registration for the thread.
>> Then we can add a glibc tunable that flips off rseq reliably if people
>> do not want to use it for some reason (and switch to the non-rseq
>> fallback code instead). But that's going to help with future bugs
>> only.
>
> I don't think it's needed. All I really need is to have _some_ way to
> let lttng-ust or liburcu query whether the cpu_id field is reliable. This
> state does not have to be made quickly accessible to other libraries,
> nor does it have to be shared between libraries. It would allow each
> user library or application to make its own mind on whether it can use
> rseq or not.

That check is "kernel version > x.y.z" :-)

or

"My kernel vendor told me it's fixed."

--
Cheers,
Carlos.

2020-07-07 19:00:03

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: [RFC PATCH for 5.8 3/4] rseq: Introduce RSEQ_FLAG_RELIABLE_CPU_ID

----- On Jul 7, 2020, at 2:53 PM, Carlos O'Donell [email protected] wrote:

> On 7/7/20 8:06 AM, Mathieu Desnoyers wrote:
>> ----- On Jul 7, 2020, at 7:32 AM, Florian Weimer [email protected] wrote:
>>
>>> * Mathieu Desnoyers:
>>>
>>>> Those are very good points. One possibility we have would be to let
>>>> glibc do the rseq registration without the RSEQ_FLAG_RELIABLE_CPU_ID
>>>> flag. On kernels with the bug present, the cpu_id field is still good
>>>> enough for typical uses of sched_getcpu() which does not appear to
>>>> have a very strict correctness requirement on returning the right
>>>> cpu number.
>>>>
>>>> Then libraries and applications which require a reliable cpu_id
>>>> field could check this on their own by calling rseq with the
>>>> RSEQ_FLAG_RELIABLE_CPU_ID flag. This would not make the state more
>>>> complex in __rseq_abi, and let each rseq user decide about its own
>>>> fate: whether it uses rseq or keeps using an rseq-free fallback.
>>>>
>>>> I am still tempted to allow combining RSEQ_FLAG_REGISTER |
>>>> RSEQ_FLAG_RELIABLE_CPU_ID for applications which would not be using
>>>> glibc, and want to check this flag on thread registration.
>>>
>>> Well, you could add a bug fix level field to the __rseq_abi variable.
>>
>> Even though I initially planned to make the struct rseq_abi extensible,
>> the __rseq_abi variable ends up being fix-sized, so we need to be very
>> careful in choosing what we place in the remaining 12 bytes of padding.
>> I suspect we'd want to keep 8 bytes to express a pointer to an
>> "extended" structure.
>>
>> I wonder if a bug fix level "version" is the right approach. We could
>> instead have a bitmask of fixes, which the application could independently
>> check. For instance, some applications may care about cpu_id field
>> reliability, and others not.
>
> I agree with Florian.
>
> Developers are not interested in a bitmask of fixes.
>
> They want a version they can check and that at a given version everything
> works as expected.
>
> In reality today this is the kernel version.
>
> So rseq is broken from a developer perspective until they can get a new
> kernel or an agreement from their downstream vendor that revision Z of
> the kernel they are using has the fix you've just committed.
>
> Essentially this problem solves itself at level higher than the interfaces
> we're talking about today.
>
> Encoding this as a bitmask of fixes is an overengineered solution for a
> problem that the downstream communities already know how to solve.
>
> I would strongly suggest a "version" or nothing.

That's a good point.

>
>>> Then applications could check if the kernel has the appropriate level
>>> of non-buggyness. But the same thing could be useful for many other
>>> kernel interfaces, and I haven't seen such a fix level value for them.
>>> What makes rseq so special?
>>
>> I guess my only answer is because I care as a user of the system call, and
>> what is a system call without users ? I have real applications which work
>> today with end users deploying them on various kernels, old and new, and I
>> want to take advantage of this system call to speed them up. However, if I
>> have to choose between speed and correctness (in other words, not crashing
>> a critical application), I will choose correctness. So if I cannot detect
>> that I can safely use the system call, it becomes pretty much useless even
>> for my own use-cases.
>
> Yes.
>
> In the case of RHEL we have *tons* of users in the same predicament.
>
> They just wait until the RHEL kernel team releases a fixed kernel version
> and check for that version and deploy with that version.
>
> Likewise every other user of a kernel. They solve it by asking their
> kernel provider, internal or external, to verify the fix is applied to
> the deployment kernel.
>
> If they are an ISV they have to test and deploy similar strategies for
> multiple kernel version.
>
> By trying to automate this you are encoding into the API some level of
> package management concepts e.g. patch level, revision level, etc.
>
> This is difficult to do without a more generalized API. Why do it just
> for rseq? Why do it with the few bits you have?
>
> It's not a great fit IMO. Just let the kernel version be the arbiter of
> correctness.
>
>>> It won't help with the present bug, but maybe we should add an rseq
>>> API sub-call that blocks future rseq registration for the thread.
>>> Then we can add a glibc tunable that flips off rseq reliably if people
>>> do not want to use it for some reason (and switch to the non-rseq
>>> fallback code instead). But that's going to help with future bugs
>>> only.
>>
>> I don't think it's needed. All I really need is to have _some_ way to
>> let lttng-ust or liburcu query whether the cpu_id field is reliable. This
>> state does not have to be made quickly accessible to other libraries,
>> nor does it have to be shared between libraries. It would allow each
>> user library or application to make its own mind on whether it can use
>> rseq or not.
>
> That check is "kernel version > x.y.z" :-)
>
> or
>
> "My kernel vendor told me it's fixed."

Allright, thanks for the insight! I'll drop these patches and focus only
on the bugfix.

Thanks,

Mathieu

>
> --
> Cheers,
> Carlos.

--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

2020-07-07 19:56:21

by Florian Weimer

[permalink] [raw]
Subject: Re: [RFC PATCH for 5.8 3/4] rseq: Introduce RSEQ_FLAG_RELIABLE_CPU_ID

* Carlos O'Donell:

> It's not a great fit IMO. Just let the kernel version be the arbiter of
> correctness.

For manual review, sure. But checking it programmatically does not
yield good results due to backports. Even those who use the stable
kernel series sometimes pick up critical fixes beforehand, so it's not
reliable possible for a program to say, “I do not want to run on this
kernel because it has a bad version”. We had a recent episode of this
with the Go runtime, which tried to do exactly this.

2020-07-08 08:32:15

by Florian Weimer

[permalink] [raw]
Subject: Re: [RFC PATCH for 5.8 3/4] rseq: Introduce RSEQ_FLAG_RELIABLE_CPU_ID

* Mathieu Desnoyers:

> Allright, thanks for the insight! I'll drop these patches and focus only
> on the bugfix.

Thanks, much appreciated!

2020-07-08 15:35:07

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: [RFC PATCH for 5.8 3/4] rseq: Introduce RSEQ_FLAG_RELIABLE_CPU_ID

[ Context for Linus: I am dropping this RFC patch, but am curious to
hear your point of view on exposing to user-space which system call
behavior fixes are present in the kernel, either through feature
flags or system-call versioning. The intent is to allow user-space
to make better decisions on whether it should use a system call or
rely on fallback behavior. ]

----- On Jul 7, 2020, at 3:55 PM, Florian Weimer [email protected] wrote:

> * Carlos O'Donell:
>
>> It's not a great fit IMO. Just let the kernel version be the arbiter of
>> correctness.
>
> For manual review, sure. But checking it programmatically does not
> yield good results due to backports. Even those who use the stable
> kernel series sometimes pick up critical fixes beforehand, so it's not
> reliable possible for a program to say, “I do not want to run on this
> kernel because it has a bad version”. We had a recent episode of this
> with the Go runtime, which tried to do exactly this.

FWIW, the kernel fix backport issue would also be a concern if we exposed
a numeric "fix level version" with specific system calls: what should
we do if a distribution chooses to include one fix in the sequence,
but not others ? Identifying fixes are "feature flags" allow
cherry-picking specific fixes in a backport, but versions would not
allow that.

That being said, maybe it's not such a bad thing to _require_ the
entire series of fixes to be picked in backports, which would be a
fortunate side-effect of the per-syscall-fix-version approach.

But I'm under the impression that such a scheme ends up versioning
a system call, which I suspect will be a no-go from Linus' perspective.

Thanks,

Mathieu


--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

2020-07-08 16:25:17

by Christian Brauner

[permalink] [raw]
Subject: Re: [RFC PATCH for 5.8 3/4] rseq: Introduce RSEQ_FLAG_RELIABLE_CPU_ID

On Wed, Jul 08, 2020 at 11:33:51AM -0400, Mathieu Desnoyers wrote:
> [ Context for Linus: I am dropping this RFC patch, but am curious to
> hear your point of view on exposing to user-space which system call
> behavior fixes are present in the kernel, either through feature
> flags or system-call versioning. The intent is to allow user-space
> to make better decisions on whether it should use a system call or
> rely on fallback behavior. ]
>
> ----- On Jul 7, 2020, at 3:55 PM, Florian Weimer [email protected] wrote:
>
> > * Carlos O'Donell:
> >
> >> It's not a great fit IMO. Just let the kernel version be the arbiter of
> >> correctness.
> >
> > For manual review, sure. But checking it programmatically does not
> > yield good results due to backports. Even those who use the stable
> > kernel series sometimes pick up critical fixes beforehand, so it's not
> > reliable possible for a program to say, “I do not want to run on this
> > kernel because it has a bad version”. We had a recent episode of this
> > with the Go runtime, which tried to do exactly this.
>
> FWIW, the kernel fix backport issue would also be a concern if we exposed
> a numeric "fix level version" with specific system calls: what should
> we do if a distribution chooses to include one fix in the sequence,
> but not others ? Identifying fixes are "feature flags" allow
> cherry-picking specific fixes in a backport, but versions would not
> allow that.
>
> That being said, maybe it's not such a bad thing to _require_ the
> entire series of fixes to be picked in backports, which would be a
> fortunate side-effect of the per-syscall-fix-version approach.
>
> But I'm under the impression that such a scheme ends up versioning
> a system call, which I suspect will be a no-go from Linus' perspective.

I've been following this a little bit. The kernel version itself doesn't
really mean anything and the kernel version is imho not at all
interesting to userspace applications. Especially for cross-distro
programs. We can't go around and ask Red Hat, SUSE, Ubuntu, Archlinux,
openSUSE and god knows who what other distro what their fixed kernel
version is. That's not feasible at all and not how must programs do it.
Sure, a lot of programs name a minimal kernel version they require but
realistically we can't keep bumping it all the time. So the best
strategy for userspace imho has been to introduce a re-versioned flag or
enum that indicates the fixed behavior.

So I would suggest to just introduce
RSEQ_FLAG_REGISTER_2 = (1 << 2),
that's how these things are usually done (Netlink etc.). So not
introducing a fix bit or whatever but simply reversion your flag/enum.
We already deal with this today.

(Also, as a side-note. I see that you're passing struct rseq *rseq with
a length argument but you are not versioning by size. Is that
intentional? That basically somewhat locks you to the current struct
rseq layout and means users might run into problems when you extend
struct rseq in the future as they can't pass the new struct down to
older kernels. The way we deal with this is now - rseq might preceed
this - is copy_struct_from_user() (for example in sched_{get,set}attr(),
openat2(), bpf(), clone3(), etc.). Maybe you want to switch to that to
keep rseq extensible? Users can detect the new rseq version by just
passing a larger struct down to the kernel with the extra bytes set to 0
and if rseq doesn't complain they know they're dealing with an rseq that
knows larger struct sizes. Might be worth it if you have any reason to
belive that struct rseq might need to grow.)

Christian

2020-07-08 16:39:40

by Florian Weimer

[permalink] [raw]
Subject: Re: [RFC PATCH for 5.8 3/4] rseq: Introduce RSEQ_FLAG_RELIABLE_CPU_ID

* Christian Brauner:

> I've been following this a little bit. The kernel version itself doesn't
> really mean anything and the kernel version is imho not at all
> interesting to userspace applications. Especially for cross-distro
> programs. We can't go around and ask Red Hat, SUSE, Ubuntu, Archlinux,
> openSUSE and god knows who what other distro what their fixed kernel
> version is.

And Red Hat Enterprise Linux only has a dozen or two kernel branches
under active maintenance, each with their own succession of version
numbers. It's just not feasible. Even figuring out the branch based
on the kernel version can be tricky!

> (Also, as a side-note. I see that you're passing struct rseq *rseq with
> a length argument but you are not versioning by size. Is that
> intentional? That basically somewhat locks you to the current struct
> rseq layout and means users might run into problems when you extend
> struct rseq in the future as they can't pass the new struct down to
> older kernels. The way we deal with this is now - rseq might preceed
> this - is copy_struct_from_user()

The kernel retains the pointer after the system call returns.
Basically, ownership of the memory area is transferred to the kernel.
It's like set_robust_list in this regard.

2020-07-08 17:40:31

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: [RFC PATCH for 5.8 3/4] rseq: Introduce RSEQ_FLAG_RELIABLE_CPU_ID

----- On Jul 8, 2020, at 12:22 PM, Christian Brauner [email protected] wrote:
[...]
> I've been following this a little bit. The kernel version itself doesn't
> really mean anything and the kernel version is imho not at all
> interesting to userspace applications. Especially for cross-distro
> programs. We can't go around and ask Red Hat, SUSE, Ubuntu, Archlinux,
> openSUSE and god knows who what other distro what their fixed kernel
> version is. That's not feasible at all and not how must programs do it.
> Sure, a lot of programs name a minimal kernel version they require but
> realistically we can't keep bumping it all the time. So the best
> strategy for userspace imho has been to introduce a re-versioned flag or
> enum that indicates the fixed behavior.
>
> So I would suggest to just introduce
> RSEQ_FLAG_REGISTER_2 = (1 << 2),
> that's how these things are usually done (Netlink etc.). So not
> introducing a fix bit or whatever but simply reversion your flag/enum.
> We already deal with this today.

Because rseq is effectively a per-thread resource shared across application
and libraries, it is not practical to merge the notion of version with the
registration. Typically __rseq_abi is registered by libc, and can be used
by the application and by many libraries. Early adopter libraries and
applications (e.g. librseq, tcmalloc) can also choose to handle registration
if it's not already done by libc.

For instance, it is acceptable for glibc to register rseq for all threads,
even in the presence of the cpu_id field inaccuracy, for use by the
sched_getcpu(3) implementation. However, users of rseq which need to
implement critical sections performing per-cpu data updates may want
to know whether the cpu_id field is reliable to ensure they do not crash
the process due to per-cpu data corruption.

This led me to consider exposing a feature-specific flag which can be
queried by specific users to know whether rseq has specific set of correct
behaviors implemented.

> (Also, as a side-note. I see that you're passing struct rseq *rseq with
> a length argument but you are not versioning by size. Is that
> intentional? That basically somewhat locks you to the current struct
> rseq layout and means users might run into problems when you extend
> struct rseq in the future as they can't pass the new struct down to
> older kernels. The way we deal with this is now - rseq might preceed
> this - is copy_struct_from_user() (for example in sched_{get,set}attr(),
> openat2(), bpf(), clone3(), etc.). Maybe you want to switch to that to
> keep rseq extensible? Users can detect the new rseq version by just
> passing a larger struct down to the kernel with the extra bytes set to 0
> and if rseq doesn't complain they know they're dealing with an rseq that
> knows larger struct sizes. Might be worth it if you have any reason to
> belive that struct rseq might need to grow.)

In the initial iterations of the rseq patch set, we initially had the rseq_len
argument hoping we would eventually be able to extend struct rseq. However,
it was finally decided against making it extensible, so the rseq ABI merged
into the Linux kernel with a fixed-size.

One of the key reasons for this is explained in
commit 83b0b15bcb0f ("rseq: Remove superfluous rseq_len from task_struct")

The rseq system call, when invoked with flags of "0" or
"RSEQ_FLAG_UNREGISTER" values, expects the rseq_len parameter to
be equal to sizeof(struct rseq), which is fixed-size and fixed-layout,
specified in uapi linux/rseq.h.

Expecting a fixed size for rseq_len is a design choice that ensures
multiple libraries and application defining __rseq_abi in the same
process agree on its exact size.

The issue here is caused by the fact that the __rseq_abi variable is
shared across application/libraries for a given thread. So it's not
enough to agree between kernel and user-space on the extensibility
scheme, but we'd also have to find a way for all users within a process
to agree on the layout.

The "rseq_len" parameter could eventually become useful when combined
with additional flags, but currently its size is fixed.

There are indeed clear use-cases for extending struct rseq (or add a new
similar TLS structure), for instance exposing the current numa node id,
or to implement a fast signal-disabling scheme. But the fact that struct rseq
is shared across application/libraries makes it tricky to "just extend" struct
rseq.

Thanks,

Mathieu

--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

2020-07-09 12:50:25

by Christian Brauner

[permalink] [raw]
Subject: Re: [RFC PATCH for 5.8 3/4] rseq: Introduce RSEQ_FLAG_RELIABLE_CPU_ID

On Wed, Jul 08, 2020 at 01:34:48PM -0400, Mathieu Desnoyers wrote:
> ----- On Jul 8, 2020, at 12:22 PM, Christian Brauner [email protected] wrote:
> [...]
> > I've been following this a little bit. The kernel version itself doesn't
> > really mean anything and the kernel version is imho not at all
> > interesting to userspace applications. Especially for cross-distro
> > programs. We can't go around and ask Red Hat, SUSE, Ubuntu, Archlinux,
> > openSUSE and god knows who what other distro what their fixed kernel
> > version is. That's not feasible at all and not how must programs do it.
> > Sure, a lot of programs name a minimal kernel version they require but
> > realistically we can't keep bumping it all the time. So the best
> > strategy for userspace imho has been to introduce a re-versioned flag or
> > enum that indicates the fixed behavior.
> >
> > So I would suggest to just introduce
> > RSEQ_FLAG_REGISTER_2 = (1 << 2),
> > that's how these things are usually done (Netlink etc.). So not
> > introducing a fix bit or whatever but simply reversion your flag/enum.
> > We already deal with this today.
>
> Because rseq is effectively a per-thread resource shared across application
> and libraries, it is not practical to merge the notion of version with the
> registration. Typically __rseq_abi is registered by libc, and can be used
> by the application and by many libraries. Early adopter libraries and
> applications (e.g. librseq, tcmalloc) can also choose to handle registration
> if it's not already done by libc.

I'm probably missing the elephant in the room but I was briefly looking
at github.com/compudj/librseq and it seems to me that the registration
you're talking about is:

extern __thread struct rseq __rseq_abi;
extern int __rseq_handled;

and it's done in int rseq_register_current_thread(void) afaict and
currently registration is done with flags set to 0.

What is the problem with either adding a - I don't know -
RSEG_FLAG_REGISTER/RSEQ_RELIABLE_CPU_FIELD flag that is also recorded in
__rseq_abi.flags. If the kernel doesn't support the flag it will fail
registration with EINVAL. So the registering program can detect it. If a
caller needs to know whether another thread uses the new flag it can
query __rseq_abi.flags. Some form of coordination must be possible in
userspace otherwise you'll have trouble with any new feature you add. I
general, I don't see how this is different from adding a new feature to
rseq. It should be the same principle.

I also don't understand the "not practical to merge the notion of
version with the registration". I'm not sure what that means to be
honest. :)
But just thinking about adding a new feature to rseq. Then you're in the
same spot, I think. When you register a bumped rseq - because you added
a new flag or whatever - you register a new version one way or the other
since a new feature - imho - is always a version bump. In fact, you
could think of your "reliable cpu" as a new feature not a bug. ;)

Also, you seem to directly version struct rseq_cs already through the
"version" member. So even if you are against the new flag I wouldn't
know what would stop you from directly versioning struct rseq itself.

And it's not that we don't version syscalls. We're doing it in multiple
ways to be honest, syscalls with a flag argument that reject unknown
flags are bumped in their version every time you add a new flag that
they accept. We don't spell this out but this is effectively what it is.
Think of it as a minor version bump. Extensible syscalls are versioned
by size and when their struct grows are bumped in their (minor) version.
In fact extensible syscalls with flags argument embedded in the struct
can be version bumped in two ways: growing a new flag argument or
growing a new struct member.

>
> For instance, it is acceptable for glibc to register rseq for all threads,
> even in the presence of the cpu_id field inaccuracy, for use by the
> sched_getcpu(3) implementation. However, users of rseq which need to
> implement critical sections performing per-cpu data updates may want
> to know whether the cpu_id field is reliable to ensure they do not crash
> the process due to per-cpu data corruption.
>
> This led me to consider exposing a feature-specific flag which can be
> queried by specific users to know whether rseq has specific set of correct
> behaviors implemented.
>
> > (Also, as a side-note. I see that you're passing struct rseq *rseq with
> > a length argument but you are not versioning by size. Is that
> > intentional? That basically somewhat locks you to the current struct
> > rseq layout and means users might run into problems when you extend
> > struct rseq in the future as they can't pass the new struct down to
> > older kernels. The way we deal with this is now - rseq might preceed
> > this - is copy_struct_from_user() (for example in sched_{get,set}attr(),
> > openat2(), bpf(), clone3(), etc.). Maybe you want to switch to that to
> > keep rseq extensible? Users can detect the new rseq version by just
> > passing a larger struct down to the kernel with the extra bytes set to 0
> > and if rseq doesn't complain they know they're dealing with an rseq that
> > knows larger struct sizes. Might be worth it if you have any reason to
> > belive that struct rseq might need to grow.)
>
> In the initial iterations of the rseq patch set, we initially had the rseq_len
> argument hoping we would eventually be able to extend struct rseq. However,
> it was finally decided against making it extensible, so the rseq ABI merged
> into the Linux kernel with a fixed-size.
>
> One of the key reasons for this is explained in
> commit 83b0b15bcb0f ("rseq: Remove superfluous rseq_len from task_struct")
>
> The rseq system call, when invoked with flags of "0" or
> "RSEQ_FLAG_UNREGISTER" values, expects the rseq_len parameter to
> be equal to sizeof(struct rseq), which is fixed-size and fixed-layout,
> specified in uapi linux/rseq.h.
>
> Expecting a fixed size for rseq_len is a design choice that ensures
> multiple libraries and application defining __rseq_abi in the same
> process agree on its exact size.
>
> The issue here is caused by the fact that the __rseq_abi variable is
> shared across application/libraries for a given thread. So it's not
> enough to agree between kernel and user-space on the extensibility
> scheme, but we'd also have to find a way for all users within a process
> to agree on the layout.

But you're in the same boat if you add any new feature, no? In your
model, wouldn't you need all users to agree on the feature set as well?
Not just the struct rseq size. If that's the case then rseq would be
unextendable (for now).

But specifically about the size-versioning part. Well, one way to solve
this - imho - would be to add a output size parameter to struct rseq and
introduce a little more vetting than we have right now.
So the kernel is what ultimately registers struct rseq iiuc. If there
were a size output parameter the kernel could set the size of the struct
it knows about before registering it.
So a caller passing down a larger struct with e.g. a new field set to a
non-zero value would get an error from the kernel and the size of the
supported struct. The caller can then adjust and simply zero out the
unsupported field and retry. Other callers in userspace can query the
size and find out what size of struct is registered. If it's larger than
what they know about they can infer they are on a newer kernel with new
features but they can simply ignore the unknown fields. If it's smaller
than they know what fields to ignore.

I hope I'm not derailing this discussion. I'm just trying to show that
there's some hope.
In short, imho, I think adding a flag indicating the new "reliable cpu"
feature is probably the best way to go. Think of it not as the kernel
indicating the absence of a bug but as the presence of a feature. :)

Thanks!
Christian

2020-07-09 15:17:16

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: [RFC PATCH for 5.8 3/4] rseq: Introduce RSEQ_FLAG_RELIABLE_CPU_ID

----- On Jul 9, 2020, at 8:49 AM, Christian Brauner [email protected] wrote:

> On Wed, Jul 08, 2020 at 01:34:48PM -0400, Mathieu Desnoyers wrote:
>> ----- On Jul 8, 2020, at 12:22 PM, Christian Brauner
>> [email protected] wrote:
>> [...]
>> > I've been following this a little bit. The kernel version itself doesn't
>> > really mean anything and the kernel version is imho not at all
>> > interesting to userspace applications. Especially for cross-distro
>> > programs. We can't go around and ask Red Hat, SUSE, Ubuntu, Archlinux,
>> > openSUSE and god knows who what other distro what their fixed kernel
>> > version is. That's not feasible at all and not how must programs do it.
>> > Sure, a lot of programs name a minimal kernel version they require but
>> > realistically we can't keep bumping it all the time. So the best
>> > strategy for userspace imho has been to introduce a re-versioned flag or
>> > enum that indicates the fixed behavior.
>> >
>> > So I would suggest to just introduce
>> > RSEQ_FLAG_REGISTER_2 = (1 << 2),
>> > that's how these things are usually done (Netlink etc.). So not
>> > introducing a fix bit or whatever but simply reversion your flag/enum.
>> > We already deal with this today.
>>
>> Because rseq is effectively a per-thread resource shared across application
>> and libraries, it is not practical to merge the notion of version with the
>> registration. Typically __rseq_abi is registered by libc, and can be used
>> by the application and by many libraries. Early adopter libraries and
>> applications (e.g. librseq, tcmalloc) can also choose to handle registration
>> if it's not already done by libc.
>
> I'm probably missing the elephant in the room but I was briefly looking
> at github.com/compudj/librseq and it seems to me that the registration
> you're talking about is:
>
> extern __thread struct rseq __rseq_abi;
> extern int __rseq_handled;

Note that __rseq_handled has now vanished, adapting to glibc's ABI. I just
updated librseq's header accordingly.

>
> and it's done in int rseq_register_current_thread(void) afaict and
> currently registration is done with flags set to 0.

Correct, however that registration will become a no-op when linked against a
glibc 2.32+, because the glibc will already have handled the registration
at thread creation.

>
> What is the problem with either adding a - I don't know -
> RSEG_FLAG_REGISTER/RSEQ_RELIABLE_CPU_FIELD flag that is also recorded in
> __rseq_abi.flags. If the kernel doesn't support the flag it will fail
> registration with EINVAL. So the registering program can detect it. If a
> caller needs to know whether another thread uses the new flag it can
> query __rseq_abi.flags. Some form of coordination must be possible in
> userspace otherwise you'll have trouble with any new feature you add. I
> general, I don't see how this is different from adding a new feature to
> rseq. It should be the same principle.

The problem with "extending" struct rseq is that it becomes complex
because it is shared between libraries and application. Let's suppose
the library doing the rseq registration does the scheme you describe:
queries the kernel for features, and stores them in the __rseq_abi.flags.
We end up with the following upgrade transition headhaches for an
application using __rseq_abi:

Kernel | glibc | librseq | __rseq_abi registration owner
----------------------------------------------------------------------
4.18 | 2.31 | no | application (reliable cpu_id = false)
4.18 | 2.31 | yes | librseq (reliable cpu_id = false)
5.8 | 2.31 | yes | librseq (reliable cpu_id = true)
5.8 | 2.32 | yes | glibc (reliable cpu_id = false)
5.8 | 2.33+ | yes | glibc (reliable cpu_id = true)

This kind of transition regressing feature-wise when upgrading a glibc
can be confusing for users.

One possibility would be to have the kernel store the "reliable cpu_id"
flag directly into a new __rseq_abi.kernel_flags (because __rseq_abi.flags
is documented as only read by the kernel). This would remove the registration
owner from the upgrade scenarios. But what would we gain by exposing this
flag within struct rseq ? The only real reason for doing so over using an
explicit system call is typically speed, and querying the kernel for a
feature does not need to be done often, so this is why I originally favored
exposing this information through a new system call flag without changing
the content of struct rseq_cs.

One additional thing to keep in mind: the application can itself choose
to define the __rseq_abi TLS, which AFAIU (please let me know if I am
wrong) would take precedence over glibc's copy. So extending the
size of struct rseq seems rather tricky because the application may
provide a smaller __rseq_abi, even if both the kernel and glibc agree
on a larger size.

>
> I also don't understand the "not practical to merge the notion of
> version with the registration". I'm not sure what that means to be
> honest. :)

The notion of "version" here would be to replace the "RELIABLE_CPU_FIELD"
flag I proposed with a steadily-increasing "fix" version instead.

For both approaches, we could either pass them as parameters with rseq
registration, and make rseq registration success conditional on the
kernel implementing those feature/fix-version, or validate the flag/version
separately from registration.

If this is done on registration, it means glibc will eventually have to
handle this. This prevents user libraries with specific needs to query
whether their features are available. Doing the feature/version validation
separately from registration allows each user library to make its own
queries and take advantage of new kernel features before glibc is
upgraded to be made aware of them.

> But just thinking about adding a new feature to rseq. Then you're in the
> same spot, I think. When you register a bumped rseq - because you added
> a new flag or whatever - you register a new version one way or the other
> since a new feature - imho - is always a version bump. In fact, you
> could think of your "reliable cpu" as a new feature not a bug. ;)

Indeed.

> Also, you seem to directly version struct rseq_cs already through the
> "version" member. So even if you are against the new flag I wouldn't
> know what would stop you from directly versioning struct rseq itself.

struct rseq needs to be shared between application and libraries, with
issues about what to do if size changes when we have an application
defining a small struct rseq (taking precedence over glibc's), and glibc
agreeing with the kernel on a larger structure. So there is little hope
in changing that layout.

The case of struct rseq_cs is simpler: it is only used as interface between
a specific library/application user and the kernel, which allows us to
version the structure and create new layouts as needed.

>
> And it's not that we don't version syscalls. We're doing it in multiple
> ways to be honest, syscalls with a flag argument that reject unknown
> flags are bumped in their version every time you add a new flag that
> they accept. We don't spell this out but this is effectively what it is.
> Think of it as a minor version bump. Extensible syscalls are versioned
> by size and when their struct grows are bumped in their (minor) version.
> In fact extensible syscalls with flags argument embedded in the struct
> can be version bumped in two ways: growing a new flag argument or
> growing a new struct member.

The issue with struct rseq extensibility is not so much ABI between kernel
and user-space, but rather ABI between userspace libraries/application users.

>
>>
>> For instance, it is acceptable for glibc to register rseq for all threads,
>> even in the presence of the cpu_id field inaccuracy, for use by the
>> sched_getcpu(3) implementation. However, users of rseq which need to
>> implement critical sections performing per-cpu data updates may want
>> to know whether the cpu_id field is reliable to ensure they do not crash
>> the process due to per-cpu data corruption.
>>
>> This led me to consider exposing a feature-specific flag which can be
>> queried by specific users to know whether rseq has specific set of correct
>> behaviors implemented.
>>
>> > (Also, as a side-note. I see that you're passing struct rseq *rseq with
>> > a length argument but you are not versioning by size. Is that
>> > intentional? That basically somewhat locks you to the current struct
>> > rseq layout and means users might run into problems when you extend
>> > struct rseq in the future as they can't pass the new struct down to
>> > older kernels. The way we deal with this is now - rseq might preceed
>> > this - is copy_struct_from_user() (for example in sched_{get,set}attr(),
>> > openat2(), bpf(), clone3(), etc.). Maybe you want to switch to that to
>> > keep rseq extensible? Users can detect the new rseq version by just
>> > passing a larger struct down to the kernel with the extra bytes set to 0
>> > and if rseq doesn't complain they know they're dealing with an rseq that
>> > knows larger struct sizes. Might be worth it if you have any reason to
>> > belive that struct rseq might need to grow.)
>>
>> In the initial iterations of the rseq patch set, we initially had the rseq_len
>> argument hoping we would eventually be able to extend struct rseq. However,
>> it was finally decided against making it extensible, so the rseq ABI merged
>> into the Linux kernel with a fixed-size.
>>
>> One of the key reasons for this is explained in
>> commit 83b0b15bcb0f ("rseq: Remove superfluous rseq_len from task_struct")
>>
>> The rseq system call, when invoked with flags of "0" or
>> "RSEQ_FLAG_UNREGISTER" values, expects the rseq_len parameter to
>> be equal to sizeof(struct rseq), which is fixed-size and fixed-layout,
>> specified in uapi linux/rseq.h.
>>
>> Expecting a fixed size for rseq_len is a design choice that ensures
>> multiple libraries and application defining __rseq_abi in the same
>> process agree on its exact size.
>>
>> The issue here is caused by the fact that the __rseq_abi variable is
>> shared across application/libraries for a given thread. So it's not
>> enough to agree between kernel and user-space on the extensibility
>> scheme, but we'd also have to find a way for all users within a process
>> to agree on the layout.
>
> But you're in the same boat if you add any new feature, no? In your
> model, wouldn't you need all users to agree on the feature set as well?
> Not just the struct rseq size. If that's the case then rseq would be
> unextendable (for now).

As a consequence of this, my current approach to add a "node_id" field to rseq
(in a prototype patch) is far from ideal: it defines another TLS symbol, e.g.
__rseq_abi2, with an extended layout, and registers it with new rseq flags.

I would really like to be able to extend struct rseq, but because of ABI
compatibility required between user-space libraries/applications, it seems
rather tricky to do so.

> But specifically about the size-versioning part. Well, one way to solve
> this - imho - would be to add a output size parameter to struct rseq and
> introduce a little more vetting than we have right now.
> So the kernel is what ultimately registers struct rseq iiuc. If there
> were a size output parameter the kernel could set the size of the struct
> it knows about before registering it.
> So a caller passing down a larger struct with e.g. a new field set to a
> non-zero value would get an error from the kernel and the size of the
> supported struct. The caller can then adjust and simply zero out the
> unsupported field and retry. Other callers in userspace can query the
> size and find out what size of struct is registered. If it's larger than
> what they know about they can infer they are on a newer kernel with new
> features but they can simply ignore the unknown fields. If it's smaller
> than they know what fields to ignore.

How would that work in the case of an application defining its own copy of
struct rseq __rseq_abi TLS with sizeof(struct rseq) == 32, and then upgrading
its glibc to a new glibc which implements a larger structure, which agrees with
the kernel on that larger layout ?

>
> I hope I'm not derailing this discussion. I'm just trying to show that
> there's some hope.
> In short, imho, I think adding a flag indicating the new "reliable cpu"
> feature is probably the best way to go. Think of it not as the kernel
> indicating the absence of a bug but as the presence of a feature. :)

Indeed, the "reliable cpu_id" feature is simpler, because it does not require
any layout change nor size extension whatsoever. But I think it's good that
we discuss those extensibility challenges right away, because there are a
lot of moving pieces involved on the user-space side.

Thanks,

Mathieu

>
> Thanks!
> Christian

--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

2020-07-11 15:58:15

by Christian Brauner

[permalink] [raw]
Subject: Re: [RFC PATCH for 5.8 3/4] rseq: Introduce RSEQ_FLAG_RELIABLE_CPU_ID

On Thu, Jul 09, 2020 at 11:15:57AM -0400, Mathieu Desnoyers wrote:
> ----- On Jul 9, 2020, at 8:49 AM, Christian Brauner [email protected] wrote:
>
> > On Wed, Jul 08, 2020 at 01:34:48PM -0400, Mathieu Desnoyers wrote:
> >> ----- On Jul 8, 2020, at 12:22 PM, Christian Brauner
> >> [email protected] wrote:
> >> [...]
> >> > I've been following this a little bit. The kernel version itself doesn't
> >> > really mean anything and the kernel version is imho not at all
> >> > interesting to userspace applications. Especially for cross-distro
> >> > programs. We can't go around and ask Red Hat, SUSE, Ubuntu, Archlinux,
> >> > openSUSE and god knows who what other distro what their fixed kernel
> >> > version is. That's not feasible at all and not how must programs do it.
> >> > Sure, a lot of programs name a minimal kernel version they require but
> >> > realistically we can't keep bumping it all the time. So the best
> >> > strategy for userspace imho has been to introduce a re-versioned flag or
> >> > enum that indicates the fixed behavior.
> >> >
> >> > So I would suggest to just introduce
> >> > RSEQ_FLAG_REGISTER_2 = (1 << 2),
> >> > that's how these things are usually done (Netlink etc.). So not
> >> > introducing a fix bit or whatever but simply reversion your flag/enum.
> >> > We already deal with this today.
> >>
> >> Because rseq is effectively a per-thread resource shared across application
> >> and libraries, it is not practical to merge the notion of version with the
> >> registration. Typically __rseq_abi is registered by libc, and can be used
> >> by the application and by many libraries. Early adopter libraries and
> >> applications (e.g. librseq, tcmalloc) can also choose to handle registration
> >> if it's not already done by libc.
> >
> > I'm probably missing the elephant in the room but I was briefly looking
> > at github.com/compudj/librseq and it seems to me that the registration
> > you're talking about is:
> >
> > extern __thread struct rseq __rseq_abi;
> > extern int __rseq_handled;
>
> Note that __rseq_handled has now vanished, adapting to glibc's ABI. I just
> updated librseq's header accordingly.
>
> >
> > and it's done in int rseq_register_current_thread(void) afaict and
> > currently registration is done with flags set to 0.
>
> Correct, however that registration will become a no-op when linked against a
> glibc 2.32+, because the glibc will already have handled the registration
> at thread creation.

The registration is a thread-group property I'll assume, right, i.e. all
threads will have rseq TLS or no thread will have it? Some things I seem
to be able to assume (correct me if I'm wrong):
- rseq registration is expected to be done at thread creation time
- rseq registration _should_ only be done once, i.e. if a caller detects
that rseq is already registered for a thread, then they could
technically re-register rseq - risking a feature mismatch if doing so
- but they shouldn't re-register but simply assume that someone else
is in control of rseq. If they violate that assumption than you're
hosed anyway.
So it seems as long as callers leave rseq registration alone whenever
they detect that it is already registered then one can assume that rseq
is under consistent control of a single library/program. If that's the
case it should safe to assume that the library will use the same rseq
registration for all threads bounded by the available kernel features or
by the set of features it is aware of.

>
> >
> > What is the problem with either adding a - I don't know -
> > RSEG_FLAG_REGISTER/RSEQ_RELIABLE_CPU_FIELD flag that is also recorded in
> > __rseq_abi.flags. If the kernel doesn't support the flag it will fail
> > registration with EINVAL. So the registering program can detect it. If a
> > caller needs to know whether another thread uses the new flag it can
> > query __rseq_abi.flags. Some form of coordination must be possible in
> > userspace otherwise you'll have trouble with any new feature you add. I
> > general, I don't see how this is different from adding a new feature to
> > rseq. It should be the same principle.
>
> The problem with "extending" struct rseq is that it becomes complex
> because it is shared between libraries and application. Let's suppose
> the library doing the rseq registration does the scheme you describe:
> queries the kernel for features, and stores them in the __rseq_abi.flags.
> We end up with the following upgrade transition headhaches for an
> application using __rseq_abi:
>
> Kernel | glibc | librseq | __rseq_abi registration owner
> ----------------------------------------------------------------------
> 4.18 | 2.31 | no | application (reliable cpu_id = false)
> 4.18 | 2.31 | yes | librseq (reliable cpu_id = false)
> 5.8 | 2.31 | yes | librseq (reliable cpu_id = true)
> 5.8 | 2.32 | yes | glibc (reliable cpu_id = false)
> 5.8 | 2.33+ | yes | glibc (reliable cpu_id = true)
>
> This kind of transition regressing feature-wise when upgrading a glibc
> can be confusing for users.

Sure, but that's seems like a problem that is not specific to rseq.
That's a problem that any interface has which introduces new features or
fixes bugs.

>
> One possibility would be to have the kernel store the "reliable cpu_id"
> flag directly into a new __rseq_abi.kernel_flags (because __rseq_abi.flags
> is documented as only read by the kernel). This would remove the registration

Ah, there we go. That's a missing piece of information I didn't have.
Thank you.

> owner from the upgrade scenarios. But what would we gain by exposing this
> flag within struct rseq ? The only real reason for doing so over using an

I proposed that specific scheme because I was under the impression that
you are in need of a mechanism for a caller (at thread creation I
assume) to check what feature set is supported by rseq _without_
issung a system call. If you were to record the requested flags in
struct rseq or in some other way, then another library which tries to
register rseq for a thread but detects it has already been registered
can look at e.g. whether the reliable cpu feature is around and then
adjust it's behavior accordingly.
Another reason why this seems worthwhile is because of how rseq works in
general. Since it registers a piece of userspace memory which userspace
can trivially access enforcing that userspace issue a syscall to get at
the feature list seems odd when you can just record it in the struct.
But that's a matter of style, I guess.

Also, I'm thinking about the case of adding one or two new features that
introduce mutually exclusive behavior for rseq, i.e. if you register
rseq with FEAT1 and someone else registers it with FEAT2 and FEAT1 and
FEAT2 would lead to incompatible behavior for an aspect or all of rseq.
Even if you had a way to query the kernel for FEAT1 and FEAT2 in the
rseq syscall it still be a problem since a caller wouldn't know at rseq
registration time whether the library registered rseq with FEAT1 or
FEAT2. If you record the behavior somewhere - kernel_flags or whatever -
then the caller could check at registration time "ah, rseq is registered
with this behavior" I need to adjust my behavior.

> explicit system call is typically speed, and querying the kernel for a
> feature does not need to be done often, so this is why I originally favored
> exposing this information through a new system call flag without changing
> the content of struct rseq_cs.

Sure, there are multiple ways of doing this. What you're proposing is
one. Though that is not as simple as it seems, I fear. :)
If you add a new flag to the system call that effectively only exists to
query the supported features then you need to decide whether checking
for a feature means the system call just returns and all it does is to
give you back the features it supports, i.e. it doesn't do any real work
or if it actually does real work.
That becomes a little more tricky actually. If the system call does the
feature checking and registration in one shot you need to decide whether
it should fail registration if a feature is not supported or whether it
should just move one. If the latter then you likely want a way for the
kernel to report back on the features it supports. That's not a big deal
if it's only a single feature but it becomes more tricky if you
introduce multiple features because then you don't know what feature
the kernel doesn't support and you likely want to know what feature made
the system call fail, i.e. what feature is missing. :)
Thinking about this a little bit I would think what makes the most sense
for such a scheme is the following:
- introduce a flag CHECK_FEATURES flag wich by default makes the system
call a noop effectively and just reports back (somewhere in struct
rseq) the features it has and e.g. masking out any features it doesn't
know about.
- introduce another flag (again, dummy names here)
IGNORE_UNKNOWN_FEATURES which makes the system call succeed making use
of all the features requested that it supports, masking out any
features it doesn't know about.
This way a library/caller can decide what behavior it wants to have.
That's advanced behavior though and maybe that "just introduce a flag
and make it fail if not supported by the kernel" is enough for your
use-case now and the other stuff can be added when more features happen.

>
> One additional thing to keep in mind: the application can itself choose
> to define the __rseq_abi TLS, which AFAIU (please let me know if I am
> wrong) would take precedence over glibc's copy. So extending the

You mean either that an application could simply choose to ignore that e.g.
glibc has already registered rseq and e.g. unregister it and register
it's own or that it registers it's own rseq before glibc comes into the
play?
I mean, if I interpreted what you're trying to say correctly, I think
one needs to work under the assumption that if someone else has already
registered rseq than it becomes the single source of truth. I think that
basically needs to be the contract. Same way you expect a user of
pthreads to not suddenly go and call clone3() with CLONE_THREAD |
CLONE_VM | [...] and assume that this won't mess with glibc's internal
state.

> size of struct rseq seems rather tricky because the application may
> provide a smaller __rseq_abi, even if both the kernel and glibc agree
> on a larger size.
>
> >
> > I also don't understand the "not practical to merge the notion of
> > version with the registration". I'm not sure what that means to be
> > honest. :)
>
> The notion of "version" here would be to replace the "RELIABLE_CPU_FIELD"
> flag I proposed with a steadily-increasing "fix" version instead.

Sure.

>
> For both approaches, we could either pass them as parameters with rseq
> registration, and make rseq registration success conditional on the
> kernel implementing those feature/fix-version, or validate the flag/version
> separately from registration.
>
> If this is done on registration, it means glibc will eventually have to
> handle this. This prevents user libraries with specific needs to query
> whether their features are available. Doing the feature/version validation
> separately from registration allows each user library to make its own
> queries and take advantage of new kernel features before glibc is
> upgraded to be made aware of them.

Why isn't there a "dual scheme"? I.e. you record the features somewhere
in struct rseq or some other place so userspace can query an already
registered thread for the features it was registered with and have a way
to query the kernel for the supported features through the system call
(See what I suggested above with the feature checking flags.).

>
> > But just thinking about adding a new feature to rseq. Then you're in the
> > same spot, I think. When you register a bumped rseq - because you added
> > a new flag or whatever - you register a new version one way or the other
> > since a new feature - imho - is always a version bump. In fact, you
> > could think of your "reliable cpu" as a new feature not a bug. ;)
>
> Indeed.
>
> > Also, you seem to directly version struct rseq_cs already through the
> > "version" member. So even if you are against the new flag I wouldn't
> > know what would stop you from directly versioning struct rseq itself.
>
> struct rseq needs to be shared between application and libraries, with
> issues about what to do if size changes when we have an application
> defining a small struct rseq (taking precedence over glibc's), and glibc
> agreeing with the kernel on a larger structure. So there is little hope
> in changing that layout.

I really think this is not an rseq specific problem. This seems to be a
generic problem any interface has when it e.g. makes use of an extended
struct and e.g. decides to make its own syscalls without going through
the glibc wrappers (If there are any...). That's the reality right now
and will likely always be that way short of us blocking non-libc
syscalls similar to some bsds at which point we need a 1:1 kernel + libc
development. :) That's not going to happen. The problem ranges from
struct statx to struct clone_args and struct open_how and so on.

But one question. Musn't the assumption be that all threads in a
thread-group if they have registered rseq then the same
application/library has done that registration? And if that's the case
then the application/library doing the registration is what defines the
layout for that thread-group and becomes the one source of truth.
Basically, if an application uses it's own rseq then glibc must be out
of the picture. If that's not part of the contract then it feels to me
that rseq cannot be extended (for now).

>
> The case of struct rseq_cs is simpler: it is only used as interface between
> a specific library/application user and the kernel, which allows us to
> version the structure and create new layouts as needed.
>
> >
> > And it's not that we don't version syscalls. We're doing it in multiple
> > ways to be honest, syscalls with a flag argument that reject unknown
> > flags are bumped in their version every time you add a new flag that
> > they accept. We don't spell this out but this is effectively what it is.
> > Think of it as a minor version bump. Extensible syscalls are versioned
> > by size and when their struct grows are bumped in their (minor) version.
> > In fact extensible syscalls with flags argument embedded in the struct
> > can be version bumped in two ways: growing a new flag argument or
> > growing a new struct member.
>
> The issue with struct rseq extensibility is not so much ABI between kernel
> and user-space, but rather ABI between userspace libraries/application users.
>
> >
> >>
> >> For instance, it is acceptable for glibc to register rseq for all threads,
> >> even in the presence of the cpu_id field inaccuracy, for use by the
> >> sched_getcpu(3) implementation. However, users of rseq which need to
> >> implement critical sections performing per-cpu data updates may want
> >> to know whether the cpu_id field is reliable to ensure they do not crash
> >> the process due to per-cpu data corruption.
> >>
> >> This led me to consider exposing a feature-specific flag which can be
> >> queried by specific users to know whether rseq has specific set of correct
> >> behaviors implemented.
> >>
> >> > (Also, as a side-note. I see that you're passing struct rseq *rseq with
> >> > a length argument but you are not versioning by size. Is that
> >> > intentional? That basically somewhat locks you to the current struct
> >> > rseq layout and means users might run into problems when you extend
> >> > struct rseq in the future as they can't pass the new struct down to
> >> > older kernels. The way we deal with this is now - rseq might preceed
> >> > this - is copy_struct_from_user() (for example in sched_{get,set}attr(),
> >> > openat2(), bpf(), clone3(), etc.). Maybe you want to switch to that to
> >> > keep rseq extensible? Users can detect the new rseq version by just
> >> > passing a larger struct down to the kernel with the extra bytes set to 0
> >> > and if rseq doesn't complain they know they're dealing with an rseq that
> >> > knows larger struct sizes. Might be worth it if you have any reason to
> >> > belive that struct rseq might need to grow.)
> >>
> >> In the initial iterations of the rseq patch set, we initially had the rseq_len
> >> argument hoping we would eventually be able to extend struct rseq. However,
> >> it was finally decided against making it extensible, so the rseq ABI merged
> >> into the Linux kernel with a fixed-size.
> >>
> >> One of the key reasons for this is explained in
> >> commit 83b0b15bcb0f ("rseq: Remove superfluous rseq_len from task_struct")
> >>
> >> The rseq system call, when invoked with flags of "0" or
> >> "RSEQ_FLAG_UNREGISTER" values, expects the rseq_len parameter to
> >> be equal to sizeof(struct rseq), which is fixed-size and fixed-layout,
> >> specified in uapi linux/rseq.h.
> >>
> >> Expecting a fixed size for rseq_len is a design choice that ensures
> >> multiple libraries and application defining __rseq_abi in the same
> >> process agree on its exact size.
> >>
> >> The issue here is caused by the fact that the __rseq_abi variable is
> >> shared across application/libraries for a given thread. So it's not
> >> enough to agree between kernel and user-space on the extensibility
> >> scheme, but we'd also have to find a way for all users within a process
> >> to agree on the layout.
> >
> > But you're in the same boat if you add any new feature, no? In your
> > model, wouldn't you need all users to agree on the feature set as well?
> > Not just the struct rseq size. If that's the case then rseq would be
> > unextendable (for now).
>
> As a consequence of this, my current approach to add a "node_id" field to rseq
> (in a prototype patch) is far from ideal: it defines another TLS symbol, e.g.
> __rseq_abi2, with an extended layout, and registers it with new rseq flags.
>
> I would really like to be able to extend struct rseq, but because of ABI
> compatibility required between user-space libraries/applications, it seems
> rather tricky to do so.
>
> > But specifically about the size-versioning part. Well, one way to solve
> > this - imho - would be to add a output size parameter to struct rseq and
> > introduce a little more vetting than we have right now.
> > So the kernel is what ultimately registers struct rseq iiuc. If there
> > were a size output parameter the kernel could set the size of the struct
> > it knows about before registering it.
> > So a caller passing down a larger struct with e.g. a new field set to a
> > non-zero value would get an error from the kernel and the size of the
> > supported struct. The caller can then adjust and simply zero out the
> > unsupported field and retry. Other callers in userspace can query the
> > size and find out what size of struct is registered. If it's larger than
> > what they know about they can infer they are on a newer kernel with new
> > features but they can simply ignore the unknown fields. If it's smaller
> > than they know what fields to ignore.
>
> How would that work in the case of an application defining its own copy of
> struct rseq __rseq_abi TLS with sizeof(struct rseq) == 32, and then upgrading
> its glibc to a new glibc which implements a larger structure, which agrees with
> the kernel on that larger layout ?

Wouldn't the non-updated application just access fields and methods of
the smaller struct? The way struct extensions work is that we only
extend them after the last member and always correctly 64 bit aligned.
And as long as you only extend the struct at the end wouldn't that be
ok? An application with a non-updated/smaller struct rseq would just
access fields that the larger struct supports, no?

Thanks!
Christian

2020-07-13 18:43:05

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: [RFC PATCH for 5.8 3/4] rseq: Introduce RSEQ_FLAG_RELIABLE_CPU_ID

----- On Jul 11, 2020, at 11:54 AM, Christian Brauner [email protected] wrote:

>
> The registration is a thread-group property I'll assume, right, i.e. all
> threads will have rseq TLS or no thread will have it?

No, rseq registration is a per-thread property, but it would arguably be
a bit weird for a thread-group to observe different registration states
for different threads.

> Some things I seem to be able to assume (correct me if I'm wrong):
> - rseq registration is expected to be done at thread creation time

True.

> - rseq registration _should_ only be done once, i.e. if a caller detects
> that rseq is already registered for a thread, then they could
> technically re-register rseq - risking a feature mismatch if doing so
> - but they shouldn't re-register but simply assume that someone else
> is in control of rseq. If they violate that assumption than you're
> hosed anyway.

Right.

> So it seems as long as callers leave rseq registration alone whenever
> they detect that it is already registered then one can assume that rseq
> is under consistent control of a single library/program. If that's the
> case it should safe to assume that the library will use the same rseq
> registration for all threads bounded by the available kernel features or
> by the set of features it is aware of.

The rseq registration is per-thread. But yes, as soon as one user registers
rseq, other users for that thread are expected to piggy-back on that
registration.

> I proposed that specific scheme because I was under the impression that
> you are in need of a mechanism for a caller (at thread creation I
> assume) to check what feature set is supported by rseq _without_
> issung a system call. If you were to record the requested flags in
> struct rseq or in some other way, then another library which tries to
> register rseq for a thread but detects it has already been registered
> can look at e.g. whether the reliable cpu feature is around and then
> adjust it's behavior accordingly.
> Another reason why this seems worthwhile is because of how rseq works in
> general. Since it registers a piece of userspace memory which userspace
> can trivially access enforcing that userspace issue a syscall to get at
> the feature list seems odd when you can just record it in the struct.
> But that's a matter of style, I guess.

Good points.

>
> Also, I'm thinking about the case of adding one or two new features that
> introduce mutually exclusive behavior for rseq, i.e. if you register
> rseq with FEAT1 and someone else registers it with FEAT2 and FEAT1 and
> FEAT2 would lead to incompatible behavior for an aspect or all of rseq.
> Even if you had a way to query the kernel for FEAT1 and FEAT2 in the
> rseq syscall it still be a problem since a caller wouldn't know at rseq
> registration time whether the library registered rseq with FEAT1 or
> FEAT2. If you record the behavior somewhere - kernel_flags or whatever -
> then the caller could check at registration time "ah, rseq is registered
> with this behavior" I need to adjust my behavior.

I think what we want here is to be able to extend the feature set, but not
"pick and choose" different incompatible features.

[...]
>>
>> One additional thing to keep in mind: the application can itself choose
>> to define the __rseq_abi TLS, which AFAIU (please let me know if I am
>> wrong) would take precedence over glibc's copy. So extending the
>
> You mean either that an application could simply choose to ignore that e.g.
> glibc has already registered rseq and e.g. unregister it and register
> it's own or that it registers it's own rseq before glibc comes into the
> play?

No quite.

I mean that an application binary or a preloaded library is allowed to
interpose with glibc and expose a __rseq_abi symbol with a size smaller
than glibc's __rseq_abi. The issue is that glibc (or other library
responsible for rseq registration) is unaware of that symbol's size.

This means that extending __rseq_abi cannot be done by means of additional
flags or parameters passed directly from the registration owner to the
rseq system call.

> I mean, if I interpreted what you're trying to say correctly, I think
> one needs to work under the assumption that if someone else has already
> registered rseq than it becomes the single source of truth. I think that
> basically needs to be the contract. Same way you expect a user of
> pthreads to not suddenly go and call clone3() with CLONE_THREAD |
> CLONE_VM | [...] and assume that this won't mess with glibc's internal
> state.

Right. The issue is not about which library owns the registration, but
rather making sure everyone agree on the size of __rseq_abi, including
interposition scenarios.

[...]
>>
>> For both approaches, we could either pass them as parameters with rseq
>> registration, and make rseq registration success conditional on the
>> kernel implementing those feature/fix-version, or validate the flag/version
>> separately from registration.
>>
>> If this is done on registration, it means glibc will eventually have to
>> handle this. This prevents user libraries with specific needs to query
>> whether their features are available. Doing the feature/version validation
>> separately from registration allows each user library to make its own
>> queries and take advantage of new kernel features before glibc is
>> upgraded to be made aware of them.
>
> Why isn't there a "dual scheme"? I.e. you record the features somewhere
> in struct rseq or some other place so userspace can query an already
> registered thread for the features it was registered with and have a way
> to query the kernel for the supported features through the system call
> (See what I suggested above with the feature checking flags.).

This discussion got my head into gears over the weekend, and I think
I came up with something that could elegantly solve all the "rseq extensibility"
problem. More below.

[...]

> I really think this is not an rseq specific problem. This seems to be a
> generic problem any interface has when it e.g. makes use of an extended
> struct and e.g. decides to make its own syscalls without going through
> the glibc wrappers (If there are any...). That's the reality right now
> and will likely always be that way short of us blocking non-libc
> syscalls similar to some bsds at which point we need a 1:1 kernel + libc
> development. :) That's not going to happen. The problem ranges from
> struct statx to struct clone_args and struct open_how and so on.

AFAIU the only "special" thing about rseq is that its __rseq_abi is
a TLS symbol shared between application/libraries, and interposition is
allowed.

>
> But one question. Musn't the assumption be that all threads in a
> thread-group if they have registered rseq then the same
> application/library has done that registration?

No, __rseq_abi is a TLS, and the registration is per-thread.

> And if that's the case
> then the application/library doing the registration is what defines the
> layout for that thread-group and becomes the one source of truth.
> Basically, if an application uses it's own rseq then glibc must be out
> of the picture. If that's not part of the contract then it feels to me
> that rseq cannot be extended (for now).

Indeed, the new scheme I have in mind for rseq extensibility would allow
new features to be used between new application/library and kernel even
with an older glibc which would contain the rseq registration code, but
be unaware of those new features.

[...]
>
> Wouldn't the non-updated application just access fields and methods of
> the smaller struct? The way struct extensions work is that we only
> extend them after the last member and always correctly 64 bit aligned.
> And as long as you only extend the struct at the end wouldn't that be
> ok? An application with a non-updated/smaller struct rseq would just
> access fields that the larger struct supports, no?

The issue is symbol interposition, as discussed above.

So here is the idea which emerged through the weekend as I was thinking
about your email:

* Current technical constraints
- struct rseq __rseq_abi can be interposed by preloaded libraries and
application, without knowledge from the registration "owner" (typically
glibc).

* Objectives:
- Allow extending the size of struct rseq to add additional features,
such as node_id field, signal-disabling fields, and so on.
- Allow extending this size without requiring an upgrade of the library
performing rseq registration. Simply upgrading the rseq "user" application
or library and the kernel should suffice.

* Proposed ABI
- Reserve a bit in the field (struct rseq *)->flags of the TLS __rseq_abi,
named e.g.: RSEQ_CS_FLAG_SIZE = (1U << 3).
- A definition wishing to extend struct rseq would be required to initialize
__rseq_abi with this bit set in the flags field.
- When RSEQ_CS_FLAG_SIZE is set, struct rseq is guaranteed to have two new
fields after the flags field: a __u32 user_size and a __u32 kernel_size field.
- The user_size field is meant to be initialized to sizeof(struct rseq) by the
__rseq_abi definition. In an interposition scenario, a kernel supporting this
size feature will know about the size of the symbol by checking both the
RSEQ_CS_FLAG_SIZE flag and the user_size field.
- On registration, if the __rseq_abi.flags RSEQ_CS_FLAG_SIZE flag is set (and this
flag is supported by the kernel), the kernel updates the kernel_size field to
min(sizeof(struct rseq), __rseq_abi.user_size), effectively the subset of size
supported by both the user-space __rseq_abi definition and by the kernel.
- Users wishing to use additional fields beyond __rseq_abi.flags would need to check
that __rseq_abi->flags & RSEQ_CS_FLAG_SIZE is true, and that
__rseq_abi.kernel_size >= offsetof(struct rseq, feature_field) + sizeof(__rseq_abi.feature_field)
This would ensure the fields are only used if the symbol is large enough to
hold them *and* the kernel supports them.

With this kind of scheme, we could then easily extend struct rseq to cover additional
use-cases such as:

- adding a new "node_id" field to speed up getcpu(3), user-space locking on NUMA,
and possibly memory allocators,
- adding fields allowing to quickly disable/enable signals,
- adding a "__u64 features" field, which would contain for instance
RSEQ_FEATURE_RELIABLE_CPU_ID.

I'm not sure why I did not think of this earlier, but it all seems to fit nicely.

Any comments ?

Thanks,

Mathieu

--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com