LinuxLists.cc - Re: [RFC PATCH 2/3] rseq: extend struct rseq with per thread group vcpu id

2022-02-02 09:38:46

Subject: Re: [RFC PATCH 2/3] rseq: extend struct rseq with per thread group vcpu id

* Mathieu Desnoyers:

> ----- On Feb 1, 2022, at 3:32 PM, Florian Weimer [email protected] wrote:
> [...]
>>
>>>> Is the switch really useful? I suspect it's faster to just write as
>>>> much as possible all the time. The switch should be well-predictable
>>>> if running uniform userspace, but still …
>>>
>>> The switch ensures the kernel don't try to write to a memory area beyond
>>> the rseq size which has been registered by user-space. So it seems to be
>>> useful to ensure we don't corrupt user-space memory. Or am I missing your
>>> point ?
>>
>> Due to the alignment, I think you'd only ever see 32 and 64 bytes for
>> now?
>
> Yes, but I would expect the rseq registration arguments to have a rseq_len
> of offsetofend(struct rseq, tg_vcpu_id) when userspace wants the tg_vcpu_id
> feature to be supported (but not the following features).

But if rseq is managed by libc, it really has to use the full size
unconditionally. I would even expect that eventually, the kernel only
supports the initial 32, maybe 64 for a few early extension, and the
size indicated by the auxiliary vector.

Not all of that area would be ABI, some of it would be used by the
vDSO only and opaque to userspace application (with applications/libcs
passing __rseq_offset as an argument to these functions).

>> I'd appreciate if you could put the maximm supported size and possibly
>> the alignment in the auxiliary vector, so that we don't have to rseq
>> system calls in a loop on process startup.
>
> Yes, it's a good idea. I'm not too familiar with the auxiliary vector.
> Are we talking about the kernel's
>
> fs/binfmt_elf.c:fill_auxv_note()
>
> ?

Indeed.

2022-02-02 16:10:45

by Mathieu Desnoyers

[permalink] [raw]

Subject: Re: [RFC PATCH 2/3] rseq: extend struct rseq with per thread group vcpu id

----- On Feb 1, 2022, at 4:30 PM, Florian Weimer [email protected] wrote:

> * Mathieu Desnoyers:
>
>> ----- On Feb 1, 2022, at 3:32 PM, Florian Weimer [email protected] wrote:
>> [...]
>>>
>>>>> Is the switch really useful? I suspect it's faster to just write as
>>>>> much as possible all the time. The switch should be well-predictable
>>>>> if running uniform userspace, but still …
>>>>
>>>> The switch ensures the kernel don't try to write to a memory area beyond
>>>> the rseq size which has been registered by user-space. So it seems to be
>>>> useful to ensure we don't corrupt user-space memory. Or am I missing your
>>>> point ?
>>>
>>> Due to the alignment, I think you'd only ever see 32 and 64 bytes for
>>> now?
>>
>> Yes, but I would expect the rseq registration arguments to have a rseq_len
>> of offsetofend(struct rseq, tg_vcpu_id) when userspace wants the tg_vcpu_id
>> feature to be supported (but not the following features).
>
> But if rseq is managed by libc, it really has to use the full size
> unconditionally. I would even expect that eventually, the kernel only
> supports the initial 32, maybe 64 for a few early extension, and the
> size indicated by the auxiliary vector.
>
> Not all of that area would be ABI, some of it would be used by the
> vDSO only and opaque to userspace application (with applications/libcs
> passing __rseq_offset as an argument to these functions).
>

I think one aspect leading to our misunderstanding here is the distinction
between the size of the rseq area _allocation_, and the offset after the last
field supported by the given kernel.

With this in mind, let's state a bit more clearly our expected aux. vector
extensibility scheme.

With CONFIG_RSEQ=y, the kernel would pass the following information through
the ELF auxv:

- rseq allocation size (auxv_rseq_alloc_size),
- rseq allocation alignment (auxv_rseq_alloc_align),
- offset after the end of the last rseq field supported by this kernel (auxv_rseq_offset_end),

We always have auxv_rseq_alloc_size >= auxv_rseq_offset_end.

I would expect libc to use this information to allocate a memory area
at least auxv_rseq_alloc_size in size, with an alignment respecting
auxv_rseq_alloc_align. It would use a value >= auvx_rseq_alloc_size
as rseq_len argument for the rseq registration.

But I would expect libc to use the auxv_rseq_offset_end value to populate __rseq_size,
so rseq users can rely on this to check whether the fields they are trying to access
is indeed populated by the kernel.

Of course, the kernel would still allow the original 32-byte rseq_len argument
for the rseq registration, so the original ABI still works. It would however
reject any rseq registration with size smaller than auxv_rseq_alloc_size (other
than the 32-byte special-case).

Is that in line with what you have in mind ? Do we really need to expose those 3
auxv variables independently or can we somehow remove auxv_rseq_alloc_size and
use auxv_rseq_offset_end as a min value for allocation instead ?

Thanks,

Mathieu

--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com