2022-02-02 10:26:28

by Mathieu Desnoyers

[permalink] [raw]
Subject: [RFC PATCH 2/3] rseq: extend struct rseq with per thread group vcpu id

If a thread group has fewer threads than cores, or is limited to run on
few cores concurrently through sched affinity or cgroup cpusets, the
virtual cpu ids will be values close to 0, thus allowing efficient use
of user-space memory for per-cpu data structures.

Signed-off-by: Mathieu Desnoyers <[email protected]>
---
include/uapi/linux/rseq.h | 15 +++++++++++++++
kernel/rseq.c | 16 +++++++++++++++-
2 files changed, 30 insertions(+), 1 deletion(-)

diff --git a/include/uapi/linux/rseq.h b/include/uapi/linux/rseq.h
index 386c25b5bbdb..d687ac79e62c 100644
--- a/include/uapi/linux/rseq.h
+++ b/include/uapi/linux/rseq.h
@@ -154,6 +154,21 @@ struct rseq {
* rseq_len. Use the offset immediately after the node_id field as
* rseq_len.
*/
+
+ /*
+ * Restartable sequences tg_vcpu_id field. Updated by the kernel. Read by
+ * user-space with single-copy atomicity semantics. This field should
+ * only be read by the thread which registered this data structure.
+ * Aligned on 32-bit. Contains the current thread's virtual CPU ID
+ * (allocated uniquely within thread group).
+ */
+ __u32 tg_vcpu_id;
+
+ /*
+ * This is a valid end of rseq ABI for the purpose of rseq registration
+ * rseq_len. Use the offset immediately after the tg_vcpu_id field as
+ * rseq_len.
+ */
} __attribute__((aligned(4 * sizeof(__u64))));

#endif /* _UAPI_LINUX_RSEQ_H */
diff --git a/kernel/rseq.c b/kernel/rseq.c
index 13f6d0419f31..37b43735a400 100644
--- a/kernel/rseq.c
+++ b/kernel/rseq.c
@@ -86,10 +86,14 @@ static int rseq_update_cpu_node_id(struct task_struct *t)
struct rseq __user *rseq = t->rseq;
u32 cpu_id = raw_smp_processor_id();
u32 node_id = cpu_to_node(cpu_id);
+ u32 tg_vcpu_id = task_tg_vcpu_id(t);

if (!user_write_access_begin(rseq, t->rseq_len))
goto efault;
switch (t->rseq_len) {
+ case offsetofend(struct rseq, tg_vcpu_id):
+ unsafe_put_user(tg_vcpu_id, &rseq->tg_vcpu_id, efault_end);
+ fallthrough;
case offsetofend(struct rseq, node_id):
unsafe_put_user(node_id, &rseq->node_id, efault_end);
fallthrough;
@@ -112,9 +116,17 @@ static int rseq_update_cpu_node_id(struct task_struct *t)

static int rseq_reset_rseq_cpu_node_id(struct task_struct *t)
{
- u32 cpu_id_start = 0, cpu_id = RSEQ_CPU_ID_UNINITIALIZED, node_id = 0;
+ u32 cpu_id_start = 0, cpu_id = RSEQ_CPU_ID_UNINITIALIZED, node_id = 0,
+ tg_vcpu_id = 0;

switch (t->rseq_len) {
+ case offsetofend(struct rseq, tg_vcpu_id):
+ /*
+ * Reset tg_vcpu_id to its initial state (0).
+ */
+ if (put_user(tg_vcpu_id, &t->rseq->tg_vcpu_id))
+ return -EFAULT;
+ fallthrough;
case offsetofend(struct rseq, node_id):
/*
* Reset node_id to its initial state (0).
@@ -396,6 +408,8 @@ SYSCALL_DEFINE4(rseq, struct rseq __user *, rseq, u32, rseq_len,
if (!IS_ALIGNED((unsigned long)rseq, __alignof__(*rseq)))
return -EINVAL;
switch (rseq_len) {
+ case offsetofend(struct rseq, tg_vcpu_id):
+ fallthrough;
case offsetofend(struct rseq, node_id):
fallthrough;
case offsetofend(struct rseq, padding1):
--
2.17.1


2022-02-02 11:59:22

by Florian Weimer

[permalink] [raw]
Subject: Re: [RFC PATCH 2/3] rseq: extend struct rseq with per thread group vcpu id

* Mathieu Desnoyers:

> ----- On Feb 1, 2022, at 3:03 PM, Florian Weimer [email protected] wrote:
>
>> * Mathieu Desnoyers:
>>
>>> If a thread group has fewer threads than cores, or is limited to run on
>>> few cores concurrently through sched affinity or cgroup cpusets, the
>>> virtual cpu ids will be values close to 0, thus allowing efficient use
>>> of user-space memory for per-cpu data structures.
>>
>> From a userspace programmer perspective, what's a good way to obtain a
>> reasonable upper bound for the possible tg_vcpu_id values?
>
> Some effective upper bounds:
>
> - sysconf(3) _SC_NPROCESSORS_CONF,
> - the number of threads which exist concurrently in the process,
> - the number of cpus in the cpu affinity mask applied by sched_setaffinity,
> except in corner-case situations such as cpu hotplug removing all cpus from
> the affinity set,
> - cgroup cpuset "partition" limits,

Affinity masks and _SC_NPROCESSORS_CONF can be off by more than an
order of magnitude compared to the cgroup cpuset, I think, and those
aren't even that atypical configurations.

The number of concurrent threads sounds more tractable, but I'm
worried about things creating threads behind libc's back (perhaps
io_uring?). So it couldn't be a hard upper bound.

I'm worried about querying anything cgroup-related because these APIs
have a reputation for being slow, convoluted, and unstable
(effectively not subject to the “don't break userspace” rule).
Hopefully I'm wrong about that.

>> I believe not all users of cgroup cpusets change the affinity mask.
>
> AFAIR the sched affinity mask is tweaked independently of the cgroup cpuset.
> Those are two mechanisms both affecting the scheduler task placement.

There are container hosts out there that synthesize an affinity mask
that matches the CPU allocation, assuming that anyone who calls
sched_getaffinity only does so for counting the number of set bits.

> I would expect the user-space code to use some sensible upper bound as a
> hint about how many per-vcpu data structure elements to expect (and how many
> to pre-allocate), but have a "lazy initialization" fall-back in case the
> vcpu id goes up to the number of configured processors - 1. And I suspect
> that even the number of configured processors may change with CRIU.

Sounds reasonable.

>> Is the switch really useful? I suspect it's faster to just write as
>> much as possible all the time. The switch should be well-predictable
>> if running uniform userspace, but still …
>
> The switch ensures the kernel don't try to write to a memory area beyond
> the rseq size which has been registered by user-space. So it seems to be
> useful to ensure we don't corrupt user-space memory. Or am I missing your
> point ?

Due to the alignment, I think you'd only ever see 32 and 64 bytes for
now?

I'd appreciate if you could put the maximm supported size and possibly
the alignment in the auxiliary vector, so that we don't have to rseq
system calls in a loop on process startup.

2022-02-02 17:42:44

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: [RFC PATCH 2/3] rseq: extend struct rseq with per thread group vcpu id

----- On Feb 1, 2022, at 3:32 PM, Florian Weimer [email protected] wrote:
[...]
>
>>> Is the switch really useful? I suspect it's faster to just write as
>>> much as possible all the time. The switch should be well-predictable
>>> if running uniform userspace, but still …
>>
>> The switch ensures the kernel don't try to write to a memory area beyond
>> the rseq size which has been registered by user-space. So it seems to be
>> useful to ensure we don't corrupt user-space memory. Or am I missing your
>> point ?
>
> Due to the alignment, I think you'd only ever see 32 and 64 bytes for
> now?

Yes, but I would expect the rseq registration arguments to have a rseq_len
of offsetofend(struct rseq, tg_vcpu_id) when userspace wants the tg_vcpu_id
feature to be supported (but not the following features).

Then, as we append additional features as follow-up fields, those
eventually become requested by glibc by increasing the requested size.

Then it's kind of weird to receive a registration size which is not
aligned on 32-byte, but then use internal knowledge of the structure
alignment in the kernel code to write beyond the requested size. And all
this in a case where we are returning to user-space after a preemption,
so I don't expect this extra switch/case to cause significant overhead.

>
> I'd appreciate if you could put the maximm supported size and possibly
> the alignment in the auxiliary vector, so that we don't have to rseq
> system calls in a loop on process startup.

Yes, it's a good idea. I'm not too familiar with the auxiliary vector.
Are we talking about the kernel's

fs/binfmt_elf.c:fill_auxv_note()

?

Thanks,

Mathieu

--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

2022-02-02 23:59:06

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: [RFC PATCH 2/3] rseq: extend struct rseq with per thread group vcpu id

----- On Feb 1, 2022, at 3:03 PM, Florian Weimer [email protected] wrote:

> * Mathieu Desnoyers:
>
>> If a thread group has fewer threads than cores, or is limited to run on
>> few cores concurrently through sched affinity or cgroup cpusets, the
>> virtual cpu ids will be values close to 0, thus allowing efficient use
>> of user-space memory for per-cpu data structures.
>
> From a userspace programmer perspective, what's a good way to obtain a
> reasonable upper bound for the possible tg_vcpu_id values?

Some effective upper bounds:

- sysconf(3) _SC_NPROCESSORS_CONF,
- the number of threads which exist concurrently in the process,
- the number of cpus in the cpu affinity mask applied by sched_setaffinity,
except in corner-case situations such as cpu hotplug removing all cpus from
the affinity set,
- cgroup cpuset "partition" limits,

Note that AFAIR non-partition cgroup cpusets allow a cgroup to "borrow"
additional cores from the rest of the system if they are idle, therefore
allowing the number of concurrent threads to go beyond the specified limit.

>
> I believe not all users of cgroup cpusets change the affinity mask.

AFAIR the sched affinity mask is tweaked independently of the cgroup cpuset.
Those are two mechanisms both affecting the scheduler task placement.

I would expect the user-space code to use some sensible upper bound as a
hint about how many per-vcpu data structure elements to expect (and how many
to pre-allocate), but have a "lazy initialization" fall-back in case the
vcpu id goes up to the number of configured processors - 1. And I suspect
that even the number of configured processors may change with CRIU.

>
>> diff --git a/kernel/rseq.c b/kernel/rseq.c
>> index 13f6d0419f31..37b43735a400 100644
>> --- a/kernel/rseq.c
>> +++ b/kernel/rseq.c
>> @@ -86,10 +86,14 @@ static int rseq_update_cpu_node_id(struct task_struct *t)
>> struct rseq __user *rseq = t->rseq;
>> u32 cpu_id = raw_smp_processor_id();
>> u32 node_id = cpu_to_node(cpu_id);
>> + u32 tg_vcpu_id = task_tg_vcpu_id(t);
>>
>> if (!user_write_access_begin(rseq, t->rseq_len))
>> goto efault;
>> switch (t->rseq_len) {
>> + case offsetofend(struct rseq, tg_vcpu_id):
>> + unsafe_put_user(tg_vcpu_id, &rseq->tg_vcpu_id, efault_end);
>> + fallthrough;
>> case offsetofend(struct rseq, node_id):
>> unsafe_put_user(node_id, &rseq->node_id, efault_end);
>> fallthrough;
>
> Is the switch really useful? I suspect it's faster to just write as
> much as possible all the time. The switch should be well-predictable
> if running uniform userspace, but still …

The switch ensures the kernel don't try to write to a memory area beyond
the rseq size which has been registered by user-space. So it seems to be
useful to ensure we don't corrupt user-space memory. Or am I missing your
point ?

Thanks,

Mathieu

--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

2022-02-03 09:51:05

by Florian Weimer

[permalink] [raw]
Subject: Re: [RFC PATCH 2/3] rseq: extend struct rseq with per thread group vcpu id

* Mathieu Desnoyers:

> If a thread group has fewer threads than cores, or is limited to run on
> few cores concurrently through sched affinity or cgroup cpusets, the
> virtual cpu ids will be values close to 0, thus allowing efficient use
> of user-space memory for per-cpu data structures.

From a userspace programmer perspective, what's a good way to obtain a
reasonable upper bound for the possible tg_vcpu_id values?

I believe not all users of cgroup cpusets change the affinity mask.

> diff --git a/kernel/rseq.c b/kernel/rseq.c
> index 13f6d0419f31..37b43735a400 100644
> --- a/kernel/rseq.c
> +++ b/kernel/rseq.c
> @@ -86,10 +86,14 @@ static int rseq_update_cpu_node_id(struct task_struct *t)
> struct rseq __user *rseq = t->rseq;
> u32 cpu_id = raw_smp_processor_id();
> u32 node_id = cpu_to_node(cpu_id);
> + u32 tg_vcpu_id = task_tg_vcpu_id(t);
>
> if (!user_write_access_begin(rseq, t->rseq_len))
> goto efault;
> switch (t->rseq_len) {
> + case offsetofend(struct rseq, tg_vcpu_id):
> + unsafe_put_user(tg_vcpu_id, &rseq->tg_vcpu_id, efault_end);
> + fallthrough;
> case offsetofend(struct rseq, node_id):
> unsafe_put_user(node_id, &rseq->node_id, efault_end);
> fallthrough;

Is the switch really useful? I suspect it's faster to just write as
much as possible all the time. The switch should be well-predictable
if running uniform userspace, but still …