Upstreaming efforts aiming to integrate rseq support into glibc led to
interesting discussions, where we identified a clear need to extend the
size of the per-thread structure shared between kernel and user-space
(struct rseq). This is something that is not possible with the current
rseq ABI. The fact that the current non-extensible rseq kernel ABI
would also prevent glibc's ABI to be extended prevents its integration
into glibc.
Discussions with glibc maintainers led to the following design, which we
are calling "Kernel Thread Local Storage" or KTLS:
- at glibc library init:
- glibc queries the size and alignment of the KTLS area supported by the
kernel,
- glibc reserves the memory area required by the kernel for main
thread,
- glibc registers the offset from thread pointer where the KTLS area
will be placed for all threads belonging to the threads group which
are created with clone3 CLONE_RSEQ_KTLS,
- at nptl thread creation:
- glibc reserves the memory area required by the kernel,
- application/libraries can query glibc for the offset/size of the
KTLS area, and offset from the thread pointer to access that area.
The basic idea is that the kernel UAPI will define the layout of that
per-thread area, and glibc only deals with its allocation.
It should fulfill the extensibility needs for many extensions which can
benefit from a per-thread shared memory area between kernel and
user-space, e.g.:
- Exposing:
- current NUMA node number,
- current cpu number,
- current user id, group id, thread id, process id, process group id, ...
- signal block flag (fast-path for signal blocking),
- pretty much anything a vDSO can be used for, without requiring the
function call.
This patch implements a crude prototype of extensible rseq ABI to show
what interfaces we need to expose. It's currently wired up using rseq
flags, turning rseq into a system call multiplexer, which was easy for a
prototype.
It should be ready for some bikeshedding on how that ABI should look
like. Is this extension using flags OK, or should this appear as new
system calls instead ?
Signed-off-by: Mathieu Desnoyers <[email protected]>
Cc: Carlos O'Donell <[email protected]>
Cc: "Florian Weimer <[email protected]>
Cc: Peter Zijlstra (Intel) <[email protected]>
Cc: "Paul E. McKenney" <[email protected]>
Cc: Boqun Feng <[email protected]>
---
arch/x86/kernel/process_64.c | 1 +
include/linux/sched.h | 122 +-----------------------
include/linux/sched/signal.h | 150 +++++++++++++++++++++++++++++
include/uapi/linux/rseq.h | 16 +++-
include/uapi/linux/sched.h | 1 +
kernel/fork.c | 14 ++-
kernel/rseq.c | 177 +++++++++++++++++++++++++++--------
7 files changed, 321 insertions(+), 160 deletions(-)
diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
index 9a97415b2139..f0c822a78d01 100644
--- a/arch/x86/kernel/process_64.c
+++ b/arch/x86/kernel/process_64.c
@@ -670,6 +670,7 @@ long do_arch_prctl_64(struct task_struct *task, int option, unsigned long arg2)
task->thread.fsindex = 0;
x86_fsbase_write_task(task, arg2);
}
+ rseq_set_thread_pointer(task, arg2);
preempt_enable();
break;
}
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 683372943093..7b11ee7dee1a 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1148,6 +1148,8 @@ struct task_struct {
* with respect to preemption.
*/
unsigned long rseq_event_mask;
+ unsigned long rseq_thread_pointer;
+ unsigned int rseq_ktls:1;
#endif
struct tlbflush_unmap_batch tlb_ubc;
@@ -1907,114 +1909,6 @@ extern long sched_getaffinity(pid_t pid, struct cpumask *mask);
#define TASK_SIZE_OF(tsk) TASK_SIZE
#endif
-#ifdef CONFIG_RSEQ
-
-/*
- * Map the event mask on the user-space ABI enum rseq_cs_flags
- * for direct mask checks.
- */
-enum rseq_event_mask_bits {
- RSEQ_EVENT_PREEMPT_BIT = RSEQ_CS_FLAG_NO_RESTART_ON_PREEMPT_BIT,
- RSEQ_EVENT_SIGNAL_BIT = RSEQ_CS_FLAG_NO_RESTART_ON_SIGNAL_BIT,
- RSEQ_EVENT_MIGRATE_BIT = RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE_BIT,
-};
-
-enum rseq_event_mask {
- RSEQ_EVENT_PREEMPT = (1U << RSEQ_EVENT_PREEMPT_BIT),
- RSEQ_EVENT_SIGNAL = (1U << RSEQ_EVENT_SIGNAL_BIT),
- RSEQ_EVENT_MIGRATE = (1U << RSEQ_EVENT_MIGRATE_BIT),
-};
-
-static inline void rseq_set_notify_resume(struct task_struct *t)
-{
- if (t->rseq)
- set_tsk_thread_flag(t, TIF_NOTIFY_RESUME);
-}
-
-void __rseq_handle_notify_resume(struct ksignal *sig, struct pt_regs *regs);
-
-static inline void rseq_handle_notify_resume(struct ksignal *ksig,
- struct pt_regs *regs)
-{
- if (current->rseq)
- __rseq_handle_notify_resume(ksig, regs);
-}
-
-static inline void rseq_signal_deliver(struct ksignal *ksig,
- struct pt_regs *regs)
-{
- preempt_disable();
- __set_bit(RSEQ_EVENT_SIGNAL_BIT, ¤t->rseq_event_mask);
- preempt_enable();
- rseq_handle_notify_resume(ksig, regs);
-}
-
-/* rseq_preempt() requires preemption to be disabled. */
-static inline void rseq_preempt(struct task_struct *t)
-{
- __set_bit(RSEQ_EVENT_PREEMPT_BIT, &t->rseq_event_mask);
- rseq_set_notify_resume(t);
-}
-
-/* rseq_migrate() requires preemption to be disabled. */
-static inline void rseq_migrate(struct task_struct *t)
-{
- __set_bit(RSEQ_EVENT_MIGRATE_BIT, &t->rseq_event_mask);
- rseq_set_notify_resume(t);
-}
-
-/*
- * If parent process has a registered restartable sequences area, the
- * child inherits. Unregister rseq for a clone with CLONE_VM set.
- */
-static inline void rseq_fork(struct task_struct *t, unsigned long clone_flags)
-{
- if (clone_flags & CLONE_VM) {
- t->rseq = NULL;
- t->rseq_sig = 0;
- t->rseq_event_mask = 0;
- } else {
- t->rseq = current->rseq;
- t->rseq_sig = current->rseq_sig;
- t->rseq_event_mask = current->rseq_event_mask;
- }
-}
-
-static inline void rseq_execve(struct task_struct *t)
-{
- t->rseq = NULL;
- t->rseq_sig = 0;
- t->rseq_event_mask = 0;
-}
-
-#else
-
-static inline void rseq_set_notify_resume(struct task_struct *t)
-{
-}
-static inline void rseq_handle_notify_resume(struct ksignal *ksig,
- struct pt_regs *regs)
-{
-}
-static inline void rseq_signal_deliver(struct ksignal *ksig,
- struct pt_regs *regs)
-{
-}
-static inline void rseq_preempt(struct task_struct *t)
-{
-}
-static inline void rseq_migrate(struct task_struct *t)
-{
-}
-static inline void rseq_fork(struct task_struct *t, unsigned long clone_flags)
-{
-}
-static inline void rseq_execve(struct task_struct *t)
-{
-}
-
-#endif
-
void __exit_umh(struct task_struct *tsk);
static inline void exit_umh(struct task_struct *tsk)
@@ -2023,18 +1917,6 @@ static inline void exit_umh(struct task_struct *tsk)
__exit_umh(tsk);
}
-#ifdef CONFIG_DEBUG_RSEQ
-
-void rseq_syscall(struct pt_regs *regs);
-
-#else
-
-static inline void rseq_syscall(struct pt_regs *regs)
-{
-}
-
-#endif
-
const struct sched_avg *sched_trace_cfs_rq_avg(struct cfs_rq *cfs_rq);
char *sched_trace_cfs_rq_path(struct cfs_rq *cfs_rq, char *str, int len);
int sched_trace_cfs_rq_cpu(struct cfs_rq *cfs_rq);
diff --git a/include/linux/sched/signal.h b/include/linux/sched/signal.h
index 0ee5e696c5d8..22fc2a73f679 100644
--- a/include/linux/sched/signal.h
+++ b/include/linux/sched/signal.h
@@ -234,6 +234,11 @@ struct signal_struct {
* updated during exec, and may have
* inconsistent permissions.
*/
+#ifdef CONFIG_RSEQ
+ long rseq_ktls_offset;
+ unsigned int rseq_has_sig;
+ u32 rseq_sig;
+#endif
} __randomize_layout;
/*
@@ -715,4 +720,149 @@ static inline unsigned long rlimit_max(unsigned int limit)
return task_rlimit_max(current, limit);
}
+#ifdef CONFIG_RSEQ
+
+/*
+ * Map the event mask on the user-space ABI enum rseq_cs_flags
+ * for direct mask checks.
+ */
+enum rseq_event_mask_bits {
+ RSEQ_EVENT_PREEMPT_BIT = RSEQ_CS_FLAG_NO_RESTART_ON_PREEMPT_BIT,
+ RSEQ_EVENT_SIGNAL_BIT = RSEQ_CS_FLAG_NO_RESTART_ON_SIGNAL_BIT,
+ RSEQ_EVENT_MIGRATE_BIT = RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE_BIT,
+};
+
+enum rseq_event_mask {
+ RSEQ_EVENT_PREEMPT = (1U << RSEQ_EVENT_PREEMPT_BIT),
+ RSEQ_EVENT_SIGNAL = (1U << RSEQ_EVENT_SIGNAL_BIT),
+ RSEQ_EVENT_MIGRATE = (1U << RSEQ_EVENT_MIGRATE_BIT),
+};
+
+static inline void rseq_set_notify_resume(struct task_struct *t)
+{
+ if (t->rseq_ktls || t->rseq)
+ set_tsk_thread_flag(t, TIF_NOTIFY_RESUME);
+}
+
+void __rseq_handle_notify_resume(struct ksignal *sig, struct pt_regs *regs);
+
+static inline void rseq_handle_notify_resume(struct ksignal *ksig,
+ struct pt_regs *regs)
+{
+ if (current->rseq_ktls || current->rseq)
+ __rseq_handle_notify_resume(ksig, regs);
+}
+
+static inline void rseq_signal_deliver(struct ksignal *ksig,
+ struct pt_regs *regs)
+{
+ preempt_disable();
+ __set_bit(RSEQ_EVENT_SIGNAL_BIT, ¤t->rseq_event_mask);
+ preempt_enable();
+ rseq_handle_notify_resume(ksig, regs);
+}
+
+/* rseq_preempt() requires preemption to be disabled. */
+static inline void rseq_preempt(struct task_struct *t)
+{
+ __set_bit(RSEQ_EVENT_PREEMPT_BIT, &t->rseq_event_mask);
+ rseq_set_notify_resume(t);
+}
+
+/* rseq_migrate() requires preemption to be disabled. */
+static inline void rseq_migrate(struct task_struct *t)
+{
+ __set_bit(RSEQ_EVENT_MIGRATE_BIT, &t->rseq_event_mask);
+ rseq_set_notify_resume(t);
+}
+
+/*
+ * If parent process has a registered restartable sequences area, the
+ * child inherits. Unregister rseq for a clone with CLONE_VM set.
+ */
+static inline void rseq_fork(struct task_struct *t, u64 clone_flags,
+ unsigned long tls)
+{
+ if (clone_flags & CLONE_VM) {
+ t->rseq = NULL;
+ t->rseq_sig = 0;
+ t->rseq_event_mask = 0;
+ } else {
+ t->rseq = current->rseq;
+ t->rseq_sig = current->rseq_sig;
+ t->rseq_event_mask = current->rseq_event_mask;
+ }
+ /*
+ * TODO: we should also set t->rseq_thread_pointer for architectures
+ * implementing set_thread_area.
+ */
+ if (clone_flags & CLONE_SETTLS)
+ t->rseq_thread_pointer = tls;
+ else
+ t->rseq_thread_pointer = 0;
+ t->rseq_ktls = !!(clone_flags & CLONE_RSEQ_KTLS);
+}
+
+static inline void rseq_execve(struct task_struct *t)
+{
+ t->rseq = NULL;
+ t->rseq_sig = 0;
+ t->rseq_event_mask = 0;
+ t->rseq_ktls = false;
+ t->signal->rseq_ktls_offset = 0;
+ t->signal->rseq_has_sig = 0;
+ t->signal->rseq_sig = 0;
+}
+
+static inline void rseq_set_thread_pointer(struct task_struct *t,
+ unsigned long thread_pointer)
+{
+ t->rseq_thread_pointer = thread_pointer;
+}
+
+#else
+
+static inline void rseq_set_notify_resume(struct task_struct *t)
+{
+}
+static inline void rseq_handle_notify_resume(struct ksignal *ksig,
+ struct pt_regs *regs)
+{
+}
+static inline void rseq_signal_deliver(struct ksignal *ksig,
+ struct pt_regs *regs)
+{
+}
+static inline void rseq_preempt(struct task_struct *t)
+{
+}
+static inline void rseq_migrate(struct task_struct *t)
+{
+}
+static inline void rseq_fork(struct task_struct *t, u64 clone_flags,
+ unsigned long tls)
+{
+}
+static inline void rseq_execve(struct task_struct *t)
+{
+}
+static inline void rseq_set_thread_pointer(struct task_struct *t,
+ unsigned long thread_pointer)
+{
+}
+
+#endif
+
+#ifdef CONFIG_DEBUG_RSEQ
+
+void rseq_syscall(struct pt_regs *regs);
+
+#else
+
+static inline void rseq_syscall(struct pt_regs *regs)
+{
+}
+
+#endif
+
#endif /* _LINUX_SCHED_SIGNAL_H */
diff --git a/include/uapi/linux/rseq.h b/include/uapi/linux/rseq.h
index 9a402fdb60e9..f0fa947d6344 100644
--- a/include/uapi/linux/rseq.h
+++ b/include/uapi/linux/rseq.h
@@ -18,8 +18,22 @@ enum rseq_cpu_id_state {
RSEQ_CPU_ID_REGISTRATION_FAILED = -2,
};
+struct rseq_ktls_layout {
+ __u32 size;
+ __u32 alignment;
+ __u64 offset; /* 0 if not set. */
+};
+
+struct rseq_ktls_offset {
+ __s64 offset;
+};
+
enum rseq_flags {
- RSEQ_FLAG_UNREGISTER = (1 << 0),
+ RSEQ_FLAG_UNREGISTER = (1 << 0),
+ RSEQ_FLAG_GET_KTLS_LAYOUT = (1 << 1), /* Get ktls size, alignment, offset from thread pointer (if set). */
+ RSEQ_FLAG_SET_KTLS_OFFSET = (1 << 2), /* Set ktls offset from thread pointer for process. */
+ RSEQ_FLAG_SET_SIG = (1 << 3), /* Set rseq signature for process. */
+ RSEQ_FLAG_SET_KTLS_THREAD = (1 << 4), /* Set ktls enabled for thread (same as done by CLONE_RSEQ_KTLS). */
};
enum rseq_cs_flags_bit {
diff --git a/include/uapi/linux/sched.h b/include/uapi/linux/sched.h
index 3bac0a8ceab2..96a88899c9b4 100644
--- a/include/uapi/linux/sched.h
+++ b/include/uapi/linux/sched.h
@@ -36,6 +36,7 @@
/* Flags for the clone3() syscall. */
#define CLONE_CLEAR_SIGHAND 0x100000000ULL /* Clear any signal handler and reset to SIG_DFL. */
#define CLONE_INTO_CGROUP 0x200000000ULL /* Clone into a specific cgroup given the right permissions. */
+#define CLONE_RSEQ_KTLS 0x400000000ULL /* Child has rseq ktls area. */
/*
* cloning flags intersect with CSIGNAL so can be used with unshare and clone3
diff --git a/kernel/fork.c b/kernel/fork.c
index efc5493203ae..8a68601e4384 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1604,6 +1604,12 @@ static int copy_signal(unsigned long clone_flags, struct task_struct *tsk)
sig->oom_score_adj = current->signal->oom_score_adj;
sig->oom_score_adj_min = current->signal->oom_score_adj_min;
+#ifdef CONFIG_RSEQ
+ sig->rseq_ktls_offset = current->signal->rseq_ktls_offset;
+ sig->rseq_has_sig = current->signal->rseq_has_sig;
+ sig->rseq_sig = current->signal->rseq_sig;
+#endif
+
mutex_init(&sig->cred_guard_mutex);
mutex_init(&sig->exec_update_mutex);
@@ -2237,7 +2243,7 @@ static __latent_entropy struct task_struct *copy_process(
*/
copy_seccomp(p);
- rseq_fork(p, clone_flags);
+ rseq_fork(p, clone_flags, args->tls);
/* Don't start children in a dying pid namespace */
if (unlikely(!(ns_of_pid(pid)->pid_allocated & PIDNS_ADDING))) {
@@ -2712,7 +2718,7 @@ static bool clone3_args_valid(struct kernel_clone_args *kargs)
{
/* Verify that no unknown flags are passed along. */
if (kargs->flags &
- ~(CLONE_LEGACY_FLAGS | CLONE_CLEAR_SIGHAND | CLONE_INTO_CGROUP))
+ ~(CLONE_LEGACY_FLAGS | CLONE_CLEAR_SIGHAND | CLONE_INTO_CGROUP | CLONE_RSEQ_KTLS))
return false;
/*
@@ -2730,6 +2736,10 @@ static bool clone3_args_valid(struct kernel_clone_args *kargs)
kargs->exit_signal)
return false;
+ if ((kargs->flags & CLONE_RSEQ_KTLS) &&
+ (!(kargs->flags & CLONE_THREAD) || !(kargs->flags & CLONE_SETTLS)))
+ return false;
+
if (!clone3_stack_valid(kargs))
return false;
diff --git a/kernel/rseq.c b/kernel/rseq.c
index a4f86a9d6937..40d24d3e404e 100644
--- a/kernel/rseq.c
+++ b/kernel/rseq.c
@@ -21,6 +21,18 @@
#define RSEQ_CS_PREEMPT_MIGRATE_FLAGS (RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE | \
RSEQ_CS_FLAG_NO_RESTART_ON_PREEMPT)
+static struct rseq __user *rseq_get_ktls(struct task_struct *t)
+{
+ unsigned long thread_pointer;
+ long ktls_offset;
+
+ thread_pointer = t->rseq_thread_pointer;
+ ktls_offset = t->signal->rseq_ktls_offset;
+ if (!thread_pointer || !ktls_offset)
+ return NULL;
+ return (struct rseq __user *) (thread_pointer + ktls_offset);
+}
+
/*
*
* Restartable sequences are a lightweight interface that allows
@@ -117,7 +129,7 @@ static int rseq_get_rseq_cs(struct task_struct *t, struct rseq_cs *rseq_cs)
struct rseq_cs __user *urseq_cs;
u64 ptr;
u32 __user *usig;
- u32 sig;
+ u32 sig, ksig;
int ret;
if (copy_from_user(&ptr, &t->rseq->rseq_cs.ptr64, sizeof(ptr)))
@@ -149,10 +161,12 @@ static int rseq_get_rseq_cs(struct task_struct *t, struct rseq_cs *rseq_cs)
if (ret)
return ret;
- if (current->rseq_sig != sig) {
+ ksig = current->signal->rseq_has_sig ?
+ current->signal->rseq_sig : current->rseq_sig;
+ if (ksig != sig) {
printk_ratelimited(KERN_WARNING
"Possible attack attempt. Unexpected rseq signature 0x%x, expecting 0x%x (pid=%d, addr=%p).\n",
- sig, current->rseq_sig, current->pid, usig);
+ sig, ksig, current->pid, usig);
return -EINVAL;
}
return 0;
@@ -266,7 +280,9 @@ void __rseq_handle_notify_resume(struct ksignal *ksig, struct pt_regs *regs)
if (unlikely(t->flags & PF_EXITING))
return;
- if (unlikely(!access_ok(t->rseq, sizeof(*t->rseq))))
+ if (unlikely(!t->rseq && t->rseq_ktls))
+ t->rseq = rseq_get_ktls(t);
+ if (unlikely(!access_ok(t->rseq, offsetofend(struct rseq, flags))))
goto error;
ret = rseq_ip_fixup(regs);
if (unlikely(ret < 0))
@@ -292,44 +308,19 @@ void rseq_syscall(struct pt_regs *regs)
struct task_struct *t = current;
struct rseq_cs rseq_cs;
- if (!t->rseq)
- return;
- if (!access_ok(t->rseq, sizeof(*t->rseq)) ||
- rseq_get_rseq_cs(t, &rseq_cs) || in_rseq_cs(ip, &rseq_cs))
- force_sig(SIGSEGV);
+ if (!t->rseq && t->rseq_ktls)
+ t->rseq = rseq_get_ktls(t);
+ if (t->rseq) {
+ if (!access_ok(t->rseq, sizeof(*t->rseq)) ||
+ rseq_get_rseq_cs(t, &rseq_cs) || in_rseq_cs(ip, &rseq_cs))
+ force_sig(SIGSEGV);
+ }
}
#endif
-/*
- * sys_rseq - setup restartable sequences for caller thread.
- */
-SYSCALL_DEFINE4(rseq, struct rseq __user *, rseq, u32, rseq_len,
- int, flags, u32, sig)
+static int rseq_register(struct rseq __user *rseq, u32 rseq_len, u32 sig)
{
- int ret;
-
- if (flags & RSEQ_FLAG_UNREGISTER) {
- if (flags & ~RSEQ_FLAG_UNREGISTER)
- return -EINVAL;
- /* Unregister rseq for current thread. */
- if (current->rseq != rseq || !current->rseq)
- return -EINVAL;
- if (rseq_len != sizeof(*rseq))
- return -EINVAL;
- if (current->rseq_sig != sig)
- return -EPERM;
- ret = rseq_reset_rseq_cpu_id(current);
- if (ret)
- return ret;
- current->rseq = NULL;
- current->rseq_sig = 0;
- return 0;
- }
-
- if (unlikely(flags))
- return -EINVAL;
-
if (current->rseq) {
/*
* If rseq is already registered, check whether
@@ -353,6 +344,7 @@ SYSCALL_DEFINE4(rseq, struct rseq __user *, rseq, u32, rseq_len,
return -EINVAL;
if (!access_ok(rseq, rseq_len))
return -EFAULT;
+
current->rseq = rseq;
current->rseq_sig = sig;
/*
@@ -364,3 +356,114 @@ SYSCALL_DEFINE4(rseq, struct rseq __user *, rseq, u32, rseq_len,
return 0;
}
+
+static int rseq_unregister(struct rseq __user *rseq, u32 rseq_len, u32 sig)
+{
+ int ret;
+
+ /* Unregister rseq for current thread. */
+ if (current->rseq != rseq || !current->rseq)
+ return -EINVAL;
+ if (rseq_len != sizeof(*rseq))
+ return -EINVAL;
+ if (current->rseq_sig != sig)
+ return -EPERM;
+ ret = rseq_reset_rseq_cpu_id(current);
+ if (ret)
+ return ret;
+ current->rseq = NULL;
+ current->rseq_sig = 0;
+ return 0;
+}
+
+static int rseq_get_ktls_layout(struct rseq_ktls_layout __user *ktls_layout)
+{
+ u32 size, alignment;
+ u64 offset;
+
+ size = offsetofend(struct rseq, flags);
+ alignment = __alignof__(struct rseq);
+ offset = (u64) current->signal->rseq_ktls_offset;
+
+ if (put_user(size, &ktls_layout->size))
+ return -EFAULT;
+ if (put_user(alignment, &ktls_layout->alignment))
+ return -EFAULT;
+ if (put_user(offset, &ktls_layout->offset))
+ return -EFAULT;
+ return 0;
+}
+
+static int rseq_set_ktls_offset(struct rseq_ktls_offset __user *ktls_offset)
+{
+ s64 offset;
+ int ret = 0;
+
+ if (get_user(offset, &ktls_offset->offset))
+ return -EFAULT;
+ /*
+ * TODO: do we want to return EINVAL if a compat syscall provides
+ * an offset larger than INT_MAX/smaller than INT_MIN on a 64-bit kernel ?
+ */
+ if (offset > LONG_MAX || offset < LONG_MIN)
+ return -EINVAL;
+ /* Only allow setting ktls offset when single-threaded. */
+ if (get_nr_threads(current) != 1)
+ return -EBUSY;
+ if (current->signal->rseq_ktls_offset)
+ return -EBUSY;
+ current->signal->rseq_ktls_offset = (long) offset;
+ current->rseq_ktls = true;
+ /*
+ * If rseq was previously inactive, and has just been
+ * registered, ensure the cpu_id_start and cpu_id fields
+ * are updated before returning to user-space.
+ */
+ rseq_set_notify_resume(current);
+
+ return ret;
+}
+
+static int rseq_set_sig(u32 sig)
+{
+ /* Only allow setting signature when single-threaded. */
+ if (get_nr_threads(current) != 1)
+ return -EBUSY;
+ if (current->signal->rseq_has_sig)
+ return -EPERM;
+ current->signal->rseq_sig = sig;
+ current->signal->rseq_has_sig = true;
+ return 0;
+}
+
+static int rseq_set_ktls_thread(void)
+{
+ if (!rseq_get_ktls(current))
+ return -EFAULT;
+ current->rseq_ktls = true;
+ return 0;
+}
+
+/*
+ * sys_rseq - setup restartable sequences for caller thread.
+ */
+SYSCALL_DEFINE4(rseq, void __user *, uptr, u32, rseq_len,
+ int, flags, u32, sig)
+{
+ switch (flags) {
+ case 0:
+ return rseq_register((struct rseq __user *) uptr, rseq_len, sig);
+ case RSEQ_FLAG_UNREGISTER:
+ return rseq_unregister((struct rseq __user *) uptr, rseq_len, sig);
+ case RSEQ_FLAG_GET_KTLS_LAYOUT:
+ return rseq_get_ktls_layout((struct rseq_ktls_layout __user *) uptr);
+ case RSEQ_FLAG_SET_KTLS_OFFSET:
+ return rseq_set_ktls_offset((struct rseq_ktls_offset __user *) uptr);
+ case RSEQ_FLAG_SET_SIG:
+ return rseq_set_sig(sig);
+ case RSEQ_FLAG_SET_KTLS_THREAD:
+ return rseq_set_ktls_thread();
+ default:
+ return -EINVAL;
+ }
+}
--
2.17.1
* Mathieu Desnoyers:
> Upstreaming efforts aiming to integrate rseq support into glibc led to
> interesting discussions, where we identified a clear need to extend the
> size of the per-thread structure shared between kernel and user-space
> (struct rseq). This is something that is not possible with the current
> rseq ABI. The fact that the current non-extensible rseq kernel ABI
> would also prevent glibc's ABI to be extended prevents its integration
> into glibc.
>
> Discussions with glibc maintainers led to the following design, which we
> are calling "Kernel Thread Local Storage" or KTLS:
>
> - at glibc library init:
> - glibc queries the size and alignment of the KTLS area supported by the
> kernel,
> - glibc reserves the memory area required by the kernel for main
> thread,
> - glibc registers the offset from thread pointer where the KTLS area
> will be placed for all threads belonging to the threads group which
> are created with clone3 CLONE_RSEQ_KTLS,
> - at nptl thread creation:
> - glibc reserves the memory area required by the kernel,
> - application/libraries can query glibc for the offset/size of the
> KTLS area, and offset from the thread pointer to access that area.
One remaining challenge see is that we want to use vDSO functions to
abstract away the exact layout of the KTLS area. For example, there are
various implementation strategies for getuid optimizations, some of them
exposing a shared struct cred in a thread group, and others not doing
that.
The vDSO has access to the thread pointer because it's ABI (something
that we recently (and quite conveniently) clarified for x86). What it
does not know is the offset of the KTLS area from the thread pointer.
In the original rseq implementation, this offset could vary from thread
to thread in a process, although the submitted glibc implementation did
not use this level of flexibility and the offset is constant. The vDSO
is not relocated by the run-time dynamic loader, so it can't use ELF TLS
data.
Furthermore, not all threads in a thread group may have an associated
KTLS area. In a potential glibc implementation, only the threads
created by pthread_create would have it; threads created directly using
clone would lack it (and would not even run with a correctly set up
userspace TCB).
So we have a bootstrap issue here that needs to be solved, I think.
In most cases, I would not be too eager to bypass the vDSO completely,
and having the kernel expose a data-only interface. I could perhaps
make an exception for the current TID because that's so convenient to
use in mutex implementations, and errno. With the latter, we could
directly expose the vDSO implementation to applications, assuming that
we agree that the vDSO will not fail with ENOSYS to request fallback to
the system call, but will itself perform the system call.
Thanks,
Florian
--
Red Hat GmbH, https://de.redhat.com/ , Registered seat: Grasbrunn,
Commercial register: Amtsgericht Muenchen, HRB 153243,
Managing Directors: Charles Cachera, Brian Klemm, Laurie Krebs, Michael O'Neill
----- On Sep 28, 2020, at 11:13 AM, Florian Weimer [email protected] wrote:
> * Mathieu Desnoyers:
>
>> Upstreaming efforts aiming to integrate rseq support into glibc led to
>> interesting discussions, where we identified a clear need to extend the
>> size of the per-thread structure shared between kernel and user-space
>> (struct rseq). This is something that is not possible with the current
>> rseq ABI. The fact that the current non-extensible rseq kernel ABI
>> would also prevent glibc's ABI to be extended prevents its integration
>> into glibc.
>>
>> Discussions with glibc maintainers led to the following design, which we
>> are calling "Kernel Thread Local Storage" or KTLS:
>>
>> - at glibc library init:
>> - glibc queries the size and alignment of the KTLS area supported by the
>> kernel,
>> - glibc reserves the memory area required by the kernel for main
>> thread,
>> - glibc registers the offset from thread pointer where the KTLS area
>> will be placed for all threads belonging to the threads group which
>> are created with clone3 CLONE_RSEQ_KTLS,
>> - at nptl thread creation:
>> - glibc reserves the memory area required by the kernel,
>> - application/libraries can query glibc for the offset/size of the
>> KTLS area, and offset from the thread pointer to access that area.
>
> One remaining challenge see is that we want to use vDSO functions to
> abstract away the exact layout of the KTLS area. For example, there are
> various implementation strategies for getuid optimizations, some of them
> exposing a shared struct cred in a thread group, and others not doing
> that.
>
> The vDSO has access to the thread pointer because it's ABI (something
> that we recently (and quite conveniently) clarified for x86). What it
> does not know is the offset of the KTLS area from the thread pointer.
> In the original rseq implementation, this offset could vary from thread
> to thread in a process, although the submitted glibc implementation did
> not use this level of flexibility and the offset is constant. The vDSO
> is not relocated by the run-time dynamic loader, so it can't use ELF TLS
> data.
In the context of this prototype, the KTLS offset is the same for all threads
belonging to a thread group.
>
> Furthermore, not all threads in a thread group may have an associated
> KTLS area. In a potential glibc implementation, only the threads
> created by pthread_create would have it; threads created directly using
> clone would lack it (and would not even run with a correctly set up
> userspace TCB).
Right.
>
> So we have a bootstrap issue here that needs to be solved, I think.
The one thing I'm not sure about is whether the vDSO interface is indeed
superior to KTLS, or if it is just the model we are used to.
AFAIU, the current use-cases for vDSO is that an application calls into
glibc, which then calls the vDSO function exposed by the kernel. I wonder
whether the vDSO indirection is really needed if we typically have a glibc
function used as indirection ? For an end user, what is the benefit of vDSO
over accessing KTLS data directly from glibc ?
If we decide that using KTLS from a vDSO function is indeed a requirement,
then, as you point out, the thread_pointer is available as ABI, but we miss
the KTLS offset.
Some ideas on how we could solve this: we could either make the KTLS
offset part of the ABI (fixed offset), or save the offset near the thread pointer
at a location that would become ABI. It would have to be already populated with
something which can help detect the case where a vDSO is called from a thread
which does not populate KTLS though. Is that even remotely doable ?
>
> In most cases, I would not be too eager to bypass the vDSO completely,
> and having the kernel expose a data-only interface. I could perhaps
> make an exception for the current TID because that's so convenient to
> use in mutex implementations, and errno.
Indeed, using a KTLS field to store errno is another use-case I forgot to
mention. That would make life easier for errno handling in vDSO as well.
> With the latter, we could
> directly expose the vDSO implementation to applications, assuming that
> we agree that the vDSO will not fail with ENOSYS to request fallback to
> the system call, but will itself perform the system call.
We should not forget the fields needed by rseq as well: the rseq_cs pointer and
the cpu_id fields need to be accessed directly from the rseq critical section,
without function call. Those use-cases require that applications and library can
know the KTLS offset and size and use those fields directly. That being said,
there are certainly plenty of use-cases where it makes sense to use the KTLS
data through a vDSO, and only expose the vDSO interface, if the cost of the
extra vDSO call indirection is not prohibitive.
Thanks,
Mathieu
>
> Thanks,
> Florian
> --
> Red Hat GmbH, https://de.redhat.com/ , Registered seat: Grasbrunn,
> Commercial register: Amtsgericht Muenchen, HRB 153243,
> Managing Directors: Charles Cachera, Brian Klemm, Laurie Krebs, Michael O'Neill
--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com
* Mathieu Desnoyers:
>> So we have a bootstrap issue here that needs to be solved, I think.
>
> The one thing I'm not sure about is whether the vDSO interface is indeed
> superior to KTLS, or if it is just the model we are used to.
>
> AFAIU, the current use-cases for vDSO is that an application calls into
> glibc, which then calls the vDSO function exposed by the kernel. I wonder
> whether the vDSO indirection is really needed if we typically have a glibc
> function used as indirection ? For an end user, what is the benefit of vDSO
> over accessing KTLS data directly from glibc ?
I think the kernel can only reasonably maintain a single userspace data
structure. It's not reasonable to update several versions of the data
structure in parallel.
This means that glibc would have to support multiple kernel data
structures, and users might lose userspace acceleration after a kernel
update, until they update glibc as well. The glibc update should be
ABI-compatible, but someone would still have to backport it, apply it to
container images, etc.
What's worse, the glibc code would be quite hard to test because we
would have to keep around multiple kernel versions to exercise all the
different data structure variants.
In contrast, the vDSO code always matches the userspace data structures,
is always updated at the same time, and tested together. That looks
like a clear win to me.
> If we decide that using KTLS from a vDSO function is indeed a requirement,
> then, as you point out, the thread_pointer is available as ABI, but we miss
> the KTLS offset.
>
> Some ideas on how we could solve this: we could either make the KTLS
> offset part of the ABI (fixed offset), or save the offset near the
> thread pointer at a location that would become ABI. It would have to
> be already populated with something which can help detect the case
> where a vDSO is called from a thread which does not populate KTLS
> though. Is that even remotely doable ?
I don't know.
We could decide that these accelerated system calls must only be called
with a valid TCB. That's unavoidable if the vDSO sets errno directly,
so it's perhaps not a big loss. It's also backwards-compatible because
existing TCB-less code won't know about those new vDSO entrypoints.
Calling into glibc from a TCB-less thread has always been undefined.
TCB-less code would have to make direct, non-vDSO system calls, as today.
For discovering the KTLS offset, a per-process page at a fixed offset
from the vDSO code (i.e., what real shared objects already do for global
data) could store this offset. This way, we could entirely avoid an ABI
dependency.
We'll see what will break once we have the correct TID after vfork. 8->
glibc currently supports malloc-after-vfork as an extension, and
a lot of software depends on it (OpenJDK, for example).
>> With the latter, we could
>> directly expose the vDSO implementation to applications, assuming that
>> we agree that the vDSO will not fail with ENOSYS to request fallback to
>> the system call, but will itself perform the system call.
>
> We should not forget the fields needed by rseq as well: the rseq_cs
> pointer and the cpu_id fields need to be accessed directly from the
> rseq critical section, without function call. Those use-cases require
> that applications and library can know the KTLS offset and size and
> use those fields directly.
Yes, but those offsets could be queried using a function from the vDSO
(or using a glibc interface, to simplify linking).
Thanks,
Florian
--
Red Hat GmbH, https://de.redhat.com/ , Registered seat: Grasbrunn,
Commercial register: Amtsgericht Muenchen, HRB 153243,
Managing Directors: Charles Cachera, Brian Klemm, Laurie Krebs, Michael O'Neill
On Mon, Sep 28, 2020 at 8:14 AM Florian Weimer <[email protected]> wrote:
>
> * Mathieu Desnoyers:
>
> > Upstreaming efforts aiming to integrate rseq support into glibc led to
> > interesting discussions, where we identified a clear need to extend the
> > size of the per-thread structure shared between kernel and user-space
> > (struct rseq). This is something that is not possible with the current
> > rseq ABI. The fact that the current non-extensible rseq kernel ABI
> > would also prevent glibc's ABI to be extended prevents its integration
> > into glibc.
> >
> > Discussions with glibc maintainers led to the following design, which we
> > are calling "Kernel Thread Local Storage" or KTLS:
> >
> > - at glibc library init:
> > - glibc queries the size and alignment of the KTLS area supported by the
> > kernel,
> > - glibc reserves the memory area required by the kernel for main
> > thread,
> > - glibc registers the offset from thread pointer where the KTLS area
> > will be placed for all threads belonging to the threads group which
> > are created with clone3 CLONE_RSEQ_KTLS,
> > - at nptl thread creation:
> > - glibc reserves the memory area required by the kernel,
> > - application/libraries can query glibc for the offset/size of the
> > KTLS area, and offset from the thread pointer to access that area.
>
> One remaining challenge see is that we want to use vDSO functions to
> abstract away the exact layout of the KTLS area. For example, there are
> various implementation strategies for getuid optimizations, some of them
> exposing a shared struct cred in a thread group, and others not doing
> that.
>
> The vDSO has access to the thread pointer because it's ABI (something
> that we recently (and quite conveniently) clarified for x86). What it
> does not know is the offset of the KTLS area from the thread pointer.
> In the original rseq implementation, this offset could vary from thread
> to thread in a process, although the submitted glibc implementation did
> not use this level of flexibility and the offset is constant. The vDSO
> is not relocated by the run-time dynamic loader, so it can't use ELF TLS
> data.
I assume that, by "thread pointer", you mean the pointer stored in
GSBASE on x86_32, FSBASE on x86_64, and elsewhere on other
architectures?
The vDSO has done pretty well so far having the vDSO not touch FS, GS,
or their bases at all. If we want to change that, I would be very
nervous about doing so in existing vDSO functions. Regardless of
anything an ABI document might say and anything that existing or
previous glibc versions may or may not have done, there are plenty of
bizarre programs out there that don't really respect the psABI
document. Go and various not-ready-for-prime-time-but-released-anyway
Bionic branches come to mind. So we would need to tread very, very
carefully.
One way to side-step much of this would be to make the interface explicit:
long __vdso_do_whatever(void *ktls_ptr, ...);
Sadly, on x86, actually generating the ktls ptr is bit nasty due to
the fact that lea %fs:(offset) doesn't do what one might have liked it
to do. I suppose this could also be:
long __vdso_do_whatever(unsigned long ktls_offset);
which will generate quite nice code on x86_64. I can't speak for the
asm capabilities of other architectures.
What I *don't* want to do is to accidentally repeat anything like the
%gs:0x28 mess we have with the stack cookie on x86_32. (The stack
cookie is, in kernel code, in a completely nonsensical location. I'm
quite surprised that any of the maintainers ever accepted the current
stack cookie implementation. I assume there's some history there, but
I don't know it. The end result is a festering mess in the x86_32
kernel code that only persists because no one cares quite enough about
x86_32 to fix it.) We obviously won't end up with precisely the same
type of mistake here, but a mis-step here certainly does have the
possibility of promoting an unfortunate-in-hindsight design decision
in glibc and/or psABI to something that every other x86_64 Linux
software stack has to copy to be compatible with the vDSO.
As for errno itself, with all due respect to those who designed errno
before I was born, IMO it was a mistake. Why exactly should the vDSO
know about errno?
----- On Sep 29, 2020, at 4:13 AM, Florian Weimer [email protected] wrote:
> * Mathieu Desnoyers:
>
>>> So we have a bootstrap issue here that needs to be solved, I think.
>>
>> The one thing I'm not sure about is whether the vDSO interface is indeed
>> superior to KTLS, or if it is just the model we are used to.
>>
>> AFAIU, the current use-cases for vDSO is that an application calls into
>> glibc, which then calls the vDSO function exposed by the kernel. I wonder
>> whether the vDSO indirection is really needed if we typically have a glibc
>> function used as indirection ? For an end user, what is the benefit of vDSO
>> over accessing KTLS data directly from glibc ?
>
> I think the kernel can only reasonably maintain a single userspace data
> structure. It's not reasonable to update several versions of the data
> structure in parallel.
I disagree with your statement. Considering that the kernel needs to keep
ABI compatibility for whatever it exposes to user-space, claiming that it
should never update several versions of data structures exposed to user-space
in parallel means that once a data structure is exposed to user-space as ABI
in a certain way, it can never ever change in the future, even if we find
a better way to do things.
It makes more sense to allow multiple data structures to be updated
in parallel until older ones become deprecated/unused/irrelevant, at
which point those can be configured out at build time and eventually
phased out after years of deprecation. Having the ability to update multiple
data structures in user-space with replicated information is IMHO necessary
to allow creation of new/better accelerated ABIs.
>
> This means that glibc would have to support multiple kernel data
> structures, and users might lose userspace acceleration after a kernel
> update, until they update glibc as well. The glibc update should be
> ABI-compatible, but someone would still have to backport it, apply it to
> container images, etc.
No. If the kernel ever exposes a data structure to user-space as ABI,
then it needs to stay there, and not break userspace. Hence the need to
duplicate information provided to user-space if need be, so we can move
on to better ABIs without breaking the old ones.
>
> What's worse, the glibc code would be quite hard to test because we
> would have to keep around multiple kernel versions to exercise all the
> different data structure variants.
>
> In contrast, the vDSO code always matches the userspace data structures,
> is always updated at the same time, and tested together. That looks
> like a clear win to me.
For cases where the overhead of vDSO is not an issue, I agree that it
makes things tidier than directly accessing a data structure. The
documentation of the ABI becomes much simpler as well.
>
>> If we decide that using KTLS from a vDSO function is indeed a requirement,
>> then, as you point out, the thread_pointer is available as ABI, but we miss
>> the KTLS offset.
>>
>> Some ideas on how we could solve this: we could either make the KTLS
>> offset part of the ABI (fixed offset), or save the offset near the
>> thread pointer at a location that would become ABI. It would have to
>> be already populated with something which can help detect the case
>> where a vDSO is called from a thread which does not populate KTLS
>> though. Is that even remotely doable ?
>
> I don't know.
>
> We could decide that these accelerated system calls must only be called
> with a valid TCB. That's unavoidable if the vDSO sets errno directly,
> so it's perhaps not a big loss. It's also backwards-compatible because
> existing TCB-less code won't know about those new vDSO entrypoints.
> Calling into glibc from a TCB-less thread has always been undefined.
> TCB-less code would have to make direct, non-vDSO system calls, as today.
>
> For discovering the KTLS offset, a per-process page at a fixed offset
> from the vDSO code (i.e., what real shared objects already do for global
> data) could store this offset. This way, we could entirely avoid an ABI
> dependency.
Or as Andy mentioned, we would simply pass the ktls offset as argument to
the vDSO ? It seems simple enough. Would it fit all our use-cases including
errno ?
>
> We'll see what will break once we have the correct TID after vfork. 8->
> glibc currently supports malloc-after-vfork as an extension, and
> a lot of software depends on it (OpenJDK, for example).
I am not sure to see how that is related to ktls ?
>
>>> With the latter, we could
>>> directly expose the vDSO implementation to applications, assuming that
>>> we agree that the vDSO will not fail with ENOSYS to request fallback to
>>> the system call, but will itself perform the system call.
>>
>> We should not forget the fields needed by rseq as well: the rseq_cs
>> pointer and the cpu_id fields need to be accessed directly from the
>> rseq critical section, without function call. Those use-cases require
>> that applications and library can know the KTLS offset and size and
>> use those fields directly.
>
> Yes, but those offsets could be queried using a function from the vDSO
> (or using a glibc interface, to simplify linking).
Good point!
Thanks,
Mathieu
--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com
* Mathieu Desnoyers:
> ----- On Sep 29, 2020, at 4:13 AM, Florian Weimer [email protected] wrote:
>
>> * Mathieu Desnoyers:
>>
>>>> So we have a bootstrap issue here that needs to be solved, I think.
>>>
>>> The one thing I'm not sure about is whether the vDSO interface is indeed
>>> superior to KTLS, or if it is just the model we are used to.
>>>
>>> AFAIU, the current use-cases for vDSO is that an application calls into
>>> glibc, which then calls the vDSO function exposed by the kernel. I wonder
>>> whether the vDSO indirection is really needed if we typically have a glibc
>>> function used as indirection ? For an end user, what is the benefit of vDSO
>>> over accessing KTLS data directly from glibc ?
>>
>> I think the kernel can only reasonably maintain a single userspace data
>> structure. It's not reasonable to update several versions of the data
>> structure in parallel.
>
> I disagree with your statement. Considering that the kernel needs to
> keep ABI compatibility for whatever it exposes to user-space, claiming
> that it should never update several versions of data structures
> exposed to user-space in parallel means that once a data structure is
> exposed to user-space as ABI in a certain way, it can never ever
> change in the future, even if we find a better way to do things.
I think it's possible to put data into userspace without making it ABI.
Think about the init_module system call. The module blob comes from
userspace, but its (deeper) internal structure does not have a stable
ABI. Similar for many BPF use cases.
If the internal KTLS blob structure turns into ABI, including the parts
that need to be updated on context switch, each versioning change has a
performance impact.
>> This means that glibc would have to support multiple kernel data
>> structures, and users might lose userspace acceleration after a kernel
>> update, until they update glibc as well. The glibc update should be
>> ABI-compatible, but someone would still have to backport it, apply it to
>> container images, etc.
>
> No. If the kernel ever exposes a data structure to user-space as ABI,
> then it needs to stay there, and not break userspace. Hence the need to
> duplicate information provided to user-space if need be, so we can move
> on to better ABIs without breaking the old ones.
It can expose the data as an opaque blob.
> Or as Andy mentioned, we would simply pass the ktls offset as argument to
> the vDSO ? It seems simple enough. Would it fit all our use-cases including
> errno ?
That would work, yes. It's neat, but it won't give you a way to provide
traditional syscall wrappers directly from the vDSO.
>> We'll see what will break once we have the correct TID after vfork. 8->
>> glibc currently supports malloc-after-vfork as an extension, and
>> a lot of software depends on it (OpenJDK, for example).
>
> I am not sure to see how that is related to ktls ?
The mutex implementation could switch to the KTLS TID because it always
correct. But then locking in a vfork'ed subprocess would no longer look
like locking from the parent thread because the TID would be different.
Thanks,
Florian
--
Red Hat GmbH, https://de.redhat.com/ , Registered seat: Grasbrunn,
Commercial register: Amtsgericht Muenchen, HRB 153243,
Managing Directors: Charles Cachera, Brian Klemm, Laurie Krebs, Michael O'Neill