This is a fairly small series demonstrating a feature we've found to be quite
powerful in practice, "restartable sequences".
Most simply: these sequences comprise small snippets of user-code that are
guaranteed to be (effectively) executed serially, with support for restart (or
other handling) in the event of preemption or interruption.
This (when combined with an in-memory cpu-id; potentially optional on some
architectures) can be used to implement extremely fast per-cpu operations that
do not rely on the use of actual processor atomics. We've used this to back
performance critical code such as our malloc implementation with good results.
We previously discussed this feature at LPC:
http://www.linuxplumbersconf.org/2013/ocw/system/presentations/1695/original/LPC%20-%20PerCpu%20Atomics.pdf
Mathieu Desnoyers posted an alternate implementation based on the ideas above
at:
https://lkml.org/lkml/2015/5/21/472
This RFC is based on the approach we currently use internally. However, I'll
likely posit that neither this approach, nor the one posted above, is what we
should ultimately adopt (provided sufficient interest exists).
The implementation in this series can be summarized as follows:
- We allow a single (per-address) space critical section (and associated
handler) to be defined.
- When a thread with RSEQ configured (via new syscall) is interrupted within
a critical section, we modify its return address. Either within signal
delivery, or the notify-resume path. The scheduler touchpoint is only a
preempt_notifier which (potentially, dependent on config) examines the
kernel copy of pt_regs.
There are a few core requirements which motivate the approach taken here:
1) We must not interfere with regular scheduling. Unlike preemption protection
(or similar), there are no special considerations for code running within a
critical region beyond that we must move to the restart handler if
interrupted.
2) The code executing in scheduler context must be tightly constrained, both in
terms of overhead and that it must not require access to user memory.
3) The fast-path must be fast. That is, user entry/execution/exit of a
non-interrupted critical section is the most important case. The restart
handler is a 'slow' path that should only happen comparatively infrequently.
We're strongly motivated here by high-performance, low-level primitives:
think malloc, or rcu_read_lock.
While the contained implementation performs well under these constraints, it
has some notable limitations which we should consider for more general usage:
1) The definition of a single region works well for statically linked binaries
but can be challenging when shared-libraries want to use this feature. This
is partially mitigated in our experience that a library of primitives is
generally more useful than a myriad of custom sequences, but it's still a
major concern.
2) Due to the nature of restart and explicit location requirements it's only
really reasonable to write this critical section in assembly; which makes
porting and maintenance less convenient. (One of the included tests shows a
reasonable example of what this looks like.)
3) TLS spreads the entrails of its implementation all over compile _and_ link.
This makes properly handling it within the critical section cumbersome in
the shared binary case.
We've been thinking about how to address these issues and are considering some
alternate ABIs that still satisfy (1)-(3), but have a more convenient
programming model. I'll send this as a follow-up, but wanted to first share
the approach we've used to date.
Thanks,
- Paul
Introduce the notion of 'restartable sequence'. This is a user-defined range
within which we guarantee user-execution will occur serially with respect
to scheduling events such as migration or competition with other threads.
Preemption, or other interruption within this region, results in control being
transferred to a user-defined restart handler when rescheduled. This handler
may arrange for the original operation to be retried, including potentially
resynchronizing with dependent state that may have been updated in the interim.
This may be used in combination with an in-memory cpu-id to allow user programs
to implement cpu-local data-structures and primitives, without the use/overhead
of any atomics.
The kernel ABI generally consists of:
- A single (per-address space) critical region
- A restart handler which pairs with the region above
- A (per-thread) memory location which will be kept current with its cpu
The definition of the above is performed via a new syscall,
SYSCALL_DEFINE5(restartable_sequences,
int, op, int, flags, long, val1, long, val2, long, val3)
There are currently 2 possible operations,
1) Configure the critical region (and restart handler)
2) Configure the per-thread cpu pointer
[ See kernel/restartable_sequences.c for full documentation ]
A thread that has not configured (2) will not be restarted when executing in
(1).
Note that while the kernel only sees a single critical region, arbitrarily many
sequences can be composed via multiplexing of the user-space restart handler.
This patch introduces the general framework for configuration, as well as
exposing the syscall. We minimally expose x86 as having support (even though
the actual ABI is added by a subsequent patch) so that this can be compile
tested in isolation.
Signed-off-by: Paul Turner <[email protected]>
---
arch/Kconfig | 7 +
arch/x86/Kconfig | 1
arch/x86/syscalls/syscall_64.tbl | 1
fs/exec.c | 1
include/linux/sched.h | 28 ++++++
include/uapi/asm-generic/unistd.h | 5 +
init/Kconfig | 9 ++
kernel/Makefile | 1
kernel/restartable_sequences.c | 185 +++++++++++++++++++++++++++++++++++++
kernel/sched/core.c | 4 +
kernel/sched/sched.h | 3 +
kernel/sys_ni.c | 3 +
12 files changed, 246 insertions(+), 2 deletions(-)
create mode 100644 kernel/restartable_sequences.c
diff --git a/arch/Kconfig b/arch/Kconfig
index a65eafb..fb31981 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -229,6 +229,13 @@ config HAVE_REGS_AND_STACK_ACCESS_API
declared in asm/ptrace.h
For example the kprobes-based event tracer needs this API.
+config HAVE_RESTARTABLE_SEQUENCE_SUPPORT
+ bool
+ depends on HAVE_REGS_AND_STACK_ACCESS_API
+ help
+ This symbol should be selected by an architecture if it supports an
+ implementation of restartable sequences.
+
config HAVE_CLK
bool
help
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 8fec044..9c9c92f 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -67,6 +67,7 @@ config X86
select HAVE_EFFICIENT_UNALIGNED_ACCESS
select USER_STACKTRACE_SUPPORT
select HAVE_REGS_AND_STACK_ACCESS_API
+ select HAVE_RESTARTABLE_SEQUENCE_SUPPORT
select HAVE_DMA_API_DEBUG
select HAVE_KERNEL_GZIP
select HAVE_KERNEL_BZIP2
diff --git a/arch/x86/syscalls/syscall_64.tbl b/arch/x86/syscalls/syscall_64.tbl
index 9ef32d5..1de5cbc 100644
--- a/arch/x86/syscalls/syscall_64.tbl
+++ b/arch/x86/syscalls/syscall_64.tbl
@@ -329,6 +329,7 @@
320 common kexec_file_load sys_kexec_file_load
321 common bpf sys_bpf
322 64 execveat stub_execveat
+323 common restartable_sequences sys_restartable_sequences
#
# x32-specific system call numbers start at 512 to avoid cache impact
diff --git a/fs/exec.c b/fs/exec.c
index 1977c2a..acd38f6 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1590,6 +1590,7 @@ static int do_execveat_common(int fd, struct filename *filename,
current->in_execve = 0;
acct_update_integrals(current);
task_numa_free(current);
+ rseq_clear_state_exec();
free_bprm(bprm);
kfree(pathbuf);
putname(filename);
diff --git a/include/linux/sched.h b/include/linux/sched.h
index af0eeba..0540735 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1178,6 +1178,22 @@ struct mempolicy;
struct pipe_inode_info;
struct uts_namespace;
+#ifdef CONFIG_RESTARTABLE_SEQUENCES
+struct restartable_sequence_state {
+ /* Start and end of an address space's critical section. */
+ void __user *crit_start, __user *crit_end;
+ /* Where preempted threads will be restarted. */
+ void __user *crit_restart;
+ /* Thread's current CPU, typically in TLS. */
+ int __user *cpu_pointer;
+ struct preempt_notifier notifier;
+};
+
+void rseq_clear_state_exec(void);
+#else
+static inline void rseq_clear_state_exec(void) {}
+#endif
+
struct load_weight {
unsigned long weight;
u32 inv_weight;
@@ -1793,6 +1809,9 @@ struct task_struct {
#ifdef CONFIG_DEBUG_ATOMIC_SLEEP
unsigned long task_state_change;
#endif
+#ifdef CONFIG_RESTARTABLE_SEQUENCES
+ struct restartable_sequence_state rseq_state;
+#endif
int pagefault_disabled;
};
@@ -3167,4 +3186,13 @@ static inline unsigned long rlimit_max(unsigned int limit)
return task_rlimit_max(current, limit);
}
+#ifdef CONFIG_RESTARTABLE_SEQUENCES
+static inline int rseq_active(struct task_struct *p)
+{
+ return p->rseq_state.cpu_pointer != NULL;
+}
+#else
+static inline int rseq_active(struct task_struct *p) { return 0; }
+#endif
+
#endif
diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
index e016bd9..6173f56 100644
--- a/include/uapi/asm-generic/unistd.h
+++ b/include/uapi/asm-generic/unistd.h
@@ -709,9 +709,10 @@ __SYSCALL(__NR_memfd_create, sys_memfd_create)
__SYSCALL(__NR_bpf, sys_bpf)
#define __NR_execveat 281
__SC_COMP(__NR_execveat, sys_execveat, compat_sys_execveat)
-
+#define __NR_restartable_sequences 282
+__SYSCALL(__NR_restartable_sequences, sys_restartable_sequences)
#undef __NR_syscalls
-#define __NR_syscalls 282
+#define __NR_syscalls 283
/*
* All syscalls below here should go away really,
diff --git a/init/Kconfig b/init/Kconfig
index 81050e4..a597e30 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -2010,6 +2010,15 @@ source "block/Kconfig"
config PREEMPT_NOTIFIERS
bool
+config RESTARTABLE_SEQUENCES
+ bool "Userspace Restartable Sequences (RSEQ)"
+ default n
+ depends on HAVE_RESTARTABLE_SEQUENCE_SUPPORT && PREEMPT_NOTIFIERS
+ help
+ Allows binaries to define a region of user-text within which
+ execution will be restarted in the event of signal delivery or
+ preemption.
+
config PADATA
depends on SMP
bool
diff --git a/kernel/Makefile b/kernel/Makefile
index 60c302c..01eea12 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -98,6 +98,7 @@ obj-$(CONFIG_CRASH_DUMP) += crash_dump.o
obj-$(CONFIG_JUMP_LABEL) += jump_label.o
obj-$(CONFIG_CONTEXT_TRACKING) += context_tracking.o
obj-$(CONFIG_TORTURE_TEST) += torture.o
+obj-$(CONFIG_RESTARTABLE_SEQUENCES) += restartable_sequences.o
$(obj)/configs.o: $(obj)/config_data.h
diff --git a/kernel/restartable_sequences.c b/kernel/restartable_sequences.c
new file mode 100644
index 0000000..72945f2
--- /dev/null
+++ b/kernel/restartable_sequences.c
@@ -0,0 +1,185 @@
+/*
+ * Restartable Sequences are a lightweight interface that allows user-level
+ * code to be executed atomically relative to scheduler preemption. Typically
+ * used for implementing per-cpu operations.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301, USA.
+ *
+ * Copyright (C) 2015, Google, Inc.,
+ * Paul Turner <[email protected]> and Andrew Hunter <[email protected]>
+ *
+ */
+
+#ifdef CONFIG_RESTARTABLE_SEQUENCES
+
+#include <linux/uaccess.h>
+#include <linux/preempt.h>
+#include <linux/syscalls.h>
+
+static void rseq_sched_in_nop(struct preempt_notifier *pn, int cpu) {}
+static void rseq_sched_out_nop(struct preempt_notifier *pn,
+ struct task_struct *next) {}
+
+static __read_mostly struct preempt_ops rseq_preempt_ops = {
+ .sched_in = rseq_sched_in_nop,
+ .sched_out = rseq_sched_out_nop,
+};
+
+int rseq_register_cpu_pointer_current(int __user *cpu_pointer)
+{
+ struct restartable_sequence_state *rseq_state =
+ ¤t->rseq_state;
+ int registered = 0, rc = 0;
+
+ if (cpu_pointer == rseq_state->cpu_pointer)
+ return 0;
+
+ if (cpu_pointer && !access_ok(VERIFY_WRITE, cpu_pointer, sizeof(int)))
+ return -EINVAL;
+
+ rcu_read_lock();
+ /* Group leader always holds critical section definition. */
+ if (cpu_pointer && !current->group_leader->rseq_state.crit_restart) {
+ rc = -EINVAL;
+ goto out_unlock;
+ }
+ smp_rmb(); /* Pairs with setting crit_restart. */
+
+ if (rseq_state->cpu_pointer)
+ registered = 1;
+ rseq_state->cpu_pointer = cpu_pointer;
+
+ if (cpu_pointer && !registered) {
+ preempt_notifier_init(&rseq_state->notifier,
+ &rseq_preempt_ops);
+ preempt_notifier_register(&rseq_state->notifier);
+ } else if (!cpu_pointer && registered) {
+ preempt_notifier_unregister(&rseq_state->notifier);
+ }
+
+ /* Will update *cpu_pointer on return. */
+ if (cpu_pointer)
+ set_thread_flag(TIF_NOTIFY_RESUME);
+out_unlock:
+ rcu_read_unlock();
+
+ return 0;
+}
+
+void rseq_clear_state_exec()
+{
+ /* Ensure notifier is disabled. */
+ rseq_register_cpu_pointer_current(NULL);
+ memset(¤t->rseq_state, 0, sizeof(current->rseq_state));
+}
+
+static DEFINE_MUTEX(rseq_state_mutex);
+
+int rseq_register_critical_current(__user void *start, __user void *end,
+ __user void *restart)
+{
+ struct restartable_sequence_state *rseq_state;
+ int rc = 0;
+
+ rcu_read_lock();
+ /* The critical section is shared by all threads in a process. */
+ rseq_state = ¤t->group_leader->rseq_state;
+
+ /* [start,end) must not overlap with the restart handler. */
+ if (start >= end || (restart >= start && restart < end)) {
+ rc = -EINVAL;
+ goto out_rcu;
+ }
+
+ if (!access_ok(VERIFY_READ, start, end - start) ||
+ !access_ok(VERIFY_READ, restart, 1)) {
+ rc = -EINVAL;
+ goto out_rcu;
+ }
+
+ mutex_lock(&rseq_state_mutex);
+ /*
+ * We (currently) only allow RSEQ to be configured once. This
+ * simplifies synchronization with updates and reduces the risk of
+ * colliding critical sections.
+ */
+ if (rseq_state->crit_restart) {
+ rc = -EBUSY;
+ } else {
+ rseq_state->crit_start = start;
+ rseq_state->crit_end = end;
+ smp_wmb(); /* synchronize on visibility of crit_restart. */
+ rseq_state->crit_restart = restart;
+ }
+ mutex_unlock(&rseq_state_mutex);
+out_rcu:
+ rcu_read_unlock();
+ return rc;
+}
+
+#define SYS_RSEQ_SET_CRITICAL 0
+#define SYS_RSEQ_SET_CPU_POINTER 1
+
+/*
+ * RSEQ syscall interface.
+ *
+ * Usage:
+ * SYS_RSEQ_SET_CRITICAL, flags, crit_start, crit_end, crit_restart)
+ * A thread with user rip in (crit_start, crit_end] that has called
+ * RSEQ_SET_CPU_POINTER will have its execution resumed at crit_restart
+ * when interrupted by preemption or signal.
+ *
+ * SYS_RSEQ_SET_CPU_POINTER, flags, cpu_pointer_address
+ * Configures a (typically per-thread) value, containing the cpu which that
+ * thread is currently executing on.
+ * REQUIRES: SYS_RSEQ_SET_CRITICAL must have previously been called.
+ *
+ * flags is currently unused.
+ *
+ * Note: RSEQ_SET_CRITICAL may currently only be called once within an address
+ * space. This more general (e.g. by using RCU to synchronize region updates).
+ * However, that also introduces the risk of corruption in the case that more
+ * than one caller compete for control of the critical section.
+ */
+SYSCALL_DEFINE5(restartable_sequences,
+ int, op, int, flags, long, val1, long, val2, long, val3)
+{
+ int rc = -EINVAL;
+
+ if (op == SYS_RSEQ_SET_CRITICAL) {
+ /* Defines (process-wide) critical section. */
+ __user void *crit_start = (__user void *)val1;
+ __user void *crit_end = (__user void *)val2;
+ __user void *crit_restart = (__user void *)val3;
+ rc = rseq_register_critical_current(
+ crit_start, crit_end, crit_restart);
+ } else if (op == SYS_RSEQ_SET_CPU_POINTER) {
+ /*
+ * Enables RSEQ for this thread; sets location for CPU update
+ * to val1.
+ */
+ int __user *cpu = (int __user *)val1;
+ rc = rseq_register_cpu_pointer_current(cpu);
+ }
+
+ return rc;
+}
+#else
+SYSCALL_DEFINE0(restartable_sequences)
+{
+ return -ENOSYS;
+}
+#endif
+
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 921a754..1113565 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1868,6 +1868,10 @@ static void __sched_fork(unsigned long clone_flags, struct task_struct *p)
p->numa_group = NULL;
#endif /* CONFIG_NUMA_BALANCING */
+
+#ifdef CONFIG_RESTARTABLE_SEQUENCES
+ memset(&p->rseq_state, 0, sizeof(p->rseq_state));
+#endif
}
#ifdef CONFIG_NUMA_BALANCING
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index f10a445..24d4fac 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -947,6 +947,9 @@ static inline void __set_task_cpu(struct task_struct *p, unsigned int cpu)
{
set_task_rq(p, cpu);
#ifdef CONFIG_SMP
+ if (rseq_active(p))
+ set_tsk_thread_flag(p, TIF_NOTIFY_RESUME);
+
/*
* After ->cpu is set up to a new value, task_rq_lock(p, ...) can be
* successfuly executed on another CPU. We must ensure that updates of
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index 7995ef5..4b109d9 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -243,3 +243,6 @@ cond_syscall(sys_bpf);
/* execveat */
cond_syscall(sys_execveat);
+
+/* restartable sequences */
+cond_syscall(sys_restartable_sequences);
Implements the x86 (i386 & x86-64) ABIs for interrupting and restarting
execution within restartable sequence sections.
With respect to the x86-specific ABI:
On 32-bit: Upon restart, the interrupted rip is placed in %ecx
On 64-bit (or x32): Upon restart, the interrupted rip is placed in %r10
While potentially surprising at first glance, this choice is strongly motivated
by the fact that the available scratch registers under the i386 function call
ABI overlap with those used as argument registers under x86_64.
Given that sequences are already personality specific and that we always want
the arguments to be available for sequence restart, it's much more natural to
ultimately differentiate the ABI in these two cases.
Signed-off-by: Paul Turner <[email protected]>
---
arch/x86/include/asm/restartable_sequences.h | 50 +++++++++++++++++++
arch/x86/kernel/Makefile | 2 +
arch/x86/kernel/restartable_sequences.c | 69 ++++++++++++++++++++++++++
arch/x86/kernel/signal.c | 12 +++++
kernel/restartable_sequences.c | 11 +++-
5 files changed, 141 insertions(+), 3 deletions(-)
create mode 100644 arch/x86/include/asm/restartable_sequences.h
create mode 100644 arch/x86/kernel/restartable_sequences.c
diff --git a/arch/x86/include/asm/restartable_sequences.h b/arch/x86/include/asm/restartable_sequences.h
new file mode 100644
index 0000000..0ceb024
--- /dev/null
+++ b/arch/x86/include/asm/restartable_sequences.h
@@ -0,0 +1,50 @@
+#ifndef _ASM_X86_RESTARTABLE_SEQUENCES_H
+#define _ASM_X86_RESTARTABLE_SEQUENCES_H
+
+#include <asm/processor.h>
+#include <asm/ptrace.h>
+#include <linux/sched.h>
+
+#ifdef CONFIG_RESTARTABLE_SEQUENCES
+
+static inline bool arch_rseq_in_crit_section(struct task_struct *p,
+ struct pt_regs *regs)
+{
+ struct task_struct *leader = p->group_leader;
+ struct restartable_sequence_state *rseq_state = &leader->rseq_state;
+
+ unsigned long ip = (unsigned long)regs->ip;
+ if (unlikely(ip < (unsigned long)rseq_state->crit_end &&
+ ip >= (unsigned long)rseq_state->crit_start))
+ return true;
+
+ return false;
+}
+
+static inline bool arch_rseq_needs_notify_resume(struct task_struct *p)
+{
+#ifdef CONFIG_PREEMPT
+ /*
+ * Under CONFIG_PREEMPT it's possible for regs to be incoherent in the
+ * case that we took an interrupt during syscall entry. Avoid this by
+ * always deferring to our notify-resume handler.
+ */
+ return true;
+#else
+ return arch_rseq_in_crit_section(p, task_pt_regs(p));
+#endif
+}
+
+void arch_rseq_handle_notify_resume(struct pt_regs *regs);
+void arch_rseq_check_critical_section(struct task_struct *p,
+ struct pt_regs *regs);
+
+#else /* !CONFIG_RESTARTABLE_SEQUENCES */
+
+static inline void arch_rseq_handle_notify_resume(struct pt_regs *regs) {}
+static inline void arch_rseq_check_critical_section(struct task_struct *p,
+ struct pt_regs *regs) {}
+
+#endif
+
+#endif /* _ASM_X86_RESTARTABLE_SEQUENCES_H */
diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile
index febaf18..bd7827d 100644
--- a/arch/x86/kernel/Makefile
+++ b/arch/x86/kernel/Makefile
@@ -113,6 +113,8 @@ obj-$(CONFIG_TRACING) += tracepoint.o
obj-$(CONFIG_IOSF_MBI) += iosf_mbi.o
obj-$(CONFIG_PMC_ATOM) += pmc_atom.o
+obj-$(CONFIG_RESTARTABLE_SEQUENCES) += restartable_sequences.o
+
###
# 64 bit specific files
ifeq ($(CONFIG_X86_64),y)
diff --git a/arch/x86/kernel/restartable_sequences.c b/arch/x86/kernel/restartable_sequences.c
new file mode 100644
index 0000000..3b38013
--- /dev/null
+++ b/arch/x86/kernel/restartable_sequences.c
@@ -0,0 +1,69 @@
+/*
+ * Restartable Sequences: x86 ABI.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301, USA.
+ *
+ * Copyright (C) 2015, Google, Inc.,
+ * Paul Turner <[email protected]> and Andrew Hunter <[email protected]>
+ *
+ */
+
+#include <linux/sched.h>
+#include <linux/uaccess.h>
+#include <asm/restartable_sequences.h>
+
+void arch_rseq_check_critical_section(struct task_struct *p,
+ struct pt_regs *regs)
+{
+ if (!arch_rseq_in_crit_section(p, regs))
+ return;
+
+ /* RSEQ only applies to user-mode execution */
+ BUG_ON(!user_mode(regs));
+
+ /*
+ * The ABI is slightly different for {32,64}-bit threads on x86
+ *
+ * Short version:
+ * x86-64 (or x32): interrupted rip => %r10
+ * i386: interrupted rip => %ecx
+ *
+ * Longer version:
+ * The scratch registers available under the i386 function call ABI
+ * overlap with those used by argument registers under the x86_64 ABI.
+ *
+ * Given that the sequence block is already personality specific in
+ * that it must be entered by 'call' and that we always want the
+ * arguments available for a sequence restart; it's more natural to
+ * differentiate the ABI in these two cases.
+ */
+ if (unlikely(test_tsk_thread_flag(p, TIF_IA32)))
+ regs->cx = regs->ip; /* i386 */
+ else
+ regs->r10 = regs->ip; /* x86-64/x32 */
+
+ regs->ip = (unsigned long)p->group_leader->rseq_state.crit_restart;
+}
+
+void arch_rseq_handle_notify_resume(struct pt_regs *regs)
+{
+ struct restartable_sequence_state *rseq_state = ¤t->rseq_state;
+
+ /* If this update fails our user-state is incoherent. */
+ if (put_user(task_cpu(current), rseq_state->cpu_pointer))
+ force_sig(SIGSEGV, current);
+
+ arch_rseq_check_critical_section(current, regs);
+}
diff --git a/arch/x86/kernel/signal.c b/arch/x86/kernel/signal.c
index 206996c..987c50b 100644
--- a/arch/x86/kernel/signal.c
+++ b/arch/x86/kernel/signal.c
@@ -31,6 +31,7 @@
#include <asm/vdso.h>
#include <asm/mce.h>
#include <asm/sighandling.h>
+#include <asm/restartable_sequences.h>
#ifdef CONFIG_X86_64
#include <asm/proto.h>
@@ -617,6 +618,15 @@ setup_rt_frame(struct ksignal *ksig, struct pt_regs *regs)
sigset_t *set = sigmask_to_save();
compat_sigset_t *cset = (compat_sigset_t *) set;
+ /*
+ * If we are executing in the critical section of a restartable
+ * sequence we need to fix up the user's stack saved ip at this point
+ * so that signal handler return does not allow us to jump back into
+ * the block across a context switch boundary.
+ */
+ if (rseq_active(current))
+ arch_rseq_check_critical_section(current, regs);
+
/* Set up the stack frame */
if (is_ia32_frame()) {
if (ksig->ka.sa.sa_flags & SA_SIGINFO)
@@ -755,6 +765,8 @@ do_notify_resume(struct pt_regs *regs, void *unused, __u32 thread_info_flags)
if (thread_info_flags & _TIF_NOTIFY_RESUME) {
clear_thread_flag(TIF_NOTIFY_RESUME);
tracehook_notify_resume(regs);
+ if (rseq_active(current))
+ arch_rseq_handle_notify_resume(regs);
}
if (thread_info_flags & _TIF_USER_RETURN_NOTIFY)
fire_user_return_notifiers();
diff --git a/kernel/restartable_sequences.c b/kernel/restartable_sequences.c
index 72945f2..9102241 100644
--- a/kernel/restartable_sequences.c
+++ b/kernel/restartable_sequences.c
@@ -24,17 +24,22 @@
#ifdef CONFIG_RESTARTABLE_SEQUENCES
+#include <asm/restartable_sequences.h>
#include <linux/uaccess.h>
#include <linux/preempt.h>
#include <linux/syscalls.h>
static void rseq_sched_in_nop(struct preempt_notifier *pn, int cpu) {}
-static void rseq_sched_out_nop(struct preempt_notifier *pn,
- struct task_struct *next) {}
+static void rseq_sched_out(struct preempt_notifier *pn,
+ struct task_struct *next)
+{
+ if (arch_rseq_needs_notify_resume(current))
+ set_thread_flag(TIF_NOTIFY_RESUME);
+}
static __read_mostly struct preempt_ops rseq_preempt_ops = {
.sched_in = rseq_sched_in_nop,
- .sched_out = rseq_sched_out_nop,
+ .sched_out = rseq_sched_out,
};
int rseq_register_cpu_pointer_current(int __user *cpu_pointer)
Implements two basic tests of RSEQ functionality.
The first, "basic_test" only asserts that RSEQ works moderately correctly.
E.g. that:
- The CPUID pointer works
- Code infinitely looping within a critical section will eventually be
interrupted.
"basic_percpu_ops_test" is a slightly more "realistic" variant, implementing a
few simple per-cpu operations and testing their correctness. It also includes
a trivial example of user-space may multiplexing the critical section via the
restart handler.
Signed-off-by: Paul Turner <[email protected]>
---
tools/testing/selftests/rseq/Makefile | 15 +
.../testing/selftests/rseq/basic_percpu_ops_test.S | 131 ++++++++++
.../testing/selftests/rseq/basic_percpu_ops_test.c | 250 ++++++++++++++++++++
tools/testing/selftests/rseq/basic_test.c | 76 ++++++
tools/testing/selftests/rseq/rseq.c | 48 ++++
tools/testing/selftests/rseq/rseq.h | 28 ++
6 files changed, 548 insertions(+)
create mode 100644 tools/testing/selftests/rseq/Makefile
create mode 100644 tools/testing/selftests/rseq/basic_percpu_ops_test.S
create mode 100644 tools/testing/selftests/rseq/basic_percpu_ops_test.c
create mode 100644 tools/testing/selftests/rseq/basic_test.c
create mode 100644 tools/testing/selftests/rseq/rseq.c
create mode 100644 tools/testing/selftests/rseq/rseq.h
diff --git a/tools/testing/selftests/rseq/Makefile b/tools/testing/selftests/rseq/Makefile
new file mode 100644
index 0000000..c5a2b47
--- /dev/null
+++ b/tools/testing/selftests/rseq/Makefile
@@ -0,0 +1,15 @@
+CFLAGS += -Wall
+LDFLAGS += -lpthread
+
+TESTS = basic_test basic_percpu_ops_test
+
+basic_percpu_ops_test: basic_percpu_ops_test.c basic_percpu_ops_test.S
+
+all: $(TESTS)
+%: %.c
+ $(CC) $(CFLAGS) -o $@ $^ rseq.c $(LDFLAGS)
+
+include ../lib.mk
+
+clean:
+ $(RM) $(TESTS)
diff --git a/tools/testing/selftests/rseq/basic_percpu_ops_test.S b/tools/testing/selftests/rseq/basic_percpu_ops_test.S
new file mode 100644
index 0000000..7da7781
--- /dev/null
+++ b/tools/testing/selftests/rseq/basic_percpu_ops_test.S
@@ -0,0 +1,131 @@
+#include "rseq.h"
+
+#ifdef __x86_64__
+ .text
+ .code64
+
+#define FETCH_CPU(dest) movl %fs:__rseq_current_cpu@TPOFF, dest
+#define CRITICAL_SECTION_OFFSET(label) $label
+
+/* If start <= %RESTART_ADDR_REG < %end, jump to jump_to */
+#define HANDLE_REGION(start, end, jump_to) \
+ cmpq CRITICAL_SECTION_OFFSET(end), %RESTART_ADDR_REG; \
+ jge 1f; \
+ cmpq CRITICAL_SECTION_OFFSET(start), %RESTART_ADDR_REG; \
+ jge jump_to; \
+ 1:;
+
+#define HANDLE_REGION_PREFIX(prefix, start, end, jump_to) \
+ HANDLE_REGION(prefix##start, prefix##end, prefix##jump_to)
+
+/*-----------------------------------------------------------------------------
+ * Start of actual restartable sequences.
+ *---------------------------------------------------------------------------*/
+ .align 8
+ .globl RSEQ_CRITICAL_SECTION_START
+RSEQ_CRITICAL_SECTION_START:
+/* int rseq_percpu_lock() */
+ .globl rseq_percpu_lock
+ .type rseq_percpu_lock, @function
+rseq_percpu_lock:
+ .cfi_startproc
+rseq_percpu_lock_region0:
+ FETCH_CPU(%eax)
+ leaq (,%eax,8), %RESTART_ADDR_REG
+ leaq (%rdi,%RESTART_ADDR_REG,8), %RESTART_ADDR_REG
+rseq_percpu_lock_retry:
+ cmpw $0, (%RESTART_ADDR_REG)
+ jne rseq_percpu_lock_retry
+ movw $1, (%RESTART_ADDR_REG) /* 1 => lock owned */
+rseq_percpu_lock_region1:
+ ret
+rseq_percpu_lock_region2:
+ .cfi_endproc
+
+/*
+ * int rseq_cmpxchg(int cpu, intptr_t *p, intptr_t old, intptr_t new)
+ * int rseq_percpu_cmpxchgcheck(int cpu, intptr_t *p,
+ * intptr_t old, intptr_t new,
+ * intptr_t *check_ptr, intptr_t check_val)
+ *
+ * NOTE: We don't use cmpxchg in the implementation below as that would make
+ * checking the success of our commit operation was dependent on flags (which
+ * are in turn clobbered by the restart region) -- furthermore we can't just
+ * retry to fill in the flags since the restarted cmpxchg may have actually
+ * succeeded; spuriously failing subsequent attempts.
+ */
+
+ .globl rseq_percpu_cmpxchg
+ .type rseq_percpu_cmpxchg, @function
+rseq_percpu_cmpxchg:
+ .cfi_startproc
+rseq_percpu_cmpxchg_region0:
+ FETCH_CPU(%eax)
+ cmp %eax, %edi /* check cpu vs current_cpu */
+ jne rseq_percpu_cmpxchg_region1
+ cmp %rdx, (%rsi) /* verify *p == old */
+ jne rseq_percpu_cmpxchg_region2
+ mov %rcx, (%rsi)
+rseq_percpu_cmpxchg_region1:
+ ret /* return current cpu, indicating mismatch OR success */
+rseq_percpu_cmpxchg_region2:
+ mov $-1, %eax /* mismatch versus "old" or "check", return -1 */
+ ret
+rseq_percpu_cmpxchg_region3:
+ .cfi_endproc
+
+ .globl rseq_percpu_cmpxchgcheck
+ .type rseq_percpu_cmpxchgcheck, @function
+rseq_percpu_cmpxchgcheck:
+ .cfi_startproc
+rseq_percpu_cmpxchgcheck_region0:
+ FETCH_CPU(%eax)
+ cmp %eax, %edi /* check cpu vs current_cpu */
+ jne rseq_percpu_cmpxchgcheck_region1
+ cmp %rdx, (%rsi) /* verify *p == old */
+ jne rseq_percpu_cmpxchgcheck_region2
+ cmp %r9, (%r8) /* verify *check_ptr == check_val */
+ jne rseq_percpu_cmpxchgcheck_region2
+ mov %rcx, (%rsi)
+rseq_percpu_cmpxchgcheck_region1:
+ ret /* return current cpu, indicating mismatch OR success */
+rseq_percpu_cmpxchgcheck_region2:
+ mov $-1, %eax /* mismatch versus "old" or "check", return -1 */
+ ret
+rseq_percpu_cmpxchgcheck_region3:
+ .cfi_endproc
+
+ .align 8
+ .globl RSEQ_CRITICAL_SECTION_END
+RSEQ_CRITICAL_SECTION_END:
+
+/*-----------------------------------------------------------------------------
+ * Restart handler
+ * NOTE: per ABI, %RESTART_ADDR_REG is the program-counter we were restarted at.
+ *----------------------------------------------------------------------------
+ */
+
+ .align 8
+ .globl RSEQ_RESTART_HANDLER
+ .type RSEQ_RESTART_HANDLER, @function
+RSEQ_RESTART_HANDLER:
+ .cfi_startproc
+ /* There are several ways to implement this more efficiently. */
+ HANDLE_REGION_PREFIX(rseq_percpu_lock_region, 0, 1, 0)
+ HANDLE_REGION_PREFIX(rseq_percpu_lock_region, 1, 2, 1)
+
+ HANDLE_REGION_PREFIX(rseq_percpu_cmpxchg_region, 0, 1, 0)
+ HANDLE_REGION_PREFIX(rseq_percpu_cmpxchg_region, 1, 2, 1)
+ HANDLE_REGION_PREFIX(rseq_percpu_cmpxchg_region, 2, 3, 2)
+
+ HANDLE_REGION_PREFIX(rseq_percpu_cmpxchgcheck_region, 0, 1, 0)
+ HANDLE_REGION_PREFIX(rseq_percpu_cmpxchgcheck_region, 1, 2, 1)
+ HANDLE_REGION_PREFIX(rseq_percpu_cmpxchgcheck_region, 2, 3, 2)
+rseq_unknown_restart_addr:
+ mov %RESTART_ADDR_REG, %rdi
+ call rseq_unknown_restart_addr@PLT
+ .cfi_endproc
+
+/* Don't need/want an executable stack. */
+.section .note.GNU-stack,"",@progbits
+#endif
diff --git a/tools/testing/selftests/rseq/basic_percpu_ops_test.c b/tools/testing/selftests/rseq/basic_percpu_ops_test.c
new file mode 100644
index 0000000..c6d7e4e
--- /dev/null
+++ b/tools/testing/selftests/rseq/basic_percpu_ops_test.c
@@ -0,0 +1,250 @@
+#define _GNU_SOURCE
+#include <assert.h>
+#include <pthread.h>
+#include <sched.h>
+#include <stdint.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+
+#include "rseq.h"
+
+/* We restrict on !__PIC__ as it greatly simplifies handling of TLS. */
+#if defined(__x86_64__) && !defined(__PIC__)
+
+#define barrier() __asm__ __volatile__("": : :"memory")
+
+/* Implemented by percpu_ops.S */
+struct percpu_lock {
+ int word[CPU_SETSIZE][16]; /* cache aligned; lock-word is [cpu][0] */
+};
+
+/* A simple percpu spinlock. Returns the cpu lock was acquired on. */
+int rseq_percpu_lock(struct percpu_lock *lock);
+
+/*
+ * cmpxchg [with an additional check value].
+ *
+ * Returns:
+ * -1 if *p != old [ || check_ptr != check_val, ] otherwise
+ * cpu that rseq_percpu_cmpxchgcheck was executed.
+ * - If this is different from the passed cpu, no modifications were made.
+ *
+ * Note: When specified, check_ptr is dereferenced iff *p == old
+ */
+int rseq_percpu_cmpxchg(int cpu, intptr_t *p, intptr_t old, intptr_t new);
+int rseq_percpu_cmpxchgcheck(int cpu, intptr_t *p, intptr_t old, intptr_t new,
+ intptr_t *check_ptr, intptr_t check_val);
+
+
+void rseq_percpu_unlock(struct percpu_lock *lock, int cpu)
+{
+ barrier(); /* need a release-store here, this suffices on x86. */
+ assert(lock->word[cpu][0] == 1);
+ lock->word[cpu][0] = 0;
+}
+
+void rseq_unknown_restart_addr(void *addr)
+{
+ fprintf(stderr, "rseq: unrecognized restart address %p\n", addr);
+ exit(1);
+}
+
+struct spinlock_test_data {
+ struct percpu_lock lock;
+ int counts[CPU_SETSIZE];
+ int reps;
+};
+
+void *test_percpu_spinlock_thread(void *arg)
+{
+ struct spinlock_test_data *data = arg;
+
+ int i, cpu;
+ rseq_configure_cpu_pointer();
+ for (i = 0; i < data->reps; i++) {
+ cpu = rseq_percpu_lock(&data->lock);
+ data->counts[cpu]++;
+ rseq_percpu_unlock(&data->lock, cpu);
+ }
+
+ return 0;
+}
+
+/*
+ * A simple test which implements a sharded counter using a per-cpu lock.
+ * Obviously real applications might prefer to simply use a per-cpu increment;
+ * however, this is reasonable for a test and the lock can be extended to
+ * synchronize more complicated operations.
+ */
+void test_percpu_spinlock()
+{
+ int i, sum;
+ pthread_t test_threads[200];
+ struct spinlock_test_data data;
+
+ memset(&data, 0, sizeof(data));
+ data.reps = 5000;
+
+ for (i = 0; i < 200; i++)
+ pthread_create(&test_threads[i], NULL,
+ test_percpu_spinlock_thread, &data);
+
+ for (i = 0; i < 200; i++)
+ pthread_join(test_threads[i], NULL);
+
+ sum = 0;
+ for (i = 0; i < CPU_SETSIZE; i++)
+ sum += data.counts[i];
+
+ assert(sum == data.reps * 200);
+}
+
+struct percpu_list_node {
+ intptr_t data;
+ struct percpu_list_node *next;
+};
+
+struct percpu_list {
+ struct percpu_list_node *heads[CPU_SETSIZE];
+};
+
+int percpu_list_push(struct percpu_list *list, struct percpu_list_node *node)
+{
+ int cpu;
+
+ do {
+ cpu = rseq_current_cpu();
+ node->next = list->heads[cpu];
+ } while (cpu != rseq_percpu_cmpxchg(cpu,
+ (intptr_t *)&list->heads[cpu], (intptr_t)node->next,
+ (intptr_t)node));
+
+ return cpu;
+}
+
+struct percpu_list_node *percpu_list_pop(struct percpu_list *list)
+{
+ int cpu;
+ struct percpu_list_node *head, *next;
+
+ do {
+ cpu = rseq_current_cpu();
+ head = list->heads[cpu];
+ /*
+ * Unlike a traditional lock-less linked list; the availability
+ * of a cmpxchg-check primitive allows us to implement pop
+ * without concerns over ABA-type races.
+ */
+ if (!head) return 0;
+ next = head->next;
+ } while (cpu != rseq_percpu_cmpxchgcheck(cpu,
+ (intptr_t *)&list->heads[cpu], (intptr_t)head, (intptr_t)next,
+ (intptr_t *)&head->next, (intptr_t)next));
+
+ return head;
+}
+
+
+void *test_percpu_list_thread(void *arg)
+{
+ int i;
+ struct percpu_list *list = (struct percpu_list *)arg;
+
+ rseq_configure_cpu_pointer();
+ for (i = 0; i < 100000; i++) {
+ struct percpu_list_node *node = percpu_list_pop(list);
+ sched_yield(); /* encourage shuffling */
+ if (node) percpu_list_push(list, node);
+ }
+
+ return 0;
+}
+
+/*
+ * Implements a per-cpu linked list then shuffles it via popping and pushing
+ * from many threads.
+ */
+void test_percpu_list()
+{
+ int i, j;
+ long sum = 0, expected_sum = 0;
+ struct percpu_list list;
+ pthread_t test_threads[200];
+ cpu_set_t allowed_cpus;
+
+ memset(&list, 0, sizeof(list));
+
+ /* Generate list entries for every usable cpu. */
+ sched_getaffinity(0, sizeof(allowed_cpus), &allowed_cpus);
+ for (i = 0; i < CPU_SETSIZE; i++) {
+ if (!CPU_ISSET(i, &allowed_cpus)) continue;
+ for (j = 1; j <= 100; j++) {
+ struct percpu_list_node *node;
+
+ expected_sum += j;
+
+ node = malloc(sizeof(*node));
+ assert(node);
+ node->data = j;
+ node->next = list.heads[i];
+ list.heads[i] = node;
+ }
+ }
+
+ for (i = 0; i < 200; i++)
+ assert(pthread_create(&test_threads[i], NULL,
+ test_percpu_list_thread, &list) == 0);
+
+ for (i = 0; i < 200; i++)
+ pthread_join(test_threads[i], NULL);
+
+ for (i = 0; i < CPU_SETSIZE; i++) {
+ cpu_set_t pin_mask;
+ struct percpu_list_node *node;
+
+ if (!CPU_ISSET(i, &allowed_cpus)) continue;
+
+ CPU_ZERO(&pin_mask);
+ CPU_SET(i, &pin_mask);
+ sched_setaffinity(0, sizeof(pin_mask), &pin_mask);
+
+ while ((node = percpu_list_pop(&list))) {
+ sum += node->data;
+ free(node);
+ }
+ }
+
+ /*
+ * All entries should now be accounted for (unless some external actor
+ * is interfering with our allowed affinity while this test is
+ * running).
+ */
+ assert(sum == expected_sum);
+}
+
+/* defined by basic_percpu_ops_test.S */
+extern void *RSEQ_CRITICAL_SECTION_START;
+extern void *RSEQ_CRITICAL_SECTION_END;
+extern void *RSEQ_RESTART_HANDLER;
+
+int main(int argc, char **argv)
+{
+ rseq_configure_region(&RSEQ_CRITICAL_SECTION_START,
+ &RSEQ_CRITICAL_SECTION_END,
+ &RSEQ_RESTART_HANDLER);
+ rseq_configure_cpu_pointer();
+
+ test_percpu_spinlock();
+ test_percpu_list();
+
+ return 0;
+}
+
+#else
+int main(int argc, char **argv)
+{
+ fprintf(stderr, "architecture not supported\n");
+ return 0;
+}
+#endif
diff --git a/tools/testing/selftests/rseq/basic_test.c b/tools/testing/selftests/rseq/basic_test.c
new file mode 100644
index 0000000..cca8edb
--- /dev/null
+++ b/tools/testing/selftests/rseq/basic_test.c
@@ -0,0 +1,76 @@
+/*
+ * Basic test coverage for critical regions and rseq_current_cpu().
+ */
+
+#define _GNU_SOURCE
+#include <assert.h>
+#include <sched.h>
+
+#include "rseq.h"
+
+#define _STRINGIFY(x) #x
+#define STRINGIFY(x) _STRINGIFY(x)
+
+extern void *RSEQ_CRITICAL_SECTION_START;
+extern void *RSEQ_CRITICAL_SECTION_END;
+extern void *RSEQ_RESTART_HANDLER;
+
+/*
+ * Asserts simply that we eventually see *some* event which interrupts our
+ * critical section (which otherwise loops infinitely). This could be
+ * preemption or signal delivery.
+ */
+int test_critical_section()
+{
+ void* restart_address = 0;
+#if defined(__i386__) || defined(__x86_64__)
+ __asm__(
+ ".globl RSEQ_CRITICAL_SECTION_START\n"
+ "RSEQ_CRITICAL_SECTION_START:\n"
+ " jmp RSEQ_CRITICAL_SECTION_START\n" /* while(1) */
+ ".globl RSEQ_CRITICAL_SECTION_END\n"
+ "RSEQ_CRITICAL_SECTION_END:\n"
+ ".globl RSEQ_RESTART_HANDLER\n"
+ "RSEQ_RESTART_HANDLER:\n"
+ " movq %%" STRINGIFY(RESTART_ADDR_REG) ", %0\n"
+ : "=a"(restart_address) ::);
+ assert(restart_address == &RSEQ_CRITICAL_SECTION_START);
+#else
+ fprintf(stderr, "architecture not supported\n");
+#endif
+ return 0;
+}
+
+void test_cpu_pointer()
+{
+ cpu_set_t affinity, test_affinity;
+ int i;
+
+ sched_getaffinity(0, sizeof(affinity), &affinity);
+ CPU_ZERO(&test_affinity);
+ for (i = 0; i < CPU_SETSIZE; i++) {
+ if (CPU_ISSET(i, &affinity)) {
+ CPU_SET(i, &test_affinity);
+ sched_setaffinity(0, sizeof(test_affinity),
+ &test_affinity);
+ assert(rseq_current_cpu() == sched_getcpu());
+ assert(rseq_current_cpu() == i);
+ CPU_CLR(i, &test_affinity);
+ }
+ }
+ sched_setaffinity(0, sizeof(affinity), &affinity);
+}
+
+int main(int argc, char **argv)
+{
+ rseq_configure_region(&RSEQ_CRITICAL_SECTION_START,
+ &RSEQ_CRITICAL_SECTION_END,
+ &RSEQ_RESTART_HANDLER);
+ rseq_configure_cpu_pointer();
+
+ test_critical_section();
+ test_cpu_pointer();
+
+ return 0;
+}
+
diff --git a/tools/testing/selftests/rseq/rseq.c b/tools/testing/selftests/rseq/rseq.c
new file mode 100644
index 0000000..c1ea5d8
--- /dev/null
+++ b/tools/testing/selftests/rseq/rseq.c
@@ -0,0 +1,48 @@
+#define _GNU_SOURCE
+#include <assert.h>
+#include <errno.h>
+#include <sched.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <unistd.h>
+
+#include "rseq.h"
+
+__thread volatile const int __rseq_current_cpu = -1;
+
+#define __NR_rseq 323
+#define SYS_RSEQ_SET_CRITICAL 0
+#define SYS_RSEQ_SET_CPU_POINTER 1
+
+int sys_rseq(int op, int flags, void* val1, void* val2, void* val3)
+{
+ return syscall(__NR_rseq, op, flags,
+ (intptr_t)val1, (intptr_t)val2, (intptr_t)val3);
+}
+
+static void sys_rseq_checked(int op, int flags,
+ void* val1, void* val2, void* val3)
+{
+ int rc = sys_rseq(op, flags, val1, val2, val3);
+ if (rc) {
+ fprintf(stderr,"sys_rseq(%d, %d, %p, %p, %p) failed(%d): %s\n",
+ op, flags, val1, val2, val3, errno, strerror(errno));
+ exit(1);
+ }
+}
+
+void rseq_configure_region(void *rseq_text_start, void *rseq_text_end,
+ void *rseq_restart_handler)
+{
+ sys_rseq_checked(SYS_RSEQ_SET_CRITICAL, 0,
+ rseq_text_start, rseq_text_end, rseq_restart_handler);
+}
+
+void rseq_configure_cpu_pointer(void)
+{
+ sys_rseq_checked(SYS_RSEQ_SET_CPU_POINTER, 0,
+ (void*)&__rseq_current_cpu, 0, 0);
+ assert(rseq_current_cpu() != -1); /* always updated prior to return. */
+}
+
diff --git a/tools/testing/selftests/rseq/rseq.h b/tools/testing/selftests/rseq/rseq.h
new file mode 100644
index 0000000..91bb655
--- /dev/null
+++ b/tools/testing/selftests/rseq/rseq.h
@@ -0,0 +1,28 @@
+#ifndef RSEQ_TEST_H
+#define RSEQ_TEST_H
+
+#if defined(__i386__)
+#define RESTART_ADDR_REG ecx
+#elif defined(__x86_64__)
+#define RESTART_ADDR_REG r10
+#else
+#define RESTART_ADDR_REG unknown
+#endif
+
+#ifndef __ASSEMBLER__
+int sys_rseq(int op, int flags, void* val1, void* val2, void* val3);
+/* RSEQ provided thread-local current_cpu */
+
+void rseq_configure_cpu_pointer(void);
+
+void rseq_configure_region(void *rseq_text_start, void *rseq_text_end,
+ void *rseq_restart_handler);
+
+
+extern __thread volatile const int __rseq_current_cpu;
+static inline int rseq_current_cpu(void) { return __rseq_current_cpu; }
+
+void run_tests();
+#endif
+
+#endif
On Wed, Jun 24, 2015 at 3:26 PM, Paul Turner <[email protected]> wrote:
> This is a fairly small series demonstrating a feature we've found to be quite
> powerful in practice, "restartable sequences".
>
On an extremely short glance, I'm starting to think that the right
approach, at least for x86, is to implement per-cpu gsbase. Then you
could do cmpxchg with a gs prefix to atomically take a percpu lock and
atomically release a percpu lock and check whether someone else stole
the lock from you. (Note: cmpxchg, unlike lock cmpxchg, is very
fast.)
This is totally useless for other architectures, but I think it would
be reasonable clean on x86. Thoughts?
I can elaborate if necessary.
--Andy
On Wed, Jun 24, 2015 at 5:07 PM, Andy Lutomirski <[email protected]> wrote:
> On Wed, Jun 24, 2015 at 3:26 PM, Paul Turner <[email protected]> wrote:
>> This is a fairly small series demonstrating a feature we've found to be quite
>> powerful in practice, "restartable sequences".
>>
>
> On an extremely short glance, I'm starting to think that the right
> approach, at least for x86, is to implement per-cpu gsbase. Then you
> could do cmpxchg with a gs prefix to atomically take a percpu lock and
> atomically release a percpu lock and check whether someone else stole
> the lock from you. (Note: cmpxchg, unlike lock cmpxchg, is very
> fast.)
>
> This is totally useless for other architectures, but I think it would
> be reasonable clean on x86. Thoughts?
So this gives semantics that are obviously similar to this_cpu().
This provides allows reasonable per-cpu counters (which is alone
almost sufficient for a strong user-space RCU implementation giving
this some legs).
However, unless there's a nice implementation trick I'm missing, the
thing that stands out to me for locks (or other primitives) is that
this forces a two-phase commit. There's no way (short of say,
cmpxchg16b) to perform a write conditional on the lock not having been
stolen from us (and subsequently release the lock).
e.g.
1) We take the operation in some sort of speculative mode, that
another thread on the same cpu is stilled allowed to steal from us
2) We prepare what we want to commit
3) At this point we have to promote the lock taken in (1) to perform
our actual commit, or see that someone else has stolen (1)
4) Release the promoted lock in (3)
However, this means that if we're preempted at (3) then no other
thread on that cpu can make progress until we've been rescheduled and
released the lock; a nice property of the model we have today is that
threads sharing a cpu can not impede each other beyond what the
scheduler allows.
A lesser concern, but worth mentioning, is that there are also
potential pitfalls in the interaction with signal handlers,
particularly if a 2-phase commit is used.
- Paul
----- On Jun 24, 2015, at 10:54 PM, Paul Turner [email protected] wrote:
> On Wed, Jun 24, 2015 at 5:07 PM, Andy Lutomirski <[email protected]> wrote:
>> On Wed, Jun 24, 2015 at 3:26 PM, Paul Turner <[email protected]> wrote:
>>> This is a fairly small series demonstrating a feature we've found to be quite
>>> powerful in practice, "restartable sequences".
>>>
>>
>> On an extremely short glance, I'm starting to think that the right
>> approach, at least for x86, is to implement per-cpu gsbase. Then you
>> could do cmpxchg with a gs prefix to atomically take a percpu lock and
>> atomically release a percpu lock and check whether someone else stole
>> the lock from you. (Note: cmpxchg, unlike lock cmpxchg, is very
>> fast.)
>>
>> This is totally useless for other architectures, but I think it would
>> be reasonable clean on x86. Thoughts?
>
> So this gives semantics that are obviously similar to this_cpu().
> This provides allows reasonable per-cpu counters (which is alone
> almost sufficient for a strong user-space RCU implementation giving
> this some legs).
>
> However, unless there's a nice implementation trick I'm missing, the
> thing that stands out to me for locks (or other primitives) is that
> this forces a two-phase commit. There's no way (short of say,
> cmpxchg16b) to perform a write conditional on the lock not having been
> stolen from us (and subsequently release the lock).
>
> e.g.
> 1) We take the operation in some sort of speculative mode, that
> another thread on the same cpu is stilled allowed to steal from us
> 2) We prepare what we want to commit
> 3) At this point we have to promote the lock taken in (1) to perform
> our actual commit, or see that someone else has stolen (1)
> 4) Release the promoted lock in (3)
>
> However, this means that if we're preempted at (3) then no other
> thread on that cpu can make progress until we've been rescheduled and
> released the lock; a nice property of the model we have today is that
> threads sharing a cpu can not impede each other beyond what the
> scheduler allows.
>
> A lesser concern, but worth mentioning, is that there are also
> potential pitfalls in the interaction with signal handlers,
> particularly if a 2-phase commit is used.
Assuming we have a gs segment we can use to address per-cpu locks
in userspace, would the following scheme take care of some of your
concerns ?
per-cpu int32_t: each lock initialized to "cpu_nr" value
per-cpu lock:
get current cpu number. Remember this value as "CPU lock nr".
use cmpxchg on gs:lock to grab the lock.
- Expect old value to be "CPU lock nr".
- Update with a lock flag in most significant bit, "CPU lock nr"
in lower bits.
- Retry if fails. Can be caused by migration or lock being already
held.
per-cpu unlock:
clear lock flag within the "CPU lock nr" lock.
Thanks,
Mathieu
--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com
On Thu, Jun 25, 2015 at 6:15 PM, Mathieu Desnoyers
<[email protected]> wrote:
> ----- On Jun 24, 2015, at 10:54 PM, Paul Turner [email protected] wrote:
>
>> On Wed, Jun 24, 2015 at 5:07 PM, Andy Lutomirski <[email protected]> wrote:
>>> On Wed, Jun 24, 2015 at 3:26 PM, Paul Turner <[email protected]> wrote:
>>>> This is a fairly small series demonstrating a feature we've found to be quite
>>>> powerful in practice, "restartable sequences".
>>>>
>>>
>>> On an extremely short glance, I'm starting to think that the right
>>> approach, at least for x86, is to implement per-cpu gsbase. Then you
>>> could do cmpxchg with a gs prefix to atomically take a percpu lock and
>>> atomically release a percpu lock and check whether someone else stole
>>> the lock from you. (Note: cmpxchg, unlike lock cmpxchg, is very
>>> fast.)
>>>
>>> This is totally useless for other architectures, but I think it would
>>> be reasonable clean on x86. Thoughts?
>>
>> So this gives semantics that are obviously similar to this_cpu().
>> This provides allows reasonable per-cpu counters (which is alone
>> almost sufficient for a strong user-space RCU implementation giving
>> this some legs).
>>
>> However, unless there's a nice implementation trick I'm missing, the
>> thing that stands out to me for locks (or other primitives) is that
>> this forces a two-phase commit. There's no way (short of say,
>> cmpxchg16b) to perform a write conditional on the lock not having been
>> stolen from us (and subsequently release the lock).
>>
>> e.g.
>> 1) We take the operation in some sort of speculative mode, that
>> another thread on the same cpu is stilled allowed to steal from us
>> 2) We prepare what we want to commit
>> 3) At this point we have to promote the lock taken in (1) to perform
>> our actual commit, or see that someone else has stolen (1)
>> 4) Release the promoted lock in (3)
>>
>> However, this means that if we're preempted at (3) then no other
>> thread on that cpu can make progress until we've been rescheduled and
>> released the lock; a nice property of the model we have today is that
>> threads sharing a cpu can not impede each other beyond what the
>> scheduler allows.
>>
>> A lesser concern, but worth mentioning, is that there are also
>> potential pitfalls in the interaction with signal handlers,
>> particularly if a 2-phase commit is used.
>
> Assuming we have a gs segment we can use to address per-cpu locks
> in userspace, would the following scheme take care of some of your
> concerns ?
>
> per-cpu int32_t: each lock initialized to "cpu_nr" value
>
> per-cpu lock:
> get current cpu number. Remember this value as "CPU lock nr".
> use cmpxchg on gs:lock to grab the lock.
> - Expect old value to be "CPU lock nr".
> - Update with a lock flag in most significant bit, "CPU lock nr"
> in lower bits.
> - Retry if fails. Can be caused by migration or lock being already
> held.
So this is equivalent to just taking the lock in a non-speculative
mode (e.g. non-stealable) from the start.
The challenge remains exactly the same, as soon as an exclusive mode
exists you have some 'critical' critical inner section about the
commit, which impedes the progress of other threads if interrupted.
[The only reason you'd take the lock in any sort of speculative mode
above would be to reduce the length of this.]
I definitely agree that this gets us locks and counters, there's lots
of good things we can do with those and those alone may be sufficient
to implement it this way but I feel there is a lot of power compared
to what can be done with full lockless semantics that this leaves on
the table. As a concrete example, consider what this does to the
complexity and usefulness of Pop().
>
> per-cpu unlock:
> clear lock flag within the "CPU lock nr" lock.
>
> Thanks,
>
> Mathieu
>
> --
> Mathieu Desnoyers
> EfficiOS Inc.
> http://www.efficios.com
----- On Jun 24, 2015, at 6:26 PM, Paul Turner [email protected] wrote:
> Implements the x86 (i386 & x86-64) ABIs for interrupting and restarting
> execution within restartable sequence sections.
>
> With respect to the x86-specific ABI:
> On 32-bit: Upon restart, the interrupted rip is placed in %ecx
> On 64-bit (or x32): Upon restart, the interrupted rip is placed in %r10
>
> While potentially surprising at first glance, this choice is strongly motivated
> by the fact that the available scratch registers under the i386 function call
> ABI overlap with those used as argument registers under x86_64.
>
> Given that sequences are already personality specific and that we always want
> the arguments to be available for sequence restart, it's much more natural to
> ultimately differentiate the ABI in these two cases.
>
> Signed-off-by: Paul Turner <[email protected]>
> ---
> arch/x86/include/asm/restartable_sequences.h | 50 +++++++++++++++++++
> arch/x86/kernel/Makefile | 2 +
> arch/x86/kernel/restartable_sequences.c | 69 ++++++++++++++++++++++++++
> arch/x86/kernel/signal.c | 12 +++++
> kernel/restartable_sequences.c | 11 +++-
> 5 files changed, 141 insertions(+), 3 deletions(-)
> create mode 100644 arch/x86/include/asm/restartable_sequences.h
> create mode 100644 arch/x86/kernel/restartable_sequences.c
>
> diff --git a/arch/x86/include/asm/restartable_sequences.h
> b/arch/x86/include/asm/restartable_sequences.h
> new file mode 100644
> index 0000000..0ceb024
> --- /dev/null
> +++ b/arch/x86/include/asm/restartable_sequences.h
> @@ -0,0 +1,50 @@
> +#ifndef _ASM_X86_RESTARTABLE_SEQUENCES_H
> +#define _ASM_X86_RESTARTABLE_SEQUENCES_H
> +
> +#include <asm/processor.h>
> +#include <asm/ptrace.h>
> +#include <linux/sched.h>
> +
> +#ifdef CONFIG_RESTARTABLE_SEQUENCES
> +
> +static inline bool arch_rseq_in_crit_section(struct task_struct *p,
> + struct pt_regs *regs)
> +{
> + struct task_struct *leader = p->group_leader;
> + struct restartable_sequence_state *rseq_state = &leader->rseq_state;
> +
> + unsigned long ip = (unsigned long)regs->ip;
> + if (unlikely(ip < (unsigned long)rseq_state->crit_end &&
> + ip >= (unsigned long)rseq_state->crit_start))
> + return true;
> +
> + return false;
> +}
> +
> +static inline bool arch_rseq_needs_notify_resume(struct task_struct *p)
> +{
> +#ifdef CONFIG_PREEMPT
> + /*
> + * Under CONFIG_PREEMPT it's possible for regs to be incoherent in the
> + * case that we took an interrupt during syscall entry. Avoid this by
> + * always deferring to our notify-resume handler.
> + */
> + return true;
I'm a bit puzzled about this. If I look at perf_get_regs_user() in the perf
code, task_pt_regs() seems to return the user-space pt_regs for a task with
a current->mm set (iow, not a kernel thread), even if an interrupt nests on
top of a system call. The only corner-case is NMIs, where an NMI may interrupt
in the middle of setting up the task pt_regs, but scheduling should never happen
there, right ?
Since it's impossible for kernel threads to have a rseq critical section,
we should be able to assume that every time task_pt_regs() returns a
non-userspace (user_mode(regs) != 0) pt_regs implies that scheduling applies
to a kernel thread. Therefore, following this line of thoughts,
arch_rseq_in_crit_section() should work for CONFIG_PREEMPT kernels too.
So what I am missing here ?
Thanks,
Mathieu
> +#else
> + return arch_rseq_in_crit_section(p, task_pt_regs(p));
> +#endif
> +}
> +
> +void arch_rseq_handle_notify_resume(struct pt_regs *regs);
> +void arch_rseq_check_critical_section(struct task_struct *p,
> + struct pt_regs *regs);
> +
> +#else /* !CONFIG_RESTARTABLE_SEQUENCES */
> +
> +static inline void arch_rseq_handle_notify_resume(struct pt_regs *regs) {}
> +static inline void arch_rseq_check_critical_section(struct task_struct *p,
> + struct pt_regs *regs) {}
> +
> +#endif
> +
> +#endif /* _ASM_X86_RESTARTABLE_SEQUENCES_H */
> diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile
> index febaf18..bd7827d 100644
> --- a/arch/x86/kernel/Makefile
> +++ b/arch/x86/kernel/Makefile
> @@ -113,6 +113,8 @@ obj-$(CONFIG_TRACING) += tracepoint.o
> obj-$(CONFIG_IOSF_MBI) += iosf_mbi.o
> obj-$(CONFIG_PMC_ATOM) += pmc_atom.o
>
> +obj-$(CONFIG_RESTARTABLE_SEQUENCES) += restartable_sequences.o
> +
> ###
> # 64 bit specific files
> ifeq ($(CONFIG_X86_64),y)
> diff --git a/arch/x86/kernel/restartable_sequences.c
> b/arch/x86/kernel/restartable_sequences.c
> new file mode 100644
> index 0000000..3b38013
> --- /dev/null
> +++ b/arch/x86/kernel/restartable_sequences.c
> @@ -0,0 +1,69 @@
> +/*
> + * Restartable Sequences: x86 ABI.
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License as published by
> + * the Free Software Foundation; either version 2 of the License, or
> + * (at your option) any later version.
> + *
> + * This program is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
> + * GNU General Public License for more details.
> + *
> + * You should have received a copy of the GNU General Public License
> + * along with this program; if not, write to the Free Software
> + * Foundation, 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301, USA.
> + *
> + * Copyright (C) 2015, Google, Inc.,
> + * Paul Turner <[email protected]> and Andrew Hunter <[email protected]>
> + *
> + */
> +
> +#include <linux/sched.h>
> +#include <linux/uaccess.h>
> +#include <asm/restartable_sequences.h>
> +
> +void arch_rseq_check_critical_section(struct task_struct *p,
> + struct pt_regs *regs)
> +{
> + if (!arch_rseq_in_crit_section(p, regs))
> + return;
> +
> + /* RSEQ only applies to user-mode execution */
> + BUG_ON(!user_mode(regs));
> +
> + /*
> + * The ABI is slightly different for {32,64}-bit threads on x86
> + *
> + * Short version:
> + * x86-64 (or x32): interrupted rip => %r10
> + * i386: interrupted rip => %ecx
> + *
> + * Longer version:
> + * The scratch registers available under the i386 function call ABI
> + * overlap with those used by argument registers under the x86_64 ABI.
> + *
> + * Given that the sequence block is already personality specific in
> + * that it must be entered by 'call' and that we always want the
> + * arguments available for a sequence restart; it's more natural to
> + * differentiate the ABI in these two cases.
> + */
> + if (unlikely(test_tsk_thread_flag(p, TIF_IA32)))
> + regs->cx = regs->ip; /* i386 */
> + else
> + regs->r10 = regs->ip; /* x86-64/x32 */
> +
> + regs->ip = (unsigned long)p->group_leader->rseq_state.crit_restart;
> +}
> +
> +void arch_rseq_handle_notify_resume(struct pt_regs *regs)
> +{
> + struct restartable_sequence_state *rseq_state = ¤t->rseq_state;
> +
> + /* If this update fails our user-state is incoherent. */
> + if (put_user(task_cpu(current), rseq_state->cpu_pointer))
> + force_sig(SIGSEGV, current);
> +
> + arch_rseq_check_critical_section(current, regs);
> +}
> diff --git a/arch/x86/kernel/signal.c b/arch/x86/kernel/signal.c
> index 206996c..987c50b 100644
> --- a/arch/x86/kernel/signal.c
> +++ b/arch/x86/kernel/signal.c
> @@ -31,6 +31,7 @@
> #include <asm/vdso.h>
> #include <asm/mce.h>
> #include <asm/sighandling.h>
> +#include <asm/restartable_sequences.h>
>
> #ifdef CONFIG_X86_64
> #include <asm/proto.h>
> @@ -617,6 +618,15 @@ setup_rt_frame(struct ksignal *ksig, struct pt_regs *regs)
> sigset_t *set = sigmask_to_save();
> compat_sigset_t *cset = (compat_sigset_t *) set;
>
> + /*
> + * If we are executing in the critical section of a restartable
> + * sequence we need to fix up the user's stack saved ip at this point
> + * so that signal handler return does not allow us to jump back into
> + * the block across a context switch boundary.
> + */
> + if (rseq_active(current))
> + arch_rseq_check_critical_section(current, regs);
> +
> /* Set up the stack frame */
> if (is_ia32_frame()) {
> if (ksig->ka.sa.sa_flags & SA_SIGINFO)
> @@ -755,6 +765,8 @@ do_notify_resume(struct pt_regs *regs, void *unused, __u32
> thread_info_flags)
> if (thread_info_flags & _TIF_NOTIFY_RESUME) {
> clear_thread_flag(TIF_NOTIFY_RESUME);
> tracehook_notify_resume(regs);
> + if (rseq_active(current))
> + arch_rseq_handle_notify_resume(regs);
> }
> if (thread_info_flags & _TIF_USER_RETURN_NOTIFY)
> fire_user_return_notifiers();
> diff --git a/kernel/restartable_sequences.c b/kernel/restartable_sequences.c
> index 72945f2..9102241 100644
> --- a/kernel/restartable_sequences.c
> +++ b/kernel/restartable_sequences.c
> @@ -24,17 +24,22 @@
>
> #ifdef CONFIG_RESTARTABLE_SEQUENCES
>
> +#include <asm/restartable_sequences.h>
> #include <linux/uaccess.h>
> #include <linux/preempt.h>
> #include <linux/syscalls.h>
>
> static void rseq_sched_in_nop(struct preempt_notifier *pn, int cpu) {}
> -static void rseq_sched_out_nop(struct preempt_notifier *pn,
> - struct task_struct *next) {}
> +static void rseq_sched_out(struct preempt_notifier *pn,
> + struct task_struct *next)
> +{
> + if (arch_rseq_needs_notify_resume(current))
> + set_thread_flag(TIF_NOTIFY_RESUME);
> +}
>
> static __read_mostly struct preempt_ops rseq_preempt_ops = {
> .sched_in = rseq_sched_in_nop,
> - .sched_out = rseq_sched_out_nop,
> + .sched_out = rseq_sched_out,
> };
>
> int rseq_register_cpu_pointer_current(int __user *cpu_pointer)
--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com
----- On Jun 26, 2015, at 2:09 PM, Mathieu Desnoyers [email protected] wrote:
> ----- On Jun 24, 2015, at 6:26 PM, Paul Turner [email protected] wrote:
>
>> Implements the x86 (i386 & x86-64) ABIs for interrupting and restarting
>> execution within restartable sequence sections.
>>
>> With respect to the x86-specific ABI:
>> On 32-bit: Upon restart, the interrupted rip is placed in %ecx
>> On 64-bit (or x32): Upon restart, the interrupted rip is placed in %r10
>>
>> While potentially surprising at first glance, this choice is strongly motivated
>> by the fact that the available scratch registers under the i386 function call
>> ABI overlap with those used as argument registers under x86_64.
>>
>> Given that sequences are already personality specific and that we always want
>> the arguments to be available for sequence restart, it's much more natural to
>> ultimately differentiate the ABI in these two cases.
>>
>> Signed-off-by: Paul Turner <[email protected]>
>> ---
>> arch/x86/include/asm/restartable_sequences.h | 50 +++++++++++++++++++
>> arch/x86/kernel/Makefile | 2 +
>> arch/x86/kernel/restartable_sequences.c | 69 ++++++++++++++++++++++++++
>> arch/x86/kernel/signal.c | 12 +++++
>> kernel/restartable_sequences.c | 11 +++-
>> 5 files changed, 141 insertions(+), 3 deletions(-)
>> create mode 100644 arch/x86/include/asm/restartable_sequences.h
>> create mode 100644 arch/x86/kernel/restartable_sequences.c
>>
>> diff --git a/arch/x86/include/asm/restartable_sequences.h
>> b/arch/x86/include/asm/restartable_sequences.h
>> new file mode 100644
>> index 0000000..0ceb024
>> --- /dev/null
>> +++ b/arch/x86/include/asm/restartable_sequences.h
>> @@ -0,0 +1,50 @@
>> +#ifndef _ASM_X86_RESTARTABLE_SEQUENCES_H
>> +#define _ASM_X86_RESTARTABLE_SEQUENCES_H
>> +
>> +#include <asm/processor.h>
>> +#include <asm/ptrace.h>
>> +#include <linux/sched.h>
>> +
>> +#ifdef CONFIG_RESTARTABLE_SEQUENCES
>> +
>> +static inline bool arch_rseq_in_crit_section(struct task_struct *p,
>> + struct pt_regs *regs)
>> +{
>> + struct task_struct *leader = p->group_leader;
>> + struct restartable_sequence_state *rseq_state = &leader->rseq_state;
>> +
>> + unsigned long ip = (unsigned long)regs->ip;
>> + if (unlikely(ip < (unsigned long)rseq_state->crit_end &&
>> + ip >= (unsigned long)rseq_state->crit_start))
>> + return true;
>> +
>> + return false;
>> +}
>> +
>> +static inline bool arch_rseq_needs_notify_resume(struct task_struct *p)
>> +{
>> +#ifdef CONFIG_PREEMPT
>> + /*
>> + * Under CONFIG_PREEMPT it's possible for regs to be incoherent in the
>> + * case that we took an interrupt during syscall entry. Avoid this by
>> + * always deferring to our notify-resume handler.
>> + */
>> + return true;
>
> I'm a bit puzzled about this. If I look at perf_get_regs_user() in the perf
> code, task_pt_regs() seems to return the user-space pt_regs for a task with
> a current->mm set (iow, not a kernel thread), even if an interrupt nests on
> top of a system call. The only corner-case is NMIs, where an NMI may interrupt
> in the middle of setting up the task pt_regs, but scheduling should never happen
> there, right ?
>
> Since it's impossible for kernel threads to have a rseq critical section,
> we should be able to assume that every time task_pt_regs() returns a
> non-userspace (user_mode(regs) != 0) pt_regs implies that scheduling applies
> to a kernel thread. Therefore, following this line of thoughts,
> arch_rseq_in_crit_section() should work for CONFIG_PREEMPT kernels too.
>
> So what I am missing here ?
AFAIU, the comment near this check in perf_get_regs_user() is bogus.
It does not only apply to NMIs, but also applies to normal interrupt
handlers that nest over the stack setup on syscall entry (below
entry_SYSCALL_64_after_swapgs in entry_64.S):
struct pt_regs *user_regs = task_pt_regs(current);
/*
* If we're in an NMI that interrupted task_pt_regs setup, then
* we can't sample user regs at all. This check isn't really
* sufficient, though, as we could be in an NMI inside an interrupt
* that happened during task_pt_regs setup.
*/
if (regs->sp > (unsigned long)&user_regs->r11 &&
regs->sp <= (unsigned long)(user_regs + 1)) {
regs_user->abi = PERF_SAMPLE_REGS_ABI_NONE;
regs_user->regs = NULL;
return;
}
That would be how, for tracing, those races can be avoided. It
might not be a huge issue for perf samples to lose one sample once
in a while, but I understand that this statistical approach would
be incorrect in the context of RSEQ.
Moving ENABLE_INTERRUPTS(CLBR_NONE) 3 instructions down, just after
pushq %rcx /* pt_regs->ip */
might solve your issue here. (in entry_SYSCALL_64_after_swapgs)
Thoughts ?
Thanks,
Mathieu
>
> Thanks,
>
> Mathieu
>
>> +#else
>> + return arch_rseq_in_crit_section(p, task_pt_regs(p));
>> +#endif
>> +}
>> +
>> +void arch_rseq_handle_notify_resume(struct pt_regs *regs);
>> +void arch_rseq_check_critical_section(struct task_struct *p,
>> + struct pt_regs *regs);
>> +
>> +#else /* !CONFIG_RESTARTABLE_SEQUENCES */
>> +
>> +static inline void arch_rseq_handle_notify_resume(struct pt_regs *regs) {}
>> +static inline void arch_rseq_check_critical_section(struct task_struct *p,
>> + struct pt_regs *regs) {}
>> +
>> +#endif
>> +
>> +#endif /* _ASM_X86_RESTARTABLE_SEQUENCES_H */
>> diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile
>> index febaf18..bd7827d 100644
>> --- a/arch/x86/kernel/Makefile
>> +++ b/arch/x86/kernel/Makefile
>> @@ -113,6 +113,8 @@ obj-$(CONFIG_TRACING) += tracepoint.o
>> obj-$(CONFIG_IOSF_MBI) += iosf_mbi.o
>> obj-$(CONFIG_PMC_ATOM) += pmc_atom.o
>>
>> +obj-$(CONFIG_RESTARTABLE_SEQUENCES) += restartable_sequences.o
>> +
>> ###
>> # 64 bit specific files
>> ifeq ($(CONFIG_X86_64),y)
>> diff --git a/arch/x86/kernel/restartable_sequences.c
>> b/arch/x86/kernel/restartable_sequences.c
>> new file mode 100644
>> index 0000000..3b38013
>> --- /dev/null
>> +++ b/arch/x86/kernel/restartable_sequences.c
>> @@ -0,0 +1,69 @@
>> +/*
>> + * Restartable Sequences: x86 ABI.
>> + *
>> + * This program is free software; you can redistribute it and/or modify
>> + * it under the terms of the GNU General Public License as published by
>> + * the Free Software Foundation; either version 2 of the License, or
>> + * (at your option) any later version.
>> + *
>> + * This program is distributed in the hope that it will be useful,
>> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
>> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
>> + * GNU General Public License for more details.
>> + *
>> + * You should have received a copy of the GNU General Public License
>> + * along with this program; if not, write to the Free Software
>> + * Foundation, 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301, USA.
>> + *
>> + * Copyright (C) 2015, Google, Inc.,
>> + * Paul Turner <[email protected]> and Andrew Hunter <[email protected]>
>> + *
>> + */
>> +
>> +#include <linux/sched.h>
>> +#include <linux/uaccess.h>
>> +#include <asm/restartable_sequences.h>
>> +
>> +void arch_rseq_check_critical_section(struct task_struct *p,
>> + struct pt_regs *regs)
>> +{
>> + if (!arch_rseq_in_crit_section(p, regs))
>> + return;
>> +
>> + /* RSEQ only applies to user-mode execution */
>> + BUG_ON(!user_mode(regs));
>> +
>> + /*
>> + * The ABI is slightly different for {32,64}-bit threads on x86
>> + *
>> + * Short version:
>> + * x86-64 (or x32): interrupted rip => %r10
>> + * i386: interrupted rip => %ecx
>> + *
>> + * Longer version:
>> + * The scratch registers available under the i386 function call ABI
>> + * overlap with those used by argument registers under the x86_64 ABI.
>> + *
>> + * Given that the sequence block is already personality specific in
>> + * that it must be entered by 'call' and that we always want the
>> + * arguments available for a sequence restart; it's more natural to
>> + * differentiate the ABI in these two cases.
>> + */
>> + if (unlikely(test_tsk_thread_flag(p, TIF_IA32)))
>> + regs->cx = regs->ip; /* i386 */
>> + else
>> + regs->r10 = regs->ip; /* x86-64/x32 */
>> +
>> + regs->ip = (unsigned long)p->group_leader->rseq_state.crit_restart;
>> +}
>> +
>> +void arch_rseq_handle_notify_resume(struct pt_regs *regs)
>> +{
>> + struct restartable_sequence_state *rseq_state = ¤t->rseq_state;
>> +
>> + /* If this update fails our user-state is incoherent. */
>> + if (put_user(task_cpu(current), rseq_state->cpu_pointer))
>> + force_sig(SIGSEGV, current);
>> +
>> + arch_rseq_check_critical_section(current, regs);
>> +}
>> diff --git a/arch/x86/kernel/signal.c b/arch/x86/kernel/signal.c
>> index 206996c..987c50b 100644
>> --- a/arch/x86/kernel/signal.c
>> +++ b/arch/x86/kernel/signal.c
>> @@ -31,6 +31,7 @@
>> #include <asm/vdso.h>
>> #include <asm/mce.h>
>> #include <asm/sighandling.h>
>> +#include <asm/restartable_sequences.h>
>>
>> #ifdef CONFIG_X86_64
>> #include <asm/proto.h>
>> @@ -617,6 +618,15 @@ setup_rt_frame(struct ksignal *ksig, struct pt_regs *regs)
>> sigset_t *set = sigmask_to_save();
>> compat_sigset_t *cset = (compat_sigset_t *) set;
>>
>> + /*
>> + * If we are executing in the critical section of a restartable
>> + * sequence we need to fix up the user's stack saved ip at this point
>> + * so that signal handler return does not allow us to jump back into
>> + * the block across a context switch boundary.
>> + */
>> + if (rseq_active(current))
>> + arch_rseq_check_critical_section(current, regs);
>> +
>> /* Set up the stack frame */
>> if (is_ia32_frame()) {
>> if (ksig->ka.sa.sa_flags & SA_SIGINFO)
>> @@ -755,6 +765,8 @@ do_notify_resume(struct pt_regs *regs, void *unused, __u32
>> thread_info_flags)
>> if (thread_info_flags & _TIF_NOTIFY_RESUME) {
>> clear_thread_flag(TIF_NOTIFY_RESUME);
>> tracehook_notify_resume(regs);
>> + if (rseq_active(current))
>> + arch_rseq_handle_notify_resume(regs);
>> }
>> if (thread_info_flags & _TIF_USER_RETURN_NOTIFY)
>> fire_user_return_notifiers();
>> diff --git a/kernel/restartable_sequences.c b/kernel/restartable_sequences.c
>> index 72945f2..9102241 100644
>> --- a/kernel/restartable_sequences.c
>> +++ b/kernel/restartable_sequences.c
>> @@ -24,17 +24,22 @@
>>
>> #ifdef CONFIG_RESTARTABLE_SEQUENCES
>>
>> +#include <asm/restartable_sequences.h>
>> #include <linux/uaccess.h>
>> #include <linux/preempt.h>
>> #include <linux/syscalls.h>
>>
>> static void rseq_sched_in_nop(struct preempt_notifier *pn, int cpu) {}
>> -static void rseq_sched_out_nop(struct preempt_notifier *pn,
>> - struct task_struct *next) {}
>> +static void rseq_sched_out(struct preempt_notifier *pn,
>> + struct task_struct *next)
>> +{
>> + if (arch_rseq_needs_notify_resume(current))
>> + set_thread_flag(TIF_NOTIFY_RESUME);
>> +}
>>
>> static __read_mostly struct preempt_ops rseq_preempt_ops = {
>> .sched_in = rseq_sched_in_nop,
>> - .sched_out = rseq_sched_out_nop,
>> + .sched_out = rseq_sched_out,
>> };
>>
>> int rseq_register_cpu_pointer_current(int __user *cpu_pointer)
>
> --
> Mathieu Desnoyers
> EfficiOS Inc.
> http://www.efficios.com
--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com
On Fri, Jun 26, 2015 at 11:09 AM, Mathieu Desnoyers
<[email protected]> wrote:
> ----- On Jun 24, 2015, at 6:26 PM, Paul Turner [email protected] wrote:
>
>> Implements the x86 (i386 & x86-64) ABIs for interrupting and restarting
>> execution within restartable sequence sections.
>>
>> With respect to the x86-specific ABI:
>> On 32-bit: Upon restart, the interrupted rip is placed in %ecx
>> On 64-bit (or x32): Upon restart, the interrupted rip is placed in %r10
>>
>> While potentially surprising at first glance, this choice is strongly motivated
>> by the fact that the available scratch registers under the i386 function call
>> ABI overlap with those used as argument registers under x86_64.
>>
>> Given that sequences are already personality specific and that we always want
>> the arguments to be available for sequence restart, it's much more natural to
>> ultimately differentiate the ABI in these two cases.
>>
>> Signed-off-by: Paul Turner <[email protected]>
>> ---
>> arch/x86/include/asm/restartable_sequences.h | 50 +++++++++++++++++++
>> arch/x86/kernel/Makefile | 2 +
>> arch/x86/kernel/restartable_sequences.c | 69 ++++++++++++++++++++++++++
>> arch/x86/kernel/signal.c | 12 +++++
>> kernel/restartable_sequences.c | 11 +++-
>> 5 files changed, 141 insertions(+), 3 deletions(-)
>> create mode 100644 arch/x86/include/asm/restartable_sequences.h
>> create mode 100644 arch/x86/kernel/restartable_sequences.c
>>
>> diff --git a/arch/x86/include/asm/restartable_sequences.h
>> b/arch/x86/include/asm/restartable_sequences.h
>> new file mode 100644
>> index 0000000..0ceb024
>> --- /dev/null
>> +++ b/arch/x86/include/asm/restartable_sequences.h
>> @@ -0,0 +1,50 @@
>> +#ifndef _ASM_X86_RESTARTABLE_SEQUENCES_H
>> +#define _ASM_X86_RESTARTABLE_SEQUENCES_H
>> +
>> +#include <asm/processor.h>
>> +#include <asm/ptrace.h>
>> +#include <linux/sched.h>
>> +
>> +#ifdef CONFIG_RESTARTABLE_SEQUENCES
>> +
>> +static inline bool arch_rseq_in_crit_section(struct task_struct *p,
>> + struct pt_regs *regs)
>> +{
>> + struct task_struct *leader = p->group_leader;
>> + struct restartable_sequence_state *rseq_state = &leader->rseq_state;
>> +
>> + unsigned long ip = (unsigned long)regs->ip;
>> + if (unlikely(ip < (unsigned long)rseq_state->crit_end &&
>> + ip >= (unsigned long)rseq_state->crit_start))
>> + return true;
>> +
>> + return false;
>> +}
>> +
>> +static inline bool arch_rseq_needs_notify_resume(struct task_struct *p)
>> +{
>> +#ifdef CONFIG_PREEMPT
>> + /*
>> + * Under CONFIG_PREEMPT it's possible for regs to be incoherent in the
>> + * case that we took an interrupt during syscall entry. Avoid this by
>> + * always deferring to our notify-resume handler.
>> + */
>> + return true;
>
> I'm a bit puzzled about this. If I look at perf_get_regs_user() in the perf
> code, task_pt_regs() seems to return the user-space pt_regs for a task with
> a current->mm set (iow, not a kernel thread), even if an interrupt nests on
> top of a system call. The only corner-case is NMIs, where an NMI may interrupt
> in the middle of setting up the task pt_regs, but scheduling should never happen
> there, right ?
Careful, here! task_pt_regs returns a pointer to the place where regs
would be if they were fully initialized. We can certainly take an
interrupt in the middle of pt_regs setup (entry_SYSCALL_64 enables
interrupts very early, for example). To me, the question is whether
we can ever be preemptable at such a time.
It's a bit worse, though: we can certainly be preemptible when other
code is accessing pt_regs. clone, execve, sigreturn, and signal
delivery come to mind.
Why don't we give up on poking at user state from the scheduler and do
it on exit to user mode instead? Starting in 4.3 (hopefully landing
in -tip in a week or two), we should have a nice function
prepare_exit_to_usermode that runs with well-defined state,
non-reentrantly, that can do whatever you want here, *including user
memory access*.
The remaining question would be what the ABI should be.
Could we get away with a vDSO function along the lines of "set *A=B
and *X=Y if we're on cpu N and *X=Z"? Straight-up cmpxchg would be
even simpler.
--Andy
On Fri, Jun 26, 2015 at 12:31 PM, Andy Lutomirski <[email protected]> wrote:
> On Fri, Jun 26, 2015 at 11:09 AM, Mathieu Desnoyers
> <[email protected]> wrote:
>> ----- On Jun 24, 2015, at 6:26 PM, Paul Turner [email protected] wrote:
>>
>>> Implements the x86 (i386 & x86-64) ABIs for interrupting and restarting
>>> execution within restartable sequence sections.
>>>
>>> With respect to the x86-specific ABI:
>>> On 32-bit: Upon restart, the interrupted rip is placed in %ecx
>>> On 64-bit (or x32): Upon restart, the interrupted rip is placed in %r10
>>>
>>> While potentially surprising at first glance, this choice is strongly motivated
>>> by the fact that the available scratch registers under the i386 function call
>>> ABI overlap with those used as argument registers under x86_64.
>>>
>>> Given that sequences are already personality specific and that we always want
>>> the arguments to be available for sequence restart, it's much more natural to
>>> ultimately differentiate the ABI in these two cases.
>>>
>>> Signed-off-by: Paul Turner <[email protected]>
>>> ---
>>> arch/x86/include/asm/restartable_sequences.h | 50 +++++++++++++++++++
>>> arch/x86/kernel/Makefile | 2 +
>>> arch/x86/kernel/restartable_sequences.c | 69 ++++++++++++++++++++++++++
>>> arch/x86/kernel/signal.c | 12 +++++
>>> kernel/restartable_sequences.c | 11 +++-
>>> 5 files changed, 141 insertions(+), 3 deletions(-)
>>> create mode 100644 arch/x86/include/asm/restartable_sequences.h
>>> create mode 100644 arch/x86/kernel/restartable_sequences.c
>>>
>>> diff --git a/arch/x86/include/asm/restartable_sequences.h
>>> b/arch/x86/include/asm/restartable_sequences.h
>>> new file mode 100644
>>> index 0000000..0ceb024
>>> --- /dev/null
>>> +++ b/arch/x86/include/asm/restartable_sequences.h
>>> @@ -0,0 +1,50 @@
>>> +#ifndef _ASM_X86_RESTARTABLE_SEQUENCES_H
>>> +#define _ASM_X86_RESTARTABLE_SEQUENCES_H
>>> +
>>> +#include <asm/processor.h>
>>> +#include <asm/ptrace.h>
>>> +#include <linux/sched.h>
>>> +
>>> +#ifdef CONFIG_RESTARTABLE_SEQUENCES
>>> +
>>> +static inline bool arch_rseq_in_crit_section(struct task_struct *p,
>>> + struct pt_regs *regs)
>>> +{
>>> + struct task_struct *leader = p->group_leader;
>>> + struct restartable_sequence_state *rseq_state = &leader->rseq_state;
>>> +
>>> + unsigned long ip = (unsigned long)regs->ip;
>>> + if (unlikely(ip < (unsigned long)rseq_state->crit_end &&
>>> + ip >= (unsigned long)rseq_state->crit_start))
>>> + return true;
>>> +
>>> + return false;
>>> +}
>>> +
>>> +static inline bool arch_rseq_needs_notify_resume(struct task_struct *p)
>>> +{
>>> +#ifdef CONFIG_PREEMPT
>>> + /*
>>> + * Under CONFIG_PREEMPT it's possible for regs to be incoherent in the
>>> + * case that we took an interrupt during syscall entry. Avoid this by
>>> + * always deferring to our notify-resume handler.
>>> + */
>>> + return true;
>>
>> I'm a bit puzzled about this. If I look at perf_get_regs_user() in the perf
>> code, task_pt_regs() seems to return the user-space pt_regs for a task with
>> a current->mm set (iow, not a kernel thread), even if an interrupt nests on
>> top of a system call. The only corner-case is NMIs, where an NMI may interrupt
>> in the middle of setting up the task pt_regs, but scheduling should never happen
>> there, right ?
>
> Careful, here! task_pt_regs returns a pointer to the place where regs
> would be if they were fully initialized. We can certainly take an
> interrupt in the middle of pt_regs setup (entry_SYSCALL_64 enables
> interrupts very early, for example). To me, the question is whether
> we can ever be preemptable at such a time.
>
> It's a bit worse, though: we can certainly be preemptible when other
> code is accessing pt_regs. clone, execve, sigreturn, and signal
> delivery come to mind.
Yeah Andy covered it exactly: interrupt in pt_regs setup.
With respect to whether we can be preemptible; I think we were
concerned about rescheduling during syscall entry but I'd have to
re-audit the current state of entry_64.S :)
Mathieu also wrote:
> Moving ENABLE_INTERRUPTS(CLBR_NONE) 3 instructions down, just after
> pushq %rcx /* pt_regs->ip */
> might solve your issue here. (in entry_SYSCALL_64_after_swapgs)
We considered doing something exactly like this; but I think any
potential changes here should be made in isolation of this series.
>
> Why don't we give up on poking at user state from the scheduler and do
> it on exit to user mode instead? Starting in 4.3 (hopefully landing
> in -tip in a week or two), we should have a nice function
> prepare_exit_to_usermode that runs with well-defined state,
> non-reentrantly, that can do whatever you want here, *including user
> memory access*.
So this series already does the exact approximation of that:
The only thing we touch in the scheduler is looking at the kernel copy
pt_regs in the case we know it's safe to.
The entirety of *any* poking (both current cpu pointer updates and
potential rip manipulation) at user-state exactly happens in the
exit-to-user path via TIF_NOTIFY_RESUME.
>
> The remaining question would be what the ABI should be.
>
> Could we get away with a vDSO function along the lines of "set *A=B
> and *X=Y if we're on cpu N and *X=Z"? Straight-up cmpxchg would be
> even simpler.
The short answer is yes [*]; but I don't think it should live in the vDSO.
a) vdso-Call overhead is fairly high
b) I don't think there are any properties of being in the vDSO that we
benefit from.
c) It would be nice if these sequences were inlinable.
I have an alternate implementation that satisfies (c) which I'm
looking to propose early next week (I've got it baking on some tests
over the weekend).
[*] I mean the very simplest implementation of taking this patch and
putting the implementation of the critical section is clearly
sufficient.
Moving ENABLE_INTERRUPTS(CLBR_NONE) 3 instructions down, just after
pushq %rcx /* pt_regs->ip */
might solve your issue here. (in entry_SYSCALL_64_after_swapgs)
Let me try to summarize some of the approaches with their pros and cons:
--- percpu segment ---
This is probably the simplest and might make sense regardless.
cmpxchg can be used to do an atomic push onto a linked list. I think
that unlocked cmpxchg16b can be used to get an atomic pop. (You'd
have the list head pointer next to an auxiliary pointer to the second
element in the list, perhaps.)
You can also use this for limited forms of speculative locking.
Aborting cleanly if your lock is stolen might require the kernel's
help, though (you're now on the wrong cpu, so you can't atomically
poke the lock variable any more).
The ABI is straightforward, and the only limitation on multiple users
in the same process is that they need to coordinate their offsets into
the percpu segment.
--- vdso-provided atomic ops ---
This could be quite flexible. The upside is that the ABI would be
straightforward (call a function with clearly-specified behavior).
The downside is that implementing it well might require percpu
segments and a certain amount of coordination, and it requires a
function call.
One nice thing about doing it in the vdso is that we can change the
implementation down the road.
--- kernel preemption hooks ---
I'm defining a preemption hook as an action taken by the kernel when a
user task is preempted during a critical section.
As an upside, we get extremely efficient, almost arbitrary percpu
operations. We don't need to worry about memory ordering at all,
because the whole sequence aborts if anything else might run on the
same cpu. Push and pop are both easy.
One con is that actually defining where the critical section is might
be nasty. If there's a single IP range, then two libraries could
fight over it. We could have a variable somewhere that you write to
arm the critical section, but that's a bit slower.
Another con is that you can't single-step through this type of
critical section. It will be preempted every time.
--- kernel migration hooks ---
I'm not sure either Paul or Mattieu discussed this, but another option
would be to have some special handling if a task is migrated during a
critical section or to allow a task to prevent migration entirely
during a critical section. From the user's point of view, this is
weaker than preemption hooks: it's possible to start your critical
section, be preempted, and have another thread enter its own critical
section, then get rescheduled on the same cpu without aborting. Users
would have to use local atomics (like cmpxchg) to make it useful.
As a major advantage, single-stepping still works.
This shares the coordination downside with preemption hooks (users
have to tell the kernel about their critical sections somehow).
Push can certainly be implemented using cmpxchg. The gs prefix isn't
even needed. Pop might be harder to implement directly without
resorting to cmpxchg16b or similar.
--- Unnamed trick ---
On entry to a critical section, try to take a per-cpu lock that stores
the holder's tid. This might require percpu segments.
If you get the lock, then start doing your thing. For example, you
could pop by reading head->next and writing it back to head.
If, however, you miss the lock, then you need to either wait or
forcibly abort the lock holder. You could do the latter by sending a
signal or possibly using a new syscall that atomically aborts the lock
holder and takes the lock. You don't need to wait, though -- all you
need to do is queue the signal and, if the lock holder is actually
running, wait for signal delivery to start.
Thoughts? I personally like the other options better than preemption
hooks. I prefer solutions that don't interfere with debugging.
--Andy
----- On Jun 27, 2015, at 12:25 PM, Andy Lutomirski [email protected] wrote:
> Let me try to summarize some of the approaches with their pros and cons:
>
I can try summarizing a desiderata that I gather from this
thread so far:
- *very fast* accesses for per-cpu push/pop, per-cpu lock
acquisition, and per-cpu counters,
- guaranteed progress (don't loop forever when single-stepped),
- critical section implementation flexibility:
- can be written only in assembly or also in C,
- provide a single building-block (e.g. cmpxchg) or allow
applications/libraries to do arbitrary critical sections,
- portability to non-x86 architectures,
- low-intrusiveness for injection into applications (no signal
handler, no segment selector used by pre-existing applications).
- can be used by either a single or many shared objects per process,
- don't disturb real-time scheduling,
- minimal slowdown of kernel scheduling execution.
More comments on each approach below:
> --- percpu segment ---
>
> This is probably the simplest and might make sense regardless.
> cmpxchg can be used to do an atomic push onto a linked list. I think
> that unlocked cmpxchg16b can be used to get an atomic pop. (You'd
> have the list head pointer next to an auxiliary pointer to the second
> element in the list, perhaps.)
Based on http://www.agner.org/optimize/instruction_tables.pdf
On Intel Haswell:
instruction latency (cycles)
cmpxchg: 8
lock cmpxchg: 19
cmpxchg16b: 15
cmpxchg16b does not appear to be particularly fast (twice the latency of
cmpxchg).
>
> You can also use this for limited forms of speculative locking.
> Aborting cleanly if your lock is stolen might require the kernel's
> help, though (you're now on the wrong cpu, so you can't atomically
> poke the lock variable any more).
>
> The ABI is straightforward, and the only limitation on multiple users
> in the same process is that they need to coordinate their offsets into
> the percpu segment.
One more downside about this approach: some applications already use
the gs segment (AFAIK the wine emulator uses it), so can be prohibitive
to use it from tracing code injected into pre-existing applications.
Another downside is that it is x86-specific.
>
> --- vdso-provided atomic ops ---
>
> This could be quite flexible. The upside is that the ABI would be
> straightforward (call a function with clearly-specified behavior).
Following same ref as above, the call/ret pair alone would cost
about 5 cycles.
> The downside is that implementing it well might require percpu
> segments and a certain amount of coordination, and it requires a
> function call.
Same downside as above about gs segment being already used by some
applications.
>
> One nice thing about doing it in the vdso is that we can change the
> implementation down the road.
Yes, this is clearly an advantage over letting applications inline
their cmpxchg on gs:.
Same downside as above about being x86-specific.
>
> --- kernel preemption hooks ---
>
> I'm defining a preemption hook as an action taken by the kernel when a
> user task is preempted during a critical section.
>
> As an upside, we get extremely efficient, almost arbitrary percpu
> operations. We don't need to worry about memory ordering at all,
> because the whole sequence aborts if anything else might run on the
> same cpu. Push and pop are both easy.
>
> One con is that actually defining where the critical section is might
> be nasty. If there's a single IP range, then two libraries could
> fight over it. We could have a variable somewhere that you write to
> arm the critical section, but that's a bit slower.
My current understanding is that we have two means to tell the kernel
"we are in a critical section": either through register content or
through a per-thread memory area. Paul's implementation uses the
instruction pointer, but it could perhaps use another reserved register
state, which might help us do the critical functions in C code rather
than assembly. It might be tricky to find a register that is guaranteed
not to be used though, hence the per-thread memory area.
The per-thread memory area has the advantage of allowing the critical
sections to be implemented in C code rather than assembly, but, as you
say, its downside is that we need at the very least to set/clear a TLS
flag (or have a nesting counter) surrounding the critical section. This
approach is quite similar to preempt_disable()/preempt_enable() in the
Linux kernel.
Another advantage of preempt migration hooks over migration hooks
is that the critical section can assume it has mutually exclusive
access, which is not the case for migration hooks, because it can
be preempted and continue execution afterward. This means what can
be achieved with e.g. "load+test+store" with preempt hook needs to
be performed with "cmpxchg" with migration hooks. This might not be
a huge issue for x86, but can become more expensive on other
architectures.
>
> Another con is that you can't single-step through this type of
> critical section. It will be preempted every time.
Could we simply make single-stepping skip over those critical
sections ? AFAIU the kernel should have all the information needed
to do that.
One more advantage: could be implemented for all architectures.
Then the part that is missing from this summary is how to handle
the "restart". Paul's approach moves execution to a restart ip,
which is fine since the entire critical section is written in
assembly. My approach uses page protection on the per-thread
virtual memory mapping to the per-cpu data (e.g. the lock) to
make the thread fault. My approach is more intrusive (requires
implementation of SIGSEGV handler in user-space), but allows
doing the critical sections in plain C.
>
> --- kernel migration hooks ---
>
> I'm not sure either Paul or Mattieu discussed this, but another option
> would be to have some special handling if a task is migrated during a
> critical section
I don't think we discussed this yet. See my point above about not
having exclusive access within the critical section. This can be
limiting, especially on architectures that don't provide per-cpu
atomic ops as efficient as x86 cmpxchg.
> or to allow a task to prevent migration entirely
> during a critical section.
Following past discussion with Peter Zijlstra, I understood that this
approach would receive a huge pushback from real-time people, since
letting user-space hold off migration for an arbitrary amount of time
destroys many nice real-time guarantees. However, since the kernel
also allows threads to pin themselves to specific cores explicitly,
I still don't get how preventing migration from userspace for an
arbitrary amount of time is different from setting CPU affinity.
> From the user's point of view, this is
> weaker than preemption hooks: it's possible to start your critical
> section, be preempted, and have another thread enter its own critical
> section, then get rescheduled on the same cpu without aborting. Users
> would have to use local atomics (like cmpxchg) to make it useful.
>
> As a major advantage, single-stepping still works.
>
> This shares the coordination downside with preemption hooks (users
> have to tell the kernel about their critical sections somehow).
>
> Push can certainly be implemented using cmpxchg. The gs prefix isn't
> even needed. Pop might be harder to implement directly without
> resorting to cmpxchg16b or similar.
>
> --- Unnamed trick ---
>
> On entry to a critical section, try to take a per-cpu lock that stores
> the holder's tid. This might require percpu segments.
>
> If you get the lock, then start doing your thing. For example, you
> could pop by reading head->next and writing it back to head.
>
> If, however, you miss the lock, then you need to either wait or
> forcibly abort the lock holder. You could do the latter by sending a
> signal or possibly using a new syscall that atomically aborts the lock
> holder and takes the lock. You don't need to wait, though -- all you
> need to do is queue the signal and, if the lock holder is actually
> running, wait for signal delivery to start.
Aborting the critical section can be tricky in the general case. See
my point above about critical sections written in assembly or C.
Would you require that c.s. are written in assembly for this ?
>
> Thoughts? I personally like the other options better than preemption
> hooks. I prefer solutions that don't interfere with debugging.
Would skipping over the critical section be doable/acceptable for the
single-stepping case ?
One more card we have is to design algorithms that have a restartable
fast-path, but their restart point go to a path that is guaranteed to
progress. For instance, lock-acquisition can be achieved in this way
using a "2-threads" Peterson's algorithm for each core where the variable
used for thread 1 is always guaranteed to have exclusive access (can be
restarted), but the 2nd thread variable is not restarted (guaranteed
progress), and updates to this variable can go through a lock-prefixed
cmpxchg. Basically, we can see this as a 2-class Peterson algorithm
(rather than 2 threads), where the first class has exclusive access,
and the 2nd class serializes many threads through an atomic instruction.
This approach should work well for locks, but I'm not sure it would
do for push/pop of a linked list.
Thoughts ?
Thanks,
Mathieu
--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com