Hi,
I'm respinning this series for another RFC round. It is based on the
v4.16-rc7 tag. I am now targeting the 4.17 merge window.
This series contains:
- Restartable sequences system call (x86 32/64, powerpc 32/64, arm 32),
- CPU operation vector system call (x86 32/64, powerpc 32/64, arm 32).
The main changes introduced in this updated version are:
- new cpu_opv flag to query the number of operations supported by the
system call (added associated test-cases),
- Fix error handling when cpu_opv receives a cpu number not part of the
allowed cpu mask (added associated test-cases).
Feedback is welcome!
Thanks,
Mathieu
Boqun Feng (2):
powerpc: Add support for restartable sequences
powerpc: Wire up restartable sequences system call
Mathieu Desnoyers (19):
uapi headers: Provide types_32_64.h
rseq: Introduce restartable sequences system call (v12)
arm: Add restartable sequences support
arm: Wire up restartable sequences system call
x86: Add support for restartable sequences
x86: Wire up restartable sequence system call
sched: Implement push_task_to_cpu (v2)
cpu_opv: Provide cpu_opv system call (v6)
x86: Wire up cpu_opv system call
powerpc: Wire up cpu_opv system call
arm: Wire up cpu_opv system call
selftests: lib.mk: Introduce OVERRIDE_TARGETS
cpu_opv: selftests: Implement selftests (v7)
rseq: selftests: Provide rseq library (v5)
rseq: selftests: Provide percpu_op API
rseq: selftests: Provide basic test
rseq: selftests: Provide basic percpu ops test
rseq: selftests: Provide parametrized tests
rseq: selftests: Provide Makefile, scripts, gitignore
MAINTAINERS | 20 +
arch/Kconfig | 7 +
arch/arm/Kconfig | 1 +
arch/arm/kernel/signal.c | 7 +
arch/arm/tools/syscall.tbl | 2 +
arch/powerpc/Kconfig | 1 +
arch/powerpc/include/asm/systbl.h | 2 +
arch/powerpc/include/asm/unistd.h | 2 +-
arch/powerpc/include/uapi/asm/unistd.h | 2 +
arch/powerpc/kernel/signal.c | 3 +
arch/x86/Kconfig | 1 +
arch/x86/entry/common.c | 1 +
arch/x86/entry/syscalls/syscall_32.tbl | 2 +
arch/x86/entry/syscalls/syscall_64.tbl | 2 +
arch/x86/kernel/signal.c | 6 +
fs/exec.c | 1 +
include/linux/sched.h | 118 ++
include/linux/syscalls.h | 6 +
include/trace/events/rseq.h | 56 +
include/uapi/linux/cpu_opv.h | 120 ++
include/uapi/linux/rseq.h | 150 +++
include/uapi/linux/types_32_64.h | 67 +
init/Kconfig | 31 +
kernel/Makefile | 2 +
kernel/cpu_opv.c | 1083 ++++++++++++++++
kernel/fork.c | 2 +
kernel/rseq.c | 358 +++++
kernel/sched/core.c | 43 +
kernel/sched/sched.h | 10 +
kernel/sys_ni.c | 4 +
tools/testing/selftests/Makefile | 2 +
tools/testing/selftests/cpu-opv/.gitignore | 1 +
tools/testing/selftests/cpu-opv/Makefile | 17 +
.../testing/selftests/cpu-opv/basic_cpu_opv_test.c | 1368 ++++++++++++++++++++
tools/testing/selftests/cpu-opv/cpu-op.c | 352 +++++
tools/testing/selftests/cpu-opv/cpu-op.h | 59 +
tools/testing/selftests/lib.mk | 4 +
tools/testing/selftests/rseq/.gitignore | 7 +
tools/testing/selftests/rseq/Makefile | 37 +
.../testing/selftests/rseq/basic_percpu_ops_test.c | 296 +++++
tools/testing/selftests/rseq/basic_test.c | 55 +
tools/testing/selftests/rseq/param_test.c | 1163 +++++++++++++++++
tools/testing/selftests/rseq/percpu-op.h | 163 +++
tools/testing/selftests/rseq/rseq-arm.h | 732 +++++++++++
tools/testing/selftests/rseq/rseq-ppc.h | 688 ++++++++++
tools/testing/selftests/rseq/rseq-skip.h | 82 ++
tools/testing/selftests/rseq/rseq-x86.h | 1149 ++++++++++++++++
tools/testing/selftests/rseq/rseq.c | 116 ++
tools/testing/selftests/rseq/rseq.h | 164 +++
tools/testing/selftests/rseq/run_param_test.sh | 130 ++
50 files changed, 8694 insertions(+), 1 deletion(-)
create mode 100644 include/trace/events/rseq.h
create mode 100644 include/uapi/linux/cpu_opv.h
create mode 100644 include/uapi/linux/rseq.h
create mode 100644 include/uapi/linux/types_32_64.h
create mode 100644 kernel/cpu_opv.c
create mode 100644 kernel/rseq.c
create mode 100644 tools/testing/selftests/cpu-opv/.gitignore
create mode 100644 tools/testing/selftests/cpu-opv/Makefile
create mode 100644 tools/testing/selftests/cpu-opv/basic_cpu_opv_test.c
create mode 100644 tools/testing/selftests/cpu-opv/cpu-op.c
create mode 100644 tools/testing/selftests/cpu-opv/cpu-op.h
create mode 100644 tools/testing/selftests/rseq/.gitignore
create mode 100644 tools/testing/selftests/rseq/Makefile
create mode 100644 tools/testing/selftests/rseq/basic_percpu_ops_test.c
create mode 100644 tools/testing/selftests/rseq/basic_test.c
create mode 100644 tools/testing/selftests/rseq/param_test.c
create mode 100644 tools/testing/selftests/rseq/percpu-op.h
create mode 100644 tools/testing/selftests/rseq/rseq-arm.h
create mode 100644 tools/testing/selftests/rseq/rseq-ppc.h
create mode 100644 tools/testing/selftests/rseq/rseq-skip.h
create mode 100644 tools/testing/selftests/rseq/rseq-x86.h
create mode 100644 tools/testing/selftests/rseq/rseq.c
create mode 100644 tools/testing/selftests/rseq/rseq.h
create mode 100755 tools/testing/selftests/rseq/run_param_test.sh
--
2.11.0
Introduce OVERRIDE_TARGETS to allow tests to express dependencies on
header files and .so, which require to override the selftests lib.mk
targets.
Signed-off-by: Mathieu Desnoyers <[email protected]>
Acked-by: Shuah Khan <[email protected]>
CC: Russell King <[email protected]>
CC: Catalin Marinas <[email protected]>
CC: Will Deacon <[email protected]>
CC: Thomas Gleixner <[email protected]>
CC: Paul Turner <[email protected]>
CC: Andrew Hunter <[email protected]>
CC: Peter Zijlstra <[email protected]>
CC: Andy Lutomirski <[email protected]>
CC: Andi Kleen <[email protected]>
CC: Dave Watson <[email protected]>
CC: Chris Lameter <[email protected]>
CC: Ingo Molnar <[email protected]>
CC: "H. Peter Anvin" <[email protected]>
CC: Ben Maurer <[email protected]>
CC: Steven Rostedt <[email protected]>
CC: "Paul E. McKenney" <[email protected]>
CC: Josh Triplett <[email protected]>
CC: Linus Torvalds <[email protected]>
CC: Andrew Morton <[email protected]>
CC: Boqun Feng <[email protected]>
CC: [email protected]
CC: [email protected]
---
tools/testing/selftests/lib.mk | 4 ++++
1 file changed, 4 insertions(+)
diff --git a/tools/testing/selftests/lib.mk b/tools/testing/selftests/lib.mk
index 7de482a0519d..3d4bdeb072b2 100644
--- a/tools/testing/selftests/lib.mk
+++ b/tools/testing/selftests/lib.mk
@@ -105,6 +105,9 @@ COMPILE.S = $(CC) $(ASFLAGS) $(CPPFLAGS) $(TARGET_ARCH) -c
LINK.S = $(CC) $(ASFLAGS) $(CPPFLAGS) $(LDFLAGS) $(TARGET_ARCH)
endif
+# Selftest makefiles can override those targets by setting
+# OVERRIDE_TARGETS = 1.
+ifeq ($(OVERRIDE_TARGETS),)
$(OUTPUT)/%:%.c
$(LINK.c) $^ $(LDLIBS) -o $@
@@ -113,5 +116,6 @@ $(OUTPUT)/%.o:%.S
$(OUTPUT)/%:%.S
$(LINK.S) $^ $(LDLIBS) -o $@
+endif
.PHONY: run_tests all clean install emit_tests
--
2.11.0
"basic_test" only asserts that RSEQ works moderately correctly. E.g.
that the CPUID pointer works.
Signed-off-by: Mathieu Desnoyers <[email protected]>
CC: Shuah Khan <[email protected]>
CC: Russell King <[email protected]>
CC: Catalin Marinas <[email protected]>
CC: Will Deacon <[email protected]>
CC: Thomas Gleixner <[email protected]>
CC: Paul Turner <[email protected]>
CC: Andrew Hunter <[email protected]>
CC: Peter Zijlstra <[email protected]>
CC: Andy Lutomirski <[email protected]>
CC: Andi Kleen <[email protected]>
CC: Dave Watson <[email protected]>
CC: Chris Lameter <[email protected]>
CC: Ingo Molnar <[email protected]>
CC: "H. Peter Anvin" <[email protected]>
CC: Ben Maurer <[email protected]>
CC: Steven Rostedt <[email protected]>
CC: "Paul E. McKenney" <[email protected]>
CC: Josh Triplett <[email protected]>
CC: Linus Torvalds <[email protected]>
CC: Andrew Morton <[email protected]>
CC: Boqun Feng <[email protected]>
CC: [email protected]
CC: [email protected]
---
tools/testing/selftests/rseq/basic_test.c | 55 +++++++++++++++++++++++++++++++
1 file changed, 55 insertions(+)
create mode 100644 tools/testing/selftests/rseq/basic_test.c
diff --git a/tools/testing/selftests/rseq/basic_test.c b/tools/testing/selftests/rseq/basic_test.c
new file mode 100644
index 000000000000..e2086b3885d7
--- /dev/null
+++ b/tools/testing/selftests/rseq/basic_test.c
@@ -0,0 +1,55 @@
+/*
+ * Basic test coverage for critical regions and rseq_current_cpu().
+ */
+
+#define _GNU_SOURCE
+#include <assert.h>
+#include <sched.h>
+#include <signal.h>
+#include <stdio.h>
+#include <string.h>
+#include <sys/time.h>
+
+#include "rseq.h"
+
+void test_cpu_pointer(void)
+{
+ cpu_set_t affinity, test_affinity;
+ int i;
+
+ sched_getaffinity(0, sizeof(affinity), &affinity);
+ CPU_ZERO(&test_affinity);
+ for (i = 0; i < CPU_SETSIZE; i++) {
+ if (CPU_ISSET(i, &affinity)) {
+ CPU_SET(i, &test_affinity);
+ sched_setaffinity(0, sizeof(test_affinity),
+ &test_affinity);
+ assert(sched_getcpu() == i);
+ assert(rseq_current_cpu() == i);
+ assert(rseq_current_cpu_raw() == i);
+ assert(rseq_cpu_start() == i);
+ CPU_CLR(i, &test_affinity);
+ }
+ }
+ sched_setaffinity(0, sizeof(affinity), &affinity);
+}
+
+int main(int argc, char **argv)
+{
+ if (rseq_register_current_thread()) {
+ fprintf(stderr, "Error: rseq_register_current_thread(...) failed(%d): %s\n",
+ errno, strerror(errno));
+ goto init_thread_error;
+ }
+ printf("testing current cpu\n");
+ test_cpu_pointer();
+ if (rseq_unregister_current_thread()) {
+ fprintf(stderr, "Error: rseq_unregister_current_thread(...) failed(%d): %s\n",
+ errno, strerror(errno));
+ goto init_thread_error;
+ }
+ return 0;
+
+init_thread_error:
+ return -1;
+}
--
2.11.0
Introduce percpu-op.h API. It uses rseq internally as fast-path if
invoked from the right CPU, else cpu_opv as slow-path if called
from the wrong CPU or if rseq fails.
This allows acting on per-cpu data from various CPUs transparently from
user-space: cpu_opv will take care of migrating the thread to the
requested CPU. Use-cases such as rebalancing memory across per-cpu
memory pools, or migrating tasks for a user-space scheduler, are thus
facilitated. This also handles debugger single-stepping.
The use from userspace is, e.g. for a counter increment:
int cpu, ret;
cpu = rseq_cpu_start();
ret = percpu_addv(&data->c[cpu].count, 1, cpu);
if (unlikely(ret)) {
perror("percpu_addv");
return -1;
}
return 0;
Signed-off-by: Mathieu Desnoyers <[email protected]>
CC: Shuah Khan <[email protected]>
CC: Russell King <[email protected]>
CC: Catalin Marinas <[email protected]>
CC: Will Deacon <[email protected]>
CC: Thomas Gleixner <[email protected]>
CC: Paul Turner <[email protected]>
CC: Andrew Hunter <[email protected]>
CC: Peter Zijlstra <[email protected]>
CC: Andy Lutomirski <[email protected]>
CC: Andi Kleen <[email protected]>
CC: Dave Watson <[email protected]>
CC: Chris Lameter <[email protected]>
CC: Ingo Molnar <[email protected]>
CC: "H. Peter Anvin" <[email protected]>
CC: Ben Maurer <[email protected]>
CC: Steven Rostedt <[email protected]>
CC: "Paul E. McKenney" <[email protected]>
CC: Josh Triplett <[email protected]>
CC: Linus Torvalds <[email protected]>
CC: Andrew Morton <[email protected]>
CC: Boqun Feng <[email protected]>
CC: [email protected]
CC: [email protected]
---
tools/testing/selftests/rseq/percpu-op.h | 163 +++++++++++++++++++++++++++++++
1 file changed, 163 insertions(+)
create mode 100644 tools/testing/selftests/rseq/percpu-op.h
diff --git a/tools/testing/selftests/rseq/percpu-op.h b/tools/testing/selftests/rseq/percpu-op.h
new file mode 100644
index 000000000000..c17d165438a6
--- /dev/null
+++ b/tools/testing/selftests/rseq/percpu-op.h
@@ -0,0 +1,163 @@
+/*
+ * percpu-op.h
+ *
+ * (C) Copyright 2017 - Mathieu Desnoyers <[email protected]>
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining a copy
+ * of this software and associated documentation files (the "Software"), to deal
+ * in the Software without restriction, including without limitation the rights
+ * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+ * copies of the Software, and to permit persons to whom the Software is
+ * furnished to do so, subject to the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be included in
+ * all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+ * AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+ * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+
+#ifndef PERCPU_OP_H
+#define PERCPU_OP_H
+
+#include <stdint.h>
+#include <stdbool.h>
+#include <errno.h>
+#include <stdlib.h>
+#include "rseq.h"
+#include "cpu-op.h"
+
+static inline __attribute__((always_inline))
+int percpu_cmpeqv_storev(intptr_t *v, intptr_t expect, intptr_t newv,
+ int cpu)
+{
+ int ret;
+
+ ret = rseq_cmpeqv_storev(v, expect, newv, cpu);
+ if (rseq_unlikely(ret)) {
+ if (ret > 0)
+ return ret;
+ return cpu_op_cmpeqv_storev(v, expect, newv, cpu);
+ }
+ return 0;
+}
+
+static inline __attribute__((always_inline))
+int percpu_cmpnev_storeoffp_load(intptr_t *v, intptr_t expectnot,
+ off_t voffp, intptr_t *load, int cpu)
+{
+ int ret;
+
+ ret = rseq_cmpnev_storeoffp_load(v, expectnot, voffp, load, cpu);
+ if (rseq_unlikely(ret)) {
+ if (ret > 0)
+ return ret;
+ return cpu_op_cmpnev_storeoffp_load(v, expectnot, voffp,
+ load, cpu);
+ }
+ return 0;
+}
+
+static inline __attribute__((always_inline))
+int percpu_addv(intptr_t *v, intptr_t count, int cpu)
+{
+ if (rseq_unlikely(rseq_addv(v, count, cpu)))
+ return cpu_op_addv(v, count, cpu);
+ return 0;
+}
+
+static inline __attribute__((always_inline))
+int percpu_cmpeqv_storev_storev(intptr_t *v, intptr_t expect,
+ intptr_t *v2, intptr_t newv2,
+ intptr_t newv, int cpu)
+{
+ int ret;
+
+ ret = rseq_cmpeqv_trystorev_storev(v, expect, v2, newv2,
+ newv, cpu);
+ if (rseq_unlikely(ret)) {
+ if (ret > 0)
+ return ret;
+ return cpu_op_cmpeqv_storev_storev(v, expect, v2, newv2,
+ newv, cpu);
+ }
+ return 0;
+}
+
+static inline __attribute__((always_inline))
+int percpu_cmpeqv_storev_storev_release(intptr_t *v, intptr_t expect,
+ intptr_t *v2, intptr_t newv2,
+ intptr_t newv, int cpu)
+{
+ int ret;
+
+ ret = rseq_cmpeqv_trystorev_storev_release(v, expect, v2, newv2,
+ newv, cpu);
+ if (rseq_unlikely(ret)) {
+ if (ret > 0)
+ return ret;
+ return cpu_op_cmpeqv_storev_mb_storev(v, expect, v2, newv2,
+ newv, cpu);
+ }
+ return 0;
+}
+
+static inline __attribute__((always_inline))
+int percpu_cmpeqv_cmpeqv_storev(intptr_t *v, intptr_t expect,
+ intptr_t *v2, intptr_t expect2,
+ intptr_t newv, int cpu)
+{
+ int ret;
+
+ ret = rseq_cmpeqv_cmpeqv_storev(v, expect, v2, expect2, newv, cpu);
+ if (rseq_unlikely(ret)) {
+ if (ret > 0)
+ return ret;
+ return cpu_op_cmpeqv_cmpeqv_storev(v, expect, v2, expect2,
+ newv, cpu);
+ }
+ return 0;
+}
+
+static inline __attribute__((always_inline))
+int percpu_cmpeqv_memcpy_storev(intptr_t *v, intptr_t expect,
+ void *dst, void *src, size_t len,
+ intptr_t newv, int cpu)
+{
+ int ret;
+
+ ret = rseq_cmpeqv_trymemcpy_storev(v, expect, dst, src, len,
+ newv, cpu);
+ if (rseq_unlikely(ret)) {
+ if (ret > 0)
+ return ret;
+ return cpu_op_cmpeqv_memcpy_storev(v, expect, dst, src, len,
+ newv, cpu);
+ }
+ return 0;
+}
+
+static inline __attribute__((always_inline))
+int percpu_cmpeqv_memcpy_storev_release(intptr_t *v, intptr_t expect,
+ void *dst, void *src, size_t len,
+ intptr_t newv, int cpu)
+{
+ int ret;
+
+ ret = rseq_cmpeqv_trymemcpy_storev_release(v, expect, dst, src, len,
+ newv, cpu);
+ if (rseq_unlikely(ret)) {
+ if (ret > 0)
+ return ret;
+ return cpu_op_cmpeqv_memcpy_mb_storev(v, expect, dst, src, len,
+ newv, cpu);
+ }
+ return 0;
+}
+
+#endif /* PERCPU_OP_H_ */
--
2.11.0
"param_test" is a parametrizable restartable sequences test. See
the "--help" output for usage.
"param_test_benchmark" is the same as "param_test", but it removes
testing book-keeping code to allow accurate benchmarks.
"param_test_skip_fastpath" is the same as "param_test", but it skips
the rseq fast-path, and only calls the cpu_opv slow path.
"param_test_compare_twice" is the same as "param_test", but it performs
each comparison within rseq critical section twice, thus validating
invariants. If any of the second comparisons fails, an error message
is printed and the test aborts.
Signed-off-by: Mathieu Desnoyers <[email protected]>
CC: Shuah Khan <[email protected]>
CC: Russell King <[email protected]>
CC: Catalin Marinas <[email protected]>
CC: Will Deacon <[email protected]>
CC: Thomas Gleixner <[email protected]>
CC: Paul Turner <[email protected]>
CC: Andrew Hunter <[email protected]>
CC: Peter Zijlstra <[email protected]>
CC: Andy Lutomirski <[email protected]>
CC: Andi Kleen <[email protected]>
CC: Dave Watson <[email protected]>
CC: Chris Lameter <[email protected]>
CC: Ingo Molnar <[email protected]>
CC: "H. Peter Anvin" <[email protected]>
CC: Ben Maurer <[email protected]>
CC: Steven Rostedt <[email protected]>
CC: "Paul E. McKenney" <[email protected]>
CC: Josh Triplett <[email protected]>
CC: Linus Torvalds <[email protected]>
CC: Andrew Morton <[email protected]>
CC: Boqun Feng <[email protected]>
CC: [email protected]
CC: [email protected]
---
tools/testing/selftests/rseq/param_test.c | 1163 +++++++++++++++++++++++++++++
1 file changed, 1163 insertions(+)
create mode 100644 tools/testing/selftests/rseq/param_test.c
diff --git a/tools/testing/selftests/rseq/param_test.c b/tools/testing/selftests/rseq/param_test.c
new file mode 100644
index 000000000000..0a7c05f506be
--- /dev/null
+++ b/tools/testing/selftests/rseq/param_test.c
@@ -0,0 +1,1163 @@
+#define _GNU_SOURCE
+#include <assert.h>
+#include <pthread.h>
+#include <sched.h>
+#include <stdint.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <syscall.h>
+#include <unistd.h>
+#include <poll.h>
+#include <sys/types.h>
+#include <signal.h>
+#include <errno.h>
+#include <stddef.h>
+
+static inline pid_t gettid(void)
+{
+ return syscall(__NR_gettid);
+}
+
+#define NR_INJECT 9
+static int loop_cnt[NR_INJECT + 1];
+
+static int loop_cnt_1 asm("asm_loop_cnt_1") __attribute__((used));
+static int loop_cnt_2 asm("asm_loop_cnt_2") __attribute__((used));
+static int loop_cnt_3 asm("asm_loop_cnt_3") __attribute__((used));
+static int loop_cnt_4 asm("asm_loop_cnt_4") __attribute__((used));
+static int loop_cnt_5 asm("asm_loop_cnt_5") __attribute__((used));
+static int loop_cnt_6 asm("asm_loop_cnt_6") __attribute__((used));
+
+static int opt_modulo, verbose;
+
+static int opt_yield, opt_signal, opt_sleep,
+ opt_disable_rseq, opt_threads = 200,
+ opt_disable_mod = 0, opt_test = 's', opt_mb = 0;
+
+static long long opt_reps = 5000;
+
+static __thread __attribute__((tls_model("initial-exec")))
+unsigned int signals_delivered;
+
+#ifndef BENCHMARK
+
+static __thread __attribute__((tls_model("initial-exec"), unused))
+unsigned int yield_mod_cnt, nr_abort;
+
+#define printf_verbose(fmt, ...) \
+ do { \
+ if (verbose) \
+ printf(fmt, ## __VA_ARGS__); \
+ } while (0)
+
+#if defined(__x86_64__) || defined(__i386__)
+
+#define INJECT_ASM_REG "eax"
+
+#define RSEQ_INJECT_CLOBBER \
+ , INJECT_ASM_REG
+
+#define RSEQ_INJECT_ASM(n) \
+ "mov asm_loop_cnt_" #n ", %%" INJECT_ASM_REG "\n\t" \
+ "test %%" INJECT_ASM_REG ",%%" INJECT_ASM_REG "\n\t" \
+ "jz 333f\n\t" \
+ "222:\n\t" \
+ "dec %%" INJECT_ASM_REG "\n\t" \
+ "jnz 222b\n\t" \
+ "333:\n\t"
+
+#elif defined(__ARMEL__)
+
+#define RSEQ_INJECT_INPUT \
+ , [loop_cnt_1]"m"(loop_cnt[1]) \
+ , [loop_cnt_2]"m"(loop_cnt[2]) \
+ , [loop_cnt_3]"m"(loop_cnt[3]) \
+ , [loop_cnt_4]"m"(loop_cnt[4]) \
+ , [loop_cnt_5]"m"(loop_cnt[5]) \
+ , [loop_cnt_6]"m"(loop_cnt[6])
+
+#define INJECT_ASM_REG "r4"
+
+#define RSEQ_INJECT_CLOBBER \
+ , INJECT_ASM_REG
+
+#define RSEQ_INJECT_ASM(n) \
+ "ldr " INJECT_ASM_REG ", %[loop_cnt_" #n "]\n\t" \
+ "cmp " INJECT_ASM_REG ", #0\n\t" \
+ "beq 333f\n\t" \
+ "222:\n\t" \
+ "subs " INJECT_ASM_REG ", #1\n\t" \
+ "bne 222b\n\t" \
+ "333:\n\t"
+
+#elif __PPC__
+
+#define RSEQ_INJECT_INPUT \
+ , [loop_cnt_1]"m"(loop_cnt[1]) \
+ , [loop_cnt_2]"m"(loop_cnt[2]) \
+ , [loop_cnt_3]"m"(loop_cnt[3]) \
+ , [loop_cnt_4]"m"(loop_cnt[4]) \
+ , [loop_cnt_5]"m"(loop_cnt[5]) \
+ , [loop_cnt_6]"m"(loop_cnt[6])
+
+#define INJECT_ASM_REG "r18"
+
+#define RSEQ_INJECT_CLOBBER \
+ , INJECT_ASM_REG
+
+#define RSEQ_INJECT_ASM(n) \
+ "lwz %%" INJECT_ASM_REG ", %[loop_cnt_" #n "]\n\t" \
+ "cmpwi %%" INJECT_ASM_REG ", 0\n\t" \
+ "beq 333f\n\t" \
+ "222:\n\t" \
+ "subic. %%" INJECT_ASM_REG ", %%" INJECT_ASM_REG ", 1\n\t" \
+ "bne 222b\n\t" \
+ "333:\n\t"
+#else
+#error unsupported target
+#endif
+
+#define RSEQ_INJECT_FAILED \
+ nr_abort++;
+
+#define RSEQ_INJECT_C(n) \
+{ \
+ int loc_i, loc_nr_loops = loop_cnt[n]; \
+ \
+ for (loc_i = 0; loc_i < loc_nr_loops; loc_i++) { \
+ rseq_barrier(); \
+ } \
+ if (loc_nr_loops == -1 && opt_modulo) { \
+ if (yield_mod_cnt == opt_modulo - 1) { \
+ if (opt_sleep > 0) \
+ poll(NULL, 0, opt_sleep); \
+ if (opt_yield) \
+ sched_yield(); \
+ if (opt_signal) \
+ raise(SIGUSR1); \
+ yield_mod_cnt = 0; \
+ } else { \
+ yield_mod_cnt++; \
+ } \
+ } \
+}
+
+#else
+
+#define printf_verbose(fmt, ...)
+
+#endif /* BENCHMARK */
+
+#include "percpu-op.h"
+
+struct percpu_lock_entry {
+ intptr_t v;
+} __attribute__((aligned(128)));
+
+struct percpu_lock {
+ struct percpu_lock_entry c[CPU_SETSIZE];
+};
+
+struct test_data_entry {
+ intptr_t count;
+} __attribute__((aligned(128)));
+
+struct spinlock_test_data {
+ struct percpu_lock lock;
+ struct test_data_entry c[CPU_SETSIZE];
+};
+
+struct spinlock_thread_test_data {
+ struct spinlock_test_data *data;
+ long long reps;
+ int reg;
+};
+
+struct inc_test_data {
+ struct test_data_entry c[CPU_SETSIZE];
+};
+
+struct inc_thread_test_data {
+ struct inc_test_data *data;
+ long long reps;
+ int reg;
+};
+
+struct percpu_list_node {
+ intptr_t data;
+ struct percpu_list_node *next;
+};
+
+struct percpu_list_entry {
+ struct percpu_list_node *head;
+} __attribute__((aligned(128)));
+
+struct percpu_list {
+ struct percpu_list_entry c[CPU_SETSIZE];
+};
+
+#define BUFFER_ITEM_PER_CPU 100
+
+struct percpu_buffer_node {
+ intptr_t data;
+};
+
+struct percpu_buffer_entry {
+ intptr_t offset;
+ intptr_t buflen;
+ struct percpu_buffer_node **array;
+} __attribute__((aligned(128)));
+
+struct percpu_buffer {
+ struct percpu_buffer_entry c[CPU_SETSIZE];
+};
+
+#define MEMCPY_BUFFER_ITEM_PER_CPU 100
+
+struct percpu_memcpy_buffer_node {
+ intptr_t data1;
+ uint64_t data2;
+};
+
+struct percpu_memcpy_buffer_entry {
+ intptr_t offset;
+ intptr_t buflen;
+ struct percpu_memcpy_buffer_node *array;
+} __attribute__((aligned(128)));
+
+struct percpu_memcpy_buffer {
+ struct percpu_memcpy_buffer_entry c[CPU_SETSIZE];
+};
+
+/* A simple percpu spinlock. */
+static void rseq_percpu_lock(struct percpu_lock *lock, int cpu)
+{
+ for (;;) {
+ int ret;
+
+ ret = percpu_cmpeqv_storev(&lock->c[cpu].v,
+ 0, 1, cpu);
+ if (rseq_likely(!ret))
+ break;
+ if (rseq_unlikely(ret < 0)) {
+ perror("cpu_opv");
+ abort();
+ }
+ /* Retry if comparison fails. */
+ }
+ /*
+ * Acquire semantic when taking lock after control dependency.
+ * Matches rseq_smp_store_release().
+ */
+ rseq_smp_acquire__after_ctrl_dep();
+}
+
+static void rseq_percpu_unlock(struct percpu_lock *lock, int cpu)
+{
+ assert(lock->c[cpu].v == 1);
+ /*
+ * Release lock, with release semantic. Matches
+ * rseq_smp_acquire__after_ctrl_dep().
+ */
+ rseq_smp_store_release(&lock->c[cpu].v, 0);
+}
+
+void *test_percpu_spinlock_thread(void *arg)
+{
+ struct spinlock_thread_test_data *thread_data = arg;
+ struct spinlock_test_data *data = thread_data->data;
+ long long i, reps;
+
+ if (!opt_disable_rseq && thread_data->reg &&
+ rseq_register_current_thread())
+ abort();
+ reps = thread_data->reps;
+ for (i = 0; i < reps; i++) {
+ int cpu = rseq_cpu_start();
+
+ rseq_percpu_lock(&data->lock, cpu);
+ data->c[cpu].count++;
+ rseq_percpu_unlock(&data->lock, cpu);
+#ifndef BENCHMARK
+ if (i != 0 && !(i % (reps / 10)))
+ printf_verbose("tid %d: count %lld\n", (int) gettid(), i);
+#endif
+ }
+ printf_verbose("tid %d: number of rseq abort: %d, signals delivered: %u\n",
+ (int) gettid(), nr_abort, signals_delivered);
+ if (!opt_disable_rseq && thread_data->reg &&
+ rseq_unregister_current_thread())
+ abort();
+ return NULL;
+}
+
+/*
+ * A simple test which implements a sharded counter using a per-cpu
+ * lock. Obviously real applications might prefer to simply use a
+ * per-cpu increment; however, this is reasonable for a test and the
+ * lock can be extended to synchronize more complicated operations.
+ */
+void test_percpu_spinlock(void)
+{
+ const int num_threads = opt_threads;
+ int i, ret;
+ uint64_t sum;
+ pthread_t test_threads[num_threads];
+ struct spinlock_test_data data;
+ struct spinlock_thread_test_data thread_data[num_threads];
+
+ memset(&data, 0, sizeof(data));
+ for (i = 0; i < num_threads; i++) {
+ thread_data[i].reps = opt_reps;
+ if (opt_disable_mod <= 0 || (i % opt_disable_mod))
+ thread_data[i].reg = 1;
+ else
+ thread_data[i].reg = 0;
+ thread_data[i].data = &data;
+ ret = pthread_create(&test_threads[i], NULL,
+ test_percpu_spinlock_thread,
+ &thread_data[i]);
+ if (ret) {
+ errno = ret;
+ perror("pthread_create");
+ abort();
+ }
+ }
+
+ for (i = 0; i < num_threads; i++) {
+ ret = pthread_join(test_threads[i], NULL);
+ if (ret) {
+ errno = ret;
+ perror("pthread_join");
+ abort();
+ }
+ }
+
+ sum = 0;
+ for (i = 0; i < CPU_SETSIZE; i++)
+ sum += data.c[i].count;
+
+ assert(sum == (uint64_t)opt_reps * num_threads);
+}
+
+void *test_percpu_inc_thread(void *arg)
+{
+ struct inc_thread_test_data *thread_data = arg;
+ struct inc_test_data *data = thread_data->data;
+ long long i, reps;
+
+ if (!opt_disable_rseq && thread_data->reg &&
+ rseq_register_current_thread())
+ abort();
+ reps = thread_data->reps;
+ for (i = 0; i < reps; i++) {
+ int cpu, ret;
+
+ cpu = rseq_cpu_start();
+ ret = percpu_addv(&data->c[cpu].count, 1, cpu);
+ if (rseq_unlikely(ret)) {
+ perror("cpu_opv");
+ abort();
+ }
+#ifndef BENCHMARK
+ if (i != 0 && !(i % (reps / 10)))
+ printf_verbose("tid %d: count %lld\n", (int) gettid(), i);
+#endif
+ }
+ printf_verbose("tid %d: number of rseq abort: %d, signals delivered: %u\n",
+ (int) gettid(), nr_abort, signals_delivered);
+ if (!opt_disable_rseq && thread_data->reg &&
+ rseq_unregister_current_thread())
+ abort();
+ return NULL;
+}
+
+void test_percpu_inc(void)
+{
+ const int num_threads = opt_threads;
+ int i, ret;
+ uint64_t sum;
+ pthread_t test_threads[num_threads];
+ struct inc_test_data data;
+ struct inc_thread_test_data thread_data[num_threads];
+
+ memset(&data, 0, sizeof(data));
+ for (i = 0; i < num_threads; i++) {
+ thread_data[i].reps = opt_reps;
+ if (opt_disable_mod <= 0 || (i % opt_disable_mod))
+ thread_data[i].reg = 1;
+ else
+ thread_data[i].reg = 0;
+ thread_data[i].data = &data;
+ ret = pthread_create(&test_threads[i], NULL,
+ test_percpu_inc_thread,
+ &thread_data[i]);
+ if (ret) {
+ errno = ret;
+ perror("pthread_create");
+ abort();
+ }
+ }
+
+ for (i = 0; i < num_threads; i++) {
+ ret = pthread_join(test_threads[i], NULL);
+ if (ret) {
+ errno = ret;
+ perror("pthread_join");
+ abort();
+ }
+ }
+
+ sum = 0;
+ for (i = 0; i < CPU_SETSIZE; i++)
+ sum += data.c[i].count;
+
+ assert(sum == (uint64_t)opt_reps * num_threads);
+}
+
+void percpu_list_push(struct percpu_list *list,
+ struct percpu_list_node *node,
+ int cpu)
+{
+ for (;;) {
+ intptr_t *targetptr, newval, expect;
+ int ret;
+
+ /* Load list->c[cpu].head with single-copy atomicity. */
+ expect = (intptr_t)RSEQ_READ_ONCE(list->c[cpu].head);
+ newval = (intptr_t)node;
+ targetptr = (intptr_t *)&list->c[cpu].head;
+ node->next = (struct percpu_list_node *)expect;
+ ret = percpu_cmpeqv_storev(targetptr, expect, newval, cpu);
+ if (rseq_likely(!ret))
+ break;
+ if (rseq_unlikely(ret < 0)) {
+ perror("cpu_opv");
+ abort();
+ }
+ /* Retry if comparison fails. */
+ }
+}
+
+/*
+ * Unlike a traditional lock-less linked list; the availability of a
+ * rseq primitive allows us to implement pop without concerns over
+ * ABA-type races.
+ */
+struct percpu_list_node *percpu_list_pop(struct percpu_list *list,
+ int cpu)
+{
+ struct percpu_list_node *head;
+ intptr_t *targetptr, expectnot, *load;
+ off_t offset;
+ int ret;
+
+ targetptr = (intptr_t *)&list->c[cpu].head;
+ expectnot = (intptr_t)NULL;
+ offset = offsetof(struct percpu_list_node, next);
+ load = (intptr_t *)&head;
+ ret = percpu_cmpnev_storeoffp_load(targetptr, expectnot,
+ offset, load, cpu);
+ if (rseq_unlikely(ret < 0)) {
+ perror("cpu_opv");
+ abort();
+ }
+ if (ret > 0)
+ return NULL;
+ return head;
+}
+
+void *test_percpu_list_thread(void *arg)
+{
+ long long i, reps;
+ struct percpu_list *list = (struct percpu_list *)arg;
+
+ if (!opt_disable_rseq && rseq_register_current_thread())
+ abort();
+
+ reps = opt_reps;
+ for (i = 0; i < reps; i++) {
+ struct percpu_list_node *node;
+
+ node = percpu_list_pop(list, rseq_cpu_start());
+ if (opt_yield)
+ sched_yield(); /* encourage shuffling */
+ if (node)
+ percpu_list_push(list, node, rseq_cpu_start());
+ }
+
+ printf_verbose("tid %d: number of rseq abort: %d, signals delivered: %u\n",
+ (int) gettid(), nr_abort, signals_delivered);
+ if (!opt_disable_rseq && rseq_unregister_current_thread())
+ abort();
+
+ return NULL;
+}
+
+/* Simultaneous modification to a per-cpu linked list from many threads. */
+void test_percpu_list(void)
+{
+ const int num_threads = opt_threads;
+ int i, j, ret;
+ uint64_t sum = 0, expected_sum = 0;
+ struct percpu_list list;
+ pthread_t test_threads[num_threads];
+ cpu_set_t allowed_cpus;
+
+ memset(&list, 0, sizeof(list));
+
+ /* Generate list entries for every usable cpu. */
+ sched_getaffinity(0, sizeof(allowed_cpus), &allowed_cpus);
+ for (i = 0; i < CPU_SETSIZE; i++) {
+ if (!CPU_ISSET(i, &allowed_cpus))
+ continue;
+ for (j = 1; j <= 100; j++) {
+ struct percpu_list_node *node;
+
+ expected_sum += j;
+
+ node = malloc(sizeof(*node));
+ assert(node);
+ node->data = j;
+ node->next = list.c[i].head;
+ list.c[i].head = node;
+ }
+ }
+
+ for (i = 0; i < num_threads; i++) {
+ ret = pthread_create(&test_threads[i], NULL,
+ test_percpu_list_thread, &list);
+ if (ret) {
+ errno = ret;
+ perror("pthread_create");
+ abort();
+ }
+ }
+
+ for (i = 0; i < num_threads; i++) {
+ ret = pthread_join(test_threads[i], NULL);
+ if (ret) {
+ errno = ret;
+ perror("pthread_join");
+ abort();
+ }
+ }
+
+ for (i = 0; i < CPU_SETSIZE; i++) {
+ struct percpu_list_node *node;
+
+ if (!CPU_ISSET(i, &allowed_cpus))
+ continue;
+
+ while ((node = percpu_list_pop(&list, i))) {
+ sum += node->data;
+ free(node);
+ }
+ }
+
+ /*
+ * All entries should now be accounted for (unless some external
+ * actor is interfering with our allowed affinity while this
+ * test is running).
+ */
+ assert(sum == expected_sum);
+}
+
+bool percpu_buffer_push(struct percpu_buffer *buffer,
+ struct percpu_buffer_node *node,
+ int cpu)
+{
+ for (;;) {
+ intptr_t *targetptr_spec, newval_spec;
+ intptr_t *targetptr_final, newval_final;
+ intptr_t offset;
+ int ret;
+
+ offset = RSEQ_READ_ONCE(buffer->c[cpu].offset);
+ if (offset == buffer->c[cpu].buflen)
+ return false;
+ newval_spec = (intptr_t)node;
+ targetptr_spec = (intptr_t *)&buffer->c[cpu].array[offset];
+ newval_final = offset + 1;
+ targetptr_final = &buffer->c[cpu].offset;
+ if (opt_mb)
+ ret = percpu_cmpeqv_storev_storev_release(
+ targetptr_final, offset, targetptr_spec,
+ newval_spec, newval_final, cpu);
+ else
+ ret = percpu_cmpeqv_storev_storev(targetptr_final,
+ offset, targetptr_spec, newval_spec,
+ newval_final, cpu);
+ if (rseq_likely(!ret))
+ break;
+ if (rseq_unlikely(ret < 0)) {
+ perror("cpu_opv");
+ abort();
+ }
+ /* Retry if comparison fails. */
+ }
+ return true;
+}
+
+struct percpu_buffer_node *percpu_buffer_pop(struct percpu_buffer *buffer,
+ int cpu)
+{
+ struct percpu_buffer_node *head;
+
+ for (;;) {
+ intptr_t *targetptr, newval;
+ intptr_t offset;
+ int ret;
+
+ /* Load offset with single-copy atomicity. */
+ offset = RSEQ_READ_ONCE(buffer->c[cpu].offset);
+ if (offset == 0)
+ return NULL;
+ head = RSEQ_READ_ONCE(buffer->c[cpu].array[offset - 1]);
+ newval = offset - 1;
+ targetptr = (intptr_t *)&buffer->c[cpu].offset;
+ ret = percpu_cmpeqv_cmpeqv_storev(targetptr, offset,
+ (intptr_t *)&buffer->c[cpu].array[offset - 1],
+ (intptr_t)head, newval, cpu);
+ if (rseq_likely(!ret))
+ break;
+ if (rseq_unlikely(ret < 0)) {
+ perror("cpu_opv");
+ abort();
+ }
+ /* Retry if comparison fails. */
+ }
+ return head;
+}
+
+void *test_percpu_buffer_thread(void *arg)
+{
+ long long i, reps;
+ struct percpu_buffer *buffer = (struct percpu_buffer *)arg;
+
+ if (!opt_disable_rseq && rseq_register_current_thread())
+ abort();
+
+ reps = opt_reps;
+ for (i = 0; i < reps; i++) {
+ struct percpu_buffer_node *node;
+
+ node = percpu_buffer_pop(buffer, rseq_cpu_start());
+ if (opt_yield)
+ sched_yield(); /* encourage shuffling */
+ if (node) {
+ if (!percpu_buffer_push(buffer, node,
+ rseq_cpu_start())) {
+ /* Should increase buffer size. */
+ abort();
+ }
+ }
+ }
+
+ printf_verbose("tid %d: number of rseq abort: %d, signals delivered: %u\n",
+ (int) gettid(), nr_abort, signals_delivered);
+ if (!opt_disable_rseq && rseq_unregister_current_thread())
+ abort();
+
+ return NULL;
+}
+
+/* Simultaneous modification to a per-cpu buffer from many threads. */
+void test_percpu_buffer(void)
+{
+ const int num_threads = opt_threads;
+ int i, j, ret;
+ uint64_t sum = 0, expected_sum = 0;
+ struct percpu_buffer buffer;
+ pthread_t test_threads[num_threads];
+ cpu_set_t allowed_cpus;
+
+ memset(&buffer, 0, sizeof(buffer));
+
+ /* Generate list entries for every usable cpu. */
+ sched_getaffinity(0, sizeof(allowed_cpus), &allowed_cpus);
+ for (i = 0; i < CPU_SETSIZE; i++) {
+ if (!CPU_ISSET(i, &allowed_cpus))
+ continue;
+ /* Worse-case is every item in same CPU. */
+ buffer.c[i].array =
+ malloc(sizeof(*buffer.c[i].array) * CPU_SETSIZE *
+ BUFFER_ITEM_PER_CPU);
+ assert(buffer.c[i].array);
+ buffer.c[i].buflen = CPU_SETSIZE * BUFFER_ITEM_PER_CPU;
+ for (j = 1; j <= BUFFER_ITEM_PER_CPU; j++) {
+ struct percpu_buffer_node *node;
+
+ expected_sum += j;
+
+ /*
+ * We could theoretically put the word-sized
+ * "data" directly in the buffer. However, we
+ * want to model objects that would not fit
+ * within a single word, so allocate an object
+ * for each node.
+ */
+ node = malloc(sizeof(*node));
+ assert(node);
+ node->data = j;
+ buffer.c[i].array[j - 1] = node;
+ buffer.c[i].offset++;
+ }
+ }
+
+ for (i = 0; i < num_threads; i++) {
+ ret = pthread_create(&test_threads[i], NULL,
+ test_percpu_buffer_thread, &buffer);
+ if (ret) {
+ errno = ret;
+ perror("pthread_create");
+ abort();
+ }
+ }
+
+ for (i = 0; i < num_threads; i++) {
+ ret = pthread_join(test_threads[i], NULL);
+ if (ret) {
+ errno = ret;
+ perror("pthread_join");
+ abort();
+ }
+ }
+
+ for (i = 0; i < CPU_SETSIZE; i++) {
+ struct percpu_buffer_node *node;
+
+ if (!CPU_ISSET(i, &allowed_cpus))
+ continue;
+
+ while ((node = percpu_buffer_pop(&buffer, i))) {
+ sum += node->data;
+ free(node);
+ }
+ free(buffer.c[i].array);
+ }
+
+ /*
+ * All entries should now be accounted for (unless some external
+ * actor is interfering with our allowed affinity while this
+ * test is running).
+ */
+ assert(sum == expected_sum);
+}
+
+bool percpu_memcpy_buffer_push(struct percpu_memcpy_buffer *buffer,
+ struct percpu_memcpy_buffer_node item, int cpu)
+{
+ for (;;) {
+ intptr_t *targetptr_final, newval_final, offset;
+ char *destptr, *srcptr;
+ size_t copylen;
+ int ret;
+
+ /* Load offset with single-copy atomicity. */
+ offset = RSEQ_READ_ONCE(buffer->c[cpu].offset);
+ if (offset == buffer->c[cpu].buflen)
+ return false;
+ destptr = (char *)&buffer->c[cpu].array[offset];
+ srcptr = (char *)&item;
+ /* copylen must be <= 4kB. */
+ copylen = sizeof(item);
+ newval_final = offset + 1;
+ targetptr_final = &buffer->c[cpu].offset;
+ if (opt_mb)
+ ret = percpu_cmpeqv_memcpy_storev_release(
+ targetptr_final, offset,
+ destptr, srcptr, copylen,
+ newval_final, cpu);
+ else
+ ret = percpu_cmpeqv_memcpy_storev(targetptr_final,
+ offset, destptr, srcptr, copylen,
+ newval_final, cpu);
+ if (rseq_likely(!ret))
+ break;
+ if (rseq_unlikely(ret < 0)) {
+ perror("cpu_opv");
+ abort();
+ }
+ /* Retry if comparison fails. */
+ }
+ return true;
+}
+
+bool percpu_memcpy_buffer_pop(struct percpu_memcpy_buffer *buffer,
+ struct percpu_memcpy_buffer_node *item, int cpu)
+{
+ for (;;) {
+ intptr_t *targetptr_final, newval_final, offset;
+ char *destptr, *srcptr;
+ size_t copylen;
+ int ret;
+
+ /* Load offset with single-copy atomicity. */
+ offset = RSEQ_READ_ONCE(buffer->c[cpu].offset);
+ if (offset == 0)
+ return false;
+ destptr = (char *)item;
+ srcptr = (char *)&buffer->c[cpu].array[offset - 1];
+ /* copylen must be <= 4kB. */
+ copylen = sizeof(*item);
+ newval_final = offset - 1;
+ targetptr_final = &buffer->c[cpu].offset;
+ ret = percpu_cmpeqv_memcpy_storev(targetptr_final,
+ offset, destptr, srcptr, copylen,
+ newval_final, cpu);
+ if (rseq_likely(!ret))
+ break;
+ if (rseq_unlikely(ret < 0)) {
+ perror("cpu_opv");
+ abort();
+ }
+ /* Retry if comparison fails. */
+ }
+ return true;
+}
+
+void *test_percpu_memcpy_buffer_thread(void *arg)
+{
+ long long i, reps;
+ struct percpu_memcpy_buffer *buffer = (struct percpu_memcpy_buffer *)arg;
+
+ if (!opt_disable_rseq && rseq_register_current_thread())
+ abort();
+
+ reps = opt_reps;
+ for (i = 0; i < reps; i++) {
+ struct percpu_memcpy_buffer_node item;
+ bool result;
+
+ result = percpu_memcpy_buffer_pop(buffer, &item,
+ rseq_cpu_start());
+ if (opt_yield)
+ sched_yield(); /* encourage shuffling */
+ if (result) {
+ if (!percpu_memcpy_buffer_push(buffer, item,
+ rseq_cpu_start())) {
+ /* Should increase buffer size. */
+ abort();
+ }
+ }
+ }
+
+ printf_verbose("tid %d: number of rseq abort: %d, signals delivered: %u\n",
+ (int) gettid(), nr_abort, signals_delivered);
+ if (!opt_disable_rseq && rseq_unregister_current_thread())
+ abort();
+
+ return NULL;
+}
+
+/* Simultaneous modification to a per-cpu buffer from many threads. */
+void test_percpu_memcpy_buffer(void)
+{
+ const int num_threads = opt_threads;
+ int i, j, ret;
+ uint64_t sum = 0, expected_sum = 0;
+ struct percpu_memcpy_buffer buffer;
+ pthread_t test_threads[num_threads];
+ cpu_set_t allowed_cpus;
+
+ memset(&buffer, 0, sizeof(buffer));
+
+ /* Generate list entries for every usable cpu. */
+ sched_getaffinity(0, sizeof(allowed_cpus), &allowed_cpus);
+ for (i = 0; i < CPU_SETSIZE; i++) {
+ if (!CPU_ISSET(i, &allowed_cpus))
+ continue;
+ /* Worse-case is every item in same CPU. */
+ buffer.c[i].array =
+ malloc(sizeof(*buffer.c[i].array) * CPU_SETSIZE *
+ MEMCPY_BUFFER_ITEM_PER_CPU);
+ assert(buffer.c[i].array);
+ buffer.c[i].buflen = CPU_SETSIZE * MEMCPY_BUFFER_ITEM_PER_CPU;
+ for (j = 1; j <= MEMCPY_BUFFER_ITEM_PER_CPU; j++) {
+ expected_sum += 2 * j + 1;
+
+ /*
+ * We could theoretically put the word-sized
+ * "data" directly in the buffer. However, we
+ * want to model objects that would not fit
+ * within a single word, so allocate an object
+ * for each node.
+ */
+ buffer.c[i].array[j - 1].data1 = j;
+ buffer.c[i].array[j - 1].data2 = j + 1;
+ buffer.c[i].offset++;
+ }
+ }
+
+ for (i = 0; i < num_threads; i++) {
+ ret = pthread_create(&test_threads[i], NULL,
+ test_percpu_memcpy_buffer_thread,
+ &buffer);
+ if (ret) {
+ errno = ret;
+ perror("pthread_create");
+ abort();
+ }
+ }
+
+ for (i = 0; i < num_threads; i++) {
+ ret = pthread_join(test_threads[i], NULL);
+ if (ret) {
+ errno = ret;
+ perror("pthread_join");
+ abort();
+ }
+ }
+
+ for (i = 0; i < CPU_SETSIZE; i++) {
+ struct percpu_memcpy_buffer_node item;
+
+ if (!CPU_ISSET(i, &allowed_cpus))
+ continue;
+
+ while (percpu_memcpy_buffer_pop(&buffer, &item, i)) {
+ sum += item.data1;
+ sum += item.data2;
+ }
+ free(buffer.c[i].array);
+ }
+
+ /*
+ * All entries should now be accounted for (unless some external
+ * actor is interfering with our allowed affinity while this
+ * test is running).
+ */
+ assert(sum == expected_sum);
+}
+
+static void test_signal_interrupt_handler(int signo)
+{
+ signals_delivered++;
+}
+
+static int set_signal_handler(void)
+{
+ int ret = 0;
+ struct sigaction sa;
+ sigset_t sigset;
+
+ ret = sigemptyset(&sigset);
+ if (ret < 0) {
+ perror("sigemptyset");
+ return ret;
+ }
+
+ sa.sa_handler = test_signal_interrupt_handler;
+ sa.sa_mask = sigset;
+ sa.sa_flags = 0;
+ ret = sigaction(SIGUSR1, &sa, NULL);
+ if (ret < 0) {
+ perror("sigaction");
+ return ret;
+ }
+
+ printf_verbose("Signal handler set for SIGUSR1\n");
+
+ return ret;
+}
+
+static void show_usage(int argc, char **argv)
+{
+ printf("Usage : %s <OPTIONS>\n",
+ argv[0]);
+ printf("OPTIONS:\n");
+ printf(" [-1 loops] Number of loops for delay injection 1\n");
+ printf(" [-2 loops] Number of loops for delay injection 2\n");
+ printf(" [-3 loops] Number of loops for delay injection 3\n");
+ printf(" [-4 loops] Number of loops for delay injection 4\n");
+ printf(" [-5 loops] Number of loops for delay injection 5\n");
+ printf(" [-6 loops] Number of loops for delay injection 6\n");
+ printf(" [-7 loops] Number of loops for delay injection 7 (-1 to enable -m)\n");
+ printf(" [-8 loops] Number of loops for delay injection 8 (-1 to enable -m)\n");
+ printf(" [-9 loops] Number of loops for delay injection 9 (-1 to enable -m)\n");
+ printf(" [-m N] Yield/sleep/kill every modulo N (default 0: disabled) (>= 0)\n");
+ printf(" [-y] Yield\n");
+ printf(" [-k] Kill thread with signal\n");
+ printf(" [-s S] S: =0: disabled (default), >0: sleep time (ms)\n");
+ printf(" [-t N] Number of threads (default 200)\n");
+ printf(" [-r N] Number of repetitions per thread (default 5000)\n");
+ printf(" [-d] Disable rseq system call (no initialization)\n");
+ printf(" [-D M] Disable rseq for each M threads\n");
+ printf(" [-T test] Choose test: (s)pinlock, (l)ist, (b)uffer, (m)emcpy, (i)ncrement\n");
+ printf(" [-M] Push into buffer and memcpy buffer with memory barriers.\n");
+ printf(" [-v] Verbose output.\n");
+ printf(" [-h] Show this help.\n");
+ printf("\n");
+}
+
+int main(int argc, char **argv)
+{
+ int i;
+
+ for (i = 1; i < argc; i++) {
+ if (argv[i][0] != '-')
+ continue;
+ switch (argv[i][1]) {
+ case '1':
+ case '2':
+ case '3':
+ case '4':
+ case '5':
+ case '6':
+ case '7':
+ case '8':
+ case '9':
+ if (argc < i + 2) {
+ show_usage(argc, argv);
+ goto error;
+ }
+ loop_cnt[argv[i][1] - '0'] = atol(argv[i + 1]);
+ i++;
+ break;
+ case 'm':
+ if (argc < i + 2) {
+ show_usage(argc, argv);
+ goto error;
+ }
+ opt_modulo = atol(argv[i + 1]);
+ if (opt_modulo < 0) {
+ show_usage(argc, argv);
+ goto error;
+ }
+ i++;
+ break;
+ case 's':
+ if (argc < i + 2) {
+ show_usage(argc, argv);
+ goto error;
+ }
+ opt_sleep = atol(argv[i + 1]);
+ if (opt_sleep < 0) {
+ show_usage(argc, argv);
+ goto error;
+ }
+ i++;
+ break;
+ case 'y':
+ opt_yield = 1;
+ break;
+ case 'k':
+ opt_signal = 1;
+ break;
+ case 'd':
+ opt_disable_rseq = 1;
+ break;
+ case 'D':
+ if (argc < i + 2) {
+ show_usage(argc, argv);
+ goto error;
+ }
+ opt_disable_mod = atol(argv[i + 1]);
+ if (opt_disable_mod < 0) {
+ show_usage(argc, argv);
+ goto error;
+ }
+ i++;
+ break;
+ case 't':
+ if (argc < i + 2) {
+ show_usage(argc, argv);
+ goto error;
+ }
+ opt_threads = atol(argv[i + 1]);
+ if (opt_threads < 0) {
+ show_usage(argc, argv);
+ goto error;
+ }
+ i++;
+ break;
+ case 'r':
+ if (argc < i + 2) {
+ show_usage(argc, argv);
+ goto error;
+ }
+ opt_reps = atoll(argv[i + 1]);
+ if (opt_reps < 0) {
+ show_usage(argc, argv);
+ goto error;
+ }
+ i++;
+ break;
+ case 'h':
+ show_usage(argc, argv);
+ goto end;
+ case 'T':
+ if (argc < i + 2) {
+ show_usage(argc, argv);
+ goto error;
+ }
+ opt_test = *argv[i + 1];
+ switch (opt_test) {
+ case 's':
+ case 'l':
+ case 'i':
+ case 'b':
+ case 'm':
+ break;
+ default:
+ show_usage(argc, argv);
+ goto error;
+ }
+ i++;
+ break;
+ case 'v':
+ verbose = 1;
+ break;
+ case 'M':
+ opt_mb = 1;
+ break;
+ default:
+ show_usage(argc, argv);
+ goto error;
+ }
+ }
+
+ loop_cnt_1 = loop_cnt[1];
+ loop_cnt_2 = loop_cnt[2];
+ loop_cnt_3 = loop_cnt[3];
+ loop_cnt_4 = loop_cnt[4];
+ loop_cnt_5 = loop_cnt[5];
+ loop_cnt_6 = loop_cnt[6];
+
+ if (set_signal_handler())
+ goto error;
+
+ if (!opt_disable_rseq && rseq_register_current_thread())
+ goto error;
+ switch (opt_test) {
+ case 's':
+ printf_verbose("spinlock\n");
+ test_percpu_spinlock();
+ break;
+ case 'l':
+ printf_verbose("linked list\n");
+ test_percpu_list();
+ break;
+ case 'b':
+ printf_verbose("buffer\n");
+ test_percpu_buffer();
+ break;
+ case 'm':
+ printf_verbose("memcpy buffer\n");
+ test_percpu_memcpy_buffer();
+ break;
+ case 'i':
+ printf_verbose("counter increment\n");
+ test_percpu_inc();
+ break;
+ }
+ if (!opt_disable_rseq && rseq_unregister_current_thread())
+ abort();
+end:
+ return 0;
+
+error:
+ return -1;
+}
--
2.11.0
This rseq helper library provides a user-space API to the rseq()
system call.
The rseq fast-path exposes the instruction pointer addresses where the
rseq assembly blocks begin and end, as well as the associated abort
instruction pointer, in the __rseq_table section. This section allows
debuggers may know where to place breakpoints when single-stepping
through assembly blocks which may be aborted at any point by the kernel.
Signed-off-by: Mathieu Desnoyers <[email protected]>
CC: Shuah Khan <[email protected]>
CC: Russell King <[email protected]>
CC: Catalin Marinas <[email protected]>
CC: Will Deacon <[email protected]>
CC: Thomas Gleixner <[email protected]>
CC: Paul Turner <[email protected]>
CC: Andrew Hunter <[email protected]>
CC: Peter Zijlstra <[email protected]>
CC: Andy Lutomirski <[email protected]>
CC: Andi Kleen <[email protected]>
CC: Dave Watson <[email protected]>
CC: Chris Lameter <[email protected]>
CC: Ingo Molnar <[email protected]>
CC: "H. Peter Anvin" <[email protected]>
CC: Ben Maurer <[email protected]>
CC: Steven Rostedt <[email protected]>
CC: "Paul E. McKenney" <[email protected]>
CC: Josh Triplett <[email protected]>
CC: Linus Torvalds <[email protected]>
CC: Andrew Morton <[email protected]>
CC: Boqun Feng <[email protected]>
CC: [email protected]
CC: [email protected]
---
Changes since v1:
- Provide abort-ip signature: The abort-ip signature is located just
before the abort-ip target. It is currently hardcoded, but a
user-space application could use the __rseq_table to iterate on all
abort-ip targets and use a random value as signature if needed in the
future.
- Add rseq_prepare_unload(): Libraries and JIT code using rseq critical
sections need to issue rseq_prepare_unload() on each thread at least
once before reclaim of struct rseq_cs.
- Use initial-exec TLS model, non-weak symbol: The initial-exec model is
signal-safe, whereas the global-dynamic model is not. Remove the
"weak" symbol attribute from the __rseq_abi in rseq.c. The rseq.so
library will have ownership of that symbol, and there is not reason for
an application or user library to try to define that symbol.
The expected use is to link against libreq.so, which owns and provide
that symbol.
- Set cpu_id to -2 on register error
- Add rseq_len syscall parameter, rseq_cs version
- Ensure disassember-friendly signature: x86 32/64 disassembler have a
hard time decoding the instruction stream after a bad instruction. Use
a nopl instruction to encode the signature. Suggested by Andy Lutomirski.
- Exercise parametrized tests variants in a shell scripts.
- Restartable sequences selftests: Remove use of event counter.
- Use cpu_id_start field: With the cpu_id_start field, the C
preparation phase of the fast-path does not need to compare cpu_id < 0
anymore.
- Signal-safe registration and refcounting: Allow libraries using
librseq.so to register it from signal handlers.
- Use OVERRIDE_TARGETS in makefile.
- Use "m" constraints for rseq_cs field.
Changes since v2:
- Update based on Thomas Gleixner's comments.
Changes since v3:
- Generate param_test_skip_fastpath and param_test_benchmark with
-DSKIP_FASTPATH and -DBENCHMARK (respectively). Add param_test_fastpath
to run_param_test.sh.
Changes since v4:
- Fold arm: workaround gcc asm size guess,
- Namespace barrier() -> rseq_barrier() in library header,
- Take into account coding style feedback from Peter Zijlstra,
- Split rseq selftests into logical commits.
---
tools/testing/selftests/rseq/rseq-arm.h | 732 +++++++++++++++++++
tools/testing/selftests/rseq/rseq-ppc.h | 688 ++++++++++++++++++
tools/testing/selftests/rseq/rseq-skip.h | 82 +++
tools/testing/selftests/rseq/rseq-x86.h | 1149 ++++++++++++++++++++++++++++++
tools/testing/selftests/rseq/rseq.c | 116 +++
tools/testing/selftests/rseq/rseq.h | 164 +++++
6 files changed, 2931 insertions(+)
create mode 100644 tools/testing/selftests/rseq/rseq-arm.h
create mode 100644 tools/testing/selftests/rseq/rseq-ppc.h
create mode 100644 tools/testing/selftests/rseq/rseq-skip.h
create mode 100644 tools/testing/selftests/rseq/rseq-x86.h
create mode 100644 tools/testing/selftests/rseq/rseq.c
create mode 100644 tools/testing/selftests/rseq/rseq.h
diff --git a/tools/testing/selftests/rseq/rseq-arm.h b/tools/testing/selftests/rseq/rseq-arm.h
new file mode 100644
index 000000000000..adcaa6cbbd01
--- /dev/null
+++ b/tools/testing/selftests/rseq/rseq-arm.h
@@ -0,0 +1,732 @@
+/*
+ * rseq-arm.h
+ *
+ * (C) Copyright 2016 - Mathieu Desnoyers <[email protected]>
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining a copy
+ * of this software and associated documentation files (the "Software"), to deal
+ * in the Software without restriction, including without limitation the rights
+ * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+ * copies of the Software, and to permit persons to whom the Software is
+ * furnished to do so, subject to the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be included in
+ * all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+ * AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+ * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+
+#define RSEQ_SIG 0x53053053
+
+#define rseq_smp_mb() __asm__ __volatile__ ("dmb" ::: "memory", "cc")
+#define rseq_smp_rmb() __asm__ __volatile__ ("dmb" ::: "memory", "cc")
+#define rseq_smp_wmb() __asm__ __volatile__ ("dmb" ::: "memory", "cc")
+
+#define rseq_smp_load_acquire(p) \
+__extension__ ({ \
+ __typeof(*p) ____p1 = RSEQ_READ_ONCE(*p); \
+ rseq_smp_mb(); \
+ ____p1; \
+})
+
+#define rseq_smp_acquire__after_ctrl_dep() rseq_smp_rmb()
+
+#define rseq_smp_store_release(p, v) \
+do { \
+ rseq_smp_mb(); \
+ RSEQ_WRITE_ONCE(*p, v); \
+} while (0)
+
+#ifdef RSEQ_SKIP_FASTPATH
+#include "rseq-skip.h"
+#else /* !RSEQ_SKIP_FASTPATH */
+
+#define __RSEQ_ASM_DEFINE_TABLE(version, flags, start_ip, \
+ post_commit_offset, abort_ip) \
+ ".pushsection __rseq_table, \"aw\"\n\t" \
+ ".balign 32\n\t" \
+ ".word " __rseq_str(version) ", " __rseq_str(flags) "\n\t" \
+ ".word " __rseq_str(start_ip) ", 0x0, " __rseq_str(post_commit_offset) ", 0x0, " __rseq_str(abort_ip) ", 0x0\n\t" \
+ ".popsection\n\t"
+
+#define RSEQ_ASM_DEFINE_TABLE(start_ip, post_commit_ip, abort_ip) \
+ __RSEQ_ASM_DEFINE_TABLE(0x0, 0x0, start_ip, \
+ (post_commit_ip - start_ip), abort_ip)
+
+#define RSEQ_ASM_STORE_RSEQ_CS(label, cs_label, rseq_cs) \
+ RSEQ_INJECT_ASM(1) \
+ "adr r0, " __rseq_str(cs_label) "\n\t" \
+ "str r0, %[" __rseq_str(rseq_cs) "]\n\t" \
+ __rseq_str(label) ":\n\t"
+
+#define RSEQ_ASM_CMP_CPU_ID(cpu_id, current_cpu_id, label) \
+ RSEQ_INJECT_ASM(2) \
+ "ldr r0, %[" __rseq_str(current_cpu_id) "]\n\t" \
+ "cmp %[" __rseq_str(cpu_id) "], r0\n\t" \
+ "bne " __rseq_str(label) "\n\t"
+
+#define __RSEQ_ASM_DEFINE_ABORT(table_label, label, teardown, \
+ abort_label, version, flags, \
+ start_ip, post_commit_offset, abort_ip) \
+ __rseq_str(table_label) ":\n\t" \
+ ".word " __rseq_str(version) ", " __rseq_str(flags) "\n\t" \
+ ".word " __rseq_str(start_ip) ", 0x0, " __rseq_str(post_commit_offset) ", 0x0, " __rseq_str(abort_ip) ", 0x0\n\t" \
+ ".word " __rseq_str(RSEQ_SIG) "\n\t" \
+ __rseq_str(label) ":\n\t" \
+ teardown \
+ "b %l[" __rseq_str(abort_label) "]\n\t"
+
+#define RSEQ_ASM_DEFINE_ABORT(table_label, label, teardown, abort_label, \
+ start_ip, post_commit_ip, abort_ip) \
+ __RSEQ_ASM_DEFINE_ABORT(table_label, label, teardown, \
+ abort_label, 0x0, 0x0, start_ip, \
+ (post_commit_ip - start_ip), abort_ip)
+
+#define RSEQ_ASM_DEFINE_CMPFAIL(label, teardown, cmpfail_label) \
+ __rseq_str(label) ":\n\t" \
+ teardown \
+ "b %l[" __rseq_str(cmpfail_label) "]\n\t"
+
+#define rseq_workaround_gcc_asm_size_guess() __asm__ __volatile__("")
+
+static inline __attribute__((always_inline))
+int rseq_cmpeqv_storev(intptr_t *v, intptr_t expect, intptr_t newv, int cpu)
+{
+ RSEQ_INJECT_C(9)
+
+ rseq_workaround_gcc_asm_size_guess();
+ __asm__ __volatile__ goto (
+ RSEQ_ASM_DEFINE_TABLE(1f, 2f, 4f) /* start, commit, abort */
+ /* Start rseq by storing table entry pointer into rseq_cs. */
+ RSEQ_ASM_STORE_RSEQ_CS(1, 3f, rseq_cs)
+ RSEQ_ASM_CMP_CPU_ID(cpu_id, current_cpu_id, 4f)
+ RSEQ_INJECT_ASM(3)
+ "ldr r0, %[v]\n\t"
+ "cmp %[expect], r0\n\t"
+ "bne %l[cmpfail]\n\t"
+ RSEQ_INJECT_ASM(4)
+#ifdef RSEQ_COMPARE_TWICE
+ RSEQ_ASM_CMP_CPU_ID(cpu_id, current_cpu_id, %l[error1])
+ "ldr r0, %[v]\n\t"
+ "cmp %[expect], r0\n\t"
+ "bne %l[error2]\n\t"
+#endif
+ /* final store */
+ "str %[newv], %[v]\n\t"
+ "2:\n\t"
+ RSEQ_INJECT_ASM(5)
+ "b 5f\n\t"
+ RSEQ_ASM_DEFINE_ABORT(3, 4, "", abort, 1b, 2b, 4f)
+ "5:\n\t"
+ : /* gcc asm goto does not allow outputs */
+ : [cpu_id] "r" (cpu),
+ [current_cpu_id] "m" (__rseq_abi.cpu_id),
+ [rseq_cs] "m" (__rseq_abi.rseq_cs),
+ [v] "m" (*v),
+ [expect] "r" (expect),
+ [newv] "r" (newv)
+ RSEQ_INJECT_INPUT
+ : "r0", "memory", "cc"
+ RSEQ_INJECT_CLOBBER
+ : abort, cmpfail
+#ifdef RSEQ_COMPARE_TWICE
+ , error1, error2
+#endif
+ );
+ rseq_workaround_gcc_asm_size_guess();
+ return 0;
+abort:
+ rseq_workaround_gcc_asm_size_guess();
+ RSEQ_INJECT_FAILED
+ return -1;
+cmpfail:
+ rseq_workaround_gcc_asm_size_guess();
+ return 1;
+#ifdef RSEQ_COMPARE_TWICE
+error1:
+ rseq_bug("cpu_id comparison failed");
+error2:
+ rseq_bug("expected value comparison failed");
+#endif
+}
+
+static inline __attribute__((always_inline))
+int rseq_cmpnev_storeoffp_load(intptr_t *v, intptr_t expectnot,
+ off_t voffp, intptr_t *load, int cpu)
+{
+ RSEQ_INJECT_C(9)
+
+ rseq_workaround_gcc_asm_size_guess();
+ __asm__ __volatile__ goto (
+ RSEQ_ASM_DEFINE_TABLE(1f, 2f, 4f) /* start, commit, abort */
+ /* Start rseq by storing table entry pointer into rseq_cs. */
+ RSEQ_ASM_STORE_RSEQ_CS(1, 3f, rseq_cs)
+ RSEQ_ASM_CMP_CPU_ID(cpu_id, current_cpu_id, 4f)
+ RSEQ_INJECT_ASM(3)
+ "ldr r0, %[v]\n\t"
+ "cmp %[expectnot], r0\n\t"
+ "beq %l[cmpfail]\n\t"
+ RSEQ_INJECT_ASM(4)
+#ifdef RSEQ_COMPARE_TWICE
+ RSEQ_ASM_CMP_CPU_ID(cpu_id, current_cpu_id, %l[error1])
+ "ldr r0, %[v]\n\t"
+ "cmp %[expectnot], r0\n\t"
+ "beq %l[error2]\n\t"
+#endif
+ "str r0, %[load]\n\t"
+ "add r0, %[voffp]\n\t"
+ "ldr r0, [r0]\n\t"
+ /* final store */
+ "str r0, %[v]\n\t"
+ "2:\n\t"
+ RSEQ_INJECT_ASM(5)
+ "b 5f\n\t"
+ RSEQ_ASM_DEFINE_ABORT(3, 4, "", abort, 1b, 2b, 4f)
+ "5:\n\t"
+ : /* gcc asm goto does not allow outputs */
+ : [cpu_id] "r" (cpu),
+ [current_cpu_id] "m" (__rseq_abi.cpu_id),
+ [rseq_cs] "m" (__rseq_abi.rseq_cs),
+ /* final store input */
+ [v] "m" (*v),
+ [expectnot] "r" (expectnot),
+ [voffp] "Ir" (voffp),
+ [load] "m" (*load)
+ RSEQ_INJECT_INPUT
+ : "r0", "memory", "cc"
+ RSEQ_INJECT_CLOBBER
+ : abort, cmpfail
+#ifdef RSEQ_COMPARE_TWICE
+ , error1, error2
+#endif
+ );
+ rseq_workaround_gcc_asm_size_guess();
+ return 0;
+abort:
+ rseq_workaround_gcc_asm_size_guess();
+ RSEQ_INJECT_FAILED
+ return -1;
+cmpfail:
+ rseq_workaround_gcc_asm_size_guess();
+ return 1;
+#ifdef RSEQ_COMPARE_TWICE
+error1:
+ rseq_bug("cpu_id comparison failed");
+error2:
+ rseq_bug("expected value comparison failed");
+#endif
+}
+
+static inline __attribute__((always_inline))
+int rseq_addv(intptr_t *v, intptr_t count, int cpu)
+{
+ RSEQ_INJECT_C(9)
+
+ rseq_workaround_gcc_asm_size_guess();
+ __asm__ __volatile__ goto (
+ RSEQ_ASM_DEFINE_TABLE(1f, 2f, 4f) /* start, commit, abort */
+ /* Start rseq by storing table entry pointer into rseq_cs. */
+ RSEQ_ASM_STORE_RSEQ_CS(1, 3f, rseq_cs)
+ RSEQ_ASM_CMP_CPU_ID(cpu_id, current_cpu_id, 4f)
+ RSEQ_INJECT_ASM(3)
+#ifdef RSEQ_COMPARE_TWICE
+ RSEQ_ASM_CMP_CPU_ID(cpu_id, current_cpu_id, %l[error1])
+#endif
+ "ldr r0, %[v]\n\t"
+ "add r0, %[count]\n\t"
+ /* final store */
+ "str r0, %[v]\n\t"
+ "2:\n\t"
+ RSEQ_INJECT_ASM(4)
+ "b 5f\n\t"
+ RSEQ_ASM_DEFINE_ABORT(3, 4, "", abort, 1b, 2b, 4f)
+ "5:\n\t"
+ : /* gcc asm goto does not allow outputs */
+ : [cpu_id] "r" (cpu),
+ [current_cpu_id] "m" (__rseq_abi.cpu_id),
+ [rseq_cs] "m" (__rseq_abi.rseq_cs),
+ [v] "m" (*v),
+ [count] "Ir" (count)
+ RSEQ_INJECT_INPUT
+ : "r0", "memory", "cc"
+ RSEQ_INJECT_CLOBBER
+ : abort
+#ifdef RSEQ_COMPARE_TWICE
+ , error1
+#endif
+ );
+ rseq_workaround_gcc_asm_size_guess();
+ return 0;
+abort:
+ rseq_workaround_gcc_asm_size_guess();
+ RSEQ_INJECT_FAILED
+ return -1;
+#ifdef RSEQ_COMPARE_TWICE
+error1:
+ rseq_bug("cpu_id comparison failed");
+#endif
+}
+
+static inline __attribute__((always_inline))
+int rseq_cmpeqv_trystorev_storev(intptr_t *v, intptr_t expect,
+ intptr_t *v2, intptr_t newv2,
+ intptr_t newv, int cpu)
+{
+ RSEQ_INJECT_C(9)
+
+ rseq_workaround_gcc_asm_size_guess();
+ __asm__ __volatile__ goto (
+ RSEQ_ASM_DEFINE_TABLE(1f, 2f, 4f) /* start, commit, abort */
+ /* Start rseq by storing table entry pointer into rseq_cs. */
+ RSEQ_ASM_STORE_RSEQ_CS(1, 3f, rseq_cs)
+ RSEQ_ASM_CMP_CPU_ID(cpu_id, current_cpu_id, 4f)
+ RSEQ_INJECT_ASM(3)
+ "ldr r0, %[v]\n\t"
+ "cmp %[expect], r0\n\t"
+ "bne %l[cmpfail]\n\t"
+ RSEQ_INJECT_ASM(4)
+#ifdef RSEQ_COMPARE_TWICE
+ RSEQ_ASM_CMP_CPU_ID(cpu_id, current_cpu_id, %l[error1])
+ "ldr r0, %[v]\n\t"
+ "cmp %[expect], r0\n\t"
+ "bne %l[error2]\n\t"
+#endif
+ /* try store */
+ "str %[newv2], %[v2]\n\t"
+ RSEQ_INJECT_ASM(5)
+ /* final store */
+ "str %[newv], %[v]\n\t"
+ "2:\n\t"
+ RSEQ_INJECT_ASM(6)
+ "b 5f\n\t"
+ RSEQ_ASM_DEFINE_ABORT(3, 4, "", abort, 1b, 2b, 4f)
+ "5:\n\t"
+ : /* gcc asm goto does not allow outputs */
+ : [cpu_id] "r" (cpu),
+ [current_cpu_id] "m" (__rseq_abi.cpu_id),
+ [rseq_cs] "m" (__rseq_abi.rseq_cs),
+ /* try store input */
+ [v2] "m" (*v2),
+ [newv2] "r" (newv2),
+ /* final store input */
+ [v] "m" (*v),
+ [expect] "r" (expect),
+ [newv] "r" (newv)
+ RSEQ_INJECT_INPUT
+ : "r0", "memory", "cc"
+ RSEQ_INJECT_CLOBBER
+ : abort, cmpfail
+#ifdef RSEQ_COMPARE_TWICE
+ , error1, error2
+#endif
+ );
+ rseq_workaround_gcc_asm_size_guess();
+ return 0;
+abort:
+ rseq_workaround_gcc_asm_size_guess();
+ RSEQ_INJECT_FAILED
+ return -1;
+cmpfail:
+ rseq_workaround_gcc_asm_size_guess();
+ return 1;
+#ifdef RSEQ_COMPARE_TWICE
+error1:
+ rseq_bug("cpu_id comparison failed");
+error2:
+ rseq_bug("expected value comparison failed");
+#endif
+}
+
+static inline __attribute__((always_inline))
+int rseq_cmpeqv_trystorev_storev_release(intptr_t *v, intptr_t expect,
+ intptr_t *v2, intptr_t newv2,
+ intptr_t newv, int cpu)
+{
+ RSEQ_INJECT_C(9)
+
+ rseq_workaround_gcc_asm_size_guess();
+ __asm__ __volatile__ goto (
+ RSEQ_ASM_DEFINE_TABLE(1f, 2f, 4f) /* start, commit, abort */
+ /* Start rseq by storing table entry pointer into rseq_cs. */
+ RSEQ_ASM_STORE_RSEQ_CS(1, 3f, rseq_cs)
+ RSEQ_ASM_CMP_CPU_ID(cpu_id, current_cpu_id, 4f)
+ RSEQ_INJECT_ASM(3)
+ "ldr r0, %[v]\n\t"
+ "cmp %[expect], r0\n\t"
+ "bne %l[cmpfail]\n\t"
+ RSEQ_INJECT_ASM(4)
+#ifdef RSEQ_COMPARE_TWICE
+ RSEQ_ASM_CMP_CPU_ID(cpu_id, current_cpu_id, %l[error1])
+ "ldr r0, %[v]\n\t"
+ "cmp %[expect], r0\n\t"
+ "bne %l[error2]\n\t"
+#endif
+ /* try store */
+ "str %[newv2], %[v2]\n\t"
+ RSEQ_INJECT_ASM(5)
+ "dmb\n\t" /* full mb provides store-release */
+ /* final store */
+ "str %[newv], %[v]\n\t"
+ "2:\n\t"
+ RSEQ_INJECT_ASM(6)
+ "b 5f\n\t"
+ RSEQ_ASM_DEFINE_ABORT(3, 4, "", abort, 1b, 2b, 4f)
+ "5:\n\t"
+ : /* gcc asm goto does not allow outputs */
+ : [cpu_id] "r" (cpu),
+ [current_cpu_id] "m" (__rseq_abi.cpu_id),
+ [rseq_cs] "m" (__rseq_abi.rseq_cs),
+ /* try store input */
+ [v2] "m" (*v2),
+ [newv2] "r" (newv2),
+ /* final store input */
+ [v] "m" (*v),
+ [expect] "r" (expect),
+ [newv] "r" (newv)
+ RSEQ_INJECT_INPUT
+ : "r0", "memory", "cc"
+ RSEQ_INJECT_CLOBBER
+ : abort, cmpfail
+#ifdef RSEQ_COMPARE_TWICE
+ , error1, error2
+#endif
+ );
+ rseq_workaround_gcc_asm_size_guess();
+ return 0;
+abort:
+ rseq_workaround_gcc_asm_size_guess();
+ RSEQ_INJECT_FAILED
+ return -1;
+cmpfail:
+ rseq_workaround_gcc_asm_size_guess();
+ return 1;
+#ifdef RSEQ_COMPARE_TWICE
+error1:
+ rseq_bug("cpu_id comparison failed");
+error2:
+ rseq_bug("expected value comparison failed");
+#endif
+}
+
+static inline __attribute__((always_inline))
+int rseq_cmpeqv_cmpeqv_storev(intptr_t *v, intptr_t expect,
+ intptr_t *v2, intptr_t expect2,
+ intptr_t newv, int cpu)
+{
+ RSEQ_INJECT_C(9)
+
+ rseq_workaround_gcc_asm_size_guess();
+ __asm__ __volatile__ goto (
+ RSEQ_ASM_DEFINE_TABLE(1f, 2f, 4f) /* start, commit, abort */
+ /* Start rseq by storing table entry pointer into rseq_cs. */
+ RSEQ_ASM_STORE_RSEQ_CS(1, 3f, rseq_cs)
+ RSEQ_ASM_CMP_CPU_ID(cpu_id, current_cpu_id, 4f)
+ RSEQ_INJECT_ASM(3)
+ "ldr r0, %[v]\n\t"
+ "cmp %[expect], r0\n\t"
+ "bne %l[cmpfail]\n\t"
+ RSEQ_INJECT_ASM(4)
+ "ldr r0, %[v2]\n\t"
+ "cmp %[expect2], r0\n\t"
+ "bne %l[cmpfail]\n\t"
+ RSEQ_INJECT_ASM(5)
+#ifdef RSEQ_COMPARE_TWICE
+ RSEQ_ASM_CMP_CPU_ID(cpu_id, current_cpu_id, %l[error1])
+ "ldr r0, %[v]\n\t"
+ "cmp %[expect], r0\n\t"
+ "bne %l[error2]\n\t"
+ "ldr r0, %[v2]\n\t"
+ "cmp %[expect2], r0\n\t"
+ "bne %l[error3]\n\t"
+#endif
+ /* final store */
+ "str %[newv], %[v]\n\t"
+ "2:\n\t"
+ RSEQ_INJECT_ASM(6)
+ "b 5f\n\t"
+ RSEQ_ASM_DEFINE_ABORT(3, 4, "", abort, 1b, 2b, 4f)
+ "5:\n\t"
+ : /* gcc asm goto does not allow outputs */
+ : [cpu_id] "r" (cpu),
+ [current_cpu_id] "m" (__rseq_abi.cpu_id),
+ [rseq_cs] "m" (__rseq_abi.rseq_cs),
+ /* cmp2 input */
+ [v2] "m" (*v2),
+ [expect2] "r" (expect2),
+ /* final store input */
+ [v] "m" (*v),
+ [expect] "r" (expect),
+ [newv] "r" (newv)
+ RSEQ_INJECT_INPUT
+ : "r0", "memory", "cc"
+ RSEQ_INJECT_CLOBBER
+ : abort, cmpfail
+#ifdef RSEQ_COMPARE_TWICE
+ , error1, error2, error3
+#endif
+ );
+ rseq_workaround_gcc_asm_size_guess();
+ return 0;
+abort:
+ rseq_workaround_gcc_asm_size_guess();
+ RSEQ_INJECT_FAILED
+ return -1;
+cmpfail:
+ rseq_workaround_gcc_asm_size_guess();
+ return 1;
+#ifdef RSEQ_COMPARE_TWICE
+error1:
+ rseq_bug("cpu_id comparison failed");
+error2:
+ rseq_bug("1st expected value comparison failed");
+error3:
+ rseq_bug("2nd expected value comparison failed");
+#endif
+}
+
+static inline __attribute__((always_inline))
+int rseq_cmpeqv_trymemcpy_storev(intptr_t *v, intptr_t expect,
+ void *dst, void *src, size_t len,
+ intptr_t newv, int cpu)
+{
+ uint32_t rseq_scratch[3];
+
+ RSEQ_INJECT_C(9)
+
+ rseq_workaround_gcc_asm_size_guess();
+ __asm__ __volatile__ goto (
+ RSEQ_ASM_DEFINE_TABLE(1f, 2f, 4f) /* start, commit, abort */
+ "str %[src], %[rseq_scratch0]\n\t"
+ "str %[dst], %[rseq_scratch1]\n\t"
+ "str %[len], %[rseq_scratch2]\n\t"
+ /* Start rseq by storing table entry pointer into rseq_cs. */
+ RSEQ_ASM_STORE_RSEQ_CS(1, 3f, rseq_cs)
+ RSEQ_ASM_CMP_CPU_ID(cpu_id, current_cpu_id, 4f)
+ RSEQ_INJECT_ASM(3)
+ "ldr r0, %[v]\n\t"
+ "cmp %[expect], r0\n\t"
+ "bne 5f\n\t"
+ RSEQ_INJECT_ASM(4)
+#ifdef RSEQ_COMPARE_TWICE
+ RSEQ_ASM_CMP_CPU_ID(cpu_id, current_cpu_id, 6f)
+ "ldr r0, %[v]\n\t"
+ "cmp %[expect], r0\n\t"
+ "bne 7f\n\t"
+#endif
+ /* try memcpy */
+ "cmp %[len], #0\n\t" \
+ "beq 333f\n\t" \
+ "222:\n\t" \
+ "ldrb %%r0, [%[src]]\n\t" \
+ "strb %%r0, [%[dst]]\n\t" \
+ "adds %[src], #1\n\t" \
+ "adds %[dst], #1\n\t" \
+ "subs %[len], #1\n\t" \
+ "bne 222b\n\t" \
+ "333:\n\t" \
+ RSEQ_INJECT_ASM(5)
+ /* final store */
+ "str %[newv], %[v]\n\t"
+ "2:\n\t"
+ RSEQ_INJECT_ASM(6)
+ /* teardown */
+ "ldr %[len], %[rseq_scratch2]\n\t"
+ "ldr %[dst], %[rseq_scratch1]\n\t"
+ "ldr %[src], %[rseq_scratch0]\n\t"
+ "b 8f\n\t"
+ RSEQ_ASM_DEFINE_ABORT(3, 4,
+ /* teardown */
+ "ldr %[len], %[rseq_scratch2]\n\t"
+ "ldr %[dst], %[rseq_scratch1]\n\t"
+ "ldr %[src], %[rseq_scratch0]\n\t",
+ abort, 1b, 2b, 4f)
+ RSEQ_ASM_DEFINE_CMPFAIL(5,
+ /* teardown */
+ "ldr %[len], %[rseq_scratch2]\n\t"
+ "ldr %[dst], %[rseq_scratch1]\n\t"
+ "ldr %[src], %[rseq_scratch0]\n\t",
+ cmpfail)
+#ifdef RSEQ_COMPARE_TWICE
+ RSEQ_ASM_DEFINE_CMPFAIL(6,
+ /* teardown */
+ "ldr %[len], %[rseq_scratch2]\n\t"
+ "ldr %[dst], %[rseq_scratch1]\n\t"
+ "ldr %[src], %[rseq_scratch0]\n\t",
+ error1)
+ RSEQ_ASM_DEFINE_CMPFAIL(7,
+ /* teardown */
+ "ldr %[len], %[rseq_scratch2]\n\t"
+ "ldr %[dst], %[rseq_scratch1]\n\t"
+ "ldr %[src], %[rseq_scratch0]\n\t",
+ error2)
+#endif
+ "8:\n\t"
+ : /* gcc asm goto does not allow outputs */
+ : [cpu_id] "r" (cpu),
+ [current_cpu_id] "m" (__rseq_abi.cpu_id),
+ [rseq_cs] "m" (__rseq_abi.rseq_cs),
+ /* final store input */
+ [v] "m" (*v),
+ [expect] "r" (expect),
+ [newv] "r" (newv),
+ /* try memcpy input */
+ [dst] "r" (dst),
+ [src] "r" (src),
+ [len] "r" (len),
+ [rseq_scratch0] "m" (rseq_scratch[0]),
+ [rseq_scratch1] "m" (rseq_scratch[1]),
+ [rseq_scratch2] "m" (rseq_scratch[2])
+ RSEQ_INJECT_INPUT
+ : "r0", "memory", "cc"
+ RSEQ_INJECT_CLOBBER
+ : abort, cmpfail
+#ifdef RSEQ_COMPARE_TWICE
+ , error1, error2
+#endif
+ );
+ rseq_workaround_gcc_asm_size_guess();
+ return 0;
+abort:
+ rseq_workaround_gcc_asm_size_guess();
+ RSEQ_INJECT_FAILED
+ return -1;
+cmpfail:
+ rseq_workaround_gcc_asm_size_guess();
+ return 1;
+#ifdef RSEQ_COMPARE_TWICE
+error1:
+ rseq_workaround_gcc_asm_size_guess();
+ rseq_bug("cpu_id comparison failed");
+error2:
+ rseq_workaround_gcc_asm_size_guess();
+ rseq_bug("expected value comparison failed");
+#endif
+}
+
+static inline __attribute__((always_inline))
+int rseq_cmpeqv_trymemcpy_storev_release(intptr_t *v, intptr_t expect,
+ void *dst, void *src, size_t len,
+ intptr_t newv, int cpu)
+{
+ uint32_t rseq_scratch[3];
+
+ RSEQ_INJECT_C(9)
+
+ rseq_workaround_gcc_asm_size_guess();
+ __asm__ __volatile__ goto (
+ RSEQ_ASM_DEFINE_TABLE(1f, 2f, 4f) /* start, commit, abort */
+ "str %[src], %[rseq_scratch0]\n\t"
+ "str %[dst], %[rseq_scratch1]\n\t"
+ "str %[len], %[rseq_scratch2]\n\t"
+ /* Start rseq by storing table entry pointer into rseq_cs. */
+ RSEQ_ASM_STORE_RSEQ_CS(1, 3f, rseq_cs)
+ RSEQ_ASM_CMP_CPU_ID(cpu_id, current_cpu_id, 4f)
+ RSEQ_INJECT_ASM(3)
+ "ldr r0, %[v]\n\t"
+ "cmp %[expect], r0\n\t"
+ "bne 5f\n\t"
+ RSEQ_INJECT_ASM(4)
+#ifdef RSEQ_COMPARE_TWICE
+ RSEQ_ASM_CMP_CPU_ID(cpu_id, current_cpu_id, 6f)
+ "ldr r0, %[v]\n\t"
+ "cmp %[expect], r0\n\t"
+ "bne 7f\n\t"
+#endif
+ /* try memcpy */
+ "cmp %[len], #0\n\t" \
+ "beq 333f\n\t" \
+ "222:\n\t" \
+ "ldrb %%r0, [%[src]]\n\t" \
+ "strb %%r0, [%[dst]]\n\t" \
+ "adds %[src], #1\n\t" \
+ "adds %[dst], #1\n\t" \
+ "subs %[len], #1\n\t" \
+ "bne 222b\n\t" \
+ "333:\n\t" \
+ RSEQ_INJECT_ASM(5)
+ "dmb\n\t" /* full mb provides store-release */
+ /* final store */
+ "str %[newv], %[v]\n\t"
+ "2:\n\t"
+ RSEQ_INJECT_ASM(6)
+ /* teardown */
+ "ldr %[len], %[rseq_scratch2]\n\t"
+ "ldr %[dst], %[rseq_scratch1]\n\t"
+ "ldr %[src], %[rseq_scratch0]\n\t"
+ "b 8f\n\t"
+ RSEQ_ASM_DEFINE_ABORT(3, 4,
+ /* teardown */
+ "ldr %[len], %[rseq_scratch2]\n\t"
+ "ldr %[dst], %[rseq_scratch1]\n\t"
+ "ldr %[src], %[rseq_scratch0]\n\t",
+ abort, 1b, 2b, 4f)
+ RSEQ_ASM_DEFINE_CMPFAIL(5,
+ /* teardown */
+ "ldr %[len], %[rseq_scratch2]\n\t"
+ "ldr %[dst], %[rseq_scratch1]\n\t"
+ "ldr %[src], %[rseq_scratch0]\n\t",
+ cmpfail)
+#ifdef RSEQ_COMPARE_TWICE
+ RSEQ_ASM_DEFINE_CMPFAIL(6,
+ /* teardown */
+ "ldr %[len], %[rseq_scratch2]\n\t"
+ "ldr %[dst], %[rseq_scratch1]\n\t"
+ "ldr %[src], %[rseq_scratch0]\n\t",
+ error1)
+ RSEQ_ASM_DEFINE_CMPFAIL(7,
+ /* teardown */
+ "ldr %[len], %[rseq_scratch2]\n\t"
+ "ldr %[dst], %[rseq_scratch1]\n\t"
+ "ldr %[src], %[rseq_scratch0]\n\t",
+ error2)
+#endif
+ "8:\n\t"
+ : /* gcc asm goto does not allow outputs */
+ : [cpu_id] "r" (cpu),
+ [current_cpu_id] "m" (__rseq_abi.cpu_id),
+ [rseq_cs] "m" (__rseq_abi.rseq_cs),
+ /* final store input */
+ [v] "m" (*v),
+ [expect] "r" (expect),
+ [newv] "r" (newv),
+ /* try memcpy input */
+ [dst] "r" (dst),
+ [src] "r" (src),
+ [len] "r" (len),
+ [rseq_scratch0] "m" (rseq_scratch[0]),
+ [rseq_scratch1] "m" (rseq_scratch[1]),
+ [rseq_scratch2] "m" (rseq_scratch[2])
+ RSEQ_INJECT_INPUT
+ : "r0", "memory", "cc"
+ RSEQ_INJECT_CLOBBER
+ : abort, cmpfail
+#ifdef RSEQ_COMPARE_TWICE
+ , error1, error2
+#endif
+ );
+ rseq_workaround_gcc_asm_size_guess();
+ return 0;
+abort:
+ rseq_workaround_gcc_asm_size_guess();
+ RSEQ_INJECT_FAILED
+ return -1;
+cmpfail:
+ rseq_workaround_gcc_asm_size_guess();
+ return 1;
+#ifdef RSEQ_COMPARE_TWICE
+error1:
+ rseq_workaround_gcc_asm_size_guess();
+ rseq_bug("cpu_id comparison failed");
+error2:
+ rseq_workaround_gcc_asm_size_guess();
+ rseq_bug("expected value comparison failed");
+#endif
+}
+
+#endif /* !RSEQ_SKIP_FASTPATH */
diff --git a/tools/testing/selftests/rseq/rseq-ppc.h b/tools/testing/selftests/rseq/rseq-ppc.h
new file mode 100644
index 000000000000..2d6d2dfa1235
--- /dev/null
+++ b/tools/testing/selftests/rseq/rseq-ppc.h
@@ -0,0 +1,688 @@
+/*
+ * rseq-ppc.h
+ *
+ * (C) Copyright 2016 - Mathieu Desnoyers <[email protected]>
+ * (C) Copyright 2016 - Boqun Feng <[email protected]>
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining a copy
+ * of this software and associated documentation files (the "Software"), to deal
+ * in the Software without restriction, including without limitation the rights
+ * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+ * copies of the Software, and to permit persons to whom the Software is
+ * furnished to do so, subject to the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be included in
+ * all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+ * AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+ * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+
+#define RSEQ_SIG 0x53053053
+
+#define rseq_smp_mb() __asm__ __volatile__ ("sync" ::: "memory", "cc")
+#define rseq_smp_lwsync() __asm__ __volatile__ ("lwsync" ::: "memory", "cc")
+#define rseq_smp_rmb() rseq_smp_lwsync()
+#define rseq_smp_wmb() rseq_smp_lwsync()
+
+#define rseq_smp_load_acquire(p) \
+__extension__ ({ \
+ __typeof(*p) ____p1 = RSEQ_READ_ONCE(*p); \
+ rseq_smp_lwsync(); \
+ ____p1; \
+})
+
+#define rseq_smp_acquire__after_ctrl_dep() rseq_smp_lwsync()
+
+#define rseq_smp_store_release(p, v) \
+do { \
+ rseq_smp_lwsync(); \
+ RSEQ_WRITE_ONCE(*p, v); \
+} while (0)
+
+#ifdef RSEQ_SKIP_FASTPATH
+#include "rseq-skip.h"
+#else /* !RSEQ_SKIP_FASTPATH */
+
+/*
+ * The __rseq_table section can be used by debuggers to better handle
+ * single-stepping through the restartable critical sections.
+ */
+
+#ifdef __PPC64__
+
+#define STORE_WORD "std "
+#define LOAD_WORD "ld "
+#define LOADX_WORD "ldx "
+#define CMP_WORD "cmpd "
+
+#define __RSEQ_ASM_DEFINE_TABLE(label, version, flags, \
+ start_ip, post_commit_offset, abort_ip) \
+ ".pushsection __rseq_table, \"aw\"\n\t" \
+ ".balign 32\n\t" \
+ __rseq_str(label) ":\n\t" \
+ ".long " __rseq_str(version) ", " __rseq_str(flags) "\n\t" \
+ ".quad " __rseq_str(start_ip) ", " __rseq_str(post_commit_offset) ", " __rseq_str(abort_ip) "\n\t" \
+ ".popsection\n\t"
+
+#define RSEQ_ASM_STORE_RSEQ_CS(label, cs_label, rseq_cs) \
+ RSEQ_INJECT_ASM(1) \
+ "lis %%r17, (" __rseq_str(cs_label) ")@highest\n\t" \
+ "ori %%r17, %%r17, (" __rseq_str(cs_label) ")@higher\n\t" \
+ "rldicr %%r17, %%r17, 32, 31\n\t" \
+ "oris %%r17, %%r17, (" __rseq_str(cs_label) ")@high\n\t" \
+ "ori %%r17, %%r17, (" __rseq_str(cs_label) ")@l\n\t" \
+ "std %%r17, %[" __rseq_str(rseq_cs) "]\n\t" \
+ __rseq_str(label) ":\n\t"
+
+#else /* #ifdef __PPC64__ */
+
+#define STORE_WORD "stw "
+#define LOAD_WORD "lwz "
+#define LOADX_WORD "lwzx "
+#define CMP_WORD "cmpw "
+
+#define __RSEQ_ASM_DEFINE_TABLE(label, version, flags, \
+ start_ip, post_commit_offset, abort_ip) \
+ ".pushsection __rseq_table, \"aw\"\n\t" \
+ ".balign 32\n\t" \
+ __rseq_str(label) ":\n\t" \
+ ".long " __rseq_str(version) ", " __rseq_str(flags) "\n\t" \
+ /* 32-bit only supported on BE */ \
+ ".long 0x0, " __rseq_str(start_ip) ", 0x0, " __rseq_str(post_commit_offset) ", 0x0, " __rseq_str(abort_ip) "\n\t" \
+ ".popsection\n\t"
+
+#define RSEQ_ASM_STORE_RSEQ_CS(label, cs_label, rseq_cs) \
+ RSEQ_INJECT_ASM(1) \
+ "lis %%r17, (" __rseq_str(cs_label) ")@ha\n\t" \
+ "addi %%r17, %%r17, (" __rseq_str(cs_label) ")@l\n\t" \
+ "stw %%r17, %[" __rseq_str(rseq_cs) "]\n\t" \
+ __rseq_str(label) ":\n\t"
+
+#endif /* #ifdef __PPC64__ */
+
+#define RSEQ_ASM_DEFINE_TABLE(label, start_ip, post_commit_ip, abort_ip) \
+ __RSEQ_ASM_DEFINE_TABLE(label, 0x0, 0x0, start_ip, \
+ (post_commit_ip - start_ip), abort_ip)
+
+#define RSEQ_ASM_CMP_CPU_ID(cpu_id, current_cpu_id, label) \
+ RSEQ_INJECT_ASM(2) \
+ "lwz %%r17, %[" __rseq_str(current_cpu_id) "]\n\t" \
+ "cmpw cr7, %[" __rseq_str(cpu_id) "], %%r17\n\t" \
+ "bne- cr7, " __rseq_str(label) "\n\t"
+
+#define RSEQ_ASM_DEFINE_ABORT(label, abort_label) \
+ ".pushsection __rseq_failure, \"ax\"\n\t" \
+ ".long " __rseq_str(RSEQ_SIG) "\n\t" \
+ __rseq_str(label) ":\n\t" \
+ "b %l[" __rseq_str(abort_label) "]\n\t" \
+ ".popsection\n\t"
+
+/*
+ * RSEQ_ASM_OPs: asm operations for rseq
+ * RSEQ_ASM_OP_R_*: has hard-code registers in it
+ * RSEQ_ASM_OP_* (else): doesn't have hard-code registers(unless cr7)
+ */
+#define RSEQ_ASM_OP_CMPEQ(var, expect, label) \
+ LOAD_WORD "%%r17, %[" __rseq_str(var) "]\n\t" \
+ CMP_WORD "cr7, %%r17, %[" __rseq_str(expect) "]\n\t" \
+ "bne- cr7, " __rseq_str(label) "\n\t"
+
+#define RSEQ_ASM_OP_CMPNE(var, expectnot, label) \
+ LOAD_WORD "%%r17, %[" __rseq_str(var) "]\n\t" \
+ CMP_WORD "cr7, %%r17, %[" __rseq_str(expectnot) "]\n\t" \
+ "beq- cr7, " __rseq_str(label) "\n\t"
+
+#define RSEQ_ASM_OP_STORE(value, var) \
+ STORE_WORD "%[" __rseq_str(value) "], %[" __rseq_str(var) "]\n\t"
+
+/* Load @var to r17 */
+#define RSEQ_ASM_OP_R_LOAD(var) \
+ LOAD_WORD "%%r17, %[" __rseq_str(var) "]\n\t"
+
+/* Store r17 to @var */
+#define RSEQ_ASM_OP_R_STORE(var) \
+ STORE_WORD "%%r17, %[" __rseq_str(var) "]\n\t"
+
+/* Add @count to r17 */
+#define RSEQ_ASM_OP_R_ADD(count) \
+ "add %%r17, %[" __rseq_str(count) "], %%r17\n\t"
+
+/* Load (r17 + voffp) to r17 */
+#define RSEQ_ASM_OP_R_LOADX(voffp) \
+ LOADX_WORD "%%r17, %[" __rseq_str(voffp) "], %%r17\n\t"
+
+/* TODO: implement a faster memcpy. */
+#define RSEQ_ASM_OP_R_MEMCPY() \
+ "cmpdi %%r19, 0\n\t" \
+ "beq 333f\n\t" \
+ "addi %%r20, %%r20, -1\n\t" \
+ "addi %%r21, %%r21, -1\n\t" \
+ "222:\n\t" \
+ "lbzu %%r18, 1(%%r20)\n\t" \
+ "stbu %%r18, 1(%%r21)\n\t" \
+ "addi %%r19, %%r19, -1\n\t" \
+ "cmpdi %%r19, 0\n\t" \
+ "bne 222b\n\t" \
+ "333:\n\t" \
+
+#define RSEQ_ASM_OP_R_FINAL_STORE(var, post_commit_label) \
+ STORE_WORD "%%r17, %[" __rseq_str(var) "]\n\t" \
+ __rseq_str(post_commit_label) ":\n\t"
+
+#define RSEQ_ASM_OP_FINAL_STORE(value, var, post_commit_label) \
+ STORE_WORD "%[" __rseq_str(value) "], %[" __rseq_str(var) "]\n\t" \
+ __rseq_str(post_commit_label) ":\n\t"
+
+static inline __attribute__((always_inline))
+int rseq_cmpeqv_storev(intptr_t *v, intptr_t expect, intptr_t newv, int cpu)
+{
+ RSEQ_INJECT_C(9)
+
+ __asm__ __volatile__ goto (
+ RSEQ_ASM_DEFINE_TABLE(3, 1f, 2f, 4f) /* start, commit, abort */
+ /* Start rseq by storing table entry pointer into rseq_cs. */
+ RSEQ_ASM_STORE_RSEQ_CS(1, 3b, rseq_cs)
+ /* cmp cpuid */
+ RSEQ_ASM_CMP_CPU_ID(cpu_id, current_cpu_id, 4f)
+ RSEQ_INJECT_ASM(3)
+ /* cmp @v equal to @expect */
+ RSEQ_ASM_OP_CMPEQ(v, expect, %l[cmpfail])
+ RSEQ_INJECT_ASM(4)
+#ifdef RSEQ_COMPARE_TWICE
+ /* cmp cpuid */
+ RSEQ_ASM_CMP_CPU_ID(cpu_id, current_cpu_id, %l[error1])
+ /* cmp @v equal to @expect */
+ RSEQ_ASM_OP_CMPEQ(v, expect, %l[error2])
+#endif
+ /* final store */
+ RSEQ_ASM_OP_FINAL_STORE(newv, v, 2)
+ RSEQ_INJECT_ASM(5)
+ RSEQ_ASM_DEFINE_ABORT(4, abort)
+ : /* gcc asm goto does not allow outputs */
+ : [cpu_id] "r" (cpu),
+ [current_cpu_id] "m" (__rseq_abi.cpu_id),
+ [rseq_cs] "m" (__rseq_abi.rseq_cs),
+ [v] "m" (*v),
+ [expect] "r" (expect),
+ [newv] "r" (newv)
+ RSEQ_INJECT_INPUT
+ : "memory", "cc", "r17"
+ RSEQ_INJECT_CLOBBER
+ : abort, cmpfail
+#ifdef RSEQ_COMPARE_TWICE
+ , error1, error2
+#endif
+ );
+ return 0;
+abort:
+ RSEQ_INJECT_FAILED
+ return -1;
+cmpfail:
+ return 1;
+#ifdef RSEQ_COMPARE_TWICE
+error1:
+ rseq_bug("cpu_id comparison failed");
+error2:
+ rseq_bug("expected value comparison failed");
+#endif
+}
+
+static inline __attribute__((always_inline))
+int rseq_cmpnev_storeoffp_load(intptr_t *v, intptr_t expectnot,
+ off_t voffp, intptr_t *load, int cpu)
+{
+ RSEQ_INJECT_C(9)
+
+ __asm__ __volatile__ goto (
+ RSEQ_ASM_DEFINE_TABLE(3, 1f, 2f, 4f) /* start, commit, abort */
+ /* Start rseq by storing table entry pointer into rseq_cs. */
+ RSEQ_ASM_STORE_RSEQ_CS(1, 3b, rseq_cs)
+ /* cmp cpuid */
+ RSEQ_ASM_CMP_CPU_ID(cpu_id, current_cpu_id, 4f)
+ RSEQ_INJECT_ASM(3)
+ /* cmp @v not equal to @expectnot */
+ RSEQ_ASM_OP_CMPNE(v, expectnot, %l[cmpfail])
+ RSEQ_INJECT_ASM(4)
+#ifdef RSEQ_COMPARE_TWICE
+ /* cmp cpuid */
+ RSEQ_ASM_CMP_CPU_ID(cpu_id, current_cpu_id, %l[error1])
+ /* cmp @v not equal to @expectnot */
+ RSEQ_ASM_OP_CMPNE(v, expectnot, %l[error2])
+#endif
+ /* load the value of @v */
+ RSEQ_ASM_OP_R_LOAD(v)
+ /* store it in @load */
+ RSEQ_ASM_OP_R_STORE(load)
+ /* dereference voffp(v) */
+ RSEQ_ASM_OP_R_LOADX(voffp)
+ /* final store the value at voffp(v) */
+ RSEQ_ASM_OP_R_FINAL_STORE(v, 2)
+ RSEQ_INJECT_ASM(5)
+ RSEQ_ASM_DEFINE_ABORT(4, abort)
+ : /* gcc asm goto does not allow outputs */
+ : [cpu_id] "r" (cpu),
+ [current_cpu_id] "m" (__rseq_abi.cpu_id),
+ [rseq_cs] "m" (__rseq_abi.rseq_cs),
+ /* final store input */
+ [v] "m" (*v),
+ [expectnot] "r" (expectnot),
+ [voffp] "b" (voffp),
+ [load] "m" (*load)
+ RSEQ_INJECT_INPUT
+ : "memory", "cc", "r17"
+ RSEQ_INJECT_CLOBBER
+ : abort, cmpfail
+#ifdef RSEQ_COMPARE_TWICE
+ , error1, error2
+#endif
+ );
+ return 0;
+abort:
+ RSEQ_INJECT_FAILED
+ return -1;
+cmpfail:
+ return 1;
+#ifdef RSEQ_COMPARE_TWICE
+error1:
+ rseq_bug("cpu_id comparison failed");
+error2:
+ rseq_bug("expected value comparison failed");
+#endif
+}
+
+static inline __attribute__((always_inline))
+int rseq_addv(intptr_t *v, intptr_t count, int cpu)
+{
+ RSEQ_INJECT_C(9)
+
+ __asm__ __volatile__ goto (
+ RSEQ_ASM_DEFINE_TABLE(3, 1f, 2f, 4f) /* start, commit, abort */
+ /* Start rseq by storing table entry pointer into rseq_cs. */
+ RSEQ_ASM_STORE_RSEQ_CS(1, 3b, rseq_cs)
+ /* cmp cpuid */
+ RSEQ_ASM_CMP_CPU_ID(cpu_id, current_cpu_id, 4f)
+ RSEQ_INJECT_ASM(3)
+#ifdef RSEQ_COMPARE_TWICE
+ /* cmp cpuid */
+ RSEQ_ASM_CMP_CPU_ID(cpu_id, current_cpu_id, %l[error1])
+#endif
+ /* load the value of @v */
+ RSEQ_ASM_OP_R_LOAD(v)
+ /* add @count to it */
+ RSEQ_ASM_OP_R_ADD(count)
+ /* final store */
+ RSEQ_ASM_OP_R_FINAL_STORE(v, 2)
+ RSEQ_INJECT_ASM(4)
+ RSEQ_ASM_DEFINE_ABORT(4, abort)
+ : /* gcc asm goto does not allow outputs */
+ : [cpu_id] "r" (cpu),
+ [current_cpu_id] "m" (__rseq_abi.cpu_id),
+ [rseq_cs] "m" (__rseq_abi.rseq_cs),
+ /* final store input */
+ [v] "m" (*v),
+ [count] "r" (count)
+ RSEQ_INJECT_INPUT
+ : "memory", "cc", "r17"
+ RSEQ_INJECT_CLOBBER
+ : abort
+#ifdef RSEQ_COMPARE_TWICE
+ , error1
+#endif
+ );
+ return 0;
+abort:
+ RSEQ_INJECT_FAILED
+ return -1;
+#ifdef RSEQ_COMPARE_TWICE
+error1:
+ rseq_bug("cpu_id comparison failed");
+#endif
+}
+
+static inline __attribute__((always_inline))
+int rseq_cmpeqv_trystorev_storev(intptr_t *v, intptr_t expect,
+ intptr_t *v2, intptr_t newv2,
+ intptr_t newv, int cpu)
+{
+ RSEQ_INJECT_C(9)
+
+ __asm__ __volatile__ goto (
+ RSEQ_ASM_DEFINE_TABLE(3, 1f, 2f, 4f) /* start, commit, abort */
+ /* Start rseq by storing table entry pointer into rseq_cs. */
+ RSEQ_ASM_STORE_RSEQ_CS(1, 3b, rseq_cs)
+ /* cmp cpuid */
+ RSEQ_ASM_CMP_CPU_ID(cpu_id, current_cpu_id, 4f)
+ RSEQ_INJECT_ASM(3)
+ /* cmp @v equal to @expect */
+ RSEQ_ASM_OP_CMPEQ(v, expect, %l[cmpfail])
+ RSEQ_INJECT_ASM(4)
+#ifdef RSEQ_COMPARE_TWICE
+ /* cmp cpuid */
+ RSEQ_ASM_CMP_CPU_ID(cpu_id, current_cpu_id, %l[error1])
+ /* cmp @v equal to @expect */
+ RSEQ_ASM_OP_CMPEQ(v, expect, %l[error2])
+#endif
+ /* try store */
+ RSEQ_ASM_OP_STORE(newv2, v2)
+ RSEQ_INJECT_ASM(5)
+ /* final store */
+ RSEQ_ASM_OP_FINAL_STORE(newv, v, 2)
+ RSEQ_INJECT_ASM(6)
+ RSEQ_ASM_DEFINE_ABORT(4, abort)
+ : /* gcc asm goto does not allow outputs */
+ : [cpu_id] "r" (cpu),
+ [current_cpu_id] "m" (__rseq_abi.cpu_id),
+ [rseq_cs] "m" (__rseq_abi.rseq_cs),
+ /* try store input */
+ [v2] "m" (*v2),
+ [newv2] "r" (newv2),
+ /* final store input */
+ [v] "m" (*v),
+ [expect] "r" (expect),
+ [newv] "r" (newv)
+ RSEQ_INJECT_INPUT
+ : "memory", "cc", "r17"
+ RSEQ_INJECT_CLOBBER
+ : abort, cmpfail
+#ifdef RSEQ_COMPARE_TWICE
+ , error1, error2
+#endif
+ );
+ return 0;
+abort:
+ RSEQ_INJECT_FAILED
+ return -1;
+cmpfail:
+ return 1;
+#ifdef RSEQ_COMPARE_TWICE
+error1:
+ rseq_bug("cpu_id comparison failed");
+error2:
+ rseq_bug("expected value comparison failed");
+#endif
+}
+
+static inline __attribute__((always_inline))
+int rseq_cmpeqv_trystorev_storev_release(intptr_t *v, intptr_t expect,
+ intptr_t *v2, intptr_t newv2,
+ intptr_t newv, int cpu)
+{
+ RSEQ_INJECT_C(9)
+
+ __asm__ __volatile__ goto (
+ RSEQ_ASM_DEFINE_TABLE(3, 1f, 2f, 4f) /* start, commit, abort */
+ /* Start rseq by storing table entry pointer into rseq_cs. */
+ RSEQ_ASM_STORE_RSEQ_CS(1, 3b, rseq_cs)
+ /* cmp cpuid */
+ RSEQ_ASM_CMP_CPU_ID(cpu_id, current_cpu_id, 4f)
+ RSEQ_INJECT_ASM(3)
+ /* cmp @v equal to @expect */
+ RSEQ_ASM_OP_CMPEQ(v, expect, %l[cmpfail])
+ RSEQ_INJECT_ASM(4)
+#ifdef RSEQ_COMPARE_TWICE
+ /* cmp cpuid */
+ RSEQ_ASM_CMP_CPU_ID(cpu_id, current_cpu_id, %l[error1])
+ /* cmp @v equal to @expect */
+ RSEQ_ASM_OP_CMPEQ(v, expect, %l[error2])
+#endif
+ /* try store */
+ RSEQ_ASM_OP_STORE(newv2, v2)
+ RSEQ_INJECT_ASM(5)
+ /* for 'release' */
+ "lwsync\n\t"
+ /* final store */
+ RSEQ_ASM_OP_FINAL_STORE(newv, v, 2)
+ RSEQ_INJECT_ASM(6)
+ RSEQ_ASM_DEFINE_ABORT(4, abort)
+ : /* gcc asm goto does not allow outputs */
+ : [cpu_id] "r" (cpu),
+ [current_cpu_id] "m" (__rseq_abi.cpu_id),
+ [rseq_cs] "m" (__rseq_abi.rseq_cs),
+ /* try store input */
+ [v2] "m" (*v2),
+ [newv2] "r" (newv2),
+ /* final store input */
+ [v] "m" (*v),
+ [expect] "r" (expect),
+ [newv] "r" (newv)
+ RSEQ_INJECT_INPUT
+ : "memory", "cc", "r17"
+ RSEQ_INJECT_CLOBBER
+ : abort, cmpfail
+#ifdef RSEQ_COMPARE_TWICE
+ , error1, error2
+#endif
+ );
+ return 0;
+abort:
+ RSEQ_INJECT_FAILED
+ return -1;
+cmpfail:
+ return 1;
+#ifdef RSEQ_COMPARE_TWICE
+error1:
+ rseq_bug("cpu_id comparison failed");
+error2:
+ rseq_bug("expected value comparison failed");
+#endif
+}
+
+static inline __attribute__((always_inline))
+int rseq_cmpeqv_cmpeqv_storev(intptr_t *v, intptr_t expect,
+ intptr_t *v2, intptr_t expect2,
+ intptr_t newv, int cpu)
+{
+ RSEQ_INJECT_C(9)
+
+ __asm__ __volatile__ goto (
+ RSEQ_ASM_DEFINE_TABLE(3, 1f, 2f, 4f) /* start, commit, abort */
+ /* Start rseq by storing table entry pointer into rseq_cs. */
+ RSEQ_ASM_STORE_RSEQ_CS(1, 3b, rseq_cs)
+ /* cmp cpuid */
+ RSEQ_ASM_CMP_CPU_ID(cpu_id, current_cpu_id, 4f)
+ RSEQ_INJECT_ASM(3)
+ /* cmp @v equal to @expect */
+ RSEQ_ASM_OP_CMPEQ(v, expect, %l[cmpfail])
+ RSEQ_INJECT_ASM(4)
+ /* cmp @v2 equal to @expct2 */
+ RSEQ_ASM_OP_CMPEQ(v2, expect2, %l[cmpfail])
+ RSEQ_INJECT_ASM(5)
+#ifdef RSEQ_COMPARE_TWICE
+ /* cmp cpuid */
+ RSEQ_ASM_CMP_CPU_ID(cpu_id, current_cpu_id, %l[error1])
+ /* cmp @v equal to @expect */
+ RSEQ_ASM_OP_CMPEQ(v, expect, %l[error2])
+ /* cmp @v2 equal to @expct2 */
+ RSEQ_ASM_OP_CMPEQ(v2, expect2, %l[error3])
+#endif
+ /* final store */
+ RSEQ_ASM_OP_FINAL_STORE(newv, v, 2)
+ RSEQ_INJECT_ASM(6)
+ RSEQ_ASM_DEFINE_ABORT(4, abort)
+ : /* gcc asm goto does not allow outputs */
+ : [cpu_id] "r" (cpu),
+ [current_cpu_id] "m" (__rseq_abi.cpu_id),
+ [rseq_cs] "m" (__rseq_abi.rseq_cs),
+ /* cmp2 input */
+ [v2] "m" (*v2),
+ [expect2] "r" (expect2),
+ /* final store input */
+ [v] "m" (*v),
+ [expect] "r" (expect),
+ [newv] "r" (newv)
+ RSEQ_INJECT_INPUT
+ : "memory", "cc", "r17"
+ RSEQ_INJECT_CLOBBER
+ : abort, cmpfail
+#ifdef RSEQ_COMPARE_TWICE
+ , error1, error2, error3
+#endif
+ );
+ return 0;
+abort:
+ RSEQ_INJECT_FAILED
+ return -1;
+cmpfail:
+ return 1;
+#ifdef RSEQ_COMPARE_TWICE
+error1:
+ rseq_bug("cpu_id comparison failed");
+error2:
+ rseq_bug("1st expected value comparison failed");
+error3:
+ rseq_bug("2nd expected value comparison failed");
+#endif
+}
+
+static inline __attribute__((always_inline))
+int rseq_cmpeqv_trymemcpy_storev(intptr_t *v, intptr_t expect,
+ void *dst, void *src, size_t len,
+ intptr_t newv, int cpu)
+{
+ RSEQ_INJECT_C(9)
+
+ __asm__ __volatile__ goto (
+ RSEQ_ASM_DEFINE_TABLE(3, 1f, 2f, 4f) /* start, commit, abort */
+ /* setup for mempcy */
+ "mr %%r19, %[len]\n\t"
+ "mr %%r20, %[src]\n\t"
+ "mr %%r21, %[dst]\n\t"
+ /* Start rseq by storing table entry pointer into rseq_cs. */
+ RSEQ_ASM_STORE_RSEQ_CS(1, 3b, rseq_cs)
+ /* cmp cpuid */
+ RSEQ_ASM_CMP_CPU_ID(cpu_id, current_cpu_id, 4f)
+ RSEQ_INJECT_ASM(3)
+ /* cmp @v equal to @expect */
+ RSEQ_ASM_OP_CMPEQ(v, expect, %l[cmpfail])
+ RSEQ_INJECT_ASM(4)
+#ifdef RSEQ_COMPARE_TWICE
+ /* cmp cpuid */
+ RSEQ_ASM_CMP_CPU_ID(cpu_id, current_cpu_id, %l[error1])
+ /* cmp @v equal to @expect */
+ RSEQ_ASM_OP_CMPEQ(v, expect, %l[error2])
+#endif
+ /* try memcpy */
+ RSEQ_ASM_OP_R_MEMCPY()
+ RSEQ_INJECT_ASM(5)
+ /* final store */
+ RSEQ_ASM_OP_FINAL_STORE(newv, v, 2)
+ RSEQ_INJECT_ASM(6)
+ /* teardown */
+ RSEQ_ASM_DEFINE_ABORT(4, abort)
+ : /* gcc asm goto does not allow outputs */
+ : [cpu_id] "r" (cpu),
+ [current_cpu_id] "m" (__rseq_abi.cpu_id),
+ [rseq_cs] "m" (__rseq_abi.rseq_cs),
+ /* final store input */
+ [v] "m" (*v),
+ [expect] "r" (expect),
+ [newv] "r" (newv),
+ /* try memcpy input */
+ [dst] "r" (dst),
+ [src] "r" (src),
+ [len] "r" (len)
+ RSEQ_INJECT_INPUT
+ : "memory", "cc", "r17", "r18", "r19", "r20", "r21"
+ RSEQ_INJECT_CLOBBER
+ : abort, cmpfail
+#ifdef RSEQ_COMPARE_TWICE
+ , error1, error2
+#endif
+ );
+ return 0;
+abort:
+ RSEQ_INJECT_FAILED
+ return -1;
+cmpfail:
+ return 1;
+#ifdef RSEQ_COMPARE_TWICE
+error1:
+ rseq_bug("cpu_id comparison failed");
+error2:
+ rseq_bug("expected value comparison failed");
+#endif
+}
+
+static inline __attribute__((always_inline))
+int rseq_cmpeqv_trymemcpy_storev_release(intptr_t *v, intptr_t expect,
+ void *dst, void *src, size_t len,
+ intptr_t newv, int cpu)
+{
+ RSEQ_INJECT_C(9)
+
+ __asm__ __volatile__ goto (
+ RSEQ_ASM_DEFINE_TABLE(3, 1f, 2f, 4f) /* start, commit, abort */
+ /* setup for mempcy */
+ "mr %%r19, %[len]\n\t"
+ "mr %%r20, %[src]\n\t"
+ "mr %%r21, %[dst]\n\t"
+ /* Start rseq by storing table entry pointer into rseq_cs. */
+ RSEQ_ASM_STORE_RSEQ_CS(1, 3b, rseq_cs)
+ /* cmp cpuid */
+ RSEQ_ASM_CMP_CPU_ID(cpu_id, current_cpu_id, 4f)
+ RSEQ_INJECT_ASM(3)
+ /* cmp @v equal to @expect */
+ RSEQ_ASM_OP_CMPEQ(v, expect, %l[cmpfail])
+ RSEQ_INJECT_ASM(4)
+#ifdef RSEQ_COMPARE_TWICE
+ /* cmp cpuid */
+ RSEQ_ASM_CMP_CPU_ID(cpu_id, current_cpu_id, %l[error1])
+ /* cmp @v equal to @expect */
+ RSEQ_ASM_OP_CMPEQ(v, expect, %l[error2])
+#endif
+ /* try memcpy */
+ RSEQ_ASM_OP_R_MEMCPY()
+ RSEQ_INJECT_ASM(5)
+ /* for 'release' */
+ "lwsync\n\t"
+ /* final store */
+ RSEQ_ASM_OP_FINAL_STORE(newv, v, 2)
+ RSEQ_INJECT_ASM(6)
+ /* teardown */
+ RSEQ_ASM_DEFINE_ABORT(4, abort)
+ : /* gcc asm goto does not allow outputs */
+ : [cpu_id] "r" (cpu),
+ [current_cpu_id] "m" (__rseq_abi.cpu_id),
+ [rseq_cs] "m" (__rseq_abi.rseq_cs),
+ /* final store input */
+ [v] "m" (*v),
+ [expect] "r" (expect),
+ [newv] "r" (newv),
+ /* try memcpy input */
+ [dst] "r" (dst),
+ [src] "r" (src),
+ [len] "r" (len)
+ RSEQ_INJECT_INPUT
+ : "memory", "cc", "r17", "r18", "r19", "r20", "r21"
+ RSEQ_INJECT_CLOBBER
+ : abort, cmpfail
+#ifdef RSEQ_COMPARE_TWICE
+ , error1, error2
+#endif
+ );
+ return 0;
+abort:
+ RSEQ_INJECT_FAILED
+ return -1;
+cmpfail:
+ return 1;
+#ifdef RSEQ_COMPARE_TWICE
+error1:
+ rseq_bug("cpu_id comparison failed");
+error2:
+ rseq_bug("expected value comparison failed");
+#endif
+}
+
+#undef STORE_WORD
+#undef LOAD_WORD
+#undef LOADX_WORD
+#undef CMP_WORD
+
+#endif /* !RSEQ_SKIP_FASTPATH */
diff --git a/tools/testing/selftests/rseq/rseq-skip.h b/tools/testing/selftests/rseq/rseq-skip.h
new file mode 100644
index 000000000000..dc8f8e74b737
--- /dev/null
+++ b/tools/testing/selftests/rseq/rseq-skip.h
@@ -0,0 +1,82 @@
+/*
+ * rseq-skip.h
+ *
+ * (C) Copyright 2017 - Mathieu Desnoyers <[email protected]>
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining a copy
+ * of this software and associated documentation files (the "Software"), to deal
+ * in the Software without restriction, including without limitation the rights
+ * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+ * copies of the Software, and to permit persons to whom the Software is
+ * furnished to do so, subject to the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be included in
+ * all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+ * AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+ * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+
+static inline __attribute__((always_inline))
+int rseq_cmpeqv_storev(intptr_t *v, intptr_t expect, intptr_t newv, int cpu)
+{
+ return -1;
+}
+
+static inline __attribute__((always_inline))
+int rseq_cmpnev_storeoffp_load(intptr_t *v, intptr_t expectnot,
+ off_t voffp, intptr_t *load, int cpu)
+{
+ return -1;
+}
+
+static inline __attribute__((always_inline))
+int rseq_addv(intptr_t *v, intptr_t count, int cpu)
+{
+ return -1;
+}
+
+static inline __attribute__((always_inline))
+int rseq_cmpeqv_trystorev_storev(intptr_t *v, intptr_t expect,
+ intptr_t *v2, intptr_t newv2,
+ intptr_t newv, int cpu)
+{
+ return -1;
+}
+
+static inline __attribute__((always_inline))
+int rseq_cmpeqv_trystorev_storev_release(intptr_t *v, intptr_t expect,
+ intptr_t *v2, intptr_t newv2,
+ intptr_t newv, int cpu)
+{
+ return -1;
+}
+
+static inline __attribute__((always_inline))
+int rseq_cmpeqv_cmpeqv_storev(intptr_t *v, intptr_t expect,
+ intptr_t *v2, intptr_t expect2,
+ intptr_t newv, int cpu)
+{
+ return -1;
+}
+
+static inline __attribute__((always_inline))
+int rseq_cmpeqv_trymemcpy_storev(intptr_t *v, intptr_t expect,
+ void *dst, void *src, size_t len,
+ intptr_t newv, int cpu)
+{
+ return -1;
+}
+
+static inline __attribute__((always_inline))
+int rseq_cmpeqv_trymemcpy_storev_release(intptr_t *v, intptr_t expect,
+ void *dst, void *src, size_t len,
+ intptr_t newv, int cpu)
+{
+ return -1;
+}
diff --git a/tools/testing/selftests/rseq/rseq-x86.h b/tools/testing/selftests/rseq/rseq-x86.h
new file mode 100644
index 000000000000..02d853ae2ce1
--- /dev/null
+++ b/tools/testing/selftests/rseq/rseq-x86.h
@@ -0,0 +1,1149 @@
+/*
+ * rseq-x86.h
+ *
+ * (C) Copyright 2016 - Mathieu Desnoyers <[email protected]>
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining a copy
+ * of this software and associated documentation files (the "Software"), to deal
+ * in the Software without restriction, including without limitation the rights
+ * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+ * copies of the Software, and to permit persons to whom the Software is
+ * furnished to do so, subject to the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be included in
+ * all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+ * AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+ * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+
+#include <stdint.h>
+
+#define RSEQ_SIG 0x53053053
+
+#ifdef __x86_64__
+
+#define rseq_smp_mb() \
+ __asm__ __volatile__ ("lock; addl $0,-128(%%rsp)" ::: "memory", "cc")
+#define rseq_smp_rmb() rseq_barrier()
+#define rseq_smp_wmb() rseq_barrier()
+
+#define rseq_smp_load_acquire(p) \
+__extension__ ({ \
+ __typeof(*p) ____p1 = RSEQ_READ_ONCE(*p); \
+ rseq_barrier(); \
+ ____p1; \
+})
+
+#define rseq_smp_acquire__after_ctrl_dep() rseq_smp_rmb()
+
+#define rseq_smp_store_release(p, v) \
+do { \
+ rseq_barrier(); \
+ RSEQ_WRITE_ONCE(*p, v); \
+} while (0)
+
+#ifdef RSEQ_SKIP_FASTPATH
+#include "rseq-skip.h"
+#else /* !RSEQ_SKIP_FASTPATH */
+
+#define __RSEQ_ASM_DEFINE_TABLE(label, version, flags, \
+ start_ip, post_commit_offset, abort_ip) \
+ ".pushsection __rseq_table, \"aw\"\n\t" \
+ ".balign 32\n\t" \
+ __rseq_str(label) ":\n\t" \
+ ".long " __rseq_str(version) ", " __rseq_str(flags) "\n\t" \
+ ".quad " __rseq_str(start_ip) ", " __rseq_str(post_commit_offset) ", " __rseq_str(abort_ip) "\n\t" \
+ ".popsection\n\t"
+
+#define RSEQ_ASM_DEFINE_TABLE(label, start_ip, post_commit_ip, abort_ip) \
+ __RSEQ_ASM_DEFINE_TABLE(label, 0x0, 0x0, start_ip, \
+ (post_commit_ip - start_ip), abort_ip)
+
+#define RSEQ_ASM_STORE_RSEQ_CS(label, cs_label, rseq_cs) \
+ RSEQ_INJECT_ASM(1) \
+ "leaq " __rseq_str(cs_label) "(%%rip), %%rax\n\t" \
+ "movq %%rax, %[" __rseq_str(rseq_cs) "]\n\t" \
+ __rseq_str(label) ":\n\t"
+
+#define RSEQ_ASM_CMP_CPU_ID(cpu_id, current_cpu_id, label) \
+ RSEQ_INJECT_ASM(2) \
+ "cmpl %[" __rseq_str(cpu_id) "], %[" __rseq_str(current_cpu_id) "]\n\t" \
+ "jnz " __rseq_str(label) "\n\t"
+
+#define RSEQ_ASM_DEFINE_ABORT(label, teardown, abort_label) \
+ ".pushsection __rseq_failure, \"ax\"\n\t" \
+ /* Disassembler-friendly signature: nopl <sig>(%rip). */\
+ ".byte 0x0f, 0x1f, 0x05\n\t" \
+ ".long " __rseq_str(RSEQ_SIG) "\n\t" \
+ __rseq_str(label) ":\n\t" \
+ teardown \
+ "jmp %l[" __rseq_str(abort_label) "]\n\t" \
+ ".popsection\n\t"
+
+#define RSEQ_ASM_DEFINE_CMPFAIL(label, teardown, cmpfail_label) \
+ ".pushsection __rseq_failure, \"ax\"\n\t" \
+ __rseq_str(label) ":\n\t" \
+ teardown \
+ "jmp %l[" __rseq_str(cmpfail_label) "]\n\t" \
+ ".popsection\n\t"
+
+static inline __attribute__((always_inline))
+int rseq_cmpeqv_storev(intptr_t *v, intptr_t expect, intptr_t newv, int cpu)
+{
+ RSEQ_INJECT_C(9)
+
+ __asm__ __volatile__ goto (
+ RSEQ_ASM_DEFINE_TABLE(3, 1f, 2f, 4f) /* start, commit, abort */
+ /* Start rseq by storing table entry pointer into rseq_cs. */
+ RSEQ_ASM_STORE_RSEQ_CS(1, 3b, rseq_cs)
+ RSEQ_ASM_CMP_CPU_ID(cpu_id, current_cpu_id, 4f)
+ RSEQ_INJECT_ASM(3)
+ "cmpq %[v], %[expect]\n\t"
+ "jnz %l[cmpfail]\n\t"
+ RSEQ_INJECT_ASM(4)
+#ifdef RSEQ_COMPARE_TWICE
+ RSEQ_ASM_CMP_CPU_ID(cpu_id, current_cpu_id, %l[error1])
+ "cmpq %[v], %[expect]\n\t"
+ "jnz %l[error2]\n\t"
+#endif
+ /* final store */
+ "movq %[newv], %[v]\n\t"
+ "2:\n\t"
+ RSEQ_INJECT_ASM(5)
+ RSEQ_ASM_DEFINE_ABORT(4, "", abort)
+ : /* gcc asm goto does not allow outputs */
+ : [cpu_id] "r" (cpu),
+ [current_cpu_id] "m" (__rseq_abi.cpu_id),
+ [rseq_cs] "m" (__rseq_abi.rseq_cs),
+ [v] "m" (*v),
+ [expect] "r" (expect),
+ [newv] "r" (newv)
+ : "memory", "cc", "rax"
+ RSEQ_INJECT_CLOBBER
+ : abort, cmpfail
+#ifdef RSEQ_COMPARE_TWICE
+ , error1, error2
+#endif
+ );
+ return 0;
+abort:
+ RSEQ_INJECT_FAILED
+ return -1;
+cmpfail:
+ return 1;
+#ifdef RSEQ_COMPARE_TWICE
+error1:
+ rseq_bug("cpu_id comparison failed");
+error2:
+ rseq_bug("expected value comparison failed");
+#endif
+}
+
+/*
+ * Compare @v against @expectnot. When it does _not_ match, load @v
+ * into @load, and store the content of *@v + voffp into @v.
+ */
+static inline __attribute__((always_inline))
+int rseq_cmpnev_storeoffp_load(intptr_t *v, intptr_t expectnot,
+ off_t voffp, intptr_t *load, int cpu)
+{
+ RSEQ_INJECT_C(9)
+
+ __asm__ __volatile__ goto (
+ RSEQ_ASM_DEFINE_TABLE(3, 1f, 2f, 4f) /* start, commit, abort */
+ /* Start rseq by storing table entry pointer into rseq_cs. */
+ RSEQ_ASM_STORE_RSEQ_CS(1, 3b, rseq_cs)
+ RSEQ_ASM_CMP_CPU_ID(cpu_id, current_cpu_id, 4f)
+ RSEQ_INJECT_ASM(3)
+ "movq %[v], %%rbx\n\t"
+ "cmpq %%rbx, %[expectnot]\n\t"
+ "je %l[cmpfail]\n\t"
+ RSEQ_INJECT_ASM(4)
+#ifdef RSEQ_COMPARE_TWICE
+ RSEQ_ASM_CMP_CPU_ID(cpu_id, current_cpu_id, %l[error1])
+ "movq %[v], %%rbx\n\t"
+ "cmpq %%rbx, %[expectnot]\n\t"
+ "je %l[error2]\n\t"
+#endif
+ "movq %%rbx, %[load]\n\t"
+ "addq %[voffp], %%rbx\n\t"
+ "movq (%%rbx), %%rbx\n\t"
+ /* final store */
+ "movq %%rbx, %[v]\n\t"
+ "2:\n\t"
+ RSEQ_INJECT_ASM(5)
+ RSEQ_ASM_DEFINE_ABORT(4, "", abort)
+ : /* gcc asm goto does not allow outputs */
+ : [cpu_id] "r" (cpu),
+ [current_cpu_id] "m" (__rseq_abi.cpu_id),
+ [rseq_cs] "m" (__rseq_abi.rseq_cs),
+ /* final store input */
+ [v] "m" (*v),
+ [expectnot] "r" (expectnot),
+ [voffp] "er" (voffp),
+ [load] "m" (*load)
+ : "memory", "cc", "rax", "rbx"
+ RSEQ_INJECT_CLOBBER
+ : abort, cmpfail
+#ifdef RSEQ_COMPARE_TWICE
+ , error1, error2
+#endif
+ );
+ return 0;
+abort:
+ RSEQ_INJECT_FAILED
+ return -1;
+cmpfail:
+ return 1;
+#ifdef RSEQ_COMPARE_TWICE
+error1:
+ rseq_bug("cpu_id comparison failed");
+error2:
+ rseq_bug("expected value comparison failed");
+#endif
+}
+
+static inline __attribute__((always_inline))
+int rseq_addv(intptr_t *v, intptr_t count, int cpu)
+{
+ RSEQ_INJECT_C(9)
+
+ __asm__ __volatile__ goto (
+ RSEQ_ASM_DEFINE_TABLE(3, 1f, 2f, 4f) /* start, commit, abort */
+ /* Start rseq by storing table entry pointer into rseq_cs. */
+ RSEQ_ASM_STORE_RSEQ_CS(1, 3b, rseq_cs)
+ RSEQ_ASM_CMP_CPU_ID(cpu_id, current_cpu_id, 4f)
+ RSEQ_INJECT_ASM(3)
+#ifdef RSEQ_COMPARE_TWICE
+ RSEQ_ASM_CMP_CPU_ID(cpu_id, current_cpu_id, %l[error1])
+#endif
+ /* final store */
+ "addq %[count], %[v]\n\t"
+ "2:\n\t"
+ RSEQ_INJECT_ASM(4)
+ RSEQ_ASM_DEFINE_ABORT(4, "", abort)
+ : /* gcc asm goto does not allow outputs */
+ : [cpu_id] "r" (cpu),
+ [current_cpu_id] "m" (__rseq_abi.cpu_id),
+ [rseq_cs] "m" (__rseq_abi.rseq_cs),
+ /* final store input */
+ [v] "m" (*v),
+ [count] "er" (count)
+ : "memory", "cc", "rax"
+ RSEQ_INJECT_CLOBBER
+ : abort
+#ifdef RSEQ_COMPARE_TWICE
+ , error1
+#endif
+ );
+ return 0;
+abort:
+ RSEQ_INJECT_FAILED
+ return -1;
+#ifdef RSEQ_COMPARE_TWICE
+error1:
+ rseq_bug("cpu_id comparison failed");
+#endif
+}
+
+static inline __attribute__((always_inline))
+int rseq_cmpeqv_trystorev_storev(intptr_t *v, intptr_t expect,
+ intptr_t *v2, intptr_t newv2,
+ intptr_t newv, int cpu)
+{
+ RSEQ_INJECT_C(9)
+
+ __asm__ __volatile__ goto (
+ RSEQ_ASM_DEFINE_TABLE(3, 1f, 2f, 4f) /* start, commit, abort */
+ /* Start rseq by storing table entry pointer into rseq_cs. */
+ RSEQ_ASM_STORE_RSEQ_CS(1, 3b, rseq_cs)
+ RSEQ_ASM_CMP_CPU_ID(cpu_id, current_cpu_id, 4f)
+ RSEQ_INJECT_ASM(3)
+ "cmpq %[v], %[expect]\n\t"
+ "jnz %l[cmpfail]\n\t"
+ RSEQ_INJECT_ASM(4)
+#ifdef RSEQ_COMPARE_TWICE
+ RSEQ_ASM_CMP_CPU_ID(cpu_id, current_cpu_id, %l[error1])
+ "cmpq %[v], %[expect]\n\t"
+ "jnz %l[error2]\n\t"
+#endif
+ /* try store */
+ "movq %[newv2], %[v2]\n\t"
+ RSEQ_INJECT_ASM(5)
+ /* final store */
+ "movq %[newv], %[v]\n\t"
+ "2:\n\t"
+ RSEQ_INJECT_ASM(6)
+ RSEQ_ASM_DEFINE_ABORT(4, "", abort)
+ : /* gcc asm goto does not allow outputs */
+ : [cpu_id] "r" (cpu),
+ [current_cpu_id] "m" (__rseq_abi.cpu_id),
+ [rseq_cs] "m" (__rseq_abi.rseq_cs),
+ /* try store input */
+ [v2] "m" (*v2),
+ [newv2] "r" (newv2),
+ /* final store input */
+ [v] "m" (*v),
+ [expect] "r" (expect),
+ [newv] "r" (newv)
+ : "memory", "cc", "rax"
+ RSEQ_INJECT_CLOBBER
+ : abort, cmpfail
+#ifdef RSEQ_COMPARE_TWICE
+ , error1, error2
+#endif
+ );
+ return 0;
+abort:
+ RSEQ_INJECT_FAILED
+ return -1;
+cmpfail:
+ return 1;
+#ifdef RSEQ_COMPARE_TWICE
+error1:
+ rseq_bug("cpu_id comparison failed");
+error2:
+ rseq_bug("expected value comparison failed");
+#endif
+}
+
+/* x86-64 is TSO. */
+static inline __attribute__((always_inline))
+int rseq_cmpeqv_trystorev_storev_release(intptr_t *v, intptr_t expect,
+ intptr_t *v2, intptr_t newv2,
+ intptr_t newv, int cpu)
+{
+ return rseq_cmpeqv_trystorev_storev(v, expect, v2, newv2, newv, cpu);
+}
+
+static inline __attribute__((always_inline))
+int rseq_cmpeqv_cmpeqv_storev(intptr_t *v, intptr_t expect,
+ intptr_t *v2, intptr_t expect2,
+ intptr_t newv, int cpu)
+{
+ RSEQ_INJECT_C(9)
+
+ __asm__ __volatile__ goto (
+ RSEQ_ASM_DEFINE_TABLE(3, 1f, 2f, 4f) /* start, commit, abort */
+ /* Start rseq by storing table entry pointer into rseq_cs. */
+ RSEQ_ASM_STORE_RSEQ_CS(1, 3b, rseq_cs)
+ RSEQ_ASM_CMP_CPU_ID(cpu_id, current_cpu_id, 4f)
+ RSEQ_INJECT_ASM(3)
+ "cmpq %[v], %[expect]\n\t"
+ "jnz %l[cmpfail]\n\t"
+ RSEQ_INJECT_ASM(4)
+ "cmpq %[v2], %[expect2]\n\t"
+ "jnz %l[cmpfail]\n\t"
+ RSEQ_INJECT_ASM(5)
+#ifdef RSEQ_COMPARE_TWICE
+ RSEQ_ASM_CMP_CPU_ID(cpu_id, current_cpu_id, %l[error1])
+ "cmpq %[v], %[expect]\n\t"
+ "jnz %l[error2]\n\t"
+ "cmpq %[v2], %[expect2]\n\t"
+ "jnz %l[error3]\n\t"
+#endif
+ /* final store */
+ "movq %[newv], %[v]\n\t"
+ "2:\n\t"
+ RSEQ_INJECT_ASM(6)
+ RSEQ_ASM_DEFINE_ABORT(4, "", abort)
+ : /* gcc asm goto does not allow outputs */
+ : [cpu_id] "r" (cpu),
+ [current_cpu_id] "m" (__rseq_abi.cpu_id),
+ [rseq_cs] "m" (__rseq_abi.rseq_cs),
+ /* cmp2 input */
+ [v2] "m" (*v2),
+ [expect2] "r" (expect2),
+ /* final store input */
+ [v] "m" (*v),
+ [expect] "r" (expect),
+ [newv] "r" (newv)
+ : "memory", "cc", "rax"
+ RSEQ_INJECT_CLOBBER
+ : abort, cmpfail
+#ifdef RSEQ_COMPARE_TWICE
+ , error1, error2, error3
+#endif
+ );
+ return 0;
+abort:
+ RSEQ_INJECT_FAILED
+ return -1;
+cmpfail:
+ return 1;
+#ifdef RSEQ_COMPARE_TWICE
+error1:
+ rseq_bug("cpu_id comparison failed");
+error2:
+ rseq_bug("1st expected value comparison failed");
+error3:
+ rseq_bug("2nd expected value comparison failed");
+#endif
+}
+
+static inline __attribute__((always_inline))
+int rseq_cmpeqv_trymemcpy_storev(intptr_t *v, intptr_t expect,
+ void *dst, void *src, size_t len,
+ intptr_t newv, int cpu)
+{
+ uint64_t rseq_scratch[3];
+
+ RSEQ_INJECT_C(9)
+
+ __asm__ __volatile__ goto (
+ RSEQ_ASM_DEFINE_TABLE(3, 1f, 2f, 4f) /* start, commit, abort */
+ "movq %[src], %[rseq_scratch0]\n\t"
+ "movq %[dst], %[rseq_scratch1]\n\t"
+ "movq %[len], %[rseq_scratch2]\n\t"
+ /* Start rseq by storing table entry pointer into rseq_cs. */
+ RSEQ_ASM_STORE_RSEQ_CS(1, 3b, rseq_cs)
+ RSEQ_ASM_CMP_CPU_ID(cpu_id, current_cpu_id, 4f)
+ RSEQ_INJECT_ASM(3)
+ "cmpq %[v], %[expect]\n\t"
+ "jnz 5f\n\t"
+ RSEQ_INJECT_ASM(4)
+#ifdef RSEQ_COMPARE_TWICE
+ RSEQ_ASM_CMP_CPU_ID(cpu_id, current_cpu_id, 6f)
+ "cmpq %[v], %[expect]\n\t"
+ "jnz 7f\n\t"
+#endif
+ /* try memcpy */
+ "test %[len], %[len]\n\t" \
+ "jz 333f\n\t" \
+ "222:\n\t" \
+ "movb (%[src]), %%al\n\t" \
+ "movb %%al, (%[dst])\n\t" \
+ "inc %[src]\n\t" \
+ "inc %[dst]\n\t" \
+ "dec %[len]\n\t" \
+ "jnz 222b\n\t" \
+ "333:\n\t" \
+ RSEQ_INJECT_ASM(5)
+ /* final store */
+ "movq %[newv], %[v]\n\t"
+ "2:\n\t"
+ RSEQ_INJECT_ASM(6)
+ /* teardown */
+ "movq %[rseq_scratch2], %[len]\n\t"
+ "movq %[rseq_scratch1], %[dst]\n\t"
+ "movq %[rseq_scratch0], %[src]\n\t"
+ RSEQ_ASM_DEFINE_ABORT(4,
+ "movq %[rseq_scratch2], %[len]\n\t"
+ "movq %[rseq_scratch1], %[dst]\n\t"
+ "movq %[rseq_scratch0], %[src]\n\t",
+ abort)
+ RSEQ_ASM_DEFINE_CMPFAIL(5,
+ "movq %[rseq_scratch2], %[len]\n\t"
+ "movq %[rseq_scratch1], %[dst]\n\t"
+ "movq %[rseq_scratch0], %[src]\n\t",
+ cmpfail)
+#ifdef RSEQ_COMPARE_TWICE
+ RSEQ_ASM_DEFINE_CMPFAIL(6,
+ "movq %[rseq_scratch2], %[len]\n\t"
+ "movq %[rseq_scratch1], %[dst]\n\t"
+ "movq %[rseq_scratch0], %[src]\n\t",
+ error1)
+ RSEQ_ASM_DEFINE_CMPFAIL(7,
+ "movq %[rseq_scratch2], %[len]\n\t"
+ "movq %[rseq_scratch1], %[dst]\n\t"
+ "movq %[rseq_scratch0], %[src]\n\t",
+ error2)
+#endif
+ : /* gcc asm goto does not allow outputs */
+ : [cpu_id] "r" (cpu),
+ [current_cpu_id] "m" (__rseq_abi.cpu_id),
+ [rseq_cs] "m" (__rseq_abi.rseq_cs),
+ /* final store input */
+ [v] "m" (*v),
+ [expect] "r" (expect),
+ [newv] "r" (newv),
+ /* try memcpy input */
+ [dst] "r" (dst),
+ [src] "r" (src),
+ [len] "r" (len),
+ [rseq_scratch0] "m" (rseq_scratch[0]),
+ [rseq_scratch1] "m" (rseq_scratch[1]),
+ [rseq_scratch2] "m" (rseq_scratch[2])
+ : "memory", "cc", "rax"
+ RSEQ_INJECT_CLOBBER
+ : abort, cmpfail
+#ifdef RSEQ_COMPARE_TWICE
+ , error1, error2
+#endif
+ );
+ return 0;
+abort:
+ RSEQ_INJECT_FAILED
+ return -1;
+cmpfail:
+ return 1;
+#ifdef RSEQ_COMPARE_TWICE
+error1:
+ rseq_bug("cpu_id comparison failed");
+error2:
+ rseq_bug("expected value comparison failed");
+#endif
+}
+
+/* x86-64 is TSO. */
+static inline __attribute__((always_inline))
+int rseq_cmpeqv_trymemcpy_storev_release(intptr_t *v, intptr_t expect,
+ void *dst, void *src, size_t len,
+ intptr_t newv, int cpu)
+{
+ return rseq_cmpeqv_trymemcpy_storev(v, expect, dst, src, len,
+ newv, cpu);
+}
+
+#endif /* !RSEQ_SKIP_FASTPATH */
+
+#elif __i386__
+
+#define rseq_smp_mb() \
+ __asm__ __volatile__ ("lock; addl $0,-128(%%esp)" ::: "memory", "cc")
+#define rseq_smp_rmb() \
+ __asm__ __volatile__ ("lock; addl $0,-128(%%esp)" ::: "memory", "cc")
+#define rseq_smp_wmb() \
+ __asm__ __volatile__ ("lock; addl $0,-128(%%esp)" ::: "memory", "cc")
+
+#define rseq_smp_load_acquire(p) \
+__extension__ ({ \
+ __typeof(*p) ____p1 = RSEQ_READ_ONCE(*p); \
+ rseq_smp_mb(); \
+ ____p1; \
+})
+
+#define rseq_smp_acquire__after_ctrl_dep() rseq_smp_rmb()
+
+#define rseq_smp_store_release(p, v) \
+do { \
+ rseq_smp_mb(); \
+ RSEQ_WRITE_ONCE(*p, v); \
+} while (0)
+
+#ifdef RSEQ_SKIP_FASTPATH
+#include "rseq-skip.h"
+#else /* !RSEQ_SKIP_FASTPATH */
+
+/*
+ * Use eax as scratch register and take memory operands as input to
+ * lessen register pressure. Especially needed when compiling in O0.
+ */
+#define __RSEQ_ASM_DEFINE_TABLE(label, version, flags, \
+ start_ip, post_commit_offset, abort_ip) \
+ ".pushsection __rseq_table, \"aw\"\n\t" \
+ ".balign 32\n\t" \
+ __rseq_str(label) ":\n\t" \
+ ".long " __rseq_str(version) ", " __rseq_str(flags) "\n\t" \
+ ".long " __rseq_str(start_ip) ", 0x0, " __rseq_str(post_commit_offset) ", 0x0, " __rseq_str(abort_ip) ", 0x0\n\t" \
+ ".popsection\n\t"
+
+#define RSEQ_ASM_DEFINE_TABLE(label, start_ip, post_commit_ip, abort_ip) \
+ __RSEQ_ASM_DEFINE_TABLE(label, 0x0, 0x0, start_ip, \
+ (post_commit_ip - start_ip), abort_ip)
+
+#define RSEQ_ASM_STORE_RSEQ_CS(label, cs_label, rseq_cs) \
+ RSEQ_INJECT_ASM(1) \
+ "movl $" __rseq_str(cs_label) ", %[rseq_cs]\n\t" \
+ __rseq_str(label) ":\n\t"
+
+#define RSEQ_ASM_CMP_CPU_ID(cpu_id, current_cpu_id, label) \
+ RSEQ_INJECT_ASM(2) \
+ "cmpl %[" __rseq_str(cpu_id) "], %[" __rseq_str(current_cpu_id) "]\n\t" \
+ "jnz " __rseq_str(label) "\n\t"
+
+#define RSEQ_ASM_DEFINE_ABORT(label, teardown, abort_label) \
+ ".pushsection __rseq_failure, \"ax\"\n\t" \
+ /* Disassembler-friendly signature: nopl <sig>. */ \
+ ".byte 0x0f, 0x1f, 0x05\n\t" \
+ ".long " __rseq_str(RSEQ_SIG) "\n\t" \
+ __rseq_str(label) ":\n\t" \
+ teardown \
+ "jmp %l[" __rseq_str(abort_label) "]\n\t" \
+ ".popsection\n\t"
+
+#define RSEQ_ASM_DEFINE_CMPFAIL(label, teardown, cmpfail_label) \
+ ".pushsection __rseq_failure, \"ax\"\n\t" \
+ __rseq_str(label) ":\n\t" \
+ teardown \
+ "jmp %l[" __rseq_str(cmpfail_label) "]\n\t" \
+ ".popsection\n\t"
+
+static inline __attribute__((always_inline))
+int rseq_cmpeqv_storev(intptr_t *v, intptr_t expect, intptr_t newv, int cpu)
+{
+ RSEQ_INJECT_C(9)
+
+ __asm__ __volatile__ goto (
+ RSEQ_ASM_DEFINE_TABLE(3, 1f, 2f, 4f) /* start, commit, abort */
+ /* Start rseq by storing table entry pointer into rseq_cs. */
+ RSEQ_ASM_STORE_RSEQ_CS(1, 3b, rseq_cs)
+ RSEQ_ASM_CMP_CPU_ID(cpu_id, current_cpu_id, 4f)
+ RSEQ_INJECT_ASM(3)
+ "cmpl %[v], %[expect]\n\t"
+ "jnz %l[cmpfail]\n\t"
+ RSEQ_INJECT_ASM(4)
+#ifdef RSEQ_COMPARE_TWICE
+ RSEQ_ASM_CMP_CPU_ID(cpu_id, current_cpu_id, %l[error1])
+ "cmpl %[v], %[expect]\n\t"
+ "jnz %l[error2]\n\t"
+#endif
+ /* final store */
+ "movl %[newv], %[v]\n\t"
+ "2:\n\t"
+ RSEQ_INJECT_ASM(5)
+ RSEQ_ASM_DEFINE_ABORT(4, "", abort)
+ : /* gcc asm goto does not allow outputs */
+ : [cpu_id] "r" (cpu),
+ [current_cpu_id] "m" (__rseq_abi.cpu_id),
+ [rseq_cs] "m" (__rseq_abi.rseq_cs),
+ [v] "m" (*v),
+ [expect] "r" (expect),
+ [newv] "r" (newv)
+ : "memory", "cc", "eax"
+ RSEQ_INJECT_CLOBBER
+ : abort, cmpfail
+#ifdef RSEQ_COMPARE_TWICE
+ , error1, error2
+#endif
+ );
+ return 0;
+abort:
+ RSEQ_INJECT_FAILED
+ return -1;
+cmpfail:
+ return 1;
+#ifdef RSEQ_COMPARE_TWICE
+error1:
+ rseq_bug("cpu_id comparison failed");
+error2:
+ rseq_bug("expected value comparison failed");
+#endif
+}
+
+/*
+ * Compare @v against @expectnot. When it does _not_ match, load @v
+ * into @load, and store the content of *@v + voffp into @v.
+ */
+static inline __attribute__((always_inline))
+int rseq_cmpnev_storeoffp_load(intptr_t *v, intptr_t expectnot,
+ off_t voffp, intptr_t *load, int cpu)
+{
+ RSEQ_INJECT_C(9)
+
+ __asm__ __volatile__ goto (
+ RSEQ_ASM_DEFINE_TABLE(3, 1f, 2f, 4f) /* start, commit, abort */
+ /* Start rseq by storing table entry pointer into rseq_cs. */
+ RSEQ_ASM_STORE_RSEQ_CS(1, 3b, rseq_cs)
+ RSEQ_ASM_CMP_CPU_ID(cpu_id, current_cpu_id, 4f)
+ RSEQ_INJECT_ASM(3)
+ "movl %[v], %%ebx\n\t"
+ "cmpl %%ebx, %[expectnot]\n\t"
+ "je %l[cmpfail]\n\t"
+ RSEQ_INJECT_ASM(4)
+#ifdef RSEQ_COMPARE_TWICE
+ RSEQ_ASM_CMP_CPU_ID(cpu_id, current_cpu_id, %l[error1])
+ "movl %[v], %%ebx\n\t"
+ "cmpl %%ebx, %[expectnot]\n\t"
+ "je %l[error2]\n\t"
+#endif
+ "movl %%ebx, %[load]\n\t"
+ "addl %[voffp], %%ebx\n\t"
+ "movl (%%ebx), %%ebx\n\t"
+ /* final store */
+ "movl %%ebx, %[v]\n\t"
+ "2:\n\t"
+ RSEQ_INJECT_ASM(5)
+ RSEQ_ASM_DEFINE_ABORT(4, "", abort)
+ : /* gcc asm goto does not allow outputs */
+ : [cpu_id] "r" (cpu),
+ [current_cpu_id] "m" (__rseq_abi.cpu_id),
+ [rseq_cs] "m" (__rseq_abi.rseq_cs),
+ /* final store input */
+ [v] "m" (*v),
+ [expectnot] "r" (expectnot),
+ [voffp] "ir" (voffp),
+ [load] "m" (*load)
+ : "memory", "cc", "eax", "ebx"
+ RSEQ_INJECT_CLOBBER
+ : abort, cmpfail
+#ifdef RSEQ_COMPARE_TWICE
+ , error1, error2
+#endif
+ );
+ return 0;
+abort:
+ RSEQ_INJECT_FAILED
+ return -1;
+cmpfail:
+ return 1;
+#ifdef RSEQ_COMPARE_TWICE
+error1:
+ rseq_bug("cpu_id comparison failed");
+error2:
+ rseq_bug("expected value comparison failed");
+#endif
+}
+
+static inline __attribute__((always_inline))
+int rseq_addv(intptr_t *v, intptr_t count, int cpu)
+{
+ RSEQ_INJECT_C(9)
+
+ __asm__ __volatile__ goto (
+ RSEQ_ASM_DEFINE_TABLE(3, 1f, 2f, 4f) /* start, commit, abort */
+ /* Start rseq by storing table entry pointer into rseq_cs. */
+ RSEQ_ASM_STORE_RSEQ_CS(1, 3b, rseq_cs)
+ RSEQ_ASM_CMP_CPU_ID(cpu_id, current_cpu_id, 4f)
+ RSEQ_INJECT_ASM(3)
+#ifdef RSEQ_COMPARE_TWICE
+ RSEQ_ASM_CMP_CPU_ID(cpu_id, current_cpu_id, %l[error1])
+#endif
+ /* final store */
+ "addl %[count], %[v]\n\t"
+ "2:\n\t"
+ RSEQ_INJECT_ASM(4)
+ RSEQ_ASM_DEFINE_ABORT(4, "", abort)
+ : /* gcc asm goto does not allow outputs */
+ : [cpu_id] "r" (cpu),
+ [current_cpu_id] "m" (__rseq_abi.cpu_id),
+ [rseq_cs] "m" (__rseq_abi.rseq_cs),
+ /* final store input */
+ [v] "m" (*v),
+ [count] "ir" (count)
+ : "memory", "cc", "eax"
+ RSEQ_INJECT_CLOBBER
+ : abort
+#ifdef RSEQ_COMPARE_TWICE
+ , error1
+#endif
+ );
+ return 0;
+abort:
+ RSEQ_INJECT_FAILED
+ return -1;
+#ifdef RSEQ_COMPARE_TWICE
+error1:
+ rseq_bug("cpu_id comparison failed");
+#endif
+}
+
+static inline __attribute__((always_inline))
+int rseq_cmpeqv_trystorev_storev(intptr_t *v, intptr_t expect,
+ intptr_t *v2, intptr_t newv2,
+ intptr_t newv, int cpu)
+{
+ RSEQ_INJECT_C(9)
+
+ __asm__ __volatile__ goto (
+ RSEQ_ASM_DEFINE_TABLE(3, 1f, 2f, 4f) /* start, commit, abort */
+ /* Start rseq by storing table entry pointer into rseq_cs. */
+ RSEQ_ASM_STORE_RSEQ_CS(1, 3b, rseq_cs)
+ RSEQ_ASM_CMP_CPU_ID(cpu_id, current_cpu_id, 4f)
+ RSEQ_INJECT_ASM(3)
+ "cmpl %[v], %[expect]\n\t"
+ "jnz %l[cmpfail]\n\t"
+ RSEQ_INJECT_ASM(4)
+#ifdef RSEQ_COMPARE_TWICE
+ RSEQ_ASM_CMP_CPU_ID(cpu_id, current_cpu_id, %l[error1])
+ "cmpl %[v], %[expect]\n\t"
+ "jnz %l[error2]\n\t"
+#endif
+ /* try store */
+ "movl %[newv2], %%eax\n\t"
+ "movl %%eax, %[v2]\n\t"
+ RSEQ_INJECT_ASM(5)
+ /* final store */
+ "movl %[newv], %[v]\n\t"
+ "2:\n\t"
+ RSEQ_INJECT_ASM(6)
+ RSEQ_ASM_DEFINE_ABORT(4, "", abort)
+ : /* gcc asm goto does not allow outputs */
+ : [cpu_id] "r" (cpu),
+ [current_cpu_id] "m" (__rseq_abi.cpu_id),
+ [rseq_cs] "m" (__rseq_abi.rseq_cs),
+ /* try store input */
+ [v2] "m" (*v2),
+ [newv2] "m" (newv2),
+ /* final store input */
+ [v] "m" (*v),
+ [expect] "r" (expect),
+ [newv] "r" (newv)
+ : "memory", "cc", "eax"
+ RSEQ_INJECT_CLOBBER
+ : abort, cmpfail
+#ifdef RSEQ_COMPARE_TWICE
+ , error1, error2
+#endif
+ );
+ return 0;
+abort:
+ RSEQ_INJECT_FAILED
+ return -1;
+cmpfail:
+ return 1;
+#ifdef RSEQ_COMPARE_TWICE
+error1:
+ rseq_bug("cpu_id comparison failed");
+error2:
+ rseq_bug("expected value comparison failed");
+#endif
+}
+
+static inline __attribute__((always_inline))
+int rseq_cmpeqv_trystorev_storev_release(intptr_t *v, intptr_t expect,
+ intptr_t *v2, intptr_t newv2,
+ intptr_t newv, int cpu)
+{
+ RSEQ_INJECT_C(9)
+
+ __asm__ __volatile__ goto (
+ RSEQ_ASM_DEFINE_TABLE(3, 1f, 2f, 4f) /* start, commit, abort */
+ /* Start rseq by storing table entry pointer into rseq_cs. */
+ RSEQ_ASM_STORE_RSEQ_CS(1, 3b, rseq_cs)
+ RSEQ_ASM_CMP_CPU_ID(cpu_id, current_cpu_id, 4f)
+ RSEQ_INJECT_ASM(3)
+ "movl %[expect], %%eax\n\t"
+ "cmpl %[v], %%eax\n\t"
+ "jnz %l[cmpfail]\n\t"
+ RSEQ_INJECT_ASM(4)
+#ifdef RSEQ_COMPARE_TWICE
+ RSEQ_ASM_CMP_CPU_ID(cpu_id, current_cpu_id, %l[error1])
+ "movl %[expect], %%eax\n\t"
+ "cmpl %[v], %%eax\n\t"
+ "jnz %l[error2]\n\t"
+#endif
+ /* try store */
+ "movl %[newv2], %[v2]\n\t"
+ RSEQ_INJECT_ASM(5)
+ "lock; addl $0,-128(%%esp)\n\t"
+ /* final store */
+ "movl %[newv], %[v]\n\t"
+ "2:\n\t"
+ RSEQ_INJECT_ASM(6)
+ RSEQ_ASM_DEFINE_ABORT(4, "", abort)
+ : /* gcc asm goto does not allow outputs */
+ : [cpu_id] "r" (cpu),
+ [current_cpu_id] "m" (__rseq_abi.cpu_id),
+ [rseq_cs] "m" (__rseq_abi.rseq_cs),
+ /* try store input */
+ [v2] "m" (*v2),
+ [newv2] "r" (newv2),
+ /* final store input */
+ [v] "m" (*v),
+ [expect] "m" (expect),
+ [newv] "r" (newv)
+ : "memory", "cc", "eax"
+ RSEQ_INJECT_CLOBBER
+ : abort, cmpfail
+#ifdef RSEQ_COMPARE_TWICE
+ , error1, error2
+#endif
+ );
+ return 0;
+abort:
+ RSEQ_INJECT_FAILED
+ return -1;
+cmpfail:
+ return 1;
+#ifdef RSEQ_COMPARE_TWICE
+error1:
+ rseq_bug("cpu_id comparison failed");
+error2:
+ rseq_bug("expected value comparison failed");
+#endif
+
+}
+
+static inline __attribute__((always_inline))
+int rseq_cmpeqv_cmpeqv_storev(intptr_t *v, intptr_t expect,
+ intptr_t *v2, intptr_t expect2,
+ intptr_t newv, int cpu)
+{
+ RSEQ_INJECT_C(9)
+
+ __asm__ __volatile__ goto (
+ RSEQ_ASM_DEFINE_TABLE(3, 1f, 2f, 4f) /* start, commit, abort */
+ /* Start rseq by storing table entry pointer into rseq_cs. */
+ RSEQ_ASM_STORE_RSEQ_CS(1, 3b, rseq_cs)
+ RSEQ_ASM_CMP_CPU_ID(cpu_id, current_cpu_id, 4f)
+ RSEQ_INJECT_ASM(3)
+ "cmpl %[v], %[expect]\n\t"
+ "jnz %l[cmpfail]\n\t"
+ RSEQ_INJECT_ASM(4)
+ "cmpl %[expect2], %[v2]\n\t"
+ "jnz %l[cmpfail]\n\t"
+ RSEQ_INJECT_ASM(5)
+#ifdef RSEQ_COMPARE_TWICE
+ RSEQ_ASM_CMP_CPU_ID(cpu_id, current_cpu_id, %l[error1])
+ "cmpl %[v], %[expect]\n\t"
+ "jnz %l[error2]\n\t"
+ "cmpl %[expect2], %[v2]\n\t"
+ "jnz %l[error3]\n\t"
+#endif
+ "movl %[newv], %%eax\n\t"
+ /* final store */
+ "movl %%eax, %[v]\n\t"
+ "2:\n\t"
+ RSEQ_INJECT_ASM(6)
+ RSEQ_ASM_DEFINE_ABORT(4, "", abort)
+ : /* gcc asm goto does not allow outputs */
+ : [cpu_id] "r" (cpu),
+ [current_cpu_id] "m" (__rseq_abi.cpu_id),
+ [rseq_cs] "m" (__rseq_abi.rseq_cs),
+ /* cmp2 input */
+ [v2] "m" (*v2),
+ [expect2] "r" (expect2),
+ /* final store input */
+ [v] "m" (*v),
+ [expect] "r" (expect),
+ [newv] "m" (newv)
+ : "memory", "cc", "eax"
+ RSEQ_INJECT_CLOBBER
+ : abort, cmpfail
+#ifdef RSEQ_COMPARE_TWICE
+ , error1, error2, error3
+#endif
+ );
+ return 0;
+abort:
+ RSEQ_INJECT_FAILED
+ return -1;
+cmpfail:
+ return 1;
+#ifdef RSEQ_COMPARE_TWICE
+error1:
+ rseq_bug("cpu_id comparison failed");
+error2:
+ rseq_bug("1st expected value comparison failed");
+error3:
+ rseq_bug("2nd expected value comparison failed");
+#endif
+}
+
+/* TODO: implement a faster memcpy. */
+static inline __attribute__((always_inline))
+int rseq_cmpeqv_trymemcpy_storev(intptr_t *v, intptr_t expect,
+ void *dst, void *src, size_t len,
+ intptr_t newv, int cpu)
+{
+ uint32_t rseq_scratch[3];
+
+ RSEQ_INJECT_C(9)
+
+ __asm__ __volatile__ goto (
+ RSEQ_ASM_DEFINE_TABLE(3, 1f, 2f, 4f) /* start, commit, abort */
+ "movl %[src], %[rseq_scratch0]\n\t"
+ "movl %[dst], %[rseq_scratch1]\n\t"
+ "movl %[len], %[rseq_scratch2]\n\t"
+ /* Start rseq by storing table entry pointer into rseq_cs. */
+ RSEQ_ASM_STORE_RSEQ_CS(1, 3b, rseq_cs)
+ RSEQ_ASM_CMP_CPU_ID(cpu_id, current_cpu_id, 4f)
+ RSEQ_INJECT_ASM(3)
+ "movl %[expect], %%eax\n\t"
+ "cmpl %%eax, %[v]\n\t"
+ "jnz 5f\n\t"
+ RSEQ_INJECT_ASM(4)
+#ifdef RSEQ_COMPARE_TWICE
+ RSEQ_ASM_CMP_CPU_ID(cpu_id, current_cpu_id, 6f)
+ "movl %[expect], %%eax\n\t"
+ "cmpl %%eax, %[v]\n\t"
+ "jnz 7f\n\t"
+#endif
+ /* try memcpy */
+ "test %[len], %[len]\n\t" \
+ "jz 333f\n\t" \
+ "222:\n\t" \
+ "movb (%[src]), %%al\n\t" \
+ "movb %%al, (%[dst])\n\t" \
+ "inc %[src]\n\t" \
+ "inc %[dst]\n\t" \
+ "dec %[len]\n\t" \
+ "jnz 222b\n\t" \
+ "333:\n\t" \
+ RSEQ_INJECT_ASM(5)
+ "movl %[newv], %%eax\n\t"
+ /* final store */
+ "movl %%eax, %[v]\n\t"
+ "2:\n\t"
+ RSEQ_INJECT_ASM(6)
+ /* teardown */
+ "movl %[rseq_scratch2], %[len]\n\t"
+ "movl %[rseq_scratch1], %[dst]\n\t"
+ "movl %[rseq_scratch0], %[src]\n\t"
+ RSEQ_ASM_DEFINE_ABORT(4,
+ "movl %[rseq_scratch2], %[len]\n\t"
+ "movl %[rseq_scratch1], %[dst]\n\t"
+ "movl %[rseq_scratch0], %[src]\n\t",
+ abort)
+ RSEQ_ASM_DEFINE_CMPFAIL(5,
+ "movl %[rseq_scratch2], %[len]\n\t"
+ "movl %[rseq_scratch1], %[dst]\n\t"
+ "movl %[rseq_scratch0], %[src]\n\t",
+ cmpfail)
+#ifdef RSEQ_COMPARE_TWICE
+ RSEQ_ASM_DEFINE_CMPFAIL(6,
+ "movl %[rseq_scratch2], %[len]\n\t"
+ "movl %[rseq_scratch1], %[dst]\n\t"
+ "movl %[rseq_scratch0], %[src]\n\t",
+ error1)
+ RSEQ_ASM_DEFINE_CMPFAIL(7,
+ "movl %[rseq_scratch2], %[len]\n\t"
+ "movl %[rseq_scratch1], %[dst]\n\t"
+ "movl %[rseq_scratch0], %[src]\n\t",
+ error2)
+#endif
+ : /* gcc asm goto does not allow outputs */
+ : [cpu_id] "r" (cpu),
+ [current_cpu_id] "m" (__rseq_abi.cpu_id),
+ [rseq_cs] "m" (__rseq_abi.rseq_cs),
+ /* final store input */
+ [v] "m" (*v),
+ [expect] "m" (expect),
+ [newv] "m" (newv),
+ /* try memcpy input */
+ [dst] "r" (dst),
+ [src] "r" (src),
+ [len] "r" (len),
+ [rseq_scratch0] "m" (rseq_scratch[0]),
+ [rseq_scratch1] "m" (rseq_scratch[1]),
+ [rseq_scratch2] "m" (rseq_scratch[2])
+ : "memory", "cc", "eax"
+ RSEQ_INJECT_CLOBBER
+ : abort, cmpfail
+#ifdef RSEQ_COMPARE_TWICE
+ , error1, error2
+#endif
+ );
+ return 0;
+abort:
+ RSEQ_INJECT_FAILED
+ return -1;
+cmpfail:
+ return 1;
+#ifdef RSEQ_COMPARE_TWICE
+error1:
+ rseq_bug("cpu_id comparison failed");
+error2:
+ rseq_bug("expected value comparison failed");
+#endif
+}
+
+/* TODO: implement a faster memcpy. */
+static inline __attribute__((always_inline))
+int rseq_cmpeqv_trymemcpy_storev_release(intptr_t *v, intptr_t expect,
+ void *dst, void *src, size_t len,
+ intptr_t newv, int cpu)
+{
+ uint32_t rseq_scratch[3];
+
+ RSEQ_INJECT_C(9)
+
+ __asm__ __volatile__ goto (
+ RSEQ_ASM_DEFINE_TABLE(3, 1f, 2f, 4f) /* start, commit, abort */
+ "movl %[src], %[rseq_scratch0]\n\t"
+ "movl %[dst], %[rseq_scratch1]\n\t"
+ "movl %[len], %[rseq_scratch2]\n\t"
+ /* Start rseq by storing table entry pointer into rseq_cs. */
+ RSEQ_ASM_STORE_RSEQ_CS(1, 3b, rseq_cs)
+ RSEQ_ASM_CMP_CPU_ID(cpu_id, current_cpu_id, 4f)
+ RSEQ_INJECT_ASM(3)
+ "movl %[expect], %%eax\n\t"
+ "cmpl %%eax, %[v]\n\t"
+ "jnz 5f\n\t"
+ RSEQ_INJECT_ASM(4)
+#ifdef RSEQ_COMPARE_TWICE
+ RSEQ_ASM_CMP_CPU_ID(cpu_id, current_cpu_id, 6f)
+ "movl %[expect], %%eax\n\t"
+ "cmpl %%eax, %[v]\n\t"
+ "jnz 7f\n\t"
+#endif
+ /* try memcpy */
+ "test %[len], %[len]\n\t" \
+ "jz 333f\n\t" \
+ "222:\n\t" \
+ "movb (%[src]), %%al\n\t" \
+ "movb %%al, (%[dst])\n\t" \
+ "inc %[src]\n\t" \
+ "inc %[dst]\n\t" \
+ "dec %[len]\n\t" \
+ "jnz 222b\n\t" \
+ "333:\n\t" \
+ RSEQ_INJECT_ASM(5)
+ "lock; addl $0,-128(%%esp)\n\t"
+ "movl %[newv], %%eax\n\t"
+ /* final store */
+ "movl %%eax, %[v]\n\t"
+ "2:\n\t"
+ RSEQ_INJECT_ASM(6)
+ /* teardown */
+ "movl %[rseq_scratch2], %[len]\n\t"
+ "movl %[rseq_scratch1], %[dst]\n\t"
+ "movl %[rseq_scratch0], %[src]\n\t"
+ RSEQ_ASM_DEFINE_ABORT(4,
+ "movl %[rseq_scratch2], %[len]\n\t"
+ "movl %[rseq_scratch1], %[dst]\n\t"
+ "movl %[rseq_scratch0], %[src]\n\t",
+ abort)
+ RSEQ_ASM_DEFINE_CMPFAIL(5,
+ "movl %[rseq_scratch2], %[len]\n\t"
+ "movl %[rseq_scratch1], %[dst]\n\t"
+ "movl %[rseq_scratch0], %[src]\n\t",
+ cmpfail)
+#ifdef RSEQ_COMPARE_TWICE
+ RSEQ_ASM_DEFINE_CMPFAIL(6,
+ "movl %[rseq_scratch2], %[len]\n\t"
+ "movl %[rseq_scratch1], %[dst]\n\t"
+ "movl %[rseq_scratch0], %[src]\n\t",
+ error1)
+ RSEQ_ASM_DEFINE_CMPFAIL(7,
+ "movl %[rseq_scratch2], %[len]\n\t"
+ "movl %[rseq_scratch1], %[dst]\n\t"
+ "movl %[rseq_scratch0], %[src]\n\t",
+ error2)
+#endif
+ : /* gcc asm goto does not allow outputs */
+ : [cpu_id] "r" (cpu),
+ [current_cpu_id] "m" (__rseq_abi.cpu_id),
+ [rseq_cs] "m" (__rseq_abi.rseq_cs),
+ /* final store input */
+ [v] "m" (*v),
+ [expect] "m" (expect),
+ [newv] "m" (newv),
+ /* try memcpy input */
+ [dst] "r" (dst),
+ [src] "r" (src),
+ [len] "r" (len),
+ [rseq_scratch0] "m" (rseq_scratch[0]),
+ [rseq_scratch1] "m" (rseq_scratch[1]),
+ [rseq_scratch2] "m" (rseq_scratch[2])
+ : "memory", "cc", "eax"
+ RSEQ_INJECT_CLOBBER
+ : abort, cmpfail
+#ifdef RSEQ_COMPARE_TWICE
+ , error1, error2
+#endif
+ );
+ return 0;
+abort:
+ RSEQ_INJECT_FAILED
+ return -1;
+cmpfail:
+ return 1;
+#ifdef RSEQ_COMPARE_TWICE
+error1:
+ rseq_bug("cpu_id comparison failed");
+error2:
+ rseq_bug("expected value comparison failed");
+#endif
+}
+
+#endif /* !RSEQ_SKIP_FASTPATH */
+
+#endif
diff --git a/tools/testing/selftests/rseq/rseq.c b/tools/testing/selftests/rseq/rseq.c
new file mode 100644
index 000000000000..ba65c14ef3ae
--- /dev/null
+++ b/tools/testing/selftests/rseq/rseq.c
@@ -0,0 +1,116 @@
+/*
+ * rseq.c
+ *
+ * Copyright (C) 2016 Mathieu Desnoyers <[email protected]>
+ *
+ * This library is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU Lesser General Public
+ * License as published by the Free Software Foundation; only
+ * version 2.1 of the License.
+ *
+ * This library is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+ * Lesser General Public License for more details.
+ */
+
+#define _GNU_SOURCE
+#include <errno.h>
+#include <sched.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <unistd.h>
+#include <syscall.h>
+#include <assert.h>
+#include <signal.h>
+
+#include "rseq.h"
+
+#define ARRAY_SIZE(arr) (sizeof(arr) / sizeof((arr)[0]))
+
+__attribute__((tls_model("initial-exec"))) __thread
+volatile struct rseq __rseq_abi = {
+ .cpu_id = RSEQ_CPU_ID_UNINITIALIZED,
+};
+
+static __attribute__((tls_model("initial-exec"))) __thread
+volatile int refcount;
+
+static void signal_off_save(sigset_t *oldset)
+{
+ sigset_t set;
+ int ret;
+
+ sigfillset(&set);
+ ret = pthread_sigmask(SIG_BLOCK, &set, oldset);
+ if (ret)
+ abort();
+}
+
+static void signal_restore(sigset_t oldset)
+{
+ int ret;
+
+ ret = pthread_sigmask(SIG_SETMASK, &oldset, NULL);
+ if (ret)
+ abort();
+}
+
+static int sys_rseq(volatile struct rseq *rseq_abi, uint32_t rseq_len,
+ int flags, uint32_t sig)
+{
+ return syscall(__NR_rseq, rseq_abi, rseq_len, flags, sig);
+}
+
+int rseq_register_current_thread(void)
+{
+ int rc, ret = 0;
+ sigset_t oldset;
+
+ signal_off_save(&oldset);
+ if (refcount++)
+ goto end;
+ rc = sys_rseq(&__rseq_abi, sizeof(struct rseq), 0, RSEQ_SIG);
+ if (!rc) {
+ assert(rseq_current_cpu_raw() >= 0);
+ goto end;
+ }
+ if (errno != EBUSY)
+ __rseq_abi.cpu_id = -2;
+ ret = -1;
+ refcount--;
+end:
+ signal_restore(oldset);
+ return ret;
+}
+
+int rseq_unregister_current_thread(void)
+{
+ int rc, ret = 0;
+ sigset_t oldset;
+
+ signal_off_save(&oldset);
+ if (--refcount)
+ goto end;
+ rc = sys_rseq(&__rseq_abi, sizeof(struct rseq),
+ RSEQ_FLAG_UNREGISTER, RSEQ_SIG);
+ if (!rc)
+ goto end;
+ ret = -1;
+end:
+ signal_restore(oldset);
+ return ret;
+}
+
+int32_t rseq_fallback_current_cpu(void)
+{
+ int32_t cpu;
+
+ cpu = sched_getcpu();
+ if (cpu < 0) {
+ perror("sched_getcpu()");
+ abort();
+ }
+ return cpu;
+}
diff --git a/tools/testing/selftests/rseq/rseq.h b/tools/testing/selftests/rseq/rseq.h
new file mode 100644
index 000000000000..d199f0beee1b
--- /dev/null
+++ b/tools/testing/selftests/rseq/rseq.h
@@ -0,0 +1,164 @@
+/*
+ * rseq.h
+ *
+ * (C) Copyright 2016 - Mathieu Desnoyers <[email protected]>
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining a copy
+ * of this software and associated documentation files (the "Software"), to deal
+ * in the Software without restriction, including without limitation the rights
+ * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+ * copies of the Software, and to permit persons to whom the Software is
+ * furnished to do so, subject to the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be included in
+ * all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+ * AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+ * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+
+#ifndef RSEQ_H
+#define RSEQ_H
+
+#include <stdint.h>
+#include <stdbool.h>
+#include <pthread.h>
+#include <signal.h>
+#include <sched.h>
+#include <errno.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <sched.h>
+#include <linux/rseq.h>
+
+/*
+ * Empty code injection macros, override when testing.
+ * It is important to consider that the ASM injection macros need to be
+ * fully reentrant (e.g. do not modify the stack).
+ */
+#ifndef RSEQ_INJECT_ASM
+#define RSEQ_INJECT_ASM(n)
+#endif
+
+#ifndef RSEQ_INJECT_C
+#define RSEQ_INJECT_C(n)
+#endif
+
+#ifndef RSEQ_INJECT_INPUT
+#define RSEQ_INJECT_INPUT
+#endif
+
+#ifndef RSEQ_INJECT_CLOBBER
+#define RSEQ_INJECT_CLOBBER
+#endif
+
+#ifndef RSEQ_INJECT_FAILED
+#define RSEQ_INJECT_FAILED
+#endif
+
+extern __thread volatile struct rseq __rseq_abi;
+
+#define rseq_likely(x) __builtin_expect(!!(x), 1)
+#define rseq_unlikely(x) __builtin_expect(!!(x), 0)
+#define rseq_barrier() __asm__ __volatile__("" : : : "memory")
+
+#define RSEQ_ACCESS_ONCE(x) (*(__volatile__ __typeof__(x) *)&(x))
+#define RSEQ_WRITE_ONCE(x, v) __extension__ ({ RSEQ_ACCESS_ONCE(x) = (v); })
+#define RSEQ_READ_ONCE(x) RSEQ_ACCESS_ONCE(x)
+
+#define __rseq_str_1(x) #x
+#define __rseq_str(x) __rseq_str_1(x)
+
+#define rseq_log(fmt, args...) \
+ fprintf(stderr, fmt "(in %s() at " __FILE__ ":" __rseq_str(__LINE__)"\n", \
+ ## args, __func__)
+
+#define rseq_bug(fmt, args...) \
+ do { \
+ rseq_log(fmt, ##args); \
+ abort(); \
+ } while (0)
+
+#if defined(__x86_64__) || defined(__i386__)
+#include <rseq-x86.h>
+#elif defined(__ARMEL__)
+#include <rseq-arm.h>
+#elif defined(__PPC__)
+#include <rseq-ppc.h>
+#else
+#error unsupported target
+#endif
+
+/*
+ * Register rseq for the current thread. This needs to be called once
+ * by any thread which uses restartable sequences, before they start
+ * using restartable sequences, to ensure restartable sequences
+ * succeed. A restartable sequence executed from a non-registered
+ * thread will always fail.
+ */
+int rseq_register_current_thread(void);
+
+/*
+ * Unregister rseq for current thread.
+ */
+int rseq_unregister_current_thread(void);
+
+/*
+ * Restartable sequence fallback for reading the current CPU number.
+ */
+int32_t rseq_fallback_current_cpu(void);
+
+/*
+ * Values returned can be either the current CPU number, -1 (rseq is
+ * uninitialized), or -2 (rseq initialization has failed).
+ */
+static inline int32_t rseq_current_cpu_raw(void)
+{
+ return RSEQ_ACCESS_ONCE(__rseq_abi.cpu_id);
+}
+
+/*
+ * Returns a possible CPU number, which is typically the current CPU.
+ * The returned CPU number can be used to prepare for an rseq critical
+ * section, which will confirm whether the cpu number is indeed the
+ * current one, and whether rseq is initialized.
+ *
+ * The CPU number returned by rseq_cpu_start should always be validated
+ * by passing it to a rseq asm sequence, or by comparing it to the
+ * return value of rseq_current_cpu_raw() if the rseq asm sequence
+ * does not need to be invoked.
+ */
+static inline uint32_t rseq_cpu_start(void)
+{
+ return RSEQ_ACCESS_ONCE(__rseq_abi.cpu_id_start);
+}
+
+static inline uint32_t rseq_current_cpu(void)
+{
+ int32_t cpu;
+
+ cpu = rseq_current_cpu_raw();
+ if (rseq_unlikely(cpu < 0))
+ cpu = rseq_fallback_current_cpu();
+ return cpu;
+}
+
+/*
+ * rseq_prepare_unload() should be invoked by each thread using rseq_finish*()
+ * at least once between their last rseq_finish*() and library unload of the
+ * library defining the rseq critical section (struct rseq_cs). This also
+ * applies to use of rseq in code generated by JIT: rseq_prepare_unload()
+ * should be invoked at least once by each thread using rseq_finish*() before
+ * reclaim of the memory holding the struct rseq_cs.
+ */
+static inline void rseq_prepare_unload(void)
+{
+ __rseq_abi.rseq_cs = 0;
+}
+
+#endif /* RSEQ_H_ */
--
2.11.0
Implement cpu_opv selftests. It needs to express dependencies on
header files and .so, which require to override the selftests
lib.mk targets. Use OVERRIDE_TARGETS define for this.
Signed-off-by: Mathieu Desnoyers <[email protected]>
Acked-by: Shuah Khan <[email protected]>
CC: Russell King <[email protected]>
CC: Catalin Marinas <[email protected]>
CC: Will Deacon <[email protected]>
CC: Thomas Gleixner <[email protected]>
CC: Paul Turner <[email protected]>
CC: Andrew Hunter <[email protected]>
CC: Peter Zijlstra <[email protected]>
CC: Andy Lutomirski <[email protected]>
CC: Andi Kleen <[email protected]>
CC: Dave Watson <[email protected]>
CC: Chris Lameter <[email protected]>
CC: Ingo Molnar <[email protected]>
CC: "H. Peter Anvin" <[email protected]>
CC: Ben Maurer <[email protected]>
CC: Steven Rostedt <[email protected]>
CC: "Paul E. McKenney" <[email protected]>
CC: Josh Triplett <[email protected]>
CC: Linus Torvalds <[email protected]>
CC: Andrew Morton <[email protected]>
CC: Boqun Feng <[email protected]>
CC: [email protected]
CC: [email protected]
---
Changes since v1:
- Expose similar library API as rseq: Expose library API closely
matching the rseq APIs, following removal of the event counter from
the rseq kernel API.
- Update makefile to fix make run_tests dependency on "all".
- Introduce a OVERRIDE_TARGETS.
Changes since v2:
- Test page faults.
Changes since v3:
- Move lib.mk OVERRIDE_TARGETS change to its own patch.
- Printout TAP output.
Changes since v4:
- Retry internally within cpu_op_cmpnev_storeoffp_load().
Changes since v5:
- Test huge pages.
Change since v6:
- Test CPU_OP_NR_FLAG,
- Invoke ksft_test_result_fail rather than ksft_exit_fail_msg,
- Test CPU parameter outside of possible CPUs range,
- Test CPU parameter outside of allowed CPUs.
---
MAINTAINERS | 1 +
tools/testing/selftests/Makefile | 1 +
tools/testing/selftests/cpu-opv/.gitignore | 1 +
tools/testing/selftests/cpu-opv/Makefile | 17 +
.../testing/selftests/cpu-opv/basic_cpu_opv_test.c | 1368 ++++++++++++++++++++
tools/testing/selftests/cpu-opv/cpu-op.c | 352 +++++
tools/testing/selftests/cpu-opv/cpu-op.h | 59 +
7 files changed, 1799 insertions(+)
create mode 100644 tools/testing/selftests/cpu-opv/.gitignore
create mode 100644 tools/testing/selftests/cpu-opv/Makefile
create mode 100644 tools/testing/selftests/cpu-opv/basic_cpu_opv_test.c
create mode 100644 tools/testing/selftests/cpu-opv/cpu-op.c
create mode 100644 tools/testing/selftests/cpu-opv/cpu-op.h
diff --git a/MAINTAINERS b/MAINTAINERS
index e32d4415081b..936ff672d5fb 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -3749,6 +3749,7 @@ L: [email protected]
S: Supported
F: kernel/cpu_opv.c
F: include/uapi/linux/cpu_opv.h
+F: tools/testing/selftests/cpu-opv/
CRAMFS FILESYSTEM
M: Nicolas Pitre <[email protected]>
diff --git a/tools/testing/selftests/Makefile b/tools/testing/selftests/Makefile
index 7442dfb73b7f..1322e63f5963 100644
--- a/tools/testing/selftests/Makefile
+++ b/tools/testing/selftests/Makefile
@@ -5,6 +5,7 @@ TARGETS += breakpoints
TARGETS += capabilities
TARGETS += cpufreq
TARGETS += cpu-hotplug
+TARGETS += cpu-opv
TARGETS += efivarfs
TARGETS += exec
TARGETS += firmware
diff --git a/tools/testing/selftests/cpu-opv/.gitignore b/tools/testing/selftests/cpu-opv/.gitignore
new file mode 100644
index 000000000000..c7186eb95cf5
--- /dev/null
+++ b/tools/testing/selftests/cpu-opv/.gitignore
@@ -0,0 +1 @@
+basic_cpu_opv_test
diff --git a/tools/testing/selftests/cpu-opv/Makefile b/tools/testing/selftests/cpu-opv/Makefile
new file mode 100644
index 000000000000..21e63545d521
--- /dev/null
+++ b/tools/testing/selftests/cpu-opv/Makefile
@@ -0,0 +1,17 @@
+CFLAGS += -O2 -Wall -g -I./ -I../../../../usr/include/ -L./ -Wl,-rpath=./
+
+# Own dependencies because we only want to build against 1st prerequisite, but
+# still track changes to header files and depend on shared object.
+OVERRIDE_TARGETS = 1
+
+TEST_GEN_PROGS = basic_cpu_opv_test
+
+TEST_GEN_PROGS_EXTENDED = libcpu-op.so
+
+include ../lib.mk
+
+$(OUTPUT)/libcpu-op.so: cpu-op.c cpu-op.h
+ $(CC) $(CFLAGS) -shared -fPIC $< -o $@
+
+$(OUTPUT)/%: %.c $(TEST_GEN_PROGS_EXTENDED) cpu-op.h
+ $(CC) $(CFLAGS) $< -lcpu-op -o $@
diff --git a/tools/testing/selftests/cpu-opv/basic_cpu_opv_test.c b/tools/testing/selftests/cpu-opv/basic_cpu_opv_test.c
new file mode 100644
index 000000000000..792b68f4f330
--- /dev/null
+++ b/tools/testing/selftests/cpu-opv/basic_cpu_opv_test.c
@@ -0,0 +1,1368 @@
+/*
+ * Basic test coverage for cpu_opv system call.
+ */
+
+#define _GNU_SOURCE
+#include <assert.h>
+#include <sched.h>
+#include <signal.h>
+#include <stdio.h>
+#include <string.h>
+#include <errno.h>
+#include <stdlib.h>
+#include <sys/time.h>
+#include <sys/mman.h>
+#include <sched.h>
+
+#include "../kselftest.h"
+
+#include "cpu-op.h"
+
+#define ARRAY_SIZE(arr) (sizeof(arr) / sizeof((arr)[0]))
+
+#define TESTBUFLEN 4096
+#define TESTBUFLEN_CMP 16
+
+#define TESTBUFLEN_PAGE_MAX 65536
+
+#define NR_PF_ARRAY 16384
+#define PF_ARRAY_LEN 4096
+
+#define NR_HUGE_ARRAY 512
+#define HUGEMAPLEN (NR_HUGE_ARRAY * PF_ARRAY_LEN)
+
+/* 64 MB arrays for page fault testing. */
+char pf_array_dst[NR_PF_ARRAY][PF_ARRAY_LEN];
+char pf_array_src[NR_PF_ARRAY][PF_ARRAY_LEN];
+
+static int test_ops_supported(void)
+{
+ const char *test_name = "test_ops_supported";
+ int ret;
+
+ ret = cpu_opv(NULL, 0, -1, CPU_OP_NR_FLAG);
+ if (ret < 0) {
+ ksft_test_result_fail("%s test: returned with %d, errno = %s\n",
+ test_name, ret, strerror(errno));
+ return -1;
+ }
+ if (ret < NR_CPU_OPS) {
+ ksft_test_result_fail("%s test: only %d operations supported, expecting at least %d\n",
+ test_name, ret, NR_CPU_OPS);
+ return -1;
+ }
+ ksft_test_result_pass("%s test\n", test_name);
+ return 0;
+}
+
+static int test_compare_eq_op(char *a, char *b, size_t len)
+{
+ struct cpu_op opvec[] = {
+ [0] = {
+ .op = CPU_COMPARE_EQ_OP,
+ .len = len,
+ LINUX_FIELD_u32_u64_INIT_ONSTACK(.u.compare_op.a, a),
+ LINUX_FIELD_u32_u64_INIT_ONSTACK(.u.compare_op.b, b),
+ .u.compare_op.expect_fault_a = 0,
+ .u.compare_op.expect_fault_b = 0,
+ },
+ };
+ int ret, cpu;
+
+ do {
+ cpu = cpu_op_get_current_cpu();
+ ret = cpu_opv(opvec, ARRAY_SIZE(opvec), cpu, 0);
+ } while (ret == -1 && errno == EAGAIN);
+
+ return ret;
+}
+
+static int test_compare_eq_same(void)
+{
+ int i, ret;
+ char buf1[TESTBUFLEN];
+ char buf2[TESTBUFLEN];
+ const char *test_name = "test_compare_eq same";
+
+ /* Test compare_eq */
+ for (i = 0; i < TESTBUFLEN; i++)
+ buf1[i] = (char)i;
+ for (i = 0; i < TESTBUFLEN; i++)
+ buf2[i] = (char)i;
+ ret = test_compare_eq_op(buf2, buf1, TESTBUFLEN);
+ if (ret < 0) {
+ ksft_test_result_fail("%s test: returned with %d, errno = %s\n",
+ test_name, ret, strerror(errno));
+ return -1;
+ }
+ if (ret > 0) {
+ ksft_test_result_fail("%s test: unexpected value %d. Should be %d.\n",
+ test_name, ret, 0);
+ return -1;
+ }
+ ksft_test_result_pass("%s test\n", test_name);
+ return 0;
+}
+
+static int test_compare_eq_diff(void)
+{
+ int i, ret;
+ char buf1[TESTBUFLEN];
+ char buf2[TESTBUFLEN];
+ const char *test_name = "test_compare_eq different";
+
+ for (i = 0; i < TESTBUFLEN; i++)
+ buf1[i] = (char)i;
+ memset(buf2, 0, TESTBUFLEN);
+ ret = test_compare_eq_op(buf2, buf1, TESTBUFLEN);
+ if (ret < 0) {
+ ksft_test_result_fail("%s test: returned with %d, errno = %s\n",
+ test_name, ret, strerror(errno));
+ return -1;
+ }
+ if (ret == 0) {
+ ksft_test_result_fail("%s test: unexpected value %d. Should be %d.\n",
+ test_name, ret, 1);
+ return -1;
+ }
+ ksft_test_result_pass("%s test\n", test_name);
+ return 0;
+}
+
+static int test_compare_ne_op(char *a, char *b, size_t len)
+{
+ struct cpu_op opvec[] = {
+ [0] = {
+ .op = CPU_COMPARE_NE_OP,
+ .len = len,
+ LINUX_FIELD_u32_u64_INIT_ONSTACK(.u.compare_op.a, a),
+ LINUX_FIELD_u32_u64_INIT_ONSTACK(.u.compare_op.b, b),
+ .u.compare_op.expect_fault_a = 0,
+ .u.compare_op.expect_fault_b = 0,
+ },
+ };
+ int ret, cpu;
+
+ do {
+ cpu = cpu_op_get_current_cpu();
+ ret = cpu_opv(opvec, ARRAY_SIZE(opvec), cpu, 0);
+ } while (ret == -1 && errno == EAGAIN);
+
+ return ret;
+}
+
+static int test_compare_ne_same(void)
+{
+ int i, ret;
+ char buf1[TESTBUFLEN];
+ char buf2[TESTBUFLEN];
+ const char *test_name = "test_compare_ne same";
+
+ /* Test compare_ne */
+ for (i = 0; i < TESTBUFLEN; i++)
+ buf1[i] = (char)i;
+ for (i = 0; i < TESTBUFLEN; i++)
+ buf2[i] = (char)i;
+ ret = test_compare_ne_op(buf2, buf1, TESTBUFLEN);
+ if (ret < 0) {
+ ksft_test_result_fail("%s test: returned with %d, errno = %s\n",
+ test_name, ret, strerror(errno));
+ return -1;
+ }
+ if (ret == 0) {
+ ksft_test_result_fail("%s test: unexpected value %d. Should be %d.\n",
+ test_name, ret, 1);
+ return -1;
+ }
+ ksft_test_result_pass("%s test\n", test_name);
+ return 0;
+}
+
+static int test_compare_ne_diff(void)
+{
+ int i, ret;
+ char buf1[TESTBUFLEN];
+ char buf2[TESTBUFLEN];
+ const char *test_name = "test_compare_ne different";
+
+ for (i = 0; i < TESTBUFLEN; i++)
+ buf1[i] = (char)i;
+ memset(buf2, 0, TESTBUFLEN);
+ ret = test_compare_ne_op(buf2, buf1, TESTBUFLEN);
+ if (ret < 0) {
+ ksft_test_result_fail("%s test: returned with %d, errno = %s\n",
+ test_name, ret, strerror(errno));
+ return -1;
+ }
+ if (ret != 0) {
+ ksft_test_result_fail("%s test: unexpected value %d. Should be %d.\n",
+ test_name, ret, 0);
+ return -1;
+ }
+ ksft_test_result_pass("%s test\n", test_name);
+ return 0;
+}
+
+static int test_2compare_eq_op(char *a, char *b, char *c, char *d,
+ size_t len)
+{
+ struct cpu_op opvec[] = {
+ [0] = {
+ .op = CPU_COMPARE_EQ_OP,
+ .len = len,
+ LINUX_FIELD_u32_u64_INIT_ONSTACK(.u.compare_op.a, a),
+ LINUX_FIELD_u32_u64_INIT_ONSTACK(.u.compare_op.b, b),
+ .u.compare_op.expect_fault_a = 0,
+ .u.compare_op.expect_fault_b = 0,
+ },
+ [1] = {
+ .op = CPU_COMPARE_EQ_OP,
+ .len = len,
+ LINUX_FIELD_u32_u64_INIT_ONSTACK(.u.compare_op.a, c),
+ LINUX_FIELD_u32_u64_INIT_ONSTACK(.u.compare_op.b, d),
+ .u.compare_op.expect_fault_a = 0,
+ .u.compare_op.expect_fault_b = 0,
+ },
+ };
+ int ret, cpu;
+
+ do {
+ cpu = cpu_op_get_current_cpu();
+ ret = cpu_opv(opvec, ARRAY_SIZE(opvec), cpu, 0);
+ } while (ret == -1 && errno == EAGAIN);
+
+ return ret;
+}
+
+static int test_2compare_eq_index(void)
+{
+ int i, ret;
+ char buf1[TESTBUFLEN_CMP];
+ char buf2[TESTBUFLEN_CMP];
+ char buf3[TESTBUFLEN_CMP];
+ char buf4[TESTBUFLEN_CMP];
+ const char *test_name = "test_2compare_eq index";
+
+ for (i = 0; i < TESTBUFLEN_CMP; i++)
+ buf1[i] = (char)i;
+ memset(buf2, 0, TESTBUFLEN_CMP);
+ memset(buf3, 0, TESTBUFLEN_CMP);
+ memset(buf4, 0, TESTBUFLEN_CMP);
+
+ /* First compare failure is op[0], expect 1. */
+ ret = test_2compare_eq_op(buf2, buf1, buf4, buf3, TESTBUFLEN_CMP);
+ if (ret < 0) {
+ ksft_test_result_fail("%s test: returned with %d, errno = %s\n",
+ test_name, ret, strerror(errno));
+ return -1;
+ }
+ if (ret != 1) {
+ ksft_test_result_fail("%s test: unexpected value %d. Should be %d.\n",
+ test_name, ret, 1);
+ return -1;
+ }
+
+ /* All compares succeed. */
+ for (i = 0; i < TESTBUFLEN_CMP; i++)
+ buf2[i] = (char)i;
+ ret = test_2compare_eq_op(buf2, buf1, buf4, buf3, TESTBUFLEN_CMP);
+ if (ret < 0) {
+ ksft_test_result_fail("%s test: returned with %d, errno = %s\n",
+ test_name, ret, strerror(errno));
+ return -1;
+ }
+ if (ret != 0) {
+ ksft_test_result_fail("%s test: unexpected value %d. Should be %d.\n",
+ test_name, ret, 0);
+ return -1;
+ }
+
+ /* First compare failure is op[1], expect 2. */
+ for (i = 0; i < TESTBUFLEN_CMP; i++)
+ buf3[i] = (char)i;
+ ret = test_2compare_eq_op(buf2, buf1, buf4, buf3, TESTBUFLEN_CMP);
+ if (ret < 0) {
+ ksft_test_result_fail("%s test: returned with %d, errno = %s\n",
+ test_name, ret, strerror(errno));
+ return -1;
+ }
+ if (ret != 2) {
+ ksft_test_result_fail("%s test: unexpected value %d. Should be %d.\n",
+ test_name, ret, 2);
+ return -1;
+ }
+ ksft_test_result_pass("%s test\n", test_name);
+ return 0;
+}
+
+static int test_2compare_ne_op(char *a, char *b, char *c, char *d,
+ size_t len)
+{
+ struct cpu_op opvec[] = {
+ [0] = {
+ .op = CPU_COMPARE_NE_OP,
+ .len = len,
+ LINUX_FIELD_u32_u64_INIT_ONSTACK(.u.compare_op.a, a),
+ LINUX_FIELD_u32_u64_INIT_ONSTACK(.u.compare_op.b, b),
+ .u.compare_op.expect_fault_a = 0,
+ .u.compare_op.expect_fault_b = 0,
+ },
+ [1] = {
+ .op = CPU_COMPARE_NE_OP,
+ .len = len,
+ LINUX_FIELD_u32_u64_INIT_ONSTACK(.u.compare_op.a, c),
+ LINUX_FIELD_u32_u64_INIT_ONSTACK(.u.compare_op.b, d),
+ .u.compare_op.expect_fault_a = 0,
+ .u.compare_op.expect_fault_b = 0,
+ },
+ };
+ int ret, cpu;
+
+ do {
+ cpu = cpu_op_get_current_cpu();
+ ret = cpu_opv(opvec, ARRAY_SIZE(opvec), cpu, 0);
+ } while (ret == -1 && errno == EAGAIN);
+
+ return ret;
+}
+
+static int test_2compare_ne_index(void)
+{
+ int i, ret;
+ char buf1[TESTBUFLEN_CMP];
+ char buf2[TESTBUFLEN_CMP];
+ char buf3[TESTBUFLEN_CMP];
+ char buf4[TESTBUFLEN_CMP];
+ const char *test_name = "test_2compare_ne index";
+
+ memset(buf1, 0, TESTBUFLEN_CMP);
+ memset(buf2, 0, TESTBUFLEN_CMP);
+ memset(buf3, 0, TESTBUFLEN_CMP);
+ memset(buf4, 0, TESTBUFLEN_CMP);
+
+ /* First compare ne failure is op[0], expect 1. */
+ ret = test_2compare_ne_op(buf2, buf1, buf4, buf3, TESTBUFLEN_CMP);
+ if (ret < 0) {
+ ksft_test_result_fail("%s test: returned with %d, errno = %s\n",
+ test_name, ret, strerror(errno));
+ return -1;
+ }
+ if (ret != 1) {
+ ksft_test_result_fail("%s test: unexpected value %d. Should be %d.\n",
+ test_name, ret, 1);
+ return -1;
+ }
+
+ /* All compare ne succeed. */
+ for (i = 0; i < TESTBUFLEN_CMP; i++)
+ buf1[i] = (char)i;
+ for (i = 0; i < TESTBUFLEN_CMP; i++)
+ buf3[i] = (char)i;
+ ret = test_2compare_ne_op(buf2, buf1, buf4, buf3, TESTBUFLEN_CMP);
+ if (ret < 0) {
+ ksft_test_result_fail("%s test: returned with %d, errno = %s\n",
+ test_name, ret, strerror(errno));
+ return -1;
+ }
+ if (ret != 0) {
+ ksft_test_result_fail("%s test: unexpected value %d. Should be %d.\n",
+ test_name, ret, 0);
+ return -1;
+ }
+
+ /* First compare failure is op[1], expect 2. */
+ for (i = 0; i < TESTBUFLEN_CMP; i++)
+ buf4[i] = (char)i;
+ ret = test_2compare_ne_op(buf2, buf1, buf4, buf3, TESTBUFLEN_CMP);
+ if (ret < 0) {
+ ksft_test_result_fail("%s test: returned with %d, errno = %s\n",
+ test_name, ret, strerror(errno));
+ return -1;
+ }
+ if (ret != 2) {
+ ksft_test_result_fail("%s test: unexpected value %d. Should be %d.\n",
+ test_name, ret, 2);
+ return -1;
+ }
+ ksft_test_result_pass("%s test\n", test_name);
+ return 0;
+}
+
+static int test_memcpy_op(void *dst, void *src, size_t len)
+{
+ struct cpu_op opvec[] = {
+ [0] = {
+ .op = CPU_MEMCPY_OP,
+ .len = len,
+ LINUX_FIELD_u32_u64_INIT_ONSTACK(.u.memcpy_op.dst, dst),
+ LINUX_FIELD_u32_u64_INIT_ONSTACK(.u.memcpy_op.src, src),
+ .u.memcpy_op.expect_fault_dst = 0,
+ .u.memcpy_op.expect_fault_src = 0,
+ },
+ };
+ int ret, cpu;
+
+ do {
+ cpu = cpu_op_get_current_cpu();
+ ret = cpu_opv(opvec, ARRAY_SIZE(opvec), cpu, 0);
+ } while (ret == -1 && errno == EAGAIN);
+
+ return ret;
+}
+
+static int test_memcpy(void)
+{
+ int i, ret;
+ char buf1[TESTBUFLEN];
+ char buf2[TESTBUFLEN];
+ const char *test_name = "test_memcpy";
+
+ /* Test memcpy */
+ for (i = 0; i < TESTBUFLEN; i++)
+ buf1[i] = (char)i;
+ memset(buf2, 0, TESTBUFLEN);
+ ret = test_memcpy_op(buf2, buf1, TESTBUFLEN);
+ if (ret) {
+ ksft_test_result_fail("%s test: returned with %d, errno = %s\n",
+ test_name, ret, strerror(errno));
+ return -1;
+ }
+ for (i = 0; i < TESTBUFLEN; i++) {
+ if (buf2[i] != (char)i) {
+ ksft_test_result_fail("%s test: unexpected value at offset %d. Found %d. Should be %d.\n",
+ test_name, i, buf2[i], (char)i);
+ return -1;
+ }
+ }
+ ksft_test_result_pass("%s test\n", test_name);
+ return 0;
+}
+
+static int test_memcpy_u32(void)
+{
+ int ret;
+ uint32_t v1, v2;
+ const char *test_name = "test_memcpy_u32";
+
+ /* Test memcpy_u32 */
+ v1 = 42;
+ v2 = 0;
+ ret = test_memcpy_op(&v2, &v1, sizeof(v1));
+ if (ret) {
+ ksft_test_result_fail("%s test: returned with %d, errno = %s\n",
+ test_name, ret, strerror(errno));
+ return -1;
+ }
+ if (v1 != v2) {
+ ksft_test_result_fail("%s test: unexpected value %d. Should be %d.\n",
+ test_name, v2, v1);
+ return -1;
+ }
+ ksft_test_result_pass("%s test\n", test_name);
+ return 0;
+}
+
+static int test_memcpy_mb_memcpy_op(void *dst1, void *src1,
+ void *dst2, void *src2, size_t len)
+{
+ struct cpu_op opvec[] = {
+ [0] = {
+ .op = CPU_MEMCPY_OP,
+ .len = len,
+ LINUX_FIELD_u32_u64_INIT_ONSTACK(.u.memcpy_op.dst, dst1),
+ LINUX_FIELD_u32_u64_INIT_ONSTACK(.u.memcpy_op.src, src1),
+ .u.memcpy_op.expect_fault_dst = 0,
+ .u.memcpy_op.expect_fault_src = 0,
+ },
+ [1] = {
+ .op = CPU_MB_OP,
+ },
+ [2] = {
+ .op = CPU_MEMCPY_OP,
+ .len = len,
+ LINUX_FIELD_u32_u64_INIT_ONSTACK(.u.memcpy_op.dst, dst2),
+ LINUX_FIELD_u32_u64_INIT_ONSTACK(.u.memcpy_op.src, src2),
+ .u.memcpy_op.expect_fault_dst = 0,
+ .u.memcpy_op.expect_fault_src = 0,
+ },
+ };
+ int ret, cpu;
+
+ do {
+ cpu = cpu_op_get_current_cpu();
+ ret = cpu_opv(opvec, ARRAY_SIZE(opvec), cpu, 0);
+ } while (ret == -1 && errno == EAGAIN);
+
+ return ret;
+}
+
+static int test_memcpy_mb_memcpy(void)
+{
+ int ret;
+ int v1, v2, v3;
+ const char *test_name = "test_memcpy_mb_memcpy";
+
+ /* Test memcpy */
+ v1 = 42;
+ v2 = v3 = 0;
+ ret = test_memcpy_mb_memcpy_op(&v2, &v1, &v3, &v2, sizeof(int));
+ if (ret) {
+ ksft_test_result_fail("%s test: returned with %d, errno = %s\n",
+ test_name, ret, strerror(errno));
+ return -1;
+ }
+ if (v3 != v1) {
+ ksft_test_result_fail("%s test: unexpected value %d. Should be %d.\n",
+ test_name, v3, v1);
+ return -1;
+ }
+ ksft_test_result_pass("%s test\n", test_name);
+ return 0;
+}
+
+static int test_add_op(int *v, int64_t increment)
+{
+ int ret, cpu;
+
+ do {
+ cpu = cpu_op_get_current_cpu();
+ ret = cpu_op_add(v, increment, sizeof(*v), cpu);
+ } while (ret == -1 && errno == EAGAIN);
+
+ return ret;
+}
+
+static int test_add(void)
+{
+ int orig_v = 42, v, ret;
+ int increment = 1;
+ const char *test_name = "test_add";
+
+ v = orig_v;
+ ret = test_add_op(&v, increment);
+ if (ret) {
+ ksft_test_result_fail("%s test: returned with %d, errno = %s\n",
+ test_name, ret, strerror(errno));
+ return -1;
+ }
+ if (v != orig_v + increment) {
+ ksft_test_result_fail("%s test: unexpected value %d. Should be %d.\n",
+ test_name, v,
+ orig_v + increment);
+ return -1;
+ }
+ ksft_test_result_pass("%s test\n", test_name);
+ return 0;
+}
+
+static int test_two_add_op(int *v, int64_t *increments)
+{
+ struct cpu_op opvec[] = {
+ [0] = {
+ .op = CPU_ADD_OP,
+ .len = sizeof(*v),
+ LINUX_FIELD_u32_u64_INIT_ONSTACK(
+ .u.arithmetic_op.p, v),
+ .u.arithmetic_op.count = increments[0],
+ .u.arithmetic_op.expect_fault_p = 0,
+ },
+ [1] = {
+ .op = CPU_ADD_OP,
+ .len = sizeof(*v),
+ LINUX_FIELD_u32_u64_INIT_ONSTACK(
+ .u.arithmetic_op.p, v),
+ .u.arithmetic_op.count = increments[1],
+ .u.arithmetic_op.expect_fault_p = 0,
+ },
+ };
+ int ret, cpu;
+
+ do {
+ cpu = cpu_op_get_current_cpu();
+ ret = cpu_opv(opvec, ARRAY_SIZE(opvec), cpu, 0);
+ } while (ret == -1 && errno == EAGAIN);
+
+ return ret;
+}
+
+static int test_two_add(void)
+{
+ int orig_v = 42, v, ret;
+ int64_t increments[2] = { 99, 123 };
+ const char *test_name = "test_two_add";
+
+ v = orig_v;
+ ret = test_two_add_op(&v, increments);
+ if (ret) {
+ ksft_test_result_fail("%s test: returned with %d, errno = %s\n",
+ test_name, ret, strerror(errno));
+ return -1;
+ }
+ if (v != orig_v + increments[0] + increments[1]) {
+ ksft_test_result_fail("%s test: unexpected value %d. Should be %d.\n",
+ test_name, v,
+ orig_v + increments[0] + increments[1]);
+ return -1;
+ }
+ ksft_test_result_pass("%s test\n", test_name);
+ return 0;
+}
+
+static int test_or_op(int *v, uint64_t mask)
+{
+ struct cpu_op opvec[] = {
+ [0] = {
+ .op = CPU_OR_OP,
+ .len = sizeof(*v),
+ LINUX_FIELD_u32_u64_INIT_ONSTACK(
+ .u.bitwise_op.p, v),
+ .u.bitwise_op.mask = mask,
+ .u.bitwise_op.expect_fault_p = 0,
+ },
+ };
+ int ret, cpu;
+
+ do {
+ cpu = cpu_op_get_current_cpu();
+ ret = cpu_opv(opvec, ARRAY_SIZE(opvec), cpu, 0);
+ } while (ret == -1 && errno == EAGAIN);
+
+ return ret;
+}
+
+static int test_or(void)
+{
+ int orig_v = 0xFF00000, v, ret;
+ uint32_t mask = 0xFFF;
+ const char *test_name = "test_or";
+
+ v = orig_v;
+ ret = test_or_op(&v, mask);
+ if (ret) {
+ ksft_test_result_fail("%s test: returned with %d, errno = %s\n",
+ test_name, ret, strerror(errno));
+ return -1;
+ }
+ if (v != (orig_v | mask)) {
+ ksft_test_result_fail("%s test: unexpected value %d. Should be %d.\n",
+ test_name, v, orig_v | mask);
+ return -1;
+ }
+ ksft_test_result_pass("%s test\n", test_name);
+ return 0;
+}
+
+static int test_and_op(int *v, uint64_t mask)
+{
+ struct cpu_op opvec[] = {
+ [0] = {
+ .op = CPU_AND_OP,
+ .len = sizeof(*v),
+ LINUX_FIELD_u32_u64_INIT_ONSTACK(
+ .u.bitwise_op.p, v),
+ .u.bitwise_op.mask = mask,
+ .u.bitwise_op.expect_fault_p = 0,
+ },
+ };
+ int ret, cpu;
+
+ do {
+ cpu = cpu_op_get_current_cpu();
+ ret = cpu_opv(opvec, ARRAY_SIZE(opvec), cpu, 0);
+ } while (ret == -1 && errno == EAGAIN);
+
+ return ret;
+}
+
+static int test_and(void)
+{
+ int orig_v = 0xF00, v, ret;
+ uint32_t mask = 0xFFF;
+ const char *test_name = "test_and";
+
+ v = orig_v;
+ ret = test_and_op(&v, mask);
+ if (ret) {
+ ksft_test_result_fail("%s test: returned with %d, errno = %s\n",
+ test_name, ret, strerror(errno));
+ return -1;
+ }
+ if (v != (orig_v & mask)) {
+ ksft_test_result_fail("%s test: unexpected value %d. Should be %d.\n",
+ test_name, v, orig_v & mask);
+ return -1;
+ }
+ ksft_test_result_pass("%s test\n", test_name);
+ return 0;
+}
+
+static int test_xor_op(int *v, uint64_t mask)
+{
+ struct cpu_op opvec[] = {
+ [0] = {
+ .op = CPU_XOR_OP,
+ .len = sizeof(*v),
+ LINUX_FIELD_u32_u64_INIT_ONSTACK(
+ .u.bitwise_op.p, v),
+ .u.bitwise_op.mask = mask,
+ .u.bitwise_op.expect_fault_p = 0,
+ },
+ };
+ int ret, cpu;
+
+ do {
+ cpu = cpu_op_get_current_cpu();
+ ret = cpu_opv(opvec, ARRAY_SIZE(opvec), cpu, 0);
+ } while (ret == -1 && errno == EAGAIN);
+
+ return ret;
+}
+
+static int test_xor(void)
+{
+ int orig_v = 0xF00, v, ret;
+ uint32_t mask = 0xFFF;
+ const char *test_name = "test_xor";
+
+ v = orig_v;
+ ret = test_xor_op(&v, mask);
+ if (ret) {
+ ksft_test_result_fail("%s test: returned with %d, errno = %s\n",
+ test_name, ret, strerror(errno));
+ return -1;
+ }
+ if (v != (orig_v ^ mask)) {
+ ksft_test_result_fail("%s test: unexpected value %d. Should be %d.\n",
+ test_name, v, orig_v ^ mask);
+ return -1;
+ }
+ ksft_test_result_pass("%s test\n", test_name);
+ return 0;
+}
+
+static int test_lshift_op(int *v, uint32_t bits)
+{
+ struct cpu_op opvec[] = {
+ [0] = {
+ .op = CPU_LSHIFT_OP,
+ .len = sizeof(*v),
+ LINUX_FIELD_u32_u64_INIT_ONSTACK(
+ .u.shift_op.p, v),
+ .u.shift_op.bits = bits,
+ .u.shift_op.expect_fault_p = 0,
+ },
+ };
+ int ret, cpu;
+
+ do {
+ cpu = cpu_op_get_current_cpu();
+ ret = cpu_opv(opvec, ARRAY_SIZE(opvec), cpu, 0);
+ } while (ret == -1 && errno == EAGAIN);
+
+ return ret;
+}
+
+static int test_lshift(void)
+{
+ int orig_v = 0xF00, v, ret;
+ uint32_t bits = 5;
+ const char *test_name = "test_lshift";
+
+ v = orig_v;
+ ret = test_lshift_op(&v, bits);
+ if (ret) {
+ ksft_test_result_fail("%s test: returned with %d, errno = %s\n",
+ test_name, ret, strerror(errno));
+ return -1;
+ }
+ if (v != (orig_v << bits)) {
+ ksft_test_result_fail("%s test: unexpected value %d. Should be %d.\n",
+ test_name, v, orig_v << bits);
+ return -1;
+ }
+ ksft_test_result_pass("%s test\n", test_name);
+ return 0;
+}
+
+static int test_rshift_op(int *v, uint32_t bits)
+{
+ struct cpu_op opvec[] = {
+ [0] = {
+ .op = CPU_RSHIFT_OP,
+ .len = sizeof(*v),
+ LINUX_FIELD_u32_u64_INIT_ONSTACK(
+ .u.shift_op.p, v),
+ .u.shift_op.bits = bits,
+ .u.shift_op.expect_fault_p = 0,
+ },
+ };
+ int ret, cpu;
+
+ do {
+ cpu = cpu_op_get_current_cpu();
+ ret = cpu_opv(opvec, ARRAY_SIZE(opvec), cpu, 0);
+ } while (ret == -1 && errno == EAGAIN);
+
+ return ret;
+}
+
+static int test_rshift(void)
+{
+ int orig_v = 0xF00, v, ret;
+ uint32_t bits = 5;
+ const char *test_name = "test_rshift";
+
+ v = orig_v;
+ ret = test_rshift_op(&v, bits);
+ if (ret) {
+ ksft_test_result_fail("%s test: returned with %d, errno = %s\n",
+ test_name, ret, strerror(errno));
+ return -1;
+ }
+ if (v != (orig_v >> bits)) {
+ ksft_test_result_fail("%s test: unexpected value %d. Should be %d.\n",
+ test_name, v, orig_v >> bits);
+ return -1;
+ }
+ ksft_test_result_pass("%s test\n", test_name);
+ return 0;
+}
+
+static int test_cmpxchg_op(void *v, void *expect, void *old, void *n,
+ size_t len)
+{
+ int ret, cpu;
+
+ do {
+ cpu = cpu_op_get_current_cpu();
+ ret = cpu_op_cmpxchg(v, expect, old, n, len, cpu);
+ } while (ret == -1 && errno == EAGAIN);
+
+ return ret;
+}
+
+static int test_cmpxchg_success(void)
+{
+ int ret;
+ uint64_t orig_v = 1, v, expect = 1, old = 0, n = 3;
+ const char *test_name = "test_cmpxchg success";
+
+ v = orig_v;
+ ret = test_cmpxchg_op(&v, &expect, &old, &n, sizeof(uint64_t));
+ if (ret < 0) {
+ ksft_test_result_fail("%s test: ret = %d, errno = %s\n",
+ test_name, ret, strerror(errno));
+ return -1;
+ }
+ if (ret) {
+ ksft_test_result_fail("%s returned %d, expecting %d\n",
+ test_name, ret, 0);
+ return -1;
+ }
+ if (v != n) {
+ ksft_test_result_fail("%s v is %lld, expecting %lld\n",
+ test_name, (long long)v, (long long)n);
+ return -1;
+ }
+ if (old != orig_v) {
+ ksft_test_result_fail("%s old is %lld, expecting %lld\n",
+ test_name, (long long)old,
+ (long long)orig_v);
+ return -1;
+ }
+ ksft_test_result_pass("%s test\n", test_name);
+ return 0;
+}
+
+static int test_cmpxchg_fail(void)
+{
+ int ret;
+ uint64_t orig_v = 1, v, expect = 123, old = 0, n = 3;
+ const char *test_name = "test_cmpxchg fail";
+
+ v = orig_v;
+ ret = test_cmpxchg_op(&v, &expect, &old, &n, sizeof(uint64_t));
+ if (ret < 0) {
+ ksft_test_result_fail("%s test: ret = %d, errno = %s\n",
+ test_name, ret, strerror(errno));
+ return -1;
+ }
+ if (ret == 0) {
+ ksft_test_result_fail("%s returned %d, expecting %d\n",
+ test_name, ret, 1);
+ return -1;
+ }
+ if (v == n) {
+ ksft_test_result_fail("%s returned %lld, expecting %lld\n",
+ test_name, (long long)v,
+ (long long)orig_v);
+ return -1;
+ }
+ if (old != orig_v) {
+ ksft_test_result_fail("%s old is %lld, expecting %lld\n",
+ test_name, (long long)old,
+ (long long)orig_v);
+ return -1;
+ }
+ ksft_test_result_pass("%s test\n", test_name);
+ return 0;
+}
+
+static int test_memcpy_expect_fault_op(void *dst, void *src, size_t len)
+{
+ struct cpu_op opvec[] = {
+ [0] = {
+ .op = CPU_MEMCPY_OP,
+ .len = len,
+ LINUX_FIELD_u32_u64_INIT_ONSTACK(.u.memcpy_op.dst, dst),
+ LINUX_FIELD_u32_u64_INIT_ONSTACK(.u.memcpy_op.src, src),
+ .u.memcpy_op.expect_fault_dst = 0,
+ /* Return EAGAIN on fault. */
+ .u.memcpy_op.expect_fault_src = 1,
+ },
+ };
+ int cpu;
+
+ cpu = cpu_op_get_current_cpu();
+ return cpu_opv(opvec, ARRAY_SIZE(opvec), cpu, 0);
+}
+
+static int test_memcpy_fault(void)
+{
+ int ret;
+ char buf1[TESTBUFLEN];
+ const char *test_name = "test_memcpy_fault";
+
+ /* Test memcpy */
+ ret = test_memcpy_op(buf1, NULL, TESTBUFLEN);
+ if (!ret || (ret < 0 && errno != EFAULT)) {
+ ksft_test_result_fail("%s test: ret = %d, errno = %s\n",
+ test_name, ret, strerror(errno));
+ return -1;
+ }
+ /* Test memcpy expect fault */
+ ret = test_memcpy_expect_fault_op(buf1, NULL, TESTBUFLEN);
+ if (!ret || (ret < 0 && errno != EAGAIN)) {
+ ksft_test_result_fail("%s test: ret = %d, errno = %s\n",
+ test_name, ret, strerror(errno));
+ return -1;
+ }
+ ksft_test_result_pass("%s test\n", test_name);
+ return 0;
+}
+
+static int do_test_unknown_op(void)
+{
+ struct cpu_op opvec[] = {
+ [0] = {
+ .op = -1, /* Unknown */
+ .len = 0,
+ },
+ };
+ int cpu;
+
+ cpu = cpu_op_get_current_cpu();
+ return cpu_opv(opvec, ARRAY_SIZE(opvec), cpu, 0);
+}
+
+static int test_unknown_op(void)
+{
+ int ret;
+ const char *test_name = "test_unknown_op";
+
+ ret = do_test_unknown_op();
+ if (!ret || (ret < 0 && errno != EINVAL)) {
+ ksft_test_result_fail("%s test: ret = %d, errno = %s\n",
+ test_name, ret, strerror(errno));
+ return -1;
+ }
+ ksft_test_result_pass("%s test\n", test_name);
+ return 0;
+}
+
+static int do_test_max_ops(void)
+{
+ struct cpu_op opvec[] = {
+ [0] = { .op = CPU_MB_OP, },
+ [1] = { .op = CPU_MB_OP, },
+ [2] = { .op = CPU_MB_OP, },
+ [3] = { .op = CPU_MB_OP, },
+ [4] = { .op = CPU_MB_OP, },
+ [5] = { .op = CPU_MB_OP, },
+ [6] = { .op = CPU_MB_OP, },
+ [7] = { .op = CPU_MB_OP, },
+ [8] = { .op = CPU_MB_OP, },
+ [9] = { .op = CPU_MB_OP, },
+ [10] = { .op = CPU_MB_OP, },
+ [11] = { .op = CPU_MB_OP, },
+ [12] = { .op = CPU_MB_OP, },
+ [13] = { .op = CPU_MB_OP, },
+ [14] = { .op = CPU_MB_OP, },
+ [15] = { .op = CPU_MB_OP, },
+ };
+ int cpu;
+
+ cpu = cpu_op_get_current_cpu();
+ return cpu_opv(opvec, ARRAY_SIZE(opvec), cpu, 0);
+}
+
+static int test_max_ops(void)
+{
+ int ret;
+ const char *test_name = "test_max_ops";
+
+ ret = do_test_max_ops();
+ if (ret < 0) {
+ ksft_test_result_fail("%s test: ret = %d, errno = %s\n",
+ test_name, ret, strerror(errno));
+ return -1;
+ }
+ ksft_test_result_pass("%s test\n", test_name);
+ return 0;
+}
+
+static int do_test_too_many_ops(void)
+{
+ struct cpu_op opvec[] = {
+ [0] = { .op = CPU_MB_OP, },
+ [1] = { .op = CPU_MB_OP, },
+ [2] = { .op = CPU_MB_OP, },
+ [3] = { .op = CPU_MB_OP, },
+ [4] = { .op = CPU_MB_OP, },
+ [5] = { .op = CPU_MB_OP, },
+ [6] = { .op = CPU_MB_OP, },
+ [7] = { .op = CPU_MB_OP, },
+ [8] = { .op = CPU_MB_OP, },
+ [9] = { .op = CPU_MB_OP, },
+ [10] = { .op = CPU_MB_OP, },
+ [11] = { .op = CPU_MB_OP, },
+ [12] = { .op = CPU_MB_OP, },
+ [13] = { .op = CPU_MB_OP, },
+ [14] = { .op = CPU_MB_OP, },
+ [15] = { .op = CPU_MB_OP, },
+ [16] = { .op = CPU_MB_OP, },
+ };
+ int cpu;
+
+ cpu = cpu_op_get_current_cpu();
+ return cpu_opv(opvec, ARRAY_SIZE(opvec), cpu, 0);
+}
+
+static int test_too_many_ops(void)
+{
+ int ret;
+ const char *test_name = "test_too_many_ops";
+
+ ret = do_test_too_many_ops();
+ if (!ret || (ret < 0 && errno != EINVAL)) {
+ ksft_test_result_fail("%s test: ret = %d, errno = %s\n",
+ test_name, ret, strerror(errno));
+ return -1;
+ }
+ ksft_test_result_pass("%s test\n", test_name);
+ return 0;
+}
+
+/* Use 64kB len, largest page size known on Linux. */
+static int test_memcpy_single_too_large(void)
+{
+ int i, ret;
+ char buf1[TESTBUFLEN_PAGE_MAX + 1];
+ char buf2[TESTBUFLEN_PAGE_MAX + 1];
+ const char *test_name = "test_memcpy_single_too_large";
+
+ /* Test memcpy */
+ for (i = 0; i < TESTBUFLEN_PAGE_MAX + 1; i++)
+ buf1[i] = (char)i;
+ memset(buf2, 0, TESTBUFLEN_PAGE_MAX + 1);
+ ret = test_memcpy_op(buf2, buf1, TESTBUFLEN_PAGE_MAX + 1);
+ if (!ret || (ret < 0 && errno != EINVAL)) {
+ ksft_test_result_fail("%s test: ret = %d, errno = %s\n",
+ test_name, ret, strerror(errno));
+ return -1;
+ }
+ ksft_test_result_pass("%s test\n", test_name);
+ return 0;
+}
+
+static int test_memcpy_single_ok_sum_too_large_op(void *dst, void *src,
+ size_t len)
+{
+ struct cpu_op opvec[] = {
+ [0] = {
+ .op = CPU_MEMCPY_OP,
+ .len = len,
+ LINUX_FIELD_u32_u64_INIT_ONSTACK(.u.memcpy_op.dst, dst),
+ LINUX_FIELD_u32_u64_INIT_ONSTACK(.u.memcpy_op.src, src),
+ .u.memcpy_op.expect_fault_dst = 0,
+ .u.memcpy_op.expect_fault_src = 0,
+ },
+ [1] = {
+ .op = CPU_MEMCPY_OP,
+ .len = len,
+ LINUX_FIELD_u32_u64_INIT_ONSTACK(.u.memcpy_op.dst, dst),
+ LINUX_FIELD_u32_u64_INIT_ONSTACK(.u.memcpy_op.src, src),
+ .u.memcpy_op.expect_fault_dst = 0,
+ .u.memcpy_op.expect_fault_src = 0,
+ },
+ };
+ int ret, cpu;
+
+ do {
+ cpu = cpu_op_get_current_cpu();
+ ret = cpu_opv(opvec, ARRAY_SIZE(opvec), cpu, 0);
+ } while (ret == -1 && errno == EAGAIN);
+
+ return ret;
+}
+
+static int test_memcpy_single_ok_sum_too_large(void)
+{
+ int i, ret;
+ char buf1[TESTBUFLEN];
+ char buf2[TESTBUFLEN];
+ const char *test_name = "test_memcpy_single_ok_sum_too_large";
+
+ /* Test memcpy */
+ for (i = 0; i < TESTBUFLEN; i++)
+ buf1[i] = (char)i;
+ memset(buf2, 0, TESTBUFLEN);
+ ret = test_memcpy_single_ok_sum_too_large_op(buf2, buf1, TESTBUFLEN);
+ if (!ret || (ret < 0 && errno != EINVAL)) {
+ ksft_test_result_fail("%s test: ret = %d, errno = %s\n",
+ test_name, ret, strerror(errno));
+ return -1;
+ }
+ ksft_test_result_pass("%s test\n", test_name);
+ return 0;
+}
+
+/*
+ * Iterate over large uninitialized arrays to trigger page faults.
+ * This includes reading from zero pages.
+ */
+int test_page_fault(void)
+{
+ int ret = 0;
+ uint64_t i;
+ const char *test_name = "test_page_fault";
+
+ for (i = 0; i < NR_PF_ARRAY; i++) {
+ ret = test_memcpy_op(pf_array_dst[i],
+ pf_array_src[i],
+ PF_ARRAY_LEN);
+ if (ret) {
+ ksft_test_result_fail("%s test: ret = %d, errno = %s\n",
+ test_name, ret, strerror(errno));
+ return ret;
+ }
+ }
+ ksft_test_result_pass("%s test\n", test_name);
+ return 0;
+}
+
+/*
+ * Try to use 2MB huge pages.
+ */
+int test_hugetlb(void)
+{
+ int ret = 0;
+ uint64_t i;
+ const char *test_name = "test_hugetlb";
+ int *dst, *src;
+
+ dst = mmap(NULL, HUGEMAPLEN, PROT_READ | PROT_WRITE,
+ MAP_HUGETLB | MAP_ANONYMOUS | MAP_PRIVATE, -1, 0);
+ if (dst == MAP_FAILED) {
+ switch (errno) {
+ case ENOMEM:
+ case ENOENT:
+ case EINVAL:
+ ksft_test_result_skip("%s test.\n", test_name);
+ goto end;
+ default:
+ break;
+ }
+ perror("mmap");
+ abort();
+ }
+ src = mmap(NULL, HUGEMAPLEN, PROT_READ | PROT_WRITE,
+ MAP_HUGETLB | MAP_ANONYMOUS | MAP_PRIVATE, -1, 0);
+ if (src == MAP_FAILED) {
+ if (errno == ENOMEM) {
+ ksft_test_result_skip("%s test.\n", test_name);
+ goto unmap_dst;
+ }
+ perror("mmap");
+ abort();
+ }
+
+ /* Read/write from/to huge zero pages. */
+ for (i = 0; i < NR_HUGE_ARRAY; i++) {
+ ret = test_memcpy_op(dst + (i * PF_ARRAY_LEN / sizeof(int)),
+ src + (i * PF_ARRAY_LEN / sizeof(int)),
+ PF_ARRAY_LEN);
+ if (ret) {
+ ksft_test_result_fail("%s test: ret = %d, errno = %s\n",
+ test_name, ret, strerror(errno));
+ return ret;
+ }
+ }
+ for (i = 0; i < NR_HUGE_ARRAY * (PF_ARRAY_LEN / sizeof(int)); i++)
+ src[i] = i;
+
+ for (i = 0; i < NR_HUGE_ARRAY; i++) {
+ ret = test_memcpy_op(dst + (i * PF_ARRAY_LEN / sizeof(int)),
+ src + (i * PF_ARRAY_LEN / sizeof(int)),
+ PF_ARRAY_LEN);
+ if (ret) {
+ ksft_test_result_fail("%s test: ret = %d, errno = %s\n",
+ test_name, ret, strerror(errno));
+ return ret;
+ }
+ }
+
+ for (i = 0; i < NR_HUGE_ARRAY * (PF_ARRAY_LEN / sizeof(int)); i++) {
+ if (dst[i] != i) {
+ ksft_test_result_fail("%s mismatch, expect %d, got %d\n",
+ test_name, i, dst[i]);
+ return ret;
+ }
+ }
+
+ ksft_test_result_pass("%s test\n", test_name);
+
+ if (munmap(src, HUGEMAPLEN)) {
+ perror("munmap");
+ abort();
+ }
+unmap_dst:
+ if (munmap(dst, HUGEMAPLEN)) {
+ perror("munmap");
+ abort();
+ }
+end:
+ return 0;
+}
+
+static int test_cmpxchg_op_cpu(void *v, void *expect, void *old, void *n,
+ size_t len, int cpu)
+{
+ int ret;
+
+ do {
+ ret = cpu_op_cmpxchg(v, expect, old, n, len, cpu);
+ } while (ret == -1 && errno == EAGAIN);
+
+ return ret;
+}
+
+static int test_over_possible_cpu(void)
+{
+ int ret;
+ uint64_t orig_v = 1, v, expect = 1, old = 0, n = 3;
+ const char *test_name = "test_over_possible_cpu";
+
+ v = orig_v;
+ ret = test_cmpxchg_op_cpu(&v, &expect, &old, &n, sizeof(uint64_t),
+ 0xFFFFFFFF);
+ if (ret == 0) {
+ ksft_test_result_fail("%s test: ret = %d\n",
+ test_name, ret);
+ return -1;
+ }
+ if (ret < 0 && errno == EINVAL) {
+ ksft_test_result_pass("%s test\n", test_name);
+ return 0;
+ }
+ ksft_test_result_fail("%s returned %d, errno %s, expecting %d, errno %s\n",
+ test_name, ret, strerror(errno),
+ 0, strerror(EINVAL));
+ return -1;
+}
+
+static int test_allowed_affinity(void)
+{
+ int ret;
+ uint64_t orig_v = 1, v, expect = 1, old = 0, n = 3;
+ const char *test_name = "test_allowed_affinity";
+ cpu_set_t allowed_cpus, cpuset;
+
+ ret = sched_getaffinity(0, sizeof(allowed_cpus), &allowed_cpus);
+ if (ret) {
+ ksft_test_result_fail("%s returned %d, errno %s\n",
+ test_name, ret, strerror(errno));
+ return -1;
+ }
+ if (!(CPU_ISSET(0, &allowed_cpus) && CPU_ISSET(1, &allowed_cpus))) {
+ ksft_test_result_skip("%s test. Requiring allowed CPUs 0 and 1.\n",
+ test_name);
+ return 0;
+ }
+ CPU_ZERO(&cpuset);
+ CPU_SET(0, &cpuset);
+ if (sched_setaffinity(0, sizeof(cpuset), &cpuset) != 0) {
+ ksft_test_result_fail("%s test. Unable to set affinity. errno = %s\n",
+ test_name, strerror(errno));
+ return -1;
+ }
+ v = orig_v;
+ ret = test_cmpxchg_op_cpu(&v, &expect, &old, &n, sizeof(uint64_t),
+ 1);
+ if (sched_setaffinity(0, sizeof(allowed_cpus), &allowed_cpus) != 0) {
+ ksft_test_result_fail("%s test. Unable to set affinity. errno = %s\n",
+ test_name, strerror(errno));
+ return -1;
+ }
+ if (ret == 0) {
+ ksft_test_result_fail("%s test: ret = %d\n",
+ test_name, ret);
+ return -1;
+ }
+
+ if (ret < 0 && errno == EINVAL) {
+ ksft_test_result_pass("%s test\n", test_name);
+ return 0;
+ }
+ ksft_test_result_fail("%s returned %d, errno %s, expecting %d, errno %s\n",
+ test_name, ret, strerror(errno),
+ 0, strerror(EINVAL));
+ return -1;
+}
+
+int main(int argc, char **argv)
+{
+ ksft_print_header();
+
+ test_ops_supported();
+ test_compare_eq_same();
+ test_compare_eq_diff();
+ test_compare_ne_same();
+ test_compare_ne_diff();
+ test_2compare_eq_index();
+ test_2compare_ne_index();
+ test_memcpy();
+ test_memcpy_u32();
+ test_memcpy_mb_memcpy();
+ test_add();
+ test_two_add();
+ test_or();
+ test_and();
+ test_xor();
+ test_lshift();
+ test_rshift();
+ test_cmpxchg_success();
+ test_cmpxchg_fail();
+ test_memcpy_fault();
+ test_unknown_op();
+ test_max_ops();
+ test_too_many_ops();
+ test_memcpy_single_too_large();
+ test_memcpy_single_ok_sum_too_large();
+ test_page_fault();
+ test_hugetlb();
+ test_over_possible_cpu();
+ test_allowed_affinity();
+
+ return ksft_exit_pass();
+}
diff --git a/tools/testing/selftests/cpu-opv/cpu-op.c b/tools/testing/selftests/cpu-opv/cpu-op.c
new file mode 100644
index 000000000000..5981895df25a
--- /dev/null
+++ b/tools/testing/selftests/cpu-opv/cpu-op.c
@@ -0,0 +1,352 @@
+/*
+ * cpu-op.c
+ *
+ * Copyright (C) 2017 Mathieu Desnoyers <[email protected]>
+ *
+ * This library is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU Lesser General Public
+ * License as published by the Free Software Foundation; only
+ * version 2.1 of the License.
+ *
+ * This library is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+ * Lesser General Public License for more details.
+ */
+
+#define _GNU_SOURCE
+#include <errno.h>
+#include <sched.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <unistd.h>
+#include <syscall.h>
+#include <assert.h>
+#include <signal.h>
+
+#include "cpu-op.h"
+
+#define ARRAY_SIZE(arr) (sizeof(arr) / sizeof((arr)[0]))
+
+#define ACCESS_ONCE(x) (*(__volatile__ __typeof__(x) *)&(x))
+#define WRITE_ONCE(x, v) __extension__ ({ ACCESS_ONCE(x) = (v); })
+#define READ_ONCE(x) ACCESS_ONCE(x)
+
+int cpu_opv(struct cpu_op *cpu_opv, int cpuopcnt, int cpu, int flags)
+{
+ return syscall(__NR_cpu_opv, cpu_opv, cpuopcnt, cpu, flags);
+}
+
+int cpu_op_get_current_cpu(void)
+{
+ int cpu;
+
+ cpu = sched_getcpu();
+ if (cpu < 0) {
+ perror("sched_getcpu()");
+ abort();
+ }
+ return cpu;
+}
+
+int cpu_op_cmpxchg(void *v, void *expect, void *old, void *n, size_t len,
+ int cpu)
+{
+ struct cpu_op opvec[] = {
+ [0] = {
+ .op = CPU_MEMCPY_OP,
+ .len = len,
+ .u.memcpy_op.dst = (unsigned long)old,
+ .u.memcpy_op.src = (unsigned long)v,
+ .u.memcpy_op.expect_fault_dst = 0,
+ .u.memcpy_op.expect_fault_src = 0,
+ },
+ [1] = {
+ .op = CPU_COMPARE_EQ_OP,
+ .len = len,
+ .u.compare_op.a = (unsigned long)v,
+ .u.compare_op.b = (unsigned long)expect,
+ .u.compare_op.expect_fault_a = 0,
+ .u.compare_op.expect_fault_b = 0,
+ },
+ [2] = {
+ .op = CPU_MEMCPY_OP,
+ .len = len,
+ .u.memcpy_op.dst = (unsigned long)v,
+ .u.memcpy_op.src = (unsigned long)n,
+ .u.memcpy_op.expect_fault_dst = 0,
+ .u.memcpy_op.expect_fault_src = 0,
+ },
+ };
+
+ return cpu_opv(opvec, ARRAY_SIZE(opvec), cpu, 0);
+}
+
+int cpu_op_add(void *v, int64_t count, size_t len, int cpu)
+{
+ struct cpu_op opvec[] = {
+ [0] = {
+ .op = CPU_ADD_OP,
+ .len = len,
+ .u.arithmetic_op.p = (unsigned long)v,
+ .u.arithmetic_op.count = count,
+ .u.arithmetic_op.expect_fault_p = 0,
+ },
+ };
+
+ return cpu_opv(opvec, ARRAY_SIZE(opvec), cpu, 0);
+}
+
+int cpu_op_cmpeqv_storev(intptr_t *v, intptr_t expect, intptr_t newv,
+ int cpu)
+{
+ struct cpu_op opvec[] = {
+ [0] = {
+ .op = CPU_COMPARE_EQ_OP,
+ .len = sizeof(intptr_t),
+ .u.compare_op.a = (unsigned long)v,
+ .u.compare_op.b = (unsigned long)&expect,
+ .u.compare_op.expect_fault_a = 0,
+ .u.compare_op.expect_fault_b = 0,
+ },
+ [1] = {
+ .op = CPU_MEMCPY_OP,
+ .len = sizeof(intptr_t),
+ .u.memcpy_op.dst = (unsigned long)v,
+ .u.memcpy_op.src = (unsigned long)&newv,
+ .u.memcpy_op.expect_fault_dst = 0,
+ .u.memcpy_op.expect_fault_src = 0,
+ },
+ };
+
+ return cpu_opv(opvec, ARRAY_SIZE(opvec), cpu, 0);
+}
+
+static int cpu_op_cmpeqv_storep_expect_fault(intptr_t *v, intptr_t expect,
+ intptr_t *newp, int cpu)
+{
+ struct cpu_op opvec[] = {
+ [0] = {
+ .op = CPU_COMPARE_EQ_OP,
+ .len = sizeof(intptr_t),
+ .u.compare_op.a = (unsigned long)v,
+ .u.compare_op.b = (unsigned long)&expect,
+ .u.compare_op.expect_fault_a = 0,
+ .u.compare_op.expect_fault_b = 0,
+ },
+ [1] = {
+ .op = CPU_MEMCPY_OP,
+ .len = sizeof(intptr_t),
+ .u.memcpy_op.dst = (unsigned long)v,
+ .u.memcpy_op.src = (unsigned long)newp,
+ .u.memcpy_op.expect_fault_dst = 0,
+ /* Return EAGAIN on src fault. */
+ .u.memcpy_op.expect_fault_src = 1,
+ },
+ };
+
+ return cpu_opv(opvec, ARRAY_SIZE(opvec), cpu, 0);
+}
+
+int cpu_op_cmpnev_storeoffp_load(intptr_t *v, intptr_t expectnot,
+ off_t voffp, intptr_t *load, int cpu)
+{
+ int ret;
+
+ do {
+ intptr_t oldv = READ_ONCE(*v);
+ intptr_t *newp = (intptr_t *)(oldv + voffp);
+
+ if (oldv == expectnot)
+ return 1;
+ ret = cpu_op_cmpeqv_storep_expect_fault(v, oldv, newp, cpu);
+ if (!ret) {
+ *load = oldv;
+ return 0;
+ }
+ } while (ret > 0);
+
+ return -1;
+}
+
+int cpu_op_cmpeqv_storev_storev(intptr_t *v, intptr_t expect,
+ intptr_t *v2, intptr_t newv2,
+ intptr_t newv, int cpu)
+{
+ struct cpu_op opvec[] = {
+ [0] = {
+ .op = CPU_COMPARE_EQ_OP,
+ .len = sizeof(intptr_t),
+ .u.compare_op.a = (unsigned long)v,
+ .u.compare_op.b = (unsigned long)&expect,
+ .u.compare_op.expect_fault_a = 0,
+ .u.compare_op.expect_fault_b = 0,
+ },
+ [1] = {
+ .op = CPU_MEMCPY_OP,
+ .len = sizeof(intptr_t),
+ .u.memcpy_op.dst = (unsigned long)v2,
+ .u.memcpy_op.src = (unsigned long)&newv2,
+ .u.memcpy_op.expect_fault_dst = 0,
+ .u.memcpy_op.expect_fault_src = 0,
+ },
+ [2] = {
+ .op = CPU_MEMCPY_OP,
+ .len = sizeof(intptr_t),
+ .u.memcpy_op.dst = (unsigned long)v,
+ .u.memcpy_op.src = (unsigned long)&newv,
+ .u.memcpy_op.expect_fault_dst = 0,
+ .u.memcpy_op.expect_fault_src = 0,
+ },
+ };
+
+ return cpu_opv(opvec, ARRAY_SIZE(opvec), cpu, 0);
+}
+
+int cpu_op_cmpeqv_storev_mb_storev(intptr_t *v, intptr_t expect,
+ intptr_t *v2, intptr_t newv2,
+ intptr_t newv, int cpu)
+{
+ struct cpu_op opvec[] = {
+ [0] = {
+ .op = CPU_COMPARE_EQ_OP,
+ .len = sizeof(intptr_t),
+ .u.compare_op.a = (unsigned long)v,
+ .u.compare_op.b = (unsigned long)&expect,
+ .u.compare_op.expect_fault_a = 0,
+ .u.compare_op.expect_fault_b = 0,
+ },
+ [1] = {
+ .op = CPU_MEMCPY_OP,
+ .len = sizeof(intptr_t),
+ .u.memcpy_op.dst = (unsigned long)v2,
+ .u.memcpy_op.src = (unsigned long)&newv2,
+ .u.memcpy_op.expect_fault_dst = 0,
+ .u.memcpy_op.expect_fault_src = 0,
+ },
+ [2] = {
+ .op = CPU_MB_OP,
+ },
+ [3] = {
+ .op = CPU_MEMCPY_OP,
+ .len = sizeof(intptr_t),
+ .u.memcpy_op.dst = (unsigned long)v,
+ .u.memcpy_op.src = (unsigned long)&newv,
+ .u.memcpy_op.expect_fault_dst = 0,
+ .u.memcpy_op.expect_fault_src = 0,
+ },
+ };
+
+ return cpu_opv(opvec, ARRAY_SIZE(opvec), cpu, 0);
+}
+
+int cpu_op_cmpeqv_cmpeqv_storev(intptr_t *v, intptr_t expect,
+ intptr_t *v2, intptr_t expect2,
+ intptr_t newv, int cpu)
+{
+ struct cpu_op opvec[] = {
+ [0] = {
+ .op = CPU_COMPARE_EQ_OP,
+ .len = sizeof(intptr_t),
+ .u.compare_op.a = (unsigned long)v,
+ .u.compare_op.b = (unsigned long)&expect,
+ .u.compare_op.expect_fault_a = 0,
+ .u.compare_op.expect_fault_b = 0,
+ },
+ [1] = {
+ .op = CPU_COMPARE_EQ_OP,
+ .len = sizeof(intptr_t),
+ .u.compare_op.a = (unsigned long)v2,
+ .u.compare_op.b = (unsigned long)&expect2,
+ .u.compare_op.expect_fault_a = 0,
+ .u.compare_op.expect_fault_b = 0,
+ },
+ [2] = {
+ .op = CPU_MEMCPY_OP,
+ .len = sizeof(intptr_t),
+ .u.memcpy_op.dst = (unsigned long)v,
+ .u.memcpy_op.src = (unsigned long)&newv,
+ .u.memcpy_op.expect_fault_dst = 0,
+ .u.memcpy_op.expect_fault_src = 0,
+ },
+ };
+
+ return cpu_opv(opvec, ARRAY_SIZE(opvec), cpu, 0);
+}
+
+int cpu_op_cmpeqv_memcpy_storev(intptr_t *v, intptr_t expect,
+ void *dst, void *src, size_t len,
+ intptr_t newv, int cpu)
+{
+ struct cpu_op opvec[] = {
+ [0] = {
+ .op = CPU_COMPARE_EQ_OP,
+ .len = sizeof(intptr_t),
+ .u.compare_op.a = (unsigned long)v,
+ .u.compare_op.b = (unsigned long)&expect,
+ .u.compare_op.expect_fault_a = 0,
+ .u.compare_op.expect_fault_b = 0,
+ },
+ [1] = {
+ .op = CPU_MEMCPY_OP,
+ .len = len,
+ .u.memcpy_op.dst = (unsigned long)dst,
+ .u.memcpy_op.src = (unsigned long)src,
+ .u.memcpy_op.expect_fault_dst = 0,
+ .u.memcpy_op.expect_fault_src = 0,
+ },
+ [2] = {
+ .op = CPU_MEMCPY_OP,
+ .len = sizeof(intptr_t),
+ .u.memcpy_op.dst = (unsigned long)v,
+ .u.memcpy_op.src = (unsigned long)&newv,
+ .u.memcpy_op.expect_fault_dst = 0,
+ .u.memcpy_op.expect_fault_src = 0,
+ },
+ };
+
+ return cpu_opv(opvec, ARRAY_SIZE(opvec), cpu, 0);
+}
+
+int cpu_op_cmpeqv_memcpy_mb_storev(intptr_t *v, intptr_t expect,
+ void *dst, void *src, size_t len,
+ intptr_t newv, int cpu)
+{
+ struct cpu_op opvec[] = {
+ [0] = {
+ .op = CPU_COMPARE_EQ_OP,
+ .len = sizeof(intptr_t),
+ .u.compare_op.a = (unsigned long)v,
+ .u.compare_op.b = (unsigned long)&expect,
+ .u.compare_op.expect_fault_a = 0,
+ .u.compare_op.expect_fault_b = 0,
+ },
+ [1] = {
+ .op = CPU_MEMCPY_OP,
+ .len = len,
+ .u.memcpy_op.dst = (unsigned long)dst,
+ .u.memcpy_op.src = (unsigned long)src,
+ .u.memcpy_op.expect_fault_dst = 0,
+ .u.memcpy_op.expect_fault_src = 0,
+ },
+ [2] = {
+ .op = CPU_MB_OP,
+ },
+ [3] = {
+ .op = CPU_MEMCPY_OP,
+ .len = sizeof(intptr_t),
+ .u.memcpy_op.dst = (unsigned long)v,
+ .u.memcpy_op.src = (unsigned long)&newv,
+ .u.memcpy_op.expect_fault_dst = 0,
+ .u.memcpy_op.expect_fault_src = 0,
+ },
+ };
+
+ return cpu_opv(opvec, ARRAY_SIZE(opvec), cpu, 0);
+}
+
+int cpu_op_addv(intptr_t *v, int64_t count, int cpu)
+{
+ return cpu_op_add(v, count, sizeof(intptr_t), cpu);
+}
diff --git a/tools/testing/selftests/cpu-opv/cpu-op.h b/tools/testing/selftests/cpu-opv/cpu-op.h
new file mode 100644
index 000000000000..762a38d6e0d0
--- /dev/null
+++ b/tools/testing/selftests/cpu-opv/cpu-op.h
@@ -0,0 +1,59 @@
+/*
+ * cpu-op.h
+ *
+ * (C) Copyright 2017 - Mathieu Desnoyers <[email protected]>
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining a copy
+ * of this software and associated documentation files (the "Software"), to deal
+ * in the Software without restriction, including without limitation the rights
+ * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+ * copies of the Software, and to permit persons to whom the Software is
+ * furnished to do so, subject to the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be included in
+ * all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+ * AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+ * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+
+#ifndef CPU_OPV_H
+#define CPU_OPV_H
+
+#include <stdlib.h>
+#include <stdint.h>
+#include <linux/cpu_opv.h>
+
+int cpu_opv(struct cpu_op *cpuopv, int cpuopcnt, int cpu, int flags);
+int cpu_op_get_current_cpu(void);
+
+int cpu_op_cmpxchg(void *v, void *expect, void *old, void *_new, size_t len,
+ int cpu);
+int cpu_op_add(void *v, int64_t count, size_t len, int cpu);
+
+int cpu_op_cmpeqv_storev(intptr_t *v, intptr_t expect, intptr_t newv, int cpu);
+int cpu_op_cmpnev_storeoffp_load(intptr_t *v, intptr_t expectnot,
+ off_t voffp, intptr_t *load, int cpu);
+int cpu_op_cmpeqv_storev_storev(intptr_t *v, intptr_t expect,
+ intptr_t *v2, intptr_t newv2,
+ intptr_t newv, int cpu);
+int cpu_op_cmpeqv_storev_mb_storev(intptr_t *v, intptr_t expect,
+ intptr_t *v2, intptr_t newv2,
+ intptr_t newv, int cpu);
+int cpu_op_cmpeqv_cmpeqv_storev(intptr_t *v, intptr_t expect,
+ intptr_t *v2, intptr_t expect2,
+ intptr_t newv, int cpu);
+int cpu_op_cmpeqv_memcpy_storev(intptr_t *v, intptr_t expect,
+ void *dst, void *src, size_t len,
+ intptr_t newv, int cpu);
+int cpu_op_cmpeqv_memcpy_mb_storev(intptr_t *v, intptr_t expect,
+ void *dst, void *src, size_t len,
+ intptr_t newv, int cpu);
+int cpu_op_addv(intptr_t *v, int64_t count, int cpu);
+
+#endif /* CPU_OPV_H_ */
--
2.11.0
"basic_percpu_ops_test" is a slightly more "realistic" variant,
implementing a few simple per-cpu operations and testing their
correctness.
Signed-off-by: Mathieu Desnoyers <[email protected]>
CC: Shuah Khan <[email protected]>
CC: Russell King <[email protected]>
CC: Catalin Marinas <[email protected]>
CC: Will Deacon <[email protected]>
CC: Thomas Gleixner <[email protected]>
CC: Paul Turner <[email protected]>
CC: Andrew Hunter <[email protected]>
CC: Peter Zijlstra <[email protected]>
CC: Andy Lutomirski <[email protected]>
CC: Andi Kleen <[email protected]>
CC: Dave Watson <[email protected]>
CC: Chris Lameter <[email protected]>
CC: Ingo Molnar <[email protected]>
CC: "H. Peter Anvin" <[email protected]>
CC: Ben Maurer <[email protected]>
CC: Steven Rostedt <[email protected]>
CC: "Paul E. McKenney" <[email protected]>
CC: Josh Triplett <[email protected]>
CC: Linus Torvalds <[email protected]>
CC: Andrew Morton <[email protected]>
CC: Boqun Feng <[email protected]>
CC: [email protected]
CC: [email protected]
---
.../testing/selftests/rseq/basic_percpu_ops_test.c | 296 +++++++++++++++++++++
1 file changed, 296 insertions(+)
create mode 100644 tools/testing/selftests/rseq/basic_percpu_ops_test.c
diff --git a/tools/testing/selftests/rseq/basic_percpu_ops_test.c b/tools/testing/selftests/rseq/basic_percpu_ops_test.c
new file mode 100644
index 000000000000..e585bba0bf8d
--- /dev/null
+++ b/tools/testing/selftests/rseq/basic_percpu_ops_test.c
@@ -0,0 +1,296 @@
+#define _GNU_SOURCE
+#include <assert.h>
+#include <pthread.h>
+#include <sched.h>
+#include <stdint.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <stddef.h>
+
+#include "percpu-op.h"
+
+#define ARRAY_SIZE(arr) (sizeof(arr) / sizeof((arr)[0]))
+
+struct percpu_lock_entry {
+ intptr_t v;
+} __attribute__((aligned(128)));
+
+struct percpu_lock {
+ struct percpu_lock_entry c[CPU_SETSIZE];
+};
+
+struct test_data_entry {
+ intptr_t count;
+} __attribute__((aligned(128)));
+
+struct spinlock_test_data {
+ struct percpu_lock lock;
+ struct test_data_entry c[CPU_SETSIZE];
+ int reps;
+};
+
+struct percpu_list_node {
+ intptr_t data;
+ struct percpu_list_node *next;
+};
+
+struct percpu_list_entry {
+ struct percpu_list_node *head;
+} __attribute__((aligned(128)));
+
+struct percpu_list {
+ struct percpu_list_entry c[CPU_SETSIZE];
+};
+
+/* A simple percpu spinlock. Returns the cpu lock was acquired on. */
+int rseq_percpu_lock(struct percpu_lock *lock)
+{
+ int cpu;
+
+ for (;;) {
+ int ret;
+
+ cpu = rseq_cpu_start();
+ ret = percpu_cmpeqv_storev(&lock->c[cpu].v,
+ 0, 1, cpu);
+ if (rseq_likely(!ret))
+ break;
+ if (rseq_unlikely(ret < 0)) {
+ perror("cpu_opv");
+ abort();
+ }
+ /* Retry if comparison fails. */
+ }
+ /*
+ * Acquire semantic when taking lock after control dependency.
+ * Matches rseq_smp_store_release().
+ */
+ rseq_smp_acquire__after_ctrl_dep();
+ return cpu;
+}
+
+void rseq_percpu_unlock(struct percpu_lock *lock, int cpu)
+{
+ assert(lock->c[cpu].v == 1);
+ /*
+ * Release lock, with release semantic. Matches
+ * rseq_smp_acquire__after_ctrl_dep().
+ */
+ rseq_smp_store_release(&lock->c[cpu].v, 0);
+}
+
+void *test_percpu_spinlock_thread(void *arg)
+{
+ struct spinlock_test_data *data = arg;
+ int i, cpu;
+
+ if (rseq_register_current_thread()) {
+ fprintf(stderr, "Error: rseq_register_current_thread(...) failed(%d): %s\n",
+ errno, strerror(errno));
+ abort();
+ }
+ for (i = 0; i < data->reps; i++) {
+ cpu = rseq_percpu_lock(&data->lock);
+ data->c[cpu].count++;
+ rseq_percpu_unlock(&data->lock, cpu);
+ }
+ if (rseq_unregister_current_thread()) {
+ fprintf(stderr, "Error: rseq_unregister_current_thread(...) failed(%d): %s\n",
+ errno, strerror(errno));
+ abort();
+ }
+
+ return NULL;
+}
+
+/*
+ * A simple test which implements a sharded counter using a per-cpu
+ * lock. Obviously real applications might prefer to simply use a
+ * per-cpu increment; however, this is reasonable for a test and the
+ * lock can be extended to synchronize more complicated operations.
+ */
+void test_percpu_spinlock(void)
+{
+ const int num_threads = 200;
+ int i;
+ uint64_t sum;
+ pthread_t test_threads[num_threads];
+ struct spinlock_test_data data;
+
+ memset(&data, 0, sizeof(data));
+ data.reps = 5000;
+
+ for (i = 0; i < num_threads; i++)
+ pthread_create(&test_threads[i], NULL,
+ test_percpu_spinlock_thread, &data);
+
+ for (i = 0; i < num_threads; i++)
+ pthread_join(test_threads[i], NULL);
+
+ sum = 0;
+ for (i = 0; i < CPU_SETSIZE; i++)
+ sum += data.c[i].count;
+
+ assert(sum == (uint64_t)data.reps * num_threads);
+}
+
+int percpu_list_push(struct percpu_list *list, struct percpu_list_node *node,
+ int cpu)
+{
+ for (;;) {
+ intptr_t *targetptr, newval, expect;
+ int ret;
+
+ /* Load list->c[cpu].head with single-copy atomicity. */
+ expect = (intptr_t)RSEQ_READ_ONCE(list->c[cpu].head);
+ newval = (intptr_t)node;
+ targetptr = (intptr_t *)&list->c[cpu].head;
+ node->next = (struct percpu_list_node *)expect;
+ ret = percpu_cmpeqv_storev(targetptr, expect, newval, cpu);
+ if (rseq_likely(!ret))
+ break;
+ if (rseq_unlikely(ret < 0)) {
+ perror("cpu_opv");
+ abort();
+ }
+ /* Retry if comparison fails. */
+ }
+ return cpu;
+}
+
+/*
+ * Unlike a traditional lock-less linked list; the availability of a
+ * rseq primitive allows us to implement pop without concerns over
+ * ABA-type races.
+ */
+struct percpu_list_node *percpu_list_pop(struct percpu_list *list,
+ int cpu)
+{
+ struct percpu_list_node *head;
+ intptr_t *targetptr, expectnot, *load;
+ off_t offset;
+ int ret;
+
+ targetptr = (intptr_t *)&list->c[cpu].head;
+ expectnot = (intptr_t)NULL;
+ offset = offsetof(struct percpu_list_node, next);
+ load = (intptr_t *)&head;
+ ret = percpu_cmpnev_storeoffp_load(targetptr, expectnot,
+ offset, load, cpu);
+ if (rseq_unlikely(ret < 0)) {
+ perror("cpu_opv");
+ abort();
+ }
+ if (ret > 0)
+ return NULL;
+ return head;
+}
+
+void *test_percpu_list_thread(void *arg)
+{
+ int i;
+ struct percpu_list *list = (struct percpu_list *)arg;
+
+ if (rseq_register_current_thread()) {
+ fprintf(stderr, "Error: rseq_register_current_thread(...) failed(%d): %s\n",
+ errno, strerror(errno));
+ abort();
+ }
+
+ for (i = 0; i < 100000; i++) {
+ struct percpu_list_node *node;
+
+ node = percpu_list_pop(list, rseq_cpu_start());
+ sched_yield(); /* encourage shuffling */
+ if (node)
+ percpu_list_push(list, node, rseq_cpu_start());
+ }
+
+ if (rseq_unregister_current_thread()) {
+ fprintf(stderr, "Error: rseq_unregister_current_thread(...) failed(%d): %s\n",
+ errno, strerror(errno));
+ abort();
+ }
+
+ return NULL;
+}
+
+/* Simultaneous modification to a per-cpu linked list from many threads. */
+void test_percpu_list(void)
+{
+ int i, j;
+ uint64_t sum = 0, expected_sum = 0;
+ struct percpu_list list;
+ pthread_t test_threads[200];
+ cpu_set_t allowed_cpus;
+
+ memset(&list, 0, sizeof(list));
+
+ /* Generate list entries for every usable cpu. */
+ sched_getaffinity(0, sizeof(allowed_cpus), &allowed_cpus);
+ for (i = 0; i < CPU_SETSIZE; i++) {
+ if (!CPU_ISSET(i, &allowed_cpus))
+ continue;
+ for (j = 1; j <= 100; j++) {
+ struct percpu_list_node *node;
+
+ expected_sum += j;
+
+ node = malloc(sizeof(*node));
+ assert(node);
+ node->data = j;
+ node->next = list.c[i].head;
+ list.c[i].head = node;
+ }
+ }
+
+ for (i = 0; i < 200; i++)
+ pthread_create(&test_threads[i], NULL,
+ test_percpu_list_thread, &list);
+
+ for (i = 0; i < 200; i++)
+ pthread_join(test_threads[i], NULL);
+
+ for (i = 0; i < CPU_SETSIZE; i++) {
+ struct percpu_list_node *node;
+
+ if (!CPU_ISSET(i, &allowed_cpus))
+ continue;
+
+ while ((node = percpu_list_pop(&list, i))) {
+ sum += node->data;
+ free(node);
+ }
+ }
+
+ /*
+ * All entries should now be accounted for (unless some external
+ * actor is interfering with our allowed affinity while this
+ * test is running).
+ */
+ assert(sum == expected_sum);
+}
+
+int main(int argc, char **argv)
+{
+ if (rseq_register_current_thread()) {
+ fprintf(stderr, "Error: rseq_register_current_thread(...) failed(%d): %s\n",
+ errno, strerror(errno));
+ goto error;
+ }
+ printf("spinlock\n");
+ test_percpu_spinlock();
+ printf("percpu_list\n");
+ test_percpu_list();
+ if (rseq_unregister_current_thread()) {
+ fprintf(stderr, "Error: rseq_unregister_current_thread(...) failed(%d): %s\n",
+ errno, strerror(errno));
+ goto error;
+ }
+ return 0;
+
+error:
+ return -1;
+}
+
--
2.11.0
A run_param_test.sh script runs many variants of the parametrizable
tests.
Wire up the rseq Makefile, add directory entry into MAINTAINERS file.
Signed-off-by: Mathieu Desnoyers <[email protected]>
CC: Shuah Khan <[email protected]>
CC: Russell King <[email protected]>
CC: Catalin Marinas <[email protected]>
CC: Will Deacon <[email protected]>
CC: Thomas Gleixner <[email protected]>
CC: Paul Turner <[email protected]>
CC: Andrew Hunter <[email protected]>
CC: Peter Zijlstra <[email protected]>
CC: Andy Lutomirski <[email protected]>
CC: Andi Kleen <[email protected]>
CC: Dave Watson <[email protected]>
CC: Chris Lameter <[email protected]>
CC: Ingo Molnar <[email protected]>
CC: "H. Peter Anvin" <[email protected]>
CC: Ben Maurer <[email protected]>
CC: Steven Rostedt <[email protected]>
CC: "Paul E. McKenney" <[email protected]>
CC: Josh Triplett <[email protected]>
CC: Linus Torvalds <[email protected]>
CC: Andrew Morton <[email protected]>
CC: Boqun Feng <[email protected]>
CC: [email protected]
CC: [email protected]
---
MAINTAINERS | 1 +
tools/testing/selftests/Makefile | 1 +
tools/testing/selftests/rseq/.gitignore | 7 ++
tools/testing/selftests/rseq/Makefile | 37 +++++++
tools/testing/selftests/rseq/run_param_test.sh | 130 +++++++++++++++++++++++++
5 files changed, 176 insertions(+)
create mode 100644 tools/testing/selftests/rseq/.gitignore
create mode 100644 tools/testing/selftests/rseq/Makefile
create mode 100755 tools/testing/selftests/rseq/run_param_test.sh
diff --git a/MAINTAINERS b/MAINTAINERS
index 936ff672d5fb..ba630a386a88 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -11830,6 +11830,7 @@ S: Supported
F: kernel/rseq.c
F: include/uapi/linux/rseq.h
F: include/trace/events/rseq.h
+F: tools/testing/selftests/rseq/
RFKILL
M: Johannes Berg <[email protected]>
diff --git a/tools/testing/selftests/Makefile b/tools/testing/selftests/Makefile
index 1322e63f5963..c1c9323d4e50 100644
--- a/tools/testing/selftests/Makefile
+++ b/tools/testing/selftests/Makefile
@@ -26,6 +26,7 @@ TARGETS += nsfs
TARGETS += powerpc
TARGETS += pstore
TARGETS += ptrace
+TARGETS += rseq
TARGETS += seccomp
TARGETS += sigaltstack
TARGETS += size
diff --git a/tools/testing/selftests/rseq/.gitignore b/tools/testing/selftests/rseq/.gitignore
new file mode 100644
index 000000000000..73dcf05b69fd
--- /dev/null
+++ b/tools/testing/selftests/rseq/.gitignore
@@ -0,0 +1,7 @@
+basic_percpu_ops_test
+basic_test
+basic_rseq_op_test
+param_test
+param_test_benchmark
+param_test_compare_twice
+param_test_skip_fastpath
diff --git a/tools/testing/selftests/rseq/Makefile b/tools/testing/selftests/rseq/Makefile
new file mode 100644
index 000000000000..69f90d718774
--- /dev/null
+++ b/tools/testing/selftests/rseq/Makefile
@@ -0,0 +1,37 @@
+CFLAGS += -O2 -Wall -g -I./ -I../cpu-opv/ -I../../../../usr/include/ -L./ -Wl,-rpath=./
+LDLIBS += -lpthread
+
+# Own dependencies because we only want to build against 1st prerequisite, but
+# still track changes to header files and depend on shared object.
+OVERRIDE_TARGETS = 1
+
+TEST_GEN_PROGS = basic_test basic_percpu_ops_test \
+ param_test param_test_skip_fastpath \
+ param_test_benchmark param_test_compare_twice
+
+TEST_GEN_PROGS_EXTENDED = librseq.so libcpu-op.so
+
+TEST_PROGS = run_param_test.sh
+
+include ../lib.mk
+
+$(OUTPUT)/librseq.so: rseq.c rseq.h rseq-*.h
+ $(CC) $(CFLAGS) -shared -fPIC $< $(LDLIBS) -o $@
+
+$(OUTPUT)/libcpu-op.so: ../cpu-opv/cpu-op.c ../cpu-opv/cpu-op.h
+ $(CC) $(CFLAGS) -shared -fPIC $< $(LDLIBS) -o $@
+
+$(OUTPUT)/%: %.c $(TEST_GEN_PROGS_EXTENDED) rseq.h rseq-*.h ../cpu-opv/cpu-op.h percpu-op.h
+ $(CC) $(CFLAGS) $< $(LDLIBS) -lrseq -lcpu-op -o $@
+
+$(OUTPUT)/param_test_skip_fastpath: param_test.c $(TEST_GEN_PROGS_EXTENDED) \
+ rseq.h rseq-*.h ../cpu-opv/cpu-op.h percpu-op.h
+ $(CC) $(CFLAGS) -DRSEQ_SKIP_FASTPATH $< $(LDLIBS) -lrseq -lcpu-op -o $@
+
+$(OUTPUT)/param_test_benchmark: param_test.c $(TEST_GEN_PROGS_EXTENDED) \
+ rseq.h rseq-*.h ../cpu-opv/cpu-op.h percpu-op.h
+ $(CC) $(CFLAGS) -DBENCHMARK $< $(LDLIBS) -lrseq -lcpu-op -o $@
+
+$(OUTPUT)/param_test_compare_twice: param_test.c $(TEST_GEN_PROGS_EXTENDED) \
+ rseq.h rseq-*.h ../cpu-opv/cpu-op.h percpu-op.h
+ $(CC) $(CFLAGS) -DRSEQ_COMPARE_TWICE $< $(LDLIBS) -lrseq -lcpu-op -o $@
diff --git a/tools/testing/selftests/rseq/run_param_test.sh b/tools/testing/selftests/rseq/run_param_test.sh
new file mode 100755
index 000000000000..64d4c7d722c9
--- /dev/null
+++ b/tools/testing/selftests/rseq/run_param_test.sh
@@ -0,0 +1,130 @@
+#!/bin/bash
+
+EXTRA_ARGS=${@}
+
+OLDIFS="$IFS"
+IFS=$'\n'
+TEST_LIST=(
+ "-T s"
+ "-T l"
+ "-T b"
+ "-T b -M"
+ "-T m"
+ "-T m -M"
+ "-T i"
+)
+
+TEST_NAME=(
+ "spinlock"
+ "list"
+ "buffer"
+ "buffer with barrier"
+ "memcpy"
+ "memcpy with barrier"
+ "increment"
+)
+IFS="$OLDIFS"
+
+REPS=1000
+
+function do_tests()
+{
+ local i=0
+ while [ "$i" -lt "${#TEST_LIST[@]}" ]; do
+ echo "Running test ${TEST_NAME[$i]}"
+ ./param_test ${TEST_LIST[$i]} -r ${REPS} ${@} ${EXTRA_ARGS} || exit 1
+ echo "Running skip fast-path test ${TEST_NAME[$i]}"
+ ./param_test_skip_fastpath ${TEST_LIST[$i]} -r ${REPS} ${@} ${EXTRA_ARGS} || exit 1
+ echo "Running compare-twice test ${TEST_NAME[$i]}"
+ ./param_test_compare_twice ${TEST_LIST[$i]} -r ${REPS} ${@} ${EXTRA_ARGS} || exit 1
+ let "i++"
+ done
+}
+
+echo "Default parameters"
+do_tests
+
+echo "Loop injection: 10000 loops"
+
+OLDIFS="$IFS"
+IFS=$'\n'
+INJECT_LIST=(
+ "1"
+ "2"
+ "3"
+ "4"
+ "5"
+ "6"
+ "7"
+ "8"
+ "9"
+)
+IFS="$OLDIFS"
+
+NR_LOOPS=10000
+
+i=0
+while [ "$i" -lt "${#INJECT_LIST[@]}" ]; do
+ echo "Injecting at <${INJECT_LIST[$i]}>"
+ do_tests -${INJECT_LIST[i]} ${NR_LOOPS}
+ let "i++"
+done
+NR_LOOPS=
+
+function inject_blocking()
+{
+ OLDIFS="$IFS"
+ IFS=$'\n'
+ INJECT_LIST=(
+ "7"
+ "8"
+ "9"
+ )
+ IFS="$OLDIFS"
+
+ NR_LOOPS=-1
+
+ i=0
+ while [ "$i" -lt "${#INJECT_LIST[@]}" ]; do
+ echo "Injecting at <${INJECT_LIST[$i]}>"
+ do_tests -${INJECT_LIST[i]} -1 ${@}
+ let "i++"
+ done
+ NR_LOOPS=
+}
+
+echo "Yield injection (25%)"
+inject_blocking -m 4 -y
+
+echo "Yield injection (50%)"
+inject_blocking -m 2 -y
+
+echo "Yield injection (100%)"
+inject_blocking -m 1 -y
+
+echo "Kill injection (25%)"
+inject_blocking -m 4 -k
+
+echo "Kill injection (50%)"
+inject_blocking -m 2 -k
+
+echo "Kill injection (100%)"
+inject_blocking -m 1 -k
+
+echo "Sleep injection (1ms, 25%)"
+inject_blocking -m 4 -s 1
+
+echo "Sleep injection (1ms, 50%)"
+inject_blocking -m 2 -s 1
+
+echo "Sleep injection (1ms, 100%)"
+inject_blocking -m 1 -s 1
+
+echo "Disable rseq for 25% threads"
+do_tests -D 4
+
+echo "Disable rseq for 50% threads"
+do_tests -D 2
+
+echo "Disable rseq"
+do_tests -d
--
2.11.0
Wire up the rseq system call on x86 32/64.
This provides an ABI improving the speed of a user-space getcpu
operation on x86 by removing the need to perform a function call, "lsl"
instruction, or system call on the fast path, as well as improving the
speed of user-space operations on per-cpu data.
Signed-off-by: Mathieu Desnoyers <[email protected]>
Reviewed-by: Thomas Gleixner <[email protected]>
CC: Russell King <[email protected]>
CC: Catalin Marinas <[email protected]>
CC: Will Deacon <[email protected]>
CC: Paul Turner <[email protected]>
CC: Andrew Hunter <[email protected]>
CC: Peter Zijlstra <[email protected]>
CC: Andy Lutomirski <[email protected]>
CC: Andi Kleen <[email protected]>
CC: Dave Watson <[email protected]>
CC: Chris Lameter <[email protected]>
CC: Ingo Molnar <[email protected]>
CC: "H. Peter Anvin" <[email protected]>
CC: Ben Maurer <[email protected]>
CC: Steven Rostedt <[email protected]>
CC: "Paul E. McKenney" <[email protected]>
CC: Josh Triplett <[email protected]>
CC: Linus Torvalds <[email protected]>
CC: Andrew Morton <[email protected]>
CC: Boqun Feng <[email protected]>
CC: [email protected]
---
arch/x86/entry/syscalls/syscall_32.tbl | 1 +
arch/x86/entry/syscalls/syscall_64.tbl | 1 +
2 files changed, 2 insertions(+)
diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
index 2a5e99cff859..b76cbd25854f 100644
--- a/arch/x86/entry/syscalls/syscall_32.tbl
+++ b/arch/x86/entry/syscalls/syscall_32.tbl
@@ -391,3 +391,4 @@
382 i386 pkey_free sys_pkey_free
383 i386 statx sys_statx
384 i386 arch_prctl sys_arch_prctl compat_sys_arch_prctl
+385 i386 rseq sys_rseq
diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
index 5aef183e2f85..3ad03495bbb9 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -339,6 +339,7 @@
330 common pkey_alloc sys_pkey_alloc
331 common pkey_free sys_pkey_free
332 common statx sys_statx
+333 common rseq sys_rseq
#
# x32-specific system call numbers start at 512 to avoid cache impact
--
2.11.0
Signed-off-by: Mathieu Desnoyers <[email protected]>
CC: Benjamin Herrenschmidt <[email protected]>
CC: Paul Mackerras <[email protected]>
CC: Michael Ellerman <[email protected]>
CC: Boqun Feng <[email protected]>
CC: Peter Zijlstra <[email protected]>
CC: "Paul E. McKenney" <[email protected]>
CC: [email protected]
---
arch/powerpc/include/asm/systbl.h | 1 +
arch/powerpc/include/asm/unistd.h | 2 +-
arch/powerpc/include/uapi/asm/unistd.h | 1 +
3 files changed, 3 insertions(+), 1 deletion(-)
diff --git a/arch/powerpc/include/asm/systbl.h b/arch/powerpc/include/asm/systbl.h
index 45d4d37495fd..4131825b5a05 100644
--- a/arch/powerpc/include/asm/systbl.h
+++ b/arch/powerpc/include/asm/systbl.h
@@ -393,3 +393,4 @@ SYSCALL(pkey_alloc)
SYSCALL(pkey_free)
SYSCALL(pkey_mprotect)
SYSCALL(rseq)
+SYSCALL(cpu_opv)
diff --git a/arch/powerpc/include/asm/unistd.h b/arch/powerpc/include/asm/unistd.h
index 1e9708632dce..c19379f0a32e 100644
--- a/arch/powerpc/include/asm/unistd.h
+++ b/arch/powerpc/include/asm/unistd.h
@@ -12,7 +12,7 @@
#include <uapi/asm/unistd.h>
-#define NR_syscalls 388
+#define NR_syscalls 389
#define __NR__exit __NR_exit
diff --git a/arch/powerpc/include/uapi/asm/unistd.h b/arch/powerpc/include/uapi/asm/unistd.h
index ac5ba55066dd..f7a221bdb5df 100644
--- a/arch/powerpc/include/uapi/asm/unistd.h
+++ b/arch/powerpc/include/uapi/asm/unistd.h
@@ -399,5 +399,6 @@
#define __NR_pkey_free 385
#define __NR_pkey_mprotect 386
#define __NR_rseq 387
+#define __NR_cpu_opv 388
#endif /* _UAPI_ASM_POWERPC_UNISTD_H_ */
--
2.11.0
Signed-off-by: Mathieu Desnoyers <[email protected]>
CC: Russell King <[email protected]>
CC: Catalin Marinas <[email protected]>
CC: Will Deacon <[email protected]>
CC: Thomas Gleixner <[email protected]>
CC: Paul Turner <[email protected]>
CC: Andrew Hunter <[email protected]>
CC: Peter Zijlstra <[email protected]>
CC: Andy Lutomirski <[email protected]>
CC: Andi Kleen <[email protected]>
CC: Dave Watson <[email protected]>
CC: Chris Lameter <[email protected]>
CC: Ingo Molnar <[email protected]>
CC: Ben Maurer <[email protected]>
CC: Steven Rostedt <[email protected]>
CC: "Paul E. McKenney" <[email protected]>
CC: Josh Triplett <[email protected]>
CC: Linus Torvalds <[email protected]>
CC: Andrew Morton <[email protected]>
CC: Boqun Feng <[email protected]>
CC: [email protected]
---
arch/arm/tools/syscall.tbl | 1 +
1 file changed, 1 insertion(+)
diff --git a/arch/arm/tools/syscall.tbl b/arch/arm/tools/syscall.tbl
index fbc74b5fa3ed..213ccfc2c437 100644
--- a/arch/arm/tools/syscall.tbl
+++ b/arch/arm/tools/syscall.tbl
@@ -413,3 +413,4 @@
396 common pkey_free sys_pkey_free
397 common statx sys_statx
398 common rseq sys_rseq
+399 common cpu_opv sys_cpu_opv
--
2.11.0
The cpu_opv system call executes a vector of operations on behalf of
user-space on a specific CPU with preemption disabled. It is inspired
by readv() and writev() system calls which take a "struct iovec"
array as argument.
The operations available are: comparison, memcpy, add, or, and, xor,
left shift, right shift, and memory barrier. The system call receives
a CPU number from user-space as argument, which is the CPU on which
those operations need to be performed. All pointers in the ops must
have been set up to point to the per CPU memory of the CPU on which
the operations should be executed. The "comparison" operation can be
used to check that the data used in the preparation step did not
change between preparation of system call inputs and operation
execution within the preempt-off critical section.
The reason why we require all pointer offsets to be calculated by
user-space beforehand is because we need to use get_user_pages_fast()
to first pin all pages touched by each operation. This takes care of
faulting-in the pages. Then, preemption is disabled, and the
operations are performed atomically with respect to other thread
execution on that CPU, without generating any page fault.
An overall maximum of 4216 bytes in enforced on the sum of operation
length within an operation vector, so user-space cannot generate a
too long preempt-off critical section (cache cold critical section
duration measured as 4.7µs on x86-64). Each operation is also limited
a length of 4096 bytes, meaning that an operation can touch a
maximum of 4 pages (memcpy: 2 pages for source, 2 pages for
destination if addresses are not aligned on page boundaries).
If the thread is not running on the requested CPU, it is migrated to
it.
**** Justification for cpu_opv ****
Here are a few reasons justifying why the cpu_opv system call is
needed in addition to rseq:
1) Allow algorithms to perform per-cpu data migration without relying on
sched_setaffinity()
The use-cases are migrating memory between per-cpu memory free-lists, or
stealing tasks from other per-cpu work queues: each require that
accesses to remote per-cpu data structures are performed.
Just rseq is not enough to cover those use-cases without additionally
relying on sched_setaffinity, which is unfortunately not
CPU-hotplug-safe.
The cpu_opv system call receives a CPU number as argument, and migrates
the current task to the right CPU to perform the operation sequence. If
the requested CPU is offline, it performs the operations from the
current CPU while preventing CPU hotplug, and with a mutex held.
2) Handling single-stepping from tools
Tools like debuggers, and simulators use single-stepping to run through
existing programs. If core libraries start to use restartable sequences
for e.g. memory allocation, this means pre-existing programs cannot be
single-stepped, simply because the underlying glibc or jemalloc has
changed.
The rseq user-space does expose a __rseq_table section for the sake of
debuggers, so they can skip over the rseq critical sections if they
want. However, this requires upgrading tools, and still breaks
single-stepping in case where glibc or jemalloc is updated, but not the
tooling.
Having a performance-related library improvement break tooling is likely
to cause a big push-back against wide adoption of rseq.
3) Forward-progress guarantee
Having a piece of user-space code that stops progressing due to external
conditions is pretty bad. Developers are used to think of fast-path and
slow-path (e.g. for locking), where the contended vs uncontended cases
have different performance characteristics, but each need to provide
some level of progress guarantees.
There are concerns about proposing just "rseq" without the associated
slow-path (cpu_opv) that guarantees progress. It's just asking for
trouble when real-life will happen: page faults, uprobes, and other
unforeseen conditions that would seldom cause a rseq fast-path to never
progress.
4) Handling page faults
It's pretty easy to come up with corner-case scenarios where rseq does
not progress without the help from cpu_opv. For instance, a system with
swap enabled which is under high memory pressure could trigger page
faults at pretty much every rseq attempt. Although this scenario
is extremely unlikely, rseq becomes the weak link of the chain.
5) Comparison with LL/SC
The layman versed in the load-link/store-conditional instructions in
RISC architectures will notice the similarity between rseq and LL/SC
critical sections. The comparison can even be pushed further: since
debuggers can handle those LL/SC critical sections, they should be
able to handle rseq c.s. in the same way.
First, the way gdb recognises LL/SC c.s. patterns is very fragile:
it's limited to specific common patterns, and will miss the pattern
in all other cases. But fear not, having the rseq c.s. expose a
__rseq_table to debuggers removes that guessing part.
The main difference between LL/SC and rseq is that debuggers had
to support single-stepping through LL/SC critical sections from the
get go in order to support a given architecture. For rseq, we're
adding critical sections into pre-existing applications/libraries,
so the user expectation is that tools don't break due to a library
optimization.
6) Perform maintenance operations on per-cpu data
rseq c.s. are quite limited feature-wise: they need to end with a
*single* commit instruction that updates a memory location. On the other
hand, the cpu_opv system call can combine a sequence of operations that
need to be executed with preemption disabled. While slower than rseq,
this allows for more complex maintenance operations to be performed on
per-cpu data concurrently with rseq fast-paths, in cases where it's not
possible to map those sequences of ops to a rseq.
7) Use cpu_opv as generic implementation for architectures not
implementing rseq assembly code
rseq critical sections require architecture-specific user-space code to
be crafted in order to port an algorithm to a given architecture. In
addition, it requires that the kernel architecture implementation adds
hooks into signal delivery and resume to user-space.
In order to facilitate integration of rseq into user-space, cpu_opv can
provide a (relatively slower) architecture-agnostic implementation of
rseq. This means that user-space code can be ported to all architectures
through use of cpu_opv initially, and have the fast-path use rseq
whenever the asm code is implemented.
8) Allow libraries with multi-part algorithms to work on same per-cpu
data without affecting the allowed cpu mask
The lttng-ust tracer presents an interesting use-case for per-cpu
buffers: the algorithm needs to update a "reserve" counter, serialize
data into the buffer, and then update a "commit" counter _on the same
per-cpu buffer_. Using rseq for both reserve and commit can bring
significant performance benefits.
Clearly, if rseq reserve fails, the algorithm can retry on a different
per-cpu buffer. However, it's not that easy for the commit. It needs to
be performed on the same per-cpu buffer as the reserve.
The cpu_opv system call solves that problem by receiving the cpu number
on which the operation needs to be performed as argument. It can push
the task to the right CPU if needed, and perform the operations there
with preemption disabled.
Changing the allowed cpu mask for the current thread is not an
acceptable alternative for a tracing library, because the application
being traced does not expect that mask to be changed by libraries.
9) Ensure that data structures don't need store-release/load-acquire
semantic to handle fall-back
cpu_opv performs the fall-back on the requested CPU by migrating the
task to that CPU. Executing the slow-path on the right CPU ensures that
store-release/load-acquire semantic is not required neither on the
fast-path nor slow-path.
**** rseq and cpu_opv use-cases ****
1) per-cpu spinlock
A per-cpu spinlock can be implemented as a rseq consisting of a
comparison operation (== 0) on a word, and a word store (1), followed
by an acquire barrier after control dependency. The unlock path can be
performed with a simple store-release of 0 to the word, which does
not require rseq.
The cpu_opv fallback requires a single-word comparison (== 0) and a
single-word store (1).
2) per-cpu statistics counters
A per-cpu statistics counters can be implemented as a rseq consisting
of a final "add" instruction on a word as commit.
The cpu_opv fallback can be implemented as a "ADD" operation.
Besides statistics tracking, these counters can be used to implement
user-space RCU per-cpu grace period tracking for both single and
multi-process user-space RCU.
3) per-cpu LIFO linked-list (unlimited size stack)
A per-cpu LIFO linked-list has a "push" and "pop" operation,
which respectively adds an item to the list, and removes an
item from the list.
The "push" operation can be implemented as a rseq consisting of
a word comparison instruction against head followed by a word store
(commit) to head. Its cpu_opv fallback can be implemented as a
word-compare followed by word-store as well.
The "pop" operation can be implemented as a rseq consisting of
loading head, comparing it against NULL, loading the next pointer
at the right offset within the head item, and the next pointer as
a new head, returning the old head on success.
The cpu_opv fallback for "pop" differs from its rseq algorithm:
considering that cpu_opv requires to know all pointers at system
call entry so it can pin all pages, so cpu_opv cannot simply load
head and then load the head->next address within the preempt-off
critical section. User-space needs to pass the head and head->next
addresses to the kernel, and the kernel needs to check that the
head address is unchanged since it has been loaded by user-space.
However, when accessing head->next in a ABA situation, it's
possible that head is unchanged, but loading head->next can
result in a page fault due to a concurrently freed head object.
This is why the "expect_fault" operation field is introduced: if a
fault is triggered by this access, "-EAGAIN" will be returned by
cpu_opv rather than -EFAULT, thus indicating the the operation
vector should be attempted again. The "pop" operation can thus be
implemented as a word comparison of head against the head loaded
by user-space, followed by a load of the head->next pointer (which
may fault), and a store of that pointer as a new head.
4) per-cpu LIFO ring buffer with pointers to objects (fixed-sized stack)
This structure is useful for passing around allocated objects
by passing pointers through per-cpu fixed-sized stack.
The "push" side can be implemented with a check of the current
offset against the maximum buffer length, followed by a rseq
consisting of a comparison of the previously loaded offset
against the current offset, a word "try store" operation into the
next ring buffer array index (it's OK to abort after a try-store,
since it's not the commit, and its side-effect can be overwritten),
then followed by a word-store to increment the current offset (commit).
The "push" cpu_opv fallback can be done with the comparison, and
two consecutive word stores, all within the preempt-off section.
The "pop" side can be implemented with a check that offset is not
0 (whether the buffer is empty), a load of the "head" pointer before the
offset array index, followed by a rseq consisting of a word
comparison checking that the offset is unchanged since previously
loaded, another check ensuring that the "head" pointer is unchanged,
followed by a store decrementing the current offset.
The cpu_opv "pop" can be implemented with the same algorithm
as the rseq fast-path (compare, compare, store).
5) per-cpu LIFO ring buffer with pointers to objects (fixed-sized stack)
supporting "peek" from remote CPU
In order to implement work queues with work-stealing between CPUs, it is
useful to ensure the offset "commit" in scenario 4) "push" have a
store-release semantic, thus allowing remote CPU to load the offset
with acquire semantic, and load the top pointer, in order to check if
work-stealing should be performed. The task (work queue item) existence
should be protected by other means, e.g. RCU.
If the peek operation notices that work-stealing should indeed be
performed, a thread can use cpu_opv to move the task between per-cpu
workqueues, by first invoking cpu_opv passing the remote work queue
cpu number as argument to pop the task, and then again as "push" with
the target work queue CPU number.
6) per-cpu LIFO ring buffer with data copy (fixed-sized stack)
(with and without acquire-release)
This structure is useful for passing around data without requiring
memory allocation by copying the data content into per-cpu fixed-sized
stack.
The "push" operation is performed with an offset comparison against
the buffer size (figuring out if the buffer is full), followed by
a rseq consisting of a comparison of the offset, a try-memcpy attempting
to copy the data content into the buffer (which can be aborted and
overwritten), and a final store incrementing the offset.
The cpu_opv fallback needs to same operations, except that the memcpy
is guaranteed to complete, given that it is performed with preemption
disabled. This requires a memcpy operation supporting length up to 4kB.
The "pop" operation is similar to the "push, except that the offset
is first compared to 0 to ensure the buffer is not empty. The
copy source is the ring buffer, and the destination is an output
buffer.
7) per-cpu FIFO ring buffer (fixed-sized queue)
This structure is useful wherever a FIFO behavior (queue) is needed.
One major use-case is tracer ring buffer.
An implementation of this ring buffer has a "reserve", followed by
serialization of multiple bytes into the buffer, ended by a "commit".
The "reserve" can be implemented as a rseq consisting of a word
comparison followed by a word store. The reserve operation moves the
producer "head". The multi-byte serialization can be performed
non-atomically. Finally, the "commit" update can be performed with
a rseq "add" commit instruction with store-release semantic. The
ring buffer consumer reads the commit value with load-acquire
semantic to know whenever it is safe to read from the ring buffer.
This use-case requires that both "reserve" and "commit" operations
be performed on the same per-cpu ring buffer, even if a migration
happens between those operations. In the typical case, both operations
will happens on the same CPU and use rseq. In the unlikely event of a
migration, the cpu_opv system call will ensure the commit can be
performed on the right CPU by migrating the task to that CPU.
On the consumer side, an alternative to using store-release and
load-acquire on the commit counter would be to use cpu_opv to
ensure the commit counter load is performed on the right CPU. This
effectively allows moving a consumer thread between CPUs to execute
close to the ring buffer cache lines it will read.
Signed-off-by: Mathieu Desnoyers <[email protected]>
CC: "Paul E. McKenney" <[email protected]>
CC: Peter Zijlstra <[email protected]>
CC: Paul Turner <[email protected]>
CC: Thomas Gleixner <[email protected]>
CC: Andrew Hunter <[email protected]>
CC: Andy Lutomirski <[email protected]>
CC: Andi Kleen <[email protected]>
CC: Dave Watson <[email protected]>
CC: Chris Lameter <[email protected]>
CC: Ingo Molnar <[email protected]>
CC: "H. Peter Anvin" <[email protected]>
CC: Ben Maurer <[email protected]>
CC: Steven Rostedt <[email protected]>
CC: Josh Triplett <[email protected]>
CC: Linus Torvalds <[email protected]>
CC: Andrew Morton <[email protected]>
CC: Russell King <[email protected]>
CC: Catalin Marinas <[email protected]>
CC: Will Deacon <[email protected]>
CC: Michael Kerrisk <[email protected]>
CC: Boqun Feng <[email protected]>
CC: [email protected]
---
Changes since v1:
- handle CPU hotplug,
- cleanup implementation using function pointers: We can use function
pointers to implement the operations rather than duplicating all the
user-access code.
- refuse device pages: Performing cpu_opv operations on io map'd pages
with preemption disabled could generate long preempt-off critical
sections, which leads to unwanted scheduler latency. Return EFAULT if
a device page is received as parameter
- restrict op vector to 4216 bytes length sum: Restrict the operation
vector to length sum of:
- 4096 bytes (typical page size on most architectures, should be
enough for a string, or structures)
- 15 * 8 bytes (typical operations on integers or pointers).
The goal here is to keep the duration of preempt off critical section
short, so we don't add significant scheduler latency.
- Add INIT_ONSTACK macro: Introduce the
CPU_OP_FIELD_u32_u64_INIT_ONSTACK() macros to ensure that users
correctly initialize the upper bits of CPU_OP_FIELD_u32_u64() on their
stack to 0 on 32-bit architectures.
- Add CPU_MB_OP operation:
Use-cases with:
- two consecutive stores,
- a mempcy followed by a store,
require a memory barrier before the final store operation. A typical
use-case is a store-release on the final store. Given that this is a
slow path, just providing an explicit full barrier instruction should
be sufficient.
- Add expect fault field:
The use-case of list_pop brings interesting challenges. With rseq, we
can use rseq_cmpnev_storeoffp_load(), and therefore load a pointer,
compare it against NULL, add an offset, and load the target "next"
pointer from the object, all within a single req critical section.
Life is not so easy for cpu_opv in this use-case, mainly because we
need to pin all pages we are going to touch in the preempt-off
critical section beforehand. So we need to know the target object (in
which we apply an offset to fetch the next pointer) when we pin pages
before disabling preemption.
So the approach is to load the head pointer and compare it against
NULL in user-space, before doing the cpu_opv syscall. User-space can
then compute the address of the head->next field, *without loading it*.
The cpu_opv system call will first need to pin all pages associated
with input data. This includes the page backing the head->next object,
which may have been concurrently deallocated and unmapped. Therefore,
in this case, getting -EFAULT when trying to pin those pages may
happen: it just means they have been concurrently unmapped. This is
an expected situation, and should just return -EAGAIN to user-space,
to user-space can distinguish between "should retry" type of
situations and actual errors that should be handled with extreme
prejudice to the program (e.g. abort()).
Therefore, add "expect_fault" fields along with op input address
pointers, so user-space can identify whether a fault when getting a
field should return EAGAIN rather than EFAULT.
- Add compiler barrier between operations: Adding a compiler barrier
between store operations in a cpu_opv sequence can be useful when
paired with membarrier system call.
An algorithm with a paired slow path and fast path can use
sys_membarrier on the slow path to replace fast-path memory barriers
by compiler barrier.
Adding an explicit compiler barrier between operations allows
cpu_opv to be used as fallback for operations meant to match
the membarrier system call.
Changes since v2:
- Fix memory leak by introducing struct cpu_opv_pinned_pages.
Suggested by Boqun Feng.
- Cast argument 1 passed to access_ok from integer to void __user *,
fixing sparse warning.
Changes since v3:
- Fix !SMP by adding push_task_to_cpu() empty static inline.
- Add missing sys_cpu_opv() asmlinkage declaration to
include/linux/syscalls.h.
Changes since v4:
- Cleanup based on Thomas Gleixner's feedback.
- Handle retry in case where the scheduler migrates the thread away
from the target CPU after migration within the syscall rather than
returning EAGAIN to user-space.
- Move push_task_to_cpu() to its own patch.
- New scheme for touching user-space memory:
1) get_user_pages_fast() to pin/get all pages (which can sleep),
2) vm_map_ram() those pages
3) grab mmap_sem (read lock)
4) __get_user_pages_fast() (or get_user_pages() on failure)
-> Confirm that the same page pointers are returned. This
catches cases where COW mappings are changed concurrently.
-> If page pointers differ, or on gup failure, release mmap_sem,
vm_unmap_ram/put_page and retry from step (1).
-> perform put_page on the extra reference immediately for each
page.
5) preempt disable
6) Perform operations on vmap. Those operations are normal
loads/stores/memcpy.
7) preempt enable
8) release mmap_sem
9) vm_unmap_ram() all virtual addresses
10) put_page() all pages
- Handle architectures with VIVT caches along with vmap(): call
flush_kernel_vmap_range() after each "write" operation. This
ensures that the user-space mapping and vmap reach a consistent
state between each operation.
- Depend on MMU for is_zero_pfn(). e.g. Blackfin and SH architectures
don't provide the zero_pfn symbol.
Changes since v5:
- Fix handling of push_task_to_cpu() when argument is a cpu which is
not part of the task's allowed cpu mask.
- Add CPU_OP_NR_FLAG flag, which returns the number of operations
supported by the system call.
---
Man page associated:
CPU_OPV(2) Linux Programmer's Manual CPU_OPV(2)
NAME
cpu_opv - CPU preempt-off operation vector system call
SYNOPSIS
#include <linux/cpu_opv.h>
int cpu_opv(struct cpu_op * cpu_opv, int cpuopcnt, int cpu, int f
lags);
DESCRIPTION
The cpu_opv system call executes a vector of operations on
behalf of user-space on a specific CPU with preemption dis‐
abled.
The operations available are: comparison, memcpy, add, or, and,
xor, left shift, right shift, and memory barrier. The system
call receives a CPU number from user-space as argument, which
is the CPU on which those operations need to be performed. All
pointers in the ops must have been set up to point to the per
CPU memory of the CPU on which the operations should be exe‐
cuted. The "comparison" operation can be used to check that the
data used in the preparation step did not change between prepa‐
ration of system call inputs and operation execution within the
preempt-off critical section.
An overall maximum of 4216 bytes in enforced on the sum of
operation length within an operation vector, so user-space can‐
not generate a too long preempt-off critical section. Each
operation is also limited a length of 4096 bytes. A maximum
limit of 16 operations per cpu_opv syscall invocation is
enforced.
If the thread is not running on the requested CPU, it is
migrated to it.
The layout of struct cpu_opv is as follows:
Fields
op Operation of type enum cpu_op_type to perform. This
operation type selects the associated "u" union field.
len
Length (in bytes) of data to consider for this opera‐
tion.
u.compare_op
For a CPU_COMPARE_EQ_OP , and CPU_COMPARE_NE_OP , con‐
tains the a and b pointers to compare. The
expect_fault_a and expect_fault_b fields indicate
whether a page fault should be expected for each of
those pointers. If expect_fault_a , or expect_fault_b
is set, EAGAIN is returned on fault, else EFAULT is
returned. The len field is allowed to take values from 0
to 4096 for comparison operations.
u.memcpy_op
For a CPU_MEMCPY_OP , contains the dst and src pointers,
expressing a copy of src into dst. The expect_fault_dst
and expect_fault_src fields indicate whether a page
fault should be expected for each of those pointers. If
expect_fault_dst , or expect_fault_src is set, EAGAIN is
returned on fault, else EFAULT is returned. The len
field is allowed to take values from 0 to 4096 for mem‐
cpy operations.
u.arithmetic_op
For a CPU_ADD_OP , contains the p , count , and
expect_fault_p fields, which are respectively a pointer
to the memory location to increment, the 64-bit signed
integer value to add, and whether a page fault should be
expected for p . If expect_fault_p is set, EAGAIN is
returned on fault, else EFAULT is returned. The len
field is allowed to take values of 1, 2, 4, 8 bytes for
arithmetic operations.
u.bitwise_op
For a CPU_OR_OP , CPU_AND_OP , and CPU_XOR_OP , contains
the p , mask , and expect_fault_p fields, which are
respectively a pointer to the memory location to target,
the mask to apply, and whether a page fault should be
expected for p . If expect_fault_p is set, EAGAIN is
returned on fault, else EFAULT is returned. The len
field is allowed to take values of 1, 2, 4, 8 bytes for
bitwise operations.
u.shift_op
For a CPU_LSHIFT_OP , and CPU_RSHIFT_OP , contains the p
, bits , and expect_fault_p fields, which are respec‐
tively a pointer to the memory location to target, the
number of bits to shift either left of right, and
whether a page fault should be expected for p . If
expect_fault_p is set, EAGAIN is returned on fault, else
EFAULT is returned. The len field is allowed to take
values of 1, 2, 4, 8 bytes for shift operations. The
bits field is allowed to take values between 0 and 63.
The enum cpu_op_types contains the following operations:
· CPU_COMPARE_EQ_OP: Compare whether two memory locations are
equal,
· CPU_COMPARE_NE_OP: Compare whether two memory locations dif‐
fer,
· CPU_MEMCPY_OP: Copy a source memory location into a destina‐
tion,
· CPU_ADD_OP: Increment a target memory location of a given
count,
· CPU_OR_OP: Apply a "or" mask to a memory location,
· CPU_AND_OP: Apply a "and" mask to a memory location,
· CPU_XOR_OP: Apply a "xor" mask to a memory location,
· CPU_LSHIFT_OP: Shift a memory location left of a given number
of bits,
· CPU_RSHIFT_OP: Shift a memory location right of a given num‐
ber of bits.
· CPU_MB_OP: Issue a memory barrier.
All of the operations above provide single-copy atomicity
guarantees for word-sized, word-aligned target pointers, for
both loads and stores.
The cpuopcnt argument is the number of elements in the cpu_opv
array. It can take values from 0 to 16.
The cpu argument is the CPU number on which the operation
sequence needs to be executed.
The flags argument is a bitmask. When CPU_OP_NR_FLAG is set,
the cpu_opv() system call returns the number of operations
available. When flags is 0, the sequence of operations received
as parameter is performed.
RETURN VALUE
A return value of 0 indicates success. On error, -1 is
returned, and errno is set appropriately. If a comparison oper‐
ation fails, execution of the operation vector is stopped, and
the return value is the index after the comparison operation
(values between 1 and 16).
ERRORS
EAGAIN cpu_opv() system call should be attempted again.
EINVAL Either flags contains an invalid value, or cpu contains
an invalid value or a value not allowed by the current
thread's allowed cpu mask, or cpuopcnt contains an
invalid value, or the cpu_opv operation vector contains
an invalid op value, or the cpu_opv operation vector
contains an invalid len value, or the cpu_opv operation
vector sum of len values is too large.
ENOSYS The cpu_opv() system call is not implemented by this
kernel.
EFAULT cpu_opv is an invalid address, or a pointer contained
within an operation is invalid (and a fault is not
expected for that pointer).
VERSIONS
The cpu_opv() system call was added in Linux 4.X (TODO).
CONFORMING TO
cpu_opv() is Linux-specific.
SEE ALSO
membarrier(2), rseq(2)
Linux 2018-03-22 CPU_OPV(2)
---
MAINTAINERS | 7 +
include/linux/syscalls.h | 3 +
include/uapi/linux/cpu_opv.h | 120 +++++
init/Kconfig | 17 +
kernel/Makefile | 1 +
kernel/cpu_opv.c | 1083 ++++++++++++++++++++++++++++++++++++++++++
kernel/sys_ni.c | 1 +
7 files changed, 1232 insertions(+)
create mode 100644 include/uapi/linux/cpu_opv.h
create mode 100644 kernel/cpu_opv.c
diff --git a/MAINTAINERS b/MAINTAINERS
index 789463978181..e32d4415081b 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -3743,6 +3743,13 @@ B: https://bugzilla.kernel.org
F: drivers/cpuidle/*
F: include/linux/cpuidle.h
+CPU NON-PREEMPTIBLE OPERATION VECTOR SUPPORT
+M: Mathieu Desnoyers <[email protected]>
+L: [email protected]
+S: Supported
+F: kernel/cpu_opv.c
+F: include/uapi/linux/cpu_opv.h
+
CRAMFS FILESYSTEM
M: Nicolas Pitre <[email protected]>
S: Maintained
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 340650b4ec54..32d289f41f62 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -67,6 +67,7 @@ struct perf_event_attr;
struct file_handle;
struct sigaltstack;
struct rseq;
+struct cpu_op;
union bpf_attr;
#include <linux/types.h>
@@ -943,5 +944,7 @@ asmlinkage long sys_statx(int dfd, const char __user *path, unsigned flags,
unsigned mask, struct statx __user *buffer);
asmlinkage long sys_rseq(struct rseq __user *rseq, uint32_t rseq_len,
int flags, uint32_t sig);
+asmlinkage long sys_cpu_opv(struct cpu_op __user *ucpuopv, int cpuopcnt,
+ int cpu, int flags);
#endif
diff --git a/include/uapi/linux/cpu_opv.h b/include/uapi/linux/cpu_opv.h
new file mode 100644
index 000000000000..4901e5704db6
--- /dev/null
+++ b/include/uapi/linux/cpu_opv.h
@@ -0,0 +1,120 @@
+#ifndef _UAPI_LINUX_CPU_OPV_H
+#define _UAPI_LINUX_CPU_OPV_H
+
+/*
+ * linux/cpu_opv.h
+ *
+ * CPU preempt-off operation vector system call API
+ *
+ * Copyright (c) 2017 Mathieu Desnoyers <[email protected]>
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining a copy
+ * of this software and associated documentation files (the "Software"), to deal
+ * in the Software without restriction, including without limitation the rights
+ * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+ * copies of the Software, and to permit persons to whom the Software is
+ * furnished to do so, subject to the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be included in
+ * all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+ * AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+ * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+
+#ifdef __KERNEL__
+# include <linux/types.h>
+#else
+# include <stdint.h>
+#endif
+
+#include <linux/types_32_64.h>
+
+#define CPU_OP_VEC_LEN_MAX 16
+#define CPU_OP_ARG_LEN_MAX 24
+/* Maximum data len per operation. */
+#define CPU_OP_DATA_LEN_MAX 4096
+/*
+ * Maximum data len for overall vector. Restrict the amount of user-space
+ * data touched by the kernel in non-preemptible context, so it does not
+ * introduce long scheduler latencies.
+ * This allows one copy of up to 4096 bytes, and 15 operations touching 8
+ * bytes each.
+ * This limit is applied to the sum of length specified for all operations
+ * in a vector.
+ */
+#define CPU_OP_MEMCPY_EXPECT_LEN 4096
+#define CPU_OP_EXPECT_LEN 8
+#define CPU_OP_VEC_DATA_LEN_MAX \
+ (CPU_OP_MEMCPY_EXPECT_LEN + \
+ (CPU_OP_VEC_LEN_MAX - 1) * CPU_OP_EXPECT_LEN)
+
+enum cpu_op_flags {
+ CPU_OP_NR_FLAG = (1U << 0),
+};
+
+enum cpu_op_type {
+ /* compare */
+ CPU_COMPARE_EQ_OP,
+ CPU_COMPARE_NE_OP,
+ /* memcpy */
+ CPU_MEMCPY_OP,
+ /* arithmetic */
+ CPU_ADD_OP,
+ /* bitwise */
+ CPU_OR_OP,
+ CPU_AND_OP,
+ CPU_XOR_OP,
+ /* shift */
+ CPU_LSHIFT_OP,
+ CPU_RSHIFT_OP,
+ /* memory barrier */
+ CPU_MB_OP,
+
+ NR_CPU_OPS,
+};
+
+/* Vector of operations to perform. Limited to 16. */
+struct cpu_op {
+ /* enum cpu_op_type. */
+ int32_t op;
+ /* data length, in bytes. */
+ uint32_t len;
+ union {
+ struct {
+ LINUX_FIELD_u32_u64(a);
+ LINUX_FIELD_u32_u64(b);
+ uint8_t expect_fault_a;
+ uint8_t expect_fault_b;
+ } compare_op;
+ struct {
+ LINUX_FIELD_u32_u64(dst);
+ LINUX_FIELD_u32_u64(src);
+ uint8_t expect_fault_dst;
+ uint8_t expect_fault_src;
+ } memcpy_op;
+ struct {
+ LINUX_FIELD_u32_u64(p);
+ int64_t count;
+ uint8_t expect_fault_p;
+ } arithmetic_op;
+ struct {
+ LINUX_FIELD_u32_u64(p);
+ uint64_t mask;
+ uint8_t expect_fault_p;
+ } bitwise_op;
+ struct {
+ LINUX_FIELD_u32_u64(p);
+ uint32_t bits;
+ uint8_t expect_fault_p;
+ } shift_op;
+ char __padding[CPU_OP_ARG_LEN_MAX];
+ } u;
+};
+
+#endif /* _UAPI_LINUX_CPU_OPV_H */
diff --git a/init/Kconfig b/init/Kconfig
index 9610d3def25c..e8d538c47fde 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1422,6 +1422,8 @@ config RSEQ
bool "Enable rseq() system call" if EXPERT
default y
depends on HAVE_RSEQ
+ depends on MMU
+ select CPU_OPV
select MEMBARRIER
help
Enable the restartable sequences system call. It provides a
@@ -1432,6 +1434,21 @@ config RSEQ
If unsure, say Y.
+# CPU_OPV depends on MMU for is_zero_pfn()
+config CPU_OPV
+ bool "Enable cpu_opv() system call" if EXPERT
+ default y
+ depends on MMU
+ help
+ Enable the CPU preempt-off operation vector system call.
+ It allows user-space to perform a sequence of operations on
+ per-cpu data with preemption disabled. Useful as
+ single-stepping fall-back for restartable sequences, and for
+ performing more complex operations on per-cpu data that would
+ not be otherwise possible to do with restartable sequences.
+
+ If unsure, say Y.
+
config EMBEDDED
bool "Embedded system"
option allnoconfig_y
diff --git a/kernel/Makefile b/kernel/Makefile
index 7085c841c413..9075436afc2e 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -114,6 +114,7 @@ obj-$(CONFIG_TORTURE_TEST) += torture.o
obj-$(CONFIG_HAS_IOMEM) += memremap.o
obj-$(CONFIG_RSEQ) += rseq.o
+obj-$(CONFIG_CPU_OPV) += cpu_opv.o
$(obj)/configs.o: $(obj)/config_data.h
diff --git a/kernel/cpu_opv.c b/kernel/cpu_opv.c
new file mode 100644
index 000000000000..197339e4ea67
--- /dev/null
+++ b/kernel/cpu_opv.c
@@ -0,0 +1,1083 @@
+/*
+ * CPU preempt-off operation vector system call
+ *
+ * It allows user-space to perform a sequence of operations on per-cpu
+ * data with preemption disabled. Useful as single-stepping fall-back
+ * for restartable sequences, and for performing more complex operations
+ * on per-cpu data that would not be otherwise possible to do with
+ * restartable sequences.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * Copyright (C) 2017, EfficiOS Inc.,
+ * Mathieu Desnoyers <[email protected]>
+ */
+
+#include <linux/sched.h>
+#include <linux/uaccess.h>
+#include <linux/syscalls.h>
+#include <linux/cpu_opv.h>
+#include <linux/types.h>
+#include <linux/mutex.h>
+#include <linux/pagemap.h>
+#include <linux/mm.h>
+#include <asm/ptrace.h>
+#include <asm/byteorder.h>
+#include <asm/cacheflush.h>
+
+#include "sched/sched.h"
+
+/*
+ * Typical invocation of cpu_opv need few virtual address pointers. Keep
+ * those in an array on the stack of the cpu_opv system call up to
+ * this limit, beyond which the array is dynamically allocated.
+ */
+#define NR_VADDR_ON_STACK 8
+
+/* Maximum pages per op. */
+#define CPU_OP_MAX_PAGES 4
+
+/* Maximum number of virtual addresses per op. */
+#define CPU_OP_VEC_MAX_ADDR (2 * CPU_OP_VEC_LEN_MAX)
+
+union op_fn_data {
+ uint8_t _u8;
+ uint16_t _u16;
+ uint32_t _u32;
+ uint64_t _u64;
+#if (BITS_PER_LONG < 64)
+ uint32_t _u64_split[2];
+#endif
+};
+
+struct vaddr {
+ unsigned long mem;
+ unsigned long uaddr;
+ struct page *pages[2];
+ unsigned int nr_pages;
+ int write;
+};
+
+struct cpu_opv_vaddr {
+ struct vaddr *addr;
+ size_t nr_vaddr;
+ bool is_kmalloc;
+};
+
+typedef int (*op_fn_t)(union op_fn_data *data, uint64_t v, uint32_t len);
+
+/*
+ * Provide mutual exclution for threads executing a cpu_opv against an
+ * offline CPU.
+ */
+static DEFINE_MUTEX(cpu_opv_offline_lock);
+
+/*
+ * The cpu_opv system call executes a vector of operations on behalf of
+ * user-space on a specific CPU with preemption disabled. It is inspired
+ * by readv() and writev() system calls which take a "struct iovec"
+ * array as argument.
+ *
+ * The operations available are: comparison, memcpy, add, or, and, xor,
+ * left shift, right shift, and memory barrier. The system call receives
+ * a CPU number from user-space as argument, which is the CPU on which
+ * those operations need to be performed. All pointers in the ops must
+ * have been set up to point to the per CPU memory of the CPU on which
+ * the operations should be executed. The "comparison" operation can be
+ * used to check that the data used in the preparation step did not
+ * change between preparation of system call inputs and operation
+ * execution within the preempt-off critical section.
+ *
+ * The reason why we require all pointer offsets to be calculated by
+ * user-space beforehand is because we need to use get_user_pages_fast()
+ * to first pin all pages touched by each operation. This takes care of
+ * faulting-in the pages. Then, preemption is disabled, and the
+ * operations are performed atomically with respect to other thread
+ * execution on that CPU, without generating any page fault.
+ *
+ * An overall maximum of 4216 bytes in enforced on the sum of operation
+ * length within an operation vector, so user-space cannot generate a
+ * too long preempt-off critical section (cache cold critical section
+ * duration measured as 4.7µs on x86-64). Each operation is also limited
+ * a length of 4096 bytes, meaning that an operation can touch a
+ * maximum of 4 pages (memcpy: 2 pages for source, 2 pages for
+ * destination if addresses are not aligned on page boundaries).
+ *
+ * If the thread is not running on the requested CPU, it is migrated to
+ * it.
+ */
+
+static unsigned long cpu_op_range_nr_pages(unsigned long addr,
+ unsigned long len)
+{
+ return ((addr + len - 1) >> PAGE_SHIFT) - (addr >> PAGE_SHIFT) + 1;
+}
+
+static int cpu_op_count_pages(unsigned long addr, unsigned long len)
+{
+ unsigned long nr_pages;
+
+ if (!len)
+ return 0;
+ nr_pages = cpu_op_range_nr_pages(addr, len);
+ if (nr_pages > 2) {
+ WARN_ON(1);
+ return -EINVAL;
+ }
+ return nr_pages;
+}
+
+static struct vaddr *cpu_op_alloc_vaddr_vector(int nr_vaddr)
+{
+ return kzalloc(nr_vaddr * sizeof(struct vaddr), GFP_KERNEL);
+}
+
+/*
+ * Check operation types and length parameters. Count number of pages.
+ */
+static int cpu_opv_check_op(struct cpu_op *op, int *nr_vaddr, uint32_t *sum)
+{
+ int ret;
+
+ switch (op->op) {
+ case CPU_MB_OP:
+ break;
+ default:
+ *sum += op->len;
+ }
+
+ /* Validate inputs. */
+ switch (op->op) {
+ case CPU_COMPARE_EQ_OP:
+ case CPU_COMPARE_NE_OP:
+ case CPU_MEMCPY_OP:
+ if (op->len > CPU_OP_DATA_LEN_MAX)
+ return -EINVAL;
+ break;
+ case CPU_ADD_OP:
+ case CPU_OR_OP:
+ case CPU_AND_OP:
+ case CPU_XOR_OP:
+ switch (op->len) {
+ case 1:
+ case 2:
+ case 4:
+ case 8:
+ break;
+ default:
+ return -EINVAL;
+ }
+ break;
+ case CPU_LSHIFT_OP:
+ case CPU_RSHIFT_OP:
+ switch (op->len) {
+ case 1:
+ if (op->u.shift_op.bits > 7)
+ return -EINVAL;
+ break;
+ case 2:
+ if (op->u.shift_op.bits > 15)
+ return -EINVAL;
+ break;
+ case 4:
+ if (op->u.shift_op.bits > 31)
+ return -EINVAL;
+ break;
+ case 8:
+ if (op->u.shift_op.bits > 63)
+ return -EINVAL;
+ break;
+ default:
+ return -EINVAL;
+ }
+ break;
+ case CPU_MB_OP:
+ break;
+ default:
+ return -EINVAL;
+ }
+
+ /* Count pages and virtual addresses. */
+ switch (op->op) {
+ case CPU_COMPARE_EQ_OP:
+ case CPU_COMPARE_NE_OP:
+ ret = cpu_op_count_pages(op->u.compare_op.a, op->len);
+ if (ret < 0)
+ return ret;
+ ret = cpu_op_count_pages(op->u.compare_op.b, op->len);
+ if (ret < 0)
+ return ret;
+ *nr_vaddr += 2;
+ break;
+ case CPU_MEMCPY_OP:
+ ret = cpu_op_count_pages(op->u.memcpy_op.dst, op->len);
+ if (ret < 0)
+ return ret;
+ ret = cpu_op_count_pages(op->u.memcpy_op.src, op->len);
+ if (ret < 0)
+ return ret;
+ *nr_vaddr += 2;
+ break;
+ case CPU_ADD_OP:
+ ret = cpu_op_count_pages(op->u.arithmetic_op.p, op->len);
+ if (ret < 0)
+ return ret;
+ (*nr_vaddr)++;
+ break;
+ case CPU_OR_OP:
+ case CPU_AND_OP:
+ case CPU_XOR_OP:
+ ret = cpu_op_count_pages(op->u.bitwise_op.p, op->len);
+ if (ret < 0)
+ return ret;
+ (*nr_vaddr)++;
+ break;
+ case CPU_LSHIFT_OP:
+ case CPU_RSHIFT_OP:
+ ret = cpu_op_count_pages(op->u.shift_op.p, op->len);
+ if (ret < 0)
+ return ret;
+ (*nr_vaddr)++;
+ break;
+ case CPU_MB_OP:
+ break;
+ default:
+ return -EINVAL;
+ }
+ return 0;
+}
+
+/*
+ * Check operation types and length parameters. Count number of pages.
+ */
+static int cpu_opv_check(struct cpu_op *cpuopv, int cpuopcnt, int *nr_vaddr)
+{
+ uint32_t sum = 0;
+ int i, ret;
+
+ for (i = 0; i < cpuopcnt; i++) {
+ ret = cpu_opv_check_op(&cpuopv[i], nr_vaddr, &sum);
+ if (ret)
+ return ret;
+ }
+ if (sum > CPU_OP_VEC_DATA_LEN_MAX)
+ return -EINVAL;
+ return 0;
+}
+
+static int cpu_op_check_page(struct page *page, int write)
+{
+ struct address_space *mapping;
+
+ if (is_zone_device_page(page))
+ return -EFAULT;
+
+ /*
+ * The page lock protects many things but in this context the page
+ * lock stabilizes mapping, prevents inode freeing in the shared
+ * file-backed region case and guards against movement to swap
+ * cache.
+ *
+ * Strictly speaking the page lock is not needed in all cases being
+ * considered here and page lock forces unnecessarily serialization
+ * From this point on, mapping will be re-verified if necessary and
+ * page lock will be acquired only if it is unavoidable
+ *
+ * Mapping checks require the head page for any compound page so the
+ * head page and mapping is looked up now.
+ */
+ page = compound_head(page);
+ mapping = READ_ONCE(page->mapping);
+
+ /*
+ * If page->mapping is NULL, then it cannot be a PageAnon page;
+ * but it might be the ZERO_PAGE (which is OK to read from), or
+ * in the gate area or in a special mapping (for which this
+ * check should fail); or it may have been a good file page when
+ * get_user_pages_fast found it, but truncated or holepunched or
+ * subjected to invalidate_complete_page2 before the page lock
+ * is acquired (also cases which should fail). Given that a
+ * reference to the page is currently held, refcount care in
+ * invalidate_complete_page's remove_mapping prevents
+ * drop_caches from setting mapping to NULL concurrently.
+ *
+ * The case to guard against is when memory pressure cause
+ * shmem_writepage to move the page from filecache to swapcache
+ * concurrently: an unlikely race, but a retry for page->mapping
+ * is required in that situation.
+ */
+ if (!mapping) {
+ int shmem_swizzled;
+
+ /*
+ * Check again with page lock held to guard against
+ * memory pressure making shmem_writepage move the page
+ * from filecache to swapcache.
+ */
+ lock_page(page);
+ shmem_swizzled = PageSwapCache(page) || page->mapping;
+ unlock_page(page);
+ if (shmem_swizzled)
+ return -EAGAIN;
+ /*
+ * It is valid to read from, but invalid to write to the
+ * ZERO_PAGE.
+ */
+ if (!(is_zero_pfn(page_to_pfn(page)) ||
+ is_huge_zero_page(page)) || write)
+ return -EFAULT;
+ }
+ return 0;
+}
+
+static int cpu_op_check_pages(struct page **pages,
+ unsigned long nr_pages,
+ int write)
+{
+ unsigned long i;
+
+ for (i = 0; i < nr_pages; i++) {
+ int ret;
+
+ ret = cpu_op_check_page(pages[i], write);
+ if (ret)
+ return ret;
+ }
+ return 0;
+}
+
+static int cpu_op_pin_pages(unsigned long addr, unsigned long len,
+ struct cpu_opv_vaddr *vaddr_ptrs,
+ unsigned long *vaddr, int write)
+{
+ struct page *pages[2];
+ int ret, nr_pages, nr_put_pages, n;
+ unsigned long _vaddr;
+ struct vaddr *va;
+
+ nr_pages = cpu_op_count_pages(addr, len);
+ if (!nr_pages)
+ return 0;
+again:
+ ret = get_user_pages_fast(addr, nr_pages, write, pages);
+ if (ret < nr_pages) {
+ if (ret >= 0) {
+ nr_put_pages = ret;
+ ret = -EFAULT;
+ } else {
+ nr_put_pages = 0;
+ }
+ goto error;
+ }
+ ret = cpu_op_check_pages(pages, nr_pages, write);
+ if (ret) {
+ nr_put_pages = nr_pages;
+ goto error;
+ }
+ va = &vaddr_ptrs->addr[vaddr_ptrs->nr_vaddr++];
+ _vaddr = (unsigned long)vm_map_ram(pages, nr_pages, numa_node_id(),
+ PAGE_KERNEL);
+ if (!_vaddr) {
+ nr_put_pages = nr_pages;
+ ret = -ENOMEM;
+ goto error;
+ }
+ va->mem = _vaddr;
+ va->uaddr = addr;
+ for (n = 0; n < nr_pages; n++)
+ va->pages[n] = pages[n];
+ va->nr_pages = nr_pages;
+ va->write = write;
+ *vaddr = _vaddr + (addr & ~PAGE_MASK);
+ return 0;
+
+error:
+ for (n = 0; n < nr_put_pages; n++)
+ put_page(pages[n]);
+ /*
+ * Retry if a page has been faulted in, or is being swapped in.
+ */
+ if (ret == -EAGAIN)
+ goto again;
+ return ret;
+}
+
+static int cpu_opv_pin_pages_op(struct cpu_op *op,
+ struct cpu_opv_vaddr *vaddr_ptrs,
+ bool *expect_fault)
+{
+ int ret;
+ unsigned long vaddr = 0;
+
+ switch (op->op) {
+ case CPU_COMPARE_EQ_OP:
+ case CPU_COMPARE_NE_OP:
+ ret = -EFAULT;
+ *expect_fault = op->u.compare_op.expect_fault_a;
+ if (!access_ok(VERIFY_READ,
+ (void __user *)op->u.compare_op.a,
+ op->len))
+ return ret;
+ ret = cpu_op_pin_pages(op->u.compare_op.a, op->len,
+ vaddr_ptrs, &vaddr, 0);
+ if (ret)
+ return ret;
+ op->u.compare_op.a = vaddr;
+ ret = -EFAULT;
+ *expect_fault = op->u.compare_op.expect_fault_b;
+ if (!access_ok(VERIFY_READ,
+ (void __user *)op->u.compare_op.b,
+ op->len))
+ return ret;
+ ret = cpu_op_pin_pages(op->u.compare_op.b, op->len,
+ vaddr_ptrs, &vaddr, 0);
+ if (ret)
+ return ret;
+ op->u.compare_op.b = vaddr;
+ break;
+ case CPU_MEMCPY_OP:
+ ret = -EFAULT;
+ *expect_fault = op->u.memcpy_op.expect_fault_dst;
+ if (!access_ok(VERIFY_WRITE,
+ (void __user *)op->u.memcpy_op.dst,
+ op->len))
+ return ret;
+ ret = cpu_op_pin_pages(op->u.memcpy_op.dst, op->len,
+ vaddr_ptrs, &vaddr, 1);
+ if (ret)
+ return ret;
+ op->u.memcpy_op.dst = vaddr;
+ ret = -EFAULT;
+ *expect_fault = op->u.memcpy_op.expect_fault_src;
+ if (!access_ok(VERIFY_READ,
+ (void __user *)op->u.memcpy_op.src,
+ op->len))
+ return ret;
+ ret = cpu_op_pin_pages(op->u.memcpy_op.src, op->len,
+ vaddr_ptrs, &vaddr, 0);
+ if (ret)
+ return ret;
+ op->u.memcpy_op.src = vaddr;
+ break;
+ case CPU_ADD_OP:
+ ret = -EFAULT;
+ *expect_fault = op->u.arithmetic_op.expect_fault_p;
+ if (!access_ok(VERIFY_WRITE,
+ (void __user *)op->u.arithmetic_op.p,
+ op->len))
+ return ret;
+ ret = cpu_op_pin_pages(op->u.arithmetic_op.p, op->len,
+ vaddr_ptrs, &vaddr, 1);
+ if (ret)
+ return ret;
+ op->u.arithmetic_op.p = vaddr;
+ break;
+ case CPU_OR_OP:
+ case CPU_AND_OP:
+ case CPU_XOR_OP:
+ ret = -EFAULT;
+ *expect_fault = op->u.bitwise_op.expect_fault_p;
+ if (!access_ok(VERIFY_WRITE,
+ (void __user *)op->u.bitwise_op.p,
+ op->len))
+ return ret;
+ ret = cpu_op_pin_pages(op->u.bitwise_op.p, op->len,
+ vaddr_ptrs, &vaddr, 1);
+ if (ret)
+ return ret;
+ op->u.bitwise_op.p = vaddr;
+ break;
+ case CPU_LSHIFT_OP:
+ case CPU_RSHIFT_OP:
+ ret = -EFAULT;
+ *expect_fault = op->u.shift_op.expect_fault_p;
+ if (!access_ok(VERIFY_WRITE,
+ (void __user *)op->u.shift_op.p,
+ op->len))
+ return ret;
+ ret = cpu_op_pin_pages(op->u.shift_op.p, op->len,
+ vaddr_ptrs, &vaddr, 1);
+ if (ret)
+ return ret;
+ op->u.shift_op.p = vaddr;
+ break;
+ case CPU_MB_OP:
+ break;
+ default:
+ return -EINVAL;
+ }
+ return 0;
+}
+
+static int cpu_opv_pin_pages(struct cpu_op *cpuop, int cpuopcnt,
+ struct cpu_opv_vaddr *vaddr_ptrs)
+{
+ int ret, i;
+ bool expect_fault = false;
+
+ /* Check access, pin pages. */
+ for (i = 0; i < cpuopcnt; i++) {
+ ret = cpu_opv_pin_pages_op(&cpuop[i], vaddr_ptrs,
+ &expect_fault);
+ if (ret)
+ goto error;
+ }
+ return 0;
+
+error:
+ /*
+ * If faulting access is expected, return EAGAIN to user-space.
+ * It allows user-space to distinguish between a fault caused by
+ * an access which is expect to fault (e.g. due to concurrent
+ * unmapping of underlying memory) from an unexpected fault from
+ * which a retry would not recover.
+ */
+ if (ret == -EFAULT && expect_fault)
+ return -EAGAIN;
+ return ret;
+}
+
+static int __op_get(union op_fn_data *data, void *p, size_t len)
+{
+ switch (len) {
+ case 1:
+ data->_u8 = READ_ONCE(*(uint8_t *)p);
+ break;
+ case 2:
+ data->_u16 = READ_ONCE(*(uint16_t *)p);
+ break;
+ case 4:
+ data->_u32 = READ_ONCE(*(uint32_t *)p);
+ break;
+ case 8:
+#if (BITS_PER_LONG == 64)
+ data->_u64 = READ_ONCE(*(uint64_t *)p);
+#else
+ {
+ data->_u64_split[0] = READ_ONCE(*(uint32_t *)p);
+ data->_u64_split[1] = READ_ONCE(*((uint32_t *)p + 1));
+ }
+#endif
+ break;
+ default:
+ return -EINVAL;
+ }
+ return 0;
+}
+
+static int __op_put(union op_fn_data *data, void *p, size_t len)
+{
+ switch (len) {
+ case 1:
+ WRITE_ONCE(*(uint8_t *)p, data->_u8);
+ break;
+ case 2:
+ WRITE_ONCE(*(uint16_t *)p, data->_u16);
+ break;
+ case 4:
+ WRITE_ONCE(*(uint32_t *)p, data->_u32);
+ break;
+ case 8:
+#if (BITS_PER_LONG == 64)
+ WRITE_ONCE(*(uint64_t *)p, data->_u64);
+#else
+ {
+ WRITE_ONCE(*(uint32_t *)p, data->_u64_split[0]);
+ WRITE_ONCE(*((uint32_t *)p + 1), data->_u64_split[1]);
+ }
+#endif
+ break;
+ default:
+ return -EINVAL;
+ }
+ flush_kernel_vmap_range(p, len);
+ return 0;
+}
+
+/* Return 0 if same, > 0 if different, < 0 on error. */
+static int do_cpu_op_compare(unsigned long _a, unsigned long _b, uint32_t len)
+{
+ void *a = (void *)_a;
+ void *b = (void *)_b;
+ union op_fn_data tmp[2];
+ int ret;
+
+ switch (len) {
+ case 1:
+ case 2:
+ case 4:
+ case 8:
+ if (!IS_ALIGNED(_a, len) || !IS_ALIGNED(_b, len))
+ goto memcmp;
+ break;
+ default:
+ goto memcmp;
+ }
+
+ ret = __op_get(&tmp[0], a, len);
+ if (ret)
+ return ret;
+ ret = __op_get(&tmp[1], b, len);
+ if (ret)
+ return ret;
+
+ switch (len) {
+ case 1:
+ ret = !!(tmp[0]._u8 != tmp[1]._u8);
+ break;
+ case 2:
+ ret = !!(tmp[0]._u16 != tmp[1]._u16);
+ break;
+ case 4:
+ ret = !!(tmp[0]._u32 != tmp[1]._u32);
+ break;
+ case 8:
+ ret = !!(tmp[0]._u64 != tmp[1]._u64);
+ break;
+ default:
+ return -EINVAL;
+ }
+ return ret;
+
+memcmp:
+ if (memcmp(a, b, len))
+ return 1;
+ return 0;
+}
+
+/* Return 0 on success, < 0 on error. */
+static int do_cpu_op_memcpy(unsigned long _dst, unsigned long _src,
+ uint32_t len)
+{
+ void *dst = (void *)_dst;
+ void *src = (void *)_src;
+ union op_fn_data tmp;
+ int ret;
+
+ switch (len) {
+ case 1:
+ case 2:
+ case 4:
+ case 8:
+ if (!IS_ALIGNED(_dst, len) || !IS_ALIGNED(_src, len))
+ goto memcpy;
+ break;
+ default:
+ goto memcpy;
+ }
+
+ ret = __op_get(&tmp, src, len);
+ if (ret)
+ return ret;
+ return __op_put(&tmp, dst, len);
+
+memcpy:
+ memcpy(dst, src, len);
+ flush_kernel_vmap_range(dst, len);
+ return 0;
+}
+
+static int op_add_fn(union op_fn_data *data, uint64_t count, uint32_t len)
+{
+ switch (len) {
+ case 1:
+ data->_u8 += (uint8_t)count;
+ break;
+ case 2:
+ data->_u16 += (uint16_t)count;
+ break;
+ case 4:
+ data->_u32 += (uint32_t)count;
+ break;
+ case 8:
+ data->_u64 += (uint64_t)count;
+ break;
+ default:
+ return -EINVAL;
+ }
+ return 0;
+}
+
+static int op_or_fn(union op_fn_data *data, uint64_t mask, uint32_t len)
+{
+ switch (len) {
+ case 1:
+ data->_u8 |= (uint8_t)mask;
+ break;
+ case 2:
+ data->_u16 |= (uint16_t)mask;
+ break;
+ case 4:
+ data->_u32 |= (uint32_t)mask;
+ break;
+ case 8:
+ data->_u64 |= (uint64_t)mask;
+ break;
+ default:
+ return -EINVAL;
+ }
+ return 0;
+}
+
+static int op_and_fn(union op_fn_data *data, uint64_t mask, uint32_t len)
+{
+ switch (len) {
+ case 1:
+ data->_u8 &= (uint8_t)mask;
+ break;
+ case 2:
+ data->_u16 &= (uint16_t)mask;
+ break;
+ case 4:
+ data->_u32 &= (uint32_t)mask;
+ break;
+ case 8:
+ data->_u64 &= (uint64_t)mask;
+ break;
+ default:
+ return -EINVAL;
+ }
+ return 0;
+}
+
+static int op_xor_fn(union op_fn_data *data, uint64_t mask, uint32_t len)
+{
+ switch (len) {
+ case 1:
+ data->_u8 ^= (uint8_t)mask;
+ break;
+ case 2:
+ data->_u16 ^= (uint16_t)mask;
+ break;
+ case 4:
+ data->_u32 ^= (uint32_t)mask;
+ break;
+ case 8:
+ data->_u64 ^= (uint64_t)mask;
+ break;
+ default:
+ return -EINVAL;
+ }
+ return 0;
+}
+
+static int op_lshift_fn(union op_fn_data *data, uint64_t bits, uint32_t len)
+{
+ switch (len) {
+ case 1:
+ data->_u8 <<= (uint8_t)bits;
+ break;
+ case 2:
+ data->_u16 <<= (uint16_t)bits;
+ break;
+ case 4:
+ data->_u32 <<= (uint32_t)bits;
+ break;
+ case 8:
+ data->_u64 <<= (uint64_t)bits;
+ break;
+ default:
+ return -EINVAL;
+ }
+ return 0;
+}
+
+static int op_rshift_fn(union op_fn_data *data, uint64_t bits, uint32_t len)
+{
+ switch (len) {
+ case 1:
+ data->_u8 >>= (uint8_t)bits;
+ break;
+ case 2:
+ data->_u16 >>= (uint16_t)bits;
+ break;
+ case 4:
+ data->_u32 >>= (uint32_t)bits;
+ break;
+ case 8:
+ data->_u64 >>= (uint64_t)bits;
+ break;
+ default:
+ return -EINVAL;
+ }
+ return 0;
+}
+
+/* Return 0 on success, < 0 on error. */
+static int do_cpu_op_fn(op_fn_t op_fn, unsigned long _p, uint64_t v,
+ uint32_t len)
+{
+ union op_fn_data tmp;
+ void *p = (void *)_p;
+ int ret;
+
+ ret = __op_get(&tmp, p, len);
+ if (ret)
+ return ret;
+ ret = op_fn(&tmp, v, len);
+ if (ret)
+ return ret;
+ ret = __op_put(&tmp, p, len);
+ if (ret)
+ return ret;
+ return 0;
+}
+
+/*
+ * Return negative value on error, positive value if comparison
+ * fails, 0 on success.
+ */
+static int __do_cpu_opv_op(struct cpu_op *op)
+{
+ /* Guarantee a compiler barrier between each operation. */
+ barrier();
+
+ switch (op->op) {
+ case CPU_COMPARE_EQ_OP:
+ return do_cpu_op_compare(op->u.compare_op.a,
+ op->u.compare_op.b,
+ op->len);
+ case CPU_COMPARE_NE_OP:
+ {
+ int ret;
+
+ ret = do_cpu_op_compare(op->u.compare_op.a,
+ op->u.compare_op.b,
+ op->len);
+ if (ret < 0)
+ return ret;
+ /*
+ * Stop execution, return positive value if comparison
+ * is identical.
+ */
+ if (ret == 0)
+ return 1;
+ return 0;
+ }
+ case CPU_MEMCPY_OP:
+ return do_cpu_op_memcpy(op->u.memcpy_op.dst,
+ op->u.memcpy_op.src,
+ op->len);
+ case CPU_ADD_OP:
+ return do_cpu_op_fn(op_add_fn, op->u.arithmetic_op.p,
+ op->u.arithmetic_op.count, op->len);
+ case CPU_OR_OP:
+ return do_cpu_op_fn(op_or_fn, op->u.bitwise_op.p,
+ op->u.bitwise_op.mask, op->len);
+ case CPU_AND_OP:
+ return do_cpu_op_fn(op_and_fn, op->u.bitwise_op.p,
+ op->u.bitwise_op.mask, op->len);
+ case CPU_XOR_OP:
+ return do_cpu_op_fn(op_xor_fn, op->u.bitwise_op.p,
+ op->u.bitwise_op.mask, op->len);
+ case CPU_LSHIFT_OP:
+ return do_cpu_op_fn(op_lshift_fn, op->u.shift_op.p,
+ op->u.shift_op.bits, op->len);
+ case CPU_RSHIFT_OP:
+ return do_cpu_op_fn(op_rshift_fn, op->u.shift_op.p,
+ op->u.shift_op.bits, op->len);
+ case CPU_MB_OP:
+ /* Memory barrier provided by this operation. */
+ smp_mb();
+ return 0;
+ default:
+ return -EINVAL;
+ }
+}
+
+static int __do_cpu_opv(struct cpu_op *cpuop, int cpuopcnt)
+{
+ int i, ret;
+
+ for (i = 0; i < cpuopcnt; i++) {
+ ret = __do_cpu_opv_op(&cpuop[i]);
+ /* If comparison fails, stop execution and return index + 1. */
+ if (ret > 0)
+ return i + 1;
+ /* On error, stop execution. */
+ if (ret < 0)
+ return ret;
+ }
+ return 0;
+}
+
+/*
+ * Check that the page pointers pinned by get_user_pages_fast()
+ * are still in the page table. Invoked with mmap_sem held.
+ * Return 0 if pointers match, -EAGAIN if they don't.
+ */
+static int vaddr_check(struct vaddr *vaddr)
+{
+ struct page *pages[2];
+ int ret, n;
+
+ ret = __get_user_pages_fast(vaddr->uaddr, vaddr->nr_pages,
+ vaddr->write, pages);
+ for (n = 0; n < ret; n++)
+ put_page(pages[n]);
+ if (ret < vaddr->nr_pages) {
+ ret = get_user_pages(vaddr->uaddr, vaddr->nr_pages,
+ vaddr->write ? FOLL_WRITE : 0,
+ pages, NULL);
+ if (ret < 0)
+ return -EAGAIN;
+ for (n = 0; n < ret; n++)
+ put_page(pages[n]);
+ if (ret < vaddr->nr_pages)
+ return -EAGAIN;
+ }
+ for (n = 0; n < vaddr->nr_pages; n++) {
+ if (pages[n] != vaddr->pages[n])
+ return -EAGAIN;
+ }
+ return 0;
+}
+
+static int vaddr_ptrs_check(struct cpu_opv_vaddr *vaddr_ptrs)
+{
+ int i;
+
+ for (i = 0; i < vaddr_ptrs->nr_vaddr; i++) {
+ int ret;
+
+ ret = vaddr_check(&vaddr_ptrs->addr[i]);
+ if (ret)
+ return ret;
+ }
+ return 0;
+}
+
+static int do_cpu_opv(struct cpu_op *cpuop, int cpuopcnt,
+ struct cpu_opv_vaddr *vaddr_ptrs, int cpu)
+{
+ struct mm_struct *mm = current->mm;
+ int ret;
+
+retry:
+ if (cpu != raw_smp_processor_id()) {
+ ret = push_task_to_cpu(current, cpu);
+ if (ret)
+ goto check_online;
+ }
+ down_read(&mm->mmap_sem);
+ ret = vaddr_ptrs_check(vaddr_ptrs);
+ if (ret)
+ goto end;
+ preempt_disable();
+ if (cpu != smp_processor_id()) {
+ preempt_enable();
+ up_read(&mm->mmap_sem);
+ goto retry;
+ }
+ ret = __do_cpu_opv(cpuop, cpuopcnt);
+ preempt_enable();
+end:
+ up_read(&mm->mmap_sem);
+ return ret;
+
+check_online:
+ /*
+ * push_task_to_cpu() returns -EINVAL if the requested cpu is not part
+ * of the current thread's cpus_allowed mask.
+ */
+ if (ret == -EINVAL)
+ return ret;
+ get_online_cpus();
+ if (cpu_online(cpu)) {
+ put_online_cpus();
+ goto retry;
+ }
+ /*
+ * CPU is offline. Perform operation from the current CPU with
+ * cpu_online read lock held, preventing that CPU from coming online,
+ * and with mutex held, providing mutual exclusion against other
+ * CPUs also finding out about an offline CPU.
+ */
+ down_read(&mm->mmap_sem);
+ ret = vaddr_ptrs_check(vaddr_ptrs);
+ if (ret)
+ goto offline_end;
+ mutex_lock(&cpu_opv_offline_lock);
+ ret = __do_cpu_opv(cpuop, cpuopcnt);
+ mutex_unlock(&cpu_opv_offline_lock);
+offline_end:
+ up_read(&mm->mmap_sem);
+ put_online_cpus();
+ return ret;
+}
+
+/*
+ * cpu_opv - execute operation vector on a given CPU with preempt off.
+ *
+ * Userspace should pass current CPU number as parameter.
+ */
+SYSCALL_DEFINE4(cpu_opv, struct cpu_op __user *, ucpuopv, int, cpuopcnt,
+ int, cpu, int, flags)
+{
+ struct vaddr vaddr_on_stack[NR_VADDR_ON_STACK];
+ struct cpu_op cpuopv[CPU_OP_VEC_LEN_MAX];
+ struct cpu_opv_vaddr vaddr_ptrs = {
+ .addr = vaddr_on_stack,
+ .nr_vaddr = 0,
+ .is_kmalloc = false,
+ };
+ int ret, i, nr_vaddr = 0;
+ bool retry = false;
+
+ if (unlikely(flags & ~CPU_OP_NR_FLAG))
+ return -EINVAL;
+ if (flags & CPU_OP_NR_FLAG)
+ return NR_CPU_OPS;
+ if (unlikely(cpu < 0))
+ return -EINVAL;
+ if (cpuopcnt < 0 || cpuopcnt > CPU_OP_VEC_LEN_MAX)
+ return -EINVAL;
+ if (copy_from_user(cpuopv, ucpuopv, cpuopcnt * sizeof(struct cpu_op)))
+ return -EFAULT;
+ ret = cpu_opv_check(cpuopv, cpuopcnt, &nr_vaddr);
+ if (ret)
+ return ret;
+ if (nr_vaddr > NR_VADDR_ON_STACK) {
+ vaddr_ptrs.addr = cpu_op_alloc_vaddr_vector(nr_vaddr);
+ if (!vaddr_ptrs.addr) {
+ ret = -ENOMEM;
+ goto end;
+ }
+ vaddr_ptrs.is_kmalloc = true;
+ }
+again:
+ ret = cpu_opv_pin_pages(cpuopv, cpuopcnt, &vaddr_ptrs);
+ if (ret)
+ goto end;
+ ret = do_cpu_opv(cpuopv, cpuopcnt, &vaddr_ptrs, cpu);
+ if (ret == -EAGAIN)
+ retry = true;
+end:
+ for (i = 0; i < vaddr_ptrs.nr_vaddr; i++) {
+ struct vaddr *vaddr = &vaddr_ptrs.addr[i];
+ int j;
+
+ vm_unmap_ram((void *)vaddr->mem, vaddr->nr_pages);
+ for (j = 0; j < vaddr->nr_pages; j++) {
+ if (vaddr->write)
+ set_page_dirty(vaddr->pages[j]);
+ put_page(vaddr->pages[j]);
+ }
+ }
+ if (retry) {
+ retry = false;
+ vaddr_ptrs.nr_vaddr = 0;
+ goto again;
+ }
+ if (vaddr_ptrs.is_kmalloc)
+ kfree(vaddr_ptrs.addr);
+ return ret;
+}
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index bfa1ee1bf669..59e622296dc3 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -262,3 +262,4 @@ cond_syscall(sys_pkey_free);
/* restartable sequence */
cond_syscall(sys_rseq);
+cond_syscall(sys_cpu_opv);
--
2.11.0
From: Boqun Feng <[email protected]>
Call the rseq_handle_notify_resume() function on return to userspace if
TIF_NOTIFY_RESUME thread flag is set.
Increment the event counter and perform fixup on the pre-signal when a
signal is delivered on top of a restartable sequence critical section.
Signed-off-by: Boqun Feng <[email protected]>
Signed-off-by: Mathieu Desnoyers <[email protected]>
CC: Benjamin Herrenschmidt <[email protected]>
CC: Paul Mackerras <[email protected]>
CC: Michael Ellerman <[email protected]>
CC: Peter Zijlstra <[email protected]>
CC: "Paul E. McKenney" <[email protected]>
CC: [email protected]
---
arch/powerpc/Kconfig | 1 +
arch/powerpc/kernel/signal.c | 3 +++
2 files changed, 4 insertions(+)
diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index 73ce5dd07642..90700b6918ef 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -223,6 +223,7 @@ config PPC
select HAVE_SYSCALL_TRACEPOINTS
select HAVE_VIRT_CPU_ACCOUNTING
select HAVE_IRQ_TIME_ACCOUNTING
+ select HAVE_RSEQ
select IRQ_DOMAIN
select IRQ_FORCED_THREADING
select MODULES_USE_ELF_RELA
diff --git a/arch/powerpc/kernel/signal.c b/arch/powerpc/kernel/signal.c
index 61db86ecd318..d3bb3aaaf5ac 100644
--- a/arch/powerpc/kernel/signal.c
+++ b/arch/powerpc/kernel/signal.c
@@ -133,6 +133,8 @@ static void do_signal(struct task_struct *tsk)
/* Re-enable the breakpoints for the signal stack */
thread_change_pc(tsk, tsk->thread.regs);
+ rseq_signal_deliver(tsk->thread.regs);
+
if (is32) {
if (ksig.ka.sa.sa_flags & SA_SIGINFO)
ret = handle_rt_signal32(&ksig, oldset, tsk);
@@ -164,6 +166,7 @@ void do_notify_resume(struct pt_regs *regs, unsigned long thread_info_flags)
if (thread_info_flags & _TIF_NOTIFY_RESUME) {
clear_thread_flag(TIF_NOTIFY_RESUME);
tracehook_notify_resume(regs);
+ rseq_handle_notify_resume(regs);
}
user_enter();
--
2.11.0
Implement push_task_to_cpu(), which moves the task received as argument
to the destination cpu's runqueue. It only does so if the CPU is within
the CPU allowed mask of the task and if the CPU is active. If the CPU is
not part of the allowed mask, -EINVAL is returned. If the CPU is not
active, -EBUSY is returned.
It does not change the CPU allowed mask, and can therefore be used
within applications which rely on owning the sched_setaffinity() state.
It does not pin the task to the destination CPU, which means that the
scheduler may choose to move the task away from that CPU before the
task executes. Code invoking push_task_to_cpu() must be prepared to
retry in that case.
Signed-off-by: Mathieu Desnoyers <[email protected]>
CC: "Paul E. McKenney" <[email protected]>
CC: Peter Zijlstra <[email protected]>
CC: Paul Turner <[email protected]>
CC: Thomas Gleixner <[email protected]>
CC: Andrew Hunter <[email protected]>
CC: Andy Lutomirski <[email protected]>
CC: Andi Kleen <[email protected]>
CC: Dave Watson <[email protected]>
CC: Chris Lameter <[email protected]>
CC: Ingo Molnar <[email protected]>
CC: "H. Peter Anvin" <[email protected]>
CC: Ben Maurer <[email protected]>
CC: Steven Rostedt <[email protected]>
CC: Josh Triplett <[email protected]>
CC: Linus Torvalds <[email protected]>
CC: Andrew Morton <[email protected]>
CC: Russell King <[email protected]>
CC: Catalin Marinas <[email protected]>
CC: Will Deacon <[email protected]>
CC: Michael Kerrisk <[email protected]>
CC: Boqun Feng <[email protected]>
CC: [email protected]
---
Change since v1:
- Return -EBUSY if CPU is not active.
---
kernel/sched/core.c | 42 ++++++++++++++++++++++++++++++++++++++++++
kernel/sched/sched.h | 9 +++++++++
2 files changed, 51 insertions(+)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 771caa7e95c6..ef7f5eb5d56e 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1062,6 +1062,48 @@ void do_set_cpus_allowed(struct task_struct *p, const struct cpumask *new_mask)
set_curr_task(rq, p);
}
+int push_task_to_cpu(struct task_struct *p, unsigned int dest_cpu)
+{
+ struct rq_flags rf;
+ struct rq *rq;
+ int ret = 0;
+
+ rq = task_rq_lock(p, &rf);
+ update_rq_clock(rq);
+
+ if (!cpumask_test_cpu(dest_cpu, &p->cpus_allowed)) {
+ ret = -EINVAL;
+ goto out;
+ }
+
+ if (!cpumask_test_cpu(dest_cpu, cpu_active_mask)) {
+ ret = -EBUSY;
+ goto out;
+ }
+
+ if (task_cpu(p) == dest_cpu)
+ goto out;
+
+ if (task_running(rq, p) || p->state == TASK_WAKING) {
+ struct migration_arg arg = { p, dest_cpu };
+ /* Need help from migration thread: drop lock and wait. */
+ task_rq_unlock(rq, p, &rf);
+ stop_one_cpu(cpu_of(rq), migration_cpu_stop, &arg);
+ tlb_migrate_finish(p->mm);
+ return 0;
+ } else if (task_on_rq_queued(p)) {
+ /*
+ * OK, since we're going to drop the lock immediately
+ * afterwards anyway.
+ */
+ rq = move_queued_task(rq, &rf, p, dest_cpu);
+ }
+out:
+ task_rq_unlock(rq, p, &rf);
+
+ return ret;
+}
+
/*
* Change a given task's CPU affinity. Migrate the thread to a
* proper CPU and schedule it away if the CPU it's executing on
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 66b070444a7e..4aaf70d54afc 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1252,6 +1252,15 @@ static inline void __set_task_cpu(struct task_struct *p, unsigned int cpu)
rseq_migrate(p);
}
+#ifdef CONFIG_SMP
+int push_task_to_cpu(struct task_struct *p, unsigned int dest_cpu);
+#else
+static inline int push_task_to_cpu(struct task_struct *p, unsigned int dest_cpu)
+{
+ return 0;
+}
+#endif
+
/*
* Tunables that become constants when CONFIG_SCHED_DEBUG is off:
*/
--
2.11.0
Signed-off-by: Mathieu Desnoyers <[email protected]>
CC: "Paul E. McKenney" <[email protected]>
CC: Peter Zijlstra <[email protected]>
CC: Paul Turner <[email protected]>
CC: Thomas Gleixner <[email protected]>
CC: Andrew Hunter <[email protected]>
CC: Andy Lutomirski <[email protected]>
CC: Andi Kleen <[email protected]>
CC: Dave Watson <[email protected]>
CC: Chris Lameter <[email protected]>
CC: Ingo Molnar <[email protected]>
CC: "H. Peter Anvin" <[email protected]>
CC: Ben Maurer <[email protected]>
CC: Steven Rostedt <[email protected]>
CC: Josh Triplett <[email protected]>
CC: Linus Torvalds <[email protected]>
CC: Andrew Morton <[email protected]>
CC: Russell King <[email protected]>
CC: Catalin Marinas <[email protected]>
CC: Will Deacon <[email protected]>
CC: Michael Kerrisk <[email protected]>
CC: Boqun Feng <[email protected]>
CC: [email protected]
---
arch/x86/entry/syscalls/syscall_32.tbl | 1 +
arch/x86/entry/syscalls/syscall_64.tbl | 1 +
2 files changed, 2 insertions(+)
diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
index b76cbd25854f..5772b343f7d5 100644
--- a/arch/x86/entry/syscalls/syscall_32.tbl
+++ b/arch/x86/entry/syscalls/syscall_32.tbl
@@ -392,3 +392,4 @@
383 i386 statx sys_statx
384 i386 arch_prctl sys_arch_prctl compat_sys_arch_prctl
385 i386 rseq sys_rseq
+386 i386 cpu_opv sys_cpu_opv
diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
index 3ad03495bbb9..ab5d1f9f9396 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -340,6 +340,7 @@
331 common pkey_free sys_pkey_free
332 common statx sys_statx
333 common rseq sys_rseq
+334 common cpu_opv sys_cpu_opv
#
# x32-specific system call numbers start at 512 to avoid cache impact
--
2.11.0
Expose a new system call allowing each thread to register one userspace
memory area to be used as an ABI between kernel and user-space for two
purposes: user-space restartable sequences and quick access to read the
current CPU number value from user-space.
* Restartable sequences (per-cpu atomics)
Restartables sequences allow user-space to perform update operations on
per-cpu data without requiring heavy-weight atomic operations.
The restartable critical sections (percpu atomics) work has been started
by Paul Turner and Andrew Hunter. It lets the kernel handle restart of
critical sections. [1] [2] The re-implementation proposed here brings a
few simplifications to the ABI which facilitates porting to other
architectures and speeds up the user-space fast path. A second system
call, cpu_opv(), is proposed as fallback to deal with debugger
single-stepping. cpu_opv() executes a sequence of operations on behalf
of user-space with preemption disabled.
Here are benchmarks of various rseq use-cases.
Test hardware:
arm32: ARMv7 Processor rev 4 (v7l) "Cubietruck", 2-core
x86-64: Intel E5-2630 [email protected], 16-core, hyperthreading
The following benchmarks were all performed on a single thread.
* Per-CPU statistic counter increment
getcpu+atomic (ns/op) rseq (ns/op) speedup
arm32: 344.0 31.4 11.0
x86-64: 15.3 2.0 7.7
* LTTng-UST: write event 32-bit header, 32-bit payload into tracer
per-cpu buffer
getcpu+atomic (ns/op) rseq (ns/op) speedup
arm32: 2502.0 2250.0 1.1
x86-64: 117.4 98.0 1.2
* liburcu percpu: lock-unlock pair, dereference, read/compare word
getcpu+atomic (ns/op) rseq (ns/op) speedup
arm32: 751.0 128.5 5.8
x86-64: 53.4 28.6 1.9
* jemalloc memory allocator adapted to use rseq
Using rseq with per-cpu memory pools in jemalloc at Facebook (based on
rseq 2016 implementation):
The production workload response-time has 1-2% gain avg. latency, and
the P99 overall latency drops by 2-3%.
* Reading the current CPU number
Speeding up reading the current CPU number on which the caller thread is
running is done by keeping the current CPU number up do date within the
cpu_id field of the memory area registered by the thread. This is done
by making scheduler preemption set the TIF_NOTIFY_RESUME flag on the
current thread. Upon return to user-space, a notify-resume handler
updates the current CPU value within the registered user-space memory
area. User-space can then read the current CPU number directly from
memory.
Keeping the current cpu id in a memory area shared between kernel and
user-space is an improvement over current mechanisms available to read
the current CPU number, which has the following benefits over
alternative approaches:
- 35x speedup on ARM vs system call through glibc
- 20x speedup on x86 compared to calling glibc, which calls vdso
executing a "lsl" instruction,
- 14x speedup on x86 compared to inlined "lsl" instruction,
- Unlike vdso approaches, this cpu_id value can be read from an inline
assembly, which makes it a useful building block for restartable
sequences.
- The approach of reading the cpu id through memory mapping shared
between kernel and user-space is portable (e.g. ARM), which is not the
case for the lsl-based x86 vdso.
On x86, yet another possible approach would be to use the gs segment
selector to point to user-space per-cpu data. This approach performs
similarly to the cpu id cache, but it has two disadvantages: it is
not portable, and it is incompatible with existing applications already
using the gs segment selector for other purposes.
Benchmarking various approaches for reading the current CPU number:
ARMv7 Processor rev 4 (v7l)
Machine model: Cubietruck
- Baseline (empty loop): 8.4 ns
- Read CPU from rseq cpu_id: 16.7 ns
- Read CPU from rseq cpu_id (lazy register): 19.8 ns
- glibc 2.19-0ubuntu6.6 getcpu: 301.8 ns
- getcpu system call: 234.9 ns
x86-64 Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz:
- Baseline (empty loop): 0.8 ns
- Read CPU from rseq cpu_id: 0.8 ns
- Read CPU from rseq cpu_id (lazy register): 0.8 ns
- Read using gs segment selector: 0.8 ns
- "lsl" inline assembly: 13.0 ns
- glibc 2.19-0ubuntu6 getcpu: 16.6 ns
- getcpu system call: 53.9 ns
- Speed (benchmark taken on v8 of patchset)
Running 10 runs of hackbench -l 100000 seems to indicate, contrary to
expectations, that enabling CONFIG_RSEQ slightly accelerates the
scheduler:
Configuration: 2 sockets * 8-core Intel(R) Xeon(R) CPU E5-2630 v3 @
2.40GHz (directly on hardware, hyperthreading disabled in BIOS, energy
saving disabled in BIOS, turboboost disabled in BIOS, cpuidle.off=1
kernel parameter), with a Linux v4.6 defconfig+localyesconfig,
restartable sequences series applied.
* CONFIG_RSEQ=n
avg.: 41.37 s
std.dev.: 0.36 s
* CONFIG_RSEQ=y
avg.: 40.46 s
std.dev.: 0.33 s
- Size
On x86-64, between CONFIG_RSEQ=n/y, the text size increase of vmlinux is
567 bytes, and the data size increase of vmlinux is 5696 bytes.
On x86-64, between CONFIG_CPU_OPV=n/y, the text size increase of vmlinux is
5576 bytes, and the data size increase of vmlinux is 6164 bytes.
[1] https://lwn.net/Articles/650333/
[2] http://www.linuxplumbersconf.org/2013/ocw/system/presentations/1695/original/LPC%20-%20PerCpu%20Atomics.pdf
Link: http://lkml.kernel.org/r/20151027235635.16059.11630.stgit@pjt-glaptop.roam.corp.google.com
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Mathieu Desnoyers <[email protected]>
CC: Thomas Gleixner <[email protected]>
CC: Paul Turner <[email protected]>
CC: Andrew Hunter <[email protected]>
CC: Peter Zijlstra <[email protected]>
CC: Andy Lutomirski <[email protected]>
CC: Andi Kleen <[email protected]>
CC: Dave Watson <[email protected]>
CC: Chris Lameter <[email protected]>
CC: Ingo Molnar <[email protected]>
CC: "H. Peter Anvin" <[email protected]>
CC: Ben Maurer <[email protected]>
CC: Steven Rostedt <[email protected]>
CC: "Paul E. McKenney" <[email protected]>
CC: Josh Triplett <[email protected]>
CC: Linus Torvalds <[email protected]>
CC: Andrew Morton <[email protected]>
CC: Russell King <[email protected]>
CC: Catalin Marinas <[email protected]>
CC: Will Deacon <[email protected]>
CC: Michael Kerrisk <[email protected]>
CC: Boqun Feng <[email protected]>
CC: Alexander Viro <[email protected]>
CC: [email protected]
---
Changes since v1:
- Return -1, errno=EINVAL if cpu_cache pointer is not aligned on
sizeof(int32_t).
- Update man page to describe the pointer alignement requirements and
update atomicity guarantees.
- Add MAINTAINERS file GETCPU_CACHE entry.
- Remove dynamic memory allocation: go back to having a single
getcpu_cache entry per thread. Update documentation accordingly.
- Rebased on Linux 4.4.
Changes since v2:
- Introduce a "cmd" argument, along with an enum with GETCPU_CACHE_GET
and GETCPU_CACHE_SET. Introduce a uapi header linux/getcpu_cache.h
defining this enumeration.
- Split resume notifier architecture implementation from the system call
wire up in the following arch-specific patches.
- Man pages updates.
- Handle 32-bit compat pointers.
- Simplify handling of getcpu_cache GETCPU_CACHE_SET compiler barrier:
set the current cpu cache pointer before doing the cache update, and
set it back to NULL if the update fails. Setting it back to NULL on
error ensures that no resume notifier will trigger a SIGSEGV if a
migration happened concurrently.
Changes since v3:
- Fix __user annotations in compat code,
- Update memory ordering comments.
- Rebased on kernel v4.5-rc5.
Changes since v4:
- Inline getcpu_cache_fork, getcpu_cache_execve, and getcpu_cache_exit.
- Add new line between if() and switch() to improve readability.
- Added sched switch benchmarks (hackbench) and size overhead comparison
to change log.
Changes since v5:
- Rename "getcpu_cache" to "thread_local_abi", allowing to extend
this system call to cover future features such as restartable critical
sections. Generalizing this system call ensures that we can add
features similar to the cpu_id field within the same cache-line
without having to track one pointer per feature within the task
struct.
- Add a tlabi_nr parameter to the system call, thus allowing to extend
the ABI beyond the initial 64-byte structure by registering structures
with tlabi_nr greater than 0. The initial ABI structure is associated
with tlabi_nr 0.
- Rebased on kernel v4.5.
Changes since v6:
- Integrate "restartable sequences" v2 patchset from Paul Turner.
- Add handling of single-stepping purely in user-space, with a
fallback to locking after 2 rseq failures to ensure progress, and
by exposing a __rseq_table section to debuggers so they know where
to put breakpoints when dealing with rseq assembly blocks which
can be aborted at any point.
- make the code and ABI generic: porting the kernel implementation
simply requires to wire up the signal handler and return to user-space
hooks, and allocate the syscall number.
- extend testing with a fully configurable test program. See
param_spinlock_test -h for details.
- handling of rseq ENOSYS in user-space, also with a fallback
to locking.
- modify Paul Turner's rseq ABI to only require a single TLS store on
the user-space fast-path, removing the need to populate two additional
registers. This is made possible by introducing struct rseq_cs into
the ABI to describe a critical section start_ip, post_commit_ip, and
abort_ip.
- Rebased on kernel v4.7-rc7.
Changes since v7:
- Documentation updates.
- Integrated powerpc architecture support.
- Compare rseq critical section start_ip, allows shriking the user-space
fast-path code size.
- Added Peter Zijlstra, Paul E. McKenney and Boqun Feng as
co-maintainers.
- Added do_rseq2 and do_rseq_memcpy to test program helper library.
- Code cleanup based on review from Peter Zijlstra, Andy Lutomirski and
Boqun Feng.
- Rebase on kernel v4.8-rc2.
Changes since v8:
- clear rseq_cs even if non-nested. Speeds up user-space fast path by
removing the final "rseq_cs=NULL" assignment.
- add enum rseq_flags: critical sections and threads can set migration,
preemption and signal "disable" flags to inhibit rseq behavior.
- rseq_event_counter needs to be updated with a pre-increment: Otherwise
misses an increment after exec (when TLS and in-kernel states are
initially 0).
Changes since v9:
- Update changelog.
- Fold instrumentation patch.
- check abort-ip signature: Add a signature before the abort-ip landing
address. This signature is also received as a new parameter to the
rseq system call. The kernel uses it ensures that rseq cannot be used
as an exploit vector to redirect execution to arbitrary code.
- Use rseq pointer for both register and unregister. This is more
symmetric, and eventually allow supporting a linked list of rseq
struct per thread if needed in the future.
- Unregistration of a rseq structure is now done with
RSEQ_FLAG_UNREGISTER.
- Remove reference counting. Return "EBUSY" to the caller if rseq is
already registered for the current thread. This simplifies
implementation while still allowing user-space to perform lazy
registration in multi-lib use-cases. (suggested by Ben Maurer)
- Clear rseq_cs upon unregister.
- Set cpu_id back to -1 on unregister, so if rseq user libraries follow
an unregister, and they expect to lazily register rseq, they can do
so.
- Document rseq_cs clear requirement: JIT should reset the rseq_cs
pointer before reclaiming memory of rseq_cs structure.
- Introduce rseq_len syscall parameter, rseq_cs version field:
Allow keeping track of the registered rseq struct length, for future
extensions. Add rseq_cs version as first field. Will allow future
extensions.
- Use offset and unsigned arithmetic to save a branch: Save a
conditional branch when comparing instruction pointer against a
rseq_cs descriptor's address range by having post_commit_ip as an
offset from start_ip, and using unsigned integer comparison.
Suggested by Ben Maurer.
- Remove event counter from ABI. Suggested by Andy Lutomirski.
- Add INIT_ONSTACK macro: Introduce the
RSEQ_FIELD_u32_u64_INIT_ONSTACK() macros to ensure that users
correctly initialize the upper bits of RSEQ_FIELD_u32_u64() on their
stack to 0 on 32-bit architectures.
- Select MEMBARRIER: Allows user-space rseq fast-paths to use the value
of cpu_id field (inherently required by the rseq algorithm) to figure
out whether membarrier can be expected to be available.
This effectively allows user-space fast-paths to remove extra
comparisons and branch testing whether membarrier is enabled, and thus
whether a full barrier is required (e.g. in userspace RCU
implementation after rcu_read_lock/before rcu_read_unlock).
- Expose cpu_id_start field: Checking whether the (cpu_id < 0) in the C
preparation part of the rseq fast-path brings significant overhead at
least on arm32. We can remove this extra comparison by exposing two
distinct cpu_id fields in the rseq TLS:
The field cpu_id_start always contain a *possible* cpu number, although
it may not be the current one if, for instance, rseq is not initialized
for the current thread. cpu_id_start is meant to be used in the C code
for the pointer chasing to figure out which per-cpu data structure
should be passed to the rseq asm sequence.
The field cpu_id values -1 means rseq is not initialized, and -2 means
initialization failed. That field is used in the rseq asm sequence to
confirm that the cpu_id_start value was indeed the current cpu number.
It also ends up confirming that rseq is initialized for the current
thread, because values -1 and -2 will never match the cpu_id_start
possible cpu number values.
This allows checking the current CPU number and rseq initialization
state with a single comparison on the fast-path.
Changes since v10:
- Update rseq.c comment, removing reference to event_counter.
Changes since v11:
- Replace task struct rseq_preempt, rseq_signal, and rseq_migrate
bool by u32 rseq_event_mask.
- Add missing sys_rseq() asmlinkage declaration to
include/linux/syscalls.h.
- Copy event mask on process fork, set to 0 on exec and thread-fork.
- Cleanups based on review from Peter Zijlstra.
- Cleanups based on review from Thomas Gleixner.
- Fix: rseq_cs needs to be cleared only when:
- Nested over non-critical-section userspace code,
- Nested over rseq_cs _and_ handling abort.
Basically, we should never clear rseq_cs when the rseq resume to
userspace handler is called and it is not handling abort: the
problematic case is if any of the __get_user()/__put_user done
by the handler trigger a page fault (e.g. page protection
done by NUMA page migration work), which triggers preemption:
the next call to the rseq resume to userspace handler needs to
perform the abort.
- Perform rseq event mask updates atomically wrt preemption,
- Move rseq_migrate to __set_task_cpu(), thus catching migration
scenario that bypass set_task_cpu(): fork and wake_up_new_task.
- Merge content of rseq_sched_out into rseq_preempt. There is no
need to have two hook sites. Both setting the rseq event mask
preempt bit and setting the notify resume thread flag can be
done from rseq_preempt().
- Issue rseq_preempt() from fork(), thus ensuring that we handle
abort if needed.
Man page associated:
RSEQ(2) Linux Programmer's Manual RSEQ(2)
NAME
rseq - Restartable sequences and cpu number cache
SYNOPSIS
#include <linux/rseq.h>
int rseq(struct rseq * rseq, uint32_t rseq_len, int flags, uint32_t sig);
DESCRIPTION
The rseq() ABI accelerates user-space operations on per-cpu
data by defining a shared data structure ABI between each user-
space thread and the kernel.
It allows user-space to perform update operations on per-cpu
data without requiring heavy-weight atomic operations.
Restartable sequences are atomic with respect to preemption
(making it atomic with respect to other threads running on the
same CPU), as well as signal delivery (user-space execution
contexts nested over the same thread).
It is suited for update operations on per-cpu data.
It can be used on data structures shared between threads within
a process, and on data structures shared between threads across
different processes.
Some examples of operations that can be accelerated or improved
by this ABI:
· Memory allocator per-cpu free-lists,
· Querying the current CPU number,
· Incrementing per-CPU counters,
· Modifying data protected by per-CPU spinlocks,
· Inserting/removing elements in per-CPU linked-lists,
· Writing/reading per-CPU ring buffers content.
· Accurately reading performance monitoring unit counters with
respect to thread migration.
The rseq argument is a pointer to the thread-local rseq struc‐
ture to be shared between kernel and user-space. A NULL rseq
value unregisters the current thread rseq structure.
The layout of struct rseq is as follows:
Structure alignment
This structure is aligned on multiples of 32 bytes.
Structure size
This structure is extensible. Its size is passed as
parameter to the rseq system call.
Fields
cpu_id_start
Optimistic cache of the CPU number on which the current
thread is running. Its value is guaranteed to always be
a possible CPU number, even when rseq is not initial‐
ized. The value it contains should always be confirmed
by reading the cpu_id field.
cpu_id
Cache of the CPU number on which the current thread is
running. -1 if uninitialized.
rseq_cs
The rseq_cs field is a pointer to a struct rseq_cs. Is
is NULL when no rseq assembly block critical section is
active for the current thread. Setting it to point to a
critical section descriptor (struct rseq_cs) marks the
beginning of the critical section.
flags
Flags indicating the restart behavior for the current
thread. This is mainly used for debugging purposes. Can
be either:
· RSEQ_CS_FLAG_NO_RESTART_ON_PREEMPT
· RSEQ_CS_FLAG_NO_RESTART_ON_SIGNAL
· RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE
The layout of struct rseq_cs version 0 is as follows:
Structure alignment
This structure is aligned on multiples of 32 bytes.
Structure size
This structure has a fixed size of 32 bytes.
Fields
version
Version of this structure.
flags
Flags indicating the restart behavior of this structure.
Can be either:
· RSEQ_CS_FLAG_NO_RESTART_ON_PREEMPT
· RSEQ_CS_FLAG_NO_RESTART_ON_SIGNAL
· RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE
start_ip
Instruction pointer address of the first instruction of
the sequence of consecutive assembly instructions.
post_commit_offset
Offset (from start_ip address) of the address after the
last instruction of the sequence of consecutive assembly
instructions.
abort_ip
Instruction pointer address where to move the execution
flow in case of abort of the sequence of consecutive
assembly instructions.
The rseq_len argument is the size of the struct rseq to regis‐
ter.
The flags argument is 0 for registration, and RSEQ_FLAG_UNREG‐
ISTER for unregistration.
The sig argument is the 32-bit signature to be expected before
the abort handler code.
A single library per process should keep the rseq structure in
a thread-local storage variable. The cpu_id field should be
initialized to -1, and the cpu_id_start field should be ini‐
tialized to a possible CPU value (typically 0).
Each thread is responsible for registering and unregistering
its rseq structure. No more than one rseq structure address can
be registered per thread at a given time.
In a typical usage scenario, the thread registering the rseq
structure will be performing loads and stores from/to that
structure. It is however also allowed to read that structure
from other threads. The rseq field updates performed by the
kernel provide relaxed atomicity semantics, which guarantee
that other threads performing relaxed atomic reads of the cpu
number cache will always observe a consistent value.
RETURN VALUE
A return value of 0 indicates success. On error, -1 is
returned, and errno is set appropriately.
ERRORS
EINVAL Either flags contains an invalid value, or rseq contains
an address which is not appropriately aligned, or
rseq_len contains a size that does not match the size
received on registration.
ENOSYS The rseq() system call is not implemented by this ker‐
nel.
EFAULT rseq is an invalid address.
EBUSY Restartable sequence is already registered for this
thread.
EPERM The sig argument on unregistration does not match the
signature received on registration.
VERSIONS
The rseq() system call was added in Linux 4.X (TODO).
CONFORMING TO
rseq() is Linux-specific.
SEE ALSO
sched_getcpu(3)
Linux 2017-11-06 RSEQ(2)
---
MAINTAINERS | 11 ++
arch/Kconfig | 7 +
fs/exec.c | 1 +
include/linux/sched.h | 118 +++++++++++++++
include/linux/syscalls.h | 3 +
include/trace/events/rseq.h | 56 +++++++
include/uapi/linux/rseq.h | 150 +++++++++++++++++++
init/Kconfig | 14 ++
kernel/Makefile | 1 +
kernel/fork.c | 2 +
kernel/rseq.c | 358 ++++++++++++++++++++++++++++++++++++++++++++
kernel/sched/core.c | 1 +
kernel/sched/sched.h | 1 +
kernel/sys_ni.c | 3 +
14 files changed, 726 insertions(+)
create mode 100644 include/trace/events/rseq.h
create mode 100644 include/uapi/linux/rseq.h
create mode 100644 kernel/rseq.c
diff --git a/MAINTAINERS b/MAINTAINERS
index 73c0cdabf755..789463978181 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -11812,6 +11812,17 @@ F: include/dt-bindings/reset/
F: include/linux/reset.h
F: include/linux/reset-controller.h
+RESTARTABLE SEQUENCES SUPPORT
+M: Mathieu Desnoyers <[email protected]>
+M: Peter Zijlstra <[email protected]>
+M: "Paul E. McKenney" <[email protected]>
+M: Boqun Feng <[email protected]>
+L: [email protected]
+S: Supported
+F: kernel/rseq.c
+F: include/uapi/linux/rseq.h
+F: include/trace/events/rseq.h
+
RFKILL
M: Johannes Berg <[email protected]>
L: [email protected]
diff --git a/arch/Kconfig b/arch/Kconfig
index 76c0b54443b1..b9b252b1e97a 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -272,6 +272,13 @@ config HAVE_REGS_AND_STACK_ACCESS_API
declared in asm/ptrace.h
For example the kprobes-based event tracer needs this API.
+config HAVE_RSEQ
+ bool
+ depends on HAVE_REGS_AND_STACK_ACCESS_API
+ help
+ This symbol should be selected by an architecture if it
+ supports an implementation of restartable sequences.
+
config HAVE_CLK
bool
help
diff --git a/fs/exec.c b/fs/exec.c
index 7eb8d21bcab9..3eb74db04ee7 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1807,6 +1807,7 @@ static int do_execveat_common(int fd, struct filename *filename,
current->fs->in_exec = 0;
current->in_execve = 0;
membarrier_execve(current);
+ rseq_execve(current);
acct_update_integrals(current);
task_numa_free(current);
free_bprm(bprm);
diff --git a/include/linux/sched.h b/include/linux/sched.h
index b161ef8a902e..708d8e9e0821 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -27,6 +27,7 @@
#include <linux/signal_types.h>
#include <linux/mm_types_task.h>
#include <linux/task_io_accounting.h>
+#include <linux/rseq.h>
/* task_struct member predeclarations (sorted alphabetically): */
struct audit_context;
@@ -979,6 +980,17 @@ struct task_struct {
unsigned long numa_pages_migrated;
#endif /* CONFIG_NUMA_BALANCING */
+#ifdef CONFIG_RSEQ
+ struct rseq __user *rseq;
+ u32 rseq_len;
+ u32 rseq_sig;
+ /*
+ * RmW on rseq_event_mask must be performed atomically
+ * with respect to preemption.
+ */
+ unsigned long rseq_event_mask;
+#endif
+
struct tlbflush_unmap_batch tlb_ubc;
struct rcu_head rcu;
@@ -1688,4 +1700,110 @@ extern long sched_getaffinity(pid_t pid, struct cpumask *mask);
#define TASK_SIZE_OF(tsk) TASK_SIZE
#endif
+#ifdef CONFIG_RSEQ
+
+/*
+ * Map the event mask on the user-space ABI enum rseq_cs_flags
+ * for direct mask checks.
+ */
+enum rseq_event_mask_bits {
+ RSEQ_EVENT_PREEMPT_BIT = RSEQ_CS_FLAG_NO_RESTART_ON_PREEMPT_BIT,
+ RSEQ_EVENT_SIGNAL_BIT = RSEQ_CS_FLAG_NO_RESTART_ON_SIGNAL_BIT,
+ RSEQ_EVENT_MIGRATE_BIT = RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE_BIT,
+};
+
+enum rseq_event_mask {
+ RSEQ_EVENT_PREEMPT = (1U << RSEQ_EVENT_PREEMPT_BIT),
+ RSEQ_EVENT_SIGNAL = (1U << RSEQ_EVENT_SIGNAL_BIT),
+ RSEQ_EVENT_MIGRATE = (1U << RSEQ_EVENT_MIGRATE_BIT),
+};
+
+static inline void rseq_set_notify_resume(struct task_struct *t)
+{
+ if (t->rseq)
+ set_tsk_thread_flag(t, TIF_NOTIFY_RESUME);
+}
+
+void __rseq_handle_notify_resume(struct pt_regs *regs);
+
+static inline void rseq_handle_notify_resume(struct pt_regs *regs)
+{
+ if (current->rseq)
+ __rseq_handle_notify_resume(regs);
+}
+
+static inline void rseq_signal_deliver(struct pt_regs *regs)
+{
+ set_bit(RSEQ_EVENT_SIGNAL_BIT, ¤t->rseq_event_mask);
+ rseq_handle_notify_resume(regs);
+}
+
+static inline void rseq_preempt(struct task_struct *t)
+{
+ set_bit(RSEQ_EVENT_PREEMPT_BIT, &t->rseq_event_mask);
+ rseq_set_notify_resume(t);
+}
+
+static inline void rseq_migrate(struct task_struct *t)
+{
+ set_bit(RSEQ_EVENT_MIGRATE_BIT, &t->rseq_event_mask);
+ rseq_set_notify_resume(t);
+}
+
+/*
+ * If parent process has a registered restartable sequences area, the
+ * child inherits. Only applies when forking a process, not a thread. In
+ * case a parent fork() in the middle of a restartable sequence, set the
+ * resume notifier to force the child to retry.
+ */
+static inline void rseq_fork(struct task_struct *t, unsigned long clone_flags)
+{
+ if (clone_flags & CLONE_THREAD) {
+ t->rseq = NULL;
+ t->rseq_len = 0;
+ t->rseq_sig = 0;
+ t->rseq_event_mask = 0;
+ } else {
+ t->rseq = current->rseq;
+ t->rseq_len = current->rseq_len;
+ t->rseq_sig = current->rseq_sig;
+ t->rseq_event_mask = current->rseq_event_mask;
+ rseq_preempt(t);
+ }
+}
+
+static inline void rseq_execve(struct task_struct *t)
+{
+ t->rseq = NULL;
+ t->rseq_len = 0;
+ t->rseq_sig = 0;
+ t->rseq_event_mask = 0;
+}
+
+#else
+
+static inline void rseq_set_notify_resume(struct task_struct *t)
+{
+}
+static inline void rseq_handle_notify_resume(struct pt_regs *regs)
+{
+}
+static inline void rseq_signal_deliver(struct pt_regs *regs)
+{
+}
+static inline void rseq_preempt(struct task_struct *t)
+{
+}
+static inline void rseq_migrate(struct task_struct *t)
+{
+}
+static inline void rseq_fork(struct task_struct *t, unsigned long clone_flags)
+{
+}
+static inline void rseq_execve(struct task_struct *t)
+{
+}
+
+#endif
+
#endif
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index a78186d826d7..340650b4ec54 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -66,6 +66,7 @@ struct old_linux_dirent;
struct perf_event_attr;
struct file_handle;
struct sigaltstack;
+struct rseq;
union bpf_attr;
#include <linux/types.h>
@@ -940,5 +941,7 @@ asmlinkage long sys_pkey_alloc(unsigned long flags, unsigned long init_val);
asmlinkage long sys_pkey_free(int pkey);
asmlinkage long sys_statx(int dfd, const char __user *path, unsigned flags,
unsigned mask, struct statx __user *buffer);
+asmlinkage long sys_rseq(struct rseq __user *rseq, uint32_t rseq_len,
+ int flags, uint32_t sig);
#endif
diff --git a/include/trace/events/rseq.h b/include/trace/events/rseq.h
new file mode 100644
index 000000000000..c4609a3f5008
--- /dev/null
+++ b/include/trace/events/rseq.h
@@ -0,0 +1,56 @@
+#undef TRACE_SYSTEM
+#define TRACE_SYSTEM rseq
+
+#if !defined(_TRACE_RSEQ_H) || defined(TRACE_HEADER_MULTI_READ)
+#define _TRACE_RSEQ_H
+
+#include <linux/tracepoint.h>
+#include <linux/types.h>
+
+TRACE_EVENT(rseq_update,
+
+ TP_PROTO(struct task_struct *t),
+
+ TP_ARGS(t),
+
+ TP_STRUCT__entry(
+ __field(s32, cpu_id)
+ ),
+
+ TP_fast_assign(
+ __entry->cpu_id = raw_smp_processor_id();
+ ),
+
+ TP_printk("cpu_id=%d", __entry->cpu_id)
+);
+
+TRACE_EVENT(rseq_ip_fixup,
+
+ TP_PROTO(unsigned long regs_ip, unsigned long start_ip,
+ unsigned long post_commit_offset, unsigned long abort_ip),
+
+ TP_ARGS(regs_ip, start_ip, post_commit_offset, abort_ip),
+
+ TP_STRUCT__entry(
+ __field(unsigned long, regs_ip)
+ __field(unsigned long, start_ip)
+ __field(unsigned long, post_commit_offset)
+ __field(unsigned long, abort_ip)
+ ),
+
+ TP_fast_assign(
+ __entry->regs_ip = regs_ip;
+ __entry->start_ip = start_ip;
+ __entry->post_commit_offset = post_commit_offset;
+ __entry->abort_ip = abort_ip;
+ ),
+
+ TP_printk("regs_ip=0x%lx start_ip=0x%lx post_commit_offset=%lu abort_ip=0x%lx",
+ __entry->regs_ip, __entry->start_ip,
+ __entry->post_commit_offset, __entry->abort_ip)
+);
+
+#endif /* _TRACE_SOCK_H */
+
+/* This part must be outside protection */
+#include <trace/define_trace.h>
diff --git a/include/uapi/linux/rseq.h b/include/uapi/linux/rseq.h
new file mode 100644
index 000000000000..3895ec940059
--- /dev/null
+++ b/include/uapi/linux/rseq.h
@@ -0,0 +1,150 @@
+#ifndef _UAPI_LINUX_RSEQ_H
+#define _UAPI_LINUX_RSEQ_H
+
+/*
+ * linux/rseq.h
+ *
+ * Restartable sequences system call API
+ *
+ * Copyright (c) 2015-2016 Mathieu Desnoyers <[email protected]>
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining a copy
+ * of this software and associated documentation files (the "Software"), to deal
+ * in the Software without restriction, including without limitation the rights
+ * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+ * copies of the Software, and to permit persons to whom the Software is
+ * furnished to do so, subject to the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be included in
+ * all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+ * AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+ * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+
+#ifdef __KERNEL__
+# include <linux/types.h>
+#else
+# include <stdint.h>
+#endif
+
+#include <linux/types_32_64.h>
+
+enum rseq_cpu_id_state {
+ RSEQ_CPU_ID_UNINITIALIZED = -1,
+ RSEQ_CPU_ID_REGISTRATION_FAILED = -2,
+};
+
+enum rseq_flags {
+ RSEQ_FLAG_UNREGISTER = (1 << 0),
+};
+
+enum rseq_cs_flags_bit {
+ RSEQ_CS_FLAG_NO_RESTART_ON_PREEMPT_BIT = 0,
+ RSEQ_CS_FLAG_NO_RESTART_ON_SIGNAL_BIT = 1,
+ RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE_BIT = 2,
+};
+
+enum rseq_cs_flags {
+ RSEQ_CS_FLAG_NO_RESTART_ON_PREEMPT =
+ (1U << RSEQ_CS_FLAG_NO_RESTART_ON_PREEMPT_BIT),
+ RSEQ_CS_FLAG_NO_RESTART_ON_SIGNAL =
+ (1U << RSEQ_CS_FLAG_NO_RESTART_ON_SIGNAL_BIT),
+ RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE =
+ (1U << RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE_BIT),
+};
+
+/*
+ * struct rseq_cs is aligned on 4 * 8 bytes to ensure it is always
+ * contained within a single cache-line. It is usually declared as
+ * link-time constant data.
+ */
+struct rseq_cs {
+ /* Version of this structure. */
+ uint32_t version;
+ /* enum rseq_cs_flags */
+ uint32_t flags;
+ LINUX_FIELD_u32_u64(start_ip);
+ /* Offset from start_ip. */
+ LINUX_FIELD_u32_u64(post_commit_offset);
+ LINUX_FIELD_u32_u64(abort_ip);
+} __attribute__((aligned(4 * sizeof(uint64_t))));
+
+/*
+ * struct rseq is aligned on 4 * 8 bytes to ensure it is always
+ * contained within a single cache-line.
+ *
+ * A single struct rseq per thread is allowed.
+ */
+struct rseq {
+ /*
+ * Restartable sequences cpu_id_start field. Updated by the
+ * kernel, and read by user-space with single-copy atomicity
+ * semantics. Aligned on 32-bit. Always contains a value in the
+ * range of possible CPUs, although the value may not be the
+ * actual current CPU (e.g. if rseq is not initialized). This
+ * CPU number value should always be compared against the value
+ * of the cpu_id field before performing a rseq commit or
+ * returning a value read from a data structure indexed using
+ * the cpu_id_start value.
+ */
+ uint32_t cpu_id_start;
+ /*
+ * Restartable sequences cpu_id field. Updated by the kernel,
+ * and read by user-space with single-copy atomicity semantics.
+ * Aligned on 32-bit. Values RSEQ_CPU_ID_UNINITIALIZED and
+ * RSEQ_CPU_ID_REGISTRATION_FAILED have a special semantic: the
+ * former means "rseq uninitialized", and latter means "rseq
+ * initialization failed". This value is meant to be read within
+ * rseq critical sections and compared with the cpu_id_start
+ * value previously read, before performing the commit instruction,
+ * or read and compared with the cpu_id_start value before returning
+ * a value loaded from a data structure indexed using the
+ * cpu_id_start value.
+ */
+ uint32_t cpu_id;
+ /*
+ * Restartable sequences rseq_cs field.
+ *
+ * Contains NULL when no critical section is active for the current
+ * thread, or holds a pointer to the currently active struct rseq_cs.
+ *
+ * Updated by user-space, which sets the address of the currently
+ * active rseq_cs at the beginning of assembly instruction sequence
+ * block, and set to NULL by the kernel when it restarts an assembly
+ * instruction sequence block, as well as when the kernel detects that
+ * it is preempting or delivering a signal outside of the range
+ * targeted by the rseq_cs. Also needs to be set to NULL by user-space
+ * before reclaiming memory that contains the targeted struct rseq_cs.
+ *
+ * Read and set by the kernel with single-copy atomicity semantics.
+ * Set by user-space with single-copy atomicity semantics. Aligned
+ * on 64-bit.
+ */
+ LINUX_FIELD_u32_u64(rseq_cs);
+ /*
+ * - RSEQ_DISABLE flag:
+ *
+ * Fallback fast-track flag for single-stepping.
+ * Set by user-space if lack of progress is detected.
+ * Cleared by user-space after rseq finish.
+ * Read by the kernel.
+ * - RSEQ_CS_FLAG_NO_RESTART_ON_PREEMPT
+ * Inhibit instruction sequence block restart and event
+ * counter increment on preemption for this thread.
+ * - RSEQ_CS_FLAG_NO_RESTART_ON_SIGNAL
+ * Inhibit instruction sequence block restart and event
+ * counter increment on signal delivery for this thread.
+ * - RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE
+ * Inhibit instruction sequence block restart and event
+ * counter increment on migration for this thread.
+ */
+ uint32_t flags;
+} __attribute__((aligned(4 * sizeof(uint64_t))));
+
+#endif /* _UAPI_LINUX_RSEQ_H */
diff --git a/init/Kconfig b/init/Kconfig
index e37f4b2a6445..9610d3def25c 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1418,6 +1418,20 @@ config ARCH_HAS_MEMBARRIER_CALLBACKS
config ARCH_HAS_MEMBARRIER_SYNC_CORE
bool
+config RSEQ
+ bool "Enable rseq() system call" if EXPERT
+ default y
+ depends on HAVE_RSEQ
+ select MEMBARRIER
+ help
+ Enable the restartable sequences system call. It provides a
+ user-space cache for the current CPU number value, which
+ speeds up getting the current CPU number from user-space,
+ as well as an ABI to speed up user-space operations on
+ per-CPU data.
+
+ If unsure, say Y.
+
config EMBEDDED
bool "Embedded system"
option allnoconfig_y
diff --git a/kernel/Makefile b/kernel/Makefile
index f85ae5dfa474..7085c841c413 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -113,6 +113,7 @@ obj-$(CONFIG_CONTEXT_TRACKING) += context_tracking.o
obj-$(CONFIG_TORTURE_TEST) += torture.o
obj-$(CONFIG_HAS_IOMEM) += memremap.o
+obj-$(CONFIG_RSEQ) += rseq.o
$(obj)/configs.o: $(obj)/config_data.h
diff --git a/kernel/fork.c b/kernel/fork.c
index e5d9d405ae4e..3970526f7b45 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1898,6 +1898,8 @@ static __latent_entropy struct task_struct *copy_process(
*/
copy_seccomp(p);
+ rseq_fork(p, clone_flags);
+
/*
* Process group and session signals need to be delivered to just the
* parent before the fork or both the parent and the child after the
diff --git a/kernel/rseq.c b/kernel/rseq.c
new file mode 100644
index 000000000000..93f3f169e112
--- /dev/null
+++ b/kernel/rseq.c
@@ -0,0 +1,358 @@
+/*
+ * Restartable sequences system call
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * Copyright (C) 2015, Google, Inc.,
+ * Paul Turner <[email protected]> and Andrew Hunter <[email protected]>
+ * Copyright (C) 2015-2016, EfficiOS Inc.,
+ * Mathieu Desnoyers <[email protected]>
+ */
+
+#include <linux/sched.h>
+#include <linux/uaccess.h>
+#include <linux/syscalls.h>
+#include <linux/rseq.h>
+#include <linux/types.h>
+#include <asm/ptrace.h>
+
+#define CREATE_TRACE_POINTS
+#include <trace/events/rseq.h>
+
+/*
+ *
+ * Restartable sequences are a lightweight interface that allows
+ * user-level code to be executed atomically relative to scheduler
+ * preemption and signal delivery. Typically used for implementing
+ * per-cpu operations.
+ *
+ * It allows user-space to perform update operations on per-cpu data
+ * without requiring heavy-weight atomic operations.
+ *
+ * Detailed algorithm of rseq user-space assembly sequences:
+ *
+ * init(rseq_cs)
+ * cpu = TLS->rseq::cpu_id_start
+ * [1] TLS->rseq::rseq_cs = rseq_cs
+ * [start_ip] ----------------------------
+ * [2] if (cpu != TLS->rseq::cpu_id)
+ * goto abort_ip;
+ * [3] <last_instruction_in_cs>
+ * [post_commit_ip] ----------------------------
+ *
+ * The address of jump target abort_ip must be outside the critical
+ * region, i.e.:
+ *
+ * [abort_ip] < [start_ip] || [abort_ip] >= [post_commit_ip]
+ *
+ * Steps [2]-[3] (inclusive) need to be a sequence of instructions in
+ * userspace that can handle being interrupted between any of those
+ * instructions, and then resumed to the abort_ip.
+ *
+ * 1. Userspace stores the address of the struct rseq_cs assembly
+ * block descriptor into the rseq_cs field of the registered
+ * struct rseq TLS area. This update is performed through a single
+ * store within the inline assembly instruction sequence.
+ * [start_ip]
+ *
+ * 2. Userspace tests to check whether the current cpu_id field match
+ * the cpu number loaded before start_ip, branching to abort_ip
+ * in case of a mismatch.
+ *
+ * If the sequence is preempted or interrupted by a signal
+ * at or after start_ip and before post_commit_ip, then the kernel
+ * clears TLS->__rseq_abi::rseq_cs, and sets the user-space return
+ * ip to abort_ip before returning to user-space, so the preempted
+ * execution resumes at abort_ip.
+ *
+ * 3. Userspace critical section final instruction before
+ * post_commit_ip is the commit. The critical section is
+ * self-terminating.
+ * [post_commit_ip]
+ *
+ * 4. <success>
+ *
+ * On failure at [2], or if interrupted by preempt or signal delivery
+ * between [1] and [3]:
+ *
+ * [abort_ip]
+ * F1. <failure>
+ */
+
+static int rseq_update_cpu_id(struct task_struct *t)
+{
+ uint32_t cpu_id = raw_smp_processor_id();
+
+ if (__put_user(cpu_id, &t->rseq->cpu_id_start))
+ return -EFAULT;
+ if (__put_user(cpu_id, &t->rseq->cpu_id))
+ return -EFAULT;
+ trace_rseq_update(t);
+ return 0;
+}
+
+static int rseq_reset_rseq_cpu_id(struct task_struct *t)
+{
+ uint32_t cpu_id_start = 0, cpu_id = RSEQ_CPU_ID_UNINITIALIZED;
+
+ /*
+ * Reset cpu_id_start to its initial state (0).
+ */
+ if (__put_user(cpu_id_start, &t->rseq->cpu_id_start))
+ return -EFAULT;
+ /*
+ * Reset cpu_id to RSEQ_CPU_ID_UNINITIALIZED, so any user coming
+ * in after unregistration can figure out that rseq needs to be
+ * registered again.
+ */
+ if (__put_user(cpu_id, &t->rseq->cpu_id))
+ return -EFAULT;
+ return 0;
+}
+
+static int rseq_get_rseq_cs(struct task_struct *t,
+ unsigned long *start_ip,
+ unsigned long *post_commit_offset,
+ unsigned long *abort_ip,
+ uint32_t *cs_flags)
+{
+ struct rseq_cs __user *urseq_cs;
+ struct rseq_cs rseq_cs;
+ unsigned long ptr;
+ u32 __user *usig;
+ u32 sig;
+ int ret;
+
+ ret = __get_user(ptr, &t->rseq->rseq_cs);
+ if (ret)
+ return ret;
+ if (!ptr)
+ return 0;
+ urseq_cs = (struct rseq_cs __user *)ptr;
+ if (copy_from_user(&rseq_cs, urseq_cs, sizeof(rseq_cs)))
+ return -EFAULT;
+ if (rseq_cs.version > 0)
+ return -EINVAL;
+
+ /* Ensure that abort_ip is not in the critical section. */
+ if (rseq_cs.abort_ip - rseq_cs.start_ip < rseq_cs.post_commit_offset)
+ return -EINVAL;
+
+ *cs_flags = rseq_cs.flags;
+ *start_ip = rseq_cs.start_ip;
+ *post_commit_offset = rseq_cs.post_commit_offset;
+ *abort_ip = rseq_cs.abort_ip;
+
+ usig = (u32 __user *)(rseq_cs.abort_ip - sizeof(u32));
+ ret = get_user(sig, usig);
+ if (ret)
+ return ret;
+
+ if (current->rseq_sig != sig) {
+ printk_ratelimited(KERN_WARNING
+ "Possible attack attempt. Unexpected rseq signature 0x%x, expecting 0x%x (pid=%d, addr=%p).\n",
+ sig, current->rseq_sig, current->pid, usig);
+ return -EPERM;
+ }
+ return 0;
+}
+
+static int rseq_need_restart(struct task_struct *t, uint32_t cs_flags)
+{
+ uint32_t flags, event_mask;
+ int ret;
+
+ /* Get thread flags. */
+ ret = __get_user(flags, &t->rseq->flags);
+ if (ret)
+ return ret;
+
+ /* Take critical section flags into account. */
+ flags |= cs_flags;
+
+ /*
+ * Restart on signal can only be inhibited when restart on
+ * preempt and restart on migrate are inhibited too. Otherwise,
+ * a preempted signal handler could fail to restart the prior
+ * execution context on sigreturn.
+ */
+ if (unlikely(flags & RSEQ_CS_FLAG_NO_RESTART_ON_SIGNAL)) {
+ if ((flags & (RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE
+ | RSEQ_CS_FLAG_NO_RESTART_ON_PREEMPT)) !=
+ (RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE
+ | RSEQ_CS_FLAG_NO_RESTART_ON_PREEMPT))
+ return -EINVAL;
+ }
+
+ /*
+ * Load and clear event mask atomically with respect to
+ * scheduler preemption.
+ */
+ preempt_disable();
+ event_mask = t->rseq_event_mask;
+ t->rseq_event_mask = 0;
+ preempt_enable();
+
+ event_mask &= ~flags;
+ if (event_mask)
+ return 1;
+ return 0;
+}
+
+static int clear_rseq_cs(struct task_struct *t)
+{
+ unsigned long ptr = 0;
+
+ /*
+ * The rseq_cs field is set to NULL on preemption or signal
+ * delivery on top of rseq assembly block, as well as on top
+ * of code outside of the rseq assembly block. This performs
+ * a lazy clear of the rseq_cs field.
+ *
+ * Set rseq_cs to NULL with single-copy atomicity.
+ */
+ return __put_user(ptr, &t->rseq->rseq_cs);
+}
+
+static int rseq_ip_fixup(struct pt_regs *regs)
+{
+ unsigned long ip = instruction_pointer(regs), start_ip = 0,
+ post_commit_offset = 0, abort_ip = 0;
+ struct task_struct *t = current;
+ uint32_t cs_flags = 0;
+ bool in_rseq_cs = false;
+ int ret;
+
+ ret = rseq_get_rseq_cs(t, &start_ip, &post_commit_offset, &abort_ip,
+ &cs_flags);
+ if (ret)
+ return ret;
+
+ /*
+ * Handle potentially not being within a critical section.
+ * Unsigned comparison will be true when
+ * ip >= start_ip, and when ip < start_ip + post_commit_offset.
+ */
+ if (ip - start_ip < post_commit_offset)
+ in_rseq_cs = true;
+
+ /*
+ * If not nested over a rseq critical section, restart is
+ * useless. Clear the rseq_cs pointer and return.
+ */
+ if (!in_rseq_cs)
+ return clear_rseq_cs(t);
+ ret = rseq_need_restart(t, cs_flags);
+ if (ret <= 0)
+ return ret;
+ ret = clear_rseq_cs(t);
+ if (ret)
+ return ret;
+ trace_rseq_ip_fixup(ip, start_ip, post_commit_offset, abort_ip);
+ instruction_pointer_set(regs, (unsigned long)abort_ip);
+ return 0;
+}
+
+/*
+ * This resume handler must always be executed between any of:
+ * - preemption,
+ * - signal delivery,
+ * and return to user-space.
+ *
+ * This is how we can ensure that the entire rseq critical section,
+ * consisting of both the C part and the assembly instruction sequence,
+ * will issue the commit instruction only if executed atomically with
+ * respect to other threads scheduled on the same CPU, and with respect
+ * to signal handlers.
+ */
+void __rseq_handle_notify_resume(struct pt_regs *regs)
+{
+ struct task_struct *t = current;
+ int ret;
+
+ if (unlikely(t->flags & PF_EXITING))
+ return;
+ if (unlikely(!access_ok(VERIFY_WRITE, t->rseq, sizeof(*t->rseq))))
+ goto error;
+ ret = rseq_ip_fixup(regs);
+ if (unlikely(ret < 0))
+ goto error;
+ if (unlikely(rseq_update_cpu_id(t)))
+ goto error;
+ return;
+
+error:
+ force_sig(SIGSEGV, t);
+}
+
+/*
+ * sys_rseq - setup restartable sequences for caller thread.
+ */
+SYSCALL_DEFINE4(rseq, struct rseq __user *, rseq, uint32_t, rseq_len,
+ int, flags, uint32_t, sig)
+{
+ int ret;
+
+ if (flags & RSEQ_FLAG_UNREGISTER) {
+ /* Unregister rseq for current thread. */
+ if (current->rseq != rseq || !current->rseq)
+ return -EINVAL;
+ if (current->rseq_len != rseq_len)
+ return -EINVAL;
+ if (current->rseq_sig != sig)
+ return -EPERM;
+ ret = rseq_reset_rseq_cpu_id(current);
+ if (ret)
+ return ret;
+ current->rseq = NULL;
+ current->rseq_len = 0;
+ current->rseq_sig = 0;
+ return 0;
+ }
+
+ if (unlikely(flags))
+ return -EINVAL;
+
+ if (current->rseq) {
+ /*
+ * If rseq is already registered, check whether
+ * the provided address differs from the prior
+ * one.
+ */
+ if (current->rseq != rseq || current->rseq_len != rseq_len)
+ return -EINVAL;
+ if (current->rseq_sig != sig)
+ return -EPERM;
+ /* Already registered. */
+ return -EBUSY;
+ }
+
+ /*
+ * If there was no rseq previously registered,
+ * ensure the provided rseq is properly aligned and valid.
+ */
+ if (!IS_ALIGNED((unsigned long)rseq, __alignof__(*rseq)) ||
+ rseq_len != sizeof(*rseq))
+ return -EINVAL;
+ if (!access_ok(VERIFY_WRITE, rseq, rseq_len))
+ return -EFAULT;
+ current->rseq = rseq;
+ current->rseq_len = rseq_len;
+ current->rseq_sig = sig;
+ /*
+ * If rseq was previously inactive, and has just been
+ * registered, ensure the cpu_id_start and cpu_id fields
+ * are updated before returning to user-space.
+ */
+ rseq_set_notify_resume(current);
+
+ return 0;
+}
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index c94895bc5a2c..771caa7e95c6 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2648,6 +2648,7 @@ prepare_task_switch(struct rq *rq, struct task_struct *prev,
{
sched_info_switch(rq, prev, next);
perf_event_task_sched_out(prev, next);
+ rseq_preempt(prev);
fire_sched_out_preempt_notifiers(prev, next);
prepare_task(next);
prepare_arch_switch(next);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index fb5fc458547f..66b070444a7e 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1249,6 +1249,7 @@ static inline void __set_task_cpu(struct task_struct *p, unsigned int cpu)
#endif
p->wake_cpu = cpu;
#endif
+ rseq_migrate(p);
}
/*
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index b5189762d275..bfa1ee1bf669 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -259,3 +259,6 @@ cond_syscall(sys_membarrier);
cond_syscall(sys_pkey_mprotect);
cond_syscall(sys_pkey_alloc);
cond_syscall(sys_pkey_free);
+
+/* restartable sequence */
+cond_syscall(sys_rseq);
--
2.11.0
Provide helper macros for fields which represent pointers in
kernel-userspace ABI. This facilitates handling of 32-bit
user-space by 64-bit kernels by defining those fields as
32-bit 0-padding and 32-bit integer on 32-bit architectures,
which allows the kernel to treat those as 64-bit integers.
The order of padding and 32-bit integer depends on the
endianness.
Signed-off-by: Mathieu Desnoyers <[email protected]>
CC: "Paul E. McKenney" <[email protected]>
CC: Peter Zijlstra <[email protected]>
CC: Paul Turner <[email protected]>
CC: Thomas Gleixner <[email protected]>
CC: Andrew Hunter <[email protected]>
CC: Andy Lutomirski <[email protected]>
CC: Andi Kleen <[email protected]>
CC: Dave Watson <[email protected]>
CC: Chris Lameter <[email protected]>
CC: Ingo Molnar <[email protected]>
CC: "H. Peter Anvin" <[email protected]>
CC: Ben Maurer <[email protected]>
CC: Steven Rostedt <[email protected]>
CC: Josh Triplett <[email protected]>
CC: Linus Torvalds <[email protected]>
CC: Andrew Morton <[email protected]>
CC: Russell King <[email protected]>
CC: Catalin Marinas <[email protected]>
CC: Will Deacon <[email protected]>
CC: Michael Kerrisk <[email protected]>
CC: Boqun Feng <[email protected]>
CC: [email protected]
---
include/uapi/linux/types_32_64.h | 67 ++++++++++++++++++++++++++++++++++++++++
1 file changed, 67 insertions(+)
create mode 100644 include/uapi/linux/types_32_64.h
diff --git a/include/uapi/linux/types_32_64.h b/include/uapi/linux/types_32_64.h
new file mode 100644
index 000000000000..18dc8808d026
--- /dev/null
+++ b/include/uapi/linux/types_32_64.h
@@ -0,0 +1,67 @@
+#ifndef _UAPI_LINUX_TYPES_32_64_H
+#define _UAPI_LINUX_TYPES_32_64_H
+
+/*
+ * linux/types_32_64.h
+ *
+ * Integer type declaration for pointers across 32-bit and 64-bit systems.
+ *
+ * Copyright (c) 2015-2017 Mathieu Desnoyers <[email protected]>
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining a copy
+ * of this software and associated documentation files (the "Software"), to deal
+ * in the Software without restriction, including without limitation the rights
+ * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+ * copies of the Software, and to permit persons to whom the Software is
+ * furnished to do so, subject to the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be included in
+ * all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+ * AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+ * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+
+#ifdef __KERNEL__
+# include <linux/types.h>
+#else
+# include <stdint.h>
+#endif
+
+#include <asm/byteorder.h>
+
+#ifdef __BYTE_ORDER
+# if (__BYTE_ORDER == __BIG_ENDIAN)
+# define LINUX_BYTE_ORDER_BIG_ENDIAN
+# else
+# define LINUX_BYTE_ORDER_LITTLE_ENDIAN
+# endif
+#else
+# ifdef __BIG_ENDIAN
+# define LINUX_BYTE_ORDER_BIG_ENDIAN
+# else
+# define LINUX_BYTE_ORDER_LITTLE_ENDIAN
+# endif
+#endif
+
+#ifdef __LP64__
+# define LINUX_FIELD_u32_u64(field) uint64_t field
+# define LINUX_FIELD_u32_u64_INIT_ONSTACK(field, v) field = (intptr_t)v
+#else
+# ifdef LINUX_BYTE_ORDER_BIG_ENDIAN
+# define LINUX_FIELD_u32_u64(field) uint32_t field ## _padding, field
+# define LINUX_FIELD_u32_u64_INIT_ONSTACK(field, v) \
+ field ## _padding = 0, field = (intptr_t)v
+# else
+# define LINUX_FIELD_u32_u64(field) uint32_t field, field ## _padding
+# define LINUX_FIELD_u32_u64_INIT_ONSTACK(field, v) \
+ field = (intptr_t)v, field ## _padding = 0
+# endif
+#endif
+
+#endif /* _UAPI_LINUX_TYPES_32_64_H */
--
2.11.0
Wire up the rseq system call on 32-bit ARM.
This provides an ABI improving the speed of a user-space getcpu
operation on ARM by skipping the getcpu system call on the fast path, as
well as improving the speed of user-space operations on per-cpu data
compared to using load-linked/store-conditional.
Signed-off-by: Mathieu Desnoyers <[email protected]>
CC: Russell King <[email protected]>
CC: Catalin Marinas <[email protected]>
CC: Will Deacon <[email protected]>
CC: Thomas Gleixner <[email protected]>
CC: Paul Turner <[email protected]>
CC: Andrew Hunter <[email protected]>
CC: Peter Zijlstra <[email protected]>
CC: Andy Lutomirski <[email protected]>
CC: Andi Kleen <[email protected]>
CC: Dave Watson <[email protected]>
CC: Chris Lameter <[email protected]>
CC: Ingo Molnar <[email protected]>
CC: Ben Maurer <[email protected]>
CC: Steven Rostedt <[email protected]>
CC: "Paul E. McKenney" <[email protected]>
CC: Josh Triplett <[email protected]>
CC: Linus Torvalds <[email protected]>
CC: Andrew Morton <[email protected]>
CC: Boqun Feng <[email protected]>
CC: [email protected]
---
arch/arm/tools/syscall.tbl | 1 +
1 file changed, 1 insertion(+)
diff --git a/arch/arm/tools/syscall.tbl b/arch/arm/tools/syscall.tbl
index 0bb0e9c6376c..fbc74b5fa3ed 100644
--- a/arch/arm/tools/syscall.tbl
+++ b/arch/arm/tools/syscall.tbl
@@ -412,3 +412,4 @@
395 common pkey_alloc sys_pkey_alloc
396 common pkey_free sys_pkey_free
397 common statx sys_statx
+398 common rseq sys_rseq
--
2.11.0
From: Boqun Feng <[email protected]>
Wire up the rseq system call on powerpc.
This provides an ABI improving the speed of a user-space getcpu
operation on powerpc by skipping the getcpu system call on the fast
path, as well as improving the speed of user-space operations on per-cpu
data compared to using load-reservation/store-conditional atomics.
Signed-off-by: Boqun Feng <[email protected]>
Signed-off-by: Mathieu Desnoyers <[email protected]>
CC: Benjamin Herrenschmidt <[email protected]>
CC: Paul Mackerras <[email protected]>
CC: Michael Ellerman <[email protected]>
CC: Peter Zijlstra <[email protected]>
CC: "Paul E. McKenney" <[email protected]>
CC: [email protected]
---
arch/powerpc/include/asm/systbl.h | 1 +
arch/powerpc/include/asm/unistd.h | 2 +-
arch/powerpc/include/uapi/asm/unistd.h | 1 +
3 files changed, 3 insertions(+), 1 deletion(-)
diff --git a/arch/powerpc/include/asm/systbl.h b/arch/powerpc/include/asm/systbl.h
index d61f9c96d916..45d4d37495fd 100644
--- a/arch/powerpc/include/asm/systbl.h
+++ b/arch/powerpc/include/asm/systbl.h
@@ -392,3 +392,4 @@ SYSCALL(statx)
SYSCALL(pkey_alloc)
SYSCALL(pkey_free)
SYSCALL(pkey_mprotect)
+SYSCALL(rseq)
diff --git a/arch/powerpc/include/asm/unistd.h b/arch/powerpc/include/asm/unistd.h
index daf1ba97a00c..1e9708632dce 100644
--- a/arch/powerpc/include/asm/unistd.h
+++ b/arch/powerpc/include/asm/unistd.h
@@ -12,7 +12,7 @@
#include <uapi/asm/unistd.h>
-#define NR_syscalls 387
+#define NR_syscalls 388
#define __NR__exit __NR_exit
diff --git a/arch/powerpc/include/uapi/asm/unistd.h b/arch/powerpc/include/uapi/asm/unistd.h
index 389c36fd8299..ac5ba55066dd 100644
--- a/arch/powerpc/include/uapi/asm/unistd.h
+++ b/arch/powerpc/include/uapi/asm/unistd.h
@@ -398,5 +398,6 @@
#define __NR_pkey_alloc 384
#define __NR_pkey_free 385
#define __NR_pkey_mprotect 386
+#define __NR_rseq 387
#endif /* _UAPI_ASM_POWERPC_UNISTD_H_ */
--
2.11.0
Call the rseq_handle_notify_resume() function on return to userspace if
TIF_NOTIFY_RESUME thread flag is set.
Increment the event counter and perform fixup on the pre-signal frame
when a signal is delivered on top of a restartable sequence critical
section.
Signed-off-by: Mathieu Desnoyers <[email protected]>
Reviewed-by: Thomas Gleixner <[email protected]>
CC: Russell King <[email protected]>
CC: Catalin Marinas <[email protected]>
CC: Will Deacon <[email protected]>
CC: Paul Turner <[email protected]>
CC: Andrew Hunter <[email protected]>
CC: Peter Zijlstra <[email protected]>
CC: Andy Lutomirski <[email protected]>
CC: Andi Kleen <[email protected]>
CC: Dave Watson <[email protected]>
CC: Chris Lameter <[email protected]>
CC: Ingo Molnar <[email protected]>
CC: "H. Peter Anvin" <[email protected]>
CC: Ben Maurer <[email protected]>
CC: Steven Rostedt <[email protected]>
CC: "Paul E. McKenney" <[email protected]>
CC: Josh Triplett <[email protected]>
CC: Linus Torvalds <[email protected]>
CC: Andrew Morton <[email protected]>
CC: Boqun Feng <[email protected]>
CC: [email protected]
---
arch/x86/Kconfig | 1 +
arch/x86/entry/common.c | 1 +
arch/x86/kernel/signal.c | 6 ++++++
3 files changed, 8 insertions(+)
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 0fa71a78ec99..47a2b14fcc7d 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -178,6 +178,7 @@ config X86
select HAVE_REGS_AND_STACK_ACCESS_API
select HAVE_RELIABLE_STACKTRACE if X86_64 && UNWINDER_FRAME_POINTER && STACK_VALIDATION
select HAVE_STACK_VALIDATION if X86_64
+ select HAVE_RSEQ
select HAVE_SYSCALL_TRACEPOINTS
select HAVE_UNSTABLE_SCHED_CLOCK
select HAVE_USER_RETURN_NOTIFIER
diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c
index 74f6eee15179..ad348b28bcec 100644
--- a/arch/x86/entry/common.c
+++ b/arch/x86/entry/common.c
@@ -164,6 +164,7 @@ static void exit_to_usermode_loop(struct pt_regs *regs, u32 cached_flags)
if (cached_flags & _TIF_NOTIFY_RESUME) {
clear_thread_flag(TIF_NOTIFY_RESUME);
tracehook_notify_resume(regs);
+ rseq_handle_notify_resume(regs);
}
if (cached_flags & _TIF_USER_RETURN_NOTIFY)
diff --git a/arch/x86/kernel/signal.c b/arch/x86/kernel/signal.c
index 4cdc0b27ec82..0f549cbd8b46 100644
--- a/arch/x86/kernel/signal.c
+++ b/arch/x86/kernel/signal.c
@@ -687,6 +687,12 @@ setup_rt_frame(struct ksignal *ksig, struct pt_regs *regs)
sigset_t *set = sigmask_to_save();
compat_sigset_t *cset = (compat_sigset_t *) set;
+ /*
+ * Increment event counter and perform fixup for the pre-signal
+ * frame.
+ */
+ rseq_signal_deliver(regs);
+
/* Set up the stack frame */
if (is_ia32_frame(ksig)) {
if (ksig->ka.sa.sa_flags & SA_SIGINFO)
--
2.11.0
Call the rseq_handle_notify_resume() function on return to
userspace if TIF_NOTIFY_RESUME thread flag is set.
Increment the event counter and perform fixup on the pre-signal frame
when a signal is delivered on top of a restartable sequence critical
section.
Signed-off-by: Mathieu Desnoyers <[email protected]>
CC: Russell King <[email protected]>
CC: Catalin Marinas <[email protected]>
CC: Will Deacon <[email protected]>
CC: Thomas Gleixner <[email protected]>
CC: Paul Turner <[email protected]>
CC: Andrew Hunter <[email protected]>
CC: Peter Zijlstra <[email protected]>
CC: Andy Lutomirski <[email protected]>
CC: Andi Kleen <[email protected]>
CC: Dave Watson <[email protected]>
CC: Chris Lameter <[email protected]>
CC: Ingo Molnar <[email protected]>
CC: Ben Maurer <[email protected]>
CC: Steven Rostedt <[email protected]>
CC: "Paul E. McKenney" <[email protected]>
CC: Josh Triplett <[email protected]>
CC: Linus Torvalds <[email protected]>
CC: Andrew Morton <[email protected]>
CC: Boqun Feng <[email protected]>
CC: [email protected]
---
arch/arm/Kconfig | 1 +
arch/arm/kernel/signal.c | 7 +++++++
2 files changed, 8 insertions(+)
diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig
index 7e3d53575486..1897d40ddd87 100644
--- a/arch/arm/Kconfig
+++ b/arch/arm/Kconfig
@@ -90,6 +90,7 @@ config ARM
select HAVE_PERF_USER_STACK_DUMP
select HAVE_RCU_TABLE_FREE if (SMP && ARM_LPAE)
select HAVE_REGS_AND_STACK_ACCESS_API
+ select HAVE_RSEQ
select HAVE_SYSCALL_TRACEPOINTS
select HAVE_UID16
select HAVE_VIRT_CPU_ACCOUNTING_GEN
diff --git a/arch/arm/kernel/signal.c b/arch/arm/kernel/signal.c
index bd8810d4acb3..5879ab3f53c1 100644
--- a/arch/arm/kernel/signal.c
+++ b/arch/arm/kernel/signal.c
@@ -541,6 +541,12 @@ static void handle_signal(struct ksignal *ksig, struct pt_regs *regs)
int ret;
/*
+ * Increment event counter and perform fixup for the pre-signal
+ * frame.
+ */
+ rseq_signal_deliver(regs);
+
+ /*
* Set up the stack frame
*/
if (ksig->ka.sa.sa_flags & SA_SIGINFO)
@@ -660,6 +666,7 @@ do_work_pending(struct pt_regs *regs, unsigned int thread_flags, int syscall)
} else {
clear_thread_flag(TIF_NOTIFY_RESUME);
tracehook_notify_resume(regs);
+ rseq_handle_notify_resume(regs);
}
}
local_irq_disable();
--
2.11.0
On Tue, Mar 27, 2018 at 12:05:21PM -0400, Mathieu Desnoyers wrote:
> Hi,
>
> I'm respinning this series for another RFC round. It is based on the
> v4.16-rc7 tag. I am now targeting the 4.17 merge window.
I'll go over the thing again in more detail, but I'm basically ok with
rseq, but I hate that cpu_opv thing with a passion (as you well know).
On Tue, Mar 27, 2018 at 12:05:23PM -0400, Mathieu Desnoyers wrote:
[...]
> Changes since v11:
>
> - Replace task struct rseq_preempt, rseq_signal, and rseq_migrate
> bool by u32 rseq_event_mask.
[...]
> @@ -979,6 +980,17 @@ struct task_struct {
> unsigned long numa_pages_migrated;
> #endif /* CONFIG_NUMA_BALANCING */
>
> +#ifdef CONFIG_RSEQ
> + struct rseq __user *rseq;
> + u32 rseq_len;
> + u32 rseq_sig;
> + /*
> + * RmW on rseq_event_mask must be performed atomically
> + * with respect to preemption.
> + */
> + unsigned long rseq_event_mask;
s/unsigned long/u32
> +#endif
> +
> struct tlbflush_unmap_batch tlb_ubc;
>
> struct rcu_head rcu;
> @@ -1688,4 +1700,110 @@ extern long sched_getaffinity(pid_t pid, struct cpumask *mask);
> #define TASK_SIZE_OF(tsk) TASK_SIZE
> #endif
>
[...]
> +
> +static int rseq_ip_fixup(struct pt_regs *regs)
> +{
> + unsigned long ip = instruction_pointer(regs), start_ip = 0,
> + post_commit_offset = 0, abort_ip = 0;
> + struct task_struct *t = current;
> + uint32_t cs_flags = 0;
> + bool in_rseq_cs = false;
This seems unnecessary? Because..
> + int ret;
> +
> + ret = rseq_get_rseq_cs(t, &start_ip, &post_commit_offset, &abort_ip,
> + &cs_flags);
> + if (ret)
> + return ret;
> +
> + /*
> + * Handle potentially not being within a critical section.
> + * Unsigned comparison will be true when
> + * ip >= start_ip, and when ip < start_ip + post_commit_offset.
> + */
> + if (ip - start_ip < post_commit_offset)
> + in_rseq_cs = true;
> +
> + /*
> + * If not nested over a rseq critical section, restart is
> + * useless. Clear the rseq_cs pointer and return.
> + */
> + if (!in_rseq_cs)
> + return clear_rseq_cs(t);
we can write
if (ip - start_ip >= post_commit_offset)
return clear_rseq_cs(t);
Regards,
Boqun
> + ret = rseq_need_restart(t, cs_flags);
> + if (ret <= 0)
> + return ret;
> + ret = clear_rseq_cs(t);
> + if (ret)
> + return ret;
> + trace_rseq_ip_fixup(ip, start_ip, post_commit_offset, abort_ip);
> + instruction_pointer_set(regs, (unsigned long)abort_ip);
> + return 0;
> +}
> +
[...]
On Tue, Mar 27, 2018 at 12:05:23PM -0400, Mathieu Desnoyers wrote:
> +#ifdef CONFIG_RSEQ
> + struct rseq __user *rseq;
> + u32 rseq_len;
> + u32 rseq_sig;
> + /*
> + * RmW on rseq_event_mask must be performed atomically
> + * with respect to preemption.
> + */
> + unsigned long rseq_event_mask;
> +#endif
> +static inline void rseq_signal_deliver(struct pt_regs *regs)
> +{
> + set_bit(RSEQ_EVENT_SIGNAL_BIT, ¤t->rseq_event_mask);
> + rseq_handle_notify_resume(regs);
> +}
> +
> +static inline void rseq_preempt(struct task_struct *t)
> +{
> + set_bit(RSEQ_EVENT_PREEMPT_BIT, &t->rseq_event_mask);
> + rseq_set_notify_resume(t);
> +}
> +
> +static inline void rseq_migrate(struct task_struct *t)
> +{
> + set_bit(RSEQ_EVENT_MIGRATE_BIT, &t->rseq_event_mask);
> + rseq_set_notify_resume(t);
> +}
Given that comment above, do you really need the full atomic set bit?
Isn't __set_bit() sufficient?
On Tue, Mar 27, 2018 at 12:05:23PM -0400, Mathieu Desnoyers wrote:
> +/*
> + * struct rseq_cs is aligned on 4 * 8 bytes to ensure it is always
> + * contained within a single cache-line. It is usually declared as
> + * link-time constant data.
> + */
> +struct rseq_cs {
> + /* Version of this structure. */
> + uint32_t version;
> + /* enum rseq_cs_flags */
> + uint32_t flags;
> + LINUX_FIELD_u32_u64(start_ip);
> + /* Offset from start_ip. */
> + LINUX_FIELD_u32_u64(post_commit_offset);
> + LINUX_FIELD_u32_u64(abort_ip);
> +} __attribute__((aligned(4 * sizeof(uint64_t))));
What's with the uint32_t ? The normal Linux API type is __u32 afaik.
On Tue, Mar 27, 2018 at 12:05:23PM -0400, Mathieu Desnoyers wrote:
> +static int rseq_update_cpu_id(struct task_struct *t)
> +{
> + uint32_t cpu_id = raw_smp_processor_id();
u32
> +
> + if (__put_user(cpu_id, &t->rseq->cpu_id_start))
> + return -EFAULT;
> + if (__put_user(cpu_id, &t->rseq->cpu_id))
> + return -EFAULT;
> + trace_rseq_update(t);
> + return 0;
> +}
> +
> +static int rseq_reset_rseq_cpu_id(struct task_struct *t)
> +{
> + uint32_t cpu_id_start = 0, cpu_id = RSEQ_CPU_ID_UNINITIALIZED;
u32
> +
> + /*
> + * Reset cpu_id_start to its initial state (0).
> + */
> + if (__put_user(cpu_id_start, &t->rseq->cpu_id_start))
> + return -EFAULT;
> + /*
> + * Reset cpu_id to RSEQ_CPU_ID_UNINITIALIZED, so any user coming
> + * in after unregistration can figure out that rseq needs to be
> + * registered again.
> + */
> + if (__put_user(cpu_id, &t->rseq->cpu_id))
> + return -EFAULT;
> + return 0;
> +}
> +
> +static int rseq_get_rseq_cs(struct task_struct *t,
> + unsigned long *start_ip,
> + unsigned long *post_commit_offset,
> + unsigned long *abort_ip,
> + uint32_t *cs_flags)
> +{
> + struct rseq_cs __user *urseq_cs;
> + struct rseq_cs rseq_cs;
> + unsigned long ptr;
> + u32 __user *usig;
> + u32 sig;
> + int ret;
> +
> + ret = __get_user(ptr, &t->rseq->rseq_cs);
> + if (ret)
> + return ret;
> + if (!ptr)
> + return 0;
> + urseq_cs = (struct rseq_cs __user *)ptr;
> + if (copy_from_user(&rseq_cs, urseq_cs, sizeof(rseq_cs)))
> + return -EFAULT;
> + if (rseq_cs.version > 0)
> + return -EINVAL;
> +
> + /* Ensure that abort_ip is not in the critical section. */
> + if (rseq_cs.abort_ip - rseq_cs.start_ip < rseq_cs.post_commit_offset)
> + return -EINVAL;
The kernel will not crash if userspace messes that up right? So why do
we care to check?
> +
> + *cs_flags = rseq_cs.flags;
> + *start_ip = rseq_cs.start_ip;
> + *post_commit_offset = rseq_cs.post_commit_offset;
> + *abort_ip = rseq_cs.abort_ip;
Then this becomes a straight struct assignment.
> +
> + usig = (u32 __user *)(rseq_cs.abort_ip - sizeof(u32));
> + ret = get_user(sig, usig);
> + if (ret)
> + return ret;
> +
> + if (current->rseq_sig != sig) {
> + printk_ratelimited(KERN_WARNING
> + "Possible attack attempt. Unexpected rseq signature 0x%x, expecting 0x%x (pid=%d, addr=%p).\n",
> + sig, current->rseq_sig, current->pid, usig);
> + return -EPERM;
> + }
Is there any text that explains the thread model and possible attack
that this signature prevents? I failed to find any, which raises the
question, why is it there..
> + return 0;
> +}
> +
> +static int rseq_need_restart(struct task_struct *t, uint32_t cs_flags)
u32
> +{
> + uint32_t flags, event_mask;
u32
> + int ret;
> +
> + /* Get thread flags. */
> + ret = __get_user(flags, &t->rseq->flags);
> + if (ret)
> + return ret;
> +
> + /* Take critical section flags into account. */
> + flags |= cs_flags;
> +
> + /*
> + * Restart on signal can only be inhibited when restart on
> + * preempt and restart on migrate are inhibited too. Otherwise,
> + * a preempted signal handler could fail to restart the prior
> + * execution context on sigreturn.
> + */
> + if (unlikely(flags & RSEQ_CS_FLAG_NO_RESTART_ON_SIGNAL)) {
> + if ((flags & (RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE
> + | RSEQ_CS_FLAG_NO_RESTART_ON_PREEMPT)) !=
> + (RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE
> + | RSEQ_CS_FLAG_NO_RESTART_ON_PREEMPT))
> + return -EINVAL;
Please put operators at the end of the previous line, not at the start
of the new line when you have to break statements.
Also, that's unreadable.
#define RSEQ_CS_FLAGS (RSEQ_CS_FLAG_NO_RESTART_ON_PREEMPT | \
RSEQ_CS_FLAG_NO_RESTART_ON_SIGNAL | \
RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE)
if (unlikely((flags & RSEQ_CS_FLAG_NO_RESTART_ON_SIGNAL) &&
(flags & RSEQ_CS_FLAGS) != RSEQ_CS_FLAGS))
return -EINVAL;
> + }
> +
> + /*
> + * Load and clear event mask atomically with respect to
> + * scheduler preemption.
> + */
> + preempt_disable();
> + event_mask = t->rseq_event_mask;
> + t->rseq_event_mask = 0;
> + preempt_enable();
> +
> + event_mask &= ~flags;
> + if (event_mask)
> + return 1;
> + return 0;
return !!(event_mask & ~flags);
> +}
> +
> +static int clear_rseq_cs(struct task_struct *t)
> +{
> + unsigned long ptr = 0;
> +
> + /*
> + * The rseq_cs field is set to NULL on preemption or signal
> + * delivery on top of rseq assembly block, as well as on top
> + * of code outside of the rseq assembly block. This performs
> + * a lazy clear of the rseq_cs field.
> + *
> + * Set rseq_cs to NULL with single-copy atomicity.
> + */
> + return __put_user(ptr, &t->rseq->rseq_cs);
__put_user(0UL, &t->rseq->rseq_cs); ?
> +}
> +
> +static int rseq_ip_fixup(struct pt_regs *regs)
> +{
> + unsigned long ip = instruction_pointer(regs), start_ip = 0,
> + post_commit_offset = 0, abort_ip = 0;
valid C, but yuck. Just have two 'unsigned long' lines.
Also, why the =0, the below call to rseq_get_rseq_cs() will either
initialize of fail.
> + struct task_struct *t = current;
> + uint32_t cs_flags = 0;
u32
> + bool in_rseq_cs = false;
> + int ret;
> +
> + ret = rseq_get_rseq_cs(t, &start_ip, &post_commit_offset, &abort_ip,
> + &cs_flags);
ret = rseq_get_rseq_cs(t, &start_ip, &post_commit_offset,
&abort_ip, &cs_flags);
> + if (ret)
> + return ret;
> +
> + /*
> + * Handle potentially not being within a critical section.
> + * Unsigned comparison will be true when
> + * ip >= start_ip, and when ip < start_ip + post_commit_offset.
> + */
> + if (ip - start_ip < post_commit_offset)
> + in_rseq_cs = true;
> +
> + /*
> + * If not nested over a rseq critical section, restart is
> + * useless. Clear the rseq_cs pointer and return.
> + */
> + if (!in_rseq_cs)
> + return clear_rseq_cs(t);
That all seems needlessly complicated; isn't:
if (ip - start_ip >= post_commit_offset)
return clear_rseq_cs();
equivalent? Nothing seems to use that variable after this.
> + ret = rseq_need_restart(t, cs_flags);
> + if (ret <= 0)
> + return ret;
> + ret = clear_rseq_cs(t);
> + if (ret)
> + return ret;
> + trace_rseq_ip_fixup(ip, start_ip, post_commit_offset, abort_ip);
> + instruction_pointer_set(regs, (unsigned long)abort_ip);
> + return 0;
> +}
On Tue, Mar 27, 2018 at 12:05:23PM -0400, Mathieu Desnoyers wrote:
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index fb5fc458547f..66b070444a7e 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -1249,6 +1249,7 @@ static inline void __set_task_cpu(struct task_struct *p, unsigned int cpu)
> #endif
> p->wake_cpu = cpu;
> #endif
> + rseq_migrate(p);
> }
I think you want that in set_task_cpu(), right next to nr_migrations++.
On Wed, Mar 28, 2018 at 02:29:46PM +0200, Peter Zijlstra wrote:
> > +static int rseq_get_rseq_cs(struct task_struct *t,
> > + unsigned long *start_ip,
> > + unsigned long *post_commit_offset,
> > + unsigned long *abort_ip,
> > + uint32_t *cs_flags)
> > +{
>
> > +
> > + *cs_flags = rseq_cs.flags;
> > + *start_ip = rseq_cs.start_ip;
> > + *post_commit_offset = rseq_cs.post_commit_offset;
> > + *abort_ip = rseq_cs.abort_ip;
>
> Then this becomes a straight struct assignment.
I initially suggested passing a structure instead of many arguments, but
then recondidered, mostly because it will be inlined (due to having only
the one caller) anyway. Still, maybe a struct will work better, I dunno.
----- On Mar 28, 2018, at 2:47 AM, Boqun Feng [email protected] wrote:
> On Tue, Mar 27, 2018 at 12:05:23PM -0400, Mathieu Desnoyers wrote:
> [...]
>> Changes since v11:
>>
>> - Replace task struct rseq_preempt, rseq_signal, and rseq_migrate
>> bool by u32 rseq_event_mask.
> [...]
>> @@ -979,6 +980,17 @@ struct task_struct {
>> unsigned long numa_pages_migrated;
>> #endif /* CONFIG_NUMA_BALANCING */
>>
>> +#ifdef CONFIG_RSEQ
>> + struct rseq __user *rseq;
>> + u32 rseq_len;
>> + u32 rseq_sig;
>> + /*
>> + * RmW on rseq_event_mask must be performed atomically
>> + * with respect to preemption.
>> + */
>> + unsigned long rseq_event_mask;
>
> s/unsigned long/u32
good point, fixed.
>
>> +#endif
>> +
>> struct tlbflush_unmap_batch tlb_ubc;
>>
>> struct rcu_head rcu;
>> @@ -1688,4 +1700,110 @@ extern long sched_getaffinity(pid_t pid, struct cpumask
>> *mask);
>> #define TASK_SIZE_OF(tsk) TASK_SIZE
>> #endif
>>
>
> [...]
>
>> +
>> +static int rseq_ip_fixup(struct pt_regs *regs)
>> +{
>> + unsigned long ip = instruction_pointer(regs), start_ip = 0,
>> + post_commit_offset = 0, abort_ip = 0;
>> + struct task_struct *t = current;
>> + uint32_t cs_flags = 0;
>> + bool in_rseq_cs = false;
>
> This seems unnecessary? Because..
>
>> + int ret;
>> +
>> + ret = rseq_get_rseq_cs(t, &start_ip, &post_commit_offset, &abort_ip,
>> + &cs_flags);
>> + if (ret)
>> + return ret;
>> +
>> + /*
>> + * Handle potentially not being within a critical section.
>> + * Unsigned comparison will be true when
>> + * ip >= start_ip, and when ip < start_ip + post_commit_offset.
>> + */
>> + if (ip - start_ip < post_commit_offset)
>> + in_rseq_cs = true;
>> +
>> + /*
>> + * If not nested over a rseq critical section, restart is
>> + * useless. Clear the rseq_cs pointer and return.
>> + */
>> + if (!in_rseq_cs)
>> + return clear_rseq_cs(t);
>
> we can write
>
> if (ip - start_ip >= post_commit_offset)
> return clear_rseq_cs(t);
Good point. In a previous version, rseq_get_rseq_cs() had to conditionally
update in_rseq_cs, but it's not the case anymore, so your approach
indeed cleans up the code.
Thanks!
Mathieu
>
> Regards,
> Boqun
>
>> + ret = rseq_need_restart(t, cs_flags);
>> + if (ret <= 0)
>> + return ret;
>> + ret = clear_rseq_cs(t);
>> + if (ret)
>> + return ret;
>> + trace_rseq_ip_fixup(ip, start_ip, post_commit_offset, abort_ip);
>> + instruction_pointer_set(regs, (unsigned long)abort_ip);
>> + return 0;
>> +}
>> +
> [...]
--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com
----- On Mar 28, 2018, at 7:19 AM, Peter Zijlstra [email protected] wrote:
> On Tue, Mar 27, 2018 at 12:05:23PM -0400, Mathieu Desnoyers wrote:
>> +#ifdef CONFIG_RSEQ
>> + struct rseq __user *rseq;
>> + u32 rseq_len;
>> + u32 rseq_sig;
>> + /*
>> + * RmW on rseq_event_mask must be performed atomically
>> + * with respect to preemption.
>> + */
>> + unsigned long rseq_event_mask;
>> +#endif
>
>> +static inline void rseq_signal_deliver(struct pt_regs *regs)
>> +{
>> + set_bit(RSEQ_EVENT_SIGNAL_BIT, ¤t->rseq_event_mask);
>> + rseq_handle_notify_resume(regs);
>> +}
>> +
>> +static inline void rseq_preempt(struct task_struct *t)
>> +{
>> + set_bit(RSEQ_EVENT_PREEMPT_BIT, &t->rseq_event_mask);
>> + rseq_set_notify_resume(t);
>> +}
>> +
>> +static inline void rseq_migrate(struct task_struct *t)
>> +{
>> + set_bit(RSEQ_EVENT_MIGRATE_BIT, &t->rseq_event_mask);
>> + rseq_set_notify_resume(t);
>> +}
>
> Given that comment above, do you really need the full atomic set bit?
> Isn't __set_bit() sufficient?
For each of rseq_signal_deliver, rseq_preempt, and rseq_migrate, we should
confirm that their callers guarantee preemption is disabled before
we can use __set_bit() in each of those functions.
Is that the case ? If so, we should also document the requirement
about preemption for each function.
AFAIU, rseq_migrate is only invoked from __set_task_cpu, which I *think*
always has preemption disabled. rseq_preempt() is called by the scheduler,
so this one is fine. On x86, rseq_signal_deliver is called from setup_rt_frame,
with preemption enabled.
So one approach would be to use __set_bit in both rseq_preempt and rseq_migrate,
but keep the atomic set_bit() in rseq_signal_deliver.
Thoughts ?
Thanks,
Mathieu
--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com
----- On Mar 28, 2018, at 10:06 AM, Mathieu Desnoyers [email protected] wrote:
> ----- On Mar 28, 2018, at 2:47 AM, Boqun Feng [email protected] wrote:
>
>> On Tue, Mar 27, 2018 at 12:05:23PM -0400, Mathieu Desnoyers wrote:
>> [...]
>>> Changes since v11:
>>>
>>> - Replace task struct rseq_preempt, rseq_signal, and rseq_migrate
>>> bool by u32 rseq_event_mask.
>> [...]
>>> @@ -979,6 +980,17 @@ struct task_struct {
>>> unsigned long numa_pages_migrated;
>>> #endif /* CONFIG_NUMA_BALANCING */
>>>
>>> +#ifdef CONFIG_RSEQ
>>> + struct rseq __user *rseq;
>>> + u32 rseq_len;
>>> + u32 rseq_sig;
>>> + /*
>>> + * RmW on rseq_event_mask must be performed atomically
>>> + * with respect to preemption.
>>> + */
>>> + unsigned long rseq_event_mask;
>>
>> s/unsigned long/u32
>
> good point, fixed.
>
Actually, by having a u32 instead of unsigned long here, it triggers those
warnings:
In file included from ./include/linux/bitops.h:38:0,
from ./include/linux/kernel.h:11,
from certs/system_keyring.c:13:
./arch/x86/include/asm/bitops.h:73:1: note: expected ‘volatile long unsigned int *’ but argument is of type ‘u32 *’
set_bit(long nr, volatile unsigned long *addr)
^
I suspect that casting the u32 * to a unsigned long * is not a safe approach, because
the code can generate a load/store on unallocated memory (kasan might complain).
Thoughts ?
Thanks,
Mathieu
--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com
----- On Mar 28, 2018, at 8:50 AM, Peter Zijlstra [email protected] wrote:
> On Tue, Mar 27, 2018 at 12:05:23PM -0400, Mathieu Desnoyers wrote:
>> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
>> index fb5fc458547f..66b070444a7e 100644
>> --- a/kernel/sched/sched.h
>> +++ b/kernel/sched/sched.h
>> @@ -1249,6 +1249,7 @@ static inline void __set_task_cpu(struct task_struct *p,
>> unsigned int cpu)
>> #endif
>> p->wake_cpu = cpu;
>> #endif
>> + rseq_migrate(p);
>> }
>
> I think you want that in set_task_cpu(), right next to nr_migrations++.
This would miss the __set_task_cpu() call from sched_fork() and wake_up_new_task().
Those cases are not accounted as explicit "migrations", but it does change the CPU
of the current task. So if for some weird reason userspace wants to fork() while in
a rseq critical section, we want to trigger a rseq restart.
Note that rseq_fork() implies rseq_preempt(), but userspace can request to
track only migrations for a given rseq critical section (by using the
RSEQ_CS_FLAG_NO_RESTART_ON_PREEMPT flag), so the rseq_preempt() in rseq_fork()
is not enough to restart if a migration between CPUs is done across a fork.
An alternative to this would be to call rseq_migrate() in rseq_fork().
Thoughts ?
Thanks,
Mathieu
--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com
On Wed, Mar 28, 2018 at 10:47:54AM -0400, Mathieu Desnoyers wrote:
> ----- On Mar 28, 2018, at 8:50 AM, Peter Zijlstra [email protected] wrote:
>
> > On Tue, Mar 27, 2018 at 12:05:23PM -0400, Mathieu Desnoyers wrote:
> >> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> >> index fb5fc458547f..66b070444a7e 100644
> >> --- a/kernel/sched/sched.h
> >> +++ b/kernel/sched/sched.h
> >> @@ -1249,6 +1249,7 @@ static inline void __set_task_cpu(struct task_struct *p,
> >> unsigned int cpu)
> >> #endif
> >> p->wake_cpu = cpu;
> >> #endif
> >> + rseq_migrate(p);
> >> }
> >
> > I think you want that in set_task_cpu(), right next to nr_migrations++.
>
> This would miss the __set_task_cpu() call from sched_fork() and wake_up_new_task().
Correct; but since those are _new_ tasks they _SHOULD_ not have an
active RSEQ to begin with.
> Those cases are not accounted as explicit "migrations", but it does change the CPU
> of the current task. So if for some weird reason userspace wants to fork() while in
> a rseq critical section, we want to trigger a rseq restart.
If at all possible I would make it SIGSEGV when issueing SYSCALL()s from
within an RSEQ.
> An alternative to this would be to call rseq_migrate() in rseq_fork().
>
> Thoughts ?
Yes, don't try and support that at all. It's _insane_.
----- On Mar 28, 2018, at 8:52 AM, Peter Zijlstra [email protected] wrote:
> On Wed, Mar 28, 2018 at 02:29:46PM +0200, Peter Zijlstra wrote:
>> > +static int rseq_get_rseq_cs(struct task_struct *t,
>> > + unsigned long *start_ip,
>> > + unsigned long *post_commit_offset,
>> > + unsigned long *abort_ip,
>> > + uint32_t *cs_flags)
>> > +{
>
>>
>> > +
>> > + *cs_flags = rseq_cs.flags;
>> > + *start_ip = rseq_cs.start_ip;
>> > + *post_commit_offset = rseq_cs.post_commit_offset;
>> > + *abort_ip = rseq_cs.abort_ip;
>>
>> Then this becomes a straight struct assignment.
>
> I initially suggested passing a structure instead of many arguments, but
> then recondidered, mostly because it will be inlined (due to having only
> the one caller) anyway. Still, maybe a struct will work better, I dunno.
I find the result of struct pointer argument cleaner indeed. I'll go for that
approach.
I'll memset rseq_cs to 0 in the following case though, because the caller
expects the content of the structure to be set when rseq_get_rseq_cs() succeeds.
static int rseq_get_rseq_cs(struct task_struct *t, struct rseq_cs *rseq_cs)
{
struct rseq_cs __user *urseq_cs;
unsigned long ptr;
u32 __user *usig;
u32 sig;
int ret;
ret = __get_user(ptr, &t->rseq->rseq_cs);
if (ret)
return ret;
if (!ptr) {
memset(rseq_cs, 0, sizeof(*rseq_cs));
return 0;
}
[...]
Thanks!
Mathieu
--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com
----- On Mar 28, 2018, at 10:59 AM, Peter Zijlstra [email protected] wrote:
> On Wed, Mar 28, 2018 at 10:47:54AM -0400, Mathieu Desnoyers wrote:
>> ----- On Mar 28, 2018, at 8:50 AM, Peter Zijlstra [email protected] wrote:
>>
>> > On Tue, Mar 27, 2018 at 12:05:23PM -0400, Mathieu Desnoyers wrote:
>> >> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
>> >> index fb5fc458547f..66b070444a7e 100644
>> >> --- a/kernel/sched/sched.h
>> >> +++ b/kernel/sched/sched.h
>> >> @@ -1249,6 +1249,7 @@ static inline void __set_task_cpu(struct task_struct *p,
>> >> unsigned int cpu)
>> >> #endif
>> >> p->wake_cpu = cpu;
>> >> #endif
>> >> + rseq_migrate(p);
>> >> }
>> >
>> > I think you want that in set_task_cpu(), right next to nr_migrations++.
>>
>> This would miss the __set_task_cpu() call from sched_fork() and
>> wake_up_new_task().
>
> Correct; but since those are _new_ tasks they _SHOULD_ not have an
> active RSEQ to begin with.
As long as fork() can be issued from a rseq critical section, nothing
actually prevents this. This is a fork(), not an exec(), so the new tasks
may very well be going through a restartable sequence when fork() happens.
>
>> Those cases are not accounted as explicit "migrations", but it does change the
>> CPU
>> of the current task. So if for some weird reason userspace wants to fork() while
>> in
>> a rseq critical section, we want to trigger a rseq restart.
>
> If at all possible I would make it SIGSEGV when issueing SYSCALL()s from
> within an RSEQ.
What's the goal there ? rseq critical sections can technically do system calls
if they wish. Why prevent this ?
How would you handle signal handlers that issue system calls while nested
on top of a rseq critical section in the userspace thread ? SIGSEGV on
SYSCALLs will break this case.
>
>> An alternative to this would be to call rseq_migrate() in rseq_fork().
>>
>> Thoughts ?
>
> Yes, don't try and support that at all. It's _insane_.
Thomas told me those fork corner-cases should be correctly handled
in a previous version of the patchset. I'm following his advice here.
So either we disallow fork() within rseq critical sections completely with
some kind of validation, or we need to provide a non-bogus behavior when this
happens. Given that fork(2) is async-signal-safe, this means a signal handler
can do a fork() while nested on top of a userspace thread's rseq critical section.
So prohibiting fork() from being called over a rseq c.s. does not seem like
something we can do here.
Thoughts ?
Thanks,
Mathieu
--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com
On Tue, Mar 27, 2018 at 12:05:31PM -0400, Mathieu Desnoyers wrote:
> 1) Allow algorithms to perform per-cpu data migration without relying on
> sched_setaffinity()
>
> The use-cases are migrating memory between per-cpu memory free-lists, or
> stealing tasks from other per-cpu work queues: each require that
> accesses to remote per-cpu data structures are performed.
I think that one completely reduces to the per-cpu (spin)lock case,
right? Because, as per the below, your logging case (8) can 'easily' be
done without the cpu_opv monstrosity.
And if you can construct a per-cpu lock, that can be used to construct
aribtrary logic.
And the difficult case for the per-cpu lock is the remote acquire; all
the other cases are (relatively) trivial.
I've not really managed to get anything sensible to work, I've tried
several variations of split lock, but you invariably end up with
barriers in the fast (local) path, which sucks.
But I feel this should be solvable without cpu_opv. As in, I really hate
that thing ;-)
> 8) Allow libraries with multi-part algorithms to work on same per-cpu
> data without affecting the allowed cpu mask
>
> The lttng-ust tracer presents an interesting use-case for per-cpu
> buffers: the algorithm needs to update a "reserve" counter, serialize
> data into the buffer, and then update a "commit" counter _on the same
> per-cpu buffer_. Using rseq for both reserve and commit can bring
> significant performance benefits.
>
> Clearly, if rseq reserve fails, the algorithm can retry on a different
> per-cpu buffer. However, it's not that easy for the commit. It needs to
> be performed on the same per-cpu buffer as the reserve.
>
> The cpu_opv system call solves that problem by receiving the cpu number
> on which the operation needs to be performed as argument. It can push
> the task to the right CPU if needed, and perform the operations there
> with preemption disabled.
>
> Changing the allowed cpu mask for the current thread is not an
> acceptable alternative for a tracing library, because the application
> being traced does not expect that mask to be changed by libraries.
We talked about this use-case, and it can be solved without cpu_opv if
you keep a dual commit counter, one local and one (atomic) remote.
We retain the cpu_id from the first rseq, and the second part will, when
it (unlikely) finds it runs remotely, do an atomic increment on the
remote counter. The consumer of the counter will then have to sum both
the local and remote counter parts.
----- On Mar 28, 2018, at 11:28 AM, Peter Zijlstra [email protected] wrote:
> On Wed, Mar 28, 2018 at 11:14:05AM -0400, Mathieu Desnoyers wrote:
>
>> > If at all possible I would make it SIGSEGV when issueing SYSCALL()s from
>> > within an RSEQ.
>>
>> What's the goal there ? rseq critical sections can technically do system calls
>> if they wish. Why prevent this ?
>
> This all started as a way to do 'small' _fast_ per-cpu ops, System calls
> do NOT fit in that pattern. If you're willing to do a system calls the
> cost of atomics is not a problem.
I'm not arguing that a typical rseq would do a system call. I'm merely
pointing out that if we start putting arbitrary limitations like "SIGSEGV
when a fork or system call is encountered on top of rseq", this will cause
pain in user-space.
>
>> How would you handle signal handlers that issue system calls while nested
>> on top of a rseq critical section in the userspace thread ? SIGSEGV on
>> SYSCALLs will break this case.
>
> Have the rseq thing aborted prior to delivering the signal ?
Not if the RSEQ_CS_FLAG_NO_RESTART_ON_SIGNAL flag is set either in the TLS
or in the rseq_cs structure.
How about we simply add a rseq_migrate() within rseq_fork() (when
forking to a new process), which will allow me to move the rseq_migrate
from __set_task_cpu() to set_task_cpu() as you request. Is that solution
acceptable for you ?
Thanks,
Mathieu
--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com
----- On Mar 28, 2018, at 7:22 AM, Peter Zijlstra [email protected] wrote:
> On Tue, Mar 27, 2018 at 12:05:23PM -0400, Mathieu Desnoyers wrote:
>> +/*
>> + * struct rseq_cs is aligned on 4 * 8 bytes to ensure it is always
>> + * contained within a single cache-line. It is usually declared as
>> + * link-time constant data.
>> + */
>> +struct rseq_cs {
>> + /* Version of this structure. */
>> + uint32_t version;
>> + /* enum rseq_cs_flags */
>> + uint32_t flags;
>> + LINUX_FIELD_u32_u64(start_ip);
>> + /* Offset from start_ip. */
>> + LINUX_FIELD_u32_u64(post_commit_offset);
>> + LINUX_FIELD_u32_u64(abort_ip);
>> +} __attribute__((aligned(4 * sizeof(uint64_t))));
>
> What's with the uint32_t ? The normal Linux API type is __u32 afaik.
Will fix. Working on both kernel and user-space code in parallel kind
of does that to the brain. ;-)
Thanks,
Mathieu
--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com
On Wed, Mar 28, 2018 at 11:14:05AM -0400, Mathieu Desnoyers wrote:
> > If at all possible I would make it SIGSEGV when issueing SYSCALL()s from
> > within an RSEQ.
>
> What's the goal there ? rseq critical sections can technically do system calls
> if they wish. Why prevent this ?
This all started as a way to do 'small' _fast_ per-cpu ops, System calls
do NOT fit in that pattern. If you're willing to do a system calls the
cost of atomics is not a problem.
> How would you handle signal handlers that issue system calls while nested
> on top of a rseq critical section in the userspace thread ? SIGSEGV on
> SYSCALLs will break this case.
Have the rseq thing aborted prior to delivering the signal ?
----- On Mar 28, 2018, at 8:29 AM, Peter Zijlstra [email protected] wrote:
> On Tue, Mar 27, 2018 at 12:05:23PM -0400, Mathieu Desnoyers wrote:
[...]
>> + /* Ensure that abort_ip is not in the critical section. */
>> + if (rseq_cs.abort_ip - rseq_cs.start_ip < rseq_cs.post_commit_offset)
>> + return -EINVAL;
>
> The kernel will not crash if userspace messes that up right? So why do
> we care to check?
That's because the kernel clears the TLS @rseq_cs pointer whenever it restarts
a rseq critical section. Therefore, if the abort_ip points somewhere within
the rseq critical section, the kernel will clear the @rseq_cs pointer, move the
instruction pointer to the abort_ip, and return to user-space. At that stage,
user-space will still be running within a rseq critical section, but the
@rseq_cs pointer is NULL. So if the kernel preempts again before that critical
section completes, it will completely miss it.
So having the kernel segfault userspace when this pattern is encountered
is an extra safety net ensuring that if user-space ever implement such
construct, at least it will segfault quickly, and will therefore be easier
to debug, because easier to reproduce.
>> +
>> + usig = (u32 __user *)(rseq_cs.abort_ip - sizeof(u32));
>> + ret = get_user(sig, usig);
>> + if (ret)
>> + return ret;
>> +
>
>> + if (current->rseq_sig != sig) {
>> + printk_ratelimited(KERN_WARNING
>> + "Possible attack attempt. Unexpected rseq signature 0x%x, expecting 0x%x
>> (pid=%d, addr=%p).\n",
>> + sig, current->rseq_sig, current->pid, usig);
>> + return -EPERM;
>> + }
>
> Is there any text that explains the thread model and possible attack
> that this signature prevents? I failed to find any, which raises the
> question, why is it there..
The threat model is an attacker partly controlling a user-space process, trying
to execute his own code by abusing the rseq restart mechanism to make the kernel
jump to a user-space address of his choice, which contains either an injected shell
code or specific library functions, thus escalating to full control of the process
execution.
Where should I document this ?
>
>> + int ret;
>> +
>> + /* Get thread flags. */
>> + ret = __get_user(flags, &t->rseq->flags);
>> + if (ret)
>> + return ret;
>> +
>> + /* Take critical section flags into account. */
>> + flags |= cs_flags;
>> +
>> + /*
>> + * Restart on signal can only be inhibited when restart on
>> + * preempt and restart on migrate are inhibited too. Otherwise,
>> + * a preempted signal handler could fail to restart the prior
>> + * execution context on sigreturn.
>> + */
>> + if (unlikely(flags & RSEQ_CS_FLAG_NO_RESTART_ON_SIGNAL)) {
>> + if ((flags & (RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE
>> + | RSEQ_CS_FLAG_NO_RESTART_ON_PREEMPT)) !=
>> + (RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE
>> + | RSEQ_CS_FLAG_NO_RESTART_ON_PREEMPT))
>> + return -EINVAL;
>
> Please put operators at the end of the previous line, not at the start
> of the new line when you have to break statements.
>
> Also, that's unreadable.
>
> #define RSEQ_CS_FLAGS (RSEQ_CS_FLAG_NO_RESTART_ON_PREEMPT | \
> RSEQ_CS_FLAG_NO_RESTART_ON_SIGNAL | \
> RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE)
>
> if (unlikely((flags & RSEQ_CS_FLAG_NO_RESTART_ON_SIGNAL) &&
> (flags & RSEQ_CS_FLAGS) != RSEQ_CS_FLAGS))
> return -EINVAL;
>
Based on your suggestion:
#define RSEQ_CS_PREEMPT_MIGRATE_FLAGS (RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE | \
RSEQ_CS_FLAG_NO_RESTART_ON_PREEMPT)
if (unlikely((flags & RSEQ_CS_FLAG_NO_RESTART_ON_SIGNAL) &&
(flags & RSEQ_CS_PREEMPT_MIGRATE_FLAGS) !=
RSEQ_CS_PREEMPT_MIGRATE_FLAGS))
return -EINVAL;
>> +}
>> +
>> +static int clear_rseq_cs(struct task_struct *t)
>> +{
>> + unsigned long ptr = 0;
>> +
>> + /*
>> + * The rseq_cs field is set to NULL on preemption or signal
>> + * delivery on top of rseq assembly block, as well as on top
>> + * of code outside of the rseq assembly block. This performs
>> + * a lazy clear of the rseq_cs field.
>> + *
>> + * Set rseq_cs to NULL with single-copy atomicity.
>> + */
>> + return __put_user(ptr, &t->rseq->rseq_cs);
>
> __put_user(0UL, &t->rseq->rseq_cs); ?
Yes.
>
>> +}
>> +
>> +static int rseq_ip_fixup(struct pt_regs *regs)
>> +{
>> + unsigned long ip = instruction_pointer(regs), start_ip = 0,
>> + post_commit_offset = 0, abort_ip = 0;
>
> valid C, but yuck. Just have two 'unsigned long' lines.
>
> Also, why the =0, the below call to rseq_get_rseq_cs() will either
> initialize of fail.
rseq_get_rseq_cs() can return 0 (success) if __get_user finds a
NULL pointer in the @rseq_cs TLS field. I'll use a
struct rseq_cs * parameter instead, and memset to 0 only in that
specific success case within rseq_get_rseq_cs() rather than always
initialize to 0 in its caller.
>
>
>> + if (ret)
>> + return ret;
>> +
>> + /*
>> + * Handle potentially not being within a critical section.
>> + * Unsigned comparison will be true when
>> + * ip >= start_ip, and when ip < start_ip + post_commit_offset.
>> + */
>> + if (ip - start_ip < post_commit_offset)
>> + in_rseq_cs = true;
>> +
>> + /*
>> + * If not nested over a rseq critical section, restart is
>> + * useless. Clear the rseq_cs pointer and return.
>> + */
>> + if (!in_rseq_cs)
>> + return clear_rseq_cs(t);
>
>
> That all seems needlessly complicated; isn't:
>
> if (ip - start_ip >= post_commit_offset)
> return clear_rseq_cs();
>
> equivalent? Nothing seems to use that variable after this.
Yep, Boqun already pointed it out. Fixed.
Thanks!
Mathieu
--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com
On Wed, Mar 28, 2018 at 11:37:06AM -0400, Mathieu Desnoyers wrote:
> ----- On Mar 28, 2018, at 11:28 AM, Peter Zijlstra [email protected] wrote:
>
> > On Wed, Mar 28, 2018 at 11:14:05AM -0400, Mathieu Desnoyers wrote:
> >
> >> > If at all possible I would make it SIGSEGV when issueing SYSCALL()s from
> >> > within an RSEQ.
> >>
> >> What's the goal there ? rseq critical sections can technically do system calls
> >> if they wish. Why prevent this ?
> >
> > This all started as a way to do 'small' _fast_ per-cpu ops, System calls
> > do NOT fit in that pattern. If you're willing to do a system calls the
> > cost of atomics is not a problem.
>
> I'm not arguing that a typical rseq would do a system call. I'm merely
> pointing out that if we start putting arbitrary limitations like "SIGSEGV
> when a fork or system call is encountered on top of rseq", this will cause
> pain in user-space.
I don't think disallowing system calls is arbitrary. And I think that is
something we really want to enforce, because it's batshit insane to
allow.
And if we allow now, people _will_ use it and we can't ever take it
away again.
----- On Mar 28, 2018, at 11:22 AM, Peter Zijlstra [email protected] wrote:
> On Tue, Mar 27, 2018 at 12:05:31PM -0400, Mathieu Desnoyers wrote:
>
>> 1) Allow algorithms to perform per-cpu data migration without relying on
>> sched_setaffinity()
>>
>> The use-cases are migrating memory between per-cpu memory free-lists, or
>> stealing tasks from other per-cpu work queues: each require that
>> accesses to remote per-cpu data structures are performed.
>
> I think that one completely reduces to the per-cpu (spin)lock case,
> right? Because, as per the below, your logging case (8) can 'easily' be
> done without the cpu_opv monstrosity.
>
> And if you can construct a per-cpu lock, that can be used to construct
> aribtrary logic.
The per-cpu spinlock does not have the same performance characteristics
as lock-free alternatives for various operations. A rseq compare-and-store
is faster than a rseq spinlock for linked-list operations.
>
> And the difficult case for the per-cpu lock is the remote acquire; all
> the other cases are (relatively) trivial.
>
> I've not really managed to get anything sensible to work, I've tried
> several variations of split lock, but you invariably end up with
> barriers in the fast (local) path, which sucks.
>
> But I feel this should be solvable without cpu_opv. As in, I really hate
> that thing ;-)
I have not developed cpu_opv out of any kind of love for that solution.
I just realized that it did solve all my issues after failing for quite
some time to implement acceptable solutions for the remote access
problem, and for ensuring progress of single-stepping with current
debuggers that don't know about the rseq_table section.
>
>> 8) Allow libraries with multi-part algorithms to work on same per-cpu
>> data without affecting the allowed cpu mask
>>
>> The lttng-ust tracer presents an interesting use-case for per-cpu
>> buffers: the algorithm needs to update a "reserve" counter, serialize
>> data into the buffer, and then update a "commit" counter _on the same
>> per-cpu buffer_. Using rseq for both reserve and commit can bring
>> significant performance benefits.
>>
>> Clearly, if rseq reserve fails, the algorithm can retry on a different
>> per-cpu buffer. However, it's not that easy for the commit. It needs to
>> be performed on the same per-cpu buffer as the reserve.
>>
>> The cpu_opv system call solves that problem by receiving the cpu number
>> on which the operation needs to be performed as argument. It can push
>> the task to the right CPU if needed, and perform the operations there
>> with preemption disabled.
>>
>> Changing the allowed cpu mask for the current thread is not an
>> acceptable alternative for a tracing library, because the application
>> being traced does not expect that mask to be changed by libraries.
>
> We talked about this use-case, and it can be solved without cpu_opv if
> you keep a dual commit counter, one local and one (atomic) remote.
Right.
>
> We retain the cpu_id from the first rseq, and the second part will, when
> it (unlikely) finds it runs remotely, do an atomic increment on the
> remote counter. The consumer of the counter will then have to sum both
> the local and remote counter parts.
Yes, I did a prototype of this specific case with split-counters a while
ago. However, if we need cpu_opv as fallback for other reasons (e.g. remote
accesses), then the split-counters are not needed, and there is no need to
change the layout of user-space data to accommodate the extra per-cpu
counter.
Thanks,
Mathieu
--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com
----- On Mar 28, 2018, at 1:49 PM, Peter Zijlstra [email protected] wrote:
> On Wed, Mar 28, 2018 at 11:37:06AM -0400, Mathieu Desnoyers wrote:
>> ----- On Mar 28, 2018, at 11:28 AM, Peter Zijlstra [email protected] wrote:
>>
>> > On Wed, Mar 28, 2018 at 11:14:05AM -0400, Mathieu Desnoyers wrote:
>> >
>> >> > If at all possible I would make it SIGSEGV when issueing SYSCALL()s from
>> >> > within an RSEQ.
>> >>
>> >> What's the goal there ? rseq critical sections can technically do system calls
>> >> if they wish. Why prevent this ?
>> >
>> > This all started as a way to do 'small' _fast_ per-cpu ops, System calls
>> > do NOT fit in that pattern. If you're willing to do a system calls the
>> > cost of atomics is not a problem.
>>
>> I'm not arguing that a typical rseq would do a system call. I'm merely
>> pointing out that if we start putting arbitrary limitations like "SIGSEGV
>> when a fork or system call is encountered on top of rseq", this will cause
>> pain in user-space.
>
> I don't think disallowing system calls is arbitrary. And I think that is
> something we really want to enforce, because it's batshit insane to
> allow.
>
> And if we allow now, people _will_ use it and we can't ever take it
> away again.
Here are some examples of how I would like to use system calls within
rseq critical sections, for testing purposes:
- Issue poll(NULL, 0, ms_timeout) from a rseq critical section, to introduce
a delay in the critical section and test the effect,
- Issue sched_yield() from a rseq critical section, to introduce preemption at
that point,
- Issue kill() on self, thus testing interruption by signals over rseq c.s.,
- Invoke sched_setaffinity to tweak the cpu affinity mask to force thread
migration within a rseq c.s.
I currently have only implemented the poll(), sched_yield() and kill()
test-cases outside of the rseq critical sections, instead relying on
assembly loops to introduce delays in rseq c.s.. However, if we disallow
system calls in rseq critical sections, I'll never be able to use those
systems calls to extend the test matrix.
I see other use-cases where having a system call in a rseq critical section
could make sense: if vDSO data shared between kernel and user-space rely
on rseq for synchronization, but a fallback sometimes needs to issue a system
call for part of the operation.
Therefore I'd really want to keep allowing system calls within rseq critical
sections, even though we don't expect this to be the typical use-case.
Thanks,
Mathieu
--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com
On Wed, 28 Mar 2018, Mathieu Desnoyers wrote:
> ----- On Mar 28, 2018, at 1:49 PM, Peter Zijlstra [email protected] wrote:
> > I don't think disallowing system calls is arbitrary. And I think that is
> > something we really want to enforce, because it's batshit insane to
> > allow.
> >
> > And if we allow now, people _will_ use it and we can't ever take it
> > away again.
>
> Here are some examples of how I would like to use system calls within
> rseq critical sections, for testing purposes:
>
> - Issue poll(NULL, 0, ms_timeout) from a rseq critical section, to introduce
> a delay in the critical section and test the effect,
It's simple enough to use a delay loop for that. It's testing after all.
> - Issue sched_yield() from a rseq critical section, to introduce preemption at
> that point,
Make it loop on a varible and use secondary threads to force preemption.
> - Issue kill() on self, thus testing interruption by signals over rseq c.s.,
Second thread can do that
> - Invoke sched_setaffinity to tweak the cpu affinity mask to force thread
> migration within a rseq c.s.
Second thread can do that
> I currently have only implemented the poll(), sched_yield() and kill()
> test-cases outside of the rseq critical sections, instead relying on
> assembly loops to introduce delays in rseq c.s.. However, if we disallow
> system calls in rseq critical sections, I'll never be able to use those
> systems calls to extend the test matrix.
All of these tests can be implemented without system calls and there is no
justification to allow system calls just because it makes writing test
cases simpler. Nice try.
> I see other use-cases where having a system call in a rseq critical section
> could make sense: if vDSO data shared between kernel and user-space rely
> on rseq for synchronization, but a fallback sometimes needs to issue a system
> call for part of the operation.
What in the VDSO relies on rseqs? Nothing AFAICT. If the VDSO ever goes to
use that then it's going to be a kernel/vdso specific variant and we'll
figure out how that needs to be handled if at all.
But we are not misdesigning now to accomodate artificial scenarios dreamed
up for argumentation sake,
> Therefore I'd really want to keep allowing system calls within rseq critical
> sections, even though we don't expect this to be the typical use-case.
Syscalls inside rseq sections make no sense whatsoever, unless you can
rollback the message you just sent through the intertubes when the rseq
loop failed the taste test. If that works we might reconsider.
Thanks,
tglx
----- On Mar 28, 2018, at 5:25 PM, Thomas Gleixner [email protected] wrote:
> On Wed, 28 Mar 2018, Mathieu Desnoyers wrote:
>> ----- On Mar 28, 2018, at 1:49 PM, Peter Zijlstra [email protected] wrote:
>> > I don't think disallowing system calls is arbitrary. And I think that is
>> > something we really want to enforce, because it's batshit insane to
>> > allow.
>> >
>> > And if we allow now, people _will_ use it and we can't ever take it
>> > away again.
>>
>> Here are some examples of how I would like to use system calls within
>> rseq critical sections, for testing purposes:
>>
>> - Issue poll(NULL, 0, ms_timeout) from a rseq critical section, to introduce
>> a delay in the critical section and test the effect,
>
> It's simple enough to use a delay loop for that. It's testing after all.
>
>> - Issue sched_yield() from a rseq critical section, to introduce preemption at
>> that point,
>
> Make it loop on a varible and use secondary threads to force preemption.
>
>> - Issue kill() on self, thus testing interruption by signals over rseq c.s.,
>
> Second thread can do that
>
>> - Invoke sched_setaffinity to tweak the cpu affinity mask to force thread
>> migration within a rseq c.s.
>
> Second thread can do that
>
>> I currently have only implemented the poll(), sched_yield() and kill()
>> test-cases outside of the rseq critical sections, instead relying on
>> assembly loops to introduce delays in rseq c.s.. However, if we disallow
>> system calls in rseq critical sections, I'll never be able to use those
>> systems calls to extend the test matrix.
>
> All of these tests can be implemented without system calls and there is no
> justification to allow system calls just because it makes writing test
> cases simpler. Nice try.
You bring good points. As a logical consequence, I indeed don't need to issue
system calls from rseq c.s. for testing.
>
>> I see other use-cases where having a system call in a rseq critical section
>> could make sense: if vDSO data shared between kernel and user-space rely
>> on rseq for synchronization, but a fallback sometimes needs to issue a system
>> call for part of the operation.
>
> What in the VDSO relies on rseqs? Nothing AFAICT. If the VDSO ever goes to
> use that then it's going to be a kernel/vdso specific variant and we'll
> figure out how that needs to be handled if at all.
Sure, and we can craft the vDSO so the system call does not need to be
issued within a rseq c.s.. So this one is a non-issue I think.
>
> But we are not misdesigning now to accomodate artificial scenarios dreamed
> up for argumentation sake,
If we decide to impose limitations on the rseq c.s. abilities, we need to
think this through very carefully.
Let's say we disallow system calls from rseq critical sections. A few points
arise:
- We still need to allow traps (page faults, breakpoints, ...) within rseq c.s.,
- We still need to allow interrupts within rseq c.s.,
- We need to decide whether we just document that syscalls within rseq c.s.
are not supported, or we enforce a behavior if this happens (e.g. SIGSEGV).
If we enforce a SIGSEGV, we'd have to figure out whether it's worth it to
add extra branches to the system call fast path to validate this.
- If we document that syscalls are not supported within rseq c.s., we should
specify whether doing so terminates the process, or if it merely does not
guarantee proper abort behavior of the critical section.
- We need to carefully consider the case of system calls issued within signal
handlers nested on top of rseq. When RSEQ_CS_FLAG_NO_RESTART_ON_SIGNAL is
_not_ set, neither in the rseq c.s. descriptor nor in the TLS @flags,
it's pretty much straightforward: upon signal delivery, the kernel moves the
ip to abort, and clears the tls @rseq_cs pointer. This means that any system
call issued within the signal handler is not actually within the rseq c.s.
upon which the signal is nested.
The case I worry about is if a thread sets the RSEQ_CS_FLAG_NO_RESTART_ON_SIGNAL
flag in its TLS @flags field (useful in a debugging scenario where we want a
debugger to single-step through the rseq c.s. and observe registers at each step).
Arguably, this is only ever used in development. However, it does allow a situation
where a system call executed within a signal handler can nest over a rseq c.s..
So if we choose to be very strict and SIGSEGV any syscall nested over rseq
c.s., we may very well end up killing the process for no good reason in this
scenario.
- We need to decide whether all syscalls are disallowed, or if we want to pick
specific ones (e.g. fork()).
Thoughts ?
Thanks,
Mathieu
--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com
On Thu, Mar 29, 2018 at 09:54:01AM -0400, Mathieu Desnoyers wrote:
> Let's say we disallow system calls from rseq critical sections. A few points
> arise:
>
> - We still need to allow traps (page faults, breakpoints, ...) within rseq c.s.,
>
> - We still need to allow interrupts within rseq c.s.,
Sure, but all those are different entry points, so that shouldn't be a
problem.
> - We need to decide whether we just document that syscalls within rseq c.s.
> are not supported, or we enforce a behavior if this happens (e.g. SIGSEGV).
> If we enforce a SIGSEGV, we'd have to figure out whether it's worth it to
> add extra branches to the system call fast path to validate this.
Without enforcement someone will eventually do this :/ We might (maybe)
get away with it being a debug option somewhere, but even that sounds
like trouble.
> - We need to carefully consider the case of system calls issued within signal
> handlers nested on top of rseq. When RSEQ_CS_FLAG_NO_RESTART_ON_SIGNAL is
> _not_ set, neither in the rseq c.s. descriptor nor in the TLS @flags,
> it's pretty much straightforward: upon signal delivery, the kernel moves the
> ip to abort, and clears the tls @rseq_cs pointer. This means that any system
> call issued within the signal handler is not actually within the rseq c.s.
> upon which the signal is nested.
>
> The case I worry about is if a thread sets the RSEQ_CS_FLAG_NO_RESTART_ON_SIGNAL
> flag in its TLS @flags field (useful in a debugging scenario where we want a
> debugger to single-step through the rseq c.s. and observe registers at each step).
> Arguably, this is only ever used in development. However, it does allow a situation
> where a system call executed within a signal handler can nest over a rseq c.s..
> So if we choose to be very strict and SIGSEGV any syscall nested over rseq
> c.s., we may very well end up killing the process for no good reason in this
> scenario.
Yes, that needs a little thought; but when we run the signal handler,
the IP would no longer be inside the active RSEQ, right?
> - We need to decide whether all syscalls are disallowed, or if we want to pick
> specific ones (e.g. fork()).
All.
----- On Mar 29, 2018, at 10:23 AM, Peter Zijlstra [email protected] wrote:
> On Thu, Mar 29, 2018 at 09:54:01AM -0400, Mathieu Desnoyers wrote:
>> Let's say we disallow system calls from rseq critical sections. A few points
>> arise:
>>
>> - We still need to allow traps (page faults, breakpoints, ...) within rseq c.s.,
>>
>> - We still need to allow interrupts within rseq c.s.,
>
> Sure, but all those are different entry points, so that shouldn't be a
> problem.
Yes, indeed.
>
>> - We need to decide whether we just document that syscalls within rseq c.s.
>> are not supported, or we enforce a behavior if this happens (e.g. SIGSEGV).
>> If we enforce a SIGSEGV, we'd have to figure out whether it's worth it to
>> add extra branches to the system call fast path to validate this.
>
> Without enforcement someone will eventually do this :/ We might (maybe)
> get away with it being a debug option somewhere, but even that sounds
> like trouble.
I find it unlikely that someone will issue a syscall from a rseq critical
section without really intending it. The system call would need to be
crafted within the rseq assembly block.
Enforcing SIGSEGV on syscall entry when nested in a rseq critical section
will not be free both in terms of syscall overhead, and in terms of code
maintenance: we'd need to add those checks into entry.S for each architecture
supported, which pretty much doubles the amount of architecture-specific
code we need to implement for rseq. Currently, all we need is to hook in
signal delivery and wire up the system call numbers.
If there is some clever arch-agnostic way to enforce SIGSEGV in those
situations, I'm all ears. But I don't think it's worthwhile to enforce
this if it costs in terms of system call speed and adds extra arch-specific
code to maintain.
We could simply document that issuing a system call within a rseq critical
section will cause the restart behavior (whether the critical section is
restarted or not) to be undefined.
>
>> - We need to carefully consider the case of system calls issued within signal
>> handlers nested on top of rseq. When RSEQ_CS_FLAG_NO_RESTART_ON_SIGNAL is
>> _not_ set, neither in the rseq c.s. descriptor nor in the TLS @flags,
>> it's pretty much straightforward: upon signal delivery, the kernel moves the
>> ip to abort, and clears the tls @rseq_cs pointer. This means that any system
>> call issued within the signal handler is not actually within the rseq c.s.
>> upon which the signal is nested.
>>
>> The case I worry about is if a thread sets the RSEQ_CS_FLAG_NO_RESTART_ON_SIGNAL
>> flag in its TLS @flags field (useful in a debugging scenario where we want a
>> debugger to single-step through the rseq c.s. and observe registers at each
>> step).
>> Arguably, this is only ever used in development. However, it does allow a
>> situation
>> where a system call executed within a signal handler can nest over a rseq c.s..
>> So if we choose to be very strict and SIGSEGV any syscall nested over rseq
>> c.s., we may very well end up killing the process for no good reason in this
>> scenario.
>
> Yes, that needs a little thought; but when we run the signal handler,
> the IP would no longer be inside the active RSEQ, right?
Good point, I missed that. So yes, even with the RSEQ_CS_FLAG_NO_RESTART_ON_SIGNAL
flag set, the instruction pointer comparison would detect that we're not actually
running in the rseq critical section if a syscall is issued from the signal handler.
>
>> - We need to decide whether all syscalls are disallowed, or if we want to pick
>> specific ones (e.g. fork()).
>
> All.
I'm fine with that.
Thanks,
Mathieu
--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com
On Thu, 29 Mar 2018 11:39:00 -0400 (EDT)
Mathieu Desnoyers <[email protected]> wrote:
> Enforcing SIGSEGV on syscall entry when nested in a rseq critical section
> will not be free both in terms of syscall overhead, and in terms of code
> maintenance: we'd need to add those checks into entry.S for each architecture
> supported, which pretty much doubles the amount of architecture-specific
> code we need to implement for rseq. Currently, all we need is to hook in
> signal delivery and wire up the system call numbers.
Why not have the check on syscall exit? Then we could use the ptrace
type mechanism to only go that path when a rseq exists for the program.
-- Steve
On Thu, 29 Mar 2018 14:02:33 -0400 (EDT)
Mathieu Desnoyers <[email protected]> wrote:
> Currently, anyone using ptrace on a process has pretty much given up all
> hopes of performance. Processes will use rseq to gain performance, not the
> opposite, so this deterioration will be unwelcome.
The ptrace path has nothing to do with ptrace anymore, and probably be
hard to notice the performance hit. You simply set a TIF flag, and on
exit of the syscall it jumps to a path that checks special cases
(tracing system calls being one of them). It's called the ptrace path
because ptrace was the first one to use it (I'm guessing, I haven't
actually looked at the history).
This is used to add any system call checks that are not done during
normal operation. And this certainly falls under that category.
-- Steve
----- On Mar 29, 2018, at 2:07 PM, rostedt [email protected] wrote:
> On Thu, 29 Mar 2018 14:02:33 -0400 (EDT)
> Mathieu Desnoyers <[email protected]> wrote:
>
>> Currently, anyone using ptrace on a process has pretty much given up all
>> hopes of performance. Processes will use rseq to gain performance, not the
>> opposite, so this deterioration will be unwelcome.
>
> The ptrace path has nothing to do with ptrace anymore, and probably be
> hard to notice the performance hit. You simply set a TIF flag, and on
> exit of the syscall it jumps to a path that checks special cases
> (tracing system calls being one of them). It's called the ptrace path
> because ptrace was the first one to use it (I'm guessing, I haven't
> actually looked at the history).
Last time I checked, it's not only a jump, it's actually saving/restoring
tons of registers. Did this change recently ?
I use it for LTTng syscall tracing too. My experience so far is that it's really
terribly slow. I've been waiting on Andy Lutomirski to complete his changes in that
area to look into making this faster for syscall tracepoints.
>
> This is used to add any system call checks that are not done during
> normal operation. And this certainly falls under that category.
I know it's used for stuff like seccomp too. My guess has always been that security
people care much more about robustness than performance.
Thanks,
Mathieu
--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com
----- On Mar 29, 2018, at 12:24 PM, rostedt [email protected] wrote:
> On Thu, 29 Mar 2018 11:39:00 -0400 (EDT)
> Mathieu Desnoyers <[email protected]> wrote:
>
>> Enforcing SIGSEGV on syscall entry when nested in a rseq critical section
>> will not be free both in terms of syscall overhead, and in terms of code
>> maintenance: we'd need to add those checks into entry.S for each architecture
>> supported, which pretty much doubles the amount of architecture-specific
>> code we need to implement for rseq. Currently, all we need is to hook in
>> signal delivery and wire up the system call numbers.
>
> Why not have the check on syscall exit? Then we could use the ptrace
> type mechanism to only go that path when a rseq exists for the program.
Currently, anyone using ptrace on a process has pretty much given up all
hopes of performance. Processes will use rseq to gain performance, not the
opposite, so this deterioration will be unwelcome.
One thing I would find more acceptable is to only compile in this check under
a CONFIG_DEBUG_RSEQ option or something similar. This means we can then put
the check at the most convenient location without caring too much about its
performance impact.
Thoughts ?
Thanks,
Mathieu
--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com
On Thu, 29 Mar 2018 14:35:16 -0400 (EDT)
Mathieu Desnoyers <[email protected]> wrote:
> ----- On Mar 29, 2018, at 2:07 PM, rostedt [email protected] wrote:
>
> > On Thu, 29 Mar 2018 14:02:33 -0400 (EDT)
> > Mathieu Desnoyers <[email protected]> wrote:
> >
> >> Currently, anyone using ptrace on a process has pretty much given up all
> >> hopes of performance. Processes will use rseq to gain performance, not the
> >> opposite, so this deterioration will be unwelcome.
> >
> > The ptrace path has nothing to do with ptrace anymore, and probably be
> > hard to notice the performance hit. You simply set a TIF flag, and on
> > exit of the syscall it jumps to a path that checks special cases
> > (tracing system calls being one of them). It's called the ptrace path
> > because ptrace was the first one to use it (I'm guessing, I haven't
> > actually looked at the history).
>
> Last time I checked, it's not only a jump, it's actually saving/restoring
> tons of registers. Did this change recently ?
>
> I use it for LTTng syscall tracing too. My experience so far is that it's really
> terribly slow. I've been waiting on Andy Lutomirski to complete his changes in that
> area to look into making this faster for syscall tracepoints.
This gives us more incentive to help Andy make it faster ;-)
-- Steve
>
> >
> > This is used to add any system call checks that are not done during
> > normal operation. And this certainly falls under that category.
>
> I know it's used for stuff like seccomp too. My guess has always been that security
> people care much more about robustness than performance.
>
> Thanks,
>
> Mathieu
>
>
On Thu, 29 Mar 2018 14:46:05 -0400
Steven Rostedt <[email protected]> wrote:
> This gives us more incentive to help Andy make it faster ;-)
And with Meltdown, I doubt this makes as big of a difference anymore :-/
-- Steve
On Tue, 27 Mar 2018 12:05:23 -0400
Mathieu Desnoyers <[email protected]> wrote:
> Expose a new system call allowing each thread to register one userspace
> memory area to be used as an ABI between kernel and user-space for two
> purposes: user-space restartable sequences and quick access to read the
> current CPU number value from user-space.
What is the *worst* case timing achievable by using the atomics ? What
does it do to real time performance requirements ? For cpu_opv you now
give an answer but your answer is assuming there isn't another thread
actively thrashing the cache or store buffers, and that the user didn't
sneakily pass in a page of uncacheable memory (eg framebuffer, or GPU
space).
I don't see anything that restricts it to cached pages. With that check
in place for x86 at least it would probably be ok and I think the sneaky
attacks to make it uncacheable would fail becuase you've got the pages
locked so trying to give them to an accelerator will block until you are
done.
I still like the idea it's just the latencies concern me.
> Restartable sequences are atomic with respect to preemption
> (making it atomic with respect to other threads running on the
> same CPU), as well as signal delivery (user-space execution
> contexts nested over the same thread).
CPU generally means 'big lump with legs on it'. You are not atomic to the
same CPU, because that CPU may have 30+ cores with 8 threads per core.
It could do with some better terminology (hardware thread, CPU context ?)
> In a typical usage scenario, the thread registering the rseq
> structure will be performing loads and stores from/to that
> structure. It is however also allowed to read that structure
> from other threads. The rseq field updates performed by the
> kernel provide relaxed atomicity semantics, which guarantee
> that other threads performing relaxed atomic reads of the cpu
> number cache will always observe a consistent value.
So what happens to your API if the kernel atomics get improved ? You are
effectively exporting rseq behaviour from private to public.
Alan
On Sun, 1 Apr 2018, Alan Cox wrote:
> > Restartable sequences are atomic with respect to preemption
> > (making it atomic with respect to other threads running on the
> > same CPU), as well as signal delivery (user-space execution
> > contexts nested over the same thread).
>
> CPU generally means 'big lump with legs on it'. You are not atomic to the
> same CPU, because that CPU may have 30+ cores with 8 threads per core.
>
> It could do with some better terminology (hardware thread, CPU context ?)
Well we call it a "CPU" in the scheduler context I think. We could use
better terminology throughout the kernel tools and source.
Hardware Execution Context?
> > In a typical usage scenario, the thread registering the rseq
> > structure will be performing loads and stores from/to that
> > structure. It is however also allowed to read that structure
> > from other threads. The rseq field updates performed by the
> > kernel provide relaxed atomicity semantics, which guarantee
> > that other threads performing relaxed atomic reads of the cpu
> > number cache will always observe a consistent value.
>
> So what happens to your API if the kernel atomics get improved ? You are
> effectively exporting rseq behaviour from private to public.
There is already a pretty complex coherency model guiding kernel atomics.
Improvements/changes to that are difficult and the effect will ripple
throughout the kernel. So I would suggest that these areas of the kernel
are pretty "petrified" (or written in stone).
On Mon, Apr 02, 2018 at 10:03:58AM -0500, Christopher Lameter wrote:
> On Sun, 1 Apr 2018, Alan Cox wrote:
>
> > > Restartable sequences are atomic with respect to preemption
> > > (making it atomic with respect to other threads running on the
> > > same CPU), as well as signal delivery (user-space execution
> > > contexts nested over the same thread).
> >
> > CPU generally means 'big lump with legs on it'. You are not atomic to the
> > same CPU, because that CPU may have 30+ cores with 8 threads per core.
> >
> > It could do with some better terminology (hardware thread, CPU context ?)
>
> Well we call it a "CPU" in the scheduler context I think. We could use
> better terminology throughout the kernel tools and source.
Agreed, it has been "CPU" for "single hardware thread" for a very long
time. People tend to use "core" for "group of hardware threads" and
"socket" for "big lump with legs on it".
> Hardware Execution Context?
Should be even more fun when non-CPU hardware execution contexts show
up in force within each core. ;-)
But the terminology in place for non-CPU hardware execution contexts
should be able to survive that event.
> > > In a typical usage scenario, the thread registering the rseq
> > > structure will be performing loads and stores from/to that
> > > structure. It is however also allowed to read that structure
> > > from other threads. The rseq field updates performed by the
> > > kernel provide relaxed atomicity semantics, which guarantee
> > > that other threads performing relaxed atomic reads of the cpu
> > > number cache will always observe a consistent value.
> >
> > So what happens to your API if the kernel atomics get improved ? You are
> > effectively exporting rseq behaviour from private to public.
>
> There is already a pretty complex coherency model guiding kernel atomics.
> Improvements/changes to that are difficult and the effect will ripple
> throughout the kernel. So I would suggest that these areas of the kernel
> are pretty "petrified" (or written in stone).
I suspect that there are much more pressing areas of confusion in any
case!
Thanx, Paul
----- On Apr 1, 2018, at 12:13 PM, One Thousand Gnomes [email protected] wrote:
> On Tue, 27 Mar 2018 12:05:23 -0400
> Mathieu Desnoyers <[email protected]> wrote:
>
>> Expose a new system call allowing each thread to register one userspace
>> memory area to be used as an ABI between kernel and user-space for two
>> purposes: user-space restartable sequences and quick access to read the
>> current CPU number value from user-space.
>
> What is the *worst* case timing achievable by using the atomics ? What
> does it do to real time performance requirements ?
Given that there are two system calls introduced in this series (rseq and
cpu_opv), can you clarify which system call you refer to in the two questions
above ?
For rseq, given that its userspace works pretty much like a read seqlock
(it retries on failure), it has no impact whatsoever on scheduler behavior.
So characterizing its worst case timing does not appear to be relevant.
> For cpu_opv you now
> give an answer but your answer is assuming there isn't another thread
> actively thrashing the cache or store buffers, and that the user didn't
> sneakily pass in a page of uncacheable memory (eg framebuffer, or GPU
> space).
Are those considered as device pages ?
>
> I don't see anything that restricts it to cached pages. With that check
> in place for x86 at least it would probably be ok and I think the sneaky
> attacks to make it uncacheable would fail becuase you've got the pages
> locked so trying to give them to an accelerator will block until you are
> done.
>
> I still like the idea it's just the latencies concern me.
Indeed, cpu_opv touches pages that are shared with user-space with
preemption off, so this one affects the scheduler latency. The worse-case
timings I measured for cpu_opv were with cache-cold memory. So I expect that
another thread actively trashing the cache would be in the same ballpark
figure. It does not account for a concurrent thread thrashing the store
buffers though.
The checks enforcing which pages can be touched by cpu_opv operations are
done within cpu_op_check_page(). is_zone_device_page() is used to ensure no
device page is touched with preempt disabled. I understand that you would
prefer to disallow pages of uncacheable memory as well, which I'm fine with.
Is there an API similar to is_zone_device_page() to check whether a page is
uncacheable ?
>
>> Restartable sequences are atomic with respect to preemption
>> (making it atomic with respect to other threads running on the
>> same CPU), as well as signal delivery (user-space execution
>> contexts nested over the same thread).
>
> CPU generally means 'big lump with legs on it'. You are not atomic to the
> same CPU, because that CPU may have 30+ cores with 8 threads per core.
>
> It could do with some better terminology (hardware thread, CPU context ?)
Would you be OK with Christoph's terminology of "Hardware Execution Context" ?
>
>> In a typical usage scenario, the thread registering the rseq
>> structure will be performing loads and stores from/to that
>> structure. It is however also allowed to read that structure
>> from other threads. The rseq field updates performed by the
>> kernel provide relaxed atomicity semantics, which guarantee
>> that other threads performing relaxed atomic reads of the cpu
>> number cache will always observe a consistent value.
>
> So what happens to your API if the kernel atomics get improved ? You are
> effectively exporting rseq behaviour from private to public.
Relaxed atomics is pretty much the loosest kind of consistency we can
provide before we start allowing the compiler to do load/store tearing
(it's basically a volatile store of a word-aligned word). It does not
involve any kind of memory barrier whatsoever. I expect that the atomics
that may evolve in the future will be those with release/acquire and
implicit barriers semantics. The relaxed atomicity does not cover any of
these.
Thanks,
Mathieu
>
> Alan
--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com
----- On Apr 2, 2018, at 11:33 AM, Mathieu Desnoyers [email protected] wrote:
> ----- On Apr 1, 2018, at 12:13 PM, One Thousand Gnomes
> [email protected] wrote:
>
>> On Tue, 27 Mar 2018 12:05:23 -0400
>> Mathieu Desnoyers <[email protected]> wrote:
>>
>>> Expose a new system call allowing each thread to register one userspace
>>> memory area to be used as an ABI between kernel and user-space for two
>>> purposes: user-space restartable sequences and quick access to read the
>>> current CPU number value from user-space.
>>
>> What is the *worst* case timing achievable by using the atomics ? What
>> does it do to real time performance requirements ?
>
> Given that there are two system calls introduced in this series (rseq and
> cpu_opv), can you clarify which system call you refer to in the two questions
> above ?
>
> For rseq, given that its userspace works pretty much like a read seqlock
> (it retries on failure), it has no impact whatsoever on scheduler behavior.
> So characterizing its worst case timing does not appear to be relevant.
>
>> For cpu_opv you now
>> give an answer but your answer is assuming there isn't another thread
>> actively thrashing the cache or store buffers, and that the user didn't
>> sneakily pass in a page of uncacheable memory (eg framebuffer, or GPU
>> space).
>
> Are those considered as device pages ?
>
>>
>> I don't see anything that restricts it to cached pages. With that check
>> in place for x86 at least it would probably be ok and I think the sneaky
>> attacks to make it uncacheable would fail becuase you've got the pages
>> locked so trying to give them to an accelerator will block until you are
>> done.
>>
>> I still like the idea it's just the latencies concern me.
>
> Indeed, cpu_opv touches pages that are shared with user-space with
> preemption off, so this one affects the scheduler latency. The worse-case
> timings I measured for cpu_opv were with cache-cold memory. So I expect that
> another thread actively trashing the cache would be in the same ballpark
> figure. It does not account for a concurrent thread thrashing the store
> buffers though.
>
> The checks enforcing which pages can be touched by cpu_opv operations are
> done within cpu_op_check_page(). is_zone_device_page() is used to ensure no
> device page is touched with preempt disabled. I understand that you would
> prefer to disallow pages of uncacheable memory as well, which I'm fine with.
> Is there an API similar to is_zone_device_page() to check whether a page is
> uncacheable ?
Looking into this a bit more, I notice the following: The pgprot_noncached
(_PAGE_NOCACHE on x86) pgprot is part of the vma->vm_page_prot. Therefore,
in order to have userspace provide pointers to noncached pages as input
to cpu_opv, they need to be part of a userspace vma which has a
pgprot_noncached vm_page_prot.
The cpu_opv system call uses get_user_pages_fast() to grab the struct page
from the userspace addresses, and then passes those pages to vm_map_ram(),
with a PAGE_KERNEL pgprot. This creates a temporary kernel mapping to those
pages, which is then used to read/write from/to those pages with preemption
disabled.
Therefore, with the proposed cpu_opv implementation, the kernel is not
touching noncached mappings with preemption disabled, which should take
care of your latency concern.
Am I missing something ?
Thanks,
Mathieu
--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com
----- On Apr 3, 2018, at 12:36 PM, Mathieu Desnoyers [email protected] wrote:
> ----- On Apr 2, 2018, at 11:33 AM, Mathieu Desnoyers
> [email protected] wrote:
>
>> ----- On Apr 1, 2018, at 12:13 PM, One Thousand Gnomes
>> [email protected] wrote:
>>
[...]
>>> I still like the idea it's just the latencies concern me.
>>
[...]
>
> Looking into this a bit more, I notice the following: The pgprot_noncached
> (_PAGE_NOCACHE on x86) pgprot is part of the vma->vm_page_prot. Therefore,
> in order to have userspace provide pointers to noncached pages as input
> to cpu_opv, they need to be part of a userspace vma which has a
> pgprot_noncached vm_page_prot.
>
> The cpu_opv system call uses get_user_pages_fast() to grab the struct page
> from the userspace addresses, and then passes those pages to vm_map_ram(),
> with a PAGE_KERNEL pgprot. This creates a temporary kernel mapping to those
> pages, which is then used to read/write from/to those pages with preemption
> disabled.
>
> Therefore, with the proposed cpu_opv implementation, the kernel is not
> touching noncached mappings with preemption disabled, which should take
> care of your latency concern.
[...]
The following extra check should let userspace know it's trying to
provide a pointer to noncached memory by returning -1, errno=EFAULT.
Is the approach acceptable ?
Thanks,
Mathieu
diff --git a/include/linux/mm.h b/include/linux/mm.h
index ad06d42..0245481 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2425,6 +2425,18 @@ static inline struct page *follow_page(struct vm_area_struct *vma,
return follow_page_mask(vma, address, foll_flags, &unused_page_mask);
}
+static inline bool is_vma_noncached(struct vm_area_struct *vma)
+{
+ pgprot_t pgprot = vma->vm_page_prot;
+
+ /* Check whether architecture implements noncached pages. */
+ if (pgprot_val(pgprot_noncached(PAGE_KERNEL)) == pgprot_val(PAGE_KERNEL))
+ return false;
+ if (pgprot_val(pgprot) != pgprot_val(pgprot_noncached(pgprot)))
+ return false;
+ return true;
+}
+
#define FOLL_WRITE 0x01 /* check pte is writable */
#define FOLL_TOUCH 0x02 /* mark page accessed */
#define FOLL_GET 0x04 /* do get_page on page */
diff --git a/kernel/cpu_opv.c b/kernel/cpu_opv.c
index 197339e..e4395b4 100644
--- a/kernel/cpu_opv.c
+++ b/kernel/cpu_opv.c
@@ -362,7 +362,19 @@ static int cpu_op_pin_pages(unsigned long addr, unsigned long len,
int ret, nr_pages, nr_put_pages, n;
unsigned long _vaddr;
struct vaddr *va;
+ struct vm_area_struct *vma;
+ vma = find_vma_intersection(current->mm, addr, addr + len);
+ if (!vma)
+ return -EFAULT;
+ /*
+ * cpu_opv() accesses its own cached mapping of the userspace pages.
+ * Considering that concurrent noncached and cached accesses may yield
+ * to unexpected results in terms of memory consistency, explicitly
+ * disallow cpu_opv on noncached memory.
+ */
+ if (is_vma_noncached(vma))
+ return -EFAULT;
nr_pages = cpu_op_count_pages(addr, len);
if (!nr_pages)
return 0;
--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com