Hi,
This is the v7 of syscall user dispatch. This version is a bit
different from v6 on the following points, after the modifications
requested on that submission.
* The interface no longer receives <start>,<end> end parameters, but
<start>,<length> as suggested by PeterZ.
* Syscall User Dispatch is now done before ptrace, and this means there
is some SYSCALL_WORK_EXIT work that needs to be done. No challenges
there, but I'd like to draw attention to that region of the code that
is new in this submission.
* The previous TIF_SYSCALL_USER_DISPATCH is now handled through
SYSCALL_WORK flags.
* Introduced a new test as patch 6, which benchmarks the fast submission
path and test the return in blocked selector state.
* Nothing is architecture dependent anymore. No config switches. It
only depends on CONFIG_GENERIC_ENTRY.
Other smaller changes are documented one each commit.
This was tested using the kselftests tests in patch 5 and 6 and compiled
tested with !CONFIG_GENERIC_ENTRY.
This patchset is based on the core/entry branch of the TIP tree.
A working tree with this patchset is available at:
https://gitlab.collabora.com/krisman/linux -b syscall-user-dispatch-v7
Previous submissions are archived at:
RFC/v1: https://lkml.org/lkml/2020/7/8/96
v2: https://lkml.org/lkml/2020/7/9/17
v3: https://lkml.org/lkml/2020/7/12/4
v4: https://www.spinics.net/lists/linux-kselftest/msg16377.html
v5: https://lkml.org/lkml/2020/8/10/1320
v6: https://lkml.org/lkml/2020/9/4/1122
Gabriel Krisman Bertazi (7):
x86: vdso: Expose sigreturn address on vdso to the kernel
signal: Expose SYS_USER_DISPATCH si_code type
kernel: Implement selective syscall userspace redirection
entry: Support Syscall User Dispatch on common syscall entry
selftests: Add kselftest for syscall user dispatch
selftests: Add benchmark for syscall user dispatch
docs: Document Syscall User Dispatch
.../admin-guide/syscall-user-dispatch.rst | 87 +++++
arch/x86/entry/vdso/vdso2c.c | 2 +
arch/x86/entry/vdso/vdso32/sigreturn.S | 2 +
arch/x86/entry/vdso/vma.c | 15 +
arch/x86/include/asm/elf.h | 2 +
arch/x86/include/asm/vdso.h | 2 +
arch/x86/kernel/signal_compat.c | 2 +-
fs/exec.c | 3 +
include/linux/entry-common.h | 2 +
include/linux/sched.h | 2 +
include/linux/syscall_user_dispatch.h | 40 +++
include/linux/thread_info.h | 2 +
include/uapi/asm-generic/siginfo.h | 3 +-
include/uapi/linux/prctl.h | 5 +
kernel/entry/Makefile | 2 +-
kernel/entry/common.c | 17 +
kernel/entry/common.h | 16 +
kernel/entry/syscall_user_dispatch.c | 102 ++++++
kernel/fork.c | 1 +
kernel/sys.c | 5 +
tools/testing/selftests/Makefile | 1 +
.../syscall_user_dispatch/.gitignore | 3 +
.../selftests/syscall_user_dispatch/Makefile | 9 +
.../selftests/syscall_user_dispatch/config | 1 +
.../syscall_user_dispatch/sud_benchmark.c | 200 +++++++++++
.../syscall_user_dispatch/sud_test.c | 310 ++++++++++++++++++
26 files changed, 833 insertions(+), 3 deletions(-)
create mode 100644 Documentation/admin-guide/syscall-user-dispatch.rst
create mode 100644 include/linux/syscall_user_dispatch.h
create mode 100644 kernel/entry/common.h
create mode 100644 kernel/entry/syscall_user_dispatch.c
create mode 100644 tools/testing/selftests/syscall_user_dispatch/.gitignore
create mode 100644 tools/testing/selftests/syscall_user_dispatch/Makefile
create mode 100644 tools/testing/selftests/syscall_user_dispatch/config
create mode 100644 tools/testing/selftests/syscall_user_dispatch/sud_benchmark.c
create mode 100644 tools/testing/selftests/syscall_user_dispatch/sud_test.c
--
2.29.2
Syscall User Dispatch (SUD) must take precedence over seccomp and
ptrace, since the use case is emulation (it can be invoked with a
different ABI) such that seccomp filtering by syscall number doesn't
make sense in the first place. In addition, either the syscall is
dispatched back to userspace, in which case there is no resource for to
trace, or the syscall will be executed, and seccomp/ptrace will execute
next.
Since SUD runs before tracepoints, it needs to be a SYSCALL_WORK_EXIT as
well, just to prevent a trace exit event when dispatch was triggered.
For that, the on_syscall_dispatch() examines context to skip the
tracepoint, audit and other work.
Signed-off-by: Gabriel Krisman Bertazi <[email protected]>
---
Changes since v6:
- Update do_syscall_intercept signature (Christian Brauner)
- Move it to before tracepoints
- Use SYSCALL_WORK flags
---
include/linux/entry-common.h | 2 ++
kernel/entry/common.c | 17 +++++++++++++++++
2 files changed, 19 insertions(+)
diff --git a/include/linux/entry-common.h b/include/linux/entry-common.h
index 49b26b216e4e..a6e98b4ba8e9 100644
--- a/include/linux/entry-common.h
+++ b/include/linux/entry-common.h
@@ -44,10 +44,12 @@
SYSCALL_WORK_SYSCALL_TRACE | \
SYSCALL_WORK_SYSCALL_EMU | \
SYSCALL_WORK_SYSCALL_AUDIT | \
+ SYSCALL_WORK_SYSCALL_USER_DISPATCH | \
ARCH_SYSCALL_WORK_ENTER)
#define SYSCALL_WORK_EXIT (SYSCALL_WORK_SYSCALL_TRACEPOINT | \
SYSCALL_WORK_SYSCALL_TRACE | \
SYSCALL_WORK_SYSCALL_AUDIT | \
+ SYSCALL_WORK_SYSCALL_USER_DISPATCH | \
ARCH_SYSCALL_WORK_EXIT)
/*
diff --git a/kernel/entry/common.c b/kernel/entry/common.c
index 91e8fd50adf4..cef136fc7cb5 100644
--- a/kernel/entry/common.c
+++ b/kernel/entry/common.c
@@ -5,6 +5,8 @@
#include <linux/livepatch.h>
#include <linux/audit.h>
+#include "common.h"
+
#define CREATE_TRACE_POINTS
#include <trace/events/syscalls.h>
@@ -46,6 +48,16 @@ static long syscall_trace_enter(struct pt_regs *regs, long syscall,
{
long ret = 0;
+ /*
+ * Handle Syscall User Dispatch. This must comes first, since
+ * the ABI here can be something that doesn't make sense for
+ * other syscall_work features.
+ */
+ if (work & SYSCALL_WORK_SYSCALL_USER_DISPATCH) {
+ if (do_syscall_user_dispatch(regs))
+ return -1L;
+ }
+
/* Handle ptrace */
if (work & (SYSCALL_WORK_SYSCALL_TRACE | SYSCALL_WORK_SYSCALL_EMU)) {
ret = arch_syscall_enter_tracehook(regs);
@@ -230,6 +242,11 @@ static void syscall_exit_work(struct pt_regs *regs, unsigned long work)
{
bool step;
+ if (work & SYSCALL_WORK_SYSCALL_USER_DISPATCH) {
+ if (on_syscall_dispatch())
+ return;
+ }
+
audit_syscall_exit(regs);
if (work & SYSCALL_WORK_SYSCALL_TRACEPOINT)
--
2.29.2
Introduce a mechanism to quickly disable/enable syscall handling for a
specific process and redirect to userspace via SIGSYS. This is useful
for processes with parts that require syscall redirection and parts that
don't, but who need to perform this boundary crossing really fast,
without paying the cost of a system call to reconfigure syscall handling
on each boundary transition. This is particularly important for Windows
games running over Wine.
The proposed interface looks like this:
prctl(PR_SET_SYSCALL_USER_DISPATCH, <op>, <off>, <length>, [selector])
The range [<offset>,<offset>+len] is a part of the process memory map
that is allowed to by-pass the redirection code and dispatch syscalls
directly, such that in fast paths a process doesn't need to disable the
trap nor the kernel has to check the selector. This is essential to
return from SIGSYS to a blocked area without triggering another SIGSYS
from rt_sigreturn.
selector is an optional pointer to a char-sized userspace memory region
that has a key switch for the mechanism. This key switch is set to
either PR_SYS_DISPATCH_ON, PR_SYS_DISPATCH_OFF to enable and disable the
redirection without calling the kernel.
The feature is meant to be set per-thread and it is disabled on
fork/clone/execv.
Internally, this doesn't add overhead to the syscall hot path, and it
requires very little per-architecture support. I avoided using seccomp,
even though it duplicates some functionality, due to previous feedback
that maybe it shouldn't mix with seccomp since it is not a security
mechanism. And obviously, this should never be considered a security
mechanism, since any part of the program can by-pass it by using the
syscall dispatcher.
For the sysinfo benchmark, which measures the overhead added to
executing a native syscall that doesn't require interception, the
overhead using only the direct dispatcher region to issue syscalls is
pretty much irrelevant. The overhead of using the selector goes around
40ns for a native (unredirected) syscall in my system, and it is (as
expected) dominated by the supervisor-mode user-address access. In
fact, with SMAP off, the overhead is consistently less than 5ns on my
test box.
Cc: Matthew Wilcox <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Paul Gofman <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: [email protected]
Signed-off-by: Gabriel Krisman Bertazi <[email protected]>
---
Changes since v6:
(Matthew Wilcox)
- Use unsigned long for mode
(peterZ)
- Change interface to {offset,len}
- Use SYSCALL_WORK interface instead of TIF flags
Changes since v4:
(Andy Lutomirski)
- Allow sigreturn coming from vDSO
- Exit with SIGSYS instead of SIGSEGV on bad selector
(Thomas Gleixner)
- Use sizeof selector in access_ok
- Document usage of __get_user
- Use constant for state value
- Split out x86 parts
- Rebase on top of Gleixner's common entry code
- Don't expose do_syscall_user_dispatch
Changes since v3:
- NTR.
Changes since v2:
(Matthew Wilcox suggestions)
- Drop __user on non-ptr type.
- Move #define closer to similar defs
- Allow a memory region that can dispatch directly
(Kees Cook suggestions)
- Improve kconfig summary line
- Move flag cleanup on execve to begin_new_exec
- Hint branch predictor in the syscall path
(Me)
- Convert selector to char
Changes since RFC:
(Kees Cook suggestions)
- Don't mention personality while explaining the feature
- Use syscall_get_nr
- Remove header guard on several places
- Convert WARN_ON to WARN_ON_ONCE
- Explicit check for state values
- Rename to syscall user dispatcher
---
fs/exec.c | 3 +
include/linux/sched.h | 2 +
include/linux/syscall_user_dispatch.h | 40 ++++++++++
include/linux/thread_info.h | 2 +
include/uapi/linux/prctl.h | 5 ++
kernel/entry/Makefile | 2 +-
kernel/entry/common.h | 16 ++++
kernel/entry/syscall_user_dispatch.c | 102 ++++++++++++++++++++++++++
kernel/fork.c | 1 +
kernel/sys.c | 5 ++
10 files changed, 177 insertions(+), 1 deletion(-)
create mode 100644 include/linux/syscall_user_dispatch.h
create mode 100644 kernel/entry/common.h
create mode 100644 kernel/entry/syscall_user_dispatch.c
diff --git a/fs/exec.c b/fs/exec.c
index 547a2390baf5..aee36e5733ce 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -64,6 +64,7 @@
#include <linux/compat.h>
#include <linux/vmalloc.h>
#include <linux/io_uring.h>
+#include <linux/syscall_user_dispatch.h>
#include <linux/uaccess.h>
#include <asm/mmu_context.h>
@@ -1302,6 +1303,8 @@ int begin_new_exec(struct linux_binprm * bprm)
flush_thread();
me->personality &= ~bprm->per_clear;
+ clear_syscall_work_syscall_user_dispatch(me);
+
/*
* We have to apply CLOEXEC before we change whether the process is
* dumpable (in setup_new_exec) to avoid a race with a process in userspace
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 063cd120b459..6d1c1f5e74fe 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -34,6 +34,7 @@
#include <linux/rseq.h>
#include <linux/seqlock.h>
#include <linux/kcsan.h>
+#include <linux/syscall_user_dispatch.h>
/* task_struct member predeclarations (sorted alphabetically): */
struct audit_context;
@@ -965,6 +966,7 @@ struct task_struct {
unsigned int sessionid;
#endif
struct seccomp seccomp;
+ struct syscall_user_dispatch syscall_dispatch;
/* Thread group tracking: */
u64 parent_exec_id;
diff --git a/include/linux/syscall_user_dispatch.h b/include/linux/syscall_user_dispatch.h
new file mode 100644
index 000000000000..9517ea16f090
--- /dev/null
+++ b/include/linux/syscall_user_dispatch.h
@@ -0,0 +1,40 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Copyright (C) 2020 Collabora Ltd.
+ */
+#ifndef _SYSCALL_USER_DISPATCH_H
+#define _SYSCALL_USER_DISPATCH_H
+
+#include <linux/thread_info.h>
+
+#ifdef CONFIG_GENERIC_ENTRY
+
+struct syscall_user_dispatch {
+ char __user *selector;
+ unsigned long offset;
+ unsigned long len;
+ bool on_dispatch;
+};
+
+int set_syscall_user_dispatch(unsigned long mode, unsigned long offset,
+ unsigned long len, char __user *selector);
+
+#define clear_syscall_work_syscall_user_dispatch(tsk) \
+ clear_task_syscall_work(tsk, SYSCALL_USER_DISPATCH)
+
+#else
+struct syscall_user_dispatch {};
+
+static inline int set_syscall_user_dispatch(unsigned long mode, unsigned long offset,
+ unsigned long len, char __user *selector)
+{
+ return -EINVAL;
+}
+
+static inline void clear_syscall_work_syscall_user_dispatch(struct task_struct *tsk)
+{
+}
+
+#endif /* CONFIG_GENERIC_ENTRY */
+
+#endif /* _SYSCALL_USER_DISPATCH_H */
diff --git a/include/linux/thread_info.h b/include/linux/thread_info.h
index 317363212ae9..45708a2602b9 100644
--- a/include/linux/thread_info.h
+++ b/include/linux/thread_info.h
@@ -41,6 +41,7 @@ enum syscall_work_bit {
SYSCALL_WORK_BIT_SYSCALL_TRACE,
SYSCALL_WORK_BIT_SYSCALL_EMU,
SYSCALL_WORK_BIT_SYSCALL_AUDIT,
+ SYSCALL_WORK_BIT_SYSCALL_USER_DISPATCH,
};
#define SYSCALL_WORK_SECCOMP BIT(SYSCALL_WORK_BIT_SECCOMP)
@@ -48,6 +49,7 @@ enum syscall_work_bit {
#define SYSCALL_WORK_SYSCALL_TRACE BIT(SYSCALL_WORK_BIT_SYSCALL_TRACE)
#define SYSCALL_WORK_SYSCALL_EMU BIT(SYSCALL_WORK_BIT_SYSCALL_EMU)
#define SYSCALL_WORK_SYSCALL_AUDIT BIT(SYSCALL_WORK_BIT_SYSCALL_AUDIT)
+#define SYSCALL_WORK_SYSCALL_USER_DISPATCH BIT(SYSCALL_WORK_BIT_SYSCALL_USER_DISPATCH)
#include <asm/thread_info.h>
diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
index 7f0827705c9a..90deb41c8a34 100644
--- a/include/uapi/linux/prctl.h
+++ b/include/uapi/linux/prctl.h
@@ -247,4 +247,9 @@ struct prctl_mm_map {
#define PR_SET_IO_FLUSHER 57
#define PR_GET_IO_FLUSHER 58
+/* Dispatch syscalls to a userspace handler */
+#define PR_SET_SYSCALL_USER_DISPATCH 59
+# define PR_SYS_DISPATCH_OFF 0
+# define PR_SYS_DISPATCH_ON 1
+
#endif /* _LINUX_PRCTL_H */
diff --git a/kernel/entry/Makefile b/kernel/entry/Makefile
index 34c8a3f1c735..095c775e001e 100644
--- a/kernel/entry/Makefile
+++ b/kernel/entry/Makefile
@@ -9,5 +9,5 @@ KCOV_INSTRUMENT := n
CFLAGS_REMOVE_common.o = -fstack-protector -fstack-protector-strong
CFLAGS_common.o += -fno-stack-protector
-obj-$(CONFIG_GENERIC_ENTRY) += common.o
+obj-$(CONFIG_GENERIC_ENTRY) += common.o syscall_user_dispatch.o
obj-$(CONFIG_KVM_XFER_TO_GUEST_WORK) += kvm.o
diff --git a/kernel/entry/common.h b/kernel/entry/common.h
new file mode 100644
index 000000000000..cd0c4e5f143e
--- /dev/null
+++ b/kernel/entry/common.h
@@ -0,0 +1,16 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _COMMON_H
+#define _COMMON_H
+
+bool do_syscall_user_dispatch(struct pt_regs *regs);
+
+static inline bool on_syscall_dispatch(void)
+{
+ if (unlikely(current->syscall_dispatch.on_dispatch)) {
+ current->syscall_dispatch.on_dispatch = false;
+ return true;
+ }
+ return false;
+}
+
+#endif
diff --git a/kernel/entry/syscall_user_dispatch.c b/kernel/entry/syscall_user_dispatch.c
new file mode 100644
index 000000000000..131c38a0b628
--- /dev/null
+++ b/kernel/entry/syscall_user_dispatch.c
@@ -0,0 +1,102 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright (C) 2020 Collabora Ltd.
+ */
+#include <linux/sched.h>
+#include <linux/prctl.h>
+#include <linux/syscall_user_dispatch.h>
+#include <linux/uaccess.h>
+#include <linux/signal.h>
+#include <linux/elf.h>
+
+#include <asm/syscall.h>
+
+#include <linux/sched/signal.h>
+#include <linux/sched/task_stack.h>
+
+static void trigger_sigsys(struct pt_regs *regs)
+{
+ struct kernel_siginfo info;
+
+ clear_siginfo(&info);
+ info.si_signo = SIGSYS;
+ info.si_code = SYS_USER_DISPATCH;
+ info.si_call_addr = (void __user *)KSTK_EIP(current);
+ info.si_errno = 0;
+ info.si_arch = syscall_get_arch(current);
+ info.si_syscall = syscall_get_nr(current, regs);
+
+ force_sig_info(&info);
+}
+
+bool do_syscall_user_dispatch(struct pt_regs *regs)
+{
+ struct syscall_user_dispatch *sd = ¤t->syscall_dispatch;
+ char state;
+
+ if (likely(instruction_pointer(regs) - sd->offset < sd->len))
+ return false;
+
+ if (unlikely(arch_syscall_is_vdso_sigreturn(regs)))
+ return false;
+
+ if (likely(sd->selector)) {
+ /*
+ * access_ok() is performed once, at prctl time, when
+ * the selector is loaded by userspace.
+ */
+ if (unlikely(__get_user(state, sd->selector)))
+ do_exit(SIGSEGV);
+
+ if (likely(state == PR_SYS_DISPATCH_OFF))
+ return false;
+
+ if (state != PR_SYS_DISPATCH_ON)
+ do_exit(SIGSYS);
+ }
+
+ sd->on_dispatch = true;
+ syscall_rollback(current, regs);
+ trigger_sigsys(regs);
+
+ return true;
+}
+
+int set_syscall_user_dispatch(unsigned long mode, unsigned long offset,
+ unsigned long len, char __user *selector)
+{
+ switch (mode) {
+ case PR_SYS_DISPATCH_OFF:
+ if (offset || len || selector)
+ return -EINVAL;
+ break;
+ case PR_SYS_DISPATCH_ON:
+ /*
+ * Validate the direct dispatcher region just for basic
+ * sanity against overflow and a 0-sized dispatcher
+ * region. If the user is able to submit a syscall from
+ * an address, that address is obviously valid.
+ */
+ if (offset && offset + len <= offset)
+ return -EINVAL;
+
+ if (selector && !access_ok(selector, sizeof(*selector)))
+ return -EFAULT;
+
+ break;
+ default:
+ return -EINVAL;
+ }
+
+ current->syscall_dispatch.selector = selector;
+ current->syscall_dispatch.offset = offset;
+ current->syscall_dispatch.len = len;
+ current->syscall_dispatch.on_dispatch = false;
+
+ if (mode == PR_SYS_DISPATCH_ON)
+ set_syscall_work(SYSCALL_USER_DISPATCH);
+ else
+ clear_syscall_work(SYSCALL_USER_DISPATCH);
+
+ return 0;
+}
diff --git a/kernel/fork.c b/kernel/fork.c
index 02b689a23457..4a5ecb41f440 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -906,6 +906,7 @@ static struct task_struct *dup_task_struct(struct task_struct *orig, int node)
clear_user_return_notifier(tsk);
clear_tsk_need_resched(tsk);
set_task_stack_end_magic(tsk);
+ clear_syscall_work_syscall_user_dispatch(tsk);
#ifdef CONFIG_STACKPROTECTOR
tsk->stack_canary = get_random_canary();
diff --git a/kernel/sys.c b/kernel/sys.c
index a730c03ee607..51f00fe20e4d 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -42,6 +42,7 @@
#include <linux/syscore_ops.h>
#include <linux/version.h>
#include <linux/ctype.h>
+#include <linux/syscall_user_dispatch.h>
#include <linux/compat.h>
#include <linux/syscalls.h>
@@ -2530,6 +2531,10 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
error = (current->flags & PR_IO_FLUSHER) == PR_IO_FLUSHER;
break;
+ case PR_SET_SYSCALL_USER_DISPATCH:
+ error = set_syscall_user_dispatch(arg2, arg3, arg4,
+ (char __user *) arg5);
+ break;
default:
error = -EINVAL;
break;
--
2.29.2
This is the patch I'm using to evaluate the impact syscall user dispatch
has on native syscall (syscalls not redirected to userspace) when
enabled for the process and submiting syscalls though the unblocked
dispatch selector. It works by running a step to define a baseline of
the cost of executing sysinfo, then enabling SUD, and rerunning that
step.
On my test machine, an AMD Ryzen 5 1500X, I have the following results
with the latest version of syscall user dispatch patches.
root@olga:~# syscall_user_dispatch/sud_benchmark
Calibrating test set to last ~5 seconds...
test iterations = 37500000
Avg syscall time 134ns.
Caught sys_ff00
trapped_call_count 1, native_call_count 0.
Avg syscall time 147ns.
Interception overhead: 9.7% (+13ns).
Signed-off-by: Gabriel Krisman Bertazi <[email protected]>
---
.../selftests/syscall_user_dispatch/Makefile | 2 +-
.../syscall_user_dispatch/sud_benchmark.c | 200 ++++++++++++++++++
2 files changed, 201 insertions(+), 1 deletion(-)
create mode 100644 tools/testing/selftests/syscall_user_dispatch/sud_benchmark.c
diff --git a/tools/testing/selftests/syscall_user_dispatch/Makefile b/tools/testing/selftests/syscall_user_dispatch/Makefile
index 8e15fa42bcda..03c120270953 100644
--- a/tools/testing/selftests/syscall_user_dispatch/Makefile
+++ b/tools/testing/selftests/syscall_user_dispatch/Makefile
@@ -5,5 +5,5 @@ LINUX_HDR_PATH = $(INSTALL_HDR_PATH)/include/
CFLAGS += -Wall -I$(LINUX_HDR_PATH)
-TEST_GEN_PROGS := sud_test
+TEST_GEN_PROGS := sud_test sud_benchmark
include ../lib.mk
diff --git a/tools/testing/selftests/syscall_user_dispatch/sud_benchmark.c b/tools/testing/selftests/syscall_user_dispatch/sud_benchmark.c
new file mode 100644
index 000000000000..6689f1183dbf
--- /dev/null
+++ b/tools/testing/selftests/syscall_user_dispatch/sud_benchmark.c
@@ -0,0 +1,200 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright (c) 2020 Collabora Ltd.
+ *
+ * Benchmark and test syscall user dispatch
+ */
+
+#define _GNU_SOURCE
+#include <stdio.h>
+#include <string.h>
+#include <stdlib.h>
+#include <signal.h>
+#include <errno.h>
+#include <time.h>
+#include <sys/time.h>
+#include <unistd.h>
+#include <sys/sysinfo.h>
+#include <sys/prctl.h>
+#include <sys/syscall.h>
+
+#ifndef PR_SET_SYSCALL_USER_DISPATCH
+# define PR_SET_SYSCALL_USER_DISPATCH 59
+# define PR_SYS_DISPATCH_OFF 0
+# define PR_SYS_DISPATCH_ON 1
+#endif
+
+#ifdef __NR_syscalls
+# define MAGIC_SYSCALL_1 (__NR_syscalls + 1) /* Bad Linux syscall number */
+#else
+# define MAGIC_SYSCALL_1 (0xff00) /* Bad Linux syscall number */
+#endif
+
+/*
+ * To test returning from a sigsys with selector blocked, the test
+ * requires some per-architecture support (i.e. knowledge about the
+ * signal trampoline address). On i386, we know it is on the vdso, and
+ * a small trampoline is open-coded for x86_64. Other architectures
+ * that have a trampoline in the vdso will support TEST_BLOCKED_RETURN
+ * out of the box, but don't enable them until they support syscall user
+ * dispatch.
+ */
+#if defined(__x86_64__) || defined(__i386__)
+#define TEST_BLOCKED_RETURN
+#endif
+
+#ifdef __x86_64__
+void* (syscall_dispatcher_start)(void);
+void* (syscall_dispatcher_end)(void);
+#else
+unsigned long syscall_dispatcher_start = 0;
+unsigned long syscall_dispatcher_end = 0;
+#endif
+
+unsigned long trapped_call_count = 0;
+unsigned long native_call_count = 0;
+
+char selector;
+#define SYSCALL_BLOCK (selector = PR_SYS_DISPATCH_ON)
+#define SYSCALL_UNBLOCK (selector = PR_SYS_DISPATCH_OFF)
+
+#define CALIBRATION_STEP 100000
+#define CALIBRATE_TO_SECS 5
+int factor;
+
+static double one_sysinfo_step(void)
+{
+ struct timespec t1, t2;
+ int i;
+ struct sysinfo info;
+
+ clock_gettime(CLOCK_MONOTONIC, &t1);
+ for (i = 0; i < CALIBRATION_STEP; i++)
+ sysinfo(&info);
+ clock_gettime(CLOCK_MONOTONIC, &t2);
+ return (t2.tv_sec - t1.tv_sec) + 1.0e-9 * (t2.tv_nsec - t1.tv_nsec);
+}
+
+static void calibrate_set(void)
+{
+ double elapsed = 0;
+
+ printf("Calibrating test set to last ~%d seconds...\n", CALIBRATE_TO_SECS);
+
+ while (elapsed < 1) {
+ elapsed += one_sysinfo_step();
+ factor += CALIBRATE_TO_SECS;
+ }
+
+ printf("test iterations = %d\n", CALIBRATION_STEP * factor);
+}
+
+static double perf_syscall(void)
+{
+ unsigned int i;
+ double partial = 0;
+
+ for (i = 0; i < factor; ++i)
+ partial += one_sysinfo_step()/(CALIBRATION_STEP*factor);
+ return partial;
+}
+
+static void handle_sigsys(int sig, siginfo_t *info, void *ucontext)
+{
+ char buf[1024];
+ int len;
+
+ SYSCALL_UNBLOCK;
+
+ /* printf and friends are not signal-safe. */
+ len = snprintf(buf, 1024, "Caught sys_%x\n", info->si_syscall);
+ write(1, buf, len);
+
+ if (info->si_syscall == MAGIC_SYSCALL_1)
+ trapped_call_count++;
+ else
+ native_call_count++;
+
+#ifdef TEST_BLOCKED_RETURN
+ SYSCALL_BLOCK;
+#endif
+
+#ifdef __x86_64__
+ __asm__ volatile("movq $0xf, %rax");
+ __asm__ volatile("leaveq");
+ __asm__ volatile("add $0x8, %rsp");
+ __asm__ volatile("syscall_dispatcher_start:");
+ __asm__ volatile("syscall");
+ __asm__ volatile("nop"); /* Landing pad within dispatcher area */
+ __asm__ volatile("syscall_dispatcher_end:");
+#endif
+
+}
+
+int main(void)
+{
+ struct sigaction act;
+ double time1, time2;
+ int ret;
+ sigset_t mask;
+
+ memset(&act, 0, sizeof(act));
+ sigemptyset(&mask);
+
+ act.sa_sigaction = handle_sigsys;
+ act.sa_flags = SA_SIGINFO;
+ act.sa_mask = mask;
+
+ calibrate_set();
+
+ time1 = perf_syscall();
+ printf("Avg syscall time %.0lfns.\n", time1 * 1.0e9);
+
+ ret = sigaction(SIGSYS, &act, NULL);
+ if (ret) {
+ perror("Error sigaction:");
+ exit(-1);
+ }
+
+ fprintf(stderr, "Enabling syscall trapping.\n");
+
+ if (prctl(PR_SET_SYSCALL_USER_DISPATCH, PR_SYS_DISPATCH_ON,
+ syscall_dispatcher_start,
+ (syscall_dispatcher_end - syscall_dispatcher_start + 1),
+ &selector)) {
+ perror("prctl failed\n");
+ exit(-1);
+ }
+
+ SYSCALL_BLOCK;
+ syscall(MAGIC_SYSCALL_1);
+
+#ifdef TEST_BLOCKED_RETURN
+ if (selector == PR_SYS_DISPATCH_OFF) {
+ fprintf(stderr, "Failed to return with selector blocked.\n");
+ exit(-1);
+ }
+#endif
+
+ SYSCALL_UNBLOCK;
+
+ if (!trapped_call_count) {
+ fprintf(stderr, "syscall trapping does not work.\n");
+ exit(-1);
+ }
+
+ time2 = perf_syscall();
+
+ if (native_call_count) {
+ perror("syscall trapping intercepted more syscalls than expected\n");
+ exit(-1);
+ }
+
+ printf("trapped_call_count %lu, native_call_count %lu.\n",
+ trapped_call_count, native_call_count);
+ printf("Avg syscall time %.0lfns.\n", time2 * 1.0e9);
+ printf("Interception overhead: %.1lf%% (+%.0lfns).\n",
+ 100.0 * (time2 / time1 - 1.0), 1.0e9 * (time2 - time1));
+ return 0;
+
+}
--
2.29.2
Explain the interface, provide some background and security notes.
Signed-off-by: Gabriel Krisman Bertazi <[email protected]>
Reviewed-by: Kees Cook <[email protected]>
---
.../admin-guide/syscall-user-dispatch.rst | 87 +++++++++++++++++++
1 file changed, 87 insertions(+)
create mode 100644 Documentation/admin-guide/syscall-user-dispatch.rst
diff --git a/Documentation/admin-guide/syscall-user-dispatch.rst b/Documentation/admin-guide/syscall-user-dispatch.rst
new file mode 100644
index 000000000000..e2fb36926f97
--- /dev/null
+++ b/Documentation/admin-guide/syscall-user-dispatch.rst
@@ -0,0 +1,87 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=====================
+Syscall User Dispatch
+=====================
+
+Background
+----------
+
+Compatibility layers like Wine need a way to efficiently emulate system
+calls of only a part of their process - the part that has the
+incompatible code - while being able to execute native syscalls without
+a high performance penalty on the native part of the process. Seccomp
+falls short on this task, since it has limited support to efficiently
+filter syscalls based on memory regions, and it doesn't support removing
+filters. Therefore a new mechanism is necessary.
+
+Syscall User Dispatch brings the filtering of the syscall dispatcher
+address back to userspace. The application is in control of a flip
+switch, indicating the current personality of the process. A
+multiple-personality application can then flip the switch without
+invoking the kernel, when crossing the compatibility layer API
+boundaries, to enable/disable the syscall redirection and execute
+syscalls directly (disabled) or send them to be emulated in userspace
+through a SIGSYS.
+
+The goal of this design is to provide very quick compatibility layer
+boundary crosses, which is achieved by not executing a syscall to change
+personality every time the compatibility layer executes. Instead, a
+userspace memory region exposed to the kernel indicates the current
+personality, and the application simply modifies that variable to
+configure the mechanism.
+
+There is a relatively high cost associated with handling signals on most
+architectures, like x86, but at least for Wine, syscalls issued by
+native Windows code are currently not known to be a performance problem,
+since they are quite rare, at least for modern gaming applications.
+
+Since this mechanism is designed to capture syscalls issued by
+non-native applications, it must function on syscalls whose invocation
+ABI is completely unexpected to Linux. Syscall User Dispatch, therefore
+doesn't rely on any of the syscall ABI to make the filtering. It uses
+only the syscall dispatcher address and the userspace key.
+
+Interface
+---------
+
+A process can setup this mechanism on supported kernels
+CONFIG_SYSCALL_USER_DISPATCH) by executing the following prctl:
+
+ prctl(PR_SET_SYSCALL_USER_DISPATCH, <op>, <offset>, <length>, [selector])
+
+<op> is either PR_SYS_DISPATCH_ON or PR_SYS_DISPATCH_OFF, to enable and
+disable the mechanism globally for that thread. When
+PR_SYS_DISPATCH_OFF is used, the other fields must be zero.
+
+<offset> and <offset+length> delimit a closed memory region interval
+from which syscalls are always executed directly, regardless of the
+userspace selector. This provides a fast path for the C library, which
+includes the most common syscall dispatchers in the native code
+applications, and also provides a way for the signal handler to return
+without triggering a nested SIGSYS on (rt_)sigreturn. Users of this
+interface should make sure that at least the signal trampoline code is
+included in this region. In addition, for syscalls that implement the
+trampoline code on the vDSO, that trampoline is never intercepted.
+
+[selector] is a pointer to a char-sized region in the process memory
+region, that provides a quick way to enable disable syscall redirection
+thread-wide, without the need to invoke the kernel directly. selector
+can be set to PR_SYS_DISPATCH_ON or PR_SYS_DISPATCH_OFF. Any other
+value should terminate the program with a SIGSYS.
+
+Security Notes
+--------------
+
+Syscall User Dispatch provides functionality for compatibility layers to
+quickly capture system calls issued by a non-native part of the
+application, while not impacting the Linux native regions of the
+process. It is not a mechanism for sandboxing system calls, and it
+should not be seen as a security mechanism, since it is trivial for a
+malicious application to subvert the mechanism by jumping to an allowed
+dispatcher region prior to executing the syscall, or to discover the
+address and modify the selector value. If the use case requires any
+kind of security sandboxing, Seccomp should be used instead.
+
+Any fork or exec of the existing process resets the mechanism to
+PR_SYS_DISPATCH_OFF.
--
2.29.2
* Gabriel Krisman Bertazi:
> +Interface
> +---------
> +
> +A process can setup this mechanism on supported kernels
> +CONFIG_SYSCALL_USER_DISPATCH) by executing the following prctl:
> +
> + prctl(PR_SET_SYSCALL_USER_DISPATCH, <op>, <offset>, <length>, [selector])
> +
> +<op> is either PR_SYS_DISPATCH_ON or PR_SYS_DISPATCH_OFF, to enable and
> +disable the mechanism globally for that thread. When
> +PR_SYS_DISPATCH_OFF is used, the other fields must be zero.
> +
> +<offset> and <offset+length> delimit a closed memory region interval
> +from which syscalls are always executed directly, regardless of the
> +userspace selector. This provides a fast path for the C library, which
> +includes the most common syscall dispatchers in the native code
> +applications, and also provides a way for the signal handler to return
> +without triggering a nested SIGSYS on (rt_)sigreturn. Users of this
> +interface should make sure that at least the signal trampoline code is
> +included in this region. In addition, for syscalls that implement the
> +trampoline code on the vDSO, that trampoline is never intercepted.
> +
> +[selector] is a pointer to a char-sized region in the process memory
> +region, that provides a quick way to enable disable syscall redirection
> +thread-wide, without the need to invoke the kernel directly. selector
> +can be set to PR_SYS_DISPATCH_ON or PR_SYS_DISPATCH_OFF. Any other
> +value should terminate the program with a SIGSYS.
Is this a process property or a task/thread property? The last
paragraph says “thread-wide”, but the first paragraph says “process”.
* Gabriel Krisman Bertazi:
> This is the v7 of syscall user dispatch. This version is a bit
> different from v6 on the following points, after the modifications
> requested on that submission.
Is this supposed to work with existing (Linux) libcs, or do you bring
your own low-level run-time libraries?
Florian Weimer <[email protected]> writes:
> * Gabriel Krisman Bertazi:
>
>> This is the v7 of syscall user dispatch. This version is a bit
>> different from v6 on the following points, after the modifications
>> requested on that submission.
>
> Is this supposed to work with existing (Linux) libcs, or do you bring
> your own low-level run-time libraries?
Hi Florian,
The main use case is to intercept Windows system calls of an application
running over Wine. While Wine is using an unmodified glibc to execute
its own native Linux syscalls, the Windows libraries might be directly
issuing syscalls that we need to capture. So there is a mix. While this
mechanism is compatible with existing libc, we might have other
libraries executing a syscall instruction directly.
--
Gabriel Krisman Bertazi
Florian Weimer <[email protected]> writes:
> * Gabriel Krisman Bertazi:
>
>> +Interface
>> +---------
>> +
>> +A process can setup this mechanism on supported kernels
>> +CONFIG_SYSCALL_USER_DISPATCH) by executing the following prctl:
>> +
>> + prctl(PR_SET_SYSCALL_USER_DISPATCH, <op>, <offset>, <length>, [selector])
>> +
>> +<op> is either PR_SYS_DISPATCH_ON or PR_SYS_DISPATCH_OFF, to enable and
>> +disable the mechanism globally for that thread. When
>> +PR_SYS_DISPATCH_OFF is used, the other fields must be zero.
>> +
>> +<offset> and <offset+length> delimit a closed memory region interval
>> +from which syscalls are always executed directly, regardless of the
>> +userspace selector. This provides a fast path for the C library, which
>> +includes the most common syscall dispatchers in the native code
>> +applications, and also provides a way for the signal handler to return
>> +without triggering a nested SIGSYS on (rt_)sigreturn. Users of this
>> +interface should make sure that at least the signal trampoline code is
>> +included in this region. In addition, for syscalls that implement the
>> +trampoline code on the vDSO, that trampoline is never intercepted.
>> +
>> +[selector] is a pointer to a char-sized region in the process memory
>> +region, that provides a quick way to enable disable syscall redirection
>> +thread-wide, without the need to invoke the kernel directly. selector
>> +can be set to PR_SYS_DISPATCH_ON or PR_SYS_DISPATCH_OFF. Any other
>> +value should terminate the program with a SIGSYS.
>
> Is this a process property or a task/thread property? The last
> paragraph says “thread-wide”, but the first paragraph says “process”.
It is per-thread, as it doesn't survive across clone/fork syscalls. I
will fix the first paragraph of this text.
--
Gabriel Krisman Bertazi
* Gabriel Krisman Bertazi:
> The main use case is to intercept Windows system calls of an application
> running over Wine. While Wine is using an unmodified glibc to execute
> its own native Linux syscalls, the Windows libraries might be directly
> issuing syscalls that we need to capture. So there is a mix. While this
> mechanism is compatible with existing libc, we might have other
> libraries executing a syscall instruction directly.
Please raise this on libc-alpha, it's an unexpected compatibility
constraint on glibc. Thanks.
On Tue, Nov 17, 2020 at 10:28:36PM -0500, Gabriel Krisman Bertazi wrote:
> prctl(PR_SET_SYSCALL_USER_DISPATCH, <op>, <off>, <length>, [selector])
>
> The range [<offset>,<offset>+len] is a part of the process memory map
> + if (likely(instruction_pointer(regs) - sd->offset < sd->len))
> + return false;
The actual implementation ^ is: [<offset>, <offset>+<length>).
Which seems consistent and right, so I would suggest simply changing the
Changelog, something that could be done when applying.
On Tue, Nov 17, 2020 at 10:28:33PM -0500, Gabriel Krisman Bertazi wrote:
> Gabriel Krisman Bertazi (7):
> x86: vdso: Expose sigreturn address on vdso to the kernel
> signal: Expose SYS_USER_DISPATCH si_code type
> kernel: Implement selective syscall userspace redirection
> entry: Support Syscall User Dispatch on common syscall entry
> selftests: Add kselftest for syscall user dispatch
> selftests: Add benchmark for syscall user dispatch
> docs: Document Syscall User Dispatch
Aside from the one little nit this looks good to me.
Acked-by: Peter Zijlstra (Intel) <[email protected]>
Gabriel Krisman Bertazi <[email protected]> writes:
> Introduce a mechanism to quickly disable/enable syscall handling for a
> specific process and redirect to userspace via SIGSYS. This is useful
> for processes with parts that require syscall redirection and parts that
> don't, but who need to perform this boundary crossing really fast,
> without paying the cost of a system call to reconfigure syscall handling
> on each boundary transition. This is particularly important for Windows
> games running over Wine.
I raised a discussion about this on libc-alpha, as requested by Florian.
At the moment, there was some back and forth on why the use-case is not
done by seccomp, but a more interesting point about user_notif was
raised by Rich Felker (cc'ed).
SIGSYS, as a signal handler, is limited in what can be done inside it.
Rich suggested the user_notif design is a better solution. I understand
that from a Wine perspective, SIGSYS suffices for their work, but would
it make sense to extend SUD interface to support a user_notif-like
interface? Would this be acceptable as future work to be added when/if
needed, or should we design it from the start?
The existing interface could be extended with a flags field as part of
the opcode passed in argument 2, which is currently reserved, and then
return a FD, just like seccomp(2) does. So it is not like the current
patches couldn't be extended in the future if needed, unless I'm
mistaken.
--
Gabriel Krisman Bertazi
On Thu, Nov 19, 2020 at 12:43:05PM -0500, Gabriel Krisman Bertazi wrote:
> The existing interface could be extended with a flags field as part of
> the opcode passed in argument 2, which is currently reserved, and then
> return a FD, just like seccomp(2) does. So it is not like the current
> patches couldn't be extended in the future if needed, unless I'm
> mistaken.
Yes, I'd prefer this series go in as-is, and if there is a need for
extending the API, arg2 can have more values added.
--
Kees Cook
On Thu, Nov 19, 2020 at 01:38:27PM +0100, Peter Zijlstra wrote:
> On Tue, Nov 17, 2020 at 10:28:33PM -0500, Gabriel Krisman Bertazi wrote:
> > Gabriel Krisman Bertazi (7):
> > x86: vdso: Expose sigreturn address on vdso to the kernel
> > signal: Expose SYS_USER_DISPATCH si_code type
> > kernel: Implement selective syscall userspace redirection
> > entry: Support Syscall User Dispatch on common syscall entry
> > selftests: Add kselftest for syscall user dispatch
> > selftests: Add benchmark for syscall user dispatch
> > docs: Document Syscall User Dispatch
>
> Aside from the one little nit this looks good to me.
>
> Acked-by: Peter Zijlstra (Intel) <[email protected]>
Agreed, and thank you Gabriel for the SYSCALL_WORK series too. :) That's
so nice to have!
--
Kees Cook
On Fri, Nov 20, 2020 at 4:18 PM Kees Cook <[email protected]> wrote:
>
> On Thu, Nov 19, 2020 at 12:43:05PM -0500, Gabriel Krisman Bertazi wrote:
> > The existing interface could be extended with a flags field as part of
> > the opcode passed in argument 2, which is currently reserved, and then
> > return a FD, just like seccomp(2) does. So it is not like the current
> > patches couldn't be extended in the future if needed, unless I'm
> > mistaken.
>
> Yes, I'd prefer this series go in as-is, and if there is a need for
> extending the API, arg2 can have more values added.
I agree.
>
> --
> Kees Cook