Introduce a mechanism to quickly disable/enable syscall handling for a
specific process and redirect to userspace via SIGSYS. This is useful
for processes with parts that require syscall redirection and parts that
don't, but who need to perform this boundary crossing really fast,
without paying the cost of a system call to reconfigure syscall handling
on each boundary transition. This is particularly important for Windows
games running over Wine.
The proposed interface looks like this:
prctl(PR_SET_SYSCALL_USER_DISPATCH, <op>, <dispatcher>, [selector])
Dispatcher is the address of a syscall instruction that is allowed to
by-pass the blockage, such that in fast paths you don't need to disable
the trap nor check the selector. This is essential to return from
SIGSYS to a blocked area without triggering another SIGSYS from the
rt_sigreturn.
selector is an optional pointer to an aligned int-sized userspace memory
region that has a key switch for the mechanism. This key switch can be
PR_SYSCALL_DISPATCH_ENABLE, PR_SYSCALL_DISPATCH_DISABLE.
The feature is meant to be set per-thread and it is disabled on
fork/clone/execv.
Internally, this doesn't add overhead to the syscall hot path, and it
requires very little per-architecture support. I avoided using seccomp,
even though it duplicates some functionality, due to previous feedback
that maybe it shouldn't mix with seccomp since it is not a security
mechanism. And obviously, this should never be considered a security
mechanism, since any part of the program can by-pass it by using the
syscall dispatcher.
For the sysinfo benchmark, which measures the overhead added to
executing a native syscall that doesn't require interception, the
overhead using only dispatcher to issue syscalls is pretty much
irrelevant. The overhead of selector goes around 40ns for a
native (untrapped) syscall in my system, and it is (as expected)
dominated by the supervisor-mode user-address access. In fact, with
SMAP off, the overhead is consistently less than 5ns on my test box.
Right now, it is only supported by x86_64 and x86, but it should be
easily enabled for other architectures.
An example code using this interface can be found at:
https://gitlab.collabora.com/krisman/syscall-disable-personality
Cc: Matthew Wilcox <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Paul Gofman <[email protected]>
Cc: Kees Cook <[email protected]>
Signed-off-by: Gabriel Krisman Bertazi <[email protected]>
---
Changes since RFC:
(Kees Cook suggestions)
- Don't mention personality while explaining the feature
- Use syscall_get_nr
- Remove header guard on several places
- Convert WARN_ON to WARN_ON_ONCE
- Explicit check for state values
- Rename to syscall user dispatcher
RFC/v1: https://lkml.org/lkml/2020/7/8/96
arch/Kconfig | 20 ++++++
arch/x86/Kconfig | 1 +
arch/x86/entry/common.c | 5 ++
arch/x86/include/asm/thread_info.h | 4 +-
fs/exec.c | 2 +
include/linux/sched.h | 3 +
include/linux/syscall_user_dispatch.h | 45 ++++++++++++++
include/uapi/asm-generic/siginfo.h | 1 +
include/uapi/linux/prctl.h | 4 ++
kernel/Makefile | 1 +
kernel/fork.c | 1 +
kernel/sys.c | 8 +++
kernel/syscall_user_dispatch.c | 90 +++++++++++++++++++++++++++
13 files changed, 184 insertions(+), 1 deletion(-)
create mode 100644 include/linux/syscall_user_dispatch.h
create mode 100644 kernel/syscall_user_dispatch.c
diff --git a/arch/Kconfig b/arch/Kconfig
index 8cc35dc556c7..c182344374a1 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -465,6 +465,26 @@ config SECCOMP_FILTER
See Documentation/userspace-api/seccomp_filter.rst for details.
+config HAVE_ARCH_SYSCALL_USER_DISPATCH
+ bool
+ help
+ An arch should select this symbol if it provides all of these things:
+ - TIF_SYSCALL_USER_DISPATCH
+ - syscall_get_arch
+ - syscall_rollback
+ - syscall_get_nr
+ - SIGSYS siginfo_t support
+
+config SYSCALL_USER_DISPATCH
+ bool "Support rejecting syscalls not coming from a dispatcher"
+ depends on HAVE_ARCH_SYSCALL_USER_DISPATCH
+ help
+ Enable tasks to ask the kernel to redirect syscalls not
+ issued from a predefined dispatcher back to userspace,
+ depending on a userspace selector.
+
+ This option is useful to optimize games running over Wine.
+
config HAVE_ARCH_STACKLEAK
bool
help
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 883da0abf779..466a3a9c0708 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -149,6 +149,7 @@ config X86
select HAVE_ARCH_COMPAT_MMAP_BASES if MMU && COMPAT
select HAVE_ARCH_PREL32_RELOCATIONS
select HAVE_ARCH_SECCOMP_FILTER
+ select HAVE_ARCH_SYSCALL_USER_DISPATCH
select HAVE_ARCH_THREAD_STRUCT_WHITELIST
select HAVE_ARCH_STACKLEAK
select HAVE_ARCH_TRACEHOOK
diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c
index bd3f14175193..6c1360a7f260 100644
--- a/arch/x86/entry/common.c
+++ b/arch/x86/entry/common.c
@@ -138,6 +138,11 @@ static long syscall_trace_enter(struct pt_regs *regs)
return -1L;
}
+ if (work & _TIF_SYSCALL_USER_DISPATCH) {
+ if (do_syscall_user_dispatch(regs))
+ return -1L;
+ }
+
#ifdef CONFIG_SECCOMP
/*
* Do seccomp after ptrace, to catch any tracer changes.
diff --git a/arch/x86/include/asm/thread_info.h b/arch/x86/include/asm/thread_info.h
index 8de8ceccb8bc..b26a9f2f0491 100644
--- a/arch/x86/include/asm/thread_info.h
+++ b/arch/x86/include/asm/thread_info.h
@@ -93,6 +93,7 @@ struct thread_info {
#define TIF_NOTSC 16 /* TSC is not accessible in userland */
#define TIF_IA32 17 /* IA32 compatibility process */
#define TIF_SLD 18 /* Restore split lock detection on context switch */
+#define TIF_SYSCALL_USER_DISPATCH 19 /* Redirect syscall for userspace handling */
#define TIF_MEMDIE 20 /* is terminating due to OOM killer */
#define TIF_POLLING_NRFLAG 21 /* idle is polling for TIF_NEED_RESCHED */
#define TIF_IO_BITMAP 22 /* uses I/O bitmap */
@@ -123,6 +124,7 @@ struct thread_info {
#define _TIF_NOTSC (1 << TIF_NOTSC)
#define _TIF_IA32 (1 << TIF_IA32)
#define _TIF_SLD (1 << TIF_SLD)
+#define _TIF_SYSCALL_USER_DISPATCH (1 << TIF_SYSCALL_USER_DISPATCH)
#define _TIF_POLLING_NRFLAG (1 << TIF_POLLING_NRFLAG)
#define _TIF_IO_BITMAP (1 << TIF_IO_BITMAP)
#define _TIF_FORCED_TF (1 << TIF_FORCED_TF)
@@ -136,7 +138,7 @@ struct thread_info {
/* Work to do before invoking the actual syscall. */
#define _TIF_WORK_SYSCALL_ENTRY \
(_TIF_SYSCALL_TRACE | _TIF_SYSCALL_EMU | _TIF_SYSCALL_AUDIT | \
- _TIF_SECCOMP | _TIF_SYSCALL_TRACEPOINT)
+ _TIF_SECCOMP | _TIF_SYSCALL_TRACEPOINT | _TIF_SYSCALL_USER_DISPATCH)
/* flags to check in __switch_to() */
#define _TIF_WORK_CTXSW_BASE \
diff --git a/fs/exec.c b/fs/exec.c
index e6e8a9a70327..44f0ce352a0d 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1502,6 +1502,8 @@ void setup_new_exec(struct linux_binprm * bprm)
arch_setup_new_exec();
+ clear_tsk_syscall_user_dispatch(me);
+
/* Set the new mm task size. We have to do that late because it may
* depend on TIF_32BIT which is only updated in flush_thread() on
* some architectures like powerpc
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 692e327d7455..407b868146e1 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -32,6 +32,7 @@
#include <linux/posix-timers.h>
#include <linux/rseq.h>
#include <linux/kcsan.h>
+#include <linux/syscall_user_dispatch.h>
/* task_struct member predeclarations (sorted alphabetically): */
struct audit_context;
@@ -953,6 +954,8 @@ struct task_struct {
#endif
struct seccomp seccomp;
+ struct syscall_user_dispatch syscall_dispatch;
+
/* Thread group tracking: */
u64 parent_exec_id;
u64 self_exec_id;
diff --git a/include/linux/syscall_user_dispatch.h b/include/linux/syscall_user_dispatch.h
new file mode 100644
index 000000000000..dc7d8bb3ae13
--- /dev/null
+++ b/include/linux/syscall_user_dispatch.h
@@ -0,0 +1,45 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _SYSCALL_USER_DISPATCH_H
+#define _SYSCALL_USER_DISPATCH_H
+
+struct task_struct;
+static void clear_tsk_thread_flag(struct task_struct *tsk, int flag);
+
+#ifdef CONFIG_SYSCALL_USER_DISPATCH
+struct syscall_user_dispatch {
+ int __user *selector;
+ unsigned long __user dispatcher;
+};
+
+int do_syscall_user_dispatch(struct pt_regs *regs);
+int set_syscall_user_dispatch(int mode, unsigned long __user dispatcher,
+ int __user *selector);
+
+static inline void clear_tsk_syscall_user_dispatch(struct task_struct *tsk)
+{
+ clear_tsk_thread_flag(tsk, TIF_SYSCALL_USER_DISPATCH);
+}
+
+#else
+struct syscall_user_dispatch {};
+
+static inline int set_syscall_user_dispatch(int mode,
+ unsigned long __user dispatcher,
+ int __user *selector)
+{
+ return -EINVAL;
+}
+
+static inline int do_syscall_user_dispatch(struct pt_regs *regs)
+{
+ WARN_ON_ONCE(1);
+ return 0;
+}
+
+static inline void clear_tsk_syscall_user_dispatch(struct task_struct *tsk)
+{
+}
+
+#endif /* CONFIG_SYSCALL_USER_DISPATCH */
+
+#endif /* _SYSCALL_USER_DISPATCH_H */
diff --git a/include/uapi/asm-generic/siginfo.h b/include/uapi/asm-generic/siginfo.h
index cb3d6c267181..16337270467d 100644
--- a/include/uapi/asm-generic/siginfo.h
+++ b/include/uapi/asm-generic/siginfo.h
@@ -285,6 +285,7 @@ typedef struct siginfo {
*/
#define SYS_SECCOMP 1 /* seccomp triggered */
#define NSIGSYS 1
+#define SYS_USER_REDIRECT 2
/*
* SIGEMT si_codes
diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
index 07b4f8131e36..335ea1b5785e 100644
--- a/include/uapi/linux/prctl.h
+++ b/include/uapi/linux/prctl.h
@@ -238,4 +238,8 @@ struct prctl_mm_map {
#define PR_SET_IO_FLUSHER 57
#define PR_GET_IO_FLUSHER 58
+#define PR_SET_SYSCALL_USER_DISPATCH 59
+# define PR_SYSCALL_DISPATCH_DISABLE 0
+# define PR_SYSCALL_DISPATCH_ENABLE 1
+
#endif /* _LINUX_PRCTL_H */
diff --git a/kernel/Makefile b/kernel/Makefile
index f3218bc5ec69..158b8c61592f 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -92,6 +92,7 @@ obj-$(CONFIG_DETECT_HUNG_TASK) += hung_task.o
obj-$(CONFIG_LOCKUP_DETECTOR) += watchdog.o
obj-$(CONFIG_HARDLOCKUP_DETECTOR_PERF) += watchdog_hld.o
obj-$(CONFIG_SECCOMP) += seccomp.o
+obj-$(CONFIG_SYSCALL_USER_DISPATCH) += syscall_user_dispatch.o
obj-$(CONFIG_RELAY) += relay.o
obj-$(CONFIG_SYSCTL) += utsname_sysctl.o
obj-$(CONFIG_TASK_DELAY_ACCT) += delayacct.o
diff --git a/kernel/fork.c b/kernel/fork.c
index 142b23645d82..c6b64a849fec 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -921,6 +921,7 @@ static struct task_struct *dup_task_struct(struct task_struct *orig, int node)
clear_user_return_notifier(tsk);
clear_tsk_need_resched(tsk);
set_task_stack_end_magic(tsk);
+ clear_tsk_syscall_user_dispatch(tsk);
#ifdef CONFIG_STACKPROTECTOR
tsk->stack_canary = get_random_canary();
diff --git a/kernel/sys.c b/kernel/sys.c
index 00a96746e28a..549f4dc9340f 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -42,6 +42,7 @@
#include <linux/syscore_ops.h>
#include <linux/version.h>
#include <linux/ctype.h>
+#include <linux/syscall_user_dispatch.h>
#include <linux/compat.h>
#include <linux/syscalls.h>
@@ -2527,6 +2528,13 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
error = (current->flags & PR_IO_FLUSHER) == PR_IO_FLUSHER;
break;
+ case PR_SET_SYSCALL_USER_DISPATCH:
+ if (arg5 != 0)
+ return -EINVAL;
+ error = set_syscall_user_dispatch((int) arg2,
+ (unsigned long __user) arg3,
+ (int __user *) arg4);
+ break;
default:
error = -EINVAL;
break;
diff --git a/kernel/syscall_user_dispatch.c b/kernel/syscall_user_dispatch.c
new file mode 100644
index 000000000000..dd0867964a82
--- /dev/null
+++ b/kernel/syscall_user_dispatch.c
@@ -0,0 +1,90 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <linux/sched.h>
+#include <linux/prctl.h>
+#include <linux/syscall_user_dispatch.h>
+#include <linux/uaccess.h>
+#include <linux/signal.h>
+
+#include <asm/syscall.h>
+
+#include <linux/sched/signal.h>
+#include <linux/sched/task_stack.h>
+
+static void trigger_sigsys(struct pt_regs *regs)
+{
+ struct kernel_siginfo info;
+
+ clear_siginfo(&info);
+ info.si_signo = SIGSYS;
+ info.si_code = SYS_USER_REDIRECT;
+ info.si_call_addr = (void __user *)KSTK_EIP(current);
+ info.si_errno = 0;
+ info.si_arch = syscall_get_arch(current);
+ info.si_syscall = syscall_get_nr(current, regs);
+
+ force_sig_info(&info);
+}
+
+int do_syscall_user_dispatch(struct pt_regs *regs)
+{
+ int state;
+
+ if (current->syscall_dispatch.dispatcher == instruction_pointer(regs))
+ return 0;
+
+ if (current->syscall_dispatch.selector) {
+ if (__get_user(state, current->syscall_dispatch.selector))
+ do_exit(SIGSEGV);
+
+ switch (state) {
+ case PR_SYSCALL_DISPATCH_DISABLE:
+ return 0;
+ case PR_SYSCALL_DISPATCH_ENABLE:
+ break;
+ default:
+ do_exit(SIGSEGV);
+ }
+ }
+
+ syscall_rollback(current, regs);
+ trigger_sigsys(regs);
+
+ return 1;
+}
+
+int set_syscall_user_dispatch(int mode, unsigned long __user dispatcher,
+ int __user *selector)
+{
+ switch (mode) {
+ case PR_SYSCALL_DISPATCH_DISABLE:
+ if (dispatcher || selector)
+ return -EINVAL;
+ break;
+ case PR_SYSCALL_DISPATCH_ENABLE:
+ break;
+ default:
+ return -EINVAL;
+ }
+
+ if (selector) {
+ if (!access_ok(selector, sizeof(int)))
+ return -EFAULT;
+ }
+
+ spin_lock_irq(¤t->sighand->siglock);
+
+ current->syscall_dispatch.selector = selector;
+ current->syscall_dispatch.dispatcher = dispatcher;
+
+ /* make sure fastlock is committed before setting the flag. */
+ smp_mb__before_atomic();
+
+ if (mode == PR_SYSCALL_DISPATCH_ENABLE)
+ set_tsk_thread_flag(current, TIF_SYSCALL_USER_DISPATCH);
+ else
+ clear_tsk_thread_flag(current, TIF_SYSCALL_USER_DISPATCH);
+
+ spin_unlock_irq(¤t->sighand->siglock);
+
+ return 0;
+}
--
2.27.0
On Thu, Jul 09, 2020 at 12:38:40AM -0400, Gabriel Krisman Bertazi wrote:
> The proposed interface looks like this:
>
> prctl(PR_SET_SYSCALL_USER_DISPATCH, <op>, <dispatcher>, [selector])
>
> Dispatcher is the address of a syscall instruction that is allowed to
> by-pass the blockage, such that in fast paths you don't need to disable
> the trap nor check the selector. This is essential to return from
> SIGSYS to a blocked area without triggering another SIGSYS from the
> rt_sigreturn.
Should <dispatcher> be a single pointer or should the interface specify
a range from which syscalls may be made without being redirected? eg,
one could specify the whole of libc.
prctl(PR_SET_SYSCALL_USER_DISPATCH, <op>, <start>, <inclusive-end>, [selector])
> +++ b/include/linux/syscall_user_dispatch.h
> @@ -0,0 +1,45 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#ifndef _SYSCALL_USER_DISPATCH_H
> +#define _SYSCALL_USER_DISPATCH_H
> +
> +struct task_struct;
> +static void clear_tsk_thread_flag(struct task_struct *tsk, int flag);
> +
> +#ifdef CONFIG_SYSCALL_USER_DISPATCH
> +struct syscall_user_dispatch {
> + int __user *selector;
> + unsigned long __user dispatcher;
The __user annotation is on the pointer, not the value. ie, it's
unsigned long foo;
unsigned long __user *p;
get_user(foo, p)
> +++ b/include/uapi/asm-generic/siginfo.h
> @@ -285,6 +285,7 @@ typedef struct siginfo {
> */
> #define SYS_SECCOMP 1 /* seccomp triggered */
> #define NSIGSYS 1
> +#define SYS_USER_REDIRECT 2
I'd suggest that SYS_USER_REDIRECT should be moved up by one line.
> +int set_syscall_user_dispatch(int mode, unsigned long __user dispatcher,
> + int __user *selector)
> +{
> + switch (mode) {
> + case PR_SYSCALL_DISPATCH_DISABLE:
> + if (dispatcher || selector)
> + return -EINVAL;
> + break;
> + case PR_SYSCALL_DISPATCH_ENABLE:
> + break;
> + default:
> + return -EINVAL;
> + }
> +
> + if (selector) {
> + if (!access_ok(selector, sizeof(int)))
> + return -EFAULT;
> + }
You're not enforcing the alignment requirement here.
> + spin_lock_irq(¤t->sighand->siglock);
> +
> + current->syscall_dispatch.selector = selector;
> + current->syscall_dispatch.dispatcher = dispatcher;
> +
> + /* make sure fastlock is committed before setting the flag. */
fastlock? ;-)
I don't think you actually need this. You're setting per-thread state on
yourself, so what's the race that you're concerned about?
> + smp_mb__before_atomic();
> +
> + if (mode == PR_SYSCALL_DISPATCH_ENABLE)
> + set_tsk_thread_flag(current, TIF_SYSCALL_USER_DISPATCH);
> + else
> + clear_tsk_thread_flag(current, TIF_SYSCALL_USER_DISPATCH);
> +
> + spin_unlock_irq(¤t->sighand->siglock);
> +
> + return 0;
> +}
> --
> 2.27.0
>
Matthew Wilcox <[email protected]> writes:
> On Thu, Jul 09, 2020 at 12:38:40AM -0400, Gabriel Krisman Bertazi wrote:
>> The proposed interface looks like this:
>>
>> prctl(PR_SET_SYSCALL_USER_DISPATCH, <op>, <dispatcher>, [selector])
>>
>> Dispatcher is the address of a syscall instruction that is allowed to
>> by-pass the blockage, such that in fast paths you don't need to disable
>> the trap nor check the selector. This is essential to return from
>> SIGSYS to a blocked area without triggering another SIGSYS from the
>> rt_sigreturn.
>
> Should <dispatcher> be a single pointer or should the interface specify
> a range from which syscalls may be made without being redirected? eg,
> one could specify the whole of libc.
>
> prctl(PR_SET_SYSCALL_USER_DISPATCH, <op>, <start>, <inclusive-end>,
> [selector])
I liked this suggestion a lot, since user can just pass a single address
to get the original interface, but it still let us not pay the cost of
__get_user on more paths. I will add it to v3.
>
>> +++ b/include/linux/syscall_user_dispatch.h
>> @@ -0,0 +1,45 @@
>> +/* SPDX-License-Identifier: GPL-2.0 */
>> +#ifndef _SYSCALL_USER_DISPATCH_H
>> +#define _SYSCALL_USER_DISPATCH_H
>> +
>> +struct task_struct;
>> +static void clear_tsk_thread_flag(struct task_struct *tsk, int flag);
>> +
>> +#ifdef CONFIG_SYSCALL_USER_DISPATCH
>> +struct syscall_user_dispatch {
>> + int __user *selector;
>> + unsigned long __user dispatcher;
>
> The __user annotation is on the pointer, not the value. ie, it's
>
> unsigned long foo;
> unsigned long __user *p;
>
> get_user(foo, p)
>
>> +++ b/include/uapi/asm-generic/siginfo.h
>> @@ -285,6 +285,7 @@ typedef struct siginfo {
>> */
>> #define SYS_SECCOMP 1 /* seccomp triggered */
>> #define NSIGSYS 1
>> +#define SYS_USER_REDIRECT 2
>
> I'd suggest that SYS_USER_REDIRECT should be moved up by one line.
>
>> +int set_syscall_user_dispatch(int mode, unsigned long __user dispatcher,
>> + int __user *selector)
>> +{
>> + switch (mode) {
>> + case PR_SYSCALL_DISPATCH_DISABLE:
>> + if (dispatcher || selector)
>> + return -EINVAL;
>> + break;
>> + case PR_SYSCALL_DISPATCH_ENABLE:
>> + break;
>> + default:
>> + return -EINVAL;
>> + }
>> +
>> + if (selector) {
>> + if (!access_ok(selector, sizeof(int)))
>> + return -EFAULT;
>> + }
>
> You're not enforcing the alignment requirement here.
>
>> + spin_lock_irq(¤t->sighand->siglock);
>> +
>> + current->syscall_dispatch.selector = selector;
>> + current->syscall_dispatch.dispatcher = dispatcher;
>> +
>> + /* make sure fastlock is committed before setting the flag. */
>
> fastlock? ;-)
Gee, keeping variable renames updated on comments is hard, compiler
won't catch those. :)
> I don't think you actually need this. You're setting per-thread state on
> yourself, so what's the race that you're concerned about?
Good point. I was assuming this would be modifiable from under it, but
it is not the case.
>
>> + smp_mb__before_atomic();
>> +
>> + if (mode == PR_SYSCALL_DISPATCH_ENABLE)
>> + set_tsk_thread_flag(current, TIF_SYSCALL_USER_DISPATCH);
>> + else
>> + clear_tsk_thread_flag(current, TIF_SYSCALL_USER_DISPATCH);
>> +
>> + spin_unlock_irq(¤t->sighand->siglock);
>> +
>> + return 0;
>> +}
>> --
>> 2.27.0
>>
--
Gabriel Krisman Bertazi
On Thu, Jul 09, 2020 at 12:38:40AM -0400, Gabriel Krisman Bertazi wrote:
> [...]
> +config SYSCALL_USER_DISPATCH
> + bool "Support rejecting syscalls not coming from a dispatcher"
bike shed: this doesn't really describe why it's useful. Maybe:
bool "Support syscall redirection to userspace dispatcher"
> + depends on HAVE_ARCH_SYSCALL_USER_DISPATCH
> + help
> + Enable tasks to ask the kernel to redirect syscalls not
> + issued from a predefined dispatcher back to userspace,
> + depending on a userspace selector.
depending on a userspace memory selector.
?
> [...]
> --- a/arch/x86/entry/common.c
> +++ b/arch/x86/entry/common.c
> @@ -138,6 +138,11 @@ static long syscall_trace_enter(struct pt_regs *regs)
> return -1L;
> }
>
> + if (work & _TIF_SYSCALL_USER_DISPATCH) {
> + if (do_syscall_user_dispatch(regs))
> + return -1L;
> + }
> +
Nice; I like this! It's very small, and now I want to refactor seccomp
to be so pretty. :)
> [...]
> diff --git a/fs/exec.c b/fs/exec.c
> index e6e8a9a70327..44f0ce352a0d 100644
> --- a/fs/exec.c
> +++ b/fs/exec.c
> @@ -1502,6 +1502,8 @@ void setup_new_exec(struct linux_binprm * bprm)
>
> arch_setup_new_exec();
>
> + clear_tsk_syscall_user_dispatch(me);
> +
To keep this as arch-agnostic as possible, I actually think this belongs
in begin_new_exec() instead, after the personality clearing. If it were
to be less arch-agnostic, I'd recommend it live in arch_setup_new_exec()
(which most other TIF flags get cleared), but I'd like to have this
feature as a good example of an "easy" refactor into arch-agnostic in
the future. :P
> [...]
> @@ -285,6 +285,7 @@ typedef struct siginfo {
> */
> #define SYS_SECCOMP 1 /* seccomp triggered */
> #define NSIGSYS 1
> +#define SYS_USER_REDIRECT 2
More than just what willy suggested, but you also need to bump NSIGSYS
(I missed that in the RFC). This should read as:
#define SYS_SECCOMP 1 /* seccomp triggered */
#define SYS_USER_REDIRECT 2
#define NSIGSYS 2
(i.e. NSIGSYS is "how many si codes are there for sigsys?")
> [...]
> +int do_syscall_user_dispatch(struct pt_regs *regs)
> +{
> + int state;
> +
> + if (current->syscall_dispatch.dispatcher == instruction_pointer(regs))
> + return 0;
Just to clarify what willy was talking about -- since you're using
"dispatcher" as a scalar (and not dereferencing it, etc), it can just
stay "unsigned long" without __user.
In the documentation for the future "range inclusive" check, maybe also
note that it's the inclusive ranger covering the _starting_ address of
the syscall entry, in case anyone thinks they need to decode instruction
lengths to get the right range, which they don't and I imagine you don't
care about.
> +
> + if (current->syscall_dispatch.selector) {
> + if (__get_user(state, current->syscall_dispatch.selector))
> + do_exit(SIGSEGV);
> +
> + switch (state) {
> + case PR_SYSCALL_DISPATCH_DISABLE:
> + return 0;
> + case PR_SYSCALL_DISPATCH_ENABLE:
> + break;
> + default:
> + do_exit(SIGSEGV);
> + }
> + }
> +
> + syscall_rollback(current, regs);
> + trigger_sigsys(regs);
> +
> + return 1;
How should do_syscall_user_dispatch() approach branch hinting? For example,
is dispatcher == instruction_pointer going to be the common case for
users of this? I would expect not. So maybe, at the top:
if (unlikely(current->syscall_dispatch.dispatcher == instruction_pointer(regs)))
return 0;
and what about the selector? I assume it would be common for the selector
to be set, and enabled:
if (likely(current->syscall_dispatch.selector)) {
if (__get_user(state, current->syscall_dispatch.selector))
do_exit(SIGSEGV);
if (unlikely(state != PR_SYSCALL_DISPATCH_ENABLE)) {
if (likely(state == PR_SYSCALL_DISPATCH_DISABLE))
return 0;
do_exit(SIGSEGV);
}
}
syscall_rollback...
Or, perhaps micro-optimization doesn't matter at all here give the
overhead of signal delivery, and you can just ignore me and leave this
as-is, which is fine too. :)
This continues to look really nice. Very simple, very powerful. I think
I'd like to see one more thing: a selftest! I think it should be very
easy to add; model it after the seccomp selftests:
tools/testing/selftests/seccomp/seccomp_bpf.c
The testing harness there should make it very easy to produce a test,
and it's easy to wire up to the Makefiles. I'm happy to help point you
in the right directions. If you want, you could even share seccomp's
directory, but maybe you want your own. :)
--
Kees Cook
Kees Cook <[email protected]> writes:
> On Thu, Jul 09, 2020 at 12:38:40AM -0400, Gabriel Krisman Bertazi wrote:
>> [...]
>> +config SYSCALL_USER_DISPATCH
>> + bool "Support rejecting syscalls not coming from a dispatcher"
>
> bike shed: this doesn't really describe why it's useful. Maybe:
>
> bool "Support syscall redirection to userspace dispatcher"
>
>> + depends on HAVE_ARCH_SYSCALL_USER_DISPATCH
>> + help
>> + Enable tasks to ask the kernel to redirect syscalls not
>> + issued from a predefined dispatcher back to userspace,
>> + depending on a userspace selector.
>
> depending on a userspace memory selector.
>
> ?
>
>> [...]
>> --- a/arch/x86/entry/common.c
>> +++ b/arch/x86/entry/common.c
>> @@ -138,6 +138,11 @@ static long syscall_trace_enter(struct pt_regs *regs)
>> return -1L;
>> }
>>
>> + if (work & _TIF_SYSCALL_USER_DISPATCH) {
>> + if (do_syscall_user_dispatch(regs))
>> + return -1L;
>> + }
>> +
>
> Nice; I like this! It's very small, and now I want to refactor seccomp
> to be so pretty. :)
>
>> [...]
>> diff --git a/fs/exec.c b/fs/exec.c
>> index e6e8a9a70327..44f0ce352a0d 100644
>> --- a/fs/exec.c
>> +++ b/fs/exec.c
>> @@ -1502,6 +1502,8 @@ void setup_new_exec(struct linux_binprm * bprm)
>>
>> arch_setup_new_exec();
>>
>> + clear_tsk_syscall_user_dispatch(me);
>> +
>
> To keep this as arch-agnostic as possible, I actually think this belongs
> in begin_new_exec() instead, after the personality clearing. If it were
> to be less arch-agnostic, I'd recommend it live in arch_setup_new_exec()
> (which most other TIF flags get cleared), but I'd like to have this
> feature as a good example of an "easy" refactor into arch-agnostic in
> the future. :P
>
>> [...]
>> @@ -285,6 +285,7 @@ typedef struct siginfo {
>> */
>> #define SYS_SECCOMP 1 /* seccomp triggered */
>> #define NSIGSYS 1
>> +#define SYS_USER_REDIRECT 2
>
> More than just what willy suggested, but you also need to bump NSIGSYS
> (I missed that in the RFC). This should read as:
>
> #define SYS_SECCOMP 1 /* seccomp triggered */
> #define SYS_USER_REDIRECT 2
> #define NSIGSYS 2
>
> (i.e. NSIGSYS is "how many si codes are there for sigsys?")
Hey Kees,
Thanks for the very detailed feedback.
I was actually asking myself what NSIGSYS was and, if it meant something
else, whether it was a bug to have the same value as SYS_SECCOMP. The
name isn't very meaningful to me :). I'll slip in a comment here in my
next version.
>> [...]
>> +int do_syscall_user_dispatch(struct pt_regs *regs)
>> +{
>> + int state;
>> +
>> + if (current->syscall_dispatch.dispatcher == instruction_pointer(regs))
>> + return 0;
>
> Just to clarify what willy was talking about -- since you're using
> "dispatcher" as a scalar (and not dereferencing it, etc), it can just
> stay "unsigned long" without __user.
Thanks, i think I got the __user flag correctly now.
I was actually wondering more if this should be (void*), (uintptr) or
some other type, and if ulong was the right call, but looking at other
code in the kernel ulong seems fine.
> In the documentation for the future "range inclusive" check, maybe also
> note that it's the inclusive ranger covering the _starting_ address of
> the syscall entry, in case anyone thinks they need to decode instruction
> lengths to get the right range, which they don't and I imagine you don't
> care about.
This is something I was wondering. For the end of the range, shouldn't
we document it must include exactly the __end__ of the syscall
instruction, i.e., the start of the next instruction, instead of the
start? If syscall is in the last address of the range, we get IP for
whatever is next and in that case, a syscall issued from a "valid"
address is rejected. This is a corner case, for sure, but for
simplicity I think we could just let the kernel reject the syscall,
otherwise we need to find the actual syscall address when checking the
range. Is that fine?
>
>> +
>> + if (current->syscall_dispatch.selector) {
>> + if (__get_user(state, current->syscall_dispatch.selector))
>> + do_exit(SIGSEGV);
>> +
>> + switch (state) {
>> + case PR_SYSCALL_DISPATCH_DISABLE:
>> + return 0;
>> + case PR_SYSCALL_DISPATCH_ENABLE:
>> + break;
>> + default:
>> + do_exit(SIGSEGV);
>> + }
>> + }
>> +
>> + syscall_rollback(current, regs);
>> + trigger_sigsys(regs);
>> +
>> + return 1;
>
> How should do_syscall_user_dispatch() approach branch hinting? For example,
> is dispatcher == instruction_pointer going to be the common case for
> users of this? I would expect not. So maybe, at the top:
>
> if (unlikely(current->syscall_dispatch.dispatcher == instruction_pointer(regs)))
> return 0;
>
> and what about the selector? I assume it would be common for the selector
> to be set, and enabled:
>
> if (likely(current->syscall_dispatch.selector)) {
> if (__get_user(state, current->syscall_dispatch.selector))
> do_exit(SIGSEGV);
>
> if (unlikely(state != PR_SYSCALL_DISPATCH_ENABLE)) {
> if (likely(state == PR_SYSCALL_DISPATCH_DISABLE))
> return 0;
> do_exit(SIGSEGV);
> }
> }
>
> syscall_rollback...
My first version had all the unlikely/likely added, but I didn't observe
high branch misses when profiling so I dropped before sending the
RFC. The fast path, considering the new range scheme (specially when you
allow the glibc range), is entering the first 'if' condition, so that
should be made likely. Once we are reading selector then __get_user has
such a high cost with SMAP that I didn't notice any difference in the
wall-clock time.
Also, I think placing the DISABLE check before ENABLE might be
better, since we won't pay the high cost of SIGSYS in that case. Once
we know that we will either SIGSEGV or SIGSYS, a branch miss won't do
much difference.. Does the following sound good?
if (likely(current->syscall_dispatch.dispatcher == instruction_pointer(regs)))
return 0;
if (likely(current->syscall_dispatch.selector)) {
if (__get_user(state, current->syscall_dispatch.selector))
do_exit(SIGSEGV);
if (likely(state == PR_SYSCALL_DISPATCH_DISABLE)
return 0;
if (state != PR_SYSCALL_DISPATCH_ENABLE)
do_exit(SIGSEGV);
}
// syscall_rollback
>
> Or, perhaps micro-optimization doesn't matter at all here give the
> overhead of signal delivery, and you can just ignore me and leave this
> as-is, which is fine too. :)
>
> This continues to look really nice. Very simple, very powerful. I think
> I'd like to see one more thing: a selftest! I think it should be very
> easy to add; model it after the seccomp selftests:
>
> tools/testing/selftests/seccomp/seccomp_bpf.c
>
> The testing harness there should make it very easy to produce a test,
> and it's easy to wire up to the Makefiles. I'm happy to help point you
> in the right directions. If you want, you could even share seccomp's
> directory, but maybe you want your own. :)
Definitely. Added to my todo list for the next iteration.
Thanks!
--
Gabriel Krisman Bertazi