2020-11-28 22:23:49

by Gabriel Krisman Bertazi

[permalink] [raw]
Subject: [PATCH v8 0/7] Syscall User Dispatch

Hi,

This is v8 of syscall user dispatch. Last version got some acks but
there was one small documentation fix I wanted to do, as requested by
Florian. This also addresses the commit message fixup Peter requested.

The only actual code change from v7 is solving a trivial merge conflict
I myself created with the entry code fixup I made week and with
something else in the TIP tree.

I also shared this with glibc and there wasn't any complaints other than
the matter about user-notif vs. siginfo, which was discussed in v7 and
the understanding is that it is not necessary now and can be added
later, if needed, on the same infrastructure without a new api.

I'm not sure about TIP the rules, but is it too late to be queued for
the next merge window? I'd love to have this in 5.11 if possible, since
it has been flying for quite a while.

This is based on tip/master.

As usual, a working tree with this patchset is available at:

https://gitlab.collabora.com/krisman/linux -b syscall-user-dispatch-v8

Previous submissions are archived at:

RFC/v1: https://lkml.org/lkml/2020/7/8/96
v2: https://lkml.org/lkml/2020/7/9/17
v3: https://lkml.org/lkml/2020/7/12/4
v4: https://www.spinics.net/lists/linux-kselftest/msg16377.html
v5: https://lkml.org/lkml/2020/8/10/1320
v6: https://lkml.org/lkml/2020/9/4/1122
v7: https://lwn.net/Articles/837598/

Gabriel Krisman Bertazi (7):
x86: vdso: Expose sigreturn address on vdso to the kernel
signal: Expose SYS_USER_DISPATCH si_code type
kernel: Implement selective syscall userspace redirection
entry: Support Syscall User Dispatch on common syscall entry
selftests: Add kselftest for syscall user dispatch
selftests: Add benchmark for syscall user dispatch
docs: Document Syscall User Dispatch

.../admin-guide/syscall-user-dispatch.rst | 87 +++++
arch/x86/entry/vdso/vdso2c.c | 2 +
arch/x86/entry/vdso/vdso32/sigreturn.S | 2 +
arch/x86/entry/vdso/vma.c | 15 +
arch/x86/include/asm/elf.h | 2 +
arch/x86/include/asm/vdso.h | 2 +
arch/x86/kernel/signal_compat.c | 2 +-
fs/exec.c | 3 +
include/linux/entry-common.h | 2 +
include/linux/sched.h | 2 +
include/linux/syscall_user_dispatch.h | 40 +++
include/linux/thread_info.h | 2 +
include/uapi/asm-generic/siginfo.h | 3 +-
include/uapi/linux/prctl.h | 5 +
kernel/entry/Makefile | 2 +-
kernel/entry/common.c | 17 +
kernel/entry/common.h | 16 +
kernel/entry/syscall_user_dispatch.c | 102 ++++++
kernel/fork.c | 1 +
kernel/sys.c | 5 +
tools/testing/selftests/Makefile | 1 +
.../syscall_user_dispatch/.gitignore | 3 +
.../selftests/syscall_user_dispatch/Makefile | 9 +
.../selftests/syscall_user_dispatch/config | 1 +
.../syscall_user_dispatch/sud_benchmark.c | 200 +++++++++++
.../syscall_user_dispatch/sud_test.c | 310 ++++++++++++++++++
26 files changed, 833 insertions(+), 3 deletions(-)
create mode 100644 Documentation/admin-guide/syscall-user-dispatch.rst
create mode 100644 include/linux/syscall_user_dispatch.h
create mode 100644 kernel/entry/common.h
create mode 100644 kernel/entry/syscall_user_dispatch.c
create mode 100644 tools/testing/selftests/syscall_user_dispatch/.gitignore
create mode 100644 tools/testing/selftests/syscall_user_dispatch/Makefile
create mode 100644 tools/testing/selftests/syscall_user_dispatch/config
create mode 100644 tools/testing/selftests/syscall_user_dispatch/sud_benchmark.c
create mode 100644 tools/testing/selftests/syscall_user_dispatch/sud_test.c

--
2.29.2


2020-11-28 22:25:49

by Gabriel Krisman Bertazi

[permalink] [raw]
Subject: [PATCH v8 4/7] entry: Support Syscall User Dispatch on common syscall entry

Syscall User Dispatch (SUD) must take precedence over seccomp and
ptrace, since the use case is emulation (it can be invoked with a
different ABI) such that seccomp filtering by syscall number doesn't
make sense in the first place. In addition, either the syscall is
dispatched back to userspace, in which case there is no resource for to
trace, or the syscall will be executed, and seccomp/ptrace will execute
next.

Since SUD runs before tracepoints, it needs to be a SYSCALL_WORK_EXIT as
well, just to prevent a trace exit event when dispatch was triggered.
For that, the on_syscall_dispatch() examines context to skip the
tracepoint, audit and other work.

Signed-off-by: Gabriel Krisman Bertazi <[email protected]>
Acked-by: Peter Zijlstra (Intel) <[email protected]>
---
Changes since v6:
- Update do_syscall_intercept signature (Christian Brauner)
- Move it to before tracepoints
- Use SYSCALL_WORK flags
---
include/linux/entry-common.h | 2 ++
kernel/entry/common.c | 17 +++++++++++++++++
2 files changed, 19 insertions(+)

diff --git a/include/linux/entry-common.h b/include/linux/entry-common.h
index 49b26b216e4e..a6e98b4ba8e9 100644
--- a/include/linux/entry-common.h
+++ b/include/linux/entry-common.h
@@ -44,10 +44,12 @@
SYSCALL_WORK_SYSCALL_TRACE | \
SYSCALL_WORK_SYSCALL_EMU | \
SYSCALL_WORK_SYSCALL_AUDIT | \
+ SYSCALL_WORK_SYSCALL_USER_DISPATCH | \
ARCH_SYSCALL_WORK_ENTER)
#define SYSCALL_WORK_EXIT (SYSCALL_WORK_SYSCALL_TRACEPOINT | \
SYSCALL_WORK_SYSCALL_TRACE | \
SYSCALL_WORK_SYSCALL_AUDIT | \
+ SYSCALL_WORK_SYSCALL_USER_DISPATCH | \
ARCH_SYSCALL_WORK_EXIT)

/*
diff --git a/kernel/entry/common.c b/kernel/entry/common.c
index f1b12dc32ff4..ec20aba3b890 100644
--- a/kernel/entry/common.c
+++ b/kernel/entry/common.c
@@ -6,6 +6,8 @@
#include <linux/livepatch.h>
#include <linux/audit.h>

+#include "common.h"
+
#define CREATE_TRACE_POINTS
#include <trace/events/syscalls.h>

@@ -47,6 +49,16 @@ static long syscall_trace_enter(struct pt_regs *regs, long syscall,
{
long ret = 0;

+ /*
+ * Handle Syscall User Dispatch. This must comes first, since
+ * the ABI here can be something that doesn't make sense for
+ * other syscall_work features.
+ */
+ if (work & SYSCALL_WORK_SYSCALL_USER_DISPATCH) {
+ if (do_syscall_user_dispatch(regs))
+ return -1L;
+ }
+
/* Handle ptrace */
if (work & (SYSCALL_WORK_SYSCALL_TRACE | SYSCALL_WORK_SYSCALL_EMU)) {
ret = arch_syscall_enter_tracehook(regs);
@@ -232,6 +244,11 @@ static void syscall_exit_work(struct pt_regs *regs, unsigned long work)
{
bool step;

+ if (work & SYSCALL_WORK_SYSCALL_USER_DISPATCH) {
+ if (on_syscall_dispatch())
+ return;
+ }
+
audit_syscall_exit(regs);

if (work & SYSCALL_WORK_SYSCALL_TRACEPOINT)
--
2.29.2

2020-12-01 23:02:41

by Kees Cook

[permalink] [raw]
Subject: Re: [PATCH v8 4/7] entry: Support Syscall User Dispatch on common syscall entry

On Fri, Nov 27, 2020 at 02:32:35PM -0500, Gabriel Krisman Bertazi wrote:
> Syscall User Dispatch (SUD) must take precedence over seccomp and
> ptrace, since the use case is emulation (it can be invoked with a
> different ABI) such that seccomp filtering by syscall number doesn't
> make sense in the first place. In addition, either the syscall is
> dispatched back to userspace, in which case there is no resource for to
> trace, or the syscall will be executed, and seccomp/ptrace will execute
> next.
>
> Since SUD runs before tracepoints, it needs to be a SYSCALL_WORK_EXIT as
> well, just to prevent a trace exit event when dispatch was triggered.
> For that, the on_syscall_dispatch() examines context to skip the
> tracepoint, audit and other work.
>
> Signed-off-by: Gabriel Krisman Bertazi <[email protected]>

Acked-by: Kees Cook <[email protected]>


--
Kees Cook

2020-12-02 00:08:53

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH v8 0/7] Syscall User Dispatch

On Fri, Nov 27, 2020 at 11:32 AM Gabriel Krisman Bertazi
<[email protected]> wrote:
>
> Hi,
>
> This is v8 of syscall user dispatch. Last version got some acks but
> there was one small documentation fix I wanted to do, as requested by
> Florian. This also addresses the commit message fixup Peter requested.
>
> The only actual code change from v7 is solving a trivial merge conflict
> I myself created with the entry code fixup I made week and with
> something else in the TIP tree.
>
> I also shared this with glibc and there wasn't any complaints other than
> the matter about user-notif vs. siginfo, which was discussed in v7 and
> the understanding is that it is not necessary now and can be added
> later, if needed, on the same infrastructure without a new api.
>
> I'm not sure about TIP the rules, but is it too late to be queued for
> the next merge window? I'd love to have this in 5.11 if possible, since
> it has been flying for quite a while.
>

Other than my little nitpick about on_syscall_dispatch(), the whole series is:

Reviewed-by: Andy Lutomirski <[email protected]>

2020-12-02 00:08:53

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH v8 4/7] entry: Support Syscall User Dispatch on common syscall entry

On Fri, Nov 27, 2020 at 11:33 AM Gabriel Krisman Bertazi
<[email protected]> wrote:
>
> Syscall User Dispatch (SUD) must take precedence over seccomp and
> ptrace, since the use case is emulation (it can be invoked with a
> different ABI) such that seccomp filtering by syscall number doesn't
> make sense in the first place. In addition, either the syscall is
> dispatched back to userspace, in which case there is no resource for to
> trace, or the syscall will be executed, and seccomp/ptrace will execute
> next.
>
> Since SUD runs before tracepoints, it needs to be a SYSCALL_WORK_EXIT as
> well, just to prevent a trace exit event when dispatch was triggered.
> For that, the on_syscall_dispatch() examines context to skip the
> tracepoint, audit and other work.
>
> Signed-off-by: Gabriel Krisman Bertazi <[email protected]>
> Acked-by: Peter Zijlstra (Intel) <[email protected]>
> ---
> Changes since v6:
> - Update do_syscall_intercept signature (Christian Brauner)
> - Move it to before tracepoints
> - Use SYSCALL_WORK flags
> ---
> include/linux/entry-common.h | 2 ++
> kernel/entry/common.c | 17 +++++++++++++++++
> 2 files changed, 19 insertions(+)
>
> diff --git a/include/linux/entry-common.h b/include/linux/entry-common.h
> index 49b26b216e4e..a6e98b4ba8e9 100644
> --- a/include/linux/entry-common.h
> +++ b/include/linux/entry-common.h
> @@ -44,10 +44,12 @@
> SYSCALL_WORK_SYSCALL_TRACE | \
> SYSCALL_WORK_SYSCALL_EMU | \
> SYSCALL_WORK_SYSCALL_AUDIT | \
> + SYSCALL_WORK_SYSCALL_USER_DISPATCH | \
> ARCH_SYSCALL_WORK_ENTER)
> #define SYSCALL_WORK_EXIT (SYSCALL_WORK_SYSCALL_TRACEPOINT | \
> SYSCALL_WORK_SYSCALL_TRACE | \
> SYSCALL_WORK_SYSCALL_AUDIT | \
> + SYSCALL_WORK_SYSCALL_USER_DISPATCH | \
> ARCH_SYSCALL_WORK_EXIT)
>
> /*
> diff --git a/kernel/entry/common.c b/kernel/entry/common.c
> index f1b12dc32ff4..ec20aba3b890 100644
> --- a/kernel/entry/common.c
> +++ b/kernel/entry/common.c
> @@ -6,6 +6,8 @@
> #include <linux/livepatch.h>
> #include <linux/audit.h>
>
> +#include "common.h"
> +
> #define CREATE_TRACE_POINTS
> #include <trace/events/syscalls.h>
>
> @@ -47,6 +49,16 @@ static long syscall_trace_enter(struct pt_regs *regs, long syscall,
> {
> long ret = 0;
>
> + /*
> + * Handle Syscall User Dispatch. This must comes first, since
> + * the ABI here can be something that doesn't make sense for
> + * other syscall_work features.
> + */
> + if (work & SYSCALL_WORK_SYSCALL_USER_DISPATCH) {
> + if (do_syscall_user_dispatch(regs))
> + return -1L;
> + }
> +
> /* Handle ptrace */
> if (work & (SYSCALL_WORK_SYSCALL_TRACE | SYSCALL_WORK_SYSCALL_EMU)) {
> ret = arch_syscall_enter_tracehook(regs);
> @@ -232,6 +244,11 @@ static void syscall_exit_work(struct pt_regs *regs, unsigned long work)
> {
> bool step;
>
> + if (work & SYSCALL_WORK_SYSCALL_USER_DISPATCH) {
> + if (on_syscall_dispatch())
> + return;
> + }

I think this would be less confusing if you just open-coded the body
of on_syscall_dispatch here and got rid of the helper.

--Andy

2020-12-02 09:43:03

by tip-bot2 for Haifeng Xu

[permalink] [raw]
Subject: [tip: core/entry] entry: Support Syscall User Dispatch on common syscall entry

The following commit has been merged into the core/entry branch of tip:

Commit-ID: 5a5c45c624b8851cbfd269d5b0a8856a2b728502
Gitweb: https://git.kernel.org/tip/5a5c45c624b8851cbfd269d5b0a8856a2b728502
Author: Gabriel Krisman Bertazi <[email protected]>
AuthorDate: Fri, 27 Nov 2020 14:32:35 -05:00
Committer: Thomas Gleixner <[email protected]>
CommitterDate: Wed, 02 Dec 2020 10:32:17 +01:00

entry: Support Syscall User Dispatch on common syscall entry

Syscall User Dispatch (SUD) must take precedence over seccomp and
ptrace, since the use case is emulation (it can be invoked with a
different ABI) such that seccomp filtering by syscall number doesn't
make sense in the first place. In addition, either the syscall is
dispatched back to userspace, in which case there is no resource for to
trace, or the syscall will be executed, and seccomp/ptrace will execute
next.

Since SUD runs before tracepoints, it needs to be a SYSCALL_WORK_EXIT as
well, just to prevent a trace exit event when dispatch was triggered.
For that, the on_syscall_dispatch() examines context to skip the
tracepoint, audit and other work.

[ tglx: Add a comment on the exit side ]

Signed-off-by: Gabriel Krisman Bertazi <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Reviewed-by: Andy Lutomirski <[email protected]>
Acked-by: Peter Zijlstra (Intel) <[email protected]>
Acked-by: Kees Cook <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
---
include/linux/entry-common.h | 2 ++
kernel/entry/common.c | 25 +++++++++++++++++++++++++
2 files changed, 27 insertions(+)

diff --git a/include/linux/entry-common.h b/include/linux/entry-common.h
index 49b26b2..a6e98b4 100644
--- a/include/linux/entry-common.h
+++ b/include/linux/entry-common.h
@@ -44,10 +44,12 @@
SYSCALL_WORK_SYSCALL_TRACE | \
SYSCALL_WORK_SYSCALL_EMU | \
SYSCALL_WORK_SYSCALL_AUDIT | \
+ SYSCALL_WORK_SYSCALL_USER_DISPATCH | \
ARCH_SYSCALL_WORK_ENTER)
#define SYSCALL_WORK_EXIT (SYSCALL_WORK_SYSCALL_TRACEPOINT | \
SYSCALL_WORK_SYSCALL_TRACE | \
SYSCALL_WORK_SYSCALL_AUDIT | \
+ SYSCALL_WORK_SYSCALL_USER_DISPATCH | \
ARCH_SYSCALL_WORK_EXIT)

/*
diff --git a/kernel/entry/common.c b/kernel/entry/common.c
index 91e8fd5..e661e70 100644
--- a/kernel/entry/common.c
+++ b/kernel/entry/common.c
@@ -5,6 +5,8 @@
#include <linux/livepatch.h>
#include <linux/audit.h>

+#include "common.h"
+
#define CREATE_TRACE_POINTS
#include <trace/events/syscalls.h>

@@ -46,6 +48,16 @@ static long syscall_trace_enter(struct pt_regs *regs, long syscall,
{
long ret = 0;

+ /*
+ * Handle Syscall User Dispatch. This must comes first, since
+ * the ABI here can be something that doesn't make sense for
+ * other syscall_work features.
+ */
+ if (work & SYSCALL_WORK_SYSCALL_USER_DISPATCH) {
+ if (syscall_user_dispatch(regs))
+ return -1L;
+ }
+
/* Handle ptrace */
if (work & (SYSCALL_WORK_SYSCALL_TRACE | SYSCALL_WORK_SYSCALL_EMU)) {
ret = arch_syscall_enter_tracehook(regs);
@@ -230,6 +242,19 @@ static void syscall_exit_work(struct pt_regs *regs, unsigned long work)
{
bool step;

+ /*
+ * If the syscall was rolled back due to syscall user dispatching,
+ * then the tracers below are not invoked for the same reason as
+ * the entry side was not invoked in syscall_trace_enter(): The ABI
+ * of these syscalls is unknown.
+ */
+ if (work & SYSCALL_WORK_SYSCALL_USER_DISPATCH) {
+ if (unlikely(current->syscall_dispatch.on_dispatch)) {
+ current->syscall_dispatch.on_dispatch = false;
+ return;
+ }
+ }
+
audit_syscall_exit(regs);

if (work & SYSCALL_WORK_SYSCALL_TRACEPOINT)

2020-12-02 14:16:33

by tip-bot2 for Haifeng Xu

[permalink] [raw]
Subject: [tip: core/entry] entry: Support Syscall User Dispatch on common syscall entry

The following commit has been merged into the core/entry branch of tip:

Commit-ID: 11894468e39def270199f845b76df6c36d4ed133
Gitweb: https://git.kernel.org/tip/11894468e39def270199f845b76df6c36d4ed133
Author: Gabriel Krisman Bertazi <[email protected]>
AuthorDate: Fri, 27 Nov 2020 14:32:35 -05:00
Committer: Thomas Gleixner <[email protected]>
CommitterDate: Wed, 02 Dec 2020 15:07:56 +01:00

entry: Support Syscall User Dispatch on common syscall entry

Syscall User Dispatch (SUD) must take precedence over seccomp and
ptrace, since the use case is emulation (it can be invoked with a
different ABI) such that seccomp filtering by syscall number doesn't
make sense in the first place. In addition, either the syscall is
dispatched back to userspace, in which case there is no resource for to
trace, or the syscall will be executed, and seccomp/ptrace will execute
next.

Since SUD runs before tracepoints, it needs to be a SYSCALL_WORK_EXIT as
well, just to prevent a trace exit event when dispatch was triggered.
For that, the on_syscall_dispatch() examines context to skip the
tracepoint, audit and other work.

[ tglx: Add a comment on the exit side ]

Signed-off-by: Gabriel Krisman Bertazi <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Reviewed-by: Andy Lutomirski <[email protected]>
Acked-by: Peter Zijlstra (Intel) <[email protected]>
Acked-by: Kees Cook <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
---
include/linux/entry-common.h | 2 ++
kernel/entry/common.c | 25 +++++++++++++++++++++++++
2 files changed, 27 insertions(+)

diff --git a/include/linux/entry-common.h b/include/linux/entry-common.h
index 49b26b2..a6e98b4 100644
--- a/include/linux/entry-common.h
+++ b/include/linux/entry-common.h
@@ -44,10 +44,12 @@
SYSCALL_WORK_SYSCALL_TRACE | \
SYSCALL_WORK_SYSCALL_EMU | \
SYSCALL_WORK_SYSCALL_AUDIT | \
+ SYSCALL_WORK_SYSCALL_USER_DISPATCH | \
ARCH_SYSCALL_WORK_ENTER)
#define SYSCALL_WORK_EXIT (SYSCALL_WORK_SYSCALL_TRACEPOINT | \
SYSCALL_WORK_SYSCALL_TRACE | \
SYSCALL_WORK_SYSCALL_AUDIT | \
+ SYSCALL_WORK_SYSCALL_USER_DISPATCH | \
ARCH_SYSCALL_WORK_EXIT)

/*
diff --git a/kernel/entry/common.c b/kernel/entry/common.c
index 91e8fd5..e661e70 100644
--- a/kernel/entry/common.c
+++ b/kernel/entry/common.c
@@ -5,6 +5,8 @@
#include <linux/livepatch.h>
#include <linux/audit.h>

+#include "common.h"
+
#define CREATE_TRACE_POINTS
#include <trace/events/syscalls.h>

@@ -46,6 +48,16 @@ static long syscall_trace_enter(struct pt_regs *regs, long syscall,
{
long ret = 0;

+ /*
+ * Handle Syscall User Dispatch. This must comes first, since
+ * the ABI here can be something that doesn't make sense for
+ * other syscall_work features.
+ */
+ if (work & SYSCALL_WORK_SYSCALL_USER_DISPATCH) {
+ if (syscall_user_dispatch(regs))
+ return -1L;
+ }
+
/* Handle ptrace */
if (work & (SYSCALL_WORK_SYSCALL_TRACE | SYSCALL_WORK_SYSCALL_EMU)) {
ret = arch_syscall_enter_tracehook(regs);
@@ -230,6 +242,19 @@ static void syscall_exit_work(struct pt_regs *regs, unsigned long work)
{
bool step;

+ /*
+ * If the syscall was rolled back due to syscall user dispatching,
+ * then the tracers below are not invoked for the same reason as
+ * the entry side was not invoked in syscall_trace_enter(): The ABI
+ * of these syscalls is unknown.
+ */
+ if (work & SYSCALL_WORK_SYSCALL_USER_DISPATCH) {
+ if (unlikely(current->syscall_dispatch.on_dispatch)) {
+ current->syscall_dispatch.on_dispatch = false;
+ return;
+ }
+ }
+
audit_syscall_exit(regs);

if (work & SYSCALL_WORK_SYSCALL_TRACEPOINT)