From: YiFei Zhu <[email protected]>
Alternative: https://lore.kernel.org/lkml/[email protected]/T/
Major differences from the linked alternative by Kees:
* No x32 special-case handling -- not worth the complexity
* No caching of denylist -- not worth the complexity
* No seccomp arch pinning -- I think this is an independent feature
* The bitmaps are part of the filters rather than the task.
* Architectures supported by default through arch number array,
except for MIPS with its sparse syscall numbers.
* Configurable per-build for future different cache modes.
This series adds a bitmap to cache seccomp filter results if the
result permits a syscall and is indepenent of syscall arguments.
This visibly decreases seccomp overhead for most common seccomp
filters with very little memory footprint.
The overhead of running Seccomp filters has been part of some past
discussions [1][2][3]. Oftentimes, the filters have a large number
of instructions that check syscall numbers one by one and jump based
on that. Some users chain BPF filters which further enlarge the
overhead. A recent work [6] comprehensively measures the Seccomp
overhead and shows that the overhead is non-negligible and has a
non-trivial impact on application performance.
We observed some common filters, such as docker's [4] or
systemd's [5], will make most decisions based only on the syscall
numbers, and as past discussions considered, a bitmap where each bit
represents a syscall makes most sense for these filters.
In order to build this bitmap at filter attach time, each filter is
emulated for every syscall (under each possible architecture), and
checked for any accesses of struct seccomp_data that are not the "arch"
nor "nr" (syscall) members. If only "arch" and "nr" are examined, and
the program returns allow, then we can be sure that the filter must
return allow independent from syscall arguments.
When it is concluded that an allow must occur for the given
architecture and syscall pair, seccomp will immediately allow
the syscall, bypassing further BPF execution.
Ongoing work is to further support arguments with fast hash table
lookups. We are investigating the performance of doing so [6], and how
to best integrate with the existing seccomp infrastructure.
Some benchmarks are performed with results in patch 5, copied below:
Current BPF sysctl settings:
net.core.bpf_jit_enable = 1
net.core.bpf_jit_harden = 0
Benchmarking 100000000 syscalls...
63.896255358 - 0.008504529 = 63887750829 (63.9s)
getpid native: 638 ns
130.383312423 - 63.897315189 = 66485997234 (66.5s)
getpid RET_ALLOW 1 filter (bitmap): 664 ns
196.789080421 - 130.384414983 = 66404665438 (66.4s)
getpid RET_ALLOW 2 filters (bitmap): 664 ns
268.844643304 - 196.790234168 = 72054409136 (72.1s)
getpid RET_ALLOW 3 filters (full): 720 ns
342.627472515 - 268.845799103 = 73781673412 (73.8s)
getpid RET_ALLOW 4 filters (full): 737 ns
Estimated total seccomp overhead for 1 bitmapped filter: 26 ns
Estimated total seccomp overhead for 2 bitmapped filters: 26 ns
Estimated total seccomp overhead for 3 full filters: 82 ns
Estimated total seccomp overhead for 4 full filters: 99 ns
Estimated seccomp entry overhead: 26 ns
Estimated seccomp per-filter overhead (last 2 diff): 17 ns
Estimated seccomp per-filter overhead (filters / 4): 18 ns
Expectations:
native ≤ 1 bitmap (638 ≤ 664): ✔️
native ≤ 1 filter (638 ≤ 720): ✔️
per-filter (last 2 diff) ≈ per-filter (filters / 4) (17 ≈ 18): ✔️
1 bitmapped ≈ 2 bitmapped (26 ≈ 26): ✔️
entry ≈ 1 bitmapped (26 ≈ 26): ✔️
entry ≈ 2 bitmapped (26 ≈ 26): ✔️
native + entry + (per filter * 4) ≈ 4 filters total (732 ≈ 737): ✔️
RFC -> v1:
* Config made on by default across all arches that could support it.
* Added arch numbers array and emulate filter for each arch number, and
have a per-arch bitmap.
* Massively simplified the emulator so it would only support the common
instructions in Kees's list.
* Fixed inheriting bitmap across filters (filter->prev is always NULL
during prepare).
* Stole the selftest from Kees.
* Added a /proc/pid/seccomp_cache by Jann's suggestion.
Patch 1 moves the SECCOMP Kcomfig option to arch/Kconfig.
Patch 2 adds a syscall_arches array so the emulator can enumerate it.
Patch 3 implements the emulator that finds if a filter must return allow,
Patch 4 implements the test_bit against the bitmaps.
Patch 5 updates the selftest to better show the new semantics.
Patch 6 implements /proc/pid/seccomp_cache.
[1] https://lore.kernel.org/linux-security-module/[email protected]/T/
[2] https://lore.kernel.org/lkml/202005181120.971232B7B@keescook/T/
[3] https://github.com/seccomp/libseccomp/issues/116
[4] https://github.com/moby/moby/blob/ae0ef82b90356ac613f329a8ef5ee42ca923417d/profiles/seccomp/default.json
[5] https://github.com/systemd/systemd/blob/6743a1caf4037f03dc51a1277855018e4ab61957/src/shared/seccomp-util.c#L270
[6] Draco: Architectural and Operating System Support for System Call Security
https://tianyin.github.io/pub/draco.pdf, MICRO-53, Oct. 2020
Kees Cook (1):
selftests/seccomp: Compare bitmap vs filter overhead
YiFei Zhu (5):
seccomp: Move config option SECCOMP to arch/Kconfig
asm/syscall.h: Add syscall_arches[] array
seccomp/cache: Add "emulator" to check if filter is arg-dependent
seccomp/cache: Lookup syscall allowlist for fast path
seccomp/cache: Report cache data through /proc/pid/seccomp_cache
arch/Kconfig | 56 ++++
arch/alpha/include/asm/syscall.h | 4 +
arch/arc/include/asm/syscall.h | 24 +-
arch/arm/Kconfig | 15 +-
arch/arm/include/asm/syscall.h | 4 +
arch/arm64/Kconfig | 13 -
arch/arm64/include/asm/syscall.h | 4 +
arch/c6x/include/asm/syscall.h | 13 +-
arch/csky/Kconfig | 13 -
arch/csky/include/asm/syscall.h | 4 +
arch/h8300/include/asm/syscall.h | 4 +
arch/hexagon/include/asm/syscall.h | 4 +
arch/ia64/include/asm/syscall.h | 4 +
arch/m68k/include/asm/syscall.h | 4 +
arch/microblaze/Kconfig | 18 +-
arch/microblaze/include/asm/syscall.h | 4 +
arch/mips/Kconfig | 17 --
arch/mips/include/asm/syscall.h | 16 ++
arch/nds32/include/asm/syscall.h | 13 +-
arch/nios2/include/asm/syscall.h | 4 +
arch/openrisc/include/asm/syscall.h | 4 +
arch/parisc/Kconfig | 16 --
arch/parisc/include/asm/syscall.h | 7 +
arch/powerpc/Kconfig | 17 --
arch/powerpc/include/asm/syscall.h | 14 +
arch/riscv/Kconfig | 13 -
arch/riscv/include/asm/syscall.h | 14 +-
arch/s390/Kconfig | 17 --
arch/s390/include/asm/syscall.h | 7 +
arch/sh/Kconfig | 16 --
arch/sh/include/asm/syscall_32.h | 17 +-
arch/sparc/Kconfig | 18 +-
arch/sparc/include/asm/syscall.h | 9 +
arch/um/Kconfig | 16 --
arch/x86/Kconfig | 16 --
arch/x86/include/asm/syscall.h | 11 +
arch/x86/um/asm/syscall.h | 14 +-
arch/xtensa/Kconfig | 14 -
arch/xtensa/include/asm/syscall.h | 4 +
fs/proc/base.c | 7 +-
include/linux/seccomp.h | 5 +
kernel/seccomp.c | 259 +++++++++++++++++-
.../selftests/seccomp/seccomp_benchmark.c | 151 ++++++++--
tools/testing/selftests/seccomp/settings | 2 +-
44 files changed, 641 insertions(+), 265 deletions(-)
--
2.28.0
From: YiFei Zhu <[email protected]>
SECCOMP_CACHE_NR_ONLY will only operate on syscalls that do not
access any syscall arguments or instruction pointer. To facilitate
this we need a static analyser to know whether a filter will
return allow regardless of syscall arguments for a given
architecture number / syscall number pair. This is implemented
here with a pseudo-emulator, and stored in a per-filter bitmap.
Each common BPF instruction (stolen from Kees's list [1]) are
emulated. Any weirdness or loading from a syscall argument will
cause the emulator to bail.
The emulation is also halted if it reaches a return. In that case,
if it returns an SECCOMP_RET_ALLOW, the syscall is marked as good.
Filter dependency is resolved at attach time. If a filter depends
on more filters, then we perform an and on its bitmask against its
dependee; if the dependee does not guarantee to allow the syscall,
then the depender is also marked not to guarantee to allow the
syscall.
[1] https://lore.kernel.org/lkml/[email protected]/
Signed-off-by: YiFei Zhu <[email protected]>
---
arch/Kconfig | 25 ++++++
kernel/seccomp.c | 196 ++++++++++++++++++++++++++++++++++++++++++++++-
2 files changed, 220 insertions(+), 1 deletion(-)
diff --git a/arch/Kconfig b/arch/Kconfig
index 6dfc5673215d..8cc3dc87f253 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -489,6 +489,31 @@ config SECCOMP_FILTER
See Documentation/userspace-api/seccomp_filter.rst for details.
+choice
+ prompt "Seccomp filter cache"
+ default SECCOMP_CACHE_NONE
+ depends on SECCOMP_FILTER
+ help
+ Seccomp filters can potentially incur large overhead for each
+ system call. This can alleviate some of the overhead.
+
+ If in doubt, select 'syscall numbers only'.
+
+config SECCOMP_CACHE_NONE
+ bool "None"
+ help
+ No caching is done. Seccomp filters will be called each time
+ a system call occurs in a seccomp-guarded task.
+
+config SECCOMP_CACHE_NR_ONLY
+ bool "Syscall number only"
+ depends on !HAVE_SPARSE_SYSCALL_NR
+ help
+ For each syscall number, if the seccomp filter has a fixed
+ result, store that result in a bitmap to speed up system calls.
+
+endchoice
+
config HAVE_ARCH_STACKLEAK
bool
help
diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index 3ee59ce0a323..7c286d66f983 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -143,6 +143,32 @@ struct notification {
struct list_head notifications;
};
+#ifdef CONFIG_SECCOMP_CACHE_NR_ONLY
+/**
+ * struct seccomp_cache_filter_data - container for cache's per-filter data
+ *
+ * @syscall_ok: A bitmap for each architecture number, where each bit
+ * represents whether the filter will always allow the syscall.
+ */
+struct seccomp_cache_filter_data {
+ DECLARE_BITMAP(syscall_ok[ARRAY_SIZE(syscall_arches)], NR_syscalls);
+};
+
+#define SECCOMP_EMU_MAX_PENDING_STATES 64
+#else
+struct seccomp_cache_filter_data { };
+
+static inline int seccomp_cache_prepare(struct seccomp_filter *sfilter)
+{
+ return 0;
+}
+
+static inline void seccomp_cache_inherit(struct seccomp_filter *sfilter,
+ const struct seccomp_filter *prev)
+{
+}
+#endif /* CONFIG_SECCOMP_CACHE_NR_ONLY */
+
/**
* struct seccomp_filter - container for seccomp BPF programs
*
@@ -185,6 +211,7 @@ struct seccomp_filter {
struct notification *notif;
struct mutex notify_lock;
wait_queue_head_t wqh;
+ struct seccomp_cache_filter_data cache;
};
/* Limit any path through the tree to 256KB worth of instructions. */
@@ -530,6 +557,139 @@ static inline void seccomp_sync_threads(unsigned long flags)
}
}
+#ifdef CONFIG_SECCOMP_CACHE_NR_ONLY
+/**
+ * struct seccomp_emu_env - container for seccomp emulator environment
+ *
+ * @filter: The cBPF filter instructions.
+ * @nr: The syscall number we are emulating.
+ * @arch: The architecture number we are emulating.
+ * @syscall_ok: Emulation result, whether it is okay for seccomp to cache the
+ * syscall.
+ */
+struct seccomp_emu_env {
+ struct sock_filter *filter;
+ int arch;
+ int nr;
+ bool syscall_ok;
+};
+
+/**
+ * struct seccomp_emu_state - container for seccomp emulator state
+ *
+ * @next: The next pending state. This structure is a linked list.
+ * @pc: The current program counter.
+ * @areg: the value of that A register.
+ */
+struct seccomp_emu_state {
+ struct seccomp_emu_state *next;
+ int pc;
+ u32 areg;
+};
+
+/**
+ * seccomp_emu_step - step one instruction in the emulator
+ * @env: The emulator environment
+ * @state: The emulator state
+ *
+ * Returns 1 to halt emulation, 0 to continue, or -errno if error occurred.
+ */
+static int seccomp_emu_step(struct seccomp_emu_env *env,
+ struct seccomp_emu_state *state)
+{
+ struct sock_filter *ftest = &env->filter[state->pc++];
+ u16 code = ftest->code;
+ u32 k = ftest->k;
+ bool compare;
+
+ switch (code) {
+ case BPF_LD | BPF_W | BPF_ABS:
+ if (k == offsetof(struct seccomp_data, nr))
+ state->areg = env->nr;
+ else if (k == offsetof(struct seccomp_data, arch))
+ state->areg = env->arch;
+ else
+ return 1;
+
+ return 0;
+ case BPF_JMP | BPF_JA:
+ state->pc += k;
+ return 0;
+ case BPF_JMP | BPF_JEQ | BPF_K:
+ case BPF_JMP | BPF_JGE | BPF_K:
+ case BPF_JMP | BPF_JGT | BPF_K:
+ case BPF_JMP | BPF_JSET | BPF_K:
+ switch (BPF_OP(code)) {
+ case BPF_JEQ:
+ compare = state->areg == k;
+ break;
+ case BPF_JGT:
+ compare = state->areg > k;
+ break;
+ case BPF_JGE:
+ compare = state->areg >= k;
+ break;
+ case BPF_JSET:
+ compare = state->areg & k;
+ break;
+ default:
+ WARN_ON(true);
+ return -EINVAL;
+ }
+
+ state->pc += compare ? ftest->jt : ftest->jf;
+ return 0;
+ case BPF_ALU | BPF_AND | BPF_K:
+ state->areg &= k;
+ return 0;
+ case BPF_RET | BPF_K:
+ env->syscall_ok = k == SECCOMP_RET_ALLOW;
+ return 1;
+ default:
+ return 1;
+ }
+}
+
+/**
+ * seccomp_cache_prepare - emulate the filter to find cachable syscalls
+ * @sfilter: The seccomp filter
+ *
+ * Returns 0 if successful or -errno if error occurred.
+ */
+int seccomp_cache_prepare(struct seccomp_filter *sfilter)
+{
+ struct sock_fprog_kern *fprog = sfilter->prog->orig_prog;
+ struct sock_filter *filter = fprog->filter;
+ int arch, nr, res = 0;
+
+ for (arch = 0; arch < ARRAY_SIZE(syscall_arches); arch++) {
+ for (nr = 0; nr < NR_syscalls; nr++) {
+ struct seccomp_emu_env env = {0};
+ struct seccomp_emu_state state = {0};
+
+ env.filter = filter;
+ env.arch = syscall_arches[arch];
+ env.nr = nr;
+
+ while (true) {
+ res = seccomp_emu_step(&env, &state);
+ if (res)
+ break;
+ }
+
+ if (res < 0)
+ goto out;
+
+ if (env.syscall_ok)
+ set_bit(nr, sfilter->cache.syscall_ok[arch]);
+ }
+ }
+
+out:
+ return res;
+}
+#endif /* CONFIG_SECCOMP_CACHE_NR_ONLY */
+
/**
* seccomp_prepare_filter: Prepares a seccomp filter for use.
* @fprog: BPF program to install
@@ -540,7 +700,8 @@ static struct seccomp_filter *seccomp_prepare_filter(struct sock_fprog *fprog)
{
struct seccomp_filter *sfilter;
int ret;
- const bool save_orig = IS_ENABLED(CONFIG_CHECKPOINT_RESTORE);
+ const bool save_orig = IS_ENABLED(CONFIG_CHECKPOINT_RESTORE) ||
+ IS_ENABLED(CONFIG_SECCOMP_CACHE_NR_ONLY);
if (fprog->len == 0 || fprog->len > BPF_MAXINSNS)
return ERR_PTR(-EINVAL);
@@ -571,6 +732,13 @@ static struct seccomp_filter *seccomp_prepare_filter(struct sock_fprog *fprog)
return ERR_PTR(ret);
}
+ ret = seccomp_cache_prepare(sfilter);
+ if (ret < 0) {
+ bpf_prog_destroy(sfilter->prog);
+ kfree(sfilter);
+ return ERR_PTR(ret);
+ }
+
refcount_set(&sfilter->refs, 1);
refcount_set(&sfilter->users, 1);
init_waitqueue_head(&sfilter->wqh);
@@ -606,6 +774,31 @@ seccomp_prepare_user_filter(const char __user *user_filter)
return filter;
}
+#ifdef CONFIG_SECCOMP_CACHE_NR_ONLY
+/**
+ * seccomp_cache_inherit - lookup seccomp cache
+ * @sfilter: The seccomp filter
+ * @sd: The seccomp data to lookup the cache with
+ *
+ * Returns true if the seccomp_data is cached and allowed.
+ */
+static void seccomp_cache_inherit(struct seccomp_filter *sfilter,
+ const struct seccomp_filter *prev)
+{
+ int arch;
+
+ if (!prev)
+ return;
+
+ for (arch = 0; arch < ARRAY_SIZE(syscall_arches); arch++) {
+ bitmap_and(sfilter->cache.syscall_ok[arch],
+ sfilter->cache.syscall_ok[arch],
+ prev->cache.syscall_ok[arch],
+ NR_syscalls);
+ }
+}
+#endif /* CONFIG_SECCOMP_CACHE_NR_ONLY */
+
/**
* seccomp_attach_filter: validate and attach filter
* @flags: flags to change filter behavior
@@ -655,6 +848,7 @@ static long seccomp_attach_filter(unsigned int flags,
* task reference.
*/
filter->prev = current->seccomp.filter;
+ seccomp_cache_inherit(filter, filter->prev);
current->seccomp.filter = filter;
atomic_inc(¤t->seccomp.filter_count);
--
2.28.0
From: YiFei Zhu <[email protected]>
Seccomp cache emulator needs to know all the architecture numbers
that syscall_get_arch() could return for the kernel build in order
to generate a cache for all of them.
The array is declared in header as static __maybe_unused const
to maximize compiler optimiation opportunities such as loop
unrolling.
Signed-off-by: YiFei Zhu <[email protected]>
---
arch/alpha/include/asm/syscall.h | 4 ++++
arch/arc/include/asm/syscall.h | 24 +++++++++++++++++++-----
arch/arm/include/asm/syscall.h | 4 ++++
arch/arm64/include/asm/syscall.h | 4 ++++
arch/c6x/include/asm/syscall.h | 13 +++++++++++--
arch/csky/include/asm/syscall.h | 4 ++++
arch/h8300/include/asm/syscall.h | 4 ++++
arch/hexagon/include/asm/syscall.h | 4 ++++
arch/ia64/include/asm/syscall.h | 4 ++++
arch/m68k/include/asm/syscall.h | 4 ++++
arch/microblaze/include/asm/syscall.h | 4 ++++
arch/mips/include/asm/syscall.h | 16 ++++++++++++++++
arch/nds32/include/asm/syscall.h | 13 +++++++++++--
arch/nios2/include/asm/syscall.h | 4 ++++
arch/openrisc/include/asm/syscall.h | 4 ++++
arch/parisc/include/asm/syscall.h | 7 +++++++
arch/powerpc/include/asm/syscall.h | 14 ++++++++++++++
arch/riscv/include/asm/syscall.h | 14 ++++++++++----
arch/s390/include/asm/syscall.h | 7 +++++++
arch/sh/include/asm/syscall_32.h | 17 +++++++++++------
arch/sparc/include/asm/syscall.h | 9 +++++++++
arch/x86/include/asm/syscall.h | 11 +++++++++++
arch/x86/um/asm/syscall.h | 14 ++++++++++----
arch/xtensa/include/asm/syscall.h | 4 ++++
24 files changed, 184 insertions(+), 23 deletions(-)
diff --git a/arch/alpha/include/asm/syscall.h b/arch/alpha/include/asm/syscall.h
index 11c688c1d7ec..625ac9b23f37 100644
--- a/arch/alpha/include/asm/syscall.h
+++ b/arch/alpha/include/asm/syscall.h
@@ -4,6 +4,10 @@
#include <uapi/linux/audit.h>
+static __maybe_unused const int syscall_arches[] = {
+ AUDIT_ARCH_ALPHA
+};
+
static inline int syscall_get_arch(struct task_struct *task)
{
return AUDIT_ARCH_ALPHA;
diff --git a/arch/arc/include/asm/syscall.h b/arch/arc/include/asm/syscall.h
index 94529e89dff0..899c13cbf5cc 100644
--- a/arch/arc/include/asm/syscall.h
+++ b/arch/arc/include/asm/syscall.h
@@ -65,14 +65,28 @@ syscall_get_arguments(struct task_struct *task, struct pt_regs *regs,
}
}
+#ifdef CONFIG_ISA_ARCOMPACT
+# ifdef CONFIG_CPU_BIG_ENDIAN
+# define SYSCALL_ARCH AUDIT_ARCH_ARCOMPACTBE
+# else
+# define SYSCALL_ARCH AUDIT_ARCH_ARCOMPACT
+# endif /* CONFIG_CPU_BIG_ENDIAN */
+#else
+# ifdef CONFIG_CPU_BIG_ENDIAN
+# define SYSCALL_ARCH AUDIT_ARCH_ARCV2BE
+# else
+# define SYSCALL_ARCH AUDIT_ARCH_ARCV2
+# endif /* CONFIG_CPU_BIG_ENDIAN */
+#endif /* CONFIG_ISA_ARCOMPACT */
+
+static __maybe_unused const int syscall_arches[] = {
+ SYSCALL_ARCH
+};
+
static inline int
syscall_get_arch(struct task_struct *task)
{
- return IS_ENABLED(CONFIG_ISA_ARCOMPACT)
- ? (IS_ENABLED(CONFIG_CPU_BIG_ENDIAN)
- ? AUDIT_ARCH_ARCOMPACTBE : AUDIT_ARCH_ARCOMPACT)
- : (IS_ENABLED(CONFIG_CPU_BIG_ENDIAN)
- ? AUDIT_ARCH_ARCV2BE : AUDIT_ARCH_ARCV2);
+ return SYSCALL_ARCH;
}
#endif
diff --git a/arch/arm/include/asm/syscall.h b/arch/arm/include/asm/syscall.h
index fd02761ba06c..33ade26e3956 100644
--- a/arch/arm/include/asm/syscall.h
+++ b/arch/arm/include/asm/syscall.h
@@ -73,6 +73,10 @@ static inline void syscall_set_arguments(struct task_struct *task,
memcpy(®s->ARM_r0 + 1, args, 5 * sizeof(args[0]));
}
+static __maybe_unused const int syscall_arches[] = {
+ AUDIT_ARCH_ARM
+};
+
static inline int syscall_get_arch(struct task_struct *task)
{
/* ARM tasks don't change audit architectures on the fly. */
diff --git a/arch/arm64/include/asm/syscall.h b/arch/arm64/include/asm/syscall.h
index cfc0672013f6..77f3d300e7a0 100644
--- a/arch/arm64/include/asm/syscall.h
+++ b/arch/arm64/include/asm/syscall.h
@@ -82,6 +82,10 @@ static inline void syscall_set_arguments(struct task_struct *task,
memcpy(®s->regs[1], args, 5 * sizeof(args[0]));
}
+static __maybe_unused const int syscall_arches[] = {
+ AUDIT_ARCH_ARM, AUDIT_ARCH_AARCH64
+};
+
/*
* We don't care about endianness (__AUDIT_ARCH_LE bit) here because
* AArch64 has the same system calls both on little- and big- endian.
diff --git a/arch/c6x/include/asm/syscall.h b/arch/c6x/include/asm/syscall.h
index 38f3e2284ecd..0d78c67ee1fc 100644
--- a/arch/c6x/include/asm/syscall.h
+++ b/arch/c6x/include/asm/syscall.h
@@ -66,10 +66,19 @@ static inline void syscall_set_arguments(struct task_struct *task,
regs->a9 = *args;
}
+#ifdef CONFIG_CPU_BIG_ENDIAN
+#define SYSCALL_ARCH AUDIT_ARCH_C6XBE
+#else
+#define SYSCALL_ARCH AUDIT_ARCH_C6X
+#endif
+
+static __maybe_unused const int syscall_arches[] = {
+ SYSCALL_ARCH
+};
+
static inline int syscall_get_arch(struct task_struct *task)
{
- return IS_ENABLED(CONFIG_CPU_BIG_ENDIAN)
- ? AUDIT_ARCH_C6XBE : AUDIT_ARCH_C6X;
+ return SYSCALL_ARCH;
}
#endif /* __ASM_C6X_SYSCALLS_H */
diff --git a/arch/csky/include/asm/syscall.h b/arch/csky/include/asm/syscall.h
index f624fa3bbc22..86242d2850d7 100644
--- a/arch/csky/include/asm/syscall.h
+++ b/arch/csky/include/asm/syscall.h
@@ -68,6 +68,10 @@ syscall_set_arguments(struct task_struct *task, struct pt_regs *regs,
memcpy(®s->a1, args, 5 * sizeof(regs->a1));
}
+static __maybe_unused const int syscall_arches[] = {
+ AUDIT_ARCH_CSKY
+};
+
static inline int
syscall_get_arch(struct task_struct *task)
{
diff --git a/arch/h8300/include/asm/syscall.h b/arch/h8300/include/asm/syscall.h
index 01666b8bb263..775f6ac8fde3 100644
--- a/arch/h8300/include/asm/syscall.h
+++ b/arch/h8300/include/asm/syscall.h
@@ -28,6 +28,10 @@ syscall_get_arguments(struct task_struct *task, struct pt_regs *regs,
*args = regs->er6;
}
+static __maybe_unused const int syscall_arches[] = {
+ AUDIT_ARCH_H8300
+};
+
static inline int
syscall_get_arch(struct task_struct *task)
{
diff --git a/arch/hexagon/include/asm/syscall.h b/arch/hexagon/include/asm/syscall.h
index f6e454f18038..6ee21a76f6a3 100644
--- a/arch/hexagon/include/asm/syscall.h
+++ b/arch/hexagon/include/asm/syscall.h
@@ -45,6 +45,10 @@ static inline long syscall_get_return_value(struct task_struct *task,
return regs->r00;
}
+static __maybe_unused const int syscall_arches[] = {
+ AUDIT_ARCH_HEXAGON
+};
+
static inline int syscall_get_arch(struct task_struct *task)
{
return AUDIT_ARCH_HEXAGON;
diff --git a/arch/ia64/include/asm/syscall.h b/arch/ia64/include/asm/syscall.h
index 6c6f16e409a8..19456125c89a 100644
--- a/arch/ia64/include/asm/syscall.h
+++ b/arch/ia64/include/asm/syscall.h
@@ -71,6 +71,10 @@ static inline void syscall_set_arguments(struct task_struct *task,
ia64_syscall_get_set_arguments(task, regs, args, 1);
}
+static __maybe_unused const int syscall_arches[] = {
+ AUDIT_ARCH_IA64
+};
+
static inline int syscall_get_arch(struct task_struct *task)
{
return AUDIT_ARCH_IA64;
diff --git a/arch/m68k/include/asm/syscall.h b/arch/m68k/include/asm/syscall.h
index 465ac039be09..031b051f9026 100644
--- a/arch/m68k/include/asm/syscall.h
+++ b/arch/m68k/include/asm/syscall.h
@@ -4,6 +4,10 @@
#include <uapi/linux/audit.h>
+static __maybe_unused const int syscall_arches[] = {
+ AUDIT_ARCH_M68K
+};
+
static inline int syscall_get_arch(struct task_struct *task)
{
return AUDIT_ARCH_M68K;
diff --git a/arch/microblaze/include/asm/syscall.h b/arch/microblaze/include/asm/syscall.h
index 3a6924f3cbde..28cde14056d1 100644
--- a/arch/microblaze/include/asm/syscall.h
+++ b/arch/microblaze/include/asm/syscall.h
@@ -105,6 +105,10 @@ static inline void syscall_set_arguments(struct task_struct *task,
asmlinkage unsigned long do_syscall_trace_enter(struct pt_regs *regs);
asmlinkage void do_syscall_trace_leave(struct pt_regs *regs);
+static __maybe_unused const int syscall_arches[] = {
+ AUDIT_ARCH_MICROBLAZE
+};
+
static inline int syscall_get_arch(struct task_struct *task)
{
return AUDIT_ARCH_MICROBLAZE;
diff --git a/arch/mips/include/asm/syscall.h b/arch/mips/include/asm/syscall.h
index 25fa651c937d..29e4c1c47c54 100644
--- a/arch/mips/include/asm/syscall.h
+++ b/arch/mips/include/asm/syscall.h
@@ -140,6 +140,22 @@ extern const unsigned long sys_call_table[];
extern const unsigned long sys32_call_table[];
extern const unsigned long sysn32_call_table[];
+static __maybe_unused const int syscall_arches[] = {
+#ifdef __LITTLE_ENDIAN
+ AUDIT_ARCH_MIPSEL,
+# ifdef CONFIG_64BIT
+ AUDIT_ARCH_MIPSEL64,
+ AUDIT_ARCH_MIPSEL64N32,
+# endif /* CONFIG_64BIT */
+#else
+ AUDIT_ARCH_MIPS,
+# ifdef CONFIG_64BIT
+ AUDIT_ARCH_MIPS64,
+ AUDIT_ARCH_MIPS64N32,
+# endif /* CONFIG_64BIT */
+#endif /* __LITTLE_ENDIAN */
+};
+
static inline int syscall_get_arch(struct task_struct *task)
{
int arch = AUDIT_ARCH_MIPS;
diff --git a/arch/nds32/include/asm/syscall.h b/arch/nds32/include/asm/syscall.h
index 7b5180d78e20..2dd5e33bcfcb 100644
--- a/arch/nds32/include/asm/syscall.h
+++ b/arch/nds32/include/asm/syscall.h
@@ -154,11 +154,20 @@ syscall_set_arguments(struct task_struct *task, struct pt_regs *regs,
memcpy(®s->uregs[0] + 1, args, 5 * sizeof(args[0]));
}
+#ifdef CONFIG_CPU_BIG_ENDIAN
+#define SYSCALL_ARCH AUDIT_ARCH_NDS32BE
+#else
+#define SYSCALL_ARCH AUDIT_ARCH_NDS32
+#endif
+
+static __maybe_unused const int syscall_arches[] = {
+ SYSCALL_ARCH
+};
+
static inline int
syscall_get_arch(struct task_struct *task)
{
- return IS_ENABLED(CONFIG_CPU_BIG_ENDIAN)
- ? AUDIT_ARCH_NDS32BE : AUDIT_ARCH_NDS32;
+ return SYSCALL_ARCH;
}
#endif /* _ASM_NDS32_SYSCALL_H */
diff --git a/arch/nios2/include/asm/syscall.h b/arch/nios2/include/asm/syscall.h
index 526449edd768..8fa2716cac5a 100644
--- a/arch/nios2/include/asm/syscall.h
+++ b/arch/nios2/include/asm/syscall.h
@@ -69,6 +69,10 @@ static inline void syscall_set_arguments(struct task_struct *task,
regs->r9 = *args;
}
+static __maybe_unused const int syscall_arches[] = {
+ AUDIT_ARCH_NIOS2
+};
+
static inline int syscall_get_arch(struct task_struct *task)
{
return AUDIT_ARCH_NIOS2;
diff --git a/arch/openrisc/include/asm/syscall.h b/arch/openrisc/include/asm/syscall.h
index e6383be2a195..4eb28ad08042 100644
--- a/arch/openrisc/include/asm/syscall.h
+++ b/arch/openrisc/include/asm/syscall.h
@@ -64,6 +64,10 @@ syscall_set_arguments(struct task_struct *task, struct pt_regs *regs,
memcpy(®s->gpr[3], args, 6 * sizeof(args[0]));
}
+static __maybe_unused const int syscall_arches[] = {
+ AUDIT_ARCH_OPENRISC
+};
+
static inline int syscall_get_arch(struct task_struct *task)
{
return AUDIT_ARCH_OPENRISC;
diff --git a/arch/parisc/include/asm/syscall.h b/arch/parisc/include/asm/syscall.h
index 00b127a5e09b..2915f140c9fd 100644
--- a/arch/parisc/include/asm/syscall.h
+++ b/arch/parisc/include/asm/syscall.h
@@ -55,6 +55,13 @@ static inline void syscall_rollback(struct task_struct *task,
/* do nothing */
}
+static __maybe_unused const int syscall_arches[] = {
+ AUDIT_ARCH_PARISC,
+#ifdef CONFIG_64BIT
+ AUDIT_ARCH_PARISC64,
+#endif
+};
+
static inline int syscall_get_arch(struct task_struct *task)
{
int arch = AUDIT_ARCH_PARISC;
diff --git a/arch/powerpc/include/asm/syscall.h b/arch/powerpc/include/asm/syscall.h
index fd1b518eed17..781deb211e3d 100644
--- a/arch/powerpc/include/asm/syscall.h
+++ b/arch/powerpc/include/asm/syscall.h
@@ -104,6 +104,20 @@ static inline void syscall_set_arguments(struct task_struct *task,
regs->orig_gpr3 = args[0];
}
+static __maybe_unused const int syscall_arches[] = {
+#ifdef __LITTLE_ENDIAN__
+ AUDIT_ARCH_PPC | __AUDIT_ARCH_LE,
+# ifdef CONFIG_PPC64
+ AUDIT_ARCH_PPC64LE,
+# endif /* CONFIG_PPC64 */
+#else
+ AUDIT_ARCH_PPC,
+# ifdef CONFIG_PPC64
+ AUDIT_ARCH_PPC64,
+# endif /* CONFIG_PPC64 */
+#endif /* __LITTLE_ENDIAN__ */
+};
+
static inline int syscall_get_arch(struct task_struct *task)
{
int arch;
diff --git a/arch/riscv/include/asm/syscall.h b/arch/riscv/include/asm/syscall.h
index 49350c8bd7b0..4b36d358243e 100644
--- a/arch/riscv/include/asm/syscall.h
+++ b/arch/riscv/include/asm/syscall.h
@@ -73,13 +73,19 @@ static inline void syscall_set_arguments(struct task_struct *task,
memcpy(®s->a1, args, 5 * sizeof(regs->a1));
}
-static inline int syscall_get_arch(struct task_struct *task)
-{
#ifdef CONFIG_64BIT
- return AUDIT_ARCH_RISCV64;
+#define SYSCALL_ARCH AUDIT_ARCH_RISCV64
#else
- return AUDIT_ARCH_RISCV32;
+#define SYSCALL_ARCH AUDIT_ARCH_RISCV32
#endif
+
+static __maybe_unused const int syscall_arches[] = {
+ SYSCALL_ARCH
+};
+
+static inline int syscall_get_arch(struct task_struct *task)
+{
+ return SYSCALL_ARCH;
}
#endif /* _ASM_RISCV_SYSCALL_H */
diff --git a/arch/s390/include/asm/syscall.h b/arch/s390/include/asm/syscall.h
index d9d5de0f67ff..4cb9da36610a 100644
--- a/arch/s390/include/asm/syscall.h
+++ b/arch/s390/include/asm/syscall.h
@@ -89,6 +89,13 @@ static inline void syscall_set_arguments(struct task_struct *task,
regs->orig_gpr2 = args[0];
}
+static __maybe_unused const int syscall_arches[] = {
+ AUDIT_ARCH_S390X,
+#ifdef CONFIG_COMPAT
+ AUDIT_ARCH_S390,
+#endif
+};
+
static inline int syscall_get_arch(struct task_struct *task)
{
#ifdef CONFIG_COMPAT
diff --git a/arch/sh/include/asm/syscall_32.h b/arch/sh/include/asm/syscall_32.h
index cb51a7528384..4780f2339c72 100644
--- a/arch/sh/include/asm/syscall_32.h
+++ b/arch/sh/include/asm/syscall_32.h
@@ -69,13 +69,18 @@ static inline void syscall_set_arguments(struct task_struct *task,
regs->regs[4] = args[0];
}
-static inline int syscall_get_arch(struct task_struct *task)
-{
- int arch = AUDIT_ARCH_SH;
-
#ifdef CONFIG_CPU_LITTLE_ENDIAN
- arch |= __AUDIT_ARCH_LE;
+#define SYSCALL_ARCH AUDIT_ARCH_SHEL
+#else
+#define SYSCALL_ARCH AUDIT_ARCH_SH
#endif
- return arch;
+
+static __maybe_unused const int syscall_arches[] = {
+ SYSCALL_ARCH
+};
+
+static inline int syscall_get_arch(struct task_struct *task)
+{
+ return SYSCALL_ARCH;
}
#endif /* __ASM_SH_SYSCALL_32_H */
diff --git a/arch/sparc/include/asm/syscall.h b/arch/sparc/include/asm/syscall.h
index 62a5a78804c4..a458992cdcfe 100644
--- a/arch/sparc/include/asm/syscall.h
+++ b/arch/sparc/include/asm/syscall.h
@@ -127,6 +127,15 @@ static inline void syscall_set_arguments(struct task_struct *task,
regs->u_regs[UREG_I0 + i] = args[i];
}
+static __maybe_unused const int syscall_arches[] = {
+#ifdef CONFIG_SPARC64
+ AUDIT_ARCH_SPARC64,
+#endif
+#if !defined(CONFIG_SPARC64) || defined(CONFIG_COMPAT)
+ AUDIT_ARCH_SPARC,
+#endif
+};
+
static inline int syscall_get_arch(struct task_struct *task)
{
#if defined(CONFIG_SPARC64) && defined(CONFIG_COMPAT)
diff --git a/arch/x86/include/asm/syscall.h b/arch/x86/include/asm/syscall.h
index 7cbf733d11af..e13bb2a65b6f 100644
--- a/arch/x86/include/asm/syscall.h
+++ b/arch/x86/include/asm/syscall.h
@@ -97,6 +97,10 @@ static inline void syscall_set_arguments(struct task_struct *task,
memcpy(®s->bx + i, args, n * sizeof(args[0]));
}
+static __maybe_unused const int syscall_arches[] = {
+ AUDIT_ARCH_I386
+};
+
static inline int syscall_get_arch(struct task_struct *task)
{
return AUDIT_ARCH_I386;
@@ -152,6 +156,13 @@ static inline void syscall_set_arguments(struct task_struct *task,
}
}
+static __maybe_unused const int syscall_arches[] = {
+ AUDIT_ARCH_X86_64,
+#ifdef CONFIG_IA32_EMULATION
+ AUDIT_ARCH_I386,
+#endif
+};
+
static inline int syscall_get_arch(struct task_struct *task)
{
/* x32 tasks should be considered AUDIT_ARCH_X86_64. */
diff --git a/arch/x86/um/asm/syscall.h b/arch/x86/um/asm/syscall.h
index 56a2f0913e3c..590a31e22b99 100644
--- a/arch/x86/um/asm/syscall.h
+++ b/arch/x86/um/asm/syscall.h
@@ -9,13 +9,19 @@ typedef asmlinkage long (*sys_call_ptr_t)(unsigned long, unsigned long,
unsigned long, unsigned long,
unsigned long, unsigned long);
-static inline int syscall_get_arch(struct task_struct *task)
-{
#ifdef CONFIG_X86_32
- return AUDIT_ARCH_I386;
+#define SYSCALL_ARCH AUDIT_ARCH_I386
#else
- return AUDIT_ARCH_X86_64;
+#define SYSCALL_ARCH AUDIT_ARCH_X86_64
#endif
+
+static __maybe_unused const int syscall_arches[] = {
+ SYSCALL_ARCH
+};
+
+static inline int syscall_get_arch(struct task_struct *task)
+{
+ return SYSCALL_ARCH;
}
#endif /* __UM_ASM_SYSCALL_H */
diff --git a/arch/xtensa/include/asm/syscall.h b/arch/xtensa/include/asm/syscall.h
index f9a671cbf933..3d334fb0d329 100644
--- a/arch/xtensa/include/asm/syscall.h
+++ b/arch/xtensa/include/asm/syscall.h
@@ -14,6 +14,10 @@
#include <asm/ptrace.h>
#include <uapi/linux/audit.h>
+static __maybe_unused const int syscall_arches[] = {
+ AUDIT_ARCH_XTENSA
+};
+
static inline int syscall_get_arch(struct task_struct *task)
{
return AUDIT_ARCH_XTENSA;
--
2.28.0
From: YiFei Zhu <[email protected]>
Alternative: https://lore.kernel.org/lkml/[email protected]/T/
Major differences from the linked alternative by Kees:
* No x32 special-case handling -- not worth the complexity
* No caching of denylist -- not worth the complexity
* No seccomp arch pinning -- I think this is an independent feature
* The bitmaps are part of the filters rather than the task.
* Architectures supported by default through arch number array,
except for MIPS with its sparse syscall numbers.
* Configurable per-build for future different cache modes.
This series adds a bitmap to cache seccomp filter results if the
result permits a syscall and is indepenent of syscall arguments.
This visibly decreases seccomp overhead for most common seccomp
filters with very little memory footprint.
The overhead of running Seccomp filters has been part of some past
discussions [1][2][3]. Oftentimes, the filters have a large number
of instructions that check syscall numbers one by one and jump based
on that. Some users chain BPF filters which further enlarge the
overhead. A recent work [6] comprehensively measures the Seccomp
overhead and shows that the overhead is non-negligible and has a
non-trivial impact on application performance.
We observed some common filters, such as docker's [4] or
systemd's [5], will make most decisions based only on the syscall
numbers, and as past discussions considered, a bitmap where each bit
represents a syscall makes most sense for these filters.
In order to build this bitmap at filter attach time, each filter is
emulated for every syscall (under each possible architecture), and
checked for any accesses of struct seccomp_data that are not the "arch"
nor "nr" (syscall) members. If only "arch" and "nr" are examined, and
the program returns allow, then we can be sure that the filter must
return allow independent from syscall arguments.
When it is concluded that an allow must occur for the given
architecture and syscall pair, seccomp will immediately allow
the syscall, bypassing further BPF execution.
Ongoing work is to further support arguments with fast hash table
lookups. We are investigating the performance of doing so [6], and how
to best integrate with the existing seccomp infrastructure.
Some benchmarks are performed with results in patch 5, copied below:
Current BPF sysctl settings:
net.core.bpf_jit_enable = 1
net.core.bpf_jit_harden = 0
Benchmarking 100000000 syscalls...
63.896255358 - 0.008504529 = 63887750829 (63.9s)
getpid native: 638 ns
130.383312423 - 63.897315189 = 66485997234 (66.5s)
getpid RET_ALLOW 1 filter (bitmap): 664 ns
196.789080421 - 130.384414983 = 66404665438 (66.4s)
getpid RET_ALLOW 2 filters (bitmap): 664 ns
268.844643304 - 196.790234168 = 72054409136 (72.1s)
getpid RET_ALLOW 3 filters (full): 720 ns
342.627472515 - 268.845799103 = 73781673412 (73.8s)
getpid RET_ALLOW 4 filters (full): 737 ns
Estimated total seccomp overhead for 1 bitmapped filter: 26 ns
Estimated total seccomp overhead for 2 bitmapped filters: 26 ns
Estimated total seccomp overhead for 3 full filters: 82 ns
Estimated total seccomp overhead for 4 full filters: 99 ns
Estimated seccomp entry overhead: 26 ns
Estimated seccomp per-filter overhead (last 2 diff): 17 ns
Estimated seccomp per-filter overhead (filters / 4): 18 ns
Expectations:
native ≤ 1 bitmap (638 ≤ 664): ✔️
native ≤ 1 filter (638 ≤ 720): ✔️
per-filter (last 2 diff) ≈ per-filter (filters / 4) (17 ≈ 18): ✔️
1 bitmapped ≈ 2 bitmapped (26 ≈ 26): ✔️
entry ≈ 1 bitmapped (26 ≈ 26): ✔️
entry ≈ 2 bitmapped (26 ≈ 26): ✔️
native + entry + (per filter * 4) ≈ 4 filters total (732 ≈ 737): ✔️
RFC -> v1:
* Config made on by default across all arches that could support it.
* Added arch numbers array and emulate filter for each arch number, and
have a per-arch bitmap.
* Massively simplified the emulator so it would only support the common
instructions in Kees's list.
* Fixed inheriting bitmap across filters (filter->prev is always NULL
during prepare).
* Stole the selftest from Kees.
* Added a /proc/pid/seccomp_cache by Jann's suggestion.
v1 -> v2:
* Corrected one outdated function documentation.
Patch 1 moves the SECCOMP Kcomfig option to arch/Kconfig.
Patch 2 adds a syscall_arches array so the emulator can enumerate it.
Patch 3 implements the emulator that finds if a filter must return allow,
Patch 4 implements the test_bit against the bitmaps.
Patch 5 updates the selftest to better show the new semantics.
Patch 6 implements /proc/pid/seccomp_cache.
[1] https://lore.kernel.org/linux-security-module/[email protected]/T/
[2] https://lore.kernel.org/lkml/202005181120.971232B7B@keescook/T/
[3] https://github.com/seccomp/libseccomp/issues/116
[4] https://github.com/moby/moby/blob/ae0ef82b90356ac613f329a8ef5ee42ca923417d/profiles/seccomp/default.json
[5] https://github.com/systemd/systemd/blob/6743a1caf4037f03dc51a1277855018e4ab61957/src/shared/seccomp-util.c#L270
[6] Draco: Architectural and Operating System Support for System Call Security
https://tianyin.github.io/pub/draco.pdf, MICRO-53, Oct. 2020
Kees Cook (1):
selftests/seccomp: Compare bitmap vs filter overhead
YiFei Zhu (5):
seccomp: Move config option SECCOMP to arch/Kconfig
asm/syscall.h: Add syscall_arches[] array
seccomp/cache: Add "emulator" to check if filter is arg-dependent
seccomp/cache: Lookup syscall allowlist for fast path
seccomp/cache: Report cache data through /proc/pid/seccomp_cache
arch/Kconfig | 56 ++++
arch/alpha/include/asm/syscall.h | 4 +
arch/arc/include/asm/syscall.h | 24 +-
arch/arm/Kconfig | 15 +-
arch/arm/include/asm/syscall.h | 4 +
arch/arm64/Kconfig | 13 -
arch/arm64/include/asm/syscall.h | 4 +
arch/c6x/include/asm/syscall.h | 13 +-
arch/csky/Kconfig | 13 -
arch/csky/include/asm/syscall.h | 4 +
arch/h8300/include/asm/syscall.h | 4 +
arch/hexagon/include/asm/syscall.h | 4 +
arch/ia64/include/asm/syscall.h | 4 +
arch/m68k/include/asm/syscall.h | 4 +
arch/microblaze/Kconfig | 18 +-
arch/microblaze/include/asm/syscall.h | 4 +
arch/mips/Kconfig | 17 --
arch/mips/include/asm/syscall.h | 16 ++
arch/nds32/include/asm/syscall.h | 13 +-
arch/nios2/include/asm/syscall.h | 4 +
arch/openrisc/include/asm/syscall.h | 4 +
arch/parisc/Kconfig | 16 --
arch/parisc/include/asm/syscall.h | 7 +
arch/powerpc/Kconfig | 17 --
arch/powerpc/include/asm/syscall.h | 14 +
arch/riscv/Kconfig | 13 -
arch/riscv/include/asm/syscall.h | 14 +-
arch/s390/Kconfig | 17 --
arch/s390/include/asm/syscall.h | 7 +
arch/sh/Kconfig | 16 --
arch/sh/include/asm/syscall_32.h | 17 +-
arch/sparc/Kconfig | 18 +-
arch/sparc/include/asm/syscall.h | 9 +
arch/um/Kconfig | 16 --
arch/x86/Kconfig | 16 --
arch/x86/include/asm/syscall.h | 11 +
arch/x86/um/asm/syscall.h | 14 +-
arch/xtensa/Kconfig | 14 -
arch/xtensa/include/asm/syscall.h | 4 +
fs/proc/base.c | 7 +-
include/linux/seccomp.h | 5 +
kernel/seccomp.c | 257 +++++++++++++++++-
.../selftests/seccomp/seccomp_benchmark.c | 151 ++++++++--
tools/testing/selftests/seccomp/settings | 2 +-
44 files changed, 639 insertions(+), 265 deletions(-)
--
2.28.0