From: YiFei Zhu <[email protected]>
Alternative: https://lore.kernel.org/lkml/[email protected]/T/
Major differences from the linked alternative by Kees:
* No x32 special-case handling -- not worth the complexity
* No caching of denylist -- not worth the complexity
* No seccomp arch pinning -- I think this is an independent feature
* The bitmaps are part of the filters rather than the task.
* Architectures supported by default through arch number array,
except for MIPS with its sparse syscall numbers.
* Configurable per-build for future different cache modes.
This series adds a bitmap to cache seccomp filter results if the
result permits a syscall and is indepenent of syscall arguments.
This visibly decreases seccomp overhead for most common seccomp
filters with very little memory footprint.
The overhead of running Seccomp filters has been part of some past
discussions [1][2][3]. Oftentimes, the filters have a large number
of instructions that check syscall numbers one by one and jump based
on that. Some users chain BPF filters which further enlarge the
overhead. A recent work [6] comprehensively measures the Seccomp
overhead and shows that the overhead is non-negligible and has a
non-trivial impact on application performance.
We observed some common filters, such as docker's [4] or
systemd's [5], will make most decisions based only on the syscall
numbers, and as past discussions considered, a bitmap where each bit
represents a syscall makes most sense for these filters.
In order to build this bitmap at filter attach time, each filter is
emulated for every syscall (under each possible architecture), and
checked for any accesses of struct seccomp_data that are not the "arch"
nor "nr" (syscall) members. If only "arch" and "nr" are examined, and
the program returns allow, then we can be sure that the filter must
return allow independent from syscall arguments.
When it is concluded that an allow must occur for the given
architecture and syscall pair, seccomp will immediately allow
the syscall, bypassing further BPF execution.
Ongoing work is to further support arguments with fast hash table
lookups. We are investigating the performance of doing so [6], and how
to best integrate with the existing seccomp infrastructure.
Some benchmarks are performed with results in patch 5, copied below:
Current BPF sysctl settings:
net.core.bpf_jit_enable = 1
net.core.bpf_jit_harden = 0
Benchmarking 200000000 syscalls...
129.359381409 - 0.008724424 = 129350656985 (129.4s)
getpid native: 646 ns
264.385890006 - 129.360453229 = 135025436777 (135.0s)
getpid RET_ALLOW 1 filter (bitmap): 675 ns
399.400511893 - 264.387045901 = 135013465992 (135.0s)
getpid RET_ALLOW 2 filters (bitmap): 675 ns
545.872866260 - 399.401718327 = 146471147933 (146.5s)
getpid RET_ALLOW 3 filters (full): 732 ns
696.337101319 - 545.874097681 = 150463003638 (150.5s)
getpid RET_ALLOW 4 filters (full): 752 ns
Estimated total seccomp overhead for 1 bitmapped filter: 29 ns
Estimated total seccomp overhead for 2 bitmapped filters: 29 ns
Estimated total seccomp overhead for 3 full filters: 86 ns
Estimated total seccomp overhead for 4 full filters: 106 ns
Estimated seccomp entry overhead: 29 ns
Estimated seccomp per-filter overhead (last 2 diff): 20 ns
Estimated seccomp per-filter overhead (filters / 4): 19 ns
Expectations:
native ≤ 1 bitmap (646 ≤ 675): ✔️
native ≤ 1 filter (646 ≤ 732): ✔️
per-filter (last 2 diff) ≈ per-filter (filters / 4) (20 ≈ 19): ✔️
1 bitmapped ≈ 2 bitmapped (29 ≈ 29): ✔️
entry ≈ 1 bitmapped (29 ≈ 29): ✔️
entry ≈ 2 bitmapped (29 ≈ 29): ✔️
native + entry + (per filter * 4) ≈ 4 filters total (755 ≈ 752): ✔️
v2 -> v3:
* Added array_index_nospec guards
* No more syscall_arches[] array and expecting on loop unrolling. Arches
are configured with per-arch seccomp.h.
* Moved filter emulation to attach time (from prepare time).
* Further simplified emulator, basing on Kees's code.
* Guard /proc/pid/seccomp_cache with CAP_SYS_ADMIN.
v1 -> v2:
* Corrected one outdated function documentation.
RFC -> v1:
* Config made on by default across all arches that could support it.
* Added arch numbers array and emulate filter for each arch number, and
have a per-arch bitmap.
* Massively simplified the emulator so it would only support the common
instructions in Kees's list.
* Fixed inheriting bitmap across filters (filter->prev is always NULL
during prepare).
* Stole the selftest from Kees.
* Added a /proc/pid/seccomp_cache by Jann's suggestion.
Patch 1 adds the arch macros for x86.
Patch 2 implements the emulator that finds if a filter must return allow,
Patch 3 implements the test_bit against the bitmaps.
Patch 4 updates the selftest to better show the new semantics.
Patch 5 implements /proc/pid/seccomp_cache.
[1] https://lore.kernel.org/linux-security-module/[email protected]/T/
[2] https://lore.kernel.org/lkml/202005181120.971232B7B@keescook/T/
[3] https://github.com/seccomp/libseccomp/issues/116
[4] https://github.com/moby/moby/blob/ae0ef82b90356ac613f329a8ef5ee42ca923417d/profiles/seccomp/default.json
[5] https://github.com/systemd/systemd/blob/6743a1caf4037f03dc51a1277855018e4ab61957/src/shared/seccomp-util.c#L270
[6] Draco: Architectural and Operating System Support for System Call Security
https://tianyin.github.io/pub/draco.pdf, MICRO-53, Oct. 2020
Kees Cook (2):
x86: Enable seccomp architecture tracking
selftests/seccomp: Compare bitmap vs filter overhead
YiFei Zhu (3):
seccomp/cache: Add "emulator" to check if filter is constant allow
seccomp/cache: Lookup syscall allowlist for fast path
seccomp/cache: Report cache data through /proc/pid/seccomp_cache
arch/Kconfig | 49 ++++
arch/x86/Kconfig | 1 +
arch/x86/include/asm/seccomp.h | 15 +
fs/proc/base.c | 3 +
include/linux/seccomp.h | 5 +
kernel/seccomp.c | 265 +++++++++++++++++-
.../selftests/seccomp/seccomp_benchmark.c | 151 ++++++++--
tools/testing/selftests/seccomp/settings | 2 +-
8 files changed, 467 insertions(+), 24 deletions(-)
--
2.28.0
From: YiFei Zhu <[email protected]>
Alternative: https://lore.kernel.org/lkml/[email protected]/T/
Major differences from the linked alternative by Kees:
* No x32 special-case handling -- not worth the complexity
* No caching of denylist -- not worth the complexity
* No seccomp arch pinning -- I think this is an independent feature
* The bitmaps are part of the filters rather than the task.
This series adds a bitmap to cache seccomp filter results if the
result permits a syscall and is indepenent of syscall arguments.
This visibly decreases seccomp overhead for most common seccomp
filters with very little memory footprint.
The overhead of running Seccomp filters has been part of some past
discussions [1][2][3]. Oftentimes, the filters have a large number
of instructions that check syscall numbers one by one and jump based
on that. Some users chain BPF filters which further enlarge the
overhead. A recent work [6] comprehensively measures the Seccomp
overhead and shows that the overhead is non-negligible and has a
non-trivial impact on application performance.
We observed some common filters, such as docker's [4] or
systemd's [5], will make most decisions based only on the syscall
numbers, and as past discussions considered, a bitmap where each bit
represents a syscall makes most sense for these filters.
In order to build this bitmap at filter attach time, each filter is
emulated for every syscall (under each possible architecture), and
checked for any accesses of struct seccomp_data that are not the "arch"
nor "nr" (syscall) members. If only "arch" and "nr" are examined, and
the program returns allow, then we can be sure that the filter must
return allow independent from syscall arguments.
When it is concluded that an allow must occur for the given
architecture and syscall pair, seccomp will immediately allow
the syscall, bypassing further BPF execution.
Ongoing work is to further support arguments with fast hash table
lookups. We are investigating the performance of doing so [6], and how
to best integrate with the existing seccomp infrastructure.
Some benchmarks are performed with results in patch 5, copied below:
Current BPF sysctl settings:
net.core.bpf_jit_enable = 1
net.core.bpf_jit_harden = 0
Benchmarking 200000000 syscalls...
129.359381409 - 0.008724424 = 129350656985 (129.4s)
getpid native: 646 ns
264.385890006 - 129.360453229 = 135025436777 (135.0s)
getpid RET_ALLOW 1 filter (bitmap): 675 ns
399.400511893 - 264.387045901 = 135013465992 (135.0s)
getpid RET_ALLOW 2 filters (bitmap): 675 ns
545.872866260 - 399.401718327 = 146471147933 (146.5s)
getpid RET_ALLOW 3 filters (full): 732 ns
696.337101319 - 545.874097681 = 150463003638 (150.5s)
getpid RET_ALLOW 4 filters (full): 752 ns
Estimated total seccomp overhead for 1 bitmapped filter: 29 ns
Estimated total seccomp overhead for 2 bitmapped filters: 29 ns
Estimated total seccomp overhead for 3 full filters: 86 ns
Estimated total seccomp overhead for 4 full filters: 106 ns
Estimated seccomp entry overhead: 29 ns
Estimated seccomp per-filter overhead (last 2 diff): 20 ns
Estimated seccomp per-filter overhead (filters / 4): 19 ns
Expectations:
native ≤ 1 bitmap (646 ≤ 675): ✔️
native ≤ 1 filter (646 ≤ 732): ✔️
per-filter (last 2 diff) ≈ per-filter (filters / 4) (20 ≈ 19): ✔️
1 bitmapped ≈ 2 bitmapped (29 ≈ 29): ✔️
entry ≈ 1 bitmapped (29 ≈ 29): ✔️
entry ≈ 2 bitmapped (29 ≈ 29): ✔️
native + entry + (per filter * 4) ≈ 4 filters total (755 ≈ 752): ✔️
v3 -> v4:
* Reordered patches
* Naming changes
* Fixed racing in /proc/pid/seccomp_cache against filter being released
from task, using Jann's suggestion of sighand spinlock.
* Cache no longer configurable.
* Copied some description from cover letter to commit messages.
* Used Kees's logic to set clear bits from bitmap, rather than set bits.
v2 -> v3:
* Added array_index_nospec guards
* No more syscall_arches[] array and expecting on loop unrolling. Arches
are configured with per-arch seccomp.h.
* Moved filter emulation to attach time (from prepare time).
* Further simplified emulator, basing on Kees's code.
* Guard /proc/pid/seccomp_cache with CAP_SYS_ADMIN.
v1 -> v2:
* Corrected one outdated function documentation.
RFC -> v1:
* Config made on by default across all arches that could support it.
* Added arch numbers array and emulate filter for each arch number, and
have a per-arch bitmap.
* Massively simplified the emulator so it would only support the common
instructions in Kees's list.
* Fixed inheriting bitmap across filters (filter->prev is always NULL
during prepare).
* Stole the selftest from Kees.
* Added a /proc/pid/seccomp_cache by Jann's suggestion.
Patch 1 implements the test_bit against the bitmaps.
Patch 2 implements the emulator that finds if a filter must return allow,
Patch 3 adds the arch macros for x86.
Patch 4 updates the selftest to better show the new semantics.
Patch 5 implements /proc/pid/seccomp_cache.
[1] https://lore.kernel.org/linux-security-module/[email protected]/T/
[2] https://lore.kernel.org/lkml/202005181120.971232B7B@keescook/T/
[3] https://github.com/seccomp/libseccomp/issues/116
[4] https://github.com/moby/moby/blob/ae0ef82b90356ac613f329a8ef5ee42ca923417d/profiles/seccomp/default.json
[5] https://github.com/systemd/systemd/blob/6743a1caf4037f03dc51a1277855018e4ab61957/src/shared/seccomp-util.c#L270
[6] Draco: Architectural and Operating System Support for System Call Security
https://tianyin.github.io/pub/draco.pdf, MICRO-53, Oct. 2020
Kees Cook (2):
x86: Enable seccomp architecture tracking
selftests/seccomp: Compare bitmap vs filter overhead
YiFei Zhu (3):
seccomp/cache: Lookup syscall allowlist bitmap for fast path
seccomp/cache: Add "emulator" to check if filter is constant allow
seccomp/cache: Report cache data through /proc/pid/seccomp_cache
arch/Kconfig | 24 ++
arch/x86/Kconfig | 1 +
arch/x86/include/asm/seccomp.h | 15 +
fs/proc/base.c | 6 +
include/linux/seccomp.h | 5 +
kernel/seccomp.c | 289 +++++++++++++++++-
.../selftests/seccomp/seccomp_benchmark.c | 151 +++++++--
tools/testing/selftests/seccomp/settings | 2 +-
8 files changed, 469 insertions(+), 24 deletions(-)
--
2.28.0
From: Kees Cook <[email protected]>
Provide seccomp internals with the details to calculate which syscall
table the running kernel is expecting to deal with. This allows for
efficient architecture pinning and paves the way for constant-action
bitmaps.
Signed-off-by: Kees Cook <[email protected]>
Co-developed-by: YiFei Zhu <[email protected]>
Signed-off-by: YiFei Zhu <[email protected]>
---
arch/x86/include/asm/seccomp.h | 12 ++++++++++++
1 file changed, 12 insertions(+)
diff --git a/arch/x86/include/asm/seccomp.h b/arch/x86/include/asm/seccomp.h
index 2bd1338de236..03365af6165d 100644
--- a/arch/x86/include/asm/seccomp.h
+++ b/arch/x86/include/asm/seccomp.h
@@ -16,6 +16,18 @@
#define __NR_seccomp_sigreturn_32 __NR_ia32_sigreturn
#endif
+#ifdef CONFIG_X86_64
+# define SECCOMP_ARCH_NATIVE AUDIT_ARCH_X86_64
+# define SECCOMP_ARCH_NATIVE_NR NR_syscalls
+# ifdef CONFIG_COMPAT
+# define SECCOMP_ARCH_COMPAT AUDIT_ARCH_I386
+# define SECCOMP_ARCH_COMPAT_NR IA32_NR_syscalls
+# endif
+#else /* !CONFIG_X86_64 */
+# define SECCOMP_ARCH_NATIVE AUDIT_ARCH_I386
+# define SECCOMP_ARCH_NATIVE_NR NR_syscalls
+#endif
+
#include <asm-generic/seccomp.h>
#endif /* _ASM_X86_SECCOMP_H */
--
2.28.0
From: YiFei Zhu <[email protected]>
The overhead of running Seccomp filters has been part of some past
discussions [1][2][3]. Oftentimes, the filters have a large number
of instructions that check syscall numbers one by one and jump based
on that. Some users chain BPF filters which further enlarge the
overhead. A recent work [6] comprehensively measures the Seccomp
overhead and shows that the overhead is non-negligible and has a
non-trivial impact on application performance.
We observed some common filters, such as docker's [4] or
systemd's [5], will make most decisions based only on the syscall
numbers, and as past discussions considered, a bitmap where each bit
represents a syscall makes most sense for these filters.
The fast (common) path for seccomp should be that the filter permits
the syscall to pass through, and failing seccomp is expected to be
an exceptional case; it is not expected for userspace to call a
denylisted syscall over and over.
When it can be concluded that an allow must occur for the given
architecture and syscall pair (this determination is introduced in
the next commit), seccomp will immediately allow the syscall,
bypassing further BPF execution.
Each architecture number has its own bitmap. The architecture
number in seccomp_data is checked against the defined architecture
number constant before proceeding to test the bit against the
bitmap with the syscall number as the index of the bit in the
bitmap, and if the bit is set, seccomp returns allow. The bitmaps
are all clear in this patch and will be initialized in the next
commit.
[1] https://lore.kernel.org/linux-security-module/[email protected]/T/
[2] https://lore.kernel.org/lkml/202005181120.971232B7B@keescook/T/
[3] https://github.com/seccomp/libseccomp/issues/116
[4] https://github.com/moby/moby/blob/ae0ef82b90356ac613f329a8ef5ee42ca923417d/profiles/seccomp/default.json
[5] https://github.com/systemd/systemd/blob/6743a1caf4037f03dc51a1277855018e4ab61957/src/shared/seccomp-util.c#L270
[6] Draco: Architectural and Operating System Support for System Call Security
https://tianyin.github.io/pub/draco.pdf, MICRO-53, Oct. 2020
Co-developed-by: Dimitrios Skarlatos <[email protected]>
Signed-off-by: Dimitrios Skarlatos <[email protected]>
Signed-off-by: YiFei Zhu <[email protected]>
---
kernel/seccomp.c | 72 ++++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 72 insertions(+)
diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index ae6b40cc39f4..73f6b6e9a3b0 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -143,6 +143,34 @@ struct notification {
struct list_head notifications;
};
+#ifdef SECCOMP_ARCH_NATIVE
+/**
+ * struct action_cache - per-filter cache of seccomp actions per
+ * arch/syscall pair
+ *
+ * @allow_native: A bitmap where each bit represents whether the
+ * filter will always allow the syscall, for the
+ * native architecture.
+ * @allow_compat: A bitmap where each bit represents whether the
+ * filter will always allow the syscall, for the
+ * compat architecture.
+ */
+struct action_cache {
+ DECLARE_BITMAP(allow_native, SECCOMP_ARCH_NATIVE_NR);
+#ifdef SECCOMP_ARCH_COMPAT
+ DECLARE_BITMAP(allow_compat, SECCOMP_ARCH_COMPAT_NR);
+#endif
+};
+#else
+struct action_cache { };
+
+static inline bool seccomp_cache_check_allow(const struct seccomp_filter *sfilter,
+ const struct seccomp_data *sd)
+{
+ return false;
+}
+#endif /* SECCOMP_ARCH_NATIVE */
+
/**
* struct seccomp_filter - container for seccomp BPF programs
*
@@ -298,6 +326,47 @@ static int seccomp_check_filter(struct sock_filter *filter, unsigned int flen)
return 0;
}
+#ifdef SECCOMP_ARCH_NATIVE
+static inline bool seccomp_cache_check_allow_bitmap(const void *bitmap,
+ size_t bitmap_size,
+ int syscall_nr)
+{
+ if (unlikely(syscall_nr < 0 || syscall_nr >= bitmap_size))
+ return false;
+ syscall_nr = array_index_nospec(syscall_nr, bitmap_size);
+
+ return test_bit(syscall_nr, bitmap);
+}
+
+/**
+ * seccomp_cache_check_allow - lookup seccomp cache
+ * @sfilter: The seccomp filter
+ * @sd: The seccomp data to lookup the cache with
+ *
+ * Returns true if the seccomp_data is cached and allowed.
+ */
+static inline bool seccomp_cache_check_allow(const struct seccomp_filter *sfilter,
+ const struct seccomp_data *sd)
+{
+ int syscall_nr = sd->nr;
+ const struct action_cache *cache = &sfilter->cache;
+
+ if (likely(sd->arch == SECCOMP_ARCH_NATIVE))
+ return seccomp_cache_check_allow_bitmap(cache->allow_native,
+ SECCOMP_ARCH_NATIVE_NR,
+ syscall_nr);
+#ifdef SECCOMP_ARCH_COMPAT
+ if (likely(sd->arch == SECCOMP_ARCH_COMPAT))
+ return seccomp_cache_check_allow_bitmap(cache->allow_compat,
+ SECCOMP_ARCH_COMPAT_NR,
+ syscall_nr);
+#endif /* SECCOMP_ARCH_COMPAT */
+
+ WARN_ON_ONCE(true);
+ return false;
+}
+#endif /* SECCOMP_ARCH_NATIVE */
+
/**
* seccomp_run_filters - evaluates all seccomp filters against @sd
* @sd: optional seccomp data to be passed to filters
@@ -320,6 +389,9 @@ static u32 seccomp_run_filters(const struct seccomp_data *sd,
if (WARN_ON(f == NULL))
return SECCOMP_RET_KILL_PROCESS;
+ if (seccomp_cache_check_allow(f, sd))
+ return SECCOMP_RET_ALLOW;
+
/*
* All filters in the list are evaluated and the lowest BPF return
* value always takes priority (ignoring the DATA).
--
2.28.0
From: YiFei Zhu <[email protected]>
SECCOMP_CACHE will only operate on syscalls that do not access
any syscall arguments or instruction pointer. To facilitate
this we need a static analyser to know whether a filter will
return allow regardless of syscall arguments for a given
architecture number / syscall number pair. This is implemented
here with a pseudo-emulator, and stored in a per-filter bitmap.
In order to build this bitmap at filter attach time, each filter is
emulated for every syscall (under each possible architecture), and
checked for any accesses of struct seccomp_data that are not the "arch"
nor "nr" (syscall) members. If only "arch" and "nr" are examined, and
the program returns allow, then we can be sure that the filter must
return allow independent from syscall arguments.
Nearly all seccomp filters are built from these cBPF instructions:
BPF_LD | BPF_W | BPF_ABS
BPF_JMP | BPF_JEQ | BPF_K
BPF_JMP | BPF_JGE | BPF_K
BPF_JMP | BPF_JGT | BPF_K
BPF_JMP | BPF_JSET | BPF_K
BPF_JMP | BPF_JA
BPF_RET | BPF_K
BPF_ALU | BPF_AND | BPF_K
Each of these instructions are emulated. Any weirdness or loading
from a syscall argument will cause the emulator to bail.
The emulation is also halted if it reaches a return. In that case,
if it returns an SECCOMP_RET_ALLOW, the syscall is marked as good.
Emulator structure and comments are from Kees [1] and Jann [2].
Emulation is done at attach time. If a filter depends on more
filters, and if the dependee does not guarantee to allow the
syscall, then we skip the emulation of this syscall.
[1] https://lore.kernel.org/lkml/[email protected]/
[2] https://lore.kernel.org/lkml/CAG48ez1p=dR_2ikKq=xVxkoGg0fYpTBpkhJSv1w-6BG=76PAvw@mail.gmail.com/
Suggested-by: Jann Horn <[email protected]>
Co-developed-by: Kees Cook <[email protected]>
Signed-off-by: Kees Cook <[email protected]>
Signed-off-by: YiFei Zhu <[email protected]>
---
kernel/seccomp.c | 158 ++++++++++++++++++++++++++++++++++++++++++++++-
1 file changed, 157 insertions(+), 1 deletion(-)
diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index 73f6b6e9a3b0..51032b41fe59 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -169,6 +169,10 @@ static inline bool seccomp_cache_check_allow(const struct seccomp_filter *sfilte
{
return false;
}
+
+static inline void seccomp_cache_prepare(struct seccomp_filter *sfilter)
+{
+}
#endif /* SECCOMP_ARCH_NATIVE */
/**
@@ -187,6 +191,7 @@ static inline bool seccomp_cache_check_allow(const struct seccomp_filter *sfilte
* this filter after reaching 0. The @users count is always smaller
* or equal to @refs. Hence, reaching 0 for @users does not mean
* the filter can be freed.
+ * @cache: cache of arch/syscall mappings to actions
* @log: true if all actions except for SECCOMP_RET_ALLOW should be logged
* @prev: points to a previously installed, or inherited, filter
* @prog: the BPF program to evaluate
@@ -208,6 +213,7 @@ struct seccomp_filter {
refcount_t refs;
refcount_t users;
bool log;
+ struct action_cache cache;
struct seccomp_filter *prev;
struct bpf_prog *prog;
struct notification *notif;
@@ -616,7 +622,12 @@ static struct seccomp_filter *seccomp_prepare_filter(struct sock_fprog *fprog)
{
struct seccomp_filter *sfilter;
int ret;
- const bool save_orig = IS_ENABLED(CONFIG_CHECKPOINT_RESTORE);
+ const bool save_orig =
+#if defined(CONFIG_CHECKPOINT_RESTORE) || defined(SECCOMP_ARCH_NATIVE)
+ true;
+#else
+ false;
+#endif
if (fprog->len == 0 || fprog->len > BPF_MAXINSNS)
return ERR_PTR(-EINVAL);
@@ -682,6 +693,150 @@ seccomp_prepare_user_filter(const char __user *user_filter)
return filter;
}
+#ifdef SECCOMP_ARCH_NATIVE
+/**
+ * seccomp_is_const_allow - check if filter is constant allow with given data
+ * @fprog: The BPF programs
+ * @sd: The seccomp data to check against, only syscall number are arch
+ * number are considered constant.
+ */
+static bool seccomp_is_const_allow(struct sock_fprog_kern *fprog,
+ struct seccomp_data *sd)
+{
+ unsigned int insns;
+ unsigned int reg_value = 0;
+ unsigned int pc;
+ bool op_res;
+
+ if (WARN_ON_ONCE(!fprog))
+ return false;
+
+ insns = bpf_classic_proglen(fprog);
+ for (pc = 0; pc < insns; pc++) {
+ struct sock_filter *insn = &fprog->filter[pc];
+ u16 code = insn->code;
+ u32 k = insn->k;
+
+ switch (code) {
+ case BPF_LD | BPF_W | BPF_ABS:
+ switch (k) {
+ case offsetof(struct seccomp_data, nr):
+ reg_value = sd->nr;
+ break;
+ case offsetof(struct seccomp_data, arch):
+ reg_value = sd->arch;
+ break;
+ default:
+ /* can't optimize (non-constant value load) */
+ return false;
+ }
+ break;
+ case BPF_RET | BPF_K:
+ /* reached return with constant values only, check allow */
+ return k == SECCOMP_RET_ALLOW;
+ case BPF_JMP | BPF_JA:
+ pc += insn->k;
+ break;
+ case BPF_JMP | BPF_JEQ | BPF_K:
+ case BPF_JMP | BPF_JGE | BPF_K:
+ case BPF_JMP | BPF_JGT | BPF_K:
+ case BPF_JMP | BPF_JSET | BPF_K:
+ switch (BPF_OP(code)) {
+ case BPF_JEQ:
+ op_res = reg_value == k;
+ break;
+ case BPF_JGE:
+ op_res = reg_value >= k;
+ break;
+ case BPF_JGT:
+ op_res = reg_value > k;
+ break;
+ case BPF_JSET:
+ op_res = !!(reg_value & k);
+ break;
+ default:
+ /* can't optimize (unknown jump) */
+ return false;
+ }
+
+ pc += op_res ? insn->jt : insn->jf;
+ break;
+ case BPF_ALU | BPF_AND | BPF_K:
+ reg_value &= k;
+ break;
+ default:
+ /* can't optimize (unknown insn) */
+ return false;
+ }
+ }
+
+ /* ran off the end of the filter?! */
+ WARN_ON(1);
+ return false;
+}
+
+static void seccomp_cache_prepare_bitmap(struct seccomp_filter *sfilter,
+ void *bitmap, const void *bitmap_prev,
+ size_t bitmap_size, int arch)
+{
+ struct sock_fprog_kern *fprog = sfilter->prog->orig_prog;
+ struct seccomp_data sd;
+ int nr;
+
+ if (bitmap_prev) {
+ /* The new filter must be as restrictive as the last. */
+ bitmap_copy(bitmap, bitmap_prev, bitmap_size);
+ } else {
+ /* Before any filters, all syscalls are always allowed. */
+ bitmap_fill(bitmap, bitmap_size);
+ }
+
+ for (nr = 0; nr < bitmap_size; nr++) {
+ /* No bitmap change: not a cacheable action. */
+ if (!test_bit(nr, bitmap))
+ continue;
+
+ sd.nr = nr;
+ sd.arch = arch;
+
+ /* No bitmap change: continue to always allow. */
+ if (seccomp_is_const_allow(fprog, &sd))
+ continue;
+
+ /*
+ * Not a cacheable action: always run filters.
+ * atomic clear_bit() not needed, filter not visible yet.
+ */
+ __clear_bit(nr, bitmap);
+ }
+}
+
+/**
+ * seccomp_cache_prepare - emulate the filter to find cachable syscalls
+ * @sfilter: The seccomp filter
+ *
+ * Returns 0 if successful or -errno if error occurred.
+ */
+static void seccomp_cache_prepare(struct seccomp_filter *sfilter)
+{
+ struct action_cache *cache = &sfilter->cache;
+ const struct action_cache *cache_prev =
+ sfilter->prev ? &sfilter->prev->cache : NULL;
+
+ seccomp_cache_prepare_bitmap(sfilter, cache->allow_native,
+ cache_prev ? cache_prev->allow_native : NULL,
+ SECCOMP_ARCH_NATIVE_NR,
+ SECCOMP_ARCH_NATIVE);
+
+#ifdef SECCOMP_ARCH_COMPAT
+ seccomp_cache_prepare_bitmap(sfilter, cache->allow_compat,
+ cache_prev ? cache_prev->allow_compat : NULL,
+ SECCOMP_ARCH_COMPAT_NR,
+ SECCOMP_ARCH_COMPAT);
+#endif /* SECCOMP_ARCH_COMPAT */
+}
+#endif /* SECCOMP_ARCH_NATIVE */
+
/**
* seccomp_attach_filter: validate and attach filter
* @flags: flags to change filter behavior
@@ -731,6 +886,7 @@ static long seccomp_attach_filter(unsigned int flags,
* task reference.
*/
filter->prev = current->seccomp.filter;
+ seccomp_cache_prepare(filter);
current->seccomp.filter = filter;
atomic_inc(¤t->seccomp.filter_count);
--
2.28.0
On Fri, Oct 9, 2020 at 10:15 AM YiFei Zhu <[email protected]> wrote:
>
> From: Kees Cook <[email protected]>
>
> Provide seccomp internals with the details to calculate which syscall
> table the running kernel is expecting to deal with. This allows for
> efficient architecture pinning and paves the way for constant-action
> bitmaps.
>
> Signed-off-by: Kees Cook <[email protected]>
> Co-developed-by: YiFei Zhu <[email protected]>
> Signed-off-by: YiFei Zhu <[email protected]>
> ---
> arch/x86/include/asm/seccomp.h | 12 ++++++++++++
> 1 file changed, 12 insertions(+)
>
> diff --git a/arch/x86/include/asm/seccomp.h b/arch/x86/include/asm/seccomp.h
> index 2bd1338de236..03365af6165d 100644
> --- a/arch/x86/include/asm/seccomp.h
> +++ b/arch/x86/include/asm/seccomp.h
> @@ -16,6 +16,18 @@
> #define __NR_seccomp_sigreturn_32 __NR_ia32_sigreturn
> #endif
>
> +#ifdef CONFIG_X86_64
> +# define SECCOMP_ARCH_NATIVE AUDIT_ARCH_X86_64
> +# define SECCOMP_ARCH_NATIVE_NR NR_syscalls
> +# ifdef CONFIG_COMPAT
> +# define SECCOMP_ARCH_COMPAT AUDIT_ARCH_I386
> +# define SECCOMP_ARCH_COMPAT_NR IA32_NR_syscalls
> +# endif
> +#else /* !CONFIG_X86_64 */
> +# define SECCOMP_ARCH_NATIVE AUDIT_ARCH_I386
> +# define SECCOMP_ARCH_NATIVE_NR NR_syscalls
> +#endif
Is the idea that any syscall that's out of range for this (e.g. all of
the x32 syscalls) is unoptimized? I'm okay with this, but I think it
could use a comment.
> +
> #include <asm-generic/seccomp.h>
>
> #endif /* _ASM_X86_SECCOMP_H */
> --
> 2.28.0
>
--
Andy Lutomirski
AMA Capital Management, LLC
On Fri, Oct 9, 2020 at 7:15 PM YiFei Zhu <[email protected]> wrote:
>
> From: YiFei Zhu <[email protected]>
>
> SECCOMP_CACHE will only operate on syscalls that do not access
> any syscall arguments or instruction pointer. To facilitate
> this we need a static analyser to know whether a filter will
> return allow regardless of syscall arguments for a given
> architecture number / syscall number pair. This is implemented
> here with a pseudo-emulator, and stored in a per-filter bitmap.
>
> In order to build this bitmap at filter attach time, each filter is
> emulated for every syscall (under each possible architecture), and
> checked for any accesses of struct seccomp_data that are not the "arch"
> nor "nr" (syscall) members. If only "arch" and "nr" are examined, and
> the program returns allow, then we can be sure that the filter must
> return allow independent from syscall arguments.
>
> Nearly all seccomp filters are built from these cBPF instructions:
>
> BPF_LD | BPF_W | BPF_ABS
> BPF_JMP | BPF_JEQ | BPF_K
> BPF_JMP | BPF_JGE | BPF_K
> BPF_JMP | BPF_JGT | BPF_K
> BPF_JMP | BPF_JSET | BPF_K
> BPF_JMP | BPF_JA
> BPF_RET | BPF_K
> BPF_ALU | BPF_AND | BPF_K
>
> Each of these instructions are emulated. Any weirdness or loading
> from a syscall argument will cause the emulator to bail.
>
> The emulation is also halted if it reaches a return. In that case,
> if it returns an SECCOMP_RET_ALLOW, the syscall is marked as good.
>
> Emulator structure and comments are from Kees [1] and Jann [2].
>
> Emulation is done at attach time. If a filter depends on more
> filters, and if the dependee does not guarantee to allow the
> syscall, then we skip the emulation of this syscall.
>
> [1] https://lore.kernel.org/lkml/[email protected]/
> [2] https://lore.kernel.org/lkml/CAG48ez1p=dR_2ikKq=xVxkoGg0fYpTBpkhJSv1w-6BG=76PAvw@mail.gmail.com/
[...]
> @@ -682,6 +693,150 @@ seccomp_prepare_user_filter(const char __user *user_filter)
> return filter;
> }
>
> +#ifdef SECCOMP_ARCH_NATIVE
> +/**
> + * seccomp_is_const_allow - check if filter is constant allow with given data
> + * @fprog: The BPF programs
> + * @sd: The seccomp data to check against, only syscall number are arch
> + * number are considered constant.
nit: s/syscall number are arch number/syscall number and arch number/
> + */
> +static bool seccomp_is_const_allow(struct sock_fprog_kern *fprog,
> + struct seccomp_data *sd)
> +{
> + unsigned int insns;
> + unsigned int reg_value = 0;
> + unsigned int pc;
> + bool op_res;
> +
> + if (WARN_ON_ONCE(!fprog))
> + return false;
> +
> + insns = bpf_classic_proglen(fprog);
bpf_classic_proglen() is defined as:
#define bpf_classic_proglen(fprog) (fprog->len * sizeof(fprog->filter[0]))
so this is wrong - what you want is the number of instructions in the
program, what you actually have is the size of the program in bytes.
Please instead check for `pc < fprog->len` in the loop condition.
> + for (pc = 0; pc < insns; pc++) {
> + struct sock_filter *insn = &fprog->filter[pc];
> + u16 code = insn->code;
> + u32 k = insn->k;
[...]
> + }
> +
> + /* ran off the end of the filter?! */
> + WARN_ON(1);
> + return false;
> +}
On Fri, Oct 9, 2020 at 7:15 PM YiFei Zhu <[email protected]> wrote:
> The overhead of running Seccomp filters has been part of some past
> discussions [1][2][3]. Oftentimes, the filters have a large number
> of instructions that check syscall numbers one by one and jump based
> on that. Some users chain BPF filters which further enlarge the
> overhead. A recent work [6] comprehensively measures the Seccomp
> overhead and shows that the overhead is non-negligible and has a
> non-trivial impact on application performance.
>
> We observed some common filters, such as docker's [4] or
> systemd's [5], will make most decisions based only on the syscall
> numbers, and as past discussions considered, a bitmap where each bit
> represents a syscall makes most sense for these filters.
>
> The fast (common) path for seccomp should be that the filter permits
> the syscall to pass through, and failing seccomp is expected to be
> an exceptional case; it is not expected for userspace to call a
> denylisted syscall over and over.
>
> When it can be concluded that an allow must occur for the given
> architecture and syscall pair (this determination is introduced in
> the next commit), seccomp will immediately allow the syscall,
> bypassing further BPF execution.
>
> Each architecture number has its own bitmap. The architecture
> number in seccomp_data is checked against the defined architecture
> number constant before proceeding to test the bit against the
> bitmap with the syscall number as the index of the bit in the
> bitmap, and if the bit is set, seccomp returns allow. The bitmaps
> are all clear in this patch and will be initialized in the next
> commit.
[...]
> Co-developed-by: Dimitrios Skarlatos <[email protected]>
> Signed-off-by: Dimitrios Skarlatos <[email protected]>
> Signed-off-by: YiFei Zhu <[email protected]>
Reviewed-by: Jann Horn <[email protected]>
On Fri, Oct 09, 2020 at 12:14:29PM -0500, YiFei Zhu wrote:
> From: YiFei Zhu <[email protected]>
>
> The overhead of running Seccomp filters has been part of some past
> discussions [1][2][3]. Oftentimes, the filters have a large number
> of instructions that check syscall numbers one by one and jump based
> on that. Some users chain BPF filters which further enlarge the
> overhead. A recent work [6] comprehensively measures the Seccomp
> overhead and shows that the overhead is non-negligible and has a
> non-trivial impact on application performance.
>
> We observed some common filters, such as docker's [4] or
> systemd's [5], will make most decisions based only on the syscall
> numbers, and as past discussions considered, a bitmap where each bit
> represents a syscall makes most sense for these filters.
>
> The fast (common) path for seccomp should be that the filter permits
> the syscall to pass through, and failing seccomp is expected to be
> an exceptional case; it is not expected for userspace to call a
> denylisted syscall over and over.
>
> When it can be concluded that an allow must occur for the given
> architecture and syscall pair (this determination is introduced in
> the next commit), seccomp will immediately allow the syscall,
> bypassing further BPF execution.
>
> Each architecture number has its own bitmap. The architecture
> number in seccomp_data is checked against the defined architecture
> number constant before proceeding to test the bit against the
> bitmap with the syscall number as the index of the bit in the
> bitmap, and if the bit is set, seccomp returns allow. The bitmaps
> are all clear in this patch and will be initialized in the next
> commit.
>
> [1] https://lore.kernel.org/linux-security-module/[email protected]/T/
> [2] https://lore.kernel.org/lkml/202005181120.971232B7B@keescook/T/
> [3] https://github.com/seccomp/libseccomp/issues/116
> [4] https://github.com/moby/moby/blob/ae0ef82b90356ac613f329a8ef5ee42ca923417d/profiles/seccomp/default.json
> [5] https://github.com/systemd/systemd/blob/6743a1caf4037f03dc51a1277855018e4ab61957/src/shared/seccomp-util.c#L270
> [6] Draco: Architectural and Operating System Support for System Call Security
> https://tianyin.github.io/pub/draco.pdf, MICRO-53, Oct. 2020
>
> Co-developed-by: Dimitrios Skarlatos <[email protected]>
> Signed-off-by: Dimitrios Skarlatos <[email protected]>
> Signed-off-by: YiFei Zhu <[email protected]>
> ---
> kernel/seccomp.c | 72 ++++++++++++++++++++++++++++++++++++++++++++++++
> 1 file changed, 72 insertions(+)
>
> diff --git a/kernel/seccomp.c b/kernel/seccomp.c
> index ae6b40cc39f4..73f6b6e9a3b0 100644
> --- a/kernel/seccomp.c
> +++ b/kernel/seccomp.c
> @@ -143,6 +143,34 @@ struct notification {
> struct list_head notifications;
> };
>
> +#ifdef SECCOMP_ARCH_NATIVE
> +/**
> + * struct action_cache - per-filter cache of seccomp actions per
> + * arch/syscall pair
> + *
> + * @allow_native: A bitmap where each bit represents whether the
> + * filter will always allow the syscall, for the
> + * native architecture.
> + * @allow_compat: A bitmap where each bit represents whether the
> + * filter will always allow the syscall, for the
> + * compat architecture.
> + */
> +struct action_cache {
> + DECLARE_BITMAP(allow_native, SECCOMP_ARCH_NATIVE_NR);
> +#ifdef SECCOMP_ARCH_COMPAT
> + DECLARE_BITMAP(allow_compat, SECCOMP_ARCH_COMPAT_NR);
> +#endif
> +};
> +#else
> +struct action_cache { };
> +
> +static inline bool seccomp_cache_check_allow(const struct seccomp_filter *sfilter,
> + const struct seccomp_data *sd)
> +{
> + return false;
> +}
> +#endif /* SECCOMP_ARCH_NATIVE */
> +
> /**
> * struct seccomp_filter - container for seccomp BPF programs
> *
> @@ -298,6 +326,47 @@ static int seccomp_check_filter(struct sock_filter *filter, unsigned int flen)
> return 0;
> }
>
> +#ifdef SECCOMP_ARCH_NATIVE
> +static inline bool seccomp_cache_check_allow_bitmap(const void *bitmap,
> + size_t bitmap_size,
> + int syscall_nr)
> +{
> + if (unlikely(syscall_nr < 0 || syscall_nr >= bitmap_size))
> + return false;
> + syscall_nr = array_index_nospec(syscall_nr, bitmap_size);
> +
> + return test_bit(syscall_nr, bitmap);
> +}
> +
> +/**
> + * seccomp_cache_check_allow - lookup seccomp cache
> + * @sfilter: The seccomp filter
> + * @sd: The seccomp data to lookup the cache with
> + *
> + * Returns true if the seccomp_data is cached and allowed.
> + */
> +static inline bool seccomp_cache_check_allow(const struct seccomp_filter *sfilter,
> + const struct seccomp_data *sd)
> +{
> + int syscall_nr = sd->nr;
> + const struct action_cache *cache = &sfilter->cache;
> +
> + if (likely(sd->arch == SECCOMP_ARCH_NATIVE))
> + return seccomp_cache_check_allow_bitmap(cache->allow_native,
> + SECCOMP_ARCH_NATIVE_NR,
> + syscall_nr);
> +#ifdef SECCOMP_ARCH_COMPAT
> + if (likely(sd->arch == SECCOMP_ARCH_COMPAT))
> + return seccomp_cache_check_allow_bitmap(cache->allow_compat,
> + SECCOMP_ARCH_COMPAT_NR,
> + syscall_nr);
> +#endif /* SECCOMP_ARCH_COMPAT */
> +
> + WARN_ON_ONCE(true);
> + return false;
> +}
> +#endif /* SECCOMP_ARCH_NATIVE */
An small optimization for the non-compat case might be to do this to
avoid the sd->arch test (which should have no way to ever change in such
builds):
static inline bool seccomp_cache_check_allow(const struct seccomp_filter *sfilter,
const struct seccomp_data *sd)
{
const struct action_cache *cache = &sfilter->cache;
#ifndef SECCOMP_ARCH_COMPAT
/* A native-only architecture doesn't need to check sd->arch. */
return seccomp_cache_check_allow_bitmap(cache->allow_native,
SECCOMP_ARCH_NATIVE_NR,
sd->nr);
#else /* SECCOMP_ARCH_COMPAT */
if (likely(sd->arch == SECCOMP_ARCH_NATIVE))
return seccomp_cache_check_allow_bitmap(cache->allow_native,
SECCOMP_ARCH_NATIVE_NR,
sd->nr);
if (likely(sd->arch == SECCOMP_ARCH_COMPAT))
return seccomp_cache_check_allow_bitmap(cache->allow_compat,
SECCOMP_ARCH_COMPAT_NR,
sd->nr);
#endif
WARN_ON_ONCE(true);
return false;
}
> +
> /**
> * seccomp_run_filters - evaluates all seccomp filters against @sd
> * @sd: optional seccomp data to be passed to filters
> @@ -320,6 +389,9 @@ static u32 seccomp_run_filters(const struct seccomp_data *sd,
> if (WARN_ON(f == NULL))
> return SECCOMP_RET_KILL_PROCESS;
>
> + if (seccomp_cache_check_allow(f, sd))
> + return SECCOMP_RET_ALLOW;
> +
> /*
> * All filters in the list are evaluated and the lowest BPF return
> * value always takes priority (ignoring the DATA).
> --
> 2.28.0
>
This is all looking good; thank you! I'm doing some test builds/runs
now. :)
--
Kees Cook
On Fri, Oct 09, 2020 at 11:30:18PM +0200, Jann Horn wrote:
> On Fri, Oct 9, 2020 at 7:15 PM YiFei Zhu <[email protected]> wrote:
> >
> > From: YiFei Zhu <[email protected]>
> >
> > SECCOMP_CACHE will only operate on syscalls that do not access
> > any syscall arguments or instruction pointer. To facilitate
> > this we need a static analyser to know whether a filter will
> > return allow regardless of syscall arguments for a given
> > architecture number / syscall number pair. This is implemented
> > here with a pseudo-emulator, and stored in a per-filter bitmap.
> >
> > In order to build this bitmap at filter attach time, each filter is
> > emulated for every syscall (under each possible architecture), and
> > checked for any accesses of struct seccomp_data that are not the "arch"
> > nor "nr" (syscall) members. If only "arch" and "nr" are examined, and
> > the program returns allow, then we can be sure that the filter must
> > return allow independent from syscall arguments.
> >
> > Nearly all seccomp filters are built from these cBPF instructions:
> >
> > BPF_LD | BPF_W | BPF_ABS
> > BPF_JMP | BPF_JEQ | BPF_K
> > BPF_JMP | BPF_JGE | BPF_K
> > BPF_JMP | BPF_JGT | BPF_K
> > BPF_JMP | BPF_JSET | BPF_K
> > BPF_JMP | BPF_JA
> > BPF_RET | BPF_K
> > BPF_ALU | BPF_AND | BPF_K
> >
> > Each of these instructions are emulated. Any weirdness or loading
> > from a syscall argument will cause the emulator to bail.
> >
> > The emulation is also halted if it reaches a return. In that case,
> > if it returns an SECCOMP_RET_ALLOW, the syscall is marked as good.
> >
> > Emulator structure and comments are from Kees [1] and Jann [2].
> >
> > Emulation is done at attach time. If a filter depends on more
> > filters, and if the dependee does not guarantee to allow the
> > syscall, then we skip the emulation of this syscall.
> >
> > [1] https://lore.kernel.org/lkml/[email protected]/
> > [2] https://lore.kernel.org/lkml/CAG48ez1p=dR_2ikKq=xVxkoGg0fYpTBpkhJSv1w-6BG=76PAvw@mail.gmail.com/
> [...]
> > @@ -682,6 +693,150 @@ seccomp_prepare_user_filter(const char __user *user_filter)
> > return filter;
> > }
> >
> > +#ifdef SECCOMP_ARCH_NATIVE
> > +/**
> > + * seccomp_is_const_allow - check if filter is constant allow with given data
> > + * @fprog: The BPF programs
> > + * @sd: The seccomp data to check against, only syscall number are arch
> > + * number are considered constant.
>
> nit: s/syscall number are arch number/syscall number and arch number/
>
> > + */
> > +static bool seccomp_is_const_allow(struct sock_fprog_kern *fprog,
> > + struct seccomp_data *sd)
> > +{
> > + unsigned int insns;
> > + unsigned int reg_value = 0;
> > + unsigned int pc;
> > + bool op_res;
> > +
> > + if (WARN_ON_ONCE(!fprog))
> > + return false;
> > +
> > + insns = bpf_classic_proglen(fprog);
>
> bpf_classic_proglen() is defined as:
>
> #define bpf_classic_proglen(fprog) (fprog->len * sizeof(fprog->filter[0]))
>
> so this is wrong - what you want is the number of instructions in the
> program, what you actually have is the size of the program in bytes.
> Please instead check for `pc < fprog->len` in the loop condition.
Oh yes, good catch. I had this wrong in my v1.
--
Kees Cook
From: YiFei Zhu <[email protected]>
Alternative: https://lore.kernel.org/lkml/[email protected]/T/
Major differences from the linked alternative by Kees:
* No x32 special-case handling -- not worth the complexity
* No caching of denylist -- not worth the complexity
* No seccomp arch pinning -- I think this is an independent feature
* The bitmaps are part of the filters rather than the task.
This series adds a bitmap to cache seccomp filter results if the
result permits a syscall and is indepenent of syscall arguments.
This visibly decreases seccomp overhead for most common seccomp
filters with very little memory footprint.
The overhead of running Seccomp filters has been part of some past
discussions [1][2][3]. Oftentimes, the filters have a large number
of instructions that check syscall numbers one by one and jump based
on that. Some users chain BPF filters which further enlarge the
overhead. A recent work [6] comprehensively measures the Seccomp
overhead and shows that the overhead is non-negligible and has a
non-trivial impact on application performance.
We observed some common filters, such as docker's [4] or
systemd's [5], will make most decisions based only on the syscall
numbers, and as past discussions considered, a bitmap where each bit
represents a syscall makes most sense for these filters.
In order to build this bitmap at filter attach time, each filter is
emulated for every syscall (under each possible architecture), and
checked for any accesses of struct seccomp_data that are not the "arch"
nor "nr" (syscall) members. If only "arch" and "nr" are examined, and
the program returns allow, then we can be sure that the filter must
return allow independent from syscall arguments.
When it is concluded that an allow must occur for the given
architecture and syscall pair, seccomp will immediately allow
the syscall, bypassing further BPF execution.
Ongoing work is to further support arguments with fast hash table
lookups. We are investigating the performance of doing so [6], and how
to best integrate with the existing seccomp infrastructure.
Some benchmarks are performed with results in patch 5, copied below:
Current BPF sysctl settings:
net.core.bpf_jit_enable = 1
net.core.bpf_jit_harden = 0
Benchmarking 200000000 syscalls...
129.359381409 - 0.008724424 = 129350656985 (129.4s)
getpid native: 646 ns
264.385890006 - 129.360453229 = 135025436777 (135.0s)
getpid RET_ALLOW 1 filter (bitmap): 675 ns
399.400511893 - 264.387045901 = 135013465992 (135.0s)
getpid RET_ALLOW 2 filters (bitmap): 675 ns
545.872866260 - 399.401718327 = 146471147933 (146.5s)
getpid RET_ALLOW 3 filters (full): 732 ns
696.337101319 - 545.874097681 = 150463003638 (150.5s)
getpid RET_ALLOW 4 filters (full): 752 ns
Estimated total seccomp overhead for 1 bitmapped filter: 29 ns
Estimated total seccomp overhead for 2 bitmapped filters: 29 ns
Estimated total seccomp overhead for 3 full filters: 86 ns
Estimated total seccomp overhead for 4 full filters: 106 ns
Estimated seccomp entry overhead: 29 ns
Estimated seccomp per-filter overhead (last 2 diff): 20 ns
Estimated seccomp per-filter overhead (filters / 4): 19 ns
Expectations:
native ≤ 1 bitmap (646 ≤ 675): ✔️
native ≤ 1 filter (646 ≤ 732): ✔️
per-filter (last 2 diff) ≈ per-filter (filters / 4) (20 ≈ 19): ✔️
1 bitmapped ≈ 2 bitmapped (29 ≈ 29): ✔️
entry ≈ 1 bitmapped (29 ≈ 29): ✔️
entry ≈ 2 bitmapped (29 ≈ 29): ✔️
native + entry + (per filter * 4) ≈ 4 filters total (755 ≈ 752): ✔️
v4 -> v5:
* Typo and wording fixes
* Skip arch number test when there are only one arch
* Fixed prog instruction number check.
* Added comment about the behavior of x32.
* /proc/pid/seccomp_cache return -ESRCH for exiting process.
* Fixed /proc/pid/seccomp_cache depend on the architecture.
* Fixed struct seq_file visibility reported by kernel test robot.
v3 -> v4:
* Reordered patches
* Naming changes
* Fixed racing in /proc/pid/seccomp_cache against filter being released
from task, using Jann's suggestion of sighand spinlock.
* Cache no longer configurable.
* Copied some description from cover letter to commit messages.
* Used Kees's logic to set clear bits from bitmap, rather than set bits.
v2 -> v3:
* Added array_index_nospec guards
* No more syscall_arches[] array and expecting on loop unrolling. Arches
are configured with per-arch seccomp.h.
* Moved filter emulation to attach time (from prepare time).
* Further simplified emulator, basing on Kees's code.
* Guard /proc/pid/seccomp_cache with CAP_SYS_ADMIN.
v1 -> v2:
* Corrected one outdated function documentation.
RFC -> v1:
* Config made on by default across all arches that could support it.
* Added arch numbers array and emulate filter for each arch number, and
have a per-arch bitmap.
* Massively simplified the emulator so it would only support the common
instructions in Kees's list.
* Fixed inheriting bitmap across filters (filter->prev is always NULL
during prepare).
* Stole the selftest from Kees.
* Added a /proc/pid/seccomp_cache by Jann's suggestion.
Patch 1 implements the test_bit against the bitmaps.
Patch 2 implements the emulator that finds if a filter must return allow,
Patch 3 adds the arch macros for x86.
Patch 4 updates the selftest to better show the new semantics.
Patch 5 implements /proc/pid/seccomp_cache.
[1] https://lore.kernel.org/linux-security-module/[email protected]/T/
[2] https://lore.kernel.org/lkml/202005181120.971232B7B@keescook/T/
[3] https://github.com/seccomp/libseccomp/issues/116
[4] https://github.com/moby/moby/blob/ae0ef82b90356ac613f329a8ef5ee42ca923417d/profiles/seccomp/default.json
[5] https://github.com/systemd/systemd/blob/6743a1caf4037f03dc51a1277855018e4ab61957/src/shared/seccomp-util.c#L270
[6] Draco: Architectural and Operating System Support for System Call Security
https://tianyin.github.io/pub/draco.pdf, MICRO-53, Oct. 2020
Kees Cook (2):
x86: Enable seccomp architecture tracking
selftests/seccomp: Compare bitmap vs filter overhead
YiFei Zhu (3):
seccomp/cache: Lookup syscall allowlist bitmap for fast path
seccomp/cache: Add "emulator" to check if filter is constant allow
seccomp/cache: Report cache data through /proc/pid/seccomp_cache
arch/Kconfig | 24 ++
arch/x86/Kconfig | 1 +
arch/x86/include/asm/seccomp.h | 20 ++
fs/proc/base.c | 6 +
include/linux/seccomp.h | 7 +
kernel/seccomp.c | 292 +++++++++++++++++-
.../selftests/seccomp/seccomp_benchmark.c | 151 +++++++--
tools/testing/selftests/seccomp/settings | 2 +-
8 files changed, 479 insertions(+), 24 deletions(-)
--
2.28.0
From: YiFei Zhu <[email protected]>
SECCOMP_CACHE will only operate on syscalls that do not access
any syscall arguments or instruction pointer. To facilitate
this we need a static analyser to know whether a filter will
return allow regardless of syscall arguments for a given
architecture number / syscall number pair. This is implemented
here with a pseudo-emulator, and stored in a per-filter bitmap.
In order to build this bitmap at filter attach time, each filter is
emulated for every syscall (under each possible architecture), and
checked for any accesses of struct seccomp_data that are not the "arch"
nor "nr" (syscall) members. If only "arch" and "nr" are examined, and
the program returns allow, then we can be sure that the filter must
return allow independent from syscall arguments.
Nearly all seccomp filters are built from these cBPF instructions:
BPF_LD | BPF_W | BPF_ABS
BPF_JMP | BPF_JEQ | BPF_K
BPF_JMP | BPF_JGE | BPF_K
BPF_JMP | BPF_JGT | BPF_K
BPF_JMP | BPF_JSET | BPF_K
BPF_JMP | BPF_JA
BPF_RET | BPF_K
BPF_ALU | BPF_AND | BPF_K
Each of these instructions are emulated. Any weirdness or loading
from a syscall argument will cause the emulator to bail.
The emulation is also halted if it reaches a return. In that case,
if it returns an SECCOMP_RET_ALLOW, the syscall is marked as good.
Emulator structure and comments are from Kees [1] and Jann [2].
Emulation is done at attach time. If a filter depends on more
filters, and if the dependee does not guarantee to allow the
syscall, then we skip the emulation of this syscall.
[1] https://lore.kernel.org/lkml/[email protected]/
[2] https://lore.kernel.org/lkml/CAG48ez1p=dR_2ikKq=xVxkoGg0fYpTBpkhJSv1w-6BG=76PAvw@mail.gmail.com/
Suggested-by: Jann Horn <[email protected]>
Co-developed-by: Kees Cook <[email protected]>
Signed-off-by: Kees Cook <[email protected]>
Signed-off-by: YiFei Zhu <[email protected]>
---
kernel/seccomp.c | 156 ++++++++++++++++++++++++++++++++++++++++++++++-
1 file changed, 155 insertions(+), 1 deletion(-)
diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index d67a8b61f2bf..236e7b367d4e 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -169,6 +169,10 @@ static inline bool seccomp_cache_check_allow(const struct seccomp_filter *sfilte
{
return false;
}
+
+static inline void seccomp_cache_prepare(struct seccomp_filter *sfilter)
+{
+}
#endif /* SECCOMP_ARCH_NATIVE */
/**
@@ -187,6 +191,7 @@ static inline bool seccomp_cache_check_allow(const struct seccomp_filter *sfilte
* this filter after reaching 0. The @users count is always smaller
* or equal to @refs. Hence, reaching 0 for @users does not mean
* the filter can be freed.
+ * @cache: cache of arch/syscall mappings to actions
* @log: true if all actions except for SECCOMP_RET_ALLOW should be logged
* @prev: points to a previously installed, or inherited, filter
* @prog: the BPF program to evaluate
@@ -208,6 +213,7 @@ struct seccomp_filter {
refcount_t refs;
refcount_t users;
bool log;
+ struct action_cache cache;
struct seccomp_filter *prev;
struct bpf_prog *prog;
struct notification *notif;
@@ -621,7 +627,12 @@ static struct seccomp_filter *seccomp_prepare_filter(struct sock_fprog *fprog)
{
struct seccomp_filter *sfilter;
int ret;
- const bool save_orig = IS_ENABLED(CONFIG_CHECKPOINT_RESTORE);
+ const bool save_orig =
+#if defined(CONFIG_CHECKPOINT_RESTORE) || defined(SECCOMP_ARCH_NATIVE)
+ true;
+#else
+ false;
+#endif
if (fprog->len == 0 || fprog->len > BPF_MAXINSNS)
return ERR_PTR(-EINVAL);
@@ -687,6 +698,148 @@ seccomp_prepare_user_filter(const char __user *user_filter)
return filter;
}
+#ifdef SECCOMP_ARCH_NATIVE
+/**
+ * seccomp_is_const_allow - check if filter is constant allow with given data
+ * @fprog: The BPF programs
+ * @sd: The seccomp data to check against, only syscall number and arch
+ * number are considered constant.
+ */
+static bool seccomp_is_const_allow(struct sock_fprog_kern *fprog,
+ struct seccomp_data *sd)
+{
+ unsigned int reg_value = 0;
+ unsigned int pc;
+ bool op_res;
+
+ if (WARN_ON_ONCE(!fprog))
+ return false;
+
+ for (pc = 0; pc < fprog->len; pc++) {
+ struct sock_filter *insn = &fprog->filter[pc];
+ u16 code = insn->code;
+ u32 k = insn->k;
+
+ switch (code) {
+ case BPF_LD | BPF_W | BPF_ABS:
+ switch (k) {
+ case offsetof(struct seccomp_data, nr):
+ reg_value = sd->nr;
+ break;
+ case offsetof(struct seccomp_data, arch):
+ reg_value = sd->arch;
+ break;
+ default:
+ /* can't optimize (non-constant value load) */
+ return false;
+ }
+ break;
+ case BPF_RET | BPF_K:
+ /* reached return with constant values only, check allow */
+ return k == SECCOMP_RET_ALLOW;
+ case BPF_JMP | BPF_JA:
+ pc += insn->k;
+ break;
+ case BPF_JMP | BPF_JEQ | BPF_K:
+ case BPF_JMP | BPF_JGE | BPF_K:
+ case BPF_JMP | BPF_JGT | BPF_K:
+ case BPF_JMP | BPF_JSET | BPF_K:
+ switch (BPF_OP(code)) {
+ case BPF_JEQ:
+ op_res = reg_value == k;
+ break;
+ case BPF_JGE:
+ op_res = reg_value >= k;
+ break;
+ case BPF_JGT:
+ op_res = reg_value > k;
+ break;
+ case BPF_JSET:
+ op_res = !!(reg_value & k);
+ break;
+ default:
+ /* can't optimize (unknown jump) */
+ return false;
+ }
+
+ pc += op_res ? insn->jt : insn->jf;
+ break;
+ case BPF_ALU | BPF_AND | BPF_K:
+ reg_value &= k;
+ break;
+ default:
+ /* can't optimize (unknown insn) */
+ return false;
+ }
+ }
+
+ /* ran off the end of the filter?! */
+ WARN_ON(1);
+ return false;
+}
+
+static void seccomp_cache_prepare_bitmap(struct seccomp_filter *sfilter,
+ void *bitmap, const void *bitmap_prev,
+ size_t bitmap_size, int arch)
+{
+ struct sock_fprog_kern *fprog = sfilter->prog->orig_prog;
+ struct seccomp_data sd;
+ int nr;
+
+ if (bitmap_prev) {
+ /* The new filter must be as restrictive as the last. */
+ bitmap_copy(bitmap, bitmap_prev, bitmap_size);
+ } else {
+ /* Before any filters, all syscalls are always allowed. */
+ bitmap_fill(bitmap, bitmap_size);
+ }
+
+ for (nr = 0; nr < bitmap_size; nr++) {
+ /* No bitmap change: not a cacheable action. */
+ if (!test_bit(nr, bitmap))
+ continue;
+
+ sd.nr = nr;
+ sd.arch = arch;
+
+ /* No bitmap change: continue to always allow. */
+ if (seccomp_is_const_allow(fprog, &sd))
+ continue;
+
+ /*
+ * Not a cacheable action: always run filters.
+ * atomic clear_bit() not needed, filter not visible yet.
+ */
+ __clear_bit(nr, bitmap);
+ }
+}
+
+/**
+ * seccomp_cache_prepare - emulate the filter to find cachable syscalls
+ * @sfilter: The seccomp filter
+ *
+ * Returns 0 if successful or -errno if error occurred.
+ */
+static void seccomp_cache_prepare(struct seccomp_filter *sfilter)
+{
+ struct action_cache *cache = &sfilter->cache;
+ const struct action_cache *cache_prev =
+ sfilter->prev ? &sfilter->prev->cache : NULL;
+
+ seccomp_cache_prepare_bitmap(sfilter, cache->allow_native,
+ cache_prev ? cache_prev->allow_native : NULL,
+ SECCOMP_ARCH_NATIVE_NR,
+ SECCOMP_ARCH_NATIVE);
+
+#ifdef SECCOMP_ARCH_COMPAT
+ seccomp_cache_prepare_bitmap(sfilter, cache->allow_compat,
+ cache_prev ? cache_prev->allow_compat : NULL,
+ SECCOMP_ARCH_COMPAT_NR,
+ SECCOMP_ARCH_COMPAT);
+#endif /* SECCOMP_ARCH_COMPAT */
+}
+#endif /* SECCOMP_ARCH_NATIVE */
+
/**
* seccomp_attach_filter: validate and attach filter
* @flags: flags to change filter behavior
@@ -736,6 +889,7 @@ static long seccomp_attach_filter(unsigned int flags,
* task reference.
*/
filter->prev = current->seccomp.filter;
+ seccomp_cache_prepare(filter);
current->seccomp.filter = filter;
atomic_inc(¤t->seccomp.filter_count);
--
2.28.0
From: Kees Cook <[email protected]>
Provide seccomp internals with the details to calculate which syscall
table the running kernel is expecting to deal with. This allows for
efficient architecture pinning and paves the way for constant-action
bitmaps.
Signed-off-by: Kees Cook <[email protected]>
Co-developed-by: YiFei Zhu <[email protected]>
Signed-off-by: YiFei Zhu <[email protected]>
---
arch/x86/include/asm/seccomp.h | 17 +++++++++++++++++
1 file changed, 17 insertions(+)
diff --git a/arch/x86/include/asm/seccomp.h b/arch/x86/include/asm/seccomp.h
index 2bd1338de236..b17d037c72ce 100644
--- a/arch/x86/include/asm/seccomp.h
+++ b/arch/x86/include/asm/seccomp.h
@@ -16,6 +16,23 @@
#define __NR_seccomp_sigreturn_32 __NR_ia32_sigreturn
#endif
+#ifdef CONFIG_X86_64
+# define SECCOMP_ARCH_NATIVE AUDIT_ARCH_X86_64
+# define SECCOMP_ARCH_NATIVE_NR NR_syscalls
+# ifdef CONFIG_COMPAT
+# define SECCOMP_ARCH_COMPAT AUDIT_ARCH_I386
+# define SECCOMP_ARCH_COMPAT_NR IA32_NR_syscalls
+# endif
+/*
+ * x32 will have __X32_SYSCALL_BIT set in syscall number. We don't support
+ * caching them and they are treated as out of range syscalls, which will
+ * always pass through the BPF filter.
+ */
+#else /* !CONFIG_X86_64 */
+# define SECCOMP_ARCH_NATIVE AUDIT_ARCH_I386
+# define SECCOMP_ARCH_NATIVE_NR NR_syscalls
+#endif
+
#include <asm-generic/seccomp.h>
#endif /* _ASM_X86_SECCOMP_H */
--
2.28.0
From: YiFei Zhu <[email protected]>
The overhead of running Seccomp filters has been part of some past
discussions [1][2][3]. Oftentimes, the filters have a large number
of instructions that check syscall numbers one by one and jump based
on that. Some users chain BPF filters which further enlarge the
overhead. A recent work [6] comprehensively measures the Seccomp
overhead and shows that the overhead is non-negligible and has a
non-trivial impact on application performance.
We observed some common filters, such as docker's [4] or
systemd's [5], will make most decisions based only on the syscall
numbers, and as past discussions considered, a bitmap where each bit
represents a syscall makes most sense for these filters.
The fast (common) path for seccomp should be that the filter permits
the syscall to pass through, and failing seccomp is expected to be
an exceptional case; it is not expected for userspace to call a
denylisted syscall over and over.
When it can be concluded that an allow must occur for the given
architecture and syscall pair (this determination is introduced in
the next commit), seccomp will immediately allow the syscall,
bypassing further BPF execution.
Each architecture number has its own bitmap. The architecture
number in seccomp_data is checked against the defined architecture
number constant before proceeding to test the bit against the
bitmap with the syscall number as the index of the bit in the
bitmap, and if the bit is set, seccomp returns allow. The bitmaps
are all clear in this patch and will be initialized in the next
commit.
When only one architecture exists, the check against architecture
number is skipped, suggested by Kees Cook [7].
[1] https://lore.kernel.org/linux-security-module/[email protected]/T/
[2] https://lore.kernel.org/lkml/202005181120.971232B7B@keescook/T/
[3] https://github.com/seccomp/libseccomp/issues/116
[4] https://github.com/moby/moby/blob/ae0ef82b90356ac613f329a8ef5ee42ca923417d/profiles/seccomp/default.json
[5] https://github.com/systemd/systemd/blob/6743a1caf4037f03dc51a1277855018e4ab61957/src/shared/seccomp-util.c#L270
[6] Draco: Architectural and Operating System Support for System Call Security
https://tianyin.github.io/pub/draco.pdf, MICRO-53, Oct. 2020
[7] https://lore.kernel.org/bpf/202010091614.8BB0EB64@keescook/
Co-developed-by: Dimitrios Skarlatos <[email protected]>
Signed-off-by: Dimitrios Skarlatos <[email protected]>
Signed-off-by: YiFei Zhu <[email protected]>
---
kernel/seccomp.c | 77 ++++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 77 insertions(+)
diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index ae6b40cc39f4..d67a8b61f2bf 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -143,6 +143,34 @@ struct notification {
struct list_head notifications;
};
+#ifdef SECCOMP_ARCH_NATIVE
+/**
+ * struct action_cache - per-filter cache of seccomp actions per
+ * arch/syscall pair
+ *
+ * @allow_native: A bitmap where each bit represents whether the
+ * filter will always allow the syscall, for the
+ * native architecture.
+ * @allow_compat: A bitmap where each bit represents whether the
+ * filter will always allow the syscall, for the
+ * compat architecture.
+ */
+struct action_cache {
+ DECLARE_BITMAP(allow_native, SECCOMP_ARCH_NATIVE_NR);
+#ifdef SECCOMP_ARCH_COMPAT
+ DECLARE_BITMAP(allow_compat, SECCOMP_ARCH_COMPAT_NR);
+#endif
+};
+#else
+struct action_cache { };
+
+static inline bool seccomp_cache_check_allow(const struct seccomp_filter *sfilter,
+ const struct seccomp_data *sd)
+{
+ return false;
+}
+#endif /* SECCOMP_ARCH_NATIVE */
+
/**
* struct seccomp_filter - container for seccomp BPF programs
*
@@ -298,6 +326,52 @@ static int seccomp_check_filter(struct sock_filter *filter, unsigned int flen)
return 0;
}
+#ifdef SECCOMP_ARCH_NATIVE
+static inline bool seccomp_cache_check_allow_bitmap(const void *bitmap,
+ size_t bitmap_size,
+ int syscall_nr)
+{
+ if (unlikely(syscall_nr < 0 || syscall_nr >= bitmap_size))
+ return false;
+ syscall_nr = array_index_nospec(syscall_nr, bitmap_size);
+
+ return test_bit(syscall_nr, bitmap);
+}
+
+/**
+ * seccomp_cache_check_allow - lookup seccomp cache
+ * @sfilter: The seccomp filter
+ * @sd: The seccomp data to lookup the cache with
+ *
+ * Returns true if the seccomp_data is cached and allowed.
+ */
+static inline bool seccomp_cache_check_allow(const struct seccomp_filter *sfilter,
+ const struct seccomp_data *sd)
+{
+ int syscall_nr = sd->nr;
+ const struct action_cache *cache = &sfilter->cache;
+
+#ifndef SECCOMP_ARCH_COMPAT
+ /* A native-only architecture doesn't need to check sd->arch. */
+ return seccomp_cache_check_allow_bitmap(cache->allow_native,
+ SECCOMP_ARCH_NATIVE_NR,
+ syscall_nr);
+#else
+ if (likely(sd->arch == SECCOMP_ARCH_NATIVE))
+ return seccomp_cache_check_allow_bitmap(cache->allow_native,
+ SECCOMP_ARCH_NATIVE_NR,
+ syscall_nr);
+ if (likely(sd->arch == SECCOMP_ARCH_COMPAT))
+ return seccomp_cache_check_allow_bitmap(cache->allow_compat,
+ SECCOMP_ARCH_COMPAT_NR,
+ syscall_nr);
+#endif /* SECCOMP_ARCH_COMPAT */
+
+ WARN_ON_ONCE(true);
+ return false;
+}
+#endif /* SECCOMP_ARCH_NATIVE */
+
/**
* seccomp_run_filters - evaluates all seccomp filters against @sd
* @sd: optional seccomp data to be passed to filters
@@ -320,6 +394,9 @@ static u32 seccomp_run_filters(const struct seccomp_data *sd,
if (WARN_ON(f == NULL))
return SECCOMP_RET_KILL_PROCESS;
+ if (seccomp_cache_check_allow(f, sd))
+ return SECCOMP_RET_ALLOW;
+
/*
* All filters in the list are evaluated and the lowest BPF return
* value always takes priority (ignoring the DATA).
--
2.28.0
From: YiFei Zhu <[email protected]>
Currently the kernel does not provide an infrastructure to translate
architecture numbers to a human-readable name. Translating syscall
numbers to syscall names is possible through FTRACE_SYSCALL
infrastructure but it does not provide support for compat syscalls.
This will create a file for each PID as /proc/pid/seccomp_cache.
The file will be empty when no seccomp filters are loaded, or be
in the format of:
<arch name> <decimal syscall number> <ALLOW | FILTER>
where ALLOW means the cache is guaranteed to allow the syscall,
and filter means the cache will pass the syscall to the BPF filter.
For the docker default profile on x86_64 it looks like:
x86_64 0 ALLOW
x86_64 1 ALLOW
x86_64 2 ALLOW
x86_64 3 ALLOW
[...]
x86_64 132 ALLOW
x86_64 133 ALLOW
x86_64 134 FILTER
x86_64 135 FILTER
x86_64 136 FILTER
x86_64 137 ALLOW
x86_64 138 ALLOW
x86_64 139 FILTER
x86_64 140 ALLOW
x86_64 141 ALLOW
[...]
This file is guarded by CONFIG_SECCOMP_CACHE_DEBUG with a default
of N because I think certain users of seccomp might not want the
application to know which syscalls are definitely usable. For
the same reason, it is also guarded by CAP_SYS_ADMIN.
Suggested-by: Jann Horn <[email protected]>
Link: https://lore.kernel.org/lkml/CAG48ez3Ofqp4crXGksLmZY6=fGrF_tWyUCg7PBkAetvbbOPeOA@mail.gmail.com/
Signed-off-by: YiFei Zhu <[email protected]>
---
arch/Kconfig | 24 ++++++++++++++
arch/x86/Kconfig | 1 +
arch/x86/include/asm/seccomp.h | 3 ++
fs/proc/base.c | 6 ++++
include/linux/seccomp.h | 7 ++++
kernel/seccomp.c | 59 ++++++++++++++++++++++++++++++++++
6 files changed, 100 insertions(+)
diff --git a/arch/Kconfig b/arch/Kconfig
index 21a3675a7a3a..6157c3ce0662 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -471,6 +471,15 @@ config HAVE_ARCH_SECCOMP_FILTER
results in the system call being skipped immediately.
- seccomp syscall wired up
+config HAVE_ARCH_SECCOMP_CACHE
+ bool
+ help
+ An arch should select this symbol if it provides all of these things:
+ - all the requirements for HAVE_ARCH_SECCOMP_FILTER
+ - SECCOMP_ARCH_NATIVE
+ - SECCOMP_ARCH_NATIVE_NR
+ - SECCOMP_ARCH_NATIVE_NAME
+
config SECCOMP
prompt "Enable seccomp to safely execute untrusted bytecode"
def_bool y
@@ -498,6 +507,21 @@ config SECCOMP_FILTER
See Documentation/userspace-api/seccomp_filter.rst for details.
+config SECCOMP_CACHE_DEBUG
+ bool "Show seccomp filter cache status in /proc/pid/seccomp_cache"
+ depends on SECCOMP
+ depends on SECCOMP_FILTER && HAVE_ARCH_SECCOMP_CACHE
+ depends on PROC_FS
+ help
+ This enables the /proc/pid/seccomp_cache interface to monitor
+ seccomp cache data. The file format is subject to change. Reading
+ the file requires CAP_SYS_ADMIN.
+
+ This option is for debugging only. Enabling presents the risk that
+ an adversary may be able to infer the seccomp filter logic.
+
+ If unsure, say N.
+
config HAVE_ARCH_STACKLEAK
bool
help
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 1ab22869a765..1a807f89ac77 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -150,6 +150,7 @@ config X86
select HAVE_ARCH_COMPAT_MMAP_BASES if MMU && COMPAT
select HAVE_ARCH_PREL32_RELOCATIONS
select HAVE_ARCH_SECCOMP_FILTER
+ select HAVE_ARCH_SECCOMP_CACHE
select HAVE_ARCH_THREAD_STRUCT_WHITELIST
select HAVE_ARCH_STACKLEAK
select HAVE_ARCH_TRACEHOOK
diff --git a/arch/x86/include/asm/seccomp.h b/arch/x86/include/asm/seccomp.h
index b17d037c72ce..fef16e398161 100644
--- a/arch/x86/include/asm/seccomp.h
+++ b/arch/x86/include/asm/seccomp.h
@@ -19,9 +19,11 @@
#ifdef CONFIG_X86_64
# define SECCOMP_ARCH_NATIVE AUDIT_ARCH_X86_64
# define SECCOMP_ARCH_NATIVE_NR NR_syscalls
+# define SECCOMP_ARCH_NATIVE_NAME "x86_64"
# ifdef CONFIG_COMPAT
# define SECCOMP_ARCH_COMPAT AUDIT_ARCH_I386
# define SECCOMP_ARCH_COMPAT_NR IA32_NR_syscalls
+# define SECCOMP_ARCH_COMPAT_NAME "ia32"
# endif
/*
* x32 will have __X32_SYSCALL_BIT set in syscall number. We don't support
@@ -31,6 +33,7 @@
#else /* !CONFIG_X86_64 */
# define SECCOMP_ARCH_NATIVE AUDIT_ARCH_I386
# define SECCOMP_ARCH_NATIVE_NR NR_syscalls
+# define SECCOMP_ARCH_NATIVE_NAME "ia32"
#endif
#include <asm-generic/seccomp.h>
diff --git a/fs/proc/base.c b/fs/proc/base.c
index 617db4e0faa0..a4990410ff05 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -3258,6 +3258,9 @@ static const struct pid_entry tgid_base_stuff[] = {
#ifdef CONFIG_PROC_PID_ARCH_STATUS
ONE("arch_status", S_IRUGO, proc_pid_arch_status),
#endif
+#ifdef CONFIG_SECCOMP_CACHE_DEBUG
+ ONE("seccomp_cache", S_IRUSR, proc_pid_seccomp_cache),
+#endif
};
static int proc_tgid_base_readdir(struct file *file, struct dir_context *ctx)
@@ -3587,6 +3590,9 @@ static const struct pid_entry tid_base_stuff[] = {
#ifdef CONFIG_PROC_PID_ARCH_STATUS
ONE("arch_status", S_IRUGO, proc_pid_arch_status),
#endif
+#ifdef CONFIG_SECCOMP_CACHE_DEBUG
+ ONE("seccomp_cache", S_IRUSR, proc_pid_seccomp_cache),
+#endif
};
static int proc_tid_base_readdir(struct file *file, struct dir_context *ctx)
diff --git a/include/linux/seccomp.h b/include/linux/seccomp.h
index 02aef2844c38..76963ec4641a 100644
--- a/include/linux/seccomp.h
+++ b/include/linux/seccomp.h
@@ -121,4 +121,11 @@ static inline long seccomp_get_metadata(struct task_struct *task,
return -EINVAL;
}
#endif /* CONFIG_SECCOMP_FILTER && CONFIG_CHECKPOINT_RESTORE */
+
+#ifdef CONFIG_SECCOMP_CACHE_DEBUG
+struct seq_file;
+
+int proc_pid_seccomp_cache(struct seq_file *m, struct pid_namespace *ns,
+ struct pid *pid, struct task_struct *task);
+#endif
#endif /* _LINUX_SECCOMP_H */
diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index 236e7b367d4e..1df2fac281da 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -553,6 +553,9 @@ void seccomp_filter_release(struct task_struct *tsk)
{
struct seccomp_filter *orig = tsk->seccomp.filter;
+ /* We are effectively holding the siglock by not having any sighand. */
+ WARN_ON(tsk->sighand != NULL);
+
/* Detach task from its filter tree. */
tsk->seccomp.filter = NULL;
__seccomp_filter_release(orig);
@@ -2311,3 +2314,59 @@ static int __init seccomp_sysctl_init(void)
device_initcall(seccomp_sysctl_init)
#endif /* CONFIG_SYSCTL */
+
+#ifdef CONFIG_SECCOMP_CACHE_DEBUG
+/* Currently CONFIG_SECCOMP_CACHE_DEBUG implies SECCOMP_ARCH_NATIVE */
+static void proc_pid_seccomp_cache_arch(struct seq_file *m, const char *name,
+ const void *bitmap, size_t bitmap_size)
+{
+ int nr;
+
+ for (nr = 0; nr < bitmap_size; nr++) {
+ bool cached = test_bit(nr, bitmap);
+ char *status = cached ? "ALLOW" : "FILTER";
+
+ seq_printf(m, "%s %d %s\n", name, nr, status);
+ }
+}
+
+int proc_pid_seccomp_cache(struct seq_file *m, struct pid_namespace *ns,
+ struct pid *pid, struct task_struct *task)
+{
+ struct seccomp_filter *f;
+ unsigned long flags;
+
+ /*
+ * We don't want some sandboxed process to know what their seccomp
+ * filters consist of.
+ */
+ if (!file_ns_capable(m->file, &init_user_ns, CAP_SYS_ADMIN))
+ return -EACCES;
+
+ if (!lock_task_sighand(task, &flags))
+ return -ESRCH;
+
+ f = READ_ONCE(task->seccomp.filter);
+ if (!f) {
+ unlock_task_sighand(task, &flags);
+ return 0;
+ }
+
+ /* prevent filter from being freed while we are printing it */
+ __get_seccomp_filter(f);
+ unlock_task_sighand(task, &flags);
+
+ proc_pid_seccomp_cache_arch(m, SECCOMP_ARCH_NATIVE_NAME,
+ f->cache.allow_native,
+ SECCOMP_ARCH_NATIVE_NR);
+
+#ifdef SECCOMP_ARCH_COMPAT
+ proc_pid_seccomp_cache_arch(m, SECCOMP_ARCH_COMPAT_NAME,
+ f->cache.allow_compat,
+ SECCOMP_ARCH_COMPAT_NR);
+#endif /* SECCOMP_ARCH_COMPAT */
+
+ __put_seccomp_filter(f);
+ return 0;
+}
+#endif /* CONFIG_SECCOMP_CACHE_DEBUG */
--
2.28.0
From: Kees Cook <[email protected]>
As part of the seccomp benchmarking, include the expectations with
regard to the timing behavior of the constant action bitmaps, and report
inconsistencies better.
Example output with constant action bitmaps on x86:
$ sudo ./seccomp_benchmark 100000000
Current BPF sysctl settings:
net.core.bpf_jit_enable = 1
net.core.bpf_jit_harden = 0
Benchmarking 200000000 syscalls...
129.359381409 - 0.008724424 = 129350656985 (129.4s)
getpid native: 646 ns
264.385890006 - 129.360453229 = 135025436777 (135.0s)
getpid RET_ALLOW 1 filter (bitmap): 675 ns
399.400511893 - 264.387045901 = 135013465992 (135.0s)
getpid RET_ALLOW 2 filters (bitmap): 675 ns
545.872866260 - 399.401718327 = 146471147933 (146.5s)
getpid RET_ALLOW 3 filters (full): 732 ns
696.337101319 - 545.874097681 = 150463003638 (150.5s)
getpid RET_ALLOW 4 filters (full): 752 ns
Estimated total seccomp overhead for 1 bitmapped filter: 29 ns
Estimated total seccomp overhead for 2 bitmapped filters: 29 ns
Estimated total seccomp overhead for 3 full filters: 86 ns
Estimated total seccomp overhead for 4 full filters: 106 ns
Estimated seccomp entry overhead: 29 ns
Estimated seccomp per-filter overhead (last 2 diff): 20 ns
Estimated seccomp per-filter overhead (filters / 4): 19 ns
Expectations:
native ≤ 1 bitmap (646 ≤ 675): ✔️
native ≤ 1 filter (646 ≤ 732): ✔️
per-filter (last 2 diff) ≈ per-filter (filters / 4) (20 ≈ 19): ✔️
1 bitmapped ≈ 2 bitmapped (29 ≈ 29): ✔️
entry ≈ 1 bitmapped (29 ≈ 29): ✔️
entry ≈ 2 bitmapped (29 ≈ 29): ✔️
native + entry + (per filter * 4) ≈ 4 filters total (755 ≈ 752): ✔️
Signed-off-by: Kees Cook <[email protected]>
[YiFei: Changed commit message to show stats for this patch series]
Signed-off-by: YiFei Zhu <[email protected]>
---
.../selftests/seccomp/seccomp_benchmark.c | 151 +++++++++++++++---
tools/testing/selftests/seccomp/settings | 2 +-
2 files changed, 130 insertions(+), 23 deletions(-)
diff --git a/tools/testing/selftests/seccomp/seccomp_benchmark.c b/tools/testing/selftests/seccomp/seccomp_benchmark.c
index 91f5a89cadac..fcc806585266 100644
--- a/tools/testing/selftests/seccomp/seccomp_benchmark.c
+++ b/tools/testing/selftests/seccomp/seccomp_benchmark.c
@@ -4,12 +4,16 @@
*/
#define _GNU_SOURCE
#include <assert.h>
+#include <limits.h>
+#include <stdbool.h>
+#include <stddef.h>
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <unistd.h>
#include <linux/filter.h>
#include <linux/seccomp.h>
+#include <sys/param.h>
#include <sys/prctl.h>
#include <sys/syscall.h>
#include <sys/types.h>
@@ -70,18 +74,74 @@ unsigned long long calibrate(void)
return samples * seconds;
}
+bool approx(int i_one, int i_two)
+{
+ double one = i_one, one_bump = one * 0.01;
+ double two = i_two, two_bump = two * 0.01;
+
+ one_bump = one + MAX(one_bump, 2.0);
+ two_bump = two + MAX(two_bump, 2.0);
+
+ /* Equal to, or within 1% or 2 digits */
+ if (one == two ||
+ (one > two && one <= two_bump) ||
+ (two > one && two <= one_bump))
+ return true;
+ return false;
+}
+
+bool le(int i_one, int i_two)
+{
+ if (i_one <= i_two)
+ return true;
+ return false;
+}
+
+long compare(const char *name_one, const char *name_eval, const char *name_two,
+ unsigned long long one, bool (*eval)(int, int), unsigned long long two)
+{
+ bool good;
+
+ printf("\t%s %s %s (%lld %s %lld): ", name_one, name_eval, name_two,
+ (long long)one, name_eval, (long long)two);
+ if (one > INT_MAX) {
+ printf("Miscalculation! Measurement went negative: %lld\n", (long long)one);
+ return 1;
+ }
+ if (two > INT_MAX) {
+ printf("Miscalculation! Measurement went negative: %lld\n", (long long)two);
+ return 1;
+ }
+
+ good = eval(one, two);
+ printf("%s\n", good ? "✔️" : "❌");
+
+ return good ? 0 : 1;
+}
+
int main(int argc, char *argv[])
{
+ struct sock_filter bitmap_filter[] = {
+ BPF_STMT(BPF_LD|BPF_W|BPF_ABS, offsetof(struct seccomp_data, nr)),
+ BPF_STMT(BPF_RET|BPF_K, SECCOMP_RET_ALLOW),
+ };
+ struct sock_fprog bitmap_prog = {
+ .len = (unsigned short)ARRAY_SIZE(bitmap_filter),
+ .filter = bitmap_filter,
+ };
struct sock_filter filter[] = {
+ BPF_STMT(BPF_LD|BPF_W|BPF_ABS, offsetof(struct seccomp_data, args[0])),
BPF_STMT(BPF_RET|BPF_K, SECCOMP_RET_ALLOW),
};
struct sock_fprog prog = {
.len = (unsigned short)ARRAY_SIZE(filter),
.filter = filter,
};
- long ret;
- unsigned long long samples;
- unsigned long long native, filter1, filter2;
+
+ long ret, bits;
+ unsigned long long samples, calc;
+ unsigned long long native, filter1, filter2, bitmap1, bitmap2;
+ unsigned long long entry, per_filter1, per_filter2;
printf("Current BPF sysctl settings:\n");
system("sysctl net.core.bpf_jit_enable");
@@ -101,35 +161,82 @@ int main(int argc, char *argv[])
ret = prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0);
assert(ret == 0);
- /* One filter */
- ret = prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER, &prog);
+ /* One filter resulting in a bitmap */
+ ret = prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER, &bitmap_prog);
assert(ret == 0);
- filter1 = timing(CLOCK_PROCESS_CPUTIME_ID, samples) / samples;
- printf("getpid RET_ALLOW 1 filter: %llu ns\n", filter1);
+ bitmap1 = timing(CLOCK_PROCESS_CPUTIME_ID, samples) / samples;
+ printf("getpid RET_ALLOW 1 filter (bitmap): %llu ns\n", bitmap1);
+
+ /* Second filter resulting in a bitmap */
+ ret = prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER, &bitmap_prog);
+ assert(ret == 0);
- if (filter1 == native)
- printf("No overhead measured!? Try running again with more samples.\n");
+ bitmap2 = timing(CLOCK_PROCESS_CPUTIME_ID, samples) / samples;
+ printf("getpid RET_ALLOW 2 filters (bitmap): %llu ns\n", bitmap2);
- /* Two filters */
+ /* Third filter, can no longer be converted to bitmap */
ret = prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER, &prog);
assert(ret == 0);
- filter2 = timing(CLOCK_PROCESS_CPUTIME_ID, samples) / samples;
- printf("getpid RET_ALLOW 2 filters: %llu ns\n", filter2);
-
- /* Calculations */
- printf("Estimated total seccomp overhead for 1 filter: %llu ns\n",
- filter1 - native);
+ filter1 = timing(CLOCK_PROCESS_CPUTIME_ID, samples) / samples;
+ printf("getpid RET_ALLOW 3 filters (full): %llu ns\n", filter1);
- printf("Estimated total seccomp overhead for 2 filters: %llu ns\n",
- filter2 - native);
+ /* Fourth filter, can not be converted to bitmap because of filter 3 */
+ ret = prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER, &bitmap_prog);
+ assert(ret == 0);
- printf("Estimated seccomp per-filter overhead: %llu ns\n",
- filter2 - filter1);
+ filter2 = timing(CLOCK_PROCESS_CPUTIME_ID, samples) / samples;
+ printf("getpid RET_ALLOW 4 filters (full): %llu ns\n", filter2);
+
+ /* Estimations */
+#define ESTIMATE(fmt, var, what) do { \
+ var = (what); \
+ printf("Estimated " fmt ": %llu ns\n", var); \
+ if (var > INT_MAX) \
+ goto more_samples; \
+ } while (0)
+
+ ESTIMATE("total seccomp overhead for 1 bitmapped filter", calc,
+ bitmap1 - native);
+ ESTIMATE("total seccomp overhead for 2 bitmapped filters", calc,
+ bitmap2 - native);
+ ESTIMATE("total seccomp overhead for 3 full filters", calc,
+ filter1 - native);
+ ESTIMATE("total seccomp overhead for 4 full filters", calc,
+ filter2 - native);
+ ESTIMATE("seccomp entry overhead", entry,
+ bitmap1 - native - (bitmap2 - bitmap1));
+ ESTIMATE("seccomp per-filter overhead (last 2 diff)", per_filter1,
+ filter2 - filter1);
+ ESTIMATE("seccomp per-filter overhead (filters / 4)", per_filter2,
+ (filter2 - native - entry) / 4);
+
+ printf("Expectations:\n");
+ ret |= compare("native", "≤", "1 bitmap", native, le, bitmap1);
+ bits = compare("native", "≤", "1 filter", native, le, filter1);
+ if (bits)
+ goto more_samples;
+
+ ret |= compare("per-filter (last 2 diff)", "≈", "per-filter (filters / 4)",
+ per_filter1, approx, per_filter2);
+
+ bits = compare("1 bitmapped", "≈", "2 bitmapped",
+ bitmap1 - native, approx, bitmap2 - native);
+ if (bits) {
+ printf("Skipping constant action bitmap expectations: they appear unsupported.\n");
+ goto out;
+ }
- printf("Estimated seccomp entry overhead: %llu ns\n",
- filter1 - native - (filter2 - filter1));
+ ret |= compare("entry", "≈", "1 bitmapped", entry, approx, bitmap1 - native);
+ ret |= compare("entry", "≈", "2 bitmapped", entry, approx, bitmap2 - native);
+ ret |= compare("native + entry + (per filter * 4)", "≈", "4 filters total",
+ entry + (per_filter1 * 4) + native, approx, filter2);
+ if (ret == 0)
+ goto out;
+more_samples:
+ printf("Saw unexpected benchmark result. Try running again with more samples?\n");
+out:
return 0;
}
diff --git a/tools/testing/selftests/seccomp/settings b/tools/testing/selftests/seccomp/settings
index ba4d85f74cd6..6091b45d226b 100644
--- a/tools/testing/selftests/seccomp/settings
+++ b/tools/testing/selftests/seccomp/settings
@@ -1 +1 @@
-timeout=90
+timeout=120
--
2.28.0
On Sun, Oct 11, 2020 at 5:48 PM YiFei Zhu <[email protected]> wrote:
> SECCOMP_CACHE will only operate on syscalls that do not access
> any syscall arguments or instruction pointer. To facilitate
> this we need a static analyser to know whether a filter will
> return allow regardless of syscall arguments for a given
> architecture number / syscall number pair. This is implemented
> here with a pseudo-emulator, and stored in a per-filter bitmap.
>
> In order to build this bitmap at filter attach time, each filter is
> emulated for every syscall (under each possible architecture), and
> checked for any accesses of struct seccomp_data that are not the "arch"
> nor "nr" (syscall) members. If only "arch" and "nr" are examined, and
> the program returns allow, then we can be sure that the filter must
> return allow independent from syscall arguments.
>
> Nearly all seccomp filters are built from these cBPF instructions:
>
> BPF_LD | BPF_W | BPF_ABS
> BPF_JMP | BPF_JEQ | BPF_K
> BPF_JMP | BPF_JGE | BPF_K
> BPF_JMP | BPF_JGT | BPF_K
> BPF_JMP | BPF_JSET | BPF_K
> BPF_JMP | BPF_JA
> BPF_RET | BPF_K
> BPF_ALU | BPF_AND | BPF_K
>
> Each of these instructions are emulated. Any weirdness or loading
> from a syscall argument will cause the emulator to bail.
>
> The emulation is also halted if it reaches a return. In that case,
> if it returns an SECCOMP_RET_ALLOW, the syscall is marked as good.
>
> Emulator structure and comments are from Kees [1] and Jann [2].
>
> Emulation is done at attach time. If a filter depends on more
> filters, and if the dependee does not guarantee to allow the
> syscall, then we skip the emulation of this syscall.
>
> [1] https://lore.kernel.org/lkml/[email protected]/
> [2] https://lore.kernel.org/lkml/CAG48ez1p=dR_2ikKq=xVxkoGg0fYpTBpkhJSv1w-6BG=76PAvw@mail.gmail.com/
>
> Suggested-by: Jann Horn <[email protected]>
> Co-developed-by: Kees Cook <[email protected]>
> Signed-off-by: Kees Cook <[email protected]>
> Signed-off-by: YiFei Zhu <[email protected]>
Reviewed-by: Jann Horn <[email protected]>
On Sun, Oct 11, 2020 at 5:48 PM YiFei Zhu <[email protected]> wrote:
> Currently the kernel does not provide an infrastructure to translate
> architecture numbers to a human-readable name. Translating syscall
> numbers to syscall names is possible through FTRACE_SYSCALL
> infrastructure but it does not provide support for compat syscalls.
>
> This will create a file for each PID as /proc/pid/seccomp_cache.
> The file will be empty when no seccomp filters are loaded, or be
> in the format of:
> <arch name> <decimal syscall number> <ALLOW | FILTER>
> where ALLOW means the cache is guaranteed to allow the syscall,
> and filter means the cache will pass the syscall to the BPF filter.
>
> For the docker default profile on x86_64 it looks like:
> x86_64 0 ALLOW
> x86_64 1 ALLOW
> x86_64 2 ALLOW
> x86_64 3 ALLOW
> [...]
> x86_64 132 ALLOW
> x86_64 133 ALLOW
> x86_64 134 FILTER
> x86_64 135 FILTER
> x86_64 136 FILTER
> x86_64 137 ALLOW
> x86_64 138 ALLOW
> x86_64 139 FILTER
> x86_64 140 ALLOW
> x86_64 141 ALLOW
> [...]
>
> This file is guarded by CONFIG_SECCOMP_CACHE_DEBUG with a default
> of N because I think certain users of seccomp might not want the
> application to know which syscalls are definitely usable. For
> the same reason, it is also guarded by CAP_SYS_ADMIN.
>
> Suggested-by: Jann Horn <[email protected]>
> Link: https://lore.kernel.org/lkml/CAG48ez3Ofqp4crXGksLmZY6=fGrF_tWyUCg7PBkAetvbbOPeOA@mail.gmail.com/
> Signed-off-by: YiFei Zhu <[email protected]>
Reviewed-by: Jann Horn <[email protected]>
On Sun, Oct 11, 2020 at 5:48 PM YiFei Zhu <[email protected]> wrote:
> The overhead of running Seccomp filters has been part of some past
> discussions [1][2][3]. Oftentimes, the filters have a large number
> of instructions that check syscall numbers one by one and jump based
> on that. Some users chain BPF filters which further enlarge the
> overhead. A recent work [6] comprehensively measures the Seccomp
> overhead and shows that the overhead is non-negligible and has a
> non-trivial impact on application performance.
>
> We observed some common filters, such as docker's [4] or
> systemd's [5], will make most decisions based only on the syscall
> numbers, and as past discussions considered, a bitmap where each bit
> represents a syscall makes most sense for these filters.
>
> The fast (common) path for seccomp should be that the filter permits
> the syscall to pass through, and failing seccomp is expected to be
> an exceptional case; it is not expected for userspace to call a
> denylisted syscall over and over.
>
> When it can be concluded that an allow must occur for the given
> architecture and syscall pair (this determination is introduced in
> the next commit), seccomp will immediately allow the syscall,
> bypassing further BPF execution.
>
> Each architecture number has its own bitmap. The architecture
> number in seccomp_data is checked against the defined architecture
> number constant before proceeding to test the bit against the
> bitmap with the syscall number as the index of the bit in the
> bitmap, and if the bit is set, seccomp returns allow. The bitmaps
> are all clear in this patch and will be initialized in the next
> commit.
>
> When only one architecture exists, the check against architecture
> number is skipped, suggested by Kees Cook [7].
>
> [1] https://lore.kernel.org/linux-security-module/[email protected]/T/
> [2] https://lore.kernel.org/lkml/202005181120.971232B7B@keescook/T/
> [3] https://github.com/seccomp/libseccomp/issues/116
> [4] https://github.com/moby/moby/blob/ae0ef82b90356ac613f329a8ef5ee42ca923417d/profiles/seccomp/default.json
> [5] https://github.com/systemd/systemd/blob/6743a1caf4037f03dc51a1277855018e4ab61957/src/shared/seccomp-util.c#L270
> [6] Draco: Architectural and Operating System Support for System Call Security
> https://tianyin.github.io/pub/draco.pdf, MICRO-53, Oct. 2020
> [7] https://lore.kernel.org/bpf/202010091614.8BB0EB64@keescook/
>
> Co-developed-by: Dimitrios Skarlatos <[email protected]>
> Signed-off-by: Dimitrios Skarlatos <[email protected]>
> Signed-off-by: YiFei Zhu <[email protected]>
Reviewed-by: Jann Horn <[email protected]>
On Sun, 11 Oct 2020 10:47:41 -0500, YiFei Zhu wrote:
> Alternative: https://lore.kernel.org/lkml/[email protected]/T/
>
> Major differences from the linked alternative by Kees:
> * No x32 special-case handling -- not worth the complexity
> * No caching of denylist -- not worth the complexity
> * No seccomp arch pinning -- I think this is an independent feature
> * The bitmaps are part of the filters rather than the task.
>
> [...]
Applied to for-next/seccomp, thanks! I left off patch 5 for now until
we sort out the rest of the SECCOMP_FILTER architectures, and tweaked
patch 3 to include the architecture names.
[1/4] seccomp/cache: Lookup syscall allowlist bitmap for fast path
https://git.kernel.org/kees/c/f94defb8bf46
[2/4] seccomp/cache: Add "emulator" to check if filter is constant allow
https://git.kernel.org/kees/c/e7dc9f1e5f6b
[3/4] x86: Enable seccomp architecture tracking
https://git.kernel.org/kees/c/1f68a4d393fe
[4/4] selftests/seccomp: Compare bitmap vs filter overhead
https://git.kernel.org/kees/c/57a339117e52
--
Kees Cook
Hi Yifei,
On Sun, Oct 11, 2020 at 8:08 PM YiFei Zhu <[email protected]> wrote:
> From: YiFei Zhu <[email protected]>
>
> Currently the kernel does not provide an infrastructure to translate
> architecture numbers to a human-readable name. Translating syscall
> numbers to syscall names is possible through FTRACE_SYSCALL
> infrastructure but it does not provide support for compat syscalls.
>
> This will create a file for each PID as /proc/pid/seccomp_cache.
> The file will be empty when no seccomp filters are loaded, or be
> in the format of:
> <arch name> <decimal syscall number> <ALLOW | FILTER>
> where ALLOW means the cache is guaranteed to allow the syscall,
> and filter means the cache will pass the syscall to the BPF filter.
>
> For the docker default profile on x86_64 it looks like:
> x86_64 0 ALLOW
> x86_64 1 ALLOW
> x86_64 2 ALLOW
> x86_64 3 ALLOW
> [...]
> x86_64 132 ALLOW
> x86_64 133 ALLOW
> x86_64 134 FILTER
> x86_64 135 FILTER
> x86_64 136 FILTER
> x86_64 137 ALLOW
> x86_64 138 ALLOW
> x86_64 139 FILTER
> x86_64 140 ALLOW
> x86_64 141 ALLOW
> [...]
>
> This file is guarded by CONFIG_SECCOMP_CACHE_DEBUG with a default
> of N because I think certain users of seccomp might not want the
> application to know which syscalls are definitely usable. For
> the same reason, it is also guarded by CAP_SYS_ADMIN.
>
> Suggested-by: Jann Horn <[email protected]>
> Link: https://lore.kernel.org/lkml/CAG48ez3Ofqp4crXGksLmZY6=fGrF_tWyUCg7PBkAetvbbOPeOA@mail.gmail.com/
> Signed-off-by: YiFei Zhu <[email protected]>
> @@ -2311,3 +2314,59 @@ static int __init seccomp_sysctl_init(void)
> device_initcall(seccomp_sysctl_init)
>
> #endif /* CONFIG_SYSCTL */
> +
> +#ifdef CONFIG_SECCOMP_CACHE_DEBUG
> +/* Currently CONFIG_SECCOMP_CACHE_DEBUG implies SECCOMP_ARCH_NATIVE */
Should there be a dependency on SECCOMP_ARCH_NATIVE?
Should all architectures that implement seccomp have this?
E.g. mips does select HAVE_ARCH_SECCOMP_FILTER, but doesn't
have SECCOMP_ARCH_NATIVE?
(noticed with preliminary out-of-tree seccomp implementation for m68k,
which doesn't have SECCOMP_ARCH_NATIVE
> +static void proc_pid_seccomp_cache_arch(struct seq_file *m, const char *name,
> + const void *bitmap, size_t bitmap_size)
> +{
> + int nr;
> +
> + for (nr = 0; nr < bitmap_size; nr++) {
> + bool cached = test_bit(nr, bitmap);
> + char *status = cached ? "ALLOW" : "FILTER";
> +
> + seq_printf(m, "%s %d %s\n", name, nr, status);
> + }
> +}
> +
> +int proc_pid_seccomp_cache(struct seq_file *m, struct pid_namespace *ns,
> + struct pid *pid, struct task_struct *task)
> +{
> + struct seccomp_filter *f;
> + unsigned long flags;
> +
> + /*
> + * We don't want some sandboxed process to know what their seccomp
> + * filters consist of.
> + */
> + if (!file_ns_capable(m->file, &init_user_ns, CAP_SYS_ADMIN))
> + return -EACCES;
> +
> + if (!lock_task_sighand(task, &flags))
> + return -ESRCH;
> +
> + f = READ_ONCE(task->seccomp.filter);
> + if (!f) {
> + unlock_task_sighand(task, &flags);
> + return 0;
> + }
> +
> + /* prevent filter from being freed while we are printing it */
> + __get_seccomp_filter(f);
> + unlock_task_sighand(task, &flags);
> +
> + proc_pid_seccomp_cache_arch(m, SECCOMP_ARCH_NATIVE_NAME,
> + f->cache.allow_native,
error: ‘struct action_cache’ has no member named ‘allow_native’
struct action_cache is empty if SECCOMP_ARCH_NATIVE is not
defined (so there are checks for it).
> + SECCOMP_ARCH_NATIVE_NR);
> +
> +#ifdef SECCOMP_ARCH_COMPAT
> + proc_pid_seccomp_cache_arch(m, SECCOMP_ARCH_COMPAT_NAME,
> + f->cache.allow_compat,
> + SECCOMP_ARCH_COMPAT_NR);
> +#endif /* SECCOMP_ARCH_COMPAT */
> +
> + __put_seccomp_filter(f);
> + return 0;
> +}
> +#endif /* CONFIG_SECCOMP_CACHE_DEBUG */
> --
> 2.28.0
>
--
Gr{oetje,eeting}s,
Geert
--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- [email protected]
In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
-- Linus Torvalds
On Thu, Dec 17, 2020 at 6:14 AM Geert Uytterhoeven <[email protected]> wrote:
> Should there be a dependency on SECCOMP_ARCH_NATIVE?
> Should all architectures that implement seccomp have this?
>
> E.g. mips does select HAVE_ARCH_SECCOMP_FILTER, but doesn't
> have SECCOMP_ARCH_NATIVE?
>
> (noticed with preliminary out-of-tree seccomp implementation for m68k,
> which doesn't have SECCOMP_ARCH_NATIVE
Hi Geert
You are correct. This specific patch in this series was not applied,
and this was addressed in a follow up patch series [1]. MIPS does not
define SECCOMP_ARCH_NATIVE because the bitmap expects syscall numbers
to start from 0, whereas MIPS does not (defines
CONFIG_HAVE_SPARSE_SYSCALL_NR). The follow up patch makes it so that
any arch with HAVE_SPARSE_SYSCALL_NR (currently just MIPS) cannot have
CONFIG_SECCOMP_CACHE_DEBUG on, by the depend on clause.
I see that you are doing an out of tree seccomp implementation for
m68k. Assuming unchanged arch/xtensa/include/asm/syscall.h, something
like this to arch/m68k/include/asm/seccomp.h should make it work:
#define SECCOMP_ARCH_NATIVE AUDIT_ARCH_M68K
#define SECCOMP_ARCH_NATIVE_NR NR_syscalls
#define SECCOMP_ARCH_NATIVE_NAME "m68k"
If the file does not exist already, arch/xtensa/include/asm/seccomp.h
is a good example of how the file should look like, and remember to
remove `generic-y += seccomp.h` from arch/m68k/include/asm/Kbuild.
[1] https://lore.kernel.org/lkml/[email protected]/T/
YiFei Zhu
Hi YiFei,
On Thu, Dec 17, 2020 at 7:34 PM YiFei Zhu <[email protected]> wrote:
> On Thu, Dec 17, 2020 at 6:14 AM Geert Uytterhoeven <[email protected]> wrote:
> > Should there be a dependency on SECCOMP_ARCH_NATIVE?
> > Should all architectures that implement seccomp have this?
> >
> > E.g. mips does select HAVE_ARCH_SECCOMP_FILTER, but doesn't
> > have SECCOMP_ARCH_NATIVE?
> >
> > (noticed with preliminary out-of-tree seccomp implementation for m68k,
> > which doesn't have SECCOMP_ARCH_NATIVE
>
> You are correct. This specific patch in this series was not applied,
> and this was addressed in a follow up patch series [1]. MIPS does not
> define SECCOMP_ARCH_NATIVE because the bitmap expects syscall numbers
> to start from 0, whereas MIPS does not (defines
> CONFIG_HAVE_SPARSE_SYSCALL_NR). The follow up patch makes it so that
> any arch with HAVE_SPARSE_SYSCALL_NR (currently just MIPS) cannot have
> CONFIG_SECCOMP_CACHE_DEBUG on, by the depend on clause.
>
> I see that you are doing an out of tree seccomp implementation for
> m68k. Assuming unchanged arch/xtensa/include/asm/syscall.h, something
> like this to arch/m68k/include/asm/seccomp.h should make it work:
>
> #define SECCOMP_ARCH_NATIVE AUDIT_ARCH_M68K
> #define SECCOMP_ARCH_NATIVE_NR NR_syscalls
> #define SECCOMP_ARCH_NATIVE_NAME "m68k"
>
> If the file does not exist already, arch/xtensa/include/asm/seccomp.h
> is a good example of how the file should look like, and remember to
> remove `generic-y += seccomp.h` from arch/m68k/include/asm/Kbuild.
>
> [1] https://lore.kernel.org/lkml/[email protected]/T/
Thank you for your extensive explanation.
Gr{oetje,eeting}s,
Geert
--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- [email protected]
In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
-- Linus Torvalds