2020-06-16 07:52:12

by Kees Cook

[permalink] [raw]
Subject: [RFC][PATCH 0/8] seccomp: Implement constant action bitmaps

Hi,

This is my initial stab at making constant-time seccomp bitmaps that
are automatically generated from filters as they are added. This version
is x86 only (and not x86_x32), but it should be easy to expand this to
other architectures. I'd like to get arm64 working, but it has some
NR_syscalls shenanigans I haven't sorted out yet.

The first two patches are small clean-ups that I intend to land in
for-next/seccomp unless there are objections. Patch 3 is another
experimental feature to perform architecture-pinning. Patch 4 is the
bulk of the bitmap code. Patch 5 is benchmark updates. Patches 6 and
7 perform the x86 enablement. Patch 8 is just a debugging example,
in case anyone wants to play with this and would find it helpful.

Repeating the commit log from patch 4:


One of the most common pain points with seccomp filters has been dealing
with the overhead of processing the filters, especially for "always allow"
or "always reject" cases. While BPF is extremely fast[1], it will always
have overhead associated with it. Additionally, due to seccomp's design,
filters are layered, which means processing time goes up as the number
of filters attached goes up.

In the past, efforts have been focused on making filter execution complete
in a shorter amount of time. For example, filters were rewritten from
using linear if/then/else syscall search to using balanced binary trees,
or moving tests for syscalls common to the process's workload to the
front of the filter. However, there are limits to this, especially when
some processes are dealing with tens of filters[2], or when some
architectures have a less efficient BPF engine[3].

The most common use of seccomp, constructing syscall block/allow-lists,
where syscalls that are always allowed or always rejected (without regard
to any arguments), also tends to produce the most pathological runtime
problems, in that a large number of syscall checks in the filter need
to be performed to come to a determination.

In order to optimize these cases from O(n) to O(1), seccomp can
use bitmaps to immediately determine the desired action. A critical
observation in the prior paragraph bears repeating: the common case for
syscall tests do not check arguments. For any given filter, there is a
constant mapping from the combination of architecture and syscall to the
seccomp action result. (For kernels/architectures without CONFIG_COMPAT,
there is a single architecture.). As such, it is possible to construct
a mapping of arch/syscall to action, which can be updated as new filters
are attached to a process.

In order to build this mapping at filter attach time, each filter is
executed for every syscall (under each possible architecture), and
checked for any accesses of struct seccomp_data that are not the "arch"
nor "nr" (syscall) members. If only "arch" and "nr" are examined, then
there is a constant mapping for that syscall, and bitmaps can be updated
accordingly. If any accesses happen outside of those struct members,
seccomp must not bypass filter execution for that syscall, since program
state will be used to determine filter action result.

During syscall action probing, in order to determine whether other members
of struct seccomp_data are being accessed during a filter execution,
the struct is placed across a page boundary with the "arch" and "nr"
members in the first page, and everything else in the second page. The
"page accessed" flag is cleared in the second page's PTE, and the filter
is run. If the "page accessed" flag appears as set after running the
filter, we can determine that the filter looked beyond the "arch" and
"nr" members, and exclude that syscall from the constant action bitmaps.

For architectures to support this optimization, they must declare
their architectures for seccomp to see (via SECCOMP_ARCH and
SECCOMP_ARCH_COMPAT macros), and provide a way to perform efficient
CPU-local kernel TLB flushes (via local_flush_tlb_kernel_range()),
and then set HAVE_ARCH_SECCOMP_BITMAP in their Kconfig.

Areas needing more attention:

On x86, this currently adds 168 bytes (or 336 bytes under CONFIG_COMPAT)
to the size of task_struct. Allocating these on demand may be a better
use of memory, but may not result in good cache locality.

For architectures with "synthetic" architectures, like x86_x32,
additional work is needed. It should be possible to define a simple
mechanism based on the masking done in the x86 syscall entry path to
create another set of bitmaps for seccomp to key off of. I am, however,
considering just leaving HAVE_ARCH_SECCOMP_BITMAP depend on !X86_X32.

[1] https://lore.kernel.org/bpf/[email protected]/
[2] https://lore.kernel.org/bpf/20200601101137.GA121847@gardel-login/
[3] https://lore.kernel.org/bpf/[email protected]/



Thanks!

-Kees

Kees Cook (8):
selftests/seccomp: Improve calibration loop
seccomp: Use pr_fmt
seccomp: Introduce SECCOMP_PIN_ARCHITECTURE
seccomp: Implement constant action bitmaps
selftests/seccomp: Compare bitmap vs filter overhead
x86: Provide API for local kernel TLB flushing
x86: Enable seccomp constant action bitmaps
[DEBUG] seccomp: Report bitmap coverage ranges

arch/Kconfig | 7 +
arch/x86/Kconfig | 1 +
arch/x86/include/asm/syscall.h | 5 +
arch/x86/include/asm/tlbflush.h | 2 +
arch/x86/mm/tlb.c | 12 +-
include/linux/seccomp.h | 18 +
include/uapi/linux/seccomp.h | 1 +
kernel/seccomp.c | 374 +++++++++++++++++-
.../selftests/seccomp/seccomp_benchmark.c | 197 +++++++--
tools/testing/selftests/seccomp/settings | 2 +-
10 files changed, 571 insertions(+), 48 deletions(-)

--
2.25.1


2020-06-16 07:52:14

by Kees Cook

[permalink] [raw]
Subject: [PATCH 5/8] selftests/seccomp: Compare bitmap vs filter overhead

As part of the seccomp benchmarking, include the expectations with
regard to the timing behavior of the constant action bitmaps, and report
inconsistencies better.

Example output with constant action bitmaps on x86:

$ sudo ./seccomp_benchmark 30344920
Current BPF sysctl settings:
net.core.bpf_jit_enable = 1
net.core.bpf_jit_harden = 0
Benchmarking 30344920 syscalls...
22.113430452 - 0.005691205 = 22107739247 (22.1s)
getpid native: 728 ns
44.867669556 - 22.113755935 = 22753913621 (22.8s)
getpid RET_ALLOW 1 filter (bitmap): 749 ns
67.649040358 - 44.868003056 = 22781037302 (22.8s)
getpid RET_ALLOW 2 filters (bitmap): 750 ns
92.555661414 - 67.650328959 = 24905332455 (24.9s)
getpid RET_ALLOW 3 filters (full): 820 ns
118.170831065 - 92.556057543 = 25614773522 (25.6s)
getpid RET_ALLOW 4 filters (full): 844 ns
Estimated total seccomp overhead for 1 bitmapped filter: 21 ns
Estimated total seccomp overhead for 2 bitmapped filters: 22 ns
Estimated total seccomp overhead for 3 full filters: 92 ns
Estimated total seccomp overhead for 4 full filters: 116 ns
Estimated seccomp entry overhead: 20 ns
Estimated seccomp per-filter overhead (last 2 diff): 24 ns
Estimated seccomp per-filter overhead (filters / 4): 24 ns
Expectations:
native ≤ 1 bitmap (728 ≤ 749): ✔️
native ≤ 1 filter (728 ≤ 820): ✔️
per-filter (last 2 diff) ≈ per-filter (filters / 4) (24 ≈ 24): ✔️
1 bitmapped ≈ 2 bitmapped (21 ≈ 22): ✔️
entry ≈ 1 bitmapped (20 ≈ 21): ✔️
entry ≈ 2 bitmapped (20 ≈ 22): ✔️
native + entry + (per filter * 4) ≈ 4 filters total (844 ≈ 844): ✔️

Signed-off-by: Kees Cook <[email protected]>
---
.../selftests/seccomp/seccomp_benchmark.c | 151 +++++++++++++++---
tools/testing/selftests/seccomp/settings | 2 +-
2 files changed, 130 insertions(+), 23 deletions(-)

diff --git a/tools/testing/selftests/seccomp/seccomp_benchmark.c b/tools/testing/selftests/seccomp/seccomp_benchmark.c
index 91f5a89cadac..fcc806585266 100644
--- a/tools/testing/selftests/seccomp/seccomp_benchmark.c
+++ b/tools/testing/selftests/seccomp/seccomp_benchmark.c
@@ -4,12 +4,16 @@
*/
#define _GNU_SOURCE
#include <assert.h>
+#include <limits.h>
+#include <stdbool.h>
+#include <stddef.h>
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <unistd.h>
#include <linux/filter.h>
#include <linux/seccomp.h>
+#include <sys/param.h>
#include <sys/prctl.h>
#include <sys/syscall.h>
#include <sys/types.h>
@@ -70,18 +74,74 @@ unsigned long long calibrate(void)
return samples * seconds;
}

+bool approx(int i_one, int i_two)
+{
+ double one = i_one, one_bump = one * 0.01;
+ double two = i_two, two_bump = two * 0.01;
+
+ one_bump = one + MAX(one_bump, 2.0);
+ two_bump = two + MAX(two_bump, 2.0);
+
+ /* Equal to, or within 1% or 2 digits */
+ if (one == two ||
+ (one > two && one <= two_bump) ||
+ (two > one && two <= one_bump))
+ return true;
+ return false;
+}
+
+bool le(int i_one, int i_two)
+{
+ if (i_one <= i_two)
+ return true;
+ return false;
+}
+
+long compare(const char *name_one, const char *name_eval, const char *name_two,
+ unsigned long long one, bool (*eval)(int, int), unsigned long long two)
+{
+ bool good;
+
+ printf("\t%s %s %s (%lld %s %lld): ", name_one, name_eval, name_two,
+ (long long)one, name_eval, (long long)two);
+ if (one > INT_MAX) {
+ printf("Miscalculation! Measurement went negative: %lld\n", (long long)one);
+ return 1;
+ }
+ if (two > INT_MAX) {
+ printf("Miscalculation! Measurement went negative: %lld\n", (long long)two);
+ return 1;
+ }
+
+ good = eval(one, two);
+ printf("%s\n", good ? "✔️" : "❌");
+
+ return good ? 0 : 1;
+}
+
int main(int argc, char *argv[])
{
+ struct sock_filter bitmap_filter[] = {
+ BPF_STMT(BPF_LD|BPF_W|BPF_ABS, offsetof(struct seccomp_data, nr)),
+ BPF_STMT(BPF_RET|BPF_K, SECCOMP_RET_ALLOW),
+ };
+ struct sock_fprog bitmap_prog = {
+ .len = (unsigned short)ARRAY_SIZE(bitmap_filter),
+ .filter = bitmap_filter,
+ };
struct sock_filter filter[] = {
+ BPF_STMT(BPF_LD|BPF_W|BPF_ABS, offsetof(struct seccomp_data, args[0])),
BPF_STMT(BPF_RET|BPF_K, SECCOMP_RET_ALLOW),
};
struct sock_fprog prog = {
.len = (unsigned short)ARRAY_SIZE(filter),
.filter = filter,
};
- long ret;
- unsigned long long samples;
- unsigned long long native, filter1, filter2;
+
+ long ret, bits;
+ unsigned long long samples, calc;
+ unsigned long long native, filter1, filter2, bitmap1, bitmap2;
+ unsigned long long entry, per_filter1, per_filter2;

printf("Current BPF sysctl settings:\n");
system("sysctl net.core.bpf_jit_enable");
@@ -101,35 +161,82 @@ int main(int argc, char *argv[])
ret = prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0);
assert(ret == 0);

- /* One filter */
- ret = prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER, &prog);
+ /* One filter resulting in a bitmap */
+ ret = prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER, &bitmap_prog);
assert(ret == 0);

- filter1 = timing(CLOCK_PROCESS_CPUTIME_ID, samples) / samples;
- printf("getpid RET_ALLOW 1 filter: %llu ns\n", filter1);
+ bitmap1 = timing(CLOCK_PROCESS_CPUTIME_ID, samples) / samples;
+ printf("getpid RET_ALLOW 1 filter (bitmap): %llu ns\n", bitmap1);
+
+ /* Second filter resulting in a bitmap */
+ ret = prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER, &bitmap_prog);
+ assert(ret == 0);

- if (filter1 == native)
- printf("No overhead measured!? Try running again with more samples.\n");
+ bitmap2 = timing(CLOCK_PROCESS_CPUTIME_ID, samples) / samples;
+ printf("getpid RET_ALLOW 2 filters (bitmap): %llu ns\n", bitmap2);

- /* Two filters */
+ /* Third filter, can no longer be converted to bitmap */
ret = prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER, &prog);
assert(ret == 0);

- filter2 = timing(CLOCK_PROCESS_CPUTIME_ID, samples) / samples;
- printf("getpid RET_ALLOW 2 filters: %llu ns\n", filter2);
-
- /* Calculations */
- printf("Estimated total seccomp overhead for 1 filter: %llu ns\n",
- filter1 - native);
+ filter1 = timing(CLOCK_PROCESS_CPUTIME_ID, samples) / samples;
+ printf("getpid RET_ALLOW 3 filters (full): %llu ns\n", filter1);

- printf("Estimated total seccomp overhead for 2 filters: %llu ns\n",
- filter2 - native);
+ /* Fourth filter, can not be converted to bitmap because of filter 3 */
+ ret = prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER, &bitmap_prog);
+ assert(ret == 0);

- printf("Estimated seccomp per-filter overhead: %llu ns\n",
- filter2 - filter1);
+ filter2 = timing(CLOCK_PROCESS_CPUTIME_ID, samples) / samples;
+ printf("getpid RET_ALLOW 4 filters (full): %llu ns\n", filter2);
+
+ /* Estimations */
+#define ESTIMATE(fmt, var, what) do { \
+ var = (what); \
+ printf("Estimated " fmt ": %llu ns\n", var); \
+ if (var > INT_MAX) \
+ goto more_samples; \
+ } while (0)
+
+ ESTIMATE("total seccomp overhead for 1 bitmapped filter", calc,
+ bitmap1 - native);
+ ESTIMATE("total seccomp overhead for 2 bitmapped filters", calc,
+ bitmap2 - native);
+ ESTIMATE("total seccomp overhead for 3 full filters", calc,
+ filter1 - native);
+ ESTIMATE("total seccomp overhead for 4 full filters", calc,
+ filter2 - native);
+ ESTIMATE("seccomp entry overhead", entry,
+ bitmap1 - native - (bitmap2 - bitmap1));
+ ESTIMATE("seccomp per-filter overhead (last 2 diff)", per_filter1,
+ filter2 - filter1);
+ ESTIMATE("seccomp per-filter overhead (filters / 4)", per_filter2,
+ (filter2 - native - entry) / 4);
+
+ printf("Expectations:\n");
+ ret |= compare("native", "≤", "1 bitmap", native, le, bitmap1);
+ bits = compare("native", "≤", "1 filter", native, le, filter1);
+ if (bits)
+ goto more_samples;
+
+ ret |= compare("per-filter (last 2 diff)", "≈", "per-filter (filters / 4)",
+ per_filter1, approx, per_filter2);
+
+ bits = compare("1 bitmapped", "≈", "2 bitmapped",
+ bitmap1 - native, approx, bitmap2 - native);
+ if (bits) {
+ printf("Skipping constant action bitmap expectations: they appear unsupported.\n");
+ goto out;
+ }

- printf("Estimated seccomp entry overhead: %llu ns\n",
- filter1 - native - (filter2 - filter1));
+ ret |= compare("entry", "≈", "1 bitmapped", entry, approx, bitmap1 - native);
+ ret |= compare("entry", "≈", "2 bitmapped", entry, approx, bitmap2 - native);
+ ret |= compare("native + entry + (per filter * 4)", "≈", "4 filters total",
+ entry + (per_filter1 * 4) + native, approx, filter2);
+ if (ret == 0)
+ goto out;

+more_samples:
+ printf("Saw unexpected benchmark result. Try running again with more samples?\n");
+out:
return 0;
}
diff --git a/tools/testing/selftests/seccomp/settings b/tools/testing/selftests/seccomp/settings
index ba4d85f74cd6..6091b45d226b 100644
--- a/tools/testing/selftests/seccomp/settings
+++ b/tools/testing/selftests/seccomp/settings
@@ -1 +1 @@
-timeout=90
+timeout=120
--
2.25.1

2020-06-16 07:52:41

by Kees Cook

[permalink] [raw]
Subject: [PATCH 6/8] x86: Provide API for local kernel TLB flushing

The seccomp constant action bitmap filter evaluation routine depends
on being able to quickly clear the PTE "accessed" bit for a temporary
allocation. Provide access to the existing CPU-local kernel memory TLB
flushing routines.

Signed-off-by: Kees Cook <[email protected]>
---
arch/x86/include/asm/tlbflush.h | 2 ++
arch/x86/mm/tlb.c | 12 +++++++++---
2 files changed, 11 insertions(+), 3 deletions(-)

diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index 8c87a2e0b660..ae853e77d6bc 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -228,6 +228,8 @@ extern void flush_tlb_all(void);
extern void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start,
unsigned long end, unsigned int stride_shift,
bool freed_tables);
+extern void local_flush_tlb_kernel_range(unsigned long start,
+ unsigned long end);
extern void flush_tlb_kernel_range(unsigned long start, unsigned long end);

static inline void flush_tlb_page(struct vm_area_struct *vma, unsigned long a)
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 1a3569b43aa5..ffcf2bd0ce1c 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -959,16 +959,22 @@ void flush_tlb_all(void)
on_each_cpu(do_flush_tlb_all, NULL, 1);
}

-static void do_kernel_range_flush(void *info)
+void local_flush_tlb_kernel_range(unsigned long start, unsigned long end)
{
- struct flush_tlb_info *f = info;
unsigned long addr;

/* flush range by one by one 'invlpg' */
- for (addr = f->start; addr < f->end; addr += PAGE_SIZE)
+ for (addr = start; addr < end; addr += PAGE_SIZE)
flush_tlb_one_kernel(addr);
}

+static void do_kernel_range_flush(void *info)
+{
+ struct flush_tlb_info *f = info;
+
+ local_flush_tlb_kernel_range(f->start, f->end);
+}
+
void flush_tlb_kernel_range(unsigned long start, unsigned long end)
{
/* Balance as user space task's flush, a bit conservative */
--
2.25.1

2020-06-16 07:52:42

by Kees Cook

[permalink] [raw]
Subject: [PATCH 7/8] x86: Enable seccomp constant action bitmaps

Now that CPU-local kernel TLB flushes are available to seccomp, define
the specific architectures seccomp should be expected to reason about,
so that constant action bitmaps can be enabled for x86.

TODO: handle x32 via a "synthetic architecture" check, like done in
syscall entry.

Signed-off-by: Kees Cook <[email protected]>
---
arch/x86/Kconfig | 1 +
arch/x86/include/asm/syscall.h | 5 +++++
2 files changed, 6 insertions(+)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 6a0cc524882d..0f7a0abab88f 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -149,6 +149,7 @@ config X86
select HAVE_ARCH_COMPAT_MMAP_BASES if MMU && COMPAT
select HAVE_ARCH_PREL32_RELOCATIONS
select HAVE_ARCH_SECCOMP_FILTER
+ select HAVE_ARCH_SECCOMP_BITMAP if !X86_X32
select HAVE_ARCH_THREAD_STRUCT_WHITELIST
select HAVE_ARCH_STACKLEAK
select HAVE_ARCH_TRACEHOOK
diff --git a/arch/x86/include/asm/syscall.h b/arch/x86/include/asm/syscall.h
index 7cbf733d11af..b89e86f4c061 100644
--- a/arch/x86/include/asm/syscall.h
+++ b/arch/x86/include/asm/syscall.h
@@ -97,6 +97,7 @@ static inline void syscall_set_arguments(struct task_struct *task,
memcpy(&regs->bx + i, args, n * sizeof(args[0]));
}

+#define SECCOMP_ARCH AUDIT_ARCH_I386
static inline int syscall_get_arch(struct task_struct *task)
{
return AUDIT_ARCH_I386;
@@ -152,6 +153,10 @@ static inline void syscall_set_arguments(struct task_struct *task,
}
}

+#define SECCOMP_ARCH AUDIT_ARCH_X86_64
+#ifdef CONFIG_COMPAT
+#define SECCOMP_ARCH_COMPAT AUDIT_ARCH_I386
+#endif
static inline int syscall_get_arch(struct task_struct *task)
{
/* x32 tasks should be considered AUDIT_ARCH_X86_64. */
--
2.25.1

2020-06-16 07:52:45

by Kees Cook

[permalink] [raw]
Subject: [PATCH 8/8] [DEBUG] seccomp: Report bitmap coverage ranges

This is what I've been using to explore actual bitmap results for
real-world filters.

Signed-off-by: Kees Cook <[email protected]>
---
kernel/seccomp.c | 107 +++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 107 insertions(+)

diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index 2fbe7d2260f7..370b7ed9273b 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -715,6 +715,85 @@ static void seccomp_update_bitmap(struct seccomp_filter *filter,
}
}

+static void __report_bitmap(const char *arch, u32 ret, int start, int finish)
+{
+ int gap;
+ char *name;
+
+ if (finish == -1)
+ return;
+
+ switch (ret) {
+ case UINT_MAX:
+ name = "filter";
+ break;
+ case SECCOMP_RET_ALLOW:
+ name = "SECCOMP_RET_ALLOW";
+ break;
+ case SECCOMP_RET_KILL_PROCESS:
+ name = "SECCOMP_RET_KILL_PROCESS";
+ break;
+ case SECCOMP_RET_KILL_THREAD:
+ name = "SECCOMP_RET_KILL_THREAD";
+ break;
+ default:
+ WARN_ON_ONCE(1);
+ name = "unknown";
+ break;
+ }
+
+ gap = 0;
+ if (start < 100)
+ gap++;
+ if (start < 10)
+ gap++;
+ if (finish < 100)
+ gap++;
+ if (finish < 10)
+ gap++;
+
+ if (start == finish)
+ pr_info("%s %3d: %s\n", arch, start, name);
+ else if (start + 1 == finish)
+ pr_info("%s %*s%d,%d: %s\n", arch, gap, "", start, finish, name);
+ else
+ pr_info("%s %*s%d-%d: %s\n", arch, gap, "", start, finish, name);
+}
+
+static void report_bitmap(struct seccomp_bitmaps *bitmaps, const char *arch)
+{
+ u32 nr;
+ int start = 0, finish = -1;
+ u32 ret = UINT_MAX;
+ struct report_states {
+ unsigned long *bitmap;
+ u32 ret;
+ } states[] = {
+ { .bitmap = bitmaps->allow, .ret = SECCOMP_RET_ALLOW, },
+ { .bitmap = bitmaps->kill_process, .ret = SECCOMP_RET_KILL_PROCESS, },
+ { .bitmap = bitmaps->kill_thread, .ret = SECCOMP_RET_KILL_THREAD, },
+ { .bitmap = NULL, .ret = UINT_MAX, },
+ };
+
+ for (nr = 0; nr < NR_syscalls; nr++) {
+ int i;
+
+ for (i = 0; i < ARRAY_SIZE(states); i++) {
+ if (!states[i].bitmap || test_bit(nr, states[i].bitmap)) {
+ if (ret != states[i].ret) {
+ __report_bitmap(arch, ret, start, finish);
+ ret = states[i].ret;
+ start = nr;
+ }
+ finish = nr;
+ break;
+ }
+ }
+ }
+ if (start != nr)
+ __report_bitmap(arch, ret, start, finish);
+}
+
static void seccomp_update_bitmaps(struct seccomp_filter *filter,
void *pagepair)
{
@@ -724,6 +803,20 @@ static void seccomp_update_bitmaps(struct seccomp_filter *filter,
seccomp_update_bitmap(filter, pagepair, SECCOMP_ARCH_COMPAT,
&current->seccomp.compat);
#endif
+ if (strncmp(current->comm, "test-", 5) == 0 ||
+ strcmp(current->comm, "seccomp_bpf") == 0 ||
+ /*
+ * Why are systemd's process names head-truncated to 8 bytes
+ * and wrapped in parens!?
+ */
+ (current->comm[0] == '(' && strrchr(current->comm, ')') != NULL)) {
+ pr_info("reporting syscall bitmap usage for %d (%s):\n",
+ task_pid_nr(current), current->comm);
+ report_bitmap(&current->seccomp.native, "native");
+#ifdef CONFIG_COMPAT
+ report_bitmap(&current->seccomp.compat, "compat");
+#endif
+ }
}
#else
static void seccomp_update_bitmaps(struct seccomp_filter *filter,
@@ -783,6 +876,10 @@ static long seccomp_attach_filter(unsigned int flags,
filter->prev = current->seccomp.filter;
current->seccomp.filter = filter;
atomic_inc(&current->seccomp.filter_count);
+ if (atomic_read(&current->seccomp.filter_count) > 10)
+ pr_info("%d filters: %d (%s)\n",
+ atomic_read(&current->seccomp.filter_count),
+ task_pid_nr(current), current->comm);

/* Evaluate filter for new known-outcome syscalls */
seccomp_update_bitmaps(filter, pagepair);
@@ -2131,6 +2228,16 @@ static int __init seccomp_sysctl_init(void)
pr_warn("sysctl registration failed\n");
else
kmemleak_not_leak(hdr);
+#ifndef CONFIG_HAVE_ARCH_SECCOMP_BITMAP
+ pr_info("arch lacks support for constant action bitmaps\n");
+#else
+ pr_info("NR_syscalls: %d\n", NR_syscalls);
+ pr_info("arch: 0x%x\n", SECCOMP_ARCH);
+#ifdef CONFIG_COMPAT
+ pr_info("compat arch: 0x%x\n", SECCOMP_ARCH_COMPAT);
+#endif
+#endif
+ pr_info("sizeof(struct seccomp_bitmaps): %zu\n", sizeof(struct seccomp_bitmaps));

return 0;
}
--
2.25.1

2020-06-16 07:53:29

by Kees Cook

[permalink] [raw]
Subject: [PATCH 2/8] seccomp: Use pr_fmt

Avoid open-coding "seccomp: " prefixes for pr_*() calls.

Signed-off-by: Kees Cook <[email protected]>
---
kernel/seccomp.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index 0016cad0e605..a319700c04c4 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -13,6 +13,7 @@
* Mode 2 allows user-defined system call filters in the form
* of Berkeley Packet Filters/Linux Socket Filters.
*/
+#define pr_fmt(fmt) "seccomp: " fmt

#include <linux/refcount.h>
#include <linux/audit.h>
@@ -1873,7 +1874,7 @@ static int __init seccomp_sysctl_init(void)

hdr = register_sysctl_paths(seccomp_sysctl_path, seccomp_sysctl_table);
if (!hdr)
- pr_warn("seccomp: sysctl registration failed\n");
+ pr_warn("sysctl registration failed\n");
else
kmemleak_not_leak(hdr);

--
2.25.1

2020-06-16 07:53:43

by Kees Cook

[permalink] [raw]
Subject: [PATCH 4/8] seccomp: Implement constant action bitmaps

One of the most common pain points with seccomp filters has been dealing
with the overhead of processing the filters, especially for "always allow"
or "always reject" cases. While BPF is extremely fast[1], it will always
have overhead associated with it. Additionally, due to seccomp's design,
filters are layered, which means processing time goes up as the number
of filters attached goes up.

In the past, efforts have been focused on making filter execution complete
in a shorter amount of time. For example, filters were rewritten from
using linear if/then/else syscall search to using balanced binary trees,
or moving tests for syscalls common to the process's workload to the
front of the filter. However, there are limits to this, especially when
some processes are dealing with tens of filters[2], or when some
architectures have a less efficient BPF engine[3].

The most common use of seccomp, constructing syscall block/allow-lists,
where syscalls that are always allowed or always rejected (without regard
to any arguments), also tends to produce the most pathological runtime
problems, in that a large number of syscall checks in the filter need
to be performed to come to a determination.

In order to optimize these cases from O(n) to O(1), seccomp can
use bitmaps to immediately determine the desired action. A critical
observation in the prior paragraph bears repeating: the common case for
syscall tests do not check arguments. For any given filter, there is a
constant mapping from the combination of architecture and syscall to the
seccomp action result. (For kernels/architectures without CONFIG_COMPAT,
there is a single architecture.). As such, it is possible to construct
a mapping of arch/syscall to action, which can be updated as new filters
are attached to a process.

In order to build this mapping at filter attach time, each filter is
executed for every syscall (under each possible architecture), and
checked for any accesses of struct seccomp_data that are not the "arch"
nor "nr" (syscall) members. If only "arch" and "nr" are examined, then
there is a constant mapping for that syscall, and bitmaps can be updated
accordingly. If any accesses happen outside of those struct members,
seccomp must not bypass filter execution for that syscall, since program
state will be used to determine filter action result.

During syscall action probing, in order to determine whether other members
of struct seccomp_data are being accessed during a filter execution,
the struct is placed across a page boundary with the "arch" and "nr"
members in the first page, and everything else in the second page. The
"page accessed" flag is cleared in the second page's PTE, and the filter
is run. If the "page accessed" flag appears as set after running the
filter, we can determine that the filter looked beyond the "arch" and
"nr" members, and exclude that syscall from the constant action bitmaps.

For architectures to support this optimization, they must declare
their architectures for seccomp to see (via SECCOMP_ARCH and
SECCOMP_ARCH_COMPAT macros), and provide a way to perform efficient
CPU-local kernel TLB flushes (via local_flush_tlb_kernel_range()),
and then set HAVE_ARCH_SECCOMP_BITMAP in their Kconfig.

Areas needing more attention:

On x86, this currently adds 168 bytes (or 336 bytes under CONFIG_COMPAT)
to the size of task_struct. Allocating these on demand may be a better
use of memory, but may not result in good cache locality.

For architectures with "synthetic" architectures, like x86_x32,
additional work is needed. It should be possible to define a simple
mechanism based on the masking done in the x86 syscall entry path to
create another set of bitmaps for seccomp to key off of. I am, however,
considering just leaving HAVE_ARCH_SECCOMP_BITMAP depend on !X86_X32.

[1] https://lore.kernel.org/bpf/[email protected]/
[2] https://lore.kernel.org/bpf/20200601101137.GA121847@gardel-login/
[3] https://lore.kernel.org/bpf/[email protected]/

Signed-off-by: Kees Cook <[email protected]>
---
arch/Kconfig | 7 ++
include/linux/seccomp.h | 15 +++
kernel/seccomp.c | 227 +++++++++++++++++++++++++++++++++++++++-
3 files changed, 246 insertions(+), 3 deletions(-)

diff --git a/arch/Kconfig b/arch/Kconfig
index 8cc35dc556c7..4e692b7a4435 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -465,6 +465,13 @@ config SECCOMP_FILTER

See Documentation/userspace-api/seccomp_filter.rst for details.

+config HAVE_ARCH_SECCOMP_BITMAP
+ bool
+ help
+ An arch should select this symbol if it provides all of these things:
+ - SECCOMP_ARCH (and SECCOMP_ARCH_COMPAT if appropriate)
+ - local_flush_tlb_kernel_range()
+
config HAVE_ARCH_STACKLEAK
bool
help
diff --git a/include/linux/seccomp.h b/include/linux/seccomp.h
index 6525ddec177a..31ee2d6f4ec0 100644
--- a/include/linux/seccomp.h
+++ b/include/linux/seccomp.h
@@ -16,6 +16,17 @@
#include <linux/atomic.h>
#include <asm/seccomp.h>

+/* When no bits are set for a syscall, filters are run. */
+struct seccomp_bitmaps {
+#ifdef CONFIG_HAVE_ARCH_SECCOMP_BITMAP
+ /* "allow" are initialized to set and only ever get cleared. */
+ DECLARE_BITMAP(allow, NR_syscalls);
+ /* These are initialized to clear and only ever get set. */
+ DECLARE_BITMAP(kill_thread, NR_syscalls);
+ DECLARE_BITMAP(kill_process, NR_syscalls);
+#endif
+};
+
struct seccomp_filter;
/**
* struct seccomp - the state of a seccomp'ed process
@@ -35,6 +46,10 @@ struct seccomp {
#endif
atomic_t filter_count;
struct seccomp_filter *filter;
+ struct seccomp_bitmaps native;
+#ifdef CONFIG_COMPAT
+ struct seccomp_bitmaps compat;
+#endif
};

#ifdef CONFIG_HAVE_ARCH_SECCOMP_FILTER
diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index 43edf53c2d84..2fbe7d2260f7 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -44,6 +44,11 @@
#include <linux/anon_inodes.h>
#include <linux/lockdep.h>

+#ifdef CONFIG_HAVE_ARCH_SECCOMP_BITMAP
+#include <linux/pgtable.h>
+#include <asm/tlbflush.h>
+#endif
+
enum notify_state {
SECCOMP_NOTIFY_INIT,
SECCOMP_NOTIFY_SENT,
@@ -476,6 +481,16 @@ static inline void seccomp_sync_threads(unsigned long flags)
atomic_set(&thread->seccomp.filter_count,
atomic_read(&thread->seccomp.filter_count));

+ /* Copy syscall filter bitmaps. */
+ memcpy(&thread->seccomp.native,
+ &caller->seccomp.native,
+ sizeof(caller->seccomp.native));
+#ifdef CONFIG_COMPAT
+ memcpy(&thread->seccomp.compat,
+ &caller->seccomp.compat,
+ sizeof(caller->seccomp.compat));
+#endif
+
/*
* Don't let an unprivileged task work around
* the no_new_privs restriction by creating
@@ -578,6 +593,144 @@ seccomp_prepare_user_filter(const char __user *user_filter)
return filter;
}

+static inline bool sd_touched(pte_t *ptep)
+{
+ return !!pte_young(*(READ_ONCE(ptep)));
+}
+
+#ifdef CONFIG_HAVE_ARCH_SECCOMP_BITMAP
+/*
+ * We can build bitmaps only when an arch/nr combination reads nothing more
+ * that sd->nr and sd->arch, since those have a constant mapping to the
+ * syscall. To do this, we can run the filters for each syscall number, and
+ * examine the page table entry that is aligned to everything past sd->arch,
+ * checking for the ACCESSED flag.
+ *
+ * This approach could also be used to test for access to sd->arch too,
+ * if we wanted to warn about compat-unsafe filters.
+ */
+static void seccomp_update_bitmap(struct seccomp_filter *filter,
+ void *pagepair, u32 arch,
+ struct seccomp_bitmaps *bitmaps)
+{
+ struct seccomp_data *sd;
+ unsigned long vaddr;
+ u32 nr, ret;
+ pte_t *ptep;
+ u64 check;
+
+ /* Initialize bitmaps for first filter. */
+ if (!filter->prev)
+ bitmap_fill(bitmaps->allow, NR_syscalls);
+ /*
+ * Prepare to detect memory accesses: find the PTE for the second page
+ * in the page pair.
+ */
+ vaddr = (unsigned long)(pagepair + PAGE_SIZE);
+ ptep = virt_to_kpte(vaddr);
+ /*
+ * Split struct seccomp_data across two pages, with everything after
+ * sd->arch (i.e. starting with sd->instruction_pointer), in the second
+ * page of the page pair.
+ */
+ sd = pagepair + PAGE_SIZE - offsetof(struct seccomp_data, instruction_pointer);
+
+ /* Mark the second page as untouched (i.e. "old") */
+ preempt_disable();
+ set_pte_at(&init_mm, vaddr, ptep, pte_mkold(*(READ_ONCE(ptep))));
+ local_flush_tlb_kernel_range(vaddr, vaddr + PAGE_SIZE);
+ preempt_enable();
+ /* Make sure the PTE agrees that it is untouched. */
+ if (WARN_ON_ONCE(sd_touched(ptep)))
+ return;
+ /* Read a portion of struct seccomp_data from the second page. */
+ check = sd->instruction_pointer;
+ /* First, verify the contents are zero from vzalloc(). */
+ if (WARN_ON_ONCE(check))
+ return;
+ /* Now make sure the ACCESSED bit has been set after the read. */
+ if (!sd_touched(ptep)) {
+ /*
+ * If autodetection fails, fall back to standard beahavior by
+ * clearing the entire "allow" bitmap.
+ */
+ pr_warn_once("seccomp: cannot build automatic syscall filters\n");
+ bitmap_zero(bitmaps->allow, NR_syscalls);
+ return;
+ }
+
+ /*
+ * For every syscall, if we don't already know we need to run
+ * the full filter, simulate the filter with our static values.
+ */
+ for (nr = 0; nr < NR_syscalls; nr++) {
+ /* Are we already at the maximal rejection state? */
+ if (test_bit(nr, bitmaps->kill_process))
+ continue;
+
+ sd->nr = nr;
+ sd->arch = arch;
+
+ /* Do we need to reset the ACCESSED bit? */
+ if (sd_touched(ptep)) {
+ preempt_disable();
+ set_pte_at(&init_mm, vaddr, ptep, pte_mkold(*(READ_ONCE(ptep))));
+ local_flush_tlb_kernel_range(vaddr, vaddr + PAGE_SIZE);
+ preempt_enable();
+ }
+
+ /* Evaluate filter for this syscall. */
+ ret = bpf_prog_run_pin_on_cpu(filter->prog, sd);
+ /*
+ * If this run through the filter didn't access
+ * beyond "arch", we know the result is a constant
+ * mapping for arch/nr -> ret.
+ */
+ if (!sd_touched(ptep)) {
+ /* Constant evaluation. Mark appropriate bitmaps. */
+ switch (ret) {
+ case SECCOMP_RET_KILL_PROCESS:
+ set_bit(nr, bitmaps->kill_process);
+ break;
+ case SECCOMP_RET_KILL_THREAD:
+ set_bit(nr, bitmaps->kill_thread);
+ break;
+ default:
+ break;
+ case SECCOMP_RET_ALLOW:
+ /*
+ * If we always map to allow, there are
+ * no changes needed to the bitmaps.
+ */
+ continue;
+ }
+ }
+
+ /*
+ * Dynamic evaluation of syscall, or non-allow constant
+ * mapping to something other than SECCOMP_RET_ALLOW: we
+ * must not short-circuit-allow it anymore.
+ */
+ clear_bit(nr, bitmaps->allow);
+ }
+}
+
+static void seccomp_update_bitmaps(struct seccomp_filter *filter,
+ void *pagepair)
+{
+ seccomp_update_bitmap(filter, pagepair, SECCOMP_ARCH,
+ &current->seccomp.native);
+#ifdef CONFIG_COMPAT
+ seccomp_update_bitmap(filter, pagepair, SECCOMP_ARCH_COMPAT,
+ &current->seccomp.compat);
+#endif
+}
+#else
+static void seccomp_update_bitmaps(struct seccomp_filter *filter,
+ void *pagepair)
+{ }
+#endif
+
/**
* seccomp_attach_filter: validate and attach filter
* @flags: flags to change filter behavior
@@ -591,7 +744,8 @@ seccomp_prepare_user_filter(const char __user *user_filter)
* - in NEW_LISTENER mode: the fd of the new listener
*/
static long seccomp_attach_filter(unsigned int flags,
- struct seccomp_filter *filter)
+ struct seccomp_filter *filter,
+ void *pagepair)
{
unsigned long total_insns;
struct seccomp_filter *walker;
@@ -630,6 +784,9 @@ static long seccomp_attach_filter(unsigned int flags,
current->seccomp.filter = filter;
atomic_inc(&current->seccomp.filter_count);

+ /* Evaluate filter for new known-outcome syscalls */
+ seccomp_update_bitmaps(filter, pagepair);
+
/* Now that the new filter is in place, synchronize to all threads. */
if (flags & SECCOMP_FILTER_FLAG_TSYNC)
seccomp_sync_threads(flags);
@@ -857,6 +1014,56 @@ static int seccomp_do_user_notification(int this_syscall,
return -1;
}

+#ifdef CONFIG_HAVE_ARCH_SECCOMP_BITMAP
+static inline bool __bypass_filter(struct seccomp_bitmaps *bitmaps,
+ u32 nr, u32 *filter_ret)
+{
+ if (nr < NR_syscalls) {
+ if (test_bit(nr, current->seccomp.native.allow)) {
+ *filter_ret = SECCOMP_RET_ALLOW;
+ return true;
+ }
+ if (test_bit(nr, current->seccomp.native.kill_process)) {
+ *filter_ret = SECCOMP_RET_KILL_PROCESS;
+ return true;
+ }
+ if (test_bit(nr, current->seccomp.native.kill_thread)) {
+ *filter_ret = SECCOMP_RET_KILL_THREAD;
+ return true;
+ }
+ }
+ return false;
+}
+
+static inline u32 check_syscall(const struct seccomp_data *sd,
+ struct seccomp_filter **match)
+{
+ u32 filter_ret = SECCOMP_RET_KILL_PROCESS;
+
+#ifdef CONFIG_COMPAT
+ if (sd->arch == SECCOMP_ARCH) {
+#endif
+ if (__bypass_filter(&current->seccomp.native, sd->nr, &filter_ret))
+ return filter_ret;
+#ifdef CONFIG_COMPAT
+ } else if (sd->arch == SECCOMP_ARCH_COMPAT) {
+ if (__bypass_filter(&current->seccomp.compat, sd->nr, &filter_ret))
+ return filter_ret;
+ } else {
+ WARN_ON_ONCE(1);
+ return filter_ret;
+ }
+#endif
+ return seccomp_run_filters(sd, match);
+}
+#else
+static inline u32 check_syscall(const struct seccomp_data *sd,
+ struct seccomp_filter **match)
+{
+ return seccomp_run_filters(sd, match);
+}
+#endif
+
static int __seccomp_filter(int this_syscall, const struct seccomp_data *sd,
const bool recheck_after_trace)
{
@@ -876,7 +1083,7 @@ static int __seccomp_filter(int this_syscall, const struct seccomp_data *sd,
sd = &sd_local;
}

- filter_ret = seccomp_run_filters(sd, &match);
+ filter_ret = check_syscall(sd, &match);
data = filter_ret & SECCOMP_RET_DATA;
action = filter_ret & SECCOMP_RET_ACTION_FULL;

@@ -1346,6 +1553,7 @@ static long seccomp_set_mode_filter(unsigned int flags,
long ret = -EINVAL;
int listener = -1;
struct file *listener_f = NULL;
+ void *pagepair;

/* Validate flags. */
if (flags & ~SECCOMP_FILTER_FLAG_MASK)
@@ -1391,12 +1599,24 @@ static long seccomp_set_mode_filter(unsigned int flags,
mutex_lock_killable(&current->signal->cred_guard_mutex))
goto out_put_fd;

+ /*
+ * This memory will be needed for bitmap testing, but we'll
+ * be holding a spinlock at that point. Do the allocation
+ * (and free) outside of the lock.
+ *
+ * Alternative: we could do the bitmap update before attach
+ * to avoid spending too much time under lock.
+ */
+ pagepair = vzalloc(PAGE_SIZE * 2);
+ if (!pagepair)
+ goto out_put_fd;
+
spin_lock_irq(&current->sighand->siglock);

if (!seccomp_may_assign_mode(seccomp_mode))
goto out;

- ret = seccomp_attach_filter(flags, prepared);
+ ret = seccomp_attach_filter(flags, prepared, pagepair);
if (ret)
goto out;
/* Do not free the successfully attached filter. */
@@ -1405,6 +1625,7 @@ static long seccomp_set_mode_filter(unsigned int flags,
seccomp_assign_mode(current, seccomp_mode, flags);
out:
spin_unlock_irq(&current->sighand->siglock);
+ vfree(pagepair);
if (flags & SECCOMP_FILTER_FLAG_TSYNC)
mutex_unlock(&current->signal->cred_guard_mutex);
out_put_fd:
--
2.25.1

2020-06-16 07:53:54

by Kees Cook

[permalink] [raw]
Subject: [PATCH 1/8] selftests/seccomp: Improve calibration loop

The seccomp benchmark calibration loop did not need to take so long.
Instead, use a simple 1 second timeout and multiply up to target. It
does not need to be accurate.

Signed-off-by: Kees Cook <[email protected]>
---
.../selftests/seccomp/seccomp_benchmark.c | 50 ++++++++++++-------
1 file changed, 32 insertions(+), 18 deletions(-)

diff --git a/tools/testing/selftests/seccomp/seccomp_benchmark.c b/tools/testing/selftests/seccomp/seccomp_benchmark.c
index eca13fe1fba9..91f5a89cadac 100644
--- a/tools/testing/selftests/seccomp/seccomp_benchmark.c
+++ b/tools/testing/selftests/seccomp/seccomp_benchmark.c
@@ -18,9 +18,9 @@

unsigned long long timing(clockid_t clk_id, unsigned long long samples)
{
- pid_t pid, ret;
- unsigned long long i;
struct timespec start, finish;
+ unsigned long long i;
+ pid_t pid, ret;

pid = getpid();
assert(clock_gettime(clk_id, &start) == 0);
@@ -31,30 +31,43 @@ unsigned long long timing(clockid_t clk_id, unsigned long long samples)
assert(clock_gettime(clk_id, &finish) == 0);

i = finish.tv_sec - start.tv_sec;
- i *= 1000000000;
+ i *= 1000000000ULL;
i += finish.tv_nsec - start.tv_nsec;

- printf("%lu.%09lu - %lu.%09lu = %llu\n",
+ printf("%lu.%09lu - %lu.%09lu = %llu (%.1fs)\n",
finish.tv_sec, finish.tv_nsec,
start.tv_sec, start.tv_nsec,
- i);
+ i, (double)i / 1000000000.0);

return i;
}

unsigned long long calibrate(void)
{
- unsigned long long i;
-
- printf("Calibrating reasonable sample size...\n");
+ struct timespec start, finish;
+ unsigned long long i, samples, step = 9973;
+ pid_t pid, ret;
+ int seconds = 15;

- for (i = 5; ; i++) {
- unsigned long long samples = 1 << i;
+ printf("Calibrating sample size for %d seconds worth of syscalls ...\n", seconds);

- /* Find something that takes more than 5 seconds to run. */
- if (timing(CLOCK_REALTIME, samples) / 1000000000ULL > 5)
- return samples;
- }
+ samples = 0;
+ pid = getpid();
+ assert(clock_gettime(CLOCK_MONOTONIC, &start) == 0);
+ do {
+ for (i = 0; i < step; i++) {
+ ret = syscall(__NR_getpid);
+ assert(pid == ret);
+ }
+ assert(clock_gettime(CLOCK_MONOTONIC, &finish) == 0);
+
+ samples += step;
+ i = finish.tv_sec - start.tv_sec;
+ i *= 1000000000ULL;
+ i += finish.tv_nsec - start.tv_nsec;
+ } while (i < 1000000000ULL);
+
+ return samples * seconds;
}

int main(int argc, char *argv[])
@@ -70,15 +83,16 @@ int main(int argc, char *argv[])
unsigned long long samples;
unsigned long long native, filter1, filter2;

+ printf("Current BPF sysctl settings:\n");
+ system("sysctl net.core.bpf_jit_enable");
+ system("sysctl net.core.bpf_jit_harden");
+
if (argc > 1)
samples = strtoull(argv[1], NULL, 0);
else
samples = calibrate();

- printf("Current BPF sysctl settings:\n");
- system("sysctl net.core.bpf_jit_enable");
- system("sysctl net.core.bpf_jit_harden");
- printf("Benchmarking %llu samples...\n", samples);
+ printf("Benchmarking %llu syscalls...\n", samples);

/* Native call */
native = timing(CLOCK_PROCESS_CPUTIME_ID, samples) / samples;
--
2.25.1

2020-06-16 07:54:56

by Kees Cook

[permalink] [raw]
Subject: [PATCH 3/8] seccomp: Introduce SECCOMP_PIN_ARCHITECTURE

For systems that provide multiple syscall maps based on architectures
(e.g. AUDIT_ARCH_X86_64 and AUDIT_ARCH_I386 via CONFIG_COMPAT), allow
a fast way to pin the process to a specific syscall mapping, instead of
needing to generate all filters with an architecture check as the first
filter action.

Cc: Andy Lutomirski <[email protected]>
Cc: Will Drewry <[email protected]>
Signed-off-by: Kees Cook <[email protected]>
---
include/linux/seccomp.h | 3 +++
include/uapi/linux/seccomp.h | 1 +
kernel/seccomp.c | 37 ++++++++++++++++++++++++++++++++++--
3 files changed, 39 insertions(+), 2 deletions(-)

diff --git a/include/linux/seccomp.h b/include/linux/seccomp.h
index babcd6c02d09..6525ddec177a 100644
--- a/include/linux/seccomp.h
+++ b/include/linux/seccomp.h
@@ -30,6 +30,9 @@ struct seccomp_filter;
*/
struct seccomp {
int mode;
+#ifdef CONFIG_COMPAT
+ u32 arch;
+#endif
atomic_t filter_count;
struct seccomp_filter *filter;
};
diff --git a/include/uapi/linux/seccomp.h b/include/uapi/linux/seccomp.h
index c1735455bc53..84e89bb201ae 100644
--- a/include/uapi/linux/seccomp.h
+++ b/include/uapi/linux/seccomp.h
@@ -16,6 +16,7 @@
#define SECCOMP_SET_MODE_FILTER 1
#define SECCOMP_GET_ACTION_AVAIL 2
#define SECCOMP_GET_NOTIF_SIZES 3
+#define SECCOMP_PIN_ARCHITECTURE 4

/* Valid flags for SECCOMP_SET_MODE_FILTER */
#define SECCOMP_FILTER_FLAG_TSYNC (1UL << 0)
diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index a319700c04c4..43edf53c2d84 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -268,9 +268,16 @@ static u32 seccomp_run_filters(const struct seccomp_data *sd,
struct seccomp_filter **match)
{
u32 ret = SECCOMP_RET_ALLOW;
+ struct seccomp_filter *f;
+
+#ifdef CONFIG_COMPAT
+ /* Block mismatched architectures. */
+ if (current->seccomp.arch && current->seccomp.arch != sd->arch)
+ return SECCOMP_RET_KILL_PROCESS;
+#endif
+
/* Make sure cross-thread synced filter points somewhere sane. */
- struct seccomp_filter *f =
- READ_ONCE(current->seccomp.filter);
+ f = READ_ONCE(current->seccomp.filter);

/* Ensure unexpected behavior doesn't result in failing open. */
if (WARN_ON(f == NULL))
@@ -478,6 +485,11 @@ static inline void seccomp_sync_threads(unsigned long flags)
if (task_no_new_privs(caller))
task_set_no_new_privs(thread);

+#ifdef CONFIG_COMPAT
+ /* Copy any pinned architecture. */
+ thread->seccomp.arch = caller->seccomp.arch;
+#endif
+
/*
* Opt the other thread into seccomp if needed.
* As threads are considered to be trust-realm
@@ -1456,6 +1468,20 @@ static long seccomp_get_notif_sizes(void __user *usizes)
return 0;
}

+static long seccomp_pin_architecture(void)
+{
+#ifdef CONFIG_COMPAT
+ u32 arch = syscall_get_arch(current);
+
+ /* How did you even get here? */
+ if (current->seccomp.arch && current->seccomp.arch != arch)
+ return -EBUSY;
+
+ current->seccomp.arch = arch;
+#endif
+ return 0;
+}
+
/* Common entry point for both prctl and syscall. */
static long do_seccomp(unsigned int op, unsigned int flags,
void __user *uargs)
@@ -1477,6 +1503,13 @@ static long do_seccomp(unsigned int op, unsigned int flags,
return -EINVAL;

return seccomp_get_notif_sizes(uargs);
+ case SECCOMP_PIN_ARCHITECTURE:
+ if (flags != 0)
+ return -EINVAL;
+ if (uargs != NULL)
+ return -EINVAL;
+
+ return seccomp_pin_architecture();
default:
return -EINVAL;
}
--
2.25.1

2020-06-16 12:17:49

by Jann Horn

[permalink] [raw]
Subject: Re: [PATCH 4/8] seccomp: Implement constant action bitmaps

On Tue, Jun 16, 2020 at 9:49 AM Kees Cook <[email protected]> wrote:
> One of the most common pain points with seccomp filters has been dealing
> with the overhead of processing the filters, especially for "always allow"
> or "always reject" cases. While BPF is extremely fast[1], it will always
> have overhead associated with it. Additionally, due to seccomp's design,
> filters are layered, which means processing time goes up as the number
> of filters attached goes up.
>
> In the past, efforts have been focused on making filter execution complete
> in a shorter amount of time. For example, filters were rewritten from
> using linear if/then/else syscall search to using balanced binary trees,
> or moving tests for syscalls common to the process's workload to the
> front of the filter. However, there are limits to this, especially when
> some processes are dealing with tens of filters[2], or when some
> architectures have a less efficient BPF engine[3].
>
> The most common use of seccomp, constructing syscall block/allow-lists,
> where syscalls that are always allowed or always rejected (without regard
> to any arguments), also tends to produce the most pathological runtime
> problems, in that a large number of syscall checks in the filter need
> to be performed to come to a determination.
>
> In order to optimize these cases from O(n) to O(1), seccomp can
> use bitmaps to immediately determine the desired action. A critical
> observation in the prior paragraph bears repeating: the common case for
> syscall tests do not check arguments. For any given filter, there is a
> constant mapping from the combination of architecture and syscall to the
> seccomp action result. (For kernels/architectures without CONFIG_COMPAT,
> there is a single architecture.). As such, it is possible to construct
> a mapping of arch/syscall to action, which can be updated as new filters
> are attached to a process.
>
> In order to build this mapping at filter attach time, each filter is
> executed for every syscall (under each possible architecture), and
> checked for any accesses of struct seccomp_data that are not the "arch"
> nor "nr" (syscall) members. If only "arch" and "nr" are examined, then
> there is a constant mapping for that syscall, and bitmaps can be updated
> accordingly. If any accesses happen outside of those struct members,
> seccomp must not bypass filter execution for that syscall, since program
> state will be used to determine filter action result.
>
> During syscall action probing, in order to determine whether other members
> of struct seccomp_data are being accessed during a filter execution,
> the struct is placed across a page boundary with the "arch" and "nr"
> members in the first page, and everything else in the second page. The
> "page accessed" flag is cleared in the second page's PTE, and the filter
> is run. If the "page accessed" flag appears as set after running the
> filter, we can determine that the filter looked beyond the "arch" and
> "nr" members, and exclude that syscall from the constant action bitmaps.
>
> For architectures to support this optimization, they must declare
> their architectures for seccomp to see (via SECCOMP_ARCH and
> SECCOMP_ARCH_COMPAT macros), and provide a way to perform efficient
> CPU-local kernel TLB flushes (via local_flush_tlb_kernel_range()),
> and then set HAVE_ARCH_SECCOMP_BITMAP in their Kconfig.

Wouldn't it be simpler to use a function that can run a subset of
seccomp cBPF and bails out on anything that indicates that a syscall's
handling is complex or on instructions it doesn't understand? For
syscalls that have a fixed policy, a typical seccomp filter doesn't
even use any of the BPF_ALU ops, the scratch space, or the X register;
it just uses something like the following set of operations, which is
easy to emulate without much code:

BPF_LD | BPF_W | BPF_ABS
BPF_JMP | BPF_JEQ | BPF_K
BPF_JMP | BPF_JGE | BPF_K
BPF_JMP | BPF_JGT | BPF_K
BPF_JMP | BPF_JA
BPF_RET | BPF_K

Something like (completely untested):

/*
* Try to statically determine whether @filter will always return a fixed result
* when run for syscall @nr under architecture @arch.
* Returns true if the result could be determined; if so, the result will be
* stored in @action.
*/
static bool seccomp_check_syscall(struct sock_filter *filter, unsigned int arch,
unsigned int nr, unsigned int *action)
{
int pc;
unsigned int reg_value = 0;

for (pc = 0; 1; pc++) {
struct sock_filter *insn = &filter[pc];
u16 code = insn->code;
u32 k = insn->k;

switch (code) {
case BPF_LD | BPF_W | BPF_ABS:
if (k == offsetof(struct seccomp_data, nr)) {
reg_value = nr;
} else if (k == offsetof(struct seccomp_data, arch)) {
reg_value = arch;
} else {
return false; /* can't optimize (non-constant value load) */
}
break;
case BPF_RET | BPF_K:
*action = insn->k;
return true; /* success: reached return with constant values only */
case BPF_JMP | BPF_JA:
pc += insn->k;
break;
case BPF_JMP | BPF_JEQ | BPF_K:
case BPF_JMP | BPF_JGE | BPF_K:
case BPF_JMP | BPF_JGT | BPF_K:
default:
if (BPF_CLASS(code) == BPF_JMP && BPF_SRC(code) == BPF_K) {
u16 op = BPF_OP(code);
bool op_res;

switch (op) {
case BPF_JEQ:
op_res = reg_value == k;
break;
case BPF_JGE:
op_res = reg_value >= k;
break;
case BPF_JGT:
op_res = reg_value > k;
break;
default:
return false; /* can't optimize (unknown insn) */
}

pc += op_res ? insn->jt : insn->jf;
break;
}
return false; /* can't optimize (unknown insn) */
}
}
}

That way, you won't need any of this complicated architecture-specific stuff.

2020-06-16 14:44:13

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH 4/8] seccomp: Implement constant action bitmaps

On 6/16/20 12:49 AM, Kees Cook wrote:
> + /* Mark the second page as untouched (i.e. "old") */
> + preempt_disable();
> + set_pte_at(&init_mm, vaddr, ptep, pte_mkold(*(READ_ONCE(ptep))));
> + local_flush_tlb_kernel_range(vaddr, vaddr + PAGE_SIZE);
> + preempt_enable();

If you can, I'd wrap that nugget up in a helper. I'd also suggest being
very explicit in a comment about what it is trying to do: ensure no TLB
entries exist so that a future access will always set the Accessed bit.

> + /* Make sure the PTE agrees that it is untouched. */
> + if (WARN_ON_ONCE(sd_touched(ptep)))
> + return;
> + /* Read a portion of struct seccomp_data from the second page. */
> + check = sd->instruction_pointer;
> + /* First, verify the contents are zero from vzalloc(). */
> + if (WARN_ON_ONCE(check))
> + return;
> + /* Now make sure the ACCESSED bit has been set after the read. */
> + if (!sd_touched(ptep)) {
> + /*
> + * If autodetection fails, fall back to standard beahavior by
> + * clearing the entire "allow" bitmap.
> + */
> + pr_warn_once("seccomp: cannot build automatic syscall filters\n");
> + bitmap_zero(bitmaps->allow, NR_syscalls);
> + return;
> + }

I can't find any big holes with this. It's the kind of code that makes
me nervous, but mostly because it's pretty different that anything else
we have in the kernel.

It's also clear to me here that you probably have a slightly different
expectation of what the PTE accessed flag means versus the hardware
guys. What you are looking for it to mean is roughly: "a retired
instruction touched this page".

The hardware guys would probably say it's closer to "a TLB entry was
established for this page." Remember that TLB entries can be
established speculatively or from things like prefetchers. While I
don't know of anything microarchitectural today which would trip this
mechanism, it's entirely possible that something in the future might.
Accessing close to the page boundary is the exact kind of place folks
might want to optimize.

*But*, at least it would err in the direction of being conservative. It
would say "somebody touched the page!" more often than it should, but
never _less_ often than it should.

One thing about the implementation (which is roughly):

// Touch the data:
check = sd->instruction_pointer;
// Examine the PTE mapping that data:
if (!sd_touched(ptep)) {
// something
}

There aren't any barriers in there, which could lead to the sd_touched()
check being ordered before the data touch. I think a rmb() will
suffice. You could even do it inside sd_touched().

Was there a reason you chose to export a ranged TLB flush? I probably
would have just used the single-page flush_tlb_one_kernel() for this
purpose if I were working in arch-specific code.

2020-06-16 15:53:07

by Kees Cook

[permalink] [raw]
Subject: Re: [PATCH 4/8] seccomp: Implement constant action bitmaps

On Tue, Jun 16, 2020 at 02:14:47PM +0200, Jann Horn wrote:
> Wouldn't it be simpler to use a function that can run a subset of
> seccomp cBPF and bails out on anything that indicates that a syscall's
> handling is complex or on instructions it doesn't understand? For
> syscalls that have a fixed policy, a typical seccomp filter doesn't
> even use any of the BPF_ALU ops, the scratch space, or the X register;
> it just uses something like the following set of operations, which is
> easy to emulate without much code:
>
> BPF_LD | BPF_W | BPF_ABS
> BPF_JMP | BPF_JEQ | BPF_K
> BPF_JMP | BPF_JGE | BPF_K
> BPF_JMP | BPF_JGT | BPF_K
> BPF_JMP | BPF_JA
> BPF_RET | BPF_K

Initially, I started down this path. It needed a bit of plumbing into
BPF to better control the lifetime of the cBPF "saved original filter"
(normally used by CHECKPOINT_RESTORE uses), and then I needed to keep
making exceptions (same list you have: ALU, X register, scratch, etc)
in the name of avoiding too much complexity in the emulator. I decided
I'd rather reuse the existing infrastructure to actually execute the
filter (no cBPF copy needed to be saved, no separate code, and full
instruction coverage).

>
> Something like (completely untested):
>
> /*
> * Try to statically determine whether @filter will always return a fixed result
> * when run for syscall @nr under architecture @arch.
> * Returns true if the result could be determined; if so, the result will be
> * stored in @action.
> */
> static bool seccomp_check_syscall(struct sock_filter *filter, unsigned int arch,
> unsigned int nr, unsigned int *action)
> {
> int pc;
> unsigned int reg_value = 0;
>
> for (pc = 0; 1; pc++) {
> struct sock_filter *insn = &filter[pc];
> u16 code = insn->code;
> u32 k = insn->k;
>
> switch (code) {
> case BPF_LD | BPF_W | BPF_ABS:
> if (k == offsetof(struct seccomp_data, nr)) {
> reg_value = nr;
> } else if (k == offsetof(struct seccomp_data, arch)) {
> reg_value = arch;
> } else {
> return false; /* can't optimize (non-constant value load) */
> }
> break;
> case BPF_RET | BPF_K:
> *action = insn->k;
> return true; /* success: reached return with constant values only */
> case BPF_JMP | BPF_JA:
> pc += insn->k;
> break;
> case BPF_JMP | BPF_JEQ | BPF_K:
> case BPF_JMP | BPF_JGE | BPF_K:
> case BPF_JMP | BPF_JGT | BPF_K:
> default:
> if (BPF_CLASS(code) == BPF_JMP && BPF_SRC(code) == BPF_K) {
> u16 op = BPF_OP(code);
> bool op_res;
>
> switch (op) {
> case BPF_JEQ:
> op_res = reg_value == k;
> break;
> case BPF_JGE:
> op_res = reg_value >= k;
> break;
> case BPF_JGT:
> op_res = reg_value > k;
> break;
> default:
> return false; /* can't optimize (unknown insn) */
> }
>
> pc += op_res ? insn->jt : insn->jf;
> break;
> }
> return false; /* can't optimize (unknown insn) */
> }
> }
> }

I didn't actually finish going down the emulator path (I stopped right
around the time I verified that libseccomp does use BPF_ALU -- though
only BPF_AND), so I didn't actually evaluate the filter contents for other
filter builders (i.e. Chrome).

But, if BPF_ALU | BPF_AND were added to your code above, it would cover
everything libseccomp generates (which covers a lot of the seccomp
filters, e.g. systemd, docker). I just felt funny about an "incomplete"
emulator.

Though now you've got me looking. It seems this is the core
of Chrome's BPF instruction generation:
https://github.com/chromium/chromium/blob/master/sandbox/linux/bpf_dsl/policy_compiler.cc
It also uses ALU|AND, but adds JMP|JSET.

So... that's only 2 more instructions to cover what I think are likely
the two largest seccomp instruction generators.

> That way, you won't need any of this complicated architecture-specific stuff.

There are two arch-specific needs, and using a cBPF-subset emulator
just gets rid of the local TLB flush. The other part is distinguishing
the archs. Neither requirement is onerous (TLB flush usually just
needs little more than an extern, arch is already documented in the
per-arch syscall_get_arch()). The awkward part I ran into for arm64
was a header include loop for compat due to how unistd is handled for
getting NR_syscalls for the bitmap sizing (which I'm sure is solvable,
but I just wanted to get the x86 RFC posted first).

--
Kees Cook

2020-06-16 16:06:30

by Kees Cook

[permalink] [raw]
Subject: Re: [PATCH 4/8] seccomp: Implement constant action bitmaps

On Tue, Jun 16, 2020 at 07:40:17AM -0700, Dave Hansen wrote:
> On 6/16/20 12:49 AM, Kees Cook wrote:
> > + /* Mark the second page as untouched (i.e. "old") */
> > + preempt_disable();
> > + set_pte_at(&init_mm, vaddr, ptep, pte_mkold(*(READ_ONCE(ptep))));
> > + local_flush_tlb_kernel_range(vaddr, vaddr + PAGE_SIZE);
> > + preempt_enable();
>
> If you can, I'd wrap that nugget up in a helper. I'd also suggest being
> very explicit in a comment about what it is trying to do: ensure no TLB
> entries exist so that a future access will always set the Accessed bit.

Yeah, good idea!

>
> > + /* Make sure the PTE agrees that it is untouched. */
> > + if (WARN_ON_ONCE(sd_touched(ptep)))
> > + return;
> > + /* Read a portion of struct seccomp_data from the second page. */
> > + check = sd->instruction_pointer;
> > + /* First, verify the contents are zero from vzalloc(). */
> > + if (WARN_ON_ONCE(check))
> > + return;
> > + /* Now make sure the ACCESSED bit has been set after the read. */
> > + if (!sd_touched(ptep)) {
> > + /*
> > + * If autodetection fails, fall back to standard beahavior by
> > + * clearing the entire "allow" bitmap.
> > + */
> > + pr_warn_once("seccomp: cannot build automatic syscall filters\n");
> > + bitmap_zero(bitmaps->allow, NR_syscalls);
> > + return;
> > + }
>
> I can't find any big holes with this. It's the kind of code that makes
> me nervous, but mostly because it's pretty different that anything else
> we have in the kernel.
>
> It's also clear to me here that you probably have a slightly different
> expectation of what the PTE accessed flag means versus the hardware
> guys. What you are looking for it to mean is roughly: "a retired
> instruction touched this page".
>
> The hardware guys would probably say it's closer to "a TLB entry was
> established for this page." Remember that TLB entries can be
> established speculatively or from things like prefetchers. While I
> don't know of anything microarchitectural today which would trip this
> mechanism, it's entirely possible that something in the future might.
> Accessing close to the page boundary is the exact kind of place folks
> might want to optimize.

Yeah, and to that end, going the cBPF emulator route removes this kind
of "weird" behavior.

>
> *But*, at least it would err in the direction of being conservative. It
> would say "somebody touched the page!" more often than it should, but
> never _less_ often than it should.

Right -- I made sure to design the bitmaps and the direction of the
checking to fail towards running the filter instead of bypassing it.

> One thing about the implementation (which is roughly):
>
> // Touch the data:
> check = sd->instruction_pointer;
> // Examine the PTE mapping that data:
> if (!sd_touched(ptep)) {
> // something
> }
>
> There aren't any barriers in there, which could lead to the sd_touched()
> check being ordered before the data touch. I think a rmb() will
> suffice. You could even do it inside sd_touched().

Ah yeah, I had convinced myself that READ_ONCE() gained me that
coverage, but I guess that's not actually true here.

> Was there a reason you chose to export a ranged TLB flush? I probably
> would have just used the single-page flush_tlb_one_kernel() for this
> purpose if I were working in arch-specific code.

No particular reason -- it just seemed easiest to make available given
the interfaces. I could do the single-page version instead, if this way
of doing things survives review. ;)

Thanks for looking at it!

--
Kees Cook

2020-06-16 17:00:45

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH 3/8] seccomp: Introduce SECCOMP_PIN_ARCHITECTURE

On Tue, Jun 16, 2020 at 12:49 AM Kees Cook <[email protected]> wrote:
>
> For systems that provide multiple syscall maps based on architectures
> (e.g. AUDIT_ARCH_X86_64 and AUDIT_ARCH_I386 via CONFIG_COMPAT), allow
> a fast way to pin the process to a specific syscall mapping, instead of
> needing to generate all filters with an architecture check as the first
> filter action.

Can you allow specification of the reject action? I can see people
wanting TRAP instead, for example.

2020-06-16 17:01:59

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH 6/8] x86: Provide API for local kernel TLB flushing

On Tue, Jun 16, 2020 at 12:49 AM Kees Cook <[email protected]> wrote:
>
> The seccomp constant action bitmap filter evaluation routine depends
> on being able to quickly clear the PTE "accessed" bit for a temporary
> allocation. Provide access to the existing CPU-local kernel memory TLB
> flushing routines.

Can you write a better justification? Also, unless I'm just
incompetent this morning, I can't find anyone calling this in the
series.

2020-06-16 17:04:14

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [RFC][PATCH 0/8] seccomp: Implement constant action bitmaps

On Tue, Jun 16, 2020 at 12:49 AM Kees Cook <[email protected]> wrote:
>
> Hi,
>

> In order to build this mapping at filter attach time, each filter is
> executed for every syscall (under each possible architecture), and
> checked for any accesses of struct seccomp_data that are not the "arch"
> nor "nr" (syscall) members. If only "arch" and "nr" are examined, then
> there is a constant mapping for that syscall, and bitmaps can be updated
> accordingly. If any accesses happen outside of those struct members,
> seccomp must not bypass filter execution for that syscall, since program
> state will be used to determine filter action result.

>
> During syscall action probing, in order to determine whether other members
> of struct seccomp_data are being accessed during a filter execution,
> the struct is placed across a page boundary with the "arch" and "nr"
> members in the first page, and everything else in the second page. The
> "page accessed" flag is cleared in the second page's PTE, and the filter
> is run. If the "page accessed" flag appears as set after running the
> filter, we can determine that the filter looked beyond the "arch" and
> "nr" members, and exclude that syscall from the constant action bitmaps.

This is... evil. I don't know how I feel about it. It's also
potentially quite slow.

I don't suppose you could, instead, instrument the BPF code to get at
this without TLB hackery? Or maybe try to do some real symbolic
execution of the BPF code?

--Andy

2020-06-16 18:38:10

by Kees Cook

[permalink] [raw]
Subject: Re: [RFC][PATCH 0/8] seccomp: Implement constant action bitmaps

On Tue, Jun 16, 2020 at 10:01:43AM -0700, Andy Lutomirski wrote:
> On Tue, Jun 16, 2020 at 12:49 AM Kees Cook <[email protected]> wrote:
> >
> > Hi,
> >
>
> > In order to build this mapping at filter attach time, each filter is
> > executed for every syscall (under each possible architecture), and
> > checked for any accesses of struct seccomp_data that are not the "arch"
> > nor "nr" (syscall) members. If only "arch" and "nr" are examined, then
> > there is a constant mapping for that syscall, and bitmaps can be updated
> > accordingly. If any accesses happen outside of those struct members,
> > seccomp must not bypass filter execution for that syscall, since program
> > state will be used to determine filter action result.
>
> >
> > During syscall action probing, in order to determine whether other members
> > of struct seccomp_data are being accessed during a filter execution,
> > the struct is placed across a page boundary with the "arch" and "nr"
> > members in the first page, and everything else in the second page. The
> > "page accessed" flag is cleared in the second page's PTE, and the filter
> > is run. If the "page accessed" flag appears as set after running the
> > filter, we can determine that the filter looked beyond the "arch" and
> > "nr" members, and exclude that syscall from the constant action bitmaps.
>
> This is... evil. I don't know how I feel about it. It's also

Thank you! ;)

> potentially quite slow.

I got the impression that (worst-case: a "full" filter for every
arch/syscall combo) ~900 _local_ TLB flushes per filter attach wouldn't be
very slow at all. (And the code is optimized to avoid needless flushes.)

> I don't suppose you could, instead, instrument the BPF code to get at
> this without TLB hackery? Or maybe try to do some real symbolic
> execution of the BPF code?

I think the "simple emulator" path[1] might get us a realistically large
coverage. I'm going to try it out, and see what it looks like.

-Kees

[1] https://lore.kernel.org/lkml/202006160757.99FD9B785@keescook/

--
Kees Cook

2020-06-16 18:40:10

by Jann Horn

[permalink] [raw]
Subject: Re: [PATCH 4/8] seccomp: Implement constant action bitmaps

On Tue, Jun 16, 2020 at 5:49 PM Kees Cook <[email protected]> wrote:
> On Tue, Jun 16, 2020 at 02:14:47PM +0200, Jann Horn wrote:
> > Wouldn't it be simpler to use a function that can run a subset of
> > seccomp cBPF and bails out on anything that indicates that a syscall's
> > handling is complex or on instructions it doesn't understand? For
> > syscalls that have a fixed policy, a typical seccomp filter doesn't
> > even use any of the BPF_ALU ops, the scratch space, or the X register;
> > it just uses something like the following set of operations, which is
> > easy to emulate without much code:
> >
> > BPF_LD | BPF_W | BPF_ABS
> > BPF_JMP | BPF_JEQ | BPF_K
> > BPF_JMP | BPF_JGE | BPF_K
> > BPF_JMP | BPF_JGT | BPF_K
> > BPF_JMP | BPF_JA
> > BPF_RET | BPF_K
>
> Initially, I started down this path. It needed a bit of plumbing into
> BPF to better control the lifetime of the cBPF "saved original filter"
> (normally used by CHECKPOINT_RESTORE uses)

I don't think you need that? When a filter is added, you can compute
the results of the added individual filter, and then merge the state.

> and then I needed to keep
> making exceptions (same list you have: ALU, X register, scratch, etc)
> in the name of avoiding too much complexity in the emulator. I decided
> I'd rather reuse the existing infrastructure to actually execute the
> filter (no cBPF copy needed to be saved, no separate code, and full
> instruction coverage).

If you really think that this bit of emulation is so bad, you could
also make a copy of the BPF filter in which you replace all load
instructions from syscall arguments with "return NON_CONSTANT_RESULT",
and then run that through the normal BPF infrastructure.

> > Something like (completely untested):
[...]
> I didn't actually finish going down the emulator path (I stopped right
> around the time I verified that libseccomp does use BPF_ALU -- though
> only BPF_AND), so I didn't actually evaluate the filter contents for other
> filter builders (i.e. Chrome).
>
> But, if BPF_ALU | BPF_AND were added to your code above, it would cover
> everything libseccomp generates (which covers a lot of the seccomp
> filters, e.g. systemd, docker). I just felt funny about an "incomplete"
> emulator.
>
> Though now you've got me looking. It seems this is the core
> of Chrome's BPF instruction generation:
> https://github.com/chromium/chromium/blob/master/sandbox/linux/bpf_dsl/policy_compiler.cc
> It also uses ALU|AND, but adds JMP|JSET.
>
> So... that's only 2 more instructions to cover what I think are likely
> the two largest seccomp instruction generators.
>
> > That way, you won't need any of this complicated architecture-specific stuff.
>
> There are two arch-specific needs, and using a cBPF-subset emulator
> just gets rid of the local TLB flush. The other part is distinguishing
> the archs. Neither requirement is onerous (TLB flush usually just
> needs little more than an extern, arch is already documented in the
> per-arch syscall_get_arch()).

But it's also somewhat layer-breaking and reliant on very specific
assumptions. Normal kernel code doesn't mess around with page table
magic, outside of very specific low-level things. And your method
would break if the fixed-value members were not all packed together at
the start of the structure.


And from a hardening perspective: The more code we add that fiddles
around with PTEs directly, rather than going through higher-level
abstractions, the higher the chance that something gets horribly
screwed up. For example, this bit from your patch looks *really*
suspect:

+ preempt_disable();
+ set_pte_at(&init_mm, vaddr, ptep,
pte_mkold(*(READ_ONCE(ptep))));
+ local_flush_tlb_kernel_range(vaddr, vaddr + PAGE_SIZE);
+ preempt_enable();

First off, that set_pte_at() is just a memory write; I don't see why
you put it inside a preempt_disable() region.
But more importantly, sticking a local TLB flush inside a
preempt_disable() region with nothing else in there looks really
shady. How is that supposed to work? If we migrate from CPU0 to CPU1
directly before this region, and then from CPU1 back to CPU0 directly
afterwards, the local TLB flush will have no effect.

2020-06-16 18:42:25

by Kees Cook

[permalink] [raw]
Subject: Re: [PATCH 6/8] x86: Provide API for local kernel TLB flushing

On Tue, Jun 16, 2020 at 09:59:29AM -0700, Andy Lutomirski wrote:
> On Tue, Jun 16, 2020 at 12:49 AM Kees Cook <[email protected]> wrote:
> >
> > The seccomp constant action bitmap filter evaluation routine depends
> > on being able to quickly clear the PTE "accessed" bit for a temporary
> > allocation. Provide access to the existing CPU-local kernel memory TLB
> > flushing routines.
>
> Can you write a better justification? Also, unless I'm just

Er, dunno? That's the entire reason this series needs it.

> incompetent this morning, I can't find anyone calling this in the
> series.

It's in patch 4, seccomp_update_bitmap():
https://lore.kernel.org/lkml/[email protected]/

--
Kees Cook

2020-06-16 18:51:54

by Kees Cook

[permalink] [raw]
Subject: Re: [PATCH 4/8] seccomp: Implement constant action bitmaps

On Tue, Jun 16, 2020 at 08:36:28PM +0200, Jann Horn wrote:
> On Tue, Jun 16, 2020 at 5:49 PM Kees Cook <[email protected]> wrote:
> > On Tue, Jun 16, 2020 at 02:14:47PM +0200, Jann Horn wrote:
> > > Wouldn't it be simpler to use a function that can run a subset of
> > > seccomp cBPF and bails out on anything that indicates that a syscall's
> > > handling is complex or on instructions it doesn't understand? For
> > > syscalls that have a fixed policy, a typical seccomp filter doesn't
> > > even use any of the BPF_ALU ops, the scratch space, or the X register;
> > > it just uses something like the following set of operations, which is
> > > easy to emulate without much code:
> > >
> > > BPF_LD | BPF_W | BPF_ABS
> > > BPF_JMP | BPF_JEQ | BPF_K
> > > BPF_JMP | BPF_JGE | BPF_K
> > > BPF_JMP | BPF_JGT | BPF_K
> > > BPF_JMP | BPF_JA
> > > BPF_RET | BPF_K
> >
> > Initially, I started down this path. It needed a bit of plumbing into
> > BPF to better control the lifetime of the cBPF "saved original filter"
> > (normally used by CHECKPOINT_RESTORE uses)
>
> I don't think you need that? When a filter is added, you can compute
> the results of the added individual filter, and then merge the state.

That's what I thought too, but unfortunately not (unless I missed
something) -- the seccomp verifier is run as a callback from the BPF
internals, so seccomp only see what the user sends (which is unverified)
and the final eBPF filter. There isn't state I can attach during the
callback, so I opted to just do the same thing as CHECKPOINT_RESTORE,
but to then explicitly free the cBPF after bitmap generation.

> > and then I needed to keep
> > making exceptions (same list you have: ALU, X register, scratch, etc)
> > in the name of avoiding too much complexity in the emulator. I decided
> > I'd rather reuse the existing infrastructure to actually execute the
> > filter (no cBPF copy needed to be saved, no separate code, and full
> > instruction coverage).
>
> If you really think that this bit of emulation is so bad, you could
> also make a copy of the BPF filter in which you replace all load
> instructions from syscall arguments with "return NON_CONSTANT_RESULT",
> and then run that through the normal BPF infrastructure.
>
> > > Something like (completely untested):
> [...]
> > I didn't actually finish going down the emulator path (I stopped right
> > around the time I verified that libseccomp does use BPF_ALU -- though
> > only BPF_AND), so I didn't actually evaluate the filter contents for other
> > filter builders (i.e. Chrome).
> >
> > But, if BPF_ALU | BPF_AND were added to your code above, it would cover
> > everything libseccomp generates (which covers a lot of the seccomp
> > filters, e.g. systemd, docker). I just felt funny about an "incomplete"
> > emulator.
> >
> > Though now you've got me looking. It seems this is the core
> > of Chrome's BPF instruction generation:
> > https://github.com/chromium/chromium/blob/master/sandbox/linux/bpf_dsl/policy_compiler.cc
> > It also uses ALU|AND, but adds JMP|JSET.
> >
> > So... that's only 2 more instructions to cover what I think are likely
> > the two largest seccomp instruction generators.
> >
> > > That way, you won't need any of this complicated architecture-specific stuff.
> >
> > There are two arch-specific needs, and using a cBPF-subset emulator
> > just gets rid of the local TLB flush. The other part is distinguishing
> > the archs. Neither requirement is onerous (TLB flush usually just
> > needs little more than an extern, arch is already documented in the
> > per-arch syscall_get_arch()).
>
> But it's also somewhat layer-breaking and reliant on very specific
> assumptions. Normal kernel code doesn't mess around with page table
> magic, outside of very specific low-level things. And your method
> would break if the fixed-value members were not all packed together at
> the start of the structure.

Right -- that was lucky. I suspect the emulation route will win out
here.

> And from a hardening perspective: The more code we add that fiddles
> around with PTEs directly, rather than going through higher-level
> abstractions, the higher the chance that something gets horribly
> screwed up. For example, this bit from your patch looks *really*
> suspect:
>
> + preempt_disable();
> + set_pte_at(&init_mm, vaddr, ptep,
> pte_mkold(*(READ_ONCE(ptep))));
> + local_flush_tlb_kernel_range(vaddr, vaddr + PAGE_SIZE);
> + preempt_enable();
>
> First off, that set_pte_at() is just a memory write; I don't see why
> you put it inside a preempt_disable() region.
> But more importantly, sticking a local TLB flush inside a
> preempt_disable() region with nothing else in there looks really
> shady. How is that supposed to work? If we migrate from CPU0 to CPU1
> directly before this region, and then from CPU1 back to CPU0 directly
> afterwards, the local TLB flush will have no effect.

Yeah, true, that's another good reason not to do this.

--
Kees Cook

2020-06-16 21:17:59

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH 4/8] seccomp: Implement constant action bitmaps



> On Jun 16, 2020, at 11:36 AM, Jann Horn <[email protected]> wrote:
>
> On Tue, Jun 16, 2020 at 5:49 PM Kees Cook <[email protected]> wrote:
>>> On Tue, Jun 16, 2020 at 02:14:47PM +0200, Jann Horn wrote:
>>> Wouldn't it be simpler to use a function that can run a subset of
>>> seccomp cBPF and bails out on anything that indicates that a syscall's
>>> handling is complex or on instructions it doesn't understand? For
>>> syscalls that have a fixed policy, a typical seccomp filter doesn't
>>> even use any of the BPF_ALU ops, the scratch space, or the X register;
>>> it just uses something like the following set of operations, which is
>>> easy to emulate without much code:
>>>
>>> BPF_LD | BPF_W | BPF_ABS
>>> BPF_JMP | BPF_JEQ | BPF_K
>>> BPF_JMP | BPF_JGE | BPF_K
>>> BPF_JMP | BPF_JGT | BPF_K
>>> BPF_JMP | BPF_JA
>>> BPF_RET | BPF_K
>>
>> Initially, I started down this path. It needed a bit of plumbing into
>> BPF to better control the lifetime of the cBPF "saved original filter"
>> (normally used by CHECKPOINT_RESTORE uses)
>
> I don't think you need that? When a filter is added, you can compute
> the results of the added individual filter, and then merge the state.
>
>> and then I needed to keep
>> making exceptions (same list you have: ALU, X register, scratch, etc)
>> in the name of avoiding too much complexity in the emulator. I decided
>> I'd rather reuse the existing infrastructure to actually execute the
>> filter (no cBPF copy needed to be saved, no separate code, and full
>> instruction coverage).
>
> If you really think that this bit of emulation is so bad, you could
> also make a copy of the BPF filter in which you replace all load
> instructions from syscall arguments with "return NON_CONSTANT_RESULT",
> and then run that through the normal BPF infrastructure.
>
>>> Something like (completely untested):
> [...]
>> I didn't actually finish going down the emulator path (I stopped right
>> around the time I verified that libseccomp does use BPF_ALU -- though
>> only BPF_AND), so I didn't actually evaluate the filter contents for other
>> filter builders (i.e. Chrome).
>>
>> But, if BPF_ALU | BPF_AND were added to your code above, it would cover
>> everything libseccomp generates (which covers a lot of the seccomp
>> filters, e.g. systemd, docker). I just felt funny about an "incomplete"
>> emulator.
>>
>> Though now you've got me looking. It seems this is the core
>> of Chrome's BPF instruction generation:
>> https://github.com/chromium/chromium/blob/master/sandbox/linux/bpf_dsl/policy_compiler.cc
>> It also uses ALU|AND, but adds JMP|JSET.
>>
>> So... that's only 2 more instructions to cover what I think are likely
>> the two largest seccomp instruction generators.
>>
>>> That way, you won't need any of this complicated architecture-specific stuff.
>>
>> There are two arch-specific needs, and using a cBPF-subset emulator
>> just gets rid of the local TLB flush. The other part is distinguishing
>> the archs. Neither requirement is onerous (TLB flush usually just
>> needs little more than an extern, arch is already documented in the
>> per-arch syscall_get_arch()).
>
> But it's also somewhat layer-breaking and reliant on very specific
> assumptions. Normal kernel code doesn't mess around with page table
> magic, outside of very specific low-level things. And your method
> would break if the fixed-value members were not all packed together at
> the start of the structure.
>
>
> And from a hardening perspective: The more code we add that fiddles
> around with PTEs directly, rather than going through higher-level
> abstractions, the higher the chance that something gets horribly
> screwed up. For example, this bit from your patch looks *really*
> suspect:
>
> + preempt_disable();
> + set_pte_at(&init_mm, vaddr, ptep,
> pte_mkold(*(READ_ONCE(ptep))));
> + local_flush_tlb_kernel_range(vaddr, vaddr + PAGE_SIZE);
> + preempt_enable();
>
> First off, that set_pte_at() is just a memory write; I don't see why
> you put it inside a preempt_disable() region.
> But more importantly, sticking a local TLB flush inside a
> preempt_disable() region with nothing else in there looks really
> shady. How is that supposed to work? If we migrate from CPU0 to CPU1
> directly before this region, and then from CPU1 back to CPU0 directly
> afterwards, the local TLB flush will have no effect.


Indeed.

With my x86/mm maintainer hat on, this is highly questionable. Either the real API should be used, or there should be a sane API. The former will have really atrocious performance, and the latter would need some thought. Basically, if you pin entire process to one CPU, you can clear the dirty bit, flush, do some magic, and read it back. This is only valid if you have a short enough operation that running with preemption off is reasonable. Otherwise you need to arrange to flush when you schedule in, which could be done with a voluntary preemption style or with scheduler hooks.

I’m not convinced this is worthwhile.

2020-06-17 15:27:52

by Jann Horn

[permalink] [raw]
Subject: Re: [PATCH 3/8] seccomp: Introduce SECCOMP_PIN_ARCHITECTURE

On Tue, Jun 16, 2020 at 9:49 AM Kees Cook <[email protected]> wrote:
> For systems that provide multiple syscall maps based on architectures
> (e.g. AUDIT_ARCH_X86_64 and AUDIT_ARCH_I386 via CONFIG_COMPAT), allow
> a fast way to pin the process to a specific syscall mapping, instead of
> needing to generate all filters with an architecture check as the first
> filter action.

This seems reasonable; but can we maybe also add X86-specific handling
for that X32 mess? AFAIK there are four ways to do syscalls with
AUDIT_ARCH_X86_64:

1. normal x86-64 syscall, X32 bit unset (native case)
2. normal x86-64 syscall, X32 bit set (for X32 code calling syscalls
with no special X32 version)
3. x32-specific syscall, X32 bit unset (never happens legitimately)
4. x32-specific syscall, X32 bit set (for X32 code calling syscalls
with special X32 version)

(I got this wrong when I wrote the notes on x32 in the seccomp manpage...)

Can we add a flag for AUDIT_ARCH_X86_64 that says either "I want
native x64-64" (enforcing case 1) or "I want X32" (enforcing case 2 or
4, and in case 2 checking that the syscall has no X32 equivalent)? (Of
course, if the kernel is built without X32 support, we can leave out
these extra checks.)

> +static long seccomp_pin_architecture(void)
> +{
> +#ifdef CONFIG_COMPAT
> + u32 arch = syscall_get_arch(current);
> +
> + /* How did you even get here? */
> + if (current->seccomp.arch && current->seccomp.arch != arch)
> + return -EBUSY;
> +
> + current->seccomp.arch = arch;
> +#endif
> + return 0;
> +}

Are you intentionally writing this such that SECCOMP_PIN_ARCHITECTURE
only has an effect once you've installed a filter, and propagation to
other threads happens when a filter is installed with TSYNC? I guess
that is a possible way to design the API, but it seems like something
that should at least be pointed out explicitly.

2020-06-17 15:34:33

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH 3/8] seccomp: Introduce SECCOMP_PIN_ARCHITECTURE

On Wed, Jun 17, 2020 at 8:25 AM Jann Horn <[email protected]> wrote:
>
> On Tue, Jun 16, 2020 at 9:49 AM Kees Cook <[email protected]> wrote:
> > For systems that provide multiple syscall maps based on architectures
> > (e.g. AUDIT_ARCH_X86_64 and AUDIT_ARCH_I386 via CONFIG_COMPAT), allow
> > a fast way to pin the process to a specific syscall mapping, instead of
> > needing to generate all filters with an architecture check as the first
> > filter action.
>
> This seems reasonable; but can we maybe also add X86-specific handling
> for that X32 mess? AFAIK there are four ways to do syscalls with
> AUDIT_ARCH_X86_64:

You're out of date :) I fixed the mess.

commit 6365b842aae4490ebfafadfc6bb27a6d3cc54757
Author: Andy Lutomirski <[email protected]>
Date: Wed Jul 3 13:34:04 2019 -0700

x86/syscalls: Split the x32 syscalls into their own table



>
> 1. normal x86-64 syscall, X32 bit unset (native case)
> 2. normal x86-64 syscall, X32 bit set (for X32 code calling syscalls
> with no special X32 version)

Returns -ENOSYS now if an x32 version was supposed to be used.

> 3. x32-specific syscall, X32 bit unset (never happens legitimately)

Returns -ENOSYS now.

> 4. x32-specific syscall, X32 bit set (for X32 code calling syscalls
> with special X32 version)
>
> (I got this wrong when I wrote the notes on x32 in the seccomp manpage...)
>
> Can we add a flag for AUDIT_ARCH_X86_64 that says either "I want
> native x64-64" (enforcing case 1) or "I want X32" (enforcing case 2 or
> 4, and in case 2 checking that the syscall has no X32 equivalent)? (Of
> course, if the kernel is built without X32 support, we can leave out
> these extra checks.)

No extra checks needed. Trying to do a syscall with a wrongly-encoded
x32 nr just generates -ENOSYS now.

Henceforth, all new syscalls will have the same number for native and
x32 and will differ only in the presence of the x32 bit.

--Andy

2020-06-17 15:36:46

by Jann Horn

[permalink] [raw]
Subject: Re: [PATCH 3/8] seccomp: Introduce SECCOMP_PIN_ARCHITECTURE

On Wed, Jun 17, 2020 at 5:30 PM Andy Lutomirski <[email protected]> wrote:
>
> On Wed, Jun 17, 2020 at 8:25 AM Jann Horn <[email protected]> wrote:
> >
> > On Tue, Jun 16, 2020 at 9:49 AM Kees Cook <[email protected]> wrote:
> > > For systems that provide multiple syscall maps based on architectures
> > > (e.g. AUDIT_ARCH_X86_64 and AUDIT_ARCH_I386 via CONFIG_COMPAT), allow
> > > a fast way to pin the process to a specific syscall mapping, instead of
> > > needing to generate all filters with an architecture check as the first
> > > filter action.
> >
> > This seems reasonable; but can we maybe also add X86-specific handling
> > for that X32 mess? AFAIK there are four ways to do syscalls with
> > AUDIT_ARCH_X86_64:
>
> You're out of date :) I fixed the mess.
>
> commit 6365b842aae4490ebfafadfc6bb27a6d3cc54757
> Author: Andy Lutomirski <[email protected]>
> Date: Wed Jul 3 13:34:04 2019 -0700
>
> x86/syscalls: Split the x32 syscalls into their own table

Oooooh, beautiful. Thank you very much for that.