Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755773Ab2BVTr2 (ORCPT ); Wed, 22 Feb 2012 14:47:28 -0500 Received: from mail-bk0-f46.google.com ([209.85.214.46]:44011 "EHLO mail-bk0-f46.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1755505Ab2BVTrR convert rfc822-to-8bit (ORCPT ); Wed, 22 Feb 2012 14:47:17 -0500 MIME-Version: 1.0 In-Reply-To: <38d58caa17befe422065efe5dc451a34.squirrel@webmail.greenhost.nl> References: <1329845435-2313-1-git-send-email-wad@chromium.org> <1329845435-2313-5-git-send-email-wad@chromium.org> <38d58caa17befe422065efe5dc451a34.squirrel@webmail.greenhost.nl> Date: Wed, 22 Feb 2012 13:47:16 -0600 Message-ID: Subject: Re: [PATCH v10 05/11] seccomp: add system call filtering using BPF From: Will Drewry To: Indan Zupancic Cc: linux-kernel@vger.kernel.org, linux-arch@vger.kernel.org, linux-doc@vger.kernel.org, kernel-hardening@lists.openwall.com, netdev@vger.kernel.org, x86@kernel.org, arnd@arndb.de, davem@davemloft.net, hpa@zytor.com, mingo@redhat.com, oleg@redhat.com, peterz@infradead.org, rdunlap@xenotime.net, mcgrathr@chromium.org, tglx@linutronix.de, luto@mit.edu, eparis@redhat.com, serge.hallyn@canonical.com, djm@mindrot.org, scarybeasts@gmail.com, pmoore@redhat.com, akpm@linux-foundation.org, corbet@lwn.net, eric.dumazet@gmail.com, markus@chromium.org, keescook@chromium.org Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 8BIT Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 32450 Lines: 877 On Wed, Feb 22, 2012 at 2:19 AM, Indan Zupancic wrote: > Hello, > > On Tue, February 21, 2012 18:30, Will Drewry wrote: >> [This patch depends on luto@mit.edu's no_new_privs patch: >> https://lkml.org/lkml/2012/1/30/264 >> ] >> >> This patch adds support for seccomp mode 2. ?Mode 2 introduces the >> ability for unprivileged processes to install system call filtering >> policy expressed in terms of a Berkeley Packet Filter (BPF) program. >> This program will be evaluated in the kernel for each system call >> the task makes and computes a result based on data in the format >> of struct seccomp_data. >> >> A filter program may be installed by calling: >> ?struct sock_fprog fprog = { ... }; >> ?... >> ?prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER, &fprog); >> >> The return value of the filter program determines if the system call is >> allowed to proceed or denied. ?If the first filter program installed >> allows prctl(2) calls, then the above call may be made repeatedly >> by a task to further reduce its access to the kernel. ?All attached >> programs must be evaluated before a system call will be allowed to >> proceed. >> >> Filter programs will be inherited across fork/clone and execve. >> However, if the task attaching the filter is unprivileged >> (!CAP_SYS_ADMIN) the no_new_privs bit will be set on the task. ?This >> ensures that unprivileged tasks cannot attach filters that affect >> privileged tasks (e.g., setuid binary). >> >> There are a number of benefits to this approach. A few of which are >> as follows: >> - BPF has been exposed to userland for a long time >> - BPF optimization (and JIT'ing) are well understood >> - Userland already knows its ABI: system call numbers and desired >> ?arguments >> - No time-of-check-time-of-use vulnerable data accesses are possible. >> - system call arguments are loaded on access only to minimize copying >> ?required for system call policy decisions. >> >> Mode 2 support is restricted to architectures that enable >> HAVE_ARCH_SECCOMP_FILTER. ?In this patch, the primary dependency is on >> syscall_get_arguments(). ?The full desired scope of this feature will >> add a few minor additional requirements expressed later in this series. >> Based on discussion, SECCOMP_RET_ERRNO and SECCOMP_RET_TRACE seem to be >> the desired additional functionality. >> >> No architectures are enabled in this patch. >> >> v10: - seccomp_data has changed again to be more aesthetically pleasing >> ? ? ? (hpa@zytor.com) >> ? ? - calling convention is noted in a new u32 field using syscall_get_arch. >> ? ? ? This allows for cross-calling convention tasks to use seccomp filters. >> ? ? ? (hpa@zytor.com) > > I highly disagree with every filter having to check the mode: Filters that > don't check the arch on e.g. x86 are buggy, so they have to check it, even > if it's a 32-bit or 64-bit only system, the filters can't know that and > needs to check the arch at every syscall entry. All other info in the data > depends on the arch, because of this there isn't much code to share between > the two archs, so you can as well have one filter for each arch. > > Alternative approach: Tell the arch at filter install time and only run the > filters with the same arch as the current system call. If no filters are run, > deny the systemcall. This was roughly how I first implemented compat and non-compat support. It causes some implicit behavior across inheritance that is not nice though. > Advantages: > > - Filters don't have to check the arch every syscall entry. This I like. > - Secure by default. Filters don't have to do anything arch specific to > ?be secure, no surprises possible. This is partially true, but it is exactly why I hid compat before. > - If a new arch comes into existence, there is no chance of old filters > ?becoming buggy and insecure. This is especially true for archs that > ?had only one mode, but added another one later on: Old filters had no > ?need to check the mode at all. Perhaps. A buggy filter that works on x86-64 might be exposed on a new x32 ABI. It's hard to predict how audit_arch and the syscall abi will develop with new platforms. > - For kernels supporting only one arch, the check can be optimised away, > ?by not installing unsupported arch filters at all. Somewhat. Without having a dedicated arch helper, you'd have to guess that arches only support one or two arches (if compat is supported), but I don't know if that is a safe assumption to make. > It's more secure, faster and simpler for the filters. > > If something like this is implemented it's fine to expose the arch info > in the syscall data too, and have a way to install filters for all archs, > for the few cases where that might be useful, although I can't think of > any reason why people would like to do unnecessary work in the filters. It seems to just add complexity to support both. I think we'll probably end up with it in the filters for better or worse. Possibly JITing will be useful since at least a 32-bit load and je is pretty cheap in native instructions. > All that's needed is an extra argument to the prctl() call. I propose > 0 for the current arch, -1 for all archs and anything else to specify > the arch. Installing a filter for an unsupported arch could return > ENOEXEC. Without adding a per-arch call, there is no way to know all the supported arches at install time. Current arch, at least, can be determined with a call to syscall_get_arch(). As is, I'm not sure it makes sense to try to reserve two extra input types: 0 and -1. 0 would be sane to treat as either a wildcard or current because it is unlikely to be used by AUDIT_ARCH_* ever since EM_NONE is assigned to 0. However, I have no such insight into whether it will ever be possible to compose 0xffffffff as an AUDIT_ARCH_. > As far as the implementation goes, either have a list per supported arch > or store the arch per filter and check that before running the filter. You can't do it per arch without adding even more per-arch dependencies. Keeping them annotated in the same list is the clearest way I've seen so far, but it comes with its own burdens. >> ? ? - lots of clean up (thanks, Indan!) >> v9: - n/a >> v8: - use bpf_chk_filter, bpf_run_filter. update load_fns >> ? ? - Lots of fixes courtesy of indan@nul.nu: >> ? ? -- fix up load behavior, compat fixups, and merge alloc code, >> ? ? -- renamed pc and dropped __packed, use bool compat. >> ? ? -- Added a hidden CONFIG_SECCOMP_FILTER to synthesize non-arch >> ? ? ? ?dependencies >> v7: ?(massive overhaul thanks to Indan, others) >> ? ? - added CONFIG_HAVE_ARCH_SECCOMP_FILTER >> ? ? - merged into seccomp.c >> ? ? - minimal seccomp_filter.h >> ? ? - no config option (part of seccomp) >> ? ? - no new prctl >> ? ? - doesn't break seccomp on systems without asm/syscall.h >> ? ? ? (works but arg access always fails) >> ? ? - dropped seccomp_init_task, extra free functions, ... >> ? ? - dropped the no-asm/syscall.h code paths >> ? ? - merges with network sk_run_filter and sk_chk_filter >> v6: - fix memory leak on attach compat check failure >> ? ? - require no_new_privs || CAP_SYS_ADMIN prior to filter >> ? ? ? installation. (luto@mit.edu) >> ? ? - s/seccomp_struct_/seccomp_/ for macros/functions (amwang@redhat.com) >> ? ? - cleaned up Kconfig (amwang@redhat.com) >> ? ? - on block, note if the call was compat (so the # means something) >> v5: - uses syscall_get_arguments >> ? ? ? (indan@nul.nu,oleg@redhat.com, mcgrathr@chromium.org) >> ? ? ?- uses union-based arg storage with hi/lo struct to >> ? ? ? ?handle endianness. ?Compromises between the two alternate >> ? ? ? ?proposals to minimize extra arg shuffling and account for >> ? ? ? ?endianness assuming userspace uses offsetof(). >> ? ? ? ?(mcgrathr@chromium.org, indan@nul.nu) >> ? ? ?- update Kconfig description >> ? ? ?- add include/seccomp_filter.h and add its installation >> ? ? ?- (naive) on-demand syscall argument loading >> ? ? ?- drop seccomp_t (eparis@redhat.com) >> v4: ?- adjusted prctl to make room for PR_[SG]ET_NO_NEW_PRIVS >> ? ? ?- now uses current->no_new_privs >> ? ? ? ?(luto@mit.edu,torvalds@linux-foundation.com) >> ? ? ?- assign names to seccomp modes (rdunlap@xenotime.net) >> ? ? ?- fix style issues (rdunlap@xenotime.net) >> ? ? ?- reworded Kconfig entry (rdunlap@xenotime.net) >> v3: ?- macros to inline (oleg@redhat.com) >> ? ? ?- init_task behavior fixed (oleg@redhat.com) >> ? ? ?- drop creator entry and extra NULL check (oleg@redhat.com) >> ? ? ?- alloc returns -EINVAL on bad sizing (serge.hallyn@canonical.com) >> ? ? ?- adds tentative use of "always_unprivileged" as per >> ? ? ? ?torvalds@linux-foundation.org and luto@mit.edu >> v2: ?- (patch 2 only) >> >> Signed-off-by: Will Drewry >> --- >> arch/Kconfig ? ? ? ? ? ?| ? 18 +++ >> include/linux/Kbuild ? ?| ? ?1 + >> include/linux/seccomp.h | ? 76 +++++++++++- >> kernel/fork.c ? ? ? ? ? | ? ?3 + >> kernel/seccomp.c ? ? ? ?| ?321 ++++++++++++++++++++++++++++++++++++++++++++--- >> kernel/sys.c ? ? ? ? ? ?| ? ?2 +- >> 6 files changed, 399 insertions(+), 22 deletions(-) >> >> diff --git a/arch/Kconfig b/arch/Kconfig >> index 4f55c73..8150fa2 100644 >> --- a/arch/Kconfig >> +++ b/arch/Kconfig >> @@ -199,4 +199,22 @@ config HAVE_CMPXCHG_LOCAL >> config HAVE_CMPXCHG_DOUBLE >> ? ? ? bool >> >> +config HAVE_ARCH_SECCOMP_FILTER >> + ? ? bool >> + ? ? help >> + ? ? ? This symbol should be selected by an architecure if it provides >> + ? ? ? asm/syscall.h, specifically syscall_get_arguments() and >> + ? ? ? syscall_get_arch(). >> + >> +config SECCOMP_FILTER >> + ? ? def_bool y >> + ? ? depends on HAVE_ARCH_SECCOMP_FILTER && SECCOMP && NET >> + ? ? help >> + ? ? ? Enable tasks to build secure computing environments defined >> + ? ? ? in terms of Berkeley Packet Filter programs which implement >> + ? ? ? task-defined system call filtering polices. >> + >> + ? ? ? See Documentation/prctl/seccomp_filter.txt for more >> + ? ? ? information on the topic of seccomp filtering. > > The last part is redundant, the topic is clear. I'll drop it. >> + >> source "kernel/gcov/Kconfig" >> diff --git a/include/linux/Kbuild b/include/linux/Kbuild >> index c94e717..d41ba12 100644 >> --- a/include/linux/Kbuild >> +++ b/include/linux/Kbuild >> @@ -330,6 +330,7 @@ header-y += scc.h >> header-y += sched.h >> header-y += screen_info.h >> header-y += sdla.h >> +header-y += seccomp.h >> header-y += securebits.h >> header-y += selinux_netlink.h >> header-y += sem.h >> diff --git a/include/linux/seccomp.h b/include/linux/seccomp.h >> index d61f27f..001f883 100644 >> --- a/include/linux/seccomp.h >> +++ b/include/linux/seccomp.h >> @@ -1,14 +1,67 @@ >> #ifndef _LINUX_SECCOMP_H >> #define _LINUX_SECCOMP_H >> >> +#include >> +#include >> + >> + >> +/* Valid values for seccomp.mode and prctl(PR_SET_SECCOMP, ) */ >> +#define SECCOMP_MODE_DISABLED ? ? ? ?0 /* seccomp is not in use. */ >> +#define SECCOMP_MODE_STRICT ?1 /* uses hard-coded filter. */ >> +#define SECCOMP_MODE_FILTER ?2 /* uses user-supplied filter. */ >> + >> +/* >> + * BPF programs may return a 32-bit value. > > They have to return a 32-bit value, no "may" about it. True. >> + * The bottom 16-bits are reserved for future use. >> + * The upper 16-bits are ordered from least permissive values to most. >> + * >> + * The ordering ensures that a min_t() over composed return values always >> + * selects the least permissive choice. >> + */ >> +#define SECCOMP_RET_KILL ? ? 0x00000000U /* kill the task immediately */ >> +#define SECCOMP_RET_ALLOW ? ?0x7fff0000U /* allow */ >> + >> +/* Masks for the return value sections. */ >> +#define SECCOMP_RET_ACTION ? 0xffff0000U >> +#define SECCOMP_RET_DATA ? ? 0x0000ffffU >> + >> +/** >> + * struct seccomp_data - the format the BPF program executes over. >> + * @args: up to 6 system call arguments. ?When the calling convention is >> + * ? ? ? ?32-bit, the arguments will still be at each args[X] offset. > > What does this mean? Do you mean the data layout will always be "LE" for > 32-bit archs? I hope not, because that would make it incompatible with > the 64-bit code for BE archs, so it will be confusing. Except if the data > layout is always LE, but then you should document that. If neither is the > case, then the comment is just confusing. Just say that the data layout > depends on the arch's endianness. I'll rephrase. I just wanted to call out that the argument values will always be treated as a 64-bit value even if the calling convention is 32-bit. This doesn't matter for LE systems, except to acknowledge the padding, but for BE systems they might load the wrong half. >> + * @instruction_pointer: at the time of the system call. >> + * @arch: indicates system call convention as an AUDIT_ARCH_* value >> + * ? ? ? ?as defined in . >> + * @nr: the system call number >> + */ >> +struct seccomp_data { >> + ? ? __u64 args[6]; >> + ? ? __u64 instruction_pointer; >> + ? ? __u32 arch; >> + ? ? int nr; >> +}; > > I agree this looks a hell of a lot nicer. I just hope it's worth it. > Oh well, a bit more ugliness in userspace to make the kernel code a > bit nicer isn't too bad. Just document the endianness issue properly. > > What use is the instruction pointer considering it tells nothing about > the call path? > >> >> +#ifdef __KERNEL__ >> #ifdef CONFIG_SECCOMP >> >> #include >> #include >> >> +struct seccomp_filter; >> +/** >> + * struct seccomp - the state of a seccomp'ed process >> + * >> + * @mode: ?indicates one of the valid values above for controlled >> + * ? ? ? ? system calls available to a process. >> + * @filter: The metadata and ruleset for determining what system calls >> + * ? ? ? ? ?are allowed for a task. >> + * >> + * ? ? ? ? ?@filter must only be accessed from the context of current as there >> + * ? ? ? ? ?is no locking. >> + */ >> struct seccomp { >> ? ? ? int mode; >> + ? ? struct seccomp_filter *filter; >> }; >> >> extern void __secure_computing(int); >> @@ -19,7 +72,7 @@ static inline void secure_computing(int this_syscall) >> } >> >> extern long prctl_get_seccomp(void); >> -extern long prctl_set_seccomp(unsigned long); >> +extern long prctl_set_seccomp(unsigned long, char __user *); >> >> static inline int seccomp_mode(struct seccomp *s) >> { >> @@ -31,15 +84,16 @@ static inline int seccomp_mode(struct seccomp *s) >> #include >> >> struct seccomp { }; >> +struct seccomp_filter { }; >> >> -#define secure_computing(x) do { } while (0) >> +#define secure_computing(x) 0 >> >> static inline long prctl_get_seccomp(void) >> { >> ? ? ? return -EINVAL; >> } >> >> -static inline long prctl_set_seccomp(unsigned long arg2) >> +static inline long prctl_set_seccomp(unsigned long arg2, char __user *arg3) >> { >> ? ? ? return -EINVAL; >> } >> @@ -48,7 +102,21 @@ static inline int seccomp_mode(struct seccomp *s) >> { >> ? ? ? return 0; >> } >> - >> #endif /* CONFIG_SECCOMP */ >> >> +#ifdef CONFIG_SECCOMP_FILTER >> +extern void put_seccomp_filter(struct seccomp_filter *); >> +extern void copy_seccomp(struct seccomp *child, >> + ? ? ? ? ? ? ? ? ? ? ?const struct seccomp *parent); > > This is 80 chars long, why break it up? Please, stop your bad habit of > breaking up (slightly too) long lines. > >> +#else ?/* CONFIG_SECCOMP_FILTER */ >> +/* The macro consumes the ->filter reference. */ >> +#define put_seccomp_filter(_s) do { } while (0) >> + >> +static inline void copy_seccomp(struct seccomp *child, >> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? const struct seccomp *prev) >> +{ >> + ? ? return; >> +} >> +#endif /* CONFIG_SECCOMP_FILTER */ >> +#endif /* __KERNEL__ */ >> #endif /* _LINUX_SECCOMP_H */ >> diff --git a/kernel/fork.c b/kernel/fork.c >> index b77fd55..a5187b7 100644 >> --- a/kernel/fork.c >> +++ b/kernel/fork.c >> @@ -34,6 +34,7 @@ >> #include >> #include >> #include >> +#include >> #include >> #include >> #include >> @@ -169,6 +170,7 @@ void free_task(struct task_struct *tsk) >> ? ? ? free_thread_info(tsk->stack); >> ? ? ? rt_mutex_debug_task_free(tsk); >> ? ? ? ftrace_graph_exit_task(tsk); >> + ? ? put_seccomp_filter(tsk->seccomp.filter); > > So that's why you use macro's sometimes, to make it compile with > CONFIG_SECCOMP disabled where there is no seccomp.filter. Exactly! >> ? ? ? free_task_struct(tsk); >> } >> EXPORT_SYMBOL(free_task); >> @@ -1113,6 +1115,7 @@ static struct task_struct *copy_process(unsigned long clone_flags, >> ? ? ? ? ? ? ? goto fork_out; >> >> ? ? ? ftrace_graph_init_task(p); >> + ? ? copy_seccomp(&p->seccomp, ¤t->seccomp); >> >> ? ? ? rt_mutex_init_task(p); >> >> diff --git a/kernel/seccomp.c b/kernel/seccomp.c >> index e8d76c5..0043b7e 100644 >> --- a/kernel/seccomp.c >> +++ b/kernel/seccomp.c >> @@ -3,16 +3,287 @@ >> ?* >> ?* Copyright 2004-2005 ?Andrea Arcangeli >> ?* >> - * This defines a simple but solid secure-computing mode. >> + * Copyright (C) 2012 Google, Inc. >> + * Will Drewry >> + * >> + * This defines a simple but solid secure-computing facility. >> + * >> + * Mode 1 uses a fixed list of allowed system calls. >> + * Mode 2 allows user-defined system call filters in the form >> + * ? ? ? ?of Berkeley Packet Filters/Linux Socket Filters. >> ?*/ >> >> +#include >> #include >> -#include >> -#include >> #include >> +#include >> +#include >> +#include >> +#include >> +#include >> +#include >> + >> +#include >> +#include >> >> /* #define SECCOMP_DEBUG 1 */ >> -#define NR_SECCOMP_MODES 1 >> + >> +#ifdef CONFIG_SECCOMP_FILTER >> +/** >> + * struct seccomp_filter - container for seccomp BPF programs >> + * >> + * @usage: reference count to manage the object liftime. >> + * ? ? ? ? get/put helpers should be used when accessing an instance >> + * ? ? ? ? outside of a lifetime-guarded section. ?In general, this >> + * ? ? ? ? is only needed for handling filters shared across tasks. >> + * @prev: points to a previously installed, or inherited, filter >> + * @compat: indicates the value of is_compat_task() at creation time > > You're not really using 'compat', except for logging. Oops - I should've dropped it altogether! > But you could use it to run only the filters with the right arch. Well an int arch would do the trick. >> + * @insns: the BPF program instructions to evaluate >> + * @len: the number of instructions in the program >> + * >> + * seccomp_filter objects are organized in a tree linked via the @prev >> + * pointer. ?For any task, it appears to be a singly-linked list starting >> + * with current->seccomp.filter, the most recently attached or inherited filter. >> + * However, multiple filters may share a @prev node, by way of fork(), which >> + * results in a unidirectional tree existing in memory. ?This is similar to >> + * how namespaces work. >> + * >> + * seccomp_filter objects should never be modified after being attached >> + * to a task_struct (other than @usage). >> + */ >> +struct seccomp_filter { >> + ? ? atomic_t usage; >> + ? ? struct seccomp_filter *prev; >> + ? ? bool compat; >> + ? ? unsigned short len; ?/* Instruction count */ >> + ? ? struct sock_filter insns[]; >> +}; >> + >> +static void seccomp_filter_log_failure(int syscall) >> +{ >> + ? ? int compat = 0; >> +#ifdef CONFIG_COMPAT >> + ? ? compat = is_compat_task(); >> +#endif >> + ? ? pr_info("%s[%d]: %ssystem call %d blocked at 0x%lx\n", >> + ? ? ? ? ? ? current->comm, task_pid_nr(current), >> + ? ? ? ? ? ? (compat ? "compat " : ""), >> + ? ? ? ? ? ? syscall, KSTK_EIP(current)); >> +} >> + >> +/** >> + * get_u32 - returns a u32 offset into data >> + * @data: a unsigned 64 bit value >> + * @index: 0 or 1 to return the first or second 32-bits >> + * >> + * This inline exists to hide the length of unsigned long. >> + * If a 32-bit unsigned long is passed in, it will be extended >> + * and the top 32-bits will be 0. If it is a 64-bit unsigned >> + * long, then whatever data is resident will be properly returned. >> + */ >> +static inline u32 get_u32(u64 data, int index) >> +{ >> + ? ? return ((u32 *)&data)[index]; >> +} >> + >> +/* Helper for bpf_load below. */ >> +#define BPF_DATA(_name) offsetof(struct seccomp_data, _name) >> +/** >> + * bpf_load: checks and returns a pointer to the requested offset >> + * @nr: int syscall passed as a void * to bpf_run_filter >> + * @off: index into struct seccomp_data to load from >> + * @size: load width requested >> + * @buffer: temporary storage supplied by bpf_run_filter >> + * >> + * Returns a pointer to @buffer where the value was stored. >> + * On failure, returns NULL. >> + */ >> +static void *bpf_load(const void *nr, int off, unsigned int size, void *buf) >> +{ >> + ? ? unsigned long value; >> + ? ? u32 *A = buf; >> + >> + ? ? if (size != sizeof(u32)) >> + ? ? ? ? ? ? return NULL; >> + >> + ? ? if (off >= BPF_DATA(args[0]) && off < BPF_DATA(args[6])) { >> + ? ? ? ? ? ? struct pt_regs *regs = task_pt_regs(current); >> + ? ? ? ? ? ? int arg = off >> 3; ?/* args[0] is at offset 0. */ > > Probably clearer if you just do off / 8, you can count on compilers to > get that right and turn it into a shift. > >> + ? ? ? ? ? ? int index = (off % sizeof(u64)) ? 1 : 0; > > Considering the previous line I expected to see (off & 4). > > Anyway, this code mostly ignores the lowest three bits and instead of > either returning an error or the requested data, it returns the aligned > value instead. Not good. Hrm true - I got sloppy. I'll fix that up! >> + ? ? ? ? ? ? syscall_get_arguments(current, regs, arg, 1, &value); >> + ? ? ? ? ? ? *A = get_u32(value, index); >> + ? ? } else if (off == BPF_DATA(nr)) { >> + ? ? ? ? ? ? *A = (u32)(uintptr_t)nr; >> + ? ? } else if (off == BPF_DATA(arch)) { >> + ? ? ? ? ? ? struct pt_regs *regs = task_pt_regs(current); >> + ? ? ? ? ? ? *A = syscall_get_arch(current, regs); >> + ? ? } else if (off == BPF_DATA(instruction_pointer)) { >> + ? ? ? ? ? ? *A = get_u32(KSTK_EIP(current), 0); >> + ? ? } else if (off == BPF_DATA(instruction_pointer) + sizeof(u32)) { >> + ? ? ? ? ? ? *A = get_u32(KSTK_EIP(current), 1); >> + ? ? } else { >> + ? ? ? ? ? ? return NULL; >> + ? ? } >> + ? ? return buf; >> +} >> + >> +/** >> + * seccomp_run_filters - evaluates all seccomp filters against @syscall >> + * @syscall: number of the current system call >> + * >> + * Returns valid seccomp BPF response codes. >> + */ >> +static u32 seccomp_run_filters(int syscall) >> +{ >> + ? ? struct seccomp_filter *f; >> + ? ? u32 ret = SECCOMP_RET_KILL; >> + ? ? static const struct bpf_load_fn fns = { >> + ? ? ? ? ? ? bpf_load, >> + ? ? ? ? ? ? sizeof(struct seccomp_data), > > I suppose this could be used to check for new fields if struct seccomp_data > ever gets extended in the future. Yeah since the only other indicator might be @arch. >> + ? ? }; >> + ? ? const void *sc_ptr = (const void *)(uintptr_t)syscall; >> + >> + ? ? /* >> + ? ? ?* All filters are evaluated in order of youngest to oldest. The lowest >> + ? ? ?* BPF return value always takes priority. >> + ? ? ?*/ >> + ? ? for (f = current->seccomp.filter; f; f = f->prev) { >> + ? ? ? ? ? ? ret = bpf_run_filter(sc_ptr, f->insns, &fns); >> + ? ? ? ? ? ? if (ret != SECCOMP_RET_ALLOW) >> + ? ? ? ? ? ? ? ? ? ? break; >> + ? ? } >> + ? ? return ret; >> +} >> + >> +/** >> + * seccomp_attach_filter: Attaches a seccomp filter to current. >> + * @fprog: BPF program to install >> + * >> + * Returns 0 on success or an errno on failure. >> + */ >> +static long seccomp_attach_filter(struct sock_fprog *fprog) >> +{ >> + ? ? struct seccomp_filter *filter; >> + ? ? unsigned long fp_size = fprog->len * sizeof(struct sock_filter); >> + ? ? long ret; >> + >> + ? ? if (fprog->len == 0 || fprog->len > BPF_MAXINSNS) >> + ? ? ? ? ? ? return -EINVAL; >> + >> + ? ? /* Allocate a new seccomp_filter */ >> + ? ? filter = kzalloc(sizeof(struct seccomp_filter) + fp_size, GFP_KERNEL); >> + ? ? if (!filter) >> + ? ? ? ? ? ? return -ENOMEM; >> + ? ? atomic_set(&filter->usage, 1); >> + ? ? filter->len = fprog->len; >> + >> + ? ? /* Copy the instructions from fprog. */ >> + ? ? ret = -EFAULT; >> + ? ? if (copy_from_user(filter->insns, fprog->filter, fp_size)) >> + ? ? ? ? ? ? goto out; >> + >> + ? ? /* Check the fprog */ >> + ? ? ret = bpf_chk_filter(filter->insns, filter->len, BPF_CHK_FLAGS_NO_SKB); >> + ? ? if (ret) >> + ? ? ? ? ? ? goto out; >> + >> + ? ? /* >> + ? ? ?* Installing a seccomp filter requires that the task >> + ? ? ?* have CAP_SYS_ADMIN in its namespace or be running with >> + ? ? ?* no_new_privs. ?This avoids scenarios where unprivileged >> + ? ? ?* tasks can affect the behavior of privileged children. >> + ? ? ?*/ >> + ? ? ret = -EACCES; >> + ? ? if (!current->no_new_privs && >> + ? ? ? ? security_capable_noaudit(current_cred(), current_user_ns(), >> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?CAP_SYS_ADMIN) != 0) >> + ? ? ? ? ? ? goto out; >> + >> + ? ? /* >> + ? ? ?* If there is an existing filter, make it the prev >> + ? ? ?* and don't drop its task reference. >> + ? ? ?*/ >> + ? ? filter->prev = current->seccomp.filter; >> + ? ? current->seccomp.filter = filter; >> + ? ? return 0; >> +out: >> + ? ? put_seccomp_filter(filter); ?/* for get or task, on err */ >> + ? ? return ret; >> +} >> + >> +/** >> + * seccomp_attach_user_filter - attaches a user-supplied sock_fprog >> + * @user_filter: pointer to the user data containing a sock_fprog. >> + * >> + * This function may be called repeatedly to install additional filters. >> + * Every filter successfully installed will be evaluated (in reverse order) >> + * for each system call the task makes. >> + * >> + * Returns 0 on success and non-zero otherwise. >> + */ >> +long seccomp_attach_user_filter(char __user *user_filter) >> +{ >> + ? ? struct sock_fprog fprog; >> + ? ? long ret = -EFAULT; >> + >> + ? ? if (!user_filter) >> + ? ? ? ? ? ? goto out; >> +#ifdef CONFIG_COMPAT >> + ? ? if (is_compat_task()) { >> + ? ? ? ? ? ? /* XXX: Share with net/compat.c (compat_sock_fprog) */ > > Then do so as part of your BPF sharing patch. Makes sense. Queuing it up. >> + ? ? ? ? ? ? struct { >> + ? ? ? ? ? ? ? ? ? ? u16 len; >> + ? ? ? ? ? ? ? ? ? ? compat_uptr_t filter; ? /* struct sock_filter */ >> + ? ? ? ? ? ? } fprog32; >> + ? ? ? ? ? ? if (copy_from_user(&fprog32, user_filter, sizeof(fprog32))) >> + ? ? ? ? ? ? ? ? ? ? goto out; >> + ? ? ? ? ? ? fprog.len = fprog32.len; >> + ? ? ? ? ? ? fprog.filter = compat_ptr(fprog32.filter); >> + ? ? } else /* falls through to the if below. */ >> +#endif >> + ? ? if (copy_from_user(&fprog, user_filter, sizeof(fprog))) >> + ? ? ? ? ? ? goto out; >> + ? ? ret = seccomp_attach_filter(&fprog); >> +out: >> + ? ? return ret; >> +} >> + >> +/* get_seccomp_filter - increments the reference count of @orig. */ >> +static struct seccomp_filter *get_seccomp_filter(struct seccomp_filter *orig) >> +{ >> + ? ? if (!orig) >> + ? ? ? ? ? ? return NULL; >> + ? ? /* Reference count is bounded by the number of total processes. */ >> + ? ? atomic_inc(&orig->usage); >> + ? ? return orig; >> +} >> + >> +/* put_seccomp_filter - decrements the ref count of @orig and may free. */ >> +void put_seccomp_filter(struct seccomp_filter *orig) >> +{ >> + ? ? /* Clean up single-reference branches iteratively. */ >> + ? ? while (orig && atomic_dec_and_test(&orig->usage)) { >> + ? ? ? ? ? ? struct seccomp_filter *freeme = orig; >> + ? ? ? ? ? ? orig = orig->prev; >> + ? ? ? ? ? ? kfree(freeme); >> + ? ? } >> +} >> + >> +/** >> + * copy_seccomp: manages inheritance on fork >> + * @child: forkee's seccomp >> + * @prev: forker's seccomp >> + * >> + * Ensures that @child inherits seccomp mode and state if >> + * seccomp filtering is in use. >> + */ >> +void copy_seccomp(struct seccomp *child, >> + ? ? ? ? ? ? ? const struct seccomp *prev) > > One line please. Alright :) >> +{ >> + ? ? child->mode = prev->mode; >> + ? ? child->filter = get_seccomp_filter(prev->filter); >> +} >> +#endif ? ? ? /* CONFIG_SECCOMP_FILTER */ >> >> /* >> ?* Secure computing mode 1 allows only read/write/exit/sigreturn. >> @@ -34,10 +305,10 @@ static int mode1_syscalls_32[] = { >> void __secure_computing(int this_syscall) >> { >> ? ? ? int mode = current->seccomp.mode; >> - ? ? int * syscall; >> + ? ? int *syscall; >> >> ? ? ? switch (mode) { >> - ? ? case 1: >> + ? ? case SECCOMP_MODE_STRICT: >> ? ? ? ? ? ? ? syscall = mode1_syscalls; >> #ifdef CONFIG_COMPAT >> ? ? ? ? ? ? ? if (is_compat_task()) >> @@ -48,6 +319,13 @@ void __secure_computing(int this_syscall) >> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? return; >> ? ? ? ? ? ? ? } while (*++syscall); >> ? ? ? ? ? ? ? break; >> +#ifdef CONFIG_SECCOMP_FILTER >> + ? ? case SECCOMP_MODE_FILTER: >> + ? ? ? ? ? ? if (seccomp_run_filters(this_syscall) == SECCOMP_RET_ALLOW) >> + ? ? ? ? ? ? ? ? ? ? return; >> + ? ? ? ? ? ? seccomp_filter_log_failure(this_syscall); >> + ? ? ? ? ? ? break; >> +#endif >> ? ? ? default: >> ? ? ? ? ? ? ? BUG(); >> ? ? ? } >> @@ -64,25 +342,34 @@ long prctl_get_seccomp(void) >> ? ? ? return current->seccomp.mode; >> } >> >> -long prctl_set_seccomp(unsigned long seccomp_mode) >> +long prctl_set_seccomp(unsigned long seccomp_mode, char __user *filter) >> { >> - ? ? long ret; >> + ? ? long ret = -EINVAL; >> >> - ? ? /* can set it only once to be even more secure */ >> - ? ? ret = -EPERM; >> - ? ? if (unlikely(current->seccomp.mode)) >> + ? ? if (current->seccomp.mode && >> + ? ? ? ? current->seccomp.mode != seccomp_mode) > > Wouldn't it make sense to allow going from mode 2 to 1? > After all, the filter could have blocked it if it didn't > want to permit it, and mode 1 is more restrictive than > mode 2. Nope - that might allow a downgrade that bypasses write/read restrictions. E.g., a filter could only allow a read to a certain buf or of a certain size. Allowing a downgrade would allow bypassing those filters, whether they are the most sane things or not :) >> ? ? ? ? ? ? ? goto out; >> >> - ? ? ret = -EINVAL; >> - ? ? if (seccomp_mode && seccomp_mode <= NR_SECCOMP_MODES) { >> - ? ? ? ? ? ? current->seccomp.mode = seccomp_mode; >> - ? ? ? ? ? ? set_thread_flag(TIF_SECCOMP); >> + ? ? switch (seccomp_mode) { >> + ? ? case SECCOMP_MODE_STRICT: >> + ? ? ? ? ? ? ret = 0; >> #ifdef TIF_NOTSC >> ? ? ? ? ? ? ? disable_TSC(); >> #endif >> - ? ? ? ? ? ? ret = 0; >> + ? ? ? ? ? ? break; >> +#ifdef CONFIG_SECCOMP_FILTER >> + ? ? case SECCOMP_MODE_FILTER: >> + ? ? ? ? ? ? ret = seccomp_attach_user_filter(filter); >> + ? ? ? ? ? ? if (ret) >> + ? ? ? ? ? ? ? ? ? ? goto out; >> + ? ? ? ? ? ? break; >> +#endif >> + ? ? default: >> + ? ? ? ? ? ? goto out; >> ? ? ? } >> >> - out: >> + ? ? current->seccomp.mode = seccomp_mode; >> + ? ? set_thread_flag(TIF_SECCOMP); >> +out: >> ? ? ? return ret; >> } >> diff --git a/kernel/sys.c b/kernel/sys.c >> index 4070153..905031e 100644 >> --- a/kernel/sys.c >> +++ b/kernel/sys.c >> @@ -1899,7 +1899,7 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, > unsigned long, arg3, >> ? ? ? ? ? ? ? ? ? ? ? error = prctl_get_seccomp(); >> ? ? ? ? ? ? ? ? ? ? ? break; >> ? ? ? ? ? ? ? case PR_SET_SECCOMP: >> - ? ? ? ? ? ? ? ? ? ? error = prctl_set_seccomp(arg2); >> + ? ? ? ? ? ? ? ? ? ? error = prctl_set_seccomp(arg2, (char __user *)arg3); >> ? ? ? ? ? ? ? ? ? ? ? break; >> ? ? ? ? ? ? ? case PR_GET_TSC: >> ? ? ? ? ? ? ? ? ? ? ? error = GET_TSC_CTL(arg2); > > Out of curiosity, did you measure the kernel size differences before and > after these patches? Would be sad if sharing it with the networking code > didn't reduce the actual kernel size. Oh yeah - it was a serious reduction. Initially, seccomp_filter.o added 8kb by itself. With the merged seccomp.o, continued code trimming (as suggested), and all the SECCOMP_RET_* variations, the total kernel growth is 2972 bytes for the same kernel config. This is shared across ~2000 bytes in seccomp.o and ~800 bytes in filter.o. Thanks! will -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/