Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753869Ab1FBRhF (ORCPT ); Thu, 2 Jun 2011 13:37:05 -0400 Received: from e6.ny.us.ibm.com ([32.97.182.146]:36667 "EHLO e6.ny.us.ibm.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753144Ab1FBRhC (ORCPT ); Thu, 2 Jun 2011 13:37:02 -0400 Date: Thu, 2 Jun 2011 10:36:54 -0700 From: "Paul E. McKenney" To: Will Drewry Cc: linux-kernel@vger.kernel.org, kees.cook@canonical.com, torvalds@linux-foundation.org, tglx@linutronix.de, mingo@elte.hu, rostedt@goodmis.org, jmorris@namei.org, Peter Zijlstra , Frederic Weisbecker , linux-security-module@vger.kernel.org Subject: Re: [PATCH v3 03/13] seccomp_filters: new mode with configurable syscall filters Message-ID: <20110602173654.GI2271@linux.vnet.ibm.com> Reply-To: paulmck@linux.vnet.ibm.com References: <1306897845-9393-3-git-send-email-wad@chromium.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1306897845-9393-3-git-send-email-wad@chromium.org> User-Agent: Mutt/1.5.20 (2009-06-14) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 39747 Lines: 1306 On Tue, May 31, 2011 at 10:10:35PM -0500, Will Drewry wrote: > This change adds a new seccomp mode which specifies the allowed system > calls dynamically. When in the new mode (2), all system calls are > checked against process-defined filters - first by system call number, > then by a filter string. If an entry exists for a given system call and > all filter predicates evaluate to true, then the task may proceed. > Otherwise, the task is killed. A few questions below -- I can't say that I understand the RCU usage. Thanx, Paul > Filter string parsing and evaluation is handled by the ftrace filter > engine. Related patches tweak to the perf filter trace and free > allowing the calls to be shared. Filters inherit their understanding of > types and arguments for each system call from the CONFIG_FTRACE_SYSCALLS > subsystem which already populates this information in syscall_metadata > associated enter_event (and exit_event) structures. If > CONFIG_FTRACE_SYSCALLS is not compiled in, only filter strings of "1" > will be allowed. > > The net result is a process may have its system calls filtered using the > ftrace filter engine's inherent understanding of systems calls. The set > of filters is specified through the PR_SET_SECCOMP_FILTER argument in > prctl(). For example, a filterset for a process, like pdftotext, that > should only process read-only input could (roughly) look like: > sprintf(rdonly, "flags == %u", O_RDONLY|O_LARGEFILE); > prctl(PR_SET_SECCOMP_FILTER, __NR_open, rdonly); > prctl(PR_SET_SECCOMP_FILTER, __NR__llseek, "1"); > prctl(PR_SET_SECCOMP_FILTER, __NR_brk, "1"); > prctl(PR_SET_SECCOMP_FILTER, __NR_close, "1"); > prctl(PR_SET_SECCOMP_FILTER, __NR_exit_group, "1"); > prctl(PR_SET_SECCOMP_FILTER, __NR_fstat64, "1"); > prctl(PR_SET_SECCOMP_FILTER, __NR_mmap2, "1"); > prctl(PR_SET_SECCOMP_FILTER, __NR_munmap, "1"); > prctl(PR_SET_SECCOMP_FILTER, __NR_read, "1"); > prctl(PR_SET_SECCOMP_FILTER, __NR_write, "(fd == 1 | fd == 2)"); > prctl(PR_SET_SECCOMP, 2); > > Subsequent calls to PR_SET_SECCOMP_FILTER for the same system call will > be &&'d together to ensure that attack surface may only be reduced: > prctl(PR_SET_SECCOMP_FILTER, __NR_write, "fd != 2"); > > With the earlier example, the active filter becomes: > "(fd == 1 || fd == 2) && fd != 2" > > The patch also adds PR_CLEAR_SECCOMP_FILTER and PR_GET_SECCOMP_FILTER. > The latter returns the current filter for a system call to userspace: > > prctl(PR_GET_SECCOMP_FILTER, __NR_write, buf, bufsize); > > while the former clears any filters for a given system call changing it > back to a defaulty deny: > > prctl(PR_CLEAR_SECCOMP_FILTER, __NR_write); > > v3: - always block execve calls (as per linus torvalds) > - add __NR_seccomp_execve(_32) to seccomp-supporting arches > - ensure compat tasks can't reach ftrace:syscalls > - dropped new defines for seccomp modes. > - two level array instead of hlists (sugg. by olof johansson) > - added generic Kconfig entry that is not connected. > - dropped internal seccomp.h > - move prctl helpers to seccomp_filter > - killed seccomp_t typedef (as per checkpatch) > v2: - changed to use the existing syscall number ABI. > - prctl changes to minimize parsing in the kernel: > prctl(PR_SET_SECCOMP, {0 | 1 | 2 }, { 0 | ON_EXEC }); > prctl(PR_SET_SECCOMP_FILTER, __NR_read, "fd == 5"); > prctl(PR_CLEAR_SECCOMP_FILTER, __NR_read); > prctl(PR_GET_SECCOMP_FILTER, __NR_read, buf, bufsize); > - defined PR_SECCOMP_MODE_STRICT and ..._FILTER > - added flags > - provide a default fail syscall_nr_to_meta in ftrace > - provides fallback for unhooked system calls > - use -ENOSYS and ERR_PTR(-ENOSYS) for stubbed functionality > - added kernel/seccomp.h to share seccomp.c/seccomp_filter.c > - moved to a hlist and 4 bit hash of linked lists > - added support to operate without CONFIG_FTRACE_SYSCALLS > - moved Kconfig support next to SECCOMP > - made Kconfig entries dependent on EXPERIMENTAL > - added macros to avoid ifdefs from kernel/fork.c > - added compat task/filter matching > - drop seccomp.h inclusion in sched.h and drop seccomp_t > - added Filtering to "show" output > - added on_exec state dup'ing when enabling after a fast-path accept. > > Signed-off-by: Will Drewry > --- > include/linux/prctl.h | 5 + > include/linux/sched.h | 2 +- > include/linux/seccomp.h | 98 ++++++- > include/trace/syscall.h | 7 + > kernel/Makefile | 3 + > kernel/fork.c | 3 + > kernel/seccomp.c | 38 ++- > kernel/seccomp_filter.c | 784 +++++++++++++++++++++++++++++++++++++++++++++++ > kernel/sys.c | 13 +- > security/Kconfig | 17 + > 10 files changed, 954 insertions(+), 16 deletions(-) > create mode 100644 kernel/seccomp_filter.c > > diff --git a/include/linux/prctl.h b/include/linux/prctl.h > index a3baeb2..44723ce 100644 > --- a/include/linux/prctl.h > +++ b/include/linux/prctl.h > @@ -64,6 +64,11 @@ > #define PR_GET_SECCOMP 21 > #define PR_SET_SECCOMP 22 > > +/* Get/set process seccomp filters */ > +#define PR_GET_SECCOMP_FILTER 35 > +#define PR_SET_SECCOMP_FILTER 36 > +#define PR_CLEAR_SECCOMP_FILTER 37 > + > /* Get/set the capability bounding set (as per security/commoncap.c) */ > #define PR_CAPBSET_READ 23 > #define PR_CAPBSET_DROP 24 > diff --git a/include/linux/sched.h b/include/linux/sched.h > index 18d63ce..3f0bc8d 100644 > --- a/include/linux/sched.h > +++ b/include/linux/sched.h > @@ -1374,7 +1374,7 @@ struct task_struct { > uid_t loginuid; > unsigned int sessionid; > #endif > - seccomp_t seccomp; > + struct seccomp_struct seccomp; > > /* Thread group tracking */ > u32 parent_exec_id; > diff --git a/include/linux/seccomp.h b/include/linux/seccomp.h > index 167c333..f4434ca 100644 > --- a/include/linux/seccomp.h > +++ b/include/linux/seccomp.h > @@ -1,13 +1,33 @@ > #ifndef _LINUX_SECCOMP_H > #define _LINUX_SECCOMP_H > > +struct seq_file; > > #ifdef CONFIG_SECCOMP > > +#include > #include > +#include > #include > > -typedef struct { int mode; } seccomp_t; > +struct seccomp_filters; > +/** > + * struct seccomp_struct - the state of a seccomp'ed process > + * > + * @mode: > + * if this is 1, the process is under standard seccomp rules > + * is 2, the process is only allowed to make system calls where > + * associated filters evaluate successfully. > + * @filters: Metadata for filters if using CONFIG_SECCOMP_FILTER. > + * filters assignment/use should be RCU-protected and its contents > + * should never be modified when attached to a seccomp_struct. > + */ > +struct seccomp_struct { > + uint16_t mode; > +#ifdef CONFIG_SECCOMP_FILTER > + struct seccomp_filters *filters; > +#endif > +}; > > extern void __secure_computing(int); > static inline void secure_computing(int this_syscall) > @@ -16,15 +36,14 @@ static inline void secure_computing(int this_syscall) > __secure_computing(this_syscall); > } > > -extern long prctl_get_seccomp(void); > extern long prctl_set_seccomp(unsigned long); > +extern long prctl_get_seccomp(void); > > #else /* CONFIG_SECCOMP */ > > #include > > -typedef struct { } seccomp_t; > - > +struct seccomp_struct { }; > #define secure_computing(x) do { } while (0) > > static inline long prctl_get_seccomp(void) > @@ -32,11 +51,80 @@ static inline long prctl_get_seccomp(void) > return -EINVAL; > } > > -static inline long prctl_set_seccomp(unsigned long arg2) > +static inline long prctl_set_seccomp(unsigned long a2); > { > return -EINVAL; > } > > #endif /* CONFIG_SECCOMP */ > > +#ifdef CONFIG_SECCOMP_FILTER > + > +#define inherit_tsk_seccomp(_child, _orig) do { \ > + _child->seccomp.mode = _orig->seccomp.mode; \ > + _child->seccomp.filters = get_seccomp_filters(_orig->seccomp.filters); \ > + } while (0) > +#define put_tsk_seccomp(_tsk) put_seccomp_filters(_tsk->seccomp.filters) > + > +extern int seccomp_show_filters(struct seccomp_filters *filters, > + struct seq_file *); > +extern long seccomp_set_filter(int, char *); > +extern long seccomp_clear_filter(int); > +extern long seccomp_get_filter(int, char *, unsigned long); > + > +extern long prctl_set_seccomp_filter(unsigned long, char __user *); > +extern long prctl_get_seccomp_filter(unsigned long, char __user *, > + unsigned long); > +extern long prctl_clear_seccomp_filter(unsigned long); > + > +extern struct seccomp_filters *get_seccomp_filters(struct seccomp_filters *); > +extern void put_seccomp_filters(struct seccomp_filters *); > + > +extern int seccomp_test_filters(int); > +extern void seccomp_filter_log_failure(int); > + > +#else /* CONFIG_SECCOMP_FILTER */ > + > +struct seccomp_filters { }; > +#define inherit_tsk_seccomp(_child, _orig) do { } while (0) > +#define put_tsk_seccomp(_tsk) do { } while (0) > + > +static inline int seccomp_show_filters(struct seccomp_filters *filters, > + struct seq_file *m) > +{ > + return -ENOSYS; > +} > + > +static inline long seccomp_set_filter(int syscall_nr, char *filter) > +{ > + return -ENOSYS; > +} > + > +static inline long seccomp_clear_filter(int syscall_nr) > +{ > + return -ENOSYS; > +} > + > +static inline long seccomp_get_filter(int syscall_nr, > + char *buf, unsigned long available) > +{ > + return -ENOSYS; > +} > + > +static inline long prctl_set_seccomp_filter(unsigned long a2, char __user *a3) > +{ > + return -ENOSYS; > +} > + > +static inline long prctl_clear_seccomp_filter(unsigned long a2) > +{ > + return -ENOSYS; > +} > + > +static inline long prctl_get_seccomp_filter(unsigned long a2, char __user *a3, > + unsigned long a4) > +{ > + return -ENOSYS; > +} > +#endif /* CONFIG_SECCOMP_FILTER */ > #endif /* _LINUX_SECCOMP_H */ > diff --git a/include/trace/syscall.h b/include/trace/syscall.h > index 242ae04..e061ad0 100644 > --- a/include/trace/syscall.h > +++ b/include/trace/syscall.h > @@ -35,6 +35,8 @@ struct syscall_metadata { > extern unsigned long arch_syscall_addr(int nr); > extern int init_syscall_trace(struct ftrace_event_call *call); > > +extern struct syscall_metadata *syscall_nr_to_meta(int); > + > extern int reg_event_syscall_enter(struct ftrace_event_call *call); > extern void unreg_event_syscall_enter(struct ftrace_event_call *call); > extern int reg_event_syscall_exit(struct ftrace_event_call *call); > @@ -49,6 +51,11 @@ enum print_line_t print_syscall_enter(struct trace_iterator *iter, int flags, > struct trace_event *event); > enum print_line_t print_syscall_exit(struct trace_iterator *iter, int flags, > struct trace_event *event); > +#else > +static inline struct syscall_metadata *syscall_nr_to_meta(int nr) > +{ > + return NULL; > +} > #endif > > #ifdef CONFIG_PERF_EVENTS > diff --git a/kernel/Makefile b/kernel/Makefile > index 85cbfb3..84e7dfb 100644 > --- a/kernel/Makefile > +++ b/kernel/Makefile > @@ -81,6 +81,9 @@ obj-$(CONFIG_DETECT_HUNG_TASK) += hung_task.o > obj-$(CONFIG_LOCKUP_DETECTOR) += watchdog.o > obj-$(CONFIG_GENERIC_HARDIRQS) += irq/ > obj-$(CONFIG_SECCOMP) += seccomp.o > +ifeq ($(CONFIG_SECCOMP_FILTER),y) > +obj-$(CONFIG_SECCOMP) += seccomp_filter.o > +endif > obj-$(CONFIG_RCU_TORTURE_TEST) += rcutorture.o > obj-$(CONFIG_TREE_RCU) += rcutree.o > obj-$(CONFIG_TREE_PREEMPT_RCU) += rcutree.o > diff --git a/kernel/fork.c b/kernel/fork.c > index e7548de..6f835e0 100644 > --- a/kernel/fork.c > +++ b/kernel/fork.c > @@ -34,6 +34,7 @@ > #include > #include > #include > +#include > #include > #include > #include > @@ -169,6 +170,7 @@ void free_task(struct task_struct *tsk) > free_thread_info(tsk->stack); > rt_mutex_debug_task_free(tsk); > ftrace_graph_exit_task(tsk); > + put_tsk_seccomp(tsk); > free_task_struct(tsk); > } > EXPORT_SYMBOL(free_task); > @@ -280,6 +282,7 @@ static struct task_struct *dup_task_struct(struct task_struct *orig) > if (err) > goto out; > > + inherit_tsk_seccomp(tsk, orig); > setup_thread_stack(tsk, orig); > clear_user_return_notifier(tsk); > clear_tsk_need_resched(tsk); > diff --git a/kernel/seccomp.c b/kernel/seccomp.c > index 57d4b13..0a942be 100644 > --- a/kernel/seccomp.c > +++ b/kernel/seccomp.c > @@ -2,16 +2,20 @@ > * linux/kernel/seccomp.c > * > * Copyright 2004-2005 Andrea Arcangeli > + * Copyright (C) 2011 The Chromium OS Authors > * > * This defines a simple but solid secure-computing mode. > */ > > #include > #include > +#include > #include > +#include > +#include > > +#define SECCOMP_MAX_FILTER_LENGTH MAX_FILTER_STR_VAL > /* #define SECCOMP_DEBUG 1 */ > -#define NR_SECCOMP_MODES 1 > > /* > * Secure computing mode 1 allows only read/write/exit/sigreturn. > @@ -32,10 +36,9 @@ static int mode1_syscalls_32[] = { > > void __secure_computing(int this_syscall) > { > - int mode = current->seccomp.mode; > int * syscall; > > - switch (mode) { > + switch (current->seccomp.mode) { > case 1: > syscall = mode1_syscalls; > #ifdef CONFIG_COMPAT > @@ -47,6 +50,17 @@ void __secure_computing(int this_syscall) > return; > } while (*++syscall); > break; > +#ifdef CONFIG_SECCOMP_FILTER > + case 2: > + if (this_syscall >= NR_syscalls || this_syscall < 0) > + break; > + > + if (!seccomp_test_filters(this_syscall)) > + return; > + > + seccomp_filter_log_failure(this_syscall); > + break; > +#endif > default: > BUG(); > } > @@ -71,16 +85,22 @@ long prctl_set_seccomp(unsigned long seccomp_mode) > if (unlikely(current->seccomp.mode)) > goto out; > > - ret = -EINVAL; > - if (seccomp_mode && seccomp_mode <= NR_SECCOMP_MODES) { > - current->seccomp.mode = seccomp_mode; > - set_thread_flag(TIF_SECCOMP); > + ret = 0; > + switch (seccomp_mode) { > + case 1: > #ifdef TIF_NOTSC > disable_TSC(); > #endif > - ret = 0; > +#ifdef CONFIG_SECCOMP_FILTER > + case 2: > +#endif > + current->seccomp.mode = seccomp_mode; > + set_thread_flag(TIF_SECCOMP); > + break; > + default: > + ret = -EINVAL; > } > > - out: > +out: > return ret; > } > diff --git a/kernel/seccomp_filter.c b/kernel/seccomp_filter.c > new file mode 100644 > index 0000000..9782f25 > --- /dev/null > +++ b/kernel/seccomp_filter.c > @@ -0,0 +1,784 @@ > +/* filter engine-based seccomp system call filtering > + * > + * This program is free software; you can redistribute it and/or modify > + * it under the terms of the GNU General Public License as published by > + * the Free Software Foundation; either version 2 of the License, or > + * (at your option) any later version. > + * > + * This program is distributed in the hope that it will be useful, > + * but WITHOUT ANY WARRANTY; without even the implied warranty of > + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the > + * GNU General Public License for more details. > + * > + * You should have received a copy of the GNU General Public License > + * along with this program; if not, write to the Free Software > + * Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA. > + * > + * Copyright (C) 2011 The Chromium OS Authors > + */ > + > +#include > +#include > +#include > +#include > +#include > +#include > +#include > +#include > +#include > + > +#include > +#include > + > + > +#define SECCOMP_MAX_FILTER_LENGTH MAX_FILTER_STR_VAL > + > +#define SECCOMP_FILTER_ALLOW "1" > +#define SECCOMP_ACTION_DENY 0xffff > +#define SECCOMP_ACTION_ALLOW 0xfffe > + > +/** > + * struct seccomp_filters - container for seccomp filterset > + * > + * @syscalls: array of 16-bit indices into @event_filters by syscall_nr > + * May also be SECCOMP_ACTION_DENY or SECCOMP_ACTION_ALLOW > + * @event_filters: array of pointers to ftrace event objects > + * @count: size of @event_filters > + * @flags: anonymous struct to wrap filters-specific flags > + * @usage: reference count to simplify use. > + */ > +struct seccomp_filters { > + uint16_t syscalls[NR_syscalls]; > + struct event_filter **event_filters; > + uint16_t count; > + struct { > + uint32_t compat:1, > + __reserved:31; > + } flags; > + atomic_t usage; > +}; > + > +/* Handle ftrace symbol non-existence */ > +#ifdef CONFIG_FTRACE_SYSCALLS > +#define create_event_filter(_ef_pptr, _event_type, _str) \ > + ftrace_parse_filter(_ef_pptr, _event_type, _str) > +#define get_filter_string(_ef) ftrace_get_filter_string(_ef) > +#define free_event_filter(_f) ftrace_free_filter(_f) > + > +#else > + > +#define create_event_filter(_ef_pptr, _event_type, _str) (-ENOSYS) > +#define get_filter_string(_ef) (NULL) > +#define free_event_filter(_f) do { } while (0) > +#endif > + > +/** > + * seccomp_filters_new - allocates a new filters object > + * @count: count to allocate for the event_filters array > + * > + * Returns ERR_PTR on error or an allocated object. > + */ > +static struct seccomp_filters *seccomp_filters_new(uint16_t count) > +{ > + struct seccomp_filters *f; > + > + if (count >= SECCOMP_ACTION_ALLOW) > + return ERR_PTR(-EINVAL); > + > + f = kzalloc(sizeof(struct seccomp_filters), GFP_KERNEL); > + if (!f) > + return ERR_PTR(-ENOMEM); > + > + /* Lazy SECCOMP_ACTION_DENY assignment. */ > + memset(f->syscalls, 0xff, sizeof(f->syscalls)); > + atomic_set(&f->usage, 1); > + > + f->event_filters = NULL; > + f->count = count; > + if (!count) > + return f; > + > + f->event_filters = kzalloc(count * sizeof(struct event_filter *), > + GFP_KERNEL); > + if (!f->event_filters) { > + kfree(f); > + f = ERR_PTR(-ENOMEM); > + } > + return f; > +} > + > +/** > + * seccomp_filters_free - cleans up the filter list and frees the table > + * @filters: NULL or live object to be completely destructed. > + */ > +static void seccomp_filters_free(struct seccomp_filters *filters) > +{ > + uint16_t count = 0; > + if (!filters) > + return; > + while (count < filters->count) { > + struct event_filter *f = filters->event_filters[count]; > + free_event_filter(f); > + count++; > + } > + kfree(filters->event_filters); > + kfree(filters); > +} > + > +static void __put_seccomp_filters(struct seccomp_filters *orig) > +{ > + WARN_ON(atomic_read(&orig->usage)); > + seccomp_filters_free(orig); > +} > + > +#define seccomp_filter_allow(_id) ((_id) == SECCOMP_ACTION_ALLOW) > +#define seccomp_filter_deny(_id) ((_id) == SECCOMP_ACTION_DENY) > +#define seccomp_filter_dynamic(_id) \ > + (!seccomp_filter_allow(_id) && !seccomp_filter_deny(_id)) > +static inline uint16_t seccomp_filter_id(const struct seccomp_filters *f, > + int syscall_nr) > +{ > + if (!f) > + return SECCOMP_ACTION_DENY; > + return f->syscalls[syscall_nr]; > +} > + > +static inline struct event_filter *seccomp_dynamic_filter( > + const struct seccomp_filters *filters, uint16_t id) > +{ > + if (!seccomp_filter_dynamic(id)) > + return NULL; > + return filters->event_filters[id]; > +} > + > +static inline void set_seccomp_filter_id(struct seccomp_filters *filters, > + int syscall_nr, uint16_t id) > +{ > + filters->syscalls[syscall_nr] = id; > +} > + > +static inline void set_seccomp_filter(struct seccomp_filters *filters, > + int syscall_nr, uint16_t id, > + struct event_filter *dynamic_filter) > +{ > + filters->syscalls[syscall_nr] = id; > + if (seccomp_filter_dynamic(id)) > + filters->event_filters[id] = dynamic_filter; > +} > + > +static struct event_filter *alloc_event_filter(int syscall_nr, > + const char *filter_string) > +{ > + struct syscall_metadata *data; > + struct event_filter *filter = NULL; > + int err; > + > + data = syscall_nr_to_meta(syscall_nr); > + /* Argument-based filtering only works on ftrace-hooked syscalls. */ > + err = -ENOSYS; > + if (!data) > + goto fail; > + err = create_event_filter(&filter, > + data->enter_event->event.type, > + filter_string); > + if (err) > + goto fail; > + > + return filter; > +fail: > + kfree(filter); > + return ERR_PTR(err); > +} > + > +/** > + * seccomp_filters_copy - copies filters from src to dst. > + * > + * @dst: seccomp_filters to populate. > + * @src: table to read from. > + * @skip: specifies an entry, by system call, to skip. > + * > + * Returns non-zero on failure. > + * Both the source and the destination should have no simultaneous > + * writers, and dst should be exclusive to the caller. > + * If @skip is < 0, it is ignored. > + */ > +static int seccomp_filters_copy(struct seccomp_filters *dst, > + const struct seccomp_filters *src, > + int skip) > +{ > + int id = 0, ret = 0, nr; > + memcpy(&dst->flags, &src->flags, sizeof(src->flags)); > + memcpy(dst->syscalls, src->syscalls, sizeof(dst->syscalls)); > + if (!src->count) > + goto done; > + for (nr = 0; nr < NR_syscalls; ++nr) { > + struct event_filter *filter; > + const char *str; > + uint16_t src_id = seccomp_filter_id(src, nr); > + if (nr == skip) { > + set_seccomp_filter(dst, nr, SECCOMP_ACTION_DENY, > + NULL); > + continue; > + } > + if (!seccomp_filter_dynamic(src_id)) > + continue; > + if (id >= dst->count) { > + ret = -EINVAL; > + goto done; > + } > + str = get_filter_string(seccomp_dynamic_filter(src, src_id)); > + filter = alloc_event_filter(nr, str); > + if (IS_ERR(filter)) { > + ret = PTR_ERR(filter); > + goto done; > + } > + set_seccomp_filter(dst, nr, id, filter); > + id++; > + } > + > +done: > + return ret; > +} > + > +/** > + * seccomp_extend_filter - appends more text to a syscall_nr's filter > + * @filters: unattached filter object to operate on > + * @syscall_nr: syscall number to update filters for > + * @filter_string: string to append to the existing filter > + * > + * The new string will be &&'d to the original filter string to ensure that it > + * always matches the existing predicates or less: > + * (old_filter) && @filter_string > + * A new seccomp_filters instance is returned on success and a ERR_PTR on > + * failure. > + */ > +static int seccomp_extend_filter(struct seccomp_filters *filters, > + int syscall_nr, char *filter_string) > +{ > + struct event_filter *filter; > + uint16_t id = seccomp_filter_id(filters, syscall_nr); > + char *merged = NULL; > + int ret = -EINVAL, expected; > + > + /* No extending with a "1". */ > + if (!strcmp(SECCOMP_FILTER_ALLOW, filter_string)) > + goto out; > + > + filter = seccomp_dynamic_filter(filters, id); > + ret = -ENOENT; > + if (!filter) > + goto out; > + > + merged = kzalloc(SECCOMP_MAX_FILTER_LENGTH + 1, GFP_KERNEL); > + ret = -ENOMEM; > + if (!merged) > + goto out; > + > + expected = snprintf(merged, SECCOMP_MAX_FILTER_LENGTH, "(%s) && %s", > + get_filter_string(filter), filter_string); > + ret = -E2BIG; > + if (expected >= SECCOMP_MAX_FILTER_LENGTH || expected < 0) > + goto out; > + > + /* Free the old filter */ > + free_event_filter(filter); > + set_seccomp_filter(filters, syscall_nr, id, NULL); > + > + /* Replace it */ > + filter = alloc_event_filter(syscall_nr, merged); > + if (IS_ERR(filter)) { > + ret = PTR_ERR(filter); > + goto out; > + } > + set_seccomp_filter(filters, syscall_nr, id, filter); > + ret = 0; > + > +out: > + kfree(merged); > + return ret; > +} > + > +/** > + * seccomp_add_filter - adds a filter for an unfiltered syscall > + * @filters: filters object to add a filter/action to > + * @syscall_nr: system call number to add a filter for > + * @filter_string: the filter string to apply > + * > + * Returns 0 on success and non-zero otherwise. > + */ > +static int seccomp_add_filter(struct seccomp_filters *filters, int syscall_nr, > + char *filter_string) > +{ > + struct event_filter *filter; > + int ret = 0; > + > + if (!strcmp(SECCOMP_FILTER_ALLOW, filter_string)) { > + set_seccomp_filter(filters, syscall_nr, > + SECCOMP_ACTION_ALLOW, NULL); > + goto out; > + } > + > + filter = alloc_event_filter(syscall_nr, filter_string); > + if (IS_ERR(filter)) { > + ret = PTR_ERR(filter); > + goto out; > + } > + /* Always add to the last slot available since additions are > + * are only done one at a time. > + */ > + set_seccomp_filter(filters, syscall_nr, filters->count - 1, filter); > +out: > + return ret; > +} > + > +/* Wrap optional ftrace syscall support. Returns 1 on match or 0 otherwise. */ > +static int filter_match_current(struct event_filter *event_filter) > +{ > + int err = 0; > +#ifdef CONFIG_FTRACE_SYSCALLS > + uint8_t syscall_state[64]; > + > + memset(syscall_state, 0, sizeof(syscall_state)); > + > + /* The generic tracing entry can remain zeroed. */ > + err = ftrace_syscall_enter_state(syscall_state, sizeof(syscall_state), > + NULL); > + if (err) > + return 0; > + > + err = filter_match_preds(event_filter, syscall_state); > +#endif > + return err; > +} > + > +static const char *syscall_nr_to_name(int syscall) > +{ > + const char *syscall_name = "unknown"; > + struct syscall_metadata *data = syscall_nr_to_meta(syscall); > + if (data) > + syscall_name = data->name; > + return syscall_name; > +} > + > +static void filters_set_compat(struct seccomp_filters *filters) > +{ > +#ifdef CONFIG_COMPAT > + if (is_compat_task()) > + filters->flags.compat = 1; > +#endif > +} > + > +static inline int filters_compat_mismatch(struct seccomp_filters *filters) > +{ > + int ret = 0; > + if (!filters) > + return 0; > +#ifdef CONFIG_COMPAT > + if (!!(is_compat_task()) == filters->flags.compat) > + ret = 1; > +#endif > + return ret; > +} > + > +static inline int syscall_is_execve(int syscall) > +{ > + int nr = __NR_execve; > +#ifdef CONFIG_COMPAT > + if (is_compat_task()) > + nr = __NR_seccomp_execve_32; > +#endif > + return syscall == nr; > +} > + > +#ifndef KSTK_EIP > +#define KSTK_EIP(x) 0L > +#endif > + > +void seccomp_filter_log_failure(int syscall) > +{ > + pr_info("%s[%d]: system call %d (%s) blocked at 0x%lx\n", > + current->comm, task_pid_nr(current), syscall, > + syscall_nr_to_name(syscall), KSTK_EIP(current)); > +} > + > +/* put_seccomp_state - decrements the reference count of @orig and may free. */ > +void put_seccomp_filters(struct seccomp_filters *orig) > +{ > + if (!orig) > + return; > + > + if (atomic_dec_and_test(&orig->usage)) > + __put_seccomp_filters(orig); > +} > + > +/* get_seccomp_state - increments the reference count of @orig */ > +struct seccomp_filters *get_seccomp_filters(struct seccomp_filters *orig) Nit: the name does not match the comment. > +{ > + if (!orig) > + return NULL; > + atomic_inc(&orig->usage); > + return orig; This is called in an RCU read-side critical section. What exactly is RCU protecting? I would expect an rcu_dereference() or one of the RCU list-traversal primitives somewhere, either here or at the caller. > +} > + > +/** > + * seccomp_test_filters - tests 'current' against the given syscall > + * @state: seccomp_state of current to use. > + * @syscall: number of the system call to test > + * > + * Returns 0 on ok and non-zero on error/failure. > + */ > +int seccomp_test_filters(int syscall) > +{ > + uint16_t id; > + struct event_filter *filter; > + struct seccomp_filters *filters; > + int ret = -EACCES; > + > + rcu_read_lock(); > + filters = get_seccomp_filters(current->seccomp.filters); > + rcu_read_unlock(); > + > + if (!filters) > + goto out; > + > + if (filters_compat_mismatch(filters)) { > + pr_info("%s[%d]: seccomp_filter compat() mismatch.\n", > + current->comm, task_pid_nr(current)); > + goto out; > + } > + > + /* execve is never allowed. */ > + if (syscall_is_execve(syscall)) > + goto out; > + > + ret = 0; > + id = seccomp_filter_id(filters, syscall); > + if (seccomp_filter_allow(id)) > + goto out; > + > + ret = -EACCES; > + if (!seccomp_filter_dynamic(id)) > + goto out; > + > + filter = seccomp_dynamic_filter(filters, id); > + if (filter && filter_match_current(filter)) > + ret = 0; > +out: > + put_seccomp_filters(filters); > + return ret; > +} > + > +/** > + * seccomp_show_filters - prints the current filter state to a seq_file > + * @filters: properly get()'d filters object > + * @m: the prepared seq_file to receive the data > + * > + * Returns 0 on a successful write. > + */ > +int seccomp_show_filters(struct seccomp_filters *filters, struct seq_file *m) > +{ > + int syscall; > + seq_printf(m, "Mode: %d\n", current->seccomp.mode); > + if (!filters) > + goto out; > + > + for (syscall = 0; syscall < NR_syscalls; ++syscall) { > + uint16_t id = seccomp_filter_id(filters, syscall); > + const char *filter_string = SECCOMP_FILTER_ALLOW; > + if (seccomp_filter_deny(id)) > + continue; > + seq_printf(m, "%d (%s): ", > + syscall, > + syscall_nr_to_name(syscall)); > + if (seccomp_filter_dynamic(id)) > + filter_string = get_filter_string( > + seccomp_dynamic_filter(filters, id)); > + seq_printf(m, "%s\n", filter_string); > + } > +out: > + return 0; > +} > +EXPORT_SYMBOL_GPL(seccomp_show_filters); > + > +/** > + * seccomp_get_filter - copies the filter_string into "buf" > + * @syscall_nr: system call number to look up > + * @buf: destination buffer > + * @bufsize: available space in the buffer. > + * > + * Context: User context only. This function may sleep on allocation and > + * operates on current. current must be attempting a system call > + * when this is called. > + * > + * Looks up the filter for the given system call number on current. If found, > + * the string length of the NUL-terminated buffer is returned and < 0 is > + * returned on error. The NUL byte is not included in the length. > + */ > +long seccomp_get_filter(int syscall_nr, char *buf, unsigned long bufsize) > +{ > + struct seccomp_filters *filters; > + struct event_filter *filter; > + long ret = -EINVAL; > + uint16_t id; > + > + if (bufsize > SECCOMP_MAX_FILTER_LENGTH) > + bufsize = SECCOMP_MAX_FILTER_LENGTH; > + > + rcu_read_lock(); > + filters = get_seccomp_filters(current->seccomp.filters); > + rcu_read_unlock(); > + > + if (!filters) > + goto out; > + > + ret = -ENOENT; > + id = seccomp_filter_id(filters, syscall_nr); > + if (seccomp_filter_deny(id)) > + goto out; > + > + if (seccomp_filter_allow(id)) { > + ret = strlcpy(buf, SECCOMP_FILTER_ALLOW, bufsize); > + goto copied; > + } > + > + filter = seccomp_dynamic_filter(filters, id); > + if (!filter) > + goto out; > + ret = strlcpy(buf, get_filter_string(filter), bufsize); > + > +copied: > + if (ret >= bufsize) { > + ret = -ENOSPC; > + goto out; > + } > + /* Zero out any remaining buffer, just in case. */ > + memset(buf + ret, 0, bufsize - ret); > +out: > + put_seccomp_filters(filters); > + return ret; > +} > +EXPORT_SYMBOL_GPL(seccomp_get_filter); > + > +/** > + * seccomp_clear_filter: clears the seccomp filter for a syscall. > + * @syscall_nr: the system call number to clear filters for. > + * > + * Context: User context only. This function may sleep on allocation and > + * operates on current. current must be attempting a system call > + * when this is called. > + * > + * Returns 0 on success. > + */ > +long seccomp_clear_filter(int syscall_nr) > +{ > + struct seccomp_filters *filters = NULL, *orig_filters; > + uint16_t id; > + int ret = -EINVAL; > + > + rcu_read_lock(); > + orig_filters = get_seccomp_filters(current->seccomp.filters); > + rcu_read_unlock(); > + > + if (!orig_filters) > + goto out; > + > + if (filters_compat_mismatch(orig_filters)) > + goto out; > + > + id = seccomp_filter_id(orig_filters, syscall_nr); > + if (seccomp_filter_deny(id)) > + goto out; > + > + /* Create a new filters object for the task */ > + if (seccomp_filter_dynamic(id)) > + filters = seccomp_filters_new(orig_filters->count - 1); > + else > + filters = seccomp_filters_new(orig_filters->count); > + > + if (IS_ERR(filters)) { > + ret = PTR_ERR(filters); > + goto out; > + } > + > + /* Copy, but drop the requested entry. */ > + ret = seccomp_filters_copy(filters, orig_filters, syscall_nr); > + if (ret) > + goto out; > + get_seccomp_filters(filters); /* simplify the out: path */ > + > + rcu_assign_pointer(current->seccomp.filters, filters); What prevents two copies of seccomp_clear_filter() from running concurrently? > + synchronize_rcu(); > + put_seccomp_filters(orig_filters); /* for the task */ > +out: > + put_seccomp_filters(orig_filters); /* for the get */ > + put_seccomp_filters(filters); /* for the extra get */ > + return ret; > +} > +EXPORT_SYMBOL_GPL(seccomp_clear_filter); > + > +/** > + * seccomp_set_filter: - Adds/extends a seccomp filter for a syscall. > + * @syscall_nr: system call number to apply the filter to. > + * @filter: ftrace filter string to apply. > + * > + * Context: User context only. This function may sleep on allocation and > + * operates on current. current must be attempting a system call > + * when this is called. > + * > + * New filters may be added for system calls when the current task is > + * not in a secure computing mode (seccomp). Otherwise, existing filters may > + * be extended. > + * > + * Returns 0 on success or an errno on failure. > + */ > +long seccomp_set_filter(int syscall_nr, char *filter) > +{ > + struct seccomp_filters *filters = NULL, *orig_filters = NULL; > + uint16_t id; > + long ret = -EINVAL; > + uint16_t filters_needed; > + > + if (!filter) > + goto out; > + > + filter = strstrip(filter); > + /* Disallow empty strings. */ > + if (filter[0] == 0) > + goto out; > + > + rcu_read_lock(); > + orig_filters = get_seccomp_filters(current->seccomp.filters); > + rcu_read_unlock(); > + > + /* After the first call, compatibility mode is selected permanently. */ > + ret = -EACCES; > + if (filters_compat_mismatch(orig_filters)) > + goto out; > + > + filters_needed = orig_filters ? orig_filters->count : 0; > + id = seccomp_filter_id(orig_filters, syscall_nr); > + if (seccomp_filter_deny(id)) { > + /* Don't allow DENYs to be changed when in a seccomp mode */ > + ret = -EACCES; > + if (current->seccomp.mode) > + goto out; > + filters_needed++; > + } > + > + filters = seccomp_filters_new(filters_needed); > + if (IS_ERR(filters)) { > + ret = PTR_ERR(filters); > + goto out; > + } > + > + filters_set_compat(filters); > + if (orig_filters) { > + ret = seccomp_filters_copy(filters, orig_filters, -1); > + if (ret) > + goto out; > + } > + > + if (seccomp_filter_deny(id)) > + ret = seccomp_add_filter(filters, syscall_nr, filter); > + else > + ret = seccomp_extend_filter(filters, syscall_nr, filter); > + if (ret) > + goto out; > + get_seccomp_filters(filters); /* simplify the error paths */ > + > + rcu_assign_pointer(current->seccomp.filters, filters); Again, what prevents two copies of seccomp_set_filter() from running concurrently? > + synchronize_rcu(); > + put_seccomp_filters(orig_filters); /* for the task */ > +out: > + put_seccomp_filters(orig_filters); /* for the get */ > + put_seccomp_filters(filters); /* for get or task, on err */ > + return ret; > +} > +EXPORT_SYMBOL_GPL(seccomp_set_filter); > + > +long prctl_set_seccomp_filter(unsigned long syscall_nr, > + char __user *user_filter) > +{ > + int nr; > + long ret; > + char *filter = NULL; > + > + ret = -EINVAL; > + if (syscall_nr >= NR_syscalls) > + goto out; > + > + ret = -EFAULT; > + if (!user_filter) > + goto out; > + > + filter = kzalloc(SECCOMP_MAX_FILTER_LENGTH + 1, GFP_KERNEL); > + ret = -ENOMEM; > + if (!filter) > + goto out; > + > + ret = -EFAULT; > + if (strncpy_from_user(filter, user_filter, > + SECCOMP_MAX_FILTER_LENGTH - 1) < 0) > + goto out; > + > + nr = (int) syscall_nr; > + ret = seccomp_set_filter(nr, filter); > + > +out: > + kfree(filter); > + return ret; > +} > + > +long prctl_clear_seccomp_filter(unsigned long syscall_nr) > +{ > + int nr = -1; > + long ret; > + > + ret = -EINVAL; > + if (syscall_nr >= NR_syscalls) > + goto out; > + > + nr = (int) syscall_nr; > + ret = seccomp_clear_filter(nr); > + > +out: > + return ret; > +} > + > +long prctl_get_seccomp_filter(unsigned long syscall_nr, char __user *dst, > + unsigned long available) > +{ > + int ret, nr; > + unsigned long copied; > + char *buf = NULL; > + ret = -EINVAL; > + if (!available) > + goto out; > + /* Ignore extra buffer space. */ > + if (available > SECCOMP_MAX_FILTER_LENGTH) > + available = SECCOMP_MAX_FILTER_LENGTH; > + > + ret = -EINVAL; > + if (syscall_nr >= NR_syscalls) > + goto out; > + nr = (int) syscall_nr; > + > + ret = -ENOMEM; > + buf = kmalloc(available, GFP_KERNEL); > + if (!buf) > + goto out; > + > + ret = seccomp_get_filter(nr, buf, available); > + if (ret < 0) > + goto out; > + > + /* Include the NUL byte in the copy. */ > + copied = copy_to_user(dst, buf, ret + 1); > + ret = -ENOSPC; > + if (copied) > + goto out; > + ret = 0; > +out: > + kfree(buf); > + return ret; > +} > diff --git a/kernel/sys.c b/kernel/sys.c > index af468ed..ed60d06 100644 > --- a/kernel/sys.c > +++ b/kernel/sys.c > @@ -1698,13 +1698,24 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3, > case PR_SET_ENDIAN: > error = SET_ENDIAN(me, arg2); > break; > - > case PR_GET_SECCOMP: > error = prctl_get_seccomp(); > break; > case PR_SET_SECCOMP: > error = prctl_set_seccomp(arg2); > break; > + case PR_SET_SECCOMP_FILTER: > + error = prctl_set_seccomp_filter(arg2, > + (char __user *) arg3); > + break; > + case PR_CLEAR_SECCOMP_FILTER: > + error = prctl_clear_seccomp_filter(arg2); > + break; > + case PR_GET_SECCOMP_FILTER: > + error = prctl_get_seccomp_filter(arg2, > + (char __user *) arg3, > + arg4); > + break; > case PR_GET_TSC: > error = GET_TSC_CTL(arg2); > break; > diff --git a/security/Kconfig b/security/Kconfig > index 95accd4..c76adf2 100644 > --- a/security/Kconfig > +++ b/security/Kconfig > @@ -2,6 +2,10 @@ > # Security configuration > # > > +# Make seccomp filter Kconfig switch below available > +config HAVE_SECCOMP_FILTER > + bool > + > menu "Security options" > > config KEYS > @@ -82,6 +86,19 @@ config SECURITY_DMESG_RESTRICT > > If you are unsure how to answer this question, answer N. > > +config SECCOMP_FILTER > + bool "Enable seccomp-based system call filtering" > + select SECCOMP > + depends on HAVE_SECCOMP_FILTER && EXPERIMENTAL > + help > + This kernel feature expands CONFIG_SECCOMP to allow computing > + in environments with reduced kernel access dictated by the > + application itself through prctl calls. If > + CONFIG_FTRACE_SYSCALLS is available, then system call > + argument-based filtering predicates may be used. > + > + See Documentation/prctl/seccomp_filter.txt for more detail. > + > config SECURITY > bool "Enable different security models" > depends on SYSFS > -- > 1.7.0.4 > > -- > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/