Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753525Ab1FBU24 (ORCPT ); Thu, 2 Jun 2011 16:28:56 -0400 Received: from mail-fx0-f46.google.com ([209.85.161.46]:46284 "EHLO mail-fx0-f46.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752286Ab1FBU2v convert rfc822-to-8bit (ORCPT ); Thu, 2 Jun 2011 16:28:51 -0400 DomainKey-Signature: a=rsa-sha1; c=nofws; d=chromium.org; s=google; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc:content-type:content-transfer-encoding; b=A6GhlfJjjIhUTuP2ZIph8jHafk1+IIvtSFFAqw3iJb/tewSo/kYHxou2mwBUBlOaAN 9Okbnmgqn7Lt6BntRPYKpUM+uOAnxrlt8aNlp9jBUSGUeMWzvcR/L3mXKH0PgiVvHkmA s4CK0F361PsjtNlfe+xfhYSbtIeVCms6w9rGA= MIME-Version: 1.0 In-Reply-To: <20110602194231.GL2271@linux.vnet.ibm.com> References: <1306897845-9393-3-git-send-email-wad@chromium.org> <20110602173654.GI2271@linux.vnet.ibm.com> <20110602194231.GL2271@linux.vnet.ibm.com> Date: Thu, 2 Jun 2011 15:28:48 -0500 Message-ID: Subject: Re: [PATCH v3 03/13] seccomp_filters: new mode with configurable syscall filters From: Will Drewry To: paulmck@linux.vnet.ibm.com Cc: linux-kernel@vger.kernel.org, kees.cook@canonical.com, torvalds@linux-foundation.org, tglx@linutronix.de, mingo@elte.hu, rostedt@goodmis.org, jmorris@namei.org, Peter Zijlstra , Frederic Weisbecker , linux-security-module@vger.kernel.org Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 8BIT Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 53212 Lines: 1390 On Thu, Jun 2, 2011 at 2:42 PM, Paul E. McKenney wrote: > On Thu, Jun 02, 2011 at 01:14:54PM -0500, Will Drewry wrote: >> On Thu, Jun 2, 2011 at 12:36 PM, Paul E. McKenney >> wrote: >> > On Tue, May 31, 2011 at 10:10:35PM -0500, Will Drewry wrote: >> >> This change adds a new seccomp mode which specifies the allowed system >> >> calls dynamically. ?When in the new mode (2), all system calls are >> >> checked against process-defined filters - first by system call number, >> >> then by a filter string. ?If an entry exists for a given system call and >> >> all filter predicates evaluate to true, then the task may proceed. >> >> Otherwise, the task is killed. >> > >> > A few questions below -- I can't say that I understand the RCU usage. >> > >> > ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?Thanx, Paul >> > >> >> Filter string parsing and evaluation is handled by the ftrace filter >> >> engine. ?Related patches tweak to the perf filter trace and free >> >> allowing the calls to be shared. Filters inherit their understanding of >> >> types and arguments for each system call from the CONFIG_FTRACE_SYSCALLS >> >> subsystem which already populates this information in syscall_metadata >> >> associated enter_event (and exit_event) structures. If >> >> CONFIG_FTRACE_SYSCALLS is not compiled in, only filter strings of "1" >> >> will be allowed. >> >> >> >> The net result is a process may have its system calls filtered using the >> >> ftrace filter engine's inherent understanding of systems calls. ?The set >> >> of filters is specified through the PR_SET_SECCOMP_FILTER argument in >> >> prctl(). For example, a filterset for a process, like pdftotext, that >> >> should only process read-only input could (roughly) look like: >> >> ? sprintf(rdonly, "flags == %u", O_RDONLY|O_LARGEFILE); >> >> ? prctl(PR_SET_SECCOMP_FILTER, __NR_open, rdonly); >> >> ? prctl(PR_SET_SECCOMP_FILTER, __NR__llseek, "1"); >> >> ? prctl(PR_SET_SECCOMP_FILTER, __NR_brk, "1"); >> >> ? prctl(PR_SET_SECCOMP_FILTER, __NR_close, "1"); >> >> ? prctl(PR_SET_SECCOMP_FILTER, __NR_exit_group, "1"); >> >> ? prctl(PR_SET_SECCOMP_FILTER, __NR_fstat64, "1"); >> >> ? prctl(PR_SET_SECCOMP_FILTER, __NR_mmap2, "1"); >> >> ? prctl(PR_SET_SECCOMP_FILTER, __NR_munmap, "1"); >> >> ? prctl(PR_SET_SECCOMP_FILTER, __NR_read, "1"); >> >> ? prctl(PR_SET_SECCOMP_FILTER, __NR_write, "(fd == 1 | fd == 2)"); >> >> ? prctl(PR_SET_SECCOMP, 2); >> >> >> >> Subsequent calls to PR_SET_SECCOMP_FILTER for the same system call will >> >> be &&'d together to ensure that attack surface may only be reduced: >> >> ? prctl(PR_SET_SECCOMP_FILTER, __NR_write, "fd != 2"); >> >> >> >> With the earlier example, the active filter becomes: >> >> ? "(fd == 1 || fd == 2) && fd != 2" >> >> >> >> The patch also adds PR_CLEAR_SECCOMP_FILTER and PR_GET_SECCOMP_FILTER. >> >> The latter returns the current filter for a system call to userspace: >> >> >> >> ? prctl(PR_GET_SECCOMP_FILTER, __NR_write, buf, bufsize); >> >> >> >> while the former clears any filters for a given system call changing it >> >> back to a defaulty deny: >> >> >> >> ? prctl(PR_CLEAR_SECCOMP_FILTER, __NR_write); >> >> >> >> v3: - always block execve calls (as per linus torvalds) >> >> ? ? - add __NR_seccomp_execve(_32) to seccomp-supporting arches >> >> ? ? - ensure compat tasks can't reach ftrace:syscalls >> >> ? ? - dropped new defines for seccomp modes. >> >> ? ? - two level array instead of hlists (sugg. by olof johansson) >> >> ? ? - added generic Kconfig entry that is not connected. >> >> ? ? - dropped internal seccomp.h >> >> ? ? - move prctl helpers to seccomp_filter >> >> ? ? - killed seccomp_t typedef (as per checkpatch) >> >> v2: - changed to use the existing syscall number ABI. >> >> ? ? - prctl changes to minimize parsing in the kernel: >> >> ? ? ? prctl(PR_SET_SECCOMP, {0 | 1 | 2 }, { 0 | ON_EXEC }); >> >> ? ? ? prctl(PR_SET_SECCOMP_FILTER, __NR_read, "fd == 5"); >> >> ? ? ? prctl(PR_CLEAR_SECCOMP_FILTER, __NR_read); >> >> ? ? ? prctl(PR_GET_SECCOMP_FILTER, __NR_read, buf, bufsize); >> >> ? ? - defined PR_SECCOMP_MODE_STRICT and ..._FILTER >> >> ? ? - added flags >> >> ? ? - provide a default fail syscall_nr_to_meta in ftrace >> >> ? ? - provides fallback for unhooked system calls >> >> ? ? - use -ENOSYS and ERR_PTR(-ENOSYS) for stubbed functionality >> >> ? ? - added kernel/seccomp.h to share seccomp.c/seccomp_filter.c >> >> ? ? - moved to a hlist and 4 bit hash of linked lists >> >> ? ? - added support to operate without CONFIG_FTRACE_SYSCALLS >> >> ? ? - moved Kconfig support next to SECCOMP >> >> ? ? - made Kconfig entries dependent on EXPERIMENTAL >> >> ? ? - added macros to avoid ifdefs from kernel/fork.c >> >> ? ? - added compat task/filter matching >> >> ? ? - drop seccomp.h inclusion in sched.h and drop seccomp_t >> >> ? ? - added Filtering to "show" output >> >> ? ? - added on_exec state dup'ing when enabling after a fast-path accept. >> >> >> >> Signed-off-by: Will Drewry >> >> --- >> >> ?include/linux/prctl.h ? | ? ?5 + >> >> ?include/linux/sched.h ? | ? ?2 +- >> >> ?include/linux/seccomp.h | ? 98 ++++++- >> >> ?include/trace/syscall.h | ? ?7 + >> >> ?kernel/Makefile ? ? ? ? | ? ?3 + >> >> ?kernel/fork.c ? ? ? ? ? | ? ?3 + >> >> ?kernel/seccomp.c ? ? ? ?| ? 38 ++- >> >> ?kernel/seccomp_filter.c | ?784 +++++++++++++++++++++++++++++++++++++++++++++++ >> >> ?kernel/sys.c ? ? ? ? ? ?| ? 13 +- >> >> ?security/Kconfig ? ? ? ?| ? 17 + >> >> ?10 files changed, 954 insertions(+), 16 deletions(-) >> >> ?create mode 100644 kernel/seccomp_filter.c >> >> >> >> diff --git a/include/linux/prctl.h b/include/linux/prctl.h >> >> index a3baeb2..44723ce 100644 >> >> --- a/include/linux/prctl.h >> >> +++ b/include/linux/prctl.h >> >> @@ -64,6 +64,11 @@ >> >> ?#define PR_GET_SECCOMP ? ? ? 21 >> >> ?#define PR_SET_SECCOMP ? ? ? 22 >> >> >> >> +/* Get/set process seccomp filters */ >> >> +#define PR_GET_SECCOMP_FILTER ? ? ? ?35 >> >> +#define PR_SET_SECCOMP_FILTER ? ? ? ?36 >> >> +#define PR_CLEAR_SECCOMP_FILTER ? ? ?37 >> >> + >> >> ?/* Get/set the capability bounding set (as per security/commoncap.c) */ >> >> ?#define PR_CAPBSET_READ 23 >> >> ?#define PR_CAPBSET_DROP 24 >> >> diff --git a/include/linux/sched.h b/include/linux/sched.h >> >> index 18d63ce..3f0bc8d 100644 >> >> --- a/include/linux/sched.h >> >> +++ b/include/linux/sched.h >> >> @@ -1374,7 +1374,7 @@ struct task_struct { >> >> ? ? ? uid_t loginuid; >> >> ? ? ? unsigned int sessionid; >> >> ?#endif >> >> - ? ? seccomp_t seccomp; >> >> + ? ? struct seccomp_struct seccomp; >> >> >> >> ?/* Thread group tracking */ >> >> ? ? ? u32 parent_exec_id; >> >> diff --git a/include/linux/seccomp.h b/include/linux/seccomp.h >> >> index 167c333..f4434ca 100644 >> >> --- a/include/linux/seccomp.h >> >> +++ b/include/linux/seccomp.h >> >> @@ -1,13 +1,33 @@ >> >> ?#ifndef _LINUX_SECCOMP_H >> >> ?#define _LINUX_SECCOMP_H >> >> >> >> +struct seq_file; >> >> >> >> ?#ifdef CONFIG_SECCOMP >> >> >> >> +#include >> >> ?#include >> >> +#include >> >> ?#include >> >> >> >> -typedef struct { int mode; } seccomp_t; >> >> +struct seccomp_filters; >> >> +/** >> >> + * struct seccomp_struct - the state of a seccomp'ed process >> >> + * >> >> + * @mode: >> >> + * ? ? if this is 1, the process is under standard seccomp rules >> >> + * ? ? ? ? ? ? is 2, the process is only allowed to make system calls where >> >> + * ? ? ? ? ? ? ? ? ? associated filters evaluate successfully. >> >> + * @filters: Metadata for filters if using CONFIG_SECCOMP_FILTER. >> >> + * ? ? ? ? ? filters assignment/use should be RCU-protected and its contents >> >> + * ? ? ? ? ? should never be modified when attached to a seccomp_struct. >> >> + */ >> >> +struct seccomp_struct { >> >> + ? ? uint16_t mode; >> >> +#ifdef CONFIG_SECCOMP_FILTER >> >> + ? ? struct seccomp_filters *filters; >> >> +#endif >> >> +}; >> >> >> >> ?extern void __secure_computing(int); >> >> ?static inline void secure_computing(int this_syscall) >> >> @@ -16,15 +36,14 @@ static inline void secure_computing(int this_syscall) >> >> ? ? ? ? ? ? ? __secure_computing(this_syscall); >> >> ?} >> >> >> >> -extern long prctl_get_seccomp(void); >> >> ?extern long prctl_set_seccomp(unsigned long); >> >> +extern long prctl_get_seccomp(void); >> >> >> >> ?#else /* CONFIG_SECCOMP */ >> >> >> >> ?#include >> >> >> >> -typedef struct { } seccomp_t; >> >> - >> >> +struct seccomp_struct { }; >> >> ?#define secure_computing(x) do { } while (0) >> >> >> >> ?static inline long prctl_get_seccomp(void) >> >> @@ -32,11 +51,80 @@ static inline long prctl_get_seccomp(void) >> >> ? ? ? return -EINVAL; >> >> ?} >> >> >> >> -static inline long prctl_set_seccomp(unsigned long arg2) >> >> +static inline long prctl_set_seccomp(unsigned long a2); >> >> ?{ >> >> ? ? ? return -EINVAL; >> >> ?} >> >> >> >> ?#endif /* CONFIG_SECCOMP */ >> >> >> >> +#ifdef CONFIG_SECCOMP_FILTER >> >> + >> >> +#define inherit_tsk_seccomp(_child, _orig) do { \ >> >> + ? ? _child->seccomp.mode = _orig->seccomp.mode; \ >> >> + ? ? _child->seccomp.filters = get_seccomp_filters(_orig->seccomp.filters); \ >> >> + ? ? } while (0) >> >> +#define put_tsk_seccomp(_tsk) put_seccomp_filters(_tsk->seccomp.filters) >> >> + >> >> +extern int seccomp_show_filters(struct seccomp_filters *filters, >> >> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? struct seq_file *); >> >> +extern long seccomp_set_filter(int, char *); >> >> +extern long seccomp_clear_filter(int); >> >> +extern long seccomp_get_filter(int, char *, unsigned long); >> >> + >> >> +extern long prctl_set_seccomp_filter(unsigned long, char __user *); >> >> +extern long prctl_get_seccomp_filter(unsigned long, char __user *, >> >> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?unsigned long); >> >> +extern long prctl_clear_seccomp_filter(unsigned long); >> >> + >> >> +extern struct seccomp_filters *get_seccomp_filters(struct seccomp_filters *); >> >> +extern void put_seccomp_filters(struct seccomp_filters *); >> >> + >> >> +extern int seccomp_test_filters(int); >> >> +extern void seccomp_filter_log_failure(int); >> >> + >> >> +#else ?/* CONFIG_SECCOMP_FILTER */ >> >> + >> >> +struct seccomp_filters { }; >> >> +#define inherit_tsk_seccomp(_child, _orig) do { } while (0) >> >> +#define put_tsk_seccomp(_tsk) do { } while (0) >> >> + >> >> +static inline int seccomp_show_filters(struct seccomp_filters *filters, >> >> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?struct seq_file *m) >> >> +{ >> >> + ? ? return -ENOSYS; >> >> +} >> >> + >> >> +static inline long seccomp_set_filter(int syscall_nr, char *filter) >> >> +{ >> >> + ? ? return -ENOSYS; >> >> +} >> >> + >> >> +static inline long seccomp_clear_filter(int syscall_nr) >> >> +{ >> >> + ? ? return -ENOSYS; >> >> +} >> >> + >> >> +static inline long seccomp_get_filter(int syscall_nr, >> >> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? char *buf, unsigned long available) >> >> +{ >> >> + ? ? return -ENOSYS; >> >> +} >> >> + >> >> +static inline long prctl_set_seccomp_filter(unsigned long a2, char __user *a3) >> >> +{ >> >> + ? ? return -ENOSYS; >> >> +} >> >> + >> >> +static inline long prctl_clear_seccomp_filter(unsigned long a2) >> >> +{ >> >> + ? ? return -ENOSYS; >> >> +} >> >> + >> >> +static inline long prctl_get_seccomp_filter(unsigned long a2, char __user *a3, >> >> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? unsigned long a4) >> >> +{ >> >> + ? ? return -ENOSYS; >> >> +} >> >> +#endif ?/* CONFIG_SECCOMP_FILTER */ >> >> ?#endif /* _LINUX_SECCOMP_H */ >> >> diff --git a/include/trace/syscall.h b/include/trace/syscall.h >> >> index 242ae04..e061ad0 100644 >> >> --- a/include/trace/syscall.h >> >> +++ b/include/trace/syscall.h >> >> @@ -35,6 +35,8 @@ struct syscall_metadata { >> >> ?extern unsigned long arch_syscall_addr(int nr); >> >> ?extern int init_syscall_trace(struct ftrace_event_call *call); >> >> >> >> +extern struct syscall_metadata *syscall_nr_to_meta(int); >> >> + >> >> ?extern int reg_event_syscall_enter(struct ftrace_event_call *call); >> >> ?extern void unreg_event_syscall_enter(struct ftrace_event_call *call); >> >> ?extern int reg_event_syscall_exit(struct ftrace_event_call *call); >> >> @@ -49,6 +51,11 @@ enum print_line_t print_syscall_enter(struct trace_iterator *iter, int flags, >> >> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? struct trace_event *event); >> >> ?enum print_line_t print_syscall_exit(struct trace_iterator *iter, int flags, >> >> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?struct trace_event *event); >> >> +#else >> >> +static inline struct syscall_metadata *syscall_nr_to_meta(int nr) >> >> +{ >> >> + ? ? return NULL; >> >> +} >> >> ?#endif >> >> >> >> ?#ifdef CONFIG_PERF_EVENTS >> >> diff --git a/kernel/Makefile b/kernel/Makefile >> >> index 85cbfb3..84e7dfb 100644 >> >> --- a/kernel/Makefile >> >> +++ b/kernel/Makefile >> >> @@ -81,6 +81,9 @@ obj-$(CONFIG_DETECT_HUNG_TASK) += hung_task.o >> >> ?obj-$(CONFIG_LOCKUP_DETECTOR) += watchdog.o >> >> ?obj-$(CONFIG_GENERIC_HARDIRQS) += irq/ >> >> ?obj-$(CONFIG_SECCOMP) += seccomp.o >> >> +ifeq ($(CONFIG_SECCOMP_FILTER),y) >> >> +obj-$(CONFIG_SECCOMP) += seccomp_filter.o >> >> +endif >> >> ?obj-$(CONFIG_RCU_TORTURE_TEST) += rcutorture.o >> >> ?obj-$(CONFIG_TREE_RCU) += rcutree.o >> >> ?obj-$(CONFIG_TREE_PREEMPT_RCU) += rcutree.o >> >> diff --git a/kernel/fork.c b/kernel/fork.c >> >> index e7548de..6f835e0 100644 >> >> --- a/kernel/fork.c >> >> +++ b/kernel/fork.c >> >> @@ -34,6 +34,7 @@ >> >> ?#include >> >> ?#include >> >> ?#include >> >> +#include >> >> ?#include >> >> ?#include >> >> ?#include >> >> @@ -169,6 +170,7 @@ void free_task(struct task_struct *tsk) >> >> ? ? ? free_thread_info(tsk->stack); >> >> ? ? ? rt_mutex_debug_task_free(tsk); >> >> ? ? ? ftrace_graph_exit_task(tsk); >> >> + ? ? put_tsk_seccomp(tsk); >> >> ? ? ? free_task_struct(tsk); >> >> ?} >> >> ?EXPORT_SYMBOL(free_task); >> >> @@ -280,6 +282,7 @@ static struct task_struct *dup_task_struct(struct task_struct *orig) >> >> ? ? ? if (err) >> >> ? ? ? ? ? ? ? goto out; >> >> >> >> + ? ? inherit_tsk_seccomp(tsk, orig); >> >> ? ? ? setup_thread_stack(tsk, orig); >> >> ? ? ? clear_user_return_notifier(tsk); >> >> ? ? ? clear_tsk_need_resched(tsk); >> >> diff --git a/kernel/seccomp.c b/kernel/seccomp.c >> >> index 57d4b13..0a942be 100644 >> >> --- a/kernel/seccomp.c >> >> +++ b/kernel/seccomp.c >> >> @@ -2,16 +2,20 @@ >> >> ? * linux/kernel/seccomp.c >> >> ? * >> >> ? * Copyright 2004-2005 ?Andrea Arcangeli >> >> + * Copyright (C) 2011 The Chromium OS Authors >> >> ? * >> >> ? * This defines a simple but solid secure-computing mode. >> >> ? */ >> >> >> >> ?#include >> >> ?#include >> >> +#include >> >> ?#include >> >> +#include >> >> +#include >> >> >> >> +#define SECCOMP_MAX_FILTER_LENGTH MAX_FILTER_STR_VAL >> >> ?/* #define SECCOMP_DEBUG 1 */ >> >> -#define NR_SECCOMP_MODES 1 >> >> >> >> ?/* >> >> ? * Secure computing mode 1 allows only read/write/exit/sigreturn. >> >> @@ -32,10 +36,9 @@ static int mode1_syscalls_32[] = { >> >> >> >> ?void __secure_computing(int this_syscall) >> >> ?{ >> >> - ? ? int mode = current->seccomp.mode; >> >> ? ? ? int * syscall; >> >> >> >> - ? ? switch (mode) { >> >> + ? ? switch (current->seccomp.mode) { >> >> ? ? ? case 1: >> >> ? ? ? ? ? ? ? syscall = mode1_syscalls; >> >> ?#ifdef CONFIG_COMPAT >> >> @@ -47,6 +50,17 @@ void __secure_computing(int this_syscall) >> >> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? return; >> >> ? ? ? ? ? ? ? } while (*++syscall); >> >> ? ? ? ? ? ? ? break; >> >> +#ifdef CONFIG_SECCOMP_FILTER >> >> + ? ? case 2: >> >> + ? ? ? ? ? ? if (this_syscall >= NR_syscalls || this_syscall < 0) >> >> + ? ? ? ? ? ? ? ? ? ? break; >> >> + >> >> + ? ? ? ? ? ? if (!seccomp_test_filters(this_syscall)) >> >> + ? ? ? ? ? ? ? ? ? ? return; >> >> + >> >> + ? ? ? ? ? ? seccomp_filter_log_failure(this_syscall); >> >> + ? ? ? ? ? ? break; >> >> +#endif >> >> ? ? ? default: >> >> ? ? ? ? ? ? ? BUG(); >> >> ? ? ? } >> >> @@ -71,16 +85,22 @@ long prctl_set_seccomp(unsigned long seccomp_mode) >> >> ? ? ? if (unlikely(current->seccomp.mode)) >> >> ? ? ? ? ? ? ? goto out; >> >> >> >> - ? ? ret = -EINVAL; >> >> - ? ? if (seccomp_mode && seccomp_mode <= NR_SECCOMP_MODES) { >> >> - ? ? ? ? ? ? current->seccomp.mode = seccomp_mode; >> >> - ? ? ? ? ? ? set_thread_flag(TIF_SECCOMP); >> >> + ? ? ret = 0; >> >> + ? ? switch (seccomp_mode) { >> >> + ? ? case 1: >> >> ?#ifdef TIF_NOTSC >> >> ? ? ? ? ? ? ? disable_TSC(); >> >> ?#endif >> >> - ? ? ? ? ? ? ret = 0; >> >> +#ifdef CONFIG_SECCOMP_FILTER >> >> + ? ? case 2: >> >> +#endif >> >> + ? ? ? ? ? ? current->seccomp.mode = seccomp_mode; >> >> + ? ? ? ? ? ? set_thread_flag(TIF_SECCOMP); >> >> + ? ? ? ? ? ? break; >> >> + ? ? default: >> >> + ? ? ? ? ? ? ret = -EINVAL; >> >> ? ? ? } >> >> >> >> - out: >> >> +out: >> >> ? ? ? return ret; >> >> ?} >> >> diff --git a/kernel/seccomp_filter.c b/kernel/seccomp_filter.c >> >> new file mode 100644 >> >> index 0000000..9782f25 >> >> --- /dev/null >> >> +++ b/kernel/seccomp_filter.c >> >> @@ -0,0 +1,784 @@ >> >> +/* filter engine-based seccomp system call filtering >> >> + * >> >> + * This program is free software; you can redistribute it and/or modify >> >> + * it under the terms of the GNU General Public License as published by >> >> + * the Free Software Foundation; either version 2 of the License, or >> >> + * (at your option) any later version. >> >> + * >> >> + * This program is distributed in the hope that it will be useful, >> >> + * but WITHOUT ANY WARRANTY; without even the implied warranty of >> >> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. ?See the >> >> + * GNU General Public License for more details. >> >> + * >> >> + * You should have received a copy of the GNU General Public License >> >> + * along with this program; if not, write to the Free Software >> >> + * Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA. >> >> + * >> >> + * Copyright (C) 2011 The Chromium OS Authors >> >> + */ >> >> + >> >> +#include >> >> +#include >> >> +#include >> >> +#include >> >> +#include >> >> +#include >> >> +#include >> >> +#include >> >> +#include >> >> + >> >> +#include >> >> +#include >> >> + >> >> + >> >> +#define SECCOMP_MAX_FILTER_LENGTH MAX_FILTER_STR_VAL >> >> + >> >> +#define SECCOMP_FILTER_ALLOW "1" >> >> +#define SECCOMP_ACTION_DENY 0xffff >> >> +#define SECCOMP_ACTION_ALLOW 0xfffe >> >> + >> >> +/** >> >> + * struct seccomp_filters - container for seccomp filterset >> >> + * >> >> + * @syscalls: array of 16-bit indices into @event_filters by syscall_nr >> >> + * ? ? ? ? ? ?May also be SECCOMP_ACTION_DENY or SECCOMP_ACTION_ALLOW >> >> + * @event_filters: array of pointers to ftrace event objects >> >> + * @count: size of @event_filters >> >> + * @flags: anonymous struct to wrap filters-specific flags >> >> + * @usage: reference count to simplify use. >> >> + */ >> >> +struct seccomp_filters { >> >> + ? ? uint16_t syscalls[NR_syscalls]; >> >> + ? ? struct event_filter **event_filters; >> >> + ? ? uint16_t count; >> >> + ? ? struct { >> >> + ? ? ? ? ? ? uint32_t compat:1, >> >> + ? ? ? ? ? ? ? ? ? ? ?__reserved:31; >> >> + ? ? } flags; >> >> + ? ? atomic_t usage; >> >> +}; >> >> + >> >> +/* Handle ftrace symbol non-existence */ >> >> +#ifdef CONFIG_FTRACE_SYSCALLS >> >> +#define create_event_filter(_ef_pptr, _event_type, _str) \ >> >> + ? ? ftrace_parse_filter(_ef_pptr, _event_type, _str) >> >> +#define get_filter_string(_ef) ftrace_get_filter_string(_ef) >> >> +#define free_event_filter(_f) ftrace_free_filter(_f) >> >> + >> >> +#else >> >> + >> >> +#define create_event_filter(_ef_pptr, _event_type, _str) (-ENOSYS) >> >> +#define get_filter_string(_ef) (NULL) >> >> +#define free_event_filter(_f) do { } while (0) >> >> +#endif >> >> + >> >> +/** >> >> + * seccomp_filters_new - allocates a new filters object >> >> + * @count: count to allocate for the event_filters array >> >> + * >> >> + * Returns ERR_PTR on error or an allocated object. >> >> + */ >> >> +static struct seccomp_filters *seccomp_filters_new(uint16_t count) >> >> +{ >> >> + ? ? struct seccomp_filters *f; >> >> + >> >> + ? ? if (count >= SECCOMP_ACTION_ALLOW) >> >> + ? ? ? ? ? ? return ERR_PTR(-EINVAL); >> >> + >> >> + ? ? f = kzalloc(sizeof(struct seccomp_filters), GFP_KERNEL); >> >> + ? ? if (!f) >> >> + ? ? ? ? ? ? return ERR_PTR(-ENOMEM); >> >> + >> >> + ? ? /* Lazy SECCOMP_ACTION_DENY assignment. */ >> >> + ? ? memset(f->syscalls, 0xff, sizeof(f->syscalls)); >> >> + ? ? atomic_set(&f->usage, 1); >> >> + >> >> + ? ? f->event_filters = NULL; >> >> + ? ? f->count = count; >> >> + ? ? if (!count) >> >> + ? ? ? ? ? ? return f; >> >> + >> >> + ? ? f->event_filters = kzalloc(count * sizeof(struct event_filter *), >> >> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?GFP_KERNEL); >> >> + ? ? if (!f->event_filters) { >> >> + ? ? ? ? ? ? kfree(f); >> >> + ? ? ? ? ? ? f = ERR_PTR(-ENOMEM); >> >> + ? ? } >> >> + ? ? return f; >> >> +} >> >> + >> >> +/** >> >> + * seccomp_filters_free - cleans up the filter list and frees the table >> >> + * @filters: NULL or live object to be completely destructed. >> >> + */ >> >> +static void seccomp_filters_free(struct seccomp_filters *filters) >> >> +{ >> >> + ? ? uint16_t count = 0; >> >> + ? ? if (!filters) >> >> + ? ? ? ? ? ? return; >> >> + ? ? while (count < filters->count) { >> >> + ? ? ? ? ? ? struct event_filter *f = filters->event_filters[count]; >> >> + ? ? ? ? ? ? free_event_filter(f); >> >> + ? ? ? ? ? ? count++; >> >> + ? ? } >> >> + ? ? kfree(filters->event_filters); >> >> + ? ? kfree(filters); >> >> +} >> >> + >> >> +static void __put_seccomp_filters(struct seccomp_filters *orig) >> >> +{ >> >> + ? ? WARN_ON(atomic_read(&orig->usage)); >> >> + ? ? seccomp_filters_free(orig); >> >> +} >> >> + >> >> +#define seccomp_filter_allow(_id) ((_id) == SECCOMP_ACTION_ALLOW) >> >> +#define seccomp_filter_deny(_id) ((_id) == SECCOMP_ACTION_DENY) >> >> +#define seccomp_filter_dynamic(_id) \ >> >> + ? ? (!seccomp_filter_allow(_id) && !seccomp_filter_deny(_id)) >> >> +static inline uint16_t seccomp_filter_id(const struct seccomp_filters *f, >> >> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?int syscall_nr) >> >> +{ >> >> + ? ? if (!f) >> >> + ? ? ? ? ? ? return SECCOMP_ACTION_DENY; >> >> + ? ? return f->syscalls[syscall_nr]; >> >> +} >> >> + >> >> +static inline struct event_filter *seccomp_dynamic_filter( >> >> + ? ? ? ? ? ? const struct seccomp_filters *filters, uint16_t id) >> >> +{ >> >> + ? ? if (!seccomp_filter_dynamic(id)) >> >> + ? ? ? ? ? ? return NULL; >> >> + ? ? return filters->event_filters[id]; >> >> +} >> >> + >> >> +static inline void set_seccomp_filter_id(struct seccomp_filters *filters, >> >> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?int syscall_nr, uint16_t id) >> >> +{ >> >> + ? ? filters->syscalls[syscall_nr] = id; >> >> +} >> >> + >> >> +static inline void set_seccomp_filter(struct seccomp_filters *filters, >> >> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? int syscall_nr, uint16_t id, >> >> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? struct event_filter *dynamic_filter) >> >> +{ >> >> + ? ? filters->syscalls[syscall_nr] = id; >> >> + ? ? if (seccomp_filter_dynamic(id)) >> >> + ? ? ? ? ? ? filters->event_filters[id] = dynamic_filter; >> >> +} >> >> + >> >> +static struct event_filter *alloc_event_filter(int syscall_nr, >> >> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?const char *filter_string) >> >> +{ >> >> + ? ? struct syscall_metadata *data; >> >> + ? ? struct event_filter *filter = NULL; >> >> + ? ? int err; >> >> + >> >> + ? ? data = syscall_nr_to_meta(syscall_nr); >> >> + ? ? /* Argument-based filtering only works on ftrace-hooked syscalls. */ >> >> + ? ? err = -ENOSYS; >> >> + ? ? if (!data) >> >> + ? ? ? ? ? ? goto fail; >> >> + ? ? err = create_event_filter(&filter, >> >> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? data->enter_event->event.type, >> >> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? filter_string); >> >> + ? ? if (err) >> >> + ? ? ? ? ? ? goto fail; >> >> + >> >> + ? ? return filter; >> >> +fail: >> >> + ? ? kfree(filter); >> >> + ? ? return ERR_PTR(err); >> >> +} >> >> + >> >> +/** >> >> + * seccomp_filters_copy - copies filters from src to dst. >> >> + * >> >> + * @dst: seccomp_filters to populate. >> >> + * @src: table to read from. >> >> + * @skip: specifies an entry, by system call, to skip. >> >> + * >> >> + * Returns non-zero on failure. >> >> + * Both the source and the destination should have no simultaneous >> >> + * writers, and dst should be exclusive to the caller. >> >> + * If @skip is < 0, it is ignored. >> >> + */ >> >> +static int seccomp_filters_copy(struct seccomp_filters *dst, >> >> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? const struct seccomp_filters *src, >> >> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? int skip) >> >> +{ >> >> + ? ? int id = 0, ret = 0, nr; >> >> + ? ? memcpy(&dst->flags, &src->flags, sizeof(src->flags)); >> >> + ? ? memcpy(dst->syscalls, src->syscalls, sizeof(dst->syscalls)); >> >> + ? ? if (!src->count) >> >> + ? ? ? ? ? ? goto done; >> >> + ? ? for (nr = 0; nr < NR_syscalls; ++nr) { >> >> + ? ? ? ? ? ? struct event_filter *filter; >> >> + ? ? ? ? ? ? const char *str; >> >> + ? ? ? ? ? ? uint16_t src_id = seccomp_filter_id(src, nr); >> >> + ? ? ? ? ? ? if (nr == skip) { >> >> + ? ? ? ? ? ? ? ? ? ? set_seccomp_filter(dst, nr, SECCOMP_ACTION_DENY, >> >> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?NULL); >> >> + ? ? ? ? ? ? ? ? ? ? continue; >> >> + ? ? ? ? ? ? } >> >> + ? ? ? ? ? ? if (!seccomp_filter_dynamic(src_id)) >> >> + ? ? ? ? ? ? ? ? ? ? continue; >> >> + ? ? ? ? ? ? if (id >= dst->count) { >> >> + ? ? ? ? ? ? ? ? ? ? ret = -EINVAL; >> >> + ? ? ? ? ? ? ? ? ? ? goto done; >> >> + ? ? ? ? ? ? } >> >> + ? ? ? ? ? ? str = get_filter_string(seccomp_dynamic_filter(src, src_id)); >> >> + ? ? ? ? ? ? filter = alloc_event_filter(nr, str); >> >> + ? ? ? ? ? ? if (IS_ERR(filter)) { >> >> + ? ? ? ? ? ? ? ? ? ? ret = PTR_ERR(filter); >> >> + ? ? ? ? ? ? ? ? ? ? goto done; >> >> + ? ? ? ? ? ? } >> >> + ? ? ? ? ? ? set_seccomp_filter(dst, nr, id, filter); >> >> + ? ? ? ? ? ? id++; >> >> + ? ? } >> >> + >> >> +done: >> >> + ? ? return ret; >> >> +} >> >> + >> >> +/** >> >> + * seccomp_extend_filter - appends more text to a syscall_nr's filter >> >> + * @filters: unattached filter object to operate on >> >> + * @syscall_nr: syscall number to update filters for >> >> + * @filter_string: string to append to the existing filter >> >> + * >> >> + * The new string will be &&'d to the original filter string to ensure that it >> >> + * always matches the existing predicates or less: >> >> + * ? (old_filter) && @filter_string >> >> + * A new seccomp_filters instance is returned on success and a ERR_PTR on >> >> + * failure. >> >> + */ >> >> +static int seccomp_extend_filter(struct seccomp_filters *filters, >> >> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?int syscall_nr, char *filter_string) >> >> +{ >> >> + ? ? struct event_filter *filter; >> >> + ? ? uint16_t id = seccomp_filter_id(filters, syscall_nr); >> >> + ? ? char *merged = NULL; >> >> + ? ? int ret = -EINVAL, expected; >> >> + >> >> + ? ? /* No extending with a "1". */ >> >> + ? ? if (!strcmp(SECCOMP_FILTER_ALLOW, filter_string)) >> >> + ? ? ? ? ? ? goto out; >> >> + >> >> + ? ? filter = seccomp_dynamic_filter(filters, id); >> >> + ? ? ret = -ENOENT; >> >> + ? ? if (!filter) >> >> + ? ? ? ? ? ? goto out; >> >> + >> >> + ? ? merged = kzalloc(SECCOMP_MAX_FILTER_LENGTH + 1, GFP_KERNEL); >> >> + ? ? ret = -ENOMEM; >> >> + ? ? if (!merged) >> >> + ? ? ? ? ? ? goto out; >> >> + >> >> + ? ? expected = snprintf(merged, SECCOMP_MAX_FILTER_LENGTH, "(%s) && %s", >> >> + ? ? ? ? ? ? ? ? ? ? ? ? get_filter_string(filter), filter_string); >> >> + ? ? ret = -E2BIG; >> >> + ? ? if (expected >= SECCOMP_MAX_FILTER_LENGTH || expected < 0) >> >> + ? ? ? ? ? ? goto out; >> >> + >> >> + ? ? /* Free the old filter */ >> >> + ? ? free_event_filter(filter); >> >> + ? ? set_seccomp_filter(filters, syscall_nr, id, NULL); >> >> + >> >> + ? ? /* Replace it */ >> >> + ? ? filter = alloc_event_filter(syscall_nr, merged); >> >> + ? ? if (IS_ERR(filter)) { >> >> + ? ? ? ? ? ? ret = PTR_ERR(filter); >> >> + ? ? ? ? ? ? goto out; >> >> + ? ? } >> >> + ? ? set_seccomp_filter(filters, syscall_nr, id, filter); >> >> + ? ? ret = 0; >> >> + >> >> +out: >> >> + ? ? kfree(merged); >> >> + ? ? return ret; >> >> +} >> >> + >> >> +/** >> >> + * seccomp_add_filter - adds a filter for an unfiltered syscall >> >> + * @filters: filters object to add a filter/action to >> >> + * @syscall_nr: system call number to add a filter for >> >> + * @filter_string: the filter string to apply >> >> + * >> >> + * Returns 0 on success and non-zero otherwise. >> >> + */ >> >> +static int seccomp_add_filter(struct seccomp_filters *filters, int syscall_nr, >> >> + ? ? ? ? ? ? ? ? ? ? ? ? ? char *filter_string) >> >> +{ >> >> + ? ? struct event_filter *filter; >> >> + ? ? int ret = 0; >> >> + >> >> + ? ? if (!strcmp(SECCOMP_FILTER_ALLOW, filter_string)) { >> >> + ? ? ? ? ? ? set_seccomp_filter(filters, syscall_nr, >> >> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?SECCOMP_ACTION_ALLOW, NULL); >> >> + ? ? ? ? ? ? goto out; >> >> + ? ? } >> >> + >> >> + ? ? filter = alloc_event_filter(syscall_nr, filter_string); >> >> + ? ? if (IS_ERR(filter)) { >> >> + ? ? ? ? ? ? ret = PTR_ERR(filter); >> >> + ? ? ? ? ? ? goto out; >> >> + ? ? } >> >> + ? ? /* Always add to the last slot available since additions are >> >> + ? ? ?* are only done one at a time. >> >> + ? ? ?*/ >> >> + ? ? set_seccomp_filter(filters, syscall_nr, filters->count - 1, filter); >> >> +out: >> >> + ? ? return ret; >> >> +} >> >> + >> >> +/* Wrap optional ftrace syscall support. Returns 1 on match or 0 otherwise. */ >> >> +static int filter_match_current(struct event_filter *event_filter) >> >> +{ >> >> + ? ? int err = 0; >> >> +#ifdef CONFIG_FTRACE_SYSCALLS >> >> + ? ? uint8_t syscall_state[64]; >> >> + >> >> + ? ? memset(syscall_state, 0, sizeof(syscall_state)); >> >> + >> >> + ? ? /* The generic tracing entry can remain zeroed. */ >> >> + ? ? err = ftrace_syscall_enter_state(syscall_state, sizeof(syscall_state), >> >> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?NULL); >> >> + ? ? if (err) >> >> + ? ? ? ? ? ? return 0; >> >> + >> >> + ? ? err = filter_match_preds(event_filter, syscall_state); >> >> +#endif >> >> + ? ? return err; >> >> +} >> >> + >> >> +static const char *syscall_nr_to_name(int syscall) >> >> +{ >> >> + ? ? const char *syscall_name = "unknown"; >> >> + ? ? struct syscall_metadata *data = syscall_nr_to_meta(syscall); >> >> + ? ? if (data) >> >> + ? ? ? ? ? ? syscall_name = data->name; >> >> + ? ? return syscall_name; >> >> +} >> >> + >> >> +static void filters_set_compat(struct seccomp_filters *filters) >> >> +{ >> >> +#ifdef CONFIG_COMPAT >> >> + ? ? if (is_compat_task()) >> >> + ? ? ? ? ? ? filters->flags.compat = 1; >> >> +#endif >> >> +} >> >> + >> >> +static inline int filters_compat_mismatch(struct seccomp_filters *filters) >> >> +{ >> >> + ? ? int ret = 0; >> >> + ? ? if (!filters) >> >> + ? ? ? ? ? ? return 0; >> >> +#ifdef CONFIG_COMPAT >> >> + ? ? if (!!(is_compat_task()) == filters->flags.compat) >> >> + ? ? ? ? ? ? ret = 1; >> >> +#endif >> >> + ? ? return ret; >> >> +} >> >> + >> >> +static inline int syscall_is_execve(int syscall) >> >> +{ >> >> + ? ? int nr = __NR_execve; >> >> +#ifdef CONFIG_COMPAT >> >> + ? ? if (is_compat_task()) >> >> + ? ? ? ? ? ? nr = __NR_seccomp_execve_32; >> >> +#endif >> >> + ? ? return syscall == nr; >> >> +} >> >> + >> >> +#ifndef KSTK_EIP >> >> +#define KSTK_EIP(x) 0L >> >> +#endif >> >> + >> >> +void seccomp_filter_log_failure(int syscall) >> >> +{ >> >> + ? ? pr_info("%s[%d]: system call %d (%s) blocked at 0x%lx\n", >> >> + ? ? ? ? ? ? current->comm, task_pid_nr(current), syscall, >> >> + ? ? ? ? ? ? syscall_nr_to_name(syscall), KSTK_EIP(current)); >> >> +} >> >> + >> >> +/* put_seccomp_state - decrements the reference count of @orig and may free. */ >> >> +void put_seccomp_filters(struct seccomp_filters *orig) >> >> +{ >> >> + ? ? if (!orig) >> >> + ? ? ? ? ? ? return; >> >> + >> >> + ? ? if (atomic_dec_and_test(&orig->usage)) >> >> + ? ? ? ? ? ? __put_seccomp_filters(orig); >> >> +} >> >> + >> >> +/* get_seccomp_state - increments the reference count of @orig */ >> >> +struct seccomp_filters *get_seccomp_filters(struct seccomp_filters *orig) >> > >> > Nit: the name does not match the comment. >> >> Will fix it here and above. Thanks! >> >> >> +{ >> >> + ? ? if (!orig) >> >> + ? ? ? ? ? ? return NULL; >> >> + ? ? atomic_inc(&orig->usage); >> >> + ? ? return orig; >> > >> > This is called in an RCU read-side critical section. ?What exactly is >> > RCU protecting? ?I would expect an rcu_dereference() or one of the >> > RCU list-traversal primitives somewhere, either here or at the caller. >> >> Ah, I spaced on rcu_dereference(). ?The goal was to make the >> assignment and replacement of the seccomp_filters pointer >> RCU-protected (in seccomp_state) so there's no concern over it being >> replaced partial on platforms where pointer assignments are non-atomic >> - such as via /proc//seccomp_filters access or a call via the >> exported symbols. ?Object lifetime is managed by reference counting so >> that I don't have to worry about extending the RCU read-side critical >> section by much or deal with pre-allocations. >> >> I'll add rcu_dereference() to all the get_seccomp_filters() uses where >> it makes sense, so that it is called safely. ?Just to make sure, does >> it make sense to continue to rcu protect the specific pointer? > > It might. ?The usual other options is to use a lock outside of the element > containing the reference count to protect reference-count manipulation. > If there is some convenient lock, especially if it is already held where > needed, then locking is more straightforward. ?Otherwise, RCU is usually > a reasonable option. I was concerned about the overhead a lock would have at each system call entry, but I didn't benchmark it to see. I'll add the rcu_dereference right away, then look into seeing whether there's a cleaner approach. I was trying to be overly protective of mutating any data internal to the filters through complete replacement on any change. I'll take a step back and see if >> >> +} >> >> + >> >> +/** >> >> + * seccomp_test_filters - tests 'current' against the given syscall >> >> + * @state: seccomp_state of current to use. >> >> + * @syscall: number of the system call to test >> >> + * >> >> + * Returns 0 on ok and non-zero on error/failure. >> >> + */ >> >> +int seccomp_test_filters(int syscall) >> >> +{ >> >> + ? ? uint16_t id; >> >> + ? ? struct event_filter *filter; >> >> + ? ? struct seccomp_filters *filters; >> >> + ? ? int ret = -EACCES; >> >> + >> >> + ? ? rcu_read_lock(); >> >> + ? ? filters = get_seccomp_filters(current->seccomp.filters); >> >> + ? ? rcu_read_unlock(); >> >> + >> >> + ? ? if (!filters) >> >> + ? ? ? ? ? ? goto out; >> >> + >> >> + ? ? if (filters_compat_mismatch(filters)) { >> >> + ? ? ? ? ? ? pr_info("%s[%d]: seccomp_filter compat() mismatch.\n", >> >> + ? ? ? ? ? ? ? ? ? ? current->comm, task_pid_nr(current)); >> >> + ? ? ? ? ? ? goto out; >> >> + ? ? } >> >> + >> >> + ? ? /* execve is never allowed. */ >> >> + ? ? if (syscall_is_execve(syscall)) >> >> + ? ? ? ? ? ? goto out; >> >> + >> >> + ? ? ret = 0; >> >> + ? ? id = seccomp_filter_id(filters, syscall); >> >> + ? ? if (seccomp_filter_allow(id)) >> >> + ? ? ? ? ? ? goto out; >> >> + >> >> + ? ? ret = -EACCES; >> >> + ? ? if (!seccomp_filter_dynamic(id)) >> >> + ? ? ? ? ? ? goto out; >> >> + >> >> + ? ? filter = seccomp_dynamic_filter(filters, id); >> >> + ? ? if (filter && filter_match_current(filter)) >> >> + ? ? ? ? ? ? ret = 0; >> >> +out: >> >> + ? ? put_seccomp_filters(filters); >> >> + ? ? return ret; >> >> +} >> >> + >> >> +/** >> >> + * seccomp_show_filters - prints the current filter state to a seq_file >> >> + * @filters: properly get()'d filters object >> >> + * @m: the prepared seq_file to receive the data >> >> + * >> >> + * Returns 0 on a successful write. >> >> + */ >> >> +int seccomp_show_filters(struct seccomp_filters *filters, struct seq_file *m) >> >> +{ >> >> + ? ? int syscall; >> >> + ? ? seq_printf(m, "Mode: %d\n", current->seccomp.mode); >> >> + ? ? if (!filters) >> >> + ? ? ? ? ? ? goto out; >> >> + >> >> + ? ? for (syscall = 0; syscall < NR_syscalls; ++syscall) { >> >> + ? ? ? ? ? ? uint16_t id = seccomp_filter_id(filters, syscall); >> >> + ? ? ? ? ? ? const char *filter_string = SECCOMP_FILTER_ALLOW; >> >> + ? ? ? ? ? ? if (seccomp_filter_deny(id)) >> >> + ? ? ? ? ? ? ? ? ? ? continue; >> >> + ? ? ? ? ? ? seq_printf(m, "%d (%s): ", >> >> + ? ? ? ? ? ? ? ? ? ? ? ? ? syscall, >> >> + ? ? ? ? ? ? ? ? ? ? ? ? ? syscall_nr_to_name(syscall)); >> >> + ? ? ? ? ? ? if (seccomp_filter_dynamic(id)) >> >> + ? ? ? ? ? ? ? ? ? ? filter_string = get_filter_string( >> >> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? seccomp_dynamic_filter(filters, id)); >> >> + ? ? ? ? ? ? seq_printf(m, "%s\n", filter_string); >> >> + ? ? } >> >> +out: >> >> + ? ? return 0; >> >> +} >> >> +EXPORT_SYMBOL_GPL(seccomp_show_filters); >> >> + >> >> +/** >> >> + * seccomp_get_filter - copies the filter_string into "buf" >> >> + * @syscall_nr: system call number to look up >> >> + * @buf: destination buffer >> >> + * @bufsize: available space in the buffer. >> >> + * >> >> + * Context: User context only. This function may sleep on allocation and >> >> + * ? ? ? ? ?operates on current. current must be attempting a system call >> >> + * ? ? ? ? ?when this is called. >> >> + * >> >> + * Looks up the filter for the given system call number on current. ?If found, >> >> + * the string length of the NUL-terminated buffer is returned and < 0 is >> >> + * returned on error. The NUL byte is not included in the length. >> >> + */ >> >> +long seccomp_get_filter(int syscall_nr, char *buf, unsigned long bufsize) >> >> +{ >> >> + ? ? struct seccomp_filters *filters; >> >> + ? ? struct event_filter *filter; >> >> + ? ? long ret = -EINVAL; >> >> + ? ? uint16_t id; >> >> + >> >> + ? ? if (bufsize > SECCOMP_MAX_FILTER_LENGTH) >> >> + ? ? ? ? ? ? bufsize = SECCOMP_MAX_FILTER_LENGTH; >> >> + >> >> + ? ? rcu_read_lock(); >> >> + ? ? filters = get_seccomp_filters(current->seccomp.filters); >> >> + ? ? rcu_read_unlock(); >> >> + >> >> + ? ? if (!filters) >> >> + ? ? ? ? ? ? goto out; >> >> + >> >> + ? ? ret = -ENOENT; >> >> + ? ? id = seccomp_filter_id(filters, syscall_nr); >> >> + ? ? if (seccomp_filter_deny(id)) >> >> + ? ? ? ? ? ? goto out; >> >> + >> >> + ? ? if (seccomp_filter_allow(id)) { >> >> + ? ? ? ? ? ? ret = strlcpy(buf, SECCOMP_FILTER_ALLOW, bufsize); >> >> + ? ? ? ? ? ? goto copied; >> >> + ? ? } >> >> + >> >> + ? ? filter = seccomp_dynamic_filter(filters, id); >> >> + ? ? if (!filter) >> >> + ? ? ? ? ? ? goto out; >> >> + ? ? ret = strlcpy(buf, get_filter_string(filter), bufsize); >> >> + >> >> +copied: >> >> + ? ? if (ret >= bufsize) { >> >> + ? ? ? ? ? ? ret = -ENOSPC; >> >> + ? ? ? ? ? ? goto out; >> >> + ? ? } >> >> + ? ? /* Zero out any remaining buffer, just in case. */ >> >> + ? ? memset(buf + ret, 0, bufsize - ret); >> >> +out: >> >> + ? ? put_seccomp_filters(filters); >> >> + ? ? return ret; >> >> +} >> >> +EXPORT_SYMBOL_GPL(seccomp_get_filter); >> >> + >> >> +/** >> >> + * seccomp_clear_filter: clears the seccomp filter for a syscall. >> >> + * @syscall_nr: the system call number to clear filters for. >> >> + * >> >> + * Context: User context only. This function may sleep on allocation and >> >> + * ? ? ? ? ?operates on current. current must be attempting a system call >> >> + * ? ? ? ? ?when this is called. >> >> + * >> >> + * Returns 0 on success. >> >> + */ >> >> +long seccomp_clear_filter(int syscall_nr) >> >> +{ >> >> + ? ? struct seccomp_filters *filters = NULL, *orig_filters; >> >> + ? ? uint16_t id; >> >> + ? ? int ret = -EINVAL; >> >> + >> >> + ? ? rcu_read_lock(); >> >> + ? ? orig_filters = get_seccomp_filters(current->seccomp.filters); >> >> + ? ? rcu_read_unlock(); >> >> + >> >> + ? ? if (!orig_filters) >> >> + ? ? ? ? ? ? goto out; >> >> + >> >> + ? ? if (filters_compat_mismatch(orig_filters)) >> >> + ? ? ? ? ? ? goto out; >> >> + >> >> + ? ? id = seccomp_filter_id(orig_filters, syscall_nr); >> >> + ? ? if (seccomp_filter_deny(id)) >> >> + ? ? ? ? ? ? goto out; >> >> + >> >> + ? ? /* Create a new filters object for the task */ >> >> + ? ? if (seccomp_filter_dynamic(id)) >> >> + ? ? ? ? ? ? filters = seccomp_filters_new(orig_filters->count - 1); >> >> + ? ? else >> >> + ? ? ? ? ? ? filters = seccomp_filters_new(orig_filters->count); >> >> + >> >> + ? ? if (IS_ERR(filters)) { >> >> + ? ? ? ? ? ? ret = PTR_ERR(filters); >> >> + ? ? ? ? ? ? goto out; >> >> + ? ? } >> >> + >> >> + ? ? /* Copy, but drop the requested entry. */ >> >> + ? ? ret = seccomp_filters_copy(filters, orig_filters, syscall_nr); >> >> + ? ? if (ret) >> >> + ? ? ? ? ? ? goto out; >> >> + ? ? get_seccomp_filters(filters); ?/* simplify the out: path */ >> >> + >> >> + ? ? rcu_assign_pointer(current->seccomp.filters, filters); >> > >> > What prevents two copies of seccomp_clear_filter() from running >> > concurrently? >> >> Nothing - the last one wins assignment, but the objects themselves >> should be internally consistent to the parallel calls. ?If that's a >> concern, a per-task writer mutex could be used just to ensure >> simultaneous calls to clear and set are performed serially. ?Would >> that make more sense? > > Here is the sequence of events that I am concerned about: > > o ? ? ? CPU 0 sets orig_filters to point to the current filters. > > o ? ? ? CPU 1 sets its local orig_filters to point to the current > ? ? ? ?set of filters. > > o ? ? ? Both CPUs allocate new filters and use rcu_assign_pointer() > ? ? ? ?to do the update. ?As you say, the last one wins, but it appears > ? ? ? ?to me that the first one leaks memory. > > o ? ? ? Both CPUs free the object referenced by their orig_filters, > ? ? ? ?which might or might not result in a double free, depending > ? ? ? ?on exactly what happens below. ?(You might actually be OK, > ? ? ? ?I didn't check -- leaking memory was enough for me to call > ? ? ? ?attention to this.) > > So yes, please use some kind of mutual exclusion. ?Not sure what you > mean by "per-task mutex", but whatever it is must prevent two different > tasks from acting on the same set of filters at the same time. ?The > thing that I call "per-task mutex" would -not- do that. Ah nice. Yeah that would most definitely leak as there would be a remaining increment for the task that isn't dropped. Since those interfaces acquire the filter itself from current->seccomp.filters, I was thinking a mutex in current, e.g., current->seccomp.write_mutex, would fit the bill to ensure that pointer isn't accessed for replacement. I'll look at this and the rcu usage to see if I'm taking the most logical approach. I pretty much always get locking wrong in some way and perhaps I can simplify further and get the correct guarantees. I really appreciate the keen observation and explanation! >> >> + ? ? synchronize_rcu(); >> >> + ? ? put_seccomp_filters(orig_filters); ?/* for the task */ >> >> +out: >> >> + ? ? put_seccomp_filters(orig_filters); ?/* for the get */ >> >> + ? ? put_seccomp_filters(filters); ?/* for the extra get */ >> >> + ? ? return ret; >> >> +} >> >> +EXPORT_SYMBOL_GPL(seccomp_clear_filter); >> >> + >> >> +/** >> >> + * seccomp_set_filter: - Adds/extends a seccomp filter for a syscall. >> >> + * @syscall_nr: system call number to apply the filter to. >> >> + * @filter: ftrace filter string to apply. >> >> + * >> >> + * Context: User context only. This function may sleep on allocation and >> >> + * ? ? ? ? ?operates on current. current must be attempting a system call >> >> + * ? ? ? ? ?when this is called. >> >> + * >> >> + * New filters may be added for system calls when the current task is >> >> + * not in a secure computing mode (seccomp). ?Otherwise, existing filters may >> >> + * be extended. >> >> + * >> >> + * Returns 0 on success or an errno on failure. >> >> + */ >> >> +long seccomp_set_filter(int syscall_nr, char *filter) >> >> +{ >> >> + ? ? struct seccomp_filters *filters = NULL, *orig_filters = NULL; >> >> + ? ? uint16_t id; >> >> + ? ? long ret = -EINVAL; >> >> + ? ? uint16_t filters_needed; >> >> + >> >> + ? ? if (!filter) >> >> + ? ? ? ? ? ? goto out; >> >> + >> >> + ? ? filter = strstrip(filter); >> >> + ? ? /* Disallow empty strings. */ >> >> + ? ? if (filter[0] == 0) >> >> + ? ? ? ? ? ? goto out; >> >> + >> >> + ? ? rcu_read_lock(); >> >> + ? ? orig_filters = get_seccomp_filters(current->seccomp.filters); >> >> + ? ? rcu_read_unlock(); >> >> + >> >> + ? ? /* After the first call, compatibility mode is selected permanently. */ >> >> + ? ? ret = -EACCES; >> >> + ? ? if (filters_compat_mismatch(orig_filters)) >> >> + ? ? ? ? ? ? goto out; >> >> + >> >> + ? ? filters_needed = orig_filters ? orig_filters->count : 0; >> >> + ? ? id = seccomp_filter_id(orig_filters, syscall_nr); >> >> + ? ? if (seccomp_filter_deny(id)) { >> >> + ? ? ? ? ? ? /* Don't allow DENYs to be changed when in a seccomp mode */ >> >> + ? ? ? ? ? ? ret = -EACCES; >> >> + ? ? ? ? ? ? if (current->seccomp.mode) >> >> + ? ? ? ? ? ? ? ? ? ? goto out; >> >> + ? ? ? ? ? ? filters_needed++; >> >> + ? ? } >> >> + >> >> + ? ? filters = seccomp_filters_new(filters_needed); >> >> + ? ? if (IS_ERR(filters)) { >> >> + ? ? ? ? ? ? ret = PTR_ERR(filters); >> >> + ? ? ? ? ? ? goto out; >> >> + ? ? } >> >> + >> >> + ? ? filters_set_compat(filters); >> >> + ? ? if (orig_filters) { >> >> + ? ? ? ? ? ? ret = seccomp_filters_copy(filters, orig_filters, -1); >> >> + ? ? ? ? ? ? if (ret) >> >> + ? ? ? ? ? ? ? ? ? ? goto out; >> >> + ? ? } >> >> + >> >> + ? ? if (seccomp_filter_deny(id)) >> >> + ? ? ? ? ? ? ret = seccomp_add_filter(filters, syscall_nr, filter); >> >> + ? ? else >> >> + ? ? ? ? ? ? ret = seccomp_extend_filter(filters, syscall_nr, filter); >> >> + ? ? if (ret) >> >> + ? ? ? ? ? ? goto out; >> >> + ? ? get_seccomp_filters(filters); ?/* simplify the error paths */ >> >> + >> >> + ? ? rcu_assign_pointer(current->seccomp.filters, filters); >> > >> > Again, what prevents two copies of seccomp_set_filter() from running >> > concurrently? >> >> Same deal - nothing, but I'd be happy to add a guard if it makes sense. >> >> Thanks! >> >> >> + ? ? synchronize_rcu(); >> >> + ? ? put_seccomp_filters(orig_filters); ?/* for the task */ >> >> +out: >> >> + ? ? put_seccomp_filters(orig_filters); ?/* for the get */ >> >> + ? ? put_seccomp_filters(filters); ?/* for get or task, on err */ >> >> + ? ? return ret; >> >> +} >> >> +EXPORT_SYMBOL_GPL(seccomp_set_filter); >> >> + >> >> +long prctl_set_seccomp_filter(unsigned long syscall_nr, >> >> + ? ? ? ? ? ? ? ? ? ? ? ? ? char __user *user_filter) >> >> +{ >> >> + ? ? int nr; >> >> + ? ? long ret; >> >> + ? ? char *filter = NULL; >> >> + >> >> + ? ? ret = -EINVAL; >> >> + ? ? if (syscall_nr >= NR_syscalls) >> >> + ? ? ? ? ? ? goto out; >> >> + >> >> + ? ? ret = -EFAULT; >> >> + ? ? if (!user_filter) >> >> + ? ? ? ? ? ? goto out; >> >> + >> >> + ? ? filter = kzalloc(SECCOMP_MAX_FILTER_LENGTH + 1, GFP_KERNEL); >> >> + ? ? ret = -ENOMEM; >> >> + ? ? if (!filter) >> >> + ? ? ? ? ? ? goto out; >> >> + >> >> + ? ? ret = -EFAULT; >> >> + ? ? if (strncpy_from_user(filter, user_filter, >> >> + ? ? ? ? ? ? ? ? ? ? ? ? ? SECCOMP_MAX_FILTER_LENGTH - 1) < 0) >> >> + ? ? ? ? ? ? goto out; >> >> + >> >> + ? ? nr = (int) syscall_nr; >> >> + ? ? ret = seccomp_set_filter(nr, filter); >> >> + >> >> +out: >> >> + ? ? kfree(filter); >> >> + ? ? return ret; >> >> +} >> >> + >> >> +long prctl_clear_seccomp_filter(unsigned long syscall_nr) >> >> +{ >> >> + ? ? int nr = -1; >> >> + ? ? long ret; >> >> + >> >> + ? ? ret = -EINVAL; >> >> + ? ? if (syscall_nr >= NR_syscalls) >> >> + ? ? ? ? ? ? goto out; >> >> + >> >> + ? ? nr = (int) syscall_nr; >> >> + ? ? ret = seccomp_clear_filter(nr); >> >> + >> >> +out: >> >> + ? ? return ret; >> >> +} >> >> + >> >> +long prctl_get_seccomp_filter(unsigned long syscall_nr, char __user *dst, >> >> + ? ? ? ? ? ? ? ? ? ? ? ? ? unsigned long available) >> >> +{ >> >> + ? ? int ret, nr; >> >> + ? ? unsigned long copied; >> >> + ? ? char *buf = NULL; >> >> + ? ? ret = -EINVAL; >> >> + ? ? if (!available) >> >> + ? ? ? ? ? ? goto out; >> >> + ? ? /* Ignore extra buffer space. */ >> >> + ? ? if (available > SECCOMP_MAX_FILTER_LENGTH) >> >> + ? ? ? ? ? ? available = SECCOMP_MAX_FILTER_LENGTH; >> >> + >> >> + ? ? ret = -EINVAL; >> >> + ? ? if (syscall_nr >= NR_syscalls) >> >> + ? ? ? ? ? ? goto out; >> >> + ? ? nr = (int) syscall_nr; >> >> + >> >> + ? ? ret = -ENOMEM; >> >> + ? ? buf = kmalloc(available, GFP_KERNEL); >> >> + ? ? if (!buf) >> >> + ? ? ? ? ? ? goto out; >> >> + >> >> + ? ? ret = seccomp_get_filter(nr, buf, available); >> >> + ? ? if (ret < 0) >> >> + ? ? ? ? ? ? goto out; >> >> + >> >> + ? ? /* Include the NUL byte in the copy. */ >> >> + ? ? copied = copy_to_user(dst, buf, ret + 1); >> >> + ? ? ret = -ENOSPC; >> >> + ? ? if (copied) >> >> + ? ? ? ? ? ? goto out; >> >> + ? ? ret = 0; >> >> +out: >> >> + ? ? kfree(buf); >> >> + ? ? return ret; >> >> +} >> >> diff --git a/kernel/sys.c b/kernel/sys.c >> >> index af468ed..ed60d06 100644 >> >> --- a/kernel/sys.c >> >> +++ b/kernel/sys.c >> >> @@ -1698,13 +1698,24 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3, >> >> ? ? ? ? ? ? ? case PR_SET_ENDIAN: >> >> ? ? ? ? ? ? ? ? ? ? ? error = SET_ENDIAN(me, arg2); >> >> ? ? ? ? ? ? ? ? ? ? ? break; >> >> - >> >> ? ? ? ? ? ? ? case PR_GET_SECCOMP: >> >> ? ? ? ? ? ? ? ? ? ? ? error = prctl_get_seccomp(); >> >> ? ? ? ? ? ? ? ? ? ? ? break; >> >> ? ? ? ? ? ? ? case PR_SET_SECCOMP: >> >> ? ? ? ? ? ? ? ? ? ? ? error = prctl_set_seccomp(arg2); >> >> ? ? ? ? ? ? ? ? ? ? ? break; >> >> + ? ? ? ? ? ? case PR_SET_SECCOMP_FILTER: >> >> + ? ? ? ? ? ? ? ? ? ? error = prctl_set_seccomp_filter(arg2, >> >> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?(char __user *) arg3); >> >> + ? ? ? ? ? ? ? ? ? ? break; >> >> + ? ? ? ? ? ? case PR_CLEAR_SECCOMP_FILTER: >> >> + ? ? ? ? ? ? ? ? ? ? error = prctl_clear_seccomp_filter(arg2); >> >> + ? ? ? ? ? ? ? ? ? ? break; >> >> + ? ? ? ? ? ? case PR_GET_SECCOMP_FILTER: >> >> + ? ? ? ? ? ? ? ? ? ? error = prctl_get_seccomp_filter(arg2, >> >> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?(char __user *) arg3, >> >> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?arg4); >> >> + ? ? ? ? ? ? ? ? ? ? break; >> >> ? ? ? ? ? ? ? case PR_GET_TSC: >> >> ? ? ? ? ? ? ? ? ? ? ? error = GET_TSC_CTL(arg2); >> >> ? ? ? ? ? ? ? ? ? ? ? break; >> >> diff --git a/security/Kconfig b/security/Kconfig >> >> index 95accd4..c76adf2 100644 >> >> --- a/security/Kconfig >> >> +++ b/security/Kconfig >> >> @@ -2,6 +2,10 @@ >> >> ?# Security configuration >> >> ?# >> >> >> >> +# Make seccomp filter Kconfig switch below available >> >> +config HAVE_SECCOMP_FILTER >> >> + ? ? ? bool >> >> + >> >> ?menu "Security options" >> >> >> >> ?config KEYS >> >> @@ -82,6 +86,19 @@ config SECURITY_DMESG_RESTRICT >> >> >> >> ? ? ? ? If you are unsure how to answer this question, answer N. >> >> >> >> +config SECCOMP_FILTER >> >> + ? ? bool "Enable seccomp-based system call filtering" >> >> + ? ? select SECCOMP >> >> + ? ? depends on HAVE_SECCOMP_FILTER && EXPERIMENTAL >> >> + ? ? help >> >> + ? ? ? This kernel feature expands CONFIG_SECCOMP to allow computing >> >> + ? ? ? in environments with reduced kernel access dictated by the >> >> + ? ? ? application itself through prctl calls. ?If >> >> + ? ? ? CONFIG_FTRACE_SYSCALLS is available, then system call >> >> + ? ? ? argument-based filtering predicates may be used. >> >> + >> >> + ? ? ? See Documentation/prctl/seccomp_filter.txt for more detail. >> >> + >> >> ?config SECURITY >> >> ? ? ? bool "Enable different security models" >> >> ? ? ? depends on SYSFS >> >> -- >> >> 1.7.0.4 >> >> >> >> -- >> >> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in >> >> the body of a message to majordomo@vger.kernel.org >> >> More majordomo info at ?http://vger.kernel.org/majordomo-info.html >> >> Please read the FAQ at ?http://www.tux.org/lkml/ >> > >> -- >> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in >> the body of a message to majordomo@vger.kernel.org >> More majordomo info at ?http://vger.kernel.org/majordomo-info.html >> Please read the FAQ at ?http://www.tux.org/lkml/ > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/