2011-04-28 03:10:27

by Will Drewry

[permalink] [raw]
Subject: [PATCH 2/7] tracing: split out syscall_trace_enter construction

perf appears to be the primary consumer of the CONFIG_FTRACE_SYSCALLS
infrastructure. As such, many of the helpers that target perf can be
split into a perf-focused helper and a generic CONFIG_FTRACE_SYSCALLS
consumer interface.

This change splits out syscall_trace_enter construction from
perf_syscall_enter for current into two helpers:
- ftrace_syscall_enter_state
- ftrace_syscall_enter_state_size

And adds another helper for completeness:
- ftrace_syscall_exit_state_size

These helpers allow for shared code between perf ftrace events and
any other consumers of CONFIG_FTRACE_SYSCALLS events. The proposed
seccomp_filter patches use this code.

Signed-off-by: Will Drewry <[email protected]>
---
include/trace/syscall.h | 4 ++
kernel/trace/trace_syscalls.c | 96 +++++++++++++++++++++++++++++++++++------
2 files changed, 86 insertions(+), 14 deletions(-)

diff --git a/include/trace/syscall.h b/include/trace/syscall.h
index 31966a4..242ae04 100644
--- a/include/trace/syscall.h
+++ b/include/trace/syscall.h
@@ -41,6 +41,10 @@ extern int reg_event_syscall_exit(struct ftrace_event_call *call);
extern void unreg_event_syscall_exit(struct ftrace_event_call *call);
extern int
ftrace_format_syscall(struct ftrace_event_call *call, struct trace_seq *s);
+extern int ftrace_syscall_enter_state(u8 *buf, size_t available,
+ struct trace_entry **entry);
+extern size_t ftrace_syscall_enter_state_size(int nb_args);
+extern size_t ftrace_syscall_exit_state_size(void);
enum print_line_t print_syscall_enter(struct trace_iterator *iter, int flags,
struct trace_event *event);
enum print_line_t print_syscall_exit(struct trace_iterator *iter, int flags,
diff --git a/kernel/trace/trace_syscalls.c b/kernel/trace/trace_syscalls.c
index ee7b5a0..f37f120 100644
--- a/kernel/trace/trace_syscalls.c
+++ b/kernel/trace/trace_syscalls.c
@@ -95,7 +95,7 @@ find_syscall_meta(unsigned long syscall)
return NULL;
}

-static struct syscall_metadata *syscall_nr_to_meta(int nr)
+struct syscall_metadata *syscall_nr_to_meta(int nr)
{
if (!syscalls_metadata || nr >= NR_syscalls || nr < 0)
return NULL;
@@ -498,7 +498,7 @@ static int sys_perf_refcount_exit;
static void perf_syscall_enter(void *ignore, struct pt_regs *regs, long id)
{
struct syscall_metadata *sys_data;
- struct syscall_trace_enter *rec;
+ void *buf;
struct hlist_head *head;
int syscall_nr;
int rctx;
@@ -513,25 +513,22 @@ static void perf_syscall_enter(void *ignore, struct pt_regs *regs, long id)
return;

/* get the size after alignment with the u32 buffer size field */
- size = sizeof(unsigned long) * sys_data->nb_args + sizeof(*rec);
- size = ALIGN(size + sizeof(u32), sizeof(u64));
- size -= sizeof(u32);
+ size = ftrace_syscall_enter_state_size(sys_data->nb_args);

if (WARN_ONCE(size > PERF_MAX_TRACE_SIZE,
"perf buffer not large enough"))
return;

- rec = (struct syscall_trace_enter *)perf_trace_buf_prepare(size,
- sys_data->enter_event->event.type, regs, &rctx);
- if (!rec)
+ buf = perf_trace_buf_prepare(size, sys_data->enter_event->event.type,
+ regs, &rctx);
+ if (!buf)
return;

- rec->nr = syscall_nr;
- syscall_get_arguments(current, regs, 0, sys_data->nb_args,
- (unsigned long *)&rec->args);
+ /* The only error conditions in this helper are handled above. */
+ ftrace_syscall_enter_state(buf, size, NULL);

head = this_cpu_ptr(sys_data->enter_event->perf_events);
- perf_trace_buf_submit(rec, size, rctx, 0, 1, regs, head);
+ perf_trace_buf_submit(buf, size, rctx, 0, 1, regs, head);
}

int perf_sysenter_enable(struct ftrace_event_call *call)
@@ -587,8 +584,7 @@ static void perf_syscall_exit(void *ignore, struct pt_regs *regs, long ret)
return;

/* We can probably do that at build time */
- size = ALIGN(sizeof(*rec) + sizeof(u32), sizeof(u64));
- size -= sizeof(u32);
+ size = ftrace_syscall_exit_state_size();

/*
* Impossible, but be paranoid with the future
@@ -688,3 +684,75 @@ static int syscall_exit_register(struct ftrace_event_call *event,
}
return 0;
}
+
+/* ftrace_syscall_enter_state_size - returns the state size required.
+ *
+ * @nb_args: number of system call args expected.
+ * a negative value implies the maximum allowed.
+ */
+size_t ftrace_syscall_enter_state_size(int nb_args)
+{
+ /* syscall_get_arguments only supports up to 6 arguments. */
+ int arg_count = (nb_args >= 0 ? nb_args : 6);
+ size_t size = (sizeof(unsigned long) * arg_count) +
+ sizeof(struct syscall_trace_enter);
+ size = ALIGN(size + sizeof(u32), sizeof(u64));
+ size -= sizeof(u32);
+ return size;
+}
+EXPORT_SYMBOL_GPL(ftrace_syscall_enter_state_size);
+
+size_t ftrace_syscall_exit_state_size(void)
+{
+ return ALIGN(sizeof(struct syscall_trace_exit) + sizeof(u32),
+ sizeof(u64)) - sizeof(u32);
+}
+EXPORT_SYMBOL_GPL(ftrace_syscall_exit_state_size);
+
+/* ftrace_syscall_enter_state - build state for filter matching
+ *
+ * @buf: buffer to populate with current task state for matching
+ * @available: size available for use in the buffer.
+ * @entry: optional pointer to the trace_entry member of the state.
+ *
+ * Returns 0 on success and non-zero otherwise.
+ * If @entry is NULL, it will be ignored.
+ */
+int ftrace_syscall_enter_state(u8 *buf, size_t available,
+ struct trace_entry **entry)
+{
+ struct syscall_trace_enter *sys_enter;
+ struct syscall_metadata *sys_data;
+ int size;
+ int syscall_nr;
+ struct pt_regs *regs = task_pt_regs(current);
+
+ syscall_nr = syscall_get_nr(current, regs);
+ if (syscall_nr < 0)
+ return -EINVAL;
+
+ sys_data = syscall_nr_to_meta(syscall_nr);
+ if (!sys_data)
+ return -EINVAL;
+
+ /* Determine the actual size needed. */
+ size = sizeof(unsigned long) * sys_data->nb_args +
+ sizeof(struct syscall_trace_enter);
+ size = ALIGN(size + sizeof(u32), sizeof(u64));
+ size -= sizeof(u32);
+
+ BUG_ON(size > available);
+ sys_enter = (struct syscall_trace_enter *)buf;
+
+ /* Populating the struct trace_sys_enter is left to the caller, but
+ * a pointer is returned to encourage opacity.
+ */
+ if (entry)
+ *entry = &sys_enter->ent;
+
+ sys_enter->nr = syscall_nr;
+ syscall_get_arguments(current, regs, 0, sys_data->nb_args,
+ sys_enter->args);
+ return 0;
+}
+EXPORT_SYMBOL_GPL(ftrace_syscall_enter_state);
--
1.7.0.4


2011-04-28 03:11:14

by Will Drewry

[permalink] [raw]
Subject: [PATCH 4/7] seccomp_filter: add process state reporting

Adds seccomp and seccomp_filter status reporting to proc.
/proc/<pid>/status will include a Seccomp field, and
/proc/<pid>/seccomp_filter will provide read-only access
to the current filter and bitmask set for seccomp_filters.

Signed-off-by: Will Drewry <[email protected]>
---
fs/proc/array.c | 21 +++++++++++++++++++++
fs/proc/base.c | 25 +++++++++++++++++++++++++
2 files changed, 46 insertions(+), 0 deletions(-)

diff --git a/fs/proc/array.c b/fs/proc/array.c
index 5e4f776..c35ec60 100644
--- a/fs/proc/array.c
+++ b/fs/proc/array.c
@@ -77,6 +77,7 @@
#include <linux/cpuset.h>
#include <linux/rcupdate.h>
#include <linux/delayacct.h>
+#include <linux/seccomp.h>
#include <linux/seq_file.h>
#include <linux/pid_namespace.h>
#include <linux/ptrace.h>
@@ -337,6 +338,19 @@ static void task_cpus_allowed(struct seq_file *m, struct task_struct *task)
seq_putc(m, '\n');
}

+static void task_show_seccomp(struct seq_file *m, struct task_struct *p) {
+#if defined(CONFIG_SECCOMP)
+ int mode;
+ struct seccomp_state* state;
+ rcu_read_lock();
+ state = get_seccomp_state(rcu_dereference(p->seccomp.state));
+ mode = state ? state->mode : 0;
+ rcu_read_unlock();
+ put_seccomp_state(state);
+ seq_printf(m, "Seccomp:\t%d\n", mode);
+#endif
+}
+
int proc_pid_status(struct seq_file *m, struct pid_namespace *ns,
struct pid *pid, struct task_struct *task)
{
@@ -354,6 +368,7 @@ int proc_pid_status(struct seq_file *m, struct pid_namespace *ns,
task_cpus_allowed(m, task);
cpuset_task_status_allowed(m, task);
task_context_switch_counts(m, task);
+ task_show_seccomp(m, task);
return 0;
}

@@ -544,3 +559,9 @@ int proc_pid_statm(struct seq_file *m, struct pid_namespace *ns,

return 0;
}
+
+int proc_pid_seccomp(struct seq_file *m, struct pid_namespace *ns,
+ struct pid *pid, struct task_struct *task)
+{
+ return 0;
+}
diff --git a/fs/proc/base.c b/fs/proc/base.c
index dfa5327..4b6f0c7 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -73,6 +73,7 @@
#include <linux/security.h>
#include <linux/ptrace.h>
#include <linux/tracehook.h>
+#include <linux/seccomp.h>
#include <linux/cgroup.h>
#include <linux/cpuset.h>
#include <linux/audit.h>
@@ -579,6 +580,24 @@ static int proc_pid_syscall(struct task_struct *task, char *buffer)
}
#endif /* CONFIG_HAVE_ARCH_TRACEHOOK */

+#ifdef CONFIG_SECCOMP_FILTER
+/*
+ * Print out the current seccomp filter set for the task.
+ */
+int proc_pid_seccomp_filter_show(struct seq_file *m, struct pid_namespace *ns,
+ struct pid *pid, struct task_struct *task)
+{
+ struct seccomp_state *state;
+
+ rcu_read_lock();
+ state = get_seccomp_state(task->seccomp.state);
+ rcu_read_unlock();
+ seccomp_show_filters(state, m);
+ put_seccomp_state(state);
+ return 0;
+}
+#endif
+
/************************************************************************/
/* Here the fs part begins */
/************************************************************************/
@@ -2838,6 +2857,9 @@ static const struct pid_entry tgid_base_stuff[] = {
#ifdef CONFIG_HAVE_ARCH_TRACEHOOK
INF("syscall", S_IRUGO, proc_pid_syscall),
#endif
+#ifdef CONFIG_SECCOMP_FILTER
+ ONE("seccomp_filter", S_IRUSR, proc_pid_seccomp_filter_show),
+#endif
INF("cmdline", S_IRUGO, proc_pid_cmdline),
ONE("stat", S_IRUGO, proc_tgid_stat),
ONE("statm", S_IRUGO, proc_pid_statm),
@@ -3180,6 +3202,9 @@ static const struct pid_entry tid_base_stuff[] = {
#ifdef CONFIG_HAVE_ARCH_TRACEHOOK
INF("syscall", S_IRUGO, proc_pid_syscall),
#endif
+#ifdef CONFIG_SECCOMP_FILTER
+ ONE("seccomp_filter", S_IRUSR, proc_pid_seccomp_filter_show),
+#endif
INF("cmdline", S_IRUGO, proc_pid_cmdline),
ONE("stat", S_IRUGO, proc_tid_stat),
ONE("statm", S_IRUGO, proc_pid_statm),
--
1.7.0.4

2011-04-28 03:11:19

by Will Drewry

[permalink] [raw]
Subject: [PATCH 5/7] seccomp_filter: Document what seccomp_filter is and how it works.

Adds a text file covering what CONFIG_SECCOMP_FILTER is, how it is
implemented presently, and what it may be used for. In addition,
the limitations and caveats of the proposed implementation are
included.

Signed-off-by: Will Drewry <[email protected]>
---
Documentation/trace/seccomp_filter.txt | 75 ++++++++++++++++++++++++++++++++
1 files changed, 75 insertions(+), 0 deletions(-)
create mode 100644 Documentation/trace/seccomp_filter.txt

diff --git a/Documentation/trace/seccomp_filter.txt b/Documentation/trace/seccomp_filter.txt
new file mode 100644
index 0000000..6a0fd33
--- /dev/null
+++ b/Documentation/trace/seccomp_filter.txt
@@ -0,0 +1,75 @@
+ Seccomp filtering
+ =================
+
+Introduction
+------------
+
+A large number of system calls are exposed to every userland process
+with many of them going unused for the entire lifetime of the
+application. As system calls change and mature, bugs are found and
+quashed. A certain subset of userland applications benefit by having
+a reduce set of available system calls. The reduced set reduces the
+total kernel surface exposed to the application. System call filtering
+is meant for use with those applications.
+
+The implementation currently leverages both the existing seccomp
+infrastructure and the kernel tracing infrastructure. By centralizing
+hooks for attack surface reduction in seccomp, it is possible to assure
+attention to security that is less relevant in normal ftrace scenarios,
+such as time of check, time of use attacks. However, ftrace provides a
+rich, human-friendly environment for specifying system calls by name and
+expected arguments. (As such, this requires FTRACE_SYSCALLS.)
+
+
+What it isn't
+-------------
+
+System call filtering isn't a sandbox. It provides a clearly defined
+mechanism for minimizing the exposed kernel surface. Beyond that, policy for
+logical behavior and information flow should be managed with an LSM of your
+choosing.
+
+
+Usage
+-----
+
+An additional seccomp mode is exposed through mode '2'. This mode
+depends on CONFIG_SECCOMP_FILTER which in turn depends on
+CONFIG_FTRACE_SYSCALLS.
+
+A collection of filters may be supplied via prctl, and the current set of
+filters is exposed in /proc/<pid>/seccomp_filter.
+
+For instance,
+ const char filters[] =
+ "sys_read: (fd == 1) || (fd == 2)\n"
+ "sys_write: (fd == 0)\n"
+ "sys_exit: 1\n"
+ "sys_exit_group: 1\n"
+ "on_next_syscall: 1";
+ prctl(PR_SET_SECCOMP, 2, filters);
+
+This will setup system call filters for read, write, and exit where reading can
+be done only from fds 1 and 2 and writing to fd 0. The "on_next_syscall" directive tells
+seccomp to not enforce the ruleset until after the next system call is run. This allows
+for launchers to apply system call filters to a binary before executing it.
+
+Once enabled, the access may only be reduced. For example, a set of filters may be:
+
+ sys_read: 1
+ sys_write: 1
+ sys_mmap: 1
+ sys_prctl: 1
+
+Then it may call the following to drop mmap access:
+ prctl(PR_SET_SECCOMP, 2, "sys_mmap: 0");
+
+
+Caveats
+-------
+
+The system call names come from ftrace events. At present, many system
+calls are not hooked - such as x86's ptregs wrapped system calls.
+
+In addition compat_task()s will not be supported until a sys32s begin
+being hooked.
--
1.7.0.4

2011-04-28 03:11:45

by Will Drewry

[permalink] [raw]
Subject: [PATCH 7/7] arch/x86: hook int returning system calls

Using the newly specified __SYSCALL_DEFINEx helpers, redefine
int-returning system calls with the macro to enable ftrace access.

Signed-off-by: Will Drewry <[email protected]>
---
arch/x86/kernel/ldt.c | 5 +++--
arch/x86/kernel/tls.c | 7 +++++--
2 files changed, 8 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kernel/ldt.c b/arch/x86/kernel/ldt.c
index ea69726..3f1160c 100644
--- a/arch/x86/kernel/ldt.c
+++ b/arch/x86/kernel/ldt.c
@@ -10,6 +10,7 @@
#include <linux/gfp.h>
#include <linux/sched.h>
#include <linux/string.h>
+#include <linux/syscalls.h>
#include <linux/mm.h>
#include <linux/smp.h>
#include <linux/vmalloc.h>
@@ -245,8 +246,8 @@ out:
return error;
}

-asmlinkage int sys_modify_ldt(int func, void __user *ptr,
- unsigned long bytecount)
+__SYSCALL_DEFINEx(asmlinkage, int, 3, _modify_ldt,
+ int, func, void __user *, ptr, unsigned long, bytecount)
{
int ret = -ENOSYS;

diff --git a/arch/x86/kernel/tls.c b/arch/x86/kernel/tls.c
index 6bb7b85..0f27a6b 100644
--- a/arch/x86/kernel/tls.c
+++ b/arch/x86/kernel/tls.c
@@ -1,6 +1,7 @@
#include <linux/kernel.h>
#include <linux/errno.h>
#include <linux/sched.h>
+#include <linux/syscalls.h>
#include <linux/user.h>
#include <linux/regset.h>

@@ -90,7 +91,8 @@ int do_set_thread_area(struct task_struct *p, int idx,
return 0;
}

-asmlinkage int sys_set_thread_area(struct user_desc __user *u_info)
+__SYSCALL_DEFINEx(asmlinkage, int, 1, _set_thread_area,
+ struct user_desc __user *, u_info)
{
int ret = do_set_thread_area(current, -1, u_info, 1);
asmlinkage_protect(1, ret, u_info);
@@ -140,7 +142,8 @@ int do_get_thread_area(struct task_struct *p, int idx,
return 0;
}

-asmlinkage int sys_get_thread_area(struct user_desc __user *u_info)
+__SYSCALL_DEFINEx(asmlinkage, int, 1, _get_thread_area,
+ struct user_desc __user *, u_info)
{
int ret = do_get_thread_area(current, -1, u_info);
asmlinkage_protect(1, ret, u_info);
--
1.7.0.4

2011-04-28 03:16:09

by Will Drewry

[permalink] [raw]
Subject: [PATCH 3/7] seccomp_filter: Enable ftrace-based system call filtering

This change adds a new seccomp mode based on the work by
[email protected]. This mode comes with a bitmask of NR_syscalls size and
an optional linked list of seccomp_filter objects. When in mode 2, all
system calls are first checked against the bitmask to determine if they
are allowed or denied. If allowed, the list of filters is checked for
the given syscall number. If all filter predicates for the system call
match or the system call was allowed without restriction, the process
continues. Otherwise, it is killed and a KERN_INFO notification is
posted.

The filter language itself is provided by the ftrace filter engine.
Related patches tweak to the perf filter trace and free allow the calls
to be shared. Filters inherit their understanding of types and arguments
for each system call from the CONFIG_FTRACE_SYSCALLS subsystem which
predefines this information in syscall_metadata associated enter_event
(and exit_event) structures.

The result is that a process may reduce its available interfaces to
the kernel through prctl() without knowing the appropriate system call
number a priori and with the flexibility of filtering based on
register-stored arguments. (String checks suffer from TOCTOU issues and
should be left to LSMs to provide policy for! Don't get greedy :)

A sample filterset for a process that only needs to interact over stdin
and stdout and exit cleanly is shown below:
sys_read: fd == 0
sys_write: fd == 1
sys_exit_group: 1

The filters may be specified once prior to entering the reduced access
state:
prctl(PR_SET_SECCOMP, 2, filters);
If prctl() is in the allowed bitmask, the process may futher reduce
privileges by dropping system calls from the allowed set.

The only other twist is that it is possible to delay enforcement by one
system call by supplying a "on_next_syscall: 1" 'filter'. This allows
for a launcher process to fork(), prctl(), then execve() leaving the
launched binary in a filtered state.

Implementation-wise, seccomp.c now uses seccomp_state struct which is
managed using RCU primitives. It contains the system call bitmask and
a linked list of seccomp_filters (which are also managed as an RCU
list). All mutations (barring one optional bit flip) to the
seccomp_state are always done on a duplicate of the current state value
which is swapped on prior to use.

Signed-off-by: Will Drewry <[email protected]>
---
include/linux/seccomp.h | 98 +++++++++++-
include/trace/syscall.h | 2 +
kernel/Makefile | 1 +
kernel/fork.c | 8 +
kernel/seccomp.c | 144 +++++++++++++++--
kernel/seccomp_filter.c | 428 +++++++++++++++++++++++++++++++++++++++++++++++
kernel/sys.c | 11 +-
kernel/trace/Kconfig | 10 +
8 files changed, 687 insertions(+), 15 deletions(-)
create mode 100644 kernel/seccomp_filter.c

diff --git a/include/linux/seccomp.h b/include/linux/seccomp.h
index 167c333..16703c4 100644
--- a/include/linux/seccomp.h
+++ b/include/linux/seccomp.h
@@ -4,10 +4,33 @@

#ifdef CONFIG_SECCOMP

+#include <linux/list.h>
#include <linux/thread_info.h>
+#include <linux/types.h>
#include <asm/seccomp.h>

-typedef struct { int mode; } seccomp_t;
+/**
+ * struct seccomp_state - the state of a seccomp'ed process
+ *
+ * @mode:
+ * if this is 1, the process is under standard seccomp rules
+ * is 2, the process is only allowed to make system calls where
+ * the corresponding bit is set in bitmask and any
+ * associated filters evaluate successfully.
+ * @usage: number of references to the current instance.
+ * @bitmask: a mask of allowed or filtered system calls and additional flags.
+ * @filter_count: number of seccomp filters in @filters.
+ * @filters: list of seccomp_filter entries for system calls.
+ */
+struct seccomp_state {
+ uint16_t mode;
+ atomic_t usage;
+ DECLARE_BITMAP(bitmask, NR_syscalls + 1);
+ int filter_count;
+ struct list_head filters;
+};
+
+typedef struct { struct seccomp_state *state; } seccomp_t;

extern void __secure_computing(int);
static inline void secure_computing(int this_syscall)
@@ -16,8 +39,14 @@ static inline void secure_computing(int this_syscall)
__secure_computing(this_syscall);
}

+extern struct seccomp_state *seccomp_state_new(void);
+extern struct seccomp_state *seccomp_state_dup(const struct seccomp_state *);
+extern struct seccomp_state *get_seccomp_state(struct seccomp_state *);
+extern void put_seccomp_state(struct seccomp_state *);
extern long prctl_get_seccomp(void);
-extern long prctl_set_seccomp(unsigned long);
+extern long prctl_set_seccomp(unsigned long, char *);
+
+#define SECCOMP_MAX_FILTER_LENGTH 16384

#else /* CONFIG_SECCOMP */

@@ -27,16 +56,79 @@ typedef struct { } seccomp_t;

#define secure_computing(x) do { } while (0)

+#define SECCOMP_MAX_FILTER_LENGTH 0
+
static inline long prctl_get_seccomp(void)
{
return -EINVAL;
}

-static inline long prctl_set_seccomp(unsigned long arg2)
+static inline long prctl_set_seccomp(unsigned long arg2, unsigned char *filters)
{
return -EINVAL;
}

+static inline struct seccomp_state *seccomp_state_new(void)
+{
+ return NULL;
+}
+
+static inline struct seccomp_state *seccomp_state_dup(
+ const struct seccomp_state *state)
+{
+ return NULL;
+}
+
+static inline struct seccomp_state *get_seccomp_state(
+ struct seccomp_state *state)
+{
+ return NULL;
+}
+
+static inline void put_seccomp_state(struct seccomp_state *state)
+{
+}
+
#endif /* CONFIG_SECCOMP */

+#if defined(CONFIG_SECCOMP) && defined(CONFIG_SECCOMP_FILTER)
+
+extern int seccomp_copy_all_filters(struct list_head *,
+ const struct list_head *);
+extern void seccomp_drop_all_filters(struct seccomp_state *);
+
+extern int seccomp_parse_filters(struct seccomp_state *, char *);
+extern int seccomp_test_filters(struct seccomp_state *, int);
+
+extern int seccomp_show_filters(struct seccomp_state *, struct seq_file *);
+
+#else /* CONFIG_SECCOMP && CONFIG_SECCOMP_FILTER */
+
+static inline int seccomp_test_filters(struct seccomp_state *state, int nr)
+{
+ return -EINVAL;
+}
+
+static inline int seccomp_parse_filters(struct seccomp_state *state,
+ char *filters)
+{
+ return -EINVAL;
+}
+
+static inline int seccomp_show_filters(struct seccomp_state *state,
+ struct seq_file *m)
+{
+ return -EINVAL;
+}
+
+extern inline int seccomp_copy_all_filters(struct list_head *dst,
+ const struct list_head *src)
+{
+ return 0;
+}
+
+static inline void seccomp_drop_all_filters(struct seccomp_state *state) { }
+
+#endif /* CONFIG_SECCOMP && CONFIG_SECCOMP_FILTER */
+
#endif /* _LINUX_SECCOMP_H */
diff --git a/include/trace/syscall.h b/include/trace/syscall.h
index 242ae04..1f72fce 100644
--- a/include/trace/syscall.h
+++ b/include/trace/syscall.h
@@ -35,6 +35,8 @@ struct syscall_metadata {
extern unsigned long arch_syscall_addr(int nr);
extern int init_syscall_trace(struct ftrace_event_call *call);

+extern struct syscall_metadata *syscall_nr_to_meta(int);
+
extern int reg_event_syscall_enter(struct ftrace_event_call *call);
extern void unreg_event_syscall_enter(struct ftrace_event_call *call);
extern int reg_event_syscall_exit(struct ftrace_event_call *call);
diff --git a/kernel/Makefile b/kernel/Makefile
index 85cbfb3..a4b21fb 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -81,6 +81,7 @@ obj-$(CONFIG_DETECT_HUNG_TASK) += hung_task.o
obj-$(CONFIG_LOCKUP_DETECTOR) += watchdog.o
obj-$(CONFIG_GENERIC_HARDIRQS) += irq/
obj-$(CONFIG_SECCOMP) += seccomp.o
+obj-$(CONFIG_SECCOMP_FILTER) += seccomp_filter.o
obj-$(CONFIG_RCU_TORTURE_TEST) += rcutorture.o
obj-$(CONFIG_TREE_RCU) += rcutree.o
obj-$(CONFIG_TREE_PREEMPT_RCU) += rcutree.o
diff --git a/kernel/fork.c b/kernel/fork.c
index e7548de..bdcf70b 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -34,6 +34,7 @@
#include <linux/cgroup.h>
#include <linux/security.h>
#include <linux/hugetlb.h>
+#include <linux/seccomp.h>
#include <linux/swap.h>
#include <linux/syscalls.h>
#include <linux/jiffies.h>
@@ -169,6 +170,9 @@ void free_task(struct task_struct *tsk)
free_thread_info(tsk->stack);
rt_mutex_debug_task_free(tsk);
ftrace_graph_exit_task(tsk);
+#ifdef CONFIG_SECCOMP
+ put_seccomp_state(tsk->seccomp.state);
+#endif
free_task_struct(tsk);
}
EXPORT_SYMBOL(free_task);
@@ -280,6 +284,10 @@ static struct task_struct *dup_task_struct(struct task_struct *orig)
if (err)
goto out;

+#ifdef CONFIG_SECCOMP
+ tsk->seccomp.state = get_seccomp_state(orig->seccomp.state);
+#endif
+
setup_thread_stack(tsk, orig);
clear_user_return_notifier(tsk);
clear_tsk_need_resched(tsk);
diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index 57d4b13..1bee87c 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -8,10 +8,11 @@

#include <linux/seccomp.h>
#include <linux/sched.h>
+#include <linux/slab.h>
#include <linux/compat.h>

/* #define SECCOMP_DEBUG 1 */
-#define NR_SECCOMP_MODES 1
+#define NR_SECCOMP_MODES 2

/*
* Secure computing mode 1 allows only read/write/exit/sigreturn.
@@ -32,9 +33,11 @@ static int mode1_syscalls_32[] = {

void __secure_computing(int this_syscall)
{
- int mode = current->seccomp.mode;
+ int mode = -1;
int * syscall;
-
+ /* Do we need an RCU read lock to access current's state? */
+ if (current->seccomp.state)
+ mode = current->seccomp.state->mode;
switch (mode) {
case 1:
syscall = mode1_syscalls;
@@ -47,6 +50,16 @@ void __secure_computing(int this_syscall)
return;
} while (*++syscall);
break;
+ case 2:
+#ifdef CONFIG_COMPAT
+ if (is_compat_task())
+ /* XXX: No compat support yet. */
+ break;
+#endif
+ if (!seccomp_test_filters(current->seccomp.state,
+ this_syscall))
+ return;
+ break;
default:
BUG();
}
@@ -57,30 +70,139 @@ void __secure_computing(int this_syscall)
do_exit(SIGKILL);
}

+/* seccomp_state_new - allocate a new state object. */
+struct seccomp_state *seccomp_state_new()
+{
+ struct seccomp_state *new = kzalloc(sizeof(struct seccomp_state),
+ GFP_KERNEL);
+ if (!new)
+ return NULL;
+ atomic_set(&new->usage, 1);
+ INIT_LIST_HEAD(&new->filters);
+ return new;
+}
+
+/* seccomp_state_dup - copies an existing state object. */
+struct seccomp_state *seccomp_state_dup(const struct seccomp_state *orig)
+{
+ int err;
+ struct seccomp_state *new = seccomp_state_new();
+
+ err = -ENOMEM;
+ if (!new)
+ goto fail;
+ new->mode = orig->mode;
+ memcpy(new->bitmask, orig->bitmask, sizeof(new->bitmask));
+ err = seccomp_copy_all_filters(&new->filters,
+ &orig->filters);
+ if (err)
+ goto fail;
+ new->filter_count = orig->filter_count;
+
+ return new;
+fail:
+ put_seccomp_state(new);
+ return NULL;
+}
+
+/* get_seccomp_state - increments the reference count of @orig */
+struct seccomp_state *get_seccomp_state(struct seccomp_state *orig)
+{
+ if (!orig)
+ return NULL;
+ atomic_inc(&orig->usage);
+ return orig;
+}
+
+static void __put_seccomp_state(struct seccomp_state *orig)
+{
+ WARN_ON(atomic_read(&orig->usage));
+ seccomp_drop_all_filters(orig);
+ kfree(orig);
+}
+
+/* put_seccomp_state - decrements the reference count of @orig and may free. */
+void put_seccomp_state(struct seccomp_state *orig)
+{
+ if (!orig)
+ return;
+
+ if (atomic_dec_and_test(&orig->usage))
+ __put_seccomp_state(orig);
+}
+
long prctl_get_seccomp(void)
{
- return current->seccomp.mode;
+ if (!current->seccomp.state)
+ return 0;
+ return current->seccomp.state->mode;
}

-long prctl_set_seccomp(unsigned long seccomp_mode)
+long prctl_set_seccomp(unsigned long seccomp_mode, char *filters)
{
long ret;
+ struct seccomp_state *state, *orig_state;

- /* can set it only once to be even more secure */
+ rcu_read_lock();
+ orig_state = get_seccomp_state(rcu_dereference(current->seccomp.state));
+ rcu_read_unlock();
+
+ /* mode 1 can only be set once, but mode 2 may have access reduced. */
ret = -EPERM;
- if (unlikely(current->seccomp.mode))
+ if (orig_state && orig_state->mode != 2 && seccomp_mode != 2)
goto out;

ret = -EINVAL;
- if (seccomp_mode && seccomp_mode <= NR_SECCOMP_MODES) {
- current->seccomp.mode = seccomp_mode;
- set_thread_flag(TIF_SECCOMP);
+ if (!seccomp_mode || seccomp_mode > NR_SECCOMP_MODES)
+ goto out;
+
+ /* Prepare to modify the seccomp state by dropping the shared
+ * reference after duplicating it.
+ */
+ state = (orig_state ? seccomp_state_dup(orig_state) :
+ seccomp_state_new());
+ if (!state) {
+ ret = -ENOMEM;
+ goto out;
+ }
+
+ ret = 0;
+ switch (seccomp_mode) {
+ case 1:
#ifdef TIF_NOTSC
disable_TSC();
#endif
+ state->mode = seccomp_mode;
+ set_thread_flag(TIF_SECCOMP);
ret = 0;
+ break;
+ case 2:
+ if (filters) {
+ ret = seccomp_parse_filters(state, filters);
+ /* No partial applications. */
+ if (ret)
+ goto free_state;
+ }
+
+ if (!state->mode) {
+ state->mode = seccomp_mode;
+ set_thread_flag(TIF_SECCOMP);
+ }
+ break;
+ default:
+ ret = -EINVAL;
}

- out:
+ rcu_assign_pointer(current->seccomp.state, state);
+ synchronize_rcu();
+ put_seccomp_state(orig_state); /* for the get */
+
+out:
+ put_seccomp_state(orig_state); /* for the task */
+ return ret;
+
+free_state:
+ put_seccomp_state(orig_state); /* for the get */
+ put_seccomp_state(state); /* drop the dup */
return ret;
}
diff --git a/kernel/seccomp_filter.c b/kernel/seccomp_filter.c
new file mode 100644
index 0000000..8f7878b
--- /dev/null
+++ b/kernel/seccomp_filter.c
@@ -0,0 +1,428 @@
+/* filter engine-based seccomp system call filtering.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.
+ *
+ * Copyright (C) 2011 The Chromium OS Authors <[email protected]>
+ */
+
+#include <linux/seccomp.h>
+#include <linux/seq_file.h>
+#include <linux/sched.h>
+#include <linux/compat.h>
+#include <linux/slab.h>
+#include <linux/uaccess.h>
+
+#include <asm/syscall.h>
+#include <trace/syscall.h>
+
+#define SECCOMP_MAX_FILTER_COUNT 16
+#define SECCOMP_DIRECTIVE_ON_NEXT "on_next_syscall"
+#define SECCOMP_ON_NEXT_BIT NR_syscalls
+
+struct seccomp_filter {
+ struct list_head list;
+ struct rcu_head rcu;
+ int syscall_nr;
+ struct syscall_metadata *data;
+ struct event_filter *event_filter;
+};
+
+static struct seccomp_filter *alloc_seccomp_filter(int syscall_nr,
+ const char *filter_string)
+{
+ int err;
+ struct seccomp_filter *filter = kzalloc(sizeof(struct seccomp_filter),
+ GFP_KERNEL);
+ if (!filter)
+ goto fail;
+
+ filter->syscall_nr = syscall_nr;
+ filter->data = syscall_nr_to_meta(syscall_nr);
+ if (!filter->data)
+ goto fail;
+
+ err = ftrace_parse_filter(&filter->event_filter,
+ filter->data->enter_event->event.type,
+ filter_string);
+ if (err)
+ goto fail;
+
+ return filter;
+
+fail:
+ kfree(filter);
+ return NULL;
+}
+
+static void free_seccomp_filter(struct seccomp_filter *filter)
+{
+ ftrace_free_filter(filter->event_filter);
+ kfree(filter);
+}
+
+static void free_seccomp_filter_rcu(struct rcu_head *head)
+{
+ free_seccomp_filter(container_of(head,
+ struct seccomp_filter,
+ rcu));
+}
+
+static struct seccomp_filter *copy_seccomp_filter(struct seccomp_filter *orig)
+{
+ return alloc_seccomp_filter(orig->syscall_nr,
+ ftrace_get_filter_string(orig->event_filter));
+}
+
+/* Safely drops all filters for a given syscall. */
+static void drop_matching_filters(struct seccomp_state *state, int syscall_nr)
+{
+ struct list_head *this, *temp;
+ list_for_each_safe(this, temp, &state->filters) {
+ struct seccomp_filter *f = list_entry(this,
+ struct seccomp_filter,
+ list);
+ if (f->syscall_nr == syscall_nr) {
+ WARN_ON(state->filter_count == 0);
+ state->filter_count--;
+ list_del_rcu(this);
+ call_rcu(&f->rcu, free_seccomp_filter_rcu);
+ }
+ }
+}
+
+static int add_filter(struct seccomp_state *state, int syscall_nr,
+ char *filter_string)
+{
+ struct syscall_metadata *data;
+ struct seccomp_filter *filter;
+
+ if (state->filter_count == SECCOMP_MAX_FILTER_COUNT - 1)
+ return -EINVAL;
+
+ /* This check will catch flag bits, like on_next_syscall. */
+ data = syscall_nr_to_meta(syscall_nr);
+ if (!data)
+ return -EINVAL;
+
+ filter = alloc_seccomp_filter(syscall_nr,
+ filter_string);
+ if (!filter)
+ return -EINVAL;
+
+ state->filter_count++;
+ list_add_tail_rcu(&filter->list, &state->filters);
+ return 0;
+}
+
+static int syscall_name_to_nr(const char *name)
+{
+ int i;
+ for (i = 0; i < NR_syscalls; i++) {
+ struct syscall_metadata *meta = syscall_nr_to_meta(i);
+ if (!meta)
+ continue;
+ if (!strcmp(meta->name, name))
+ return i;
+ }
+ return -1;
+}
+
+static int directive_to_bit(const char *syscall)
+{
+ int syscall_nr = -1;
+ if (!strcmp(syscall, SECCOMP_DIRECTIVE_ON_NEXT))
+ return SECCOMP_ON_NEXT_BIT;
+ syscall_nr = syscall_name_to_nr(syscall);
+ return syscall_nr;
+}
+
+/* 1 on match, 0 otherwise. */
+static int filter_match_current(struct seccomp_filter *filter)
+{
+ int err;
+ uint8_t syscall_state[64];
+
+ memset(syscall_state, 0, sizeof(syscall_state));
+
+ /* The generic tracing entry can remain zeroed. */
+ err = ftrace_syscall_enter_state(syscall_state, sizeof(syscall_state),
+ NULL);
+ if (err)
+ return 0;
+
+ return filter_match_preds(filter->event_filter, syscall_state);
+}
+
+/**
+ * parse_seccomp_filter() - update a system call mask
+ * @state: the seccomp_state to update.
+ * @filter: a rule expressed in a char array.
+ */
+static int parse_seccomp_filter(struct seccomp_state *state,
+ char *filter_spec)
+{
+ char *filter_string;
+ char *syscall = filter_spec;
+ int syscall_nr = -1;
+ int ret = -EINVAL;
+
+ if (!state)
+ goto done;
+
+ if (!filter_spec)
+ goto done;
+
+ /* Expected format is
+ * directive: filter_string
+ * where directive may be a system call name (sys_*) or
+ * on_next_syscall.
+ */
+ filter_string = strchr(filter_spec, ':');
+ if (!filter_string)
+ goto done;
+
+ *filter_string++ = '\0';
+
+ /* Map the name to the syscalls defined name. */
+ syscall_nr = directive_to_bit(syscall);
+ if (syscall_nr < 0) {
+ goto done;
+ }
+
+ /* Short-circuit filter parsing and check for "0" to clear. */
+ ret = 0;
+ if (!strcmp(strstrip(filter_string), "0")) {
+ clear_bit(syscall_nr, state->bitmask);
+ drop_matching_filters(state, syscall_nr);
+ goto done;
+ }
+ /* Do not allow any allowed syscalls or filter additions when a
+ * secure computing mode is enabled.
+ */
+ ret = -EPERM;
+ if (state->mode)
+ goto done;
+
+ /* Short-circuit parsing for the allow case. */
+ if (!strcmp(strstrip(filter_string), "1"))
+ goto allow;
+
+ /* Attempt to create a new filter, but don't do it if it doesn't parse.
+ */
+ /* TODO either use append_filter_string or replace old matches. */
+ if (add_filter(state, syscall_nr, filter_string)) {
+ ret = -EINVAL;
+ goto done;
+ }
+
+allow:
+ ret = 0;
+ set_bit(syscall_nr, state->bitmask);
+
+done:
+ return ret;
+}
+
+#ifndef KSTK_EIP
+#define KSTK_EIP(x) 0L
+#endif
+
+static void log_failure(int syscall)
+{
+ const char *syscall_name = "unknown";
+ struct syscall_metadata *data = syscall_nr_to_meta(syscall);
+ if (data)
+ syscall_name = data->name;
+ printk(KERN_INFO
+ "%s[%d]: system call %d (%s) blocked at ip:%lx\n",
+ current->comm, task_pid_nr(current), syscall, syscall_name,
+ KSTK_EIP(current));
+}
+
+/* seccomp_drop_all_filters - cleans up the filter list
+ *
+ * @state: the seccomp_state to destroy the filters in.
+ */
+void seccomp_drop_all_filters(struct seccomp_state *state)
+{
+ struct list_head *this, *temp;
+ state->filter_count = 0;
+ if (list_empty(&state->filters))
+ return;
+ list_for_each_safe(this, temp, &state->filters) {
+ struct seccomp_filter *f = list_entry(this,
+ struct seccomp_filter,
+ list);
+ list_del_rcu(this);
+ /* Schedules freeing on the RCU. */
+ call_rcu(&f->rcu, free_seccomp_filter_rcu);
+ }
+}
+EXPORT_SYMBOL_GPL(seccomp_drop_all_filters);
+
+/* seccomp_copy_all_filters - copies all filters from src to dst.
+ *
+ * @dst: the list_head for seccomp_filters to populate.
+ * @src: the list_head for seccomp_filters to copy from.
+ * Returns non-zero on failure.
+ */
+int seccomp_copy_all_filters(struct list_head *dst,
+ const struct list_head *src)
+{
+ struct seccomp_filter *filter;
+ int ret = 0;
+ BUG_ON(!dst || !src);
+ if (list_empty(src))
+ goto done;
+ rcu_read_lock();
+ list_for_each_entry(filter, src, list) {
+ struct seccomp_filter *new_filter = copy_seccomp_filter(filter);
+ if (!new_filter) {
+ ret = -ENOMEM;
+ goto done;
+ }
+ list_add_tail_rcu(&new_filter->list, dst);
+ }
+
+done:
+ rcu_read_unlock();
+ return ret;
+}
+EXPORT_SYMBOL_GPL(seccomp_copy_all_filters);
+
+/* seccomp_show_filters - prints the filter state to a seq_file
+ * @state: the seccomp_state to enumerate the filter and bitmask of.
+ * @m: the prepared seq_file to receive the data.
+ * Returns 0 on a successful write.
+ */
+int seccomp_show_filters(struct seccomp_state *state, struct seq_file *m)
+{
+ int i = 0;
+ if (!state) {
+ return 0;
+ }
+
+ if (test_bit(SECCOMP_ON_NEXT_BIT, state->bitmask))
+ seq_printf(m, SECCOMP_DIRECTIVE_ON_NEXT ": 1\n");
+ /* Lazy but effective. */
+ do {
+ struct syscall_metadata *meta;
+ struct seccomp_filter *filter;
+ int filtered = 0;
+ if (!test_bit(i, state->bitmask))
+ continue;
+ rcu_read_lock();
+ list_for_each_entry_rcu(filter, &state->filters, list) {
+ if (filter->syscall_nr == i) {
+ filtered = 1;
+ seq_printf(m, "%s: %s\n",
+ filter->data->name,
+ ftrace_get_filter_string(
+ filter->event_filter));
+ }
+ }
+ rcu_read_unlock();
+
+ if (!filtered) {
+ meta = syscall_nr_to_meta(i);
+ BUG_ON(!meta);
+ seq_printf(m, "%s: 1\n", meta->name);
+ }
+ } while (++i < NR_syscalls);
+ return 0;
+}
+EXPORT_SYMBOL_GPL(seccomp_show_filters);
+
+/* seccomp_test_filters - tests 'current' agaist the given syscall
+ *
+ * @state: seccomp_state of current to use.
+ * @syscall: number of the system call to test
+ * Returns 0 on ok and non-zero on error/failure.
+ */
+int seccomp_test_filters(struct seccomp_state *state, int syscall)
+{
+ struct seccomp_filter *filter = NULL;
+ int ret = 0;
+ if (syscall >= NR_syscalls) {
+ ret = -EINVAL;
+ goto out;
+ }
+
+ /* The non-atomic version is used. Since the goal of this bit is to
+ * provide a means to set a filter set prior to exec*(), it there
+ * should not be a race nor should it matter if one occurs.
+ */
+ if (__test_and_clear_bit(NR_syscalls, state->bitmask)) {
+ ret = 0;
+ goto out;
+ }
+
+ if (!test_bit(syscall, state->bitmask)) {
+ ret = -EACCES;
+ goto out;
+ }
+
+ /* Check if the syscall is filtered, if not, allow it. */
+ rcu_read_lock();
+ /* XXX: using two bitmasks would avoid searching the filter list
+ * unnecessarily.
+ */
+ list_for_each_entry(filter, &state->filters, list) {
+ /* Multiple filters are allowed per system call.
+ * They are logically ANDed in order of addition.
+ */
+ if (filter->syscall_nr == syscall) {
+ if (!filter_match_current(filter)) {
+ ret = -EACCES;
+ goto out_unlock;
+ }
+ }
+ }
+
+out_unlock:
+ rcu_read_unlock();
+out:
+ if (ret)
+ log_failure(syscall);
+ return ret;
+}
+EXPORT_SYMBOL_GPL(seccomp_test_filters);
+
+/* seccomp_parse_filters - populates state with seccomp_filters
+ *
+ * @state: seccomp state to manage the filters on.
+ * @buf: list of seccomp filters of the format:
+ * directive: ftrace_filter_string
+ * Returns 0 on success or non-zero if any filter is invalid.
+ *
+ * A directive may be any valid syscalls enter event name,
+ * like sys_newuname, or SECCOMP_DIRECTIVE_ON_NEXT.
+ *
+ * See parse_seccomp_filter for exact behavior.
+ */
+int seccomp_parse_filters(struct seccomp_state *state, char *buf)
+{
+ char *filter, *next;
+ int err = 0;
+ next = buf;
+ while ((filter = strsep(&next, "\n")) != NULL) {
+ if (!filter[0])
+ continue;
+ if ((err = parse_seccomp_filter(state, filter)))
+ break;
+ }
+ return err;
+}
+EXPORT_SYMBOL_GPL(seccomp_parse_filters);
diff --git a/kernel/sys.c b/kernel/sys.c
index af468ed..1d3f27e 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -1620,6 +1620,7 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
{
struct task_struct *me = current;
unsigned char comm[sizeof(me->comm)];
+ char *seccomp_filter = NULL;
long error;

error = security_task_prctl(option, arg2, arg3, arg4, arg5);
@@ -1703,7 +1704,15 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
error = prctl_get_seccomp();
break;
case PR_SET_SECCOMP:
- error = prctl_set_seccomp(arg2);
+ if (arg3 && SECCOMP_MAX_FILTER_LENGTH) {
+ seccomp_filter = strndup_user(
+ (char __user *)arg3,
+ SECCOMP_MAX_FILTER_LENGTH);
+ if (!seccomp_filter)
+ return -ENOMEM;
+ }
+ error = prctl_set_seccomp(arg2, seccomp_filter);
+ kfree(seccomp_filter);
break;
case PR_GET_TSC:
error = GET_TSC_CTL(arg2);
diff --git a/kernel/trace/Kconfig b/kernel/trace/Kconfig
index 61d7d59..e21b9eb 100644
--- a/kernel/trace/Kconfig
+++ b/kernel/trace/Kconfig
@@ -240,6 +240,16 @@ config FTRACE_SYSCALLS
help
Basic tracer to catch the syscall entry and exit events.

+config SECCOMP_FILTER
+ bool "Enable seccomp trace event-based filtering"
+ depends on FTRACE_SYSCALLS
+ select CONFIG_SECCOMP
+ help
+ Per-process, inherited system call filtering using shared code
+ across seccomp and ftrace_syscalls.
+
+ See Documentation/trace/seccomp_filter.txt for more detail.
+
config TRACE_BRANCH_PROFILING
bool
select GENERIC_TRACER
--
1.7.0.4

2011-04-28 03:19:27

by Will Drewry

[permalink] [raw]
Subject: [PATCH 6/7] include/linux/syscalls.h: add __ layer of macros with return types.

This change addresses two issues directly:
1. The inability for ftrace to hook system calls that don't return
a long.
2. The fact that SYSCALL_DEFINE() does not touch ftrace at all

1 is fixed by adding __SYSCALL_DEFINEx/0 and __SYSCALL_TRACEx/0
macros which lay underneath the normal call,
SYSCALL_DEFINE0, and SYSCALL_DEFINE[1-6]. All existing calls will
continue to work normally and SYSCALL_DEFINE calls will become
ftrace-able but without argument inspection support.

2 is addressed by separating out SYSCALL_TRACE0 and pulling it under
SYSCALL_DEFINE. It means that calls that may lack argument
introspection will still be traceable in a binary way without any
additional code changes.

There are still some challenges for wrapping calls that have a
trampoline directly in the assembly, like the ptregs calls in
arch/x86/kernel (e.g., sys_clone). Given that the arguments to the
function do not directly map to those of the system call, the main
macros are not friendly for use. However, the SYSCALL_TRACE0 macro in
this change could be used as a placeholder (declared above the function
definition) to allow limited tracing functionality until a better
solution is presented.

Signed-off-by: Will Drewry <[email protected]>
---
include/linux/syscalls.h | 52 +++++++++++++++++++++++++++++++---------------
1 files changed, 35 insertions(+), 17 deletions(-)

diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 83ecc17..1c4a3fb 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -171,7 +171,7 @@ extern struct trace_event_functions exit_syscall_print_funcs;
__attribute__((section("__syscalls_metadata"))) \
*__p_syscall_meta_##sname = &__syscall_meta_##sname;

-#define SYSCALL_DEFINE0(sname) \
+#define __SYSCALL_TRACE0(linkage, ret, sname) \
SYSCALL_TRACE_ENTER_EVENT(_##sname); \
SYSCALL_TRACE_EXIT_EVENT(_##sname); \
static struct syscall_metadata __used \
@@ -185,12 +185,16 @@ extern struct trace_event_functions exit_syscall_print_funcs;
}; \
static struct syscall_metadata __used \
__attribute__((section("__syscalls_metadata"))) \
- *__p_syscall_meta_##sname = &__syscall_meta__##sname; \
- asmlinkage long sys_##sname(void)
+ *__p_syscall_meta_##sname = &__syscall_meta__##sname;
#else
-#define SYSCALL_DEFINE0(name) asmlinkage long sys_##name(void)
+#define __SYSCALL_TRACE0(linkage, ret, sname)
#endif

+#define __SYSCALL_DEFINE0(linkage, ret, sname) \
+ __SYSCALL_TRACE0(linkage, ret, sname) \
+ linkage ret sys_##sname(void)
+
+#define SYSCALL_DEFINE0(name) __SYSCALL_DEFINE0(asmlinkage, long, name)
#define SYSCALL_DEFINE1(name, ...) SYSCALL_DEFINEx(1, _##name, __VA_ARGS__)
#define SYSCALL_DEFINE2(name, ...) SYSCALL_DEFINEx(2, _##name, __VA_ARGS__)
#define SYSCALL_DEFINE3(name, ...) SYSCALL_DEFINEx(3, _##name, __VA_ARGS__)
@@ -214,39 +218,53 @@ extern struct trace_event_functions exit_syscall_print_funcs;

#ifdef CONFIG_FTRACE_SYSCALLS
#define SYSCALL_DEFINEx(x, sname, ...) \
+ __SYSCALL_DEFINEx(asmlinkage, long, x, sname, __VA_ARGS__)
+
+#define __SYSCALL_TRACEx(linkage, ret, x, sname, ...) \
static const char *types_##sname[] = { \
__SC_STR_TDECL##x(__VA_ARGS__) \
}; \
static const char *args_##sname[] = { \
__SC_STR_ADECL##x(__VA_ARGS__) \
}; \
- SYSCALL_METADATA(sname, x); \
- __SYSCALL_DEFINEx(x, sname, __VA_ARGS__)
+ SYSCALL_METADATA(sname, x);
#else
#define SYSCALL_DEFINEx(x, sname, ...) \
- __SYSCALL_DEFINEx(x, sname, __VA_ARGS__)
+ __SYSCALL_DEFINEx(asmlinkage, long, x, sname, __VA_ARGS__)
+#define __SYSCALL_TRACEx(linkage, ret, x, sname, ...)
#endif

#ifdef CONFIG_HAVE_SYSCALL_WRAPPERS

-#define SYSCALL_DEFINE(name) static inline long SYSC_##name
-
-#define __SYSCALL_DEFINEx(x, name, ...) \
- asmlinkage long sys##name(__SC_DECL##x(__VA_ARGS__)); \
- static inline long SYSC##name(__SC_DECL##x(__VA_ARGS__)); \
- asmlinkage long SyS##name(__SC_LONG##x(__VA_ARGS__)) \
+#define SYSCALL_DEFINE(name) \
+ __SYSCALL_DEFINE(long, name)
+#define __SYSCALL_DEFINE(ret, name) \
+ __SYSCALL_TRACE0(, ret, name)
+ static inline ret SYSC_##name
+
+#define __SYSCALL_DEFINEx(linkage, ret, x, name, ...) \
+ __SYSCALL_TRACEx(linkage, ret, x, name, __VA_ARGS__) \
+ linkage ret sys##name(__SC_DECL##x(__VA_ARGS__)); \
+ static inline ret SYSC##name(__SC_DECL##x(__VA_ARGS__)); \
+ linkage ret SyS##name(__SC_LONG##x(__VA_ARGS__)) \
{ \
__SC_TEST##x(__VA_ARGS__); \
- return (long) SYSC##name(__SC_CAST##x(__VA_ARGS__)); \
+ return (ret) SYSC##name(__SC_CAST##x(__VA_ARGS__)); \
} \
SYSCALL_ALIAS(sys##name, SyS##name); \
- static inline long SYSC##name(__SC_DECL##x(__VA_ARGS__))
+ static inline ret SYSC##name(__SC_DECL##x(__VA_ARGS__))

#else /* CONFIG_HAVE_SYSCALL_WRAPPERS */

-#define SYSCALL_DEFINE(name) asmlinkage long sys_##name
-#define __SYSCALL_DEFINEx(x, name, ...) \
- asmlinkage long sys##name(__SC_DECL##x(__VA_ARGS__))
+#define SYSCALL_DEFINE(name) \
+ __SYSCALL_DEFINE(asmlinkage, long, name)
+#define __SYSCALL_DEFINE(linkage, ret, name) \
+ __SYSCALL_TRACE0(linkage, ret, name) \
+ linkage ret sys_##name
+
+#define __SYSCALL_DEFINEx(linkage, ret, x, name, ...) \
+ __SYSCALL_TRACEx(linkage, ret, x, name, __VA_ARGS__) \
+ linkage ret sys##name(__SC_DECL##x(__VA_ARGS__))

#endif /* CONFIG_HAVE_SYSCALL_WRAPPERS */

--
1.7.0.4

2011-04-28 03:21:49

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [PATCH 4/7] seccomp_filter: add process state reporting

> Adds seccomp and seccomp_filter status reporting to proc.
> /proc/<pid>/status will include a Seccomp field, and
> /proc/<pid>/seccomp_filter will provide read-only access
> to the current filter and bitmask set for seccomp_filters.
>
> Signed-off-by: Will Drewry <[email protected]>
> ---
> fs/proc/array.c | 21 +++++++++++++++++++++
> fs/proc/base.c | 25 +++++++++++++++++++++++++
> 2 files changed, 46 insertions(+), 0 deletions(-)

I'm not againt seccomp_filter. but I dislike to increase /proc/<pid>/status mess.
1) it's read from a lot of applications, I don't want to worry about performance
thing. 2) 99.99% user never use seccomp. this field is useless for them.

Can't you make individual seccomp specific file?


2011-04-28 03:24:22

by Will Drewry

[permalink] [raw]
Subject: Re: [PATCH 4/7] seccomp_filter: add process state reporting

On Wed, Apr 27, 2011 at 10:21 PM, KOSAKI Motohiro
<[email protected]> wrote:
>> Adds seccomp and seccomp_filter status reporting to proc.
>> /proc/<pid>/status will include a Seccomp field, and
>> /proc/<pid>/seccomp_filter will provide read-only access
>> to the current filter and bitmask set for seccomp_filters.
>>
>> Signed-off-by: Will Drewry <[email protected]>
>> ---
>> ?fs/proc/array.c | ? 21 +++++++++++++++++++++
>> ?fs/proc/base.c ?| ? 25 +++++++++++++++++++++++++
>> ?2 files changed, 46 insertions(+), 0 deletions(-)
>
> I'm not againt seccomp_filter. but I dislike to increase /proc/<pid>/status mess.
> 1) it's read from a lot of applications, I don't want to worry about performance
> thing. 2) 99.99% user never use seccomp. this field is useless for them.
>
> Can't you make individual seccomp specific file?

Definitely. Would it make sense to have /proc/<pid>/seccomp and
/proc/<pid>/seccomp_filter?

2011-04-28 03:40:11

by Al Viro

[permalink] [raw]
Subject: Re: [PATCH 4/7] seccomp_filter: add process state reporting

On Wed, Apr 27, 2011 at 10:24:20PM -0500, Will Drewry wrote:

> Definitely. Would it make sense to have /proc/<pid>/seccomp and
> /proc/<pid>/seccomp_filter?

Just one question: WTF bother with S_IRUSR? Note that it's absolutely
_useless_ in procfs; any kind of permission checks must be done in
read(2) time since doing it in open(2) is worthless. Consider execve()
on suid binary; oops, there goes your uid and there goes the effect
of checks done by open(2). And if you *do* checks in read(2), why bother
duplicating them in open(2)?

2011-04-28 03:43:08

by Will Drewry

[permalink] [raw]
Subject: Re: [PATCH 4/7] seccomp_filter: add process state reporting

On Wed, Apr 27, 2011 at 10:40 PM, Al Viro <[email protected]> wrote:
> On Wed, Apr 27, 2011 at 10:24:20PM -0500, Will Drewry wrote:
>
>> Definitely. ?Would it make sense to have /proc/<pid>/seccomp and
>> /proc/<pid>/seccomp_filter?
>
> Just one question: WTF bother with S_IRUSR? ?Note that it's absolutely
> _useless_ in procfs; any kind of permission checks must be done in
> read(2) time since doing it in open(2) is worthless. ?Consider execve()
> on suid binary; oops, there goes your uid and there goes the effect
> of checks done by open(2). ?And if you *do* checks in read(2), why bother
> duplicating them in open(2)?

In earlier versions I was allowing filter/bitmask updating via the
proc file (which I nixed :). Is S_IRUGO preferred? I don't see any
crazy information leakage by sharing the filters/ruleset. I'll fold
it into the next version of this patch.

thanks!
will

2011-04-28 07:07:12

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH 5/7] seccomp_filter: Document what seccomp_filter is and how it works.


* Will Drewry <[email protected]> wrote:

> +A collection of filters may be supplied via prctl, and the current set of
> +filters is exposed in /proc/<pid>/seccomp_filter.
> +
> +For instance,
> + const char filters[] =
> + "sys_read: (fd == 1) || (fd == 2)\n"
> + "sys_write: (fd == 0)\n"
> + "sys_exit: 1\n"
> + "sys_exit_group: 1\n"
> + "on_next_syscall: 1";
> + prctl(PR_SET_SECCOMP, 2, filters);
> +
> +This will setup system call filters for read, write, and exit where reading can
> +be done only from fds 1 and 2 and writing to fd 0. The "on_next_syscall" directive tells
> +seccomp to not enforce the ruleset until after the next system call is run. This allows
> +for launchers to apply system call filters to a binary before executing it.
> +
> +Once enabled, the access may only be reduced. For example, a set of filters may be:
> +
> + sys_read: 1
> + sys_write: 1
> + sys_mmap: 1
> + sys_prctl: 1
> +
> +Then it may call the following to drop mmap access:
> + prctl(PR_SET_SECCOMP, 2, "sys_mmap: 0");

Ok, color me thoroughly impressed - AFAICS you implemented my suggestions in:

http://lwn.net/Articles/332974/

and you made it work in practice!

We could split out the ftrace filter engine some more and make it more
independent of ftrace. It's basically an in-kernel interpreter able to run off
tracepoints.

I've Cc:-ed Linus and Andrew: are you guys opposed to such flexible, dynamic
filters conceptually? I think we should really think hard about the actual ABI
as this could easily spread to more applications than Chrome/Chromium.

Btw., i also think that such an approach is actually the sane(r) design to
implement security modules: using such filters is far more flexible than the
typical LSM approach of privileged user-space uploading various nasty objects
into kernel space and implementing silly (and limited and intrusive) hooks
there, like SElinux and the other security modules do.

This approach also has the ability to become recursive (gets inherited by child
tasks, which could add their own filters) and unprivileged - unlike LSMs.

I like this *a lot* more than any security sandboxing approach i've seen
before.

Thanks,

Ingo

2011-04-28 13:50:36

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH 3/7] seccomp_filter: Enable ftrace-based system call filtering

On Wed, 2011-04-27 at 22:08 -0500, Will Drewry wrote:

> -typedef struct { int mode; } seccomp_t;
> +/**
> + * struct seccomp_state - the state of a seccomp'ed process
> + *
> + * @mode:
> + * if this is 1, the process is under standard seccomp rules
> + * is 2, the process is only allowed to make system calls where
> + * the corresponding bit is set in bitmask and any
> + * associated filters evaluate successfully.

I hate arbitrary numbers. This is not a big deal, but can we create an
enum or define that puts names for the modes.

SECCOMP_MODE_BASIC
SECCOMP_MODE_FILTER

??

> + * @usage: number of references to the current instance.
> + * @bitmask: a mask of allowed or filtered system calls and additional flags.
> + * @filter_count: number of seccomp filters in @filters.
> + * @filters: list of seccomp_filter entries for system calls.
> + */
> +struct seccomp_state {
> + uint16_t mode;
> + atomic_t usage;
> + DECLARE_BITMAP(bitmask, NR_syscalls + 1);
> + int filter_count;
> + struct list_head filters;
> +};
> +
> +typedef struct { struct seccomp_state *state; } seccomp_t;
>
> extern void __secure_computing(int);
> static inline void secure_computing(int this_syscall)
> @@ -16,8 +39,14 @@ static inline void secure_computing(int this_syscall)
> __secure_computing(this_syscall);
> }
>


> diff --git a/kernel/fork.c b/kernel/fork.c
> index e7548de..bdcf70b 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -34,6 +34,7 @@
> #include <linux/cgroup.h>
> #include <linux/security.h>
> #include <linux/hugetlb.h>
> +#include <linux/seccomp.h>
> #include <linux/swap.h>
> #include <linux/syscalls.h>
> #include <linux/jiffies.h>
> @@ -169,6 +170,9 @@ void free_task(struct task_struct *tsk)
> free_thread_info(tsk->stack);
> rt_mutex_debug_task_free(tsk);
> ftrace_graph_exit_task(tsk);
> +#ifdef CONFIG_SECCOMP
> + put_seccomp_state(tsk->seccomp.state);
> +#endif
> free_task_struct(tsk);
> }
> EXPORT_SYMBOL(free_task);
> @@ -280,6 +284,10 @@ static struct task_struct *dup_task_struct(struct task_struct *orig)
> if (err)
> goto out;
>
> +#ifdef CONFIG_SECCOMP
> + tsk->seccomp.state = get_seccomp_state(orig->seccomp.state);
> +#endif

I wonder if we could macro these to get rid of the #ifdefs in fork.c.

#define put_seccomp_state(state) do { } while (0)
#define assign_seccomp_state(src, state) do { } while (0)

??

> +
> setup_thread_stack(tsk, orig);
> clear_user_return_notifier(tsk);
> clear_tsk_need_resched(tsk);
> diff --git a/kernel/seccomp.c b/kernel/seccomp.c
> index 57d4b13..1bee87c 100644
> --- a/kernel/seccomp.c
> +++ b/kernel/seccomp.c
> @@ -8,10 +8,11 @@
>
> #include <linux/seccomp.h>
> #include <linux/sched.h>
> +#include <linux/slab.h>
> #include <linux/compat.h>
>
> /* #define SECCOMP_DEBUG 1 */
> -#define NR_SECCOMP_MODES 1
> +#define NR_SECCOMP_MODES 2
>
> /*
> * Secure computing mode 1 allows only read/write/exit/sigreturn.
> @@ -32,9 +33,11 @@ static int mode1_syscalls_32[] = {
>
> void __secure_computing(int this_syscall)
> {
> - int mode = current->seccomp.mode;
> + int mode = -1;
> int * syscall;
> -
> + /* Do we need an RCU read lock to access current's state? */

I'm actually confused to why you are using RCU. What are you protecting.
Currently, I see the state is always accessed from current->seccomp. But
current should not be fighting with itself.

Maybe I'm missing something.

-- Steve

> + if (current->seccomp.state)
> + mode = current->seccomp.state->mode;
> switch (mode) {
> case 1:
> syscall = mode1_syscalls;
> @@ -47,6 +50,16 @@ void __secure_computing(int this_syscall)
> return;
> } while (*++syscall);
> break;
> + case 2:
> +#ifdef CONFIG_COMPAT
> + if (is_compat_task())
> + /* XXX: No compat support yet. */
> + break;
> +#endif
> + if (!seccomp_test_filters(current->seccomp.state,
> + this_syscall))
> + return;
> + break;
> default:
> BUG();
> }
> @@ -57,30 +70,139 @@ void __secure_computing(int this_syscall)
> do_exit(SIGKILL);

2011-04-28 14:29:12

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [PATCH 3/7] seccomp_filter: Enable ftrace-based system call filtering

On Wed, Apr 27, 2011 at 10:08:47PM -0500, Will Drewry wrote:
> This change adds a new seccomp mode based on the work by
> [email protected]. This mode comes with a bitmask of NR_syscalls size and
> an optional linked list of seccomp_filter objects. When in mode 2, all

Since you now use the filters. Why not using them to filter syscalls
entirely rather than using a bitmap of allowed syscalls?

You have the "nr" field in syscall tracepoints.

2011-04-28 14:56:52

by Eric Paris

[permalink] [raw]
Subject: Re: [PATCH 5/7] seccomp_filter: Document what seccomp_filter is and how it works.

On Thu, 2011-04-28 at 09:06 +0200, Ingo Molnar wrote:
> * Will Drewry <[email protected]> wrote:
>
> > +A collection of filters may be supplied via prctl, and the current set of
> > +filters is exposed in /proc/<pid>/seccomp_filter.
> > +
> > +For instance,
> > + const char filters[] =
> > + "sys_read: (fd == 1) || (fd == 2)\n"
> > + "sys_write: (fd == 0)\n"
> > + "sys_exit: 1\n"
> > + "sys_exit_group: 1\n"
> > + "on_next_syscall: 1";
> > + prctl(PR_SET_SECCOMP, 2, filters);
> > +
> > +This will setup system call filters for read, write, and exit where reading can
> > +be done only from fds 1 and 2 and writing to fd 0. The "on_next_syscall" directive tells
> > +seccomp to not enforce the ruleset until after the next system call is run. This allows
> > +for launchers to apply system call filters to a binary before executing it.
> > +
> > +Once enabled, the access may only be reduced. For example, a set of filters may be:
> > +
> > + sys_read: 1
> > + sys_write: 1
> > + sys_mmap: 1
> > + sys_prctl: 1
> > +
> > +Then it may call the following to drop mmap access:
> > + prctl(PR_SET_SECCOMP, 2, "sys_mmap: 0");
>
> Ok, color me thoroughly impressed

Me too!

> I've Cc:-ed Linus and Andrew: are you guys opposed to such flexible, dynamic
> filters conceptually? I think we should really think hard about the actual ABI
> as this could easily spread to more applications than Chrome/Chromium.

I'll definitely port QEMU to use this new interface rather than my more
rigid flexible (haha "rigid flexible") seccomp. I'll see if I run into
any issues with this ABI in that porting...

> Btw., i also think that such an approach is actually the sane(r) design to
> implement security modules: using such filters is far more flexible than the
> typical LSM approach of privileged user-space uploading various nasty objects
> into kernel space and implementing silly (and limited and intrusive) hooks
> there, like SElinux and the other security modules do.

Then you are wrong. There's no question that this interface can provide
great extensions to the current discretionary functionality provided by
legacy security controls but if you actually want to mediate what tasks
can do to other tasks or can do to arbitrary objects on the system this
doesn't cut it. Every system call that takes or uses a structure as an
argument or that uses copy_from_user (for something other than just
unparsed data) is uncontrollable.

This approach is great and with careful coding of userspace apps can be
made very useful in constraining those apps, but a replacement for
mandatory access control it is not.

> This approach also has the ability to become recursive (gets inherited by child
> tasks, which could add their own filters) and unprivileged - unlike LSMs.

LSMs have that ability. There's nothing to prevent a module loading
service to allow unpriv applications to further constrain themselves.
It's just the different between DAC and MAC. You are clearly a DAC guy,
and there is no question this change is great in that mindset, but you
don't seem to understand either the flexibility of the LSM or the
purpose of some of the modules implemented on top of the LSM.

> I like this *a lot* more than any security sandboxing approach i've seen
> before.

I like this *a lot*. It will be a HUGE addition to the security
sandboxing approaches I've seen before.

-Eric

2011-04-28 15:12:54

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [PATCH 3/7] seccomp_filter: Enable ftrace-based system call filtering

On Wed, Apr 27, 2011 at 10:08:47PM -0500, Will Drewry wrote:
> This change adds a new seccomp mode based on the work by
> [email protected]. This mode comes with a bitmask of NR_syscalls size and
> an optional linked list of seccomp_filter objects. When in mode 2, all
> system calls are first checked against the bitmask to determine if they
> are allowed or denied. If allowed, the list of filters is checked for
> the given syscall number. If all filter predicates for the system call
> match or the system call was allowed without restriction, the process
> continues. Otherwise, it is killed and a KERN_INFO notification is
> posted.
>
> The filter language itself is provided by the ftrace filter engine.
> Related patches tweak to the perf filter trace and free allow the calls
> to be shared. Filters inherit their understanding of types and arguments
> for each system call from the CONFIG_FTRACE_SYSCALLS subsystem which
> predefines this information in syscall_metadata associated enter_event
> (and exit_event) structures.
>
> The result is that a process may reduce its available interfaces to
> the kernel through prctl() without knowing the appropriate system call
> number a priori and with the flexibility of filtering based on
> register-stored arguments. (String checks suffer from TOCTOU issues and
> should be left to LSMs to provide policy for! Don't get greedy :)
>
> A sample filterset for a process that only needs to interact over stdin
> and stdout and exit cleanly is shown below:
> sys_read: fd == 0
> sys_write: fd == 1
> sys_exit_group: 1
>
> The filters may be specified once prior to entering the reduced access
> state:
> prctl(PR_SET_SECCOMP, 2, filters);

Instead of having such multiline filter definition with syscall
names prepended, it would be nicer to make the parsing simplier.

You could have either:

prctl(PR_SET_SECCOMP, mode);
/* Works only if we are in mode 2 */
prctl(PR_SET_SECCOMP_FILTER, syscall_nr, filter);

or:
/*
* If mode == 2, set the filter to syscall_nr
* Recall this for each syscall that need a filter.
* If a filter was previously set on the targeted syscall,
* it will be overwritten.
*/
prctl(PR_SET_SECCOMP, mode, syscall_nr, filter);

One can erase a previous filter by setting the new filter "1".

Also, instead of having a bitmap of syscall to accept. You could
simply set "0" as a filter to those you want to deactivate:

prctl(PR_SET_SECCOMP, 2, 1, 0); <- deactivate the syscall_nr 1

Hm?

2011-04-28 15:15:09

by Will Drewry

[permalink] [raw]
Subject: Re: [PATCH 3/7] seccomp_filter: Enable ftrace-based system call filtering

On Thu, Apr 28, 2011 at 9:29 AM, Frederic Weisbecker <[email protected]> wrote:
> On Wed, Apr 27, 2011 at 10:08:47PM -0500, Will Drewry wrote:
>> This change adds a new seccomp mode based on the work by
>> [email protected]. This mode comes with a bitmask of NR_syscalls size and
>> an optional linked list of seccomp_filter objects. When in mode 2, all
>
> Since you now use the filters. Why not using them to filter syscalls
> entirely rather than using a bitmap of allowed syscalls?

The current approach just uses a linked list of filters. While a more
efficient data structure could be used, the bitmask provides a quick
binary decision, and optimizes for the relatively common case where
there won't be many non-binary filters to evaluate so we don't have to
walk the list for a larger number of yes/no decisions versus more
complex predicates. Though that may be a short-sighted view! I'm
happy to change it up.

> You have the "nr" field in syscall tracepoints.

I'n not sure I follow. Do you mean moving entirely to using the
actual tracepoint infrastructure instead of using the seccomp hooks,
or just looking up proper filter by syscall nr? If there's a sane and
better way to do the latter, I'm all ears :) As far as using the
tracepoints themselves, I looked to how the perf/ftrace interactions
worked and while I could've registered with the syscalls tracepoints
for enter and exit, it would mean later evaluation of the system call
interception, possibly out-of-order with respect to other registered
event sinks, and there is complexity in just killing current from
within the notifier-like list registered syscall events (as Eric Paris
ran into when expanding filtering into perf itself). To get around
that, the tracepoint handler would have to pump the data somewhere
else (like it does for perf), and it just seemed messy. I think it's
doable, but I don't know that the pure syscall tracepoint
infrastructure should be burdened with the added requirements that
come with seccomp-filtering. If I didn't properly understand the
code, though, please set me on the right path.

thanks!
will

2011-04-28 15:20:29

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [PATCH 3/7] seccomp_filter: Enable ftrace-based system call filtering

On Thu, Apr 28, 2011 at 05:12:44PM +0200, Frederic Weisbecker wrote:
> On Wed, Apr 27, 2011 at 10:08:47PM -0500, Will Drewry wrote:
> > This change adds a new seccomp mode based on the work by
> > [email protected]. This mode comes with a bitmask of NR_syscalls size and
> > an optional linked list of seccomp_filter objects. When in mode 2, all
> > system calls are first checked against the bitmask to determine if they
> > are allowed or denied. If allowed, the list of filters is checked for
> > the given syscall number. If all filter predicates for the system call
> > match or the system call was allowed without restriction, the process
> > continues. Otherwise, it is killed and a KERN_INFO notification is
> > posted.
> >
> > The filter language itself is provided by the ftrace filter engine.
> > Related patches tweak to the perf filter trace and free allow the calls
> > to be shared. Filters inherit their understanding of types and arguments
> > for each system call from the CONFIG_FTRACE_SYSCALLS subsystem which
> > predefines this information in syscall_metadata associated enter_event
> > (and exit_event) structures.
> >
> > The result is that a process may reduce its available interfaces to
> > the kernel through prctl() without knowing the appropriate system call
> > number a priori and with the flexibility of filtering based on
> > register-stored arguments. (String checks suffer from TOCTOU issues and
> > should be left to LSMs to provide policy for! Don't get greedy :)
> >
> > A sample filterset for a process that only needs to interact over stdin
> > and stdout and exit cleanly is shown below:
> > sys_read: fd == 0
> > sys_write: fd == 1
> > sys_exit_group: 1
> >
> > The filters may be specified once prior to entering the reduced access
> > state:
> > prctl(PR_SET_SECCOMP, 2, filters);
>
> Instead of having such multiline filter definition with syscall
> names prepended, it would be nicer to make the parsing simplier.
>
> You could have either:
>
> prctl(PR_SET_SECCOMP, mode);
> /* Works only if we are in mode 2 */
> prctl(PR_SET_SECCOMP_FILTER, syscall_nr, filter);
>
> or:
> /*
> * If mode == 2, set the filter to syscall_nr
> * Recall this for each syscall that need a filter.
> * If a filter was previously set on the targeted syscall,
> * it will be overwritten.
> */
> prctl(PR_SET_SECCOMP, mode, syscall_nr, filter);
>
> One can erase a previous filter by setting the new filter "1".
>
> Also, instead of having a bitmap of syscall to accept. You could
> simply set "0" as a filter to those you want to deactivate:
>
> prctl(PR_SET_SECCOMP, 2, 1, 0); <- deactivate the syscall_nr 1


I meant "0" and not 0. Because a NULL filter would actually mean we
don't have a filter, which would be the same as "1".

>
> Hm?

2011-04-28 15:29:14

by Will Drewry

[permalink] [raw]
Subject: Re: [PATCH 3/7] seccomp_filter: Enable ftrace-based system call filtering

On Thu, Apr 28, 2011 at 10:12 AM, Frederic Weisbecker
<[email protected]> wrote:
> On Wed, Apr 27, 2011 at 10:08:47PM -0500, Will Drewry wrote:
>> This change adds a new seccomp mode based on the work by
>> [email protected]. This mode comes with a bitmask of NR_syscalls size and
>> an optional linked list of seccomp_filter objects. When in mode 2, all
>> system calls are first checked against the bitmask to determine if they
>> are allowed or denied. ?If allowed, the list of filters is checked for
>> the given syscall number. If all filter predicates for the system call
>> match or the system call was allowed without restriction, the process
>> continues. Otherwise, it is killed and a KERN_INFO notification is
>> posted.
>>
>> The filter language itself is provided by the ftrace filter engine.
>> Related patches tweak to the perf filter trace and free allow the calls
>> to be shared. Filters inherit their understanding of types and arguments
>> for each system call from the CONFIG_FTRACE_SYSCALLS subsystem which
>> predefines this information in syscall_metadata associated enter_event
>> (and exit_event) structures.
>>
>> The result is that a process may reduce its available interfaces to
>> the kernel through prctl() without knowing the appropriate system call
>> number a priori and with the flexibility of filtering based on
>> register-stored arguments. ?(String checks suffer from TOCTOU issues and
>> should be left to LSMs to provide policy for! Don't get greedy :)
>>
>> A sample filterset for a process that only needs to interact over stdin
>> and stdout and exit cleanly is shown below:
>> ? sys_read: fd == 0
>> ? sys_write: fd == 1
>> ? sys_exit_group: 1
>>
>> The filters may be specified once prior to entering the reduced access
>> state:
>> ? prctl(PR_SET_SECCOMP, 2, filters);
>
> Instead of having such multiline filter definition with syscall
> names prepended, it would be nicer to make the parsing simplier.
>
> You could have either:
>
> ? ? ? ?prctl(PR_SET_SECCOMP, mode);
> ? ? ? ?/* Works only if we are in mode 2 */
> ? ? ? ?prctl(PR_SET_SECCOMP_FILTER, syscall_nr, filter);

It'd need to be syscall_name instead of syscall_nr. Otherwise we're
right back to where Adam's patch was 2+ years ago :) Using the event
names from the syscalls infrastructure means the consumer of the
interface doesn't need to be confident of the syscall number. That
said, it would be nice to be able to specify the number as well. If
there were no complaints, it'd be nice to support both, imo.

> or:
> ? ? ? ?/*
> ? ? ? ? * If mode == 2, set the filter to syscall_nr
> ? ? ? ? * Recall this for each syscall that need a filter.
> ? ? ? ? * If a filter was previously set on the targeted syscall,
> ? ? ? ? * it will be overwritten.
> ? ? ? ? */
> ? ? ? ?prctl(PR_SET_SECCOMP, mode, syscall_nr, filter);
>
> One can erase a previous filter by setting the new filter "1".
>
> Also, instead of having a bitmap of syscall to accept. You could
> simply set "0" as a filter to those you want to deactivate:
>
> prctl(PR_SET_SECCOMP, 2, 1, 0); <- deactivate the syscall_nr 1
>
> Hm?

I like the simplicity in not needing to parse anything extra, but it
does add the need for extra state - either a bit or a new field - to
represent "enabled/enforcing".

The only way to do it without a third mode would be to take a
blacklist model - where all syscalls are allowed by default and the
caller has to enumerate them all and drop them. That would definitely
not be the right approach :)

If a new bit of state was added, it could be used as:
prctl(PR_SET_SECCOMP, 2);
prctl(PR_SET_SECCOMP, 2, "sys_read", "fd == 1"); /* add a read filter */
prctl(PR_SET_SECCOMP, 2, "sys_write", "fd == 0"); /* add a read filter */
...
prctl(PR_SET_SECCOMP, 2, "sys_read", "0"); /* clear the sys_read
filters and block it */ (or NULL?)
prctl(PR_SET_SECCOMP, 2, "enable"); /* Start enforcing */
prctl(PR_SET_SECCOMP, 2, "sys_write", "0"); /* Reduce attack
surface on the fly */


As to the "0" filter instead of a bitmask, would it make sense to just
cut over to an hlist now and drop the bitmask? It looks like perf
uses that model, and I'd hope it wouldn't incur too much additional
overhead. (The linked list approach now is certainly not scalable for
a large number of filters!)

If that interface seems sane, I can certainly start exploring it and
see if I hit any surprises (and put it in the next version of the
patch :). I think it'll simplify a fair amount of the add/drop code!

thanks!
will

2011-04-28 15:31:00

by Will Drewry

[permalink] [raw]
Subject: Re: [PATCH 3/7] seccomp_filter: Enable ftrace-based system call filtering

On Thu, Apr 28, 2011 at 8:50 AM, Steven Rostedt <[email protected]> wrote:
> On Wed, 2011-04-27 at 22:08 -0500, Will Drewry wrote:
>
>> -typedef struct { int mode; } seccomp_t;
>> +/**
>> + * struct seccomp_state - the state of a seccomp'ed process
>> + *
>> + * @mode:
>> + * ? ? if this is 1, the process is under standard seccomp rules
>> + * ? ? ? ? ? ? is 2, the process is only allowed to make system calls where
>> + * ? ? ? ? ? ? ? ? ? the corresponding bit is set in bitmask and any
>> + * ? ? ? ? ? ? ? ? ? associated filters evaluate successfully.
>
> I hate arbitrary numbers. This is not a big deal, but can we create an
> enum or define that puts names for the modes.
>
> SECCOMP_MODE_BASIC
> SECCOMP_MODE_FILTER
>
> ??

Works for me.

>> + * @usage: number of references to the current instance.
>> + * @bitmask: a mask of allowed or filtered system calls and additional flags.
>> + * @filter_count: number of seccomp filters in @filters.
>> + * @filters: list of seccomp_filter entries for ?system calls.
>> + */
>> +struct seccomp_state {
>> + ? ? uint16_t mode;
>> + ? ? atomic_t usage;
>> + ? ? DECLARE_BITMAP(bitmask, NR_syscalls + 1);
>> + ? ? int filter_count;
>> + ? ? struct list_head filters;
>> +};
>> +
>> +typedef struct { struct seccomp_state *state; } seccomp_t;
>>
>> ?extern void __secure_computing(int);
>> ?static inline void secure_computing(int this_syscall)
>> @@ -16,8 +39,14 @@ static inline void secure_computing(int this_syscall)
>> ? ? ? ? ? ? ? __secure_computing(this_syscall);
>> ?}
>>
>
>
>> diff --git a/kernel/fork.c b/kernel/fork.c
>> index e7548de..bdcf70b 100644
>> --- a/kernel/fork.c
>> +++ b/kernel/fork.c
>> @@ -34,6 +34,7 @@
>> ?#include <linux/cgroup.h>
>> ?#include <linux/security.h>
>> ?#include <linux/hugetlb.h>
>> +#include <linux/seccomp.h>
>> ?#include <linux/swap.h>
>> ?#include <linux/syscalls.h>
>> ?#include <linux/jiffies.h>
>> @@ -169,6 +170,9 @@ void free_task(struct task_struct *tsk)
>> ? ? ? free_thread_info(tsk->stack);
>> ? ? ? rt_mutex_debug_task_free(tsk);
>> ? ? ? ftrace_graph_exit_task(tsk);
>> +#ifdef CONFIG_SECCOMP
>> + ? ? put_seccomp_state(tsk->seccomp.state);
>> +#endif
>> ? ? ? free_task_struct(tsk);
>> ?}
>> ?EXPORT_SYMBOL(free_task);
>> @@ -280,6 +284,10 @@ static struct task_struct *dup_task_struct(struct task_struct *orig)
>> ? ? ? if (err)
>> ? ? ? ? ? ? ? goto out;
>>
>> +#ifdef CONFIG_SECCOMP
>> + ? ? tsk->seccomp.state = get_seccomp_state(orig->seccomp.state);
>> +#endif
>
> I wonder if we could macro these to get rid of the #ifdefs in fork.c.
>
> #define put_seccomp_state(state) ? ? ? ?do { } while (0)
> #define assign_seccomp_state(src, state) do { } while (0)
>
> ??

Makes sense to me. I'll add it to the next update.

>> +
>> ? ? ? setup_thread_stack(tsk, orig);
>> ? ? ? clear_user_return_notifier(tsk);
>> ? ? ? clear_tsk_need_resched(tsk);
>> diff --git a/kernel/seccomp.c b/kernel/seccomp.c
>> index 57d4b13..1bee87c 100644
>> --- a/kernel/seccomp.c
>> +++ b/kernel/seccomp.c
>> @@ -8,10 +8,11 @@
>>
>> ?#include <linux/seccomp.h>
>> ?#include <linux/sched.h>
>> +#include <linux/slab.h>
>> ?#include <linux/compat.h>
>>
>> ?/* #define SECCOMP_DEBUG 1 */
>> -#define NR_SECCOMP_MODES 1
>> +#define NR_SECCOMP_MODES 2
>>
>> ?/*
>> ? * Secure computing mode 1 allows only read/write/exit/sigreturn.
>> @@ -32,9 +33,11 @@ static int mode1_syscalls_32[] = {
>>
>> ?void __secure_computing(int this_syscall)
>> ?{
>> - ? ? int mode = current->seccomp.mode;
>> + ? ? int mode = -1;
>> ? ? ? int * syscall;
>> -
>> + ? ? /* Do we need an RCU read lock to access current's state? */
>
> I'm actually confused to why you are using RCU. What are you protecting.
> Currently, I see the state is always accessed from current->seccomp. But
> current should not be fighting with itself.
>
> Maybe I'm missing something.

I'm sure it's me that's missing something. I believe the seccomp
pointer can be accessed from:
- current
- via /proc/<pid>/seccomp_filter (read-only)

Given those cases, would it make sense to ditch the RCU interface for it?

Thanks!
will

>
>> + ? ? if (current->seccomp.state)
>> + ? ? ? ? ? ? mode = current->seccomp.state->mode;
>> ? ? ? switch (mode) {
>> ? ? ? case 1:
>> ? ? ? ? ? ? ? syscall = mode1_syscalls;
>> @@ -47,6 +50,16 @@ void __secure_computing(int this_syscall)
>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? return;
>> ? ? ? ? ? ? ? } while (*++syscall);
>> ? ? ? ? ? ? ? break;
>> + ? ? case 2:
>> +#ifdef CONFIG_COMPAT
>> + ? ? ? ? ? ? if (is_compat_task())
>> + ? ? ? ? ? ? ? ? ? ? /* XXX: No compat support yet. */
>> + ? ? ? ? ? ? ? ? ? ? break;
>> +#endif
>> + ? ? ? ? ? ? if (!seccomp_test_filters(current->seccomp.state,
>> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? this_syscall))
>> + ? ? ? ? ? ? ? ? ? ? return;
>> + ? ? ? ? ? ? break;
>> ? ? ? default:
>> ? ? ? ? ? ? ? BUG();
>> ? ? ? }
>> @@ -57,30 +70,139 @@ void __secure_computing(int this_syscall)
>> ? ? ? do_exit(SIGKILL);
>
>
>

2011-04-28 15:46:26

by Randy Dunlap

[permalink] [raw]
Subject: Re: [PATCH 5/7] seccomp_filter: Document what seccomp_filter is and how it works.

On Wed, 27 Apr 2011 22:08:49 -0500 Will Drewry wrote:

> Adds a text file covering what CONFIG_SECCOMP_FILTER is, how it is
> implemented presently, and what it may be used for. In addition,
> the limitations and caveats of the proposed implementation are
> included.
>
> Signed-off-by: Will Drewry <[email protected]>
> ---
> Documentation/trace/seccomp_filter.txt | 75 ++++++++++++++++++++++++++++++++
> 1 files changed, 75 insertions(+), 0 deletions(-)
> create mode 100644 Documentation/trace/seccomp_filter.txt
>
> diff --git a/Documentation/trace/seccomp_filter.txt b/Documentation/trace/seccomp_filter.txt
> new file mode 100644
> index 0000000..6a0fd33
> --- /dev/null
> +++ b/Documentation/trace/seccomp_filter.txt
> @@ -0,0 +1,75 @@
> + Seccomp filtering
> + =================
> +
> +Introduction
> +------------
> +
> +A large number of system calls are exposed to every userland process
> +with many of them going unused for the entire lifetime of the
> +application. As system calls change and mature, bugs are found and
> +quashed. A certain subset of userland applications benefit by having
> +a reduce set of available system calls. The reduced set reduces the

reduced

> +total kernel surface exposed to the application. System call filtering
> +is meant for use with those applications.
> +
> +The implementation currently leverages both the existing seccomp
> +infrastructure and the kernel tracing infrastructure. By centralizing
> +hooks for attack surface reduction in seccomp, it is possible to assure
> +attention to security that is less relevant in normal ftrace scenarios,
> +such as time of check, time of use attacks. However, ftrace provides a
> +rich, human-friendly environment for specifying system calls by name and
> +expected arguments. (As such, this requires FTRACE_SYSCALLS.)
> +
> +
> +What it isn't
> +-------------
> +
> +System call filtering isn't a sandbox. It provides a clearly defined
> +mechanism for minimizing the exposed kernel surface. Beyond that, policy for
> +logical behavior and information flow should be managed with an LSM of your
> +choosing.
> +
> +
> +Usage
> +-----
> +
> +An additional seccomp mode is exposed through mode '2'. This mode
> +depends on CONFIG_SECCOMP_FILTER which in turn depends on
> +CONFIG_FTRACE_SYSCALLS.
> +
> +A collection of filters may be supplied via prctl, and the current set of
> +filters is exposed in /proc/<pid>/seccomp_filter.
> +
> +For instance,
> + const char filters[] =
> + "sys_read: (fd == 1) || (fd == 2)\n"
> + "sys_write: (fd == 0)\n"
> + "sys_exit: 1\n"
> + "sys_exit_group: 1\n"
> + "on_next_syscall: 1";
> + prctl(PR_SET_SECCOMP, 2, filters);
> +
> +This will setup system call filters for read, write, and exit where reading can
> +be done only from fds 1 and 2 and writing to fd 0. The "on_next_syscall" directive tells
> +seccomp to not enforce the ruleset until after the next system call is run. This allows
> +for launchers to apply system call filters to a binary before executing it.
> +
> +Once enabled, the access may only be reduced. For example, a set of filters may be:
> +
> + sys_read: 1
> + sys_write: 1
> + sys_mmap: 1
> + sys_prctl: 1
> +
> +Then it may call the following to drop mmap access:
> + prctl(PR_SET_SECCOMP, 2, "sys_mmap: 0");
> +
> +
> +Caveats
> +-------
> +
> +The system call names come from ftrace events. At present, many system
> +calls are not hooked - such as x86's ptregs wrapped system calls.
> +
> +In addition compat_task()s will not be supported until a sys32s begin
> +being hooked.

Last sentence is hard to read IMO:
a. what are compat_task()s?
b. what is a sys32s begin?
c. awkward wording, maybe change to: until a sys32s begin has been hooked.


thanks,
---
~Randy
*** Remember to use Documentation/SubmitChecklist when testing your code ***

2011-04-28 15:58:04

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [PATCH 3/7] seccomp_filter: Enable ftrace-based system call filtering

On Thu, Apr 28, 2011 at 10:15:04AM -0500, Will Drewry wrote:
> On Thu, Apr 28, 2011 at 9:29 AM, Frederic Weisbecker <[email protected]> wrote:
> > On Wed, Apr 27, 2011 at 10:08:47PM -0500, Will Drewry wrote:
> >> This change adds a new seccomp mode based on the work by
> >> [email protected]. This mode comes with a bitmask of NR_syscalls size and
> >> an optional linked list of seccomp_filter objects. When in mode 2, all
> >
> > Since you now use the filters. Why not using them to filter syscalls
> > entirely rather than using a bitmap of allowed syscalls?
>
> The current approach just uses a linked list of filters. While a more
> efficient data structure could be used, the bitmask provides a quick
> binary decision, and optimizes for the relatively common case where
> there won't be many non-binary filters to evaluate so we don't have to
> walk the list for a larger number of yes/no decisions versus more
> complex predicates. Though that may be a short-sighted view! I'm
> happy to change it up.

Well, using a hlist that points to the filters may be not that slower.
Dunno, that needs to be measured perhaps.

No big deal for now.

>
> > You have the "nr" field in syscall tracepoints.
>
> I'n not sure I follow. Do you mean moving entirely to using the
> actual tracepoint infrastructure instead of using the seccomp hooks,
> or just looking up proper filter by syscall nr? If there's a sane and
> better way to do the latter, I'm all ears :) As far as using the
> tracepoints themselves, I looked to how the perf/ftrace interactions
> worked and while I could've registered with the syscalls tracepoints
> for enter and exit, it would mean later evaluation of the system call
> interception, possibly out-of-order with respect to other registered
> event sinks, and there is complexity in just killing current from
> within the notifier-like list registered syscall events (as Eric Paris
> ran into when expanding filtering into perf itself). To get around
> that, the tracepoint handler would have to pump the data somewhere
> else (like it does for perf), and it just seemed messy. I think it's
> doable, but I don't know that the pure syscall tracepoint
> infrastructure should be burdened with the added requirements that
> come with seccomp-filtering. If I didn't properly understand the
> code, though, please set me on the right path.

No, my bad I was confused. I always post questions that show my
misunderstanding of a new (or not) patchset. It's like a tradition ;)

2011-04-28 16:05:16

by Will Drewry

[permalink] [raw]
Subject: Re: [PATCH 3/7] seccomp_filter: Enable ftrace-based system call filtering

On Thu, Apr 28, 2011 at 10:57 AM, Frederic Weisbecker
<[email protected]> wrote:
> On Thu, Apr 28, 2011 at 10:15:04AM -0500, Will Drewry wrote:
>> On Thu, Apr 28, 2011 at 9:29 AM, Frederic Weisbecker <[email protected]> wrote:
>> > On Wed, Apr 27, 2011 at 10:08:47PM -0500, Will Drewry wrote:
>> >> This change adds a new seccomp mode based on the work by
>> >> [email protected]. This mode comes with a bitmask of NR_syscalls size and
>> >> an optional linked list of seccomp_filter objects. When in mode 2, all
>> >
>> > Since you now use the filters. Why not using them to filter syscalls
>> > entirely rather than using a bitmap of allowed syscalls?
>>
>> The current approach just uses a linked list of filters. ?While a more
>> efficient data structure could be used, the bitmask provides a quick
>> binary decision, and optimizes for the relatively common case where
>> there won't be many non-binary filters to evaluate so we don't have to
>> walk the list for a larger number of yes/no decisions versus more
>> complex predicates. ?Though that may be a short-sighted view! I'm
>> happy to change it up.
>
> Well, using a hlist that points to the filters may be not that slower.
> Dunno, that needs to be measured perhaps.
>
> No big deal for now.

Cool - that makes sense. I just haven't used hlist before and was
reticent to dive in given how much other new territory this was. I'll
check it out, though.

>>
>> > You have the "nr" field in syscall tracepoints.
>>
>> I'n not sure I follow. ?Do you mean moving entirely to using the
>> actual tracepoint infrastructure instead of using the seccomp hooks,
>> or just looking up proper filter by syscall nr? ?If there's a sane and
>> better way to do the latter, I'm all ears :) ?As far as using the
>> tracepoints themselves, I looked to how the perf/ftrace interactions
>> worked and while I could've registered with the syscalls tracepoints
>> for enter and exit, it would mean later evaluation of the system call
>> interception, possibly out-of-order with respect to other registered
>> event sinks, and there is complexity in just killing current from
>> within the notifier-like list registered syscall events (as Eric Paris
>> ran into when expanding filtering into perf itself). ?To get around
>> that, the tracepoint handler would have to pump the data somewhere
>> else (like it does for perf), and it just seemed messy. ?I think it's
>> doable, but I don't know that the pure syscall tracepoint
>> infrastructure should be burdened with the added requirements that
>> come with seccomp-filtering. ? If I didn't properly understand the
>> code, though, please set me on the right path.
>
> No, my bad I was confused. I always post questions that show my
> misunderstanding of a new (or not) patchset. It's like a tradition ;)

Awesome - I was getting worried I'd missed something terribly obvious!

2011-04-28 16:13:19

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [PATCH 3/7] seccomp_filter: Enable ftrace-based system call filtering

On Thu, Apr 28, 2011 at 10:29:11AM -0500, Will Drewry wrote:
> On Thu, Apr 28, 2011 at 10:12 AM, Frederic Weisbecker
> <[email protected]> wrote:
> > Instead of having such multiline filter definition with syscall
> > names prepended, it would be nicer to make the parsing simplier.
> >
> > You could have either:
> >
> > ? ? ? ?prctl(PR_SET_SECCOMP, mode);
> > ? ? ? ?/* Works only if we are in mode 2 */
> > ? ? ? ?prctl(PR_SET_SECCOMP_FILTER, syscall_nr, filter);
>
> It'd need to be syscall_name instead of syscall_nr. Otherwise we're
> right back to where Adam's patch was 2+ years ago :) Using the event
> names from the syscalls infrastructure means the consumer of the
> interface doesn't need to be confident of the syscall number.

Is it really a problem? There are libraries that can resolve that. Of
course I can't recall their name.

>
> > or:
> > ? ? ? ?/*
> > ? ? ? ? * If mode == 2, set the filter to syscall_nr
> > ? ? ? ? * Recall this for each syscall that need a filter.
> > ? ? ? ? * If a filter was previously set on the targeted syscall,
> > ? ? ? ? * it will be overwritten.
> > ? ? ? ? */
> > ? ? ? ?prctl(PR_SET_SECCOMP, mode, syscall_nr, filter);
> >
> > One can erase a previous filter by setting the new filter "1".
> >
> > Also, instead of having a bitmap of syscall to accept. You could
> > simply set "0" as a filter to those you want to deactivate:
> >
> > prctl(PR_SET_SECCOMP, 2, 1, 0); <- deactivate the syscall_nr 1
> >
> > Hm?
>
> I like the simplicity in not needing to parse anything extra, but it
> does add the need for extra state - either a bit or a new field - to
> represent "enabled/enforcing".

And by the way I'm really puzzled about these. I don't understand
well why we need this.

As for the enable_on_next_syscall. The documentation says
it's useful if you want the filter to only apply to the
child. So if fork immediately follows, you will be able to fork
but if the child doesn't have the right to exec, it won't be able
to do so. Same for the mmap() that involves...

So I'm a bit confused about that.

But yeah if that's really needed, it looks to me better to
reduce the parsing and cut it that way:

prctl(PR_SET_SECCOMP, 2, syscall_name_or_nr, filter);
prctl(PR_SECCOMP_APPLY_FILTERS, enable_on_next_syscall?)

or something...

>
> The only way to do it without a third mode would be to take a
> blacklist model - where all syscalls are allowed by default and the
> caller has to enumerate them all and drop them. That would definitely
> not be the right approach :)
>
> If a new bit of state was added, it could be used as:
> prctl(PR_SET_SECCOMP, 2);
> prctl(PR_SET_SECCOMP, 2, "sys_read", "fd == 1"); /* add a read filter */
> prctl(PR_SET_SECCOMP, 2, "sys_write", "fd == 0"); /* add a read filter */
> ...
> prctl(PR_SET_SECCOMP, 2, "sys_read", "0"); /* clear the sys_read
> filters and block it */ (or NULL?)
> prctl(PR_SET_SECCOMP, 2, "enable"); /* Start enforcing */
> prctl(PR_SET_SECCOMP, 2, "sys_write", "0"); /* Reduce attack
> surface on the fly */
>
>
> As to the "0" filter instead of a bitmask, would it make sense to just
> cut over to an hlist now and drop the bitmask?
> It looks like perf
> uses that model, and I'd hope it wouldn't incur too much additional
> overhead. (The linked list approach now is certainly not scalable for
> a large number of filters!)

The linked list certainly doesn't scale there. But either a hlist for
everything, or a hlist + bitmap to check if the syscall is enabled,
why not. May be start with a pure hlist for any filters and if performance
issues that really matter are pinpointed, then move the "1" and "0"
implementation to a bitmap.

My guess is that doesn't really matter. If it's "1" then you can
just have an empty set of filters for the syscall and it goes ahead
quickly. If it's "0" then the app fails, which is not what I would
call a fast path.

> If that interface seems sane, I can certainly start exploring it and
> see if I hit any surprises (and put it in the next version of the
> patch :). I think it'll simplify a fair amount of the add/drop code!

Yup.

2011-04-28 16:20:19

by Serge E. Hallyn

[permalink] [raw]
Subject: Re: [PATCH 3/7] seccomp_filter: Enable ftrace-based system call filtering

Quoting Will Drewry ([email protected]):
> >> ?void __secure_computing(int this_syscall)
> >> ?{
> >> - ? ? int mode = current->seccomp.mode;
> >> + ? ? int mode = -1;
> >> ? ? ? int * syscall;
> >> -
> >> + ? ? /* Do we need an RCU read lock to access current's state? */
> >
> > I'm actually confused to why you are using RCU. What are you protecting.
> > Currently, I see the state is always accessed from current->seccomp. But
> > current should not be fighting with itself.
> >
> > Maybe I'm missing something.
>
> I'm sure it's me that's missing something. I believe the seccomp
> pointer can be accessed from:
> - current
> - via /proc/<pid>/seccomp_filter (read-only)
>
> Given those cases, would it make sense to ditch the RCU interface for it?

ISTM you need them to protect the reader.

-serge

2011-04-28 16:28:20

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH 3/7] seccomp_filter: Enable ftrace-based system call filtering

On Wed, 2011-04-27 at 22:08 -0500, Will Drewry wrote:

> The only other twist is that it is possible to delay enforcement by one
> system call by supplying a "on_next_syscall: 1" 'filter'. This allows
> for a launcher process to fork(), prctl(), then execve() leaving the
> launched binary in a filtered state.

I wonder if the more "unixy" thing to do is, instead of on_next_sycall,
have "enable_on_exec". Where the user could do multiple syscalls but the
filter will not take place until an exec is made?

-- Steve

2011-04-28 16:48:51

by Will Drewry

[permalink] [raw]
Subject: Re: [PATCH 3/7] seccomp_filter: Enable ftrace-based system call filtering

On Thu, Apr 28, 2011 at 11:13 AM, Frederic Weisbecker
<[email protected]> wrote:
> On Thu, Apr 28, 2011 at 10:29:11AM -0500, Will Drewry wrote:
>> On Thu, Apr 28, 2011 at 10:12 AM, Frederic Weisbecker
>> <[email protected]> wrote:
>> > Instead of having such multiline filter definition with syscall
>> > names prepended, it would be nicer to make the parsing simplier.
>> >
>> > You could have either:
>> >
>> > ? ? ? ?prctl(PR_SET_SECCOMP, mode);
>> > ? ? ? ?/* Works only if we are in mode 2 */
>> > ? ? ? ?prctl(PR_SET_SECCOMP_FILTER, syscall_nr, filter);
>>
>> It'd need to be syscall_name instead of syscall_nr. ?Otherwise we're
>> right back to where Adam's patch was 2+ years ago :) ?Using the event
>> names from the syscalls infrastructure means the consumer of the
>> interface doesn't need to be confident of the syscall number.
>
> Is it really a problem?

I'd prefer using numbers myself since it gives more flexibility around
filtering entries that haven't yet made it into ftrace syscalls hooks.
However, I think there's some complexity in ensuring it matches the
host kernel. (It would also allow compat_task support which doesn't
seem to be possible now.)

> There are libraries that can resolve that. Of course I can't recall their name.

asm-generic/unistd.h will provide the mapping, if available. It would
mean that any helper libraries would need to be matched to the locally
running kernel.

If syscall names and numbers are supported, then it would mean that
this change could do both what Adam's original bitmask patch did, by
default, and when CONFIG_FTRACE_SYSCALLS are enabled, it could do name
resolution and apply more complex filters based on the syscalls
events.

Would that make sense? I wonder if the inconsistency between
sometimes using names and filters and sometimes using numbers and 0|1
would become onerous if userspace applications started using this.
Maybe something like
prctl(PR_SECCOMP_HAS_FILTERS);
would be desirable? I dunno.

>> > or:
>> > ? ? ? ?/*
>> > ? ? ? ? * If mode == 2, set the filter to syscall_nr
>> > ? ? ? ? * Recall this for each syscall that need a filter.
>> > ? ? ? ? * If a filter was previously set on the targeted syscall,
>> > ? ? ? ? * it will be overwritten.
>> > ? ? ? ? */
>> > ? ? ? ?prctl(PR_SET_SECCOMP, mode, syscall_nr, filter);
>> >
>> > One can erase a previous filter by setting the new filter "1".
>> >
>> > Also, instead of having a bitmap of syscall to accept. You could
>> > simply set "0" as a filter to those you want to deactivate:
>> >
>> > prctl(PR_SET_SECCOMP, 2, 1, 0); <- deactivate the syscall_nr 1
>> >
>> > Hm?
>>
>> I like the simplicity in not needing to parse anything extra, but it
>> does add the need for extra state - either a bit or a new field - to
>> represent "enabled/enforcing".
>
> And by the way I'm really puzzled about these. I don't understand
> well why we need this.

Right now, if you are in seccomp mode !=0 , your system calls will be
hooked (if TIF_SECCOMP is set too). The current patch begins filtering
immediately. And once filtering is enabled, the process is not
allowed to expand its access to the kernel system cals - only shrink
it. An "enabled" bit would be needed to tell it that even though it
is running in mode=2, it should/shouldn't kill the process for system
call filter violations and it should allow filters to be added, not
just dropped. It's why the current patch does it in one step: set the
filters and the mode. While an enabled bit could be hidden behind
whether TIF_SECCOMP is applied, I'm not sure if that would just be
confusing. There'd still be a need to APPLY_FILTERS to get the
TIF_SECCOMP flag set to start hooking the system calls.

> As for the enable_on_next_syscall. The documentation says
> it's useful if you want the filter to only apply to the
> child. So if fork immediately follows, you will be able to fork
> but if the child doesn't have the right to exec, it won't be able
> to do so. Same for the mmap() that involves...
>
> So I'm a bit confused about that.

Here's an example:
http://static.dataspill.org/seccomp_filter/v1/seccomp_launcher.c

You'd use it by calling fork() ... prctl(..., ON_NEXT_SYSCALL) then
execve since you'll already have fork()d. It seems to work, but maybe
I'm missing something :)

> But yeah if that's really needed, it looks to me better to
> reduce the parsing and cut it that way:
>
> ? ? ? ?prctl(PR_SET_SECCOMP, 2, syscall_name_or_nr, filter);
> ? ? ? ?prctl(PR_SECCOMP_APPLY_FILTERS, enable_on_next_syscall?)
>
> or something...

Cool - if there are any other flags, it could be something like:
prctl(PR_SECCOMP_SET_FLAGS, SECCOMP_ENFORCE|SECCOMP_DELAYED_ENFORCEMENT);

But only if there are other flags worth considering. I had a few
others in mind, but now I've completely forgotten :/

>>
>> The only way to do it without a third mode would be to take a
>> blacklist model - where all syscalls are allowed by default and the
>> caller has to enumerate them all and drop them. That would definitely
>> not be the right approach :)
>>
>> If a new bit of state was added, ?it could be used as:
>> ? prctl(PR_SET_SECCOMP, 2);
>> ? prctl(PR_SET_SECCOMP, 2, "sys_read", "fd == 1"); ?/* add a read filter */
>> ? prctl(PR_SET_SECCOMP, 2, "sys_write", "fd == 0"); ?/* add a read filter */
>> ? ...
>> ? prctl(PR_SET_SECCOMP, 2, "sys_read", "0"); ?/* clear the sys_read
>> filters and block it */ ?(or NULL?)
>> ? prctl(PR_SET_SECCOMP, 2, "enable"); ?/* Start enforcing */
>> ? prctl(PR_SET_SECCOMP, 2, "sys_write", "0"); ?/* Reduce attack
>> surface on the fly */
>>
>>
>> As to the "0" filter instead of a bitmask, would it make sense to just
>> cut over to an hlist now and drop the bitmask?
>> It looks like perf
>> uses that model, and I'd hope it wouldn't incur too much additional
>> overhead. ?(The linked list approach now is certainly not scalable for
>> a large number of filters!)
>
> The linked list certainly doesn't scale there. But either a hlist for
> everything, or a hlist + bitmap to check if the syscall is enabled,
> why not. May be start with a pure hlist for any filters and if performance
> issues that really matter are pinpointed, then move the "1" and "0"
> implementation to a bitmap.
>
> My guess is that doesn't really matter. If it's "1" then you can
> just have an empty set of filters for the syscall and it goes ahead
> quickly. If it's "0" then the app fails, which is not what I would
> call a fast path.

Sounds reasonable to me. I'll start to iterate on the suggested
changes and see what emerges.

>> If that interface seems sane, I can certainly start exploring it and
>> see if I hit any surprises (and put it in the next version of the
>> patch :). ?I think it'll simplify a fair amount of the add/drop code!
>
> Yup.

Thanks!
will

2011-04-28 16:53:21

by Will Drewry

[permalink] [raw]
Subject: Re: [PATCH 3/7] seccomp_filter: Enable ftrace-based system call filtering

On Thu, Apr 28, 2011 at 11:28 AM, Steven Rostedt <[email protected]> wrote:
> On Wed, 2011-04-27 at 22:08 -0500, Will Drewry wrote:
>
>> The only other twist is that it is possible to delay enforcement by one
>> system call by supplying a "on_next_syscall: 1" 'filter'. ?This allows
>> for a launcher process to fork(), prctl(), then execve() leaving the
>> launched binary in a filtered state.
>
> I wonder if the more "unixy" thing to do is, instead of on_next_sycall,
> have "enable_on_exec". Where the user could do multiple syscalls but the
> filter will not take place until an exec is made?

That's what it was originally, but since ftrace syscalls doesn't wrap
sys_execve on x86, I opted to just say "next syscall". But of course
I can just check the _NR_execve == syscall_nr and do the right thing.
Duh.

thanks!

2011-04-28 16:55:28

by Serge E. Hallyn

[permalink] [raw]
Subject: Re: [PATCH 3/7] seccomp_filter: Enable ftrace-based system call filtering

Quoting Will Drewry ([email protected]):

Thanks very much for going through with this, Will. I really hope to see
this go upstream soon.

> This change adds a new seccomp mode based on the work by
> [email protected]. This mode comes with a bitmask of NR_syscalls size and
> an optional linked list of seccomp_filter objects. When in mode 2, all
> system calls are first checked against the bitmask to determine if they
> are allowed or denied. If allowed, the list of filters is checked for
> the given syscall number. If all filter predicates for the system call
> match or the system call was allowed without restriction, the process
> continues. Otherwise, it is killed and a KERN_INFO notification is
> posted.
>
> The filter language itself is provided by the ftrace filter engine.
> Related patches tweak to the perf filter trace and free allow the calls
> to be shared. Filters inherit their understanding of types and arguments
> for each system call from the CONFIG_FTRACE_SYSCALLS subsystem which
> predefines this information in syscall_metadata associated enter_event
> (and exit_event) structures.
>
> The result is that a process may reduce its available interfaces to
> the kernel through prctl() without knowing the appropriate system call
> number a priori and with the flexibility of filtering based on
> register-stored arguments. (String checks suffer from TOCTOU issues and
> should be left to LSMs to provide policy for! Don't get greedy :)
>
> A sample filterset for a process that only needs to interact over stdin
> and stdout and exit cleanly is shown below:
> sys_read: fd == 0
> sys_write: fd == 1
> sys_exit_group: 1
>
> The filters may be specified once prior to entering the reduced access
> state:
> prctl(PR_SET_SECCOMP, 2, filters);
> If prctl() is in the allowed bitmask, the process may futher reduce
> privileges by dropping system calls from the allowed set.
>
> The only other twist is that it is possible to delay enforcement by one
> system call by supplying a "on_next_syscall: 1" 'filter'. This allows
> for a launcher process to fork(), prctl(), then execve() leaving the
> launched binary in a filtered state.
>
> Implementation-wise, seccomp.c now uses seccomp_state struct which is
> managed using RCU primitives. It contains the system call bitmask and
> a linked list of seccomp_filters (which are also managed as an RCU
> list). All mutations (barring one optional bit flip) to the
> seccomp_state are always done on a duplicate of the current state value
> which is swapped on prior to use.
>
> Signed-off-by: Will Drewry <[email protected]>

...

> void __secure_computing(int this_syscall)
> {
> - int mode = current->seccomp.mode;
> + int mode = -1;
> int * syscall;
> -
> + /* Do we need an RCU read lock to access current's state? */

Nope.

...

> +struct seccomp_state *get_seccomp_state(struct seccomp_state *orig)
> +{
> + if (!orig)
> + return NULL;
> + atomic_inc(&orig->usage);
> + return orig;
> +}
> +
> +static void __put_seccomp_state(struct seccomp_state *orig)
> +{
> + WARN_ON(atomic_read(&orig->usage));
> + seccomp_drop_all_filters(orig);
> + kfree(orig);
> +}
> +
> +/* put_seccomp_state - decrements the reference count of @orig and may free. */
> +void put_seccomp_state(struct seccomp_state *orig)
> +{
> + if (!orig)
> + return;
> +
> + if (atomic_dec_and_test(&orig->usage))
> + __put_seccomp_state(orig);
> +}
> +
> long prctl_get_seccomp(void)
> {
> - return current->seccomp.mode;
> + if (!current->seccomp.state)
> + return 0;
> + return current->seccomp.state->mode;
> }
>
> -long prctl_set_seccomp(unsigned long seccomp_mode)
> +long prctl_set_seccomp(unsigned long seccomp_mode, char *filters)
> {
> long ret;
> + struct seccomp_state *state, *orig_state;
>
> - /* can set it only once to be even more secure */
> + rcu_read_lock();
> + orig_state = get_seccomp_state(rcu_dereference(current->seccomp.state));
> + rcu_read_unlock();
> +
> + /* mode 1 can only be set once, but mode 2 may have access reduced. */
> ret = -EPERM;
> - if (unlikely(current->seccomp.mode))
> + if (orig_state && orig_state->mode != 2 && seccomp_mode != 2)
> goto out;
>
> ret = -EINVAL;
> - if (seccomp_mode && seccomp_mode <= NR_SECCOMP_MODES) {
> - current->seccomp.mode = seccomp_mode;
> - set_thread_flag(TIF_SECCOMP);
> + if (!seccomp_mode || seccomp_mode > NR_SECCOMP_MODES)
> + goto out;
> +
> + /* Prepare to modify the seccomp state by dropping the shared
> + * reference after duplicating it.
> + */
> + state = (orig_state ? seccomp_state_dup(orig_state) :
> + seccomp_state_new());
> + if (!state) {
> + ret = -ENOMEM;
> + goto out;
> + }
> +
> + ret = 0;
> + switch (seccomp_mode) {
> + case 1:
> #ifdef TIF_NOTSC
> disable_TSC();
> #endif
> + state->mode = seccomp_mode;
> + set_thread_flag(TIF_SECCOMP);
> ret = 0;
> + break;
> + case 2:
> + if (filters) {
> + ret = seccomp_parse_filters(state, filters);
> + /* No partial applications. */
> + if (ret)
> + goto free_state;
> + }
> +
> + if (!state->mode) {
> + state->mode = seccomp_mode;
> + set_thread_flag(TIF_SECCOMP);
> + }
> + break;
> + default:
> + ret = -EINVAL;
> }
>
> - out:
> + rcu_assign_pointer(current->seccomp.state, state);
> + synchronize_rcu();
> + put_seccomp_state(orig_state); /* for the get */
> +
> +out:
> + put_seccomp_state(orig_state); /* for the task */
> + return ret;
> +
> +free_state:
> + put_seccomp_state(orig_state); /* for the get */
> + put_seccomp_state(state); /* drop the dup */
> return ret;
> }

This looks exactly right. The only case where put_seccomp_state()
might actually lead to freeing the state is where the current's
state gets reassigned. So you need to synchronize_rcu() before
that (as you do). The other cases will only decrement the usage
counter, can race with a reader doing (inc; get) but not with a
final free, which can only be done here.

(Rambling above is just me pursuading myself)

> diff --git a/kernel/seccomp_filter.c b/kernel/seccomp_filter.c

Unfortunately your use of filters doesn't seem exactly right.

> +/* seccomp_copy_all_filters - copies all filters from src to dst.
> + *
> + * @dst: the list_head for seccomp_filters to populate.
> + * @src: the list_head for seccomp_filters to copy from.
> + * Returns non-zero on failure.
> + */
> +int seccomp_copy_all_filters(struct list_head *dst,
> + const struct list_head *src)
> +{
> + struct seccomp_filter *filter;
> + int ret = 0;
> + BUG_ON(!dst || !src);
> + if (list_empty(src))
> + goto done;
> + rcu_read_lock();
> + list_for_each_entry(filter, src, list) {
> + struct seccomp_filter *new_filter = copy_seccomp_filter(filter);

copy_seccomp_filter() causes kzalloc to be called. You can't do that under
rcu_read_lock().

I actually thought you were going to be more extreme about the seccomp
state than you are: I thought you were going to tie a filter list to
seccomp state. So adding or removing a filter would have required
duping the seccomp state, duping all the filters, making the change in
the copy, and then swapping the new state into place. Slow in the
hopefully rare update case, but safe.

You don't have to do that, but then I'm pretty sure you'll need to add
reference counts to each filter and use rcu cycles to a reader from
having the filter disappear mid-read.

-serge

2011-04-28 16:56:57

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH 3/7] seccomp_filter: Enable ftrace-based system call filtering

On Thu, 2011-04-28 at 10:30 -0500, Will Drewry wrote:

> >> diff --git a/kernel/seccomp.c b/kernel/seccomp.c
> >> index 57d4b13..1bee87c 100644
> >> --- a/kernel/seccomp.c
> >> +++ b/kernel/seccomp.c
> >> @@ -8,10 +8,11 @@
> >>
> >> #include <linux/seccomp.h>
> >> #include <linux/sched.h>
> >> +#include <linux/slab.h>
> >> #include <linux/compat.h>
> >>
> >> /* #define SECCOMP_DEBUG 1 */
> >> -#define NR_SECCOMP_MODES 1
> >> +#define NR_SECCOMP_MODES 2
> >>
> >> /*
> >> * Secure computing mode 1 allows only read/write/exit/sigreturn.
> >> @@ -32,9 +33,11 @@ static int mode1_syscalls_32[] = {
> >>
> >> void __secure_computing(int this_syscall)
> >> {
> >> - int mode = current->seccomp.mode;
> >> + int mode = -1;
> >> int * syscall;
> >> -
> >> + /* Do we need an RCU read lock to access current's state? */
> >
> > I'm actually confused to why you are using RCU. What are you protecting.
> > Currently, I see the state is always accessed from current->seccomp. But
> > current should not be fighting with itself.
> >
> > Maybe I'm missing something.
>
> I'm sure it's me that's missing something. I believe the seccomp
> pointer can be accessed from:
> - current
> - via /proc/<pid>/seccomp_filter (read-only)
>
> Given those cases, would it make sense to ditch the RCU interface for it?

Looking at this in a bit more detail. I think you can ditch the
rcu_readlocks where current accesses its own seccomp state. As current
is the one that duplicates it (and ups the refcount) on fork, and
current wont free it until after it performs a rcu_synchronization. No
one else can free current's seccomp state while current has a reference
to it.

You still need the rcu_readlocks on the readers for the proc system.
Otherwise the handle can be freed while its still in use. With the
rcu_readlocks, these readers will always get the refcount before current
frees it. And then the dec_and_test should work as expected when the
readers do the put.

-- Steve

2011-04-28 17:16:24

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH 3/7] seccomp_filter: Enable ftrace-based system call filtering

On Thu, 2011-04-28 at 11:55 -0500, Serge E. Hallyn wrote:

> ...
>
> > void __secure_computing(int this_syscall)
> > {
> > - int mode = current->seccomp.mode;
> > + int mode = -1;
> > int * syscall;
> > -
> > + /* Do we need an RCU read lock to access current's state? */
>
> Nope.

Correct.

> > - out:
> > + rcu_assign_pointer(current->seccomp.state, state);
> > + synchronize_rcu();
> > + put_seccomp_state(orig_state); /* for the get */
> > +
> > +out:
> > + put_seccomp_state(orig_state); /* for the task */
> > + return ret;
> > +
> > +free_state:
> > + put_seccomp_state(orig_state); /* for the get */
> > + put_seccomp_state(state); /* drop the dup */
> > return ret;
> > }
>
> This looks exactly right. The only case where put_seccomp_state()
> might actually lead to freeing the state is where the current's
> state gets reassigned. So you need to synchronize_rcu() before
> that (as you do). The other cases will only decrement the usage
> counter, can race with a reader doing (inc; get) but not with a
> final free, which can only be done here.

Technically incorrect ;)

"final free, which can only be done here."

This is not the only place that a free will happen. But the code is
correct none-the-less.

Reader on another CPU ups the orig_state refcount under rcu_readlock,
but after it ups the refcount it releases the rcu_readlock and continues
to read this state.

Current on this CPU calls this function does the synchronize_rcu() and
calls put on the state. But since the reader still has a ref count on
it, it does not get freed here.

When the reader is finally done with the state it calls the put() which
does the final free on it.

The code still looks correct, I'm just nitpicking your analysis.

>
> (Rambling above is just me pursuading myself)

Me rambling too.

>
> > diff --git a/kernel/seccomp_filter.c b/kernel/seccomp_filter.c
>
> Unfortunately your use of filters doesn't seem exactly right.
>
> > +/* seccomp_copy_all_filters - copies all filters from src to dst.
> > + *
> > + * @dst: the list_head for seccomp_filters to populate.
> > + * @src: the list_head for seccomp_filters to copy from.
> > + * Returns non-zero on failure.
> > + */
> > +int seccomp_copy_all_filters(struct list_head *dst,
> > + const struct list_head *src)
> > +{
> > + struct seccomp_filter *filter;
> > + int ret = 0;
> > + BUG_ON(!dst || !src);
> > + if (list_empty(src))
> > + goto done;
> > + rcu_read_lock();
> > + list_for_each_entry(filter, src, list) {
> > + struct seccomp_filter *new_filter = copy_seccomp_filter(filter);
>
> copy_seccomp_filter() causes kzalloc to be called. You can't do that under
> rcu_read_lock().

Unless you change the kzalloc to do GFP_ATOMIC. Not sure I'd recommend
doing that.

>
> I actually thought you were going to be more extreme about the seccomp
> state than you are: I thought you were going to tie a filter list to
> seccomp state. So adding or removing a filter would have required
> duping the seccomp state, duping all the filters, making the change in
> the copy, and then swapping the new state into place. Slow in the
> hopefully rare update case, but safe.
>
> You don't have to do that, but then I'm pretty sure you'll need to add
> reference counts to each filter and use rcu cycles to a reader from
> having the filter disappear mid-read.

Or you can preallocate the new filters, call rcu_read_lock(), check if
the number of old filters is the same or less, if more, call
rcu_read_unlock, and try allocating more, and then call rcu_read_lock()
again and repeat. Then just copy the filters to the preallocate ones.
rcu_read_unlock() and then free any unused allocated filters.

Maybe a bit messy, but not that bad.

-- Steve

2011-04-28 17:36:29

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [PATCH 3/7] seccomp_filter: Enable ftrace-based system call filtering

On Thu, Apr 28, 2011 at 11:48:47AM -0500, Will Drewry wrote:
> On Thu, Apr 28, 2011 at 11:13 AM, Frederic Weisbecker
> <[email protected]> wrote:
> > On Thu, Apr 28, 2011 at 10:29:11AM -0500, Will Drewry wrote:
> >> On Thu, Apr 28, 2011 at 10:12 AM, Frederic Weisbecker
> >> <[email protected]> wrote:
> >> > Instead of having such multiline filter definition with syscall
> >> > names prepended, it would be nicer to make the parsing simplier.
> >> >
> >> > You could have either:
> >> >
> >> > ? ? ? ?prctl(PR_SET_SECCOMP, mode);
> >> > ? ? ? ?/* Works only if we are in mode 2 */
> >> > ? ? ? ?prctl(PR_SET_SECCOMP_FILTER, syscall_nr, filter);
> >>
> >> It'd need to be syscall_name instead of syscall_nr. ?Otherwise we're
> >> right back to where Adam's patch was 2+ years ago :) ?Using the event
> >> names from the syscalls infrastructure means the consumer of the
> >> interface doesn't need to be confident of the syscall number.
> >
> > Is it really a problem?
>
> I'd prefer using numbers myself since it gives more flexibility around
> filtering entries that haven't yet made it into ftrace syscalls hooks.
> However, I think there's some complexity in ensuring it matches the
> host kernel. (It would also allow compat_task support which doesn't
> seem to be possible now.)

You'd have some problems with compat and syscall naming, I think some
of them are not the same between compat architectures.

Note that your new prctl() request is turning syscall handler names (in-kernel API)
into a user ABI. We won't be able to change the name of those kernel handlers
after that. I'm not sure that's desired.

OTOH, syscall numbers are a user ABI,


>
> > There are libraries that can resolve that. Of course I can't recall their name.
>
> asm-generic/unistd.h will provide the mapping, if available. It would
> mean that any helper libraries would need to be matched to the locally
> running kernel.

Yeah. But syscall numbers don't move inside a given architecture (could they?)
Or if so then you'd a dynamic library.


> >> > or:
> >> > ? ? ? ?/*
> >> > ? ? ? ? * If mode == 2, set the filter to syscall_nr
> >> > ? ? ? ? * Recall this for each syscall that need a filter.
> >> > ? ? ? ? * If a filter was previously set on the targeted syscall,
> >> > ? ? ? ? * it will be overwritten.
> >> > ? ? ? ? */
> >> > ? ? ? ?prctl(PR_SET_SECCOMP, mode, syscall_nr, filter);
> >> >
> >> > One can erase a previous filter by setting the new filter "1".
> >> >
> >> > Also, instead of having a bitmap of syscall to accept. You could
> >> > simply set "0" as a filter to those you want to deactivate:
> >> >
> >> > prctl(PR_SET_SECCOMP, 2, 1, 0); <- deactivate the syscall_nr 1
> >> >
> >> > Hm?
> >>
> >> I like the simplicity in not needing to parse anything extra, but it
> >> does add the need for extra state - either a bit or a new field - to
> >> represent "enabled/enforcing".
> >
> > And by the way I'm really puzzled about these. I don't understand
> > well why we need this.
>
> Right now, if you are in seccomp mode !=0 , your system calls will be
> hooked (if TIF_SECCOMP is set too). The current patch begins filtering
> immediately. And once filtering is enabled, the process is not
> allowed to expand its access to the kernel system cals - only shrink
> it. An "enabled" bit would be needed to tell it that even though it
> is running in mode=2, it should/shouldn't kill the process for system
> call filter violations and it should allow filters to be added, not
> just dropped. It's why the current patch does it in one step: set the
> filters and the mode. While an enabled bit could be hidden behind
> whether TIF_SECCOMP is applied, I'm not sure if that would just be
> confusing. There'd still be a need to APPLY_FILTERS to get the
> TIF_SECCOMP flag set to start hooking the system calls.

It seems the filter is only ever needed for the childs, right?
And moreover when they exec. So Steve's suggestion to apply
the filters only after the next exec makes sense.

That would solve the problem of both ON_NEXT_SYSCALL and APPLY_FILTER.

I may be missing something obvious though. Do you see a limitation there?

>
> > As for the enable_on_next_syscall. The documentation says
> > it's useful if you want the filter to only apply to the
> > child. So if fork immediately follows, you will be able to fork
> > but if the child doesn't have the right to exec, it won't be able
> > to do so. Same for the mmap() that involves...
> >
> > So I'm a bit confused about that.
>
> Here's an example:
> http://static.dataspill.org/seccomp_filter/v1/seccomp_launcher.c
>
> You'd use it by calling fork() ... prctl(..., ON_NEXT_SYSCALL) then
> execve since you'll already have fork()d. It seems to work, but maybe
> I'm missing something :)

Nope, I see.

>
> > But yeah if that's really needed, it looks to me better to
> > reduce the parsing and cut it that way:
> >
> > ? ? ? ?prctl(PR_SET_SECCOMP, 2, syscall_name_or_nr, filter);
> > ? ? ? ?prctl(PR_SECCOMP_APPLY_FILTERS, enable_on_next_syscall?)
> >
> > or something...
>
> Cool - if there are any other flags, it could be something like:
> prctl(PR_SECCOMP_SET_FLAGS, SECCOMP_ENFORCE|SECCOMP_DELAYED_ENFORCEMENT);
>
> But only if there are other flags worth considering. I had a few
> others in mind, but now I've completely forgotten :/

2011-04-28 17:40:01

by Serge E. Hallyn

[permalink] [raw]
Subject: Re: [PATCH 3/7] seccomp_filter: Enable ftrace-based system call filtering

Quoting Steven Rostedt ([email protected]):
> On Thu, 2011-04-28 at 11:55 -0500, Serge E. Hallyn wrote:
>
> > ...
> >
> > > void __secure_computing(int this_syscall)
> > > {
> > > - int mode = current->seccomp.mode;
> > > + int mode = -1;
> > > int * syscall;
> > > -
> > > + /* Do we need an RCU read lock to access current's state? */
> >
> > Nope.
>
> Correct.
>
> > > - out:
> > > + rcu_assign_pointer(current->seccomp.state, state);
> > > + synchronize_rcu();
> > > + put_seccomp_state(orig_state); /* for the get */
> > > +
> > > +out:
> > > + put_seccomp_state(orig_state); /* for the task */
> > > + return ret;
> > > +
> > > +free_state:
> > > + put_seccomp_state(orig_state); /* for the get */
> > > + put_seccomp_state(state); /* drop the dup */
> > > return ret;
> > > }
> >
> > This looks exactly right. The only case where put_seccomp_state()
> > might actually lead to freeing the state is where the current's
> > state gets reassigned. So you need to synchronize_rcu() before
> > that (as you do). The other cases will only decrement the usage
> > counter, can race with a reader doing (inc; get) but not with a
> > final free, which can only be done here.
>
> Technically incorrect ;)
>
> "final free, which can only be done here."
>
> This is not the only place that a free will happen. But the code is
> correct none-the-less.
>
> Reader on another CPU ups the orig_state refcount under rcu_readlock,
> but after it ups the refcount it releases the rcu_readlock and continues
> to read this state.
>
> Current on this CPU calls this function does the synchronize_rcu() and
> calls put on the state. But since the reader still has a ref count on
> it, it does not get freed here.
>
> When the reader is finally done with the state it calls the put() which
> does the final free on it.
>
> The code still looks correct, I'm just nitpicking your analysis.

:) I appreciate the precision.

> > (Rambling above is just me pursuading myself)
>
> Me rambling too.
>
> >
> > > diff --git a/kernel/seccomp_filter.c b/kernel/seccomp_filter.c
> >
> > Unfortunately your use of filters doesn't seem exactly right.
> >
> > > +/* seccomp_copy_all_filters - copies all filters from src to dst.
> > > + *
> > > + * @dst: the list_head for seccomp_filters to populate.
> > > + * @src: the list_head for seccomp_filters to copy from.
> > > + * Returns non-zero on failure.
> > > + */
> > > +int seccomp_copy_all_filters(struct list_head *dst,
> > > + const struct list_head *src)
> > > +{
> > > + struct seccomp_filter *filter;
> > > + int ret = 0;
> > > + BUG_ON(!dst || !src);
> > > + if (list_empty(src))
> > > + goto done;
> > > + rcu_read_lock();
> > > + list_for_each_entry(filter, src, list) {
> > > + struct seccomp_filter *new_filter = copy_seccomp_filter(filter);
> >
> > copy_seccomp_filter() causes kzalloc to be called. You can't do that under
> > rcu_read_lock().
>
> Unless you change the kzalloc to do GFP_ATOMIC. Not sure I'd recommend
> doing that.
>
> >
> > I actually thought you were going to be more extreme about the seccomp
> > state than you are: I thought you were going to tie a filter list to
> > seccomp state. So adding or removing a filter would have required
> > duping the seccomp state, duping all the filters, making the change in
> > the copy, and then swapping the new state into place. Slow in the
> > hopefully rare update case, but safe.
> >
> > You don't have to do that, but then I'm pretty sure you'll need to add
> > reference counts to each filter and use rcu cycles to a reader from
> > having the filter disappear mid-read.
>
> Or you can preallocate the new filters, call rcu_read_lock(), check if
> the number of old filters is the same or less, if more, call
> rcu_read_unlock, and try allocating more, and then call rcu_read_lock()
> again and repeat. Then just copy the filters to the preallocate ones.
> rcu_read_unlock() and then free any unused allocated filters.
>
> Maybe a bit messy, but not that bad.

Sounds good.

thanks,
-serge

2011-04-28 17:43:36

by Serge E. Hallyn

[permalink] [raw]
Subject: Re: [PATCH 5/7] seccomp_filter: Document what seccomp_filter is and how it works.

Quoting Ingo Molnar ([email protected]):
> I've Cc:-ed Linus and Andrew: are you guys opposed to such flexible, dynamic
> filters conceptually? I think we should really think hard about the actual ABI
> as this could easily spread to more applications than Chrome/Chromium.

We want to use it for containers, to try and provide some bit of
mitigation for the fact that they are sharing a kernel with the host.

thanks,
-serge

2011-04-28 18:01:14

by Will Drewry

[permalink] [raw]
Subject: Re: [PATCH 3/7] seccomp_filter: Enable ftrace-based system call filtering

On Thu, Apr 28, 2011 at 12:39 PM, Serge E. Hallyn <[email protected]> wrote:
> Quoting Steven Rostedt ([email protected]):
>> On Thu, 2011-04-28 at 11:55 -0500, Serge E. Hallyn wrote:
>>
>> > ...
>> >
>> > > ?void __secure_computing(int this_syscall)
>> > > ?{
>> > > - int mode = current->seccomp.mode;
>> > > + int mode = -1;
>> > > ? int * syscall;
>> > > -
>> > > + /* Do we need an RCU read lock to access current's state? */
>> >
>> > Nope.
>>
>> Correct.
>>
>> > > - out:
>> > > + rcu_assign_pointer(current->seccomp.state, state);
>> > > + synchronize_rcu();
>> > > + put_seccomp_state(orig_state); ?/* for the get */
>> > > +
>> > > +out:
>> > > + put_seccomp_state(orig_state); ?/* for the task */
>> > > + return ret;
>> > > +
>> > > +free_state:
>> > > + put_seccomp_state(orig_state); ?/* for the get */
>> > > + put_seccomp_state(state); ?/* drop the dup */
>> > > ? return ret;
>> > > ?}
>> >
>> > This looks exactly right. ?The only case where put_seccomp_state()
>> > might actually lead to freeing the state is where the current's
>> > state gets reassigned. ?So you need to synchronize_rcu() before
>> > that (as you do). ?The other cases will only decrement the usage
>> > counter, can race with a reader doing (inc; get) but not with a
>> > final free, which can only be done here.
>>
>> Technically incorrect ;)
>>
>> "final free, which can only be done here."
>>
>> This is not the only place that a free will happen. But the code is
>> correct none-the-less.
>>
>> Reader on another CPU ups the orig_state refcount under rcu_readlock,
>> but after it ups the refcount it releases the rcu_readlock and continues
>> to read this state.
>>
>> Current on this CPU calls this function does the synchronize_rcu() and
>> calls put on the state. But since the reader still has a ref count on
>> it, it does not get freed here.
>>
>> When the reader is finally done with the state it calls the put() which
>> does the final free on it.
>>
>> The code still looks correct, I'm just nitpicking your analysis.
>
> :) ?I appreciate the precision.
>
>> > (Rambling above is just me pursuading myself)
>>
>> Me rambling too.
>>
>> >
>> > > diff --git a/kernel/seccomp_filter.c b/kernel/seccomp_filter.c
>> >
>> > Unfortunately your use of filters doesn't seem exactly right.
>> >
>> > > +/* seccomp_copy_all_filters - copies all filters from src to dst.
>> > > + *
>> > > + * @dst: the list_head for seccomp_filters to populate.
>> > > + * @src: the list_head for seccomp_filters to copy from.
>> > > + * Returns non-zero on failure.
>> > > + */
>> > > +int seccomp_copy_all_filters(struct list_head *dst,
>> > > + ? ? ? ? ? ? ? ? ? ? ?const struct list_head *src)
>> > > +{
>> > > + struct seccomp_filter *filter;
>> > > + int ret = 0;
>> > > + BUG_ON(!dst || !src);
>> > > + if (list_empty(src))
>> > > + ? ? ? ? goto done;
>> > > + rcu_read_lock();
>> > > + list_for_each_entry(filter, src, list) {
>> > > + ? ? ? ? struct seccomp_filter *new_filter = copy_seccomp_filter(filter);
>> >
>> > copy_seccomp_filter() causes kzalloc to be called. ?You can't do that under
>> > rcu_read_lock().
>>
>> Unless you change the kzalloc to do GFP_ATOMIC. Not sure I'd recommend
>> doing that.

Good to know! My question (below) is if I should even be using an RCU
guard at all. I may have been a bit too overzealous.

>> >
>> > I actually thought you were going to be more extreme about the seccomp
>> > state than you are: ?I thought you were going to tie a filter list to
>> > seccomp state. ?So adding or removing a filter would have required
>> > duping the seccomp state, duping all the filters, making the change in
>> > the copy, and then swapping the new state into place. ?Slow in the
>> > hopefully rare update case, but safe.

Hrm, I think I'm confused now! This is exactly what I *thought* the
code was doing.

At present, seccomp_state can be shared across predecessor/ancestor
relationships using refcounting in fork.c (get/put). However, the
only way to change a given seccomp_state or its filters is either
through the one-bit on_next_syscall change or through
prctl_set_seccomp. In prctl_set_seccomp, it does:
state = (orig_state ? seccomp_state_dup(orig_state) :
seccomp_state_new());
operates on the new state and then rcu_assign_pointer()s it to the
task. I didn't intentionally provide any way to drop filters from an
existing state object nor change the filtered syscalls on an in-use
object. That _dup call should hit the impromperly rcu_locked
copy_all_filters returning duplicates of the original filters by
reparsing the filter_string.

Did I accidentally provide a means to mutate a state object or filter
list without dup()ing? :/

>> > You don't have to do that, but then I'm pretty sure you'll need to add
>> > reference counts to each filter and use rcu cycles to a reader from
>> > having the filter disappear mid-read.

Right now, I don't think it is possible for seccomp_copy_all_filters()
to be called with a src list that changes since every change is
guarded by a seccomp_state_dup(). If that's not true, then I violated
my own invariant :/ If that is the case, should I not treat the list
as an RCU list? There should never be any simultaneous
reader/writers, just a single reader/writer or multiple readers.

>>
>> Or you can preallocate the new filters, call rcu_read_lock(), check if
>> the number of old filters is the same or less, if more, call
>> rcu_read_unlock, and try allocating more, and then call rcu_read_lock()
>> again and repeat. Then just copy the filters to the preallocate ones.
>> rcu_read_unlock() and then free any unused allocated filters.
>>
>> Maybe a bit messy, but not that bad.
>
> Sounds good.

I'd prefer a heavy-weight copy ;)

I think I'm a bit lost -- am I missing something obvious here? I was
hoping by using a swapped-in-seccomp_state-pointer, locking and
consistency internal to the state objects would be a tad easier -
though expensive.

thanks!
will

2011-04-28 18:02:17

by Will Drewry

[permalink] [raw]
Subject: Re: [PATCH 3/7] seccomp_filter: Enable ftrace-based system call filtering

On Thu, Apr 28, 2011 at 11:56 AM, Steven Rostedt <[email protected]> wrote:
> On Thu, 2011-04-28 at 10:30 -0500, Will Drewry wrote:
>
>> >> diff --git a/kernel/seccomp.c b/kernel/seccomp.c
>> >> index 57d4b13..1bee87c 100644
>> >> --- a/kernel/seccomp.c
>> >> +++ b/kernel/seccomp.c
>> >> @@ -8,10 +8,11 @@
>> >>
>> >> ?#include <linux/seccomp.h>
>> >> ?#include <linux/sched.h>
>> >> +#include <linux/slab.h>
>> >> ?#include <linux/compat.h>
>> >>
>> >> ?/* #define SECCOMP_DEBUG 1 */
>> >> -#define NR_SECCOMP_MODES 1
>> >> +#define NR_SECCOMP_MODES 2
>> >>
>> >> ?/*
>> >> ? * Secure computing mode 1 allows only read/write/exit/sigreturn.
>> >> @@ -32,9 +33,11 @@ static int mode1_syscalls_32[] = {
>> >>
>> >> ?void __secure_computing(int this_syscall)
>> >> ?{
>> >> - ? ? int mode = current->seccomp.mode;
>> >> + ? ? int mode = -1;
>> >> ? ? ? int * syscall;
>> >> -
>> >> + ? ? /* Do we need an RCU read lock to access current's state? */
>> >
>> > I'm actually confused to why you are using RCU. What are you protecting.
>> > Currently, I see the state is always accessed from current->seccomp. But
>> > current should not be fighting with itself.
>> >
>> > Maybe I'm missing something.
>>
>> I'm sure it's me that's missing something. ?I believe the seccomp
>> pointer can be accessed from:
>> - current
>> - via /proc/<pid>/seccomp_filter (read-only)
>>
>> Given those cases, would it make sense to ditch the RCU interface for it?
>
> Looking at this in a bit more detail. I think you can ditch the
> rcu_readlocks where current accesses its own seccomp state. As current
> is the one that duplicates it (and ups the refcount) on fork, and
> current wont free it until after it performs a rcu_synchronization. No
> one else can free current's seccomp state while current has a reference
> to it.
>
> You still need the rcu_readlocks on the readers for the proc system.
> Otherwise the handle can be freed while its still in use. With the
> rcu_readlocks, these readers will always get the refcount before current
> frees it. And then the dec_and_test should work as expected when the
> readers do the put.

Great, I'll do that!

2011-04-28 18:21:21

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH 3/7] seccomp_filter: Enable ftrace-based system call filtering

On Thu, 2011-04-28 at 13:01 -0500, Will Drewry wrote:

> Good to know! My question (below) is if I should even be using an RCU
> guard at all. I may have been a bit too overzealous.
>
> >> >
> >> > I actually thought you were going to be more extreme about the seccomp
> >> > state than you are: I thought you were going to tie a filter list to
> >> > seccomp state. So adding or removing a filter would have required
> >> > duping the seccomp state, duping all the filters, making the change in
> >> > the copy, and then swapping the new state into place. Slow in the
> >> > hopefully rare update case, but safe.
>
> Hrm, I think I'm confused now! This is exactly what I *thought* the
> code was doing.

I guess my thought by looking at the code is the call_rcu() to free the
filters in drop_matching_filters().

Also, you have seccomp_drop_all_filters() as a standalone function that
is even exported to modules.

This to me, seems that the filters can disappear at any time. Because
the freeing is done with a rcu_call() the access to the filters needs a
ref count.

>
> At present, seccomp_state can be shared across predecessor/ancestor
> relationships using refcounting in fork.c (get/put). However, the
> only way to change a given seccomp_state or its filters is either
> through the one-bit on_next_syscall change or through
> prctl_set_seccomp. In prctl_set_seccomp, it does:
> state = (orig_state ? seccomp_state_dup(orig_state) :
> seccomp_state_new());
> operates on the new state and then rcu_assign_pointer()s it to the
> task. I didn't intentionally provide any way to drop filters from an
> existing state object nor change the filtered syscalls on an in-use
> object. That _dup call should hit the impromperly rcu_locked
> copy_all_filters returning duplicates of the original filters by
> reparsing the filter_string.
>
> Did I accidentally provide a means to mutate a state object or filter
> list without dup()ing? :/

That seccomp_drop_all_filters() looks like you can.

>
> >> > You don't have to do that, but then I'm pretty sure you'll need to add
> >> > reference counts to each filter and use rcu cycles to a reader from
> >> > having the filter disappear mid-read.
>
> Right now, I don't think it is possible for seccomp_copy_all_filters()
> to be called with a src list that changes since every change is
> guarded by a seccomp_state_dup(). If that's not true, then I violated
> my own invariant :/ If that is the case, should I not treat the list
> as an RCU list? There should never be any simultaneous
> reader/writers, just a single reader/writer or multiple readers.

Again, that seccomp_copy_all_filters() is called free standing (exported
to modules). Which to me means that it can be called by anyone at
anytime. There is no protection of this src list.

>
> >>
> >> Or you can preallocate the new filters, call rcu_read_lock(), check if
> >> the number of old filters is the same or less, if more, call
> >> rcu_read_unlock, and try allocating more, and then call rcu_read_lock()
> >> again and repeat. Then just copy the filters to the preallocate ones.
> >> rcu_read_unlock() and then free any unused allocated filters.
> >>
> >> Maybe a bit messy, but not that bad.
> >
> > Sounds good.
>
> I'd prefer a heavy-weight copy ;)
>
> I think I'm a bit lost -- am I missing something obvious here? I was
> hoping by using a swapped-in-seccomp_state-pointer, locking and
> consistency internal to the state objects would be a tad easier -
> though expensive.

Perhaps, but those free standing functions (the ones that are exported
to modules) seem like they can destroy state.

Your code would have been correct if you could call kzalloc under
rcu_read_lock() (which you can on some kernel configurations but not
all). The issue is that you need to pull out that allocation from the
rcu_read_lock() because rcu_read_lock assumes you can't preempt, and
that allocation can schedule out. The access to the filters must be done
under rcu_read_lock(), other than that, you're fine.

-- Steve

2011-04-28 18:21:29

by Will Drewry

[permalink] [raw]
Subject: Re: [PATCH 3/7] seccomp_filter: Enable ftrace-based system call filtering

On Thu, Apr 28, 2011 at 12:36 PM, Frederic Weisbecker
<[email protected]> wrote:
> On Thu, Apr 28, 2011 at 11:48:47AM -0500, Will Drewry wrote:
>> On Thu, Apr 28, 2011 at 11:13 AM, Frederic Weisbecker
>> <[email protected]> wrote:
>> > On Thu, Apr 28, 2011 at 10:29:11AM -0500, Will Drewry wrote:
>> >> On Thu, Apr 28, 2011 at 10:12 AM, Frederic Weisbecker
>> >> <[email protected]> wrote:
>> >> > Instead of having such multiline filter definition with syscall
>> >> > names prepended, it would be nicer to make the parsing simplier.
>> >> >
>> >> > You could have either:
>> >> >
>> >> > ? ? ? ?prctl(PR_SET_SECCOMP, mode);
>> >> > ? ? ? ?/* Works only if we are in mode 2 */
>> >> > ? ? ? ?prctl(PR_SET_SECCOMP_FILTER, syscall_nr, filter);
>> >>
>> >> It'd need to be syscall_name instead of syscall_nr. ?Otherwise we're
>> >> right back to where Adam's patch was 2+ years ago :) ?Using the event
>> >> names from the syscalls infrastructure means the consumer of the
>> >> interface doesn't need to be confident of the syscall number.
>> >
>> > Is it really a problem?
>>
>> I'd prefer using numbers myself since it gives more flexibility around
>> filtering entries that haven't yet made it into ftrace syscalls hooks.
>> However, I think there's some complexity in ensuring it matches the
>> host kernel. ?(It would also allow compat_task support which doesn't
>> seem to be possible now.)
>
> You'd have some problems with compat and syscall naming, I think some
> of them are not the same between compat architectures.
>
> Note that your new prctl() request is turning syscall handler names (in-kernel API)
> into a user ABI. We won't be able to change the name of those kernel handlers
> after that. I'm not sure that's desired.
>
> OTOH, syscall numbers are a user ABI,

Makes sense. I don't mind going the syscall nr route and using a
library for name resolution in userspace, but other users could just
include the libc provided names. If it doesn't work, the SIGKILL will
make it apparent pretty quickly :p I don't think it would be a
problem and it makes the kernel side even simpler again.

>
>
>>
>> > There are libraries that can resolve that. Of course I can't recall their name.
>>
>> asm-generic/unistd.h will provide the mapping, if available. It would
>> mean that any helper libraries would need to be matched to the locally
>> running kernel.
>
> Yeah. But syscall numbers don't move inside a given architecture (could they?)
> Or if so then you'd a dynamic library.

Works for me.

>
>> >> > or:
>> >> > ? ? ? ?/*
>> >> > ? ? ? ? * If mode == 2, set the filter to syscall_nr
>> >> > ? ? ? ? * Recall this for each syscall that need a filter.
>> >> > ? ? ? ? * If a filter was previously set on the targeted syscall,
>> >> > ? ? ? ? * it will be overwritten.
>> >> > ? ? ? ? */
>> >> > ? ? ? ?prctl(PR_SET_SECCOMP, mode, syscall_nr, filter);
>> >> >
>> >> > One can erase a previous filter by setting the new filter "1".
>> >> >
>> >> > Also, instead of having a bitmap of syscall to accept. You could
>> >> > simply set "0" as a filter to those you want to deactivate:
>> >> >
>> >> > prctl(PR_SET_SECCOMP, 2, 1, 0); <- deactivate the syscall_nr 1
>> >> >
>> >> > Hm?
>> >>
>> >> I like the simplicity in not needing to parse anything extra, but it
>> >> does add the need for extra state - either a bit or a new field - to
>> >> represent "enabled/enforcing".
>> >
>> > And by the way I'm really puzzled about these. I don't understand
>> > well why we need this.
>>
>> Right now, if you are in seccomp mode !=0 , your system calls will be
>> hooked (if TIF_SECCOMP is set too). The current patch begins filtering
>> immediately. ?And once filtering is enabled, the process is not
>> allowed to expand its access to the kernel system cals - only shrink
>> it. ?An "enabled" bit would be needed to tell it that even though it
>> is running in mode=2, it should/shouldn't kill the process for system
>> call filter violations and it should allow filters to be added, not
>> just dropped. ?It's why the current patch does it in one step: set the
>> filters and the mode. ?While an enabled bit could be hidden behind
>> whether TIF_SECCOMP is applied, I'm not sure if that would just be
>> confusing. ?There'd still be a need to APPLY_FILTERS to get the
>> TIF_SECCOMP flag set to start hooking the system calls.
>
> It seems the filter is only ever needed for the childs, right?
> And moreover when they exec. So Steve's suggestion to apply
> the filters only after the next exec makes sense.
>
> That would solve the problem of both ON_NEXT_SYSCALL and APPLY_FILTER.
>
> I may be missing something obvious though. Do you see a limitation there?
>
>>
>> > As for the enable_on_next_syscall. The documentation says
>> > it's useful if you want the filter to only apply to the
>> > child. So if fork immediately follows, you will be able to fork
>> > but if the child doesn't have the right to exec, it won't be able
>> > to do so. Same for the mmap() that involves...
>> >
>> > So I'm a bit confused about that.
>>
>> Here's an example:
>> ? http://static.dataspill.org/seccomp_filter/v1/seccomp_launcher.c
>>
>> You'd use it by calling fork() ... prctl(..., ON_NEXT_SYSCALL) then
>> execve since you'll already have fork()d. ?It seems to work, but maybe
>> I'm missing something :)
>
> Nope, I see.
>
>>
>> > But yeah if that's really needed, it looks to me better to
>> > reduce the parsing and cut it that way:
>> >
>> > ? ? ? ?prctl(PR_SET_SECCOMP, 2, syscall_name_or_nr, filter);
>> > ? ? ? ?prctl(PR_SECCOMP_APPLY_FILTERS, enable_on_next_syscall?)
>> >
>> > or something...
>>
>> Cool - if there are any other flags, it could be something like:
>> ? prctl(PR_SECCOMP_SET_FLAGS, SECCOMP_ENFORCE|SECCOMP_DELAYED_ENFORCEMENT);
>>
>> But only if there are other flags worth considering. I had a few
>> others in mind, but now I've completely forgotten :/
>

2011-04-28 18:23:04

by Will Drewry

[permalink] [raw]
Subject: Re: [PATCH 5/7] seccomp_filter: Document what seccomp_filter is and how it works.

On Thu, Apr 28, 2011 at 10:46 AM, Randy Dunlap <[email protected]> wrote:
> On Wed, 27 Apr 2011 22:08:49 -0500 Will Drewry wrote:
>
>> Adds a text file covering what CONFIG_SECCOMP_FILTER is, how it is
>> implemented presently, and what it may be used for. ?In addition,
>> the limitations and caveats of the proposed implementation are
>> included.
>>
>> Signed-off-by: Will Drewry <[email protected]>
>> ---
>> ?Documentation/trace/seccomp_filter.txt | ? 75 ++++++++++++++++++++++++++++++++
>> ?1 files changed, 75 insertions(+), 0 deletions(-)
>> ?create mode 100644 Documentation/trace/seccomp_filter.txt
>>
>> diff --git a/Documentation/trace/seccomp_filter.txt b/Documentation/trace/seccomp_filter.txt
>> new file mode 100644
>> index 0000000..6a0fd33
>> --- /dev/null
>> +++ b/Documentation/trace/seccomp_filter.txt
>> @@ -0,0 +1,75 @@
>> + ? ? ? ? ? ? Seccomp filtering
>> + ? ? ? ? ? ? =================
>> +
>> +Introduction
>> +------------
>> +
>> +A large number of system calls are exposed to every userland process
>> +with many of them going unused for the entire lifetime of the
>> +application. ?As system calls change and mature, bugs are found and
>> +quashed. ?A certain subset of userland applications benefit by having
>> +a reduce set of available system calls. ?The reduced set reduces the
>
> ? ? reduced
>
>> +total kernel surface exposed to the application. ?System call filtering
>> +is meant for use with those applications.
>> +
>> +The implementation currently leverages both the existing seccomp
>> +infrastructure and the kernel tracing infrastructure. ?By centralizing
>> +hooks for attack surface reduction in seccomp, it is possible to assure
>> +attention to security that is less relevant in normal ftrace scenarios,
>> +such as time of check, time of use attacks. ?However, ftrace provides a
>> +rich, human-friendly environment for specifying system calls by name and
>> +expected arguments. ?(As such, this requires FTRACE_SYSCALLS.)
>> +
>> +
>> +What it isn't
>> +-------------
>> +
>> +System call filtering isn't a sandbox. ?It provides a clearly defined
>> +mechanism for minimizing the exposed kernel surface. ?Beyond that, policy for
>> +logical behavior and information flow should be managed with an LSM of your
>> +choosing.
>> +
>> +
>> +Usage
>> +-----
>> +
>> +An additional seccomp mode is exposed through mode '2'. ?This mode
>> +depends on CONFIG_SECCOMP_FILTER which in turn depends on
>> +CONFIG_FTRACE_SYSCALLS.
>> +
>> +A collection of filters may be supplied via prctl, and the current set of
>> +filters is exposed in /proc/<pid>/seccomp_filter.
>> +
>> +For instance,
>> + ?const char filters[] =
>> + ? ?"sys_read: (fd == 1) || (fd == 2)\n"
>> + ? ?"sys_write: (fd == 0)\n"
>> + ? ?"sys_exit: 1\n"
>> + ? ?"sys_exit_group: 1\n"
>> + ? ?"on_next_syscall: 1";
>> + ?prctl(PR_SET_SECCOMP, 2, filters);
>> +
>> +This will setup system call filters for read, write, and exit where reading can
>> +be done only from fds 1 and 2 and writing to fd 0. ?The "on_next_syscall" directive tells
>> +seccomp to not enforce the ruleset until after the next system call is run. ?This allows
>> +for launchers to apply system call filters to a binary before executing it.
>> +
>> +Once enabled, the access may only be reduced. ?For example, a set of filters may be:
>> +
>> + ?sys_read: 1
>> + ?sys_write: 1
>> + ?sys_mmap: 1
>> + ?sys_prctl: 1
>> +
>> +Then it may call the following to drop mmap access:
>> + ?prctl(PR_SET_SECCOMP, 2, "sys_mmap: 0");
>> +
>> +
>> +Caveats
>> +-------
>> +
>> +The system call names come from ftrace events. ?At present, many system
>> +calls are not hooked - such as x86's ptregs wrapped system calls.
>> +
>> +In addition compat_task()s will not be supported until a sys32s begin
>> +being hooked.
>
> Last sentence is hard to read IMO:
> a. what are compat_task()s?
> b. what is a sys32s begin?
> c. awkward wording, maybe change to: ? until a sys32s begin has been hooked.

I'll clean it up and try again. I believe the other thread discussing
the interface will change this last sentence anyway, so once it
settles, I'll update this patch to reflect the new reality.

thanks!

2011-04-28 18:34:19

by Will Drewry

[permalink] [raw]
Subject: Re: [PATCH 3/7] seccomp_filter: Enable ftrace-based system call filtering

On Thu, Apr 28, 2011 at 1:21 PM, Steven Rostedt <[email protected]> wrote:
> On Thu, 2011-04-28 at 13:01 -0500, Will Drewry wrote:
>
>> Good to know! My question (below) is if I should even be using an RCU
>> guard at all. I may have been a bit too overzealous.
>>
>> >> >
>> >> > I actually thought you were going to be more extreme about the seccomp
>> >> > state than you are: ?I thought you were going to tie a filter list to
>> >> > seccomp state. ?So adding or removing a filter would have required
>> >> > duping the seccomp state, duping all the filters, making the change in
>> >> > the copy, and then swapping the new state into place. ?Slow in the
>> >> > hopefully rare update case, but safe.
>>
>> Hrm, I think I'm confused now! ?This is exactly what I *thought* the
>> code was doing.
>
> I guess my thought by looking at the code is the call_rcu() to free the
> filters in drop_matching_filters().
>
> Also, you have seccomp_drop_all_filters() as a standalone function that
> is even exported to modules.
>
> This to me, seems that the filters can disappear at any time. Because
> the freeing is done with a rcu_call() the access to the filters needs a
> ref count.
>
>>
>> At present, seccomp_state can be shared across predecessor/ancestor
>> relationships using refcounting in fork.c ?(get/put). However, the
>> only way to change a given seccomp_state or its filters is either
>> through the one-bit on_next_syscall change or through
>> prctl_set_seccomp. ?In prctl_set_seccomp, it does:
>> state = (orig_state ? seccomp_state_dup(orig_state) :
>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? seccomp_state_new());
>> operates on the new state and then rcu_assign_pointer()s it to the
>> task. ?I didn't intentionally provide any way to drop filters from an
>> existing state object nor change the filtered syscalls on an in-use
>> object. ?That _dup call should hit the impromperly rcu_locked
>> copy_all_filters returning duplicates of the original filters by
>> reparsing the filter_string.
>>
>> Did I accidentally provide a means to mutate a state object or filter
>> list without dup()ing? :/
>
> That seccomp_drop_all_filters() looks like you can.
>
>>
>> >> > You don't have to do that, but then I'm pretty sure you'll need to add
>> >> > reference counts to each filter and use rcu cycles to a reader from
>> >> > having the filter disappear mid-read.
>>
>> Right now, I don't think it is possible for seccomp_copy_all_filters()
>> to be called with a src list that changes since every change is
>> guarded by a seccomp_state_dup(). If that's not true, then I violated
>> my own invariant :/ ?If that is the case, should I not treat the list
>> as an RCU list? ?There should never be any simultaneous
>> reader/writers, just a single reader/writer or multiple readers.
>
> Again, that seccomp_copy_all_filters() is called free standing (exported
> to modules). Which to me means that it can be called by anyone at
> anytime. There is no protection of this src list.
>
>>
>> >>
>> >> Or you can preallocate the new filters, call rcu_read_lock(), check if
>> >> the number of old filters is the same or less, if more, call
>> >> rcu_read_unlock, and try allocating more, and then call rcu_read_lock()
>> >> again and repeat. Then just copy the filters to the preallocate ones.
>> >> rcu_read_unlock() and then free any unused allocated filters.
>> >>
>> >> Maybe a bit messy, but not that bad.
>> >
>> > Sounds good.
>>
>> I'd prefer a heavy-weight copy ;)
>>
>> I think I'm a bit lost -- am I missing something obvious here? ?I was
>> hoping by using a swapped-in-seccomp_state-pointer, locking and
>> consistency internal to the state objects would be a tad easier -
>> though expensive.
>
> Perhaps, but those free standing functions (the ones that are exported
> to modules) seem like they can destroy state.

My intent was to make them available for use by seccomp.c during state
teardown/dup. I don't think there's benefit to exposing them outside
of that. Would dropping the export, and adding an local seccomp.h
with the shared functions in them resolve that more cleanly?

> Your code would have been correct if you could call kzalloc under
> rcu_read_lock() (which you can on some kernel configurations but not
> all). The issue is that you need to pull out that allocation from the
> rcu_read_lock() because rcu_read_lock assumes you can't preempt, and
> that allocation can schedule out. The access to the filters must be done
> under rcu_read_lock(), other than that, you're fine.

That makes sense. I think I'd prefer to not share those functions
rather than guard the list just in case a future consumer of the
interface comes along. Would that make sense to you? Since I don't
see any other users right now other than seccomp.c, it might make
sense to tackle the impact when an actual need arises.

I'll go whichever way pointed on this, though.
thanks again,
will

2011-04-28 18:37:36

by Will Drewry

[permalink] [raw]
Subject: Re: [PATCH 5/7] seccomp_filter: Document what seccomp_filter is and how it works.

On Thu, Apr 28, 2011 at 9:56 AM, Eric Paris <[email protected]> wrote:
> On Thu, 2011-04-28 at 09:06 +0200, Ingo Molnar wrote:
>> * Will Drewry <[email protected]> wrote:
>>
>> > +A collection of filters may be supplied via prctl, and the current set of
>> > +filters is exposed in /proc/<pid>/seccomp_filter.
>> > +
>> > +For instance,
>> > + ?const char filters[] =
>> > + ? ?"sys_read: (fd == 1) || (fd == 2)\n"
>> > + ? ?"sys_write: (fd == 0)\n"
>> > + ? ?"sys_exit: 1\n"
>> > + ? ?"sys_exit_group: 1\n"
>> > + ? ?"on_next_syscall: 1";
>> > + ?prctl(PR_SET_SECCOMP, 2, filters);
>> > +
>> > +This will setup system call filters for read, write, and exit where reading can
>> > +be done only from fds 1 and 2 and writing to fd 0. ?The "on_next_syscall" directive tells
>> > +seccomp to not enforce the ruleset until after the next system call is run. ?This allows
>> > +for launchers to apply system call filters to a binary before executing it.
>> > +
>> > +Once enabled, the access may only be reduced. ?For example, a set of filters may be:
>> > +
>> > + ?sys_read: 1
>> > + ?sys_write: 1
>> > + ?sys_mmap: 1
>> > + ?sys_prctl: 1
>> > +
>> > +Then it may call the following to drop mmap access:
>> > + ?prctl(PR_SET_SECCOMP, 2, "sys_mmap: 0");
>>
>> Ok, color me thoroughly impressed
>
> Me too!
>
>> I've Cc:-ed Linus and Andrew: are you guys opposed to such flexible, dynamic
>> filters conceptually? I think we should really think hard about the actual ABI
>> as this could easily spread to more applications than Chrome/Chromium.

Would it make sense to start, as Frederic has pointed out, by using
the existing ABI - system call numbers - and not system call names?
We could leave name resolution to userspace as it is for all other
system call consumers now. It might leave the interface for this
support looking more like:
prctl(PR_SET_SECCOMP, 2, _NR_mmap, "fd == 1");
prctl(PR_SET_SECCOMP_FILTER_APPLY, now|on_exec);

which may be less of a dramatic ABI change to start with.

> I'll definitely port QEMU to use this new interface rather than my more
> rigid flexible (haha "rigid flexible") seccomp. ?I'll see if I run into
> any issues with this ABI in that porting...

Great - also let me know if you have a preference on the interface
Frederic proposed, as it might reduce the parsing footprint and
overall kernel-side complexity but add a little bit more burden on the
userspace side.

>> Btw., i also think that such an approach is actually the sane(r) design to
>> implement security modules: using such filters is far more flexible than the
>> typical LSM approach of privileged user-space uploading various nasty objects
>> into kernel space and implementing silly (and limited and intrusive) hooks
>> there, like SElinux and the other security modules do.
>
> Then you are wrong. ?There's no question that this interface can provide
> great extensions to the current discretionary functionality provided by
> legacy security controls but if you actually want to mediate what tasks
> can do to other tasks or can do to arbitrary objects on the system this
> doesn't cut it. ?Every system call that takes or uses a structure as an
> argument or that uses copy_from_user (for something other than just
> unparsed data) is uncontrollable.

I think it'd take a fair amount of additional work to turn pure system
call filtering into a robust policy engine (like systrace). It seems
to me that right now, it would just be a great addition to the
existing LSM model by providing infrastructure the LSMs can use with
their higher level logic. (Plumbing in all the bits to understand the
system call arguments completely and avoid time-of-check-time-of-use
attacks would be a sizable undertaking - and that's without making
sure the existing LSMs could live happily on top of it .)

> This approach is great and with careful coding of userspace apps can be
> made very useful in constraining those apps, but a replacement for
> mandatory access control it is not.
>
>> This approach also has the ability to become recursive (gets inherited by child
>> tasks, which could add their own filters) and unprivileged - unlike LSMs.
>
> LSMs have that ability. ?There's nothing to prevent a module loading
> service to allow unpriv applications to further constrain themselves.
> It's just the different between DAC and MAC. ?You are clearly a DAC guy,
> and there is no question this change is great in that mindset, ?but you
> don't seem to understand either the flexibility of the LSM or the
> purpose of some of the modules implemented on top of the LSM.
>
>> I like this *a lot* more than any security sandboxing approach i've seen
>> before.
>
> I like this *a lot*. ?It will be a HUGE addition to the security
> sandboxing approaches I've seen before.

Thanks!
will

2011-04-28 18:51:33

by Serge E. Hallyn

[permalink] [raw]
Subject: Re: [PATCH 3/7] seccomp_filter: Enable ftrace-based system call filtering

Quoting Will Drewry ([email protected]):
> On Thu, Apr 28, 2011 at 12:39 PM, Serge E. Hallyn <[email protected]> wrote:
> > Quoting Steven Rostedt ([email protected]):
> >> On Thu, 2011-04-28 at 11:55 -0500, Serge E. Hallyn wrote:
> >> > I actually thought you were going to be more extreme about the seccomp
> >> > state than you are: ?I thought you were going to tie a filter list to
> >> > seccomp state. ?So adding or removing a filter would have required
> >> > duping the seccomp state, duping all the filters, making the change in
> >> > the copy, and then swapping the new state into place. ?Slow in the
> >> > hopefully rare update case, but safe.
>
> Hrm, I think I'm confused now! This is exactly what I *thought* the
> code was doing.
>
> At present, seccomp_state can be shared across predecessor/ancestor
> relationships using refcounting in fork.c (get/put). However, the
> only way to change a given seccomp_state or its filters is either
> through the one-bit on_next_syscall change or through
> prctl_set_seccomp. In prctl_set_seccomp, it does:
> state = (orig_state ? seccomp_state_dup(orig_state) :
> seccomp_state_new());
> operates on the new state and then rcu_assign_pointer()s it to the
> task. I didn't intentionally provide any way to drop filters from an
> existing state object nor change the filtered syscalls on an in-use
> object. That _dup call should hit the impromperly rcu_locked
> copy_all_filters returning duplicates of the original filters by
> reparsing the filter_string.
>
> Did I accidentally provide a means to mutate a state object or filter
> list without dup()ing? :/

Ah - probably not, I probably misread your intent for the helpers.

So in that case you don't need to rcu-protect the filters at all
when copying. You just need to inc the seccomp struct refcount
under rcu lock, to make sure it all stays put. Drop rcu lock, take
your time copying, then when you dec the refcount the whole seccomp
struct either gets freed or not.

Excellent.

-serge

2011-04-28 18:54:19

by Serge E. Hallyn

[permalink] [raw]
Subject: Re: [PATCH 3/7] seccomp_filter: Enable ftrace-based system call filtering

Quoting Will Drewry ([email protected]):
> My intent was to make them available for use by seccomp.c during state
> teardown/dup. I don't think there's benefit to exposing them outside
> of that. Would dropping the export, and adding an local seccomp.h
> with the shared functions in them resolve that more cleanly?

And add a clear comment explaining :)

> > Your code would have been correct if you could call kzalloc under
> > rcu_read_lock() (which you can on some kernel configurations but not
> > all). The issue is that you need to pull out that allocation from the
> > rcu_read_lock() because rcu_read_lock assumes you can't preempt, and
> > that allocation can schedule out. The access to the filters must be done
> > under rcu_read_lock(), other than that, you're fine.
>
> That makes sense. I think I'd prefer to not share those functions
> rather than guard the list just in case a future consumer of the
> interface comes along. Would that make sense to you? Since I don't
> see any other users right now other than seccomp.c, it might make
> sense to tackle the impact when an actual need arises.
>
> I'll go whichever way pointed on this, though.

Complicating the locking for nonexistent users doesn't seem like the
right way.

-serge

2011-04-28 19:07:03

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH 3/7] seccomp_filter: Enable ftrace-based system call filtering

On Thu, 2011-04-28 at 13:34 -0500, Will Drewry wrote:

> > Perhaps, but those free standing functions (the ones that are exported
> > to modules) seem like they can destroy state.
>
> My intent was to make them available for use by seccomp.c during state
> teardown/dup. I don't think there's benefit to exposing them outside
> of that. Would dropping the export, and adding an local seccomp.h
> with the shared functions in them resolve that more cleanly?

Yes, having a local seccomp.h in the same directory as seccomp.c is a
way of saying "these are internal functions, don't use them directly".
Adding a function to EXPORT_SYMBOL*() is saying "here you go, have a
field day with it!".


>
> > Your code would have been correct if you could call kzalloc under
> > rcu_read_lock() (which you can on some kernel configurations but not
> > all). The issue is that you need to pull out that allocation from the
> > rcu_read_lock() because rcu_read_lock assumes you can't preempt, and
> > that allocation can schedule out. The access to the filters must be done
> > under rcu_read_lock(), other than that, you're fine.
>
> That makes sense. I think I'd prefer to not share those functions
> rather than guard the list just in case a future consumer of the
> interface comes along. Would that make sense to you? Since I don't
> see any other users right now other than seccomp.c, it might make
> sense to tackle the impact when an actual need arises.
>
> I'll go whichever way pointed on this, though.
> thanks again,

I'm fine either way. If you make them local internal functions, then you
don't need the locking if they are safe in their current context.

-- Steve

2011-04-28 19:07:25

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH 3/7] seccomp_filter: Enable ftrace-based system call filtering

On Thu, 2011-04-28 at 13:54 -0500, Serge E. Hallyn wrote:
> Quoting Will Drewry ([email protected]):
> > My intent was to make them available for use by seccomp.c during state
> > teardown/dup. I don't think there's benefit to exposing them outside
> > of that. Would dropping the export, and adding an local seccomp.h
> > with the shared functions in them resolve that more cleanly?
>
> And add a clear comment explaining :)

Yes that always helps.

>
> > > Your code would have been correct if you could call kzalloc under
> > > rcu_read_lock() (which you can on some kernel configurations but not
> > > all). The issue is that you need to pull out that allocation from the
> > > rcu_read_lock() because rcu_read_lock assumes you can't preempt, and
> > > that allocation can schedule out. The access to the filters must be done
> > > under rcu_read_lock(), other than that, you're fine.
> >
> > That makes sense. I think I'd prefer to not share those functions
> > rather than guard the list just in case a future consumer of the
> > interface comes along. Would that make sense to you? Since I don't
> > see any other users right now other than seccomp.c, it might make
> > sense to tackle the impact when an actual need arises.
> >
> > I'll go whichever way pointed on this, though.
>
> Complicating the locking for nonexistent users doesn't seem like the
> right way.

I agree.

-- Steve

2011-04-28 22:54:58

by James Morris

[permalink] [raw]
Subject: Re: [PATCH 4/7] seccomp_filter: add process state reporting

On Wed, 27 Apr 2011, Will Drewry wrote:

> > Can't you make individual seccomp specific file?
>
> Definitely. Would it make sense to have /proc/<pid>/seccomp and
> /proc/<pid>/seccomp_filter?

Do you need the separate seccomp file vs. just checking what's in
seccomp_filter ?


--
James Morris
<[email protected]>

2011-04-29 13:19:06

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [PATCH 5/7] seccomp_filter: Document what seccomp_filter is and how it works.

On Thu, Apr 28, 2011 at 01:37:33PM -0500, Will Drewry wrote:
> On Thu, Apr 28, 2011 at 9:56 AM, Eric Paris <[email protected]> wrote:
> > On Thu, 2011-04-28 at 09:06 +0200, Ingo Molnar wrote:
> >> * Will Drewry <[email protected]> wrote:
> >>
> >> > +A collection of filters may be supplied via prctl, and the current set of
> >> > +filters is exposed in /proc/<pid>/seccomp_filter.
> >> > +
> >> > +For instance,
> >> > + ?const char filters[] =
> >> > + ? ?"sys_read: (fd == 1) || (fd == 2)\n"
> >> > + ? ?"sys_write: (fd == 0)\n"
> >> > + ? ?"sys_exit: 1\n"
> >> > + ? ?"sys_exit_group: 1\n"
> >> > + ? ?"on_next_syscall: 1";
> >> > + ?prctl(PR_SET_SECCOMP, 2, filters);
> >> > +
> >> > +This will setup system call filters for read, write, and exit where reading can
> >> > +be done only from fds 1 and 2 and writing to fd 0. ?The "on_next_syscall" directive tells
> >> > +seccomp to not enforce the ruleset until after the next system call is run. ?This allows
> >> > +for launchers to apply system call filters to a binary before executing it.
> >> > +
> >> > +Once enabled, the access may only be reduced. ?For example, a set of filters may be:
> >> > +
> >> > + ?sys_read: 1
> >> > + ?sys_write: 1
> >> > + ?sys_mmap: 1
> >> > + ?sys_prctl: 1
> >> > +
> >> > +Then it may call the following to drop mmap access:
> >> > + ?prctl(PR_SET_SECCOMP, 2, "sys_mmap: 0");
> >>
> >> Ok, color me thoroughly impressed
> >
> > Me too!
> >
> >> I've Cc:-ed Linus and Andrew: are you guys opposed to such flexible, dynamic
> >> filters conceptually? I think we should really think hard about the actual ABI
> >> as this could easily spread to more applications than Chrome/Chromium.
>
> Would it make sense to start, as Frederic has pointed out, by using
> the existing ABI - system call numbers - and not system call names?
> We could leave name resolution to userspace as it is for all other
> system call consumers now. It might leave the interface for this
> support looking more like:
> prctl(PR_SET_SECCOMP, 2, _NR_mmap, "fd == 1");
> prctl(PR_SET_SECCOMP_FILTER_APPLY, now|on_exec);

PR_SET_SECCOMP_FILTER_APPLY seems only useful if you think there
are other cases than enable_on_exec that would be useful for these
filters.

We can think about a default enable on exec behaviour as Steve pointed
out.

But I have no idea if other cases may be desirable to apply these
filters.

2011-04-29 16:13:47

by Will Drewry

[permalink] [raw]
Subject: Re: [PATCH 5/7] seccomp_filter: Document what seccomp_filter is and how it works.

On Fri, Apr 29, 2011 at 8:18 AM, Frederic Weisbecker <[email protected]> wrote:
> On Thu, Apr 28, 2011 at 01:37:33PM -0500, Will Drewry wrote:
>> On Thu, Apr 28, 2011 at 9:56 AM, Eric Paris <[email protected]> wrote:
>> > On Thu, 2011-04-28 at 09:06 +0200, Ingo Molnar wrote:
>> >> * Will Drewry <[email protected]> wrote:
>> >>
>> >> > +A collection of filters may be supplied via prctl, and the current set of
>> >> > +filters is exposed in /proc/<pid>/seccomp_filter.
>> >> > +
>> >> > +For instance,
>> >> > + ?const char filters[] =
>> >> > + ? ?"sys_read: (fd == 1) || (fd == 2)\n"
>> >> > + ? ?"sys_write: (fd == 0)\n"
>> >> > + ? ?"sys_exit: 1\n"
>> >> > + ? ?"sys_exit_group: 1\n"
>> >> > + ? ?"on_next_syscall: 1";
>> >> > + ?prctl(PR_SET_SECCOMP, 2, filters);
>> >> > +
>> >> > +This will setup system call filters for read, write, and exit where reading can
>> >> > +be done only from fds 1 and 2 and writing to fd 0. ?The "on_next_syscall" directive tells
>> >> > +seccomp to not enforce the ruleset until after the next system call is run. ?This allows
>> >> > +for launchers to apply system call filters to a binary before executing it.
>> >> > +
>> >> > +Once enabled, the access may only be reduced. ?For example, a set of filters may be:
>> >> > +
>> >> > + ?sys_read: 1
>> >> > + ?sys_write: 1
>> >> > + ?sys_mmap: 1
>> >> > + ?sys_prctl: 1
>> >> > +
>> >> > +Then it may call the following to drop mmap access:
>> >> > + ?prctl(PR_SET_SECCOMP, 2, "sys_mmap: 0");
>> >>
>> >> Ok, color me thoroughly impressed
>> >
>> > Me too!
>> >
>> >> I've Cc:-ed Linus and Andrew: are you guys opposed to such flexible, dynamic
>> >> filters conceptually? I think we should really think hard about the actual ABI
>> >> as this could easily spread to more applications than Chrome/Chromium.
>>
>> Would it make sense to start, as Frederic has pointed out, by using
>> the existing ABI - system call numbers - and not system call names?
>> We could leave name resolution to userspace as it is for all other
>> system call consumers now. ?It might leave the interface for this
>> support looking more like:
>> ? prctl(PR_SET_SECCOMP, 2, _NR_mmap, "fd == 1");
>> ? prctl(PR_SET_SECCOMP_FILTER_APPLY, now|on_exec);
>
> PR_SET_SECCOMP_FILTER_APPLY seems only useful if you think there
> are other cases than enable_on_exec that would be useful for these
> filters.
>
> We can think about a default enable on exec behaviour as Steve pointed
> out.
>
> But I have no idea if other cases may be desirable to apply these
> filters.

I nearly have all of the changes in, but I'm still updating my tests.
In general, I think having both on_exec and now is reasonable is
because you can write a much tighter filter set if it is embedded in
the application. E.g., it may load all its shared libraries, which
you allow, then lock itself down before touching untrusted content.
That said, if the default behavior is enable_on_exec, then you'd only
call PR_SET_SECCOMP_FILTER_APPLY when you want to apply _now_. I like
that.

That said, I have a general interface question :) Right now I have:
prctl(PR_SET_SECCOMP, 2, SECCOMP_FILTER_ADD, syscall_nr, filter_string);
prctl(PR_SET_SECCOMP, 2, SECCOMP_FILTER_DROP, syscall_nr,
filter_string_or_NULL);
prctl(PR_SET_SECCOMP, 2, SECCOMP_FILTER_APPLY, apply_flags);
(I will change this to default to apply_on_exec and let FILTER_APPLY
make it apply _now_ exclusively. :)

This can easily be mapped to:
prctl(PR_SET_SECCOMP
PR_SET_SECOMP_FILTER_ADD
PR_SET_SECOMP_FILTER_DROP
PR_SET_SECOMP_FILTER_APPLY
if that'd be preferable (to keep it all in the prctl.h world).

Following along the suggestion of reducing custom parsing, it seemed
to make a lot of sense to make add and drop actions very explicit.
There is no guesswork so a system call filtered process will only be
able to perform DROP operations (if prctl is allowed) to reduce the
allowed system calls. This also allows more fine grained flexibility
in addition to the in-kernel complexity reduction. E.g.,
Process starts with
__NR_read, "fd == 1"
__NR_read, "fd == 2"
later it can call:
prctl(PR_SET_SECCOMP, 2, SECCOMP_FILTER_DROP, __NR_read, "fd == 2");
to drop one of the filters without disabling "fd == 1" reading. (Or
it could pass in NULL to drop all filters).

FWIW, I also am updating the Kconfig to be dependent EXPERIMENTAL, as
it might make sense to get some use prior to considering it finalized
:) I'm not sure if that is appropriate though.

Thanks! I'll try to post the v2s today once they're working properly :)
will

2011-05-02 10:08:25

by Will Drewry

[permalink] [raw]
Subject: Re: [PATCH 4/7] seccomp_filter: add process state reporting

On Thu, Apr 28, 2011 at 3:54 PM, James Morris <[email protected]> wrote:
> On Wed, 27 Apr 2011, Will Drewry wrote:
>
>> > Can't you make individual seccomp specific file?
>>
>> Definitely. ?Would it make sense to have /proc/<pid>/seccomp and
>> /proc/<pid>/seccomp_filter?
>
> Do you need the separate seccomp file vs. just checking what's in
> seccomp_filter ?

Initially, I would've said yes, because I had modeled it such that any
entries in /proc/<pid>/seccomp_filter could be fed right back into as
filters. However, as I've reworked it from Frederic and other's
feedback, I don't need to keep the separation - all the relevant info
can just be in seccomp_filter and no secondary file will be useful.

Thanks for pointing it out! I'm tracking down what is, no doubt, a
dumb bug (still), but once I sort it, I'll repost the series with this
change included.

Cheers!
will

2011-05-03 01:29:08

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [PATCH 5/7] seccomp_filter: Document what seccomp_filter is and how it works.

On Fri, Apr 29, 2011 at 11:13:44AM -0500, Will Drewry wrote:
> On Fri, Apr 29, 2011 at 8:18 AM, Frederic Weisbecker <[email protected]> wrote:
> > PR_SET_SECCOMP_FILTER_APPLY seems only useful if you think there
> > are other cases than enable_on_exec that would be useful for these
> > filters.
> >
> > We can think about a default enable on exec behaviour as Steve pointed
> > out.
> >
> > But I have no idea if other cases may be desirable to apply these
> > filters.
>
> I nearly have all of the changes in, but I'm still updating my tests.
> In general, I think having both on_exec and now is reasonable is
> because you can write a much tighter filter set if it is embedded in
> the application. E.g., it may load all its shared libraries, which
> you allow, then lock itself down before touching untrusted content.

Well, that makes sense.

> That said, if the default behavior is enable_on_exec, then you'd only
> call PR_SET_SECCOMP_FILTER_APPLY when you want to apply _now_. I like
> that.

It could be the default behaviour, which could be overriden with
PR_SET_SECCOMP_FILTER_APPLY. However I'm wondering about that enable_on_exec.

Say you want to accept only stdin/stdout read/write, and you blocked
mmap, open, etc... How can ld load the app and mmap all its shared libraries?
The filters are going to be applied once the interpreter is launched. This
makes me wonder now about the general usability of this and also about
the relevance in a default enable on exec behaviour here.

>
> That said, I have a general interface question :) Right now I have:
> prctl(PR_SET_SECCOMP, 2, SECCOMP_FILTER_ADD, syscall_nr, filter_string);
> prctl(PR_SET_SECCOMP, 2, SECCOMP_FILTER_DROP, syscall_nr,
> filter_string_or_NULL);
> prctl(PR_SET_SECCOMP, 2, SECCOMP_FILTER_APPLY, apply_flags);
> (I will change this to default to apply_on_exec and let FILTER_APPLY
> make it apply _now_ exclusively. :)
>
> This can easily be mapped to:
> prctl(PR_SET_SECCOMP
> PR_SET_SECOMP_FILTER_ADD
> PR_SET_SECOMP_FILTER_DROP
> PR_SET_SECOMP_FILTER_APPLY
> if that'd be preferable (to keep it all in the prctl.h world).
>
> Following along the suggestion of reducing custom parsing, it seemed
> to make a lot of sense to make add and drop actions very explicit.
> There is no guesswork so a system call filtered process will only be
> able to perform DROP operations (if prctl is allowed) to reduce the
> allowed system calls. This also allows more fine grained flexibility
> in addition to the in-kernel complexity reduction. E.g.,
> Process starts with
> __NR_read, "fd == 1"
> __NR_read, "fd == 2"
> later it can call:
> prctl(PR_SET_SECCOMP, 2, SECCOMP_FILTER_DROP, __NR_read, "fd == 2");
> to drop one of the filters without disabling "fd == 1" reading. (Or
> it could pass in NULL to drop all filters).

Hm, but then you don't let the childs be able to restrict further
what you allowed before.

Say I have foo(int a, int b), and I apply these filters:

__NR_foo, "a == 1";
__NR_foo, "a == 2";

This is basically "a == 1 || a == 2".

Now I apply the filters and I fork. How can the child
(or current task after the filter is applied) restrict
further by only allowing "b == 2", such that with the
inherited parent filters we have:

"(a == 1 || a == 2) && b == 2"

So what you propose seems to me too limited. I'd rather have this:

SECCOMP_FILTER_SET = remove previous filter entirely and set a new one
SECCOMP_FILTER_GET = get the string of the current filter

The rule would be that you can only set a filter that is intersected
with the one that was previously applied.

It means that if you set filter A and you apply it. If you want to set
filter B thereafter, it must be:

A && B

OTOH, as long as you haven't applied A, you can override it as you wish.
Like you can have "A || B" instead. Or you can remove it with "1". Of course
if a previous filter was applied before A, then your new filter must be
concatenated: "previous && (A || B)".

Right? And note in this scheme you can reproduce your DROP trick. If
"A || B" is the current filter applied, then you can restrict B by
doing: "(A || B) && A".

So the role of SECCOMP_FILTER_GET is to get the string that matches
the current applied filter.

The effect of this is infinite of course. If you apply A, then apply
B then you need A && B. If later you want to apply C, then you need
A && B && C, etc...

Does that look sane?

2011-05-03 01:47:05

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [PATCH 5/7] seccomp_filter: Document what seccomp_filter is and how it works.

2011/5/3 Frederic Weisbecker <[email protected]>:
> On Fri, Apr 29, 2011 at 11:13:44AM -0500, Will Drewry wrote:
>> That said, I have a general interface question :) ?Right now I have:
>> prctl(PR_SET_SECCOMP, 2, SECCOMP_FILTER_ADD, syscall_nr, filter_string);
>> prctl(PR_SET_SECCOMP, 2, SECCOMP_FILTER_DROP, syscall_nr,
>> filter_string_or_NULL);
>> prctl(PR_SET_SECCOMP, 2, SECCOMP_FILTER_APPLY, apply_flags);
>> ? (I will change this to default to apply_on_exec and let FILTER_APPLY
>> make it apply _now_ exclusively. :)
>>
>> This can easily be mapped to:
>> prctl(PR_SET_SECCOMP
>> ? ? ? ?PR_SET_SECOMP_FILTER_ADD
>> ? ? ? ?PR_SET_SECOMP_FILTER_DROP
>> ? ? ? ?PR_SET_SECOMP_FILTER_APPLY
>> if that'd be preferable (to keep it all in the prctl.h world).
>>
>> Following along the suggestion of reducing custom parsing, it seemed
>> to make a lot of sense to make add and drop actions very explicit.
>> There is no guesswork so a system call filtered process will only be
>> able to perform DROP operations (if prctl is allowed) to reduce the
>> allowed system calls. ?This also allows more fine grained flexibility
>> in addition to the in-kernel complexity reduction. ?E.g.,
>> Process starts with
>> ? __NR_read, "fd == 1"
>> ? __NR_read, "fd == 2"
>> later it can call:
>> ? prctl(PR_SET_SECCOMP, 2, SECCOMP_FILTER_DROP, __NR_read, "fd == 2");
>> to drop one of the filters without disabling "fd == 1" reading. ?(Or
>> it could pass in NULL to drop all filters).
>
> Hm, but then you don't let the childs be able to restrict further
> what you allowed before.
>
> Say I have foo(int a, int b), and I apply these filters:
>
> ? ? ? ?__NR_foo, "a == 1";
> ? ? ? ?__NR_foo, "a == 2";
>
> This is basically "a == 1 || a == 2".
>
> Now I apply the filters and I fork. How can the child
> (or current task after the filter is applied) restrict
> further by only allowing "b == 2", such that with the
> inherited parent filters we have:
>
> ? ? ? ?"(a == 1 || a == 2) && b == 2"
>
> So what you propose seems to me too limited. I'd rather have this:
>
> SECCOMP_FILTER_SET = remove previous filter entirely and set a new one
> SECCOMP_FILTER_GET = get the string of the current filter
>
> The rule would be that you can only set a filter that is intersected
> with the one that was previously applied.
>
> It means that if you set filter A and you apply it. If you want to set
> filter B thereafter, it must be:
>
> ? ? ? ?A && B
>
> OTOH, as long as you haven't applied A, you can override it as you wish.
> Like you can have "A || B" instead. Or you can remove it with "1". Of course
> if a previous filter was applied before A, then your new filter must be
> concatenated: "previous && (A || B)".
>
> Right? And note in this scheme you can reproduce your DROP trick. If
> "A || B" is the current filter applied, then you can restrict B by
> doing: "(A || B) && A".
>
> So the role of SECCOMP_FILTER_GET is to get the string that matches
> the current applied filter.
>
> The effect of this is infinite of course. If you apply A, then apply
> B then you need A && B. If later you want to apply C, then you need
> A && B && C, etc...
>
> Does that look sane?
>

Even better: applying a filter would always automatically be an
intersection of the previous one.

If you do:

SECCOMP_FILTER_SET, __NR_foo, "a == 1 || a == 2"
SECCOMP_FILTER_APPLY
SECCOMP_FILTER_SET, __NR_foo, "b == 2"
SECCOMP_FILTER_APPLY
SECCOMP_FILTER_SET, __NR_foo, "c == 3"
SECCOMP_FILTER_APPLY

The end result is:

"(a == 1 || a == 2) && b == 2 && c == 3"

So that we don't push the burden in the kernel to compare the applied
expression with a new one that may or may not be embraced by parenthesis
and other trickies like that. We simply append to the working one.

Ah and OTOH this:

SECCOMP_FILTER_SET, __NR_foo, "a == 1 || a == 2"
SECCOMP_FILTER_APPLY
SECCOMP_FILTER_SET, __NR_foo, "b == 2"
SECCOMP_FILTER_SET, __NR_foo, "c == 3"

has the following end result:

"(a == 1 || a == 2) && c == 3"

As long as you don't apply the filter, the temporary part is
overriden, but still we keep
the applied part.

Still sane? (or completely nuts?)

2011-05-03 08:40:10

by Avi Kivity

[permalink] [raw]
Subject: Re: [PATCH 3/7] seccomp_filter: Enable ftrace-based system call filtering

On 04/28/2011 06:08 AM, Will Drewry wrote:
> This change adds a new seccomp mode based on the work by
> [email protected]. This mode comes with a bitmask of NR_syscalls size and
> an optional linked list of seccomp_filter objects. When in mode 2, all
> system calls are first checked against the bitmask to determine if they
> are allowed or denied. If allowed, the list of filters is checked for
> the given syscall number. If all filter predicates for the system call
> match or the system call was allowed without restriction, the process
> continues. Otherwise, it is killed and a KERN_INFO notification is
> posted.
>
> The filter language itself is provided by the ftrace filter engine.
> Related patches tweak to the perf filter trace and free allow the calls
> to be shared. Filters inherit their understanding of types and arguments
> for each system call from the CONFIG_FTRACE_SYSCALLS subsystem which
> predefines this information in syscall_metadata associated enter_event
> (and exit_event) structures.
>
> The result is that a process may reduce its available interfaces to
> the kernel through prctl() without knowing the appropriate system call
> number a priori and with the flexibility of filtering based on
> register-stored arguments. (String checks suffer from TOCTOU issues and
> should be left to LSMs to provide policy for! Don't get greedy :)

This is potentially very useful for qemu/kvm.

--
error compiling committee.c: too many arguments to function

2011-05-04 09:15:50

by Will Drewry

[permalink] [raw]
Subject: Re: [PATCH 5/7] seccomp_filter: Document what seccomp_filter is and how it works.

On Mon, May 2, 2011 at 6:47 PM, Frederic Weisbecker <[email protected]> wrote:
> 2011/5/3 Frederic Weisbecker <[email protected]>:
>> On Fri, Apr 29, 2011 at 11:13:44AM -0500, Will Drewry wrote:
>>> That said, I have a general interface question :) ?Right now I have:
>>> prctl(PR_SET_SECCOMP, 2, SECCOMP_FILTER_ADD, syscall_nr, filter_string);
>>> prctl(PR_SET_SECCOMP, 2, SECCOMP_FILTER_DROP, syscall_nr,
>>> filter_string_or_NULL);
>>> prctl(PR_SET_SECCOMP, 2, SECCOMP_FILTER_APPLY, apply_flags);
>>> ? (I will change this to default to apply_on_exec and let FILTER_APPLY
>>> make it apply _now_ exclusively. :)
>>>
>>> This can easily be mapped to:
>>> prctl(PR_SET_SECCOMP
>>> ? ? ? ?PR_SET_SECOMP_FILTER_ADD
>>> ? ? ? ?PR_SET_SECOMP_FILTER_DROP
>>> ? ? ? ?PR_SET_SECOMP_FILTER_APPLY
>>> if that'd be preferable (to keep it all in the prctl.h world).
>>>
>>> Following along the suggestion of reducing custom parsing, it seemed
>>> to make a lot of sense to make add and drop actions very explicit.
>>> There is no guesswork so a system call filtered process will only be
>>> able to perform DROP operations (if prctl is allowed) to reduce the
>>> allowed system calls. ?This also allows more fine grained flexibility
>>> in addition to the in-kernel complexity reduction. ?E.g.,
>>> Process starts with
>>> ? __NR_read, "fd == 1"
>>> ? __NR_read, "fd == 2"
>>> later it can call:
>>> ? prctl(PR_SET_SECCOMP, 2, SECCOMP_FILTER_DROP, __NR_read, "fd == 2");
>>> to drop one of the filters without disabling "fd == 1" reading. ?(Or
>>> it could pass in NULL to drop all filters).
>>
>> Hm, but then you don't let the childs be able to restrict further
>> what you allowed before.
>>
>> Say I have foo(int a, int b), and I apply these filters:
>>
>> ? ? ? ?__NR_foo, "a == 1";
>> ? ? ? ?__NR_foo, "a == 2";
>>
>> This is basically "a == 1 || a == 2".
>>
>> Now I apply the filters and I fork. How can the child
>> (or current task after the filter is applied) restrict
>> further by only allowing "b == 2", such that with the
>> inherited parent filters we have:
>>
>> ? ? ? ?"(a == 1 || a == 2) && b == 2"
>>
>> So what you propose seems to me too limited. I'd rather have this:
>>
>> SECCOMP_FILTER_SET = remove previous filter entirely and set a new one
>> SECCOMP_FILTER_GET = get the string of the current filter
>>
>> The rule would be that you can only set a filter that is intersected
>> with the one that was previously applied.
>>
>> It means that if you set filter A and you apply it. If you want to set
>> filter B thereafter, it must be:
>>
>> ? ? ? ?A && B
>>
>> OTOH, as long as you haven't applied A, you can override it as you wish.
>> Like you can have "A || B" instead. Or you can remove it with "1". Of course
>> if a previous filter was applied before A, then your new filter must be
>> concatenated: "previous && (A || B)".
>>
>> Right? And note in this scheme you can reproduce your DROP trick. If
>> "A || B" is the current filter applied, then you can restrict B by
>> doing: "(A || B) && A".
>>
>> So the role of SECCOMP_FILTER_GET is to get the string that matches
>> the current applied filter.
>>
>> The effect of this is infinite of course. If you apply A, then apply
>> B then you need A && B. If later you want to apply C, then you need
>> A && B && C, etc...
>>
>> Does that look sane?
>>
>
> Even better: applying a filter would always automatically be an
> intersection of the previous one.
>
> If you do:
>
> SECCOMP_FILTER_SET, __NR_foo, "a == 1 || a == 2"
> SECCOMP_FILTER_APPLY
> SECCOMP_FILTER_SET, __NR_foo, "b == 2"
> SECCOMP_FILTER_APPLY
> SECCOMP_FILTER_SET, __NR_foo, "c == 3"
> SECCOMP_FILTER_APPLY
>
> The end result is:
>
> "(a == 1 || a == 2) && b == 2 ?&& c == 3"
>
> So that we don't push the burden in the kernel to compare the applied
> expression with a new one that may or may not be embraced by parenthesis
> and other trickies like that. We simply append to the working one.
>
> Ah and OTOH this:
>
> SECCOMP_FILTER_SET, __NR_foo, "a == 1 || a == 2"
> SECCOMP_FILTER_APPLY
> SECCOMP_FILTER_SET, __NR_foo, "b == 2"
> SECCOMP_FILTER_SET, __NR_foo, "c == 3"
>
> has the following end result:
>
> "(a == 1 || a == 2) && c == 3"
>
> As long as you don't apply the filter, the temporary part is
> overriden, but still we keep
> the applied part.
>
> Still sane? (or completely nuts?)

Okay - so I *think* I'm following. I really like the use of SET and
GET to allow for further constraint based on additional argument
restrictions instead of purely reducing the filters available. The
only part I'm stumbling on is using APPLY on a per-filter basis. In
my current implementation, I consider APPLY to be the global enable
bit. Whatever filters are set become set in stone and only &&s are
handled. I'm not sure I understand why it would make sense to do a
per-syscall-filter apply call. It's certainly doable, but it will
mean that we may be logically storing something like:

__NR_foo: (a == 1 || a == 2), applied
__NR_foo: b == 2, not applied
__NR_foo: c == 3, not applied

after

SECCOMP_FILTER_SET, __NR_foo, "a == 1 || a == 2"
SECCOMP_FILTER_APPLY
SECCOMP_FILTER_SET, __NR_foo, "b == 2"
SECCOMP_FILTER_SET, __NR_foo, "c == 3"

In that case, would a call to sys_foo even be tested against the
non-applied constraints of b==2 or c==3? Or would the call to set "c
== 3" replace the "b == 2" entry. I'm not sure I see that the benefit
exceeds the ambiguity that might introduce. However, if the default
behavior it to always extend with &&, then a consumer of the interface
could just do:

SECCOMP_FILTER_SET, __NR_prctl, "option == 2"
SECCOMP_FILTER_SET, __NR_foo, "a == 1 || a == 2"
SECCOMP_FILTER_SET, __NR_foo, "b == 2"
SECCOMP_FILTER_APPLY

This would yield a final filter for foo of "(a == 1 || a == 2) && b ==
2". The call to APPLY would initiate the enforcement of the syscall
filtering and enforce that no new filters may be added for syscalls
that aren't already constrained. So you could still call

SECCOMP_FILTER_SET, __NR_foo, "0"

which is logically "((a == 1 || a == 2) && b == 2) && 0" and would be
interpreted as just a DROP. But you could not do,

SECCOMP_FILTER_SET, __NR_foo, "0"
SECCOMP_FILTER_SET, __NR_foo, "1"
or
SECCOMP_FILTER_SET, __NR_read, "1"

(since the implicit filter for all syscalls after an APPLY call should
be "0" and additions would just be "0 && whatever").

Am I missing something? If that makes sense, then we may even be able
to reduce the extra directives by one and get a resulting interface
that looks something like:

/* Appends (&&) a new filter string for a syscall to the current
filter value. "0" clears the filter. */
prctl(PR_SET_SECCOMP_FILTER, syscall_nr, filter_string);
/* Returns the current explicit filter string for a syscall */
prctl(PR_GET_SECCOMP_FILTER, syscall_nr, buf, buflen);
/* Transition to a secure computing mode:
* 1 - enables traditional seccomp behavior
* 2 - enables seccomp filter enforcement and changes the implicit
filter for syscalls from "1" to "0"
*/
prctl(PR_SET_SECCOMP, mode);

I'll set aside the v2 of the patch that uses ADD, DROP, and APPLY and
work up a version with this interface. Do you (or anyone else :) feel
strongly about per-syscall APPLY? I like the above version, but I'm
certainly willing to explore the other path. As is, I'll go through
my usecases (and tests once I have a new cut of the patch) and see how
it feels. At first blush, this appears more succinct _and_ more
expressive than the prior versions!

~~

As to the use of apply_on_exec, even if you whitelist: mmap, fstat64,
brk, uname, open, read, close, set_thread_area, mprotect, munmap, and
access _just_ to allow a process to be exec'd, it is still a
significant reduction in kernel attack surface. Pairing that with a
LSM, delayed chroot, etc could fill in the gaps with respect to a
greater sandboxing solution. I'd certainly take that tradeoff for
running binaries that I don't control the source for :) That said,
LD_PRELOAD, ptrace injection, and other tricks could allow for the
injection of a very targeted filterset, but I don't think that
invalidates the on-exec case given the brittleness relying exclusively
on such tactics. Does that seem reasonable?


Thanks!
will

2011-05-04 09:29:24

by Will Drewry

[permalink] [raw]
Subject: Re: [PATCH 5/7] seccomp_filter: Document what seccomp_filter is and how it works.

On Wed, May 4, 2011 at 2:15 AM, Will Drewry <[email protected]> wrote:
> On Mon, May 2, 2011 at 6:47 PM, Frederic Weisbecker <[email protected]> wrote:
>> 2011/5/3 Frederic Weisbecker <[email protected]>:
>>> On Fri, Apr 29, 2011 at 11:13:44AM -0500, Will Drewry wrote:
>>>> That said, I have a general interface question :) ?Right now I have:
>>>> prctl(PR_SET_SECCOMP, 2, SECCOMP_FILTER_ADD, syscall_nr, filter_string);
>>>> prctl(PR_SET_SECCOMP, 2, SECCOMP_FILTER_DROP, syscall_nr,
>>>> filter_string_or_NULL);
>>>> prctl(PR_SET_SECCOMP, 2, SECCOMP_FILTER_APPLY, apply_flags);
>>>> ? (I will change this to default to apply_on_exec and let FILTER_APPLY
>>>> make it apply _now_ exclusively. :)
>>>>
>>>> This can easily be mapped to:
>>>> prctl(PR_SET_SECCOMP
>>>> ? ? ? ?PR_SET_SECOMP_FILTER_ADD
>>>> ? ? ? ?PR_SET_SECOMP_FILTER_DROP
>>>> ? ? ? ?PR_SET_SECOMP_FILTER_APPLY
>>>> if that'd be preferable (to keep it all in the prctl.h world).
>>>>
>>>> Following along the suggestion of reducing custom parsing, it seemed
>>>> to make a lot of sense to make add and drop actions very explicit.
>>>> There is no guesswork so a system call filtered process will only be
>>>> able to perform DROP operations (if prctl is allowed) to reduce the
>>>> allowed system calls. ?This also allows more fine grained flexibility
>>>> in addition to the in-kernel complexity reduction. ?E.g.,
>>>> Process starts with
>>>> ? __NR_read, "fd == 1"
>>>> ? __NR_read, "fd == 2"
>>>> later it can call:
>>>> ? prctl(PR_SET_SECCOMP, 2, SECCOMP_FILTER_DROP, __NR_read, "fd == 2");
>>>> to drop one of the filters without disabling "fd == 1" reading. ?(Or
>>>> it could pass in NULL to drop all filters).
>>>
>>> Hm, but then you don't let the childs be able to restrict further
>>> what you allowed before.
>>>
>>> Say I have foo(int a, int b), and I apply these filters:
>>>
>>> ? ? ? ?__NR_foo, "a == 1";
>>> ? ? ? ?__NR_foo, "a == 2";
>>>
>>> This is basically "a == 1 || a == 2".
>>>
>>> Now I apply the filters and I fork. How can the child
>>> (or current task after the filter is applied) restrict
>>> further by only allowing "b == 2", such that with the
>>> inherited parent filters we have:
>>>
>>> ? ? ? ?"(a == 1 || a == 2) && b == 2"
>>>
>>> So what you propose seems to me too limited. I'd rather have this:
>>>
>>> SECCOMP_FILTER_SET = remove previous filter entirely and set a new one
>>> SECCOMP_FILTER_GET = get the string of the current filter
>>>
>>> The rule would be that you can only set a filter that is intersected
>>> with the one that was previously applied.
>>>
>>> It means that if you set filter A and you apply it. If you want to set
>>> filter B thereafter, it must be:
>>>
>>> ? ? ? ?A && B
>>>
>>> OTOH, as long as you haven't applied A, you can override it as you wish.
>>> Like you can have "A || B" instead. Or you can remove it with "1". Of course
>>> if a previous filter was applied before A, then your new filter must be
>>> concatenated: "previous && (A || B)".
>>>
>>> Right? And note in this scheme you can reproduce your DROP trick. If
>>> "A || B" is the current filter applied, then you can restrict B by
>>> doing: "(A || B) && A".
>>>
>>> So the role of SECCOMP_FILTER_GET is to get the string that matches
>>> the current applied filter.
>>>
>>> The effect of this is infinite of course. If you apply A, then apply
>>> B then you need A && B. If later you want to apply C, then you need
>>> A && B && C, etc...
>>>
>>> Does that look sane?
>>>
>>
>> Even better: applying a filter would always automatically be an
>> intersection of the previous one.
>>
>> If you do:
>>
>> SECCOMP_FILTER_SET, __NR_foo, "a == 1 || a == 2"
>> SECCOMP_FILTER_APPLY
>> SECCOMP_FILTER_SET, __NR_foo, "b == 2"
>> SECCOMP_FILTER_APPLY
>> SECCOMP_FILTER_SET, __NR_foo, "c == 3"
>> SECCOMP_FILTER_APPLY
>>
>> The end result is:
>>
>> "(a == 1 || a == 2) && b == 2 ?&& c == 3"
>>
>> So that we don't push the burden in the kernel to compare the applied
>> expression with a new one that may or may not be embraced by parenthesis
>> and other trickies like that. We simply append to the working one.
>>
>> Ah and OTOH this:
>>
>> SECCOMP_FILTER_SET, __NR_foo, "a == 1 || a == 2"
>> SECCOMP_FILTER_APPLY
>> SECCOMP_FILTER_SET, __NR_foo, "b == 2"
>> SECCOMP_FILTER_SET, __NR_foo, "c == 3"
>>
>> has the following end result:
>>
>> "(a == 1 || a == 2) && c == 3"
>>
>> As long as you don't apply the filter, the temporary part is
>> overriden, but still we keep
>> the applied part.
>>
>> Still sane? (or completely nuts?)
>
> Okay - so I *think* I'm following. ?I really like the use of SET and
> GET to allow for further constraint based on additional argument
> restrictions instead of purely reducing the filters available. ?The
> only part I'm stumbling on is using APPLY on a per-filter basis. ?In
> my current implementation, I consider APPLY to be the global enable
> bit. ?Whatever filters are set become set in stone and only &&s are
> handled. ?I'm not sure I understand why it would make sense to do a
> per-syscall-filter apply call. ?It's certainly doable, but it will
> mean that we may be logically storing something like:
>
> __NR_foo: (a == 1 || a == 2), applied
> __NR_foo: b == 2, not applied
> __NR_foo: c == 3, not applied
>
> after
>
> SECCOMP_FILTER_SET, __NR_foo, "a == 1 || a == 2"
> SECCOMP_FILTER_APPLY
> SECCOMP_FILTER_SET, __NR_foo, "b == 2"
> SECCOMP_FILTER_SET, __NR_foo, "c == 3"
>
> In that case, would a call to sys_foo even be tested against the
> non-applied constraints of b==2 or c==3? Or would the call to set "c
> == 3" replace the "b == 2" entry. ?I'm not sure I see that the benefit
> exceeds the ambiguity that might introduce. ?However, if the default
> behavior it to always extend with &&, then a consumer of the interface
> could just do:
>
> SECCOMP_FILTER_SET, __NR_prctl, "option == 2"
> SECCOMP_FILTER_SET, __NR_foo, "a == 1 || a == 2"
> SECCOMP_FILTER_SET, __NR_foo, "b == 2"
> SECCOMP_FILTER_APPLY
>
> This would yield a final filter for foo of "(a == 1 || a == 2) && b ==
> 2". ?The call to APPLY would initiate the enforcement of the syscall
> filtering and enforce that no new filters may be added for syscalls
> that aren't already constrained. ?So you could still call
>
> SECCOMP_FILTER_SET, __NR_foo, "0"
>
> which is logically "((a == 1 || a == 2) && b == 2) && 0" and would be
> interpreted as just a DROP. But you could not do,
>
> SECCOMP_FILTER_SET, __NR_foo, "0"
> SECCOMP_FILTER_SET, __NR_foo, "1"
> or
> SECCOMP_FILTER_SET, __NR_read, "1"
>
> (since the implicit filter for all syscalls after an APPLY call should
> be "0" and additions would just be "0 && whatever").
>
> Am I missing something? ?If that makes sense, then we may even be able
> to reduce the extra directives by one and get a resulting interface
> that looks something like:
>
> /* Appends (&&) a new filter string for a syscall to the current
> filter value. "0" clears the filter. */
> prctl(PR_SET_SECCOMP_FILTER, syscall_nr, filter_string);
> /* Returns the current explicit filter string for a syscall */
> prctl(PR_GET_SECCOMP_FILTER, syscall_nr, buf, buflen);
> /* Transition to a secure computing mode:
> ?* 1 - enables traditional seccomp behavior
> ?* 2 - enables seccomp filter enforcement and changes the implicit
> filter for syscalls from "1" to "0"
> ?*/
> prctl(PR_SET_SECCOMP, mode);
>
> I'll set aside the v2 of the patch that uses ADD, DROP, and APPLY and
> work up a version with this interface. ?Do you (or anyone else :) feel
> strongly about per-syscall APPLY? ?I like the above version, but I'm
> certainly willing to explore the other path. ?As is, I'll go through
> my usecases (and tests once I have a new cut of the patch) and see how
> it feels. ?At first blush, this appears more succinct _and_ more
> expressive than the prior versions!
>
> ~~
>
> As to the use of apply_on_exec, even if you whitelist: mmap, fstat64,
> brk, uname, open, read, close, set_thread_area, mprotect, munmap, and
> access _just_ to allow a process to be exec'd, it is still a
> significant reduction in kernel attack surface. ?Pairing that with a
> LSM, delayed chroot, etc could fill in the gaps with respect to a
> greater sandboxing solution. ?I'd certainly take that tradeoff for
> running binaries that I don't control the source for :) ?That said,
> LD_PRELOAD, ptrace injection, and other tricks could allow for the
> injection of a very targeted filterset, but I don't think that
> invalidates the on-exec case given the brittleness relying exclusively
> on such tactics. ?Does that seem reasonable?

With a little more thinking :), I don't think there's an obvious hook
for "on-exec" bit in the interface proposed above. I think that might
be a worthwhile tradeoff given how much cleaner it is. It'd still be
possible to write a launcher to provide a reduced kernel surface, it'd
just also need a filter for the exec syscall.

cheers!
will

2011-05-04 12:16:42

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH 5/7] seccomp_filter: Document what seccomp_filter is and how it works.

On Tue, 2011-05-03 at 03:47 +0200, Frederic Weisbecker wrote:
> 2011/5/3 Frederic Weisbecker <[email protected]>:

> Even better: applying a filter would always automatically be an
> intersection of the previous one.
>
> If you do:
>
> SECCOMP_FILTER_SET, __NR_foo, "a == 1 || a == 2"
> SECCOMP_FILTER_APPLY
> SECCOMP_FILTER_SET, __NR_foo, "b == 2"
> SECCOMP_FILTER_APPLY
> SECCOMP_FILTER_SET, __NR_foo, "c == 3"
> SECCOMP_FILTER_APPLY
>
> The end result is:
>
> "(a == 1 || a == 2) && b == 2 && c == 3"
>

I'm a little confused. Why do we have both a FILTER_SET and a
FILTER_APPLY? Maybe this was discussed earlier in the thread and I
missed it or simply forgot.

Why not just apply on the set call?

-- Steve

2011-05-04 15:55:43

by Eric Paris

[permalink] [raw]
Subject: Re: [PATCH 5/7] seccomp_filter: Document what seccomp_filter is and how it works.

On Wed, 2011-05-04 at 08:16 -0400, Steven Rostedt wrote:
> On Tue, 2011-05-03 at 03:47 +0200, Frederic Weisbecker wrote:
> > 2011/5/3 Frederic Weisbecker <[email protected]>:
>
> > Even better: applying a filter would always automatically be an
> > intersection of the previous one.
> >
> > If you do:
> >
> > SECCOMP_FILTER_SET, __NR_foo, "a == 1 || a == 2"
> > SECCOMP_FILTER_APPLY
> > SECCOMP_FILTER_SET, __NR_foo, "b == 2"
> > SECCOMP_FILTER_APPLY
> > SECCOMP_FILTER_SET, __NR_foo, "c == 3"
> > SECCOMP_FILTER_APPLY
> >
> > The end result is:
> >
> > "(a == 1 || a == 2) && b == 2 && c == 3"
> >
>
> I'm a little confused. Why do we have both a FILTER_SET and a
> FILTER_APPLY? Maybe this was discussed earlier in the thread and I
> missed it or simply forgot.
>
> Why not just apply on the set call?

As this is a deny by default interface which only allows you to further
restrict you couldn't add more than 1 syscall if you didn't have an
explict 'apply' action.

SECCOMP_FILTER_SET, __NR_fo, "a=0"
SECCOMP_FILTER_SET, __NR_read, "1" == EPERM

Maybe apply on set is fine after the first apply, but we definitely need
some way to do more than 1 set before the rules are applied....

-Eric

2011-05-04 16:06:16

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH 5/7] seccomp_filter: Document what seccomp_filter is and how it works.

On Wed, 2011-05-04 at 11:54 -0400, Eric Paris wrote:

> As this is a deny by default interface which only allows you to further
> restrict you couldn't add more than 1 syscall if you didn't have an
> explict 'apply' action.
>
> SECCOMP_FILTER_SET, __NR_fo, "a=0"
> SECCOMP_FILTER_SET, __NR_read, "1" == EPERM
>
> Maybe apply on set is fine after the first apply, but we definitely need
> some way to do more than 1 set before the rules are applied....

So we could have SET be 'or' and APPLY be 'and'.

SECCOMP_FILTER_SET, __NR_foo, "a=0"
SECCOMP_FILTER_SET, __NR_read, "1" == EPERM
SECCOPM_FILTER_APPLY

SECCOMP_FILTER_SET, __NR_foo, "b=0"
SECCOPM_FILTER_APPLY

Will end up being:

(foo: a == 0 || read: "1" == EPERM) && (foo: b == 0)

The second set/apply now removes the read option, and foo only works if
a is 0 and b is 0.

This would also work for children, as they can only restrict (with
'and') and can not add more control.

-- Steve

2011-05-04 16:23:15

by Eric Paris

[permalink] [raw]
Subject: Re: [PATCH 5/7] seccomp_filter: Document what seccomp_filter is and how it works.

On Wed, 2011-05-04 at 12:06 -0400, Steven Rostedt wrote:
> On Wed, 2011-05-04 at 11:54 -0400, Eric Paris wrote:
>
> > As this is a deny by default interface which only allows you to further
> > restrict you couldn't add more than 1 syscall if you didn't have an
> > explict 'apply' action.
> >
> > SECCOMP_FILTER_SET, __NR_fo, "a=0"
> > SECCOMP_FILTER_SET, __NR_read, "1" == EPERM
> >
> > Maybe apply on set is fine after the first apply, but we definitely need
> > some way to do more than 1 set before the rules are applied....
>
> So we could have SET be 'or' and APPLY be 'and'.
>
> SECCOMP_FILTER_SET, __NR_foo, "a=0"
> SECCOMP_FILTER_SET, __NR_read, "1" == EPERM

When I said "== EPERM" I meant that the given prctl call would return
EPERM. I'm going to pretend that you didn't type it.

> SECCOPM_FILTER_APPLY
>
> SECCOMP_FILTER_SET, __NR_foo, "b=0"
> SECCOPM_FILTER_APPLY
>
> Will end up being:
>
> (foo: a == 0 || read: "1") && (foo: b == 0)
>
> The second set/apply now removes the read option, and foo only works if
> a is 0 and b is 0.
>
> This would also work for children, as they can only restrict (with
> 'and') and can not add more control.

I think we pretty much agree although I'm pretty that we will have 1
filter per syscall. So the rules would really be (in your syntax)

Rule1: (foo: a == 0 && b == 0)
OR
Rule2: (read: "1")

Although logically the same, it's not just one huge rule. I don't see
any need for any operation other than an &&. Before the first "set" you
can add new syscalls. After the first set you can only && onto existing
syscalls. So the following set of operations:

SECCOMP_FILTER_SET, __NR_foo, "a=0"
SECCOMP_FILTER_SET, __NR_read, "1"
SECCOPM_FILTER_APPLY

SECCOMP_FILTER_SET, __NR_foo, "b=0"
SECCOMP_FILTER_APPLY

SECCOMP_FILTER_SET, __NR_write, "1"
SECCOMP_FILTER_APPLY

Would return EPERM for the __NR_write entry since it was a new syscall
after a set. I think we agree on all this.

I do have a question on some syntax proposed a while back. Given:
SECCOMP_FILTER_SET, __NR_foo, "a=0"
SECCOMP_FILTER_SET, __NR_foo, "b=0"
SECCOMP_FILTER_APPLY

I would think to keep the interface consistent that should result in
foo: (a=0) && (b=0)

But I think the proposal was that we should instead have just
foo: (b=0)

What's the logic behind having a second call overwrite uncommitted
changes? I sorta feel like if I put it in there, I must have wanted it
in there :)

-Eric

2011-05-04 16:39:24

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH 5/7] seccomp_filter: Document what seccomp_filter is and how it works.

On Wed, 2011-05-04 at 12:22 -0400, Eric Paris wrote:

> > SECCOMP_FILTER_SET, __NR_foo, "a=0"
> > SECCOMP_FILTER_SET, __NR_read, "1" == EPERM
>
> When I said "== EPERM" I meant that the given prctl call would return
> EPERM. I'm going to pretend that you didn't type it.

I was confused about that, but I didn't type that. You did. I just cut
and pasted it ;)

>
> > SECCOPM_FILTER_APPLY
> >
> > SECCOMP_FILTER_SET, __NR_foo, "b=0"
> > SECCOPM_FILTER_APPLY
> >
> > Will end up being:
> >
> > (foo: a == 0 || read: "1") && (foo: b == 0)
> >
> > The second set/apply now removes the read option, and foo only works if
> > a is 0 and b is 0.
> >
> > This would also work for children, as they can only restrict (with
> > 'and') and can not add more control.
>
> I think we pretty much agree although I'm pretty that we will have 1
> filter per syscall. So the rules would really be (in your syntax)
>
> Rule1: (foo: a == 0 && b == 0)
> OR
> Rule2: (read: "1")
>
> Although logically the same, it's not just one huge rule. I don't see
> any need for any operation other than an &&. Before the first "set" you
> can add new syscalls. After the first set you can only && onto existing
> syscalls. So the following set of operations:
>
> SECCOMP_FILTER_SET, __NR_foo, "a=0"
> SECCOMP_FILTER_SET, __NR_read, "1"
> SECCOPM_FILTER_APPLY
>
> SECCOMP_FILTER_SET, __NR_foo, "b=0"
> SECCOMP_FILTER_APPLY
>
> SECCOMP_FILTER_SET, __NR_write, "1"
> SECCOMP_FILTER_APPLY
>
> Would return EPERM for the __NR_write entry since it was a new syscall
> after a set. I think we agree on all this.

Do you mean "after a apply"? As the second line above is a new syscall
after the first set.


>
> I do have a question on some syntax proposed a while back. Given:
> SECCOMP_FILTER_SET, __NR_foo, "a=0"
> SECCOMP_FILTER_SET, __NR_foo, "b=0"
> SECCOMP_FILTER_APPLY
>
> I would think to keep the interface consistent that should result in
> foo: (a=0) && (b=0)

I agree.

>
> But I think the proposal was that we should instead have just
> foo: (b=0)

Yeah, that's what it looked like Frederic showed. I rather have the
first instance.

Perhaps we could have a "unset"? that would only work on things that
haven't been applied yet.

>
> What's the logic behind having a second call overwrite uncommitted
> changes? I sorta feel like if I put it in there, I must have wanted it
> in there :)

Perhaps for making the user code simpler?

SET a=1
SET b=2

[ some nasty if logic ]

UNSET b=2

APPLY


Thus a default setting can be made and then we can selectively remove
settings before we do the apply based on information about the process
that we will exec. We can start out with "limit the hell out of it" and
then selectively remove things. I think this is a simply way to
understand what is being done. Kind of like iptables, where you can set
up default rules, but then selectively override them.

One thing I know about security, the easier it is to set up, the more
secure it becomes.

-- Steve

2011-05-04 17:04:04

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [PATCH 5/7] seccomp_filter: Document what seccomp_filter is and how it works.

On Wed, May 04, 2011 at 12:22:40PM -0400, Eric Paris wrote:
> On Wed, 2011-05-04 at 12:06 -0400, Steven Rostedt wrote:
> > On Wed, 2011-05-04 at 11:54 -0400, Eric Paris wrote:
> >
> > > As this is a deny by default interface which only allows you to further
> > > restrict you couldn't add more than 1 syscall if you didn't have an
> > > explict 'apply' action.
> > >
> > > SECCOMP_FILTER_SET, __NR_fo, "a=0"
> > > SECCOMP_FILTER_SET, __NR_read, "1" == EPERM
> > >
> > > Maybe apply on set is fine after the first apply, but we definitely need
> > > some way to do more than 1 set before the rules are applied....
> >
> > So we could have SET be 'or' and APPLY be 'and'.
> >
> > SECCOMP_FILTER_SET, __NR_foo, "a=0"
> > SECCOMP_FILTER_SET, __NR_read, "1" == EPERM
>
> When I said "== EPERM" I meant that the given prctl call would return
> EPERM. I'm going to pretend that you didn't type it.
>
> > SECCOPM_FILTER_APPLY
> >
> > SECCOMP_FILTER_SET, __NR_foo, "b=0"
> > SECCOPM_FILTER_APPLY
> >
> > Will end up being:
> >
> > (foo: a == 0 || read: "1") && (foo: b == 0)
> >
> > The second set/apply now removes the read option, and foo only works if
> > a is 0 and b is 0.
> >
> > This would also work for children, as they can only restrict (with
> > 'and') and can not add more control.
>
> I think we pretty much agree although I'm pretty that we will have 1
> filter per syscall. So the rules would really be (in your syntax)
>
> Rule1: (foo: a == 0 && b == 0)
> OR
> Rule2: (read: "1")
>
> Although logically the same, it's not just one huge rule. I don't see
> any need for any operation other than an &&. Before the first "set" you
> can add new syscalls. After the first set you can only && onto existing
> syscalls. So the following set of operations:
>
> SECCOMP_FILTER_SET, __NR_foo, "a=0"
> SECCOMP_FILTER_SET, __NR_read, "1"
> SECCOPM_FILTER_APPLY
>
> SECCOMP_FILTER_SET, __NR_foo, "b=0"
> SECCOMP_FILTER_APPLY
>
> SECCOMP_FILTER_SET, __NR_write, "1"
> SECCOMP_FILTER_APPLY
>
> Would return EPERM for the __NR_write entry since it was a new syscall
> after a set. I think we agree on all this.

No, why?

The default filter for a syscall, if none have been given for it, is "0".

Thus, if you write "1" later, the entire filter is going to be:

"0 && 1"

Which is fine, we are not overriding already applied permissions there.

So where is the need to return -EPERM in such a specific case? Is it
worth the corner case to check in the kernel, and to handle in userspace?
And for what reason?

2011-05-04 17:52:39

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [PATCH 5/7] seccomp_filter: Document what seccomp_filter is and how it works.

On Wed, May 04, 2011 at 02:15:47AM -0700, Will Drewry wrote:
> Okay - so I *think* I'm following. I really like the use of SET and
> GET to allow for further constraint based on additional argument
> restrictions instead of purely reducing the filters available. The
> only part I'm stumbling on is using APPLY on a per-filter basis. In
> my current implementation, I consider APPLY to be the global enable
> bit. Whatever filters are set become set in stone and only &&s are
> handled. I'm not sure I understand why it would make sense to do a
> per-syscall-filter apply call.

Nope it's still a global apply..

> It's certainly doable, but it will
> mean that we may be logically storing something like:
>
> __NR_foo: (a == 1 || a == 2), applied
> __NR_foo: b == 2, not applied
> __NR_foo: c == 3, not applied
>
> after
>
> SECCOMP_FILTER_SET, __NR_foo, "a == 1 || a == 2"
> SECCOMP_FILTER_APPLY
> SECCOMP_FILTER_SET, __NR_foo, "b == 2"
> SECCOMP_FILTER_SET, __NR_foo, "c == 3"

No, the c == 3 would override b == 2.

> In that case, would a call to sys_foo even be tested against the
> non-applied constraints of b==2 or c==3?

No, not as long as it's not applied.

> Or would the call to set "c
> == 3" replace the "b == 2" entry. I'm not sure I see that the benefit
> exceeds the ambiguity that might introduce.

The rationale behind it is that as long as you haven't applied your filter,
you should be able to override it.

And the simplest way to override it is to do it completely. Remove what
was there before (that wasn't applied).

You could opt for a FILTER_DROP.
Let's take that example:

SECCOMP_FILTER_SET, __NR_foo, "b == 2"
SECCOMP_FILTER_SET, __NR_foo, "c == 3"

Following your logic, we should have "b == 2 && c == 3" after that.
How are you going to remove only the c == 3 sequence?

You would need SECCOMP_FILTER_DROP, __NR_foo, "c == 3"

That said, using a string with a filter expression as an id looks a
bit odd to me. That doesn't work anymore with "((c == 3))", or "c== 3",
unless you compare against the resulting tree but that's complicated.

I mean, that works, most of the time you keep your expression
and pass it as is. But the expression should be identified by its
meaning, not by some random language layout.

That also forces you to scatter your filter representation:

SECCOMP_FILTER_SET, __NR_foo, "b == 2"
SECCOMP_FILTER_SET, __NR_foo, "c == 3"

For me this shouldn't be different than

SECCOMP_FILTER_SET, __NR_foo, "b == 2 && c == 3"

Still nobody will be able to remove a part of the above expression.

So, I'm more for having something that removes everything not applied
than just a part of the non-applied filter.

Two possibilities I see:

SECCOMP_FILTER_SET, __NR_foo, "b == 2"
SECCOMP_FILTER_SET, __NR_foo, "c == 3"

// And result is "c == 3"

Or:

SECCOMP_FILTER_SET, __NR_foo, "b == 2"
SECCOMP_FILTER_SET, __NR_foo, "c == 3"

// And result is "b == 2 && c == 3"

SECCOMP_FILTER_RESET, __NR_foo //rewind to filter "0"

SECCOMP_FILTER_SET, __NR_foo, "c == 4"

// And result is "c == 4"

How does that look?



> However, if the default
> behavior it to always extend with &&, then a consumer of the interface
> could just do:
>
> SECCOMP_FILTER_SET, __NR_prctl, "option == 2"
> SECCOMP_FILTER_SET, __NR_foo, "a == 1 || a == 2"
> SECCOMP_FILTER_SET, __NR_foo, "b == 2"
> SECCOMP_FILTER_APPLY
>
> This would yield a final filter for foo of "(a == 1 || a == 2) && b ==
> 2". The call to APPLY would initiate the enforcement of the syscall
> filtering and enforce that no new filters may be added for syscalls
> that aren't already constrained. So you could still call
>
> SECCOMP_FILTER_SET, __NR_foo, "0"
>
> which is logically "((a == 1 || a == 2) && b == 2) && 0" and would be
> interpreted as just a DROP. But you could not do,
>
> SECCOMP_FILTER_SET, __NR_foo, "0"
> SECCOMP_FILTER_SET, __NR_foo, "1"
> or
> SECCOMP_FILTER_SET, __NR_read, "1"

Why?

"((a == 1 || a == 2) && b == 2) && 0 && 1"

or

"((a == 1 || a == 2) && b == 2) && 1"

does still follow our previous constraints.

>
> (since the implicit filter for all syscalls after an APPLY call should
> be "0" and additions would just be "0 && whatever").

If the default filter is 0 after apply, setting whatever afterward is
not an issue. You'll have "0 && whatever" which still means 0, that's fine.

>
> Am I missing something? If that makes sense, then we may even be able
> to reduce the extra directives by one and get a resulting interface
> that looks something like:
>
> /* Appends (&&) a new filter string for a syscall to the current
> filter value. "0" clears the filter. */
> prctl(PR_SET_SECCOMP_FILTER, syscall_nr, filter_string);
> /* Returns the current explicit filter string for a syscall */
> prctl(PR_GET_SECCOMP_FILTER, syscall_nr, buf, buflen);

So GET is eventually optional here, as a new filter automatically
append to the previous applied one.

But if a process needs some information about current filter, this
makes sense.

> /* Transition to a secure computing mode:
> * 1 - enables traditional seccomp behavior
> * 2 - enables seccomp filter enforcement and changes the implicit
> filter for syscalls from "1" to "0"
> */
> prctl(PR_SET_SECCOMP, mode);
>
> I'll set aside the v2 of the patch that uses ADD, DROP, and APPLY and
> work up a version with this interface. Do you (or anyone else :) feel
> strongly about per-syscall APPLY?

I'm still not sure what you meant by per syscall APPLY :)

> ~~
>
> As to the use of apply_on_exec, even if you whitelist: mmap, fstat64,
> brk, uname, open, read, close, set_thread_area, mprotect, munmap, and
> access _just_ to allow a process to be exec'd, it is still a
> significant reduction in kernel attack surface. Pairing that with a
> LSM, delayed chroot, etc could fill in the gaps with respect to a
> greater sandboxing solution. I'd certainly take that tradeoff for
> running binaries that I don't control the source for :) That said,
> LD_PRELOAD, ptrace injection, and other tricks could allow for the
> injection of a very targeted filterset, but I don't think that
> invalidates the on-exec case given the brittleness relying exclusively
> on such tactics. Does that seem reasonable?

Right. But then, does a default enable_on_exec behaviour really matches your
needs? Given your description, it seems that applying it on the next
fork would work too, as a random example.

2011-05-04 17:56:00

by Eric Paris

[permalink] [raw]
Subject: Re: [PATCH 5/7] seccomp_filter: Document what seccomp_filter is and how it works.

On Wed, 2011-05-04 at 19:03 +0200, Frederic Weisbecker wrote:
> On Wed, May 04, 2011 at 12:22:40PM -0400, Eric Paris wrote:
> So the following set of operations:
> >
> > SECCOMP_FILTER_SET, __NR_foo, "a=0"
> > SECCOMP_FILTER_SET, __NR_read, "1"
> > SECCOPM_FILTER_APPLY
> >
> > SECCOMP_FILTER_SET, __NR_foo, "b=0"
> > SECCOMP_FILTER_APPLY
> >
> > SECCOMP_FILTER_SET, __NR_write, "1"
> > SECCOMP_FILTER_APPLY
> >
> > Would return EPERM for the __NR_write entry since it was a new syscall
> > after a set. I think we agree on all this.
>
> No, why?
>
> The default filter for a syscall, if none have been given for it, is "0".
>
> Thus, if you write "1" later, the entire filter is going to be:
>
> "0 && 1"
>
> Which is fine, we are not overriding already applied permissions there.
>
> So where is the need to return -EPERM in such a specific case? Is it
> worth the corner case to check in the kernel, and to handle in userspace?
> And for what reason?

I assumed without looking at the code (always a bad idea) that he wasn't
going to explicitly create a rule with "0" and was going to implicitly
deny anything without a rule. If there is an explicit "0" rule then you
are right, i don't see a need to deny the set operation in the kernel.
But if it is implicit in the non-existence of a filter then it should be
easy to tell userspace it isn't allowed any more.

-Eric

-Eric

2011-05-04 18:03:21

by Eric Paris

[permalink] [raw]
Subject: Re: [PATCH 5/7] seccomp_filter: Document what seccomp_filter is and how it works.

On Wed, 2011-05-04 at 12:39 -0400, Steven Rostedt wrote:
> On Wed, 2011-05-04 at 12:22 -0400, Eric Paris wrote:

[rewriting history]
> > Although logically the same, it's not just one huge rule. I don't see
> > any need for any operation other than an &&. Before the first "apply" you
> > can add new syscalls. After the first apply you can only && onto existing
> > syscalls. So the following set of operations:
> >
> > SECCOMP_FILTER_SET, __NR_foo, "a=0"
> > SECCOMP_FILTER_SET, __NR_read, "1"
> > SECCOPM_FILTER_APPLY
> >
> > SECCOMP_FILTER_SET, __NR_foo, "b=0"
> > SECCOMP_FILTER_APPLY
> >
> > SECCOMP_FILTER_SET, __NR_write, "1"
> > SECCOMP_FILTER_APPLY
> >
> > Would return EPERM for the __NR_write entry since it was a new syscall
> > after an apply. I think we agree on all this.
>
> Do you mean "after a apply"? As the second line above is a new syscall
> after the first set.

Clearly. Especially since given my revisionist history it's what I
said!

> >
> > I do have a question on some syntax proposed a while back. Given:
> > SECCOMP_FILTER_SET, __NR_foo, "a=0"
> > SECCOMP_FILTER_SET, __NR_foo, "b=0"
> > SECCOMP_FILTER_APPLY
> >
> > I would think to keep the interface consistent that should result in
> > foo: (a=0) && (b=0)
>
> I agree.
>
> >
> > But I think the proposal was that we should instead have just
> > foo: (b=0)
>
> Yeah, that's what it looked like Frederic showed. I rather have the
> first instance.
>
> Perhaps we could have a "unset"? that would only work on things that
> haven't been applied yet.
>
> >
> > What's the logic behind having a second call overwrite uncommitted
> > changes? I sorta feel like if I put it in there, I must have wanted it
> > in there :)
>
> Perhaps for making the user code simpler?
>
> SET a=1
> SET b=2
>
> [ some nasty if logic ]
>
> UNSET b=2
>
> APPLY
>
>
> Thus a default setting can be made and then we can selectively remove
> settings before we do the apply based on information about the process
> that we will exec. We can start out with "limit the hell out of it" and
> then selectively remove things. I think this is a simply way to
> understand what is being done. Kind of like iptables, where you can set
> up default rules, but then selectively override them.
>
> One thing I know about security, the easier it is to set up, the more
> secure it becomes.

I'm ok with an explicit unset/remove/delete for rules since the last
apply if anyone thinks it will be useful. But I don't like an implicit
'overwrite.'

I'm starting to debate if I think between rules should be an implicit &&
or should be an implicit ||. After an 'apply' I believe the next block
definitely needs an &&. I feel like the code in userspace becomes
simpler with || but maybe it would be more confusing for the human coder
to have to know the distinctions. The example in my head, which I think
will be common, involves handling multiple fd's for read. We could
either do:

int handle_read_fd(int fd)
{
char buf[4096];
snprintf(buf, 4096, "a0=%d", fd)
return prctl(SECCOMP_FILTER_SET, __NR_read, buf);
}

main()
{
....
fd1 = open("readfile1", O_RDONLY);
if (fd1 < 0) return
if (handle_read_fd(fd1)) return
....
fd2 = open("readfile2, O_RDONLY);
if (fd2 < 0) return
if (handle_read_fd(fd2)) return
....
prctl(SECCOMP_FILTER_APPLY);
}

Or case2 with the implicit && like we talked about to handle an
arbitrary number of read fd's you need (pseudocode):
int handle_read_fds(int *fds, int len)
{
char buf[4096];
size_t saved;
while(len--) {
saved = snprintf(buf, 4096, "a0=%d ||");
buf += saved;
}
return prctl(SECCOMP_FILTER_SET, __NR_read, buf)
}
main()
{
int fd[2];
....
fd[0] = open("readfile1", O_RDONLY);
if (fd[0] < 0) return
....
fd[1] = open("readfile2, O_RDONLY);
if (fd[1] < 0) return
....
handle_read_fds(fd, 2)
prctl(SECCOMP_FILTER_APPLY);
}

The latter of the examples requires what I think to be the common case
of complex rules to be required to hold state whereas the other is done
in the kernel. It's a lot easier to code the first one but it might be
harder on the coder to decide "now is than an && or an ||"? Like I
said, I'm ok with just declaring everything &&, especially if people
think complex filters are likely to be anything more than rule1 || rule2
|| rule3 || rule4 where ruleX is an independent clause, but I figured
I'd throw it out there....

No matter what after the APPLY I think that:
SECCOMP_FILTER, __NR_read, "0" should result in:
(rule1 || rule2 || rule3 || rule4) && 0

-Eric

2011-05-04 18:23:08

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH 5/7] seccomp_filter: Document what seccomp_filter is and how it works.

On Wed, 2011-05-04 at 19:52 +0200, Frederic Weisbecker wrote:

> > It's certainly doable, but it will
> > mean that we may be logically storing something like:
> >
> > __NR_foo: (a == 1 || a == 2), applied
> > __NR_foo: b == 2, not applied
> > __NR_foo: c == 3, not applied
> >
> > after
> >
> > SECCOMP_FILTER_SET, __NR_foo, "a == 1 || a == 2"
> > SECCOMP_FILTER_APPLY
> > SECCOMP_FILTER_SET, __NR_foo, "b == 2"
> > SECCOMP_FILTER_SET, __NR_foo, "c == 3"
>
> No, the c == 3 would override b == 2.

I honestly hate the "override" mode. I like that SETs are or'd among
each other and an APPLY is a "commit"; meaning that you can only limit
it further, but can not extend it (an explicit &&)

>
> > In that case, would a call to sys_foo even be tested against the
> > non-applied constraints of b==2 or c==3?
>
> No, not as long as it's not applied.
>
> > Or would the call to set "c
> > == 3" replace the "b == 2" entry. I'm not sure I see that the benefit
> > exceeds the ambiguity that might introduce.
>
> The rationale behind it is that as long as you haven't applied your filter,
> you should be able to override it.

We need a "UNSET" (I like that better than DROP).

>
> And the simplest way to override it is to do it completely. Remove what
> was there before (that wasn't applied).
>
> You could opt for a FILTER_DROP.
> Let's take that example:
>
> SECCOMP_FILTER_SET, __NR_foo, "b == 2"
> SECCOMP_FILTER_SET, __NR_foo, "c == 3"
>
> Following your logic, we should have "b == 2 && c == 3" after that.
> How are you going to remove only the c == 3 sequence?
>
> You would need SECCOMP_FILTER_DROP, __NR_foo, "c == 3"
>
> That said, using a string with a filter expression as an id looks a
> bit odd to me. That doesn't work anymore with "((c == 3))", or "c== 3",
> unless you compare against the resulting tree but that's complicated.

Nah, it should be easy. Actually, in "set" mode (before the apply) we
should keep a link list of "trees" of sets. Then we just find the "tree"
that matches the UNSET and remove it.

>
> I mean, that works, most of the time you keep your expression
> and pass it as is. But the expression should be identified by its
> meaning, not by some random language layout.
>
> That also forces you to scatter your filter representation:
>
> SECCOMP_FILTER_SET, __NR_foo, "b == 2"
> SECCOMP_FILTER_SET, __NR_foo, "c == 3"
>
> For me this shouldn't be different than
>
> SECCOMP_FILTER_SET, __NR_foo, "b == 2 && c == 3"
>
> Still nobody will be able to remove a part of the above expression.

Correct, as the sets will be just link lists of sets. Once we apply it,
it goes into the main tree.

Thus, to find the UNSET set, we would evaluate the tree of that unset,
and search for it. If it is found, remove it, otherwise return EINVAL or
something.

>
> So, I'm more for having something that removes everything not applied
> than just a part of the non-applied filter.
>
> Two possibilities I see:
>
> SECCOMP_FILTER_SET, __NR_foo, "b == 2"
> SECCOMP_FILTER_SET, __NR_foo, "c == 3"
>
> // And result is "c == 3"
>
> Or:
>
> SECCOMP_FILTER_SET, __NR_foo, "b == 2"
> SECCOMP_FILTER_SET, __NR_foo, "c == 3"
>
> // And result is "b == 2 && c == 3"
>
> SECCOMP_FILTER_RESET, __NR_foo //rewind to filter "0"
>
> SECCOMP_FILTER_SET, __NR_foo, "c == 4"
>
> // And result is "c == 4"
>
> How does that look?

The thing is, I don't understand where we would use (or want) an
override. I could understand a UNSET, which I described in another
reply. Also, I like the fact that sets can be self contained because
then the user can add wrapper functions for them.

add_filter("foo: c == 4");
add_filter("bar: b < 3");
apply_filter();

That wont work with an override, unless we did the work in userspace to
keep string, and then we need to worry about "string matches" as you
stated if we want to implement a remove_filter("foo: c==4"). (see it
would fail, because this version is missing spaces)

-- Steve

2011-05-04 18:31:00

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [PATCH 5/7] seccomp_filter: Document what seccomp_filter is and how it works.

On Wed, May 04, 2011 at 02:23:02PM -0400, Steven Rostedt wrote:
> On Wed, 2011-05-04 at 19:52 +0200, Frederic Weisbecker wrote:
>
> > > It's certainly doable, but it will
> > > mean that we may be logically storing something like:
> > >
> > > __NR_foo: (a == 1 || a == 2), applied
> > > __NR_foo: b == 2, not applied
> > > __NR_foo: c == 3, not applied
> > >
> > > after
> > >
> > > SECCOMP_FILTER_SET, __NR_foo, "a == 1 || a == 2"
> > > SECCOMP_FILTER_APPLY
> > > SECCOMP_FILTER_SET, __NR_foo, "b == 2"
> > > SECCOMP_FILTER_SET, __NR_foo, "c == 3"
> >
> > No, the c == 3 would override b == 2.
>
> I honestly hate the "override" mode. I like that SETs are or'd among
> each other and an APPLY is a "commit"; meaning that you can only limit
> it further, but can not extend it (an explicit &&)

I'm confused with what you just said...

Somehow I could understand it that way:

SECCOMP_FILTER_SET, __NR_foo, "a == 1"
SECCOMP_FILTER_SET, __NR_foo, "a == 2"
SECCOMP_FILTER_APPLY
SECCOMP_FILTER_SET, __NR_foo, "b == 1"

Would produce:

"(a == 1 || a == 2) && b == 1"

That makes a pretty confusing behaviour for users I think.

>
> >
> > > In that case, would a call to sys_foo even be tested against the
> > > non-applied constraints of b==2 or c==3?
> >
> > No, not as long as it's not applied.
> >
> > > Or would the call to set "c
> > > == 3" replace the "b == 2" entry. I'm not sure I see that the benefit
> > > exceeds the ambiguity that might introduce.
> >
> > The rationale behind it is that as long as you haven't applied your filter,
> > you should be able to override it.
>
> We need a "UNSET" (I like that better than DROP).

What about a complete erase (RESET) of the temporary filter? Like I explained below
from my previous mail.

>
> >
> > And the simplest way to override it is to do it completely. Remove what
> > was there before (that wasn't applied).
> >
> > You could opt for a FILTER_DROP.
> > Let's take that example:
> >
> > SECCOMP_FILTER_SET, __NR_foo, "b == 2"
> > SECCOMP_FILTER_SET, __NR_foo, "c == 3"
> >
> > Following your logic, we should have "b == 2 && c == 3" after that.
> > How are you going to remove only the c == 3 sequence?
> >
> > You would need SECCOMP_FILTER_DROP, __NR_foo, "c == 3"
> >
> > That said, using a string with a filter expression as an id looks a
> > bit odd to me. That doesn't work anymore with "((c == 3))", or "c== 3",
> > unless you compare against the resulting tree but that's complicated.
>
> Nah, it should be easy. Actually, in "set" mode (before the apply) we
> should keep a link list of "trees" of sets. Then we just find the "tree"
> that matches the UNSET and remove it.
>
> >
> > I mean, that works, most of the time you keep your expression
> > and pass it as is. But the expression should be identified by its
> > meaning, not by some random language layout.
> >
> > That also forces you to scatter your filter representation:
> >
> > SECCOMP_FILTER_SET, __NR_foo, "b == 2"
> > SECCOMP_FILTER_SET, __NR_foo, "c == 3"
> >
> > For me this shouldn't be different than
> >
> > SECCOMP_FILTER_SET, __NR_foo, "b == 2 && c == 3"
> >
> > Still nobody will be able to remove a part of the above expression.
>
> Correct, as the sets will be just link lists of sets. Once we apply it,
> it goes into the main tree.
>
> Thus, to find the UNSET set, we would evaluate the tree of that unset,
> and search for it. If it is found, remove it, otherwise return EINVAL or
> something.
>
> >
> > So, I'm more for having something that removes everything not applied
> > than just a part of the non-applied filter.
> >
> > Two possibilities I see:
> >
> > SECCOMP_FILTER_SET, __NR_foo, "b == 2"
> > SECCOMP_FILTER_SET, __NR_foo, "c == 3"
> >
> > // And result is "c == 3"
> >
> > Or:
> >
> > SECCOMP_FILTER_SET, __NR_foo, "b == 2"
> > SECCOMP_FILTER_SET, __NR_foo, "c == 3"
> >
> > // And result is "b == 2 && c == 3"
> >
> > SECCOMP_FILTER_RESET, __NR_foo //rewind to filter "0"
> >
> > SECCOMP_FILTER_SET, __NR_foo, "c == 4"
> >
> > // And result is "c == 4"
> >
> > How does that look?
>
> The thing is, I don't understand where we would use (or want) an
> override. I could understand a UNSET, which I described in another
> reply. Also, I like the fact that sets can be self contained because
> then the user can add wrapper functions for them.
>
> add_filter("foo: c == 4");
> add_filter("bar: b < 3");
> apply_filter();
>
> That wont work with an override, unless we did the work in userspace to
> keep string, and then we need to worry about "string matches" as you
> stated if we want to implement a remove_filter("foo: c==4"). (see it
> would fail, because this version is missing spaces)
>
> -- Steve
>
>

2011-05-04 18:46:28

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH 5/7] seccomp_filter: Document what seccomp_filter is and how it works.

On Wed, 2011-05-04 at 20:30 +0200, Frederic Weisbecker wrote:
> On Wed, May 04, 2011 at 02:23:02PM -0400, Steven Rostedt wrote:
> > On Wed, 2011-05-04 at 19:52 +0200, Frederic Weisbecker wrote:
> >
> > > > It's certainly doable, but it will
> > > > mean that we may be logically storing something like:
> > > >
> > > > __NR_foo: (a == 1 || a == 2), applied
> > > > __NR_foo: b == 2, not applied
> > > > __NR_foo: c == 3, not applied
> > > >
> > > > after
> > > >
> > > > SECCOMP_FILTER_SET, __NR_foo, "a == 1 || a == 2"
> > > > SECCOMP_FILTER_APPLY
> > > > SECCOMP_FILTER_SET, __NR_foo, "b == 2"
> > > > SECCOMP_FILTER_SET, __NR_foo, "c == 3"
> > >
> > > No, the c == 3 would override b == 2.
> >
> > I honestly hate the "override" mode. I like that SETs are or'd among
> > each other and an APPLY is a "commit"; meaning that you can only limit
> > it further, but can not extend it (an explicit &&)
>
> I'm confused with what you just said...
>
> Somehow I could understand it that way:
>
> SECCOMP_FILTER_SET, __NR_foo, "a == 1"
> SECCOMP_FILTER_SET, __NR_foo, "a == 2"
> SECCOMP_FILTER_APPLY
> SECCOMP_FILTER_SET, __NR_foo, "b == 1"
>
> Would produce:
>
> "(a == 1 || a == 2) && b == 1"

No, it would produce:

(a == 1 || a == 2)

The b == 1 will not be added until the next apply.

>
> That makes a pretty confusing behaviour for users I think.


Not really, if it is documented well. Or we can call it:

SECCOMP_FILTER_SET_OR and SECCOMP_FILTER_APPLY_AND

to remove the ambiguity. The reason is actually quite simple. Before you
do an apply, you can modify it to whatever you want. But once you do an
apply, you just limited yourself. An apply can not be reversed.

>
> >
> > >
> > > > In that case, would a call to sys_foo even be tested against the
> > > > non-applied constraints of b==2 or c==3?
> > >
> > > No, not as long as it's not applied.
> > >
> > > > Or would the call to set "c
> > > > == 3" replace the "b == 2" entry. I'm not sure I see that the benefit
> > > > exceeds the ambiguity that might introduce.
> > >
> > > The rationale behind it is that as long as you haven't applied your filter,
> > > you should be able to override it.
> >
> > We need a "UNSET" (I like that better than DROP).
>
> What about a complete erase (RESET) of the temporary filter? Like I explained below
> from my previous mail.

What is a temporary filter? And a RESET could be there too to just
remove all sets that are pending.

I was thinking that we add "SETS" which the kernel can verify are
correct and let the user know at the time of the command if it is valid.
But these sets are not actually implemented until an APPLY is hit.

-- Steve

2011-05-05 09:21:10

by Will Drewry

[permalink] [raw]
Subject: Re: [PATCH 5/7] seccomp_filter: Document what seccomp_filter is and how it works.

On Wed, May 4, 2011 at 11:46 AM, Steven Rostedt <[email protected]> wrote:
> On Wed, 2011-05-04 at 20:30 +0200, Frederic Weisbecker wrote:
>> On Wed, May 04, 2011 at 02:23:02PM -0400, Steven Rostedt wrote:
>> > On Wed, 2011-05-04 at 19:52 +0200, Frederic Weisbecker wrote:
>> >
>> > > > ?It's certainly doable, but it will
>> > > > mean that we may be logically storing something like:
>> > > >
>> > > > __NR_foo: (a == 1 || a == 2), applied
>> > > > __NR_foo: b == 2, not applied
>> > > > __NR_foo: c == 3, not applied
>> > > >
>> > > > after
>> > > >
>> > > > SECCOMP_FILTER_SET, __NR_foo, "a == 1 || a == 2"
>> > > > SECCOMP_FILTER_APPLY
>> > > > SECCOMP_FILTER_SET, __NR_foo, "b == 2"
>> > > > SECCOMP_FILTER_SET, __NR_foo, "c == 3"
>> > >
>> > > No, the c == 3 would override b == 2.
>> >
>> > I honestly hate the "override" mode. I like that SETs are or'd among
>> > each other and an APPLY is a "commit"; meaning that you can only limit
>> > it further, but can not extend it (an explicit &&)
>>
>> I'm confused with what you just said...
>>
>> Somehow I could understand it that way:
>>
>> SECCOMP_FILTER_SET, __NR_foo, "a == 1"
>> SECCOMP_FILTER_SET, __NR_foo, "a == 2"
>> SECCOMP_FILTER_APPLY
>> SECCOMP_FILTER_SET, __NR_foo, "b == 1"
>>
>> Would produce:
>>
>> "(a == 1 || a == 2) && b == 1"
>
> No, it would produce:
>
> ?(a == 1 || a == 2)
>
> The b == 1 will not be added until the next apply.
>
>>
>> That makes a pretty confusing behaviour for users I think.
>
>
> Not really, if it is documented well. Or we can call it:
>
> SECCOMP_FILTER_SET_OR and SECCOMP_FILTER_APPLY_AND
>
> to remove the ambiguity. The reason is actually quite simple. Before you
> do an apply, you can modify it to whatever you want. But once you do an
> apply, you just limited yourself. An apply can not be reversed.
>
>>
>> >
>> > >
>> > > > In that case, would a call to sys_foo even be tested against the
>> > > > non-applied constraints of b==2 or c==3?
>> > >
>> > > No, not as long as it's not applied.
>> > >
>> > > > Or would the call to set "c
>> > > > == 3" replace the "b == 2" entry. ?I'm not sure I see that the benefit
>> > > > exceeds the ambiguity that might introduce.
>> > >
>> > > The rationale behind it is that as long as you haven't applied your filter,
>> > > you should be able to override it.
>> >
>> > We need a "UNSET" (I like that better than DROP).
>>
>> What about a complete erase (RESET) of the temporary filter? Like I explained below
>> from my previous mail.
>
> What is a temporary filter? And a RESET could be there too to just
> remove all sets that are pending.
>
> I was thinking that we add "SETS" which the kernel can verify are
> correct and let the user know at the time of the command if it is valid.
> But these sets are not actually implemented until an APPLY is hit.
>

The past few mails have proposed quite a few interesting variants, but
I wonder if Eric's code samples show something useful about the
not-yet-applied filters in addition to the behavioral differences
between changing implicit operations (&& versus ||).

In particular, if the userspace code wants to stage some filters and
apply them all at once, when ready, I'm not sure that it makes sense
to me to put that complexity in the kernel itself. For instance,
Eric's second sample showed a call that took an array of ints and
coalesced them into "fd == %d || ...". That simple example shows that
we could easily get by with a pretty minimal kernel-supported
interface as long as the richer behavior could live userspace side --
even if just in a simple helper library. It'd be pretty easy to
implement a userspace library that exposed add_filter(syscall_nr,
filter) and apply_filters() such that it could manage building the
final filter string for a given syscall and pushing it to prctl on
apply.

I think that could also help simplify the primitives. For instance,
if any separate SET called on a system call resulting in an &&
operation, then the behavior could be consistent prior to enforcement
of the filtering and after. E.g.,
SET, __NR_read, "fd == 1"
SET, __NR_read, "len < 4097"
would result in an evaluated "fd == 1 && len < 4097". It would do so
after a single APPLY call too:
SET, __NR_read, "1"
APPLY
SET, __NR_read, "fd == 1"
SET, __NR_read, "len < 4097"
Results in: "1 && fd == 1 && len < 4097", and SET, nr, "0" would
nullify the syscall filter in total. It seems like that would be
enough to build the SET-SET-...-APPLY, SET-SET-...-SET-APPLY logic
into a userspace library so that all temporary unapplied state doesn't
have to be explicitly managed by the kernel.

While I completely agree with the comment around ease-of-use as being
key to security, I also find that the more the state diagram explodes,
the harder it is to feel confident that a solution is actually secure.
To try to achieve both objectives, I'd like to limit the kernel
interface to the bare minimum of primitives and build any API
fanciness into userspace.

Does it seem that the tradeoff isn't worth it, or are there some
specific behaviors that aren't addressed using that model?

While writing that, another option occurred to me that touches on the
other proposals but makes the behaviors much more explicit.
A prctl prototype could be provided:
prctl(<SET|GET>, <AND|OR>, <syscall_nr>, <filter string>)
e.g.,
prctl(PR_SET_SECCOMP_FILTER, PR_SECCOMP_FILTER_OR, __NR_read, "fd == 2");

The explicit prctl argument list would allow the filter strings to be
self-referential and allow the userspace app to decide what behaviors
are allowed and when. If we followed that route, all implicit filters
would be "0" and the initial call to get things started might be:
#define SET 33
#define OR 0
#define AND 1
SET, OR, __NR_prctl, "option == 33 && (arg1 == 0 || arg1 == 1)"
prctl(PR_SET_SECCOMP, 2);

So now the "locked down" binary can call prctl to set an OR or AND
filter for any syscall. A subsequent call could change that:
SET, OR, __NR_read, "fd == 2" /* => "0 || fd == 2" */
SET, AND, __NR_prctl, "(arg2 != 63 || arg1 != 0)" /* __NR_read == 63 */

This would OR in a __NR_read filter, then disallow a future call to
prctl to OR in more NR_read filters, but for other syscalls ANDing and
ORing is still possible until you pass in something like:

SET, AND, __NR_prctl, "arg1 == 1"

which would lock down all future prctl calls to only ANDing filters
in. (The numbers in the examples could then be properly managed in a
userspace library to ensure platform correctness.)

While this would reduce the primitives a bit further, I'm not sure if
this would be the right approach either, but it would open the door to
pushing even more down to userspace very explicitly and further
removing magic policy logic from the kernel-side. Is this vaguely
interesting or just another layer of confusing-ness?


I'll follow Eric's lead and try out a few different interfaces
proposed earlier and the ones I laid out above and see if it seems to
come out any clearer (for me at least). I'd love to know if anyone
else thinks we can get away with less primitives and put more of the
complex/delayed logic in userspace exclusively.

thanks!
will

2011-05-05 13:14:09

by Serge E. Hallyn

[permalink] [raw]
Subject: Re: [PATCH 5/7] seccomp_filter: Document what seccomp_filter is and how it works.

Quoting Will Drewry ([email protected]):
> In particular, if the userspace code wants to stage some filters and
> apply them all at once, when ready, I'm not sure that it makes sense
> to me to put that complexity in the kernel itself. For instance,

Hi Will,

just one note - in my original comment I wasn't actually suggesting
disabling setting of filters through a writeable file - I was only
suggesting restricting writing to one's own filters file.

All the better if it is possible to get a nice prctl-only
interface, but if it ends up limiting rule expressiveness (or taking
years to define an interface) then perhaps we should stick with
prctl for setting seccomp mode, and a more expressive file interface
for defining filters.

-serge

2011-05-06 11:53:29

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH 5/7] seccomp_filter: Document what seccomp_filter is and how it works.

On Thu, 2011-05-05 at 02:21 -0700, Will Drewry wrote:

> In particular, if the userspace code wants to stage some filters and
> apply them all at once, when ready, I'm not sure that it makes sense
> to me to put that complexity in the kernel itself. For instance,
> Eric's second sample showed a call that took an array of ints and
> coalesced them into "fd == %d || ...". That simple example shows that
> we could easily get by with a pretty minimal kernel-supported
> interface as long as the richer behavior could live userspace side --
> even if just in a simple helper library. It'd be pretty easy to
> implement a userspace library that exposed add_filter(syscall_nr,
> filter) and apply_filters() such that it could manage building the
> final filter string for a given syscall and pushing it to prctl on
> apply.

I'm fine with a single kernel call and the "temporary filter" be done in
userspace. Making the kernel code less complex is better :)

>
> I think that could also help simplify the primitives. For instance,
> if any separate SET called on a system call resulting in an &&
> operation, then the behavior could be consistent prior to enforcement
> of the filtering and after. E.g.,
> SET, __NR_read, "fd == 1"
> SET, __NR_read, "len < 4097"
> would result in an evaluated "fd == 1 && len < 4097". It would do so
> after a single APPLY call too:
> SET, __NR_read, "1"
> APPLY
> SET, __NR_read, "fd == 1"
> SET, __NR_read, "len < 4097"
> Results in: "1 && fd == 1 && len < 4097", and SET, nr, "0" would
> nullify the syscall filter in total.

Only that that was not applied? We can't let tasks nullify their
restrictions once they have been applied. This keeps the kernel code
simpler.

> It seems like that would be
> enough to build the SET-SET-...-APPLY, SET-SET-...-SET-APPLY logic
> into a userspace library so that all temporary unapplied state doesn't
> have to be explicitly managed by the kernel.

Thus, the SETs are done in the userspace library that does not need to
interact with the kernel (besides perhaps allocating memory). Then the
apply would send all the filters to the kernel which would restrict the
task (or the task on exec) further.

>
> While I completely agree with the comment around ease-of-use as being
> key to security, I also find that the more the state diagram explodes,
> the harder it is to feel confident that a solution is actually secure.
> To try to achieve both objectives, I'd like to limit the kernel
> interface to the bare minimum of primitives and build any API
> fanciness into userspace.

Fair enough.

>
> Does it seem that the tradeoff isn't worth it, or are there some
> specific behaviors that aren't addressed using that model?
>
> While writing that, another option occurred to me that touches on the
> other proposals but makes the behaviors much more explicit.
> A prctl prototype could be provided:
> prctl(<SET|GET>, <AND|OR>, <syscall_nr>, <filter string>)
> e.g.,
> prctl(PR_SET_SECCOMP_FILTER, PR_SECCOMP_FILTER_OR, __NR_read, "fd == 2");
>
> The explicit prctl argument list would allow the filter strings to be
> self-referential and allow the userspace app to decide what behaviors
> are allowed and when. If we followed that route, all implicit filters
> would be "0" and the initial call to get things started might be:
> #define SET 33
> #define OR 0
> #define AND 1
> SET, OR, __NR_prctl, "option == 33 && (arg1 == 0 || arg1 == 1)"
> prctl(PR_SET_SECCOMP, 2);
>
> So now the "locked down" binary can call prctl to set an OR or AND
> filter for any syscall. A subsequent call could change that:
> SET, OR, __NR_read, "fd == 2" /* => "0 || fd == 2" */
> SET, AND, __NR_prctl, "(arg2 != 63 || arg1 != 0)" /* __NR_read == 63 */
>
> This would OR in a __NR_read filter, then disallow a future call to
> prctl to OR in more NR_read filters, but for other syscalls ANDing and
> ORing is still possible until you pass in something like:
>
> SET, AND, __NR_prctl, "arg1 == 1"
>
> which would lock down all future prctl calls to only ANDing filters
> in. (The numbers in the examples could then be properly managed in a
> userspace library to ensure platform correctness.)

I don't know about this. It seems to be starting to get too complex, and
thus error prone. Is there any reason we should allow an OR to the task?
Why would we want to restrict a task where the task could easily
unrestrict itself?

>
> While this would reduce the primitives a bit further, I'm not sure if
> this would be the right approach either, but it would open the door to
> pushing even more down to userspace very explicitly and further
> removing magic policy logic from the kernel-side. Is this vaguely
> interesting or just another layer of confusing-ness?

I'm confused, thus I must have hit that layer ;)

-- Steve

2011-05-06 13:36:14

by Eric Paris

[permalink] [raw]
Subject: Re: [PATCH 5/7] seccomp_filter: Document what seccomp_filter is and how it works.

On Fri, 2011-05-06 at 07:53 -0400, Steven Rostedt wrote:
> On Thu, 2011-05-05 at 02:21 -0700, Will Drewry wrote:
>
> > In particular, if the userspace code wants to stage some filters and
> > apply them all at once, when ready, I'm not sure that it makes sense
> > to me to put that complexity in the kernel itself. For instance,
> > Eric's second sample showed a call that took an array of ints and
> > coalesced them into "fd == %d || ...". That simple example shows that
> > we could easily get by with a pretty minimal kernel-supported
> > interface as long as the richer behavior could live userspace side --
> > even if just in a simple helper library. It'd be pretty easy to
> > implement a userspace library that exposed add_filter(syscall_nr,
> > filter) and apply_filters() such that it could manage building the
> > final filter string for a given syscall and pushing it to prctl on
> > apply.
>
> I'm fine with a single kernel call and the "temporary filter" be done in
> userspace. Making the kernel code less complex is better :)

I'm also starting to think a userspace library is a good idea as well.
Everything in the kernel is an && with nothing but SET and APPLY. We
provide interfaces to build the strings as we go along including UNSET
and stuff like that if there are users....

-Eric

2011-05-06 17:38:48

by Eric Paris

[permalink] [raw]
Subject: Re: [PATCH 5/7] seccomp_filter: Document what seccomp_filter is and how it works.

On Thu, 2011-05-05 at 02:21 -0700, Will Drewry wrote:
> On Wed, May 4, 2011 at 11:46 AM, Steven Rostedt <[email protected]> wrote:

> In particular, if the userspace code wants to stage some filters and
> apply them all at once, when ready, I'm not sure that it makes sense
> to me to put that complexity in the kernel itself. For instance,
> Eric's second sample showed a call that took an array of ints and
> coalesced them into "fd == %d || ...". That simple example shows that
> we could easily get by with a pretty minimal kernel-supported
> interface as long as the richer behavior could live userspace side --
> even if just in a simple helper library. It'd be pretty easy to
> implement a userspace library that exposed add_filter(syscall_nr,
> filter) and apply_filters() such that it could manage building the
> final filter string for a given syscall and pushing it to prctl on
> apply.

Had a few minutes this morning so I started writing something that looks
sorta like a userspace library. It dosn't have unset yet but I don't
think it'll be that hard for me to implement. You can build some pretty
complex filters easily. Not sure when I'll get to work on it more, but
it at least shows the idea of doing the complex work in a library and
keeping a simple kernel interface.... Any thoughts?

-Eric


Attachments:
libseccomp.c (6.45 kB)
libseccomp.h (821.00 B)
Makefile (281.00 B)
test.c (829.00 B)
Download all attachments

2011-05-07 01:58:20

by Will Drewry

[permalink] [raw]
Subject: Re: [PATCH 5/7] seccomp_filter: Document what seccomp_filter is and how it works.

On Fri, May 6, 2011 at 4:53 AM, Steven Rostedt <[email protected]> wrote:
> On Thu, 2011-05-05 at 02:21 -0700, Will Drewry wrote:
>
>> In particular, if the userspace code wants to stage some filters and
>> apply them all at once, when ready, I'm not sure that it makes sense
>> to me to put that complexity in the kernel itself. ?For instance,
>> Eric's second sample showed a call that took an array of ints and
>> coalesced them into "fd == %d || ...". ?That simple example shows that
>> we could easily get by with a pretty minimal kernel-supported
>> interface as long as the richer behavior could live userspace side --
>> even if just in a simple helper library. ?It'd be pretty easy to
>> implement a userspace library that exposed add_filter(syscall_nr,
>> filter) and apply_filters() such that it could manage building the
>> final filter string for a given syscall and pushing it to prctl on
>> apply.
>
> I'm fine with a single kernel call and the "temporary filter" be done in
> userspace. Making the kernel code less complex is better :)
>
>>
>> I think that could also help simplify the primitives. ?For instance,
>> if any separate SET called on a system call resulting in an &&
>> operation, then the behavior could be consistent prior to enforcement
>> of the filtering and after. ?E.g.,
>> ? SET, __NR_read, "fd == 1"
>> ? SET, __NR_read, "len < 4097"
>> would result in an evaluated "fd == 1 && len < 4097". ?It would do so
>> after a single APPLY call too:
>> ? SET, __NR_read, "1"
>> ? APPLY
>> ? SET, __NR_read, "fd == 1"
>> ? SET, __NR_read, "len < 4097"
>> Results in: "1 && fd == 1 && len < 4097", and SET, nr, "0" would
>> nullify the syscall filter in total.
>
> Only that that was not applied? We can't let tasks nullify their
> restrictions once they have been applied. This keeps the kernel code
> simpler.

Ah - so I really need to be more explicit when discussing these
things! In the "simplification" effort, I was thinking any syscall
with no entry has a "0" rule. So if if nullify it, it becomes a
complete block and if you can't OR, then you can't add permissions.

>> ? It seems like that would be
>> enough to build the SET-SET-...-APPLY, SET-SET-...-SET-APPLY logic
>> into a userspace library so that all temporary unapplied state doesn't
>> have to be explicitly managed by the kernel.
>
> Thus, the SETs are done in the userspace library that does not need to
> interact with the kernel (besides perhaps allocating memory). Then the
> apply would send all the filters to the kernel which would restrict the
> task (or the task on exec) further.

Exactly. Smaller patch and less state per-filter entry (I hope!).


>>
>> While I completely agree with the comment around ease-of-use as being
>> key to security, I also find that the more the state diagram explodes,
>> the harder it is to feel confident that a solution is actually secure.
>> ?To try to achieve both objectives, I'd like to limit the kernel
>> interface to the bare minimum of primitives and build any API
>> fanciness into userspace.
>
> Fair enough.
>
>>
>> Does it seem that the tradeoff isn't worth it, or are there some
>> specific behaviors that aren't addressed using that model?
>>
>> While writing that, another option occurred to me that touches on the
>> other proposals but makes the behaviors much more explicit.
>> A prctl prototype could be provided:
>> ?prctl(<SET|GET>, <AND|OR>, <syscall_nr>, <filter string>)
>> e.g.,
>> ?prctl(PR_SET_SECCOMP_FILTER, PR_SECCOMP_FILTER_OR, __NR_read, "fd == 2");
>>
>> The explicit prctl argument list would allow the filter strings to be
>> self-referential and allow the userspace app to decide what behaviors
>> are allowed and when. If we followed that route, all implicit filters
>> would be "0" and the initial call to get things started might be:
>> ? ?#define SET 33
>> ? ?#define OR 0
>> ? ?#define AND 1
>> ? ?SET, OR, __NR_prctl, "option == 33 && (arg1 == 0 || arg1 == 1)"
>> ? ?prctl(PR_SET_SECCOMP, 2);
>>
>> So now the "locked down" binary can call prctl to set an OR or AND
>> filter for any syscall. ?A subsequent call could change that:
>> ? SET, OR, __NR_read, "fd == 2" ?/* => "0 || fd == 2" */
>> ? SET, AND, __NR_prctl, "(arg2 != 63 || arg1 != 0)" ?/* __NR_read == 63 */
>>
>> This would OR in a __NR_read filter, then disallow a future call to
>> prctl to OR in more NR_read filters, but for other syscalls ANDing and
>> ORing is still possible until you pass in something like:
>>
>> ? SET, AND, __NR_prctl, "arg1 == 1"
>>
>> which would lock down all future prctl calls to only ANDing filters
>> in. ?(The numbers in the examples could then be properly managed in a
>> userspace library to ensure platform correctness.)
>
> I don't know about this. It seems to be starting to get too complex, and
> thus error prone. Is there any reason we should allow an OR to the task?
> Why would we want to restrict a task where the task could easily
> unrestrict itself?

No idea! I can't think of any good examples where you'd want to do
it, just contrived ones. In general, I think the above approach would
rarely be used since I expect that something like 80% of the places
where this will be used will just be one-time, upfront filter installs
without any surface reduction after the fact.

That said, if there's no reason to support OR after the fact, then the
interface can just _only_ support &&s and leave the installation to
userspace. It might makes the multiple-fd-ORing case less fun in
userspace, but it should work for most cases I think.

>>
>> While this would reduce the primitives a bit further, I'm not sure if
>> this would be the right approach either, but it would open the door to
>> pushing even more down to userspace very explicitly and further
>> removing magic policy logic from the kernel-side. ?Is this vaguely
>> interesting or just another layer of confusing-ness?
>
> I'm confused, thus I must have hit that layer ;)

Sounds like it. I'm always a sucker for self-referential mechanisms.
I've been travelling a bit recently so my code output has been a bit
low, but I'll pull together the most minimal approach that I think
we've been iterating toward and hopefully post something in the not
too distant future.

thanks!

2011-05-07 02:11:45

by Will Drewry

[permalink] [raw]
Subject: Re: [PATCH 5/7] seccomp_filter: Document what seccomp_filter is and how it works.

On Fri, May 6, 2011 at 6:35 AM, Eric Paris <[email protected]> wrote:
> I'm also starting to think a userspace library is a good idea as well.
> Everything in the kernel is an && with nothing but SET and APPLY. ?We
> provide interfaces to build the strings as we go along including UNSET
> and stuff like that if there are users....

That sounds ideal to me. I think it'll be worth tagging this config
option as EXPERIMENTAL to begin with so that there's a chance to see
those of us who want it using it more widely and find out if there are
missing pieces or dumb limitations. Hopefully by starting minimal, it
won't be a huge departure to evolve it if needed.

On Fri, May 6, 2011 at 9:30 AM, Eric Paris <[email protected]> wrote:

> On Thu, 2011-05-05 at 02:21 -0700, Will Drewry wrote:
>> On Wed, May 4, 2011 at 11:46 AM, Steven Rostedt <[email protected]> wrote:
>
>> In particular, if the userspace code wants to stage some filters and
>> apply them all at once, when ready, I'm not sure that it makes sense
>> to me to put that complexity in the kernel itself. ?For instance,
>> Eric's second sample showed a call that took an array of ints and
>> coalesced them into "fd == %d || ...". ?That simple example shows that
>> we could easily get by with a pretty minimal kernel-supported
>> interface as long as the richer behavior could live userspace side --
>> even if just in a simple helper library. ?It'd be pretty easy to
>> implement a userspace library that exposed add_filter(syscall_nr,
>> filter) and apply_filters() such that it could manage building the
>> final filter string for a given syscall and pushing it to prctl on
>> apply.
>
> Had a few minutes this morning so I started writing something that looks
> sorta like a userspace library. ?It dosn't have unset yet but I don't
> think it'll be that hard for me to implement. ?You can build some pretty
> complex filters easily. ?Not sure when I'll get to work on it more, but
> it at least shows the idea of doing the complex work in a library and
> keeping a simple kernel interface.... ?Any thoughts?

Cool! I was thrown a little bit by spelling out the operations (eq,
ne, ..), but I can see how it might be useful for being able to
synthesize good filters before passing them off to the kernel.

It sounds like we might be moving towards a good starting point. I'll
start playing with the interface some (and start on the kernel support
too :).

thanks!

2011-05-12 03:07:14

by Will Drewry

[permalink] [raw]
Subject: [PATCH 4/5] v2 seccomp_filter: add process state reporting

Adds seccomp_filter status reporting to proc.
/proc/<pid>/seccomp_filter will provide read-only access to the current
filter set.

v2: removed status entry, added seccomp file.
(requested by [email protected])
allowed S_IRUGO reading of entries
(requested by [email protected])
added flags
got rid of the seccomp_t type
dropped seccomp file

Signed-off-by: Will Drewry <[email protected]>
---
fs/proc/base.c | 25 +++++++++++++++++++++++++
1 files changed, 25 insertions(+), 0 deletions(-)

diff --git a/fs/proc/base.c b/fs/proc/base.c
index dfa5327..f991d1a 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -73,6 +73,7 @@
#include <linux/security.h>
#include <linux/ptrace.h>
#include <linux/tracehook.h>
+#include <linux/seccomp.h>
#include <linux/cgroup.h>
#include <linux/cpuset.h>
#include <linux/audit.h>
@@ -579,6 +580,24 @@ static int proc_pid_syscall(struct task_struct *task, char *buffer)
}
#endif /* CONFIG_HAVE_ARCH_TRACEHOOK */

+/*
+ * Print out the current seccomp filter set for the task.
+ */
+#ifdef CONFIG_SECCOMP_FILTER
+int proc_pid_seccomp_filter_show(struct seq_file *m, struct pid_namespace *ns,
+ struct pid *pid, struct task_struct *task)
+{
+ struct seccomp_state *state;
+
+ rcu_read_lock();
+ state = get_seccomp_state(task->seccomp);
+ rcu_read_unlock();
+ seccomp_show_filters(state, m);
+ put_seccomp_state(state);
+ return 0;
+}
+#endif /* CONFIG_SECCOMP_FILTER */
+
/************************************************************************/
/* Here the fs part begins */
/************************************************************************/
@@ -2838,6 +2857,9 @@ static const struct pid_entry tgid_base_stuff[] = {
#ifdef CONFIG_HAVE_ARCH_TRACEHOOK
INF("syscall", S_IRUGO, proc_pid_syscall),
#endif
+#ifdef CONFIG_SECCOMP_FILTER
+ ONE("seccomp_filter", S_IRUGO, proc_pid_seccomp_filter_show),
+#endif
INF("cmdline", S_IRUGO, proc_pid_cmdline),
ONE("stat", S_IRUGO, proc_tgid_stat),
ONE("statm", S_IRUGO, proc_pid_statm),
@@ -3180,6 +3202,9 @@ static const struct pid_entry tid_base_stuff[] = {
#ifdef CONFIG_HAVE_ARCH_TRACEHOOK
INF("syscall", S_IRUGO, proc_pid_syscall),
#endif
+#ifdef CONFIG_SECCOMP_FILTER
+ ONE("seccomp_filter", S_IRUGO, proc_pid_seccomp_filter_show),
+#endif
INF("cmdline", S_IRUGO, proc_pid_cmdline),
ONE("stat", S_IRUGO, proc_tid_stat),
ONE("statm", S_IRUGO, proc_pid_statm),
--
1.7.0.4

2011-05-12 03:07:52

by Will Drewry

[permalink] [raw]
Subject: [PATCH 5/5] v2 seccomp_filter: Document what seccomp_filter is and how it works.

Adds a text file covering what CONFIG_SECCOMP_FILTER is, how it is
implemented presently, and what it may be used for. In addition,
the limitations and caveats of the proposed implementation are
included.

v2: moved to prctl/
updated for the v2 syntax.
adds a note about compat behavior

Signed-off-by: Will Drewry <[email protected]>
---
Documentation/prctl/seccomp_filter.txt | 156 ++++++++++++++++++++++++++++++++
1 files changed, 156 insertions(+), 0 deletions(-)
create mode 100644 Documentation/prctl/seccomp_filter.txt

diff --git a/Documentation/prctl/seccomp_filter.txt b/Documentation/prctl/seccomp_filter.txt
new file mode 100644
index 0000000..4c1686a
--- /dev/null
+++ b/Documentation/prctl/seccomp_filter.txt
@@ -0,0 +1,156 @@
+ Seccomp filtering
+ =================
+
+Introduction
+------------
+
+A large number of system calls are exposed to every userland process
+with many of them going unused for the entire lifetime of the process.
+As system calls change and mature, bugs are found and eradicated. A
+certain subset of userland applications benefit by having a reduce set
+of available system calls. The reduced set reduces the total kernel
+surface exposed to the application. System call filtering is meant for
+use with those applications.
+
+The implementation currently leverages both the existing seccomp
+infrastructure and the kernel tracing infrastructure. By centralizing
+hooks for attack surface reduction in seccomp, it is possible to assure
+attention to security that is less relevant in normal ftrace scenarios,
+such as time-of-check, time-of-use attacks. However, ftrace provides a
+rich, human-friendly environment for interfacing with system call
+specific arguments. (As such, this requires FTRACE_SYSCALLS for any
+introspective filtering support.)
+
+
+What it isn't
+-------------
+
+System call filtering isn't a sandbox. It provides a clearly defined
+mechanism for minimizing the exposed kernel surface. Beyond that,
+policy for logical behavior and information flow should be managed with
+an LSM of your choosing. Filtering based on the ftrace filter engine
+provides further options down this path (avoiding pathological sizes,
+for instance), but it could be misconstrued for a real sandbox.
+
+
+Usage
+-----
+
+An additional seccomp mode is exposed through mode '2',
+PR_SECCOMP_MODE_FILTER. This mode depends on CONFIG_SECCOMP_FILTER
+which in turn depends on CONFIG_FTRACE_SYSCALLS.
+
+A collection of filters may be supplied via prctl, and the current set
+of filters is exposed in /proc/<pid>/seccomp_filter.
+
+Interacting with seccomp filters can be done through three new prctl calls
+and one existing one.
+
+PR_SET_SECCOMP: A pre-existing option for enabling strict seccomp
+ mode (1) or filtering seccomp. This option now takes an
+ additional "flags" argument.
+
+ Usage:
+ prctl(PR_SET_SECCOMP, 1);
+ prctl(PR_SET_SECCOMP, PR_SECCOMP_MODE_FILTER, 0);
+ Flags:
+ - 0: Empty set.
+ - PR_SECCOMP_FLAG_FILTER_ON_EXEC: Delays enforcement of seccomp
+ enforcment only on MODE_FILTER until an exec() call is seen.
+
+PR_SET_SECCOMP_FILTER: Allows the specification of a new filter for
+ a given system call, by number, and filter string. If
+ CONFIG_FTRACE_SYSCALLS is supported, the filter string may be
+ any valid value for the given system call. If it is not
+ supported, the filter string may only be "1" or "0".
+
+ All calls to PR_SET_SECCOMP_FILTER for a given system
+ call will append the supplied string to any existing filters.
+ Filter construction looks as follows:
+ (Nothing) + "fd == 1 || fd == 2" => fd == 1 || fd == 2
+ ... + "fd != 2" => (fd == 1 || fd == 2) && fd != 2
+ ... + "size < 100" =>
+ ((fd == 1 || fd == 2) && fd != 2) && size < 100
+ If there is no filter and the seccomp mode has already
+ transitioned to filtering, additions cannot be made. Filters
+ may only be added that reduce the available kernel surface.
+
+ Usage (per the construction example above):
+ prctl(PR_SET_SECCOMP_FILTER, __NR_write, "fd == 1 || fd == 2");
+ prctl(PR_SET_SECCOMP_FILTER, __NR_write, "fd != 2");
+ prctl(PR_SET_SECCOMP_FILTER, __NR_write, "size < 100");
+
+PR_CLEAR_SECCOMP_FILTER: Removes all filter entries for a given system
+ call number. When called prior to entering seccomp filtering
+ mode, it allows for new filters to be applied to the same system
+ call. After transition, however, it completely drops access to
+ the call.
+
+ Usage:
+ prctl(PR_CLEAR_SECCOMP_FILTER, __NR_open);
+
+PR_GET_SECCOMP_FILTER: Returns the aggregated filter string for a system
+ call into a user-supplied buffer of a given length.
+
+ Usage:
+ prctl(PR_GET_SECCOMP_FILTER, __NR_write, buf,
+ sizeof(buf));
+
+All of the above calls return 0 on success and non-zero on error.
+
+
+Example
+-------
+
+Assume a process would like to cleanly read and write to stdin/out/err
+as well as access its filters after seccomp enforcement begins. This
+may be done as follows:
+
+ prctl(PR_SET_SECCOMP_FILTER, __NR_read, "fd == 0");
+ prctl(PR_SET_SECCOMP_FILTER, __NR_write, "fd == 1 || fd == 2");
+ prctl(PR_SET_SECCOMP_FILTER, __NR_exit, "1");
+ prctl(PR_SET_SECCOMP_FILTER, __NR_prctl, "1");
+
+ prctl(PR_SET_SECCOMP, PR_SECCOMP_MODE_FILTER, 0);
+
+ /* Do stuff with fdset . . .*/
+
+ /* Drop read access and keep only write access to fd 1. */
+ prctl(PR_CLEAR_SECCOMP_FILTER, __NR_read);
+ prctl(PR_SET_SECCOMP_FILTER, __NR_write, "fd != 2");
+
+ /* Perform any final processing . . . */
+ syscall(__NR_exit, 0);
+
+If the initial setup had been handled through a launcher of some sort,
+the call to PR_SET_SECCOMP may have been replaced with:
+ prctl(PR_SET_SECCOMP, PR_SECCOMP_MODE_FILTER, PR_SECCOMP_FLAG_FILTER_ON_EXEC);
+ /* ... */
+ execve(path, args);
+
+This will continue to allow system calls to proceed uninspected until an
+exec*() call is seen. From that point onward, the calling process will
+have filters enforced.
+
+
+Caveats
+-------
+
+- The filter event subsystem comes from CONFIG_TRACE_EVENTS, and the
+system call events come from CONFIG_FTRACE_SYSCALLS. However, if
+neither are available, a filter string of "1" will be honored, and it may
+be removed using PR_CLEAR_SECCOMP_FILTER. With ftrace filtering,
+calling PR_SET_SECCOMP_FILTER with a filter of "0" would have similar
+affect but would not be consistent on a kernel without the support.
+
+- Some platforms support a 32-bit userspace with 64-bit kernels. In
+these cases (CONFIG_COMPAT), system call numbers may not match across
+64-bit and 32-bit system calls. This may be especially relevant when
+filters are inherited across execution contexts. If filters are created
+in a non-compat context then inherited into a compat context, the
+inheriting process will be terminated if seccomp filtering mode is
+enabled. If it is not yet enabled, the inheriting process may iterate
+over the available system calls clearing any existing values. Once no
+filters remain, it can begin setting new filters based on its own
+context. (This behavior is bidirectional: compat->non-compat,
+non-compat->compat.)
--
1.7.0.4

2011-05-12 03:20:30

by Will Drewry

[permalink] [raw]
Subject: Re: [PATCH 5/7] seccomp_filter: Document what seccomp_filter is and how it works.

On Thu, May 5, 2011 at 6:14 AM, Serge E. Hallyn <[email protected]> wrote:
> Quoting Will Drewry ([email protected]):
>> In particular, if the userspace code wants to stage some filters and
>> apply them all at once, when ready, I'm not sure that it makes sense
>> to me to put that complexity in the kernel itself. ?For instance,
>
> Hi Will,
>
> just one note - in my original comment I wasn't actually suggesting
> disabling setting of filters through a writeable file - I was only
> suggesting restricting writing to one's own filters file.
>
> All the better if it is possible to get a nice prctl-only
> interface, but if it ends up limiting rule expressiveness (or taking
> years to define an interface) then perhaps we should stick with
> prctl for setting seccomp mode, and a more expressive file interface
> for defining filters.

Didn't want you to think I missed this -- thanks for clarifying! I've
attempted to pull together a prctl interface that balances the
directions proposed by Eric, Steven, Frederic, and co. Upon
reflection of the /proc interface, it seems to have similar
challenges, but if the new patchset tanks and a /proc interface would
have more flexibility, I'll definitely explore that route. I'd
certainly like to avoid spending years defining this, especially
upfront, and I'll take any guidance as to how to best reach a
reasonable starting place! (Of course, I'd appreciate feedback on
this round of patches too :)

Thanks!
will